To Gaffani the renter
of the
So the thing to watch out is you may have heard with the other speaker.
If when you move, you should be okay.
I think I got it.
He sure was rubbing against it.
That's why it was making so much noise.
Thanks.
Right.
So hello everyone.
Today I'm going to talk to you about the challenges of deploying your network workloads
at different levels of extraction, and particularly from the standpoint of CPU affinity.
So first thing first, I'm Hadi, and I'm currently part of the vector packet processing team at Cisco
alongside Nathan and Hedy who contributed to this presentation.
So we're all contributors to the FDIO VPP project, and Nathan and Hedy are active contributors of the Calico VPP data plane.
So first, I'd like to introduce the concept of network function.
So it's basically any application or network service which processes packets,
and you're probably familiar with physical network functions, which are appliances like routers or switches.
And in these systems, there is a challenge to L4, and also it can be used in many network functions as a very performant data plane component.
So rather than using a scalar packet processing approach, it goes for a vector packet processing approach using an optimized graph of packet.
So also, I'd like to talk about the Calico VPP data plane.
So Calico in itself is a CNI, a container network interface for Kubernetes.
It allows to deploy Kubernetes cluster with additional network and security features,
and it allows for seamless communication between Kubernetes workloads, maybe VM, containers, or legacy workloads.
So what's nice about Calico is that it enables the use of other data planes, in this case VPP.
So Calico VPP allows for the introduction of IPsec tunnels and also wire guard traffic.
And it's also been in GA since December 2023, if you want to check it out.
So first, I want to give an overview of the CPA pinning problematic.
So currently, we have CPU pinning, which is by definition binding a process or thread to a particular CPU or at least a set of CPUs.
And within the scope of network workloads, this allows us to ensure stable and optimal performance for network workloads.
And you may have some workloads which might be with a single thread or they may be multi-threaded, so you want to avoid contention.
And also workloads need to be quite performance.
At some of them, my process hundreds of millions of packets per second and require the most out of your CPUs.
So the question is, how do you select CPUs for pinning?
And also, why is CPU pinning important in the first place?
So of course, there are some concerns concerning the memory architecture of systems.
If, for example, we have one of your network workloads which is pinned on any new node, we'll try to access the memory in another new node and you'll have additional latency.
So the best practice to have is to try to pin your network workloads on CPUs in the same new node.
And of course, distance also for any network interface card that you might have as if it's present on a new node and you're trying to run a network workload,
which will use that device will have additional latency.
So in short, try to be numer-aware.
There are tools if you'd like to get your numer architecture or visualize it directly from the terminal or getting picture outputs.
So some short recommendation would be to avoid pinning on Core Zero as some processes run by default.
And if you can change kernel boot parameter and you want to set up system for maximum performance on benchmarking,
you can try to modify some settings such as ISO CPUs to isolate Core.
You can change also the affinity of some of the IRQs or in reduced kernel noise with no HZ in the score for to remove system cloutics.
So what we attempt to do is see the impact of CPU pinning on one of our example workloads.
So we wanted to test connection between two VPP instances that establish an IP-6 tunnel.
So we run a workload locally using only virtual interfaces and which is each VPP instance using one core only.
And the results were as expected, we managed with proper pinning to get the best performance around 10 gigabits per second
with IP-6 tunnels and the more we went into scenarios like cross-numa or using only two core to pin for process,
the two VPP and the two IPERF, this is where we observed the most performance loss.
So moving into abstraction challenges.
So first of all, the virtual machine.
Virtual machine are popular to employ network workloads in.
They allow to abstract or hardware and have multiple isolated system where we can run our network workloads.
One of the main network namespaces and separate file system, but they also have Linux kernel construct that allow us to manage the resources of our containers, which is C-groups.
So we're directly also able to pin the threads of our network workloads running within container using task command previous command we've seen.
So if you look in particular at one controller of the C-groups, the CPU set controller, it allows us to limit the set of CPUs on the host machine which are available for a container.
And when we do that, we can move towards dedicating some cores to a specific container or a specific workloads.
And of course, it needs to go and turn them with isolation on the host machine of the CPUs that we're going to use.
And you should watch out for difference between C-groups V1 and C-groups V2.
C-groups V2 has been around a long time since 2016, but there are many systems, especially against the system, which still run with C-groups V1.
And this is going to cost a mission on where you can fetch the current CPUs which are available.
So I like to talk about one of the challenges we have with VPP, a concerning CPU set.
So right now if you run VPP and bare metal, what we can do is rather than using task set, we can provide VPP with a list of core we'd like it to pin itself on, to spawn threads and pin itself on.
And bare metal, it works pretty well.
So we're going to ask it to pin itself from 0 to 3 and it's going to pin itself properly and run without any problem.
But here's the challenge. We're going to introduce an abstraction which is containers.
And container in this case will have a limited CPU set.
We're going to ask it to only use CPUs 4 to 7 on the host machine using C-groups CPU set.
And what's going to happen? We're not using task set.
VPP is trying to pin itself on 0 to 3 and this is going to fail.
This is going to fail the P-throw. VPP takes that into consideration and it's mapping.
So again, what we learned from this challenge with the CPU set is that we should be aware of the cores that are exposed by environment.
Our environment is going to expose currently available resources and at the same time we need to be aware of how application that's running within the container instance in this case fetches available cores.
So similarly to our previous use case, we tried to launch an IP-seq tunnel between two VPP instances.
But the twist was to introduce an additional abstraction layer which is container.
In this case we used Calico VPP to route traffic between the two VPP instances.
And we wanted to see if we could expect on the performance side similar performance result with abstraction.
So by adding an additional hop we obtained with our IP-seq tunnels a traffic of around 9 Gbps.
So compared to the 10 Gbps obtained with Burr metal, this is pretty close.
We lost 1 Gbps but we gained the additional feature introduced by Calico VPP which is additional security and isolation.
So to close it up, if you're thinking of switching to performant virtual network function, think about VPP.
If you're thinking of adding a performant data plan to your network function and think Calico VPP if you're using Kubernetes workloads.
And of course be aware of your architecture when configuring especially the layer of abstraction you're currently running your workload on.
And stay tuned if you're interested in seeing more VPP. Pim will have a talk this afternoon.
But we'll talk about how he managed to get hundreds of millions of packets per second with MPLS on an XPC with community hardware.
So this is the test spec we used for our machines. And thank you for your attention.
So right now we're assuming that a set of course is already been made available to a container.
And this is going to be the role of the administrator deploying those containers.
It's going to deploy them on specific human nodes. This is an assumption we need to take.
If for example we have a container with a set of core that are present on a human node and set of core that are present on another human node,
there's not much that can be done with this system.
So we're going to take a set of core that are present on a human node and set of core that are present on another human node.
There's not much that can be done with this system. So there's a need for awareness at different types.
Yes, I was wondering related to kind of big little design CPUs.
Is this a problem as well for pinning or did you look into this?
Sorry.
Like big little CPUs?
No.
Like the ones with performance cores and efficiency cores?
You mean with the new Intel architecture with the yes, with P cores and E cores.
So no, we are not considering this. We're trying to be as agnostic as possible.
Thank you.
This is going to be the last question.
Thank you.
There's the elephant in the room about hyper threading and the fact that the repeat does not perform well if it's scheduled on through on two different hyper thread cores that actually sits on the same physical core.
How do you schedule workload on one core but not on the twin core?
Let's say.
So one of the issues that might arise if you have two cores that are on the same.
If you have two threads are in the same core using hyper threading is that for example, if they're going to deal with the same packets, they might deal with the same packets of cash information and this might create some contention.
Yeah, this will definitely create some contention.
So it's a specific use case.
This is fine if you're only going to read information, same cash line.
But if there are rights, there's definitely going to be some contention and lock and slow down.
Thank you very much.
Thank you again.
Thank you.
Thank you.
Thank you.
Thank you.
Thank you.
Thank you.
I just changed it.
It's a slide also here.