[00:00.000 --> 00:15.480] Hi everyone, so that's the last speaker of the day, so I'm going to talk about community [00:15.480 --> 00:18.640] spot connecting to multiple networks. [00:18.640 --> 00:24.360] So Doug already spoke about this in a slightly different way, I'll take a slightly different [00:24.360 --> 00:26.360] approach. [00:26.360 --> 00:31.360] So first a few things about myself, I'm a software engineer at Cisco working on container [00:31.360 --> 00:37.400] networking things, and I'm a maintainer of Calico VPP, which is going to be the topic [00:37.400 --> 00:38.400] of this talk. [00:38.400 --> 00:43.160] This talk is also a bit particular, it's a result of a collaboration effort with many [00:43.160 --> 00:49.040] awesome people like Tajera, Intel mostly and Cisco, and direct on collaboration with [00:49.040 --> 00:54.520] Mritika Ganguly, which is a P at Intel, but she sadly couldn't be here today because it's [00:54.520 --> 01:02.920] quite far from the US where she lives, but I'll do my best to present her work. [01:02.920 --> 01:09.000] So first, a bit of a background story of this work, so in the world of employee applications [01:09.000 --> 01:16.280] Kubernetes has really become the solution of choice when it comes to deploying large [01:16.280 --> 01:22.880] scale services in various environments because it provides the primitives for scalability, [01:22.880 --> 01:28.320] so Metal LB that we saw in a previous talk, services, health checks and so on. [01:28.320 --> 01:33.760] It also provides the uniformity of deployment, and it's also far from the sequence, so you [01:33.760 --> 01:36.880] don't need to know what you're running on. [01:36.880 --> 01:42.640] But coming from the CNF land, so trying to deploy a network function in this environment, [01:42.640 --> 01:44.280] the story is not the same. [01:44.280 --> 01:49.240] So I'll define a bit more what I mean by CNF because it's a bit different between the [01:49.240 --> 01:54.200] standard CNF use case, the 5G one. [01:54.200 --> 02:00.080] What I mean by that is, so I'll take an example for the sake of this presentation, so typically [02:00.080 --> 02:02.600] what I mean by that is the wire guard head end. [02:02.600 --> 02:06.920] For example, you have a customer and you want to deploy a fleet of wire guard head ends [02:06.920 --> 02:15.720] to give that user access to a resource in a company network, so typically a very prior [02:15.720 --> 02:21.640] printer that everybody wants to access to because people like to print. [02:21.640 --> 02:27.000] So the particularity of this use case is that it's dynamic enough to benefit from the abstraction [02:27.000 --> 02:31.400] that humanities brings, and I've lost my mouse. [02:31.400 --> 02:37.040] So typically load balancing, scheduling and those kinds of things. [02:37.040 --> 02:42.880] But it has a lot of specific needs, for example ingress has to be done in a particular way [02:42.880 --> 02:46.320] because you have some wire guard encrypted traffic, so typically you want to see which [02:46.320 --> 02:49.880] IP it's coming from. [02:49.880 --> 02:53.720] You also constrain on how you receive traffic because typically, and that's the place where [02:53.720 --> 02:56.960] you need multiple interfaces to go into your pod. [02:56.960 --> 03:02.920] And you also require high performance because encrypted traffic, so typically you want those [03:02.920 --> 03:06.920] things to run fast and you have a lot of user using them. [03:06.920 --> 03:13.440] So not for that printer, but assuming it's a bigger use case. [03:13.440 --> 03:15.880] So we tried to design a solution for that. [03:15.880 --> 03:22.360] So there are lots of components at play, I'll try to go quickly into them. [03:22.360 --> 03:28.160] So in the top we have our application, so here the wire guard VPN head end. [03:28.160 --> 03:32.280] We want to deploy it on top of humanity, so we have to choose a CNI, so we want to have [03:32.280 --> 03:38.800] it with Calico, mainly because of the cuteness of the cat, but also because it provides a [03:38.800 --> 03:45.080] really nice interface into supporting multiple data planes and also a nice BGP integration [03:45.080 --> 03:52.520] that allows to tweak the way we process packets. [03:52.520 --> 03:58.840] And for carrying packets we use the FDIO's LPP as a data plane that gave us more control [03:58.840 --> 04:01.960] on how packets are processed. [04:01.960 --> 04:09.680] And so that allowed us to go deeper into the way networks actually manage at a really low [04:09.680 --> 04:10.680] level. [04:10.680 --> 04:16.820] There are also other components that are going to play, but more on this later. [04:16.820 --> 04:22.480] So I'm going to go quickly over Calico and VPP because they have been presented many [04:22.480 --> 04:23.480] times. [04:23.480 --> 04:29.360] So in short Calico is a community CNI, providing a lot of great features, policies, BGP, support [04:29.360 --> 04:34.840] for really huge clusters, and the point that's important for this presentation is that it [04:34.840 --> 04:40.160] has a very well-defined control plane data plane interface, allowing to plug new performance [04:40.160 --> 04:46.880] oriented software underneath it without much hassle, and that's what we are going to leverage. [04:46.880 --> 04:53.600] So we choose to sleep VPP underneath Calico, first because we were originally contributors [04:53.600 --> 04:59.560] in this open-source user space networking data plane, so it was a solution of choice. [04:59.560 --> 05:06.520] But also it has a lot of cool functionalities that are built in and it's extensible. [05:06.520 --> 05:13.240] So I am doing a bit of publicity for the software I'm coming from, but it was a good tool for [05:13.240 --> 05:14.240] this use case. [05:14.240 --> 05:21.120] And also it's quite fast, so it really fits the needs for this application. [05:21.120 --> 05:23.800] So how did we bind the two together? [05:23.800 --> 05:28.160] What we do is we built an agent running in a demand set on every node, so deployment [05:28.160 --> 05:33.520] is the same as a simple pod, just with more privileges. [05:33.520 --> 05:40.120] We registered these agents in Calico as a Calico data plane and used their GRPC interface [05:40.120 --> 05:45.640] and their APIs that they exposed to decouple control data plane. [05:45.640 --> 05:50.240] That agent listens for Calico events and then programs VPP accordingly. [05:50.240 --> 05:56.080] And we also built a series of custom plug-ins for handling that, servicing and so on. [05:56.080 --> 06:01.880] And we tweaked the configuration so that things behave nicely in a container-oriented environment. [06:01.880 --> 06:08.240] And with all this, we have every brick to bring VPP into the clusters and so to have [06:08.240 --> 06:15.280] really control on everything that happens, indeed, the communities networking. [06:15.280 --> 06:17.280] How does that happen under the hood? [06:17.280 --> 06:23.600] So what happens exactly under the hood, what we do is we swap all the network logic that [06:23.600 --> 06:31.280] was happening in Linux to VPP, so from this configuration to there. [06:31.280 --> 06:39.000] In order to, so, yeah, the thing is as VPP is a user space stack, we have to do a few [06:39.000 --> 06:43.520] things a bit differently compared to what was previously done in Linux. [06:43.520 --> 06:48.760] So in order to insert VPP between the host and the network, we will grab the host interface, [06:48.760 --> 06:54.320] the uplink, and consume it in VPP with the appropriate driver. [06:54.320 --> 06:58.200] And then we restore the host connectivity by creating a turn interface in the host root [06:58.200 --> 06:59.360] network main space. [06:59.360 --> 07:01.880] So that's the turn tap here. [07:01.880 --> 07:05.840] And we replicate everything on that interface, it resists the routes. [07:05.840 --> 07:09.440] So basically we insert ourselves as a bump in the wire on the uplink. [07:09.440 --> 07:15.120] It's not very network-ish, but it works pretty well in that configuration. [07:15.120 --> 07:22.760] And that way we restore pod connectivity as before with turn taps instead of a VIF. [07:22.760 --> 07:27.440] We create an interface in every pod. [07:27.440 --> 07:30.800] And then everything runs normally, the Calico control plane is running normally on the host [07:30.800 --> 07:37.040] and it configures the data plan functions in VPP via the agent. [07:37.040 --> 07:40.360] So now we have the green part covered. [07:40.360 --> 07:42.080] So all those components run neatly. [07:42.080 --> 07:47.760] And what we achieve with that is that when we create a pod, Kubernetes will call Calico, [07:47.760 --> 07:49.320] Calico will call VPP. [07:49.320 --> 07:54.760] And we can provide an interface that we fully handle on a network network player directly [07:54.760 --> 07:55.760] in VPP. [07:55.760 --> 08:02.160] But for this specific wire guard add-on application, we need a bit more than that. [08:02.160 --> 08:08.200] We need multiple interfaces and we also potentially have overlapping addresses. [08:08.200 --> 08:13.840] So we don't really manage where the IPs are going to end. [08:13.840 --> 08:20.040] So for the multiple interface part, our goal to show us was to go with multis that provides [08:20.040 --> 08:21.040] multiplexing. [08:21.040 --> 08:26.920] And we chose also dedicated IPAM that we patched, which we were about because it was quite [08:26.920 --> 08:33.560] simple to patch and brought those two pieces in. [08:33.560 --> 08:39.400] So when I mean multiple interfaces, what does that exactly contain? [08:39.400 --> 08:47.720] So the thing is, the typical Kubernetes deployment looks like this. [08:47.720 --> 08:51.880] So each pod has a single interface. [08:51.880 --> 08:57.320] And the CNI provides a pod-to-pod connectivity, typically with an encapsulation from node [08:57.320 --> 08:58.320] to node. [08:58.320 --> 09:03.640] But in our application, we want to differentiate the encrypted traffic from the clear-text [09:03.640 --> 09:07.800] traffic, so before and after the head end. [09:07.800 --> 09:10.800] But we still want Kubernetes as the end to operate. [09:10.800 --> 09:15.240] So we still want the nice things about Kubernetes, so service IPs and everything. [09:15.240 --> 09:20.800] So it's not only multiple interfaces, it's really multiple interfaces wired into Kubernetes. [09:20.800 --> 09:23.960] So it's more multiple isolated networks. [09:23.960 --> 09:29.880] So conceptually, what we needed was the ability to create multiple Kubernetes networks. [09:29.880 --> 09:35.640] So each network behaving a bit like a standalone cluster stacked on top of each other. [09:35.640 --> 09:43.720] So with this, we request networks that provide complete isolation between each other, meaning [09:43.720 --> 09:48.360] that traffic cannot cross from a network to another without going from to the outside [09:48.360 --> 09:49.360] world. [09:49.360 --> 09:56.360] And so that means that we have to bind Calico, VPP, so on, integration, and multist together [09:56.360 --> 10:03.000] to create a model where everybody is aware of that definition of networks, have a catalog [10:03.000 --> 10:06.960] of isolated networks, specify the way they are going to communicate from node to node [10:06.960 --> 10:13.360] via VXLAN encapsulations, and have a way to propose to attach to those networks with [10:13.360 --> 10:18.840] annotations, so that in the end, Kubernetes is aware of these networks and we can still [10:18.840 --> 10:23.600] maintain the SDN part of the logic. [10:23.600 --> 10:33.360] So the way this works quickly is that the C&I interface will call Calico once per pod. [10:33.360 --> 10:40.200] So the thing is, multist will call the C&I Calico once per top all pod interface. [10:40.200 --> 10:46.040] And we will in turn receive in origin those calls and we can map those with annotations [10:46.040 --> 10:50.400] and do our magics to provide the logic. [10:50.400 --> 10:54.760] And having also the IPAM patch allows us to support multiple IPs and to have different [10:54.760 --> 10:59.600] realms where the IP lives and gets located from. [10:59.600 --> 11:03.840] So from a user's perspective, what we expose is a network catalog where our networks are [11:03.840 --> 11:06.160] defining CRDs for now. [11:06.160 --> 11:10.600] We are starting a standardization effort to bring that into Kubernetes, but that will [11:10.600 --> 11:12.560] probably take time. [11:12.560 --> 11:18.440] So right now we kept it that simple with just specifying a V&I using VXLAN by default, [11:18.440 --> 11:19.960] just passing a range. [11:19.960 --> 11:25.240] And we also keep a network attachment definition from multist with one-to-one mapping to network [11:25.240 --> 11:30.640] so that things, we don't change too many things at once. [11:30.640 --> 11:34.920] And then we use those networks into the pod definitions. [11:34.920 --> 11:39.440] So we reference them the multist way. [11:39.440 --> 11:44.720] We can reference them as well in services with dedicated annotations. [11:44.720 --> 11:50.480] And so that way we tell our agents to program VP in a way where the services apply only [11:50.480 --> 11:52.600] in a specific network. [11:52.600 --> 11:54.200] The policy is the same way. [11:54.200 --> 12:01.880] And also that gives us the ability for pods to have a bit more tweaking on the parameters [12:01.880 --> 12:03.240] exposed on the interface. [12:03.240 --> 12:09.400] So to specify the number of queues we want, the queue depth, and also support multiple [12:09.400 --> 12:10.400] types. [12:10.400 --> 12:15.120] So that gives a lot of flexibility to get the performance right and to get, so first [12:15.120 --> 12:16.800] to get the functionalities. [12:16.800 --> 12:21.480] So the fact that we have multiple interfaces and so also size them so that the performance [12:21.480 --> 12:26.120] is appropriate for the use case that we want to achieve. [12:26.120 --> 12:33.480] The last nice feature of this is that as we have GoBGP support, we can pair those networks [12:33.480 --> 12:39.720] with the outside world if we have a fabric that's VXLAN and if GoBGP supports it. [12:39.720 --> 12:44.400] So that part is still a bit work in progress and there are a lot of things to get right. [12:44.400 --> 12:49.320] But that's the end picture we want to go. [12:49.320 --> 12:57.080] And this could, so if we put everything together, we would get probably something like that, [12:57.080 --> 12:58.480] that looks like that. [12:58.480 --> 13:04.280] So basically when the users want to connect to this hypothetical VPN and that hypothetical [13:04.280 --> 13:12.120] printer, it would get into the cluster via GoBGP peering, so traffic is going to be [13:12.120 --> 13:18.760] attracted to the green network, heat service IP in that network, so get some load balancing [13:18.760 --> 13:27.320] across the nodes, then it's going to be deciphered in a pod that then encapsulate traffic and [13:27.320 --> 13:31.280] pass it, for example, to a NAT pod running in user space. [13:31.280 --> 13:38.200] So here I put another type of interface that is more performance oriented and then exit [13:38.200 --> 13:43.800] the cluster on a different VLAN peered with the outside world. [13:43.800 --> 13:48.480] So some parts still need to be done, but the general internal logic of the cluster is still [13:48.480 --> 13:56.600] something that works and that brings the ability for container networking functions to run [13:56.600 --> 14:07.600] unmodified with their multiple interfaces directly in a somewhat regular cluster. [14:07.600 --> 14:18.920] So we spoke about improving performance of the network, of the underlying interface, [14:18.920 --> 14:27.800] but we can also improve the performance with which the application in the pod consumes their [14:27.800 --> 14:29.200] own interfaces. [14:29.200 --> 14:36.400] So the standard way applications usually consume packets within pods is via socket APIs. [14:36.400 --> 14:40.600] So it's really standard, but you have to go through the kernel and it's a code path that [14:40.600 --> 14:44.120] wasn't designed for the performance levels of modern apps. [14:44.120 --> 14:48.120] So that's why GSO came up as a network stack optimization. [14:48.120 --> 14:53.240] But here with VPP running, it would be nice to be able to bypass the network stack and [14:53.240 --> 14:57.160] pass the packets directly from VPP and not touching the kernel. [14:57.160 --> 15:04.360] So fortunately, VPP exposes two different ways to consume those interfaces. [15:04.360 --> 15:09.880] We'll mostly go into the first one, which is the memory interface. [15:09.880 --> 15:18.200] So basically, it's a packet oriented interface standard relying on a memory segment for speed. [15:18.200 --> 15:21.040] And this can be leveraged by an application via a simple library. [15:21.040 --> 15:28.680] So either GoMemi, PhiliBMMF in C, or DPDK, or even a VPP. [15:28.680 --> 15:37.960] And so provide a really high speed way of consuming that extra interface in the pod. [15:37.960 --> 15:44.280] And the really nice thing about this is that it brings also the connection between the [15:44.280 --> 15:51.560] Kubernetes network, Kubernetes SDN, and the pod into user space, meaning that now that [15:51.560 --> 15:57.880] connection lives in a regular C program, we can also leverage. [15:57.880 --> 16:02.600] So it's easier to leverage CPU optimizations and new features. [16:02.600 --> 16:09.320] And that's where the Silicon re-enters the picture and the work from Mrytka from Intel [16:09.320 --> 16:12.200] and their team. [16:12.200 --> 16:20.840] So they benchmarked this kind of setup and also introduced an optimization that's coming [16:20.840 --> 16:25.800] into the fourth generation Intel Xeon that's called Data Streaming Accelerator. [16:25.800 --> 16:33.840] Basically, it's a way to optimize copies between processes on some CPUs. [16:33.840 --> 16:39.040] And so what they did is compare the performance that we get with the Kubernetes clusters, [16:39.040 --> 16:42.200] multiple interfaces, and a simple pod. [16:42.200 --> 16:50.840] So not bringing in the old VPN logic, just doing L3 patch and seeing how fast things [16:50.840 --> 17:00.440] could go between regular kernel interfaces, the turn, the memory interfaces, and the memory [17:00.440 --> 17:07.400] interfaces leveraging those optimizations in the CPU. [17:07.400 --> 17:15.360] So that gives those graphs that are really, so that have a lot of numbers in them. [17:15.360 --> 17:21.280] But basically, I try to sum up quite quickly what this gives. [17:21.280 --> 17:27.120] There are two MTUs, 1,500 bytes and 9,000 bytes here. [17:27.120 --> 17:30.880] The performance for turn interfaces in dark blue. [17:30.880 --> 17:36.920] Blue is the first MAMIF and the DSA optimized MAMIF is in yellow. [17:36.920 --> 17:49.920] And basically, what this gives is that the performance is really, so it brings really [17:49.920 --> 18:04.880] a huge difference between, so, sorry, throughput with DSA is 2.3 times faster than with regular [18:04.880 --> 18:09.080] MAMIF for the 1,500 bytes packets. [18:09.080 --> 18:14.640] And if you go with DSA enabled, it's 23 times faster than turn tap. [18:14.640 --> 18:22.720] And with a 9,000 byte MTU, basically, you get more than 60 times faster with the MAMIF [18:22.720 --> 18:26.200] that's optimized with DSA. [18:26.200 --> 18:36.240] Basically the digit, so the number that's really interesting is that with 200,000, so [18:36.240 --> 18:40.600] basically you get a single call doing 100 Gs with that. [18:40.600 --> 18:45.000] And that without too much modifications of the applications. [18:45.000 --> 18:47.920] So basically, you just spin a regular cluster. [18:47.920 --> 18:52.760] If the CPU supports it, you use a regular library and you're able to consume packets [18:52.760 --> 18:59.960] at really huge speeds without modifying the application too much. [18:59.960 --> 19:10.120] So there is another graph looking into the scaling with number of calls, both with small [19:10.120 --> 19:13.520] MTUs and large MTUs. [19:13.520 --> 19:19.920] Basically that shows that we can spare calls when going, so turn tap does not scale very [19:19.920 --> 19:20.920] well. [19:20.920 --> 19:34.520] So the regular MAMIF scales with 1 to 6 calls and DSA achieves the same results with 2 to [19:34.520 --> 19:38.560] 3 less calls than its regular MAMIF counterpart. [19:38.560 --> 19:44.360] So basically you achieve 100 Gs, which was the limit of the setup with a single call [19:44.360 --> 19:53.600] in the case of large MTUs and 3 calls in the case of smaller MTUs. [19:53.600 --> 19:56.080] So that's all for the talk. [19:56.080 --> 20:03.440] Sorry I went into a variety of different subjects because that topic goes into a lot of different [20:03.440 --> 20:04.440] directions. [20:04.440 --> 20:10.800] Basically that was to give you another view of the duration we are trying to go, trying [20:10.800 --> 20:16.680] to bring all those pieces together in a framework that allows us to make those CNFs run into [20:16.680 --> 20:19.520] a community environment. [20:19.520 --> 20:21.440] This work is open source. [20:21.440 --> 20:27.480] There are the details of the tests that were done in the following slides. [20:27.480 --> 20:33.680] You can find us on GitHub and there is also a Slack channel open where you can ask questions. [20:33.680 --> 20:39.280] And we have a new release coming up in Beta aiming for GA that's going to go without soon. [20:39.280 --> 20:45.200] So thanks a lot for listening, so here are the details. [20:45.200 --> 21:00.720] And I'm open for questions if you have any. [21:00.720 --> 21:06.680] Just one question for the sake of it, have you ever thought about some shared memory between [21:06.680 --> 21:13.720] the different parts to eliminate the need to copy over the packets? [21:13.720 --> 21:24.840] So we thought of this, so there are different ways to do that. [21:24.840 --> 21:32.720] So there is the VCR which I haven't spoken about, which is a way of opening the sockets [21:32.720 --> 21:33.960] directly in VPP. [21:33.960 --> 21:40.480] So basically you do a list in VPP for TCP, UDP or given protocol, so like the sockets [21:40.480 --> 21:41.480] APIs. [21:41.480 --> 21:47.480] And that supports directly, so basically the data never leaves VPP and you can do direct [21:47.480 --> 21:55.200] copies between processes without having to copy because everything stays in VPP in the [21:55.200 --> 21:56.200] end. [21:56.200 --> 22:02.760] For MMF, we don't support that out of the box but nothing forbids you to spawn two pods, [22:02.760 --> 22:07.640] make them share a socket and it's only shared memory so you can directly do it without having [22:07.640 --> 22:10.760] to spin up the whole thing. [22:10.760 --> 22:16.000] So you could even do that in any cluster or directly on bare metal. [22:16.000 --> 22:25.600] So MMF is really a lightweight protocol so you can do that just with a regular socket. [22:25.600 --> 22:35.600] Okay, cool, thank you very much.