Hi, I'm here to talk about what we have done to, you talk about what we are doing at my work to, does this work? I'll just change slides here. Good. We are, we are an ISP in Copenhagen and we do fiber to the building type networks where we do the high concentration of customers, we can give some quite good rates. Then we currently have a little bit more than a hundred gig going through our network and if we look a bit at how a network can be simplified, then we have some external connections, PNIS, ISPs, transit connections and they are all going in to a router that is, DLC that means default free zone, so it doesn't have a default route, it only knows all the million different routes to all the networks connected to the internet and then from there traffic going into the network goes on to a free switch where we have OSPF cloud which knows all of the internal routing and then we have end users connected to that and we have some internal streaming servers so that we don't have to get any unicast stream from Amsterdam or elsewhere. We used to do this with an old Cisco big chassis router back when we had 10 gig, it was a very simple time and then we upgraded to something like this when we changed it to 40 gig and these both operators router and stick but I'll come back to that later and then in 2020 we changed, we just run it on a 2U server with 4, 10 gig ports changed together so still 40 gig and then in 21 we upgraded to 100 gig with some of these cuts and like the router and stick I talked about before is basically when you have a single port on your router and then you just use reland tags to differentiate between different external connections and you start doing that because router ports are expensive so it's a lot cheaper just to aggregate it with relands on a switch and it's also an easy way to scale without using too much money. So now we get into what TC flow is, it's an API extension to the kernel originally I think developed by Melanux but it's mainly marketed as just for making open v-switch more efficient and bypassing the host system. However, apart from just being used in this sense, it can also be used to do forwarding or other shenanigans. Here in the bottom you can see that on the network card there is an embedded switch chip so that it can have virtual machines represented as a port in the NIC itself and that way the offload rules can forward traffic directly into a virtual machine. Still not used for that however but this is what you can find marketing wise from all the vendors, they generically just call it something with OBS offload. A TC flow is part of TC which is Linux traffic control. It operates with chains and priorities and then you can have some different things you can do through packets, you can change them, you can drop them, you can redirect them or go to another chain and continue there or you can trap them whereby they go to the CPU. In the hardware offload part of TC there's a few other modules other than flow that can also do hardware offload but generically you have the rules of skip software or skip hardware so it's kind of the other way around where if you want something only installed in hardware you say skip software and wise versa. It's kind of wind diagnostic but you really have to read deep down the drivers for the different devices to see what do they support or just have the hardware tested. A lot of the hardware can do many of the operations that we need for this project but we have only been able to test it with Kandidex cards at this point. So these chains and priorities and rules in it is basically that the packet goes into chain zero and then it takes the first matching rule with the lowest priority and then it does whatever is there, it could be the go to and go chain one and then continue and it could do this number sequence wherever you can recognize it but this limit to 256 go to set max. But then they can drop the packet, they can trap it or redirect it and redirect can go out any other port in the better switch chip we saw before. I'm going a bit quickly because it's a short time flood. If we know what does it take to actually forward a packet in hardware, we need the wheeler to be different when it goes back out of the port and we need the magnetizers to have changed in order to prevent routing loops. We need to determine the detail or hub limits and then before there's also checks on but we don't actually need that rule in hardware, the hardware does it automatically. If we didn't suffer we would need that rule and then push it back out of the port that came in. Even on dual port nicks we cannot push it out of the other port. They are at least on the Kandidex cards, they are just two separate nicks on the same PCB. Not that they cannot talk to each other. So if we do this then it would look like this. If you just use a TC command. So what does here is that it says that it's an Ingress rule and it has a chain and preference and then it says it's a VLAN packet and it's a skip software and it's IPv4 this one and then modifies the VLAN as an action and goes on to modify the magnetizers and then decommends the detail and updates the checksum and then it pushes it back out again with the redirect. If we then show this rule with the show command afterwards then we get output like this. This is a bit edited to fit on the slide. So the action order here is actually a little bit longer. We can see that it sets these values with these masks when it changes the magnetizers and then the decommends detail is just a masked overflow. If you then use the dash S option then you get statistics. We can see that a few packets and bytes got pushed in hardware and never saw this off a path. So we used this for a few years. We just did it statically for all inbound traffic because for inbound traffic as shown in the diagram earlier it always just goes to the layer 3 internal network so that part is static if that site has power that will always work otherwise it wouldn't even advertise our address space. That will always work so therefore we could do that statically for some time but now we also wanted it to work for some high traffic outbound prefixes but for that we needed to work with BDP but yeah more than a bit later. So in the static case we basically have this chain and priority design where we have two rules and change the rule where it just spits it up in IPv4, IPv6 and then in chain one or two we first say all packets with an expiring detail needs to go visit the CPU so that the CPU can send back an IP packet saying that this packet exploded and then it matches some link net so that our BDP sessions don't die because we still need BDP to be able to so the packet for internal traffic link nets in our own address space we need to have rules specifically for that so that that actually gets to the CPU and then we just match the inbound destinations and go to the chain 4 and 6 where we have the big rule changing the backdressers and VLAN tag and so on so we ran this for a few years until we needed something more dynamic and this is basically how it looks when you have amount of where you place the rules so I wasn't able to generate enough packets in a small test setup that I could actually get the nick to hit the limit in the beginning but from 15 packets and 15 rules and more you can see that it decreases with the more rules I add to the hardware that needs to check the slower it gets at the moment we have around four million packets going through each of these routers so we had like this is the worst case where every single packet needs to go through this many rules so the numbers is a lot better if it's a very traffic pattern this is the worst case scenario then in order to make this dynamically and get all this these rules put in dynamically based on bdp changes we have made some software that just has an event loop with some netlink sockets and talks with the network stack in the kernel gets all the routes and links and neighbors and and then generates the tc rule set based on that and dynamically updates and then we have bird feed in all the rules into a separate routing table and and then it automatically gets notified there because we have a monitoring session in that link and then updates the rule set dynamically so that we can use the cpu on the long tail of all the non-offloaded prefixes and in the bird side we just have a kernel protocol to have an extra kernel table where we then have some type protocols copying from the full routing table select prefixes like all our own address space but also some select cdn's and other stuff that we then offload and in the future we need to make the configuration a bit more flexible so we can also handle directly connected paths so if someone wants to use this for their home connection if someone has 5.7 or something like that where they get 25 gig connections then you can we want to use this in our hacker space as well where we have 10 gig connection so like expand it a bit make it more flexible so it's not only our use case and then we need some kernel support on mtu and ecmp support but it's easier to do that after we've presented this and then we need to test if we can also do this with binding the port together so that we can use both ports in a dual port nick and just install the rules in both of them we've also tested how much power do we need to do a small embedded setup so with a dual port connected f5 we can run that on around 15 watt so that's a quite efficient solution but that's more for the hacker space scenario than for real world isp stuff yeah that was the end of it i think that was also the time yeah we are on time yeah yeah any question hi there yeah hi there so you've shown us performance graph of the worst case scenario yeah is it better addressed in software than in hardware in that case or it's uh that is also right now some performance issues in in the kernel side but that'll probably get fixed soon but the graph was the worst case in the hardware only because it had an amount of non matching rules followed by the one matching rule for the test traffic that i generated and if you would let's say execute the same scenario in software only would that lead to better performance or no it had like at the hundred rules it matches the software performance so equal performance right yeah after a hundred rules okay thanks so it's way better performance than software until you have a hundred rules in hardware thanks well actually the kind of got answered by your last question i was just wondering and do you do like net flow analysis or something to choose the rules that the choose the routes that you're going to use those 100 yes we we have some flow analysis to do that and select the prefixes and what is what's your normal level of how many rules you do populate at the moment we we only have around 30 or something like that but we need to do it a bit more dynamically and do it a little bit more clever so that so that we put related routes that are overlapping out in a separate chain so that it's it can check on the on the largest prefix and then only if it matches largest prefix go into the subchain so we need a little bit more of those kind of optimizations but currently all the ground functionality is there for the basic use case and all of the reference counting internally and chaining the primitives from the kernel together to have only have the same destination mapped in as one target so that mobile routes can go to the same next hub by having the same chain that they all jump to since you found some traffic to the cpu how do you protect your network from pathological traffic creating a denial of service we don't have enough of it to really do a lot right now but it does happen occasionally but we have so much capacity that it's rare that it's anything significant enough to want a lot of extra time spent on it we have first initial customers where it's mostly meta gaming and stuff like that that leads to details we don't have hosting and stuff like that where web shops are more targeted yeah excuse me i did not get how many rules you can offload to the hardware you can offload many rules to it but the chain of rules you can check right now drops in performance when you are above 15 rules and then it gets worse from there but that was in the worst case scenario where all the packets will go through that many rules where half of our traffic is inbound traffic and therefore all the inbound traffic to our own address space we can check that in the first few rules it looks a very low number of rules yes it would be nice if numbers were higher on the amount of rules but it just means you need to be more clever in designing the how the chains are constructed okay and were you able to experiment with other hardware from those we would like to do that as well we would like to test with other hardware as well but at the moment we we have some connex 4 as well and connex 4 might actually be but i only discovered that after i left for here but connex 4 is mentioned in some documents that it might support it but it also depends on the firmware versions because a lot of the supports for instance for TTL document and stuff like that depends on the new firmware release okay but they have some documents saying it's connex 5 onward it's a feature called asap2 that is this feature yeah so if you look in data sheets for asap2 from nvidia and melanox then it's it has this feature but they have some of the initial documents on it say that connex 4 should work as well yeah sure thank you okay thank you you