Hi, I'm here to talk about what we have done to, you talk about what we are doing at my
work to, does this work?
I'll just change slides here.
Good.
We are, we are an ISP in Copenhagen and we do fiber to the building type networks where
we do the high concentration of customers, we can give some quite good rates.
Then we currently have a little bit more than a hundred gig going through our network and
if we look a bit at how a network can be simplified, then we have some external connections,
PNIS, ISPs, transit connections and they are all going in to a router that is, DLC that
means default free zone, so it doesn't have a default route, it only knows all the million
different routes to all the networks connected to the internet and then from there traffic
going into the network goes on to a free switch where we have OSPF cloud which knows all of
the internal routing and then we have end users connected to that and we have some internal
streaming servers so that we don't have to get any unicast stream from Amsterdam or
elsewhere.
We used to do this with an old Cisco big chassis router back when we had 10 gig, it was a very
simple time and then we upgraded to something like this when we changed it to 40 gig and
these both operators router and stick but I'll come back to that later and then in 2020 we
changed, we just run it on a 2U server with 4, 10 gig ports changed together so still 40
gig and then in 21 we upgraded to 100 gig with some of these cuts and like the router and stick
I talked about before is basically when you have a single port on your router and then
you just use reland tags to differentiate between different external connections and
you start doing that because router ports are expensive so it's a lot cheaper just to
aggregate it with relands on a switch and it's also an easy way to scale without using
too much money.
So now we get into what TC flow is, it's an API extension to the kernel originally I think
developed by Melanux but it's mainly marketed as just for making open v-switch more efficient
and bypassing the host system.
However, apart from just being used in this sense, it can also be used to do forwarding
or other shenanigans.
Here in the bottom you can see that on the network card there is an embedded switch chip
so that it can have virtual machines represented as a port in the NIC itself and that way the
offload rules can forward traffic directly into a virtual machine.
Still not used for that however but this is what you can find marketing wise from all
the vendors, they generically just call it something with OBS offload.
A TC flow is part of TC which is Linux traffic control.
It operates with chains and priorities and then you can have some different things you
can do through packets, you can change them, you can drop them, you can redirect them or
go to another chain and continue there or you can trap them whereby they go to the CPU.
In the hardware offload part of TC there's a few other modules other than flow that can
also do hardware offload but generically you have the rules of skip software or skip hardware
so it's kind of the other way around where if you want something only installed in hardware
you say skip software and wise versa.
It's kind of wind diagnostic but you really have to read deep down the drivers for the
different devices to see what do they support or just have the hardware tested.
A lot of the hardware can do many of the operations that we need for this project but we have only
been able to test it with Kandidex cards at this point.
So these chains and priorities and rules in it is basically that the packet goes into
chain zero and then it takes the first matching rule with the lowest priority and then it does
whatever is there, it could be the go to and go chain one and then continue and it could
do this number sequence wherever you can recognize it but this limit to 256 go to set max.
But then they can drop the packet, they can trap it or redirect it and redirect can go
out any other port in the better switch chip we saw before.
I'm going a bit quickly because it's a short time flood.
If we know what does it take to actually forward a packet in hardware, we need the
wheeler to be different when it goes back out of the port and we need the magnetizers to
have changed in order to prevent routing loops.
We need to determine the detail or hub limits and then before there's also checks on but
we don't actually need that rule in hardware, the hardware does it automatically.
If we didn't suffer we would need that rule and then push it back out of the port that came in.
Even on dual port nicks we cannot push it out of the other port.
They are at least on the Kandidex cards, they are just two separate nicks on the same PCB.
Not that they cannot talk to each other.
So if we do this then it would look like this.
If you just use a TC command.
So what does here is that it says that it's an Ingress rule and it has a chain and preference
and then it says it's a VLAN packet and it's a skip software and it's IPv4 this one
and then modifies the VLAN as an action and goes on to modify the magnetizers and then
decommends the detail and updates the checksum and then it pushes it back out again with the redirect.
If we then show this rule with the show command afterwards
then we get output like this. This is a bit edited to fit on the slide.
So the action order here is actually a little bit longer.
We can see that it sets these values with these masks when it changes the magnetizers
and then the decommends detail is just a masked overflow.
If you then use the dash S option then you get statistics.
We can see that a few packets and bytes got pushed in hardware and never saw this off a path.
So we used this for a few years. We just did it statically for all inbound traffic because for
inbound traffic as shown in the diagram earlier it always just goes to the layer 3 internal network
so that part is static if that site has power that will always work otherwise it wouldn't even
advertise our address space. That will always work so therefore we could do that statically
for some time but now we also wanted it to work for some high traffic outbound prefixes
but for that we needed to work with BDP but yeah more than a bit later.
So in the static case we basically have this chain and priority design where we have two rules
and change the rule where it just spits it up in IPv4, IPv6 and then in chain one or two
we first say all packets with an expiring detail needs to go visit the CPU so that the CPU can send
back an IP packet saying that this packet exploded and then it matches some link net so that our
BDP sessions don't die because we still need BDP to be able to so the packet for internal traffic
link nets in our own address space we need to have rules specifically for that so that that
actually gets to the CPU and then we just match the inbound destinations and go to the
chain 4 and 6 where we have the big rule changing the backdressers and VLAN tag and so on so we ran
this for a few years until we needed something more dynamic and this is basically how it looks when
you have amount of where you place the rules so I wasn't able to generate enough packets in a small
test setup that I could actually get the nick to hit the limit in the beginning but from 15 packets
and 15 rules and more you can see that it decreases with the more rules I add to the hardware that
needs to check the slower it gets at the moment we have around four million packets going through
each of these routers so we had like this is the worst case where every single packet
needs to go through this many rules so the numbers is a lot better if it's a very traffic pattern
this is the worst case scenario then in order to make this dynamically and get all this these
rules put in dynamically based on bdp changes we have made some software that just has an event
loop with some netlink sockets and talks with the network stack in the kernel gets all the routes
and links and neighbors and and then generates the tc rule set based on that and dynamically updates
and then we have bird feed in all the rules into a separate routing table
and and then it automatically gets notified there because we have a monitoring session
in that link and then updates the rule set dynamically so that we can use the cpu on the
long tail of all the non-offloaded prefixes and in the bird side we just have a kernel protocol
to have an extra kernel table where we then have some type protocols copying from the full
routing table select prefixes like all our own address space but also some select cdn's and other
stuff that we then offload and in the future we need to make the configuration a bit more
flexible so we can also handle directly connected paths so if someone wants to use this for their
home connection if someone has 5.7 or something like that where they get 25 gig connections
then you can we want to use this in our hacker space as well where we have 10 gig connection so
like expand it a bit make it more flexible so it's not only our use case and then we need some
kernel support on mtu and ecmp support but it's easier to do that after we've presented this
and then we need to test if we can also do this with binding the port together so that we can use
both ports in a dual port nick and just install the rules in both of them
we've also tested how much power do we need to do a small embedded setup so with a dual port
connected f5 we can run that on around 15 watt so that's a quite efficient solution but
that's more for the hacker space scenario than for real world isp stuff
yeah that was the end of it i think that was also the time yeah we are on time yeah
yeah
any question
hi there yeah hi there so you've shown us performance graph of the worst case scenario
yeah is it better addressed in software than in hardware in that case or
it's uh that is also right now some performance issues in in the kernel side but that'll probably
get fixed soon but the graph was the worst case in the hardware only because it had an amount of
non matching rules followed by the one matching rule for the test traffic that i generated
and if you would let's say execute the same scenario in software only would that lead to better
performance or no it had like at the hundred rules it matches the software performance
so equal performance right yeah after a hundred rules okay thanks
so it's way better performance than software until you have a hundred rules in hardware
thanks
well actually the kind of got answered by your last question i was just wondering and do you do like
net flow analysis or something to choose the rules that the choose the routes that you're going to
use those 100 yes we we have some flow analysis to do that and select the prefixes
and what is what's your normal level of how many rules you do populate
at the moment we we only have around 30 or something like that but we need to do it a bit more
dynamically and do it a little bit more clever so that so that we put related routes that are
overlapping out in a separate chain so that it's it can check on the on the largest prefix
and then only if it matches largest prefix go into the subchain so we need a little bit more
of those kind of optimizations but currently all the ground functionality is there for the basic
use case and all of the reference counting internally and chaining the primitives from the
kernel together to have only have the same destination mapped in as one target so that
mobile routes can go to the same next hub by having the same chain that they all jump to
since you found some traffic to the cpu how do you protect your network from pathological
traffic creating a denial of service we don't have enough of it to really do a lot right now
but it does happen occasionally but we have so much capacity that it's rare that it's anything
significant enough to want a lot of extra time spent on it we have first initial customers
where it's mostly meta gaming and stuff like that that leads to details we don't have
hosting and stuff like that where web shops are more targeted
yeah excuse me i did not get how many rules you can offload to the hardware
you can offload many rules to it but the chain of rules you can check right now
drops in performance when you are above 15 rules and then it gets worse from there
but that was in the worst case scenario where all the packets will go through that many rules
where half of our traffic is inbound traffic and therefore all the inbound traffic
to our own address space we can check that in the first few rules
it looks a very low number of rules yes it would be nice if numbers were higher
on the amount of rules but it just means you need to be more clever in designing the how the chains
are constructed okay and were you able to experiment with other hardware from those
we would like to do that as well we would like to test with other hardware as well
but at the moment we we have some connex 4 as well and connex 4 might actually be but i only
discovered that after i left for here but connex 4 is mentioned in some documents that it might
support it but it also depends on the firmware versions because a lot of the supports for instance
for TTL document and stuff like that depends on the new firmware release okay but they have
some documents saying it's connex 5 onward it's a feature called asap2 that is this feature
yeah so if you look in data sheets for asap2 from nvidia and melanox then it's it has this feature
but they have some of the initial documents on it say that connex 4 should work as well
yeah sure thank you
okay thank you
you