[00:00.000 --> 00:29.280] Yeah, so let's go right into the topic. [00:29.280 --> 00:30.280] So I'm Alex. [00:30.280 --> 00:31.280] I work for Switch. [00:31.280 --> 00:35.160] Switch is the national research and education network in Switzerland. [00:35.160 --> 00:40.720] Like most countries have something like us, like in Belgium, it's BellNet, and in Germany [00:40.720 --> 00:43.160] it's DFN, and in France it's Renataire. [00:43.160 --> 00:52.320] So we connect to Swiss universities and universities of applied sciences via the ISP of those institutions. [00:52.320 --> 00:57.760] So NetFlow, I'm not sure if everyone is familiar with NetFlow, so I just recaptured the like [00:57.760 --> 01:01.760] a central thing about what the network flow actually is. [01:01.760 --> 01:05.800] So when you look at an IP packet, you extract the source and destination addresses, the [01:05.800 --> 01:11.240] IP protocol, and if the protocol is UDP or TCP, also the source and destination ports, [01:11.240 --> 01:15.280] and those five numbers identify a flow. [01:15.280 --> 01:20.920] So every packet with the same values is said to belong to the same flow. [01:20.920 --> 01:26.240] And then in the simplest possible way, you basically just aggregate, you count the bytes [01:26.240 --> 01:31.680] and packets of all the packets that belong to the flow, and then you export this information [01:31.680 --> 01:37.720] to a collector where you can then analyze the data. [01:37.720 --> 01:42.640] So this is an old thing, like these days people talk about network telemetry, and back in [01:42.640 --> 01:47.040] the day when this was developed, that name didn't exist yet, and I'm not sure when exactly [01:47.040 --> 01:51.800] Cisco came up with this, but it must have been the early 90s or mid 90s, and it used [01:51.800 --> 01:57.800] to be a de facto standard for a long time, but people just figured out what Cisco did [01:57.800 --> 02:05.120] and then did the same thing, and then finally got properly standardized with the IPfix IETF [02:05.120 --> 02:06.680] standard. [02:06.680 --> 02:12.160] And you can do this either in sampled mode or unsampled mode, so unsampled means you look [02:12.160 --> 02:16.400] at every single packet and account for it in the flow, and with sampling you just look [02:16.400 --> 02:26.800] at every nth packet, and then you have to make certain assumptions to then reconstruct [02:26.800 --> 02:28.760] the actual values. [02:28.760 --> 02:34.760] So we at Switch, we've been using NetFlow for a very long time as the basic, as the most [02:34.760 --> 02:44.440] important metric or means to analyze our data, our network data, since the mid 1990s. [02:44.440 --> 02:49.200] It used to be that this was provided by the Rogers themselves, which is reasonable, and [02:49.200 --> 02:55.680] the packets passed through that device, and so the device has immediately access to the [02:55.680 --> 03:00.200] packets and then can construct the flow data itself. [03:00.200 --> 03:04.080] So initially that was done in software, then it was done in hardware. [03:04.080 --> 03:11.440] It used to be basically always unsampled, but with the advent of more powerful networking [03:11.440 --> 03:17.800] gear, and especially with the arrival of the 100 gig ports, it became basically unfeasible [03:17.800 --> 03:24.520] to do this on the Rogers themselves because of typically software restrictions, also hardware [03:24.520 --> 03:25.520] restrictions. [03:25.520 --> 03:29.120] If you want to do this in software, you usually can because the Rogers are not very powerful [03:29.120 --> 03:33.040] in terms of CPU, and in hardware it becomes very expensive. [03:33.040 --> 03:38.240] So the vendors started to basically only implement sampled NetFlow, so these days if you buy a [03:38.240 --> 03:44.600] Cisco or a Juniper box and you do NetFlow, you get sampling. [03:44.600 --> 03:48.560] And sampling is fine, of course, if you're only interested in aggregate data anyway, [03:48.560 --> 03:54.640] so big aggregated network flows between networks, for instance, sampling is perfectly fine, [03:54.640 --> 04:00.160] you make certain assumptions about the traffic, and then you just upscale it, and you get [04:00.160 --> 04:01.920] fairly reasonable numbers. [04:01.920 --> 04:05.200] So why would you even want to do unsampled NetFlow? [04:05.200 --> 04:10.760] Well, there are some couple of use cases that are really useful. [04:10.760 --> 04:16.760] So for instance, in terms of security, one thing that sampling is fine is detect DDoS, [04:16.760 --> 04:21.840] for instance, that's volumetric DDoS, that's very simple, so you basically have a constant [04:21.840 --> 04:26.000] packet rate, and if you just look at every end packet, it's easy to scale this up. [04:26.000 --> 04:31.280] But if you want to detect a bot, for instance, in your network, then it's more difficult. [04:31.280 --> 04:36.680] So maybe you want to do this by looking at the communication with the command and control [04:36.680 --> 04:43.440] channels, those are short lift flows, and if you do sampling, you're probably going [04:43.440 --> 04:44.440] to miss them. [04:44.440 --> 04:49.880] But with unsampled NetFlow, you see every single flow, so you can identify these things. [04:49.880 --> 04:54.320] And we as a network operator, we use this fairly often to troubleshoot network problems, [04:54.320 --> 04:59.520] so if a customer says complaints, you can't reach a certain IP address in the internet, [04:59.520 --> 05:04.200] we can actually go look in our flows for the outgoing TCP SYN packet and see whether there's [05:04.200 --> 05:06.600] a TCP SYN coming back in. [05:06.600 --> 05:12.800] You can do this because we see every single flow, so this is extremely useful. [05:12.800 --> 05:19.800] But as I said, so we cannot longer do that on our big new core routers, we can't do that, [05:19.800 --> 05:26.880] they only give us sampled NetFlow, so we started to do this with an external box, and that's [05:26.880 --> 05:33.080] where this SnapFlow software implementation comes in. [05:33.080 --> 05:36.720] Because I mean, there are always ways to do that, but they might be very expensive if [05:36.720 --> 05:40.960] you have to buy dedicated hardware, for instance. [05:40.960 --> 05:46.120] So just to give an idea of what type of traffic we're dealing with, Switzerland is a small [05:46.120 --> 05:52.120] country, we are a small network, and we only do NetFlow on our borders, so when the traffic [05:52.120 --> 05:59.120] that we exchange with neighboring networks, and the peak values are these days, it's roughly [05:59.120 --> 06:06.120] maybe 180 gigabits per second, something like that, and 20 million packets per second, [06:06.120 --> 06:11.800] and roughly 350,000 flows per second, unsampled. [06:11.800 --> 06:14.160] And this can actually be even much more. [06:14.160 --> 06:19.320] The flow rate, because of the aggressive scanning that's going on for the past couple of years, [06:19.320 --> 06:27.160] has started to perform very aggressive network scans, like plain TCP SIN scans, as fast as [06:27.160 --> 06:35.040] they can, so sometimes a single host can easily generate 100,000 flows per second. [06:35.040 --> 06:40.880] So the actual IPv6 traffic that the export is done in the order of 200 to 300 megabits [06:40.880 --> 06:47.200] per second, so the flow records themselves, so this is all for the unsampled flow. [06:47.200 --> 06:52.160] The average flow rate is maybe just around 200,000 per second, and the data it generates, [06:52.160 --> 06:59.360] the actual NetFlow data is like roughly 1.5 terabits per day, so the actual scaling problem [06:59.360 --> 07:02.720] is more on the collector side than. [07:02.720 --> 07:09.760] We have 10 gig, 100 gig, and 400 gig ports, so that's what our solution needs to support. [07:09.760 --> 07:14.800] So we used to do this historically on the routers themselves until a couple of years [07:14.800 --> 07:20.920] ago, then we moved to a commercial NetFlow generator that did that in hardware, which [07:20.920 --> 07:26.400] was pretty expensive, maybe the whole solution for one pop was 100,000 euros, something like [07:26.400 --> 07:32.360] that, and then we finally moved to SnapFlow and Pure Software. [07:32.360 --> 07:33.840] So how do we do this? [07:33.840 --> 07:38.720] On the borders, these are all fiber connections, so we have optical splitters, we create a [07:38.720 --> 07:45.000] copy of all the traffic flow, and then we have a second device, or the primary device [07:45.000 --> 07:49.920] that these tabs are connected to is what we call a packet broker, it's basically a switch [07:49.920 --> 07:56.000] that aggregates all the packets and sends it out on 200 gig ports to our actual exporter [07:56.000 --> 07:58.840] box. [07:58.840 --> 08:04.920] So it uses VLAN tags to identify, so in NetFlow we also want to keep track of the router ports [08:04.920 --> 08:10.800] where the traffic was sent or received from, so because then that information gets lost [08:10.800 --> 08:14.680] and you aggregate them, so we use VLANs to tag them. [08:14.680 --> 08:20.240] The box we use are white box switches based on the Tofino ASIC, the ones that Intel just [08:20.240 --> 08:25.800] decided they stopped developing, unfortunately, these are very nice boxes, like there's one [08:25.800 --> 08:32.120] with 3,200 gig ports for 5,000 euros, and the other one has 3,200, 400 gig ports and [08:32.120 --> 08:33.880] costs about 20,000 euros. [08:33.880 --> 08:38.880] The thing is you have to program them yourself and you buy them, they're just plain hardware, [08:38.880 --> 08:41.320] and so you can use the P4 language to do this. [08:41.320 --> 08:48.200] I link here to another project of mine where I actually developed the P4 program to do that, [08:48.200 --> 08:50.960] so that's also part of this entire architecture. [08:50.960 --> 08:57.800] And then the traffic gets to the NetFlow exporter box, which is currently just one rack unit, [08:57.800 --> 09:09.040] that's basic rack mount server, we use AMD Epics, mainly these days with a fairly large [09:09.040 --> 09:13.160] number of cores, that's the way we scale, with the number of cores, NetFlow always scales [09:13.160 --> 09:17.000] very well the cores because you just have to make sure that you keep the packets to [09:17.000 --> 09:20.560] a flow together. [09:20.560 --> 09:28.200] They use, the exporter has a Melonox 2.4, 100 gig card, that's connected to the packet [09:28.200 --> 09:31.960] broker, that's where it receives the packets. [09:31.960 --> 09:38.400] So in a picture that's what it looks like, on the upper left that would be our border [09:38.400 --> 09:44.360] router, on the upper right that would be the bordering router of our neighboring networks, [09:44.360 --> 09:50.040] in the middle you have this optical spitter, which is completely on passive box, just as [09:50.040 --> 09:56.240] an optical splitter, and then you have this packet broker switch in between that aggregates [09:56.240 --> 10:02.160] all the packets and distributes them by flow on these two links currently. [10:02.160 --> 10:07.880] So these are now 200 gig ports between the broker and the exporter, we can easily add [10:07.880 --> 10:14.000] more ports if that's not sufficient, and on the SnapFlow exporter we can basically just [10:14.000 --> 10:19.360] add more cores to be able to scale. [10:19.360 --> 10:38.320] So now let's hear Max talk about the actual software. [10:38.320 --> 10:54.520] Hello, hello, does this work, good, all right. [10:54.520 --> 11:00.800] All right, how do we know how SnapFlow is deployed, I want to talk about how it's built, how [11:00.800 --> 11:08.600] it scales, how you configure it, how you monitor your running application, etc. [11:08.600 --> 11:15.560] So SnapFlow as the name suggests is built using Snap, Snap is a toolkit for writing [11:15.560 --> 11:21.360] high performance networking applications, Snap is written in Lua, using the amazing [11:21.360 --> 11:28.760] Lua JIT compiler, and it does packet IO without going through the kernel, like generally the [11:28.760 --> 11:35.680] Linux kernel packet networking stack is slow from an ISP perspective, so a Snap bypass [11:35.680 --> 11:41.880] is that, uses its own device drivers, and this is also often called kernel bypass networking, [11:41.880 --> 11:46.640] I think nowadays it's fairly common, and Snap is open source and independent, we're not [11:46.640 --> 11:52.880] sponsored by any vendor in particular. [11:52.880 --> 11:59.720] So Snap is built with these three core values in mind, we prefer simple designs over complex [11:59.720 --> 12:07.360] designs, we prefer our software to be small rather than large, and we are open, you can [12:07.360 --> 12:17.080] read the source, you can understand it, you can modify it, you can rewrite it, etc. [12:17.080 --> 12:25.000] So here I have a snippet of code taken directly from SnapFlow, unedited, so this is how the [12:25.000 --> 12:32.120] Lua code that powers the usual Snap application sort of looks like, just to give you an idea. [12:32.120 --> 12:36.960] In this particular example we read a batch of packets from an incoming link, we extract [12:36.960 --> 12:42.520] some metadata that tells us which flow this packet belongs to, then we look up a matching [12:42.520 --> 12:47.280] flow in the flow table that we maintain, if we already have a flow we count that packet [12:47.280 --> 12:55.480] towards that flow, if not we create a new entry in the flow table. [12:55.480 --> 12:58.760] Got one more snippet, this function is called every now and then to actually export the [12:58.760 --> 13:08.520] flows, so we walk over a section of the flow table here, and add flow aggregates from that [13:08.520 --> 13:13.720] flow table into a next data export record, and if it's time to export the data record [13:13.720 --> 13:22.240] we send it off to an IPfix collector, which is a separate program. [13:22.240 --> 13:30.480] So from a bird's eye view, SnapFlow works sort of like this, we read packets from a 100 [13:30.480 --> 13:37.480] gigabits nick, the garden hole so to speak, we process those packets to extract flow information [13:37.480 --> 13:45.120] in a snap process, and then we send off data records over a ton-tap interface to the IPfix [13:45.120 --> 13:46.120] collector. [13:46.120 --> 13:51.720] So on the right side here you have a device driver written, like that is part of Snap [13:51.720 --> 13:58.040] written in Lua, that actually happens like the actual traffic, the bulk of it, and on [13:58.040 --> 14:03.040] the left side you have an interface to the Linux network stack, so since the flow export [14:03.040 --> 14:08.560] data is rather small in comparison, you can just do that over the regular Linux network [14:08.560 --> 14:10.960] stack, and that works. [14:10.960 --> 14:14.680] On the very left side you have the IPfix collector, that's a different application, [14:14.680 --> 14:23.960] like a separate program that we send the flow data to in the end. [14:23.960 --> 14:30.920] So sadly, or I mean I guess just obviously, single CPU core is not enough to handle 100 [14:30.920 --> 14:36.120] gigabits of traffic, so instead what we do is we do receive side scaling provided by [14:36.120 --> 14:42.200] the network device, this way we can process n different sets of flows on n different processes [14:42.200 --> 14:51.960] running on n different CPU cores, so every circle here is a CPU core. [14:51.960 --> 14:57.360] And we also support to repeat basically the same trick in software, so we can do another [14:57.360 --> 15:03.080] round of received side scaling after filtering the traffic by protocol, and this way we can [15:03.080 --> 15:10.520] process for example DNS traffic on different set of cores than IP traffic, like non-DNS [15:10.520 --> 15:18.000] IP traffic, and that way we can sort of like segregate the server resources into the workloads [15:18.000 --> 15:20.440] that we actually care about. [15:20.440 --> 15:26.000] We might for example care more about that we have an accurate general IP flow profile [15:26.000 --> 15:30.480] to send to the collectors, and maybe if we still have some time left we will also do [15:30.480 --> 15:40.400] some DNS analysis, but we don't want one to slow down the other necessarily. [15:40.400 --> 15:46.680] So SNAP programs are organized into independent apps, so an app is like a logical packet processing [15:46.680 --> 15:54.880] component, could be for example a device driver or an app that implements the address resolution [15:54.880 --> 16:02.040] protocol, and these apps are combined into implications like SNAP flow using links. [16:02.040 --> 16:08.320] Links are unidirectional, they really just ring buffers, and any app can have like any [16:08.320 --> 16:12.480] number of them to use as input or output for packet data. [16:12.480 --> 16:16.080] And you communicate with like you use those links like shown here, that's basically the [16:16.080 --> 16:21.160] API that you call link receive on a link to receive a packet, and you call link transmit [16:21.160 --> 16:30.800] on an output link to send a packet. [16:30.800 --> 16:34.480] So now to forward packets from one CPU core to another CPU core we have this thing called [16:34.480 --> 16:39.480] live interlink, these are really just like regular links except that they span process [16:39.480 --> 16:45.680] and CPU core boundaries, and you can also use them just like any link, you have the [16:45.680 --> 16:50.800] same interface if you want to operate with them, and we use those to implement the software [16:50.800 --> 16:58.280] based receive set scaling that I talked about earlier, right? [16:58.280 --> 17:05.000] We also have libp3, so libp3 implements a very strict control plane data plane segregation, [17:05.000 --> 17:11.320] I think for most networking folks the concept of control plane data plane is pretty common, [17:11.320 --> 17:16.240] but just to recap it, control plane is something that basically is fancy and elaborate, you [17:16.240 --> 17:19.840] expect it to be really nice, you want to have a nice interface to configure your application [17:19.840 --> 17:24.320] and monitor it, the data plane on the other hand you really just want it to work, it should [17:24.320 --> 17:34.520] like preferably run at line rate, and you don't really have any time to mess around. [17:34.520 --> 17:39.080] So since these like two different parts of the application have very different requirements [17:39.080 --> 17:46.320] nice to keep them separate, and that's what we do. [17:46.320 --> 17:52.360] We also have libyang, so you see both the configuration and the application state of [17:52.360 --> 17:58.120] snap flow are actually managed by described in the yang schema. [17:58.120 --> 18:02.760] So for example you can tell the control plane to load a new configuration of snap flow or [18:02.760 --> 18:06.800] you can query it for some state counters while it's running, and on this slide I have some [18:06.800 --> 18:19.960] examples how you will use the snap command line interface to do those things. [18:19.960 --> 18:25.960] So here we have a snippet of the snap flow yang schema, and yang is one of these things [18:25.960 --> 18:29.680] where at the beginning you wonder if you're really going to need it, but once that you [18:29.680 --> 18:32.960] have it you are usually really happy that you do have it. [18:32.960 --> 18:39.840] So what I like specifically about yang is it's very expressive. [18:39.840 --> 18:43.720] If a configuration passes the control plane and it doesn't reject it because it says hey [18:43.720 --> 18:46.800] this is invalid, I'm pretty confident that this configuration will do something useful [18:46.800 --> 18:50.520] in the data plane and it will not just like crash. [18:50.520 --> 18:56.040] For example here we have a list of interfaces and one of the fields is a device which is [18:56.040 --> 19:00.920] a PCI address and the PCI address in this case this type is attached to some regular [19:00.920 --> 19:04.080] expression that makes sure that it actually looks like a PCI address and we kind of just [19:04.080 --> 19:08.160] pass any string in there and validate it somewhere way down the line. [19:08.160 --> 19:15.000] Like if you don't put a thing that at least looks like a PCI address then this won't even [19:15.000 --> 19:20.520] be loaded. [19:20.520 --> 19:26.240] So sadly any piece of software has bugs and in our case even suboptimal performance often [19:26.240 --> 19:31.000] considered a bug right and we deal with the second issue here with the performance by [19:31.000 --> 19:33.560] shipping snap with a flight recorder. [19:33.560 --> 19:37.240] So this flight recorder has minimal overhead, it's always on you even run in production [19:37.240 --> 19:44.560] preferably and it stores useful data, part of that data is really useful to profile your [19:44.560 --> 19:55.240] application after the fact or while it's running. [19:55.240 --> 20:00.760] To analyze the collected data we have built a little UI that we used to do that, it's [20:00.760 --> 20:04.160] usually running on one of our development servers so we test stuff but you can really [20:04.160 --> 20:06.840] run it anywhere. [20:06.840 --> 20:07.840] Did I mention snap? [20:07.840 --> 20:08.840] I did right? [20:08.840 --> 20:12.680] So we're dealing with a JIT compiler here. [20:12.680 --> 20:16.840] So the UI shows you stuff that you would expect from a profiler like basically where does [20:16.840 --> 20:23.080] my program spend its time but also some JIT related stuff like did the compiler have issues [20:23.080 --> 20:26.520] generating efficient code for particular parts of my program. [20:26.520 --> 20:34.160] So for example here there's like a JGC column that's like when the injected code the garbage [20:34.160 --> 20:45.560] collector is invoked and that's for example something to look out for. [20:45.560 --> 20:49.120] And another part of the flight recorder is a high resolution event log. [20:49.120 --> 20:55.480] It can give you accurate latency measurements of the pieces that make up your software. [20:55.480 --> 21:02.120] And you can see here on the slide that the OUI has or it shows latency histograms for [21:02.120 --> 21:03.360] individual events. [21:03.360 --> 21:06.840] These events are, some of these events are like already defined in snap but you can also [21:06.840 --> 21:09.520] use a defined new event. [21:09.520 --> 21:17.120] And here for example I could tell that processing a batch of packets and extracting the flow [21:17.120 --> 21:25.120] data so this is like the main IP fix app main loop takes us about 35 microseconds per iteration [21:25.120 --> 21:28.080] per process. [21:28.080 --> 21:31.800] And this is really useful if you want to debug tail-latencies, right? [21:31.800 --> 21:38.680] And tail-latencies translate basically to drop packets in our world so that's something [21:38.680 --> 21:43.320] that's really valuable. [21:43.320 --> 21:51.280] So to close things, if you were to write a new application based on snap today you would [21:51.280 --> 21:58.440] have all these things and more ready at your disposal. [21:58.440 --> 22:05.800] And also it is possible to purchase consultancy services like commercial support for snap [22:05.800 --> 22:14.160] and developing snap applications from your friendly open source consultancy Igalia, which [22:14.160 --> 22:16.480] is my current employer. [22:16.480 --> 22:21.240] So yeah, that's all for now, thanks for your attention. [22:21.240 --> 22:25.000] On the right there are some pointers if you have some contacts, if you have questions [22:25.000 --> 22:29.800] or inquiries about snap or snap flow you can email us there after the conference or for [22:29.800 --> 22:30.800] now. [22:30.800 --> 22:37.800] If you have any questions, please ask them. [22:37.800 --> 23:07.480] Thank you. [23:07.480 --> 23:08.480] Please come down. [23:08.480 --> 23:23.200] There are some seats available here in the middle. [23:23.200 --> 23:29.760] The next speaker is Peter Manev, that is one of the key guys of Suricata, a very popular [23:29.760 --> 23:32.560] open source ideas. [23:32.560 --> 23:37.840] And today is going to talk about this open source platform. [23:37.840 --> 24:03.840] Please have a seat here in the middle.