[00:00.000 --> 00:29.280]  Yeah, so let's go right into the topic.
[00:29.280 --> 00:30.280]  So I'm Alex.
[00:30.280 --> 00:31.280]  I work for Switch.
[00:31.280 --> 00:35.160]  Switch is the national research and education network in Switzerland.
[00:35.160 --> 00:40.720]  Like most countries have something like us, like in Belgium, it's BellNet, and in Germany
[00:40.720 --> 00:43.160]  it's DFN, and in France it's Renataire.
[00:43.160 --> 00:52.320]  So we connect to Swiss universities and universities of applied sciences via the ISP of those institutions.
[00:52.320 --> 00:57.760]  So NetFlow, I'm not sure if everyone is familiar with NetFlow, so I just recaptured the like
[00:57.760 --> 01:01.760]  a central thing about what the network flow actually is.
[01:01.760 --> 01:05.800]  So when you look at an IP packet, you extract the source and destination addresses, the
[01:05.800 --> 01:11.240]  IP protocol, and if the protocol is UDP or TCP, also the source and destination ports,
[01:11.240 --> 01:15.280]  and those five numbers identify a flow.
[01:15.280 --> 01:20.920]  So every packet with the same values is said to belong to the same flow.
[01:20.920 --> 01:26.240]  And then in the simplest possible way, you basically just aggregate, you count the bytes
[01:26.240 --> 01:31.680]  and packets of all the packets that belong to the flow, and then you export this information
[01:31.680 --> 01:37.720]  to a collector where you can then analyze the data.
[01:37.720 --> 01:42.640]  So this is an old thing, like these days people talk about network telemetry, and back in
[01:42.640 --> 01:47.040]  the day when this was developed, that name didn't exist yet, and I'm not sure when exactly
[01:47.040 --> 01:51.800]  Cisco came up with this, but it must have been the early 90s or mid 90s, and it used
[01:51.800 --> 01:57.800]  to be a de facto standard for a long time, but people just figured out what Cisco did
[01:57.800 --> 02:05.120]  and then did the same thing, and then finally got properly standardized with the IPfix IETF
[02:05.120 --> 02:06.680]  standard.
[02:06.680 --> 02:12.160]  And you can do this either in sampled mode or unsampled mode, so unsampled means you look
[02:12.160 --> 02:16.400]  at every single packet and account for it in the flow, and with sampling you just look
[02:16.400 --> 02:26.800]  at every nth packet, and then you have to make certain assumptions to then reconstruct
[02:26.800 --> 02:28.760]  the actual values.
[02:28.760 --> 02:34.760]  So we at Switch, we've been using NetFlow for a very long time as the basic, as the most
[02:34.760 --> 02:44.440]  important metric or means to analyze our data, our network data, since the mid 1990s.
[02:44.440 --> 02:49.200]  It used to be that this was provided by the Rogers themselves, which is reasonable, and
[02:49.200 --> 02:55.680]  the packets passed through that device, and so the device has immediately access to the
[02:55.680 --> 03:00.200]  packets and then can construct the flow data itself.
[03:00.200 --> 03:04.080]  So initially that was done in software, then it was done in hardware.
[03:04.080 --> 03:11.440]  It used to be basically always unsampled, but with the advent of more powerful networking
[03:11.440 --> 03:17.800]  gear, and especially with the arrival of the 100 gig ports, it became basically unfeasible
[03:17.800 --> 03:24.520]  to do this on the Rogers themselves because of typically software restrictions, also hardware
[03:24.520 --> 03:25.520]  restrictions.
[03:25.520 --> 03:29.120]  If you want to do this in software, you usually can because the Rogers are not very powerful
[03:29.120 --> 03:33.040]  in terms of CPU, and in hardware it becomes very expensive.
[03:33.040 --> 03:38.240]  So the vendors started to basically only implement sampled NetFlow, so these days if you buy a
[03:38.240 --> 03:44.600]  Cisco or a Juniper box and you do NetFlow, you get sampling.
[03:44.600 --> 03:48.560]  And sampling is fine, of course, if you're only interested in aggregate data anyway,
[03:48.560 --> 03:54.640]  so big aggregated network flows between networks, for instance, sampling is perfectly fine,
[03:54.640 --> 04:00.160]  you make certain assumptions about the traffic, and then you just upscale it, and you get
[04:00.160 --> 04:01.920]  fairly reasonable numbers.
[04:01.920 --> 04:05.200]  So why would you even want to do unsampled NetFlow?
[04:05.200 --> 04:10.760]  Well, there are some couple of use cases that are really useful.
[04:10.760 --> 04:16.760]  So for instance, in terms of security, one thing that sampling is fine is detect DDoS,
[04:16.760 --> 04:21.840]  for instance, that's volumetric DDoS, that's very simple, so you basically have a constant
[04:21.840 --> 04:26.000]  packet rate, and if you just look at every end packet, it's easy to scale this up.
[04:26.000 --> 04:31.280]  But if you want to detect a bot, for instance, in your network, then it's more difficult.
[04:31.280 --> 04:36.680]  So maybe you want to do this by looking at the communication with the command and control
[04:36.680 --> 04:43.440]  channels, those are short lift flows, and if you do sampling, you're probably going
[04:43.440 --> 04:44.440]  to miss them.
[04:44.440 --> 04:49.880]  But with unsampled NetFlow, you see every single flow, so you can identify these things.
[04:49.880 --> 04:54.320]  And we as a network operator, we use this fairly often to troubleshoot network problems,
[04:54.320 --> 04:59.520]  so if a customer says complaints, you can't reach a certain IP address in the internet,
[04:59.520 --> 05:04.200]  we can actually go look in our flows for the outgoing TCP SYN packet and see whether there's
[05:04.200 --> 05:06.600]  a TCP SYN coming back in.
[05:06.600 --> 05:12.800]  You can do this because we see every single flow, so this is extremely useful.
[05:12.800 --> 05:19.800]  But as I said, so we cannot longer do that on our big new core routers, we can't do that,
[05:19.800 --> 05:26.880]  they only give us sampled NetFlow, so we started to do this with an external box, and that's
[05:26.880 --> 05:33.080]  where this SnapFlow software implementation comes in.
[05:33.080 --> 05:36.720]  Because I mean, there are always ways to do that, but they might be very expensive if
[05:36.720 --> 05:40.960]  you have to buy dedicated hardware, for instance.
[05:40.960 --> 05:46.120]  So just to give an idea of what type of traffic we're dealing with, Switzerland is a small
[05:46.120 --> 05:52.120]  country, we are a small network, and we only do NetFlow on our borders, so when the traffic
[05:52.120 --> 05:59.120]  that we exchange with neighboring networks, and the peak values are these days, it's roughly
[05:59.120 --> 06:06.120]  maybe 180 gigabits per second, something like that, and 20 million packets per second,
[06:06.120 --> 06:11.800]  and roughly 350,000 flows per second, unsampled.
[06:11.800 --> 06:14.160]  And this can actually be even much more.
[06:14.160 --> 06:19.320]  The flow rate, because of the aggressive scanning that's going on for the past couple of years,
[06:19.320 --> 06:27.160]  has started to perform very aggressive network scans, like plain TCP SIN scans, as fast as
[06:27.160 --> 06:35.040]  they can, so sometimes a single host can easily generate 100,000 flows per second.
[06:35.040 --> 06:40.880]  So the actual IPv6 traffic that the export is done in the order of 200 to 300 megabits
[06:40.880 --> 06:47.200]  per second, so the flow records themselves, so this is all for the unsampled flow.
[06:47.200 --> 06:52.160]  The average flow rate is maybe just around 200,000 per second, and the data it generates,
[06:52.160 --> 06:59.360]  the actual NetFlow data is like roughly 1.5 terabits per day, so the actual scaling problem
[06:59.360 --> 07:02.720]  is more on the collector side than.
[07:02.720 --> 07:09.760]  We have 10 gig, 100 gig, and 400 gig ports, so that's what our solution needs to support.
[07:09.760 --> 07:14.800]  So we used to do this historically on the routers themselves until a couple of years
[07:14.800 --> 07:20.920]  ago, then we moved to a commercial NetFlow generator that did that in hardware, which
[07:20.920 --> 07:26.400]  was pretty expensive, maybe the whole solution for one pop was 100,000 euros, something like
[07:26.400 --> 07:32.360]  that, and then we finally moved to SnapFlow and Pure Software.
[07:32.360 --> 07:33.840]  So how do we do this?
[07:33.840 --> 07:38.720]  On the borders, these are all fiber connections, so we have optical splitters, we create a
[07:38.720 --> 07:45.000]  copy of all the traffic flow, and then we have a second device, or the primary device
[07:45.000 --> 07:49.920]  that these tabs are connected to is what we call a packet broker, it's basically a switch
[07:49.920 --> 07:56.000]  that aggregates all the packets and sends it out on 200 gig ports to our actual exporter
[07:56.000 --> 07:58.840]  box.
[07:58.840 --> 08:04.920]  So it uses VLAN tags to identify, so in NetFlow we also want to keep track of the router ports
[08:04.920 --> 08:10.800]  where the traffic was sent or received from, so because then that information gets lost
[08:10.800 --> 08:14.680]  and you aggregate them, so we use VLANs to tag them.
[08:14.680 --> 08:20.240]  The box we use are white box switches based on the Tofino ASIC, the ones that Intel just
[08:20.240 --> 08:25.800]  decided they stopped developing, unfortunately, these are very nice boxes, like there's one
[08:25.800 --> 08:32.120]  with 3,200 gig ports for 5,000 euros, and the other one has 3,200, 400 gig ports and
[08:32.120 --> 08:33.880]  costs about 20,000 euros.
[08:33.880 --> 08:38.880]  The thing is you have to program them yourself and you buy them, they're just plain hardware,
[08:38.880 --> 08:41.320]  and so you can use the P4 language to do this.
[08:41.320 --> 08:48.200]  I link here to another project of mine where I actually developed the P4 program to do that,
[08:48.200 --> 08:50.960]  so that's also part of this entire architecture.
[08:50.960 --> 08:57.800]  And then the traffic gets to the NetFlow exporter box, which is currently just one rack unit,
[08:57.800 --> 09:09.040]  that's basic rack mount server, we use AMD Epics, mainly these days with a fairly large
[09:09.040 --> 09:13.160]  number of cores, that's the way we scale, with the number of cores, NetFlow always scales
[09:13.160 --> 09:17.000]  very well the cores because you just have to make sure that you keep the packets to
[09:17.000 --> 09:20.560]  a flow together.
[09:20.560 --> 09:28.200]  They use, the exporter has a Melonox 2.4, 100 gig card, that's connected to the packet
[09:28.200 --> 09:31.960]  broker, that's where it receives the packets.
[09:31.960 --> 09:38.400]  So in a picture that's what it looks like, on the upper left that would be our border
[09:38.400 --> 09:44.360]  router, on the upper right that would be the bordering router of our neighboring networks,
[09:44.360 --> 09:50.040]  in the middle you have this optical spitter, which is completely on passive box, just as
[09:50.040 --> 09:56.240]  an optical splitter, and then you have this packet broker switch in between that aggregates
[09:56.240 --> 10:02.160]  all the packets and distributes them by flow on these two links currently.
[10:02.160 --> 10:07.880]  So these are now 200 gig ports between the broker and the exporter, we can easily add
[10:07.880 --> 10:14.000]  more ports if that's not sufficient, and on the SnapFlow exporter we can basically just
[10:14.000 --> 10:19.360]  add more cores to be able to scale.
[10:19.360 --> 10:38.320]  So now let's hear Max talk about the actual software.
[10:38.320 --> 10:54.520]  Hello, hello, does this work, good, all right.
[10:54.520 --> 11:00.800]  All right, how do we know how SnapFlow is deployed, I want to talk about how it's built, how
[11:00.800 --> 11:08.600]  it scales, how you configure it, how you monitor your running application, etc.
[11:08.600 --> 11:15.560]  So SnapFlow as the name suggests is built using Snap, Snap is a toolkit for writing
[11:15.560 --> 11:21.360]  high performance networking applications, Snap is written in Lua, using the amazing
[11:21.360 --> 11:28.760]  Lua JIT compiler, and it does packet IO without going through the kernel, like generally the
[11:28.760 --> 11:35.680]  Linux kernel packet networking stack is slow from an ISP perspective, so a Snap bypass
[11:35.680 --> 11:41.880]  is that, uses its own device drivers, and this is also often called kernel bypass networking,
[11:41.880 --> 11:46.640]  I think nowadays it's fairly common, and Snap is open source and independent, we're not
[11:46.640 --> 11:52.880]  sponsored by any vendor in particular.
[11:52.880 --> 11:59.720]  So Snap is built with these three core values in mind, we prefer simple designs over complex
[11:59.720 --> 12:07.360]  designs, we prefer our software to be small rather than large, and we are open, you can
[12:07.360 --> 12:17.080]  read the source, you can understand it, you can modify it, you can rewrite it, etc.
[12:17.080 --> 12:25.000]  So here I have a snippet of code taken directly from SnapFlow, unedited, so this is how the
[12:25.000 --> 12:32.120]  Lua code that powers the usual Snap application sort of looks like, just to give you an idea.
[12:32.120 --> 12:36.960]  In this particular example we read a batch of packets from an incoming link, we extract
[12:36.960 --> 12:42.520]  some metadata that tells us which flow this packet belongs to, then we look up a matching
[12:42.520 --> 12:47.280]  flow in the flow table that we maintain, if we already have a flow we count that packet
[12:47.280 --> 12:55.480]  towards that flow, if not we create a new entry in the flow table.
[12:55.480 --> 12:58.760]  Got one more snippet, this function is called every now and then to actually export the
[12:58.760 --> 13:08.520]  flows, so we walk over a section of the flow table here, and add flow aggregates from that
[13:08.520 --> 13:13.720]  flow table into a next data export record, and if it's time to export the data record
[13:13.720 --> 13:22.240]  we send it off to an IPfix collector, which is a separate program.
[13:22.240 --> 13:30.480]  So from a bird's eye view, SnapFlow works sort of like this, we read packets from a 100
[13:30.480 --> 13:37.480]  gigabits nick, the garden hole so to speak, we process those packets to extract flow information
[13:37.480 --> 13:45.120]  in a snap process, and then we send off data records over a ton-tap interface to the IPfix
[13:45.120 --> 13:46.120]  collector.
[13:46.120 --> 13:51.720]  So on the right side here you have a device driver written, like that is part of Snap
[13:51.720 --> 13:58.040]  written in Lua, that actually happens like the actual traffic, the bulk of it, and on
[13:58.040 --> 14:03.040]  the left side you have an interface to the Linux network stack, so since the flow export
[14:03.040 --> 14:08.560]  data is rather small in comparison, you can just do that over the regular Linux network
[14:08.560 --> 14:10.960]  stack, and that works.
[14:10.960 --> 14:14.680]  On the very left side you have the IPfix collector, that's a different application,
[14:14.680 --> 14:23.960]  like a separate program that we send the flow data to in the end.
[14:23.960 --> 14:30.920]  So sadly, or I mean I guess just obviously, single CPU core is not enough to handle 100
[14:30.920 --> 14:36.120]  gigabits of traffic, so instead what we do is we do receive side scaling provided by
[14:36.120 --> 14:42.200]  the network device, this way we can process n different sets of flows on n different processes
[14:42.200 --> 14:51.960]  running on n different CPU cores, so every circle here is a CPU core.
[14:51.960 --> 14:57.360]  And we also support to repeat basically the same trick in software, so we can do another
[14:57.360 --> 15:03.080]  round of received side scaling after filtering the traffic by protocol, and this way we can
[15:03.080 --> 15:10.520]  process for example DNS traffic on different set of cores than IP traffic, like non-DNS
[15:10.520 --> 15:18.000]  IP traffic, and that way we can sort of like segregate the server resources into the workloads
[15:18.000 --> 15:20.440]  that we actually care about.
[15:20.440 --> 15:26.000]  We might for example care more about that we have an accurate general IP flow profile
[15:26.000 --> 15:30.480]  to send to the collectors, and maybe if we still have some time left we will also do
[15:30.480 --> 15:40.400]  some DNS analysis, but we don't want one to slow down the other necessarily.
[15:40.400 --> 15:46.680]  So SNAP programs are organized into independent apps, so an app is like a logical packet processing
[15:46.680 --> 15:54.880]  component, could be for example a device driver or an app that implements the address resolution
[15:54.880 --> 16:02.040]  protocol, and these apps are combined into implications like SNAP flow using links.
[16:02.040 --> 16:08.320]  Links are unidirectional, they really just ring buffers, and any app can have like any
[16:08.320 --> 16:12.480]  number of them to use as input or output for packet data.
[16:12.480 --> 16:16.080]  And you communicate with like you use those links like shown here, that's basically the
[16:16.080 --> 16:21.160]  API that you call link receive on a link to receive a packet, and you call link transmit
[16:21.160 --> 16:30.800]  on an output link to send a packet.
[16:30.800 --> 16:34.480]  So now to forward packets from one CPU core to another CPU core we have this thing called
[16:34.480 --> 16:39.480]  live interlink, these are really just like regular links except that they span process
[16:39.480 --> 16:45.680]  and CPU core boundaries, and you can also use them just like any link, you have the
[16:45.680 --> 16:50.800]  same interface if you want to operate with them, and we use those to implement the software
[16:50.800 --> 16:58.280]  based receive set scaling that I talked about earlier, right?
[16:58.280 --> 17:05.000]  We also have libp3, so libp3 implements a very strict control plane data plane segregation,
[17:05.000 --> 17:11.320]  I think for most networking folks the concept of control plane data plane is pretty common,
[17:11.320 --> 17:16.240]  but just to recap it, control plane is something that basically is fancy and elaborate, you
[17:16.240 --> 17:19.840]  expect it to be really nice, you want to have a nice interface to configure your application
[17:19.840 --> 17:24.320]  and monitor it, the data plane on the other hand you really just want it to work, it should
[17:24.320 --> 17:34.520]  like preferably run at line rate, and you don't really have any time to mess around.
[17:34.520 --> 17:39.080]  So since these like two different parts of the application have very different requirements
[17:39.080 --> 17:46.320]  nice to keep them separate, and that's what we do.
[17:46.320 --> 17:52.360]  We also have libyang, so you see both the configuration and the application state of
[17:52.360 --> 17:58.120]  snap flow are actually managed by described in the yang schema.
[17:58.120 --> 18:02.760]  So for example you can tell the control plane to load a new configuration of snap flow or
[18:02.760 --> 18:06.800]  you can query it for some state counters while it's running, and on this slide I have some
[18:06.800 --> 18:19.960]  examples how you will use the snap command line interface to do those things.
[18:19.960 --> 18:25.960]  So here we have a snippet of the snap flow yang schema, and yang is one of these things
[18:25.960 --> 18:29.680]  where at the beginning you wonder if you're really going to need it, but once that you
[18:29.680 --> 18:32.960]  have it you are usually really happy that you do have it.
[18:32.960 --> 18:39.840]  So what I like specifically about yang is it's very expressive.
[18:39.840 --> 18:43.720]  If a configuration passes the control plane and it doesn't reject it because it says hey
[18:43.720 --> 18:46.800]  this is invalid, I'm pretty confident that this configuration will do something useful
[18:46.800 --> 18:50.520]  in the data plane and it will not just like crash.
[18:50.520 --> 18:56.040]  For example here we have a list of interfaces and one of the fields is a device which is
[18:56.040 --> 19:00.920]  a PCI address and the PCI address in this case this type is attached to some regular
[19:00.920 --> 19:04.080]  expression that makes sure that it actually looks like a PCI address and we kind of just
[19:04.080 --> 19:08.160]  pass any string in there and validate it somewhere way down the line.
[19:08.160 --> 19:15.000]  Like if you don't put a thing that at least looks like a PCI address then this won't even
[19:15.000 --> 19:20.520]  be loaded.
[19:20.520 --> 19:26.240]  So sadly any piece of software has bugs and in our case even suboptimal performance often
[19:26.240 --> 19:31.000]  considered a bug right and we deal with the second issue here with the performance by
[19:31.000 --> 19:33.560]  shipping snap with a flight recorder.
[19:33.560 --> 19:37.240]  So this flight recorder has minimal overhead, it's always on you even run in production
[19:37.240 --> 19:44.560]  preferably and it stores useful data, part of that data is really useful to profile your
[19:44.560 --> 19:55.240]  application after the fact or while it's running.
[19:55.240 --> 20:00.760]  To analyze the collected data we have built a little UI that we used to do that, it's
[20:00.760 --> 20:04.160]  usually running on one of our development servers so we test stuff but you can really
[20:04.160 --> 20:06.840]  run it anywhere.
[20:06.840 --> 20:07.840]  Did I mention snap?
[20:07.840 --> 20:08.840]  I did right?
[20:08.840 --> 20:12.680]  So we're dealing with a JIT compiler here.
[20:12.680 --> 20:16.840]  So the UI shows you stuff that you would expect from a profiler like basically where does
[20:16.840 --> 20:23.080]  my program spend its time but also some JIT related stuff like did the compiler have issues
[20:23.080 --> 20:26.520]  generating efficient code for particular parts of my program.
[20:26.520 --> 20:34.160]  So for example here there's like a JGC column that's like when the injected code the garbage
[20:34.160 --> 20:45.560]  collector is invoked and that's for example something to look out for.
[20:45.560 --> 20:49.120]  And another part of the flight recorder is a high resolution event log.
[20:49.120 --> 20:55.480]  It can give you accurate latency measurements of the pieces that make up your software.
[20:55.480 --> 21:02.120]  And you can see here on the slide that the OUI has or it shows latency histograms for
[21:02.120 --> 21:03.360]  individual events.
[21:03.360 --> 21:06.840]  These events are, some of these events are like already defined in snap but you can also
[21:06.840 --> 21:09.520]  use a defined new event.
[21:09.520 --> 21:17.120]  And here for example I could tell that processing a batch of packets and extracting the flow
[21:17.120 --> 21:25.120]  data so this is like the main IP fix app main loop takes us about 35 microseconds per iteration
[21:25.120 --> 21:28.080]  per process.
[21:28.080 --> 21:31.800]  And this is really useful if you want to debug tail-latencies, right?
[21:31.800 --> 21:38.680]  And tail-latencies translate basically to drop packets in our world so that's something
[21:38.680 --> 21:43.320]  that's really valuable.
[21:43.320 --> 21:51.280]  So to close things, if you were to write a new application based on snap today you would
[21:51.280 --> 21:58.440]  have all these things and more ready at your disposal.
[21:58.440 --> 22:05.800]  And also it is possible to purchase consultancy services like commercial support for snap
[22:05.800 --> 22:14.160]  and developing snap applications from your friendly open source consultancy Igalia, which
[22:14.160 --> 22:16.480]  is my current employer.
[22:16.480 --> 22:21.240]  So yeah, that's all for now, thanks for your attention.
[22:21.240 --> 22:25.000]  On the right there are some pointers if you have some contacts, if you have questions
[22:25.000 --> 22:29.800]  or inquiries about snap or snap flow you can email us there after the conference or for
[22:29.800 --> 22:30.800]  now.
[22:30.800 --> 22:37.800]  If you have any questions, please ask them.
[22:37.800 --> 23:07.480]  Thank you.
[23:07.480 --> 23:08.480]  Please come down.
[23:08.480 --> 23:23.200]  There are some seats available here in the middle.
[23:23.200 --> 23:29.760]  The next speaker is Peter Manev, that is one of the key guys of Suricata, a very popular
[23:29.760 --> 23:32.560]  open source ideas.
[23:32.560 --> 23:37.840]  And today is going to talk about this open source platform.
[23:37.840 --> 24:03.840]  Please have a seat here in the middle.