[00:00.000 --> 00:10.000] Hi everyone, I'm Jeroen, well I'm going to tell you something about parsing some files [00:10.000 --> 00:22.120] really fast and I worked for NL netlabs, oh yeah, so some numbers, there's some caveats [00:22.120 --> 00:27.480] here, so the fifth, I did not do measurements because I made finished the slides like today, [00:27.480 --> 00:33.080] so I did the measurements on the train so, but I think the 50 megabytes is actually slower, [00:33.080 --> 00:39.160] I'm pretty sure the 700 megabytes is correct and we will go beyond and I'm going to tell [00:39.160 --> 00:46.200] you how, so yeah, so basically the motivation is that currently the parser NSD isn't very [00:46.200 --> 00:51.680] fast and we have an operator where someone only takes the better part of an hour and [00:51.680 --> 01:02.000] at that point it stops being practical, so yeah, that's the motivation and I actually [01:02.000 --> 01:06.400] like to take you on a journey so that I went through, I will also show you the new algorithms [01:06.400 --> 01:12.360] but I also want to tell you why parsing some files currently in NSD at least is really [01:12.360 --> 01:19.440] slow and to do that we have to tell you a bit on parsing, so I've included an example [01:19.440 --> 01:25.080] with a whole world sea program and the NSD parser is based on lexing jak and that's really [01:25.080 --> 01:30.920] useful if you want to parse things like a computer language where each token has a meaning [01:30.920 --> 01:39.160] of in and of itself, for some files however that is definitely not a case, so if you look [01:39.160 --> 01:45.520] at the, also in this case I provide an example but everyone in the room will probably make [01:45.520 --> 01:50.520] out that on the last line I try to define an A record with a corresponding IP address [01:50.520 --> 01:55.800] and then what the zone parser actually does is it takes the A, it makes that the owner [01:55.800 --> 02:00.480] and then throws a syntax error because well an IP address is not really a valid record [02:00.480 --> 02:10.800] type obviously, so yeah lexing jak is really not a good choice but then there's also the [02:10.800 --> 02:19.480] fact that I think zone files are only more or less standardized, they're not really standardized [02:19.480 --> 02:29.560] and putting it modally and when you combine the two that just leads to a lot of trouble [02:29.560 --> 02:35.120] and well the first thing I did was analyze why the current parser is slow and the current [02:35.120 --> 02:41.000] parser is slow well, it's actually inherent to the tool because they're just not a good [02:41.000 --> 02:48.400] fit because what the lexer does is it gives you each token and then passes on to the parser [02:48.400 --> 02:54.880] but it does so by matching a whole bunch of inputs and then taking the longest one, the [02:54.880 --> 03:01.160] longest match and then executing corresponding action in which in the case of the NSD zone [03:01.160 --> 03:06.600] file parser means that you can, that it copies the token then tries to unescape it and for [03:06.600 --> 03:13.120] names it tries to, it needs the dots right because they have meaning in domain names [03:13.120 --> 03:18.680] and what it does there is that it actually splits the input and passes each label separately, [03:18.680 --> 03:24.160] that is of course copied and the parser then concatenates that all back together and that [03:24.160 --> 03:28.560] is not really a fast process. [03:28.560 --> 03:34.360] So my first thought was well what if I just change the process a bit and scrap lexen [03:34.360 --> 03:43.040] yak and cut all the memory allocations that gave us better numbers but they're not the [03:43.040 --> 03:48.000] numbers that I wanted to see right because under the 8 megabytes is, I mean you can express [03:48.000 --> 03:53.840] it in gigabytes a second but then it becomes even less impressive right than under megabytes [03:53.840 --> 03:59.560] a second so. [03:59.560 --> 04:06.520] So yeah I started looking into that and I will show you the algorithms that I used and [04:06.520 --> 04:12.280] I came up with in a minute but to make you understand why these algorithms work it's [04:12.280 --> 04:17.200] important that each and every one of you knows that your CPU is a pipeline CPU and all modern [04:17.200 --> 04:24.200] CPUs are pipeline CPUs and what that means is that each executing each instruction is [04:24.200 --> 04:28.640] not a single step it's actually a multi-stage process. [04:28.640 --> 04:35.280] So there's a fetch and decode step and there's a lot more to it in practice but this is a [04:35.280 --> 04:43.960] mechanism that was designed so that you to optimize performance in the CPU and the premise [04:43.960 --> 04:50.560] on getting fast codes is that you keep the pipeline nice and full. [04:50.560 --> 04:54.760] That does not always happen especially or one case where that doesn't happen is when [04:54.760 --> 05:01.080] you get a pipeline install and that happens essentially when there's when the next instruction [05:01.080 --> 05:08.040] that you want to execute depends on the result of the instruction that you're currently executing [05:08.040 --> 05:14.160] and in that case the CPU has installed a pipeline it has to wait until the result is printed [05:14.160 --> 05:21.680] back it can then go on to decode and then only then execute your instruction and you'll [05:21.680 --> 05:25.840] take a hit of a couple cycles. [05:25.840 --> 05:32.520] Then there's of course the well-known pipeline flashes and those happen essentially if there's [05:32.520 --> 05:39.320] a conditional jump for instance an F statement and the CPU goes on to load the instructions [05:39.320 --> 05:44.000] that come after that and if it turns out that it actually needs to execute other instructions [05:44.000 --> 05:50.440] then it needs to flush the pipeline and only then can it go on to execute your code. [05:50.440 --> 05:56.640] And there's bronze prediction that is used to improve the flow and modern CPUs actually [05:56.640 --> 06:03.880] do a pretty good job of that but well it's prediction so it's not it's not always right. [06:03.880 --> 06:11.960] And it turns out so if we look at that information then we can analyze why a parser is the process [06:11.960 --> 06:16.760] of parsing is just inherently slow because if we go over it byte by byte like the NSDE [06:16.760 --> 06:22.720] shown parser for example then before you can analyze the next byte you have to wait until [06:22.720 --> 06:28.080] you know you have to resolve the new state of your current byte. [06:28.080 --> 06:35.400] And also it turns out that as far as the CPU is concerned the zone files are just random [06:35.400 --> 06:43.560] right that anything can happen at any time so it's hard to predict branches in that case. [06:43.560 --> 06:51.040] Right so the base of the new instructions that I'm using at base is a thing called [06:51.040 --> 06:56.520] a CND or single instruction multiple data and my interest in all of this was really [06:56.520 --> 07:02.160] sparked by the CND JSON project and it caught my attention because they expressed their [07:02.160 --> 07:04.560] throughput in gigabytes a second. [07:04.560 --> 07:08.840] Now in the next slide I'm going to tell you something about the algorithms but I'm not [07:08.840 --> 07:13.800] going to go into them in great depth because there's not a whole lot of time so if you [07:13.800 --> 07:20.720] want to know more on that then I would advise you to watch the talk or just read the paper. [07:20.720 --> 07:27.840] CND what that is is actually an instruction set and what it does it adds factor in registers [07:27.840 --> 07:34.400] and instructions to operate on those registers and what that allows us to do is to classify [07:34.400 --> 07:42.680] blocks instead of just bytes and there's some trick re-involved and there's a super simple [07:42.680 --> 07:51.800] example on the slide but basically we can classify 16, 32 or 64 bytes in one go depending [07:51.800 --> 07:57.280] on your CPU and then we repeat that multiple times for each input that we actually want [07:57.280 --> 08:06.720] to know about and the idea is that we can cut branches and dependencies. [08:06.720 --> 08:14.000] So what's good to know about CND is that it's all vertical non-horizontal so it's really [08:14.000 --> 08:20.560] it's really an instruction that is executed for each of the inputs so you can actually [08:20.560 --> 08:28.840] do logic in CND and the way to work around that is to convert the inputs to a mask. [08:28.840 --> 08:38.040] So we would get a 64 bit mask for each of the inputs that we checked and with those [08:38.040 --> 08:42.200] bit masks in hand then the first thing that we are going to do is to classify all the [08:42.200 --> 08:49.800] escape bits because there's some files allow for escaping and this is actually an algorithm [08:49.800 --> 08:57.080] that CND JSON guys came up with and basically what we do is that for each uneven number [08:57.080 --> 09:04.680] of backslashes we take the next character and so that bit then represents the character [09:04.680 --> 09:14.000] that is actually escaped and we need that information so that we can identify the quoted [09:14.000 --> 09:24.320] sections or in the case of some zone files also the comment sections and this was actually [09:24.320 --> 09:29.000] kind of a hard problem because they don't have this problem in JSON documents but in [09:29.000 --> 09:35.520] zone files comments can cancel out quoted sections and quoted sections can contain semicolons [09:35.520 --> 09:43.160] and then new lines they limit comments but we only want the new lines that actually delimits [09:43.160 --> 09:51.840] the comment because what we really want to do is that we want to find out which of the [09:51.840 --> 09:56.760] characters that we identify as structural characters are contained in quoted sequences [09:56.760 --> 10:09.160] or in comments and there's a simple example in the bottom there so oh yeah I did a number [10:09.160 --> 10:16.680] of experiments but in the end it turned out that if there's a semicolon in the input we [10:16.680 --> 10:22.920] just branch so we have a slow path assuming that there's not too many comments in zone [10:22.920 --> 10:30.040] files which for generated zone files I guess it's okay and once we have that information [10:30.040 --> 10:38.920] all the bits that remain automatically belong to the non quoted strings and then and this [10:38.920 --> 10:43.640] is oversimplifying it but if we shift right and do an XOR then that would get us all the [10:43.640 --> 10:56.560] transitions and with that information we can then go on to create indexes of those because [10:56.560 --> 11:03.160] your CPU does not only provide SIMD instructions it also provides bit manipulation instructions [11:03.160 --> 11:08.800] really fast bit manipulation instructions so the first thing that it does is it takes [11:08.800 --> 11:14.920] the population count to find out how many transitions are actually in your input block [11:14.920 --> 11:20.320] and then we use the trailing zero count to find out the relative position of the bit [11:20.320 --> 11:26.280] and if we combine it with the index then that should give us the pointer to the exact input [11:26.280 --> 11:38.480] byte and there's some more trickery involved here because of course for zone files if there's [11:38.480 --> 11:43.280] an error we want to report that error and to do that we need a line count and quoted [11:43.280 --> 11:48.480] sections of course may contain just new lines but we don't want to worry about those in [11:48.480 --> 11:56.680] the parser because that would mean that each parse function would possibly need to update [11:56.680 --> 12:02.880] a line count and that would just not be very convenient so what we do there is we take an [12:02.880 --> 12:08.520] unlikely branch if there's new line in the quoted section which really doesn't happen [12:08.520 --> 12:15.400] it's an edge case in the case of zone files and we take the slope path to count all the [12:15.400 --> 12:22.240] new lines in the input or at least the one in the quoted sections and then once we generate [12:22.240 --> 12:27.880] a token for the actual, the limiting new line we add the number of new lines that we found [12:27.880 --> 12:37.280] in quoted sections yeah and that gives us basically that gives us a fast scanner in [12:37.280 --> 12:42.240] my initial measurements and I think it's a little bit fast now that would get me a scanning [12:42.240 --> 12:48.040] of two gigabytes a second for zone files at least with an older.com zone etc etc so there's [12:48.040 --> 12:54.800] caveats there too but it turns out that the rest of the DNS data because we of course [12:54.800 --> 12:59.120] we have to parse it we only now tokenize it we also have to parse it the rest of the DNS [12:59.120 --> 13:08.520] data allows for optimizations using cindy as well and of course we want to start with [13:08.520 --> 13:15.080] the data that occurs the most and that is of course the main names and with the cindy [13:15.080 --> 13:22.160] instruction we actually just repeat the scanning process we quickly identify all the dots we [13:22.160 --> 13:26.040] turn that into a bit mask and then use the bit manipulation instructions to go over the [13:26.040 --> 13:30.560] domain name because most of the time if we just fill in the length on the dot then that [13:30.560 --> 13:35.720] would give us a proper Y format and of course there's a slow path for edge cases as well [13:35.720 --> 13:47.400] there and next of course is the record type and normally I guess you would hash I'd initially [13:47.400 --> 13:55.720] just use binary search which is faster than just linear search of course but that took [13:55.720 --> 14:03.080] away quite some performance so we want a perfect actually we want a hash but then a hash table [14:03.080 --> 14:10.120] is pretty big and so I figured I want a perfect hash and it turns out we can do we can do [14:10.120 --> 14:14.480] that so if you take the first character of the records because there's not that many [14:14.480 --> 14:19.680] record types right and there's certainly if you take the first character there's never [14:19.680 --> 14:26.240] more than that many record never more than like eight or nine record types that start [14:26.240 --> 14:30.880] with the first letter so if you then take the last character and at length then it turns [14:30.880 --> 14:38.960] out that doesn't give me any collisions so we can also the hash of collisions occur but [14:38.960 --> 14:44.320] I mean for all the record types and what for 40 years it doesn't give me collisions so [14:44.320 --> 15:00.400] I guess we're good on that from with a number no for each record type someone asked if this [15:00.400 --> 15:07.640] only works for record types and then the number and answer is no it works for all record types [15:07.640 --> 15:11.840] because they're all closely they're all really close together right so they're and they're [15:11.840 --> 15:18.840] alphanumeric most of the time sometimes there's numbers so we just I think our uppercase or [15:18.840 --> 15:24.760] downcase it and then multiply together a good distribution and then just add length and that [15:24.760 --> 15:29.880] gives me that gives that gives me a unique key without collisions and from there I can [15:29.880 --> 15:37.160] just do a use in the instruction to do compare equal so I can do the exact right string compare [15:37.160 --> 15:46.040] and that gives me a really nice nice speed up yeah and it and the people who worked on [15:46.040 --> 15:54.480] some DJs and actually do a lot of did a lot of projects like using Cindy for for decoding [15:54.480 --> 16:04.400] base 64 so the plan is to incorporate all those things as well and then there's one [16:04.400 --> 16:11.200] tricky part your CPU actually supports multiple instructions that at least if you have modern [16:11.200 --> 16:16.880] CPU if you have like a pending for then you only get SSC 42 but we want our software to [16:16.880 --> 16:23.200] be able to run and all those devices without recompiling so we actually compile it four [16:23.200 --> 16:37.520] times in the case of xx86 then use the CPU ID instruction to pick the right one and then [16:37.520 --> 16:43.320] well it's still in progress projects I have hoped to be a little bit further along but [16:43.320 --> 16:47.760] unfortunately not it will be a standalone library because it might actually be useful [16:47.760 --> 16:54.080] to other people and that will make it easy to integrate into other projects it was initially [16:54.080 --> 17:00.560] just intended for NSD yeah the numbers are so far pretty good at least quite a bit better [17:00.560 --> 17:10.080] than what we have now I think it's possible to go to one gigabyte a second yeah so if [17:10.080 --> 17:15.600] you want to check it out there's a link in there and finally I want to there's slide [17:15.600 --> 17:19.760] with acknowledgement because these people help me a lot I just send them on unsolicited [17:19.760 --> 17:24.680] email at first and then I happen to get answers back and they help me and they even took a [17:24.680 --> 17:30.800] look at my presentation help me there as well so thanks to Jeff Daniel and all the sim [17:30.800 --> 17:54.280] DJs and people and with that I actually finished in time it's time for questions yes oh no but [17:54.280 --> 18:00.760] that's the slow path and exactly there the hash doesn't work but that's the slow path [18:00.760 --> 18:06.840] so we do a slow path sorry I should repeat the question what the person in the audience [18:06.840 --> 18:12.800] was actually referring to what happens if you we use a generic type notation where we start [18:12.800 --> 18:18.320] the type by type followed by a number and obviously it doesn't work there but it does [18:18.320 --> 18:27.840] the slow path so we have a slow path there is the person so complete that you can parse [18:27.840 --> 18:37.440] an output and you get the same output as a parsed in really good because I would so if [18:37.440 --> 18:45.040] the parser is good enough that to give you the exact same output as you put in no it [18:45.040 --> 18:56.000] does not do that no well you you I mean do you mean by access to white space exact or [18:56.000 --> 19:06.320] just yeah no it doesn't do that no and then but it's also not it does also strip comments [19:06.320 --> 19:16.800] yes yes yes yes yes yes yes how do you handle escape decimals because in the example you [19:16.800 --> 19:20.800] gave you strip and you have looked where the backslashes are yeah and then take the next [19:20.800 --> 19:25.400] character but if you have like backslash 003 then you need those four characters as a single [19:25.400 --> 19:34.920] unit to encode in the final I'm not actually I just do the I just really good yeah you're [19:34.920 --> 19:38.440] gonna have to I hope there's no more questions because you're gonna have to keep doing that [19:38.440 --> 19:54.680] but what I so what happens if I guess for certain type of input I record like a backslash [19:54.680 --> 20:00.840] 003 which encodes the byte with a value 3 single byte yeah how do you do that with your algorithm [20:00.840 --> 20:05.920] that just takes the next character when you have backslashes but that's just to that's [20:05.920 --> 20:12.680] just to so the question is what do I do with escape characters or with escape sequences [20:12.680 --> 20:18.240] and so what I explained on the what I do what happens with backslashes is just to tokenize [20:18.240 --> 20:23.560] so I don't strip any data I just mark out the starts in the ends of each string field [20:23.560 --> 20:45.400] and then the parsing comes after that so there's no data actually stripped yeah so the question [20:45.400 --> 20:52.240] is so the question is what's the output format and the output format in this case is just [20:52.240 --> 20:59.480] DNS wire format so the the idea here is that for each record it will invoke a callback [20:59.480 --> 21:04.400] and it will just give you wire format with pointers to where all the fields are in the [21:04.400 --> 21:09.040] like an internal description of the field so that you know the length and you know the [21:09.040 --> 21:28.200] type of the field yeah there is definitely value to large effectors because it takes [21:28.200 --> 21:36.200] less instructions so if you can do something so I did not look into using the GPU but yeah [21:36.200 --> 22:03.280] that might benefit so if you know I have not so the question is why do why does parsing [22:03.280 --> 22:10.160] zone files have to be fast well because they're they get reloaded quite a lot of times each [22:10.160 --> 22:18.840] time there's an update to a zone you need to reload you need to reload the zone so we [22:18.840 --> 22:26.440] want to yeah the so that happens multiple times like an hour it differs per zone right [22:26.440 --> 22:32.960] but in our case if it takes more than an hour and the operator actually or it takes the [22:32.960 --> 22:39.280] better part of an hour and the operator wants to lead more wants to reload more often then [22:39.280 --> 22:46.000] that becomes a problem right so we just need to be faster but then there's all the end [22:46.000 --> 22:51.520] so there's other benefits as well so NSD for instance we support zone verification where [22:51.520 --> 22:59.200] just before the zone goes live you can have a focus program to verify that your DNS stack [22:59.200 --> 23:09.000] data is correct and there you can use an AXFR or you can let the NSD feed you the zone data [23:09.000 --> 23:15.080] in which case you just get text representation and if the zone is big enough then you want [23:15.080 --> 23:28.840] that to be fast because it's in the critical path right yeah well actually if the question [23:28.840 --> 23:32.960] was if splitting the files and multi-threading is something that we consider well splitting [23:32.960 --> 23:44.840] the files no but split on them yeah well new lines yeah that can be tricky because [23:44.840 --> 23:50.600] zone files can contain parentheses which would then mean that the record continues on the [23:50.600 --> 23:59.880] next line but a colleague actually did do a parallel zone loading implementation and [23:59.880 --> 24:05.200] I guess we can even do that with with this implementation right because there the it [24:05.200 --> 24:12.040] was actually quite a bit faster but the scanning process still takes a long time because you [24:12.040 --> 24:16.760] go over it by by bite but now that we have a fast scanner there's no reason why we cannot [24:16.760 --> 24:35.080] also include like parallel parsing yeah that could work yeah so the question is if we did [24:35.080 --> 24:53.920] it