[00:00.000 --> 00:11.360] So, my name is Robin Geuze. I used to work for TransIP at first and Team Blue after they [00:11.360 --> 00:17.800] merged with a bunch of other companies for about a decade until a month ago. During that [00:17.800 --> 00:24.920] time period we transitioned from running our own closed source DNS server software to running [00:24.920 --> 00:30.400] open source DNS server software and just like the talk we just had, that happens to be power [00:30.400 --> 00:38.040] DNS. So, I'll take you through the issues we had going from closed source to open source, [00:38.040 --> 00:46.960] which roughly took the entire time I was there, about nine years. So, yeah, let's start. So, [00:46.960 --> 00:53.120] how it started for me. TransDNS, which they called the home root DNS software, was written originally [00:53.120 --> 01:02.640] in about 2003, 2004 and it had the DNS support added in 2012. When I started working at TransIP in [01:02.640 --> 01:13.480] 2013 as a PHP coder, I was asked to help them debug a crasher in the TransDNS code. It basically [01:13.480 --> 01:19.760] came down to a buffer overflow because somebody had, one of our customers had managed to put more [01:19.760 --> 01:27.320] than 16 kilobytes of text record data on one single label. The really quickly quick fix was to [01:27.320 --> 01:35.080] increase the buffer to 32 kilobytes. And one small disclaimer, I was involved in almost all the work [01:35.080 --> 01:40.080] that I mentioned here, but there are some things that I didn't do myself or just consulted on, [01:40.080 --> 01:46.720] stuff like that. I'll try to make a distinction about it, but I might miss some stuff. Yeah, [01:46.720 --> 01:50.280] so back then it was a really basic setup. We basically had three servers. They were all running [01:50.280 --> 01:57.440] TransDNS. There was no load balancing. The signing stack was built using DNS stack tools for those [01:57.440 --> 02:02.760] few people who still know what it is. And there was a lot of automation on top of DNS stack tools [02:02.760 --> 02:07.520] in PHP to make all of that work and ultimately upload stuff to the registry because we were one [02:07.520 --> 02:13.760] of the, we were a registrar, so a lot of the stuff was automated. All of this DNS propagation was [02:13.760 --> 02:20.200] done to cron jobs, which means it was very slow. It took roughly five minutes to propagate a DNS [02:20.200 --> 02:26.440] change, which back then wasn't really a big problem. But as we went on, it became more and more an [02:26.440 --> 02:31.120] issue, especially when we got let's encrypts and you needed to quickly update your DNS to get your [02:31.120 --> 02:38.480] certificate signed. We had at the high, I think we still have roughly one million zones in the [02:38.480 --> 02:46.400] setup, most of which so about 80, 90% are DNS signed. There were very few people back then that [02:46.400 --> 02:52.440] actually knew stuff about it and dared to work on it. I think maybe three or four people, one of [02:52.440 --> 03:00.280] which was I. It had very bad RSC compatibility, which I will get into a little bit later. Adding [03:00.280 --> 03:05.440] new record types, which Kevin mentioned like SSHFP was a lot of work because there was a [03:05.440 --> 03:11.560] interpreter in TransDNS itself, which had to be written in C and writing interpreters in C for [03:11.560 --> 03:18.440] stack strings is not fun. And well, I fixed that initial buffer overflow block, but the main [03:18.440 --> 03:24.680] problem was there just not a lot of bound checking in the code. So yeah, there were a lot of hidden [03:24.680 --> 03:32.560] bugs that probably should be fixed as well. So we took a few initial steps because initially, [03:32.560 --> 03:37.720] because we had the three servers, there was no loan financing, we meant that if we restarted [03:37.720 --> 03:43.040] TransDNS, one of the servers would stop responding until the restart was done. And the restart took [03:43.040 --> 03:49.400] roughly 15 minutes because every single record would get loaded into memory. And since we had a [03:49.400 --> 03:55.400] million zones, I think it was like 25 million records or something back then, it just took a [03:55.400 --> 04:02.400] lot of time and might have used the quick DNS zone parser stuff. So the first thing we did was [04:02.400 --> 04:08.440] implement load balancing. This was before DNS. This was a thing. So what we tried initially was [04:08.440 --> 04:15.960] relay day, which some of the BSD folks might know. It did work, but we had a lot of weird issues. [04:15.960 --> 04:24.840] It was really hard to debug. And so eventually we switched to using HAProxy for TCP, which works, [04:24.840 --> 04:32.800] nothing more to say about it. And I wrote something rather quickly in C roughly based on the [04:32.800 --> 04:39.080] TransDNS code to forward the UDP stuff. That worked quite well and actually enabled us to [04:39.080 --> 04:43.920] actually iterate on the TransDNS code because we could do save restarts without having to worry [04:43.920 --> 04:50.120] about queries being dropped. And that allowed me to fix the glaring issues like there not being [04:50.120 --> 04:56.360] any bounce checking in the code. So we had less risk of buffer overflows. And I fixed a lot of [04:56.360 --> 05:03.480] the EDS issues that were becoming a problem at that point. Eventually when DNS was a little bit [05:03.480 --> 05:08.240] more mature, we switched to that because otherwise I had to maintain another piece of software and [05:08.240 --> 05:15.880] I really didn't feel like that. In the meantime, it did improve the TCP stack a lot in TransDNS [05:15.880 --> 05:24.560] because we noticed that especially SIDN, the.nlRegistrar registry, did a lot of TCP queries [05:24.560 --> 05:32.520] and the original implementation was basically just spawn a new thread for every TCP connection, [05:32.520 --> 05:38.960] but once you get to about a thousand threads, that's not a great solution. So I changed to a [05:38.960 --> 05:45.040] polling-based model, worked great, got pretty high performance, and we never had a problem with it [05:45.040 --> 05:49.680] after that. The only thing I changed later is when we moved to Linux, I changed to ePool. [05:49.680 --> 05:59.160] Yeah, so SIDN had validation monitoring and we kept getting reminded about the fact that we were [05:59.160 --> 06:08.880] doing a lot of stuff wrong. So yeah, we actually had one specific case that basically covered most [06:08.880 --> 06:16.920] of the, I think it was about 80% of those errors, and that's, it's 62 issues, but they have the [06:16.920 --> 06:22.320] same cause. So the first issue was the incorrect handling of wildcards. So if you have a wildcard [06:22.320 --> 06:33.920] that, for example, star.nl, then you have a record c.nl, and then you try to resolve a.c.nl, [06:33.920 --> 06:40.160] it should not hit the wildcard, because c.nl exists, which means you should return a no-data, [06:40.160 --> 06:48.080] or an extra main in this case, but transdns didn't really care, so it would just return the data [06:48.080 --> 06:53.560] from the wildcard. Very useful, makes it a lot easier to configure DNS, but it causes some issues, [06:53.560 --> 06:58.880] especially with DNS validation. The second issue was basically the same only in the [06:58.880 --> 07:07.480] empty non-terminals. If a.b.c exists, and you try to resolve b.c, even though there's nothing [07:07.480 --> 07:12.080] specific on b.c, you should say there's no data, rather than it's a non-existent domain, [07:12.080 --> 07:20.160] also causes the DNS validation errors. Same basic cause. The solution was to switch from, [07:20.160 --> 07:28.600] in transdns, to switch from an ordered map that used the type and the domain name as the key, [07:28.600 --> 07:34.440] to a map that only used the domain name as the key, and have an array in there with the type, [07:34.440 --> 07:40.840] which could also be empty, so we would immediately notice if there was a label in our way. That [07:40.840 --> 07:48.120] worked well. I actually did it this next slide, so the only problem is we couldn't just deploy [07:48.120 --> 07:53.920] that, because we might break stuff for our customers, and customers get a little bit difficult [07:53.920 --> 07:59.320] if you break stuff for them. So what I decided was, okay, for the NSIC it's broken anyways, [07:59.320 --> 08:05.680] because the NSIC enables for the resolvers would just return errors when you have one of these [08:05.680 --> 08:12.200] labels. So what I did is fixing the two steps. I initially enabled it only for the NSIC queries, [08:12.200 --> 08:19.120] so the correct behavior, and kept the wrong behavior for non-DNSIC queries, and in between we [08:19.120 --> 08:24.680] just covered a large amount of queries. I think I did two days of DCP dumping, and milling it down [08:24.680 --> 08:31.320] to the actual unique queries, and compared what our name servers would respond for DNSIC versus [08:31.320 --> 08:36.600] non-DNSIC. For everything that had a difference, we contacted the customers, and told them, hey, [08:36.600 --> 08:42.120] you need to fix this. I think it was only about 20 to 30 customers. It was actually not that many, [08:42.120 --> 08:49.520] so that made it a lot easier. And then we just, at some point I decided I'll flip the switch. [08:49.520 --> 08:53.800] There were a few customers that didn't respond, but at some point you just have to decide to. [08:53.800 --> 09:02.560] Don't give a fuck. One other small issue we have with RFC implementation was the NSIC [09:02.560 --> 09:07.840] implementation, because almost all of our zones use NSIC tree. The NSIC implementation was not [09:07.840 --> 09:15.720] as well tested as the NSIC tree implementation, so it was wrong, like really wrong. I just rewrote [09:15.720 --> 09:25.720] it from scratch, and then it worked, but yeah. So we started to think about moving to PowerDNS, [09:25.720 --> 09:32.160] and the main reason we did was because SIDN announced that we would no longer get a DNSIC [09:32.160 --> 09:39.480] incentive for domains using the NSIC algorithm 7. So that's the RSA plus NSIC tree algorithm. [09:39.480 --> 09:45.480] That would cost us a bunch of money, and that's a very good way to stimulate people to do stuff. [09:45.480 --> 09:50.760] So at this point we decided to buy the bullet and just start over from scratch, [09:50.760 --> 09:56.600] and build a really new, more modern setup. We picked PowerDNS, basically, [09:56.600 --> 10:02.600] partially because we already had some experience with it, and we didn't really want to deal with [10:02.600 --> 10:07.280] zone files, because we had a million zones, and putting them all on a file system makes things [10:07.280 --> 10:17.440] annoying. So PowerDNS was the only one where we thought, oh, this allows us to do changes via [10:17.440 --> 10:23.360] the API. We don't need to worry about having separate zone files for every single zone. [10:23.360 --> 10:28.480] So we needed to pick a PowerDNS backend to use, because PowerDNS is one thing, [10:28.480 --> 10:33.640] but you still need something to put stuff in. And there we sort of had to hit a problem, [10:33.640 --> 10:38.240] because PowerDNS is really fast, because it's literally just a hash map in memory, [10:38.240 --> 10:46.120] so it can basically do instant answers. And while the PowerDNS, as you go back in, [10:46.120 --> 10:51.920] is very nice and flexible, but it's not really fast, especially because we had a lot of zones [10:51.920 --> 10:59.680] that would not get very frequent queries. So they'd have a lot of non-active data, [10:59.680 --> 11:04.040] which means the query cache wouldn't really help a lot, which means that we would have [11:04.040 --> 11:08.240] a lot of SQL queries continuously, because they would get queries sometimes. It's not a lot. [11:08.240 --> 11:13.920] The bind backend had the same problem as all the other name servers. We didn't have API support, [11:13.920 --> 11:17.680] and it would mean we needed to use a lot of zone files, which we didn't want to. [11:17.680 --> 11:25.080] So introducing the LMDB backend. This already exists at the point that we started looking at it, [11:25.080 --> 11:31.120] because Hubert had written it. It's very fast, and it has support for the API, which is really nice. [11:31.120 --> 11:39.400] It only had one major issue. Because of the way Hubert had implemented it, [11:39.400 --> 11:44.800] it didn't really allow records bigger than 512 bytes. We have quite a lot of zones. [11:44.800 --> 11:53.240] So I decided to fix that in the end. I wrote a pull request for the Power Genius team, [11:53.240 --> 11:58.680] and I think that was pretty quickly accepted into there. It also included some migration code, [11:58.680 --> 12:05.680] so the older the LMDB database would automatically be migrated to the new LMDB database format. [12:05.680 --> 12:11.880] It also improved performance in some corner cases, but that was not really the goal of this patch. [12:11.880 --> 12:19.560] So then we started moving over. We built a setup. It was really cool. There's a lot of automation [12:19.560 --> 12:25.080] around it. It does actually do all the zone transfers via XFR, even though Kevin just said [12:25.080 --> 12:29.840] it's a bad idea if you have a lot of zones. But in practice, it works quite well, except for one [12:29.840 --> 12:39.000] issue. Every first day, our updates would take ages to go through. Basically, we traced it down to an [12:39.000 --> 12:47.840] enormous bump in the XFR queues. We would literally have 400,000 XFR queued up. So that was a bit [12:47.840 --> 12:52.960] of a problem. So the reason this happens is because Power Genius renews its signatures every [12:52.960 --> 12:59.240] first day of the week. Very nice. We don't have to think about it. Problem is, if you have a million [12:59.240 --> 13:04.680] zones, that takes quite a while, especially because we were running our hidden primary on a VM, [13:04.680 --> 13:09.640] so it was also not that quick to answer queries. So we could have just shown more hardware in it, [13:09.640 --> 13:14.560] but we decided to look a little bit more at a more sustainable solution, because, well, [13:14.560 --> 13:19.800] if it works with one million zones on the Phosomers scene, it will still work if you have 10 [13:19.800 --> 13:25.880] million zones. So I discussed it with the Power Genius guys, and I came up with a solution which [13:25.880 --> 13:31.480] is XFR priority levels. So rather than treating all XFRs that need to be done at the same level, [13:31.480 --> 13:37.480] we gave more priority to things that are user-initiated. So if you initiate an XFR [13:37.480 --> 13:42.880] via Power Genius control, it will be first in the queue. Whatever else is in the queue, [13:42.880 --> 13:48.680] that one was treated first. After that, there's the API, notifies, solar refresh, and signature [13:48.680 --> 13:54.640] refresh is the lowest priority. That meant that, yes, we would still have a quite a large queue, [13:54.640 --> 14:01.520] but we could still process our updates very quickly. That was included into Power Genius, [14:01.520 --> 14:07.880] right with us. Well, in 4.5, it was included. We still saw the use queues, but those own [14:07.880 --> 14:12.800] updates would pretty quickly propagate. So that, for us, was fine. We never had a problem with it [14:12.800 --> 14:21.400] after that. Yeah, and then we had some other issues, most of which were in a minor solved in [14:21.400 --> 14:25.600] the low-banus layer or just fixed in Power Genius updates. The TCP performance is still [14:25.600 --> 14:32.160] something I want to look at in Power Genius just for fun, as a open-source developer. It's on my [14:32.160 --> 14:38.920] list of things I want to improve. We had some various smaller bugs in the NMDB backend because [14:38.920 --> 14:47.000] it was quite new. We were not the first one that ran it at really large scale, but we were one of [14:47.000 --> 14:53.200] the first ones, and we did see some problems that nobody else had had yet. One CVE we discovered [14:53.200 --> 14:59.040] literally within the day of rolling out a new version, so that was very fun for Peter because [14:59.040 --> 15:07.680] he got to roll out a new lose a day after he released the previous one. We had an issue that [15:07.680 --> 15:13.400] there were certain query patterns that we would get that were specifically designed to target [15:13.400 --> 15:20.040] a weakness in Power Genius. That was a transient as we didn't care about them, but Power Genius did [15:20.040 --> 15:25.320] get affected. We eventually resolved this by adding some detection at the low-banus layer that would [15:25.320 --> 15:32.680] just block queries for those affected domains. It would mean that that customer's domain would [15:32.680 --> 15:37.840] have limited functionality, but at least it would still work, and all the other customers would not [15:37.840 --> 15:47.800] be affected, which was for us the most important thing. Yeah, so some closing thoughts. Yeah, [15:47.800 --> 15:53.320] migrating a home root setup is really not for the faint of heart. However, running one is also [15:53.320 --> 16:01.040] not for the faint of heart. Yeah, it is worth it. It just gives you a lot more flexibility because [16:01.040 --> 16:06.840] now adding new record types is just a question of adding them in our front end and making them [16:06.840 --> 16:12.360] work. Whereas before, we had to add the new record type at every single step in the stack, [16:12.360 --> 16:23.400] and it just really took a lot of time. We can even, in theory, add different brands of [16:23.400 --> 16:28.440] secondaries. Currently, there's a few issues that were prevented, but it's relatively easy to solve, [16:28.440 --> 16:36.560] so we could just run not as a secondary or NSD or even bind if he would want to do that for [16:36.560 --> 16:45.640] some reason. What I did really, really notice is don't try to do this in one go, because it's a lot [16:45.640 --> 16:51.160] of work, and you'll make mistakes. If you do it in smaller steps, the mistakes will be smaller, [16:51.160 --> 16:58.640] easier to fix, and it also just feels a lot better if you can accomplish some things in [16:58.640 --> 17:03.960] between rather than trying to do it all at once. One thing I wanted to ask, DNSSEC incentives, [17:03.960 --> 17:11.400] they work both when trying to get people to use DNSSEC, but they also work to improve the [17:11.400 --> 17:16.200] quality of the DNSSEC, because we've seen, especially in the.nl zone, because I've also [17:16.200 --> 17:21.960] been involved in that work a little bit, some very bad implementations that got fixed when the [17:21.960 --> 17:30.520] rules were made stricter, including ours initially, but were even the worst ones. Yeah, that is it. [17:30.520 --> 17:36.960] For the people that would like to see, I've open sourced DNS before I left the company, [17:36.960 --> 17:45.120] so I can see it myself as well, so that's fun. It's on GitHub. I've also put the URLs for [17:45.120 --> 17:50.200] the two major pull requests I made. There's a bunch of other ones, but I haven't put all of them [17:50.200 --> 18:04.360] in there, and that's about it. So, questions? [18:04.360 --> 18:30.080] So, that makes it a bit of a more concern in your case, and I used to be a customer. On the upside, [18:30.080 --> 18:35.720] most of these methods probably weren't noticed by the majority, but I think you should take it [18:35.720 --> 18:40.640] more seriously if you were a company that actually makes money out of posting here. [18:40.640 --> 18:48.200] So, the comment is that we both made mistakes. It was a bit related to the talk Kevin did, [18:48.200 --> 18:56.480] so a lot of the things that we said are related, and the comment is Kevin was only doing it for [18:56.480 --> 19:03.880] himself, and we were doing it for paying customers. Yeah, I agree. When I started, there were a lot [19:03.880 --> 19:10.640] of issues, and I've tried three years to attempt to fix them as much as I could. To be clear, [19:10.640 --> 19:16.200] I wasn't hired to maintain trans-DNS. That just happened to be something that got shoved [19:16.200 --> 19:25.280] into my lab because I knew some C and C++. I became pretty passionate about it. [19:25.280 --> 19:31.400] I came pretty quickly rolled into the PowerDNS community. I also added a lot of contributions [19:31.400 --> 19:40.680] to DNS when that was getting started up. I agree with the initial statement. I've tried to fix it [19:40.680 --> 19:56.680] as much as I could. Sometimes, you set out with certain criteria. You build something that can meet that criteria, and it scales to a certain point. [19:56.680 --> 20:12.680] Eventually, you get to a million customers, a lot of customers. The company would start off with a million customers, and maybe at the time that this was a good system, things would fly to me. But as the business grew and things grow, [20:12.680 --> 20:28.680] you have to do exactly what he did. You evaluate it and say, you know, it's time for something different. He identified that and made the changes accordingly. [20:28.680 --> 20:45.680] A brief resume, he said that sometimes due to scaling, you run into issues that you hadn't foreseen when you were in it, and he set something up. Just taking a step to resolve them in the end can be a good thing. [20:45.680 --> 21:07.680] There was a question there. The question is how did they get them to agree with over sourcing it. At the point that I open sourced it, I was sort of CTO slash head of R&D of the Dutch part of the organization. [21:07.680 --> 21:17.680] Also, I only open sourced it after we totally took it out of production, so it's mainly a historic interest thing. [21:17.680 --> 21:24.680] Did you ever consider open sourcing trans-DNS before switching? [21:24.680 --> 21:40.680] So the question is, did we consider open sourcing before switching? No. And I'll tell you, we weren't very proud of, at least I wasn't proud of the source quality. [21:40.680 --> 21:48.680] I didn't write it myself, all of it. I only contributed to it later. I tried to improve it as much as I could, but it's still not... [21:48.680 --> 21:57.680] It's very focused on just doing one thing, and it's very good at that, but it's not very applicable to use by others. [21:57.680 --> 22:03.680] So I think it's interesting now to see some of the tricks to make things really fast that you can see in the code. [22:03.680 --> 22:18.680] But beyond that, I would never use it in a production environment other than the one it was in, because that one was built specifically to run around that code. [22:18.680 --> 22:32.680] What's actually the motivation for implementing the DNS hosting and trans-DNS? I think even around the time when you started as a company, there was not a software available that could have been used. [22:32.680 --> 22:51.680] So the question is, what was the motivation to implement their own DNS software? So to be clear, trans-DNS was implemented in 2003, so this was roughly when power-DNS started to grow, but the problem was that there were already quite a lot of zones in there, [22:51.680 --> 23:02.680] and it just got a little bit cumbersome using bind, because that was the primary name-server software you'd use back then, and yeah, that was the main motivation. [23:02.680 --> 23:16.680] Bind was getting annoying because you had to have a lot of zone files, and everything was running on 3BSD using UFS, so there was a 32,000 files per directory limit at that point, which also didn't help. [23:16.680 --> 23:27.680] I mean, there's ways to solve that, that's not that complicated, but that was the main motivation as well, as I think there were some performance issues in bind back then that were relatively easily resolved. [23:27.680 --> 23:40.680] The other alternatives would have been GGB DNS, but that had its own things, like the guy that wrote it, not saying you should use it. [23:40.680 --> 23:46.680] Anything else?