[00:00.000 --> 00:10.000] Next up, we have two speakers for the prize of one. [00:10.000 --> 00:15.480] They are going to talk about everything to do with an open, open, more open source version [00:15.480 --> 00:16.480] of Talescale. [00:16.480 --> 00:21.000] So let's give an applause to Christopher H. Gouan. [00:21.000 --> 00:22.000] Hello. [00:22.000 --> 00:23.000] Hello. [00:23.000 --> 00:24.000] Okay. [00:24.000 --> 00:25.000] This is cool. [00:25.000 --> 00:26.000] Hello. [00:26.000 --> 00:31.680] My name is Christopher, and I'm going to, together with H. Gouan there, talk a bit about [00:31.680 --> 00:38.680] how we use integration testing to kind of reimplement the control panel or the control [00:38.680 --> 00:40.800] server of Talescale. [00:40.800 --> 00:43.680] So first a little bit about us. [00:43.680 --> 00:46.120] Juan Fontalonso is the creator of Talescale. [00:46.120 --> 00:50.600] He works for the European Space Agency on cloud and DevOps and infrastructure. [00:50.600 --> 00:55.200] He claims to have been my first manager, but I think that's incorrect. [00:55.200 --> 00:58.680] And he has the attention span of a goldfish. [00:58.680 --> 01:03.160] Which makes the whole collaboration very fun, and I'm here with Christopher. [01:03.160 --> 01:09.800] He's a top contributor of Talescale and one of the other maintainers alongside me. [01:09.800 --> 01:15.320] He's part of the technical staff at Talescale, and part of his time at Talescale is to work [01:15.320 --> 01:17.120] improving Talescale. [01:17.120 --> 01:21.960] I was his manager, at least from a hierarchical point of view. [01:21.960 --> 01:26.440] And one of the challenges we have is that he always finds these kind of super niche [01:26.440 --> 01:31.360] languages, like OCaml or things like that, where he tries to reimplement headscaling. [01:31.360 --> 01:38.800] But first of all, how many people here know Talescale and headscale? [01:38.800 --> 01:39.800] Good. [01:39.800 --> 01:43.640] That's pretty good. [01:43.640 --> 01:49.640] So for the people who don't know, we'll do like a quick tweak what is Talescale. [01:49.640 --> 01:56.080] So Talescale tries to solve this problem, where you basically sit and you want to connect [01:56.080 --> 02:02.640] your organization or home or something like this, and you have an old school or legacy [02:02.640 --> 02:09.160] VPN concentrator, where you connect into your kind of perimeter, you have access to absolutely [02:09.160 --> 02:13.400] everything, there's a single point of failure and a massive bottleneck. [02:13.400 --> 02:22.040] And it tries to do this by creating like a mesh VPN that uses direct connections wire [02:22.040 --> 02:27.960] guard and kind of facilitates this for you using techniques like natural and has a very, [02:27.960 --> 02:33.920] very powerful client that will make sure that you always reach what you're trying to get [02:33.920 --> 02:40.440] to, and it offers a lot of different kind of granular access, and you get a lot more [02:40.440 --> 02:46.880] power compared to your old school bottleneck, single point of failure VPN. [02:46.880 --> 02:51.000] And in Talescale, the clients are open source, at least for the open platforms, and what [02:51.000 --> 02:52.800] they have is a closed SAS. [02:52.800 --> 02:57.280] But still, they are quite open when it comes to explaining how the whole thing works. [02:57.280 --> 03:02.600] And in March 2020, they publish a blog post basically explaining how the whole thing worked, [03:02.600 --> 03:09.200] how they use these natural techniques so you don't have to open the ports in your router. [03:09.200 --> 03:14.920] And there was a phrase in this blog post that gathered my attention for a little bit, and [03:14.920 --> 03:18.960] was basically talking about a coordination server, that the clients talk to a coordination [03:18.960 --> 03:27.000] server, the core of this services service offering, which is essentially a shared drop [03:27.000 --> 03:31.560] box for this wire guard public keys. [03:31.560 --> 03:39.280] So I was pass up by that, and basically I took that open source clients and started reverse [03:39.280 --> 03:47.080] engineering, basically a lot of print apps to see what kind of payload were they sending, [03:47.080 --> 03:52.360] what kind of endpoints or protocol they were doing. [03:52.360 --> 04:00.280] And yeah, this was around April 2020, in June, I had a lot of free time at that time, and [04:00.280 --> 04:02.440] in June I did the initial release. [04:02.440 --> 04:06.960] I talked to my friend Christopher about tail scale, and he was very happy distributing [04:06.960 --> 04:13.320] wire guard keys with Thansible, which, yeah, so I kept doing my own thing for a while. [04:13.320 --> 04:19.000] Head scale gained a little bit of traction, and around mid 2021, he joined because he [04:19.000 --> 04:20.880] was quite curious about the whole thing. [04:20.880 --> 04:27.280] But he was afraid about breaking stuff, and that's why kind of we are here, although [04:27.280 --> 04:30.800] he was not afraid of making a logo, but I think it's super nice. [04:30.800 --> 04:38.360] So what I've learned doing this reverse engineering exercise is that the tail scale clients talk [04:38.360 --> 04:41.680] to what is basically a web service. [04:41.680 --> 04:47.920] This web service receives metadata from the clients, like the endpoints or the wire guard [04:47.920 --> 04:54.440] public keys that they use, and assigns them IP addresses, like you would having a classic [04:54.440 --> 04:56.480] traditional VPN service. [04:56.480 --> 05:04.880] As everybody knows about everything, you can establish this mesh network across the clients [05:04.880 --> 05:09.600] without interference because the data doesn't go through the web service. [05:09.600 --> 05:13.840] So we arrived to the initial stage of head scale, the illusion that everything works [05:13.840 --> 05:17.480] and kind of worked until it stopped doing. [05:17.480 --> 05:23.240] So we had this web service, we implemented the web service, the series of endpoints [05:23.240 --> 05:28.400] that we found in the reverse engineering exercise, and we were assigning IP address [05:28.400 --> 05:34.080] to the when a node arrives, and what happens when a second node arrives? [05:34.080 --> 05:40.080] Hey, we want to tell that I'm here and I want to find my friends and I want to communicate [05:40.080 --> 05:41.080] with them. [05:41.080 --> 05:47.080] So in order to handle that, and to handle the metadata that you need to establish the [05:47.080 --> 05:51.640] connections, we developed a little bit of a state machine that will handle, and you know [05:51.640 --> 05:58.680] has arrived, there's been a change in the map of the network, and we need to distribute [05:58.680 --> 06:02.640] the updated metadata that we have. [06:02.640 --> 06:10.360] However, at that time, I was kind of learning go, and we follow a little bit of a weird [06:10.360 --> 06:18.440] approach when handling concurrency, which was basically adding more Mutex every time [06:18.440 --> 06:19.440] we needed it. [06:19.440 --> 06:25.480] And this is a problem, because at the end we ended up with a great Mutex for this state [06:25.480 --> 06:30.280] machine, and this is a very big problem because the Python track is tomorrow, so the grid [06:30.280 --> 06:32.640] logs are over there. [06:32.640 --> 06:37.240] So what ended up happening inside the state machine, or what didn't end up happening, [06:37.240 --> 06:42.880] was that basically some of the failure modes we saw was that a new node trying to register, [06:42.880 --> 06:46.920] and then we burned a couple of CPU cycles trying to calculate some stuff, and then we [06:46.920 --> 06:48.120] did nothing. [06:48.120 --> 06:51.280] So no updates were sent out or anything. [06:51.280 --> 06:56.320] Sometimes we would have a new node joining, and we would compute everything, send some [06:56.320 --> 07:01.080] network traffic, we just omitted the new information, that was kind of crucial for everyone to know, [07:01.080 --> 07:03.320] so it ended up not working. [07:03.320 --> 07:08.320] And sometimes a new node joined, nothing really happened, but then eventually something [07:08.320 --> 07:13.760] happens and it sent out an update to everyone, and that was, you know, useful. [07:13.760 --> 07:20.440] And sometimes on the individual update channels for each node, some of these aforementioned [07:20.440 --> 07:25.520] Mutexes kind of deadlocked up the whole thread or the GoRoutine, and then we just never sent [07:25.520 --> 07:29.880] updates to particular nodes, and sometimes we just deadlocked the whole server and you [07:29.880 --> 07:34.880] kind of had to kick it to make it come back to life. [07:34.880 --> 07:40.800] But still there was kind of this notion that it did work pretty well eventually most of [07:40.800 --> 07:48.040] the time, and it gave this illusion of working, because what you often saw was that you had [07:48.040 --> 07:53.120] like three nodes, and only two of them actually talked together, and as long as those had [07:53.120 --> 07:58.000] received the updates they needed, you know, the user was happy and you're just like, ah, [07:58.000 --> 08:04.680] it works, so I'm going to press the star sign on GitHub and share it with my friends. [08:04.680 --> 08:10.880] So, but we figured that eventually this would like caught up with us and we're trying to [08:10.880 --> 08:15.800] get to this stage where we, you know, it works most of the time, so what we did have was a [08:15.800 --> 08:21.120] fair amount of unit tests, but the problem with unit tests is that we're trying to reverse [08:21.120 --> 08:26.680] engineering something, that we're also learning how it works, and what we spent a lot of time [08:26.680 --> 08:31.800] on was misunderstanding how it was supposed to work, writing wrong, well, writing unit [08:31.800 --> 08:37.080] tests that would pass, but they were wrong, so you kind of have like a, a passing test [08:37.080 --> 08:41.680] and it's an entirely wrong implementation, and 90% of what we were actually trying to [08:41.680 --> 08:48.120] do was integrate with a third party software, and this is where we get to actual integration [08:48.120 --> 08:53.240] tests, so what I started doing was I found this Docker test framework, which basically [08:53.240 --> 08:59.800] allows you like programmatically create Docker containers, so we started making tests that [08:59.800 --> 09:07.200] spun up a head scale container, it created a bunch of tail scale instances also running [09:07.200 --> 09:12.040] in Docker, and associated them with a couple of users and tried to like emulate the entire [09:12.040 --> 09:15.240] environment so you can test everyone to everyone. [09:15.240 --> 09:21.360] We had them join the head scale server, and since it takes a little bit of time for everyone [09:21.360 --> 09:25.680] to catch up with each other and, you know, send the updates and stuff, so we put a sleep [09:25.680 --> 09:33.480] of two minutes in front of the test, which is a terrible idea, but, you know, you learn, [09:33.480 --> 09:38.200] and then after that sleep runs out, presumably everyone is now up to date and can talk to [09:38.200 --> 09:42.520] each other, so we had a test, the most basic test is, is my network working, can everyone [09:42.520 --> 09:43.720] ping everyone? [09:43.720 --> 09:50.120] So we tried to do that, and of course that didn't work because of all of the errors we [09:50.120 --> 09:56.160] actually had in the code, and I ran some initial like, tried to make some statistics on my [09:56.160 --> 10:02.480] laptop and out of like 100 test runs we had 70 failures, that's pretty bad, but at this [10:02.480 --> 10:06.800] point we're starting to approach like, we have an actual goal that we can measure so [10:06.800 --> 10:13.960] we can improve on this, and quite rapidly we figured out that these two big blocks of [10:13.960 --> 10:19.240] problems that we have is associated with two things, so one of them is the being able to [10:19.240 --> 10:24.960] reliably send updates to all of our clients, which is the kind of deadlock problem that [10:24.960 --> 10:29.360] the update channels were just locking up and didn't really work, so we made a massive, [10:29.360 --> 10:33.840] massive rewrite PR that re-did the whole logic and made sure that we always were able [10:33.840 --> 10:38.480] to send an update to the client as long as it was connected, and then the other problem [10:38.480 --> 10:43.400] was the state machine that was very broken, and then we kind of figured out that we can [10:43.400 --> 10:50.160] make a global state, and we tried to simplify it initially and optimize later, so basically [10:50.160 --> 10:57.440] a global state, how can we determine if everyone is up to date, and make sure that we know [10:57.440 --> 11:01.720] when you last received the successful update, and if not we have to re-issue one to make [11:01.720 --> 11:03.800] sure that you know everything. [11:03.800 --> 11:09.320] However, changing the Rambo culture takes a little bit of time. [11:09.320 --> 11:15.520] We kept merging staff without proper integration testing, but as Christopher said, we didn't [11:15.520 --> 11:20.800] have the incentive, we didn't have the pressure because the thing really worked. [11:20.800 --> 11:26.040] It's not the same when you are in your home lab and you join a node than when you are [11:26.040 --> 11:32.120] joining 100 nodes within one second, so if you are slowly joining machines to your tail [11:32.120 --> 11:34.720] net, things were working. [11:34.720 --> 11:42.160] However, the project was gaining popularity, and we were increasing more and more in contributions [11:42.160 --> 11:49.360] in external PRs, and this was around August 2021, or September, something like that. [11:49.360 --> 11:56.880] So it was great, we were getting to a point where we could improve headscales with confidence. [11:56.880 --> 12:08.000] We had a point of view, given that the project started as a reverse engineering effort, we [12:08.000 --> 12:13.880] had a lot of staff that was not that great, we could improve or maintain the compatibility [12:13.880 --> 12:20.800] with these third-party external clients that we are using, and we could improve from a [12:20.800 --> 12:27.600] community point of view, I'm going to talk a little bit about this now. [12:27.600 --> 12:32.920] For starters, we could improve from a technical point of view, we could do massive refactoring [12:32.920 --> 12:39.680] within the project, or implementation of the second version of the headscale protocol [12:39.680 --> 12:45.960] without breaking the existing users, the only thing that breaks is probably the mental health [12:45.960 --> 12:49.560] of the reviewer that has to deal with 3,000 lines of code. [12:49.560 --> 12:53.080] But that's a different thing. [12:53.080 --> 13:00.480] Then as I said, we have this minor small detail that we completely depend on a third-party [13:00.480 --> 13:07.040] client, because we are using exactly the same official clients as a stale scale, however, [13:07.040 --> 13:12.400] we have a very good working relationship with them, and every time that they change something, [13:12.400 --> 13:13.960] we get a heads up. [13:13.960 --> 13:21.960] However, we keep within our integration tests quite a bit of commitment for support in this [13:21.960 --> 13:22.960] client. [13:22.960 --> 13:28.280] So we target the head of the repository, we target the unstable releases, and we target [13:28.280 --> 13:35.040] nine minor releases of the client to make sure that nothing breaks from their side or from [13:35.040 --> 13:38.360] ours, because I mean, it can happen. [13:38.360 --> 13:48.040] And then I think integration testing also helps the community, because we as maintainers [13:48.040 --> 13:54.880] can trust in a better way those random PRs from random unknown people that appear in [13:54.880 --> 14:01.640] GitHub, which is something that is not given. [14:01.640 --> 14:08.560] And in theory, or that's what one would think, is that by having integration tests, contributors, [14:08.560 --> 14:15.080] those external people that we don't know, should also feel more confident when submitting [14:15.080 --> 14:16.080] a PR. [14:16.080 --> 14:18.680] But that's a theory. [14:18.680 --> 14:21.480] So it does still come with some challenges. [14:21.480 --> 14:26.760] So one of the things that we see occasionally is that a PR comes in and it doesn't have [14:26.760 --> 14:33.320] a test, and then we ask nicely if they can add a test, and then the contributor disappears. [14:33.320 --> 14:42.000] So some of the times we're trying to improve on this thing and kind of like always get [14:42.000 --> 14:44.000] them in. [14:44.000 --> 14:49.080] So what we try to do is, if they truly disappear, we try to pick it up if it's a feature that [14:49.080 --> 14:51.840] we really want and we are bound with to do so. [14:51.840 --> 14:58.640] Once we try to reach out and kind of sit and help them write a test and kind of onboard [14:58.640 --> 15:05.520] them in this kind of things, one of the tests actually for our, there is an SSH feature. [15:05.520 --> 15:11.680] And the test for that, I knew the developer and he was also in Norway, so once I was [15:11.680 --> 15:16.440] dropping by Oslo, we sat down for an afternoon and we worked on them together and paired [15:16.440 --> 15:17.440] on them. [15:17.440 --> 15:19.360] That's not available for everyone, sadly. [15:19.360 --> 15:26.880] But you know, we always try to kind of like get this test message out there in a way. [15:26.880 --> 15:32.080] But there is a couple of other challenges as well, and that is that adding the test [15:32.080 --> 15:33.840] raises some sort of learning curve. [15:33.840 --> 15:38.000] So you know, you need to know go test, you need to understand our test framework, you [15:38.000 --> 15:41.200] need to have Docker and all of this kind of thing, whereas it's not writing tests that [15:41.200 --> 15:43.000] are a lot less code. [15:43.000 --> 15:48.000] And it's hard to convince people how awesome tests actually really are, that they're not [15:48.000 --> 15:54.680] really a chore and that you really, really thank yourself later for doing them. [15:54.680 --> 15:58.440] So some of the things we're trying to do to even make this barrier lower, since we're [15:58.440 --> 16:03.080] so heavily dependent on this for compatibility and everything, is that we're making like [16:03.080 --> 16:08.840] our own test framework, V2, because we depended on a lot of repeated and copied code and there [16:08.840 --> 16:13.920] was a really high bar for adding new tests and it was really hard to update and change [16:13.920 --> 16:20.760] and it did depend on time.sleep, which was, yeah, haunted me so many times and it couldn't [16:20.760 --> 16:27.120] really be ran in parallel for many of the previous reasons and the documentation wasn't [16:27.120 --> 16:31.560] really good, like I knew how to use it, one knew how to use it and then that was about [16:31.560 --> 16:32.560] it. [16:32.560 --> 16:35.480] So a couple of other people figured it out. [16:35.480 --> 16:39.360] So what we're trying to do is we're abstracting things a bit away, so we have this concept [16:39.360 --> 16:45.960] called control server, which is what essentially head scale is and the tail scale product, [16:45.960 --> 16:50.520] the software as a service and it's implemented as like head scale in container and it exposes [16:50.520 --> 16:56.240] convenient functions that now have Godox support and all of these things to make it easier [16:56.240 --> 17:00.680] for developers to actually use it and then we have the tail scale client, which is implemented [17:00.680 --> 17:05.040] at tail scale in container and it has the same type of convenience functions and what [17:05.040 --> 17:11.960] this allows us to do is previously the two files on the right here, sorry, on the left, [17:11.960 --> 17:18.360] is two different versions of the setup code for the tests because when you needed something [17:18.360 --> 17:23.800] that was slightly special, you had to copy the whole thing and then make a new file to [17:23.800 --> 17:28.280] be able to write a test case like you see on the other side here, but now after abstracting [17:28.280 --> 17:33.480] that away, making it a lot more configurable, we allow people to write more or less regular [17:33.480 --> 17:38.520] test cases, but you just set up what we call a scenario, which is a head scale with a given [17:38.520 --> 17:44.560] amount of tail scale nodes and then you let them ping each other or something like this. [17:44.560 --> 17:47.080] So what do we test right now? [17:47.080 --> 17:53.600] We tried to, we kept all of the original tests, so basically we make all nodes, join the network [17:53.600 --> 17:57.600] and we make them ping each other to verify that we have a fully functioning network both [17:57.600 --> 18:04.000] by IP and magic DNS, magic DNS is tail scales DNS system. [18:04.000 --> 18:08.880] We test tail drop, which is a file sharing features, a bit like Apple's airdrop and we [18:08.880 --> 18:13.560] send the files from every node to every node to make sure that they work. [18:13.560 --> 18:17.840] We test all our registration flows because we broken them a couple of times, so it was [18:17.840 --> 18:23.560] better to do it that way, which is pre-authored keys and web plus a command line flow and [18:23.560 --> 18:26.800] even open ID we currently have tests for. [18:26.800 --> 18:31.520] We try to isolate all of our network from the internet and test with our own embedded [18:31.520 --> 18:36.760] relay server because tail scale depends on some relay servers that we also embed in [18:36.760 --> 18:45.000] our binary and we have preliminary tests for the SSH features that we support, which is [18:45.000 --> 18:50.800] like authenticated by head scale so you can SSH into your machine and we test SSH all [18:50.800 --> 18:53.080] to all and we try to do negative tests. [18:53.080 --> 18:58.760] And also we test our CLI because if you may change something, you don't want to sit and [18:58.760 --> 19:04.760] type in every single command in a structured way manually because that's just painful. [19:04.760 --> 19:09.480] So in the future, we want to also kind of improve this granular access control that [19:09.480 --> 19:11.120] tail scale offer. [19:11.120 --> 19:17.480] Currently this is a very good example of where we have added a lot of unit tests and they [19:17.480 --> 19:24.320] all pass, but they're all wrong, so well, they're mostly wrong, so we have to kind of [19:24.320 --> 19:30.560] redo most of this into integration test first and then kind of backfill the unit test once [19:30.560 --> 19:34.000] we know how the implementation is actually supposed to work. [19:34.000 --> 19:38.000] And one of the things we've been dabbling with, especially for this ACL feature, is [19:38.000 --> 19:43.200] to use that control server abstraction we have before and use the tail scale product [19:43.200 --> 19:48.200] to test our tests because if they pass on the public server, we know they're correct [19:48.200 --> 19:50.840] and then we can use them to verify our thing. [19:50.840 --> 19:54.640] And then maybe run tail scale in the VM instead of Docker to test it properly, but that's [19:54.640 --> 19:59.120] more of a benefit for tail scale than it is for us. [19:59.120 --> 20:05.360] So if you're just here waiting for the next talk, a little bit of a TLDR is that, I mean, [20:05.360 --> 20:09.880] we cannot understate how important having this integration testing when we depend on [20:09.880 --> 20:15.440] an external party has been for the development of health scale. [20:15.440 --> 20:20.880] I reckon also the head, like the name, is also excellent, ponytail scale would have [20:20.880 --> 20:24.440] been worse. [20:24.440 --> 20:28.960] We have, I mean, with integration testing, we are able to maintain this compatibility [20:28.960 --> 20:37.000] with the client and we are able to take contributions from third party developers, otherwise it's [20:37.000 --> 20:41.760] a little bit more difficult to develop this trust across the internet, right? [20:41.760 --> 20:46.880] And even though the tests are not perfect and we still have to migrate unit tests towards [20:46.880 --> 20:56.280] integration tests, I think this is one of the keys for the success of the project. [20:56.280 --> 21:03.000] So some extra things, tail scale is hosting a happy hour at a brewdog by the station. [21:03.000 --> 21:07.720] This QR code takes you to a sign up form, I'll quickly switch back to this slide at [21:07.720 --> 21:12.320] the end, but I have like a question slide as well, so, you know, we go through this. [21:12.320 --> 21:17.720] Basically this is how to reach us, Github, we have a Discord community, and we're very [21:17.720 --> 21:21.600] happy to talk to anyone who wants to talk to us here at Fostem, so please feel free [21:21.600 --> 21:26.720] to reach out and I'll leave it at this one if anyone has any questions. [21:26.720 --> 21:30.240] We have some minutes, I think. [21:30.240 --> 21:37.560] Thank you. [21:37.560 --> 21:42.000] While I have your attention, we have a go for that lost there wallet, look to the left, [21:42.000 --> 21:46.320] look to the right, front and back, if you see a wallet that is not yours, please come [21:46.320 --> 21:49.200] right to the front, it will help this person a lot. [21:49.200 --> 21:50.200] Thank you. [21:50.200 --> 21:56.920] After you look for the wallet and you have a question, raise your hand and I'll try [21:56.920 --> 22:07.040] to come with this microphone. [22:07.040 --> 22:11.720] How come the tail scale guys are not mad at you, and not only are not mad at you, but [22:11.720 --> 22:16.320] they hurt you afterwards. [22:16.320 --> 22:19.040] I mean, part of it, is it working? [22:19.040 --> 22:20.040] Yeah. [22:20.040 --> 22:21.040] No? [22:21.040 --> 22:22.040] Okay. [22:22.040 --> 22:25.440] I think part of it is that they are quite chill, I mean, they could have, they are quite [22:25.440 --> 22:31.640] chill, they could have taken this way worse than they have, and I don't think we are competition. [22:31.640 --> 22:38.080] We are focused on self-hostors, on home labs, perhaps a little bit of a small company. [22:38.080 --> 22:43.440] And what usually happens is that people that use headscales at home, then they go to their [22:43.440 --> 22:47.600] companies and they talk about tail scale, and when you're in a company, you actually [22:47.600 --> 22:50.640] prefer to pay for the service. [22:50.640 --> 22:53.080] So it's like a way of... [22:53.080 --> 23:06.280] It's like a way of selling headscales, sorry, headscales also. [23:06.280 --> 23:07.280] Okay, thank you very much. [23:07.280 --> 23:08.280] Last round of applause. [23:08.280 --> 23:24.040] If you have any questions, you can card them in the hallway track.