[00:00.000 --> 00:13.160]  So, hello everyone, my name is Florian, I work at Scaleway and I'm an engineering manager
[00:13.160 --> 00:20.440]  software developer there and today we'll talk about how we use servers to handle our production
[00:20.440 --> 00:26.240]  workload just like if there were containers.
[00:26.240 --> 00:32.200]  So a quick slide of context, Scaleway is a European cloud provider, we do a lot of tests,
[00:32.200 --> 00:40.840]  data centers, shared host, collocation, shared hosting, whatever instances, databases, we
[00:40.840 --> 00:46.320]  have physical locations in France, Amsterdam and in Poland and I work in the storage team
[00:46.320 --> 00:50.840]  as I said, we are a team of ten people, we handle pretty much everything storage-ish
[00:51.160 --> 00:58.600]  related at Scaleway, so being the block and object storage products and also some order
[00:58.600 --> 01:04.200]  systems like the RfNSAN and the data backup in the online ecosystem and we have around
[01:04.200 --> 01:13.320]  a thousand servers in production and more than a hundred bitabytes of storage. So when I joined
[01:13.320 --> 01:21.440]  the team five years ago, the intro was what it was, it was drawn organically over the years
[01:21.440 --> 01:32.160]  and the versions were over the places, everything was a bit custom for the team needs at the time,
[01:32.160 --> 01:41.760]  so some servers were locally installed, some were PXC but everything was not homogeneous and we
[01:41.760 --> 01:48.320]  had an old pearl-based automation system that suffered a lot because not so much people had
[01:48.320 --> 01:55.880]  fluency in pearl, so it was pretty much a huge happen-only script that did stuff. So we wanted
[01:55.880 --> 02:02.640]  to start something fresh, something new, something we could work on for a few years, so we started
[02:02.640 --> 02:08.840]  considering stuff, everybody at the time started to use containers but we were not a fan because
[02:08.840 --> 02:13.960]  at that time they were not as material, the tooling and ecosystem was not that great and also we
[02:13.960 --> 02:20.000]  wanted to try something different. We still use containers for developer purposes, CI and whatever
[02:20.000 --> 02:27.400]  but we decided to do pixie live booting. It comes with some great advantages, first of all you
[02:27.400 --> 02:32.000]  just plug the server to the network, you put your discovery image and you are pretty much set up,
[02:32.160 --> 02:40.400]  nothing else to do. Reboot is just like taking down a container to update it and it's when you
[02:40.400 --> 02:46.640]  have thousands of servers in production, reboot is not that big of a deal, it's just life. The only
[02:46.640 --> 02:52.800]  downside is that you need to have a solid network and working the HCP. As a cloud provider you
[02:52.800 --> 03:01.560]  don't have network, don't do anything, so not really a downside. So let's talk about automation.
[03:02.280 --> 03:10.760]  After using the Perl stuff we started on salt, it worked okay but it was really hard to test
[03:10.760 --> 03:16.680]  modifications before putting them in production and at the time it was nobody's job, pretty much,
[03:16.680 --> 03:22.840]  like no one had clear in-house responsibility for it, so it was not that well maintained.
[03:23.720 --> 03:29.400]  So one day we decided to move to Ansible, mostly because everyone else at the time was moving to
[03:29.480 --> 03:37.080]  Ansible in the company, so we wanted to share our efforts and small libraries so that everybody can
[03:38.920 --> 03:46.520]  like do stuff cleaner and not have something where everyone does things on the side. It's
[03:46.520 --> 03:52.280]  easier to write, easier to test, the learning curve is not as steep as with salt and you don't
[03:52.280 --> 03:55.880]  have a central controller that has to be maintained by a sentencing team, you just
[03:55.880 --> 04:01.160]  can keep that in-house. But ultimately we wrote something along the lines of a controller,
[04:01.160 --> 04:11.160]  but some really dumb stuff, around 250 lines of go for the server and just a shy of 100
[04:11.160 --> 04:17.160]  for the client, it's written in Python, it's pretty dumb just an API that when a server boots it
[04:17.160 --> 04:21.400]  calls it the API ID. Okay, here you are, here's your configuration.
[04:24.600 --> 04:30.600]  Just to have something a bit cleaner that works well with pixie images, we have split up the
[04:30.600 --> 04:36.360]  automation in two parts, we have one part that's our deployment playbooks, the only job in the life
[04:36.360 --> 04:43.000]  is to install and update all the software stacks and that's pretty much it, and afterwards we have
[04:43.000 --> 04:48.760]  the runtime playbooks, which job is to set up networking, because basically when you boot it
[04:48.760 --> 04:55.560]  in the lifeboots you just have the basic one interface DHCP address and you need some network
[04:55.560 --> 05:00.120]  configuration to make your service work, it assembles the raids, mount fire system, tune
[05:00.120 --> 05:06.200]  the OS like sysatel, separate governor or whatever, and after that just configure and start your services.
[05:06.600 --> 05:15.960]  One quick note, the installed playbooks are run on prediction server and during image
[05:15.960 --> 05:21.240]  creations and some are there just to remove default packages that we do not want to have
[05:22.200 --> 05:30.200]  in the built image. Now just a little bit about how we handle pixie image creations,
[05:31.080 --> 05:35.560]  basically a pixie image is a squash of s that you download over the internet just after
[05:35.560 --> 05:42.040]  having booted your kernel, it's literally a rootfes of your server that's inside the RAM,
[05:43.560 --> 05:49.080]  the first ways that were used before arrived was just a snapshot of the rootfes of a VM,
[05:49.960 --> 05:56.200]  afterwards there were Docker to pixie.pl, basically a huge power strip that was extracting the
[05:57.560 --> 06:03.240]  file system of a container, it came with some limitations, because basically
[06:05.720 --> 06:10.520]  the way that base image of containers are made make that not a fully functional OS,
[06:11.640 --> 06:19.080]  afterwards we use Ashko Parker and now we use a small trick with sshd and uncivilized ssh port
[06:19.080 --> 06:24.600]  that basically traps the uncivilization inside the ashrut on the build server.
[06:26.040 --> 06:31.640]  The advantage of this is that the same playbooks for the deployments of software are used for both
[06:31.640 --> 06:38.200]  the creation of the image and updating servers and predictions, and the build system is based
[06:38.200 --> 06:43.640]  on the ubuntu server image, nothing fancy, and we are using the default LiveBooty Neutron fs
[06:43.640 --> 06:51.240]  package that comes up with dbian and ubuntu with a few tweaks to allow for retries and to avoid
[06:51.240 --> 06:58.120]  boot storms, like you have 30 or 40 servers that reboot at the same time that do not fetch
[06:58.920 --> 07:03.480]  the squash fs on the same machines to avoid any networking issues.
[07:06.200 --> 07:11.560]  Here's the fun part, there is a small playbook that's called pixiemagic.tml that handles a lot of
[07:11.560 --> 07:19.720]  stuff, that's cool because the ubuntu default packages are not all meant to be run inside
[07:19.720 --> 07:26.120]  a LiveBoot environment, first we avoid triggering the packages triggers during kernel install
[07:26.120 --> 07:31.000]  because it saves a lot of time during the creation of an image, but afterwards you have to rebuild
[07:31.000 --> 07:36.440]  all the dkms and custom kernel models to specifically target the kernel version that is
[07:36.440 --> 07:41.800]  inside the shoot because it's not the same as the host necessarily, all the hypermorph profiles
[07:41.800 --> 07:51.640]  are to be patched because they limit what software can access which files, but as you are running
[07:51.640 --> 07:58.360]  from ram, you have some kind of overlay fs above it, so apparently that's ntp for example,
[07:58.360 --> 08:07.800]  it's not targeting the ntp drift file in slashvar but in somewhere like varlib, live, medium,
[08:07.800 --> 08:13.880]  something, overlay fs, so you have to patch all those, here's the flag, just like two weeks to
[08:13.880 --> 08:21.160]  find in another main learning list of the kernel, like 2.4 or something, you have that support for
[08:21.240 --> 08:27.400]  the overlay fs, so you don't break the default in your td before configuration, and as the system
[08:27.400 --> 08:33.320]  is amnesiac after each reboot, you have to take into account that we do not use any network
[08:33.320 --> 08:40.040]  configuration utility, so no network d, no if and down, nothing, so you have to take that into
[08:40.040 --> 08:49.480]  account because some simulink are broken by default. So we've been doing that for five years and
[08:49.720 --> 08:58.120]  well, pretty much nothing to say, just works, we have the convenience of containers, so we have
[08:58.920 --> 09:04.600]  all the systems into production that are pretty much homogeneous, with pretty much some version
[09:04.600 --> 09:10.840]  everywhere, and we have the comfort of using just plain bare metal servers without having any issues,
[09:11.560 --> 09:17.480]  like a long time ago when you updated Docker, it had the bad habits of restarting every container
[09:17.560 --> 09:25.480]  on the host without asking you anything, we can scale from 100 servers to a thousand,
[09:26.680 --> 09:31.320]  with only pretty much three to four people handling deployment in installation of servers,
[09:32.760 --> 09:39.560]  and new servers are deployed quite fast, like it takes like one hour to deploy 10 new servers
[09:39.560 --> 09:45.560]  maximum, because you just have to collect the max, update the DHCP, and put the machines, and it just
[09:45.560 --> 09:50.360]  works, and if you want to update anything, you just have to reboot, it's pretty simple.
[09:53.000 --> 09:59.080]  I think that's supposed to be fast. Does anyone have questions? Yes?
[09:59.080 --> 10:03.720]  That doesn't mean that the operating system has basically never written the disk on the client.
[10:04.440 --> 10:05.080]  Excuse me.
[10:05.080 --> 10:08.520]  Does that essentially mean that you have diskless servers there?
[10:08.520 --> 10:10.040]  Yeah. Okay, cool.
[10:10.040 --> 10:17.240]  So the only data that's written on physical disk are the question.
[10:18.040 --> 10:22.520]  So I was saying that we did that install pretty much the servers, that there was no
[10:22.520 --> 10:28.600]  OS installed on servers, and yes, we do have a few ones, like the one that handles the images,
[10:28.600 --> 10:34.920]  the installation, the DHCP, and so on, but every other servers inside our infrastructure is diskless
[10:35.640 --> 10:40.360]  for the OS parts, because we do have data from clients, and we tend to store that in memory,
[10:41.400 --> 10:44.360]  but that pretty much every disk is used for clients.
[10:46.280 --> 10:47.280]  Yes?
[10:47.280 --> 10:51.800]  Did you consider using live build or building your images instead of packer?
[10:53.400 --> 10:59.160]  So the question was, did we consider using live build from packer? We did use that,
[10:59.800 --> 11:05.720]  but at the time we were running on public virtual machines, and the building of images was pretty
[11:05.720 --> 11:12.840]  slow, so we moved to this solution using shrewd and a big-ass build server, and just really,
[11:12.840 --> 11:16.120]  really fast, so that's why we use that now. Yes?
[11:16.120 --> 11:21.240]  By swapping out self-master, that means that now Ansible needs to be run like ad hoc.
[11:21.240 --> 11:26.200]  What kind of automation do you have on top of that? Because having to run Ansible every time
[11:26.200 --> 11:34.120]  from your laptop? Yeah, so as I said, the playbooks are run at boot time on the servers.
[11:34.120 --> 11:38.360]  Basically, on every servers you have a small client written in pittons that's like
[11:38.360 --> 11:44.200]  80 lines, but pretty much half of it is commentaries, and you just call an API saying,
[11:44.200 --> 11:49.960]  hey, my IP is that, can you deploy me? And we have multiple services everywhere that just
[11:49.960 --> 11:54.840]  have a copy of all the deployment files and the configuration files, and it's a reverse
[11:54.920 --> 11:59.400]  Ansible deployment from a server inside there, so you don't have any human interaction to redeploy
[11:59.400 --> 12:20.760]  a server. Any other questions? So thanks for your time.
[12:24.840 --> 12:26.220]  you