[00:00.000 --> 00:12.000]  Okay, hello everyone, thanks for being here so early on a Sunday morning.
[00:12.000 --> 00:18.400]  My name is Laura, I work at Glabra and today I'd like to share with you like a war story
[00:18.400 --> 00:23.520]  about how we built and grew our laboratory for upstream testing.
[00:23.520 --> 00:29.000]  I'm going to share with you a little bit about our infrastructure as well as some of
[00:29.000 --> 00:33.720]  the challenges that we had to face while scaling up.
[00:33.720 --> 00:40.800]  So our main goal was to build a big test bed for open source projects to use.
[00:40.800 --> 00:45.760]  So of course we're going to need a diverse ecosystem of devices, so many different devices
[00:45.760 --> 00:49.480]  of different architectures and from different vendors.
[00:49.480 --> 00:55.120]  Of course we're going to need a software to automate the tests on the actual devices.
[00:55.120 --> 01:01.280]  We need a monitoring system, so a way to monitor and assess the health of the devices that
[01:01.280 --> 01:07.000]  we have in the lab and we also need some recovery strategies.
[01:07.000 --> 01:12.320]  So mainly when devices start to misbehave or don't behave as expected, we need some
[01:12.320 --> 01:18.760]  way of recovering them automatically or putting them offline automatically if they're not
[01:18.760 --> 01:22.120]  reliable to run tests.
[01:22.120 --> 01:29.120]  So it all starts with a commit, the developer pushes the changes into a development branch,
[01:29.120 --> 01:37.120]  the artifacts for the tests all built automatically, a test job is submitted and run, the results
[01:37.120 --> 01:44.640]  are gathered and parsed and finally a report is generated and sent back to the developer.
[01:44.640 --> 01:49.960]  So from the lab perspective we're interested in the part that runs the test jobs and makes
[01:49.960 --> 01:56.800]  the results available and what we chose for our lab is Lava as we saw earlier, this is
[01:56.800 --> 02:00.640]  the linear automation and validation architecture.
[02:00.640 --> 02:07.640]  So it automates the boot and deploy phases of the operating system on the device.
[02:07.640 --> 02:15.520]  It has a really scalable scheduler, it allows to run thousands of jobs on hundreds of devices
[02:15.520 --> 02:22.120]  on a single instance, so that's really convenient for big labs.
[02:22.120 --> 02:29.120]  It handles the power on the devices, so it switches the power on and off on the devices
[02:29.120 --> 02:35.000]  when needed and it also helps monitoring the serial output.
[02:35.000 --> 02:40.200]  And finally it also makes the results of the tests available in many different formats,
[02:40.200 --> 02:44.000]  which is again pretty convenient.
[02:44.000 --> 02:49.040]  Lava again just takes care of this part of the CI loop while all the other phases needs
[02:49.040 --> 02:54.320]  to be implemented with different tools.
[02:54.320 --> 03:01.720]  So in order to run devices in Lava we need to fulfill a set of base requirements, of
[03:01.720 --> 03:07.960]  course we're going to need to be able to turn on and off the power on the devices remotely,
[03:07.960 --> 03:16.320]  we need access to a reliable serial console remotely and finally we need some way of booting
[03:16.320 --> 03:22.680]  an arbitrary combination of kernel, device tree and Vultafast remotely.
[03:22.680 --> 03:31.480]  For all the devices that we have in the lab we rely on TFTP, so we need network connectivity
[03:31.480 --> 03:38.960]  at the bootloader level and that means that we often have to build our custom bootloaders
[03:38.960 --> 03:43.200]  and enable all the features that we need for debugging.
[03:43.200 --> 03:50.480]  So there are a few steps to prepare the devices before they enter the lab.
[03:50.480 --> 03:56.520]  As far as the configuration of the devices itself in Lava we have a couple of, you only
[03:56.520 --> 04:04.480]  need to define some JINJA 2 and YAML templates, so the device type template basically defines
[04:04.480 --> 04:08.960]  the characteristic of the device type, so for example which kind of bootloader run on
[04:08.960 --> 04:14.600]  a certain device, which kind of command line options are needed for booting, while the
[04:14.600 --> 04:21.400]  device dictionary defines device specific characteristics, so for example what command
[04:21.400 --> 04:28.080]  do we need to run to turn on and off the power or to access the serial console and finally
[04:28.080 --> 04:33.800]  we have the health check, which is a special kind of job associated to each device type
[04:33.800 --> 04:42.000]  and the aim of the health check is to assess the health status of each device, it's supposed
[04:42.000 --> 04:47.560]  to be run on a fairly regular basis, we run a health check on every device that we have
[04:47.560 --> 04:53.560]  in the lab every day and the examples of tests that you can fit in a health check are for
[04:53.560 --> 04:58.640]  example a battery test or you can check the temperature on the device to make sure it's
[04:58.640 --> 05:04.440]  not overheating, you can check the network connectivity, basically all the tests that
[05:04.440 --> 05:12.040]  you need to make sure that the device is functional, you can fit them in health check and whenever
[05:12.040 --> 05:17.840]  a device fills its health check, lava automatically puts it offline, so it's really useful just
[05:17.840 --> 05:26.080]  to shut down all the devices that are not reliable at the moment, so Colabora maintains
[05:26.080 --> 05:34.520]  a laboratory running lava and we have as of a couple of days ago 217 devices of 38 different
[05:34.520 --> 05:41.440]  types, spread across 16 racks, each rack is controlled by its own server and that's also
[05:41.440 --> 05:47.960]  where the lava dispatcher runs and of course besides all the device types, devices we also
[05:47.960 --> 05:54.000]  have a bunch of hardware equipment that we need to automate the boot and test phases
[05:54.000 --> 06:02.800]  on our devices and this is what the device distribution looked like in January, so the
[06:02.800 --> 06:12.080]  vast majority of our devices are X8664 and ARM64 platforms and we also have some QM instances
[06:12.080 --> 06:19.600]  that are mainly used by KernelCI and the very vast majority of our devices are actually
[06:19.600 --> 06:29.840]  Chromebook laptops but we also have some embedded SBC devices as well, so what kind of hardware
[06:29.840 --> 06:37.520]  do we have in the lab, so different devices as usually different requirements, so for
[06:37.520 --> 06:43.880]  embedded SBCs what we use to control the power on them remotely are Ethernet control relays
[06:43.880 --> 06:49.040]  and PDUs, I left there some examples of the actual models that we currently have in the
[06:49.040 --> 06:57.640]  lab, Chromebooks are kind of a different beast, they have their own hardware debug interface
[06:57.640 --> 07:04.120]  which is the Servo V4 and the Susie cables, so Servo V4 allows you to control the power
[07:04.120 --> 07:10.960]  on the device to access the serial consoles on the device and also provides network connectivity
[07:10.960 --> 07:17.600]  to an Ethernet port, so you can fit everything you need to automate the boot and test phases
[07:17.600 --> 07:25.320]  on a Chromebook that fit inside just one hardware box, as an alternative you also have Susie
[07:25.320 --> 07:32.120]  cables which pretty much have the same functionality except for the network connectivity that you
[07:32.120 --> 07:38.120]  have to provide usually to a USB to Ethernet adapter.
[07:38.120 --> 07:44.120]  We have a couple of servers as well in the lab and for those we use the IPMI standard
[07:44.120 --> 07:51.800]  protocol just to control the power and access the serial consoles and for all the devices
[07:51.800 --> 07:59.720]  of course we're going to need a bunch of USB cables with their fragilities and also we
[07:59.720 --> 08:08.040]  use USB hubs, we find especially useful the switchable hubs such as the Y-Cush, especially
[08:08.040 --> 08:13.520]  for those devices that are controlled by just one USB connection such as the Chromebooks,
[08:13.520 --> 08:19.120]  so that's really convenient just not to having to manually intervene every time you need
[08:19.120 --> 08:22.480]  to re-plug the USB connection.
[08:22.480 --> 08:26.720]  As for the software, I left here a couple of links, you want to check them out.
[08:26.720 --> 08:34.520]  We use PDU Demon to execute comments on the PDUs, we use Conserver to access the serial
[08:34.520 --> 08:40.480]  consoles and monitor the output and the HDC tools are just for the Chromebooks, these
[08:40.480 --> 08:46.360]  are the software tools that allows you to interact with the server v4 and with the Susie
[08:46.360 --> 08:52.640]  cable as well just to control the power and serial on the Chromebooks.
[08:52.640 --> 08:58.200]  For the interaction with Lava, we use Lava CLI, it's a command line interface and that's
[08:58.200 --> 09:06.560]  useful to run the tests on the device and also configure and push the templates.
[09:06.560 --> 09:11.640]  Finally we also have a Lava GitLab runner that serves as a bridge between GitLab and
[09:11.640 --> 09:14.640]  Lava.
[09:14.640 --> 09:19.160]  That's pretty much it for the software side.
[09:19.160 --> 09:27.440]  In our lab we have two major users, one is KernelCI which is focused on continuous testing
[09:27.440 --> 09:32.960]  of the Linux kernel, it's not only boot tests, we have a bunch of other test suites running
[09:32.960 --> 09:40.880]  on them and the type of testing that KernelCI does is post merge, so changes are tested
[09:40.880 --> 09:46.960]  after they landed on a set of monitored trees.
[09:46.960 --> 09:55.560]  After the tests have run, KernelCI will generate some build reports as well as some regression
[09:55.560 --> 10:00.320]  reports for every regression that is found.
[10:00.320 --> 10:08.040]  The other major player in our lab is MesaCI, that's DCI for Mesa3D and it does conformance
[10:08.040 --> 10:11.480]  testing and also performance tracking.
[10:11.480 --> 10:17.080]  There are a bunch of test suites that are currently run by MesaCI, I left the list here, bunch
[10:17.080 --> 10:25.080]  of APIs and drivers are tested and while KernelCI only does post merge testing, MesaCI also
[10:25.080 --> 10:31.480]  does pre merge conformance tests, so that's a little bit of both.
[10:31.480 --> 10:39.880]  In this diagram you can see what's the usage of KernelCI and MesaCI in our laboratory,
[10:39.880 --> 10:48.160]  as you can see both projects keep our lab pretty busy, KernelCI uses almost all the
[10:48.160 --> 10:59.000]  architectures that we have in the lab, while MesaCI is focused more on X8664 and ARM64,
[10:59.000 --> 11:06.000]  and with so many jobs running every day in our lab, of course the impact of any error
[11:06.000 --> 11:12.480]  or unreliability in the infrastructure can be quite big, so for pre-merge tests you
[11:12.480 --> 11:17.520]  have the merge requests from users can get blocked, and definitely if a certain device
[11:17.520 --> 11:26.440]  type is not available, and also, yeah, there's a risk of merge requests getting rejected
[11:26.440 --> 11:31.960]  if there are many errors in the lab, so what we need to make sure from the lab perspective
[11:31.960 --> 11:40.200]  is that the merge requests from users do get rejected only because the changes introduced
[11:40.200 --> 11:47.040]  made the test fail, and not because of any infrastructure error, while for post merge
[11:47.040 --> 11:54.960]  tests we have a risk of reporting false regressions, in this case we want to make sure again that
[11:54.960 --> 12:02.640]  the infrastructure errors are reported as such by Lava, and Lava defines different types
[12:02.640 --> 12:08.080]  of exceptions that you can raise based on the type of error that occurs, we need just
[12:08.080 --> 12:16.520]  to make sure that the devices and Lava itself is configured properly to do so, so yeah,
[12:16.520 --> 12:21.920]  this is just a minimal list of the common issues that we have seen over the years, of
[12:21.920 --> 12:27.400]  course there can be other degradation, you can have faulty cables at any time, or batteries
[12:27.400 --> 12:34.920]  just failing, power chargers not working properly, all kind of network issues can happen at any
[12:34.920 --> 12:41.440]  time and they can have quite a big impact, we also saw some issues related to the rack
[12:41.440 --> 12:48.200]  setup, so for example we had some laptops where the lid was likely too closed because
[12:48.200 --> 12:53.680]  of how it was set up in the rack and it was closing the device to enter suspense unexpectedly,
[12:53.680 --> 13:00.280]  so we have all kind of different errors that can happen, of course we can have firmware
[13:00.280 --> 13:06.080]  bugs either in the firmware running or the actual devices, or also firmware running in
[13:06.080 --> 13:13.520]  the hardware debug interface, so that's a lot of errors that can happen, I gather a few
[13:13.520 --> 13:21.440]  of my favorite pitfalls, these are issues, tricky ones that we have found recently and
[13:21.440 --> 13:29.480]  we're still dealing with some of those, so one of the things that we saw is that sometimes
[13:29.480 --> 13:35.720]  it happens that the serial console will just stop outputting anything on the serial console
[13:35.720 --> 13:44.000]  and if this happens during the test phase it's kind of hard to understand in an automated
[13:44.000 --> 13:51.600]  way, whether the kernel is hanging or whether your USB cable connection has dropped or if
[13:51.600 --> 14:00.080]  it's just like an unreliable serial connection, so that's usually a tricky one to deal with.
[14:00.080 --> 14:08.840]  Another serial related one is caused by interference, so not all devices can have multiple UART
[14:08.840 --> 14:15.320]  connections for debug, most of our devices in the lab don't, so we have to share the
[14:15.320 --> 14:21.120]  same serial connection between the kernel and the test shell and this sometimes can
[14:21.120 --> 14:28.840]  cause some interference and of course it will confuse lava about the outcome, so one way
[14:28.840 --> 14:36.000]  that we are thinking of many solutions to deal with this kind of serial issue stuff
[14:36.000 --> 14:43.960]  and one approach that we are looking into is actually using a docker container, so running
[14:43.960 --> 14:51.560]  a docker container connecting to the device over SSH and run the tests on the SSH console,
[14:51.560 --> 14:56.960]  this way we can probably work around some of these serial issues.
[14:56.960 --> 15:03.760]  So as I said there are also network connectivity issues from time to time and if the network
[15:03.760 --> 15:10.840]  drops during the bootloader phase, that's usually something we can easily catch because
[15:10.840 --> 15:19.120]  lava of course monitors the serial output and if our bootloader is nice enough to print
[15:19.120 --> 15:24.800]  error messages we can just catch the right patterns at the right time and just raise
[15:24.800 --> 15:32.240]  an infrastructure error so that won't initiate like the outcome of the test.
[15:32.240 --> 15:37.680]  When this happens we can also configure lava to retry the job if needed, so when this happens
[15:37.680 --> 15:42.840]  it's useful to catch error patterns.
[15:42.840 --> 15:49.120]  If network decides to drop during the test phase that's usually worse, especially for
[15:49.120 --> 15:55.520]  devices that rely on a network phase system, so it's usually pretty hard to recover from
[15:55.520 --> 15:56.520]  this.
[15:56.520 --> 16:02.520]  We have seen occasional USB disconnection for whatever reason and yet it's hard to recover
[16:02.520 --> 16:06.960]  from these kind of issues usually.
[16:06.960 --> 16:17.040]  So these are some of the best practices we came upon while working on these issues.
[16:17.040 --> 16:25.120]  So the first one is about writing robust health checks, so as I said devices will be put offline
[16:25.120 --> 16:29.880]  by lava if the health check fails so we need to make sure that the health checks catch
[16:29.880 --> 16:36.240]  as many issues as possible automatically.
[16:36.240 --> 16:42.800]  We found very useful to monitor the lava infrastructure error exceptions and this is mainly to catch
[16:42.800 --> 16:48.160]  issues with specific racks or specific device types.
[16:48.160 --> 16:55.200]  We usually try to monitor also the devices health and the job queue as well and this
[16:55.200 --> 17:01.320]  is to make sure that we have enough devices of a certain device type to feed all the pipelines
[17:01.320 --> 17:10.400]  for the projects and also to minimize like if a certain device type goes offline and
[17:10.400 --> 17:17.160]  if we have redone this we're able to kind of recover from that and last but not least
[17:17.160 --> 17:22.800]  as I said what best practice is to try and isolate the test shell output and kernel
[17:22.800 --> 17:25.160]  messages whenever possible.
[17:25.160 --> 17:30.640]  If not we're trying to work around some of these issues.
[17:30.640 --> 17:36.760]  So next steps for our lab is of course increase the lab capacity and try to cover even more
[17:36.760 --> 17:41.600]  platform and different vendors as soon as they come out.
[17:41.600 --> 17:46.880]  While doing this we are continuing to improve our infrastructure and monitoring tools.
[17:46.880 --> 17:52.400]  I haven't included in this presentation how we actually monitor things but yeah lava just
[17:52.400 --> 18:00.560]  has some APIs that you can use to monitor the status of each device and also of the server
[18:00.560 --> 18:07.160]  and yeah of course while keeping to keep adding new lab devices we also want to increase
[18:07.160 --> 18:11.920]  the coverage of test suites so we're working on adding even more test suites on kernel
[18:11.920 --> 18:17.560]  CI and meso CI as well and that's it.
[18:17.560 --> 18:45.960]  If you have any questions I think I have time right.
[18:45.960 --> 18:58.000]  Pretty often I'd say yeah I don't have data at hand with the actual failures but yeah it
[18:58.000 --> 18:59.000]  happens pretty often.
[18:59.000 --> 19:05.640]  I mean we have so many jobs running every day and we rely heavily on USB which is kind
[19:05.640 --> 19:13.240]  of not great like it breaks pretty often I'd say the most common issues that we have is
[19:13.240 --> 19:20.120]  usually due to the serial consoles being not too reliable.
[19:20.120 --> 19:24.560]  The vast majority of devices that we have are Chromebooks and we're using like these
[19:24.560 --> 19:30.160]  hardware debugging interfaces that were meant for debugging so sometimes like the serial
[19:30.160 --> 19:35.440]  connection is not great and that cause all kind of issues so we try at least to retry
[19:35.440 --> 19:55.560]  the jobs when possible and catch the infrastructure errors as they come out.
[19:55.560 --> 20:01.320]  I'd say we don't have to manually intervene every day I'd go like every couple of days
[20:01.320 --> 20:06.720]  we need to maybe re-plug some of the devices because we as I said we use the switchable
[20:06.720 --> 20:13.240]  hubs to try and avoid having to reset the connection manually but we don't have this
[20:13.240 --> 20:20.360]  setup on each and every device that we have we're working on it but yeah I'd say like
[20:20.360 --> 20:26.280]  at least a couple of times a week I haven't really checked the frequency of it but yeah
[20:26.280 --> 20:41.800]  of course there's people in the lab actually taking care of all the devices.
[20:41.800 --> 20:47.440]  So from the lab perspective we don't really care about the test switch running it's more
[20:47.440 --> 20:54.840]  the responsibility of kernel CI and mesoCi you can check out the links that I left like
[20:54.840 --> 20:58.520]  everything is of course open source so you can check out all the test suites and how
[20:58.520 --> 21:06.800]  they work yeah some tests you just need a RAM disk some other tests rely on the most
[21:06.800 --> 21:29.320]  heavier ones rely on natural pack system.
[21:29.320 --> 21:33.800]  I mean it depends on the type of tests that you need to run.
[21:33.800 --> 22:02.400]  We use a lot of Chromebooks because we need Chromebooks so you cannot really emulate one.
[22:02.400 --> 22:08.680]  Thank you.