[00:00.000 --> 00:18.920]  Alright, so that's the last session for today from Nikola, rethinking device support for
[00:18.920 --> 00:20.920]  the long-term.
[00:21.920 --> 00:30.920]  Okay, so hi everyone, my name is Nikolas, and today I'm going to be presenting about
[00:30.920 --> 00:37.920]  how we can rethink the device support for the long-term. So first of all, I work at
[00:37.920 --> 00:44.920]  Calabra, and what I do there is I do the upstreaming of the kernel support for some Chromebooks,
[00:44.920 --> 00:56.920]  and also I improve the coverage for kernel CI, so adding new tests. So first of all,
[00:56.920 --> 01:03.920]  why is upstream the upstream support relevant? So well, there are many reasons, and most
[01:03.920 --> 01:10.920]  of you probably know very well about them, but basically, for one, when you have good
[01:10.920 --> 01:18.920]  upstream support for a device, you can count on continuous updates, since it's easier for
[01:18.920 --> 01:24.920]  the OEMs that are developing the device. If they're basing their work on upstream, it's
[01:24.920 --> 01:33.920]  easier to continuously rebase and provide more frequent supports, more frequent updates
[01:33.920 --> 01:40.920]  for the device. Also, for the same reason, you have less of a vendor locking problem,
[01:40.920 --> 01:48.920]  because you don't need to rely on the downstream kernel for the device, so you can just install
[01:48.920 --> 01:57.920]  the mainstream kernel and be happy with it. On the OEM side also, there's lower maintenance
[01:57.920 --> 02:10.920]  cost, so it's also a good benefit for them as well. So basically, in the end, you just
[02:10.920 --> 02:19.920]  get a longer lifespan for the device, and that's what we've been working on, the upstream
[02:19.920 --> 02:29.920]  support for this Chromebooks, and so that's what's happening. So as we're focusing on the
[02:29.920 --> 02:37.920]  upstream support, these devices are basically staying longer on the market and getting a
[02:37.920 --> 02:48.920]  longer life, having more updates. But of course, we have new devices every year, and so there
[02:49.920 --> 02:57.920]  the demand isn't going to diminish. So basically, as I see it, as we get these devices living
[02:57.920 --> 03:06.920]  longer and needing more updates and there are more devices out there, basically, it's a
[03:06.920 --> 03:14.920]  problem of scale, so that's why continuous integration is important, because it's the
[03:14.920 --> 03:21.920]  way we're going to be able to automatically detect regressions and basically keep up with
[03:21.920 --> 03:28.920]  this demand. And it's also important to emphasize that enabling tests earlier in the device's
[03:28.920 --> 03:35.920]  life is important, because basically, there's a cost in enabling tests, and if you do them
[03:35.920 --> 03:41.920]  the earlier, the better, because then you get to benefit more from that over the device's
[03:41.920 --> 03:49.920]  lifetime to get the regressions. So a little bit about KernelCI itself, so that's why we
[03:49.920 --> 03:59.920]  need a CI, and in the case for the kernel, we have KernelCI. The main instance is there
[03:59.920 --> 04:07.920]  on linux.kernelCI.org, but that tests the upstream kernel, but there are also other
[04:07.920 --> 04:18.920]  instances, like the one on ChromeOS.kernelCI.org that runs the TAST tests, and not only on
[04:18.920 --> 04:25.920]  the downstream ChromeOS kernel, but also on the upstream kernel. So the way the KernelCI
[04:25.920 --> 04:32.920]  works, basically, it's this pipeline, so there are several Git branches from Git trees that
[04:32.920 --> 04:39.920]  are monitored, and when a new revision is found, the jobs are triggered that will build the
[04:39.920 --> 04:45.920]  artifacts, which means, like, the Kernel, the modules, device trees, and the rootFS for
[04:45.920 --> 04:53.920]  the test. Once you have that, the test itself is going to be queued to run on a device in
[04:53.920 --> 05:03.920]  a lava lab with pointers to those artifacts, and after the test gets, after the test runs
[05:03.920 --> 05:09.920]  on the device, then the results are going to be pushed back to the dashboard in KernelCI,
[05:09.920 --> 05:14.920]  so they're available for anyone to check out, and if there are any regressions detected,
[05:14.920 --> 05:23.920]  they'll be reported to the KernelCI results main list. KernelCI can be configured through
[05:23.920 --> 05:30.920]  several YAML files, and, like, for most people, that's the part you're most interested in,
[05:30.920 --> 05:38.920]  so, for instance, there is a build configuration that's where you set the branches and the
[05:38.920 --> 05:46.920]  trees that will be tracked. Also, the config fragments that will be used as part for the
[05:46.920 --> 05:53.920]  Kernel build, compiler versions and whatnot. So, for instance, some maintainers have a
[05:53.920 --> 06:00.920]  for KernelCI branch. They have a for KernelCI branch on their tree and also register that
[06:00.920 --> 06:06.920]  in KernelCI so that they can, as they receive the patches, they can merge those patches to
[06:06.920 --> 06:14.920]  that for KernelCI branch and have them run on KernelCI before they actually, so they can
[06:14.920 --> 06:21.920]  validate the changes before they actually merge those to their main branches. There's also
[06:21.920 --> 06:28.920]  a lab configuration where the labs themselves, so the lava labs that have all these devices
[06:28.920 --> 06:37.920]  on racks, like running the tests, there are currently around 11 of them. Also, you can
[06:37.920 --> 06:45.920]  set filters there, so you can filter out whatever kind of tests you don't want to, or you want
[06:45.920 --> 06:54.920]  to run on your lab. There's a rootFS configuration where you, well, you define what rootFS is
[06:54.920 --> 07:01.920]  going to be used for the test. So, you can add custom rootFS there, which will be involved
[07:01.920 --> 07:08.920]  like setting the base OS itself, so the Debian version and so on, like architecture packages
[07:08.920 --> 07:15.920]  that will be installed, the scripts, because like you might want to do some extra tweaking
[07:15.920 --> 07:22.920]  before you run the tests. Maybe you might want to compile something from source. And
[07:22.920 --> 07:30.920]  FS overlays if you need some special file in your rootFS. And finally, you have the actual
[07:30.920 --> 07:38.920]  test configuration where you define the tests themselves, so the test plans, which rootFS
[07:38.920 --> 07:45.920]  the test needs to run, which lava job template, which is the actual file that gets submitted
[07:45.920 --> 07:54.920]  lava to run the job, some parameters that might be needed, and the device types, which
[07:54.920 --> 08:01.920]  are the actual devices that are getting the tests run on. There are currently around 208
[08:01.920 --> 08:09.920]  device types in KernelCI. So, basically, it's simple for anyone interested in improving the
[08:09.920 --> 08:14.920]  coverage for KernelCI to add a new test or add a new rootFS. If you have some of your
[08:14.920 --> 08:21.920]  own devices that you want to run tests on, you might add your own lab there. So, these
[08:21.920 --> 08:29.920]  are the tests that are currently available in KernelCI. The baseline tests are very
[08:29.920 --> 08:38.920]  interesting ones. I think they're simple tests, but they do a lot. So, basically, it uses
[08:38.920 --> 08:45.920]  the Buddha RR suite. And, basically, that has a generic and machine-specific part. But
[08:45.920 --> 08:54.920]  the point of this test is to make sure that the basics are there. So, as for the generic
[08:54.920 --> 09:00.920]  part, you have things like checking that no devices are deferring probe. So, the machine
[09:00.920 --> 09:06.920]  is actually probing everything that it should. And for the machine-specific parts, you can
[09:06.920 --> 09:12.920]  check, like, whether all the devices and all the drivers that you expect to be there are
[09:12.920 --> 09:20.920]  actually there and up. So, yeah. And, like, we have lots of other tests, like, case of
[09:20.920 --> 09:27.920]  tests. There are multiple of them. LTP. You have decoder conformance tests for the run
[09:27.920 --> 09:36.920]  cluster to verify that the output of a decoded frame from a hardware decoder matches what
[09:36.920 --> 09:45.920]  it should. And stuff like that. So, IGT, AFR-02 compliance, lib camera, which is one I added,
[09:45.920 --> 09:54.920]  for the Chrome OS embedded controller test, lib test to test suspend, resume, and that
[09:54.920 --> 10:03.920]  kind of stuff. So, we have there. So, that's about it for the basics for our kernel CI that
[10:03.920 --> 10:13.920]  I wanted to share. So, during my work in Collabra, I was doing the upstreaming for the support
[10:13.920 --> 10:22.920]  for the Acer Chromebook, CB514, which uses the MT-8192 SoC from MediaTek and the Azurata
[10:22.920 --> 10:32.920]  Baseboard. And during the upstreaming for the support for this machine, I had to obviously
[10:32.920 --> 10:41.920]  test all the components from the machine. And if I found any issues, I would fix them and
[10:41.920 --> 10:47.920]  send them upstream. But since the kernel is a moving target, I had to basically redo
[10:47.920 --> 10:55.920]  this every time, manually, constantly on every rebase. And I did detect several issues during
[10:55.920 --> 11:02.920]  the upstreaming process. And of course, like if this tests, there are some tests in kernel
[11:02.920 --> 11:08.920]  CI that could have enabled this to be done automatically without me doing them manually
[11:08.920 --> 11:15.920]  as I was doing the upstreaming. So, to just give some of these examples, these are actually
[11:15.920 --> 11:24.920]  the commits that me and a colleague, we sent upstream to fix the issues themselves. So,
[11:24.920 --> 11:31.920]  like, you have things like, for those two first commits, like, the issues that they fixed,
[11:31.920 --> 11:39.920]  they were preventing the display from probing at all. And on this machine, on the machine
[11:39.920 --> 11:46.920]  that I worked on and a few others from MediaTek. And basically, those kind of issues, like,
[11:46.920 --> 11:51.920]  the display not working, not probing, that can easily be detected by a baseline test,
[11:51.920 --> 12:02.920]  or IGT-KMS test, which would fail if display isn't probed. Also, you have another issue
[12:02.920 --> 12:09.920]  where there were basically the call to disable V-blank was moved to a wrong place, where
[12:09.920 --> 12:15.920]  it was the V-blank interrupt was being disabled before it should, and that caused warnings
[12:15.920 --> 12:22.920]  during suspend, and that can be detected by the sleep test that does a suspend-resume cycle.
[12:22.920 --> 12:29.920]  And you have also the other issue where the encoders stopped probing as well, because of
[12:29.920 --> 12:37.920]  platform-get-resource being deprecated on this platform. So, that's also something that could
[12:37.920 --> 12:46.920]  have been detected by a baseline test. So, where are we now for this machine in Kernel-CI?
[12:46.920 --> 12:55.920]  I have worked on upstreaming the, enabling the configs. So, I added the config fragment
[12:55.920 --> 13:02.920]  in Kernel-CI to get all devices probing and working on this machine, and also upstream
[13:02.920 --> 13:08.920]  that config, it's already queued for the next release. Also, we enabled the baseline test
[13:08.920 --> 13:14.920]  that I talked about. So, the baseline test for this machine are already running in Kernel-CI,
[13:14.920 --> 13:21.920]  and I added the device probe checks for the machine itself in baseline, but there are
[13:21.920 --> 13:27.920]  still lots of stuff to enable, like the also case of test, which I'm working on right now.
[13:27.920 --> 13:34.920]  This machine uses, makes use of, it has more complex control. So, the UCM needs to be applied
[13:34.920 --> 13:40.920]  before the tests are run. So, all the paths can actually be tested, the other paths. And
[13:40.920 --> 13:47.920]  some other tests, like CROS EC, the camera, refer to compliance, IGT, KMS, should name
[13:47.920 --> 13:54.920]  a few. And there are also some tests, like refer to the color conformance, that like
[13:54.920 --> 14:00.920]  can't be enabled yet, because the upstream support hasn't quite landed yet, because it depends
[14:00.920 --> 14:07.920]  on the description in the device tree for the hardware decoder, the patch is still being
[14:07.920 --> 14:14.920]  reviewed in the mailing list. And same thing for ZLIP, which requires a CPU-fract node,
[14:14.920 --> 14:20.920]  which isn't there, and the GPU also isn't there for this machine yet. So, after those components
[14:20.920 --> 14:27.920]  get enabled upstream, the tests can be enabled, so we can catch those issues when they happen
[14:27.920 --> 14:36.920]  again. This is a screenshot of the kernel CI page, the dashboard, with the results for
[14:36.920 --> 14:42.920]  this machine. As you can see, there are a few tests that are failing, because I, when I
[14:43.920 --> 14:50.920]  added the space line test, I already added the checks for the devices, for all devices,
[14:50.920 --> 14:54.920]  including the ones that haven't made it upstream yet. So, the one that's failing there, for
[14:54.920 --> 15:01.920]  example, is the external display for this machine, and it isn't probing because upstream
[15:01.920 --> 15:08.920]  support isn't yet, the support isn't enabled. So, as it gets merged, those tests will start
[15:11.920 --> 15:18.920]  passing. And if they ever start failing, then we can quickly notice that something broke.
[15:23.920 --> 15:30.920]  So, where can we grow kernel CI from here? I think definitely more subsystems should have
[15:32.920 --> 15:42.920]  more coverage. So, maybe like IIO and input are subsystems that I didn't find any tests to run.
[15:44.920 --> 15:49.920]  So, I think as we increase the coverage of the subsystems, we'll start catching these issues.
[15:49.920 --> 15:58.920]  If we don't have the test themselves, like we can, we can detect when the issues happen.
[15:59.920 --> 16:07.920]  And while baseline tests are, like, they already help with the base support, like, we really need
[16:07.920 --> 16:16.920]  these specific subsystem-specific tests to detect that issues in the usage of the hardware
[16:16.920 --> 16:24.920]  components themselves happen. So, besides that, also, more trees from maintainers. So, if there's
[16:24.920 --> 16:32.920]  any maintainers in this room that would like to have some branch tested by kernel CI, maybe
[16:32.920 --> 16:39.920]  sign up for requests in the kernel CI repository or get in touch. So, we can start testing
[16:39.920 --> 16:53.920]  more of the trees before the, and catch the issues. Also, more labs, of course. So, basically,
[16:53.920 --> 17:01.920]  we rely on the labs with the devices on them to run the tests. So, the more labs we have,
[17:01.920 --> 17:11.920]  the better, with more diversity of the devices. Maybe if, maybe you're interested in some device
[17:11.920 --> 17:19.920]  that isn't already present in some lab. In that case, it would be interesting if you, you could
[17:19.920 --> 17:25.920]  set your own lab, your own lava lab with the devices you're interested in and hook that up to kernel
[17:25.920 --> 17:31.920]  CI so that you get to benefit from the tests that are already there and they get run on your, on
[17:31.920 --> 17:37.920]  your lab, on your devices you're interested in. We could definitely also have more of the case of
[17:37.920 --> 17:46.920]  tests, LTP tests added to kernel CI and also support for K-unit, which, as it grows in the kernel,
[17:47.920 --> 17:56.920]  it would be great to have support to run those tests. So, basically, I think that there's still a
[17:56.920 --> 18:04.920]  lot, a lot to be gained from like the, like the open source model we have in the kernel that we've,
[18:04.920 --> 18:14.920]  we're all pretty familiar with in the development sense, where not everybody does, everybody
[18:14.920 --> 18:19.920]  does a little bit of work and every, and also everybody gets to benefit from it. I think we're
[18:19.920 --> 18:29.920]  still starting to see that in the Linux testing side of things for that. So, as we keep increasing
[18:29.920 --> 18:37.920]  the, the branches, the code base, the device coverage, everybody will start to benefit from
[18:37.920 --> 18:51.920]  that and this, this usage of the kernel CI will allow us to have, to be able to cope with the,
[18:51.920 --> 19:01.920]  the, the quantity of devices that are there and reply, respond quickly to the, the regressions
[19:01.920 --> 19:11.920]  as they happen and, like, because of that, be able to give a actual reliable long-term support for all,
[19:11.920 --> 19:19.920]  for all these many devices and, yeah, and everybody will benefit from that. So, and that's about it
[19:19.920 --> 19:22.920]  for the presentation. So, if you have any questions.
[19:23.920 --> 19:31.920]  Thank you. Are there any questions?
[19:38.920 --> 19:45.920]  Seems everybody is eager to get out. So, thank you for the talk. Thank you all for being here.
[19:45.920 --> 19:49.920]  This was the first iteration of the kernel dev room. We hope to make this a regular thing
[19:49.920 --> 19:58.920]  at Fastim. So, spread the word. We can use loads more submissions than we had, although we had a lot of great talks.
[19:58.920 --> 20:05.920]  I really want to thank, we organized this together, like three people, Stefan Grabber from
[20:05.920 --> 20:11.920]  Canonical, TechLexity team leader, the, one of the EPPF Updream maintainers, Daniel
[20:11.920 --> 20:16.920]  Huckman from Isovalent, and I'm Christian Browner. Thanks everyone.