[00:00.000 --> 00:18.920] Alright, so that's the last session for today from Nikola, rethinking device support for [00:18.920 --> 00:20.920] the long-term. [00:21.920 --> 00:30.920] Okay, so hi everyone, my name is Nikolas, and today I'm going to be presenting about [00:30.920 --> 00:37.920] how we can rethink the device support for the long-term. So first of all, I work at [00:37.920 --> 00:44.920] Calabra, and what I do there is I do the upstreaming of the kernel support for some Chromebooks, [00:44.920 --> 00:56.920] and also I improve the coverage for kernel CI, so adding new tests. So first of all, [00:56.920 --> 01:03.920] why is upstream the upstream support relevant? So well, there are many reasons, and most [01:03.920 --> 01:10.920] of you probably know very well about them, but basically, for one, when you have good [01:10.920 --> 01:18.920] upstream support for a device, you can count on continuous updates, since it's easier for [01:18.920 --> 01:24.920] the OEMs that are developing the device. If they're basing their work on upstream, it's [01:24.920 --> 01:33.920] easier to continuously rebase and provide more frequent supports, more frequent updates [01:33.920 --> 01:40.920] for the device. Also, for the same reason, you have less of a vendor locking problem, [01:40.920 --> 01:48.920] because you don't need to rely on the downstream kernel for the device, so you can just install [01:48.920 --> 01:57.920] the mainstream kernel and be happy with it. On the OEM side also, there's lower maintenance [01:57.920 --> 02:10.920] cost, so it's also a good benefit for them as well. So basically, in the end, you just [02:10.920 --> 02:19.920] get a longer lifespan for the device, and that's what we've been working on, the upstream [02:19.920 --> 02:29.920] support for this Chromebooks, and so that's what's happening. So as we're focusing on the [02:29.920 --> 02:37.920] upstream support, these devices are basically staying longer on the market and getting a [02:37.920 --> 02:48.920] longer life, having more updates. But of course, we have new devices every year, and so there [02:49.920 --> 02:57.920] the demand isn't going to diminish. So basically, as I see it, as we get these devices living [02:57.920 --> 03:06.920] longer and needing more updates and there are more devices out there, basically, it's a [03:06.920 --> 03:14.920] problem of scale, so that's why continuous integration is important, because it's the [03:14.920 --> 03:21.920] way we're going to be able to automatically detect regressions and basically keep up with [03:21.920 --> 03:28.920] this demand. And it's also important to emphasize that enabling tests earlier in the device's [03:28.920 --> 03:35.920] life is important, because basically, there's a cost in enabling tests, and if you do them [03:35.920 --> 03:41.920] the earlier, the better, because then you get to benefit more from that over the device's [03:41.920 --> 03:49.920] lifetime to get the regressions. So a little bit about KernelCI itself, so that's why we [03:49.920 --> 03:59.920] need a CI, and in the case for the kernel, we have KernelCI. The main instance is there [03:59.920 --> 04:07.920] on linux.kernelCI.org, but that tests the upstream kernel, but there are also other [04:07.920 --> 04:18.920] instances, like the one on ChromeOS.kernelCI.org that runs the TAST tests, and not only on [04:18.920 --> 04:25.920] the downstream ChromeOS kernel, but also on the upstream kernel. So the way the KernelCI [04:25.920 --> 04:32.920] works, basically, it's this pipeline, so there are several Git branches from Git trees that [04:32.920 --> 04:39.920] are monitored, and when a new revision is found, the jobs are triggered that will build the [04:39.920 --> 04:45.920] artifacts, which means, like, the Kernel, the modules, device trees, and the rootFS for [04:45.920 --> 04:53.920] the test. Once you have that, the test itself is going to be queued to run on a device in [04:53.920 --> 05:03.920] a lava lab with pointers to those artifacts, and after the test gets, after the test runs [05:03.920 --> 05:09.920] on the device, then the results are going to be pushed back to the dashboard in KernelCI, [05:09.920 --> 05:14.920] so they're available for anyone to check out, and if there are any regressions detected, [05:14.920 --> 05:23.920] they'll be reported to the KernelCI results main list. KernelCI can be configured through [05:23.920 --> 05:30.920] several YAML files, and, like, for most people, that's the part you're most interested in, [05:30.920 --> 05:38.920] so, for instance, there is a build configuration that's where you set the branches and the [05:38.920 --> 05:46.920] trees that will be tracked. Also, the config fragments that will be used as part for the [05:46.920 --> 05:53.920] Kernel build, compiler versions and whatnot. So, for instance, some maintainers have a [05:53.920 --> 06:00.920] for KernelCI branch. They have a for KernelCI branch on their tree and also register that [06:00.920 --> 06:06.920] in KernelCI so that they can, as they receive the patches, they can merge those patches to [06:06.920 --> 06:14.920] that for KernelCI branch and have them run on KernelCI before they actually, so they can [06:14.920 --> 06:21.920] validate the changes before they actually merge those to their main branches. There's also [06:21.920 --> 06:28.920] a lab configuration where the labs themselves, so the lava labs that have all these devices [06:28.920 --> 06:37.920] on racks, like running the tests, there are currently around 11 of them. Also, you can [06:37.920 --> 06:45.920] set filters there, so you can filter out whatever kind of tests you don't want to, or you want [06:45.920 --> 06:54.920] to run on your lab. There's a rootFS configuration where you, well, you define what rootFS is [06:54.920 --> 07:01.920] going to be used for the test. So, you can add custom rootFS there, which will be involved [07:01.920 --> 07:08.920] like setting the base OS itself, so the Debian version and so on, like architecture packages [07:08.920 --> 07:15.920] that will be installed, the scripts, because like you might want to do some extra tweaking [07:15.920 --> 07:22.920] before you run the tests. Maybe you might want to compile something from source. And [07:22.920 --> 07:30.920] FS overlays if you need some special file in your rootFS. And finally, you have the actual [07:30.920 --> 07:38.920] test configuration where you define the tests themselves, so the test plans, which rootFS [07:38.920 --> 07:45.920] the test needs to run, which lava job template, which is the actual file that gets submitted [07:45.920 --> 07:54.920] lava to run the job, some parameters that might be needed, and the device types, which [07:54.920 --> 08:01.920] are the actual devices that are getting the tests run on. There are currently around 208 [08:01.920 --> 08:09.920] device types in KernelCI. So, basically, it's simple for anyone interested in improving the [08:09.920 --> 08:14.920] coverage for KernelCI to add a new test or add a new rootFS. If you have some of your [08:14.920 --> 08:21.920] own devices that you want to run tests on, you might add your own lab there. So, these [08:21.920 --> 08:29.920] are the tests that are currently available in KernelCI. The baseline tests are very [08:29.920 --> 08:38.920] interesting ones. I think they're simple tests, but they do a lot. So, basically, it uses [08:38.920 --> 08:45.920] the Buddha RR suite. And, basically, that has a generic and machine-specific part. But [08:45.920 --> 08:54.920] the point of this test is to make sure that the basics are there. So, as for the generic [08:54.920 --> 09:00.920] part, you have things like checking that no devices are deferring probe. So, the machine [09:00.920 --> 09:06.920] is actually probing everything that it should. And for the machine-specific parts, you can [09:06.920 --> 09:12.920] check, like, whether all the devices and all the drivers that you expect to be there are [09:12.920 --> 09:20.920] actually there and up. So, yeah. And, like, we have lots of other tests, like, case of [09:20.920 --> 09:27.920] tests. There are multiple of them. LTP. You have decoder conformance tests for the run [09:27.920 --> 09:36.920] cluster to verify that the output of a decoded frame from a hardware decoder matches what [09:36.920 --> 09:45.920] it should. And stuff like that. So, IGT, AFR-02 compliance, lib camera, which is one I added, [09:45.920 --> 09:54.920] for the Chrome OS embedded controller test, lib test to test suspend, resume, and that [09:54.920 --> 10:03.920] kind of stuff. So, we have there. So, that's about it for the basics for our kernel CI that [10:03.920 --> 10:13.920] I wanted to share. So, during my work in Collabra, I was doing the upstreaming for the support [10:13.920 --> 10:22.920] for the Acer Chromebook, CB514, which uses the MT-8192 SoC from MediaTek and the Azurata [10:22.920 --> 10:32.920] Baseboard. And during the upstreaming for the support for this machine, I had to obviously [10:32.920 --> 10:41.920] test all the components from the machine. And if I found any issues, I would fix them and [10:41.920 --> 10:47.920] send them upstream. But since the kernel is a moving target, I had to basically redo [10:47.920 --> 10:55.920] this every time, manually, constantly on every rebase. And I did detect several issues during [10:55.920 --> 11:02.920] the upstreaming process. And of course, like if this tests, there are some tests in kernel [11:02.920 --> 11:08.920] CI that could have enabled this to be done automatically without me doing them manually [11:08.920 --> 11:15.920] as I was doing the upstreaming. So, to just give some of these examples, these are actually [11:15.920 --> 11:24.920] the commits that me and a colleague, we sent upstream to fix the issues themselves. So, [11:24.920 --> 11:31.920] like, you have things like, for those two first commits, like, the issues that they fixed, [11:31.920 --> 11:39.920] they were preventing the display from probing at all. And on this machine, on the machine [11:39.920 --> 11:46.920] that I worked on and a few others from MediaTek. And basically, those kind of issues, like, [11:46.920 --> 11:51.920] the display not working, not probing, that can easily be detected by a baseline test, [11:51.920 --> 12:02.920] or IGT-KMS test, which would fail if display isn't probed. Also, you have another issue [12:02.920 --> 12:09.920] where there were basically the call to disable V-blank was moved to a wrong place, where [12:09.920 --> 12:15.920] it was the V-blank interrupt was being disabled before it should, and that caused warnings [12:15.920 --> 12:22.920] during suspend, and that can be detected by the sleep test that does a suspend-resume cycle. [12:22.920 --> 12:29.920] And you have also the other issue where the encoders stopped probing as well, because of [12:29.920 --> 12:37.920] platform-get-resource being deprecated on this platform. So, that's also something that could [12:37.920 --> 12:46.920] have been detected by a baseline test. So, where are we now for this machine in Kernel-CI? [12:46.920 --> 12:55.920] I have worked on upstreaming the, enabling the configs. So, I added the config fragment [12:55.920 --> 13:02.920] in Kernel-CI to get all devices probing and working on this machine, and also upstream [13:02.920 --> 13:08.920] that config, it's already queued for the next release. Also, we enabled the baseline test [13:08.920 --> 13:14.920] that I talked about. So, the baseline test for this machine are already running in Kernel-CI, [13:14.920 --> 13:21.920] and I added the device probe checks for the machine itself in baseline, but there are [13:21.920 --> 13:27.920] still lots of stuff to enable, like the also case of test, which I'm working on right now. [13:27.920 --> 13:34.920] This machine uses, makes use of, it has more complex control. So, the UCM needs to be applied [13:34.920 --> 13:40.920] before the tests are run. So, all the paths can actually be tested, the other paths. And [13:40.920 --> 13:47.920] some other tests, like CROS EC, the camera, refer to compliance, IGT, KMS, should name [13:47.920 --> 13:54.920] a few. And there are also some tests, like refer to the color conformance, that like [13:54.920 --> 14:00.920] can't be enabled yet, because the upstream support hasn't quite landed yet, because it depends [14:00.920 --> 14:07.920] on the description in the device tree for the hardware decoder, the patch is still being [14:07.920 --> 14:14.920] reviewed in the mailing list. And same thing for ZLIP, which requires a CPU-fract node, [14:14.920 --> 14:20.920] which isn't there, and the GPU also isn't there for this machine yet. So, after those components [14:20.920 --> 14:27.920] get enabled upstream, the tests can be enabled, so we can catch those issues when they happen [14:27.920 --> 14:36.920] again. This is a screenshot of the kernel CI page, the dashboard, with the results for [14:36.920 --> 14:42.920] this machine. As you can see, there are a few tests that are failing, because I, when I [14:43.920 --> 14:50.920] added the space line test, I already added the checks for the devices, for all devices, [14:50.920 --> 14:54.920] including the ones that haven't made it upstream yet. So, the one that's failing there, for [14:54.920 --> 15:01.920] example, is the external display for this machine, and it isn't probing because upstream [15:01.920 --> 15:08.920] support isn't yet, the support isn't enabled. So, as it gets merged, those tests will start [15:11.920 --> 15:18.920] passing. And if they ever start failing, then we can quickly notice that something broke. [15:23.920 --> 15:30.920] So, where can we grow kernel CI from here? I think definitely more subsystems should have [15:32.920 --> 15:42.920] more coverage. So, maybe like IIO and input are subsystems that I didn't find any tests to run. [15:44.920 --> 15:49.920] So, I think as we increase the coverage of the subsystems, we'll start catching these issues. [15:49.920 --> 15:58.920] If we don't have the test themselves, like we can, we can detect when the issues happen. [15:59.920 --> 16:07.920] And while baseline tests are, like, they already help with the base support, like, we really need [16:07.920 --> 16:16.920] these specific subsystem-specific tests to detect that issues in the usage of the hardware [16:16.920 --> 16:24.920] components themselves happen. So, besides that, also, more trees from maintainers. So, if there's [16:24.920 --> 16:32.920] any maintainers in this room that would like to have some branch tested by kernel CI, maybe [16:32.920 --> 16:39.920] sign up for requests in the kernel CI repository or get in touch. So, we can start testing [16:39.920 --> 16:53.920] more of the trees before the, and catch the issues. Also, more labs, of course. So, basically, [16:53.920 --> 17:01.920] we rely on the labs with the devices on them to run the tests. So, the more labs we have, [17:01.920 --> 17:11.920] the better, with more diversity of the devices. Maybe if, maybe you're interested in some device [17:11.920 --> 17:19.920] that isn't already present in some lab. In that case, it would be interesting if you, you could [17:19.920 --> 17:25.920] set your own lab, your own lava lab with the devices you're interested in and hook that up to kernel [17:25.920 --> 17:31.920] CI so that you get to benefit from the tests that are already there and they get run on your, on [17:31.920 --> 17:37.920] your lab, on your devices you're interested in. We could definitely also have more of the case of [17:37.920 --> 17:46.920] tests, LTP tests added to kernel CI and also support for K-unit, which, as it grows in the kernel, [17:47.920 --> 17:56.920] it would be great to have support to run those tests. So, basically, I think that there's still a [17:56.920 --> 18:04.920] lot, a lot to be gained from like the, like the open source model we have in the kernel that we've, [18:04.920 --> 18:14.920] we're all pretty familiar with in the development sense, where not everybody does, everybody [18:14.920 --> 18:19.920] does a little bit of work and every, and also everybody gets to benefit from it. I think we're [18:19.920 --> 18:29.920] still starting to see that in the Linux testing side of things for that. So, as we keep increasing [18:29.920 --> 18:37.920] the, the branches, the code base, the device coverage, everybody will start to benefit from [18:37.920 --> 18:51.920] that and this, this usage of the kernel CI will allow us to have, to be able to cope with the, [18:51.920 --> 19:01.920] the, the quantity of devices that are there and reply, respond quickly to the, the regressions [19:01.920 --> 19:11.920] as they happen and, like, because of that, be able to give a actual reliable long-term support for all, [19:11.920 --> 19:19.920] for all these many devices and, yeah, and everybody will benefit from that. So, and that's about it [19:19.920 --> 19:22.920] for the presentation. So, if you have any questions. [19:23.920 --> 19:31.920] Thank you. Are there any questions? [19:38.920 --> 19:45.920] Seems everybody is eager to get out. So, thank you for the talk. Thank you all for being here. [19:45.920 --> 19:49.920] This was the first iteration of the kernel dev room. We hope to make this a regular thing [19:49.920 --> 19:58.920] at Fastim. So, spread the word. We can use loads more submissions than we had, although we had a lot of great talks. [19:58.920 --> 20:05.920] I really want to thank, we organized this together, like three people, Stefan Grabber from [20:05.920 --> 20:11.920] Canonical, TechLexity team leader, the, one of the EPPF Updream maintainers, Daniel [20:11.920 --> 20:16.920] Huckman from Isovalent, and I'm Christian Browner. Thanks everyone.