So we're ready for our next talk, so we need to be a little bit more quiet.
Ahmed is going to be talking about Linux matchmaking,
helping devices and drivers find each other.
Okay, thank you.
Yeah, my name is Ahmed Fatoum,
and I will talk to you about how devices and drivers find each other.
I am an embedded Linux developer with Pingotronics.
We do kernel and bootloader porting, system integration, graphics stuff.
And I think we often do is update kernels or bring up kernels on new devices
and run into the problems that happen when we do that.
Often we have some kernel patches because not everything is upstream yet.
Sometimes we have multiple topic branches.
We use a tool called umphazet, but in the end we have Git tree that we built.
Maybe if we do a kernel update we have an old config,
so we learn make old config to get the old config to a new version.
Then we type make to build the kernel.
Then they deploy the kernel somehow and sometimes it works
and sometimes it doesn't and the kernel hangs at boot.
And then you need to debug why that happens.
So if you are doing a kernel update and you have a known good,
don't waste time, just do a git bisect if you have network boot
or something quick to test new versions.
And when you have the commits that cause the regression,
you can reach out to the author or you can discuss it on the mailing list
or you can read it and try to understand why this caused a problem for you.
Maybe it causes problems for others too or maybe it's just a problem in your configuration.
But if you are doing a new kernel bring up or you are moving from a much detached kernel,
for example a vendor fork, you often don't have a known good
that you could easily bisect between.
And yes, that's what my talk is about.
How would you debug early driver issues?
Here is an example breakage that can happen that a colleague run into.
So he updated the kernel.
Here and make old dev config or menu config.
So make old config will prompt you for all the new symbols
or new configuration symbols.
Do you need that driver? Do you need that driver?
There is a lot of there and if you do old dev config,
you are not prompted, it just takes the default.
And yeah, after he did that the kernel no longer booted
or some device didn't was not functional.
And with a git bisect eventually you would have found this commit to be the cause.
And what this commit is doing is that's renaming a symbol.
So previously we had MFT RK808.
This is a driver for the power management IC on that board
and that was renamed to RK8XX which makes sense
because as you see the driver supports 805, 808, 809 and so on.
And so it's a bit confusing to have the driver called 808
but it supports much more than that.
But the problem is a Kconfig doesn't track such renames.
It's from Kconfig's point of view.
Your old config has an RK808 that doesn't exist anymore.
So it's deleted.
And the Neo kernel tree has an RK8XX that's by default off.
So if you just do menu config this driver will get lost.
And because that's a power management IC and everything depends on it.
So if you need to drive a USB stick and output 5 volt,
it has a dependency on that PMP MIC.
If you want to use a higher speed mode on your SD card
and need to lower the voltage, it has a dependency on that PMIC.
And all these drivers that have a dependency on that one driver
will fail to probe and you might not even be able to boot your system.
And yes, you need to debug that somehow.
If you know the culprit commit like here, the solution, the resolution is evident.
You just open make menu config, you enable that one config option
and then you are on your merry way and you can probe your driver
and everything should work as before.
But what I want to talk about is what if you don't know what's the problem.
You have a system that's stuck on boots or you have reached user space
and graphics don't work and you want to know so what happened.
The kernel always be, of course, kernel bugs, but a lot of people use a kernel
and if you run into an issue and you are not really the first one to try it out,
it might be something in your configuration, maybe just a driver that's not enabled
or maybe something that's specific to your board.
And to be able to debug that, you will need some insight about how Linux
does this early driver initialization step.
So I will start with that and then talk a bit about the problems that can happen that early.
So Linux device driver model is what matches devices with drivers.
We have three main abstractions.
One is the bus type, which is what sits between your device driver executing on your CPU
and the device that you want to talk to.
A bus type can be something like PCI or USB or MDIO or something like that.
So it's meant to reflect an actual hardware bus and that's the software representation of it.
Then you have the struct device driver, which is a data structure
that has the entry points into your driver.
And then you have the struct device, which is what the driver operates on.
So you can have the same driver operating on multiple devices because for example,
my laptop here has two USB controllers, one on that side and one on that side
and one of them has also internal connections.
So these are completely identical, but they are just at different places on the bus.
You can see that here on that diagram.
You have one PCI bus, there are three devices on it.
Then you have a second PCI bus that hangs off the first one through a PCI bus bridge.
And each device of these optimally has a driver that will try to bind to it.
And this is done by a series of function pointers.
There is a lot of entry points for device driver and so on,
but the three function pointers that are interesting to us are match for bus type.
This takes as an argument the device driver and the device itself
and has to determine in a bus specific manner if they are compatible or not.
So for PCI, where each device has a vendor ID and a device ID,
which are, so if you are a vendor, the PCI sync group will give you a vendor ID
if you pay them and then you can assign device IDs and then you can write drivers
or people can write drivers via hardware that match on exactly that
vendor ID and device ID.
And here, once a match, once a bus determines a match,
it will return to the driver call some positive value, a one,
and then the device driver probe will commence.
So the probe will take the device, will try to determine
if that device is really what it's after, initialize it,
and then you have a remove step to undo that probe.
The struct device also has function pointers,
but these are not interesting to us that early in the initialization stage.
Here is a short quote example how it looks like for PCI.
So you have a struct PCI device.
This inherits struct device by embedding it and then adds PCI specific stuff.
So it has the vendor ID, 16 bits, it has device ID, it has resources,
so PCI devices have IOPods or memory mapped regions that are associated with it.
That's all struct PCI device.
Then you have a struct PCI driver that similarly does extend struct driver.
And additionally, it has this ID table, which is a list of all vendor and device pairs
that are matched by that driver.
You see that structure at the end.
And you can add to it also a driver data unsigned long
where you can encode stuff so you have a...
...
Usually you have the platform bus, which is...
There is a catch all in the back of chips bus because it does a lot
because the bus itself doesn't do anything for you.
The bus itself is usually just memory mapped and it doesn't tell you what device is where.
It's not like a PCI bus where you can actually ask the bus to enumerate the devices and report them.
You have to have some sort of hardware description that tells you what is where
so you can actually use it.
How that hardware description looks like is on the right.
You will have this compatible, which tells you I have a device that's compatible with that.
It has that address and it has these resources, these interrupts, these clocks and so on.
And then the platform bus will take this description and try to match it with drivers
who also list these compatibles.
But that's just one thing the platform bus can do.
It can also match on ACPI.
It can match on strings.
It can match on the driver name itself.
It's where you throw everything in that doesn't have a proper bus.
That's a platform bus.
And yeah, once any bus, platform bus, PCI bus and so on finds a match,
it will call the driver probe function.
So the name of the function is a bit of a historical artifact.
Normally if you have already done the match, the device will probably be what you expect.
So you don't really need to probe the device if it's really yours.
But in the past, for example, with the Superio chips on x86,
they usually had the same IOPOD.
And if you write the values appropriate to one of them in tools,
in tools, registers of another, you could break it.
So they had like schemes where you need to enter a password into a magic register two times.
And then you read from another register if it's really the device that you expect.
And if it's not, you can return an error code, you know, def or you know,
such device or address.
And the driver call will try something else, but that's usually nowadays a buck if you return in a def.
Usually you don't intend to return in a def, but some other return value.
So the return values that are relevant either the driver is happy,
it has got a device, it claims its resources, it returns zero.
After registering it with some other kernel framework, for example, if it's an Ethernet interface,
it will register a net def, call register net def to register a network interface
that can be later called to interface with the device.
And that's an uninteresting case because everything works here.
What's interesting case if it returns an error code.
And that's usually what happens when your boot is stuck,
sends kernel bugs of course, but if your boot is stuck,
it's usually some driver that didn't want to probe.
And that's usually because device dependencies because a device probe is just one little part of a function,
but each device has, especially on a system and chip, has a lot of dependencies on other devices.
And yeah, if you have like eight dependencies, maybe one of these dependencies is missing
and that's propagates up and kills the possibility to probe anything that depends on it.
This was the case with that PMIC example that I had in an earlier slide
because that was not available. Everything that depended on it, for example,
USB or SD card controller, didn't manage to probe because the dependency couldn't be satisfied.
And there are a whole lot of these dependencies and they are used at very different places.
So there are generic dependencies like PIN control.
A chip usually has more functions than PINs available,
so it needs to mocks the PINs into the correct states.
The Generic Driver Core will do that for you.
Also the MA configuration.
Then the platform devices will also do like initial clock assignment
if you need to ramp up the rate of a clock or you need to reparent a clock.
If you have a power domain which are like power is inside the chip,
it needs to be powered on so you can actually talk to the device.
And then there is a whole lot of stuff that's device specific
that is inside the probe function of your driver which is requesting clocks,
multiple power domains, GPIO, resets, files,
or like the supply of the PMICs that we just saw in that problem.
And the problem is when the device driver probes, it expects these resources to be available.
And if it's not available, it just can't progress.
So if it's like a reset line, you really expect that reset line to get the device out of reset.
You can't continue usefully often without having that resource.
So what the kernel tries to do is to probe the dependencies that you require first
before probing something that comes later.
So we want the PMIC to probe very early and then later on we want USB to probe, for example.
How that used to be done was statically in the build system using init calls and makefile ordering.
So I don't know if it's a bit too small.
No, it's okay.
So yeah, there are init calls which are the different stages that the kernel will run its initialization code from
and these are synchronized using these sync stages between them.
And if you do something in a subsystem init call,
you know when something in a device init call runs, it will be available.
And so you can place dependencies in a subsystem init call, for example.
But the kernel uses a lot of these init calls for itself.
There are not enough to represent all what the kernel needs.
So what was instead done was that the order in the makefiles were used.
So the kernel will work all directories, collect object files,
and the order in which the object files are collected is the order in which they land in the linker list
of init calls and that's the order where drivers are registered
and if the devices are available, that's the order where the devices are probed.
And so you have makefiles that still have some stuff like regulators early
since some subsystem might rely on them or DMA.
DMA is very important, so do that extra early.
But that of course breaks down once you have a dependency that goes into the other way.
Like here is a power domain driver that requires power supply, requires a regulator
and power domains are added before regulators.
So in this case it breaks down.
You can't have one order that is okay for everyone.
You can do that in a simple microcontroller or on a board that you know beforehand.
Yeah, I could do that in that order, but any more complex system you will have maybe cycles even
or you will have stuff that you can generically say because you can have plug-and-play and so on.
And you can have one fixed order that always works.
So you have to have a system that is only avoiding the problem of requiring resource
that's not available in the kernel since 2011 or something.
Also does detection that this happens and tries to re-probe at a later time.
This is done with a special return value.
It's called E-probe defer.
It has a value of minus 517.
It's interesting because all other return values are smaller numbers
because they start counting from E-perma at minus one and then go maybe, I don't know, minus 100 something.
But minus E-probe defer is like over 500 so you can more easily spot it.
And this is never reported to user space.
It's only for internal use by the kernel and it's what the kernel driver uses to tell the kernel driver core,
please try me at a later time.
The kernel driver core will go through the exact same motions.
It will clean up resources but instead of marking that the driver probe has failed
and there is nothing we can do to fix it, it will instead add it to a list of deferred driver probes that will be retried later.
So if another device fails, succeeds to probe,
then the kernel will try, can try again to see if now every requirement for that device is there so it can attempt a reprobe.
And once it runs through the whole list and no new devices appear or no devices in that list manage to probe,
it knows there will nothing happen anymore.
And in that case, yeah, you can't boot but in the case that you have drivers that will bind later
or you have maybe a cyclic dependency here that will help you because the stuff is being retried.
And how that looks in driver code, here's a small example.
This is getting an interconnect and USB 1 and then you have, you say get me a regulator or get me an interconnect.
And you check if it's an error and then if it's an error, you return it.
So if it's an error that you can't recover from, you return it and then the driver calls now, okay, that won't work.
I won't try this again.
If it's an e-probe defer error, it will just be propagated and the driver call will try at a later time and you are responsible to do cleanup.
So you must take or the driver or the author must take care to clean up all resources so the driver probe can be attempted again.
What you often see is this check that the error code is not e-probe defer.
That's because e-probe defer is an expected result.
So if you had a driver that couldn't get it resourced on the first time and it would say it couldn't get interconnect
and it would work later on the second try, that would confuse a lot if you had error messages that are not really errors.
So often you check for e-probe defer and only write the error message if it's actually an error and not e-probe defer.
And you can see all this deferred stuff in the CISFS.
There is a file there, CISCURNELDebugDevicesDeferred if you have debugFS support enabled and it will list all devices that have the probe not done yet.
And in the case that you don't reach a shell because you have some dependency of your root file system missing,
after 10 seconds if you have config modules, the kernel will time out and print all devices that it couldn't satisfy,
that it couldn't probe because of missing dependencies.
And then here in this case I am missing an interconnect driver and everything that depends on that interconnect driver,
the power domains, the USB files, the USB itself will defer the probe.
But you don't actually know why it deferred the probe.
If you look at the slides before that, we had an error message that says couldn't get interconnect,
but we didn't want to print it the first time because we didn't know maybe in the future it will be really probes that we can satisfy.
But at the time from e-probe defer, you want to print it only on the last e-probe defer when the kernel gives up.
But because we check here for e-probe defer that error message is lost,
which is why for a few years now we have dev-er probe as a function in the kernel,
it takes advice and error code and an error message.
And if the error code is equal to a probe defer, it will just store the message in the device.
And if it's unequal to a probe defer, it will just print it directly.
And with that you get actually reason why deferred probe has happened.
So here you see how it looks like in the debug FS.
It will tell you block control is not ready.
And then if you look for block control, it will tell you fail to get knock entries.
I wouldn't know what to do with that error message, but at least I can search for it.
And then I will see in the kernel source, okay, it tries to get an ICC.
Yeah, I will search what's an ICC and then I will see it's an interconnect.
And then I will look in the device tree, see there is an interconnect.
And then I know, oh, maybe I need to enable an interconnect driver.
And this would be a lot more cumbersome if I didn't have that information.
And 6.8 is the kernel release that's currently being stabilized.
And since 6.8 Rc1, these reasons are also printed to the kernel log.
So in the case that your system doesn't manage to boot, you get the same error messages.
Before that, yeah, you had to like start an init.id or something and mount it there.
But you don't have to do that anymore.
And yeah, if you have devaprope, that's an easy thing.
If you don't have devaprope, you need to trace a bit and try to find out what's the last call that's failing.
I pasted here some of the stuff that I add to the kernel command line from the bootloader.
I add some kernel options to try to zero in on what's the problem.
Earlycon is a useful thing because many drivers have a separate earlycon implementation for outputting a character.
And that can be used even before the normal kernel driver is initialized.
And you even use that because of a serial driver while it sounds very simple.
It has resets as dependencies.
It has clocks as dependencies.
It might have even a power domain as a dependency.
And you don't want to wait for all that to initialize before you can actually see something on the console.
And earlycon sidesteps that because with the assumption that the bootloader has set up your serial console,
the kernel could just keep using that set up serial port.
And yeah, and later on when a real console with the driver model is registered, it can take over.
So at earlycon and set standard outpath in your device tree so the kernel knows what to what console to use earlycon on.
Then ignore lock level will get a lot of output when you enable debug stuff, but it's better than no output at all.
So I just say ignore lock level and then filter it on my side with init call debug.
You can print out the debug as they happen.
This is useful if the kernel gets stuck at the moment.
Maybe then you can see what is the last init call that the kernel called or the last few init calls if it's done in a multi-threaded manner.
With dynamic debugging, if your kernel is compiled with config dynamic debug,
your kernel will have all the debug strings built in, but it won't output them by default.
But with dynamic debug, you can enable that for example for a file or for a line or for module.
Here it takes dd.c, that's a device driver model file.
That's the main file that does matching the devices to the drivers in the kernel and also will print debug messages like deferring a probe and so on.
So you can see how often the probe is deferred if that's useful to you.
And with a plus p, that means print.
So yeah, this line will cause if you have config dynamic debug enabled to print out debug information from your driver core.
And for good measure, I always add clock ignore unused and PD ignore unused when I am debugging because that's when the kernel has finished starting up.
It will disable clocks and power domains that it thinks it doesn't need anymore.
But if you happen to need them, the system will get will hang itself because there is still devices that require it.
So yeah, it's not really related to the other stuff, but because it's debug stuff that I copy and paste, I add that too.
Of course, you will want to remove it later for when once you have found where your bug is.
And then for that particular problem of understanding why that probe has deferred, you can use ftrace.
So the function graph tracer will print assuming of course you have it enabled in the kernel and so on.
And you have enabled boot time tracing.
It's a separate option.
Yeah, if you have that all enabled, you can on the kernel command line say ftrace function graph and that will enable the function graph tracer very early.
You can set a ftrace graph filter, which is a function during which the ftrace should run and you can set a maximum depth.
And what the kernel will do is that once you enter this probe function, it will print a line to the trace buffer.
And if you enter a shite function, it will add some indentation and print the next.
And if it returns, it will return and so on.
And then you have the flow of how the kernel walks through the probe function before you.
I limited here to three, to a depth of three, because usually it's functions that you are interested in that claim resources are at a depth of three or less.
You can increase it as you like, of course.
And then you can check out what was the last thing that the kernel called that might have failed.
So if you see at the very end, okay, let's try to get a GPIO and after that you only see clean up.
So yes, probably the GPIO that's missing.
I wanted to automate this a bit because one nice thing would be to get the error code out.
So if you could just see which ones return an error code that is E-probe defer, you could just look for that and you don't have to look what could plausibly be the cause.
This seems to be possible with boot config.
Boot config is an ftrace function graph red val tracer that records return values, but you need to use boot config for that.
And I haven't had the chance to use boot config so far.
I tried yesterday for the first time.
And what I initially wanted to do, I wanted to dump the ftrace buffer during boot time.
So if the kernel gets stuck, I can't access the tracers to output my trace buffer to find out what was the last function that's called.
Yeah, I can help myself by adding an init addy.
I have an init addy for arm64 that I can just, that has a small shell that I can use.
And then I can on the small shell from the init addy mount tracers and read it out.
I could like something just out of the box where the kernel just, if it can continue boot, it would just dump out why it couldn't continue to boot.
It currently will dump out this deferred probe information, but if you don't have this devair probe call, you won't know what's the reason why it didn't probe.
And so ftrace could help.
But I haven't managed to get it working.
There is an ongoing discussion about that.
And once I get it working, or someone tells me how I can get it working, that might be someone of you too, you can reply to that mailing thread and others will know too.
I will try for sure a bit on the way back in the train.
Yeah, and, ah, fadflink is something I wanted to talk a little bit about too.
fadflink, it's a problem with eProbeDefur, it works, but you retry probes a lot of times when you don't really need to.
In the worst case, the next device that you want to probe is the last one in the list.
So you walk the whole list, you try all probes, and all will return again eProbeDefur until you reach the last one, then you probe that, and then you start the list again.
And so on, you keep walking the whole list because the one you are interested in is at the very end.
And you could do better than that if the kernel could take note of the dependencies.
And yeah, if the kernel could take note of the dependencies, like read them out of the device tree, it could order probes, and that's what fvdeflink is doing.
And yeah, it doesn't replace eProbeDefur completely, but it minimizes it a lot, which should improve your boot up time because you don't need to redo probes so often.
And that's it.
So if you want to take one thing with you from this talk, if you debug such an issue and you find a place where you could add def-er probe to make the life of other people after you easier, please do so.
So the world will be a bit better after that.
Thank you for listening.
We have time for maybe one or two questions.
Hi, Ahmed. Thank you for your talk. I just figured out you can use magic sys request to print the trace buffer.
Oh, okay. Yeah, that would be a way. If you add the magic sys request, you can do that over the serial console too, then you could ask for the trace buffer to be dumped.
I was thinking about triggering an oops maybe, but that sounds a bit less severe.