Good afternoon, everyone.
So next we have Aaron, speaking about documenting and fixing non reproducible builds due to
configuration options.
Thanks.
So hello, everybody.
My name is Aaron.
I'm a PhD student at the University of Rennes.
Doing research in the software engineering research team diverse of Inaria, Eriza, Ray
in France.
And today I'm going to talk about reproducible builds and software configurations.
So what is reproducible builds?
I took this definition from the paper, reproducible builds, increasing the integrity of software
supply chains.
So it says that the build process of a software product is reproducible when given a specific
version of the source code and all its dependencies, every build produces bit by bit identical artifacts
and plus no matter the environment.
And I think it's a really important point.
So to achieve reproducible builds, there are a set of guidelines in the website of reproducible
builds such as how to have deterministic build systems, what not to ship in the binary
or even how to distribute an environment, set some environment variable and so on.
So let's take an example.
So for the Linux, I can go to the source tree.
So I've downloaded and I just generate the configuration of the kernel.
So here in this case, I generated a tiny configuration, then I just build it.
And once the build is done, I have a binary called the VM Linux that I just keep on the
put in the TMP and then I clean everything up and I just reproduce the process.
So tiny config run called twice just produce the same configuration.
And now if I want to compare the product of these two builds running Difascope, which
is a tool provided by the Producible Build Initiative tool.
So what's happened?
Just because I've built the two binaries few seconds apart, I have two binaries that are
different, not bit by bit identical.
So I'm following the guidelines.
I can set some value to environment variables of the build system.
So here in this case, K build.
So I can give a fixed date, for instance, here the 1st of January of this year.
And now I can have a bit by bit identical binary.
The question is in Linux, for instance, we have all different set of configurations.
We have the default configurations per architecture, all these config, all mode config and so on.
And especially round config that will set some configuration options randomly.
So do I need to just fix all of the predictability issues for Linux just with this trick?
So we can look in the documentation.
So the K build trick is obviously written in the documentation.
But there's the documentation emphasize on the configuration options.
So here we have six of them.
So just as a reminder in the kernel, you can set some values to some options, either yes,
no or module to ship them or not.
And so here we have a list of six configuration options.
But is that all?
So as the latest version of the kernel, I think there are more than 19,000 configuration
options.
So there are six configuration options that have an impact on the predictability of the
kernel among all these configuration options.
So to answer this question, we have basically have a kind of a brutal approach.
So we just generate the set of random configurations.
So as you can see here on the left, then we build them in the same environment.
So we have a fixed Docker file.
And for each build, we just build them in a newly built container.
Then we compare the binary.
So we don't compare all of the intermediate file of the build.
Just compare the final binary.
Then you simply do a diff on the binary and get all the results, as you can see here.
So there's a way to encode the configurations in a tabular representation.
So we just have a row with all the configuration options.
And zero means no.
One means yes.
Enabled.
That means module if it exists.
Then we get all the data and put it in a classification algorithm.
And we just get the outlier options that are responsible of the non reproducibility.
Then from the list, we have a phase that is an exploration phase that I will explain a
little bit later, where we enrich the list we've got from the classification algorithm.
Then we just have a fixed phase.
And the idea is to add, if the options are indeed responsible of the non-reposibility
to add them to the documentation.
So the setup is, so this is the setup.
So we have 2000 configuration for each system we study.
So the Linux kernel, but also busybox and toybox.
So we generate random configurations.
We have a preset for x8664 for the kernel.
And then for the environment, we just derive the tuxmake image.
And then we set all of the environment variable so they don't vary during the build, like
the timestamp and so on.
So here's one of our first results is for Linux 47% of the builds were non reproducible.
And for busybox, we have two cases here.
We have the first case where we didn't vary the environment, so the build path.
And there's a case where we just vary the build path.
So we wanted to show case that there is an interaction between two layers, so the configuration
and the build path.
And to solve it, you just choose either to change the build path, to fix the build path,
or to disable the debug configuration option.
So it's up to you.
But if we enable the debug configuration option and we just vary the build path between two
builds, we have 49% of non reproducible builds.
And for toybox, it's 100% reproducible in our study.
And so now who is to blame?
So no for the Linux case.
So here we have an example of the decision tree we got from the process.
And we have five configuration options here.
So what we do is we don't consider destructor like if I disable module six, one.
So here the structure is that if I disable six, one, the next responsible is Jacob profile
of trace and so on.
So here we just flatten everything and we consider each configuration as independent.
Each configuration option as independent, sorry.
And so we have this list of five configuration options that, so module six is a similar configuration
option is in the documentation for both, but for the rest of them, we don't have them
in the documentation of Linux.
And now we have an exploration phase where the main idea is that we want to identify
all the options of the same kind.
So in the documentation, we saw that we had some configuration options on the module,
CIG, all module CIG module and config CIG module CIG and so on.
And so here the idea is just to identify the siblings of the options.
Like if I disable one option, I have another alternative of the same kind and we just explore
all the alternatives.
And a great example here is module six, one.
If I disable it, I have to enable two, 24 or 256 and so on.
And so once we have, we've got all of the siblings, we just use the name and conversion
in K config to just get the parent.
So we know that if I want to disable this specific option, I have to disable this parent.
And now, place to the fix of the each configuration options.
So the idea is to remove all of the detected configuration options from the initial configuration.
And it's a kind of hard task sometimes in the Linux kernel because we have to get all
of the dependencies of the configuration options.
So to solve each, I mean to detect the dependency of each of the configuration options you want
to modify or to change, we use a tool called config fix that is a set based solver that
is presented in detail in this paper here.
And it just gives a list of options to a list of conditions to satisfy.
And it could be in the configuration option and the value in order to apply a change.
And then once the change is applied, or once the change is applied and the change being
just set in the configuration to no, we just build again and check for a predictability.
And from the list we've got, we were able to make 90% of the non reproducible build
reproducible.
We had 31 configurations, so 3.5% that is still not reproducible due to some dependency we
couldn't identify.
So that's one of the limits of the approach.
And less than 0.5% because the tool we used couldn't find the diagnoses.
But compared to the first result I showed, we went from 47% of non reproducibility to
1%.
So now the summary.
So I think one of the takeaways that options matter.
So we should explore more the impact of configuration options in the reproducibility of each build.
The second takeaway is that there could be interactions across variability layers, such
as I showed for our busy box.
So we also need to detect them and to pinpoint and describe precisely in documentation.
And we have identified more configuration options that could be added to the documentation,
so we'll send the patch soon.
And now we could remove some of them.
So 96% of non reproducible builds made reproducible.
So if you want more detail on the whole approach and the rest, this will be presented at the
mining software repository conference.
It's an academic conference that will happen in Portugal in April.
And thank you for your attention.