Alright, hi, my name is Joshua Watt and I'm going to talk to you today about our migration
to SPDX-3 from SPDX-2 in the Yachter project.
I have a lot to get through so I'm going to go really fast.
Alright, so a little bit about me.
I've been working at Garmin since 2009, making primarily embedded Linux systems that run
on boats.
So that's exciting.
We've been using Open Embedded in the Yachter project since about 2016 to do that.
I remember the Open Embedded Technical Steering Committee and there's all the different ways
you can contact me if you're interested in doing that.
So I primarily work embedded and do some of the SPDX stuff on the side.
If you're not familiar with Open Embedded in the Yachter project, Open Embedded is a
community-driven project primarily focused on generating software for embedded systems.
Open Embedded itself provides the core layers and build system that you use to build embedded
systems.
And then there's also the Yachter project which is a Linux foundation project.
It provides a reference distribution built on top of Open Embedded.
It also runs a whole bunch of QA tests to make sure that the project is high quality,
which is a really schedule, provides funding and writes really good documentation.
So if you're ever confused about the difference between the two, that kind of summarizes it
a little bit.
So I'm going to give you a very, very brief overview of how Yachter works.
I do not have time to go into this in depth.
I have a bunch of video, other talks I've given about this.
So if you're curious, you can see those.
It'll go into much more detail.
I don't just don't have time right now.
Basically the way that it works is when users want to build some thing for an embedded
system or not exclusively embedded, but you can do other stuff too.
They basically have source code that they want to use.
They have some metadata that says how to build that source code.
They've got some policy information.
They shovel this all into this magical tool we call BitBake that does a whole bunch of
stuff for them, compiles things, configures things like that.
It spits out this thing we call a target image.
You put that on your widget and profit, right?
In a very high level, this is what the build flow sort of looks like.
The important thing to note here is that we're building all of the native tools, which we then
use to build all of whatever goes on your actual target.
So we have a very comprehensive supply chain actually just built into the way that embedded
builds in the first place.
The way that we generate SPDX when we're doing this is as we go through this process, we're
actually generating these SPDX documents at each step along the way.
And then at the end, we have this final SPDX deliverable, which basically is just all of
these documents combined together, and that describes whatever target image you built.
This is what our SPDX 2 model looks like today.
It's pretty complex and pretty comprehensive.
We have a lot of really interesting things in here that I don't really have time to get into.
But you can see that we're generating a lot of relationships and things like that that are really useful.
So a couple of problems we ran into with SPDX 2.
One of them is that in SPDX, we have this concept of the recipe element.
And it's a little bit strange because it's not actually describing a thing that exists, per se.
It's really describing a process that happened, which is we built software.
SPDX 2 doesn't have a really great way of describing that because it only has a concept of really packages, not a process.
SPDX 3 has made this much better by adding a concept of a build element,
which describes something, a process that happened at a given point in time.
And so we've transitioned to using that.
And this is kind of what the build element looks like in SPDX 3.
So you can describe the inputs and the outputs of the build element, so
the things it took in and the things it's spitting out.
And then you can also have sort of more abstract dependencies between build elements
themselves, if you can say like this step depends on this step.
The other way to show dependencies is the outputs of one step might be the inputs of another step, right?
And so builds are really useful for describing how data has flowed through your system and
sort of tracking how they've been changed and things like that.
Yeah, just like that, right?
So that build element could have inputs that were the outputs of another build element.
You can also do nested builds, and this is really useful for
us because we have a top level command called bitbake that the user actually invokes.
But then that's actually gonna go and do a whole bunch of hundreds of individual build steps.
And so we can actually track that by using this ancestor of relationship in
SPDX 3 to say these builds were invoked as a sub build or
whatever you wanna call it of the parent bitbake command that the actual user typed in.
There's a couple of other information that's associated with a build.
So there's host information so you can say exactly where a build was done.
Like it was done on this VM or done on this machine.
If you have a complete SPDX document that describes that,
you can actually link that in there with this.
You can also say who invoked the build using two relationships.
So there's invoked by, which is the user or agent that actually did the build.
And then there's delegated to, which is the user or agent that wanted the build done.
So this would be the difference between like if you click the build button in
GitHub or whatever, the user that wanted the build done is the person who clicked
the button, the agent that actually did it was GitHub.
So you can track both of those things, which is really useful.
Another problem that we had, particularly with the way that we generate our documents,
was that the SPDX IDs and SPDX 2 are really only valid within the context of
a document, which is fine.
But that means that you can only reference an SPDX ID and
SPDX 2 by referencing the document that contains it and
then the ID within that document.
And then also when you reference a document, you have to include is checksum.
And I completely understand why it was done this way.
It makes a lot of sense.
But it's really hard when you're doing things like us, like we were doing,
where we're generating all of these documents as we go along in the build.
Because when you have to reference a document and include is checksum,
you can't ever go back and change a document you've done before.
Because if you do that, you have completely invalidated everything,
all of your links, all of your SPDX IDs.
If you ever change any document you've done before, they're no longer valid.
In the wider ecosystem, I understand why that's done, but
that was really hard for us and made it hard.
And so because of that,
it's also very difficult to merge documents together.
It's not impossible.
It's just really difficult because the SPDX IDs are scoped to the document that
they're contained in.
And so if you're merging documents together,
there might be duplicates and conclusions and then you have to go find all of them
and fix them.
So we just kind of gave up and stuck all of our documents that we produced in a
tar ball and called it good cuz we just couldn't figure out a better way to do it.
I did that.
SPDX fixes this by using a linked data where objects can basically have a globally
unique ID.
It can be referenced from anywhere.
It's actually mandatory if the element can be referenced cuz there's no other way to
reference it.
And this makes it a lot easier cuz you don't have to worry about the conflicts of IDs.
You can just say, I'm gonna reference this ID and it exists somewhere.
And that makes it way easier for us to not have to actually try to figure out
stuff out.
And that also makes merging a lot easier because the IDs aren't namespace so
they're unique.
And so with SPDX3, we can actually generate a single JSON LD document at the end that
is the entire document instead of having to do it with the tar ball.
The third problem we had with SPDX2 that will hopefully be a lot better here is the
validation was really hard because for some reason we're putting all our documents in
a tar ball.
People don't know how to pull in a tar ball full of SPDX documents to validate it.
And also the data was just huge.
So a lot of the especially early validation tools for SPDX2 just couldn't handle the
sheer amount of data we were producing.
For like a root file system image that we would generate, there was 100 megabytes of
data for an embedded system.
We were generating 150 to 200 megabytes of SPDX.
So there was more SPDX data than actual data.
Yeah, turns out compilers are really good at compressing things.
Yeah, so we had to hand validate most of our SPDX output, at least parts of it, just to
see if it was all.
SPDX3 actually has a formal shackle model which helps a ton with this because there are tools
that will just take in a shackle model and take in your document and say this is valid
or not.
I don't think it covers like 100% of the things, but it's miles better than what we
had for SPDX2.
And as a bonus for this, we're working on a process to automatically generate language
code bitings from that shackle model to make it easier for people to write tools and stuff.
Another problem we had was the CV and vulnerability tracking.
SPDX2 didn't really have a mechanism for saying how vulnerabilities have been addressed.
So we were tracking the things we built, like their CPE, but then we were also patching
things for the users and there wasn't a way to express that in SPDX2.
We did it in SPDX2, but it was very specific of the way we did it.
The only way you would have known is if you had known, oh, this came from the OctoProject
or the OctoEA generated us, but I can look in these special annotations and figure out
that this was patched.
It wasn't standardized at all.
SPDX3 has VEX compliant encoding for vulnerabilities.
So you can actually say, yeah, we know the CVE applies, but we also already fixed it for
you.
So yeah, it's great.
It's really complex for the SPDX3.
VEX reporting is very powerful.
It's kind of what it looks like very quickly.
All right, so where we are today?
So this is what our SPDX3 model looks like today.
I'm sorry, I have a lot.
This is what it looks like today.
I did provide the link there if you want to go look at it.
It's in a different format.
It's actually very similar to the SPDX2.
If you dig down, we're doing a lot of the exact same relationships just better and more
precisely, which is really nice.
Yeah, there's that link there you can look at.
I will publish my slides.
There are, it's not a free lunch.
There are a couple problems that we've seen on the horizon with SPDX3.
The big one is we have very strict requirements on we can't be pulling in any external dependencies
to process our SPDX documents.
I don't think we're unique in that situation.
So how are people going to successfully, how do you successfully interchange SPDX data
in a very minimal dependency way?
And that's kind of my main concern because what I'd really like to do is not only generate
our own SPDX documents, but also make our build process link in documents from upstream
sources.
And the only way I can do that with no dependencies is if there's a standardized way of interchanging
SPDX documents.
So I really, really want that.
And that's kind of where that at context comes from.
I don't like context in my documents.
Just no, no, no context.
I just want everything there.
But we're talking about that.
So my closing thoughts is SPDX3 is, it has a much higher ceiling.
Right now we're just doing the same things that we were doing with SPDX2, but it, there
is a lot more that we will be able to do with SPDX3 just looking at the things that we
could, will be able to do that were just basically would have been impossible with SPDX2 and
linking documents together.
So much easier.
So much happier with that.
No one likes tar balls full of text files.
Other talks, this is other talks I've been getting about SBOM generation, open embedded.
It goes into way more detail.
Check it out if you're interested.
Any questions?
No.
Yeah, no, no.
Cool.
I'm here.