All right then. Thank you. Thanks for the holler. All right. So I think I did. Yeah.
Okay. Okay. So my name is Karen. I work for OSDL and I'm going to take a step back actually
from everything we've been discussing here today about what capabilities S-bombs have and what
they need, what more capabilities they have to have as well and how about tools to create them
and go back to, well, I think what they were originally for, what we thought we originally
should use them for with license compliance. And there we have a lot of stuff where we're still
redoing the same work again and again and again because creating S-bombs doesn't work automatically,
at least for most of the software that we're dealing with in embedded Linux systems. So there's
still a lot of manual work required and that's where sharing and reusing work makes sense and
this is where the OSDL project comes in. So I think this is fairly obvious. I don't need to go
into this a lot why reusing makes sense. We don't want it to redo work that has been
done before that is being done again and again and again. I mean, we'll still get these questions
every day that why do I have to extract copyrights from Linux kernel source code. Someone must have
done that already. Why can't we reuse that? And so why not do that and why hasn't it been done before
exactly what can we do? So you know more or less compliance toolchain could look like this.
We can't share work everywhere but we can share work where most manual effort is required with
scanning and with curating data because as good as the scanners are that are out there, we've
heard about scan code, we've heard a lot of tools ort, ort, ort, ort, ort, ort, ort, ort, ort, ort,
ort, ort, ort, ort. So all of the scanners are all the tools that use the scanner materials,
they're really good but there's still quite a lot of mistakes. So to actually do license
compliance properly, we still need to do manual curation of the data. And this is where the
Ocelot project comes in, you can find more information on the Ocelot website. The data
itself is available on the open source compliance repository and the package analysis repository
there, you actually find stuff that you can already use today.
License copyright analysis results for various packages,
mainly from embedded Linux systems.
We have about 320 when I last checked.
So different versions, of course, are 200 unique packages,
more than 1.5 million files that have been manually curated.
For each package, we have some metadata, so
where the package comes from, a package URL to find where the package comes from,
download location, and so on.
Then there's the S-bombs in there.
So the S-PDX S-bombs is what we're focusing on for
license compliance in different formats as well with the license conclusions in
there with the copyright notices.
And I think that is probably some of the most valuable part of this with
comments on why a particular decision was made.
Because sometimes it's not clear, you can find information in a file.
I don't know, you know how licensing information is noted in some files.
It doesn't really follow any standards, especially in older software that's
still being used.
And then you have to make a decision, you have to do some kind of interpretation.
And this is explained as well as part of the S-PDX files that are available there.
Also, the S-PDX files themselves are explained because what we find,
even though there is a standard, there is a specification,
people still understand it differently.
So someone might expect to get an S-bomb from their suppliers and
they have a certain expectation on what a particular S-PDX file looks like.
But they understand the different tags and they're differently than the customer.
So we have a clear explanation of how we understand the S-PDX tags and
of course we try to be as close to the specification as possible there.
And then also for convenience, there is a disclosure document where if you find
a particular package that are reusing in exactly that way unmodified and
exactly that version you might for license compliance just use the finished
disclosure document with all the license texts and copyrights and
acknowledgments and so on, aggregate it.
So of course it's not yet big enough to immediately license an entire system,
but it is definitely a start.
So as I said, the question why this hasn't been done before has been around
for quite a few years and why hasn't it been done before?
And I think two of the main reasons are liability and trust,
which are more or less two sides of the same coin.
So on the first hand, who was willing to supply such information,
which is legally relevant if we're talking about license compliance or
if legally relevant information where companies have gone to court over licenses.
So who was willing to provide this and say, look, you can use this and
we don't give you a guarantee but we did our best to make this documentation as
sound as possible so that hopefully you won't be taken to court if you use it.
And then on the other hand, you're a company and you're putting out products and
you reuse legal information that you found somewhere on the internet.
How can you trust this information?
And these are the thoughts that we were thinking when starting this project.
So how can we limit liability, first of all, for ourselves and
for anyone who's contributing?
And of course we asked some lawyers about that.
And the idea was to license as liberally as possible.
So we went with CC0, 1.0, that gives you as many rights as possible.
And it works well for documents as well.
In this case, gift regulation supply and liability applies only for
willful intent and trust negligence, which we try to avoid.
Also, I think the times have changed.
So maybe ten years ago there was a lot of worry,
especially from the US, that there's gonna be lawsuits in the open source area.
But there haven't been any, not with providing legal information,
with providing support with licensing, there haven't been any or
none that have been known.
And so I think people just got braver and said, okay,
maybe now's the time that we can do this.
And then on the other hand, we have trust.
So how can you establish trust in the information?
I think that's fairly straightforward is provide good quality.
So do the curation conscientiously, diligently,
only let people do it or let people contribute to actually know what they're
doing, so train anyone who wants to contribute.
I mean, it's a bit of a bigger hurdle for contribution, but
it's really important as well to keep up the quality.
The same goes with review, so the stuff is on GitHub, so
we can use that for the review process.
And yeah, we'll stand with it also,
we'll stand with our name to make sure or to promise that we'll keep
the quality as high as it started out with.
Let's wait around.
So what are the curation guidelines that we established to ensure this quality?
Well, we're working with Phosology, I think that's just our preference.
You can use any other tool as well.
We're using ScanCode as well for scanning and integrated into Phosology.
And we use the source code as upstream as possible.
So for ideally directly from the project page, so
to not go through any of the stages that we've seen on some slides before where
stuff gets added from package managers.
But we'll try to start as upstream as possible at the moment.
And then I think the diffs that you get from what's added by package managers,
this is something that can be included as well, but we're not there yet.
So at the moment, we're still trying to go with the origin.
And then curating the license, as I said, there's manual work in there.
And I think that's the valuable stuff of this project.
So license findings, copyright findings that the scanners have created are curated
manually, of course, with all the help that Phosology can give with that.
So with our curation guidelines, I don't have to check the time.
I don't think I'm gonna go into too much details on that.
I mean, if you have looked at the scanner findings,
you know why there is some manual work required still.
So with copyrights, it means mainly that stuff that was incorrectly identified as
copyright is removed, stuff that is added to a copyright notice that's not really
part of it, formatting signs, there's sometimes license notices,
just part of code that is identified as part of the copyright notice is
removed from the copyright notice.
And then there might be references to external files as well,
like copyright by the authors, project authors, C file authors.
And then this information has to be added as well.
With the licenses, again, reviews then on file level.
So every file of the source code tree is inspected,
if the scanner has found anything, or if it is mentioned in some kind of
notice file or similar.
And this is done in addition or
even though if a package contains some kind of metadata on licensing.
Because we've made the experience, and
probably a lot of you have as well, that metadata just gets outdated or
is incomplete and so can't really be trusted entirely.
And I think that might also be one of the reasons I can imagine that
this question might come up.
So why do we keep all this information in a separate place?
Why not upstream it into the upstream projects?
And I think there is some reluctance in upstream projects to provide legally
relevant information along with the source code.
And also because then we would have, again, we would have the same problem
that it just won't be updated.
It's just how people are.
And yeah, okay, so we check, we do curating on the file level.
We confirm or correct scanner findings as you do.
We add individual license texts as you have, especially with BSD licenses and so on.
So this is also something that's not usually done by scanners.
We only tag main licenses if there is a clear main license
given in the root directory for a package to not
mislead anyone that this might be the only license that's in there.
And as I said before, the license comments, the license comments tags of
the SPDX explains any license decisions, any curating decisions that are not
obvious or that need some level of interpretation.
Yes, please.
What's your correction rate on average?
Do you mean how many scanners are finding?
How many do you find that you're like we have to step in?
Well, that differs.
Yeah, sorry.
The question was what our correction rate is.
Well, it differs heavily per package.
So there's some packets that are really good order where, I don't know,
I don't have a number.
I would guess around 10% and there are packages in horrible shape where
it's closer to 80% that needs manual work.
Yes.
So let's say I processed more than 3000 packages for sovereignty and
I would agree but maybe 20% in general.
Yeah, yeah, so that was just some agreement about the numbers with someone
who clearly has more experience than me.
I can't say I've done 3000.
Yeah.
That's because of the gray hair.
Maybe you're just talking in detail about the clearing process at Siemens.
Well, you might guess that there might be some connection there as well.
Yeah.
Okay, so what do these license comments look like?
They also follow some kind of heuristic, so
the usually says the information in the file is quotes,
whatever the information in the file is.
And then we give a reason for why we made whatever conclusion is concluded.
Example, we don't have a version of a license given.
We find this, this file is GPL.
And then the license comment would be as no version of the GPL is given.
GPL 1.0 or later is concluded.
But we interpret and this is clearly is an interpretation.
So this is a legal step that when we find this file is GPL,
one could also say GPL the most, I think it still is the mostly used GPL,
is still version two.
So you could also go ahead and conclude, they probably mean version two,
because it's the most heavily used.
But our interpretation here is if they only say GPL,
the author wanted to give us the option to choose whatever version of
the GPL there is available, so version one or later is concluded.
So this is something that is a step of interpretation,
but that is explained in the data.
Or for example, a URL is given instead of a license text.
And then of course, the URL is checked, a date is given if anything was found more
often than not the URL is dead as well.
And then maybe additional research is required.
And then of course, the information and the date is given as well when that was
checked.
So what are, yes?
But in that last case, do you report it to the packet itself?
Yes, yes.
So in case we do find problematic things, we report them back.
I mean, there are some licenses that have a URL in the license text that is dead.
So I mean, and then people usually say this license is outdated, but
it's still valid for some files that are out there and that are being used.
So sometimes, yeah, there is, sometimes it's helpful and
projects react, but sometimes there's not.
But yeah, we try to, whenever possible, we try to report it back.
That was also the question, sorry for not repeating it.
The question was if we push it back into the projects, yes, we do our best to do so.
And then going forward on what you need to comment.
If the upstream doesn't take it, how much, what's the hit rate in terms of them ignoring it?
So what, the question is what the hit rate is in terms of them ignoring it,
it's not large.
Mostly they do take it.
So because most projects are interested in being license compliant as well or
making it possible for users to be license compliant.
Because that's what we're trying to do.
We're trying to do what the project or the authors wanted us to do.
We're trying to make it possible for users to be license compliant.
So projects are usually keen to take, to take any or to help.
So you asked about the rate before.
I couldn't give you exact numbers, but I can give you some example of what kind of
scanner findings we do have to correct.
Well, I think they're fairly typical and
if you've done any curation, you'll know most of them.
So I'll go over them fairly quickly as well.
So we have not a license or something has been found that just simply isn't a
license but a bit of code or just whatever.
So that's removed of course.
It might be not the files license.
So it might be some part, some information that's content of the file but
isn't the files license.
We have that in documentation quite a lot.
Then license text.
This is something that of course scanners get wrong and
I don't think there's any way to fix it either.
If you have a file, a license.text file that contains the license text,
then of course the license of that file isn't the license text.
Most licenses don't have a license themselves.
But new licenses for example have a license.
But this is something that's corrected as well then.
With generic license text, I said that before.
So individual texts, if that differ from the generic license text it is.
Of course provided we have improvised,
imprecise findings in particular those with respect to version of a license.
Then dual licensing cases, especially if it's not a single.
So an easy dual license where you have this or that license but
you have this or that and a third license or this one license and
the second or a third license.
So these need some manual work as well.
We have license exceptions that we handle a bit differently than
Phosology does to bring it into one finding as well.
But that's maybe particular to Phosology as well.
We have external references that need to be checked.
As I mentioned before, it might be URLs.
It might also just be external references within the package though.
And that also there's a lot of problems there because then you have
files that are integrated from a different project and then in their file,
they say look in the copyright file in the root directory.
But they mean the root directory from where they're originally from.
So then that information isn't true anymore and
we'll need to do some research and then of course explain what research has been
turned to find out where the file originally came from,
what license it is referencing.
Yeah, so that usually takes a bit of effort.
And then we have global license assignment or
partially global license assignment that we don't usually use.
Again, from the same reasons I said before that meta information is usually wrong or
that stuff is included from different projects.
So if there's a read me file that says all files in this directory are licensed
under the following license, we usually don't go with that information.
Unless it says in a particular source code file,
it says for license information, see read me file.
So this is something that just I think comes from experience.
Yes.
There's a package manager field where there's a specific license field.
That's filled out with the proper SPDX and the fire, do you apply that to the.
Okay.
The question was about package managers that have a license field or
that have a tag for what license is.
So at the moment that's not come up yet because we're on like fairly at the bottom
were from come from Linux based embedded systems.
So this would try and so far we haven't gone into much that is managed by
any package managers.
But the stuff that I have looked at it depends if that's the only information
that's there will go with it.
But there might be different information again in the source code.
And if possible, we'll always go back to the source code.
But we do give, so if we do have third party, or
meta information, we also add that to the information in the package,
license comment I think is where we add that kind of information.
Yeah.
I think there was another question.
Yeah, it's fine.
It's fine.
Okay.
Okay.
Yes.
So the project seems to be mostly organized for collaboration among humans.
Yes.
And not really consuming information about machines.
For instance, there is no API for media.
Yes, yes, there's a REST API.
But for instance, package naming seems to be quite vague.
So, so there are uppercase, lower case, and this is the.
Well, we tried to go with the, yeah, yeah, you pointed out something.
The question was about if the project is made for human consumption or
machine consumption, automatic consumption.
And I said that there is a REST API to call the files,
which is not described in the repo, but on the Ocelot website.
If you go to ocelot.org.
And then the question was about the naming schemes, and we tried to be as close to
the upstream naming as possible.
But then again, they're not consistent usually.
So, yeah, there is no, so we didn't make up our own schemes, but
we tried to stay with the upstream and where there's inconsistencies.
Well, we mirror that.
Yeah, that's right.
Do you know where is described API?
Because even on the website, I cannot find it.
On the ocelot.org, oh, I might be on tools actually, sorry.
On the wiki, wiki.oscelot.org.
Try that.
Yeah.
So, lately I found in some license listing that we were, someone who was using
libvui.id and that was TPL listed only and that was taking licenses.
And then the readme file which tells me, yeah, various source code in this license
has different licenses.
Various source code in this package has different licenses.
So, and looking at the source code, we are presumably using the functions.
It says, oh, it's not strictly TPL.
So, not poison from a proprietary business point of view.
And how do you express that kind of, I mean, we had the discussion before regarding
vulnerabilities that you need to get back to function level.
Do you foresee that necessity in your work also or do you strictly handle packages because?
We handle files.
So, the question was about, as I said before, with the meta information is imprecise, let's
say.
So, we go back to file, source code file level and what we find there, we believe.
So, there might, you're right.
You might, we might have to dig deeper but then it's over snippet matching and we, so
we only assign something to a package if there is a clear main license but we also warn you
can have this information and take it but don't take it as the only information that's
there.
Okay, there's more questions?
Yes.
Thank you.
What you're doing, I think it's great.
I was wondering about the upstreaming of the information that your gallery, first you said
well, upstream is often not interested in it and then you made a statement like they really
like to be licensed compliant.
Did you, do you have statistics on that?
Like what is your gut feeling about this because my personal experience coming from the videos
and that you can list, many people have interest but they need help with it.
I would say these two license compliance we needed to recover have to help them but you
were having a great data set to help them actually.
Yeah, yeah.
So, well what our experience is that when, oh sorry, the repetition, the question was about
upstreaming, what our experience was with upstreaming, if the, how the projects would react with that
and you said your experience was that they're keen about license compliance and I, well yeah,
most of them are, there's exceptions always and we have that as well with, like with concrete
and particular cases when we say this file, we found this problem, can you, or can you fix it,
can you clarify?
Then they, most of them, very, really, very most of them are fine but, well I have to admit
also we haven't tried with super many projects but if we say we have, we did a complete license
analysis of your entire package for this and this version release, here's the SPDX file, then
they're not as keen to, to, to provide that via their website because that is legally relevant
information so that rather, I think we had one or two projects who were like, oh that's cool,
we'll point to your site but we're not going to provide it through our stuff because there is
interpretation in there, as I said before which we explained in the comments.
There's some interpretation in there so there's some wiggling room and I don't know,
maybe we could reach more with more effort.
Okay.
There's a few more questions, do we have another minute or time's up?
Okay, so, well contact me anyway, I'll skip to the last slide so there's some, yep, sorry.
No, that's good though, I prefer discussions.
So contact me at info at auselaugh.org and we can, we can chat anyway.
Okay.
Thanks.