As Julia is getting herself up and running, I'd like to introduce her to you.
She is from Seattle, so she's another American here.
She is a social technical, I can't say that, systems nerd and a huge fan of Lego, so that's
important to know.
You used to have a Scottish accent despite being an American and never being to Scotland.
Now that is fascinating, that truly is.
And her favorite joke is just her own humor.
Julia, take it away.
I think that one landed is exactly how I intended to, so thank you.
Hi everyone, it's great to be here today to talk about what I am calling a principled
component analysis of open source AI.
So who am I?
I'm Julia.
I've been focused primarily on open source resilience and the software supply chain for
the past five, ten years.
I feel like a little bit of an open source hipster because I've been talking about some
of these things before they were cool.
But I've been working in and around open source AI since undergrad, so I'm not going
to leave that to your imagination.
Some things will probably give it away though.
So in case you were wondering, that isn't a typo.
It is a pun.
It is a good slash bad pun, and I hope you are ready for more.
I couldn't make a pun out of support vector machines though, so if you have one, please
come talk to me later.
That would be good.
It would be much appreciated.
So open source AI is not new.
I've been doing open source AI for a while now, and I threw a bunch of stuff up on this
slide, mostly copying and pasting from former poster presentations, but I have a chapter
in this lovely book.
I believe there is only one left in stock.
They haven't had much of a call to reprint it.
So you too could own constrained clustering, not about neural networks.
And in that chapter, we had a very interesting approach to exploring information with user
feedback using a variant of the K-means algorithm.
So that was the basis of our chapter.
We used this fantastic open source machine learning library called Weka.
Is anyone familiar with Weka here?
My people, hello.
Excellent.
Weka is wonderful.
It has so many great machine learning algorithms in it.
When I first started using Weka, I went to this website called SourceForge and downloaded
it.
And I was just entranced.
It was my first experience with open source.
I was entranced by this idea that I could go and see the code.
I could go and modify it.
And in fact, some researchers at the University of Texas had done this and modified it and
redistributed it.
Just this magical thing that I could then go and use in my own research, knowing that
somebody who is much better at math than I am had validated all the algorithms.
I also built this lovely autonomous robot, aerial autonomous robot, that won the dubious
award of innovative hardware design from AAAI, which I think means we don't know how you
got this to lift off the ground.
But lift it did, mostly because we made it a giant dodecahedron to lift all of the camera
components.
And a few years ago, I used machine learning to tackle one of the world's hardest problems.
Determining whether or not you should hug something.
So I trained a little model when Fed and Image would tell you if it was a good idea to hug
it or not.
I am sad to say I am not huggable, apparently.
I am not as mathematically proven.
So, you know, maybe I should take my picture again, try it out.
So that's me.
So the bad news here is that I don't have any answers for any of you in this presentation.
I only have open questions.
We are not going into the deep specifics of models, algorithms, or approaches.
This is going to be probably, for some of you, a little bit too high level.
And for some, probably a little too low level.
We are exploring this new area of technology that has ballooned kind of seemingly overnight.
It feels like that to me.
I don't know if that feels like that to you.
And we are facing some really interesting challenges when talking about the advances
that we have seen in artificial intelligence and how it intersects with open source if
you are not using Weka.
If you are using Weka, you are set.
It's great.
They are not paying me for this.
I promise.
So level set, like AI draws from a lot of different fields.
If you go back to your AI 101 course, you are going to probably get a little bit of a survey
overview of all of these different fields, from ethics to philosophy.
Philosophy plays a big part in AI.
Economics, my favorite part of AI is the formal logic side.
But that's because statistics was never really my strong suit, which is why I love computers.
So there are a lot of different considerations when it comes into building AI systems, AI
technologies, and looking at new approaches for things both as practitioners and as researchers.
I'm hearing a lot of echo.
Is that like everyone?
Okay.
It will say everything twice.
It's fine.
So at a very high level, this is one of the slides I show people when they ask me why I don't use the phrase AI.
It's because generally speaking, when people are talking about artificial intelligence, they are not talking about the entire field of artificial intelligence.
They are talking about machine learning.
So we can break it down into roughly two camps.
And I call them camps because people are really settled in one or settled in the other, and they usually don't switch back and forth.
That's been my experience anyway.
So we've got symbolic artificial intelligence, the logic, logical AI.
It's also referred to as logical AI.
How do people think?
How do we teach machines to think in ways that are similar to how people actually think?
So this is where cognitive science really comes into play.
And then the much bigger circle up there is what we're mostly concerned with these days is machine learning, the math.
And this is what I tend to characterize as thinking is hard.
We can probably build a model that comes close and then we'll do some math and we want to get stuff done.
So we're just going to use the data that we've got and cross our fingers.
You can argue with me about that opinion, all you like.
I'm cool with that.
So while I do have this, as I mentioned, the deep abiding love for symbolic AI, we're focusing primarily on machine learning here.
Unless anybody wants to talk about slime.
Not that slime.
So some elements of AI, or machine learning, see?
I'm also getting hit by the AI means machine learning bug.
So some possible elements do include things like what data do we have that go into training the system?
How do we actually train the system?
How do we evaluate the system?
All of the different elements, is there a model as an output?
Is there a user interface as a way of interacting with machine learning?
Now, not all of these are going to be present in every machine learning system.
It kind of blows some people's minds to realize that you can have machine learning without a model.
Or you can have machine learning without a task or prompt.
But it's true.
And we have to account for that when thinking about open source machine learning.
And when we're looking at all of these different components, it gets a little bit hard to reason about.
But if we reduce the dimensionality, PCA pun, we see roughly four buckets emerge.
We've got the data, which is pretty familiar to us.
We know what data is.
We've got a good understanding of what might be training data, what's validation tasks, et cetera.
We've got code, also a well-worn path for open source.
And then we have what I call the other stuff.
Because one of my skills is not naming things.
And then finally, we've got output.
So by doing a rough grouping with K equals four, we wind up with these four buckets.
And I think that by thinking of them in this schematic, it makes it much easier to tackle the challenges that we face one by one.
Now, some elements might appear in multiple buckets.
Not on this slide, because simplicity.
And some might not appear at all.
But it's a starting point.
So let's first talk about data.
So when it comes to machine learning and data, we have some interesting problems.
Some of them are known.
Some of them are unknown.
We have a lot of data out there.
Machine learning research has been going on for what, since the 50s?
Right? Ish?
And that means that a lot of the data that has been used in this research doesn't have known provenance.
So we don't actually know where some of the data came from.
And if we're talking about things, I'm not going to talk about licenses, by the way.
I forgot to mention that.
I'm not talking about licenses.
I'm just going to talk about the challenges in building open machine learning systems.
But when there's data without known provenance, we don't know if it is truly game for us to use.
If we have data that we don't know how it was collected, we don't know if it's truly game for us to use.
Or if it suits our purpose.
It could be an incredibly flawed data set, and we have no way of knowing.
In terms of privacy, we've got a few big challenges with de-identification and anonymization.
There's a lot of really interesting work that can be done in machine learning,
but can't necessarily be done in an open way.
And I don't actually know how to solve that particular beast,
because I feel like I'm kind of splitten, too, on that.
On the one hand, I do think that having full access to the training data,
validation data, test, et cetera, is really important for building open systems.
I don't think it can come at the expense of people's safety.
And so if you are training something based on data that includes personally identifiable information,
we kind of have to weigh that.
There is that question of, should that be an open source system in the first place?
I know that's a very controversial thing to say at FOSDEM.
So I can leave now, if you like.
For systems that also incorporate user feedback as part of the training data,
because that's a fun place to get into,
how do you build in, again, that de-identification and anonymization?
And there are things like, okay, well, how are we actually going about splitting the corpus?
If we are splitting the corpus into training and validation data,
it's actually important to know what proportions we're using
and how we're sampling in order to do that splitting.
But again, some of these may not be applicable for all machine learning systems.
So one of the things that I kept asking myself is,
are these things required to recompile, for some definition of recompile, a model?
So if I wanted to create the model from scratch and build it myself,
what do I need?
The entire dataset?
If you want to show hands and tell me if you think it's required,
I'm fascinated to know, but there's no obligation to.
So is the entire dataset required to recompile a model?
Would a description of the data suffice?
How about a datasheet?
Who thinks that you need to know about how the data was collected?
Well, thank you.
I appreciate you.
I appreciate all of you.
So I kept coming back to this question,
is it required to recompile a model?
And so you'll see that question as we go through some of these other sections.
I do take the stance that you need the entire dataset and the methodology,
but that introduces some big problems and big as in dollar signs,
because these corpus are not small.
Hosting them is expensive.
So if we want to make open source machine learning open to all of the people who are interested in participating in it,
how do we break down the cost of doing so?
How do we make it available?
The methodology, publishing the methodology for how the data was collected,
helps with transparency if you trust it.
But we're open source.
We try to trust each other mostly.
Except for a few of you.
You know who you are.
And there's some open questions about attribution.
If the data also needs attribution.
I'm not talking about from a legal sense or a licensed sense,
just for transparency and for credit,
because we appreciate giving credit where credit is due.
And if somebody wants to opt out of having their data in a corpus,
how do we handle that as well?
Lots of unknown problems with data.
But code gets a little bit easier.
This is going to be the second time I make a Jurassic Park joke this week.
But we know how to do open source software.
This is code. We know how to do this.
And despite what we've been hearing,
most of machine learning fits solely in this camp.
It fits solely in the camp of open source software.
Job well done.
Great.
We've got Weka. What else do we need?
So it does, they are governed by the same requirements as normal open source software.
No special casing needed.
Cool.
One of the unique things, though, about this type of software
is that it may actually produce one of those things we don't yet know how to deal with.
The model.
It also might involve an interface, how to interact with whatever system it produces,
which may or may not be a model.
And it does intersect with the data and some interesting problems that go along with that.
So one of the things that we do when we process data is we clean it.
So does that code need to be open source?
Alongside with cleaning is some interesting value judgments that we have.
We may say, okay, deprioritize this feature a little bit.
Let's increase the priority on this one.
And in that way, we are actually making moral and ethical judgments,
and we're encoding them.
Now, the great thing about code is that it's very easily inspected.
For some definition of very and some definition of easily.
That's an exercise for the reader.
But if we dig into it, we can see where those value judgments are made.
The other stuff, this is my favorite part.
If somebody has a better name for it, let me know.
So our hardware specifications required to recompile a model.
How about disclosure of training time?
How long it was trained for?
A definition of correctness.
So all of these do impact what comes out of the data.
All of these do impact what comes out of your machine learning algorithm.
I was doing like a one-day course,
brushing up some knowledge,
and there was a bit of a competition,
ordinarily I hate competitions in classwork.
But the idea was, okay, let's...
Here are all the concepts.
Do some fine-tuning, play around,
and see who can have the highest accuracy for some random task.
And I thought I did a pretty good job.
The story of my life.
I thought I did a pretty good job.
But there was one person who achieved nearly a perfect score.
So of course, the question is, would you do?
And they said, oh, well, I just ran the training for, you know, two days.
I'm like, oh, okay.
Yeah, this was a one-day course.
I didn't think it was a two-day investment.
So the training time does absolutely affect the quality
and the output of the resulting system,
as does the hardware.
So they are required to recompile a model.
But similar to data, we also have the question of access.
Access to equitable compute.
Access to the hardware itself.
And again, we have a problem of attribution.
So finally, output.
And since we are focusing on models and machine learning models,
we've got matrices.
I don't really know how to make that work in an open way.
Yes, I can inspect a matrix.
Can I make sense of it in isolation?
Not so much.
So if we do a litmus test, if this is all we have,
can we do arbitrary machine learning tasks with just this?
Probably not.
How about just the code?
Probably not.
You still need some data.
Just the hardware? Maybe. I'd like to see that.
Just the model?
I'm going to say no.
That's my... I'm putting my foot down there.
So what do we really need to make a transparent machine learning system?
We need all of it.
It all needs to be there.
It all needs to be open and available.
And that might mean that some things are not suitable for being open.
So some other questions that I'd love for you to think about
as you're thinking about open source and machine learning
is what does contribution to a model look like?
What does correctness of a contribution look like?
How do we actually verify the openness of these systems
in a way that doesn't require a huge amount of investment
that only a select few have?