We also have the Matrix Room Up and Running, which is great.
And I have the pleasure of introducing you to Andrew.
You live near Oxford, which I believe is pretty cool.
I'm a Texan, I wouldn't know.
You love the Oxford music scene?
You once played croquet for Cambridge University against Oxford, and he has a great joke for
you.
What is brown and sticky?
A stick.
Take it away, Andrew.
My kids like that.
Happy Fosfemme, everyone.
So it's great to see everyone here.
I am a lawyer and I've been advising on AI for quite a long time.
You can hear me.
That's better.
Clearly, the emergence of large language models has meant that there's a lot more analysis
going on in the legal context behind that.
So I just thought I'd spent half an hour or so just going through some of my thought
processes when I'm analysing mainly copyright law as it relates to AI machine learning and
large language models.
You may have heard a number of myths.
So the first one is what models are essentially data.
Data are facts and facts can't be covered by copyright.
So statement number one.
Statement number two, you may be aware that in various jurisdictions throughout the world
there are different exemptions that exist in copyright law to enable you to gather and
use data to ingest data for machine learning and data mining purposes.
So you may hear that gathering training data within one of these copyright exemptions solves
any copyright problems that you may have.
So you really don't have to worry about any copyright issues.
The third thing that you may hear is that output that is generated by generative AI cannot
be subject to copyright.
And the fourth one is that as a result AI generated code cannot trigger copy left obligations.
So let's look at some of these statements in slightly greater depth.
Now I'm going to do this in a slightly strange way because I've only got half an hour or
so.
I'm not able to go through all of my thought processes in great depth.
What I propose to do is to give you the conclusions of my thinking first and then I'll give you
a flavor of some of the thought processes that led to them.
So you will by necessity only be getting part of the picture.
I do apologize for that.
So first of all, my belief is that a large language model is capable of containing and
I put containing in inverted commas there because what that means is subject to a certain
amount of interpretation.
And I believe that it can contain copyright works or derivative works of copyright works.
Secondly, I believe that if the training data was ingested under a copyright exemption,
that doesn't automatically mean that that exempts any output from copyright protection
as well.
Third, the generated output is likely to be subject to copyright.
This is a statement that will vary significantly from jurisdiction to jurisdiction.
I'm licensed to practice in England and Wales and it's certainly true there.
It may not be quite as true in other jurisdictions, but certainly from my jurisdiction's perspective
that generated output is likely to be subject to copyright.
And also under the English and Welsh jurisdiction, the output that's generated by a prompt view
of entered belongs to you or possibly to your employer.
But even having said that, then it means that AI generated output can be infringing and
similarly it can also trigger copy left effects.
So what I'm going to talk about now is a few things that you need to bear in mind about
copyright and about the process of generating models and generating outputs from generative AI.
So again, I'm going through these fairly quickly.
I'm not necessarily going to link them in detail together.
But when analyzing this, the three things you need to bear in mind from a copyright perspective
are the copyright can potentially impinge at three points.
So first of all, the point when the training data is ingested and the model is created in the first place.
Secondly, and this is the one that people tend to forget about when the whole model is transferred
from distributed from one place to another.
And this is particularly relevant when that model is distributed over jurisdictional boundaries.
And the third one is that we need to consider from a copyright perspective the point at which those results were output.
So there's also potential of copyright impinging at that point as well.
Now, there are a few things that you need to understand about copyright.
And all the developers I've met have a pretty good grasp of copyright in theory,
but there are always some areas around the edges that they're potentially a little unsure about.
Many lawyers are unsure about these as well.
And a lot of the arguments that we employ when we're talking about copyright analysis of large language models,
they do tend to involve these edge cases.
So there's a few things that you need to bear in mind when you are considering the application of copyright to AI,
and these are just characteristics of copyright in general.
So first of all, it's possible that more than one copyright can exist in a work simultaneously.
So for example, if I write an opera that's based on the Lord of the Rings,
then my creative input has gone into writing that opera, and I just like to tell you now it wouldn't be a very good one.
And that will therefore be both a copyright work that I've created,
but it's also a derivative work of Lord of the Rings.
So if you want to perform that opera, you're going to need both a license from Middle Earth Enterprises,
which is the organization that holds the rights or the relevant performing rights in the Lord of the Rings,
but you're also going to need a license from me as well.
So there's at least two copyrights being held in this opera simultaneously,
and you're going to need licenses from both of those copyright holders in order to perform the work.
And the classic example here is the Linux kernel,
which will have many thousands, possibly tens of thousands of different copyright holders simultaneously,
which is the main reason that we will never see it re-licensed under any license other than GPL version 2 only.
So if you wanted to re-license it, then you would basically need to have permission from all of those copyright holders,
or you would need to extract their copyright works from the kernel. That's never going to happen.
So just stepping back a little bit, software is covered by copyright in just about the same way as literary works are.
This is sort of the legal fiction that was established back in the day when the legal systems were thinking about how software should be protected under copyright.
Indeed, if it should be protected at all.
And one very key characteristic, one key piece of the philosophy behind copyright is this distinction between an idea and an expression.
And again, this distinction is very clearly laid out in US copyright law.
It's also laid out in European copyright law, but not quite so clearly.
But the copyright directive, the software directive does make explicit reference to this distinction between the ideas and the facts and the expression of those ideas.
So basically the facts themselves cannot be subject to copyright, but the expression of those facts can.
And so one thing to bear in mind.
Another thing to bear in mind is that copyright infringement is not subject to intent.
It doesn't matter whether you intended to infringe copyright or not.
If you do an act which causes copyright to be breached, if you copy something, believing that it was yours to copy or believing that you had a license to copy it.
Then from a civil law perspective, we're not talking about the criminal law here, but from civil law perspective, that still counts as copyright infringement.
And the third thing to realize is that if someone produces a work independently, which is the same as an existing copyright work, then each copyright owner will retain their own copyright in that work.
So there's no infringement.
So if I write a little melody and then somebody else independently comes up with an absolutely identical melody, they haven't copied me.
They've had no opportunity to listen to that melody.
Then they would equally hold the copyright in their melody as I hold in my melody.
So those are three things that tend to be a little bit counterintuitive about copyright and their concepts that I draw on when I'm talking about the rest of my analysis here.
OK, so let's look at one issue.
So we take the premise that a model is essentially a set of statistical facts about the information or the training material.
Does it mean that it is capable of containing derivative works?
OK, let's look at changing the subject completely.
Right, let's look at a WAV file.
So you could also argue that a WAV file is just a set of facts about how far a speaker cone is from a fixed point at a particular period of time.
And we know, obviously, that a WAV file can be infringing.
A WAV file can take a piece of copyrighted music and the number of lawsuits that exist that cover copyrighted music encapsulated in electronic file formats is obviously huge.
There's absolutely no doubt at all that music encapsulated in a WAV file, which you can argue is just a set, you know, selection of facts can be copyrighted.
And just because you so just turning over to the concept of a model, people would say that you can't reverse engineer a model easily to find out that what is within it.
I mean, it is just a set of information about statistical relationships.
But just because you can't reverse engineer the model to determine its contents doesn't mean that it doesn't contain potentially derivative works.
And I'll go into that in a little bit more detail later.
But if we go back to the audio example, I mean, if you look at a more complex audio file format like Agvorbis, for example, you know, if you're just given that file, you're going to find it impossible, I would say, to reverse engineer it and get the music back out again.
Unless you actually know how it was encoded in the first place, a WAV file is sufficiently simple that you probably could reverse engineer it and figure out the music was in there.
But an Agvorbis file, you're just not going to be able to do that unless you know the encoding scheme in the first place.
But nobody's going to argue that a Vorbis encoded file cannot be infringing.
And there is, of course, a way that you can get AI models to reveal whether they contain any derivative works, and that is if it's part of a generative AI, you can just simply ask them.
And I will give a few examples of that shortly.
So some of you may be familiar with this poem, which was written by Lewis Carroll.
And I'll just read the first line.
It was Brillock and the Slythe Toves did Gire and Gimbal in the WAV.
Now, for those of you who aren't native English speakers, the words that I've placed in yellow up here, they're just nonsense words.
They were just made up specifically for this poem.
Now, it's important to realize that Jabberwocky is no longer in copyright, which is partially why it's easy for me to talk about it here, because I don't have to worry about that.
But because these are nonsense words and they only exist in this particular poem, or in derivatives of the poem that have been ingested later,
then it turns out to be a great test of whether an AI has had access to this particular work as part of its ingestion process, if you can get it to disgorge these words later in some way.
So I did a few experiments with chat GPT and I asked it to write a poem entitled Jabberwocky and it wrote Jabberwocky.
So the result was verbatim.
Didn't even try to change anything at all.
So we know that chat GPT has ingested Jabberwocky.
The chances of this being developed independently, produced independently, are infinitesimal.
Did it know that it was out of copyright?
I'm not suggesting that there's any copyright infringement going on here, clearly because Jabberwocky is out of copyright anyway.
So it may be that in choosing the training materials that great care was taken to make sure that there were no copyright materials or materials which didn't have an appropriate license being ingested.
So we did quite a few other tests on this basis.
We found a number of works that actually are in copyright, contained within various large language models.
I have to stress that this doesn't mean again that OpenAI is necessarily infringing.
They might have attained a license to these particular works.
But it does demonstrate that it's possible for LLM to contain copyright works.
The argument that the LLM just contains facts doesn't really hold a great deal of water when you analyze it along these lines.
And indeed, there are plenty of studies now subsequently showing that copyright works do exist in various LLMs.
There's one from the copilot itself, which is some research showing that from time to time copilot will disgorge some verbatim copyright works.
So the conclusion here is that we really can't believe AI can be used to essentially launder copyright works.
You can't take a copyright work, feed it into an LLM and then claim that because the same copyright work has come out the other side, that it's no longer subject to copyright of some sort.
So is it possible to have AI extract the ideas and leave the expression?
And we can remember that ideas themselves don't attract copyright, but the expression does.
Now there's a great video here. I put the QR code up there for it, which shows the difference or the similarity of two songs.
One called My Sweet Lord by George Harrison, which you may be familiar with.
And the other one called He's So Fine by the chiffons that you may not be quite so familiar with.
And in a nutshell, in a case some time ago, George Harrison was sued by the chiffons for releasing My Sweet Lord, which does to my untrained musical ear sound very, very similar to He's So Fine.
And the crux of the case was that although it was never established that Harrison had consciously copied He's So Fine, and if you recall, we said earlier that intent has nothing to play here.
So whether he'd meant to or not was by the by.
The fact is that if he had copied, then infringement would have occurred.
The judge said that Harrison had the opportunity to hear He's So Fine because it was a quite popular song at the time.
It would have been played in shops on the radio and so on and so forth.
So there's a high probability that he would have heard it and somehow subconsciously it would have entered his mind and it would have become part of his thought process when he wrote My Sweet Lord.
And there's a reference to the case there on the slide.
So it seems to me quite logical that the courts are going to follow a similar reasoning with AI in that if a generative AI produces some material which appears to be copyright infringing and it can be demonstrated that that material was part of the training data,
then the courts are likely to come to the conclusion that infringement is happening, notwithstanding that we can't really work out how exactly it's encoded if that's the correct word to use within the AI model in the first place.
So let's look at a different case now which is sort of straining this idea, expression, distinction.
So this is a photograph that was pretty popular in London a few years ago.
It was in a lot of tourist places where you were buying souvenirs and so on.
And it's a pretty striking picture of a red London bus crossing Westminster Bridge.
So the picture that I've just shown you, which was on a drinks coaster, reproduced here on the left.
And on the right, another company called New English Tees Limited decided that it would be nice to have a similar sort of image, but they didn't want to pay a license fee to the holders of the first image.
So they asked a photographer to go and take a picture from roughly the same location of a London bus and then they got somebody to retouch it so the London bus was red and everything else was in monochrome.
And you would imagine that that is about as clear an example of an idea versus an expression as possible.
The expressions of these two photographs, but the idea is basically a red double-decker London bus crossing Westminster Bridge, which is all in monochrome.
And it's not a particularly good case because this was only a very decision at the lower court, so it's not particularly binding.
But nonetheless, it was determined that there was potential infringement going on here.
So I went onto Dali and I used this as a prompt, which to me seems to be a fairly reasonable explanation or the distillation into an idea.
And you'll see that Dali using that prompt has produced something that under that particular legal doctrine, this is almost certainly infringing.
But hopefully not in Belgium.
So that's an example of where you can have court cases that do not help this analysis a great deal.
So what can help us in circumstances when we are using generative AI and we're trying to avoid infringement situations?
I mean, there are different ways to do this. First of all, you can filter what is ingested.
So if you limit your training data to things that are not in copyright anymore or things that you have a specific license for that allows,
this is sufficiently broad that allows it to be used to generate materials using an AI or their subject to an exemption,
again, that's broad enough to enable you to do that, then that may assist you.
But the trouble is that then that is going to fairly dramatically reduce the pool of things that you're able to do to generate the model in the first place,
which of course means that your model is not going to be as good as it potentially could be. But that's something to bear in mind.
The second thing is to review the algorithm.
If it turns out that your algorithm produces a model that's only two megabytes in size,
then there's not going to be much space to fit a whole bunch of copyright works inside their derivatives or otherwise.
So that's going to be a pretty strong argument that what comes out of it is unlikely to be infringing because it was going to be pretty difficult
to actually fit anything inside there that could potentially be a derivative work in the first place.
And the third option is clearly to filter what's output.
You look at the output of the AI and at that point you determine whether that is potentially derivative work and sort of block its onwards transmission.
There's potentially some infringement happening at the period where it's reduced, but if you're not distributing any further, that's kind of the limit the issues there.
And there are a number of technologies available that could potentially help with doing that.
So YouTube's content ID, for example, Getty Images has got some software for plagiarism to section and so on.
So without going into detail, it's very easy for me to say that, but without going into detail, it's possible those sort of techniques can be used to determine whether the output is potentially infringing.
And indeed, this is already being used in certain cases, including co-pilot at the moment.
So if you've got co-pilot duplication detection that intends to do that.
Those of you who are involved in open source compliance are probably familiar with snippet matching services like BlackDark or FOS ID or Scannos,
which use a database of existing code and they have quite sophisticated algorithms to make sure that you can't defeat these algorithms simply by obfuscating the code.
But how effective those algorithms are is obviously is variable.
But I think it's likely that specialist products are going to be developed.
And I happen to get in touch with the founder of one of the developers of one of these scanning software companies a couple of weeks ago and mentioned to him and said,
you know, to your knowledge of the developments of foot to help with this situation of filtering the output of generative AI to see that it was potentially infringing.
And he basically said that he couldn't tell me anymore unless I signed an NDA.
So you can take that as you will.
So few sort of final thoughts that I have here.
I don't believe that permissively licensed knowledge base or permissively licensed corpus of materials used to ingest and create the model is the answer.
There are a number of models available that say that we're only using permissively licensed code.
That doesn't mean that, you know, there are no compliance obligations.
I mean, you know, pretty much all of the licenses in question will have attribution requirements.
So how do you follow those through onto the output?
So you're taking a sort of risk based approach way of saying, well, you think that piece somebody who's licensed the code under Apache is less likely to get unhappy than somebody who's licensed it under GPL.
But it's not a legal analysis.
It's really a risk based analysis there.
So you still got to be careful about that.
Not a magic bullet.
The other thing to be aware of is that different jurisdictions have very different rules about where the machine generated code is subject to copyright.
So we touched on this earlier.
There's a specific clause in the UK Copyright Act that says that machine generated works are subject to copyright.
And that copyright will be owned by the person who made the arrangements.
Now, it's a bit difficult to determine what made the arrangements means.
Is it the person who created the model or is it the person who created the software that uses the model to produce the output?
Or is it the person who put the prompt in?
My gut feel is that it probably means that it's the person who put the prompt in to generate the output, but it's not been determined judicially.
And there is one case which has got nothing to do with AI, but it's got to do with image generation that suggests that the person who wrote the software, in this case some game software,
is the person who made the arrangements, not the person who was playing the game.
So that's a little bit problematic, but I say it's only one case and it didn't go to the appeal clause.
One other thing to bear in mind is that quite often, if you're looking at two pieces of copyright work and they're quite long and extensive and they potentially have mistakes in them
and they are identical, including the mistakes, then you're going to make the assumption that the only way that Work B came into existence with all of those mistakes, etc.
is that Work A was copied.
Now, up until now, that's been a pretty reasonable assumption to make, but of course it's entirely possible that using Generative AI, two people could put the same very similar prompt in, or identical prompt in,
and that prompt would generate identical output works, and therefore we can't automatically assume that a long and complex work that has mistakes in it is essentially only going to be owned by one person from a copyright perspective.
So that's just a sort of cautionary word.
So one thing that really does worry me here is the potential that AI can be used to automate a clean room rewrite, so we can use AI.
Again, I've done some analysis on this, but I won't share the details now because I don't have time.
But if you take a piece of code and you ask the AI to analyze the code and produce it to a functional description, and then you take that functional description and insert it into another AI
and say, please write code to this functional description, then does that mean that because you've been through an automated process that has basically stripped out the expression,
taking it to the functional description, which is purely an idea which we know does not attract copyright, and then reproduced a piece of software from that that somehow we've developed an automated way of copyright washing?
An awful lot, I think several billion, if not trillion, dollars says that is not going to be allowed to happen, so we just need to be aware of that of a possibility.
So that's a sort of whistle-stop tour through my various thoughts on the topic.
Thank you very much for taking care, taking the time to listen to me.
Do we have time for a couple of questions potentially?
Two questions right over there.
Fantastic.
So one of the questions we have, I'm going to have to move up.
So, I mean you might as well have a look at the whole thing.
One question we had was if the model outputs a copy of an image who is infringing the AI machine or the human who asked the prompt.
Good question.
People infringe.
So potentially it can be a legal person who can infringe, so it could be a company as well.
But ultimately it's going to be who was doing the act of copying.
So it's almost certainly going to be a human, but if the human is employed by an organisation, then it could be the organisation that would be infringing as well.
While we've still got time, another question we had is there seems to have been some confusion about whether this was under US copyright law or England and Wales.
Which ones are you talking about here?
Any of you either?
So what I'm saying was under the copyright of England and Wales, which is the same as the rest of the copyright law in the UK.
But references that I made from time to time were about the idea of expression dichotomy for example is something that's much clearer under US law than it is under English law, but it still subsists.
Thank you.