So thank you for your coming. It's quite early, 9.30. It's difficult to start, so I will
try to push energy to this session. So just before to get started, I would like to know
more about you. So with three very simple questions. First, who has ever locally run
LLM on his laptop using Lama CCP, VLLM or LLM Studio? Please raise your hand. Okay,
right. Second question, who has ever fine-tuned a model? Fine-tuned. Okay, let's find 10. Okay.
And the last one, who has, like me, dreaming to have one-dred open-source model in here? Not
only one, not only open-weight model, but really open-source models. Raise your hand. Okay, you are
in the right place. So we will do the job. Okay. Yes, my name is Michel Marie Modés. So I have a
co-founder company, a software company called Inagora. So we started in 2001. So as the first
time, we will be very close to our 25 years. So it will be for next year. And our mission with
Inagora is to invent and develop good tech for good. So what I can sum up as ethical open-source.
And for AI, we do the same. We do ethical AI. And to come up to achieve this goal, so we started
a community, a very large and a brand community called Open LLM France. So we started in June
2020. And we have two main goals. First, as well to build trusted sovereign and real open-source
generative AI technologies. And the second goal is to build a strong ecosystem around LLMs and
generative AI systems. So for the second objective, so I can say that we have success because the
community right now is more over 450 active members with a strong support from the academic and
public research in France. So it's very important because, for example, with the GenC, we can use
freely supercomputers like GenZ. And it's very useful for us to give freely GPUs to train our
models. So it's very important. And at the same time, so we have a lot of corporates, corporate
private company, who have are using AI technology or many to build with us AI solutions. So and all
this track for today. So I think there is a lot of important things to build ethical AI system. But
my talk, I will talk with you about three topics as well, what we could consider as open-source AI. So
this is my first part of my talk. The second part of my talk will be related to diversity and the
underrepresentation of our culture, our language in these models today. And the third part of my
talk this morning will be related to data quality and the evaluation of this model. Okay.
Under. Okay. Right. So to be very clear, and to start on the biggest problem, so the most popular
open model that you are using today are not open source. They are open wave model. So this afternoon,
Stefano Maffuli from the OZ, open source initiative, we have a talk to report to their progress to
this definition of what we could consider as open source AI. So I'm very proud because I'm part of
this small group and private group inside the experts from external from from the OZ to try to
define and to get this definition because it's important to clarify the situation because as you
know, and I'm not alone, but Stefano and probably some of you's have used as a published post to
raise the problem to the misuse of the open source term today by some players on the far work
ecosystem. And so and I put in this slide, you know, the OZ definition of open source. So to be very
clear, if you have limitation on the use, the license and the term of use of a lease license, or
if you don't have the artifact, the element to train again the model or to make a derefited work on
it, you can't say that you are doing open source. This is very clear. And today, the main part of the
popular model we have, you don't have a view and access to the data set used to train the model.
For us, for this community, what we open source AI means three things. First, as well, that we are
able to have the open source of the model. All the tooling system used, for example, to train the
models to evaluate the model of the pipeline to do the evaluation of the model. And so for different
things, it's not very easy to find this information on an open model today. The second point is related
to a license. So if you have for us our license, we don't have this license, we have to have, we thought
in the limitation of who and what we are doing with this model. And the most important is the third
point is related to that asset, open corpus, open corpora. But you know, it's very interesting because
probably if you follow the news related to AI, you saw during these past days some new models with
data sets published under open source license. So, and I think it's very important and I think that
for 2020, not only the year of open source AI, but also for data set publication, open source
license. So I changed my presentation last night, just after the talk of Joss, the co-founder of Next
Cloud, because he present an ethical rating system. And I'm very glad to see that we share the same point
of view. And it's very simple for also for the Next Cloud community. If all these conditions are met,
the three conditions, so you are in the green area. If you have only one, two conditions, so you are
in the yellow, only one orange. And if you are using, for example, open AI, in fact, ChabGPT from
open AI, zero condition are met. So you are in the red area. So if we have today this morning
developers from this beautiful Next Cloud community, thanks for your job. It's amazing and we love it. And so
for us, by the way, we are in the first green area and we try to do the job. The second part, the second
topic I would like to underline this morning, it's the problem that AI generative models are more and more
representation of a picture of what we are in terms of culture, in terms of society, in terms of language. So I
think that's figures talked by the by themselves. So in the left, you can see that since 2018, less than
8% of LLM has been created in Europe. And on the right, what you can see that it's the volume of
language used to train LAMATU model. So 0.16 for French and 0.17 for German. So percent. So I don't
know what do you think about that. So but in my point of view, we can say that we are not really well
represented as our culture values in this model today. So we have a problem, I think. And we have a
community we try to solve. So first, first try we did, it's to adopt a data first, drive an approach or
quite a quality first, drive an approach. And because the small also is beautiful. And we try to get the
proof that quality of the data set is more important than the quantity of data you have. And to demonstrate
this this point, we release a first model in October called Claire. So Claire like the woman, the
show name in France. So I'm not against I have nothing against a podcast, Albert, Alfred, Mr. But you
know, we prefer in our community to promote women because by fact, it's our little contribution to have
more women in our AI ecosystem and a global unity. So I will, I will not go deeply in Claire because
Julie, the real one. Yes. Julie will go deep and tell you all about Claire what we did. Oh, we did this
model. But just for very, very, very, we just gave the proof that it's we are able with a lot of amount of
French tokens to give a very, very conversational model. Conversational means that Claire is able to
understand dialogue between people with their realization. And the second part of Claire, the second
features, it's that Claire is able to talk like, like you, to make a dialogue, human like dialogue with
defluence, hesitation, because we train Claire with conversational data. So we continue to collect a lot of
data. And today, so we are around 140 billion of token in French. So and we I'm very glad and happy to
announce that we started to the training phase to train our new model called Lucy. So Lucy, the main goal of
Lucy is to fix or to yes, to improve the under representation of the French language in generally in
LLMs. But at the same time, we put in our data set some over European language, the German, Spanish, Italian,
some code to some some some source code to make our model to have a capacity of reasoning. And we try to
build some new features to make this model efficient, not only in French, but for over language. So probably you
will be interesting to follow this work and probably our custom tokenizer and so on. But the most important
things I would like to share with you this morning is that we are not the only one community involved in this
goal to build this sovereign LLM in Europe. So I'm sure that this list is not exhaustive. If anyone knows
new or other initiative, please call me just after the presentation. I will be very excited to discuss with you. But the
most important is that we are strongly believe that we have all the capacity, all the technology, all the GPUs in
Europe to build our models. And it's why I'm very delighted to announce you that today, during the first day, we
changed OpenLLM France to become OpenLLM Europe. So you can use this QR code to inboard yourself in this in our
Discord server. So we all the content we produce during the six months in French is still available, available. But we have
created the channel for each European language. So please welcome. And if someone want to be part of the community
management team, please contact us and we will be very pleased to inboard you in our initiative. So that's my tool for today.