Hi, y'all.
I have the privilege of introducing you to Stefano.
And he is from Italy, in the middle of the Italian coast.
You've been a Linux enthusiast for 20 years, got me on that one.
And your focus is on VoIP, interestingly enough.
This is his 10th Fosdom, and your favorite animal is you after four beers.
Very appropriate.
Everyone, welcome Stefano.
Thank you.
One of my hobbies was caving.
I spent 10 years going into caves, descending pitches with ropes, crawling into mud, and
doing those awful things.
The reason for doing that is that the very few time I had the chance to be the first
one in an unknown place, it was awesome.
When you are in an unknown place, you face some dangers, but you also have infinite possibilities.
Behind the light of your headlamp, there could be anything.
A river, a beach, kilometers of unexplored passages, who knows.
And I feel the same about the AI today.
And I'd really love to increase the power of your headlamp today.
So I'm going to kick start you into Lang chain.
This is the GitHub page for the talk, where you can find the proof of concept code and
the presentation itself.
It's better if you look at the code during the presentation.
We'll explore Lang chain using one of its notable use case, that is retrieval of met
generation.
And for doing that, we will look at some of its components and concept that are document
loaders, text splitters, embeddings, vector stores, retrievers, prompts and templates
for generating prompts, large-length models, of course, and finally we'll combine some
of those together in a chain.
Then I'll experience the adrenaline of a live demo, and maybe we will take a look at some
other notable use cases.
Let's talk about our main quest first, that is retrieval of met generation.
This cutting edge techniques involves giving additional data to the LLM to enhance its
responses.
It's interesting because when you give additional data to the LLM, the answers become more precise
and relevant, and it's also allowed the citation of sources, and allowed to respond to data
that are not in training data set, that could be even personal data or real-time data.
It's a very discussed topic, and it's an intriguing case for showcasing Lang chain.
This is the scheme of what we want to obtain.
Multiple use cases exist over retrieval of met generation.
We will look at the simple one that is question answering over unstructured data.
We will take some text that is our unstructured data, and we will put it into a storage.
Then we will ask a question and use the data from the storage to help the LLM answer the
question.
Let's look at it in more detail.
We will take data from a transcript from a YouTube video, and we will load it into a
usable format.
Then we will split it into smaller parts and compute a vector representation, also known
as embeddings, of this data.
We will store it into a database.
Then we will ask a question and compute the vector representation of the question, and
use this vector representation to find similar documents.
Then we will put the question and the retrieved documents into the prompt and give it to the
large language model.
If you're thinking that it's complex, I assure you that it's not, and it fits in a few lines
of code.
If you're thinking that it's trivial or worthless, I assure you that it's not the case-hater,
because there are a lot of concepts behind that.
Why using LineChain?
LineChain is a framework for developing LLM-powered applications.
It offers us a lot of ready-to-use of the shelf components and building blocks that make our
life easier.
Should we take our code in production, it also has components that make it easier for
us to do it, and also it has a lot of samples to copy.
It's fun because it has an extreme speed of improvement, and something interesting came
out of its community continuously.
On the other hand, it's very young, and breaking changes may happen, but we like risk.
We are using Python.
LineChain is also available in TypeScript, but that's not make-up-of-tea.
We also have our main requirements that are LineChain, of course.
OpenAI that we will use as embeddings and LLM provider, and TraumaDB as vector store.
Since we're using OpenAI, we will provide an API key.
Okay.
In this part, we prepare and store our data.
We will use four components that are a document loader to retrieve our data, to get our data,
and convert it into a usable format.
A text splitter for divide the document into smaller meaningful units, an embedding function
to compute the vector representation and the vector store to store our vectors.
The document loader is an object that takes from various sources to the data source.
It takes from various sources of data and gives us a transform it into a usable format.
That is a document.
Multiple sources are available, and for instance, we can have files like PDF or text file or
web pages or cloud storage such as Amazon S3 or Google Drive, social media like Reddit,
Twitter, GitHub, and papers, and of course, YouTube transcripts.
It's also very easy to write your own if you don't find something that fits for what
you need.
You can just extend the base loader class.
This is our document loader, and we are using the YouTube loaders from the LineChain community.
And this will take the transcript of our video and put it into the document class.
This is the document class.
It has a page content string that will hold the transcript of our video and a metadata
dictionary that will have a key source with the URL of our video.
Now that we have our document, we want to split it into smaller meaningful units.
Why do we want to split it?
Well, for free reason.
The first one is that the input size of our LLM is limited, so we want to give smaller
pieces.
The second one is that, like me, our LLM tends to be easily distracted, so we want to increase
as much as possible the signal-to-noise ratio and avoid to distract it, giving it useless
information.
So we will choose only the pieces important to answer the question.
And the third reason is that usually we pay per token, so the more we give, the more we
pay.
We can think of five levels of text splitting from simple to complex.
The simple one is splitting just counting charters or tokens.
This is simple and easy, but it has a problem, and the problem is that probably we will end
up splitting in the middle of a word or a phrase.
The second level addresses this problem, and this recursive splitting.
It recursively tries to split text on special charters like new line or punctuation, then
combines those phrases together till the maximum length specified is reached.
The third one, look at the document structure that works for HTML files or markdown or code.
And then there are semantic chunkers that is still experimental on a long chain, and
it's very interesting because it combines phrases together only if they are similar
and use embeddings to compute similarity.
The last one is highly experimental, and it's asking an LLM to split our text.
This is highly experimental and also very expensive.
It probably makes sense only if you are thinking that the cost per token is going to zero.
We are using the recursive charter text splitter, that is the second, and it's a good default
choice.
We can specify the length of the text, and if you want some overlap.
There's not a golden rule about that, so maybe you want to try what works best for you.
Okay, now we have our documents, and we want to compute the embeddings.
The embeddings are a vector representation in a high dimensional space.
That means that we take our data and represent it as a vector.
Each dimension of this vector will reflect an aspect of context or meaning of our data.
There are thousands of those dimensions.
If two pieces of text are similar, they are next to each other in the embedding space.
That means that we can compute the similarity of two pieces of text just measuring the distance
between those vectors.
It seems complex, but for us it's very easy because for us it's just a function that we
use when we create the vector store.
We are using an external provider here, that is OpenAI.
And auto privacy, obviously if you use an external provider to compute embeddings, you
are sending your data to the external provider.
We now have vector representation of our data, and our data is split.
We want to store it into a vector store.
A vector store is a database that is tailored for storing and searching embeddings.
We are using TraumaDB here.
It is open source, it's very easy to set up.
This is the initialization.
And as we said before, we are passing the OpenAI embedding function to it when we initialize
it.
These are the most used vector store in the reports of the state of AI for 2023.
And TraumaDB is at first place, and FACE is also open source, it's from Meta.
And Pinecon is a very popular cloud vector storage.
Okay, we now have hard data into the vector store.
We want to use it.
We will use four main components here that are a retriever to search similar documents
to our question, a prompt that will give the LLM the instruction on the output that we
will give, the LLM that is the heart and lung and brain of our application, and finally
we will combine those three together in a chain.
Okay, the retriever is an object that is responsible for searching documents that are relevant
to answer our question.
The simple retriever does this just computing the vector representation of our question
and search for document that are near to this vector in the embedding space.
This is the simple retriever.
Long chain also offers us more advanced retriever like this one, this is multi-query retriever.
Please use the LLM component to formulate the variation of our question and then use
the embeddings of those variations to search for similar documents, similar and hopefully
relevant to answer our question.
Now that we have similar documents, we can put them into the prompt and the prompt to
give to the LLM.
This is the prompt that we are using and the prompt is just a template with the instruction
for our LLM and two variables in this case that are the context that will be our documents
and the question itself.
I love delving into details because it's just a template and also we can take this prompt
from the long chain app.
Long chain features an app with all the prompts and other objects that we can use, all the
of the shell components that we can use.
We have the prompt, we want to give it to the LLM.
We are using OpenAI SLLM and this is how we initialize it.
I use streaming, the first variable because it really improves the user experience and
temperature zero means that we don't want creativity or hallucination, we just want
precise answers.
Maybe you can argue that I should have used different LLM providers but nobody gets fired
for buying OpenAI so I chose that.
These are the most used LLM providers always from long chain state of AI.
OpenAI is at first place and I'd like to rant a bit about that because CloudAI, the third
on that list, is labeled from almost from everywhere in the world except from Europe.
This week the Italian data protection authority is going against OpenAI over privacy issue
again.
I know that there are a lot of privacy advocates here and I also care about user privacy but
I think that defending the user right shouldn't mean going against going against war against
them.
That's my two cents.
Those are the most used open source providers.
It's interesting because the first three has a very different business model.
The first one rents hardware, the second has a cost per token, paper token and the third
one is for surf hosting.
We now have gathered all the components, we want to put them together.
This is all the components called one after another.
We have our question and we pass the question to the retriever and we get a list of documents.
The list of documents is joined together in the context variable then the context variable
is used in the template to generate the prompt and the prompt is given to the LLM.
It works nice and easy but we can do better and this put everything together using a chain.
A chain is a sequence of components that does a function and it's better than just
calling the component one after another because it has several advantages like it offers sync
and the sync support and that allow us to take our code directly into production without
changing it and also as advantages of observability and it's integrated very well with other
launch chain components that are used to take code in production.
This is the code put together using the LLM expression language LCL that is a new way
of doing those chains. This is an acquired taste and it's quite new.
It's from September but I find it very useful when you get used to it.
Okay, let's see how this works.
This is our code and there are two examples.
One uses the chain, one not, this is the one that doesn't use it and it's just a few lines
of codes. It's very easy.
Okay, I forget the open AI key.
Okay, I forget the open AI key.
Of course it doesn't work.
I'm not connected, you're right.
Okay, I have a backup video.
No, no.
By the way, it's just for giving you an idea of the piece of calling the various components
and the parts that takes the most time is computing embeddings and this is the streaming
output. Okay, I have prepared some questions that are those questions and those are given
too fast, sorry.
I gave the question to the LLM and this is the output of the output of the LLM.
Also, okay, it's nice because this one, the retriever wasn't able to find the answer for
this question and so it wasn't able to give us a response and the LLM told us, I don't
know.
I'm not sure if I can move forward. Maybe I also have it for the LCL.
The LCL version uses the multi-query retriever.
So you will see now that it will ask multiple questions.
Each question is transformed into multiple questions.
This is low, I'm sorry.
Okay, those are the questions and this is the answer that came out.
Okay.
There are also other interesting use cases of luncheon.
We look at the simple one that is question answering over unstructured data.
Also it's very interesting question answering over structured data.
This one uses the LLM component to convert our question into a sequel query that is executed
and the result of the query is used to improve the answer of our LLM.
It's very interesting.
Another one is data extraction.
You just have to provide a JSON schema and then unstructured text and the JSON schema is
automatically filled in with the data from the structured text.
The LLM understands what to put into the JSON schema.
It's interesting because there are people paid for doing that work.
Summarization is very useful and it has a lot of, let's say, problems.
It's an open problem.
It's very interesting and useful.
Then there is a synthetic data generation that is useful if you want to find a model
or maybe if you want to anonymize some data.
It works like data extraction backwards.
You have a JSON schema and the LLM generates a text unstructured that contains data that
will fit into the JSON schema.
Finally, there are agents that is a key concept of luncheon and it's very fun.
With agents, the LLM takes charge of choosing what action to do.
It's worth studying.
It's very interesting.
Okay, that's it.
So, thank you.
Do you have any questions?
I saw his hand first.
Thank you.
Very interesting.
My question is how does this scale?
You showed an example in which we have just one transcript.
What if we had billions of transcripts?
I didn't see any mention to the ranking of the retrieved chunk.
If you can elaborate a little bit on that, it would be very good.
Thanks.
Okay, luncheon helps to take this in production.
This was proof of concept so you can take this in production.
Also, it's out of the scope of this talk.
This was luncheon from zero to one.
So, that scaling is from zero to 100.
You can find a lot of examples on how to take that in production.
If you take a look at the GitHub repository, there is also a link on how people from
luncheon use this in production with the chatbot that helps searching
in the luncheon documentation.
You can find the code and it's very interesting.
If you want to take it in production, it's worth copying that code.
It's the best practice.
Did I answer your question?
I'm sure you'll see this coming.
If I have some money to spend on a hardware and I want to get an LLM,
there is a lot of proprietary intelligence that you use, like the Mbendix in particular,
and also the other part that it's on the query side at the end of the chain.
How difficult it is to do this without using OpenAI?
It's really easy because luncheon allows to swap those components.
I use it here at OpenAI because it's the easy way for having a result.
But if you, for instance, use the Ollama, you can self-host the LLM and ask questions to the LLM,
or maybe with a face you can rent hardware and run your open source model on their hardware.
So it's easy because those components are swappable.
All right, y'all. Let's give Stefano one more round of applause.