in the morning on Sunday. It's nice to see all here, looking very bright and early. So
we shall get straight into it. Let me welcome the first presenter of the day, Maria and
Christian from CrateDB, who are going to be talking about privacy and generative AI.
Thank you. Good morning from our side. Pleasure to open the Dev Room today and thanks for
being here that early on a Sunday morning. We're going to talk about a very interesting
topic, generative AI, how to use your own data and how we can build such applications
also based on open source software. I think everyone is used to open AI and chat GPT,
but you never know what happens with your data in these cases. So very, very brief overview.
This is gen AI. I think everyone in the room played around with it already. Just a very
quick summary of the basics here. You have your source data of any kind of sort of data.
It can be text, it can be code, images, audio, videos. Everything is transformed. We are
encoders, but billions of parameters that we use, a lot of text, a lot of input to train
the so-called foundational models. We as users formulate some prompts against it. We ask
the models some questions. It does its job and it generates the output and a language
model does nothing else than predicting the most likely next token that it should generate.
That's all the magic behind. We see a very, very big potential. When I first tried chat
GPT more than a year ago, it was amazing. It started to write code for me. It starts to generate
articles. I even went to some tools out there, took 30 seconds of my video and all of a sudden I
can be a virtual speaker. Very, very impressive, super fast, but there's also a bot assigned to
it. Obviously, some quality issues. All of you heard of hallucinations. Last week we had the
example of what color is the water. Is it blue or is it really transparent? Depending on your
training data, if you use children's books, the water is obviously blue. If you use the real-world
training data, water should be transparent. Same as snowflakes or not white. They are
transparent technically. Also, a lot of ethical questions, a lot of governance questions. Official
government people talking to deep fakes, not realizing it. Also, a big threat that we have in
the future. We have to be aware of also some environmental impact. The key thing we want to
talk about today is quality and reliability with the importance of current, of accurate, and also
of private data that is not available publicly. Because all of these foundation and models have
been trained on public data. What's in Github, what's on the internet, what is in the
documentation. Yesterday I watched a presentation with a clear message to everyone writing docs. We
are responsible for what these models tell us. If you write bad documentation, we get bad results
from GEPT or other models. It has been trained on not so good training data. Here, for example,
Maria figured out promo code, open AIS web. If you register there and put the code 20% off.
But unfortunately it was not working. So asking GEPT, hey, how can I apply the promo code? I'm
sorry I know about this promotion. That's something you don't want to happen if it's a company chat
bot. You want to avoid this. So perfect example why we need this current and accurate data up to the
minute, maybe even up to the second. We need this current data. And obviously non-public data,
private data, it's internal documents, it's confidential documents, documentation that is not
public. And also if you are working with, they use legal documents, they use the technical
documentations, vectorize it, put it to a language model and then for the maintenance workers, they
have an application ready. But this is information that also must not leak. And this brings us also
into a little bit of a dilemma because there are multiple options to bring this private data into
the foundation and models or to enhance this foundation and models. First thing, again, I think
everyone in the room heard about it, is fine-tuning. Where you give some input data, you really change
the parameters, the weights in the foundation and model so that the knowledge gets incorporated
into your fine-tuned LLM. Very good. You put the domain knowledge in there, but there are also
challenges, right? You don't solve the frequency issue of the data. It's still some static knowledge.
So there's research out there that one single wrong training data record can kill the overall
performance. One guy says the water is blue and all of a sudden the response of the chat, but it's
all water is light blue or something like this. And it doesn't solve the problem of hallucinations.
You might still get a lot of hallucinations and not talking about the resources that we need.
So second option, retrieval augmented generation, which is kind of developed into kind of a
standard when you want to work with your own data. So first step is you really need to make the
existing data, whether it's videos, it's data from internal database documents available to create
the embeddings, to calculate the vectors, how this knowledge is internally represented. And then
as soon as your user asks a question in the knowledge assistant or the chatbot,
there's a called retriever is then asked, hey, please give me the relevant context. And this
can now be a similarity search in the vector database, or it can be a combination of various
searches, a full text search, to your spatial search, a regular SQL query to get information out of
your databases. This context is returned back to the retriever. It is put into a special prompt,
as context, as additional information to the prompt, and together with the question,
and this additional context, not a large language model can generate your answer. And you can put
into the prompt, as we will see in the demo also, please use only this contextual data. If you
don't know the answer, please say you don't know. Limits the hallucinations a lot, doesn't prevent
them 100%. Good. I think I talked about disadvantages and challenges already. And one advantage I
forgot to mention is access control. Now that you really get this context from either vector store
a different database, maybe create, you can put fine-grained privileges there. The example
application that I mentioned before, some of the maintenance workers are not allowed to use
legal documents, for example. So they don't use the index, use the embeddings of the legal documents,
but they are obviously allowed to use the technical documentation. And someone from the legal
department, oh, what is the support contract with XYZ? Are we now in liability? Et cetera. Obviously,
they need then different indexes, different search indexes. How to do this? How semantics
represented? Key is the vectors. So, or embeddings. And the vector is nothing else than a series of
decimal values or an array of decimal values with a lot of different embedding models out there
already. And every model has its strengths and weaknesses. Some are more optimal if you use,
for example, German text, if you use Chinese text or Indian text, right? A very different way how
to come up with the semantics and to analyze how the attention mechanisms internally work,
right? Because the sentences are built in a very, very different way. So you see different
performance there or highly specialized models. You do an image recognition. Oh, it's a sleeping cat.
And this can then be vectorized as well. And you can search for this context in your vector store.
And now, if we think this one step further, how could an architecture look like for such a knowledge
assistant or a chatbot? Prototype is always easy to build, but you need to think about a lot of
a lot of additional topics. First of all, it starts with the data, right? The data that you want to
train, that you want to vectorize, that you want to make available for your search. So we've shown
here a landing zone from different sources, can be the original sources. You might copy it, depends
on the architecture you want to build. And the important thing is the processing layer. How do
you chunk your data? How do you create the vectors? And obviously, you need to store these chunks of
information together with the vectors and provide proper data access control. Second part here,
the LLM part, talked about it now multiple times. You need access to the embeddings,
you need access to the large language models, and then also needs to be some logging. What do
do you use a query? How much cost does it incur? Is the performance okay? A lot of logging that
also occurs here. And intentionally, an LLM gateway put in front of it because it needs to be
changeable. Chatbots with a lot of functionalities don't want to go into all the details, obviously
monitoring and reporting. And the beauty of it, you can build all of that with open source tools
nowadays. And also the embeddings and language models can be open sourced, a lot of alternatives
out there. Now, why create a long chain? You need a robust data management. As we have seen,
there's a lot of different data sources involved here, data stores, whether it's logging, whether
it's semantics, your agents communicate in JSON. So you need to store all of this information,
ideally in one store, not five, six different databases here that you need to operate,
you need to learn the language, et cetera. And also long chain, other opportunities are also
out there. Think of Haystack and others that you could use. But all of these frameworks give you
a very good set of building blocks. You can just use them. It's available in Python, JavaScript,
there are also Java ports out there, ports to other languages are now available. Everything you
need is already in these libraries to come up with your overall architecture. And that's now the
point to hand it over to Maria. She will guide you through a demo where we want to use it, try to
simulate how you can use support tickets, internal data. Here we took some Twitter posts from Microsoft.
We will vectorize them and we'll show how a support or a customer can then interact with this chat
bot, ask certain questions. It won't demonstrate it's not such a big effort. You can get started
right away. And all the demo, we put the link here on the slide. You find also the link to the
demo in the app or on the website for the talk.
Thank you. Do you hear me? Okay. Awesome. Thank you. So you have heard a lot of theoretical
aspects of the drug and how it works. I have a little bit more than 10 minutes to show you a
practical example. But believe me, we can have hours long workshop on this topic. So essentially,
the idea today is to show you how to augment some of the existing LLAMs with the private data and how
to use it for the context of some specific questions that this LLAM has not seen so far.
So we actually use data that capture customer interactions on Twitter and these customer
interactions involve different questions from the users about Microsoft, Amazon, all these
different products today and how actually the support from these big companies actually
answer to these user questions. So this is not something that you usually see on the Internet
very easily. So if you have maybe some problem with some Microsoft product, yeah, very often you
can actually find the solution out there. But some very specific questions that are asked directly
to customer support is probably a very good reason why it sells to the customer support. So you
didn't find the answer to this out of the box. And we will use CradyBee as a vector store
to support this example. So I think Kristina already gave you a good overview of what the CradyBee
is. What is the long chain? Long chain is an open source Python project that actually is used to
facilitate the development of LLAM applications. It's a pretty cool project that integrates a lot
of large language models, a lot of models for calculating embeddings and actually something
that helps you integrate some data source with some language model without thinking out of the box
how the full engineering pipeline should look like. Actually you can just do this in a couple of
lines of code. May I add one point here that I forgot to mention. Although you use long chains,
very good starting point. What we have also seen for very advanced purposes, you want to directly
interact with your data, with your source data, with your vector store and all of that is available
in standard SQL, no matter which data model you're using. And CradyBee is an open source store,
one of the easiest ways to run CradyBee is actually to use a Docker image. So a vector support in
CradyBee has been available since 5.5 version, but if you actually always pull the latest image,
you should not actually think about this. So once you run this Docker run command,
we actually run the instance of CradyBee cluster and then we can access the admin UI in the local
host. So currently I think because of the resolution of this screen, yes, not everything is available,
but actually in this admin UI you have a couple of tabs that you can use actually to monitor your
cluster to run some query in the console and also to have overview of the tables and the views that
are available in your database. So let's go back to the example because the time is flying very fast.
So what we need is the first step, we need a couple of import statements to make sure that the long
chain and all libraries that we use in this example are available. What is also important is that
you import CradyBee vector search interface that is available in one of the long chain versions,
let's say, which is used to interact with the CradyBee. And as a next step, because we need
to interact with the CradyBee instance, we need to specify how we interact. So this is done
by specifying connection string. We are using open source version running on local host,
but you also have option, for example, if you want to deploy CradyBee cloud cluster and at this
point we also give option for all users to deploy one cluster that is free forever so you can just
run it and use it for testing purposes. Finally, we need to specify the collection name that we are
going to use in this notebook session. So if we run this piece of code, the connection string is
now available and then we can start interacting with the CradyBee. So for purpose of this notebook,
I rely on open AI models. Of course, there are long chain supports, so many different models,
you can actually integrate many of them, but if you choose to use open AI, make sure that you have
open AI key as a part of your environment variable. So now let's take a look at how the dataset looks
like. This dataset is also available on our CradyBee dataset repository, which is also
open source and it contains the customer interaction about Microsoft products. So
essentially we would like to now kind of narrow the scope of this notebook for the
for the illustration reasons and time reasons. So essentially this dataset has some information like
who is the author of this question, whether it's unbound, outbound question, when it was created,
what was the context of the question or the answer and actually whether this text is response
to something or is it response tweet or is it created in response to something else.
So essentially all this information and now the idea is to feed them to the large language model
and to ask questions that could be for example seen in this dataset. So as a first step, if you
remember this big rug image is to create embeddings. Embeddings is actually the representation of your
data that is suitable for machine learning and AI purposes. So first as a first step we need to
load the data from this dataset and for this we use CSV loader interface that is available in
Longchain and now in this like few pieces of code we already we already creating embeddings for all
the data set for all the entries in our dataset. So if I go back to the admin UI I can see two tables.
So in the first table actually gives me a collection of entries. So as we as we define the
the first collection we created is called customer data but essentially what is interesting now is to
see like embeddings created for all the entries in this in this collection. So for example this is
the instance of the of the document that we are actually using for the training purpose or for the
context purposes and you can actually see how the embeddings look like. So if you use open AI
embeddings usually the length of your of your vector is going to be 1040 something yes it would
be size of 1040 something but you can also for example choose some other embedding algorithm for
example hugging face as you can see suggested here which is which is open source and it can
easily be used out of the box in just two lines of code. Now once we have these embeddings let's
define our question and our question today is like okay I have some I have some order on my
Microsoft Store but I want to update the shipping address and how I do this. I also here put alternative
questions so like when you play with this notebook you can also put your own questions and see
actually whether this data set has enough information to answer this question. So once the
question is answered what we want to do is actually we want to find the context that is relevant to
this question and this context is done by doing similarity search of the vector representation of
our question compared to the vectors actually that we stored in the creative instance and this is
actually done in just one line of code. As Christian suggested vector search is one way to find the
relevant context of course kdb supports other types of searches like full text search or geospatial
search or just key search keyword search so like you can use different type of searches combined
together to find what is what is the relevant context for your question. Once we do this we are
now ready to actually ask our LLM to answer our question and how we do this. First we need to create
a prompt that explains LLM what his purpose is. So his purpose is today to be expert about
Microsoft products and services and should use the context that you are going to actually give
to the LLM to answer relevant questions but if the answer is not fine in the context it should
reply with I don't know and this is very simple way to create a prompt that actually gives
instructions to LLM how it should answer specific questions and finally we just need to create
small chatbot by using some of the available models that are integrated with the long chain and also
passing this context together with the user question. Once this is completed we can access
the answer and in this case it says to update the shipping address you will need to cancel your
current order and place a new one. Maybe that's something that is still up to date that is relevant
maybe it's not relevant anymore but it's actually something we learned only from the dataset we
provided so this is a way how to actually how you actually use your private data to teach LLM
actually what should what should be the context for any incoming questions. So I hope you like this
demo you can play with this notebook it's on our creative B examples repository and you also can see
there are other similar notebooks for different different different types of examples for different
prompt engineering examples or like how to create another another form of chatbots how to use another
embedding algorithms so please let us know what you think give us a feedback open a new issue on
this repository and we are looking forward actually to work with you on these topics. So I think that
is all from us thank you for being part of this session maybe we have time for one question
okay awesome
do we have questions
anyone
thank you for the talk I have a question about the embeddings model because if you
encode prompt with language model and use external embeddings model they cannot be in
different spaces and if you do similarity search have you tested it and do you see the effect of
different embeddings
I mean it's a very important question now if you the way you create these embeddings is super
important and you're usually limited to one embedding algorithm because you need to
they need to have the same length and obviously they need to capture the same semantics
simplifying a bit and this is also what I meant with the customers that we work with they were able
to create different indexes right and then the retriever gets more and more complex as you've seen
on this architecture slide this is a simplified example you maybe you need to query different
different indexes created by different embedding algorithms you know so that you can search your
images you can search your textual data right obviously you might use different things there
and then re-rank the results come up with the really relevant context maybe from different
indexes and maybe you also want to combine it with a full text search or limit it to
customer support tickets from Europe trying to come up with a good example there are or to
customers support tickets from the US with some geospatial inhibition but this is then the re-ranking
of the results that really identifies the particular context that is really relevant for the question
okay thanks a lot
any more questions
no so thank you very much for the very nice talk thank you
you