in the morning on Sunday. It's nice to see all here, looking very bright and early. So we shall get straight into it. Let me welcome the first presenter of the day, Maria and Christian from CrateDB, who are going to be talking about privacy and generative AI. Thank you. Good morning from our side. Pleasure to open the Dev Room today and thanks for being here that early on a Sunday morning. We're going to talk about a very interesting topic, generative AI, how to use your own data and how we can build such applications also based on open source software. I think everyone is used to open AI and chat GPT, but you never know what happens with your data in these cases. So very, very brief overview. This is gen AI. I think everyone in the room played around with it already. Just a very quick summary of the basics here. You have your source data of any kind of sort of data. It can be text, it can be code, images, audio, videos. Everything is transformed. We are encoders, but billions of parameters that we use, a lot of text, a lot of input to train the so-called foundational models. We as users formulate some prompts against it. We ask the models some questions. It does its job and it generates the output and a language model does nothing else than predicting the most likely next token that it should generate. That's all the magic behind. We see a very, very big potential. When I first tried chat GPT more than a year ago, it was amazing. It started to write code for me. It starts to generate articles. I even went to some tools out there, took 30 seconds of my video and all of a sudden I can be a virtual speaker. Very, very impressive, super fast, but there's also a bot assigned to it. Obviously, some quality issues. All of you heard of hallucinations. Last week we had the example of what color is the water. Is it blue or is it really transparent? Depending on your training data, if you use children's books, the water is obviously blue. If you use the real-world training data, water should be transparent. Same as snowflakes or not white. They are transparent technically. Also, a lot of ethical questions, a lot of governance questions. Official government people talking to deep fakes, not realizing it. Also, a big threat that we have in the future. We have to be aware of also some environmental impact. The key thing we want to talk about today is quality and reliability with the importance of current, of accurate, and also of private data that is not available publicly. Because all of these foundation and models have been trained on public data. What's in Github, what's on the internet, what is in the documentation. Yesterday I watched a presentation with a clear message to everyone writing docs. We are responsible for what these models tell us. If you write bad documentation, we get bad results from GEPT or other models. It has been trained on not so good training data. Here, for example, Maria figured out promo code, open AIS web. If you register there and put the code 20% off. But unfortunately it was not working. So asking GEPT, hey, how can I apply the promo code? I'm sorry I know about this promotion. That's something you don't want to happen if it's a company chat bot. You want to avoid this. So perfect example why we need this current and accurate data up to the minute, maybe even up to the second. We need this current data. And obviously non-public data, private data, it's internal documents, it's confidential documents, documentation that is not public. And also if you are working with, they use legal documents, they use the technical documentations, vectorize it, put it to a language model and then for the maintenance workers, they have an application ready. But this is information that also must not leak. And this brings us also into a little bit of a dilemma because there are multiple options to bring this private data into the foundation and models or to enhance this foundation and models. First thing, again, I think everyone in the room heard about it, is fine-tuning. Where you give some input data, you really change the parameters, the weights in the foundation and model so that the knowledge gets incorporated into your fine-tuned LLM. Very good. You put the domain knowledge in there, but there are also challenges, right? You don't solve the frequency issue of the data. It's still some static knowledge. So there's research out there that one single wrong training data record can kill the overall performance. One guy says the water is blue and all of a sudden the response of the chat, but it's all water is light blue or something like this. And it doesn't solve the problem of hallucinations. You might still get a lot of hallucinations and not talking about the resources that we need. So second option, retrieval augmented generation, which is kind of developed into kind of a standard when you want to work with your own data. So first step is you really need to make the existing data, whether it's videos, it's data from internal database documents available to create the embeddings, to calculate the vectors, how this knowledge is internally represented. And then as soon as your user asks a question in the knowledge assistant or the chatbot, there's a called retriever is then asked, hey, please give me the relevant context. And this can now be a similarity search in the vector database, or it can be a combination of various searches, a full text search, to your spatial search, a regular SQL query to get information out of your databases. This context is returned back to the retriever. It is put into a special prompt, as context, as additional information to the prompt, and together with the question, and this additional context, not a large language model can generate your answer. And you can put into the prompt, as we will see in the demo also, please use only this contextual data. If you don't know the answer, please say you don't know. Limits the hallucinations a lot, doesn't prevent them 100%. Good. I think I talked about disadvantages and challenges already. And one advantage I forgot to mention is access control. Now that you really get this context from either vector store a different database, maybe create, you can put fine-grained privileges there. The example application that I mentioned before, some of the maintenance workers are not allowed to use legal documents, for example. So they don't use the index, use the embeddings of the legal documents, but they are obviously allowed to use the technical documentation. And someone from the legal department, oh, what is the support contract with XYZ? Are we now in liability? Et cetera. Obviously, they need then different indexes, different search indexes. How to do this? How semantics represented? Key is the vectors. So, or embeddings. And the vector is nothing else than a series of decimal values or an array of decimal values with a lot of different embedding models out there already. And every model has its strengths and weaknesses. Some are more optimal if you use, for example, German text, if you use Chinese text or Indian text, right? A very different way how to come up with the semantics and to analyze how the attention mechanisms internally work, right? Because the sentences are built in a very, very different way. So you see different performance there or highly specialized models. You do an image recognition. Oh, it's a sleeping cat. And this can then be vectorized as well. And you can search for this context in your vector store. And now, if we think this one step further, how could an architecture look like for such a knowledge assistant or a chatbot? Prototype is always easy to build, but you need to think about a lot of a lot of additional topics. First of all, it starts with the data, right? The data that you want to train, that you want to vectorize, that you want to make available for your search. So we've shown here a landing zone from different sources, can be the original sources. You might copy it, depends on the architecture you want to build. And the important thing is the processing layer. How do you chunk your data? How do you create the vectors? And obviously, you need to store these chunks of information together with the vectors and provide proper data access control. Second part here, the LLM part, talked about it now multiple times. You need access to the embeddings, you need access to the large language models, and then also needs to be some logging. What do do you use a query? How much cost does it incur? Is the performance okay? A lot of logging that also occurs here. And intentionally, an LLM gateway put in front of it because it needs to be changeable. Chatbots with a lot of functionalities don't want to go into all the details, obviously monitoring and reporting. And the beauty of it, you can build all of that with open source tools nowadays. And also the embeddings and language models can be open sourced, a lot of alternatives out there. Now, why create a long chain? You need a robust data management. As we have seen, there's a lot of different data sources involved here, data stores, whether it's logging, whether it's semantics, your agents communicate in JSON. So you need to store all of this information, ideally in one store, not five, six different databases here that you need to operate, you need to learn the language, et cetera. And also long chain, other opportunities are also out there. Think of Haystack and others that you could use. But all of these frameworks give you a very good set of building blocks. You can just use them. It's available in Python, JavaScript, there are also Java ports out there, ports to other languages are now available. Everything you need is already in these libraries to come up with your overall architecture. And that's now the point to hand it over to Maria. She will guide you through a demo where we want to use it, try to simulate how you can use support tickets, internal data. Here we took some Twitter posts from Microsoft. We will vectorize them and we'll show how a support or a customer can then interact with this chat bot, ask certain questions. It won't demonstrate it's not such a big effort. You can get started right away. And all the demo, we put the link here on the slide. You find also the link to the demo in the app or on the website for the talk. Thank you. Do you hear me? Okay. Awesome. Thank you. So you have heard a lot of theoretical aspects of the drug and how it works. I have a little bit more than 10 minutes to show you a practical example. But believe me, we can have hours long workshop on this topic. So essentially, the idea today is to show you how to augment some of the existing LLAMs with the private data and how to use it for the context of some specific questions that this LLAM has not seen so far. So we actually use data that capture customer interactions on Twitter and these customer interactions involve different questions from the users about Microsoft, Amazon, all these different products today and how actually the support from these big companies actually answer to these user questions. So this is not something that you usually see on the Internet very easily. So if you have maybe some problem with some Microsoft product, yeah, very often you can actually find the solution out there. But some very specific questions that are asked directly to customer support is probably a very good reason why it sells to the customer support. So you didn't find the answer to this out of the box. And we will use CradyBee as a vector store to support this example. So I think Kristina already gave you a good overview of what the CradyBee is. What is the long chain? Long chain is an open source Python project that actually is used to facilitate the development of LLAM applications. It's a pretty cool project that integrates a lot of large language models, a lot of models for calculating embeddings and actually something that helps you integrate some data source with some language model without thinking out of the box how the full engineering pipeline should look like. Actually you can just do this in a couple of lines of code. May I add one point here that I forgot to mention. Although you use long chains, very good starting point. What we have also seen for very advanced purposes, you want to directly interact with your data, with your source data, with your vector store and all of that is available in standard SQL, no matter which data model you're using. And CradyBee is an open source store, one of the easiest ways to run CradyBee is actually to use a Docker image. So a vector support in CradyBee has been available since 5.5 version, but if you actually always pull the latest image, you should not actually think about this. So once you run this Docker run command, we actually run the instance of CradyBee cluster and then we can access the admin UI in the local host. So currently I think because of the resolution of this screen, yes, not everything is available, but actually in this admin UI you have a couple of tabs that you can use actually to monitor your cluster to run some query in the console and also to have overview of the tables and the views that are available in your database. So let's go back to the example because the time is flying very fast. So what we need is the first step, we need a couple of import statements to make sure that the long chain and all libraries that we use in this example are available. What is also important is that you import CradyBee vector search interface that is available in one of the long chain versions, let's say, which is used to interact with the CradyBee. And as a next step, because we need to interact with the CradyBee instance, we need to specify how we interact. So this is done by specifying connection string. We are using open source version running on local host, but you also have option, for example, if you want to deploy CradyBee cloud cluster and at this point we also give option for all users to deploy one cluster that is free forever so you can just run it and use it for testing purposes. Finally, we need to specify the collection name that we are going to use in this notebook session. So if we run this piece of code, the connection string is now available and then we can start interacting with the CradyBee. So for purpose of this notebook, I rely on open AI models. Of course, there are long chain supports, so many different models, you can actually integrate many of them, but if you choose to use open AI, make sure that you have open AI key as a part of your environment variable. So now let's take a look at how the dataset looks like. This dataset is also available on our CradyBee dataset repository, which is also open source and it contains the customer interaction about Microsoft products. So essentially we would like to now kind of narrow the scope of this notebook for the for the illustration reasons and time reasons. So essentially this dataset has some information like who is the author of this question, whether it's unbound, outbound question, when it was created, what was the context of the question or the answer and actually whether this text is response to something or is it response tweet or is it created in response to something else. So essentially all this information and now the idea is to feed them to the large language model and to ask questions that could be for example seen in this dataset. So as a first step, if you remember this big rug image is to create embeddings. Embeddings is actually the representation of your data that is suitable for machine learning and AI purposes. So first as a first step we need to load the data from this dataset and for this we use CSV loader interface that is available in Longchain and now in this like few pieces of code we already we already creating embeddings for all the data set for all the entries in our dataset. So if I go back to the admin UI I can see two tables. So in the first table actually gives me a collection of entries. So as we as we define the the first collection we created is called customer data but essentially what is interesting now is to see like embeddings created for all the entries in this in this collection. So for example this is the instance of the of the document that we are actually using for the training purpose or for the context purposes and you can actually see how the embeddings look like. So if you use open AI embeddings usually the length of your of your vector is going to be 1040 something yes it would be size of 1040 something but you can also for example choose some other embedding algorithm for example hugging face as you can see suggested here which is which is open source and it can easily be used out of the box in just two lines of code. Now once we have these embeddings let's define our question and our question today is like okay I have some I have some order on my Microsoft Store but I want to update the shipping address and how I do this. I also here put alternative questions so like when you play with this notebook you can also put your own questions and see actually whether this data set has enough information to answer this question. So once the question is answered what we want to do is actually we want to find the context that is relevant to this question and this context is done by doing similarity search of the vector representation of our question compared to the vectors actually that we stored in the creative instance and this is actually done in just one line of code. As Christian suggested vector search is one way to find the relevant context of course kdb supports other types of searches like full text search or geospatial search or just key search keyword search so like you can use different type of searches combined together to find what is what is the relevant context for your question. Once we do this we are now ready to actually ask our LLM to answer our question and how we do this. First we need to create a prompt that explains LLM what his purpose is. So his purpose is today to be expert about Microsoft products and services and should use the context that you are going to actually give to the LLM to answer relevant questions but if the answer is not fine in the context it should reply with I don't know and this is very simple way to create a prompt that actually gives instructions to LLM how it should answer specific questions and finally we just need to create small chatbot by using some of the available models that are integrated with the long chain and also passing this context together with the user question. Once this is completed we can access the answer and in this case it says to update the shipping address you will need to cancel your current order and place a new one. Maybe that's something that is still up to date that is relevant maybe it's not relevant anymore but it's actually something we learned only from the dataset we provided so this is a way how to actually how you actually use your private data to teach LLM actually what should what should be the context for any incoming questions. So I hope you like this demo you can play with this notebook it's on our creative B examples repository and you also can see there are other similar notebooks for different different different types of examples for different prompt engineering examples or like how to create another another form of chatbots how to use another embedding algorithms so please let us know what you think give us a feedback open a new issue on this repository and we are looking forward actually to work with you on these topics. So I think that is all from us thank you for being part of this session maybe we have time for one question okay awesome do we have questions anyone thank you for the talk I have a question about the embeddings model because if you encode prompt with language model and use external embeddings model they cannot be in different spaces and if you do similarity search have you tested it and do you see the effect of different embeddings I mean it's a very important question now if you the way you create these embeddings is super important and you're usually limited to one embedding algorithm because you need to they need to have the same length and obviously they need to capture the same semantics simplifying a bit and this is also what I meant with the customers that we work with they were able to create different indexes right and then the retriever gets more and more complex as you've seen on this architecture slide this is a simplified example you maybe you need to query different different indexes created by different embedding algorithms you know so that you can search your images you can search your textual data right obviously you might use different things there and then re-rank the results come up with the really relevant context maybe from different indexes and maybe you also want to combine it with a full text search or limit it to customer support tickets from Europe trying to come up with a good example there are or to customers support tickets from the US with some geospatial inhibition but this is then the re-ranking of the results that really identifies the particular context that is really relevant for the question okay thanks a lot any more questions no so thank you very much for the very nice talk thank you you