Okay, next up we have Peter telling us about vector search in modern databases.
Okay. Well, hello everyone. My name is Peter Zaitsev. I was supposed to speak here together
with Sergey Nikolayev, who is actually a much better expert in this space, but unfortunately he couldn't get a visa.
So, guess what? You are stuck with me.
And we are going to talk about the vector search. How many of you are familiar with the vector searches?
Oh, well, that's a good number of hands. Some not, so that's a fun audience to have.
Let me maybe start with ruining the suspense and kind of showing the highlight what is a state or a vector search in a variety of databases.
And I think what is interesting happening here is what we have, A, the number of new databases started in the last few years,
which is specifically focused on vector search and related applications.
And then also at the same time we can see a lot of mainstream databases have added support for vector search.
You can see that starting back in 2019, which is actually relatively recently and relatively quickly.
I think that is very interesting because databases are often rather conservative, kind of relatively slow moving product.
And what that reminds me is something what we saw with Jason.
We saw databases like MongoDB came and really got a lot of developers' hearts and minds.
And then later on, Jason support came to pretty much every major database out there and was also added as SQL standard.
Now, what is their unfortunate mission here? What you can see is the MySQL, right?
Well, you can say, well, MySQL is now owned by Oracle, which is a big, fast, slow moving corporation.
Right, so they're not doing that. But there is another problem is what with MySQL, it's actually exist with a heat wave solution,
which is cloud-only Oracle's MySQL variant, right?
And it's, I think, not very clear to what extent it will come in the MySQL proper.
At the same time, MariaDB is working on solution in the MySQL space.
The planet scale announced what we are going to implement vector search.
So if not in the MySQL community itself, it will come in some variants, right?
And obviously, if you look at the PostgreSQL, I think that's always wonderful ecosystem, right?
So something has done the process with like a multiple way of doing this stuff,
and there have been number of vector search extensions created, but the PG vector seems to be one which is getting the most support
and most of attention those days.
Now, what is also interesting, what we can see with vector search being so hot, right, with AI,
and some databases like Elastic in this case here, they are pretty much focusing, calling themselves the vector search database first, right,
and the full-text search database second, right?
So I think for me that was very interesting to see that change.
Well, anyway, with kind of a big picture out of the way, let's talk a little bit about the vectors and vector search, right?
And why suddenly this becomes kind of important those days.
Now, if you, Leon, I don't know, let's say, is it like a high school algebra or something else, right?
You probably know what the vectors are, right?
We can think about 2D, 3D space, that's all very clear.
But also if we think about that, we can think about vector as something as represent colors, right?
As in a software engineer, they probably know like well, colors, this is red, green, blue, right?
We often encode that in one byte each, right?
And that actually can be seen as a vector in color space, right?
And we also can think about similarity of colors as similarity between the different vectors.
Now, if you think about the vectors, there are actually a number of different approaches how you can think about what vectors are being similar and the same, right?
Like there is, like, Euclidean distance, but if you think what is the most common this case is a cosine similarity, right?
For example, if you go and ask our, you know, the third leaders in the eye space, open the eye, say, hey guys, then I'm using your API,
what you suggest as a distance between vectors, they would suggest you to use their cosine similarity.
Now, let's talk a little bit about the history, right, and how they have been using the database and specifically in their information retrieval space, right?
Vectors are actually not something which is suddenly became new, right, as you may think, a vector search database, right?
If you think about the full-text search application specifically, there have been vectors used for a long time, right?
For example, if you, one approach would be to look at the sparse vector and say, hey, we have a document, we can actually look at all the words we have in the dictionary
and let's say state a frequency of that word in that vector, and then if you want to compute the similarity between two different documents,
well, we can essentially look at cosine similarity between those two kind of massive sparse vectors, right?
That was something which exists for, again, very, very long time.
You can find that in, you know, Lucene, Elastic, Libraries, and so on and so forth.
Now, if you think about what we use in the vector search for much more is we are looking at more of a dense vector, which are called vector embeddings, right?
Which are different, right?
What is interesting in those kind of sparse vectors, like also referred as a bag of words, we can actually think about the every dimension in a vector as a mean something, right?
We can say, oh, this dimension means if a word, you know, cat was seen in a document and how many times, right, or a relative frequency, right?
Then we compute the dense vector, right, or embeddings, what that means is we take a document, right, we learn that from the model which generates those embeddings,
and then we don't really understand very well what those different bits mean, right?
What we know, though, is what the similar documents should be close by, right, how this system works.
Or it doesn't have to be even a document, right?
Like if you think about their, let's say, image recognition process, right?
I can train the system, let's say, you know, faces of a bunch of people, though I know it's kind of totally legal in Europe, but let's say imagine we are in China and we are going to do that, right?
Then we can look at, you know, vector embeddings, computer of somebody face, right, and look at what is a people, people in the database it looks the most like, right?
That should give us, give us the closest much.
So here is also something interesting.
This is embeddings which are computed for single word documents, right, using one of the open source frameworks by Glove model by the guy called Jay Alamard, right?
And what is interesting in this case is what we get, you know, sort of, you know, cardinality, we see, you know, bunch of words we don't really quite know, right, what those different dimensions means, right?
And it's kind of like very common in AI, right?
We know those things work, but we can't really figure out how exactly this works, right?
And then you can think about and rationalize about what those things could be, like for example, all of those things have time to do with humans besides water, right?
And then you can see there is those, you know, you see like a straight blue line which is blue for everybody by that.
They say, well, maybe that is something which is related to humans, right?
Or we can see something else, right, like a king and queen, right?
They also have a lot of similarity.
They may say, well, you know what, AI, we don't really exactly know how, but it may have something to, you know, think about their royalty.
Okay.
So if you think about their vector search, right, in a nutshell, we have vectors.
That typically will be dense vectors which came from the AI applications.
That would be some embeddings which in some database systems, right, they are, they focus on supporting operations with those embedding systems, right?
For example, postgres, PgVector, it's just say, hey guys, I don't know how you compute those vectors.
That is not our problem, right?
We just going to help you to operate FM.
Some of the more advanced features they may, they vector databases, they also may support creating embeddings.
Maybe even, you know, from the database itself, right, especially in the cloud database, being able to, you know, call open AI's API in the background to compute the, you know,
embeddings for you.
So anyway, let's talk more about the technology, how does that work, right?
So what exactly operations do we typically see in the vector search applications, right?
Well, typically we do have a bunch of vector store, we have a vector on input, right, and we are looking to find a vector which is closest to it.
Right? Through the distance we want to define vector search is a cosine, cosine distance.
Right? Now, if you look at this problem, if you, of course, you can just, you know, as with about everything, right, you can just scan all the data and find the closest vectors, right?
And that is, you know, fantastic way you can do it, you can do it exact, but it is also very slow, right?
So that means why it's not used in practice instead, using special index structures, right?
This HN SW, right, that seems to be the most popular algorithm, right, which industry seems to be coalescing about.
And I think what is important to note about that is compared, like, different from many other things in database, this index is not exact, right?
So it gives us rather, well, good accuracy, but it does not guarantee what that always will give you the, you know, let's say the closest vector when you ask it, right?
And if you are familiar with database, you can say, ooh, that looks strange.
But in AI applications, how those vectors are computed, right, they are not really exact to begin with, right?
So these are quite usable.
Okay. So let's talk about those solutions, right, what we are really using this for, right?
As I mentioned, the most common features in this case will be found in your nearest.
There are some other more kind of global operations in this case, which can be used in terms of clustering the data or classification, which is supported by advanced features, advanced systems.
Okay. Let me show maybe in this case a little bit of example.
And here we'll use their, an interesting way, we'll use their multi-core search system, right, that's where one of the search is created for.
But we'll connect to that through a MySQL protocol, right, that's what it's for.
And we can see what we can go ahead and let's say, create the table, as you can see, right?
We've defined in the flow of vector, in the traditional database type, but something what this engine support, right?
And you would see typically vectors support some sort of vector store functionality, right?
And then we can find their distance between the given vector we have as well as the vectors in the database, right?
And it can give us the information, right?
So what we actually had in this case is their different images, which would run through creating embedding for them.
And then you can see what, what was that?
Yeah, I think that was like an image of a bag, right, which was saying, hey, you know what, it's much more similar to a yellow bag compared to their, to a white bag, right?
Which you can, that is a kind of pretty common, what we have.
Okay. I mentioned what, when you speak about embedded computation, right?
There, if you look at especially non-vector databases, not specialized databases like Postgres where we would say, hey guys, you guys can use external encoder, but they don't see that as a database problem, right?
You process some information out there, your favorite find, external API, or use some local open source model, that's fine.
Though there are some libraries, right, with Microsoft last time, a library, right, they are being added.
And I think over time, of course, if our desire to enable developers simplicity, we'll see more of a direct support.
Now, something else I think what is interesting about their embedding and the information retrieval tasks specifically.
I think what is very interesting, if you look at the search applications, for years, there have been a lot of time spent about sort of like a hard coding,
like in the engineering, the structure of the language, defining synonyms, defining antonyms, right, and so on and so forth, right, and so we can run our search quality.
The other approach, though, is we can use the AI, right, for this approach, right, so we can look at the matching document through embeddings, right,
and that really allows to avoid a lot of that thing which has to be manual processing if a good quality.
But what is an interesting there, though, is this new generation AI search may not be best, and also it may not be the most effective, right,
because if you think about that, if I have like a lot of a document in my database, right, billions, right, then actually the search can be relatively expensive.
So one of the approaches which is used is dual approaches, right, and you can say, well, we may be getting, you know, like a top thousand documents or something through, you know, like a legacy kind of frequency-based search methods, right,
and then we can use AI to rank those, right, and you can see what that's sort of like a combination that's like a last-second, like a Vespa hybrid, right, it shows better, right, on many benchmarks.
Okay, well, oh, close, thank you, is very usable, both in information retrieval tasks as well as many other applications,
and what we see also what the vector support in databases in relatively early stages, right, it just happened implemented in the last few years, right, I would say what the current implementation is related to basics,
and we would see a lot more of the features, right, as we kind of figure out, right, what we actually want to empower us to go into the continuing innovation in this data structures to support us in the fast vector
applications, right, and as well improve accuracy. Well, that's all I have, right, you had some, yes, the gentleman was about to show me zero minutes, yeah, and if there is any questions, I would be happy to try to answer.
Any questions? Yes.
Hi.
Okay, okay, so given the fact that so you say that databases are going to transition from just supporting externally modelled embeddings to generating internally, but given the fact that we see a lot of many advanced models that generate
embeddings, for example, the GPT, the GPT is a machine that basically takes entire concepts, translates very efficiently into embeddings and then outputs also, and is the standard like you provide a single external
embedding generator model compared to a traditional BWAC, and then everyone can just benefit from an on par model. Well, yes, yeah, absolutely. Well, what I'm saying in this case, I think it's being supported, right, maybe in different things, right, I'm not saying, oh, you know what, you should expect postgres
incorporate inside that over possible models, right, I think I think the same would be say like if, let's say, Python support those things, right, that just means it's easier to do from a Python standpoint, right, so now if you think about this case, if I want to generate
embedding, even for data I have in the database, I kind of have to do that, that externally, right, what I'm saying is, well, get some, you know, fun, was us by talking externally, not creative, so that is what I'm speaking about, make sense?
Thank you.
Any other questions?
Thanks. Thanks for the talk. As a database developer, how do you deal with very rapidly moving target of both algorithms and storage formats? In the sense, do you deprecate rapidly?
Because if today you have a KNN function that has this, some certain API, and a month later, you know, ML is very rapidly moving target, you have a better KNN function, and same goes with densely, dense format.
Well, yeah, I think that is a good point. Now, I think what is interesting what you're saying, right, you know, thing on one extent it's always interesting and then you have a, you know,
interesting and then you have this, the industry in like an early stage, because on one extent things are changing quickly, on the other hand, often people implement something, they put it in production, it's good enough, and the fact that
that kind of is a better state of art out there, that doesn't mean what they want to change, right, and that means, like at least in the database world, right, you always have to say, well, you know what, there are certain things you would love to kill,
but actually some very big corporations already deployed in production and they're not freaking changing that in the next 10 years, right. So, reality what that's going to be, right, I think is that evolution on that side, but then we'll still have to have a competitive
I think if you look at, I mentioned like a postgres, like a PG vector extensions, well, you support like a whole bunch of different options in this case, right, so yeah, I think that's what we'll expect.
Okay, hi, yeah, can you go back to the embeddings slide where you showed the similarity between Queen and Queen, I think, the similarities between Queen and King, I think, or was like the first,
Oh, this one? Yeah, yeah, yeah, yeah, yeah, you mentioned you use Ada, I think.
What do you use?
Chart GPT embeddings.
No, so yeah, this one is actually, you know, a global model, right, that was one of the open source models, right.
Ah, okay, okay, yeah.
But I mean, in this case, I think that is just like an example, right, I think what you wanted to look in this case to visualize, right, I mean, when you say, hey guys, you generate those kind of embeddings, and they do not particularly mean anything, right, but can you just plot them, you know, as we plot, let's say, you know, DNA of a frog and a fish, right, can we see something maybe in this case, right.
So that was the case to do here, right, to visualize how particular embedding generation model.
Yeah, no, no, I find it really helpful also for, I find it really helpful also for like my future students, because like it really grasped the accents of embedding and similarities, I think.
Yeah, I mean, again, like in this case, that is just to visualize to show people what that things look like, right, and my point I was trying to make is, on one extent, we cannot really state what exactly are those dimensions, what it corresponds to, right, but as a human, we can, you know, try to rationalize over, it seems to be like, you know, something.
Yeah, yeah, also.
You know, something there, right.
Also because other, the embedding of OpenIE has like 100, no, 1000 and a half features.
Well, that's right.
Yes.
Really difficult to like.
That's right.
That's right.
So like in this case, that was specifically kind of cut, right, to have any less features, right.
Yes, it's because, you know, 1500 or like, well, like, you know, 3000 tried for large.
Yeah, that would be too many to display.
Okay.
Thank you.
Thank you.
you