[00:00.000 --> 00:12.800] I'll just yell. But yeah, so this is effectively my talk. It started out as a really big thing [00:12.800 --> 00:18.040] and then I realised 40 minutes wasn't actually that much time and so we sort of had to compress [00:18.040 --> 00:24.080] it down into a bit of a slightly smaller talk but hopefully covering the most interesting [00:24.080 --> 00:30.120] points in my opinion. So a bit about me. I'm Harrison. I come from London. I live in London [00:30.120 --> 00:35.720] and I work for QuickWit where as Paul has said we build basically a distributed search [00:35.720 --> 00:45.200] engine for logs. I am the creator of LNX which is a slightly different design of search engine [00:45.200 --> 00:51.480] probably more akin to something like Elasticsearch or Algolia for all your lovely e-commerce [00:51.480 --> 01:00.160] websites. And you can contact me at Harrison at QuickWit.io. A little bit about LNX since [01:00.160 --> 01:05.400] this is basically the origin story of this talk really. It's a search engine built on [01:05.400 --> 01:10.400] top of Tantrave. It's akin to Elasticsearch or Algolia as I've said. It's aimed at user [01:10.400 --> 01:16.120] facing search. That's things like your e-commerce websites, your Netflix streaming platforms, [01:16.120 --> 01:22.280] things like that. It's not aimed to be your cost effective log search engine. It doesn't [01:22.280 --> 01:27.040] really handle those hundreds of terabytes a day type workloads but it will handle thousands [01:27.040 --> 01:32.080] of queries a second per core. It's very easily configurable. It's designed to be really [01:32.080 --> 01:37.200] fast out of the box because it uses Tantrave and it has an indexing throughput of about [01:37.200 --> 01:43.480] 30 to 60 megabytes a second on reasonable hardware. With high availability coming soon [01:43.480 --> 01:50.720] which is the presence of this talk. So what is user facing search? I've stolen Crunchy [01:50.720 --> 01:55.800] Well's website and I've typed some bad spelling in there and you see that a lot of the top [01:55.800 --> 02:00.460] results actually account for the fact that I can't spell. That's basically the biggest [02:00.460 --> 02:06.880] principle with these user facing search engines is you have this concept of typo tolerance. [02:06.880 --> 02:13.440] This is a really good thing for users because users can't spell. The downside of this is [02:13.440 --> 02:18.680] that it has a lot of CPU time when we're checking those additional words and it makes things [02:18.680 --> 02:24.640] a lot more complicated and often documents are mutable and a lot of other things but [02:24.640 --> 02:31.320] also when you have these nice search experiences and you want no latency something called search [02:31.320 --> 02:36.800] as you type has become more popular now and that means your amount of search as you're [02:36.800 --> 02:41.920] doing for a single user is increasing several times over because now every key stroke you [02:41.920 --> 02:48.000] press is a search versus typing it all in one go hitting enter user gets a bunch of [02:48.000 --> 02:53.320] results back goes oh no I've spelt something wrong or I can't see what I want on here [02:53.320 --> 02:58.680] so I'm going to type it again. And so that is effectively the principle of these search [02:58.680 --> 03:03.600] engines. You see we have Algolia at the bottom which is a very common one which I think most [03:03.600 --> 03:10.240] people know very popular for document searching. But you know we decided hey we don't want [03:10.240 --> 03:13.920] to use one of these pre-built systems we don't want to use Elasticsearch that's big that's [03:13.920 --> 03:18.960] scary I don't like it. We don't want to use Algolia because I don't have that much money [03:18.960 --> 03:24.440] I'm just a lonely paid software developer I can't be spending thousands of pounds on [03:24.440 --> 03:27.680] that. And we look at some of the others but we're going there we're just going to write [03:27.680 --> 03:32.240] it ourselves and that's where we have a little look because we hear something about Tantavi [03:32.240 --> 03:38.360] we hear something about Rust that being blazingly fast as all things must be and so we go okay [03:38.360 --> 03:43.320] I like this I like what it says it says yeah Apache Lucene I think I've heard that before [03:43.320 --> 03:48.320] somewhere written in Rust I think I've definitely heard that before. And so we take a little [03:48.320 --> 03:53.600] look at what it is and it is effectively akin to Lucene which if you don't know what that [03:53.600 --> 03:59.000] is it's a full text search engine as it's called. Tantavi in particular supports things [03:59.000 --> 04:04.120] like BM25 scoring which is just a fancy way of saying what words are relevant to this [04:04.120 --> 04:08.720] query it supports something called incremental indexing which basically just means you don't [04:08.720 --> 04:14.080] have to re-index all of your documents every time you change one thing. You have fasted [04:14.080 --> 04:19.080] search you have range queries and we have things like JSON fields which allow for a [04:19.080 --> 04:26.320] schemeless indexing as such. You can do aggregations which have some limitations in particular around [04:26.320 --> 04:32.080] JSON fields being a little bit limited but in the biggest thing is it has a cheesy logo [04:32.080 --> 04:38.040] with a horse which I believe Paul drew himself so I think that needs a clap on its own. But [04:38.040 --> 04:46.760] there are other features which I just haven't yes. But there are more features which I couldn't [04:46.760 --> 04:52.560] fit on this slide and timers of the essence. So you might be wondering what the basic [04:52.560 --> 04:56.440] implementation of Tantavi looks like and because it's a library it's actually really quite [04:56.440 --> 05:01.400] simple to do. So we have a couple of core things starting at the top is we define what's [05:01.400 --> 05:08.880] called a schema. Since Tantavi was originally a schema based system still is we need some [05:08.880 --> 05:14.760] way of telling Tantavi what the structure of our documents are and defining what properties [05:14.760 --> 05:18.760] they have. We can use something like a JSON field to give the impression of a schemeless [05:18.760 --> 05:25.600] index but you know schemas are good we should use them. They come with lots of nice bells [05:25.600 --> 05:32.440] and whistles so in this case we've created a schema with the title field and you can [05:32.440 --> 05:38.200] see there we've added the text and stored flag which all that really says is I'm going [05:38.200 --> 05:43.280] to tokenize this field and then I'm going to store it so we can retrieve it later on [05:43.280 --> 05:49.280] once we've done the search. The second thing we do once we've done that is we create our [05:49.280 --> 05:55.360] index writer and in this case we're just letting Tantavi select the number of threads so by [05:55.360 --> 06:00.560] default sorry when you create this index writer and we give it a memory buffer in this case [06:00.560 --> 06:08.480] about 50 megabytes. Tantavi will allocate n number of threads I think up to eight threads [06:08.480 --> 06:12.720] depending on what your system is using so you don't really have to put much thought [06:12.720 --> 06:17.000] into the multi-threaded indexing and then we're just adding a document really so we've [06:17.000 --> 06:21.760] created our document we've added the text field we've given it in this case the old [06:21.760 --> 06:27.400] man of the sea and we're going to put it to our indexer which is essentially just adding [06:27.400 --> 06:32.640] it to a queue for the threads to pull off process spit out onto disk and then if we [06:32.640 --> 06:38.440] want to actually have that be visible to our users for searching and things like that we [06:38.440 --> 06:43.560] need to commit the index so in Tantavi you can either commit or you can roll back and [06:43.560 --> 06:48.960] if you have a power failure midway through indexing when you reload from disk it will [06:48.960 --> 06:53.800] be at the point of that last commit which is very very useful so you don't leave with [06:53.800 --> 06:58.760] partial state and all that all that nasty things and then once we've done that we can [06:58.760 --> 07:03.440] actually search and in this case you can either build queries using traits which are very [07:03.440 --> 07:08.160] nice and you can mash them all together with lots of boxing and things or you can use [07:08.160 --> 07:13.560] the query parser which basically parses a nice little query language in this case we've [07:13.560 --> 07:20.040] got a very simple phrase query as it's called trouble that up and it spits out a query for [07:20.040 --> 07:26.200] us we then pass that into our search executor which in this case we're executing the query [07:26.200 --> 07:30.400] and then we're passing what called collectors and they are effectively just a simple thing [07:30.400 --> 07:36.560] to process the documents which are matched so in this case I believe we've got the count [07:36.560 --> 07:43.680] collector and the top docs collector and the count collector does well it counts a big [07:43.680 --> 07:48.960] surprise there and we have the top docs which collects the top k documents up to a given [07:48.960 --> 07:53.720] limit so in this case we've selected 10 we only have one document to match so this doesn't [07:53.720 --> 07:59.080] matter that much but if you have more you can limit your results you can adjust how things [07:59.080 --> 08:06.800] are scored etc. Now that's all well and good in this example but this doesn't actually [08:06.800 --> 08:11.760] really account for spelling and as we discussed earlier users aren't very good at spelling [08:11.760 --> 08:16.920] or at least I'm not so we maybe we want a bit of typo tolerance and in this case Tanturi [08:16.920 --> 08:21.680] does provide us with some additional way of doing this in the form of the fuzzy term query [08:21.680 --> 08:28.080] it uses something called lever science distance it's a very common form of effectively working [08:28.080 --> 08:35.800] out how much modification you need to do to a word in order to actually get it to match [08:35.800 --> 08:41.560] and we call that the edit distance as such typically you're between one and two edits [08:41.560 --> 08:46.680] so you're swapping a word around you're removing it you're adding a new word a bit of magic [08:46.680 --> 08:52.240] there really and as you can see at the bottom this is effectively if we use just the regular [08:52.240 --> 08:58.320] full text search well if we enter the term hello we'll only match with the word hello [08:58.320 --> 09:05.440] if we go with the term hell we'll only match with the word hell if we use some fuzzy term [09:05.440 --> 09:10.240] query here we can actually match hell and hello which is very useful especially for [09:10.240 --> 09:17.400] the prefix search this is built upon Tanturi's inverted index which uses something called [09:17.400 --> 09:23.920] a FST which is effectively a fancy word for saying we threw state machines at it and then [09:23.920 --> 09:29.320] made them return results that's as much as I can describe how they work the person who [09:29.320 --> 09:36.320] originally wrote the FST library in Rust burnt sushi he has a blog on this goes into a lot [09:36.320 --> 09:41.120] of depth really really useful for that sort of thing but I can't elaborate any more on [09:41.120 --> 09:48.000] that but all of this additional walking through our index and matching these additional words [09:48.000 --> 09:55.120] does come at the cost of some additional CPU and once we've sort of got that what we're [09:55.120 --> 10:00.440] left with is this nice block of data on our disks really so we have some metadata files [10:00.440 --> 10:06.840] here in particular meta meta.json that contains your schema along with a couple other things [10:06.840 --> 10:11.400] and we have our sort of core files which look very similar if they look very similar to [10:11.400 --> 10:17.080] these scenes that's because they are in particular we have our field norms our terms our store [10:17.080 --> 10:26.160] which is effectively a row level store log file our positions our IDs and our fast fields [10:26.160 --> 10:36.120] and fast fields are effectively fast because we cut somewhat simple and equally vague name [10:36.120 --> 10:42.280] but now that we've got all this stuff on disk if we wrap it up in an API we sort of we've [10:42.280 --> 10:47.400] got that we've mostly we've got everything in this case we've got a demo of LNX working [10:47.400 --> 10:53.400] here and we've got about I think 27 million documents and we're searching it with about [10:53.400 --> 10:59.640] millisecond latency I think in total it's about 20 gigabytes on disk compressed which [10:59.640 --> 11:06.880] is pretty nice but there's sort of a bit of an issue here which is if we deploy this production [11:06.880 --> 11:13.520] and our site is very nice we get lots of traffic things increase we go hmm well search traffic [11:13.520 --> 11:17.520] is increased our server is not coping let's just scale up the server and we can repeat [11:17.520 --> 11:24.520] this for quite a lot and in fact things like AWS allow you a stupid amount of cores and [11:24.520 --> 11:29.240] things like that which you can scale up very easily but you keep going along with this [11:29.240 --> 11:34.680] and eventually something happens and in this case your data centers burnt down if anyone [11:34.680 --> 11:41.720] remembers this this happened in 2021 OVH basically caught fire and that was an end of I think [11:41.720 --> 11:48.400] a lot of sleeping people and so yeah your data centers on fire search isn't able to do anything [11:48.400 --> 11:52.480] you're losing losing money no one's buying anything management's breathing down your [11:52.480 --> 11:57.480] neck for a fix you're having to load from a backup what are you gonna do and well you [11:57.480 --> 12:04.040] think ah I should have made some replicas I should have done something called high availability [12:04.040 --> 12:08.200] and in this case what this means is we have instead of having one node on one server ready [12:08.200 --> 12:14.320] to burn down we have three nodes available to burn down at any point in time and in this [12:14.320 --> 12:18.000] case we hope that we put them in different what are called availability zones which [12:18.000 --> 12:22.560] mean hey if one data center burns down there's a very small likelihood or at least as it [12:22.560 --> 12:28.520] possible for another data center to burn down in the meantime and this allows us to effectively [12:28.520 --> 12:33.800] operate even though one server is currently on fire or lost to the ether or I don't know [12:33.800 --> 12:40.160] network has torn itself to pieces and this does also mean we can upgrade if we want to [12:40.160 --> 12:43.680] tear a server down and we want to restart it with some newer hardware we can do that [12:43.680 --> 12:49.640] without interrupting our existing system but this is sort of a hard thing to do because [12:49.640 --> 12:54.400] now we've got to work out a way of getting the same documents across all of our nodes [12:54.400 --> 12:59.000] in this case it's sort of a share nothing architecture this is done by elastic search [12:59.000 --> 13:05.200] and basically most most systems so we're just replicating the documents we're not replicating [13:05.200 --> 13:11.200] all of that process data we've just done we need to apply them to each node and doing [13:11.200 --> 13:15.840] this approach makes it a bit simpler in reality LNX and QuickWit do something a little bit [13:15.840 --> 13:22.040] different but this is this is easier I say this is easier because the initial solution [13:22.040 --> 13:26.800] would be you know just just spin up more nodes you know what can add some RPC in there what [13:26.800 --> 13:32.200] can go wrong and then deep down you work out it's like oh do you mean networks are reliable [13:32.200 --> 13:37.960] what's a raft and things like that and so at that point you go okay this is this is harder [13:37.960 --> 13:42.560] than I thought and you realize the world is in fact a scary place outside your happy little [13:42.560 --> 13:50.000] data center and you need some way of organizing states independent on things catching on fire [13:50.000 --> 13:55.320] and this is this is a hard problem to solve and so you have a little look around and you [13:55.320 --> 14:00.920] go well Rust is quite a new system it's quite a young ecosystem we're quite limited so we [14:00.920 --> 14:06.160] can't necessarily pick a Paxos implementation off the shelf we maybe have something called [14:06.160 --> 14:12.400] raft so that's a leader-based approach and that means we elect a leader and we say okay [14:12.400 --> 14:18.360] leader tell us what to do and it will say okay you you handle these documents go go do things [14:18.360 --> 14:23.400] with them it's a very well-known algorithm very easy to understand it's probably the [14:23.400 --> 14:28.240] only algorithm which is really implemented widely in Rust so there's two implementations [14:28.240 --> 14:33.880] one of them by the pink cap group called raft RS and the other by data fuse labs called [14:33.880 --> 14:43.440] open raft varying levels of completion or pre-made so in this case you think okay I don't really [14:43.440 --> 14:49.720] know what I'm doing here so maybe I shouldn't be managing my own raft cluster and you hear [14:49.720 --> 14:54.640] something about eventual consistency and you hear oh it's it's leaderless any any node can [14:54.640 --> 14:58.200] handle the rights and then ship off to the other nodes as long as the operations are [14:58.200 --> 15:03.000] idempotent and that's a very key point which means you can basically ship the same document [15:03.000 --> 15:07.960] over and over and over again and they're not going to duplicate themselves or at least [15:07.960 --> 15:14.080] they don't act like they duplicate and this gives us realistically a bit more freedom [15:14.080 --> 15:20.800] if we want to change we can change and so we decide let's go with eventual consistency [15:20.800 --> 15:29.480] because yeah I like an easy life and it seemed easier yes people laughing will agree that [15:29.480 --> 15:35.760] yes things that seem easier probably aren't and so our diagram sort of looks something [15:35.760 --> 15:39.720] like this and I'm scared to cross the white line so I'll try and point but we have step [15:39.720 --> 15:46.440] one a client sends the documents to a any node it doesn't really care which one that [15:46.440 --> 15:50.680] client then goes okay I'm going to send it to some of my peers and then wait for them [15:50.680 --> 15:54.960] to tell me that they've got the document it's safe and then once we've got the majority [15:54.960 --> 16:00.240] which is a very common approach in these systems we can tell the client okay your document [16:00.240 --> 16:05.680] is safe even if OHV burns down again we're probably going to be okay it doesn't need [16:05.680 --> 16:10.080] to wait for all of the nodes to respond because otherwise you're not really highly available [16:10.080 --> 16:15.840] because if one node goes down you can't progress and so this system is this system is pretty [16:15.840 --> 16:23.080] good there's just one small problem which is how in God's name do you do this many questions [16:23.080 --> 16:27.400] need to be answered many things how do you test this or who's going to have the time [16:27.400 --> 16:32.960] to do this and well luckily someone aka me spent the better part of six months of their [16:32.960 --> 16:40.160] free time dealing with this and so I made a library and in this case it's called data [16:40.160 --> 16:45.680] cake whoo yes in this case this is called data cake I originally was going to call it [16:45.680 --> 16:50.320] data lake but unfortunately that already exists so we added cake at the end and called it [16:50.320 --> 16:56.960] a day it is effectively a tooling to create your own distributed systems it doesn't have [16:56.960 --> 17:02.800] to be eventually consistent but it just is designed to make your life a lot easier and [17:02.800 --> 17:08.640] it only took about six rewrites to get it to the stage that it is because yeah things [17:08.640 --> 17:13.720] are hard and trying to work out what you want to do with something like that is awkward [17:13.720 --> 17:18.560] but some of the features it includes is it includes the zero copy RPC framework and this [17:18.560 --> 17:23.520] is built upon the popular archive framework which is really really useful if you're shipping [17:23.520 --> 17:27.240] a lot of data because you don't actually have to deserialize and allocate everything all [17:27.240 --> 17:32.440] over again you can just treat an initial buffer as if it's the data which if that sounds wildly [17:32.440 --> 17:41.680] and safe it is but there's a lot of tests and I didn't write it so you're safe. [17:41.680 --> 17:47.560] We also add the membership and failure detection and this is done using chit chat which is a [17:47.560 --> 17:52.040] library we made at quick quit it uses the same algorithm as something like Cassandra [17:52.040 --> 17:57.240] or DynamoDB and this allows the system to essentially work out what nodes are actually [17:57.240 --> 18:05.520] its friends and what it can do and in this case we've also implemented an eventually [18:05.520 --> 18:13.000] consistent store in the form of a key value system which only requires one trait to implement [18:13.000 --> 18:16.640] and the reason why I went with this is because if you implement anything more than one trait [18:16.640 --> 18:22.400] people seem to turn off and frankly I did when I looked at the raft implementations. [18:22.400 --> 18:26.600] So we went with one storage trait that's all you need to get this to work. [18:26.600 --> 18:31.440] We also have some pre-built implementations I particularly like abusing SQLite so there [18:31.440 --> 18:37.400] is an SQLite implementation and a memory version and it also gives you some CRDTs which [18:37.400 --> 18:44.720] are conflict-free replicated data types I should say and also something called a hybrid [18:44.720 --> 18:48.720] logical clock which means it's a clock which you can have across your cluster where the [18:48.720 --> 18:54.000] nodes will stabilize themselves and prevent you from effectively having to deal with this [18:54.000 --> 19:00.720] concept of causality and causality is definitely the biggest issue you will ever run into with [19:00.720 --> 19:06.160] distributed systems because time is suddenly not reliable. [19:06.160 --> 19:10.720] And so we go back to our original thing of well first we actually need a cluster and [19:10.720 --> 19:16.640] this case it's really simple to do all we need to do is we just create our node builder [19:16.640 --> 19:22.320] we tell data cake okay we've got your address is this your peers are this or you can start [19:22.320 --> 19:27.520] with one peer and they'll discover themselves who their neighbors are and you give them [19:27.520 --> 19:28.520] a node ID. [19:28.520 --> 19:32.400] They're integers they're not strings and the reason for that is because there's a lot [19:32.400 --> 19:37.640] of bit packing of certain data types going on and strings do not do well. [19:37.640 --> 19:42.680] And here we can also effectively wait for nodes to come onto the system so our cluster [19:42.680 --> 19:46.840] is stable and ready to go before we actually do anything else. [19:46.840 --> 19:51.000] And by the time we get to this point our RPC systems are working nodes are communicating [19:51.000 --> 19:55.720] your clocks have synchronized themselves mostly and you can actually start adding something [19:55.720 --> 19:58.120] called extensions. [19:58.120 --> 20:05.440] Now extensions essentially allow you to extend your existing cluster you don't you can do [20:05.440 --> 20:10.480] this at runtime they can be added and they can be unloaded all at runtime without any [20:10.480 --> 20:16.000] with state cleanup and everything else which makes life a lot easier especially for testing. [20:16.000 --> 20:20.480] They have access to the running node on this local system which allows you to access things [20:20.480 --> 20:25.680] like the cluster clock the RPC network as it's called which is the pre-established RPC [20:25.680 --> 20:31.920] connections and you can essentially make this as simple or as complex as possible which [20:31.920 --> 20:37.200] is essentially what I've done here so I've created this nice little extension which is [20:37.200 --> 20:40.720] absolutely nothing other than print what the current time is which realistically I could [20:40.720 --> 20:45.200] do without but nonetheless I went with it. [20:45.200 --> 20:49.480] And this is what the eventual consistency store actually does under the hood is it's [20:49.480 --> 20:55.040] just an extension and here we can see that we're passing in a I can't point that far [20:55.040 --> 21:03.680] but we pass in a mem store which is our storage trait we pass in our create our eventual consistency [21:03.680 --> 21:11.040] extension using this and we pass it to the data cake node and say okay go add this extension [21:11.040 --> 21:16.040] give me the result back when you're ready and in this case our eventual consistency cluster [21:16.040 --> 21:21.000] actually returns us a storage handle which allows us to do basically all of our lovely [21:21.000 --> 21:28.240] key value operations should we wish including delete, put, get that's about all there is [21:28.240 --> 21:33.680] on the key value store but there are also some bulk operations which allow for much [21:33.680 --> 21:37.760] more efficient replication of data. [21:37.760 --> 21:42.080] The only problem with this approach is it's not suitable for billion scale databases so [21:42.080 --> 21:46.880] if you're trying to make the next Cassandra or Silla don't use this particular extension [21:46.880 --> 21:53.440] because it keeps the key value or the keys sorry in memory which it uses to work out [21:53.440 --> 21:59.040] what keys have and have not been processed and the reason for this is effectively because [21:59.040 --> 22:04.240] I didn't really trust users implementing this on the storage site correctly which turned [22:04.240 --> 22:07.560] out to be a good choice because the amount of unit tests that this failed initially [22:07.560 --> 22:13.880] was a lot and so now we've sort of got this ability to replicate our key values our life [22:13.880 --> 22:20.720] is a lot easier in particular we can actually go as far as essentially saying okay we've [22:20.720 --> 22:25.840] established our data connection our key values let's just use Tantive as our persistence [22:25.840 --> 22:32.400] store and this is effectively the simplest way to do it and I've made a little demo here [22:32.400 --> 22:38.440] which you can go to that link I basically abused and slightly ignored certain things [22:38.440 --> 22:44.960] in particular correctness but this will replicate your data you may end up with duplicate documents [22:44.960 --> 22:49.840] because I didn't handle de-duping but in this case we can fetch we can delete and we can [22:49.840 --> 22:53.840] index documents with Tantive and that's our persistence store and here you can see we're [22:53.840 --> 23:04.440] doing about 20,000 documents in 400 milliseconds in the local cluster yes and that is effectively [23:04.440 --> 23:26.720] the end so are there any questions how long do we have left how long do we have left 15 [23:26.720 --> 23:49.000] minutes so actually kind of so in there do you have like a way to provide from outside [23:49.000 --> 23:56.600] to the Tantive transaction or links transaction an external ID that I can use to integrate [23:56.600 --> 24:03.080] with the standard storage so change the question would be an easier way do you have a way to [24:03.080 --> 24:10.440] say which which level of data has been indexed yes in this case I've sort of glossed over [24:10.440 --> 24:14.920] it a little bit because in reality it's a little bit more complicated when you implement [24:14.920 --> 24:20.800] it so in reality when you actually implement this you would probably have a essentially [24:20.800 --> 24:25.200] use the replication to replicate the initial documents and then you would have a check [24:25.200 --> 24:29.520] mark to essentially work out what documents have and have not been indexed yet or you [24:29.520 --> 24:35.000] would add some additional step like a right ahead log so that way you know that as long [24:35.000 --> 24:40.040] as the documents are there you can make sure that your check your commit point is always [24:40.040 --> 24:47.240] updated to the latest thing in the next it's actually a little bit different again because [24:47.240 --> 24:52.960] the way it creates indexes is they are per check point so in a new index is created every [24:52.960 --> 25:00.760] commit effectively but you don't have to do that and in this method I didn't so you could [25:00.760 --> 25:06.040] you can it doesn't do it here but you can add a right ahead log and do you can do basically [25:06.040 --> 25:22.720] do anything as long as the trait is implemented hello hello hi yeah all right so congratulations [25:22.720 --> 25:52.640] for the presentation sorry I think I can see you yes hello so let me see if I can [25:52.640 --> 25:57.880] got that question right so you was is that about it sending time to me so if you want [25:57.880 --> 26:12.640] to go beyond something like bm25 or leave a size distance and things like that things [26:12.640 --> 26:17.960] like I think things like vector search or word embedding search is still something which [26:17.960 --> 26:23.760] is quite far away and we need quite a big push to do with time to be specifically but [26:23.760 --> 26:27.600] if you want to add additional queries or additional functionality it's quite easy to add with [26:27.600 --> 26:32.840] time to be so it's actually just a query trait so one of the things that and the next does [26:32.840 --> 26:39.240] it actually has another a query mode called fast fuzzy which actually uses another algorithm [26:39.240 --> 26:44.800] for pre-computing dictionaries in order to do the edit distance lookup and that basically [26:44.800 --> 26:50.760] is just involves creating another query and you can customize effectively all of your query [26:50.760 --> 26:55.800] logic all of your collecting logic and things like that so providing your within the scope [26:55.800 --> 27:01.320] of the API time to be will allow you to implement it yourself otherwise things like the word [27:01.320 --> 27:05.400] embeddings which are a little bit more complicated and require a bit more on the storage side [27:05.400 --> 27:10.240] would need to an issue and a very motivated individual to probably implement that which [27:10.240 --> 27:31.320] currently we we don't really have so it's pretty little question on all your sketches [27:31.320 --> 27:38.760] the network the subject network was fully connected is that important let me see if [27:38.760 --> 27:46.320] I can find which one that was was it was it this one or was it this one well on this one [27:46.320 --> 27:52.120] it's it does not look fully connected but I'm not sure if these diagram depicts kind [27:52.120 --> 28:00.400] of connectivity connect home or just which messages has actually been dispatched so I'm [28:00.400 --> 28:06.240] going to cross the forbidden white line here because we're doing questions and effectively [28:06.240 --> 28:12.320] these are just indicating sending responses and getting things back so these notes don't [28:12.320 --> 28:17.920] actually in a real system that you could have a network petition here and your node one [28:17.920 --> 28:22.480] can no longer talk to no three it's effectively lost to the ether and maybe no two can also [28:22.480 --> 28:28.720] not do it and in this case it doesn't actually really care all that you need to do is you [28:28.720 --> 28:34.560] need to achieve what's called a consistency level so which means that if you want to progress [28:34.560 --> 28:40.120] you have to reach that level otherwise things are counted as not happening and so in this [28:40.120 --> 28:46.400] case if no three is down or can't be contacted as long as node one can contact node two and [28:46.400 --> 28:50.920] no two acknowledges the messages things can still progress this is the same with raft [28:50.920 --> 28:57.080] as well so raft operates on what's called a quorum which yeah but effectively any node [28:57.080 --> 29:01.840] any one node can go down in a three node group and the other two nodes can still progress [29:01.840 --> 29:06.520] providing they have what's called what's the majority so I understand full connection [29:06.520 --> 29:22.080] of the network is not an important factor here well it's nice to know thank you thank [29:22.080 --> 29:28.720] you for our talk I see here that there is basically a consistency mechanism for indexing [29:28.720 --> 29:34.520] do you check as well for that on over nodes when there is a search request as well say [29:34.520 --> 29:38.760] that again sorry I didn't quite pick that up do you check the data on over nodes when [29:38.760 --> 29:44.480] there is a search request not an indexing request in this case we have relaxed reads [29:44.480 --> 29:49.960] essentially so we don't do we're not it's searching across several nodes and getting [29:49.960 --> 29:54.240] the most updated version from that which is part of the trade-off you make with the eventual [29:54.240 --> 29:59.120] consistency you will have that with raft as well effectively unless you contact the leader [29:59.120 --> 30:04.760] you won't have the most update data when searching but one of the things you do have to do if [30:04.760 --> 30:13.680] you go with the eventual consistency eventual consistency approach like we do here is you [30:13.680 --> 30:18.320] would need to effectively handle the idea that maybe you will have duplicate documents [30:18.320 --> 30:24.200] because something's been recent in the meantime and so you'll need to be able to deduplicate [30:24.200 --> 30:28.880] that when you're searching or have some other method of handling it and deleting it from [30:28.880 --> 30:34.200] the index so that means that effectively every node must have a copy of the data like I cannot [30:34.200 --> 30:39.240] have five nodes unlike a free with the car system or something about yeah so as long [30:39.240 --> 30:44.840] as if you've got like a five node cluster and three nodes respond you can immediately [30:44.840 --> 30:48.360] search from if those three nodes have got the data they can immediately be searched from [30:48.360 --> 30:53.160] effectively if you want but the other nodes may take a little bit of time to catch up [30:53.160 --> 30:57.600] which is the principle with eventual consistency they'll eventually align themselves but they're [30:57.600 --> 31:08.680] not all immediately immediately able to reflect changes hello just simple one in hindsight [31:08.680 --> 31:16.600] would you take the raft part in hindsight probably not still and the reason for that [31:16.600 --> 31:25.800] is because the current state of the rust ecosystem with it means that there's a lot of black [31:25.800 --> 31:31.480] holes effectively around it and so you either going with an implementation which is very [31:31.480 --> 31:35.600] very stripped down just the state machine part or going with an implementation which [31:35.600 --> 31:41.240] is very very trait heavy and is a little bit opaque around what you need to test what you [31:41.240 --> 31:46.440] don't need to test and how it behaves under failure so in this case it's I like this [31:46.440 --> 31:51.800] approach more because it may allow me to implement things like network simulation which the RPC [31:51.800 --> 31:57.520] program supports so we can actually simulate network fit networks failing locally in tests [31:57.520 --> 32:03.640] and things like that which makes me feel a little bit more confident than trying to just [32:03.640 --> 32:09.080] have the state machine and implement everything and all the handling correctly but I think [32:09.080 --> 32:17.360] in future yeah you could you could use it but it's just not not quite at that state [32:17.360 --> 32:41.160] so I'm not sure I quite got how how if the engine actually does any data sharding or [32:41.160 --> 32:49.520] there's a hatchery yeah in this so in this approach it's simplicity of time really we're [32:49.520 --> 32:55.640] not actually doing any data sharding servers are really quite big nowadays so you can even [32:55.640 --> 33:00.760] for your e-commerce website you can get a pretty huge server and the biggest issue tends [33:00.760 --> 33:07.120] to be replication and the high availability the data sharding is something that some quick [33:07.120 --> 33:11.600] wits is something that would be concerned about because you've got so much data you [33:11.600 --> 33:16.720] need to spread it across machines and things like that when you're searching but in e-commerce [33:16.720 --> 33:20.760] at the point in which you're searching across multiple machines you're probably going to [33:20.760 --> 33:24.960] be looking at the higher latencies so you would you'd be better off dedicating one [33:24.960 --> 33:41.600] machine per search rather than several machines per per search really.