[00:00.000 --> 00:10.000] So, before we go on with the next security or later topic, we're going to talk about [00:10.000 --> 00:15.440] something completely different, and that is about elastic search internals by Martin [00:15.440 --> 00:32.440] from the Bulgaria jug, and maybe also about security in this context. [00:32.440 --> 00:50.440] So, we're going to talk about something completely different, and maybe also about security in [00:50.440 --> 01:13.440] this context. [01:13.440 --> 01:22.440] Test, test, test. [01:22.440 --> 01:48.440] Thank you. [01:48.440 --> 01:53.440] So, people coming in, please move to the middle of your row so that there's space on the side [01:53.440 --> 02:06.440] so people can sit. [02:06.440 --> 02:18.440] We're working in Cisco together, so a lot of people coming, so we can start. [02:18.440 --> 02:45.440] If you're standing along the side, please take a seat. [02:45.440 --> 02:50.440] So, hello, everyone. [02:50.440 --> 02:56.440] My name is Martin, and I'm a consulting architect at the European Patient Office. [02:56.440 --> 03:02.440] I've been also doing a lot of consultancy on elastic search in the past two to three years. [03:02.440 --> 03:06.440] So, just before we start with this session, how many of you are using or have used elastic [03:06.440 --> 03:08.440] search in a project? [03:08.440 --> 03:11.440] Okay, more than half of the people. [03:11.440 --> 03:15.440] So, why this talk at FOSDEM? [03:15.440 --> 03:16.440] So, multiple reasons, in fact. [03:16.440 --> 03:21.440] When I've worked with elastic search, I realized that even though it has quite a good documentation, [03:21.440 --> 03:26.440] in many cases, you need to go into the public code base and see what's in there, and to [03:26.440 --> 03:28.440] understand how it works. [03:28.440 --> 03:32.440] I've had questions from many people, how this functionality works, or how can I achieve [03:32.440 --> 03:34.440] something with elastic search. [03:34.440 --> 03:39.440] And not always it's clear from documentation or blocks over the Internet what you can achieve [03:39.440 --> 03:40.440] with elastic search. [03:40.440 --> 03:46.440] So, in this short session, I'll try to show you how this elastic search works internally, [03:46.440 --> 03:50.440] and I'll talk about the elastic search architecture. [03:50.440 --> 03:56.440] So, first of all, we'll do a 360-degree overview of the elastic search stack, which I believe [03:56.440 --> 03:58.440] most of you are familiar with. [03:58.440 --> 04:02.440] Then I'll go into the elastic search architecture, and at the end of this short session, I'll [04:02.440 --> 04:06.440] show you how you can write a very simple elastic search plugin. [04:06.440 --> 04:10.440] In most cases, you won't need to write an elastic search plugin because there is quite [04:10.440 --> 04:13.440] a rich ecosystem of elastic search plugins that you can use. [04:13.440 --> 04:16.440] But many companies find that that's not always the case. [04:16.440 --> 04:20.440] So, sometimes you need to either customize something in elastic search or write your [04:20.440 --> 04:23.440] own plugin to achieve something. [04:23.440 --> 04:24.440] All right. [04:24.440 --> 04:27.440] So, let's talk briefly about the elastic search stack. [04:27.440 --> 04:31.440] In the middle, we have elastic search, which is a Java application. [04:31.440 --> 04:34.440] It's being updated quite oftenly. [04:34.440 --> 04:38.440] There are a lot of features being implemented in elastic search, especially in the latest [04:38.440 --> 04:39.440] few releases. [04:39.440 --> 04:44.440] And around the elastic search server application, there are different applications that are [04:44.440 --> 04:48.440] being built to allow you to work more easily with elastic search, such as Kibana. [04:48.440 --> 04:53.440] Kibana is a user-rich user interface for elastic search that allows you to achieve multiple [04:53.440 --> 04:58.440] things, so not only querying elastic search, but Kibana allows you to also visualize data [04:58.440 --> 05:03.440] that's already in elastic search or build different dashboards that are quite nice, [05:03.440 --> 05:05.440] especially for management. [05:05.440 --> 05:09.440] Also, if you want to put different data from a variety of sources in elastic search, you [05:09.440 --> 05:11.440] can use LogStash. [05:11.440 --> 05:16.440] So, originally, LogStash was implemented to provide a way to aggregate logs into elastic [05:16.440 --> 05:17.440] search. [05:17.440 --> 05:22.440] But over time, LogStash evolves to an application that is used to integrate data in elastic [05:22.440 --> 05:25.440] search, not only log data, but any kind of data. [05:25.440 --> 05:31.440] So, you can think of LogStash as a log aggregation pipeline that allows you to put data in elastic [05:31.440 --> 05:32.440] search. [05:32.440 --> 05:37.440] And on top of that, we also have a different set of so-called bits applications that are [05:37.440 --> 05:43.440] lightweight log shippers that allow you to collect data and put it either directly into [05:43.440 --> 05:50.440] an elastic search or through LogStash into elastic search or different other data sources. [05:50.440 --> 05:54.440] The specific thing about the bit applications is that they are lightweight in nature, so [05:54.440 --> 05:59.440] they are supposed to not consume a lot of resources such as CPU and memory. [05:59.440 --> 06:06.440] And in that reason, they allow you to collect log data or other data and put it into elastic [06:06.440 --> 06:07.440] search. [06:07.440 --> 06:12.440] Now, you can think of elastic search as a web server built on top of the Apache Lucene [06:12.440 --> 06:13.440] library. [06:13.440 --> 06:20.440] So, the Apache Lucene library is an actively developed Java library that is used by different [06:20.440 --> 06:25.440] applications that want to implement some kind of search functionality. [06:25.440 --> 06:27.440] And elastic search is one of them. [06:27.440 --> 06:31.440] So, I'll show briefly in a few slides how elastic search interacts with the Apache Lucene [06:31.440 --> 06:32.440] library. [06:32.440 --> 06:36.440] And another way to describe elastic search is a document-oriented database. [06:36.440 --> 06:43.440] So, elastic search is used by different projects not only for searching, but also as a NoSQL [06:43.440 --> 06:44.440] database. [06:44.440 --> 06:49.440] So, I had a few projects where elastic search was used purely as a NoSQL database, not as [06:49.440 --> 06:50.440] a search engine. [06:50.440 --> 06:54.440] And one can think, okay, elastic search is a Java application. [06:54.440 --> 06:57.440] Why I cannot use Apache Lucene directly? [06:57.440 --> 07:02.440] And the reason is that elastic search provides a number of features that are missing in the [07:02.440 --> 07:09.440] Apache Lucene library that allow you to implement search in your project way more easily than [07:09.440 --> 07:10.440] using directly Apache Lucene. [07:10.440 --> 07:16.440] Some of these features are, for example, JSON-based REST API, which is quite easy to use, quite [07:16.440 --> 07:20.440] easy to write search queries, to index data into elastic search, and so on. [07:20.440 --> 07:25.440] There is also a really nice clustering mechanism implemented in elastic search that allows [07:25.440 --> 07:30.440] you to bring and scale your elastic search cluster quite easily, something that's not [07:30.440 --> 07:35.440] possible if you use directly Apache Lucene in your project directly. [07:35.440 --> 07:39.440] And also, it has a number of other features, such as, for example, caching, that allow [07:39.440 --> 07:44.440] you to improve the performance of your search queries, and so on. [07:44.440 --> 07:50.440] Now, the basic data structure used by elastic search is the so-called inverted index, and [07:50.440 --> 07:55.440] indexes are stored on disk in separate files or Lucene segments. [07:55.440 --> 07:58.440] Search can be performed on multiple indexes at a time. [07:58.440 --> 08:01.440] That's one of the capabilities of elastic search. [08:01.440 --> 08:06.440] And in earlier versions of elastic search, documents were logically grouped by types. [08:06.440 --> 08:12.440] That was effectively deprecated as a version 7 of elastic search, and it's expected to be [08:12.440 --> 08:15.440] dropped. [08:15.440 --> 08:20.440] In order to ensure score relevancy when you search for some data in elastic search, [08:20.440 --> 08:28.440] elastic search uses a set of different algorithms to score results relevance. [08:28.440 --> 08:32.440] In the later versions of elastic search, this algorithm is BM25. [08:32.440 --> 08:38.440] In earlier versions of elastic search, this was a simpler algorithm which is called TFIDF. [08:38.440 --> 08:44.440] And the base of those algorithms is the fact how many times does a term occur in a document, [08:44.440 --> 08:48.440] and how many times does this term occur across all documents that are currently indexed in [08:48.440 --> 08:49.440] elastic search. [08:49.440 --> 08:55.440] Based on that, by default, elastic search scores every result that gets returned by your search [08:55.440 --> 09:02.440] query, and by default, it returns results sorted by relevant score. [09:02.440 --> 09:07.440] Now, why would you use elastic search in favor, for example, of a relational database? [09:07.440 --> 09:13.440] Well, it provides faster retrieval for documents in way more scenarios than a traditional [09:13.440 --> 09:15.440] relational database can do. [09:15.440 --> 09:22.440] So, as you know, traditional relational databases provide faster searches through indexes. [09:22.440 --> 09:26.440] However, indexes in relational databases have many limitations based on the type of [09:26.440 --> 09:28.440] SQL queries that you write. [09:28.440 --> 09:34.440] In elastic search, the inverted index data structure provides with the capability to cover [09:34.440 --> 09:38.440] way more scenarios for searching using more complex queries. [09:38.440 --> 09:45.440] And for that reason, many projects choose to use elastic search as a search engine. [09:45.440 --> 09:51.440] Now, documents also in elastic search might not have an explicit schema, as you have in [09:51.440 --> 09:55.440] a relational database, and that's typical for many no-SQL databases. [09:55.440 --> 10:00.440] An explicit schema, however, can be defined on the fields, and certain fields can even [10:00.440 --> 10:04.440] have different types mapped to them. [10:04.440 --> 10:10.440] This is needed because sometimes you need to use different kinds of search queries based [10:10.440 --> 10:14.440] on the field type, and some field types pose limitations. [10:14.440 --> 10:19.440] So, that's why you might need to have multiple types on a single field in elastic search. [10:19.440 --> 10:23.440] Now, this was brief about what is elastic search and how it works. [10:23.440 --> 10:26.440] Now, let's see what the architecture of elastic search. [10:26.440 --> 10:31.440] Elastic search, as I mentioned to you, is designed with clustering in mind. [10:31.440 --> 10:37.440] By default, in later versions of elastic search, if you start, if you create an index, it has [10:37.440 --> 10:41.440] one primary chart and one replica chart. [10:41.440 --> 10:43.440] So, what is a chart? [10:43.440 --> 10:49.440] Now, an elastic search index contains one or more primary charts that distribute the data in the [10:49.440 --> 10:51.440] elastic search cluster. [10:51.440 --> 10:57.440] Below that, an elastic search chart is, in fact, a Lucene index, and a Lucene index is, [10:57.440 --> 11:02.440] in fact, the data structure that stores the data on disk in terms of Lucene segments. [11:02.440 --> 11:07.440] Lucene segments are the physical files that store data on the disk. [11:07.440 --> 11:12.440] Now, when you index data in elastic search, you might have also replica charts. [11:12.440 --> 11:17.440] Replica charts provide you with the possibility to enable high availability and data replication [11:17.440 --> 11:20.440] at the level of the elastic search cluster. [11:20.440 --> 11:27.440] So, two types of charts, primary and replica charts. [11:27.440 --> 11:32.440] The more notes you add to the elastic search cluster, the more data gets distributed among [11:32.440 --> 11:33.440] charts. [11:33.440 --> 11:38.440] Now, it's very important that up front you plan the number of primary charts based on [11:38.440 --> 11:40.440] the data growth that you have. [11:40.440 --> 11:45.440] It's very difficult to change later in your project lifecycle the number of primary charts [11:45.440 --> 11:48.440] you would need to re-index data. [11:48.440 --> 11:52.440] However, if you want to change the number of replica charts, that's more easy to do later [11:52.440 --> 11:53.440] in time. [11:53.440 --> 11:56.440] So, it's very important that you plan up front what's the number of primary charts on an [11:56.440 --> 11:58.440] index that you create. [11:58.440 --> 12:02.440] Now, by default, elastic search tries to balance the number of charts across the notes that [12:02.440 --> 12:04.440] you have. [12:04.440 --> 12:08.440] And one of the other capabilities that elastic search provides you is that if a note fails, [12:08.440 --> 12:14.440] you still can get search results, or so-called partial results can be returned, even if some [12:14.440 --> 12:17.440] of the notes in the cluster are not available. [12:17.440 --> 12:23.440] Now, by default, elastic search determines the chart where a document is indexed based [12:23.440 --> 12:25.440] on a relatively simple formula. [12:25.440 --> 12:29.440] You get the hash key of the routing key of the document. [12:29.440 --> 12:33.440] This is the document ID, which can be generated in different ways. [12:33.440 --> 12:35.440] You can generate it from elastic search. [12:35.440 --> 12:39.440] If you don't specify your application, you can supply the document ID, and so on and so [12:39.440 --> 12:40.440] forth. [12:40.440 --> 12:44.440] And you'll take the modules, the number of primary charts that you have defined on the [12:44.440 --> 12:46.440] index, where you index the document. [12:46.440 --> 12:50.440] Now, as I mentioned, by default, the routing key is the document ID, but you can also [12:50.440 --> 12:53.440] use a different routing key. [12:53.440 --> 13:02.440] And one interesting technique that some people use to enable distribution of data in the [13:02.440 --> 13:09.440] elastic search cluster is by specifying a custom routing key that allows you to enable [13:09.440 --> 13:11.440] so-called chart routing. [13:11.440 --> 13:15.440] This is a technique that allows you to specify at which particular chart you want to send [13:15.440 --> 13:17.440] the document to be indexed. [13:17.440 --> 13:20.440] But that's a case that's used in some specific scenarios. [13:20.440 --> 13:27.440] In most cases, people rely on the default mechanism that elastic search uses to distribute [13:27.440 --> 13:30.440] data in the cluster. [13:30.440 --> 13:35.440] Now, by default, new nodes are discovered via multicast. [13:35.440 --> 13:41.440] If a cluster is discovered, a new node joins the cluster if it has the same cluster name. [13:41.440 --> 13:46.440] If a node on the same instance already runs on a specified port, and if you try to run [13:46.440 --> 13:50.440] another node on that instance, elastic search automatically gives you the next available [13:50.440 --> 13:52.440] port. [13:52.440 --> 13:59.440] Now, however, in some cases, in some companies, multicast addresses are disabled for security [13:59.440 --> 14:00.440] reasons. [14:00.440 --> 14:04.440] And that's why the preferred mechanism to join new nodes in an elastic search cluster is [14:04.440 --> 14:06.440] by using unicast addresses. [14:06.440 --> 14:11.440] In the elastic search YAML configuration, you just need to specify one or more existing [14:11.440 --> 14:17.440] nodes from the elastic search cluster so that they can join that existing cluster. [14:17.440 --> 14:22.440] And in that list of unicast nodes, you don't need to specify all the nodes in the elastic [14:22.440 --> 14:23.440] search cluster. [14:23.440 --> 14:31.440] You just need to specify at least one node that has already joined the cluster. [14:31.440 --> 14:36.440] Now, when you bring up an elastic search cluster, there are some considerations that you need [14:36.440 --> 14:37.440] to take. [14:37.440 --> 14:42.440] First of all, as I mentioned, sharding, it's very important for you to consider what should [14:42.440 --> 14:46.440] be the number of primary shards that you define on the elastic search index, and the number [14:46.440 --> 14:50.440] of replica shards, which is more easy to change over time. [14:50.440 --> 14:55.440] You also need to consider how much data you store in an elastic search index. [14:55.440 --> 15:00.440] Indexes with too small amount of data are not good, because that implies a lot of management [15:00.440 --> 15:01.440] overhead. [15:01.440 --> 15:04.440] And the same is for indexes with too many amounts of data. [15:04.440 --> 15:08.440] I've seen some cases where people store, let's say, more than two, three hundred gigabytes [15:08.440 --> 15:10.440] of data in an elastic search index. [15:10.440 --> 15:15.440] And that really slows down search operations and other operations of that index. [15:15.440 --> 15:18.440] And people start wondering, okay, why is my indexing slow? [15:18.440 --> 15:20.440] Why are my search queries slow? [15:20.440 --> 15:24.440] And in many cases, the reason is that because data is not distributed properly in the elastic [15:24.440 --> 15:26.440] search index. [15:26.440 --> 15:32.440] The preferred amount of data that you should keep in an elastic search shard is between [15:32.440 --> 15:34.440] five and ten gigabytes, roughly speaking. [15:34.440 --> 15:39.440] So if you have more data that you want to put on a shard, you should consider splitting [15:39.440 --> 15:40.440] that data. [15:40.440 --> 15:47.440] So you either use more shards in the cluster, or you split the data into so-called sequential [15:47.440 --> 15:48.440] indexes. [15:48.440 --> 15:56.440] So for example, you might have daily, weekly, or monthly indexes. [15:56.440 --> 15:58.440] Now, this is what I mentioned. [15:58.440 --> 16:02.440] So you should avoid putting too less data in the elastic search cluster. [16:02.440 --> 16:08.440] Also, if you have too many shards defined on an index, that also introduces performance [16:08.440 --> 16:10.440] and management overhead. [16:10.440 --> 16:14.440] So you should consider rather splitting the data in the index rather than bringing too [16:14.440 --> 16:17.440] many shards on a single index. [16:17.440 --> 16:23.440] And determining the number of shards should be a matter of upfront planning. [16:23.440 --> 16:28.440] Now, apart from putting the fact that you need to avoid putting large amounts of data [16:28.440 --> 16:35.440] in a single index, the main strategy that people use is to use, for example, prefix [16:35.440 --> 16:37.440] when they split data into indexes. [16:37.440 --> 16:41.440] For example, you can put prefix for daily, weekly, or yearly indexes. [16:41.440 --> 16:45.440] And if you do that, it's a good practice that also you use aliases to reference data, [16:45.440 --> 16:52.440] directly reference a particular index in your application, but rather use aliases. [16:52.440 --> 16:57.440] In terms of concurrency control, elastic search does not provide pessimistic locking, [16:57.440 --> 17:00.440] like, for example, you have in relational databases. [17:00.440 --> 17:04.440] If you want to establish some form of concurrency control in elastic search in order to make [17:04.440 --> 17:10.440] sure that you don't have unexpected race conditions, so elastic search uses optimistic [17:10.440 --> 17:13.440] locking for concurrency control. [17:13.440 --> 17:18.440] The way this works is when you index a document, there is a version attribute that can be [17:18.440 --> 17:19.440] specified. [17:19.440 --> 17:24.440] And if there is already a document indexed with that version, then the operation is [17:24.440 --> 17:26.440] rejected from elastic search. [17:26.440 --> 17:31.440] Concurrency control can also be achieved with the two fields that can be specified [17:31.440 --> 17:33.440] when you index the document. [17:33.440 --> 17:38.440] If sequence number and if primary term parameters, if they already match the document [17:38.440 --> 17:40.440] that's indexed, then this operation gets rejected. [17:40.440 --> 17:44.440] So if you want to establish some form of concurrency control in elastic search, you can [17:44.440 --> 17:49.440] use this optimistic locking provided by elastic search. [17:49.440 --> 17:54.440] In terms of high availability, you can create one or more copies, or so-called replicas [17:54.440 --> 17:56.440] of an existing index. [17:56.440 --> 18:01.440] The number of primary charts is specified when you define the index mapping, or you [18:01.440 --> 18:03.440] can change it later. [18:03.440 --> 18:07.440] Once an index request is sent to a particular chart, determined based on the hash of the [18:07.440 --> 18:09.440] document ID. [18:09.440 --> 18:11.440] The document is also sent to the chart replicas. [18:11.440 --> 18:16.440] And one interesting property in elastic search is that the replicas are not used only for [18:16.440 --> 18:20.440] high availability, but also used for searching purposes to improve performance. [18:20.440 --> 18:25.440] So when you have replica charts, they also participate in the search requests that you [18:25.440 --> 18:30.440] have for elastic search. [18:30.440 --> 18:36.440] Now, this mechanism for improving performance is really nice, but this doesn't mean that [18:36.440 --> 18:41.440] you need to supply to increase the number of replicas because, of course, that increases [18:41.440 --> 18:42.440] management overhead. [18:42.440 --> 18:46.440] So it's also a matter of determining how many replicas up front you would need. [18:46.440 --> 18:51.440] And later on, if you plan to scale your cluster, you can also increase the amount of replicas. [18:51.440 --> 18:56.440] So you should not put a lot of replica charts also at the beginning when you define your [18:56.440 --> 18:57.440] indexes. [18:57.440 --> 19:00.440] Now, how is a chart request processed? [19:00.440 --> 19:03.440] Now, if we want to index a document in elastic search, what happens? [19:03.440 --> 19:06.440] We send the request to a coordinating node. [19:06.440 --> 19:09.440] This is one of the nodes in the elastic search cluster. [19:09.440 --> 19:14.440] And this coordinating node sends the request to the chart, to the node in the cluster where [19:14.440 --> 19:18.440] the document needs to be indexed and stored in Lucene segments. [19:18.440 --> 19:24.440] When the document reaches the elastic search node in the cluster, the particular chart, [19:24.440 --> 19:28.440] it gets sent not directly to the disk, but to two in-memory areas. [19:28.440 --> 19:31.440] This is the memory buffer and the transaction lock. [19:31.440 --> 19:35.440] Now, the memory buffer gets flushed every second to the disk. [19:35.440 --> 19:39.440] So when you index a document in elastic search, you cannot expect it to be available right [19:39.440 --> 19:41.440] way for searching purposes. [19:41.440 --> 19:45.440] But there is also a parameter that you can use to enforce it to be written to disk right [19:45.440 --> 19:48.440] away before waiting for this one second to be flushed on disk. [19:48.440 --> 19:53.440] There is also another area, which is called the transaction lock, where it gets flushed [19:53.440 --> 19:54.440] not so often. [19:54.440 --> 19:58.440] It gets flushed every 30 minutes or when it gets full. [19:58.440 --> 20:03.440] So the important takeover from this is that when you index a document, you should not [20:03.440 --> 20:09.440] expect it to be available right way for searching, but you can enforce it too. [20:09.440 --> 20:15.440] What happens if you send the search query to elastic search? [20:15.440 --> 20:19.440] First, the search request gets sent to one of the nodes in the elastic search cluster, [20:19.440 --> 20:21.440] the so-called coordinating node. [20:21.440 --> 20:23.440] Then we have two phases. [20:23.440 --> 20:24.440] First is the query phase. [20:24.440 --> 20:29.440] It asks all the shots, primary and replica shots, hey, do you contain some data for that [20:29.440 --> 20:30.440] search query? [20:30.440 --> 20:33.440] And this information gets returned to the coordinating node. [20:33.440 --> 20:38.440] Based on that information, the coordinating node determines which nodes it needs to query. [20:38.440 --> 20:42.440] And on the second fetch phase, it sends the request to the shots that have some data for [20:42.440 --> 20:47.440] that search query and return it back to the client. [20:47.440 --> 20:53.440] Now, in terms of how is the elastic search called base structured, this is a snapshot [20:53.440 --> 20:57.440] from the GitHub code base of elastic search from the public code base. [20:57.440 --> 21:01.440] Now, what I'm speaking about in this presentation applies for the public code base in elastic [21:01.440 --> 21:05.440] search because of version 7.16, there was a licensing change, and there is a lot of [21:05.440 --> 21:10.440] controversy in the open source communities whether elastic search is still open source [21:10.440 --> 21:11.440] or not. [21:11.440 --> 21:13.440] So we can have a discussion about that after the session. [21:13.440 --> 21:17.440] I'm not going to go into the details, but the main thing about this licensing change [21:17.440 --> 21:22.440] is to protect elastic search from other vendors willing to provide elastic search as a service, [21:22.440 --> 21:27.440] not from people willing to customize elastic search or to use it for their in-house projects [21:27.440 --> 21:28.440] and so on. [21:28.440 --> 21:33.440] So this is the structure of the elastic search code base that has been like this since the [21:33.440 --> 21:35.440] Apache license code base. [21:35.440 --> 21:39.440] So elastic search gets built with GitHub actions. [21:39.440 --> 21:43.440] You can see also the definition in the.github folder. [21:43.440 --> 21:46.440] The main server application is in the server folder. [21:46.440 --> 21:50.440] The documentation that gets generated on the official elastic search website is in the docs [21:50.440 --> 21:51.440] folder. [21:51.440 --> 21:55.440] We have the main modules for the elastic search server application in the modules folder and [21:55.440 --> 21:58.440] the internal plugins in the plugins folder. [21:58.440 --> 22:03.440] An implementation of the REST based Java client for elastic search, the high level and the [22:03.440 --> 22:07.440] low level REST funds are in the client folder, and the distribution folder, you can find [22:07.440 --> 22:11.440] the gradle scripts that allow you to build different distributions of elastic search [22:11.440 --> 22:14.440] for Linux, Windows, and so on. [22:14.440 --> 22:18.440] Now, I would say the structure of the code repository is very logical. [22:18.440 --> 22:20.440] It's easy to navigate. [22:20.440 --> 22:25.440] So you can just go into GitHub, and if you need to see, for example, how is a particular [22:25.440 --> 22:30.440] plugin or module implemented, you can just go to GitHub and check it out. [22:30.440 --> 22:34.440] Now, internally elastic search is comprised of different modules. [22:34.440 --> 22:39.440] And in earlier versions, elastic search used the modified version of Google GIS for module [22:39.440 --> 22:44.440] binding, but they're slowly shifting away from Google GIS in favor of their own internal [22:44.440 --> 22:46.440] module system. [22:46.440 --> 22:51.440] So modules are loaded on startup when the elastic search server starts up. [22:51.440 --> 22:57.440] And in this simple example, I've shown an example of how modules were bound internally [22:57.440 --> 22:59.440] when the node starts up. [22:59.440 --> 23:01.440] So we use a module binder. [23:01.440 --> 23:03.440] The earlier versions, B was a Google GIS binder. [23:03.440 --> 23:08.440] And then we bind particular module classes to their implementation. [23:08.440 --> 23:13.440] And then wherever you need them, you can reference them in the elastic search code base. [23:13.440 --> 23:17.440] It's a very simple dependency injection mechanism. [23:17.440 --> 23:22.440] Now, when elastic search starts up, you can imagine it's a simple Java application. [23:22.440 --> 23:26.440] The main class is Orc elastic search, bootstrap elastic search. [23:26.440 --> 23:30.440] It boils down to calling the start method of the node class. [23:30.440 --> 23:36.440] And the start method, in fact, loads up all the modules of the elastic search node. [23:36.440 --> 23:47.440] Now, some of these core modules are, for example, modules that provide the REST API of elastic [23:47.440 --> 23:52.440] search module that allows you to establish clustering and elastic search, or so-called [23:52.440 --> 23:54.440] transport module. [23:54.440 --> 24:02.440] There is a module that allows you to build plugins for elastic search, and so on and so forth. [24:02.440 --> 24:06.440] Now, how does elastic search internally interact with loosing? [24:06.440 --> 24:11.440] When you start up the node, the node also exposes, provides different services that are [24:11.440 --> 24:14.440] used by the modules of elastic search. [24:14.440 --> 24:21.440] And, for example, if you want to, when you start up a node, there is a createChart method [24:21.440 --> 24:27.440] that gets called, indexServiceCreateChart, to create and initialize the chart that is [24:27.440 --> 24:29.440] part of this elastic search node. [24:29.440 --> 24:36.440] And then, if you want to index a new document, it boils down to calling indexChartApplyIndexOperation [24:36.440 --> 24:38.440] on primary. [24:38.440 --> 24:43.440] Then, this boils down to calling the index method on the indexChart class. [24:43.440 --> 24:50.440] And the indexChart class goes down to an internal engine class that calls index into loosing. [24:50.440 --> 24:53.440] Then, that calls internal engine at docs. [24:53.440 --> 24:57.440] And at the end, we just call indexWriter, which is a class from the Apache Loosing Library, [24:57.440 --> 24:58.440] at documents. [24:58.440 --> 25:02.440] So, it boils down to calling different methods from the Loosing API. [25:02.440 --> 25:07.440] And on top of that, we have a lot of initialization and services happening. [25:07.440 --> 25:12.440] So, in a way, you can think that apart from all the functionality that elastic search [25:12.440 --> 25:16.440] provides, the integration with the Apache Loosing Library just boils down to calling the [25:16.440 --> 25:22.440] different APIs that Apache Loosing provides. [25:22.440 --> 25:28.440] And last but not least, I'll show how you can build a very simple elastic search plugin. [25:28.440 --> 25:33.440] Now, if you see the elastic search code base, it already has some building plugins that [25:33.440 --> 25:34.440] you can use. [25:34.440 --> 25:38.440] And there is a very nice elastic search plugin utility that you can use to manage plugins, [25:38.440 --> 25:41.440] to install them, remove them, and so on and so forth. [25:41.440 --> 25:45.440] If you build your own plugin, you can use the same utility to install the plugin, and [25:45.440 --> 25:48.440] it gets placed in a folder in your node installation. [25:48.440 --> 25:52.440] So, if you install a plugin, you need to make sure that it's installed on all the nodes in [25:52.440 --> 25:54.440] your cluster. [25:54.440 --> 25:58.440] Because many plugins are cluster aware, it needs to be installed on every node in the [25:58.440 --> 26:01.440] cluster. [26:01.440 --> 26:05.440] Elastic search plugins are bundled in ZIP archives, along with their dependencies, and [26:05.440 --> 26:10.440] all of them must have a class that implements our elastic search plugin's plugin class. [26:10.440 --> 26:16.440] There is a plugin service which is responsible to load the plugins in elastic search. [26:16.440 --> 26:22.440] Now, let's see how we can create a very simple ingest plugin that allows you to filter words [26:22.440 --> 26:24.440] from a field of an index document. [26:24.440 --> 26:28.440] So, if you index a document, you can specify from which field, which words you want to [26:28.440 --> 26:29.440] filter out. [26:29.440 --> 26:33.440] This is a very common scenario, for example, if you want, for example, to implement that [26:33.440 --> 26:38.440] allows you to clear contents from documents and so on and so forth. [26:38.440 --> 26:41.440] It's probably one of the simplest plugins you might have. [26:41.440 --> 26:45.440] So, first we have a filter ingest plugin class that extends the plugin class and implements [26:45.440 --> 26:46.440] ingest plugin. [26:46.440 --> 26:50.440] We have different interfaces for the different types of plugins you might have for elastic [26:50.440 --> 26:54.440] search, and ingest plugin is one of these types of interfaces. [26:54.440 --> 27:01.440] Then you specify you implement the get processors method because an ingest plugin needs to have [27:01.440 --> 27:06.440] processors that you can define that do something on the documents before their index. [27:06.440 --> 27:13.440] And the get processors method, what we do, we get a filter word from the parameters that [27:13.440 --> 27:17.440] we supply on the ingest processor that we define in elastic search. [27:17.440 --> 27:19.440] And then we get the filter field. [27:19.440 --> 27:23.440] So, we have two parameters, the word that we want to filter out, and from which field [27:23.440 --> 27:26.440] of the document we want to filter it out. [27:26.440 --> 27:33.440] Then we create a map of processors, and in that map we put the filter word processor [27:33.440 --> 27:36.440] that we create from this class and return it. [27:36.440 --> 27:40.440] You can also have multiple processors defined in that plugin. [27:40.440 --> 27:44.440] Now, what does the filter word processor look like? [27:44.440 --> 27:48.440] The filter word processor extends abstract processor from elastic search. [27:48.440 --> 27:51.440] It, again, comes from the core class of elastic search. [27:51.440 --> 27:53.440] And we have an execute method. [27:53.440 --> 27:57.440] In the execute method, we get the document that we want to index. [27:57.440 --> 27:58.440] This is the ingest document. [27:58.440 --> 28:03.440] We get the value from the particular field that we want to filter out, and then we replace [28:03.440 --> 28:05.440] that value with the empty string. [28:05.440 --> 28:10.440] And then we set back the value on top of that field and return the document. [28:10.440 --> 28:14.440] This, when you index a document and you specify that ingest processor, [28:14.440 --> 28:18.440] applies the filtering on that document before it gets indexed into elastic search. [28:18.440 --> 28:22.440] Now, those two classes, if you want to build a plugin, you also need to supply some simple [28:22.440 --> 28:28.440] plugin metadata, then build it, for example, with Maven or with Gradle, [28:28.440 --> 28:31.440] and then you can install it with the elastic search plugin utility. [28:31.440 --> 28:37.440] And in that manner, you can build any plugin you would like for elastic search. [28:37.440 --> 28:41.440] And since we are running out of time, I'm not sure if we have some time for one [28:41.440 --> 28:42.440] or two questions, maybe. [28:42.440 --> 28:50.440] Do you have time for? [28:50.440 --> 28:51.440] Yes, of course. [28:51.440 --> 28:52.440] Okay, so if anybody? [28:52.440 --> 28:53.440] Yeah? [28:53.440 --> 28:55.440] Hey, thanks for your insights. [28:55.440 --> 28:59.440] We saw how too many cats can go out and fall into the pool. [28:59.440 --> 29:00.440] Yeah. [29:00.440 --> 29:01.440] Yes? [29:01.440 --> 29:06.440] I was curious, how does one know how many cars are going to be in charge? [29:06.440 --> 29:12.440] Well, I would say it depends on upfront estimation of how much data do you expect to put in that [29:12.440 --> 29:13.440] index. [29:13.440 --> 29:17.440] So we need to do an upfront finding, okay, in the first phase of my project, [29:17.440 --> 29:21.440] how many, let's say, gigabytes of data I would have. [29:21.440 --> 29:25.440] And based on that, you determine how many initial set of shots do you put, [29:25.440 --> 29:29.440] and if those shots still have a lot of data, then you consider partitioning the index. [29:29.440 --> 29:34.440] And it's a matter of upfront planning to determine that. [29:34.440 --> 29:35.440] Okay? [29:35.440 --> 29:36.440] Yeah? [29:36.440 --> 29:41.440] What is the structure used for store indexes and data? [29:41.440 --> 29:42.440] It's inverted index. [29:42.440 --> 29:43.440] This is the data structure. [29:43.440 --> 29:44.440] Yeah? [29:44.440 --> 29:45.440] Inverted index. [29:45.440 --> 29:46.440] Inverted index. [29:46.440 --> 29:51.440] It's just called an inverted index because it's an app between terms, [29:51.440 --> 29:56.440] and if for each term you have a pointer to the document that contains that term. [29:56.440 --> 29:59.440] So it's called inverted index.