[00:00.000 --> 00:11.440] Hi everybody. Let's begin. Thank you all for coming. My name is Roman Nostrin. I'm from [00:11.440 --> 00:19.080] MariaDB Corporation from a column store team and today I will be sitting just because the [00:19.080 --> 00:28.200] manipulator doesn't work. So today I will tell you about new features that comes in [00:28.200 --> 00:37.760] analytics and maybe not so new, but something I find important. For those who don't know [00:37.760 --> 00:46.880] what MariaDB column store is, it's an all-lib storage engine for MariaDB server. It's columnar [00:46.880 --> 00:53.000] oriented. It's MPP and it has fancy two-tier distributed storage. When the data is distributed [00:53.000 --> 00:58.360] across nodes and all the nodes have the possibility to distribute it even more using [00:58.360 --> 01:08.960] DB routes. Where DB routes are basically the Linux mounts. And I will start with the most [01:08.960 --> 01:16.120] obscure in my opinion, most obscure topic and that is the MariaDB column store version numbering. [01:16.120 --> 01:27.320] Why it is obscure? Because I still don't understand the reasons that server team takes to put this [01:27.320 --> 01:33.560] or another version of column store into the community server. But anyway, first of all, [01:33.560 --> 01:39.400] column store migrated to yet another version in numbering to simplify the understanding what [01:39.400 --> 01:48.080] is the actual column store version in the community server. And that is there are three digits, [01:48.080 --> 01:54.680] first goes here, then month and then the page number. And at the bottom side of the slide, [01:54.680 --> 02:04.320] you can see the actual mapping that shows us which column store versions are published [02:04.320 --> 02:12.720] with the community server. The most notable thing here is the last row that tells when [02:12.720 --> 02:21.040] the next table, the most featureful release will be published with the community server. [02:21.040 --> 02:29.280] Let's first glance at the features available at the current stable that is shipped with [02:30.040 --> 02:39.480] the community 10.5 and 10.6. Most notable is the filtering vectorization for x86. And guess what? [02:39.480 --> 02:46.800] Seemed processing is not the ultimate answer to life universe and everything. I mean, the benchmarks, [02:46.800 --> 02:54.600] when I first did them, when I first run them, they showed me a 10x speedup comparing with [02:54.600 --> 03:02.640] the non vectorized code. But when I run the full blown query processing pipeline, [03:02.640 --> 03:14.560] I just got 30 to 40% of the speedup. So this is a great speedup, but not the ultimate answer. [03:14.720 --> 03:24.320] Anyway, the next feature is an external group buy. And that is what it is. It allows a group buy [03:24.320 --> 03:31.360] operation to crunch the data sets, data sets that doesn't fit into memory. And as usually, [03:31.360 --> 03:38.120] it consists of two phases. The first phase calculates the partial aggregates. And when they don't fit [03:38.200 --> 03:45.920] into memory, it all flots them and stores on disk. And the second phase sequentially goes over the [03:45.920 --> 03:53.760] stored pre-aggregates and merge them together to produce the resulting data set. As I have seen [03:53.760 --> 04:02.840] in another open source OLAP engines, the second phase takes 2x memory comparing to the first [04:03.800 --> 04:12.160] phase in the worst case. This feature is disabled by default. And one needs to explicitly enable [04:12.160 --> 04:18.480] it with the settings in a column store.xml. These settings are shown at the bottom of the slide. [04:18.480 --> 04:30.320] The next feature is LZ4 compression for data files. We had heard a lot that LZ4 beats snappy in [04:30.360 --> 04:38.120] terms of the compression speed, so we decided to give it a try. And guess what? Snappy still [04:38.120 --> 04:47.200] delivers better compression ratio. It is about 5% better comparing to LZ4 compressed data files. [04:47.200 --> 04:54.360] So there is always a trade-off between space and the compression speed that does the query [04:54.400 --> 05:03.040] processing speed. This is also disabled by default in the current stable. And to enable it, [05:03.040 --> 05:09.680] you need to set this MariaDB client session variable. So we basically connect to MariaDB [05:09.680 --> 05:16.680] and set this variable just before you create a table. Or you can set it in the configuration [05:16.680 --> 05:26.800] file of MariaDB server. And we decided to make this a new default since the new stable that is [05:26.800 --> 05:38.200] coming this April or May. What's the benefit? It means that the compression and decompression [05:39.200 --> 05:47.360] ratio are very fine. Good point. It might be obscure. The compression ratio for snappy is still [05:47.360 --> 05:56.880] better, but the decompression speed is faster for LZ4. Not really, because ingestion speed [05:56.880 --> 06:02.480] is not the big question. I mean, everybody wants to ingest their data as fast as possible, [06:02.480 --> 06:09.400] but let's be honest, we are interested in the select, not in certs in the first place. [06:09.400 --> 06:24.880] Yeah, that is true for LTP. Okay, and now we come to the new upcoming stable over column store. [06:25.800 --> 06:33.240] And there are some features that I will not discuss in details, but there are some interesting ones [06:33.240 --> 06:43.080] that I will look into later. So let's get quick answer these ones. We are officially [06:43.080 --> 06:54.360] support ARM64 platform with vectorization as well using Neon. So this platform builds produces [06:54.360 --> 07:05.000] slightly faster runtime comparing to our normal nightly CI routines. But I have seen that it [07:05.000 --> 07:13.000] consumes 5% more over RAM, running the DML heavy workload like updates or deletes. [07:13.000 --> 07:25.000] The next tool is called MCS Rebuild AM. And to explain why we decide to write this [07:25.000 --> 07:34.200] tool, I need to make a small D tool. So how does column store the data? There are two parts, [07:34.200 --> 07:39.520] the data itself and the metadata that describes where the data is. Like when you have multiple [07:39.520 --> 07:46.560] nodes, you need to tell the query coordinator where to go for the data files. So the metadata [07:46.560 --> 07:56.920] describes the location of data and column store tries its best to preserve this metadata in [07:56.920 --> 08:03.760] tact and it has multiple copies of meta. So if one meta is corrupted for some reason, because [08:04.320 --> 08:14.560] life is not ideal, there are multiple copies that you can use as a backup to restore. However, [08:14.560 --> 08:22.240] if one loses its meta, there is no way to access the data itself properly, even if the data files [08:22.240 --> 08:30.360] are intact. So this tool allows one to produce the metadata from the data files. The key [08:30.360 --> 08:40.440] way is you have to create all the tables starting with column store older than 6.4.4 to allow this [08:40.440 --> 08:47.560] feature to work. The next feature is distributed JSON function support. Technically speaking, [08:48.120 --> 08:52.760] there were JSON functions in the column store from the very beginning, but unfortunately, [08:52.760 --> 08:58.440] to use them, you need to fall back to a very slow mode, we call it table mode. That is basically [08:59.240 --> 09:10.520] MariaDB crunching the data, but it asks for a full table scan from a column store. So it's not very [09:10.520 --> 09:20.040] scalable, not fast. So we implemented this in terms of last year JSON. And now you can use [09:20.280 --> 09:29.800] all but one functions. And these are JSON object egg and the JSON table. The last one is very, [09:30.520 --> 09:35.720] how to say, MariaDB specific. It produces the relation. [09:39.720 --> 09:48.200] That is also true. But I mean, by specific, I mean, we cannot produce the MariaDB relation, [09:48.200 --> 09:55.880] to be honest. So we decided to postpone it because nobody is very interested in this one. [09:57.400 --> 10:06.120] But we implemented. Good suggestion. Anybody is interested and maybe anybody knows what [10:06.120 --> 10:19.560] JSON table does? Okay, that's a good statistic. So now, and here are features that I want to [10:19.560 --> 10:28.120] discuss in some more details. First is auxiliary column. What it does, it basically speds up the [10:28.120 --> 10:35.880] leads from 3x up to 50x, depending on a SQL schema. And what does it mean? The more columns you [10:35.880 --> 10:42.280] have in the table, the faster the speed up will be. It is disabled. It has an additional [10:42.280 --> 10:47.560] sped up configuration option. I will explain it a bit later. But let's take a look at the bottom [10:47.560 --> 10:57.320] half of a slide. This is a very simplified data files layout to explain how delete operation [10:57.320 --> 11:02.120] works previously before the patch on the left and after the patch on the right. So let's concentrate [11:02.120 --> 11:10.120] on the left part. You can see the violet and blue empties. So when delete comes into the table, [11:10.120 --> 11:16.440] it replaces the actual values in the columns, because we are columnar, with the special magic [11:16.440 --> 11:24.760] values that are called empty, that are specific for a data type. And it does it in place. And [11:24.760 --> 11:32.360] moreover, the delete operation has to store the actual block for a version before it changes [11:32.360 --> 11:41.960] the data in place. So there are two block copies to do, and there are four in place changes that [11:41.960 --> 11:51.000] has to be flushed. So what does the auxiliary column changes in this pattern? First, you can see [11:51.000 --> 11:57.400] on the right side, there is an additional column. There is basically a flag. So delete operation [11:57.400 --> 12:04.040] now goes over this auxiliary column and changes the flag over there. And this auxiliary column is [12:04.040 --> 12:10.920] only one byte. So you need to store only one block of data, and you don't need to change to do in [12:10.920 --> 12:18.440] place changes in all the columns. That's where the sped up comes from. As I said, there is an [12:18.440 --> 12:24.680] additional, there is an additional opportunity to sped up it even more, and this is to enable [12:24.680 --> 12:32.040] a fast delete. I will not discuss it in details, but internally, it doesn't update the metadata [12:32.040 --> 12:35.800] file, so it makes it even faster to delete the operator. [12:36.360 --> 12:42.120] Basically, you have a new column indicating if it's deleted or not, and you just update it. [12:42.120 --> 12:50.280] Yes, that's true. We have plans to use this even more, this auxiliary column, [12:52.840 --> 12:56.200] but I will discuss it later when we finally implement these features. [12:56.520 --> 13:07.720] And to be honest, this fast delete gives an opportunity to implement the append-only [13:09.240 --> 13:16.120] update. That should be also a very fast boost for update operation, but stay tuned, not today. [13:17.400 --> 13:23.720] Next feature is extent map scalability improvement, and what extent map is. If you recall, couple [13:23.720 --> 13:29.560] slides before, I already mentioned the metadata that describes where the data is, and extent map [13:29.560 --> 13:37.320] is the core of this metadata. It's basically a structure and memory structure that allows to map [13:37.320 --> 13:45.640] from a globally unique block number. Block is a minimally possible data unit in a cluster, [13:46.360 --> 13:54.040] and the extent map allows to map from this block number to a tuple of OID, not partition segment, [13:54.040 --> 14:01.240] and vice versa, from the tuple to a globally unique block number. And these operations are used [14:01.240 --> 14:07.720] extensively in the cluster because you always want to know where the data is and what is the OID that [14:07.720 --> 14:15.640] this block belongs to. Originally, the extent map was an array basically with the OIN lookup [14:16.200 --> 14:24.280] complexity, and this array was replaced with, oh, I need to mention why it is a problem. Imagine the [14:24.280 --> 14:35.560] extent map that is 200 max. So, going over 200 max, and one entry entry is only 100 bytes only, [14:35.560 --> 14:45.400] so imagine how many entries are in these 200 max. It takes a lot to look up in such an enormous array. [14:46.200 --> 14:53.080] An array, this array was replaced with a red-black tree. This makes the block-to-tuple mapping [14:54.280 --> 14:59.560] to, this changes the complexity of block-to-tuple mapping to log n, [15:00.520 --> 15:11.480] or log n. And to facilitate another conversion, another mapping from tuple to block, [15:11.480 --> 15:18.360] we implemented the extent map index. There is basically a burger of couple hash maps on top [15:18.360 --> 15:30.920] and a very tiny arrays at the bottom. So, this gives us the mapping operation complexity of OC. [15:32.600 --> 15:42.280] And here is at the bottom, you will see the results. These are except of the CP import logs. [15:42.280 --> 15:49.000] One is, one demonstrates that the preprocessed step takes roughly 30 seconds, and after the [15:49.000 --> 15:57.400] patch, you see that it decreases to four seconds. It was originally. So, if you have a huge extent [15:57.400 --> 16:07.320] map, it will give even faster operations. The next feature is a primproc and ex-manager [16:07.640 --> 16:13.240] processes merger. If anybody here uses comestory, he knows there are a bunch of processes. [16:13.240 --> 16:18.200] And the central ones are primproc and ex-manager, where ex-manager is a coordinator for the query [16:18.200 --> 16:29.000] processing. And primproc is an actual worker. So, there were a lot of additional headache when [16:29.000 --> 16:36.840] the local processes must communicate between each other over the same, in the same node. [16:36.840 --> 16:43.080] So, same node communication goes over loopback. And this traffic is not compressed, comparing to [16:43.080 --> 16:50.840] the different nodes communication that is compressed. So, combining these two run times, [16:50.840 --> 16:57.640] we get the four to seven percent overall sped up. And this gives us another opportunities for [16:57.640 --> 17:09.320] optimizations that I will mention later. Union push down. Previously, there was no way to [17:09.320 --> 17:17.400] directly run the queries, like you can see at the bottom half of the bottom, the most bottom line [17:18.120 --> 17:24.680] when the union closes at the top. Because MariaDB processing path for unions and simple [17:24.680 --> 17:33.880] selects, it differs. And columnstore works using the notion of select handler that allows MariaDB [17:33.880 --> 17:40.200] server to push the whole query down to columnstore. So, when there was a union, we cannot use this [17:40.200 --> 17:49.240] path and we have to do, we call it a subquery wrap. So, we were wrapping these into a subquery and [17:49.240 --> 17:54.760] this allows us to use the select handler, comparing to a table mode that is slower. [17:57.320 --> 18:05.720] The next big feature is a full TPCH support. And why it is big? Because it is mostly concerned [18:05.720 --> 18:13.320] with the optimizer, that let's face it columnstore lag previously. There are two queries, [18:14.280 --> 18:24.520] with two features that columnstore lack and these two are correlated subquery with aggregates. [18:24.520 --> 18:31.880] It's basically a scalar query, but with aggregate. It returns a single row that you can use in [18:31.880 --> 18:39.480] comparison. And guess what? To enable this feature, we just disabled the error message [18:39.480 --> 18:47.800] in the code. So, it was a very, very tiny change. And columnstore naturally supports such queries. [18:47.800 --> 18:56.360] However, the last query type that was presented at the bottom half is way more elaborate to handle. [18:57.240 --> 19:07.640] And that is the common conjunction detection and rewriting. As you can see, there are two [19:07.640 --> 19:13.400] joint conditions at the bottom half of the slide. They are marked with the violet. [19:14.040 --> 19:19.160] And these conditions are common. However, columnstore cannot handle the [19:20.040 --> 19:28.200] joint conditions if they are combined with a junction or basically. So, to handle this query, [19:28.200 --> 19:36.200] we need to go over all the conditions that are ordered and get the common one and put it at the top. [19:38.600 --> 19:46.440] For a general case, it's very complex to find such patterns and applies them in a way that it [19:46.440 --> 19:53.960] doesn't make the symmetrical meaning of the query itself. However, we manage this and this [19:53.960 --> 19:58.920] feature will benefit not only TPCH query that didn't work previously, but all others. [19:59.640 --> 20:12.040] External distinct. As you might know, columnstore was not great doing this thing. To be honest, [20:12.040 --> 20:19.240] it was hash map based. So, it cannot do an external operation and it was tightly coupled [20:19.240 --> 20:26.040] with order by. So, to allow this thing to be enacted, to become an external, we need to [20:26.040 --> 20:33.400] untie them, to uncouple them, and we just applied the existing goodbye facility that already has [20:33.400 --> 20:42.600] the external support, external operation support. So, this gives us a future optimization opportunities [20:42.600 --> 20:50.600] and also it allows the distinct to scale now. So, it will be fast. The most notable changes maybe [20:50.600 --> 21:00.760] because I was the order is the order by rewrite. The current implementation of order by facility [21:00.760 --> 21:09.240] is based on a priority queue. That is great for top K queries like mentioned in the second bullet, [21:10.120 --> 21:17.240] maybe the third. However, the priority queue timings are just terrifying when you run the [21:17.240 --> 21:24.680] query on a huge data set without limit or the limit is big and relatively big. So, I replace the [21:25.560 --> 21:33.240] priority queue I'll go with the different approach. It has 2.5 phases. It first calculated the sort [21:33.240 --> 21:40.600] of runs in parallel, completely in parallel. And the new middle phase, it calculates non-overlap [21:40.600 --> 21:49.080] intermutation ranges for the second phase. So, the second phase officially merges non-overlap [21:49.080 --> 21:55.960] ranges in parallel. So, you have first phase and second phase and these two phases are completely [21:55.960 --> 22:05.320] parallel. They don't overlap each other. So, in the end, the column star now has a choice between the [22:05.400 --> 22:11.240] priority queue based sorting and this new algorithm that it will pick looking at the [22:12.840 --> 22:23.000] order by usage pattern of the query. And these are the results. As you can see, the most interesting [22:23.000 --> 22:32.360] is last two bullets. There is a comparison between the previous code, the previous version that is [22:32.360 --> 22:44.200] based on a priority queue, the data taken from a TPCDS generator. And the first query uses the [22:44.200 --> 22:54.040] integer sorting key columns. And the next, the second query uses the character key columns. So, [22:54.040 --> 23:02.760] as you can see, for integers, it brings like four times faster. But I need to confess this is only [23:03.320 --> 23:11.640] scalability factor, scale factor 10. So, it's not a big dataset, anyway. For a bigger dataset, [23:11.640 --> 23:21.400] you can have like 20x or maybe even bigger. And to be honest, I don't compare with the [23:21.480 --> 23:28.760] other open source engine just yet because I'm going to come up with a separate story about this [23:28.760 --> 23:35.960] sorting. So, these are the links. If you're interested in column star, first is the code itself and [23:35.960 --> 23:42.600] the second is JIRA. So, if you find bags, please post them. We will appreciate it. Thank you all for [23:42.600 --> 23:49.400] coming again. [23:51.400 --> 23:52.780] you