[00:00.000 --> 00:12.360] First, because the last time I did a presentation, I forgot to say this and there were people [00:12.360 --> 00:14.160] last step after asking. [00:14.160 --> 00:23.280] So I am assistant engineer working at ProxySQL, my name is Sorry. [00:23.280 --> 00:28.600] So I am assistant engineer working at ProxySQL, my name is Xavier and for any more information [00:28.600 --> 00:37.600] if you follow us, so that's g-hub and j-lub and there is what I am usually. [00:37.600 --> 00:44.080] But maybe a more interesting question is how am I here and this is just because of this [00:44.080 --> 00:50.360] conference because in certain conference I found this a couple years ago and yes, it [00:50.360 --> 00:59.920] was the first in 2020 in certain corner so and we are still hiring since then and I am [00:59.920 --> 01:02.120] speeding up a little bit here. [01:02.120 --> 01:14.720] So if you are a developer and you find yourself meeting any of those conditions, so. [01:14.720 --> 01:18.520] You tell me when I can do, okay. [01:18.520 --> 01:24.800] So a brief introduction to ProxySQL, I am going to be very brief, so high performance [01:24.800 --> 01:31.160] protocol where proxy for my SQL and focus is scalability, flexibility, certain time [01:31.160 --> 01:43.720] and just basic topology, right, like any other proxy, our offers. [01:43.720 --> 01:49.880] So these slides normally are like three or four but we are going only to carry this presentation [01:49.880 --> 01:55.880] about the first three of those, so load balancing, query router, cluster monetization, we are [01:55.880 --> 02:00.960] not going to see any of that, even though the three of that today, but yeah, SQL far [02:00.960 --> 02:08.120] with statistics, mirroring and the statistics parts you can develop a lot of that but yeah, [02:08.120 --> 02:13.640] we are going mostly to care about load balancing and query routing and probably cluster monetization [02:13.640 --> 02:21.240] will be also important if you are, you know, using what we are talking about today, but [02:21.240 --> 02:28.160] and we have done a lot of recent work there lately, so but let's move on and say what [02:28.160 --> 02:32.480] is what is Gallera cluster and what is the cluster that we are, we are experimenting [02:32.480 --> 02:34.320] on with today. [02:34.320 --> 02:39.640] So multi-primary cluster is in a synchronous application, easy to use, high probability [02:39.640 --> 02:47.680] solution, so it has all the goodies of a multi-primary cluster, right? [02:47.680 --> 02:55.360] So it's, so in essence it's multi-primary or not a primary REC plaga and it's used [02:55.360 --> 03:03.600] in a synchronous application which I put this disclaimer here because it's how they [03:03.600 --> 03:10.360] officially announced it, so it's virtual synchronization and even in logical asynchronous, [03:10.360 --> 03:16.800] the actual writing and committing happens independently and that's very important, the fact that it's [03:16.800 --> 03:23.360] virtual asynchronous because what we are going to care is about the definitions of the isolation [03:23.360 --> 03:32.000] levels of per cluster node and per cluster itself, so the typical isolation levels that [03:32.120 --> 03:38.320] we have in committed and repeatable read, so Gallera says that this all, this isolation [03:38.320 --> 03:43.280] level is available for you at cluster level being the default one repeatable read and [03:43.280 --> 03:50.600] you know, solvable, but no, obviously solvable solves all of the problems in this talk, but [03:50.600 --> 03:59.560] with KB hats that, you know, make it pointless for the purpose of the talk, so mostly performance. [03:59.560 --> 04:10.240] So repeatable reads offers non-digit reads and reads remain repeatable during transactions, [04:10.240 --> 04:13.000] so offers from the last update problem. [04:13.000 --> 04:20.240] What I am, I am making this distinction here because we have just going back through the [04:20.360 --> 04:31.240] isolation levels, we have per cluster, per node and having repeatable read at no level [04:31.240 --> 04:40.800] is something that doesn't, how to say it, is something that solves some issues that you [04:40.800 --> 04:46.960] are going to expect a lot more at cluster level, so because of that Gallera has another [04:46.960 --> 04:54.440] kind of isolation during the application, which is a form of, sorry I am just going [04:54.440 --> 04:59.720] to get a little bit more into the interviews, of the loss of date, so the loss of date for [04:59.720 --> 05:04.960] what, for any that is not familiar is just that during one transaction you perform a [05:04.960 --> 05:10.840] read and then another transaction perform a return update and then the first transaction [05:10.840 --> 05:17.160] perform a write and that preview, that update from the second transaction may be lost, and [05:17.160 --> 05:22.600] is something that could happen at a lot more cluster level, so in Gallera they have a form [05:22.600 --> 05:30.080] of a snapshot isolation that is an enforcement of a repeatable read and is essentially the [05:30.080 --> 05:39.400] first committed read, so that leads to deadlocks, but it is a protection for the loss of date [05:39.400 --> 05:48.600] problem, and I have just said all this for saying that okay we have a lot of consistency [05:48.600 --> 06:00.160] across the cluster, right, across the cluster nodes, but this level isolation is respected [06:00.160 --> 06:07.640] as it was in one node itself, but there is something that is missing, what is the next [06:07.680 --> 06:12.880] question, what about the semantics, right, there are the semantics of these isolation [06:12.880 --> 06:20.920] levels preserved at cluster level, they are not, because for that we have WS3 same weight [06:20.920 --> 06:30.680] which enforce the read commit semantics at cluster level, so once you have that you don't [06:30.720 --> 06:43.880] have any potential read after write, right, but it is a very, that, you know, at what [06:43.880 --> 06:48.120] cost, you have elevated now the semantics to the semantics that you probably wanted from [06:48.120 --> 06:58.000] the beginning at cluster level, but what is the cost of that, so we are going to provide [06:58.000 --> 07:06.000] some numbers, what is the cost of that, and now as in every measurement marking thing, [07:06.000 --> 07:10.480] these are, there are a lot of things that we can discuss here, I am going just to provide [07:10.480 --> 07:16.440] some numbers that I think that are representative from what we have tested, but, you know, there [07:16.440 --> 07:22.920] can always be discussion, so this is the system, and just saying that because just for the [07:22.920 --> 07:27.200] setup that you are going to see, if you have a system that is alike, you will see that [07:27.200 --> 07:32.880] CPU, not bottleneck, memory, not bottleneck, and disk, not bottleneck, what we are seeing [07:32.880 --> 07:39.640] is the performance in the cluster where it is not fully, there is no resource constraint [07:39.640 --> 07:48.040] in more than one resource at the same time, so the versions, these are the versions, probably [07:48.040 --> 07:54.920] the infra will be available if someone wants to try itself, so the versions, this is the [07:54.920 --> 07:59.760] version that I am using, and the network is a dockeraceous infra with 500 microseconds [07:59.760 --> 08:06.200] delay imposed for one millisecond RTT between the nodes, and that is because, you know, [08:06.200 --> 08:12.520] you cannot benchmark a cluster with zero network latency, especially because you are killing [08:12.520 --> 08:23.560] the whole point of benchmarking, so cluster, let's start talking about some numbers, single [08:23.560 --> 08:29.840] primary grids will be against multi-primary grids, so the first you benchmark like this [08:29.840 --> 08:34.440] and you see that multi-primary grids outperform the single primary grids, and you are like [08:34.440 --> 08:44.680] what, that's wonderful, that's what I wanted, more nodes, more grids, around 70% more grids, [08:44.680 --> 08:47.560] that's incredible, right? [08:47.560 --> 08:51.720] Okay, this is not the truth, and why is it not the truth? [08:51.720 --> 08:56.920] Because there is a lot of lies that we tell ourselves when we do benchmarking, and this [08:56.920 --> 09:02.480] is pretty much the super ideal scenario for a replication of a multi-primary, which is [09:02.480 --> 09:07.760] transactions, only grids, I don't care about many synchronization problems that will arise [09:07.760 --> 09:13.320] of the level of isolation that we have talked before, and all nodes are equally busy because [09:13.320 --> 09:19.120] I have decided the load is distributed and the queries are all the same, and all take [09:19.120 --> 09:20.120] the same time. [09:20.120 --> 09:27.320] Okay, of course, perfect throughput, because everything takes the same time all the time. [09:27.320 --> 09:35.160] Same with grids, all the grids, and you have crazy amount of throughput yourself, but it's [09:35.160 --> 09:39.760] not also outperforming like before. [09:39.760 --> 09:48.640] Now Gaussian, the same ideal, now mix it to read-write, now we start seeing some more [09:48.640 --> 09:50.080] real stuff. [09:50.080 --> 09:52.680] So what happens when I mix the load? [09:52.680 --> 09:58.600] Well, obviously numbers go down very hard. [09:58.600 --> 10:01.240] Now this is the cluster grids, cluster reads. [10:01.240 --> 10:07.200] We have dropped from 30,000 to around 10,000 reads, and we have dropped from, I think it [10:07.200 --> 10:12.360] was like 5,000, 6,000 to what it looks like between 2,000 and 3,000. [10:12.360 --> 10:17.280] So it's half the reads and half the grids, reads probably one-third. [10:17.280 --> 10:20.200] This is equally distributed load. [10:20.200 --> 10:29.560] Now we compare the cluster with just one node of the cluster. [10:29.560 --> 10:36.080] Now it's all the same load, this mixed load, but one node. [10:36.080 --> 10:41.640] And now we have 15% more grids and 19% more reads, but it reverses. [10:41.640 --> 10:45.800] Now the single node is outperforming the whole cluster. [10:46.800 --> 10:54.480] This makes sense, because you're in a mixed load, and in a mixed load like that against [10:54.480 --> 10:58.320] all the nodes, you have a lot of collisions, you have a lot of problems, and that's all [10:58.320 --> 11:02.440] gone in one single node. [11:02.440 --> 11:07.840] But this is not anything that you're not suspecting, this is just a problem, this is just usual. [11:07.840 --> 11:11.960] This is how it's supposed to be. [11:12.120 --> 11:20.560] What we want is to increase the reads, because we have 30,000 reads and now we have 10,000, [11:20.560 --> 11:23.640] so let's try to improve that. [11:23.640 --> 11:29.480] So let's make a read-reader split over the cluster and see what happens. [11:29.480 --> 11:35.360] So if we do a read-reader split, and this is HA proxy, by the way, this field part of [11:35.360 --> 11:40.800] the vermin are going to be HA proxy because this is the dumbiest read-reader split. [11:40.800 --> 11:46.760] We have a port for reads, we have a port for reads, and that's it. [11:46.760 --> 11:52.080] So the cluster reads, we see that they are more the same as we had before, but then we [11:52.080 --> 12:00.520] have where a split reads, which go insanely up, and our grades has gone down a little [12:00.520 --> 12:01.520] bit. [12:01.520 --> 12:07.800] So we are compromising our grades by our reads, and that also makes sense, because by doing [12:07.800 --> 12:12.200] that amount of reads in the whole cluster, we are creating a lot of pressure in the other [12:12.200 --> 12:20.600] cluster nodes, because this is maximum cluster throughput, like 10%, 50% of the total cluster [12:20.600 --> 12:25.280] throughput on reads, where it is fully on reads. [12:25.280 --> 12:28.760] So you're compromising a lot, and you're losing some grades. [12:28.760 --> 12:33.040] So well, that's okay, that's okay. [12:33.040 --> 12:38.240] But we will, in this journey of talking about the semantics and the synchronization, and [12:38.240 --> 12:42.400] I have just done the dumbiest read-reader split in which I don't care about any read-after [12:42.400 --> 12:46.360] grades or anything, right, the semantics. [12:46.360 --> 12:57.120] So what happens if I now enable the synchronization for the readers, and I enable the full cluster [12:57.120 --> 13:03.080] read-committed semantics, okay, that's fine. [13:03.080 --> 13:10.960] So what happens is that we go back to basically the same performance that we were having against [13:10.960 --> 13:16.200] the whole cluster, which makes sense, because now instead of writing against everyone, you [13:16.200 --> 13:23.000] are writing against one and reading from the others, but waiting for the replication. [13:23.000 --> 13:32.440] So we can see that the split reads go between the same frames, is split reads still win, [13:32.440 --> 13:39.680] but it's marginal, and speed writes still win a little bit, but it's also marginal. [13:39.680 --> 13:46.120] So we have created, now we have the whole cluster having the nice semantics that we wanted and [13:46.120 --> 13:54.400] etc., but we are in, we haven't fixed our performance throughput. [13:54.400 --> 13:57.240] So what can we do in this scenario? [13:57.240 --> 14:02.160] Because what looks like we need a little bit more complicated logic, we are in square one [14:02.160 --> 14:06.680] of the problem. [14:06.680 --> 14:11.360] Read-side, what do I say, I'm sorry, read-side above the original read-write, writes are [14:11.360 --> 14:19.240] below the original read-write, and we still need protections for our critical reads. [14:19.240 --> 14:25.000] Okay, an alternative for avoiding this, we'll be using a single writer for a multi-premier [14:25.000 --> 14:31.640] cluster, which looks like, for what we had seen before, it shouldn't be like a very [14:31.640 --> 14:38.920] bad performance trade-off in terms of whole cluster grids, and then we re-read the critical [14:38.920 --> 14:49.560] reads only to the master, and then we, for to the replica, we choose, we choose to read [14:49.560 --> 14:54.200] all the non-critical reads to the other replicas. [14:54.200 --> 15:02.760] So critical reads to the primary, okay, and now we're being processed equally to the picture, [15:02.760 --> 15:07.560] because that's something that we can do with process equally very easily. [15:07.560 --> 15:11.560] Contrary to the Mexico-Purilus, which offers you both of the things that you need, which [15:11.560 --> 15:17.520] is reverection and anthropocontrol. [15:17.520 --> 15:25.600] The testing scenario, we are going to have a 10-90 ratio of writes versus reads, we are [15:25.600 --> 15:30.440] going to have a 5% of critical reads and a 95% of regular reads. [15:30.880 --> 15:42.400] Okay, which is, well, I would say that changing this into a more balanced ratio with an impact, [15:42.400 --> 15:48.760] okay, with an impact so much, thank you, with an impact so much the performance, the problem [15:48.760 --> 15:54.520] is that if the total throughput is what you care about, okay, so the ratio is not as important [15:54.520 --> 15:59.120] as the, how much you are going to stress the cluster with the throughput that you want [15:59.120 --> 16:03.080] for that ratio. [16:03.080 --> 16:09.240] So now we're being processed equally into the picture. [16:09.240 --> 16:15.160] In this scenario, these are the non-critical reads, these are the critical reads, and these [16:15.160 --> 16:18.800] are the writes. [16:18.800 --> 16:25.480] So we have improved almost a 50% on non-critical reads, we are retaining more or less the write [16:25.480 --> 16:39.640] load, and we have an extra 1,000 reads in the critical reads. [16:39.640 --> 16:43.920] So we are trying to, we have able to preserve the great throughput, increase the throughput [16:43.920 --> 16:51.840] of our 50%, and this is against the whole cluster read load and progress with the synchronization [16:51.840 --> 16:52.840] enabled. [16:53.840 --> 16:56.320] And this is the most important part of it. [16:56.320 --> 17:01.880] This is like this, because I decided that it was going to be like this, because if I [17:01.880 --> 17:08.120] went back here, and I decided that I am not going to limit the throughput on the reads, [17:08.120 --> 17:14.640] I will go as up as almost the whole cluster throughput, and I will compromise the writes [17:14.640 --> 17:20.840] again, that will go below what you were seeing before at any time in another benchmark. [17:21.280 --> 17:22.360] And why is that? [17:22.360 --> 17:25.520] Because the total cluster throughput is what it is. [17:25.520 --> 17:29.040] So what you need is to control what you want to do with that throughput. [17:29.040 --> 17:35.480] You cannot expect just to get more reads for free. [17:35.480 --> 17:42.800] So the conclusions, multi-primary clusters can be a lot of benefit for you, let's please [17:42.800 --> 17:43.800] set up. [17:43.800 --> 17:46.640] And it's just like this. [17:46.640 --> 17:49.560] System measurements analysis is really hard. [17:50.560 --> 17:52.640] Really, really hard. [17:52.640 --> 17:58.560] And especially benchmarking is also very hard, because most of the things that I have said [17:58.560 --> 18:03.320] here, they are right, but they are right in this scenario. [18:03.320 --> 18:08.480] And if you change slightly the scenario, it may not be. [18:08.480 --> 18:12.160] Adapting the system to your workload is what you always want to do. [18:12.160 --> 18:16.000] A different workload will change everything. [18:17.000 --> 18:21.120] The final conclusion is that control is everything. [18:21.120 --> 18:24.920] Being able to control the throughput and being able to control what you want to do with your [18:24.920 --> 18:30.600] load is what is going to decide the performance of the cluster, more than anything else. [18:30.600 --> 18:34.880] And for that, process SQL is a great tool. [18:34.880 --> 18:41.280] And thank you a lot for your attention, and happy specificity. [18:41.280 --> 18:45.760] You have five minutes for the questions. [18:45.760 --> 18:46.760] Okay. [18:46.760 --> 18:53.560] How do you know the maximum throughput of a cluster? [18:53.560 --> 18:55.360] You just measure. [18:55.360 --> 19:01.280] Like you create an artificial environment, I would say, where I say that it's very hard, [19:01.280 --> 19:05.960] because it's probably not going to be the typical load that you're going to have. [19:05.960 --> 19:13.040] You try to replicate, and then you measure, because otherwise it's... [19:13.040 --> 19:17.960] I would say that you are not going to know which is the limiting factor until you try. [19:21.960 --> 19:25.960] What did you use to measure the workload? [19:25.960 --> 19:27.120] Seizebench. [19:27.120 --> 19:36.400] I was using Seizebench, the Lua, one with all the scripts, and the old, very old, magical [19:36.400 --> 19:43.840] one, because it has, let's use, because you can do the same thing with the Lua one. [19:43.840 --> 19:49.480] You can create your own things and et cetera, but I just wanted to benchmark also what if [19:49.480 --> 19:53.600] I am selecting from different tables of the one that I am writing for and that thing. [19:53.600 --> 19:56.960] And name selection is not something that you have in the Lua one by default. [19:56.960 --> 20:00.120] So you don't have throughput limiting by default. [20:00.120 --> 20:02.560] And it's something that you have in the old one in the options. [20:02.560 --> 20:06.840] So for convenience, I use a mix. [20:06.840 --> 20:14.320] Which, by the way, no, it impacts the performance, depending on what you're measuring. [20:14.320 --> 20:18.760] If you're measuring in process equal, if you're measuring against the other proxy, it's different [20:18.760 --> 20:20.640] depending even on the tool that you're choosing. [20:20.640 --> 20:30.640] You said that you used Seizebench, I wanted to hear, because in my test, when I was trying [20:30.640 --> 20:38.640] to do multi-write multiple loads, I noticed that the drop of write performance was because [20:38.640 --> 20:40.640] of write complete. [20:40.640 --> 20:41.640] Yes. [20:41.640 --> 20:47.640] So if it's not right here to me, how you're achieving, like, better for multi-writing, [20:48.640 --> 20:52.640] because you need to certify all loads. [20:52.640 --> 20:58.640] How I achieve better throughput writing to a single note that the whole, to multiple [20:58.640 --> 20:59.640] notes. [20:59.640 --> 21:00.640] Yeah. [21:00.640 --> 21:07.640] Because from what I see, you have better performance writing multiple loads than a single load. [21:07.640 --> 21:12.640] Probably because I was having very, very few conflicts, because the size of the table [21:12.640 --> 21:14.640] that I choose were very big. [21:14.640 --> 21:18.640] And I was having very, very few conflicts during that testing. [21:18.640 --> 21:23.640] It was a very, very favorable scenario for writing. [21:23.640 --> 21:25.640] That's why I say that it's super ideal. [21:25.640 --> 21:29.640] When I say that it was ideal, it was because it's super, super ideal. [21:29.640 --> 21:30.640] Yeah. [21:30.640 --> 21:31.640] I don't know. [21:37.640 --> 21:39.640] Thank you very much. [21:39.640 --> 21:40.640] Thank you. [21:42.640 --> 21:43.640] Thank you.