[00:00.000 --> 00:17.440] Now, we already spoke here a little bit about developers and especially the front-end developers. [00:17.440 --> 00:22.560] One purpose of this talk for me is to really sort of this kind of a bridge, the gap, which [00:22.560 --> 00:28.600] I often see between the people who really have a database as a center of their at least [00:28.600 --> 00:31.280] professional life. [00:31.280 --> 00:37.600] Any people who are writing an application and database just of them like a thing. [00:37.600 --> 00:38.600] It's like a toilet. [00:38.600 --> 00:43.120] You do your business and you move on with your life. [00:43.120 --> 00:44.120] Something like that. [00:44.120 --> 00:49.800] For those people, the database is typically like a black box. [00:49.800 --> 00:58.000] There is this black box and what I want is I connect to the service point which is provided [00:58.000 --> 00:59.000] to me. [00:59.000 --> 01:00.000] I connect it quickly. [01:00.000 --> 01:09.240] I run my queries and that's all I care about and all that kind of change buffer combaya. [01:09.240 --> 01:11.720] Never heard about it. [01:11.720 --> 01:14.040] What about queries? [01:14.040 --> 01:17.120] What would you as a developer think about queries? [01:17.120 --> 01:20.400] Well, these are actually pretty simple things. [01:20.400 --> 01:22.360] When you connect it to a service point, you are queries. [01:22.360 --> 01:25.160] You want them to complete if no errors. [01:25.160 --> 01:31.240] You want them to provide you correct result set because if you wouldn't, we could alter [01:31.240 --> 01:35.760] over my school tables to black hole and get a fantastic performance. [01:35.760 --> 01:38.320] No errors too. [01:38.320 --> 01:44.120] And also you want them to make sure they complete in that response time what your application [01:44.120 --> 01:45.120] expects. [01:45.120 --> 01:48.320] I think that is a very important thing to understand. [01:48.320 --> 01:54.280] If you look at from the individual developer standpoint, like Ryan application, hey, performance [01:54.280 --> 01:58.800] response time for my queries is all I care about. [01:58.800 --> 02:06.160] And how that's internal database kitchen works, somebody else's problem. [02:06.160 --> 02:13.040] Now if you think about the response time from the database point of view, that is often [02:13.040 --> 02:20.560] seen like, well, I see that response time for a query is in average or whatever distribution. [02:20.560 --> 02:22.400] We'll talk about that later. [02:22.400 --> 02:27.960] But that is different from what business cares about. [02:27.960 --> 02:34.000] If you think about the business point of view, you think about, well, do my user have outstanding [02:34.000 --> 02:39.760] experience in terms of performance with all the application interactions? [02:39.760 --> 02:45.520] That means like a search should work and place an order should work and whatever. [02:45.520 --> 02:52.080] And the database is important part of that, of course, but not is their complete part. [02:52.080 --> 02:57.000] What is interesting in this case is what as database engineers, we often talk about those [02:57.000 --> 03:02.840] kind of different events, kind of like a bad performance and the downtime. [03:02.840 --> 03:07.600] And say, well, you know, no, no, we weren't down, it just was taken 15 minutes to run [03:07.600 --> 03:10.600] my very basic query. [03:10.600 --> 03:17.720] Well, from user standpoint, the bad performance, very bad performance isn't distinguishable [03:17.720 --> 03:19.280] from downtime. [03:19.280 --> 03:23.720] Because A, we don't have parents, then even if people are very patient, then the browser [03:23.720 --> 03:28.000] or some other timeout will happen and nobody gives a shit about that query which may still [03:28.000 --> 03:31.640] continue to run. [03:31.640 --> 03:40.280] Another thing to understand about query performance is you do not want to focus on the averages. [03:40.280 --> 03:45.720] I like this kind of one saying, but there was one sleeve demand to try to cross the [03:45.720 --> 03:51.320] river in the average one meter deep. [03:51.320 --> 03:53.760] That is same applies to the query. [03:53.760 --> 03:59.320] If your average query time is X, that means pretty much nothing. [03:59.320 --> 04:05.240] You need to understand more about that. [04:05.240 --> 04:14.320] And I like in this case to look at their percentiles and even more to make sure you can look [04:14.320 --> 04:21.760] at a specific distribution of your query response time. [04:21.760 --> 04:25.920] If you have that, that gives you a lot more insight. [04:25.920 --> 04:32.120] Now one thing to understand about the percentile, you may be looking and saying, well, great. [04:32.120 --> 04:41.000] My queries have this decent 99 percentile, but that does not mean on a business side [04:41.000 --> 04:45.320] what 99 percent of your users have a good or acceptable experience. [04:45.320 --> 04:46.320] Why is that? [04:46.320 --> 04:47.320] Well, because guess what? [04:47.320 --> 04:53.760] The single user interaction can correspond to a lot of queries sequentially, which all [04:53.760 --> 05:00.720] add up and typically through their joining user has a number of those interactions. [05:00.720 --> 05:08.720] So I would say even 99 percentile that may all well, depending on your application, only [05:08.720 --> 05:12.640] correspond to like 50 percent of user session. [05:12.640 --> 05:21.840] So if you really see the complicated large environments, they are really focused on either [05:21.840 --> 05:28.640] relatively short SLA or rather high percentiles. [05:28.640 --> 05:34.040] Another thing that I would encourage to pay attention to is errors. [05:34.040 --> 05:40.520] And make sure you are measuring response time for those as well, because errors actually [05:40.520 --> 05:44.320] can be offered two kinds, fast errors and slow errors. [05:44.320 --> 05:48.960] In certain cases, let's say if your table doesn't exist, you may be like, get the response [05:48.960 --> 05:55.800] time straight away, and if you put all your error queries and actually normal queries [05:55.800 --> 06:02.000] in the same bucket, you may say, oh my gosh, my response times are doing kind of so well. [06:02.000 --> 06:07.440] But on the other hand, if your query is, for example, error is a lock weight time out, [06:07.440 --> 06:09.560] then that is a slow error. [06:09.560 --> 06:14.840] It actually will have a higher response time than the normal cases. [06:14.840 --> 06:21.960] That is why I always suggest to make sure we measure response time for normal queries [06:21.960 --> 06:25.920] and for queries with problems differently. [06:25.920 --> 06:35.080] Another thing which is very important is looking at response time over time, because traffic [06:35.080 --> 06:40.200] changes, a lot of things are going on in the system and just saying, hey, I'll have a response [06:40.200 --> 06:44.440] time of x over some long period of time, it's not very helpful. [06:44.440 --> 06:51.520] Also what you would see in many cases, you still start those like a small performance [06:51.520 --> 07:01.520] problems, maybe like SLA violations, which are if unfixed, they convert in the downtime. [07:01.520 --> 07:07.320] For example, in my SQL world, you may say, well, I have forgotten this kind of runaway [07:07.320 --> 07:10.800] query, and my history accumulates. [07:10.800 --> 07:14.720] It will slowly increase and increase your response time. [07:14.720 --> 07:19.160] If you measure that over time and say, well, something is not trending in the right directions, [07:19.160 --> 07:26.200] you probably can fix it before that will be seen as a downtime by your users. [07:26.200 --> 07:33.200] If you are not, then not so much. [07:33.200 --> 07:39.760] This is example what we have here, what you often may see something like this as well, [07:39.760 --> 07:44.760] where all the queries have like a spike in the response time, which you often may correspond [07:44.760 --> 07:48.280] to something external happening in the environment. [07:48.280 --> 07:52.400] I think here is what is very interesting, especially for us running in the cloud, we [07:52.400 --> 07:56.480] only have limited observability to environment. [07:56.480 --> 08:00.800] If there is some shit going on on the Amazon backend, they're not going to tell us that. [08:00.800 --> 08:06.680] Oh, you know what, we had, let's say, some free hard drives failed, which back our EBS [08:06.680 --> 08:14.280] and we had to some rebalance, yada, yada. [08:14.280 --> 08:20.320] The other question I would ask is where we want to measure response time from a queries. [08:20.320 --> 08:26.440] In my opinion, both application view and database you are in the combinations are very helpful [08:26.440 --> 08:30.080] because the application can see real thing. [08:30.080 --> 08:35.080] If your network, for example, is adding some latency or whatever, and you will see that [08:35.080 --> 08:40.960] from application, not so much in the database, because it's only sees from, hey, I got response [08:40.960 --> 08:46.480] to, then it's sent the data back. [08:46.480 --> 08:51.960] But the database view allows you often to see a lot more inside about what have been [08:51.960 --> 09:00.000] going on inside, where from application side, we often can just capture the query time, [09:00.000 --> 09:04.560] maybe some very basic additional parameters. [09:04.560 --> 09:08.400] So what we spoke from our business view, right? [09:08.400 --> 09:13.800] Well, we already said what that all users have outstanding performance experience of [09:13.800 --> 09:16.120] all the application interactions, right? [09:16.120 --> 09:23.160] Let's now try to break it down a little bit more, right, to what that may mean. [09:23.160 --> 09:28.160] In this case, I want to introduce this little project or flag from you. [09:28.160 --> 09:35.760] This is SQL Commenter project by Google, right, I mean, which is pretty cool in terms of what [09:35.760 --> 09:41.680] it allows to pass you the metadata, right, which you understand as developer all the way [09:41.680 --> 09:43.200] to SQL query. [09:43.200 --> 09:51.200] They implemented that support from a number of frameworks, right, and it's also supported [09:51.200 --> 09:56.160] in their Google Cloud monitoring environment, right? [09:56.160 --> 10:01.760] And I wouldn't very much see that developed more, right, and for at least kind of us come [10:01.760 --> 10:07.720] to some sort of shared standards between the databases, right, to wherever, how we can augment [10:07.720 --> 10:16.040] query information with sort of like a tags, values, right, which users care about. [10:16.040 --> 10:19.600] So what are possibilities which can be quite helpful in this regard? [10:19.600 --> 10:25.960] Well, finding, for example, who is our actual user tenant, who is query, corresponds, right, [10:25.960 --> 10:33.200] because we often may have, you know, different performance issues, right, finding the application, [10:33.200 --> 10:37.960] like some sort of like a subset of application functionality where many of them may be hitting [10:37.960 --> 10:44.800] the application, right, version information, maybe information about like an, their engineer [10:44.800 --> 10:46.840] of a team who is responsible. [10:46.840 --> 10:52.640] I often see DBAs or SRAs team having problem, like, oh, I see this nasty query which was [10:52.640 --> 10:57.920] shipped yesterday, I know because, shipped today because I know it wasn't very yesterday, [10:57.920 --> 11:04.360] right, but now having to find out who a hell introduced that stupid query, maybe problematic [11:04.360 --> 11:07.480] in a large environment. [11:07.480 --> 11:13.200] Now a lot of focus, and I think the core of the query-based observability may be about [11:13.200 --> 11:14.200] the query. [11:14.200 --> 11:23.640] But the query, I mean, obviously like a query with sort of like it's, which are same except [11:23.640 --> 11:28.520] different parameters, and that is very helpful because, well, obviously they have a different [11:28.520 --> 11:33.520] complexity, different expected SLA, and so on and so forth. [11:33.520 --> 11:43.200] The next way also to break things down for me would be to look at the schema or database, [11:43.200 --> 11:45.880] and why is that interesting? [11:45.880 --> 11:46.880] How? [11:46.880 --> 11:56.880] I just noticed right now what it's been cut a bit, you see, well, anyway, life is life. [11:56.880 --> 12:04.480] I'm just not going to be lucky in this room, right, yes, yeah, but, well, we can blame [12:04.480 --> 12:09.680] our windows, right, on this conference we can and should blame windows. [12:09.680 --> 12:19.520] Okay, well, why schema and database are also good because often we would separate in the [12:19.520 --> 12:27.600] multi-tenant applications different tenants by schema, right, and in that case that gives [12:27.600 --> 12:35.280] us a good profiling for performance of their different schemas, right, like we can see here [12:35.280 --> 12:40.760] in the example with PMM tool. [12:40.760 --> 12:45.680] Another thing what I found is very helpful to find a way to separate the data by different [12:45.680 --> 12:51.920] tables, right, in many cases you want to say, hey, you know what, how a query is hitting [12:51.920 --> 12:57.520] given table is affected, especially if it did some change which relates to the table. [12:57.520 --> 13:03.600] Hey, you know what, I changed the indexing on this table, let me see how all the queries [13:03.600 --> 13:12.600] hitting this table is impacted, very helpful because there may be some surprising differences. [13:12.600 --> 13:17.440] Database users, that is another thing which is quite helpful because that often allows [13:17.440 --> 13:21.960] us to identify the service application, right, if you're following good security practices [13:21.960 --> 13:28.080] you would not let all your applications, right, just use one username, you know, not a good [13:28.080 --> 13:36.160] idea, right, and also find human troublemakers, right, which are doing, having direct access, [13:36.160 --> 13:41.480] right, and so many times you'll find somebody, you know, running the query, right, and say, [13:41.480 --> 13:46.920] okay, well, yeah, it's slow but wherever I'll go for lunch, you know, I have time, well, [13:46.920 --> 13:55.160] you may have time but your database may not, right, so we also, like here's example how [13:55.160 --> 14:04.800] we provide that. I also mentioned database hosts and indexes in many instances, in many [14:04.800 --> 14:10.080] cases that is very helpful because even if you may think, oh, my different database instance [14:10.080 --> 14:15.800] should perform the same, well, world is a messy place and world in the cloud is even [14:15.800 --> 14:22.840] messy place, right, they may not exactly have the same performance due to, you know, some [14:22.840 --> 14:28.040] strange configuration differently, having a bad day, right, or even maybe having a different [14:28.040 --> 14:32.920] load, right, and that is a good to be able to break it down, right, when you see some [14:32.920 --> 14:40.080] of your queries are not performing very well. I would also look at the same stuff from a [14:40.080 --> 14:44.680] web server or application server instance because, again, if you have, like, maybe like [14:44.680 --> 14:50.800] a hundred nodes, you deploy the same application, you may think, hey, we're all going to perform [14:50.800 --> 14:55.160] the same, hitting the database, well, that is not always the case, right, they have seen [14:55.160 --> 15:00.840] changes from people saying one FM is misconfigured or for some reason cannot connect the cache, [15:00.840 --> 15:04.680] so it's, you know, hitting, you know, ten times more queries, right, on the database [15:04.680 --> 15:11.720] than it should be, or the application rollout didn't go well, where UV eliminated nasty [15:11.720 --> 15:17.040] query on 99 of application instance but not some others, right, it's a very good to actually [15:17.040 --> 15:23.280] be able to validate that because what you would see or, like, again, from a DBA standpoint, [15:23.280 --> 15:28.920] you know, developers, sysadmins, storage people, they are going to tell you shit, right, but [15:28.920 --> 15:35.240] they are going to lie, right, they are going to lie, right, maybe not intentionally, maybe [15:35.240 --> 15:40.440] because of their ignorance and limitation of their tool but as a DBA, a city or something, [15:40.440 --> 15:47.560] you want to point them out to their shit and say, look, I have evidence, right, evidence [15:47.560 --> 15:55.520] is good, right, so clients costs, custom tags is very helpful if you can extend, that is [15:55.520 --> 16:05.920] what we spoke about, the SQL commenters, something else which I find very helpful which we cannot [16:05.920 --> 16:12.160] quite easily get with MySQL but being able to separate the query by the query plans, right, [16:12.160 --> 16:17.560] often you may have a query which looks the same but it may take different execution [16:17.560 --> 16:23.120] plans, right, and often that may be correlated to its performance. [16:23.120 --> 16:28.920] In certain cases, it is totally fine, right, very different situations, sometimes MySQL [16:28.920 --> 16:35.200] optimizer may get a little bit, you know, crazy just and has that optimizer plan drift [16:35.200 --> 16:42.640] for no good reason which may not be very easy to catch, right, and will be helpful to do. [16:42.640 --> 16:49.960] What I also would like to highlight is when you find the specific query and say, hey, [16:49.960 --> 16:56.680] this query has nasty performance, right, we often want to understand where that query [16:56.680 --> 17:02.640] response time comes from, right, and that is some of their things, right, where it can [17:02.640 --> 17:11.760] come from, certain of them are relatively easy to find out, right, certain are not very [17:11.760 --> 17:17.720] well, right, for example, wherever query has waited on available CPU, right, because system [17:17.720 --> 17:22.080] was already saturated, well, you can't really see on per query basics, right, you can only [17:22.080 --> 17:27.600] see those things, well, my CPU was kind of like a super packed, right, on a period of [17:27.600 --> 17:28.600] time. [17:28.600 --> 17:35.160] Okay, here are a couple of other things to consider when you're looking at the queries. [17:35.160 --> 17:40.440] One, you want to really look at separately the bad queries, right, versus victims, because [17:40.440 --> 17:45.440] sometimes you will see, oh, queries are getting slower, but it's not because of FAM, it's [17:45.440 --> 17:51.480] about some other nasty queries, right, maybe that is your Java developer who thought, well, [17:51.480 --> 18:01.280] you know, to solve my problems, I will just launch with 200 threads, right, and make sure [18:01.280 --> 18:09.560] I am good, but everything else is kind of slowed down, right, and that's maybe tricky. [18:09.560 --> 18:13.240] One thing is what you should not forget the currently running queries. [18:13.240 --> 18:17.600] In many cases, like if you look in performance schema queries by dash address, that gives [18:17.600 --> 18:22.720] you what happened in the past, but believe me, if you start, you know, 50 instances [18:22.720 --> 18:28.920] of some very bad query, which continues to run, well, that may be the reason of your [18:28.920 --> 18:33.640] problem, not the past, right, and to connect to that, I think it is less problem in my [18:33.640 --> 18:38.760] skill right now, right, if you're using query timeouts, which is a very good practice, right, [18:38.760 --> 18:42.080] because if you say, hey, you know what, for all my interactive queries, by default, I [18:42.080 --> 18:47.120] set the timeout of, let's say, 15 seconds, then you should not care too much about your [18:47.120 --> 18:53.960] past queries because, well, you know what, everything gets killed after 15 minutes. [18:53.960 --> 18:58.320] Also, 50 seconds, right, you should not ignore the stuff which is invisible from a query [18:58.320 --> 19:06.120] standpoint, right, databases do a lot of shit in the background, you may also do things [19:06.120 --> 19:13.360] or your operation teams like, well, backups or provisioning another node for cloning, right, [19:13.360 --> 19:18.880] for the clouds or wherever your VM system may need to do something in the background, [19:18.880 --> 19:23.560] it may not be directly visible, but that can impact the query performance, right, so sometimes, [19:23.560 --> 19:29.880] well, when you observe a query impact and you can't really see what is causing that, [19:29.880 --> 19:31.480] it's possible. [19:31.480 --> 19:36.760] I also would encourage to avoid what I would call like a biased something. [19:36.760 --> 19:40.640] I see people sometimes would say, hey, you know what, we will set long query time to [19:40.640 --> 19:45.040] one second and only look at the queries which are more than one second in length, well, [19:45.040 --> 19:50.840] you may be only focusing on the outliers, right, and missing the possibility to optimize [19:50.840 --> 19:59.080] other queries, right, or actually even focusing on the queries which provide, which are responsible [19:59.080 --> 20:04.240] for providing that bad experience, right, for your users. [20:04.240 --> 20:11.840] Okay, we find another thing like a last minute I have or something, I wanted to say, hey, [20:11.840 --> 20:18.720] what I would like to see from my skill to do better, who is Kenny, no Kenny? [20:18.720 --> 20:26.240] Yes, he's always hiding, right, he probably wanted to get another sandwich, damn it. [20:26.240 --> 20:30.120] Okay, so here are some things that I would like to see. [20:30.120 --> 20:37.600] One is better support of prepared statements, right, and right now it's kind of, you know, [20:37.600 --> 20:45.320] not done in the same way, right, which is, I think, is a problem, right. [20:45.320 --> 20:50.240] Now I would say consider grouping data by time in certain cases, right now you get like [20:50.240 --> 20:56.040] all the statements in one table, right, and you have a lot of statement variety, that [20:56.040 --> 21:02.400] table tends to overflow, right, which is not really helpful, right, and if you have to [21:02.400 --> 21:10.880] kind of reset your queries all the time, that is not very, you know, good practice in my [21:10.880 --> 21:12.480] opinion. [21:12.480 --> 21:19.400] Provide list of tables query touches, right, that is very helpful because, well, my skill [21:19.400 --> 21:24.960] parser already knows it, right, it knows tables query touches, but it's very hard to parse [21:24.960 --> 21:30.160] it out from a query, especially if you consider views, right. [21:30.160 --> 21:35.640] I don't know by looking at the query alone, wherever something is a table or a view, right, [21:35.640 --> 21:37.400] so, in this case. [21:37.400 --> 21:43.040] Information about plan ID, right, I would like to see for the query, right, some sort [21:43.040 --> 21:49.720] of plan hash or something, so I know then query is using something like that, and also [21:49.720 --> 21:52.480] what I would call like a top weight summary, right. [21:52.480 --> 21:58.080] Right now we have information about the weights in my skill performance query and about query, [21:58.080 --> 22:05.520] but I cannot see and say, oh, that query was slow because it spent XYZ amount of weight [22:05.520 --> 22:11.320] on something or whatever, right, or at least kind of like some small class of queries, [22:11.320 --> 22:14.040] right, I don't think that's convenient. [22:14.040 --> 22:22.160] Well, with that, that's all I had to say, hope that will help you to avoid tuning your [22:22.160 --> 22:32.560] indexes by, by the credit card, and yes, oh, I have a time for questions, you told me [22:32.560 --> 22:39.760] like, Peter, five minutes, oh, to answer, I have a time for questions, yes, any questions, [22:39.760 --> 22:49.080] no, oh, yeah, what's the difference or advantages of this SQL commenter thing compared to what [22:49.080 --> 22:54.120] open tracing standards people start tracing the whole thing, what's the difference of [22:54.120 --> 22:55.120] SQL commenter? [22:55.120 --> 23:01.320] Well, what I would say in this case, yes, I mean, there is obviously open tracing framework, [23:01.320 --> 23:12.760] right, which you can use, this gets specifically to the database and specifically in every query, [23:12.760 --> 23:20.280] right, if you look at the open tracing framework, I think, you know, getting every query, right, [23:20.280 --> 23:28.480] maybe a lot of, a lot of volume out there, right, and again, I also think, well, the [23:28.480 --> 23:34.040] good thing if also SQL commenter, right, is what that does it automatically, if you will, [23:34.040 --> 23:39.280] right, that does not require you to take an extra integration. [23:39.280 --> 24:09.080] Okay, anybody else, yeah, I mean, it works with MariaDB as well, yes, well, there are [24:09.080 --> 24:17.160] not practices, there are no good practices, right, like you can, there is a lot of optimizer [24:17.160 --> 24:21.600] hints you can use, right, so you can actually force the query to go like this particular [24:21.600 --> 24:27.880] stuff, right, but that also prevents optimizer choosing different plan if better plan becomes [24:27.880 --> 24:28.880] available. [24:28.880 --> 24:29.880] Yeah. [24:29.880 --> 24:46.360] Never use forced index, always use ignore index, okay, well, then thank you, folks.