[00:00.000 --> 00:17.440]  Now, we already spoke here a little bit about developers and especially the front-end developers.
[00:17.440 --> 00:22.560]  One purpose of this talk for me is to really sort of this kind of a bridge, the gap, which
[00:22.560 --> 00:28.600]  I often see between the people who really have a database as a center of their at least
[00:28.600 --> 00:31.280]  professional life.
[00:31.280 --> 00:37.600]  Any people who are writing an application and database just of them like a thing.
[00:37.600 --> 00:38.600]  It's like a toilet.
[00:38.600 --> 00:43.120]  You do your business and you move on with your life.
[00:43.120 --> 00:44.120]  Something like that.
[00:44.120 --> 00:49.800]  For those people, the database is typically like a black box.
[00:49.800 --> 00:58.000]  There is this black box and what I want is I connect to the service point which is provided
[00:58.000 --> 00:59.000]  to me.
[00:59.000 --> 01:00.000]  I connect it quickly.
[01:00.000 --> 01:09.240]  I run my queries and that's all I care about and all that kind of change buffer combaya.
[01:09.240 --> 01:11.720]  Never heard about it.
[01:11.720 --> 01:14.040]  What about queries?
[01:14.040 --> 01:17.120]  What would you as a developer think about queries?
[01:17.120 --> 01:20.400]  Well, these are actually pretty simple things.
[01:20.400 --> 01:22.360]  When you connect it to a service point, you are queries.
[01:22.360 --> 01:25.160]  You want them to complete if no errors.
[01:25.160 --> 01:31.240]  You want them to provide you correct result set because if you wouldn't, we could alter
[01:31.240 --> 01:35.760]  over my school tables to black hole and get a fantastic performance.
[01:35.760 --> 01:38.320]  No errors too.
[01:38.320 --> 01:44.120]  And also you want them to make sure they complete in that response time what your application
[01:44.120 --> 01:45.120]  expects.
[01:45.120 --> 01:48.320]  I think that is a very important thing to understand.
[01:48.320 --> 01:54.280]  If you look at from the individual developer standpoint, like Ryan application, hey, performance
[01:54.280 --> 01:58.800]  response time for my queries is all I care about.
[01:58.800 --> 02:06.160]  And how that's internal database kitchen works, somebody else's problem.
[02:06.160 --> 02:13.040]  Now if you think about the response time from the database point of view, that is often
[02:13.040 --> 02:20.560]  seen like, well, I see that response time for a query is in average or whatever distribution.
[02:20.560 --> 02:22.400]  We'll talk about that later.
[02:22.400 --> 02:27.960]  But that is different from what business cares about.
[02:27.960 --> 02:34.000]  If you think about the business point of view, you think about, well, do my user have outstanding
[02:34.000 --> 02:39.760]  experience in terms of performance with all the application interactions?
[02:39.760 --> 02:45.520]  That means like a search should work and place an order should work and whatever.
[02:45.520 --> 02:52.080]  And the database is important part of that, of course, but not is their complete part.
[02:52.080 --> 02:57.000]  What is interesting in this case is what as database engineers, we often talk about those
[02:57.000 --> 03:02.840]  kind of different events, kind of like a bad performance and the downtime.
[03:02.840 --> 03:07.600]  And say, well, you know, no, no, we weren't down, it just was taken 15 minutes to run
[03:07.600 --> 03:10.600]  my very basic query.
[03:10.600 --> 03:17.720]  Well, from user standpoint, the bad performance, very bad performance isn't distinguishable
[03:17.720 --> 03:19.280]  from downtime.
[03:19.280 --> 03:23.720]  Because A, we don't have parents, then even if people are very patient, then the browser
[03:23.720 --> 03:28.000]  or some other timeout will happen and nobody gives a shit about that query which may still
[03:28.000 --> 03:31.640]  continue to run.
[03:31.640 --> 03:40.280]  Another thing to understand about query performance is you do not want to focus on the averages.
[03:40.280 --> 03:45.720]  I like this kind of one saying, but there was one sleeve demand to try to cross the
[03:45.720 --> 03:51.320]  river in the average one meter deep.
[03:51.320 --> 03:53.760]  That is same applies to the query.
[03:53.760 --> 03:59.320]  If your average query time is X, that means pretty much nothing.
[03:59.320 --> 04:05.240]  You need to understand more about that.
[04:05.240 --> 04:14.320]  And I like in this case to look at their percentiles and even more to make sure you can look
[04:14.320 --> 04:21.760]  at a specific distribution of your query response time.
[04:21.760 --> 04:25.920]  If you have that, that gives you a lot more insight.
[04:25.920 --> 04:32.120]  Now one thing to understand about the percentile, you may be looking and saying, well, great.
[04:32.120 --> 04:41.000]  My queries have this decent 99 percentile, but that does not mean on a business side
[04:41.000 --> 04:45.320]  what 99 percent of your users have a good or acceptable experience.
[04:45.320 --> 04:46.320]  Why is that?
[04:46.320 --> 04:47.320]  Well, because guess what?
[04:47.320 --> 04:53.760]  The single user interaction can correspond to a lot of queries sequentially, which all
[04:53.760 --> 05:00.720]  add up and typically through their joining user has a number of those interactions.
[05:00.720 --> 05:08.720]  So I would say even 99 percentile that may all well, depending on your application, only
[05:08.720 --> 05:12.640]  correspond to like 50 percent of user session.
[05:12.640 --> 05:21.840]  So if you really see the complicated large environments, they are really focused on either
[05:21.840 --> 05:28.640]  relatively short SLA or rather high percentiles.
[05:28.640 --> 05:34.040]  Another thing that I would encourage to pay attention to is errors.
[05:34.040 --> 05:40.520]  And make sure you are measuring response time for those as well, because errors actually
[05:40.520 --> 05:44.320]  can be offered two kinds, fast errors and slow errors.
[05:44.320 --> 05:48.960]  In certain cases, let's say if your table doesn't exist, you may be like, get the response
[05:48.960 --> 05:55.800]  time straight away, and if you put all your error queries and actually normal queries
[05:55.800 --> 06:02.000]  in the same bucket, you may say, oh my gosh, my response times are doing kind of so well.
[06:02.000 --> 06:07.440]  But on the other hand, if your query is, for example, error is a lock weight time out,
[06:07.440 --> 06:09.560]  then that is a slow error.
[06:09.560 --> 06:14.840]  It actually will have a higher response time than the normal cases.
[06:14.840 --> 06:21.960]  That is why I always suggest to make sure we measure response time for normal queries
[06:21.960 --> 06:25.920]  and for queries with problems differently.
[06:25.920 --> 06:35.080]  Another thing which is very important is looking at response time over time, because traffic
[06:35.080 --> 06:40.200]  changes, a lot of things are going on in the system and just saying, hey, I'll have a response
[06:40.200 --> 06:44.440]  time of x over some long period of time, it's not very helpful.
[06:44.440 --> 06:51.520]  Also what you would see in many cases, you still start those like a small performance
[06:51.520 --> 07:01.520]  problems, maybe like SLA violations, which are if unfixed, they convert in the downtime.
[07:01.520 --> 07:07.320]  For example, in my SQL world, you may say, well, I have forgotten this kind of runaway
[07:07.320 --> 07:10.800]  query, and my history accumulates.
[07:10.800 --> 07:14.720]  It will slowly increase and increase your response time.
[07:14.720 --> 07:19.160]  If you measure that over time and say, well, something is not trending in the right directions,
[07:19.160 --> 07:26.200]  you probably can fix it before that will be seen as a downtime by your users.
[07:26.200 --> 07:33.200]  If you are not, then not so much.
[07:33.200 --> 07:39.760]  This is example what we have here, what you often may see something like this as well,
[07:39.760 --> 07:44.760]  where all the queries have like a spike in the response time, which you often may correspond
[07:44.760 --> 07:48.280]  to something external happening in the environment.
[07:48.280 --> 07:52.400]  I think here is what is very interesting, especially for us running in the cloud, we
[07:52.400 --> 07:56.480]  only have limited observability to environment.
[07:56.480 --> 08:00.800]  If there is some shit going on on the Amazon backend, they're not going to tell us that.
[08:00.800 --> 08:06.680]  Oh, you know what, we had, let's say, some free hard drives failed, which back our EBS
[08:06.680 --> 08:14.280]  and we had to some rebalance, yada, yada.
[08:14.280 --> 08:20.320]  The other question I would ask is where we want to measure response time from a queries.
[08:20.320 --> 08:26.440]  In my opinion, both application view and database you are in the combinations are very helpful
[08:26.440 --> 08:30.080]  because the application can see real thing.
[08:30.080 --> 08:35.080]  If your network, for example, is adding some latency or whatever, and you will see that
[08:35.080 --> 08:40.960]  from application, not so much in the database, because it's only sees from, hey, I got response
[08:40.960 --> 08:46.480]  to, then it's sent the data back.
[08:46.480 --> 08:51.960]  But the database view allows you often to see a lot more inside about what have been
[08:51.960 --> 09:00.000]  going on inside, where from application side, we often can just capture the query time,
[09:00.000 --> 09:04.560]  maybe some very basic additional parameters.
[09:04.560 --> 09:08.400]  So what we spoke from our business view, right?
[09:08.400 --> 09:13.800]  Well, we already said what that all users have outstanding performance experience of
[09:13.800 --> 09:16.120]  all the application interactions, right?
[09:16.120 --> 09:23.160]  Let's now try to break it down a little bit more, right, to what that may mean.
[09:23.160 --> 09:28.160]  In this case, I want to introduce this little project or flag from you.
[09:28.160 --> 09:35.760]  This is SQL Commenter project by Google, right, I mean, which is pretty cool in terms of what
[09:35.760 --> 09:41.680]  it allows to pass you the metadata, right, which you understand as developer all the way
[09:41.680 --> 09:43.200]  to SQL query.
[09:43.200 --> 09:51.200]  They implemented that support from a number of frameworks, right, and it's also supported
[09:51.200 --> 09:56.160]  in their Google Cloud monitoring environment, right?
[09:56.160 --> 10:01.760]  And I wouldn't very much see that developed more, right, and for at least kind of us come
[10:01.760 --> 10:07.720]  to some sort of shared standards between the databases, right, to wherever, how we can augment
[10:07.720 --> 10:16.040]  query information with sort of like a tags, values, right, which users care about.
[10:16.040 --> 10:19.600]  So what are possibilities which can be quite helpful in this regard?
[10:19.600 --> 10:25.960]  Well, finding, for example, who is our actual user tenant, who is query, corresponds, right,
[10:25.960 --> 10:33.200]  because we often may have, you know, different performance issues, right, finding the application,
[10:33.200 --> 10:37.960]  like some sort of like a subset of application functionality where many of them may be hitting
[10:37.960 --> 10:44.800]  the application, right, version information, maybe information about like an, their engineer
[10:44.800 --> 10:46.840]  of a team who is responsible.
[10:46.840 --> 10:52.640]  I often see DBAs or SRAs team having problem, like, oh, I see this nasty query which was
[10:52.640 --> 10:57.920]  shipped yesterday, I know because, shipped today because I know it wasn't very yesterday,
[10:57.920 --> 11:04.360]  right, but now having to find out who a hell introduced that stupid query, maybe problematic
[11:04.360 --> 11:07.480]  in a large environment.
[11:07.480 --> 11:13.200]  Now a lot of focus, and I think the core of the query-based observability may be about
[11:13.200 --> 11:14.200]  the query.
[11:14.200 --> 11:23.640]  But the query, I mean, obviously like a query with sort of like it's, which are same except
[11:23.640 --> 11:28.520]  different parameters, and that is very helpful because, well, obviously they have a different
[11:28.520 --> 11:33.520]  complexity, different expected SLA, and so on and so forth.
[11:33.520 --> 11:43.200]  The next way also to break things down for me would be to look at the schema or database,
[11:43.200 --> 11:45.880]  and why is that interesting?
[11:45.880 --> 11:46.880]  How?
[11:46.880 --> 11:56.880]  I just noticed right now what it's been cut a bit, you see, well, anyway, life is life.
[11:56.880 --> 12:04.480]  I'm just not going to be lucky in this room, right, yes, yeah, but, well, we can blame
[12:04.480 --> 12:09.680]  our windows, right, on this conference we can and should blame windows.
[12:09.680 --> 12:19.520]  Okay, well, why schema and database are also good because often we would separate in the
[12:19.520 --> 12:27.600]  multi-tenant applications different tenants by schema, right, and in that case that gives
[12:27.600 --> 12:35.280]  us a good profiling for performance of their different schemas, right, like we can see here
[12:35.280 --> 12:40.760]  in the example with PMM tool.
[12:40.760 --> 12:45.680]  Another thing what I found is very helpful to find a way to separate the data by different
[12:45.680 --> 12:51.920]  tables, right, in many cases you want to say, hey, you know what, how a query is hitting
[12:51.920 --> 12:57.520]  given table is affected, especially if it did some change which relates to the table.
[12:57.520 --> 13:03.600]  Hey, you know what, I changed the indexing on this table, let me see how all the queries
[13:03.600 --> 13:12.600]  hitting this table is impacted, very helpful because there may be some surprising differences.
[13:12.600 --> 13:17.440]  Database users, that is another thing which is quite helpful because that often allows
[13:17.440 --> 13:21.960]  us to identify the service application, right, if you're following good security practices
[13:21.960 --> 13:28.080]  you would not let all your applications, right, just use one username, you know, not a good
[13:28.080 --> 13:36.160]  idea, right, and also find human troublemakers, right, which are doing, having direct access,
[13:36.160 --> 13:41.480]  right, and so many times you'll find somebody, you know, running the query, right, and say,
[13:41.480 --> 13:46.920]  okay, well, yeah, it's slow but wherever I'll go for lunch, you know, I have time, well,
[13:46.920 --> 13:55.160]  you may have time but your database may not, right, so we also, like here's example how
[13:55.160 --> 14:04.800]  we provide that. I also mentioned database hosts and indexes in many instances, in many
[14:04.800 --> 14:10.080]  cases that is very helpful because even if you may think, oh, my different database instance
[14:10.080 --> 14:15.800]  should perform the same, well, world is a messy place and world in the cloud is even
[14:15.800 --> 14:22.840]  messy place, right, they may not exactly have the same performance due to, you know, some
[14:22.840 --> 14:28.040]  strange configuration differently, having a bad day, right, or even maybe having a different
[14:28.040 --> 14:32.920]  load, right, and that is a good to be able to break it down, right, when you see some
[14:32.920 --> 14:40.080]  of your queries are not performing very well. I would also look at the same stuff from a
[14:40.080 --> 14:44.680]  web server or application server instance because, again, if you have, like, maybe like
[14:44.680 --> 14:50.800]  a hundred nodes, you deploy the same application, you may think, hey, we're all going to perform
[14:50.800 --> 14:55.160]  the same, hitting the database, well, that is not always the case, right, they have seen
[14:55.160 --> 15:00.840]  changes from people saying one FM is misconfigured or for some reason cannot connect the cache,
[15:00.840 --> 15:04.680]  so it's, you know, hitting, you know, ten times more queries, right, on the database
[15:04.680 --> 15:11.720]  than it should be, or the application rollout didn't go well, where UV eliminated nasty
[15:11.720 --> 15:17.040]  query on 99 of application instance but not some others, right, it's a very good to actually
[15:17.040 --> 15:23.280]  be able to validate that because what you would see or, like, again, from a DBA standpoint,
[15:23.280 --> 15:28.920]  you know, developers, sysadmins, storage people, they are going to tell you shit, right, but
[15:28.920 --> 15:35.240]  they are going to lie, right, they are going to lie, right, maybe not intentionally, maybe
[15:35.240 --> 15:40.440]  because of their ignorance and limitation of their tool but as a DBA, a city or something,
[15:40.440 --> 15:47.560]  you want to point them out to their shit and say, look, I have evidence, right, evidence
[15:47.560 --> 15:55.520]  is good, right, so clients costs, custom tags is very helpful if you can extend, that is
[15:55.520 --> 16:05.920]  what we spoke about, the SQL commenters, something else which I find very helpful which we cannot
[16:05.920 --> 16:12.160]  quite easily get with MySQL but being able to separate the query by the query plans, right,
[16:12.160 --> 16:17.560]  often you may have a query which looks the same but it may take different execution
[16:17.560 --> 16:23.120]  plans, right, and often that may be correlated to its performance.
[16:23.120 --> 16:28.920]  In certain cases, it is totally fine, right, very different situations, sometimes MySQL
[16:28.920 --> 16:35.200]  optimizer may get a little bit, you know, crazy just and has that optimizer plan drift
[16:35.200 --> 16:42.640]  for no good reason which may not be very easy to catch, right, and will be helpful to do.
[16:42.640 --> 16:49.960]  What I also would like to highlight is when you find the specific query and say, hey,
[16:49.960 --> 16:56.680]  this query has nasty performance, right, we often want to understand where that query
[16:56.680 --> 17:02.640]  response time comes from, right, and that is some of their things, right, where it can
[17:02.640 --> 17:11.760]  come from, certain of them are relatively easy to find out, right, certain are not very
[17:11.760 --> 17:17.720]  well, right, for example, wherever query has waited on available CPU, right, because system
[17:17.720 --> 17:22.080]  was already saturated, well, you can't really see on per query basics, right, you can only
[17:22.080 --> 17:27.600]  see those things, well, my CPU was kind of like a super packed, right, on a period of
[17:27.600 --> 17:28.600]  time.
[17:28.600 --> 17:35.160]  Okay, here are a couple of other things to consider when you're looking at the queries.
[17:35.160 --> 17:40.440]  One, you want to really look at separately the bad queries, right, versus victims, because
[17:40.440 --> 17:45.440]  sometimes you will see, oh, queries are getting slower, but it's not because of FAM, it's
[17:45.440 --> 17:51.480]  about some other nasty queries, right, maybe that is your Java developer who thought, well,
[17:51.480 --> 18:01.280]  you know, to solve my problems, I will just launch with 200 threads, right, and make sure
[18:01.280 --> 18:09.560]  I am good, but everything else is kind of slowed down, right, and that's maybe tricky.
[18:09.560 --> 18:13.240]  One thing is what you should not forget the currently running queries.
[18:13.240 --> 18:17.600]  In many cases, like if you look in performance schema queries by dash address, that gives
[18:17.600 --> 18:22.720]  you what happened in the past, but believe me, if you start, you know, 50 instances
[18:22.720 --> 18:28.920]  of some very bad query, which continues to run, well, that may be the reason of your
[18:28.920 --> 18:33.640]  problem, not the past, right, and to connect to that, I think it is less problem in my
[18:33.640 --> 18:38.760]  skill right now, right, if you're using query timeouts, which is a very good practice, right,
[18:38.760 --> 18:42.080]  because if you say, hey, you know what, for all my interactive queries, by default, I
[18:42.080 --> 18:47.120]  set the timeout of, let's say, 15 seconds, then you should not care too much about your
[18:47.120 --> 18:53.960]  past queries because, well, you know what, everything gets killed after 15 minutes.
[18:53.960 --> 18:58.320]  Also, 50 seconds, right, you should not ignore the stuff which is invisible from a query
[18:58.320 --> 19:06.120]  standpoint, right, databases do a lot of shit in the background, you may also do things
[19:06.120 --> 19:13.360]  or your operation teams like, well, backups or provisioning another node for cloning, right,
[19:13.360 --> 19:18.880]  for the clouds or wherever your VM system may need to do something in the background,
[19:18.880 --> 19:23.560]  it may not be directly visible, but that can impact the query performance, right, so sometimes,
[19:23.560 --> 19:29.880]  well, when you observe a query impact and you can't really see what is causing that,
[19:29.880 --> 19:31.480]  it's possible.
[19:31.480 --> 19:36.760]  I also would encourage to avoid what I would call like a biased something.
[19:36.760 --> 19:40.640]  I see people sometimes would say, hey, you know what, we will set long query time to
[19:40.640 --> 19:45.040]  one second and only look at the queries which are more than one second in length, well,
[19:45.040 --> 19:50.840]  you may be only focusing on the outliers, right, and missing the possibility to optimize
[19:50.840 --> 19:59.080]  other queries, right, or actually even focusing on the queries which provide, which are responsible
[19:59.080 --> 20:04.240]  for providing that bad experience, right, for your users.
[20:04.240 --> 20:11.840]  Okay, we find another thing like a last minute I have or something, I wanted to say, hey,
[20:11.840 --> 20:18.720]  what I would like to see from my skill to do better, who is Kenny, no Kenny?
[20:18.720 --> 20:26.240]  Yes, he's always hiding, right, he probably wanted to get another sandwich, damn it.
[20:26.240 --> 20:30.120]  Okay, so here are some things that I would like to see.
[20:30.120 --> 20:37.600]  One is better support of prepared statements, right, and right now it's kind of, you know,
[20:37.600 --> 20:45.320]  not done in the same way, right, which is, I think, is a problem, right.
[20:45.320 --> 20:50.240]  Now I would say consider grouping data by time in certain cases, right now you get like
[20:50.240 --> 20:56.040]  all the statements in one table, right, and you have a lot of statement variety, that
[20:56.040 --> 21:02.400]  table tends to overflow, right, which is not really helpful, right, and if you have to
[21:02.400 --> 21:10.880]  kind of reset your queries all the time, that is not very, you know, good practice in my
[21:10.880 --> 21:12.480]  opinion.
[21:12.480 --> 21:19.400]  Provide list of tables query touches, right, that is very helpful because, well, my skill
[21:19.400 --> 21:24.960]  parser already knows it, right, it knows tables query touches, but it's very hard to parse
[21:24.960 --> 21:30.160]  it out from a query, especially if you consider views, right.
[21:30.160 --> 21:35.640]  I don't know by looking at the query alone, wherever something is a table or a view, right,
[21:35.640 --> 21:37.400]  so, in this case.
[21:37.400 --> 21:43.040]  Information about plan ID, right, I would like to see for the query, right, some sort
[21:43.040 --> 21:49.720]  of plan hash or something, so I know then query is using something like that, and also
[21:49.720 --> 21:52.480]  what I would call like a top weight summary, right.
[21:52.480 --> 21:58.080]  Right now we have information about the weights in my skill performance query and about query,
[21:58.080 --> 22:05.520]  but I cannot see and say, oh, that query was slow because it spent XYZ amount of weight
[22:05.520 --> 22:11.320]  on something or whatever, right, or at least kind of like some small class of queries,
[22:11.320 --> 22:14.040]  right, I don't think that's convenient.
[22:14.040 --> 22:22.160]  Well, with that, that's all I had to say, hope that will help you to avoid tuning your
[22:22.160 --> 22:32.560]  indexes by, by the credit card, and yes, oh, I have a time for questions, you told me
[22:32.560 --> 22:39.760]  like, Peter, five minutes, oh, to answer, I have a time for questions, yes, any questions,
[22:39.760 --> 22:49.080]  no, oh, yeah, what's the difference or advantages of this SQL commenter thing compared to what
[22:49.080 --> 22:54.120]  open tracing standards people start tracing the whole thing, what's the difference of
[22:54.120 --> 22:55.120]  SQL commenter?
[22:55.120 --> 23:01.320]  Well, what I would say in this case, yes, I mean, there is obviously open tracing framework,
[23:01.320 --> 23:12.760]  right, which you can use, this gets specifically to the database and specifically in every query,
[23:12.760 --> 23:20.280]  right, if you look at the open tracing framework, I think, you know, getting every query, right,
[23:20.280 --> 23:28.480]  maybe a lot of, a lot of volume out there, right, and again, I also think, well, the
[23:28.480 --> 23:34.040]  good thing if also SQL commenter, right, is what that does it automatically, if you will,
[23:34.040 --> 23:39.280]  right, that does not require you to take an extra integration.
[23:39.280 --> 24:09.080]  Okay, anybody else, yeah, I mean, it works with MariaDB as well, yes, well, there are
[24:09.080 --> 24:17.160]  not practices, there are no good practices, right, like you can, there is a lot of optimizer
[24:17.160 --> 24:21.600]  hints you can use, right, so you can actually force the query to go like this particular
[24:21.600 --> 24:27.880]  stuff, right, but that also prevents optimizer choosing different plan if better plan becomes
[24:27.880 --> 24:28.880]  available.
[24:28.880 --> 24:29.880]  Yeah.
[24:29.880 --> 24:46.360]  Never use forced index, always use ignore index, okay, well, then thank you, folks.