[00:00.000 --> 00:07.000] Hello, everyone, could you hear me well? [00:07.000 --> 00:12.000] How do you feel today, Sunday, the fourth day? [00:12.000 --> 00:14.000] Did you wake up with energy? [00:14.000 --> 00:17.000] Or you are like... [00:17.000 --> 00:19.000] We have energy, right? [00:19.000 --> 00:21.000] Okay, that's good. [00:21.000 --> 00:23.000] I'm Ida Pugja from Peru. [00:23.000 --> 00:27.000] I traveled to Europe for the first time for this year. [00:27.000 --> 00:32.000] It's the first time also doing a talk in English. [00:32.000 --> 00:36.000] So if I make some mistake, I hope you can understand this. [00:36.000 --> 00:38.000] And I am a technology evangelist at Percona. [00:38.000 --> 00:40.000] I started six months ago. [00:40.000 --> 00:43.000] It's a really great company focusing on databases. [00:43.000 --> 00:46.000] And if you want to follow me on Twitter or want to connect, [00:46.000 --> 00:48.000] I will be really happy by LinkedIn. [00:48.000 --> 00:52.000] So I used to publish things about open source containers, [00:52.000 --> 00:55.000] Kubernetes. [00:55.000 --> 00:57.000] About me, I am from Peru. [00:57.000 --> 01:01.000] I am a Google woman tech maker. [01:01.000 --> 01:05.000] I was nominated as a docker captain the last year. [01:05.000 --> 01:08.000] And I am a container and database enthusiast. [01:08.000 --> 01:17.000] So this talk is about monitoring your database with open source tools. [01:17.000 --> 01:19.000] We are going to focus in PMM. [01:19.000 --> 01:23.000] How many of you did you hear about PMM before? [01:23.000 --> 01:26.000] Percona monitoring and management. [01:26.000 --> 01:30.000] Okay, so it's something new for this room. [01:30.000 --> 01:38.000] And this is a perspective of a beginner view. [01:38.000 --> 01:40.000] So this is not something advanced. [01:40.000 --> 01:46.000] I will see this monitoring tool as a perspective, [01:46.000 --> 01:49.000] because I am learning about databases in the company [01:49.000 --> 01:51.000] where I am working, Percona. [01:51.000 --> 01:57.000] So we are going to evaluate why it's important the value of monitoring databases. [01:57.000 --> 02:03.000] Also, we will see PMM, Percona monitoring and management, [02:03.000 --> 02:07.000] and how we can effectively see the dashboards and graphs [02:07.000 --> 02:13.000] that we can get to monitor and manage our databases. [02:13.000 --> 02:16.000] All this time I was working in a database company. [02:16.000 --> 02:23.000] I asked myself, and I realized the importance is to monitor databases [02:23.000 --> 02:27.000] because we can just have one database running for us. [02:27.000 --> 02:29.000] We can have several of them. [02:29.000 --> 02:32.000] We can have it in cloud, in infra. [02:32.000 --> 02:37.000] So it's very important, and it's why we should ask [02:37.000 --> 02:39.000] why we should care about databases, [02:39.000 --> 02:42.000] why we should care about monitoring these databases. [02:42.000 --> 02:45.000] And we can ask most questions like, [02:45.000 --> 02:47.000] is my database performing well? [02:47.000 --> 02:50.000] So as we start to work with databases, [02:50.000 --> 02:52.000] there are several queries that we make. [02:52.000 --> 02:57.000] So this queries maybe is not executing in the time that we expected. [02:57.000 --> 02:59.000] It could be taking more time, [02:59.000 --> 03:04.000] and it's going to have bottleneck in the time that we are executing. [03:04.000 --> 03:09.000] So we should care about this [03:09.000 --> 03:14.000] and know that there are metrics that we should detect [03:14.000 --> 03:18.000] the problem for the performance of our databases. [03:18.000 --> 03:22.000] Another question that could ask ourselves is, [03:22.000 --> 03:26.000] are databases available and accepting connections? [03:26.000 --> 03:28.000] We can have several databases, [03:28.000 --> 03:30.000] and many connections could be made, [03:30.000 --> 03:34.000] but if we don't put a limit in these connections, [03:34.000 --> 03:36.000] it could just crash. [03:36.000 --> 03:39.000] But if we put a limit also, [03:39.000 --> 03:42.000] we should be aware when we are achieving [03:42.000 --> 03:45.000] or we are reaching that limit to increase [03:45.000 --> 03:47.000] or maybe to stop the connection. [03:47.000 --> 03:54.000] But if we pass the limit, [03:54.000 --> 03:57.000] so this is going to be a problem in our databases, right? [03:57.000 --> 04:00.000] If this is an e-commerce company, [04:00.000 --> 04:04.000] it can happen because the user is going to wait just two seconds [04:04.000 --> 04:08.000] or maybe one second to go to the next page faster. [04:08.000 --> 04:11.000] But if no, if we did this in three seconds, [04:11.000 --> 04:14.000] five seconds that the user is going to go to another page [04:14.000 --> 04:18.000] and we lost as a business. [04:18.000 --> 04:20.000] Is my sister estable? [04:20.000 --> 04:30.000] We also monitor the infrastructure where our database is running, [04:30.000 --> 04:31.000] not just the databases, [04:31.000 --> 04:33.000] because it is running over an infrastructure. [04:33.000 --> 04:36.000] We are using CPUs, memory, this, [04:36.000 --> 04:42.000] and we are not able to provision these resources on time. [04:42.000 --> 04:47.000] We will be in problem for our databases. [04:47.000 --> 04:50.000] I am having avoidable time, [04:50.000 --> 04:56.000] so if our hardware is not enough, [04:56.000 --> 05:00.000] so our application could crash [05:00.000 --> 05:02.000] and we can have hardware failures [05:02.000 --> 05:05.000] or network outage. [05:05.000 --> 05:12.000] We should be aware of these metrics to avoid these problems. [05:12.000 --> 05:19.000] So we also can have human errors or these crashes. [05:19.000 --> 05:28.000] So there are metrics that we can see to identify these before. [05:28.000 --> 05:32.000] But we can just see what are the problems [05:32.000 --> 05:34.000] when we are having those problems. [05:34.000 --> 05:36.000] We can also prevent these problems, [05:36.000 --> 05:37.000] asking these questions. [05:37.000 --> 05:44.000] I am minimising performance issues that can impact my business. [05:44.000 --> 05:48.000] I am able to identify these issues before they happen [05:48.000 --> 05:51.000] because there is a way to prevent. [05:51.000 --> 05:55.000] As a previous example, if we are reaching the limit, [05:55.000 --> 05:57.000] we can see it before we are reaching it. [05:57.000 --> 06:00.000] So we can prevent provisioning more resources, [06:00.000 --> 06:04.000] maybe checking the query that we already saw [06:04.000 --> 06:08.000] that is taking too much time executing. [06:08.000 --> 06:15.000] So there are ways we can prevent these problems. [06:15.000 --> 06:18.000] Nowadays, there are challenges that we have [06:18.000 --> 06:21.000] when we think to monitor databases. [06:21.000 --> 06:23.000] For example, the data volume grow. [06:23.000 --> 06:26.000] We don't talk about gigabytes now. [06:26.000 --> 06:30.000] We are talking about terabytes or maybe more [06:30.000 --> 06:36.000] because the database has a lot. [06:36.000 --> 06:39.000] It's a challenge to monitor these days. [06:39.000 --> 06:42.000] The complexity of the model databases. [06:42.000 --> 06:47.000] Right now, it's SQL databases, not SQL databases. [06:47.000 --> 06:49.000] The database has different models. [06:49.000 --> 06:53.000] It could run in different cloud providers. [06:53.000 --> 06:55.000] It could have different models. [06:55.000 --> 06:57.000] So the complexity just grow. [06:57.000 --> 07:00.000] Now it's different than before. [07:00.000 --> 07:03.000] The downtime and data loss. [07:03.000 --> 07:08.000] So it's one race that we should try to monitor to prevent it. [07:08.000 --> 07:10.000] Lack of visibility. [07:10.000 --> 07:17.000] If we don't have these things ready to be, to check it, [07:17.000 --> 07:21.000] if we have to do another thing, like maybe create scripts, [07:21.000 --> 07:23.000] Linux script or bash script to check it [07:23.000 --> 07:25.000] and to get this metrics, [07:25.000 --> 07:27.000] we don't have that data on time. [07:27.000 --> 07:29.000] We don't have that metrics on time. [07:29.000 --> 07:31.000] But if we have that visibility, [07:31.000 --> 07:33.000] it's going to be easy for us to detect these problems [07:33.000 --> 07:35.000] to monitor our databases. [07:35.000 --> 07:38.000] Even better, if we have everything in one single dashboard [07:38.000 --> 07:40.000] where we can visualize it. [07:40.000 --> 07:42.000] I learned fatigue. [07:42.000 --> 07:46.000] We can monitor many things. [07:46.000 --> 07:50.000] But with this, we have different databases. [07:50.000 --> 07:54.000] They can have MySQL, they can have other kind of databases. [07:54.000 --> 07:57.000] So it depends on the business, what we want to monitor, [07:57.000 --> 08:00.000] what metrics we are going to get to monitor. [08:00.000 --> 08:04.000] But if we don't know in all these things that we have for databases, [08:04.000 --> 08:07.000] we are going to get a lot of things that maybe we don't need [08:07.000 --> 08:11.000] and create alerts for maybe things that we really don't need. [08:11.000 --> 08:15.000] We are not focusing in that business exactly. [08:15.000 --> 08:20.000] We should try to focus to monitor, we should try to focus in the business. [08:20.000 --> 08:24.000] What's the metrics that are important for us, for our business? [08:24.000 --> 08:26.000] Integration with other tools. [08:26.000 --> 08:29.000] So with that time, we can do continuous integration, [08:29.000 --> 08:31.000] continuous delivery things. [08:31.000 --> 08:35.000] So it increases also the complexity to monitor our databases [08:35.000 --> 08:37.000] because this is not in a single place. [08:37.000 --> 08:41.000] This go over a process that is also devolved. [08:41.000 --> 08:48.000] These things are being automated. [08:48.000 --> 08:53.000] Now that we know why we should care about monitor databases, [08:53.000 --> 08:57.000] we will talk about one of the solutions that could help us to monitor. [08:57.000 --> 09:02.000] This is Percona monitoring and management, which is PMM. [09:02.000 --> 09:05.000] This is an open source tools and free tools, [09:05.000 --> 09:09.000] also based in other open source tools like MySQL. [09:09.000 --> 09:13.000] This led us to monitor databases like MySQL, MariaDB, [09:13.000 --> 09:16.000] PostgreSQL, AmongoDB, but not just that. [09:16.000 --> 09:20.000] As I said before, this also led us to monitor the infrastructure [09:20.000 --> 09:22.000] where our database is running. [09:22.000 --> 09:24.000] It's important to know about that. [09:24.000 --> 09:28.000] And it also helped us to performance our databases [09:28.000 --> 09:32.000] to simplify the management of these databases [09:32.000 --> 09:38.000] and we can exchange the security. [09:38.000 --> 09:43.000] Percona monitoring and management is built on top of other open source tools [09:43.000 --> 09:45.000] like Grafana. [09:45.000 --> 09:48.000] I know many of you use Grafana, right? [09:48.000 --> 09:50.000] Who use Grafana? [09:50.000 --> 09:52.000] Okay, a lot. [09:52.000 --> 09:56.000] And Victoria metrics also to storage this data, [09:56.000 --> 09:59.000] that metrics we collect for a long time. [09:59.000 --> 10:03.000] We are using clickhose to create these reports in real time [10:03.000 --> 10:06.000] with all these metrics that we collect in the time. [10:06.000 --> 10:10.000] We are using PostgreSQL to storage all the metadata [10:10.000 --> 10:13.000] and all these metrics for databases, [10:13.000 --> 10:16.000] all the important data that we have in PMM. [10:16.000 --> 10:21.000] And everything that we visualize is saved in this database. [10:21.000 --> 10:24.000] And we use Docker to install PMM. [10:24.000 --> 10:28.000] We can containerize the installation of PMM, the client and the server, [10:28.000 --> 10:33.000] and use it in different platforms. [10:33.000 --> 10:41.000] We also use Kubernetes operators for scaling our databases. [10:41.000 --> 10:45.000] There are three levels of deep when we talk about PMM interface. [10:45.000 --> 10:49.000] This is the big one, which is a dashboard that we all know, [10:49.000 --> 10:54.000] but we can go deeper and see the graphs, [10:54.000 --> 11:01.000] that is a graphical representation of the metrics in long period of time. [11:01.000 --> 11:07.000] And we have the metrics also, which is a countable number [11:07.000 --> 11:12.000] that represents some value, some important value [11:12.000 --> 11:18.000] of our infrastructure of our database. [11:18.000 --> 11:22.000] Okay, as I tell you before, what we want to monitor [11:22.000 --> 11:26.000] is going to depend of our business. [11:26.000 --> 11:30.000] We are not going to monitor the same metrics as another business. [11:30.000 --> 11:32.000] It depends a lot on that. [11:32.000 --> 11:35.000] And we also should aware of the alerts that we create [11:35.000 --> 11:38.000] that should be focused on what we do as a business [11:38.000 --> 11:42.000] and create this alertness and notifications [11:42.000 --> 11:47.000] that could be notified when we need it. [11:47.000 --> 11:49.000] So it could be integrated with Slack, [11:49.000 --> 11:52.000] with many other tools that we have in the hand [11:52.000 --> 11:59.000] to know when the problem is happening exactly. [11:59.000 --> 12:05.000] Some important metrics, some of that that we can check with PMM, [12:05.000 --> 12:10.000] is query performance, high CPU, high memory usage, [12:10.000 --> 12:14.000] and this high disk part of the infrastructure, [12:14.000 --> 12:17.000] the amount of user connections. [12:17.000 --> 12:19.000] We can know when the data grows [12:19.000 --> 12:24.000] and other kind of metrics that we can have. [12:24.000 --> 12:28.000] Could somebody tell me what other kind of metrics we can check with PMM? [12:28.000 --> 12:33.000] With some monitoring tool? [12:33.000 --> 12:39.000] A part of the infrastructure or... [12:39.000 --> 12:42.000] Okay, we'll check. [12:42.000 --> 12:44.000] If we see the long query response times, [12:44.000 --> 12:48.000] as I say, some queries could take some time, [12:48.000 --> 12:52.000] PMM has a very good dashboard, which is this, is Khan, [12:52.000 --> 12:55.000] query analytics dashboard, [12:55.000 --> 13:01.000] where we can see for a specific... [13:01.000 --> 13:07.000] We can see all the queries for our databases. [13:07.000 --> 13:11.000] In this case, I'm seeing all the 10 top queries [13:11.000 --> 13:13.000] that is running in our databases, [13:13.000 --> 13:16.000] but also we have an option to check it here [13:16.000 --> 13:21.000] if we want to just check for MySQL, for Postgres, or MongoDB, [13:21.000 --> 13:24.000] and we will see the amount of queries per second [13:24.000 --> 13:28.000] for example that we are running and how time is taking. [13:28.000 --> 13:32.000] So if we open this dashboard when we are working in databases, [13:32.000 --> 13:35.000] the first thing that we are going to see is, [13:35.000 --> 13:37.000] okay, this query is taking too much time. [13:37.000 --> 13:41.000] In this case, no, but we have a query that is taking too much time, [13:41.000 --> 13:43.000] or it's running a lot of queries per second. [13:43.000 --> 13:45.000] We can see the first one, [13:45.000 --> 13:50.000] and we can start troubleshooting from that point. [13:50.000 --> 13:54.000] The high CPU utilization is part of the infrastructure. [13:54.000 --> 13:58.000] Also, it's important to know how is this going. [13:58.000 --> 14:00.000] For example, this dashboard in PMA, [14:00.000 --> 14:02.000] in Percona Monitoring and Management, [14:02.000 --> 14:05.000] we can see for a specific note, [14:05.000 --> 14:10.000] we can check a note because we can have different notes running [14:10.000 --> 14:14.000] in our infrastructure, [14:14.000 --> 14:16.000] and in this case, for example, [14:16.000 --> 14:21.000] we have 25% of our disk that is using. [14:21.000 --> 14:26.000] This may be not a good example because I checked last 12 hours, [14:26.000 --> 14:32.000] but we can check our disk usage during six months or more, [14:32.000 --> 14:36.000] and then it's when we can see and take decisions. [14:36.000 --> 14:38.000] Let's see if this is like six months. [14:38.000 --> 14:41.000] We are using just 25% of our disk. [14:41.000 --> 14:44.000] It could be a problem because we have a lot of infrastructure [14:44.000 --> 14:47.000] that we are not using and we are wasting money [14:47.000 --> 14:52.000] because of this space that we are not using in six months. [14:52.000 --> 14:57.000] We can reduce our CPU and save money. [14:57.000 --> 15:05.000] High memory usage is this dashboard where we can see the amount [15:05.000 --> 15:09.000] of memory that I have for my databases [15:09.000 --> 15:12.000] and also can see what is using for Kakache, [15:12.000 --> 15:18.000] what has been using, what is going to be ready to be free, [15:18.000 --> 15:22.000] and this is good because we can also see [15:22.000 --> 15:25.000] when we are reaching the limit of the memory [15:25.000 --> 15:31.000] and we can take actions to provisioning another disk. [15:31.000 --> 15:35.000] We can say that this is very easy when we are working in cloud [15:35.000 --> 15:37.000] because just we click in a button and say, [15:37.000 --> 15:41.000] hey, provision another disk or increment the memory, [15:41.000 --> 15:48.000] but if we are working in infra, maybe in a private cloud, [15:48.000 --> 15:51.000] this is hard because we have to prepare the logistics [15:51.000 --> 15:55.000] to get another disk, another memory is going to take time, [15:55.000 --> 16:02.000] and have this kind of visualization helps. [16:02.000 --> 16:06.000] The amount of input or output that we make in our disk, [16:06.000 --> 16:10.000] you also can check it in this graph. [16:10.000 --> 16:13.000] We can see that your latency here, [16:13.000 --> 16:17.000] in this case, is stable, [16:17.000 --> 16:25.000] but we can have peaks to see where we are detecting these problems. [16:25.000 --> 16:27.000] User connections, as I say, [16:27.000 --> 16:31.000] it helps to monitor the number of active database connections [16:31.000 --> 16:36.000] and size it appropriately, [16:36.000 --> 16:41.000] and also put limits in our connections for our databases. [16:41.000 --> 16:45.000] We have, in this case, MySQL connections. [16:45.000 --> 16:49.000] We can see that you are for other databases too, [16:49.000 --> 16:51.000] but in this case, it's also stable. [16:51.000 --> 16:54.000] We are going to have peaks when we can see [16:54.000 --> 16:58.000] that we are working with a lot of transactions on our databases, [16:58.000 --> 17:04.000] and we can take actions with that. [17:04.000 --> 17:09.000] The maximum of connections allowed is 151. [17:09.000 --> 17:11.000] We are in 150. [17:11.000 --> 17:16.000] If this is going to be 151, this is going to alert us. [17:16.000 --> 17:22.000] You have to check this. [17:22.000 --> 17:24.000] The data grow also. [17:24.000 --> 17:27.000] We can see a dashboard where we can... [17:27.000 --> 17:31.000] Where you can see where our data is a lot [17:31.000 --> 17:34.000] when we are inserting a lot of data in our databases, [17:34.000 --> 17:37.000] or we are just removing things. [17:37.000 --> 17:42.000] In this case, it's going to show when my databases start, [17:42.000 --> 17:47.000] and it's not like too good to see it here, [17:47.000 --> 17:52.000] but if I have time, yes, I still have time. [17:52.000 --> 17:55.000] I still have time, right? [17:55.000 --> 17:58.000] To show something, to show the dashboard. [17:58.000 --> 18:02.000] I have time. [18:02.000 --> 18:06.000] If you want to try it right now, [18:06.000 --> 18:09.000] we can check this PMM demo per con graph. [18:09.000 --> 18:10.000] You can enter. [18:10.000 --> 18:13.000] Right now, we are going to check the dashboard, [18:13.000 --> 18:20.000] but what we learned now, it was some aspects [18:20.000 --> 18:24.000] that let me think what we should keep away [18:24.000 --> 18:26.000] and monitoring our databases, [18:26.000 --> 18:29.000] and also how to explore PMM, [18:29.000 --> 18:31.000] which is an open source tool, is available there. [18:31.000 --> 18:34.000] You have to double down and start to check it and explore it, [18:34.000 --> 18:37.000] and it's easy to visualize things, [18:37.000 --> 18:42.000] so we are going to check now PMM. [18:42.000 --> 18:47.000] You can also enter to this link, so it's free to experiment. [18:47.000 --> 18:51.000] Let's see if this is going to work. [18:51.000 --> 18:55.000] Yes, this is the dashboard that we have. [18:55.000 --> 18:59.000] We have several nodes that we can choose here. [18:59.000 --> 19:02.000] A lot of them, so it depends on the database. [19:02.000 --> 19:06.000] We have nodes of MongoDB, MySQL, Postgres, [19:06.000 --> 19:09.000] and MySQL proxy also. [19:09.000 --> 19:12.000] We can check the details for a specific time. [19:12.000 --> 19:17.000] 12 hours is not enough, maybe, for some things, [19:17.000 --> 19:21.000] but maybe six months, three months, [19:21.000 --> 19:24.000] we have a lot of things here. [19:24.000 --> 19:28.000] So we also can check things about the system operator. [19:28.000 --> 19:30.000] I don't have access right now to see that, [19:30.000 --> 19:37.000] but we can also register alertings for this. [19:37.000 --> 19:43.000] In this case, we have three databases [19:43.000 --> 19:45.000] that are being monitored for Postgres. [19:45.000 --> 19:49.000] One of that is the database for PMM itself. [19:49.000 --> 19:55.000] We have nine databases in Mongo and 15 databases in MySQL. [19:55.000 --> 19:57.000] This is a good thing for PMM, [19:57.000 --> 20:00.000] because you can see everything in one single dashboard, [20:00.000 --> 20:02.000] and we can go deeper for each node [20:02.000 --> 20:11.000] or for each database as you want. [20:11.000 --> 20:13.000] Yeah, this is all. [20:13.000 --> 20:15.000] Thank you so much if you have some questions. [20:15.000 --> 20:24.000] So thanks a lot for the interesting talk. [20:24.000 --> 20:26.000] Does anyone have questions? [20:33.000 --> 20:35.000] Hi, hello, good morning. [20:35.000 --> 20:38.000] The query monitoring dashboard [20:38.000 --> 20:43.000] do have some advice on how to perform these queries, [20:43.000 --> 20:49.000] the bad queries, the slow queries. [20:49.000 --> 20:56.000] Some advice on how to perform the bad queries, [20:56.000 --> 21:02.000] the slow queries, how to rewrite them. [21:02.000 --> 21:05.000] If you go deeper into that query, [21:05.000 --> 21:07.000] you will open another dashboard [21:07.000 --> 21:10.000] where you are going to be able to see suggestions. [21:10.000 --> 21:12.000] You are doing something bad here, [21:12.000 --> 21:17.000] and you can fix it with that. [21:17.000 --> 21:20.000] Hi, thank you for the talk, first of all. [21:20.000 --> 21:24.000] One question regarding the PMM query analytics. [21:24.000 --> 21:29.000] Is it possible to filter by connection and not by database? [21:29.000 --> 21:31.000] Say it again, please. [21:31.000 --> 21:35.000] I want all queries from one connection instead of... [21:35.000 --> 21:37.000] Yeah, is it possible? [21:37.000 --> 21:40.000] Yeah, you can have just one connection. [21:40.000 --> 21:42.000] Okay, then we have to talk. [21:42.000 --> 22:11.000] Yeah, okay, thank you.