[00:00.000 --> 00:10.680] Thanks guys for holding until the last presentation, we've been through three days for those who [00:10.680 --> 00:18.360] were in pre-fossil days, I know it's hard, a lot of information, my brain is half melted, [00:18.360 --> 00:24.560] but yes, I hope this won't be the 20 longest minutes of your life. [00:24.560 --> 00:30.760] LeFredi he was talking about components, I didn't know, these I will talk about plugins, [00:30.760 --> 00:34.880] but I think the good thing of Fossil is that you get out with some insights on what to [00:34.880 --> 00:38.160] do next, so let's see. [00:38.160 --> 00:45.000] And the objective here of this talk is to talk about the PROCFS plugin that Percona [00:45.000 --> 00:53.320] developed in order so you can monitor Linux metrics without an agent on the server. [00:53.320 --> 00:59.680] I know I saw some of you already in pre-fossil, but I'm Vinicius, I've worked for Percona [00:59.680 --> 01:08.440] for six years almost, working with Database for a plent of time, and also the co-author [01:08.440 --> 01:14.960] of the book Learning MySQL, my colleague is here, the other author, Sergei. [01:14.960 --> 01:23.120] And this is our agenda for the day, so what to monitor in Database, let's get some insights [01:24.040 --> 01:31.480] and why monitoring, what is the difference between agent versus agentless monitoring, [01:31.480 --> 01:38.560] what are the pros and cons of each option, we will go in details of them, then we will [01:38.560 --> 01:47.600] go through the PROCFS plugin, you see it's very easy to install, how to use it, we can [01:47.600 --> 01:59.880] use it natively or with any Prometheus, and an example how to integrate with PMM from [01:59.880 --> 02:07.160] Percona, for those who doesn't know, PMM it's monitoring tool, a bundle of like several [02:07.160 --> 02:16.240] open source projects, components, glue it together to provide monitoring and observability. [02:16.480 --> 02:20.440] But if you want, you can ask me anytime. [02:20.440 --> 02:25.160] So what to monitor in a database, usually we like to monitor, for example, customer [02:25.160 --> 02:34.360] response time, so if we are booking, I'm trying to make a search for hotels, I want my search [02:34.360 --> 02:44.080] to return, let's say in less than 10 milliseconds, also KPS, so I want my search to return in [02:44.160 --> 02:51.280] less than 10 milliseconds, but I also want the search for 10,000 people at the same time [02:51.280 --> 02:59.120] return under 10 milliseconds. Understand workload behavior, what is the busiest [02:59.120 --> 03:06.400] day, what is the busiest hour of my workload, is there something different, some anomaly, [03:07.360 --> 03:14.160] this can indicate some security breach, also the most basic infrastructure components, [03:16.160 --> 03:22.720] we need to verify if our equipment is working correct, so if network that have a lot of [03:22.720 --> 03:30.720] retransmissions, files, storage, if my power supply is failing, all this kind of stuff [03:31.440 --> 03:34.960] that needs to work, otherwise we won't have the service. [03:36.480 --> 03:42.480] And as I said, like resource utilization, not only to predict what is going on now, [03:42.480 --> 03:51.200] but also for the future, security breach and so on. Maybe each company has more relevant [03:51.200 --> 03:58.880] metrics than the ones that I'm showing here, but this is just an overall example of what is most [03:58.960 --> 04:09.280] monitored around database. So why monitoring? It helps diagnose issues that happen, for example, [04:09.280 --> 04:13.600] in the weekend, something happened and I don't know what, you need a monitoring tool, [04:14.400 --> 04:19.360] because if you open the database now, you won't see the metrics from the time that [04:19.360 --> 04:24.880] problem was going on. We can understand issues that are actually happening, [04:25.840 --> 04:31.120] and also proactively, we can see, look, my load is raising, is raising, raising, [04:31.120 --> 04:35.600] I think I'm going to have an issue, maybe I need to increase the number of servers, [04:35.600 --> 04:46.400] I need to optimize some queries and so on. For those who use cloud, it is very important, [04:46.480 --> 04:55.040] because each CPU cycle, each byte costs money, so if you optimize a query, you optimize a table, [04:55.040 --> 05:03.200] you're saving CPU performance, disk storage, backup and so on. And my favorite one, it helps [05:03.200 --> 05:09.680] you sleep at night, because you don't need to keep working with disasters, you can predict them [05:09.680 --> 05:19.760] or work in a better way. Continue, so this is some metrics from Grafana, so we can get memory [05:19.760 --> 05:32.000] utilization, disk space, also we can see an estimation of, it will take me 1.86 years to run [05:32.080 --> 05:40.320] out of space, so I can plan ahead my budget, buy disk and everything, because when the database [05:40.320 --> 05:46.800] gets out of disk, it crashes, and then we need to decide who we are going to sacrifice. [05:49.840 --> 05:56.240] And as I said, we can understand what is going on inside the database, what kind of queries [05:56.240 --> 06:04.240] here you can see, but it's insert, select, predates, commands, so it's the call handlers [06:04.240 --> 06:12.000] from a SQL, and you can understand the fluctuation of them. In this case, it's a database that is [06:12.000 --> 06:17.760] doing this green line, the bigger one are inserts, so it's a heavy right database. [06:17.760 --> 06:30.160] And as I said, to understand behavior, so during normal days, business days, I have peaks of almost [06:30.160 --> 06:38.080] 40k EPS, but on weekends, I don't get more than 10, so if I'm on a weekend and then I have 30k, [06:38.720 --> 06:44.160] probably I'm under attack, or someone is doing a promotion, something that I don't know, [06:44.880 --> 06:55.840] some important person died, whatever. And there are two ways to monitor, one is using agent, [06:55.840 --> 07:00.640] the other one is letting the monitoring system monitor the database. [07:03.040 --> 07:09.040] And when you have the agent, you have it installed locally, so it's gathering information, sending [07:09.040 --> 07:17.360] to the server, you can get a lot of details, like Linux metrics, you can get all of them [07:17.360 --> 07:25.840] with node exporters from Prometheus. Sometimes you can run custom scripts, so maybe you can embed, [07:25.840 --> 07:33.440] for example, backups or restores, routines, something like that on the agent, but the cons [07:33.440 --> 07:39.840] are more related, for example, enterprise companies may suffer more, because you need [07:39.840 --> 07:45.520] authorization to run the, to install the agent from the machine, so maybe you have multiple teams, [07:45.520 --> 07:50.080] so you need to ask authorization for security, then the CISA demeaning needs to install, [07:51.440 --> 07:58.160] some other team needs to configure, and so this is one of the bad things, and it happens a lot [07:58.160 --> 08:05.520] with enterprise companies. There's an example from PMM, so we have the agents running in the [08:05.520 --> 08:11.920] database server, sending information to the central server, where data is analyzed. [08:14.160 --> 08:23.120] This is an example from Datadog, same way, agent running the same server, sending to the back end. [08:23.920 --> 08:33.440] And now agentless, the pros, it is basically the opposite of what we saw, we can reduce overhead [08:33.440 --> 08:41.120] of administrative tasks, because we only need the monitoring server to reach the database, [08:41.120 --> 08:46.240] and data will be fetched from there, like, for example, from performance schema, [08:47.200 --> 08:58.000] or the global status variables. Again, easier to approve, we are talking more about bureaucracy here. [08:59.360 --> 09:05.520] The cons, you still have some job to do, for example, I can't say I'm going to monitor [09:05.520 --> 09:10.560] LeFred's database, if he doesn't give me a user and password, of course I won't have access, [09:11.120 --> 09:21.520] so you still have some job to do. And the problem that we will try to solve with Prof KFS is you [09:21.520 --> 09:28.960] have limited scope to analyze, because you are not on the server, for example, you may need some [09:28.960 --> 09:32.800] special Linux permission to gather certain metrics, and you won't have it. [09:33.120 --> 09:39.040] And one last, if you are monitoring, let's say, 1,000 hosts remote, so you are putting a lot of stress [09:39.040 --> 09:43.760] on the central server, like it's natural, someone needs to do the job, if it's not the database [09:43.760 --> 09:52.720] server, it will be the monitoring server. This is an example of another company, the EGI, so they [09:53.120 --> 10:06.000] work by sending the information, the monitoring appliances collecting data from these hosts. [10:08.080 --> 10:13.760] And another example from PMM, in this case without agent, we can connect to anything, [10:13.760 --> 10:18.880] the advantage is, for example, RDS or OCI, that you don't have access to [10:19.840 --> 10:26.400] system metrics, unless you are using, for example, the CloudWatch metrics or the metrics from Oracle, [10:26.400 --> 10:30.480] but you can get MySQL information by simply connecting to them. [10:32.960 --> 10:40.240] So the Prof KFS plugin, it provides access to the Linux performance data, basically, [10:41.200 --> 10:48.400] it's information from slash proc and slash sys, so when you are running VMSTAT, [10:49.200 --> 10:55.520] it is basically collecting information from slash proc, and this information is gathered and [10:55.520 --> 11:00.720] populated in a view that is created with the plugin, the Prof KFS view. [11:00.880 --> 11:10.880] For those who also, security is a concern, there is a parameter, Prof KFS files spec, [11:11.520 --> 11:18.640] where you can say exactly what you want to gather. Now, for the problems, [11:19.680 --> 11:26.000] the sad face is, currently, it works only for Percona server, maybe if we go to components, [11:26.000 --> 11:32.720] we can make it work for MySQL. If you try to copy the Libby, I did it, of course, in Keras, [11:32.720 --> 11:36.800] you will crash the database with signal 11, so don't try this. [11:40.720 --> 11:47.520] The other caveat, the other cons, it only works with the same version, so even if we are talking [11:47.520 --> 11:54.800] about MySQL, Percona server, if you copy the Libby from 8010 and put on 8030, it will fail. [11:55.680 --> 11:57.840] So it always needs to be on the same version. [12:03.920 --> 12:09.120] To install the plugin, it has to install any plugin in MySQL, just a command line. [12:10.240 --> 12:16.240] When you are installing Percona server, the Libby is already in the plugin folder directory, [12:16.240 --> 12:21.840] as the thread showed in the, not only the components, but also the plugins will be there. [12:21.840 --> 12:28.640] You just install, and it works. You need to use this particular privilege that is created for [12:28.640 --> 12:35.920] this specific plugin, which is AccessProcFS. If you have a super, which I thought was a bit weird, [12:35.920 --> 12:43.040] it doesn't work. I thought that, okay, super should override, but it's just a matter of providing [12:43.760 --> 12:51.520] the privilege. And as I said, this is the variable that controls what you can collect, [12:52.800 --> 12:59.920] I highlighted three, which is our S version, I will load the average, but this is a string, [12:59.920 --> 13:06.560] so you just remove the ones that you don't want, and it is required that we start. This is a static [13:06.560 --> 13:16.960] variable, so you can change it at runtime. And using the plugin, so this is raw data, [13:16.960 --> 13:23.200] you can get from MySQL, you run the command, and exactly what you want to collect. In this case, [13:23.200 --> 13:32.320] I was running on Ubuntu 22, which is, I don't remember, oh, it's on AWS. It's a bit weird, [13:32.320 --> 13:37.920] because there is no binary, I was using generic packages, because there is no official [13:38.640 --> 13:49.120] package for MySQL. And now here we have the raw data, so we need to keep running selects all the [13:49.120 --> 13:55.680] time. And for example, when you get the CPU counts, you see a bunch of numbers, which is hard to [13:55.680 --> 14:02.720] figure it out what's going on by that way. So we can use Prometheus. Prometheus can work with [14:02.800 --> 14:10.320] its open source, it works with exporters. You have an old exporter, I think they have here, [14:10.320 --> 14:17.200] that showed about ProcSQL, there is a ProcSQL exporter, so you have plenty of options. In this [14:17.200 --> 14:26.640] case, we have the MySQL ProcFS collector running using the Prometheus exporter. So you can integrate [14:26.640 --> 14:33.040] with any tool. It doesn't need to be anything like, I know that AWS Oracle Google, they have [14:34.880 --> 14:42.720] features that you can use with Prometheus. This is an example that I took from Pierre Grafana, [14:42.720 --> 14:50.000] so if you want, you just populate here and start using it. In this case, I will show my example, [14:50.000 --> 14:58.400] we'll be with PMM, because I think it's easier. This, I'm going to leave more for the records, [14:58.400 --> 15:05.120] for those who want to try this at home, because these are basically codes. In this case, I have a [15:05.120 --> 15:13.280] container running the node exporter. As you can see, my username, password, and by the way, [15:13.280 --> 15:17.760] this host is still running, if you want to connect, please don't drop the database. [15:20.400 --> 15:24.880] And if you want to test if the exporter is running, you can do a simple curve, [15:24.880 --> 15:31.120] it will work and you'll see the metrics. These slides about integrating with [15:31.840 --> 15:39.600] PMM, basically what you have to do. First, you need to add the database to the monitoring server. [15:39.600 --> 15:46.880] So we add as a MySQL, here we will ask for you for user and password. And later on, [15:46.880 --> 15:53.440] we have this Docker exporter running, this container running, we will add as an external service. [15:55.120 --> 16:02.080] This is how you can visualize the services that are running there. So you can see, [16:02.080 --> 16:08.640] I have the database first, and this one is my container that is gathering the metrics from the [16:08.640 --> 16:18.320] OS. Those are the commands, how you list the servers, how you add the external service. [16:19.120 --> 16:23.680] Don't worry, as I said, it's here more for later on if you want to try at home [16:24.480 --> 16:30.560] from where the ideas are coming from, so you can understand what's going on. [16:30.880 --> 16:42.480] So adding the agent, and the last slide here is on Grafana, as here I was using [16:44.640 --> 16:52.960] a container, and then I can see that I'm having some issues, because I have my user 30, 30, [16:52.960 --> 17:01.360] and my IOH is at 30%. So for Docker instance, for those who are using, for example, Kubernetes, [17:01.360 --> 17:07.200] that it's hard to collect OS metrics, if you have the plugin installed, it can give you [17:08.240 --> 17:13.760] a very nice idea of what's going on. On my experience, lots of time people come in, [17:13.760 --> 17:18.960] like, my SQL is not working, and then you are going to see on a Kubernetes node, [17:18.960 --> 17:25.520] the worker node where the services host that are like 3,000 containers running, the worker is [17:25.520 --> 17:31.520] completely saturated and dying, and we are blaming my SQL, which is a small piece using the whole [17:31.520 --> 17:43.120] thing. I think I went too fast, but I had to skip some of the codes. Do you guys have any questions [17:43.760 --> 17:48.080] or any curiosity about this? Yes, Muka? [18:05.200 --> 18:11.120] That is a good question. I'm not sure what is the frequency of the collection, because like, [18:11.120 --> 18:17.120] this is more experimental too, probably don't have the frequency. It should be hard coded, [18:17.120 --> 18:24.000] for example, five seconds or something. It was Nicolai who developed it from Percona, and [18:27.360 --> 18:33.840] I tested when we restart, the database information is lost, so it's only a view that it's populated [18:33.840 --> 18:39.840] along the time. It's not started, like, you don't have 30 days of monitoring or anything like that. [18:42.080 --> 18:47.840] So, you don't know how it went, where is it? No, no, I don't. It's when you ask for it. [18:47.840 --> 18:57.840] No, it says it's lost after restart. It's when you query the table that it checks the value. [18:57.840 --> 19:07.840] This is the case, it should be data. No, it's lost today. He said it's lost after restart. [19:08.160 --> 19:16.160] So, when you query that table, that view, it goes to the clock to get the information and return it. [19:16.160 --> 19:24.160] Yes, my name is in the search box, because you know it is lost because you have to query the physicality [19:24.160 --> 19:31.840] in some signals or something like that. No, that's fine, because I think it's not [19:32.080 --> 19:35.600] there when I change the code. Do you think when you restart, there is nothing? [19:35.600 --> 19:39.200] I didn't see, you know. Don't say something, Sergei. [19:39.200 --> 19:43.040] Yeah, I want to ask you that the implication of no cache and implication of this plugin is [19:43.040 --> 19:47.840] let it go, actually, go ahead, take data from proper pass load it into memory, parse it up, [19:47.840 --> 19:52.720] use CPUs, there are any controls from the amount of memory being used, the hard limit of the number [19:52.720 --> 19:58.080] of nodes in the proper pass, like, it can be, there are situations where it has furious [19:58.080 --> 20:05.920] proper pass expansion. It is documented, there is a limitation. I think there is a number of lines [20:07.680 --> 20:12.720] that the proper pass will try to collect, and if it's kind of similar to PT's talk when there [20:12.720 --> 20:17.360] are more than 1,000, 10,000 tables, he will ignore because of the overhead. [20:18.000 --> 20:28.240] Yeah, is it the same work which was done in 2018 by Nikolai, or is it another? [20:28.240 --> 20:32.800] Probably it's a continuation, because I saw that the project in GitHub is old. [20:37.120 --> 20:41.360] What it is? It is a new plugin written by somebody else. [20:41.360 --> 20:44.000] No, no, it's by Nikolai, by Nikolai, yes. [20:45.200 --> 20:49.840] Okay, can that plugin be compiled for Oracle? [20:50.480 --> 20:58.480] So, you can, but we saw that on the source code because there is a particular privilege, [20:58.480 --> 21:03.520] Marcelo helped me, helped me, you need to change parts of my SQL code. [21:05.680 --> 21:09.600] Yes, yes, and talking to our engineering team, they have [21:09.920 --> 21:11.200] component to this. [21:11.200 --> 21:19.040] Yeah, they were talking and probably they will make something, because this is more like [21:20.800 --> 21:26.720] an adventure, a college project. It's written in the documentation, it's a [21:26.720 --> 21:33.440] experimental feature, but the engineering team, because it was crashing during my presentation, [21:33.440 --> 21:39.120] then I said, guys, I'm opening a Gira ticket, I want this next week. They didn't agree, but yeah, [21:39.120 --> 21:46.320] I think hopefully this can become a real part of my SQL. As Lefred said, [21:46.320 --> 21:48.800] if it helps the community, why not? [21:53.600 --> 21:57.600] I want to say thanks to Lefred and the whole organization of Fosden for you guys that [21:57.600 --> 22:03.760] survived until the end. It was a pleasure to be here again, and I hope to see you next year. [22:03.760 --> 22:06.400] Thanks.