[00:00.000 --> 00:12.920]  Hi, everyone, and welcome to our talk about operator monitoring and how to do it correctly.
[00:12.920 --> 00:14.000]  My name is Shirley.
[00:14.000 --> 00:17.400]  I work at Red Hat.
[00:17.400 --> 00:18.400]  I'm Jean Villassa.
[00:18.400 --> 00:25.160]  I also work at Red Hat for about one year and a half.
[00:25.160 --> 00:33.080]  So today we're going to talk about operators' observability, Kubernetes operators, and we're
[00:33.080 --> 00:43.800]  going to talk about when to start, the maturity levels of metrics, why we want to monitor,
[00:43.800 --> 00:54.280]  what we want to monitor, and the best practices and code examples that we created for it.
[00:54.280 --> 01:03.440]  So when we want to talk about, when should we start to think about the observability
[01:03.440 --> 01:05.600]  for operators?
[01:05.600 --> 01:13.720]  You can see here in the chart the life cycle of creating an operator, which is starting
[01:13.720 --> 01:20.600]  in basic installation, and the most mature step is autopilot.
[01:20.600 --> 01:27.440]  So when do you think we should start thinking about observability for a new operator?
[01:27.440 --> 01:28.440]  Anyone?
[01:28.440 --> 01:29.440]  When?
[01:29.440 --> 01:33.560]  From the start.
[01:33.560 --> 01:34.560]  From the start.
[01:34.560 --> 01:37.600]  That's correct.
[01:37.600 --> 01:51.400]  Really deep insights, talks about metrics, alerts, which is being able to monitor your
[01:51.400 --> 01:52.400]  operator fully.
[01:52.400 --> 01:57.320]  And people think maybe we should start thinking about it in full life cycle.
[01:57.320 --> 01:58.320]  Maybe that's the case.
[01:58.320 --> 02:06.520]  But you should pretty much start at the beginning, because the metrics that you are adding first
[02:06.520 --> 02:11.400]  are usually not the metrics that are for your users.
[02:11.400 --> 02:12.400]  They are internal.
[02:12.400 --> 02:16.760]  There are a few steps for the maturity of metrics.
[02:16.760 --> 02:18.280]  The first step is initial.
[02:18.280 --> 02:23.880]  You start with your operator, you want to understand how it works, if it works correctly.
[02:23.880 --> 02:30.960]  So the developers start to add hot metrics.
[02:30.960 --> 02:37.720]  I've been working for a few years on an operator in Red Hat called Qvert.
[02:37.720 --> 02:47.120]  And when I joined the project, it was already in the life cycle phase, full life cycle.
[02:47.120 --> 02:52.080]  And when I joined, already a lot of metrics were implemented in this operator.
[02:52.080 --> 03:02.600]  The problem was that there was no, the developers that added the metrics didn't fall best practices.
[03:02.600 --> 03:09.160]  And a lot of the metrics, it was hard to understand which metrics were ours.
[03:09.160 --> 03:15.800]  It's important to understand that your operator is not the only one inside of the Kubernetes
[03:15.800 --> 03:17.040]  system.
[03:17.040 --> 03:22.880]  So when someone, when a user or even other developers want to understand which metrics
[03:22.880 --> 03:30.160]  your operator is exposing, it should be easy for them to identify your metrics.
[03:30.160 --> 03:33.840]  So the first step, as I said, is initial.
[03:33.840 --> 03:36.880]  The second step is basic monitoring.
[03:36.880 --> 03:41.320]  You start adding your monitoring, and you're starting to think about your users, what they
[03:41.320 --> 03:45.800]  want to understand about your operator.
[03:45.800 --> 03:53.480]  And the third step is you have a process for implementing metrics and new metrics, and you
[03:53.480 --> 03:57.120]  are focused about health and performance for your operator.
[03:57.120 --> 04:01.040]  And the last step is actually autopilot.
[04:01.040 --> 04:08.120]  Taking those metrics and doing smart actions with them in order to do stuff like auto healing
[04:08.120 --> 04:10.840]  and auto scaling for your operator.
[04:10.840 --> 04:19.840]  And this is the part that we are actually on in our operator.
[04:19.840 --> 04:26.600]  So as Shirley said, when we first start, we look very much at internal metrics for the
[04:26.600 --> 04:28.040]  operators themselves.
[04:28.040 --> 04:32.880]  So at this point, we might start, for example, looking at the health of the operator.
[04:32.880 --> 04:38.600]  For example, can it connect to the Kubernetes API, or if it's using external resources,
[04:38.600 --> 04:42.240]  can it connect to those providers' API?
[04:42.240 --> 04:44.760]  Is it experiencing any errors?
[04:44.760 --> 04:49.840]  So we can also start by looking at, for example, its behavior.
[04:49.840 --> 04:52.280]  How often is the operator reconciling?
[04:52.280 --> 04:54.440]  What actions is the operator performing?
[04:54.440 --> 04:59.080]  So this is the kind of stuff that, as we are developing, we are very interested in.
[04:59.080 --> 05:07.360]  But we should start, as Shirley said, thinking more in the future about having these good
[05:07.360 --> 05:13.680]  standards, because later we will not be only tracking these, and could also be, like, resource
[05:13.680 --> 05:17.520]  metrics.
[05:17.520 --> 05:25.400]  And then why should then, why operator observability, and what are the steps that we'll be taking?
[05:25.400 --> 05:31.080]  So starting from the performance and health, here we want to detect the issues that come
[05:31.080 --> 05:32.080]  up early.
[05:32.080 --> 05:39.320]  We try to, obviously, reduce both operator and application downtime, and try to detect
[05:39.320 --> 05:42.240]  some regressions that might happen.
[05:42.240 --> 05:49.680]  Also we can start looking at, for example, planning and billing to improve planification,
[05:49.680 --> 05:54.400]  to also improve profitability, or then build users.
[05:54.400 --> 05:59.720]  At this point, we start looking more at infrastructure metrics also.
[05:59.720 --> 06:02.800]  For example, we want to track resource utilization.
[06:02.800 --> 06:09.360]  This might be, like, CPU, memory, this, and we can also start looking at the health of
[06:09.360 --> 06:14.800]  the infrastructure itself, maybe hardware failures, or trying to detect some network
[06:14.800 --> 06:16.160]  issues.
[06:16.160 --> 06:23.680]  Then we also start looking at, use these metrics to create alerts, to send notifications
[06:23.680 --> 06:27.000]  about the problems that come up as early as possible.
[06:27.000 --> 06:32.080]  So we obviously want to take appropriate actions to not let them go around.
[06:32.080 --> 06:37.000]  And after this, at this point, we go into more detail about metrics.
[06:37.000 --> 06:39.320]  Maybe we start looking at application metrics.
[06:39.320 --> 06:42.120]  So what's the availability of our application?
[06:42.120 --> 06:43.120]  What's the time?
[06:43.120 --> 06:44.680]  What's the error rates?
[06:44.680 --> 06:46.760]  And also its behavior.
[06:46.760 --> 06:49.920]  What type of request is the application receiving?
[06:49.920 --> 06:51.760]  What types of responses is sending?
[06:51.760 --> 06:55.080]  And it's important to monitor all of these things.
[06:55.080 --> 07:01.600]  And when we start building up all this information, then at a certain point in time, as Shirley
[07:01.600 --> 07:10.960]  said, we'll be able to give, like, this new life to the operator by having the autopilot
[07:10.960 --> 07:16.560]  capabilities, such as auto scaling, auto wheeling capabilities.
[07:16.560 --> 07:21.640]  Because at this point, if we did everything correctly, you'll be able to know, like, almost
[07:21.640 --> 07:24.000]  all the states that we are in.
[07:24.000 --> 07:27.760]  And we also start looking at metrics functionality metrics.
[07:27.760 --> 07:32.760]  We can provide the expected, are we providing the expected functionality to users?
[07:32.760 --> 07:36.840]  For example, checking that application features are working correctly.
[07:36.840 --> 07:42.120]  We want to see if there are any performance or reliability issues by checking service
[07:42.120 --> 07:48.920]  levels, and that everything is, it's working in the expected way by checking response to
[07:48.920 --> 07:56.000]  the airhorse and the data that it's responding to.
[07:56.000 --> 07:57.000]  Okay.
[07:57.000 --> 08:00.400]  So I hope you are convinced that the observability is important.
[08:00.400 --> 08:04.200]  If you are in this room, I guess you are.
[08:04.200 --> 08:09.800]  And for the past three years, we've been working on observability on our operator.
[08:09.800 --> 08:14.200]  What's important to understand is that our operator is considerably complex.
[08:14.200 --> 08:19.480]  It has a few sub-operators that it's managing.
[08:19.480 --> 08:28.000]  And each sub-operator has its own team, dedicated team, that is maintaining it.
[08:28.000 --> 08:36.200]  And having the insight of looking at those teams working on implementing observability,
[08:36.200 --> 08:44.320]  each team separately gave us a higher level of the possibility of understanding the pitfalls
[08:44.320 --> 08:49.640]  that they all share when implementing monitoring.
[08:49.640 --> 08:56.640]  So we decided to contribute from our knowledge of how to do this correctly in order for others
[08:56.640 --> 09:02.960]  not to do the same, to fall to the same pitfalls as us.
[09:02.960 --> 09:10.080]  So we decided to create best practices and to share with the community our findings.
[09:10.080 --> 09:17.680]  We hope to shorten the onboarding time for others and to create better documentation
[09:17.680 --> 09:25.400]  and to create reusable code for others to be able to use and save time and money, of
[09:25.400 --> 09:28.600]  course.
[09:28.600 --> 09:37.400]  So we reached out to the operator framework SDK team to collaborate with them and to publish
[09:37.400 --> 09:40.080]  there our best practices.
[09:40.080 --> 09:47.400]  As you can see here, this is the operator observability best practices.
[09:47.400 --> 09:53.160]  The operator SDK itself is the first step when someone wants to create a new operator.
[09:53.160 --> 10:00.320]  It gives them tools, how to create it easily, how to build, test the packages, and provides
[10:00.320 --> 10:05.120]  best practices for all steps of the operator life cycle.
[10:05.120 --> 10:13.200]  So we found that this was the best place for others to also go for monitoring.
[10:13.200 --> 10:17.120]  And in these best practices, I will now share with you a few examples.
[10:17.120 --> 10:24.960]  It may sound simple, but simple things have a big impact, both on the users that are using
[10:24.960 --> 10:30.720]  the system and both on developers that are trying to work with the metrics.
[10:30.720 --> 10:37.040]  So for example, a naming convention for metrics.
[10:37.040 --> 10:41.560]  One of the things that is mentioned in the document is having a name prefix for your
[10:41.560 --> 10:42.760]  metrics.
[10:42.760 --> 10:48.640]  This is very simple action that will help you identify, that will help the developers,
[10:48.640 --> 10:54.960]  the users to identify that the metrics are coming from the specific operator or a company.
[10:54.960 --> 11:00.600]  In this case, you can see that all of the metrics here have a cube width prefix, a cube
[11:00.600 --> 11:04.600]  width, as I said, has sub-operators.
[11:04.600 --> 11:11.640]  So under this prefix, we also have a sub-prefix for each individual operator, a CDI network
[11:11.640 --> 11:18.520]  and so on.
[11:18.520 --> 11:24.760]  And this is another example, which does not have this prefix.
[11:24.760 --> 11:29.840]  We can see here a container CPU, for example, prefix, but we can't understand where it's
[11:29.840 --> 11:30.840]  coming from.
[11:30.840 --> 11:32.800]  In this case, it's the advisor.
[11:32.800 --> 11:36.800]  But if you're a user and you're trying to understand where this metric came from, it's
[11:36.800 --> 11:44.400]  very hard, and also you cannot search in Grafana, for example, for all of the C-advisor metrics
[11:44.400 --> 11:45.400]  together.
[11:45.400 --> 11:49.840]  So that's a problem.
[11:49.840 --> 11:55.240]  Another thing that is mentioned in the best practices is about help text.
[11:55.240 --> 12:04.120]  Each metric has a dedicated place to add the help for this metric.
[12:04.120 --> 12:10.680]  And as you can see in Grafana and other visualization tools, the user will be able to see when hovering
[12:10.680 --> 12:13.960]  on the metrics, the description of it.
[12:13.960 --> 12:20.240]  It's very important because if not, you need to go somewhere else to search for it.
[12:20.240 --> 12:25.640]  Also this gives you the ability to create auto-generated documentation for all of your
[12:25.640 --> 12:31.720]  metrics in your site.
[12:31.720 --> 12:34.040]  Another example is the base units.
[12:34.040 --> 12:39.120]  So Prometheus recommends using base units for metrics.
[12:39.120 --> 12:47.480]  For example, you can see here for time to use seconds, not milliseconds, temperature,
[12:47.480 --> 12:54.600]  Celsius, not Fahrenheit, this gives the users a fluent experience when they are using the
[12:54.600 --> 13:03.040]  metrics, they don't need to do conversions, deviations of the data, and they are saying
[13:03.040 --> 13:07.920]  if you want to use milliseconds, use a floating point number.
[13:07.920 --> 13:14.520]  This removes the concern of magnitude of the number, and Grafana can handle it, and it
[13:14.520 --> 13:21.160]  will still show you the same precision, but the consistency in the UI and how to use the
[13:21.160 --> 13:25.760]  metrics will stay the same.
[13:25.760 --> 13:31.920]  Here you can see an example for metrics that are using seconds.
[13:31.920 --> 13:35.920]  And here we see that each CD are not using it.
[13:35.920 --> 13:44.120]  So this is not as recommended, and we would actually recommend to switch it, but they
[13:44.120 --> 13:45.720]  started with milliseconds.
[13:45.720 --> 13:52.040]  And now doing the change will cause issues with the UI that is based on it and everything.
[13:52.040 --> 13:58.560]  So it's a problem to change the names of the metrics once they are created.
[13:58.560 --> 14:03.640]  So when I joined the operator, we didn't have name prefixes.
[14:03.640 --> 14:08.640]  I tried to understand which metrics are ours and which are not, it was very hard.
[14:08.640 --> 14:14.000]  So we needed to go and do breaking changes for the metrics and add those prefixes, change
[14:14.000 --> 14:24.240]  the units, and this is what we want others to be able to avoid, this duplicate of work.
[14:24.240 --> 14:28.360]  Additional information in the best practices is about alerts.
[14:28.360 --> 14:31.000]  This is an example of an alert.
[14:31.000 --> 14:34.600]  You can see here that we have the alert name.
[14:34.600 --> 14:42.240]  We have an expression which is based on a metric, and once the expression is met, the
[14:42.240 --> 14:49.720]  alert either starts firing or is in pending state until the evaluation time.
[14:49.720 --> 14:50.720]  There is a description.
[14:50.720 --> 14:53.880]  There is also a possibility to add a summary.
[14:53.880 --> 14:55.680]  This is the evaluation time.
[14:55.680 --> 14:58.720]  It has a severity.
[14:58.720 --> 15:01.000]  And a link to a runbook URL.
[15:01.000 --> 15:10.360]  There could be other information that you can add to it, but this is the basic.
[15:10.360 --> 15:14.360]  And what we're saying in the best practice is that there's supposed to be, for example,
[15:14.360 --> 15:21.280]  for the labels of severity, there should only be three valid options, critical, warning,
[15:21.280 --> 15:23.000]  and info alerts.
[15:23.000 --> 15:27.240]  If you're using something else, it would be problematic.
[15:27.240 --> 15:32.680]  You can see here in this example, I don't know if you're seeing it, but we see that
[15:32.680 --> 15:35.400]  this is our example in the cluster.
[15:35.400 --> 15:41.800]  We have info, warning, and critical, and we have one non-severity, which is the watchdog.
[15:41.800 --> 15:43.880]  It's part of Prometheus alerts.
[15:43.880 --> 15:47.240]  It's just making sure that the alerts are working as expected.
[15:47.240 --> 15:49.080]  It should always stay one.
[15:49.080 --> 15:55.840]  There should never be alerts that don't have severity.
[15:55.840 --> 15:59.080]  And this is a bad example of using a severity label.
[15:59.080 --> 16:03.040]  In this case, they are using major instead of critical.
[16:03.040 --> 16:11.040]  The impact of that is that if someone is setting up alert manager to notify the support team
[16:11.040 --> 16:16.720]  that something critical happened to the system, and they were to get notified by Slack or by
[16:16.720 --> 16:23.120]  a pager, they will miss out on this alert because it doesn't meet with the convention
[16:23.120 --> 16:29.720]  of severities, values for severities.
[16:29.720 --> 16:37.640]  So what we have at the moment for best practices, we have for a metrics naming convention.
[16:37.640 --> 16:45.040]  We have how to create documentation for metrics, alerts, information about alert labels, run
[16:45.040 --> 16:46.040]  books.
[16:46.040 --> 16:52.040]  By the way, run books are a way to provide more information about the alert.
[16:52.040 --> 16:59.680]  You have a link in the alert where you can send the user to go and find more details.
[16:59.680 --> 17:00.680]  What's it about?
[17:00.680 --> 17:01.680]  What's the impact?
[17:01.680 --> 17:02.680]  How to diagnose it?
[17:02.680 --> 17:05.640]  And how to mitigate the issue.
[17:05.640 --> 17:11.560]  And then additional information about how to test metrics and how to test alerts.
[17:11.560 --> 17:19.320]  We plan to enrich this information, add information about dashboards, logging events, tracing in
[17:19.320 --> 17:22.040]  the future.
[17:22.040 --> 17:28.200]  So Shirley gave an overview about an eye-level situation about metrics and alerts.
[17:28.200 --> 17:33.160]  But how do we translate some of these best practices into code?
[17:33.160 --> 17:38.320]  So one of the problems that we faced was that logic code and monitoring code were becoming
[17:38.320 --> 17:41.160]  very intertwined.
[17:41.160 --> 17:44.480]  Code like this becomes harder to maintain.
[17:44.480 --> 17:51.040]  Obviously it becomes more difficult in understanding what the code does and to modify it.
[17:51.040 --> 17:56.400]  This leads obviously to longer development times, potential bugs, and it's also more
[17:56.400 --> 18:03.120]  challenging to onboard new team members or to contribute to one of these projects.
[18:03.120 --> 18:11.080]  In this specific snippet, there was like 16.4% of monitoring code intertwined with migration
[18:11.080 --> 18:12.480]  logic code.
[18:12.480 --> 18:19.880]  So what we did was try to refactor this code to try to separate these concerns, one from
[18:19.880 --> 18:22.720]  the other.
[18:22.720 --> 18:29.920]  In this specific case, we used a Prometheus collector that's just iterating the existing
[18:29.920 --> 18:36.560]  virtual machines migrations, and then it's just pushing the metrics according to the
[18:36.560 --> 18:41.760]  status of the virtual machines, whether they are successful or not, or the accounts of
[18:41.760 --> 18:45.480]  the pending schedule and running migrations.
[18:45.480 --> 18:50.800]  And obviously this snippet is much easier to understand how the monitoring is being
[18:50.800 --> 18:56.600]  done, and we take all of this out of the migration logic code.
[18:56.600 --> 19:05.120]  And to help other developers that are starting to avoid the same mistakes as we had to solve,
[19:05.120 --> 19:10.440]  we are creating a monitoring example in the memcached operator.
[19:10.440 --> 19:18.960]  We already have an initial example that is already thinking about all these concerns
[19:18.960 --> 19:24.720]  in separation between logic code and monitoring code.
[19:24.720 --> 19:32.160]  Our idea with this example is to make it as clear as possible, especially this is especially
[19:32.160 --> 19:39.880]  important when we are working with large and complex code bases, also make it more modular.
[19:39.880 --> 19:44.920]  It's easier to understand both the logic code and the monitoring code without affecting
[19:44.920 --> 19:52.200]  each other's functionality in the application in general, also make it more reusable.
[19:52.200 --> 19:57.760]  Since like, for example, the way we are doing monitoring in different operators will always
[19:57.760 --> 19:59.520]  be more or less the same.
[19:59.520 --> 20:07.240]  So if we find a more or less common way to do this, it will make it easier to reuse this
[20:07.240 --> 20:13.480]  code in other applications and projects, which will save them time and effort.
[20:13.480 --> 20:18.200]  And also, it will become more performant.
[20:18.200 --> 20:25.960]  If we mix all the monitoring concerns with the migration code, it's trivial that the
[20:25.960 --> 20:31.840]  time it will take to make a migration will take longer because we are calculating metric
[20:31.840 --> 20:38.520]  values and doing some Prometheus operations while we are trying to calculate the state
[20:38.520 --> 20:39.640]  of a migration.
[20:39.640 --> 20:46.280]  So having this separation will also help these questions.
[20:46.280 --> 20:53.760]  Our idea for the structure of the code will be by creating a package.
[20:53.760 --> 21:00.240]  And for example, here we can see a migration example, a central place where we will be
[21:00.240 --> 21:08.080]  registering all migrations and all migrations, sorry, no, all metrics, obviously, and then
[21:08.080 --> 21:14.320]  we will have files that will separate these metrics by their types.
[21:14.320 --> 21:19.120]  For example, in this example, you can see one operator metrics file, which will have
[21:19.120 --> 21:26.160]  all the operator-related metrics, as we talked in the beginning, and then we could have one
[21:26.160 --> 21:34.520]  specific file only for the migration metrics and then register them in one place.
[21:34.520 --> 21:41.440]  And why do we think about this structure and what benefits could this bring us?
[21:41.440 --> 21:46.960]  The first one is to automate the metric and the alert code generation.
[21:46.960 --> 21:55.680]  As we saw, much of the work that a developer needs to do that, it's like creating a file
[21:55.680 --> 22:02.480]  with a specific name, then go to the metrics.go file and register that file there.
[22:02.480 --> 22:07.920]  So this is very structured and always the same.
[22:07.920 --> 22:13.240]  It will be easier to automate and then allow developers to have a command line tool to generate
[22:13.240 --> 22:17.760]  new metrics and generate new alerts easier.
[22:17.760 --> 22:23.000]  We are also looking forward to create a linter for the metrics name.
[22:23.000 --> 22:30.520]  As Shirley said, a lot of the concerns that happen when operators are becoming more advanced
[22:30.520 --> 22:35.920]  is looking back at the metrics and see everything we did wrong with their naming.
[22:35.920 --> 22:41.200]  And even, as she said, it's a simple change, but can have a lot of impact.
[22:41.200 --> 22:47.480]  So a linter that follows all these conventions will also be important.
[22:47.480 --> 22:51.440]  Also automated metrics documentations, we are already doing this.
[22:51.440 --> 22:58.360]  And one thing that we faced was that a lot of metrics were very scattered in the code.
[22:58.360 --> 23:03.520]  So it was easy to automate and find all of them.
[23:03.520 --> 23:10.040]  And with a structure like the previous one, it will be even more easier to create a full
[23:10.040 --> 23:18.200]  list of metrics and that description that will help both developers, newcomers, and users.
[23:18.200 --> 23:24.840]  And lastly, have an easier structure for both unit and end-to-end testing, because if we
[23:24.840 --> 23:33.560]  have, like, this clear structure for where the metrics are, we can test there and test
[23:33.560 --> 23:42.960]  exactly those functions and not code intertwined in logic code.
[23:42.960 --> 23:47.800]  And just to conclude, if you are starting to create an operator or if you already have
[23:47.800 --> 23:53.600]  an operator, we invite you to go and to look at the operator SDK, to look at the best practices,
[23:53.600 --> 23:56.400]  to try to avoid the pitfalls that we had.
[23:56.400 --> 23:58.600]  And I really hope it will help you.
[23:58.600 --> 24:04.040]  And you should really just consider that when you're creating a new operator, it starts
[24:04.040 --> 24:07.760]  small, but it can become really robust.
[24:07.760 --> 24:09.760]  And you cannot tell that in the beginning.
[24:09.760 --> 24:13.920]  So think ahead and try to build it correctly from the beginning.
[24:13.920 --> 24:15.840]  I hope it will be helpful for you.
[24:15.840 --> 24:16.840]  And thank you.
[24:16.840 --> 24:17.840]  Thank you.
[24:17.840 --> 24:40.320]  Do you have any recommendations on how you would log out the decision points within your
[24:40.320 --> 24:41.320]  operator?
[24:41.320 --> 24:52.160]  So if you wanted to retrospectively see why it's done certain things, like the decision
[24:52.160 --> 25:01.520]  points, how it's decided which Kubernetes API calls to make, if your operator did something
[25:01.520 --> 25:07.320]  crazy and you wanted to look back and see why it did that, is there anything you would
[25:07.320 --> 25:13.360]  do in advance to the logging?
[25:13.360 --> 25:18.320]  I think this is the summary of what we've learned is in these documents.
[25:18.320 --> 25:25.160]  Because as I said, for example, the developers that started this project, they didn't have
[25:25.160 --> 25:29.240]  where to go and the best practices of how to name a metric.
[25:29.240 --> 25:32.240]  So they just named it how they thought.
[25:32.240 --> 25:39.960]  And they did follow the Prometheus recommendations, but having a prefix of the operator has a big
[25:39.960 --> 25:43.320]  impact for the users.
[25:43.320 --> 25:44.760]  And not even the users.
[25:44.760 --> 25:51.120]  When we are trying to understand how to use internal metrics for our uses, we also are
[25:51.120 --> 25:55.440]  struggling to understand where a metric came from, where is the code for it.
[25:55.440 --> 26:01.440]  So all of the summary of what we've learned is in those documents, and we plan to enrich
[26:01.440 --> 26:03.440]  it even further.
[26:03.440 --> 26:10.000]  Thank you for your talk.
[26:10.000 --> 26:12.760]  It was very interesting.
[26:12.760 --> 26:17.880]  You mentioned code generation for the metrics package.
[26:17.880 --> 26:24.560]  My question is, do you plan on adding that to QBuilder and the operator SDK?
[26:24.560 --> 26:33.720]  Yeah, basically we are working on the operator SDK right now, because we want to build all
[26:33.720 --> 26:38.200]  these tools, and we are thinking about them, but obviously this needs a lot of help of
[26:38.200 --> 26:39.400]  the community.
[26:39.400 --> 26:47.080]  And I am saying this because I'll enter like a personal note here and an idea, right?
[26:47.080 --> 26:53.640]  Because the way I see it is like on QBuilder and on operator SDK, being able to, you just
[26:53.640 --> 26:58.120]  go there and you say that you want to generate a project with monitoring, and it creates
[26:58.120 --> 26:59.840]  the monitoring package.
[26:59.840 --> 27:05.920]  Or if the operator already exists, you have a command to generate the monitoring package,
[27:05.920 --> 27:12.280]  and then on QBuilder, like you use it to create an API or a controller, you'll have
[27:12.280 --> 27:14.480]  a similar command, but to create a new metric.
[27:14.480 --> 27:18.920]  And you pass the type of the metric, the help, and the same for alerts.
[27:18.920 --> 27:20.520]  At least that's the way I see it.
[27:20.520 --> 27:23.800]  And for me, so it makes sense.
[27:23.800 --> 27:24.800]  I agree.
[27:24.800 --> 27:25.800]  Thank you.
[27:25.800 --> 27:36.840]  Hey, thank you for the talk.
[27:36.840 --> 27:41.480]  How much of a conventions that you talked about, aligned with open telemetry, is my
[27:41.480 --> 27:42.480]  opinion?
[27:42.480 --> 27:43.480]  How much?
[27:43.480 --> 27:44.480]  What?
[27:44.480 --> 27:47.000]  Aligned with open telemetry.
[27:47.000 --> 27:50.920]  Most of them are aligned with open telemetry, actually.
[27:50.920 --> 27:53.560]  But these are specific for operators.
[27:53.560 --> 27:54.560]  That's the idea.
[27:54.560 --> 27:58.120]  The idea is that you have a central place where you can get the information.
[27:58.120 --> 28:03.600]  And by the way, if someone is creating a new operator and has an insight, we encourage
[28:03.600 --> 28:09.640]  others to contribute to the documentation and teach others and share the information.
[28:09.640 --> 28:10.640]  So yeah.
[28:10.640 --> 28:18.800]  Basically, I think we align with open telemetry conventions, but we add more ideas to it to
[28:18.800 --> 28:29.400]  operate.
[28:29.400 --> 28:30.400]  I think that's it.
[28:30.400 --> 28:31.400]  Thank you.
[28:31.400 --> 28:32.400]  Thank you.
[28:32.400 --> 28:33.400]  Thank you.
[28:33.400 --> 28:58.560]  Thank you.