[00:00.000 --> 00:14.600] So, I hope it will be fun enough for you to wake up at the end of the day and very excited [00:14.600 --> 00:18.920] to be here at FOSDEM and specifically at the CI CD Dev Room. [00:18.920 --> 00:23.840] And today I'd like to share with you about how we gained observability into our CI CD [00:23.840 --> 00:28.120] pipeline and how you can do too. [00:28.120 --> 00:36.600] So let's start with a day in the life of a DoD developer on duty, at least in my company. [00:36.600 --> 00:38.120] And it goes like that. [00:38.120 --> 00:42.400] So the first thing the DoD does in the morning, at least it used to be before we did this [00:42.400 --> 00:45.400] exercise, is going into the Jenkins. [00:45.400 --> 00:51.880] We worked with Jenkins, but the takeaways, by the way, will be very applicable to any [00:51.880 --> 00:57.280] other system you work with, so nothing too specific here. [00:57.280 --> 01:02.640] Going into Jenkins at the beginning of the morning, we're looking at the status there, [01:02.640 --> 01:07.440] the pipelines for the last few hours over the night, and of course checking if anything [01:07.440 --> 01:13.640] is red, and most importantly, if there's a red master. [01:13.640 --> 01:20.360] And if you can obviously finish your coffee or jump straight into the investigation. [01:20.360 --> 01:25.120] And to be honest, sometimes people actually forgot to go into the Jenkins and check this, [01:25.120 --> 01:28.960] so that's another topic we'll maybe touch upon. [01:28.960 --> 01:32.480] So you go in, and then you need to go, let's say you see a failure, you see something red, [01:32.480 --> 01:39.920] you need to start going one by one on the different runs, and start figuring out, understanding [01:39.920 --> 01:45.640] what failed, where it failed, why it failed, and so on. [01:45.640 --> 01:51.800] And it's important that you actually, you needed to go one by one on the different runs, [01:51.800 --> 01:55.920] and we have several runs, we have the backend, we have the app, we have smoke tests, several [01:55.920 --> 02:01.600] of these, and start getting the picture, getting the pattern across, and understanding cross [02:01.600 --> 02:05.440] runs, across branches, what's going on. [02:05.440 --> 02:10.640] And on top of all of that, it was very difficult to compare with historical behavior, with [02:10.640 --> 02:16.280] the past behavior, to understand what's, and anomaly, what's the steady state for these [02:16.280 --> 02:18.920] days, and so on. [02:18.920 --> 02:24.960] So, and just to give you a few examples of questions that we found it difficult or time-consuming [02:24.960 --> 02:31.080] to answer, things such as, did all runs fail on the same step, did all runs fail for the [02:31.080 --> 02:39.680] same reason, is that on a specific branch, is that on a specific machine, if something's [02:39.680 --> 02:46.400] taking longer, is that normal, is that anomalous, what's the benchmark? [02:46.400 --> 02:54.320] And so these sorts of questions, it took us too long to answer, and we realized we need [02:54.320 --> 02:55.320] to improve. [02:55.320 --> 03:02.360] A word about myself, my name is Dotan Horvitz, I'm the Principal Developer Advocate at a company [03:02.360 --> 03:10.280] called Logs.io, Logs.io provides a cloud-native observability platform that's built on popular [03:10.280 --> 03:16.840] open-source tools such as you probably know, Prometheus, OpenSearch, OpenTelemetry, Yeager, [03:16.840 --> 03:17.840] and others. [03:17.840 --> 03:25.680] I come from a background as a developer, a solutions architect, even a product manager, [03:25.680 --> 03:30.880] and most importantly, I'm an advocate of open-source and communities. [03:30.880 --> 03:37.640] I run a podcast called Open Observability Talks about open-source DevOps observability, [03:37.640 --> 03:41.600] so if you're interested in these topics and you like podcasts, do check it out. [03:41.600 --> 03:47.720] I also run, organize, co-organize several communities, the local chapter of the CNCF, [03:47.720 --> 03:52.440] the cloud-native computing foundation in Tel Aviv, Kubernetes Community Days, DevOps [03:52.440 --> 03:57.480] Days, et cetera, and you can find me everywhere at Horvitz. [03:57.480 --> 04:02.200] So if you have something interesting, you tweet, feel free to tag me. [04:02.200 --> 04:10.960] So before I get into how we improved our CI CD pipeline or capabilities, let's first [04:10.960 --> 04:14.360] understand what we want to improve on. [04:14.360 --> 04:19.560] And actually, I see very often that people jump into solving before really understanding [04:19.560 --> 04:27.440] the metric, the KPI that they want to improve, and very basically, therefore, primary metrics [04:27.440 --> 04:34.920] for, let's say, DevOps performance, and you can see there on the screen, there's the [04:34.920 --> 04:42.560] deployment frequency, lead time for changes, change failure rate, and MPTR, mean time to [04:42.560 --> 04:44.040] recovery. [04:44.040 --> 04:48.600] I don't have time to go over all of these, but very important, so if you're new to this [04:48.600 --> 04:54.240] and if you want to read a bit more about that, I left a QR code and a short link for you [04:54.240 --> 05:02.040] at the bottom for a 101 on the Dora metrics, do check it out, I think it's priceless. [05:02.040 --> 05:07.560] And in our case, we needed to improve on the lead time for changes or sometimes called [05:07.560 --> 05:14.480] cycle time, which is the amount of time it takes a commit to get into production, which [05:14.480 --> 05:22.120] in our case was the time was too long, too high, and was holding us back. [05:22.120 --> 05:27.640] So we are experts at observability in our engineering team. [05:27.640 --> 05:31.840] That's what we do for a living, so it was very clear to us that what we're missing in [05:31.840 --> 05:37.120] our case is observability into our CICD pipeline. [05:37.120 --> 05:42.000] And to be fair with Jenkins, and there are lots of things to complain about Jenkins, [05:42.000 --> 05:44.560] but there is some capabilities within Jenkins. [05:44.560 --> 05:49.720] You can go into a specific pipeline run, you can see the different steps, you can see how [05:49.720 --> 05:53.600] much time an individual step took. [05:53.600 --> 05:59.320] Using some plugins, you can also visualize the graph and we even wired Jenkins to get [05:59.320 --> 06:05.640] alerts on Slack, but that wasn't good enough for us. [06:05.640 --> 06:11.680] And the reason that we wanted to find a way to monitor aggregated and filtered information [06:11.680 --> 06:17.160] according to our own timescale, according to our own filters, obviously to see things [06:17.160 --> 06:23.760] across branches, across runs, to compare with historical data, with our own filtering, [06:23.760 --> 06:26.400] so that's where we aimed at. [06:26.400 --> 06:32.360] And we launched this internal project with these requirements, four requirements. [06:32.360 --> 06:37.920] One, first and foremost, as we need the dashboard, we need the dashboard with aggregated views [06:37.920 --> 06:44.080] to be able to see the aggregated data across pipelines, across runs, across branches as [06:44.080 --> 06:46.280] we talked about. [06:46.280 --> 06:52.240] Finally we wanted to have access to historical data to be able to compare, to understand [06:52.240 --> 06:57.480] trends, to identify patterns, anomalies, and so on. [06:57.480 --> 07:04.280] Thirdly, we wanted reports and alerts to be able to automate as much as possible. [07:04.280 --> 07:09.840] And lastly, we wanted some ability to view flaky tests, test performance, and to be able [07:09.840 --> 07:13.560] to understand their impact on the pipeline. [07:13.560 --> 07:20.240] So that was the project requirements and how we did that. [07:20.240 --> 07:29.400] Essentially it takes four steps, collect, store, visualize, and report. [07:29.400 --> 07:33.600] And I'll show you exactly how it's done and what each step entails. [07:33.600 --> 07:39.920] In terms of the tech stack, we were very versed with the Elk stack, Elasticsearch, Kabbana. [07:39.920 --> 07:45.440] Then we also switched over to OpenSearch and OpenSearch dashboards after Elastic re-licensed [07:45.440 --> 07:47.640] and it was no longer open source. [07:47.640 --> 07:53.000] So that was our natural point to start our observability journey. [07:53.000 --> 07:58.160] And I'll show you how we did these four steps with this tech stack. [07:58.160 --> 08:00.560] So the first step is collect. [08:00.560 --> 08:05.560] And for that we instrumented the pipeline to collect all the relevant information and [08:05.560 --> 08:08.280] put it in environment variables. [08:08.280 --> 08:14.200] Which information, you can see some examples here on the screen, the branch, the Kamecha, [08:14.200 --> 08:19.840] the machine IP, the run type, whether it's scheduled, triggered by merge to master or [08:19.840 --> 08:25.400] something else, fail step, step duration, build number, anything essentially that you [08:25.400 --> 08:28.320] find useful for investigation later. [08:28.320 --> 08:32.160] My recommendation, collect it and persist it. [08:32.160 --> 08:36.920] So that's the collect phase and after collect comes store. [08:36.920 --> 08:43.440] And for that we created a new summary step at the end of the pipeline one where we ran [08:43.440 --> 08:49.560] a command to collect all of that information that we did in the first step and created [08:49.560 --> 08:59.800] a JSON and persisted it to Elasticsearch, as I mentioned then move to OpenSearch. [08:59.800 --> 09:04.520] And it's important to say again for the fairness of Jenkins and for the Jenkins experts here, [09:04.520 --> 09:08.120] Jenkins does have some built in persistency capabilities. [09:08.120 --> 09:12.000] And we tried them out, but it wasn't good enough for us. [09:12.000 --> 09:18.280] And the reason is that by default Jenkins essentially keeps all the bills and stores [09:18.280 --> 09:23.760] them on the Jenkins machine, which burdens these machines of course. [09:23.760 --> 09:29.320] And then you start needing to limit the number of bills and the duration, how many days and [09:29.320 --> 09:30.320] so on and so forth. [09:30.320 --> 09:33.200] So that wasn't good enough for us. [09:33.200 --> 09:36.960] We needed a more powerful access to historical data. [09:36.960 --> 09:44.040] We wanted to persist historical data in our own control, the duration, the retention and [09:44.040 --> 09:51.480] most importantly off of the Jenkins servers so as not to risk and overload the critical [09:51.480 --> 09:54.080] path. [09:54.080 --> 09:56.160] So that's about store and after store. [09:56.160 --> 10:01.400] Once we have all the data in Elasticsearch or OpenSearch, now it's very easy to build [10:01.400 --> 10:07.480] command dashboards or OpenSearch dashboards and visualizations on top of that. [10:07.480 --> 10:13.200] And then comes the question, sorry, then comes the question, okay, so which visualizations [10:13.200 --> 10:16.160] should I build? [10:16.160 --> 10:20.960] And for that, and that's a tip, take it with you, go back to the pains, go back to the [10:20.960 --> 10:25.560] questions that you found it hard to answer and this would be the starting point. [10:25.560 --> 10:29.920] So if you remember before we mentioned things such as did all runs fail on the same step, [10:29.920 --> 10:35.880] did all runs fail for the same reason, how many fail, is that a specific branch, is that [10:35.880 --> 10:41.640] a specific machine and so on, these are the questions that we guide you then to choose [10:41.640 --> 10:44.720] the right visualizations for your dashboard. [10:44.720 --> 10:47.680] And I'll give you some examples here. [10:47.680 --> 10:50.000] So let's start with the top line view. [10:50.000 --> 10:54.240] You want to understand the health of your house table, your pipeline is. [10:54.240 --> 11:00.840] So visualize the success and failure rates, you can do that overall in general or at a [11:00.840 --> 11:07.520] specific time window on a graph, very easy to see the first glance, what's the health [11:07.520 --> 11:11.840] status of your pipeline. [11:11.840 --> 11:18.200] You want to find problematic steps, then visualize failures segmented by pipeline steps, again [11:18.200 --> 11:22.520] very easy to see the spiking step there. [11:22.520 --> 11:28.400] You want to detect problematic build machines, visualize failures segmented by machine and [11:28.400 --> 11:36.440] that by the way saved us a lot of wasted time going and checking bugs in the release code. [11:36.440 --> 11:41.240] When we saw such a thing, we just go, you kill the machine, you let the auto scaler spin [11:41.240 --> 11:46.240] up a new instance and you start clean and in many cases it solves the problem. [11:46.240 --> 11:55.120] So lots of time saved, in general this aspect of code based or environmental based issues [11:55.120 --> 12:02.360] is definitely a challenge I'm assuming, not just for me, so I'll get back to that soon. [12:02.360 --> 12:10.320] Another example duration per step, again very easy to see where and at the time is spent. [12:10.320 --> 12:17.840] So that's the visualize part and after visualize comes the reporting and alerting phase. [12:17.840 --> 12:23.120] And if you remember before the DOD, the developer on duty, needed to go manually and check Jenkins [12:23.120 --> 12:32.560] and then the health check, now the DOD gets start of day report directly to Slack and [12:32.560 --> 12:37.200] actually as you can see the report already contains the link to the dashboard and even [12:37.200 --> 12:43.440] a snapshot of the dashboard embedded within the Slack so that at the first glance even [12:43.440 --> 12:48.920] without going into the dashboard you can see if you can finish your coffee or if there's [12:48.920 --> 12:54.320] something alerting that you need to click that link and go start investigating. [12:54.320 --> 12:58.120] And of course it doesn't have to be a schedule report, it could be also you can define triggered [12:58.120 --> 13:04.160] alerts on any of that, the fields, the data that we collected in the first phase and the [13:04.160 --> 13:11.080] collect phase so and you can do any complex queries or conditions that you want, you want [13:11.080 --> 13:16.920] to do something like if the sum of failures goes above x or the average duration goes [13:16.920 --> 13:19.000] above y trigger an alert. [13:19.000 --> 13:24.280] So essentially anything that you can formalize as a Lucene query, you can automate as an [13:24.280 --> 13:29.400] alert and that's some alerting layer that we built on top of elastic search and open [13:29.400 --> 13:32.280] search for that. [13:32.280 --> 13:37.080] One last note, I'm giving the examples from Slack because that's what we use in our environment [13:37.080 --> 13:42.880] but you're not limited obviously to Slack, you have support for many notification endpoints [13:42.880 --> 13:48.240] depending on your systems, pager duty, victorops, ops genie, MS themes, whatever. [13:48.240 --> 13:53.680] We personally work with Slack so that the examples are with Slack. [13:53.680 --> 14:00.280] So that's how we build observability into the Jenkins pipelines but as we all know especially [14:00.280 --> 14:06.600] here in the CI CD dev room, Jenkins, CI CD is much more than just Jenkins. [14:06.600 --> 14:09.880] So what else? [14:09.880 --> 14:14.720] So we wanted to analyze if you remember the original requirements to analyze flaky tests [14:14.720 --> 14:22.520] and test performance and following the same process, collecting all the relevant information [14:22.520 --> 14:29.720] from test run and storing it in elastic search and open search and then creating a cabana [14:29.720 --> 14:36.760] dashboard or open search dashboards and as you can see very all the relevant usual suspects [14:36.760 --> 14:43.000] that you'd expect, the test duration, fail test, flaky test, failure count and rate moving [14:43.000 --> 14:48.160] averages, fail test by branch over time, all of the things that you would need in order [14:48.160 --> 14:56.160] to analyze and understand the impact of your test and the flaky tests in your system. [14:56.160 --> 15:03.280] And similarly after visualize you can also report, we created reports to Slack, we have [15:03.280 --> 15:08.120] a dedicated Slack channel for that, following the same pattern. [15:08.120 --> 15:10.800] One important point is about the openness. [15:10.800 --> 15:16.560] So once you have the data in open search or in elastic search, it's very easy for different [15:16.560 --> 15:21.120] teams to create different visualizations on top of that same data. [15:21.120 --> 15:26.160] So I took another extreme, a different team that didn't like the graphs and preferred [15:26.160 --> 15:37.240] the table views and the counters to visualize, again, very similarly, test stats and so on. [15:37.240 --> 15:40.440] And that's the beauty of it. [15:40.440 --> 15:46.000] So just to summarize, we instrumented Jenkins pipeline to collect relevant data and put [15:46.000 --> 15:50.720] it in environment variables, then at the end of the pipeline we created a JSON with all [15:50.720 --> 15:57.600] this data and persisted it to elastic search open search, then we created Kibana dashboards [15:57.600 --> 16:03.360] on top of that data and lastly we created reports and alerts on that data. [16:03.360 --> 16:10.040] So four steps, collect, store, visualize and report. [16:10.040 --> 16:13.880] So that was our first step in the journey but we didn't stop there. [16:13.880 --> 16:22.440] The next step was we asked ourselves, what can we do in order to investigate performance [16:22.440 --> 16:24.680] of specific pipeline runs? [16:24.680 --> 16:30.120] So you have a run that takes a lot of time, you want to optimize, but where is the problem? [16:30.120 --> 16:36.480] And that's actually what distributed tracing is ideal for. [16:36.480 --> 16:40.200] How many people know what distributed tracing is with a show of hands? [16:40.200 --> 16:44.440] Okay, I see that most of us, there are a few that know, so maybe I'll say a word about [16:44.440 --> 16:46.080] that soon. [16:46.080 --> 16:52.600] Very importantly, Jenkins has the capability to emit trace data spans, just like it does [16:52.600 --> 16:55.360] for logs, so it's already built in. [16:55.360 --> 17:00.720] So we decided to visualize jobs and pipeline executions as distributed tracing. [17:00.720 --> 17:05.320] That was the next step. [17:05.320 --> 17:12.800] And for those who don't know, distributed tracing essentially helps pinpoint where issues occur [17:12.800 --> 17:19.560] and where latency is in production environments, in distributed systems, it's not specific [17:19.560 --> 17:20.560] for CICD. [17:20.560 --> 17:25.080] If you think about a microservice architecture and a request coming in and flowing through [17:25.080 --> 17:31.000] a chain of interacting microservices, then when something goes wrong, you get an error [17:31.000 --> 17:35.280] on that request, you want to know where the error is within this chain, or if there's a [17:35.280 --> 17:38.680] latency, you want to know where the latency is. [17:38.680 --> 17:40.520] That's distributed tracing in a nutshell. [17:40.520 --> 17:45.760] And the way it works is that each step in this call chain, or in our case, each step [17:45.760 --> 17:50.320] in the pipeline, creates and emits a span. [17:50.320 --> 17:55.520] You can think about a span as a structured log that also contains the trace ID, the start [17:55.520 --> 17:58.160] time, the duration, and some other context. [17:58.160 --> 18:02.360] And then there is a back end that collects all these spans, reconstruct the trace, and [18:02.360 --> 18:11.560] then visualizes it typically in this timeline view or gun chart that you can see on the [18:11.560 --> 18:13.120] right-hand side. [18:13.120 --> 18:18.720] So now that we understand the distributed tracing, let's see how we add distributed [18:18.720 --> 18:25.520] tracing type of performance, pipeline performance into a CICD pipeline. [18:25.520 --> 18:27.440] And same process. [18:27.440 --> 18:29.280] For the collect step, collect. [18:29.280 --> 18:38.840] And for the collect step, we decided to use an open telemetry collector who doesn't know [18:38.840 --> 18:43.040] about open telemetry, who doesn't know the project, just so I have a background, okay. [18:43.040 --> 18:47.520] I have a few, so I'll say a word about that. [18:47.520 --> 18:54.080] And anyway, I added a link, you see a QR code and a link at the lower corner there for a [18:54.080 --> 18:56.880] beginner's guide to open telemetry that I wrote. [18:56.880 --> 19:01.360] I gave a talk about open telemetry at KubeCon Europe, so you'll find it useful. [19:01.360 --> 19:09.640] But very briefly, it's an observability platform for collecting logs, metrics, and traces. [19:09.640 --> 19:16.440] So it's not specific only to traces in an open unified standard manner. [19:16.440 --> 19:22.760] It's an open source project under the CNCF, the Cloud Native Computing Foundation. [19:22.760 --> 19:28.320] And at the time, it's a fairly young project by the time, the tracing piece of open telemetry [19:28.320 --> 19:32.200] was already GA generally available, so we decided to go with that. [19:32.200 --> 19:38.160] Today, by the way, also metrics is soon to be GA, it's already in release candidate, [19:38.160 --> 19:41.520] and logging is still not there. [19:41.520 --> 19:43.480] So what do you need to do if you choose open telemetry? [19:43.480 --> 19:48.120] You need to set up the open telemetry collector, it's sort of an agent for it to send. [19:48.120 --> 19:54.120] You need to install the Jenkins open telemetry plug-in, very easy to do that on the UI. [19:54.120 --> 19:59.400] And then you need to configure the Jenkins open telemetry plug-in to send to the open [19:59.400 --> 20:05.240] telemetry collector and point over OTLP over GRPC protocol. [20:05.240 --> 20:10.280] That's the collect phase, and after collect comes store. [20:10.280 --> 20:12.280] For the back end, we used Jega. [20:12.280 --> 20:21.400] Jega is also a very popular open source under the CNCF, specifically for distributed tracing. [20:21.400 --> 20:25.240] And we use Jega to monitor our own production environment, so that was our natural choice [20:25.240 --> 20:27.200] also for this. [20:27.200 --> 20:32.240] We also have a Jager-based service, so we just use that. [20:32.240 --> 20:38.240] But anything that I show here, actually you can use with any Jager distro, whichever one [20:38.240 --> 20:41.400] you use, managed or self-serve. [20:41.400 --> 20:46.600] And if you do run your own, by the way, I added the link on how to deploy Jager on Kubernetes [20:46.600 --> 20:53.480] in production, so you have a link there, a short link that I added, a very useful guide. [20:53.480 --> 20:54.480] So what do you need to do? [20:54.480 --> 21:00.800] You need to configure open telemetry collector to export in open telemetry collector terms [21:00.800 --> 21:07.120] to export to Jager in the right format, all the aggregated information. [21:07.120 --> 21:11.640] And once you have that, then you can visualize, the visualized part is much easier in this [21:11.640 --> 21:17.640] case, because you have a Jager UI with predefined dashboard, you don't need to start composing [21:17.640 --> 21:18.640] visuals. [21:18.640 --> 21:25.680] Essentially, what you can see here on the left-hand side, you can see this indented [21:25.680 --> 21:27.880] tree structure, and then on the right, the gun chart. [21:27.880 --> 21:33.800] Each line here is a span, and it's very easy to see the pipeline sequence. [21:33.800 --> 21:38.520] The text is a bit small, but you can see, for each step of the pipeline, you can see [21:38.520 --> 21:45.320] the duration, how much it took, you see which ones ran in parallel, and which ones ran sequentially. [21:45.320 --> 21:50.320] If you have a very long latency on the overall, you can see where most of the time is being [21:50.320 --> 21:56.080] spent, where the critical path, where you best optimize, and so on. [21:56.080 --> 22:02.280] And by the way, Jager also offers other views, like recently added the flame graph, and you [22:02.280 --> 22:06.720] have trace statistics, and graph view, and so on. [22:06.720 --> 22:11.120] But this is what people are used to, so I'm showing the timeline view. [22:11.120 --> 22:17.160] So that's on Jager, and of course, as we said before, CICD is more than just Jenkins, so [22:17.160 --> 22:24.320] what we can do beyond just Jenkins, and what you can do is actually to instrument additional [22:24.320 --> 22:31.240] pieces like Maven, Ansible, and other elements to get final granularity into your traces [22:31.240 --> 22:32.240] and steps. [22:32.240 --> 22:37.040] For example, here, the things that you see in yellow is Maven build steps. [22:37.040 --> 22:41.520] So what before used to be one black box span in the trace. [22:41.520 --> 22:46.000] Suddenly, now you can click, open, and see the different build steps, each one with its [22:46.000 --> 22:50.440] own duration, each one with its own context, and so on. [22:50.440 --> 22:57.280] So that's in a nutshell how we added tracing to our CICD pipeline. [22:57.280 --> 23:02.200] The next step is, as I mentioned before, many of the pipelines actually failed not because [23:02.200 --> 23:06.200] of the released code, but because of the CICD environment. [23:06.200 --> 23:09.880] So we decided to monitor metrics from the Jenkins servers and the environment. [23:09.880 --> 23:15.600] It goes to the system, the containers, the JVM, essentially anything that could break [23:15.600 --> 23:20.160] irrespective of the released code, and following the same flow. [23:20.160 --> 23:27.880] So the first step, collect, we use the telegraph, we use that in production, so we use that [23:27.880 --> 23:35.120] here as well, that's an open source by inflex data, and essentially you need two steps. [23:35.120 --> 23:43.120] You need to first enable, configure, sorry, Jenkins to expose metrics in Prometheus format. [23:43.120 --> 23:49.000] We work a lot with Prometheus for metrics, so that was our natural choice, and that's [23:49.000 --> 23:53.720] a simple configuration in the Jenkins web UI, and then you need to install telegraph [23:53.720 --> 23:58.120] if you don't already have that, and then make sure that it configured to scrape the [23:58.120 --> 24:05.440] metrics off of the Jenkins server using the Prometheus input plugin. [24:05.440 --> 24:06.440] So that's the first step. [24:06.440 --> 24:11.560] The second step is on the store side. [24:11.560 --> 24:14.880] As I mentioned, we use Prometheus for metrics, so we use that as well here. [24:14.880 --> 24:19.760] We even have our own managed Prometheus, so we use that, but anything that I show here [24:19.760 --> 24:26.800] is identical whether you use Prometheus or any Prometheus compatible backend. [24:26.800 --> 24:30.920] And essentially you need to configure telegraph to send the metrics to Prometheus, and you [24:30.920 --> 24:32.040] have two ways to do that. [24:32.040 --> 24:35.000] You can do that in pull mode or in push mode. [24:35.000 --> 24:40.680] So pull mode is the default for Prometheus, essentially when you configure a telegraph [24:40.680 --> 24:47.560] to expose a slash metrics endpoint, and then it can be exposed for Prometheus to scrape [24:47.560 --> 24:48.560] it from. [24:48.560 --> 24:53.280] If you want to do that, you use the Prometheus client output plugin, or if you want to do [24:53.280 --> 24:56.720] it in push mode, then you use the HTTP output plugin. [24:56.720 --> 25:02.680] Just an important note, make sure that you set the data format to Prometheus remote write. [25:02.680 --> 25:06.520] So that's the store phase, and then once you have all the data in Prometheus, then it's [25:06.520 --> 25:11.680] very easy to create Grafana dashboards on top of that. [25:11.680 --> 25:14.520] And I gave some examples here. [25:14.520 --> 25:19.120] You can filter, of course, by build type, by branch, machine ID, build number, and so [25:19.120 --> 25:20.360] on. [25:20.360 --> 25:25.240] And you can monitor in this example, this is a system monitoring, so CPU, memory, disk [25:25.240 --> 25:27.080] usage, load, and so on. [25:27.080 --> 25:35.600] You can monitor the Docker container, like the CPU, IO, inbound, outbound, disk usage, [25:35.600 --> 25:41.600] obviously the running, stopped, paused containers by Jenkins machine, everything that you'd [25:41.600 --> 25:50.760] expect, and JVM metrics, by being a Java implementation, thread count, heap memory, garbage collection, [25:50.760 --> 25:52.720] duration, things like that. [25:52.720 --> 25:57.360] You can even, of course, monitor the Jenkins nodes, queues, executors themselves. [25:57.360 --> 26:00.000] So again, you have an example dashboard here. [26:00.000 --> 26:05.200] You can see the queue size, status breakdown, the Jenkins jobs, the count executed over [26:05.200 --> 26:08.160] time, breakdown by job status, and so on. [26:08.160 --> 26:12.600] So this is the types, just to, obviously, lots of other visualizations that you can [26:12.600 --> 26:14.960] create, and you can also create alerts. [26:14.960 --> 26:23.600] I won't show that in the lack of time, so just to summarize what we've seen. [26:23.600 --> 26:27.400] Treat your CICD the same as you treat your production. [26:27.400 --> 26:33.320] For your production, use whatever, elastic search, open search, Grafana to monitor to [26:33.320 --> 26:34.920] create observability. [26:34.920 --> 26:42.400] Do the same with your CICD pipeline, and preferably leverage the same stack, the same tool chain [26:42.400 --> 26:46.240] for that, and don't reinvent the wheel. [26:46.240 --> 26:47.920] That was our journey. [26:47.920 --> 26:53.640] As I mentioned, we wanted dashboards and aggregated views to see several pipelines across different [26:53.640 --> 26:56.120] run branches over time, and so on. [26:56.120 --> 27:02.840] We wanted historical data and controlled persistence off of the Jenkins servers to determine the [27:02.840 --> 27:05.360] duration, the retention of that data. [27:05.360 --> 27:10.080] We wanted reports and alerts to automate as much as possible, and lastly, we wanted test [27:10.080 --> 27:13.080] performance, flaky tests, and so on. [27:13.080 --> 27:15.680] You saw how we achieved that. [27:15.680 --> 27:16.680] Four steps. [27:16.680 --> 27:22.760] If there's one thing to take out of that talk, take this one, collect, store, visualize, [27:22.760 --> 27:25.280] and report an alert. [27:25.280 --> 27:31.880] And what we gained, just to summarize, significant improvement in our lead time for changes, [27:31.880 --> 27:37.200] in our cycle time, if you remember the Dora metrics at the beginning. [27:37.200 --> 27:44.960] On the way, we also got an improved developer-on-duty experience, much less of a sufferer there. [27:44.960 --> 27:46.320] It's based on open source. [27:46.320 --> 27:47.320] Very important. [27:47.320 --> 27:48.320] We're here on FOSDEM. [27:48.320 --> 27:53.280] So based on open search, open telemetry, Yeager, Prometheus, Telegraph, you saw the stack. [27:53.280 --> 27:58.800] If you want more information, you have here a QR code for a guide to CICD observability [27:58.800 --> 27:59.800] that I wrote. [27:59.800 --> 28:05.840] You're welcome to take a short or a bit short link and read more about this, but this was [28:05.840 --> 28:08.440] very much in a nutshell. [28:08.440 --> 28:10.520] Thank you very much for listening. [28:10.520 --> 28:13.480] I'm Doton Horvitz, and enjoy the rest of the conference. [28:13.480 --> 28:20.280] I don't know if we have time for questions. [28:20.280 --> 28:21.280] No. [28:21.280 --> 28:24.680] So I'm here if you have questions or if you have a sticker, and may the open source be [28:24.680 --> 28:25.680] with you. [28:25.680 --> 28:26.680] Thank you. [28:26.680 --> 28:31.960] We have time for questions, if there are any. [28:31.960 --> 28:35.440] We have time for questions, so if you want, we can just see for a few minutes. [28:35.440 --> 28:36.440] Is that a question? [28:36.440 --> 28:39.440] Yeah, the other question in the back. [28:39.440 --> 28:40.440] Okay. [28:40.440 --> 28:46.680] Which one do you want to be the first one to ask a question? [28:46.680 --> 28:47.680] Thanks. [28:47.680 --> 28:51.080] So have you considered persistence? [28:51.080 --> 28:53.920] How long do you store your metrics and your traces? [28:53.920 --> 28:55.320] Have you wondered about that? [28:55.320 --> 28:58.320] And for how long at a time you store your metrics? [28:58.320 --> 28:59.320] So we have. [28:59.320 --> 29:03.360] That was part of the original challenge when we used the Jenkins persistence, because when [29:03.360 --> 29:06.880] you persist it on the nodes themselves, and obviously you're very limited, there's the [29:06.880 --> 29:12.880] plugin that you can configure per days or per number of bills and so on. [29:12.880 --> 29:17.400] When you do it off of that critical path, you have much more room to maneuver, and then [29:17.400 --> 29:19.760] it depends on the amount of data you collect. [29:19.760 --> 29:25.240] We started small, so we collected for longer periods, but the more it came with the app, [29:25.240 --> 29:29.400] the more the appetite grew, and people wanted more and more types of metrics and time series [29:29.400 --> 29:35.560] data, so we needed to be a bit more conservative, but it's very much dependent on your practices [29:35.560 --> 29:36.560] in terms of the data. [29:36.560 --> 29:42.440] Yeah, the question was more about the process, so iterative, you explained it, so it starts [29:42.440 --> 29:43.440] small. [29:43.440 --> 29:44.440] Yeah, exactly. [29:44.440 --> 29:46.680] And iterative is the best, because it really depends, you need to learn the patterns of [29:46.680 --> 29:52.360] your data consumption, the telemetry, and then you can optimize the balance between having [29:52.360 --> 29:55.680] the observability and not overloading and overpricing costs. [29:55.680 --> 29:56.680] Right. [29:56.680 --> 29:57.680] Thank you very, very interesting. [29:57.680 --> 29:58.680] Thank you. [29:58.680 --> 30:00.320] There was another question in the back, yeah? [30:00.320 --> 30:01.320] Thank you. [30:01.320 --> 30:06.600] So what was the most surprising insight that you've learned, good or bad, and how did you [30:06.600 --> 30:08.200] react to it? [30:08.200 --> 30:13.040] I think I was most surprised personally about the amount of failures that occur because [30:13.040 --> 30:18.680] of the environment and what kinds of things, and how simple it is to just kill the machine, [30:18.680 --> 30:22.720] kill the instance, let the auto-scaler spin it back up, and you save yourself a lot of [30:22.720 --> 30:26.480] hassle and a lot of waking people up at night, so that was astonishing. [30:26.480 --> 30:30.480] How many things are irrespective of the code and just environmental, and we took a lot [30:30.480 --> 30:34.200] of learnings out there to make the environment more robust, to get people to clean after [30:34.200 --> 30:39.600] them, to automate the cleanups and things like that, that's what me was insightful. [30:39.600 --> 30:40.600] Thank you. [30:40.600 --> 30:41.600] Any other questions? [30:41.600 --> 30:44.600] Then I have one last one, sorry. [30:44.600 --> 30:45.600] No, no worries. [30:45.600 --> 30:50.880] My question is, who are usually the people looking at the dashboard, because I maintain [30:50.880 --> 30:54.200] a lot of dashboard in the past, and sometimes I had a feeling that I was the only one looking [30:54.200 --> 30:58.720] at those dashboards, so I'm just wondering if you identify a type of people who really [30:58.720 --> 31:00.960] benefit from those dashboards. [31:00.960 --> 31:07.480] So it's a very interesting question because we also learned and we changed the org structure [31:07.480 --> 31:10.720] several times, so it moves between Dev and DevOps. [31:10.720 --> 31:16.200] We now have a release engineering team, so they are the main stakeholders to look at that, [31:16.200 --> 31:20.760] but this dashboard is the goal, as I said, the developer on duty, so everyone that is [31:20.760 --> 31:27.360] now on call needs to see that, that's for sure, and the tier two, tier three, so let's [31:27.360 --> 31:30.040] say the chain for that. [31:30.040 --> 31:35.680] You also use that as a high level also by the team leads in the developer side of things, [31:35.680 --> 31:39.440] so these are the main stakeholders, depending on if it's the critical part of the developer [31:39.440 --> 31:44.600] on duty and the tiers, or if it's the overall thing the health state in general by the release [31:44.600 --> 31:45.600] engineer. [31:45.600 --> 31:46.600] Thank you. [31:46.600 --> 32:13.160] Thank you very much, everyone.