[00:00.000 --> 00:16.040] Welcome our next speakers and give them a round of applause. [00:16.040 --> 00:17.040] Can you hear me? [00:17.040 --> 00:19.280] I guess you can. [00:19.280 --> 00:23.720] So yeah, hello, everyone, and welcome to the session about how we can use open telemetry [00:23.720 --> 00:29.360] on Kubernetes to collect traces, metrics, and logs. [00:29.360 --> 00:34.880] So my name is Pavel, I'm software engineer at Red Hat, I contribute and I'm a contributor [00:34.880 --> 00:38.360] and maintainer of Open Telemetry Operator and Yeager project. [00:38.360 --> 00:44.640] Yeah, my name is Bine, and I'm also working on the Open Telemetry Operator and spent most [00:44.640 --> 00:47.920] of the time on Open Telemetry. [00:47.920 --> 00:52.680] And so as I mentioned on today's agenda, there is the Open Telemetry Operator. [00:52.680 --> 00:56.840] We will show how you can use it to deploy the collector, how you can as well use it [00:56.840 --> 00:59.960] to instrument your workloads on Kubernetes. [00:59.960 --> 01:03.560] And after this brief introduction, we will walk you three use cases, how you can use [01:03.560 --> 01:06.240] it to collect traces, metrics, and logs. [01:06.240 --> 01:12.040] However, I will start with the history of open source observability. [01:12.040 --> 01:16.640] I'm doing this because I believe that if we understand the history, maybe we will better [01:16.640 --> 01:22.280] understand where we as industry are going. [01:22.280 --> 01:27.640] So on this slide, you essentially see a timeline of the, with the open source projects. [01:27.640 --> 01:31.080] And it's divided into the, the upper and bottom parts. [01:31.080 --> 01:36.800] In the bottom, you see the open source projects or platforms that you can deploy and they [01:36.800 --> 01:44.240] provide you with a storage and visualization capabilities for, for the observability data. [01:44.240 --> 01:50.080] Most of them work with distributed traces, however, some of them, like the Apache skywalking [01:50.080 --> 01:56.160] hyper-tracing signals, those are more like end-to-end platforms that can show traces, [01:56.160 --> 01:57.640] metrics, and logs. [01:57.640 --> 02:02.520] I would like to focus on the upper part that shows you the open source data collection [02:02.520 --> 02:05.360] kind of frameworks. [02:05.360 --> 02:10.680] And what we see there with, especially with open sensors and open telemetry is that it's [02:10.680 --> 02:16.480] becoming more important that these frameworks kind of work with all the signals. [02:16.480 --> 02:23.600] For me, the, the data collection, especially for tracing started with Zipkin project. [02:23.600 --> 02:28.440] It gave us a stable data model that we, as developers, could use to export traces into [02:28.440 --> 02:34.440] Zipkin, but as well to many other kind of platforms that adopted Zipkin project. [02:34.440 --> 02:40.840] As a developer, when we wanted to use Zipkin clients, because the ecosystem hosted client [02:40.840 --> 02:44.960] libraries as well, it was a bit problematic in polyglot environments because those clients [02:44.960 --> 02:51.560] were using kind of inconsistent APIs, there was no standardization. [02:51.560 --> 02:56.120] And so this problem then was partially solved with open tracing. [02:56.120 --> 03:01.240] The scope of the project was a bit wider, there was a specification, there was a document [03:01.240 --> 03:08.680] that defines which data should be collected and as well how the API in those languages [03:08.680 --> 03:10.440] should look like. [03:10.440 --> 03:14.920] This enabled us to build reusable instrumentation libraries. [03:14.920 --> 03:19.920] And then, even later, the open sensors project started with slightly different approach. [03:19.920 --> 03:26.080] There was no specification, there was no API, but there was SDK that everybody could use [03:26.080 --> 03:28.640] and a collector. [03:28.640 --> 03:34.680] So with open tracing, the approach was that developers would use the API and then at the [03:34.680 --> 03:38.200] build time provide the SDK from a vendor. [03:38.200 --> 03:43.400] With open sensors, everybody would use the SDK and then in the collector decide where [03:43.400 --> 03:45.520] the data should be sent. [03:45.520 --> 03:49.760] Those two projects were kind of competing and then finally they merged into open telemetry [03:49.760 --> 03:51.840] in 2019. [03:51.840 --> 03:58.400] So the hotel, it adopted all the pieces from open tracing and open sensors, but kind of [03:58.400 --> 04:03.320] the biggest innovation in hotel is the, at least in my view, is the auto instrumentation [04:03.320 --> 04:06.600] libraries or the agents. [04:06.600 --> 04:11.000] Those agents are production ready, most of them, because they were donated by one of [04:11.000 --> 04:17.400] the observability vendors, so they are, you know, production tested. [04:17.400 --> 04:21.480] So when we kind of summarize what happened is that we started with some instrumentation [04:21.480 --> 04:27.120] libraries, you know, with Zipkin project, then since we have some kind of standardization, [04:27.120 --> 04:32.680] we could build reusable instrumentation libraries and kind of create more sophisticated instrumentations [04:32.680 --> 04:34.280] for runtimes. [04:34.280 --> 04:39.800] And now we are in an age that we have available in open source agents or auto instrumentation [04:39.800 --> 04:44.960] libraries that we can just grab, put into our platforms, and we will get telemetry data [04:44.960 --> 04:46.880] almost for free. [04:46.880 --> 04:50.640] And I think, you know, so where are we going? [04:50.640 --> 04:56.400] I think we are going into an era where we, as developers, we won't have to care about [04:56.400 --> 04:58.400] how the telemetry is created for us. [04:58.400 --> 05:05.640] We will be, the instrumentation will become maybe the feature of the platform where we [05:05.640 --> 05:07.640] deploy the application. [05:07.640 --> 05:08.800] So this is one way to look at it. [05:08.800 --> 05:14.160] The other way might be that the observability will shift left, and since we have this data, [05:14.160 --> 05:22.520] we will start utilizing it for other use cases, probably like testing and security. [05:22.520 --> 05:29.000] So with that, I would like to move to the open telemetry, and it's obviously open source [05:29.000 --> 05:33.800] project hosted in the cognitive computing foundation, and its main goal is to provide [05:33.800 --> 05:37.000] the vendor or neutral telemetry data collection. [05:37.000 --> 05:43.480] It's the second most active project in CNCF after Kubernetes, so it's quite large. [05:43.480 --> 05:46.560] And there are several independent components that we can use. [05:46.560 --> 05:52.440] There is a specification that defines what data should be collected and how the API should [05:52.440 --> 05:58.440] look like, and obviously then there is the implementation of the API, the SDK and the [05:58.440 --> 06:02.880] standard data model called OTLP or open telemetry protocol. [06:02.880 --> 06:08.320] These four pieces are meant to be used primarily by instrumentation authors or the people that [06:08.320 --> 06:11.560] work on the observability systems. [06:11.560 --> 06:17.280] And last two components, the auto instrumentation or agent and collector are meant to be used [06:17.280 --> 06:22.600] by end users to kind of roll out observability in their organization. [06:22.600 --> 06:26.480] To facilitate open telemetry deployment on Kubernetes, there is a Helm chart and Kubernetes [06:26.480 --> 06:31.000] operator. [06:31.000 --> 06:36.160] What I would like to stress is that open telemetry is only about how we collect and create telemetry [06:36.160 --> 06:37.160] data. [06:37.160 --> 06:45.880] It's not a platform that you can deploy, it doesn't provide any storage or query APIs. [06:45.880 --> 06:49.560] So now let's go to the main part, the Kubernetes operator. [06:49.560 --> 06:56.360] The operator itself, it's a Golang application, it uses QBuilder and operator SDK, and it [06:56.360 --> 06:58.880] has three primary use cases. [06:58.880 --> 07:04.200] It can deploy the open telemetry collector as a deployment, demon set, stateful set. [07:04.200 --> 07:09.200] It can as well inject the collector as a side card to your workload. [07:09.200 --> 07:14.760] The second use case is that it can instrument your workloads running on Kubernetes by using [07:14.760 --> 07:19.400] those instrumentation libraries or agents from open telemetry. [07:19.400 --> 07:23.680] And last but not least, it integrates with Prometheus ecosystem. [07:23.680 --> 07:29.280] It can read the service and pod monitors, get the scraped targets, and split them across [07:29.280 --> 07:33.360] the collector instances that you have deployed. [07:33.360 --> 07:38.480] To enable this functionality, the operator provides two CRDs, one for the collector that [07:38.480 --> 07:42.200] is used to deploy the collector and integrate the Prometheus. [07:42.200 --> 07:47.240] And the second one is the instrumentation CRD, where you define how the applications [07:47.240 --> 07:49.920] should be instrumented. [07:49.920 --> 07:58.960] The operator itself then can be deployed through manifest files, home chart, or OLM. [07:58.960 --> 08:01.880] So what we see here is the Kubernetes cluster. [08:01.880 --> 08:06.160] There are three workloads, pod one, pod two, and pod three. [08:06.160 --> 08:11.600] The first workload is instrumented with the hotel SDK directly, so when we were building [08:11.600 --> 08:18.440] this application, we pulled in the hotel dependency and we compiled it against it and used those [08:18.440 --> 08:24.240] APIs directly in our business code and in the middlewares that we are using. [08:24.240 --> 08:30.360] The second pod is using the auto instrumentation libraries that were injected by the operator [08:30.360 --> 08:34.520] through the Venetian webhook. [08:34.520 --> 08:39.960] And the third pod is using Zipkin instrumentation and Prometheus instrumentation libraries, [08:39.960 --> 08:47.440] and it has the collector sidecar as well injected by the operator. [08:47.440 --> 08:55.840] So essentially the operator there, it reconciles three open telemetry CRs, two for the collector, [08:55.840 --> 08:58.120] and one instrumentation. [08:58.120 --> 09:02.760] And then all these workloads, they send data to the collector deployed, probably as a demon [09:02.760 --> 09:08.440] set, and then this collector then does some data normalization and sends finally data [09:08.440 --> 09:17.600] into platform of your choice, which can be Prometheus for metrics, Yeager for traces. [09:17.600 --> 09:22.960] With that, I would like to move to the second part, explaining the CRDs in more detail. [09:22.960 --> 09:23.960] Yep. [09:23.960 --> 09:26.360] The microphone should work. [09:26.360 --> 09:32.960] Yeah, so with the CRDs for today, we wanted to show both of them, and we start with the [09:32.960 --> 09:34.460] collector one. [09:34.460 --> 09:40.200] The collector CRD is a bit loaded, so therefore we picked a few things here, which I would [09:40.200 --> 09:45.120] say are the most used or important. [09:45.120 --> 09:48.640] So as Pawe mentioned, there are different deployment modes, different use cases for [09:48.640 --> 09:56.640] the open telemetry collector, and in the specification, we can go to the mode and just specify it there. [09:56.640 --> 10:01.600] There's a handy thing, which is the sidecar, we will see it afterwards. [10:01.600 --> 10:06.480] And if we want to use it, we only go to the part definition of our deployment and inject [10:06.480 --> 10:10.280] the annotation we see on the top right. [10:10.280 --> 10:16.280] And if we go with the deployment mode or something like this, and we want to expose it for collecting [10:16.280 --> 10:21.080] metrics, locks, and traces from a different system, for example, we can use the Ingress [10:21.080 --> 10:27.480] type, we can set there a lot of more, we configure there a lot of more like also the annotations, [10:27.480 --> 10:30.000] your Ingress class. [10:30.000 --> 10:34.840] But yeah, mainly the operator takes care of everything, creating services, also is able [10:34.840 --> 10:37.560] to balance your load there. [10:37.560 --> 10:43.920] And yeah, the last thing here is then the image section, which is also important. [10:43.920 --> 10:48.960] With the open telemetry operator, it usually ships the core distribution of open telemetry [10:48.960 --> 10:49.960] by default. [10:49.960 --> 10:55.640] So in open telemetry, the collector is split into two repositories when you go up and look [10:55.640 --> 10:57.640] at GitHub. [10:57.640 --> 11:04.240] So in core, you will find OTP, a logging exporter, so some basic stuff. [11:04.240 --> 11:07.280] And in Contrip, you find basically everything. [11:07.280 --> 11:14.120] So if you want to send your traces to some proprietary vendor or to Jäger, you probably [11:14.120 --> 11:16.520] need to look there. [11:16.520 --> 11:19.520] Okay, the next thing is then the configuration. [11:19.520 --> 11:24.720] The configuration for the open telemetry collector is here provided like it's usually done for [11:24.720 --> 11:26.040] the collector itself. [11:26.040 --> 11:28.560] So it's passed directly forward. [11:28.560 --> 11:30.600] It's split it into three parts here. [11:30.600 --> 11:31.960] We see the receiving part there. [11:31.960 --> 11:34.800] We specify our OTP receiver. [11:34.800 --> 11:37.640] Here it's accepts GRPC on a specific board. [11:37.640 --> 11:43.400] It could also be there that we specify a prometers receiver, which is then scraping something. [11:43.400 --> 11:47.800] Then the optional part is basically the processing part. [11:47.800 --> 11:53.320] We might want to save some resources and we batch them our telemetry data. [11:53.320 --> 11:56.920] And yeah, there are other useful things. [11:56.920 --> 12:00.520] And on the exporter section, here we use the logging exporter, which is part of the [12:00.520 --> 12:04.280] core distribution, but you can configure whatever you like. [12:04.280 --> 12:08.760] You can also have multiple exporters for one resource. [12:08.760 --> 12:09.760] There is one thing. [12:09.760 --> 12:11.440] On the right side, we see the extensions. [12:11.440 --> 12:15.440] It didn't fit on the slide, so it's there in this box. [12:15.440 --> 12:19.920] This is then used if you have, for example, an exporter, which needs some additional [12:19.920 --> 12:20.920] headers. [12:20.920 --> 12:24.280] Yeah, you want to set a barrier token or something else. [12:24.280 --> 12:26.200] You can do it there. [12:26.200 --> 12:30.120] And then finally, we go to the service section where we have different pipelines for each [12:30.120 --> 12:31.600] signal. [12:31.600 --> 12:41.240] And then we can then configure a processor and receiver and exporter in the way we wanted. [12:41.240 --> 12:46.560] So then there is another CD, which is used for the auto instrumentation. [12:46.560 --> 12:49.280] And it looks slightly different. [12:49.280 --> 12:53.080] So here we have also the, in the specification, we have the exporter. [12:53.080 --> 13:01.320] And the exporter only exports OTP, so which means if we want to export it to some, yeah, [13:01.320 --> 13:06.400] back end of our choice, we usually instrument our application directly then forward this [13:06.400 --> 13:10.600] traces to a collector instance, which is running next to it. [13:10.600 --> 13:15.000] And yeah, we can use the power of these processors. [13:15.000 --> 13:20.840] Yeah, then we can configure some other useful things like how the context is propagated [13:20.840 --> 13:22.760] and the sample rate. [13:22.760 --> 13:25.080] And to use it, it's also quite easy. [13:25.080 --> 13:26.360] So we have our deployment. [13:26.360 --> 13:31.240] In this case, we can, it can choose from this list of supported languages. [13:31.240 --> 13:37.880] We might use Java and we only set this annotation on the port level and it will take care of [13:37.880 --> 13:44.280] adding the SDK and also setting and configuring the environment variables. [13:44.280 --> 13:51.640] If we use something like Rust, we can also use the inject SDK annotation to configure [13:51.640 --> 13:57.840] then, yeah, just the destination because then SDK should be there. [13:57.840 --> 14:03.600] And if we have a setup where there is, let's say, some proxy in front, like Envoy, we can [14:03.600 --> 14:12.800] then just skip the, yeah, adding the auto instrumentation there by only configuring the container [14:12.800 --> 14:16.360] names we want to instrument. [14:16.360 --> 14:20.440] And we will see this in a minute, a bit more in detail. [14:20.440 --> 14:23.120] So this is then basically what we would need to do. [14:23.120 --> 14:27.680] So we create this instrumentation, we add this annotation on the left. [14:27.680 --> 14:31.160] We see the pot, there is our application. [14:31.160 --> 14:36.440] And in this gray box, you see what automatically is added. [14:36.440 --> 14:42.000] And this is then forwarded in this example to a collector. [14:42.000 --> 14:43.640] And yeah, how does this work? [14:43.640 --> 14:50.240] So the operator in that mission web hook, he will add this in its container. [14:50.240 --> 14:52.600] On the top left, we see how the container looks before. [14:52.600 --> 14:54.400] So there are no environment variables. [14:54.400 --> 14:58.480] It's just a plain application. [14:58.480 --> 15:03.320] And in the command section, there is then the copy, which copies the Java agent to our [15:03.320 --> 15:04.600] original container. [15:04.600 --> 15:07.000] And on the right side, we see the final result. [15:07.000 --> 15:13.960] We see the Java tool options where the container is loaded, and then we see all this environment [15:13.960 --> 15:18.600] variables to configure our SDK. [15:18.600 --> 15:24.080] And finally, what we have seen also in the presentation from Nicholas previously, we [15:24.080 --> 15:28.200] have here the Yeager output. [15:28.200 --> 15:35.340] So we can see the resource attributes and all the beautiful stuff that comes with it. [15:35.340 --> 15:39.560] So next, we have, we can have a look on metrics. [15:39.560 --> 15:43.360] So there is the open telemetry SDK. [15:43.360 --> 15:48.840] So if you want to go with open telemetry metrics, but I assume a lot of people have already [15:48.840 --> 15:51.360] some perimeter stuff in place. [15:51.360 --> 15:56.600] And the open telemetry operator also helps us with this. [15:56.600 --> 16:02.280] So we can, I might we look first on the receiver part on the bottom. [16:02.280 --> 16:06.920] We see there, we configure the perimeter's receiver, which has a scrape configuration, [16:06.920 --> 16:10.240] and there we can, for example, add some static targets. [16:10.240 --> 16:16.720] So we assume we add there three different scrape endpoints, then afterwards, if the target [16:16.720 --> 16:23.480] allocator is enabled, this will then take these scrape targets and divide, well, spread [16:23.480 --> 16:29.440] these targets across our replicas, which are then responsible for getting the metrics. [16:29.440 --> 16:32.840] And yeah, that's basically how it works. [16:32.840 --> 16:38.800] There's also an option to enable perimeter CRs, so we can then forward to this one. [16:38.800 --> 16:42.720] And the target allocator, which is an extra instance created by the Yeager, by the open [16:42.720 --> 16:49.160] telemetry operator, will then, yeah, get the targets from there. [16:49.160 --> 16:52.400] So we see this here in this graphic, quite good. [16:52.400 --> 16:57.400] On the left side, we see part one, which is using open telemetry, and it's pushing the [16:57.400 --> 17:02.000] information telemetry data it gets directly to a collector. [17:02.000 --> 17:08.640] And in this gray box, we see there, we have two instances running Prometoys, running instruments [17:08.640 --> 17:13.840] with Prometoys, and the collector one and collector two are pulling the information [17:13.840 --> 17:14.840] from there. [17:14.840 --> 17:17.160] So this is all managed then by the operator. [17:17.160 --> 17:21.200] We have seen the replicas, this is basically collector one and collector two, and since [17:21.200 --> 17:25.640] we enable the target allocator, we get the targets from there, so which is then coming [17:25.640 --> 17:27.760] from the port monitor. [17:27.760 --> 17:32.440] And finally, we send the information somewhere. [17:32.440 --> 17:36.680] So the last thing here, the last signal are then locks. [17:36.680 --> 17:38.680] So for locks, there are different options. [17:38.680 --> 17:46.800] So the first one would be to use the open telemetry SDK, what we might don't want right [17:46.800 --> 17:50.920] now because we need to do some work, but if we directly want to go ahead, there is the [17:50.920 --> 17:57.400] philoc receiver, we can configure it to get the information from this, and yeah, it's [17:57.400 --> 18:02.040] available in the conflict repository, and we have different parsers there which help [18:02.040 --> 18:08.160] us to move the locks into the OTP format. [18:08.160 --> 18:11.280] We will see in a minute how this looks like. [18:11.280 --> 18:17.920] And there are other options if you want to integrate with FluentBit, so there is a forwarder, [18:17.920 --> 18:22.240] so you can use it as a kind of a gateway then. [18:22.240 --> 18:29.120] And yeah, the only thing we need to do then is we can configure it as a demon set. [18:29.120 --> 18:34.840] We need to pass our information there, and the philoc receiver, for example, can then [18:34.840 --> 18:38.520] get all the locks. [18:38.520 --> 18:40.640] And how does this look like at the end? [18:40.640 --> 18:47.640] So this is when we exported the locks to the logging output, so standard out. [18:47.640 --> 18:53.520] We see that we have the resource attributes which are added automatically, and yeah, we [18:53.520 --> 18:58.560] see then the lock information, and on the bottom the trace ID and span ID which are [18:58.560 --> 19:03.080] not given if we read it just from disk, but that's it. [19:03.080 --> 19:09.160] Yeah, then we are almost at the end. [19:09.160 --> 19:19.320] Yeah, thanks a lot for the interesting talk. [19:19.320 --> 19:23.200] Does anyone have questions? [19:23.200 --> 19:24.200] Any questions? [19:24.200 --> 19:26.200] Raise your hand. [19:26.200 --> 19:28.200] Question? [19:28.200 --> 19:36.080] Yeah, there we go. [19:36.080 --> 19:42.320] For the logging part, would you suggest to replace any kind of cluster logging like with [19:42.320 --> 19:51.760] Fluent Bit, or that's like sending it off to Loki or something with an open telemetry [19:51.760 --> 19:59.280] log scraping, or is that complementary? [19:59.280 --> 20:00.880] I'm not sure if I fully got it. [20:00.880 --> 20:16.480] So you want to, yeah, in this case it's just another way, but the useful thing is if you [20:16.480 --> 20:24.920] have the open telemetry SDK, it will automatically add then the trace ID to it, and then you [20:24.920 --> 20:31.240] can correlate your signals. [20:31.240 --> 20:35.240] Sorry. [20:35.240 --> 20:44.040] So I'm super newbie to this, so I failed to understand how the, if the open telemetry [20:44.040 --> 20:50.960] is trying to replace, for example, the log parsers like the telegraph, for example, which [20:50.960 --> 20:58.880] is able to generate prometheus metrics by log scraping, or also how Zipkin, which is the [20:58.880 --> 21:03.400] tracing thing, fits in the metric collection of all this picture. [21:03.400 --> 21:10.920] So I'm not trying to understand how you cobble together all these sources and how open telemetry [21:10.920 --> 21:17.000] either replaces or either makes it easier to use all these technologies together. [21:17.000 --> 21:21.840] Thank you. [21:21.840 --> 21:29.880] So maybe on this slide you see that the third port is using the Zipkin and prometheus, and [21:29.880 --> 21:37.320] the collector can receive data in Zipkin format, it can scrape prometheus metrics, then transform [21:37.320 --> 21:45.640] this data into OTLT or the Zipkin as well, and then send it to the other collector. [21:45.640 --> 21:50.280] So the collector essentially can receive data in multiple formats, transform them to the [21:50.280 --> 22:06.080] format of your choice, and then use that format to send it to other systems. [22:06.080 --> 22:08.560] Hello and thanks for the talk. [22:08.560 --> 22:15.680] I'm just wondering, what's your strategy of filtering health check requests, for example, [22:15.680 --> 22:20.280] or the life probes request that you get in the pod? [22:20.280 --> 22:24.720] Health checks, like to avoid generating traces for health checks? [22:24.720 --> 22:25.720] Sorry? [22:25.720 --> 22:28.880] To avoid generating traces for health check endpoints? [22:28.880 --> 22:29.880] Yeah. [22:29.880 --> 22:34.240] That's a very good question. [22:34.240 --> 22:42.720] So you could maybe configure the collector to drop the data, but I think the best way [22:42.720 --> 22:49.080] would be to tell the instrumentation to skip instrumenting those endpoints. [22:49.080 --> 22:56.000] To be honest, I'm not sure if this is implemented in OTL agents, but I saw a lot of discussions [22:56.000 --> 23:05.320] around this problem, so probably there will be some solution. [23:05.320 --> 23:09.120] We have time for one last question, if there is any. [23:09.120 --> 23:10.120] No? [23:10.120 --> 23:11.120] Okay. [23:11.120 --> 23:12.320] Oh, no. [23:12.320 --> 23:28.360] And thanks a lot.