[00:00.000 --> 00:15.000] Okay. Hello, everyone. Welcome to the talk on open telemetry with Grafana. Microphone [00:15.000 --> 00:19.120] broke, so I need to do it with this microphone now. Let's see how it goes with typing and [00:19.120 --> 00:26.840] live demo. Few words about me, so who am I, why am I here talking about Grafana and open [00:26.840 --> 00:32.720] telemetry, so I work at Grafana Labs, I'm an engineering manager, I'm also a manager [00:32.720 --> 00:37.840] for our open telemetry squad, and I'm also active in open source, so I'm a member of [00:37.840 --> 00:44.020] the Prometheus team where I maintain the Java metrics library. So what are we going to do [00:44.020 --> 00:50.320] in this talk in the next 25 minutes or so, so it will almost exclusively be a live demo, [00:50.320 --> 00:55.480] so basically the idea is I have a little example application running on my laptop, and it is [00:55.480 --> 01:00.640] instrumented with open telemetry, I will show you in a minute what it does and how I instrumented [01:00.640 --> 01:06.560] it, and I also have an open source monitoring back and running, right, it consists of three [01:06.560 --> 01:13.920] databases, one is Loki, which is a open source logs database, one is Temple, which is an [01:13.920 --> 01:19.880] open source trace database, and one is Mimir, which is an open source metrics database, so [01:19.880 --> 01:25.680] Mimir is compatible with Prometheus, so I could have shown the exact same demo using [01:25.680 --> 01:30.920] Prometheus instead of Mimir, so it doesn't really matter for now. And of course I also [01:30.920 --> 01:35.840] have Grafana, I have those databases configured as data sources, and what we are going to [01:35.840 --> 01:40.200] do, we are going to start up Grafana, you know, have a look at metrics, have a look [01:40.200 --> 01:45.400] at traces, have a look at logs, and basically the idea is that at the end of the talk you [01:45.400 --> 01:50.000] kind of have seen all the signals that come out of open telemetry, you know, explore a [01:50.000 --> 01:54.400] bit what you can do with this type of data, and so you should have a good overview how [01:54.400 --> 02:01.400] open source monitoring with open telemetry looks like, right? So last slide before we [02:01.400 --> 02:06.320] jump into the live demo, so this is just a quick overview of what the example application [02:06.320 --> 02:12.400] does so that you know what we are going to look at. It's a simple hello world rest service [02:12.400 --> 02:19.600] written in Java using Spring Boot, and so basically you can send a request to port 8080 [02:19.600 --> 02:24.560] and it will respond with hello world, and in order to make it a bit more interesting, [02:24.560 --> 02:28.920] I made it a distributed hello world service, so it doesn't respond directly, but when it [02:28.920 --> 02:34.600] receives a request, it reaches out to a greeting service running on port 8081, the greeting [02:34.600 --> 02:39.080] service responds with the greeting, which is hello world, and then the response is forwarded [02:39.080 --> 02:45.200] to the client, right? And there are random errors to have some error rates as well, so [02:45.200 --> 02:51.320] basically a hello world microservice architecture or whatever, right? And in order to instrument [02:51.320 --> 02:57.880] this with open telemetry, I use the Java instrumentation agent that's provided by the open telemetry [02:57.880 --> 03:02.760] community, that's something you can download on GitHub, and the thing is this thing, you [03:02.760 --> 03:07.680] basically attach it to the Java virtual machine at start up time with a special command line [03:07.680 --> 03:13.560] parameter, so I didn't modify any source code, I didn't use any SDK or introduce any custom [03:13.560 --> 03:19.680] stuff, all we are going to look at in this demo is just data produced by attaching the [03:19.680 --> 03:27.160] open telemetry instrumentation to a standard Spring Boot application, right? Cool. So let's [03:27.160 --> 03:33.880] get started. As said, I have my data sources configured here, so Prometheus and Mimera are [03:33.880 --> 03:40.280] compatible, so it doesn't really matter which one we choose. There are a lot of, so I want [03:40.280 --> 03:54.480] to start with metrics, and yeah, so... Yeah? Can we turn the lights down a bit? I don't [03:54.480 --> 04:14.880] know. Okay. Maybe the other way around. Okay. I will just continue, come on. So there are [04:14.880 --> 04:20.760] lots of metrics that you get from the open telemetry instrumentation, so kind of JVM-related [04:20.760 --> 04:27.960] stuff like garbage collection activity and so forth, but the one I want to look at, oh, [04:27.960 --> 04:46.400] no, it's getting brighter and brighter. Yeah. Okay. Great. I think there is also a light [04:46.400 --> 04:53.320] mode in Grafana. Maybe that would have been a better choice. But no, I'm not going to [04:53.320 --> 05:00.480] use light mode. So let's figure out how to do the demo while I have a microphone that [05:00.480 --> 05:14.480] I should hold in my hands. Let's just put it here. Okay. Thank you. Cool. So the metric [05:14.480 --> 05:20.000] that we are going to look at for the demo, it's a metric named HTTP server duration. [05:20.000 --> 05:24.880] This is a metric of type histogram. So histograms have a couple of different numbers attached [05:24.880 --> 05:30.920] to them, so there are histogram buckets with the distribution data and so forth, and there's [05:30.920 --> 05:37.840] also a count. The count is the most simple one, so we are going to use this in our example. [05:37.840 --> 05:41.800] I actually got it two times. I got it once for my greeting service here and once for [05:41.800 --> 05:49.840] the hello world application. And if we are just, you know, running this query, maybe [05:49.840 --> 05:55.320] take a little bit of a shorter time window here, then we basically see two request counters, [05:55.320 --> 06:01.440] right? One is the green line, which is counting the request resulting in HTTP such as 200. [06:01.440 --> 06:05.960] So the successful requests, and basically we see that since I started the application [06:05.960 --> 06:11.520] on my laptop, I got about a little more than 400 successful requests, and the yellow line [06:11.520 --> 06:18.400] is, you know, requests resulting in HTTP status 500, and we got around 50 of them, right? [06:18.400 --> 06:24.760] And obviously, raw counter values are not very useful, right? Nobody is interested in [06:24.760 --> 06:30.160] how often was my service called since I started the application, and the way, you know, metric [06:30.160 --> 06:36.000] monitoring works with Prometheus, as probably most of you know, is that you use the Prometheus [06:36.000 --> 06:42.280] query language to get some useful information out of that kind of data, right? And I guess [06:42.280 --> 06:47.200] most of you have run some Prometheus queries, but they're still going to show maybe a couple [06:47.200 --> 06:53.320] of examples. So for those of you who are not very familiar with that, does this one work [06:53.320 --> 07:00.320] again? Hey, nice. It's even better. The lights work, the microphone works. Wow. Now let's [07:00.320 --> 07:07.760] hope the demo works. So I'm going to run just a couple of quick, you know, Prometheus queries [07:07.760 --> 07:11.320] so that for those of you who are not very familiar with it, so that you get an idea [07:11.320 --> 07:16.960] of what it is, right? And the most important function in the Prometheus query language [07:16.960 --> 07:21.400] is called the rate function. And what the rate function does, it takes a counter like [07:21.400 --> 07:26.720] this and a time interval like five minutes, and then it calculates a per second rate, [07:26.720 --> 07:32.920] right? So based on a five minute time interval, we now see that we have about 0.6 requests [07:32.920 --> 07:39.720] per second resulting in HTTP status 200, and we have about 0.1 requests per second resulting [07:39.720 --> 07:46.200] in HTTP status 500. And this is already quite some useful information, right? So typically [07:46.200 --> 07:52.040] you want to know the total load on your system, not buy status code or something. So you basically [07:52.040 --> 07:58.240] want to sum these two values up, and obviously there's also a sum function to sum values [07:58.240 --> 08:03.000] up, and if you call that, you get the total load on your system, which is just one line [08:03.000 --> 08:09.680] now and it's just, you know, around 0.7 requests per second, right? And this is, yeah, this [08:09.680 --> 08:14.320] is basically how Prometheus queries work. If you're not familiar with the syntax, there's [08:14.320 --> 08:18.840] also kind of a graphical query builder where you can, you know, use a bit drag and drop [08:18.840 --> 08:23.680] and get a bit more help and so forth, right? And so eventually, you know, when you got [08:23.680 --> 08:28.480] your queries and got your metrics, so what you want to do is you create a metrics dashboard [08:28.480 --> 08:34.160] and for monitoring HTTP services, there is, there are a couple of best practices, what [08:34.160 --> 08:40.080] type of data you want to visualize on a dashboard for monitoring HTTP services. And the most [08:40.080 --> 08:47.720] simple and straightforward thing is to visualize three things. One is the request rate, so [08:47.720 --> 08:52.560] for the current load on the system, which is exactly the query that we are seeing here. [08:52.560 --> 08:57.000] The next thing you want to see is the error rate, so the percentage of calls that fail. [08:57.000 --> 09:02.720] And the third thing is duration. How long does it take, right? And I created a simple [09:02.720 --> 09:08.040] example dashboard just to show you how this looks like. So I put the name of the service [09:08.040 --> 09:14.760] as a parameter up here so we can reuse the same dashboard for both services. Maybe let's [09:14.760 --> 09:19.600] use a 15 minute time window, so here I started the application. The first is the request [09:19.600 --> 09:24.520] rate, that's the exact same query that we just saw. Second thing here is the error rate, [09:24.520 --> 09:30.560] so we have about, I don't know, around 10% errors in my example application. And then [09:30.560 --> 09:35.720] for duration, there are a couple of different ways how to visualize that. So what we see [09:35.720 --> 09:41.520] here is basically the raw histogram, right? The histogram buckets. And this representation [09:41.520 --> 09:46.560] is actually quite useful because it shows you the shape of the distribution. So what [09:46.560 --> 09:54.200] we see here is two spikes, one around 600 milliseconds and one around 1.8 seconds. And [09:54.200 --> 09:59.800] this is a typical shape that you would see if your application uses a cache, right? Because [09:59.800 --> 10:03.760] then you have a couple of requests that are responded quite quickly. Those are the cache [10:03.760 --> 10:09.760] hits. A couple of requests are slow that are the cache misses. And visualizing the shape [10:09.760 --> 10:14.600] of the histogram helps you understand kind of the latency behavior of your application, [10:14.600 --> 10:20.840] right? The other and most popular way to visualize durations is this one here. These [10:20.840 --> 10:27.640] are percentiles. So the green line is the 95th percentile, so it tells us 95% of the [10:27.640 --> 10:33.280] calls have been faster than 1.7 seconds and 5% slower than that. The yellow line is the [10:33.280 --> 10:38.360] 50th, so half of the calls faster than that, half of the calls slower than that. And this [10:38.360 --> 10:43.480] doesn't really tell you the shape of the distribution, but it shows you a development over time, [10:43.480 --> 10:48.400] which is useful as well. So if your service becomes slower, those lines will go up, right? [10:48.400 --> 10:53.480] And it's also a good indicator if you want to do alerting and so forth. You can define [10:53.480 --> 10:57.920] a threshold and say it's above, if it's above a certain threshold, I want to be notified [10:57.920 --> 11:03.720] and stuff like that. And there are other more, you know, experimental things like this heat [11:03.720 --> 11:09.320] map showing basically development of histograms over time and stuff like that. So it's pretty [11:09.320 --> 11:14.440] cool to play with all the different visualizations in Grafana and, you know, see what you can [11:14.440 --> 11:21.920] get. So this is a, you know, quick example of a so-called red dashboard, Request Rates, [11:21.920 --> 11:27.880] Error Rates duration based on open telemetry data. And the cool thing about this, about [11:27.880 --> 11:34.400] it, is that it actually, all that we are seeing here is just based on that single histogram [11:34.400 --> 11:41.720] metric HDP server duration. And the fact that this metric is there is not a coincidence. [11:41.720 --> 11:46.760] The metric HDP server duration is actually defined in the open telemetry standard as [11:46.760 --> 11:55.000] part of the semantic conventions for HDP services. So whenever you monitor an HDP server with [11:55.000 --> 12:01.440] open telemetry, then you will find a histogram named HDP server duration. It will have the [12:01.440 --> 12:07.800] HDP status as an attribute. It will contain the latencies in milliseconds. That's all [12:07.800 --> 12:13.800] part of the standard. So it doesn't matter what programming language your services uses, [12:13.800 --> 12:19.000] what framework, whatever. If it's being monitored with open telemetry and it's compatible, you [12:19.000 --> 12:23.720] will find that metric and you can create a similar dashboard like that. And this is kind [12:23.720 --> 12:29.000] of one of the things that make application monitoring with open telemetry a lot easier [12:29.000 --> 12:37.920] than it used to be before these standardization. Cool. So that was a quick look at metrics, [12:37.920 --> 12:42.660] but of course we want to look at the other signals as well. So let's switch data sources [12:42.660 --> 12:52.040] for now and have a look at traces. So tracing, again, there's a kind of search, like graphical [12:52.040 --> 12:57.000] search where you can create your search criteria with drag and drop. There's a relatively new [12:57.000 --> 13:03.000] feature which is a query language for traces. So I'm going to use that for now. And one [13:03.000 --> 13:08.840] thing you can do is to just search by labels. So I can, for example, say I'm interested [13:08.840 --> 13:17.080] in the service name greeting service and then I could basically just open a random trace [13:17.080 --> 13:25.560] here. Let's take this as an example. Can I, I need to zoom out a little bit to be able [13:25.560 --> 13:33.440] to close the search window here. Okay. So this is how a distributed trace looks like. [13:33.440 --> 13:38.160] And if you see it for the first time, it might be a bit hard to understand, but it's actually [13:38.160 --> 13:42.840] fairly easy. So you just need like two minutes of introduction and then you will understand [13:42.840 --> 13:48.640] traces forever. And to give you that introduction, I actually have one more slide. So just to [13:48.640 --> 13:55.080] help you understand what we are seeing here. And the thing is distributed traces consist [13:55.080 --> 13:59.880] of spans, right? And spans are time spans. So a span is something that has a point in [13:59.880 --> 14:05.440] time where it starts and a point in time where it ends, right? And in open telemetry, there [14:05.440 --> 14:11.600] are three different kinds of spans. One are server spans. The second is internal spans [14:11.600 --> 14:18.000] and the third is client spans. Okay. So what happens when my Hello World application receives [14:18.000 --> 14:23.200] a request? So the first thing that happens if a server receives a request, a server span [14:23.200 --> 14:28.800] is created. So that's the first line here. It's started as soon as the request is received. [14:28.800 --> 14:36.240] It remains open until the request is responded, right? Then I said in the introduction that [14:36.240 --> 14:41.800] I used spring boot for implementing the example application. And the way spring boot works [14:41.800 --> 14:46.480] is that it takes the request and passes it to the corresponding spring controller that [14:46.480 --> 14:53.240] would handle the request. And open telemetries Java instrumentation agent is nice for Java [14:53.240 --> 14:57.920] developers because it just creates internal spans for each spring controller that is involved, [14:57.920 --> 15:02.600] right? And that is the second line that we are seeing here. It's basically opened as [15:02.600 --> 15:07.520] soon as the spring controller takes over and remains open until the spring controller [15:07.520 --> 15:13.120] is done handling the request, which might seem not too useful if I have just a single [15:13.120 --> 15:18.040] spring controller anyway, but if you have kind of a larger, you know, application and [15:18.040 --> 15:22.200] if you have multiple controllers involved, it gives you quite some interesting insights [15:22.200 --> 15:26.440] into what's happening inside your application. Like you would see immediately, like which [15:26.440 --> 15:32.000] controller do I spend most time in and so forth, right? And then eventually my Hello [15:32.000 --> 15:37.440] Word application reaches out to the greeting service and outgoing requests are represented [15:37.440 --> 15:43.360] by client spans. So the client span is basically opened as soon as my HTTP request goes out [15:43.360 --> 15:47.600] and remains open until the response is received. And then in the greeting service, the same [15:47.600 --> 15:52.800] thing starts again, you know, request is received, which creates a server span and then I have [15:52.800 --> 15:57.400] a spring controller as well, which is an internal span and that's the end of my distributed [15:57.400 --> 16:02.680] application here. And this is exactly what we are seeing here. And each of those span [16:02.680 --> 16:08.120] types has a corresponding metadata attached to it. So if you look at one of the internal [16:08.120 --> 16:12.560] spans here, we see the name of the spring controller and the name of the controller [16:12.560 --> 16:19.120] method and a couple of JVM-related attributes, whatever. And if we look at an HTTP span, [16:19.120 --> 16:24.360] for example, we see, of course, HTTP attributes like the status code, method and so forth, [16:24.360 --> 16:32.320] right? So of course, you do not want to just look at random spans. So usually you're looking [16:32.320 --> 16:37.560] for something. There are standard attributes in open telemetry that you can use for searching. [16:37.560 --> 16:44.760] So we already had the service name greeting service, for example. But the most important [16:44.760 --> 16:53.400] or one of the most important attributes is HTTP.status, no,.status code, this one here. [16:53.400 --> 16:59.240] And if we, for example, search for spans with HTTP status code 500, then we should find [16:59.240 --> 17:05.040] an example of a request that failed. So let's close the search window again. Yes, that's [17:05.040 --> 17:10.300] an example of a failed request. You see it with the indicated by those red exclamation [17:10.300 --> 17:15.960] marks at the bottom here. So this is where the thing failed, right? So the root cause [17:15.960 --> 17:21.520] of the error is the internal span, something in my spring controller in the greeting service. [17:21.520 --> 17:27.120] If I look at the metadata attached to that, I actually see that the instrumentation attached [17:27.120 --> 17:32.920] the event that caused the error, and this even includes the stack trace. So you can [17:32.920 --> 17:38.280] basically immediately navigate to the exact line of code that is the root cause of this [17:38.280 --> 17:43.880] error, right? And this is quite cool. So if you have a distributed application and you [17:43.880 --> 17:50.160] get an unexpected response from your Hello World application, without distributed tracing, [17:50.160 --> 17:55.640] it's pretty hard to find that actually there's an exception in the greeting service that, [17:55.640 --> 18:00.840] you know, propagated through your distributed landscape and then eventually caused the unexpected [18:00.840 --> 18:06.680] response. And with distributed tracing, finding these kind of things becomes pretty easy because [18:06.680 --> 18:12.440] you get all the related calls grouped together, you get the failed ones marked with an exclamation [18:12.440 --> 18:18.920] mark, and you can pretty easily navigate to what's the root cause of your error, okay? [18:18.920 --> 18:25.120] Cool. So that was a quick look at traces. There are a lot of interesting things about [18:25.120 --> 18:30.240] tracing. Maybe one thing I would like to show you, because I find it particularly cool, [18:30.240 --> 18:36.320] so if you have all your services instrumented with tracing in your back end, then basically [18:36.320 --> 18:41.800] those traces give you metadata about all the network calls happening in your system, [18:41.800 --> 18:46.680] and you can do something with that type of data, right? So for example, you can calculate [18:46.680 --> 18:53.360] something that we call the service graph. So it looks like this. It's maybe not too impressive [18:53.360 --> 18:58.160] if you just have two services calling each other, right? So, but if you imagine, you [18:58.160 --> 19:02.600] know, a more larger, you know, dozens or hundreds of services, so it will generate a map of [19:02.600 --> 19:08.280] all the services and indicate which service calls which other service, and this is quite [19:08.280 --> 19:13.440] useful. For example, if you intend to deploy a breaking change in your greeting service [19:13.440 --> 19:17.680] and you want to know who's using the greeting service, what would I break? Then looking [19:17.680 --> 19:22.520] at the service graph, you basically get this information right away. Traditionally, if [19:22.520 --> 19:27.120] you don't have that, you basically have a PDF with your architecture diagram, and then [19:27.120 --> 19:32.960] you look it up there, and also traditionally, there's at least one team that deployed something [19:32.960 --> 19:37.120] and forgot to update the diagram, and then you missed that, and there's a service graph [19:37.120 --> 19:41.640] that won't happen, right? This is the actual truth. This is based on what's actually happening [19:41.640 --> 19:46.880] in your backend, and this is pretty useful in these situations, right? And you can do [19:46.880 --> 19:52.520] other things as well, like, you know, have some statistics like the most frequently called [19:52.520 --> 19:59.280] endpoint or the endpoint with the most errors and stuff like that. So, that was a quick, [19:59.280 --> 20:05.000] quick look at traces. So we covered metrics, we covered traces. One thing I want to show [20:05.000 --> 20:10.880] you is that metrics and traces are actually related to each other, right? And so in order [20:10.880 --> 20:16.520] to show that, I'm going to go back to our dashboard, because if you, let's take a 15 [20:16.520 --> 20:22.200] minute window, then we get a bit more examples. So if you look at the latency data here, [20:22.200 --> 20:27.000] you notice these little green dots. These are called exemplars, and this is something [20:27.000 --> 20:32.920] that's provided by the auto instrumentation of open telemetry. So whenever it generates [20:32.920 --> 20:40.480] latency data, it basically attaches trace IDs of example traces to the latency data, [20:40.480 --> 20:44.840] and this is visualized by these little green dots, right? And so you see some examples [20:44.840 --> 20:49.440] of particularly fast calls, some examples of particularly slow calls and so forth. [20:49.440 --> 20:54.720] And if you, for example, take this dot up here, which is kind of slower than anything [20:54.720 --> 20:59.880] else, it's almost two seconds, right? Then you have the trace ID here, and you can navigate [20:59.880 --> 21:05.200] to tempo and have a look at the trace and start figuring out why did I have an example [21:05.200 --> 21:11.000] of such a slow call in my system, right? And in that case, you would immediately see that [21:11.000 --> 21:15.720] most of the time spent in the greeting service. So if you're looking for the performance bottleneck, [21:15.720 --> 21:22.400] then this is the most likely thing. Yeah, four minutes, that's fine. Cool. So if I have [21:22.400 --> 21:29.000] four minutes, it's high time to jump to logs, the third signal that we didn't look at yet. [21:29.000 --> 21:37.680] So let's select Loki, our open source logs database as a data source. So again, there's [21:37.680 --> 21:42.880] a query language, there's a graphical query builder and so forth. So let's just open random [21:42.880 --> 21:49.440] logs coming from the greeting service. It looks a bit like this. So it's even, I don't [21:49.440 --> 21:53.800] know, I didn't even log anything explicitly. I just turned on some whatever spring request [21:53.800 --> 21:58.520] logging so that I get some log data. And from time to time, I throw an exception, which [21:58.520 --> 22:03.880] is an IO exception to simulate these errors. Looks a bit broken, but that's just because [22:03.880 --> 22:10.560] of the resolution that I have here. Yeah, so what you can do, of course, you can do some [22:10.560 --> 22:19.120] full text search, for example, can say, I'm interested in these IO exception. And then [22:19.120 --> 22:28.800] you would basically get, well, if you spell it correctly, like that, then you would get [22:28.800 --> 22:33.080] the list of all IO exceptions, which in my case are just the random errors I'm throwing [22:33.080 --> 22:37.440] here. And this query language is actually quite powerful. So you can, this is kind of [22:37.440 --> 22:41.600] filtering by a label and filtering by full text search, but you can do totally different [22:41.600 --> 22:46.400] things as well. For example, you can have queries that, you know, derive metrics based [22:46.400 --> 22:52.280] on log data. There's a function pretty similar to what we have seen in the metrics demo, [22:52.280 --> 22:58.640] which is called the rate function. So the rate function, again, takes a time interval [22:58.640 --> 23:03.680] and then calculates the per second increase rate. So it basically tells you that we have [23:03.680 --> 23:11.280] almost 0.1 of these IO exceptions per second in our log data, which is also kind of useful [23:11.280 --> 23:18.760] for information to have. And the last thing to show you, because that's particularly interesting, [23:18.760 --> 23:25.960] so it is that these logs and traces and metrics are, again, not independent of each other. [23:25.960 --> 23:31.160] They are related to each other. And so if we look at an example here, just let's open [23:31.160 --> 23:37.440] a random log line. So what we see here, there's a trace ID. And this is interesting. So how [23:37.440 --> 23:44.680] does a trace ID end up in my log line? So this is actually also a feature of the Java [23:44.680 --> 23:50.240] instrumentation that's provided by the OpenTelemetry community. So the way logging in general works [23:50.240 --> 23:56.720] in Java is that there's a global thing with key value pairs called the log context. And [23:56.720 --> 24:01.120] applications can put arbitrary key value pairs into that context. And when you configure [24:01.120 --> 24:06.600] your log format, you can define which of those values you want to include in your log data. [24:06.600 --> 24:12.560] And if you have this OpenTelemetry agent attached, then as soon as a log line is written in the [24:12.560 --> 24:17.800] context of serving an HTTP request, then the corresponding trace ID is put into that log [24:17.800 --> 24:23.200] context. And you can configure your log format to include the trace ID in your log data. [24:23.200 --> 24:28.040] And that's what I did. And so each of my log lines actually has a trace ID. And so if I [24:28.040 --> 24:32.560] see something fancy and I want to know maybe somewhere down my distributed stack something [24:32.560 --> 24:38.800] went wrong, I can just query that in tempo, navigate to the corresponding trace, close [24:38.800 --> 24:43.920] that here, yeah, and then basically maybe get some information what happened. And then [24:43.920 --> 24:48.280] the same navigation works the other way around as well. So of course, there's a little, you [24:48.280 --> 24:53.680] know, log button here. So if I see something fancy going on in my greeting service thing [24:53.680 --> 24:59.960] here, and maybe the logs have more information, I can click on that, navigate to the logs. [24:59.960 --> 25:05.080] And then it basically just generates a query, right? I click on the greeting service with [25:05.080 --> 25:10.000] that trace ID. So it's basically just a full text search for that trace ID. And so I will [25:10.000 --> 25:15.280] find all my corresponding log lines. In that case, just one line. But if you have a bit [25:15.280 --> 25:20.360] better logging, then maybe it would give you some indication what happened there. Okay. [25:20.360 --> 25:27.800] So that was a very quick 25 minutes overview of, you know, looking a bit into metrics, [25:27.800 --> 25:33.280] looking a bit into tracing, looking a bit into logs. I hope it gave you some impression, [25:33.280 --> 25:38.680] you know, what's the type of data that you get out of open telemetry looks like. All [25:38.680 --> 25:43.360] of what we did is really, you know, without even modifying the application. I didn't, [25:43.360 --> 25:48.400] you know, even start with custom metrics, custom traces and so forth. So but it's already [25:48.400 --> 25:53.680] quite some useful data that we get out of that. If you like the demo, if you want to [25:53.680 --> 25:58.840] explore it a more, a bit more, want to try it at home, I pushed it on my GitHub and there's [25:58.840 --> 26:06.560] a readme telling you how to run it. So you can do that. And yeah, next up, we have a [26:06.560 --> 26:12.040] talk that goes a bit more in detail into the tracing part of this. And then after that, [26:12.040 --> 26:17.920] we have a talk that goes a bit more into detail how to run open telemetry in Kubernetes. So [26:17.920 --> 26:28.040] stay here and thanks for listening. Please remain seated during Q&A. Otherwise, we can't [26:28.040 --> 26:36.600] do a real Q&A. So please remain seated. Order any questions. Yes. [26:36.600 --> 26:49.120] Hi. Thank you for this. One quick question. You mentioned you just need to add some parameters [26:49.120 --> 26:56.400] to the Java virtual machine to run the telemetry. What happens to my application if, for example, [26:56.400 --> 27:02.560] the back end of the telemetry is down? Is my application failing or impacted in any way? [27:02.560 --> 27:08.240] If the monitoring back end is down. Yes. Say the monitoring is down, but I started my application [27:08.240 --> 27:15.280] with these parameters. Is it impacting the application? No. I mean, you won't see metrics, [27:15.280 --> 27:19.080] of course, if you're monitoring back end is down, but the application would just continue [27:19.080 --> 27:26.360] running. So typically, in like production setups, the applications wouldn't send telemetry [27:26.360 --> 27:31.160] data directly to the monitoring back end. But what you usually have is something in [27:31.160 --> 27:36.440] the middle. There's alternatives. There's the Grafana agent that you can use for that. [27:36.440 --> 27:39.960] There's the open telemetry collector that you can use for that. And it's basically a [27:39.960 --> 27:46.480] thing that runs close to the application, takes the telemetry data off the application [27:46.480 --> 27:52.520] very quickly, and then, you know, can buffer stuff and process stuff and send it over to [27:52.520 --> 27:56.920] the monitoring back end. And that's used for decoupling that a little bit, right? And if [27:56.920 --> 28:02.200] you have such an architecture, the application shouldn't be affected at all by that. [28:02.200 --> 28:05.640] Two more. Two more. [28:05.640 --> 28:14.200] So I really like being able to link from your metrics to traces. But what I'm actually [28:14.200 --> 28:20.040] really curious to be able to do, and as far as I know, doesn't exist, or I guess that's [28:20.040 --> 28:24.080] my question, is like, is there any thought towards doing this, is being able to go the [28:24.080 --> 28:32.520] other direction, where what I'd like to be able to answer is, here's all my trace data, [28:32.520 --> 28:37.200] and this node of the trace incremented these counters by this much. So I could ask things [28:37.200 --> 28:45.280] like how much network IO or disk IOPS did this complete request do, and where in the [28:45.280 --> 28:47.320] tree would that occur? [28:47.320 --> 28:57.800] Yeah, that's a good question. I mean, linking from traces to metrics, it's not so straightforward, [28:57.800 --> 29:03.480] because I think the things you can do to relate this is to use the service name. So if you [29:03.480 --> 29:08.880] have the service name part of your resource attributes of the metrics, and consistently [29:08.880 --> 29:13.480] you have the same service name in your trace data, then you can at least, you know, navigate [29:13.480 --> 29:19.000] to all traces coming, to all metrics coming from the same service. Maybe you have some [29:19.000 --> 29:24.400] more, you know, related attributes, like in whatever instance ID and so forth. But it's [29:24.400 --> 29:26.960] not like really a one-to-one relationship, so. [29:26.960 --> 29:32.640] That's specific. What request, how much did this request come from the IOPS? [29:32.640 --> 29:36.960] Yeah, no, I don't think that's possible. [29:36.960 --> 29:43.680] So in this example, you've shown that Grafana World and Prometheus works great with server-side [29:43.680 --> 29:52.040] applications. Have you had examples of client-side applications, mobile desktop applications that [29:52.040 --> 30:00.640] use Prometheus metrics and then ship their trace, their metrics and traces to the metric [30:00.640 --> 30:01.640] backend? [30:01.640 --> 30:07.960] Did I hear it correctly? You're asking about starting your traces on the client-side and [30:07.960 --> 30:10.320] the web browser and stuff? [30:10.320 --> 30:15.480] You have tracing on the server-side, but what about having traces and metrics on the client-side [30:15.480 --> 30:20.480] and, for example, for an embedded or mobile application so that you could actually see [30:20.480 --> 30:27.360] the trace from when the customer clicked a thing and see the full customer journey? [30:27.360 --> 30:32.680] Yeah, that's a great question. That's actually an area where there's currently a lot of research [30:32.680 --> 30:38.360] and new projects and so forth. So there is a group called real-user monitoring, RUM, [30:38.360 --> 30:45.240] in open telemetry that deal with client-side applications. There's also a project by Grafana. [30:45.240 --> 30:50.840] It's called Faro. It's kind of, you know, JavaScript that you can include in your front [30:50.840 --> 30:57.320] end, in your HTML page, and then it gives you traces and metrics from in the web browser [30:57.320 --> 31:04.280] coming from the web browser. And this is currently a pretty active area, so lots of, you know, [31:04.280 --> 31:05.280] movement there. [31:05.280 --> 31:11.560] And so there are things to explore. So if you like, check out Faro. It's a nice new project [31:11.560 --> 31:18.320] and standardization is also currently being discussed, but it's newer than the rest of [31:18.320 --> 31:23.320] what I showed you, right? So it's not as, so there's no, you know, clear standard yet [31:23.320 --> 31:25.320] or nothing decided yet. [31:25.320 --> 31:28.320] Cool. Okay. Thanks, everyone, again.