[00:00.000 --> 00:12.000] Hi, everybody. Thanks to be here for this talk. That's a lot of people. I'm Nicolas [00:12.000 --> 00:18.160] Frankel. I've been a developer for a long time, and I would like to ask how many of [00:18.160 --> 00:28.400] you are developers in this room? Quite a lot. Who are ops? Just as many, and who are devops, [00:28.400 --> 00:40.040] whatever you mean by it. So this talk is intended for actually developers, because I was, or [00:40.040 --> 00:47.920] I still think I'm a developer. So if you are an ops people, and for this, for you is not [00:47.920 --> 00:53.000] that super interesting. At least you can direct your developer colleagues to the talk, so [00:53.000 --> 01:03.120] that you can understand how they can ease your work. Well, perhaps you've never seen [01:03.120 --> 01:11.760] that, but I'm old or experienced, depending on how you see it. And when I was starting [01:11.760 --> 01:17.160] my career, monitoring was like a bunch of people sitting in front of screens the whole [01:17.160 --> 01:23.600] day. And actually, I was lucky. Once in the south of France, I was told, hey, this is [01:23.600 --> 01:30.400] the biggest monitoring site of all France. And actually, it really looked like this. [01:30.400 --> 01:37.000] And of course, there were people watching it. And that was the easy way. Now, I hope that [01:37.000 --> 01:45.880] you don't have that anymore, that it has become a bit more modern. Actually, there is a lot [01:45.880 --> 01:54.040] of talk now about microservices, right? Who here is doing microservices? Yeah. Yeah, because [01:54.040 --> 02:00.000] if you don't do microservices, you are not a real developer. But even if you don't do [02:00.000 --> 02:05.040] microservices, so you are not a real developer, and I encourage you not to be a real developer, [02:05.040 --> 02:11.200] in that case, you probably are doing some kind of distributed work. It's become increasingly [02:11.200 --> 02:18.200] difficult to just handle everything locally. And the problem becomes, yeah, if something [02:18.200 --> 02:24.560] bad happens, how can you locate how it works? Or even if something works as expected, how [02:24.560 --> 02:34.600] you can understand the flow of your request across the network. I love Wikipedia. And [02:34.600 --> 02:43.280] here is the observability definition by Wikipedia, which is long and in that case, not that interesting. [02:43.280 --> 02:55.800] So I have a better one afterwards for tracing. So basically, tracing helps you to understand [02:55.800 --> 03:04.800] the flow of a business request across all your components. Fabian, where is Fabian? [03:04.800 --> 03:10.240] Fabian is here, so he talked a lot about the metrics and the logging. So in this talk, [03:10.240 --> 03:19.000] I will really focus on tracing because my opinion is that, well, metrics is easy. We [03:19.000 --> 03:24.240] do metrics since ages, like we take the CPU, the memory, whatever. Now we are trying to [03:24.240 --> 03:32.280] get more business-related metrics, but it's still the same concept. Logging also. Now [03:32.280 --> 03:40.080] we do aggregated logging. Again, nothing mind-blowing. Tracing is, I think, the hardest part. [03:40.080 --> 03:48.320] So in the past, there were already some tracing pioneers. Perhaps you've used some of them. [03:48.320 --> 03:55.200] And well, now we are at the stage where we want to have something more standardized. [03:55.200 --> 04:10.800] So it starts with the trace context from the W3C. And the idea is that you start a trace [04:10.800 --> 04:19.360] and then other components will get the trace and will append their own trace to it. So [04:19.360 --> 04:28.040] it works very well in a web context. And it defines two important concepts that Fabian [04:28.040 --> 04:44.000] thanks already described. So now I am done. So I have the same stupid stuff. So here you [04:44.000 --> 04:48.920] have, oh, sorry. Yes. It reminds me of the story. I did the same to my colleagues. They [04:48.920 --> 04:59.280] didn't care about the presentation. They only remember that. Okay. So here you have a trace [04:59.280 --> 05:06.720] and here you have the different spans. So here the X1 is the parent one. And then the [05:06.720 --> 05:15.400] Y and the Z1 will take this X span as their parent span. And so this is a single trace. [05:15.400 --> 05:21.480] This is a single request across your service. Web stuff is good, but it's definitely not [05:21.480 --> 05:31.160] enough. And so for that we have the open telemetry stuff. Open telemetry is just a big bag of [05:31.160 --> 05:40.840] miracles all set into a specific project. So it's basically APIs, SDK, tools, whatever [05:40.840 --> 05:51.600] under the open telemetry level. It implements the W3C trace context. If you have been doing [05:51.600 --> 05:57.080] some kind of tracing before, you might know it because it's like the merging of open tracing [05:57.080 --> 06:03.280] and open sensors. Good thing is a CNCF project. So basically there is some hope that it will [06:03.280 --> 06:11.080] last for a couple of years. The architecture is pretty simple. Basically you've got sources, [06:11.080 --> 06:17.240] you've got the open telemetry protocol, and as Fabian mentioned, you dump everything into [06:17.240 --> 06:25.280] a collector. Collector, we should be as close as possible to your sources. And then some [06:25.280 --> 06:31.840] tools are able to read like data from it and to display it into the way that we expect [06:31.840 --> 06:44.480] to see it. What happens after the open telemetry collector is not a problem of open telemetry. [06:44.480 --> 06:50.400] Just they are collectors that are compatible, and for example you can use Yeager or Zipkin [06:50.400 --> 06:57.760] in a way that allows you to dump your data, your open telemetry data into Yeager or Zipkin [06:57.760 --> 07:02.200] into the open telemetry format. So you can reuse, and that is very important, you can [07:02.200 --> 07:07.160] reuse your infrastructure if you're already using the tools, but just switching to open [07:07.160 --> 07:12.880] telemetry. And then you are like you are using a standard, and then you can switch your open [07:12.880 --> 07:24.080] telemetry back end with less issues. Now comes the fun developer part. If you are a developer, [07:24.080 --> 07:33.600] you probably are lazy. I know, I'm a developer. So the idea is open telemetry should make [07:33.600 --> 07:42.680] your life as a developer as easy as possible to help your ops colleague, like diagnose [07:42.680 --> 07:51.960] your problems. And the easiest part if you do auto instrumentation. Auto instrumentation [07:51.960 --> 07:58.040] is only possible in cases where you have a platform, when you have a run time. Fabian [07:58.040 --> 08:05.760] mentioned Java, Java as a run time, which is the Jivem. Python as a run time. Now if [08:05.760 --> 08:16.040] you have rusts, it's not as easy. So in that case, you are stuck. My advice if you are [08:16.040 --> 08:23.280] using a run time, and probably most of you are using such run times, whether Java, whatever, [08:23.280 --> 08:29.920] use it. It's basically free. It's a low hanging fruit, and there is no coupling. So basically [08:29.920 --> 08:37.080] you don't need extra dependencies as developers in your projects. So since it's called practical [08:37.080 --> 08:43.800] introduction, let's do some practice. So here I have a bit better than the hello world, [08:43.800 --> 08:51.680] so I have tried to model like an e-commerce shop with very simple stuff. It starts just [08:51.680 --> 08:56.960] asking for products. I will go through an API gateway which will forward the product [08:56.960 --> 09:01.840] to the catalog, and the catalog doesn't know about the prices, so it will ask the prices [09:01.840 --> 09:14.000] from the pricing service, and it will ask the stocks from the stock service. The intra [09:14.000 --> 09:20.120] point is the most important thing, because it gives the parent's phrase. Everything will [09:20.120 --> 09:26.200] be from that. So in general, you have a reverse proxy or an API gateway, depending on your [09:26.200 --> 09:33.840] use case. I work on the Apache API 6 project. It uses the NGINX reverse proxy. On top you [09:33.840 --> 09:39.840] have an open resty, because you want to have Lua to script and to auto reload the configuration. [09:39.840 --> 09:49.000] Then you have lots of out of the box plugins. Let's see how it works. Now I have the code [09:49.000 --> 10:00.880] here. Is it the begin off? Good. So I might be very old, because for me it wouldn't. Okay, [10:00.880 --> 10:06.240] here that's my architecture. I'm using Docker compose, because I'm super lazy. I don't want [10:06.240 --> 10:12.560] to use Kubernetes, so I have Yeager. As I mentioned, I have all in one. I'm using the [10:12.560 --> 10:18.960] all included, so I don't need to think about having the telemetry collector and the web [10:18.960 --> 10:28.720] to check the traces. I have only one single image. Then I have API 6. Then I have the catalog, [10:28.720 --> 10:37.680] which I showed you. Of course I have a couple of variables to configure everything. I wanted [10:37.680 --> 10:45.760] to focus on tracing, so no metrics, no logs. I'm sending everything to Yeager, and then [10:45.760 --> 10:53.240] I do the same for pricing, and I do the same for the stock. And normally at this point, [10:53.240 --> 11:00.800] I already started, because in general I have issues with the Java stuff. So here I'm doing [11:00.800 --> 11:08.240] a simple curl to the product. I've got the data, which is not that important. And I can [11:08.240 --> 11:15.520] check on the web app how it works. So here I will go on the Yeager UI. I see all my services. [11:15.520 --> 11:22.440] I can find the traces. Here you can find the latest one. And here is the thing. If I click [11:22.440 --> 11:32.960] on it, it might be a bit small, right? I cannot do much better. You can already see everything [11:32.960 --> 11:39.160] that I've shown you. So I start with the product from the API gateway. It forwards it to the [11:39.160 --> 11:47.160] product to the catalog. Then I have the internal calls, and I will show you how it works. Then [11:47.160 --> 11:54.040] I have the get request made from inside the application. And then I have the stocks that [11:54.040 --> 12:02.600] responds here. Same here. And here we see something that was not mentioned on the component [12:02.600 --> 12:10.040] diagram. From the catalog to the stock, I go directly. But from the catalog to the pricing, [12:10.040 --> 12:16.280] I go back to the API gateway, which is also a way to do that for whatever reason. And [12:16.280 --> 12:22.280] so this is something that was not mentioned on the PDF, but you cannot cheat with open [12:22.280 --> 12:29.280] telemetry. It tells you exactly what happens and the flow. And the rest is the same. So [12:29.280 --> 12:40.080] regarding the code itself, I told you that I don't want anything to trouble the developer. [12:40.080 --> 12:49.760] So here I have nothing regarding open telemetry. If I write hotel, you see nothing. If I write [12:49.760 --> 12:58.000] telemetry, you see nothing. I have no dependency. The only thing that I have is my Docker file, [12:58.000 --> 13:09.400] and in my Docker file, I get the latest open telemetry agents. So you can have your developers [13:09.400 --> 13:16.520] completely oblivious, and you just provide them with this snippet, and then when you [13:16.520 --> 13:24.040] run the Java application, you just tell them, A, run with the Java agent. Low-hanging fruits, [13:24.040 --> 13:39.320] zero trouble. Any Java developer here? Not that many. Python? OK, so it will be Python. [13:39.320 --> 13:49.080] Just the same here. Here it's a bit different. I add dependencies, but actually I do nothing [13:49.080 --> 13:57.160] on it. So here I have no dependency on anything. Here I'm using a SQL database because, again, [13:57.160 --> 14:04.840] I'm lazy. I don't care that much. But here I have no dependency, no API call to open telemetry. [14:04.840 --> 14:14.000] The only thing that I have is in the Docker file again. I have this. Again, I'm using [14:14.000 --> 14:20.840] a runtime. It's super easy. I let the runtime, like, intercept the calls and everything to [14:20.840 --> 14:31.880] open telemetry. And the last fun stuff is Rust. Any Rust developer? Please don't look [14:31.880 --> 14:42.080] at my code too much. I'm not a Rust developer, so I hope it won't be too horrible. And Rust [14:42.080 --> 14:47.480] is actually, well, not that standardized. So here I don't have any runtime, so I need [14:47.480 --> 14:55.920] to make the calls by myself. The hardest part is to find which library to use, depending [14:55.920 --> 15:03.440] on which framework to use. So in this case, I found one, and perhaps there are better options. [15:03.440 --> 15:13.080] But I found this open telemetry OLTP stuff. And here this is because I'm using XM. I'm [15:13.080 --> 15:19.920] using this library. And so far, it works for me. I don't need to do a lot of stuff. I just, [15:19.920 --> 15:30.040] like, copy pasted this stuff. Copy past developer. And afterwards, in my main function, I just [15:30.040 --> 15:38.920] need to say this and this. So I added two layers. So if you don't have any platform, [15:38.920 --> 15:44.440] any runtime, you actually need your developers to care about open telemetry. Otherwise, it's [15:44.440 --> 15:54.400] fine. Now, we already have pretty good, like, results, but we want to do better. So we can [15:54.400 --> 16:02.880] also ask the developers, once they are more comfortable, to do manual instrumentation [16:02.880 --> 16:15.640] even in the case when there is a platform. Now, I will docker compose down. And it takes [16:15.640 --> 16:39.720] a bit of time. I will prepare this. And on the catalog sides, now I can have some additional [16:39.720 --> 16:52.400] codes. So this is a Spring Boot application. What I can do is add annotations. Like, I [16:52.400 --> 16:57.240] noticed there were a couple of Java developers. So it's the same with Kotlin. It's still on [16:57.240 --> 17:02.440] the JVM. So basically, I'm adding annotations. And because Spring Boot can read the annotation [17:02.440 --> 17:08.720] at runtime, it can add those calls. So I don't have to call the API explicitly. I just add [17:08.720 --> 17:23.200] some annotation, and it should be done. On the Python side, I import this trace stuff, [17:23.200 --> 17:32.960] and then I can, with the tracer, add some, again, explicit traces, so internal traces. [17:32.960 --> 17:37.520] And from the first point of view, because I already, like, did it explicitly work. [17:37.520 --> 17:41.760] And now you can see that I am in deep trouble, because it happened a lot of time. The Java [17:41.760 --> 17:47.520] application doesn't start for a demo, and that's really, really fun. So I will try to [17:47.520 --> 17:59.320] docker compose down the catalog. And docker compose, hey, what happens? Dash? Are you [17:59.320 --> 18:11.400] sure? No, no, no, no, no, no, no. Not with the new versions. Yes. That's fine. We are [18:11.400 --> 18:37.360] only here to learn. What? Stop. Thanks. The stress, the stress. Yeah. Honestly, if there [18:37.360 --> 18:47.000] is any, like, person here able to tell me why this Java application sometimes has issues [18:47.000 --> 19:02.000] starting because I've added one gig at the beginning, and it's stuck always here. So [19:02.000 --> 19:12.880] I can tell you what you should see normally. If I'm lucky, I made a screenshot. Yes, here, [19:12.880 --> 19:21.600] but it's the beginning, it's the rust one. So here, this is what you can have in Python. [19:21.600 --> 19:25.600] This is what I added explicitly. I have five minutes. Well, if the demo doesn't work, it [19:25.600 --> 19:32.040] will be much better. Then I won't have any problems with the timing. Here, you can see [19:32.040 --> 19:42.200] that this is the trace that, yeah, this is a trace that I added manually in Python. And [19:42.200 --> 19:53.160] here we can see that I filled the ID with the value. And on the Java sides, again, nope, [19:53.160 --> 20:07.400] nope. I think it will be here. This is not the manual stuff that I added. Yes, it is, [20:07.400 --> 20:14.680] you have the fetch here. You have the fetch here. So this is the span that I added manually. [20:14.680 --> 20:22.680] I'm afraid that at this point, the demo just refused working. Yes, it's still stuck. I [20:22.680 --> 20:32.120] will stop there. I won't humiliate myself further when it's done. It's done. Perhaps, [20:32.120 --> 20:37.720] if you are interested, you can follow me on Twitter. You can follow me on MasterDone. [20:37.720 --> 20:43.520] I don't know what's the ratio. More importantly, if you are interested about the GitHub repo, [20:43.520 --> 20:48.040] to do that by yourself, perhaps with better configuration of the code compose with the [20:48.040 --> 20:53.520] right memory, it would work. And though the talk was not about Apache API 6, well, have [20:53.520 --> 21:00.000] a look at Apache API 6. It's an API get away, the Apache way. Great. Are there some questions [21:00.000 --> 21:11.880] now? I never got so many uploads with a filling demo. [21:11.880 --> 21:18.760] Please remain seated so we can have a Q&A. Who had a question? [21:18.760 --> 21:23.200] Thank you. Very good talk. I have two questions. So one is about this. [21:23.200 --> 21:29.200] Let's start with the first one. Right. Yes, yes, yes. How much overhead does [21:29.200 --> 21:34.240] this bring in Python and Java or Rust? How heavy is this instrumentation? [21:34.240 --> 21:40.960] That's a very good question. And the overheads of each request depends on your own infrastructure. [21:40.960 --> 21:46.640] But I always have an answer to that. Is it better to go fast and you don't know where [21:46.640 --> 21:53.120] you are going to go a bit slower and to know where you are going? [21:53.120 --> 22:01.280] I think that whatever the cost, it's always easy to add additional resources and it doesn't [22:01.280 --> 22:04.880] cost you that much. Whereas a debug incident across a distributed [22:04.880 --> 22:11.280] system can cost you days or even like weeks in injuring costs. And you are very, very [22:11.280 --> 22:16.040] expensive, right? Okay. Thank you. And the second one is have [22:16.040 --> 22:22.760] you encountered any funny issues with multi-threading or multi-processing? Something like when your [22:22.760 --> 22:25.760] server just now... Can you come closer to your... [22:25.760 --> 22:32.160] Your server just now was not starting. So some software, when you have multi-threading [22:32.160 --> 22:38.680] or multi-processing and have you encountered any issues when the instrumentation costs [22:38.680 --> 22:42.400] you trouble? This is not production stuff. This is just [22:42.400 --> 22:46.520] better than the hello world. So I cannot tell you about prediction issues. [22:46.520 --> 22:51.120] You should find people who have these issues. As I mentioned, it's a developers-oriented [22:51.120 --> 22:55.680] talk. So it's more about pushing the developers to help [22:55.680 --> 23:02.440] up to their job. For production issues, I must admit I have no clue. [23:02.440 --> 23:09.120] Hi. In the case of runtime, does it always work [23:09.120 --> 23:14.760] with also badly written application? I mean, how bad can an application be before it stops [23:14.760 --> 23:21.320] working? I'm not sure. I understood the question. [23:21.320 --> 23:25.840] So how often do you need to do it before it stops working? [23:25.840 --> 23:34.200] No, no. I mean, let's say I use deprecated libraries, bad clients, something that doesn't [23:34.200 --> 23:40.040] work as it's supposed to be for the instrumentation perspective. I mean, I do request to the network [23:40.040 --> 23:49.560] using UDP clients, something I've written myself, some custom stuff that... [23:49.560 --> 23:59.360] I'm imagining that the instrumentation sits between some layer of the network, which is [23:59.360 --> 24:06.840] going to the Internet, for example. And so how bad can I be before it stops recognizing [24:06.840 --> 24:11.840] a request from junk? You cannot be banned. [24:11.840 --> 24:21.760] OK. Well, it's a moral issue first. But then on the platform side, the Austo instrumentation, [24:21.760 --> 24:29.080] they work with specific frameworks and tools. It's those frameworks and tools that know [24:29.080 --> 24:35.880] how to check what happens and to send the data to open telemetry. [24:35.880 --> 24:42.560] So if you don't play in this game, nothing will be sent. [24:42.560 --> 24:49.080] On the manual instrumentation side, it's an explicit call. So it depends what you want [24:49.080 --> 24:55.160] to send. Yeah. I was thinking of auto instrumentation. [24:55.160 --> 25:03.040] So let's say I do the NS resolution by myself and then I just throw a request to an IP. [25:03.040 --> 25:14.240] Let me show the Python stuff here. This is what I showed you in the screenshot. [25:14.240 --> 25:21.000] This is what I write. And this is the attributes that I want to have. [25:21.000 --> 25:29.400] So basically, if here you have something that is completely unrelated, it's up to you. [25:29.400 --> 25:33.440] That's why it's easier to start with auto instrumentation. [25:33.440 --> 25:39.200] And then once you get a general overview of what you have and your app starts saying, [25:39.200 --> 25:47.000] hey, perhaps we want to have more details here, then you can come with manual instrumentation. [25:47.000 --> 25:54.360] But start with the less expensive stuff. I didn't really answer the question. [25:54.360 --> 26:00.240] I understand it. But that's the best I can do regarding it. [26:00.240 --> 26:05.840] Sorry. Okay. Thanks for the talk. [26:05.840 --> 26:11.480] For the agent you use in the Docker file, how you can configure it, for example, for [26:11.480 --> 26:18.800] the tracing for Jagger or other stuff. Regarding the Docker file, sorry? [26:18.800 --> 26:25.600] Yeah. How you can configure the agent to send the tracing for Jagger or other stuff. [26:25.600 --> 26:28.560] The Docker file doesn't mention where you send it. [26:28.560 --> 26:32.880] The Docker file just says, hey, I will use open telemetry. [26:32.880 --> 26:41.080] And it's during configuration, it's like in the Docker Compulse file where I'm using, [26:41.080 --> 26:47.040] like, agreed upon environment variables where I'm saying you should set it here or here [26:47.040 --> 26:50.560] or you should use logging or tracing or metrics or whatever. [26:50.560 --> 26:55.080] So that's very important to, like, separate those concerns. [26:55.080 --> 27:00.200] On one side in the Docker file in the image, you say, hey, I'm ready for open telemetry. [27:00.200 --> 27:05.520] And when you actually deploy it to say, okay, open telemetry will go there for the metrics [27:05.520 --> 27:12.160] and there for the tracing and for logging, I will disable it or whatever. [27:12.160 --> 27:13.160] Thank you for... Oh, sorry. [27:13.160 --> 27:14.160] Sorry. Go ahead. [27:14.160 --> 27:20.800] Sorry. And then you have a Docker image that can be, like, reusable. [27:20.800 --> 27:24.000] Thank you for being good first-time citizens to remain seated. [27:24.000 --> 27:26.200] Next question. [27:26.200 --> 27:27.960] Thank you for your presentation. [27:27.960 --> 27:37.240] So my question is does open telemetry support error handling like sentry? [27:37.240 --> 27:40.800] If not, is there any plans to do that? [27:40.800 --> 27:46.760] It's really useful to catch crashes and capture the context of the crash. [27:46.760 --> 27:49.760] So that's it. Thank you. [27:49.760 --> 27:56.280] If it happens, when you mean crashes of open telemetry itself or of the components that [27:56.280 --> 27:58.800] are, like, under watch? [27:58.800 --> 28:02.360] Yeah, of the application that's monitored, yeah. [28:02.360 --> 28:12.120] Well, Fabian showed you how you could log and, like, bind your traces and your logs. [28:12.120 --> 28:14.000] So you could have both here. [28:14.000 --> 28:21.320] My focus was just on tracing, but you can reuse the same Docker, the same GitHub repo [28:21.320 --> 28:32.680] and just, like, here, put the logs somewhere in, I don't know, Elasticsearch or whatever. [28:32.680 --> 28:38.680] No, because it's not a sponsored room. [28:38.680 --> 28:43.720] And then you can check and you introduce some errors and then you can check how the two are [28:43.720 --> 28:48.080] bound and you can, like, drill down to where it failed. [28:48.080 --> 28:55.080] Okay, thank you.