[00:00.000 --> 00:12.000]  Hi, everybody. Thanks to be here for this talk. That's a lot of people. I'm Nicolas
[00:12.000 --> 00:18.160]  Frankel. I've been a developer for a long time, and I would like to ask how many of
[00:18.160 --> 00:28.400]  you are developers in this room? Quite a lot. Who are ops? Just as many, and who are devops,
[00:28.400 --> 00:40.040]  whatever you mean by it. So this talk is intended for actually developers, because I was, or
[00:40.040 --> 00:47.920]  I still think I'm a developer. So if you are an ops people, and for this, for you is not
[00:47.920 --> 00:53.000]  that super interesting. At least you can direct your developer colleagues to the talk, so
[00:53.000 --> 01:03.120]  that you can understand how they can ease your work. Well, perhaps you've never seen
[01:03.120 --> 01:11.760]  that, but I'm old or experienced, depending on how you see it. And when I was starting
[01:11.760 --> 01:17.160]  my career, monitoring was like a bunch of people sitting in front of screens the whole
[01:17.160 --> 01:23.600]  day. And actually, I was lucky. Once in the south of France, I was told, hey, this is
[01:23.600 --> 01:30.400]  the biggest monitoring site of all France. And actually, it really looked like this.
[01:30.400 --> 01:37.000]  And of course, there were people watching it. And that was the easy way. Now, I hope that
[01:37.000 --> 01:45.880]  you don't have that anymore, that it has become a bit more modern. Actually, there is a lot
[01:45.880 --> 01:54.040]  of talk now about microservices, right? Who here is doing microservices? Yeah. Yeah, because
[01:54.040 --> 02:00.000]  if you don't do microservices, you are not a real developer. But even if you don't do
[02:00.000 --> 02:05.040]  microservices, so you are not a real developer, and I encourage you not to be a real developer,
[02:05.040 --> 02:11.200]  in that case, you probably are doing some kind of distributed work. It's become increasingly
[02:11.200 --> 02:18.200]  difficult to just handle everything locally. And the problem becomes, yeah, if something
[02:18.200 --> 02:24.560]  bad happens, how can you locate how it works? Or even if something works as expected, how
[02:24.560 --> 02:34.600]  you can understand the flow of your request across the network. I love Wikipedia. And
[02:34.600 --> 02:43.280]  here is the observability definition by Wikipedia, which is long and in that case, not that interesting.
[02:43.280 --> 02:55.800]  So I have a better one afterwards for tracing. So basically, tracing helps you to understand
[02:55.800 --> 03:04.800]  the flow of a business request across all your components. Fabian, where is Fabian?
[03:04.800 --> 03:10.240]  Fabian is here, so he talked a lot about the metrics and the logging. So in this talk,
[03:10.240 --> 03:19.000]  I will really focus on tracing because my opinion is that, well, metrics is easy. We
[03:19.000 --> 03:24.240]  do metrics since ages, like we take the CPU, the memory, whatever. Now we are trying to
[03:24.240 --> 03:32.280]  get more business-related metrics, but it's still the same concept. Logging also. Now
[03:32.280 --> 03:40.080]  we do aggregated logging. Again, nothing mind-blowing. Tracing is, I think, the hardest part.
[03:40.080 --> 03:48.320]  So in the past, there were already some tracing pioneers. Perhaps you've used some of them.
[03:48.320 --> 03:55.200]  And well, now we are at the stage where we want to have something more standardized.
[03:55.200 --> 04:10.800]  So it starts with the trace context from the W3C. And the idea is that you start a trace
[04:10.800 --> 04:19.360]  and then other components will get the trace and will append their own trace to it. So
[04:19.360 --> 04:28.040]  it works very well in a web context. And it defines two important concepts that Fabian
[04:28.040 --> 04:44.000]  thanks already described. So now I am done. So I have the same stupid stuff. So here you
[04:44.000 --> 04:48.920]  have, oh, sorry. Yes. It reminds me of the story. I did the same to my colleagues. They
[04:48.920 --> 04:59.280]  didn't care about the presentation. They only remember that. Okay. So here you have a trace
[04:59.280 --> 05:06.720]  and here you have the different spans. So here the X1 is the parent one. And then the
[05:06.720 --> 05:15.400]  Y and the Z1 will take this X span as their parent span. And so this is a single trace.
[05:15.400 --> 05:21.480]  This is a single request across your service. Web stuff is good, but it's definitely not
[05:21.480 --> 05:31.160]  enough. And so for that we have the open telemetry stuff. Open telemetry is just a big bag of
[05:31.160 --> 05:40.840]  miracles all set into a specific project. So it's basically APIs, SDK, tools, whatever
[05:40.840 --> 05:51.600]  under the open telemetry level. It implements the W3C trace context. If you have been doing
[05:51.600 --> 05:57.080]  some kind of tracing before, you might know it because it's like the merging of open tracing
[05:57.080 --> 06:03.280]  and open sensors. Good thing is a CNCF project. So basically there is some hope that it will
[06:03.280 --> 06:11.080]  last for a couple of years. The architecture is pretty simple. Basically you've got sources,
[06:11.080 --> 06:17.240]  you've got the open telemetry protocol, and as Fabian mentioned, you dump everything into
[06:17.240 --> 06:25.280]  a collector. Collector, we should be as close as possible to your sources. And then some
[06:25.280 --> 06:31.840]  tools are able to read like data from it and to display it into the way that we expect
[06:31.840 --> 06:44.480]  to see it. What happens after the open telemetry collector is not a problem of open telemetry.
[06:44.480 --> 06:50.400]  Just they are collectors that are compatible, and for example you can use Yeager or Zipkin
[06:50.400 --> 06:57.760]  in a way that allows you to dump your data, your open telemetry data into Yeager or Zipkin
[06:57.760 --> 07:02.200]  into the open telemetry format. So you can reuse, and that is very important, you can
[07:02.200 --> 07:07.160]  reuse your infrastructure if you're already using the tools, but just switching to open
[07:07.160 --> 07:12.880]  telemetry. And then you are like you are using a standard, and then you can switch your open
[07:12.880 --> 07:24.080]  telemetry back end with less issues. Now comes the fun developer part. If you are a developer,
[07:24.080 --> 07:33.600]  you probably are lazy. I know, I'm a developer. So the idea is open telemetry should make
[07:33.600 --> 07:42.680]  your life as a developer as easy as possible to help your ops colleague, like diagnose
[07:42.680 --> 07:51.960]  your problems. And the easiest part if you do auto instrumentation. Auto instrumentation
[07:51.960 --> 07:58.040]  is only possible in cases where you have a platform, when you have a run time. Fabian
[07:58.040 --> 08:05.760]  mentioned Java, Java as a run time, which is the Jivem. Python as a run time. Now if
[08:05.760 --> 08:16.040]  you have rusts, it's not as easy. So in that case, you are stuck. My advice if you are
[08:16.040 --> 08:23.280]  using a run time, and probably most of you are using such run times, whether Java, whatever,
[08:23.280 --> 08:29.920]  use it. It's basically free. It's a low hanging fruit, and there is no coupling. So basically
[08:29.920 --> 08:37.080]  you don't need extra dependencies as developers in your projects. So since it's called practical
[08:37.080 --> 08:43.800]  introduction, let's do some practice. So here I have a bit better than the hello world,
[08:43.800 --> 08:51.680]  so I have tried to model like an e-commerce shop with very simple stuff. It starts just
[08:51.680 --> 08:56.960]  asking for products. I will go through an API gateway which will forward the product
[08:56.960 --> 09:01.840]  to the catalog, and the catalog doesn't know about the prices, so it will ask the prices
[09:01.840 --> 09:14.000]  from the pricing service, and it will ask the stocks from the stock service. The intra
[09:14.000 --> 09:20.120]  point is the most important thing, because it gives the parent's phrase. Everything will
[09:20.120 --> 09:26.200]  be from that. So in general, you have a reverse proxy or an API gateway, depending on your
[09:26.200 --> 09:33.840]  use case. I work on the Apache API 6 project. It uses the NGINX reverse proxy. On top you
[09:33.840 --> 09:39.840]  have an open resty, because you want to have Lua to script and to auto reload the configuration.
[09:39.840 --> 09:49.000]  Then you have lots of out of the box plugins. Let's see how it works. Now I have the code
[09:49.000 --> 10:00.880]  here. Is it the begin off? Good. So I might be very old, because for me it wouldn't. Okay,
[10:00.880 --> 10:06.240]  here that's my architecture. I'm using Docker compose, because I'm super lazy. I don't want
[10:06.240 --> 10:12.560]  to use Kubernetes, so I have Yeager. As I mentioned, I have all in one. I'm using the
[10:12.560 --> 10:18.960]  all included, so I don't need to think about having the telemetry collector and the web
[10:18.960 --> 10:28.720]  to check the traces. I have only one single image. Then I have API 6. Then I have the catalog,
[10:28.720 --> 10:37.680]  which I showed you. Of course I have a couple of variables to configure everything. I wanted
[10:37.680 --> 10:45.760]  to focus on tracing, so no metrics, no logs. I'm sending everything to Yeager, and then
[10:45.760 --> 10:53.240]  I do the same for pricing, and I do the same for the stock. And normally at this point,
[10:53.240 --> 11:00.800]  I already started, because in general I have issues with the Java stuff. So here I'm doing
[11:00.800 --> 11:08.240]  a simple curl to the product. I've got the data, which is not that important. And I can
[11:08.240 --> 11:15.520]  check on the web app how it works. So here I will go on the Yeager UI. I see all my services.
[11:15.520 --> 11:22.440]  I can find the traces. Here you can find the latest one. And here is the thing. If I click
[11:22.440 --> 11:32.960]  on it, it might be a bit small, right? I cannot do much better. You can already see everything
[11:32.960 --> 11:39.160]  that I've shown you. So I start with the product from the API gateway. It forwards it to the
[11:39.160 --> 11:47.160]  product to the catalog. Then I have the internal calls, and I will show you how it works. Then
[11:47.160 --> 11:54.040]  I have the get request made from inside the application. And then I have the stocks that
[11:54.040 --> 12:02.600]  responds here. Same here. And here we see something that was not mentioned on the component
[12:02.600 --> 12:10.040]  diagram. From the catalog to the stock, I go directly. But from the catalog to the pricing,
[12:10.040 --> 12:16.280]  I go back to the API gateway, which is also a way to do that for whatever reason. And
[12:16.280 --> 12:22.280]  so this is something that was not mentioned on the PDF, but you cannot cheat with open
[12:22.280 --> 12:29.280]  telemetry. It tells you exactly what happens and the flow. And the rest is the same. So
[12:29.280 --> 12:40.080]  regarding the code itself, I told you that I don't want anything to trouble the developer.
[12:40.080 --> 12:49.760]  So here I have nothing regarding open telemetry. If I write hotel, you see nothing. If I write
[12:49.760 --> 12:58.000]  telemetry, you see nothing. I have no dependency. The only thing that I have is my Docker file,
[12:58.000 --> 13:09.400]  and in my Docker file, I get the latest open telemetry agents. So you can have your developers
[13:09.400 --> 13:16.520]  completely oblivious, and you just provide them with this snippet, and then when you
[13:16.520 --> 13:24.040]  run the Java application, you just tell them, A, run with the Java agent. Low-hanging fruits,
[13:24.040 --> 13:39.320]  zero trouble. Any Java developer here? Not that many. Python? OK, so it will be Python.
[13:39.320 --> 13:49.080]  Just the same here. Here it's a bit different. I add dependencies, but actually I do nothing
[13:49.080 --> 13:57.160]  on it. So here I have no dependency on anything. Here I'm using a SQL database because, again,
[13:57.160 --> 14:04.840]  I'm lazy. I don't care that much. But here I have no dependency, no API call to open telemetry.
[14:04.840 --> 14:14.000]  The only thing that I have is in the Docker file again. I have this. Again, I'm using
[14:14.000 --> 14:20.840]  a runtime. It's super easy. I let the runtime, like, intercept the calls and everything to
[14:20.840 --> 14:31.880]  open telemetry. And the last fun stuff is Rust. Any Rust developer? Please don't look
[14:31.880 --> 14:42.080]  at my code too much. I'm not a Rust developer, so I hope it won't be too horrible. And Rust
[14:42.080 --> 14:47.480]  is actually, well, not that standardized. So here I don't have any runtime, so I need
[14:47.480 --> 14:55.920]  to make the calls by myself. The hardest part is to find which library to use, depending
[14:55.920 --> 15:03.440]  on which framework to use. So in this case, I found one, and perhaps there are better options.
[15:03.440 --> 15:13.080]  But I found this open telemetry OLTP stuff. And here this is because I'm using XM. I'm
[15:13.080 --> 15:19.920]  using this library. And so far, it works for me. I don't need to do a lot of stuff. I just,
[15:19.920 --> 15:30.040]  like, copy pasted this stuff. Copy past developer. And afterwards, in my main function, I just
[15:30.040 --> 15:38.920]  need to say this and this. So I added two layers. So if you don't have any platform,
[15:38.920 --> 15:44.440]  any runtime, you actually need your developers to care about open telemetry. Otherwise, it's
[15:44.440 --> 15:54.400]  fine. Now, we already have pretty good, like, results, but we want to do better. So we can
[15:54.400 --> 16:02.880]  also ask the developers, once they are more comfortable, to do manual instrumentation
[16:02.880 --> 16:15.640]  even in the case when there is a platform. Now, I will docker compose down. And it takes
[16:15.640 --> 16:39.720]  a bit of time. I will prepare this. And on the catalog sides, now I can have some additional
[16:39.720 --> 16:52.400]  codes. So this is a Spring Boot application. What I can do is add annotations. Like, I
[16:52.400 --> 16:57.240]  noticed there were a couple of Java developers. So it's the same with Kotlin. It's still on
[16:57.240 --> 17:02.440]  the JVM. So basically, I'm adding annotations. And because Spring Boot can read the annotation
[17:02.440 --> 17:08.720]  at runtime, it can add those calls. So I don't have to call the API explicitly. I just add
[17:08.720 --> 17:23.200]  some annotation, and it should be done. On the Python side, I import this trace stuff,
[17:23.200 --> 17:32.960]  and then I can, with the tracer, add some, again, explicit traces, so internal traces.
[17:32.960 --> 17:37.520]  And from the first point of view, because I already, like, did it explicitly work.
[17:37.520 --> 17:41.760]  And now you can see that I am in deep trouble, because it happened a lot of time. The Java
[17:41.760 --> 17:47.520]  application doesn't start for a demo, and that's really, really fun. So I will try to
[17:47.520 --> 17:59.320]  docker compose down the catalog. And docker compose, hey, what happens? Dash? Are you
[17:59.320 --> 18:11.400]  sure? No, no, no, no, no, no, no. Not with the new versions. Yes. That's fine. We are
[18:11.400 --> 18:37.360]  only here to learn. What? Stop. Thanks. The stress, the stress. Yeah. Honestly, if there
[18:37.360 --> 18:47.000]  is any, like, person here able to tell me why this Java application sometimes has issues
[18:47.000 --> 19:02.000]  starting because I've added one gig at the beginning, and it's stuck always here. So
[19:02.000 --> 19:12.880]  I can tell you what you should see normally. If I'm lucky, I made a screenshot. Yes, here,
[19:12.880 --> 19:21.600]  but it's the beginning, it's the rust one. So here, this is what you can have in Python.
[19:21.600 --> 19:25.600]  This is what I added explicitly. I have five minutes. Well, if the demo doesn't work, it
[19:25.600 --> 19:32.040]  will be much better. Then I won't have any problems with the timing. Here, you can see
[19:32.040 --> 19:42.200]  that this is the trace that, yeah, this is a trace that I added manually in Python. And
[19:42.200 --> 19:53.160]  here we can see that I filled the ID with the value. And on the Java sides, again, nope,
[19:53.160 --> 20:07.400]  nope. I think it will be here. This is not the manual stuff that I added. Yes, it is,
[20:07.400 --> 20:14.680]  you have the fetch here. You have the fetch here. So this is the span that I added manually.
[20:14.680 --> 20:22.680]  I'm afraid that at this point, the demo just refused working. Yes, it's still stuck. I
[20:22.680 --> 20:32.120]  will stop there. I won't humiliate myself further when it's done. It's done. Perhaps,
[20:32.120 --> 20:37.720]  if you are interested, you can follow me on Twitter. You can follow me on MasterDone.
[20:37.720 --> 20:43.520]  I don't know what's the ratio. More importantly, if you are interested about the GitHub repo,
[20:43.520 --> 20:48.040]  to do that by yourself, perhaps with better configuration of the code compose with the
[20:48.040 --> 20:53.520]  right memory, it would work. And though the talk was not about Apache API 6, well, have
[20:53.520 --> 21:00.000]  a look at Apache API 6. It's an API get away, the Apache way. Great. Are there some questions
[21:00.000 --> 21:11.880]  now? I never got so many uploads with a filling demo.
[21:11.880 --> 21:18.760]  Please remain seated so we can have a Q&A. Who had a question?
[21:18.760 --> 21:23.200]  Thank you. Very good talk. I have two questions. So one is about this.
[21:23.200 --> 21:29.200]  Let's start with the first one. Right. Yes, yes, yes. How much overhead does
[21:29.200 --> 21:34.240]  this bring in Python and Java or Rust? How heavy is this instrumentation?
[21:34.240 --> 21:40.960]  That's a very good question. And the overheads of each request depends on your own infrastructure.
[21:40.960 --> 21:46.640]  But I always have an answer to that. Is it better to go fast and you don't know where
[21:46.640 --> 21:53.120]  you are going to go a bit slower and to know where you are going?
[21:53.120 --> 22:01.280]  I think that whatever the cost, it's always easy to add additional resources and it doesn't
[22:01.280 --> 22:04.880]  cost you that much. Whereas a debug incident across a distributed
[22:04.880 --> 22:11.280]  system can cost you days or even like weeks in injuring costs. And you are very, very
[22:11.280 --> 22:16.040]  expensive, right? Okay. Thank you. And the second one is have
[22:16.040 --> 22:22.760]  you encountered any funny issues with multi-threading or multi-processing? Something like when your
[22:22.760 --> 22:25.760]  server just now... Can you come closer to your...
[22:25.760 --> 22:32.160]  Your server just now was not starting. So some software, when you have multi-threading
[22:32.160 --> 22:38.680]  or multi-processing and have you encountered any issues when the instrumentation costs
[22:38.680 --> 22:42.400]  you trouble? This is not production stuff. This is just
[22:42.400 --> 22:46.520]  better than the hello world. So I cannot tell you about prediction issues.
[22:46.520 --> 22:51.120]  You should find people who have these issues. As I mentioned, it's a developers-oriented
[22:51.120 --> 22:55.680]  talk. So it's more about pushing the developers to help
[22:55.680 --> 23:02.440]  up to their job. For production issues, I must admit I have no clue.
[23:02.440 --> 23:09.120]  Hi. In the case of runtime, does it always work
[23:09.120 --> 23:14.760]  with also badly written application? I mean, how bad can an application be before it stops
[23:14.760 --> 23:21.320]  working? I'm not sure. I understood the question.
[23:21.320 --> 23:25.840]  So how often do you need to do it before it stops working?
[23:25.840 --> 23:34.200]  No, no. I mean, let's say I use deprecated libraries, bad clients, something that doesn't
[23:34.200 --> 23:40.040]  work as it's supposed to be for the instrumentation perspective. I mean, I do request to the network
[23:40.040 --> 23:49.560]  using UDP clients, something I've written myself, some custom stuff that...
[23:49.560 --> 23:59.360]  I'm imagining that the instrumentation sits between some layer of the network, which is
[23:59.360 --> 24:06.840]  going to the Internet, for example. And so how bad can I be before it stops recognizing
[24:06.840 --> 24:11.840]  a request from junk? You cannot be banned.
[24:11.840 --> 24:21.760]  OK. Well, it's a moral issue first. But then on the platform side, the Austo instrumentation,
[24:21.760 --> 24:29.080]  they work with specific frameworks and tools. It's those frameworks and tools that know
[24:29.080 --> 24:35.880]  how to check what happens and to send the data to open telemetry.
[24:35.880 --> 24:42.560]  So if you don't play in this game, nothing will be sent.
[24:42.560 --> 24:49.080]  On the manual instrumentation side, it's an explicit call. So it depends what you want
[24:49.080 --> 24:55.160]  to send. Yeah. I was thinking of auto instrumentation.
[24:55.160 --> 25:03.040]  So let's say I do the NS resolution by myself and then I just throw a request to an IP.
[25:03.040 --> 25:14.240]  Let me show the Python stuff here. This is what I showed you in the screenshot.
[25:14.240 --> 25:21.000]  This is what I write. And this is the attributes that I want to have.
[25:21.000 --> 25:29.400]  So basically, if here you have something that is completely unrelated, it's up to you.
[25:29.400 --> 25:33.440]  That's why it's easier to start with auto instrumentation.
[25:33.440 --> 25:39.200]  And then once you get a general overview of what you have and your app starts saying,
[25:39.200 --> 25:47.000]  hey, perhaps we want to have more details here, then you can come with manual instrumentation.
[25:47.000 --> 25:54.360]  But start with the less expensive stuff. I didn't really answer the question.
[25:54.360 --> 26:00.240]  I understand it. But that's the best I can do regarding it.
[26:00.240 --> 26:05.840]  Sorry. Okay. Thanks for the talk.
[26:05.840 --> 26:11.480]  For the agent you use in the Docker file, how you can configure it, for example, for
[26:11.480 --> 26:18.800]  the tracing for Jagger or other stuff. Regarding the Docker file, sorry?
[26:18.800 --> 26:25.600]  Yeah. How you can configure the agent to send the tracing for Jagger or other stuff.
[26:25.600 --> 26:28.560]  The Docker file doesn't mention where you send it.
[26:28.560 --> 26:32.880]  The Docker file just says, hey, I will use open telemetry.
[26:32.880 --> 26:41.080]  And it's during configuration, it's like in the Docker Compulse file where I'm using,
[26:41.080 --> 26:47.040]  like, agreed upon environment variables where I'm saying you should set it here or here
[26:47.040 --> 26:50.560]  or you should use logging or tracing or metrics or whatever.
[26:50.560 --> 26:55.080]  So that's very important to, like, separate those concerns.
[26:55.080 --> 27:00.200]  On one side in the Docker file in the image, you say, hey, I'm ready for open telemetry.
[27:00.200 --> 27:05.520]  And when you actually deploy it to say, okay, open telemetry will go there for the metrics
[27:05.520 --> 27:12.160]  and there for the tracing and for logging, I will disable it or whatever.
[27:12.160 --> 27:13.160]  Thank you for... Oh, sorry.
[27:13.160 --> 27:14.160]  Sorry. Go ahead.
[27:14.160 --> 27:20.800]  Sorry. And then you have a Docker image that can be, like, reusable.
[27:20.800 --> 27:24.000]  Thank you for being good first-time citizens to remain seated.
[27:24.000 --> 27:26.200]  Next question.
[27:26.200 --> 27:27.960]  Thank you for your presentation.
[27:27.960 --> 27:37.240]  So my question is does open telemetry support error handling like sentry?
[27:37.240 --> 27:40.800]  If not, is there any plans to do that?
[27:40.800 --> 27:46.760]  It's really useful to catch crashes and capture the context of the crash.
[27:46.760 --> 27:49.760]  So that's it. Thank you.
[27:49.760 --> 27:56.280]  If it happens, when you mean crashes of open telemetry itself or of the components that
[27:56.280 --> 27:58.800]  are, like, under watch?
[27:58.800 --> 28:02.360]  Yeah, of the application that's monitored, yeah.
[28:02.360 --> 28:12.120]  Well, Fabian showed you how you could log and, like, bind your traces and your logs.
[28:12.120 --> 28:14.000]  So you could have both here.
[28:14.000 --> 28:21.320]  My focus was just on tracing, but you can reuse the same Docker, the same GitHub repo
[28:21.320 --> 28:32.680]  and just, like, here, put the logs somewhere in, I don't know, Elasticsearch or whatever.
[28:32.680 --> 28:38.680]  No, because it's not a sponsored room.
[28:38.680 --> 28:43.720]  And then you can check and you introduce some errors and then you can check how the two are
[28:43.720 --> 28:48.080]  bound and you can, like, drill down to where it failed.
[28:48.080 --> 28:55.080]  Okay, thank you.