[00:00.000 --> 00:15.200] Yeah. Hi, everyone. My name is Christian Simon, and I'm going to be talking about continuous [00:15.200 --> 00:23.360] profiling. So we heard a lot about observability today already, and I'm going to want to introduce [00:23.360 --> 00:30.720] maybe an additional signal there. So maybe quickly about me. So I'm working at Grafana [00:30.720 --> 00:37.240] Labs. I'm a software engineer there, and worked on our databases for observability. So I worked [00:37.240 --> 00:45.320] on Cortex slash Mamiir now. I worked on Loki, and most recently I switched to the flat team, [00:45.320 --> 00:53.680] and I'm 50% of the flat team, and we kind of work on continuous profiling database. There's [00:53.680 --> 00:59.440] not going to be a particular focus on flat today. So basically what I want to talk today [00:59.440 --> 01:04.920] is kind of introduce how it's measured, what you can achieve with it, and then maybe as [01:04.920 --> 01:12.800] I learn more, at the next first time I can go more into detail and look at very specific [01:12.800 --> 01:22.600] languages there. So when we talk about observability, what are our common goals? Obviously we want [01:22.600 --> 01:28.680] to ensure that the user journeys of our users are successful, that we maybe even can be [01:28.680 --> 01:40.720] proactively find problems before a user notices it. And basically we want to be as quickly [01:40.720 --> 01:47.080] as possible when those problems happen to disrupt less of those user journeys. And observability [01:47.080 --> 01:52.840] provides us like an objective way of getting some insights into the state of our system [01:52.840 --> 02:01.320] in production. And even after a certain event has happened, we found the right way, reboot, [02:01.320 --> 02:05.360] and it's all up again. I think it might be able to help us understanding what exactly [02:05.360 --> 02:13.480] happened when we want to figure out the root cause of something. So one of the I guess [02:13.480 --> 02:21.520] easiest and probably oldest observability signals is logs. So like I guess it starts [02:21.520 --> 02:27.040] with a kind of print hello somewhere in your code. And I guess you probably don't need [02:27.040 --> 02:32.840] a show of hands who's using it. Like I guess everyone somehow uses logs or is asleep if [02:32.840 --> 02:40.320] they don't show a hand. So basically your application doesn't need any specific SDK. [02:40.320 --> 02:47.520] It can probably just log based on the standard library of your languages. One of the challenges, [02:47.520 --> 02:54.120] most of the time the format is rather varied. So like even in terms of timestamps, it can [02:54.120 --> 03:01.920] be really, really hard to get a common understanding of your log lines there. And also like when [03:01.920 --> 03:05.920] you then want to aggregate them, they are quite costly. So like it spides, you need [03:05.920 --> 03:13.080] to kind of convert them to in floats and so on. And also that something that can happen [03:13.080 --> 03:18.000] is that like you have so many logs that you can't find the ones that you're actually interested [03:18.000 --> 03:25.880] in. So like grab error, for example, could be, yeah, could be maybe helpful, but also [03:25.880 --> 03:33.320] like there might be just too many errors. And you kind of lose the important ones. So [03:33.320 --> 03:38.560] also like if you want to learn more about logs, my colleagues, Oven and Kavi, they're [03:38.560 --> 03:44.160] going to speak about Loki. So definitely stay in the room. And I'm going to move on [03:44.160 --> 03:49.120] to the next signal. So metrics, I'm also assuming pretty much everyone has come across them [03:49.120 --> 03:55.040] and is using them. So in this case, you kind of avoid that problem I mentioned before. [03:55.040 --> 03:59.840] You have the actual integers exposed. You maybe know about those integers that you care [03:59.840 --> 04:07.240] about them. So to get a metric, most of the time you have to do some kind of define a [04:07.240 --> 04:12.600] list of metrics you care about and then you can collect them. So it might be, you might [04:12.600 --> 04:16.600] be having kind of an outage and didn't have that metric that you care about. And so you [04:16.600 --> 04:22.280] need to kind of constantly improve on the exposure of the metrics. Obviously, like Peruvius [04:22.280 --> 04:31.080] is the kind of main tool in that space. And very often we talk about web services, I guess, [04:31.080 --> 04:36.520] when we think about those applications. So the red method, so like get the rates of your [04:36.520 --> 04:43.680] requests, get the error rate of your request, and get the latency duration of the request [04:43.680 --> 04:49.320] can already cover quite a lot of cases. And obviously, like as it's kind of just integers [04:49.320 --> 04:56.560] or floats, you can aggregate them quite efficiently across like, yeah, a multitude of pods or [04:56.560 --> 05:06.440] like a really huge set up of services. And then if you get more into that kind of microservices [05:06.440 --> 05:11.400] architecture that has kind of evolved over the last couple of years, you will find yourself [05:11.400 --> 05:18.680] kind of having a really complex composition of services being involved in answering requests. [05:18.680 --> 05:23.440] And so like, you might even struggle to understand what's slowing you down or where the error [05:23.440 --> 05:29.720] is coming from, why do I have this time out here? And distributed tracing can help you [05:29.720 --> 05:34.360] a lot with kind of getting an understanding what your service is doing. It also might [05:34.360 --> 05:38.800] be that like, maybe the service is actually doing way too much and you're calculating [05:38.800 --> 05:45.320] things over and over again. So that is super helpful to get a bit more like the kind of [05:45.320 --> 05:53.040] flow of the data in your system. So like the challenge there might be like, you might have [05:53.040 --> 05:59.520] a lot of requests and that's, while it's somewhat cheap to get the tracing, you might not cover [05:59.520 --> 06:04.400] all the requests. So for example, now production system, I think maybe someone needs to correct [06:04.400 --> 06:11.880] me if I'm wrong, but like when we receive a Grafana cloud like logs and metric data, [06:11.880 --> 06:21.480] we only cover 1% of those with traces while we cover 100% of our queries. So like, basically [06:21.480 --> 06:26.920] you need to make a selective decision if it's worth investing that. Obviously logs data [06:26.920 --> 06:32.720] looks always the same, it comes every second and so on. So like, we see more value in having [06:32.720 --> 06:38.720] all of those queries where there's a complex kind of caching and all sorts of systems interacting [06:38.720 --> 06:46.640] with it and so that allows us, yeah, to look a bit deeper and even then find that one service [06:46.640 --> 06:53.040] that maybe is the bottleneck there. So maybe looking at a bit of a real problem, so I'm [06:53.040 --> 06:59.800] having an online shop, I'm selling socks and a user is complaining about getting some time [06:59.800 --> 07:06.680] out when wanting to check out. That's obviously not great because I'm not selling the socks, [07:06.680 --> 07:12.800] but at least the user got some trace ID and is complaining to our customer service. Then [07:12.800 --> 07:18.840] starting from there, I'm figuring out it's the location service that actually was the [07:18.840 --> 07:24.880] one that cost the time out in the end. And then looking at the metrics of the location [07:24.880 --> 07:30.520] service, I might find, oh, there's actually 5% of the requests timing out, so maybe 5% [07:30.520 --> 07:38.440] of my users are not able to buy their socks monthly or whatever. So what are the next steps? [07:38.440 --> 07:44.920] I guess scaling up is always good. Maybe the service is just overloaded. The person that [07:44.920 --> 07:50.080] wrote it left years ago, so we have no idea. So we just scale it up. Obviously, it comes [07:50.080 --> 07:57.520] with a cost and so I think always the first thing would be fixing the bleed, making sure [07:57.520 --> 08:01.600] there are no more timeouts. So scaling up is definitely the right option here. But then [08:01.600 --> 08:06.520] if you do that over years, you might suddenly find yourself having a lot of extra costs [08:06.520 --> 08:13.120] attached to that location service. And so that's kind of where I think we need another [08:13.120 --> 08:19.160] signal. And I think that signal should be profiling. So I guess most people might have [08:19.160 --> 08:29.880] come across profiling. And it basically measures your code and how long it executes, for example, [08:29.880 --> 08:37.760] or what kind of bytes it allocates in memory. And it basically helps you maybe understand [08:37.760 --> 08:45.440] your program even more or someone else's program in the location service case. And that eventually [08:45.440 --> 08:50.600] can translate in cost savings if you find out where the problem lies, like you can maybe [08:50.600 --> 08:56.240] fix it or can get some ideas. Maybe you can also look at the fix and see if it's gotten [08:56.240 --> 09:04.720] actually worse or better. And yeah, like that basically gives you a better understanding [09:04.720 --> 09:13.760] of how your code behaves. And so now the question is, what is actually measured in a profile? [09:13.760 --> 09:20.200] So I created a bit of a program. I don't know. I hope everyone can see it. So it's basically [09:20.200 --> 09:24.680] like a program that has a main function and then calls out other functions. So you can [09:24.680 --> 09:31.560] see there's a do a lot and there's a do little function. And both of them then call prepare. [09:31.560 --> 09:35.120] And obviously in the comments, there's some work going on. And obviously the work could [09:35.120 --> 09:43.720] be allocating memory, like using the CPU, something like that. So let's say we use CPU. [09:43.720 --> 09:51.280] So when the function starts, like we are first going to do something within the main, let's [09:51.280 --> 09:56.240] say we spend three CPU cycles, which is not a lot, but that then gets recorded like yes, [09:56.240 --> 10:03.840] we took us three CPU cycles in main. We then go into the prepare method through do a lot. [10:03.840 --> 10:08.840] Then we spend another five CPU cycles. And those kind of stack traces then they are recorded [10:08.840 --> 10:16.520] in the profile. And going through the program, like we will end up with that kind of measurement [10:16.520 --> 10:24.400] of stack traces. And while it kind of works with ten lines of codes, you can maybe spot [10:24.400 --> 10:28.800] where the problem is. Like there's the 20 and do a lot. It definitely kind of breaks [10:28.800 --> 10:35.520] down when you're speaking about like a lot of services or like a lot of code base that [10:35.520 --> 10:41.120] is happened or happens to be hot and actively used. And so like there are a couple of ways [10:41.120 --> 10:47.960] of visualizing them. I think one of the first things you would find is kind of a top table. [10:47.960 --> 10:56.360] So like in that table, you can order it by different values. Like so this is kind of [10:56.360 --> 11:01.760] an example from P prof, like which is kind of the go tool. And you can see kind of clearly [11:01.760 --> 11:10.400] do a lot is the method that comes out on top. And like there are now different ways how [11:10.400 --> 11:15.160] you can look at the value. So you have the flat count, which is the function itself only. [11:15.160 --> 11:20.280] So you can see the 20 that we had before, 20 CPU cycles. But we also have the cumulative [11:20.280 --> 11:27.280] measurement, which also includes the prepare that is going to get called from do a lot. [11:27.280 --> 11:35.040] And so like we already can see we spend 52% of our program and do a lot. So maybe we already [11:35.040 --> 11:39.000] can stop looking at the table and just look at do a lot. Because if we fix do a lot or [11:39.000 --> 11:45.400] get rid of it, we need half of the CPU less. And that's that kind of, it's kind of represented [11:45.400 --> 11:49.400] by the sum. So the sum will always change depending on how you order the table in that [11:49.400 --> 11:56.400] particular example. So in this case, like we have 100% already at row number four, because [11:56.400 --> 12:04.760] we only have four functions. And to get a bit more of a visual sense of what's going [12:04.760 --> 12:11.080] on, there are the so-called flame graphs. And I think the most confusing thing for me [12:11.080 --> 12:16.960] about them was the coloring. So obviously like red is always not great. Should we look [12:16.960 --> 12:25.920] at main? No, we shouldn't. So basically like the coloring I think is random or uses some [12:25.920 --> 12:30.840] kind of hashing. And basically it's only meant to look like a flame. So like the red here [12:30.840 --> 12:38.120] doesn't mean anything. So like if you're colorblind, that's perfect for flame graphs. So what we [12:38.120 --> 12:45.040] actually want to look at is this kind of like at the leaf end is basically where the program [12:45.040 --> 12:48.880] would spend things. Like you can see the three CPU cycles main here. So there's nothing [12:48.880 --> 12:56.560] below. So main uses 100% through methods that are called by main. And then there's nothing [12:56.560 --> 13:02.560] beyond this here. So like here we spend something in main. And in the same way in do little, [13:02.560 --> 13:08.160] we can see the five. And in do a lot, we can see the 20 quite wide. And then the prepares [13:08.160 --> 13:15.160] with five each as well. And now obviously if you look across a really huge program, you [13:15.160 --> 13:21.000] basically like can can spot kind of what's going on quite quickly. And then if you have [13:21.000 --> 13:25.880] like similar with like route in main, like you basically can ignore that, but it helps [13:25.880 --> 13:30.800] you maybe locate which component of your of your program you want to look at because [13:30.800 --> 13:36.360] like maybe you're not good at naming and you call everything prepare in util and and it [13:36.360 --> 13:45.560] would still tell you roughly where it gets called and and how the program gets there. [13:45.560 --> 13:51.280] So how do we get that profile? And that can be kind of quite that can be quite a lot of [13:51.280 --> 13:58.280] challenges how to get that. So I think I would say like there's like roughly two ways like [13:58.280 --> 14:05.180] either your ecosystem supports kind of profiles fairly natively. And then you instrument the [14:05.180 --> 14:13.720] profile, you added maybe a library and SDK. And like basically like the runtime within [14:13.720 --> 14:22.360] your environment will maybe expose the information. So like it's not available for all languages. [14:22.360 --> 14:30.440] There's I guess a lot of work going on that it becomes more and more available. And kind [14:30.440 --> 14:35.560] of other approaches more like through an agent and EPPF has been quite hyped. I'm I'm not [14:35.560 --> 14:41.200] very familiar with the EPPF myself. I have used the agents but haven't written any code. [14:41.200 --> 14:44.960] But basically what it would use it would use an outside view of it. So you wouldn't need [14:44.960 --> 14:51.280] to change the binary running really like you would just kind of look at the information [14:51.280 --> 14:58.000] you get from the Linux kernel like you hook into, I don't know, often enough when the [14:58.000 --> 15:06.000] CPU runs to then find out what is currently running. So there are different languages [15:06.000 --> 15:12.480] like for example in a compiled language. You would be having a lot more information. The [15:12.480 --> 15:18.840] memory addresses are the same. You can kind of use the information within the simple table [15:18.840 --> 15:25.640] to figure out where your program is and what is currently running. In like I don't know [15:25.640 --> 15:30.280] like an interpreted language like Ruby, Python, this might be a bit harder and that information [15:30.280 --> 15:36.320] might not be accessible to the kernel without further work. Like it also when you compile [15:36.320 --> 15:43.240] you might drop those simple tables like so that you really need to kind of be preparing [15:43.240 --> 15:56.760] your application a bit for that. I want to look into the kind of prime example I'm most [15:56.760 --> 16:04.920] familiar with. I'm mostly a go developer over the last couple of years. Go has quite a kind [16:04.920 --> 16:11.640] of mature set of tools in that area. So basically the standard library allows you to expose [16:11.640 --> 16:18.040] that information. It supports like CPU and memory. And especially garbage collected [16:18.040 --> 16:24.840] languages. Memory is quite a thing to also non-garbage collected languages. But memory [16:24.840 --> 16:33.480] is really important to understand the usage there. I have a quick example of a program [16:33.480 --> 16:37.720] like so you basically like just expose an HTTP port where you can download the profile [16:37.720 --> 16:45.200] whenever you want and you have that P prof tool that you can point to it. So like in that [16:45.200 --> 16:50.480] kind of first line example you would just get a two second long profile. So CPU profile [16:50.480 --> 16:56.960] that looks at the CPU for two seconds and basically records like whatever is running [16:56.960 --> 17:02.360] how long on the CPU and then you get the file and P prof will allow you to visualize it through [17:02.360 --> 17:09.640] that top table for example. So what I forgot to mention as well. So later you can maybe [17:09.640 --> 17:18.680] go to that URL and look at that profile that I had as an example. And in the same way you [17:18.680 --> 17:25.120] can get the memory allocations and P prof also allows you to launch an HTTP server to [17:25.120 --> 17:32.640] be a bit more interactive and select certain code paths. So that is like quite a lot in [17:32.640 --> 17:41.720] the go docs, go.dev about profiling. So I definitely leave it there. So you can also [17:41.720 --> 17:47.360] look at kind of maybe if you are a go developer like use that and play around yourself. But [17:47.360 --> 17:54.000] now I want to speak about why profiling might be actually quite difficult. So the example [17:54.000 --> 18:00.880] I had like I had three CPU cycles and if you think about that is not very much. So and [18:00.880 --> 18:05.480] just to record what the program was doing in those three CPU cycle probably takes I [18:05.480 --> 18:11.120] have no idea about thousands of CPU cycles. And so you really want to be careful what [18:11.120 --> 18:17.640] you want to record. So if you really would record all of that like your program would [18:17.640 --> 18:23.520] probably have like a massive overhead would slow down by profiling behave completely different [18:23.520 --> 18:29.320] and you also would have a lot more data to store to analyze. And then if you think about [18:29.320 --> 18:37.160] micro services and replica count 500 you might get quite a lot of data that is actually [18:37.160 --> 18:45.480] not that useful to you because are you really caring about three CPU cycles? Probably not. [18:45.480 --> 18:51.440] And because of that to allow continuous profiling so to do that in production across like a [18:51.440 --> 19:00.160] wide set of deployments like I think Google were the first ones to do that and they were [19:00.160 --> 19:06.080] starting to sample those profiling. So instead of looking at really every code that runs [19:06.080 --> 19:13.040] go for example looks 100 times a second what is currently running on the CPU and then records [19:13.040 --> 19:20.720] it and obviously maybe like integer adder will not be on the CPU if you don't run it [19:20.720 --> 19:26.920] all the time and so you get a really accurate representation what is really taking your [19:26.920 --> 19:32.280] CPU time. And the way that works you also need to be kind of aware that like some things [19:32.280 --> 19:38.040] the actual program might not be on the CPU because it might be waiting for IO and so [19:38.040 --> 19:43.200] like when you kind of collect a profile and the profile is not having that many seconds [19:43.200 --> 19:48.160] you really need to think about is this really what I want to optimize or maybe I'm not seeing [19:48.160 --> 19:58.880] what I actually want to see. With that kind of statistical approach like I don't have [19:58.880 --> 20:04.240] any kind of sources to say but like I think generally you say that it's like a two to [20:04.240 --> 20:09.160] three percent overhead that gets added on top of your program execution so that's I guess [20:09.160 --> 20:17.760] a lot more reasonable than the full approach with recording everything. And so what do [20:17.760 --> 20:21.280] you generally kind of would do obviously if you first need to ship your application somewhere [20:21.280 --> 20:28.680] and run it then you can look at the profiles and yeah think about it look at it like maybe [20:28.680 --> 20:34.200] you are the owner of that code maybe you have a bit more understanding and those profiles [20:34.200 --> 20:39.040] maybe can give you a bit more of an idea of what you're actually doing there or how the [20:39.040 --> 20:46.320] system is reacting to your code. And so basically like for that green box multiple solutions [20:46.320 --> 20:51.680] exist so I'm obviously a bit biased but I also have to say our project is fairly young [20:51.680 --> 21:01.560] and evolving. So for example there's like CNCF Pixie, EBPF based, there's Polar Signals [21:01.560 --> 21:08.720] Parker like people are in the room, Pyroscope and kind of our solution. I think they're [21:08.720 --> 21:14.560] all great like you can all use them and start using them and exploring like maybe just your [21:14.560 --> 21:19.840] benchmarks for a start and then as you get more familiar with it like you might kind [21:19.840 --> 21:28.320] of discover more and more of the value there. So I'm still going to use Flare now for my [21:28.320 --> 21:44.160] quick demo. So let me just see. So I guess most of you are kind of familiar with Grafana [21:44.160 --> 21:57.800] and Explore. Why is it so huge? And so basically that's kind of the entry point you're going [21:57.800 --> 22:04.800] to see in the Explore. You have the kind of typical time range selection so let's say [22:04.800 --> 22:09.240] we want to see the last 15 minutes now and here we can see the profile types collected. [22:09.240 --> 22:14.240] So that's just a Docker compose running locally on my laptop, hopefully running locally on [22:14.240 --> 22:21.200] my laptop since I started to talk. And for example we can here look at the CPU. So that's [22:21.200 --> 22:29.600] kind of the nanoseconds band on the CPU and you can kind of see the flame graph from earlier [22:29.600 --> 22:35.160] and maybe some bug. I don't know. It looks a bit bigger than it usually should be. But [22:35.160 --> 22:39.480] we can see that kind of top table. We can see the aggregation of all of the services. [22:39.480 --> 22:44.680] So I'm running like five pods or something like that, different languages. So you can [22:44.680 --> 22:52.200] see like for example this here is like a Python main module where it's doing some prime numbers. [22:52.200 --> 22:58.000] So what I first want to kind of break down here is by label. And that's really the only [22:58.000 --> 23:01.800] kind of functionality that we have in terms of querying. So here we would look at the [23:01.800 --> 23:11.280] different instances and we kind of see the CPU time spent like, I don't know, there's [23:11.280 --> 23:15.280] like a Rust port and they are both blue so I don't know which switch, but I guess Flare [23:15.280 --> 23:23.960] is doing more. So that might be the Flare one. And for my example now I want to look [23:23.960 --> 23:34.080] at just like a small program that I wrote to show like how like the aspect. So like here [23:34.080 --> 23:38.760] we can see kind of the timeline. So this is like a profile gets collected I think every [23:38.760 --> 23:44.680] 15 seconds and that's basically a dot. And then the flame graph and the top table below [23:44.680 --> 23:48.400] would kind of just aggregate that. So like there's no time component in here. That's [23:48.400 --> 23:56.920] also important to understand. And so like while we were looking at memory I'm now going [23:56.920 --> 24:08.600] to kind of switch to the allocated space. And oh no. And here we have some label selection [24:08.600 --> 24:15.480] like that you might be familiar. And this random port here you can see like the allocation [24:15.480 --> 24:22.400] so the amount of memory that gets allocated is like around six megabytes. But then every [24:22.400 --> 24:30.920] couple of every five minutes roughly you can see like some peak. And so if you already [24:30.920 --> 24:40.120] look in the flame graph there's already some kind of big red box and the colors don't matter. [24:40.120 --> 24:45.000] But basically like you can see this this kind of piece of code is doing kind of a majority [24:45.000 --> 24:49.520] of the allocations. And now you could even kind of zoom in here if you really want to [24:49.520 --> 24:56.600] figure out and then it even gets bigger and you can see some more what's going on. And [24:56.600 --> 25:12.760] so now if you actually want to look at the kind of code for this. And if flare is maybe [25:12.760 --> 25:16.720] in version 0.6 we could even see the line of code that we should look at for now you [25:16.720 --> 25:24.560] can. But basically like it would show us allocations in line 21. And I guess most of you can see [25:24.560 --> 25:27.960] what this kind of program is doing so every five minutes it will kind of have some peak [25:27.960 --> 25:34.400] of allocations. And you only see that kind of because you have the time component you [25:34.400 --> 25:43.840] can select and then see the flame graph aggregation. Cool. Yeah that was almost my talk. Like I [25:43.840 --> 25:49.320] have one more slide that I should just quickly want to. So in the latest version 120 there [25:49.320 --> 25:55.160] is profile guided optimizations and I think that might be a really big topic. So what [25:55.160 --> 25:59.840] it does it looks at kind of your profile and that can come from production from benchmarks [25:59.840 --> 26:07.200] from whatever and tries to do the better decisions during compile time of what things to do with [26:07.200 --> 26:11.360] your code like for example think the only thing that it does right now is making inlining [26:11.360 --> 26:17.160] decisions. But basically like if it sees this method is called a lot and is in the hot path [26:17.160 --> 26:21.840] it would then make the decisions to inline the method maybe if it's a bit colder it would [26:21.840 --> 26:27.560] not do it and you can be a lot more accurate as a compiler if you have that kind of data [26:27.560 --> 26:40.920] if you know that method is in the hot path or not. Okay that was it. Thank you. [26:40.920 --> 26:52.040] Thanks a lot that was awesome. Questions. Thank you. Thank you for the talk. I'm just [26:52.040 --> 26:57.520] wondering how would the profiling work with very multi-threaded code. Is there ability [26:57.520 --> 27:03.480] to drill down into that level. Yeah so like maybe so like in terms of multi-threading [27:03.480 --> 27:08.400] like obviously we only have the main method in that example. So you can see rude and then [27:08.400 --> 27:13.960] mine is 100% and like if it's multi-threaded you would have kind of maybe more so it's [27:13.960 --> 27:19.160] basically all only the stack trace that gets recorded like you would not see kind of maybe [27:19.160 --> 27:23.480] the connections where the thread where it's threading off and things like that. You would [27:23.480 --> 27:31.440] get the stack trace. Cold stack. Have you looked into any other profiling formats than [27:31.440 --> 27:38.800] B prof ingestion. I know open telemetry has been doing some stuff about introducing a profiling [27:38.800 --> 27:44.440] format that people can standardize on but I don't know if you've looked at that at all. [27:44.440 --> 27:54.280] Yes. Can you like I haven't seen you like sorry can you repeat like I struggled to. [27:54.280 --> 27:58.320] So I was wondering if you've looked at any other profiling ingestion formats other than [27:58.320 --> 28:05.720] B prof. No I like I or so like right now we use P prof personally with the player. So [28:05.720 --> 28:11.880] I think there's a lot of kind of improvements to be had over over the format there and that's [28:11.880 --> 28:18.160] like as far as I know some active work around open telemetry to to come to I guess a better [28:18.160 --> 28:23.120] format in the sense to not send symbols over and over again and reduce interest but not [28:23.120 --> 28:32.800] no it's the the accurate and short answer. Okay so thank you for the talk and my question [28:32.800 --> 28:36.560] is that looking at the flare architecture it's currently pool models so the flare agent [28:36.560 --> 28:40.720] is scraping the profiling data from the applications that they configure it to scrape. My question [28:40.720 --> 28:45.840] is is there an eventual plan to also add maybe a push gateway or similar API for applications [28:45.840 --> 28:51.800] where this might be suitable. Yeah like definitely like I think I also can see kind of the push [28:51.800 --> 28:57.400] use case for maybe if you want to get your micro benchmarks from CI CD in so like the [28:57.400 --> 29:03.920] API in theory allows it but tooling is missing but I definitely think it's a valid like push [29:03.920 --> 29:14.320] use case as well. I think in terms of scalability I think pool will be better but yeah I agree. [29:14.320 --> 29:24.040] Thanks for the talk. I have a small question. Did you try to implement this tooling in the [29:24.040 --> 29:32.720] end of the CI CD and CO continuous optimization? No like so we're not using it yet for that. [29:32.720 --> 29:38.800] I think it's it's definitely a super useful thing because like yeah like you want to see [29:38.800 --> 29:42.720] maybe how a pool request behaves like maybe how your application allocates more or less [29:42.720 --> 29:47.640] in different parts and and if the trade-offs are right there but yeah I think it definitely [29:47.640 --> 30:04.080] can and should be used for that but no no tooling right now. Yeah no I fully agree as [30:04.080 --> 30:12.400] well yeah. Hello thank you. So if I understand correctly profiles such as traces combined [30:12.400 --> 30:19.960] with OS metrics right so at a concrete specific time you can see how much CPU you used and [30:19.960 --> 30:26.880] so on right. Yeah I guess it looks a bit more at the actual line of code rather than I don't [30:26.880 --> 30:30.880] know like I don't know I haven't used like tracing where it automatically finds the function [30:30.880 --> 30:38.560] maybe that also tells you the line of code but yeah like it definitely adds some metrics [30:38.560 --> 30:43.440] to it like without you doing much I guess other than making sure it can read the symbol [30:43.440 --> 30:47.960] tables and the function names. Yeah so so I just had like a dumb question or like dumb [30:47.960 --> 30:54.440] idea couldn't you just combine for example you already have node exporter which exposes [30:54.440 --> 31:02.120] metrics at all times so you have OS metrics and you have traces for example so couldn't [31:02.120 --> 31:07.000] you just have some kind of integration graph on or or somewhere else that just combines [31:07.000 --> 31:10.760] traces with metrics. Yeah so like I think that was also the like like people that work [31:10.760 --> 31:15.480] longer at continuous profiling software that they try to kind of reuse kind of Prometheus [31:15.480 --> 31:21.400] and I think where you end up kind of in it's just a very high cardinality it's too many [31:21.400 --> 31:26.040] lines of codes and and that's kind of where it stops but like in theory like I guess most [31:26.040 --> 31:32.400] promql constructs and functions are maybe something we need to implement on top of that [31:32.400 --> 31:39.000] in a similar way because in the end you just get metrics out of it and so basically the [31:39.000 --> 31:43.760] problem was too many lines of code too much changing over time and like you just get too [31:43.760 --> 32:02.680] much serious turn through that. So thanks a lot. Yeah thank you for coming.