[00:00.000 --> 00:14.000]  I'm going to talk about walking native stacks in BPF without frame pointers.
[00:14.000 --> 00:17.500]  So yeah.
[00:17.500 --> 00:18.600]  So my name is Weishali.
[00:18.600 --> 00:24.560]  I work at PoloSignals as a software engineer mostly on profiling and eBPF related stuff
[00:24.560 --> 00:29.560]  and before that I have worked in different corner subsystems as part of my job.
[00:29.560 --> 00:34.920]  My name is Javier and I've been working at PoloSignals for a year mostly working on native
[00:34.920 --> 00:41.640]  stack and winding and before I was working on web reliability and depth tooling at Facebook.
[00:41.640 --> 00:44.840]  So before we get into the talk let's talk about the agenda.
[00:44.840 --> 00:49.920]  So we'll first address the first question which is always being asked that why size
[00:49.920 --> 00:55.800]  need for a dwarf based stack walker in eBPF then we will briefly go through the design
[00:55.800 --> 01:01.000]  of our stack walker will also go from like how we went from the prototype to making it
[01:01.000 --> 01:06.320]  production ready and then what are a bunch of the learnings so far especially when we
[01:06.320 --> 01:13.000]  are interacting with the eBPF subsystem of the kernel and then our future plans.
[01:13.000 --> 01:16.520]  So as I said we work on the production profilers.
[01:16.520 --> 01:22.240]  Generally sampling profilers collect the stack traces at like particular intervals and attaches
[01:22.240 --> 01:24.080]  values to it.
[01:24.080 --> 01:29.480]  For that like profilers generally need both user like application stacks and kernel stacks.
[01:29.480 --> 01:33.120]  Stack walking is like part of the process for collecting the stack traces.
[01:33.120 --> 01:37.600]  In simple words like it involves iterating over all the stack frames and like collecting
[01:37.600 --> 01:39.400]  the written addresses.
[01:39.400 --> 01:45.720]  Historically there has been a dedicated frame, dedicated register to store the value of it
[01:45.720 --> 01:53.960]  in both X86 and ARM although it has fallen victim of some of the compiler optimizations
[01:54.520 --> 01:58.000]  so most of the runtime actually sabers it.
[01:58.000 --> 02:02.440]  It's called frame pointer as many of you have heard of it.
[02:02.440 --> 02:09.520]  And when we don't have the frame pointer walking the stack becomes like magnitude of like a
[02:09.520 --> 02:12.120]  lengthy process.
[02:12.120 --> 02:16.520]  So instead of involving a couple of memory accesses per frame which is like quite fast
[02:16.520 --> 02:21.400]  we will have to do like more work in the stack walkers like not like the stack walking is
[02:21.440 --> 02:24.040]  also a common practice in deep workers right.
[02:24.040 --> 02:30.120]  So what's the current state of the world with respect to frame pointers?
[02:30.120 --> 02:35.280]  So it's not a problem for especially hyperscalers as you may know like in the production they
[02:35.280 --> 02:39.920]  are always running the applications which has the frame pointers enabled because whenever
[02:39.920 --> 02:45.600]  they have to inspect the incidents getting like faster and the reliable stack traces
[02:45.680 --> 02:47.880]  is must.
[02:47.880 --> 02:59.080]  Go runtime enables the frame pointers since go 1.7 in X86 and 1.12 in ARM 64.
[02:59.080 --> 03:04.360]  Mac OS is like always compiled with compiled with frame pointers.
[03:04.360 --> 03:10.000]  There's also an amazing work going on for the compact frame format.
[03:10.080 --> 03:17.480]  It's called simple frame format and there has been like support being added in the tool
[03:17.480 --> 03:24.120]  chains and now there is also like a mailing list discussion going on in the kernel about
[03:24.120 --> 03:31.600]  having an unwinded stack walker, sorry, a stack walker for unwind the user stacks.
[03:31.600 --> 03:37.440]  But the thing is that we want it now, we want to support all the distros, we want to support
[03:37.440 --> 03:45.000]  all the runtimes and the one thing that common across a lot of this is dwarf and that's why
[03:45.000 --> 03:46.400]  we are using it.
[03:46.400 --> 03:48.760]  So where does it come from?
[03:48.760 --> 03:56.400]  So like some of you might be wondering about like the exceptions in C++ or for example Rust
[03:56.400 --> 04:00.720]  tool chain which is like always disabling the frame pointers but when you like use the
[04:00.720 --> 04:04.680]  panic it always has the full backtracks.
[04:04.680 --> 04:13.480]  The reason is that it has the .eh frame section which is being used for that or debug frame.
[04:13.480 --> 04:21.240]  So most of the time either of the tool chains have this section and the other ideas that
[04:21.240 --> 04:26.560]  you can also unwind the tables by synthesizing it from the object code.
[04:26.560 --> 04:31.680]  This is like the approach which is being used in orc, one of the kernel second winder which
[04:32.200 --> 04:35.960]  was added I guess five, six years ago.
[04:35.960 --> 04:42.720]  So we'll talk about in detail about .eh frame in a minute but before that let's see who
[04:42.720 --> 04:45.000]  else is using .eh frame.
[04:45.000 --> 04:51.640]  So of course like we are not the first one who are going to use it, Perf does that.
[04:51.640 --> 04:59.480]  Since I think, since when the Perf event, Cyscall Perf event, Open Cyscall was introduced in 3.4
[04:59.520 --> 05:03.240]  it collects the registers for the profile processes as well as like copy of the stack
[05:03.240 --> 05:04.760]  for every sample.
[05:04.760 --> 05:09.680]  While this approach has been proven to work, that bunch of drawbacks which we wanted to
[05:09.680 --> 05:14.040]  avoid one of the thing is that kernel copies the user stake for every sample and this can
[05:14.040 --> 05:19.040]  be like quite a bit of data especially for the CPU intensive applications.
[05:19.040 --> 05:23.320]  Another thing is that when we are copying the data in the user space the implications
[05:23.320 --> 05:30.200]  of one processes having another processes data can also be complicated.
[05:30.200 --> 05:36.560]  So those are like bunch of the things we wanted to avoid and stack walking using BPF makes
[05:36.560 --> 05:41.760]  sense to us because we don't have to copy the whole stack instead like a lot of the
[05:41.760 --> 05:47.880]  information still stays in the kernel especially like in the case of stack walking mechanism.
[05:47.880 --> 05:52.200]  Once it has been implemented we can leverage the Perf subsystem to like get the samples
[05:52.200 --> 05:57.440]  on CPU cycles, instructions, alt, cache misses, etc.
[05:57.440 --> 06:02.960]  It can also help us to develop other tools like allocation tracers, runtime specific
[06:02.960 --> 06:07.880]  profiles like for the JVM or Ruby, etc.
[06:07.880 --> 06:13.480]  Now some of you might be wondering that why do we want to implement something new when
[06:13.480 --> 06:17.280]  we already have BPF get stack ID?
[06:17.280 --> 06:25.520]  So the reason is that it also uses frame pointers to unwind it so and having a fully
[06:25.520 --> 06:31.560]  featured dwarf unwind in kernel is unlikely to happen there is a mailing list discussion
[06:31.560 --> 06:36.400]  you can go and check it out why.
[06:36.400 --> 06:41.560]  So now before we dive into the design of our stack walker I wanted to give some information
[06:41.560 --> 06:45.840]  on what ES frame has and how we can use it.
[06:45.840 --> 06:54.640]  So ES frame section contains one or more call frame information records.
[06:54.640 --> 06:59.280]  The main goal of the call frame information records is to provide answers on how to restore
[06:59.280 --> 07:03.600]  the register for the previous frame and the locations such as like whether they have been
[07:03.600 --> 07:09.120]  pushed to the stack or not and all of this would actually generate the huge unwind tables
[07:09.560 --> 07:14.480]  and for this reason the format attempts to be compact and it only contains the information
[07:14.480 --> 07:17.040]  that is being needed.
[07:17.040 --> 07:24.720]  The unwind tables encoded in the CFI format are in the form of opcodes and we basically
[07:24.720 --> 07:29.880]  have to evaluate it and then in the case of stack walking once it has been like evaluated
[07:29.880 --> 07:34.960]  we generate the table that contains like for each instruction boundary how to restore the
[07:34.960 --> 07:36.680]  value of the previous register.
[07:36.720 --> 07:38.800]  It has sort of two layers to it.
[07:38.800 --> 07:45.240]  One is this sort of helps with the repetitive patterns for compressing and it allows for
[07:45.240 --> 07:48.400]  a more compact representation of some data.
[07:48.400 --> 07:53.600]  As in some cases they are like a specialized opcodes that consumes one or two or four bytes
[07:53.600 --> 07:58.400]  so it doesn't have to be four bytes all the time.
[07:58.400 --> 08:03.640]  And the second layer is like a spatial opcode that contains under the site of opcode which
[08:03.680 --> 08:10.000]  is like arbitrary expressions and they need to be actually evaluated for that.
[08:10.000 --> 08:15.480]  And this would need like that, this would mean that we will actually have to implement
[08:15.480 --> 08:24.200]  the full-blown VM in the EBPF to evaluate any expression which is not practical.
[08:24.200 --> 08:31.200]  So we are going to also mention what we are doing to actually come over those challenges.
[08:31.240 --> 08:36.480]  For those who are not aware of like what is like the general flow of the EBPF applications
[08:36.480 --> 08:42.400]  generally this is how it would look like very high-level overview.
[08:42.400 --> 08:47.600]  So in the user space we are using the driver program which is written in Go.
[08:47.600 --> 08:54.600]  We usually BPF Go, it creates the map, attaches the map to attaches the BPF program to the
[08:54.600 --> 09:03.480]  CPU cycles of Perf event and then reads, parses and evaluates the EHRM section of the process.
[09:03.480 --> 09:08.320]  And in the BPF program we fetch the table from the current PID and then have an unwinding
[09:08.320 --> 09:12.840]  algorithm which processes the raw information.
[09:12.840 --> 09:20.360]  So we will go in depth for each component but let's see what the algorithm looks like.
[09:20.360 --> 09:24.120]  So first what we are doing is we are just reading three registers.
[09:24.160 --> 09:31.160]  First one is RIP, the second one is StackPointer, RSP and the third one is RBP which is commonly
[09:31.160 --> 09:36.080]  used as frame pointer when they are unable to.
[09:36.080 --> 09:44.080]  Next we are going for the unwind frame count which is less than maximum depth.
[09:44.080 --> 09:48.440]  We find the unwind table row for the program counter, then we go for adding the instruction
[09:48.440 --> 09:53.080]  pointer to the stack, calculate the previous frames, StackPointer, update the registers
[09:53.080 --> 09:56.920]  and continue with the next frame.
[09:56.920 --> 10:04.680]  So this is like a very simple binary search but when it has to scale we need to also think
[10:04.680 --> 10:11.320]  about storing the unwind information and how can it work with all the runtimes etc.
[10:11.320 --> 10:13.760]  So Javier will now talk about that.
[10:13.760 --> 10:20.160]  Cool, so as Vaishali said we need somewhere where to store the unwind information.
[10:20.160 --> 10:23.920]  We are going to look later at how this table looks like.
[10:23.920 --> 10:26.720]  But first let's see what are the possibilities here.
[10:26.720 --> 10:31.000]  So one possibility for example will be to store the unwind information in process.
[10:31.000 --> 10:36.400]  We could do this using a combination of Ptrace, Mmap and Mlock and this will require us to
[10:36.400 --> 10:42.040]  basically hijack the processes execution, introduce a new memory mapping inside of them
[10:42.040 --> 10:47.040]  and then we have to lock the memory because in BPF and in our type of BPF programs page
[10:47.040 --> 10:48.960]  folds are not allowed.
[10:48.960 --> 10:53.920]  The problem with these approaches of course will be altering the execution flow of applications
[10:53.920 --> 10:56.640]  which is something that we never want to do.
[10:56.640 --> 11:00.960]  This complicates things a lot but for example one of the biggest problems is the life cycle
[11:00.960 --> 11:01.960]  right.
[11:01.960 --> 11:07.040]  So for example if our profiler dies before we finish cleaning up who is going to clean
[11:07.040 --> 11:11.720]  up that memory segment or how is this going to be perceived by developers if they see
[11:11.720 --> 11:16.640]  that the memory of their application has increased behind their backs just because some observability
[11:16.640 --> 11:20.400]  tool is doing something that is not great.
[11:20.400 --> 11:23.560]  But also there's another problem that is sharing memories harder.
[11:23.560 --> 11:30.440]  There is same page optimization from the kernel but if you don't think about that it's a problem
[11:30.440 --> 11:34.800]  to have the same information generated over and over for example for a libc for every single
[11:34.800 --> 11:37.040]  process in your machine.
[11:37.040 --> 11:41.760]  So that's why we ended up using another solution which is pretty typical in the BPF space which
[11:41.760 --> 11:43.960]  is using BPF maps.
[11:43.960 --> 11:48.960]  In case you're not familiar BPF maps are data structures that can be written or read from
[11:48.960 --> 11:51.000]  both user and kernel space.
[11:51.000 --> 11:55.480]  We're using hash tables everywhere which in the case of BPF they're basically a mapping
[11:55.480 --> 12:00.600]  of bytes to bytes that store arbitrary information.
[12:00.600 --> 12:05.960]  So some BPF maps, some BPF programs as well are allowed to lazily allocate memory for
[12:05.960 --> 12:10.400]  their data structures but in the case of our tracing program we kind of do that and this
[12:10.400 --> 12:11.400]  has some implications.
[12:11.400 --> 12:16.480]  So we need to mem-lock that, well the kernel, sorry user space, mem-lock that memory and
[12:16.480 --> 12:19.960]  otherwise our program wouldn't be able to run.
[12:19.960 --> 12:25.440]  And by using this approach we are also able to reuse these memory mappings which is great
[12:25.440 --> 12:31.080]  because that means we don't have to do the same work over and over and we use less space.
[12:31.080 --> 12:34.640]  So let's take a look at the logical representation of the unwind tables.
[12:34.640 --> 12:38.800]  So this is not how the layout is in memory but think about for example the unwind tables
[12:38.800 --> 12:44.080]  for libc, mysql, zlib and systemd how they will be laid out in memory if we could allocate
[12:44.080 --> 12:47.000]  a large chunk of memory.
[12:47.000 --> 12:53.080]  But in reality there's limits everywhere obviously and in BPF we have done some tests
[12:53.080 --> 12:57.240]  and in the machines that we want, well the kernels we want to support we can allocate
[12:57.240 --> 13:03.880]  up to 25,000 unwind entries per value of the hash map.
[13:03.880 --> 13:08.280]  So obviously this was a problem for us because in some cases we have some customers that
[13:08.280 --> 13:14.840]  have applications with unwind tables with 3, 4 million unwind rows which is quite ridiculous
[13:14.840 --> 13:21.400]  just to give you an idea libc I think has like 60k entries so having a couple million
[13:21.400 --> 13:23.400]  is significant.
[13:23.400 --> 13:28.520]  But yeah we came up with the same solution that you would use in any other data intensive
[13:28.520 --> 13:32.800]  application which is to partition or shard the data.
[13:32.800 --> 13:37.800]  So the way we're doing this is we have multiple entries that are allocated when our profilers
[13:37.840 --> 13:39.240]  start running.
[13:39.240 --> 13:42.840]  We allocate a different number depending on the available memory on the system and the
[13:42.840 --> 13:49.920]  overhead that you're willing to pay and yeah depending on how many charts you have you
[13:49.920 --> 13:55.120]  have a different CPU to memory trade off because the more memory you use it has to be locked
[13:55.120 --> 14:00.760]  in memory, it can be swapped out which is in some applications not ideal but at the same
[14:00.760 --> 14:04.520]  time that means that you don't have to regenerate the tables if they are full and then you want
[14:04.520 --> 14:08.920]  to give like other processes a third chance to be profiled.
[14:08.920 --> 14:13.640]  So the way this will work for example for a process like system D is that will be like
[14:13.640 --> 14:19.480]  the representation of the size of its unwind tables and because it's bigger than the size
[14:19.480 --> 14:22.920]  of a shard it will have to be chunked.
[14:22.920 --> 14:25.440]  So here we can see how this is chunked in two.
[14:25.440 --> 14:32.360]  The first chunk will go in the shard zero and a bunch of the unwind entries from the
[14:32.400 --> 14:39.900]  tail will go to the shard one and of course because we have this new layer of indirection
[14:39.900 --> 14:44.720]  we need to somehow keep track of you know all these bookkeeping and know what is the
[14:44.720 --> 14:49.680]  state of the worlds and we're doing this of course with more BBF maps.
[14:49.680 --> 14:57.000]  So a process has multiple mappings each mapping has one or more chunks and then each chunk
[14:57.000 --> 15:02.320]  maps to exactly one shard and in particular the region within one shard because you can
[15:02.320 --> 15:08.280]  have from one unwind entry up to 2050k.
[15:08.280 --> 15:11.400]  Of course this has the benefit that I was mentioning before that is because we were
[15:11.400 --> 15:16.560]  sharing the unwind tables that means that we spent actually not that many CPU cycles
[15:16.560 --> 15:18.680]  doing all the work that Shali was mentioning before.
[15:18.680 --> 15:23.440]  We need to find the ELF section where the door of CFI information is but also we need
[15:23.440 --> 15:24.640]  to parse evaluate it.
[15:24.640 --> 15:29.760]  We have two levels of VM that have to run which is not something that is very CPU consuming
[15:29.760 --> 15:30.760]  but still has to happen.
[15:30.760 --> 15:35.120]  It has to process a bunch of information and generate these unwind tables in our custom
[15:35.120 --> 15:36.560]  formats.
[15:36.560 --> 15:42.960]  So by sharing this for example Lipsy will be shared for all the processes so that means
[15:42.960 --> 15:48.160]  that we only need to add the bookkeeping data structures which are really cheap to generate.
[15:48.160 --> 15:52.560]  In some of the tests that we've been running we use less than 0.9% CPU within the total
[15:52.560 --> 15:58.800]  CPU cycles our profiler uses to generate the unwind tables and of course there is a lot
[15:58.800 --> 16:02.160]  of things that we need to take into account like for example what happens if we run out
[16:02.160 --> 16:03.760]  of space right?
[16:03.760 --> 16:08.440]  So what we do is we adaptively decide what to do in the moment.
[16:08.440 --> 16:14.720]  We might wait a little bit until resetting the state or we might decide to give chance
[16:14.720 --> 16:18.760]  up to other processes to be profiled so we wipe the whole thing and start again and as
[16:18.760 --> 16:20.960]  you can see this is very similar to a bump allocator.
[16:20.960 --> 16:25.880]  This is basically a bump allocator that has been chunked up.
[16:25.880 --> 16:33.080]  So the process of unwinding this is we start with a PAD, we check if it has unwind information
[16:33.080 --> 16:40.520]  then we need to find the mapping and for each mapping we know the minimum and the maximum
[16:40.520 --> 16:45.000]  program counter so we need to do a linear search to find it.
[16:45.000 --> 16:49.800]  Then we find the chunk, with the chunk we already have the shard information so once
[16:49.800 --> 16:54.480]  we have the shard information we have to traverse up to 250,000 items.
[16:54.480 --> 16:59.720]  We do this just with a simple binary search.
[16:59.720 --> 17:06.040]  This takes obviously between seven or eight iterations and once we have the unwind action
[17:06.040 --> 17:11.200]  that tells us how to restore the previous frames registers we do those operations and
[17:11.200 --> 17:15.480]  we are ready to go to the next frame.
[17:15.480 --> 17:18.200]  We are pretty much done for that frame.
[17:18.200 --> 17:23.880]  If the stack trace is correct we know this because basically a stack when you have frame
[17:23.880 --> 17:29.280]  pointers the bottom of the stack is defined in applications with frame pointers when you
[17:29.280 --> 17:32.400]  reach RBP equals zero this is defined by the ABI.
[17:32.400 --> 17:37.360]  When you have unwind tables it is defined by not having that program counter covered
[17:37.360 --> 17:42.720]  by any unwind table and having RBP zero these are requirement by the ABI so if some application
[17:42.720 --> 17:45.960]  doesn't respect it it is completely broken.
[17:45.960 --> 17:50.560]  So once we verify that the stack is correct that we have reached the bottom of the stack
[17:50.640 --> 17:55.960]  then we hash the addresses and we hash we add the hash to a map and we bump a counter
[17:55.960 --> 18:02.440]  and we do this I think it is 19 times a second for every single CPU in your box and every
[18:02.440 --> 18:06.560]  couple seconds we collect all this information we generate a profile and we send it to some
[18:06.560 --> 18:10.240]  server for inspection.
[18:10.240 --> 18:15.400]  So of course BPF is an interesting environment to work with it is amazing and we really really
[18:15.400 --> 18:18.120]  like it but we need to be aware of some stuff.
[18:18.160 --> 18:24.240]  First of all because we cannot page in or page out pages of the contained unwind tables
[18:24.240 --> 18:28.000]  that has to be locked in memory so we need to be very careful with how we organize our
[18:28.000 --> 18:33.200]  data and the layout of that data to make it as small as possible so we basically pack
[18:33.200 --> 18:37.520]  every single thing that can be packed and then there are some interesting BPF things
[18:37.520 --> 18:40.640]  that for most people that have written BPF programs this is well known but I just want
[18:40.640 --> 18:44.920]  to talk a little bit about how we are dealing with some of the BPF challenges.
[18:44.920 --> 18:50.200]  So one of it is a stack size which is easy to work around if I am not mistaken we have
[18:50.200 --> 18:55.320]  512 bytes which is not a lot so we use another BPF map that we use sort of a global data
[18:55.320 --> 19:02.560]  structure and that stores basically like kind of your heap if you will and then for the
[19:02.560 --> 19:07.440]  program size this is a limitation that comes in two ways first there is probably some limitation
[19:07.440 --> 19:12.240]  in the amount of how many opcodes you can load in the kernel but also the BPF verifier
[19:12.320 --> 19:17.880]  that ensures that the BPF code is safe to load for example you don't do any de-reference
[19:17.880 --> 19:25.280]  that could go wrong or that your program terminates it has some limits it could theoretically
[19:25.280 --> 19:29.520]  run forever trying to verify your program but it has some limits and we hit this limit
[19:29.520 --> 19:34.440]  everywhere in our code for example we hit it when running our binary research it complains
[19:34.440 --> 19:39.960]  saying that it is too complex for us to analyze so what we do here is that not only we have
[19:40.000 --> 19:44.880]  sharded our data we have sharded our code our code and data the same thing right so we
[19:44.880 --> 19:49.440]  basically have our program split into many sub-programs and we keep the states and we
[19:49.440 --> 19:55.480]  basically execute one program after each other and continue the state so one of the techniques
[19:55.480 --> 20:01.000]  we use is BPF tail calls two other things that are way more modern and they are amazing
[20:01.000 --> 20:06.320]  to our bounded loops and BPF loop which is a wonderful helper the problem is that while
[20:06.360 --> 20:11.600]  we use bounded loops right now we don't use BPF loop because it's only supported in modern
[20:11.600 --> 20:18.480]  kernels but it's great and if you can use it you should now because we're a profiler and we
[20:18.480 --> 20:22.000]  want to minimize the impact we have on the machines we run I want to talk a little bit about
[20:22.000 --> 20:26.600]  performance in user space so many go applications well our profiler is written in go and many
[20:26.600 --> 20:30.200]  go applications and APIs are in design with performance in mind and this is something that
[20:30.200 --> 20:36.960]  can be seen in the dwarf and elf library that go ships with in the sandal library as well as
[20:36.960 --> 20:40.360]  binary read and binary write that we use everywhere because we're dealing with raw bytes and we
[20:40.360 --> 20:47.600]  read them and write them all the time to the BPF maps so it is interesting to know that both these
[20:47.600 --> 20:53.800]  binary read and binary write low-level APIs actually allocate in the fast path which can be
[20:53.800 --> 20:57.160]  problematic so there's a lot of things that in the future we're going to be reinventing to make
[20:57.160 --> 21:01.400]  faster and then we profile a profiler a lot in production we have found a lot of opportunities
[21:01.400 --> 21:05.960]  and there's a lot more work to do because there's not much time I'm gonna quickly skip through
[21:05.960 --> 21:10.240]  testing but the great idea here is that we try to be pragmatic and we have a lot of unit tests
[21:10.240 --> 21:14.920]  for the core functions and then we use snapshot testing for the unwind tables and we have a
[21:14.920 --> 21:20.280]  git sub repository where we have a visual representation of the unwind tables and then we
[21:20.280 --> 21:24.840]  generate them every time on CI and locally and we verify that there are no changes compared to
[21:24.840 --> 21:31.880]  last time I think there's only like two or three minutes left so let me talk about the
[21:31.880 --> 21:35.840]  different environments and some of the interesting stuff that we have found while we were profiling
[21:35.840 --> 21:40.240]  our profiler in production we realized that we were spending a ridiculous amount of CPU cycles
[21:40.240 --> 21:45.440]  reading files from this I think the total this is just like a bunch a part of the flame graph but
[21:45.440 --> 21:50.720]  I think it was like 20% of the CPU cycles so turns out this was because our cloud provider has very
[21:50.760 --> 21:57.960]  slow disks that are orders of magnitude slower than our fast NVMEs in the team and another thing
[21:57.960 --> 22:04.200]  that is very interesting that is not a new fact and everybody knows about is that different
[22:04.200 --> 22:09.800]  configuration is the biggest source of trouble and we could see this the other day and if you're
[22:09.800 --> 22:14.720]  interested you can check the board request it's our all the whole project is open source which is
[22:14.720 --> 22:21.240]  the interaction between signals and BPF what happened basically go has an embedded profiler
[22:21.240 --> 22:27.200]  and we use it only in production for reasons but it triggers SIGPROF every couple a couple times
[22:27.200 --> 22:32.040]  a second that it was interrupting the process execution and at that time our process of booting
[22:32.040 --> 22:37.200]  app and it was loading our BPF program because it's quite long and complex the verifier takes a
[22:37.200 --> 22:42.560]  couple milliseconds to load it but it was getting interrupted all the time the BPF whenever it detects
[22:42.560 --> 22:47.320]  that the verifier has been interrupted it retries the process basically wasting all the previous
[22:47.320 --> 22:52.080]  CPU cycles because it starts from scratch but then it retries up to five times and then it says I
[22:52.080 --> 22:56.760]  couldn't do it and of course when we can allow the BPF program we are completely useless so we
[22:56.760 --> 23:02.280]  just crash and there is many other considerations such as what do you do with short live processes
[23:02.280 --> 23:06.680]  because you have to generate a data but even though we have an optimize for this and is we are not
[23:06.840 --> 23:12.760]  that bad and we can profile processes that run even for one second on your box and then the
[23:12.760 --> 23:16.520]  important thing here is that this is our format for our custom on wine table but it doesn't matter
[23:16.520 --> 23:22.760]  the important bit here is that it mostly fits in L2 cache so we basically incur on two L2 misses
[23:22.760 --> 23:30.520]  and it is pretty fast on a machine with a bunch of processes with up to 90 frames we can do the full
[23:30.520 --> 23:39.800]  on wine processing 0.5 milliseconds on a CPU that is five years old cool and so we are going to
[23:39.800 --> 23:44.600]  do mixing on wine mode so being able to unwind JIT sections we're applying RM64 support by the end
[23:44.600 --> 23:48.920]  of the year and this feature is going to be enabled by default in a few weeks because right now it's
[23:48.920 --> 23:52.960]  under a feature flag we have many other things that we want to support including high level
[23:52.960 --> 23:58.160]  languages and we are open source so if you want to contribute or you have anything you want to
[23:58.160 --> 24:03.360]  discuss we meet by weekly on Mondays as part of the Parker project so there's a bunch of links
[24:03.360 --> 24:08.480]  that we're going to upload to the presentation in the FOSM websites and yeah thank you.
[24:15.480 --> 24:19.560]  I think we have time for maximum one short question.