[00:00.000 --> 00:12.200]  Welcome everyone, today we are going to talk about walking native stacks in BPF without
[00:12.200 --> 00:13.720]  frame pointers.
[00:13.720 --> 00:18.960]  So my name is Roy Shally, I work at Polar Signals as a software engineer, I mostly work
[00:18.960 --> 00:25.360]  on profilers and eBPF related stuff and before that I have worked in different kernel subsystems
[00:25.360 --> 00:27.880]  as part of my job.
[00:27.880 --> 00:31.160]  My name is Javier and I have been working at Polar Signals for a year, mostly working
[00:31.160 --> 00:35.000]  on native stack and winding, so the work that we are going to introduce today and before
[00:35.000 --> 00:41.000]  that I was working on web reliability, tooling for developers and performance at Facebook.
[00:41.000 --> 00:47.400]  Yeah, so before we dive into the topic, I wanted to go through the agenda.
[00:47.400 --> 00:51.560]  So first we actually want to talk about why there is a need for a dwarf based stack walker
[00:51.560 --> 00:56.520]  in eBPF because that is like the most asked question, then we will go into the design
[00:56.520 --> 01:01.240]  of our stack walker and then we will talk about how we went from the prototype to making
[01:01.240 --> 01:08.080]  it production ready and then a bunch of learnings so far on some future plans.
[01:08.080 --> 01:13.240]  So as we said, we work on the production profilers which means that generally sampling profilers
[01:13.240 --> 01:19.320]  collect the stack traces at particular intervals and attaches values to it.
[01:19.320 --> 01:23.760]  Note that the profilers generally need like the user and application level stacks as well
[01:23.760 --> 01:30.680]  as kernel stacks and it sort of involves iterating over all the stack frames and then collecting
[01:30.680 --> 01:32.520]  the written addresses.
[01:32.520 --> 01:37.960]  Historically, there have been a dedicated register for that called frame pointer in both
[01:37.960 --> 01:47.160]  x86 and 64, but in recent times because of some of the compiler optimizations, it has
[01:47.160 --> 01:54.520]  been mostly disabled in most of the runtimes as well as in distros.
[01:54.520 --> 02:01.320]  Also it becomes really hard when you don't have frame pointers and instead of involving
[02:01.320 --> 02:07.680]  a couple of memory accesses per frame which is like quite fast, we will need to really
[02:07.680 --> 02:11.160]  do more work in the stack walkers.
[02:11.160 --> 02:14.560]  Note that the stack walking is like also a common practice in the debuggers as you all
[02:14.560 --> 02:15.560]  know.
[02:15.560 --> 02:18.560]  So what's like the current state of the world?
[02:18.560 --> 02:25.760]  Well, it's not a problem for the hyperscalers because hyperscalers actually have all the
[02:25.760 --> 02:32.240]  applications which are already enabling frame pointers in production and this is important
[02:32.240 --> 02:36.200]  because when the things break and you want to really go through the inspection, it's
[02:36.200 --> 02:42.080]  really helpful to have all the stack when it's needed.
[02:42.080 --> 02:46.680]  There's also a recent discussion in the Fedora mailing list, so just feel free to go through
[02:46.680 --> 02:47.680]  it.
[02:47.680 --> 02:55.760]  TLDR of that discussion is that it's being, so FPs are going to be enabled.
[02:55.760 --> 03:03.600]  Since I think Fedora 38, so that's about to be released in late April, Mac or software
[03:03.600 --> 03:08.240]  always have binaries which has frame pointers enabled.
[03:08.240 --> 03:14.960]  There's also an amazing work going on by Oracle engineers to have a simple format instead
[03:14.960 --> 03:21.840]  of Dwarf and we hope that that work also goes through and helps the ecosystem.
[03:21.840 --> 03:27.800]  So that's like sort of the current status, but what we want is we want that right now
[03:27.800 --> 03:32.960]  and we want the support for all the distros as well as all the runtimes which is like
[03:32.960 --> 03:41.760]  scattered here and there, for example, only go runtime, enables FPs like since go 1.7
[03:41.760 --> 03:50.160]  and in x86 and since 1.12 and 64.
[03:50.160 --> 03:57.040]  So now some of you might be wondering if not frame pointers, what do we have?
[03:57.040 --> 04:05.520]  For example, say in Rust where it has been disabled by its own, by default, but even,
[04:05.520 --> 04:11.120]  but when you have the panic, you still get the all mattress, so how is it happening?
[04:11.120 --> 04:16.720]  So well, compilers always have this information and we generally need to know the value of
[04:16.720 --> 04:21.200]  the stack pointer in the previous frame and it can be like from any offset if there is
[04:21.200 --> 04:24.720]  no frame pointer.
[04:24.720 --> 04:28.960]  So that way we can always like find the value of the return addresses and continue unwinding
[04:28.960 --> 04:30.460]  the stack.
[04:30.460 --> 04:38.720]  This information is generally encoded as part of.ehframe section or.debugframe section
[04:38.720 --> 04:47.000]  in the Dwarf and there is also another way which is like unwine tables can be also synthesized
[04:47.000 --> 04:54.200]  from the object file which is something being done by orc format that has been used in kernel
[04:54.200 --> 04:55.600]  for a while now.
[04:55.600 --> 05:03.520]  We will talk in detail about.ehframe in a minute, but first of all, let's see if anybody
[05:03.520 --> 05:09.080]  else is using.ehframe already, of course.
[05:09.080 --> 05:14.920]  So the profiler we have developed is not the first thing who is going to use.ehframe.
[05:14.920 --> 05:23.920]  Perf, the popular profiler from Linux kernel, added the Dwarf support in since like when
[05:23.920 --> 05:30.240]  the perf event opens, this call was added which was like in 3.4 and it can collect the registers
[05:30.240 --> 05:36.800]  for the profile processors as well as the copy of the stack for every sample.
[05:36.800 --> 05:42.360]  While this approach has been proven to work, there are a bunch of drawbacks to it.
[05:42.360 --> 05:48.080]  For example, one of the things which perf does is it actually collects all the stacks
[05:48.080 --> 05:51.000]  and copies it into the user space.
[05:51.000 --> 05:57.280]  The second thing is that the implications of one process having another processes data
[05:57.280 --> 06:03.440]  in the user space can be quite complicated and also be it's like a lot of data especially
[06:03.440 --> 06:09.600]  for the CPU intensive applications.
[06:09.600 --> 06:16.840]  So why BPF? Stack walking in BPF for our profilers actually makes a lot of sense to us because
[06:16.840 --> 06:20.800]  in that case we don't really have to copy the whole stack.
[06:20.800 --> 06:26.880]  The information, a lot of it still stays in the kernel which provides like higher safety
[06:26.880 --> 06:31.320]  guarantees especially in the case of like stack walking mechanism.
[06:31.320 --> 06:37.160]  Once it's been implemented, we can like sort of leverage the perf subsystem to get the
[06:37.160 --> 06:43.840]  sample CPU cycles and then instructions, altricache misses, etc.
[06:43.840 --> 06:50.000]  And it can then also help us to develop other tools like allocation tracers, runtime specific
[06:50.000 --> 06:56.960]  profilers for example for JVM or Ruby, etc.
[06:56.960 --> 07:02.800]  So some of you who are probably also familiar with BPF may know that there is already BPF
[07:02.800 --> 07:07.920]  get stack ID so why there is a need for implementing something different.
[07:07.920 --> 07:14.480]  Well, as it turns out, the problem with that helper is that it also requires frame pointers.
[07:14.480 --> 07:17.920]  So it also uses frame pointers to walk through the stacks.
[07:17.920 --> 07:26.000]  And for the historical reasons, fully featured DWARF unwinder is like unlikely to be part
[07:26.000 --> 07:29.760]  of the Linux kernel.
[07:29.760 --> 07:39.240]  So before we dive into how we are using EH frame with BPF, let's look at what EH frame
[07:39.240 --> 07:42.080]  actually has to offer.
[07:42.080 --> 07:49.360]  So the EH frame section contains one or more call frame informations.
[07:49.360 --> 07:56.040]  The main goal of the call frame information is to provide answers on like how to restore
[07:56.040 --> 08:01.480]  every register for the previous frame at any part of our code execution.
[08:01.480 --> 08:07.480]  Directly storing the table, that sort of contain each program counter and all the registered
[08:07.480 --> 08:13.640]  and then locations such as like whether they have been pushed to the stack or not, etc.
[08:13.640 --> 08:17.240]  would generate huge unwind tables.
[08:17.240 --> 08:25.960]  And for that reason, DWARF is actually quite compact and very space efficient in that sense.
[08:25.960 --> 08:33.600]  So the unwind tables encoded in the CFI format are in the form of upcodes and those upcodes
[08:33.600 --> 08:35.040]  needs to be evaluated.
[08:35.040 --> 08:39.000]  So in the case of stack walking, once it has been evaluated, we generate the table that
[08:39.000 --> 08:43.400]  contains for each instruction boundary like how to store the value of the register.
[08:43.400 --> 08:46.400]  There are two main layers to it.
[08:46.400 --> 08:53.160]  One is that it helps with repetitive patterns that compress well and allows for like more
[08:53.160 --> 08:56.920]  compact representation of some data.
[08:56.920 --> 09:04.240]  In some cases, there is like also a specialized opcode that consumes say 1, 2, 4 bytes rather
[09:04.240 --> 09:07.320]  than just 4 bytes at like all time.
[09:07.320 --> 09:13.400]  And the second layer, which we call the second layer is the spatial opcode.
[09:13.400 --> 09:19.640]  That contains like the other set of opcodes, which is like containing the arbitrary expressions.
[09:19.640 --> 09:21.960]  That needs to be evaluated and that's like a lot of work.
[09:21.960 --> 09:26.520]  The main difference between these two is that in the first one, we just need like these
[09:26.520 --> 09:28.680]  two values.
[09:28.680 --> 09:34.200]  But in the second part of it, we will actually need to evaluate like the arbitrary during
[09:34.200 --> 09:36.640]  complete expressions.
[09:36.640 --> 09:42.080]  So for that reason, we would need to have like the full blown VM to be implemented in
[09:42.080 --> 09:48.600]  the BPF itself, which is not quite practical.
[09:48.600 --> 09:55.000]  So those who don't know how like generally the BPF applications flow works.
[09:55.000 --> 10:01.200]  This is how it would look like in a very, yeah, very high level point of view.
[10:01.200 --> 10:07.200]  So like in the user space, you would have like the driver program written in go.
[10:07.200 --> 10:13.560]  Like that's that's our stack and we are using likely BPF go over there.
[10:13.560 --> 10:20.240]  We are doing like creating the maps, attaching the BPF program to a CPU cycle perf event.
[10:20.240 --> 10:25.880]  It reads parses and evaluates the EH frame section of the process and like all the dynamic
[10:25.880 --> 10:27.520]  libraries.
[10:27.520 --> 10:34.840]  And in the BPF program, we have using the PID, we are fetching the table.
[10:34.840 --> 10:40.360]  And then we have like an unwind algorithm, which processes that work information.
[10:40.360 --> 10:49.480]  We will go in depth for each component, but let's see how the algorithm looks like.
[10:49.480 --> 10:53.400]  So basically for this one, it's like a really simple one.
[10:53.400 --> 10:57.320]  But basically we just read like three important register.
[10:57.320 --> 11:00.320]  First one is instruction pointer, RIP.
[11:00.320 --> 11:02.240]  Next one is the stack pointer.
[11:02.240 --> 11:08.640]  And the third one is, of course, like frame pointer, RBP.
[11:08.640 --> 11:14.360]  And then when the frame count is less or equal to the maximum stack depth we have defined,
[11:14.360 --> 11:18.560]  we find the unwind table for the program counter.
[11:18.560 --> 11:23.160]  We are the instruction pointer to the stack, calculate the previous frame, frame stack
[11:23.160 --> 11:30.200]  pointer, then update the registers with the calculated values for the previous frame and
[11:30.200 --> 11:32.400]  then continue with the next frame.
[11:32.400 --> 11:38.000]  So there's like just a nutshell that's what the algorithm is in the BPF.
[11:38.000 --> 11:44.600]  But now the important part is how we store that unwind information and what we have done
[11:44.600 --> 11:46.640]  to make it scalable.
[11:46.640 --> 11:52.600]  So now Javier will talk about that.
[11:52.600 --> 11:57.720]  Cool.
[11:57.720 --> 12:02.640]  So now we have some unwind information that we're going to describe the format later,
[12:02.640 --> 12:05.560]  but we need to put it somewhere, right?
[12:05.560 --> 12:10.560]  So one possibility would be to store this unwind info in the memory space of the applications
[12:10.560 --> 12:12.640]  that we are trying to profile.
[12:12.640 --> 12:17.360]  And we could do this, for example, using a combination of ptrace, mmap, and mlock.
[12:17.360 --> 12:25.040]  And we could use ptrace to hijack the process execution and allocate a new memory segment.
[12:25.040 --> 12:29.760]  And then we will have to lock it because in BPF we need to make sure that the pages that
[12:29.760 --> 12:33.480]  we are accessing are never going to be swapped out.
[12:33.480 --> 12:37.480]  But this has a big problem, that is, that will be altering the execution flow of the
[12:37.480 --> 12:38.480]  program.
[12:38.480 --> 12:41.040]  And it's something that we never want to do.
[12:41.040 --> 12:44.960]  One of the reasons for this is because, first, this memory will live in the process, which
[12:44.960 --> 12:49.360]  means that accounting for it will be odd and developers will find a new memory segment
[12:49.360 --> 12:50.520]  there that appeared out of the blue.
[12:50.520 --> 12:54.520]  So in their metrics, there will be something that changes behind their backs, which is not
[12:54.520 --> 12:55.520]  great.
[12:55.520 --> 12:59.240]  But also because the lifecycle of managing this memory chunk is very complicated.
[12:59.240 --> 13:04.120]  For example, what do you do if our application, if our profiler dies before the processes
[13:04.120 --> 13:05.640]  that we are introspecting?
[13:05.640 --> 13:06.640]  How do we clean this up?
[13:06.640 --> 13:12.840]  It involves a lot of, that's a lot of problems and adding solutions to these problems will
[13:12.840 --> 13:19.340]  require like crazy engineering, which we were not planning to tackle because it will over
[13:19.340 --> 13:21.080]  complicated project a lot.
[13:21.080 --> 13:26.840]  The other problem is that sharing memories way harder and accounting for our overhead
[13:26.840 --> 13:27.840]  is also very hard.
[13:27.840 --> 13:31.960]  If you think about it, for example, Libc is probably linked in most of the applications
[13:31.960 --> 13:33.120]  in your machine.
[13:33.120 --> 13:38.360]  So why having the same information like for every single process, right?
[13:38.360 --> 13:40.800]  So how do we actually store this information?
[13:40.800 --> 13:46.320]  We use BPF maps, which are data structures that, as Shali said, they can be written and
[13:46.320 --> 13:49.160]  read both from kernel and user space.
[13:49.160 --> 13:53.480]  We use in particular hash maps, which in the case of BPF, they are basically a mapping
[13:53.480 --> 13:58.680]  of bytes to bytes where you can put arbitrary information.
[13:58.680 --> 14:00.560]  So this is always locked in memory.
[14:00.560 --> 14:05.920]  BPF allows you with this flag not to lock memory, but in the type of BPF program we use,
[14:05.920 --> 14:07.720]  it is mandatory to lock it.
[14:07.720 --> 14:13.440]  Otherwise, as we said before, these pages could be swapped out and from the type of
[14:13.440 --> 14:18.720]  BPF programs that we have, page faults are forbidden.
[14:18.720 --> 14:25.000]  And yeah, in the end, we could reuse these mappings because they are in this kind of
[14:25.000 --> 14:28.280]  global BPF map that we have control over.
[14:28.280 --> 14:32.000]  So we can store, for example, Libc in one particular area, and then we'll have to add
[14:32.000 --> 14:37.360]  metadata for where it is for every single process that has this mapping.
[14:37.360 --> 14:42.880]  So yeah, this is a logical representation of our information for different executable
[14:42.880 --> 14:43.880]  segments.
[14:43.880 --> 14:48.340]  So imagine this is Libc, MySQL, Zlib, SystemD, and then some tank that isn't used.
[14:48.340 --> 14:55.160]  So this assumes that we have this logical view has a chunk of memory that is contiguous.
[14:55.160 --> 14:58.480]  But in reality, we cannot allocate any arbitrary chunk of memory in BPF.
[14:58.480 --> 15:02.720]  We cannot say we want 200 megabytes and it needs to be contiguous.
[15:02.720 --> 15:05.160]  We cannot do an malloc, right?
[15:05.160 --> 15:09.680]  So we've been doing some experiments and in the systems that we have tested and the kernels
[15:09.680 --> 15:18.220]  that we want to support, we can add up to 250,000 wine entries to each value of a
[15:18.220 --> 15:20.320]  BPF map.
[15:20.320 --> 15:25.920]  So because we want to be able to fit larger wine tables, we basically use the same solution
[15:25.920 --> 15:33.480]  that you would use in any other data intensive application, which is partitioning or sharding.
[15:33.480 --> 15:40.840]  So we are splitting the wine table into different shards and depending on the memory that your
[15:40.840 --> 15:45.920]  machine has, we'll allocate more or less shards ahead of time.
[15:45.920 --> 15:51.400]  That will result in a CPU to memory trade-off because when they get full, we'll regenerate
[15:51.400 --> 15:52.400]  them.
[15:52.400 --> 15:54.280]  But we'll talk about this later.
[15:54.280 --> 15:58.800]  So for example, let's see SystemD some wine table and let's suppose that it's a bit bigger
[15:58.800 --> 16:04.720]  than 250,000 elements so it doesn't fit in a single shard.
[16:04.720 --> 16:06.840]  So because it doesn't, we have to chunk it up.
[16:06.840 --> 16:11.760]  So we store the first chunk in the first shard and then there's a little bit that it's stored
[16:11.760 --> 16:14.400]  in the second shard.
[16:14.400 --> 16:16.280]  So we have added all these layers of indirection.
[16:16.280 --> 16:21.600]  We need some bookkeeping to do and this metadata is also stored in BPF maps.
[16:21.600 --> 16:26.040]  So we have a process that can have many mappings.
[16:26.040 --> 16:32.400]  Each mapping can have one or more chunks and then each chunk maps to a particular shard
[16:32.400 --> 16:39.000]  because you might have one and wine entry or up to 250,000 in a shard.
[16:39.000 --> 16:45.400]  We have some other metadata to exactly tell you where that information leaves.
[16:45.400 --> 16:50.440]  So yeah, thanks to this way of organizing data, we're able to, as we said before, share
[16:50.440 --> 16:56.360]  these executable mappings and thanks to that, we skip table generation for most of the executables
[16:56.360 --> 16:57.360]  in your box.
[16:57.360 --> 17:02.960]  And thanks to this, we only use 0.9% of the CPU cycles doing the process that Vashali
[17:02.960 --> 17:06.400]  was talking about before, which is not the most complex process in the universe, but
[17:06.400 --> 17:11.760]  it consumes some CPU cycles because we need to read some information from your L file,
[17:11.760 --> 17:16.680]  find the section, then we need to read the section and we need to parse it and interpret
[17:16.680 --> 17:18.480]  the dwarf information.
[17:18.480 --> 17:20.360]  So now we need to do way fewer times.
[17:20.360 --> 17:25.680]  So in your machine, we're only going to do it once per unique executable section.
[17:25.680 --> 17:27.480]  So what happens if we run out of space?
[17:27.480 --> 17:31.400]  So basically what we have implemented is basically a bump allocator.
[17:31.400 --> 17:35.520]  We keep on appending information to the shards and logically you can see it as a big chunk
[17:35.520 --> 17:36.520]  of memory.
[17:36.520 --> 17:40.520]  Once it's full, at some point, we'll decide to wipe the whole thing and start again, but
[17:40.520 --> 17:47.240]  we do it in such a way that we give every single process a fair chance of being profiled.
[17:47.240 --> 17:49.360]  So yeah, let's take a look at how are we doing this.
[17:49.360 --> 17:54.000]  So we start with a PID of any process that you find in your machine that at that point
[17:54.000 --> 18:00.160]  happens to be running on CPU, and the first thing we do is to check if it has unwind information.
[18:00.160 --> 18:04.320]  If it does, we need to find the mapping with the current instruction pointer that we have
[18:04.320 --> 18:06.720]  for that frame.
[18:06.720 --> 18:08.380]  Then we need to find the chunk.
[18:08.380 --> 18:13.720]  We can find it doing linear search because we have the information of every single, like
[18:13.720 --> 18:18.840]  minimum and maximum program counter that is covered by that chunk.
[18:18.840 --> 18:24.560]  Once we get the chunk, we can have the shard information, and once we have the shard information,
[18:24.560 --> 18:26.480]  we have the unwind information.
[18:26.480 --> 18:31.200]  But the problem is the unwind information, as we said before, it's basically an array.
[18:31.200 --> 18:35.120]  On this array, we need to find it there, but it can be up to 250,000 items.
[18:35.120 --> 18:42.280]  If you have done anything on BPF, you know that you have to basically build whatever
[18:42.280 --> 18:46.680]  you need yourself, and you don't have, for example, some sort of binary search that is
[18:46.680 --> 18:50.200]  executed on kernel space for you, so you need to implement it yourself, which is not a big
[18:50.200 --> 18:51.200]  deal in general.
[18:51.200 --> 18:56.160]  Implementing binary search is not rocket science, but the problem, most of the times, it's difficult
[18:56.160 --> 19:00.280]  to get all the off by once, but the problem here is that in kernel space, we have a lot
[19:00.280 --> 19:03.520]  of limitations we're going to talk about later, and we're going to talk about how we overcame
[19:03.520 --> 19:09.520]  them, because this produces quite a bit of code that has to be split logically.
[19:09.520 --> 19:14.640]  Not only the data structures are sharded, but the code is sharded, too.
[19:14.640 --> 19:20.480]  So anyways, we binary search this with up to seven iterations, but let's say eight, if
[19:20.480 --> 19:25.840]  you're feeling pessimistic, and then we're going to get the unwind action.
[19:25.840 --> 19:30.260]  So what is the operation we need to do to the current registers to recover the previous
[19:30.260 --> 19:35.000]  registers, and add that frame to the stack trace, and continue with the next frame, as
[19:35.000 --> 19:37.400]  I shall explain before.
[19:37.400 --> 19:42.080]  If the stack is correct, and we have the luxury to know that, because when we have known one
[19:42.080 --> 19:49.480]  information, and RBP is zero, that is specified by the X64 API to be the end of the stack,
[19:49.480 --> 19:51.760]  the bottom of the stack.
[19:51.760 --> 19:57.400]  So if it is correct, we hash the addresses, add the hash to a map, and bump a counter.
[19:57.400 --> 20:01.640]  So it is reasonably cheap, and I will show you some data later on this.
[20:01.640 --> 20:04.720]  And then every couple seconds, I think it's every 10 seconds or so, we collect all this
[20:04.720 --> 20:11.000]  information from user space, and we generate the actual profiles that we send to some server.
[20:11.000 --> 20:15.400]  As I said before, BPF has some interesting challenges for us.
[20:15.400 --> 20:20.040]  I think it's the closest that I've been to coding in the 90s or 80s, because we have
[20:20.040 --> 20:21.040]  very little stack space.
[20:21.040 --> 20:24.360]  We have 512 bytes, if I am not mistaken.
[20:24.360 --> 20:30.920]  So in order to overcome that, we used BPF maps as some sort of heap.
[20:30.920 --> 20:34.400]  Then there's a problem that I mentioned before about memory locking.
[20:34.400 --> 20:38.240]  That memory can never be swapped out, and it is in kernel space.
[20:38.240 --> 20:43.000]  So we want to make sure that we allocate the minimal space you need, and we need to do it
[20:43.000 --> 20:49.320]  properly, because each single environment has a different C-group configuration, and as
[20:49.320 --> 20:53.520]  some talks explained yesterday, it's quite tricky to know the actual memory that your
[20:53.520 --> 20:56.000]  machine has available.
[20:56.000 --> 20:59.040]  For the program size, there is two main issues.
[20:59.040 --> 21:03.880]  One of them is that there's a limitation on the number of instructions that you can store
[21:03.880 --> 21:09.640]  in the kernel, but also the BPF verifier, which is this machinery that makes sure that your
[21:09.640 --> 21:13.080]  program is safe, and for example, your program is going to finish, you're not the reference
[21:13.080 --> 21:22.760]  in any null pointers, and that in general, you're not going to crash the kernel, has
[21:22.760 --> 21:27.840]  a limit on the amount of iterations that it does internally.
[21:27.840 --> 21:34.320]  This is a problem for us, because doing a full binary search already fills these limits.
[21:34.320 --> 21:39.080]  So we need to use some techniques like this thing called BPF tail calls that is similar
[21:39.080 --> 21:43.240]  to Lisp tail calls, and if you're lucky, we are not.
[21:43.240 --> 21:48.240]  You can use, well, we use bounded loops, but we're going to use this new helper called
[21:48.240 --> 21:52.880]  BPF loop that basically it's a function that you can call multiple times creating some
[21:52.880 --> 21:57.440]  sort of loop in BPF, but we cannot use that because we want to support older kernels.
[21:57.440 --> 22:00.680]  That's a pretty new feature.
[22:00.680 --> 22:03.720]  So let's switch to something else.
[22:03.720 --> 22:09.960]  We have written our application in user space in Go, and we are a profiler, so we want to
[22:09.960 --> 22:13.800]  make sure that the overhead we have on your machine is as little as possible.
[22:13.800 --> 22:17.240]  But unfortunately, many of the Go APIs aren't designed with performance in mind.
[22:17.240 --> 22:18.240]  I am new to Go.
[22:18.240 --> 22:22.000]  I didn't know this was like this, and every single time I profiled our profiler, and I
[22:22.000 --> 22:25.240]  found these things, I was like, how is this possible?
[22:25.240 --> 22:29.520]  But it has the dwarf and elf parsing library in the standard library, which is great, but
[22:29.520 --> 22:35.640]  they are not designed for performance sensitive environments, let's say.
[22:35.640 --> 22:41.400]  And also, there's two functions that are binary read and binary writes that we use all the
[22:41.400 --> 22:47.960]  time because we need to deal with bytes back and forth that allocate in the fast path.
[22:47.960 --> 22:50.760]  But anyways, we profile our profiler all the time.
[22:50.760 --> 22:54.480]  We have found lots of opportunities that we keep on fixing, but of course, there's more
[22:54.480 --> 22:56.640]  work to do.
[22:56.640 --> 23:03.320]  And one of the areas where we try to be pretty comprehensive, it's with testing.
[23:03.320 --> 23:08.400]  So we have thorough testing for, well, unit testing for most of the core functions to
[23:08.400 --> 23:12.160]  ensure that we don't regress, but I think that, in my opinion, has helped us the most
[23:12.160 --> 23:13.960]  is snapshot testing.
[23:13.960 --> 23:16.600]  If you're not familiar with this technique, it's very simple.
[23:16.600 --> 23:20.680]  You basically generate some textual representation of your data structures, write them to disk
[23:20.680 --> 23:24.240]  or somewhere in memory, it doesn't matter, and then you generate them again after you
[23:24.240 --> 23:27.400]  make some changes to your code, and then you compare them.
[23:27.400 --> 23:28.720]  So this is how it looks in our case.
[23:28.720 --> 23:33.440]  We have some Git sub repository called test data, and then we have this textual representation
[23:33.440 --> 23:34.440]  of the unwind tables.
[23:34.440 --> 23:39.320]  You don't have to understand it all, but the idea here is that this covers a full function,
[23:39.320 --> 23:43.640]  which program counter starts in the one over there and ends in the one over there.
[23:43.640 --> 23:46.840]  And then we have the information for every single program counter, and then it tells
[23:46.840 --> 23:48.480]  you, for example, what to do here.
[23:48.480 --> 23:51.440]  The first one says CFA type two that I know is for RBP.
[23:51.440 --> 23:56.480]  So you need to get the current RBP at eight, and that will give you the previous frame
[23:56.480 --> 23:58.160]  stack pointer.
[23:58.160 --> 24:02.520]  But anyways, the interesting thing here is that this is very easy to implement, as you
[24:02.520 --> 24:06.520]  can see by our very advanced setup in our make file.
[24:06.520 --> 24:10.920]  We just build our binary.
[24:10.920 --> 24:16.040]  We dump these tables to disk, and then we ask Git to give us the changes.
[24:16.040 --> 24:19.800]  And if there's anything that has changed, we fail.
[24:19.800 --> 24:24.480]  So thanks to this, we have found a lot of bugs, and it has allowed us to iterate with
[24:24.480 --> 24:27.440]  confidence.
[24:27.440 --> 24:30.680]  One of the important things in this project has been de-risking it.
[24:30.680 --> 24:32.280]  It's been quite complex.
[24:32.280 --> 24:34.320]  When I started working at this, I had no idea about dwarf unwinding.
[24:34.320 --> 24:37.760]  I had no idea about unwinding without frame pointers at all.
[24:37.760 --> 24:41.240]  So we had to make sure that all these avenues were properly covered.
[24:41.240 --> 24:45.640]  We had, for example, the dwarf parser properly implemented, that we had all the interactions
[24:45.640 --> 24:50.240]  with BPF cover, and that the BPF unwinder worked well as well.
[24:50.240 --> 24:54.680]  So for this, we always tried to have a plan B at every stage of the project, and we tried
[24:54.680 --> 24:57.880]  to go in depth as well as in breadth.
[24:57.880 --> 24:59.600]  But anyways, I have five minutes left apparently.
[24:59.600 --> 25:03.160]  So we had a lot of automated testing, and one of the things that we did was adding kernel
[25:03.160 --> 25:08.000]  tests, which is super important, especially for BPF programs, because the BPF sub-system
[25:08.000 --> 25:10.200]  changes a lot over time.
[25:10.200 --> 25:13.640]  And there's a lot of features that we want to make sure we don't use, because otherwise
[25:13.640 --> 25:15.760]  it wouldn't work in other kernels.
[25:15.760 --> 25:22.200]  So we have a kernel testing system where basically it runs our application in multiple kernels
[25:22.200 --> 25:26.320]  and reports the state.
[25:26.320 --> 25:30.120]  And one of the things that I want to talk about is that production, as usual, brings
[25:30.120 --> 25:31.520]  a lot of interesting challenges.
[25:31.520 --> 25:34.960]  So by deploying our profiler to production, we found a lot of things that we didn't know
[25:34.960 --> 25:35.960]  about.
[25:35.960 --> 25:39.800]  And we were able to find some of these things thanks to using continuous profiling, our
[25:39.800 --> 25:42.040]  own profiler on our profiler.
[25:42.040 --> 25:46.440]  As you know, different hardware and different configuration are the biggest sources of performance
[25:46.440 --> 25:49.880]  differences as well as incidents in production.
[25:49.880 --> 25:52.080]  So I want to show you two things that we have found recently.
[25:52.080 --> 25:58.440]  One of them is basically we're using almost 30% CPU time opening files in our production
[25:58.440 --> 26:01.520]  environments that never showed up on my NVMe.
[26:01.520 --> 26:06.200]  And the reason is because, turns out, cloud disks are very slow.
[26:06.200 --> 26:10.080]  So we have fixed this.
[26:10.080 --> 26:13.960]  Another very interesting thing that we fixed the other day, it's something that happened
[26:13.960 --> 26:18.080]  when we rolled our profiler to production and then it started crashing.
[26:18.080 --> 26:21.320]  If you are interested, we will upload the slides, so feel free to check the pull request
[26:21.320 --> 26:22.760]  because everything is open source.
[26:22.760 --> 26:27.120]  But basically what happened here was that, for reasons, Go has a signal-based profiler
[26:27.120 --> 26:29.680]  and we have it enabled for even more reasons.
[26:29.680 --> 26:32.240]  And this only was enabled in production.
[26:32.240 --> 26:39.320]  So SIGPROV was interrupting our program execution while we were trying to load the BPF program.
[26:39.320 --> 26:42.720]  The BPF program takes a little while to load because the verifier has to run a bunch of
[26:42.720 --> 26:48.160]  algorithms to start to actually ensure that everything is safe and it was getting interrupted
[26:48.160 --> 26:49.160]  all the time.
[26:49.160 --> 26:52.760]  The BPF, that is the BPF library we used to load the BPF program, was retrying this up
[26:52.760 --> 26:57.960]  to five times until it basically said, I tried, this didn't work, sorry, and obviously we
[26:57.960 --> 27:01.440]  need the BPF program to be loaded to work.
[27:01.440 --> 27:05.560]  So there's many other considerations in this project, like short-lived processes, which
[27:05.560 --> 27:09.120]  we haven't optimized for, but we are still pretty decent ads.
[27:09.120 --> 27:13.480]  If your program runs for one second, we're probably going to catch it, but if this is
[27:13.480 --> 27:15.640]  something that you care about, feel free to message us.
[27:15.640 --> 27:17.800]  It will be something that we optimized.
[27:17.800 --> 27:22.000]  And then, yeah, this is our current format, I probably have one minute left or something
[27:22.000 --> 27:23.000]  like that.
[27:23.000 --> 27:26.000]  So you don't have to understand it all, but the point is we represent every single row
[27:26.000 --> 27:33.440]  with 264-bit words, but since we are making it a bit smaller, and this is basically how
[27:33.440 --> 27:35.440]  our size compares to dwarf.
[27:35.440 --> 27:40.520]  We are bigger because dwarf is optimized for disk while we are optimized for disk space
[27:40.520 --> 27:42.800]  while we are optimized for just raw speeds.
[27:42.800 --> 27:50.920]  So for example, our whole table for one shard pretty much fits in L2 cache.
[27:50.920 --> 27:54.120]  I guess, do I have any more time, or probably not, right?
[27:54.120 --> 27:58.080]  Two minutes, oh, okay, sorry, maybe I sped up too much.
[27:58.080 --> 28:04.240]  So we need to support parsing every single dwarf CFI opcodes, and the reason for this
[28:04.240 --> 28:11.080]  is because otherwise we won't be able to progress, but we cannot unwind from every single program
[28:11.080 --> 28:13.440]  counter, which sucks.
[28:13.440 --> 28:15.400]  But this is not a problem in practice.
[28:15.400 --> 28:20.040]  The reason for this is because the most typical way to recover the previous frame stack pointer
[28:20.040 --> 28:25.800]  is, which is called CFA in dwarf, but doesn't matter, is that you will get given which register
[28:25.800 --> 28:30.920]  you need to apply some offset to, and that will give you the previous frame stack pointer.
[28:30.920 --> 28:34.640]  We support that, but the problem is that it could be any arbitrary register, and right
[28:34.640 --> 28:39.440]  now we only support either RBP or RSP offsets, which happen 99% of the time.
[28:39.440 --> 28:41.400]  So this is something that we're going to work on soon.
[28:41.400 --> 28:46.120]  The other problem, as Vishali said before, is that dwarf has a VM that you need to implement,
[28:46.120 --> 28:50.000]  which has to be Turing-complete, and can implement any expression.
[28:50.000 --> 28:52.440]  It's not Turing-complete.
[28:52.440 --> 28:53.440]  The second level, yeah.
[28:53.440 --> 28:54.440]  The dwarf?
[28:54.440 --> 28:55.440]  No.
[28:55.440 --> 29:00.440]  That's why in the infinity project they had to add this new opcode in the draft hat.
[29:00.440 --> 29:01.440]  Okay.
[29:01.440 --> 29:03.440]  It's not exactly Turing-complete, it's almost there, yeah.
[29:03.440 --> 29:04.440]  Okay, well.
[29:04.440 --> 29:08.680]  But you need to implement a VM that basically has a set of virtual registers.
[29:08.680 --> 29:09.680]  Yeah.
[29:09.680 --> 29:13.800]  But the second, well, we're going to talk about those later, because the first level, yeah,
[29:13.800 --> 29:17.000]  it's the stack machine 100%, but the second level is, I can show you our code, it's messed
[29:17.000 --> 29:18.000]  up.
[29:18.000 --> 29:20.720]  It's messed up.
[29:20.720 --> 29:24.560]  But anyways, but the thing is that we are very lucky here, and you can check more about
[29:24.560 --> 29:26.480]  this in this PR.
[29:26.480 --> 29:30.200]  So there's two dwarf expressions that account for 50% of all the expressions that we have
[29:30.200 --> 29:32.280]  seen in most distributions.
[29:32.280 --> 29:38.160]  They are the expressions used by, well, the dynamic linker needs some, basically, and they
[29:38.160 --> 29:42.960]  are the expressions for procedure linkage tables or PLTs.
[29:42.960 --> 29:46.880]  The other good news, as I said before, is that RBP and RSP offsets rarely occur, and
[29:46.880 --> 29:50.160]  all the other possibilities that I haven't talked about, they almost never occur.
[29:50.160 --> 29:53.160]  Like, we've seen them very, very, very few times.
[29:53.160 --> 29:54.160]  The indexes are useful.
[29:54.160 --> 29:55.160]  Oh, good question.
[29:55.160 --> 29:56.160]  So.
[29:56.160 --> 30:01.160]  We're playing with AR64, because the GCC has 64 backend generates CFA expressions.
[30:01.160 --> 30:04.640]  That's what I was talking about, yeah, yeah, yeah.
[30:04.640 --> 30:09.240]  So but right now we only support X64, but I'm also going to talk about this later, sorry.
[30:09.240 --> 30:10.240]  But anyways.
[30:10.240 --> 30:11.240]  We don't know.
[30:11.240 --> 30:12.240]  Oh.
[30:12.240 --> 30:13.240]  Okay.
[30:13.240 --> 30:14.240]  Done.
[30:14.240 --> 30:15.240]  Okay, well.
[30:15.240 --> 30:17.240]  But we have the minutes buffer for the next one, right?
[30:17.240 --> 30:18.240]  Five minutes.
[30:18.240 --> 30:19.240]  Five minutes.
[30:19.240 --> 30:20.240]  Okay.
[30:20.240 --> 30:21.240]  I have two more slides.
[30:21.240 --> 30:24.840]  Well, anyways, our BPR program, we tried to make it as fast as possible.
[30:24.840 --> 30:29.520]  So this was running on my machine with a bunch of applications that have 90 frames or more.
[30:29.520 --> 30:33.720]  So even the maximum time that it takes is 0.5 milliseconds, which is not terrible on
[30:33.720 --> 30:39.280]  my CPU, which is from late 2017.
[30:39.280 --> 30:43.680]  And this is in a big part because we have optimized everything for memory.
[30:43.680 --> 30:47.320]  So everything's aligned properly.
[30:47.320 --> 30:54.240]  And we try to fit as many things as possible in the CPU cache.
[30:54.240 --> 30:55.240]  What about high level languages?
[30:55.240 --> 30:59.000]  So there's a project that I happen to work on, which is called RB Perf.
[30:59.000 --> 31:02.040]  So this is something that we're going to be adding in the future, basically for dynamic
[31:02.040 --> 31:03.040]  languages.
[31:03.040 --> 31:05.800]  You have the knowledge of the ABI of every interpreter version.
[31:05.800 --> 31:07.880]  And then the Stackwalker is also implemented in BPF.
[31:07.880 --> 31:11.440]  But instead of getting the return addresses, because you have no return addresses there
[31:11.440 --> 31:15.960]  that are meaningful to you, you have to directly extract the function names and other information
[31:15.960 --> 31:20.160]  of the process heap.
[31:20.160 --> 31:21.520]  Our project is called PARCA.
[31:21.520 --> 31:24.480]  So there's a couple of things that we're going to be doing, like mix and why mode, that as
[31:24.480 --> 31:29.000]  far as we know, no one else does this in profiling, in the baggers for sure, but not in profiling,
[31:29.000 --> 31:32.440]  which is the idea of that different sections will be on wants using different techniques.
[31:32.440 --> 31:37.080]  And for example, if you have a JIT that will be used, like Node.js, that has frame pointers.
[31:37.080 --> 31:39.400]  So you will unwind it with frame pointers.
[31:39.400 --> 31:43.640]  But once you reach the actual code from your interpreter, which is compiled and has some
[31:43.640 --> 31:46.040]  white information, we will use Dwarf and white information.
[31:46.040 --> 31:49.080]  RM64 support is coming late this year.
[31:49.080 --> 31:52.960]  And this feature is now disabled by default, but it is stable enough that we're going
[31:52.960 --> 31:55.000]  to be enabling it in a month.
[31:55.000 --> 31:57.720]  And then we're going to add other runtime such as the JVM or Ruby.
[31:57.720 --> 32:02.960]  And then just to say that we are open source, user space on the Apache 2, BPF on the GPL.
[32:02.960 --> 32:04.960]  And yeah, all the links are here.
[32:04.960 --> 32:30.920]  Thanks, yeah, thank you so much.