[00:00.000 --> 00:17.320]  All right. Let's get started again. So, welcome back, everyone. The next talk is from Dylan
[00:17.320 --> 00:20.200]  about eBPF loader deep dive.
[00:20.200 --> 00:28.400]  Yes. Hello, everyone. Thank you for attending. Before we start, I have to make a quick confession.
[00:28.400 --> 00:39.920]  I'm only 80% done with my talk. No, but really, today I'm going to talk about eBPF loaders
[00:39.920 --> 00:44.880]  and while I'll do my best to go as deep as I can within the time constraints, there is
[00:44.880 --> 00:54.400]  of course so much more to go through. So, let's start with what is a loader for those
[00:54.400 --> 01:03.960]  of you who are not in the know. So, the term can be used in multiple contexts, but for
[01:03.960 --> 01:09.280]  the purpose of this talk, I will refer to a loader as any program that interacts with
[01:09.280 --> 01:18.600]  the kernel via syscalls. Or what you more commonly see is a program that uses eBPF loader
[01:18.600 --> 01:26.760]  library to do most of that work for it. So, examples of loaders are IP and TC, which can
[01:26.760 --> 01:32.480]  be used to load XDP programs or TC programs, for example, but also BPF tool, which can
[01:32.480 --> 01:40.560]  do the same or BPF trace, or even your own app if you decide to use a loader library
[01:40.560 --> 01:50.840]  and make something great. Loader libraries are basically obstructions on the eBPF syscalls
[01:50.840 --> 01:59.600]  and to make it easier to use, kind of like Lipsy, but for BPF, which is where the name
[01:59.600 --> 02:05.000]  for the first example comes from, the BPF. But of course, there are many others like
[02:05.000 --> 02:11.960]  Aya, where we had to talk before this on this day, or BCC, or CELUM, eBPF, for all examples
[02:11.960 --> 02:22.800]  of loader libraries, libraries that load BPF programs into the kernel. So, why do we need
[02:22.800 --> 02:32.320]  loaders? This is an example, this is the program example we're working with today. It's
[02:32.320 --> 02:38.960]  quite simple. So, if we, on the left side, I declare a map, which we will be using to
[02:38.960 --> 02:46.080]  store flow data, so packets and bytes per second, for combination of source address
[02:46.080 --> 02:52.760]  and destination address, and on the right is a bit of logic that checks that we have
[02:52.760 --> 03:01.280]  the correct, that we have enough data interpreted as IPv4. Now, there's a handle IPv4 function
[03:01.280 --> 03:06.720]  mentioned here, but it doesn't fit on a slide, so we'll get to that later. When I compile
[03:06.720 --> 03:17.000]  my program, I get what's called an ELF, an executable and linkable format, or linkable
[03:17.000 --> 03:23.960]  matter, I think about it, whatever. If you, a normal C program, if I were to pull any
[03:23.960 --> 03:30.360]  random Hello World C program from the internet, compile it like I showed in the above command,
[03:30.360 --> 03:36.360]  we'll get out an executable, and you can use it out of the box, no need for trickery or
[03:36.360 --> 03:41.880]  things. You make it executable, and you execute it, and you get Hello World on the command
[03:41.880 --> 03:50.040]  line. If you, if you get an EBPF program, and you try to compile it with commands you
[03:50.040 --> 03:55.560]  found on the internet, you'll get a relocatable. Now, if you try to execute it, you'll get
[03:55.560 --> 04:04.200]  an error, so it doesn't work. What you need is a loader. The executable that we have is
[04:04.200 --> 04:11.440]  like a, is like a premade IKEA furniture, but the relocatable we get for EBPF is two
[04:11.440 --> 04:17.320]  pieces, and perhaps if you're lucky, a guide on how to put them together. And this is the
[04:17.320 --> 04:23.720]  job of the loader, putting the pieces together and making it, and providing the guide to
[04:23.720 --> 04:34.080]  make it easy for you to use it. Now, an ELF as we generated has the following structure.
[04:34.080 --> 04:38.920]  So we have this large file, we start with an ELF header, which contains information like,
[04:38.920 --> 04:48.120]  this contains EBPF, and it's this many bits, this machine, but, and it has a bunch of segments,
[04:48.120 --> 04:58.160]  sorry. These sections have names, have names, and each of them can have a different format.
[04:58.160 --> 05:03.320]  So the string type has a bunch of strings. Our programs have a bunch of program code
[05:03.320 --> 05:10.680]  in them, et cetera, et cetera. But they also refer to each other. So you have all the arrows,
[05:10.680 --> 05:18.000]  and they point to each other, and they link to, they link to each other. But in this form,
[05:18.000 --> 05:27.360]  it's not that usable. Because the kernel only understands SysCalls and EBPF programs, it
[05:27.360 --> 05:33.680]  doesn't know how to handle such an ELF. So how does, but the, the BPF SysCall looks
[05:33.680 --> 05:38.640]  like, it's like this, if you, if you pull up the man page, we have a bunch of commands,
[05:38.640 --> 05:45.280]  each command has a, has attributes, and in the kernel they're defined in a very big union,
[05:45.280 --> 05:51.680]  and every command has its own set of attributes that you can use to, to instruct the kernel,
[05:51.680 --> 05:57.000]  to ask the kernel to do something for you. I can't go on over all of them because of
[05:57.000 --> 06:01.840]  time constraints, but the most important ones are loading your program, creating a map,
[06:01.840 --> 06:08.360]  loading BPF, and of course, interacting with that map, attaching it somewhere, et cetera.
[06:08.360 --> 06:14.920]  There are quite a few commands, each of them does slightly different things, and may, and
[06:14.920 --> 06:21.600]  the loaders, in most cases, provide functions that either call multiple of these to do a
[06:21.600 --> 06:26.200]  batch, like a big operation, a high-level operation, or they provide small wrappers for
[06:26.200 --> 06:33.640]  you to do your low-level operations yourself. There are also links, which is a newer concept,
[06:33.640 --> 06:40.080]  and you can pin, pin your objects to the file system, so they live longer than your program.
[06:40.080 --> 06:45.920]  And we have a few other miscellaneous functions for doing measurement statistics, iteration,
[06:45.920 --> 06:51.000]  et cetera, but I can't go in that, in this talk, unfortunately.
[06:51.000 --> 06:59.760]  So back to our program. When we write a program, we have a macro here that says SAC. We, that's
[06:59.760 --> 07:07.560]  quite unique for BPF. Every BPF program needs to have this section tag there, and this tells
[07:07.560 --> 07:13.920]  the compiler to put all of the program code in the specific section that we named. And
[07:13.920 --> 07:20.760]  the name of this section is for also convention, which can be used by the loader to inform
[07:20.760 --> 07:28.000]  it that this is an XDP program, so it should be interpreted as such.
[07:28.000 --> 07:34.800]  Now we can dump this section, so if we dump this section with LLVM object dump, then we
[07:34.800 --> 07:43.440]  get out this, which is hard to read if it's not annotated, but it's a bunch of BPF instructions
[07:43.440 --> 07:51.520]  starting with the opcode, so the actual opcode that tells it if it's add, subtract, whatever.
[07:51.520 --> 07:57.600]  Source and destination registers where these opcodes act on with offsets for jumps. These
[07:57.600 --> 08:05.600]  are relative, and intermediate data for, to say, load some data into a register like
[08:05.600 --> 08:11.800]  a constant value. And sometimes we can use two of them together to represent a 64-bit
[08:11.800 --> 08:18.640]  number, but we'll get to that later. We can also ask object dump to decompile this for
[08:18.640 --> 08:25.840]  us, and we'll get the decompiled BPF program. So the bytes on the left side and the actual
[08:25.840 --> 08:31.200]  program on the right side, but you'll notice that there's a call here. So one thing that
[08:31.200 --> 08:36.640]  I didn't tell you before is that the handle IPv4 function that we have is marked in such
[08:36.640 --> 08:44.560]  a way that it won't be inline, so it's a separate program, and BPF can do BPF to BPF function
[08:44.560 --> 08:51.760]  calls, and if you do that, it puts out this instruction, a function call instruction, but
[08:51.760 --> 08:58.200]  with minus zero. Where do we call to? Well, currently nowhere, because we haven't assembled
[08:58.200 --> 09:07.200]  the pieces of our furniture yet. So what actually, what also happens is that the compiler will
[09:07.200 --> 09:13.400]  emit relocation information, which we can, again, visualize, and it says, all right,
[09:13.400 --> 09:19.320]  we have a certain instruction that is given offset, and you should put it in relative
[09:19.320 --> 09:26.480]  address of this other function in here. Then we can go to the symbol table and we can look
[09:26.480 --> 09:34.880]  up this name, and it says, oh, that function lives in the.txt section, where for BPF programs,
[09:34.880 --> 09:42.560]  all of the function to function calls, all of the functions live together. So we have
[09:42.560 --> 09:51.080]  these two separate pieces of the puzzle, and they refer to each other. But the kernel
[09:51.080 --> 09:57.280]  only has one pointer for our instructions. It expects that every program we give it is
[09:57.280 --> 10:03.600]  one contiguous piece of memory with instructions, and it all should work. So we have some work
[10:03.600 --> 10:08.520]  to do. We need to figure out, or the loader rather, needs to figure out how it wants to
[10:08.520 --> 10:13.480]  lay out our programs, so piece all of the puzzles together, find all of these references,
[10:13.480 --> 10:19.880]  and then put in the correct offset. All of this happens in user space before we even
[10:19.880 --> 10:27.880]  go to the kernel. Now, second fun thing is that we can define our map. So again, we have
[10:27.880 --> 10:37.560]  the sec part,.maps, put it in the.maps section, and if we, and this is the function, this
[10:37.560 --> 10:43.680]  is the part that I have been hiding from you until now. It's also quite simple in terms
[10:43.680 --> 10:55.160]  of BPF programs. We get an IPv4 header, check that we can use it, and we write, or we get
[10:55.160 --> 10:59.880]  a value from the map, and if it doesn't exist, we write a new one and increment the values
[10:59.880 --> 11:06.480]  every time this happens to account for some information. So keep this program in mind,
[11:06.480 --> 11:11.080]  and then if we go look at the instructions again, the disassembled version this time,
[11:11.080 --> 11:17.320]  we see that we have two of these long lines which are zero at the end. So these are the
[11:17.320 --> 11:24.240]  64-bit intermediate values that I was talking about, and they are just long to keep, to
[11:24.240 --> 11:32.440]  pre-allocate room for actual memory addresses later, instead of relative jumps. But they
[11:32.440 --> 11:38.040]  are zero, and these should be references to our map, and later on these will become pointers
[11:38.040 --> 11:45.320]  when the kernel gets its way with it. And in our case, we again need to figure out what
[11:45.320 --> 11:51.480]  to put in here. So same routine. We have relocation information. The relocation information points
[11:51.480 --> 11:59.080]  to the instructions that we had. It says you need to plug in a flow map here. We go to
[11:59.080 --> 12:06.960]  the symbol table, and there it says we have a.map section, and there lives a flow map.
[12:06.960 --> 12:13.320]  In this case, we handle it slightly differently, so we then have to go load this flow map first,
[12:13.320 --> 12:18.040]  get a file descriptor, which is our unique identifier for the map, and we need to actually
[12:18.040 --> 12:27.440]  put in that file descriptor into these empty values, so the kernel knows where to go.
[12:27.440 --> 12:35.880]  Mapping maps also is also a command, so we have the map create command, and it takes
[12:35.880 --> 12:40.960]  these arguments. I cut out a bit of the later ones, but these are the essential type, how
[12:40.960 --> 12:49.120]  big are my values, et cetera. Give it a nice name. And there are two ways to define these.
[12:49.120 --> 12:56.600]  We have the new way of doing it, which are called BTF maps, colloquially, on the left.
[12:56.600 --> 13:03.000]  But there's also the old way of doing it, using a BTF map definition on the right. Don't
[13:03.000 --> 13:11.440]  use it if you go into libbpf in the part of the libbpf, which is used during ebpf construction.
[13:11.440 --> 13:17.240]  It will warn you that you shouldn't use it and go for the left side. But the odd thing
[13:17.240 --> 13:22.280]  is that if you use these newer BTF maps on the left, and you go look at what's actually
[13:22.280 --> 13:27.520]  then written to your.map section, it's all zero. There's no information. It still keeps,
[13:27.520 --> 13:32.640]  it still allocates room for your map, but it will, but they'll all be zero, and there's
[13:32.640 --> 13:41.800]  no information. All information, instead, is in the type information of the flow map.
[13:41.800 --> 13:50.440]  So we have to get in what is BTF. BTF stands for BPF type format. It's derived from the
[13:50.440 --> 13:58.920]  actual dwarf debug symbols that already are used for normal C programs. But as a way compact
[13:58.920 --> 14:04.400]  or smaller version of it, which only really is concerned about type information and not
[14:04.400 --> 14:10.560]  about where and at which moment a variable lives. And these are used because ebpf itself
[14:10.560 --> 14:18.200]  is just too limiting, and we want to do more, especially in the verifier. So we have, for
[14:18.200 --> 14:25.040]  example, features like spinlocks, which should only be used on maps that have spinlock values
[14:25.040 --> 14:33.680]  in them. Or we have callback functions, so we can define these BPF functions, but instead
[14:33.680 --> 14:39.760]  give them to a helper function. But this helper needs to then know that it's the correct number
[14:39.760 --> 14:44.600]  of arguments and the correct type. So all of this type information we can give to the
[14:44.600 --> 14:50.160]  kernel. And that's why it's, especially if you want to use these new fancy features,
[14:50.160 --> 14:56.520]  it's important to use the BTF information. It also allows for flexible map arguments.
[14:56.520 --> 15:01.800]  So for example, if I go back, we have the definition. And one of the things you'll notice
[15:01.800 --> 15:09.520]  is that we have pinning as an attribute here. But you will not find it in the Cisco attributes.
[15:09.520 --> 15:16.800]  This is purely something that we communicate to the loader library in this case, that we
[15:16.800 --> 15:26.160]  communicate to the loader library, not just libbpf, but that's the name that it has currently.
[15:26.160 --> 15:30.640]  And we can do a lot of different cool things with that. It also provides debug information
[15:30.640 --> 15:36.720]  for us. So if we go look at loader programs, it will be annotated with the line information
[15:36.720 --> 15:42.800]  and from rich file we can read. And perhaps one of the coolest features is compile once
[15:42.800 --> 15:52.240]  run everywhere, which allows the loader and or the kernel to modify our program slightly.
[15:52.240 --> 15:59.000]  So it will run on multiple versions of the kernel, even if the internals have changed.
[15:59.000 --> 16:05.520]  So if we dump this BTF that we have from our example program, it looks like this. Features
[16:05.520 --> 16:12.800]  to note are the numbers on the left and square brackets. Those are the type ID. Besides it's
[16:12.800 --> 16:19.080]  actual type. So we have pointers, integers, arrays. You can basically represent every C
[16:19.080 --> 16:26.240]  type in BTF this way. There's an optional name and then there's a lot of information
[16:26.240 --> 16:31.880]  about the specific type. And they refer to each other. So you'll notice a lot of type
[16:31.880 --> 16:39.360]  ID is something else. So you can also visualize it by nesting it. I've done this manually.
[16:39.360 --> 16:45.440]  By the way, there's no comment on this, but this is how you can do it yourself. So we
[16:45.440 --> 16:50.840]  have, for example, a map section with a flow map in it. And you can see that we have the
[16:50.840 --> 16:56.000]  type, the key, the value. And we have this very detailed description of exactly how it
[16:56.000 --> 17:03.320]  structured at which offsets, which things live and names for it which are used to check
[17:03.320 --> 17:10.800]  all of these certain things. And also to create a loader bill, we use this to infer the actual
[17:10.800 --> 17:17.400]  value and key sizes to give to the kernel. This BTF is structured in, so it lives in
[17:17.400 --> 17:23.880]  the dot BTF section. And it's sort of structured like this. So we have this header, then types
[17:23.880 --> 17:29.840]  and a lot of strings. And each type starts with the same three fields. So we have a name
[17:29.840 --> 17:36.120]  offset, so an offset into the strings. We have information and a size or type depending
[17:36.120 --> 17:42.640]  on what the information says. This translates into the name and the type of the BTF information
[17:42.640 --> 17:47.440]  and then the last part is specific to that type. So encoding for ins or a list of fields
[17:47.440 --> 17:57.920]  for a structure, et cetera. We also have the dot BTF.ext, the extended version of it.
[17:57.920 --> 18:04.360]  And this contains function information, line information, and optionally core relocations.
[18:04.360 --> 18:11.720]  So the line information contains a bunch of lines. So it will annotate this instruction
[18:11.720 --> 18:18.040]  as part of this line of your original source program and functions to label every one of
[18:18.040 --> 18:27.400]  these BTF functions that you have defined. Loading the BTF itself is quite simple. You
[18:27.400 --> 18:33.120]  use the load BTF command in the BTF syscall or the BTF syscall, give it the blob that
[18:33.120 --> 18:39.720]  we have. It needs to be slightly changed, especially for the data size, the data section
[18:39.720 --> 18:49.280]  type, but that's more details to explain exactly why, and a bunch of logging information.
[18:49.280 --> 18:53.400]  Once you have it, we get a file descriptor of the BTF object, and of course we have
[18:53.400 --> 19:00.480]  all of these type IDs. So when we are loading our map again, there are these fields where
[19:00.480 --> 19:05.160]  you can say, this is my BTF object, which contains all of my types, and this is the
[19:05.160 --> 19:11.680]  type of my key, this is the type of my value, that's how we wire everything together.
[19:11.680 --> 19:16.600]  The same goes for programs. So we give it the program, the BTF of the program uses,
[19:16.600 --> 19:24.320]  and then we give it these file, these func information, line information blobs, which
[19:24.320 --> 19:30.240]  will make sure that everything is nice and annotated in the kernel. So we end up with
[19:30.240 --> 19:36.160]  a sort of hierarchy that looks like this. So we start by loading the BTF, we can then
[19:36.160 --> 19:40.760]  load our maps, which use it, and then once we have our map file descriptors, we can load
[19:40.760 --> 19:46.920]  our programs after we have of course assembled all of the pieces of our program. And that
[19:46.920 --> 19:52.080]  all happens, can happen within one call to a loader library.
[19:52.080 --> 19:59.000]  Now for the last part, the core, which I touched on a little bit earlier, like I said, compile
[19:59.000 --> 20:05.320]  once, run everywhere. There's this really good blog post for, which I encourage everyone
[20:05.320 --> 20:12.960]  who wants to use the feature, which contains information on how to actually use it. But
[20:12.960 --> 20:20.840]  what it boils down to is there are in LibBPF, there are these macros to make your life easier,
[20:20.840 --> 20:26.880]  and they boil down to a bunch of compiler built-ins. And they're basically, they're basically
[20:26.880 --> 20:34.720]  questions to ask the loader just before, or the kernel just before, or while loading
[20:34.720 --> 20:39.800]  the program. Like, where does, where is, what is the offset of this field? Where does this
[20:39.800 --> 20:48.000]  type even exist? Do I have this enum value? I have this small program that writes, that
[20:48.000 --> 20:54.640]  writes values to, or that captures a certain, or the cookie value of a socket when it closes.
[20:54.640 --> 21:02.760]  Not useful at all, but it does help us to illustrate the point. When this macro resolves,
[21:02.760 --> 21:09.960]  it looks like this, and the important part to notice here is that we do a helper call,
[21:09.960 --> 21:15.680]  and where the arrow starts, we have the socket pointer, and we have an offset, and we add
[21:15.680 --> 21:23.960]  an offset, which we get from this built-in function. This offset is then encoded, gets
[21:23.960 --> 21:29.080]  encoded in the 104 that we see here. This is this offset that we add to the pointer in
[21:29.080 --> 21:35.400]  the actual code. But the compiler will also emit this relocation, which will tell us that
[21:35.400 --> 21:41.600]  this might be a piece of the code that we want to tweak, depending on if the structure changes.
[21:41.600 --> 21:46.720]  So if we again look at this relocation, there, unfortunately, as far as I'm aware, is not
[21:46.720 --> 21:54.600]  a good command line tool to visualize or to decode this, so I decoded one manually. It
[21:54.600 --> 21:58.160]  looks like this, so it says, okay, instruction number two, which is the instruction that
[21:58.160 --> 22:04.720]  we were, that we were at. Instruction number two refers to type ID 18, and it has this
[22:04.720 --> 22:09.000]  accessor string. And this accessor string is a bunch of numbers, which is basically
[22:09.000 --> 22:15.440]  offsets like the field number that it tries to access. So the socket, then the second
[22:15.440 --> 22:24.880]  field would be sk-common, and then cookie, and so forth. Now, this type information that
[22:24.880 --> 22:30.680]  we knew when we created the program is included in the btf section. But the kernel also has
[22:30.680 --> 22:40.560]  btf types for all of its types it has. So we can do a comparison and see that, for example,
[22:40.560 --> 22:48.040]  it changed position, or we can't find a certain field. And our loader can do this, can resolve
[22:48.040 --> 22:55.200]  this, see it, and then patch our code, change this offset value right before we actually
[22:55.200 --> 23:00.640]  load it, which makes it possible to use it on so many different kernel versions. I'm
[23:00.640 --> 23:08.360]  out of time. That's everything I can offer you for now. Are there any questions? And
[23:08.360 --> 23:24.120]  thank you. Thank you. Any questions? There's one in the back. All right, okay. It's difficult
[23:24.120 --> 23:38.400]  now. Can you pass this on? Hey, thanks for the great talk. So I haven't
[23:38.400 --> 23:44.640]  dealt that much with btf, but since we have those binaries that we cannot really launch
[23:44.640 --> 23:50.600]  because we have to load them in another elf, right? At least as I understand. Would it make
[23:50.600 --> 23:58.760]  any sense to make either a loader that would just work out of the box for those binaries
[23:58.760 --> 24:09.120]  or use the bnfmtm-misk feature from the kernel to be able to load those btf elf files and
[24:09.120 --> 24:15.120]  use some kind of generic or general interface and just load them and run them?
[24:15.120 --> 24:24.640]  Yeah, but I think it does make sense to some extent. For example, the IP tool doesn't have
[24:24.640 --> 24:31.800]  anything additional, so it takes this elf and just loads it as best as it can. And there
[24:31.800 --> 24:38.120]  is probably some way to use the interpreter in the elf itself, just like we do for dynamically
[24:38.120 --> 24:43.840]  loaded executables. As far as I know, no one has tried it so far, but I think it could
[24:43.840 --> 24:50.600]  work at least for a limited use case where you don't have to, where you would only load
[24:50.600 --> 24:56.800]  something and pin it and then allow some other application to actually work with it afterwards.
[24:56.800 --> 25:00.160]  Thank you. All right, thanks. We are out of time.
[25:00.160 --> 25:15.040]  If you have more questions, you can find Dylan in the hallway. And yeah, thanks again.