[00:00.000 --> 00:16.200] All right, so let's get started again. [00:16.200 --> 00:22.040] The next talk is by Anton about optimizing BPF hash map and friends. [00:22.040 --> 00:29.960] Hello, I'm Anton, working at Datapart team at Isovalent and yeah, this is talk about [00:29.960 --> 00:35.920] how simple it is to optimize BPF hash map. [00:35.920 --> 00:44.680] So about a year ago, like a little less, Andreina Kricker proposed, suggested to try new hash [00:44.680 --> 00:51.920] functions for BPF hash map and BPF tech trace map. [00:51.920 --> 01:00.160] And in Selume we use, and in Tatragon we use hashes extensively, so for us it's a big [01:00.160 --> 01:04.120] deal and I decided to give it a try. [01:04.120 --> 01:12.360] So I will briefly provide some like benchmarking how to one-on-one and then we will take a [01:12.360 --> 01:22.600] look at different hash functions and then we will see benchmarks for hash maps which utilize [01:22.600 --> 01:24.960] old hash functions and new hash functions. [01:24.960 --> 01:34.000] So first thing to do when your benchmark is to try to reduce noise because modern CPUs [01:34.000 --> 01:36.400] do everything to ruin your benchmarking. [01:36.400 --> 01:46.120] They will run on different frequency, hyperthreading will get in and in the best case you benchmark [01:46.120 --> 01:50.160] inside kernel because you can disable preemption and interrupts. [01:50.160 --> 01:58.480] So benchmark, we take, if you want to benchmark some function, we first measure some kind [01:58.480 --> 02:03.560] of time source or clock source and then we execute our function, maybe in a loop and [02:03.560 --> 02:12.080] then we measure time once again and then we divide the number of observations by the number [02:12.080 --> 02:15.160] of loops and get our result. [02:15.160 --> 02:19.240] So in some cases we can't execute our function in a loop. [02:19.240 --> 02:23.560] For example, if it's not like an abstract hash function, it's just, if it's just part [02:23.560 --> 02:30.440] of kernel then n is obviously equal to 1 and we need to take an account the time it takes [02:30.440 --> 02:40.760] to get time and one obvious way to do this is to benchmark an empty loop and this give [02:40.760 --> 02:48.040] us roughly how long the get time function works and typically people benchmark with [02:48.040 --> 02:54.440] something like rdtc instruction and if you don't reduce noise then rdtc instruction [02:54.440 --> 02:56.400] is pretty unstable. [02:56.400 --> 03:03.880] So here my CPU obviously runs on two different frequencies and if my function which I want [03:03.880 --> 03:18.080] to benchmark runs 30, 40 cycles then deviation in this case is bigger than function itself [03:18.080 --> 03:19.120] so I can't rely on it. [03:19.120 --> 03:26.680] So if we disable like reduce noise as I said then results become way more stable so here [03:26.680 --> 03:33.520] it's like maybe it looks scary but it's actually like 37 plus minus one cycle so even for very [03:33.520 --> 03:41.840] small functions this makes benchmarking more reliable but in this form rdtc doesn't work [03:41.840 --> 03:47.280] because it's not a serializing instruction which means that if you insert your code here [03:47.280 --> 03:55.040] then it can be executed in the middle of code and even after your code so and it can differ [03:55.040 --> 04:03.600] from execution to execution and so we need some way to serialize it and luckily there [04:03.600 --> 04:10.040] are no ways to do this so just serialize it and there is a white paper written maybe ten [04:10.040 --> 04:16.960] years ago about titanium and it's still valid with some changes from architecture to architecture [04:16.960 --> 04:19.820] but yeah we can do this. [04:19.820 --> 04:27.440] In this case benchmarking like the offset takes a little bit more like it's not 37 cycles [04:27.440 --> 04:34.200] anymore it's 70 cycles but again it's pretty stable and we can use this number to offset [04:34.200 --> 04:41.320] our measurements and in fact I did such benchmarking when I don't run in a loop it's then switch [04:41.320 --> 04:48.200] it to like more dumb benchmarks when you do loops over bpf maps because it's harder to [04:48.200 --> 04:54.760] port things and if you want someone else to try these benchmarks they will have to patch [04:54.760 --> 04:57.960] their kernel and this is not simple. [04:57.960 --> 05:06.600] So let's take a look at several hash functions of interest so jhash is currently used hash [05:06.600 --> 05:14.920] function in the bpf it's Bob Jenkins hash and it's probably was developed some 30 years [05:14.920 --> 05:21.600] ago so spooky hash is another it's a newer version it's not not a newer version it's [05:21.600 --> 05:32.720] a newer hash from Bob Jenkins and then there goes xxhashes xxh32n64 it's a previous generation [05:32.720 --> 05:41.880] of these functions they are available in kernel and we can try them as well and xxh3 is like [05:41.880 --> 05:45.160] the state of art hash function. [05:45.160 --> 05:53.480] So if we just take a look at this plot the orange line which goes there is jhash which [05:53.480 --> 06:00.880] is currently used the green line which looks to be winning here is like the previous generation [06:00.880 --> 06:13.040] xxh64 the blue line this is the newest generation xxh3 and while it looks here that it doesn't [06:13.040 --> 06:24.560] perform as well as xxh64 for small keys it does outperform it and for bpf hash maps we [06:24.560 --> 06:31.280] primarily interested in using it for small keys like I don't know like actually I never [06:31.280 --> 06:42.040] use it like huge keys and in any case this like xxh3 works faster than Jenkins hash and [06:42.040 --> 06:46.840] later I will show that it can be actually run even more faster but one interesting thing [06:46.840 --> 06:53.120] is that the spooky hash it actually like it it performs pretty bad for small keys because [06:53.120 --> 06:58.640] it has a lot of like setup which it does in any case but later it starts outperform like [06:58.640 --> 07:03.640] every hash function I was interested when it does it so it does it at about key size [07:03.640 --> 07:12.000] of 9000 it's cool but it's not the key size of interest for us. [07:12.000 --> 07:22.200] So if we take a look at xxh3 and jhash we can see that the blue line xxh3 it actually [07:22.200 --> 07:33.080] outperforms like jhash for all key sizes and there is also this green line it's jhash2 [07:33.080 --> 07:40.320] it's optimized version of jhash which can be used if your key size is multiple of 4 [07:40.320 --> 07:45.360] and it's actually used in bloom filters but for some reason not in hash maps. [07:45.360 --> 07:55.920] So if we take closer look at small key sizes we see that yes xxh3 outperforms jhash so [07:55.920 --> 08:05.360] for me it's enough reason to try to benchmark maps with it and let's take a look now at [08:05.360 --> 08:15.960] BPF maps which use hash functions so first one is stack trace map and then hash map in [08:15.960 --> 08:16.960] bloom filters. [08:16.960 --> 08:25.280] So stack trace was actually the main reason for Andrey to propose xxh3 because what it [08:25.280 --> 08:34.160] does it takes a stack trace and then it hashes it and creates a map of IDs which refers to [08:34.160 --> 08:42.080] this traces and if there are hash collisions then old stack traces are lost and we get [08:42.080 --> 08:51.880] incorrect picture about the system and stack traces is not too random so if your hash function [08:51.880 --> 08:59.280] is not very good if it doesn't have very good like avalanche properties then it will create [08:59.280 --> 09:09.040] more collisions for less random data and xxh3 behaves way better than jhash for avalanche [09:09.040 --> 09:17.840] properties and this is like one of reasons and the main reason to use it for stack trace. [09:17.840 --> 09:23.160] The other reason as a benchmark that's also like for stack trace it also runs about twice [09:23.160 --> 09:30.080] faster for typical key sizes because typical key size it's like 8 bytes per stack depth [09:30.080 --> 09:44.840] and this is typically like 60, 80, 100 bytes so xxh3 is like is a very good candidate here [09:44.840 --> 09:50.680] so for hash benchmarks I was primarily focused on lookups because this is what we do the [09:50.680 --> 10:00.480] most and this is the thing which like is easy to measure compared to like more complex pictures [10:00.480 --> 10:05.200] so there are some links to benchmarks I used and scripts to actually execute benchmark [10:05.200 --> 10:13.680] and plot it because like for every change I had to draw some like 100 or 150 pictures [10:13.680 --> 10:18.760] for different key sizes for different fullness of hash maps so it's impossible to do otherwise [10:18.760 --> 10:28.760] like and I had a few pictures here so if we just use xxh3 then it looks like it like the [10:28.760 --> 10:38.080] new map which is orange it outperforms the original map which is blue and here is lookup [10:38.080 --> 10:48.960] speed in cycles vertical and horizontal axis is a key size so the bigger the key the better [10:48.960 --> 10:55.360] the more gain from using the new hash function but if we take like a bigger map I see that [10:55.360 --> 11:03.320] xxh3 as it is degrades for key size 4 and this is already like for me it's a blocker [11:03.320 --> 11:12.040] I can't like propose a change which degrades existing applications so then I went to a [11:12.040 --> 11:20.120] different architecture micro architecture and here I saw that it degrades for different [11:20.120 --> 11:27.120] key size like here it degrades for key size 12 and if you if we take like a bigger map [11:27.120 --> 11:37.960] it degrades for like key size 24 and then I thought how to fix this because it's if [11:37.960 --> 11:44.280] it works for bigger keys then maybe I can utilize this and I did the same thing as bloom bloom [11:44.280 --> 11:51.440] filter currently does so bloom filter executes jh2 for key sizes divisible by 4 and it uses [11:51.440 --> 11:59.120] jhash in other cases so I did the same utilize jhash 2 for key sizes of which are divisible [11:59.120 --> 12:10.840] by 4 but for small ones like it's this keyln divided by 4 keyln 32 it's actually computed [12:10.840 --> 12:17.880] it's just keyln divided by 4 but it's computed during hash initialization and we can decide [12:17.880 --> 12:23.800] for which key sizes we do this and with this hash function I finally see that it doesn't [12:23.800 --> 12:33.080] degrade anywhere so this is like 10k 100k and 100% full which is like the worst case [12:33.080 --> 12:42.400] and if we take another slice this is 100k 100k map with key size 8 and on the left side [12:42.400 --> 12:50.320] it's almost empty on the right side it's 100% full and the bigger key size the bigger gain [12:50.320 --> 12:58.400] for particular key size so for key size 64 new like map with new hash function runs about [12:58.400 --> 13:07.040] 50% faster and for key size 128 it runs almost twice faster and bloom filters as I mentioned [13:07.040 --> 13:17.160] they use the jhash 2 for keys divisible by 4 so I don't expect any gain for keys divisible [13:17.160 --> 13:22.720] by 4 at least for small keys and it looks like this so this is like an extreme case [13:22.720 --> 13:31.880] of bloom filter with 9 hashes but and I just did it so it reproduces the plot of hash function [13:31.880 --> 13:39.640] here and yes for small keys it is the same and for bigger keys we have a gain and here [13:39.640 --> 13:50.800] is the key size 240 where xh3 function originally utilizes vector instructions and we can't use [13:50.800 --> 13:57.080] obviously vector instructions in BPF maps and for key size 240 it's like it is expected [13:57.080 --> 14:03.400] to start using vector instructions but there is also scalar implementation which works [14:03.400 --> 14:14.440] faster than jhash but it degrades at this point so and another thing to mention that [14:14.440 --> 14:25.880] old hash functions jhash xxh64 they were designed and optimized it with O2 option in mind so [14:25.880 --> 14:36.960] if we switch to O3 then they will behave the same but xxh3 actually runs like 50-60% faster [14:36.960 --> 14:48.720] so it actually performs way better with O3 so I just jump like here so and I know that [14:48.720 --> 14:56.080] like O3 is no go for kernel like there were several attempts to introduce it and the reason [14:56.080 --> 15:03.720] was that there are no candidates which benefit from this O3 but this one is a particular [15:03.720 --> 15:15.160] candidate like hash if we could use O3 for hash map because not only for xxh3 because [15:15.160 --> 15:26.240] it should be inline then in this case we would be able to get rid of this composite hash [15:26.240 --> 15:37.040] which mixes hashes and just use it as is so yeah as I said for stuck trace map it definitely [15:37.040 --> 15:49.080] makes sense to use it so there is both benefit in speed small one because stuck trace map [15:49.080 --> 15:54.960] the bottleneck for speed is not the hash but the bottleneck for hash collisions is the [15:54.960 --> 16:03.240] hash and for hash map it's a question maybe someone would advise me on what to do with [16:03.240 --> 16:14.680] O3 and after I run like benchmarks on slightly bigger number of architectures then I think [16:14.680 --> 16:21.400] this is also like a good candidate to use in the hash map so here are some links for [16:21.400 --> 16:26.840] benchmarks and paper which I use it for those who will be reading this and thank you [16:26.840 --> 16:44.840] all right thanks a lot any questions you for the O3 thing can you only compile maybe the [16:44.840 --> 16:51.120] like the hash map file with O3 yeah yeah if it is like currently it is disabled like [16:51.120 --> 16:57.520] for every file in kernel for custom build we can enable it but like generally we just [16:57.520 --> 17:03.880] pass O2 everywhere if it's possible just to compile BPF maps with O3 this will solve the [17:03.880 --> 17:08.440] thing yes it's it's not such a big change so you don't have to compile the whole kernel [17:08.440 --> 17:24.560] with O3 yeah yeah yeah it's local code okay any other questions no then thanks a lot [17:38.440 --> 17:40.500] you