[00:00.000 --> 00:14.160] Okay. I think we're ready to start. Oh, excellent. This time it worked perfectly. Thank you [00:14.160 --> 00:20.800] so much. Yeah. Chris is going to talk about C Group V2, seven years of C Group V2 in the [00:20.800 --> 00:26.480] kernel, very exciting time, and the future of Linux resource control. Take it away. [00:26.480 --> 00:35.760] Hello, everybody. Oh, yes. Please go on. Thank you. That's it. I'm done. Goodbye. Hello. I'm [00:35.760 --> 00:40.560] Chris Down. I work as a kernel engineer at Metta. I work on the kernels memory management [00:40.560 --> 00:45.040] subsystem, especially I'm a contributor to C Groups, which are one of the things which [00:45.040 --> 00:49.360] underpins our model of containers. I'm also a maintainer of the system D project. So there's [00:49.360 --> 00:53.440] two things on this slide, which you can hate me for. Most of the time I'm thinking about, [00:53.440 --> 00:57.120] you know, how we can make Linux just a little bit more reliable, just a little bit more usable [00:57.120 --> 01:01.280] at scale. We have a million plus machines. We can't just buy more RAM. It's not really a thing [01:01.280 --> 01:05.600] we can do. So we need to extract the absolute maximum from every single machine. Otherwise, [01:05.600 --> 01:09.280] there's a huge loss of capacity that could result. So that's the kind of thing I want to talk to you [01:09.280 --> 01:13.680] about today. However, the last seven years we have done this at Metta, how we've improved the [01:13.680 --> 01:20.320] reliability and capacity and extracted more efficiency. At Metta and in industry, we are [01:20.320 --> 01:25.280] increasingly facing this kind of problem where we can't effectively solve scaling problems just by [01:25.280 --> 01:29.040] throwing hardware at the problem. We can't construct data centers fast enough. We can't [01:29.040 --> 01:33.520] source clean power fast enough. We have hundreds of thousands of machines and we just can't afford [01:33.520 --> 01:38.160] to waste capacity because any small loss in capacity on a single machine translates to a [01:38.160 --> 01:42.800] very large amount at scale. Ultimately, what we need to do is use resources more efficiently [01:42.800 --> 01:48.000] and we need to build the kernel infrastructure in order to do that. Another challenge that we have [01:48.000 --> 01:53.520] is that many huge site incidents for companies like us and companies of our size are caused by [01:53.520 --> 01:59.120] lacking resource control. Not being able to control things like CPU, IO, memory and the like is one [01:59.120 --> 02:03.840] of the most pervasive causes of incidents and outages across our industry and we need to sustain [02:03.840 --> 02:11.040] an initiative industry-wide in order to fix this. So how does all of this relate to this [02:11.040 --> 02:16.720] C-groups thing in the title? So C-groups are a kernel mechanism to balance and control and isolate [02:16.720 --> 02:21.200] things like memory, CPU, IO, things that you share across a machine, things that processes share [02:21.200 --> 02:26.560] and I'm sure if you've operated containers before, which I'm going to assume that you have, judging [02:26.560 --> 02:29.520] by the fact you're in this room otherwise you may be lost in looking for the AI room, [02:31.120 --> 02:35.680] you know every single modern container runtime uses this. Stalker uses it, Chorus uses it, [02:35.680 --> 02:41.120] Kubernetes uses it, SystemD uses it. The reason they use it is because it's the most mature platform [02:41.120 --> 02:44.880] to do this work and it solves a lot of the long-standing problems which we had with kind of [02:44.880 --> 02:50.400] classic resource control in the form of view limits and things like that. C-groups have existed [02:50.400 --> 02:55.280] for about 14 years now and they have changed a lot in that time. Most notably, seven years ago [02:55.280 --> 03:01.680] in kernel 4.5 we released C-group 2. I gave a whole talk around the time when that happened on [03:01.680 --> 03:05.200] why we were moving to a totally new interface, why we weren't just iterating on the old interface [03:05.200 --> 03:09.360] and if you're interested in a really in-depth look at that then here's a talk which you can [03:09.360 --> 03:14.240] go and take a look at. But the most fundamental change really is that in C-group 2 what happens is [03:14.240 --> 03:21.280] that you enable or disable resources in the context of a particular C-group. In C-group 1 what you [03:21.280 --> 03:26.960] have is a hierarchy for memory, a hierarchy for CPU and the two will never meet. Those two things [03:26.960 --> 03:32.720] are completely independent. SystemD when it creates things in C-group V1 it will name them the same [03:32.720 --> 03:37.360] they get called something.slice or something.service but they have no relation to each other across [03:37.360 --> 03:42.800] resources. But in C-group 2 you have just a single C-group and you enable or disable resources in [03:42.800 --> 03:47.120] the context of that particular C-group so you can enable say memory control and IO control together. [03:50.320 --> 03:55.600] That might seem like you know an aesthetic kind of concern but it's really not. Without this major [03:55.600 --> 04:00.880] API change we simply cannot use C-groups to do complex resource control. Take the following [04:00.880 --> 04:06.960] scenario. Memory starts to run out on your machine. So when we start to run out of memory on a [04:06.960 --> 04:11.360] pretty much any modern operating system what do you do? Well you try and go and free some up. [04:11.360 --> 04:16.080] So we start to reclaim some page caches. We start to reclaim maybe some anonymous pages if we have [04:16.080 --> 04:23.360] swap. And this results in disk IO. And if we're particularly memory bound and it's really hard [04:23.360 --> 04:27.920] to free pages and we're having to walk the pages over and over and over to try and find stuff to [04:27.920 --> 04:32.560] free then it's going to cost a non-trivial amount of CPU cycles to do so. Looking through available [04:32.560 --> 04:37.040] memory to find pages which can be free can be extremely expensive on memory bound workloads. [04:37.040 --> 04:42.000] On some highly loaded or memory bound systems it can take you know double digit amount of CPU [04:42.000 --> 04:46.560] from the machine just to do this walking. It's a highly expensive process. And without having [04:46.560 --> 04:51.520] the single resource hierarchy we cannot take into account these transfers between the different [04:51.520 --> 04:57.520] resources how one leads to another because they're all completely independent. If you've been in [04:57.520 --> 05:00.800] the containers different before you've probably thinking I've seen this guy before and I think [05:00.800 --> 05:04.880] he's given this exact talk about three years ago. I'm sure some of you think and that already. Well [05:04.880 --> 05:10.160] the company name isn't the only thing which has changed in 2020. Also some seagrups things have [05:10.160 --> 05:13.520] changed since 2020 and obviously I don't want to rehash the same things over and over. I don't [05:13.520 --> 05:17.680] want to bore you. So this talk will mostly be about the changes since the last time I was here in [05:17.680 --> 05:22.480] 2020 with just a little bit of context setting just a little bit. This talk is really about the [05:22.480 --> 05:28.000] process of getting resource isolation working at scale. It's what it needs to happen in production [05:28.000 --> 05:36.640] not just in a theoretical concern. The elephant in the room of course is COVID. The last three years [05:36.640 --> 05:42.080] have seen pretty significant changes in behavior due to COVID especially for a platform like Facebook [05:42.080 --> 05:47.360] which we own of course. This was by about 27% over what you would usually expect and this came [05:47.360 --> 05:52.960] at a time where not only you're seeing increased demand but you literally can't go out and buy [05:52.960 --> 05:57.040] memory. You can't go out and buy more CPUs. You can't go out and buy more disks because there's a [05:57.040 --> 06:01.760] shortage because there's COVID. So what we really needed was to make more efficient use of the [06:01.760 --> 06:06.640] existing resources on the machine right. We need to have an acceleration or existing efforts around [06:06.640 --> 06:12.960] resource control in order to do that to make things more efficient. Now almost every single time that [06:12.960 --> 06:17.360] I give this sounds like a personal point of concern. Every time I give this talk somebody on [06:17.360 --> 06:22.560] Hacker News comments why don't you just get some more memory? Now I don't know how trivial people [06:22.560 --> 06:25.840] in this room think that is when you've got several million servers but it is slightly [06:25.840 --> 06:30.720] difficult sometimes. For example there's a huge amount of cost involved there and not just the [06:30.720 --> 06:35.120] money which is indeed substantial and I'm very glad it's not coming out of my bank account but [06:35.120 --> 06:39.760] also in things like power draw, in things like thermals, in things like hardware design trade-offs. [06:39.760 --> 06:45.120] Not to mention during COVID you just couldn't get these kind of, you couldn't get a hard drive, [06:45.120 --> 06:48.880] you couldn't get some memory. You'd go down to your local Best Buy and do it but that's about it. [06:48.880 --> 06:56.160] So not really an option. So here's a simple little proposition for you, for anyone in the [06:56.160 --> 07:00.000] room who wants to be brave. How do you view memory usage for a process in Linux? [07:02.560 --> 07:13.680] Oh come on. Free! My man said free. Oh lord. This was a trap. So I appreciate it though, [07:13.680 --> 07:19.200] big up about that. So yeah, so free and the like really only measure like one type of memory. [07:19.200 --> 07:24.320] They do have caches and buffers in the side but the thing is okay so for free or for PS which [07:24.320 --> 07:29.040] were shut at the back you know you do see something like the resident set size and you see some other [07:29.040 --> 07:33.520] details and you might be thinking hey you know that's fine like I don't really care about some of [07:33.520 --> 07:37.200] the other things that's the bit which my application is really using. For example we don't necessarily [07:37.200 --> 07:42.880] think that our programs rely on caches and buffers to operate in any sustainable way but the problem [07:42.880 --> 07:47.680] is the answer for any sufficiently complex system is almost certainly that a lot of those caches [07:47.680 --> 07:53.920] and buffers are not optional. They are basically essential. Let's take Chrome just as a facile [07:53.920 --> 08:01.040] example. The Chrome Binary's code segment is over 130 megs. He's a chunky boy. He is. He's a big boy. [08:01.040 --> 08:05.840] We load this code into memory. We do it gradually. We're not we're not maniacs. We do it gradually but [08:05.840 --> 08:09.360] you know we do it as part of the page cache. A boy if you want to execute some particular part of [08:09.360 --> 08:15.360] Chrome you know this cache isn't just nice to have the cache that has the code in it that runs this [08:15.360 --> 08:19.120] particular part of Chrome. We literally cannot make any forward progress without that part of the [08:19.120 --> 08:24.000] cache and the same goes for caches for the files you're loading especially for something like Chrome [08:24.000 --> 08:28.000] you probably do have a lot of caches so eventually those pages are going to have to make their way [08:28.000 --> 08:32.560] into the working set. They're going to have to make their way into main memory. In another [08:32.560 --> 08:38.320] particularly egregious case we have a demon at Meta and this demon aggregates metrics across a [08:38.320 --> 08:42.720] machine. It sends them to centralized storage and as part of this what it does is it runs a whole [08:42.720 --> 08:47.120] bunch of janky scripts and these janky scripts go and collect things across the machine. I mean [08:47.120 --> 08:51.200] we've all got one. We've all got this kind of demon where you collect all kind of janky stuff [08:51.200 --> 08:55.440] and you don't really know what it does but it sends some nice metrics and it looks nice and [08:55.440 --> 09:00.480] one of the things we were able to demonstrate is while the team had this demon thought that it [09:00.480 --> 09:06.080] took about 100 to 150 megabytes to run using the things that we'll talk about in this talk [09:06.080 --> 09:12.400] it actually was more like two gigabytes. So the difference is quite substantial on some things [09:12.400 --> 09:16.480] like you could be quite misunderstanding like what is taking memory on your machine. [09:18.080 --> 09:22.880] So in C-group 2 we have this file called memory.current that measures the current memory usage for [09:22.880 --> 09:27.680] the C-group including everything like caches, buffers, kernel objects, so on. So job done right? [09:28.960 --> 09:35.840] Well no the problem is here that whenever somebody comes to these talks and I say something like [09:35.840 --> 09:41.120] don't use RSS to measure your application they go and see oh we've added a new thing called [09:41.120 --> 09:46.960] memory.current and it measures everything great. I'm just gonna put some metrics based on that [09:47.760 --> 09:52.160] but it's quite important to understand what that actually means to have everything here right. [09:52.160 --> 09:56.640] The very fact that we are not talking about just the resident set size anymore means the [09:56.640 --> 10:01.760] ramifications are fundamentally different. We have caches, buffers, socket memory, [10:01.760 --> 10:06.320] TCP memory, kernel objects, all kind of stuff in here and that's exactly how it should be because [10:06.320 --> 10:10.880] we need that to prevent abuse of these resources which are valid resources across the system. [10:10.880 --> 10:16.800] They are things we actually need to run. So understanding why reasoning about memory.current [10:17.360 --> 10:21.760] might be more complicated than it seems comes down to why as an industry we tended to gravitate [10:21.760 --> 10:27.760] towards measuring RSS in the first place. We don't measure RSS because it measures anything useful [10:27.760 --> 10:31.920] we measure it because it's really fucking easy to measure. That's the reason we measure RSS like [10:31.920 --> 10:36.400] there's no other reason like it doesn't measure anything very useful. It kind of tells you vaguely [10:36.400 --> 10:40.480] like maybe what your application might be doing kind of but it doesn't tell you anything of any of [10:40.480 --> 10:44.560] the actually like interesting parts of your application only the bits you pretty much already [10:44.560 --> 10:50.800] knew. So memory.current suffers from pretty much exactly the opposite problem which is it tells [10:50.800 --> 10:56.160] you the truth and don't really know how to deal with that. Don't really know how to deal with [10:56.160 --> 11:01.200] being told how much memory application is using. For example if you set an 8 gigabyte memory limit [11:01.200 --> 11:06.960] in C root v2 how big is memory.current going to be on a machine which has no other thing running on [11:06.960 --> 11:11.440] it. It's probably going to be 8 gigabytes because we've decided that we're going to fill it with [11:11.440 --> 11:15.520] all kind of nice stuff. There's no reason we should evict that. There's no reason we should take away [11:15.520 --> 11:19.040] these nice you know K mem caches. There's no reason we should take away these slots because [11:19.600 --> 11:25.120] we have free memory so why not. Why not keep them around. So if there was no pressure for this to [11:25.120 --> 11:30.000] shrink from any outside scope then the slack is just going to expand until it reaches your limit. [11:30.000 --> 11:34.720] So what should we do? How should we know what the real needed amount of memory is at a given time? [11:36.240 --> 11:41.840] So let's take an example Linux kernel build for example which with no limits has a peak memory.current [11:41.840 --> 11:47.520] of just over 800 megabytes. In C root v2 we have this tunable called memory.high. This tunable [11:47.520 --> 11:52.640] reclaims memory from the C group until it goes back under some threshold. It just keeps on reclaiming [11:52.640 --> 11:57.680] and reclaiming and reclaiming and throttling until you reach back under. So right now things take [11:57.680 --> 12:02.480] about four minutes with no limits. This is about how long it takes to build the kernel and when I [12:02.480 --> 12:08.240] apply you know a throttling like a like a reclaim threshold of 600 megabytes actually you know the [12:08.240 --> 12:12.720] job finishes roughly about the same amount of time maybe a second more with about 25 percent less [12:12.720 --> 12:17.280] available memory at peak and the same even happens when we go down to 400 megabytes. Now we're using [12:17.280 --> 12:21.520] half the memory that we originally used with only a few seconds more wall time. It's it's pretty good [12:21.520 --> 12:26.720] trade-off. However if we just go just a little bit further then things just never even complete. We [12:26.720 --> 12:31.440] have to we have to control see the build right and this is nine minutes in it's still ain't done. So [12:31.440 --> 12:37.280] we know that the process needs somewhere between 300 and 400 megabytes of memory but it's pretty [12:37.280 --> 12:42.240] error prone to try and work out what the exact value is. So to get an accurate number for services [12:42.240 --> 12:46.400] at scale which are even more difficult than this because they dynamically shrink and expand depending [12:46.400 --> 12:53.520] on load we need a better automated way to do that. So determining the exact amount of memory [12:53.520 --> 12:58.960] required by an application is a really really difficult and error prone task right. So SEMPAI is [12:58.960 --> 13:04.640] this kind of simple self-contained tool to continually poll what's called pressure stall [13:04.640 --> 13:10.160] information or PSI. Pressure stall information is essentially a new thing we've added in CIGRI2 [13:10.160 --> 13:14.320] to determine whether a particular resource is oversaturated and we've never really had a metric [13:14.320 --> 13:20.000] like this in in the Linux kernel before. We've had many related metrics for example for memory we [13:20.000 --> 13:25.600] have things like you know page caches and buffer usage and so on but we don't really know how to [13:25.600 --> 13:31.200] tell pressure or over subscription from an efficient use of the system those two are very [13:31.200 --> 13:36.320] difficult to tell apart even with using things like page scans or or so on it's pretty difficult. [13:37.360 --> 13:43.200] So in SEMPAI what we do is we use these PSI pressure stall metrics to measure the amount of time [13:43.200 --> 13:48.160] which threads in a particular C group were stuck doing in this case memory work. So this pressure [13:48.160 --> 13:54.480] equals 0.16 thing kind of halfway down the slide means that you know 0.16 percent of the time I [13:54.480 --> 13:59.760] could have been doing more productive work but I've been stuck doing memory work. This could be [13:59.760 --> 14:03.200] things like you know waiting for a kernel memory lock it could be things like being throttled [14:03.200 --> 14:07.840] could be waiting for reclaimed to finish even more than that it could be memory related IO which [14:07.840 --> 14:12.400] which can also dominate to be honest things like refolding file content into the page cache or [14:12.400 --> 14:18.480] swapping in and pressure is essentially saying you know if I had a bit more memory I would be able [14:18.480 --> 14:26.480] to run so much faster 0.16 percent faster. So using PSI and memory.high what SEMPAI does is [14:26.480 --> 14:31.600] adjust just enough memory pressure on a C group to evict cold memory pages that aren't essential [14:31.600 --> 14:35.840] for workload performance. It's an integral controller which dynamically adapts to these [14:35.840 --> 14:39.920] memory peaks and troughs an example case being something like a web server which is somewhere [14:39.920 --> 14:44.160] where we have used it when more requests come we see that the pressure is growing and we expand [14:44.160 --> 14:49.520] the memory.high limit when fewer requests are coming we we see that and we start to decrease [14:49.520 --> 14:53.520] the amount of working set which we give again so it can be used to answer the question you know [14:53.520 --> 14:57.840] how much memory does my application actually use over time and in this case we find for the [14:57.840 --> 15:03.760] compile job the answer is about like 340 megabytes or so and that's fine you might be asking yourself [15:03.760 --> 15:07.280] what's the what are the benefits of this shrinking like why why does this even matter to be honest [15:07.280 --> 15:11.120] surely like when you're starting to run out of memory Linux is going to do it anyway and you're [15:11.120 --> 15:18.080] not wrong like that's true but the thing is what we kind of need here is to get ahead of memory [15:18.080 --> 15:23.760] shortages which which could be bad and amortize the work ahead of time when your machine is already [15:23.760 --> 15:27.920] highly contended it's already being driven into the ground and going towards the umkiller it's [15:27.920 --> 15:33.200] pretty hard to say hey bro could you just like like give me some pages right now like it's it's [15:33.200 --> 15:37.760] not exactly like what what's on its mind it's probably desperately trying to keep the atomic [15:37.760 --> 15:41.680] pool going so there's there's another thing as well which is you know it's pretty good for [15:41.680 --> 15:48.240] determining regressions which is what a lot of people use for rss for right like we this is the [15:48.240 --> 15:54.080] way we found out that that demon was using two gigabytes of memory instead of 150 megabytes [15:54.080 --> 15:58.880] of memory so it's pretty good for finding out hey how much does my application actually need to run [15:58.880 --> 16:02.400] so the combination of these things means that senpai is an essential part of how we do workload [16:02.400 --> 16:06.720] stacking of matter and it not only gives us an accurate read on what the demand is right now [16:06.720 --> 16:13.200] but allows us to adjust stacking expectations depending on what the workload is doing this [16:13.200 --> 16:16.400] feeds into another one of our efforts around efficiency which is improving memory offloading [16:16.400 --> 16:20.960] so traditionally on most operating systems you have only one real memory offloading location [16:20.960 --> 16:26.640] which is your disk um even if you don't have swap that's true because you do things like [16:26.640 --> 16:31.280] demand paging right you page things in gradually and you also have to you know evict and get things [16:31.280 --> 16:36.720] in the file cache so we're talking also here about like a lot of granular intermediate areas [16:36.720 --> 16:40.720] that could be considered for some page offloading for infrequently access pages but they're not [16:40.720 --> 16:47.280] really so frequently used um getting this data come into main memory again though can be very [16:47.280 --> 16:52.640] different in terms of how difficult it is depending on how far up the the triangle you go right for [16:52.640 --> 16:57.840] example um it's much easier to do it on an ssd than a hard drive because hard drives don't well [16:57.840 --> 17:02.400] they're slow and they also don't tolerate random head like head seeking very well but there are [17:02.400 --> 17:08.400] more granular gradual things that we can do as well for example one thing we can do is to start [17:08.400 --> 17:13.360] look at exact strategies outside of hardware one of the problems with the duality of either being [17:13.360 --> 17:19.520] in ram or on the disk is that even your disk even if it's quite fast even if if it's flash it tends [17:19.520 --> 17:25.200] to be quite a few orders of magnitude slower than your main memory is uh so one area which we have [17:25.200 --> 17:30.240] have been heavily invested in is looking at what we might term warm pages uh in Linux we have talked [17:30.240 --> 17:33.600] a lot about hot pages and cold pages if you look in the memory management code but there is like [17:33.600 --> 17:38.640] this kind of part of the working set which yes i do need it relatively frequently but i don't need [17:38.640 --> 17:44.080] it to make forward progress all the time so zswap is one of these one of these things we can use [17:44.080 --> 17:49.120] for that it's it's uh essentially a feature of the Linux kernel which compresses pages which looks [17:49.120 --> 17:54.320] like they will compress well and are not too hot into a separate pool in main memory we do have [17:54.320 --> 17:59.520] to page fold them back in into main memory again if if we actually want to use them of course [17:59.520 --> 18:02.800] but it's several orders of magnitude faster than trying to get it off the disk [18:04.800 --> 18:08.560] we still do have this swap for infrequently access pages there tends to be quite a bit [18:08.560 --> 18:14.560] cold working set as well um but you know this is kind of like this tiered hierarchy where we want [18:14.560 --> 18:19.520] to have warm uh warm pages instead swap hot pages in in main memory and kind of cold pages and swap [18:19.520 --> 18:24.720] one problem we had here was that even when we configure the kernel to swap as aggressively as [18:24.720 --> 18:29.280] possible it still wouldn't do it um if you've actually looked at the swap code and i've had [18:29.280 --> 18:34.480] the unfortunate misery of working on it um this you'll learn that swap code was implemented a very [18:34.480 --> 18:40.480] long time ago by the people who knew what swap did and how things worked but none of them are [18:40.480 --> 18:44.560] around to tell us what the hell anything means anymore and it's very confusing so i can't even [18:44.560 --> 18:49.040] describe to you how the old algorithm works because it has about 500 heuristics and i don't know why [18:49.040 --> 18:54.000] any of them are there um so for this reason you know we try to think how can we make this a little [18:54.000 --> 18:59.200] bit more efficient we are using non-rotational disks now we have zswap we have flash disks we [18:59.200 --> 19:05.600] have ssds we want to make a an algorithm which can handle this better so from kernel 5.8 um we [19:05.600 --> 19:10.480] have been working on a new algorithm which has already landed um so first we have code to track [19:10.480 --> 19:14.400] all swap ins and cache misses across the system so for every cache page we're having to page [19:14.400 --> 19:18.960] fold and evict and page fold and evict and page fold and evict over and over again what we want to [19:18.960 --> 19:26.320] do is try and page out a heat page instead if we're unlucky and this heat page actually it turns [19:26.320 --> 19:30.960] out to be hot then you know no biggie like we we've made a mistake but we'll try a different one [19:30.960 --> 19:34.800] next time we do have some heuristics to try and work out which one is hot and which one is not but [19:34.800 --> 19:41.440] they are kind of expensive so we don't use a lot of them um however you know if if we are lucky and [19:41.440 --> 19:46.480] the heat page does stay swapped out then that's one more page which we can use for file caches and [19:46.480 --> 19:52.000] we can use it for other processes and this means that we can engage swap a lot more readily in [19:52.000 --> 19:57.360] most scenarios importantly though we are not adding ioload this doesn't increase ioload or [19:57.360 --> 20:01.920] decrease endurance of the disk um we are just more intentional about in choosing how to apply [20:01.920 --> 20:07.440] the i it doesn't double up um we only trade one type of paging for another and our goal here is to [20:07.440 --> 20:13.200] reach an optimal state where the optimal state is doing the minimum amount of i o in order to sustain [20:13.200 --> 20:17.760] workload performance um so ideally what we do is have this tiered model of you know like I said main [20:17.760 --> 20:24.720] memory z swap and swap on disk this is super simple idea compared to the old model although the old [20:24.720 --> 20:29.280] algorithm has a lot of kind of weird heuristics as I mentioned a lot of penalties a lot of kind of [20:29.280 --> 20:34.880] strange things um in general it was not really written for an era where SSDs exist or where [20:34.880 --> 20:38.640] z swap exists so it's understandable that it needed some some care and attention [20:40.000 --> 20:45.120] so what were the effects of this change in prod like what what actually happened so on web servers [20:45.120 --> 20:50.160] we not only noticed like an increase in performance but we also noticed a decrease in heat memory by [20:50.160 --> 20:56.720] about two gigabytes or so out of about 16 gigabytes total the cache grew to fill this newly freed [20:56.720 --> 21:01.600] space and it grew by about two gigabytes from about uh two gigabytes of cache to four gigabytes of [21:01.600 --> 21:06.320] cache we also observed a measurable increase in web server performance from this change which is [21:06.320 --> 21:10.880] deeply encouraging and these are all indications that you know we are now starting to reclaim the [21:10.880 --> 21:13.840] right things actually we are making better decisions because things are looking pretty [21:13.840 --> 21:18.400] positive here so not only that but you see a decrease in disk i o because we are actually [21:18.400 --> 21:23.440] doing things correctly we are making the correct decisions and it's not really that often that [21:23.440 --> 21:29.120] you get a benefit in performance disk i o memory usage instead of having to trade off between them [21:29.120 --> 21:34.320] right so it probably indicates that this is the better solution for this kind of era this also [21:34.320 --> 21:39.280] meant that on some workloads uh we now had opportunities to stack where we did not have [21:39.280 --> 21:43.360] opportunities to stack before like running say multiple kinds of ads jobs or multiple kinds [21:43.360 --> 21:47.760] of web servers on top of each other uh many machines don't use up all of their resources [21:47.760 --> 21:52.720] but they use up just enough that it's pretty hard to stack something else on top of it [21:52.720 --> 21:56.960] because you're using just enough that it's not actually enough to sustainably run to workload [21:56.960 --> 22:00.960] side by side so this is another thing where we've managed to kind of push the needle just a little [22:00.960 --> 22:05.040] bit so that you can make quite a bit more use uh an efficiency out of the servers that exist [22:06.720 --> 22:11.200] the combination of changes to the swap algorithm using z-swap and squeezing workloads using senpai [22:11.200 --> 22:16.080] was a huge part of our operation during covid all of these things acting together we termed tmo [22:16.080 --> 22:20.240] which stands for transparent memory offloading and you can see some of the results we've had in [22:20.240 --> 22:25.520] production here in some cases we were able to save up to 20 percent of critical fleet-wide [22:25.520 --> 22:30.640] workloads memory with either neutral or even in some cases positive effects on workload performance [22:31.280 --> 22:35.040] so this opens up a lot of opportunities obviously in terms of reliability stacking [22:35.040 --> 22:39.920] and future growth this whole topic has a huge amount of cover i really could just do an entire [22:39.920 --> 22:43.920] talk on this um if you want to learn more i do recommend the post which is linked at the bottom [22:43.920 --> 22:48.000] my colleagues johannes and dan wrote an article with a lot more depth on you know how we achieve [22:48.000 --> 22:54.480] what we achieved and on things like cxl memory as well so let's come back to this this slide [22:54.480 --> 22:59.600] from earlier um we briefly touched on the fact that if bounded one resource can just turn into [22:59.600 --> 23:03.920] another a particularly egregious case being memory turning into i o when it gets bounded [23:04.720 --> 23:09.520] for this reason it might seem counterintuitive but we always need controls on i o when we have [23:09.520 --> 23:14.720] controls on memory otherwise memory pressure will always just directly translate to disk i o [23:16.160 --> 23:22.480] probably the most attuned way to solve this is to try to limit disk bandwidth or disk i ops [23:22.480 --> 23:26.160] however this doesn't really manifest usually very well in reality if you think about any modern [23:26.160 --> 23:30.720] storage device they tend to be quite complex they they're q devices you can throw a lot of commands [23:30.720 --> 23:35.200] of them in parallel and when you do that you often find that hey you know magically it can do more [23:35.200 --> 23:39.600] things the same reason we have i o schedulers because we can optimize what we do inside the disk [23:40.560 --> 23:44.800] also the mixture of i o really matters like reads versus writes sequential versus random [23:44.800 --> 23:50.160] even on ssds these things tend to matter um and it's really hard to turn to determine [23:50.160 --> 23:55.520] a single metric for loadedness for a storage device because the cost of one i o operation or [23:55.520 --> 24:02.160] one block of data is extremely variable depending on the wider context um so it's it's also really [24:02.160 --> 24:07.520] punitive to just have a limit on you know how much can i write how many i ops can i do um because [24:07.520 --> 24:12.160] even if nobody else is using the disk you're still slowed down to this level there's no opportunity [24:12.160 --> 24:15.920] to make the most of the disk when nobody else is doing anything right so it's not really good for [24:15.920 --> 24:22.880] this kind of best effort bursty work on a machine which we would like to do so the first way that [24:22.880 --> 24:27.520] we try to avoid this problem is by using latency as a metric for workload health so what we might [24:27.520 --> 24:32.160] try and do is apply a maximal target latency for i o completions on the main workload and if we [24:32.160 --> 24:36.880] exceed that we start dialing back other c groups with lucid latency requirements back to their own [24:36.880 --> 24:41.280] configured thresholds what this does is this prevents an application from thrashing on memory so much [24:41.280 --> 24:46.160] that it just kills i o across the system this actually works really well for systems where [24:46.160 --> 24:51.200] there's only one workload but the problem comes when you have a multi workload stacked case like this [24:51.200 --> 24:55.520] here we have two high priority workloads which are stacked on a single machine one has an i o [24:55.520 --> 25:00.320] dot latency of 10 milliseconds the other has 30 milliseconds but the problem here is as soon as [25:00.320 --> 25:04.800] workload one gets into trouble everyone else is going to suffer and there's no way around that [25:04.800 --> 25:09.360] we're just going to penalize them and there's no way to say you know how bad is the situation [25:09.360 --> 25:13.760] really and is it really them causing the problem this is fine if you're you know the thing you're [25:13.760 --> 25:18.400] throttling is just best effort but it's we here we have two important workloads right so how can [25:18.400 --> 25:25.040] we solve this so our solution is this thing called i o dot cost which might look very similar at first [25:25.040 --> 25:28.800] but notice the omission of the units these are not units in milliseconds these are weights in [25:28.800 --> 25:34.640] a similar way to how we do cpu scheduling so how do we know what 40 60 or 100 mean in this context [25:34.640 --> 25:39.440] well they add up to 200 so the idea is if you are saturating your disk you know best effort [25:39.440 --> 25:45.760] outside will get 40 will get i guess 20 percent of of the work it'll workload one will get 50 [25:45.760 --> 25:51.280] and workload two will get 30 so it balances out based on this kind of shares or weights like model [25:52.880 --> 25:57.520] how do we know when we reach this 100 percent of saturation though so what i o dot cost does is [25:57.520 --> 26:01.760] build a linear model of your disk over time it sees how the disk responds these variable loads [26:01.760 --> 26:06.400] passively and it works based on things like you know read or write i o whether it's random or [26:06.400 --> 26:11.040] sequential the size of the i o so it boils down this quite complex operation of you know how much [26:11.040 --> 26:18.240] can my disk actually do into a linear model which it which it handles itself it has a kind of a q [26:18.240 --> 26:22.720] s model you can implement but there's also a basic on the fly model using q depth so you can read [26:22.720 --> 26:26.160] more about it in the links at the bottom i won't waffle on too much but it is something which you [26:26.160 --> 26:32.000] can use to do kind of effective i o control in the old days i came to this room and talked about [26:32.000 --> 26:36.240] secret b2 and the historical response was basically that's nice docker doesn't support it though so [26:36.240 --> 26:42.720] please leave um i've had a nice chat with some docker lutz uh no the docker people are very nice [26:42.720 --> 26:46.880] and so are all the other container people and what's happened is we have it almost everywhere [26:46.880 --> 26:50.720] almost everywhere secret b2 is a thing we have quite a diversity of container run time some [26:50.720 --> 26:54.640] police report is basically supported everywhere um so even if nothing changes from your side [26:54.640 --> 26:58.480] moving to secret b2 means that you know you get significantly more reliable accounting for free [26:59.120 --> 27:03.840] we spent quite a while working with docker and system defoaks and so on and so forth to get [27:03.840 --> 27:07.760] things working and we're also really thankful to fedora for making secret b2 the default since [27:07.760 --> 27:13.360] fedora 32 as well as making things more reliable behind the scenes for users this also you know [27:13.360 --> 27:17.840] got some people's ass into gear when they had an issue on their github on their github that says it [27:17.840 --> 27:22.480] doesn't work in fedora so cheers fedora people uh it was a kind of a good signal that you know this [27:22.480 --> 27:26.960] is what we are actually doing this is what we as an industry as a as a technology community are [27:26.960 --> 27:33.120] actually doing uh and that was quite helpful the kd and gnom folks have also been busy using cgroups [27:33.120 --> 27:37.840] to give uh a better management of that kind of desktop handling david edmundson and henry chain [27:37.840 --> 27:43.360] from kd in particular gave this talk at kd academy the title of talk was using cgroups to make [27:43.360 --> 27:48.800] everything amazing now i'm not brazen enough to title my talk that but i'll just let it speak for [27:48.800 --> 27:54.240] itself for their one um it basically goes over the use of cgroups and c v2 for resource control [27:54.240 --> 27:58.400] and for interactive responsiveness on the desktop um so this is definitely kind of a developing [27:58.400 --> 28:02.160] space obviously there's been a lot of work on the server side here um but if you're interested in [28:02.160 --> 28:06.480] that i definitely recommend you know giving the talk a watch it really goes into challenges they [28:06.480 --> 28:12.400] had and then unique features c v2 has to solve those finally android is also using the metrics [28:12.400 --> 28:17.200] exported by the psi project in order to detect and prevent memory pressure events which affect the [28:17.200 --> 28:21.920] user experience as you can imagine on android interactive latency is extremely important [28:21.920 --> 28:24.960] you don't it would really suck if you're about to click a button and then you click it and that [28:24.960 --> 28:28.880] requires allocating memory and the whole phone freezes i mean it does still happen sometimes [28:28.880 --> 28:32.320] but obviously this is something which which they're trying to work on and we've been working [28:32.320 --> 28:36.560] quite closely with them to integrate the psi uh project into the into android [28:38.400 --> 28:41.680] hopefully this talk gave you some ideas about things you'd like to try out for yourself um [28:41.680 --> 28:45.920] we're still very actively improving uh kernel resource control it might have been seven years [28:45.920 --> 28:50.080] since we started but you know we still have plenty of things we want to do and what we really need [28:50.080 --> 28:55.120] is your feedback what we really need is more examples of uh how the community is using c v2 [28:55.120 --> 28:58.880] and problems and issues you've encountered um obviously everyone's needs are quite different [28:58.880 --> 29:03.040] and i and others are quite eager to know what we could be doing to help you what we could be doing [29:03.040 --> 29:06.240] to make things better what we could be doing to make things more intuitive because there's [29:06.240 --> 29:09.280] definitely work to be done there and i'll be around after the talk if you want to chat but [29:09.280 --> 29:14.000] feel free to drop me an email message me on mastodon always happy to hear feedback or suggestions [29:14.000 --> 29:18.720] um i've been chris down and this has been seven years of c at c review to future of [29:18.720 --> 29:44.640] Linux resource control thank you very much