[00:00.000 --> 00:19.920] We're starting. [00:19.920 --> 00:24.680] My name is Andy and I'm here to say we're talking garbage collectors in a really great [00:24.680 --> 00:25.680] way. [00:25.680 --> 00:30.200] This is my talk about WhipIt, which is a new garbage collector for Gile. [00:30.200 --> 00:33.560] So Gile is an imitation of Scheme, as most people know. [00:33.560 --> 00:37.480] But if you looked at it, and you tried to determine its composition, you would notice [00:37.480 --> 00:41.120] that there's a big C library that's part of it. [00:41.120 --> 00:46.920] And it has an API, like we show, like there's a cons function, which is defined as cons, [00:46.920 --> 00:50.120] and it takes some arguments, and it returns a value. [00:50.120 --> 00:54.560] And there's a lot of code inside Gile that uses this API, and a lot of code in external [00:54.560 --> 00:57.760] projects and files that also use this API. [00:57.760 --> 01:03.240] So it's exposed to third-party users. [01:03.240 --> 01:07.720] And Gile is a garbage collected language. [01:07.720 --> 01:11.820] Data is allocated by the garbage collector, and the garbage collector takes responsibility [01:11.820 --> 01:13.160] for freeing it. [01:13.160 --> 01:14.720] And how is this going to work? [01:14.720 --> 01:19.240] So let's say I cons the value, I'm making a new object, I need to include it in the set [01:19.240 --> 01:21.160] of live data, right? [01:21.160 --> 01:22.840] So what's a live object? [01:22.840 --> 01:27.600] A live object is one of the roots, or anything referred to by a live object. [01:27.600 --> 01:28.920] So it's a circular definition. [01:28.920 --> 01:33.680] You compute the fixed point of this computation. [01:33.680 --> 01:36.000] And how are we going to do this? [01:36.000 --> 01:38.120] I'm sorry, I'm getting on to the next slide. [01:38.120 --> 01:40.360] So there are actually three strategies we can use here. [01:40.360 --> 01:45.000] One, we can ref count values, and you know, we used to laugh at this, but it's coming [01:45.000 --> 01:47.480] back in style, actually. [01:47.480 --> 01:54.440] Here you could register the location of this value with the runtime, and unregister it [01:54.440 --> 01:56.040] at some point when it goes out of scope. [01:56.040 --> 02:02.080] And another way we could find this value would be what is called conservative root scanning. [02:02.080 --> 02:06.240] And that's what Gile has done for many, many years now. [02:06.240 --> 02:10.200] And the idea, I don't know, if this is the first time you're hearing this, this is going [02:10.200 --> 02:11.200] to be wild. [02:11.200 --> 02:14.760] You know, like your brain's just going to go poof, because you take the stack, right? [02:14.760 --> 02:16.120] The machine stack. [02:16.120 --> 02:19.040] And you treat every word on it as if, like, it's an integer, you know? [02:19.040 --> 02:23.640] But if it's an integer, which is within the range of the objects managed by the heap, [02:23.640 --> 02:25.880] then we consider this maybe a pointer. [02:25.880 --> 02:27.400] And then we keep those objects alive. [02:27.400 --> 02:32.400] So it's conservative in the sense that it doesn't compute the minimal set of live objects. [02:32.400 --> 02:35.040] It's an over approximation of the live objects. [02:35.040 --> 02:37.920] It seems to work, though, historically. [02:37.920 --> 02:39.640] It's not one of those things you have guarantees on. [02:39.640 --> 02:41.400] It's very strange. [02:41.400 --> 02:45.920] And Gile's very old, 30 years old, I think, today, or not today, but like this year, I [02:45.920 --> 02:48.880] think, something like that. [02:48.880 --> 02:52.400] We're getting older, also. [02:52.400 --> 02:57.240] And since it's very beginning, it had a custom GC, which we inherited from a previous limitation [02:57.240 --> 03:00.040] that Gile was based on, SCM. [03:00.040 --> 03:03.240] And then in the mid-2000s, we added support for proper P threads. [03:03.240 --> 03:04.880] We had other things before. [03:04.880 --> 03:10.120] It was a kind of buggy time, because threads and garbage collectors, it's a very tricky [03:10.120 --> 03:11.120] thing to get right. [03:11.120 --> 03:14.280] And if you just half-hazardly add them together without understanding what you're doing, you [03:14.280 --> 03:15.520] can make some bugs. [03:15.520 --> 03:20.120] When we switch to a third-party collector called the Bohm-Demmers-Weiser collector, I should [03:20.120 --> 03:26.720] have spelled it out here, a lot of these bugs went away, actually, because it takes threads [03:26.720 --> 03:27.720] more into account. [03:27.720 --> 03:30.320] It's better designed in some ways. [03:30.320 --> 03:34.720] And a nice thing when we switch to the Bohm collector is it scans not only stacks, but [03:34.720 --> 03:38.320] also static data segments, P thread keys. [03:38.320 --> 03:41.760] It tries to find all the roots that it might possibly find. [03:41.760 --> 03:45.640] It grovels your system for special magic integers. [03:45.640 --> 03:51.120] And actually, with conservative collection, there are some advantages, and some real advantages. [03:51.120 --> 03:54.360] It is very nice to program with a conservative garage collector. [03:54.360 --> 04:00.560] I work on web browsers, they all have, well, two of the three major ones have precise roots, [04:00.560 --> 04:02.480] and it's a pain getting the handles right. [04:02.480 --> 04:06.560] And I've had bugs, you know, where you forget to register the location of a value, and everything [04:06.560 --> 04:10.840] blows up, but only sometimes, it depends on when the garage collector runs. [04:10.840 --> 04:14.120] And it doesn't constrain the compiler, because the compiler doesn't have to keep track, you [04:14.120 --> 04:20.720] don't have to make the compiler tell the system about where the values are. [04:20.720 --> 04:25.560] And yeah, but on the neighbor side, you might leak values. [04:25.560 --> 04:28.840] We don't know to what extent this is a thing. [04:28.840 --> 04:31.280] It appears to be fine in practice. [04:31.280 --> 04:33.040] We actually don't have a lot of data there. [04:33.040 --> 04:38.240] With the advent of 64-bit address spaces, I think it is less of a problem, though. [04:38.240 --> 04:41.240] Another issue is we can't move values. [04:41.240 --> 04:47.600] If any integer that we ever find during the whole trace of the heap might be a pointer [04:47.600 --> 04:50.400] to a value, we can never compact the heap. [04:50.400 --> 04:56.640] And this is actually a real, it's a real limitation for us in the sense that we can't [04:56.640 --> 05:01.800] use some of the newer, better performing garbage collecting algorithms. [05:01.800 --> 05:09.640] And as a technical constraint, it also constrains the garbage collector from changing. [05:09.640 --> 05:12.640] It's very difficult to change to one of those garbage collector algorithms now, because [05:12.640 --> 05:16.600] we have so much user code, we have so much implementation, and it'll be hard. [05:16.600 --> 05:21.680] But whatever I told you, there is actually a better way. [05:21.680 --> 05:23.680] Because we thought we were at a local maximum. [05:23.680 --> 05:27.800] We couldn't get any better without getting worse for a while. [05:27.800 --> 05:31.880] We wouldn't reach that mountaintop without having to descend into the valley. [05:31.880 --> 05:38.400] But it turns out that you can have conservative roots and move objects and compact the heap. [05:38.400 --> 05:43.360] You can have conservative roots and do a fast bump pointer allocation, which we'll get to [05:43.360 --> 05:44.360] in a minute. [05:44.360 --> 05:50.800] And you can have conservative roots and eventually possibly add more precision to your scan. [05:50.800 --> 05:55.080] And the thing that came along that allowed me to know this was something called Immix. [05:55.080 --> 06:02.120] This is a paper that was published in 2008 by Steve Blackburn, his group. [06:02.120 --> 06:09.560] And it is a new, well, he characterizes in that paper a new class of fundamental GC algorithms. [06:09.560 --> 06:14.440] So you have basically four things you can do when you're doing a GC. [06:14.440 --> 06:18.880] You can have what's called mark compact, meaning find the live objects, and then slide them [06:18.880 --> 06:21.400] all to one side of the same space. [06:21.400 --> 06:25.440] So within the space that you found the objects in, you slide them all to one side. [06:25.440 --> 06:31.120] You have mark sweep, find all the objects, and then collect all the holes into, these [06:31.120 --> 06:34.160] are the holes that are two words long, and these are the holes that are three words long, [06:34.160 --> 06:36.160] and these are the holes, like that, into free lists. [06:36.160 --> 06:37.160] This is what it's called. [06:37.160 --> 06:38.560] You sweep to a free list. [06:38.560 --> 06:39.560] Mark sweep. [06:39.560 --> 06:43.840] Evacuation, find all the live objects, and as you find them, you copy them somewhere else. [06:43.840 --> 06:47.920] So instead of sliding to part of one space, you get them out of the space entirely. [06:47.920 --> 06:54.600] And that's a semi-space, for example, that's a number of different Java collection algorithms. [06:54.600 --> 06:57.720] And this other new algorithm is mark region. [06:57.720 --> 07:02.600] Find all the holes and bump pointer allocate into them. [07:02.600 --> 07:07.440] As you allocate, you sort of sweep across the space, and you allocate in a bump pointer [07:07.440 --> 07:11.000] fashion into this hole, and then to that hole, and then to that hole, instead of collecting [07:11.000 --> 07:12.200] free lists. [07:12.200 --> 07:14.520] And IMIX is one of these new kind of collectors. [07:14.520 --> 07:18.400] This is a diagram from the paper, the 2008 paper. [07:18.400 --> 07:22.880] IMIX organizes the heap into blocks and lines. [07:22.880 --> 07:27.600] Blocks are about 64 kilobytes in size, should be a multiple of the page size, and lines [07:27.600 --> 07:31.920] for IMIX are 128 bytes. [07:31.920 --> 07:37.040] And as you allocate, here in this diagram, we can see that there are some blocks that [07:37.040 --> 07:38.040] are all full. [07:38.040 --> 07:40.560] Full block, we don't have to do anything about it. [07:40.560 --> 07:45.680] There are some blocks that have some lines which were marked in the previous collection, [07:45.680 --> 07:48.400] and some lines that were not marked in the previous collection. [07:48.400 --> 07:51.520] The lines that are not marked, a set of contiguous lines, is a hole. [07:51.520 --> 07:54.160] You can bump pointer allocate into the holes. [07:54.160 --> 07:58.380] Objects can be part of a line, in which case maybe many objects fit in a line. [07:58.380 --> 08:03.880] They can span multiple lines, but they can't span blocks, okay? [08:03.880 --> 08:07.120] When you allocate, you bump pointer allocate, and you sweep through all the blocks in the [08:07.120 --> 08:10.040] system in the course of that GC cycle. [08:10.040 --> 08:14.520] When you trace, you mark objects in the same way as a mark sweep collector, so there's [08:14.520 --> 08:19.800] a mark bit associated with every object, possibly an object's header, possibly an aside table. [08:19.800 --> 08:25.560] But as you mark them, you also mark lines, the lines that they're on, using address math. [08:25.560 --> 08:30.480] Typically the way this is invented is all these blocks are allocated as part of a line [08:30.480 --> 08:37.240] 2 megabyte slabs, basically, and you can use address arithmetic to get to the aside table [08:37.240 --> 08:41.760] of mark bytes for the line. [08:41.760 --> 08:49.200] When you sweep, you do, at the end of collection, there is an eager sweep over all of the line [08:49.200 --> 08:54.800] mark bytes, so the contiguous array of mark bytes for lines, to identify which blocks [08:54.800 --> 08:59.960] are full, which are completely empty, and which are recycled, containing some old data, [08:59.960 --> 09:04.160] and those you would bump pointer allocate into the holes. [09:04.160 --> 09:11.040] The cool thing about it is that IMIX does opportunistic evacuation, so it's not simply [09:11.040 --> 09:12.960] leaving these objects in place. [09:12.960 --> 09:17.640] If it determines that your system needs to be defragmented, then it can choose some set [09:17.640 --> 09:22.120] of blocks to evacuate, and choose some other set of blocks which are already empty to be [09:22.120 --> 09:23.800] evacuation targets. [09:23.800 --> 09:27.840] So it's still a one-pass algorithm over the heap, but instead of marking objects in place, [09:27.840 --> 09:30.800] it tries to put them into an empty block. [09:30.800 --> 09:35.760] And if you do this a couple of times, you'll completely defragment the heap. [09:35.760 --> 09:42.880] And it can fail because parallel markers, and ordering, and alignment issues, and that's [09:42.880 --> 09:46.240] okay if the evacuation fails, you just mark in place. [09:46.240 --> 09:51.400] It's always okay to mark in place, and it's always okay to try to evacuate, evacuation [09:51.400 --> 09:53.040] may or may not succeed. [09:53.040 --> 09:58.360] So when I realize this, that you can mark in place or evacuate, this is something that [09:58.360 --> 10:01.200] is compatible with guile, right? [10:01.200 --> 10:05.360] We can do bump-point allocation now instead of allocating from free lists, which would [10:05.360 --> 10:08.440] improve throughput in guile programs. [10:08.440 --> 10:16.920] We can compact the heap, which is, I mean, I know there are many users here, and python-xyz.scm [10:16.920 --> 10:21.080] is one of the files you have, yes. [10:21.080 --> 10:22.280] I say no more. [10:22.280 --> 10:28.600] So I started a year on this, on this work-in-progress whip GC implementation, hence where the name [10:28.600 --> 10:29.600] comes from. [10:29.600 --> 10:31.080] There are a couple of differences from IMEX. [10:31.080 --> 10:38.000] IMEX has these 128-byte lines, and if just one object on a line is left over, then the [10:38.000 --> 10:40.760] line is kept live, right? [10:40.760 --> 10:46.800] In the next collection, nobody will allocate, nobody will put an object in that line. [10:46.800 --> 10:49.600] It's not a hole, basically. [10:49.600 --> 10:55.920] And for various reasons, I didn't make sense to me, so instead in Whippet, we have 16-byte [10:55.920 --> 11:01.440] lines, so effectively the line mark table is the object mark table. [11:01.440 --> 11:09.040] You only have one mark byte, it's a byte because of parallel markers, and it's a bit more [11:09.040 --> 11:13.960] overhead in terms of space, but maybe it's a bit more parsimonious with memory, we'll [11:13.960 --> 11:14.960] see how it works out. [11:14.960 --> 11:17.400] It's an open question here. [11:17.400 --> 11:22.960] And additionally, with these line mark bytes being more fine-grained, it's a lose to do [11:22.960 --> 11:27.320] an eager sweep over the heap, so we do lazy sweeping, so as you allocate, you just sweep [11:27.320 --> 11:31.200] one block, and then sweep another block, and then sweep another block, like that. [11:31.200 --> 11:34.360] And the good thing about that is that it parallelizes things. [11:34.360 --> 11:37.800] The bad thing is that you don't know how much data was live at the previous collection right [11:37.800 --> 11:40.240] after your collection, because you haven't swept yet. [11:40.240 --> 11:42.920] Yeah, okay. [11:42.920 --> 11:48.480] So some comparisons with Whippet compared to the Bohm collector, and there are a number [11:48.480 --> 11:49.920] of different points here. [11:49.920 --> 11:51.840] So one of them is you can move values. [11:51.840 --> 11:59.080] If every edge in your graph is potentially conservative, then you can't move anything, [11:59.080 --> 12:05.520] because you could find an edge that keeps an object live and doesn't allow moving late [12:05.520 --> 12:07.280] in the trace. [12:07.280 --> 12:11.000] But if you can partition your edges into a set that's conservative and a set that's not [12:11.000 --> 12:15.120] conservative, a set that's precise, you do the conservative ones first, and any object [12:15.120 --> 12:18.480] which isn't reached in that conservative trace is then movable. [12:18.480 --> 12:22.480] So what happens is you mark the stack first, and you mark in place, you don't evacuate. [12:22.480 --> 12:26.280] That is an implicit pin on every object that you mark. [12:26.280 --> 12:28.880] And then you go and you mark the heap, and if you find another object there, you can [12:28.880 --> 12:32.800] evacuate at that point. [12:32.800 --> 12:38.600] And then in Whippet, if we see that the heap is fragmented, we can turn evacuation on, [12:38.600 --> 12:41.560] and if we don't, if we see the heap is not fragmented, we can always mark in place and [12:41.560 --> 12:47.800] not incur the overhead of copying. [12:47.800 --> 12:51.360] There is also explicit pinning for various reasons. [12:51.360 --> 12:59.000] We can shrink the heap, which is nice, because these blocks are multiples of the OS page size, [12:59.000 --> 13:03.600] they're easy to return to the OS whenever we find that a block is empty, and you can [13:03.600 --> 13:08.880] just mark it as being empty, and MAdvise, MAV don't need it, and if you ever need it [13:08.880 --> 13:13.640] again, you can pull it right back in, it's zeroed by the OS. [13:13.640 --> 13:20.080] And additionally, there's a possibility to use adaptive heap sizing techniques, such [13:20.080 --> 13:27.200] as the one that I link here, it's an online algorithm that depends on what's your current [13:27.200 --> 13:29.680] cost of GC and how fast are you allocating. [13:29.680 --> 13:34.440] So a process which sort of stops and goes quiet, gets its memory slowly reduced to the [13:34.440 --> 13:35.440] minimum. [13:35.440 --> 13:37.040] You can fit more on a system. [13:37.040 --> 13:42.840] And we can also do a generational collection, if we want to, using the sticky mark-byte [13:42.840 --> 13:48.120] algorithm, which I link to here, it's described more in that post. [13:48.120 --> 13:52.880] For some programs, it doesn't make a difference, because some data isn't very generation friendly. [13:52.880 --> 13:59.040] This is the case of the first empty GC bench pair over there, where the first bar is Whippet [13:59.040 --> 14:02.080] without generational collection, and the second is with. [14:02.080 --> 14:07.440] But in some cases, it's very effective, like in this, I'm making a bunch of quad trees, [14:07.440 --> 14:12.960] and it pretty much doubles the throughput of the system. [14:12.960 --> 14:18.200] Additionally, with Whippet, we scale a lot better for multiple allocator threads. [14:18.200 --> 14:23.320] In BDW, you have these size segregated free lists, the free lists of size two, three, [14:23.320 --> 14:28.160] four, and that sort of thing, and you need to lock the heap to sweep and find more and [14:28.160 --> 14:29.720] fill those free lists. [14:29.720 --> 14:36.120] In Whippet, you use uncontended atomic ops to obtain the next block, just basically incrementing [14:36.120 --> 14:43.320] a counter, because the blocks are contiguous in these two megabyte slabs, and you sweep [14:43.320 --> 14:44.480] without contention. [14:44.480 --> 14:49.800] So these are two graphs showing the time it takes as problem size increases and number [14:49.800 --> 14:51.360] of mutator threads increases. [14:51.360 --> 14:58.120] So at each step, I'm adding on an additional mutator, an additional thread, doing the same [14:58.120 --> 14:59.120] amount of work. [14:59.120 --> 15:04.280] So with two mutator threads, the heap is twice as big as it was with one, and with eight, [15:04.280 --> 15:06.160] it's eight times as big as it was with one. [15:06.160 --> 15:07.960] So we do expect to see some increase. [15:07.960 --> 15:16.680] What we see is that BDW takes more time, ultimately, like it's at nine seconds with an eight thread [15:16.680 --> 15:21.760] mutator, whereas we're only at three and a half with Whippet, it scales much better when [15:21.760 --> 15:22.760] you're adding allocators. [15:22.760 --> 15:27.440] And this is with a single marker thread, so we expect to see some increase as the problem [15:27.440 --> 15:30.280] size gets larger. [15:30.280 --> 15:32.520] This is, what do you call that? [15:32.520 --> 15:36.640] It's like when you make a quilt, apparently you're supposed to put a part in it that's [15:36.640 --> 15:41.160] incorrect because you don't want to show too much pride in the face of God, right? [15:41.160 --> 15:43.480] It's like a touch of the hand sort of thing. [15:43.480 --> 15:49.000] This is my humility slide showing Whippet being slower than BDWGC on this one machine. [15:49.000 --> 15:52.480] I have no idea what's going on with this because I remeasure it on my other machine. [15:52.480 --> 15:53.720] It looks much better. [15:53.720 --> 15:58.480] But it does point that as you add on marker threads, things improve, although I don't [15:58.480 --> 16:03.320] understand the relative BDW Whippet thing right there, so that's a question. [16:03.320 --> 16:09.440] So with the heap, with twice as much memory as the problem takes, as we add markers, things [16:09.440 --> 16:15.480] get better for both BDW and Whippet, but a little bit better for Whippet. [16:15.480 --> 16:17.200] So ephemerons. [16:17.200 --> 16:21.600] This is weak maps like you have in JavaScript where you have keys associated with value, [16:21.600 --> 16:23.800] but what if value references key? [16:23.800 --> 16:26.840] Can you have a circular reference? [16:26.840 --> 16:30.080] Could the weak reference, does it leak memory? [16:30.080 --> 16:31.080] I don't know. [16:31.080 --> 16:33.560] You people have heard about ephemerons, I would imagine. [16:33.560 --> 16:35.960] You cannot do them in the boom collector. [16:35.960 --> 16:37.600] It's impossible, right? [16:37.600 --> 16:41.400] I've tried a lot and thought about this, but with Whippet we have them. [16:41.400 --> 16:45.920] You really need deep GC integration to implement ephemerons. [16:45.920 --> 16:47.600] Right and precision. [16:47.600 --> 16:50.000] So with BDW, you're always stack conservative. [16:50.000 --> 16:57.400] You're always scanning the heap, the stack for smelly pointers, right, or smelly integers, [16:57.400 --> 17:00.120] integers that could point to the heap. [17:00.120 --> 17:05.880] And it's often configured in such a way that every edge on the heap also is conservative. [17:05.880 --> 17:08.920] And with Whippet we can configure it in a number of different ways. [17:08.920 --> 17:14.960] And probably we're heading down the mid-near term is this conservative scan of the C stack, [17:14.960 --> 17:19.640] precise scan of the scheme stack, and a precise scan of the heap. [17:19.640 --> 17:23.360] So we will be able to get the advantages of motion and compaction and all that. [17:23.360 --> 17:27.600] But we could move to a fully precise stack as well. [17:27.600 --> 17:29.120] And potentially things to get better. [17:29.120 --> 17:30.680] BDW GC is terrible to hack on. [17:30.680 --> 17:35.680] I just counted it's like 15 or 16% CP processor directives. [17:35.680 --> 17:39.360] You can imagine it's probably 90% of the code is covered by if thefts. [17:39.360 --> 17:41.640] It's really, really hard. [17:41.640 --> 17:42.640] Right. [17:42.640 --> 17:51.680] So some more words about how it is that we are, we are, that royal we, right, okay. [17:51.680 --> 17:57.400] Working on getting Whippet implemented in such a way that it could land in Guile and [17:57.400 --> 18:05.600] not break the world because I'm going to make a confession. [18:05.600 --> 18:11.960] I don't maintain software, I develop software, and I throw it over the wallet, I forget about [18:11.960 --> 18:12.960] it. [18:12.960 --> 18:16.600] So if I'm going to get bugs in the garbage collector, that's not, I better not start [18:16.600 --> 18:21.800] because, you know, I'm not going to fix them. [18:21.800 --> 18:30.400] So the repositories here, it is designed to be an embed only library, kind of like an [18:30.400 --> 18:34.720] include style library, but you actually do separate compilation. [18:34.720 --> 18:39.040] But it's something that you include in your source tree because it needs to be specialized [18:39.040 --> 18:41.800] with respect to the program that's using it. [18:41.800 --> 18:47.720] In the case of Guile, Guile will tell Whippet how to put a forwarding pointer in an object, [18:47.720 --> 18:52.800] for example, how to do a precise trace of the heap. [18:52.800 --> 18:57.400] And then we also specify Whippet with respect to the domain. [18:57.400 --> 19:04.000] So what should we scan conservatively, what should we scan precisely, that sort of thing. [19:04.000 --> 19:12.040] There is, we use LTO, and it appears to remove the overhead of the separate compilation, [19:12.040 --> 19:14.560] link time optimization. [19:14.560 --> 19:20.480] I'm actually suspecting LTO for that other graph that I showed you. [19:20.480 --> 19:25.960] So we actually managed to get performance and abstraction at the same time by being [19:25.960 --> 19:26.960] inspired by MMTK. [19:26.960 --> 19:31.360] MMTK is a memory management toolkit, it's fantastic. [19:31.360 --> 19:36.980] It's a library of garbage collectors and technique and experience and knowledge, currently [19:36.980 --> 19:46.240] written in Rust, formerly part of the Jyx research JVM, but now retargeting to open [19:46.240 --> 19:48.960] JDK and V8 and a number of other systems. [19:48.960 --> 19:53.320] We could actually slot this into Guile if we wanted to at some point. [19:53.320 --> 19:58.960] But we have enough information exposed in the API to allow a JIT to use that exposed [19:58.960 --> 20:05.400] information and generate machine code for the fast path for allocation, for example. [20:05.400 --> 20:10.320] And by having like a real abstract barrier between the two sides, we allow both sides [20:10.320 --> 20:14.720] to evolve at their own pace. [20:14.720 --> 20:17.920] And when we think about migrating Guile to Whippet, which is kind of where I want to [20:17.920 --> 20:25.000] go here, I know in the talk description it kind of oversold the item, right, it's like [20:25.000 --> 20:29.320] now we have a new production garbage collector in Guile, it's not there yet. [20:29.320 --> 20:33.640] So this abstract API can be implemented by the current garbage collector being used by [20:33.640 --> 20:36.920] Guile, by the Bohm collector, by the BDW collector. [20:36.920 --> 20:40.880] And so that's going to be the first step, is to switch Guile over to use the new API [20:40.880 --> 20:43.480] but still use the old collector implementation. [20:43.480 --> 20:47.320] And then we can look at switching to Whippet, but that wouldn't require any code changes [20:47.320 --> 20:51.480] ideally in Guile. [20:51.480 --> 20:56.320] I mean, so you have the Whippet API, but then you have the Whippet garbage implementation [20:56.320 --> 20:58.000] algorithm that we were talking about. [20:58.000 --> 21:02.760] There are a lot of variants on the algorithm in Sonali that you can, these are different [21:02.760 --> 21:07.920] ways you can configure Whippet on two different tests, one there's MTGC bench, one there's [21:07.920 --> 21:09.320] quads here. [21:09.320 --> 21:17.120] And going across we can first see serial Whippet, one marker, one marking thread, it's not going [21:17.120 --> 21:18.440] to be parallel marking. [21:18.440 --> 21:21.680] That's the first light blue bar on both of those sides. [21:21.680 --> 21:27.960] And then we have parallel Whippet, four markers in this case is what I was measuring. [21:27.960 --> 21:33.120] It improves things in some cases, a little bit in other cases, minor improvements. [21:33.120 --> 21:38.800] Generational Whippet, collect more recently allocated objects more frequently than older [21:38.800 --> 21:39.960] objects. [21:39.960 --> 21:43.080] Parallel generational Whippet, four markers and generational. [21:43.080 --> 21:47.720] And then after that there's four more bars which are the same thing, but collecting stack [21:47.720 --> 21:50.440] routes conservatively. [21:50.440 --> 21:54.720] The previous one is a precise scan of the stack, the previous four bars and then the [21:54.720 --> 22:01.320] next four bars are conservative scan and as you'll note it actually performs better. [22:01.320 --> 22:07.180] And there are two reasons for this, one conservative scanning can actually reduce the lifetime [22:07.180 --> 22:12.880] of objects if the compiler determines that an object isn't needed at any given point [22:12.880 --> 22:18.160] it can reuse its register or stack slot or what have you, whereas you have to wait for [22:18.160 --> 22:24.040] the unregister part of a registration, deregistration API if you're using precise routes. [22:24.040 --> 22:29.880] And the other thing is that when using this API from C, I don't actually have cooperation [22:29.880 --> 22:34.480] from the compiler where it's going to write out a table of where all the values are. [22:34.480 --> 22:38.400] I have to explicitly say, and now remember this one, okay, now forget it. [22:38.400 --> 22:40.080] And now remember this one, and now forget it. [22:40.080 --> 22:41.360] And that's overhead, right? [22:41.360 --> 22:46.400] And by doing a conservative scan, I remove that overhead. [22:46.400 --> 22:48.800] And then the final two bars, I didn't include generational because it doesn't really make [22:48.800 --> 22:53.120] sense in this context as a fully heap conservative scan. [22:53.120 --> 22:57.600] We increase a lot on this empty GC benchmark because it allocates a very big array and [22:57.600 --> 23:03.560] I don't have the equivalent of point on this allocation that the BDW API gives you. [23:03.560 --> 23:07.120] So we end up tracing all the elements of that really big array, which gives a big spike [23:07.120 --> 23:08.360] over there. [23:08.360 --> 23:11.760] And in the case of quads, we never have large objects, we're always tracing everything anyway [23:11.760 --> 23:13.440] and it doesn't really matter. [23:13.440 --> 23:17.360] But heap conservative does slow you down relative to just having stack conservative. [23:17.360 --> 23:18.360] Right. [23:18.360 --> 23:26.840] And then as a project, it's written in C, which I know is a sin, but Guile has this sort of [23:26.840 --> 23:31.840] odd place in the supply chain of geeks and it's useful to depend on a more minimal set [23:31.840 --> 23:34.360] of things rather than using Rust, for example. [23:34.360 --> 23:41.840] But it's a relatively modern C, uses stethatomic, uses things in a way that are constexpr-ish [23:41.840 --> 23:46.000] in a way that you know that the compiler is going to reduce them down. [23:46.000 --> 23:51.640] It avoids void pointers completely, using instead structs containing a single number, [23:51.640 --> 23:55.760] which gets boiled away by the compiler as well, which can't be cast to each other, you need [23:55.760 --> 23:59.720] explicit conversions, that way you won't confuse a conservative reference with a precise reference [23:59.720 --> 24:01.360] and things like that. [24:01.360 --> 24:05.760] And we don't actually have any API or API concern at all because it's an embedded-only [24:05.760 --> 24:06.760] library. [24:06.760 --> 24:11.320] If something breaks, don't update it. [24:11.320 --> 24:15.080] And it does have a bit of an abstraction for how do you find conservative roots on whatever [24:15.080 --> 24:16.560] your platform is. [24:16.560 --> 24:18.960] It's not so bad, it turns out. [24:18.960 --> 24:26.040] So if we think about when it is that this might reach Guile, then we are, it's when [24:26.040 --> 24:29.000] we can, right, you know, in the end. [24:29.000 --> 24:31.800] This is kind of a side project for me. [24:31.800 --> 24:41.400] I have other side projects, children, you know, so I can't really give an ETA here, [24:41.400 --> 24:47.720] but I would mention that there are a few things to do, and what we might end up with is that [24:47.720 --> 24:51.440] we could get a new release series for Guile, which is I think is what would be required [24:51.440 --> 24:57.320] for this, maybe starting in six months or so, just switching over to the API and staying [24:57.320 --> 25:02.480] with the Balm Collector, and maybe we could release a new stable version in another six [25:02.480 --> 25:06.120] months or really a little bit more. [25:06.120 --> 25:07.760] But we'd have to do a few things for there. [25:07.760 --> 25:14.200] Wipit is done mostly with the exception of actually growing and shrinking the heap, implementing [25:14.200 --> 25:22.560] finalizers, and having an API for checking in with Wipit, checking in with the GC as [25:22.560 --> 25:26.880] to when a mutator should stop, because that's one other thing that the BDW Collector does [25:26.880 --> 25:32.360] is it uses signals to stop all the threads, whereas Wipit relies on periodic save points. [25:32.360 --> 25:34.880] There are trade-offs. [25:34.880 --> 25:39.440] In Guile we'd have to switch over to these save points, I think it's possible. [25:39.440 --> 25:43.600] And I think we would start with a heap conservative Wipit, just because it's the same thing that [25:43.600 --> 25:52.120] we do with the BDW Collector, and then we'd move over to a precise scan of the heap. [25:52.120 --> 25:57.720] When we get to a precise scan of the heap, we have to implement a few things on the Guile [25:57.720 --> 25:58.720] side. [25:58.720 --> 26:03.200] There are some hazards about current uses of the API. [26:03.200 --> 26:10.400] In particular, if a third-party user ever allocates an object and then stuffs something [26:10.400 --> 26:15.880] in it that Guile doesn't know about, is it an integer or is it a pointer to the heap? [26:15.880 --> 26:22.680] And there are a couple of places that people can do that that are unclear. [26:22.680 --> 26:28.440] And we can't allow this if we want to trace the heap precisely and move objects. [26:28.440 --> 26:37.040] So this might require some small API changes and API breaks, because it's a new series, [26:37.040 --> 26:38.040] around this area. [26:38.040 --> 26:41.680] It might be actually time to remove smobs entirely, possibly. [26:41.680 --> 26:47.200] So that's what's actually pushing us to a new major release. [26:47.200 --> 26:53.960] So in summary, Wipit is a new GC, it's a replacement for BDW GC. [26:53.960 --> 27:00.880] It has the potential to reach a new local maximum, the better than BDW. [27:00.880 --> 27:02.720] And I think we can get into Guile 3.2. [27:02.720 --> 27:09.320] I would like to thank particularly the MMTK people for inspiration and discussions, because [27:09.320 --> 27:13.800] it's been really helpful to be able to properly learn about garbage collection over the last [27:13.800 --> 27:14.800] year or so. [27:14.800 --> 27:16.400] I'll leave you with one slide. [27:16.400 --> 27:21.400] When you evaluate a GC, you need to do so with a space-time diagram, because GC is a [27:21.400 --> 27:23.720] function, it's a trade-off between space and time. [27:23.720 --> 27:28.440] So on the x-axis, you should have your heap size as a function of what is the minimum [27:28.440 --> 27:29.440] heap size. [27:29.440 --> 27:36.320] Here, I measured some algorithms at 1.3x, 1.5x, 1.75x, 2, 2.5, 3, 4, 5, and 6, or just [27:36.320 --> 27:37.320] a 5. [27:37.320 --> 27:41.400] I don't know, on the y-axis, you should have whatever you're measuring, be it instructions [27:41.400 --> 27:47.440] retired or wall clock time or memory or something like that, because the heap size is one of [27:47.440 --> 27:52.280] the, and the response to heap size is one of the fundamental trade-offs in GC. [27:52.280 --> 27:57.400] Here we show that actually, we show the BDW collector, a semi-space collector, which is [27:57.400 --> 28:05.480] also implemented behind the Wipit API, and the Wipit algorithm, serial, one marker, one [28:05.480 --> 28:07.200] mutator on this benchmark. [28:07.200 --> 28:10.520] We see performance as we change heap size. [28:10.520 --> 28:13.960] Wipit is the only one that gets to 1.3x. [28:13.960 --> 28:17.960] This is an analytical calculation of how big the heap should be. [28:17.960 --> 28:21.960] It's not measured as to how small I can get anything to run, but it's like what I think [28:21.960 --> 28:24.400] the heap should take. [28:24.400 --> 28:30.640] So it might not precisely be 1.3, it might be one, you know, it's a number in that range. [28:30.640 --> 28:31.640] It can get to the smallest. [28:31.640 --> 28:33.440] It takes a bit of effort to do so. [28:33.440 --> 28:38.520] As you become more parsimonious with your heap, you end up tracing it more. [28:38.520 --> 28:42.120] So the curve goes up on that side, but it's the only one that actually gets to that x-axis [28:42.120 --> 28:44.360] point of view. [28:44.360 --> 28:49.920] And then it quickly passes, and you want these numbers to be low. [28:49.920 --> 28:50.920] That's what you want. [28:50.920 --> 28:57.840] It quickly passes BDW GC, it's only one point where it takes more time than BDW GC, and [28:57.840 --> 28:58.840] that's concerning. [28:58.840 --> 28:59.840] I need to fix that one. [28:59.840 --> 29:00.840] Let me see the green line. [29:00.840 --> 29:02.160] This is a semi-space collector. [29:02.160 --> 29:07.080] Semi-space, as you add memory, it gets easier and easier and easier, right, because it depends [29:07.080 --> 29:09.440] only on the size of the live data. [29:09.440 --> 29:13.240] Whereas WIPPET and BDW need to sweep the heap. [29:13.240 --> 29:16.800] So as you add memory, it sort of plateaus. [29:16.800 --> 29:18.320] It doesn't keep on going down. [29:18.320 --> 29:19.640] I don't know why it goes up at the end. [29:19.640 --> 29:21.600] This is my other little touch of the hand. [29:21.600 --> 29:22.600] I don't know. [29:22.600 --> 29:25.400] That looks like a bug to me. [29:25.400 --> 29:27.160] So that's something I fixed. [29:27.160 --> 29:28.880] Anyway, there's WIPPET. [29:28.880 --> 29:31.520] Thank you for enduring this blathering. [29:31.520 --> 29:36.400] And good luck, everybody, in about 18 months when this starts rolling out to geeks. [29:36.400 --> 29:38.960] Just joking, because I won't be around. [29:38.960 --> 29:39.960] Good. [29:39.960 --> 29:46.960] So I'll take any questions. [29:46.960 --> 29:51.520] Even dumb questions. [29:51.520 --> 29:52.520] That's okay. [29:52.520 --> 29:53.520] Yes, sir? [29:53.520 --> 30:01.840] It seems to me like conservative stack scanning is incompatible with address sanitizer from [30:01.840 --> 30:02.840] LVM or GCC. [30:02.840 --> 30:09.960] So how do you debug address bugs in the GCC? [30:09.960 --> 30:15.240] So the question is, conservative stack scanning seems to be incompatible with address sanitizer [30:15.240 --> 30:16.240] from LVM GCC. [30:16.240 --> 30:23.280] I'm a professional C++ developer, and I work on web browsers. [30:23.280 --> 30:26.160] I don't know what address sanitizer does. [30:26.160 --> 30:29.720] I know it gives me bugs sometimes and tells me things I have to fix, but I don't know [30:29.720 --> 30:30.720] what's going on there. [30:30.720 --> 30:31.720] I should know. [30:31.720 --> 30:33.720] Can you tell us, why is it incompatible? [30:33.720 --> 30:40.720] Basically, every time you access something that wasn't registered properly via malloc, [30:40.720 --> 30:46.280] for example, or aloca, it tells you you're in the red zone or you're in something that [30:46.280 --> 30:47.280] doesn't work. [30:47.280 --> 30:54.360] So to scan your wall stack, only part of it is actually valid. [30:54.360 --> 31:03.360] So the answer is that it only signals warnings if you ever access a value after it's been [31:03.360 --> 31:04.360] freed. [31:04.360 --> 31:05.360] Is that right? [31:05.360 --> 31:09.920] For example, you are in a function and you access something that wasn't. [31:09.920 --> 31:14.320] I think it's actually not a problem because we don't trigger the malloc-free detection [31:14.320 --> 31:15.320] at all. [31:15.320 --> 31:20.520] What makes a complete third-party allocator is if you M-map the page and we're just reading [31:20.520 --> 31:25.160] values from that page, and so it doesn't trigger the particular logic there, which [31:25.160 --> 31:28.320] also means you have no tool support. [31:28.320 --> 31:34.760] You're as wild west with the bugs that go with it, so yeah, I guess that's the answer [31:34.760 --> 31:35.760] there. [31:35.760 --> 31:36.760] Yes, so the question. [31:36.760 --> 31:37.760] How will this affect Geeks users? [31:37.760 --> 31:45.280] Well, this will affect Geeks users in the sense that, one, I hope that when you read [31:45.280 --> 31:51.040] build the system, Geeks launches multiple threads to compile things. [31:51.040 --> 31:54.000] And as we see, there is contention in BDWGC. [31:54.000 --> 31:57.280] It doesn't actually scale very well as you add threads if you have an allocation-heavy [31:57.280 --> 31:58.280] workload. [31:58.280 --> 32:04.280] And so I think that when Guile incorporates WIPIT, Geeks with multiple threads should scale [32:04.280 --> 32:05.280] better. [32:05.280 --> 32:12.400] In addition, we will be able to have better tooling for how understanding the heap and [32:12.400 --> 32:19.400] heap usage, and ideally, be able to place ourselves better on the kind of space-time [32:19.400 --> 32:25.360] trade-off if you need more throughput, give it a bigger heap, also let it shrink. [32:25.360 --> 32:29.040] And that can affect also longer-running demons like the shepherd and things like that. [32:29.040 --> 32:31.880] So it should yield a more robust system. [32:31.880 --> 32:32.880] Yes? [32:32.880 --> 32:42.880] There are some architectures which can be used in 64-bit page. [32:42.880 --> 32:45.040] Would that be a problem with using 16K blocks? [32:45.040 --> 32:48.440] They're actually 64 kilobyte blocks. [32:48.440 --> 32:54.600] So I think I chose the least common multiple or whatever. [32:54.600 --> 32:59.080] It's configurable, but I think the default size is such that they are large enough for [32:59.080 --> 33:01.320] any common architecture. [33:01.320 --> 33:07.320] The question was about page size, is 16 kilobytes big enough for blocks, but it's actually [33:07.320 --> 33:08.320] 64 kilobytes. [33:08.320 --> 33:09.320] Yes? [33:09.320 --> 33:15.320] With the collection, would it ever have to stop all threads simultaneously, or do the [33:15.320 --> 33:16.320] threads stop at a different location? [33:16.320 --> 33:17.320] Basically, are there stops in the world? [33:17.320 --> 33:18.320] Yeah. [33:18.320 --> 33:19.320] Yeah, that's a very good question. [33:19.320 --> 33:20.320] I didn't mention this. [33:20.320 --> 33:22.520] So this is a stop-the-world collector. [33:22.520 --> 33:28.840] It's not a concurrent collector with the exception of threads mark their own stacks while other [33:28.840 --> 33:29.840] threads are running. [33:29.840 --> 33:32.400] There's a little bit of concurrency there. [33:32.400 --> 33:39.520] We may add concurrent marking at some point, but you need write barriers for that to work. [33:39.520 --> 33:43.520] And so that would be something to add once generational collection is working, because [33:43.520 --> 33:46.920] you've proven that you have all the write barriers in the right place. [33:46.920 --> 33:53.160] Then write barriers is just like a little piece of code that runs whenever you store [33:53.160 --> 33:54.400] a pointer. [33:54.400 --> 33:59.800] And if write barriers can be used to indicate pointers from old objects to new objects, [33:59.800 --> 34:05.320] helping you do generational collection, they can also be used to mark an object as being [34:05.320 --> 34:11.240] allocated after the concurrent marker has already marked it in that cycle. [34:11.240 --> 34:13.040] I'm not explaining myself very well. [34:13.040 --> 34:19.440] But basically, you need write barriers to be able to have, to be able to minimize the [34:19.440 --> 34:23.640] stop-the-world component of the mark phase. [34:23.640 --> 34:26.640] Does that answer the question? [34:26.640 --> 34:27.640] Yes? [34:27.640 --> 34:33.080] Is this simply guy or complicate the guy with the web assembly? [34:33.080 --> 34:34.080] Oh, yeah. [34:34.080 --> 34:35.080] It's a good question. [34:35.080 --> 34:38.120] So there's a project to compile a guy's web assembly. [34:38.120 --> 34:44.640] I think initially this will probably start by having a guy library produce web assembly [34:44.640 --> 34:47.160] that has its own runtime. [34:47.160 --> 34:53.240] And this could grow to a whole program standalone compiler in which a guy has a library that [34:53.240 --> 34:58.000] takes your guile program and spits out a native binary. [34:58.000 --> 35:01.960] And in that case, that native binary would include WIPPET, embedded in it, instead of [35:01.960 --> 35:06.880] having that native binary then link to the BDW collector. [35:06.880 --> 35:08.880] So the goal would be to produce one binary that's all finished. [35:08.880 --> 35:09.880] Is that it? [35:09.880 --> 35:10.880] Thank you. [35:10.880 --> 35:11.880] Thank you very much. [35:11.880 --> 35:12.880] Thank you very much. [35:12.880 --> 35:39.400] Thank you very much.