Hello? Yes. Ah, okay, let's do it again. Hello, hello friends. I'm Roman. So let's start with an overview. I'm scratchy too, right? Okay. A little overview and motivation why we are doing all this. So what is Project Lili put? Let's start at the beginning. It's an open JDK project. It's a community project. We get contributions by Red Hat, Oracle, SAP, Huawei, Alibaba, Amazon, that's me. The goal of that project is to reduce memory footprint of Java object headers. We have a lot of applications in the Java heap. It's a side effect of that goal. We will also see potential CPU and latency improvements. Specifically, we want to achieve these heap memory reductions by reducing the size of Java object headers. To illustrate this, let's consider a hypothetical very long button, I guess. So let's look at this hypothetical Java heap. Each of these squares represents a Java object with different sizes. And in such a heap, you would have like everything that is yellow here is object headers. This is really just metadata for the VM, like class information stuff, I will get into this. And this means we waste quite a lot of memory on just this metadata in the heap. With Project Lili put, the goal is, so if we consider each of these half of those squares, one word, we want to reduce these cut in half, might end up looking like this, with much less metadata for each object, and we free up a lot of space there at the bottom. So this is the savings. Another way to look at this is to see the breakdown of metadata versus actual payload. And I did some statistics when I started this project, so around 20% of live data on the heap is actually only metadata in the object headers. This depends a lot on the workloads. Some workloads, they have much more of this. Some workloads, they have much less ratio here, but this is kind of an average. So this is the current situation where we have headers that are 12 bytes large. With Project Lili put, in the first step, we want to go to 8 byte headers, which means that the ratio of metadata here is only like 13% on average. My very again, average savings of around 7%, 10% up to maybe 30% or even a little bit more percent here. The long term goal with Lili put is to have headers that are only 4 bytes large, and if you achieve this, then only 6% of live data is object headers and savings up to like 50% compared to the current situation. Yeah. We did that already. We did most of the work already and we run some services at Amazon already with Lili put, and this is the heap usage of the service after GC. When Lili put was deployed, it dropped by about 30%. I must admit that this is a bit of a sweet spot case. Not all workloads look like this, but it shows quite well how this works. As a side effect of this, the CPU utilization also dropped by about 25% here when Lili put got deployed. Is this still on? Okay. This was quite significant. This has the effect on latency too, latency drops by about 30%. Some percent. So, motivation part. Helpfully, Alexi did the GLL thing. So, if you have a workload and want to know how would it look like when running Lili put, you can generate a heap dump with your workload and then run it through GLL tool, which is here, and you can get some nice estimates of your heap utilization with different configurations, including Lili put. To tell you, you would say, I don't know, such and such percent with Lili put, and you can even do the crystal ball and look into the future and tell you about Lili put too. That's because the heap dumps don't actually include the metadata. The GLL tool needs to do a lot of calculations and estimations and get up and come up with something, but it's pretty useful. So, yeah. At the end of the day, what does it do? It reduces your hardware or cloud cost. You can save on instance cost here. Or you can do the same hardware and drive more load on it. You can help reduce your energy bills and save your tool for the climate, but this is all good. So, yeah. Let's look into what's going on there. The object headers. This is a breakdown of the current situation. So, we have a one word at the beginning of each object, which we call the mark words for historical reasons. It's really just a header that contains stuff. And we have 64 bits and two bits here. They are tag bits or lock bits. They indicate what the rest of those bits mean. Usually, they mean we have GCH bits here, four bits of GCH 0215. And then we have hash code bits. When you call the identity hash code of an object, this will generate some number and stick it here. And then you have some unused bits up here. Okay. And the next word, the second word of each object contains the class pointer. The class, so-called class pointer, points to the internal structures in hotspot that tells you all about the class of this object. Usually, what you get with modern JVMs is the compressed class pointer, which is 32 bits. I don't want to talk about the other one. Thomas is going to talk about a lot of details, how to deal with that. But this means that the first two words of each object are taken by metadata. For many objects, we can start sticking actual payload in these other 32 bits here. For arrays, we can put the area length there. But what we want to do is you can probably see this already. We have the class pointer here. We have free space here. We can probably stick it in there and then see how to deal with the other bits there. Problem is, what we also have is the so-called displaced headers or overloaded headers. This happens when locking happens or the GC does some things. And then the tag material indicates what the rest of those bits mean. And usually, it's a native pointer to some structure. And this means that those bits up here, they are not actually always unused. They are only sometimes unused. And this is a major problem. I get into how we want to deal with this. We still have to compress class pointers there. As an example, when this happens is when we have a stack-locked object, then the tag width is 0, 0. The rest of it is appointed onto the stack. Then we have objects that are locked by an object monitor. Then we have those tag widths. And this is appointed to some object monitor structure that lives somewhere. Or the GC can use this for storing forwarding information. And it points to then we have this tag with here, 0, 0, 1, 1. And it points to the forwarded object in somewhere else in the heap. This may look like this. In the example of an object monitor, we have exactly the situation that I just talked about. If that happens, where's the original mark? We'll say we have some hash code here. And so we need to preserve it somewhere. Its answer is usually it's in the, for example, in the object monitor case. It's displaced in the beginning of the object monitor structure. This is fiddly stuff. So the plan is to get here for the first step that we are currently working on. So we would still have to tag width here. Or maybe not. Then we have another bit here that says about self-focusing. Then we have H widths here, some unused stuff. And the identity hash code and a few bits for the class pointer. If you count this correctly, those are not 32 bits like we had before. Those are fewer bits. Thomas is going to talk about how we do this. The long-term plan for lilyput, we call this lilyput 2 already, is squeeze everything into 32 bits. We still have those tag widths here. It says forwarding point that you see H. It's unchanged. The hash code will fit into two bits. This is going to be a very bad hash code. No, no, no. We will leave it. And you still have the class pointer up there. What's the problem with all this? So in the current situation, we have two words. In the first word, it rarely carries any interesting information. So it might carry hash code information, but it's only, usually happens for few objects. It may carry locking information, but only very few objects ever get locked on the heap. So we basically waste this word on stuff that is rarely used. And in the second word, we have this class pointer, which is very crucial information for any object, because we need this for all sorts of things. If we don't have this, then this object doesn't have a type identity, which is, yeah, we may not do this. So this is part of the problem. In the new world, this class pointer is part of the header, which means that suddenly this header has crucial information that we must not lose. And we always need for things like figuring out the object size or something like this. So we must never lose that class pointer there. And also, this header displacement that I talked about earlier means that how do we access this information when we need to follow through the actual mark word? Those are the three problems. How to fit everything into fewer bits, how to safely access this mark word, and how to avoid losing the class pointer to begin with. I don't have all that much time, so I can only scratch the surface of these problems. I will give a very high level overview of this. The first is locking. When locking happens, then, well, let's look at, for example, stack locking is the most lightweightest locking in hotspot. It's really just a compare and swap thing on the object header. Oops, this was too fast. Let's go back a little bit. So it coordinates threads by comparing and swapping on the object header. I said this. It doesn't, as soon as contention happens, we inflate this stack lock to an object monitor. It doesn't support weight notify. It doesn't support JNI. And any of this happens, we inflate this to what we call a full object monitor. This is why we have those two different locking modes. The way that works now is when that happens, the JVM does a compare and swap on the object header, and it basically exchanges the current header with a pointer to the stack, and it sticks the original mark word somewhere on the stack. And this is how we can basically find the original mark word and restore it when we unlock. Then we do the opposite thing and swap it around. It does answer the question, is this thread locked by object E, by this object? It cannot answer the question, which thread is currently locking this object because it's not really necessary. In little bit, the way we solved that is, and I have only one slide of this, which is quite amazing because I basically rewrote the locking implementation there for this. But what we do there is instead of putting the original mark word on the stack and putting a pointer on this location in the mark word, we basically turn this around and say, OK, let's have a small lock stack, what would we call it, lock stack, and we basically only push the object on that lock stack, and then we still do a compare and swap to flip a single bit here, but nothing else. The rest of the mark word stays intact. It still answers the same question, is this thread T locking this object and nothing else? So this is the very scratch-to-service overview of this stack locking. I cannot go deeper because time is running out. I need to hand over to you at this point. Yeah, let's look at the monitor locking. It's a very similar situation. All engineers are currently working on a solution for that. The basic idea is to, instead of doing the mapping from an object to the object monitor through the header, we have a side table and do the mapping there, and then this means that we don't have this header displacement anymore. Also, only scratching the surface on GC forwarding. Some GCs need to store forwarding information in the object header. This is used by those full GCs here, and zero GC, G1 GC, and Shenandoah, they use the header to store forwarding information, and this means that when that happens, we would lose this class pointer and this crucial information because this is why this is a big problem here. For normal operation, it is not such a big deal because there we have copying GCs where we first create a copy of the object and then stick the forwarding pointer only in the old copy, which means that in the new copy we still have all this information, and we can follow through to this new copy to get to the class pointer. The full GC, because it has no space to copy the object, it slides the objects down to the bottom, and in the process of this, would lose this class information. The parallel GC has a different solution here based on the idea, let's not store this forwarding information in the header, let's always re-computers in some clever ways, and this is basically the plan for what we want to do for the other GCs too. So that GC doesn't have this problem because they don't do full GCs. And with that, well, we have a JAP-450, we have this new lightweight locking, it's already in Jadikate 21, the flag to enable this, when you grab a build from Alexi or some other place, is this flag, and I think with this, I'm handing over to Thomas, he wants to talk about the class pointers. Oh, you mentioned it. Oh, okay. Does this work? Oh, it works, wonderful. Yeah, okay, let me switch it. Class pointers, and... Wait. I'm gonna start. Okay, I'm Thomas, I work at Red Hat, and I'm going to talk about what we do for class pointers in Lilliput, and this will be a very quick dive because I only have 10 minutes, and this is a bit of a challenge. This is just one, but one of... Oh, this is cool. This is just one of many moving parts in the Lilliput project. Lilliput is really a real community project in the sense that many companies contribute, and this is the part we decided to tackle at Red Hat. And with that, let's start, time's ticking. So class pointer, the class pointer currently in the Lilliput project is 32 bits, and that's way too much, we need it to be smaller. It takes up like half of the whole space of the 64-bit Lilliput object header. And so, okay, that's pretty self-explanatory. So some background first. When we load up a class, a Java class, we build up a whole range of companion data structures in native memory, and the centerpiece of this group, and kind of the big boy among them, is the class structure written with K. It's a variable-sized structure ranging from 400 bytes to, I don't know, we saw five megabyte monsters at Amazon, but that's rare. And every object's header refers to the class structure of its class. That's why the shape of this reference really matters, because for one, obviously, footprint, and because also we need to be able to dereference it very quickly. Going from object to class is a hot path. So we could just store the native class pointer in the object header, but we usually don't do this, because it's 64-bit way too big. So since a long time ago, we already employ this optimization where we split the native pointer into two parts, a 32-bit offset that refers to a 64-bit runtime constant base, and we only store the offset, and that's what we call the narrow class pointer. And that trick only works if we are able to confine all class structures within a four-gigabyte range, obviously. And that's exactly what the class base is for. That's the only reason it exists. And we also have CDS. So CDS contains class data sharing. CDS archives contain pre-baked class structures. And so what we do usually is we map the CDS archive very close to the class base, such that both are engulfed by the class encoding range, and every class structure in either region can be reached by a narrow class pointer. Okay, and so decoding is basically just an addition. I'm a bit simplifying because of time reasons. And the nice thing is from the point of the JIT, the encoding base is a runtime constant. It's obviously not a constant because we determine it when the VM start-up. It's subject to ASLR and some such, but we can encode it, the JIT can encode it as a 64-bit immediate. That is nice, so we already save a load. And then we have a ton of optimizations that basically all depend on the base looking good. Whatever looking good means for the specific CPU, such that we can load it with just one move. I won't get into details. One simple example is if we manage to place the class base below four gigabytes, then we can set the base to zero. This is what we call unskilled encoding. And now basically every narrow class pointer is the class pointer, so we don't have to do anything. And we are now also very good at it. The one effect of the Lilliput work is that unless the address base is really populated or unless the operating system just flatly refuses to do this, this is likely to happen. Okay, for Lilliput, 32-bit is still too much, so we shrink it. There are some side goals. We still want to be able to address enough classes, whatever enough means, that is a complicated question. And we also want to basically keep using metaspaces and CDS, because both give us a ton of features we would need to reinvent otherwise. And we also decided to, for now, keep the class base layout as it is. Now, as a kind of a bar, we can load about five million classes, give or take. The class base is artificially kept at three gig. And I personally believe if you manage to load five million classes, you are either very patient or probably not really aware of doing it, because it's a leak. And I think this number is really high. Question still means is how many classes do we actually need to address. And this is a very complicated question, and I don't have much time. Therefore, we kind of decided to sidestep this question. We said, okay, we don't reduce the class encoding range, it's still four gigabyte. We leave it at that. And we say anything in multi-million range is probably fine. We can also, there's still room to reduce it, but we don't for now. What we do instead is we increase, if we manage to store all class structures at certain aligned addresses, then we can use this alignment shadow to save other stuff. And that's what we do. We decided on a 10-bit alignment because one kilobyte statistics. Even though class is very precise, it's the vast majority of class structures is below one kilobyte, usually larger than 512 byte. It's a bell distribution, so it's really much larger, outliers exist, but are really rare. And 10-bit give us 22-bit class pointers for now, and let us address three million classes. And that's still way enough, I think. So now something interesting happens, where before we used to store class structures in the class space back to back, we now store them at aligned boundaries. And this means that the narrow class pointer now is more like a class ID, because the class label morphs into a table of one kilobyte-sized slots, where every class structure occupies one or very rarely multiple slots. And we use the value range of narrow class much more efficiently, because every value now means basically another class. So it's a hypothetical 16-bit class pointer can address 65,000 classes. And that's good. Obviously, that hurts. We have alignment raised now. And for time reasons, I can't go into this. The gist of this slide is that we made metaspace very good at allocating at aligned boundaries without footprint loss, so we don't pay for that. And we still retain allocation performance as what we pour. This actually did cost quite a bit of work. Some supporting statistic, I skipped that. This is Mark Berthley out now. But before we had a 30-bit class pointer, we now reduced it to 22-bit that allowed us to inflate the hash back to its former 31-bit size. The story behind this is when we started with Lilliput, we had to reduce iHash to 25-bits, which of course has negative consequences for applications with large datasets, hash collisions, and so on. And the nice thing is we now have four free-bits. I'm sure we find a use for them. And there are some kinks still to be ironed out. Very quickly, hyperaligning class structures, any structure to sizes larger than cache line size may have detrimental effects to cache efficiency. I will need to look into this. We have mitigations planned if this should happen. There's also 30-bit. We are now in the weird situation that 30-bit class pointers on 30-bit platforms are larger than a 64-bit. And this is, we can deal with this. I kind of hope that 30-bit is going away. Before I have to do this. Future, 16-bit is possible if we just naively reduce narrow class pointer to 16-bit. This would be a severe reduction in the number of loadable classes and probably not acceptable. What we can do, however, is we can switch to a model where we have maybe a variable size tether where objects of the first 65,000 classes would benefit from lilyput and have lilyput header objects beyond that would get the narrow class pointer appended to the mark word. This, of course, is a lot more complex than what we do nowadays. But this is possible, maybe we have to do this. Should we ever do 30-bit lilyput headers as we have planned? This is basically not my idea. This is John Rose's idea. Okay, as a summary, where we are now, 10-bits we freed, restored IH to 31-bit, which is nice, that had been a blemish in the current lilyput implementation. And at the cost of a reduction of the number of addressable classes from five to three million, still a completely fantastic number. And the decoding is now more complex because now in addition to the addition, we also need to shift. There are some side effects trickling down. How are we time-wise? Oh, okay. Some side effects trickling down. We improved class pointer setup for the stock JVM. These are improvements that are already rolling out with JK22. And yeah, that basically was it. Thank you very much. And... Thank you. We take questions. Hello, Ren. Thank you. Thank you. Very nice job. Just a question. Is it worth to use some compression maybe to store these maps? Oh, sorry. Is there any gain possible to use some compression about these external maps or things like that? Maybe objects are not moving too much, or maybe you can compress these address or hash maps? I don't know if it's worth because it... analyze the pointers that may be big, then you have large, let's say, remember set of pointers somewhere or flags and maybe you have some compression here, may save and, but of course it's a trade off of performance between memories, so, depend on the use case. Thank you. So the question was, do we have any plans to like, optimize or address hash map implementations? You have to get more compact structure that we can delete from such hash maps. Ah, okay. Oh, this is not part of the scope of the report, but there may be other efforts. Yeah, but I don't know. Thank you. Well, you can go. Okay, thanks. I answer this quickly. The question was, when can we expect to see this in available release? Questions, I don't know yet. We have most of the stuff lined up already. I'm saying 24, but don't sue me on that. Can you just upstream this a bit tricky? Yeah. It's a whole separate effort. Yeah. I had a question whether it was possible, well, or was there any plan to actually make those tisings configurable? My thinking is that I would imagine maybe it's naive, but in some applications I worked on, classes are really, really simple and you don't use many of them. So you would benefit from having extremely small, you know, address to, like a small address space, basically, and smaller points or as even. I didn't quite, what's the question? The question is, is it possible to have configuration ergonomics for different sizes of class pointers or is it fixed by the JVM? Oh, possibly we debated that at some point in time. There are some advantages of keeping them constant because you get more efficient. But yeah, I'm not even sure we planned this as a development switch. Maybe it's undecided yet. Okay. No more questions? Okay then, thank you. Thank you.