[00:00.000 --> 00:14.320] Just two classical optimizations that will help modern but mature virtual machine where [00:14.320 --> 00:19.920] we have that powers native images and why is it important? [00:19.920 --> 00:22.920] Well, and who I am? [00:22.920 --> 00:23.920] My name is Mito Chukko. [00:23.920 --> 00:29.760] I work at a company named Bellsoft which actively participates in OpenJDK community [00:29.760 --> 00:37.320] and we release our own JDK distribution which you probably met if you have ever built a [00:37.320 --> 00:40.320] Spring Boot container with default build pack. [00:40.320 --> 00:42.560] So it's in there. [00:42.560 --> 00:50.280] And now Spring Boot, since version 3, supports containers with native images. [00:50.280 --> 00:56.280] It can be built as a native image and if you do that, the compiler being used is the American [00:56.280 --> 01:02.400] native image kit which is a Bellsoft distribution of GrowLVM. [01:02.400 --> 01:15.560] So that's another project that we participate and GrowLVM itself can be seen as different [01:15.560 --> 01:20.160] things at least two major modes that we can absorb. [01:20.160 --> 01:29.440] It can run as a JIT where compiler is GrowLVM or we can build a native image with a static [01:29.440 --> 01:41.880] compilation and it will utilize a special virtual machine substrate VM and here it's [01:41.880 --> 01:48.480] different from the traditional Java, traditional way of how we run it. [01:48.480 --> 01:55.480] Well, another interesting and peculiar point here is that it is written in Java. [01:55.480 --> 02:04.480] So it is a complex project but the most of the code is Java and this is beautiful. [02:04.480 --> 02:12.520] So you have a virtual machine and a compiler for JVM languages and Java in particular written [02:12.520 --> 02:14.240] in Java. [02:14.240 --> 02:21.080] So if you look at Java itself, why is it so beautiful? [02:21.080 --> 02:24.680] Well, not so beautiful compared to Kotlin as we know, right? [02:24.680 --> 02:30.480] But still, both Java and Kotlin, they share those concepts. [02:30.480 --> 02:36.840] So from the very beginning, there is a way to write correct parallel programs. [02:36.840 --> 02:41.280] So then the right parallel programs, we need some means of synchronization or to orchestrate [02:41.280 --> 02:48.840] so our threads, if we share data, most typically we do that. [02:48.840 --> 02:55.920] And also it's a managed runtime where we don't have to worry that much about pre-memory [02:55.920 --> 03:04.520] because we have garbage collection and garbage is collected for us and our programs just, [03:04.520 --> 03:10.480] they can't have memory leak but you have to work hard to get one. [03:10.480 --> 03:21.560] And having that native image implementation makes our final binaries very, sometimes makes [03:21.560 --> 03:23.680] them very performant. [03:23.680 --> 03:25.880] Of course, we have an instant startup. [03:25.880 --> 03:29.280] It was mentioned today several times. [03:29.280 --> 03:31.560] But we can also have a very good peak performance. [03:31.560 --> 03:40.120] In certain cases, that's not a rule but it can happen, like it happens here on this plot. [03:40.120 --> 03:47.440] That's just a simple spring boot application and we just ping the same endpoint. [03:47.440 --> 03:54.400] And here the native image works better and also it warms up instantly and it has very [03:54.400 --> 03:55.960] good latency. [03:55.960 --> 04:01.640] So for this small amount of memory that it takes, so this is a small service, it takes [04:01.640 --> 04:06.880] small amount of memory, very small heap, and it also has low latency. [04:06.880 --> 04:15.520] And under the hood, it uses, well, serial GC and we'll talk about that later. [04:15.520 --> 04:21.040] Well, what about relationship between Graal VM and OpenJDK? [04:21.040 --> 04:31.160] Well, we're here in a Friends of OpenJDK room and Graal has been integrated as an additional [04:31.160 --> 04:35.000] experimental compiler in JDK9. [04:35.000 --> 04:41.360] But while it has been removed from recent JDKs, but what's the left over? [04:41.360 --> 04:43.800] It's an interface to plug it in. [04:43.800 --> 04:49.360] So now it's going to be a second attempt to do that. [04:49.360 --> 04:54.480] So here on slides it's mentioned that there is a discussion about project, new project [04:54.480 --> 05:02.560] they all had, but last week it was already called for votes in OpenJDK to start the project [05:02.560 --> 05:15.080] of bringing the most sweet parts of this technology into OpenJDK, back into OpenJDK. [05:15.080 --> 05:18.040] It's something that happens right now. [05:18.040 --> 05:24.200] So that default garbage collector that sometimes shows very good latency even compared to ParallelGC [05:24.200 --> 05:29.200] or G1 in hotspot, well, on small heaps. [05:29.200 --> 05:35.240] Well, it's a kind of garbage collector we can easily understand. [05:35.240 --> 05:37.880] And it's generational stop the world collection. [05:37.880 --> 05:47.120] So here only one survivor space, but actually it's 16 by default. [05:47.120 --> 05:55.760] But anyway, so we stop all our application threads and we collect garbage in a single [05:55.760 --> 06:03.000] thread, so this is a kind of a basic garbage collector, right, but from the other hand [06:03.000 --> 06:10.280] it's reliable and it's very effective, especially if you have only a single core available. [06:10.280 --> 06:12.800] So you see the problem. [06:12.800 --> 06:20.280] We have some CPU which may be enough to run many threads, but we run only one at least [06:20.280 --> 06:21.280] for garbage collection. [06:21.280 --> 06:26.920] Now garbage collection can take significant time during our application execution, well, [06:26.920 --> 06:28.920] that's obvious. [06:28.920 --> 06:31.640] Well, what would we do? [06:31.640 --> 06:39.800] Of course, we would like to do exactly the same thing, but in parallel, to decrease the [06:39.800 --> 06:46.360] time garbage collection takes to reduce the garbage collection pause, because it still [06:46.360 --> 06:51.880] stopped the world pause, but we reduce it because we process data with multiple threads. [06:51.880 --> 06:54.280] So that's the idea of parallel garbage collection. [06:54.280 --> 07:00.960] The idea is not new, but surprisingly, this modern runtime doesn't have it yet. [07:00.960 --> 07:08.880] Well, we decided to implement it and it's still being under review and some implementation [07:08.880 --> 07:15.560] details, well, they change, but the idea is very simple. [07:15.560 --> 07:22.680] You just say, pass the garbage collection selection during the creation of your native [07:22.680 --> 07:23.680] image. [07:23.680 --> 07:30.000] For instance, if you use some Maven or Gradle configuration for your Spring Boot container, [07:30.000 --> 07:32.040] you also can do that. [07:32.040 --> 07:42.200] And then you have some GRIPS in runtime, which you also can twist when you run your application. [07:42.200 --> 07:46.680] And well, you enable that implementation. [07:46.680 --> 07:53.280] I'll show some performance results later, but basically the implementation itself, well, [07:53.280 --> 08:00.200] it can be analyzed as a change in a big Java program, which Brawl VM is. [08:00.200 --> 08:07.440] And there are now two GC interfaces and implementations. [08:07.440 --> 08:18.480] And this functionality just re-use existing things in a very, I would say, smart way just [08:18.480 --> 08:25.960] to keep what is all about the parallelization as a code. [08:25.960 --> 08:33.480] So everything else is reused from serial GC. [08:33.480 --> 08:39.840] Basically there's a problem of how do we synchronize and share the work? [08:39.840 --> 08:47.920] Because parallel threads for garbage collection, they also have the same problem because they [08:47.920 --> 08:54.760] work on the same data, so they have contention or may have contention. [08:54.760 --> 08:59.120] So we need to share in some smart manner. [08:59.120 --> 09:06.360] Well, it's implemented with a work divided in its volume. [09:06.360 --> 09:14.800] So every thread operates its local memory, and it's a chunk of memory of one megabyte. [09:14.800 --> 09:21.960] So if we need an extra memory, like we scan objects and we fulfill some set of data that [09:21.960 --> 09:23.800] we operate on. [09:23.800 --> 09:25.280] And then we have an extra chunk. [09:25.280 --> 09:31.120] We can just put it aside so someone else can pick it. [09:31.120 --> 09:36.480] So that's the stack that contains the chunks of work. [09:36.480 --> 09:44.360] And then the work is finished, the thread just takes the next chunk of work. [09:44.360 --> 09:54.280] There may be a situation when several threads try to copy to promote the same object. [09:54.280 --> 09:56.400] And this is actually solved very simply. [09:56.400 --> 10:02.040] They just reserve some space for the object and then tries to install forward pointer [10:02.040 --> 10:05.480] using an atomic operation. [10:05.480 --> 10:11.160] And as this is an atomic operation, only one thread succeeds, so others just roll back [10:11.160 --> 10:16.280] and this is a lightweight operation. [10:16.280 --> 10:18.280] Again this is Java. [10:18.280 --> 10:27.440] This is not a strict AML, sorry, but still all existing places that manage memory were [10:27.440 --> 10:31.400] reused without changing the architecture of Growl itself. [10:31.400 --> 10:35.840] So there are already possibilities to add garbage collectors. [10:35.840 --> 10:40.560] So if you want to implement one, it's not that complex. [10:40.560 --> 10:47.440] The major problem is to be correct when you deal with memory. [10:47.440 --> 10:55.400] When you deal with concurrency, and then you inject your code into this virtual machine [10:55.400 --> 11:02.400] because it's all declarative magic that requires you to be careful. [11:02.400 --> 11:06.440] Well, some performance results. [11:06.440 --> 11:15.600] With relatively large heaps with serial GC, you can have pauses of several seconds, which [11:15.600 --> 11:18.200] is long, of course. [11:18.200 --> 11:24.200] And there's a big difference if you have a two or three or four second pause or if you [11:24.200 --> 11:26.200] decrease it by one second. [11:26.200 --> 11:32.000] So that's possible with this implementation already. [11:32.000 --> 11:36.320] So that's the order of this improvement. [11:36.320 --> 11:43.280] With another benchmark, hyperalogue, you see that latency here, latency of pauses can be [11:43.280 --> 11:47.280] decreased like two times. [11:47.280 --> 11:55.040] Those pauses are not that big, and we have frequent collections here, so x-axis is epoch, [11:55.040 --> 12:06.040] so each point is a garbage collection, and y-axis is time in, I believe, milliseconds. [12:06.040 --> 12:09.800] Well, that's paralogy. [12:09.800 --> 12:18.120] So we can obviously improve many applications and many installations where we have an option [12:18.120 --> 12:20.880] to use several CPUs. [12:20.880 --> 12:25.880] If we use one CPU, of course, we won't see much difference. [12:25.880 --> 12:32.920] There is some increase in memory used for service needs, but that's kind of moderate. [12:32.920 --> 12:37.240] So other parts of this complex system. [12:37.240 --> 12:45.360] I mentioned synchronization, and, well, synchronization is useful, but it has tradeoffs. [12:45.360 --> 12:51.200] Because if we implement the non-synchronization, we need to save our CPU resources to put aside [12:51.200 --> 12:54.560] threads that won't get the resource. [12:54.560 --> 13:02.720] We need to stop them, to queue them, to manage that queues, to wake them up, to involve operating [13:02.720 --> 13:05.040] system in that process. [13:05.040 --> 13:12.280] So that's not cheap, but there are situations that, that's another queue, right? [13:12.280 --> 13:19.680] And that even influences the design of standard library, because, like, we all know string [13:19.680 --> 13:22.480] buffer and string builder, right? [13:22.480 --> 13:28.880] One class appeared because, well, another one wasn't very pleasant in terms of performance. [13:28.880 --> 13:35.280] Yeah, we need it sometimes, but in many cases, we need a non-synchronized implementation, [13:35.280 --> 13:40.120] saying, like, hash table and hash map, whoever uses hash table, right? [13:40.120 --> 13:43.320] But it's very good synchronized. [13:43.320 --> 13:50.600] But not all classes that have any synchronization in them have their twins without synchronization. [13:50.600 --> 13:52.800] That makes no sense, right? [13:52.800 --> 14:01.840] So there's a well-known technology, how to deal with a case where accesses to our data [14:01.840 --> 14:09.320] structures, to our classes, are mostly sequential than at any point in time, only a single thread [14:09.320 --> 14:11.320] owns and operates with an object. [14:11.320 --> 14:15.640] And it's called bus-locking or thing-locking. [14:15.640 --> 14:22.760] Well, why is it simpler and more lightweight? [14:22.760 --> 14:28.080] Because we don't want to manage all the complex cases. [14:28.080 --> 14:31.520] We know that we are in a good situation. [14:31.520 --> 14:38.000] And if we're not, yes, we can fall back, and it's called inflate our monitor. [14:38.000 --> 14:45.960] Well, it existed in OpenJDK for ages, and it has been removed from OpenJDK. [14:45.960 --> 14:54.280] If it was deprecated, then no one noticed, I believe, because still, are there too many [14:54.280 --> 14:58.640] people using something newer than JDK 11? [14:58.640 --> 15:06.080] Well, some consequences were noticed probably too late. [15:06.080 --> 15:08.960] Well, what are the reasons, first of all? [15:08.960 --> 15:14.560] What are the reasons to remove a bus-locking from OpenJDK from hotspot JVM? [15:14.560 --> 15:22.800] Well, to ease the implementation of virtual threads, to deliver project loom, to decrease [15:22.800 --> 15:25.480] the amount of work there. [15:25.480 --> 15:32.480] So some consequences here, initials discovered. [15:32.480 --> 15:42.440] In certain cases, things like input streams can be slowed down, like here it's 8x or something. [15:42.440 --> 15:45.480] That's enormously slow. [15:45.480 --> 15:54.920] And for GraVM, there is a mode that you say during static compilation, OK, this native [15:54.920 --> 16:00.200] image doesn't try to work with many cores. [16:00.200 --> 16:01.720] It's a single-treaded program. [16:01.720 --> 16:07.840] So it's simple, and it works really better in these circumstances. [16:07.840 --> 16:10.000] So there is an optimization for that. [16:10.000 --> 16:14.880] But you have to know it in advance, then you compile your program. [16:14.880 --> 16:19.520] Well, and there is, of course, a runtime option that supports all kinds of situations, and [16:19.520 --> 16:21.040] it's complex. [16:21.040 --> 16:24.680] So the missing part is in the left lower corner. [16:24.680 --> 16:34.000] Well, to dynamically be able to process the situation of sequential access pattern. [16:34.000 --> 16:42.200] So we've lamented quite a classical approach to this problem. [16:42.200 --> 16:50.680] That helps to, that brings that thing locking to GraVM. [16:50.680 --> 16:57.320] The initial idea was operating with object header. [16:57.320 --> 17:05.920] So where it already contains a pointer to a FAT monitor object. [17:05.920 --> 17:10.120] But it can be treated as well as some words. [17:10.120 --> 17:15.920] We can atomically access and put some information there. [17:15.920 --> 17:21.400] Probably close to final implementation that we have right now still, or again, uses a pointer [17:21.400 --> 17:29.960] because it turned to be not so easy to keep correctness across the whole VM with some [17:29.960 --> 17:37.320] memory that you treat as a pointer or as a word depending on the situation. [17:37.320 --> 17:46.160] Well, anyway, inside that part of header or inside that special object, we can have 64 [17:46.160 --> 17:48.400] bits of information. [17:48.400 --> 17:55.520] And we can mark it as a thin log, this is a flag, then we can do it atomically. [17:55.520 --> 18:05.960] We can keep the ID of an owner thread, which we can obtain, then we work with threads. [18:05.960 --> 18:11.520] And account of recursive logs that we currently hold. [18:11.520 --> 18:19.920] That, by the way, means that after a certain amount of recursive logs, we have to inflate [18:19.920 --> 18:27.000] the monitor because we can store more information in that part of this work. [18:27.000 --> 18:28.480] Yeah. [18:28.480 --> 18:39.800] So again, it's a pure Java implementation where we work with some atomic magic and we [18:39.800 --> 18:41.480] update this information. [18:41.480 --> 18:46.120] What we've got, and the most recent numbers are even better. [18:46.120 --> 18:51.440] So we see that effect on exactly that example, the streams. [18:51.440 --> 18:53.600] We can speed them up. [18:53.600 --> 19:02.440] And even in a very kind of nano-benchmark kind of measurement, you also see the improvement. [19:02.440 --> 19:13.600] And even in multi-threaded case, there is now no difference with the original.