[00:00.000 --> 00:13.760] Hello. I'll get started. Okay. My talk is entitled, The Next Frontier in Open Source [00:13.760 --> 00:21.200] Java Compilers, Just in Time, Compilation as a Service. Whoops, this isn't working. [00:21.200 --> 00:25.280] My name is Rich Agarty. I've been a software engineer for way too many years. I'm currently [00:25.280 --> 00:34.640] a developer advocate at IBM. So, we're all Java developers. We understand what a JVM [00:34.640 --> 00:42.240] and a JIT is. We'll do the JVM, execute your Java application during runtime. It sends the [00:42.240 --> 00:47.000] hot methods to the JIT to be compiled. With that in mind, we're going to talk about JIT [00:47.000 --> 00:50.920] as a service today. And we're going to break it down into three parts. First, I'm going [00:50.920 --> 00:56.680] to talk about a problem, right, which is Java running on cloud, specifically in distributed [00:56.680 --> 01:03.280] dynamic environments like microservices. Then we're going to talk about the reason, which [01:03.280 --> 01:08.040] is going to take us back to the JVM and the JIT, which has some great technology. It's [01:08.040 --> 01:13.200] great technology but does have some issues. And then the solution, which is the JIT as [01:13.200 --> 01:24.120] a service. So, is Java a good fit on the cloud? So, for context, we'll talk about legacy [01:24.120 --> 01:29.000] Java apps, enterprise apps running. They're all monoliths running on dedicated servers [01:29.000 --> 01:38.800] or VMs to ensure great performance. We loaded with a lot of memory and a lot of CPUs. They [01:38.800 --> 01:42.920] took forever to start, but it didn't matter because it never went down. We have clients [01:42.920 --> 01:48.280] running Java applications for years. If they did upgrade, it would be every six months to [01:48.280 --> 01:57.200] a year, do some simple refreshes. That was the world of legacy Java enterprise apps. [01:57.200 --> 02:05.000] Now we move to the cloud. That same monolith is a bunch of microservices talking to each [02:05.000 --> 02:11.800] other. They're all running in containers, managed by some cloud provider with a Kubernetes [02:11.800 --> 02:23.120] implementation to orchestrate. And we have auto-scaling up and down to meet demand. [02:23.120 --> 02:28.120] So the main motivators behind this, obviously, are flexibility and scalability. Easier to [02:28.120 --> 02:32.800] roll out new releases. You can have teams assigned to specific microservices and never [02:32.800 --> 02:38.400] touching other microservices. Once you're on the cloud, you can take advantage of the [02:38.400 --> 02:44.160] latest, greatest cloud technologies like serverless coming out. Obviously, you'd have less infrastructure [02:44.160 --> 02:52.400] to maintain and manage. And the ultimate goal is saving money. [02:52.400 --> 02:58.680] So before we start counting all our money, we've got to think about what about performance? [02:58.680 --> 03:04.760] So there's two variables that impact cost and performance. It's container size and [03:04.760 --> 03:11.240] the number of instances of your application you're running. Here's a graph showing all [03:11.240 --> 03:17.600] the ways we can get these variables wrong. Starting down here, containers are way too [03:17.600 --> 03:22.960] small. We're not running enough instances. It's pretty cheap, but the performance is [03:22.960 --> 03:29.760] unacceptable. On the opposite side, we have our containers are too big. Way too many instances [03:29.760 --> 03:35.440] running. Great performance, wasting money. So we need to get over here. This is a sweet [03:35.440 --> 03:43.440] spot. We got our container size just right. We have just enough instances for the demand. [03:43.440 --> 03:48.000] That's what we want to get to. Very hard to do. In fact, most conferences have a lot [03:48.000 --> 03:54.000] of talks about how to get here or their fixes for this problem. So before we can figure [03:54.000 --> 03:58.520] out how to fix it, we've got to figure out why it's so hard. And in order to do that, [03:58.520 --> 04:08.760] we've got to talk about the JVM in a JIT. So first of good, device-independent Java became [04:08.760 --> 04:14.880] so popular because we write once, run anywhere, in theory. 25 years of constant improvement, [04:14.880 --> 04:21.840] a lot of involvement from the community in it. The JIT itself, optimized code that runs [04:21.840 --> 04:29.080] great. It uses profiler, so it can optimize a code that you can't get doing it statically. [04:29.080 --> 04:33.760] Has very efficient garbage collection. And when the JVM collects more profile data in [04:33.760 --> 04:38.000] the JIT, compiles more methods, your code gets better and better. So the longer your [04:38.000 --> 04:46.600] job application runs, the better it gets. Now, the bad. So that initial execution of [04:46.600 --> 04:53.520] your code is interpreted, so it's relatively slow. Those hotspot methods compiled by the [04:53.520 --> 05:03.240] JIT can create CPU and memory spikes. CPU spikes cause lower quality of service, meaning performance. [05:03.240 --> 05:07.560] And your memory spikes cause out-of-memory issues, including crashes. In fact, the number [05:07.560 --> 05:13.440] one reason JVM, or a main reason JVM crashes because of out-of-memory issues. And we have [05:13.440 --> 05:18.480] slow startup and slow ramp-up time. So we want to distinguish between the two. Startup [05:18.480 --> 05:23.760] time is the time that it takes for that application to process first request, usually during an [05:23.760 --> 05:27.760] interpretation time. And ramp-up time is the time it takes a JIT to compile everything [05:27.760 --> 05:36.120] it wants to compile to get to that optimized version of your code. So here we have some [05:36.120 --> 05:41.600] graphs to back that up. Here we take a Java Enterprise application, and you can see on [05:41.600 --> 05:48.840] the left we got CPU spikes here happening initially, all because of JIT compilations. [05:48.840 --> 05:59.520] Same thing with the memory side. We got these large spikes that we have to account for. [05:59.520 --> 06:03.240] So let's go back to that graph I had finding that sweet spot. Now we have a little more [06:03.240 --> 06:09.280] information, but still we need to figure out a way to right-size those provisioned containers. [06:09.280 --> 06:15.400] And we got to make our auto-scaling efficient. So we have very little control over scaling. [06:15.400 --> 06:19.160] We control the size of our containers, but as far as scaling goes, we just have to set [06:19.160 --> 06:30.280] the environment enough up correctly so that auto-scaling is efficient. [06:30.280 --> 06:36.320] So on the container size portion of it, the main issue is we need to over-provision to [06:36.320 --> 06:45.960] handle those out-of-memory spikes, which is very hard to do, because JVMs have a non-deterministic [06:45.960 --> 06:50.080] behavior, meaning you can run the same application over and over, and you're going to get different [06:50.080 --> 06:54.280] spikes at different times. So you've got to run a series of tests with loading to figure [06:54.280 --> 07:02.680] out, to get that number kind of right. And on the auto-scaling part of things, again, [07:02.680 --> 07:06.960] we talk about the slow start-up and ramp-up times. The slower those are, the less effective [07:06.960 --> 07:14.760] your auto-scaling is going to be. And the CPU spikes can cause other issues. A lot of auto-scalers, [07:14.760 --> 07:20.360] the threshold for starting new instances is CPU load. So if you start a new instance [07:20.360 --> 07:28.040] and it's spinning, doing JIT compiles, your auto-scaler may detect that as a false positive, [07:28.040 --> 07:32.640] say, oh, you need, the demand is going up, you need more instances, when in this case, [07:32.640 --> 07:42.160] you really didn't. So it makes it very inefficient. So the solution to this problem is we need [07:42.160 --> 07:47.000] to minimize or eliminate those CPU spikes and memory spikes, and we've got to improve [07:47.000 --> 07:55.640] that start-up and ramp-up time. So we are proposing here, we're going to talk about [07:55.640 --> 08:01.920] JIT as a service, which is going to solve these issues, or help solve these issues. [08:01.920 --> 08:06.480] So the theory behind it is we're going to decouple the JIT compiler from the JVM and [08:06.480 --> 08:12.920] let it run as an independent process. Then we're going to offload those JIT compilations [08:12.920 --> 08:19.440] to that remote process from the client JVMs. As you can see here, we have two client JVMs [08:19.440 --> 08:29.520] talking to two remote JITs over here. We have the JIT still locally in the JVM that can [08:29.520 --> 08:37.000] be used if these become unavailable for some reason. Everything since we're all in containers [08:37.000 --> 08:44.520] is automatically managed by the orchestrator to make sure that we have their scaled correctly. [08:44.520 --> 08:49.560] This is actually a model to microsolution, so we're taking the model, as in this case, [08:49.560 --> 08:55.320] as a JVM. We're splitting it up into the JIT and everything left over in the other microservice. [08:55.320 --> 09:06.000] And again, like I mentioned, the local JIT still is available if this service goes down. [09:06.000 --> 09:11.840] So this actual technology does exist today, and it's called the JIT server, and it's a [09:11.840 --> 09:19.680] part of the Eclipse OpenJ9 JVM. It comes with the, it's also called the SAMRU cloud compiler [09:19.680 --> 09:24.440] when used with SAMRU runtimes, and I'll get to that in a minute. And I'm sure everyone [09:24.440 --> 09:30.160] here knows OpenJ9 combines with OpenJDK to form a full JDK and totally open-source it [09:30.160 --> 09:40.160] free to download. And here's a GitHub repo there. A little history of OpenJ9. It started [09:40.160 --> 09:47.600] life as the J9 JVM by IBM over 25 years ago. And the reason IBM developed it was because [09:47.600 --> 09:52.160] they had a whole range of devices they needed to support, and they wanted to make sure Java [09:52.160 --> 09:58.400] ran on all of them. That's all the way from handheld scanners to mainframes. So it was [09:58.400 --> 10:03.040] designed to go from small to large in both types of environments where you have a lot [10:03.040 --> 10:08.240] of memory or very, very little. And about five years ago, IBM decided to open-source [10:08.240 --> 10:14.320] it to the Eclipse Foundation. And OpenJ9 is renowned for its small footprint fast start-up [10:14.320 --> 10:18.480] and ramp-up time, which we'll get to in a minute. And again, even though it's got a [10:18.480 --> 10:26.280] new name, it's OpenJ9. All of IBM enterprise clients have been running their applications [10:26.280 --> 10:37.680] on this JVM for years. So there's a lot of history of success with it. Here's some OpenJ9 [10:37.680 --> 10:43.960] performance compared to Hotspot. Again, this doesn't take into account the JIT server. [10:43.960 --> 10:51.360] This is just the JVMs themselves going left to right here. OpenJ9's in green. Hotspot's [10:51.360 --> 10:58.000] in orange. So in certain circumstances, we got to see 51% faster start-up time, 50% smaller [10:58.000 --> 11:05.560] footprint after start-up. And it ramps up quicker than Hotspot. And at the very end, [11:05.560 --> 11:18.600] after a total full load, we have a 33% smaller footprint with OpenJ9. So, several run times. [11:18.600 --> 11:24.640] So that is IBM's OpenJDK distribution. Just like all the, someone just mentioned, there's [11:24.640 --> 11:29.120] a ton of distributions out there. This is IBM's. And it's the only one that comes with [11:29.120 --> 11:36.320] Eclipse OpenJ9 JVM. It's available no cost. It's stable. IBM puts their name behind it. [11:36.320 --> 11:43.080] So it comes in two editions, open source and certified. The only difference being the licensing [11:43.080 --> 11:48.680] and what platforms are supported. And if you're wondering what Samaru comes from, the name [11:48.680 --> 11:56.560] comes from, Mount Samaru is the tallest mountain on the island of, anyone know? Java, there [11:56.560 --> 12:02.760] you go. See how that makes sense? If I had a t-shirt, I would have given you that. Alright, [12:02.760 --> 12:09.080] from the perspective of the server or the client talking to this new JIT server, this [12:09.080 --> 12:14.560] is the advantages they're going to get. From a provisioning aspect, now it's going to be [12:14.560 --> 12:19.120] very easy to size our containers, right? We don't have to worry about those spikes anymore. [12:19.120 --> 12:28.080] So now we just, we level set based on the demand or the needs of the application itself. [12:28.080 --> 12:31.960] Performance wise, we're going to see improved ramp-up time, basically because the JIT server [12:31.960 --> 12:38.240] is going to be offloading. We're going to offload all the compiles in the CPU cycles [12:38.240 --> 12:44.720] to the JIT server. And there's also a feature in this JIT server called AOT cache. So it's [12:44.720 --> 12:51.840] going to store any method it compiles. So another instance of the same container application [12:51.840 --> 12:59.160] calling it, and then they'll have that method, it'll just return it. No compilation needed. [12:59.160 --> 13:04.400] Then from a cost standpoint, obviously any time you reduce your resource cost or your [13:04.400 --> 13:09.640] resource amounts, you're going to get a savings in cost. And I mentioned earlier the efficient [13:09.640 --> 13:19.960] auto scaling, you're only going to pay for what you need. Resiliency, remember the JVM [13:19.960 --> 13:29.520] still has their local JIT. So if the JIT server goes down, it could still keep going. [13:29.520 --> 13:33.440] So this is kind of an interesting chart. This is pretty big. So we're going to talk about [13:33.440 --> 13:39.520] some of the examples of where we see savings. So this is an experiment where we took four [13:39.520 --> 13:47.080] – let me see my pointer works – we took four job applications and we decided to size [13:47.080 --> 13:52.920] them correctly for the amount of the memory and CPU they needed doing all those load tests [13:52.920 --> 13:58.240] to figure out what this amount should be. And we have multiple instances of them. So [13:58.240 --> 14:03.280] the color indicates the application. You can see all the different replications. The relative [14:03.280 --> 14:10.000] size is shown with the scale of the square. And in this case, we used OpenShift to lay [14:10.000 --> 14:15.840] it out for us and it came out to use three nodes to handle all of this, all these applications [14:15.840 --> 14:21.080] in your instances. Then we introduced the JIT server, ran the same test. Here's our [14:21.080 --> 14:26.760] JIT server here, the brown. It's the biggest container in the nodes. But you notice the [14:26.760 --> 14:31.840] size of all of our containers for the applications goes way down. So we have the same number of [14:31.840 --> 14:38.440] instances in both cases, but we've just saved 33% of the resources. And if you're [14:38.440 --> 14:46.760] wondering how they perform – whoops, went too far – you see no difference. The orange [14:46.760 --> 14:54.240] is the baseline, the blue is the JIT server. And from a stable state, meaning once they've [14:54.240 --> 15:01.960] performed, they perform exactly the same. But we're, again, saving 33% of the resources. [15:01.960 --> 15:07.000] Now we'll take a look at some of the effects on auto-scaling in Kubernetes. Here we're [15:07.000 --> 15:12.880] running an application and we're setting our threshold, I think it's up there, at [15:12.880 --> 15:20.520] 50% of CPU. And you can see here all these plateaus are when the auto-scaler is going [15:20.520 --> 15:29.120] to launch another pod. And you can see how the JIT server in blue responds better. Shorter [15:29.120 --> 15:36.080] dips and they recover faster. And overall, your performance is going to be better with [15:36.080 --> 15:43.680] a JIT server. Also, that other thing I talked about with false positives. So, again, the [15:43.680 --> 15:48.880] auto-scaler is not going to be tricked into thinking that that CPU load from JIT compiles [15:48.880 --> 15:54.520] is the reason for demand. So you're going to get better behavior in auto-scaling. Two [15:54.520 --> 16:01.240] minutes. All right. When to use it? Obviously when the JVM is – we're in a memory and [16:01.240 --> 16:08.040] CPU constrained environment. Recommendations, you always use 10 to 20 client JVMs when you're [16:08.040 --> 16:14.680] talking to a JIT server. Because remember, that JIT server does take its own container. [16:14.680 --> 16:20.000] And it is communication over the network, so only adding encryption if you absolutely [16:20.000 --> 16:27.200] need it. So some final thoughts. We talked about the JIT provides great advantage that [16:27.200 --> 16:34.600] optimize code, but compilations do add overhead. So we disaggregate JIT from the JVM and we [16:34.600 --> 16:40.880] came up with this JIT compilation as a service. It's available in Eclipse OpenJ9, also called [16:40.880 --> 16:45.920] the SAMRU Cloud. It's called the Eclipse OpenJ9 JIT server. That's the technology. [16:45.920 --> 16:52.360] It's also called the SAMRU Cloud Compiler. It's available on Linux Java 8, 11, and 17. [16:52.360 --> 16:55.240] Really good with microcontainers. In fact, that's the only reason I'm bringing it up [16:55.240 --> 17:03.160] today. It's Kubernetes ready. You can improve your ramp-up time, auto-scaling. And here's [17:03.160 --> 17:09.960] the key point here I'll end with. So this is a Java solution to a Java problem. Initially [17:09.960 --> 17:14.720] I talked about that sweet spot space. So there's a lot of companies, a lot of vendors trying [17:14.720 --> 17:20.240] to figure out how to make that work better. And a lot of them involve doing other things [17:20.240 --> 17:28.000] besides what Java's all about, running the JVM, running the JIT. So it is a Java solution [17:28.000 --> 17:37.200] to your Java problem. That's it for me today. That QR code will take you to a page I have [17:37.200 --> 17:42.360] that has a bunch of articles on how to use it, also the slides and other good materials [17:42.360 --> 17:46.000] about it. That's it for me. Thank you very much. [17:46.000 --> 18:15.120] It sounds amazing. It's amazing. It really is amazing. Well, why wouldn't you? Open [18:15.120 --> 18:21.320] J9 is a perfectly, I mean, it's a viable JVM. It's nothing special, right? And nothing [18:21.320 --> 18:26.840] unique about it that makes you change your code. It's a JVM that just points to the [18:26.840 --> 18:53.320] open JDK, the open J9 JVM. Okay, here it comes. I think so because I've seen examples [18:53.320 --> 19:05.320] of using those apps in tests. Check that, yeah. Yeah, okay. That may be a problem. She [19:05.320 --> 19:25.200] go out and check the latest coverage of that. Well, the way the AOT cache will work in this [19:25.200 --> 19:32.000] case for the JIT server, it's going to keep all that information and the profile has to [19:32.000 --> 19:38.280] match from the requesting JVM, right? So if it matches, it'll use it, right? Because [19:38.280 --> 19:42.740] also on the clients, they also have their own cache. They'll keep it, but they go away [19:42.740 --> 19:47.080] once they go away, right? Or when you start a new instance of that app, you have a brand [19:47.080 --> 20:07.720] new flush cache. I'm sorry. Yeah, so that's what we were talking about. You want to go [20:07.720 --> 20:12.480] static. You're going to get a smaller image running statically, but you lose all the benefits [20:12.480 --> 20:19.840] of the JIT. Over time, yes. So that may be a great solution for short-lived apps, right? [20:19.840 --> 20:23.040] But the longer your job app runs, the more you're going to benefit from that optimized [20:23.040 --> 20:33.440] code, right? Yes? So Eclipse on the J9 is not a certain set-byte, but my main server is [20:33.440 --> 20:40.280] also a set-byte for open edition, but today it has available binaries. But for Eclipse, [20:40.280 --> 20:47.280] they are not able to actually release the binaries because they cannot actually access [20:47.280 --> 20:55.720] the TCK certification process. So that whole TCK issue is a, I don't know. Well, I guess [20:55.720 --> 21:02.000] I could say, it seems to be an issue more with IBM and Oracle, right? So our own tests [21:02.000 --> 21:18.960] are going to be, they're going to encompass all the TCK stuff. Open J9 is managed by Eclipse, [21:18.960 --> 21:23.600] but 99% of the contributions are from IBM. It's a big part of their business. It's not [21:23.600 --> 21:28.000] going to go anywhere. If you have to do open source, this is like the best of the most [21:28.000 --> 21:32.520] worlds, I think. It's available. It's open. You can see it, but you know you have a vendor [21:32.520 --> 21:36.000] who has their business based on it that it's not going to go anywhere, and they're going [21:36.000 --> 21:40.360] to put a lot of resources to making it better. So, you know, I'm just telling you right [21:40.360 --> 21:44.920] now that we just came up with a JIT server. We're going into beta on Instant On. I don't [21:44.920 --> 21:49.280] know if you've heard of that. It's based on CryU. So we're going to be able to take snapshots [21:49.280 --> 21:52.760] of those images, and you can put those in your containers. Those are going to start [21:52.760 --> 21:58.920] up in milliseconds. So JIT basically handles the JIT server, handles the ramp up time, [21:58.920 --> 22:03.320] but Instant On will handle the start up time. So we're talking milliseconds. That's coming [22:03.320 --> 22:27.400] out in the next couple of months or so. Anyway, thank you. Well, if you don't have the JIT, [22:27.400 --> 22:33.920] then you're going to be running interpreted. That's like the worst of everything. Oh, well, [22:33.920 --> 22:41.640] it won't be. But you still want to use the JIT remotely. Oh, you're talking about locally. [22:41.640 --> 22:47.560] It will not be used. It will not be used. By the way, yeah. And by the way, the JIT server [22:47.560 --> 22:53.160] is just another persona of the JVM. It's just running under a different persona. No, it [22:53.160 --> 22:58.360] won't do that. Okay. Thank you very much. Okay. Thank you.