[00:00.000 --> 00:12.260] So, hi, I will start right away because my talk is quite packed, so I'm focusing on [00:12.260 --> 00:17.440] this working for Amazon in the Amazon Coretto team. [00:17.440 --> 00:21.600] My slides and the examples are on GitHub, I will show this link one more time at the [00:21.600 --> 00:24.960] end of the talk, so you don't have to take a copy. [00:24.960 --> 00:29.280] I am principal engineer in the Amazon Coretto team, working in the OpenGDK since more than [00:29.280 --> 00:39.600] 15 years, been with SAP before, that's for also more than 15 years and have various duties [00:39.600 --> 00:42.080] in the OpenGDK and JCP. [00:42.080 --> 00:46.080] So let's get started about firecrackers, so firecracker is a minimalistic virtual machine [00:46.080 --> 00:52.360] monitor, it's KVM backed, it only supports a limited set of devices, basically block [00:52.360 --> 00:57.240] and network devices which are virtualized to Vortio and a VSOC and a serial device that [00:57.240 --> 01:03.440] makes it very fast and also very secure because it doesn't support any exotic devices like [01:03.440 --> 01:08.720] for example QMU, it has a rest-based configuration, it's completely written in Rust which also [01:08.720 --> 01:17.240] makes it kind of safe, it's based on, it was forked from Google's CrossVM and it's nowadays [01:17.240 --> 01:22.880] based on Rust VMM library which is like a based library for virtual machine monitors [01:22.880 --> 01:26.240] and I think that's also used by CrossVM meanwhile. [01:26.240 --> 01:31.200] It supports a microvia metadata service which is basically a JSON storage where you can [01:31.200 --> 01:35.840] share data between guest and host because with full virtualization it's not easy to [01:35.840 --> 01:40.960] exchange data between guest and host because all the guest applications run on their own [01:40.960 --> 01:46.280] kernel and with this data service for example you don't need a network connection between [01:46.280 --> 01:51.520] host and guest and then the firecracker process itself supports in addition to the security [01:51.520 --> 01:59.080] provided by KVM, sandboxing, so a jailer utility which basically places the firecracker process [01:59.080 --> 02:04.080] on the host into additional C-group, change-route and sec-comp environment and it's all open [02:04.080 --> 02:11.240] source, Apache 2 licensed and it's the technology behind AWS Lambda. [02:11.240 --> 02:16.880] So every Lambda runs in its own firecracker virtualized container. [02:16.880 --> 02:21.520] So here's just a picture of what I've just told you. [02:21.520 --> 02:26.680] So we have the kernel with KVM on the downside and then we have the firecracker process which [02:26.680 --> 02:33.360] has a thread for each VCPU which you configure in your guest and then it has a special thread [02:33.360 --> 02:39.480] to handle IO and an API thread which is low priority to handle the rest requests and then [02:39.480 --> 02:46.360] it boots the guest kernel which has the VATIO devices and the VM thread handles these VATIO [02:46.360 --> 02:52.920] queues and maps them for network to tap devices on the host and for the block devices for [02:52.920 --> 02:57.920] either on a native block device on the host or on a file system which is exported as [02:57.920 --> 03:03.000] block device to the guest and then you can run a bit more application on the guest and [03:03.000 --> 03:10.000] you can run as many guests as you want, it's only limited by your amount of memory basically, [03:10.000 --> 03:18.320] and overhead by firecracker is just about 50 megabytes per, I know it's less, we will [03:18.320 --> 03:21.520] see, it's very small. [03:21.520 --> 03:28.120] So let's go to a demo. [03:28.120 --> 03:30.160] So I have to truncate the file. [03:30.160 --> 03:38.160] So here we just start firecracker, we specify the API socket where we communicate, we have [03:38.160 --> 03:43.080] a log file and a log info in the boot timer to see the boot time. [03:43.080 --> 03:51.040] And now from another terminal we start to config this with JSON data as I told you before, [03:51.040 --> 04:00.360] so we configure two VCPUs and 512 megabytes of memory. [04:00.360 --> 04:07.400] I have here a root file system, extended X4 root file system and a freshly compiled Linux [04:07.400 --> 04:17.640] kernel, so I will now use another REST command to configure the Linux image which will be [04:17.640 --> 04:23.320] booted and I pass quite a lot of kernel arguments, it's mostly to switch off devices which we [04:23.320 --> 04:28.520] don't need anyway and which unsupported and we define as init script to just run bash, [04:28.520 --> 04:38.320] so init script will be just a shell and then we finally have to define a root file system, [04:38.320 --> 04:44.480] that's our X4 file which I showed you before and now that we've configured everything we [04:44.480 --> 04:52.080] can just start the virtual machine again with a JSON request and when we go back into our [04:52.080 --> 04:57.640] window we see that now the virtual machine has been started and it took about 200 milliseconds [04:57.640 --> 05:12.960] to start bash and it's fully configured Linux, the image was assembled from Ubuntu 22 image [05:12.960 --> 05:29.600] and the kernel I've compiled it myself, you see we have two CPUs and about 512 megabytes [05:29.600 --> 05:35.200] of memory, so if we exit the shell it will be able just reboot because it was our init [05:35.200 --> 05:43.480] process, from this 200 milliseconds which you take to boot the serial device alone took [05:43.480 --> 05:48.320] about 100 milliseconds, so if you take that away usually in production you don't need [05:48.320 --> 05:59.840] the serial device it puts in 100 milliseconds and that's on my laptop, okay, so very quick [05:59.840 --> 06:05.160] comparison of Firecracker and Docker, so Firecracker is fully KVM virtualized, Docker has only [06:05.160 --> 06:11.200] C group namespace isolation, the good thing about C group namespace isolation only is [06:11.200 --> 06:16.080] that Docker images run on the same kernel so they can do copy and write, page cache memory [06:16.080 --> 06:24.160] sharing so if you run many of them they are denser whereas for if you run several Firecracker [06:24.160 --> 06:28.600] images they cannot directly share memory so you have to use ballooning devices for example [06:28.600 --> 06:33.280] in the guest to give back memory to the host, on the other side that's much more secure [06:33.280 --> 06:41.680] because every container has its own memory, its own kernel and Firecracker has snapshot [06:41.680 --> 06:46.560] support to a checkpoint the whole container like with the kernel everything together and [06:46.560 --> 06:53.600] Docker can use Crewe checkpoint to store in user space to do the same thing basically serialize [06:53.600 --> 07:01.600] Docker container with all processes to a file, I will see examples for that now, so now what [07:01.600 --> 07:07.200] is crack and Crewe, so as was mentioned before crack is called in native to store and checkpoint [07:07.200 --> 07:14.320] that's a new project in the OpenJDK, it has basically three points which are important, [07:14.320 --> 07:21.120] first one is to create the standard checkpoint restore notification API because many applications [07:21.120 --> 07:29.600] are not aware of being cloned and there is state, security, time, all this kind of stuff [07:29.600 --> 07:35.920] which an application might want to react upon especially not only when cloning but not only [07:35.920 --> 07:40.400] when checkpointing and restoring but especially when cloning the application, think for example [07:40.400 --> 07:46.240] of an application which logs to a file and then you checkpoint it and restart two clones [07:46.240 --> 07:49.840] and they both write to the same file they will corrupt the file usually so you have [07:49.840 --> 07:55.120] to take some measures if you run many things in parallel and the application is not prepared [07:55.120 --> 08:00.640] for that, so if you want to, a crack is currently not part of an official OpenJDK release it's still [08:01.840 --> 08:06.400] mostly a research project in the OpenJDK but you can already now make your application ready [08:06.400 --> 08:13.040] for crack by using the org crack API that's available on Maven Central and that basically wraps [08:13.040 --> 08:19.600] JDK crack namespace which is currently in the crack repository in OpenJDK but if it finds [08:19.600 --> 08:25.200] javax.crack once it should become available it will switch to that and it also offers the [08:25.920 --> 08:33.840] possibility to pass the custom implementations to a system property and then finally what [08:33.840 --> 08:38.720] makes crack interesting for many people to experiment with is that it basically integrates [08:38.720 --> 08:46.480] with Creel so it has a copy of Creel packed with the crack distribution so you can easily [08:46.480 --> 08:54.000] checkpoint your java process and restart it and then as I mentioned before Creel is checkpoint [08:54.000 --> 08:59.200] and restore in user space that's an old java functionality which allows to serialize a single [08:59.200 --> 09:04.960] process to the file system it uses kernel free cgroup freezer to freeze the processes or process [09:04.960 --> 09:12.560] tree and then writes all the memory to the disk and so on. Still Creel has some issues because [09:12.560 --> 09:20.560] it has to take to look at all the open file descriptors, shared memory segments, stuff like [09:20.560 --> 09:25.520] that which might not be available again when you restore the image whereas firecracker as I said [09:25.520 --> 09:29.200] before it restores the whole kernel with all the file system everything in place so it's much [09:29.200 --> 09:36.800] much simpler from that perspective. So let's take a quick demo on crack. [09:36.800 --> 09:47.920] So I have here open gdk.17 with crack extensions and then you simply pass the option checkpoint [09:47.920 --> 09:55.840] to that's a file and this is just a pet clinic up a spring boot pet clinic example application [09:56.640 --> 10:05.280] and I modified it to register with the orc crack callbacks as I said you can see here [10:05.280 --> 10:16.880] it's registered to orc crack and now that I've started it I can use j command to checkpoint it [10:16.880 --> 10:23.440] so I send it a checkpoint command and when you see just out of the box it didn't work it shows [10:23.440 --> 10:30.080] some exception because it found for example that the port 8080 is open and this uses a vanilla [10:30.080 --> 10:39.360] version of Tomcat which is implementing the crack callbacks so but that's not that bad [10:40.560 --> 10:48.640] it has a developer option which has to ignore exceptions so for this simple case it will probably [10:48.640 --> 10:58.000] work so let's try it started one more time prepare the checkpoint here so let's wait [10:58.000 --> 11:09.120] until it becomes ready so and now now checkpoint it and you see we also locked the resources so [11:09.120 --> 11:13.920] you see what they were about 10 file descriptors and most of them were okay because like the crack [11:13.920 --> 11:21.200] modified VM already knows a lot of the file descriptors the VM is using for example for [11:21.200 --> 11:26.800] the jar files it has opened or for the module files and it closes them by themselves without [11:26.800 --> 11:35.920] need to register anything so and the checkpoint you work and what's interesting is here that [11:35.920 --> 11:42.000] before checkpointing it calls the my the the listener the handler I installed in my pet clinic [11:42.000 --> 11:50.000] application so I could do additional stuff before checkpointing and now we can just [11:50.000 --> 12:02.000] restore this frozen process and you see it starts instantly it calls the after restore a hook [12:02.000 --> 12:11.440] I have registered and we can send a serial request on 8080 and yeah it basically still works [12:12.480 --> 12:18.720] so that's nice let's go further so now firecracker so that's basically combination of [12:18.720 --> 12:23.840] initial firecracker and crack I found it somehow funny that words are so similar so [12:23.840 --> 12:28.640] it's a play with words and my my opinion it's the best of two worlds to combine these two [12:28.640 --> 12:34.800] currently as I said a crack project is based on crew but I think it might be interesting to [12:34.800 --> 12:41.120] add support for firecracker as well and I'm currently working on that so with firecracker you can [12:41.120 --> 12:51.840] basically checkpoint a plain JDK even with if it's not modified by crack because as I said no need [12:51.840 --> 12:59.040] to worry to worry about fire descriptors so on one issue with firecracker as I said before you [12:59.040 --> 13:04.800] cannot trigger the checkpoint from Java so the crack implementation in open JDK can checkpoint [13:04.800 --> 13:12.000] itself because crew is running on the same kernel like the Java application so the Java just [13:12.000 --> 13:17.680] so JNI calls crew and checkpoints itself that's obviously not possible in firecracker because [13:17.680 --> 13:22.560] you cannot escape from the gas that's the whole thing about running it in in a in a [13:22.560 --> 13:28.160] fully virtualized guest so we need another means of communication but that's not not that complicated [13:29.440 --> 13:34.720] it offers maximum security and speed and I said before no copy and write memory sharing but [13:34.720 --> 13:40.640] you can use ballooning same page merging kernel features which are also have their plus and [13:40.640 --> 13:47.600] their drawbacks but things to investigate so let's do a firecracker demo with Java now [13:51.280 --> 13:54.960] to not bore you more with all this [13:54.960 --> 14:04.960] JSON request I've written a shell script which basically does all that in in one script and [14:04.960 --> 14:19.040] instead of calling bash it just starts Java as in it process and we can now submit the request [14:19.040 --> 14:27.040] and you see it's it's it's working it's here here is the request my I have still registered [14:27.040 --> 14:32.400] this this callbacks although I'm running on a vanilla JDK by using the org crack library so [14:32.400 --> 14:40.400] they are they are empty they won't do anything and I can now snapshot firecracker you see that's [14:40.400 --> 14:49.520] also quite quite fix quite quick firecracker is not is resumed automatically so I have to kill [14:49.520 --> 15:01.520] it manually and now if I restart from snapshot you will see it also it takes just a few milliseconds [15:01.520 --> 15:13.760] to restart the whole image and again I can see well into it it it works you see there is no the [15:13.760 --> 15:19.440] hooks are not being called because there is no real crack implementation in the back in this case [15:19.440 --> 15:27.440] but like checkpointing for Java itself works and it's also easy to run a second clone now [15:27.440 --> 15:34.720] obviously we cannot run it in the same namespace because it will use the same IP address like [15:34.720 --> 15:42.800] the like the first version so we we started in a in a network namespace so minus and zero is just [15:42.800 --> 15:52.160] to create a new namespace for for the clone and you see it uses IP net NS net names with exec to [15:52.160 --> 16:02.480] execute firecracker but it restores quite as quickly and the initial IP address of the [16:02.480 --> 16:09.840] of the of the process has now in this namespace is it's now mapped on a different IP address on [16:09.840 --> 16:16.640] the host but you see it's it's still working so in the get the guest still has the same IP address [16:16.640 --> 16:23.120] it has in the first place it's just running in its own namespace and inside the guest again the [16:23.120 --> 16:35.040] Tomcat is running on the same port all no problem so we just kill the first instance and we kill the [16:35.040 --> 16:52.080] we kill the second instance how much time do I have oh okay okay so just a few words I I realized [16:52.080 --> 16:58.400] that talks which are rated highest are usually so some animation so I decided to do animation [16:58.400 --> 17:05.680] because usually only so console console demos so quick introduction user fold demon is a [17:05.680 --> 17:12.880] is a possibility to handle page faults from the user space and firecracker offers the possibility [17:12.880 --> 17:19.920] instead of mapping the image file right into fires firecrackers memory to to use an external [17:19.920 --> 17:26.080] user fold demon and if we write the user fold demon ourselves we have the possibility to follow [17:26.080 --> 17:32.400] page by page which addresses get loaded at the restore and I found it interesting so [17:34.960 --> 17:42.640] I created that kind of thing so to an animation for that and for that [17:45.200 --> 17:53.280] we we restart our our our firecracker service native memory enabled native memory tracking [17:53.280 --> 18:04.400] and from the guest we do now ssh into into our firecracker guest where Tomcat is running [18:04.400 --> 18:11.520] and just call j command native memory details and and put that into a file and we do the same thing [18:14.320 --> 18:20.480] with the pmap information this is just a shell script inside the guest which basically [18:20.480 --> 18:28.000] prints all the virtual to physical mappings for all processes into a file [18:30.480 --> 18:41.120] and now we can start the the visualizer and it takes the locks [18:41.120 --> 18:56.080] oops it it takes the locks of the user fold demon and the nmt and the native mapping so what you [18:56.080 --> 19:01.600] see here is basically the physical memory layout of the guest so it's memory page zero and in the [19:01.600 --> 19:09.360] end it's memory page one gigabyte and every square is four kilobyte page and if you go and [19:09.360 --> 19:14.960] that's on the java process for example you see the dark these are the pages the rss of the java [19:14.960 --> 19:21.360] process blue ones are occupied by the java process but they are also in the page cache so that's [19:21.360 --> 19:31.600] probably a file for example or something or uh uh class uh spare shell class for example [19:31.600 --> 19:36.800] when you when you look at the nmt output we see that for example for the classes we use about [19:36.800 --> 19:45.520] 66 i probably cannot read it it says virtually 69 megabytes uh rss is 60 megabytes and user [19:45.520 --> 19:53.120] fold demon loaded about 10 megabytes of it and here's the the animation i promised you [19:53.120 --> 19:59.120] so this is how the pages got loaded when we did the first call request on a on a resumed image [19:59.120 --> 20:04.960] and like the the yellow ones are all the pages which i've loaded and the orange one i don't know [20:04.960 --> 20:11.840] yeah some are orange belong to the to the to the virtual memory region i have selected here so for [20:11.840 --> 20:18.000] example all the orange pages are the the parts of the class space which got loaded for the first [20:18.000 --> 20:26.960] request so this is a lot of space for more investigation would be nice to to compact this [20:26.960 --> 20:32.640] more like physically because you want to prefetch the the things which get loaded especially if you [20:32.640 --> 20:39.040] download your images from from network for example and but the problem is that all the physical [20:39.040 --> 20:46.720] address space is continuous like the virtual uh the physical pages are are not and try to look [20:46.720 --> 20:56.480] into uh possibility to do that so that that's it thank you thank you very much [20:56.480 --> 21:02.880] there's about 30 seconds for questions is anyone got a question called your answer question [21:02.880 --> 21:10.880] i have a question regarding uh when you showed uh uh crack uh implementation there was uh [21:10.880 --> 21:14.880] implementation that put into the uh [21:14.880 --> 21:18.320] so [21:23.200 --> 21:26.320] yeah [21:30.720 --> 21:36.560] yes i unfortunately there is no time in 20 minutes to show that but you can obviously use the current [21:36.560 --> 21:43.760] crack implementation inside firecracker use j command and instead of crue there is a backend [21:43.760 --> 21:49.680] called uh post handler that's just a small program which instead of calling crue just [21:49.680 --> 21:56.000] dispense the whole process and then you can send in the signal to restore it so with firecracker you [21:56.000 --> 22:03.040] basically checkpoint with the post engine then do the firecracker snapshot then restore firecracker [22:03.040 --> 22:07.920] and then just do an ssh with a kill signal on on the process and it will will restart that's one [22:07.920 --> 22:14.800] possibility another one is i wrote the jvmti agent which basically has the same thing even without crue [22:14.800 --> 22:22.800] it uh it um suspends all threads it calls system gc and then waits uh on a on a port so you just [22:22.800 --> 22:30.720] ping it with telnet or whatsoever and and it even calls uh the the the the hooks by implementing [22:30.720 --> 22:38.320] the this custom possibility to uh with the property so i i i i say or crack to use my [22:38.320 --> 22:45.760] crack implementation to call the hooks so that all works it's in the in the repository which is [22:45.760 --> 23:01.680] i had a resource slide which i didn't show it has all the links so