[00:00.000 --> 00:14.100]  We want to start, and that means I need to once again ask you to quiet down, please,
[00:14.100 --> 00:16.760]  so that we can hear our speaker.
[00:16.760 --> 00:22.000]  Our next talk is by Adrian Reber, and he's going to talk about Kubernetes and Checkpoint
[00:22.000 --> 00:25.000]  Restore.
[00:25.000 --> 00:33.000]  Hello, Mike is on. So welcome everyone to my talk about Kubernetes and Checkpoint Restore.
[00:33.000 --> 00:35.000]  Please quiet down.
[00:35.000 --> 00:40.000]  So I've actually done the talk about container migration here in 2020.
[00:40.000 --> 00:43.000]  This was using Portman in the last three years.
[00:43.000 --> 00:47.000]  I was able to move it into Kubernetes.
[00:47.000 --> 00:49.000]  It's not on.
[00:49.000 --> 00:52.000]  It's green.
[00:52.000 --> 00:58.000]  Better now? No?
[00:58.000 --> 01:01.000]  Too soft?
[01:01.000 --> 01:03.000]  What's too soft?
[01:03.000 --> 01:07.000]  I think the only thing that you can do is move it slightly down.
[01:07.000 --> 01:09.000]  Down?
[01:09.000 --> 01:11.000]  Better now? No.
[01:11.000 --> 01:13.000]  You can turn it up?
[01:13.000 --> 01:15.000]  Oh, there was...
[01:15.000 --> 01:17.000]  That's for the...
[01:17.000 --> 01:23.000]  Green is good.
[01:23.000 --> 01:25.000]  Better now? Better now?
[01:25.000 --> 01:28.000]  Not too good?
[01:28.000 --> 01:34.000]  Is it better now? No? No? No?
[01:34.000 --> 01:39.000]  Okay, we've got to make do with what we have for now, but please, if you all quiet down,
[01:39.000 --> 01:46.000]  we can hear our speaker a lot better.
[01:46.000 --> 01:51.000]  Okay, so I'm working on process migration for at least 13 years now.
[01:51.000 --> 01:59.000]  I'm involved in CRIU, which is the basis for the container migration we are doing here today.
[01:59.000 --> 02:01.000]  I don't know.
[02:01.000 --> 02:08.000]  Since around 2012, and I'm focusing mainly on container migration since 2015.
[02:08.000 --> 02:13.000]  So the agenda for today's session here is...
[02:13.000 --> 02:19.000]  Can we turn something down? I get feedback.
[02:19.000 --> 02:22.000]  Okay, so the agenda is something like...
[02:22.000 --> 02:25.000]  I'm going to talk a bit about background of checkpoint restore,
[02:25.000 --> 02:29.000]  especially how CRIU is integrated in different things.
[02:29.000 --> 02:35.000]  Then I will present use cases for container checkpoint restore, container migration.
[02:35.000 --> 02:38.000]  Then I want to talk about few technical details about CRIU.
[02:38.000 --> 02:41.000]  I might make this very short depending on the time.
[02:41.000 --> 02:45.000]  And then I want to talk about a future of checkpoint restore,
[02:45.000 --> 02:50.000]  especially in Kubernetes, what we are thinking about topic right now.
[02:50.000 --> 02:54.000]  So checkpoint restore in user space is the name of the tool CRIU.
[02:54.000 --> 02:58.000]  The reason for the name is that checkpointing and restoring is a thing
[02:58.000 --> 03:01.000]  for over 20 years now in Linux, even longer maybe.
[03:01.000 --> 03:02.000]  And there were different implementations.
[03:02.000 --> 03:04.000]  There were ones using an external module.
[03:04.000 --> 03:07.000]  There were ones doing LD preload.
[03:07.000 --> 03:11.000]  And around 2006 or 2008 there was something called a patch set
[03:11.000 --> 03:13.000]  for Linux kernel to do it in the kernel.
[03:13.000 --> 03:15.000]  It was over 100 patches.
[03:15.000 --> 03:19.000]  It was never merged because it was really huge and complicated.
[03:19.000 --> 03:22.000]  And because the in kernel thing didn't work,
[03:22.000 --> 03:29.000]  CRIU was named in user space because it's not in the kernel, it's in user space.
[03:29.000 --> 03:34.000]  There are multiple integrations of CRIU in different container engines,
[03:34.000 --> 03:38.000]  sometimes orchestration. The first one to mention here is OpenVizet.
[03:38.000 --> 03:42.000]  They invented CRIU for their container product many years ago
[03:42.000 --> 03:46.000]  to live migrate containers from one node to another.
[03:46.000 --> 03:51.000]  So the thing is about CRIU, it has been developed with containers in mind.
[03:51.000 --> 03:55.000]  At that time it was for different containers probably,
[03:55.000 --> 04:02.000]  but it's for containers and that's why it works as well as it does today.
[04:02.000 --> 04:08.000]  Then we know that Google uses CRIU in their internal container engine called Borg.
[04:08.000 --> 04:13.000]  I have no details about it except the things I've heard at conferences from them.
[04:13.000 --> 04:22.000]  So what they told us at CRIU upstream is that they use container migration
[04:22.000 --> 04:26.000]  for low priority jobs on nodes.
[04:26.000 --> 04:31.000]  And if there's not enough resources then the container will be migrated.
[04:31.000 --> 04:37.000]  They said they killed the container before and restarted it somewhere else.
[04:37.000 --> 04:42.000]  All the work was lost and now they can just migrate it to another node
[04:42.000 --> 04:48.000]  and they say they use it for background tasks like the example they gave
[04:48.000 --> 04:51.000]  is YouTube recoding of things which happens in the background,
[04:51.000 --> 04:56.000]  it's not time-critical then so that's why they use a checkpoint restore for it.
[04:56.000 --> 05:02.000]  There's an integration in CXD which enables you to migrate container from one host to another,
[05:02.000 --> 05:06.000]  then it's integrated in Docker, it's integrated in Portman,
[05:06.000 --> 05:09.000]  this is what I've been working on the last five years mainly
[05:09.000 --> 05:17.000]  and the thing I've been working on in the last three years to get it into Kubernetes
[05:17.000 --> 05:21.000]  is to integrate CRIU support into Cryo.
[05:21.000 --> 05:27.000]  This is one of the existing container engines which Kubernetes can use.
[05:27.000 --> 05:31.000]  Interestingly enough there's a ticket about container live migration
[05:31.000 --> 05:35.000]  in Kubernetes open since 2015, since then nothing has happened
[05:35.000 --> 05:40.000]  until now where we kind of can migrate container,
[05:40.000 --> 05:46.000]  we can definitely checkpoint them and we introduce this into Kubernetes
[05:46.000 --> 05:49.000]  and the label forensic container checkpointing.
[05:49.000 --> 05:54.000]  This was an interesting experience for me because I was not aware
[05:54.000 --> 05:58.000]  how Kubernetes processes are working to get something new in there,
[05:58.000 --> 06:02.000]  so I wrote some code, I submitted the patches and then nothing happened
[06:02.000 --> 06:07.000]  and at some point people told me you have to write something called a Kubernetes enhancement proposal,
[06:07.000 --> 06:10.000]  it's a document where you describe what you want to do,
[06:10.000 --> 06:13.000]  so I did this, so this is the links to the documents,
[06:13.000 --> 06:19.000]  I wrote for this, the third link is then the pull request for the actual code changes which is marginal
[06:19.000 --> 06:25.000]  and the last link is a blog post where it is described
[06:25.000 --> 06:31.000]  how to today use forensic container checkpointing in combination with Kubernetes.
[06:31.000 --> 06:35.000]  The reason for the name forensic container checkpointing is
[06:35.000 --> 06:39.000]  we were looking at a way to introduce checkpointing into Kubernetes
[06:39.000 --> 06:43.000]  with the minimal impact on Kubernetes,
[06:43.000 --> 06:48.000]  the thing is it's a more or less completely new concept for containers
[06:48.000 --> 06:52.000]  because Kubernetes thinks about containers, you start them, you stop them,
[06:52.000 --> 06:55.000]  they're done, you don't care about anything,
[06:55.000 --> 06:58.000]  and now there's something new there which tries to,
[06:58.000 --> 07:02.000]  okay, but I can still move my container from one node to another node,
[07:02.000 --> 07:08.000]  keep all the state and so it was a long discussion to get it into Kubernetes.
[07:08.000 --> 07:11.000]  The idea behind forensic container checkpointing is
[07:11.000 --> 07:17.000]  you have a container running somewhere and you suspect there might be something wrong,
[07:17.000 --> 07:21.000]  you don't want to immediately stop it, maybe the attacker can detect if you stop it
[07:21.000 --> 07:25.000]  and remove things from it so you can take a checkpoint from the container,
[07:25.000 --> 07:27.000]  the container never knows it was checkpointed,
[07:27.000 --> 07:31.000]  you can analyze it in a sandbox environment somewhere else,
[07:31.000 --> 07:36.000]  you can look at all the memory pages offline without the container running
[07:36.000 --> 07:39.000]  or you can restart it as many times as you want,
[07:39.000 --> 07:42.000]  so that's the idea for forensic container checkpointing
[07:42.000 --> 07:46.000]  and under which label it's currently available in Kubernetes.
[07:46.000 --> 07:50.000]  So use cases for checkpoint and restore container migration,
[07:50.000 --> 07:54.000]  I have a couple of them and one has a demo which relies on the network
[07:54.000 --> 07:56.000]  so we will see if this works.
[07:56.000 --> 08:00.000]  So first and maybe simplest use case for checkpoint restore
[08:00.000 --> 08:03.000]  for containers is reboot and save state,
[08:03.000 --> 08:07.000]  so I have a host with a blue kernel running on it,
[08:07.000 --> 08:09.000]  the kernel is getting out of the state,
[08:09.000 --> 08:12.000]  I have to update it, I have a stateful container
[08:12.000 --> 08:15.000]  because for stateless containers it doesn't make sense,
[08:15.000 --> 08:18.000]  but the container, the stateful container takes some time to start
[08:18.000 --> 08:21.000]  so what can you do with checkpoint restore,
[08:21.000 --> 08:25.000]  you can take a copy of the container, write it to the disk with all the state,
[08:25.000 --> 08:29.000]  with all memory pages saved as they were just before,
[08:29.000 --> 08:33.000]  you update the kernel, you reboot the system and it comes up with a green kernel
[08:33.000 --> 08:35.000]  with all security holes fixed,
[08:35.000 --> 08:38.000]  but a container you can restore it without waiting a long time,
[08:38.000 --> 08:42.000]  it's immediately there on your rebooted host.
[08:42.000 --> 08:46.000]  Another use case which is similar to this is quick startup use case,
[08:46.000 --> 08:49.000]  I have, people were talking to me about this,
[08:49.000 --> 08:53.000]  so this is what people actually use in production from what I've been told,
[08:53.000 --> 08:57.000]  so they have a container it takes forever to start,
[08:57.000 --> 08:59.000]  it takes like eight minutes to initialize,
[08:59.000 --> 09:03.000]  and they have some software as a service thing
[09:03.000 --> 09:06.000]  where they want customers to have a container immediately,
[09:06.000 --> 09:09.000]  so what they do, they don't initialize it from scratch,
[09:09.000 --> 09:12.000]  they take a checkpoint once it's initialized
[09:12.000 --> 09:16.000]  and then they can create multiple copies really fast of the container
[09:16.000 --> 09:18.000]  in matters of seconds,
[09:18.000 --> 09:22.000]  and so the customers can have their containers faster
[09:22.000 --> 09:24.000]  and maybe they are happier.
[09:24.000 --> 09:28.000]  The next thing is the combination of those things is container live migration,
[09:28.000 --> 09:31.000]  I have a source node, I have a destination node,
[09:31.000 --> 09:35.000]  and I want to move my container from one system to the other system
[09:35.000 --> 09:37.000]  without losing the state of the container,
[09:37.000 --> 09:41.000]  so I take a checkpoint and then I can restore the container
[09:41.000 --> 09:45.000]  on the destination system one or multiple times,
[09:45.000 --> 09:49.000]  and this is the place where I want to do my demo,
[09:49.000 --> 09:57.000]  so let's see, so I want to have a Kubernetes thing running here
[09:57.000 --> 10:02.000]  and I have a small YAML file with two containers,
[10:02.000 --> 10:05.000]  let's have a look at the YAML file,
[10:12.000 --> 10:15.000]  so it has, it's a part with two containers,
[10:15.000 --> 10:19.000]  one is called WildFly, this is a WildFly-based Java-based application
[10:19.000 --> 10:21.000]  and the other one is Counter,
[10:21.000 --> 10:23.000]  both are really simple, stateful containers,
[10:23.000 --> 10:26.000]  if I do a request to the container I get back a number,
[10:26.000 --> 10:30.000]  the number is increased and second time the number is the increased number,
[10:30.000 --> 10:34.000]  so let's talk to the container, hopefully it works.
[10:38.000 --> 10:43.000]  Okay, this is hard to read, but I think I need this ID,
[10:43.000 --> 10:46.000]  so I'll just do a curl to the container here
[10:46.000 --> 10:51.000]  and then I need to replace the ID to figure out the IP address
[10:51.000 --> 10:55.000]  of my, where's my mouse here, container,
[10:55.000 --> 11:00.000]  and it returns counter zero, counter one, counter two,
[11:00.000 --> 11:03.000]  so it's stateful, but it's simple,
[11:03.000 --> 11:08.000]  so and to use checkpoint restore in Kubernetes,
[11:08.000 --> 11:11.000]  so this is currently a kubelet only interface
[11:11.000 --> 11:13.000]  because we still don't know how it,
[11:13.000 --> 11:16.000]  it's the best way to integrate it into Kubernetes,
[11:16.000 --> 11:20.000]  so it's not straightforward yet to use it, but it's there,
[11:20.000 --> 11:23.000]  so I'm also doing a curl,
[11:23.000 --> 11:26.000]  now let's find my command in the history, now that's the wrong one,
[11:26.000 --> 11:29.000]  oh there it was, missed,
[11:29.000 --> 11:35.000]  sorry, almost have it,
[11:35.000 --> 11:37.000]  okay, so this is the command,
[11:37.000 --> 11:40.000]  so what I'm doing here, I'm just talking to the kubelet,
[11:40.000 --> 11:45.000]  you see the HTTPS address at the end of the long line
[11:45.000 --> 11:49.000]  and it says I'm using the checkpoint API endpoint
[11:49.000 --> 11:52.000]  and I'm using, I'm trying to checkpoint a container
[11:52.000 --> 11:55.000]  in the default Kubernetes namespace in the pod counters
[11:55.000 --> 11:58.000]  and the container counters, so I'm doing this
[11:58.000 --> 12:01.000]  and now it's creating the checkpoint in the back
[12:01.000 --> 12:03.000]  and if I'm looking at what it says,
[12:03.000 --> 12:05.000]  it's just created a file somewhere
[12:05.000 --> 12:09.000]  which contains all file system changes, all memory pages,
[12:09.000 --> 12:11.000]  complete state of the container
[12:11.000 --> 12:14.000]  and now I want to migrate it to another host
[12:14.000 --> 12:17.000]  and now I have to create an OCI,
[12:17.000 --> 12:21.000]  kind of an OCI image out of it, I'm using builder here
[12:21.000 --> 12:25.000]  and then I'm saying,
[12:25.000 --> 12:28.000]  I'll give it an annotation so that the destination knows
[12:28.000 --> 12:31.000]  this is a container image,
[12:31.000 --> 12:35.000]  then I'm gonna include the checkpoint archive
[12:35.000 --> 12:38.000]  into the container, into the image
[12:38.000 --> 12:42.000]  and then I will say commit, it's the wrong one,
[12:42.000 --> 12:46.000]  commit and I'm gonna call it checkpoint image latest
[12:46.000 --> 12:51.000]  so now I have an OCI type container image locally
[12:51.000 --> 12:53.000]  which contains the checkpoint
[12:53.000 --> 12:57.000]  and now I will push it to a registry,
[12:57.000 --> 13:02.000]  here it was, and I will call it tech39
[13:02.000 --> 13:05.000]  and now it's getting pushed to a registry
[13:05.000 --> 13:08.000]  so this works pretty good but this VM is not local
[13:08.000 --> 13:12.000]  and now I want to restore the container on my local VM
[13:12.000 --> 13:15.000]  and that's happening here,
[13:15.000 --> 13:21.000]  right, ctrl-ps, so nothing is running
[13:21.000 --> 13:25.000]  then I have to edit my YAML file
[13:28.000 --> 13:31.000]  and so it's pretty similar to the one I had before
[13:31.000 --> 13:35.000]  I have a pod called counters
[13:35.000 --> 13:37.000]  and I have a container wildfly
[13:37.000 --> 13:40.000]  which is started from a normal image
[13:40.000 --> 13:44.000]  and the other container called counter
[13:44.000 --> 13:49.000]  is started from the checkpoint image
[13:49.000 --> 13:53.000]  and now I say apply
[13:53.000 --> 13:59.000]  and now let's see what the network says
[13:59.000 --> 14:02.000]  if it likes me
[14:02.000 --> 14:05.000]  so it now says, so it's really hard to read
[14:05.000 --> 14:07.000]  because it's a large font but it said
[14:07.000 --> 14:09.000]  pulling the initial image so that's already there
[14:09.000 --> 14:13.000]  so it doesn't need to pull, it created a container wildfly
[14:13.000 --> 14:15.000]  it started a container wildfly
[14:15.000 --> 14:18.000]  and now it's actually pulling the checkpoint archive
[14:18.000 --> 14:21.000]  from the registry, oh and it's created a container
[14:21.000 --> 14:25.000]  and started a container so now we have a restored container
[14:25.000 --> 14:28.000]  hopefully running here, let's get the idea,
[14:28.000 --> 14:30.000]  the idea of the container
[14:30.000 --> 14:34.000]  and let's talk to the container again
[14:34.000 --> 14:37.000]  and so now we shouldn't see counter zero
[14:37.000 --> 14:40.000]  but counter, I don't know, three or four
[14:40.000 --> 14:42.000]  I don't remember what it was last time
[14:42.000 --> 14:45.000]  this is the right idea, I hope
[14:45.000 --> 14:48.000]  yeah and it says, so now we have a stateful migration
[14:48.000 --> 14:51.000]  of a container from one Kubernetes host
[14:51.000 --> 14:53.000]  to another Kubernetes host
[14:53.000 --> 14:56.000]  by creating a checkpoint, pushing it to a registry
[14:56.000 --> 14:59.000]  and then tricking, kind of tricking Kubernetes
[14:59.000 --> 15:02.000]  into starting a container
[15:02.000 --> 15:06.000]  but in the background we kind of used a checkpoint container
[15:06.000 --> 15:09.000]  so Kubernetes thinks it started a container, a normal container
[15:09.000 --> 15:12.000]  but there was a checkpoint container behind it
[15:12.000 --> 15:16.000]  so the checkpoint restore, the restoring of the checkpoint
[15:16.000 --> 15:19.000]  all happens in the container engine below it
[15:19.000 --> 15:22.000]  in cryo and for Kubernetes it's just a normal container
[15:22.000 --> 15:27.000]  it has restored so back to my slides
[15:27.000 --> 15:30.000]  so another use case
[15:30.000 --> 15:33.000]  people are interested in a lot
[15:33.000 --> 15:36.000]  which I have never thought about is spot instances
[15:36.000 --> 15:39.000]  which AWS and Google has
[15:39.000 --> 15:42.000]  it's cheap machines which you can get
[15:42.000 --> 15:45.000]  but the deal is they can take it away anytime they want
[15:45.000 --> 15:48.000]  like you have two minutes before they take it away
[15:48.000 --> 15:51.000]  and so if you have checkpointing
[15:51.000 --> 15:54.000]  it's now independent of Kubernetes or not
[15:54.000 --> 15:57.000]  but if you have Kubernetes on your spot instances
[15:57.000 --> 16:00.000]  you can checkpoint your containers right into some storage
[16:00.000 --> 16:03.000]  and then restore the container on another system
[16:03.000 --> 16:07.000]  and still use spot instances without losing any of your
[16:07.000 --> 16:11.000]  calculation work, whatever it was doing
[16:11.000 --> 16:15.000]  so something about cryo
[16:15.000 --> 16:19.000]  so I mentioned everything we are doing here is using cryo
[16:19.000 --> 16:22.000]  so the call stack is basically
[16:22.000 --> 16:25.000]  the kubelet talks to cryo, cryo talks to runc, runc
[16:25.000 --> 16:28.000]  talks to cryo, cryo does the checkpoint
[16:28.000 --> 16:31.000]  and then each layer adds some metadata to it
[16:31.000 --> 16:34.000]  and so that's how we have it but all the main work
[16:34.000 --> 16:37.000]  of checkpointing a process is done by cryo
[16:37.000 --> 16:40.000]  and some details about cryo so of course the first step
[16:40.000 --> 16:43.000]  is it's checkpointing the container
[16:43.000 --> 16:46.000]  cryo uses ptrace or the secret freezer to stop
[16:46.000 --> 16:49.000]  all processes in the container
[16:49.000 --> 16:52.000]  and then we look at proc pit
[16:52.000 --> 16:55.000]  to collect information about the processes
[16:55.000 --> 16:58.000]  that's also one of the reasons why it's called in user space
[16:58.000 --> 17:02.000]  because we use existing user space interfaces
[17:02.000 --> 17:05.000]  cryo over the years added additional interfaces
[17:05.000 --> 17:08.000]  to the kernel but they've never been checkpoint only
[17:08.000 --> 17:12.000]  they are usually just adding additional information
[17:12.000 --> 17:15.000]  you can get from a running process
[17:15.000 --> 17:18.000]  so once all the information in proc pit has been collected
[17:18.000 --> 17:21.000]  by cryo another part of cryo comes
[17:21.000 --> 17:24.000]  which is called the parasite code
[17:24.000 --> 17:27.000]  the parasite code is injected into the process
[17:27.000 --> 17:30.000]  and it's now running as a daemon in the address space
[17:30.000 --> 17:33.000]  of the process and this way
[17:33.000 --> 17:36.000]  cryo can talk to this parasite code
[17:36.000 --> 17:39.000]  and now get information about the process
[17:39.000 --> 17:42.000]  from inside the address space of the process
[17:42.000 --> 17:45.000]  from memory pages to get to dump them all really fast
[17:45.000 --> 17:48.000]  to this but a lot of steps are done
[17:48.000 --> 17:51.000]  by the parasite code which is injected into the
[17:51.000 --> 17:54.000]  target process we want to checkpoint
[17:54.000 --> 17:57.000]  the parasite code is removed after usage and the process never knows
[17:57.000 --> 18:00.000]  it was under the control from the parasite code
[18:00.000 --> 18:03.000]  I have a diagram which tries to
[18:03.000 --> 18:06.000]  show how this could look like so we have
[18:06.000 --> 18:09.000]  the original process code to be checkpointed
[18:09.000 --> 18:12.000]  we put something out of the code
[18:12.000 --> 18:15.000]  it's not perfectly correct but we put a parasite code
[18:15.000 --> 18:18.000]  in the original process now the parasite code is running
[18:18.000 --> 18:21.000]  doing the things it has to do
[18:21.000 --> 18:24.000]  and then we remove it and the program looks the same as it was before
[18:24.000 --> 18:27.000]  and at this point all checkpointing
[18:27.000 --> 18:30.000]  information has been written to disk
[18:30.000 --> 18:33.000]  and the process is killed or continues running
[18:33.000 --> 18:36.000]  this really depends on what you want to do
[18:36.000 --> 18:39.000]  and no
[18:39.000 --> 18:42.000]  we are not aware of any effect on the process
[18:42.000 --> 18:45.000]  if it continues to run after checkpointing
[18:45.000 --> 18:48.000]  and to migrate the process you have then the last step
[18:48.000 --> 18:51.000]  is restoring and what CRIU does it reads all the checkpoint images
[18:51.000 --> 18:54.000]  then it recreates the process tree of the container
[18:54.000 --> 18:57.000]  by doing a clone, clone 3 for each PID
[18:57.000 --> 19:00.000]  a thread ID and then the process tree is recreated
[19:00.000 --> 19:03.000]  as before and then CRIU kind of morphs
[19:03.000 --> 19:06.000]  all the processes in the process tree
[19:06.000 --> 19:09.000]  to the original process and always a good
[19:09.000 --> 19:12.000]  example is file descriptors
[19:12.000 --> 19:15.000]  this is easy so what CRIU does during checkpointing
[19:15.000 --> 19:18.000]  it looks at all the file descriptors
[19:18.000 --> 19:21.000]  looks the ID and the file name
[19:21.000 --> 19:24.000]  the path and the file pointer where it is
[19:24.000 --> 19:27.000]  and during restore it's just put again
[19:27.000 --> 19:30.000]  so it opens the same file with the same file ID
[19:30.000 --> 19:33.000]  and then it points the file pointer to the same location
[19:33.000 --> 19:36.000]  and then the process can continue to run
[19:36.000 --> 19:39.000]  and then the file is the same as it was before
[19:39.000 --> 19:42.000]  the file descriptor
[19:42.000 --> 19:45.000]  then all the memory pages are loaded back into memory
[19:45.000 --> 19:48.000]  and mapped to the right location
[19:48.000 --> 19:51.000]  we load all the security settings like app-amor as ilinux.com
[19:51.000 --> 19:54.000]  we do this really late in CRIU
[19:54.000 --> 19:57.000]  because some of these things make
[19:57.000 --> 20:00.000]  it very difficult but it's happening late
[20:00.000 --> 20:03.000]  so it's working well
[20:03.000 --> 20:06.000]  and then when everything
[20:06.000 --> 20:09.000]  all the resources are restored, all the memory pages are back
[20:09.000 --> 20:12.000]  then CRIU tells the process to continue to run
[20:12.000 --> 20:15.000]  and then you have a restored process
[20:15.000 --> 20:18.000]  so now to what's next
[20:18.000 --> 20:21.000]  in Kubernetes so we can kind of migrate
[20:21.000 --> 20:24.000]  a container like I have shown
[20:24.000 --> 20:27.000]  and then we are only at the start of the whole thing
[20:27.000 --> 20:30.000]  so the next thing would maybe be
[20:30.000 --> 20:33.000]  kubectl checkpoint so that you don't have to talk directly to the kubelet
[20:33.000 --> 20:36.000]  for kubectl checkpoint
[20:36.000 --> 20:39.000]  one of the things which is currently under discussion
[20:39.000 --> 20:42.000]  is if you do a checkpoint all of a sudden
[20:42.000 --> 20:45.000]  you have all the memory pages on disk with all secrets
[20:45.000 --> 20:48.000]  private keys, random numbers, whatever
[20:48.000 --> 20:51.000]  and so what we do for the current Kubernetes setup is
[20:51.000 --> 20:54.000]  it's only readable by root because if you root
[20:54.000 --> 20:57.000]  then you could easily access the memory of all the processes
[20:57.000 --> 21:00.000]  so if the checkpoint archive
[21:00.000 --> 21:03.000]  is also only readable as root
[21:03.000 --> 21:06.000]  it's the same problem you have
[21:06.000 --> 21:09.000]  the thing is you can take the checkpoint archive
[21:09.000 --> 21:12.000]  move it to another machine and then maybe somebody else can read it
[21:12.000 --> 21:15.000]  so there's still a problem that you can leak information
[21:15.000 --> 21:18.000]  you don't want to leak so the thing
[21:18.000 --> 21:21.000]  about is to maybe encrypt the image
[21:21.000 --> 21:24.000]  we don't know yet if we do it at the OCI
[21:24.000 --> 21:27.000]  image level or at the CRIU level
[21:27.000 --> 21:30.000]  we're talking about but it's not yet clear what we want to do
[21:30.000 --> 21:33.000]  but at some point the goal is definitely to have something
[21:33.000 --> 21:36.000]  like kubectl checkpoint to make it easy
[21:36.000 --> 21:39.000]  then I've shown only how I can
[21:39.000 --> 21:42.000]  checkpoint a container out of a pod and restore it into another
[21:42.000 --> 21:45.000]  pod so the other thing would be to do a complete
[21:45.000 --> 21:48.000]  pod checkpoint restore
[21:48.000 --> 21:51.000]  I've done proof of concepts of this so this is not really a technical
[21:51.000 --> 21:54.000]  challenge but you have to figure out
[21:54.000 --> 21:57.000]  how can the interface in Kubernetes look to implement this
[21:57.000 --> 22:00.000]  then maybe if all of this works maybe you can do a
[22:00.000 --> 22:03.000]  kubectl migrate to just tell Kubernetes
[22:03.000 --> 22:06.000]  please migrate this container to some other node to some other hosts
[22:06.000 --> 22:09.000]  and if this works then maybe we also
[22:09.000 --> 22:12.000]  could have schedule integration that if certain resources
[22:12.000 --> 22:15.000]  are getting low, low priority containers
[22:15.000 --> 22:18.000]  can be moved to another place
[22:18.000 --> 22:21.000]  another thing which we're also discussing concerning this is
[22:21.000 --> 22:24.000]  so I've shown you I've migrated a container with my
[22:24.000 --> 22:27.000]  own private OCI image standard
[22:27.000 --> 22:30.000]  which is the thing which I came up with
[22:30.000 --> 22:33.000]  it's a tar file with some metadata in it
[22:33.000 --> 22:36.000]  but we would like to have it standardized so that
[22:36.000 --> 22:39.000]  other container engines can use
[22:39.000 --> 22:42.000]  that information, the standard and not the thing I came up with
[22:42.000 --> 22:45.000]  which just felt like the right thing to do
[22:45.000 --> 22:48.000]  so this is the place where the standardization discussion is going on
[22:48.000 --> 22:51.000]  it's not going on really fast
[22:51.000 --> 22:54.000]  or anything like this but yeah I guess that's how
[22:54.000 --> 22:57.000]  creating a standard works and with this
[22:57.000 --> 23:00.000]  I'm at the end of my talk, the summary is basically
[23:00.000 --> 23:03.000]  query you can checkpoint and restore containers, it's integrated
[23:03.000 --> 23:06.000]  into different container engines, it's used in production
[23:06.000 --> 23:09.000]  use cases are things like reboot into new kernel
[23:09.000 --> 23:12.000]  without losing container states, start multiple copies
[23:12.000 --> 23:15.000]  quickly, migrate running containers, the new spot instances
[23:15.000 --> 23:18.000]  I've been asked about, this has all been done under the
[23:18.000 --> 23:21.000]  forensic container checkpoint in Kubernetes enhancement proposal
[23:21.000 --> 23:24.000]  and currently we
[23:24.000 --> 23:27.000]  trick Kubernetes to restore a container
[23:27.000 --> 23:30.000]  by using create and start without letting
[23:30.000 --> 23:33.000]  Kubernetes know that it's a checkpoint
[23:33.000 --> 23:36.000]  and with this I'm at the end, thanks for your time
[23:36.000 --> 23:39.000]  and I guess questions
[23:44.000 --> 23:47.000]  we have time for questions
[23:48.000 --> 23:51.000]  I have two questions, the first one is
[23:51.000 --> 23:54.000]  Howard
[23:54.000 --> 23:58.000]  one second please, stay quiet until the talk is over
[23:58.000 --> 24:01.000]  so two questions, how are network connections
[24:01.000 --> 24:04.000]  handled when the containers are stored
[24:04.000 --> 24:07.000]  and the other question is does CreeU support some kind of optimization
[24:07.000 --> 24:10.000]  in incremental checkpoints?
[24:10.000 --> 24:13.000]  so the first question is network connections
[24:13.000 --> 24:16.000]  so CreeU can checkpoint and restore
[24:16.000 --> 24:19.000]  TCP connections established
[24:19.000 --> 24:22.000]  established is an interesting thing, if they're just open
[24:22.000 --> 24:25.000]  and listening it's not really a difficult thing to do
[24:25.000 --> 24:28.000]  but it can restore established TCP connections
[24:28.000 --> 24:31.000]  but I'm not sure it's important
[24:31.000 --> 24:34.000]  in the case of Kubernetes because
[24:34.000 --> 24:37.000]  if you migrate, maybe you migrate
[24:37.000 --> 24:40.000]  to some other cluster or somewhere else
[24:40.000 --> 24:43.000]  maybe the network is set up differently
[24:43.000 --> 24:46.000]  and you can only restore a TCP connection if the
[24:46.000 --> 24:49.000]  both IP addresses of the connection are the same
[24:49.000 --> 24:52.000]  and it makes sense for live migration
[24:52.000 --> 24:55.000]  because at some point the TCP timers will time out anyway
[24:55.000 --> 24:58.000]  but I think maybe it would make sense
[24:58.000 --> 25:01.000]  if you migrate a part and keep the TCP connections
[25:01.000 --> 25:04.000]  between the container and the part alive
[25:04.000 --> 25:07.000]  then it would make sense, it's technically possible
[25:07.000 --> 25:10.000]  I'm not sure how important it is
[25:10.000 --> 25:13.000]  for external connections
[25:13.000 --> 25:16.000]  but for internal connections it makes sense
[25:16.000 --> 25:19.000]  the other question was about optimization
[25:19.000 --> 25:22.000]  so CreeU itself supports pre-copy and post-copy
[25:22.000 --> 25:25.000]  migration techniques just like VMs
[25:25.000 --> 25:28.000]  so you can take a copy of the memory
[25:28.000 --> 25:31.000]  move it to the destination then just do the
[25:31.000 --> 25:34.000]  diff at the end or you can do page faults
[25:34.000 --> 25:37.000]  if on missing pages and missing pages
[25:37.000 --> 25:40.000]  are then collected during the runtime
[25:40.000 --> 25:43.000]  so this is all just like QEMU does
[25:43.000 --> 25:46.000]  all the technology is the same
[25:46.000 --> 25:49.000]  but it's not integrated into Kubernetes at all
[25:49.000 --> 25:52.000]  it's...
[25:52.000 --> 25:55.000]  technically it's possible
[25:55.000 --> 25:58.000]  in Portman we can do this
[25:58.000 --> 26:01.000]  the only thing is you have to decide
[26:01.000 --> 26:04.000]  if this is an incremental checkpoint
[26:04.000 --> 26:07.000]  or not because the checkpoint looks differently
[26:07.000 --> 26:10.000]  so if we know it's an incremental checkpoint
[26:10.000 --> 26:13.000]  only the memory pages are dumped
[26:13.000 --> 26:16.000]  and if it's the final checkpoint
[26:16.000 --> 26:19.000]  we have to dump everything
[26:19.000 --> 26:22.000]  and if it's the first checkpoint you say
[26:22.000 --> 26:25.000]  it's the final checkpoint you cannot do an incremental checkpoint
[26:25.000 --> 26:28.000]  on that one
[26:28.000 --> 26:31.000]  very impressive thing
[26:31.000 --> 26:34.000]  except network what else do you know
[26:34.000 --> 26:37.000]  will not be possible to migrate
[26:37.000 --> 26:40.000]  I'm impressed by this thing
[26:40.000 --> 26:43.000]  except network you mentioned
[26:43.000 --> 26:46.000]  something else that cannot be checkpoints
[26:46.000 --> 26:49.000]  so the main problem is every external hardware
[26:49.000 --> 26:52.000]  like infinite band, GPUs, FPGAs
[26:52.000 --> 26:55.000]  because there's state in the hardware
[26:55.000 --> 26:58.000]  and we cannot get it out
[26:58.000 --> 27:01.000]  two years ago AMD actually provided a plugin
[27:01.000 --> 27:04.000]  for KreeU to get the state out of their
[27:04.000 --> 27:07.000]  GPGPUs so KreeU
[27:07.000 --> 27:10.000]  should be able to checkpoint
[27:10.000 --> 27:13.000]  and store processes using AMD GPUs
[27:13.000 --> 27:16.000]  I never use it myself
[27:16.000 --> 27:19.000]  I don't have one but they implemented it
[27:19.000 --> 27:22.000]  so I assume it's working
[27:22.000 --> 27:25.000]  and so everything external hardware
[27:25.000 --> 27:28.000]  where you don't have the state in the kernel
[27:28.000 --> 27:31.000]  that's the main limitation
[27:31.000 --> 27:34.000]  Hi, thank you for this
[27:34.000 --> 27:37.000]  you said there's parasite code, does that mean it changes the container hash
[27:37.000 --> 27:40.000]  so how do you propose to secure them again
[27:40.000 --> 27:43.000]  and make sure that's your parasite code
[27:43.000 --> 27:46.000]  and it's somebody else's
[27:46.000 --> 27:49.000]  I didn't get it 100%
[27:49.000 --> 27:52.000]  something about container hashes and making sure it's
[27:52.000 --> 27:55.000]  I think the worry is that if you inject parasite code
[27:55.000 --> 27:58.000]  that the container hash has changed somehow
[27:58.000 --> 28:01.000]  it doesn't
[28:01.000 --> 28:04.000]  it doesn't
[28:04.000 --> 28:07.000]  it doesn't change the container hash
[28:07.000 --> 28:10.000]  the parasite code is removed afterwards so it's
[28:10.000 --> 28:13.000]  okay, thank you
[28:16.000 --> 28:19.000]  thank you, excellent talk
[28:19.000 --> 28:22.000]  how big are the images the size of the process memory used
[28:22.000 --> 28:26.000]  or the total process allocated to the system
[28:26.000 --> 28:29.000]  I don't hear anything in the front
[28:29.000 --> 28:34.000]  how big are the images that you restore
[28:34.000 --> 28:37.000]  exactly, so the size of the checkpoint
[28:37.000 --> 28:40.000]  is basically the size of all memory pages
[28:40.000 --> 28:43.000]  which we dump, all the additional information
[28:43.000 --> 28:46.000]  which crew is dumping is compared to it is really small
[28:46.000 --> 28:49.000]  and then it depends if you do something like
[28:49.000 --> 28:52.000]  an importment or docker, if you do a diff
[28:52.000 --> 28:55.000]  you usually see which files changed in the container to your base image
[28:55.000 --> 28:58.000]  and this comes on top of it, all files which change
[28:58.000 --> 29:02.000]  we include all the files completely into the checkpoint
[29:02.000 --> 29:05.000]  whether we don't include this
[29:05.000 --> 29:08.000]  while I'm bringing the mic over there
[29:08.000 --> 29:11.000]  has anyone changed, has anything changed in terms of how complex
[29:11.000 --> 29:14.000]  process trees you can restore because we're thinking about
[29:14.000 --> 29:18.000]  we discussed using it for system deservices for example
[29:18.000 --> 29:21.000]  for you
[29:21.000 --> 29:24.000]  one of your limitations that you usually had is as soon as you run something
[29:24.000 --> 29:27.000]  fairly complex inside of the container and you try to
[29:27.000 --> 29:30.000]  check point restore it with crew it would just fail
[29:30.000 --> 29:33.000]  because it would use kernel features that it wouldn't support
[29:33.000 --> 29:36.000]  so the biggest problem we're currently seeing
[29:36.000 --> 29:39.000]  is containers using systemd, because systemd is very advanced
[29:39.000 --> 29:42.000]  it uses things nobody else uses
[29:42.000 --> 29:45.000]  so this is the point where crew might fail
[29:45.000 --> 29:48.000]  because it seems like at least from previous point
[29:48.000 --> 29:51.000]  or from what I've seen nobody uses as many
[29:51.000 --> 29:54.000]  new kernel features as systemd does
[29:54.000 --> 29:57.000]  so it sometimes fails
[29:57.000 --> 30:00.000]  the systemd is running there
[30:00.000 --> 30:03.000]  but usually I don't see often people in the OCI container
[30:03.000 --> 30:06.000]  world using systemd
[30:06.000 --> 30:08.000]  I guess it would be a good idea to have a real
[30:08.000 --> 30:11.000]  init system even in your container but it's not something people do
[30:11.000 --> 30:14.000]  so it's not something we get
[30:14.000 --> 30:17.000]  complaints at all about
[30:17.000 --> 30:21.000]  I also thought this talk was very interesting
[30:21.000 --> 30:24.000]  so I saw that you had these
[30:24.000 --> 30:27.000]  talked about having these
[30:27.000 --> 30:30.000]  kubectl migrate and kubectl
[30:30.000 --> 30:33.000]  checkpoints
[30:33.000 --> 30:36.000]  because I'm thinking that mostly what you want to migrate
[30:36.000 --> 30:39.000]  might be like a stateful application
[30:39.000 --> 30:42.000]  for example like a stateful, what is called a stateful something
[30:42.000 --> 30:45.000]  so I was thinking maybe you could have
[30:45.000 --> 30:48.000]  something in the stateful
[30:48.000 --> 30:51.000]  stateful deploy whatever it's called instead of
[30:51.000 --> 30:54.000]  say you want to drain a node
[30:54.000 --> 30:57.000]  actually one of the first implementations I did
[30:57.000 --> 31:00.000]  I was using drain
[31:00.000 --> 31:03.000]  I added an option to kubectl drain which is for checkpoints
[31:03.000 --> 31:06.000]  so all containers were check pointed during drain
[31:06.000 --> 31:09.000]  and then they were restored during
[31:09.000 --> 31:12.000]  boot up
[31:12.000 --> 31:15.000]  okay
[31:15.000 --> 31:18.000]  sorry for being the buskill but we're out of time
[31:18.000 --> 31:21.000]  thank you for their talk that was really interesting
[31:21.000 --> 31:46.000]  and thank you everyone for attending and being so quiet during your question