Okay.
All right.
All right.
Okay.
We are ready to go.
All right.
Thanks, everybody.
Thanks for sticking around till the end today.
And a special shout out to those of you on the live stream as well.
My name is Adam Litke and this is Near Software.
And today we're going to get our money's worth out of this laptop.
Something is not right here.
I keep flipping on and off.
Let's see.
I'll do my best here.
So we've come a long way since Linus introduced Linux to the world back in 1991.
What started off on his personal computer is deployed pretty much everywhere these days
in increasingly complex scenarios.
Take Kubernetes, for example.
Everyone's favorite clustered container orchestrator,
which runs open source up and down the entire stack from the cluster nodes
to the Kubelet and to the container runtime itself.
And developers haven't stopped building or debugging screens.
Kubevert is a Kubernetes add-on that allows you to seamlessly run your virtual machines in Kubernetes.
And since the VMs are running in pods, like any other workload on Kubernetes,
they integrate really well with whatever else is deployed there,
be it your applications, storage, networking, monitoring, et cetera.
And as people continue to deploy Kubernetes and Kubevert to host their mission-critical workloads,
naturally they wonder what will happen when disaster strikes.
Disaster recovery software exists to protect your data
and return to normal operations as quickly as possible.
And this is typically achieved using redundancy.
So data can be replicated from primary environments to secondary environments,
and applications, including virtual machines,
are able to be started on the secondary environment at a moment's notice,
should that be required.
In this particular scenario, we have a primary data center DR1 in the west,
a secondary data center DR2 in the east,
and a hub cluster located somewhere in between.
Now we prefer to run our applications on the primary environment
because it's closer to where our customers are.
But thanks to continuous data replication, we can rest easy knowing.
We can start the application up on DR2 when required.
So ramen DR is software that enables disaster recovery
for multi-cluster Kubernetes environments.
It does this by working with the storage to enable data replication
according to a DR policy set by the administrator.
And it talks with open cluster management to manage application placement,
failover, and relocation flows.
Today we're going to simulate this disaster for you.
We're going to start by disabling the primary environment.
We can then failover our virtual machine to the secondary environment.
And I just want to note here that failover is different from live migration
because live migration would require both environments to be up.
In this specific scenario, obviously, we don't have access to DR1.
So failover is going to take a couple of minutes,
but we can be confident that the app can start back up on the secondary
and environment with minimal data loss.
So I've been kind of introducing a bunch of different components here
that's quite the menu of open source ingredients.
KubeVert is a operator managed Kubernetes deployment,
which packages libvert and QMU into a container image,
allowing you to run your virtual machines inside of a pod.
It also comes with other utilities to help you manage your virtual machine
storage and networking requirements.
Rook is software that manages and integrates
self-storage into the Kubernetes platform.
Open cluster management stitches together multiple Kubernetes clusters
and provides for application management, placement, scheduling.
And then RoninDR is adding on those DR flows to open cluster management.
So when we're considering a realistic multi-cluster DR environment,
it's a beautiful thing, kind of like this bowl of ramen here to tempt you
at dinner time.
However, it's also complicated and expensive to operate,
especially when we consider like the single developer use case.
So the question we're trying to answer here is how can we enable development
in this open source software stack without huge cloud budgets?
And our answer is to scale down that environment so that it can run inside
the kind of laptop that most of us are carrying around with us each day.
And NIR has prepared a live demo right on this laptop that you're looking at
that's going to show all this stuff working together,
and we're going to simulate that disaster for you.
So take it away.
Yep.
And I'm going to mute it so we don't annoy the live stream people.
Okay.
Put that in your pocket.
Yep.
Okay.
So this is our stack, right?
Three clusters.
We have two identical clusters.
Everything is ramen.
And we are going to put it inside this laptop to see that we can do it
because they are small and cheap.
So what we want to do today is to stuff three data centers with ramen and
kovir and stuff and a lot of other components and large part of Europe
and stuff everything inside this laptop.
Now note about this environment.
The clusters are all in the same laptop,
but they are remote clusters on different regions.
And the cluster is standalone with its own storage.
So how can we prepare this laptop for the demo?
So I have a pack of instant ramen DR, which is very easy to use.
You want one command.
DRM start with the environment of your flavor.
This in case is a kovir environment.
And then you let it cook for 10 minutes until everything is ready.
Sorry.
So we are not going to wait 10 minutes now because it's a little thing.
I prepared the environment before the talk and we'll just use it.
So whatever we need, we need a Git repo because we're going to use GitOps.
We will give Ocm some Git repo to pull the VM resources and deploy the application.
So we use Adam's repo, Ocm kovir samples.
And I talked it to customize the VM with SSH public key.
So let's jump into the demo and increase the font size a bit.
So I'm using a little tool to save my typing error for you and make everything more colorful.
So first look what we have in this laptop.
We have three clusters.
DR1 is the primary cluster where we run the VM.
DR2 is the secondary cluster.
Something bad happens to DR1 and something bad will happen.
Don't tell anyone.
And Hub is orchestrating everything and controlling the other clusters.
Now each of these are libvirt VMs inside the laptop.
So let's look inside one of the clusters.
We can use kubectl, just a normal kubectl clusters.
And we see that we have a lot of stuff that DR1 installed for us.
The most important parts for the demo are the kovir parts that you will run the VM,
the CDI that will provision the VM disk from container image.
Of course, it will be stored.
So we have a complete RookSafe system inside that using the local disk of the cluster.
And this will provide storage for the VM disk and volume replication between the clusters.
And to protect the VM, we have the Raman DR cluster operator,
which orchestrates the DR flows.
And finally, we need the open cluster management components that lets Raman control the clusters.
Because Raman extends open cluster management and depend on it.
So let's look inside the Git repo. I'm running this inside the clone of the Git repo from other.
The important parts in this repo for the demo are this VM, VM standalone pbc.js.
This is VM optimized for the environment.
The subscription, which is OCM resources for the VM.
And DR are the Raman DR resources.
So let's look first at the VM.
We have a quick look to see what we have there.
We will not look inside the YAMLs. You can check the Git repo later.
We have a VM configuration. This VM is using a pbc because we are using a pbc-based VM.
So we have this pbc here. And we need to provision a pbc somehow.
So we have the source YAML, which tells CDI how to provision the pbc disk.
So we can apply this customization to cluster DR1.
And this will start the VM on cluster DR1, but we are not going to do it.
Because nobody will protect this VM.
It's just like a port that you start and it goes down and nobody protects it.
So what we want to do is create OCM application.
OCM application. We will use subscription-based application.
These resources tells OCM how to protect the application, how to create it,
like which cluster set to use, where Git repo is, the namespace that the VM is running,
where to place the VM, and subscription ties everything together.
So to start the VM, we apply this customization to the hub.
Everything is done on the hub. And then OCM and Raman later will do the right thing.
So at this point, OCM is starting the VM on cluster DR1 and we can watch it.
Using kubctl to get the VM, VMI port and pbc.
And we can see that the pbc is already bound.
And Virt launcher is running and we have an IP address, so ROVM is running.
But let's inspect the VM a little bit more to see where is our disk.
So we can use ROVM Cess kubctl plugin to look at Cess layer.
And we can run the RBDU command, in this case on cluster DR1.
And we see that we have RBD image created for our VM.
If we look inside the pbc, we will find this volume there.
So if something bad happened to cluster DR1, we lose the running VM and the disk.
And this image will be gone and we lost all the data.
So how can we prevent it?
We want to protect this VM is Raman.
So to do this, we must tell OCM first that Raman is going to take over this VM and ACM should not change it.
We do this by notating the placement with a special annotation and at this point Raman can take over.
So how do we protect this with Raman?
We need to apply the Raman resources.
Basically it's one resource, a DRPC.
The DRPC tells Raman how to find the application, how to protect it, which pbc should be protected and what is the policy.
We are not going to look inside now, we can check the gtrepolator.
So to protect the VM, we apply disk customization.
Again on the hub, then Raman will do the right thing on the right cluster.
So once we did it, our VM is protected Raman and you can watch it again.
This time I'm watching also VRG and VR resources.
VRG is the volume replication group.
We have one such resource per each protected application and volume replication is the resource that entails the volume replication for each pbc.
So we have one of them replication for every pbc.
Now both of them are ready and primary, primary windows, this is the primary cluster, replicating data to the secondary cluster.
So what does it mean that we replicate data to the other cluster?
If you look again on the RBD images, if you remember we have seen that we have an RBD image on the primary cluster.
If you run the same command on the secondary cluster, and this time I'm running the same command on context DR2.
And we will find that we have an image on the primary cluster and we have the same image on the secondary cluster.
So what's going on under the cover is that when Raman enables volume replication,
a secondary replica of the image is created on the secondary cluster,
and the RBD mirror demon is starting to replicate rights from this image on the primary cluster to this image on the secondary cluster.
So if something bad happens to cluster DR1, we can use the secondary image to start the VM at the time of the last replication.
So the last thing to show about the VM is that we have a logger inside updating the log file every 10 seconds.
We can access the log file using the VIRT CTL SSH.
We just run this command to see the start of the log and we see the line where the service was started,
and then we see one line every 10 seconds.
This will help us verify later when we recover from a disaster that we got the right data from the disk.
So now we are ready for everything.
Let's try to create a disaster.
So one thing that is easy to do on the laptop is to suspend the cluster DR1.
If you remember this is a Libre VM, so we can just suspend it.
Now everything running there stopped.
So let's try to access the VM again with VIRT CTL SSH.
Let's try to tail the log and let's see if it works.
Well, it does not seem to work because of course we suspended the VM so nothing there is accessible.
If we had an important service on this VM, our users would be not happy now.
So how can we fix this?
Adam, do we have any idea?
I was hoping you would tell us.
Yes, so because our VM is replicated, we can just failover to the other cluster quickly.
How would we failover?
If you remember that we installed the DRPC, we can patch the DRPC.
We set the action to failover and we set the failover cluster.
And once we did it, Raman starts the failover.
And you can start watching the failover on the other cluster.
I'm running this again on the DR2 cluster because DR1 is not accessible.
And we see that we have a PPC.
It's impending.
We have a volume replication group.
We have a volume replication, but the volume replication book is not primary yet.
It will take a while until the VM is stopped.
So while you wait for it, let's understand what's going on under the cover.
So the RBD image on the secondary cluster was replica, pulling data from the master for the main cluster.
Raman has to stop this replication and promote it to a primary image that will replicate to another cluster.
Once this is done, the VRG will be marked as primary and it should happen any second.
And at this point Raman will change the application placement.
It just became primary.
So now Raman changed the placement of the application.
And Ocm will see the change and will redeploy the VM on the second cluster using the subscription.
And this should happen any second now.
When Ocm deploys the application, it will reuse the PPC that Raman has restored and connected to the right RBD image.
And it just happened.
We see that the VRT Launcher is running.
The VM is up.
We have an IP address.
So if we add this important service of the VM, this service should be absent.
And it will be used as a VRT.
But how do we know that we got the right data?
Maybe we got just a new application with empty disk.
Let's check the disk.
Again, we can use the logger.
We just dumped the entire log using SSH.
This time I'm connecting to the cluster, the R2.
And we see all the logs from the VM that run on cluster DR1 until we created the disaster.
And we see the new logs when the VM started to run on cluster DR2.
Note that we have a gap here between the last line logged when the VM was running on DR1 and the first log,
which is about three minutes in this case.
This gap depends on how fast we started the failover and the tech that there was an issue with the cluster.
So we had a short downtime.
The VM is running.
We got all the data.
Looks like a successful failover.
So what's next?
In a real cluster, you would try to fix the cluster, recover it.
Maybe you need to reinstall it.
At this point, you can relocate the VM or maybe later during some maintenance middle,
you will relocate the VM back to the primary cluster.
In this demo, we are done.
And it's time for questions.
The first three questions we'll get in Sotramen.
Go ahead.
The question is what about the IP address of the virtual machine?
We're paying attention and noted that change.
So what would you suggest?
Sotramen does not do anything about the IP address.
I think in a real application, you will have some load balancer to make sure that you can switch from one cluster to the other cluster.
Probably using the DNS system because you have a nice name for the DNS.
But basically, you will do what VIRT CTL is doing when you connect to the VM.
You use the VM name, the name space, and you have to find the actual address.
Yes.
Very nice demo.
How much you run at home?
I don't have any cloud copies.
I can re-image it to 4G.
Is 60G enough?
16 will be too low.
Yes.
So the question is what do we need to run it at home?
So first, you can try it at home, but you need big enough laptop.
I think for COVID, you need like 32G with Sotramen.
Yes.
Because we have two clusters with 8GB, maybe you can trim them down a little bit, but 16 will be too low.
And...
Maybe you need a laptop.
Yes.
Maybe you need a laptop.
If a door is wide, it's not so.
If a door would be easier because this is what we use.
But it should work on anything.
We continue the question.
Can I use two laptops, one for disaster recovery and one for the one that you made and the other for disaster recovery?
Let's say old laptop.
Repeat the question.
I didn't answer the question exactly.
Can you repeat it?
Your presentation is from the same laptop.
Can I use the solution for two laptops?
So the question is, can we use different machines for this?
You can, but it will be harder because you need to set up the network between them.
In the same laptop, it's much easier because MiniCube handed most of the stuff for you.
If you use different machines, you need to make sure that the clusters are accessible to...
So it will be harder.
I've got one over here.
Yes.
Is it required to use SIF or you can use an OSR by system?
Repeat the question.
Do we have to use SIF?
Currently, we work with SIF, so it's optimized for SIF and it works.
And we have a complete tool that you set and configure it.
If you want to use something else, we have support for any storage, basically,
but it doesn't work on Kubernetes yet.
It's very commonly on the shift.
It needs more work.
Yes.
The primary site is down.
Is there any mechanism for the extra machine for starting by mistake, by itself?
The question was, once the cluster is down, do we have any protection that the virtual machine will not start again on the same cluster?
So we don't have any protection at the ramen level because SIF is protecting us.
If the same VM starts again, let's say we resume the cluster,
the application is still running then it will continue to run and try to write to the disk,
which is fine because the disk is not replicated at this point.
Because the application is done on the destination cluster.
It's pulling data from the primary.
Usually it will just fail or go into some error state and in real application,
when ramen detects that it should not run, it will shut it down.
So it's safe.
There is one more question.
Yes.
Just because it's the end of the day.
Just one more question.
What happens when the hub that was controlling the two data centers goes down?
The question was what happens when the hub goes down?
Very good question.
In a real setup, in OpenShift you have a hub recovery setup, so actually you need two hubs.
One passive hub and one active hub.
And there is a lot of setup that backup the hub and restore it.
But for testing it doesn't matter.
And also hopefully you're not running customer visible or end user visible workloads on the hub.
So if it goes down you can repair it and people won't be quite as urgent of a disaster.
So hopefully the other sites don't fail at the same time.
Alright, thanks everybody for coming.
What a good question.