Good morning everyone. Thank you for joining this talk. My name is Niels de Vos. I am
presenting instead of Ria. Ria was supposed to travel here but got cancelled. So she gave
me her slides. I am one of the organizers of the deaf room together with Jan instead of having an
empty shot. I am presenting her talk. Hopefully I know enough about the topic to present it
well enough. If you have questions don't hesitate to interrupt. I don't mind answering questions
immediately. So Ria and I are actually colleagues. We both work on mostly CFCSI
but anything that touches it as well. So that includes RUG. It includes other deaf components
where we need it. And we make sure that Kubernetes or OpenShift works well with
deaf. It is performant enough and in this case also your data integrity is guaranteed as best as we can.
Today we focus on CFFS. CFFS is a scale out storage system based on top of CEPF.
And I hope most of you are familiar with CEPF. It is quite a nice network file system.
We use it for lots of different deployments and our customers use it for whatever applications
they have. For example the Kubernetes image registry if you want to put it locally on your
Kubernetes cluster is a very common use case for CFFS but it is not limited to
container images. You can use basically any workload on top of it.
Software engineers we always face a lot of challenges. For challenges we hopefully find
nice approaches to solve these challenges. And one of these challenges is that in a Kubernetes
cluster a node can go down. Either a virtual machine where you host your Kubernetes worker node
or a Bimetal machine that runs all the Kubernetes services. It can go down. That just happens.
That just affects. You can't prevent it. What you can do is you can accommodate your environment or
your infrastructure as best as possible to work around these kind of hiccups.
When a node goes down it isn't guaranteed that your whole node goes down. There are
like partial failures. One of the easiest failures to imagine is if you have a worker
node that runs Qplit as part of Kubernetes the Qplit daemon is responsible for mounting
volumes, mounting file systems, starting containers and whatnot. If this Qplit daemon has a bug of
some kind and it fails to respond to any requests from your Qplit management plane.
This node is perceived down in the sense of Kubernetes infrastructure. However it doesn't
mean that the node actually is down. It's only the Qplit service that is down. If this is the case
then all your containers might still be running. If you have like a database running in your container
this database might be happily running, reading and writing and accepting connections and whatnot.
It might still function. It might write your data to a file system,
CFFS in this case. But if Kubernetes thinks the node is down, Kubernetes will schedule
this workload, your database on a different worker node. When it does so with CFFS it is
possible to mount your file system on multiple nodes at the same time. It's a network file system
so it's prepared to do that. For your database that runs on a second node this is very nice because
Kubernetes thinks the service is up and running, everything is fine. But now you certainly have
two databases running on the same file system with the same directories, with the same files that
it uses. One on the old node that is perceived as down or broken because Qplit isn't running there
and the other one where Qplit is running happily your container has been started new
and it's running there as well. Now this can cause a lot of issues if your application in case of
a database you can almost assume that the database uses a file looking correctly and really doesn't
want to corrupt your data if another database instance is running on the same file system.
In this case you are relatively safe however the database starting on the second node that just
got started that will not get your file system logs. So the database will not start to write
but Kubernetes might announce that this database is the master database and actually should be able
to write but it doesn't because the old one is still running. So in that case you have like a
partial outage that's the good case. The bad case is that your application doesn't use logs
and the bad case is that the old instance is still writing data. The second newly started
instance is writing data as well and they start to overwrite each other's data and that would be
horrible because the application doesn't immediately notice and your data is now destroyed. It's
corrupted because it's not in sync anymore. You don't know the state of your data and this is a
really bad situation. This is not an inherent problem to Cephaphase. It's just network file
systems in general and applications that don't use logs that can corrupt data in this way.
It's not even restricted to file systems. If you use block storage like an Iskazi
disk attached to multiple servers at the same time and you run a virtual machine on top of it,
you actually have the same kind of issues.
So that's the challenge. The resolution or one of the resolutions that we have with
Cephaphase as a network file system, we can fence off broken nodes, network fencing.
On storage layers, you have storage fencing. So if you have your SAN, you can actually
fence servers on the HPA level and your storage fabric. With network file systems,
you can do network fencing. You can see it as setting up a firewall between the broken server
and the actual Ceph cluster. For Kubernetes, we wrote a network fence.
It's part of an operator. Network fencing is a CRD. So it's an object, a Kubernetes object,
that describes what you want to fence. The IP addresses, in this case, the
ciders, so the network that you want to fence. You can have a list of networks in case you have
multiple nodes or multiple networks on the node and whatnot. These can be fenced by creating this
custom resource. Once you create the custom resource, our operator will
see that it exists. It will see the states that you requested and will start do the fencing.
That's the eventual thing that we want to do. This is one of the CRs, just an example.
CSI Adons is a project that we use for enhancing the CSI specification. CSI is the container
storage interface specification. That's the specification that is used by storage drivers
and let Kubernetes mount a particular volume or create new volumes and so on.
With this storage interface, it's rather limited, unfortunately, and it's not trivial to extend it.
We have an Adon project, basically, that adds more storage operations that are not necessarily
basic storage operations, but more advanced or more management operations, like fencing
nodes from the network. What you do is you create the network fence CR. It needs to have a name.
That's just the standard thing. You specify which driver you want to fence or which driver is used
for the volumes. In this case, it's CephFS. We use RookCef in many deployments because it's just
extremely easy to deploy your Ceph cluster with Rook. You can list which IP addresses you want.
This is just a single IP address and this is the whole network that you want to fence. Possibly,
you can even use it with multi-cluster environments. If you have a single Ceph cluster
and you have one data center and another data center and you want to fence off one whole data
center, you can also use it this way. You could fence your whole Kubernetes cluster in data center
one and then make sure that data center number two is the only one accessing your storage.
Yes, so that's it. You pass it to CephFS because if you want to talk to Ceph, you need your credentials.
That's it, mostly.
Now, the previous one is the manual example. It works nicely, but you still have to do it as an
administrator. As an administrator, you need to know that a pod or that a system started to fail.
Then you have to figure out all these kinds of details, which is rather annoying.
It's already broken. Sorry?
It's already broken before you can actually fence it.
If you can just reschedule it in a couple of seconds.
Yes, exactly. The remark is that it's already broken if you have to fence it.
Yes, so often, you might have a little bit of time, depending on your application and so on,
but yes, you are in a stressful situation because something broke in your environment and you don't
want to hastily put some yaml together and make sure that you have the right IP addresses.
If you do wrong IP addresses, you might break your cluster even more.
This is something you can do manually, but it's tricky. Definitely, if you're in a stressful
situation with an outage or partial outage, then you really don't want to bother with this.
So the enhancement for the network fencing that we have is we include support in RUG.
RUG already supports a form of network fencing for CepRBD, which uses the same kind of yaml.
For CepFS, it's new and it requires a little bit of a twist, a little bit different
commands, for example, with RBD. RBD is the image storage, so the block device is similar to Iscasi.
And RBD only mainly talks to the OSDs. There's no metadata server, so the OSDs are
what you want to communicate with, so you have only a single service type
that you need to book access to. With CepFS, you have also the MDS metadata server,
and this is the server that keeps, for example, file system logs. So in the first example,
where you have two databases running and the database is smart enough to use a file system
log, even if you would network fence the broken node, the broken node had the file system log.
The second node doesn't immediately get the file system log if the first node
completely gets down. There will be a timeout and it might take, depending on the kind of issue
that you had, it might take a while. So for CepFS, we actually evict CepFS clients, which is a little
bit different than block listing a client from the OSDs. That's for the background information,
but for RUK, it is relatively simple to detect if a node is out. RUK doesn't have its own logic
to verify if a node is working correctly. There are dedicated projects for that.
I think a project like MediCADES is specialized in detecting the health of a node, and if something
goes wrong, MediCADES can set a taint on a node. You can also set as an administrator a taint manually,
and in this case, you put the out of service taint, and when RUK notices that a node is tainted
with out of service, it will start the network fencing for the drivers of the volumes that
are used on that particular node. So if your node is not using a CepFS volume, it's not going to
network fence this particular node, because that node might still be used, or ideally, that will have
the second database instance that should work correctly. This is then what you would see. So
if you edit the taint, or if you manually created this YAML, this is what is shown. If you check
the state of the network fence objects, and this is the host name, and whatnot, this is the IP
resisted defense, in this case, it's just a single one, and it was fenced 20 seconds ago. In this case,
and blocklisted from the OSDs, and the client was evicted from the MDSs. So another client can
resume operation with the MDS, it can obtain file logs, and it can talk to the OSDs as usual.
If you recover this node, you can edit the network fence CR and mark it as unfenced.
Yes, sure. The network fence is always a blacklist. The question is if the network fence is always a
blacklist, and that is right. So usually everything is allowed, so the whole cluster is basically
allowed to communicate with the CEP cluster, the whole Kubernetes cluster in this case.
And we call it blocklist now, but yes, it's a blocklist. You can have an allow list, but that's
less practical. If you want to secure an environment, then that's probably something you want to do,
but for network fencing, we only do the blocklisting. Yes.
Basically two operations at the same time is evicting the client and also doing the network fence.
So yes, so the question is if this happens with CEPFS, if network fencing is done,
there are two operations and that's true. So there are two operations. The first operation is to evict
the client. If you evict on the CEP level, it actually already does the blocklisting,
but we only evict the client if the client is actually connected. If you want to network fence
a client that does not have a CEPFS fast and connected at that time, we still blocklisted
additionally just to make sure that nothing happens in the future. Yes, yes. So we try to be
as safe as possible and make sure that it's consistent. What the output here is, if it's
fenced, then it really should not be not fenced. So even if you did not have CEPFS mounted,
it will still fence your node or your whole subnet or whatnot.
Yes, oh yes. So the fencing is done on the CEPF cluster and that means that any client
on the particular node won't be able to access the cluster anymore. So that's the whole goal.
It doesn't matter what kind of client it is. It can be a kernel mount file system. It can be
the fuse mount or it can be LibcFFS, which is also used in fuse. So none of these clients
will be able to use the file system anymore and that's just the security measure to prevent any
data inconsistencies. This is in fact what we do. So we evict the client. We figure out the client
on the worker node or from the worker node that is connected has a particular client ID.
We figure out this client ID for this particular IP address. It might be more than one
and the network fencing will evict all of these clients
and then eventually it will also block list the clients and you can check with this
OSD block list command if everything was fenced correctly or not.
And yes, so this is the summary how RUG protects you or can protect you more from data inconsistencies
if you have multiple or potentially multiple workloads running on the same CEPFS file system.
It's a very important feature not only for CEPFS but also for other storage environments
because silently corrupting your data is not what you really want. This can further be automated
with additional operators like medecates and that do the health checking of nodes and if they
figure out that the node is not working they can taint the node. RUG then figures out the taint
and RUG then starts the network fence CR. The network fence CR is then executed by the CSI add-ons
operator talking to the CEPF cluster and the CEPF cluster will not be able
to communicate with the worker out anymore. There's a question.
So it sounds like it has to be the case that CEPF is in the final position to control whether the
next pod is going to be able to use the same CEPFS file system otherwise there's no way to guarantee
that the proper ordering occurs. Is that correct? So the question is that is CEPF is the
final source or the decision maker in fencing? It is but you also want to fence,
you ideally want to fence on different levels but in order to have your data consistent not
necessarily only your workloads but if you want to have your data consistent then that needs to be
done on the storage level as low as possible. So CEPF is the one that decides who is allowed to
use your data and if there is a broken worker node a broken worker node should not be allowed to use
it the best way to prevent the broken worker node from using or modifying the data is by blocking
it on the storage level on CEPF. Right but my question was neither should be the next pod
was able to schedule because even step one that you missed it could cause a new pod to come.
So that would be allowed to happen because what? So the first step is that
I'll rephrase it. The first step is that step zero is the broken out breaks. Step one is
Kubernetes detects some breakage and starts the pod on another node. Right and that's where you are
going to. Right in that case you're already late because the next pod might start to disrupt your
data and write in the same things and that is correct. Yes, yes so there's still in this case
there's still a window where things might not go right and that is in this case a Kubernetes
issue and hopefully by automating more things like enabling like medicaid to detect a broken
pod or broken worker node immediately tainted and then make sure that it fails over. Kubernetes also
has a timeout in detecting if for example, Kuplet doesn't respond or so it doesn't immediately
reschedule so you still have time to actually do some operations but you should not reschedule. Yes,
it depends on the failure behavior that you see but you still need to be very fast on
automating it so if you have automation around it then that's the best but you
should try to have a large enough timeout for the rescheduling after a failure.
Exactly so the comment is that you should check your timeouts and the ordering should be sufficient
so if you have long enough timeouts then you can be very safe and if you have very short timeouts
then the challenge on problems is just bigger. It's not always good to have your application up
like five nines and sometimes a little bit lower but more safety is encouraged. This was it for my
talk. I'm not sure if we still have we have two minutes for questions if there's anything
and otherwise I thank you all for listening.