Okay, so I am just continuing on the second talk from CERN.
So for people who are already in the room, some of these things are already familiar.
So I am here to talk about CFFS at CERN, primarily in context of this asset recovery requirements
we have at the organization.
So we already introduced CERN in the previous presentation, so I am not going to go much
into it but this is just there for the dramatic effect because this looks nice.
This is the accelerator we have.
We have various detector points, so this is one of the experiment sites, Alice which
does lead-let collisions and this was one of the collisions that happened last year.
So moving on, so this is a broader perspective of the whole ring how it looks like and what
was not previously mentioned in the talk is like we of course had an existing data center
that has been serving us since the 70s but there is a new data center that is under construction
and this is primarily for the backup and disaster recovery needs of the organization and this
talk is mainly focused on the BCDR aspect of the talk.
So if you look at the existing data center, that is how it looked like in the 70s.
It no longer looks like this but it has been around since mid-70s.
This is the new data center that is coming up since it is all built new and stuff like
this of course it is more energy efficient and it is expected to go into operations in
a couple of months hopefully.
And my main purpose of this talk is to talk about CERN and CERN and how we primarily look
into SAP shots and what we found while doing some experiments on them and mainly this is
asking for advice from the community on how, whether you have run CFFS with snapshots and
whether you are facing some of the problems that we are facing.
So largely CERN we do not have a single cluster that is serving all the needs.
We have multiple smaller clusters that are dedicated to a particular purpose and this
kind of helps us in cases of when you have use cases that do not always align with each
other and you do not actually take down CFFS clusters because somebody else is doing something
that is not normally a workload for your cluster.
So when it comes to RBDs we have hard drive clusters that go around 25 petabytes and we
also have a full flash Erasure Coder cluster that is like half a petabyte and this is mainly
for the HPC use cases.
For CFFS it is also a jungle of different types of CFFS clusters we have.
We have a full production hard drive cluster again, a full flash cluster which is also
used for analysis workloads.
We have also hyperconverged clusters that we co-locate compute with.
What Richard talked about in the previous talk about CERN Tape Archive, we have a small
safe cluster that actually exclusively serves needs for some of the scheduler components
of the tape archiving and other than this we also have largely RGW cluster that is serving
the S3 components at CERN and we are newly building multi-site clusters that would also
be a second backup region for the RGW.
So our journey for CFFS it began in 2013 and it was like one physical cluster back then.
So the primary need was of course a shared file system.
We use OpenStack heavily for compute and it was mainly serving OpenStack Manila and some
HPC use cases.
We have 8 MDS servers for active and for standby.
We do not do active standby and no standby replay.
Metadata pools are still on SSDs.
We have no snapshots and it is a single file system.
Since this predated many of the more exotic pinning options you have with CFFS we have
a script that actually tries to pin sub-directries to a random metadata server.
Now that we have a femoral pinning we should try re-evaluating that instead.
And we have multiple safe clusters, so two general purpose safe clusters that serve general
production workloads.
We have a few safe clusters that only serve specialized use cases.
So we have one for monitoring, one for pure HPC workloads.
And last year we also moved like one of the safe clusters that was actually on full flash
from regular normal powered to diesel powered battery generators and this was done with
a zero downtime using virtual CFFS and virtual racks in crush.
Yeah, there are some details in additional slides in case if we have extra time I will
go for it.
So when you talk about business continuity and disaster recovery it is largely your
requirement to keep your data safe during faults and human errors or ransomware or any
of all of these things that are important nowadays.
And you have various strategies to achieve these things.
So you can go for an active sort of thing.
You can go for a warm standby and you always have backups and coldestows.
We are not that focused on the active part at least in the context of file systems in
here because you would more likely benefit from snapshots and backup here.
So the talk is mostly focused on the warm standby and coldestows use cases.
And I guess suppose everybody knows what snapshots are, anyone in the room who doesn't know what
a snapshot is?
So you just have a frozen point in time state of the system and it makes you easy to roll
back and do soft deletes and various things to build operations.
And usually they are quite cheap to create and they are much less overhead compared to
full backups of systems.
When it comes to CFFS snapshots are actually enabled in new clusters by default from Quincy
release onwards and it is this configurable thing called allow new snaps and it's a Boolean
variable that you configure on safe cluster.
Just configuring snapshots does not give clients the ability to do snapshots.
They need a special access key which has a particular flag or an odd permission that
allows snapshots to be done.
And snapshots are copied on right in CFFS.
Patrick already covered this two talks earlier if somebody attended so how the thing works
under the hood.
So there is a hidden snap directory and creation is just an act of folder creation as far as
the end user is concerned.
But snapshots are not synchronous.
It's a lazy flush operation so when you issue a snap create until CFFS comes back with the
message that the snap is created you can have IU that is not being tracked in the snapshot.
As an administrator you can also create snapshots at a sub volume level which is almost something
like a manila share or a sub volume share that you export to the end user which kind
of makes it easier for some other use cases.
Our own focus is primarily on user snapshots because users know best how they work loads
are and what is a safe point where they want to do snapshots.
And one of the things that we do of course want is like metadata intensive workloads
like CI build jobs and many of these things that do get blown for example.
Do not suffer a major performance penalty when snapshots are used.
And many of our CFFS clusters have a very heterogeneous use case.
So we need to understand the impact of snapshots because we have both interactive and non-interactive
use cases on our CFFS clusters.
So you have Kubernetes and OpenShift which use CFFS as a backing store and you also have
interactive users who use software called Svan which is basically using Jupyter and
notebooks under the hood which possesses use a lot for analysis.
And these workloads should not suffer if somebody is doing a snap workloads and this
should check off most of our BCDR service offering checklist if we were to provide this
as a service.
But we are just evaluating snapshots right now and then if we want to provide this as
a service to users we do need to understand what kind of operations can potentially impact
the functionality of the entire CFFS clusters and it would be nice to know if like you know
it is restricted to people only using snapshots to pay the penalty or it is cluster wide.
Another important question we need to solve is whether our tiny 3% team can actually run
the service offering successfully without too much operational effort and also find out
if there are mitigations for the problems and operational impact.
To a large extent many of our users are not aware that they are using a shell file system
and to a large extent we also want to keep it that way.
So that is the motivation for the many experiments that we perform in CFFS snapshots in this talk.
So when it looks to the evaluation goals first we should understand what is the baseline behavior
of the system and the normal circumstances.
So if a client is within their limits and not surfing the system thus the system with
snapshots actually behave much worse and in case if we are using snapshots is there a
performance degradation and what kind of workloads can trigger these and the other switch goal
is of course to understand how the system reacts in this rest.
So if we have bad clients with quite metadata intensive workloads which we have for some
of the HVC use cases how does the system cope and can the systems people not using snapshots
actually not suffer the market impact and how bad is the stability and performance impact we see.
Last year at Ceflacon we presented a larger version of the stock and one of the main items
that we had as a checklist was to evaluate this in a much larger context and also find
the spinning directories and these sort of things actually make some of the problems we saw last
year go away.
So to do the testing itself we primarily designed two client workloads.
So one is your standard IO 500 benchmarks.
It's a standard set of benchmarks that is used in HVC context to measure IO performance.
Under the hood you mainly use two tools one is called IOR and the other is called MD test.
Most of the benchmarks if you run the simple configuration it kind of self cleans up the
test at the end of the run and you have various injection points where you can build in functionalities
like injecting snapshots.
So during one of these post phases you can actually add a script that actually basically
creates snapshots so you can have workloads that basically do a write, snapshot and read
to understand if a single run actually sees an impact due to snapshots.
And generally our experience with other file systems at CERN I mean we have usually down
times due to aggressive metadata workloads and we need to see if these sort of workloads
actually trigger something bad.
All the tests come with an easy and a hard variant so when you look at IOR for example
the easy variant is just writing every process writing a giant file which is very easy for
bandwidth kind of use cases.
The IOR hard test on the other hand is actually writing a single file from many clients which
is more IOP intensive than you know a bandwidth intensive test and the units are measured
in bandwidth which is actually quite bad.
MD test is actually the file or directory creation workloads.
The default test doesn't actually write any contents into the file systems itself so
it's just a create or create operation whereas if you do the hard test it actually writes
some data into small files in non-aligned block sizes so to actually make the system
metadata of the system.
In addition to this we just had a small workload that always kept on you know tiring and untiring
Linux kernel and this was just to keep a base level of activity and since this kind of workload
can easily spot when things go really south.
On an average run an untied operation takes about 3 minutes for the Linux 6.2 kernel and
I mean RM minus RF for example would take about 4 minutes on this cluster that we configured
and it's not bandwidth intensive of course it's just a few megabytes per second operation
but it's more you know metadata intensive.
So the first we started evaluating this on a virtual cluster so the test cluster itself
had 3 monitors, 4 MDSs, no standby replay configured.
Everything was on virtual OSI servers on RBDs and actually when we run IO 500 benchmarks
on these you kind of expected what the theoretical performance was and it always I mean kind
of gave the exact performance so what you see as waves in the graphs are essentially
like you know read write tests and then you know you have the deletes which are just only
IO so as far as the client set up goes we had one client node for each so we created
two hierarchies, one hierarchy where snapshots were enabled and another one called just client
which does not have snapshots enabled.
We also this time I mean print all the snapshots workloads to go on a single MDS so all the
work directories are statically pinned to MDS.1 so that I mean MDS.0 can take the regular
traffic without any impact.
So initially we did the test over the Christmas break and initially adding snapshots did not
have a market performance degradation however we did not have any monitoring on the cluster
and what is important in SF land is always monitoring your PG states so what happened
is I mean placement groups were in the snap term wait state forever they never caught up
to trimming the snapshots because the cluster was too small a cluster to actually pull this
off eventually the cluster reached fullness as the stamp terms never caught up and you
know the cluster kind of went into a bad state but it got worse after that so when you actually
removed the workload and you know started removing files things got worse when we started
unmounting clients.
We saw that the MDS started going into a crash loop and this was there are a couple of trackers
that actually refer to this and primarily because I guess you should never bring your
system to a state of fullness so unmounting clients actually made the problems worse.
We eventually looked at the tracker and then found out like you know this is mainly because
of the session tracking metadata object so we manually wiped the metadata session table
and that brought the cluster back up but I mean doing this in a production scenarios
of course a very huge you know operational nightmare.
So lessons learned so we thought okay this is a make sense to continue on a virtual cluster
anymore so we actually moved the OSDs to physical hardware so we had 3x48 disk 7200 rpm hard
drives that we use as safe fs nodes the OSD nodes and we also increased two more clients
so we had two clients on the snap path and two on the regular path and of course the
IO 500 benchmarks usually always give the expected performance.
In order to establish a baseline these are the baseline stats we got out of the cluster
so OSD bench on the random OSD would give you around 240 megabytes 55 IOPS and the higher
than expected you know bandwidth for a single OSD is mainly because you have NVME journaling
radios bench deliver close to a gigabyte per second bandwidth 250 IOPS.
When you do an IO 500 benchmarks with 16 worker processes what we observed is we could extract
the 1 gigabyte per second that the cluster can deliver if you and I mean multiply the 300
megabytes per second that this can I mean a single node could deliver into 3 and we hit like 1.5 k
start IOPS and you know write IOPS in the range of 900 so whatever a tiny 3 node cluster was
configured to do it kind of delivered in terms of the baseline statistics.
What we observed is when you do IO 500 benchmarks on a path with snaps and without snaps
at least the bandwidth performance is more or less always in line.
We do see a small degradation on read the work loads when snapshots were done but this sort of
thing can be expected because when you are doing a snapshot you have a tiny performance
penalty associated with it this is a cost you can pay and maybe I mean we were running this test
with two client processes so probably if you increase the number of client processes maybe
you can deliver a much better bandwidth but let's say the bandwidth IO benchmarks kind of
reveal same performance whether you use snapshots or not. For IO hard drive which actually writes a
single file with many many clients we saw a very very distribution without snapshots.
There are two reasons for the spread also we have lot more data when it comes to
I mean work loads with snapshots because these run over a period of three or four weeks compared to
smaller two week window that we had for without snaps.
And another factor note is like when it comes to IO hard benchmark this is a benchmark where it's a
single file that is actually being used for very large number of writes so MDS is very unlikely
to come as a you know bottleneck to this because like you're still having like one metadata
objective to deal with rather than thousands so I mean you are expecting a kind of same performance
other than the variation part of it like let's say the numbers the mean and sigma I mean the
sigma part of you know IO hard drive everything else seems to be more or less in line.
When it comes to metadata read write also I mean with snapshots with two clients we see that you
know you don't have that much of a difference both are more or less similar.
We have slightly better performance when you do metadata start workloads probably because like
things get cached and you know doing a snap operation probably brings things into cache and
you know this and clients do the caching so this probably gives you a slightly better performance
with the start. MD test easy if you do the deletion workloads this is where of course we saw that
snap workloads of of course started suffering slightly however the delete workload is extremely
susceptible to cluster being you know pushed under load. So let's say the tar and under workloads
also look similar whether you're on a snap mount or a non snap mount I mean the slight variation
is more because we have more data on one of the things than the others rather than a spread.
So when ran in isolation you know snapshots did not see that much of a performance impact and you
knew if that was the only thing that we used to evaluate whether this is good you know this should
be good to go but our own requirements are primarily triggered by you know more heavier
metadata workloads and we need to understand how the system reacts when these things happen.
So what we started doing is like what HPC workloads sometimes do is like you have a longer
hierarchical directory structure so in MD test you could specify the depth of the directory
tree structure that can be evaluated. So when you started increasing the directory tree depth
we saw that with with the non snap client of course there is a degradation of removal operations but
it doesn't it's localized to the client but when you start doing this on the snap client we had
seen that this actually affects cluster wide the performance. The overall system parameters
take a dive when you know when you start deleting clients with snaps.
When especially when you started snapshotting after creating deep directory tree we saw that
the system actually started seeing latencies that we're never seen so a normal like latency
that you see in a CFFS operation is in the order of 50 milliseconds and you would see that we hit
like six seven minutes of latency for a CFFS stat latency. What we also observed is like when we
started stressing the system the pinned MDS had this increasing latency that we saw in the previous
graph and eventually it stopped failing responding to heartbeats. What we did see is standby MDS
takes over after a few minutes when it detects that this condition has happened but what happens
after seems even more interesting in terms of like traffic no longer being distributed to both
the MDSs anymore we see that all the traffic gets rerouted to the pinned MDS which is the one usually
handling the snap snap client workloads. What we also saw is like something that was reported in
the upstream tracker a year ago which is that you know when you have unlinked operations on MDSs
you would have a very high CFFS MDS latency and this tracker I mean is linked in the slides and
you know it basically is still open right now and what the tracker also mentions for example is
like what you see on the MDS side and what we do see is like the MDS always spins on one particular
function which tries to track the parents and sisters of inodes and you see like 100% of CPU
being used in just one particular function and when you run so what we were more interested in is
when this sort of workload runs on a single MDS that goes on spinning whether the other MDS can
actually serve normal client workloads and this is the part that we saw that you know the normal
client workloads cannot be served anymore because everything gets rerouted to the same MDS for some
reason and you know workloads never seem to hit a completion. If you manually restart a workload
some of them I mean catch on however you know an existing workload mostly kind of gets into a very
stuck state. We saw a worst case like you know tar and remove a workload which usually takes an
order of like 3 or 4 minutes going up to 4 plus hours and you know worst case IOR MD test benchmarks
which deliver like I mean even for a single client close to 600 megabytes per second going down to
25 or you know 25 IOPS and this sort of levels and actually so after stressing the system we
didn't run the stress workloads for all the time we just like ran it twice a day for a couple of
days. What we saw is like there's a market degradation in you know times reported by you
know operations so on the blue graphs basically indicate how the you know variants of the data
looked like when you know when you untar the workloads it was like taking the order of like
300 seconds and that is the long tail you see I mean a smaller tail that goes up to like 20 minutes
in the worst case whereas you already have a market shift in the mean itself going much higher
almost to the double range and you have a very long tail latency that goes up to hours which we
took off from this graph because I mean it doesn't make any more sense.
What we observed is like the sort of a systemic degradation which kind of makes it difficult
when you know when we want to productize this sort of a thing so non-trivial effect on
non snapshot I mean directory traces are general concern for us and it is not that easy for us to
determine from monitoring if we if we hit any of these potential triggers for the MDSS to go on
this CI node spin loop we also need further investigation into why single MDSS seem to
take all IO traffic after the switch and what we wrote as a line item for last year's Ceflacon
which is like pinning away snapshots whether that should help because the problem is seemingly
localized to MDS that doesn't seem to work from our experiments this time.
So in conclusion we are generally happy with our Ceflacon cluster but we are still not ready to
like you know enable snapshots on our general purpose Ceflacon clusters yet primarily because
we don't want fraction of users to do this sort of activities that take down the entire Ceflacon
cluster and in this particular context if there are people running Ceflacon clusters in production
you would like to hear if you have snap workloads and you know how you're monitoring these things
or if you see some of the issues that you have seen with that we saw and any feedback or like you
know future direction on how to improve snapshots for everybody is very much appreciated.
Maybe there are some monitoring insights on you know the deep directory trees from MDSS
that we don't know of or maybe it is easy to implement we don't know. Of course one important
step is of course educating users on how to use shared file systems in a good way.
Ceflacon's best practices doc in the upstream documentation is actually a pretty good starting
point here. Another takeaway which we should file bug reports for in the future is like of course
documenting various snap term parameters of Ceflacon clusters. Defaults seem sane but if you
do need to modify them it's kind of unclear what are sane values to actually configure
and that kind of brings me to the end of the talk we would like to hear from you
and what Hugo with the previous talk already mentioned we have a tech week storage at CEN
coming in the mid of March so if you're in the genie area feel free to pass by
and yeah that concludes my talk.