[00:00.000 --> 00:14.200]  So, I'm João. I work at SUSE, at the storage team. I used to work on SAF. Our storage team
[00:14.200 --> 00:22.000]  used to work on SAF, but due to restructurings, that's no longer the case. I'm going to talk
[00:22.000 --> 00:29.200]  to you about one of our latest projects, Yes Free Gateway, and I'm going to tell you why
[00:29.200 --> 00:37.640]  we're doing this mostly. What is S3 Gateway? How we're doing it? Hopefully there will be
[00:37.640 --> 00:48.800]  a demo, and then the next steps and what's ahead of us. So, why are we doing this? So,
[00:48.800 --> 01:01.320]  essentially, after our product was restructured, we needed to find something else to work on,
[01:01.320 --> 01:20.520]  and one of the ideas we had was to find a way to, on the one hand, to figure out a way to provide,
[01:20.520 --> 01:29.520]  I'm lacking the word, provide something that was lacking in the SUSE Rancher portfolio,
[01:29.520 --> 01:38.400]  which is basically an S3 service for local clusters, for local storage within a Kubernetes
[01:38.400 --> 01:46.720]  cluster, and what we aimed at was something as easy to deploy, as easily forgettable,
[01:46.720 --> 01:56.000]  and that just works for ideally local workloads, not necessarily something that is complex to
[01:56.000 --> 02:03.720]  manage. We didn't want anything that would be, we wanted something as light as possible,
[02:03.720 --> 02:15.360]  and that's what eventually became the S3 Gateway. It's an open source project as usual. It's driven
[02:15.360 --> 02:27.720]  by our team at SUSE, and ideally it will be an easy to use project that you just deploy
[02:27.720 --> 02:33.640]  on a Kubernetes cluster and it will just provide you an S3 service within your cluster. I say,
[02:33.640 --> 02:40.600]  ideally, because this has six months worth of development, and there are a lot of things still
[02:40.600 --> 02:50.080]  lacking, far more than I would actually like, but that's just how life is. It complements the
[02:50.080 --> 02:56.320]  Rancher portfolio, as I mentioned, but this was not necessarily the main driver when developing
[02:56.320 --> 03:07.240]  the S3 Gateway. It just happens to fit nicely within the stack. It helps doing backups of local
[03:07.280 --> 03:18.680]  Longhorn volumes, backups for other stuff within the stack, and one of our main criteria initially
[03:18.680 --> 03:26.720]  was that we would serve our storage from any PVC that a Kubernetes cluster could provide,
[03:26.720 --> 03:34.040]  and this just happens to be nice because given Longhorn allows us to put stuff on a Longhorn
[03:34.480 --> 03:40.400]  persistent volume, Longhorn will deal with all the nasty things like replication and whatnot
[03:40.400 --> 03:47.880]  so that we don't have to deal with that. Of course, we wanted a pretty UI for all of the
[03:47.880 --> 03:57.640]  operations and management, which is still ongoing, but we'll get there. How we're leveraging this,
[03:57.960 --> 04:08.000]  though, is basically using Rados Gateway from SEF. We didn't want to start something from scratch
[04:08.000 --> 04:16.160]  because we thought it would be a waste of time and resources, so we decided to leverage the
[04:16.240 --> 04:29.200]  Rados Gateway from SEF, which is quite amazing because it can be run standalone. Given the already
[04:29.200 --> 04:38.680]  built-in zipper layer of which there is a talk next, I think, we basically just had to create a new
[04:38.720 --> 04:51.480]  backend that is basically file-based on which our data is stored on the file-based backend instead
[04:51.480 --> 05:00.200]  of, say, the Rados store. Hence, we don't need a whole SEF cluster, we just need the binary
[05:00.240 --> 05:12.760]  running standalone. Essentially, as I was saying, we have the Rados Gateway consuming a file system,
[05:12.760 --> 05:20.320]  essentially whatever that file system is on. We keep a SQLite database for metadata for the
[05:20.360 --> 05:34.400]  objects and the directory hierarchy for the data. We decided to do this so that essentially all the
[05:34.400 --> 05:47.760]  things that can be indexed and to be essentially abstract could be kept as metadata in the SQLite,
[05:47.840 --> 05:54.960]  which will allow us to more easily search things, index things instead of having to go through
[05:56.880 --> 06:03.760]  a directory hierarchy to find, for instance, buckets and whatnot. So buckets are essentially a mapping
[06:03.760 --> 06:12.720]  of a name to a UUID and objects, as well, end up being entries in a database that associate
[06:12.800 --> 06:19.840]  the bucket name, the object name, and our map to a UUID. The data for the objects, though,
[06:20.640 --> 06:27.680]  will be based on the UUID and we will grab the UUID and create a directory hierarchy based on the
[06:28.720 --> 06:40.640]  first bytes of the UUID. The reasoning behind this was mostly because, given typically some
[06:40.640 --> 06:46.960]  buckets tend to grow larger than other buckets, if we were creating a directory hierarchy that
[06:46.960 --> 06:53.520]  was per bucket, per bucket name, we could end up with very large directories. This way, we kind
[06:53.520 --> 07:03.840]  of spread the objects around and even if we end up with larger directories, we don't have to list
[07:03.840 --> 07:09.680]  the directories themselves to find where the objects are or which objects are within those
[07:09.680 --> 07:16.800]  directories. We just have that stuff in the metadata devils within SQLite.
[07:18.880 --> 07:26.960]  Now, this is not pretty, I admit it, but it's the best I could do. This is roughly what translates
[07:26.960 --> 07:34.720]  to being the S3 gateway stuff being deployed on a Kubernetes cluster. We are essentially
[07:34.720 --> 07:42.160]  deploying two containers, one for the back end, which is rather a gateway, and another one for the
[07:42.160 --> 08:01.200]  UI. We deploy our store on whatever is the, whatever is supplying us with storage. In this case,
[08:01.280 --> 08:09.760]  this slide has Longhorn on it, which will deal with all the replication and availability and
[08:09.760 --> 08:15.280]  whatnot for us so that we don't have to care about this, but this runs as well in a local,
[08:15.280 --> 08:23.680]  I'm running this on my Synology NAS. I just have a volume that is exported through the back end.
[08:24.400 --> 08:34.800]  It really doesn't care about which whatever is providing the file system to it. The UI speaks
[08:34.800 --> 08:44.800]  directly with the back end and the user just consumes the S3 gateway if outside of the cluster
[08:44.800 --> 08:50.240]  through an ingress, if inside the cluster through magic, that I don't understand very well.
[08:50.400 --> 09:04.880]  I promise the demo, which might work or might not. It has been working 50% of the time in my computer,
[09:06.560 --> 09:10.720]  so let me see if I can get this to not this.
[09:11.440 --> 09:20.880]  So, you're not seeing anything, of course not. How do I, Jan, how do I...
[09:20.880 --> 09:49.120]  So, ideally what we'll see
[09:50.880 --> 10:17.040]  is already deployed K3S, which is running things, but it's not running
[10:17.040 --> 10:31.440]  the S3 gateway yet. What I'm going to do is to deploy with our chart, which usually works
[10:31.440 --> 10:43.200]  for people. My laptop is being difficult, so let me just see if I remember its home.
[10:43.760 --> 10:50.320]  So, basically we install this using the default values,
[10:53.040 --> 11:02.640]  and supposedly we will have the UI available at this URL here, not this one.
[11:03.520 --> 11:14.960]  Now, there is still a lot of kinks to figure out with this stuff. We have the UI being run
[11:14.960 --> 11:34.800]  here, but the UI right now is unable to talk with the back end, and this is because
[11:35.360 --> 11:46.080]  certificates. Because we are using a self-signed certificate that the browser does not understand,
[11:46.080 --> 11:52.160]  and the browser is directly talking to the back end through the UI, there is no
[11:53.920 --> 12:03.840]  actual demon in between the UI and the back end. What happens is that we have to
[12:05.360 --> 12:19.680]  go to the back end and accept the certificate, which is hilarious,
[12:22.080 --> 12:31.120]  and once this is done, we can log in to the UI. So, right now the UI is still under heavy
[12:31.120 --> 12:38.240]  development, and the UI cannot do a lot more than the back end does. We are still lacking
[12:38.880 --> 12:46.240]  a bunch of things implemented in the back end driver, but I just want to show you that if we
[12:46.240 --> 12:59.120]  do things on against the back end, it will actually happen in the front end. So, let me see.
[13:00.160 --> 13:10.560]  This is where this stuff is. It has three commands. If I put a one gigabyte object
[13:11.520 --> 13:24.080]  onto a bucket that I actually need to create first, if I create a bucket foo on the back end,
[13:24.720 --> 13:33.280]  it will actually show on the front end, which is expected, and it would suck if it didn't work.
[13:33.440 --> 13:42.240]  Putting an object there will also allow us to do some exploration.
[13:43.920 --> 13:50.480]  This is a big object, so you can see that multi-part upload actually works. I'm very proud of this part.
[13:51.200 --> 14:06.480]  It should be done to some extent. So, if we explore bucket foo, we have a one gigabyte
[14:06.480 --> 14:15.600]  object here that we could also download, and hopefully that works, or maybe it doesn't because of my
[14:16.560 --> 14:22.720]  but it should be downloadable. I think something is blocking my requests.
[14:24.720 --> 14:32.960]  Anyway, that was about it for the demo, but if we turn on the administration
[14:34.400 --> 14:37.120]  that we could also technically manage the users,
[14:38.000 --> 14:52.480]  I think. It's just been difficult. Oh yeah, and I got the object downloading now. Amazing.
[14:54.720 --> 15:02.080]  But yeah, we could create new users. We still don't support user quota, bucket quota, stuff like that.
[15:03.040 --> 15:05.680]  All of this is still very much work in progress.
[15:10.080 --> 15:19.120]  Creating buckets can also be done via the UI.
[15:20.080 --> 15:29.120]  We could enable versioning. Versioning is already supported.
[15:31.520 --> 15:36.480]  I'm just not doing that because I don't remember how to demo that part.
[15:39.280 --> 15:46.240]  We have tests for that stuff. Okay, so this is as far as the demo goes.
[15:46.800 --> 15:55.920]  Let me just list the buckets here so that you see that a bucket bar has been created,
[15:56.480 --> 16:05.520]  and we have a bucket bar over there. That's thrilling. Okay, let me just go back to
[16:05.680 --> 16:10.800]  the other thing
[16:15.120 --> 16:16.960]  and go back to the presentation.
[16:23.200 --> 16:34.560]  Slideshell, so from current slides. Okay, so next steps. So for now our roadmap is to actually
[16:34.640 --> 16:38.240]  increase the number of operations that we actually support because
[16:40.800 --> 16:47.840]  the operations, the RGW basically supports everything that exists, but then the driver
[16:47.840 --> 16:57.840]  behind it needs to comply with the expected semantics and the expected, you know, you request
[16:57.840 --> 17:03.920]  data to the backend and the backend should probably return the appropriate data so that
[17:03.920 --> 17:13.360]  the client is able to perform the operations. And we've been doing this gradually. There has been
[17:13.360 --> 17:22.880]  some challenges there of which, so we are in the process of implementing life cycle management
[17:23.440 --> 17:29.440]  for buckets, retention policies. The performance currently is far from ideal,
[17:29.440 --> 17:38.800]  but we're working towards that. And I really want statistics on the UI. I mean, having as much
[17:38.800 --> 17:47.600]  information that is useful to the user through the UI is something that is very much on my
[17:49.760 --> 17:55.680]  to the list, not necessarily on the to the list for the project, but that's another thing.
[17:56.160 --> 18:02.240]  In terms of challenges that we currently face is a semantic compliance with the S3 API. As I was
[18:02.240 --> 18:14.880]  saying, we have, we've been having some challenges ensuring that our driver replies or provides the
[18:15.840 --> 18:31.120]  right information when processing the operations that are requested by RGW. Fortunately, there is an
[18:31.120 --> 18:38.560]  amazing project called S3 tests within the self repository, which covers pretty much all of the
[18:38.880 --> 18:49.520]  API as far as we understand. And it's very, it's very good to actually ensure that we are in
[18:49.520 --> 18:56.960]  compliance with the API. Then we have performance. This has been a big learning curve, especially
[18:56.960 --> 19:06.560]  with SQLite. There have been decisions that were made during the implementation of the project,
[19:07.280 --> 19:18.160]  especially surrounding new taxes that bid us eventually. But fortunately, we have a Marcel
[19:18.160 --> 19:29.440]  actually looking into this stuff. So if, and that's one of the things that we have here, we have
[19:29.520 --> 19:37.040]  here a comparison of our performance, previous performance with the performance that comes with
[19:37.040 --> 19:43.680]  some of the work Marcel has been doing, it may not look like a lot, especially when we compare
[19:43.680 --> 19:55.440]  with FIO. But I mean, he filled with a few mutexes and we got a significant speed up on a very
[19:55.920 --> 20:05.600]  let's say that the machine is far from current. So we are also believing that we are CPU bound
[20:06.240 --> 20:12.000]  some extent and not taking full advantage of the IO path. But that's neither here nor there.
[20:12.640 --> 20:20.560]  What matters is that we are performing gradual performance improvements that will eventually
[20:21.520 --> 20:32.560]  pay off. This is just the latency distribution. So the YOLO branch is what Marcel called that
[20:32.560 --> 20:41.280]  branch mostly because it removed a bunch of mutexes. What we see here is that to some extent
[20:42.160 --> 20:48.880]  the latencies dropped a bit for put, even though they increased significantly forget.
[20:48.880 --> 20:56.240]  But we are also believing that we are, given that we are also having more operations in flight
[20:58.160 --> 21:05.040]  that we may actually be CPU bound and the operations may not be finishing because of
[21:07.040 --> 21:16.000]  concurrency issues and whatnot. After that, eventually world domination, I think, but probably
[21:16.000 --> 21:24.720]  not. But that's the hope. And that's it. If you want to find us, we are at S3 Gateway IO.
[21:26.480 --> 21:37.600]  And that's about it. Thank you for enduring my presentation. Thank you. Any questions?
[21:37.600 --> 21:50.320]  No? Awesome sauce. Great. Thank you.