[00:00.000 --> 00:13.320] Okay, everybody, and let's give it a hand for Calib and another of our moderators. [00:13.320 --> 00:14.320] Thank you. [00:14.320 --> 00:18.680] So, yeah, my name is Caleb, or Calib, as the case may be. [00:18.680 --> 00:22.240] My colleague, Dan Grinevitz, could not be here. [00:22.240 --> 00:26.400] He was going to be a co-presenter with a big reorganization of work and budget. [00:26.400 --> 00:29.000] We couldn't get the budget for everybody to come. [00:29.560 --> 00:33.080] Juan Luis gave me a great introduction in the previous talk. [00:33.080 --> 00:40.480] Now I'm going to dive in a lot deeper on the rest of what's in Rato's Gateway, RGW. [00:40.480 --> 00:44.680] So I'll start with a brief overview of what CEPH is and what the pieces are, [00:44.680 --> 00:48.920] and then I'll drill down deeper into CEPH. [00:48.920 --> 00:51.200] I know the acoustics in this room are pretty bad. [00:51.200 --> 00:53.720] Can you guys all the way in the back hear me? [00:53.720 --> 00:55.080] Good. [00:55.080 --> 00:58.600] I tend to be pretty loud anyway, so, but. [00:58.600 --> 00:59.600] So what is CEPH? [00:59.600 --> 01:06.160] CEPH is fundamentally an object storage system. [01:06.160 --> 01:15.120] The basis of CEPH is something called RATOS, which stands for Reliable Autonomous Distributed [01:15.120 --> 01:16.520] Object Store. [01:16.520 --> 01:18.920] I think that's a Bacronym. [01:18.920 --> 01:26.360] I don't think it's very great, but it rolls off the tongue, so we'll go with it. [01:26.400 --> 01:30.640] With RATOS, RATOS layer talks to object storage demons. [01:30.640 --> 01:35.240] I don't show these on this, but they would be down here, and object storage demons would [01:35.240 --> 01:46.120] have some kind of media, hard disk drives, SDDs, or NVMe storage to persist the data. [01:46.120 --> 01:54.960] There's a bunch of layers on top, so applications can talk using LibRATOS, using the LibRATOS [01:54.960 --> 01:55.960] API. [01:55.960 --> 02:04.240] It's not a standard API, it's the CEPH object API. [02:04.240 --> 02:07.280] I'm not aware how many people actually use LibRATOS. [02:07.280 --> 02:12.760] We'll come back to RATOS Gateway in a minute. [02:12.760 --> 02:23.800] Consumers of block storage like virtual machines and virtualization would use RBD to talk to [02:23.800 --> 02:27.880] LibRATOS and then RATOS. [02:27.880 --> 02:33.640] CEPH-FS is the native POSIX-like file storage API. [02:33.640 --> 02:42.040] CEPH-FS, there's native kernel support for CEPH-FS, or you can use FuseMount. [02:42.040 --> 02:46.320] But the piece I'm really here to talk about is the RATOS Gateway, or RGW, we like to call [02:46.320 --> 02:47.320] it. [02:47.320 --> 02:49.040] This diagram is not really accurate. [02:49.040 --> 02:51.240] We could split this in half. [02:51.240 --> 02:58.960] On one side we have the RGW Demon, the RATOS Gateway Demon that is listening on its own [02:58.960 --> 03:04.720] port and speaks S3 and Swift protocols. [03:04.720 --> 03:12.880] And then through RATOS Gateway, this talks to LibRATOS and down to the storage. [03:12.880 --> 03:19.800] The other piece that is RATOS Gateway is something called LibRGW. [03:19.800 --> 03:28.640] So this is actually an in-process server in a library that you can link into an application [03:28.640 --> 03:36.400] and then initiate it and it will run its own RATOS Gateway Demon talking down through [03:36.400 --> 03:41.320] the stack and then ultimately to your CEPH storage. [03:42.320 --> 03:47.240] If you have questions, just raise them immediately. [03:47.240 --> 03:51.400] We'll do a Q&A at the end. [03:51.400 --> 04:00.440] So RATOS Gateway has right now two basic front-end, Swift and S3, S3 being the one that most people [04:00.440 --> 04:01.920] are using. [04:01.920 --> 04:08.720] This is some slideware from CEPH that I stole off a CEPH slide deck. [04:08.720 --> 04:14.640] My boss said this is a terrible slide when he reviewed this. [04:14.640 --> 04:21.600] So for the last three years we've been engaged in a side project called Zipper. [04:21.600 --> 04:28.960] The idea was that we were going to unzip the implementation and break it out. [04:28.960 --> 04:30.800] So what does that really mean? [04:30.800 --> 04:42.200] So we have the ability to provide the S3 protocol capabilities to CEPH storage, but [04:42.200 --> 04:47.120] CEPH is just one example of storage that people might be using and we thought it would be [04:47.120 --> 04:54.960] interesting and valuable, there would be a good deal of value to be able to plug in [04:54.960 --> 05:00.280] different storage underneath besides just CEPH RATOS storage. [05:00.280 --> 05:13.520] So we have some proof-of-concept and some preliminary work using Intel's Deos, Seagate's [05:13.520 --> 05:19.840] Motor, which I think is the other name for Cortex. [05:19.840 --> 05:28.880] We've plugged in a MySQL database, we have proof-of-concept, plugging in a MySQL database. [05:29.880 --> 05:42.440] In order to do this we wanted to provide a flexible API at the inside of RGW that anybody [05:42.440 --> 05:46.060] could use to write their own back-end storage. [05:46.060 --> 05:55.760] So as Juan Luis was talking, Rancher at SUSAs using this for their product and we have, [05:55.760 --> 06:03.080] like I said, we have Intel and Seagate doing some more things. [06:03.080 --> 06:04.400] Why do we want to do this? [06:04.400 --> 06:15.440] As I already alluded, I didn't already alluded to this. [06:15.440 --> 06:23.440] So the CEPH implementation, the CEPH RGW implementation we think is one of the most complete S3 implementations [06:23.440 --> 06:29.560] outside of Amazon, outside of Amazon AWS. [06:29.560 --> 06:35.760] I don't know if that's right or not, we can debate that, but we do have people telling [06:35.760 --> 06:43.920] us that it's one of the best S3 implementations, one of the most complete. [06:43.920 --> 06:52.840] We have other companies, other interested parties telling us that they wanted to leverage [06:52.840 --> 06:58.840] our implementation somehow and we wanted to provide the ability for developers to leverage [06:58.840 --> 07:03.760] that and plug in their own storage. [07:03.760 --> 07:08.480] As I, what I did allude to, I called it unzip, where we were going to basically unwrap or [07:08.480 --> 07:14.880] unzip our implementation and make it consumable for other consumers. [07:14.880 --> 07:20.280] We've developed a plug-in model or a zip model, we call it the zip model. [07:20.600 --> 07:24.400] Anybody that's familiar with the Linux kernel, the VFS layer in the kernel, this is a similar [07:24.400 --> 07:31.720] concept or if you've seen NFS Ganesha, the FASAL which stands for file system abstraction [07:31.720 --> 07:41.240] layer has plug-ins that let you plug in different backend storage for the NFS server. [07:41.240 --> 07:46.240] So we're doing the same thing, we're basically putting a plug-in architecture at the bottom [07:46.240 --> 07:53.120] of the Rados gateway and we're collaborating with other people. [07:53.120 --> 07:59.960] This is, one of the things we're going to do with this is actually add the ability to [07:59.960 --> 08:07.680] stack these plug-ins, so you could, the bottom of the stack would be something like a Rados [08:07.680 --> 08:15.280] or a Deus backend, but above that you could stack filters that could do things like, that [08:15.280 --> 08:28.000] could do arbitrary random things like send your data off to be indexed or compressed or [08:28.000 --> 08:36.880] put into a cache with and have a faster look-aside cache and possibly do some cache tearing sorts [08:36.880 --> 08:38.760] of things. [08:38.760 --> 08:47.680] So the API is intended to allow the stacking of filters above the backend storage. [08:47.680 --> 08:53.640] And what this looks like in pictures is the existing, the existing RGW up through the [08:53.640 --> 09:01.920] current, what's currently shipping, which is Quinzi or Ceph 17, we have a modular, sort [09:02.560 --> 09:11.280] of monolithic, not modular, monolithic implementation of RGW that strictly speaks Lib Rados. [09:11.280 --> 09:15.920] And what we're doing is breaking it up, we've spent the last three years refactoring it, [09:15.920 --> 09:24.160] cleaning up the internal APIs and adding the ability to write individual plug-ins that [09:24.160 --> 09:31.480] are loaded as shared libraries based on what a configuration file tells you or tells it [09:31.520 --> 09:40.560] and then, so as I said, so our first plug-in, of course, the natural one would be a Rados [09:40.560 --> 09:42.760] driver. [09:42.760 --> 09:52.560] We also have a real proof-of-concept DB store that goes to MySQL, a MySQL database. [09:52.560 --> 10:02.360] We also are working with the mass-open mass, I'm looking at Niels like he knows, he doesn't [10:02.360 --> 10:03.360] know. [10:03.360 --> 10:17.280] The Boston University and Northeastern University are collaborating with Red Hat for something [10:17.280 --> 10:23.240] called the mass, that's something initiative, if you really want to know, you can ask me [10:23.240 --> 10:26.280] later, I'll figure out what it is. [10:26.280 --> 10:33.080] And they have been taking something called D3N and expanding it into a new feature called [10:33.080 --> 10:38.840] D4N, which is a look-aside cache that helps improve performance. [10:38.840 --> 10:43.080] So these are all real in there right now, although they're not in there as plug-ins, [10:43.120 --> 10:51.600] they're still in there in that monolithic implementation, but that's what it would look like. [10:51.600 --> 11:00.160] Another way to look at this, so we have the protocol interface at the top where we have [11:00.160 --> 11:08.560] S3, we have Swift, something else we have in the pipeline is Apache arrow flight as [11:08.560 --> 11:17.240] a front-end, these front-ends are not pluggable, I'm just mentioning them here for completeness. [11:17.240 --> 11:24.760] We have the operational core implementation, we have APIs for bucket, object, user, store, [11:24.760 --> 11:32.560] and lifecycle, these get routed through the plug-ins and then at the bottom we have these [11:32.560 --> 11:42.560] plug-in layers that talk to the back-end storage. [11:42.560 --> 11:47.840] So I'm going to show you some code, I'm not going to dive too deep into it, but I did [11:47.840 --> 11:55.520] want to at least show you, the APIs are, the implementation is written in C++, we start [11:55.520 --> 12:05.040] with a base virtual class, pure virtual, pure base virtual class, which has the definition [12:05.040 --> 12:08.080] of the interface. [12:08.080 --> 12:16.440] We go from there to a small lightweight stub library, or not library, a stub class with [12:16.440 --> 12:24.520] some basic implementation, and before I get ahead of myself, so here's what the pure virtual [12:24.520 --> 12:32.840] base class starts to look like, we call this class the driver class, it has a bunch of [12:32.840 --> 12:40.280] virtual methods that get filled in by the stubs and the actual implementation, so drilling [12:40.280 --> 12:50.600] down deeper than our stub class is this, subclass is this store driver based on driver, that [12:50.600 --> 12:58.120] is literally I think the whole class definition, not definition, declaration. [12:58.120 --> 13:08.320] And then finally, as a, the RADOS store class, where we plug in actual, the actual implementations [13:08.320 --> 13:17.640] that are needed to talk to the RADOS store, something like the deos and the motor implementations [13:17.680 --> 13:24.920] look very similar, but obviously they have different class names and they each have their [13:24.920 --> 13:36.680] own methods to fill in for the virtual methods. [13:36.680 --> 13:44.800] This is just a summary of all the different things that are in the API, so there's basically [13:44.800 --> 13:55.800] a main entry point, you instantiate the driver or the RADOS gateway reads a config file to [13:55.800 --> 14:05.000] figure out to determine which class it's going to instantiate, it loads, does a DL open, [14:05.000 --> 14:12.920] DL sim to get the entry point to the library, this creates an instance of the RADOS driver [14:13.040 --> 14:17.840] class or the RADOS store class, once we have an instance, once we have a pointer to the [14:17.840 --> 14:24.360] instance of this instance, then we can start invoking the methods that are the virtual [14:24.360 --> 14:26.640] methods in the APIs. [14:26.640 --> 14:39.120] So we have APIs for user, bucket, object, multi-part upload, et cetera, et cetera. [14:39.320 --> 14:47.320] The store, as I already alluded, is a back-end storage implementation, so it sits at the [14:47.320 --> 14:54.040] bottom of the stack and we have a base implementation defined. [14:54.040 --> 14:59.080] Filter, as I was mentioning earlier, is an optional partial implementation that you can [14:59.080 --> 15:10.600] do filtering, some kind of filtering or caching or other modifications on the data as it goes [15:10.600 --> 15:16.040] by to the store. [15:16.040 --> 15:24.800] What zipper is and isn't, so zipper is a way to provide S3 or Swift object storage on top [15:24.880 --> 15:28.360] of whatever your storage is. [15:28.360 --> 15:34.320] So your storage could be a database, your storage could be RADOS, it could be anything [15:34.320 --> 15:36.160] you can think of. [15:36.160 --> 15:43.200] It's a way to transform and modify stuff on the way by, so caching, you can namespace [15:43.200 --> 15:49.640] division, tracing, it's a strict layering system, calls move down the stack and returns [15:49.640 --> 15:58.760] move back up, it's not a, it's not an acyclical, it's not an acyclical graph. [15:58.760 --> 16:01.200] What is zipper not? [16:01.200 --> 16:06.040] It's not a communication bus, you can't communicate between these components. [16:06.040 --> 16:15.680] It's not a graph, there's no sideways calls across the bottom of these, and I'm not sure [16:15.680 --> 16:18.240] what I meant by it, there's no front end or part of the front end. [16:18.240 --> 16:25.600] Zipper isn't part of the front end, so when we talk about S3 or Swift or AeroFlight, those [16:25.600 --> 16:29.240] really aren't part of zipper, those are going on in parallel. [16:29.240 --> 16:38.600] The AeroFlight stuff, the AeroFlight implementations should land in Cep 18 Reef, but that's not [16:38.600 --> 16:46.360] really what zipper is, there's also Lua scripting coming in Reef. [16:46.360 --> 16:48.000] So where are we? [16:48.000 --> 16:52.480] We've actually been working on this for the better part of three years, mostly this has [16:52.480 --> 16:59.480] been consisted of refactoring the current implementation so that we can actually accomplish [16:59.480 --> 17:01.280] this. [17:01.280 --> 17:09.880] The existing monolithic implementation was really, well, I won't say anything bad about [17:09.880 --> 17:10.880] it. [17:10.880 --> 17:15.000] It was not very good, it has needed a lot of cleanup. [17:15.000 --> 17:21.440] We've done, as I said, there's a prototype proof of concept DB store that we actually [17:21.440 --> 17:25.600] have working in monolithic implementation. [17:25.600 --> 17:35.560] We have a trace driver, we have ongoing work in this D4N project, which will ultimately [17:35.560 --> 17:50.640] be a filter plugin, object on file, Lua scripting layer, so we could have a Lua plugin, a Lua [17:50.640 --> 17:58.080] filter plugin that would allow you to write your own Lua scripts to do things to the data [17:58.080 --> 18:13.280] as it transitions up and or down the graph. [18:13.280 --> 18:23.840] This is ongoing within Cep, it's a part of upstream Cep, as is the rest of RGW, RGW is [18:23.840 --> 18:32.560] in the GitHub repository, you can reach Daniel or myself at these addresses, and yeah, I [18:32.560 --> 18:37.080] think that's the gist of it. [18:37.080 --> 18:40.120] Any questions? [18:40.120 --> 18:51.880] I allowed 30 minutes and I just did it in 15. [18:51.880 --> 18:52.880] No questions? [18:52.880 --> 18:53.880] Yeah, yes. [18:53.880 --> 19:00.600] With the file backend or even the DB backend, is it possible to run it locally on a laptop [19:00.600 --> 19:01.600] too? [19:01.600 --> 19:04.880] You could run it locally on a laptop, yeah. [19:04.880 --> 19:15.800] The DB is file, we've talked about doing a separate, pure file backend or zipper plugin [19:15.800 --> 19:24.640] instead of database, but the database actually gives us everything we want and has some good [19:24.640 --> 19:25.640] semantics. [19:25.640 --> 19:29.640] I guess I could ask you that as well. [19:29.640 --> 19:34.080] You could, oh, I'm supposed to repeat the question, so the first question was, could [19:34.080 --> 19:45.200] you do a file backend, and I just answered that, and sorry, the would. [19:45.200 --> 19:51.880] The plugin is currently written with MySQL, but you could probably pretty easily refactor [19:51.880 --> 19:59.440] it to use any database you want. [19:59.440 --> 20:00.440] No other questions? [20:00.440 --> 20:01.440] Thanks so much. [20:01.440 --> 20:02.440] Thank you. [20:02.440 --> 20:02.440] Thank you.