[00:00.000 --> 00:13.000] Please welcome Sanju, Sanju and Pranin for us and enjoy. [00:13.000 --> 00:16.840] Thank you guys, thank you. [00:16.840 --> 00:23.800] Good morning guys, I am Sanju and he is Pranin, we work at Foonpei, yeah today we are going [00:23.800 --> 00:28.360] to discuss about the lessons that we learnt while we manage the cluster first cluster [00:28.360 --> 00:33.880] at the scale and the some of the problems we have faced and the solutions that we have [00:33.880 --> 00:44.640] came up with, yeah Foonpei is the leading Indian digital payments and technology company [00:44.640 --> 00:50.320] headquartered in Bangalore, India and it uses unified payments interface which is introduced [00:50.320 --> 00:56.200] by government of India, so in India if you are thinking of any payment you can do it [00:56.200 --> 01:02.240] using Foonpei app, this is how our Foonpei app home screen looks like. [01:02.240 --> 01:09.440] And we have like a we see 800 k rps on our edge layer every day and we do 130 million [01:09.440 --> 01:19.720] daily transactions, so this will generate lots of records and this will generate lots [01:19.720 --> 01:25.560] of records in the documents that we have to store and as per the regulations in India [01:25.560 --> 01:31.400] we have to store all of them in India only, so Foonpei has a private cloud where we store [01:31.400 --> 01:38.360] all these things and we need a service to store and retrieve the files from the cloud, [01:38.360 --> 01:44.400] we have developed a service called darkstore which will write the data to Glustreface and [01:44.400 --> 01:49.160] which will fetch the data from the Glustreface, so coming to the question why did we choose [01:49.160 --> 01:54.640] the Glustreface, we didn't wanted to have a metadata server because like we have lots [01:54.640 --> 02:00.760] of small files and storing all the metadata, we didn't wanted it, so Glustreface has no [02:00.760 --> 02:06.840] metadata server, so we went ahead with it and our team had earlier success in the Glustreface [02:06.840 --> 02:15.080] project, so they were confident that Glustreface will work for our use case, so we are here [02:15.080 --> 02:21.080] and this is the data flow to and from the Glustreface, so all the traffic is fronted [02:21.080 --> 02:27.320] up by CDM and the request is forwarded to nginx and nginx will send the request to the API [02:27.320 --> 02:33.960] gateway and API gateway can choose to store or retrieve any file from the, any file or [02:33.960 --> 02:38.920] it can choose to send the request to any back-end service, now if the back-end service wants [02:38.920 --> 02:45.960] to store this file or if it wants a file it can be a post or get request I mean like it [02:45.960 --> 02:52.040] can store or it can retrieve, it will send the request to darkstore, now the darkstore [02:52.040 --> 02:59.960] will store the data or retrieve the data from Glustreface servers and darkstore also uses [02:59.960 --> 03:07.560] elastic search to store some of the metadata and it uses aero spike to store the earth related [03:07.560 --> 03:15.840] info and some of the rate limiting features, it uses RMQ for asynchronous jobs like deletions [03:15.840 --> 03:25.280] and batch operations and this is our team, yeah today's our agenda is an introduction [03:25.280 --> 03:30.880] to Glustreface and then we will discuss about different problems that I have faced and the [03:30.880 --> 03:37.440] solutions that we are using and we have some proposals as a roadmap. [03:37.440 --> 03:38.440] What is Glustreface? [03:38.440 --> 03:44.920] Glustreface it is a distributed file system that means whenever you do some write the [03:44.920 --> 03:51.760] data is distributed across multiple servers, these servers have some of the directories [03:51.760 --> 04:00.800] we call them as BRICS and this is where the data is actually getting stored, yes so this [04:00.800 --> 04:07.920] is a typical Glustreface server, each server can have multiple BRICS, the BRICS will have [04:07.920 --> 04:14.480] underlying file system where the data will be stored and in the root partition we store [04:14.480 --> 04:22.920] the Glustreface configuration, go ahead, yeah this is how a 3 by 3 Glustreface volume looks [04:22.920 --> 04:31.640] like, when I say 3 by 3 whenever a write comes to Glustreface mount point, so how Mr. 1 point [04:31.640 --> 04:40.640] like we can mount Glustreface volume on any mission over the network and you can read [04:40.640 --> 04:47.480] and write from that mission, now from the client where the mount is happened if any write [04:47.480 --> 04:55.040] comes, so it is distributed across 3 sub volumes based on the hash range allocation, we will [04:55.040 --> 05:05.000] talk more about the hash range in a coming slides and another 3 is transfer the data [05:05.000 --> 05:12.640] is replicated 3 times, so whenever a write comes the data will choose one of the sub [05:12.640 --> 05:20.240] volume and in a sub volume, sub volume is a replica set, here it is a 3, so it is replicated [05:20.240 --> 05:32.440] thrice, over to Pranip, hello, yeah so let us look at some numbers that we see at phone [05:32.440 --> 05:40.000] pay for dock store service and then to Glustreface, in a day we see about 4.3 million uploads [05:40.000 --> 05:48.160] and downloads are 9 million, with peak upload rps as 200 and download rps as 800, the aggregate [05:48.160 --> 05:55.120] upload size per day is just 150 GB, not a lot but the download size is 2.5 TB, so it [05:55.120 --> 06:01.720] is completely read heavy workload and this is after a Syrian is fronting it, that means [06:01.720 --> 06:07.000] only when the file is not available in your CDN, the call will come to Glustreface which [06:07.000 --> 06:13.600] will download the file onto the CDN and then it will be served and this is how the rps [06:13.600 --> 06:21.360] is distributed throughout the day, rps is request per second, so the uploads actually [06:21.360 --> 06:28.920] are reasonably uniform from 6 am to 5 in the evening, then it tapers off for the rest of [06:28.920 --> 06:35.080] the day, whereas the downloads are in bimodal distribution with one peak at around 12 pm [06:35.080 --> 06:44.240] and another at around 7 pm, the latencies are function of the size of the file, so we [06:44.240 --> 06:54.240] have post upload latencies with mean of about 50 ms to the p99 at around 250 ms, similarly [06:54.240 --> 07:06.680] for gates the mean is around 10 ms and p99 is around 100 ms, let us look at the configuration [07:06.680 --> 07:12.480] that we use at phone pay for Glustreface, we have 30 nodes in the cluster, each node [07:12.480 --> 07:21.040] contributes 2 bricks and one brick corresponds to 10 TB and that is a ZFS pool, so 30 into [07:21.040 --> 07:27.880] 20 that is 600 TB of available capacity and we use replica 3, so the available size is [07:27.880 --> 07:37.040] 200 TB out of which 130 TB is in use at the moment, let us now go to the problems that [07:37.040 --> 07:42.280] we face and how we solved it, I will start off with the capacity expansion problem that [07:42.280 --> 07:49.760] we solved, then Sanju will take over and talk about the data migration problem that we solved, [07:49.760 --> 07:56.360] I will talk about how performance issues are debugged and how we solved the problems using [07:56.360 --> 08:01.320] that method, then Sanju will finish it off with maintenance activities that we do to [08:01.320 --> 08:08.240] prevent the problems, before we talk about the capacity expansion problem, let us try [08:08.240 --> 08:18.160] to understand a bit about the distribution, so the data is distributed across the servers [08:18.160 --> 08:26.440] based on hashes, in this diagram we have 3 distribute sub volumes, each sub volume is [08:26.440 --> 08:34.600] a replica 3, so when you create a directory, each of the directory in these 3 replica sets [08:34.600 --> 08:43.560] will get a hash range and whenever you create a file or try to read a file, it will actually [08:43.560 --> 08:48.920] compute the hash of the name and it will figure out which of these directories in these 3 [08:48.920 --> 08:53.840] sub volumes has that hash range and tries to get that file or store that file in that [08:53.840 --> 09:02.360] node, so for folks who are well versed with database, this is more like sharding but the [09:02.360 --> 09:07.960] entity here that is getting sharded is the directory based on the file names, alright, [09:07.960 --> 09:14.600] so the files actually can have varying sizes, for example in our setup, the minimum size [09:14.600 --> 09:20.560] would be less than a kb but the maximum size is like 26gb, so you will run into this problem [09:20.560 --> 09:26.840] where some of the shards or distributes of volumes that you have would fill up the space [09:26.840 --> 09:32.640] before the others, so you need to handle that part as well, so there is a feature in Glouceref [09:32.640 --> 09:38.200] is called min-free disk where if you hit that level, when you create the directory again, [09:38.200 --> 09:42.760] the hash range will not be allocated for the ones that met the threshold, so for example [09:42.760 --> 09:49.040] here, even though there are 3 distribute sub volumes, data is going to only 2 because [09:49.040 --> 09:56.920] the middle one actually has met the threshold, so the hash range will only be distributed [09:56.920 --> 10:04.400] between the 2, 50% and 50% instead of one third that you would expect normally, so let's [10:04.400 --> 10:11.000] talk about the actual process of increasing the capacity and why it didn't work for us, [10:11.000 --> 10:14.880] when you want to increase the capacity that is you bring in more distributes of volumes [10:14.880 --> 10:22.120] or shards, the way that you do it is you first you do something called as cluster peer probe, [10:22.120 --> 10:26.360] that will bring the new machines into the cluster, then you do another operation called [10:26.360 --> 10:32.400] add brick that will add the bricks to your volume, then you have to do something called [10:32.400 --> 10:41.080] as cluster volume rebalance to redistribute the data among the nodes equally, so what [10:41.080 --> 10:47.280] are the problems that we faced, when we did the benchmark, the rebalance had this application [10:47.280 --> 10:54.760] latency impact in some cases up to 25 seconds and as I mentioned most of the P99 latencies [10:54.760 --> 11:00.480] were just in milliseconds, so this is this will be like a partial timeout partial outage [11:00.480 --> 11:06.880] for us, so this is not going to work for us, the other thing that we notice is for large [11:06.880 --> 11:14.080] volumes the rebalance may take up to months and at the moment cluster FS rebalance does [11:14.080 --> 11:21.720] not have pause and resume, so we can't do the maintenance activity in off peak hours, [11:21.720 --> 11:29.800] that is one more problem, the other one that we have seen is when you do the data migration [11:29.800 --> 11:34.680] when it is going from one distribute sub volume or shard to two distribute sub volumes, you [11:34.680 --> 11:39.160] would expect 50 percent of the data to be transferred that's all right, but when you [11:39.160 --> 11:45.000] are going from 9 shards slash distributed sub volumes to 10, you want to only migrate [11:45.000 --> 11:51.720] like 10 percent of the data, but less than FS is still like transferring about 30 percent [11:51.720 --> 12:00.840] to 40 percent like irrespective of what is the number of sub volumes are, so the rebalance [12:00.840 --> 12:08.600] itself may take so much time with our workload that by the time we want to do the next capacity [12:08.600 --> 12:13.480] expansion the rebalance may not even complete, so that is also not going to work for us, [12:13.480 --> 12:19.800] so these are the three main problems that we have seen, so this is the solution that [12:19.800 --> 12:26.040] we are using now, then there is a proposal as well, since we know that the hash range [12:26.040 --> 12:32.120] allocation is based on the based on both the number of sub volumes and number of free sub [12:32.120 --> 12:38.560] volumes, what we are doing is in our doxor application every day in the night we create [12:38.560 --> 12:46.400] directories with a new basically, so the directory structure will be something like the namespace [12:46.400 --> 12:53.200] that the clients are going to use slash year slash month slash day, so each day you are [12:53.200 --> 12:59.480] going to create new directories, so based on the size that is available only the ones [12:59.480 --> 13:05.720] that have space will get the hash range allocation, so you will never run into the problem where [13:05.720 --> 13:10.240] you will have to do rebalance that much, because we have seen that with our workloads reads [13:10.240 --> 13:17.480] are distributed uniformly and as we have seen the it is read heavy workload and writes are [13:17.480 --> 13:24.560] just a few, so we were okay with the solution in the interim, but long term the solution [13:24.560 --> 13:30.080] that we are we have proposed and this is something that is yet to be accepted, but there are [13:30.080 --> 13:36.720] some POC that we did very few use jump consistent hash instead of the one that we have when [13:36.720 --> 13:41.760] you are going from 9 to 10 here it is only about 10 percent that is getting rebalanced, [13:41.760 --> 13:47.040] so that is what we want to get to this is something that we are focusing on this year, [13:47.040 --> 14:00.280] alright over to you Sanju, so let us look at the problems that we have faced while migrating [14:00.280 --> 14:06.200] the data, so we had a use case where we wanted to move complete data which is present in [14:06.200 --> 14:14.280] one server to another server, so in clusterface the standard way of doing this is to use a [14:14.280 --> 14:22.040] rebalance operation, sorry replace brick operation, so when you do replace brick operation there [14:22.040 --> 14:27.600] is a process called a self filled demon which will copy all the data which is present in [14:27.600 --> 14:38.040] the old server to new server, so to copy 10 TB data it takes around 2 to 3 weeks, so [14:38.040 --> 14:43.640] that is like a huge time we wanted to reduce this time so we came up with a new approach [14:43.640 --> 14:51.040] so let us understand few aspects of clusterface before we jump to the solution, so that we [14:51.040 --> 14:56.520] understand our approach better, so the right flow in clusterface is something like this [14:56.520 --> 15:03.680] whenever a right comes based on the hash range allocation plan is just spoke it will choose [15:03.680 --> 15:16.320] one of the sub volume, so the data will go to all the servers in that sub volume, now [15:16.320 --> 15:26.160] let us say we have chosen replicas at 0 and the right will go to all the machines in [15:26.160 --> 15:32.880] that sub volume, it is a client side replication so the client will send the right to all the [15:32.880 --> 15:40.440] machines and it will wait for the success response to come, so client will assume the [15:40.440 --> 15:48.240] right is successful only when quorum number of success responses has come, let us say [15:48.240 --> 15:55.600] one of the node is down, in our case we see like a server 2 either it can be a node down [15:55.600 --> 16:03.440] or the brick process is unhealthy this can be unresponsive at times, so something happened [16:03.440 --> 16:08.600] the right came to one of the sub volume and it went to all the three replica servers, [16:08.600 --> 16:15.840] but server 2 did not responded with the success response, now server 1 and server 3 has responded [16:15.840 --> 16:22.440] with the success response, so client it assumes that the right is successful, now when the [16:22.440 --> 16:29.520] server 2 is back up we to have the consistency of the data server 2 should get the data which [16:29.520 --> 16:36.960] it has missed while it was down, so who will take care of the job of doing this it is SHD, [16:36.960 --> 16:44.360] so SHD is a daemon process which will read the pending heal data like whatever the data [16:44.360 --> 16:50.360] that was missing we call it as a pending heal, so it will read from one of the good copy [16:50.360 --> 16:57.120] in our case server 1 and server 3 are the good copies and server 2 is a bad copy, so [16:57.120 --> 17:04.640] SHD will read the data from one of the good copy and it will write to server 2, so server [17:04.640 --> 17:12.040] 2 will have all the data once the self heal is completed healing the data, we will use [17:12.040 --> 17:19.760] this as part of our approach as well, our approach is we will kill the brick which we [17:19.760 --> 17:28.080] want to migrate like we want to migrate from the server 3 to server 4, so we have to copy [17:28.080 --> 17:36.120] all the data right, so self heal is taking 2 to 3 weeks, here in our case we will kill [17:36.120 --> 17:44.920] the brick and we have a ZFS, we are using ZFS file system, so we will take a ZFS snapshot [17:44.920 --> 17:51.040] and we will transfer this snapshot from the server 3 to server 4, it is like a old server [17:51.040 --> 17:57.760] to the new server and now we will perform the replace brick operation, while we are [17:57.760 --> 18:03.840] performing the replace brick operation server 4 that is a new server will already have all [18:03.840 --> 18:12.160] the data which server 3 had, once the replace brick operation is performed server 4 is now [18:12.160 --> 18:20.240] part of the sub volume and the heals will take place from server 1 and server 2 to server [18:20.240 --> 18:29.880] 4, so now we have reduced the amount of data that we are healing, previously we are copying [18:29.880 --> 18:36.560] all the data that is like a 10 TB of data from server 3 to server 4, but here in our [18:36.560 --> 18:44.440] case we are healing only the data which came after killing the brick before doing the rebalance [18:44.440 --> 18:53.120] replace brick operation, so the data we heal is reduced hugely, with this approach now it [18:53.120 --> 19:00.120] is taking only 50 hours to complete this, that is also if we are using the spinning [19:00.120 --> 19:07.240] discs it will take 48 hours to transfer the snapshot of 10 TB and 2 hours for the healing [19:07.240 --> 19:15.160] of data, but it is only 8 to 9 hours if we are using SSDs, if we are using SSD it takes [19:15.160 --> 19:21.680] like a 8 hours to transfer the snapshot and it takes around 40 minutes to complete the [19:21.680 --> 19:30.560] heals, so that is like we came from 2 to 3 weeks to 1 or 2 days or 9 hours we can say, [19:30.560 --> 19:36.480] we are using netcat utility, it gave us very good performance, it is like a 60% performance [19:36.480 --> 19:43.560] optimization and we have in flight checksum at both the ends in the old server and also [19:43.560 --> 19:49.760] in the new server, so that it is like we are checking whether we are transferring the snapshot [19:49.760 --> 19:59.680] perfectly or not, we are not using any data and yeah it is at the time, I have kept the [19:59.680 --> 20:06.800] commands that we have exactly used in this link and we also have a rollback plan, so [20:06.800 --> 20:12.200] let us say that we have started with this activity but we have not performed the replace [20:12.200 --> 20:17.480] brick yet, because once the replace brick is performed it will be something like this, [20:17.480 --> 20:24.040] the sub volume will already have the server 4 as a part of it, before we perform the replace [20:24.040 --> 20:31.280] brick that means when we are here, we can we do not want to do this anymore, all we [20:31.280 --> 20:37.280] need to do is start the volume with the force, so that the brick process that we have killed [20:37.280 --> 20:45.840] will come up, once it is up the good copies that we have SSD will copy the data from good [20:45.840 --> 20:52.520] copies to bad copy are the old server, so that we will have the consistent data across [20:52.520 --> 20:59.840] all of our replicated servers, yeah that is so easy and we want to popularize this method [20:59.840 --> 21:11.360] so that it helps the community, yeah over to Prenet, yeah so this we will now talk about [21:11.360 --> 21:17.200] the performance issues that we faced and how we solved them, this is the graph that we [21:17.200 --> 21:24.520] have seen in our prod setup, while doing this migration when something happened that we did [21:24.520 --> 21:30.880] not account for, so the latencies have shot up to 1 minute here and I have said that it [21:30.880 --> 21:35.040] is supposed to be only milliseconds, so this is horrible, there was like 2 hours of partial [21:35.040 --> 21:41.760] voltage because of this, so let us see how these things can be debugged and how they [21:41.760 --> 21:50.000] can be fixed, so we have a method called GlusterVolumeProfile in GlusterFS, so what [21:50.000 --> 21:55.840] you do is you start profiling on the volume, then you run your benchmark or whatever is [21:55.840 --> 22:01.680] your workload, then you keep executing GlusterVolumeProfile in for incremental and it will keep [22:01.680 --> 22:07.960] giving you the stats of what is happening to the volume during that time, for each of [22:07.960 --> 22:12.320] the bricks that are there in the volume you will get an output like this, where for that [22:12.320 --> 22:16.920] interval in this case interval 9, for each of the block size you will see the number [22:16.920 --> 22:21.960] of reads and writes that came and for all of the internal file operations that you see [22:21.960 --> 22:25.960] on the volume, you will get the number of calls and the latency distribution, min max [22:25.960 --> 22:30.760] average latency and what is the percentage latency that is taken by each of your file [22:30.760 --> 22:32.440] operation internally. [22:32.440 --> 22:40.760] So, what we have seen when this ZFS issue happened is the lookup call is taking more [22:40.760 --> 22:47.840] than a second which is not what we generally see, so we knew something was happening during [22:47.840 --> 22:55.880] lookup operation, so we did an stress on the brick and we have found that there is one [22:55.880 --> 23:03.040] internal directory called GlusterFS indices XRTROP, to list three entries it is basically [23:03.040 --> 23:10.800] taking 0.35 seconds, so we so imagine this, so you do LS it will just show you three entries, [23:10.800 --> 23:18.440] but it will take like 0.35 seconds sometimes it even takes a second, so we after looking [23:18.440 --> 23:23.400] at this we found that ZFS has this behavior where if you create a lot of files in one [23:23.400 --> 23:29.440] directory like millions and then you delete most of them and then if you do LS it takes [23:29.440 --> 23:37.880] up to a second, so this bug is open for more than like two years I think, so we did not [23:37.880 --> 23:43.840] know whether ZFS would fix this issue anytime soon, so in GlusterFS we patched it by caching [23:43.840 --> 23:48.560] this information, so that we do not have to keep doing this operation, so now you would [23:48.560 --> 23:56.520] not see it if you are using any of the latest GlusterFS releases, but yeah this is one issue [23:56.520 --> 23:59.280] that we found and fixed. [23:59.280 --> 24:06.640] The second one is about increasing the RPS that we have on our volume, so the there was [24:06.640 --> 24:13.160] a new application that was getting launched at the time and the RPS that they wanted was [24:13.160 --> 24:20.280] not what we are giving, so basically they wanted something like 300, 360 RPS or something [24:20.280 --> 24:25.720] like that, but when we did the benchmark we were getting only like 250 RPS, so we wanted [24:25.720 --> 24:32.400] to figure out what is happening, so we ran benchmarks on Prod Gluster itself and we saw [24:32.400 --> 24:42.040] that one of the threads is getting saturated, so there is a feature in GlusterFS called [24:42.040 --> 24:48.760] client IO threads where multiple threads would take the responsibility of sending it over [24:48.760 --> 24:53.840] the network, so we thought let us just enable it and it would solve all our problems, we [24:53.840 --> 24:59.960] enabled it and it made it worse like from 250 it went down, so we realized that there [24:59.960 --> 25:05.760] is a continuation problem in the client side that we are yet to fix, so for now what we [25:05.760 --> 25:12.880] did is to on the containers of Dockstore where it was doing only one mount, we are now doing [25:12.880 --> 25:23.080] three mounts and distributing the uploads and downloads over yes, so can you repeat [25:23.080 --> 25:33.080] the, oh yeah, no I didn't, it is a fuse mount, yeah the thread that is saturating is fuse [25:33.080 --> 25:41.240] thread, yeah so the question is which GlusterFS client we are using, the answer is fuse client [25:41.240 --> 25:48.200] and the thread that is saturating is fuse thread, so what we are doing is we have created [25:48.200 --> 25:52.800] multiple mounts on the container and we are distributing the load in the application itself [25:52.800 --> 25:57.720] like the uploads will go to all three and even downloads will go to all three, that [25:57.720 --> 26:02.680] is one thing that we did to solve the CPU saturation problem, the other thing that we [26:02.680 --> 26:07.840] noticed this is like part of the Gluster volume profile output where it will tell you for [26:07.840 --> 26:13.520] each block what is the number of reads and writes, we have seen that most of the writes [26:13.520 --> 26:20.880] are coming as 8KB, so later when we looked at the Java application Dockstore we saw that [26:20.880 --> 26:28.240] the IO block that Java is using the default size is 8KB, so we just increased it to 128KB, [26:28.240 --> 26:35.920] so these two combined has given us 2X to 3X the number and we also increased the number [26:35.920 --> 26:43.400] of VMs that we are using to mount the client, so put all together we got something like [26:43.400 --> 26:50.720] 10X performance improvement compared to the earlier one, so we are set for maybe 2, 3 [26:50.720 --> 27:02.960] KB all right, so let us now go on to health checks, so for any production cluster some [27:02.960 --> 27:07.160] of the health checks are needed, so I will talk about the minimal health checks that [27:07.160 --> 27:14.760] needed for GlusterFace cluster, so GlusterFace already provides POSIX health checks, so it [27:14.760 --> 27:22.960] is a health checker thread which will do a write of 1KB for every 15 or 30 minutes, [27:22.960 --> 27:29.960] I mean seconds, so there is one option to set the time interval in which you want to [27:29.960 --> 27:36.560] do this, so if you set it as a 0 that means you are disabling the health check, so you [27:36.560 --> 27:42.000] can set it as like a 10 seconds or something, so it sends a write and check if the disk [27:42.000 --> 27:48.840] is responsive enough and brick is healthy or not, if it did not get a response in a [27:48.840 --> 27:55.240] particular time, it will kill the brick process, so that like we will get to know that something [27:55.240 --> 28:03.440] is wrong with the brick process, so the other one we have is the rest of the things are [28:03.440 --> 28:09.400] we have a script and we have some config, these are the things we have kept externally [28:09.400 --> 28:14.760] kind of thing, the POSIX health checks are the one which come with the GlusterFace project, [28:14.760 --> 28:20.920] so the cluster health checks that we have are like we have a config where we will specify [28:20.920 --> 28:26.160] number of nodes in the cluster, so that is like a expected number of nodes in the cluster [28:26.160 --> 28:33.480] and using the Gluster peer status or GlusterPoorList command, we can check the number of nodes [28:33.480 --> 28:40.840] that are present in the cluster and we will check if both of them are equal, if not we [28:40.840 --> 28:48.240] will write an alert saying something unexpected is happening and we will also check whether [28:48.240 --> 28:54.880] the node is in connected state or not, so in the GlusterFace cluster the nodes can be [28:54.880 --> 29:03.560] in different state, so it can be connected or rejected or disconnected based on how the [29:03.560 --> 29:11.760] GlusterFace management daemon is working, so now we will see whether, so the expected [29:11.760 --> 29:15.840] is all the nodes should be in a connected state, we will check whether the nodes are [29:15.840 --> 29:21.560] connected or not, if the nodes are not connected then we will get an alert saying okay one [29:21.560 --> 29:27.000] of your node is not in a connected state and we have some of the health checks for the [29:27.000 --> 29:33.720] BRICS as well, so we have number of BRICS that are present in each volume in the config [29:33.720 --> 29:39.120] and in the GlusterVolume info output you will get how many number of volumes that are present [29:39.120 --> 29:44.480] in that volume and you will check if they are equal, the another check we have on the [29:44.480 --> 29:49.840] BRICS, if the BRICS is not online we will get to know it by checking the GlusterVolume [29:49.840 --> 29:55.200] status command and if it is not online you will get an alert saying that one of your [29:55.200 --> 30:03.040] BRICS is down and so whenever the server is down or the BRICS is down there will be some [30:03.040 --> 30:09.400] of the pending heels and you can check the pending heels using the GlusterVolumeHealInfo [30:09.400 --> 30:16.680] command and if there are any pending heels you will see an entry, so if the entry is [30:16.680 --> 30:22.000] non-zero then you will get an alert saying that okay you have some pending heels in your [30:22.000 --> 30:27.680] cluster that means something unexpected, unwanted is going on that can be like a BRICS down [30:27.680 --> 30:35.480] or node is down anything and we always lock profile info incremental to our debug locks [30:35.480 --> 30:41.960] using the health check so that whenever we see some issue like the Prandit just spoke [30:41.960 --> 30:47.680] about some of the issues that we can solve by looking at the profile info output, so [30:47.680 --> 30:54.960] in such cases this output will be helpful so we always log into our log backup servers [30:54.960 --> 31:06.440] and the exact commands that we are using are listed in this link, so we have some of the [31:06.440 --> 31:15.240] maintenance activities so things can go back sometimes, so we have a replica 3 setup in [31:15.240 --> 31:21.880] our production, so at any point of time quorum number of BRICS process should be up so that [31:21.880 --> 31:30.600] the reads and writes can go on smoothly, so whenever we are doing something which might [31:30.600 --> 31:38.480] take some downtime of the BRICS process or which can have some load on particular server [31:38.480 --> 31:45.600] at that time we do it only on one of the server from each replica set so that even if that [31:45.600 --> 31:51.360] server goes down or the BRICS process running on that server goes down we won't be having [31:51.360 --> 31:58.840] an issue because there are two other replica servers which can like do all the reads and [31:58.840 --> 32:05.600] writes, so we are doing few activities in this way, one is ZFS scrubbing, ZFS scrubbing [32:05.600 --> 32:13.280] is about doing the checksum of the data, it will see if the data is in a proper condition [32:13.280 --> 32:22.880] or not and we do migrations in this way only, so we are doing it on one server from each [32:22.880 --> 32:29.640] replica set so that even if it is down for some time or something didn't work out we [32:29.640 --> 32:38.560] are in a good place and upgrades also we will do in the same manner, we have done some contributions [32:38.560 --> 32:45.240] so the data migration part that I have spoke it's a production ready we have used it in [32:45.240 --> 32:51.920] our production and Pranit has given some of the developer sessions which has many internals [32:51.920 --> 32:57.800] of Glastrophase, they are very useful for any Glastrophase developers who wants to learn [32:57.800 --> 33:05.760] about many translators that we have in Glastrophase and recently we have fixed one of the single [33:05.760 --> 33:11.720] point of failure which was present in the geo-replication feature, it was merged into [33:11.720 --> 33:18.920] the upstream very recently last week and this year we are looking at another thing the hashing [33:18.920 --> 33:26.320] strategy that Pranit has proposed, once it is accepted at the community we will take [33:26.320 --> 33:33.880] it and develop it, yeah that's all we had folks, thank you. [33:33.880 --> 33:40.520] Just want to let you guys know that the production ready thing, we actually migrated like in [33:40.520 --> 33:47.320] total 375 TB using the method that Sanju talked about so it is ready, so yeah you guys can [33:47.320 --> 33:52.000] use it, I think it should work even with butter, basically any file system that has a snapshot [33:52.000 --> 34:03.440] feature it should work, yeah thank you guys, yeah I think we have a few minutes for questions [34:03.440 --> 34:18.200] if you have any otherwise you guys can catch us there, yeah so the question is how do you [34:18.200 --> 34:27.320] handle a disk failure, so basically the problem that I showed you where we had the ZFS issue [34:27.320 --> 34:32.680] where it was taking like minutes of latency that was the first time it happened on production [34:32.680 --> 34:38.680] for us and initially we were waiting for the machine itself to be fixed so that it will [34:38.680 --> 34:44.840] come back again and it went for like a week or so and the amount of data that needed to [34:44.840 --> 34:54.720] be healed became too much that it coincided with our peak hours, so now the standard operating [34:54.720 --> 34:59.400] procedure that we have come up with after this issue is if a machine goes down or disk [34:59.400 --> 35:05.640] goes down we can just get it back online in 9 hours so why do we have to wait, so we just [35:05.640 --> 35:11.640] consider that node dead, we get a new machine we do whatever Sanju mentioned using ZFS snapshot [35:11.640 --> 35:20.880] migration and we just bring it up, so do you have the ZFS backup somewhere, do you have [35:20.880 --> 35:28.120] the ZFS backup somewhere, the answer is no you have the ZFS data on the active bricks [35:28.120 --> 35:33.240] so you take a snapshot on the active bricks and do the snapshot trend, yeah one of the [35:33.240 --> 35:43.160] good ones yes, any other questions, I think that's it I think, thank you guys, thanks [35:43.160 --> 35:44.160] a lot. [35:44.160 --> 35:45.160] Thank you. [35:45.160 --> 35:46.160] Thank you. [35:46.160 --> 35:58.640] Thank you.