[00:00.000 --> 00:12.400] Hello. Thank you for coming to my talk. It's not a TED talk, but it's just my talk. Continuous [00:12.400 --> 00:17.560] delivery to many Kubernetes clusters. My name is Carlos Sanchez and I'm here to talk to [00:17.560 --> 00:25.320] you about our live experience, real world. I'm not here to sell you anything. So at least [00:25.320 --> 00:31.320] I'll try to tell you if I have time some of the mistakes we made too. She's not all beautiful [00:31.320 --> 00:38.160] and wonderful. So I'm a principal scientist at Adobe Experience Manager Cloud Service. [00:38.160 --> 00:46.600] I'll talk a little bit about the product. On the open source side, I started the Jenkins [00:46.600 --> 00:54.320] Kubernetes plug-in. Anybody heard about Jenkins? Yes, some people probably, yeah. Okay. And I'm [00:54.320 --> 01:00.240] a Kubernetes. Anybody heard about Kubernetes? Yeah? Okay. Anybody using Kubernetes in production? [01:00.240 --> 01:09.560] So I'm a long time contributor to open source. There are multiple projects on Jenkins, Apache [01:09.560 --> 01:16.520] Foundation and all that. So a quick intro to what Adobe Experience Manager is because people, [01:16.520 --> 01:29.160] every time I say Adobe, people say Photoshop and PDF and Flash, yeah. So that's not any [01:29.160 --> 01:36.000] of those, right? So this is a content management system that you probably never heard of, but [01:36.000 --> 01:43.080] it's powering 80% of the 4100 and it's very, very enterprise. I'm not expecting people [01:43.080 --> 01:52.360] to know, but this is widely used because it's based in a lot of open source. It's a distributed [01:52.360 --> 01:58.000] OSGI application that was started many years ago and uses a lot of components of open source [01:58.000 --> 02:04.840] from the Apache Foundation and we contribute back to those components like Felix, Apache [02:04.840 --> 02:11.800] Felix, Sling and a few things about content management there. And it has a huge market [02:11.800 --> 02:17.640] of extension developers, people that are writing their own Java code that then runs on Adobe [02:17.640 --> 02:26.120] Experience Manager and AM. So when I joined Adobe, the goal was, let's move this into [02:26.120 --> 02:33.680] a cloud service and this is running AM on Kubernetes. We're running currently on Azure [02:33.680 --> 02:40.640] and we have 35 clusters and growing very quickly because this is a content management. We run [02:40.640 --> 02:47.400] it in multiple regions, right now 11, so multiple ones in the US, Europe, Australia, Singapore, [02:47.400 --> 02:56.160] Japan, whatever, because people want to have low latency between their users and the content. [02:56.160 --> 03:02.800] And then another interesting fact is that we have the Kubernetes clusters. We don't [03:02.800 --> 03:07.320] run them directly. We build stuff on top of them and we have a different team at Adobe [03:07.320 --> 03:15.120] that manages Kubernetes for us. Some curiosities is like customers can run their own code, [03:15.120 --> 03:22.640] so we are running this for them and we take their code and run it inside our processes. [03:22.640 --> 03:28.560] So we have to limit clusters permissions for security and we have several security concerns [03:28.560 --> 03:37.720] because this is a very multi-tenant setup. Each customer can have multiple environments, [03:37.720 --> 03:44.560] multiple copies and they can self-service, so they can deploy new environments whenever [03:44.560 --> 03:49.640] they want, they can update them and do a few things, so it's not just us controlling what [03:49.640 --> 03:54.680] is running, it's also the customers. Each customer can have three or more Kubernetes [03:54.680 --> 04:01.800] namespaces where these environments run and this, I like to call this a micromanolith. [04:01.800 --> 04:08.960] So we don't run a big service that spans like thousands of instances, we run slightly [04:08.960 --> 04:17.360] different versions of the same service over a thousand, ten thousand times. So micromanolith [04:17.360 --> 04:24.400] defines it very well. And then we use namespace Kubernetes namespaces to provide the scope [04:24.400 --> 04:28.840] on network isolation, quotas, permissions and so on. [04:28.840 --> 04:36.840] Now internally we have multiple teams building services, so different services have different [04:36.840 --> 04:42.240] requirements, they have people can use different languages and we are more in a philosophy [04:42.240 --> 04:50.160] of you build it, you run it. And we are basically doing each services post as an API or we follow [04:50.160 --> 04:54.440] the Kubernetes operator patterns. [04:54.440 --> 05:01.280] We also use to split the monolith, we use a lot of init containers and sidecars, if [05:01.280 --> 05:07.280] you know in Kubernetes you can run multiple containers at the same time, so the main application [05:07.280 --> 05:12.920] runs in one container and then we have to apply division of concern, many sidecars that [05:12.920 --> 05:20.600] do different things. And it's an easy way to split separate concerns without having [05:20.600 --> 05:28.440] to rewrite your whole architecture to go to a fully network-based, micro-service oriented [05:28.440 --> 05:31.160] architecture. [05:31.160 --> 05:37.640] So on the continuous delivery side, which is probably what you are interested in here, [05:37.640 --> 05:44.560] we are running, we are moving to a, from a generally release to, I mean we are pushing [05:44.560 --> 05:52.480] changes daily multiple times, right? Not only, not just the application, the application [05:52.480 --> 05:59.560] may be slower to move, but on the operational side and all the services and operators, micro-service, [05:59.560 --> 06:06.200] all these things, all of them together, any of them at any point in time, any day can [06:06.200 --> 06:09.640] receive changes. [06:09.640 --> 06:17.640] So we use Jenkins for CI CD in some places, we have Tecton, you heard about that in one [06:17.640 --> 06:24.440] of the talks before, it's another open source project to do workflows on Kubernetes, to [06:24.440 --> 06:32.840] orchestrate some pipelines and we also started using Argo CD for some new micro-services. [06:32.840 --> 06:39.560] We follow a GitOps process, so where most of the configuration is a storing Git and [06:39.560 --> 06:46.760] it's reconciling each commit, right? And we use a pull versus push model to scale. And [06:46.760 --> 06:50.760] I'll go through this in a bit. [06:50.760 --> 06:57.680] We have a combination of multiple things being deployed to the clusters. We have the AM application [06:57.680 --> 07:04.240] that is deployed with a Helm chart. We have operation services that are on operators and [07:04.240 --> 07:10.480] services and all the other things that are not the application. These are deployed using [07:10.480 --> 07:17.080] Kubernetes files but templatized. And we are also using customized and Argo CD for some [07:17.080 --> 07:20.800] new micro-services. [07:20.800 --> 07:28.800] On the Helm side, we use the Helm operator. So in each namespace, we use the Helm operator [07:28.800 --> 07:39.920] CRDs to do a more state-based installation of Helm. So we create the CRD and the Helm [07:39.920 --> 07:45.320] operator is going to install the application based on the parameters on the CRD. A word [07:45.320 --> 07:54.960] of advice is don't mix application and infrastructure, infrastructural configuration on the same package [07:54.960 --> 08:01.520] because if you cannot enforce the same Helm chart for old tenants. For example, as I mentioned [08:01.520 --> 08:08.040] before, customers can decide when to update things, right? So we have some customers in [08:08.040 --> 08:12.880] older releases and some once in newer releases. This is something that we want to change. [08:12.880 --> 08:18.920] But in the meantime, if we want to update a specific version of something in an old [08:18.920 --> 08:29.560] release, it's hard when this is already packaged on Helm. So we built a solution for this. [08:29.560 --> 08:35.640] So from the platform level, we can go and manipulate this Helm chart. So we can have [08:35.640 --> 08:44.560] overrides and this is easy to do when you have the Helm operator. So you can inject, [08:44.560 --> 08:50.800] whenever there's a request to install a new Helm chart, we change parameters. So we change [08:50.800 --> 08:57.560] both Helm values. This is easy. Instead of passing some values, you pass different ones. [08:57.560 --> 09:02.840] Or you can use customized patches. And this is also support from the Helm operator. This [09:02.840 --> 09:07.400] is also support for customized patches. And customized patches are very interesting because [09:07.400 --> 09:14.680] they allow you to patch any Kubernetes resource. So even if there was no previous Helm value [09:14.680 --> 09:21.760] defined for it. So if we want to change a sidecar container image version across the [09:21.760 --> 09:28.200] whole fleet, we just have to change the patch. And this patch is going to be applied to all [09:28.200 --> 09:34.840] the clusters, all the namespaces. And all the Helm charts that were installed are going [09:34.840 --> 09:40.440] to get reinstalled with the right version that we want. So we do this combination of both [09:40.440 --> 09:48.480] Helm chart and then operational values on the other hand. [09:48.480 --> 09:55.120] Very important for us was the shift left mentality, right? Detecting problems as soon as possible. [09:55.120 --> 10:02.360] Not waiting for developers to push things to production because the cost increases. [10:02.360 --> 10:09.280] So we run checks as soon as we can on pull requests. So this is still fresh in your memory [10:09.280 --> 10:15.680] when you make a change and something is broken. You want to catch it as soon as possible. [10:15.680 --> 10:22.400] And we do this by generating all these templates. We have some tests that generate these templates [10:22.400 --> 10:29.760] and then apply tests, multiple tests on them. [10:29.760 --> 10:36.560] The most basic check that you can run is the apply QCTL, apply the right run. This will [10:36.560 --> 10:44.200] tell you if the manifest is wrong in some very obvious way. So if it's valid or it's [10:44.200 --> 10:52.520] not valid. Cube conform is a tool that will allow you to validate the Kubernetes schemas. [10:52.520 --> 10:59.160] So this is the successor of Kubeval. Anybody heard about Kubeval or Kubeconform? Okay. [10:59.160 --> 11:08.240] So this is very useful for if you have custom CRDs or just to make sure typical problems [11:08.240 --> 11:14.320] are you, you miss the jammer indentation and now it's not valid anymore and then you catch [11:14.320 --> 11:18.880] this on a PR. You just run this and it will tell you, you know, this property is missing [11:18.880 --> 11:26.000] or this is property is in the wrong place because everybody loves jammer, right? [11:26.000 --> 11:32.880] Conf test is another tool for open policy agents. Any people familiar with open policy [11:32.880 --> 11:48.080] agents? Open policy, OPA. So OPA allows you to write policies where you can go and pretty [11:48.080 --> 11:54.640] much check anything in any structure file. In the case of Kubernetes, you could say, [11:54.640 --> 12:02.680] I don't know, don't mount, don't run the pod as root. Make sure you don't mount secrets [12:02.680 --> 12:08.760] as environment variables or with files. Make sure, enforce that all the pods have some [12:08.760 --> 12:16.560] labels. Any random thing that you can think of, you can do it. And like, don't pull from [12:16.560 --> 12:24.600] Docker Hub, pull from the internal registry. You can do that with Conf test and OPA policies. [12:24.600 --> 12:29.240] The only problem is that it uses the regular language that if you haven't heard of, it's [12:29.240 --> 12:39.640] very painful to work with, but it works great once you try to figure out. [12:39.640 --> 12:45.360] We added another tool which is called Pluto. Pluto is just a CLI that will tell you what [12:45.360 --> 12:50.960] API versions have been deprecated or removed. So if you are running, if you are thinking [12:50.960 --> 12:56.280] about upgrading Kubernetes, you run Pluto and it will tell you, you know, this is deprecated, [12:56.280 --> 13:04.040] it's going to be removed in this version and so on. So you can enforce that. [13:04.040 --> 13:10.960] We built a tool that we call Git init, which is our own version of a GitOps pool. So we [13:10.960 --> 13:15.360] have the Kubernetes definitions storing Git and we deploy these to blob stores across [13:15.360 --> 13:23.360] regions. So they are pulled in each cluster. And Git init is a deployment that runs continuously [13:23.360 --> 13:31.360] on each namespace. We have around 10,000 namespaces in our fleet. So it basically pulls the blob, [13:31.360 --> 13:39.760] applies the changes and does this thing every so often. And an example of why we do a pool [13:39.760 --> 13:45.560] versus pool, because pushing to all the clusters, we have a job that does this and it runs in [13:45.560 --> 13:52.520] parallel, like in 20 threads or something, and still takes like five hours to run. So [13:52.520 --> 14:01.720] we cannot push things when we want. On Argo CD, we have a newcast platform that allows [14:01.720 --> 14:11.960] you to do Argo CD-based microservices. Argo CD, basically, this would create a new Git [14:11.960 --> 14:16.280] repo, it would come with some templates and that would get deployed with Argo CD to the [14:16.280 --> 14:23.800] cluster. And this is for us, we are thinking about moving this way and each team will have [14:23.800 --> 14:29.040] their own Git repo, because right now we have mostly centralized operators and everything. [14:29.040 --> 14:33.440] And this is good for the, okay, you go on your own direction, you do whatever you want, [14:33.440 --> 14:38.160] you build it, you run it. On the other hand, it's a bit tricky because when we decide or [14:38.160 --> 14:44.400] figure out something is problematic, we cannot just centrally say, you know, on this Git [14:44.400 --> 14:53.880] repo tell me who is doing this and let's change it. But we are moving towards that direction. [14:53.880 --> 15:03.440] Let me skip this and talk a bit about progressive delivery. So progressive delivery is a way, [15:03.440 --> 15:07.320] well, it's something that, it's a name for something that you've probably heard of, which [15:07.320 --> 15:16.840] is canary rollouts and doing percentage-based rollouts, feature flags, blue ring, so basically [15:16.840 --> 15:22.880] don't update everybody at the same time because you can break everybody. [15:22.880 --> 15:27.960] So we can do rollouts to different customer groups in separate waves and we can also do [15:27.960 --> 15:35.800] rollouts to percentage of customers. By default, we have a time-based rollout that goes from [15:35.800 --> 15:41.320] dev to stage to prod candidate after a period of time. And this is running on Jenkins and [15:41.320 --> 15:46.480] ensures that things have been running on dev on stage before we merge them to prod. I mean, [15:46.480 --> 15:54.480] this is very basic. What we built was feature flags at the namespace level. We have 10,000 [15:54.480 --> 16:02.640] namespaces and then the Kubernetes definition templates. So what we allow developers to [16:02.640 --> 16:11.200] do is for each namespace, they can decide, I want to roll out this change to this environment, [16:11.200 --> 16:17.120] dev, stage, or prod, or I want to deploy this change to a specific cluster or by template [16:17.120 --> 16:25.080] namespace type of, yeah, type of namespace or a percentage. And this is just using templates [16:25.080 --> 16:33.640] on Kubernetes objects. So an example is, in this case, a rule, sorry, a Kubernetes definition [16:33.640 --> 16:41.800] where you can have a template that is as full version or bar version, or you can enable [16:41.800 --> 16:46.240] a container, a sidecar container, or disable it. And then at the bottom, you can see the [16:46.240 --> 16:52.800] rules. So by default, we want full version to be 1.0, but for the namespace, all the [16:52.800 --> 16:58.840] namespaces on the dev environment, we want that to be 1.1. So this allows us to quickly [16:58.840 --> 17:05.480] roll out changes, but progressively. We can also do it for percentiles. So in this case, [17:05.480 --> 17:11.240] we could say, I want all the namespaces in dev and all the namespaces in a stage to have [17:11.240 --> 17:17.600] this full version 1.1 and enable matter rule true, but for prod, I only want 5%. So I roll [17:17.600 --> 17:27.120] out a change to 5% of prod, and then I can continue after that. So this has proven really [17:27.120 --> 17:35.200] useful for developers to test in safely, increases development in speed, PRs are much faster, [17:35.200 --> 17:44.240] so it's all great. And we are thinking about, well, we're thinking, we are working on getting [17:44.240 --> 17:49.560] ARGOR rollouts also at the deployment level. ARGOR rollouts allows you to do blue-green [17:49.560 --> 17:59.920] and canary rollouts, where you can say, progress the number of pods over a period of time, [17:59.920 --> 18:04.760] so instead of changing, I don't know, 10 pods at the same time, because one by one, and [18:04.760 --> 18:10.120] if you have a service mesh, you can go even more fine-grained and say, I want 5% of the [18:10.120 --> 18:14.920] traffic to go to the old version, to the new version, everything else to the old version, [18:14.920 --> 18:24.600] and keep progressing that and do automatic rollbacks. So, yes. So, yeah. With the service [18:24.600 --> 18:30.560] mesh, you can fine-tune the traffic percentages, but with Kubernetes services, you can still [18:30.560 --> 18:35.040] do it. It's just that we are limited with the number of pods. [18:35.040 --> 18:42.960] So to sum up, Shift left on Garrail, so keeping people safe on what they are doing, this increases [18:42.960 --> 18:49.080] development speed, reduces the issues that you are going to have in production, and you're [18:49.080 --> 18:56.960] never going to prevent having issues in production. What you can prevent is how many customers [18:56.960 --> 19:04.720] are affected and how fast you can fix them, right? So for us, what was very useful is [19:04.720 --> 19:09.440] the progressive delivery techniques, like canaries, percent of rollouts, or automated [19:09.440 --> 19:17.120] rollbacks, and the automation to do this, control and progressive rollout, pays off [19:17.120 --> 19:28.320] over time. So I think we have one minute for questions. Or you can find me afterwards. [19:28.320 --> 19:39.760] Thank you.