Thank you for being here.
I'm going to talk about progressive delivery and hopefully by the end of this talk you're going to know how to easily do canary deployments on Kubernetes.
Who is using Kubernetes today?
Raise your hand, please. Everybody.
I'm not asking if everybody knows what Kubernetes is because you're in the wrong place.
I'm a principal scientist on the Adobe Experience Manager Cloud Service. This is a content management system.
I'm a long time open source contributor to Maven, Jenkins, Puppet, a few other things.
I'm also part of the Google Developer Experts Program.
But probably most of you know me because of what I did with Jenkins on Kubernetes.
Some people will love it. Some people will hate me. We'll talk about that later.
Actually, I just, before this talk, 15 minutes before, I realized, oh, this is 10 years ago.
Time flies when people didn't know what Kubernetes was.
So, what is progressive delivery?
This also came, this was August 2018.
This is when the term was coined at the LaunchDarly blog.
And also was picked up by Red Monk.
And I said, this is a great name for these things that everybody knows about.
But the name kind of sums up very well what we're trying to do.
So, I said, I'm going to steal this. That's the gist of it.
So, it includes deployment strategies that avoid this.
I'm going to push this to all my nodes, all my containers, all my files, whatever it is that you're running.
I'm going to push this new version to all of them.
And if it breaks, it breaks for everybody, we want to avoid that.
So, you, with progressive delivery, you have new versions that do not replace the system versions.
And you have both old and new versions running in parallel for an amount of time.
But the interesting part is that this is happening in production.
And you can evaluate both old version, new version during a period of time that you figure out what was the best time for you.
And before saying that this is a successful thing that I need to roll to everybody in all my customers.
So, continuous delivery is hard. I used to say, like, progressive delivery makes it continuous delivery easier to adopt.
Because it reduces a lot of the risk associated with continuous delivery.
Yeah, it's great that you commit something to main and it gets pushed to everybody.
But what if that's breaking in production? Then you have these methods behind progressive delivery that will prevent you from breaking things.
And give you these guardrails that will protect your users.
The key points, avoiding downtime, limit the blast radius.
You deploy something, it only affects a subset of your users, not all of them.
And also shorten the time from your idea to production.
So, from the time you create a commit until you push it to production, you can use these techniques.
So, you can shorten really as much as possible that time.
And it's not affecting your life customers, it could affect your maybe internal customers, employees, something like that.
So, you can confidently push things to production.
The name is great, but all the techniques already existed for a long time.
We have rolling updates on Kubernetes. This is the standard way when you change something on your deployments.
You just get a new pod with a new version. When that pod comes up, the old pods start going away.
And you can configure that easily on Kubernetes.
You can configure how many pods you want to come up, if you want them to come up a little by a little.
If you want all of them to come up at once, and they will start rolling.
So, Kubernetes has been around.
Blue-green deployments, same thing. It's been around forever.
Well, defined for some definition of ever.
And you have green, what you consider the old version, which is green, the new version, which is blue, or the other one, we're on, I don't know.
And you have both running at the same time.
You evaluate or you start sending traffic to the new version.
And if something happens, you just have to flip the switch to put back to the old version.
So, this is a variation.
The difference is, in a couple of days, you don't need to have all the machines running at the same time, all the containers.
With blue-green, you need to have room for both versions running at the same time.
Canary deployment is one of the most interesting ones, where you send a percentage of the traffic or a percentage of your users to the new version,
and a percentage, a small percentage to the new version, and you keep growing.
I mean, you could just do a small percentage, or you could keep growing that canary percentage.
A lot of companies do this.
First, this gets deployed to internal employees only, then some countries, some like New Zealand, or a percentage of users, depending on some characteristics of them.
And they keep growing this canary pool over time until you reach 100%.
Feature Flags, another interesting one, where it allows you to push things to production behind Feature Flags, so you can test them in production.
And also, disable them after you deploy them.
You push something, you realize, either it breaks for a lot of users, or it breaks for a percentage of users, you can switch that feature off using some tools,
or using something as simple as environment variables.
But yeah, there's tools that allow you to manage Feature Flags, so you don't have to deal with environment variables, things like that.
Monitoring is the new testing, so you know the goal is to know when the users are experiencing issues in production,
and the other characteristic, I think, is they react to the issues automatically.
So if you deploy something that is bad, how you can automatically roll it back before some human has to go and figure out what happened.
So did you know that 90% of the outages could be solved?
There's a study that said 90% of the outages could be solved with progressive delivery.
Did you know that? No? Because I just made that up.
And one thing they need is, yeah, some requirement is having a good amount of metrics,
or you need to know what's happening in your production system before you can react,
knowing what users are seeing the new version, what users are breaking with the new version, what's happening here.
So you need to have this visibility.
And I always love to plug Devos Barat, which disappeared from the Twitter server,
to make your resumant to prepogator or in-law server in automatic way, that's what DevOps is.
Raise your hand if you have broken a lot of servers by doing this automatically.
So yeah, what I love to say is if you're breaking something automatically,
is that you haven't automated enough.
I think that's the...
When you get there, it's like, okay, maybe I should step back a little bit.
Until you get there, you keep automating things.
Now, more to the practical side, how can I do this in Kubernetes?
Introduction, who's familiar with Ingress?
Ingress in Kubernetes, okay, yes.
Then, 10 years ago, this was not like this.
So on Kubernetes, you have the load balancer, and you can have services,
and from the load balancer, Kubernetes will send you to one service or another.
So your load balancer would send traffic to one service or another.
But this was kind of the old way.
The new way is you have Ingress controllers on Kubernetes,
where the Ingress controller is running on Kubernetes.
I typically domain names, but you could also do headers.
For each of these traffic, it's easy.
You can do headers, you can do all sorts of things.
And that Ingress is sending the traffic to whatever service you're running.
So you can have one service A, one service B, and with their pods.
And the Ingress is the one that's, okay, you configure this domain to go to the service,
you configure this domain to go to this other server.
And there's a lot of Ingress controllers out there.
If you run on a cloud provider, you're going to have the AWS, the GC, whatever.
And then you can have your own NGINX, ambassador, STO, traffic.
That's a lot of it.
And ARGO rollouts, anybody using ARGO?
Wow, okay.
What are you doing here?
I mean, we already know this.
So provides advanced deployment capabilities.
All the things that I mentioned, blue, green, canary,
category analysis, experimentation, there are variations over the same thing.
ARGO rollouts provides that to you and makes it very easy to do it.
And the good thing is you don't need to use ARGO CD, for it to use ARGO rollouts.
You don't actually need to use anything else to use ARGO rollouts.
You can run ARGO rollouts just with Kubernetes, nothing else.
You don't need external dependencies.
And, yeah, it allows you to do this very easily.
I'm a bit on the architecture of ARGO rollouts.
So we have the controller that is watching a new object called a rollout.
So ARGO rollouts has this object that can replace or complement your existing deployments.
And I think I'll go down there in a bit.
So this rollout manages the replica sets.
So these replica sets, typically, you would have your deployments with the replica sets,
and now they become part of the rollout.
And it has the concept of analysis run that will check metrics or any other external source
that this analysis will decide is this rollout successful or not.
And based on that, it's going to cancel the rollout or keep it going.
So you get the traffic coming from the ingress into your services,
and you can tell ARGO, OK, send me, send traffic to this new canary replica set
or send it to the old one.
The percentage base one, for that, you need a service match.
So if you need to do something fancy like, oh, I want to send 1% of the traffic
or I want some traffic that matches this header,
then you need something like a service match
or the integration with the ARGO rollouts integration with the ingress controller.
But if you use bare Kubernetes without integration between rollouts, you can still do it.
Basically, it will use the number of pods in the replica sets.
So if you have 10 pods, you can tell ARGO, OK, one new pod is going to go to the new version,
and now you have a 10%, 90% sort of split, more or less.
You cannot do things fancy that require support from the ingress controller or a service match,
but you can still do things.
The rollout object, you have two ways of defining the rollout.
One is you replace the deployment with the rollout and add extra fields,
or you create a rollout that points to a deployment.
I don't like a lot the way of replacing the deployment
because then people that are not aware of the rollouts objects,
they may go and see, oh, there's no deployments.
What's going on here?
So it requires you changing things.
And for us, it requires also you have to change rambles,
you have to change commands that people need to secure documentation and all that.
So I don't know why the decision was made that way,
but it's not something that I'm too happy about it.
Of course, you require all the Jamel tools to write these things.
And let's go to the demo now.
So I have here...
So I'm running the Argo Rollouts demo.
This is hitting the backend and it's returning one color or another,
depending on what is running on the backend.
So right now I have the blue one.
Let me see how can I do this easier.
So what I'm going to do is to change,
update my deployment to use a new image that is going to be green.
And ring.
I lost the terminal.
Okay, so it updated the image and let me fit here in big
To show what this is doing.
Okay, I think I pushed twice and now I see two
Rollouts happening at the same time.
Otherwise it's not working.
Here it is.
Okay, so I have the green.
The one that shows green is the one that was running and it's stable.
So I think I have five pots running and I push a new change,
which is the canary.
And this should be using the green.
Okay, there it is.
So like 20% of the traffic is getting green.
Right?
And how I define this rollout, this one is at the bottom is
just the standard deployment configuration.
So what image do I want?
What ports do I want to expose and so on.
But at the top I have the strategy configuration from
Marco rollouts.
So I can say point to this analysis template.
This is what defines what is successful, what is not successful.
And I'll show you that in a bit.
And it's, I have several steps.
So set weight 20 and then do a pause, set weight 40,
pause for 10 seconds, set weight 60.
So this is percentage.
Pause for 10 seconds, set weight 80.
So this is my definition of a rollout.
20% wait for me to manually do something.
I only do that for demos in real life.
That's a bit harder to do as you could still do it.
But this is my definition of what the rollout is.
So right now it's waiting because I set a pause and it's waiting
for me to give you a key.
I look at it, it says it looks okay.
So I can do the promote.
And this is going to continue through the rest of the steps.
So hopefully we'll see this in like 60 seconds.
It should continue the progression until everybody receives a green.
A green color when they call the API.
So this shows you just by creating a rollout object with this
small section defining what your rollout is, you can do this.
There's nothing else you need.
Well, you need to install Argo.
And what else can you do?
Oh, yes, you can also have a preview version.
So you can have another ingress pointing to your preview version.
So you can even if I said I want zero traffic to go to the new version.
All the existing traffic I wanted to go to the old version,
but I want to see the new version in a new place.
I can do that too.
So that's very useful for preview environments sort of thing.
So if I go back, okay.
So while this continues running, this is running on Google Kubernetes
and sending autopilot clusters, but you can run it in any Kubernetes.
And the autopilot is pretty cool because you only pay for what you use.
So if you scale things to zero, then you don't pay anything.
What does it says here?
Okay.
So now green is the stable one.
It says here, stable here.
What if I want to do...
I was talking about how does this protect me, right?
What if I want to do a rollout that is broken?
So...
Let's see.
This works.
Right.
Okay.
So now I push an image that is bad.
So I'm changing the deployment.
Of course, you would do this with the GitOps.
You would never push the production, but YOLO.
And so I'm pushing the red image, but this red image is returning in 500 errors.
And now Argo realized, oh, this is giving errors based on my analysis template that I'll show you.
And this is in the graded status.
And it went down and the scale it down, and my canary was set as failed.
And you see that only a few percentage of traffic got the red dots,
and then it was automatically rolled back.
So I think this is the power of doing progressive delivery.
Of course, this is very easy if your application is exploding.
It's very easy to see.
It's like, what if people ask me, oh, can we do this if a button doesn't work?
Can we do this?
Well, it depends what button.
If it's the button that adds, imagine you're in Amazon, you break the button that adds things to the cart,
and you get a metric that says nobody's adding things to the cart,
maybe you're like, oh, something is really bad.
Right.
So let me show you the analysis template.
Is this one?
Yeah.
In my case, my analysis template is a very complicated call that fails if this fails,
if this doesn't return a 200.
But again, you can integrate this with whatever you want, metrics.
Argo rollouts also gives you a nice dashboard.
If you are not into the command line, you can come here.
And here.
So where I can see the status of my rollout, what is strategy.
As I said, Argo rollout supports multiple strategies on some of the more complex drivers.
I can see my steps that I showed you before in the Jamel, 20, 40, 50, 80.
And I can say what was the last image that I pushed,
and I could click here and do the clickity clock instead of doing Jamel.
Okay.
So, yeah, what I mentioned before was if you're using service mesh,
like Istio, then it integrates with a bunch of service mesh
ingress providers.
So you could go and say, I want 1% of traffic because Istio supports
doing those things instead of saying more, because when you are using
only pods, you don't have anyone here.
Pod is going to receive the traffic or not.
So it's more of an approximation.
But with Istio and other advanced things, you can do more complex.
We hook it up with Prometheus, also the support for multiple things to get metrics from.
And, yeah, hopefully you'll learn how to do a progressive delivery
canary deployment very easily.
Just you need to do some Jamel here and there.
Let me see.
On here, this one.
So you can have the other labels to the existing version,
to the stable version as labels to the new version.
So you can do other things with services on Kubernetes.
You can pass what analysis you want to run and you set what steps to run.
And everything else is just the template.
And if in the deployment template.
If you don't want to put the deployment template in the rollout object,
you just point here, there's another option that says points to existing deployment.
The only problem with that is that rollouts is not when you're migrating,
rollouts is not going to scale down the deployment.
A colleague of mine, she submitted a PR to Argus,
which is going to be in the next version.
So it will automatically, if you have like thousands of deployments,
when you spin out a rollout with a deployment,
I pointed to a deployment, when the rollout is successful,
it's going to scale down the deployment.
So that's how it will actually exist.
Okay, so, yeah, and what's that thing?
I lost my...
Did I close it?
Yeah.
Okay, so, just a quick summary, you saw everything?
And I hope that this helped you and you can try it and do it at home if you like it.
And I have time for asking me two questions.
Two questions.
No questions. One question.
I was wondering if you've been testing using the gateway API and some fingers in the waiting?
So the question is if I tested using the gateway API instead of fingers,
no, I have not been using the API yet,
but I'm guessing that if there's no support already, there will be.
Because...
We did not.
Yeah.
Hello. So my question is that for, in case of buggy rollout,
the particular traffic which is forward to the buggy instances,
is it possible to automatically replicate it and send it to the stable versions after the fail?
To ensure that even the traffic which hits the buggy rollout instances
is served later by stable versions?
So if it's...
It's possible to run it back automatically, but also...
Yeah, the individual traffic, individual request.
So you don't want any user to see the spot?
Yes.
The other thing you could do, if you use a service mess,
probably is send a clone the traffic and send a clone to the new version,
but the actual traffic is going to the old version.
And you could see if the new version is breaking or not.
But also that's tricky because you need to make sure that it's not changing your state.
If you are getting gets, it's fine if you are changing status.
That's my point. Don't do the duplication in advance because it will go to the parallel execution,
but do it only when the first execution failed because it's go to the canary instance.
Yeah, I think you can do that.
Send traffic to the new version, but it's a copy of the traffic that is not seen by any user.
And then at some point you could say, okay, this is good. I'm promoting this.
I think it's doable.
Yeah, thank you.
Okay.
Thank you.
Thank you.