This is nice.
So hello guys.
Today we prepare a presentation about the house engineering in action.
I am Marj Orsak and this is Henrik Srenching and we both work as a software engineers
in Red Hat.
Today we also prepared a quiz.
You can see the QR code.
You can scan it with your phone and if you are quick enough and you get correct answers
you can win a prize.
So over to you Henrik.
Yeah so the content of the presentation is as following.
We will begin with a brief explanation of house engineering.
Then we will describe how the target systems may actually look like in the production.
Then we will turn our focus on disabling house.
Afterwards there will be two brief demonstrations and then a quick conclusion of how to actually
work with the house.
So when we are thinking about system resilience or application resilience we have to think
about all the components which our application depends upon.
That mean other components and other services.
There is also big dependency on the network and infrastructure.
All of these things are mostly visible in the distributed systems.
There are many known fallacies about distributed systems mostly concerned about network and
bandwidth.
When we will then take a look on a system from the viewpoint of many instances and services
which have to communicate with each other in order for the system to work great.
We will come to the problem of complexity and the fact that there is possibly no single
person that can understand the system completely and every state which can the system get into.
So what can happen and what will probably inevitably happen in the system of such magnitude
is the thing that one instance or more will crash.
This is the story about house monkey which I guess some of you may be familiar with
but all we have to know so far is that it is some of first house tools which just randomly
kill some instance in the production and it will force engineers to take proactive action
to make system more resilient.
We can take this step further and bring down not just few instances but availability zone
or cluster or bring any kind of network traffic and get the system into the state we are not
so comfortable in for the production environment.
So we will get to the definition of the house experimenting.
It's experimenting on a system in order to build confidence in the system's capability
to withstand turbulent condition in production.
This may sound weird for us because why would anyone want to bring the house into the production
isn't it something funny which should we actually avoid and the real reason for doing so is
actually it's the time difference.
It's much more easier to solve the problems at 4 a.m. or 4 p.m. rather than 4 a.m. when you are
under the high pressure from the customers to solve the problems.
There are many principles which we have to abide or we should abide in the house engineering.
The first one and most important one is minimal blast radius for each experiment you conduct.
We should imagine some red button for each experiment which should be able to stop it in case
anything goes wrong.
Other principles are mostly focused on the same thing like we would test the thing in the real
life. We want to focus on how it actually works in production. We want to make sure it works
correctly and we want to introduce the problems that may happen in the real life.
The last principle is that it's a continuous run which is basically about running these tests
or experiments each time for as possible and as effortlessly as possible.
Now over to the target system.
This all started with the monolith architecture where we get one box, one backend, one database
and one UI. In terms of the complexity it was quite low. You simply get some users
connection and the server complexity or overwhelming was not so high. Then after some time you add
some customers more and more like let's say not four or five thousand and the load was pretty much
high and the server would immediately crash for instance. So such architecture is really hard to
scale horizontally and one way how to tackle this problem is scale vertically but you can scale
vertically all the time.
The second point was that the fault-or-ency of such architecture is really bad. You just
target one node and the server just immediately crash and the users will be really sad because
they don't get any response. So then Dockercams with the microservice architecture where all these
previous improved we got portability isolation. We somehow get better horizontal scaling but in
case when you have like thousands of instances it would be quite hard to manage all of these
containers. On the other hand also the complexity here increased because of the network trickle
and more. And so Kubernetes came to solve scalability in terms of the horizontal and
in the Kubernetes you easily if you want to have one replica of the system you just type in the
semi-ammo file, apply it and the Kubernetes will do it. Then if you see your server crashes or
somehow is overloaded with the request you just simply set it to free and the Kubernetes will
do it. The same with the fault-or-ency where you just I don't know inject some disruptions or
something else in the pods. The one will still be up if you only target two of them.
But still complexity increased again. And so we are in the operator stage where no one
can entirely grasp the system in terms of the behavior. And I want to present one of the such
operators is the StreamZ. StreamZ is basically Apache Kafka on its core with encapsulation in
the Kubernetes system. On top of that you got some operators which simplifies some upright
dynamic configuration. It is tracing more security involved also Grafana dashboards.
And it is part of the cloud native computing foundation. But it's quite tough too much
unknowns right? So let's break this down. So Apache Kafka it has a lot of buzzwords as you can see
public subscribe model it is messaging system and so on but still this doesn't help right?
So let's move to the basic of the Kafka. We got some producers not these ones but
we get some clients right? So these clients sending messages to the broker.
They are happy because the connection is up. We could also set system scale. We could create
another Kafka broker set up some listeners and another one. We got a second set of the clients
which are called consumers and they are simply receiving this data. So we got this really
example of the system where you have producers and consumers but we need also some representation
of the data which is Kafka topics. Also each Kafka broker has his own configuration and you can
basically set up versions set up in center replicas but this is not important for this talk. So
we got a lot of buzzwords as you can see but unfortunately or maybe fortunately we don't
have time for it. So we could stick with this model now. So we got the producers we got the
consumers we got some brokers which are the servers and what if we encapsulate system
in the Kubernetes? Now top of that we add some operators managing the Kafka ecosystem
and on top of that we have the operators and this is basically Stream Z. Really complex right?
So here we can see an example of the deployment of the Stream Z where you got a lot of connections.
These components are not like really important now. The main idea here is that even with this
low deployments you get a lot of things where you can inject the hairs. So
now I want to say that if we go to the production one of such production is the
scale job and before I dig into it I want to thank these guys because without them we would be
unable to run such hairs in such a massive survival scale. So as I said a scale job is
the production environment for Stream Z and other projects and there are a lot of
technologies involved such as I don't know Tecno pipelines, Teno's Promi to use, Grafana,
Loki logs and more. And here you can see a basic example of how we basically induce the hairs here.
Here we have some Kafka clients, Streams produces consumers with some databases which are
communicating with Kafka Connect. We have some middle maker which is transfer data from Kafka A
to Kafka B but still this is not intention of this slide. There are a lot of connections.
So I think over to you, Henrik. Thanks. So the point of these slides were actually to show
or somehow explain that when we come to the system and take a first look it may look quite
MSC and quite complicated. We may not understand the whole underlying technological stack or every
single components and we are in the position when we want to talk about how the system actually
behaves when we introduce house whereas we are not even sure how it should behave normally.
That's even increased by the fact that the system doesn't behave how it is in paper but in
actuality there are countless of instances and connections, operators, clients, network traffic.
We need to have some sort of observability and some intuition in the system.
Like in other presentation that were before us there were already some mentions about Prometheus
and Grafana. They are quite famous for their purpose. So we will be using them as well.
As mentioned we need to have some intuition about the system and how it behaves. Without that it is just a mess.
So we actually want to introduce some chaos into the system so we start a search for the
problematic parts of the system or where we actually what we actually want to focus on.
It is a simple process when we take a basically simple look on the system. Take a look what is
critical component, where are some possible bottlenecks, are some part of the network really
critical here, are there some real world events that can cause my system to be vulnerable for
some time like some rolling updates or some notes we start in the cloud and things like that.
What would be really helpful is to collaborate with all the people involved in the system.
Like we definitely need some input from the devs. We need to know about at least some basic
information about architectural component. What we may come up with is some simple document
describing all important parts or things that may occur there or protocols that are included
and we will naturally come to the important configuration parameters and maybe even some
proposal for simple chaos that could be included. So the output of this in reality is some first
look on part of the system which may be actually targeted for simple chaos. Now that we know
at least have some first insight what could be first, what could be our first guess to start
with the chaos, we may focus on concrete chaos and we may start with some simple experiments.
Now how to actually formulate some kind of hypothesis or some sort of experiment when we
will take a look on specific thing. We will take a look on just part of the system or few components.
Now we will decide to make sure that our core part of the system is actually capable to withstand
some instances being lost or have some failures. Because this is still a production environment
and although it was even in the main principles of the chaos, we don't want to start with chaos in
the production environment. I guess everyone here knows why because first intern will try to introduce
some chaos, he will bring down all instances, service will not be available for for holiday and
good luck explaining that to your boss. Now we will probably start in a smaller scale,
in a stage environment with much smaller traffic, with much smaller stakes like let's say there will
be some some client maybe just random random fraction. We will have some few instances and
few controllers. We will start by making sure that system is in a steady state. We have our
instances up and running. When we are sure about it we can introduce the chaos. When we introduce the
chaos, instances goes down and afterwards the system stabilizes by again bringing down the
instance. During all the time we are observing all important metrics and parameters about the system.
For example it could be messages per second. Now that all that is set and done we can actually
implement our chaos. What can be really helpful for this are chaos tools. We will not be describing
all of them but simply mention like there is chaos mesh, Kraken, or it moves or some other choices.
They will help with definition, evaluation, execution and all the other stuff. We will end up
with very simple email files to be executed. Now we can actually execute our chaos and see
everything went as expected. There was small decrease in the traffic but overall the system
got to the desired state after a while. Okay this was first experiment in the stage. Everything
went great. We've got the good feeling of resilience being confirmed in our system
but what we are supposed to do now is to repeat the experiment, scale it a bit up,
go into the production, really try, really make sure that it is this production environment where
we will get the confidence. What may happen is that it will not at all go according to the plan.
It will fail miserably and this is also the reason why we should scale these experiments a bit slower
and this is also the reason why we eventually want to make them in production because we want to
really make sure that this environment which is so important for us is actually able to handle that
problem. So as I said no reason to despair just keep on and try in the stage and make it and start
slowly definitely. So to the demos. Okay so guys today we prepared two demos for you.
The first one is the broker's failure. Here we will target the Kafka Pots. We have seven replicas
of the Kafka and we will be targeting the three of them. The observability or the metric would
gather would be like throughput, CPU memory and the traffic in the Kafka Pots. Then we will also
define some steady state which is basically that all brokers and client replicas would be like okay
and the communication throughput will be stable even we inject the chaos and if we define the
hypothesis it would be like we will be eliminating three of the Kafka Pots and this would not
eventually do some cascading failure and we will be okay like user will be not affected with this.
So yeah and also we will have some checks on the producers and Kafka Pots. So let's move on the
demo and hopefully it will somehow work. So okay so here we got some setup. We have Kafka cluster.
We have some notes. We have producers. Here is defined Pots hails. We have ModFix which we
targeting the value of three. Three means that we will be targeting three Pots which will be unable
to run and duration for this would be three minutes. So let's try with our script to inject the hails.
Yeah so now we are injecting the hails and we see that all Pots that three of them would be not
running. We would move to the graph on a dashbox where we have some metrics. Here's some really
simple not production ready messages per second as you can see. Now you can see here at the
decrease immediate decrease of connections. There will be also decreased the average of the messages
but Kafka would recover even when the Pots are down. So here's the decrease but after a time we
would see that it eventually recover somehow. Yeah and as we can see also we got some brokers on line
four. It's correct now. There are also some under replicated partitions. Yeah Kafka is okay now
and after this experiment will be done. I think it could be
done. Yeah so now we are do the checks. We are checking the stream Z Pots at Kafka which are
just internal custom resource of stream Z and now the Kafka Pots are ready. We're completed and also
in the Gavana dashboards we will see that brokers will go online. The under replicated partitions
we all go to the zero. I think so in the yeah and here it is. Okay so this was the first demo
and we got also the second one.
Yeah and this is basically a worker node node crash and to quickly somehow describe it the topology
is that we have the producer we have the Kafka AI Kafka B with some consumer and in the middle
there is Kafka mirror maker which just basically transfers data from Kafka A to Kafka B.
The steady state again is that all services are fully available and ready to accept traffic.
We made the hypothesis that eliminating one of the Kubernetes worker spools will not bring any
down services and also the producers and consumers will be not affected. They will be simply
sending some messages without any harm. So let's move to the demo two. I will show you the important
things. So we got source Kafka cluster, target Kafka cluster, mirror maker, we have some work
nodes, we inject the chaos. We would also create the continuous clients, person, consumer and
that's for the correctness that all messages are sent and also receive without any harm. There will
be no connection refuse or something like that. So we now reset or crash the worker node. We will see
that the worker node will move from the ready state to not ready. Here it is, it's not ready,
but clients are successfully and happy with the sending and receiving messages.
The script is just checking that worker node is still not ready and we are waiting for recovery.
It would take some time. I think it should be a worker.
And so now worker nodes just move to the ready state. We can see that all containers which
were affected on the specific node would be creating again, producing consumers still sending and
receiving messages. We do some checks on again, again, the stateful sets. Yes, this is okay. We
target cluster, cluster, recovered also. We're also doing checks for mirror maker.
And the script just runs successfully and we are happy.
Okay. So I think that was two of the demos. And last words over to Henry.
Yeah. So as you could see in the demonstration, the benefits of the chaos or execution of the chaos
was a bit different from testing we are used to. There was quite a big hype about the house
engineering and possible benefits it can bring to our organization. Yes, it can definitely reveal
bugs in the production. You can drastically improve the actual performance there or the
situation in the cluster regarding the resilience of the system. But what is the main benefit
of doing such a thing is getting confidence in the system, finding the misconfiguration. Those of
you who have tried running application in Kubernetes know how important it is to have all volumes,
all liveness and readiness checks set correctly and overall infrastructure set in place.
The actual greatest benefit is actually is in fact getting experience and new knowledge about the
system and really understand how it is supposed to work. This is not a holy grail as I said and it
can be a bit disappointing for some. But if we think about the house engineering as natural a step
above the other testing and not their replacement, we can see a great benefit in it. So how we can
actually embrace it in our organization. The very well-known concept is game days when we will
put together a lot of roles and a lot of people from our organization, introduce some kind of chaos
and let them handle it in some reasonable manner where they can all communicate, all contribute
and fix the problem in reasonable time. So that's why I do a friendly way how to start with it.
Know your tools. I know it can be overwhelming. You could see it even in the demo that we have
to introduce quite a lot of tools in order to run even simple experiments. But once you know the
basics and have some confidence in it, you can really start to make some kind of chaos. We can
really recommend some great books about house engineering, Kafka if you want, but still there
are a lot of tools and it's what is the most important due to that is to definitely start small
don't be afraid to set up some stage environment where you can actually practice and confirm your
hypothesis before you will actually go into the production and start doing mayhem. Thank you for
your attention. Really appreciate it.
Questions?
No time? One question.
Question? Yeah?
Yes, there are. It actually depends. It mostly from practical terms. It mostly start to make
sense only when we are talking about not some kind of monolithic application, but we are actually
deployed on a cloud. It's some kind of microservices architecture. I would say that it does not
depends as much on the size of the system as the fact how much you depend on a customer experience
in a sense. That when will it really be decremental for your system to get into the
chaotic condition. But yeah, thank you as well.