This is nice. So hello guys. Today we prepare a presentation about the house engineering in action. I am Marj Orsak and this is Henrik Srenching and we both work as a software engineers in Red Hat. Today we also prepared a quiz. You can see the QR code. You can scan it with your phone and if you are quick enough and you get correct answers you can win a prize. So over to you Henrik. Yeah so the content of the presentation is as following. We will begin with a brief explanation of house engineering. Then we will describe how the target systems may actually look like in the production. Then we will turn our focus on disabling house. Afterwards there will be two brief demonstrations and then a quick conclusion of how to actually work with the house. So when we are thinking about system resilience or application resilience we have to think about all the components which our application depends upon. That mean other components and other services. There is also big dependency on the network and infrastructure. All of these things are mostly visible in the distributed systems. There are many known fallacies about distributed systems mostly concerned about network and bandwidth. When we will then take a look on a system from the viewpoint of many instances and services which have to communicate with each other in order for the system to work great. We will come to the problem of complexity and the fact that there is possibly no single person that can understand the system completely and every state which can the system get into. So what can happen and what will probably inevitably happen in the system of such magnitude is the thing that one instance or more will crash. This is the story about house monkey which I guess some of you may be familiar with but all we have to know so far is that it is some of first house tools which just randomly kill some instance in the production and it will force engineers to take proactive action to make system more resilient. We can take this step further and bring down not just few instances but availability zone or cluster or bring any kind of network traffic and get the system into the state we are not so comfortable in for the production environment. So we will get to the definition of the house experimenting. It's experimenting on a system in order to build confidence in the system's capability to withstand turbulent condition in production. This may sound weird for us because why would anyone want to bring the house into the production isn't it something funny which should we actually avoid and the real reason for doing so is actually it's the time difference. It's much more easier to solve the problems at 4 a.m. or 4 p.m. rather than 4 a.m. when you are under the high pressure from the customers to solve the problems. There are many principles which we have to abide or we should abide in the house engineering. The first one and most important one is minimal blast radius for each experiment you conduct. We should imagine some red button for each experiment which should be able to stop it in case anything goes wrong. Other principles are mostly focused on the same thing like we would test the thing in the real life. We want to focus on how it actually works in production. We want to make sure it works correctly and we want to introduce the problems that may happen in the real life. The last principle is that it's a continuous run which is basically about running these tests or experiments each time for as possible and as effortlessly as possible. Now over to the target system. This all started with the monolith architecture where we get one box, one backend, one database and one UI. In terms of the complexity it was quite low. You simply get some users connection and the server complexity or overwhelming was not so high. Then after some time you add some customers more and more like let's say not four or five thousand and the load was pretty much high and the server would immediately crash for instance. So such architecture is really hard to scale horizontally and one way how to tackle this problem is scale vertically but you can scale vertically all the time. The second point was that the fault-or-ency of such architecture is really bad. You just target one node and the server just immediately crash and the users will be really sad because they don't get any response. So then Dockercams with the microservice architecture where all these previous improved we got portability isolation. We somehow get better horizontal scaling but in case when you have like thousands of instances it would be quite hard to manage all of these containers. On the other hand also the complexity here increased because of the network trickle and more. And so Kubernetes came to solve scalability in terms of the horizontal and in the Kubernetes you easily if you want to have one replica of the system you just type in the semi-ammo file, apply it and the Kubernetes will do it. Then if you see your server crashes or somehow is overloaded with the request you just simply set it to free and the Kubernetes will do it. The same with the fault-or-ency where you just I don't know inject some disruptions or something else in the pods. The one will still be up if you only target two of them. But still complexity increased again. And so we are in the operator stage where no one can entirely grasp the system in terms of the behavior. And I want to present one of the such operators is the StreamZ. StreamZ is basically Apache Kafka on its core with encapsulation in the Kubernetes system. On top of that you got some operators which simplifies some upright dynamic configuration. It is tracing more security involved also Grafana dashboards. And it is part of the cloud native computing foundation. But it's quite tough too much unknowns right? So let's break this down. So Apache Kafka it has a lot of buzzwords as you can see public subscribe model it is messaging system and so on but still this doesn't help right? So let's move to the basic of the Kafka. We got some producers not these ones but we get some clients right? So these clients sending messages to the broker. They are happy because the connection is up. We could also set system scale. We could create another Kafka broker set up some listeners and another one. We got a second set of the clients which are called consumers and they are simply receiving this data. So we got this really example of the system where you have producers and consumers but we need also some representation of the data which is Kafka topics. Also each Kafka broker has his own configuration and you can basically set up versions set up in center replicas but this is not important for this talk. So we got a lot of buzzwords as you can see but unfortunately or maybe fortunately we don't have time for it. So we could stick with this model now. So we got the producers we got the consumers we got some brokers which are the servers and what if we encapsulate system in the Kubernetes? Now top of that we add some operators managing the Kafka ecosystem and on top of that we have the operators and this is basically Stream Z. Really complex right? So here we can see an example of the deployment of the Stream Z where you got a lot of connections. These components are not like really important now. The main idea here is that even with this low deployments you get a lot of things where you can inject the hairs. So now I want to say that if we go to the production one of such production is the scale job and before I dig into it I want to thank these guys because without them we would be unable to run such hairs in such a massive survival scale. So as I said a scale job is the production environment for Stream Z and other projects and there are a lot of technologies involved such as I don't know Tecno pipelines, Teno's Promi to use, Grafana, Loki logs and more. And here you can see a basic example of how we basically induce the hairs here. Here we have some Kafka clients, Streams produces consumers with some databases which are communicating with Kafka Connect. We have some middle maker which is transfer data from Kafka A to Kafka B but still this is not intention of this slide. There are a lot of connections. So I think over to you, Henrik. Thanks. So the point of these slides were actually to show or somehow explain that when we come to the system and take a first look it may look quite MSC and quite complicated. We may not understand the whole underlying technological stack or every single components and we are in the position when we want to talk about how the system actually behaves when we introduce house whereas we are not even sure how it should behave normally. That's even increased by the fact that the system doesn't behave how it is in paper but in actuality there are countless of instances and connections, operators, clients, network traffic. We need to have some sort of observability and some intuition in the system. Like in other presentation that were before us there were already some mentions about Prometheus and Grafana. They are quite famous for their purpose. So we will be using them as well. As mentioned we need to have some intuition about the system and how it behaves. Without that it is just a mess. So we actually want to introduce some chaos into the system so we start a search for the problematic parts of the system or where we actually what we actually want to focus on. It is a simple process when we take a basically simple look on the system. Take a look what is critical component, where are some possible bottlenecks, are some part of the network really critical here, are there some real world events that can cause my system to be vulnerable for some time like some rolling updates or some notes we start in the cloud and things like that. What would be really helpful is to collaborate with all the people involved in the system. Like we definitely need some input from the devs. We need to know about at least some basic information about architectural component. What we may come up with is some simple document describing all important parts or things that may occur there or protocols that are included and we will naturally come to the important configuration parameters and maybe even some proposal for simple chaos that could be included. So the output of this in reality is some first look on part of the system which may be actually targeted for simple chaos. Now that we know at least have some first insight what could be first, what could be our first guess to start with the chaos, we may focus on concrete chaos and we may start with some simple experiments. Now how to actually formulate some kind of hypothesis or some sort of experiment when we will take a look on specific thing. We will take a look on just part of the system or few components. Now we will decide to make sure that our core part of the system is actually capable to withstand some instances being lost or have some failures. Because this is still a production environment and although it was even in the main principles of the chaos, we don't want to start with chaos in the production environment. I guess everyone here knows why because first intern will try to introduce some chaos, he will bring down all instances, service will not be available for for holiday and good luck explaining that to your boss. Now we will probably start in a smaller scale, in a stage environment with much smaller traffic, with much smaller stakes like let's say there will be some some client maybe just random random fraction. We will have some few instances and few controllers. We will start by making sure that system is in a steady state. We have our instances up and running. When we are sure about it we can introduce the chaos. When we introduce the chaos, instances goes down and afterwards the system stabilizes by again bringing down the instance. During all the time we are observing all important metrics and parameters about the system. For example it could be messages per second. Now that all that is set and done we can actually implement our chaos. What can be really helpful for this are chaos tools. We will not be describing all of them but simply mention like there is chaos mesh, Kraken, or it moves or some other choices. They will help with definition, evaluation, execution and all the other stuff. We will end up with very simple email files to be executed. Now we can actually execute our chaos and see everything went as expected. There was small decrease in the traffic but overall the system got to the desired state after a while. Okay this was first experiment in the stage. Everything went great. We've got the good feeling of resilience being confirmed in our system but what we are supposed to do now is to repeat the experiment, scale it a bit up, go into the production, really try, really make sure that it is this production environment where we will get the confidence. What may happen is that it will not at all go according to the plan. It will fail miserably and this is also the reason why we should scale these experiments a bit slower and this is also the reason why we eventually want to make them in production because we want to really make sure that this environment which is so important for us is actually able to handle that problem. So as I said no reason to despair just keep on and try in the stage and make it and start slowly definitely. So to the demos. Okay so guys today we prepared two demos for you. The first one is the broker's failure. Here we will target the Kafka Pots. We have seven replicas of the Kafka and we will be targeting the three of them. The observability or the metric would gather would be like throughput, CPU memory and the traffic in the Kafka Pots. Then we will also define some steady state which is basically that all brokers and client replicas would be like okay and the communication throughput will be stable even we inject the chaos and if we define the hypothesis it would be like we will be eliminating three of the Kafka Pots and this would not eventually do some cascading failure and we will be okay like user will be not affected with this. So yeah and also we will have some checks on the producers and Kafka Pots. So let's move on the demo and hopefully it will somehow work. So okay so here we got some setup. We have Kafka cluster. We have some notes. We have producers. Here is defined Pots hails. We have ModFix which we targeting the value of three. Three means that we will be targeting three Pots which will be unable to run and duration for this would be three minutes. So let's try with our script to inject the hails. Yeah so now we are injecting the hails and we see that all Pots that three of them would be not running. We would move to the graph on a dashbox where we have some metrics. Here's some really simple not production ready messages per second as you can see. Now you can see here at the decrease immediate decrease of connections. There will be also decreased the average of the messages but Kafka would recover even when the Pots are down. So here's the decrease but after a time we would see that it eventually recover somehow. Yeah and as we can see also we got some brokers on line four. It's correct now. There are also some under replicated partitions. Yeah Kafka is okay now and after this experiment will be done. I think it could be done. Yeah so now we are do the checks. We are checking the stream Z Pots at Kafka which are just internal custom resource of stream Z and now the Kafka Pots are ready. We're completed and also in the Gavana dashboards we will see that brokers will go online. The under replicated partitions we all go to the zero. I think so in the yeah and here it is. Okay so this was the first demo and we got also the second one. Yeah and this is basically a worker node node crash and to quickly somehow describe it the topology is that we have the producer we have the Kafka AI Kafka B with some consumer and in the middle there is Kafka mirror maker which just basically transfers data from Kafka A to Kafka B. The steady state again is that all services are fully available and ready to accept traffic. We made the hypothesis that eliminating one of the Kubernetes worker spools will not bring any down services and also the producers and consumers will be not affected. They will be simply sending some messages without any harm. So let's move to the demo two. I will show you the important things. So we got source Kafka cluster, target Kafka cluster, mirror maker, we have some work nodes, we inject the chaos. We would also create the continuous clients, person, consumer and that's for the correctness that all messages are sent and also receive without any harm. There will be no connection refuse or something like that. So we now reset or crash the worker node. We will see that the worker node will move from the ready state to not ready. Here it is, it's not ready, but clients are successfully and happy with the sending and receiving messages. The script is just checking that worker node is still not ready and we are waiting for recovery. It would take some time. I think it should be a worker. And so now worker nodes just move to the ready state. We can see that all containers which were affected on the specific node would be creating again, producing consumers still sending and receiving messages. We do some checks on again, again, the stateful sets. Yes, this is okay. We target cluster, cluster, recovered also. We're also doing checks for mirror maker. And the script just runs successfully and we are happy. Okay. So I think that was two of the demos. And last words over to Henry. Yeah. So as you could see in the demonstration, the benefits of the chaos or execution of the chaos was a bit different from testing we are used to. There was quite a big hype about the house engineering and possible benefits it can bring to our organization. Yes, it can definitely reveal bugs in the production. You can drastically improve the actual performance there or the situation in the cluster regarding the resilience of the system. But what is the main benefit of doing such a thing is getting confidence in the system, finding the misconfiguration. Those of you who have tried running application in Kubernetes know how important it is to have all volumes, all liveness and readiness checks set correctly and overall infrastructure set in place. The actual greatest benefit is actually is in fact getting experience and new knowledge about the system and really understand how it is supposed to work. This is not a holy grail as I said and it can be a bit disappointing for some. But if we think about the house engineering as natural a step above the other testing and not their replacement, we can see a great benefit in it. So how we can actually embrace it in our organization. The very well-known concept is game days when we will put together a lot of roles and a lot of people from our organization, introduce some kind of chaos and let them handle it in some reasonable manner where they can all communicate, all contribute and fix the problem in reasonable time. So that's why I do a friendly way how to start with it. Know your tools. I know it can be overwhelming. You could see it even in the demo that we have to introduce quite a lot of tools in order to run even simple experiments. But once you know the basics and have some confidence in it, you can really start to make some kind of chaos. We can really recommend some great books about house engineering, Kafka if you want, but still there are a lot of tools and it's what is the most important due to that is to definitely start small don't be afraid to set up some stage environment where you can actually practice and confirm your hypothesis before you will actually go into the production and start doing mayhem. Thank you for your attention. Really appreciate it. Questions? No time? One question. Question? Yeah? Yes, there are. It actually depends. It mostly from practical terms. It mostly start to make sense only when we are talking about not some kind of monolithic application, but we are actually deployed on a cloud. It's some kind of microservices architecture. I would say that it does not depends as much on the size of the system as the fact how much you depend on a customer experience in a sense. That when will it really be decremental for your system to get into the chaotic condition. But yeah, thank you as well.