I Come on people Yeah, let's share up for Daniel because his first time speaker and everything's failing And it's off to a good start. Yeah, come on big applause. Thank you. You're doing awesome And You know the only certain thing about technology is gonna fail exactly when it doesn't need to Yeah, like I think I said already flakiness is not only happening in test obviously right So While we're waiting for this thing to happen I could ask a question about who actually Has has an idea what a flake would be in testing Okay, I should just repeat what you're going to say Yeah, yeah, go ahead you probably I don't know So you have an idea, but you don't want to tell me Exactly exactly so To me or I think to most people that agree about this topic Flaky test is a test that fails and passes and successive runs without changing any code Neither testing code nor the underlying production code Okay So, yeah, this talk will be about flaky test. Yeah Yeah, of course, of course flaky behavior is not determined by just being the test being flaky but also the software but I would Divide those two kids into different categories and how they are going to be handled this differently. So Yeah, but Let's wait. So Yeah, I'm going to start with introduction. My name is Daniel Hiller. I'm working at Red Hat I'm working on the upstream cupid project and there I'm Maintaining the cupid CIS system So this talk will be about flaky tests and How we should or how we are actually going to handle them I'm in our community of supporters for the cupid contributors. So I don't Say I have the silver bullet for handling that I would be happy to have any input from you folks And how we can improve and I would actually also Want to have some kind of extended Q&A session if the time is still there Somehow so that you might talk about what you have experienced and how you are going to handle it Just as a quick Thing how I think this should be going I'm going to start with Waterfeg is but yeah, you described it perfectly already. So it's fine and then What the impact of flakes is And then how we are doing or how we are how we can find flakes somehow and then at the last I'm half rate the flake process works and what tools we do have that support this and Yeah, in the end, I just want to describe what we're aiming to do in the future to improve this Just don't have internet and made up for some reason. Oh, no My email, okay Yeah, yeah, I think it's going really terribly wrong Sorry for all that by the way a packed room. I didn't expect that to be honest. So thank you all for coming Really great. I'm gonna help you out. Don't worry. So Tell me kind of a little bit more until we wait for the slides Can I give us a hint as to what you wanted to show us and just tell us the story about it? Yeah, without the slides just going to open it a bit because so I can Supposed to talk about You know pretend I'm stupid and I have no idea what's flaky and just you know tell it to me So I told you already about the agenda And yeah, the question of flakes was already answered so I Have two other questions. So that one is somehow like a little bit suggestive, I guess so who thinks handling flakes is important Like put your hand up a few you don't Who yeah, of course things handling of flakes is important Okay, I thought about that. So You save my day do you have a USB port I Hope so once again, it's on you need to put in a presentation mode There is on the right should be presentation. Yeah, that should be okay Yeah, okay, so the questions we already had Yeah, and another question who has to deal with flakes on a regular basis Wow, okay. Yeah, I expected something like that So yeah, like you correctly already said Flakes are caused either by production code, which is a bug of course or also by flaky test code This is also a bug, but it's handled differently like I already said So we are using prowl for our CI system which comes from the Kubernetes ecosystem. I'm not sure Whether you're familiar with it, but it's pretty flexible and it can just Start jobs from From GitHub events, which is exactly what we want and what we need this picture actually Shows on the top for example there is the commit ID I made I don't even see it like that there We're pointing this is the commit ID and these are the job runs that are defined So like the jobs on the CI system and this of course is a failed job and these are successful jobs so obviously you can see like this is the PR history for one of our PRs inside the Kubrick CI and What you can see here is that of course? There is jobs all run on the same commit ID, but some failed and some succeeded and That's exactly how we see where we have our flakiness Oh Wait a second. That's the wrong direction. Okay, so There is a really interesting survey that which is a major survey about flakiness in test Which is just called a survey of flaky tests Not really impressive about great stuff inside there something like that there you can read that 79% percent of the flakes were for the lungs and More than 50% of flakes could not be reproduced in production In isolation, so which of course leads us to the conclusion that Ignoring fake heat as flaky test is okay, right? It's doesn't of course So When we're talking about CI we want to have a reliable signal of stability in our CI So because of course we want to know whether we can ship our product or not and so any failed test run signals us as the CI maintainers that the product is unstable and that we can ship it So if we are having flakes in our Production system, of course, they give us a wrong signal like that the product is unstable and that we can ship Which we later then have to verify the test code what exactly got wrong and then we notice it's a flaky test So this is wasted of course a lot of time so Not only does it waste the time of the Developers themselves who have to look at the test results somehow and determine whether this is the flaky test or not But it's also like when you have a CI that is somehow Determining whether a PR can get merged via the test And then you have a test result Of course the merge will not go through So this cost friction by the developers who have then somehow Maybe reissue another test So if they see it's flaky if they there is nothing to fix then they would just retest Which somehow? Yeah, sometimes you would just think okay, there was flakiness. I'm just going to retry not even looking at the test results somehow Which which I would call the retest trap and we have actually had retest like I Mean the highest number I've been seen like 25 times testing and retest on the same commit Do they have to oh I have to stay here. Okay Okay And also a very bad thing is like I'm not sure I guess any CI system has something like an acceleration system where for example, it's like testing like Multiple git commits at once so that it can merge them all together And of course if there is a flaky test this acceleration effect will just be reversed. It will not be effective Yeah, like I said another wasted wasted thing so also flaky test Trust issues at the developers themselves because they of course lose the trust in automated testing Which is really sad because that's All that we want to do we want to trust the test But if we can then then of course we are just ignoring test results, which is not a good idea So how so we want to minimize the impact in our CI so that people don't Experience that much friction Time flies so What we do there is we quarantine those tests we put them out of the set of stable tests and Put them in another set so that they are not run during pull request runs But we only want to do that as we want to do that as early as possible when we Detect the flakiness, but only as long as necessary because tests on themselves of course have value So otherwise it wouldn't be there What do we need for that? We need some like mechanism where we can put stable test from the set of stable test to the set of quarantine test Of course, we also need a report over the flakiness So we can triage Which flaky test we need to act upon first if you have a lot of flaky test that matters so because the higher The flakiness of the test is of course the highest impact And yeah, lots of data because of course you need to somehow analyze whether a test is even flaky or not So as I already said I already described this this is like a The latest commit on a merge PR where we have some flaky test or some failing test runs Which later on got green on the same commit so no changes on the code So This is not of course not saying us that is it is actually flaky But it might might be flaky and like you said it could either be in production code or in the test code itself But that doesn't matter in the end the Problem that we have is the fiction NCI and the wasted resources there So our flake process is pretty well pretty pretty rough I'd say are pretty pretty easy So we have regular meetings where we look at the at the results and at the flakes And then we decide what we want to do with those flakes. So first of all, of course You have to know whether a test is flaky or not You're looking at the test results and deciding whom you showed contact so that he fixes that because we don't fix the test Ourselves we let the developers do that because yeah, they created their mess. They should clean it up A problem is of course when people are gone from the project then someone else has to care, but yeah So we have the flaky test to the dev developers and at the time when it's been Corrected we bring those tests back in so the truth that we have is like we have the main thing that Decides whether a test is being run between For the pull request. It's just a just a note on the test itself. That is like There is in the test name. There is this quarantine Word which is the keyword which makes the test get ignored for the pull request runs We still do to do run those tests to have this stability signal But not in the pre-submit which are required for the pull request merges But in the periodic runs That run I think three times a day So that we still have a signal when we can take a test back in in order to Have the value added again So another thing is of course you need to report so we this is a not not really Nice looking but efficient thing like a heat map so where you see where the action is going on You see like the more reddish the colors are getting the worst the problem is This is in oh, no, I can't go there. So like on the top you can just see on which day how many failures were occurring and There is another like axis which is the per lane Failure so that we can pretty much see which lane is flaky and on which was the biggest impact for everything This is the first time I'm using this sorry. I'm just always switching the directions Okay, this is the detailed report about how flaky a test is or how flaky those tests are This is ordered by the number of failures Occuring for test this is a bit overwhelming I think but on the left column you just see the test name and on the on the upper column you see the number the test lanes that are The Versions of the latest three Like we have a lot of test lanes that have different Sigs which are maintaining them and this of course obviously creates a matrix of like like at least 12 really well, yeah, really Important lanes which absolutely have to be be stable Yeah, and this helps us like finding where which test we should look at and quarantine or which we shouldn't We have also long-term metrics where we can decide how we were doing in the past so like at least everyone of course wants to know whether they are improving or Getting worse in handling flakes that we have long-term metrics where we can look at how we were doing So how many merges per day for example or how many merge PRs with zero retest Which is the thing that we are measuring currently against the most because Obviously that number should be like 28 of 28, but we seldom reach that like flakies We also have a small report over the The tests that happen quarantine so that we can find them quickly because it's like Grapping over the code base is also of course doable But it is easier to just have some report that we can look at straight away during our meetings And then finally we have all the test grade which also Collects all the periodic results so that we can deduce Where to Whether the tests have been stable or not. So this is the tool I guess that guys from the from the Kubernetes ecosystem know that because Kubernetes also uses test grid for collecting all the test results so that you can quickly drill down Yeah, and we have also established Another lane that checks the test for stability which does a thing that that makes like Test dependencies for example visible you I guess you know what a test dependency is some tests that hasn't cleaned up and Leap the mess for other tests and then influencing them and then they might be failing Or the other way around they might not Was already sufficient for For the following tests and if you are just randomizing the test order you catch those cases which is like you have to have Isolated test cases, right? And also it tries to run it five times because Like I said before in the in this metadata in this meta report like Bit more than 80% of all the tests have been fed off the flaky test have been failing after about five times This is not that you catch all of them, but the majority Yeah, that's just the CI search tool so in a nutshell we Just do in regular intervals meetings that we look over the data somehow So like I described before What we want to do is of course we want to collect even more data like We want to run the majority of tests in the same way as we are doing in the flake lane like Running them five times after another and also running Randomizing always the order so that we have a better picture over how flaky our code base is And yeah, of course like we want to avoid this retest problem where you Blindly just retest your things so we are looking for ways to just Directly find that case Yeah, so it's pretty I've been Running through pretty quickly any questions Yeah So you've been talking about responsibility of devs to fix the flakiness So this kind of assumes that the flakiness is introduced either by new tests or by changes on tests or changes on the code base But what about flakiness that is introduced by the by your infrastructure actually like network latency or things like that Do we have we have those problems or is it something that you I didn't get the less could you repeat the last sentence? Sorry sorry so you You imply that the flakiness can either be introduced in new tests or changes in tests or changes in the code base but have you ever been confronted with flakiness introduced by your infrastructure your Like network latency or something like that and how do you detect them and how do you of course of course that that is also a problem But when you have like flakiness in your test infrastructure or even failures in the test infrastructure That's an entire different problem and what we have observed there is that a lot of tests are to fail then and that's why we look at first of all when we have like Rough estimate of like like have more than 20 tests failing at one run that might likely be because the test infrastructure is failing and actually We decided to just quickly verify that there is something going on in the infrastructure And then just disregard that run and yeah in earlier days. We had that problem pretty much often But in recent days it hasn't been happening anymore or Much less that's let's put it like that Of course of course we look so like what we are what we are having to test our E2e test like QBIT is a complex system It's an addition on Kubernetes so that you can run virtual machines and of course for testing that you for testing E2e You need a full Kubernetes cluster which with on which you will deploy The QBIT and that's what we're doing in DCI. So we are actually spinning up Some I would say a frozen cluster like the virtualized nodes that have been frozen and that are spun up on demand Like this takes around one and a half minutes to spin up such a cluster and Then you run all those tests somehow and we have like we have like always three versions of the thank you very much We are running out of time. Yeah, you can continue us. Thank you