I
Come on people
Yeah, let's share up for Daniel because his first time speaker and everything's failing
And it's off to a good start. Yeah, come on big applause. Thank you. You're doing awesome
And
You know the only certain thing about technology is gonna fail exactly when it doesn't need to
Yeah, like I think I said already flakiness is not only happening in test obviously right
So
While we're waiting for this thing to happen I could ask a question about
who actually
Has has an idea what a flake would be in testing
Okay, I should just repeat what you're going to say
Yeah, yeah, go ahead you probably I don't know
So you have an idea, but you don't want to tell me
Exactly exactly so
To me or I think to most people that agree about this topic
Flaky test is a test that fails and passes and successive runs without changing any code
Neither testing code nor the underlying production code
Okay
So, yeah, this talk will be about flaky test. Yeah
Yeah, of course, of course flaky behavior is not determined by just being the test being flaky but also the software but I would
Divide those two kids into different categories and how they are going to be handled this differently. So
Yeah, but
Let's wait. So
Yeah, I'm going to start with introduction. My name is Daniel Hiller. I'm working at Red Hat
I'm working on the upstream cupid project and there I'm
Maintaining the cupid CIS system
So this talk will be about flaky tests and
How we should or how we are actually going to handle them
I'm in our community of
supporters for the cupid contributors. So I don't
Say I have the silver bullet for handling that I would be happy to have any
input from you folks
And how we can improve and I would actually also
Want to have some kind of extended Q&A session if the time is still there
Somehow so that you might talk about what you have experienced and how you are going to handle it
Just as a quick
Thing how I think this should be going
I'm going to start with
Waterfeg is but yeah, you described it perfectly already. So it's fine
and then
What the impact of flakes is
And then how we are doing or how we are how we can find flakes somehow and then at the last
I'm half rate the flake process works and what tools we do have that support this
and
Yeah, in the end, I just want to describe what we're aiming to do in the future to improve this
Just don't have internet and made up for some reason. Oh, no
My email, okay
Yeah, yeah, I think it's going really terribly wrong
Sorry for all that by the way a packed room. I didn't expect that to be honest. So thank you all for coming
Really great. I'm gonna help you out. Don't worry. So
Tell me kind of a little bit more until we wait for the slides
Can I give us a hint as to what you wanted to show us and just tell us the story about it?
Yeah, without the slides just going to open it a bit because so I can
Supposed to talk about
You know pretend I'm stupid and I have no idea what's flaky and just you know tell it to me
So I told you already about the agenda
And yeah, the question of flakes was already answered
so I
Have two other questions. So that one is somehow like a little bit suggestive, I guess so who thinks handling flakes is important
Like put your hand up a few you don't
Who yeah, of course things handling of flakes is important
Okay, I thought about that. So
You save my day do you have a USB port
I
Hope so once again, it's on you need to put in a presentation mode
There is on the right should be presentation. Yeah, that should be okay
Yeah, okay, so the questions we already had
Yeah, and another question who has to deal with flakes on a regular basis
Wow, okay. Yeah, I expected something like that
So yeah, like you
correctly already said
Flakes are caused either by production code, which is a bug of course or also by flaky test code
This is also a bug, but it's handled differently like I already said
So we are using prowl for our CI system which comes from the Kubernetes ecosystem. I'm not sure
Whether you're familiar with it, but it's pretty flexible and it can just
Start jobs from
From GitHub events, which is exactly what we want and what we need
this picture actually
Shows on the top for example there is the commit ID
I made I don't even see it like that there
We're pointing this is the commit ID and these are the job runs that are defined
So like the jobs on the CI system and this of course is a failed job and these are successful jobs
so obviously you can see like this is the PR history for one of our PRs inside the
Kubrick CI and
What you can see here is that of course?
There is jobs all run on the same commit ID, but some failed and some succeeded and
That's exactly how we see where we have our flakiness
Oh
Wait a second. That's the wrong direction. Okay, so
There is a really interesting survey that which is a major survey about flakiness in test
Which is just called a survey of flaky tests
Not really impressive about great stuff inside there something like that there you can read that
79% percent of the flakes were for the lungs and
More than 50% of flakes could not be reproduced in production
In isolation, so which of course leads us to the conclusion that
Ignoring fake heat as flaky test is okay, right?
It's doesn't of course
So
When we're talking about CI we want to have a reliable signal of stability in our CI
So because of course we want to know whether we can ship our product or not and so any failed test run
signals us as the
CI maintainers that the product is unstable and that we can ship it
So if we are having flakes in our
Production system, of course, they give us a wrong signal like that the product is unstable and that we can ship
Which we later then have to verify the test code what exactly got wrong and then we notice it's a flaky test
So this is wasted of course a lot of time
so
Not only does it waste
the time of the
Developers themselves who have to look at the test results somehow and determine whether this is the flaky test or not
But it's also like
when you have a CI that is somehow
Determining whether a PR can get merged
via the test
And then you have a test result
Of course the merge will not go through
So this cost friction by the developers who have then somehow
Maybe reissue another test
So if they see it's flaky if they there is nothing to fix then they would just retest
Which somehow?
Yeah, sometimes you would just think okay, there was flakiness. I'm just going to retry not even looking at the test results somehow
Which which I would call the retest trap and we have actually had retest like I
Mean the highest number I've been seen like 25 times testing and retest on the same commit
Do they have to oh I have to stay here. Okay
Okay
And also a very bad thing is like I'm not sure I guess any CI system has something like an acceleration system
where for example, it's like testing like
Multiple git commits at once so that it can merge them all together
And of course if there is a flaky test this acceleration effect will just be reversed. It will not be effective
Yeah, like I said another wasted wasted thing
so also flaky test
Trust issues at the developers themselves because they of course lose the trust in automated testing
Which is really sad because that's
All that we want to do we want to trust the test
But if we can then then of course we are just ignoring test results, which is not a good idea
So how so we want to minimize the impact in our CI so that people don't
Experience that much friction
Time flies
so
What we do there is we quarantine those tests we put them out of the set of stable tests and
Put them in another set so that they are not run during pull request runs
But we only want to do that as we want to do that as early as possible when we
Detect the flakiness, but only as long as necessary because tests on themselves of course have value
So otherwise it wouldn't be there
What do we need for that? We need some
like mechanism where we can put stable
test from the set of stable test to the
set of quarantine test
Of course, we also need a report over the flakiness
So we can triage
Which flaky test we need to act upon first if you have a lot of flaky test that matters
so because the higher
The flakiness of the test is of course the highest impact
And yeah, lots of data because of course you need to somehow analyze whether a test is even flaky or not
So as I already said I already described this this is like a
The latest commit on a merge PR where we have some flaky test or some failing test runs
Which later on got green on the same commit so no changes on the code
So
This is not of course not saying us that is it is actually flaky
But it might might be flaky and like you said it could either be in production code or in the test code itself
But that doesn't matter in the end the
Problem that we have is the fiction NCI and the wasted resources there
So our flake process is pretty well pretty pretty rough
I'd say are pretty pretty easy
So we have regular meetings where we look at the at the results and at the flakes
And then we decide what we want to do with those flakes. So first of all, of course
You have to know whether a test is flaky or not
You're looking at the test results and deciding whom you showed contact so that he fixes that because we don't fix the test
Ourselves we let the developers do that because yeah, they created their mess. They should clean it up
A problem is of course when people are gone from the project then someone else has to care, but yeah
So we have the flaky test to the dev developers and at the time when it's been
Corrected we bring those tests back in
so
the truth that we have is like we have the
main thing that
Decides whether a test is being run between
For the pull request. It's just a just a note on the test itself. That is like
There is in the test name. There is this quarantine
Word which is the keyword which makes the test get ignored for the pull request runs
We still do to do run those tests to have this stability signal
But not in the pre-submit which are required for the pull request merges
But in the periodic runs
That run I think three times a day
So that we still have a signal when we can take a test back in
in order to
Have the value added again
So another thing is of course you need to report so we this is a not not really
Nice looking but efficient thing like a heat map so where you see where the action is going on
You see like the more reddish the colors are getting the worst the problem is
This is in oh, no, I can't go there. So like on the top you can just see
on which day how many failures were occurring and
There is another like axis which is the per lane
Failure so that we can pretty much see which lane is flaky and on which was the biggest impact for everything
This is the first time I'm using this sorry. I'm just always switching the directions
Okay, this is the detailed report about how flaky a test is or how flaky those tests are
This is ordered by the number of failures
Occuring for test this is a bit overwhelming
I think but on the left column you just see the test name and on the on the upper column you see the
number the test lanes that are
The
Versions of the latest three
Like we have a lot of test lanes that have different
Sigs which are maintaining them and this of course obviously creates a matrix of like like at least 12 really
well, yeah, really
Important lanes which absolutely have to be be stable
Yeah, and this helps us like finding where which test we should look at and quarantine or which we shouldn't
We have also long-term metrics where we can decide how we were doing in the past so like at least
everyone of course wants to know whether they are improving or
Getting worse in handling flakes that we have long-term metrics where we can look at how we were doing
So how many merges per day for example or how many merge PRs with zero retest
Which is the thing that we are measuring currently against the most because
Obviously that number should be like 28 of 28, but we seldom reach that like flakies
We also have a small report over the
The tests that happen quarantine so that we can find them quickly because it's like
Grapping over the code base is also of course doable
But it is easier to just have some report that we can look at straight away during our meetings
And then finally we have all the test grade which also
Collects all the periodic results so that we can deduce
Where to
Whether the tests have been stable or not. So this is the tool I guess that
guys from the from the Kubernetes ecosystem know that because
Kubernetes also uses test grid for collecting all the test results so that you can quickly drill down
Yeah, and we have also established
Another lane that checks the test for stability which does a thing that
that makes
like
Test dependencies for example visible you I guess you know what a test dependency is some tests that hasn't cleaned up and
Leap the mess for other tests and then influencing them and then they might be failing
Or the other way around they might not
Was already sufficient for
For the following tests and if you are just randomizing the test order you catch those cases which is like you have to have
Isolated test cases, right?
And also it tries to run it five times because
Like I said before in the in this metadata in this meta report
like
Bit more than 80% of all the tests have been fed off the flaky test have been failing after about five times
This is not that you catch all of them, but the majority
Yeah, that's just the CI search tool so in a nutshell we
Just do in regular intervals meetings that we look over the data somehow
So like I described before
What we want to do is of course we want to collect even more data like
We want to run the majority of tests in the same way as we are doing in the flake lane like
Running them five times after another and also running
Randomizing always the order so that we have a better picture over how flaky our code base is
And yeah, of course like we want to avoid this retest problem where you
Blindly just retest your things so we are looking for ways to just
Directly find that case
Yeah, so it's pretty I've been
Running through pretty quickly any questions
Yeah
So you've been talking about responsibility of devs to fix the flakiness
So this kind of assumes that the flakiness is introduced either by new tests or by changes on tests or changes on the code base
But what about flakiness that is introduced by the by your infrastructure actually like network latency or things like that
Do we have we have those problems or is it something that you I didn't get the less could you repeat the last sentence? Sorry sorry so you
You imply that the flakiness can either be introduced in new tests or changes in tests or changes in the code base
but have you ever been confronted with flakiness introduced by your infrastructure your
Like network latency or something like that and how do you detect them and how do you of course of course that that is also a problem
But when you have like flakiness in your test infrastructure or even failures in the test infrastructure
That's an entire different problem and what we have observed there is that a lot of tests are to fail then and that's why we look at first of all when we have like
Rough estimate of like like have more than 20 tests failing at one run that might likely be because the test infrastructure is failing and
actually
We decided to just quickly verify that there is something going on in the infrastructure
And then just disregard that run and yeah in earlier days. We had that problem pretty much often
But in recent days it hasn't been happening anymore or
Much less that's let's put it like that
Of course of course we look so like what we are what we are having to test our
E2e test like QBIT is a complex system
It's an addition on Kubernetes so that you can run virtual machines and of course for testing that you for testing E2e
You need a full Kubernetes cluster which with on which you will deploy
The QBIT and that's what we're doing in DCI. So we are actually spinning up
Some I would say a frozen cluster like the virtualized nodes that have been frozen and that are spun up on demand
Like this takes around one and a half minutes to spin up such a cluster
and
Then you run all those tests somehow and we have like we have like always three versions of the thank you very much
We are running out of time. Yeah, you can continue us. Thank you