Hello everyone.
You have two more minutes.
My name is Andre.
I work for...
Okay.
I've worked for Red Hat for several years now as quality engineering.
And today's talk is really about performance testing,
but it's not about the testing itself.
It's more than why we should do it.
You have six minutes in class.
Okay.
You're starting early, just so you know.
Yeah.
Okay.
So it's more why we should do it and why there are benefits in it,
even if you do it wrongly or imperfectly.
So that's the main point of the talk for today.
So first thing first, why should we do it?
What are the benefits for us doing the performance testing,
even if we don't have isolated environment and all this kind of stuff?
Well, for me, the main benefit is that even if you don't have the environment you would want,
you can still find the bottlenecks in your application or whatever you are testing.
And you still can optimize it even if you don't have everything ready
because the truth is that the performance testing is quite expensive.
And for the good one, I don't think that there are the companies
that will give you the resources that you need to do it perfectly.
So that's for me the main option or the main reason why to do it.
And for me, the second most important is that you will gain knowledge that you need
or you will obtain the knowledge about the product itself
because you will suddenly see things that you cannot see,
even if you normally deploy things and everything.
You will see the little thingies that are happening here and there.
And the information you gain are quite nice to get.
So that's probably the most, the points that you should look for
if you are thinking about the performance testing.
This is actually what it will gain from it.
So this is only in my opinion, like you will see a lot of papers about the performance testing
and all the things that you have to take care of.
Like on the GitHub, I know about two or three papers that have like 40 pages
about the performance testing and all the criteria that you have to fill.
In my opinion, there are like two variants of the performance testing.
And first is measurement and second is testing.
For me, the measurement is really the thing that you are looking for numbers.
And you need those numbers for, I would say, legal reasons
or anything what you have to declare for your customer.
You will say that, for example, for us, I work on the division.
Like if we would want to say that this connector is actually able to do 30k per second,
we would need some kind of a proof that we can do it.
And getting this proof is like, it's very complicated
and you have to do it in very specific ways.
And even if you have everything, it's not quite, it's not always acceptable.
But the second part or second variation is just testing.
And for me, the testing, the testing is really just finding the bottlenecks in your product.
And I think that's even more, like the testing is even more important
because there you will find all the bottlenecks and you can optimize your application really.
And you can see the flaws in your call because you cannot see these things
when you run it regularly and you don't have everything around the application tuned up
so you don't throttle your application to the maximum.
These things usually happen when you go over the top or near the maximum, near the max.
So, yeah, this is in my opinion two ways how to do the performance testing
or two variants of the performance testing.
What is not really optimal about the testing, which I was talking about,
just finding the bottlenecks, not the number, you need massive monitoring
and I will say more about it in the talk, but that's like main disadvantage of it
that most of the time you will go around tracing, monitoring, metrics
and you will find some stuff that will really give you hard times figuring it out
because you are going for performance and you are speaking in milliseconds, yeah?
But most of the things that are used for monitoring are not really prepared to handle you for one minute second.
They think that it's okay to scrape, for example, metrics by 10 seconds
and like this will give you massive headaches during the time.
I think I have already somehow gathered those, but the goal of the performance testing is,
as I said, find the bottlenecks, but they are much more to it because, for example, the load types, you know,
even if someone of you had crossed the performance testing, the main point all of the guys are talking about
is what kind of load we are going to do, how we are going to do it to make it reproducible, you know?
Because that's like a problem because for some application you put constant loads to them.
Let's say you are going 10k per second, some requests for the API for one hour.
And like that could be fine, but we all saw that some websites or, for example, the systems that you are buying tickets for concerts,
these kind of things need peak loads. You know, you are going low for 5k per second
and suddenly you spin it up for 100k per second or something like that.
So you really need something that will generate the load for you and do it reproducible.
You need to have the same load so you can repeat the testing for a couple of times
and be consistent in that because otherwise you will find all the other things except the flux in your cold.
So the main problems during the performance that you will find.
I have said that you don't need the isolated environment to do it.
And that's true. We don't have isolated environment in our team when we do performance testing.
But if you don't have entirely isolated environment, you need to know your environment pretty well.
You have to know your latencies, you have to know all the hardware specs and these kind of things.
You really need all of the information because if you see some very specific things happen during the test which are not common,
you then can somehow put the puzzle together with all the information you get from the environment
and you can somehow at least, I would say, decrease the number of stupid mistakes that you will have to gather around.
The next thing that is very important is to have the monitoring which I have said already.
You need all the metrics you can get because before that you don't even know what kind of the information would be valuable for you when you start doing it,
but you will need all of it.
If you can get rid and you will surely use it, if you don't gather those metrics and then you will figure out you need it,
you cannot get them from the past.
So really the thing is gather everything you can and it will be fine.
And the last thing for this is you need to tune up all the systems that you are dependent on.
For us, we are working on databases and we cannot really throttle up our product if the database isn't optimized to the hardware it's running
because if the database is not throttling, we are not throttling.
So you basically need to have everything on the high spec there to not bottleneck your application.
So that's one of the main points because on some things it's quite problematic to tune it up.
I quite like this quote because it's all about the metrics and if you have them, it's fine and it's nice.
If you don't have them, it's massive problems.
So I think that's really the quote that you should be looking for.
So again, monitoring.
I have already talked about the problem with the scraping.
So let's say we have used Prometheus mostly and the maximum what you can get from Prometheus is one second scraping.
So that's fine for information causes but not for the performance metrics because the things happen during the milliseconds.
Maybe 10 milliseconds would be enough but one second is really you are losing a lot of information and later on I have the example of what you could see when the scrapers are not fast enough.
And there's like massive problem because not every scraper is or I would say there are no scrapers that can do it really fast.
So you probably have to implement it and we are working on that actually.
So that's the main problem.
And the second problem is that you will end up with having a lot of the systems in the field because you need hardware metrics.
You need JMX metrics and I don't know what other.
It really depends on your application.
But for us, we needed hardware metrics, JMX metrics and some metrics from our test suite.
And these three things actually all the different outputs.
You know, we have used net data for the hardware metrics.
It's like really nice tool open source fast everything nice.
But you cannot import JMX metrics to net data.
And you net data has also one problem that you cannot import like anything would happen in the past because it's strictly hard coded for now.
So that's the problem, you know, and then you will say yes.
Okay.
So I cannot have JMX metrics to the net data.
So I'll add Prometheus, you know, that's fine.
Okay. So you have now net data and Prometheus.
Sorry.
Well, and then you will continue because you still have all at least in our expertise.
We still need that someplace to store the metrics from our test suite and you can just import it everywhere.
So then you happen that you will deploy the Postgres because you can use Postgres as backup for the Prometheus for the data storage.
So now we have net data, Postgres and also Prometheus.
And last but not least, you will add the Grafana because you need to visualize it, you know, and getting all those things in shape that you have like massive monitoring.
You know, everything can go wrong.
So if you can use the least amount of the tools that you can use, it's better because once you have too many, it's nightmare to somehow get that in shape for the whole time.
Yeah.
So with the performance testing, you are not really looking for the numbers.
Numbers are not important in this case because you don't want to see the throughput is like this or like that.
You need to see the trends in the graphs because there you can see if you are constantly slowing down or if you are going like optimizing your way.
So really you have to look for the patterns in graphs and the trends in the graphs.
I have the example for our testing that I will show you the patterns which we have found.
But before that, just our system under test was the BZM.
I don't know if you guys know the BZM, but it's effectively changed that I capture streaming, which means that we sit on the top of the database and we scan the transaction logs and sense all the events that happen there to the Kafka.
We are effectively running in Kafka Connect runtime, which makes the performance testing even more juicy, I would say, because the runtime is not ours.
So it's a little tricky.
So that's our system under test.
And this is the first example that I have put on.
The image on the graph on the top is basically our process duration.
And there are two things that you can see on the graph.
We are effectively most of the time we are oscillating between some values around 200 or 170 to 220.
And that's entirely fine.
That's actually what we want to see if you are looking for some responses.
You need to oscillate around some value like a sinus graph or something like that.
But what is not actually nice is this on the star.
Where we are constantly getting slower and slower and slower.
And we have some peaks there, which have these peaks are don't have a reason to be there because the data are the same all the time.
So this is most likely the flaw in the code.
There is something happening, what shouldn't be happening.
And it can be the database flashing out to the score.
It can be basically anything, but you know that there might be a problem.
You have all other metrics.
You can have metrics from the databases that will show you that flashing was happening or anything.
But this is what you have to look for.
These, it will certainly be different for your application than ours.
You will have to define what you are looking for on the star.
But that's the main thing.
And the more funny example is this.
No, here.
This is Jmx metrics from our, from Libyseum.
And it shows you the size of the queue.
Internal queue of the Bizm.
And that basically means that from the database you are reading to internal queue.
So once the queue is zero, you are not reading.
You know, but we are still processing, right?
So there should, there is some mistake.
And this is actually the problem with the scrapers.
Because if the scraper takes each second, the database is pretty fast and it can empty up the,
it can empty the queue during that time.
And if the scraper hits the right time, it will give you zero.
You know, so from this until the end, the graph is all wrong.
It's not true.
And it's all because of the speed of the scraper.
Because it basically hit the wrong time.
So that is the thing that you have to be worried about because this will happen surely.
And these are some other graphs.
These are, I would say, more wild.
I would say it's from the start of our playing with the performance.
But the top one is also, it's pretty cool.
It's not that constant as it was for the previous one.
But it's still in some borders, you know, we are somehow oscillating, but there is not really clear way.
But the queue size is okay now.
You can see because you have some data there, but not zero.
So there's an issue with the scrapers, as I said.
And this is actually the thing that you will have to look for the patterns.
Really important in graphs.
You can see all the different ones and there are a lot of papers on the Internet that you can find.
What to look for on your specific application.
So, yeah, but not look to the numbers.
Numbers don't say you anything.
You can usually get the higher numbers if you change up the hardware that you are running on.
But if you can optimize it on the some hardware that you have, you will surely get the big numbers on almost anywhere.
So some tips and tricks for me.
During the way that we have started playing with the performance, we have developed a lot of tools.
First is the database manipulation tool, which is effectively giving you a Json API.
And just with the Json API, you can create DML for almost any databases.
We have now probably MySQL, Postgres, and Oracle there.
So it's just you don't need to have a lot of different JDBC connectors in your code.
We'll just deploy this and it will take care of it.
We have also implemented some kind of the load generator that can generate you.
Load for, I would say, constant load, P-codes and all this kind of stuff.
We have also some automation and the other, like MySQL auto-tune, that's it.
We are pretty proud of that because it can basically tune your MySQL to the whole VM that you are running it or physical machine.
And you would say that it's easy, but it's not.
You know it's hard when you look on the seven or eight page of the Google.
You know, in this phase, you know, we are probably not on the good shape and this is one of the things.
So please take a look if you are working on MySQL.
We have some counting of the parameters for the database there.
And it will save you a lot of time if you want to tune up your database.
We have spent the time for you.
And secondly, we have implemented the metadata to Prometheus Creper.
So you can get rid of one, one struggle point in your monitoring environment.
And we are also starting working on the fast Creper for the, for your monitoring, for our monitoring stack.
But it's not done yet because it's quite more complicated than before.
So, but yeah, please take a look.
I will have the links on our GitHub and everything.
It's all open source. So you can just also add some code there if you want.
Yeah, so, so I have started quite early on than I should.
So I have some time now then.
But okay, we can just summarize everything now.
And I hope for some discussion before you guys who done performance testing.
So for me, don't be scared around the performance testing.
He's not like some, some monster.
People are mostly just like creating the monster from it.
But if you don't need that for some legal options or anything, it's fine.
You can play with it. It's funny.
You will gain a lot of knowledge about your, about your product on that.
And especially if you are QE, I mean, a lot of QE folks don't have the
necessary knowledge about the product itself.
And this helps a lot to get through everything because in the end,
you will, you will just go through the code and look for, for the mistakes or
something like that. So that really helps a lot.
So, gather all the metrics you can.
And well, we are also writing our blog and all the repositories will be on the
other side before the two, before the two links.
So I would be happy to hear from you, you know, like repository or organization
before the two links.
And yeah, that's probably it for, for my talk.
As I said, I have started a lot earlier.
So thank you very much for listening to me.
Please do have some questions.
Yeah.
So my question would be, so what kind of experience do you have in your
complex system? And then you see something happens there.
And say, okay, here, here's the latest EP or something like that.
Which experience do you have with, let's say, find the cause of the problem?
Cause, so when something happens randomly, you will see it with wrap and say,
okay, something happens there. And that's maybe annoying, especially when it
happens, happens randomly.
So what kind of strategies you are using then when you know, okay, there is
something to find cause of the problem in the complex chain.
Yeah, so this is, okay.
So, so the question was actually, if there are some, some changes in the
environment, some like latency things or everything, how we can deal with that
and how we can find the causes of the problems.
Yeah, so surely this is the main problem of the whole performance testing
outside of the like isolating and, you know, well, you need the metrics
from everything because then you somehow at least it will help you to, to get
all the things in the right timeline and you can see the need and picks what
could happen. But if it's like something that is really bad, you can find it
usually because it will mostly, it will just disappear in all the logs because
it can be something like if you have the smaller machines, I have to write
once on the, some microchips, you know, it was funny thing that you fill up your
TCP queue. You cannot find that anywhere in the world.
So at that case, you will just repeat the testing, even if it's take long
and you will see if same thing happens or not. I don't have any other like
recommendation for that because this is like really main problem if you are
doing it outside of the ideal environment. This surely will face this, but
mostly it's not happening that often, I would say, because you can have
observation for and tracing for a lot of things and most of the times you can
like colorate those things together. So you exactly know what is wrong,
especially for the network. You can get a lot of network traffic, like, how is
the word? You can see all the traffic and what is going on, especially on one
line. So then you can usually put those graphs together and you know at the time.
If that is okay, answer for you.
Yeah, yeah, yeah.
Thanks for the talk. I was just wondering that how to use the traces,
analyzing the traces, because I've seen that you mentioned metrics and traces.
Sorry, can you speak more louder?
Oh, yeah. Can you hear me now?
Yes.
Yeah, I was wondering how do you use the traces for performance testing,
because when you collect the traces, how do you deal with the sampling of the
traces? And if you miss something because the sampling is bad or you are not
sampling everything, maybe you have to infer something from the metrics and the
traces, I was wondering how do you deal with the traces? And if you use distributed
tracing in a large project like collecting all this kind of stuff.
I'm not sure. I understand the question.
The question actually is, I've seen that you're using collecting the metrics and then
you are analyzing the metrics.
Yeah.
And what about the traces?
Yeah, so, well, the business does not have really that amount of traces that we could
get from it. We have mostly like JMX metrics, you know, from the Java environment.
So that's for us what we analyze. And I'm not sure how can I answer more for the
questions. So I'm sorry. We can discuss it later.
I'm coming to you.
Okay, so my question is about the long running tests. Sometimes the performance
validation is visible only after long run, for example, one week, couple of weeks and
sometimes even more. So how do you address this in your process or how do you recommend
to address this problem?
Yes. So for this, actually colleagues of mine as part of our open source organization,
they are also developing the long running cluster environment, something like that is
like, because, you know, having a long running thing is complicated on itself because you
have to manage it a lot, especially on OpenShift or Kubernetes or these kind of things is like
little problematic before the upgrades and this kind of stuff.
So we have not dealt with that yet, but we are planning that once we are okay, that we
manage that we have everything prepared for like databases and everything, we want to
get the up running on the long running clusters and like regularly doing the performance tests
over, I would say a month or something like that or a week, it usually is enough, even
especially when you put all the numbers to the low ranges for the retention for the memory
and all this kind of stuff. It doesn't take too long to fill everything when you will
start to see the retention and flashes and everything. So yeah, that's our plan, but we
haven't done it yet. So yeah, but if you are interested in that, you should definitely
look in the hub that we have in the repository because it could be useful.
Do you have any tips for running performance tests in the cloud? Because for me that's quite
the opposite of dedicated test runners. But when the software runs finally in the cloud
you should probably also performance test it there.
It's a problem. A big one. We have tried it and it is so inconsistent. The results are
so all over the place. If you run, you have two same clusters, Kubernetes, OpenShifting
doesn't matter actually, and you run the tests on the same clusters at the same time,
clusters are in different zones on AWS, you will get entirely different results because
all the load balancers and these kind of stuff, it really, you know, if you have only
internal communication on the cluster, we found anything from the outside. It could be
actually doable, I think, but if you have any communication during the test that is going
outside the cluster and it has to go through the load balancers and these kind of stuff,
I think that's not doable in any way because it's like you don't know what latency we will
have for this kind of the request and travels. So I think that that would be like really
problematic, but if you could mock up the external communication with just some internal
endpoint, it should be quite okay-ish, I would say, but you will not get like really good
results from that, I think, even if you try more and more. But I think there are some
Kubernetes, some special Kubernetes builds that should be used for these kind of measurements,
but I have never actually tried it, so I cannot recommend it when I try it, when I
try it, but yeah, this is definitely a problem. Okay, so I have more questions?
No, we have a few more minutes for questions. Come on. Otherwise, I'm going to ask you to, you know,
to move your seats. Wait, wait.
You said you would want to have a very small statement developed, so in the milliseconds
ago. Yeah. So doesn't that create problems on their own, something like noisy neighbors and so on?
Yes, it does. It does. Right, but...
How big of a problem that is?
Well, that's the thing. We are really thinking about writing some scraper that is fast enough for this,
and yes, you will probably generate some problems during the way, especially if you would
like to send the metrics directly to the Prometheus every millisecond. You will probably fill up the
network line or the TCP stack or whatever, because it's really fast. So you will...
It will strongly depend on the machine that you are running and if you have space there, if you have a
lot of RAM, you could actually batch all the metrics and send it like one package after the test is done.
But yes, that's actually what we are now fighting with and we are trying to figure out how we are going to
aim for this. But mostly we are thinking that we will do somehow a configurable scraper that will either batch the
request or send them or something like that. I cannot say you because we haven't tried it, actually, what
problems it makes, especially with batching, because I have counted it up and metrics aren't small, actually.
So it will take a lot of place in the memory. So we will have to try it and somehow figure it out.
But before the fast scraper, it will give you really headaches because you will try to find something and
fight something and you will spend 20 times debugging it and then you will find that the scraper hit it the wrong
time every time. So we have to deal with this in some way. But it will be hard and problematic.
I think we have time for one last question.
No one?
Tough crowd.
Thank you very much.