We'll see something very basic, the load average, the thing that you have on top, on top when
you look at the performance of your server.
Very basic, but with a lot of misunderstanding and the goal is really to understand if it's
useful or not and at least how it works.
I usually do that as a live demo, but I'm not sure about the Wi-Fi.
I think I've lost the connection, but I have some recordings.
Basically, what we will do, we will look at what we have in top.
So this is not moving because I lost the connection, but we will see later on recordings.
You can start to think about it.
I have run something that you can see in the processes there.
I have two CPUs.
I have a load average of 32 for a long time.
I don't know if you care, but I have 99% of weight I owe.
Basically, my question to you is, do I have a problem or not?
I am bound on a resource or not.
If I'm bound on a resource, am I bound on CPU or I owe or memory or whatever?
This simple question, I see a lot of people who cannot really explain it.
The goal of the presentation will be to tell you that you can mostly ignore the numbers
that are on top of top because those are about the systems, the processor, what you care
about for your application performance is more the tasks that are running and this is
probably more useful.
Going back to the slides where I have the recordings of all the demos, so we will not
try to reconnect to the Wi-Fi.
Also, so that screenshot of what we have seen, people using the cloud, cloud providers like
to provide nice graphs about performance and usually they put first the load average, the
CPU usage.
Typically, I have two processors.
I have a load average of 30 and my CPU is doing nothing.
Memory is 100%.
What do they want to tell us with that because most systems will have usage at 100% and that's
probably cool.
We will look at that in the next 20 minutes.
First, this is the recording of what I wanted to show you.
That was what was running exactly the same.
You see the load average, the number of CPU, the weight, IO, there.
What do you think about it?
Who thinks I'm bound on CPU?
Who thinks I'm bound on IO?
Who thinks I'm bound on IO because I have a weight IO?
Less people.
That's already good.
Here, we see a high weight IO, but maybe I can advance on the recording.
What I show in this case, when people think that I have a problem with IO, is just to
run something else.
Let me check where it is in the recording.
If I have the wrong recording, I will just explain what I show usually.
Sorry, maybe it's in the next recording.
What we see is load average, high weight IO, but the most important, what I really care
about is this, the state of the tasks.
Who thinks I am bound on IO because of the D state?
For me, this D state gives me a clue that most of my processors are waiting on IO.
Probably.
We see that it's not so exact science, but that's something that can give some clues.
I'm lost in my slides.
This is the next one.
I'm running yes.
You know the yes command?
It displays yes.
I'm still running the same IO there, the same throughput.
I'm doing exactly the same, and my weight IO has decreased.
This is how to solve weight IO, just run something else.
I show that to explain that this weight IO is not about what your tasks are doing.
It's about the CPUs.
When you do IO, you don't need CPU, so you wait.
If no one else wants to do something on the CPU, then the CPU state just remembers that,
okay, I'm idle because someone is doing some IO.
Now I'm running something else that uses this CPU.
This CPU is not idle.
This weight IO just means idle, and idle because the last one did some IO.
The only information I have from weight IO is that the CPU could be used for something
more useful than weighting, but doesn't really give me the information that I have a lot
of IO because depending on the other workload, I will sit there or not.
The state doesn't lie if my processes are all on the D state.
At least they are not on the R state, the renewable state, so they are not using CPU.
In the next one, what I do to understand better the kind of IO I'm doing, the kind of system
call that puts this D state, I just run S trace on my processes, and I just did the
S trace dash C to count them, and you see that most of the system calls are P writes.
That's actually what I'm running there.
I'm doing writes with the P write system call with direct IO.
That's basically what I have there.
If I want to understand really what is behind a state that is not the R state, the renewable
state, I can trace the system calls to know exactly why.
I will explain why I'm looking at that because even if D looks like disk, you can do some
IOs that are not in D state, and you can have D state that has nothing to do with IO.
So it can be misleading.
The D state is uninterruptible calls.
So your process has something to do that is not in CPU and does it in an uninterruptible
state.
Depending on the system call, it can do it uninterruptible or not.
Often IO like the P write is using this, but there are some other kind of IOs.
Any questions so far?
Any remarks?
Okay.
So next one.
I will run something else if I remember exactly what I'm doing here.
I will run FIO.
The difference is that I'm not calling the P write system call.
I'm calling the Lib IO, asynchronous IO library.
Basically I'm doing the same writing to the disk with direct IO and you can see the throughput
is mostly the same.
However, I'm not in D state anymore.
So there are some IO who put the D state, but there are some IO who just put the sleep
state, which is not uninterruptible.
So very misleading when you see those things and try to guess what happens.
If you are stressed, there is no guess.
You know exactly the system call.
And I think this is what I do just after.
If I stress, I see that most of the IO calls here are IO get events and there is some IO
submit.
This is our asynchronous IO works.
P write just ask the kernel, I want these blocks and wait to get those blocks.
With asynchronous IO, it tells the kernel, I will need those blocks.
So that's the submit and then can you work on something else and come back and say, oh,
do you have my IO?
If not, I will wait.
The submit goes in this state, but it's very short because it's just a submit.
The get events, if it waits, goes in sleep state, the S state, and not the D state.
Depending on the kind of IO, you will see it at this state or not.
And the wait IO there depends on the state, but more important, I don't know if I can
go back.
Well, I'm sure I can go back if I replay it.
I guess that the load average was lower when I was running that because the D state counts
in the load average, the S state doesn't.
Means that some IO counts in the load average, some IO doesn't.
Means that with load average, you don't really know what happens.
Okay.
The next one, I'm running something else.
So those were direct writes by passing the buffer cache.
And here I'm running reads and more I set direct equals zero to FIO.
FIO just simulates different kind of IO.
Typically I work with databases.
I'm a developer advocate for UGA by DB that is a distributed SQL database compatible
with Postgres.
I've been working also a lot with Oracle.
They do those kind of IO, Postgres does not do direct IO.
It goes through buffer.
Oracle, you have the choice.
So really depends.
Here, what I would like to show you, I don't see it from here, but I'm probably in the
running state.
Yeah, it was not sorted.
But here, I'm mostly reading from memory, from the cache, from buffers.
And this is why you see that much faster.
And a difference, I'm using more CPU there.
You access memory more than you access the disks.
And then this is in CPU usage, the kernel part of the read.
I mean, my application is doing the same.
Just an IO call.
So the user space CPU is still low.
But on the system, on the kernel, what Linux does is read from memory.
And this is where you have some system CPU there.
That counts in the load average also.
I just, okay, in the meantime, I did this trace to see the reads there.
So I have periods, the same system call.
What is different is what is behind.
That it reads from buffer.
And I don't know if you have seen it.
When I was attaching with S trace, the state here was T. That's the state when you attach.
And of course, it has a little overhead.
You do that to troubleshoot.
The important thing is the runable state.
I'm saying that either I'm running in CPU or I want to run in CPU.
And I don't know which one from those metrics.
That's the point.
I have only two CPUs.
So I know that I cannot have more than two tasks running in CPU.
They are running able.
They are waiting in the run queue to be able to run on the CPU.
Top will not show the figure.
Load average will add those rating and those running.
If you want to see the difference, you need to look at the statistics from the scheduler
in slash proc scheduler statistics or VM stat is showing you the run queue.
I'm saying that because I've seen a lot of people comparing the load average with the
number of CPU.
Like if load average is higher than the number of CPU, I have a problem.
Maybe not because if the load average is due to IO, you don't really care about comparing
with the CPU.
And if the load average is high because you have a lot of processes in the run queue,
then probably you have a problem because you have tasks who need to run something on the
CPU and just cannot and are waiting in behind.
So we have seen different kinds of IOs and they look differently.
Many times where I've seen, especially on databases, where I've seen different teams,
the Linux team looking at the system and the DBA team looking at the database.
And in many companies, they don't really talk together.
So one is guessing what the other is doing and a lot of misinterpretation on all that.
It's very important if you look at the numbers from the system to understand what the database
is doing.
And also it's very important for the database administrator to look at the system because
many things in the database metric will be different if the system is overloaded.
I give a quick example on Oracle, you have wait events where you can know exactly how
much time you spend on IO.
But it's not exactly how much time you spend on IO.
It's how much time between the time stamp it takes before the IO and after the IO.
If your process is in the run queue, the database thinks that it is doing IO, but maybe the
IO is done and it's just waiting to go back to the CPU just to set the counter on the
time stamp.
So that's also the message.
I say that to database administrator, but applications, if you run on a system that
is overloaded in CPU, then probably all of their metrics, because they require CPU cycles
to get the number are probably wrong.
So why did I call that silly metrics?
I didn't came with this.
If you want to understand what is what low-dverage measures, Linux is open source, so just look
at the source of it.
And you can look at the source, but more interesting are the comments which can explain the intention
of the function.
And so in Linux, the load average is defined as this file, so the source for load average
contains the magic bits required to compute the global load average figure.
It is a symmetric, but people think it is important.
So you see why you see that first in top?
It is silly, but some people think it is important, so let's give them something.
And we go through the grid pane to make it work on big machine with T-class kernel.
So the load average idea comes from Unix systems where it was really measuring the load in
CPU and where it was easier to measure it because you just counted the ticks in the
scheduler.
Linux works differently and means that it is difficult to measure and maybe it makes
no big sense.
So yeah, good to know why this metric is there just because people coming from Unix
were used to have this single graph showing the load and compare that with the application
and what is done in the application, but if you don't look at the state of the processes,
then it can be misleading.
It's easy to understand exactly why we see this state, these IOCOLs in the load average,
just the way it is calculated.
There are two things that are interested in the way it is calculated.
First, it is an average and that's also a problem.
If you look at the load average, you will not see a peak of activity of five seconds
because it is average.
The other thing is that it counts the number of active, so the running state, which is
more renewable because if you are in the run queue, you are not really running and it has
the uninterruptible calls just because they thought that if we show only the CPU load,
is it really the load of the machine?
For example, you run a database doing a lot of IOCOL.
Then we say that the load is low if everyone is waiting on the disk.
Let's add an interoptable because in many cases, we have seen that those IOCOLs are
uninterruptible calls, but they are not always, so it can be quite misleading.
It doesn't mean that you don't have to look at it, but if you look at it and know what
is behind, then it can give you some clues like the clue about IOCOL looking at other
things, but more interesting is the process state.
A process can have something to run in the CPU and then look at the scheduler statistics
knowing if it waits for the CPU or there is CPU available and when it has some calls
to do, they can be done in this state or as state and they will be accounted differently
by the load average.
Any questions so far?
Okay, the next one is more about memory just because it's another thing that is misleading
in some cases.
I think it is quite clear in top that you can look at the available memory, but I see
cloud provider showing the use memory or the free memory and here I just want to explain
for those who don't know, if you do buffered IO like I did with direct equal zero.
Okay, I thought we have five minutes now.
Okay, perfect.
So I will finish quickly on that.
Do not look at the free memory.
I'm just showing that if I do some IOs, it will take some free memory, but that is easily
freed if it needs look at the available memory.
That's the memory that is available to your process, but also think that it is available.
You can use it, but if you use it, then another process doing buffered IO may not find its
data in the case.
So if it is available, doesn't mean that it's free from any impact on the others.
Okay, I just put the last one while I'm talking and taking question.
The idea there was just to show a really silly program doing V fork that has nothing to do
with the data, but just to show that it will go to the state, it will increase the load
average and that's the case I've seen in some system where the load average was thousands
on a database having its file on NFS and network issues and then those uninterruptible calls
increased the load average, but without any consequence because they weren't doing nothing.
The only thing is that it's ugly when you look at the load average and the other thing is
that they are uninterruptible.
You cannot kill them.
So you want to restart the system to have nicer numbers, but of course you wait for it.
So just be careful, load average accounts some IO and accounts some CPU and you have some IO
that you do not see there.
Okay, do you have any questions, remarks?
Thank you.
What about pressure stall information?
Very good question.
If you have seen at the first screenshot I was running pressure stall information, which in my opinion
is a better picture.
The pressure stall information is counter telling you during the last 10 seconds, for example, how many,
not how many, if there were some processes with pressure on CPU, so to run on CPU to get IO
or to get some memory.
So it really gives you an idea about the pressure itself.
The only thing about pressure stall information I have is that in most of the kernels, the distributions
I've seen, it is compiled in the kernel but not enabled by default.
And then because it's not enabled by default, I've not seen it a lot.
And then I think it's a good idea.
Each time I used pressure stall information, it was giving me the right idea, but it's just a subset
of the systems I've seen because it's not the default.
And then maybe there are some cases that I don't know where it's not perfect, but I try to encourage people
to enable pressure stall information where instead of looking at all that, you just see that you have some
processes that could be faster if they were not on pressure, on RAM, IO, or CPU.
Okay, I think we are just...
Another question? If it's okay?
So looking at a very generic use case, if you were to redesign the cloud provider's graphs,
would you change it? What would you change it to?
Could your list maybe the five most important metrics from a generic use case that you would put on a dashboard?
On a dashboard, I think pressure stall information can be really nice on a dashboard because you can show that to user.
User running on the cloud, for example, they want to know if they are on pressure on CPU or on IO
because they pay for that.
So those ones I would put that.
Load average, maybe with a clear description that it is CPU plus some IO,
and memory, available memory, not use memory because a system doing some IO, some buffered IO
will always use all the memory in Linux.
Maybe we have...