So the next presentation is from Costa about net data and open source distributed observability
pipeline journey and challenges.
A plus.
Hello guys.
So how many people use the data in this room?
Oh, a lot.
Okay.
So guys, I have to admit something when I when I get pissed off, I write code.
I know others go to the gym.
But for me, this is how this thing was born, actually.
So I was migrating some infrastructure from on prem to the cloud to one provider.
Doesn't matter.
And we had a lot of problems.
It turned to be a cloud provider bags, not our bags.
But after spending a couple of million euros in observability, building a team of eight
people installing every tool that exists out there.
This was back in 2014.
I concluded that everything that I had built, I was a level executive back then to a FinTech
company.
I concluded that everything that I had built is an illusion.
So the entire everything every observability tool that I tried is an illusion.
Is that make me feel happy?
It doesn't actually troubles with anything.
So I started thinking, OK, why they did it like that?
Why they are not real time?
Why cardinality or granularity is such a problem?
Why, you know, you have to set up everything.
By bit, you need months and months of preparation in actually to build something useful.
And I started experimenting.
I started writing code to see can it be done differently and what this different would
look like.
So I ended up, I started the first committee in 2014, it is in 2014, it is in GitHub.
After a couple of years of experimentation, I released it and boom, people loved it.
So what is the data today?
Today, the data is a monitoring tool that you install on a server and it is a monitoring
in a box.
So it has all the functions of a monitoring pipeline.
It discovers metrics, collect metrics, stores metrics, machine learning for metrics, to
learn the behavior, alerts check, health checks for metrics.
It exports metrics other times on databases.
It streams metrics and replicates metrics between the data agents, of course visualizes,
query engines, everything.
So it's an application.
You install on all your servers and each application is a monitoring.
So if you install the data today on an empty VM, an empty, just go to AWS, get an empty
VM and install the data.
What are you going to get?
We are going to get this.
200 dashboard charts, 2,000 unique time series.
Five unique alerts monitoring, more than 100 components.
As a viewer, as a logs explorer for system to journal, a network explorer for all your
connections, unsupervised anomaly detection for everything so it will learn the behavior
and trigger anomalies, detect outliers.
You are going to get about two months of retention using half a gigabyte of disk space and all
of these run with one presence CPU of a single core, 120 megabytes RAM and almost no disk
IO.
Now when you install an agent, you understand that since each one of them is standalone,
you have to switch, go one by one to see what's happening.
But then we build this.
So the same software can become apparent.
So when it becomes apparent, it's easy.
You go to the others and say, stream there.
And you give the IP of the other thing and then up it again.
So the moment this happens, the parent now can provide alarms, machine learning, dashboards
and everything for all the infrastructure that is behind it.
But we didn't stop there.
You can have as many layers as you want of this.
So you can have data centers that they have apparent cluster and then regions, you know,
a bureau for something that has another cluster of parents, et cetera, et cetera.
This scales well.
So we tested the data apparent with 500 nodes and 40,000 containers.
Against Prometheus.
We did the same with Prometheus in parallel.
The data used one third less CPU, half the memory, 10% less bandwidth, almost no disk
IO.
And in the same store as footprint, we gave them three terabytes.
We managed to have up more than a year of retention.
More than a year of retention.
Of course, it is in tiers.
Now in December, the University of Amsterdam did research about the most energy efficient
tool.
They tested, of course, Prometheus and the data, et cetera.
The data, they said, being the most energy efficient tool.
And what happens in CNCF?
The data actually leads the observability category in CNCF in terms of user's love.
Actually we don't want to incubate in CNCF because of their policies.
We believe that the data should be something different.
So we never applied.
So to build this, we had a number of challenges.
The first challenge is that we wanted everyone, everyone using the data, to be from zero to
hero to day.
So you don't have observability.
That's okay.
You're installing a data and you have observability today.
How we did that?
The first is an observation that we all have the same kind of infrastructure.
It's a finite set.
We use some visual servers or physical servers.
We use packets, applications, databases, web servers, message brokers.
All these are standard components.
We use similar disks, similar interface.
Everything is similar.
Of course, the Lego we built is different.
Each company is unique.
But the components are pretty much the same.
So what will happen if we start monitoring the components?
What will happen if the monitoring solution knows when a postgres is healthy?
When a Kafka is healthy?
When an NGNX, an HA proxy?
What if the monitoring system itself knows if the networking stack of the Linux of this
Linux is okay?
And how to monitor it and how to present it and how to visualize it?
So the observation that we have a lot of common actually allowed us to think a little bit
different.
So how many work with Prometheus here?
Okay.
So did you know the Prometheus, the OpenMetrics format?
So you have a metric and then you have some labels and the value that changes over time.
That's the thing.
What we did there is actually to say, wait a moment, not all labels are the same.
The labels that depict, for example, that this is reads or writes on a disk I.O. are
not labels.
They should not be labels.
They should be something different.
And we came up with this needle framework.
This needle framework says that there is context.
The context is the metric name in Prometheus, exactly the same, the metric name that has
before the labels.
We have instances.
This does not exist in Prometheus.
Instances are all the labels except the ones that get time series values.
So if we're discussing about disks, it's the make of the disks, the type of the disks.
This is the number of the disks.
All these are labels to enrich an instance.
The reads and writes of its performance are time series.
So what this allows us to do is the following.
The first thing is that it allows us to go and configure test boards at the context level.
So we don't care about the metrics anymore.
In order to provide a meaningful dashboard, we only care about the context.
Then it allows us to configure alerts per context which are applied to all instances.
So we can say, hey, apply this alert to all NVMe disks.
Apply this alert to all Postgres databases.
Apply this alert to all Postgres databases before version X or Y.
So the first thing is that this thing allows us to have a fully automated visualization.
You don't need to configure visualization with that data.
It comes up.
The same with alerts.
They come up.
We see 344 of them.
So the moment something happens, boom, an alert from the data will come up.
I think this is not important.
It's how if you see it from the disk perspective or from the container perspective or from
the database perspective, how the labels are matched, it's not that important.
And this is internal information for us.
So this allowed us to have, as I said, fully automated all of this.
And the data is the only monitoring tool that you can install mid-crisis.
So you have an incident happening now.
You don't have observability.
You install the data, alarms will fire up.
The dashboard will be there for you to investigate.
The second challenge was, can we get rid of the query language for slicing and dicing
the data?
So in order to do this, we provide, you are going to come on a metadata board and suddenly
you are going to see 2,000 metrics.
And you are not aware of them.
You just installed it.
So how, what can we do so that you can make sense of this?
You can see the dashboard and say, OK, I understand.
I understand what I need to do.
So this is a data chart.
On the charts, we did the following.
Of course, there is an info.
What is this?
But then we added a number of items on the chart to allow you slice and dice and see anomaly
rates, you know, make sense of that thing.
Now each menu that you see here has an analysis.
And this analysis allows you to see, for example, for the nodes, allows you to see all this
information that is described below from the same view.
This is not another query.
The biggest change that we did is that the data calculates this for every query, for
every, so it's a cube.
Think of your data as a cube.
You can see them in different ways.
What we do is that we provide a time series for one of the views, one of the sides of
the cube, but we calculate some numbers for every possible pivot that you can do.
So that by just browsing the menu, you can understand, you can grasp what the data say.
The same is with group by.
So if you want to slice them, we do exactly the same.
We give you all the possible options that you can slice by just dropping a drop down
menu, two clicks, boom, different cube, different view on the cube.
And the data, and this is how the initial problem, do you remember, that was the migration
I had?
The migration I had was that the bug at the service provider was that they were stopping
the VMs to perform some of their updates, but they were stopping the VMs in little increments.
So suddenly, one Friday, all the VMs were going in slow motion because they were stopping
starting, stopping, starting, stopping, starting like this.
With the data, I was able to find it because every data collection that is missed in the
data is a gap.
So it is not in the chart.
Grafana, for example, told the other tools, smooth it out because there is no bit.
In the data, there is a bit.
I have to collect this every second.
If I miss one, it's a gap.
I didn't collect it.
So another of the challenges, I am going fast because I have only five minutes, guys.
Machine learning.
Google in 2019 announced this.
So they said, come on.
We cannot make machine learning work for observability.
So all our ideas are bad.
And as they say, Todd says here, we should feel bad about it.
So the data does the following.
Trains for every metric collected, trains 18 different models.
These models cover the last few days of the metric behavior.
So for the last few days, the data knows how the metric behaves.
We machine learning.
Now what we do is that every time that we get a new sample, we check these 18 models.
And if all of them agree that this is an anomaly, we say, OK, this is an outlier, guys.
This is an anomaly.
OK?
Five minutes, yeah, I know.
OK?
So what this allows?
I'm going fast because there is a scoring engine inside the data.
So scoring engine is an engine.
You don't query for time series.
You say, hey, I want to find the similarities in the metrics.
I want to know the trends in metrics.
So can you score them all for me, please?
All.
It doesn't matter how many I have.
So to give you an example, this is a data dashboard.
There is a menu where we categorize everything.
So it's an infinite scrolling dashboard.
One chart below the other.
Of course, there are many tabs with different views.
And you press the AR button on the top right.
And there is an anomaly rate on every section.
So you can see now where your anomalies are.
And we have a host level anomaly.
So look what happens here.
When anomalies happen in our systems, they happen in clusters.
Look there.
It's five nodes.
It's 10 nodes.
It happened together.
This every line is a node.
Look there.
They happen together.
And not only that, this shows the percentage of metrics in a node being anomalous together
concurrently.
So you understand that NEDA can detect not only ifs a metric is anomalous, but here we
can see that it can tell you the spread of the anomaly inside your systems, inside each
system, et cetera.
And then you highlight the area of interest.
NEDA gives you a sorted list of all metrics that are anomalous within this region, this
time frame.
So the idea here is that you're a ham moment.
Or what DNS did that?
Storage did that?
The ham moment to be on the top 20, 30 items.
Another mission accomplished.
Two minutes I know.
Logs, I will pass through.
NEDA is a fund.
We base everything on system in the journal D. We have built our own logs management solution
like log key and elastic and the likes.
We zapped it.
We zipped it.
Zipped it.
How is this story?
Don't use it anymore.
And what we do is we rely on system D journals.
Now system D journal has a lot of unique features guys, especially if you are developers.
The ability to index all fields independently of the cardinality.
Independently.
It doesn't matter.
A thousand fields, different values on every log line.
Indexed.
Okay?
This is unique.
You can make troubleshooting amazing like this.
We have a UI similar to Grafana, similar to Kibana.
And actually while all these tools you see on the screen, it's 15 million entries.
The title.
15 million entries.
Logs.
All the others will sample 5,000 at the end.
We stop sampling at a million.
It's fast.
We stop sampling at a million.
We submitted patches to system D to make it faster.
We managed to bypass all the deficiencies of system D for old systems.
You will give me two minutes.
Okay.
It was 50 seconds.
I have another one.
So, okay.
We did a lot of work.
We provided a tool, look to journal that can be used to feed to system D journal.
Your HA proxy logs, your engine logs or whatever, log FMP logs.
If you have a go application, something look like this.
Boom.
Push it everything in.
And of course there is a weakness.
It's a system D journal.
And this is mainly because it's reliable.
It's fault tolerant.
This is what has been designed.
It's secure so it seals the logs.
They cannot be tampered, etc.
The key problem is this storage space that it needs.
It needs a lot more compared to Loki or I don't know with elastic, but for Loki that
we tested, 10 times more.
And you can make it a little bit better, 4 times more if you use a compressed file system.
So another mission accomplished.
That's the last one.
Wait, I didn't see what it was.
Yes, and that's another beauty that the data provides.
So observability is more guys than metrics, logs and traces.
So the moment you run your thumb infrastructure and there is an issue, metrics,
logs and traces usually are not enough.
You may want to see the slow queries the system has.
Database has.
You may want to see the socket that an application is currently connected.
You may want to see different kinds of data.
Most of the observability solutions give up.
They say, OK, open the console.
What can we do?
Go to the console and figure it out.
What we did is this.
So every data collection, this isn't a data.
It has plugins, data collection plugins.
Every data collection plugin exposes functions.
It says, hey, I am Postgres collector and I can expose, give you slow queries.
Hey, I am a network viewer and I can give you the list of circuits.
So we built the protocol in such a way so that you are there.
You go to the grand, grand parent and you ask for something from this node,
this function to be run.
It does.
So actually the whole slogging system that we have, system-less system in journal,
is built like this.
There is a system in journal plugin there that exposes logs.
And this is a network viewer.
What we did here is very simply.
You have sockets.
As you can see, it's 30,000, 31,000 TCP4 sockets.
The key problem was how you make sense.
So what we said it's OK.
From there is the internet.
From here are private IPs.
Above are clients, down are servers.
The position plays a role.
So you can immediately see what connects to the internet.
And this is live.
So if this was not a screenshot and you SSH the server, you would see the SSHD boom, moving.
This is live.
OK.
Our monetization strategy, the data is open source.
This is what we sell just for clarity.
So we sell infinite horizontal scalability.
We sell role-based access.
Access from anywhere because it's on-prem and on-prem solution.
Mobile app for notifications and dynamic configuration and persistent
customizations that we have a SAS offering that does it.
Of course, the same SAS offering is offered on-prem for a price.
That's it.
Thank you very much.
Again.
Questions.
Thanks, Costa.
Do we have any questions here?
I think that is fantastic.
I've been using it for a while and I really, really enjoyed the experience.
A while ago, I was working on a massive experiment.
I had 700 machines that had to throw gigabytes of information.
So I started to look at NetBe and basically went to the grandfather solution.
And then the monitoring was something that wasn't there.
I mean, it's a SAS service.
It's not open source that part.
Do you think that you're going to open source also that part?
Wait.
The first thing is the open source agent and the SAS offering that we have have the same UI.
It's the same thing.
It's on the different UI.
What the data cloud can do is instead of collecting all the data, if you go to data log,
you push all your data to data.
We don't do that.
So your data are inside your data inside your servers.
What the data cloud does is that it can provide you a unified view by querying all the relevant
in the data.
So it queries them, gets the data, aggregates them, and gives you the same view a parent
would give you.
Okay.
Without having the data.
We rely on your agents being alive for this.
This is why we have the streaming.
So when you have a femoral service, a Kubernetes cluster, the node may vanish.
What you do there is you need a node outside this, outside the Kubernetes cluster to have
the P2P parent for your data.
This is where your data will be eventually.
This allows you to have very small retention on every server, even disable a mail and health
checks, etc.
Disable them on your production systems and have a cluster of parents.
Parents can be clusters so you can have two or three or four or how many you want in a
chain so that you can loop them, replicate one to another.
And they will find out if something is missing, they will find out by themselves.
Okay.
We have time for one, another question.
Thank you.
I would like to ask you which issues have you experienced when scaling up on several systems
or centralized things?
Which issues have you experienced when you scale up to several systems or other centralized
things?
As with every software, there is a maximum vertical scalability that can sustain.
The data as you show is better than Prometheus.
Actually, you can be in almost two times in some cases in terms of memory, one third in
terms of CPU, etc.
But still, there is a gap.
The power of the machine, you can give to it.
Okay.
Now when you need then to scale out, what we did is that we split a query, a number of
calculations, has to do a number of stuff in order to give you this little chart there
with the drop downs and the likes.
What we do is that we have the agents to have the work and then we have a central component
that does the rest of the work.
So we managed to split the query engine in two parts, one that runs at the edge and one
that runs centrally.
Still there are limits.
If you go and put, I don't know, 10,000 and a data agents in one room and you want one
chart with 10,000 queries, you understand it's going to 10,000 queries are going to
happen.
So it may take some time.
But still what we found out is that the combined power of your infrastructure is better in
most of the cases than having a very big server.
So even if you query 10,000 servers, there is some latency due to the round trip delays,
et cetera.
But the horsepower that you have, each server will do just a tiny thing.
Oh, this is my part taken.
This is my part taken.
But this means that the heavy lift of the query is spread to 10,000 servers.
You get it?
So it becomes more scalable, faster at the end of the day.
All right.
Thank you.