Hi, I'm Javier here.
This is going to be, oh, is it too loud?
Okay, yeah, sorry.
So yeah, I'm Javier.
If you can find me on Mastodon or Twitter, I don't really use Mastodon, but it's Fosdome,
so I have to put it there.
But Twitter is better.
It's not better, it's just where I hang around.
Anyway, this is already too difficult.
I don't have any slides.
I have this gist that I will be scrolling, but this is going to be mostly a demo-driven
talk.
So I'm going to be speaking today about a template I've created, so you can start doing
a stimuli that is using different open source components.
I'm not a tab vocate.
I've been working with data for a long time.
10 years ago, it was actually my first Fosdome.
I have the teaser to prove it.
I was speaking at the time about FAST data.
I was speaking about credits.
And today I'm going to be speaking again about FAST data.
And last year, I organized the FAST and streaming data at the room, so I really like data,
and I wanted to share with you some of the things I've learned about working with data.
So 10 years ago, it was difficult to work with streaming data.
I put in the presentation that you can work with millions of events per second, and we
have some users doing that actually.
I work for a company that developed an open source database, but I wanted to speak about
all the things today.
So some people really need to ingest millions of events per second.
Most often you don't have to do that.
You are happy with a few thousands, hundreds, tens of events per second, whatever.
But 10 years ago, it was really not that easy to work with streaming data, because actually
many of the technologies that you have to see with FAST data, it didn't exist at the
time.
So for example, 10 years ago, you have Cassandra and Redis.
They were already available.
They were just FAST databases.
But even Apache Kafka, I will be speaking about Kafka today in case you don't know it.
Kafka was only three years old.
And pretty much every other technology that you will consider FAST or real time today,
in 2014, it was either not a 16 or just being born.
Things like Spark Streaming or Apache Flink, which are super cool today.
Grafana, I'm going to be presenting about Grafana also today.
QuestDB, which is the database where I work.
At the time, it has a different name.
But it didn't exist.
Even large proprietary people like Google Cloud or Amazon Kinesis at the time were not offering
streaming data.
So what I'm trying to say here is that 10 years ago, working with streaming data, it
was not really a thing.
Some people will do it.
Twitter, for example, was doing streaming data at the time with very interesting technologies.
Facebook was doing streaming data.
But it was not so useful to work with streaming data 10 years ago.
But you are thinking, that was 10 years ago.
Now we are in 2024.
It should be this year.
Well, it should be this year.
But it's not always the case.
So let me just for a second.
10 years ago, if you were doing streaming data, you would do like micro-batches of data.
The streaming data platform would be a database.
It would be Postgres.
10 years ago, maybe it was MySQL.
Sorry, Postgres.
But 10 years ago, MySQL was kind of...
But it would be Maria de Veo, Postgres, or Cassandra.
But you will have the data platform was a database and a CSV file.
That's it.
Okay?
But it's not really a pipeline.
And a CSV file was not super cool.
So then we had some innovation.
So we added an extra step.
It was basically adding Excel.
We reinforced them.
So maybe OpenOffice, LibreOffice, whatever.
But basically, the data platform was a database and a spreadsheet.
And this was the data platform.
I was doing reports for like, you know, interesting people.
And that's how you would do things.
It was not really a data platform at all.
But then, this is not working.
That's fine.
I can do it with this.
But then, at some point, I will start doing the demo.
Don't worry.
But at some point, we decided, and that was the first thing I wanted to talk about today,
we decided that sending all your data directly to your database, it might not be the best
idea.
At some point, we decided that it could be a good idea to decouple the ingestion of
the data with the storage analysis of the data.
For many reasons.
For example, because if you send your data all the time to your database and you need
to restart the server for any reason, you have a problem.
You cannot really stop the database if you are sending data all the time.
Or maybe your database is not super fast, or maybe it is.
But you know, you can have something in between.
Or maybe you want to send your data to your database and to somewhere else.
And it will be cool to have something in between to send the data.
So we started seeing things like Apache Kafka and so on.
I will present about that today.
But that was the first step.
Starting sending data first to some buffer and then to the database.
Then we started seeing dashboards, not really Excel, but dashboards like Tablo, Grafana,
other things to present the data.
That was already interesting.
And at this point, we spent a few years with this architecture.
You have something like Apache Kafka or Rabbit NQ or whatever to have data in gestion.
You will have a database, could be a PostgreSQL database, could be a different database specialized
in working with faster data like Cassandra or other databases.
And then you will have some business dashboards.
And that was cool.
But then it was the time of more real time of more advances.
And it was like, now that I can analyze data, I also want to predict the future.
I don't want to understand only what's happened.
I want to understand why it's happened and what's going to happen next.
Spoiler, it doesn't work.
But anyway, we presented walks and today I will show you some examples of time series
forecasting.
But that's the idea.
So it was not enough already with doing analytics.
It will be analytics and some machine learning.
And by the way, the dashboards got real time.
You didn't have to be reloading with S5.
I had at some point, and this is true in my browser, I had an extension that the only
thing it did, it was reloading the page every five seconds.
So we could have real time dashboards.
No, that's the thing because they didn't reload.
So you had that.
So then it came people like Grafana or maybe Looker.
And you look at it because it's not open source.
And then you started seeing real time analytics.
And that was pretty cool.
And it's like, okay, this is looking something interesting.
But yeah, then we have, of course, observability, monitoring, because if you want to go to
production, you need to make sure things are working, not only on your machine, but in
real time.
And that's kind of the thing.
We started in more and more things.
And if you want to start today to ingest streaming data and less streaming data, you have all
the components, but there are a lot of components.
And they have to work together.
And if you have never done it before, it can feel overwhelming.
So that's what I want to be talking about today, how we can build a platform that can
do all of these things.
So this is what it looks like, like everything everywhere, all of the ones, like that movie.
So that's basically what it looks like.
Today you want to have an IT platform at the very least, very likely you have a dating
system layer.
The data from that layer needs to go somehow to your analytics database.
It will probably have some listeners and application.
You might want to send data also directly without the buffer, because your application
doesn't support it, whatever.
Then you have your database, your machine learning data science workflow, your monitoring
workflow, your dashboard.
And this is kind of the thing.
So as I said, I call this the everything everywhere, all of the ones.
And if it's the first time that you are working with streaming data, this might look to you
the same way, look in the movies, like, what's this?
Like, I don't know, this doesn't look natural.
It looks like very weird.
So that's why I decided, after working with streaming data for a while and suffering these
things, I decided it was time to develop some easy example that you could deploy and
if you don't have it, if you know about real analytics, don't use this.
But if you have no idea about real analytics, if you've never done it before, this template
gives you everything you need, everything.
It gives you a few components you can use to capture real analytics end to end.
Turning gestion to dashboard with monitoring with everything you need, with data science
to do real analytics, and everything is open source in this template.
So this is what I want to talk about today.
What is in the template?
And basically, I want to do this for a demo.
I don't want to be speaking on it.
I've been speaking on it for ten minutes.
That's it.
I mean, I want to be speaking now, but it's going to be with things moving on, with code
and everything.
So this is what the template, it looks like very blurry.
But basically, this template is going to give you an Apache Kafka.
I will tell you now about Kafka and why I chose it.
But Apache Kafka, which is the ingestion layer, I have KafkaConnet, which is a component for
getting data out of Kafka and into other places.
The data I'm going to put it into QSDB, which is a fast and serious database, it's my employer.
So I'm biased here.
But I think they pay me.
But it's Apache 2.
So it's open source.
You can use it for free.
You don't have to pay to use it.
But if you pay, it's better because then I get...
Anyway, so I'm going to be using a database here.
Mom, maybe I'm looking at this.
I know I speak too much.
Anyway, and then I'm going to be using Jupyter Notebooks, which are...
It's Python on the browser, which is pretty cool for doing that science.
And I will do some interactive exploration and some forecasting modeling.
We'll have real time, that's what's with Grafana.
On top of that, I'm using Telegraph to collect metrics and store them again on a database.
So that's everything in the template.
And this is what I'm going to be talking about today.
So the template lives here.
Until yesterday, it was on my user on GitHub.
But that's of yesterday.
We move it to my organization because they pay me.
And we are open source and we are cool because FOSDEM.
Yeah.
So basically, I'm the single contributor.
So basically, it lives here.
Time series, stimuli, this template.
The entry point is a Docker compose.
So I know some people prefer other things, but Docker compose is cool.
So this Docker compose, if you want to see what it's doing, feel free.
But basically it's going to start all the things that I've told you before.
And I'm going to show you what this looks like.
So Docker compose up.
If you're familiar, you all know Docker.
Docker compose, in case anyone doesn't know Docker compose, I know, you know,
better than me.
But Docker compose allows you to start several containers in one go.
And since networking Docker is a mess, the cool thing about Docker compose is
that all the containers in the same compose file, they can talk to each other,
which is cool.
It's convenient.
So if I do Docker compose up, look, looks.
Things moving.
So yeah, thank you for coming to this talk.
So yeah, things are starting now.
In my laptop, it starts fast because all the images are already here.
You start from scratch, it needs to download a lot of things.
It's going to download about one gigabyte of data.
No, one gigabyte of data.
But there's a lot of images here.
But basically, I don't have any custom image.
All the images in Docker compose are the standard,
you can't know book image, standard Kafka.
I didn't add anything.
So it downloads everything in a wire connection.
It takes about one minute or so to download everything and start up from scratch.
So once you have this ready, we already have something up and running.
Not much.
It doesn't look like much.
But since I'm moving there, and I want to show you what we have.
So the first thing we have, I opened the other browser.
Not you, Chrome, not today, not today.
It's my default browser, sort of like that.
I feel terribly bad because actually,
Firefox was giving free cookies before and I took one.
So thank you Firefox, I'm using Firefox for the demo.
So the first thing you have here, it's,
this is Jupyter notebooks.
I'll show you to run Python, also the things, but Python from the browser.
So the thing I have here to see like easily, I have created a script,
which is going to read data from GitHub because open source.
It's going to be reading public data from GitHub and
it's going to be sending data to Kafka.
And it's going to go through all the steps.
And it's going to give me a nice task board.
So if I execute this from the browser, it's going to be,
it's going to be calling the GitHub API and every 10 seconds,
it's going to be getting new data.
I do it every 10 seconds because there are some great limits.
I wanted to make sure I'm within the limits, but every 10 seconds,
I get data from GitHub.
It's going to go through all the components and
eventually it will read the data.
I'm going to tell you now all the things.
And here in this dashboard, if everything was fine,
I cannot see any data, that's not looking good.
Let me ask for a second.
We still have data, yes?
Yeah, we do have data, actually, I guess.
Table doesn't exist, table doesn't exist.
Not you, Chrome, sorry.
That's why I don't like it, man.
It's always there, it's always trying.
So, okay.
Let's just for one second to check if I have data.
I'm going to go through all the steps.
I'm going to go through all the steps.
Just one second to check if I have data coming into Kafka.
It's funny because I was testing it earlier, like right here.
Let me for a second what I have.
I will tell you what I'm doing once I know what I'm doing myself.
I mean, I cannot be more, okay.
So, just for a second.
What are you doing here, Javier?
Topic?
Can I see which topics I have created?
I have...
I will tell you in one second what I mean.
The moment it works, I will tell you everything about it.
But, okay, the topic is here.
And...
So, basically now I'm trying to see if I have some data
entering into my Kafka platform or not.
Yeah, data is coming here.
So, let me go to the dashboard again.
Oh, yeah.
So, I don't know exactly what happened.
I started this grid from the command line.
But, basically, this is what should happen from the beginning.
I have data which is going to Kafka.
And I have a dashboard that is refreshing.
No hands here.
And in five seconds, it's going to have new data.
That's kind of the idea, okay.
So, this is, you know, this is like high level overview.
What I want to do now is telling you all the different steps
and how we can see what's happening
at a different point in time.
So, the first thing is...
I told you already, I have here a script
which is sending GitHub events to Kafka.
So, Kafka, it's a message broker.
It means you send messages, events, what...
A message can be anything.
In this case, I'm sending JSON because it is here.
But you could have, like, Abro.
You could have, like, anything, plain strings, whatever.
You send events to Kafka on one end.
And then, different consumers can read those events.
It sounds super simple, but it's very powerful.
First, because Kafka is very scalable.
You can use Kafka at any scale.
If you have a lot of data, you can add many servers.
And, you know, it scales horizontally.
It works pretty well.
In Kafka, you don't have tables.
You have something called topics.
So, when you send a message, you publish the message
into a topic.
And then you can have any number of consumers
reading messages from that topic.
And they can read messages in different ways.
You can choose to have each consumer
is going to see all the messages from the topic.
You can have them...
You can create a consumer group.
They read collaboratively.
So, basically, you can have a topic
in which consumers are reading data,
and all of them are going to see all the data.
You can have a topic in which
consumers are going to be reading in parallel,
different parts, so they can collaborate.
You can have a topic in which
some consumers are reading from the beginning,
the others are reading from the other part.
You define a retention period
for each topic, and you can replay
the stream of events
at any point until that retention period.
So, you want to replay what happened.
You can replay again.
So, it gives you a lot of possibilities,
and it's a very good way of decoupling the ingestion layer.
So, when you are working with Kafka,
basically, what you do
is like you use any Kafka library.
And as I showed you here,
all you have to do is
you create a producer
and you send your events,
in this case, in JSON,
to a topic from that producer library.
On this notebook,
I have the code only on Python,
but in the template,
if you defer the languages,
you have the ingestion in multiple languages.
I did this
in all the languages
that chatGPT could help me with.
I did it in Python,
and then I told chatGPT,
do this in Node.js.
Didn't work. Hey, I get this error.
Anyway, so it works.
So, basically here, you have the same code
in different languages.
You can do the ingestion using Node,
you can do the ingestion using Java,
using Rust,
which is basically right now
from the command line,
and here sending from Rust, actually,
they tie into the topic.
So, right now, we ingest data
into a Kafka topic.
But Kafka itself is not doing anything
with not-you-chrome.
And this is...
No, sorry, it's my brain.
It's where, like...
In the process in Kafka,
this is the first step in the process here.
From GitHub, we send to Kafka.
Okay?
But from Kafka, the messages don't go anywhere.
Kafka, by design,
is not pushing messages.
There are some message brokers that use the push model.
In Kafka, we use the pull model,
which means if you want to read messages,
you need to tell the broker.
You need to tell the server in Kafka,
hey, give me the following batch of messages.
The people in Kafka decided to do it that way,
because if you're going to have multiple consumers,
it's a better way to work.
If you are pushing data all the time
and you have a low consumer
and you're pushing data,
the consumer eventually is going to be overwhelmed,
it's not going to work.
So in Kafka, you need to be pulling data.
So you will need some application
calling Kafka
and saying, hey, you have new messages in the topic,
give me the messages,
I want to store them somewhere.
For the first years, that would be how you work with Kafka.
But that was annoying.
So Kafka created something called KafkaConnect,
which is part also of the open source Kafka.
In Kafka, there is a
like in pretty much in many projects,
you have a part which is fully open source
and a part which is proprietary.
The things I'm showing today
are only open source, of course.
So KafkaConnect is part of open source Kafka.
And what it does,
it gives you connectors
to read data from Kafka
and write into multiple destinations.
In the audience today,
we have my colleague, Yaramir,
I don't know exactly where he is,
but Yaramir actually wrote the connector
to write data from KafkaConnect into QSDB.
I wouldn't know how to do that, but you know.
So basically what I have here,
I have a connector
which is
sending data from Kafka
into a database.
This KafkaConnect,
it doesn't have any web interface.
It has just an API.
But I can tell KafkaConnect
from the notebook.
I can tell KafkaConnect, which plugins
you have available.
By default, you have only a few plugins.
In this compose,
Docker compose, I've included
the driver, the GAR file
to send data to QSDB.
You could be sending data to, I don't know,
to Amazon S3,
to Hadoop file system,
to Clickhouse, to Postgres, to whatever.
In this case, I'm sending messages to another Kafka.
So in this case, I'm sending messages
to the KafkaConnect.
So I have this connector available.
And then what I have to do,
I have to configure
if I want to send data,
I have to tell Connect
where to get the data from
and where it's going to send the data to.
So in this case, I have a configuration
that says
I'm going to be reading data
from a topic called GitHub Events,
which is the topic in which I'm sending data.
I'm going to be sending the data,
I'm going to be sending data to this
host and port.
And then its different connector
is going to have different parameters.
At the very least, you will want to configure
the format of the messages.
In this case, I'm saying that we are going to be getting
JSON messages in JSON
and I want to output strings.
So different connectors will have different options.
And in this case, I'm also telling
the timestamp name,
I want to rename it to create it.
So those kind of things, basically.
But you know, it's connected with different options.
But in the end, what I'm doing here
is configuring how Kafka Connect
is going to be reading data from a topic
and how it's going to be writing the data
at the destination.
Are you still with me here? Yes?
Cool.
So destination is a database.
And that database is called QuestDB,
as I told you earlier.
QuestDB is my employer.
So, you know, I only have
good things to say about QuestDB
because it's actually pretty cool.
So, in this,
this is the instance
running in Docker.
And you can see here, it has multiple tables.
Because I'm writing also monitoring.
But basically, the tabling
with Wayne-Gestin data is this one.
GitHub Events.
If I do something like
select count
from GitHub Events,
we have only, like, you know,
a thousand events,
which is not too much.
But anyway, it is what we have right now.
So,
this is what it looks like.
We have, like, push events
from different repositories,
from different people, at different time stamps.
And this database, this QuestDB thing,
it's an open source database
which is specialized on time series.
Time series basically means
I have data points with a time stamp
and I want to do aggregations by time.
So, we speak SQL,
but a cool thing,
I believe it's a cool thing in QuestDB,
you can do things like this.
I can say, I want to have
the count of
how many messages
I'm getting in intervals
of, for example, I don't know,
every five seconds.
So, I can say this.
And what I get here is like,
for each five seconds,
I'm running this aggregation.
So, basically, this is like a group buy.
But instead of grouping
by, you know, any dimension,
I'm grouping in time intervals.
So, this is the cool thing about a time series database.
It gives you a lot of things for, you know,
working with time.
It gives you a lot of interesting things
for working with time data.
Or you could also say, I want to have this,
but by repository
and time stamp
in intervals of five minutes.
So, here I will have
for this different repository
I'm going to have for each five minutes interval
I'm going to have
the number of events for that repository.
So, you can start to see how this can be interesting.
You can also do things like
join two different tables by approximate time.
Maybe in one table I'm getting
data every few microseconds.
In the other, I'm getting data every two hours.
But I want to join the closest event in time.
Doing that in SQL,
it can be done, but it's like, you know,
a bit query.
So, here is like, okay, just as of join.
So, we have this kind of extensions.
And the main thing about QSDB
is that it's built
time series data tends to be fast.
So, it's built for a speed.
In this case, I only have
not even 2000 events.
So, the 1000 events
sounds like nothing, but
if I go, oh, not the most,
if I go to
we have a demo site
and I will stop
speaking about QSDB
in one second, I cannot write.
I cannot even type the name of my
database. Okay, that's better, yeah?
Demo...
Okay.
See, eventually
we'll note, internet here,
I have the...
The Wi-Fi here is not really great.
When it comes back, I will show you
which is totally why I was not getting
any data earlier.
So, in this public site
we have some demo tables
and I have one table
which is a bit
bigger than the one I just
showed you.
So, this table
is not huge,
but it's already
1.6 billion records,
which is not too bad.
1.6 billion records is not too bad.
And a cool thing about QSDB
is that it's designed to be
a good example.
I could do something like
give me the average...
Let me just find any column
in this table.
So, give me the average
fair amount, for example.
From this table
that has 1.6
billion records.
How long should you expect
to do a full scan
over 1.6 billion records
and do an average? We cannot
catch. How fast will you say?
1.6 billion records.
How many?
20 seconds.
20 seconds, that's good.
I like it.
It took 400 milliseconds.
It was not too bad, but it's like
that's the kind of thing.
It's still kind of slow
because QSDB
is optimized
to be very fast
when you have chunks of time.
So, it's like
where...
I don't know.
If you go only for things like
I want to only select
data which is in
one particular year, for example,
this will be way faster
because in the end...
Oh.
That's better.
This will be faster.
It took 200 milliseconds,
which is not too bad considering
that we have a lot of time.
Oops.
Count from trips.
Okay.
That's not...
Let's go for one year with data.
So, yeah, in this case,
I can do the average.
Distance, for example.
And we are speaking
about 170 million records.
And it's still like, you know,
execution is 50 milliseconds.
So, that's the kind of thing.
Time series databases, we are not the only one.
We are the fastest.
If you ask any other database provider,
they will tell you they are the faster
and they are right
because it really depends on the query.
We are super fast for time series queries.
If you try to do other type of queries,
we are not the fastest.
But if you're just about time series,
then we are super fast.
Everything is optimized for that.
Okay.
But I don't have 1.6 billion records.
And that might be it.
But something I don't know if you ever consider
that actually getting
a billion records
is easier than you think.
If you have
500 time series,
maybe 500 cars or scooters
or whatever, maybe 500 machines,
500 users with a phone,
send in a data point
every second.
So you have
a data point per second.
It doesn't sound like much, no?
But then you say,
how many seconds are in one day?
How many seconds are in every week?
How many seconds are in your typical month
of 30.
43, 7 days?
Because some months are different to others.
So this is the amount of seconds
you have in every month
times
500.
So you have 500 devices
sending you a data point every second.
You are going to get
1.3 billion records
in just one month.
So we see users
generating this amount of data
every day or even multiple
times this data every day.
So what I'm trying to say here is that
when you are working with streaming data,
it's pretty simple to get to the point
in which you can actually
have a lot of data to process.
So it's quite interesting to have
some database that can help you with that.
But enough about
speaking about QSDB.
I want to speak about all the things today.
So I told you already
you can
get data directly
to Kafka
into QSDB.
By the way, before I move from this,
I have another notebook because
I told you I wanted to speak about
millions of units per second
in just in just 30
events every 10 seconds
so I have another notebook
which is not getting data from GitHub
but it's generating synthetic data.
So I'm going to show you
another notebook to send events
a bit faster. So I have another notebook
which is generating fake data, IoT data.
And it's going to be
sending, I'm going to be sending
for example, I don't know,
10,000 events per second?
10,000 per second?
15,000 per second?
15,000 per second, you think?
15,000 events
every 100 milliseconds.
15,000 events every 100 milliseconds?
Will you be happy with that?
15,000 events every 100 milliseconds?
So now anyway,
I'm sending 15,000
events every 100 milliseconds
which is still not super fast.
And I have a second dashboard, just for you
to see that this kind of technologies
all together, they really can help you
work with data which is fast.
So right now here,
I'm going to show you.
So we have a dashboard
that's every 100 milliseconds
it's refreshing.
And this is how fast we are seeing
data. So this is not, I mean,
it's not super fast, but you get the idea, yeah?
So Kafka
can ingest literally
hundreds of millions of you in a second.
In QuestDB, we can ingest
up to 4.2 million
events per second.
If you have 16 CPUs,
12 16 CPUs, you can ingest
up to 4.2 million
events per second. But that's the idea.
This already feels real time.
This already feels like, you know,
this already feels like a real time dashboard
in which we really are having data which is moving fast.
I wanted to show you this
because otherwise it feels like cheating.
So this is not real data, but if you
have real data at a speed, you can see
on the planes, you know, this is
random data. So yeah, they are not
crazy is the way it is. But other
thing, I wanted to show you that it really
supports this kind of thing.
So let me
just move on to the next part.
So the next part is I told you
that, okay, it's cool
to send data. It's cool to be able
to run SQL queries and analyze data.
But I also want
to have some way of doing that science.
I want to have some way of predict the future.
So for doing that science,
there are many different tools.
But you get a notebook which is
where I'm running all the demo here
today in the browser. It's
a very popular way of doing interactive
analytics, sorry, interactive
exploration. And I created
here
a notebook
in which I'm using two
different tools, actually three different
tools for doing data
exploration. Even if you are not a
Python person, a Python developer,
you probably have heard about Python
pandas, yeah? Maybe not,
maybe yes. But Python pandas
is one of the most popular libraries for that science.
Some people say it's slow, the latest
version is faster, until recently
pandas wouldn't parallelize.
So you have like a large dataset,
you know, but it has, but
we forgive pandas to be slow,
because first it's not that slow. Second,
it gives you a huge
amount of things to do with your data.
So it's probably the most popular tool
for doing interactive data exploration.
And this is the kind of thing you can do.
The first thing of course will be connecting
to where you have your data.
In pandas you can connect to your data,
I don't know, you could use just
CSV or whatever, in my case,
I have the data in QSDB.
In QSDB we speak the process protocol
for query in
data. So I'm just going to be connecting
to a database
using the process protocol.
And I say,
I tell pandas, pandas,
read this query,
give me the results. Okay, nothing too
interesting. Now, in pandas
everything goes into what is called a data
frame. And data frames have a lot of
methods to do interesting things.
So it's like also if you use
Spark or other tools, you also use
the data in abstraction. So basically
give me the information.
It gives me some, like you know,
some basic data types and so on,
not too bad.
Describe the data set. And it's already
interesting because if I don't have any
idea about my data, this already tells
me, hey, for the, I only
have one numerical column and it's a
time stamp with Santa Poc, so this is
anticlimatic, but it will tell me
which is the minimum value, maximum value,
the percentiles. So I can get an idea
of my data or I can say, hey,
tell me for all the columns that
are not a number, tell
me how many, you know,
how many different values
I have for its different type of
event. And I can see here
that in this data set,
I almost don't have
comments on commits
for the 2000 events I've seen
so far, but most of the
events are push events on GitHub, which
makes sense. But if I
try to do any type
of forecasting, what I see already
is like my data is super bias.
My data is going to train
pretty well on push events because
they are very common, but I only
have two events with commit. I
cannot learn from two events.
You see the idea, yeah? So that's kind of
a thing. Without doing nothing, you
start getting familiar with your data
set. They're like, oh, I only have to say
describe, tell me the things. And it's
very powerful because, you know, you
don't have to worry much about those
things. It's like, oh, show me what it
looks like. This is the set event
that I want to see. And I say, well,
I prefer to see the distribution for
each event. So again, Pandas
gives you very simple ways
to grouping things together,
representing things. But basically
here, for its different type of event,
I'm going to see the count of how many
events per minute I'm seeing
for each of the different types.
And as you can see, like, you know,
you don't have to, I don't want to enter
in details of the code because actually
it's super simple. But it's just for
simple things. And you can do it
very easily. And with no effort,
you can get interesting statistics.
Of course, if you go deeper,
you get more interesting things.
But that's kind of the
kind of the idea. So Pandas is pretty
cool. But, you know,
people have no heart. It was like, oh,
Pandas is slow. Let's do something
new in Rust because, you know,
Python is slow. Rust is faster, blah,
blah, blah, whatever.
You can go to the building there and
you can see the difference between
Python and Rust. I'm not going to.
But basically, Polar is the new
key to the block. It's the library
which is making data science faster.
In some Pandas, it's saying, oh, now
we are going to be faster. So it's
pretty cool because now they are both
trying to compete. But, but online,
many people are using now Polar, not
Pandas. And also the name.
And a Panda, and a Polar beer,
and a koala, it's like whatever, man.
It's like, yeah.
Basically here with Polar, same thing.
You connect to the database and
you have, literally,
the same things you can do.
Give me the account of events,
give me the different things, you get the idea.
I also went with a different
library facets that I
like it for some things. Because
in facets, with very little
code, I can get,
again, a lot of insight from my data.
For example, here, I can just
see, hey, from the data I have already,
so me,
it will, oh,
it's, my, my Wi-Fi here is
super slow. I don't know if you have
the same issue, but my Wi-Fi is super
slow. And this is, this is calling
an external JavaScript.
And since the Wi-Fi is unreliable,
it's not displaying fast, which is
what happened with it, more earlier. It's
what happened also, 1.1.2 QSDB. So
everything is running locally. But this
particular script is using
a script from our place.
So I can, sadly, I cannot show it
to you, but if I reload the notebook,
I should have
the cache version
that I had from the previous
execution. So this, this one
is not live. It's the, a cool
thing about the Python notebooks
is that you can, you can
save the notebook with
the current execution results, and you can
serve with your team. So this is what it looked
like when I had the internet, okay? Sorry
about the internet, but you know, I don't have a
computer here, but basically you have
interesting things. So that's for exploring
the data, but I promise you that we wanted
to predict the future.
And we wanted to have, like, you know, to have
forecasting on time series. And again,
in Python, you
have a lot of interesting
ways of working with time series
data. When you have data in time
series, the logical step
is trying to predict what's going to happen next.
How many,
how, how much
stock do I need to have
for people that are going to be buying
my product? How much energy do I have
to produce?
What's going to be the price
of the Bitcoin in two weeks from
now? Those kind of things are
interesting to predict. If you know that one,
let me know. But those kind of things are
interesting, okay? That's the Holy Grail
time series prediction. And this, of course,
is not trivial, but there are a number
of algorithms, like profit,
or Arima, Arima, or
other algorithms that actually
allows you to do time series
forecasting. Better or worse,
but they allow you to do that. So in this case,
I'm going to be using
a model which is called
profit. It was originally
developed by Facebook. And
profit allows you to do time series
forecasting in a very simple way.
I connect to my database,
and the next thing I'm going to do
is I'm going to tell my database
QSDB to select
all the events
sample per minute, as I saw you
earlier. So basically here,
what I have is the all the events per
minute. I can see some minutes I have
more events, some minutes I have fewer
events, and this is going to be my training
data. And this is all it takes
to train a model.
You don't need to have any fancy
NVIDIA GPU.
You don't have to, for this,
I choose this one profit specifically
and other than having this notebook
because they are very, very, very lightweight.
So you can predict
also we don't have any data. We have only
2000 events, which is super fast. But the
model already learned, and with this I
can already make predictions.
Since I have data only for the past 20 minutes,
what I'm going to try to predict
is only the next 10 minutes of the future.
Predicting 10 minutes into the future is not much.
If I had more data in the
dataset, I could predict the next week,
the next month, whatever. But in this case,
I'm going to predict the next 10 minutes.
And this is what it looks like.
It looks like
we are seeing fewer and fewer
commits over time.
I can't predict the model. Something I
can do.
I can, for example, say
so I'm sending data now
for only one script.
I can open, let me
for a second, I can open
this script and I can start
sending events
every two seconds
instead of every 10 seconds.
So if I
execute now
Yeah, it's
the drawing here, I'm reading events from
GitHub. Since Internet again
is not working, I'm going to disconnect and
reconnect again.
I'm actually going to use my phone to
tether.
I'm going to use my phone
just for a second.
I can
whistle.
I'm trying to use my phone for
the Internet connection.
I can show you this.
Wi-Fi hotspot.
Don't connect to my hotspot now.
Let's see if it opens.
I can't see it.
Okay, so that's my
phone and hopefully let's see
if this is better.
No.
Okay, my phone is also useless.
So I don't have Internet.
Basically what I was trying to do, I
was trying to send more data.
Yeah, that's okay.
I just wanted to send more data.
So you will see
if we get more data.
The prediction will be
that it's going to be, you know, higher because
that's how it works. But anyway,
I have here another model,
linear regression. You can use linear regression
for predicting many things.
In this case, I'm going to use linear regression
for predicting
also time series.
And Senaedia. I trained the model
and now I can predict the future.
And as you can see,
this model is also very pessimistic
about the future. It says that
we are not getting enough data.
This is what we've been seeing and the prediction
is going to go down.
But that's cool. I mean, we have data.
We have ways of predicting
what's going to happen based on the past.
Which is, I don't know if you thought
it was this simple. These models are
super simplistic, but they kind of give you
an orientation, a trend, which is interesting.
So we have that already.
So next step,
and almost running out of time,
next step will be the
real-time dashboard. I already
showed you the dashboard. What I'm using
here is called Grafana.
And I have
a couple of dashboards here.
So Grafana, the way it works, Grafana
is a tool for
dashboard and alerts. And Grafana
has a lot of plugins to connect to
different data sources. In our case,
I created a data source using the
Postgres connector, because we are
compatible with Postgres at the protocol
level. So
I'm connecting to my instance of
QSDB in this docker container.
And once you have a connection,
you can just create dashboards
and alerts on the dashboards.
And the way I do this,
if I go to any of these panels,
the way you do this
is just with SQL.
So basically, each of these
panels, they look very fancy
by behind the scenes. It's just
SQL query. It has
some filters, for example, here.
Here,
this timestamp filter
means that the
filter is going to be the one
that you have
here selected on the dashboard.
So you have some macros, so everything
works together. But in the end,
creating a dashboard in Grafana, just creating
the SQL, putting the charge you want,
choosing the color, choosing you want
to have multiple series. If you want to have
like, you know, how it's going to be the scale,
but that's kind of the thing. That's all you need.
The last component I had in my
template was the
monitoring. So monitoring,
what you usually do
for monitoring, you need to have some kind of server
agent. If you're in Kubernetes, you have
your own agents. In my case,
I'm not running Kubernetes or anything.
So I'm running an agent which is
called Telegraph. It was created by InfluxDB.
And Telegraph allows you
to connect to many
source of metrics.
It allows you to connect to many destinations
for storing the data. One of the destinations
is QSDB. We are at a serious database,
so it's pretty cool for monitoring.
So I'm getting the data
from QSDB and from Kafka monitoring metrics.
The Telegraph agent
is collecting the metrics. In the
agent itself, I'm doing some transformation
and then I'm writing the data back into
my database.
So for example, if I go
to my QSDB installation,
this is what
my metrics look like. So in QSDB,
we're exposing these metrics.
How many items
I'm missing from the cache, number
of queries, number of commits
in the file system,
all those things. In Kafka,
we also have a lot of different metrics that
we can explore. So Telegraph is
pulling all those metrics and we are sending
them back to QSDB. So if I go
to my local instance
here,
you will see I have here a few tables.
Kafka cluster,
Kafka runtime.
If I do, for example, I want to
select all the data from
Kafka
Java runtime,
you see within
collecting data,
this actually is quite small.
If I go to Kafka topic, hopefully in this
one, I should have more activity.
So, yeah, for the time that I've been
writing this demo, we've been
collecting data about the different topics,
how many effects,
how many bytes
per second we were sending
out of the topic, how much
data was entering. You can see
the offset in which
different consumers are reading data.
So every kind of statistics and that
monitoring, we have on the database.
So you could also build your dashboard.
That's kind of the idea.
So,
what I wanted to tell you today
is that if you want
to work with streaming data,
it's hard because streaming data
can get very big. It never stops
or at some point you need to decide when to do
analytics. Isbasti is going
to be sometimes faster, sometimes slower.
At some times it will get late.
At some times you will get new data
with an update for some results
that you already made it. At some point,
the individual points
that you are gathering, you are collecting
that are very valuable for the
immediate analytics. They lose value
but the aggregate data gets more
interesting. So working with this kind of
data, it can be hard, but if you have the right
tools, you can actually start working
with streaming data at
pretty much any speed in
a simple enough way.
At least for starting. Then of course,
everything gets harder.
So, I didn't
tell you many, many things that I don't have
yet. So, I'm going to talk about
the template. The template is a starting
point. So you can start
getting familiar with streaming data
and have some tools that are interesting.
But if you want to go to production,
you will need to support more data
formats. Not that JSON. It's not very
efficient for moving fast data.
You want to have data life cycle
policies. At which point, I'm going
to delete the data. At which point,
if I'm getting data every few milliseconds,
maybe I want to aggregate into data
for a few years, one month, I don't know.
I want to delete the data, I don't know.
So you have to define, maybe you want
to move the data to some cheap
cold storage. So those kind of things you have
to decide. Data quality,
data governance, replication.
In this case, each component
has only one server running.
It's easy to add more replicas for
each of the components, but you have to do it.
So there are many things I didn't cover.
But hopefully, this was interesting for you.
There are the links to all
the tools that I've been using in the template.
The template itself is a patched
to zero. So feel free to do
anything you want with that.
For anything you want, you can contact
me on MasterDone,
but actually, I will not
with it. So contact me on Twitter,
which is easier. Thank you very much.
Anything you want, I hear.
APPLAUSE