Hi, I'm Javier here. This is going to be, oh, is it too loud? Okay, yeah, sorry. So yeah, I'm Javier. If you can find me on Mastodon or Twitter, I don't really use Mastodon, but it's Fosdome, so I have to put it there. But Twitter is better. It's not better, it's just where I hang around. Anyway, this is already too difficult. I don't have any slides. I have this gist that I will be scrolling, but this is going to be mostly a demo-driven talk. So I'm going to be speaking today about a template I've created, so you can start doing a stimuli that is using different open source components. I'm not a tab vocate. I've been working with data for a long time. 10 years ago, it was actually my first Fosdome. I have the teaser to prove it. I was speaking at the time about FAST data. I was speaking about credits. And today I'm going to be speaking again about FAST data. And last year, I organized the FAST and streaming data at the room, so I really like data, and I wanted to share with you some of the things I've learned about working with data. So 10 years ago, it was difficult to work with streaming data. I put in the presentation that you can work with millions of events per second, and we have some users doing that actually. I work for a company that developed an open source database, but I wanted to speak about all the things today. So some people really need to ingest millions of events per second. Most often you don't have to do that. You are happy with a few thousands, hundreds, tens of events per second, whatever. But 10 years ago, it was really not that easy to work with streaming data, because actually many of the technologies that you have to see with FAST data, it didn't exist at the time. So for example, 10 years ago, you have Cassandra and Redis. They were already available. They were just FAST databases. But even Apache Kafka, I will be speaking about Kafka today in case you don't know it. Kafka was only three years old. And pretty much every other technology that you will consider FAST or real time today, in 2014, it was either not a 16 or just being born. Things like Spark Streaming or Apache Flink, which are super cool today. Grafana, I'm going to be presenting about Grafana also today. QuestDB, which is the database where I work. At the time, it has a different name. But it didn't exist. Even large proprietary people like Google Cloud or Amazon Kinesis at the time were not offering streaming data. So what I'm trying to say here is that 10 years ago, working with streaming data, it was not really a thing. Some people will do it. Twitter, for example, was doing streaming data at the time with very interesting technologies. Facebook was doing streaming data. But it was not so useful to work with streaming data 10 years ago. But you are thinking, that was 10 years ago. Now we are in 2024. It should be this year. Well, it should be this year. But it's not always the case. So let me just for a second. 10 years ago, if you were doing streaming data, you would do like micro-batches of data. The streaming data platform would be a database. It would be Postgres. 10 years ago, maybe it was MySQL. Sorry, Postgres. But 10 years ago, MySQL was kind of... But it would be Maria de Veo, Postgres, or Cassandra. But you will have the data platform was a database and a CSV file. That's it. Okay? But it's not really a pipeline. And a CSV file was not super cool. So then we had some innovation. So we added an extra step. It was basically adding Excel. We reinforced them. So maybe OpenOffice, LibreOffice, whatever. But basically, the data platform was a database and a spreadsheet. And this was the data platform. I was doing reports for like, you know, interesting people. And that's how you would do things. It was not really a data platform at all. But then, this is not working. That's fine. I can do it with this. But then, at some point, I will start doing the demo. Don't worry. But at some point, we decided, and that was the first thing I wanted to talk about today, we decided that sending all your data directly to your database, it might not be the best idea. At some point, we decided that it could be a good idea to decouple the ingestion of the data with the storage analysis of the data. For many reasons. For example, because if you send your data all the time to your database and you need to restart the server for any reason, you have a problem. You cannot really stop the database if you are sending data all the time. Or maybe your database is not super fast, or maybe it is. But you know, you can have something in between. Or maybe you want to send your data to your database and to somewhere else. And it will be cool to have something in between to send the data. So we started seeing things like Apache Kafka and so on. I will present about that today. But that was the first step. Starting sending data first to some buffer and then to the database. Then we started seeing dashboards, not really Excel, but dashboards like Tablo, Grafana, other things to present the data. That was already interesting. And at this point, we spent a few years with this architecture. You have something like Apache Kafka or Rabbit NQ or whatever to have data in gestion. You will have a database, could be a PostgreSQL database, could be a different database specialized in working with faster data like Cassandra or other databases. And then you will have some business dashboards. And that was cool. But then it was the time of more real time of more advances. And it was like, now that I can analyze data, I also want to predict the future. I don't want to understand only what's happened. I want to understand why it's happened and what's going to happen next. Spoiler, it doesn't work. But anyway, we presented walks and today I will show you some examples of time series forecasting. But that's the idea. So it was not enough already with doing analytics. It will be analytics and some machine learning. And by the way, the dashboards got real time. You didn't have to be reloading with S5. I had at some point, and this is true in my browser, I had an extension that the only thing it did, it was reloading the page every five seconds. So we could have real time dashboards. No, that's the thing because they didn't reload. So you had that. So then it came people like Grafana or maybe Looker. And you look at it because it's not open source. And then you started seeing real time analytics. And that was pretty cool. And it's like, okay, this is looking something interesting. But yeah, then we have, of course, observability, monitoring, because if you want to go to production, you need to make sure things are working, not only on your machine, but in real time. And that's kind of the thing. We started in more and more things. And if you want to start today to ingest streaming data and less streaming data, you have all the components, but there are a lot of components. And they have to work together. And if you have never done it before, it can feel overwhelming. So that's what I want to be talking about today, how we can build a platform that can do all of these things. So this is what it looks like, like everything everywhere, all of the ones, like that movie. So that's basically what it looks like. Today you want to have an IT platform at the very least, very likely you have a dating system layer. The data from that layer needs to go somehow to your analytics database. It will probably have some listeners and application. You might want to send data also directly without the buffer, because your application doesn't support it, whatever. Then you have your database, your machine learning data science workflow, your monitoring workflow, your dashboard. And this is kind of the thing. So as I said, I call this the everything everywhere, all of the ones. And if it's the first time that you are working with streaming data, this might look to you the same way, look in the movies, like, what's this? Like, I don't know, this doesn't look natural. It looks like very weird. So that's why I decided, after working with streaming data for a while and suffering these things, I decided it was time to develop some easy example that you could deploy and if you don't have it, if you know about real analytics, don't use this. But if you have no idea about real analytics, if you've never done it before, this template gives you everything you need, everything. It gives you a few components you can use to capture real analytics end to end. Turning gestion to dashboard with monitoring with everything you need, with data science to do real analytics, and everything is open source in this template. So this is what I want to talk about today. What is in the template? And basically, I want to do this for a demo. I don't want to be speaking on it. I've been speaking on it for ten minutes. That's it. I mean, I want to be speaking now, but it's going to be with things moving on, with code and everything. So this is what the template, it looks like very blurry. But basically, this template is going to give you an Apache Kafka. I will tell you now about Kafka and why I chose it. But Apache Kafka, which is the ingestion layer, I have KafkaConnet, which is a component for getting data out of Kafka and into other places. The data I'm going to put it into QSDB, which is a fast and serious database, it's my employer. So I'm biased here. But I think they pay me. But it's Apache 2. So it's open source. You can use it for free. You don't have to pay to use it. But if you pay, it's better because then I get... Anyway, so I'm going to be using a database here. Mom, maybe I'm looking at this. I know I speak too much. Anyway, and then I'm going to be using Jupyter Notebooks, which are... It's Python on the browser, which is pretty cool for doing that science. And I will do some interactive exploration and some forecasting modeling. We'll have real time, that's what's with Grafana. On top of that, I'm using Telegraph to collect metrics and store them again on a database. So that's everything in the template. And this is what I'm going to be talking about today. So the template lives here. Until yesterday, it was on my user on GitHub. But that's of yesterday. We move it to my organization because they pay me. And we are open source and we are cool because FOSDEM. Yeah. So basically, I'm the single contributor. So basically, it lives here. Time series, stimuli, this template. The entry point is a Docker compose. So I know some people prefer other things, but Docker compose is cool. So this Docker compose, if you want to see what it's doing, feel free. But basically it's going to start all the things that I've told you before. And I'm going to show you what this looks like. So Docker compose up. If you're familiar, you all know Docker. Docker compose, in case anyone doesn't know Docker compose, I know, you know, better than me. But Docker compose allows you to start several containers in one go. And since networking Docker is a mess, the cool thing about Docker compose is that all the containers in the same compose file, they can talk to each other, which is cool. It's convenient. So if I do Docker compose up, look, looks. Things moving. So yeah, thank you for coming to this talk. So yeah, things are starting now. In my laptop, it starts fast because all the images are already here. You start from scratch, it needs to download a lot of things. It's going to download about one gigabyte of data. No, one gigabyte of data. But there's a lot of images here. But basically, I don't have any custom image. All the images in Docker compose are the standard, you can't know book image, standard Kafka. I didn't add anything. So it downloads everything in a wire connection. It takes about one minute or so to download everything and start up from scratch. So once you have this ready, we already have something up and running. Not much. It doesn't look like much. But since I'm moving there, and I want to show you what we have. So the first thing we have, I opened the other browser. Not you, Chrome, not today, not today. It's my default browser, sort of like that. I feel terribly bad because actually, Firefox was giving free cookies before and I took one. So thank you Firefox, I'm using Firefox for the demo. So the first thing you have here, it's, this is Jupyter notebooks. I'll show you to run Python, also the things, but Python from the browser. So the thing I have here to see like easily, I have created a script, which is going to read data from GitHub because open source. It's going to be reading public data from GitHub and it's going to be sending data to Kafka. And it's going to go through all the steps. And it's going to give me a nice task board. So if I execute this from the browser, it's going to be, it's going to be calling the GitHub API and every 10 seconds, it's going to be getting new data. I do it every 10 seconds because there are some great limits. I wanted to make sure I'm within the limits, but every 10 seconds, I get data from GitHub. It's going to go through all the components and eventually it will read the data. I'm going to tell you now all the things. And here in this dashboard, if everything was fine, I cannot see any data, that's not looking good. Let me ask for a second. We still have data, yes? Yeah, we do have data, actually, I guess. Table doesn't exist, table doesn't exist. Not you, Chrome, sorry. That's why I don't like it, man. It's always there, it's always trying. So, okay. Let's just for one second to check if I have data. I'm going to go through all the steps. I'm going to go through all the steps. Just one second to check if I have data coming into Kafka. It's funny because I was testing it earlier, like right here. Let me for a second what I have. I will tell you what I'm doing once I know what I'm doing myself. I mean, I cannot be more, okay. So, just for a second. What are you doing here, Javier? Topic? Can I see which topics I have created? I have... I will tell you in one second what I mean. The moment it works, I will tell you everything about it. But, okay, the topic is here. And... So, basically now I'm trying to see if I have some data entering into my Kafka platform or not. Yeah, data is coming here. So, let me go to the dashboard again. Oh, yeah. So, I don't know exactly what happened. I started this grid from the command line. But, basically, this is what should happen from the beginning. I have data which is going to Kafka. And I have a dashboard that is refreshing. No hands here. And in five seconds, it's going to have new data. That's kind of the idea, okay. So, this is, you know, this is like high level overview. What I want to do now is telling you all the different steps and how we can see what's happening at a different point in time. So, the first thing is... I told you already, I have here a script which is sending GitHub events to Kafka. So, Kafka, it's a message broker. It means you send messages, events, what... A message can be anything. In this case, I'm sending JSON because it is here. But you could have, like, Abro. You could have, like, anything, plain strings, whatever. You send events to Kafka on one end. And then, different consumers can read those events. It sounds super simple, but it's very powerful. First, because Kafka is very scalable. You can use Kafka at any scale. If you have a lot of data, you can add many servers. And, you know, it scales horizontally. It works pretty well. In Kafka, you don't have tables. You have something called topics. So, when you send a message, you publish the message into a topic. And then you can have any number of consumers reading messages from that topic. And they can read messages in different ways. You can choose to have each consumer is going to see all the messages from the topic. You can have them... You can create a consumer group. They read collaboratively. So, basically, you can have a topic in which consumers are reading data, and all of them are going to see all the data. You can have a topic in which consumers are going to be reading in parallel, different parts, so they can collaborate. You can have a topic in which some consumers are reading from the beginning, the others are reading from the other part. You define a retention period for each topic, and you can replay the stream of events at any point until that retention period. So, you want to replay what happened. You can replay again. So, it gives you a lot of possibilities, and it's a very good way of decoupling the ingestion layer. So, when you are working with Kafka, basically, what you do is like you use any Kafka library. And as I showed you here, all you have to do is you create a producer and you send your events, in this case, in JSON, to a topic from that producer library. On this notebook, I have the code only on Python, but in the template, if you defer the languages, you have the ingestion in multiple languages. I did this in all the languages that chatGPT could help me with. I did it in Python, and then I told chatGPT, do this in Node.js. Didn't work. Hey, I get this error. Anyway, so it works. So, basically here, you have the same code in different languages. You can do the ingestion using Node, you can do the ingestion using Java, using Rust, which is basically right now from the command line, and here sending from Rust, actually, they tie into the topic. So, right now, we ingest data into a Kafka topic. But Kafka itself is not doing anything with not-you-chrome. And this is... No, sorry, it's my brain. It's where, like... In the process in Kafka, this is the first step in the process here. From GitHub, we send to Kafka. Okay? But from Kafka, the messages don't go anywhere. Kafka, by design, is not pushing messages. There are some message brokers that use the push model. In Kafka, we use the pull model, which means if you want to read messages, you need to tell the broker. You need to tell the server in Kafka, hey, give me the following batch of messages. The people in Kafka decided to do it that way, because if you're going to have multiple consumers, it's a better way to work. If you are pushing data all the time and you have a low consumer and you're pushing data, the consumer eventually is going to be overwhelmed, it's not going to work. So in Kafka, you need to be pulling data. So you will need some application calling Kafka and saying, hey, you have new messages in the topic, give me the messages, I want to store them somewhere. For the first years, that would be how you work with Kafka. But that was annoying. So Kafka created something called KafkaConnect, which is part also of the open source Kafka. In Kafka, there is a like in pretty much in many projects, you have a part which is fully open source and a part which is proprietary. The things I'm showing today are only open source, of course. So KafkaConnect is part of open source Kafka. And what it does, it gives you connectors to read data from Kafka and write into multiple destinations. In the audience today, we have my colleague, Yaramir, I don't know exactly where he is, but Yaramir actually wrote the connector to write data from KafkaConnect into QSDB. I wouldn't know how to do that, but you know. So basically what I have here, I have a connector which is sending data from Kafka into a database. This KafkaConnect, it doesn't have any web interface. It has just an API. But I can tell KafkaConnect from the notebook. I can tell KafkaConnect, which plugins you have available. By default, you have only a few plugins. In this compose, Docker compose, I've included the driver, the GAR file to send data to QSDB. You could be sending data to, I don't know, to Amazon S3, to Hadoop file system, to Clickhouse, to Postgres, to whatever. In this case, I'm sending messages to another Kafka. So in this case, I'm sending messages to the KafkaConnect. So I have this connector available. And then what I have to do, I have to configure if I want to send data, I have to tell Connect where to get the data from and where it's going to send the data to. So in this case, I have a configuration that says I'm going to be reading data from a topic called GitHub Events, which is the topic in which I'm sending data. I'm going to be sending the data, I'm going to be sending data to this host and port. And then its different connector is going to have different parameters. At the very least, you will want to configure the format of the messages. In this case, I'm saying that we are going to be getting JSON messages in JSON and I want to output strings. So different connectors will have different options. And in this case, I'm also telling the timestamp name, I want to rename it to create it. So those kind of things, basically. But you know, it's connected with different options. But in the end, what I'm doing here is configuring how Kafka Connect is going to be reading data from a topic and how it's going to be writing the data at the destination. Are you still with me here? Yes? Cool. So destination is a database. And that database is called QuestDB, as I told you earlier. QuestDB is my employer. So, you know, I only have good things to say about QuestDB because it's actually pretty cool. So, in this, this is the instance running in Docker. And you can see here, it has multiple tables. Because I'm writing also monitoring. But basically, the tabling with Wayne-Gestin data is this one. GitHub Events. If I do something like select count from GitHub Events, we have only, like, you know, a thousand events, which is not too much. But anyway, it is what we have right now. So, this is what it looks like. We have, like, push events from different repositories, from different people, at different time stamps. And this database, this QuestDB thing, it's an open source database which is specialized on time series. Time series basically means I have data points with a time stamp and I want to do aggregations by time. So, we speak SQL, but a cool thing, I believe it's a cool thing in QuestDB, you can do things like this. I can say, I want to have the count of how many messages I'm getting in intervals of, for example, I don't know, every five seconds. So, I can say this. And what I get here is like, for each five seconds, I'm running this aggregation. So, basically, this is like a group buy. But instead of grouping by, you know, any dimension, I'm grouping in time intervals. So, this is the cool thing about a time series database. It gives you a lot of things for, you know, working with time. It gives you a lot of interesting things for working with time data. Or you could also say, I want to have this, but by repository and time stamp in intervals of five minutes. So, here I will have for this different repository I'm going to have for each five minutes interval I'm going to have the number of events for that repository. So, you can start to see how this can be interesting. You can also do things like join two different tables by approximate time. Maybe in one table I'm getting data every few microseconds. In the other, I'm getting data every two hours. But I want to join the closest event in time. Doing that in SQL, it can be done, but it's like, you know, a bit query. So, here is like, okay, just as of join. So, we have this kind of extensions. And the main thing about QSDB is that it's built time series data tends to be fast. So, it's built for a speed. In this case, I only have not even 2000 events. So, the 1000 events sounds like nothing, but if I go, oh, not the most, if I go to we have a demo site and I will stop speaking about QSDB in one second, I cannot write. I cannot even type the name of my database. Okay, that's better, yeah? Demo... Okay. See, eventually we'll note, internet here, I have the... The Wi-Fi here is not really great. When it comes back, I will show you which is totally why I was not getting any data earlier. So, in this public site we have some demo tables and I have one table which is a bit bigger than the one I just showed you. So, this table is not huge, but it's already 1.6 billion records, which is not too bad. 1.6 billion records is not too bad. And a cool thing about QSDB is that it's designed to be a good example. I could do something like give me the average... Let me just find any column in this table. So, give me the average fair amount, for example. From this table that has 1.6 billion records. How long should you expect to do a full scan over 1.6 billion records and do an average? We cannot catch. How fast will you say? 1.6 billion records. How many? 20 seconds. 20 seconds, that's good. I like it. It took 400 milliseconds. It was not too bad, but it's like that's the kind of thing. It's still kind of slow because QSDB is optimized to be very fast when you have chunks of time. So, it's like where... I don't know. If you go only for things like I want to only select data which is in one particular year, for example, this will be way faster because in the end... Oh. That's better. This will be faster. It took 200 milliseconds, which is not too bad considering that we have a lot of time. Oops. Count from trips. Okay. That's not... Let's go for one year with data. So, yeah, in this case, I can do the average. Distance, for example. And we are speaking about 170 million records. And it's still like, you know, execution is 50 milliseconds. So, that's the kind of thing. Time series databases, we are not the only one. We are the fastest. If you ask any other database provider, they will tell you they are the faster and they are right because it really depends on the query. We are super fast for time series queries. If you try to do other type of queries, we are not the fastest. But if you're just about time series, then we are super fast. Everything is optimized for that. Okay. But I don't have 1.6 billion records. And that might be it. But something I don't know if you ever consider that actually getting a billion records is easier than you think. If you have 500 time series, maybe 500 cars or scooters or whatever, maybe 500 machines, 500 users with a phone, send in a data point every second. So you have a data point per second. It doesn't sound like much, no? But then you say, how many seconds are in one day? How many seconds are in every week? How many seconds are in your typical month of 30. 43, 7 days? Because some months are different to others. So this is the amount of seconds you have in every month times 500. So you have 500 devices sending you a data point every second. You are going to get 1.3 billion records in just one month. So we see users generating this amount of data every day or even multiple times this data every day. So what I'm trying to say here is that when you are working with streaming data, it's pretty simple to get to the point in which you can actually have a lot of data to process. So it's quite interesting to have some database that can help you with that. But enough about speaking about QSDB. I want to speak about all the things today. So I told you already you can get data directly to Kafka into QSDB. By the way, before I move from this, I have another notebook because I told you I wanted to speak about millions of units per second in just in just 30 events every 10 seconds so I have another notebook which is not getting data from GitHub but it's generating synthetic data. So I'm going to show you another notebook to send events a bit faster. So I have another notebook which is generating fake data, IoT data. And it's going to be sending, I'm going to be sending for example, I don't know, 10,000 events per second? 10,000 per second? 15,000 per second? 15,000 per second, you think? 15,000 events every 100 milliseconds. 15,000 events every 100 milliseconds? Will you be happy with that? 15,000 events every 100 milliseconds? So now anyway, I'm sending 15,000 events every 100 milliseconds which is still not super fast. And I have a second dashboard, just for you to see that this kind of technologies all together, they really can help you work with data which is fast. So right now here, I'm going to show you. So we have a dashboard that's every 100 milliseconds it's refreshing. And this is how fast we are seeing data. So this is not, I mean, it's not super fast, but you get the idea, yeah? So Kafka can ingest literally hundreds of millions of you in a second. In QuestDB, we can ingest up to 4.2 million events per second. If you have 16 CPUs, 12 16 CPUs, you can ingest up to 4.2 million events per second. But that's the idea. This already feels real time. This already feels like, you know, this already feels like a real time dashboard in which we really are having data which is moving fast. I wanted to show you this because otherwise it feels like cheating. So this is not real data, but if you have real data at a speed, you can see on the planes, you know, this is random data. So yeah, they are not crazy is the way it is. But other thing, I wanted to show you that it really supports this kind of thing. So let me just move on to the next part. So the next part is I told you that, okay, it's cool to send data. It's cool to be able to run SQL queries and analyze data. But I also want to have some way of doing that science. I want to have some way of predict the future. So for doing that science, there are many different tools. But you get a notebook which is where I'm running all the demo here today in the browser. It's a very popular way of doing interactive analytics, sorry, interactive exploration. And I created here a notebook in which I'm using two different tools, actually three different tools for doing data exploration. Even if you are not a Python person, a Python developer, you probably have heard about Python pandas, yeah? Maybe not, maybe yes. But Python pandas is one of the most popular libraries for that science. Some people say it's slow, the latest version is faster, until recently pandas wouldn't parallelize. So you have like a large dataset, you know, but it has, but we forgive pandas to be slow, because first it's not that slow. Second, it gives you a huge amount of things to do with your data. So it's probably the most popular tool for doing interactive data exploration. And this is the kind of thing you can do. The first thing of course will be connecting to where you have your data. In pandas you can connect to your data, I don't know, you could use just CSV or whatever, in my case, I have the data in QSDB. In QSDB we speak the process protocol for query in data. So I'm just going to be connecting to a database using the process protocol. And I say, I tell pandas, pandas, read this query, give me the results. Okay, nothing too interesting. Now, in pandas everything goes into what is called a data frame. And data frames have a lot of methods to do interesting things. So it's like also if you use Spark or other tools, you also use the data in abstraction. So basically give me the information. It gives me some, like you know, some basic data types and so on, not too bad. Describe the data set. And it's already interesting because if I don't have any idea about my data, this already tells me, hey, for the, I only have one numerical column and it's a time stamp with Santa Poc, so this is anticlimatic, but it will tell me which is the minimum value, maximum value, the percentiles. So I can get an idea of my data or I can say, hey, tell me for all the columns that are not a number, tell me how many, you know, how many different values I have for its different type of event. And I can see here that in this data set, I almost don't have comments on commits for the 2000 events I've seen so far, but most of the events are push events on GitHub, which makes sense. But if I try to do any type of forecasting, what I see already is like my data is super bias. My data is going to train pretty well on push events because they are very common, but I only have two events with commit. I cannot learn from two events. You see the idea, yeah? So that's kind of a thing. Without doing nothing, you start getting familiar with your data set. They're like, oh, I only have to say describe, tell me the things. And it's very powerful because, you know, you don't have to worry much about those things. It's like, oh, show me what it looks like. This is the set event that I want to see. And I say, well, I prefer to see the distribution for each event. So again, Pandas gives you very simple ways to grouping things together, representing things. But basically here, for its different type of event, I'm going to see the count of how many events per minute I'm seeing for each of the different types. And as you can see, like, you know, you don't have to, I don't want to enter in details of the code because actually it's super simple. But it's just for simple things. And you can do it very easily. And with no effort, you can get interesting statistics. Of course, if you go deeper, you get more interesting things. But that's kind of the kind of the idea. So Pandas is pretty cool. But, you know, people have no heart. It was like, oh, Pandas is slow. Let's do something new in Rust because, you know, Python is slow. Rust is faster, blah, blah, blah, whatever. You can go to the building there and you can see the difference between Python and Rust. I'm not going to. But basically, Polar is the new key to the block. It's the library which is making data science faster. In some Pandas, it's saying, oh, now we are going to be faster. So it's pretty cool because now they are both trying to compete. But, but online, many people are using now Polar, not Pandas. And also the name. And a Panda, and a Polar beer, and a koala, it's like whatever, man. It's like, yeah. Basically here with Polar, same thing. You connect to the database and you have, literally, the same things you can do. Give me the account of events, give me the different things, you get the idea. I also went with a different library facets that I like it for some things. Because in facets, with very little code, I can get, again, a lot of insight from my data. For example, here, I can just see, hey, from the data I have already, so me, it will, oh, it's, my, my Wi-Fi here is super slow. I don't know if you have the same issue, but my Wi-Fi is super slow. And this is, this is calling an external JavaScript. And since the Wi-Fi is unreliable, it's not displaying fast, which is what happened with it, more earlier. It's what happened also, 1.1.2 QSDB. So everything is running locally. But this particular script is using a script from our place. So I can, sadly, I cannot show it to you, but if I reload the notebook, I should have the cache version that I had from the previous execution. So this, this one is not live. It's the, a cool thing about the Python notebooks is that you can, you can save the notebook with the current execution results, and you can serve with your team. So this is what it looked like when I had the internet, okay? Sorry about the internet, but you know, I don't have a computer here, but basically you have interesting things. So that's for exploring the data, but I promise you that we wanted to predict the future. And we wanted to have, like, you know, to have forecasting on time series. And again, in Python, you have a lot of interesting ways of working with time series data. When you have data in time series, the logical step is trying to predict what's going to happen next. How many, how, how much stock do I need to have for people that are going to be buying my product? How much energy do I have to produce? What's going to be the price of the Bitcoin in two weeks from now? Those kind of things are interesting to predict. If you know that one, let me know. But those kind of things are interesting, okay? That's the Holy Grail time series prediction. And this, of course, is not trivial, but there are a number of algorithms, like profit, or Arima, Arima, or other algorithms that actually allows you to do time series forecasting. Better or worse, but they allow you to do that. So in this case, I'm going to be using a model which is called profit. It was originally developed by Facebook. And profit allows you to do time series forecasting in a very simple way. I connect to my database, and the next thing I'm going to do is I'm going to tell my database QSDB to select all the events sample per minute, as I saw you earlier. So basically here, what I have is the all the events per minute. I can see some minutes I have more events, some minutes I have fewer events, and this is going to be my training data. And this is all it takes to train a model. You don't need to have any fancy NVIDIA GPU. You don't have to, for this, I choose this one profit specifically and other than having this notebook because they are very, very, very lightweight. So you can predict also we don't have any data. We have only 2000 events, which is super fast. But the model already learned, and with this I can already make predictions. Since I have data only for the past 20 minutes, what I'm going to try to predict is only the next 10 minutes of the future. Predicting 10 minutes into the future is not much. If I had more data in the dataset, I could predict the next week, the next month, whatever. But in this case, I'm going to predict the next 10 minutes. And this is what it looks like. It looks like we are seeing fewer and fewer commits over time. I can't predict the model. Something I can do. I can, for example, say so I'm sending data now for only one script. I can open, let me for a second, I can open this script and I can start sending events every two seconds instead of every 10 seconds. So if I execute now Yeah, it's the drawing here, I'm reading events from GitHub. Since Internet again is not working, I'm going to disconnect and reconnect again. I'm actually going to use my phone to tether. I'm going to use my phone just for a second. I can whistle. I'm trying to use my phone for the Internet connection. I can show you this. Wi-Fi hotspot. Don't connect to my hotspot now. Let's see if it opens. I can't see it. Okay, so that's my phone and hopefully let's see if this is better. No. Okay, my phone is also useless. So I don't have Internet. Basically what I was trying to do, I was trying to send more data. Yeah, that's okay. I just wanted to send more data. So you will see if we get more data. The prediction will be that it's going to be, you know, higher because that's how it works. But anyway, I have here another model, linear regression. You can use linear regression for predicting many things. In this case, I'm going to use linear regression for predicting also time series. And Senaedia. I trained the model and now I can predict the future. And as you can see, this model is also very pessimistic about the future. It says that we are not getting enough data. This is what we've been seeing and the prediction is going to go down. But that's cool. I mean, we have data. We have ways of predicting what's going to happen based on the past. Which is, I don't know if you thought it was this simple. These models are super simplistic, but they kind of give you an orientation, a trend, which is interesting. So we have that already. So next step, and almost running out of time, next step will be the real-time dashboard. I already showed you the dashboard. What I'm using here is called Grafana. And I have a couple of dashboards here. So Grafana, the way it works, Grafana is a tool for dashboard and alerts. And Grafana has a lot of plugins to connect to different data sources. In our case, I created a data source using the Postgres connector, because we are compatible with Postgres at the protocol level. So I'm connecting to my instance of QSDB in this docker container. And once you have a connection, you can just create dashboards and alerts on the dashboards. And the way I do this, if I go to any of these panels, the way you do this is just with SQL. So basically, each of these panels, they look very fancy by behind the scenes. It's just SQL query. It has some filters, for example, here. Here, this timestamp filter means that the filter is going to be the one that you have here selected on the dashboard. So you have some macros, so everything works together. But in the end, creating a dashboard in Grafana, just creating the SQL, putting the charge you want, choosing the color, choosing you want to have multiple series. If you want to have like, you know, how it's going to be the scale, but that's kind of the thing. That's all you need. The last component I had in my template was the monitoring. So monitoring, what you usually do for monitoring, you need to have some kind of server agent. If you're in Kubernetes, you have your own agents. In my case, I'm not running Kubernetes or anything. So I'm running an agent which is called Telegraph. It was created by InfluxDB. And Telegraph allows you to connect to many source of metrics. It allows you to connect to many destinations for storing the data. One of the destinations is QSDB. We are at a serious database, so it's pretty cool for monitoring. So I'm getting the data from QSDB and from Kafka monitoring metrics. The Telegraph agent is collecting the metrics. In the agent itself, I'm doing some transformation and then I'm writing the data back into my database. So for example, if I go to my QSDB installation, this is what my metrics look like. So in QSDB, we're exposing these metrics. How many items I'm missing from the cache, number of queries, number of commits in the file system, all those things. In Kafka, we also have a lot of different metrics that we can explore. So Telegraph is pulling all those metrics and we are sending them back to QSDB. So if I go to my local instance here, you will see I have here a few tables. Kafka cluster, Kafka runtime. If I do, for example, I want to select all the data from Kafka Java runtime, you see within collecting data, this actually is quite small. If I go to Kafka topic, hopefully in this one, I should have more activity. So, yeah, for the time that I've been writing this demo, we've been collecting data about the different topics, how many effects, how many bytes per second we were sending out of the topic, how much data was entering. You can see the offset in which different consumers are reading data. So every kind of statistics and that monitoring, we have on the database. So you could also build your dashboard. That's kind of the idea. So, what I wanted to tell you today is that if you want to work with streaming data, it's hard because streaming data can get very big. It never stops or at some point you need to decide when to do analytics. Isbasti is going to be sometimes faster, sometimes slower. At some times it will get late. At some times you will get new data with an update for some results that you already made it. At some point, the individual points that you are gathering, you are collecting that are very valuable for the immediate analytics. They lose value but the aggregate data gets more interesting. So working with this kind of data, it can be hard, but if you have the right tools, you can actually start working with streaming data at pretty much any speed in a simple enough way. At least for starting. Then of course, everything gets harder. So, I didn't tell you many, many things that I don't have yet. So, I'm going to talk about the template. The template is a starting point. So you can start getting familiar with streaming data and have some tools that are interesting. But if you want to go to production, you will need to support more data formats. Not that JSON. It's not very efficient for moving fast data. You want to have data life cycle policies. At which point, I'm going to delete the data. At which point, if I'm getting data every few milliseconds, maybe I want to aggregate into data for a few years, one month, I don't know. I want to delete the data, I don't know. So you have to define, maybe you want to move the data to some cheap cold storage. So those kind of things you have to decide. Data quality, data governance, replication. In this case, each component has only one server running. It's easy to add more replicas for each of the components, but you have to do it. So there are many things I didn't cover. But hopefully, this was interesting for you. There are the links to all the tools that I've been using in the template. The template itself is a patched to zero. So feel free to do anything you want with that. For anything you want, you can contact me on MasterDone, but actually, I will not with it. So contact me on Twitter, which is easier. Thank you very much. Anything you want, I hear. APPLAUSE