[00:00.000 --> 00:10.000] The next session is a very important one around streaming and Java. [00:10.000 --> 00:15.000] Of course, streaming is an increasingly, or has been for years, important popular topic. [00:15.000 --> 00:17.000] And Thibautz is going to tell us more about it. [00:17.000 --> 00:18.000] Yes. [00:18.000 --> 00:19.000] Thank you. [00:19.000 --> 00:20.000] You have to talk loudly. [00:20.000 --> 00:21.000] Yes. [00:21.000 --> 00:22.000] Yeah. [00:22.000 --> 00:25.000] So, welcome everyone. [00:25.000 --> 00:29.000] So, this session is mainly about three-time stream processing. [00:29.000 --> 00:33.000] So, what I'm planning to do today, because it's Sunday and early morning, [00:33.000 --> 00:35.000] is to make it as easy as possible. [00:35.000 --> 00:38.000] And the fact is, I don't know your background, [00:38.000 --> 00:42.000] so I'm not sure how much you know about real-time stream processing. [00:42.000 --> 00:47.000] So, I will take it from scratch, basically, to get up to speed everyone. [00:47.000 --> 00:54.000] And I will also show you, demo how you can basically use real-time stream processing in your work as well. [00:54.000 --> 00:58.000] So, before we start, anyone recognize these guys on the screen here? [00:59.000 --> 01:01.000] On the left. [01:05.000 --> 01:06.000] Details. [01:06.000 --> 01:07.000] Yes, that's correct. [01:07.000 --> 01:09.000] So, these are the details. [01:09.000 --> 01:12.000] On the right side is the Liverpool Football Cup. [01:12.000 --> 01:14.000] That's where I came from. [01:14.000 --> 01:19.000] So, I wanted to highlight these two images here, because I wanted to say, [01:19.000 --> 01:22.000] you know, real-time stream processing is not about domain specific. [01:22.000 --> 01:24.000] So, it could be anywhere. [01:24.000 --> 01:28.000] So, it doesn't have to be, like, for example, in financial institutions, [01:28.000 --> 01:30.000] or machine learning, or IT or IT. [01:30.000 --> 01:34.000] It could be, for example, in sports, or music, or any domain, basically. [01:34.000 --> 01:38.000] The fact is, you're using real-time stream processing in every single day. [01:38.000 --> 01:41.000] So, just to give you an idea what real-time means, [01:41.000 --> 01:44.000] and how you can actually approach it. [01:44.000 --> 01:47.000] So, anyone can guess how long it takes for an IT blink? [01:48.000 --> 01:50.000] The question is a second. [01:50.000 --> 01:53.000] So, yeah, sub-milli-seconds are roughly one-third of a second. [01:53.000 --> 01:54.000] So, that's pretty fast. [01:54.000 --> 01:57.000] So, the same thing applies if you want to clap hands, [01:57.000 --> 01:59.000] or if you want to take a photo as well. [01:59.000 --> 02:01.000] So, we're not talking about minutes here. [02:01.000 --> 02:05.000] We're not talking about days or weeks, which what batch system is all about. [02:05.000 --> 02:08.000] We were talking about, like, sub-milli-seconds, [02:08.000 --> 02:10.000] how you can process it in real-time. [02:10.000 --> 02:12.000] So, as you can see, it's everywhere. [02:12.000 --> 02:15.000] So, it's not domain specific. [02:15.000 --> 02:19.000] And some of you who work already with real-time know that, basically, [02:19.000 --> 02:22.000] you have some kind of events coming into this moment, [02:22.000 --> 02:25.000] and you try to make sense out of it. [02:25.000 --> 02:28.000] So, looking at it from user perspective, [02:28.000 --> 02:32.000] what you want is to make sure that you have some kind of secret source, [02:32.000 --> 02:35.000] or key element, when it comes to real-time stream processing. [02:35.000 --> 02:37.000] So, I've seen it so many times. [02:37.000 --> 02:40.000] People approach it from the wrong angle. [02:40.000 --> 02:44.000] So, they try, basically, to read the data in real-time, [02:44.000 --> 02:48.000] and they try, basically, to provide some kind of meaning of this data. [02:48.000 --> 02:50.000] So, I'll give you a demo today for logs, [02:50.000 --> 02:53.000] so this should be easy to follow. [02:53.000 --> 02:57.000] But the secret source here is to kind of combine new data, [02:57.000 --> 02:59.000] real-time data with historical data. [02:59.000 --> 03:03.000] So, what we mean by real-time data is this data is coming this moment, [03:03.000 --> 03:05.000] and you read it in this moment. [03:05.000 --> 03:08.000] Obviously, you want to make sense out of it. [03:08.000 --> 03:10.000] You want to understand what's going on here. [03:10.000 --> 03:13.000] And the historical data is normal data. [03:13.000 --> 03:16.000] We know about, like, stored somewhere on physical drive, [03:16.000 --> 03:18.000] for example, or database, whatever. [03:18.000 --> 03:20.000] So, you want to make sure, basically, [03:20.000 --> 03:23.000] to have these two types of data at the same speed. [03:23.000 --> 03:25.000] So, now we're talking about two types of data, [03:25.000 --> 03:28.000] but how many, you know, is too many, basically? [03:28.000 --> 03:31.000] What size we're talking about here? [03:31.000 --> 03:33.000] So, for some might be, like, a few thousand, [03:33.000 --> 03:36.000] others might be millions, others might be billions. [03:36.000 --> 03:39.000] So, essentially, what we're trying to do here [03:39.000 --> 03:42.000] is not taking, like, a small data set and trying to process it, [03:42.000 --> 03:44.000] because that's, you know, easy to do. [03:44.000 --> 03:46.000] But we're talking about, like, over a billion, [03:46.000 --> 03:50.000] or over, like, ten billion of seconds in transactions per second. [03:50.000 --> 03:54.000] So, the idea here is to take a huge amount of data [03:54.000 --> 03:57.000] and, you know, trying to find some kind of trains [03:57.000 --> 03:59.000] and alerts from this data. [03:59.000 --> 04:01.000] So, I'll give you an example here, [04:01.000 --> 04:04.000] so you can start now to work on what's going on here. [04:04.000 --> 04:07.000] So, imagine, basically, you write a Java program, [04:07.000 --> 04:11.000] and you obviously have some kind of logging mechanism [04:11.000 --> 04:12.000] in your application. [04:12.000 --> 04:16.000] So, in order to understand what's going on with your log system, [04:16.000 --> 04:20.000] essentially, what you want is to have some kind of platform [04:20.000 --> 04:23.000] to allow you to actually, you know, analyze it, [04:23.000 --> 04:25.000] but also, like, at the same time, [04:25.000 --> 04:27.000] we're not talking about logs from yesterday. [04:27.000 --> 04:29.000] So, for example, events happened yesterday, [04:29.000 --> 04:32.000] you want to make sure that, for example, [04:32.000 --> 04:35.000] you actually do alerts or trains in the same moment. [04:35.000 --> 04:38.000] So, same thing for trains as well. [04:38.000 --> 04:41.000] So, if you want to know if your application [04:41.000 --> 04:43.000] is going to crash or not, [04:43.000 --> 04:47.000] what you want is to kind of have a platform or solution [04:47.000 --> 04:51.000] where it is easy for you to actually look at the data [04:51.000 --> 04:54.000] in this moment and say, hey, something is going wrong here. [04:54.000 --> 04:58.000] I need to basically define trains out of it [04:58.000 --> 05:00.000] and do some alerts. [05:00.000 --> 05:03.000] Now, for manual work, this is kind of like painful [05:03.000 --> 05:06.000] because you need to go through loops, for example, [05:06.000 --> 05:09.000] and you want to make sure that you know how to scale it [05:09.000 --> 05:11.000] and also kind of like, you know, [05:11.000 --> 05:14.000] knowing exactly where your data is stored [05:14.000 --> 05:17.000] because your enemy, when it comes to real-time scene processing, [05:17.000 --> 05:18.000] is latency. [05:18.000 --> 05:22.000] So, you want to make sure your application is as low [05:22.000 --> 05:26.000] as latency when it comes to delay, basically. [05:26.000 --> 05:28.000] Obviously, the scaling is bottleneck. [05:28.000 --> 05:30.000] And now, if you look at platforms, [05:30.000 --> 05:33.000] now, you might have heard of some of these. [05:33.000 --> 05:36.000] So, the easiest way is to split these platforms [05:36.000 --> 05:38.000] into various categories. [05:38.000 --> 05:42.000] So, on this one, here, you can see, [05:42.000 --> 05:44.000] you can have open source solutions [05:44.000 --> 05:46.000] or you can have hybrid, [05:46.000 --> 05:50.000] which is mixed between open source and the managed service. [05:50.000 --> 05:54.000] And on the horizontal, as you can see, [05:54.000 --> 05:56.000] you need to capture your data. [05:56.000 --> 05:59.000] Obviously, in real-time, you need to do some kind of transport [05:59.000 --> 06:01.000] as well as some kind of transformation [06:01.000 --> 06:03.000] and processing as well. [06:03.000 --> 06:06.000] So, you can split it into 12 squares [06:06.000 --> 06:08.000] and it becomes like, you know, [06:08.000 --> 06:11.000] easier to understand which tool you need to use. [06:11.000 --> 06:13.000] But obviously, this is still a bit complex [06:13.000 --> 06:16.000] because the area for real-time scene processing [06:16.000 --> 06:19.000] is mainly about two different subjects. [06:19.000 --> 06:22.000] So, it's not only, you know, capture or transport [06:22.000 --> 06:25.000] because you have to do all of these at the same time. [06:25.000 --> 06:27.000] It's kind of like, if you want, [06:27.000 --> 06:29.000] you need basically to decide if you're going to use [06:29.000 --> 06:32.000] swim processing engines from one side [06:32.000 --> 06:34.000] or you want to have some kind of fast data storage [06:34.000 --> 06:36.000] from the other side. [06:36.000 --> 06:38.000] So, swim processing engines are pretty good [06:38.000 --> 06:41.000] in handling data coming in real-time, [06:41.000 --> 06:45.000] which is like, for example, Kafka or TX equals and so on. [06:45.000 --> 06:48.000] Or from the far right side, you can see fast data storage, [06:48.000 --> 06:51.000] which is kind of like essentially caching solutions [06:51.000 --> 06:53.000] to your application. [06:53.000 --> 06:55.000] So, for example, MongoDB, Redis and so on. [06:55.000 --> 06:57.000] So, if you want to apply this solution, [06:57.000 --> 07:00.000] essentially you need one tool from the left [07:00.000 --> 07:02.000] and one tool from the right, [07:02.000 --> 07:04.000] which means that it adds more work on your side. [07:04.000 --> 07:07.000] So, what you want is kind of looking at it in this way [07:07.000 --> 07:10.000] and say, hey, I want one solution for you, [07:10.000 --> 07:13.000] and that's where Hazercast comes into place. [07:13.000 --> 07:15.000] Obviously, I work for Hazercast and the Hazercast [07:15.000 --> 07:19.000] as itself is built on top of the Java virtual machine. [07:19.000 --> 07:21.000] So, it's Java based and it's open source. [07:21.000 --> 07:24.000] So, this is the platform here. [07:25.000 --> 07:29.000] It's kind of like a A to Z solution. [07:29.000 --> 07:32.000] So, what you want is to catch your data, [07:32.000 --> 07:34.000] capture your data. [07:34.000 --> 07:36.000] It could be coming from Apache, for example, [07:36.000 --> 07:39.000] Apache Kafka and from IoT devices. [07:39.000 --> 07:41.000] It could be coming from some kind of custom connectors [07:41.000 --> 07:43.000] because it's open source, [07:43.000 --> 07:45.000] which means feel free to contribute to this project [07:45.000 --> 07:48.000] or if it comes from file watch up, [07:48.000 --> 07:50.000] for example, or from work suffix. [07:50.000 --> 07:52.000] So, once you have this data, [07:52.000 --> 07:55.000] free time data ingested into the platform, [07:55.000 --> 07:58.000] platform itself has two main components. [07:58.000 --> 08:00.000] So, the first one is the jet engine. [08:00.000 --> 08:02.000] So, this is the engine for scene processing [08:02.000 --> 08:05.000] and the demo I will show you how to use it [08:05.000 --> 08:08.000] and also the fast data storage or fast data management. [08:08.000 --> 08:10.000] So, this is essentially a component [08:10.000 --> 08:13.000] which allows you to load your data from external sources [08:13.000 --> 08:16.000] and it's optional, obviously. [08:16.000 --> 08:18.000] So, it's kind of like, I don't know, [08:18.000 --> 08:21.000] some kind of file system or database or stored on the cloud [08:21.000 --> 08:23.000] and you load it into memory. [08:23.000 --> 08:25.000] Why do you need to load it into memory? [08:25.000 --> 08:27.000] Simply because you want to make sure [08:27.000 --> 08:29.000] your application is as fast as possible. [08:29.000 --> 08:31.000] So, we're talking about, for example, [08:31.000 --> 08:34.000] here speed where it's sub milliseconds [08:34.000 --> 08:35.000] or fractions of seconds. [08:35.000 --> 08:37.000] So, this is very important. [08:37.000 --> 08:40.000] For example, in fraud detection scenario, [08:40.000 --> 08:42.000] if you, for example, you're using your cards [08:42.000 --> 08:45.000] and someone else is using your card somewhere else, [08:45.000 --> 08:48.000] you want to get alert in this specific moment. [08:48.000 --> 08:50.000] It doesn't make sense to get alert, [08:50.000 --> 08:53.000] I don't know, in the afternoon, for example, or next day. [08:53.000 --> 08:56.000] So, this type of machine critical solutions, [08:56.000 --> 08:59.000] what you want is to make sure that your data is stored [08:59.000 --> 09:03.000] in memory which allows you to access it in really good time. [09:03.000 --> 09:05.000] So, once you have this data, [09:05.000 --> 09:07.000] you can do some kind of transformation for your, [09:07.000 --> 09:09.000] for example, on data because remember, [09:09.000 --> 09:11.000] we're not talking about one single data source. [09:11.000 --> 09:14.000] For scene processing, we're talking about multiple sources. [09:14.000 --> 09:16.000] So, it could be, for example, I don't know, [09:16.000 --> 09:18.000] some transactions coming in Kafka topic [09:18.000 --> 09:20.000] and some IoT device for, [09:20.000 --> 09:23.000] weather forecast for examine coming from other topic, [09:23.000 --> 09:25.000] sorry, from other source. [09:25.000 --> 09:28.000] So, you want to make sure that also you can combine it. [09:28.000 --> 09:31.000] Obviously, because here we're in Java run, [09:31.000 --> 09:34.000] so you can use the Java, obviously, client for it. [09:34.000 --> 09:38.000] So, essentially what you need is kind of like a Java jar. [09:38.000 --> 09:41.000] You need to download it and plug it into your pump file. [09:41.000 --> 09:44.000] But, for example, if you're a data scientist [09:44.000 --> 09:46.000] and you're not sure, you know, about programming language, [09:46.000 --> 09:48.000] maybe you use a little bit of Python, [09:48.000 --> 09:50.000] but, you know, programming languages is not [09:50.000 --> 09:52.000] something you want to invest in. [09:52.000 --> 09:54.000] What you can do is do some, [09:54.000 --> 09:56.000] everything I mentioned today in SQL. [09:56.000 --> 09:59.000] So, which means you do everything for team, [09:59.000 --> 10:01.000] written scene processing in terms of alerts, [10:01.000 --> 10:03.000] for example, or defining trace, [10:03.000 --> 10:06.000] or even query your data using SQL. [10:06.000 --> 10:08.000] So, once you do this process, [10:08.000 --> 10:10.000] you actually can output it in some kind of, [10:10.000 --> 10:13.000] I don't know, same thing for your input. [10:13.000 --> 10:15.000] So, the Kafka topic, for example, [10:15.000 --> 10:17.000] you can use the WebSocket, or you can create your [10:17.000 --> 10:20.000] Java application to do some kind of visualization [10:20.000 --> 10:22.000] for predictions, for example. [10:22.000 --> 10:25.000] Now, the cool thing about Hazelkast is not only [10:25.000 --> 10:28.000] the platform and the easiness of use, [10:28.000 --> 10:29.000] but also how to scale. [10:29.000 --> 10:31.000] Remember what we're talking about here? [10:31.000 --> 10:33.000] We're not talking about a few thousand. [10:33.000 --> 10:35.000] We're talking maybe a few million, [10:35.000 --> 10:37.000] or even billions of transactions. [10:37.000 --> 10:39.000] So, when it comes to scaling, [10:39.000 --> 10:42.000] you want to recognize between two different topics. [10:42.000 --> 10:45.000] So, for some, scaling might refer to, I don't know, [10:45.000 --> 10:46.000] your data. [10:46.000 --> 10:48.000] So, you want to scale this data, for example, [10:48.000 --> 10:50.000] and for others who work, for example, [10:50.000 --> 10:52.000] in programming or development, [10:52.000 --> 10:54.000] they focus mainly on the compute. [10:54.000 --> 10:56.000] So, you want to make sure, basically, [10:56.000 --> 11:00.000] to combine between data and compute when you want to scale. [11:00.000 --> 11:03.000] And the cool thing about it is it's partition aware, [11:03.000 --> 11:07.000] which means if you have your data stored in multiple places [11:07.000 --> 11:09.000] around the world, for example, in different data centers, [11:09.000 --> 11:12.000] what you want is your compute or your process [11:12.000 --> 11:16.000] or your application is to be stored as close [11:16.000 --> 11:18.000] as possible to your data. [11:18.000 --> 11:21.000] So, this will give you some kind of speed [11:21.000 --> 11:25.000] and lower latency when it comes to transactions. [11:25.000 --> 11:27.000] Now, we talked about transactions, [11:27.000 --> 11:30.000] but how many we're talking about today? [11:30.000 --> 11:33.000] So, we've done this kind of like benchmark, obviously. [11:33.000 --> 11:35.000] It's bit outdated now. [11:35.000 --> 11:39.000] It's kind of worth trying to add one more zero to it. [11:39.000 --> 11:42.000] So, it's one billion transactions per second on 45 nodes. [11:42.000 --> 11:45.000] And the cool thing about it is not only the latency, [11:45.000 --> 11:47.000] which is like 30 milliseconds, [11:47.000 --> 11:49.000] but also the linear scale, [11:49.000 --> 11:52.000] which means you just need to add more nodes to your application. [11:52.000 --> 11:54.000] So, with that being said, [11:54.000 --> 11:57.000] so let's just move directly to the demo [11:57.000 --> 12:00.000] and I'll show you how you can use HazardCast [12:00.000 --> 12:02.000] within your application. [12:03.000 --> 12:07.000] So, for this demo, what I wanted is kind of, you know, [12:07.000 --> 12:09.000] you're writing your application, [12:09.000 --> 12:11.000] Java application, obviously, [12:11.000 --> 12:13.000] and you have some kind of logging mechanism [12:13.000 --> 12:15.000] within your application. [12:15.000 --> 12:17.000] And your boss comes next day and says, [12:17.000 --> 12:21.000] hey, your log messages or your solution, [12:21.000 --> 12:23.000] we need to upgrade it in a way [12:23.000 --> 12:25.000] where we'll provide some kind of alerts [12:25.000 --> 12:28.000] or predict what's going to happen next. [12:28.000 --> 12:31.000] So, your task, essentially, [12:31.000 --> 12:36.000] what you want is to kind of take exactly the application [12:36.000 --> 12:39.000] and make sure you have some kind of scanning mechanism [12:39.000 --> 12:42.000] and real-time screen processing into it. [12:42.000 --> 12:44.000] So, obviously, you have two options here. [12:44.000 --> 12:47.000] So, the first option is to download this jar, [12:47.000 --> 12:49.000] plug it into your application and run it. [12:49.000 --> 12:51.000] So, that's good. [12:51.000 --> 12:54.000] But the problem with this is usually logs stored [12:54.000 --> 12:56.000] in different places. [12:56.000 --> 13:00.000] So, what you want is you want every machine [13:00.000 --> 13:03.000] to send its logs to some, you know, center place. [13:03.000 --> 13:05.000] So, usually, it's a cloud. [13:05.000 --> 13:08.000] So, the idea of you sending logs to cloud [13:08.000 --> 13:12.000] is kind of providing some kind of one place [13:12.000 --> 13:15.000] for every single machine which sends logs. [13:15.000 --> 13:18.000] HazardCast runs the HazardCast Viridian, [13:18.000 --> 13:21.000] which is exactly what we're talking about on the cloud. [13:21.000 --> 13:24.000] And this is kind of like what you need to do. [13:24.000 --> 13:28.000] So, you write your log message in some way. [13:28.000 --> 13:32.000] So, obviously, you need to have context for your message. [13:32.000 --> 13:37.000] And the idea is to store all logs into memory. [13:37.000 --> 13:41.000] So, we're not going to store it into database or file system [13:41.000 --> 13:44.000] because that means you're adding input-output latency [13:44.000 --> 13:45.000] to your application. [13:45.000 --> 13:47.000] So, we need to minimize this. [13:47.000 --> 13:50.000] So, the idea is to use some kind of map structure. [13:50.000 --> 13:52.000] And this map has a key, [13:52.000 --> 13:54.000] which is like where is this data coming from, [13:54.000 --> 13:56.000] from ID address and port number, [13:56.000 --> 13:58.000] and the message which is the value. [13:58.000 --> 13:59.000] So, it's your choice now. [13:59.000 --> 14:02.000] Whether you use Varchar, for example, string. [14:02.000 --> 14:03.000] Obviously, it's faster. [14:03.000 --> 14:06.000] Or if you want to use some kind of JSON format, for example, [14:06.000 --> 14:09.000] if you're trying to do some kind of machine learning [14:09.000 --> 14:12.000] and do, I don't know, maybe some classification on your logs. [14:12.000 --> 14:14.000] So, once you have your tm value, [14:14.000 --> 14:17.000] what you can do is to proceed to save it into HazardCast. [14:17.000 --> 14:19.000] So, remember what we're talking about here? [14:19.000 --> 14:21.000] So, what you want is to get this map [14:21.000 --> 14:23.000] within HazardCast. [14:23.000 --> 14:27.000] So, first step is to send your logs into the cloud, obviously. [14:27.000 --> 14:30.000] So, because different logs coming from different machines. [14:30.000 --> 14:33.000] And second step is to store it into memory. [14:33.000 --> 14:37.000] So, this will allow you to access your logs in much faster way. [14:37.000 --> 14:39.000] And obviously, you need to do this mapping. [14:39.000 --> 14:42.000] So, what you see here is kind of like SQL. [14:42.000 --> 14:44.000] You can write it in SQL. [14:44.000 --> 14:46.000] And once you have it in SQL, [14:46.000 --> 14:49.000] you can proceed to do this instance. [14:49.000 --> 14:51.000] So, in order to run HazardCast, [14:51.000 --> 14:54.000] what you need is have some kind, I don't know, [14:54.000 --> 14:58.000] from HazardCast instance and block in your pipeline. [14:58.000 --> 15:02.000] So, pipeline first to the JIT, for example, or your process. [15:02.000 --> 15:05.000] And in here, what you say is basically, [15:05.000 --> 15:08.000] I'm defining this is the IP address I want to run it on. [15:08.000 --> 15:11.000] This is my part and this is my data. [15:11.000 --> 15:13.000] Now, I mentioned SQL. [15:13.000 --> 15:16.000] So, the platform itself has management center. [15:17.000 --> 15:22.000] So, this is really cool, which allows you to query your data. [15:22.000 --> 15:26.000] So, what you see here is I'm trying to query my data [15:26.000 --> 15:29.000] of my logs and really trying to understand what's going on. [15:29.000 --> 15:31.000] So, on the left, you see the key, [15:31.000 --> 15:33.000] which is like IP address input. [15:33.000 --> 15:36.000] And on the far right side, or bottom right side, [15:36.000 --> 15:38.000] you see values. [15:38.000 --> 15:42.000] So, essentially what we're doing with logs [15:42.000 --> 15:45.000] is we're giving a score for each log message. [15:45.000 --> 15:47.000] So, if you're trying to define it, for example, [15:47.000 --> 15:51.000] I know, for example, you have a specific category for logs, [15:51.000 --> 15:54.000] whether it's information or warning or error. [15:54.000 --> 15:56.000] So, that's good, but if you're trying to predict [15:56.000 --> 15:58.000] what's going to happen next, [15:58.000 --> 16:01.000] you probably need to have some kind of linear scare for it. [16:01.000 --> 16:03.000] So, for example, you run it from 100 [16:03.000 --> 16:06.000] and you give it value for your log message [16:06.000 --> 16:09.000] and this is the value you see there. [16:09.000 --> 16:12.000] So, we take this data and we ingest it into memory. [16:12.000 --> 16:14.000] So, we have the IP address port number [16:14.000 --> 16:17.000] and we have a score for each log message. [16:17.000 --> 16:20.000] And once we import it, we define a trend. [16:20.000 --> 16:23.000] So, in my case, I'm taking a window [16:23.000 --> 16:25.000] on this real-time stream processing [16:25.000 --> 16:29.000] and, in my case, it's two minutes, but it could be anything. [16:29.000 --> 16:32.000] And I'm defining this trend based on the score value [16:32.000 --> 16:34.000] for each log message. [16:34.000 --> 16:37.000] And based on this, I try to create predictive, [16:37.000 --> 16:40.000] or prediction map, so another map, [16:40.000 --> 16:43.000] in order to say if I want to send alert, [16:43.000 --> 16:48.000] which has value of one or zero, don't send alert. [16:48.000 --> 16:52.000] Obviously, you need to do some kind of programming in it. [16:52.000 --> 16:55.000] So, the idea is to group messages based on the score [16:55.000 --> 16:57.000] to define the trend and from there, [16:57.000 --> 16:59.000] you can use some kind of machine learning, [16:59.000 --> 17:02.000] so linear regression or classification for example [17:02.000 --> 17:04.000] to do some kind of prediction [17:04.000 --> 17:06.000] and also you need to output it, [17:06.000 --> 17:10.000] which means you send it back to the user. [17:10.000 --> 17:12.000] Now, this is good, so this is how it works. [17:12.000 --> 17:16.000] So, this is your actual code for doing the map prediction. [17:16.000 --> 17:19.000] So, we take the score for each log message, [17:19.000 --> 17:23.000] we do the linear regression based on the window I define, [17:23.000 --> 17:27.000] and I'm simply checking if the value is greater than 100, [17:27.000 --> 17:31.000] send an alert, otherwise, don't send alert. [17:31.000 --> 17:34.000] And in here, what you need to focus on [17:34.000 --> 17:37.000] is kind of like three ways to proceed from here. [17:37.000 --> 17:41.000] So, there are three ways to do stream processing, [17:41.000 --> 17:44.000] and the first one is to use SQL for example, [17:44.000 --> 17:47.000] so you take it from the logs map [17:47.000 --> 17:51.000] and you store it in our log map with some filtering. [17:51.000 --> 17:55.000] And second option is to use a process or pipeline, [17:55.000 --> 17:58.000] so you read it from the map and you do as well, [17:58.000 --> 18:01.000] you do some kind of alerts for example or train. [18:01.000 --> 18:05.000] So, the first two ways is called batch stream processing, [18:05.000 --> 18:07.000] so it's not read time. [18:07.000 --> 18:10.000] And the third option, so this is the main thing you want to do [18:10.000 --> 18:14.000] if you want to provide some kind of read time messaging, [18:14.000 --> 18:17.000] is to look for kind of the map journal, [18:17.000 --> 18:20.000] which is also similar to map, [18:20.000 --> 18:22.000] but it's like ring puffer sucker, [18:22.000 --> 18:25.000] which allows you to start processing your logs, [18:25.000 --> 18:28.000] either from start or from end depends on what you want. [18:28.000 --> 18:32.000] And the pipeline itself takes this memory map, [18:32.000 --> 18:35.000] logs and do some filtering. [18:35.000 --> 18:38.000] So, if you see some value in this specific moment, [18:38.000 --> 18:41.000] you can do this alerts for example, [18:41.000 --> 18:44.000] or you can send it to the message. [18:44.000 --> 18:48.000] So, I think kind of like I wanted to cover everything, [18:48.000 --> 18:51.000] but because it's too 20 minutes here [18:51.000 --> 18:54.000] and it's not enough time to talk about everything, [18:54.000 --> 18:57.000] so just to give you somebody what you need to do [18:57.000 --> 19:00.000] and open this stage for questions. [19:00.000 --> 19:04.000] So, first of all, you need to store your logs to the cloud. [19:04.000 --> 19:06.000] So, in my case, I'm using HazerCast, [19:06.000 --> 19:09.000] but obviously you can't use any other cloud provider, [19:09.000 --> 19:14.000] but you need to import HazerCast into that cloud provider. [19:14.000 --> 19:17.000] So, once you upload all logs, [19:17.000 --> 19:21.000] what you want is kind of like have some kind of cloud solution [19:21.000 --> 19:25.000] for it, so where you can use for example, I don't know, [19:26.000 --> 19:29.000] Varchar, if you want speed or JSON, [19:29.000 --> 19:33.000] for example, if you want to apply some kind of machine learning, [19:33.000 --> 19:36.000] and you need to use some kind of map structure. [19:36.000 --> 19:39.000] So, the idea is to store logs into memory [19:39.000 --> 19:43.000] in order to have some kind of random artist and rebalancing. [19:43.000 --> 19:46.000] And obviously you need to configure this map [19:46.000 --> 19:49.000] because you have a specific size that's on limits on it, [19:49.000 --> 19:54.000] so you need to have some kind of eviction policy on your data. [19:54.000 --> 19:56.000] And obviously you need to consider security. [19:56.000 --> 19:59.000] So, whatever you send to the cloud, obviously it can, [19:59.000 --> 20:02.000] you need to have some kind of security mechanism [20:02.000 --> 20:05.000] just to make sure that you don't send sensitive data. [20:05.000 --> 20:09.000] So, if you're interested in this topic about stream processing, [20:09.000 --> 20:13.000] we're running an unconference in March next month, [20:13.000 --> 20:17.000] and it's free to join, and there is like training workshop as well. [20:17.000 --> 20:19.000] So, all you need to do is just scan this code [20:19.000 --> 20:22.000] and register for the real-time stream processing unconference. [20:22.000 --> 20:24.000] It's community-based, so it's open source. [20:24.000 --> 20:27.000] You get the training, you get batch on top of it, [20:27.000 --> 20:30.000] as well as we have run table where you can see [20:30.000 --> 20:33.000] industrial experts as well as community users [20:33.000 --> 20:35.000] how they can contribute to open source of projects. [20:35.000 --> 20:39.000] And, you know, you can basically ask questions if you have. [20:39.000 --> 20:40.000] So, with that being said, [20:40.000 --> 20:43.000] thanks very much for listening and I'll open for questions. [20:43.000 --> 20:44.000] Thank you. [20:44.000 --> 20:45.000] Thank you. [21:14.000 --> 21:40.000] Yeah, so it is possible to do like multiple streams, [21:40.000 --> 21:42.000] joining multiple sources. [21:43.000 --> 21:46.000] So, this is possible to do. [21:46.000 --> 21:49.000] Obviously, you need just to find the configuration [21:49.000 --> 21:51.000] for this specific case. [21:51.000 --> 21:52.000] So, I mentioned sources here. [21:52.000 --> 21:55.000] So, the sources is not only a single source, [21:55.000 --> 21:56.000] it could be multiple sources, [21:56.000 --> 22:00.000] and that's where you do join multiple sources. [22:00.000 --> 22:06.000] So, it is possible to do it with data. [22:06.000 --> 22:08.000] Any other questions? [22:08.000 --> 22:09.000] Yeah? [22:10.000 --> 22:11.000] Yeah. [22:11.000 --> 22:13.000] Please take a seat. [22:13.000 --> 22:16.000] What is the difference? [22:16.000 --> 22:18.000] So, the question is what is the difference [22:18.000 --> 22:20.000] between HazardCast and Apache Flake? [22:20.000 --> 22:21.000] So, there was a slide, [22:21.000 --> 22:23.000] but I decided to remove it from here. [22:23.000 --> 22:26.000] So, essentially what you want is kind of like, [22:26.000 --> 22:30.000] for real time is to look for minimizing latency [22:30.000 --> 22:31.000] within your application. [22:31.000 --> 22:35.000] So, we've done benchmark between HazardCast and Flake. [22:35.000 --> 22:37.000] So, this is where things can make difference. [22:37.000 --> 22:40.000] So, if you're trying to basically write an application [22:40.000 --> 22:41.000] for real-time sequencing, [22:41.000 --> 22:43.000] we want to minimize latency. [22:43.000 --> 22:44.000] So, the lower is better, [22:44.000 --> 22:49.000] and this is where HazardCast has performed a link [22:49.000 --> 22:50.000] or Apache Spark. [22:50.000 --> 22:53.000] So, the latency is the key difference between these two. [22:53.000 --> 22:55.000] The results are online, [22:55.000 --> 22:58.000] but I decided not to include it just for this. [22:58.000 --> 23:02.000] So, basically it's the latency between different platforms. [23:02.000 --> 23:03.000] So, that's what you want to focus, [23:03.000 --> 23:04.000] whether it's HazardCast, Flake, [23:04.000 --> 23:07.000] or any other platform which offers real-time sequencing. [23:10.000 --> 23:11.000] Thank you. [23:14.000 --> 23:15.000] Thank you very much, Thomas. [23:15.000 --> 23:16.000] Thank you.