[00:00.000 --> 00:09.640] Okay. Hello, everyone, and welcome to our talk today. Yeah, so today we're going to [00:09.640 --> 00:15.880] talk about real-time analytics application with Apache Pino and Apache Pulsar and Pino. [00:15.880 --> 00:21.640] So here today too is myself. I'm Mary Grigleski. I'm a streaming developer advocate at Data [00:21.640 --> 00:26.960] Stacks. A company is actually primarily doing Apache Cassandra up to this point. And now [00:26.960 --> 00:31.480] we're going to be doing more Apache Pulsar. And it's a streaming event, streaming platform [00:31.480 --> 00:36.080] that's kind of optimized for the cloud-native platform. I'm based in Chicago. I'm also [00:36.080 --> 00:40.800] Java champion and president of Chicago Java user script too, and blah, blah, blah, all [00:40.800 --> 00:45.560] these things. I was a developer before too, just so you know, mostly in Java. So now we [00:45.560 --> 00:50.600] have Mark. So we have Mark introduce. Hello. And I do realize, Javier, that we've stolen [00:50.600 --> 00:55.840] your intro section. So we've got, we've gone straight and say, yeah, hi. I'm Mark. I work [00:56.040 --> 01:01.000] at Starchy. We do Apache Pino. I'm a developer advocate there. And so, yeah, like we kind [01:01.000 --> 01:04.200] of had on the first slide, we're going to be showing you how to, yeah, how to, I guess [01:04.200 --> 01:10.200] maybe more, how to build a real-time like analytics dashboard with Pulsar, Pino, and [01:10.200 --> 01:14.520] then Python dashboard library called Streamlit. So we're going to see half the talk will be [01:14.520 --> 01:18.560] that. And we're going to see how well does the Wi-Fi survive our attempts to use live [01:18.560 --> 01:24.920] data. So let's see. Let's hope the demo gods are in the room. So I guess first things to [01:24.960 --> 01:28.760] start with is to define like what exactly does this mean? What is real-time analytics? [01:28.760 --> 01:32.080] So we've seen lots of talks, right, showing streaming data. So obviously that's like a [01:32.080 --> 01:37.280] big part of it. But the real-time analytics bit is kind of this bit in highlighted. So [01:37.280 --> 01:42.160] the goal is we're trying to provide insights to make decisions quickly. And I guess the [01:42.160 --> 01:45.760] most important bit is that farsight. So like, it's cool that we've got the data. We can [01:45.760 --> 01:49.560] capture our IoT data. We can capture like orders coming from a website. We can capture [01:49.560 --> 01:54.440] the logs, but we want to do something, do something with it. And so if we move on from [01:54.480 --> 01:58.480] the definition, like all the talks we've seen so far, they focus on events. We've got events [01:58.480 --> 02:03.120] representing stuff. So someone is purchasing something. Someone wants to do a search query. [02:03.120 --> 02:07.760] Someone is taking a taxi ride like we've seen in Javier's talk. And those events are cool [02:07.760 --> 02:10.880] on their own, right? But we really want to get some insights from them. Like what can [02:10.880 --> 02:14.240] we do? And eventually that leads to, okay, we've got some insight and we do something [02:14.240 --> 02:18.000] as a result of knowing that this is happening. So like say, we know someone is searching [02:18.000 --> 02:21.800] for a pizza. Okay, let's get a pizza shop in front of them if we're doing Google Adverts. [02:21.800 --> 02:25.160] If someone's then going in and buying the pizza, okay, let's show them in real time [02:25.160 --> 02:28.640] what are the things that people have also been buying along with that pizza or that [02:28.640 --> 02:33.360] hamburger that we can suggest like in the flow. So we want to try and react to those [02:33.360 --> 02:37.840] events coming in. So we've seen like lots of tools. We've seen Beam. We've seen Flink. [02:37.840 --> 02:42.160] We've seen like lots of tools for getting the streaming data, but we want to do something [02:42.160 --> 02:45.960] with that data. Like that's the whole purpose, I guess, of all the applications that we're [02:45.960 --> 02:51.000] building. And in the real time analytics space, like this is like, you imagine the value of [02:51.040 --> 02:56.080] data over time in this world where we're trying to do something with that streaming data, [02:56.080 --> 03:01.080] the value of that data goes down over time. So if we know like today, like, hey, you made [03:01.080 --> 03:04.880] an order and something's gone wrong with it, I can like try and do something to make you [03:04.880 --> 03:08.640] happy, like give you a voucher or call you up and try and fix it. If I find out like [03:08.640 --> 03:13.320] when I batch process that tomorrow, it's like, oh, you already hate me. So it's too late. [03:13.320 --> 03:17.200] And so we're kind of focused on the left side of this diagram. So in terms of real time [03:17.240 --> 03:21.560] analytics, we want the data like close to when it's coming in, maybe not exactly like, not [03:21.560 --> 03:24.960] not exactly like when it comes in, like in the time just after that, that's the kind [03:24.960 --> 03:28.520] of region that we're living. And there can be lots of people who are interested in this [03:28.520 --> 03:33.040] data. So it could be the analysts inside a company. So maybe it's people like, in our [03:33.040 --> 03:37.080] imaginary pizza shop, like they're actually like running, running the, running the operations [03:37.080 --> 03:39.560] with the pizza shop. It might be the management. They're like, hey, I want to know what's going [03:39.560 --> 03:42.680] on now. What's the, what's the revenue that we're seeing like right now, like we're in [03:42.680 --> 03:46.160] the last 10 minutes or whatever it is. And it could be the users. And so that's, that's [03:46.200 --> 03:49.840] kind of the interesting thing that we, that there's sort of changed, I guess, from doing [03:49.840 --> 03:53.760] traditional analytics to doing this real time stuff is that the data is almost coming back [03:53.760 --> 03:57.920] that users are creating the data and then we're feeding it back to them in terms of products [03:57.920 --> 04:04.520] that they can then use. And yeah, I guess that's more or less what I wanted to say. And so when [04:04.520 --> 04:07.520] we're building these applications, there are kind of, I mean, they're not, they're not strictly [04:07.520 --> 04:11.200] like this, but there's sort of four obvious quadrants of applications that people build. [04:11.200 --> 04:15.480] So they go along, oh yes, I should show you, they go along two axes. So we've got human facing [04:15.480 --> 04:20.760] and machine facing along the Y and then internal and external on the X. So if we go in the top [04:20.760 --> 04:24.680] side, that would be like the observability area. And actually a lot of the time series [04:24.680 --> 04:28.120] querying would be in there. And this would be sort of the area of like data dogs. So [04:28.120 --> 04:31.640] hey, like I wanted to get the met, I've got like all this telemetry data coming in and [04:31.640 --> 04:35.520] I want to know what's going on, like what's going on in my, in my AWS cluster. What's [04:35.520 --> 04:38.440] happening like with all my Lambda functions? Like are there any that are suddenly really [04:38.440 --> 04:43.400] slow? Like can I, can I like figure that out? And maybe there's a machine that's actually [04:43.400 --> 04:46.440] interpreting that data rather than a human is looking at it and going, oh yeah, it's [04:46.440 --> 04:51.560] that one Lambda function there that's really slow. Probably be feeding it into like, like [04:51.560 --> 04:55.040] some other tool that's figuring it out. If we come down here, so this is where we're [04:55.040 --> 04:59.680] going to be today. So imagine this is a dashboard and obviously I'm sure you've seen loads of [04:59.680 --> 05:04.280] dashboards, you've probably seen Tableau, you've seen loads of BI tools and with a lot of them, [05:04.280 --> 05:07.480] the data that you're using is maybe like yesterday's, yesterday's data. And so what we're going [05:07.520 --> 05:11.640] to show you today is how could you build one that is like updating as, as new data is coming [05:11.640 --> 05:16.440] in. If we come up to the, to the top left, this is now the machine is processing the [05:16.440 --> 05:20.800] data, but it's for the users. So this would be like how Javier was talking. So we've got [05:20.800 --> 05:23.600] like the fraud detection system. So the data's coming in, it's being processed by something [05:23.600 --> 05:27.040] and then I'm seeing, I'm seeing the result, but maybe I'm not going and working out like [05:27.040 --> 05:30.560] with my own query. Oh, look, I wrote this really, really clever query. Here's the, here's [05:30.560 --> 05:34.600] the thing. Maybe there's some, some pre-processing happening by some sort of machine learning [05:34.640 --> 05:38.960] algorithm. And then if we come down into the bottom corner, be some sort of external service [05:38.960 --> 05:42.360] that could be, yeah, like in our piece, for example, like an order tracking service. Like, [05:42.360 --> 05:46.280] you know how on your phone you order something from, I don't know what's this food delivery [05:46.280 --> 05:50.600] service here just eats. Maybe you see like, Hey, look, I can see exactly where it is. How [05:50.600 --> 05:55.080] far away is it? Why on earth has the driver gone the wrong way to my house? You can see [05:55.080 --> 05:58.360] all that sort of, all that sort of information. Maybe too much. Maybe they should just show [05:58.360 --> 06:01.960] you. It's always coming towards your house. It never went like the absolute opposite way [06:02.040 --> 06:07.440] and got to your house an hour later. So just to show you some real time, real, real world [06:07.440 --> 06:12.280] examples of where this is, where this is used. So, so LinkedIn is one. So the, who viewed [06:12.280 --> 06:16.280] your profile, I guess most of you have probably seen this and you see like, I guess for people [06:16.280 --> 06:19.800] spying, spying on you. And if you, if you have, you can kind of see like all the people that [06:19.800 --> 06:23.040] look to you. And it's, it's very, it's very up to date, right? Like if I went and viewed [06:23.040 --> 06:25.720] one of your profile pages, you would see it straight away. Like, Hey, look, Mark looks [06:25.720 --> 06:30.240] at it. Use for that would be, Hey, maybe someone is in like, you often, I guess it's often [06:30.280 --> 06:32.760] recruiters, right? You're like, Oh, I wonder why that person is following them. I'm going [06:32.760 --> 06:36.480] to, I'm going to contact them. So it's almost like, is there a real time way of like interacting [06:36.480 --> 06:39.720] with someone? I don't know if you wanted to collaborate with them on something or yeah, [06:39.720 --> 06:44.200] I guess they got a job that's available. They use it in the news field as well. So I guess [06:44.200 --> 06:48.560] here is similar to what you would see in, I guess in Facebook, I guess even in TikTok [06:48.560 --> 06:53.080] and those sorts of tools. The goal is kind of make you interact with, with this product [06:53.080 --> 06:56.680] more like they want you to stay on it. So they need to show you what is happening now [06:56.840 --> 07:00.520] so that you're going to stay on it and not, and not, yeah, I guess to not go away and do [07:00.520 --> 07:04.040] something else, which potentially is more useful, but they want you to just stay on [07:04.040 --> 07:07.840] there. And then yeah, for LinkedIn, like this one's, I guess this one's like a little bit [07:07.840 --> 07:12.440] of a, a smaller user base, but yeah, for the recruiters, I can see like, okay, what is [07:12.440 --> 07:17.040] the trends of what is happening in terms of what jobs are available, what places those [07:17.040 --> 07:20.840] are in the world and so on. And then Uber Eats, let's do one more example. So Uber Eats [07:20.840 --> 07:23.360] is another one. So this one's kind of like what I was saying. So they've got a dash, [07:23.400 --> 07:27.320] this is a very, very much a dashboard approach. And this is for a restaurant manager. So if [07:27.320 --> 07:30.680] they were hosting the restaurant on there, and you can see like the things that are [07:30.680 --> 07:33.600] interesting are they would get like missed orders. So hey, we've made a mess of this [07:33.600 --> 07:37.400] order. Can we fix it like now rather than waiting till tomorrow? We've got this order [07:37.400 --> 07:40.920] that's gone wrong. Can we, can we go and fix it? It's almost like you're able to achieve [07:40.920 --> 07:43.760] like the customer service that you are in a restaurant where you can kind of see like [07:43.760 --> 07:47.240] with your eyes, okay, these people look really angry with me. It's like, hey, look, the data [07:47.240 --> 07:51.120] is being shown is almost giving you the equivalent of the, in the restaurant experience without [07:51.160 --> 07:55.520] being in the restaurant. Okay, so what, how do we go about, so those are some examples of [07:55.680 --> 07:58.760] people who have built those things and they're, they're a way more of them. Those are just [07:58.760 --> 08:02.000] some, some ones that are picked up. How do we, how do we build that? So there are some [08:02.000 --> 08:06.640] properties that we need to, we need to achieve. And some of these Javier was talking about [08:06.640 --> 08:10.520] in his talks. First one is we want to be able to get the data in quickly where it is in [08:10.520 --> 08:14.160] these applications that generally coming from a streaming data platform of, of some [08:14.160 --> 08:18.320] sorts. In our talk, it's going to be Pulsar and we need to be able to get the data into [08:18.360 --> 08:22.720] Pulsar and then into like wherever we want to get it to query it, in this case, into [08:22.720 --> 08:27.080] Pina, we need to get it in there very, very quickly. Once it's in there, we want to be [08:27.080 --> 08:32.040] able to query it very quickly as well. So one way of thinking about it is in these applications, [08:32.040 --> 08:37.360] we want to do OLTP type queries, like query speeds on OLAP data. So we want to be querying [08:37.360 --> 08:41.240] everything but getting like the results in, in like, imagine like a refresh on a web page [08:41.240 --> 08:44.640] or like on a page on a web, on a mobile app. So I don't want to, don't want to be sitting [08:44.720 --> 08:49.080] there waiting for five or 10 seconds for the results, right? And that, that, that particular [08:49.080 --> 08:53.920] requirement is a lot more the case when it's an external user, right? Like if it's inside [08:53.920 --> 08:56.640] a company, because it's fine, you can just go and get a coffee and wait for the results. [08:56.640 --> 08:58.840] But if it's outside, you're not going to, you're not going to do that. They're going [08:58.840 --> 09:02.880] to use, use another application instead. And then finally, we want to be able to scale it, [09:02.880 --> 09:06.960] right? So either it could be like one, one dashboard doing loads of different queries [09:06.960 --> 09:11.720] and kind of aggregate and bringing everything together into one view or it's a lot, maybe [09:11.760 --> 09:16.120] lots of users like concurrently doing it. But end result is lots of queries are coming [09:16.120 --> 09:19.280] in. We need to be able to handle those and it can't affect those other two things either. [09:19.280 --> 09:23.640] So we need to be able to still be like doing lots of those concurrent queries while ingesting [09:23.640 --> 09:28.400] big amounts of data very quickly. So how do we go about building one of those? So these [09:28.400 --> 09:31.560] are, we kind of got around the outside some of the properties that you would have. And [09:31.560 --> 09:35.280] then in the middle, we've got a couple of tools that can achieve this. So in this case, [09:35.280 --> 09:38.960] Pulsar and Pino. And so you can kind of see we want to achieve real-time ingestion. The [09:38.960 --> 09:42.680] data will often be like very wide, like lots of, lots of columns, lots of properties potentially [09:42.680 --> 09:45.240] nested. And then we've got to do something with it to figure out how we're going to get [09:45.240 --> 09:49.080] it into a structure that we can query it. And then yeah, you can kind of see some of [09:49.080 --> 09:53.000] them. So we need to be able to, the data needs to be fresh. We want to do thousands of queries [09:53.000 --> 09:57.880] a second. And then the latency needs to be like OLTP style. So I'm going to hand the [09:57.880 --> 09:58.880] microphone back to Mary now. [09:58.880 --> 10:04.240] Okay. Thank you. Okay. Thank you, Mark. So, okay. So now we're going to focus just couple [10:04.240 --> 10:09.320] minutes on Apache Pulsar. So Pulsar, right? How many of you actually real quickly have [10:09.320 --> 10:14.000] heard of Pulsar or working? Oh, working. Okay, cool. So some of you have, and then of course [10:14.000 --> 10:18.800] there are also folks using Kafka too. But here I'm wanting to say, to tell you why we [10:18.800 --> 10:23.360] want to use Pulsar, right? So right now in here too, for those of you who are new and [10:23.360 --> 10:28.560] essentially to Pulsar, there are like a couple components to it, but primarily to their brokers [10:28.560 --> 10:33.800] that are serverless Java runtime, right? And running, but it's very flexible too. So for [10:33.840 --> 10:38.480] clients, you are supposed to be writing your producer and consumer, it takes on a pop-up [10:38.480 --> 10:43.240] type of architectural pattern, right? So I won't have time to get into all the details, [10:43.240 --> 10:46.480] but just give you a highlight, right? Producer, consumer is what you write. You can write it [10:46.480 --> 10:51.920] in Java, in Go, in Python, in any kind of languages that are supported by the community [10:51.920 --> 10:57.400] too. And then multiple brokers that are running and also is optimized too to run in the cloud [10:57.400 --> 11:04.520] native environment. Also too, instead of it managing all of the huge amounts of log messages, [11:04.520 --> 11:09.080] it actually leveraged on Bookkey, which is a patchy bookkeeper project. And it's also [11:09.080 --> 11:16.440] like a high availability or fast read and fast write type of log, logging, distributed [11:16.440 --> 11:22.960] logging system. Then it also makes use of Zookeeper to help it to manage the cluster, that aspect [11:23.000 --> 11:28.640] of things. And now really quick to kind of give you an introduction. Pulsar was developed [11:28.640 --> 11:34.760] first by Yahoo back in 2013 or so, and it's basically recognizing we need an event streaming [11:34.760 --> 11:40.760] platform that can run very effectively in a cloud native environment. They contributed [11:40.760 --> 11:46.720] to Apache Software Foundation in 2016 and then very quickly became a top level project [11:46.720 --> 11:53.440] in 2018. And again, it's very cloud native in nature. It's already cluster based. Multitenancy [11:53.440 --> 11:57.760] is supported too. Again, I talked already about the simple client APIs that you can [11:57.760 --> 12:02.560] write in many different languages. And it separates, one of the strengths of Pulsar [12:02.560 --> 12:07.520] is that it separates out the compute and the storage. So Pulsar manage all of the message [12:07.520 --> 12:12.840] things and then have Bookkeeper manage all of the log messages and stuff. And then also [12:12.920 --> 12:18.240] Pulsar has guaranteed message delivery. And also it has a Pulsar function framework that's [12:18.240 --> 12:23.960] very lightweight too. So then you don't need to rely on any external libraries or vendors [12:23.960 --> 12:30.160] essentially to kind of do message transformation as you are constructing your data pipeline [12:30.160 --> 12:35.760] too. And also another feature about it is that it has tiered storage offload. So if [12:35.760 --> 12:42.040] there's messages that becomes cold, then you can move it off to, or actually if it kind [12:42.040 --> 12:46.600] of becomes cold, then it gets moved off to offline storage such as like S3 buckets and [12:46.600 --> 12:50.880] things like that. Okay, so just real quick slide just to show too like about streaming [12:50.880 --> 12:55.600] and not versus not streaming. So in modern day streaming, as you can see, we're ingesting [12:55.600 --> 13:01.960] data and this is what we'll be ingesting for like, you know, analytics software processing [13:01.960 --> 13:06.840] so like Pino, you ingest data and then without actually writing to disk like the traditional [13:06.840 --> 13:11.320] way of doing things. So then it speeds up to the whole process. And it processes the [13:11.320 --> 13:15.760] data in memory. And when you're done with processing, you can use Pulsar function to [13:15.760 --> 13:20.200] transform your data, whatever you need, and then you output your data to a sync. So there [13:20.200 --> 13:24.720] will be connectors that that helps you to do that. So the whole kind of pipeline is designed [13:24.720 --> 13:30.040] to be very efficient. That's what it is. So and now we back to Pino a little more and [13:30.040 --> 13:35.120] then with the demo too. All right, so what is what is Pino? So this diagram more or less [13:35.120 --> 13:41.240] explains it. So as I say, we've got data coming in from from sources. So in our case, it's [13:41.240 --> 13:44.480] going to be Pulsar, but you can kind of see there are lots of other ones. And what's kind [13:44.480 --> 13:48.040] of interesting is you could you could have in theory, you could load into the same tables [13:48.040 --> 13:52.440] streaming and batch data sources and query them query them both together. Once they come [13:52.440 --> 13:57.560] in, it's a it's a column store. So it's a store. And then you can kind of do some aggregation. [13:57.560 --> 14:01.100] There's all the different types of indexes on top of that. And you can do some prematerialization [14:01.100 --> 14:05.240] as well. And then on the right hand side, we've got a couple of the use cases. So just [14:05.240 --> 14:10.480] to quickly show you the architecture of how this how this works from the from the far side. [14:10.480 --> 14:14.720] So the data is coming in from Pulsar here. We've got there are we're going to be using [14:14.720 --> 14:17.240] three components. We've got a controller, we've got a server and then we'll have a broker, [14:17.240 --> 14:21.080] which will come up here. So the controller is the manager of the cluster. So it's taking [14:21.080 --> 14:26.160] care of a hey, where does this where does this data need to go and then and Pino uses [14:26.160 --> 14:32.280] a tool called Helix, which is on top of on top of zookeeper. And so it will then it breaks [14:32.280 --> 14:38.080] data out into segments. So the segments will in this particular case will map to the partitions [14:38.080 --> 14:42.800] coming from Pulsar. So what like each partition will be coming to a segment. So if we had, [14:42.800 --> 14:46.160] for example, four partitions, we get four, four different segments. And if you were doing [14:46.160 --> 14:48.960] it in the cluster, in our case, we'll just did on my machine. But if you were doing it [14:48.960 --> 14:53.480] in a cluster, it would then have decide where is it going to replicate each of the data coming [14:53.480 --> 14:57.680] from each of the partitions. I might do one, one is on server one and seven four, and two [14:57.680 --> 15:01.680] is on server two and seven three and so on at the servers in the data. So that's where [15:01.680 --> 15:05.480] the data is going. Remember, the controller manages everything. And then we have the broker [15:05.520 --> 15:09.240] is taking care of the querying. So the query comes in here. So for example, here we're [15:09.240 --> 15:14.240] counting from a table how many how many rows for the country us. And then the broker sends [15:14.240 --> 15:18.200] it out in a scatter gather type pattern out to the servers that it knows have the data [15:18.200 --> 15:22.400] that will map that might be appropriate. And then it will they will kind of send their result [15:22.400 --> 15:28.040] back and then the broker takes care of aggregating it and sending it back to the client. Okay, [15:28.040 --> 15:32.720] now let's have a look at a demo. Let's see if I can get this to. So what we're going to [15:32.760 --> 15:38.440] do is if in case you want to look at it, this is where the demo lives. Hopefully that QR [15:38.440 --> 15:46.200] code still works. I'll just let you take a picture of it. And what it what it is, we can kind of [15:46.200 --> 15:51.640] tell from the name. So it's a wiki demo. It's going to sit in as I say before, it's going to sit [15:51.640 --> 15:56.000] in this bottom corner. So real time dashboard. And this is the architecture of it. So we've got a [15:56.000 --> 15:59.680] streaming API. And the data is going to come in, we've got a Python application, it's going to [15:59.680 --> 16:03.760] process it, put it into Palsar, from Palsar into Pino. And then we're going to have a stream [16:03.760 --> 16:09.440] that dashboard on the end. We've got 15 minutes. So let's see how we go. So wiki media has a really [16:09.440 --> 16:17.440] nice recent changes feed. So this is capturing all the changes that are being done to different [16:17.440 --> 16:22.560] properties in wiki media. And they actually store it internally as Kafka, but then expose it using [16:22.560 --> 16:26.800] server side events protocols. This is a HTTP protocol, it basically just streams out loads [16:26.800 --> 16:32.000] and loads of it, it reflects from the infinite stream of these events. And this is what it looks [16:32.000 --> 16:36.400] like. So we get three properties to get an event, an ID and a data. The main bit that's [16:36.400 --> 16:40.560] interesting is data. So you can kind of see it's like a sort of nested JSON structure of stuff. [16:40.560 --> 16:46.560] And you can kind of, you can sort of see like, so we've got like the URL that's been changed, [16:46.560 --> 16:51.440] you've got a request ID, you got an ID. It says somewhere, yeah, like what was the title of the [16:51.440 --> 16:57.920] page? When was the time stamp? At the revision? Yeah, it's got lots of lots of kind of interesting [16:57.920 --> 17:02.880] stuff that we can, that we can pull out. Nope, not done yet. Okay, so now that would have been, [17:02.880 --> 17:09.760] that would have been amazing. So we have got, if I come, can you see? Yes. All right, great. How can I [17:12.720 --> 17:17.840] see if I can get this to stay on my, hopefully, is that good enough? Is that good enough? Yeah. [17:17.840 --> 17:24.800] All right, perfect. So we have a, let's see if I can type, pigmentize. Ah. [17:28.160 --> 17:33.040] Yeah, there we go. So we've got a script called wiki.pipe. So this is what it looks like. So [17:34.400 --> 17:38.080] I guess the red is not entirely readable. But we've got, this is, this is the, [17:38.960 --> 17:42.800] don't do that. This is the URL that we're going to be working with. So you can see these are, [17:43.760 --> 17:49.280] if I paste that into my browser. Oh, I don't know. What's that done? It's going to the wrong one. [17:49.280 --> 17:55.600] Come here. Hang on. I'll escape that. So if we paste that in here, this is what it, this is what [17:55.600 --> 17:59.360] it looks like. So you get some loads and loads of messages. Chrome will eventually get very, [17:59.360 --> 18:02.080] very angry with you if you leave this running forever. But you can kind of see the messages [18:02.080 --> 18:06.880] are coming through. And so we're going to be processing that. So you sort of see, we've got, [18:06.880 --> 18:11.680] we're just using the request library in Python to make a wrap around it. So that's this bit here. [18:11.920 --> 18:15.920] Wrap around that, that particular endpoint. And then it's going to stream the events. And we've [18:15.920 --> 18:21.040] got this SSE, the server side events client, Python client. Wrap that round and we get an [18:21.040 --> 18:27.440] infinite stream of messages. So if we were to run this, so let's just go here. And we'll pipe it [18:27.440 --> 18:31.920] into JQ just so it's a bit more readable. So you can kind of see like the messages are coming through. [18:33.200 --> 18:37.280] If I stop it and scroll up, so you can see we've got, this is what it looks like. So you got [18:38.240 --> 18:44.000] the schema, meta, ID type. You can kind of see like, oh, this is, this is, this is what, [18:44.000 --> 18:47.440] oh, this is Russian, Russian, someone's changing Russian Wikipedia at the moment apparently. [18:48.080 --> 18:52.880] There you go. Thanks. So we've got lots of different stuff. And so that's kind of the first [18:52.880 --> 18:57.040] bit, right? We wanted to get the stream into like a fashion that we've got it, right? So we've done [18:57.040 --> 19:01.120] that bit. Next bit is we want to get it into Pulsar. So let's have a look at our second script. So [19:01.680 --> 19:06.240] this one is to get it into Pulsar. Oh, hang on. It's a bit longer. So let's just pipe it into [19:06.240 --> 19:11.600] less first. So we're going to be using the Pulsar client. And the first bit is the same, right? [19:11.600 --> 19:18.480] So we still build our streaming client here, like for the, for the wiki. But then we also build a [19:18.480 --> 19:23.680] Pulsar client here. So we're creating a Pulsar client. We point it to localhost 665. This is [19:23.680 --> 19:27.120] the port. We're actually running all this in Docker, but if it exposes the ports out to the, [19:27.120 --> 19:33.360] to the host OS, we create a producer. So our topic is going to be called wiki events. [19:33.360 --> 19:37.360] And then if you scroll down here, this is still the same. So we're still looping through that [19:37.360 --> 19:43.120] stream of stuff. But this time we, we then call the Pulsar producer and we send it a message and [19:43.120 --> 19:49.280] we're sending it in async, which means we're not going to wait for the response, right? And then [19:49.280 --> 19:55.040] finally we flush, we flushed it every 100 messages. Anything else to add on the Pulsar [19:55.040 --> 20:01.120] production? That's good. All right, cool. So now let's call that one. So if we call Python [20:02.080 --> 20:08.720] wiki to Pulsar. So then I run and you can kind of see like it will sort of run away and everything [20:08.720 --> 20:12.960] is, everything is happy. I can never remember quite exactly what the commands are, but we can [20:12.960 --> 20:18.320] then use the Pulsar client to check that these messages are making their way in. So we can call [20:18.320 --> 20:21.280] this Pulsar client and we say we're going to actually, do you want to explain it? You're [20:21.280 --> 20:27.440] probably better than me. Yeah, consuming and the events and then also the subscription. You have [20:27.520 --> 20:32.480] a name to it. Okay. So it's going to pick up wherever it was the last time I did it. So let's [20:32.480 --> 20:36.480] see. So there we go. I mean, this is going way faster. I guess it's catching up the ones that [20:36.480 --> 20:40.320] we've done, we've done before. So, and then eventually we'll come to the same sort of speeds. [20:40.320 --> 20:44.800] So we can see these are, these are the messages coming through Pulsar. We've put it in JSON [20:45.520 --> 20:50.000] format, but it can handle multiple forms, right? If you wanted to. So yeah, sort of all the, all [20:50.000 --> 20:53.360] the typical ones you would expect. So I guess most people would be doing Avro with some sort of [20:54.000 --> 20:57.120] schema attached to it. But I mean, for, yeah, for simplicity in the demo, [20:57.120 --> 21:01.680] JSON is quite a, quite a nice tool. Okay. So we've got it that far. Next thing is we want to get it [21:01.680 --> 21:09.360] into our, into a way that we can query it from, let's see, where's my, right? So, so as I say, [21:09.360 --> 21:15.440] Pino, it's like model is table. So you create a table and you can query it. And the first thing is [21:15.440 --> 21:21.520] we need to have a schema for that table. So this is what the Wikipedia one looked like. So we're [21:21.520 --> 21:24.880] pulling out some of the fields. We've got ID, we've got Wiki, we've got user, we've got title, [21:24.880 --> 21:30.720] comment, stream, type, bots, and then we've specified a timestamp down to bottom. You notice maybe [21:31.600 --> 21:35.440] that this is, this is kind of using the, the language of data warehousing. So these are dimension [21:35.440 --> 21:38.880] fields. We don't have any metrics fields in because there aren't really anything that we can count [21:38.880 --> 21:42.960] in this dataset, but you could have, if you had something that you were counting, then you could [21:42.960 --> 21:48.400] put that as a metric field. And then finally we've got a date time field as well. Now in general, [21:49.040 --> 21:54.000] by default, whatever fields you put in here, it will map exactly. If there is a value in your [21:54.000 --> 21:58.560] source with this name, it will map it directly. So in ideal world, everything is flat and we just [21:58.560 --> 22:04.960] go map, map, map, map, map. In this case actually, it's not quite as simple as that, but so what [22:04.960 --> 22:08.240] we need to do, so I'll just show you the table conflict that goes with it. But the first thing [22:08.240 --> 22:13.200] to notice is we can do these transformation functions to get the data into our column. So [22:13.200 --> 22:17.840] ID is under meta.id. So we're using this JSON path function to pull stuff out. You could, [22:17.840 --> 22:21.920] if you wanted to, if you, if we had cleaned the data up before, we could have used Palsar's [22:21.920 --> 22:25.680] serverless functions to do the same thing. And then we would have it in a cleaner state and [22:25.680 --> 22:30.720] just go straight into Pina. So you can kind of choose which, which of those options. This is not [22:30.720 --> 22:36.000] a replacement for a stream processor, right? This is just a very, very tiny adjustments to the data, [22:36.000 --> 22:40.320] right? If it was like slightly, slightly wrong. And then the other thing was the timestamps in [22:40.960 --> 22:46.720] the wiki give you are in milliseconds, epoch seconds, and I need epoch milliseconds for, [22:46.800 --> 22:52.160] for Palsar. So I multiply by a thousand. Now let's, oh, sorry. Yeah, I forgot the top bit. So [22:52.160 --> 22:56.080] top bit, name of the table needs to match the name of the schema. So it's Wikipedia. [22:56.800 --> 23:00.560] We need to specify what's the timestamp. And then this is the config that's telling it, hey, I need [23:00.560 --> 23:05.360] to, actually, I'm looking, I'm looking at the wrong one even. I should be, I shouldn't be saying [23:06.720 --> 23:14.560] table Palsar. There we go. So, sorry. So you see here, we say, hey, I'm going to be pulling the [23:14.560 --> 23:21.040] data from the Palsar stream. This is my Palsar connection string. I then need to say, hey, [23:21.040 --> 23:24.560] I'm going to decode, I need, I'm going to tell it which factory to use. I need to tell it the [23:24.560 --> 23:29.280] Palsar factory. And then this is, yeah, I mean, this is not really necessary for, for now. And [23:29.280 --> 23:33.360] what we're going to do now is we're going to add our table. So what this is going to do is going [23:33.360 --> 23:36.160] to create the table and then immediately it's going to start consuming. If you didn't have any [23:36.160 --> 23:40.880] messages in there, it obviously wouldn't consume anything. But since we, we do, we should be able [23:40.880 --> 23:46.160] to see our table here, Wikipedia. And you can see we've got, the messages are kind of coming in. [23:46.160 --> 23:50.960] So you can see we've got, at the moment, 9,000 messages. We could write like a, so it's a sequel, [23:51.600 --> 23:56.480] sequel on top of it. So we can, thanks. We could say, okay, let's have a look which user is doing [23:56.480 --> 24:05.520] the most stuff. Let's hang on. Oh, forgot the group by group. By user. Oh, can't type. Order by, [24:05.520 --> 24:09.200] can't start sending. So we could do, yeah, we could do something. [27:05.520 --> 27:13.120] What people, where people are changing stuff. And then finally, yeah, let me just, let me just [27:13.120 --> 27:21.040] quickly show you what, what, what exactly what we're doing. Okay. They're back again. Should I [27:21.040 --> 27:29.200] put it back? Yeah. Yeah. Okay. All right, easy one, easy one. So, oh, sorry. So what we're doing, [27:29.280 --> 27:34.560] so it's all in Python. So we're using the Pena Python driver. We're connecting to Pena here. [27:35.440 --> 27:38.240] And then we're basically just running some queries. So the one that we're showing you, [27:38.240 --> 27:42.320] like the table on the top, this red is not entirely readable. Is that aggregation plus [27:42.320 --> 27:47.040] filtering. So kind of capturing what happened in the last minute versus, versus what happened [27:47.040 --> 27:51.200] a minute before. And then we're going to kind of just run the query and stick it into pandas. [27:51.200 --> 27:54.800] We do the same. We build some metrics. There's numbers on the top there called, [27:54.800 --> 27:58.560] string that calls the metrics. So we can pass in like a value and then we can build [27:58.560 --> 28:01.840] the dieters. And the dieters, this minute minus the previous minutes, you can kind of see the [28:01.840 --> 28:07.280] change. And instead of, yeah, I mean, I guess, yeah, that's probably enough on the code. But [28:07.280 --> 28:11.520] if you want to have a look at that, it's all in the, the repository. And in theory, like you've [28:11.520 --> 28:15.360] seen me literally just running the commands in there. It should just, should just work. So [28:15.360 --> 28:19.760] hopefully, hopefully you can, you can follow along. But I just came back to here to conclude. [28:21.440 --> 28:26.320] So yeah, lots of, lots of people doing stuff with, with parents or some of the, the users [28:27.200 --> 28:33.120] parcel as well. So lots of, lots of different, lots of people, people using it. Just a conclusion [28:33.120 --> 28:38.000] and then, and then one more slide. So I hope that you can kind of see what you're combining [28:38.000 --> 28:43.360] these different tools together. You can build some quite, some quite cool applications. In this [28:43.360 --> 28:46.480] case, I'm just going to show you what the action would be because all the users have built. But [28:46.480 --> 28:51.440] you could imagine, like if it was like, if you were looking at real wicked users, you maybe want [28:51.440 --> 28:54.880] to try and encourage the ones you see like are coming in new. We've built something that's fresh [28:54.960 --> 29:00.560] there, the fast gradient scale. And we've done it with a classroom. For me, I'm writing a book with [29:00.560 --> 29:04.000] an ex colleague of mine, I'm showing you how to build these type of things. If you're interested [29:04.000 --> 29:10.320] in that, then there's my contact on your left. And I let Mary conclude. Just, okay. Here, it's just [29:10.320 --> 29:15.280] how you can connect me with me and also Apache Pulsar. If you're interested, we have a Slack group [29:15.280 --> 29:20.480] and also a wiki page. Sorry, wiki page of Apache Pulsar neighborhood team that you can [29:21.040 --> 29:27.040] build up on more stuff on. So, okay, I think then this is pretty much it. And if you need more [29:27.920 --> 29:32.400] links to Pulsar, this is the page and I can share the slide that we can share the slide [29:32.400 --> 29:37.760] with you so you can have more. And we also have developer data stacks. We'll have it on YouTube [29:37.760 --> 29:43.680] in five minutes about Pulsar if you're interested in that. Also, we'll have also master data stacks [29:43.680 --> 29:48.080] developers that you can find examples to because today we can't get into a lot of these things. [29:48.080 --> 29:51.760] And myself too, I've actually have a Twitch stream every Wednesday afternoon, [29:52.480 --> 29:56.400] central time like Chicago, so there will be evening time here. So if you're interested, [29:56.400 --> 30:03.120] you can follow me on Twitch as well. And yeah, I think that's it. How did it become like this? [30:03.120 --> 30:10.240] Thank you. Thank you. [30:10.240 --> 30:16.400] Backup slides in case the camera gets disastrously. [30:16.400 --> 30:18.400] Two questions. [30:18.400 --> 30:22.400] Here's one. [30:22.400 --> 30:27.840] Yeah, quick one. As you know, the way Kafka has, you can have cluster of Kafka instance [30:27.840 --> 30:31.280] brokers and stuff like that. Do you have that same problem where like one goes down, [30:31.280 --> 30:34.320] the other one says going down? It's like zookeeper kind of causes that problem. [30:34.320 --> 30:38.480] Because when I say Kafka, it always seems to be a problem with speaking with each other. [30:38.480 --> 30:46.880] So that's more about, I can move it offline with you and if folks are interested too, [30:46.880 --> 30:51.760] but I actually also have another talk that's kind of deeper right into Pulsar. [30:51.760 --> 30:55.440] As well as tomorrow too, I actually have a talk at the open JDK room, but that's [30:55.440 --> 31:00.960] more focusing on JMS. So there will be open JDK room in building H. So if you want to do that. [31:00.960 --> 31:06.080] But as far as Pulsar is concerned, it is very cloud native by itself. So a lot of things [31:06.080 --> 31:09.840] who in fact I didn't get to talk about is very infrastructure aware. [31:09.840 --> 31:13.280] So things like, you know, you don't want to worry about how do you deal with [31:13.280 --> 31:16.640] offsets while in Kafka. You don't have such, you don't have such thing. [31:16.640 --> 31:20.640] And then there are other things too. You just don't have enough time. [31:20.640 --> 31:24.480] I think I was told this time's up. So we can move it offline and then follow me on [31:24.480 --> 31:27.920] my discord. Actually, I didn't even get my discord, but follow me somewhere. [31:28.480 --> 31:32.880] Twitter. And I'll be happy to answer any questions you have. Thank you. [31:32.880 --> 31:39.680] Okay. Thank you. Thank you.