All right, it's two. We'll get started.
Hello, my name is Vojta Uranek. I work as a developer at Red Hat and in this short talk,
I would like to discuss one approach, how you can ingest your data from into a machine learning
pipeline from your databases in real time. First, this is how the machine learning pipeline can
look like. And I will be speaking about where the beginning of this pipeline basically here,
how to get the data from your databases into a machine learning pipeline. It can be as complex as
this, but I will simplify. Just imagine you have your application running, you insert data in one or
several databases, and now you want to get advantage of some machine learning or maybe AI and get your
data and do some, for example, prediction. And you want to solve a question how to actually feed the
data from your databases into a model, and especially when some new events are coming, how to do, for
example, online predictions. It might sound simple, you just do some selects from the database, but when
you have quite heavy traffic and you will run the selects in a loop, you will quickly find out that you
will overload the databases with the selects. And if you add more conditions like you don't want to miss any
data, you won't have consistent view of your data, you don't want to do the right and so on, you will
shortly find out that it's a pretty hard problem. There are several ways how to tackle this problem, but I
would suggest maybe in my opinion the best one is to change data capture. What does it mean? As the name suggests,
change data capture is a process of observing some resource for a change, and if there is any change, you
will extract this change and propagate it further, typically into some messaging system. What does it mean
in terms of databases? It typically means to observe transaction log of the database, because transaction
log of the database is typically the source of true for whole database, and all the changes which are
happening to any data in the database are recorded in transaction log. So basically you will observe
transaction log of your database, and if there is any change with some data you are interested in, it can be one
example, whole database, whatever, you will extract this change from transaction log and send it into, for example,
some messaging system. It's pretty, well, probably sounds good, but maybe you have to ask yourself if it's easy
to implement. Fortunately, you don't have to solve these questions. You can use Dibizium, it's an open source
for change data capture. It's pretty major. It has connectors for most of the popular databases, and it
currently comes into two flavors. It originated as a Kafka Connect source connector, when you can deploy the Dibizium
connectors into Kafka Connect, and they will connect to one or several databases of you are using, extract the changes,
and send it to Kafka, where you can use sync connectors to do whatever you want, like, for example, updating
the search index, invalidating the cache. You can, for example, use it also for replicating one database to another
database, but in our case, you will probably want to push it into some data lake pool warehouse, feature store, or maybe
even directly into some machine learning model. If you don't want to use Kafka for whatever reason, you can use Dibizium
standalone, which is standalone process, which basically does the same, but allows you to push the events into whatever
system you like, like Apache Pools, Google Pops up, and it can be even, for example, HTTP endpoint. And if you are
missing any sync of for Dibizium server, it's pretty easy to implement your own. And Dibizium provides some other features,
like, for example, it's capable to do snapshot of your database at any time. It can also transform the records before sending out
into the messaging system and so on. So back to our problem. Unfortunately, this talk is too short to go through the whole
example. On the page of this talk, for them, there is a link to our blog post when we described in details a use case when you
store in the database images of handwritten digits. And whenever the image is stored there, you extract the change using Dibizium,
send it to Kafka and Kafka, send it to TensorFlow, which will recognize what number is written in this picture. So it's a well-known
example. And everything works in real time. As I said, we also have a full example on GitHub. But what I would like to show you here is
that it's really simple. It's basically consists of deploying Dibizium and configuring it. And it's really just one page of JSON config. Here is
you just provide credentials. And more interesting part is here. It's some transformation. Here is one predefined transformation where I
extract only the content of newly inserted image. And because there is some caveat when you use TensorFlow with Kafka because it can
interpret correctly the bytes, I will, I'm transforming the image into string, which is later on parsed in TensorFlow. But I would have to do it
in TensorFlow anyway. So it's no overhead. But I can define my own transformation here. So, and it's just a couple of lines of code which
just converted into string. And on TensorFlow side, it's similar, easy. It's again one page. Here I define the coding function which decodes the string.
And I think it retrieves from Kafka. The most of the code is just defining Kafka endpoint. And it's about three lines to push it into the model which will
recognize what is the number on the image and produce the result which you can consume further. So, as I said, if you are interested in it, please go to our
website, take it up and take a look. And basically that's it. So to sum up, DBZoom is able to do a snapshot of your database and load existing data from your
database into messaging system or directly into TensorFlow. And it can retrieve any change. So it can, once anything is stored into the database, it can
immediately extract this change and send it further to your pipeline so you can do real-time analysis of your data. And basically it works for, you can deploy many databases and do more things with that.
So that's all. If sounds interesting to you, you can try out DBZoom and please share feedback with us on Zulek, or mailing list. We have pretty large Vibran community and we will appreciate your feedback.
Thank you so much for your attention.
Thank you very much, Vojta. We have time for one question. Is this working? One question. Someone come up with one question. Come on.
So if no one, just if you have any questions, switch pop up later on, just hit me on the corridor or elsewhere in the conference.
Thanks for the presentation. My question is, is there any database that's already provided what DBZoom does by default? Any change?
Could we repeat? I don't hear.
Yeah, sure. Is there any data?
Can you please stop moving so we can hear the question?
Thanks for the presentation. My question is, is there any database that does this, what DBZoom does natively already change tracing?
What do you mean natively? Because without any external tool like DBZoom, is there any database that does this already?
Well, it again boils up to what means natively because we leverage typically some native features of database. For example, for Postgres, we use replication slot and we just read replication slot from Postgres.
So you always need something which will need something which reads from database or from Mongo, which is from change stream. So always the database provides usually this natively, but you need something which will read it and translate it into something usable, which will parse, for example, the data you get from the replication slot from Postgres and so on.
So yeah, that was the question actually. Is there any database that does this anyway without using DBZoom? But you said, I think there is no competitor then.
Pardon? Is there any competitors?
Yes, like is database is doing this natively already what DBZoom does?
I'm not aware if there is any database which uses this. I'm aware that some databases provides this, but several of them use DBZoom under the hood as far as I know.
Okay, perfect. Thank you very much, Vojta.
Thank you.