We're trying to get some fresh air into the room as well.
Have to be creative.
Okay, we'll get started with the next speaker now.
Anthony is going to talk about productizing Jupiter notebooks.
Please welcome the speaker.
Hello, nice to meet you everyone.
Yes, my name is Anton.
Hello, hello, hello.
Okay.
So, as I was saying, I'm a software engineer programmer and I've been working on data engineering
and data, disability data processing projects for the past 10 years in VMware.
I'm glad that VMware decided to open source, invest in open sourcing for Stout Data Kids
for the past three years.
I've been focusing on maintaining and developing open source projects.
Today, I'm going to talk about the challenges in productionizing Jupiter notebooks and show
some possible solutions to those challenges using Stout Data Kids.
So let's get started.
You have been using Jupiter notebooks, if you can raise your hand.
It's 20%.
Hello.
Okay.
So, if you're a CEO, hello.
Okay.
A CEO at Jupiter notebooks are really versatile too.
It helps a lot in terms of doing experiment, exploratory data analysis, POCs, like, things
like, okay.
Hello.
Is it working now?
Okay.
So, with Jupiter, you can do a lot of things because it allows you to mix documentation
with markdown, with visualizations, and also code in different languages.
Still, there's quite a bit of struggle that you don't really deploy your notebooks directly
in production.
Most likely, you can correct me, but from my experience, you'll be redoing this using
some kind of Python scripts or other type of applications of framework and not directly
in the production, which is a bit of a double work often because you do experiments in
notebooks and then you do actual productionizing separately.
The other tool, generally, I'm going to talk about is for Stout Data Kids.
So I'm going to quickly introduce it.
For Stout Data Kids, it's an ETL framework, or EOT framework, depending on how you want
to use it.
It provides you with an SDK, which allows you to write basically steps and those steps.
You can ingest data or process data.
There are abstractions that make this a lot easier.
So this is just Python.
So you can install it with Python.
And separately, there is control plane operations UI, which is the server part optional.
That can be installed on top of Kubernetes.
So let's now dive into Jupyter and what are the challenges to productionizing.
So I've listed here five challenges.
Though that's not exhaustive list, there's some of the most prevalent one in productionizing
and using notebooks in production.
I'm going to go over each one.
I'm going to explain what I understand, the challenges, and then I'm going to show what
a possible solution to that challenge is.
So let's start with the first one.
Reproducibility, right?
So it's in this example, let's say that you have a notebook where you want to develop
in three different cells.
The first cell in the notebook, you set some variable to zero, then increment it, and then
you print it.
What would the result you would expect at the third cell?
I imagine one.
And that's what you would expect if this notebook is being run in production in an automated
self-sufficient way.
But that's not necessarily the result you do during development, right?
It's quite possible that the user can execute the second cell twice.
In this case, the result would be two.
It's also possible that you execute the second cell, change the second cell after executing it.
In this case, your result, if you deploy this job in automated pain production, wouldn't be one,
as it's currently.
It's possible to remove the cell.
But because the notebooks run in state, again, the variable is one, and every call after
it, it assumes it's one, even though if you deploy it in a scheduled automated way, it
no longer be so.
This means that notebooks are not really reproducible, and that you can start developing
in a state that's diverging from one that will be in production.
So what is that is clear trouble, because things may seem to work locally, and then when
you deploy your job or notebook in production, all of a sudden things break.
So what we can do, one thing, for example, we can do is with VDK, what we've thought of
is the concept of, let's say that we have this kind of notebook, right?
And we want to have a predictable way of which the execution is done.
In VDK, we can mark certain cells as production ready and mark them in the order which they
will be executed, which is top-down noise.
So we have created a visual way that the first cell is executed first, the second, second,
the third, third.
If our state, which is on the left side, the current execution order is different than
the one that's expected.
We can issue warnings, and this allows also to smoke testing before deployment, your
notebooks.
It shows the predict exactly the order that you'd expect the cells to be executed.
This is done by, there will be different ways, but for now this is done by setting
a tag called VDK for each cell that you want to be running production.
The order is always top-down.
This solves some of, helps solve some of the problem by providing this kind of
determined sequence in executing order of the cells, while it potentially detects
divergence, and we'll see later it allows to test easier.
The second challenge I want to speak about is code organization.
Overall, in the notebooks, you expect to have quite a bit of irrelevant or maybe debugging
code.
That might be useful during development, but it's not something that you want to run during
production in an automated scheduled manner.
Like this very simple example, the first two cells import pandas and read CSV, and the
third visualizes it.
We can say that this is most likely relevant for your workflow, and this is something that
you want during development, or if you want to share the notebook with a colleague.
This is again, could be helped with VDK tags.
As VDK tags, you talk only the cells that are relevant, and you think that they're going
to need to be deployed in turn on schedule manner in production system.
All the other cells will be completely ignored when the notebook is being executed.
Like in this example, the first, second, and the third cells, which are on the right side,
are covered in blue, will be expected to be executed in production, and we don't need
the debugging code that's simply checking the data frame or visualizing being skipped.
The third challenge I'm going to talk about is the execution model.
In notebooks, generally, you have much more complex execution model.
This is necessary because of the way the notebook needs to keep state and the way you want to
be able to use multiple languages, but those kind of execution models are really bad for
automation, or using notebooks as part of a workflow.
They add a lot of extra work on top of this, and in order to execute your Python code,
you need to go through a notebook server to the Ipython kernel, for example, if it's Python,
and so on.
Usually, the way you want things in production is much simpler.
You have a Python script, and it executes on top of Python, or SQL script, and executes on top of
some SQL, and that's it.
You don't have a lot in the middle.
Because with VDK, if you can extract exactly those kind of Python pieces and construct your
Python script or SQL pieces, we can do the same thing, which is what VDK does.
So when VDK executes a notebook, it basically would extract the Python and the SQL pieces and
directly execute them.
This would enable us to do things like reuse another notebook as a template, or almost as a
function, in a similar way like this one.
Let's say that we have a...
This is some kind of job, or Python script, and we are going to execute another notebook,
process notebook, Jupyter notebook, almost as a function with arguments and so on.
You can also execute it within a workflow, and you can run automated tests, which will show how you
can do that in a little bit, which is...
Which goes to the fourth challenge that I want to talk about, and it is automated testing.
You can see a CD.
There is no doubt that automated testing overall, we were able to have a CD pipeline, it's
cornerstone of having reliable software nowadays.
Jupyter notebook do not really easily lend themselves to this kind of traditional testing
paradigms, and also really vital if you want to push code to production, and make sure that
any changes to be done, don't break out and things work as expected.
In Jupyter notebook, there has been some attempts to solve this, and it has been quite a bit of a
challenge.
With VDK, we will be attempting as well.
One of the things that you can do, because the pre-determined order that VDK tagging provides,
and the fact that you can mark which cells need to be executed, and the fact that VDK
skips a lot of the kernel and extra in the execution model, means that with VDK you can
use command, for example, VDK run, which is provided in a Jupyter notebook with the VDK plugin,
which will execute the job as it is supposed to be executed in production.
One by one, each of the cells in the order that you expect them to be.
Another, of course, you can also do this with CommonWide interface using CLI.
Beyond that, if you end up using the control plan, which is the part that you deploy the job in production,
the integration with the notebook UI would make it sure that if you create new deployment,
it will prompt you to run basically end-to-end smoke test, the data job, as it's called,
the notebook files, and to make sure that they execute correctly.
Finally, because this notebook can be used practically as a function, that means that now you can test them
using Python PyTest.
You can write PyTest tests, the survivor code, VDK test, that helps with that,
in which you specify your dependencies through plugins.
For example, you can specify dependencies like SQL databases, it could be HTTP servers and so on,
on a way sort of to mock them.
For example, using PyTest HTTP server if you want to mock an HTTP API.
You can verify the results after the notebook is executed.
This is the link over here about different cases that can be used with PyTest and notebooks.
This is pretty powerful because it allows you to actually be able to run automated tests of all your code
that you want to productionize the notebook.
Finally, another potential issue with notebooks is version control challenges.
A notebook file tends to be this kind of JSON structure when you put it in version control,
which contains all kinds of extra fields, including output and so on, which generally you want to clean.
This is something that will style data kit and deployment.
When you deploy data job to be in some kind of managed environment, it can strip all the necessary parts.
Instead of having this kind of diff, maybe where the only relevant information really is just the source,
the last three lines, you can have this diff, which is where you show run job input, right?
There are actually no changes in this diff, despite what it appears.
Those are the five challenges and potential solutions that I wanted to share with you.
There's this kind of self-paced tutorial that's showing some of the things that I've shown,
how it can be used and done, so you can actually try out your self-ings.
Overall, if you'd like to discuss more about do you think those kind of challenges,
reproducibility, co-organization, execution model, automated testing are really relevant for your use for notebooks?
Do you think the solutions make sense? Do you see any other challenges that are important?
I'll urge you to contact me. I'll be happy to talk about it. You can do this through LinkedIn.
If you want, I would appreciate if you can take the survey, which would simply ask you, basically,
what I ask you, do you have any comments? What did you like? What are you talking about?
Do you have any other issues? And you can leave any contacts if you want to talk about more.
Or you can just contact me directly through LinkedIn.
And yeah, that's everything I wanted to share. Thank you very much for listening to me today.
Thank you.
This is going to be interesting. Okay. We have time for a couple of questions. Do we have any questions?
If you can, please wait until the Q&A is done. That will be helpful. Yeah, there's a question there.
Can you ask her question? Can you repeat the question?
Yeah. Okay.
Let me run it around. Okay.
Thank you. I was wondering if I want to deploy VDK, does that replace my Jupyter Lab, my Jupyter Notebooks,
or does VDK replace my existing Jupyter Notebooks server, or is it an add-on that goes with it?
Okay. So does VDK replace existing Jupyter server? Is the question? No. Actually, it's Pugin,
both Ipite and Pugin, and also Jupyter Pugin that provide this functionality.
It basically, you can, you'll be running your Notebooks or the VDK also, they're called jobs,
which is the directory of Notebook files or other scripts with VDK using either VDK run,
as I showed, or the UI, or you can just run them as a Notebook.
It provides on top of it some extra variables, the Pugin, or there's also programmatic way in Python to run them.
But it doesn't really replace it more like co-exist on top of it.
Thank you very much. It was very interesting. Doc, does VDK, say for example, I want to export my Python?
For example, respecting the ordering rules in VDK, does that, if I wanted to export that from Jupyter,
will that impact the Python that's produced by doing a pure Python.py export?
So, you're saying that you want to export the Python which is marked with the VDK tag, for example, in a script?
Yeah, it's possible. And I, assuming that the script runs with VDK run,
it's supposed to be running in almost the same way in Python script.
You might, VDK provides some kind of extra libraries like job input, which if you're using, you might need to just initialize yourself,
but other than that, there's no reason not to work.
Hi, thanks for the nice presentation. My question is what was the first requirement for you that you needed to productionize Jupyter notebooks?
What was the purpose of first productionizing them instead of using the regular Python scripts?
So, what is the purpose of productionizing the regular notebooks instead of using Python scripts?
Well, I'm hoping that the idea is to prevent double work so that you can reuse the same basically environment that you are developing
and not needing to redo the same things in Python scripts in separate environments.
Make it easy also for people without needing to know a lot of pyn of internals and software developing engineering practices to productionize things.
So, there's a point I think that this might break if you have some very complex applications.
In this case, probably you still want to switch to using IntelliJ and Python and some kind of framework,
but until some point it should be much easier for other people, I hope.
Thanks for the talk, first of all. Just wanted to ask a couple of things about regarding dependencies.
You know how Jupyter notebooks typically don't have any versions of specific dependencies stated at the top.
Probably just says import specific packages here and there.
Probably doesn't do anything about pip installing those unless you specifically put a show command at the top.
So, how is it processing those dependencies? Is it like automatically interpreting it?
Or we still need separate requirements or a PyProject or something else that specifies those dependencies?
So, how dependencies specified in VLK?
Yeah, VLK, basic VLK data chop is a directory which has a couple of special files.
You can have Python, SQL files, notebook files like this one, but you also have requirements,
the XT file where you specify your dependencies and it will either you install them locally
or when deployed, it will automatically install them in the environment.
That's how it handle it.
Okay, thank you very much.
Thank you.