Okay, it's 10.30, time for the next talk.
So we're going to have another talk about Apache Airflow, but by Tatiana Alschwe.
She works for Astronomer and she's going to tell us about DBT, which is a tool that basically you can write SQL and then have that executed in a templated way.
How to integrate that into Apache Airflow. Thank you.
Thank you. Hi, good morning everyone.
So I'm really glad Yarek spoke before, so I don't need to get into the details of what Airflow is.
Many of you know what DBT is. Could you raise your hands? Amazing.
So initially I didn't have any clue of what DBT was and I was working for the BBC where we had very good software engineering principles and lots of things.
And one day I went to support a SQL analyst who was analyzing the results of the A-B test and there was a bug.
So the results of the A-B test were not very consistent.
We were trying to figure out which machine learning models were performing better than the others.
And then I said, okay, let's just see how you're doing things.
And then I checked in his laptop and he had a Word document with a bunch of SQL statements.
And then the procedure he was using was he would use this Word document to write the SQL statements, copy and paste that, like saying, Snowflake and other data warehouses,
export to spreadsheets, then try to join some information.
So the process altogether looked extremely error-prone.
So the principles we had like testing, versioning, some repo, anything, nothing was in place.
We eventually figured out what was the issue, but I really thought, well, we should be able to apply software engineering tools to any process.
And the tools should be as easy as any person with any skill.
And sometime after I came across Djibiti.
So the idea of Djibiti is really to empower people who may not have experienced software development to use good software development practices while they write transformations,
SQL transformations.
So there is Djibiti Core, which is a quite stable project.
It has around 250 contributors, 6000 commits.
It's quite popular on GitHub, over 6000 stars.
And the focus is on transformations.
So you can define your SQL files into text files.
It encourages users to deploy those to Git so you can have versions.
It allows you to define tests.
So let's say you would like to check some columns so you don't have no values.
So it really makes easy all this practice.
And it's an amazing tool which has really helped improving and avoiding the process I've seen in the past.
And then many people may ask, okay, but what is the relation of Djibiti and Airflow?
Aren't analysts happy enough running those scripts locally?
And why using both of them?
Yarek Raja explains what Airflow is, but it's a very mature orchestrator tool,
which can allow you to run things in parallel.
And it has lots of flexibility where you actually run things.
So it does have a few things.
In one hand, we have Airflow where you can write pipelines in Python.
It is flexible.
As Yarek shown, we have hundreds of operators from multiple providers to integrate with
uncountable data warehouses, tools, LLM tools and so on.
But it's quite complex.
So the interface can be a bit overwhelming with lots of colors, lots of boxes.
It can be hard troubleshooting and getting to understand.
If you want to run Airflow locally, there is Airflow standalone,
but you will have to be running a web server, a scheduler, and eventually a worker node.
So there is complexity to it.
On the other hand, Djibiti, people can write workflows just using SQL, transformation workflows.
It is quite specialized SQL data warehouses, but it has a very simple UI.
So Airflow is quite good.
Djibiti is quite good in specifying tests in a simple way.
It is very dependent on management.
You can use Jinja templates to reference other tables created with other modules in Djibiti via SQL files.
It is quite easy to define schemas.
But it only focuses on transformations on the T side of ETL.
Airflow, on the other hand, gives you flexibility to do anything you can do with Python, which is a lot.
But it is quite complex to run.
You could achieve anything you do with Djibiti with Airflow, but you may have to write more code.
So I don't think we need to compare.
I think what many companies decide to do is let's just try to use both tools.
What this presentation is about is how we can use these tools together in an efficient way.
So this is how a Djibiti pipeline looks like.
So on one hand, you have several files.
Inside the models, each of those files represents a table, and you have a transformation.
And then Djibiti allows you to render the pipeline.
So let's say you have some raw data and customers, orders, payments.
You then transform those, and then you aggregate those to be able to generate reports and send it to Tableau or something.
This is what a Djibiti project looks like.
It's similar to what Airflow does since Djibiti interacts with databases.
It has a way of defining how those connections look like, which secrets, credentials you have to use.
Those are done via Djibiti profiles, via YAML files.
And then the question is, okay, so there is similarity.
Djibiti and Airflow can generate DAGs, allow you to create workflows, but how to connect them.
So we thought, okay, what are the options if we had a translation tool to convert from one to the other?
If you check into Djibiti DAGs, what they say is, if you're using Djibiti Cloud, which is a managed version of Djibiti,
you can just deploy things, and there is an official Airflow provider to deploy there.
But currently, they changed the pricing model, and if you have a team,
you'll probably be spending proportional to the amount of developers you have in your team.
So it can get quite expensive.
Another strategy you can use that Djibiti suggests is you can just use the Airflow Bash operator,
which allows you to run any Bash commands.
So the same way you run Djibiti commands, on the command line, you could just trigger those to be run.
Or you could also use Kubernetes operator, delegating from the Airflow worker into Kubernetes pod to run those commands.
So this is an example of how the DAG would look like if you were using Djibiti Cloud.
So in this case, you have just some operator to declare the beginning and the end.
This is an old pattern many people use.
You can check the code in this link below.
But with the recent setup and cheer down features of Airflow, you don't really need to have those dummy operators.
But anyway, in this case, it just starts the job and triggers the job to run in Djibiti Cloud.
The challenge with this is, can someone tell by looking at this pipeline, what are the actual models and transformations going on?
Let's say you had a project with a thousand models.
How could you spot where the problem was?
Worse than that, let's say you have a pipeline that takes 12 hours to run.
And one of those tasks that takes five hours past, but a few others didn't.
How would you just re-trigger the jobs after the failed job?
So this approach is quite limited and doesn't give much visibility from Airflow on what is going on in Djibiti.
Another approach people use is just to trigger Djibiti Commons via batch operators.
So in this case, you again define, let's say, Djibiti seeds where you load from CSV to the database.
Then you can trigger transformations and then you can run Djibiti tests.
In this case, again, you are grouping the Djibiti nodes by type and running one Airflow operator, one Airflow task for each of those steps.
You don't have much visibility and control of what is actually going on, but it can do the job.
Then another approach many people have done in the industry was I will write my own parser of Djibiti projects and render that somehow into Airflow.
In a way that we can parse what are the nodes of the original graph and then render them in Airflow having some granularity on the process.
So there are codes for all these examples I'm sharing.
Each of these approaches has their own pros and cons.
From one perspective, the two first approaches are quite trivial to parse the DAG.
So they are cheap to do the processing every time Airflow tries to parse DAGs and trigger tasks.
On the other hand, it can be harder to troubleshoot, to retry and just to do the work.
In the last case, you can trigger independent tasks, which is quite powerful.
So we've seen many people in the community implementing their own solutions to do this conversion of expanding the Djibiti DAG into an Airflow DAG.
During Airflow Summit last year, Djibiti was one of the most discussed topics and there were several approaches.
Some people use Djibiti manifest, which is an artifact that represents the topology of the Djibiti DAG.
Some people would use dynamic tasks, which Iaric also showed in his presentation, where you can paralyze a sort of map-produce approach with the Airflow.
Some people just generate a static DAG and then they don't use actually Djibiti to run the transformations.
They convert into native Airflow operators, which can be asynchronous operations, so you wouldn't need to have the Airflow worker node blocked while executing the transformations.
And then many people decide to delegate the jobs, so the Airflow worker node isn't necessarily doing the Djibiti commons or the transformations and just delegate to Kubernetes.
So those are some approaches.
And then at Astronomers, since we have several Airflow users and customers trying to do this integration, during a hackathon, some team members developed Astronomers Cosmos, which was a tool to try to help in this conversion.
So the idea was really to have a sort of Rosetta Stone, which could help and simplify everyone's lives.
It is under Apache license, so it's open source.
And this is how Cosmos translates the DAG.
So you can see the original Djibiti project, and that's how it looks in Airflow.
So you have really a mapping one-to-one of how the DAG looks like, and the names are quite close to the original names as well.
This is how the DAG is, so that's all someone with a Djibiti project would have to write to be able to have their project fully translated into Airflow.
So it's quite simple.
We have some configuration.
These first lines relate to Profile Config.
They are optional.
It's a feature of Cosmos which allows you to only define your credentials to access the database using Airflow connections, and then we convert those into Djibiti Profile.
So you don't have to define things in both places.
But if you prefer, you can always give a Profile.channel, and that's it.
So the code of your DAG would be pretty much this hlivesofcode below.
It uses task groups which I also spoke about.
So we allow several ways of running tests.
One of them is to actually group with the model.
So usually you would define test-based models, and you can have task groups where you would have both the execution of the task and then the test.
Because we assume, depending on your configuration, if the tasks for a model don't pass, then it wouldn't make sense for you to continue processing the next transformations.
Then there is a demo.
Let's see.
Okay, we're okay in time.
So this is the Airflow UI.
I'm running this Airflow instance locally.
We have a few DAGs.
Some of them passed, some failed.
And then this is an example of how Djibiti Converted DAG using Cosmos looks like.
With the task groups, the tasks, you can see the code for this here.
It's super simple.
And then, like I say, we would like to trigger this workflow so we could run all these things.
We can click here on the trigger button.
You could be using the API as well.
You could be using the comment line.
And then you can see the tasks being scheduled.
Since it's in my local computer with a single worker, it's much particularly quickly.
But there you are.
You can see the tasks running, executing, and then we can see the ones that are green here.
See?
The ones that are gray.
They are waiting to be scheduled, and then the green ones succeeded.
This is queued.
These are running.
And then there you are.
And then one of the things Cosmos allows is for you to actually check, okay, for this task,
what was the actual SQL statement as a queued?
So you can see the rendered template, and you can understand, oh, this was the transformation.
And it can help troubleshooting.
So that's the demo.
Some of the key features of Cosmos.
So it easily allows you to bring GB2 projects into Airflow.
You can assume people who are used to write the workflows in SQL can remain writing the workflows in SQL.
You can either render as a task group or as a DAG, depending on your granularity.
It gives you flexibility if you want to run tasks in Kubernetes or Docker or in the Airflow Docker worker node.
And recently there is a PR even to delegate to some, I think, Azure container service.
So you could write your own way of executing the GB2 task with Cosmos.
You can override how we do the conversion, depending on the GB2 node types into Airflow tasks.
So there are a few different ways we support doing this parsing of GB2 into Airflow as well.
You can use data sets.
So each, I had a convention to that Airflow.to introduce data set to our scheduling.
So in the past, you could only schedule pipelines in Airflow using Chrome job expressions or same daily or so on.
And the cool thing with data-aware scheduling is let's say you have a pipeline.
Let's say you have a machine learning pipeline where you're processing, let's say, some video metadata in one part, user activity in the other.
And then you would like to aggregate those two pipelines to be able to train your model and fine tune and do whatever.
So now, since Airflow introduced data-aware scheduling, you can define outputs of your DAGs that those data sets are ready.
Let's say the program metadata was processed.
And then you can trigger another DAG when that data becomes available.
And with Cosmos, you also have this.
So you can do conversions of GB2 DAGs and make sure that after transformations is executed, other pipelines that depend on that transformation output will be triggered without having to depend on a time schedule.
And then since Airflow and GB2 core are open source, you can run this by as many developers as you want without having to pay proportionally to a developer, the amount of developers.
And we have a growing active open source community.
So here are just some more details on how you could configure the types of operators we're using within Cosmos.
So Python operator, V2 end operator, Docker, Kubernetes operator.
Initially, we were thinking about since both those are written in Python, our first strategy was why don't we use GB2 code to do this parsing and conversion
when we get to GB2 project.
But then we realized there are many conflicts between versions of Airflow and GB2.
So we ended up not really using the GB2 as a library, but we used other approaches to do this parsing.
So one of the approaches is to use the manifest file, which is one of the outputs from GB2.
Assuming you have a CICG, you could just output that and you have the DAG topology in a JSON file.
However, this method doesn't give all the filtering and excluding flexibility to GB2 offers when you want to just select a subset of nodes to be run.
Then we implemented GB2.ls, which would be a way of displaying selected nodes using GB2 itself.
But the performance isn't particularly good for this.
So we implemented a version with caching.
We have our own custom parser and we usually by default try to depend on the user configuration automatically define a way of parsing the GB2 project.
So we support select and exclude.
There are several different approaches rendering test nodes as well.
We allow users to do this conversion of connection from Airflow into GB2 profiles or to give their own profiles the ammo.
We try to give as much information for users to troubleshoot GB2 pipelines within Airflow itself.
This is the adoption.
So in the last month we had around 242,000 downloads and growing number of stars in GitHub.
Some of the next steps are exposing GB2 documentation from the project within Airflow as well, improve the performance,
and then improve the open lineage integration and a few other things.
We also noticed many people have GB2 projects in different repos than the Airflow one.
So we're looking to ways of optimizing the synchronization of the things.
And there's the one user asked for GB2 Cloud integration, so that's something that may come.
This is a PR from, results of a PR from the community where we actually just render GB2 documentation within Airflow using Cosmos.
We don't have as many commuters or contributions as Airflow, but I think we're quite in a good spot.
So as an example in November we had 20 authors contributing and only three of those were within Astronomer.
We have a growing number of contributors and we're promoting community members into commuters.
We know there's a lot of work to be done and we really appreciate the community support.
The discussions, we currently use Airflow GB2 Slack channel in their flow Slack workspace and we have lots of daily interactions.
And the community is each time developing and supports each other, which is super exciting to see.
There are a few references there. The slides are on the POSDEN website, so you can just click on those if you would like to see more information,
more detailed material, examples of how to run things, and that's it.
Thank you very much and I think we have four minutes for questions.
Thank you very much. That was a very, very interesting talk and it looks like a very, very good project.
Any questions? No questions. I have one question.
Are you only doing the integration from DBT into Airflow or also from Airflow into DBT back again?
At the moment we're doing from DBT into Airflow, but would you see, because the tricky thing is the features DBT offers are a subset of what Airflow offers.
So the conversion in the other way may not be feasible at all, depending on which operators and tasks you would be defining in Airflow.
And also, Astronomer has a managed Airflow, right? So our interest is bringing people into Airflow and not necessarily sending them away over Airflow.
And maybe a follow-up question to that. Is it possible to continuously migrate from DBT to Airflow so that people can continue working in DBT and then you automatically get the data into Airflow?
Yes. So with Cosmos, the current version of Cosmos, it expects the DBT files to be available to Airflow somehow, but this can be done in multiple ways.
So if you're deploying Airflow using a Docker container, you can make sure as part of your RCI CD, you would be fetching that during the image build.
We also saw, I think, British Airways that are using this, they were actually thinking, let's say from GCS, they uploaded the DBT project into GCS, and then the first step of their DAG was actually to download those files and then use that.
So some people may want to have some NFS or volume mounted with the DBT project and make sure that synchronized with the latest.
So there are several ways, and depending on the parse emojis with Cosmos, those will be updated in real time.
Okay, excellent. More questions? We have time for maybe one or two more questions? No? Well then, thank you very much again.
We have a short break now.