Okay, it's 10 a.m. So we can start with the next talk. Next talk is going to be about Apache Airflow. We have Jarek Potjok tell us about the new features in Apache Airflow 2.0. Jarek is a PMC member of the Apache Foundation. He's working on this project and he's going to tell us all the details. It's probably going to be very interesting talk. Thank you. Thank you for the interaction. Hello everyone. So I'm going to talk about workflows. Who here know about Airflow? Quite a lot of people. Who uses Airflow on a daily basis? A lot of people know. Good. So my talk will be mostly to present you the Airflow, what it does and how the new Airflow, like the new workflow orchestrator, which is the Airflow 2.8 right now, is providing you as users and someone who wants to write workflows in Python. And that's why we are on a Python track. It provides you with a very modern way of interacting with modern data stack, processing your data for all the different kinds of users, including the new users like LLM, all the models, all the artificial intelligence. But first, few words about Airflow. I always refer to my past. I was a choir singer. I know a lot about music and orchestras and choirs. So Airflow is, if you imagine what Airflow does, because lots of people ask what Airflow does. Airflow doesn't do much because Airflow is mostly a conductor and orchestrator, someone who tells others what to do. And as you know, in the modern data processing workflows, mostly you're using a lot of different data processing engines, so to speak. You just a few of them and you need someone who actually or something which actually tells others what to do when and how to pass data between those different processors of different kinds. And Airflow does exactly this. On its own, Airflow doesn't do much. Just tells, OK, you do this, you do that, you do this and send that data here and there. That's basically what Airflow does. And the main thing that Airflow does, and this is why we are here on the Python track, Airflow allows you to define DAG, which is directed acyclic graph of tasks executing the data and dependencies between them. It allows you to define it in Python code. This is very different from many different orchestrators where you usually define them in YAML or in some declarative way of declaring the dependencies in tasks. In Airflow, everything is Python. From the beginning to an end, you do everything in Python, including extending Airflow itself, which I'm going to talk about a bit later. So, like, just a few examples of how it works. You can define DAG with a decorator, a nice Python way. You can also use it in a classic way, which I will show in a moment. Then you can define a task. And then finally, you can just use the task that you define and you can, yeah, if it works. And then you can link the task between themselves and make the dependencies between them. And once you define them, Airflow does everything to schedule the task. As you can see at the top, there is a schedule that the DAG of the task dependencies are executed. So, once you define it in Python, Airflow does everything for you and executes the tasks that you defined nicely in Python in a way. There are two ways of defining DAGs. There is a classic way where you define the operators and you use them. And we have about three or four thousand different kinds of operators to talk to all the different kinds of external data processing engines, which can be either databases or cloud services or pretty much everything that you can imagine. I will show the list later. It's pretty impressive. Or you can also define the task in a more Pythonic way where you just decorate a Python function as a task and this task gets executed. That doesn't seem much and probably you thought like myself, I was developing different kinds of workflows over all my career, many years, and tools to run the workflows. And you would say, okay, that's pretty much all the work or a lot of work that you do is define these kind of workflows and make them easy to run. And Airflow is one of the most popular orchestrators out there because it does it. I think it hit the spot, the right sweet spot, how this should be defined in the Python, but also nicely executed. I will show some executed and managed so you can be a Python developer, develop your tasks, and on the other hand, you give immediately someone who operates Airflow a very nice UI and the way how to manage all those workflows you define. So I will show a few things that many people who already use Airflow and some people here know about Airflow and use it and they might have even not know that things like that are possible. So those are things which appeared in the few latest versions of Airflow. So like for example, you have a task group. The task group allows you to group the task together and execute them together and even like rerun them together or make dependency to the whole group until the whole group finishes. So that's a nice new feature or relatively new features that we have and that we are using in Airflow. Airflow also has this very nice feature of being able to dynamically create many instances of the same task so you can get like kind of map reduce kind of workflow where you expand the task that to execute like you can execute hundreds or thousands of them and then get the result of that. So the typical map reduce case and it is very useful in number of cases but as of recently you can also expand the task groups very similar way so you can run groups of tasks in parallel, multiple of them. Very nice because you can very easily parallelize complex Airflow because by the way, Airflow is not only like you define the task in one place but you can distribute the whole execution of those tasks over a fleet of computers. Like with Kubernetes, with salary, with different mechanisms. We will not talk about that. There is no time to go into details but this allows you to parallelize your task workflow pretty massively and very complex ones. Then you have like dynamic task mapping. So that's how it looks like so you get the task, you expand an array of values and then you have multiple tasks running. We have a very nice UI that you can like browse through the list of those tasks. I just wanted to show it for a reference. Very recent task, very recent feature we have set up in teardown. So when you have a task that requires complex infrastructure to be set up and turned down at the end, we have the whole mechanism to manage those edge cases that comes there when the task executes when and it fails and all the, there are multiple tasks in between set up and teardown. So this is also nicely handled right now in Airflow. Very recent addition in the last version of Airflow we have integration with FS spec. I don't know who uses FS, who knows FS spec, who uses FS spec. There are a few people but you should if you haven't used it because it allows you to access the object storage from different storage providers. It allows you to access it in the same way as you would access a local file system using path leap, using slash, using path open. But also it allows you to integrate with multiple different already existing tools which are using FS spec like pandas, Polar, Spark, DagDB, Iceberg, PyRO, those are just a few that already support FS spec and this means that you can very easily plug in the file that you define from the object storage. Put it, put like here data frame from pandas to read that and it will read it and it very nicely integrates with the Airflow way how to manage credentials to access that data. So this is really recent features that we have added. Now this is also connected with data hour scheduling that we have which means that like Airflow by default in the original version was only task based but right now it can also define data sets which are linking the tasks. So one task can produce data set and another can consume the data set. You can define it and you can use it and those dependencies are automatically taken into account and whenever first task produces the data set the second one is run and we have a lot more features to be added to support that including the FS spec integration and a few others. A little shout out to Tatiana who will be speaking next from Astronomer. One of the things that we recently added as our LLM operators donated by Astronomer who is one of the stakeholders in Airflow. So we have the whole set of LLM based operators that you can use to build your LLM workflows because like just training is one thing but you have to prepare the data. You have to process it and then you have to take the results and maybe do some inference and all the other stuff that is not necessarily just teaching and learning. And this has been implemented. It's not like a theoretical set of operators. It's been implemented in a real implementation of a bot that Astronomer developed. A few words. So that was about the DAG outtouring. So like this is how you prepare the task and this is the most important for people like you who are like Python developers and want to integrate and implement those tasks in Airflow. But one important thing is that Airflow provides you out of the box with the modern UI. Like there is a nice graph UI showing you all the dependencies, status of the tasks, how they are working. You can rerun and clear the task from there. You can see the status of that. This is like great way how you can do your like have your operations people, those who look at those workflows being executed. What's going on with your DAGs? And this is the nice like this is a nice gun chart showing how the DAG is progressing over time. And you can see the grid view on the left side where you see the history of the task being executed. So the same task would like how it progressed over days or hours or whatever your frequency is. A very nice integration with logs so you can see what happens when the DAGs fail, when there is a problem. You can diagnose them. A gun view, a little bit more detail showing you the like how the tasks are progressing while the DAG is being executed. Very nice integrated and very recent addition. You can see the whole overview of the whole cluster of your, as I mentioned, Airflow has the capability of like massive distribution of the workload among multiple nodes. So you can see that from single place in Airflow as well. But I think the most powerful part of Airflow is that it's not only like you, not only can define the task in Python and execute them. That's rather simple thing. But Airflow is a platform. Airflow is a platform that has this capability of, and this is the way how we think about platform in the future when we will be developing the next versions of Airflow, that rather than implementing something on our own, we extend the capabilities of Airflow but allowing others to, other dedicated tools and other solutions to use whatever Airflow produces. An example of that is OpenLinage. OpenLinage is a standard how you can track the provenance of your data across all your flows. So you can know that, for example, this part of the data, this column was a private column that was obfuscated or aggregated so that initially it was related to the privacy. Then it was not and then it was joined with another data and you can track all this information and OpenLinage allows that and Airflow produces the lineage data and it's fully integrated. Another thing that we have integrated is OpenTelemetry, which means that Airflow produces the telemetry data that you can use to monitor whatever happens with Airflow, whatever happens with execution of those tasks, using the favorite tools of yours like Datadog or Google Cloud monitoring systems. All these tools are supporting OpenTelemetry and Airflow produces the data in OpenTelemetry compatible way. Still early day but we already have it. We have integration like, for example, integration with DBT, which Tatiana is going to talk about at the next stop. So I will not stop here but DBT is one of the most used ways of how Airflow is processing the data or what tasks are executed and we have this nice integration where DBT models can be mapped into the Airflow models and then this comes externally. This is not something in Airflow but Airflow has the extensibility that allows you to do that and others like Astronomer, they did it. We have a fully-flashdressed API that allows you to build extensions and those extensions can be of different kinds because you can access all the data inside about Airflow. Those extensions can be like UIs. We had a discussion yesterday at the dinner. You can build your own UI using those APIs if the UI of Airflow is not enough because we cannot satisfy everyone. You can do that all with the fully-flashdressed API we have. A few words also why we are at FOSDEME, like Airflow is fully open source. Airflow has a 10-year history. This year we have 10th anniversary. Airflow Summit 2024 which is planned in September in Bay Area. The plan is 4,000 out in this. It is very popular. Airflow is pretty popular. The last Airflow Summit we had was 500 people in Toronto. You can see that it has been steadily developed over years. It initially donated from Airbnb. It became an ASF, Apache Software Foundation project incubation in 2016. Then in 2019 it became a top-level project. We released Airflow 2.0 which is this new version of Airflow. We released it in 2020. Now steadily you can see that we are releasing new versions with new features that are steadily coming. What is important part of Airflow is open source, permissive license. The Apache Software Foundation is behind it. Strong governance. We have really strong stakeholders like Astronomer, Google, Amazon, Microsoft also. All of them provide Airflow as a service. You can use it in their clouds. We have a very well-defined security process, release process and maintenance certainty. You can be quite sure that Airflow is going to be there with the same license. The license is not going to change as we have heard. It is happening with a number of open source projects. It is under the Apache Software Foundation umbrella. It is going to be maintained and you can rely on it being released in the future. You can rely on it pretty much. There are some community numbers. It is not a vanity metric. We have the biggest number of contributors in Apache Software Foundation more than 2700. 61 committers, 32 PMC members. We have 10 years of history. Those are the different kinds of tools that you can get when you integrate with Airflow. That should be a little bit before. You can see this is the community and integration. All the things that I mentioned before about the extensibility of Airflow allow you to build all the different kinds of extensions. This is like our community page, tools integration, integrating with Airflow. The big number of those are there. A lot of people are developing for Airflow, extending Airflow, adding new capabilities to it. One thing that I wanted to mention, because this is mostly what I am working on, we have very solid foundations for Airflow right now. We defined this public API, public interface that you can rely on when you are working with Airflow. We can very easily rely on the rest APIs and the number of things that Airflow exposes. This is the kind of most impressive, one of the most impressive things in Airflow. I mentioned that before, that we have this many, many different integrations built in. These are all the integrations that we have. I am not sure if you can see the names, some of them probably. There are a lot of different, more than 90 different providers that we have that allows you to immediately connect to those external services and run Airflow with them. My big focus is on the security and this is something that comes in the next two years for every software package that is nearby, near you. Because we have this CRA Act in Europe and others. If you haven't seen the kind of public policy and compliance talks here, you should realize that it's coming. In two, three years we will all have to follow the security very rigidly and we have very good functional security team that is handling the security issues according to the Apache Software Foundation. We are part of the bounty. If you find a problem in Apache Airflow, just report it and you can get money for that. I highly recommend that because we are fixing them fairly quickly. We have some features like S-BOM and reproducible builds built in Airflow. As of recently we were working on that for quite some time to make sure that whatever we deliver is not only useful and nice for the developers, but also it's secure to deploy and used in your production workflows, which I think is very important because it means that you can rely on the software. So summary, I just wanted you to remember from that talk that Airflow is a modern and solid data orchestrator with really strong foundations. Something that is going to be developed for next like 10 or 20 or 30 years, you don't know. The future is pretty bright. We have new slick ways of interacting with modern data stack. Even if we were created 10 years ago, right now we have all these new modern ways of how you can write your workflows and interact with all the external systems. That's pretty cool. It's true open source. And we have huge and supportive community of both the people who develop Airflow and those who integrate stuff with Airflow. And that's a sign that this is a really great project to work on. So I would say if you want to contribute or use, both options are very good for Airflow. And we continuously evolve. That would be very short overview. I didn't have time to go into many details. So just wanted to touch base on what Airflow does. If there are any questions, I'm happy to take them. Very nice talk. Thank you. My question is, in my group, we haven't been able to agree on whether the Python-based DAGs are declarative or not. What do you think? And do you think distinction matters at all or not? So the question was about the declarative or not. I personally think declarative way of defining DAGs or defining the workflows is very much impaired by the fact that whenever you want to do any kind of complex workflow, those declarative ways become unmanageable. In the past, I've developed a number of those workflows, as I mentioned. Airflow is just last one of them. Usually what happens when you have these declarative workflows, when they become complex enough, you start to write Python code to generate these declarative workflows. Because they are too complex. And then the benefit of having those declarative workflows is gone. I would say it's probably better if you start the other way. You use Python as a way how to define your workflows. And counter-intuitively, you take your declaration and generate the Python code from that. This is very easy, actually. And our customers are doing that. Our users are doing that, actually. A lot of users have their own version of declarative way of defining the workflows. And from that, taking that the airflow is so flexible, and you can do anything with the Python code, and you can define the workflows arbitrarily complex, just mapping it to a Python code that does it, is usually much simpler than trying to get these declarative workflows to do something complex. So declarative workflows are great for start, but when they become complex, the Python way is much better, in my opinion. Thank you. We have time for one more question. Is there any interest in airflow to support longer running workflows with events and things like that? With events. Blocking operations. You wait for an event, you wait for a time. Yes, absolutely. This is one of the things that I removed for clarity from the presentation, because previous version of that presentation didn't have that. So the airflow has so-called deferrable operators, and this means that when you will have a long running workflows and waiting for something, then you can defer execution of that operator to something which we call trigger, which is Async IEL, Python-based event loop, basically. And this event will wait, or this task will wait on this Async IEL without taking almost any resources. So if you have any workflow that, you know, with Async IEL, you can check and trigger when it finishes, this is the best way, and like, it's absolutely supported by airflow to have like hours or days running workflows for this kind of waiting for this without taking too much resources. That's a relatively new feature, like maybe two years or so. Thank you. Okay, thank you very much, Eric. That was a very interesting talk, and it's certainly a very, very good project to look into. We'll have another five-minute break now, and then continue with the next talk. Thank you.