So people can hurry and sit down for the next speaker please.
Okay, thanks for our next talk. We have met in talking about semantically driven data management solution for IO intensive HPC workflows.
Thank you. My name is Metin Chakrachalov. I work at the European Center for Medium Range Weather Forecasts Department, Forecasts and Services Department at ECMWF.
I will talk about the semantically driven data management solution for IO intensive HPC workflows, which the work was funded by the EUR HPC project called IOC.
It is work done by many people.
So a little bit background on the ECMWF, European Center for Medium Range Weather Forecasts.
It is established in 1975 by 23 member states and 12 cooperating states as an intergovernmental organization.
There are three base duty stations with more than 450 people, Redding, Great Britain, in Germany, Bonn and Bologna, Italy.
So ECMWF is both a research institution and 24-7 operational services, producing numerical weather predictions and other data to member states.
There are two big projects that ECMWF is a key player. One is Copernicus. It is the Earth Observation component of the EU's space program.
We provide climate change information, atmospheric composition information and also flooding and fire danger information.
The other big initiative, EU initiative, is the Destination Earth project, which is prototyping digital twins of the Earth.
So the ECMWF's production workflow looks like this.
There are per day 200 million observations, collected acquisitions and fed into the Earth system model.
Those observations and the output from the weather predictions are archived.
Also, these data are used to generate products, which are 300 terabytes of data per day, which then accounts for 65 terabytes of data per day as products disseminated to around 350 destinations to member states and other customers.
So the information system, the data is central. It provides access to data, models and workflows, and the data management is very critical for the operations.
We need transformation of data into information, insights and decisions.
So, semantically driven data management, we have been doing this for a long time.
It means managing data based on its meaningful logical description, rather than just storing data.
We also abstract the backend technologies. We also abstract where and how the data is stored from the users.
So instead of, we try to avoid nested folder structures or UIDs, such as this home user projects ECMWF and blah, blah, blah, or some cryptic UIDs that doesn't make much sense to the user.
Instead, we want to use meaningful, scientifically meaningful metadata to describe the data.
For example, in this case, this project is ECMWF experiment number 42.
The data is 224 parameter pressure and level.
So for that, as part of the IOC project, we developed DAISY, data access and storage interface.
So we provide, we index and identify data using its meaningful description.
And for that also, that allows us to implement optimized algorithms to retrieve archive and retrieve data.
And this is based on the ECMWF, ECMWF object store called FTB, which is also free and open source on GitHub.
And this abstracts, we also abstract the storage technologies behind POSIX.
We support POSIX, DAOS, Moto, and Ceph.
And we provide interfaces and tools as well as CEC and Python APIs.
So the schema, the main complexity is the schema which describes the database.
And it is a collection of rules and each rule is a tree of attributes.
In this example, I have a schema file and inside that I have two rules and each rule is consisting of multiple parameters.
For example, here project experiment date parameter level would translate to a key project ECMWF experiment 42 and so on.
The other rule is event city year and this could translate into event for stem, city is Brussels and year is 2024.
So the rules are, the rules have, the rules are blueprints of the database, how to construct the database.
And they have three levels and they, each level can have multiple attributes.
And to make a rule, it has to be unique and complete to describe the data so that we can identify data from other data.
And we also need to think about the locality where data, where different data is related to it, we would like to store them together.
So each level here, we can think of the first level as directory, the second level file level, and the third level as the indexes in the file.
So the locality would be increased when we go deeper in the level.
The other, we can set daisy, we can set up daisy by the configuration file in YAML configuration file.
We can point to the schema file to find the schema file and we can set the backend storage technology by saying file in this case is reference.
We can also have different parts to the databases. We can have multiple databases. It's called roots.
And we can set multiple behavior to individual roots.
So aside from data, we also need key and we also have query.
The keys would refer to single objects while queries can be any number of objects.
In this case, key defines, identifies, and single objects on the right, I have level as a list of values, 0, 1, and 3.
So it means I make a query for three different data where the differences, the levels are 0, 1, and 3.
So we provide multiple interfaces, command line tools, C, C++, and Python APIs.
But here I present an example for Python API because it's simplest.
So I need, for storing a data by key, I need a key and data.
So data can be anything in this case. I just put a string here, but it can be PNG file or PDF file or any other type of data.
Then I make a key. User is met in project is IOC.
Date is 2023 and city is born.
And I pass this key and data to Daisy and Daisy would archive it.
Then the other main feature is list, searching for data in the database.
I need to make a query, in this case, user met in project IOC.
And in this case, I just want two data objects for two different dates.
And I pass this query to Daisy and it returns me the keys that I need for retrieving.
And in the next example, I have the retrieve getting a key, getting data by a key.
I make a key, user is met in project IOC and so on.
And I pass this key to Daisy and retrieve the data.
So it's very simple.
So to sum it up, we describe data semantically instead of your IDs and nest directories.
And we index and identify data by its meaningful semantic information.
And this also allows us fast and efficient retrieve and search and archive algorithms.
Also, we abstract where how data is stored from the user.
And we make blueprints called rules.
And we make keys to attach to the data and pass it to Daisy.
And Daisy would store and manage the data using multiple different storage technologies.
So more about Daisy, we have, we published, Daisy is free and open source.
We published on GitHub.
We have example C API and Python API.
We also provide binary packages on GitHub for C, C++ as well as Python.
We also have Python packages for Linux, RPM and the beam packages are available.
We also have documentation on read to docs.
And that's all. Thank you for your attention.
APPLAUSE
Thank you. Do you have any questions for Metin?
No? Oh, there is one.
Thank you.
Hey, next presentation. I was wondering if you can specify the type of the values in your SEMA.
You mean integer?
Yeah, yeah, like that. To facilitate the queries.
Yes, attributes can have types. You can set integer, date, string, they can have multiple types.
Okay, thanks.
Hi, thank you for your talk. I was interested in indexes because you mentioned that you index and identify data.
Did some standard type of indexes or you have some format on your own, optimized for this three type of data?
Yes, indexing is based on the rules.
So the rules here, we have three structures which has three levels.
So it has to have three levels which translates to a directory file and data.
And we have in-house mechanism algorithm to that indexes that translates this three into identifying data.
So is it something like gene indexes?
I couldn't hear.
If it is something like gene indexes?
I'm not sure if I understand. Gene index?
Yeah, I'm not the right person, I think, because we use the FTB which has been developed since long time.
And it's a big library.
I cannot answer the question because I haven't worked with that level.
No problem, thank you.
Thank you for the talk.
I would like to know where are these keys stored because you need to query this kind of index and where are they stored for the user?
Yeah, so the indexes would be stored separately but together with the data.
And they would go, for example, in this case, inside the roots.
So each root would be different database.
And if you would look inside the root one, output one, for example, here, you would have index keys inside as well as the data, so together.
Okay, so it's a file system or a database stored inside these two directories?
Yeah, for POSIX, it would be directory, but for object storage, it would be not directly contained or something like that.
Okay, so the way how the index is stored depends on the type of the storage you describe here.
Yes, but we can also have, we have two different abstractions.
One is indexing and we call it catalog and we have a bulk data.
So we can have indexing catalog inside POSIX directory and bulk data on an object store.
Okay, thank you.
Any more questions?
Is the next speaker in the room?
Okay, thank you again, Metin.
Thank you.