[00:00.000 --> 00:26.240] Okay, so the aim is to control the source of variation and from a scientific point of [00:26.240 --> 00:32.320] view when you are publishing a paper, an independent observer should observe the same [00:32.320 --> 00:40.520] results and have the same conclusion and this observation must be a sustainable so it [00:40.520 --> 00:45.920] doesn't depend where, when you observe it and maybe not where neither. So it's also [00:45.920 --> 00:54.120] collective. So all the question in this scientific framework is how can we redo [00:54.120 --> 01:01.880] later and elsewhere so I'm doing something on my laptop and another person will try to [01:01.880 --> 01:10.280] redo something like two years later the same thing. So how can we redo later and elsewhere [01:10.280 --> 01:17.720] what I've done here and today and this is a big question and the challenge in reproducible [01:17.720 --> 01:25.520] research and in science and I think Geeks answer to this question. So what does it mean [01:25.520 --> 01:31.880] a computational environment? So for example Alice says using this data you need that [01:31.880 --> 01:40.840] C file and GCC 11 to run my analysis. Okay, but what is the source code of GCC 11 and [01:40.840 --> 01:46.200] GCC 11 requires some tools for building and these tools, there is also tools require [01:46.200 --> 01:54.960] a runtime and this is recursive. So answering all this question is controlling the source [01:54.960 --> 01:59.680] of variations so you are controlling your computational environment. So this question [01:59.680 --> 02:05.600] is not new in computing, I mean in computing, it's not new. We have solutions. The solutions [02:05.600 --> 02:13.600] are package manager and so on. So we have package manager like APTU but there are some [02:13.600 --> 02:17.880] issues for example it's difficult to have several versions, difficult to have rollback. [02:17.880 --> 02:23.440] We have environment manager like Conda, PIP, module files but there is a kind of issue [02:23.440 --> 02:30.880] with transparency, who know what is inside a PIP installed torch. There is module files [02:30.880 --> 02:37.080] but how they maintain on your, I mean how they maintain, do you use on your laptop and [02:37.080 --> 02:44.000] on another machine? There is a docker file but the docker and docker files are based [02:44.000 --> 02:55.280] on previous solutions so also drawback apply also. Geeks in fact is all this solution plugged [02:55.280 --> 03:03.440] together and in fact it fix all the annoyance of each. This is what Geeks is. So Geeks is [03:03.440 --> 03:11.320] a package manager like APTU, etc. is a transactional and decorative and it produce showable packs [03:11.320 --> 03:17.280] like docker images. You can produce virtual machines and you can deploy that. For example [03:17.280 --> 03:24.240] you like unseemble or parker and you can build a whole Linux distribution and it's also a [03:24.320 --> 03:32.520] scheme library. Geeks is really awesome. Okay we have 20 minutes so I don't speak about [03:32.520 --> 03:39.960] that because it's too much. I'll just explain you how Geeks is helping me in my daily job. [03:39.960 --> 03:47.760] Geeks you can run Geeks on the top of any Linux distribution so it's really easy to try. If [03:47.840 --> 03:57.840] you haven't you should. So Geeks is just another package manager. So yeah you can install, remove [03:57.840 --> 04:04.080] without any privilege. This is more than APT from Debian for example. You have the direct [04:04.080 --> 04:09.640] creative management so declarative management means that you can have a configuration file [04:09.640 --> 04:14.440] and so on transactional so you don't have broken state you can rollback and so on and [04:14.480 --> 04:19.080] you have binary substitutes so you are not compiling everything from scratch every time. [04:19.080 --> 04:25.240] Okay this is some kind of classical package manager but you have more. You have isolated [04:25.240 --> 04:33.000] environment on the fly so you can create an isolated environment with Linux namespace on [04:33.000 --> 04:39.440] the fly and this is really helpful to check the dependency of your scientific analysis and you [04:39.440 --> 04:47.160] also can use Geeks to produce images like Docker images but without using the Docker file machinery. [04:47.160 --> 04:57.320] So you are saying okay nice but the issue with science is about reproducibility so you have a [04:57.320 --> 05:03.120] package manager that have all these features but why is reproducible? So the answer is about [05:03.200 --> 05:11.240] version. So for example Alice says I use GCC Adversion 11 so you have GCC to change but you [05:11.240 --> 05:18.240] need a linker like LD you need binitils and GCC is a compiler but the compiler need NPC and also [05:18.240 --> 05:29.480] need the NPC need NPFR and the big question is is the same GCC if we replace this NPF 4.1 by [05:29.520 --> 05:38.800] NPF 4 NPFR 4.0 is the same GCC or not and this is the issue that we can have when we are [05:38.800 --> 05:46.000] using when we are running analysis we are not controlling within us detail this graph and [05:46.000 --> 05:52.200] maybe the difference in the version of this package have a big influence of GCC at the end we [05:52.440 --> 06:03.760] cannot know and okay so Geeks in fact the version of Geeks it's fixed by I mean the state of Geeks is [06:03.760 --> 06:10.840] provided by Geeks described and this fix the graph this graph it fix the complete collection of [06:10.880 --> 06:23.000] packages and Geeks itself so in fact each node specify a receipt and each node specify code [06:23.000 --> 06:31.080] source the upstream source but also all the tools that you need to build the package so the compiler [06:31.080 --> 06:39.080] the build automation CMake make etc the configuration flags and so on and the dependency that you need [06:39.320 --> 06:44.480] to do that so you have a kind of recursive graph and this graph can be really really used for [06:44.480 --> 06:52.760] example for skypie it's more than 1000 nodes so it's it's not manageable by end and Geeks provide [06:52.760 --> 07:04.720] you a fine control about this graph so collaboration in action is this is what Geeks is helping me [07:04.960 --> 07:12.160] concretely every day so I write a manifest manifest it's just a file where I describe my tools for [07:12.160 --> 07:20.960] example python skype skipy nimpy or or and so on and I create my environment with Geeks shell so [07:20.960 --> 07:30.600] this is nice then I can I can pin this graph and I apply Geeks describe and I pin the graph so now [07:30.600 --> 07:38.880] I have another file state-alice which pins this graph but collaboration is about sharing the [07:38.880 --> 07:45.920] computational environment so somewhere is sharing one specific graph so in fact if I share these [07:45.920 --> 07:53.600] two files the manifest which describe the list of the tools and state-alice which describe the [07:53.600 --> 08:06.360] state of the graph okay Blake my collaborator can spawn exactly the same computational environment [08:06.360 --> 08:14.320] using the Geeks time machine so you think that Blake and Alice are running exactly the same [08:14.360 --> 08:23.120] computational environment and if Carol also knows these two files Carol can run the exact [08:23.120 --> 08:30.640] same environment as Blake and Alice so here we have some things that it's really easy so on [08:30.640 --> 08:39.960] my laptop I write my I specify the tools that I need python or etc and I specify the state I'm [08:40.000 --> 08:45.960] running on my on the laptop then I deploy on the cluster I use just transfer the two files and I run [08:45.960 --> 08:52.040] this Geeks time machine command and on the cluster I have the exact same environment that I have on [08:52.040 --> 08:56.640] my laptop so there is no question about what can be wrong between my laptop and the cluster because [08:56.640 --> 09:03.400] it's exactly the same computational environment and this is a game changer when you are running [09:03.400 --> 09:07.960] analysis on different machines where your colleagues are running in different places [09:07.960 --> 09:19.720] so Geeks this time machine provides a way to jump in different states temporarily so you can be I [09:19.720 --> 09:27.360] mean if you imagine you have the time and and and Alice and Blake or Carol are not on the same state [09:28.200 --> 09:35.480] compared to the time but they can jump artificially and temporarily to the same point in time to have [09:35.480 --> 09:45.520] the exact same computational environment and this is not possible with any other maybe next and so [09:45.520 --> 09:52.040] this is kind of game-changing for me so to have that working very well you need to preserve all [09:52.040 --> 09:57.920] the source code and for that you need software heritage which is I took just after and you need [09:57.920 --> 10:06.160] to have a backward compatibility of the Linux Linux kernel so and you also need to have some [10:06.160 --> 10:13.280] compatibility of hardware for example in five year we can if the hardware are not x86 yeah maybe [10:13.280 --> 10:17.680] we cannot jump it back in time but if you have compatibility of that where we have this that [10:17.720 --> 10:25.840] work and the question is what we have this size where these three conditions are satisfied so now [10:25.840 --> 10:32.160] we have for example this condition and what is the size of this window and from my point of view [10:32.160 --> 10:44.280] Geeks is running a quasi unique experiment at I mean real-world experiment a large since 2019 so [10:44.320 --> 10:50.280] we will see that this size maybe in five years we will try to to redo something from now and we [10:50.280 --> 10:56.520] will fail because something that we but we have this mechanism able to jump in different point in [10:56.520 --> 11:06.280] time so software heritage is a long-term source code archive so you collect and preserve all the [11:06.280 --> 11:13.640] source code of the world I mean open source code of the world so all github githlab debian and so on [11:14.360 --> 11:20.080] and geeks is able to to save the code of the geek package definition so for example the source [11:20.080 --> 11:28.800] code of DCC you can geeks use the source code of DCC coming from internet but you can save this [11:28.800 --> 11:34.480] directly to to software heritage and the package definition itself and what is really nice is [11:34.480 --> 11:40.360] that geeks is able to to fall back if the source disappears so for example you have a product [11:40.920 --> 11:49.400] if tomorrow github is down for whatever reason all the paper published with with the line my script [11:49.400 --> 11:59.160] is on github and the package is on on github and github is down like githulite I don't remember [11:59.160 --> 12:07.400] the name there is many popular platforms that are down and all all break and with this mechanism [12:08.200 --> 12:26.040] it is doesn't break so I have five minutes right okay so just geeks is able to so geeks [12:26.040 --> 12:33.720] is able to pack everything so you can produce a docker image with geeks so using the manifest [12:33.800 --> 12:39.880] you use geeks pack and geeks pack generate the docker image and then if Blake doesn't run geeks [12:41.080 --> 12:47.080] she can run the docker and the binary inside the docker are exactly the same than the binary [12:47.080 --> 12:52.680] inside the computational environment of Alice and because of the time machine this is reproducible [12:52.680 --> 13:01.800] over the time and this is also a kind of game changing in science and in fact a container it's [13:01.800 --> 13:08.840] just a format of the archive and geeks is an austic about this container format so you can [13:08.840 --> 13:15.400] generate torbol docker singularity there is an experimental debian binary package and yesterday [13:15.400 --> 13:23.000] evening there is a patch about supporting rpm package so this is just flexible to every [13:23.720 --> 13:30.040] context the key point is to to fully control the binary going inside the container and geeks [13:30.040 --> 13:33.480] does this job so this way it's a factory for creating images [13:35.720 --> 13:41.640] so geeks is helping me because there is three commands and two files so this is really easy [13:41.640 --> 13:49.160] to explain to for example medical doctor and and so on and it's a packing factory when I can deploy [13:49.160 --> 13:53.800] on infrastructure where geeks is not running for sharing computational environment and for me this [13:53.800 --> 14:04.040] is a two key for two keys for for for open science research and so on and if you want more [14:05.880 --> 14:12.120] information there is a this group geeks hpc but it's like many things in geeks the name is not [14:12.120 --> 14:17.480] good because it's more like geeks for science than not specific to geeks hpcs geeks for scientific [14:17.800 --> 14:27.240] and okay and it's running production so don't be afraid to install geeks I mean it's a there is [14:27.240 --> 14:33.880] all this cluster running already geeks more all the laptops and desktop so it's now it's in production [14:35.720 --> 14:44.760] so yeah this is uh I mean why the picture geeks and and and science that I would like to have in [14:44.840 --> 14:47.560] in the future so thanks [15:01.320 --> 15:06.200] so uh yeah but yeah yeah if you're the first one [15:15.000 --> 15:21.960] scientists but for example the scientists uh we are trying to reuse uh they can use for example [15:21.960 --> 15:31.160] different uh kind different cpu or different architecture how much some kind of uh optimizations [15:31.160 --> 15:37.880] for particular cpu can have a impact on having different results can it just like [15:38.680 --> 15:45.640] ruin a whole idea they can can they really like if you have idea like how much impact [15:47.160 --> 15:58.120] so I have to repeat the question so the question is the so in hpc context you have [16:01.080 --> 16:05.160] you have performance that depends on the architecture and you can have micro optimization [16:05.160 --> 16:10.680] for specific architecture so how geeks deal with that for the first question and how will [16:10.680 --> 16:15.000] we do that for reproducibility is the second question I think from my understanding is the two [16:15.000 --> 16:27.800] question I don't know the difference of the performances about the the micro optimization [16:27.800 --> 16:34.360] this is a job of the researcher in the field to say this micro optimization provides this [16:34.360 --> 16:41.480] performance improvement what I can say what geeks does to manage this micro optimization so [16:42.120 --> 16:50.360] Ludo is is is giving a talk in hpc dev room about the the the tune package transformation that [16:50.360 --> 17:06.680] I can ask when we are speaking about reproducibility we can ask if this micro [17:06.680 --> 17:13.160] micro optimization or I mean our fit the reproducible and the scientific method [17:13.160 --> 17:16.440] so I don't have the answer but at some at some point I think [17:17.320 --> 17:23.720] is it worth to have a micro optimization for I mean having something like a couple of percent [17:23.720 --> 17:29.960] of improvement but we lose all the reproducibility we lose the way to check that the computation is [17:29.960 --> 17:34.600] correct so I don't know if it's I mean this is a question for a collective question for the [17:34.600 --> 17:38.760] researcher in general so there is a question [17:57.800 --> 18:03.160] so the question is is it enough or not to have all this machinery to have reproducible science [18:03.880 --> 18:11.560] so the answer is I don't know because I mean if to have the answer is that everybody should [18:11.560 --> 18:20.200] run this to be be sure that he's in case or not so I don't know and the I don't remember what I want [18:20.200 --> 18:21.480] to say about that [18:22.040 --> 18:40.600] I mean I don't show here but there is a paper published with this method and yeah there is more [18:40.600 --> 18:49.320] reproducibility than than the other but at some point the the the the software is just one part [18:49.320 --> 18:55.800] of the big picture of the reproducibility issue in in in science and in fact is a is a is a is a [18:55.800 --> 19:04.360] collective practice in fact because you for example I'm trying to reproduce the paper from [19:04.360 --> 19:11.480] August this August and some data are missing the script are missing the the package that's [19:11.480 --> 19:18.840] been used are missing so so I cannot reproduce and this is not geeks it's just because [19:19.320 --> 19:23.320] publisher didn't good job thanks everybody