[00:00.000 --> 00:15.640] So, sorry for the mess. It's a bit impressing all these people and so on. I'm Simon. I'm [00:15.640 --> 00:23.080] working as a research engineer in the University of Paris and I'm going here to present you [00:23.080 --> 00:35.840] Geeks to be able to do some reversible research and there is a group Geeks HPC which tried [00:35.840 --> 00:47.440] to apply Geeks tooling for scientific context. So, currently we are in a replication and [00:47.440 --> 00:56.600] reproducibility crisis. So, more than 70% of researchers are enabled to reproduce the [00:56.600 --> 01:05.800] results of peers or more than half are enabled to reproduce their own results. So, we have [01:05.800 --> 01:13.480] a big issue. So, there is many problems of this replication crisis and maybe one solution [01:13.480 --> 01:20.480] is open science. So, what does it mean open science? So, what does it mean science? Science [01:20.480 --> 01:27.320] means being transparent and collective activity. And what is a scientific result? Scientific [01:27.320 --> 01:34.960] result is some experiment. So, producing experimental data and then we have some numerical processing. [01:34.960 --> 01:42.120] So, to do that in today, we have different way because we need to communicate so we need [01:42.120 --> 01:46.600] to write results. So, we need open article to be able to read the results. We need to [01:46.600 --> 01:52.240] share the data. So, we have open data. We need to share the source code. But there is [01:52.240 --> 01:57.080] something that we never discuss is that all that need to be glued together because there [01:57.080 --> 02:01.680] is a numerical processing. So, we need to glue everything together. So, we need another [02:01.680 --> 02:10.200] one. We need a computational environment and this is really mean is one of the issue [02:10.200 --> 02:19.400] is that if this is not open, all the other stack is failing. So, that is the topic of [02:19.400 --> 02:28.280] today. How do we manage this computational environment? So, again, a result is a paper, [02:28.280 --> 02:35.120] some data and an analysis. And there is some parts which are mean possible to audit. For [02:35.120 --> 02:41.120] example, a paper, you can read it. A data, you can read the protocol that generates the [02:41.120 --> 02:46.400] data. You have analysis. You can read the script. But there is some part that are opaque. [02:46.400 --> 02:51.560] For example, the instrument, a telescope, a microscope. This is opaque. We don't know [02:51.560 --> 02:59.440] how it works. But there is something that is depend on our collective practice as researcher. [02:59.440 --> 03:06.160] And this is something that we can act on to do a better research. So, the question is [03:06.160 --> 03:17.320] to be able to eliminate at least this dependent and turn this as an auditable task to be really [03:17.320 --> 03:25.640] transparent. So, yeah, from my point of view, a computation and computing is just similar [03:25.640 --> 03:32.920] to an instrument. So, we should apply the same strategy that experimental people are [03:32.920 --> 03:42.200] applying for any instrument. And computing is just an experiment, in fact. So, the challenge [03:42.200 --> 03:48.800] about reputable science. From my point of view, there is two kinds. The first one is [03:48.800 --> 03:55.800] controlling the source of variation. What is different between this and that? So, between [03:55.800 --> 04:01.520] this computational environment and this computational environment. Because as with a telescope, [04:01.520 --> 04:06.800] for example, we want to know what is different between this telescope and this telescope [04:06.800 --> 04:12.160] to be sure that what we are observing is correct. So, from a scientific method, we [04:12.160 --> 04:17.200] need that the computational environment is transparent. And from a scientific knowledge [04:17.200 --> 04:23.320] viewpoint, what we are building together need to be independent. So, what I'm observing, [04:23.320 --> 04:30.320] you should observe the same. And this observation should be sustainable when the time is passing. [04:30.320 --> 04:35.720] We should be able to observe the same thing. Otherwise, it means that maybe we miss something. [04:35.720 --> 04:45.920] So, the big question today is with this kind of context, how do we redo later and elsewhere? [04:45.920 --> 04:51.120] So, I did something on my machine and you have to do this thing on your machine, for [04:51.120 --> 04:56.520] example, six months or one year or five years later, with the computer. And this is a big [04:56.520 --> 05:04.720] issue and is part of the reputable crisis in science from my point of view. So, what [05:04.720 --> 05:11.120] is a computational environment? Computational environment implies various points. For example, [05:11.120 --> 05:17.280] what is a source code? But, for example, if, say, I use Python and this script, okay, [05:17.280 --> 05:25.200] we have the source code of Python is in C and we have the source code of this Python [05:25.200 --> 05:33.280] script, okay. But the Python interpreter requires a C compiler. So, we need tools for building. [05:33.280 --> 05:39.640] And my script, for example, needs some Python library. So, we need also tools for running [05:39.640 --> 05:45.280] at runtime. So, and each tool has the same issue. What is the source code? What is the [05:45.280 --> 05:53.120] tools for building? And so, this is really reclusive. So, this is a big issue. And answering [05:53.120 --> 06:00.640] all these questions is controlling the source of variation. So, the question is, so, how [06:00.640 --> 06:06.280] do we capture the answer of all these questions? So, the question is not new. We have already [06:06.280 --> 06:12.240] tools, package manager, modified container. So, for example, with package manager, like [06:12.240 --> 06:18.120] APT for Debian, you can control this computational environment. But there is some issue. For [06:18.120 --> 06:23.640] example, how do you have several versions of open blasts on the same machine? It doesn't [06:23.640 --> 06:29.120] work really easily with Debian or with you and so on. So, there is fixes, but it's not [06:29.120 --> 06:37.640] really, I mean, practically, sometimes it's difficult. So, you have, to fix this issue, [06:37.640 --> 06:42.880] you have an environment manager, like Conda, PIP, Modifies, and so on. But this is really [06:42.880 --> 06:50.760] difficult because, for example, in Conda, how do you know how it is built? What is inside [06:50.760 --> 06:59.920] what you install? So, this is for transparency in science. Modifies, how do you use Modifies [06:59.920 --> 07:07.200] on the laptop? I think no one. And Docker is for container, Docker, Singularity, or whatever, [07:07.200 --> 07:11.960] is a strategy which generally based on the previous solution. So, in fact, you have exactly [07:11.960 --> 07:16.880] the same problems as the previous solution. It just helps to move stuff from one place [07:16.880 --> 07:22.280] to the other one, but it doesn't help to be able to have the correct thing in the first [07:22.280 --> 07:30.400] time. Geeks, in fact, is all these three solutions glued together. So, it tries to fix all the [07:30.400 --> 07:38.160] annoyance from each to have something, I mean, working, fixing all the issues of everything. [07:38.160 --> 07:44.280] So, Geeks is a package manager, like APT, UME, etc. It's transactional and declarative. [07:44.280 --> 07:50.080] It means that you can roll back, you can have a concurrent version, and so on. You can produce [07:50.080 --> 07:58.720] a pack, which is Docker images, for example. You can produce virtual machines, like Ansible [07:58.720 --> 08:05.960] for deploying on some machine. You can build a complete distribution, and it's also a [08:05.960 --> 08:14.520] self-came library, so you can extend Geeks. So, okay, the talk is 25 minutes. So, it's [08:14.520 --> 08:19.080] just a kind of aperitif before lunch. So, I don't speak about all that, because it's [08:19.080 --> 08:23.800] a little too much. So, I just speak about how Geeks help in open research from my point [08:23.800 --> 08:31.960] of view. So, I think it's really easy to try. You have just a script, and give a look before [08:31.960 --> 08:37.520] installing it. It's just a bar script, but check it. And you can install Geeks on any [08:37.520 --> 08:42.480] recent distribution. So, it's really easy to try. You are running Debian. You can try [08:42.480 --> 08:49.120] Geeks without installing the complete distribution. You can use Geeks on the top of any distribution, [08:49.120 --> 09:00.680] and it's really easy to try. Give a try. So, now, Geeks is just another package manager. [09:00.680 --> 09:06.520] So, you have the same command that you have in any package manager, for sharing packages, [09:06.520 --> 09:12.000] showing packages, installing packages, removing packages, and so on. It's exactly the same [09:12.000 --> 09:20.320] as any package manager. But you have some more functionality, like transactional. So, everything, [09:20.320 --> 09:25.360] you can do two actions in the same time. So, for example, removing and installing in the [09:25.360 --> 09:30.920] same transaction. Or you can roll back. So, for example, you install something, and you [09:30.920 --> 09:37.720] want to roll back to uninstall this thing without breaking nothing. So, okay, this is [09:37.720 --> 09:43.280] another package manager. But is it really another package manager? So, yeah, we can have, [09:43.280 --> 09:49.720] it's a command line. It's a, we install, remove without special privilege. So, this is nice. [09:49.720 --> 09:54.240] It's transactional. So, there is no broken state. We have been able to substitute. So, [09:54.240 --> 10:00.800] we don't have to wait hours and hours to have our binary. But this is nice. But what is [10:00.800 --> 10:06.200] really, really nice is decorative management. It means that everything is a configuration [10:06.200 --> 10:14.360] file with scheme. But you can declare everything. And you can produce isolated environment on [10:14.360 --> 10:22.000] the fly. This is something that's really helpful. And you can also see Geeks as a factory for [10:22.000 --> 10:31.520] the Docker images, for example. So, okay, this is all interesting feature. But why Geeks [10:31.520 --> 10:38.080] is reproducible? Or what does it mean it's reproducible? For reproducibility, we need [10:38.080 --> 10:45.040] to talk about what is a version. So, what is a version? Alice say, for example, I use [10:45.040 --> 10:52.880] GCC Adversion 11. Okay, nice. But what does it mean, concretely, I use GCC Adversion 11. [10:52.880 --> 10:58.640] It means that you need GCC, the compiler. But you also need AD, which is the linker. [10:58.640 --> 11:05.080] And you know, Binitils, for example. And the Jelitsi library. But the compiler GCC, it [11:05.080 --> 11:12.040] needs, for example, MPC, which is a package that does, I don't know what exactly. Anyway. [11:12.040 --> 11:18.720] And you need also MPFR and so on. And you have this kind of graph. And we can ask the [11:18.720 --> 11:30.000] question, is it the same GCC Adversion 11 if we replace this MPFR Adversion 4.1 by MPFR [11:30.000 --> 11:36.840] Adversion 4.0? Is it the same GCC or not? And maybe not. And if it is not the same, maybe [11:36.840 --> 11:42.840] you are feeling a difference. How can we be sure that we are using the exact same GCC? [11:42.840 --> 11:49.480] So this is just an extract of the graph because the graph have roots. And yeah, it can be [11:49.480 --> 11:56.360] really large. And maybe we can also talk about what are the roots of this graph. But this [11:56.360 --> 12:04.920] is another talk. So when you say that, okay, but I need to have a version. So what is my [12:04.920 --> 12:13.840] version in GICS? So GICS describe the state of GICS. So in fact, GICS describe is a version [12:13.840 --> 12:21.160] of GICS. And what it does, in fact, it pins the complete collection of all the packages [12:21.160 --> 12:28.640] and GICS itself. And because of that, we are able to freeze the complete graph. We can [12:28.640 --> 12:38.200] move this graph from one place to the other. So, okay. So this graph, in fact, describe [12:38.200 --> 12:46.200] the nodes of each, each, each node in this graph specify a receipt. And this receipt [12:46.200 --> 12:52.720] defines the code source, the build time tombs, and the dependency. So for me, yeah. And this [12:52.720 --> 12:58.240] graph can be really, really large. For example, for Skypy, which is a scientific Python library, [12:58.240 --> 13:11.880] there is more than 1,000 nodes. So, yeah, it can be really large. So for when I say GCC [13:11.880 --> 13:20.480] at version 11, it means one fixed graph. And providing the state which describe, this capture [13:20.480 --> 13:26.840] this complete graph. And I can reproduce this complete graph on another machine. So this [13:26.840 --> 13:32.680] is collaboration in action. So Alice describes the list of the tools in a manifest, declarative [13:32.680 --> 13:40.760] way. She generates the environment, GICS shell, and providing the tools. So this creates an [13:40.760 --> 13:48.720] environment containing the tools that are listed in the manifest file. Okay. This is [13:48.720 --> 13:56.080] nice. But now she describes the revision of GICS. So she writes GICS describe and this [13:56.080 --> 14:03.680] fix the state of Alice. So, okay, this Alice is working on her laptop. But collaboration [14:03.680 --> 14:09.760] is share this computational environment. So it's about sharing the state. To share this [14:09.760 --> 14:15.720] state, you need to share one specific graph. To share this graph, you need to only share [14:15.720 --> 14:27.120] these two files. And if, sorry, if Blake has these two files, Blake can create the exact [14:27.120 --> 14:32.840] same computational environment as Alice. So you have the GICS time machine. You specify [14:32.840 --> 14:40.640] the state of Alice shell and specify the tools that Alice used. And Blake and Alice are running [14:40.640 --> 14:48.040] the exact same computational environment. And for example, if you have Carol, who knows [14:48.040 --> 14:54.000] these two files, she also can reproduce the exact same that Alice and Blake. So, in fact, [14:54.000 --> 14:59.840] you only need two files. And with these two files, you can reproduce everything from one [14:59.840 --> 15:07.840] place to the other. So, in fact, you have this kind of picture. Alice, Blake, Carol are [15:07.840 --> 15:14.840] in different time frame. But they can jump from this time frame, virtually time different [15:14.840 --> 15:20.480] time frame, to the same place. Because their machine are in different state, but they can [15:20.480 --> 15:28.080] temporarily go to another state to create the computational environment. To make this work, [15:28.080 --> 15:34.400] when the time is passing, you need to preserve all the source code. And this is not straightforward. [15:34.400 --> 15:39.280] It is not trivial to preserve all the source code. And you also need some backward compatibility [15:39.280 --> 15:44.960] of the Linux kernel and some compatibility of the hardware. But, okay. And when these [15:44.960 --> 15:49.360] three conditions are satisfied, you have the reproducibility. But what is the size of the [15:49.360 --> 15:54.200] window, of the time window, where these three conditions are satisfied? And this is, from [15:54.200 --> 16:00.360] my point of view, unknown. And GICS is, to my knowledge, a case unique by experimenting [16:00.360 --> 16:06.320] to be able, because we have the tooling to do all that. And now we can know what is the [16:06.320 --> 16:15.000] size that we are able to reproduce the past in the future. So what is software heritage? [16:15.000 --> 16:20.920] So software heritage is an archive. It collects preserved software in source code form from [16:20.920 --> 16:29.240] a very long term. And GICS is able to save the source code of the package and the receipt [16:29.240 --> 16:36.000] of the package itself. And GICS itself is also saved in software heritage. And GICS is [16:36.000 --> 16:41.400] able to use software heritage archive to fall back if a swim disappears. So you have the [16:41.400 --> 16:46.400] postdoc working on some GitLab and Stance. And the account is closed because the postdoc [16:46.400 --> 16:52.800] is moving to other place and so on. And now you have this paper with this URL of, with [16:52.800 --> 17:00.640] the GitLab package and say, oh, no, it doesn't work because the account is closed. If you [17:00.640 --> 17:07.800] were using GICS transparently, you can check if the source code is on software heritage. [17:07.800 --> 17:14.000] And this asks really good question about how to see the software and do you notice it only [17:14.000 --> 17:18.600] the source or, and what about the dependency and the build time options and so on. How [17:18.600 --> 17:24.520] do you see the software? And how, I mean, how do you see this? Do you see it with intrinsic [17:24.520 --> 17:31.400] identifier like checksum or with intrinsic identifier like version label? This is easy. [17:31.400 --> 17:39.000] So in summary, there is three commands. I'm almost done, right? Yeah. So in summary, you [17:39.000 --> 17:44.480] have three commands. And these three commands, which are GICS shell, GICS time machine and [17:44.480 --> 17:56.280] GICS subscribe, they help you to have a computational environment that you can, I mean, inspect and [17:56.280 --> 18:03.880] collectively share. So if you have this and two files, manifest and channel files, you [18:03.880 --> 18:15.360] are reproducible over the time. So okay, for offline, when you are, because I hope I convince [18:15.360 --> 18:22.280] you that is cool. So here is some resources that to, to, to, to, to read offline. So GICS [18:22.280 --> 18:31.640] HPC is a group of people trying to apply this GICS tooling to, to, to, to, to scientific [18:31.640 --> 18:38.960] research. And we are organizing coffee gigs where we, we drink coffee and speak about GICS. [18:38.960 --> 18:49.120] There is a, an article trying to explain this kind of vision of what GICS could provide [18:49.120 --> 18:58.320] for, for open research. And for French speaker, there is a one hour tutorial. So yeah. And [18:58.320 --> 19:08.560] there is a, now GICS is tenure, so it's kind of ready. So they, we organize the ten years [19:08.560 --> 19:14.320] events where there is some really nice materials about, about GICS. And GICS is not new at [19:14.320 --> 19:21.360] first them. So yeah, there is, all the number are, are linked to, to the previous presentation. [19:21.360 --> 19:25.440] So as you see, there is a 31 presentation about GICS in first them. So you have a lot [19:25.440 --> 19:34.480] of material about what GICS can do for, for your job, for your task. So you run in production [19:34.480 --> 19:40.440] on big cluster, but also in a lot of laptop and desktop. And here, for example, is to [19:40.440 --> 19:49.800] paper in completely, I mean, medical and biomedical stuff using GICS as a, as, as tooling with, [19:49.800 --> 19:55.440] as, as I presented about GICS shell, time machine and so on. So, okay, open science [19:55.440 --> 20:02.440] means to be able to trace and transparent because is to be able to, to collectively [20:02.440 --> 20:06.680] study bug to bug, to be what is different from one thing to the other thing. And this [20:06.680 --> 20:11.280] is a scientific method and we have to apply the scientific method to the computational [20:11.280 --> 20:16.720] environment. This is my, my opinion and the message that I, I would like you bring back [20:16.720 --> 20:21.400] to home. And if you, if, if we have GICS, we can do that by controlling the environment [20:21.400 --> 20:27.480] and compare two different environment to know what is different. So, okay, this is, [20:27.480 --> 20:34.720] yeah, the kind of, what we are trying to, to do with the GICS project. So, thank you. [20:34.720 --> 20:36.720] And I'm ready for your question. [20:36.720 --> 20:49.720] Yeah. Yeah. So, we have five minutes for questions and switching speakers. Please take question [20:49.720 --> 20:53.720] and do repeat them for the stream. Thanks. I will try to do my best. Yeah. [20:53.720 --> 21:17.080] Yeah. Ah. Lobing. So, the question is, okay, I, I, I don't have the, the, the root privilege [21:17.080 --> 21:21.760] to install GICS on the cluster because once GICS is installed on any cluster, you can, [21:21.760 --> 21:25.760] you can run it without privilege, but you need to install, the first time you need to [21:25.760 --> 21:31.640] install GICS, you need root privilege. And the system administrator of my cluster doesn't [21:31.640 --> 21:38.360] mean, yeah, I need to convince him. So, maybe the, the answer is to say other people are, [21:38.360 --> 21:43.360] are already doing that. So, it's, it's not, I mean, to reduce the scare to, to, to provide [21:43.360 --> 21:51.000] a new tool. This is what I, I, I would like to try to say, okay, these people are doing, [21:51.000 --> 21:59.000] they're doing it. So, maybe it's not so scary. Uh, I think it was after, yeah. So, yeah. [21:59.000 --> 22:03.000] Yeah. You mentioned that you're not sure how, how big the time window is. Yeah. [22:03.000 --> 22:09.000] If you look today, how far can you back and still reproduce? So, five years, ten years? [22:09.000 --> 22:15.000] No. So, the, the question is, okay, what is the size of the window and can we go back [22:15.000 --> 22:21.000] five years, uh, from now in the past? The issue is that the, the mechanism to, to bring [22:21.000 --> 22:28.000] back in time or to, to, to travel in time in GICS, uh, had been introduced in, uh, 2019. [22:28.000 --> 22:34.000] So, in fact, with GICS, we don't have the tooling to go back earlier. So, now, I mean, [22:34.000 --> 22:43.000] the, the, the zero for GICS is, uh, is version one. So, it's, uh, 2019. Yeah. [22:43.000 --> 22:50.000] Um, a lot of, um, scientists are using macOS and not Linux. Is there, is it possible to [22:50.000 --> 22:54.000] use all this stuff even though GICS can't really run on macOS? [22:54.000 --> 23:00.000] So, GICS cannot run on macOS. But we can ask the question, is it transparent if we are [23:00.000 --> 23:06.000] running on macOS? So, is it, are we are playing scientific method if we are running on macOS? [23:06.000 --> 23:15.000] So, I mean, I, I, I, I, I have not the question. It's, it's a collective decision. Yeah. [23:15.000 --> 23:23.000] My name is Alain. Um, as far as I understand, uh, GICS, uh, or GICS, uh, provides the same [23:23.000 --> 23:33.000] approach as the NICS. Yeah. So, um, I've never used, uh, GICS before, but I, uh, I have some [23:33.000 --> 23:38.000] experience with NICS. Uh, is there any crucial difference? [23:38.000 --> 23:45.000] So, from my point of view, oops. Ah, sorry, anyway. Um, in the slides, there, there is [23:45.000 --> 23:51.000] a, some, uh, appendix. So, there is extra slides. And there is one extra slide trying [23:51.000 --> 23:55.000] to, to explain what, from my point of view, the difference with NICS. So, the question [23:55.000 --> 24:00.000] is, uh, what is the difference between NICS and, and GICS? Because NICS, you, I mean, [24:00.000 --> 24:05.000] GICS use exactly the same, uh, functional strategy, package management, functional [24:05.000 --> 24:09.000] strategy. So, what is the difference? From my point of view, the difference is that you [24:09.000 --> 24:15.000] have a continuum in GICS in the language. The package are, are, are, are wrote in scheme [24:15.000 --> 24:21.000] and, and, and, and the, the code of, uh, GICS itself is also wrote in scheme. The [24:21.000 --> 24:26.000] configuration file are wrote in scheme. So, you have a, a, a big continuation with [24:26.000 --> 24:31.000] everything. And because of that, you can extend GICS for your own, uh, uh, stuff. [24:31.000 --> 24:37.000] So, for example, you can write a package transformation on the fly using, I mean, [24:37.000 --> 24:41.000] GICS as a library. You cannot do that with, with NICS because you have a lot of [24:41.000 --> 24:47.000] different tooling in C++ and some, uh, from my point of view, is this unity of, [24:47.000 --> 24:51.000] of, of the, the, the, the continuum of the language. [24:51.000 --> 24:56.000] Yeah. Yeah. But scheme allow you to, to write kind of domain specific language. [24:56.000 --> 25:01.000] It's, it's, uh, it's, uh, it's, uh, yeah. It's, it's a, it's a good language to, to [25:01.000 --> 25:05.000] write domain specific language. So, in fact, you have the both of, of the two worlds. [25:05.000 --> 25:07.000] From my point of view. Thank you. [25:07.000 --> 25:09.000] Oh yeah. Sorry. Last question. Yeah. [25:09.000 --> 25:13.000] Uh, that's good. It's, it's, it's, it's a good language to, to write domain specific [25:13.000 --> 25:17.000] language. So, in fact, you have the both of, of the two worlds. From my point of [25:17.000 --> 25:22.000] view. Yeah. This is, uh, so, when you are running GICS, for example, on the top of [25:22.000 --> 25:29.000] Debian, so, uh, how do we manage the graph and can we cut the graph to reuse a part [25:29.000 --> 25:35.000] of the Debian part? I mean, a part of the graph from Debian. So, the question is, [25:35.000 --> 25:41.000] uh, maybe it could be, maybe it could be, maybe it could be, maybe it could be, [25:41.000 --> 25:47.000] a part of the graph from Debian. So, the question is, uh, maybe it could be [25:47.000 --> 25:55.000] helpful for some packages. But, but when you do that, you are not able to, to [25:55.000 --> 25:59.000] manage the computational environment. Because if you have, for example, if I [25:59.000 --> 26:05.000] cut the graph on Debian, so I have a, a state in, in Debian with some packages, [26:05.000 --> 26:10.600] I cut the graph at some place to use these Debian packages. [26:10.600 --> 26:14.040] If I do that, how my collaborator can cut the graph [26:14.040 --> 26:16.520] in the same place with the same Debian packages? [26:16.520 --> 26:19.480] So this is kind of issue of replicability. [26:19.480 --> 26:20.760] So from a practical point of view, [26:20.760 --> 26:22.920] it could be nice because, for example, [26:22.920 --> 26:26.840] Debian has some machine learning packages that are not [26:26.840 --> 26:30.000] yet in Geek, so maybe we can reuse some part. [26:30.000 --> 26:32.600] But from a replicability point of view, [26:32.600 --> 26:36.760] you lose the property to move from one place to the other.