[00:00.000 --> 00:09.500] So, thank you Paul, welcome everybody. [00:09.500 --> 00:17.960] So we will today present to you the Cortex platform, which is an open platform for research [00:17.960 --> 00:20.200] in social sciences. [00:20.200 --> 00:26.280] And it's a collective presentation where all the team is here, but we have three of us [00:26.280 --> 00:31.240] will be presenting today out of the team that you see on the picture. [00:31.240 --> 00:33.680] So what is the Cortex? [00:33.680 --> 00:42.280] It's an online platform built on top of an architecture and its main goal is to help [00:42.280 --> 00:49.360] the social scientists with a research question to actually find the benefit of a computational [00:49.360 --> 00:55.240] method to fit a specific question. [00:55.240 --> 01:04.360] So it has been founded in 2009 and it's driven by social science research and is supported [01:04.360 --> 01:12.280] by several French national institutes at the beginning, in particular INRA, and then later [01:12.280 --> 01:18.520] on some funding, some research funding and some European project called RIS, European [01:18.520 --> 01:22.880] project of infrastructure. [01:22.880 --> 01:27.600] We are now at our second version of this online platform called Cortex Manager. [01:27.600 --> 01:31.760] So this is the part that you can actually go online. [01:31.760 --> 01:36.920] It has been designed from the very beginning, as I said, for the social scientists. [01:36.920 --> 01:48.800] So we took the point of departure was that the social scientist doesn't have any IT capabilities, [01:48.800 --> 01:57.160] doesn't have any, on average, doesn't have any resource on the computer, especially 15 [01:57.160 --> 01:58.160] years ago. [01:58.160 --> 02:06.120] So one of the main benefits of the platform would have to be empowering him with the power [02:06.120 --> 02:10.480] of computing and of methods. [02:10.480 --> 02:17.840] It's a collaborative platform, so a lot of projects can be shared between scientists. [02:17.840 --> 02:23.680] And there is, as a web interface connected to a workspace online, you don't have to, [02:23.680 --> 02:30.160] of course, have it on your computer, which was, again, quite new at the time. [02:30.160 --> 02:33.520] And it's very important to note that it's very permissive. [02:33.520 --> 02:39.720] So it's a bunch of methods that you can apply on all kinds of data that we will detail later. [02:39.720 --> 02:46.080] And this was one of the main foundation of our platform in the beginning, that all the [02:46.080 --> 02:53.880] methods are runnable on all the viable that you have at your disposal. [02:53.880 --> 02:59.120] So how is it going from ten, ten, ten, twelve years after that? [02:59.120 --> 03:06.840] So we have over 250 peer-reviewed publications that we are counting from now. [03:06.840 --> 03:10.640] And of them, 75 documents in 2022. [03:10.640 --> 03:15.880] So it's growing, and it's growing quite fast the last few years. [03:15.880 --> 03:24.200] So 50,000 analysis, so 50,000 jobs have been done on the platform by scientists last year, [03:24.200 --> 03:28.800] by over 1,000 active users, 400 institutions, and in 50 countries. [03:28.800 --> 03:36.840] So it's pretty worldwide issue-wide. [03:36.840 --> 03:43.680] The team we have now in 2223 is consisted of eight technical persons, so engineers mainly, [03:43.680 --> 03:50.360] two researchers, one trainee, and two close collaborators in companies, or independent. [03:50.360 --> 03:55.760] Very important is located at University Gustave Eiffel in Paris, near, in the greater Paris, [03:55.760 --> 03:56.760] so near Marne-la-Vallée. [03:56.760 --> 04:01.560] So you have Disneyland not far away. [04:01.560 --> 04:10.360] And this infrastructure is composed as a classic web-oriented application and infrastructure. [04:10.360 --> 04:16.960] So obviously we have the web interface that you can go to as a user, regular user, that [04:16.960 --> 04:25.680] encloses a lot of web services, so the main interface, but also an API that you can go [04:25.680 --> 04:28.800] and act upon from outside. [04:28.800 --> 04:36.120] So this allows us to open all our better-than-services to external applications, and specifically [04:36.120 --> 04:39.840] external projects that we are part on. [04:39.840 --> 04:46.520] We are built on an infrastructure of servers that are hosted at Marne-la-Vallée, so in [04:46.520 --> 04:48.320] our university. [04:48.320 --> 04:49.960] So everything is local. [04:49.960 --> 04:57.800] We don't have anything hosted online, which is 300 CPU, 3.5 terabytes of RAM, and 40 terabytes [04:57.800 --> 05:03.360] of storage, which is big and not big at the same time nowadays, but it's enough for our [05:03.360 --> 05:05.440] needs. [05:05.440 --> 05:10.600] We have two main services that we are providing. [05:10.600 --> 05:19.480] First is the storage of big database that we are collecting and curating, like for example [05:19.480 --> 05:22.640] the patents, the European patents. [05:22.640 --> 05:33.200] We have several other databases that we are putting inside some projects, so we can help [05:33.200 --> 05:40.080] specific projects with this data that we have stored in the infrastructure. [05:40.080 --> 05:47.280] And the second part is the processing that we offer, so the method, the scientist method [05:47.280 --> 05:55.600] that I will talk about later, that are run inside of projects on the platform directly [05:55.600 --> 05:56.600] by the researchers. [05:56.600 --> 06:02.040] So the researchers are actually actioning those methods directly from the web interface [06:02.040 --> 06:08.480] with the parameters and their data. [06:08.480 --> 06:15.280] And we of course have this monitoring of the whole infrastructure, so it stays online. [06:15.280 --> 06:19.560] So once you have an infrastructure, it has to do something for scientists, and that's [06:19.560 --> 06:22.160] the goal of the platform. [06:22.160 --> 06:24.680] So how does it do, and what does it do for science? [06:24.680 --> 06:26.840] First what? [06:26.840 --> 06:30.800] So this is the right part of the previous diagram. [06:30.800 --> 06:35.240] So we have a bunch of heterogeneous methods. [06:35.240 --> 06:39.560] So from science to metrics, the biometrics, the study of basically publications, scientific [06:39.560 --> 06:47.080] publications and tracks, natural language processing, so terms of traction, name-intensive [06:47.080 --> 06:56.560] recognition and stuff like that, social network analysis, so the study of all the interactions [06:56.560 --> 07:03.840] inside publications and texts in between actors, keywords, et cetera. [07:03.840 --> 07:10.520] Stochastic block models, then I will end the table somewhere in the room. [07:10.520 --> 07:12.240] We'll talk to you about later. [07:12.240 --> 07:19.720] Knowledge dynamics, so through time, the study of the knowledge through time, and the special [07:19.720 --> 07:24.960] analysis, which is a big part of our infrastructure now, because we have geocoding, geolocating [07:24.960 --> 07:32.960] inside publications, and geo-mapping framework. [07:32.960 --> 07:41.920] Okay, next we'll be, I will give, join you the mic to talk to you about how to cite [07:41.920 --> 08:00.440] cortex, because this is one of the big, I will try to note. [08:00.440 --> 08:11.800] Then as cortex is a research software being used mainly by researchers, then suppose the [08:11.800 --> 08:22.400] to be cited, besides the fact that there is no yet strong culture of citing software properly [08:22.400 --> 08:26.840] inside academic works, it's expected to be done. [08:26.840 --> 08:37.120] Then according to that, in 2022, we documented how cortex must be cited inside academic works, [08:37.120 --> 08:43.280] and here we have just an example, for example, if I am writing a paper, and if I cite cortex, [08:43.280 --> 08:50.240] this is how cortex will be handed at the end of the paper in the reference section, for [08:50.240 --> 08:57.560] example, and with this, we can give, for example, the credit to the developers, like, for example, [08:57.560 --> 09:04.400] Philip, Ale, and many others, who is contributing with this software for a long time, and must [09:04.400 --> 09:11.400] be recognized by this work academically speaking, then this is important, then that's what [09:11.400 --> 09:23.000] we did, and nowadays, we are lucky because a few years ago, we are missing the infrastructure [09:23.000 --> 09:32.680] to how to define, how to document, and how to cite software, and now we have, for example, [09:32.680 --> 09:42.800] the citation.cff, that it's a citation file format, that it's been adopted a lot, and [09:42.800 --> 09:51.840] I think it will be the standard way to do it, and on top of it, we have many tools. [09:51.840 --> 09:58.640] Here is one example, CFF convert tool, where we can convert the CFF file format, for example, [09:58.640 --> 10:05.760] to big tech, to bibliotech, or to APA format, and many other formats, then we can keep the [10:05.760 --> 10:10.840] automated data about the software important for citation in one single file, and from [10:10.840 --> 10:22.040] this file, we can derivate to many others, and it's really an easy way to do it. [10:22.040 --> 10:31.520] And then in Cortex, it's a solved problem, but we still have an issue about how to identify [10:31.520 --> 10:35.640] permanently this object. [10:35.640 --> 10:41.280] I say object as a digital object, okay, it's a software, but it's a digital object from [10:41.280 --> 10:46.960] the point of view of science, and how we can identify it permanently, for example, for papers, [10:46.960 --> 10:53.400] we have a DOI, that it's very well known, and so it's very well for papers, PDF, but [10:53.400 --> 11:01.720] for software, we have a problem to, we can need to cite software in many different levels [11:01.720 --> 11:04.600] of granularity, for example. [11:04.600 --> 11:09.480] We can cite the software, just the software name, or a specific version of the software, [11:09.480 --> 11:15.000] or a specific line of a specific file inside the source code of the software. [11:15.000 --> 11:20.480] Again, this is still an open question, because, for example, we have a proposition from the [11:20.480 --> 11:30.440] community to solve it, but there is no one single kind of permanent ID that covers all [11:30.440 --> 11:33.560] the levels of granularity, for example. [11:33.560 --> 11:38.400] We have the proposition from the Software Heritage Project, that it's the suite, the [11:38.400 --> 11:45.120] Software Heritage ID, that we can use it to cite software in the very specific code fragment [11:45.120 --> 11:52.080] to the snapshot of the full source code, but, for example, we can't use Software ID to cite [11:52.080 --> 11:57.520] software in a high level, like the software version of this, only this project name. [11:57.520 --> 12:00.720] For that, we should use another kind of ID, that's it. [12:00.720 --> 12:07.560] And it's an open question that we should work this year or next year, and that's it. [12:07.560 --> 12:10.560] Now it's the time for you, Juan. [12:10.560 --> 12:11.560] You? [12:11.560 --> 12:14.560] I don't know. [12:14.560 --> 12:25.240] So one question we've been asking ourselves is how to open the platform fully. [12:25.240 --> 12:30.960] What did we do to do that until now, and what to do in the future? [12:30.960 --> 12:36.480] So there is no platform, web platform, without free open source software today, and we are [12:36.480 --> 12:43.600] using a lot already, and I just put some of them on this page. [12:43.600 --> 12:52.920] You can see here, from any layer of the infrastructure, from the virtual machine, the containers, [12:52.920 --> 13:02.000] the scripts, the operating systems, et cetera, et cetera, until the actual methods and interface [13:02.000 --> 13:04.680] that you produce to the user. [13:04.680 --> 13:09.120] We are using free open source software, or at least open source software, because I don't [13:09.120 --> 13:13.120] want to start a debate here, but that's the general idea. [13:13.120 --> 13:14.920] Now this is not Cortex producing that. [13:14.920 --> 13:22.480] The Cortex is wrapping around this open source software, and it's open access for now. [13:22.480 --> 13:27.960] So it's free and open to everybody to use as an interface, but the code is not entirely [13:27.960 --> 13:28.960] free. [13:28.960 --> 13:34.160] We've been trying to do that a little bit more than the last few years, and hopefully [13:34.160 --> 13:36.680] in the next few years, even more. [13:36.680 --> 13:42.680] And Ale will talk to you about how to do that, and what are the challenges to actually produce [13:42.680 --> 13:46.680] a good open source software and methods. [13:46.680 --> 13:52.520] Hello, everybody, I'm Ale. [13:52.520 --> 13:57.120] I work with the guys and go in Cortex. [13:57.120 --> 14:02.200] So this is not really a talk about Cortex, I'm not going to do a demo or something. [14:02.200 --> 14:07.800] Because we thought of this talk as a moment to discuss what's going on with us, what is [14:07.800 --> 14:12.880] going on in the platform, because up to now, like Philippe just explained, everybody can [14:12.880 --> 14:19.000] come and use, but most of the software in Cortex has been developed without, well, with [14:19.000 --> 14:23.800] open source in mind, but not in practice, for practical reasons. [14:23.800 --> 14:28.720] So are we missing one slide? [14:28.720 --> 14:33.680] No, yeah, so everything from the methods, the interfaces, the back ends, everything [14:33.680 --> 14:38.520] has, so even our documentation is on Wordpress or Q&A also. [14:38.520 --> 14:45.160] So everything is based on open source, but up to now, we didn't have real open source [14:45.160 --> 14:46.160] practices. [14:46.160 --> 14:57.000] So how are we managing this transition from a project that is very aligned with open [14:57.000 --> 15:02.200] source community, but that doesn't manage to do open source, to start doing open source? [15:02.200 --> 15:08.840] So there was really a big part was just getting even more open source into the project. [15:08.840 --> 15:17.120] So we're starting to use project management tools that we were using, like some of them, [15:17.120 --> 15:19.840] but not as much, not as systematically. [15:19.840 --> 15:25.000] So this is a real common problem with research software, where researchers, they don't always [15:25.000 --> 15:35.320] have the reflex to go into GitLab or GitHub or whatever and use issues and use dashboards [15:35.320 --> 15:41.040] and project management, sometimes not even a thing in their head. [15:41.040 --> 15:46.360] And that's kind of the background that we were facing, like years and years of research [15:46.360 --> 15:56.120] and developing stuff, using open source software, but using version control because, well, it [15:56.120 --> 16:01.400] was still the infrastructure guys really needed that to work with, but it was really like [16:01.400 --> 16:02.700] the bare minimum. [16:02.700 --> 16:06.240] So how do you start to get people more involved? [16:06.240 --> 16:14.400] So we just, the first step was really pushing everybody to start using the usual tools, [16:14.400 --> 16:21.040] and sometimes first moving some projects and moving the discussion into issues on GitLab [16:21.040 --> 16:22.160] and all that. [16:22.160 --> 16:29.040] And then, okay, so, and then, and also adopting for our discussions more open source software, [16:29.040 --> 16:36.480] like Matrix, so it's like, I don't know if anybody knows Matrix here, maybe today everybody [16:36.480 --> 16:40.920] already knows Matrix, it wasn't really a given last year. [16:40.920 --> 16:51.680] And then, once we had like adopted the general workflow, the projects were still, most of [16:51.680 --> 16:57.160] the projects were still private, but they had the workflow of an open source project, [16:57.160 --> 17:03.440] so it helped get people into a better mindset and the people who would arrive and start working [17:03.440 --> 17:12.040] with us would like more easily recognize the pattern of working with open source. [17:12.040 --> 17:19.280] Then because we had this system that runs the jobs and everything, and usually in the [17:19.280 --> 17:25.360] old scripts, they would just use the structure that was in the system directly, and it was [17:25.360 --> 17:27.960] pretty hard to adapt and evolve them. [17:27.960 --> 17:34.560] So then we developed some kind of intermediate libraries so that it gets easier to, so making [17:34.560 --> 17:40.880] it easier to develop new methods in the platform. [17:40.880 --> 17:44.960] So we had to develop this intermediate, I won't have time to go into show the details, [17:44.960 --> 17:51.800] but just getting the idea that we have to get people into the workflow, develop libraries [17:51.800 --> 17:59.800] and interfaces and intermediate layers that help and automate some of the processes, give [17:59.800 --> 18:06.040] people more freedom to work, actually we also started integrating containers in the job [18:06.040 --> 18:10.840] manager and everything so that we could end with this intermediate layer so that people [18:10.840 --> 18:21.840] could have more freedom to make it easier for them to keep doing their non-standard, [18:21.840 --> 18:27.720] non-advisable, non-best practices stuff, but that works for them because their researchers [18:27.720 --> 18:33.840] are not coders, they're not always coders. [18:33.840 --> 18:40.360] And finally, and even for the people in the team that are more skilled programmers, of [18:40.360 --> 18:44.920] course, they want to have the possibility to do all this stuff. [18:44.920 --> 18:50.360] And finally, one thing is very important is at the management level, because this is [18:50.360 --> 18:56.600] like a kind of already intermediate kind of big-ish thing where we have hundreds of users [18:56.600 --> 19:03.440] everywhere, so we had to discuss this in the strategic planning so that our institutional [19:03.440 --> 19:05.800] partners could get it. [19:05.800 --> 19:13.200] So basically, just get it everywhere as soon as possible and then start, once you can show [19:13.200 --> 19:22.520] some of the benefits, also work at the institutional level. [19:22.520 --> 19:27.600] So these are some of the challenges. [19:27.600 --> 19:32.400] Most part of the team didn't have the know-how, part of the researchers we work with. [19:32.400 --> 19:38.840] People have limited resources, they have priorities, and I'm just reading out of there, but that's [19:38.840 --> 19:39.840] it. [19:39.840 --> 19:41.340] It's not very surprising. [19:41.340 --> 19:45.840] There's this learning curve of open source and everything. [19:45.840 --> 19:50.320] And there's always doubts about where is this going to lead us. [19:50.320 --> 19:55.040] There's licensing doubts, so there's part of, like, next thing I'm going to quickly show [19:55.040 --> 20:00.160] is that we're starting to open up stuff, and there's lots of stuff that hasn't been open [20:00.160 --> 20:06.920] yet simply because somewhere, we know that somewhere in some file, somebody used an extract [20:06.920 --> 20:12.640] from a software library that is not clearly documented, where it has been used, and we [20:12.640 --> 20:19.360] need to find it and check if the license is compatible. [20:19.360 --> 20:22.240] So we're always like, okay, where are we going? [20:22.240 --> 20:29.840] But getting things to as close as possible on every dimension is what's keeping us moving. [20:29.840 --> 20:35.280] It's keeping the feeling that something is moving, even if we're not seeing everything. [20:35.280 --> 20:41.200] But then we, well, through this effort, eventually, we managed to open source already a good part [20:41.200 --> 20:52.200] of specifically the science part, so the infrastructure, the dashboard, the interfaces, the front-end [20:52.200 --> 20:58.160] of the web application, and the job-managing part, or this part is not necessarily, it's [20:58.160 --> 21:05.520] taking longer because it's also another part of the infrastructure. [21:05.520 --> 21:13.400] But the main, most important, short-term part is to let the scientific part of it be available. [21:13.400 --> 21:20.480] So here's just some examples, I'm not going to, but yeah, so there's two parsers, an [21:20.480 --> 21:29.840] importer for a big database of scientific publications, a kind of generic parser that [21:29.840 --> 21:37.960] we use for, we're in a sociology of science lab, so even though the tools are more like [21:37.960 --> 21:41.560] general humanities and social sciences, we have a lot of interesting things that come [21:41.560 --> 21:47.600] from PubMed, data that comes from PubMed, and similar databases, so this is like also [21:47.600 --> 21:54.680] a parser for that, and here's one example of a project that's in progress because even [21:54.680 --> 22:00.440] though we reworked the project, parts of it still use some libraries that we're not sure, [22:00.440 --> 22:08.600] so we have to check, take the time to check if we need to change anything or replace. [22:08.600 --> 22:14.960] Just quickly show what some one of these things look like, well, this is just the repository [22:14.960 --> 22:24.760] for one of the projects, you have the source code, some documentation, it makes pretty [22:24.760 --> 22:30.600] graphs, and you have to use it like this, like this. [22:30.600 --> 22:37.120] So just getting things into what we all know about, what you know as an open source format, [22:37.120 --> 22:53.520] how do I go back to the presentation, F, no, no, there you go. [22:53.520 --> 23:02.560] What else there's to say, that's it, so thank you, I think, I mean, if you, if you, if you [23:02.560 --> 23:10.480] have any of you work in a similar institution where open source is not an evidence but it's [23:10.480 --> 23:17.000] something that you have to struggle with, we're very happy to exchange after, during [23:17.000 --> 23:29.880] or after the conference, did you skip the, no, no, you have no more time, thank you. [23:29.880 --> 23:36.880] Thank you. [23:36.880 --> 23:39.880] Any questions? [23:39.880 --> 24:01.880] Yeah, thank you a lot, I like your time because there is a lot of nice to choose, and you just [24:01.880 --> 24:15.200] wanted to share with us, and I'm very happy to hear that you are bringing up the gauging. [24:15.200 --> 24:22.160] Yeah, so the question was, what was preventing us to opening it from the beginning, and to [24:22.160 --> 24:24.280] make open source software from the beginning. [24:24.280 --> 24:30.280] So as Ale was saying, the first thing I think is just the lack of know-how, especially 15 [24:30.280 --> 24:36.720] years ago when we started, the second thing I would say is that we, we basically didn't [24:36.720 --> 24:44.480] have a precise idea of what we were building, we just went basically script by script project [24:44.480 --> 24:49.160] by project, and this, this is how we built slowly the, the platform, then it became a [24:49.160 --> 24:53.240] platform, a proper platform, I mean, with the, with the resources and everything, and [24:53.240 --> 24:58.480] then, then it was already used, and, and we trained people on those methods already, [24:58.480 --> 25:04.360] so the problem of opening, of doing open source software is you have to think about it from [25:04.360 --> 25:09.480] the beginning, or at least refactor enough the, the code, so it's big, it's, it has no [25:09.480 --> 25:14.880] problem of, of license, and stuff like that, so this is, this is just a, you know, an ongoing [25:14.880 --> 25:15.880] streak. [25:15.880 --> 25:32.880] Yeah, speak loud, sorry, I can't, I can't hear you. [25:32.880 --> 25:52.400] Yeah, it's about, yeah, some, some people upload software to Zenodo or upload a documentation [25:52.400 --> 25:58.760] or, that's, that creates one kind of identifier, which you could also make a, some people just [25:58.760 --> 26:03.040] make a publication related to the software, so people can cite a DOI, Zenodo is going [26:03.040 --> 26:05.960] to give a DOI, so it's the same, pretty much the same thing. [26:05.960 --> 26:10.480] The problem is that, I think like Jenner was explaining is that sometimes you want to cite [26:10.480 --> 26:15.680] a specific version of the software or a specific file in a specific version, or you want to [26:15.680 --> 26:21.360] cite the software without mentioning a specific version, and, and you don't want to upload [26:21.360 --> 26:34.680] one version to Zenodo for every possible thing. [26:34.680 --> 26:43.400] Yeah, yeah, yeah, I, I don't know exactly deep what Zenodo offers, but I know that Zenodo [26:43.400 --> 26:50.080] offers a way to manage many different kind of digital objects, data, images, software, [26:50.080 --> 26:53.000] etc., and not only Zenodo. [26:53.000 --> 26:58.400] In our case, what is missing is to study all the options to see which fits better in, in, [26:58.400 --> 27:06.520] in our case, you know, is more to, to understand, because in, in fact, inside the, the, the software [27:06.520 --> 27:12.040] engineering community discussing this topic is still an open question, exactly how to [27:12.040 --> 27:13.040] do it. [27:13.040 --> 27:17.880] And when we, we search about this topic, it's quite difficult to find tutorials or someone [27:17.880 --> 27:23.760] teaching you how to, what kind of permanent indentifier you should use. [27:23.760 --> 27:30.320] For example, software heritage ID, for example, here is an example of the type of persistent [27:30.320 --> 27:36.040] indentifier that name it as intrinsic persistent indentifier. [27:36.040 --> 27:42.560] What means it doesn't, doesn't depend on, on, on a service registry where you need to [27:42.560 --> 27:45.520] ask a new ID to, to use it. [27:45.520 --> 27:53.520] It's like a hash, a hash ID of a commit is this kind of a intrinsic persistent indentifier [27:53.520 --> 27:57.440] that we can generate by the source code itself. [27:57.440 --> 28:03.240] Another option is to use maybe Zenodo or maybe any other thing to generate a DOI. [28:03.240 --> 28:08.760] Then for that, we are, we are going to ask to the registry, the DOI server to generate [28:08.760 --> 28:10.440] a DOI for us. [28:10.440 --> 28:18.720] And we have many options, we don't know yet deep and maybe Zenodo could be really interesting. [28:18.720 --> 28:26.000] I would be happy to, to learn with you, your experience because it's a lot of things to, [28:26.000 --> 28:27.000] to learn. [28:27.000 --> 28:28.000] Sorry. [28:28.000 --> 28:41.200] Okay, thanks for coming back to stop here.