Hello, everybody. I'm Anne and I'm a, yes, data engineer, software craft woman. And currently
I work for the French Ministry of the Higher Education and Research in France. So it will
be important for the project, for the presentation, because I'm working on the French Open Science
Monitor. So before, before all, I have to say that in France we have no CRIS system,
the system in the university that will reference all the publications and all the works in
the, in each university or in the, in the national level. So, or at the national level.
So, so, yes. And back in 2018, the, the, the policymakers decided to, to, to create the
national plan for open science. So to, to promote the, the open science in France. And the,
the first question was, so how, how open is the French science? So according to, so to, to
answer to this question, we developed a tool to, to be able to, let's say a sovereign tool to be
able to, to measure open science. And the, and the, later the requirement would be, would be not to
use the proprietary database. So everything had to be open and transparent. So to first begin with
the publications, which, which was the first part of the project. Yes, we, we started from there.
The, the, let's say the, the, the graph on the right, proprietary bibliographic database. They're
kind of complete. They have lots of metadata, but they are missing some part of the publications
because they are the one that decided what publication will be in their databases. And on the
other hand, the, the open bibliographic databases, they have, they are more complete, more extensive,
but they are missing some metadata information or sometimes the quality of the metadata was not
really perfect. So we decided to create our own tool based on this. So first we collected multiple
metadata available on the, on the web, just like with Crossref specifically. And we tried to
aggregate some other sources like OpenAPC, PubMed, and, and, and web crawling to, to
complete the information, the, the metadata. So, and based on this, we developed a tool to be
able to automatically detect the country of the affiliation in this metadata. So we, we
this little hoist, it's the tool that we made. And the, the previous one, PubMed, Crossref,
and HAL, we just collect and extract metadata from there. But based on this metadata, we had to
build the country detection. So we, our affiliation match, which is able to detect the
country of the affiliation. And we, we, we, we, we, we, we, we, we, we, we, we, we, we, we, we, we
we, we, we, we, we were able to communicate. So we wanted to create, to create, to create
our affiliation match, which is able to detect the country of the publication. Just to, to
draw the, the perimeter of the, of the French publications. So, yeah, that's it. That's just
What I said. So by, by example. So one of the examples here. So the first compromise
would be to, to, to say, okay, if I detect France in the affiliation, it will be France.
the affiliation, it will be French.
It will be French, sorry.
But then we faced this problem with Hotel de France,
which is a hospital in Beirut.
So, OK, first, wrong result.
And then we improve our affiliation
measure based on the public institutions database, just
like raw.
So yeah, that was our first challenge.
And then we still, we keep consolidating the metadata
by adding the open status based on NPWOL.
So for each publication that we collected,
we added, thanks to the DOI, we added the open status on it.
And after that, we redo the classification
to say if it's open or not, but if it's green or not,
if it's bronze or not, hybrid or not,
there are some different levels of, let's say, openness,
according to the journal and according to the APC.
So we developed a new classification open access type.
And we also added, let's say, categories there,
so a tool to be able to add some classification
for the publication, to be able to know in which,
let's say, discipline or let's say,
which scientific field this publication is.
So yes.
And with all these metadata, we were
able to build some indicators.
So our first problem would be to measure the open science
in France.
And we know how the results, according to this methodology,
I might show you something else, just like.
The website with the results.
So according to all the computer metadata,
we know that the French publications are open in 67%.
So that's the point.
And thanks to the, let's say, categorization by discipline,
we have some other graphics.
Some other graphs, sorry.
Just like being able to know the openness by discipline.
So here it's about the, let's say,
mathematics is the most open scientific field in France.
So yeah, you can go on the website where there are more
results.
But no, I would like to.
So yes, that's what I said.
Yes, and with all these results, we
asked the universities in France,
if they were interested, to have their own results.
So the only condition would be we asked them to send us
their own perimeter.
So their own list of publications
that they have in their university.
And then we were able to send them exactly the same graph,
but adapted for the smaller perimeters.
And we now have more or less 200, let's say,
local variations of this monitor.
So there are some universities perimeter, but some labs too.
So yeah, and I just show you that.
And after that, we did a second round
to try to detect and to, let's say,
measure the openness of data and code
on the French publications too.
So yes, let's try to invent another methodology.
So first, we needed to collect the data.
So that's just what we said before.
So we have the whole publications in France.
In France, sorry.
We tried to download all the PDFs of the publications
freely available on the internet.
So in this case, there was almost no problem
as long as the PDFs were still available.
But for the closed one, we need to find some agreement
with our partners in the project to give them tokens
to be able to download it.
And after that, once again, we needed
to consolidate the metadata.
So we first choose Grobit that you might, may, probably know,
which is a tool that takes a PDF in entry
and give you a standard XMLTI as an output.
And all your data is structured in it,
just like giving you the author, paragraph, keywords,
affiliations, stuff like that.
And after that, we used, so we developed two tools called
Datastate and Softsite.
And those tools tried to detect the mention of data
or code in the PDF.
So first, the mention.
And after that, in the second layer,
trying to categorize or qualify the mention
if it's a mention of usage, if it's a mention of production,
and if it's a mention of sharing.
So those two tools have been developed
by Patrice Lopez from ScienceMiner, the guy
who is behind Grobit 2.
And once again, we were able to do some indicators
to have some ideas about the part of the, let's say,
the part of the use of data in the publications in France.
So among the world publications, so the way
to read this graphic is among the world publication
in France.
So in the 2021, 60% of the publication
were mentioning the use of data.
So it's a little bit just like we need all the words
to make sense of the graphics.
And this one is the which part of the publications
saying they have produced data if they mention the use of data.
So there is just like a funeral approach.
Just like first, they take the one that shows data
and among them, the one that produced the data
and then among them, the share of data.
So yes.
So once again, to read this graphic,
we can say that in France.
So 22% of the publication that mentioned
the sharing of their own data.
So that's it.
And as a project about open science,
we had to be fully open.
So all the code is open.
So there is a link here on our GitHub.
And the whole data set is open.
And then we published our methodology somewhere.
And even the talk here is open.
And so this project started in 2018.
But from there, the open world really moved.
So Open Alex released the data about the world research
in CC0.
So if you don't know it, you should have a look there.
And the whole set of publications
across the world and the institutions and the funders
and the authors.
And everything is in CC0.
It's quite early.
And everything is not perfect.
But the good point is that exists now.
And it's improving from day to day.
Thanks to that, people from Coqui in Australia
that developed the Open Access dashboard.
So it's a website where there is an open access rate
for each country in the world.
So it's not perfect for France because it's always
difficult to detect the affiliation.
So it's a good approximation.
And finally, last year, the CWTS released the Leiden ranking
but in the open edition.
So everything is open there, the data that they used.
Because they are based on Open Alex and their own methodology.
So yes, that's it.
Thanks for listening.
Thank you.
Questions?
Yes.
I do have a question for you.
Can you make up something?
Yep.
Yes, I have a call.
Thank you very much for this.
I'm sorry, but I use it a lot to monitor open-size outputs
for Wikimedia purpose this month.
I have two questions that maybe will answer some of them.
The first is about the timing.
So how many years after the publication, year,
do you think the data is reliable?
Sometimes it takes years for a both ways to deposit.
The second question was about Open Alex,
but we already answered it.
The third question was how do you share what's the best way
for people to build on this data?
For example, do you send new notices to HAL?
And the final question was about the topic.
Do you contribute upstream?
So yes, thank you.
I'll try to remember all your questions.
So, okay, Grubit, we don't contribute to Grubit directly,
but the two other tools that have been developed from,
not from scratch, but from other experience
from the same guy, so Patrice Lopez.
We paid him for the two data sets and soft sites
that have been adapted and improved for this project.
But everything is open source, so if you need it,
you can use it.
Second was about, I don't get it, the one with HAL's notice.
Like if a publication that you detect is missing from HAL,
it's not HAL.
No, what we do, so if a publication is missing from HAL,
it's not the perimeter here of the project.
But as long as, let's say, a university or laboratory
asks us for a monitor about their own science,
they send us the list of the publication.
And in return, because we have the data,
we send them the full list of the publication
with the metadata that we calculated.
And in this metadata, there is the HAL ID if we match it.
So they might, if needed, they might have a way
to find the part of their publication that
are in HAL and the part that is not.
But it might be interesting indicators to add it.
You're right.
And I forget probably one of your questions.
The delay, how many years ago?
Yeah, you're right.
So the delay, so yeah, the delay in publication,
it's always a question.
So there is a delay because we grab data
from multiple services.
There is some delay to propagate the truth
between those platforms.
Plus, there is the delay when we decide to collect this data
because the whole process is quite long.
So us, we decide to collect the data four times a year.
But we publish the new version of the data only once a year.
So and it will happen the end of February this year.
So we mentioned here that the data are from,
so the data displayed here are from 2022,
collected from 2022.
So that's the reason why we only display
public publication until 2021.
So we decided, so in 2023, we published the monitor
based on data on 2022.
And this data and only the publication from 2021
and are taking into consideration.
So it's two years.
So yes, it's two years.
And that, guys.
Sorry, I think too long.
Yeah, we've run out of time.
Yeah, thank you.