Thank you.
Good morning.
How many of you have read the scientific paper in the last month?
Okay, for a number of you, probably why you are here, it turns out that we are churning
ever more scientific papers.
I'll show figures later on.
And for this reason, we are conducting studies on them, so seeing how they accumulate.
Some types of studies that you see here, systematic studies, so a study on previous
science that can be reproduced and done in a specific objective, we're not just picking
out a few papers from a pile.
Scientometric or bibliometric studies where we measure things such as the scientist's
output or work done in a specific field.
And then there's the analysis where we use statistical techniques from existing studies,
say how cancer is related to smoking in order to get better results and other secondary
studies.
And as you see here, such studies have been rising from 1970s onward at an exponential
rate.
Look at the scale on the left, it's logarithmic.
Scale, something that we will see many times in this presentation.
Now they have reached tens of thousands of studies every year.
We've been conducting such studies also in our group, such as looking how various data
papers, open data papers are being used by others or how software engineering research,
thousands of papers that are published on software engineering every year are actually
used in practice.
And lately also looking at how machine learning is associated and used in software engineering.
There have been so many papers in this area that we conducted what is called the tertiary
study.
So we didn't look at all those thousands of papers, but we looked at papers that summarized
those studies, a few dozens of papers.
Research on using existing publication data and building on it, demand quantitative data.
And two scientists are famous for working and establishing the field.
The one you see on the left is Eugene Garfield, a linguist and businessman.
He established the Institute for Scientific Information, which became then the science
citation index, then was bought by Thomson Reuters and now it's a firm called Clarivate.
And on the right, another famous scientist, Derek DeSolla Price, who also worked in this
field, established disciplines together with the Garfield of Scientometrics and Bibliometrics.
We can use and measure scientific output and study it using a variety of services such
as the ones you see here.
How many of you have used any of those in the past month say?
Wow, a fair number.
However, there are problems associated with them.
First of all, there's a lack of transparency, repeatability and reproducibility, a query
that you give today on say Google Scholar will give you, well give completely different
results in a year.
And we have no idea how the results appear here, why the results appear in a specific
order.
Latency can be high and bandwidth can be low.
If you want to run a query on tens of thousands or millions of publications, good luck going
back and forth on an API.
And if this wasn't bad enough, there are also rate limits.
I have the privilege of having been kicked out of various digital libraries for 25 years
now.
There are also the query languages that we use are also proprietary, so each service
has its own query language.
And they can be restricted in what you can do, maybe you can add terms, what you cannot
order them or you cannot search in specific fields.
The coverage may be limited.
Some services contain only a subset of what we want to search.
And finally, there's the issue of availability and costs.
Some are lucky enough to have subscriptions to some of these services.
Others don't and even if you have a subscription, getting access to the full dataset may be
difficult.
Thankfully, two developments are changing the status quo.
The one is the rise in computing power that we have in our hands.
I don't know if you have seen this picture.
We are the group of people delivering an Elliott computer to the Norwich City Council.
And a few decades later, somebody putting a photograph in a Raspberry Pi zero in front
of the same building.
I've compared the two machines.
Interestingly, both are European endeavors.
And there are three to six orders of magnitude increases in power.
So things that we could do in 56 compared to now are a thousand to a million times better
than they are today.
So obviously we can use this power.
Remember, CentroMetrics was established on those decades, past decades, to do a lot more.
The second development is the open availability of datasets.
We are here celebrating openness in software and data and hardware.
And a large number of datasets associated with research have now become available, such
as Crossref, ResearchRid, United States Patents, PubMed, Research Organizational Registry.
I'll go over them.
So alluding with some lack of modesty, I admit to the famous library of Alexandria, how it
should be in the third millennium, I've developed a system called Alexandria 3K that allows
us to perform publication metadata analytics on our desktop.
What it does, it provides us relational access to about two terabytes of data without needing
to have actually this amount of space on your computer, who has two terabytes available
on their laptop.
Exactly.
So you don't need it with Alexandria 3K.
It gives us access to about four billion records, 75, 74 tables.
You install it as a single Python model.
You don't need to maintain a graph database or a cluster to install and maintain.
I know it's not a problem for you, but it is a problem for people who are not in computing.
And it's also super efficient.
So if you run a sample query on the whole dataset, but sample it, it can finish in minutes.
If you build slices of the data to study, it can finish in between five hours in a couple
of days.
But then you can run queries that finish in seconds.
And the space requirements start at about 160 gigabytes for the downloaded data.
You can then process it in compressed form without requiring to decompress it.
What I will show in the next about half an hour is the model that you get access to and
the type of data that you can use.
I will explain how it can be used in practice.
Go deep into how it is implemented, giving you perhaps ideas how you can do similar things
and finish with some issues, limitations on how to move forward.
So what you see here is the scheme of all 75 tables.
They are colored based on the dataset they come from.
So on the top you see the United States patent.
The various yellow ones are cross-off.
This is the main dataset.
It contains details about publications and their authors.
There's also a similar set from PubMed regarding health sciences.
Information about researchers, open access journals, research organizations and some
other stuff I will explain.
As I said, the main dataset is cross-ref.
You see it here.
It contains mainly works.
And then these contain references to their authors, the updates, subjects, funders, licenses
and also the affiliations of the authors and the references in each work.
In numbers these are on top thousands.
So you get about 135 millions of publications.
Not all of them contain a subject or references but you see these numbers diminishing but
have been going up over the years.
About 360,000 million authors and about 1.7 billion references.
So each work has a reference at the end.
You get that many references.
Many of the works are also associated with subjects telling us what area they appear in.
Most of the works are journal articles.
Then come book chapters, proceedings and other elements such as books and post-edentries
and dissertations, far, far, far smaller.
If we look at the publications that appear in the dataset each year, all of these charts
by the way have been derived through Alexandria 3K.
You see that they've been increasing at a very large rate over the years.
If you think it's exponential, you're indeed right.
If we plot it on a logarithmic scale, you see a linear rise which means an exponential
rise in the numbers.
You can even see two dips in the rise.
Any idea why these are there?
What happened?
Wars.
Wars, exactly.
The wars apart from the extreme human tragedy also affected the science of the world wars.
Regarding availability, these are the various lines show which works have an abstract, the
subject, the funder, a researcher identifier, awards associated with the works.
You see that these numbers have been rising in most areas.
The subject is a special case.
Another dataset associated with Alexandria 3K is the open research and contributor ID.
How many of you have such an identifier?
Good.
If you don't have one, go and get one.
If you publish, it helps us, all scientists, associate you with what you have been publishing.
These have some basic details about the persons and then further details associated with distinctions
and education, invited positions, memberships, peer reviews conducted in other resources
associated with researchers who have access to a specific large telescope, for example.
Again the completeness of this data is not uniform, so most people have associated works
with them, but fewer have associated employment, education, personal data.
A similar dataset to the works is the dataset of publications made available through PubMed,
United States government effort, which has publications associated with health sciences,
so health and biomedicine.
Similar to Crossref, but it also contains some more specialized fields, such as pathogens
or a very complete taxonomy of where something belongs or chemical substances mentioned in
a specific paper, so it allows you to do more concentrated and specific research.
Also available are the United States patents for the past 20 years, containing about 5 million
records and a registry containing about 600,000 records of research organization, so the
organization you are associated with, if it conducts research, it should appear there,
it's a taxonomy, so it contains the parent organization of your organization, in many
cases up to the top, maybe the government.
What else is there, some smaller datasets, journal names, so that you can directly associate
them with their ISSN, funder names and also director of open access journal metadata about
19,000 records.
All these are tied together through identifiers, such as the digital object identifiers, DOIS,
the research identifiers, ISSNs, URLs and many other identifiers.
In a way you see here, these are just some representative tables from diverse datasets
linked together.
How are we using Alexandria 3K in practice?
You can use it as a command line tool with a typical nowadays for large tools pattern
of running A3K, the A3K command and then sub-commands such as populate or process or query.
Here's an example of that, I'm running the A3K command asking it to populate a database
called COVID-DOTB from the crossref dataset found in this directory and selecting only
the rows that have a title or an abstract that matches COVID.
I will show you later on how this can be useful.
You can also use it through Python, here's an example, you import it, you create an instance
for the specific data source that you are interested in and then you give a similar,
you call a method that performs a similar function.
Typically the way to use A3K is to download the data for crossref, it's about 160 gigabytes,
can be downloaded through a torrent in about three hours, by the way this gives you plausible
deniability argument of why you are using torrent.
You can then run various exploratory data analytics queries directly on the sample,
this can finish in about two minutes for 1% of the records, no need to uncompress this
to the terabytes required.
Or you can populate a database, a SQL-like database, this can take for four to 20 hours,
the database can be some four to 200 gigabytes in size depending on what you store in it
if you are selective.
And there on the database you can test, define, refine analysis queries, the queries can
run in minutes or in hours if they are very complicated.
Mainly how you can use it, you can run ad hoc SQL queries directly on the uncompressed
data, you can populate SQL-like databases and there you can select elements either horizontally
so you can say I want only the records that match the specific condition, works published
in the last two years, or vertically specific columns, I'm not interested in the abstract
for instance because it takes up a lot of space and I'm not going to use it.
And then once you index the SQL database you can have many queries finishing in seconds
or even on the complete data set.
Here is an example of a query directly on the data set so without creating an intermediate
database measuring how many publications appear each year, the chart I showed you earlier.
Here is another example of a query that performs sampling, it calls a random function of Python
and which returns to when it's less than 0.01 so I get 1% sample from the data set to find
out how many works contain abstracts and how many don't, this is the answer and there is
a chance on the complete data set but sampling it in about two minutes.
Another example here of populating a database in order to extract metrics that I showed
you previously how many works have authors or subjects or abstracts associated with them.
Similarly another population completing the ORC data set showing how many elements are
there in the corresponding query on that data set.
Let's see some more advanced elements. Here there are two papers both written by Nobel
prize winning authors, on the left is one by Colin Schame, a Nobel winning paper used
to establish theorems to develop a method for calculating the structure of the electrons
and on the right another famous paper, you are probably aware of it by Watson and Crick
or Nobel winners introducing a model of the structure of the DNA.
However, the way these papers are associated with science turns out to be very different.
If we look at the papers that cite them, so the red papers here, the two blue ones are
this one and this one, we see that for the left papers that cite the electron paper also
cite previous work, so works cited by this. Whereas papers that cite the DNA paper do
not cite the work published before it. So people have said that the paper on the left
consolidates existing research and advances it whereas the paper on the right, the DNA
paper is a disruptive paper that changes all things and therefore people no longer cite
other works, they cite only this paper and ones that follow it.
There is a measure that you can calculate here, the paper that first established, published
in Nature gave these numbers here. My own calculation on Alexandria 3K gives similar
numbers and the very highly significant statistical measure showing that these methods are equivalent.
The one on the top, the published one is opaque, you cannot reproduce it because the data is
not openly available, the one on the bottom can be run on your laptop.
Here are some other measures, the evolution of scientific publishing after the Second
World War, before that things were completely different, that's why I'm not looking at it.
We see many interesting changes, number of authors per work, reason from 1.5 to 4, works
per author from 1.99 fell to 1.59, references rose from 13 to 46, pages doubled to 12, the
consolidation disruption index fell, so if you think that science is becoming less disruptive
these days, it is a true number of citations, works published, journals, works cited at least
once and factors have all reason exponentially. All these calculated with queries you can
find on the software you can download from GitHub.
Here's another interesting chart showing how the evolution of applications has changed
in specific fields. You can see the big rise in computer science, the purple thing that
has increased substantially over the years and also the relative fall in publications
in arts and humanities, the other purple here that has diminished in this way. The absolute
number has risen, don't be fooled because publications have risen exponentially, but
still it occupies a lot less than it did in 1945.
Here are examples of two other data sets. Here is the evolution of applicants by year
and country of US patents. You see a fall in the number of patents associated with the
United States and Japan and the rise from a low base of patents associated with China
on the bottom of the blue line. Another one associated with replicating a paper looking
at specific software, statistical software used in health science public research. Again
you see with green the original paper and with orange replicated results with Alexandria
3K. This was completed by a TU Delft student in a couple of weeks. Let me show you a more
substantial example of what can be done with Alexandria 3K, proof of principle of concept
study on a specific topic namely COVID-19. What you are seeing this was while checking
the data set, it's a publication I found in the American Journal of Ethics, COVID care
in color, which was published by an author, a nurse working in a Bronx emergency room
team for six years and painting for 25 years. She captioned it as fear of non contagion
is dreadful, especially without proper protective equipment. So at the beginning of the pandemic
when we were conflicting information on how and when to use our PPE we relied on each
other. I created the data set in the way you see here. I populated a vertical and horizontal
slice in an escalate database, selecting some elements with those that matched COVID in
their title or abstract in running nine hours, about three gigabytes of data and the ones
I indexed it, it rose to 3.6 gigabytes. We can see some numbers, half a million of published
articles, 2.6 million authors, imagine the amount of effort that went into there and
eight million references that go more there. What are the topics associated with this research?
Everything you can imagine, you can see education and engineering featuring very high, of course,
after general medicine, but even strategy, management, law, cultural studies, pollution,
anthropology, AI, waste management, ocean engineering, all over the place. Who funded
that research? This is the query I run, National Natural Science Foundation of China, the highest
number of publication, and then National Institutes of Health and National Science Foundation
from the United States, followed by various trusts and councils. However, if we look at
the affiliations associated with COVID publications, these are the queries that established this.
We see that first came the government of the US, then the University of California System,
University of Toronto, and so on. Here I use the research organization registry and I moved
the organizations up to the parent organization, so Berkeley, for instance, and the University
of Southern California rose and UCSD rose to the University of California systems.
Another question is, how quickly could we look and work on each other? So how scientists
publishing COVID research could cite each other? Was it taking too long a time because
things were advancing too fast? And what I have plotted here is publications citing other COVID
publications published each month. And you see that fairly quickly, even on April 2020,
there were thousands of citations to other COVID publications, which rose to hundreds of thousands
late in 2020, early 2021. The number you see at the beginning is an artifact, journals that
got published with a January date, although they appeared a lot later. This shows us we shouldn't
blindly trust our data. I also looked at collaboration. I found some amazing things.
There were articles with thousands of authors authored by thousands of people. So you see
articles by 2,000 or 1,700 authors. I thought this cannot be true, so I looked at it and I saw
that there were indeed most articles had the number of, say, five authors, but there were
articles with 7,000 authors. I looked, couldn't believe my eyes, I haven't seen such thing before,
and it was indeed published here. The number of authors appear in a footnote, pages 20 to 28.
And this is not an isolated case. Through other queries, I found many articles with thousands
of authors showing a way of collaborating in an amazing way. People probably contributed by giving
data from hundreds or thousands of hospitals, and all these were collected on papers.
Let me switch subject. How many of you heard the impact factor? Yeah. What is the impact factor for
those who haven't? It's a measure that tries to see how important a journal is by measuring the
citations articles in this venue appear in a given year divided by the publications in the previous
two years. It has been severely criticized, especially when it's used for measuring scientific
worthiness or productivity of authors. But another problem is that it is opaque,
so clarivate publishes, but we have no idea how exactly it derives those numbers. It publishes
the method, but we cannot replicate it. We can with Alexandria 3K with queries and populations,
such as the ones you see here, and we can get results that rank journals by impact factor with
a very high significance correlation with the ones published by clarivate.
Through this, we can do queries such as find the most cited article in the last two years.
It is this article establishing a method where you can find how matter depends on the properties
of steel based just on facts associated with the atomic structure of that. I found that very
strange, very specialized subject, so I run a query to find what publications are citing this
article, and I got results in the titles such as these at a rate of about eight records per second
over the entire data set. If you cannot understand the titles, this is how people outside computing
view us when we talk shop. I also looked at cited articles over the last two years,
but published in that period, and predictably this was clinical features of patients infected with
the 2019 novel coronavirus in Wuhan, China, the first article reporting on COVID-19.
Another metric of author productivity is the so-called H5 index, so how many papers you
have published in the past five years that have been cited at least that number of times.
I found an author that has an index of 76, which amounts to 15 papers a year,
and many authors having more than 60 papers, an H index more than 60 and 100 more than 38.
This was too large to believe because it's not only what was established previously,
hyperproductive authors that publish papers every one paper every five days,
but also papers that got cited a lot. So how could this happen? Using again a 3k, I looked at
papers that were cited, the papers that those authors cited, and created and looked at those
graphs, and what I found is that those papers were cited a lot more between them than papers of
other authors with high productivity, but not that high. So this seems to suggest that some sort of
clique is working there, authors citing each other, and thereby elevating artificially their H5 index.
Whatever number you use, you can game it as you can understand. Finally, here's another interesting
chart. Here I'm looking at what topics cite other topics, and for instance we can see these are the
50 strongest relationships, and we can see for instance that the cancer research cites a lot
chemistry organic and inorganic chemistry. I will finish with some details regarding the
implementation of Alexandria 3k. I hope that you will get a few tricks that you can use here.
A3k is based on a plugin architecture, so on the top you get the command line interface,
which uses a Python API, which you can also use directly, and below it there are two
sets of plugins. Data sources, the values I showed you, you can create new ones by creating a file
that establishes a new data source, say archive publications, and processes, things that manipulate
data in new ways. For example, matching authors with affiliations, or disambigrating authors with
the same name based on what they publish. And finally, at the bottom is a plugin API that these
plugins use in order to function. The main ideas behind the cross implementation and the other
databases is SQLite and virtual tables. Virtual tables are a feature of SQL that's magical. You
can create tables that don't exist as real in a real database, but appear as tables which you can
access with the select and other SQL statements. It uses a method to partition the data because
large databases come in many, many files, tens of thousands of files for the case of Crossref.
I use the number of each file as an index so that the database, the virtual table implementation
of SQLite does not jump from one partition to the next because the compressing partitions is
expensive. Another trick is to understand once you have written a query or a selection of what
data you are interested in, how to understand what tables or what fields you want from each table.
I don't want to parse SQL, especially with all the specific implementation details and
of the SQLite. So what I'm doing is I ask SQLite to trace the analysis of the query,
and thereby I can understand which columns and tables it touches. I also create vertical slices
of the partitions for the queries in order to run faster. And I use various queries to only look
within a partition in order to populate records. So the Crossref data appears in JSON format,
and in a self-contained file, one of the 26,000 files, it will contain all references to each
work. I don't need to go to other partitions. So here's an example. When you run the query on
the top, what is happening underneath, a virtual table is created, and the query is run on this
very simple table. When, however, you also do joins, what is happening is it is creating
the tables, but as you see here, the tables have a container ID restricted to each container in
turn, one, two, three, up to 26,000, so that each partition is decompressed in turn, and not all of
them are run together, and then this is run on tables that are actually realized. For
population, similar things are happening. So if you populate something with a condition, so I want
only those subjects associated with library information sciences, and you want only some
columns. First of all, tracing establishes the table name and the field that you are interested in,
and then again on partitions, tables are populated, and then queries are run to
fill the data that you want from the populated tables. I found that this is faster than using
virtual tables because of the various joins. A thing called topological sort is used to
establish in which order the joins have happened based on the names of the tables that you want
to join. Similar ideas for populating ORCID, United States Patterns, and so on. Here because this
appears XML records, we can skip the parsing of XML records that we are not interested in,
and thereby gain an additional efficiency advantage. Let me finish with some issues and
limitations of A3K. The coverage of authors is fairly low, so about 17 out of 360 million authors
have an associated ORCID with them. Keep in mind data go back to the beginning, to the second world
war, so ORCID wasn't a thing then. But even now, not all authors have an ORCID. This is improving
because many institutions asked them to, and we're also investigating ways with which we can
disambiguate authors even if they don't have an ORCID through machine learning methods. Also
affiliations, either they are missing or they appear in diverse forms, so the same university,
say the one here, can appear as ULB or the full name of the university. Abstracts are also not
always available. A small number of them have an abstract, but many publications also have a text
mining link that you can use in order to obtain the full text for data mining purposes.
The subjects of publications are based on an identifier established by Scopus,
which is associated with complete journals. So if something appears in a zoology journal,
we assume it has to do with zoology, even if it has to do with, say, biology or informatics.
Again, we're working to use machine learning methods in order to obtain better results here,
and looking specifically at the impact factor calculation, which many
are interested in. Establishing what is a citable item is a tricky,
clarivate uses a proprietary method, where, for instance, an editorial or a letter is not
considered citable. It's difficult to do this automatically. I assume they have people working
on that. Way forward on how we can work as a community to improve a 3-key, first of all,
these are the early days, so I'd be very happy to help the community conduct studies.
If you have an idea for something for a way to use a 3-key, please contact me.
I would like to integrate more open access data. So here are some ideas,
an archive, the DBLP is a database of computer science papers. There is a taxonomy of medical
research called MESH, which is extremely interesting, and the wider one used by the public
library of science that I also think would be worth integrating and associating works with it.
Associated with that is improving the various processes, so things in which we can process
the data, disambiguate authors, there are many John Smiths or Zouz in China, find out which are the
ones that are written in a specific article, and classify the topics of the publications.
And finally, and this is relevant to us here at FOSTEM, evangelize more and better data
availability, more use of ORCID and publication improvements on the published metadata.
With this, I thank you for your interest and attendance here.
I think we should have time for one or two questions. Do we have questions?
Thank you for your talk. About 10 years ago, I participated in a Kangol contest,
which was about disambiguating authors to link papers written by the same authors
in the work you've done. Have you observed the need ambiguity in the different forms?
Sir, can you repeat? I'm not hearing very well.
Sorry. So I was saying that about 10 years ago, I participated in a Kangol contest, where the topic
was finding the different papers written by the same author, because the names had different
variations and formats and so on. In your work, do you also observe that the names of the authors
are written in different ways, and that makes it harder for you to link papers together?
All right. The question is regarding a contest that was run 10 years ago, whether there are authors
that have their names appearing different forms? Absolutely. First names often get abbreviated
to the first letter. Middle names appear and disappear in random order. So it's indeed a problem,
or kid helps. But also efforts such as what you did to develop ways in which to establish
uniquely an author with all their works are helpful and they can be integrated as processes,
either with a pull request on A3K or by using the API and doing it on your own.
Okay. So it's a two-parter. First, how often do you update the dataset? And is there a way to
download the delta? Just get the new stuff. Okay. Two things. How often I update the dataset? The
answer is never in contrast to say to open Alex. A3K is a tool for working with existing data.
So whenever you want, you go and fetch the data. It doesn't come with the data. You use the data
sources from their primary source. I don't pretend to curate the data. A3K allows you to use existing
open data sources. Can you work with incremental updates if you have a way to running a select,
if you have incremental parts, you can populate the data also with the increments,
if you have a database that you populate incrementally.
Thanks. Thanks. So I was in a university 10 years ago and more than 10 years ago and by that time
our articles were published mostly as unstructured text. So is that still a thing? Are there any,
are you aware of any efforts to make articles,
because they're unstructured tests and those are difficult to analyze?
If I understand the question correctly, whether there are efforts to structure the articles in a
way that we can better analyze it, there are tools such as a Grubbit, we heard it yesterday at
another dev room on Open Science that do that and create XML associated with an article's text.
Of course this cannot always be perfect. Ideally we'd want the complete pipeline from authors
to publishers to publication to take this structured account and not reverse engineer
the structure after it has been published.
Well, when I look at the time, we should stop here for Q&A here in this room. Maybe he has some more
time. So maybe some applause.
And first anyone wants to say thank you. Thank you very much.