[00:00.000 --> 00:11.000]  Hello everyone, I'm Martin Gerlach, I'm a senior research scientist in the research
[00:11.000 --> 00:14.120]  team at the Wikimedia Foundation.
[00:14.120 --> 00:19.520]  First of all, I want to thank the organizers for the opportunity to present here today.
[00:19.520 --> 00:25.840]  I'm very excited to share some of our recent work around building open tools to support
[00:25.840 --> 00:29.000]  research around Wikimedia projects.
[00:29.000 --> 00:35.400]  Before going into the details, I want to provide some background around what is Wikimedia
[00:35.400 --> 00:37.400]  and its research team.
[00:37.400 --> 00:42.920]  I want to start with something that most of you are probably familiar with, Wikipedia,
[00:42.920 --> 00:49.200]  which is by now the largest encyclopedia in the history of humankind.
[00:49.200 --> 00:54.800]  Wikipedia, together with its sister projects like Wikimedia Commons or Wiktionary, are
[00:54.800 --> 00:58.480]  operated by the Wikimedia Foundation.
[00:58.480 --> 01:07.280]  The Wikimedia Foundation is a nonprofit organization and has a staff of around 600 employees.
[01:07.280 --> 01:13.320]  It provides support to the communities and the projects in different ways, but it's
[01:13.320 --> 01:19.880]  important to know that it does not create or modify the content and it does not define
[01:19.880 --> 01:23.960]  or enforce policies on the projects.
[01:23.960 --> 01:30.960]  One of the teams at the Wikimedia Foundation is the research team, and we are a small team
[01:30.960 --> 01:37.480]  of eight scientists, engineers, and community officers, and we work with collaborators from
[01:37.480 --> 01:43.360]  different universities to do research around Wikimedia projects.
[01:43.360 --> 01:48.840]  These activities can be grouped in roughly three main areas.
[01:48.840 --> 01:54.920]  The first one is to address knowledge gap, so what content is missing or underrepresented?
[01:54.920 --> 01:59.120]  One example of this is the gender gap.
[01:59.120 --> 02:05.840]  Second is to improve knowledge integrity, that is making sure the content on the projects
[02:05.840 --> 02:12.120]  is accurate, can think of vandalism or misinformation or disinformation.
[02:12.120 --> 02:19.720]  The third aspect is growing the research community, that is empowering others to do research around
[02:19.720 --> 02:22.480]  the projects.
[02:22.480 --> 02:30.080]  Today I want to focus on the activities in this last area, specifically I want to present
[02:30.080 --> 02:37.880]  four facets in which we have been contributing towards this goal, that is around data sets,
[02:37.880 --> 02:42.280]  tools for data processing and building machine learning APIs.
[02:42.280 --> 02:50.120]  Finally, I want to conclude with how developers or interested researchers can contribute to
[02:50.120 --> 02:52.240]  these three areas.
[02:52.240 --> 02:55.320]  So let's go.
[02:55.320 --> 03:01.200]  Wikimedia Foundation provides already many, many different data sets, most notably Wikimedia
[03:01.200 --> 03:09.200]  dumps around the content, but also containing information about edits and page views of articles.
[03:09.200 --> 03:16.040]  This is public and openly available, and it's used by many researchers as well as developers
[03:16.040 --> 03:19.200]  to build dashboards or tools for editors.
[03:19.200 --> 03:27.320]  However, when working with this data, this might prove still very challenging for people
[03:27.320 --> 03:34.960]  who might not identify as Wikimedia researchers or for someone lacking the expertise about
[03:34.960 --> 03:40.520]  database schemas or which data is where or how to filter is.
[03:40.520 --> 03:49.120]  Therefore, we try to release clean and pre-process data set to facilitate that.
[03:49.120 --> 03:53.920]  And one such example is to Wikipedia image caption data set.
[03:53.920 --> 03:59.600]  This is a clean and processed data set of millions of examples of images from Wikimedia
[03:59.600 --> 04:08.200]  comments with their captions extracted from more than 100 language versions of Wikipedia.
[04:08.200 --> 04:14.680]  The background is that many articles on Wikipedia are still lacking visual content, which we
[04:14.680 --> 04:18.520]  know are crucial for learning.
[04:18.520 --> 04:26.600]  Creating text to these images increases the accessibility and enables better search.
[04:26.600 --> 04:31.800]  So with the release of this data, we hope to enable other researchers to build better
[04:31.800 --> 04:38.240]  machine learning models to assist editors in writing image captions.
[04:38.240 --> 04:44.080]  In this case, we did not just release the data, but provided it in a more structured
[04:44.080 --> 04:49.680]  form as part of a competition with a very specific task.
[04:49.680 --> 05:00.800]  And the idea was to also attract new contributors through this structure so that researchers
[05:00.800 --> 05:07.760]  could find examples of the types of tools that could be useful for the community, experienced
[05:07.760 --> 05:13.720]  researchers outside of Wikimedia could easily contribute their expertise.
[05:13.720 --> 05:22.280]  And for new researchers is an easy way to become familiar with Wikimedia data.
[05:22.280 --> 05:28.800]  The outcome of this was a Kaggle competition with more than 100 participants and many,
[05:28.800 --> 05:32.960]  many open source solutions in how to approach this problem.
[05:32.960 --> 05:39.480]  This was just one example of data sets that release, and I just want to highlight there's
[05:39.480 --> 05:45.640]  other cleaned process data sets we are releasing around quality score of Wikipedia articles
[05:45.640 --> 05:53.760]  around readability of Wikipedia articles, and also their upcoming releases around using
[05:53.760 --> 06:00.880]  differential privacy around geography of readers.
[06:00.880 --> 06:06.200]  In the next part, I want to blend how to work with all this data.
[06:06.200 --> 06:11.720]  We always aim to make data as much as the data publicly available.
[06:11.720 --> 06:19.280]  However, that doesn't necessarily mean it is accessible because it might still require
[06:19.280 --> 06:24.000]  a lot of technical expertise to effectively work with this data.
[06:24.000 --> 06:30.400]  Therefore, we try to build tools to lower the technical barriers.
[06:30.400 --> 06:37.920]  And here I want to present one such example related to the HTML dump data set.
[06:37.920 --> 06:38.920]  What is this?
[06:38.920 --> 06:46.040]  This is a new dump data set available since October 2021, and it's now published and
[06:46.040 --> 06:48.760]  updated in regular intervals.
[06:48.760 --> 06:55.000]  And it contains the HTML version of all articles of Wikipedia.
[06:55.000 --> 06:58.240]  Why is this so exciting?
[06:58.240 --> 07:04.480]  Traditional dumps, when we are using the traditional dumps, the content of the articles is only
[07:04.480 --> 07:08.480]  available in the WikiText markup.
[07:08.480 --> 07:12.480]  This is what you see when you edit the source of an article.
[07:12.480 --> 07:19.320]  However, what you see as a reader when browsing is not the WikiText markup, but the WikiText
[07:19.320 --> 07:22.360]  gets parsed into an HTML.
[07:22.360 --> 07:30.040]  The problem is the WikiText does not explicitly contain all the elements that are visible
[07:30.040 --> 07:31.600]  in the HTML.
[07:31.600 --> 07:36.040]  This comes mainly from parsing of templates or info boxes.
[07:36.040 --> 07:41.200]  This becomes an issue for researchers studying the content of articles because they will
[07:41.200 --> 07:46.120]  miss many of the elements when only when looking at the WikiText.
[07:46.120 --> 07:50.800]  One example for this is when looking for hyperlinks in articles.
[07:50.800 --> 07:58.600]  One study by Mitrovsky looked at counting the number of links in articles and found
[07:58.600 --> 08:05.640]  that WikiText contains less than half of the links that are visible in the HTML version
[08:05.640 --> 08:07.280]  of the reader.
[08:07.280 --> 08:13.440]  So we can conclude that researchers should use the HTML dumps because they capture more
[08:13.440 --> 08:16.440]  accurately the content of the article.
[08:16.720 --> 08:23.720]  However, the challenge is how to parse the HTML dumps or the articles in the HTML dumps
[08:23.720 --> 08:24.720]  version.
[08:24.720 --> 08:31.160]  This is not just about knowing HTML, but it's also about very specific knowledge about
[08:31.160 --> 08:38.400]  how the media Wiki software translates different Wiki elements and how they will appear in
[08:38.400 --> 08:40.840]  the HTML version.
[08:40.840 --> 08:46.880]  Existing packages exist for WikiText, but not for HTML.
[08:46.880 --> 08:54.800]  Therefore, this is a very high barrier for practitioners to switch their existing pipelines
[08:54.800 --> 08:57.400]  to use this new dataset.
[08:57.400 --> 09:04.840]  Our solution was to build a Python library to make working with these dumps very easily.
[09:04.840 --> 09:11.920]  We called it MWParser from HTML, and it parses HTML and extracts elements of an article
[09:11.920 --> 09:19.640]  such as links, references, templates, or the plain text without the user having to know
[09:19.640 --> 09:25.480]  anything about HTML and the way Wiki elements appear in it.
[09:25.480 --> 09:28.760]  We recently released the first version of this.
[09:28.760 --> 09:30.320]  This is work in progress.
[09:30.320 --> 09:32.040]  There's tons of open issues.
[09:32.040 --> 09:39.280]  So if you're interested, contributions from anyone are very, very welcome to improve this
[09:39.280 --> 09:40.600]  in the future.
[09:40.600 --> 09:44.720]  Check out the repo on GitLab for more information.
[09:44.720 --> 09:52.640]  As a third step, I want to mention, present how we use these datasets in practice.
[09:52.640 --> 10:01.200]  I want to show one example in the context of knowledge integrity in order to ensure quality
[10:01.200 --> 10:02.880]  of articles in Wikipedia.
[10:02.880 --> 10:09.120]  There are many, many editors who try to review the edits that are made to articles in Wikipedia
[10:09.120 --> 10:14.360]  and try to check whether these edits are okay or whether they're not okay and what should
[10:14.360 --> 10:15.560]  be reverted.
[10:15.560 --> 10:19.400]  The problem is there are a lot of edits happening.
[10:19.400 --> 10:26.680]  So just in English Wikipedia, there's around 100,000 edits per day to work through.
[10:26.680 --> 10:33.280]  And the aim is, can we build a tool to support editors in dealing with the large volume of
[10:33.280 --> 10:34.280]  edits?
[10:34.280 --> 10:40.720]  Can we help them identify the very bad edits more easily?
[10:40.720 --> 10:46.320]  And this is what we do with a so-called risk revert model.
[10:46.320 --> 10:47.320]  What is this?
[10:47.320 --> 10:53.320]  So we look at an edit by comparing the old version of an article with its new version.
[10:53.320 --> 10:59.440]  And we would like to make a prediction whether the change is good or whether it is a very
[10:59.440 --> 11:03.680]  bad edit and it should be reverted.
[11:03.680 --> 11:07.520]  How we do this is we extract different features from this article.
[11:07.520 --> 11:12.520]  So which text was changed, where their links that were removed, where their images that
[11:12.520 --> 11:15.880]  were removed, and so on.
[11:15.880 --> 11:25.480]  And then we built a model by looking into the history of all Wikipedia edits and extract
[11:25.480 --> 11:33.400]  those edits which have been reverted by editors and use that as a ground truth of bad edits
[11:33.400 --> 11:35.200]  for our model.
[11:35.200 --> 11:43.920]  And the resulting output is that we can, for each of these edits, we can calculate a so-called
[11:43.920 --> 11:45.240]  revert risk.
[11:45.240 --> 11:53.000]  This is a very bad edit, will have a very high probability, a very high risk for being
[11:53.000 --> 11:54.000]  reverted.
[11:54.000 --> 11:56.720]  And this is what our model will output.
[11:56.720 --> 12:01.120]  And our model performs fairly well.
[12:01.120 --> 12:05.680]  It has an accuracy between 70 and 80%.
[12:05.680 --> 12:09.440]  And I want to mention that we consider this OK.
[12:09.440 --> 12:11.720]  It does not need to be perfect.
[12:11.720 --> 12:18.480]  Our model, the way our model is used is there's editors that will surface these scores to help
[12:18.480 --> 12:27.200]  editors identify at which edits they should take a closer look.
[12:27.200 --> 12:34.080]  Similar models for annotating content of articles exist.
[12:34.080 --> 12:37.080]  We have been developing these types of models.
[12:37.080 --> 12:43.200]  In addition to knowledge integrity, what I presented, we have been trying to build models
[12:43.200 --> 12:51.040]  for finding easily similar articles, for identifying automatically the topic of an article to assess
[12:51.040 --> 12:59.960]  its readability or geography, or identifying related images, et cetera.
[12:59.960 --> 13:06.400]  I only want to briefly highlight that the development of these models is rooted in some
[13:06.400 --> 13:10.080]  core principles to which we are committed to.
[13:10.080 --> 13:15.080]  And this can create additional challenges in developing this model, specifically this
[13:15.080 --> 13:22.760]  context I want to highlight a multilingual aspect so that we always try to prefer language
[13:22.760 --> 13:31.080]  agnostic approaches in order to support as many as possible of the 300 different language
[13:31.080 --> 13:35.040]  versions in Wikipedia.
[13:35.240 --> 13:47.080]  I want to conclude with potential ways in which to contribute in any of these three areas that I mentioned previously.
[13:47.080 --> 13:57.040]  Generally, one can contribute as a developer to media wiki or other aspects of the wiki media ecosystem.
[13:57.040 --> 14:02.080]  And there, the place to get started is the so-called developer portal, which is a centralized
[14:02.080 --> 14:09.200]  entry point for finding technical documentation and community resources.
[14:09.200 --> 14:14.040]  Not going into more detail here, I want to give a shout out and refer to the talk by
[14:14.040 --> 14:21.240]  my colleague Slavina Stefanova from the developer acquisition advocacy team.
[14:21.240 --> 14:27.000]  But specifically in the area of research, I want to highlight a few entry points depending
[14:27.000 --> 14:29.080]  on your interest.
[14:29.080 --> 14:35.840]  In case you would like to build a specific tool, there is wiki media foundations toolforge
[14:35.840 --> 14:43.920]  infrastructure and that is a hosting environment that allows you to run bots or different APIs
[14:43.920 --> 14:50.880]  in case you would like to provide that tool to the public.
[14:50.880 --> 14:58.080]  If you want to work with us on improving tools or algorithms, you can check out the different
[14:58.080 --> 15:03.280]  packages that we have been releasing in the past months.
[15:03.280 --> 15:06.120]  These are all work in progress.
[15:06.120 --> 15:15.200]  There's many open issues and we're happy about any contributions about improving, fixing
[15:15.200 --> 15:19.560]  existing issues or even finding new bugs.
[15:19.560 --> 15:24.680]  So please check out our repository too.
[15:24.680 --> 15:30.320]  If you are interested in getting funding, there are different opportunities.
[15:30.320 --> 15:36.720]  There is an existing program to fund research around wiki media projects.
[15:36.720 --> 15:43.000]  This covers many different disciplines, humanities, social science, computer science, education,
[15:43.000 --> 15:51.360]  law, et cetera, and is around work that has potential for direct positive impact on the
[15:51.360 --> 15:53.080]  local communities.
[15:53.080 --> 15:59.520]  In addition, I want to mention that coming in the future, there are plans for a similar
[15:59.520 --> 16:05.040]  program to improve wiki media's technology and tools.
[16:05.040 --> 16:13.120]  If you want to learn about the projects we are working on, I want to mention that we
[16:13.120 --> 16:19.440]  publish a research report, a summary of our ongoing research projects every six months
[16:19.440 --> 16:25.680]  and you can find more details about some of the projects that I have mentioned.
[16:25.680 --> 16:33.360]  Finally, if you would like to engage with the research community, you can join us at
[16:33.360 --> 16:34.640]  wiki workshop.
[16:34.640 --> 16:39.040]  This is the primary meeting venue of the wiki media research community.
[16:39.040 --> 16:44.600]  This year will be the 10th edition of wiki workshop and it is expected to be held in
[16:44.600 --> 16:45.760]  May.
[16:45.760 --> 16:48.120]  You can submit your works there.
[16:48.120 --> 16:50.800]  I invite you for the submissions.
[16:50.800 --> 17:00.080]  We highly encourage ongoing or preliminary works by submitting extended abstracts.
[17:00.080 --> 17:04.960]  In this edition, there will also be a novel track for wiki media developers.
[17:04.960 --> 17:11.120]  If you are a developer of a tool or a system or an algorithm that could be of interest
[17:11.120 --> 17:16.960]  to research on wiki media, please check it out and make a submission.
[17:16.960 --> 17:23.760]  Even if you do not plan to make a submission, you are welcome to participate.
[17:23.760 --> 17:30.440]  As done in the last three editions, wiki workshop will be fully virtual and attendance
[17:30.440 --> 17:33.160]  will be free.
[17:33.160 --> 17:35.920]  With this, I want to conclude.
[17:35.920 --> 17:38.960]  I want to thank you very much for your attention.
[17:38.960 --> 17:43.480]  I am looking forward to your questions in the Q&A.
[17:43.480 --> 17:49.280]  If you want to stay in touch, feel free to reach out to me personally on my email or
[17:49.280 --> 17:54.200]  any of the other channels that I am listing here through office hours or mailing lists
[17:54.200 --> 18:01.560]  on IRC, et cetera, and with this, thank you very much.
[18:08.960 --> 18:09.960]  Thank you.
[18:09.960 --> 18:09.960]  Bye.