[00:00.000 --> 00:11.000] Hello everyone, I'm Martin Gerlach, I'm a senior research scientist in the research [00:11.000 --> 00:14.120] team at the Wikimedia Foundation. [00:14.120 --> 00:19.520] First of all, I want to thank the organizers for the opportunity to present here today. [00:19.520 --> 00:25.840] I'm very excited to share some of our recent work around building open tools to support [00:25.840 --> 00:29.000] research around Wikimedia projects. [00:29.000 --> 00:35.400] Before going into the details, I want to provide some background around what is Wikimedia [00:35.400 --> 00:37.400] and its research team. [00:37.400 --> 00:42.920] I want to start with something that most of you are probably familiar with, Wikipedia, [00:42.920 --> 00:49.200] which is by now the largest encyclopedia in the history of humankind. [00:49.200 --> 00:54.800] Wikipedia, together with its sister projects like Wikimedia Commons or Wiktionary, are [00:54.800 --> 00:58.480] operated by the Wikimedia Foundation. [00:58.480 --> 01:07.280] The Wikimedia Foundation is a nonprofit organization and has a staff of around 600 employees. [01:07.280 --> 01:13.320] It provides support to the communities and the projects in different ways, but it's [01:13.320 --> 01:19.880] important to know that it does not create or modify the content and it does not define [01:19.880 --> 01:23.960] or enforce policies on the projects. [01:23.960 --> 01:30.960] One of the teams at the Wikimedia Foundation is the research team, and we are a small team [01:30.960 --> 01:37.480] of eight scientists, engineers, and community officers, and we work with collaborators from [01:37.480 --> 01:43.360] different universities to do research around Wikimedia projects. [01:43.360 --> 01:48.840] These activities can be grouped in roughly three main areas. [01:48.840 --> 01:54.920] The first one is to address knowledge gap, so what content is missing or underrepresented? [01:54.920 --> 01:59.120] One example of this is the gender gap. [01:59.120 --> 02:05.840] Second is to improve knowledge integrity, that is making sure the content on the projects [02:05.840 --> 02:12.120] is accurate, can think of vandalism or misinformation or disinformation. [02:12.120 --> 02:19.720] The third aspect is growing the research community, that is empowering others to do research around [02:19.720 --> 02:22.480] the projects. [02:22.480 --> 02:30.080] Today I want to focus on the activities in this last area, specifically I want to present [02:30.080 --> 02:37.880] four facets in which we have been contributing towards this goal, that is around data sets, [02:37.880 --> 02:42.280] tools for data processing and building machine learning APIs. [02:42.280 --> 02:50.120] Finally, I want to conclude with how developers or interested researchers can contribute to [02:50.120 --> 02:52.240] these three areas. [02:52.240 --> 02:55.320] So let's go. [02:55.320 --> 03:01.200] Wikimedia Foundation provides already many, many different data sets, most notably Wikimedia [03:01.200 --> 03:09.200] dumps around the content, but also containing information about edits and page views of articles. [03:09.200 --> 03:16.040] This is public and openly available, and it's used by many researchers as well as developers [03:16.040 --> 03:19.200] to build dashboards or tools for editors. [03:19.200 --> 03:27.320] However, when working with this data, this might prove still very challenging for people [03:27.320 --> 03:34.960] who might not identify as Wikimedia researchers or for someone lacking the expertise about [03:34.960 --> 03:40.520] database schemas or which data is where or how to filter is. [03:40.520 --> 03:49.120] Therefore, we try to release clean and pre-process data set to facilitate that. [03:49.120 --> 03:53.920] And one such example is to Wikipedia image caption data set. [03:53.920 --> 03:59.600] This is a clean and processed data set of millions of examples of images from Wikimedia [03:59.600 --> 04:08.200] comments with their captions extracted from more than 100 language versions of Wikipedia. [04:08.200 --> 04:14.680] The background is that many articles on Wikipedia are still lacking visual content, which we [04:14.680 --> 04:18.520] know are crucial for learning. [04:18.520 --> 04:26.600] Creating text to these images increases the accessibility and enables better search. [04:26.600 --> 04:31.800] So with the release of this data, we hope to enable other researchers to build better [04:31.800 --> 04:38.240] machine learning models to assist editors in writing image captions. [04:38.240 --> 04:44.080] In this case, we did not just release the data, but provided it in a more structured [04:44.080 --> 04:49.680] form as part of a competition with a very specific task. [04:49.680 --> 05:00.800] And the idea was to also attract new contributors through this structure so that researchers [05:00.800 --> 05:07.760] could find examples of the types of tools that could be useful for the community, experienced [05:07.760 --> 05:13.720] researchers outside of Wikimedia could easily contribute their expertise. [05:13.720 --> 05:22.280] And for new researchers is an easy way to become familiar with Wikimedia data. [05:22.280 --> 05:28.800] The outcome of this was a Kaggle competition with more than 100 participants and many, [05:28.800 --> 05:32.960] many open source solutions in how to approach this problem. [05:32.960 --> 05:39.480] This was just one example of data sets that release, and I just want to highlight there's [05:39.480 --> 05:45.640] other cleaned process data sets we are releasing around quality score of Wikipedia articles [05:45.640 --> 05:53.760] around readability of Wikipedia articles, and also their upcoming releases around using [05:53.760 --> 06:00.880] differential privacy around geography of readers. [06:00.880 --> 06:06.200] In the next part, I want to blend how to work with all this data. [06:06.200 --> 06:11.720] We always aim to make data as much as the data publicly available. [06:11.720 --> 06:19.280] However, that doesn't necessarily mean it is accessible because it might still require [06:19.280 --> 06:24.000] a lot of technical expertise to effectively work with this data. [06:24.000 --> 06:30.400] Therefore, we try to build tools to lower the technical barriers. [06:30.400 --> 06:37.920] And here I want to present one such example related to the HTML dump data set. [06:37.920 --> 06:38.920] What is this? [06:38.920 --> 06:46.040] This is a new dump data set available since October 2021, and it's now published and [06:46.040 --> 06:48.760] updated in regular intervals. [06:48.760 --> 06:55.000] And it contains the HTML version of all articles of Wikipedia. [06:55.000 --> 06:58.240] Why is this so exciting? [06:58.240 --> 07:04.480] Traditional dumps, when we are using the traditional dumps, the content of the articles is only [07:04.480 --> 07:08.480] available in the WikiText markup. [07:08.480 --> 07:12.480] This is what you see when you edit the source of an article. [07:12.480 --> 07:19.320] However, what you see as a reader when browsing is not the WikiText markup, but the WikiText [07:19.320 --> 07:22.360] gets parsed into an HTML. [07:22.360 --> 07:30.040] The problem is the WikiText does not explicitly contain all the elements that are visible [07:30.040 --> 07:31.600] in the HTML. [07:31.600 --> 07:36.040] This comes mainly from parsing of templates or info boxes. [07:36.040 --> 07:41.200] This becomes an issue for researchers studying the content of articles because they will [07:41.200 --> 07:46.120] miss many of the elements when only when looking at the WikiText. [07:46.120 --> 07:50.800] One example for this is when looking for hyperlinks in articles. [07:50.800 --> 07:58.600] One study by Mitrovsky looked at counting the number of links in articles and found [07:58.600 --> 08:05.640] that WikiText contains less than half of the links that are visible in the HTML version [08:05.640 --> 08:07.280] of the reader. [08:07.280 --> 08:13.440] So we can conclude that researchers should use the HTML dumps because they capture more [08:13.440 --> 08:16.440] accurately the content of the article. [08:16.720 --> 08:23.720] However, the challenge is how to parse the HTML dumps or the articles in the HTML dumps [08:23.720 --> 08:24.720] version. [08:24.720 --> 08:31.160] This is not just about knowing HTML, but it's also about very specific knowledge about [08:31.160 --> 08:38.400] how the media Wiki software translates different Wiki elements and how they will appear in [08:38.400 --> 08:40.840] the HTML version. [08:40.840 --> 08:46.880] Existing packages exist for WikiText, but not for HTML. [08:46.880 --> 08:54.800] Therefore, this is a very high barrier for practitioners to switch their existing pipelines [08:54.800 --> 08:57.400] to use this new dataset. [08:57.400 --> 09:04.840] Our solution was to build a Python library to make working with these dumps very easily. [09:04.840 --> 09:11.920] We called it MWParser from HTML, and it parses HTML and extracts elements of an article [09:11.920 --> 09:19.640] such as links, references, templates, or the plain text without the user having to know [09:19.640 --> 09:25.480] anything about HTML and the way Wiki elements appear in it. [09:25.480 --> 09:28.760] We recently released the first version of this. [09:28.760 --> 09:30.320] This is work in progress. [09:30.320 --> 09:32.040] There's tons of open issues. [09:32.040 --> 09:39.280] So if you're interested, contributions from anyone are very, very welcome to improve this [09:39.280 --> 09:40.600] in the future. [09:40.600 --> 09:44.720] Check out the repo on GitLab for more information. [09:44.720 --> 09:52.640] As a third step, I want to mention, present how we use these datasets in practice. [09:52.640 --> 10:01.200] I want to show one example in the context of knowledge integrity in order to ensure quality [10:01.200 --> 10:02.880] of articles in Wikipedia. [10:02.880 --> 10:09.120] There are many, many editors who try to review the edits that are made to articles in Wikipedia [10:09.120 --> 10:14.360] and try to check whether these edits are okay or whether they're not okay and what should [10:14.360 --> 10:15.560] be reverted. [10:15.560 --> 10:19.400] The problem is there are a lot of edits happening. [10:19.400 --> 10:26.680] So just in English Wikipedia, there's around 100,000 edits per day to work through. [10:26.680 --> 10:33.280] And the aim is, can we build a tool to support editors in dealing with the large volume of [10:33.280 --> 10:34.280] edits? [10:34.280 --> 10:40.720] Can we help them identify the very bad edits more easily? [10:40.720 --> 10:46.320] And this is what we do with a so-called risk revert model. [10:46.320 --> 10:47.320] What is this? [10:47.320 --> 10:53.320] So we look at an edit by comparing the old version of an article with its new version. [10:53.320 --> 10:59.440] And we would like to make a prediction whether the change is good or whether it is a very [10:59.440 --> 11:03.680] bad edit and it should be reverted. [11:03.680 --> 11:07.520] How we do this is we extract different features from this article. [11:07.520 --> 11:12.520] So which text was changed, where their links that were removed, where their images that [11:12.520 --> 11:15.880] were removed, and so on. [11:15.880 --> 11:25.480] And then we built a model by looking into the history of all Wikipedia edits and extract [11:25.480 --> 11:33.400] those edits which have been reverted by editors and use that as a ground truth of bad edits [11:33.400 --> 11:35.200] for our model. [11:35.200 --> 11:43.920] And the resulting output is that we can, for each of these edits, we can calculate a so-called [11:43.920 --> 11:45.240] revert risk. [11:45.240 --> 11:53.000] This is a very bad edit, will have a very high probability, a very high risk for being [11:53.000 --> 11:54.000] reverted. [11:54.000 --> 11:56.720] And this is what our model will output. [11:56.720 --> 12:01.120] And our model performs fairly well. [12:01.120 --> 12:05.680] It has an accuracy between 70 and 80%. [12:05.680 --> 12:09.440] And I want to mention that we consider this OK. [12:09.440 --> 12:11.720] It does not need to be perfect. [12:11.720 --> 12:18.480] Our model, the way our model is used is there's editors that will surface these scores to help [12:18.480 --> 12:27.200] editors identify at which edits they should take a closer look. [12:27.200 --> 12:34.080] Similar models for annotating content of articles exist. [12:34.080 --> 12:37.080] We have been developing these types of models. [12:37.080 --> 12:43.200] In addition to knowledge integrity, what I presented, we have been trying to build models [12:43.200 --> 12:51.040] for finding easily similar articles, for identifying automatically the topic of an article to assess [12:51.040 --> 12:59.960] its readability or geography, or identifying related images, et cetera. [12:59.960 --> 13:06.400] I only want to briefly highlight that the development of these models is rooted in some [13:06.400 --> 13:10.080] core principles to which we are committed to. [13:10.080 --> 13:15.080] And this can create additional challenges in developing this model, specifically this [13:15.080 --> 13:22.760] context I want to highlight a multilingual aspect so that we always try to prefer language [13:22.760 --> 13:31.080] agnostic approaches in order to support as many as possible of the 300 different language [13:31.080 --> 13:35.040] versions in Wikipedia. [13:35.240 --> 13:47.080] I want to conclude with potential ways in which to contribute in any of these three areas that I mentioned previously. [13:47.080 --> 13:57.040] Generally, one can contribute as a developer to media wiki or other aspects of the wiki media ecosystem. [13:57.040 --> 14:02.080] And there, the place to get started is the so-called developer portal, which is a centralized [14:02.080 --> 14:09.200] entry point for finding technical documentation and community resources. [14:09.200 --> 14:14.040] Not going into more detail here, I want to give a shout out and refer to the talk by [14:14.040 --> 14:21.240] my colleague Slavina Stefanova from the developer acquisition advocacy team. [14:21.240 --> 14:27.000] But specifically in the area of research, I want to highlight a few entry points depending [14:27.000 --> 14:29.080] on your interest. [14:29.080 --> 14:35.840] In case you would like to build a specific tool, there is wiki media foundations toolforge [14:35.840 --> 14:43.920] infrastructure and that is a hosting environment that allows you to run bots or different APIs [14:43.920 --> 14:50.880] in case you would like to provide that tool to the public. [14:50.880 --> 14:58.080] If you want to work with us on improving tools or algorithms, you can check out the different [14:58.080 --> 15:03.280] packages that we have been releasing in the past months. [15:03.280 --> 15:06.120] These are all work in progress. [15:06.120 --> 15:15.200] There's many open issues and we're happy about any contributions about improving, fixing [15:15.200 --> 15:19.560] existing issues or even finding new bugs. [15:19.560 --> 15:24.680] So please check out our repository too. [15:24.680 --> 15:30.320] If you are interested in getting funding, there are different opportunities. [15:30.320 --> 15:36.720] There is an existing program to fund research around wiki media projects. [15:36.720 --> 15:43.000] This covers many different disciplines, humanities, social science, computer science, education, [15:43.000 --> 15:51.360] law, et cetera, and is around work that has potential for direct positive impact on the [15:51.360 --> 15:53.080] local communities. [15:53.080 --> 15:59.520] In addition, I want to mention that coming in the future, there are plans for a similar [15:59.520 --> 16:05.040] program to improve wiki media's technology and tools. [16:05.040 --> 16:13.120] If you want to learn about the projects we are working on, I want to mention that we [16:13.120 --> 16:19.440] publish a research report, a summary of our ongoing research projects every six months [16:19.440 --> 16:25.680] and you can find more details about some of the projects that I have mentioned. [16:25.680 --> 16:33.360] Finally, if you would like to engage with the research community, you can join us at [16:33.360 --> 16:34.640] wiki workshop. [16:34.640 --> 16:39.040] This is the primary meeting venue of the wiki media research community. [16:39.040 --> 16:44.600] This year will be the 10th edition of wiki workshop and it is expected to be held in [16:44.600 --> 16:45.760] May. [16:45.760 --> 16:48.120] You can submit your works there. [16:48.120 --> 16:50.800] I invite you for the submissions. [16:50.800 --> 17:00.080] We highly encourage ongoing or preliminary works by submitting extended abstracts. [17:00.080 --> 17:04.960] In this edition, there will also be a novel track for wiki media developers. [17:04.960 --> 17:11.120] If you are a developer of a tool or a system or an algorithm that could be of interest [17:11.120 --> 17:16.960] to research on wiki media, please check it out and make a submission. [17:16.960 --> 17:23.760] Even if you do not plan to make a submission, you are welcome to participate. [17:23.760 --> 17:30.440] As done in the last three editions, wiki workshop will be fully virtual and attendance [17:30.440 --> 17:33.160] will be free. [17:33.160 --> 17:35.920] With this, I want to conclude. [17:35.920 --> 17:38.960] I want to thank you very much for your attention. [17:38.960 --> 17:43.480] I am looking forward to your questions in the Q&A. [17:43.480 --> 17:49.280] If you want to stay in touch, feel free to reach out to me personally on my email or [17:49.280 --> 17:54.200] any of the other channels that I am listing here through office hours or mailing lists [17:54.200 --> 18:01.560] on IRC, et cetera, and with this, thank you very much. [18:08.960 --> 18:09.960] Thank you. [18:09.960 --> 18:09.960] Bye.