[00:00.000 --> 00:13.120]  Hello, everyone. I'm Marco. Thank you for being here to listen to my talk. I'm an engineer
[00:13.120 --> 00:21.720]  manager at Mozilla. I've been at Mozilla for almost 10 years now. I started as a contributor,
[00:21.720 --> 00:27.920]  then an intern, then I was hired, and I've been here for almost 10 years. I started working
[00:27.920 --> 00:34.840]  on some funny projects, like writing a Java VM in JavaScript, and then more recently I
[00:34.840 --> 00:41.240]  started focusing on using machine learning and data mining techniques to improve developer
[00:41.240 --> 00:53.400]  efficiency, which has also been the subject of my PhD. During this talk, I will show you
[00:53.400 --> 01:01.120]  how we will all be out of a job in a few years, joking. I will just thank you through our journey
[01:01.120 --> 01:09.720]  of how we incrementally built features based on machine learning for improving software
[01:09.720 --> 01:25.080]  engineering, one on top of each other. I'm the father of two, Luna and Nika. Before we start
[01:25.080 --> 01:31.640]  with the presentation, I wanted to explain a little why we need to do all these complex
[01:31.640 --> 01:41.760]  machine learning things on top of bugs, CI, patches, et cetera, et cetera. Firefox is a very
[01:41.760 --> 01:47.840]  complex software. It's a browser. We have hundreds of bug reports and feature requests
[01:47.840 --> 01:56.240]  open per day. We have 108 million bug reports at this time, which is almost the price of
[01:56.240 --> 02:04.240]  one bedroom apartment in London. We release every four weeks with thousands of changes,
[02:04.240 --> 02:12.320]  and during 2022 we had 13 major releases and 45 million minor releases. As you can see,
[02:12.320 --> 02:24.640]  we even sometimes party when we reach a certain number of bugs. As I said, Firefox is one
[02:24.640 --> 02:31.320]  of the biggest software in the world. We have a lot of legacy. Netscape was open sourced
[02:31.320 --> 02:41.160]  25 years ago. A few days ago we celebrated the 25 birthday. Over time we had 800,000
[02:41.160 --> 02:49.960]  commits made by 9,000 unique contributors representing 25 million lines of code. We had 37,000 commits
[02:49.960 --> 02:57.240]  only last year by 1,000 unique contributors. Not all of them are paid. Many of them are
[02:57.240 --> 03:04.040]  volunteers. And this is a list of the languages that we use. As you can see, we use many of
[03:04.040 --> 03:11.200]  them. We have C++ and Rust for low-level things. Rust is gaining ground and is probably going
[03:11.200 --> 03:21.800]  to overcome C soon. We use JavaScript for front-end and for tests. And we use Python for CI and
[03:21.800 --> 03:28.000]  build system. But we have many more. So if anybody is interested in contributing, you
[03:28.000 --> 03:39.920]  have many options to choose from. But let's see. So as I said, the complexity is really
[03:39.920 --> 03:47.120]  large. We have thousands and thousands of bugs. And we need some way to control the
[03:47.120 --> 03:53.800]  quality, to increase the visibility into the quality of the software. And we cannot do
[03:53.800 --> 04:00.440]  that if the bugs are left uncontrolled. One of the first problems that we had was that
[04:00.440 --> 04:06.640]  there is no way to differentiate between defects and feature requests. We call them bugs on
[04:06.640 --> 04:12.520]  bugzilla. But they are actually just reports. Many of them are defects. Many of them are
[04:12.520 --> 04:20.000]  actually just feature requests. And so at the time, we had no way to measure quality.
[04:20.000 --> 04:26.360]  We had no way to tell in this release we have 100 bugs. In this other release, we had 50.
[04:26.360 --> 04:32.640]  So this release is better than the previous. And so we need a way to make this differentiation
[04:32.640 --> 04:36.960]  in order to measure the quality. And it was also hard to improve workflows if we had no
[04:36.960 --> 04:42.960]  way to differentiate between them. So we thought of introducing a new type field. This might
[04:42.960 --> 04:50.160]  seem simple. It's just choice between defect, enhancement and task. But in practice, when
[04:50.160 --> 04:57.320]  you have 9,000 unique contributors, some of them not paid. It's not easy to enforce a
[04:57.320 --> 05:05.600]  change like this. And we also had another problem. We have 100 million bugs. If we just
[05:05.600 --> 05:13.280]  introduce this type, it's not going to help us at all until we reach a mass of bugs that
[05:13.280 --> 05:20.320]  we change. So if we just introduce it at this time, it will only start to be useful six
[05:20.320 --> 05:26.240]  months from now. So we thought, how do we set the field for existing bugs so that this
[05:26.240 --> 05:33.160]  actually becomes useful from day one? And we thought of using machine learning. So we
[05:33.160 --> 05:42.560]  collected a dataset. I'm not sure it can be considered large nowadays. With 2,000 manually
[05:42.560 --> 05:53.080]  labelled bugs, few of us labelled independently. And then we shared the labelling so that we
[05:53.080 --> 05:58.960]  were consistent. And we had 9,000 labelled with some heuristics based on fields that
[05:58.960 --> 06:05.920]  were already present in bugzilla. Then we, using the fields from bugzilla and the title
[06:05.920 --> 06:13.840]  and comment through an NLP pipeline, we trained an XGB boost model. And we achieved accuracy
[06:13.840 --> 06:25.440]  that we deemed good enough to be used on production. And this is how the bug, bug project started.
[06:25.440 --> 06:32.000]  It was just a way to differentiate between defects and non-defects on bugzilla. We saw
[06:32.000 --> 06:39.920]  it worked and then we decided, we thought, what if we extend this to something else?
[06:39.920 --> 06:47.600]  What is the next big problem that we have on bugzilla? And it was assigning components.
[06:47.600 --> 06:54.960]  Again, we have lots of bugs, millions of, hundreds of thousands of bugs. We need a way
[06:54.960 --> 07:01.280]  to split them in groups so that the right team sees them, so that the right people see them.
[07:01.280 --> 07:05.880]  And the faster we do it, the faster we can fix them. At the time, it was manually done
[07:05.880 --> 07:11.080]  by volunteers and developers. So you can see a screenshot here, product and component,
[07:11.080 --> 07:22.400]  PDF viewer. In this case, we didn't need to manually create a data set because all of
[07:22.400 --> 07:30.760]  the 1 million bugs were already manually split into groups by volunteers and developers
[07:30.760 --> 07:39.760]  in the past. So we had, in this case, a very large data set, two decades worth of bugs.
[07:39.760 --> 07:48.240]  The problem here was that we had to roll back the bug to the initial state because otherwise,
[07:48.240 --> 07:53.880]  by training the model on the final state of the bug, we would have used future data to
[07:53.880 --> 07:58.720]  predict the past. And it was not possible, of course. So we rolled back the history of
[07:58.720 --> 08:04.240]  the bug to the beginning. We also reduced the number of components because, again, with
[08:04.240 --> 08:09.440]  the Firefox scale, we have hundreds of thousands of components. Many of them are no longer
[08:09.440 --> 08:15.240]  actually maintained and no longer relevant. So we reduced them to a smaller subset. And
[08:15.240 --> 08:23.720]  again, we had the same kind of architecture to train the model. With a small tweak, we
[08:23.720 --> 08:33.600]  didn't have perfect accuracy. And so we needed a way to choose confidence and recall. So
[08:33.600 --> 08:41.360]  pay the price of lower quality but catching more bugs or catching fewer bugs but be precise
[08:41.360 --> 08:47.880]  more time. So we can control this easily with a confidence level that is output by the model,
[08:47.880 --> 08:53.280]  which allows us to sometimes be more aggressive, sometimes be less aggressive. But at least
[08:53.280 --> 09:00.160]  we can have a minimum level of quality that we enforce. The average time to assign a bug
[09:00.160 --> 09:08.600]  then went from one week to a few seconds. Over time, we auto-classified 20,000 bugs.
[09:08.600 --> 09:15.960]  And since it worked, we also extended it to webcompad.com, which is yet another bug reporting
[09:15.960 --> 09:21.480]  system that we have at Mozilla, which if you find web compatibility bugs, please go there
[09:21.480 --> 09:25.640]  and file them because it's pretty important. And you can see here an action of the bot
[09:25.640 --> 09:31.800]  moving the bug to, again, the Firefox PDF viewer component. Maybe I should have used
[09:31.800 --> 09:41.240]  another example just for fun. Now we had something working. And it was starting to become promising.
[09:41.240 --> 09:46.400]  But we needed to make it better. We needed to have a better architecture for the machine
[09:46.400 --> 09:51.560]  learning side of things. We needed to retrain the models. We needed to collect new data.
[09:51.560 --> 09:56.880]  We needed to make sure that whenever a new component comes in, we retrain the model with
[09:56.880 --> 10:03.760]  the new components. If a component stops being used, we need to remove it from the dataset
[10:03.760 --> 10:10.880]  and things like that. So we built, over time, a very complex architecture. I won't go into
[10:10.880 --> 10:17.080]  too many details because it will take too long. But maybe if somebody has questions later,
[10:17.080 --> 10:30.800]  we can go into that. And then with the architecture now, it was easier to build new models. So
[10:30.800 --> 10:41.080]  we even had contributors building models just all by themselves. In particular, there was
[10:41.080 --> 10:49.840]  a contributor, Ayush, which helped us build a model to root out spam from bugzilla. So
[10:49.840 --> 10:55.480]  it seems weird, but yes, we do have spam on bugzilla as well. People are trying to get
[10:55.480 --> 11:02.120]  links to their websites into bugzilla because they think the search engine will index them.
[11:02.120 --> 11:08.080]  It's not actually the case. We tell them all the time, but they keep doing it anyway. We
[11:08.080 --> 11:18.320]  have university students. Bugzilla is probably the most studied bug tracking system in research.
[11:18.320 --> 11:29.440]  And we have many university students from many countries that use bugzilla as a playing field.
[11:29.440 --> 11:35.640]  Many times we even contact the universities and professors asking them if we can help
[11:35.640 --> 11:45.480]  them give more relevant topics to students, et cetera, but they keep filing bugs. And
[11:45.480 --> 11:50.120]  this contributor maybe was from one of these schools, was tired of it and helped us build
[11:50.120 --> 11:57.840]  a model. And the results were pretty good. I'll show you a few examples of bugs that
[11:57.840 --> 12:06.760]  were caught by the model. So this one was, if you look just at the first comment of the
[12:06.760 --> 12:12.640]  bug, it looks like a legit bug. But then the person created a second comment with a link
[12:12.640 --> 12:23.000]  to the website. And it was pretty clear that it was spam. This one is another example. This
[12:23.000 --> 12:31.160]  is actually a legit bug. It's not spam. Maybe it's not so usable as a bug report, but it
[12:31.160 --> 12:39.320]  was not spam. And then somebody else, a spammer, took exactly the same contents, created a
[12:39.320 --> 12:46.360]  new bug injecting the link to their website in the bug report. And somehow, I don't know
[12:46.360 --> 12:53.120]  how the model was able to detect that it was spam. It's funny because you can see that,
[12:53.120 --> 12:59.560]  so when you file a bug on bugzilla, bugzilla will automatically insert a user agent so
[12:59.560 --> 13:05.880]  that we have more information as possible to fix bugs. But in this case, he was filing
[13:05.880 --> 13:11.680]  the bug, copying the contents of the other bug, so we have two user agents. And they're
[13:11.680 --> 13:24.240]  even on different platforms, one on Mac and one on Chrome, actually. Okay. So we were
[13:24.240 --> 13:30.280]  done with bugs. We are not done with the bugs. We will have plenty of things to do in the
[13:30.280 --> 13:40.400]  future forever. But we were happy enough with bugs and we thought, what can we improve next?
[13:40.400 --> 13:48.600]  One of the topics that we were focusing on at the time was testing and cost associated
[13:48.600 --> 13:55.480]  to testing. We were experimenting with code coverage, trying to collect coverage to select
[13:55.480 --> 14:07.000]  relevant tests to run on a given patch. But it was pretty complex for various reasons.
[14:07.000 --> 14:13.680]  So we thought maybe we can apply machine learning here as well. But before we go into that,
[14:13.680 --> 14:19.680]  let me explain a bit about RCI because it's a little complex. So we have three branches,
[14:19.680 --> 14:25.760]  three repositories, which all kind of share the same code, Firefox. We have Try, which
[14:25.760 --> 14:35.480]  is on demand CI. We have AutoLand, which is the repository where patches land after they've
[14:35.480 --> 14:42.080]  been reviewed and approved. And we have Mozilla Central, which is the actual repository where
[14:42.080 --> 14:51.440]  Firefox source code lives and from which we build Firefox nightly. On Try, we run whatever
[14:51.440 --> 14:58.080]  the user wants. On AutoLand, we run a subset of tests. At the time, it was kind of random,
[14:58.080 --> 15:05.200]  what we decided to run. And on Mozilla Central, we run everything. To give you an idea on
[15:05.200 --> 15:10.480]  Try, we will have hundreds of pushes per day. On AutoLand, the same. And on Mozilla Central,
[15:10.480 --> 15:21.000]  we have only three or four. And it's restricted only to certain people that have the necessary
[15:21.000 --> 15:27.000]  permissions since you can build Firefox nightly from there. And it's going to be shipped to
[15:27.000 --> 15:38.520]  everyone. The scale here is similar to the bug case. We have 100,000 unique test files.
[15:38.520 --> 15:47.280]  We have around 150 unique test configurations. So combinations of operating systems, high
[15:47.280 --> 15:53.960]  level Firefox configurations. So old style engine versus new style engine, certain graphics
[15:53.960 --> 16:01.760]  engine versus another graphics engine, et cetera, et cetera. We have debug builds versus optimized
[16:01.760 --> 16:09.160]  builds. We have asan, code coverage, et cetera, et cetera. Of course, the matrix is huge and
[16:09.160 --> 16:15.480]  you get to 150 configurations. We have more than 300 pushes per day by developers. And
[16:15.480 --> 16:23.160]  the average push takes 1,500 hours if you were to run it all one after the other. It
[16:23.160 --> 16:33.160]  takes 300 machine years per month and we run around 100 million machines per month to run
[16:33.160 --> 16:40.640]  these tests. If you were to run all of the tests, you would need to run all of the tests
[16:40.640 --> 16:46.480]  in all of the configurations. You would need to run around 2.3 billion test files per day.
[16:46.480 --> 16:56.800]  Which is, of course, unfeasible. And this is a view of our tree herder, which is the user
[16:56.800 --> 17:04.200]  interface for Mozilla test results. You can see that it is almost unreadable. The green
[17:04.200 --> 17:12.120]  stuff is good. The orange stuff is probably not good. You can see that we have lots of
[17:12.120 --> 17:20.560]  tests and we spend a lot of money to run these tests. So what we wanted to do, we wanted
[17:20.560 --> 17:25.320]  to reduce the machine time, spend to run the tests. We wanted to reduce the end-to-end
[17:25.320 --> 17:31.640]  time so that developers, when they push, they get a result, yes or no, your patch is good
[17:31.640 --> 17:37.880]  or not, quickly. And we also wanted to reduce the cognitive overload for developers. Looking
[17:37.880 --> 17:48.280]  at a page like this, what is it? It's impossible to understand. Also, to give you an obvious
[17:48.280 --> 17:58.160]  example, if you're changing the Linux version of Firefox, I don't know, you're touching
[17:58.160 --> 18:05.040]  GTK, you don't need to run Windows tests. At the time, we were doing that. At the time,
[18:05.040 --> 18:13.840]  if you touched GTK code, we were running Android, Windows, Mac, that was totally useless. And
[18:13.840 --> 18:20.400]  the traditional way of running tests on browsers doesn't really work. You cannot run everything
[18:20.400 --> 18:28.080]  on all of the pushes. Otherwise, you will have a huge bill from the cloud provider.
[18:28.080 --> 18:33.400]  So we couldn't use coverage because of some technical reasons. We thought, what if we
[18:33.400 --> 18:44.440]  use machine learning? What if we extend bug, bug to also learn patches and tests? So the
[18:44.440 --> 18:54.560]  first part was to use machines to try to parse this information and try to understand what
[18:54.560 --> 19:01.040]  exactly failed. It might seem like an easy task if you have 100 tests or 10 tests, but
[19:01.040 --> 19:07.800]  when you have two billion tests, you have lots of intermittently failing tests. These
[19:07.800 --> 19:15.840]  tests fail randomly. They are not always the same. Every week, we see 150 new intermittent
[19:15.840 --> 19:23.360]  tests coming in. It's impossible to, it's not easy to automatically say if a failure
[19:23.360 --> 19:30.920]  is actually a failure or if it is an intermittent. Not even developers are able to do that sometimes.
[19:30.920 --> 19:37.560]  So not all of the tests are run on all of the pushes. So if I push my patch and a test
[19:37.560 --> 19:46.200]  doesn't run, but runs later on another patch and it fails, I don't know if it was my fault
[19:46.200 --> 19:57.080]  or somebody else's fault. And so we have sheriffs, people that are only focused, whose only focus,
[19:57.080 --> 20:03.600]  whose main focus is watching the CI, and they are pretty experienced at doing that, probably
[20:03.600 --> 20:11.760]  better than most developers. But human errors still exist. Even if we have their annotations,
[20:11.760 --> 20:20.720]  it's pretty hard to be sure about the results. You can see a meme that some sheriff created.
[20:20.720 --> 20:29.200]  Lucky tests are the infamous intermittently failing tests. So the first step, the second
[20:29.200 --> 20:36.360]  step after we implemented some heuristics to try to understand the failures due to a
[20:36.360 --> 20:46.080]  given patch was to analyze patches. We didn't have readily available tools, at least not
[20:46.080 --> 20:53.280]  fast enough for the amount of data that we are talking about. We just used Mercurial
[20:53.280 --> 21:01.800]  for authorship info. So who's the author of the push? Who's the reviewer? When was it
[21:01.800 --> 21:08.680]  pushed? Et cetera, et cetera. And we created a couple of projects written in Rust to parse
[21:08.680 --> 21:14.200]  patches efficiently and to analyze source code. The second one was actually a research partnership
[21:14.200 --> 21:27.520]  with the Politecnico di Torino. And the machine learning model itself, it's not a multi-label
[21:27.520 --> 21:37.080]  model as one might think, where each test is a label. It would be too large with the
[21:37.080 --> 21:44.040]  number of tests that we have. The model is simplified. The input is the table, test,
[21:44.040 --> 21:52.160]  and patch. And the label is just fail, not fail. So the features actually come from both
[21:52.160 --> 21:57.680]  the test, the patch, and the link between the test and the patch. So, for example, the
[21:57.680 --> 22:04.200]  past failures, when the same files were touched, the distance from the source files to the
[22:04.200 --> 22:11.040]  test files in the tree. How often source files were modified together with test files? Of
[22:11.040 --> 22:17.040]  course, if they're modified together, probably they are somehow linked. Maybe you need to
[22:17.040 --> 22:23.440]  fix the test. And so when you push your patch, you also fix the test. This is a clear link.
[22:23.440 --> 22:31.760]  But even then, we have lots of test redundancies. So we used frequent item set mining to try
[22:31.760 --> 22:39.880]  to understand which tests are redundant and remove them from the set of tests that are
[22:39.880 --> 22:52.360]  selected to run. And this was pretty successful as well. So now we had architecture to train
[22:52.360 --> 23:01.000]  models on bugs, to train models on patches and tests. The next step was to reuse what
[23:01.000 --> 23:10.520]  we built for patches to also try to predict defects. This is actually still an experimental
[23:10.520 --> 23:16.320]  phase. It's kind of a research project. So if anybody is interested in collaborating
[23:16.320 --> 23:25.240]  with us on this topic, we will be happy to do so. I will just show you a few things that
[23:25.240 --> 23:32.920]  we have done in the space for now. So the goals are to reduce the regressions by detecting
[23:32.920 --> 23:39.120]  the patches that reviewers should focus on more than others, to reduce the time spent
[23:39.120 --> 23:46.680]  by reviewers on less risky patches, and to when we detect that the patch is risky, trigger
[23:46.680 --> 23:54.000]  some risk control operations. For example, I don't know, running phasing tests more comprehensively
[23:54.000 --> 23:59.520]  in these patches and things like this. Of course, the model is just an evaluation of
[23:59.520 --> 24:06.600]  the risk. It's not actually going to tell us if there is a bug or not. And it will never
[24:06.600 --> 24:17.080]  replace a real reviewer who can actually review the patch more precisely.
[24:17.080 --> 24:24.520]  The first step was, again, build a data set. It is not easy to know which patches cause
[24:24.520 --> 24:29.920]  regressions. It's actually impossible at this time. There are some algorithms that are used
[24:29.920 --> 24:38.320]  in research. The most famous one is SZZ. But we had some answers that it was not so good.
[24:38.320 --> 24:44.640]  So we started here, again, introducing a change in the process that we have. We introduced
[24:44.640 --> 24:54.520]  a new field, which is called regressed by, so that developers, QA users, can specify
[24:54.520 --> 25:00.760]  what caused a given regression. So when they file a bug, if they know what caused it, they
[25:00.760 --> 25:06.840]  can specify it here. If they don't know what caused it, we have a few tools that we built
[25:06.840 --> 25:16.400]  over time to automatically download builds from RCI that we showed earlier. Automatically
[25:16.400 --> 25:23.320]  download builds from the past and run a bisection to try to find what the cause is for the given
[25:23.320 --> 25:31.360]  bug. With this, we managed to build a pretty large data set, 5,000 links between bug introducing
[25:31.360 --> 25:43.400]  and bug fixing commits. Actually, commit sets. And then this amounts to 24,000 commits. And
[25:43.400 --> 25:48.600]  then we were able, with this data set, to evaluate the current algorithms that are presented
[25:48.600 --> 25:54.880]  in the literature. And as we thought, they are not working well at all. So this is one
[25:54.880 --> 26:06.240]  of the areas of improvement for research. One of the improvements that we tried to apply
[26:06.240 --> 26:15.920]  and to SZZ was to improve the blame algorithm. If you're more familiar with Mercurial annotate
[26:15.920 --> 26:26.360]  algorithm, to try to, instead of looking at lines, splitting changes by words and tokens,
[26:26.360 --> 26:33.080]  so you can see changes, past changes by token instead of by line. This is a visualization
[26:33.080 --> 26:39.640]  from the Linux kernel. This is going to give you a much more precise view of what changed
[26:39.640 --> 26:47.960]  in the past. For example, it will skip over tab only changes, white space only changes
[26:47.960 --> 26:55.320]  and things like that. If you add an if, your code will be intended more, but you're not
[26:55.320 --> 27:01.640]  actually changing everything inside. You're changing only the if. This actually improved
[27:01.640 --> 27:08.640]  the results, but it was not enough to get to an acceptable level of accuracy. But it's
[27:08.640 --> 27:15.280]  nice and we can actually use it in the IDE. We're not doing it yet, but we will to give
[27:15.280 --> 27:24.160]  more information to users because developers use annotate and get blame a lot. And this
[27:24.160 --> 27:31.080]  is a UI that is work in progress for analyzing the risk of a patch. This is a screenshot
[27:31.080 --> 27:38.000]  from our code review tool. So we are showing the result of the algorithm with the confidence.
[27:38.000 --> 27:44.920]  So in this case, it was a risky patch with 79% confidence. And we give a few explanations
[27:44.920 --> 27:50.640]  to the developers. This is one of the most important things. Developers do not always
[27:50.640 --> 27:57.560]  trust developers like any other user. Do not always trust results from machine learning.
[27:57.560 --> 28:08.160]  And so you need to give them an explanation. And this is another part of the output of
[28:08.160 --> 28:15.160]  our tool. This is again on our code review tool. We're showing on the functions that
[28:15.160 --> 28:23.120]  are being changed by the patch if the function is risky or not. And which bugs in the past
[28:23.120 --> 28:30.920]  were involved in this function. So developers can try to see if the patch is reintroducing
[28:30.920 --> 28:38.920]  a previously fixed bug. And they can also know what kind of side effects there are when
[28:38.920 --> 28:45.840]  you make changes to a given area of the code.
[28:45.840 --> 28:56.200]  Now we did a lot of stuff for developers. We trained models for bugs. We trained models
[28:56.200 --> 29:02.720]  for patches. We trained models for tests. We trained models to predict the facts. Now
[29:02.720 --> 29:10.640]  I'm going to go to a slightly different topic even though it's connected. Privacy-friendly
[29:10.640 --> 29:20.560]  translations. So we're working on introducing translations in Firefox. The subtitle was
[29:20.560 --> 29:30.280]  actually translated automatically using Firefox translate, which you can use nowadays. The
[29:30.280 --> 29:36.600]  idea is that translation models improved a lot in recent times. Current cloud-based
[29:36.600 --> 29:42.480]  services do not offer the privacy guarantees that we like to offer in Firefox. They are
[29:42.480 --> 29:49.800]  closed source. They are not privacy-preserving. So we started a project. It was funded by
[29:49.800 --> 29:56.960]  the European Union to investigate client-side private translation capabilities in Firefox
[29:56.960 --> 30:03.960]  itself. It is currently available as an add-on that you can install in Firefox. We support
[30:03.960 --> 30:09.840]  many European languages and we're working on supporting even more. We're going to also
[30:09.840 --> 30:20.680]  work on supporting non-European languages like Chinese, Korean, Japanese, etc.
[30:20.680 --> 30:26.560]  And in this case, we use machine learning on the client side to perform the translation.
[30:26.560 --> 30:34.160]  So your data never leaves your Firefox. The models are downloaded from our servers, but
[30:34.160 --> 30:41.280]  they run locally on your machine. So the contents of the web page that you're looking at will
[30:41.280 --> 30:47.400]  never go to Google Bing or whatever. They will be translated locally on your machine.
[30:47.400 --> 30:54.880]  We use a few open data sets. Luckily, we have lots of them from past research. Not all of
[30:54.880 --> 31:01.000]  them have good quality, but many of them have. But we are looking for more. So if you have
[31:01.000 --> 31:07.600]  suggestions for data sets that we can use, please let us know.
[31:07.600 --> 31:14.560]  On the data sets, we perform some basic data cleaning. And we use machine learning-based
[31:14.560 --> 31:23.240]  techniques to clean the data, to remove bad sentence pairs that we believe are bad. Of
[31:23.240 --> 31:27.920]  course, the data set that I showed before are open, but sometimes they are just crawled
[31:27.920 --> 31:38.680]  from the web, so they contain all sorts of bad sentences. Also, HTML tags and stuff
[31:38.680 --> 31:44.040]  like that, we need to clean them up. Otherwise, the translations will learn to translate HTML
[31:44.040 --> 31:50.120]  tags. And we use some techniques to increase the size of the data set automatically, like
[31:50.120 --> 31:56.120]  back translations, translating sentences from one language to the other, and back translating
[31:56.120 --> 32:04.280]  it in order to increase the size of the data sets.
[32:04.280 --> 32:16.480]  So we trained a large model on cloud machines, which is pretty large. You can see it's around
[32:16.480 --> 32:22.400]  800 megabytes, so every language pair, you would need to download 800 megabytes, and
[32:22.400 --> 32:30.640]  it is super slow, so we can only use that on cloud.
[32:30.640 --> 32:35.480]  So we use some techniques in order to reduce the size of these models and to make them
[32:35.480 --> 32:45.360]  faster. We use knowledge distillation, basically using the model, the large model that we trained
[32:45.360 --> 32:51.800]  as a trainer for a student model, which is much smaller, so you can see that from 800
[32:51.800 --> 32:56.880]  megabytes we got to 16, I think now we're around 5, 6, something like that, so it's
[32:56.880 --> 33:01.920]  much smaller and you can actually download it on demand from our servers. And we use
[33:01.920 --> 33:08.080]  quantization for further compression and perf improvements, so moving the data from the
[33:08.080 --> 33:14.120]  model from float 32 to int 8.
[33:14.120 --> 33:21.840]  Then we compiled the machine translation engine to WebAssembly in order to be able to use it
[33:21.840 --> 33:29.360]  inside Firefox. We introduced some SIMD extensions into WebAssembly and into Firefox in order
[33:29.360 --> 33:36.920]  to be able to be even faster when translating, even though we translate a bit at a time,
[33:36.920 --> 33:47.640]  so it's pretty fast. And the engines are downloaded and updated on demand.
[33:47.640 --> 34:15.320]  Let me show you a demo.
[34:15.320 --> 34:25.920]  So, you can see my Firefox is in Italian, but you can see that it automatically detected
[34:25.920 --> 34:33.560]  that the page is in French and it is suggesting me to translate it to Italian. I will change
[34:33.560 --> 35:00.920]  it to English. Oh, fuck.
[35:00.920 --> 35:09.240]  So it is downloading the model. Now it's translating. So while it was translating, you already
[35:09.240 --> 35:13.360]  saw the contents of the first part of the page was already translated, so it's super
[35:13.360 --> 35:20.240]  quick in the end. And the translation seems to be pretty good. I don't speak French, but
[35:20.240 --> 35:34.800]  I think it makes sense. You can also use it from the toolbar, so you can choose a language
[35:34.800 --> 35:59.800]  and translate it to another. Let's do Italian to French. It works.
[35:59.800 --> 36:23.040]  All right. So if you know any data set that we can use, in addition to the ones that we
[36:23.040 --> 36:28.640]  already use, or if you're interested in building a new great feature in Firefox, or if you
[36:28.640 --> 36:33.360]  want to add support for your language or improving support for your language, come and talk to
[36:33.360 --> 36:39.840]  us at our booth. We would be really happy if you could help us. And before we come to
[36:39.840 --> 36:46.920]  an end, let me show you how far we've come. The dogs have grown, and we have learned that
[36:46.920 --> 36:54.040]  it is possible to tame the complexity of a large-scale software. It is possible to use
[36:54.040 --> 37:01.400]  the past history of development to support the future development, and it is possible
[37:01.400 --> 37:08.760]  to use machine learning in a privacy-friendly way and in the open. What else could we do
[37:08.760 --> 37:14.280]  with the data and the tools that we have at our disposal? I don't know. I'm looking forward
[37:14.280 --> 37:21.720]  to know. I'm looking forward to see what other wild ideas you and us at Mozilla can come
[37:21.720 --> 37:24.920]  up with. Thank you.
[37:24.920 --> 37:36.680]  Thank you very much, Marco, for the amazing talk. Now we're open for questions. If anyone
[37:36.680 --> 37:43.240]  would like to make a question, please raise your hand so I can take the microphone. Questions,
[37:43.240 --> 37:55.480]  questions, hands up. There. Okay, okay. I'm sorry, I'm learning. I'm new to this. I'm
[37:55.480 --> 38:21.400]  coming up. Hello. I have actually two questions. First question is, have you actually think
[38:21.400 --> 38:28.600]  about the idea to use this mechanism to automatically translate interface of Mozilla products?
[38:28.600 --> 38:38.760]  Sorry? This thing? Yes. Yeah. So the question is, have you think about mechanism of automatically
[38:38.760 --> 38:44.920]  translating the interface of Mozilla Firefox products, or maybe documentation you already
[38:44.920 --> 38:49.000]  have like MDN, because it's still a demand to translate this stuff?
[38:49.000 --> 39:10.360]  I'm sorry. I'm not hearing well. Can you maybe come closer?
[39:10.360 --> 39:16.720]  From here? Okay. Now it's better? Yes. Okay. So my question is, have you trying to use
[39:16.720 --> 39:23.840]  this mechanism of automatic translation to use this translation for existing interface
[39:23.840 --> 39:28.720]  you have in the products, and especially also documentation part? Because it's kind of vital
[39:28.720 --> 39:33.360]  part when you need to translate new functionality, or you have to translate something new in
[39:33.360 --> 39:37.680]  the interface, you need the help of translator. But if you already know how to translate in
[39:37.680 --> 39:41.520]  doing this stuff, so that means like you already have a data set, you can actually automatically
[39:41.520 --> 39:47.560]  translate new parts of interface without translator? Yes. So it is definitely something
[39:47.560 --> 39:54.160]  that could be used to help translators do their job. We could translate parts of the
[39:54.160 --> 40:00.000]  interface automatically. And of course, there will always be some review from actual translator
[40:00.000 --> 40:04.400]  to make sure that the translation makes sense in the context, especially because Firefox
[40:04.400 --> 40:10.160]  UI sometimes you have very short text and it needs to make sense. But yeah, it's definitely
[40:10.160 --> 40:15.040]  something that we have considered. And actually one of the data sets that we use from the
[40:15.040 --> 40:22.080]  list, it's not possible to see from the slide, but it's called Mozilla L10N and they are
[40:22.080 --> 40:31.600]  sentence pairs from our browser UI. People are actually using it in research for automating
[40:31.600 --> 40:39.760]  translations. Does anyone have any other question? Please raise your hands. If you have any
[40:39.760 --> 40:50.960]  other questions, Marco? Okay. If not, thank you very much again, Marco.
[40:50.960 --> 41:10.240]  Thank you.