[00:00.000 --> 00:16.120]  Hello, everyone. Do you hear me well? Thanks, pretty large audience. If I may ask a quick
[00:16.120 --> 00:22.200]  show of hands, who among you have some experience, just any level of experience with machine
[00:22.200 --> 00:32.800]  learning? Okay, cool. Awesome. So, I'll be talking today about how to run testing on
[00:32.800 --> 00:38.960]  machine learning systems. So, there are different keywords, CICD, quality assurance. A few words
[00:38.960 --> 00:46.480]  about us. So, I'm one of the founders of Giscard. We are building a collaborative and open
[00:46.480 --> 00:52.760]  source software platform to precisely ensure the quality of AI models. And I'll be explaining
[00:52.760 --> 01:01.080]  in this presentation a bit how it works. In terms of agenda, I prepared kind of two sections
[01:01.080 --> 01:07.120]  on the why, like why a project on testing machine learning systems is needed, why we
[01:07.120 --> 01:15.840]  personally, I personally decided to work on that problem. Some of the risks and why classical
[01:15.840 --> 01:22.760]  software testing methods don't quite work on AI. And then I'll do some more concrete
[01:22.760 --> 01:30.840]  examples on two important quality criteria that you might want to test for machine learning.
[01:30.840 --> 01:36.440]  One is robustness and the other is furnace. And if we have the time, it's just 30 minutes.
[01:36.440 --> 01:44.280]  I hope that we can do a quick demo of an example use case where we run the full CICD pipeline
[01:44.280 --> 01:53.120]  on a machine learning model. So, to kind of start off easy, I put together a series of
[01:53.120 --> 02:03.000]  memes to explain my personal story of why I came to create a company around, and a project
[02:03.000 --> 02:09.280]  around this machine learning testing thing. So, about 10 years ago, I started in machine
[02:09.280 --> 02:16.480]  learning, statistics, data science, and you know, you had this, you start using the scikit
[02:16.480 --> 02:21.960]  learn API, and you're like, yeah, it's super easy, right? Anybody can be a data scientist,
[02:21.960 --> 02:29.960]  you just dot fit, dot predict, and that's it. You're a data scientist. And probably if
[02:29.960 --> 02:37.880]  you're here today, you're like, yeah, have you tested your model? Yeah, sure. Train test,
[02:37.880 --> 02:49.320]  yeah. Reality, if you've deployed in production, is quite different. So, if you've deployed
[02:49.320 --> 02:58.200]  through production, often you'll have this painful discovery where you have your product
[02:58.200 --> 03:04.160]  manager, business stakeholders to whom you said, look, I worked really hard on the fine
[03:04.160 --> 03:12.680]  tuning and the grid search to get to 85% accuracy, and you push your first version to production,
[03:12.680 --> 03:21.160]  and things don't quite work out. You don't reproduce these good accuracy numbers.
[03:21.160 --> 03:30.800]  So, well, this was me. I hope it's not you. It was one of these, my first experience deploying
[03:30.800 --> 03:37.360]  machine learning through production was on a fraud detection system. So, frauds are notoriously
[03:37.360 --> 03:42.360]  difficult as a use case for machine learning because what you're trying to detect doesn't
[03:42.360 --> 03:49.440]  quite want to be detected. There are people behind it who have a vested interest not to
[03:49.440 --> 03:56.560]  have your machine learning system detect them. So, often in terms of performance, that's
[03:56.560 --> 04:07.680]  at least what I ended up doing a lot of hot fixes in production. It's bad. So, kind of
[04:07.680 --> 04:16.080]  like five years ago, this was my stance on machine learning in production, a very painful
[04:16.080 --> 04:23.520]  grueling experience where you never know when you're going to be a complain, where you're
[04:23.520 --> 04:31.360]  going to be on call to solve something in production. So, that's when I decided to buff
[04:31.360 --> 04:43.240]  up and switch roles to join a software engineering team. I was a data crew back then, so I moved
[04:43.240 --> 04:53.240]  internally from data science to the product team, and here are some of the things to summarize
[04:53.240 --> 05:01.320]  that as someone with a machine learning background, but no real software engineering experience
[05:01.320 --> 05:08.840]  that these were kind of like what I was told, like you must learn the ways of a CI CD, otherwise
[05:08.840 --> 05:15.880]  your project will not come to production. And for context, so I was specifically at
[05:15.880 --> 05:25.160]  that time in charge of creating like an open source layer to federate all the NLP and computer
[05:25.160 --> 05:33.400]  vision APIs that vendors in the cloud provide, and then to do the same for pre-trained NLP
[05:33.400 --> 05:41.400]  and time series models. So, what was difficult in this context is I was not even the one
[05:41.400 --> 05:46.480]  in charge of the models, and the models will be retrained fine-tuned, so the guarantees
[05:46.480 --> 05:55.160]  into the properties of that system as an engineer, that's more difficult. There are some elements
[05:55.160 --> 06:03.960]  in the stack that you don't have control of. So, yeah, this is a bit of a repeat of a previous
[06:03.960 --> 06:12.520]  meme, and I really wanted to say one does not simply ship an ML product without tests.
[06:12.520 --> 06:21.360]  The challenge I had then is that from an engineering management standpoint, I was told, yeah, but
[06:21.360 --> 06:27.040]  you know, it's easy, no engineers, they all write their test cases, so you do machine
[06:27.040 --> 06:34.720]  learning, just write them all, just write all the test cases. So, this was me being
[06:34.720 --> 06:40.760]  kind of a square one. It's like, okay, so you're telling me, I just need to write unit
[06:40.760 --> 06:49.000]  tests, okay, that will not really solve the issue, and that's kind of the beginning of
[06:49.000 --> 06:56.240]  a quest that set me on to build something to solve that gap between, okay, I want to
[06:56.240 --> 07:02.800]  test my models, I need to test my models, and how do, how can I do that? Because clearly,
[07:02.800 --> 07:12.720]  and I'll explain why, unit testing, your model is really not enough. So, a different angle
[07:12.720 --> 07:19.120]  on the Y, I'll try to take a step back and talk about quality in general. I think in
[07:19.120 --> 07:29.280]  this track, we all agree that quality matters, and if you look at AI, it's an industry that's
[07:29.280 --> 07:35.920]  an engineering practice that is far younger than software engineering or civil engineering,
[07:35.920 --> 07:42.880]  and it's just riddled with incidents. I encourage you, if you don't know that resource already,
[07:42.880 --> 07:52.040]  it's an open source database, it's incidentdatabase.ai, and it's a public collection of reports,
[07:52.040 --> 07:59.760]  mostly in the media, of AI models that have not worked properly, and it's a really great
[07:59.760 --> 08:07.240]  work that has been going on for about two years and a half. It's a global initiative,
[08:07.240 --> 08:15.200]  and just in this time, they collected more than 2,000 incidents. Since these are public
[08:15.200 --> 08:21.440]  reports, think of it as the tip of the iceberg, of course. There are a lot of incidents internal
[08:21.440 --> 08:28.800]  to companies that are not necessarily spoken out in the media. The incident database has
[08:28.800 --> 08:37.000]  a very interesting taxonomy of the different types of incidents. It's very multifaceted.
[08:37.000 --> 08:46.120]  I took the liberty to simplify it in three big categories of incidents. One is FX, the
[08:46.120 --> 08:53.160]  other is business economic impact, and the third one is on security. We're talking about
[08:53.160 --> 09:01.360]  really, if they happen at a global company scale, incidents that are very, very severe.
[09:01.360 --> 09:10.200]  In FX, you can have a sexist credit scoring algorithm that exposes the company to lawsuits,
[09:10.200 --> 09:19.400]  to brand image damages, et cetera. These are notoriously hard to identify. In a way, machine
[09:19.400 --> 09:25.000]  learning is precisely about discrimination. It's hard to tell a machine that is learning
[09:25.000 --> 09:30.200]  to discriminate, not to discriminate against certain sensitive groups. I'll speak on some
[09:30.200 --> 09:35.160]  methods that can be used precisely on this problem, but Apple was working with at the
[09:35.160 --> 09:41.880]  time Goldman Sachs on deploying this algorithm and probably some tests and safeguards were
[09:41.880 --> 09:50.800]  unfortunately skipped. It was actually discovered on Twitter that in a simple case, a male loan
[09:50.800 --> 10:00.960]  applicant would get 10x their loan limit compared to his wife. That sparked a huge controversy
[10:00.960 --> 10:10.000]  that probably exposed Apple to some lawsuits. In another area, that is not with sensitive
[10:10.000 --> 10:20.440]  features such as gender. There was a huge catastrophe a year and a half ago that happened
[10:20.440 --> 10:30.600]  to Zillow, a real estate company, where there was a small bias that was overestimating the
[10:30.600 --> 10:40.960]  prices of homes. They decided to put this algorithm live to buy and sell houses. It
[10:40.960 --> 10:47.600]  turned out that this tiny bias, which was left unchecked, was exploited by the real
[10:47.600 --> 10:59.120]  estate agents in the US. Literally, this created a loss of nearly half a billion dollars. Again,
[10:59.120 --> 11:06.480]  maybe if going back to testing, this could have been unseapated and avoided. Now on a
[11:06.480 --> 11:13.080]  more cybersecurity spectrum, there's a lot of good research from cybersecurity labs showing
[11:13.080 --> 11:20.080]  that you can hack, for example, a computer vision system in an autonomous driving context.
[11:20.080 --> 11:26.680]  Here you put a special tape on the road and you can crash a Tesla. We don't quite know
[11:26.680 --> 11:31.920]  if these types of vulnerabilities have been exploited in real life yet, but as AI becomes
[11:31.920 --> 11:36.760]  super ubiquitous, and obviously there are some bad actors out there that might want
[11:36.760 --> 11:43.880]  to hack these systems and introduces a new type of attack vectors. That's also something
[11:43.880 --> 11:54.840]  we need to care about. Both from the practitioners of AI and from a regulatory standpoint, testing
[11:54.840 --> 12:02.680]  just makes sense. Yanlequin, chief AI scientist at META, was actually taking a stance at the
[12:02.680 --> 12:10.320]  beginning of last year on Twitter saying that if you won't trust in a system, you need tests.
[12:10.320 --> 12:15.600]  Also making a slight criticism towards some of the explainability methods, because two
[12:15.600 --> 12:20.680]  years ago, if you've followed that realm, people were saying, oh, you just need explainability
[12:20.680 --> 12:28.520]  and then your problems will go away. Well, that's just part of the answer. Lastly, and
[12:28.520 --> 12:33.320]  this was covered in some of the talks this morning on the big auditorium, there's a
[12:33.320 --> 12:40.400]  growing regulatory requirement to put some checks and balances in place. That also says
[12:40.400 --> 12:47.000]  that you need specifically in case your AI system is high risk, you need to put quality
[12:47.000 --> 12:52.720]  measures in place. The definition of high risk AI systems is pretty large. Obviously,
[12:52.720 --> 12:59.240]  you have anything related to infrastructure, like critical infrastructure, defense, et cetera,
[12:59.240 --> 13:06.440]  but you also have all AI systems that are involved in human resources and public service
[13:06.440 --> 13:10.880]  and financial services, because these are considered, obviously, critical components
[13:10.880 --> 13:22.760]  of society. Now that we kind of agreed that it's an important problem, these are some
[13:22.760 --> 13:31.000]  of the challenges, because if you've encountered some of these issues, you probably looked
[13:31.000 --> 13:39.720]  at some easy solutions, taking some analogies on what you might do to do this. There are
[13:39.720 --> 13:46.360]  three points that make this problem of testing machine learning a bit special, meaning it's
[13:46.360 --> 13:53.920]  still a big work in progress. Point one is that it is really not enough to check the
[13:53.920 --> 14:01.600]  data quality to guarantee the quality of a machine learning system. One of my co-founders
[14:01.600 --> 14:08.360]  doing his PhD proved experimentally, you can run experiments, you can have really clean
[14:08.360 --> 14:16.680]  data in a bad model. So you cannot just say it's an upstream problem, it's technically
[14:16.680 --> 14:22.440]  like systems engineering, you have to take the data, the machine learning model, and
[14:22.440 --> 14:31.000]  the user into context to analyze its properties. Moreover, the errors of the machine learning
[14:31.000 --> 14:40.160]  systems are often caused by pieces of data that did not exist when the model was created,
[14:40.160 --> 14:51.680]  they were clean, but they did not exist. Second point, it's pretty hard just to copy-paste
[14:51.680 --> 15:00.200]  some of the testing methods from software into AI. One is like, yes, you can do some
[15:00.200 --> 15:07.720]  unit tests on machine learning models, but they won't prove much. Because the principle
[15:07.720 --> 15:14.160]  is that it's a transactional system and things are moving quite a lot. So that's a good baseline.
[15:14.160 --> 15:17.760]  If you have a machine learning system and you have some unit tests, that's really like step
[15:17.760 --> 15:24.520]  one. It's better to have that than to have nothing. But you have to embrace the fact
[15:24.520 --> 15:32.880]  that there has got to be a large number of test cases. So you cannot just test on three,
[15:32.880 --> 15:39.640]  five, hundred, even a thousand cases will not be enough. The models themselves are probabilistic.
[15:39.640 --> 15:47.000]  So you have to take into account statistical methods of tests. And lastly, and I think
[15:47.000 --> 15:53.080]  this is specific to, because there has been some systems that were heavily dependent on
[15:53.080 --> 16:01.200]  data, but with AI, AI also came with a fact that you increase the number of data inputs
[16:01.200 --> 16:07.640]  compared to traditional systems. So you very quickly come into issues of, well, it's a
[16:07.640 --> 16:15.320]  combinatorial problem, and it's factually impossible to generate all the combinations.
[16:15.320 --> 16:25.120]  Very simple example of that. How can you test an NLP system? Lastly, like AI touching a lot
[16:25.120 --> 16:30.160]  of different points. If you want to have a complete test coverage, you really need to
[16:30.160 --> 16:36.560]  take into account multiple criteria. So performance of a system, but also robustness, robustness
[16:36.560 --> 16:45.920]  to errors, fairness, privacy, security, reliability. And also, and that's becoming an increasingly
[16:45.920 --> 16:51.320]  important topic with green AI, it's like what is the carbon impact of this AI? Do you really
[16:51.320 --> 16:59.040]  need that many GPUs? Can you make your system a bit more energy efficient? So today I'll
[16:59.040 --> 17:04.760]  focus on the, because I see we have 10 more minutes, I'll focus on two aspects, the robustness
[17:04.760 --> 17:17.000]  and the effects. So I'll start with robustness. Who has read or heard about this paper? Quick
[17:17.000 --> 17:27.320]  show of hands. Okay, one. So who has heard of behavioral testing? Because that's not
[17:27.320 --> 17:36.440]  machine learning specific. Yeah, cool. So Ribeiro three years ago, along with other co-writers
[17:36.440 --> 17:46.480]  of this paper, did I think a fantastic job to see how to adapt behavioral testing, which
[17:46.480 --> 17:53.120]  is a really good practice from software engineering, to the context of machine learning. And specifically
[17:53.120 --> 18:07.720]  wrote something for NLP models. The main problem that this research paper aimed to solve was
[18:07.720 --> 18:14.680]  test case generation. Because really NLP is by a sense a problem, NLP, a natural language
[18:14.680 --> 18:26.640]  processing. So you have an input text, it's just raw text. So you need to test this. But
[18:26.640 --> 18:39.880]  what you can do is to generate test cases that rely on mapping the input and the input
[18:39.880 --> 18:48.640]  changing changes in the text to expectations. I'll give three examples from very, very simple
[18:48.640 --> 18:57.640]  to a bit more complex. One is like the principle of minimum functionality. For example, if
[18:57.640 --> 19:04.200]  you are building a sentiment prediction machine learning system, you could just have a test
[19:04.200 --> 19:12.960]  that says if you have extraordinary in the sentence, you should always predict that the
[19:12.960 --> 19:20.160]  model will say it's a positive message. Now you will probably tell me, yeah, but what
[19:20.160 --> 19:30.120]  about if the user has written it's not extraordinary or absolutely not extraordinary? And that
[19:30.120 --> 19:37.920]  actually brings me to the concept of test template. And the fact that probably for NLP,
[19:37.920 --> 19:42.480]  what you need to do, and this is obviously language specific, is start to have templates
[19:42.480 --> 19:49.800]  where you change the text by, for example, adding negations. And then so you might want
[19:49.800 --> 19:57.080]  to test if your system, if you're adding negation, if you have a certain direction. Because normally
[19:57.080 --> 20:03.040]  if the machine learning model has understood, it should, if it's about sentiment, understand
[20:03.040 --> 20:10.760]  that putting not an extraordinary or not good, you have then synonyms, will not affect the
[20:10.760 --> 20:20.040]  system too much. Or actually either your system, you want it to move to a certain direction
[20:20.040 --> 20:26.120]  or there are cases where you want actually the opposite behavior. You want robustness.
[20:26.120 --> 20:35.120]  So that's called invariance. So for instance, you will want a system that is robust to typos
[20:35.120 --> 20:46.120]  to just changing like a location name, just putting synonyms, et cetera, et cetera.
[20:46.120 --> 20:54.280]  So we've created this diagram to explain it. And it's a really thriving field in the research.
[20:54.280 --> 20:59.640]  There is a lot of research going on these days about testing machine learning systems.
[20:59.640 --> 21:07.680]  And metamorphic testing is one of the leading methods to do that. The principle is, if I
[21:07.680 --> 21:14.880]  take an analogy, is very similar to if you've worked in finance or if you have some friends
[21:14.880 --> 21:24.720]  who work there, the principle of backtesting an investment strategy. You simulate different
[21:24.720 --> 21:30.920]  changes in the market conditions and you see how your strategy, your algorithm behaves,
[21:30.920 --> 21:42.960]  what is the variance of that strategy. This concept applies very well to machine learning.
[21:42.960 --> 21:55.400]  So you need two things. You need one to define a perturbation. So what I was explaining earlier
[21:55.400 --> 22:02.800]  in NLP, perturbation might be adding typos, adding negation. In another context, like
[22:02.800 --> 22:10.760]  let's say it's more in an industrial case, it might be about doubling the values of some
[22:10.760 --> 22:21.320]  sensors or adding noise to an image. And then, pretty simply, you define a test expectation
[22:21.320 --> 22:27.800]  in terms of the metamorphic relation between the output of a machine learning model and
[22:27.800 --> 22:34.800]  the distribution of the output after perturbation. And once you have that, and if you have enough
[22:34.800 --> 22:40.000]  data, then you can actually have, like you can do actual statistical tests, see there's
[22:40.000 --> 22:47.440]  a difference in distribution, et cetera. So I won't have too much time to dive into all
[22:47.440 --> 22:54.000]  the details of this, but we have wrote a technical guide on this topic and you have a link in
[22:54.000 --> 23:03.680]  QR code up there. Next, I'll talk a bit about a really tricky topic, which is AI fairness.
[23:03.680 --> 23:08.760]  And I want to emphasize that it's, at least our recommendation, is not to come at the
[23:08.760 --> 23:18.040]  problem of AI ethics with a closed mind or a top-down definition of this is an ethical
[23:18.040 --> 23:26.640]  system or no, this is an unethical system. My co-founder did his PhD on precisely on
[23:26.640 --> 23:35.080]  this topic and wrote a full paper on this, looking at the philosophical and sociological
[23:35.080 --> 23:45.880]  implications of this. And the gist of it is that, yes, to a certain extent, you can adopt
[23:45.880 --> 23:53.280]  a top-down approach to AI fairness, saying, well, for instance, as an organization, we
[23:53.280 --> 24:01.560]  want to test the fairness on explicitly free, sensitive categories. You can say, well, we
[24:01.560 --> 24:09.680]  want to check for gender balance. We want to check for race balance. That means if the
[24:09.680 --> 24:15.480]  country where you deploy a machine learning allows to collect this data, this is not always
[24:15.480 --> 24:26.200]  the case. But the challenge with these approaches is that, A, you might not have the data to
[24:26.200 --> 24:33.560]  measure this, and B, you may miss out because often when this exercise of defining the quality
[24:33.560 --> 24:45.800]  criteria for fairness and for balance are done, you only have a limited sample. So it's,
[24:45.800 --> 24:53.680]  in taking some sociological analysis, it's really important to have this kind of top-down
[24:53.680 --> 25:01.200]  definition of AI ethics, meet the reality on the ground, and confront the actual users
[25:01.200 --> 25:08.280]  and the makers of the systems to get them to define the definition of ethics, rather
[25:08.280 --> 25:15.680]  than a big organization, if I put a bit of a caricature that says, AI ethics, yeah, we
[25:15.680 --> 25:22.000]  wrote a charter about this. You follow, you read this, you sign, and then, oops, you're
[25:22.000 --> 25:33.440]  ethical. Having said that, so there are some good top-down metrics to adopt that are kind
[25:33.440 --> 25:40.160]  of a baseline, and I'll explain one of them, which is disparate impact. Disparate impact
[25:40.160 --> 25:49.040]  is actually a metric from the human resources management industry from at least 40 years
[25:49.040 --> 25:59.480]  ago, so it's not new. That says, so it's probabilities, but essentially it's about setting a rule
[25:59.480 --> 26:09.320]  of 80%, where you measure the probability of, you define a positive outcome with respect
[26:09.320 --> 26:18.240]  to a given protected population, and you say, well, I want to the proportion of the probability
[26:18.240 --> 26:26.360]  of a positive outcome relative to the probability of a positive outcome in the unprotected context
[26:26.360 --> 26:40.520]  to be above 80%. So, for instance, so if you want to apply that to a, oops, to put more
[26:40.520 --> 26:48.840]  concrete, yeah, so if, say, you're building a model to predict the churn of customers,
[26:48.840 --> 26:56.320]  and you want to check whether your model is biased or not for each class, this formula
[26:56.320 --> 27:08.480]  allows you to really define this metric and write a concrete test case. Right, so I just
[27:08.480 --> 27:15.600]  have three minutes, so I'll highlight what one of the features of our project enables
[27:15.600 --> 27:22.080]  is putting human feedback, so really having an interface where users and not only data
[27:22.080 --> 27:29.000]  scientists can change the parameters, so there's a link to metamorphic testing, and actually
[27:29.000 --> 27:35.040]  give human feedback to a point art where the biases may be, and the benefit of this approach
[27:35.040 --> 27:45.960]  is that it allows for the community to precisely define what they think are the risks. So sadly,
[27:45.960 --> 27:56.240]  we won't have time to do a demo, but this phase, in our project, we call that the inspection
[27:56.240 --> 28:03.240]  phase, and it's about before you test, and this is super important, and again, one of
[28:03.240 --> 28:09.480]  the things where it's different from traditional software testing, before you even test, you
[28:09.480 --> 28:14.960]  need to confront yourself with the data and the model, so that's where actually we think
[28:14.960 --> 28:20.680]  explainability methods really shine, it's because they allow to debug and to identify
[28:20.680 --> 28:26.440]  the zones of risks, and this is precisely what helps once you have qualified feedback
[28:26.440 --> 28:32.240]  to know where you should put your effort in test, so in a nutshell what I'm saying for
[28:32.240 --> 28:38.240]  testing machine learning systems is it's not a matter of creating hundreds of tests, of
[28:38.240 --> 28:43.440]  automating everything, but rather to have a good idea of, from a fairness standpoint
[28:43.440 --> 28:51.760]  and for a performance standpoint, of what are the 10, 15, maybe max 20 tests that you
[28:51.760 --> 29:00.480]  want in your platform. If you want to get started actually on it, this is our GitHub,
[29:00.480 --> 29:04.560]  and if you have a machine learning system to test, we're interested in your feedback.