All right, everyone. So I guess this is the last session for today. And what I'm going to present now is about our project, Test Enough for Automated Appendice Updates. And before I delve into my presentation, does anyone actually have an answer to this? Who wants to attempt to answer the question here? It depends. Yeah, that's a great answer. And I think that's also in a way in the right direction. So a little bit about myself. So my name is Joseph Hyder. I'm a member of technical staff at Endolabs. It's a startup on more now a scale up based in Palo Alto in California. And before that, I mean, I'm still actually a PhD candidate at the Duff University of Technology in the Netherlands. So quite close by to Brussels here. And for the last, let's say like six, seven, eight years of my life, I've been quite involved in working on this, writing, security, but also developing techniques that are focused on trying to like apply program analysis to, for example, package repositories or trying to better understand what's going on within dependencies and dependency trees. And just like a little bit talk about what I mean with automated appendice updates. I guess most of you already know what it is. So essentially whenever there is a new release from Maven, Ruby Jam, Socargo or MPM, you would have a tool. I just did a couple of them, which is the Panda Bot or renovate or that few. So when there's a new release in your repository, usually a prerequisite is created. And then the, let's say like it creates a branch out of your repository, tries to build it. If that goes fine, so it goes usually to the next stage. If you have it configured to basically run the tests. And then if everything is fine, in this case, it's showed on X mark, but imagine if everything is fine, you will merge it. In some cases, if you know it's not a problem, you would merge it in any case. And I think for many of us, we have seen like, usually show something like this. It would update version 2.2 to 2.4. So that's like the essential thing that I'm focusing around. Like what I mean with automated dependence updates. And an interesting thing around automated dependence updates is that there's usually this promise that if you just run your tests, you are essentially able to catch any type of regression errors, any problems that might exist in your code. And me as a researcher that maybe sort of a bit questioning pattern, as I felt like, hmm, the test that we are usually having are projects, they're more focused on the your project test suite. And maybe not so much on the third party dependencies or third party libraries that you use in your code. So that may be sort of race three questions. The first question that I asked was, do we even write tests against dependencies in the first place? And then the second question is, do project tests with even cover usages of dependencies in the source code? And the last one is like, are even test sufficient alone just using tests to detect any bad updates that you might find in using these tools or doing automated dependency updates? And to study this, I looked into open source projects at the first, oh yeah, another question is, of course, should we even write test for dependencies? Because if we like to reuse components from open source package repositories, why should we even write tests for that? Because it kind of gives us the ergonomics that we can just use anything like it in our code and that's it. And this sort of like started like as an empirical study, that's sort of what this talk is primarily centered around. So the first thing that I looked at in 20 study was to see what is the statement coverage of function calls to dependencies. And this is similar to considering for example, like covered like life support, for example, like J. Coco as a tool. And then the other thing that we also focus in the study is how effective are tests using detecting updates with regression errors. So what we're doing here is that we are basically trying to find, I mean, either find or actually put regression errors in existing libraries, and then directly validate whether the project test suite can directly detect that or not. And that's also something called mutation testing analysis. And I think there was one talk about this earlier. And then the last thing around the studies that currently the sort of state of thought is to focus on just using test suites, but could we use another way to find any problems or early detect issues that might exist in updating our dependency. So yeah, the first question is like, how can we do some type of statement coverage or get an idea about what exactly are we using in third party libraries? So we did this in two ways. The first thing was to essentially, so this was of course in Java, we extracted all call sites that we will find in projects. And if those call sites points to third party libraries in bytecode, we consider that as a usage. And that's for trustive dependencies because now you're not, let's say like longer on your source code where you have the call set direct dependencies, you would also need to go to the trusty ones and here to sort of approximate that is not an exact measurement. We essentially build static whole graphs to kind of get an idea of what would be used in the chemistry of our project. And then last we did some instrumentation. So we essentially run the tests of a project and execute what functions were invoked in the dependencies. And this will give, let's say like some idea of what exactly is being used or not used at all. So essentially first we statically derive like, what are all the usages? And then by running the tests we know which of those functions were covered or not. So kind of similar to code coverage. And we did this for around 521 Gita projects. And what we found very interesting was that when we look at the direct dependencies of a project, so this is all the direct dependencies that were found, about 60% like when running the tests are, let's say like covered by it. But then when we go to trustive dependencies, we found that the median was only 20%. So which means that a lot of the transitive functions that may be used may not even be reachable by test. So they sort of like ring some alarm bells, right? Because that means essentially like if you have a dependency update and you don't have any test that is covering that area, that will basically give you a green tick and you might merge it. And I don't think many would do that. But that's, let's say like the implementation area that also kind of raises some questions around how effective using tests for automated updates. And yeah, the other question does this matter at all. And I think a very interesting one here is the log for shell case because I don't think many of us would have tests that is particularly targeting log libraries. But here is an instance where something we don't normally would test and would have tests in any case. If you would do an update, then yeah, there might be some breaking changes then yeah, there will be a problem here. Then going to the second part of the study, which was on test effectiveness. And I was measuring that we're doing mutation testing. So the underlying framework we used here was a pie test, but we modified pie test to do things a little bit differently. And yeah, to sort of like give a quick sort of idea of what mutation testing is, is that you essentially have a function, for example, return x plus y. And then you apply some type of mutation operator where you swap, let's say like the class. And then you would expect that your test suite will be able to cover this because here the behavior is completely changed, right? It's no longer an addition operator. So normally with mutation testing, you would give it your whole project source code. It will start trying to modify in the source code and then see whether the test suite is able to capture that or not. So what we did differently is that we essentially mutated functions in the dependency code and not the project code at all. And we only mutated those that were reachable by test. So I was saying earlier that we were running a test to know which functions were executed. So we used those functions to essentially apply those mutation operators. And then from there we can see if the test is able to capture that or not. And yeah, before I go into this also another alternative way that we investigated is called a change impact analysis. So here we sort of leverage static analysis and specifically using call graphs. So how it essentially works is that we have a version 1.02 and 1.03. We compute a diff and for the diff we will find out which functions changed. And for example here we know that in bar and bus function we can see that there is an arithmetic change like instead of y minus minus it's y plus plus. And then in the other, like in the bus function we see that there is a new method called. And then what do we do later? We practically build a call graph of the application and its dependencies. And then using reachability analysis, so what we do here is that we know that the bar and bus was changed. And here we have let's say like a reachable path from bar up to let's say like stats on the score JSON I mean. And also we have like bus here where we have a new function called to QX STR. And by using this we can directly figure out if there is a coaching and dependency whether you are reachable or not in the first place. And why this is like a very nice complement to dynamic test is that we are essentially leveraging by looking at the source code what are we actually using. And then as a complement to where tests might not be covering we can sort of find directly if there is any change that might affect like your project. That of course comes the more tricky part which is semantic changes. So I mean one thing it's nice that you can detect that the method change but sometimes you might just do a simple refactoring that you know just refactors are a huge method into a method with like a couple of smaller methods is that. So the truth is that it's extremely difficult to know what exactly is a semantic change because there's a lot of factors around it. So the only thing that we did was that we kind of took what was like behavioral changes. So we looked at only like data flow or control flow changes. So for example if you add a new method call we consider that as like a special change or if you did some major change on your if statements that may introduce a new logic of how the control flow works then we consider that as an interesting change to follow. And what it is like I implemented a tool called Uptatera which means update in Swedish. And so I applied this on. So it essentially shows like which function had a change. So for example Rx, Java, not facing subscriber on error and we can see that it's reachable from the project and then it shows exactly how it was reachable. Yeah through like the code. And then in the second section I would have like what is basically the major changes in that function. So this could sort of give you some context of what essentially changed. Other than just telling that either the test parsed or failed. And when using this mutation PyPlanet that was explaining we essentially generated 1 million artificial updates by introducing those regressions and we did this on 262 GitHub projects. And what we found was that when doing the sort of changes on project tests we found that on average projects are able to detect 37% of those which means that a lot of like changes may not may get unnoticed like in general. But if you use static analysis now that you sort of have the whole context we able to detect 72% of all those changes. But what we find more interesting is that we can see that interestingly here like from the context of the studies that there's basically no guarantees that tests can prevent bad updates and using either of those techniques is not good enough to ensure that updates are safe. Then of course the other thing is that static analysis is not perfect. There are also problems with it as well. So the problem is over approximation. So we have over approximation at two locations. One is the call graphs themselves because when it comes to dynamic dispatch if there are maybe 200 implementations that might stand from an interface call we have to link to all of them and that might generate false positives. And then the other case is also with the semantic changes that we are detecting because we also don't know exactly what type of semantic changes it is. But to sort of see how this worked in practice we also analyzed and applied this on 22 dependable PRs. And from the results what we found in general was that by using static analysis we were able to detect three unused dependencies. So here let's say like the test would just pass it whatever but in fact we found that the dependencies were not used at all. And we were able to prevent three breaking updates and one which actually was confirmed by our developer where the test were not able to detect. And then of course we found that there are let's say like false positives and as I mentioned there were many cases with refractorings and then of course this over approximated call paths. So if you use like a tool like Google here or static analysis it can help to prevent updates but then you also get a lot of noise as well as a result. So sort of coming to the end of more of the studies what are let's say like the recommendations that I have after looking into like on Github projects how tests are being made etc. So one thing I found missing when it comes to updating with test widths is that we don't have any form of confidence score. And what I mean with confidence score is that for example if we can stop measuring test coverage we can see for example if there is a change function in a third party library do we even have test that reaches that or not and that could directly give an indication whether like my test width is able to capture that or not. And another very interesting thing could be for example if you find that one of your libraries are very tightly integrated with your project it can also sort of give an indication whether you have let's say like enough test to cover that usage or not at all. And then by having sort of this score you can maybe get an indication where does let's say like how well am I just able to capture things in third party libraries or not. This is something that I would like to see in tooling in general. And then when it comes to the gaps in test coverage so this is related to the results I was saying like the statement coverage and effectiveness. So I believe more of having a hybrid solution so we're using tests or dynamic analysis is able to capture. I think we should use that because that is more precise. But then in areas of the code where we don't have any coverage so for example consider back to the look for J library where usually I wouldn't expect just to be much test coverage. Here it could be nice to complement the static analysis. So you sort of get a little bit better for both words here. And then another advantage that I might see having static analysis rather running tests is that we can maybe much more earlier to take potential like problems in compatibilities by having that rather than trying to run it through the build system consuming extra resources or tests etc. So those are less likely to main things that I find important to address. And then for users like myself of using this automated dependence updating tools. So although like reusing is free in the sense that we can easily just use a library but we often forget the operational and maintenance costs and those are not free. So trying to basically automate away everything by using tooling etc. is not always the solution. I think it's important to also consider that once we start adopting a library we also need to think about how we can maintain it but also understanding what potential risk might come from it. Could be for example that maintainer have a very different sort of handling when it comes to security vulnerabilities. It could also be with the release protocol like there could be disagreements on what is breaking change or not for clients. So I think having that aspect is one important thing. And the other thing is like of course not blindly trusting automated dependency updates and I guess no one really does this. And then that's another thing which could be debatable is to have essentially critical I mean having writing tests for critical dependencies and this could be a library that's very critical to your project. I think here maybe having tests could help let's say like capture early issues that might arise in dependencies and not come as an unwanted breaking change later on once you merge the automated PR. So if you want to let's say like know more about this work I have a paper so I also uploaded slides on the Fosnum website so you can click the link and the paper is open access and yeah this is concludes my talk more or less so happy to take any questions. So do you know if any of these bots like the pen about renovate are working on such a score so let's say the merge request to get like a warning. Hey your tests are not covering only 10 percent of the dependencies. Do you know if there is any work. So what I'm aware of is that there is a compatibility score that looks at for example for a particular dependency version updates if out of less than like 200 PRs if 100 of those were successful for other projects then it will give us a score that there's a 50 percent chance that you will succeed here. The only thing I find problematic is that every project has their own specific use case or context of how they use it so it could be misleading but I haven't heard anything that looks specifically into your test suite to see how I mean how it's able to do that. Thank you. You mentioned the number of 60 percent for the amount of tests for direct dependencies and I believe it was a lot less for transitive dependencies. Do you have any numbers on the amount of transitive dependencies in search change the chains actually. So I can can imagine that the 60 percent is cumulative in these. Do you mean for the statement coverage thing or the statement. Yeah the first one. So the first one the 60 percent was on like direct dependencies and then this 20 percent was on the transitive ones. Do you have any numbers of the amount of transitive dependencies so you can relate it to that 60 percent. Okay so I did this on 500 time projects but I might have the more specific numbers in the paper. Okay. You have been looking at detecting errors. Have you looked in the other side because you can use it in a hybrid mode that your tool maybe can tell me you can make this update for sure because all the code is changed. You don't care about it. For example if you look at low level right libraries like Apache Commons you only use a part of it but you want to keep up to date and some updates are more or less completely safe because you don't touch any code that has changed because only new features have been added or so and that would also help if I just know yes that's safe. Yeah that's a great question. So this is a little bit idea we had with introducing call graphs because the call graphs you can start learning what exactly is used. So even if you use like a major library and you just use maybe two utility classes and even if you go to like a major version of it you might not be affected by it and this is something that should be covered by the call graph so we will see for example that the utility classes there are no changes there but then in the rest of the package there's a lot of changes that you're unaffected by. Thank you. Did you check how the call graphs work with dynamic dependency injection? Yeah so we essentially if I understood the question right I mean so we did generate the dynamic call graphs like running the test and this is something that we essentially used to guide or rotation like testing framework to only do changes in those functions and not for example functions that the test didn't touch because otherwise we wouldn't know whether I mean the test we were able to detect changes or not.