All right, everyone. So I guess this is the last session for today. And what I'm going
to present now is about our project, Test Enough for Automated Appendice Updates. And
before I delve into my presentation, does anyone actually have an answer to this? Who
wants to attempt to answer the question here? It depends. Yeah, that's a great answer. And
I think that's also in a way in the right direction. So a little bit about myself. So
my name is Joseph Hyder. I'm a member of technical staff at Endolabs. It's a startup
on more now a scale up based in Palo Alto in California. And before that, I mean, I'm
still actually a PhD candidate at the Duff University of Technology in the Netherlands.
So quite close by to Brussels here. And for the last, let's say like six, seven, eight
years of my life, I've been quite involved in working on this, writing, security, but
also developing techniques that are focused on trying to like apply program analysis to,
for example, package repositories or trying to better understand what's going on within
dependencies and dependency trees. And just like a little bit talk about what I mean with
automated appendice updates. I guess most of you already know what it is. So essentially
whenever there is a new release from Maven, Ruby Jam, Socargo or MPM, you would have a
tool. I just did a couple of them, which is the Panda Bot or renovate or that few. So
when there's a new release in your repository, usually a prerequisite is created. And then
the, let's say like it creates a branch out of your repository, tries to build it. If
that goes fine, so it goes usually to the next stage. If you have it configured to basically
run the tests. And then if everything is fine, in this case, it's showed on X mark, but
imagine if everything is fine, you will merge it. In some cases, if you know it's not a
problem, you would merge it in any case. And I think for many of us, we have seen like,
usually show something like this. It would update version 2.2 to 2.4. So that's like
the essential thing that I'm focusing around. Like what I mean with automated dependence
updates. And an interesting thing around automated dependence updates is that there's usually
this promise that if you just run your tests, you are essentially able to catch any type
of regression errors, any problems that might exist in your code. And me as a researcher
that maybe sort of a bit questioning pattern, as I felt like, hmm, the test that we are usually
having are projects, they're more focused on the your project test suite. And maybe not
so much on the third party dependencies or third party libraries that you use in your
code. So that may be sort of race three questions. The first question that I asked was, do we
even write tests against dependencies in the first place? And then the second question
is, do project tests with even cover usages of dependencies in the source code? And the
last one is like, are even test sufficient alone just using tests to detect any bad updates
that you might find in using these tools or doing automated dependency updates? And to
study this, I looked into open source projects at the first, oh yeah, another question is,
of course, should we even write test for dependencies? Because if we like to reuse components from
open source package repositories, why should we even write tests for that? Because it kind
of gives us the ergonomics that we can just use anything like it in our code and that's
it. And this sort of like started like as an empirical study, that's sort of what this
talk is primarily centered around. So the first thing that I looked at in 20 study was to
see what is the statement coverage of function calls to dependencies. And this is similar
to considering for example, like covered like life support, for example, like J. Coco as
a tool. And then the other thing that we also focus in the study is how effective are tests
using detecting updates with regression errors. So what we're doing here is that we are basically
trying to find, I mean, either find or actually put regression errors in existing libraries,
and then directly validate whether the project test suite can directly detect that or not.
And that's also something called mutation testing analysis. And I think there was one
talk about this earlier. And then the last thing around the studies that currently the
sort of state of thought is to focus on just using test suites, but could we use another
way to find any problems or early detect issues that might exist in updating our dependency.
So yeah, the first question is like, how can we do some type of statement coverage or
get an idea about what exactly are we using in third party libraries? So we did this in
two ways. The first thing was to essentially, so this was of course in Java, we extracted
all call sites that we will find in projects. And if those call sites points to third party
libraries in bytecode, we consider that as a usage. And that's for trustive dependencies
because now you're not, let's say like longer on your source code where you have the call
set direct dependencies, you would also need to go to the trusty ones and here to sort
of approximate that is not an exact measurement. We essentially build static whole graphs to
kind of get an idea of what would be used in the chemistry of our project. And then
last we did some instrumentation. So we essentially run the tests of a project and execute what
functions were invoked in the dependencies. And this will give, let's say like some idea
of what exactly is being used or not used at all. So essentially first we statically
derive like, what are all the usages? And then by running the tests we know which of
those functions were covered or not. So kind of similar to code coverage. And we did this
for around 521 Gita projects. And what we found very interesting was that when we look
at the direct dependencies of a project, so this is all the direct dependencies that
were found, about 60% like when running the tests are, let's say like covered by it.
But then when we go to trustive dependencies, we found that the median was only 20%. So
which means that a lot of the transitive functions that may be used may not even be reachable
by test. So they sort of like ring some alarm bells, right? Because that means essentially
like if you have a dependency update and you don't have any test that is covering that
area, that will basically give you a green tick and you might merge it. And I don't think
many would do that. But that's, let's say like the implementation area that also kind
of raises some questions around how effective using tests for automated updates. And yeah,
the other question does this matter at all. And I think a very interesting one here is
the log for shell case because I don't think many of us would have tests that is particularly
targeting log libraries. But here is an instance where something we don't normally would test
and would have tests in any case. If you would do an update, then yeah, there might be some
breaking changes then yeah, there will be a problem here. Then going to the second part
of the study, which was on test effectiveness. And I was measuring that we're doing mutation
testing. So the underlying framework we used here was a pie test, but we modified pie test
to do things a little bit differently. And yeah, to sort of like give a quick sort of
idea of what mutation testing is, is that you essentially have a function, for example,
return x plus y. And then you apply some type of mutation operator where you swap, let's
say like the class. And then you would expect that your test suite will be able to cover
this because here the behavior is completely changed, right? It's no longer an addition
operator. So normally with mutation testing, you would give it your whole project source
code. It will start trying to modify in the source code and then see whether the test
suite is able to capture that or not. So what we did differently is that we essentially
mutated functions in the dependency code and not the project code at all. And we only mutated
those that were reachable by test. So I was saying earlier that we were running a test
to know which functions were executed. So we used those functions to essentially apply
those mutation operators. And then from there we can see if the test is able to capture
that or not. And yeah, before I go into this also another alternative way that we investigated
is called a change impact analysis. So here we sort of leverage static analysis and specifically
using call graphs. So how it essentially works is that we have a version 1.02 and 1.03.
We compute a diff and for the diff we will find out which functions changed. And for
example here we know that in bar and bus function we can see that there is an arithmetic change
like instead of y minus minus it's y plus plus. And then in the other, like in the bus
function we see that there is a new method called. And then what do we do later? We practically
build a call graph of the application and its dependencies. And then using reachability
analysis, so what we do here is that we know that the bar and bus was changed. And here
we have let's say like a reachable path from bar up to let's say like stats on the score
JSON I mean. And also we have like bus here where we have a new function called to QX
STR. And by using this we can directly figure out if there is a coaching and dependency
whether you are reachable or not in the first place. And why this is like a very nice complement
to dynamic test is that we are essentially leveraging by looking at the source code what
are we actually using. And then as a complement to where tests might not be covering we can
sort of find directly if there is any change that might affect like your project. That
of course comes the more tricky part which is semantic changes. So I mean one thing it's
nice that you can detect that the method change but sometimes you might just do a simple
refactoring that you know just refactors are a huge method into a method with like a couple
of smaller methods is that. So the truth is that it's extremely difficult to know what
exactly is a semantic change because there's a lot of factors around it. So the only thing
that we did was that we kind of took what was like behavioral changes. So we looked at
only like data flow or control flow changes. So for example if you add a new method call
we consider that as like a special change or if you did some major change on your if statements
that may introduce a new logic of how the control flow works then we consider that as
an interesting change to follow. And what it is like I implemented a tool called Uptatera
which means update in Swedish. And so I applied this on.
So it essentially shows like which function had a change. So for example Rx, Java, not
facing subscriber on error and we can see that it's reachable from the project and then
it shows exactly how it was reachable. Yeah through like the code. And then in the second
section I would have like what is basically the major changes in that function. So this
could sort of give you some context of what essentially changed. Other than just telling
that either the test parsed or failed. And when using this mutation PyPlanet that was
explaining we essentially generated 1 million artificial updates by introducing those regressions
and we did this on 262 GitHub projects. And what we found was that when doing the sort
of changes on project tests we found that on average projects are able to detect 37%
of those which means that a lot of like changes may not may get unnoticed like in general.
But if you use static analysis now that you sort of have the whole context we able to
detect 72% of all those changes. But what we find more interesting is that we can see
that interestingly here like from the context of the studies that there's basically no guarantees
that tests can prevent bad updates and using either of those techniques is not good enough
to ensure that updates are safe. Then of course the other thing is that static analysis is
not perfect. There are also problems with it as well. So the problem is over approximation.
So we have over approximation at two locations. One is the call graphs themselves because
when it comes to dynamic dispatch if there are maybe 200 implementations that might stand
from an interface call we have to link to all of them and that might generate false positives.
And then the other case is also with the semantic changes that we are detecting because we also
don't know exactly what type of semantic changes it is. But to sort of see how this worked
in practice we also analyzed and applied this on 22 dependable PRs. And from the results
what we found in general was that by using static analysis we were able to detect three
unused dependencies. So here let's say like the test would just pass it whatever but in
fact we found that the dependencies were not used at all. And we were able to prevent three
breaking updates and one which actually was confirmed by our developer where the test
were not able to detect. And then of course we found that there are let's say like false
positives and as I mentioned there were many cases with refractorings and then of course
this over approximated call paths. So if you use like a tool like Google here or static
analysis it can help to prevent updates but then you also get a lot of noise as well as
a result. So sort of coming to the end of more of the studies what are let's say like the
recommendations that I have after looking into like on Github projects how tests are
being made etc. So one thing I found missing when it comes to updating with test widths
is that we don't have any form of confidence score. And what I mean with confidence score
is that for example if we can stop measuring test coverage we can see for example if there
is a change function in a third party library do we even have test that reaches that or
not and that could directly give an indication whether like my test width is able to capture
that or not. And another very interesting thing could be for example if you find that
one of your libraries are very tightly integrated with your project it can also sort of give
an indication whether you have let's say like enough test to cover that usage or not at all.
And then by having sort of this score you can maybe get an indication where does let's
say like how well am I just able to capture things in third party libraries or not. This
is something that I would like to see in tooling in general. And then when it comes to the
gaps in test coverage so this is related to the results I was saying like the statement
coverage and effectiveness. So I believe more of having a hybrid solution so we're using
tests or dynamic analysis is able to capture. I think we should use that because that is
more precise. But then in areas of the code where we don't have any coverage so for example
consider back to the look for J library where usually I wouldn't expect just to be much
test coverage. Here it could be nice to complement the static analysis. So you sort of get a
little bit better for both words here. And then another advantage that I might see having
static analysis rather running tests is that we can maybe much more earlier to take potential
like problems in compatibilities by having that rather than trying to run it through
the build system consuming extra resources or tests etc. So those are less likely to main
things that I find important to address. And then for users like myself of using this
automated dependence updating tools. So although like reusing is free in the sense that we
can easily just use a library but we often forget the operational and maintenance costs
and those are not free. So trying to basically automate away everything by using tooling
etc. is not always the solution. I think it's important to also consider that once we start
adopting a library we also need to think about how we can maintain it but also understanding
what potential risk might come from it. Could be for example that maintainer have a very
different sort of handling when it comes to security vulnerabilities. It could also be
with the release protocol like there could be disagreements on what is breaking change
or not for clients. So I think having that aspect is one important thing. And the other
thing is like of course not blindly trusting automated dependency updates and I guess no
one really does this. And then that's another thing which could be debatable is to have
essentially critical I mean having writing tests for critical dependencies and this could
be a library that's very critical to your project. I think here maybe having tests could
help let's say like capture early issues that might arise in dependencies and not come
as an unwanted breaking change later on once you merge the automated PR.
So if you want to let's say like know more about this work I have a paper so I also
uploaded slides on the Fosnum website so you can click the link and the paper is open access
and yeah this is concludes my talk more or less so happy to take any questions.
So do you know if any of these bots like the pen about renovate are working on such a score
so let's say the merge request to get like a warning. Hey your tests are not covering
only 10 percent of the dependencies. Do you know if there is any work.
So what I'm aware of is that there is a compatibility score that looks at for example for a particular
dependency version updates if out of less than like 200 PRs if 100 of those were successful
for other projects then it will give us a score that there's a 50 percent chance that you will
succeed here. The only thing I find problematic is that every project has their own specific
use case or context of how they use it so it could be misleading but I haven't heard anything
that looks specifically into your test suite to see how I mean how it's able to do that.
Thank you. You mentioned the number of 60 percent for the amount of tests for direct dependencies
and I believe it was a lot less for transitive dependencies. Do you have any numbers on the amount of transitive dependencies
in search change the chains actually. So I can can imagine that the 60 percent is cumulative in these.
Do you mean for the statement coverage thing or the statement. Yeah the first one.
So the first one the 60 percent was on like direct dependencies and then this 20 percent was on the transitive ones.
Do you have any numbers of the amount of transitive dependencies so you can relate it to that 60 percent.
Okay so I did this on 500 time projects but I might have the more specific numbers in the paper.
Okay. You have been looking at detecting errors. Have you looked in the other side because you can use it in a hybrid mode
that your tool maybe can tell me you can make this update for sure because all the code is changed.
You don't care about it. For example if you look at low level right libraries like Apache Commons you only use a part of it
but you want to keep up to date and some updates are more or less completely safe because you don't touch any code that has changed
because only new features have been added or so and that would also help if I just know yes that's safe.
Yeah that's a great question. So this is a little bit idea we had with introducing call graphs because the call graphs
you can start learning what exactly is used. So even if you use like a major library and you just use maybe two utility classes
and even if you go to like a major version of it you might not be affected by it and this is something that should be covered by the call graph
so we will see for example that the utility classes there are no changes there but then in the rest of the package there's a lot of changes that you're unaffected by.
Thank you.
Did you check how the call graphs work with dynamic dependency injection?
Yeah so we essentially if I understood the question right I mean so we did generate the dynamic call graphs like running the test
and this is something that we essentially used to guide or rotation like testing framework to only do changes in those functions
and not for example functions that the test didn't touch because otherwise we wouldn't know whether I mean the test we were able to
detect changes or not.