Okay, we'll get started. Thank you everybody for being here today. We're so pleased and
honored to join you and talk about a project that our research team that we both work together
on has been thinking about for a long time, which includes questions like what information
is missing when you move beyond a repository looking at open source. I will do my best.
Something happened last night. I can't explain it. Yes. So in this talk we are going to explore
some stories about what can go wrong and how we as researchers and practitioners in the
community can work to find best practices for open source ecosystems research. As it helps
you, our slides and speaker notes can be found at and I will spell it out, BIT.ly backslash
B-E-Y-O-N-D dash T-H-E dash R-E-P-O dash F-O-S-D-E-M-2-4.
Hi, so I'm Julia Farioli. I work at AWS as their open source AI ML strategist, but that's
not what I'm here for today. I have had one foot in research and one foot in practice for
basically my entire career and I'm especially interested in what motivates people within
open source as well as the potential for the understanding of motivations to increase resiliency
of our digital infrastructure. And hello, I'm Amanda Kaseri. Please keep reminding me
to speak up if I forget. I work in Google's open source programs office as an engineer.
I also continue to work around data and AI and I'm also a researcher and I have an emphasis
of working at looking at open source through an intersectional and a feminist complexity
lens. And our co-author who could not be here today is Dr. Juniper Lovato. She is a multidisciplinary
complex system scientist who is a research assistant professor of computer science at
the University of Vermont. And her work explores the ethical and governance issues related to
socio-technical systems. So through our work, we were struck repeatedly by how much of existing
research around open source and digital infrastructure lacked context from those working within the
ecosystems. We also saw how these research findings were making their way into open source
ecosystems and infrastructure, even if it wasn't necessarily applicable to them. So these observations
led us to establish at least a start of some best practices for research in open source.
Combined with others that we as a community collectively develop together can help us better
study open source ecosystems as the socio-technical systems that they are.
So, I mean, just basic fundamental assumption here is that open source is and has always
been much more than a repository. It's a complex multi-level ecosystem of human contributors
who collaborate and cooperate to achieve shared creative endeavors. And we, collectively
are also part of open source communities in our work and our passions. We're collectively
a dynamic socio-technical system that is always in production, both people and technology,
and evolving towards a distributed goal. We're also, unfortunately, and I say unfortunately
just because of some problems that come from it, a very attractive research space for scientists,
especially scientists of technical systems and science, because open source ecosystems
are so data rich, have such a long history, and have many exciting applications for understanding
society as it is, whether it's governance, cybersecurity, team dynamics. However, we
frequently see science that does focus only on repository data, and that gives a limited
snapshot of the wider ecosystem and explores and ignores many of the explanatory variables
of social systems. And that's just one missing data point that we're concerned about and
have done some work on. When we are talking here about open source ecosystems, the reason
we stress on the ecosystems piece is because we're referring to the collection of repository
technology, infrastructure, communities, interactions, incentives, behavioral norms,
culture, and studying these as a whole requires community cooperation and participation to
understand all the interacting and the interdependent parts of the system. So at the heart of all
of our ecosystems are humans, it's us. And our collaborations and outputs reflect social,
emotional, and technical labor of the group of individuals moving towards, again, the shared
and distributed goal that we all have. One still unsolved problem, and I want to stress
both of these, is when both industry researchers and academic researchers, two separate groups,
different outcomes and incentives, overlook data from open source ecosystems as part of
a review process. And this is because data from open source, if you're not familiar with
institutional research boards or review boards, data is usually obtained through scraping
or APIs, and it's considered this category of what's called secondary data. And secondary
data by some rubrics is not centered on humans, it doesn't require consent from research subjects.
But when you are moving data about people, that is inherently data about someone and
transforms them into a research subject. But when this happens, we as a community, we as
individuals, are unaware that we're being studied. And then a paper comes out talking about the
open source project that you worked on, and you're like, well, I didn't, first of all,
I didn't realize any of this was going on, but also do I agree with this? And you never have
the opportunity to give consent as a research subject in that case. Juniper, so Juniper actually
just recently finished her PhD, we're super excited and happy for her. And she has been working in
the field again of like data ethics and looking at this, this intersection. And she shared with us
the advice from her PhD advisor, which is just because something is permissible, does not mean
it is ethical. And so for example, just because open source repositories are public, sometimes
they're permissible to scrape, but it doesn't always mean they're ethically fair game for any
use without the community's consent. But I would like to give a positive example of this, by the
way. So there was a group of researchers in 2022 who did work with the community to learn more about
what was possible with repository data. So their research questions did center around the repose
themselves. This is in 2022, it's Coutilla et al. And they published an excellent paper that was
looking at open source and maintainer well being. They did a mixed methods approach that they looked
at quantitative signals, they did a diary study, they did interviews. And they actually determined
that it was not possible to determine maintainer well being from any of the signals that they
studied purely off of the quantitative work. Now this usually doesn't get published. So right
normally there's a hypothesis and they find that it didn't work out and then it goes into the
clock, like it goes like somewhere else. But this time they actually published that like no, these
things are not correlated, you cannot find these. There's too many individual confoundings for us to
be able to say these will effectively work with the community. And that's the kind of research,
honestly I would love to see continue to involve and encourage and for us to participate in,
because it's breaking down the mental models, not only that exists of this rich community,
but also that might hinder us or be used when out of context in a way that does not serve us.
And that allows us to build trust with folks who are trying to understand and for ourselves,
understand ourselves better in a way, again, that is not just being observed, but that is
participatory. And that's because open source data is not just code. It's the collaborative labor
of a group of people. And for all of this, we just want to make sure that we are working as
researchers as that best practice piece, is not to throw away the socio element, don't ignore
the fact that there are humans as a part of these problems. And that also helps remember that we
should be treating each other with care and respect, because when we become part of that,
we are ultimately part of the same system. And now as researchers and new practitioners,
we're at this really critical moment. I mean, and I'm sure everybody's been talking about
critical moments all day, open sources and critical moments. Research as well is in some
very interesting critical moments, especially questions around open data, how data should be
open, should it be open, what are its use cases. And community members are themselves subject matter
experts. We should hold a wealth of knowledge and lived experience. We know the system's best.
As collaborators, we actually lend much more experience to the projects as opposed to being
silently studied or alone. So involving communities through participatory methods will help researchers
better understand the systems they're studying, what's missing and what is truly available for
purposes of research. And so another point we'd like to give around is this concept of, I talked
a little bit about context. So Dr. Helen Niesbaum puts forth the concept of contextual integrity.
And that is the idea that the protection for privacy is tied to the norms of a specific context
from which the information is gathered. We have a lot of, you know, there's memes around about like
overheard out of context, things out of context. But it applies differently here when we're talking
about research, which also now fundamentally impact things like funding, people's well-being,
your jobs, your ability to advance in your career, being recognized even for having done that work,
as opposed to being invisible in history. And so we want to just emphasize here Dr. Niesbaum's idea
that a central tenet of contextual integrity is that there are no arenas of life that are not
governed by norms of information flow. There are no information of spheres or spheres of life which
anything truly goes. And so we see this breach of contextual integrity in cases where data is
taken outside of its intended environment used for another purpose. The phrase that sometimes gets
thrown around is, but this data is already public. It's already out there. Can I use it for anything
that I want to? I mean that leaves just enough room for circumvading pretty much every ethical issue
related to data that is found online. In 2016, a specific example, the open data project known as
GHTorrent, which was one of few community projects which hosted a structured history of GitHub's
activity information, had a lengthy discussion and an issue about sharing aggregated data using
GitHub's email, user email addresses. So they collected everything together and they shared it
all out. And then as part of those commit messages, as part of that metadata, was people's just
regular email addresses, not hashed, not changed. And that aggregation and sharing was being used
without consent from those individuals because they were finding themselves the targets of mailing
lists. So people would get together the list, they'd scrape all of the emails off of it, and they might
do it for something like surveys. Like even researchers asking people like, hey, we'd like you
to be a part of this study. But they didn't go to GitHub, put their email out there because they
wanted to be contacted for a study. So this is a very long discussion. This point here actually
will take you to, thank you so much for Julia for finding this, it will take you to in the
Internet Archive link because it's no longer part of that initial Git repository that issue's been
taken down. And I would like to just emphasize, so the GHTorrent actually is no longer being
maintained. The previous website's no longer applicable. You can still see the repo as it
exists on GitHub Archive. So I'm grateful to the Internet Archive for saving that information so we
can understand that more. But I just want to add the caveat because we're using this. There is a
hyperlink to ghtorrent.org. Do not click on that, it will take you somewhere you do not want to go.
You're all about transparency. But that issue around email addresses on a platform,
mailing data lists is an excellent example of why this may be controversial to say here. Openness
in and of itself is not necessarily always a good. The release of raw data is of course good in
the strict sense of reproducibility and of transparency and of working collectively in a
community. In other contexts, openness may harm people and public trust that people have between
researchers and the community themselves. So we just always need to strive for that balance and
ask questions around openness, ethics, and privacy, especially in consultation with the people who
exist there. Thank you, Amanda. And I think that's a great segue into the idea that researching
open source software is ultimately research about the people behind it. And yet the data about the
software are far more readily available than the data about the people. And sometimes that's for
very good reasons. But when doing research in and around open source ecosystems, we need to
make sure that we're not exacerbating or reinforcing inequalities in the existing system by failing to
question what is absent from the data. So plenty of research has already been done into who
contributes to open source and why and what benefits they see from it. And these benefits are not
necessarily that insignificant. A fair number of people have jobs because of their work in
open source. A fair number of folks get sponsorship for their work in open source. And a lot of the
ways that people decide who to sponsor or who to recruit for jobs is through what is visible in the
data. So if you do work that doesn't get captured in the data, then you don't receive those same
benefits. One of the things that I think a lot about is how if you are not getting paid for your
work in open source, you are basically paying to do work in open source. And that isn't because
you're like literally handing out money, it's because you're spending your free time. And free time is
the currency. In 2017, Lawson wrote a fantastic post about time as currency saying that I've already
told my partner that if and when we decide to start having kids, it will probably quit open source
for good. I can't see how I'll be able to make the time for both. And in 2019, this was reflected
in a paper by Miller et al. And it has a delightful title, which is why do people give up flossing?
Which I just I love from a pun perspective. I think it's great. But they found that for all
contributors, occupational reasons such as major life changes were the most cited for leaving
open source significantly more than lacking peer support or losing interest that are more commonly
discussed in the literature. When we are looking at who is present in the data and who isn't,
we need to understand and make sure that we're keeping at the forefront of our mind
that the economic incentives and the availability, which is also a little bit of an economic incentive,
of the people who keep the lights on are not evenly distributed. One of the papers that I want to see,
free dissertation idea, is why do people never start flossing? We don't have research on that.
So if we think about who leaves open source and why, what are the barriers for people who never come in?
I think we're at an open source conference, right? Okay, so open sources everywhere.
It powers mission critical systems. We know this. It's everywhere from space exploration
to social networks to insulin pumps, which still terrifies me. But every person on the planet
is affected by open source software, whether they know it or not. And that's critical to keep in
mind when we're thinking about research and ecosystem integrity. This example may have
crossed your radar. So in 2021, there was a retracted paper where researchers submitted known flaws
to the Linux kernel. They had absolutely no intention of allowing these flaws to be merged upstream.
But there was a lot of pushback. And in 2021, Greg Scott was quoted as saying,
these researchers crossed a line that they shouldn't have crossed. Nobody asked them to do this.
A whole lot of people wasted a whole lot of time evaluating their patches.
And I think this is a really interesting example, because from the perspective of the researchers,
they did not see an issue with their approach because they weren't going to let anything
affect the technical system. Nothing was going to be merged. No flaws were going to be
incorporated into the Linux kernel. They designed it to be an encapsulated experiment.
But they failed to realize and take into account that there are people in that system.
And they focused on the integrity of the system without considering the people or processes involved.
And so we don't know how this would have worked if they had figured out a way to get consent
for this experiment. But we do know that the way that they went about it made a whole lot of people
really angry. And it did get them banned. The entire university was banned from contributing
to the Linux kernel. Which is, I mean, actually I would put that on my resume. That's pretty impressive.
Hopefully that ban will be lifted maybe in four or five years time. But
these ecosystems are always in production. It is impossible to know where your software is being
used. Because as open source folks, we tend to hate telemetry. And so we just have to rely on
the systems that we have established and make sure that we are treating them with respect.
So running behavioral experiments, which that wound up being even if it didn't intend to be,
but behavioral or technical experiments in open source ecosystems may impact the world's infrastructure
in unknown and immeasurable ways. It's difficult to know the scope of your research in an open
source ecosystem. Small changes to one part may be what breaks something extremely important
that you just had no idea. So we do need to treat open source ecosystems as systems that are perpetually
in production. So what do we do? This is like four best practices. What do we actually take
away from this? Well, as researchers and practitioners, we need to work together
to provide practical context for research approaches, results, and recommendations.
We need to consider the ramifications of research upon the ecosystems being studied,
as well as the culture and individuals. And finally, look beyond the repository for factors
that may influence methodologies and findings. And we acknowledge that again, like wearing these
many hats, sitting in these many seats, this is a learning process. So science is all about
understanding and finding out new knowledge. But that's also science as an ecosystem that is also
always in production, learning because those ramifications of what you learn can have impacts
down the road. We're all trying to figure out how do we use data that is online? How do we put our
own data online in a way that we opt in an opt-out of? How do we get control? And how do we be responsible
about it in terms of others that we're working with, especially within a community? In these
transition periods, where things seem confusing, just again, we want to encourage that that's a
point of communication should increase and not decrease. The worst thing that we can do is to
start to shut things down where we start to silo ourselves, as opposed to come together and working
closer with each other to strengthen that humanity as a part of our shared experience.
We have an amazing gap to bridge the, amazing opportunity to bridge the gap between people who
want to understand what is happening in software and science and technology. And by being those
ones who can participate as a part of that, and that's really cool. It can open up opportunities
to welcome more people into the open source ecosystems because those are more scientists
who are then contributing their own code, contributing their own data, contributing their
knowledge as a part to lift us all up. And so thank you so much for having us here today.
We wanted to make sure as the last slide, you get all of the references that we have talked about
today. And then of course, these slides as well, in case you want to do hyperlinks clicking in there.
I do want to just point out that second bullet point also is a link to the full paper that this
presentation is based on. So you are more than welcome to read the additional best practices
in slightly more formal language than we've used today. Thank you.
Thank you.
We do have questions.
I've got a minute.
If you want.
There are a lot of them, so please, Judy, one.
Judy, one.
That's an excellent presentation.
The data is actually talking about mixed data. Is there also an element of jurisdiction
about you? Because I'm going to sort out why researchers, but exactly described as precisely
one where we told them you can't do that because of wireless GDDR. So I'm wondering if there's a
jurisdiction versus ethics, and was the case with the Great Souls at the US of Minnesota?
Yes, that was.
So the question was that are there jurisdictional implications for web scraping and the data
that you obtain through web scraping? And I think this is where we both say we are not lawyers.
No, no. And so this is a place where, yes, there are most likely jurisdictional
considerations. What they are is where we would go talk to our lawyers.
Yeah, and I also think you brought bring up an interesting point because we did try to
differentiate and we acknowledge that. So there's industry research. There is academic research
with institutions. There's government research that are funded by government agencies. Like
research becomes a throw all. But each group does have its own ethics boards, legal processes,
requirements for sharing and working with information. So yeah, that absolutely does exist.
There's not one universal standard. But it's also why we're trying to talk about this in
terms of practices as opposed to here is our like resolute suggestion on the commandment.
Yeah, here's the rules that you should be following. Thank you for bringing that up.
That's absolutely a great point.
Yeah, thank you very much for the presentation. If you want to make a point and then you can
comment on that, which is super important that we have all the research into open source.
But speaking as a researcher, it was literally a gold rush, get up to mine, and we've got lots of
silly papers. And it would be better if researchers focused on what's really important. That's not
necessarily always open source. So anyone who can provide access to open source data, for example,
by being a contributor who works for a company could also positively provide access to the company.
We need as much research on how the company is in front of you, even if it's for prior
check of work. That's my opinion equally that we need it. And because researchers for the low
main groups, you don't have any of them, most of them have a developer. So I would not encourage
anyone if you're willing to work with researchers to find ways to maybe also get a company
interpreter that will improve the world for us. So I will briefly try to summarize the comment
there, which was that research into open sources is good. There was a bit of a gold rush when we
first had access to GitHub data. But we also need to do research into the proprietary side
of software and software development, because that's important as well. And I would agree with
that. It's speaking as somebody who has worked in big tech for quite some time now. There are some
significant challenges that I personally am not sure how to overcome to doing that. But
it's on the wish list, I think. Yeah, that's an excellent point around like
when you need to do studies and you only have what is either conveniently available or ever
available, that sometimes you're limited into what you can find. I also think it does limit
the questions you can ask. So I think where I would agree that it will be interesting, hopefully,
more of those collaborations between researchers and all kinds of developers, whether you're
within a corporation or without. My concern becomes when, and I was a reviewer as a part of
the mining software reviewer, mining software repositories conference this year. And the challenge
that I saw with more papers than I'm used to was just a complete mismatch of understanding data
and research questions and the hypothesis of whether or not that was something you could even ask
using that data. And so that's a challenge as well, is just having people who understand what is
possible of asking of a certain question. There's a connection to open source, which is really nice
in company and trial development or, which is called inner source.
No.
Sure, yeah. I do believe we're out of time. Out of time.
Yeah. But we will be outside for additional questions if you want.