Okay, we'll get started. Thank you everybody for being here today. We're so pleased and honored to join you and talk about a project that our research team that we both work together on has been thinking about for a long time, which includes questions like what information is missing when you move beyond a repository looking at open source. I will do my best. Something happened last night. I can't explain it. Yes. So in this talk we are going to explore some stories about what can go wrong and how we as researchers and practitioners in the community can work to find best practices for open source ecosystems research. As it helps you, our slides and speaker notes can be found at and I will spell it out, BIT.ly backslash B-E-Y-O-N-D dash T-H-E dash R-E-P-O dash F-O-S-D-E-M-2-4. Hi, so I'm Julia Farioli. I work at AWS as their open source AI ML strategist, but that's not what I'm here for today. I have had one foot in research and one foot in practice for basically my entire career and I'm especially interested in what motivates people within open source as well as the potential for the understanding of motivations to increase resiliency of our digital infrastructure. And hello, I'm Amanda Kaseri. Please keep reminding me to speak up if I forget. I work in Google's open source programs office as an engineer. I also continue to work around data and AI and I'm also a researcher and I have an emphasis of working at looking at open source through an intersectional and a feminist complexity lens. And our co-author who could not be here today is Dr. Juniper Lovato. She is a multidisciplinary complex system scientist who is a research assistant professor of computer science at the University of Vermont. And her work explores the ethical and governance issues related to socio-technical systems. So through our work, we were struck repeatedly by how much of existing research around open source and digital infrastructure lacked context from those working within the ecosystems. We also saw how these research findings were making their way into open source ecosystems and infrastructure, even if it wasn't necessarily applicable to them. So these observations led us to establish at least a start of some best practices for research in open source. Combined with others that we as a community collectively develop together can help us better study open source ecosystems as the socio-technical systems that they are. So, I mean, just basic fundamental assumption here is that open source is and has always been much more than a repository. It's a complex multi-level ecosystem of human contributors who collaborate and cooperate to achieve shared creative endeavors. And we, collectively are also part of open source communities in our work and our passions. We're collectively a dynamic socio-technical system that is always in production, both people and technology, and evolving towards a distributed goal. We're also, unfortunately, and I say unfortunately just because of some problems that come from it, a very attractive research space for scientists, especially scientists of technical systems and science, because open source ecosystems are so data rich, have such a long history, and have many exciting applications for understanding society as it is, whether it's governance, cybersecurity, team dynamics. However, we frequently see science that does focus only on repository data, and that gives a limited snapshot of the wider ecosystem and explores and ignores many of the explanatory variables of social systems. And that's just one missing data point that we're concerned about and have done some work on. When we are talking here about open source ecosystems, the reason we stress on the ecosystems piece is because we're referring to the collection of repository technology, infrastructure, communities, interactions, incentives, behavioral norms, culture, and studying these as a whole requires community cooperation and participation to understand all the interacting and the interdependent parts of the system. So at the heart of all of our ecosystems are humans, it's us. And our collaborations and outputs reflect social, emotional, and technical labor of the group of individuals moving towards, again, the shared and distributed goal that we all have. One still unsolved problem, and I want to stress both of these, is when both industry researchers and academic researchers, two separate groups, different outcomes and incentives, overlook data from open source ecosystems as part of a review process. And this is because data from open source, if you're not familiar with institutional research boards or review boards, data is usually obtained through scraping or APIs, and it's considered this category of what's called secondary data. And secondary data by some rubrics is not centered on humans, it doesn't require consent from research subjects. But when you are moving data about people, that is inherently data about someone and transforms them into a research subject. But when this happens, we as a community, we as individuals, are unaware that we're being studied. And then a paper comes out talking about the open source project that you worked on, and you're like, well, I didn't, first of all, I didn't realize any of this was going on, but also do I agree with this? And you never have the opportunity to give consent as a research subject in that case. Juniper, so Juniper actually just recently finished her PhD, we're super excited and happy for her. And she has been working in the field again of like data ethics and looking at this, this intersection. And she shared with us the advice from her PhD advisor, which is just because something is permissible, does not mean it is ethical. And so for example, just because open source repositories are public, sometimes they're permissible to scrape, but it doesn't always mean they're ethically fair game for any use without the community's consent. But I would like to give a positive example of this, by the way. So there was a group of researchers in 2022 who did work with the community to learn more about what was possible with repository data. So their research questions did center around the repose themselves. This is in 2022, it's Coutilla et al. And they published an excellent paper that was looking at open source and maintainer well being. They did a mixed methods approach that they looked at quantitative signals, they did a diary study, they did interviews. And they actually determined that it was not possible to determine maintainer well being from any of the signals that they studied purely off of the quantitative work. Now this usually doesn't get published. So right normally there's a hypothesis and they find that it didn't work out and then it goes into the clock, like it goes like somewhere else. But this time they actually published that like no, these things are not correlated, you cannot find these. There's too many individual confoundings for us to be able to say these will effectively work with the community. And that's the kind of research, honestly I would love to see continue to involve and encourage and for us to participate in, because it's breaking down the mental models, not only that exists of this rich community, but also that might hinder us or be used when out of context in a way that does not serve us. And that allows us to build trust with folks who are trying to understand and for ourselves, understand ourselves better in a way, again, that is not just being observed, but that is participatory. And that's because open source data is not just code. It's the collaborative labor of a group of people. And for all of this, we just want to make sure that we are working as researchers as that best practice piece, is not to throw away the socio element, don't ignore the fact that there are humans as a part of these problems. And that also helps remember that we should be treating each other with care and respect, because when we become part of that, we are ultimately part of the same system. And now as researchers and new practitioners, we're at this really critical moment. I mean, and I'm sure everybody's been talking about critical moments all day, open sources and critical moments. Research as well is in some very interesting critical moments, especially questions around open data, how data should be open, should it be open, what are its use cases. And community members are themselves subject matter experts. We should hold a wealth of knowledge and lived experience. We know the system's best. As collaborators, we actually lend much more experience to the projects as opposed to being silently studied or alone. So involving communities through participatory methods will help researchers better understand the systems they're studying, what's missing and what is truly available for purposes of research. And so another point we'd like to give around is this concept of, I talked a little bit about context. So Dr. Helen Niesbaum puts forth the concept of contextual integrity. And that is the idea that the protection for privacy is tied to the norms of a specific context from which the information is gathered. We have a lot of, you know, there's memes around about like overheard out of context, things out of context. But it applies differently here when we're talking about research, which also now fundamentally impact things like funding, people's well-being, your jobs, your ability to advance in your career, being recognized even for having done that work, as opposed to being invisible in history. And so we want to just emphasize here Dr. Niesbaum's idea that a central tenet of contextual integrity is that there are no arenas of life that are not governed by norms of information flow. There are no information of spheres or spheres of life which anything truly goes. And so we see this breach of contextual integrity in cases where data is taken outside of its intended environment used for another purpose. The phrase that sometimes gets thrown around is, but this data is already public. It's already out there. Can I use it for anything that I want to? I mean that leaves just enough room for circumvading pretty much every ethical issue related to data that is found online. In 2016, a specific example, the open data project known as GHTorrent, which was one of few community projects which hosted a structured history of GitHub's activity information, had a lengthy discussion and an issue about sharing aggregated data using GitHub's email, user email addresses. So they collected everything together and they shared it all out. And then as part of those commit messages, as part of that metadata, was people's just regular email addresses, not hashed, not changed. And that aggregation and sharing was being used without consent from those individuals because they were finding themselves the targets of mailing lists. So people would get together the list, they'd scrape all of the emails off of it, and they might do it for something like surveys. Like even researchers asking people like, hey, we'd like you to be a part of this study. But they didn't go to GitHub, put their email out there because they wanted to be contacted for a study. So this is a very long discussion. This point here actually will take you to, thank you so much for Julia for finding this, it will take you to in the Internet Archive link because it's no longer part of that initial Git repository that issue's been taken down. And I would like to just emphasize, so the GHTorrent actually is no longer being maintained. The previous website's no longer applicable. You can still see the repo as it exists on GitHub Archive. So I'm grateful to the Internet Archive for saving that information so we can understand that more. But I just want to add the caveat because we're using this. There is a hyperlink to ghtorrent.org. Do not click on that, it will take you somewhere you do not want to go. You're all about transparency. But that issue around email addresses on a platform, mailing data lists is an excellent example of why this may be controversial to say here. Openness in and of itself is not necessarily always a good. The release of raw data is of course good in the strict sense of reproducibility and of transparency and of working collectively in a community. In other contexts, openness may harm people and public trust that people have between researchers and the community themselves. So we just always need to strive for that balance and ask questions around openness, ethics, and privacy, especially in consultation with the people who exist there. Thank you, Amanda. And I think that's a great segue into the idea that researching open source software is ultimately research about the people behind it. And yet the data about the software are far more readily available than the data about the people. And sometimes that's for very good reasons. But when doing research in and around open source ecosystems, we need to make sure that we're not exacerbating or reinforcing inequalities in the existing system by failing to question what is absent from the data. So plenty of research has already been done into who contributes to open source and why and what benefits they see from it. And these benefits are not necessarily that insignificant. A fair number of people have jobs because of their work in open source. A fair number of folks get sponsorship for their work in open source. And a lot of the ways that people decide who to sponsor or who to recruit for jobs is through what is visible in the data. So if you do work that doesn't get captured in the data, then you don't receive those same benefits. One of the things that I think a lot about is how if you are not getting paid for your work in open source, you are basically paying to do work in open source. And that isn't because you're like literally handing out money, it's because you're spending your free time. And free time is the currency. In 2017, Lawson wrote a fantastic post about time as currency saying that I've already told my partner that if and when we decide to start having kids, it will probably quit open source for good. I can't see how I'll be able to make the time for both. And in 2019, this was reflected in a paper by Miller et al. And it has a delightful title, which is why do people give up flossing? Which I just I love from a pun perspective. I think it's great. But they found that for all contributors, occupational reasons such as major life changes were the most cited for leaving open source significantly more than lacking peer support or losing interest that are more commonly discussed in the literature. When we are looking at who is present in the data and who isn't, we need to understand and make sure that we're keeping at the forefront of our mind that the economic incentives and the availability, which is also a little bit of an economic incentive, of the people who keep the lights on are not evenly distributed. One of the papers that I want to see, free dissertation idea, is why do people never start flossing? We don't have research on that. So if we think about who leaves open source and why, what are the barriers for people who never come in? I think we're at an open source conference, right? Okay, so open sources everywhere. It powers mission critical systems. We know this. It's everywhere from space exploration to social networks to insulin pumps, which still terrifies me. But every person on the planet is affected by open source software, whether they know it or not. And that's critical to keep in mind when we're thinking about research and ecosystem integrity. This example may have crossed your radar. So in 2021, there was a retracted paper where researchers submitted known flaws to the Linux kernel. They had absolutely no intention of allowing these flaws to be merged upstream. But there was a lot of pushback. And in 2021, Greg Scott was quoted as saying, these researchers crossed a line that they shouldn't have crossed. Nobody asked them to do this. A whole lot of people wasted a whole lot of time evaluating their patches. And I think this is a really interesting example, because from the perspective of the researchers, they did not see an issue with their approach because they weren't going to let anything affect the technical system. Nothing was going to be merged. No flaws were going to be incorporated into the Linux kernel. They designed it to be an encapsulated experiment. But they failed to realize and take into account that there are people in that system. And they focused on the integrity of the system without considering the people or processes involved. And so we don't know how this would have worked if they had figured out a way to get consent for this experiment. But we do know that the way that they went about it made a whole lot of people really angry. And it did get them banned. The entire university was banned from contributing to the Linux kernel. Which is, I mean, actually I would put that on my resume. That's pretty impressive. Hopefully that ban will be lifted maybe in four or five years time. But these ecosystems are always in production. It is impossible to know where your software is being used. Because as open source folks, we tend to hate telemetry. And so we just have to rely on the systems that we have established and make sure that we are treating them with respect. So running behavioral experiments, which that wound up being even if it didn't intend to be, but behavioral or technical experiments in open source ecosystems may impact the world's infrastructure in unknown and immeasurable ways. It's difficult to know the scope of your research in an open source ecosystem. Small changes to one part may be what breaks something extremely important that you just had no idea. So we do need to treat open source ecosystems as systems that are perpetually in production. So what do we do? This is like four best practices. What do we actually take away from this? Well, as researchers and practitioners, we need to work together to provide practical context for research approaches, results, and recommendations. We need to consider the ramifications of research upon the ecosystems being studied, as well as the culture and individuals. And finally, look beyond the repository for factors that may influence methodologies and findings. And we acknowledge that again, like wearing these many hats, sitting in these many seats, this is a learning process. So science is all about understanding and finding out new knowledge. But that's also science as an ecosystem that is also always in production, learning because those ramifications of what you learn can have impacts down the road. We're all trying to figure out how do we use data that is online? How do we put our own data online in a way that we opt in an opt-out of? How do we get control? And how do we be responsible about it in terms of others that we're working with, especially within a community? In these transition periods, where things seem confusing, just again, we want to encourage that that's a point of communication should increase and not decrease. The worst thing that we can do is to start to shut things down where we start to silo ourselves, as opposed to come together and working closer with each other to strengthen that humanity as a part of our shared experience. We have an amazing gap to bridge the, amazing opportunity to bridge the gap between people who want to understand what is happening in software and science and technology. And by being those ones who can participate as a part of that, and that's really cool. It can open up opportunities to welcome more people into the open source ecosystems because those are more scientists who are then contributing their own code, contributing their own data, contributing their knowledge as a part to lift us all up. And so thank you so much for having us here today. We wanted to make sure as the last slide, you get all of the references that we have talked about today. And then of course, these slides as well, in case you want to do hyperlinks clicking in there. I do want to just point out that second bullet point also is a link to the full paper that this presentation is based on. So you are more than welcome to read the additional best practices in slightly more formal language than we've used today. Thank you. Thank you. We do have questions. I've got a minute. If you want. There are a lot of them, so please, Judy, one. Judy, one. That's an excellent presentation. The data is actually talking about mixed data. Is there also an element of jurisdiction about you? Because I'm going to sort out why researchers, but exactly described as precisely one where we told them you can't do that because of wireless GDDR. So I'm wondering if there's a jurisdiction versus ethics, and was the case with the Great Souls at the US of Minnesota? Yes, that was. So the question was that are there jurisdictional implications for web scraping and the data that you obtain through web scraping? And I think this is where we both say we are not lawyers. No, no. And so this is a place where, yes, there are most likely jurisdictional considerations. What they are is where we would go talk to our lawyers. Yeah, and I also think you brought bring up an interesting point because we did try to differentiate and we acknowledge that. So there's industry research. There is academic research with institutions. There's government research that are funded by government agencies. Like research becomes a throw all. But each group does have its own ethics boards, legal processes, requirements for sharing and working with information. So yeah, that absolutely does exist. There's not one universal standard. But it's also why we're trying to talk about this in terms of practices as opposed to here is our like resolute suggestion on the commandment. Yeah, here's the rules that you should be following. Thank you for bringing that up. That's absolutely a great point. Yeah, thank you very much for the presentation. If you want to make a point and then you can comment on that, which is super important that we have all the research into open source. But speaking as a researcher, it was literally a gold rush, get up to mine, and we've got lots of silly papers. And it would be better if researchers focused on what's really important. That's not necessarily always open source. So anyone who can provide access to open source data, for example, by being a contributor who works for a company could also positively provide access to the company. We need as much research on how the company is in front of you, even if it's for prior check of work. That's my opinion equally that we need it. And because researchers for the low main groups, you don't have any of them, most of them have a developer. So I would not encourage anyone if you're willing to work with researchers to find ways to maybe also get a company interpreter that will improve the world for us. So I will briefly try to summarize the comment there, which was that research into open sources is good. There was a bit of a gold rush when we first had access to GitHub data. But we also need to do research into the proprietary side of software and software development, because that's important as well. And I would agree with that. It's speaking as somebody who has worked in big tech for quite some time now. There are some significant challenges that I personally am not sure how to overcome to doing that. But it's on the wish list, I think. Yeah, that's an excellent point around like when you need to do studies and you only have what is either conveniently available or ever available, that sometimes you're limited into what you can find. I also think it does limit the questions you can ask. So I think where I would agree that it will be interesting, hopefully, more of those collaborations between researchers and all kinds of developers, whether you're within a corporation or without. My concern becomes when, and I was a reviewer as a part of the mining software reviewer, mining software repositories conference this year. And the challenge that I saw with more papers than I'm used to was just a complete mismatch of understanding data and research questions and the hypothesis of whether or not that was something you could even ask using that data. And so that's a challenge as well, is just having people who understand what is possible of asking of a certain question. There's a connection to open source, which is really nice in company and trial development or, which is called inner source. No. Sure, yeah. I do believe we're out of time. Out of time. Yeah. But we will be outside for additional questions if you want.