[00:00.000 --> 00:12.320] Okay. Thank you. Thank you for coming to this presentation. I'm Nicolas Roland from [00:12.320 --> 00:20.560] the Gustave F.L. University. Thank you. Please. And I will be presenting some research we [00:20.560 --> 00:28.480] did on crowdsourced zones data. Some analysis we did with free open source software. I will [00:28.480 --> 00:36.640] be presenting the work we did with myself, Pierre Romon and Ludovic Moison from the Gustave F.L. [00:36.640 --> 00:45.840] University. So traffic noise is a major health concern. In Europe, in western Europe, it's [00:45.840 --> 00:56.840] estimated by the World Health Organization that we lost 1 million healthy life years each year. [00:56.840 --> 01:10.960] In France, we have estimated the social cost, so the cost to the community of 1,147 billion euros [01:10.960 --> 01:23.760] per year. So it has a cost on a monetary cost, but also a cost on people and their health. So the [01:23.760 --> 01:33.360] big question is, oh, we can find where noise is problematic. And so, of course, we can't [01:33.360 --> 01:42.840] have a direct measure on the everywhere. We can't put microphones everywhere. It will be a cost [01:42.840 --> 01:50.800] nightmare, a logistic nightmare, and a privacy nightmare. Of course, it's not possible. So the [01:50.800 --> 02:00.000] traditional way is to simulate the noise from traffic counts. So we put counters on the road [02:00.000 --> 02:08.800] and count the vehicles and estimate the vehicles. We do that on trains, we do that on planes, [02:08.800 --> 02:18.000] air trucks, and we simulate those traffic and we produce this kind of cars with, for example, [02:18.000 --> 02:28.280] noise modeling, which is an application we developed with the Hummerau laboratory that can [02:28.280 --> 02:36.920] compute from these counts noise maps. And this is a legal requirement by the European Commission. [02:36.920 --> 02:46.520] Another way that the Hummerau, which is working on environmental acoustics, it's a lab on [02:46.520 --> 02:56.720] acoustic, is not to simulate, but to get actual data, real data from contributors using a smartphone [02:56.720 --> 03:05.800] application you can install. It's working on real smartphones. It's available on F-Droid. So it's [03:05.800 --> 03:14.640] also a free open source software. And it measures several things like your position. The sound [03:14.640 --> 03:22.000] spectrum, or not the full spectrum, it's just the third octave. So you can't understand what [03:22.000 --> 03:30.280] people are saying, if it's someone speaking, but you can detect that someone is speaking. [03:30.280 --> 03:39.440] You also have the sound level and some kind of information. So it's part of a bigger project, [03:39.440 --> 03:46.520] like the noise-present project. So we have this noise modeling application that generates [03:46.520 --> 03:53.560] noise from open source geodata, mostly French geodata and open script map. So when you use [03:53.560 --> 04:01.200] script complete to say, okay, this is grass and this is macadam, we use that data to generate [04:01.200 --> 04:08.960] more precise maps, sound maps, noise capture to measure and share sound environments. And all [04:08.960 --> 04:15.160] of this data is given in a special data infrastructure called Onomap. And there is also [04:15.160 --> 04:25.000] some community maps made by the users. This is a map of all the recordings we have in nearly [04:25.000 --> 04:32.200] five years. So you can see it's worldwide. It's just not only France or Europe, it's worldwide. [04:32.200 --> 04:44.160] So the question was, what can you do with all this data we collect? So there was an [04:44.160 --> 04:53.560] extraction in 2021 of the three first years of data collection. So it's still collecting the [04:53.560 --> 05:05.560] data, but there was an extract that contains 260,000 tracks. So the track is recording worldwide. [05:05.560 --> 05:12.200] With the sound spectrum, like I said, GPS location and also the contributor can provide some tags. [05:12.200 --> 05:25.600] It's an open database license. So it's free to use. So the question is, oh, we can characterize the [05:25.600 --> 05:29.920] user environment, the sound environment of the user at the moment of the recording with the [05:29.920 --> 05:40.080] collected data. We think of two possibilities. One is from the sound spectrum. We record. So [05:40.080 --> 05:46.960] it's an ongoing analysis. It's not the very, it's the hardest way to do that because we have to [05:46.960 --> 05:57.000] find patterns on the recordings. And we have to use machine learnings to detect these patterns [05:57.000 --> 06:04.800] on all of these data. So it's still going, but there is only the easiest way. And this is the way [06:04.800 --> 06:15.720] I use. It's by using the tags that are provided by the contributors. So in the subset, like I [06:15.720 --> 06:29.960] say, 260,000 tracks, half of them have tags. So we can use just half of it. 50,000 are where [06:29.960 --> 06:38.840] outdoors are not test. So we want to work on this sound environment. So we discard indoors and test [06:38.840 --> 06:51.960] tagged tracks. We also remove the very, very small ones. So less than five seconds. So we remove [06:51.960 --> 07:02.760] maybe tracks that are not, that might be accident, accidental. And we also work for this just this [07:02.760 --> 07:08.080] preliminary works on France because we are French and it's easier for us to understand what's [07:08.080 --> 07:19.360] happening. And it's nearly 12,000 tracks. And like I said, it's a major, the road noise is a [07:19.360 --> 07:28.720] major concern. And it appears directly in our data because the more frequent tag is the road. So [07:28.720 --> 07:41.800] people are on maybe a third of our subset. There is one noise in it. The second one is chatting. [07:41.800 --> 07:50.640] And so we have also things like wiring, animal sounds, works. So there is 12 tags, different [07:50.640 --> 08:02.920] tags the user can provide. So we use a quite simple one toolkit to analyze the data. First is [08:02.920 --> 08:13.600] the PostgreSQL and Postgres database because the data is provided as PostgreSQL dump. So in order [08:13.600 --> 08:21.760] to access it, you have to rebuild the data and the database. And the tool we use is R because we [08:21.760 --> 08:28.240] are in the team, we are mostly R user. We also have Python, but we are more familiar with R. So [08:28.240 --> 08:37.920] two tools, simple, yes, actually not really, because we also use in R a lot of packages, [08:37.920 --> 08:48.640] like the Tideverse, the SF packages for your special, your JSON stats and so on. And we [08:48.640 --> 08:55.880] also, all of these packages use dependencies like pandoc, markdown, reveal.js. This presentation [08:55.880 --> 09:03.440] actually is made with R and reveal.js. We also use geospatial libraries, like proge, [09:03.440 --> 09:14.000] gos, gilal. And those are dependencies that are not handled by R directly. You just call them. [09:14.000 --> 09:22.120] So what we define in this dataset, so let's talk about results. We got some interesting [09:22.120 --> 09:32.200] things to add. The first thing we looked at, we looked at it was the animal tags because we know [09:32.200 --> 09:46.320] that bird songs can be heard mostly the first hour before dawn. So we can, this is a well-known [09:46.320 --> 09:56.520] dynamics in ornithology. And in the sound environment, we can earn it. And we actually [09:56.520 --> 10:04.400] find it also. So in this graph, you can see on the left part, it's the time before the sunrise [10:04.400 --> 10:19.080] on the day of recording. So we find this actual dynamics of birds singing one hour before dawn. [10:19.080 --> 10:32.160] So it was a good sign. And we also find peaks of road noise between 8 to 10 a.m. and 6 to 8 p.m. [10:32.160 --> 10:45.240] And it's, we can say, it looks like very much like commuters behavior. But we can't directly, [10:45.240 --> 10:52.160] we can't directly link to it. You can say, oh, it's very similar to. So we looked to physical [10:52.160 --> 11:04.120] events in the environment of the contributor. And we find a very good correlation between the wind [11:04.120 --> 11:13.720] force and the present of tags, the wind tags in the dataset. So it's very, it works very well. [11:13.720 --> 11:24.160] We also did that with rainfall. And the correlation is not so strong. Not as enough. It might be [11:24.160 --> 11:35.040] a user bias. Maybe if the rainfall is too small, the user doesn't hear the rain or doesn't think [11:35.040 --> 11:43.920] to add the tag about it. And it might be also a special issue because the nearest, the mean nearest [11:43.920 --> 11:49.920] waiver station distances is 16 kilometers. So maybe the local condition might be different [11:49.920 --> 11:58.080] between the waiver station and the user at the moment of the recording. So it's not so strong, [11:58.080 --> 12:06.320] but actually find data. I'm not the first one to speak about reproducible science here, [12:06.320 --> 12:14.520] actually. And it's an issue, a real issue. So for this today, we have some good points. Like, [12:14.520 --> 12:22.320] the data is already available. The source code, we made it available. So all SQL scripts to [12:22.320 --> 12:31.800] rebuild the database and the table we used are available. The R notebooks we made are also [12:31.800 --> 12:41.320] available. The setup broadly is available. But there are also bad things to assess. So some [12:41.320 --> 12:50.760] notebooks were very wide and we went very deep on the analysis and the exploratory phase. But [12:50.760 --> 12:58.800] at the end, it was very hard to reproduce even in our team. We actually were able to do it. But [12:58.800 --> 13:07.360] for someone coming from outside, it might be difficult to enter in that. So we need some code [13:07.360 --> 13:18.400] factoring and more, a little bit more commenting, more explanation. And so there is also a lack [13:18.400 --> 13:29.640] of information on software environment. So it makes it very hard to reuse and reproduce. So what [13:29.640 --> 13:42.200] could we have used to have a better tooling? Since we use R, you can use RMF, which is our [13:42.200 --> 13:51.080] package to reproduce. It's like a virtual environment. It works well, but it works well just [13:51.080 --> 14:03.480] for R. And we use other software like POSJS. We use JOS, Proj, Jidal. So it's not perfect. Docker [14:03.480 --> 14:14.800] might be something that can be helpful. But like Simon said before, it's not perfect for [14:14.800 --> 14:24.960] reproducibility. And I just say Goix is my examined mind from one year, actually, to say, [14:24.960 --> 14:30.160] okay, I need to work on that. And I think it would be a good solution. I won't talk too much [14:30.160 --> 14:38.320] because there was a talk by Simon Tornier just two talks before, and I go watch it. I think it [14:38.320 --> 14:47.840] might be a very good solution. In conclusion, so we can use code-sourced data. Oh, perfect. We can [14:47.840 --> 14:54.880] use code-sourced data for science. We can find, even for something quirky like some environment, [14:54.880 --> 15:01.880] we can use it for science. This particular data set is usable. So you can access it and find new [15:01.880 --> 15:08.080] things. We don't have every question. So we don't have every answer to, we can answer with this [15:08.080 --> 15:17.840] data set. So it's quite fun to play with it and find some, oh, we can find birds. I do believe that [15:17.840 --> 15:24.600] free open source software are key for reproducible science. We can't make reproducible science with [15:24.600 --> 15:31.880] proper software. It's not possible. Repositible science is hard to achieve. You have to think it [15:31.880 --> 15:38.920] as soon as possible before starting your project. Because when you are too far, you have to refactor [15:38.920 --> 15:54.240] things and it can be very tricky. And you have, maybe, I'm working with, this is more a sound and [15:54.240 --> 16:02.480] physics-related study, but sometimes I work with economists. I work with geographers. And they are [16:02.480 --> 16:15.720] not often very keen on technologies and computers in general. So sometimes you need someone, maybe [16:15.720 --> 16:23.040] an engineer or someone in the team that can handle this reproducible part. And so you need to get [16:23.040 --> 16:29.960] the skills. So either you get yourself or you have to take someone in the team that can do that for [16:29.960 --> 16:40.680] you. And notebooks are not enough. Notebooks are great to communicate and explore things, but they [16:40.680 --> 16:48.880] are not good enough for reproducible science. So there's a link for the data set, the actual [16:48.880 --> 16:55.440] data set. Please go to checknoiseplanet.org. You can navigate on the map. You can see actually [16:55.440 --> 17:04.240] tracks and click on things to get what is recorded. Thank you for your attention. You can have, you [17:04.240 --> 17:13.160] can join me by email or on Mastodon. This presentation is available here and everything is [17:13.160 --> 17:27.200] accessible on GitHub. Thank you very much. Thanks. That leaves us a bit of time for questions. So [17:27.200 --> 17:55.000] please feel free to take them, repeat them, and then answer them. Yeah. In the graph with the bird. Yeah. You had a sort of a dip at zero. Yeah. Yeah. So the [17:55.000 --> 18:05.720] question was about this particular graph, why there is a low beam at the zero and the peak is just above the zero. [18:05.720 --> 18:15.320] Because it's smoothed a little bit. And you can see there is a peak just before and the line is just [18:15.320 --> 18:29.840] moving and there is little shifting. But and you think why is there is a low? I don't know. I'm not sure. Yeah, please. [18:29.840 --> 18:39.680] Yeah. Because we are doing crowdsourcing data, so it's obviously influenced by the users that's [18:39.680 --> 18:47.120] collecting the data for us. Yeah. How do you factor in or how do you eliminate this source of variance where it could be underlying [18:47.120 --> 18:58.520] behavior of humans that is affecting the results of the data. For example, sunrise time. So people who get awakened by birds [18:58.520 --> 19:06.600] during before sunrise, they will be very annoyed and they record more. People who wake up at the normal times are too busy to even make the [19:06.600 --> 19:22.080] recording. Okay. So the question was about this is called source of data. So there is there is data that provided by people [19:22.080 --> 19:32.600] willing willing to provide it. And there is a bias, of course, because you may be angry at birds, waking you up in the morning [19:32.600 --> 19:42.800] and you may be angry to traffic noise. And actually, we don't assess the data. We take it as it is. Maybe there will be some I'm not [19:42.800 --> 19:54.840] part of this part of the project, but maybe there will be some some work on it. And we hope that it's so much data that it will [19:54.840 --> 20:07.080] be smooth bias. But of course, it's bias like opens with my data. And there is someone making a decision to say, okay, I will record it for a good [20:07.080 --> 20:23.400] or bad reason or to prove a point. Okay, my where I live in, it's too noisy. I make a recording. And it's okay. But we recount. It's very hard to assess [20:23.400 --> 20:33.920] this kind of information. We don't know why people record tracks, because maybe it's pleasant environment and they want to share it. It's not so good. [20:33.920 --> 20:40.360] And it's okay for us. I hope it's answer your question. Yeah, please. [20:40.360 --> 20:53.920] Yeah. So I just wanted to ask, because I think a wind is pretty hard to incorporate. Because when somebody records, they probably recorded without the pop filter, which makes the sound really loud of the wind, [20:53.920 --> 21:03.520] despite of not being so loud, because somebody comes up with the phone and records it. And the wind blows straight into it and it's really, really loud as kind of like [21:03.520 --> 21:25.640] indecent. Okay, the question was about the wind recordings and the fact that smartphone doesn't have a pop shield and protect the microphone from the wind. And so [21:25.640 --> 21:43.640] actually, I'm not an acoustician. I'm more a jazz engineer. So I don't have the exact question for that. But I do believe when you are using your microphone, when you're talking, [21:43.640 --> 21:58.640] talking, smartphone, no, nowadays, can protect you a little bit from the noise. But I'm not sure from education. Yeah, please. [21:58.640 --> 22:04.640] My question is kind of connected to the bias one. But when you were building the data capture, like... [22:04.640 --> 22:17.640] Did you build the data capture tool where people are inputting data, right? Or how was that built? And did you make sure that people could use it in a way that... [22:17.640 --> 22:25.640] Like, how did you make sure that people were comfortable using it in the situations that you needed recordings for? [22:25.640 --> 22:46.640] So the question was... Can you simplify it? So I'm interested in what choices you made in order to have the thing look and function how it did to capture the data and, like, again, to bias. [22:46.640 --> 22:55.640] So if people are, like, not able to use it or don't like using it, does that also bias the data? [22:55.640 --> 23:22.640] So you're not speaking... So the question was about how we build the analysis and how we build it. If you are not able to use R to build the... Sorry. Actually, we have to make a choice and we are more comfortable with R. [23:22.640 --> 23:33.640] So there is a bias, of course, and we also have some libraries like SunCalc, for example, that makes life very simpler for us to... [23:33.640 --> 23:52.640] You give it a time, we give it a date and a position and it gives you the sunrise and sunset, thank you, sunset time, for example, so it makes life easier for us. [23:52.640 --> 24:07.640] But, of course, there is a bias. Even when we build the application, there is, of course, a bias. But I wasn't part of the team that built this application. [24:07.640 --> 24:25.640] And it's more focused on what we want to get, but it's available for everyone, so do whatever we want to do with it. Thank you. We have more time, maybe? [24:25.640 --> 24:42.640] On your first slide, you had a really big number in terms of the social costs, only in France. It seems quite egregiously big. Do you know anything about what's included in the social costs? [24:42.640 --> 24:46.640] What are the costs that are incorporated into this number? [24:46.640 --> 24:59.640] As Kim, it's a huge report. Adam is a French agency, an environmental agency, so it works on noise pollution, but also air pollution and things like that. [24:59.640 --> 25:08.640] You are working, sorry, I didn't repeat the question, but the question is about the social cost and the amount and how it is constructed. [25:08.640 --> 25:27.640] So I just read quickly the report. And the social cost is mostly about health issues, lack of sleep and stress related to noise and things like that. [25:27.640 --> 25:44.640] And how it affects people and how it affects their health and how it affects less-better health as a cost for society because you have more anxiety. [25:44.640 --> 26:00.640] Thank you very much.