[00:00.000 --> 00:10.080] Okay, I'll start because the 10 minutes apply to me as well, even though I wear this nice [00:10.080 --> 00:11.080] blue shirt. [00:11.080 --> 00:14.000] So please sit down and I'll start right away. [00:14.000 --> 00:18.800] So I'll be talking about social audio applications that you may want to re-implement with Janus [00:18.800 --> 00:19.800] if you want. [00:19.800 --> 00:21.520] Quick slides about me. [00:21.520 --> 00:22.520] Nobody cares. [00:22.520 --> 00:24.640] But what is social audio? [00:24.640 --> 00:29.920] It's basically whenever you have something that is primarily audio and not strictly video [00:29.920 --> 00:31.840] as part of their formal communication. [00:31.840 --> 00:36.400] So whether it is messages or podcasts or virtual audio rooms or stuff like that, you may have [00:36.400 --> 00:41.200] heard about stuff like Clubhouse, Twitter Spaces, Reddit Talk, they are all examples [00:41.200 --> 00:42.200] of social audio. [00:42.200 --> 00:46.520] So people talking with each other, maybe they take turns and then they broadcast to a very [00:46.520 --> 00:47.880] large audience. [00:47.880 --> 00:51.400] And of course it does seem like a very good fit for WebRTC, especially for the real-time [00:51.400 --> 00:52.840] kind of participation. [00:52.840 --> 00:56.880] And you didn't hear about me because I don't know if there is any secrets about that, but [00:56.880 --> 01:02.720] actually Twitter Spaces uses Janus for the live part and then they distribute it somehow [01:02.720 --> 01:03.920] else. [01:03.920 --> 01:05.840] And how do they usually work? [01:05.840 --> 01:10.800] So as I said, they are typically live conversations, so we have a limited number of people that [01:10.800 --> 01:14.720] talk to each other, exchange ideas, they take turns, so it's not always the same people [01:14.720 --> 01:17.840] talking for two hours like a podcast, for instance. [01:17.840 --> 01:22.160] And then you may have possibly thousands of attendees, like for instance any time Elon [01:22.160 --> 01:27.040] Musk speaks in a Twitter space, there's a million of people listening, let's say things [01:27.040 --> 01:28.240] like that. [01:28.240 --> 01:31.800] And there are of course different challenges to tackle because for the live conversation [01:31.800 --> 01:36.320] part it needs of course to be real-time because it needs to be something that happens as fast [01:36.320 --> 01:37.640] as possible. [01:37.640 --> 01:41.920] For the distribution to the audience you may want, a bit of latency may be okay and this [01:41.920 --> 01:46.640] is why for instance they take advantage of CDNs or stuff like that most of the times. [01:46.640 --> 01:51.000] But of course there's a problem that of course the more latency you have for the audience, [01:51.000 --> 01:54.440] if somebody from the audience needs to come into a conversation there may be a bit of [01:54.440 --> 01:55.840] latency there. [01:55.840 --> 01:58.200] And so that's something that needs to be taken into account. [01:58.200 --> 02:04.040] And so you may want to use WebRTC for everything but there's scalability issues at play there. [02:04.040 --> 02:08.400] And so I wanted to check whether or not Janus, which is the WebRTC server that I work on [02:08.400 --> 02:10.480] for a living, could be used for the job. [02:10.480 --> 02:15.680] And I came up with a few potential ideas and one of those may be relying on the AudioBridge [02:15.680 --> 02:16.680] plugin. [02:16.680 --> 02:20.040] The AudioBridge is basically an audio mixer that lives within Janus. [02:20.040 --> 02:23.520] So you have multiple people connected to the AudioBridge plugin. [02:23.520 --> 02:27.880] They create a single pair connection that the AudioBridge mixes all the audio streams [02:27.880 --> 02:32.320] so that you send one stream, you receive one stream that contains the audio of everybody [02:32.320 --> 02:34.160] involved except you. [02:34.160 --> 02:39.000] Which is really nice because for instance it's easy to bring C-Pend points in if you [02:39.000 --> 02:41.760] want using the plain RTP functionality. [02:41.760 --> 02:47.280] You can play jingles, for instance you have your own show, your own context that you want [02:47.280 --> 02:51.480] to play something in there or maybe a snippet from another conversation. [02:51.480 --> 02:56.000] If you do stereo mixing which is support you can use spatial positioning of participants [02:56.000 --> 02:58.400] to make it easier to understand for people. [02:58.400 --> 03:02.000] And of course this takes care of the live conversation but we want to make it available [03:02.000 --> 03:05.040] to other people as well so to a wider audience. [03:05.040 --> 03:08.600] And so what you can do is take advantage of RTP for Worders which is basically an easy [03:08.600 --> 03:13.320] way by which the AudioBridge plugin sends a plain RTP stream towards an address that [03:13.320 --> 03:17.320] you specify containing the mix that is being mixed there. [03:17.320 --> 03:20.720] And the nice feature in the AudioBridge plugin is that you can also tag participants so [03:20.720 --> 03:25.280] that you may say don't send me a mix of all participants but only the ones that I tag [03:25.280 --> 03:26.800] in a specific group. [03:26.800 --> 03:30.720] For instance this one may be a technician so those two need to hear the technician who [03:30.720 --> 03:34.560] gives tips but all the attendees only need to hear those two. [03:34.560 --> 03:36.320] That's basically the main idea. [03:36.320 --> 03:40.960] And of course whatever happens in here is basically handling a mixed stream so there [03:40.960 --> 03:45.160] may be a script here that sends these mix to IceCast to make a very simple example [03:45.160 --> 03:50.240] or to YouTube Live for Audio or to whatever platform you want to use as a CDN for distributing [03:50.240 --> 03:52.600] the Audio if it's not WebRTC. [03:52.600 --> 03:55.120] If you want to use WebRTC you can use something like this. [03:55.120 --> 03:59.800] So you have your active participants connected to the AudioBridge they are talking to each [03:59.800 --> 04:00.800] other. [04:00.800 --> 04:04.520] You RTP forward to the streaming plugin which is the plugin in Janus that takes care of [04:04.520 --> 04:07.440] broadcasting RTP to a wider audience. [04:07.440 --> 04:11.560] And then the streaming plugin is what distributes the Audio which is the greatest advantage [04:11.560 --> 04:15.560] that you don't have to perform specific mixing for these participants. [04:15.560 --> 04:17.960] They are already receiving a mixed stream. [04:17.960 --> 04:22.800] All people connected to the AudioBridge instead have a dedicated context for mixing because [04:22.800 --> 04:27.560] they need to receive everybody except them so it's not the same Audio for all of them. [04:27.560 --> 04:32.960] And whenever you want somebody from the listeners to join in the conversation they mute the [04:32.960 --> 04:38.080] streaming parts, they join the AudioBridge temporarily, they become active participants [04:38.080 --> 04:42.320] that everybody else can listen to because they are now mixed in the AudioBridge. [04:42.320 --> 04:46.920] And of course for scalability purposes you can just RTP forward to multiple streaming [04:46.920 --> 04:51.440] plugin instances on multiple different instances of Janus how you distribute it is entirely [04:51.440 --> 04:52.440] up to you. [04:52.440 --> 04:56.320] You can use a tree based distribution wherever you want and you can also take advantage maybe [04:56.320 --> 05:00.800] of Multicast because of course if it's just a plain RTP stream that you are forwarding [05:00.800 --> 05:05.000] if you forward it on a Multicast group then multiple Janus instances can all pull from [05:05.000 --> 05:10.120] that Multicast group that same mixed Audio and can distribute it more efficiently. [05:10.120 --> 05:14.400] And one other value is that using this approach if you want you can also do something like [05:14.400 --> 05:15.640] interpreter services. [05:15.640 --> 05:19.880] You have two different AudioBridge rooms for different rooms, you have the speaker join [05:19.880 --> 05:24.480] the room of their language and you have an interpreter on the other room and then you [05:24.480 --> 05:28.880] distribute those two streams separately and then you allow the audience to listen maybe [05:28.880 --> 05:33.720] to the English channel or the French channel and depending on the language you will speak [05:33.720 --> 05:37.400] in you will hear the translator or the actual speaker on either one. [05:37.400 --> 05:41.800] So which makes little sense for an actual social Audio application if we want it's [05:41.800 --> 05:47.560] maybe more for a conversational scenario but it's still a good side effect of that. [05:47.560 --> 05:51.720] If instead you don't want to mix in Janus for a few reasons because you don't want to [05:51.720 --> 05:56.560] terminate Audio there, mixing is more intensive or whatever you may want to use the SFU approach [05:56.560 --> 06:01.960] instead which means that participants in the conversation now need to establish maybe one [06:01.960 --> 06:05.760] single peer connection not necessarily more than one but they are exchanging multiple [06:05.760 --> 06:06.760] Audio streams. [06:06.760 --> 06:11.120] So they are sending their own and they are receiving as many as there are other participants [06:11.120 --> 06:17.520] in the room and you can still externalize this conversation via RTP for Worders as before [06:17.520 --> 06:22.760] but now Audio is not mixed so you have different Audio streams for each of the participants [06:22.760 --> 06:23.760] there. [06:23.760 --> 06:28.320] Each participant in the conversation each of them is sending one and receiving two and [06:28.320 --> 06:32.400] you have a separate component that is receiving the three different Audio streams from the [06:32.400 --> 06:37.000] different participants and so if you want to distribute something via regular CDN that [06:37.000 --> 06:42.120] requires a single Audio stream to distribute and so that component receiving RTP for Worders [06:42.120 --> 06:46.200] needs to act a bit like a mixer acting live basically. [06:46.200 --> 06:50.360] And once this happens so once you have a mix there everything is pretty much as the example [06:50.360 --> 06:55.600] as I made before you have a mixed stream you can distribute it via CDN or via Janus as [06:55.600 --> 07:00.200] we've said before if you don't want to mix for the attendee as well you want something [07:00.200 --> 07:04.520] closer to a regular webinar or something like this you can still do that but then you have [07:04.520 --> 07:08.560] to take you have to use that approach that I was talking about of wording to the streaming [07:08.560 --> 07:13.760] plugin for each of the different participants and so something like you have the presenters [07:13.760 --> 07:18.280] that you're contributing Audio to the video room this becomes an Audio broadcast for that [07:18.280 --> 07:23.800] specific presenter in the streaming plugin and people listen to that participant over [07:23.800 --> 07:30.360] there you can again involve multiple streaming plugin instances if needed so that you can [07:30.360 --> 07:35.040] widen the audience if you want but again if you have multiple participants speaking [07:35.040 --> 07:39.920] you have to do the same for each of them because otherwise of course since Audio is not mixed [07:39.920 --> 07:45.000] you would only listen to one single participant which means that the audience need to create [07:45.000 --> 07:49.480] subscriptions for more than one participant at any given time and of course you have to [07:49.480 --> 07:53.680] make this dynamic in case there's presenters that come and go basically which is what is [07:53.680 --> 07:58.960] expected in a social audio kind of application which means that it's probably easier to do [07:58.960 --> 08:05.000] something like this where you still do some sort of you keep the audio conversation using [08:05.000 --> 08:10.400] an SFU for WebRTC participant because it gives a better audio quality between each them maybe [08:10.400 --> 08:15.040] but then for distributing the conversation it's okay to mix it and so even mix it for [08:15.040 --> 08:20.640] WebRTC usage so that you distribute a single audio stream instead which makes sense but [08:20.640 --> 08:24.320] again if you want to do that that works for instance this is what we do for our virtual [08:24.320 --> 08:31.480] event platform for meetings so that definitely works anyway and again you can also do this [08:31.480 --> 08:36.160] sort of multicast distribution if you want to take advantage of a wider distribution [08:36.160 --> 08:41.800] of the media and if I spoke too fast which is very likely I did write a blog post about [08:41.800 --> 08:46.440] this which goes a bit more in detail and explains things a bit more precisely than I did right [08:46.440 --> 08:51.880] now and I think I managed to stay on time and these are some references so you can find [08:51.880 --> 08:56.880] me on mastodon mainly I'm still on Twitter but who knows for how long and that's the [08:56.880 --> 09:09.640] blog post that I was mentioning before so that's all thank you [09:09.640 --> 09:14.560] okay there's time maybe for one or two questions if anybody is curious so I don't know if you [09:14.560 --> 09:33.240] have any not specifically in the audio bridge but this is something that you can enforce [09:33.240 --> 09:37.680] at the application level if you want so for instance you may decide that some users always [09:37.680 --> 09:42.880] need to be there and some some use so for instance you may have the concept of the actual [09:42.880 --> 09:47.240] presenters and panellists that come and go for instance this is more of an application [09:47.240 --> 09:53.720] level context than the mixing context as far as mixing is concerned you you just know yeah [09:53.720 --> 10:13.840] exactly so any other question or can we move to so okay then okay thank you very much [10:13.840 --> 10:33.900] for that one question