[00:00.000 --> 00:10.080]  Okay, I'll start because the 10 minutes apply to me as well, even though I wear this nice
[00:10.080 --> 00:11.080]  blue shirt.
[00:11.080 --> 00:14.000]  So please sit down and I'll start right away.
[00:14.000 --> 00:18.800]  So I'll be talking about social audio applications that you may want to re-implement with Janus
[00:18.800 --> 00:19.800]  if you want.
[00:19.800 --> 00:21.520]  Quick slides about me.
[00:21.520 --> 00:22.520]  Nobody cares.
[00:22.520 --> 00:24.640]  But what is social audio?
[00:24.640 --> 00:29.920]  It's basically whenever you have something that is primarily audio and not strictly video
[00:29.920 --> 00:31.840]  as part of their formal communication.
[00:31.840 --> 00:36.400]  So whether it is messages or podcasts or virtual audio rooms or stuff like that, you may have
[00:36.400 --> 00:41.200]  heard about stuff like Clubhouse, Twitter Spaces, Reddit Talk, they are all examples
[00:41.200 --> 00:42.200]  of social audio.
[00:42.200 --> 00:46.520]  So people talking with each other, maybe they take turns and then they broadcast to a very
[00:46.520 --> 00:47.880]  large audience.
[00:47.880 --> 00:51.400]  And of course it does seem like a very good fit for WebRTC, especially for the real-time
[00:51.400 --> 00:52.840]  kind of participation.
[00:52.840 --> 00:56.880]  And you didn't hear about me because I don't know if there is any secrets about that, but
[00:56.880 --> 01:02.720]  actually Twitter Spaces uses Janus for the live part and then they distribute it somehow
[01:02.720 --> 01:03.920]  else.
[01:03.920 --> 01:05.840]  And how do they usually work?
[01:05.840 --> 01:10.800]  So as I said, they are typically live conversations, so we have a limited number of people that
[01:10.800 --> 01:14.720]  talk to each other, exchange ideas, they take turns, so it's not always the same people
[01:14.720 --> 01:17.840]  talking for two hours like a podcast, for instance.
[01:17.840 --> 01:22.160]  And then you may have possibly thousands of attendees, like for instance any time Elon
[01:22.160 --> 01:27.040]  Musk speaks in a Twitter space, there's a million of people listening, let's say things
[01:27.040 --> 01:28.240]  like that.
[01:28.240 --> 01:31.800]  And there are of course different challenges to tackle because for the live conversation
[01:31.800 --> 01:36.320]  part it needs of course to be real-time because it needs to be something that happens as fast
[01:36.320 --> 01:37.640]  as possible.
[01:37.640 --> 01:41.920]  For the distribution to the audience you may want, a bit of latency may be okay and this
[01:41.920 --> 01:46.640]  is why for instance they take advantage of CDNs or stuff like that most of the times.
[01:46.640 --> 01:51.000]  But of course there's a problem that of course the more latency you have for the audience,
[01:51.000 --> 01:54.440]  if somebody from the audience needs to come into a conversation there may be a bit of
[01:54.440 --> 01:55.840]  latency there.
[01:55.840 --> 01:58.200]  And so that's something that needs to be taken into account.
[01:58.200 --> 02:04.040]  And so you may want to use WebRTC for everything but there's scalability issues at play there.
[02:04.040 --> 02:08.400]  And so I wanted to check whether or not Janus, which is the WebRTC server that I work on
[02:08.400 --> 02:10.480]  for a living, could be used for the job.
[02:10.480 --> 02:15.680]  And I came up with a few potential ideas and one of those may be relying on the AudioBridge
[02:15.680 --> 02:16.680]  plugin.
[02:16.680 --> 02:20.040]  The AudioBridge is basically an audio mixer that lives within Janus.
[02:20.040 --> 02:23.520]  So you have multiple people connected to the AudioBridge plugin.
[02:23.520 --> 02:27.880]  They create a single pair connection that the AudioBridge mixes all the audio streams
[02:27.880 --> 02:32.320]  so that you send one stream, you receive one stream that contains the audio of everybody
[02:32.320 --> 02:34.160]  involved except you.
[02:34.160 --> 02:39.000]  Which is really nice because for instance it's easy to bring C-Pend points in if you
[02:39.000 --> 02:41.760]  want using the plain RTP functionality.
[02:41.760 --> 02:47.280]  You can play jingles, for instance you have your own show, your own context that you want
[02:47.280 --> 02:51.480]  to play something in there or maybe a snippet from another conversation.
[02:51.480 --> 02:56.000]  If you do stereo mixing which is support you can use spatial positioning of participants
[02:56.000 --> 02:58.400]  to make it easier to understand for people.
[02:58.400 --> 03:02.000]  And of course this takes care of the live conversation but we want to make it available
[03:02.000 --> 03:05.040]  to other people as well so to a wider audience.
[03:05.040 --> 03:08.600]  And so what you can do is take advantage of RTP for Worders which is basically an easy
[03:08.600 --> 03:13.320]  way by which the AudioBridge plugin sends a plain RTP stream towards an address that
[03:13.320 --> 03:17.320]  you specify containing the mix that is being mixed there.
[03:17.320 --> 03:20.720]  And the nice feature in the AudioBridge plugin is that you can also tag participants so
[03:20.720 --> 03:25.280]  that you may say don't send me a mix of all participants but only the ones that I tag
[03:25.280 --> 03:26.800]  in a specific group.
[03:26.800 --> 03:30.720]  For instance this one may be a technician so those two need to hear the technician who
[03:30.720 --> 03:34.560]  gives tips but all the attendees only need to hear those two.
[03:34.560 --> 03:36.320]  That's basically the main idea.
[03:36.320 --> 03:40.960]  And of course whatever happens in here is basically handling a mixed stream so there
[03:40.960 --> 03:45.160]  may be a script here that sends these mix to IceCast to make a very simple example
[03:45.160 --> 03:50.240]  or to YouTube Live for Audio or to whatever platform you want to use as a CDN for distributing
[03:50.240 --> 03:52.600]  the Audio if it's not WebRTC.
[03:52.600 --> 03:55.120]  If you want to use WebRTC you can use something like this.
[03:55.120 --> 03:59.800]  So you have your active participants connected to the AudioBridge they are talking to each
[03:59.800 --> 04:00.800]  other.
[04:00.800 --> 04:04.520]  You RTP forward to the streaming plugin which is the plugin in Janus that takes care of
[04:04.520 --> 04:07.440]  broadcasting RTP to a wider audience.
[04:07.440 --> 04:11.560]  And then the streaming plugin is what distributes the Audio which is the greatest advantage
[04:11.560 --> 04:15.560]  that you don't have to perform specific mixing for these participants.
[04:15.560 --> 04:17.960]  They are already receiving a mixed stream.
[04:17.960 --> 04:22.800]  All people connected to the AudioBridge instead have a dedicated context for mixing because
[04:22.800 --> 04:27.560]  they need to receive everybody except them so it's not the same Audio for all of them.
[04:27.560 --> 04:32.960]  And whenever you want somebody from the listeners to join in the conversation they mute the
[04:32.960 --> 04:38.080]  streaming parts, they join the AudioBridge temporarily, they become active participants
[04:38.080 --> 04:42.320]  that everybody else can listen to because they are now mixed in the AudioBridge.
[04:42.320 --> 04:46.920]  And of course for scalability purposes you can just RTP forward to multiple streaming
[04:46.920 --> 04:51.440]  plugin instances on multiple different instances of Janus how you distribute it is entirely
[04:51.440 --> 04:52.440]  up to you.
[04:52.440 --> 04:56.320]  You can use a tree based distribution wherever you want and you can also take advantage maybe
[04:56.320 --> 05:00.800]  of Multicast because of course if it's just a plain RTP stream that you are forwarding
[05:00.800 --> 05:05.000]  if you forward it on a Multicast group then multiple Janus instances can all pull from
[05:05.000 --> 05:10.120]  that Multicast group that same mixed Audio and can distribute it more efficiently.
[05:10.120 --> 05:14.400]  And one other value is that using this approach if you want you can also do something like
[05:14.400 --> 05:15.640]  interpreter services.
[05:15.640 --> 05:19.880]  You have two different AudioBridge rooms for different rooms, you have the speaker join
[05:19.880 --> 05:24.480]  the room of their language and you have an interpreter on the other room and then you
[05:24.480 --> 05:28.880]  distribute those two streams separately and then you allow the audience to listen maybe
[05:28.880 --> 05:33.720]  to the English channel or the French channel and depending on the language you will speak
[05:33.720 --> 05:37.400]  in you will hear the translator or the actual speaker on either one.
[05:37.400 --> 05:41.800]  So which makes little sense for an actual social Audio application if we want it's
[05:41.800 --> 05:47.560]  maybe more for a conversational scenario but it's still a good side effect of that.
[05:47.560 --> 05:51.720]  If instead you don't want to mix in Janus for a few reasons because you don't want to
[05:51.720 --> 05:56.560]  terminate Audio there, mixing is more intensive or whatever you may want to use the SFU approach
[05:56.560 --> 06:01.960]  instead which means that participants in the conversation now need to establish maybe one
[06:01.960 --> 06:05.760]  single peer connection not necessarily more than one but they are exchanging multiple
[06:05.760 --> 06:06.760]  Audio streams.
[06:06.760 --> 06:11.120]  So they are sending their own and they are receiving as many as there are other participants
[06:11.120 --> 06:17.520]  in the room and you can still externalize this conversation via RTP for Worders as before
[06:17.520 --> 06:22.760]  but now Audio is not mixed so you have different Audio streams for each of the participants
[06:22.760 --> 06:23.760]  there.
[06:23.760 --> 06:28.320]  Each participant in the conversation each of them is sending one and receiving two and
[06:28.320 --> 06:32.400]  you have a separate component that is receiving the three different Audio streams from the
[06:32.400 --> 06:37.000]  different participants and so if you want to distribute something via regular CDN that
[06:37.000 --> 06:42.120]  requires a single Audio stream to distribute and so that component receiving RTP for Worders
[06:42.120 --> 06:46.200]  needs to act a bit like a mixer acting live basically.
[06:46.200 --> 06:50.360]  And once this happens so once you have a mix there everything is pretty much as the example
[06:50.360 --> 06:55.600]  as I made before you have a mixed stream you can distribute it via CDN or via Janus as
[06:55.600 --> 07:00.200]  we've said before if you don't want to mix for the attendee as well you want something
[07:00.200 --> 07:04.520]  closer to a regular webinar or something like this you can still do that but then you have
[07:04.520 --> 07:08.560]  to take you have to use that approach that I was talking about of wording to the streaming
[07:08.560 --> 07:13.760]  plugin for each of the different participants and so something like you have the presenters
[07:13.760 --> 07:18.280]  that you're contributing Audio to the video room this becomes an Audio broadcast for that
[07:18.280 --> 07:23.800]  specific presenter in the streaming plugin and people listen to that participant over
[07:23.800 --> 07:30.360]  there you can again involve multiple streaming plugin instances if needed so that you can
[07:30.360 --> 07:35.040]  widen the audience if you want but again if you have multiple participants speaking
[07:35.040 --> 07:39.920]  you have to do the same for each of them because otherwise of course since Audio is not mixed
[07:39.920 --> 07:45.000]  you would only listen to one single participant which means that the audience need to create
[07:45.000 --> 07:49.480]  subscriptions for more than one participant at any given time and of course you have to
[07:49.480 --> 07:53.680]  make this dynamic in case there's presenters that come and go basically which is what is
[07:53.680 --> 07:58.960]  expected in a social audio kind of application which means that it's probably easier to do
[07:58.960 --> 08:05.000]  something like this where you still do some sort of you keep the audio conversation using
[08:05.000 --> 08:10.400]  an SFU for WebRTC participant because it gives a better audio quality between each them maybe
[08:10.400 --> 08:15.040]  but then for distributing the conversation it's okay to mix it and so even mix it for
[08:15.040 --> 08:20.640]  WebRTC usage so that you distribute a single audio stream instead which makes sense but
[08:20.640 --> 08:24.320]  again if you want to do that that works for instance this is what we do for our virtual
[08:24.320 --> 08:31.480]  event platform for meetings so that definitely works anyway and again you can also do this
[08:31.480 --> 08:36.160]  sort of multicast distribution if you want to take advantage of a wider distribution
[08:36.160 --> 08:41.800]  of the media and if I spoke too fast which is very likely I did write a blog post about
[08:41.800 --> 08:46.440]  this which goes a bit more in detail and explains things a bit more precisely than I did right
[08:46.440 --> 08:51.880]  now and I think I managed to stay on time and these are some references so you can find
[08:51.880 --> 08:56.880]  me on mastodon mainly I'm still on Twitter but who knows for how long and that's the
[08:56.880 --> 09:09.640]  blog post that I was mentioning before so that's all thank you
[09:09.640 --> 09:14.560]  okay there's time maybe for one or two questions if anybody is curious so I don't know if you
[09:14.560 --> 09:33.240]  have any not specifically in the audio bridge but this is something that you can enforce
[09:33.240 --> 09:37.680]  at the application level if you want so for instance you may decide that some users always
[09:37.680 --> 09:42.880]  need to be there and some some use so for instance you may have the concept of the actual
[09:42.880 --> 09:47.240]  presenters and panellists that come and go for instance this is more of an application
[09:47.240 --> 09:53.720]  level context than the mixing context as far as mixing is concerned you you just know yeah
[09:53.720 --> 10:13.840]  exactly so any other question or can we move to so okay then okay thank you very much
[10:13.840 --> 10:33.900]  for that one question