[00:00.000 --> 00:15.720] Okay, we can start. Thank you for attempting to this meeting, which is about Linfone and [00:15.720 --> 00:23.320] video conferencing. My name is Jean Monnier. I'm involved in the Linfone project since [00:23.320 --> 00:32.000] 2010 and I'm also part of the company which is backing the Linfone project since ten years [00:32.000 --> 00:42.240] almost. So first, I'm going to provide you with a quick introduction on Linfone and then [00:42.240 --> 00:52.720] have a couple of words around video conferencing with SIP followed by an introduction on the [00:52.720 --> 01:02.440] selecting forwarding unit, which is the art of the modern video conferencing systems. And later to [01:02.440 --> 01:09.040] talk about what is required on the SIP client part to be able to interrupt with this kind of [01:09.040 --> 01:18.440] video conferencing system. And finally, the conclusion. Okay, so just a couple of words [01:18.440 --> 01:28.400] about Linfone. Linfone is a voice-over IP client implementing the SIP protocol which started [01:28.400 --> 01:40.720] in early 2000. It's available on Linux, Android, iOS, Windows and Mac. It uses SIP as the base [01:40.720 --> 01:49.720] standard for almost everything including audio, video call and instant messaging presence. Everything [01:49.720 --> 01:58.040] which is required for real-time communication. And it also provides some end-to-end encryption [01:58.040 --> 02:07.440] for messaging based on the signal protocol more or less. The Linfone team developed the [02:07.440 --> 02:13.160] Linfone software but also SIPTrover, which is basically a SIP proxy. And if you want to use [02:13.160 --> 02:23.240] SIP account, it's possible to create them on our website for testing purpose mainly. Okay, [02:23.240 --> 02:31.520] video conferencing with SIP in a couple of words. It's around a couple of standards. The first one [02:31.520 --> 02:40.000] is SIP, a basic SIP with an invite referring by to create a conference, join the conference to be [02:40.000 --> 02:50.840] able to invite other participants to the conference. And it's almost based on the RFC 4579, which [02:50.840 --> 02:59.880] defines how to create a conference and how to join the conference. And there is also some [02:59.880 --> 03:08.440] interesting standard, which is the RFC 5366, which is about defining the list of the participants [03:08.440 --> 03:18.120] of the conference. So it's for the establishment of the conference. And the next important standard [03:18.120 --> 03:27.000] is what we call the conferencing events package, which is based on the subcribe notify RFC. And [03:27.000 --> 03:34.600] the idea is that when a participant joined the conference, it initiates SIP subscribe to the [03:34.600 --> 03:41.760] server. And the server then notify to every participant of the conference, which are the [03:41.760 --> 03:49.000] states of the conference within who is going to join is their audio video, everything which is [03:49.000 --> 04:01.800] related to the status of the conference. On the media port, it's regular RTP. And for this video [04:01.800 --> 04:09.640] conferencing project, we added the support of two important RFC, which are about bundling all the [04:09.640 --> 04:18.040] media stream into the same socket in order to avoid to have too many RTP sockets, RTP streams [04:18.040 --> 04:28.480] per SIP client. And regarding the security, it's a regular SIP TLS for the for the signaling. And [04:28.480 --> 04:38.320] for the media pass, it's either SDS where the symmetric key is set in the SDP or the RTP or [04:38.320 --> 04:52.040] even SRTP DTLS. And for the RTP itself, it's a standard AES. Okay, now let's introduce the what's [04:52.040 --> 05:00.360] the selective forwarding units. And I'm going to start with the description on what we used to [05:00.360 --> 05:11.920] have for conferencing server. So in the past, the media mixer received the video from every [05:11.920 --> 05:22.440] re-user decode the video stream, perform the mixing, it could be mosaic or any layout, and then [05:22.440 --> 05:33.720] re-encode all the stream to be sent to every participant. This kind of software exists since [05:33.720 --> 05:50.760] 30 years, maybe 20. Here, I just want to show you a page that I saw in the RFC 7667, which is [05:50.760 --> 05:59.920] about the RTP topology of former legacy conferencing system. So for each client, A, B, C, here it's [05:59.920 --> 06:05.720] audio, but it could be the same for video. You have one RTP stream going to the media server and [06:05.720 --> 06:15.520] one RTP stream, which come from the media server to each client. And it's server side that everything [06:15.520 --> 06:25.400] is decoded, mixed, and sent back to the client. The advantage of this approach was that it was [06:25.400 --> 06:32.560] very simple from the client side, as the conferencing server was almost the same as calling [06:32.560 --> 06:43.640] regular user agent. The drawbacks of this approach was that video layout was defined [06:43.640 --> 06:52.320] server side, so you could have one or two different layouts. It requires a lot of CPU [06:52.320 --> 07:01.120] resourced server side, as every video stream has to be decoded and then re-encoded. And [07:01.120 --> 07:12.720] to end encryption was not possible due to the fact that video was decoded. Now, if we go to [07:12.720 --> 07:22.000] the selecting forwarding unit, the idea is that the media server is no more decoding and then [07:22.000 --> 07:29.600] encoding every video stream, but it's more switching video coming from every device to every [07:29.600 --> 07:40.320] other devices. And it could be done depending on several policies, like ActiveSpeaker or Mosaic. [07:40.320 --> 07:47.680] And for that, we also need some information coming from each client, like the volume of the [07:47.680 --> 07:53.920] audio stream in order to be able to know who is talking without having to decode the audio stream [07:53.920 --> 08:08.800] as well. If I go to the same schema, still from the RFC 7667, now you can see that from the RTP [08:08.800 --> 08:18.240] standpoint, you still have one RTP stream for each client coming, going to the media server. But [08:18.240 --> 08:28.400] now you have also one incoming video stream per participant of the of the conference. So if we [08:28.400 --> 08:37.760] follow the audio, the video stream from the client A, you can see that it is copied to client B, [08:37.760 --> 08:48.400] but as well to client F. So it's no more a media mixer, but a switching matrix, more or less. [08:48.400 --> 09:01.120] What are the advantages of this architecture is that video layout is no more defined through [09:01.120 --> 09:08.080] the server side, but the client can decide where to display every participant of the conference. [09:08.080 --> 09:17.040] It's an application choice, no more a server choice. It scales very well as there is no [09:18.240 --> 09:23.280] resources which are used through the server side to decode or encode the video stream. [09:23.840 --> 09:29.840] And finally, it's open the door for end-to-end encryption as the media server no more [09:29.840 --> 09:38.400] has to know the content of a video stream. The drawback of this approach is that it requires [09:38.400 --> 09:46.800] the Cpliant to be able to manage mostly stream, which was not the case for a standard one-to-one [09:46.800 --> 10:01.920] call. So for the Cpliant agent, what we had to change are the following mainly. It's mainly [10:01.920 --> 10:10.000] about multi-stream requirement. In the past, the Cpliant was able to send one audio stream [10:10.000 --> 10:21.440] plus one video stream. Now, it requires the client to be able to send one, but most of the time, [10:21.440 --> 10:29.440] two video streams, one for high resolution video and another, a second one for thumbnail, [10:29.440 --> 10:39.600] as well as being able to receive one video stream per participant of the video conference. [10:41.600 --> 10:53.120] Just an example of the SDP to show what is involved. So bundle mode, as I said, [10:53.120 --> 11:02.080] which is required, RTP MOOCs as well, it's to limit the number of sockets used for the media. [11:03.040 --> 11:10.800] This extension is related to audio level in order to be able to display who is talking and also for [11:10.800 --> 11:21.520] the server to be able to select which video stream is talking. It still uses IC to be able to limit [11:21.520 --> 11:31.840] the usage of media release. And from the video part, what you can see is that there are two [11:32.560 --> 11:40.880] video streams in receive only, one for the high resolution of the camera and another for the [11:40.880 --> 11:50.480] thumbnail. So it means that we encode two times the video. It could be replaced by some video [11:50.480 --> 12:00.400] encoder like H264, AVC, which supports a multi-layer functionality. But if you want to be able to do [12:00.400 --> 12:12.000] that with a simple VP8, it's better to encode two times the video. And for the receiving side, [12:12.000 --> 12:22.160] there is one video stream because in this example, there is only one participant in the video [12:22.160 --> 12:36.000] conference. But this part would be multiplied by the number of participants of the conference. [12:36.000 --> 12:48.880] Okay. So this is for what we have done on the Linfone project for this feature. [12:49.920 --> 12:58.080] It could be tested with the FlexiSIP server, which is currently running on our infrastructure. [12:58.080 --> 13:07.920] So you can create a video conference thanks to this conference factory CPURI. And using Linfone [13:07.920 --> 13:16.160] client with version above 5.0, it's possible to join a video conference. Okay. Thank you. [13:18.080 --> 13:27.680] Conclusion. Okay. So now Linfone is capable of joining video [13:27.680 --> 13:35.440] conference in two modes, mosaic and active speaker using a selective forwarding unit, [13:35.440 --> 13:42.560] which allows to scale. Possible evolution that we have in mind is to implement the Xcon [13:43.600 --> 13:48.480] conferencing server in order to be able to manage conference from a website or to have [13:48.480 --> 13:53.600] something more advanced. We are also thinking about adding end-to-end encryption to this [13:53.600 --> 13:59.360] video conferencing server and why not to provide the compatibility with WebRTC, [13:59.360 --> 14:04.880] knowing that the media protocol that we use are very close than WebRTC. [14:07.440 --> 14:11.600] Useful link. If you want to have more information about this work, [14:11.600 --> 14:15.440] you can go to the Linfone website and to have a look on our GitHub. [14:15.440 --> 14:22.800] Okay. That's it. If you have a question. Thank you. [14:32.800 --> 14:39.040] Are you aware of any other CIP client that implements multi-party video with someone? [14:39.040 --> 14:50.080] Not yet because the work to move from a regular CIP phone with only supporting one audio stream [14:50.080 --> 14:58.800] and one video stream to support this multi-stream is very significant and I'm not aware of any [15:00.080 --> 15:06.960] work in progress so far. So if you want to use it, you have to go with Linfone. [15:06.960 --> 15:11.280] Even if it's fully standardized, if we are following standard. [15:13.840 --> 15:14.160] Thank you. [15:23.360 --> 15:29.200] Not yet. Not yet, but we are quite confident that it's going to scale as we have removed [15:29.200 --> 15:36.480] all the needs for audio or video encoding server signs. So it's really about switching of [15:36.480 --> 15:41.840] packets. Maybe the question might be around the network on the client side. [15:42.560 --> 15:51.360] Around network, on the client side, as we are using, we are sending two resolutions [15:52.000 --> 15:58.080] from the client, a high resolution and a low resolution. And in the case of active speaker, [15:58.080 --> 16:04.480] we only send back to every client the high resolution of the people which is currently [16:04.480 --> 16:12.400] talking and low resolution of every other participant. So it highly limits the needs of bandwidth. [16:14.480 --> 16:14.720] Yes. [16:15.840 --> 16:21.440] On the client side, you now decode more than one stream. [16:21.440 --> 16:21.840] Correct. [16:26.480 --> 16:33.520] It's almost the same answer. We decode one high resolution and many low resolution [16:33.520 --> 16:41.600] and the CPU resources is depend on the resolution of the video that you have to decode. [16:44.320 --> 16:50.800] Just one question about the STP that you showed before. So were two receive only streams for [16:50.800 --> 16:55.520] the client? Was that from the client? It was from the server. [16:55.520 --> 16:59.760] Okay, because that was my question. Because it looked like the client. [16:59.760 --> 17:07.760] The server received two videos from the client, one in high resolution and one in low resolution [17:07.760 --> 17:14.880] and sent one video to this client. There is only one participant in this comparison. [17:14.880 --> 17:19.680] From the client perspective, when you switch from big resolution to low resolution, [17:19.680 --> 17:25.040] you still use the same M line that you have to send to the client. [17:25.040 --> 17:30.080] It's, yes, exactly. [17:30.080 --> 17:54.160] Okay, thank you very much.