[00:00.000 --> 00:15.720]  Okay, we can start. Thank you for attempting to this meeting, which is about Linfone and
[00:15.720 --> 00:23.320]  video conferencing. My name is Jean Monnier. I'm involved in the Linfone project since
[00:23.320 --> 00:32.000]  2010 and I'm also part of the company which is backing the Linfone project since ten years
[00:32.000 --> 00:42.240]  almost. So first, I'm going to provide you with a quick introduction on Linfone and then
[00:42.240 --> 00:52.720]  have a couple of words around video conferencing with SIP followed by an introduction on the
[00:52.720 --> 01:02.440]  selecting forwarding unit, which is the art of the modern video conferencing systems. And later to
[01:02.440 --> 01:09.040]  talk about what is required on the SIP client part to be able to interrupt with this kind of
[01:09.040 --> 01:18.440]  video conferencing system. And finally, the conclusion. Okay, so just a couple of words
[01:18.440 --> 01:28.400]  about Linfone. Linfone is a voice-over IP client implementing the SIP protocol which started
[01:28.400 --> 01:40.720]  in early 2000. It's available on Linux, Android, iOS, Windows and Mac. It uses SIP as the base
[01:40.720 --> 01:49.720]  standard for almost everything including audio, video call and instant messaging presence. Everything
[01:49.720 --> 01:58.040]  which is required for real-time communication. And it also provides some end-to-end encryption
[01:58.040 --> 02:07.440]  for messaging based on the signal protocol more or less. The Linfone team developed the
[02:07.440 --> 02:13.160]  Linfone software but also SIPTrover, which is basically a SIP proxy. And if you want to use
[02:13.160 --> 02:23.240]  SIP account, it's possible to create them on our website for testing purpose mainly. Okay,
[02:23.240 --> 02:31.520]  video conferencing with SIP in a couple of words. It's around a couple of standards. The first one
[02:31.520 --> 02:40.000]  is SIP, a basic SIP with an invite referring by to create a conference, join the conference to be
[02:40.000 --> 02:50.840]  able to invite other participants to the conference. And it's almost based on the RFC 4579, which
[02:50.840 --> 02:59.880]  defines how to create a conference and how to join the conference. And there is also some
[02:59.880 --> 03:08.440]  interesting standard, which is the RFC 5366, which is about defining the list of the participants
[03:08.440 --> 03:18.120]  of the conference. So it's for the establishment of the conference. And the next important standard
[03:18.120 --> 03:27.000]  is what we call the conferencing events package, which is based on the subcribe notify RFC. And
[03:27.000 --> 03:34.600]  the idea is that when a participant joined the conference, it initiates SIP subscribe to the
[03:34.600 --> 03:41.760]  server. And the server then notify to every participant of the conference, which are the
[03:41.760 --> 03:49.000]  states of the conference within who is going to join is their audio video, everything which is
[03:49.000 --> 04:01.800]  related to the status of the conference. On the media port, it's regular RTP. And for this video
[04:01.800 --> 04:09.640]  conferencing project, we added the support of two important RFC, which are about bundling all the
[04:09.640 --> 04:18.040]  media stream into the same socket in order to avoid to have too many RTP sockets, RTP streams
[04:18.040 --> 04:28.480]  per SIP client. And regarding the security, it's a regular SIP TLS for the for the signaling. And
[04:28.480 --> 04:38.320]  for the media pass, it's either SDS where the symmetric key is set in the SDP or the RTP or
[04:38.320 --> 04:52.040]  even SRTP DTLS. And for the RTP itself, it's a standard AES. Okay, now let's introduce the what's
[04:52.040 --> 05:00.360]  the selective forwarding units. And I'm going to start with the description on what we used to
[05:00.360 --> 05:11.920]  have for conferencing server. So in the past, the media mixer received the video from every
[05:11.920 --> 05:22.440]  re-user decode the video stream, perform the mixing, it could be mosaic or any layout, and then
[05:22.440 --> 05:33.720]  re-encode all the stream to be sent to every participant. This kind of software exists since
[05:33.720 --> 05:50.760]  30 years, maybe 20. Here, I just want to show you a page that I saw in the RFC 7667, which is
[05:50.760 --> 05:59.920]  about the RTP topology of former legacy conferencing system. So for each client, A, B, C, here it's
[05:59.920 --> 06:05.720]  audio, but it could be the same for video. You have one RTP stream going to the media server and
[06:05.720 --> 06:15.520]  one RTP stream, which come from the media server to each client. And it's server side that everything
[06:15.520 --> 06:25.400]  is decoded, mixed, and sent back to the client. The advantage of this approach was that it was
[06:25.400 --> 06:32.560]  very simple from the client side, as the conferencing server was almost the same as calling
[06:32.560 --> 06:43.640]  regular user agent. The drawbacks of this approach was that video layout was defined
[06:43.640 --> 06:52.320]  server side, so you could have one or two different layouts. It requires a lot of CPU
[06:52.320 --> 07:01.120]  resourced server side, as every video stream has to be decoded and then re-encoded. And
[07:01.120 --> 07:12.720]  to end encryption was not possible due to the fact that video was decoded. Now, if we go to
[07:12.720 --> 07:22.000]  the selecting forwarding unit, the idea is that the media server is no more decoding and then
[07:22.000 --> 07:29.600]  encoding every video stream, but it's more switching video coming from every device to every
[07:29.600 --> 07:40.320]  other devices. And it could be done depending on several policies, like ActiveSpeaker or Mosaic.
[07:40.320 --> 07:47.680]  And for that, we also need some information coming from each client, like the volume of the
[07:47.680 --> 07:53.920]  audio stream in order to be able to know who is talking without having to decode the audio stream
[07:53.920 --> 08:08.800]  as well. If I go to the same schema, still from the RFC 7667, now you can see that from the RTP
[08:08.800 --> 08:18.240]  standpoint, you still have one RTP stream for each client coming, going to the media server. But
[08:18.240 --> 08:28.400]  now you have also one incoming video stream per participant of the of the conference. So if we
[08:28.400 --> 08:37.760]  follow the audio, the video stream from the client A, you can see that it is copied to client B,
[08:37.760 --> 08:48.400]  but as well to client F. So it's no more a media mixer, but a switching matrix, more or less.
[08:48.400 --> 09:01.120]  What are the advantages of this architecture is that video layout is no more defined through
[09:01.120 --> 09:08.080]  the server side, but the client can decide where to display every participant of the conference.
[09:08.080 --> 09:17.040]  It's an application choice, no more a server choice. It scales very well as there is no
[09:18.240 --> 09:23.280]  resources which are used through the server side to decode or encode the video stream.
[09:23.840 --> 09:29.840]  And finally, it's open the door for end-to-end encryption as the media server no more
[09:29.840 --> 09:38.400]  has to know the content of a video stream. The drawback of this approach is that it requires
[09:38.400 --> 09:46.800]  the Cpliant to be able to manage mostly stream, which was not the case for a standard one-to-one
[09:46.800 --> 10:01.920]  call. So for the Cpliant agent, what we had to change are the following mainly. It's mainly
[10:01.920 --> 10:10.000]  about multi-stream requirement. In the past, the Cpliant was able to send one audio stream
[10:10.000 --> 10:21.440]  plus one video stream. Now, it requires the client to be able to send one, but most of the time,
[10:21.440 --> 10:29.440]  two video streams, one for high resolution video and another, a second one for thumbnail,
[10:29.440 --> 10:39.600]  as well as being able to receive one video stream per participant of the video conference.
[10:41.600 --> 10:53.120]  Just an example of the SDP to show what is involved. So bundle mode, as I said,
[10:53.120 --> 11:02.080]  which is required, RTP MOOCs as well, it's to limit the number of sockets used for the media.
[11:03.040 --> 11:10.800]  This extension is related to audio level in order to be able to display who is talking and also for
[11:10.800 --> 11:21.520]  the server to be able to select which video stream is talking. It still uses IC to be able to limit
[11:21.520 --> 11:31.840]  the usage of media release. And from the video part, what you can see is that there are two
[11:32.560 --> 11:40.880]  video streams in receive only, one for the high resolution of the camera and another for the
[11:40.880 --> 11:50.480]  thumbnail. So it means that we encode two times the video. It could be replaced by some video
[11:50.480 --> 12:00.400]  encoder like H264, AVC, which supports a multi-layer functionality. But if you want to be able to do
[12:00.400 --> 12:12.000]  that with a simple VP8, it's better to encode two times the video. And for the receiving side,
[12:12.000 --> 12:22.160]  there is one video stream because in this example, there is only one participant in the video
[12:22.160 --> 12:36.000]  conference. But this part would be multiplied by the number of participants of the conference.
[12:36.000 --> 12:48.880]  Okay. So this is for what we have done on the Linfone project for this feature.
[12:49.920 --> 12:58.080]  It could be tested with the FlexiSIP server, which is currently running on our infrastructure.
[12:58.080 --> 13:07.920]  So you can create a video conference thanks to this conference factory CPURI. And using Linfone
[13:07.920 --> 13:16.160]  client with version above 5.0, it's possible to join a video conference. Okay. Thank you.
[13:18.080 --> 13:27.680]  Conclusion. Okay. So now Linfone is capable of joining video
[13:27.680 --> 13:35.440]  conference in two modes, mosaic and active speaker using a selective forwarding unit,
[13:35.440 --> 13:42.560]  which allows to scale. Possible evolution that we have in mind is to implement the Xcon
[13:43.600 --> 13:48.480]  conferencing server in order to be able to manage conference from a website or to have
[13:48.480 --> 13:53.600]  something more advanced. We are also thinking about adding end-to-end encryption to this
[13:53.600 --> 13:59.360]  video conferencing server and why not to provide the compatibility with WebRTC,
[13:59.360 --> 14:04.880]  knowing that the media protocol that we use are very close than WebRTC.
[14:07.440 --> 14:11.600]  Useful link. If you want to have more information about this work,
[14:11.600 --> 14:15.440]  you can go to the Linfone website and to have a look on our GitHub.
[14:15.440 --> 14:22.800]  Okay. That's it. If you have a question. Thank you.
[14:32.800 --> 14:39.040]  Are you aware of any other CIP client that implements multi-party video with someone?
[14:39.040 --> 14:50.080]  Not yet because the work to move from a regular CIP phone with only supporting one audio stream
[14:50.080 --> 14:58.800]  and one video stream to support this multi-stream is very significant and I'm not aware of any
[15:00.080 --> 15:06.960]  work in progress so far. So if you want to use it, you have to go with Linfone.
[15:06.960 --> 15:11.280]  Even if it's fully standardized, if we are following standard.
[15:13.840 --> 15:14.160]  Thank you.
[15:23.360 --> 15:29.200]  Not yet. Not yet, but we are quite confident that it's going to scale as we have removed
[15:29.200 --> 15:36.480]  all the needs for audio or video encoding server signs. So it's really about switching of
[15:36.480 --> 15:41.840]  packets. Maybe the question might be around the network on the client side.
[15:42.560 --> 15:51.360]  Around network, on the client side, as we are using, we are sending two resolutions
[15:52.000 --> 15:58.080]  from the client, a high resolution and a low resolution. And in the case of active speaker,
[15:58.080 --> 16:04.480]  we only send back to every client the high resolution of the people which is currently
[16:04.480 --> 16:12.400]  talking and low resolution of every other participant. So it highly limits the needs of bandwidth.
[16:14.480 --> 16:14.720]  Yes.
[16:15.840 --> 16:21.440]  On the client side, you now decode more than one stream.
[16:21.440 --> 16:21.840]  Correct.
[16:26.480 --> 16:33.520]  It's almost the same answer. We decode one high resolution and many low resolution
[16:33.520 --> 16:41.600]  and the CPU resources is depend on the resolution of the video that you have to decode.
[16:44.320 --> 16:50.800]  Just one question about the STP that you showed before. So were two receive only streams for
[16:50.800 --> 16:55.520]  the client? Was that from the client? It was from the server.
[16:55.520 --> 16:59.760]  Okay, because that was my question. Because it looked like the client.
[16:59.760 --> 17:07.760]  The server received two videos from the client, one in high resolution and one in low resolution
[17:07.760 --> 17:14.880]  and sent one video to this client. There is only one participant in this comparison.
[17:14.880 --> 17:19.680]  From the client perspective, when you switch from big resolution to low resolution,
[17:19.680 --> 17:25.040]  you still use the same M line that you have to send to the client.
[17:25.040 --> 17:30.080]  It's, yes, exactly.
[17:30.080 --> 17:54.160]  Okay, thank you very much.