Okay, now it's time for the to introduce the guy that needs to introduce no introduction so we all know Saul is one of the key members of the the GC team and today is going to talk about and I hope we'll not get are not first and I'm trying to shoot you down but yeah it's going to talk about Skynet and AI summaries in GC meet so thank you. Thanks Lorenzo and thanks everybody for being here. Time for the intro so I don't need to do that myself. Many of you probably know GC already and GC is where is my cursor. There we go. It's a video conferencing platform, it's a toolkit to build your own, it's really a set of open source projects that we combine together to deliver these end-to-end video conferencing capabilities. It's also a set of APIs and SDKs that you can mix and match, host it yourself, you know pay us some money and we have a service running or just go to town with it and it's also a community of people that build more plugins for our platform help each other and we saw for instance during the pandemic lots of people spinning up instance GC instances to help other people communicate. It became a lot bigger than the way it started because it's a project that has been around for a while. GC is 20 years old. It started out as a communicator, a CIP client then when XMPP Jingle became a thing it became the kind of pivoted to that and then video was a big focus so multi-party video was an area where a lot of effort was put together and that kind of came to fruition when WebRTC came out because that made the client was pushed to the browser and we could run the same software that powered a client on the server this time and do the multi-party video on the server and that's the GC we have today where last year I presented how we did 10,000 participants so it went a lot of many transformations over the years. I think arguably the biggest transformation in all this time was WebRTC and how the desktop client was in a way left behind and then everything was moved into the browser. Some say that AI is the next sort of gold rush or the next revolution in this space that will make things change a little bit. As the old joke goes well in the gold rush era it's not the gold diggers that make the money it's those selling shovels so I'm hoping to show you some shovels today. Now in 2023 and plus and beyond what's kind of the state of things so AI became huge it has already been there right we have full played video games with AI characters but in November 2022 something changed can anybody guess? Open AI, open AI indeed. Open AI released chat GPT. Now the way I think of it in my head I think the most important part for the end user of chat GPT because this is vocabulary that is now second nature people have used these words even though you don't need to know about transformers so to me the more important part is chat because it's the first time that we could interact with an AI in that way. Before it was always hidden in some backend server or oh there is AI that enhances these pictures but there is this thing but you couldn't directly interact with it you couldn't ask it questions get back these prompts you did not have this ability to interact with it and I think that is what made these new developments more special than the fact that you can host them yourself of course it's a plus and you know in GC style this is kind of the path we would try to follow so as I mentioned GC is a collection of open source projects so it's in order to run GC meet is a platform that's built of different components and this is like the basic platform where we have the web server the signaling server which is still based in XMPP the GC video bridge in charge of routing and GCOFO to do the conference signaling. Now another component not depicted here is GIGACI which allows us to connect to the PSTN but in this very room sort of in 2017 we presented transcriptions with GIGACI. It was a project that started out with Google summer of code so we had this building block already in place and since all of these LLM technologies are text-to-text kind of operations where you need to feed it text to get some other text you really need to have transcriptions to start working with it so we had this building block in place and internally we started to prototype how do we want to leverage these tools are they of use to us so we conducted an experiment where we built the bots framework our original idea was instead of us building something directly we're going to build a framework so that these technologies can be integrated externally into a meeting our rough idea and what we built was to use puppeteer to run Chrome because it's the richest web RTC endpoint that we can have and I'm going to show why running Chrome on the backend I guess you could call it was a good idea and we integrated a low-level library so live jitsimid and you could pretend to be a participant talk to an LLM get the transcript and talk to the user and with with the script this is what we prototyped and I'd like to show you a couple of videos of what we built so already hey tutor I've been thinking about the architecture of our to-do web have you given it any thought yes I've been reading up on it I think we should go with the restful architecture for scalability and flexibility that's a good idea how about the database should we use sequel or no sequel I think no sequel it's totally how we better for our needs it's more flexible and can handle unstructured data yeah agreed and what about the front end framework we had a robot I was thinking of using react it's fast at the end of the sounds pretty good how about we get started on building the application let's see if that works sure thing let's do it all right now we're going to bring John Doe into a meeting and he is arriving late and the bot just greeted him here in the chat and now he'll get a transcript real quick of what was said in this meeting and he can just continue the conversation so that's that was the first thing we wanted to try which is can we use this technology to build something like that right we had seen many others you know start with oh I get a transcript I do some stuff but because we had access to real time transcriptions it's like oh can we try a little twist so you just arrive late you get the summary of what was said already okay and this was done with this chrome running in the browser now why do you want to run chromium sorry chromium back in or in a container because then you can do cool stuff like this no not that hello I don't know if it's playing or not that should be I'll show you when I switch to the files now but basically it's the fact that we can play audio and use webgl and we kind of do that otherwise the browser is very smart to end it now what did we learn in this exercise of attempting to use it this way well first is that JavaScript might not have been the best choice because not all of the AI libraries are in a different camp also that for our specific application we think that that instead of going the general route of you know you can ask any question is more specific tasks and like in this case meeting summarizes something very well defined very well understood and that also allows us to give some more value to our users through customers so we can only help our users when they are in a meeting in our software but if we do something like this we can help them even when they're not there if you can go and check the notes of a meeting that you were not part of and then it turns out they are useful so you don't really need to be there well that is in and of itself something that's helpful for you now in terms of running this our idea is to run this modest model that fulfills the task because running these things I'm going to talk about in a little bit can be taxing so you want something you know as simple as possible which it still meets the criteria that you can of course cost can be a problem so when it comes to yeah we run the model well yeah but that also costs money so some you need to balance balance those things out and as I said one thing we realized is that writing all of these logic in a bot felt kind of wrong because you might want to use the same logic you know to apply it in different places so we thought maybe we should move all that logic to do summaries or to do other interactions with the meeting to its own dedicated framework and then we can reuse it here or maybe in other places and that's where the idea of SkyNet came from and we started prototyping that right after so SkyNet is our core for GC meetings it is designed to support different but specific AI operations our ability to be horizontally scalable so it could run multiples of the of each of the parts that compose SkyNet and we currently implemented like three but really two AI services which is summarizations and transcriptions we get the open AI compatible API for free but at the moment we're not making extensive use of it per se but focusing on the specific tasks of summarization and transcriptions and our initial focus as with kind of everything we do in GT was focused on running it locally using local LLMs so you can run it on your own servers and you don't rely on you know external cloud services this is how we always build things authentication is also a URL you plug and you connect elsewhere so it kind of fits our our DNA and personally I was excited because we were using Python again and I haven't been using it in a few years so that was cool to go back to it I totally installed this sort of naming from someone's conference at Comcom couldn't remember his name but I it really resonates with me the idea in the current AI landscape you have tools that can do speech to text that can text to speech that can do text to text and then so you have all these transformations right so and then you can combine because if you want to do a summary you probably have voice first so you need first a speech to text then some text to text and well then you could summarize that part so we've got our summary's application that sits on top of long chain and then we run our LLM underneath it and then for transcriptions we're currently running a whisper I'm going to go into a bit of detail on how we run whisper and we sort of divided the SkyNet architecture internally to show this this divide because I think it helps build this mental model of how the data flows from where it begins does it originate in the speech then goes to text then it ends up being text again so for example the summary's modules got like two parts to it one is the dispatcher we call it and the other one is the executor so the only one that needs access to the LLM to actually run inference and get an output is the executor and the idea here is that we can have multiple dispatchers that will handle the request then they will prepare the request and they will store you know the content that needs summarizing in a cache and as a worker becomes available will pick up the work do the work publish a result and then the frontend scan get back the result to the user these allow us as to throw in however many executors we can well of course based on how much money it costs us to run because we need GPUs to do the the inference how much capacity that we need to serve and of course how many of them we run is a metric of measuring your own how much you want to spend how much you need to service at a given time and how long can it take how long can the answer take so it's possible that getting a summary two minutes after a meeting is acceptable maybe you need it in a minute it all depends and depending on the way you want to go about it you could play with how many of them you want to run but we built it so that we could run it this way and then we could scale horizontally based on the on the load that that we needed now as we started playing with this like for real one thing becomes quickly kind of obvious which is your summary is is like when is a summary a good summary well first of all it is a good input so the transcription is really critical because if you have a bad transcript there is no way for you to get the bad summary with a good transcript you can also get a bad summary but if you have a bad transcript you're definitely going to get bad summary so uh jigasi as i mentioned before already had transcription capabilities today it has the ability to connect to world cloud to vosk and now to skynet now the google cloud has changed the models that they have and the one we have we were using was not really great it didn't give very very accurate transcriptions and that then showed in the in the summary that we were getting we have not yet played with the other models that they have but then again sending our audio samples to google is not something sorry that we're looking forward so we started building the equivalent on top of whisper because it gave better results the end result is a skynet module that so jigasi will open a web socket connection towards skynet and it will send audio frames in pcm format and it will get back the transcript and in this module we're going to run inference leveraging the faster whisper project faster whisper combines voice activity detection with with an alternate implementation of whisper to give you these transcripts and you can have near real-time transcriptions in fact in jigasi you can use this to also show subtitles so they were definitely fast enough for this for this application that we're interested in and what faster whisper allows us that quote unquote the og whisper doesn't is the ability to do it in this streaming manner and in real time so it's way files back and forth is not something we we can use for this application because we want to get the transcripts in real time and this is this is the way we accomplish it one advantage we have also from doing it this way is that jigasi will receive the streams from each participant individually so we already have each participant identified by by their stream you can of course always do a transcript of a recording but if you have a recording with all the audio mixed in then you also need to sort of separate it and there can also be mistakes in that now this way we don't run into that problem because you guys you can clearly identify whose audio it is and send it as part of the metadata that comes back in that uh web socket channel so once we had all of these things kind of glued together we had to face reality which is this kind of ai ops so you can't just throw this in a server and it runs and it's all great because models are kind of big and that can be a little bit of a problem so deploying this to actual production is a new headache you need to worry about because you do want when you're as I mentioned if you're horizontally scaling you do want these new servers that you put in the pool let's say to boot fast you want them to be able to start working right off the bat so and then you also need you might need multiple container images because oh this needs this version of CUDA like the version of CUDA the faster whisper works with is different than the one we need to run you know like lama for example so you don't need to think about that currently the way we're doing it is we're running OCI virtual machines and inside each of them we run nomad and skynet in a container and then the VMs have the models loaded into them in their image so then this way the models are readily available because the container images are already pretty big and adding the model to them it would make them unbearably big and you end up with timeouts and yeah things you you don't really you don't really like I like to show you how this looks like a little bit and since I'm doing pretty good on time I think let's see where my mouse is so first come on okay so first I would like to show you the video that I couldn't show you it is this guy this was the video of the bot that joins the container thing well maybe that's why it didn't I'm a friendly ah there we go ah come on hello I'm a friendly bot this is just an example of what I can do check out other examples for more so this looks very simple and in a way it is but the way you get here is this little robot that moves its mouth is a 3d model made with blender animated with webgl and the lips animate as the audio is being played back the audio has been played back directly in the browser with we use the service play ht just as a as a test again going forward if we end up needing this I would love to tinker with Microsoft t5 or mozela's cognigy I think it's what's called because you can self-host them yourself so that using a browser as your runtime for bots is kind of nice because it does allow you to do these sorts of things where you can run something like this and it will be very hard to create a 3d kind of robot that moves their mouth in something else that's not really a browser and you can animate it so easily with the same library that you use for for the rest of the stuff so now I want to show you how this thing looks like when you run it so I committed some stuff so first let's look at for instance the real-time mess of of the transcriber let's see oh that's pretty good or is it are we running that's correct it's the other one way beyond so thank you rasvan you can tell who built it so as we connect and we wait for the interim hello mr robot let's look at what's going on are we running I didn't do anything let's reboot that I don't like you okay so okay I think we should be good now are you there yet what mic is it using let's see it is using my mic well that's unfortunate I'll try to show you later so this is just this demo is part of the skynet project itself and the way it works is it uses an audio worklet it will send the audio frames to the skynet server and then it will render them here in the browser it when it does it it does it in a close to real-time manner and the idea behind is that you don't need the whole jigasi thing in a jigsabang to test the way it works so it's like a self-contained demo I'm not sure if it's getting confused with the network or something else because when I tried yesterday it was working fine and in fact I do see you know data being received in this thing but I don't see it being rendered here we're going to try one more time to see hello are you there and if not we're gonna move on so we're gonna move on yeah well you know experimental technology what can you do so never mind that's what I was thinking because this network is fun let's see if the other one works so in here I'm now going to load skynet because we build this thing to be modular I think of it as a modular monolith if you will which is it's one fat thing but you can disable parts of it and the reason behind it is that you end up otherwise needing to have simplification you want to have consistent logging you want to have a lot of things that are common and then this way you can select what do you want to run do you want to run you want to run only the transcripts do you also want to run the summaries all of that stuff is separated and you can decide if you want to run it or if you don't want to run it so I put some text in here of a fictitious conversation that a guy named Tudor and I had chat gbt is very good at coming up with interesting conversations for you to see how they summarize we're not necessarily interested in the like I'm pretty sure you can't read the content but I want comment on on the API a little bit so I'm going to copy this in case this thing goes to shit because we don't know now we send a post with the data that we want to get summarized to our summary endpoint hopefully it did get the response yeah and we get back an ID then this ID we can query in the job ID thing and there we go so here we get so we get the result of our job with the status of success the type is a summary it took 11 seconds this is a mac with an m1 so it's running accelerated on this very gpu and then yeah Tudor and seller are discussing the design of the both backend of the web app yada yada yada so we we built this API that kind of follows a bit of a kind of a polling mechanism because we found in practice that some you know even within within our workplace like this stingy proxies and summarizing a long conversation can take a little bit of time and if we had the request living for too long some proxy in the middle would decide to cut it and also in order to make it resilient to things falling in the middle and another of these machines taking over and then like running it again we we decided to build it this way so the idea is you post the summary you're going to get back an id and then with the with the job polling api you can get the jobs result as it's done i was looking into using for example event source as another option so you could have like an ongoing stream that's also a possibility that we'll probably look into adding to to make it a bit more a bit more palatable but we found that in practice this has been working well we have processed in the realm of hundreds of thousands of little summaries within within the company and we're so far happy with how the architecture is is working so we can focus on what we're going to do sort of next some of the things we we think we should do next are supporting multiple model runners or backends or however we want to call it as i said we started out with hosting our own llms but that has different trade-offs and there are different reasons why people may or may not want to do that so being able to run for example on top of this example was on top of a seven billion llama too but someone may want to talk to open ai directly maybe they are not worried about sending their data to open ai or they have a different deal or whatever or maybe cloud works better for you or maybe you want to actually not host the model yourself but kind of offload that responsibility to another company like open router which hosts open source models in that case what sort of thing does like what does this kind of do in that case well the nice thing is that it can shield you from all of those changes so you could we're thinking that we can change how you are actually doing the summary or sorry rather what engine you're using to to run inference but you don't need to change your api you don't need to modify any of your applications suddenly if we change to a to a model that works better you're going to just get better results and that is kind of the the path that we're trying to follow we're going to work on integrating late arrival summaries in in jitsi meet so as i said we'll focus on making building blocks so we can then plug them together to do this thing right now you can get a transcript with gigasi and you can use this kind to summarize it we are going to build a way so that you can glue it directly to jitsi meet to get the late arrival summary straight in the meeting without needing to have this whole bot thing which i'd be very happy if we go back to but uh yeah it may take a little while of course more prompt tweaking i think rob did a good job talking about you know prompting and and a huge prompt he had i'm not sure that ends ever so you always start with something and then get a tweak it also depends on the context right it's different to summarize an article than to summarize the conversation and you might want to take the model into a little journey or steer it into a given direction so i did that's a better job at what it needs to do and lastly actually this very past week there was a new release of of lava which is the open source vision model and in the context of a meeting which is kind of our our main focus it would make sense for instance to summarize slides so if someone because it does help you capture the context of a meeting if someone is sharing slides they're probably important and they're probably the central part of the meeting so if we could use lava to get a glimpse of what's on the slide plus the transcript we think combining all of that would give us a pretty good view of what happened in that meeting and this way again help our users those who were not necessarily in the meeting all of this we open source the first scan and version yesterday is now available there with yeah the full get history so we'll see our mistakes and everything we learned along the way we're not you know learning machinists here or mad scientists we are learning ourselves it's a brand new world and i have two people to thank for they are they actually came all the way here uh ross vanduuder thank you for joining me on this interesting journey i would definitely not be here telling you this if it wasn't for them so uh very thankful and it was a very exciting project to to get started and then we can take it further i'm not sure how am i on time because this thing resetted but that's all i got to tell you today if there's any questions i'm here right there at back so you mentioned that you are happy with architecture of your setup right now how happy are you with the actual summer rates is there much hallucination going on would be different from a human that is not necessarily fully aware of all the anti-cellual good question so ralph was asking um we're happy with the architecture are we happy with the summaries um so so you know another we are we're trying to support two apis summaries and action items now a nice thing about about um about this problem application is that if it fits in the context window right like the lm does not need to invent anything and everything needs to know is within the data that you give this is the conversation what was said here um and it can come up with decent summaries at the small model sizes but where small smaller models sort of where they missed the mark is that capturing what's actually important in a conversation what are the key points that we should focus on when i'm summarizing this thing and we're trying to find where is the right balance how high up do we go in the model size um so that we get a better summary in also in timely manner because of course going bigger will also mean slower inference and it will also may mean higher cost so uh we we didn't spend a huge amount of time in improving that yet we have something that you know works okay but it can definitely be improved so we we are not you know we we can be happier so we we are focusing on making that that better and our next step in that direction is trying to make this more available for example by sharing it here and also by sharing it within the company so that people can run their meetings and get them summarized and this way we're going to you know tweak improve and rinse and repeat all the time question there have you guys looked into using the language bubble in the browser itself using for example there's a root implementation blah blah which has wgpu full which means you can use the browser to you know interact with the language bubble and I was just wondering if there's been any experimentation with not doing it centrally on a server so the question is um that there are implementations that allow you to run the model directly in the browser uh by a rest with web gpu and if we have thought of running it in the browser no we have not um that said it's a very interesting idea and that is one that actually fits what I say in the beginning of the the sort of the bots thing because the browser is such like such a competent beast it can do everything um you could actually test that by doing it that way because essentially what that was is a script that the browser run in a container and it had access to everything so I think that would be a like a cool place to test because basically what our first test did was we used javascript to tell the transcriber hey give me a transcript in real time and then we would in real time talk to open ai let's say and get back results so in that sense running it locally would be something that's completely attainable very interesting I think it's it would be interesting to give it a go I'm not sure if when it comes to because one of the advantages of model like uh running a centralized thing would be the fault tolerance for instance so the fact that you get that you send a transcript from the dispatcher to store it in the cache and even if the inference the whatever node is running inference crashes another one will pick it up we have a mechanism to uh so that another one picks up if some work has not been updated in a while and if you push that all the way to the client well a everybody would need to run their own inference so it feels a bit more wasteful overall if you sum it all up I think right off the bat and be uh yeah you have the the problem of if it fails with your browser crashes you're kind of left out in the cold and in the other case the server can take care of it and send it to you you know by email when it's ready I think but I think is a very interesting thing right there behind you yes possible to connect your other your partners webxs and zooms and what right there are companies right so the question is sometimes you talk with external organizations and would you be able to put the bot there so the bot was the experiment that took us where we are but right now that effort is paused now that bot was a bot for gtc meetings specifically so um the idea here is the moment anyone from any organization joins one of your meetings you would be able to transcribe it now if you join their meetings uh yeah we don't have that I know there are companies that sell proprietary products that integrate with different meeting providers that can use these so it's it's a little bit of there's companies that use it externally then there is uh each vendor that is sort of building this up in themselves I'm not sure what the right answer is I think it feels like to me there's there's a lot of one of the hardest parts of working in this space has been filtering the signal from the noise there's just so much shit going on but I think this particular features really blend well with a meeting's product so adding the making them building so whenever anyone from wherever they are end up in one of your meetings when you record it in fact this is a change that is coming on the next stable release is that making a recording will involve making a transcription unless you opt out because at the end of the day the recording contains everything that's also in transcription the transcription is just making it more palatable so you can then operate on it um and that's sort of the direction that we're taking at the moment question here yes that is a good question um so that is a problem we have not solved yet so we do have the capability of doing translations but the the uh at the moment the only entity that we use to do translations is google cloud so that's why sort of out of this picture and also there's a limitation of the source language so you can get multiple you can get actually subtitles in different in the language that you want but as long as the source language is English so what we are looking for which is a bit further in the future is well first of all identifying the languages on the fly so the source language doesn't matter and then having the ability for the source language to not be set so to be different so that we can both be in a meeting and I can be talking to you in English you can be replying in French and I will see subtitles in English you will see subtitles in French and then we'll get the summary I don't know in Mandarin man because it's going to be fun but that's you get the idea that that's that's where we're going but more work is needed in that area um and that's all you can find me in the whole thank you very much