Okay, now it's time for the to introduce the guy that needs to introduce no introduction
so we all know Saul is one of the key members of the the GC team and today is going to talk
about and I hope we'll not get are not first and I'm trying to shoot you down but yeah
it's going to talk about Skynet and AI summaries in GC meet so thank you.
Thanks Lorenzo and thanks everybody for being here. Time for the intro so I don't need to
do that myself. Many of you probably know GC already and GC is where is my cursor. There
we go. It's a video conferencing platform, it's a toolkit to build your own, it's really
a set of open source projects that we combine together to deliver these end-to-end video
conferencing capabilities. It's also a set of APIs and SDKs that you can mix and match,
host it yourself, you know pay us some money and we have a service running or just go to town with
it and it's also a community of people that build more plugins for our platform help each other
and we saw for instance during the pandemic lots of people spinning up instance GC instances
to help other people communicate. It became a lot bigger than the way it started because it's a
project that has been around for a while. GC is 20 years old. It started out as a communicator,
a CIP client then when XMPP Jingle became a thing it became the kind of pivoted to that
and then video was a big focus so multi-party video was an area where a lot of effort was put
together and that kind of came to fruition when WebRTC came out because that made the client was
pushed to the browser and we could run the same software that powered a client on the server
this time and do the multi-party video on the server and that's the GC we have today where last
year I presented how we did 10,000 participants so it went a lot of many transformations over the
years. I think arguably the biggest transformation in all this time was WebRTC and how the desktop
client was in a way left behind and then everything was moved into the browser. Some say that AI is
the next sort of gold rush or the next revolution in this space that will make things change a
little bit. As the old joke goes well in the gold rush era it's not the gold diggers that make the
money it's those selling shovels so I'm hoping to show you some shovels today. Now in 2023
and plus and beyond what's kind of the state of things so AI became huge it has already been there
right we have full played video games with AI characters but in November 2022 something changed
can anybody guess? Open AI, open AI indeed. Open AI released
chat GPT. Now the way I think of it in my head I think the most important part for the end user
of chat GPT because this is vocabulary that is now second nature people have used these words
even though you don't need to know about transformers so to me the more important part is chat
because it's the first time that we could interact with an AI in that way. Before it was always hidden
in some backend server or oh there is AI that enhances these pictures but there is this thing but
you couldn't directly interact with it you couldn't ask it questions get back these prompts you did
not have this ability to interact with it and I think that is what made these new developments
more special than the fact that you can host them yourself of course it's a plus and you know in
GC style this is kind of the path we would try to follow so as I mentioned GC is a collection of
open source projects so it's in order to run GC meet is a platform that's built of different
components and this is like the basic platform where we have the web server the signaling
server which is still based in XMPP the GC video bridge in charge of routing and GCOFO to do the
conference signaling. Now another component not depicted here is GIGACI which allows us to connect
to the PSTN but in this very room sort of in 2017 we presented transcriptions with GIGACI.
It was a project that started out with Google summer of code so we had this building block
already in place and since all of these LLM technologies are text-to-text kind of operations
where you need to feed it text to get some other text you really need to have transcriptions to
start working with it so we had this building block in place and internally we started to prototype
how do we want to leverage these tools are they of use to us so we conducted an experiment where
we built the bots framework our original idea was instead of us building something directly
we're going to build a framework so that these technologies can be integrated externally into
a meeting our rough idea and what we built was to use puppeteer to run Chrome because it's the
richest web RTC endpoint that we can have and I'm going to show why running Chrome on the backend I
guess you could call it was a good idea and we integrated a low-level library so live jitsimid
and you could pretend to be a participant talk to an LLM get the transcript and talk to the user
and with with the script this is what we prototyped and I'd like to show you a couple of videos of
what we built so already hey tutor I've been thinking about the architecture of our to-do web
have you given it any thought yes I've been reading up on it I think we should go with
the restful architecture for scalability and flexibility that's a good idea how about the
database should we use sequel or no sequel I think no sequel it's totally how we better for our needs
it's more flexible and can handle unstructured data yeah agreed and what about the front end
framework we had a robot I was thinking of using react it's fast at the end of the sounds pretty good
how about we get started on building the application let's see if that works sure thing let's do it
all right now we're going to bring John Doe into a meeting and he is arriving late
and the bot just greeted him here in the chat and now he'll get a transcript real quick
of what was said in this meeting and he can just continue the conversation so that's
that was the first thing we wanted to try which is can we use this technology to build something
like that right we had seen many others you know start with oh I get a transcript I do some stuff
but because we had access to real time transcriptions it's like oh can we try a little twist so you
just arrive late you get the summary of what was said already okay and this was done with this
chrome running in the browser now why do you want to run chromium sorry chromium back in or in a
container because then you can do cool stuff like this no not that
hello
I don't know if it's playing or not
that should be
I'll show you when I switch to the files
now but basically it's the fact that we can play audio and use webgl and we kind of do that otherwise
the browser is very smart to end it now what did we learn in this exercise of attempting to use it
this way well first is that JavaScript might not have been the best choice because not all of the
AI libraries are in a different camp also that for our specific application we think that that
instead of going the general route of you know you can ask any question is more specific tasks and
like in this case meeting summarizes something very well defined very well understood and that also
allows us to give some more value to our users through customers so we can only help our users
when they are in a meeting in our software but if we do something like this we can help them even
when they're not there if you can go and check the notes of a meeting that you were not part of
and then it turns out they are useful so you don't really need to be there well that is in and of
itself something that's helpful for you now in terms of running this our idea is to run this
modest model that fulfills the task because running these things I'm going to talk about in a little
bit can be taxing so you want something you know as simple as possible which it still meets the
criteria that you can of course cost can be a problem so when it comes to yeah we run the model
well yeah but that also costs money so some you need to balance balance those things out
and as I said one thing we realized is that writing all of these logic in a bot felt kind of wrong
because you might want to use the same logic you know to apply it in different places so we thought
maybe we should move all that logic to do summaries or to do other interactions with the meeting
to its own dedicated framework and then we can reuse it here or maybe in other places and that's
where the idea of SkyNet came from and we started prototyping that right after so SkyNet is our
core for GC meetings it is designed to support different but specific AI operations our ability
to be horizontally scalable so it could run multiples of the of each of the parts that compose
SkyNet and we currently implemented like three but really two AI services which is summarizations
and transcriptions we get the open AI compatible API for free but at the moment we're not
making extensive use of it per se but focusing on the specific tasks of summarization and
transcriptions and our initial focus as with kind of everything we do in GT was focused on running it
locally using local LLMs so you can run it on your own servers and you don't rely on you know
external cloud services this is how we always build things authentication is also a URL you
plug and you connect elsewhere so it kind of fits our our DNA and personally I was excited
because we were using Python again and I haven't been using it in a few years so that was cool
to go back to it I totally installed this sort of naming from someone's conference at Comcom
couldn't remember his name but I it really resonates with me the idea in the current
AI landscape you have tools that can do speech to text that can text to speech that can do text
to text and then so you have all these transformations right so and then you can combine because if you
want to do a summary you probably have voice first so you need first a speech to text then some text
to text and well then you could summarize that part so we've got our summary's application that
sits on top of long chain and then we run our LLM underneath it and then for transcriptions
we're currently running a whisper I'm going to go into a bit of detail on how we run whisper
and we sort of divided the SkyNet architecture internally to show this this divide because I
think it helps build this mental model of how the data flows from where it begins does it originate
in the speech then goes to text then it ends up being text again so for example the summary's
modules got like two parts to it one is the dispatcher we call it and the other one is the
executor so the only one that needs access to the LLM to actually run inference and get
an output is the executor and the idea here is that we can have multiple dispatchers that will
handle the request then they will prepare the request and they will store you know the content
that needs summarizing in a cache and as a worker becomes available will pick up the work do the work
publish a result and then the frontend scan get back the result to the user these allow us as to
throw in however many executors we can well of course based on how much money it costs us to run
because we need GPUs to do the the inference how much capacity that we need to serve and of course
how many of them we run is a metric of measuring your own how much you want to spend how much you
need to service at a given time and how long can it take how long can the answer take so it's
possible that getting a summary two minutes after a meeting is acceptable maybe you need it in a minute
it all depends and depending on the way you want to go about it you could play with how many of them
you want to run but we built it so that we could run it this way and then we could scale horizontally
based on the on the load that that we needed now as we started playing with this like for real one
thing becomes quickly kind of obvious which is your summary is is like when is a summary a good
summary well first of all it is a good input so the transcription is really critical because if you
have a bad transcript there is no way for you to get the bad summary with a good transcript you can
also get a bad summary but if you have a bad transcript you're definitely going to get bad
summary so uh jigasi as i mentioned before already had transcription capabilities today it has the
ability to connect to world cloud to vosk and now to skynet now the google cloud has changed the models
that they have and the one we have we were using was not really great it didn't give very very accurate
transcriptions and that then showed in the in the summary that we were getting we have not yet played
with the other models that they have but then again sending our audio samples to google is not something
sorry that we're looking forward so we started building the equivalent on top of whisper because
it gave better results the end result is a skynet module that so jigasi will open a web
socket connection towards skynet and it will send audio frames in pcm format and it will get back
the transcript and in this module we're going to run inference leveraging the faster whisper
project faster whisper combines voice activity detection with with an alternate implementation
of whisper to give you these transcripts and you can have near real-time transcriptions in fact
in jigasi you can use this to also show subtitles so they were definitely fast enough for this
for this application that we're interested in and what faster whisper allows us that quote
unquote the og whisper doesn't is the ability to do it in this streaming manner and in real time
so it's way files back and forth is not something we we can use for this application because we want
to get the transcripts in real time and this is this is the way we accomplish it one advantage we
have also from doing it this way is that jigasi will receive the streams from each participant
individually so we already have each participant identified by by their stream you can of course
always do a transcript of a recording but if you have a recording with all the audio mixed in
then you also need to sort of separate it and there can also be mistakes in that now this way
we don't run into that problem because you guys you can clearly identify whose audio it is and send
it as part of the metadata that comes back in that uh web socket channel so once we had all of these
things kind of glued together we had to face reality which is this kind of ai ops so you
can't just throw this in a server and it runs and it's all great because models are kind of big
and that can be a little bit of a problem so deploying this to actual production is a new
headache you need to worry about because you do want when you're as I mentioned if you're
horizontally scaling you do want these new servers that you put in the pool let's say to boot fast
you want them to be able to start working right off the bat so and then you also need you might need
multiple container images because oh this needs this version of CUDA like the version of CUDA the
faster whisper works with is different than the one we need to run you know like lama for example
so you don't need to think about that currently the way we're doing it is we're running OCI virtual
machines and inside each of them we run nomad and skynet in a container and then the VMs have the
models loaded into them in their image so then this way the models are readily available because
the container images are already pretty big and adding the model to them it would make them
unbearably big and you end up with timeouts and yeah things you you don't really you don't really
like I like to show you how this looks like a little bit and since I'm doing pretty good on time I think
let's see where my mouse is so
first come on
okay so
first I would like to show you the video that I couldn't show you
it is this guy
this was the video of the bot that joins the container thing
well maybe that's why it didn't I'm a friendly ah there we go
ah come on
hello I'm a friendly bot this is just an example of what I can do check out other examples for more
so this looks very simple and in a way it is but the way you get here is this little robot that
moves its mouth is a 3d model made with blender animated with webgl and the lips
animate as the audio is being played back the audio has been played back directly in the browser
with we use the service play ht just as a as a test again going forward if we end up needing this
I would love to tinker with Microsoft t5 or mozela's cognigy I think it's what's called
because you can self-host them yourself so that using a browser as your runtime for bots is kind
of nice because it does allow you to do these sorts of things where you can run something like this
and it will be very hard to create a 3d kind of robot that moves their mouth in something else
that's not really a browser and you can animate it so easily with the same library that you use
for for the rest of the stuff so now I want to show you how this thing looks like when you run it
so I committed some stuff so first let's look at for instance the real-time mess of
of the transcriber
let's see oh that's pretty good or is it
are we running
that's correct it's the other one
way beyond so thank you rasvan you can tell who built it so
as we connect and we wait for the interim
hello mr robot
let's look at what's going on
are we running I didn't do anything
let's reboot that
I don't like you
okay
so
okay I think we should be good now
are you there yet
what mic is it using
let's see it is using my mic
well that's unfortunate I'll try to show you later
so this is just this demo is part of the skynet project itself and the way it works is it uses
an audio worklet it will send the audio frames to the skynet server and then it will render them
here in the browser it when it does it it does it in a close to real-time manner and the idea behind
is that you don't need the whole jigasi thing in a jigsabang to test the way it works so
it's like a self-contained demo I'm not sure if it's getting confused with the network or
something else because when I tried yesterday it was working fine and in fact I do see you know
data being received in this thing but I don't see it being rendered here we're going to try one more
time to see hello are you there and if not we're gonna move on so we're gonna move on
yeah well you know experimental technology what can you do so never mind that's what I was thinking
because this network is fun let's see if the other one works so in here I'm now going to load
skynet because we build this thing to be modular I think of it as a modular monolith if you will
which is it's one fat thing but you can disable parts of it and the reason behind it is that you
end up otherwise needing to have simplification you want to have consistent logging you want to
have a lot of things that are common and then this way you can select what do you want to run do you
want to run you want to run only the transcripts do you also want to run the summaries all of that
stuff is separated and you can decide if you want to run it or if you don't want to run it so
I put some text in here of a fictitious conversation that a guy named Tudor and I had
chat gbt is very good at coming up with interesting conversations for you to see how they summarize
we're not necessarily interested in the like I'm pretty sure you can't read the content but I want
comment on on the API a little bit so I'm going to copy this in case this thing goes to shit
because we don't know now we send a post with the data that we want to get summarized to our
summary endpoint hopefully it did get the response yeah and we get back an ID then
this ID we can query in the job ID
thing
and there we go so here we get
so we get the result of our job with the status of success the type is a summary it took 11 seconds
this is a mac with an m1 so it's running accelerated on this very gpu and then yeah
Tudor and seller are discussing the design of the both backend of the web app yada yada yada
so we we built this API that kind of follows a bit of a kind of a polling mechanism
because we found in practice that some you know even within within our workplace like this
stingy proxies and summarizing a long conversation can take a little bit of time and if we had
the request living for too long some proxy in the middle would decide to cut it and also in order
to make it resilient to things falling in the middle and another of these machines taking over
and then like running it again we we decided to build it this way so the idea is you post the
summary you're going to get back an id and then with the with the job polling api you can get
the jobs result as it's done i was looking into using for example event source as another option
so you could have like an ongoing stream that's also a possibility that we'll probably look into
adding to to make it a bit more a bit more palatable but we found that in practice this has been working
well we have processed in the realm of hundreds of thousands of little summaries within within the
company and we're so far happy with how the architecture is is working so we can focus on
what we're going to do sort of next some of the things we we think we should do next are
supporting multiple model runners or backends or however we want to call it
as i said we started out with hosting our own llms but that has different trade-offs and there are
different reasons why people may or may not want to do that so being able to run for example on top
of this example was on top of a seven billion llama too but someone may want to talk to open
ai directly maybe they are not worried about sending their data to open ai or they have a
different deal or whatever or maybe cloud works better for you or maybe you want to actually
not host the model yourself but kind of offload that responsibility to another company like open
router which hosts open source models in that case what sort of thing does like what does this kind
of do in that case well the nice thing is that it can shield you from all of those changes so you
could we're thinking that we can change how you are actually doing the summary or sorry rather
what engine you're using to to run inference but you don't need to change your api you don't need
to modify any of your applications suddenly if we change to a to a model that works better you're
going to just get better results and that is kind of the the path that we're trying to follow
we're going to work on integrating late arrival summaries in in jitsi meet so as i said we'll focus
on making building blocks so we can then plug them together to do this thing right now you can get a
transcript with gigasi and you can use this kind to summarize it we are going to build a way so that
you can glue it directly to jitsi meet to get the late arrival summary straight in the meeting without
needing to have this whole bot thing which i'd be very happy if we go back to but uh yeah it may
take a little while of course more prompt tweaking i think rob did a good job talking about you know
prompting and and a huge prompt he had i'm not sure that ends ever so you always start with
something and then get a tweak it also depends on the context right it's different to summarize
an article than to summarize the conversation and you might want to take the model into a little
journey or steer it into a given direction so i did that's a better job at what it needs to do
and lastly actually this very past week there was a new release of of lava which is the open
source vision model and in the context of a meeting which is kind of our our main focus
it would make sense for instance to summarize slides so if someone because it does help you
capture the context of a meeting if someone is sharing slides they're probably important and
they're probably the central part of the meeting so if we could use lava to get a glimpse of what's
on the slide plus the transcript we think combining all of that would give us a pretty good view of
what happened in that meeting and this way again help our users those who were not necessarily in
the meeting all of this we open source the first scan and version yesterday is now available there
with yeah the full get history so we'll see our mistakes and everything we learned along the way
we're not you know learning machinists here or mad scientists we are learning ourselves
it's a brand new world and i have two people to thank for they are they actually came all the way
here uh ross vanduuder thank you for joining me on this interesting journey
i would definitely not be here telling you this if it wasn't for them so uh very thankful and
it was a very exciting project to to get started and then we can take it further
i'm not sure how am i on time because this thing resetted but that's all i got to tell you today
if there's any questions i'm here right there at back
so you mentioned that you are happy with architecture of your setup right now
how happy are you with the actual summer rates is there much hallucination going on
would be different from a human that is not necessarily
fully aware of all the anti-cellual
good question so ralph was asking um we're happy with the architecture are we happy with the
summaries um so so you know another we are we're trying to support two apis summaries and action
items now a nice thing about about um about this problem application is that if it fits in the
context window right like the lm does not need to invent anything and everything needs to know
is within the data that you give this is the conversation what was said here um and it can
come up with decent summaries at the small model sizes but where small smaller models sort of where
they missed the mark is that capturing what's actually important in a conversation what are the
key points that we should focus on when i'm summarizing this thing and we're trying to find
where is the right balance how high up do we go in the model size um so that we get a better summary
in also in timely manner because of course going bigger will also mean slower inference
and it will also may mean higher cost so uh we we didn't spend a huge amount of time
in improving that yet we have something that you know works okay but it can definitely be improved
so we we are not you know we we can be happier so we we are focusing on making that that better
and our next step in that direction is trying to make this more available for example by sharing
it here and also by sharing it within the company so that people can run their meetings and get them
summarized and this way we're going to you know tweak improve and rinse and repeat all the time
question there have you guys looked into using the language bubble in the browser itself using
for example there's a root implementation blah blah which has wgpu full which means you can
use the browser to you know
interact with the language bubble and I was just wondering if there's been any experimentation
with not doing it centrally on a server so the question is um that there are implementations
that allow you to run the model directly in the browser uh by a rest with web gpu and if we have
thought of running it in the browser no we have not um that said it's a very interesting idea
and that is one that actually fits what I say in the beginning of the the sort of the bots thing
because the browser is such like such a competent beast it can do everything um you could actually
test that by doing it that way because essentially what that was is a script that the browser run
in a container and it had access to everything so I think that would be a like a cool place to test
because basically what our first test did was we used javascript to tell the transcriber hey
give me a transcript in real time and then we would in real time talk to open ai let's say and
get back results so in that sense running it locally would be something that's completely attainable
very interesting I think it's it would be interesting to give it a go I'm not sure if when it comes to
because one of the advantages of model like uh running a centralized thing would be the
fault tolerance for instance so the fact that you get that you send a transcript from the dispatcher
to store it in the cache and even if the inference the whatever node is running inference crashes
another one will pick it up we have a mechanism to uh so that another one picks up if some work
has not been updated in a while and if you push that all the way to the client well a everybody
would need to run their own inference so it feels a bit more wasteful overall if you sum it all up
I think right off the bat and be uh yeah you have the the problem of if it fails with your browser
crashes you're kind of left out in the cold and in the other case the server can take care of it
and send it to you you know by email when it's ready I think but I think is a very interesting
thing right there behind you
yes
possible to connect your other your partners webxs and zooms and what
right there are companies right so the question is sometimes you talk with external organizations
and would you be able to put the bot there so the bot was the experiment that took us where we are
but right now that effort is paused now that bot was a bot for gtc meetings specifically so
um the idea here is the moment anyone from any organization joins one of your meetings you would
be able to transcribe it now if you join their meetings uh yeah we don't have that I know there
are companies that sell proprietary products that integrate with different meeting providers that
can use these so it's it's a little bit of there's companies that use it externally then there is uh
each vendor that is sort of building this up in themselves I'm not sure what the right answer is I
think it feels like to me there's there's a lot of one of the hardest parts of working in this space
has been filtering the signal from the noise there's just so much shit going on but I think this
particular features really blend well with a meeting's product so adding the making them
building so whenever anyone from wherever they are end up in one of your meetings when you record it
in fact this is a change that is coming on the next stable release is that making a recording
will involve making a transcription unless you opt out because at the end of the day the recording
contains everything that's also in transcription the transcription is just making it more palatable so
you can then operate on it um and that's sort of the direction that we're taking at the moment
question here
yes that is a good question um so that is a problem we have not solved yet so we do have
the capability of doing translations but the the uh at the moment the only entity that we
use to do translations is google cloud so that's why sort of out of this picture and also there's
a limitation of the source language so you can get multiple you can get actually subtitles in
different in the language that you want but as long as the source language is English so what we are
looking for which is a bit further in the future is well first of all identifying the languages on
the fly so the source language doesn't matter and then having the ability for the source language to
not be set so to be different so that we can both be in a meeting and I can be talking to you in
English you can be replying in French and I will see subtitles in English you will see subtitles
in French and then we'll get the summary I don't know in Mandarin man because it's going to be fun
but that's you get the idea that that's that's where we're going but more work is needed in that area
um
and that's all you can find me in the whole thank you very much