Hi everybody, thanks for the wait, sorry, a bit of technical challenges, but now we can,
this is the last talk related to AI, to AI staff, and I think it's also a nice continuation
from Saou's first presentation, so we'll have Maxime coming from all the way from Vancouver,
I think probably the farthest participant here today, so please go ahead, thank you.
Okay, thank you, thank you Lorenzo, thanks everyone for attending my presentation.
So let's get started, today I'm going to talk about a little bit some of our newest work on
the eyesight of things, we've been a little bit about myself, I'm born and raised in Ukraine,
I have masters in physics and radio physics from Key State University, I live in Vancouver for about
20 years now, father of three, and I've been involved in CIPA in various forms and projects
and whatnot since about 2003, so basically since that first day I give a little bit of
background of what me and open source, today I discovered free BSD in my university at some
like second year, got very curious, started exploring and eventually submitted patches
and all that, got accepted as developer, and then found CIPA Express router which is kind of
later became came earlier in OpenCIP, added some modules, we use it extensively, I also created
Pproxy, which people who use CIPA probably know about, and yeah, I get busy with various
open source projects to this day. Speaking of machine learning, I started in about 2010,
read this really nice book, it's actually free book, it's at 99 sort of age, but surprisingly got
pretty good basic introduction into neural networks and all that, I already got a little bit
curious about that, and then as time went on, it became more available, I trained a little toy
model just trying to figure out if I can detect DTMF from 729 encode frames essentially, because
729 is just a bunch of floating point of coefficient, so you can feed it to the model and get sort of
DTMF detection that way, it was just kind of weekend type of fun project. I also got installed
a little AI powered ADAS called Coma 2, and it's open source project also, the user AI model to
drive your car, you can really like install it yourself, you can hack it, and it's pretty nice,
a good community, I participated in some events there. Also played with DeepFake lab, which kind of
DeepFake framework, which allows you to work faces, also involves some training, so it was curious,
and also lately I've been in San Diego, we've been doing like building a little from scratch,
we've been building model which will drive a little robot across the office, also was quite fun,
lately I've been also playing with Mu Zero support, first for the game of Digger, which I maintain,
Linux port, it was pretty fun to sell highly interested, this is just the device that drives
my car, and it runs open source AI model inside that little chip. So anyway, back to the main point
of this talk, we're basically looking at that from our perspective, which is trying to build a
system which is, we basically have a lot of customers who route calls with us, so we try to
build a model that can be scaled out for provider level, not a model but software which can run
those models. So basically the idea behind this project is that we try to figure out how to scale
those models on reasonably priced hardware, so as the problem with the area right now,
we didn't kind of vacuum tube error, as I call it, machine learning, because we have very expensive
hardware for, you know, to run something mixed trial, you need, you know, like 64 gigabytes of
GPU, and that is very easy, you need several of them or you need a very expensive one, but
eventually all this stuff will get more affordable, so we're trying to kind of work towards that
goal. So right now there are like two major frameworks which people use, which torsion
and tether flow, and they are also pretty heavy as well, it's like hundreds of megabytes,
if not gigabytes, to get stuff running, but it looks like at least from my perspective,
in a few years we'll see some changes, already, see people working on alternative frameworks
that are expected to be more lightweight and more flexible in terms of environment,
because Python is not very easy to scale and integrate into something like CAPIs, although
doable, but anyway. So we started earlier this year with the project, and the original idea was
to just, for starters, we tried to implement text-to-speech, we already took our SIP stack,
which is conveniently Python based, so that was pretty easy, pretty much like 20 lines of code
to get SIP endpoint implemented RTP generation thread and tried to run those models. Essentially
my, yeah, so basically we started with this guy, and it's like four gigabytes, cart, Nvidia 1050,
and I was able to run like one channel essentially on it, of text-to-speech, and then obviously
the next thing was like how to scale it, oh no, hold on, yeah, a little bit into how this works,
essentially. So text-to-speech is, at least with transformers, you basically take your text,
you send it to your model, and then it runs multiple iterations, so you basically have all your stuff
through the model, and then it spills out something like male spectrum, which you put through
Valkoder, and get the audio out. And then the first problem we run out into is that it basically
required this run for the whole duration of the audio, so essentially on my small GPU it would take
quite a lot of time to actually produce audio. So I had to modify it's a sph-t5 model, it's like
not the latest, but one of the pretty good ones from Microsoft a few years ago, they released it,
so I had it to modify it. I mean I re-wrote the model itself, not the model itself, but the pipeline,
so instead of processing the whole audio in one go, I made it so it spits out those smaller chunks.
The unfortunate problem came with that is it started like clicking, because Valkoder probably
was not trained for this mode, so I tried to retrain the Valkoder, it did not go well, very well,
so it did not produce a good result, so I had to build a little kind of post Valkoder scene, which
smooths out essentially and fixes those clicks. And now it sounds pretty well, I will maybe play
some examples when I get finished with the stuff. And then how I tried to scale it, so I got
kind of normal size GPU, I would say, so it's like 16 gigabytes card, and I expected to get like
maybe 10 times, 20 times performance, just looking at the spec, but to my surprise I only got like
two times more performance, so I can only like with this model, I can only run two real-time threads
of TTS on the bigger card. So I started looking into why this is happening and how to improve,
because theoretically it has much more performance. Turns out that in order to get good performance
out of those models you need to use batch inference, so instead of generating like each
prompt, each audio in each one session you batch your prompts that you need to voice out and
submit it into the model, and then it generates all those streams at the same kind of cost.
Because my main problem with using GPUs is that they are pretty widely computationally,
but they are pretty expensive to send, so it's like you're operating through very slow network
with a very fast device, so you need to load a lot of, well not a lot, but several jobs to it.
And so I considered like several ways how to batch it, so my first idea was that maybe I can just
vary the size of the batch, so I run it in continuously, but then as sessions come and go
I can add or remove them to this running batch. Unfortunately this does not work with the sequence
to sequence model, because internally it kind of clocks itself, so you cannot really add another
session of it, so they should be running all of them at the same time, so essentially you need
to do something like this, so you batch a bunch of sentences and that you need to generate,
pump it up and wait for them, all of them to finish, and then at the time you can collect
a new request, then you batch them up and repeat again, and obviously if you have like
pretty powerful GPU you can probably run a few of them, or if you have several GPUs you can
improve latency by running on multiple of them.
Yeah, so that part kind of works. The next thing I'm right now working on
kind of in the other way, so we need to have something like Whisper to do the other way around,
and that one already supports batching, so basically should be pretty scalable there as well,
so right now I have on that $300 card I can do 50 sessions of text-to-speech
in real time at the same time, so it's pretty good result because this is all running locally,
so I don't use any anything pretty much and I can run it on a reasonably small device.
And yeah, I guess the last thing that I played recently was
there is a framework called Ray. It's basically when you can build a little cluster of
machines with different kind of, well maybe the same GPU, maybe different hardware, and
distribute your training or inference work over them, so that's just me running
and not 20 probably games of digger. It's basically, all of this is a model doing inference,
just looking at the screen doing inference saying where this should go basically, and
kind of training itself to win at some point of time. So yeah, Ray,
yeah this is like a CCH training, kind of improving a little bit, maybe, but anyway,
a good interesting part of it, I figured out how to use that Ray, so it's kind of useful
open source framework that you can kind of scale, use to scale up your AI project, so I'll probably
use some of it to distribute. Yeah, there are some links and I guess I have probably a few minutes
for questions. Yeah, yeah, yeah we can do video technically because it's all about,
you know, as soon as we have like the whole mechanism set up, we can do video as well,
so yeah.
Right now I'm using PyTorch, but oh okay, the question is what kind of model can they run?
Right now I'm running with this existing code, I'm using PyTorch, but I also played with
TinyGrad, so I might use some of it as well because as I said, it's kind of very lightweight,
so the whole goal of the guy who wrote it is to keep like usable framework in like 5,000 lines
of Python code, so it's kind of very interesting from that perspective, but yeah, it's not really
limited, it could use anything, right? I think another question, no?
I have one.
If you have another question, please give a round of applause.