Hi everybody, thanks for the wait, sorry, a bit of technical challenges, but now we can, this is the last talk related to AI, to AI staff, and I think it's also a nice continuation from Saou's first presentation, so we'll have Maxime coming from all the way from Vancouver, I think probably the farthest participant here today, so please go ahead, thank you. Okay, thank you, thank you Lorenzo, thanks everyone for attending my presentation. So let's get started, today I'm going to talk about a little bit some of our newest work on the eyesight of things, we've been a little bit about myself, I'm born and raised in Ukraine, I have masters in physics and radio physics from Key State University, I live in Vancouver for about 20 years now, father of three, and I've been involved in CIPA in various forms and projects and whatnot since about 2003, so basically since that first day I give a little bit of background of what me and open source, today I discovered free BSD in my university at some like second year, got very curious, started exploring and eventually submitted patches and all that, got accepted as developer, and then found CIPA Express router which is kind of later became came earlier in OpenCIP, added some modules, we use it extensively, I also created Pproxy, which people who use CIPA probably know about, and yeah, I get busy with various open source projects to this day. Speaking of machine learning, I started in about 2010, read this really nice book, it's actually free book, it's at 99 sort of age, but surprisingly got pretty good basic introduction into neural networks and all that, I already got a little bit curious about that, and then as time went on, it became more available, I trained a little toy model just trying to figure out if I can detect DTMF from 729 encode frames essentially, because 729 is just a bunch of floating point of coefficient, so you can feed it to the model and get sort of DTMF detection that way, it was just kind of weekend type of fun project. I also got installed a little AI powered ADAS called Coma 2, and it's open source project also, the user AI model to drive your car, you can really like install it yourself, you can hack it, and it's pretty nice, a good community, I participated in some events there. Also played with DeepFake lab, which kind of DeepFake framework, which allows you to work faces, also involves some training, so it was curious, and also lately I've been in San Diego, we've been doing like building a little from scratch, we've been building model which will drive a little robot across the office, also was quite fun, lately I've been also playing with Mu Zero support, first for the game of Digger, which I maintain, Linux port, it was pretty fun to sell highly interested, this is just the device that drives my car, and it runs open source AI model inside that little chip. So anyway, back to the main point of this talk, we're basically looking at that from our perspective, which is trying to build a system which is, we basically have a lot of customers who route calls with us, so we try to build a model that can be scaled out for provider level, not a model but software which can run those models. So basically the idea behind this project is that we try to figure out how to scale those models on reasonably priced hardware, so as the problem with the area right now, we didn't kind of vacuum tube error, as I call it, machine learning, because we have very expensive hardware for, you know, to run something mixed trial, you need, you know, like 64 gigabytes of GPU, and that is very easy, you need several of them or you need a very expensive one, but eventually all this stuff will get more affordable, so we're trying to kind of work towards that goal. So right now there are like two major frameworks which people use, which torsion and tether flow, and they are also pretty heavy as well, it's like hundreds of megabytes, if not gigabytes, to get stuff running, but it looks like at least from my perspective, in a few years we'll see some changes, already, see people working on alternative frameworks that are expected to be more lightweight and more flexible in terms of environment, because Python is not very easy to scale and integrate into something like CAPIs, although doable, but anyway. So we started earlier this year with the project, and the original idea was to just, for starters, we tried to implement text-to-speech, we already took our SIP stack, which is conveniently Python based, so that was pretty easy, pretty much like 20 lines of code to get SIP endpoint implemented RTP generation thread and tried to run those models. Essentially my, yeah, so basically we started with this guy, and it's like four gigabytes, cart, Nvidia 1050, and I was able to run like one channel essentially on it, of text-to-speech, and then obviously the next thing was like how to scale it, oh no, hold on, yeah, a little bit into how this works, essentially. So text-to-speech is, at least with transformers, you basically take your text, you send it to your model, and then it runs multiple iterations, so you basically have all your stuff through the model, and then it spills out something like male spectrum, which you put through Valkoder, and get the audio out. And then the first problem we run out into is that it basically required this run for the whole duration of the audio, so essentially on my small GPU it would take quite a lot of time to actually produce audio. So I had to modify it's a sph-t5 model, it's like not the latest, but one of the pretty good ones from Microsoft a few years ago, they released it, so I had it to modify it. I mean I re-wrote the model itself, not the model itself, but the pipeline, so instead of processing the whole audio in one go, I made it so it spits out those smaller chunks. The unfortunate problem came with that is it started like clicking, because Valkoder probably was not trained for this mode, so I tried to retrain the Valkoder, it did not go well, very well, so it did not produce a good result, so I had to build a little kind of post Valkoder scene, which smooths out essentially and fixes those clicks. And now it sounds pretty well, I will maybe play some examples when I get finished with the stuff. And then how I tried to scale it, so I got kind of normal size GPU, I would say, so it's like 16 gigabytes card, and I expected to get like maybe 10 times, 20 times performance, just looking at the spec, but to my surprise I only got like two times more performance, so I can only like with this model, I can only run two real-time threads of TTS on the bigger card. So I started looking into why this is happening and how to improve, because theoretically it has much more performance. Turns out that in order to get good performance out of those models you need to use batch inference, so instead of generating like each prompt, each audio in each one session you batch your prompts that you need to voice out and submit it into the model, and then it generates all those streams at the same kind of cost. Because my main problem with using GPUs is that they are pretty widely computationally, but they are pretty expensive to send, so it's like you're operating through very slow network with a very fast device, so you need to load a lot of, well not a lot, but several jobs to it. And so I considered like several ways how to batch it, so my first idea was that maybe I can just vary the size of the batch, so I run it in continuously, but then as sessions come and go I can add or remove them to this running batch. Unfortunately this does not work with the sequence to sequence model, because internally it kind of clocks itself, so you cannot really add another session of it, so they should be running all of them at the same time, so essentially you need to do something like this, so you batch a bunch of sentences and that you need to generate, pump it up and wait for them, all of them to finish, and then at the time you can collect a new request, then you batch them up and repeat again, and obviously if you have like pretty powerful GPU you can probably run a few of them, or if you have several GPUs you can improve latency by running on multiple of them. Yeah, so that part kind of works. The next thing I'm right now working on kind of in the other way, so we need to have something like Whisper to do the other way around, and that one already supports batching, so basically should be pretty scalable there as well, so right now I have on that $300 card I can do 50 sessions of text-to-speech in real time at the same time, so it's pretty good result because this is all running locally, so I don't use any anything pretty much and I can run it on a reasonably small device. And yeah, I guess the last thing that I played recently was there is a framework called Ray. It's basically when you can build a little cluster of machines with different kind of, well maybe the same GPU, maybe different hardware, and distribute your training or inference work over them, so that's just me running and not 20 probably games of digger. It's basically, all of this is a model doing inference, just looking at the screen doing inference saying where this should go basically, and kind of training itself to win at some point of time. So yeah, Ray, yeah this is like a CCH training, kind of improving a little bit, maybe, but anyway, a good interesting part of it, I figured out how to use that Ray, so it's kind of useful open source framework that you can kind of scale, use to scale up your AI project, so I'll probably use some of it to distribute. Yeah, there are some links and I guess I have probably a few minutes for questions. Yeah, yeah, yeah we can do video technically because it's all about, you know, as soon as we have like the whole mechanism set up, we can do video as well, so yeah. Right now I'm using PyTorch, but oh okay, the question is what kind of model can they run? Right now I'm running with this existing code, I'm using PyTorch, but I also played with TinyGrad, so I might use some of it as well because as I said, it's kind of very lightweight, so the whole goal of the guy who wrote it is to keep like usable framework in like 5,000 lines of Python code, so it's kind of very interesting from that perspective, but yeah, it's not really limited, it could use anything, right? I think another question, no? I have one. If you have another question, please give a round of applause.