So, hello all.
Thank you for being here till the last and I'm a first-time
presenter so if I get a bit jittery, I'm sorry.
So, the topic that I'm taking is open source AI for localization and accessibility.
Well, the main idea is to use open source AI tools to elevate the
content that you are actually receiving and to enhance the localizations that you can benefit from.
So, okay, so sorry for that.
So, essentially coming back to what I was saying, we can use open source AI to enhance the
subtitling potential and to have voice-to-voice conversion of a lot of videos and audio content
in addition to the text-to-text conversions that we are.
So, you might be wondering what is the actual problem?
So, well, I've seen that most of the time when I'm trying to access a dog.
I'll be having this language issue.
For example, I was working with the technology in the augmented reality realm
and all the documentation was actually in Japanese.
I tried reaching out to the developers over there but unfortunately the language barriers
still hit hard and another case would be with the same guys actually had a few tutorials available
on YouTube but the same case.
I don't speak Japanese and I'm actually unable to convey my ideas to them in the language that I know.
So, this was actually an issue that we were all facing.
And then there's the, so you might be thinking, why can't you use something like Google Translate?
Well, the obvious case is actually data safety.
If I'm working on a cutting-edge technology, I don't want that to be leaked
to other people without my concerns.
Or like I want to release that into the public.
I don't want someone else to just take my data and then release it or without my safety or my approval.
And yeah, when it comes to usability, that's another factor and financials,
when I was actually working as an independent developer,
financials, the financial side was a big issue for me.
I didn't have the money to bankroll something like $1,000 into a translation subscription for every month.
So, let's actually elevate that with a bit more of a user's story.
Suppose I'm a research student and you can actually take the case of augmented reality right now.
I'm trying to work in this very niche case and people actually know about it,
but I can't really converse with them.
I'll be having a few issues like that.
And one of the main problems that I'm facing is actually the lack of resources.
And there could be resources, but they are present in another language that I'm not really
able to understand with or converse with.
So I would actually require the resources to be converted to another language that I speak of.
And I want the conversions to have a diesel level of accuracy.
So that's it.
And for one of the solutions, I can obviously ask for people who actually talk the language
and require their services, but it is expensive and it is time consuming.
So a stop gap would be to use an AI solution.
A similar case would be in the case of the docs manager.
So before I go there, what do we actually have right now?
We have a few text-to-text conversion engines like Indonesia or if you're from India,
there's about 128 languages that are actually spoken and we actually have cover for two.
So if they are from such cases, you will require more coverage and you will require more assistance.
So to sum up with what I was actually talking about so far, we don't have an all-in-one solution
where we can actually use all these, which can actually fulfill all these requirements,
be it from text-to-text, text-to-video or the other way around.
So that is some of the things that I would like to talk about.
And yeah, if we are actually looking for an open source library, I would like to have one that
focuses on the audio and the video translation side because enhancement and accessibility
for the audio and the video side is what is actually helping us to improve the language models
and is helping us to reach a wider audience.
So solutions which can actually help me with the automatic translations can be a good choice.
So just to recap it again, I think I'll just skip this part.
So what I actually require is an open source model that can be executed locally
and actually gives me a decent accuracy or a decent amount of execution time
and helps me enhance the quality of the content that I'm delivering.
So the one to watch for right now is called seamless M40 and it's a model for meta and it's under the MIT and
okay, it's under the MIT and the Creative Commons Live Senses.
So it actually gives you speech-to-speech translation,
speech-to-text, text-to-speech, text-to-text and automatic speech recognition.
So that's a pretty good one.
And as we all have been trying to highlight for the last 10 minutes,
we require that because we need an all-in-one solution.
And I was just, I think I highlighted about all these parts like the super informative video
or the precarious conversations that you are having.
Like if I'm trying to have a conversation with, can I have a name, sir?
Samath. So if I'm having to have a conversation with Samath in his name,
if I'm having to have a conversation with Samath and for a moment let's just
think Samath doesn't speak English, he speaks French, I speak English and it's hard.
So that's the conversation or that's the moment that I would require and that's the moment where I
require a tool like this. But if it's not French, it's some language that's actually not sort of
documented. So say I'm just going to go with Sohili. So yeah.
Okay. Okay. That was a random guess. So yeah. If I'm going with a language like Sohili and I'm
speaking in English or if I'm speaking in my mother tongue called Malayalam, I'm just going to
sit here and I'm trying to explain the concept to him, but he doesn't understand what I'm saying.
I don't understand what he's saying. And that is the moment where we require such a tool.
He might have cutting edge research in the domain, but unfortunately,
it's only accessible to the native speakers.
So as I said, that is where the benefits come in. It's with the universalization of
the resources. Anyone from a large org, a creator, a student, a developer, and basically anyone
who can make use of these technologies and come out with this. So far as the technology that I
mentioned, the M4G is actually under an MIT and a Creative Commons license. So it can only be used
for nonprofit uses. And I believe it should remain like that. So that's the summary. But before we
go, I think there's something else that I can show. So, okay. Excuse me. Okay. Yeah.
Okay. So just suppose I'm this really famous creator and I hope the mic has an arm.
Okay. Sorry. That just played out in a way. So suppose I'm actually a really famous creator
and I'm doing something about AI. So I want this content. I only speak English and I want this
content to reach you guys everywhere. So let's watch the individual video first.
Okay. So let's pause it over there. And maybe how good it would it be if you can actually hear the
same thing in another language. Okay. Wait a second.
So how many of you guys over here know Spanish? Okay. Just tell me whether this even remotely
makes sense to you.
So that's it. Yeah. I'm going to take, yeah. Yeah. It sounds good, right? Yeah. I'm just going to
take it from those two guys. And yeah. The same thing can be actually in French, in Dutch,
in Italian, and in Hindi. I do speak a bit of Hindi, so I can validate that. So yeah.
So would you guys like to hear it in another language? It's going to be the same audio, but
if, okay, for the French speaking guys, I'm just going to play this as well.
So you can see the text, the audio and all this. The model that I'm just mentioned called the seamless
M40 is actually doing all of this in a single model. And I feel it's about time we actually have
a few of these solutions coming into open source as well. That's a totally open source model.
Like this is what I dream of. And maybe I'll be back next year with something that's remotely
close to this. So thank you. Yeah. So any questions?
So the model we're talking about, it's not open source, but it's like usable for non-commercial.
Or is it open source? The model's open source, but you cannot use it for commercial use.
It's having the MIT and the Creative Commons license. Yeah.
Is the training data made public? No, the training data is not public.
It's the classic Facebook thing. Yeah.
So it's speech recognition because it's a problem, the touch problem and now it must for use.
Yeah. It runs on speech recognition. It runs on speech recognition and you're converting it to a
base language and from there you're converting it to this thing. So suppose if I'm speaking
in Hindi and you want the conversation to be in France, it's going to convert the
Hindi's conversation to English and from English it's going to convert it to French.
Talk about the latency between the speaker speaking and understanding all this thing and also
the way. Okay. There are models that in the same model can actually offer capabilities
near 100 milliseconds, but it's a lightweight model and you won't have the accuracy of the
heavy weight one. So we actually have to trade off between accuracy and the speed. So the heavier
the model is or the more parameters it has, you'll be getting more accurate results, but
unfortunately the speed will be coming down and I saw a couple of other hands. Yeah.
The model is 2.7 million parameters one and if you're talking about the actual size of the
models in gigabyte, it's about eight gigabytes. Yeah. So just to clarify a few things that have
already kind of been answered. The licensing is MIT and Creative Commons? Not or.
And correct? Yeah. Oh, okay. And then the other thing is when you are translating from one language
to another, you said that some of the models have a latency of about 100 milliseconds.
And how long did those audio samples you showed us? How long did they take to run?
It took me like three seconds, but I'm running it on a call up T for GPU. So
it depends on the GPU. Right. Yeah, I'm sorry. Sure.
So
the proposed solution that I use over here is the same as the LLM thing.
So you can actually use racks or you can just split the text up into smaller,
smaller bits and then combine it into this. So for example, this model actually performs
better if you have something like 20 seconds of audio. So what I do is if I have a one minute
of audio, I'm going to split it into three chunks of 20 seconds each and it's going to go off from there.
Sorry, I don't have a solution to that.
Yeah. And I think you have a comment.
Yeah, I think.
Okay, sure.
Okay.
So I'm going to go ahead and ask you a question.
Okay.
You can run it locally, but it's going to depend on your GPU speed. That's it.
It's possible to have a view of practical information, link and so on.
Oh, sure.
Okay. So yeah, I just close that a bit early one moment.
Okay.
You can hit me up over here.
So my name is Nevin Koshy Daniel and that's the same for the Gmail ID. You can just text me over here.
And if you're looking for the particular model, you can get it from seamless communications on the
Facebook research page. Yeah. And someone did ask me a question about latency, right? So if you
guys have a moment, then we can have something called seamless expressive. I am not 100% sure how
this will work, but accept. Try the demo. Can you come over here please?
Tate? Sure.
Let's have a try with this.
Do I have to say something in English?
Yeah, I think so.
Or do I have to speak French?
Let's try it with French, I guess. No, I can't speak French. It's not going to work.
Okay.
Yeah, let's speak English and let's translate that to French.
French? Sure.
Oh, you have to allow audio permissions.
Yeah.
Don't worry, I'll just re-bump this out.
Yeah, yeah, yeah, it's okay.
Yeah, right.
Can I use Linux and run this on my server?
I hope that works.
So yeah, it's going to take some time to generate the translation.
Oh, wow.
It's pretty quick.
And...
I don't speak French so for someone who knows the language, it's correct.
It's correct.
That's very cool.
Okay, is the model doing both the translation and the text-to-speech?
Yes.
This model can do all the four things, text-to-text, text-to-speech, speech-to-text, and text-to-another language conversion.
So all the four things, yeah, and automatic speech recognition as well.
So the five things.
To give you guys this...
I am.
Okay.
That's pretty much everything, and thank you.
Thank you.
If you're gonna...
Okay.
Oh.
Yeah, no, I can't see it very well either.