[00:00.000 --> 00:07.440] So let's dive right into it. [00:07.440 --> 00:11.840] LibreTranslate is a software that's a bit like Google Translate, [00:11.840 --> 00:15.200] but open-source. It is AGPL3 license, [00:15.200 --> 00:17.040] so it's a strongly open-source. [00:17.040 --> 00:19.960] In fact, we're going to keep it that way forever, [00:19.960 --> 00:22.840] and let's you do natural language translation. [00:22.840 --> 00:25.440] It runs on your computer. [00:25.440 --> 00:27.240] This is one of the goals of the project. [00:27.240 --> 00:29.440] There are several other projects in [00:29.440 --> 00:31.520] the open-source realm that have aimed to [00:31.520 --> 00:34.800] provide natural language translation, [00:34.800 --> 00:37.040] except sometimes that they require [00:37.040 --> 00:39.160] a very large servers or a lot of memory, [00:39.160 --> 00:41.440] and our goal is to have this running on [00:41.440 --> 00:42.840] something as low as a Raspberry Pi. [00:42.840 --> 00:45.560] So that is very important to the project. [00:45.560 --> 00:49.440] The program has lots of clients and integrations. [00:49.440 --> 00:52.920] We'll cover some of those in the upcoming slides. [00:52.920 --> 00:55.840] Like many projects, it's available on GitHub, [00:55.840 --> 00:58.080] so you can go and check it out. [00:58.080 --> 01:00.640] But we're going to give you today a brief overview of [01:00.640 --> 01:03.920] how to get started and start using it today. [01:03.920 --> 01:07.600] Let's talk briefly about why we decided to create it, [01:07.600 --> 01:10.400] and there was a need for the project to exist. [01:10.400 --> 01:13.560] We could not find a project that had [01:13.560 --> 01:16.400] all the variables that LibreTranslate can offer. [01:16.400 --> 01:19.480] These are a simple and open-res API that you can [01:19.480 --> 01:21.960] use to programmatically do translations, [01:21.960 --> 01:24.000] so help automate part of [01:24.000 --> 01:27.760] the translation work that we need for the work. [01:27.760 --> 01:32.600] It offers pre-trained and openly licensed language models. [01:32.600 --> 01:35.720] There are other projects that do machine translation, [01:35.720 --> 01:37.600] but again, sometimes they do not make [01:37.600 --> 01:41.880] the AI models for the translation openly available, [01:41.880 --> 01:43.200] and we have that. [01:43.200 --> 01:45.920] Finally, it runs again on commodity hardware, [01:45.920 --> 01:48.320] so it does now require server-scale power [01:48.320 --> 01:50.120] to make the software work, [01:50.120 --> 01:53.920] and finally, it is very easy to get started as you will see. [01:53.920 --> 01:56.600] So talking about getting started, [01:56.600 --> 01:58.600] there are primarily two ways that you can [01:58.600 --> 02:00.920] get LibreTranslate to work on your computer. [02:00.920 --> 02:03.200] The first one is if you have Python, [02:03.200 --> 02:06.600] you can simply run a pip install command, [02:06.600 --> 02:11.360] LibreTranslate, and afterwards you run the program, and that's it. [02:11.360 --> 02:14.840] If you have Docker, which many developers like to use, [02:14.840 --> 02:16.080] we also have an option for that. [02:16.080 --> 02:20.120] We pre-build images for LibreTranslate that you can use, [02:20.120 --> 02:24.680] and we have a convenient script that will run it for you and take [02:24.680 --> 02:27.680] care of a few details that let you have things like [02:27.680 --> 02:31.640] persistent volumes for downloading language models and some technical stuff. [02:31.640 --> 02:34.800] But to get started, all you need to do is go on GitHub, [02:34.800 --> 02:39.240] get a copy of our source code, and press Run. [02:39.240 --> 02:43.920] We also have scripts for Windows and Mac OS and Linux, [02:43.920 --> 02:46.560] so we try to support all major platforms. [02:46.560 --> 02:49.240] We're hoping to get other platforms in there as well, [02:49.240 --> 02:53.960] so things like FreeBSD and others are on the to-do list. [02:53.960 --> 02:55.680] So we'll get there. [02:55.680 --> 02:58.280] So let's actually try to run it, [02:58.280 --> 03:01.360] and I'm always a little scared of doing live demos, [03:01.360 --> 03:03.920] but bear with me, we're going to try it. [03:03.920 --> 03:05.800] What could go wrong? [03:05.800 --> 03:08.920] So here is a console. [03:08.920 --> 03:11.240] I'm going to quickly activate [03:11.240 --> 03:14.760] a Python environment where I have LibreTranslate already installed, [03:14.760 --> 03:16.720] and I'm going to try to run it. [03:16.720 --> 03:18.640] On Mac OS, I have to specify [03:18.640 --> 03:21.840] a different port than the default of 5,000. [03:21.840 --> 03:23.200] I'm going to try to run it. [03:23.200 --> 03:25.000] Okay, it seems to be working. [03:25.000 --> 03:29.400] So I'm going to jump back right into Chrome, [03:29.400 --> 03:32.280] and if I refresh the page, [03:32.280 --> 03:35.880] you will be presented with a friendly user interface that you can use [03:35.880 --> 03:38.920] to test the system and even use it. [03:38.920 --> 03:43.800] It allows programmatic access to the software via an API, [03:43.800 --> 03:45.280] but you can also use it as [03:45.280 --> 03:47.680] an alternative to Google Translate if you want to. [03:47.680 --> 03:51.720] So we're going to try to say something. [03:51.720 --> 03:55.720] Okay, so obviously English to English is not going to be helpful. [03:55.720 --> 03:57.360] How about French? [03:57.360 --> 04:00.760] Okay. So we translated Hello World, [04:00.760 --> 04:04.880] Bonjour Le Monde, and it worked. [04:04.880 --> 04:07.760] But that's not too impressive, right? [04:07.760 --> 04:09.320] I'm like, okay, Hello World. [04:09.320 --> 04:14.320] Let's try to look at something a little more realistic. [04:16.640 --> 04:19.040] Before looking at something more realistic, [04:19.040 --> 04:22.960] you can also, of course, use it from an API. [04:22.960 --> 04:26.360] In this case, I can invoke a Cura command and [04:26.360 --> 04:29.520] ask Libre Translate to perform a translation. [04:29.520 --> 04:32.840] I wanted to automatically detect the language, [04:32.840 --> 04:34.720] where the translation is coming from, [04:34.720 --> 04:38.080] and finally, I want to translate [04:38.080 --> 04:42.800] into the target language. [04:42.800 --> 04:44.800] I get a JSON response. [04:44.800 --> 04:47.240] Everything in the API is JSON-based. [04:47.240 --> 04:50.320] So that will be familiar with many developers. [04:50.320 --> 04:53.360] But let's look at a more realistic example. [04:53.360 --> 04:55.920] In this case, we have a longer piece of text, [04:55.920 --> 04:59.040] and it also contains HTML. [04:59.040 --> 05:02.080] The software is capable of translating [05:02.080 --> 05:03.720] the parts that need translation while [05:03.720 --> 05:05.840] leaving the HTML part intact. [05:05.840 --> 05:10.160] So things like hyperlinks do not get mistakenly translated, [05:10.160 --> 05:13.200] which would be really bad. [05:13.200 --> 05:16.760] This code that we saw here roughly gets [05:16.760 --> 05:21.280] represented as this piece of HTML in a browser, [05:21.280 --> 05:25.360] and the translation is pretty good. [05:25.360 --> 05:28.960] Kind of. This word should have been filidae. [05:28.960 --> 05:32.880] It decided to keep the translation in French. [05:32.880 --> 05:36.080] We will improve that with time. [05:36.080 --> 05:38.960] But otherwise, the context and [05:38.960 --> 05:41.680] the meaning of the sentence is pretty darn good. [05:41.680 --> 05:45.240] We will look at accuracy in the upcoming slides. [05:45.240 --> 05:49.320] So as an overview of the list of features, [05:49.320 --> 05:50.920] it can do text translation, [05:50.920 --> 05:54.120] it can do markup translation that includes HTML, [05:54.120 --> 05:57.480] XML, and other formats that use a markup. [05:57.480 --> 06:01.520] It can do several formats for file translation. [06:01.520 --> 06:03.760] So you can upload things like [06:03.760 --> 06:06.000] open office, LibreOffice, [06:06.000 --> 06:10.240] Word documents, and PowerPoint slides, [06:10.240 --> 06:13.080] and able to translate those as well. [06:13.080 --> 06:16.080] It can perform a language detection. [06:16.080 --> 06:18.200] So you give it a piece of text, [06:18.200 --> 06:20.000] and it will give you [06:20.000 --> 06:23.920] an estimate of which language the program thinks it is. [06:23.920 --> 06:28.400] It also has a built-in system for doing rate limiting. [06:28.400 --> 06:30.840] If you're planning to host this on a public server, [06:30.840 --> 06:32.920] you will find out that it's a very useful feature, [06:32.920 --> 06:37.400] because people really like free resources, [06:37.400 --> 06:42.480] and it's difficult to give everything for free without [06:42.480 --> 06:46.840] some limits. So if your translation instance up in [06:46.840 --> 06:48.360] the Cloud gets really popular, [06:48.360 --> 06:50.840] having some limit by saying, [06:50.840 --> 06:53.960] do a maximum of 60 translation per minute, [06:53.960 --> 06:56.800] will come really handy and it's all built in into the software. [06:56.800 --> 07:02.360] You can further issue API keys to give to people that can change those limits. [07:02.360 --> 07:05.280] So you can set up the system in a way where you [07:05.280 --> 07:10.560] allow anonymous users to translate up to 20 translations per minute, [07:10.560 --> 07:13.800] and you can allow a subset of people that you've issued [07:13.800 --> 07:17.160] API keys to have however many they want. [07:17.160 --> 07:18.720] You decide those limits. [07:18.720 --> 07:21.720] It also has a localized UI. [07:21.720 --> 07:23.640] We're using WebLater to do that, [07:23.640 --> 07:27.160] which is awesome, and it has been [07:27.160 --> 07:29.800] currently translated into four languages, [07:29.800 --> 07:32.400] and we're looking to verify and add more. [07:32.400 --> 07:34.560] One cool neat feature is that [07:34.560 --> 07:40.400] a Libre translator has a ability to translate itself roughly, [07:40.400 --> 07:42.920] so we have done that of course, [07:42.920 --> 07:48.080] but we haven't displayed all the languages that it has tried to translate itself. [07:48.080 --> 07:51.360] We are waiting for a native speaker to review [07:51.360 --> 07:53.920] the actual translation and correct it. [07:53.920 --> 07:55.520] But if you run in debug mode, [07:55.520 --> 07:58.280] you will see all the work that it has done, which is neat. [07:58.280 --> 08:02.240] So it translates itself or at least it helps. [08:02.240 --> 08:04.640] It finally has the ability to monitor itself. [08:04.640 --> 08:07.960] So it can generate the usage metrics, [08:07.960 --> 08:13.520] so you can monitor the usage of the server using Prometheus and Grafana. [08:13.520 --> 08:17.520] These are tools to do monitoring that are very popular. [08:17.520 --> 08:21.920] Inside the software, there is really just a few packages, [08:21.920 --> 08:23.680] so it's very lightweight. [08:23.680 --> 08:27.360] Most of the translation work is done by [08:27.360 --> 08:29.880] another package called Argos Translate. [08:29.880 --> 08:32.360] This is really the core engine that [08:32.360 --> 08:35.160] performs the hard work in the translation, [08:35.160 --> 08:36.920] which is an awesome project, [08:36.920 --> 08:40.480] and we collaborate with them on Libre Translate. [08:40.480 --> 08:42.280] Inside Argo Translate, [08:42.280 --> 08:47.080] there is also other software which is built on the shoulder of giants. [08:47.080 --> 08:49.840] C-Translate, which is an inference engine that [08:49.840 --> 08:53.200] does neural translation using transformers models, [08:53.200 --> 08:54.600] which is a state-of-the-art. [08:54.600 --> 08:58.560] It's the same type of architecture that chatGPT3 uses. [08:58.560 --> 09:00.440] There is a sentence piece, [09:00.440 --> 09:03.200] which is a piece of code from Google that does [09:03.200 --> 09:08.040] the word tokenization and the stanza which comes out of Stanford, [09:08.040 --> 09:10.480] which does a sentence analysis. [09:10.480 --> 09:15.360] Argos Translate uses all these three to perform the translation work. [09:15.360 --> 09:17.240] Now, that's not all it does. [09:17.240 --> 09:18.960] Argos Translate also takes care of [09:18.960 --> 09:23.240] the very important Argos package manager index. [09:23.240 --> 09:27.040] This is where all the language models are handled, [09:27.040 --> 09:28.520] installed, and distributed. [09:28.520 --> 09:31.880] So the first time that you run Libre Translate, [09:31.880 --> 09:34.080] Argos Translate will take care of [09:34.080 --> 09:36.760] querying the Argos package manager index, [09:36.760 --> 09:39.280] and will download the languages that you need. [09:39.280 --> 09:42.840] This allows us to also create instances where, [09:42.840 --> 09:46.560] say, you only need to translate between French and English. [09:46.560 --> 09:51.720] You do not need to download the entire 26 gigabytes of models. [09:51.720 --> 09:52.840] You can simply say, [09:52.840 --> 09:54.560] I just need those two models, [09:54.560 --> 09:58.520] and the program will download simply those two models. [09:58.520 --> 10:01.200] We also have a small module that does [10:01.200 --> 10:05.320] the file translation which connects again to Argos Translate. [10:05.320 --> 10:08.080] That's the Argos Translate files package, [10:08.080 --> 10:11.280] and then some common Python packages that allow us to put [10:11.280 --> 10:15.560] the web interface and coordinate the application as a whole. [10:15.560 --> 10:21.440] So it's really an ecosystem that's built with other open source software, [10:21.440 --> 10:27.320] and together it creates this complete translation solution. [10:27.320 --> 10:29.320] Talking about language models, [10:29.320 --> 10:32.080] we have 58 of them that gives you [10:32.080 --> 10:34.560] translation support for about 30 languages. [10:34.560 --> 10:38.720] It does automatic pivot via English. [10:38.720 --> 10:43.440] We are currently looking to transition to using multi-language models. [10:43.440 --> 10:45.800] But for the moment when you translate, [10:45.800 --> 10:47.640] say, from Italian to French, [10:47.640 --> 10:51.200] the program will automatically do the pivoting via English. [10:51.200 --> 10:55.360] So I will translate Italian to English and English to French. [10:55.360 --> 10:57.600] If there is a language missing, [10:57.600 --> 11:01.560] there is a very cool repository [11:01.560 --> 11:04.520] under the Argos OpenTech organization, [11:04.520 --> 11:07.840] which builds Argos Translate called Argos Train. [11:07.840 --> 11:09.800] That is a repository that has [11:09.800 --> 11:13.520] very good instructions on how you can train your own models. [11:13.520 --> 11:17.520] So if a language is missing, go check it out. [11:17.520 --> 11:20.280] It has very clear instructions, [11:20.280 --> 11:22.880] and you could contribute a language that is [11:22.880 --> 11:26.480] missing and you want to see integrated into the software. [11:26.480 --> 11:29.920] Speaking of the models, [11:29.920 --> 11:32.040] when a model is downloaded, [11:32.040 --> 11:34.440] it has a Argos model extension, [11:34.440 --> 11:36.880] and these are simply zip files. [11:36.880 --> 11:39.440] It's a zip file and it's inside, [11:39.440 --> 11:42.440] has a little bit of metadata. [11:42.440 --> 11:46.480] It has a folder that contains the CTranslate model. [11:46.480 --> 11:49.960] It has the sentence piece model and finally the stanza model. [11:49.960 --> 11:52.560] So it has the information for all the three packages that we [11:52.560 --> 11:55.200] discussed earlier to perform the translation. [11:55.200 --> 11:58.000] It's very interesting to check it out. [11:58.000 --> 12:00.360] Let's talk a little bit of accuracy, right? [12:00.360 --> 12:01.880] Like the question like, okay, [12:01.880 --> 12:05.320] it's a translation sorry, but how good is it really? [12:05.320 --> 12:09.560] For that, there is a metric that can be used to [12:09.560 --> 12:13.800] assess roughly the accuracy of the translation. [12:13.800 --> 12:15.920] It's called a Blue Score acronym [12:15.920 --> 12:18.720] for bilingual evaluation under study. [12:18.720 --> 12:25.160] It measures the similarity of text to a reference corpus. [12:25.160 --> 12:28.160] It has values that go from 0 to 1, [12:28.160 --> 12:31.960] or if you express it as a percentage from 0 to 100. [12:31.960 --> 12:34.240] The best translators in the world, [12:34.240 --> 12:37.840] human translators do not get a score of 100 ever. [12:37.840 --> 12:41.400] So anything that is above [12:41.400 --> 12:45.040] a 40 is considered understandable to good. [12:45.040 --> 12:49.720] Something that is above 50 tends to be very high quality. [12:49.720 --> 12:55.120] Sorry, up to 50 is high quality and above 60 is very high. [12:55.120 --> 12:58.240] We had a community contributor [12:58.240 --> 13:00.840] actually go and a few weeks ago, [13:00.840 --> 13:06.080] he ran the evaluation on our different models, [13:06.080 --> 13:08.840] and we found that 83 percent of [13:08.840 --> 13:10.480] the models currently in [13:10.480 --> 13:13.720] Libre Translate are scoring above 40 percent. [13:13.720 --> 13:16.880] So 83 of them are good. [13:16.880 --> 13:19.840] Now, to make it into perspective, [13:19.840 --> 13:23.280] when people ask me directly how good is Libre Translate, [13:23.280 --> 13:26.000] I like to tell them that it's roughly as good as [13:26.000 --> 13:28.600] Google Translate was four years ago. [13:28.600 --> 13:31.440] So I want to make the expectations clear at this stage in [13:31.440 --> 13:33.640] the project that it is not as good as some of [13:33.640 --> 13:36.000] the proprietary alternatives. [13:36.000 --> 13:40.400] But we are improving and we will continue to improve. [13:41.600 --> 13:47.320] The way to improve it lies into mostly getting better training data. [13:47.320 --> 13:50.200] So as we find more and more sources of [13:50.200 --> 13:52.480] open data that can be used for translation, [13:52.480 --> 13:54.560] we include those into the training of [13:54.560 --> 13:58.480] the models and that results into better models. [13:58.480 --> 14:03.080] This is also an interesting point to note, [14:03.080 --> 14:08.560] is that because the project is open source and we have a way to train models, [14:08.560 --> 14:12.440] you can also train models that are specific to a certain domain. [14:12.440 --> 14:15.040] For example, in the context of software translation, [14:15.040 --> 14:18.960] you could imagine the case where instead of training the data on [14:18.960 --> 14:27.480] a general corpus like Wikipedia or the EU Parliament translation documents, [14:27.480 --> 14:31.320] you could train a model that is specific to software. [14:31.320 --> 14:36.400] For example, you could take a set of existing translations from [14:36.400 --> 14:39.320] existing software that has licensed [14:39.320 --> 14:43.760] the translation work under an open permissible license and train [14:43.760 --> 14:47.240] a model onto those existing translations. [14:47.240 --> 14:49.640] Because we have the knowledge, [14:49.640 --> 14:51.960] a lot of software has commonalities in terms. [14:51.960 --> 14:53.720] When you have a file menu, [14:53.720 --> 14:56.280] it's always called file and then edit. [14:56.280 --> 15:00.440] So those menus are specific to a context. [15:00.440 --> 15:04.120] By training models that are specific to a context, [15:04.120 --> 15:06.960] you could get, for example, [15:06.960 --> 15:12.200] software translation model that is more accurate in the context of software, [15:12.200 --> 15:13.840] rather than say poetry. [15:13.840 --> 15:18.880] So it's a very interesting thing to think about. [15:18.880 --> 15:21.000] One more thing about accuracy, [15:21.000 --> 15:23.560] we do have the occasional rare quirk. [15:23.560 --> 15:27.400] This is something that we are aware of and we are working to fix it. [15:27.400 --> 15:31.400] We like to call it the salad issue. [15:31.400 --> 15:38.240] We joke, I will demonstrate this slide because it always sparks a little bit of a giggle. [15:40.800 --> 15:43.360] It's a little bit rare, but it happens. [15:43.360 --> 15:46.880] So in Spanish, the word for salad is ensalada. [15:46.880 --> 15:51.520] Now, let's try to translate the word for salads plural. [15:51.520 --> 15:55.080] So I'm going to type ensaladas. [15:55.080 --> 15:58.360] So in French, that's saladas. [15:58.360 --> 16:00.080] Is that correct? Any French people in the room? [16:00.080 --> 16:02.320] Fantastic. [16:02.320 --> 16:05.320] Now let's try the singular form. [16:05.320 --> 16:10.040] I'm going to remove the S and it crunches for a little bit. [16:10.040 --> 16:18.840] In a second, it really likes salad. [16:18.840 --> 16:21.840] Salad, salad, salad, salad, salad. [16:21.840 --> 16:24.560] This is a quirk. We are aware of it. [16:24.560 --> 16:29.640] It's very rare, but we've found a few reports here and there and we're working to fix it. [16:29.640 --> 16:31.960] Just something to be aware of. [16:33.160 --> 16:35.640] Yes, it really likes salad. [16:35.640 --> 16:40.000] Me too. Let's talk a little bit about integrations. [16:40.000 --> 16:47.360] You can find the client libraries for about 11 programming languages that includes the most common ones like Java, Python, [16:47.360 --> 16:52.400] whatever your favorite language is, it's probably in the list of bindings. [16:52.400 --> 16:57.200] And if it's not there, adding new bindings for LibreTranslate is fairly easy. [16:57.200 --> 17:00.000] So we welcome contributions, of course. [17:00.000 --> 17:07.000] As far as software, LibreTranslate has found adoption in several existing open-source software that you may recognize. [17:07.000 --> 17:11.760] Mastodon recently added support for translating topics using LibreTranslate. [17:11.760 --> 17:21.160] Weblate has the ability to use LibreTranslate to suggest and help translators perform translations as an alternative to using proprietary software. [17:21.160 --> 17:30.720] The forum software discourse has a plugin that lets you make your forum software accessible from different locales [17:30.720 --> 17:34.520] and lets you translate the posts on the fly. [17:34.520 --> 17:40.960] LibreOffice, I found, has an extension. I didn't know this until a week ago when I was looking who has integrated stuff with LibreTranslate. [17:40.960 --> 17:45.960] But somebody wrote an extension to LibreOffice where you can translate documents on the fly using LibreTranslate. [17:45.960 --> 17:49.360] There is an add-on for the multimedia software code. [17:49.360 --> 17:51.960] There is an add-on also for Firefox. [17:51.960 --> 17:55.560] And there's probably a lot of other things that I haven't found myself. [17:55.560 --> 18:01.760] But a lot of people seem to be finding the API useful and they're doing integration work, which is fantastic. [18:01.760 --> 18:08.320] And there's finally client applications that you can use LibreTranslate with without using the web UI. [18:08.320 --> 18:12.320] And we found we have clients for Android, iOS and desktop. [18:12.320 --> 18:17.320] And there's more being built by the week. [18:17.320 --> 18:25.760] As far as comparison to proprietary alternatives, you can see that there is a clear monetary advantage, [18:25.760 --> 18:32.160] aside from the philosophical reason for why you might want to use open-source software, of course. [18:32.160 --> 18:40.160] But it could also be a really sustainable way to perform translations in that people often ask me, [18:40.160 --> 18:43.360] why should I use LibreTranslate? I can use Google Translate for free. [18:43.360 --> 18:48.160] I just go on translate.google.com and it doesn't charge me anything. So why should I care? [18:48.160 --> 18:52.160] Google Translate is free so long as you're using it by hand. [18:52.160 --> 18:57.760] If you want to do any automation work and you have to tap into their API, you're going to pay dearly. [18:57.760 --> 19:03.960] And you can see here a list of the prices and I can assure you that one million characters seem like a lot, [19:03.960 --> 19:12.960] that's six zeros, but they actually run pretty fast and so could the bill on your credit card. [19:12.960 --> 19:20.760] So if you have a lot of text to translate, LibreTranslate could really help in that regard. [19:20.760 --> 19:26.560] As far as funding goes, the project is on the path to become fully self-funded. [19:26.560 --> 19:31.360] And we really care about this because we want the project to continue living on. [19:31.360 --> 19:36.160] We, of course, accept sponsorships and donations, but honestly, [19:36.160 --> 19:41.360] we would rather prefer that you get something back if you decide to contribute financially to the project. [19:41.360 --> 19:48.360] This is why if you are in the position where you say, I have some finances to spare and help support the project, [19:48.360 --> 19:58.160] you also get something back and we do that in the form of offering you an API key to use a host distance at LibreTranslate.com. [19:58.160 --> 20:04.960] So you are free to run the infrastructure on your own server, on your Raspberry Pi, on any machine that you'd like. [20:04.960 --> 20:12.360] If you don't want to handle that, you can just get an API key and you can support the project at the same time. [20:12.360 --> 20:15.560] So it's really a good way to contribute back. [20:15.560 --> 20:20.960] And we found that that model has been helping us grow and sustain the project. [20:20.960 --> 20:26.160] So we hope to continue growing as much next year. [20:26.160 --> 20:29.960] Again, to get involved, I'll give you a few quick numbers. [20:29.960 --> 20:34.160] We've had about 70 people contribute to the code base over the last few years. [20:34.160 --> 20:40.960] The project is still very young, but it has really received a lot of attention, so we're very excited about that. [20:40.960 --> 20:48.760] You can help with code. If you're a Python programmer, if you know HTML, CSS, any of the technologies that we use, you're welcome to contribute. [20:48.760 --> 20:51.860] We are open to everybody and all ideas. [20:51.860 --> 20:54.560] You can also help us translate. [20:54.560 --> 21:05.160] If you understand English and you don't see your language in the list of languages that we currently support for your user interface, you are welcome to contribute. [21:05.160 --> 21:12.360] It's on a web late, you can simply translate and it will get included into the project every 24 hours. [21:12.360 --> 21:14.960] So that is really amazing. [21:14.960 --> 21:18.160] You can also help us train more language models. [21:18.160 --> 21:28.460] If your language is not available or a language that you care about is not available, you can yourself create a new model for a language and add that into the list. [21:28.460 --> 21:31.060] So that is also another way that people can help. [21:31.060 --> 21:37.560] You can report bugs, of course. If you don't report salad, we are aware of it. [21:37.560 --> 21:38.760] Or just come say hi. [21:38.760 --> 21:46.360] We have a community forum that is quickly growing and we love to hear what you're building with it, what you're using, or if you have any questions. [21:46.360 --> 21:51.760] So we're very, very open and we're excited to hear what you will do with it. [21:51.760 --> 21:54.360] That said, this was the last slide. [21:54.360 --> 21:57.160] I think we have some time left over, right? [21:57.160 --> 22:02.160] So I will. [22:02.160 --> 22:03.760] So thank you very much. [22:03.760 --> 22:05.960] I will open the floor for questions and discussion. [22:05.960 --> 22:08.360] So yes. [22:08.360 --> 22:20.760] Hi, my best friend cannot miss the Vice President of the Austrian Society for Artificial Intelligence sitting with Foster to find volunteers to do exactly what you're doing. [22:20.760 --> 22:27.760] Thank you. We're glad to be able to help. [22:27.760 --> 22:28.760] You're welcome. [22:28.760 --> 22:34.560] How do we find, well, I just named the thing Open Language Model Training Army. [22:34.560 --> 22:36.960] How do we find more volunteers? [22:36.960 --> 22:47.260] Unemployed people, maybe have the government fund people running training models, maybe suggest that to all politicians, everybody to their member of parliament. [22:47.260 --> 22:51.660] How many people do we have here? [22:51.660 --> 22:55.060] Should be all of Europe at least, maybe South America. [22:55.060 --> 22:58.760] I think if we multiply this, it can go viral. [22:58.760 --> 22:59.960] Thank you very much. [22:59.960 --> 23:01.060] This is awesome work. [23:01.060 --> 23:01.360] Thank you. [23:01.360 --> 23:03.360] I appreciate it. [23:03.360 --> 23:10.260] Yes, but you speak of our language that have the same link, the same structure of the language. [23:10.260 --> 23:15.460] We have French, English, Spanish, Portuguese, maybe Russian and Ukrainian. [23:15.460 --> 23:20.560] I do not have the same structure, but the language not far away from here. [23:20.560 --> 23:22.460] That's the difference. [23:22.460 --> 23:24.660] German also. [23:24.660 --> 23:26.160] Correct. [23:26.160 --> 23:32.560] And so taking this in account, it's not easy for a translator, for them to translate it. [23:32.560 --> 23:33.160] It is not. [23:33.160 --> 23:38.360] There's also maybe a problem with Chinese or Japanese. [23:38.360 --> 23:39.260] Correct. [23:39.260 --> 23:50.760] There was a problem, there was a thing I went to say, it's a dictionary in line or in the program to have the good word because it's not translated every time the good word. [23:50.760 --> 23:57.160] And so I thought also the most efficient people's language is Esperanto, not English. [23:57.160 --> 23:58.860] Oh, that is very interesting. [23:58.860 --> 23:59.460] Yes. [23:59.460 --> 24:00.160] Okay. [24:00.160 --> 24:02.360] Yeah, that's a great insight. [24:02.360 --> 24:04.760] Yeah, thank you for sharing that. [24:04.760 --> 24:13.560] And you're completely right, some languages don't share the same semantical structure and Dutch, for example, currently doesn't score super high. [24:13.560 --> 24:18.160] It's actually one of the bottom 17% of the language models in the blue score. [24:18.160 --> 24:20.660] Dutch scored around 38%. [24:20.660 --> 24:28.860] So it's almost good, but we've had some Dutch speaking people come to us and say, you know, it's like equal use improvement. [24:28.860 --> 24:31.460] So Dutch, yes, it is a language that needs improvement. [24:31.460 --> 24:40.060] And I talked to the maintainer of Argus translate about the languages that need improvement. [24:40.060 --> 24:46.360] And he pretty much suggested that better training data will help greatly. [24:46.360 --> 24:51.660] So it is mainly a problem, not of the architecture of the AI. [24:51.660 --> 25:01.760] It's a matter that we don't have sufficient quality, high quality data between, say, English and Dutch to get above 38% currently. [25:01.760 --> 25:04.960] But again, nobody has really focused on Dutch as a language. [25:04.960 --> 25:10.860] If anybody has an interest in improving Dutch, we can do better. [25:10.860 --> 25:16.460] Surprisingly, fantastic. [25:16.460 --> 25:21.960] But as far as, for example, languages like German, LibreTranslate currently does very well with German. [25:21.960 --> 25:24.460] It's above 50, if I remember correctly. [25:24.460 --> 25:27.360] Is it because German is the similar language to Dutch? [25:27.360 --> 25:28.360] It is. [25:28.360 --> 25:35.560] That is because I believe, and I think PJ, that's the name of the maintainer of Argus translate, [25:35.560 --> 25:45.460] because the German model has had a larger amount of training data, and so it tends to perform better. [25:45.460 --> 25:46.460] Yes? [25:46.460 --> 25:51.460] Yeah, just a quick question around the translation process, I suppose, touched on the structure. [25:51.460 --> 25:54.460] But how does it work with different dialects? [25:54.460 --> 25:57.460] So if you write in dialects, will you write in slang? [25:57.460 --> 25:58.960] That's a very good question. [25:58.960 --> 26:05.460] Dialects would probably, and that's my guess, because I've never inquired this myself, [26:05.460 --> 26:12.460] but I believe that a dialect to perform good as a target or source language for translation [26:12.460 --> 26:16.460] would also need its fair amount of training data. [26:16.460 --> 26:18.460] And that is the problem with dialects. [26:18.460 --> 26:22.460] I actually speak a local Italian dialect, that is my first language, [26:22.460 --> 26:26.460] and I wanted to make a model for my dialect. [26:26.460 --> 26:31.460] And I started looking online for references of data that I could use to create a model for my dialect, [26:31.460 --> 26:33.460] because it would be cool. [26:33.460 --> 26:36.460] And it was really challenging. [26:36.460 --> 26:40.460] Not being an official language, it really lacks the status of official languages, [26:40.460 --> 26:44.460] and finding training data is extremely difficult. [26:44.460 --> 26:46.460] But it could be possible, right? [26:46.460 --> 26:54.460] If you gather enough people that can create a ground truth data set of examples in the dialect [26:54.460 --> 26:58.460] with sufficient samples, you could get good results, I believe. [26:58.460 --> 27:02.460] So it's a matter, again, of training data. [27:02.460 --> 27:03.460] Yes? [27:03.460 --> 27:06.460] How much does it cost to get a model to a good level? [27:06.460 --> 27:12.460] In terms of computing power or in terms of computing power? [27:12.460 --> 27:19.460] So if I remember correctly what PJ told me about the cost of training the models, [27:19.460 --> 27:26.460] it costs maybe a few, between $12 and $30. [27:26.460 --> 27:29.460] You can rent instances on several cloud providers. [27:29.460 --> 27:31.460] You do need a GPU to train these models, [27:31.460 --> 27:37.460] and it might take a few days for it to crunch and get sufficient number of iterations to train the model. [27:37.460 --> 27:40.460] But it's absolutely affordable. [27:40.460 --> 27:45.460] Anybody can do it, and if you are willing to wait and you just have a gaming laptop sitting at home, [27:45.460 --> 27:50.460] if you're OK waiting 20 days for it to finish, it will train the model for you. [27:50.460 --> 27:55.460] So I guess it could be free to you if you're willing to wait a sufficient amount of time, [27:55.460 --> 27:59.460] and if you have a gaming laptop lying around. [27:59.460 --> 28:00.460] Yes? [28:00.460 --> 28:06.460] So you mentioned the need for a data availability for doing the model, right? [28:06.460 --> 28:13.460] Well, do you need the data to be available under a certain license? [28:13.460 --> 28:15.460] What's your problem? [28:15.460 --> 28:19.460] The world's full of things, right? [28:19.460 --> 28:20.460] Yes. [28:20.460 --> 28:21.460] What's the requirement you have? [28:21.460 --> 28:24.460] You want it to be public domain? [28:24.460 --> 28:34.460] It has to be licensed under a permissive license, so creative comments that also includes commercial use. [28:34.460 --> 28:42.460] And we give references and we give attribution to all the sources that we use. [28:42.460 --> 28:50.460] If you go into the Argos Package Manager repository, where all the models are hosted, [28:50.460 --> 28:53.460] we do give the appropriate licensing credits to all those. [28:53.460 --> 28:58.460] But yes, we cannot go on, say, the Internet and start scraping results, [28:58.460 --> 29:04.460] because everything, you just have to assume that everything is covered by copyright [29:04.460 --> 29:06.460] until they tell you that you can use it freely. [29:06.460 --> 29:12.460] So it's only trained on openly available and freely licensed sources. [29:12.460 --> 29:17.460] Do you need it to be translated as well or just a single language? [29:17.460 --> 29:19.460] It has to be translated. [29:19.460 --> 29:27.460] So very briefly, the format of the input that goes into the training is a file that has, [29:27.460 --> 29:33.460] say, the English sentences and a separate file that has the translation on the same line. [29:33.460 --> 29:36.460] So it's very basic. [29:36.460 --> 29:39.460] And somebody could do the work by hand, right? [29:39.460 --> 29:43.460] You start from the English translation and you start doing the translation. [29:43.460 --> 29:49.460] So it will take a lot of work, but it's doable, especially in a crowd-formed... [29:49.460 --> 29:51.460] Are we out of time? [29:51.460 --> 29:52.460] Okay. [29:52.460 --> 29:53.460] I'll be around if you have other questions. [29:53.460 --> 29:55.460] Our time is up, unfortunately. [29:55.460 --> 29:56.460] They're kicking me out. [29:56.460 --> 30:00.460] But the next speaker will deliver something awesome as well next talk. [30:00.460 --> 30:15.460] Thank you again.