Okay, so unique code support for GCC Rust.
Okay, it's here.
Sorry, wait a moment please.
Okay, let me start my presentation,
Unicode Support for GCC Rust frontend.
And here's today's outline.
First, I'll explain about my project and then how Unicode can be used in Rust.
And then I'll explain about how we implement Unicode support in GCC Rust.
And then I briefly explain about,
introduce about two mongering schemes in Rust and then summary.
So first, let me introduce a little bit about myself.
My name is Raiki Tamura,
and I'm an undergraduate student at Kyoto University in Japan.
And I participated at Google Summer of Code 2023,
and I worked with GCC organization.
And other program and my main interests are compilers and low level program such as emulators.
So next, I'll explain about my project.
I worked on Unicode support for GCC Rust as a Google Summer of Code 2023 project last summer.
And Google Summer of Code is a global online mentoring program where students work with open source organization,
and they write some code and contribute to the organization.
And now, I'm continuously working on supporting the new Rust manga in GCC Rust.
So next, Unicode in Rust.
You can use Unicode characters in Rust program,
and first, you can know as key new lines and white spaces in Rust program.
And next, you can create name attribute to specify the name of your Rust program
and the values of this attribute accepts Unicode alphabetic and numeric characters
and which includes also known as key characters such as alphabet from various languages.
And last, you can use more known as key characters for identity.
For example, below, you can find Germany characters and Japanese characters and Korean characters.
And you can also use all varieties identifiers.
Next, let's more deeply dig into Rust identifiers.
Rust adapt the syntax of identifiers defining Uax31 which is a part of Unicode standard.
And Uax31 is also adapted by ECMAScript, that is JavaScript, C++, Python, and other many languages.
So the syntax of this Uax31 identifier are shown below,
and this is something like a generalization of typical Novoski identifiers in programming languages.
And in Rust, after identifiers being talked about, their normalization to special form is called normalization form C,
shortly NFC, so that compiler can compare the same identifiers but with different encodings.
So identifiers are normalized to some normalization form.
So next, implementation.
So before my project starts, there are already other front-end, GCC front-end, which supports Unicode.
For example, libcpp is a C-pre-processing DCC, which implements lexer.
And also, C++, as you remember, C++ adapts the same syntax of Unicode identifiers as Rust.
So I took a look at it first, and next are also go front-end, go language supports Unicode,
but go adapts different syntax for identifiers, so...
but I read the implementation of DCC go.
So my implementation is divided into three parts.
First part is lexer part.
In the first part, we modify the lexer to accept Unicode characters, and second part is the great-name attribute part.
We added validation for values of the great-name attribute.
And we checked if the values of the great-name only has Unicode alphabetical numeric characters,
and last, the mangler part.
We modify the mangler to handle Unicode identifiers.
So the first part, the lexer part, in order to look up character properties, for example, identifiers...
So we have to tell the compiler which characters can be used as identifiers,
and which characters cannot be used for identifiers.
So we have to tell the compiler some Unicode properties.
So in order to look up such Unicode properties, we'll use some functions already in libcpp,
and for other missing properties, we generated a header file from Unicode data files.
Unicode data files are distributed by Unicode.org at the Unicode official site.
And to achieve this, we wrote a Python script, and we pass...
which passes Unicode data files and then generate C++ header files.
So this is the part of the generated header file, which contains such boring table.
So the next part is the great-name implementing, great-name attribute.
So this is quite a simple part because all we have to do is to use a generated header file
and add it to validation to the values of the identifiers.
And the last part is the mangler part.
First of all, we have to modify the Rust default mangler to handle Unicode characters.
So the default mangler is called legacy, and legacy mangling schemes escapes non-asky characters as their code point.
And also, we have to implement a new Rust mangler scheme, which is called v0.
And in v0, identifiers are encoded as punicode, which is used in WebBrowser or something like that.
And implementing v0.mangler to DCCRS is now still in progress.
So here, I briefly explain about mangling schemes.
There are two mangling schemes in Rust, legacy and v0.
You can pass options to switch this mangler scheme.
So in Rust C, you can use C symbol mangling version option, and in case of DCC, you can use f Rust mangling option.
And v0 was introduced to Rust C on 2019, and it is used in the Rust for Linux project for some reason.
I'll explain it later.
So implementing v0 is so important for DCCRS project.
So here, let me compare the two mangling schemes.
In legacy mangling scheme, legacy symbols start with underscore z prefix, which conflict Italian ABI, which is used in C++.
And v0 uses underscore r, which is unique to Rust.
And next, used characters are different in legacy mangling schemes.
Mangled symbol uses ASCII alphabet and ASCII number, and also uses Dara signs and period.
And on the other hand, in v0, mangled symbol uses ASCII alphabet and number and underscore.
So speaking of Dara signs, Dara signs are vendor-specific characters in mangled symbols.
So typically, it is preferable that we avoid using these symbols.
So next, type information.
Basically, legacy symbols doesn't contain any type information.
On the other hand, v0 has rich type information such as generic types and inherent implementation in Rust, or tried implementation in Rust, and more.
And for example, these are contents namespaces like modules in Rust.
And Rust, speaking of Unicode identifiers, you know, as you remember, legacy escapes Unicode characters as code points.
On the other hand, v0 uses punicode to encode Unicode identifiers.
So let's look at a simple example of a two-mangling scheme.
If you define this function in Rust, you can see two-mangled symbols.
And highlighted part is corresponding to the name of the function.
And you can find Dara signs in legacy mangled scheme, which is vendor-specific characters.
And you can also find a invisible symbol.
You can also find a punicode encoded symbols.
This is a hero part of this slide.
So in summary, as a result of GSOC-223, GCCRS supports Unicode.
And Rust compiler uses Unicode normalization and punicode encoding,
and implementing the new-visor-mangled to GCCRS is now in progress.
Thank you for free software foundation for supporting me to attend this conference.
And I would like also thank to my mentors, Philippa and Arthur and other GCCRS team
and another GSOC student, Mahat. And that's all. Thank you for listening.
Thank you.
So we have a few minutes to questions. Four minutes for questions.
If people have questions.
Hi.
You showed the example for like the new and the old mangling.
And how I can say, I can't understand how you would look at that
and know if it's Unicode or if it's someone actually wrote U7 blah blah blah blah as a name of function.
So how can you identify that?
Oh, okay. So if I understand your question correctly,
are there questions that you cannot find part corresponding to the name of function, right?
Yeah, I cannot look at this and see, right, from U7 onwards, this is Unicode encoding
and not something that was actually written by the user.
Sorry, one more please.
It's just like, imagine some user decided to write a function
and they decided to call the function U7 KED blah blah blah blah.
How can I tell if this encoding is saved, the user wrote that exactly, or if they used the Unicode?
Okay, so your question is that you cannot tell this symbol is encoded using Unicode or low ASCII data.
Oh, if it is verbating.
Okay, so in visual symbol, you can find the first U character, right?
So this means Unicode encoding and then you can find 7, which is a length of character string from K to H,
which is an encoded part.
So, yeah, thank you.
One minute left for questions.
Yeah, so my question is how much effort would it be for you now, after you know all this,
to improve the existing lecture to also accept Unicode?
Is it just 3 for things and an HR or?
Are you speaking now?
Current situation?
Yeah, now I would add to an existing lecture the option to pass Unicode.
How much work would it be?
I think many developers don't use NoaSki Identifier,
but what is so wide so many developers want to use NoaSki Identifier.
So in terms of that, I think it is meaningful.
Yeah, how much work would it be just if you edit now?
Sorry?
Would it take you a week or weeks?
Maybe three days or so.
Okay, all right.
Time's up.
Thank you.