So up next is 3. I hope that's reasonably correct. Yeah, that sounds right. Of linker thing, let's just say that. Yeah. Now talking about whether the mold linker can actually be used as a system linker. Yes. So thank you for coming to this talk. My name is Rui Uyama. So I'm the creator of the mold linker as well as the LLVM linker. So I wonder if you guys are using my linker. So raise your hand if you are using mold linker. And what about LLVD? OK, maybe almost everyone is using my linker. So it makes me very comfortable to be here. Anyways, so the mold linker is my latest attempt to create the best linker for developers. And that really matters because in most compilations and build times, linker dominates, especially if you are doing a quick edit, debug, compile cycle, because you edit a single file, build a thing. The compiler finishes pretty soon because it compiles just a single file. But the entire executables need to be built from scratch. So the link time matters. So I've been developing the mold linker since September 2020. So it's been almost three years under a little. So it's relatively new. So it's available under the MIT license now. It's been under a different license because I was trying to commercialize it. But it turns out that it didn't work out. So I decided to go with the published license. And the main purpose is to offer the fastest linker to that developer. So it's order of magnitude faster than the new linker. And it's also faster than my previous one, LLVD, as well as the new gold linker. So as a rough to give you an idea, the on a decent multi-core machine, mold can output one gigabyte output per second. So if your executable is two gigabytes, and then it takes two seconds on your machine. And that's pretty fast. But the modern executables are gigantic as well. So for example, if you build LLVM with debug info, the output would be like one and a half gigabyte. But it can be built in one and a half seconds. And the mold linker supports almost all major targets, except MIPS. And the reason is because MIPS, ABI, has diverged too much from the other ABI's. The fact is that the other ABI's have evolved since 2000. But the MIPS ABI has stagnated since the collapse of SGI, because SGI was a de facto player in that field to set the standard. And then no one has since then made any effort to improve the ABI. So MIPS has diverged. So at this point, I'm not sure if we want to work, continue working on MIPS support, because it seems like no one is really making a serious effort to refresh the architecture. But anyways, it supports a lot of architecture, even including long arch, which is a newcomer in this field. And despite being pretty new, I think that the linker is production ready. And I think that many people are actually using for production use. I will talk about that later, how I tested the linker. So from the developer's perspective, so this slide explains what is the model linker from the developer's perspective. So it's written in C++, specifically with C++ 20 features, and with Intel TVB as a 3D library. And the one thing that you would notice immediately if you take a look at the source code of model linker is that almost all functions and the data structures are templates rather than just plain functions or structures. And the templates are specialized for each target. So for example, if you, so we have, and the source code quality, and ideally have readable source code. So I put a lot of efforts to make it readable. So this is an example of how you write target specific code in mold. So it uses if constexpr in the source code. So if you are not familiar with C++ 20, this is a feature, this is a new feature. And the beauty of this feature is that if constexpr is evaluated at compile time rather than runtime, so this if constexpr expression will be compiled to nothing. If this function will not be specialized for PowerPC 64, V1. So if as long as you got your new code in this way, your new code cannot do anything harmful for other targets. And it cannot be, it cannot slow down other targets. So this is another example how we use C++ 20 feature in mold. So this is a data structure representing on this format of relocations. But there are many types of relocations because we at least have big Indian, little Indian 32 and a 64 bit version. So in combination we have already four different versions. And the beauty of C++ 20 is that you can use a require your crowds after the template keyword to specify what kind of type parameters that you wanna specialize for. So in this case, this data structure is specialized for middle Indian and real way of which is very technical stuff. But we have two different versions of relocation data structures. And below the definition, we have different versions of data structures of the same name. And we even have completely different version of data structure for specifically for Spark 64. Because Spark 64 has this weird field that doesn't exist in any other architecture. So, but we can just define this data structure only for Spark 64. And as long as you guard G code that access this field with if course expert, then your code will not be cause GM, you know, you are using the missing field of the data structure. So this is a very beautiful way to compile your code to a specific target. So, it's not loading. Okay, so this is a machine description of the of G some specific target. In this case, it's a machine description for x86 64. So we have bunch of constexpr static variables as a parameter. And it defines, you know, that whether it's a middle Indian architecture or big Indian architecture or it's 32 bit or 64 bit. And basically you, so if you wanna put the mold link to new target, then you define this kind of data structure where basically copy and paste. And then make the modification as you needed. And then it's just as simple as that. And since this is G's fields are compile time constant so the compiler knows what the value is at the compile time so they can optimize code based on these values instead of, you know, that dispatching at runtime. So this is a comparison of the number of lines that you need to put more linker to the new target. So on the left hand side, we have code. So it is not a really precise comparison because lines of code is not a direct indicator about how easy or how hard it is to put linker to the new target. But it gives you enough idea about the scale of you, about the amount of work that you have to do. So apparently for gold, you have to write tens of thousands of lines of code for each target. But the reality is most code in the target specific code for gold are just a copy paste. So for example, if you wanna put new gold to like spark or long arch or whatever, then you would start copying the entire file as long arch dot cc or whatever and then it make the modification. So you have a lot of copies of code and that's not a really good way to, you know, put that thing to the new code. And on the other hand, we have very little code in mode to put to the new architecture. So we have a few, we have some amount of code outside of these files for target specific architect code but overall the amount of code is very, very small, like only a few hundred lines of code. So testing, testing is the most important and the difficult part of writing the linker because as you know that if you write a simple linker it's not really hard because it's just a program that takes object files and combines them into a single executable or shared object file. But the thing is there are so many edge cases and because there are like hundreds of thousands of programs that uses the linker, essentially every program uses the linker. So every corner case will be, there is some use case of corner cases out there. So testing is very hard. So we have two tests of how to say the mode to ensure that you, I will be finding a bug before you will notice in the production use case. So the first test is shell script based test which is a very simple test. I have a slide, slide for this. So this is just a test case for the very simple test case. So we actually compile code and try to link the object file with mode and then actually execute it on the machine. And as you can see that if you have a cross compiler and the QMU, you can test that this test for other architecture that's different from the one that you are running on. So for example, you can test Spark 64 on x86 machine. But apparently this test is not enough for real use cases, right? So the other test that I was doing, I'm doing is to try to build all gentry packages in a business mode in a Docker container to find any bugs. And the beauty of using gentry is that with gentry, you can use the exact same command to build any package. And it can also run the unit test that comes with the package. So it's very easy to wait to test whether you can build the program and the build program will work or not. So I did that and it takes a few days on the 6C4 core machine. But it works. But the thing is it is sometimes extremely hard to debug the stuff when something goes wrong. But somehow I managed to fix all bugs that I found this way. Well, yeah, it was a fantastic experience to fix all the jits bugs. But my point is that it is very important to fix all bugs before you would notice in the world. Because if mold didn't work out of the box for your project, the next thing you would do is just switch back to the original linker and you will never try it again with the mold linker, right? So why mold is so fast? Well, so we use multiseletting, multiselet parallelization from the beginning. So that's essentially why mold is so fast. But the other thing is that mold is simply faster than the other linkers with single-slated case is sometimes because we are using optimized data structures and code. Actually, the data structure is more important than code. As Rob Pike once said that you would write code around data structures and not to other ways. So designing the right data structure is important to make faster program. So here is, I think, a good visualization of how good mold linker is to use multi-core all-G cores available on the machine. So on the left-hand side, LLD fails to use all-G cores, but the mold finishes very quickly with all-G cores. So why, but the question is, would be why do we want another linker even though we have LLD? So my answer is, so LLD is not known, first of all. And the other thing is that LLD does not stop or support GCC LTO. So LLD is actually tightly coupled to a specific version of LLVM. So LLD, for example, version 15 can do LTO only for LLVM 15. So it of course cannot handle any GCC LTO object files. So if you wanna do LTO with no faster linker, then mold is the only viable option. So what about Gnu Gold? I think the problem with Gnu Gold is the lack of clear ownership. So it looks like it's not really maintained well anymore. And the original creator of Gnu Gold, which is Google, has lost the interest of keep maintaining it because they are now switched to LLD. So I think the future of Gnu Gold is not clear. So and the gold is not as fast as my linker too. So can we improve Gnu LLD so that Gnu LLD gets as fast as my linker? My answer is no. I think that it's almost impossible to make the thing faster unless you rewrite everything from scratch. And if you rewrite from scratch, that would be the same thing as I did. So and in my opinion, the source code of Gnu LLD is not very easy to read. It's like the source code was written more than 30 years ago and it's been maintained since then. But people are still adding new features to Gnu LLD first and then put to other linkers because what they are actually using is the other linkers. But I think that the situation is silly because people do not really use Gnu LLD anymore for their real world project. So I think that it needs changing. And my question is do we wanna stay with Gnu LLD, the current Gnu LLD forever? My answer would be I don't think so since we have a good replacement. So if I can, I'm open to donate more to Gnu project so that we can call it a Gnu mold if that accelerates that option. It's not something that I can only decide but because it means a lot but I'm open to that option if it makes sense. So the death missing piece to use mold as the standard linker is the kernels and the embedded programming support. So user and the programs are mostly fine. Well, if you install more as a system linker you wouldn't notice any difference other than speed. But the kernels and the embedded programs needs more special care about memory layout because hardware for example, enforces you to put some data structure or code at a very specific location of the memory. And if you are programming against MMU this computer then you wanna layer as the hardware memory is. So that kind of stuff is usually handled by linker script as you know. But the linker script in my opinion has many issues. The first thing is that it doesn't have any formal specification of the language. It only has the manual and we implement to, so other linkers are trying to mimic the behavior of Gnu LD but it of course causes compatibility issues. And the other thing is that the linker script predates elf file format. So not all linker script command can translate directly to elf terminology and it causes more confusion than necessary. So, and I think that it is almost impossible to add a linker script support without slowing down the linker. So I think that we need something better. So this is my current approach to support embedded programming and counter support. So I added a very simple command line option which is called section order. And that specifies them how to layer the things. So, and I think that this option alone can satisfy like more than 90% of the usage but I'm pretty sure that that doesn't cover all the usage of linker script. So I need a help from you guys. So because especially in embedded programming world, their programs are not open source and they are not available on GitHub and they tend to be in house program. So I don't know what the real usage is for embedded programs. If you can tell me that I wanna do this with the mold linker, then I can implement that for you. So I would appreciate it if you give me a hint. All right. So this is the end of my slides. Thank you very much. So you mentioned that it's possible to do link time optimization, like as a feedback in the GCC, but in general, is it also possible, how easy is it to do link time optimization inside the linker, like is it possible for the linker to disassemble some instruction and try to put something else there? Okay, so the question is how easy it is to do something like link time optimization but not quite there. So I don't know if I correctly understand your question, but it's... It's basically optimizations during the linking. Yeah, of course, but the thing is... It's not by the compiler, it's all LTO, but it's not by the compiler. So the way how LTO works in the linker is compiler emits. So from the user's perspective, all you have to do is to add hyphen FLTO to the command line option to compiler and the linker, and everything works automatically. But behind the scenes, the compiler emits intermediate code instead of the actual machine code to the object file, and then the linker recognizes that intermediate code. And then it calls the compiler back end to compile all things once to the single object file, and then the link continues as if that gigantic single object file were passed to the linker. So in that sense, you can do anything with the intermediate file inside the compiler back end because the linker doesn't really care what is going on behind the scenes. So, well, does that answer your question? Yeah, so you said that you tested more against being all of the factors in gender Linux. How long did that take? How long does one count take? So how does it take to test all gender packages against more the linker? And it takes, if I remember correctly, three, four days on my 64-core machines, 64-core machine with 200 gigabytes, 256 gigabytes memory. And yeah, it's a very long time, but it's definitely doable on a beefy single machine. One target? Only for x86-64 because in order to cross-compile everything to different architectures and run-g test, you have to do that on QMU, which slows down like 100 times than the real performance on the computer. Yeah. Yeah, I can't. Yeah, sorry. What kind of mistakes did you make in LLD that you're fixing in mode? And are there any mistakes in mode that you think are interesting? So the question is what mistakes did I do in LLD that I fixed in LLD? And did I make any other mistakes in mode? That's a good question. The first thing is the relocation processing in LLD wasn't as good as mode. So it's complicated. It's hard to maintain, and it's slower than mode. So I fixed it. And the other thing is that LLD uses templates to support L6432, big-endian, little-endian, but it's just four instances. So it doesn't instantiate for each target. So you cannot use the technique that I used for Spark 6c4, that I showed you on the slide, for example. And did I make any mistake in mode? Maybe not. I am pretty satisfied with the quality of mode. I think that I really made... I'm personally enthusiastic about the code of the readability. So I tried to make the source code as readable as just like a book. And I don't know if I could achieve that goal, but the point is that, well, yeah, it's definitely readable. One last question. Are there any plans to ever support any order of that file that helps? Oh, so the question is, can you support other file formats? No, I'm planning to ever do that. Oh, do I have an plan to support other than LLD? Well, I did for macOS, which is a Unix-like environment, but it uses a different file format, which is called macOS. Yeah, but the thing is, and I succeeded to create a faster linker for macOS, which is much, much faster than the upload linker. But the thing is, last year in September, they released Xcode 14 with their own new linker. So there wasn't going on efforts within Apple that I wasn't aware of. And then their new linker is as fast as mine. Maybe they wrote my source code as well, because it's available online. But also, GTIB3, then? Oh, my linker is now available under the GMIT license. So it's, yeah. So maybe you only heard Apple. Well, Apple haven't released their source code yet. So, okay, we have to stop. So thank you again.