Alright, so time has flown. This is already the last talk for the emulator development room today. Thanks everybody for showing up. It's a crazy turnout. Today we've got Peter who's going to talk about a really interesting feature from Microsoft Windows. Is there a question already at the start? Let's get what's happening here. Oh, okay. Alright, so Peter has a lot of C++ experience and he can talk more about what he's going to do. So let's give him a hand. Okay, so first of all, why am I here at FOSDEM talking about some closed source Microsoft attack? That's what you're all thinking, right? So let's address that question first. So if you don't know me, one of my hobbies is hacking on LuaJet, which is a free and open source JIT compiler for Lua. And LuaJet recently gained support for Windows on ARM 64. Or at least I thought it did until this guy came along and was like, so do you support this other Windows on ARM 64? And we're like, wait, wait, what? You did what? So first I was horrified, then I was intrigued, and now I'm here speaking to all of you about what it is. So that went well. So hopefully I'm going to take you all on the same journey that I went through, kind of figuring out what this thing is, what it does, why it does it, and whether it does what it says it should do. Before we get into any of that, I'm talking about some Microsoft tech I do not work for. Microsoft, I'm talking about emulating Intel on ARM. I don't work for either Intel or ARM. If you know about LuaJet, you might have heard of Mike Paul. That's not me. Any of yous herein are my own, bugs are my own. If I'm wrong, that's my fault. Right then. Let's go into things. So we're going to do three kind of broad chapters here. First, we're going to have a general look at doing emulation of Intel code on ARM. I'm going to get really bored saying 64 all the time. So when I say Intel, I mean X64 code, and when I say ARM, I mean ARM 64 code, because otherwise all of 64 is going to get way too much. Then we'll look at this ARM 64 EC thing in particular, and then a bit of time about how LuaJet ported to this thing and whether that worked and how it worked. So emulation 101. You take Intel instructions like that one there, you turn them into ARM instructions like those three there, and you just do this for every single instruction that you find. How hard can this be? We've got this entire room to talk about doing this. One Intel instruction may become several, because Intel instructions are often more complex than ARM ones. If you're not familiar with assembly code, the square brackets here are memory loads or memory stores. In this case, they're all loads, but they could also be stores. I mentioned this because memory is complicated. Memory is what makes this more complex than it might look. Here are some of the things that I forgot to mention on the first slide. Let's start with memory ordering. If you have several threads that are all trying to work with memory at the same time, you can do cross-thread communication through memory, which on Intel often works. Intel gives you a very nice memory ordering properties, so you don't need memory barriers all that often. Whereas ARM is whatever, if you want to do cross-thread memory stuff, you will want some barriers in there to make it work. If you are trying to emulate the Intel code on ARM, you need to insert extra barriers that weren't there, otherwise we're going to introduce some concurrency bugs that weren't there. The annoying part here is that most memory operations aren't doing cross-thread stuff, but if you're doing an emulator, you don't know which instructions need the barriers and which ones don't, so you have to throw in the expensive memory barriers for almost every load and store, which is going to slow you right down. This middle question mark is saying, so memory is not just a big array of bytes. Memory is working out into pages, and those pages can have protections on them and other stuff, and all they make mapped to a PCIe device rather than going to RAM. You've got a question like, do you emulate an MMU and a bunch of things on it, or do you just pass it off to the host and let the host do whatever it would do? The final question mark here is flags. If that doesn't yet mean anything to you, that's good for you. Where can I get to the flag next? Because flags are a pain. Most Intel instructions, when you run them, they'll give you the main result that you're trying to get. They will also give you this array of six flags. Meanwhile, on ARM, some instructions will give you flags, and those that do only give you an array of four flags. I'm not a mathematician, but four is less than six, right? We've got a slight problem here. The question is, can we emulate the two that we don't have? Let's just run through all the flags. This could be a quick summary of what they are. We've got Z or ZF, just telling you whether the actual result, your main computation was zero or not. SF or N is telling you whether it was negative or not. Then we get to PF. Now, PF is great. Intel added PF in 1972 to give you the parity of the low eight bits of whatever it is that you were actually doing the computation of. Because back in 1972, you wanted a one bit checksum for doing modems and stuff. Intel being Intel, they've kept it ever since. You can emulate this thing on ARM. You just need to do a pop count of the low eight bits of whatever it is that you computed. If you know ARM assembly, you'll be like, wait a minute, there is no pop count instruction for general purpose rotors. We'll just gloss over that one. Then you've got the overflow flag or OF or V. That tells you whether any overflow happened during your computation. Useful for doing checks. Arithmetic and stuff. Then we've got CF and C, which is an extra carry bit in or out of your addition or subtraction. Fun point here is that there are two possible meanings for this flag in subtraction. And guess what? Intel choose one meaning, ARM choose the other meaning. If you're trying to emulate one on the other, you often have to flip the value of this flag to make them match up. Thankfully for this, ARM in ARM V8.4 added an instruction called CFINV, which for flipping the value of that flag added to make doing this kind of emulation easier. The final flag ARM doesn't have is AF on the right there. AF. AF is if you're doing binary coded decimal arithmetic. If you've never done any of that, good. Again, good for you. Intel thought back when they made these chips back in the 70s, BCD was a thing that people did. To make it fast, they added this extra flag that gives you the carry bit out of the low four bits of your computation because BCD uses groups of four bits. You can emulate the AF flag if you need to. We're doing a bunch of extra work to compute these things that we'd rather not do. A good emulator will try and work out when it doesn't have to compute anything at all or if it can defer the problem and hope that you don't actually need the answers at all. If you do have to compute them, there's extra work to do here which will slow you down. Start flags quickly. Next up, there are a bunch of existing solutions for doing emulation of Intel on ARM. Q emu we've heard quite a lot about here. With two flavors of Q emu, there's system mode and user mode, which boils down to in system mode, which will emulate an MMU and a bunch of devices on it, whereas user mode won't, so it's pushed off to the host. Therefore, Q emu user is much faster, but can't emulate as many things. There are a bunch of other open source solutions in the middle here. Starting with Justine Tini's Blink, which if you've not seen it, is part of a portable executable project for emulating Intel on anything. Her take is like, you know, we don't need a JVM with a portable bytecode that's used Intel code as a portable... Anyway, it does actually work. There's FeX emu that I'm not overly familiar with, but I think they're trying to be like Q emu user, but faster by only doing certain emulation of certain things on certain other things. I basically only Intel on ARM, whereas Q emu does everything on everything. Box 64 I wanted to mention because they pull this Q trick of saying, we will spot when you're trying to emulate a library that we've heard of, they're like, yeah, that's libc, that's SDL. I'm like, well, we won't emulate it. We'll just like, swap it out for our native version of it, which makes it faster because you don't have to emulate as much code. Obviously, the other big one that's not open source is Apple's Rosetta 2, which cheats by solving things in hardware. So, you know, this slide again, yeah, Apple solved this problem in hardware, this problem in hardware, this problem in hardware. So, you know, they cheat by adding extra hardware to their chips, and that makes that emulation extremely fast. Good for them, less good for other people. So, Apple can make a very appealing pitch to their developers, which is, you know, you can keep on with your kind of Intel code, and it'll like run fast on our custom hardware still. Or you can import it to ARM code, and it'll run even faster. I know Apple will port all of their kind of first party code, and the programmers in their ecosystem will do what they are told. Apple says you port your code, they will put their code. You know, the trade-off of working in a kind of Apple type system. Meanwhile, Microsoft have a far harder time. Like, you can target Intel, but it'll be dog-snow. Okay, not good. Or you can port your code to ARM, but you can't if you've got something like closed source library or plugins as part of your program. And this being that Microsoft's ecosystem, of course there are closed source libraries or plugins. So, yeah, Microsoft are in a really hard place here. And when I say slow, I mean slow. So, to give you a kind of idea here, I took the AllureJet benchmark suite, and I ran it on this Mac here as ARM code, 33 seconds. Fine, fine. Compiled it as Intel code, ran it under, was that a 2? 44 seconds. I mean, not great, but it's not a massive slow down. You can live with that. I ran a Windows VM on this thing, and the ARM version then took 37 seconds, which is a little bit slower. I'm not sure whether that's part of the VM or part of Windows slowing it down, or because Windows is running with 4K pages rather than 16K pages, same kind of ballpark. Then take the Intel version and run it under Windows' emulation, 106 seconds. Yeah, this is not good, not good. So, you know, you are someone at Microsoft, okay, so option one, I emulate the Intel code, it's too slow. Option two, I port it to ARM, is possibly impossible. At this point, I like to imagine some mad scientist at Microsoft, like, so, can we take two bad options, blend them together, and get a good option. Which, when you put it like that, seems insausable, but it turns out to actually work, surprisingly. So, that gets to part two. What is ARM64 EC? I, this crazy idea to get out of this error or to spot. Which is, let's let you port part of your application to ARM code. So, you know, if there are any kind of Intel bits that you can't port, because they're plug-ins or closed source, sure, leave them as like Intel code, but the stuff that you can port, you should port, and you can mix them all up, like, together, and allow you to like, cheaply interrupt between the two parts. And, you know, this is ARM64 EC. The ARM code is compact able with the emulated Intel code in a way that should hopefully work. Hopefully. So, that's the kind of thing. Big plan. But what does this mean? Like, how do we actually do this? Like, we're going to have to share the virtual address space between the Intel parts and the ARM parts, okay? We're going to need to share data structures out between the Intel parts and the ARM parts, okay? We're going to need to share call stacks between the two, fair enough. We're going to make things a little bit simpler by saying we can only kind of switch between Intel and ARM when you make a function call or you return from a function call. Or when you throw from a function and catch it higher up, that's, you know, painful. We're going to have to adjust how we do function calls a little bit to make this work, but ideally not too much. So, we're going to kind of delve into each of these points in turn. A shared virtual address space means, you know, you have all of your, you know, kind of address space and there's X people code in there and you have to know for any piece of X people code, is it ARM code or is it Intel code? So, we need an extra bit on every page to tell you kind of which one it is. I mentioned kind of doing cross thread communication earlier. Obviously, you know, our address space can have several threads in it. We're all trying to kind of talk to each other and any kind of Intel code running under, emulation still needs all of those extra barriers to be put in by the emulator, which will, you know, keep it slowed down but keep it correct. Meanwhile, any ARM code, Microsoft thought, just, you know, let the programmer that's doing the port put in the barriers where they have to be, which, you know, solves that problem at the cost of, you know, the programmer has to actually think. You can't just like recompile and not change your code and hope it works. So, all of that is kind of fine. Let me go to shared data structure layouts. Now, this starts off looking fairly simple. We say, you know, this is some kind of data structure. Let's make it compatible between Intel and ARM. Obviously, we can't change the Intel code, the whole point is we're kind of running free existing Intel code under emulation. So, we have to, you know, in ARM 64 EC mode, all of these types need the same size and alignment as ARM Intel. So, you know, longs are four bytes because windows doubles are eight bytes, pointers are eight bytes fine, function pointers, again, eight bytes. And this being why we needed an extra bit on every page to tell you whether it's Intel code or ARM code, because you might think, just like put it in the function pointer, but like there's no space. You have to make them one bit bigger to tell you whether they are Intel ARM pointers, which we can't do. So, we have to put that bit on the per page. But, you know, this all looks fine so far. Things get more interesting, though. So, if you're a C programmer, you'll know about set jump and long jump, which are C's equivalence to throw and catch. And there's this structure called jump buff that kind of tells you when I catch, this is the CPU state to kind of go back to when I do my catch. And you can pass the jump buff, you know, around, you know, you'll set it over here and use it over there. And in particular, in ARM 64 EC, you can do a long jump from Intel code to ARM code or from ARM code to Intel code. So, this jump buff guy has to be kind of compatible between the two. As I said, jump buff contains the CPU state that you want to go back to. But, like, Intel CPUs and ARM CPUs have different amounts of CPU state. So, that's going to be fun. To make it even worse, there's this Windows structure called context in all caps that contains the entire CPU state for a particular thread. But, again, you can pass around and do things with. And, yeah, so that guy has to be the same size as on ARM 64 EC as it is on Intel, despite there being a different amount of CPU state. So, this is starting to look a little hairy. So, what is all of the CPU state that we have to fit in to make these data structures compatible? So, we want a quick table here of, like, the user-visible CPU state on Intel and on ARM. ARM in one column, Intel in the other. I'm going to go through row by row to kind of go through them quickly. To, like, general purpose registers to start with, Intel has 16 of them, ARM has 32. You will notice we can't fit 32 into 16. This is going to be a slight problem. Next row is not so bad. We've got a bunch of, like, weird kind of edge cases. We've got RIP and PC. They're, like, the same thing. PSA and PSA because they're the same thing. The two FP things on ARM, we can fit into MX, CSR on Intel, but that much is fine. We've got the spare GS thing, which we'll come back to later. Next row is our floating point or vector registers. Again, Intel has 16, ARM has 32. 32 is more than 16. So, again, problem there. The bright sparks in the audience might say doesn't modern Intel with, like, AVEX 2 and AVEX 512 have far more registers than they're far larger. Yes, but ambient laters can't use AVEX or AVEX 2 or AVEX 512 because of patents. So, we're stuck with the kind of old, kind of, 16 of them and not only 128 bits wide. This final row is interesting because Intel way back added the AVEX 87 stack, which is 880-bit floating point registers. ARM has no such thing because this is, like, this old weird, like, legacy thing, but this is actually really good for us. So, our question is how do we fit all of the ARM column into the Intel column? So, let's start with the floating point registers and we'll say, okay, let's pretend ARM only has 16 of them. Problem solved, right? If you're writing ARM 64 EC code, you cannot use the high 16 of these guys. I mean, it'll come in a performance cost, but it'll make things work. The other way we had problem was with this first row, where we'd like to be not quite as extreme as throwing away half of them. So, we've got, like, 16 that we can fit over here. One can fit in GS and then 10 can fit down there. So, 16 plus 1 plus 10 means we can fit 27 of these guys in here somewhere. It works, it works. But we are still down 5. So, they'll become 5 general purpose registers that you cannot use. So, Microsoft said, okay, you just can't use X13, X14, X23, X24 or X28 in addition to the 16 factor things that you cannot use. So, this is the cost of making your data structures compatible between the two and it seems like a fairly high cost, but, you know, such is life. Moving on, we are sharing our call stacks between Intel and ARM. If you're familiar with Intel and ARM, your first point will be, so, doesn't like ARM put the return address in a register, whereas Intel puts it on the stack. Yes, we're going to have to fix that one up. The problem that you might not have noticed was that ARM requires your stack points to be 16-byte aligned when you use it for a load or a store. Intel merely recommends this very strongly. But doesn't actually check for it. So, you can very happily run with an only eight-byte aligned stack for a very long time and not notice that you've done anything wrong because it doesn't actually check for it. So, we're going to have to fix that one up, that one up too. So, a bit of work required to make these things work, but, you know, we can understand what that work is. And then we get to the actual meter things of, like, how do we switch between these two modes? We've made these things of compatible-ish or we've understood how to make them compatible, but how do we actually switch between Intel and ARM? So, if you're at use for assembly, you'll know what a calling convention is, which is, like, when we make a function call, where do we put the arguments for that call? Which registers contain when or what do you put on the stack when? And then, like, you can read these, like, long docs from ARM or from, like, other people about how to do this. And then there's a set of these rules for ARM, a set of these rules for Intel. We don't want to change those rules too much because, like, they mostly work, but they're not the same rules. You have to, like, put things in different places between Intel and ARM. So, we have to do some work to kind of fix that one up. And the work that you have to do will depend on the types of the arguments and the type of the return value of your function. So, we're going to need some kind of code for doing this work, and this code has to live on the ARM side of things because, again, we're trying to run Intel code that doesn't know it's running under emulation, so we can't change it. We can't add extra stuff in there. We have to add the extra stuff on the ARM side, which means if you're writing assembly code in ARM 64 it's like, how do I do a function call? This is how you do a function call. Step one, as you would normally put the arguments where they would be for a normal ARM call, and then, like, so, am I calling Intel or am I calling ARM? On the left-hand side, we've got the am I calling ARM column, and it, again, works like a normal ARM call other than this mystery box about exit points in X10 that I'm going to gloss over from now and come back to later. But other than that box, the left-hand column is a fairly normal function call on ARM. You put the results where they're meant to be, you call the function, you get the results back from where they are normally meant to be. The weird case is the Intel case on the right here, where we reduce some other things. Where we put the function that we want to call in the X9 register, has to be X9, and then we call an exit function. You're going to say, Peter, what is an exit function? And I will get to that in just a bit, but I want to address a different point first, which is this code has a branch in it, right? Everyone prefers straight line code to branchy code. But, like, we can get rid of these branches, mostly. You know, we'll have to have to do, like, both of these two steps, and push them up there, and then combine both of the calls. Because, like, this row, we're going to do a function call. We just don't know, like, where to yet. So, like, some kind of, like, conditional marks on where we want to call. And then we can make this whole lot straight line code. At which point, it'll look like this. You know, first box is the same. Second box is, like, we've pulled up both of the previous steps and just done both of them. This middle step is calling this magical mystery function from Microsoft. And then you do a call to somewhere. And then this last box is the same as previously. And if you're wondering, what does this magical mystery function from Microsoft do? You know, it turns this side back into this side. I mean, so, if you're reading assembly code, this is what you will see, but this is what it does. And now, I'll get to the previous point of so, what are these exit functions? So, they kind of fill the gap in. They kind of, the extra bits you have to do to transition off to arm mode. Which is we have to take the arguments that we carefully put in their arm places, take them out of their arm places, and put them into where they should be for the Intel style call. Which, you know, a bunch of work, but it's fine. And then ensure that the function that we want to call is still in X9. And then we call the next magical mystery function from Microsoft. And we have to do it in a special way. We have to put the address of this function in X16 and then call X16. Which is going to seem weird, but we're going to have to see why in a bit. And once the magical mystery function kind of comes back, we take the results from where they would live in Intel world, pull them out of that world, and put them where they would be for an arm world, and then we return as normal. So, okay. Next up, let's look at this magical mystery function. Which is this guy. So, first box in the top left. I mentioned that ARM puts return address on the register, whereas Intel puts it on the stack. This first box is fixing up that problem. Then the rest of the left-hand column is your kind of usual loop for emulating a CPU. You know, we get the next instruction. We do it somehow. Then we move to the next instruction and we do that one. In practice, there's going to be a far more complex logic in here. So, you know, like optimize stuff. And you're like, you know, jit compiler or AOT compiler or all sorts of clever stuff in there. But as far as we're concerned, as well as what it does, this kind of describes what it does. At some point, it'll say, wait a minute. You're now asking me to go back to ARM mode, because I've found code that's no longer like Intel code. We are doing some kind of mode switch. Now, I said earlier, mode switches, we've said they're only going to happen at function call or function return. Oh, go. If we've now gone from Intel's ARM, this is either a call or a return. And how do we know which? And the cheeky part there is that we look at the four bytes just before where we're going to start running and say, is this a call X16 instruction? Why is that the question? Because we have to call this magical mystery function as a call X16, which means if we just found that, it means we've just come back from the call that we were doing. We are in a function return type thing. And we set the return pointer to the code we want to run to, and we go to it. And this final column means that we're doing a function call, because the four bytes before where we're going to are anything else. And then we need to do, again, the opposite problem of where your return address wants to be. Is it on the stack or is it in a register? So we fix up that problem. And then we've set X4 to be the stack pointer. Why do we do that? Because the next step is to say we have to forcibly realign the stack pointer. Because remember that point where I said like, Intel code doesn't care about your alignment for as ARM does, this is where we fix that problem. And then we tail call X9's alternative entry point. Remember, X9 being the sting that we've inferred is a function tool that we're trying to make. So we do a call to almost that function. Again, you're going to say, Peter, what are these alternative entry points? So that's next. So every ARM function that could be called from Intel needs a so-called alternative entry point for handling when it is called from Intel. And it does all of the gubbins that have to be done to make this transition work. The only question is how do you find this alternative entry point, which is you put the offset of it in the four bytes before the start of the function, which is handy because we already had to read those four bytes to check whether they were that guy. So if they're not that guy, then they are the offset of this guy. And what is in one of these guys? Ignore the right hand column for now and look at just the left hand column, which point in the left hand column is mostly the opposite of what we saw earlier. We have to take the arguments from where they are in Intel and pull them out of there, put them where they should be for ARM land, call the real part of the function, and then take the results out of there and put them where they should be for Intel, and then call the next magical mystery function. The only interesting part here is this first box where we're saying if there are arguments that come off the stack, we can't read them from the stack point, we have to read them from x4. Why is this because of this forcibly realigning sp? You can no longer read your arguments from there because we might have changed it to realign it, but x4 tells you where they used to be. So that's fine. The interesting point is that the logic on the slide only depends upon the types of the arguments and the types of the return. It doesn't actually care about what the function actually does, and therefore you can kind of share these guys between multiple functions if those functions have the same type, which is good for like, you know, code sharing keeps your memory usage down, your iCash is happier, whatnot. But if you want to, you could write one of these per function, which point in the right hand side becomes an option that you could do, and then you can kind of skip the kind of calling in the other bits, just put the copy of your function in there, if you so wish. Okay, so next magical mystery point function is this guy. Don't worry, you've seen most of this side previously. It's all the same side except for the first box in the top left, for which you're going to ask, so what does this box in the top left do? What is the value of lr that we have up there? And if you trace all the stuff through, you'll see it's the same lr that we popped over on this side, which was the return address that we popped off the stack because we think the Intel code just made a function call, at which point what we're putting back in the top left is the return address to go back into Intel mode. So it all kind of works out. As you've seen this slide previously, at which point we have run out of magical mystery functions, so you know, there's a hard part over, that all kind of works out. So that tells you roughly very quickly what arm64 EC is, what the kind of code for it will look like, so next up, the kind of the lured part of how did I make lured work with this thing. So if you know lured, it's written in a mixture of assembly and C, and notably the interpreter is several thousand lines of assembly code, which is, you know, fun. So porting that code, that assembly code to arm64 EC, means that we can no longer use the versus that we said we couldn't use, they don't fit in the context structures. So we lose v16 to v31. That's fine, didn't use them to start with. x13, x14, again didn't use them to start with, not a problem. Unfortunately we did use x23 and x24 for various things, but because of what they were used for, they could be reworked to not require them with some almost zero cost tricks, so that wasn't too much of a pain. Losing x28, more annoying, that kind of required extra loads and stores to kind of split. In this regard, the jic and pilot was actually easier to port than the interpreter, because the jic and pilot could already just not use certain things, so you just had to add some things onto the list of what it can't use, and like, you know, it'll then just not use them. Again, there will be some kind of like perf cost to not using them, but it wasn't hard on the kind of porting side. Next up is handling these mode switches. So the C compiler will do most of the work for all of the C parts of Lerget, but again, it won't handle the assembly parts. So there are kind of three parts, therefore, that it doesn't really handle. One is the interpreter opcode for calling Ler API C functions. That one's fairly simple, like the, it's only one place, it can only call one type of function, and the type of that function is super simple. So that one's fine. Harder is the ffi. So if you're not familiar with the ffi, it's the Lerget's foreign function interface, and it lets Ler code call C functions of any type. Whatever type you want, it'll call it for you, and it'll just make it work. And you can also jit compile your ffi calls in most cases. I say with simple types, like most types are simple, so you can jit compile most of them. We also support ffi callbacks, but you can take a Ler function and make it into a C function, and then C code can call you. So again, it's like Intel code is trying to call your kind of ARM code that you created from your Ler code. You have to make that one work. That one's actually not too bad. 10, great, thank you. So the hard part is these two, just because they're kind of arbitrary types of function. So this is what I made Lerget do for interpreted ffi calls. So the good thing about ffi calls is that they are one shot calls. You give the ffi a pointer to call, and the type to use it to call it, and it'll go and do it, and it'll do it once. Because you're doing it once, you can look at the thing that you're trying to call it like so. Is this ARM code or is it Intel code, and just do the right thing for whatever it is that you're trying to call? Which gives you this nice, simple diagram. There is a slight problem with this diagram, though. So this is what we're meant to do. This is our slide from previously, and this is what we're actually doing. The right-hand side is just like we've inlined the exit tank, it's all fine. The left-hand side has a slight problem. I skipped over this box a while ago, and it's like we'll just forget about that box. And you'll notice that it's missing on this side, which will mean that certain things don't work. And the question is what doesn't work? So this is why I have to now tell you why this weird box is there on the left about putting a thing in X10 when there's no obvious reason why you need to do it. So let's answer that question, which is if we are making a function call and we are ARM code, and we might call Intel code, then we will need an exit tank. Hopefully we've now covered what those do and why you need them, and things of that mind, etc. If you want to do a function call, you need an exit tank, and to know which one to use, you have to know the type of the function that you want to call. Now, there's a particular subset of functions that don't know the type of the thing that they want to call. You might say that that's kind of weird, but let's just run with it for a while. And furthermore, these functions don't know their own type, also weird, but what they do know is that their own type matches the type of the thing that they want to call. Now this may sound like a somewhat contrived set of properties, but it does actually crop up enough in practice that it's worth caring about. So to let these weird typeless functions that don't know their own type and don't know what they're calling, to make them work, we give them an exit tank in X10. So if they are ARM code, they can just like, you know, run and say, well, whoever calls us put the appropriate thing in X10, and that will let us do the call that we want to do. So that means that if we end up calling one of those functions, and it then wants to call an Intel function, then this isn't actually going to work, but in practice it's actually fine. It's not yet been a problem. It could be fixed, but it's going to be like a pain to fix. You might ask, why is it going to be a pain to fix? That's because the FRI can call any type of function. So we can't just like preprepare an appropriate function for every single type of function. That's going to be way too many functions. So we have to jit compile the function that we want to use. And I mean, like really, like, can we just like not do that? Yeah, I just rather not do that yet. So I've skipped it. But it works. So, you know, great. That was interpreted, FRI calls. Then we've got jit compiled FRI calls. They're different because you will jit compile your call once, but then run it multiple times. So if we are jit compiling a call through a function pointer, we don't know whether that function pointer will be Intel or will be ARM. So we have to kind of do what we're meant to do more closely or almost do what we're meant to do. So, you know, we prepare the arguments as if it were a ARM call. If it ends up going to Intel, then we'll use an exit function to fix it up. We do the prep work that we went to do for the magical mystery function. All but again, I didn't want to jit compile an exit function for every possible type of thing that we might call. Because like, we're already jit compiling a function. We don't want to jit compile a function in addition to that at the same time. It just gets kind of hairy. So again, I cheated a bit and said, well, let's just write one function. They can handle like every case that can get jit compiled and just pass it the signature that it has to kind of pretend to be and just put that in some other register. And again, this will work fine in practice unless we hit the case of calling one of these typos functions that doesn't know their own type and it wants to call Intel and it happens to trash X15 that I've used to like, stash this X foot piece of state in. So again, not quite following the rules, but again, it works fine in practice. And then the slide that you've possibly all been waiting for, like, does this whole thing work? So you'll recall the first two lines from previous day. We said like, native ARM code ran in 37 seconds. The Intel code running under relation to 106, whereas the R54 you see code takes 38, which is pretty good, right? So we're kind of saying here, this is, it's native ARM code, so it should be close to 37 seconds, but it's making a combination such that it could call Intel code as and when it needs to. And making those combinations will slow you down by a few percentage points. But you know, you're in a much better place than you would otherwise be. And yeah, this crazy idea of Microsoft actually works. I can do one more slide or questions. Do you want to do? One more slide. Okay, great. So problems you didn't know that you had. Yeah, Linux has LD preload, which if you've used, you know, I want to let you know, change the malloc that I call or like, make F sync not slow. LD preload, great. Mac OS has LD, the old insert libraries, same thing, not quite the same details, but like the same thing. Windows doesn't have such a thing. It has ad hoc machine code patching. Yeah. And as a bonus point, Microsoft research used to sell a product called detours for doing this. Possibly like Microsoft research is only consumer facing product. Unsure. They made that open source on GitHub in like 2016. So you can go and find detours on GitHub and it will do 0.3. So you know how code lying around in your Intel code that expects to be able to go into other functions and patch them up. So to make this work, we have to take our functions and wrap them in a small Intel shell. So if you look at the shell, you're like, yeah, that's in the Intel code. I'll just patch that for you. And that's fun, right? So one of these magical mystery functions can kind of spot these shells and kind of skip right over them. But yeah, those shells are going to be here to make this thing work. That shouldn't be a problem in the first place, but it is because Windows doesn't have any of these systems. Bonus funds. Let's get back to here when we don't have to worry about the bonus problems. Okay. Great. Thank you, Peter. We have time. All right, let's do some questions. I'm going to start with one from online because otherwise we won't forget it. Can Intel code call ARM code? Oh, yes. Quick yes. Yes. Hands. This is the loop. Am I now trying to call ARM code? No. So I'm going to call Intel code. No, I'm trying to call ARM code. So we go over here and we go through the stack for calling ARM code. Yep, it all works. All right, one more time, hands, because I wasn't paying attention. I'll start here. How do you decide which code you can compile to ARM and which parts of the code you cannot and have to leave as Intel? So for the Luiget case, it's fairly simple because there's already an ARM version of Luiget. If you're going to write your own program, the advice is start with the hot parts and port those first. If that works, then you can slowly port more and more. I can get incremental speed improvements after you're porting more and more code. Over. Next question. Close by. Okay. Hi. Very nice presentation. Thank you. Hello. Okay. Yeah. Thank you very much. It was a very nice presentation. I was just curious what your experience is with the tooling support for these. What support? Like tooling support for these AVI, like the bloggers, compilers, what the support is like, if it's easy to use or. So yeah, the Microsoft C compiler can handle all of this fine. I think clang in LVM, kind of getting a few patches solely, but I'm going to be there for a while. The Visual Studio debugger for this stuff is great. You can single step through from ARM code to Intel code. I like not even notice that you've done a mode switch, which was kind of scary. Like, okay, single step, single step, single step, wait, what? I'm now in like ARM code. Okay, fine. So yeah, the Microsoft tooling is very good. The open source tooling not yet, not yet really there. So what I don't, maybe I've missed it, but what I don't quite understand is what I see here is the ARM64 AVI has been changed to match the Intel AVI a little bit more, right, to make this work. Yep. So how does that work when calling ARM64 Windows API functions? Do they have ARM64 EC versions of all of them? Yep. Wow. Yep. Yes, I have another question. It's a bit related to the question that was just asked about tool sense. Do you know other open source tool sense that support ARM64 EC, like GCC or maybe other GIT compilers? Yeah, that's my first question. Yeah, I've seen some patches land in Y and in Clang and LVM, but I kind of, I suspect they're all kind of starting to do things rather than like full support. Okay, another question and maybe, maybe I'm not sure I understood, but so you have LuaGIT users that want to call, do FFI basically with X64 code. So that's basically why you implemented the... Yeah, yes, most of your program is in Lua. Thank you. Any more questions? Oh, yeah, of course. Just, I think I didn't get, so you reduced the number of ARM registers, but wouldn't it possible to spill them to memory when you do the mode switch? Here's my cutout. So I'm going to run around. Here you go. Yeah, so it's, you can't spill them because you don't have any way to spill them to. Like if it was only the operating system that did mode switches between threads, you'd be fine. But you know, you can call such jump and long jump and there's like, there's not space in the jump buff to put the extra things. Or if you're really adventurous, you can do kind of user space scheduling in Windows. You know, you can call suspend thread and then like resume thread and like move your contacts in between threads. And you know, you could have Intel threads doing this onto your ARM threads. The Intel threads don't know that they're doing this to ARM threads. So you don't have any extra space to put the ARM states because they didn't know that they'd need this extra space. Yeah, I'm going to be running. We have somebody all the way in the back who's been waiting for a long time. Sorry, I didn't see you. I'm going to have to run back. I said a question to you. How do you deal with the red zone? I was like, why is he not answering? So short answer, Windows doesn't have a red zone in either Intel or ARM. So that's mostly fine. There was a related concept of home space for the first four integer registers in a kind of Intel call. And yeah, you have to handle that. So when you're doing your marshalling and remarshalling of arguments, you need to leave space for the home space as you would for a normal Intel call. So yeah, there is no red zone, but the closest equivalent thing, yes, you have to handle. Are there more questions? Are we? Oh, great. How long did this take you to figure out? Probably not very long. I mean, the documentation is pretty good on the Microsoft side. So possibly a week or two, probably. One more over there. So is there any way to call like regular, let's call them closed source ARM 64 Windows components, or is it complete separation? Completely separate. Any more questions? Oh, were you going to elaborate on the answer? Of course, yeah. I thought that was a really short answer. I'm just trying to save myself from running. Yeah, completely separate. So yeah, any kind of ARM only DLLs you can't call into, you have to have these special ARM 64 EC DLLs. Thankfully Microsoft have already done that for all of the kind of system libraries. So anything from Microsoft you can already call. But yeah, other code has to be in this weird mode to make it work. Any more questions? Yeah. Really making me work this year. Where were you? I was wondering, like, is there already examples of software that uses it because you can find it quite easily because the executable is like a different type that it uses? Because I, is there any like software, I haven't heard of this feature before, but like so far already using this major things? Yeah, so the person that opened the issue on the Lujo project is apparently using this thing. I mean, I'm told that most of like Microsoft like Office and similar are running in this mode so that you can have your Intel type plugins work. But yeah, apparently there's a user for this stuff of the Lujo thing and using it. The last question or was that the last question? Can we pass it? Sorry, what did the EC stand for? Emulator compatible. Am I stealing it from you? Emulator compatibility. That's what it stands for. If that was the last question, then let's thank Peter one more time. So am I still, yeah, with that, that closes up the Emulator Development Room this year. I want to thank you all for coming.