Okay, so thank you for joining. The next talk is still about Collabora, second talk of the day about Collabora, which is about Collabora online usability optimization. And we still have Kaelin, and that was in the previous talk, and also Michael that is joining us. Thank you, Kaelin. Fantastic. This is Kaelin, this is Michael. Good. This is what I'm going to say. You'll see it as we get there. And yes, fantastic. Kaelin did a very good spiel earlier on how this thing works. So if you're in the previous talk, you saw something similar to this, but you have your browser, and then you have a web socket talking to a server on the back end, C++. And this talks to the Librofiskit over a Unix domain socket, which does all sorts of beautiful interoperability rendering, tiled goodness. And yes, this fetches data from an own cloud, an OSIS, a next cloud, a pygmc file, lots of things, any kind of WAPI share point I think we can use even. Yeah, for the good guys, right? And yes, so anyway, so this gets the file, this pushes it in here, it renders it, it comes back out to the browser. And yes, we do all sorts of things to try and cache that. So JavaScript here, good stuff over there. Anything else on there? Nope, nope. Seems pretty silly. And I just want to talk a little bit about latencies. This is an interactive presentation. I'm not going to ask you to put your hands up just yet. But just here are some timings. And the one I want to time is this human eye blink, 100 milliseconds for a human eye blink, okay? Right, so here we are. How good are you at blinking? Are you ready? Okay? So I'm going to press a button and we'll start blinking. And when you see red, stop. But you need to count at the same time, okay? You ready? Silently. Silently. Yeah, yeah, here we go. Ready? Ready? Are you ready? Go. How many? How many did you get? Do you want to try again? Yeah? Okay, so here is reciprocation for beginners, okay? So this is an advanced topic in maths, okay? If you need help. Anyway, so if you're a falcon, you've got like 7.7 milliseconds. So that's pretty good. Me, I'm more about here. I don't know how about you. Six, seven, eight. How many did you get? Do you want to try again? Okay, we're going to try again. It's like, okay, right? You got the idea now, right? Okay, ready? Not completely, okay. So I'm going to click and it's going to go green. Start blinking. And count the blinks you're doing. Blink as fast as you can, right? As many as you can. I want to get a high score here, right? We're going for the Peregrine Falcon 153 in a second, right? Okay, ready? Okay, three, two, you've not started yet, have you? Three, two, one, blink. Okay, that was a second. You had to blink. How many did you get? Five, six, seven, eight. Yeah, okay, fair enough. So this tells you your score. And interestingly, in the UK, they say a blink takes between 100 and 150 milliseconds. In Harvard, it takes between 100 and 400, which tells you something about Americans. Maybe. I don't know. It's slower pace of life is good for people generally. Anyway, sorry. So here we are. So actually, the very interesting thing is that when you start looking at some of these numbers, now on a log scale, so they're a bit more friendly, you know, the blinking is really quite slow. You can go from the Frankfurt to the US east coast and back again in the same time, right? So that's pretty good. You know, the 60 hertz frame time, 16, you know, is also quite long. You can get Frankfurt Milan, Frankfurt London is a similar time to the time it takes to get something on the screen, particularly when you add the monitor latency. So you blink faster than you miss. Lots of people are very worried about latency, and they don't have a good feeling for how long things take. But it's quite interesting to see some of these things. And also, in terms of typing, you know, like the average typist is supposed to be like three characters a second, pro 6.6. Yeah, it's human eye blinkers quicker. But you know, even me typing, not very accurately, it's like, yeah, quite, quite, and if you mash the keyboard, it turns out you're massively faster, like you're 10 times faster than the average typist when you mash the keyboard. It's not, you know, it's not good for it. So yes, there we go. Anyway, I'm going to hand over to Depp, Aquilon, unless you have anything to add? No, no, no, no, nothing to add on blinking. But yeah, the fundamental point that networking is really, really fast and stuff comes from one end to the other and back in a very, very sharp period of time is great. So, you know, don't generally have to worry too much about that part of things. Yeah, so what we do is that we have a bunch of demo servers that are generally publicly accessible. And what I've started, we started in recently is to use perf to sample once a second and record for an entire week what happens on the public servers. And at the end of the week, then we generate a single flame graph from all of that to see what, where, where, where our time is spent over the week generally. That's the demo servers, multi user testing. We have this once a week called some of the people present in the room, join us from that from other people, organizations and, and community members, members. And we just have a general feel as to what it feels like in that little 10, 20, 15 person call for the applications are still responsive or whatever issues arise in testing that can be checked at that point. And that is also profiled and flame graph generated, typically one for writer and one for Calc in recent tests, which are all stuck up in GitHub that you can look at yourselves if you're interested to see the change in time over what we're looking at. We use it internally in clapper, of course, with the deployment that is used daily there and the same week long profile that I mentioned for the demo server is run on the internal one now as well. Yeah, so that's the tooling that we're looking at there. And then interactive debugging, which you have the clapper online, you can do yourself. You just go help about and you trip a click on the dialogue there. And that'll show you up this debugging display that we're looking at here. There's loads of information in it there. The far right inside the tick box as you check them on, certain ones will check on display in the bottom left corner to tell you things. But maybe more interestingly, the one that we're calling the tile overlays. When you type in the documents, you'll get these flashing areas. And that's the part of the document that has been required to be redrawn because of your interaction. So what you're really hoping to see, especially looking at these things is that people are typing and you're hoping to see a small rectangle around the area of change that they're actually making. If the entire screen starts flashing, it means that there's a whole reason other piles of things have been redrawn or been invalidated to be painted to be redrawn later on to avoid that. These are the kind of flame graphs that we look at and the week and just for the purposes of looking at these things, the colors don't matter in these flame graphs or most flame graphs. What matters is the width of the line, the width of the bar, the wider the bar, the more proportionally time has been spent there. What you want to do is you want to take a quick look at it. You want to see which is the widest line and see can you make the wider lines narrower. I mean, it's nothing to the profiling really. It's just make the wide ones narrow. Yeah, so this particular one is in the widest bar there. This whole gigantic pile of boost, spirit, classic, whatever, which is all being used to detect if the PDF that people are opening up is a particular type of PDF, the hybrid PDF that's using LibreOffice where you can embed the LibreOffice document inside the PDF. So when you open up PDF, you also have the original document. It just takes a ludicrous amount of time, especially over the course of a week to collect up that information when it can be done in many orders of magnitude less. Yes. So it's good to see that sort of stuff and disappear off the profile. You should never optimize before profiling, obviously. Cool. Thanks, Will. Storing previous tiles. Yeah, so we've done a whole lot of work to improve our tile rendering performance. We store previous tiles that have been rendered so we can see what the difference is and just send the difference. That saves a lot of bandwidth and reduces latency too. And we've completely rewritten this. Well, how this is done in the last six months to a year. So we've already compressed it, so just a simple run length encoding. Because we're extremely modern, instead of doing stupid stuff like using byte lengths and this kind of thing, we use bit masks. And you'll see why in a second. So the bit mask essentially says, is the pixel the same as the previous pixel? So you end up with a bit mask. We have 1056 square tiles. So in four 64 bit numbers, we can have the whole bit mask for the row. And yeah, it's pretty easy. This removes a whole load of things. Previously, we stored them uncompressed. We compared them uncompressed. Turns out to be massively slower. Touch is much more memory. It uses much more space. And we also did clever things to hash each row as we did that while we were copying. But it turns out this is far better just to use the bit mask and some of that stuff. And, Koel and I did this fun thing with AVX2. Why not? You hear about these processor accelerated things and after shrinking our inner loop down to almost nothing, it's still not as quick as it could be on the CPU. So this is how we do it. We load a whole load, actually eight pixels, into a whole single AVX register, which is just kind of nice, right? Eight pixels at a time. And the problem is we need to compare it with the previous thing. So we shift a bit off the end. We shove the previous one. We shift it along, although actually it's really a sort of, yeah, it's a crossbar switch here that you permute to move things. There is no shift in AVX registers that does that. And then we just compare these guys. And that gives you a whole load of either whole all ones are all zeroes. And then comes Koel on magic trick. Well, yeah, in AVX, there's the AVX2, which is like practically available. But AVX512, which is not practically available, has a particular call that you can do that will compare the two things for you and give you that bit mask, which is not available in the AVX2. And if you look at what's available, though, you can guess if it was done in floats, then the number is basically available for you. So you cast it to floats, and you do this move mask thing brings your top bits in and gives you what you were hoping for in the first place, which is just an individual bit result for each individual pixel that you've compared, whether they're true or not. And you can basically so compress, pull the bits you're looking for out in no time. It's great. Which is pretty awesome. So, you know, you convert this into a floating point number, and you get the sign out of it. And that's your that's your orally bit mask. So the nice thing about this is there's no branch, there's no compare. There's nothing. There's a simple flat loop with about five instructions. At the end of that, we then have to work out how many pixels to copy because it's all very well saying these the same, but you need individual copies of those different pixels one after another. So a bit of a pop count will count the bits in the mask. And then with a clever lookup table, we can also use this. Yeah, this clever instruction shuffling instructions to shuffle the things in that we need to copy them out, stack them up. Bingo, twice as fast, which is nice. And hopefully AVX512, you know, will make it even even faster if you believe that you'll believe anything. So yes, here we go. So this is a real problem here. And if only we can find the idiot responsible for you. We don't need to suggest. Yeah, no, what's sometimes interesting is that, while I said earlier narrow was better, sometimes it can be interesting to see that wider will be better in the sense that when you look at the flame graph, what you should see is individual threads should all be positioned separately. So they shouldn't be, you know, combined with the main thread. So if you're not seeing work that you expect to see happening in a thread on the left hand side, basically, of your flame graph, then it means the threading isn't being used. So it becomes apparent that while there's this code that attempts to do this threading for doing this previous delta stuff, there is no existence of the threads and there's a flaw that needs to be sorted. So when you fix the flaw for the threading and bring it back in, you see then on the far left hand side, because it's rooted in the threading area, all that work is put on the left hand side separately in the flame graph. And while it's wider, it now means it's operating in a separate thread and you've made progress. So it's nice to get twice as fast and then four times as fast on top of it. That's the right sort of approach. Yeah, I think we're going to skip through some of these because we're running out of time. But working out where to do the work, either in the browser or not, and I'm pretty multiplying and the stupidity of the web and having an RGB, un-premultiplied alpha API. When it's almost certainly going to be premultiplied underneath its hood. Yeah, underneath the hood, all the hardware, everything is doing premultiplying because it's so much quicker. You can see the complaints online about people pushing RGBA into the canvas and getting something out that isn't the same because it's been premultiplied and then un-premultiplied. Anyway, there you go. The web APIs are awesome. What else? What should be on your profile? Well, it's very hard to know. This could be okay. Here's a whole lot of un-premultiplication here. It's a very old profile. It's a time, but hey, there's a lot of rendering on the profile. Not very much painting, lots of delta ring, so we fixed that. But actually, it's very hard to know if this is good or bad looking at that. Actually, with lots of bogus invalidations, you start to see lots of rendering and that's not what you want. So everything should shrink and you'll end up with a profile that looks the same, but everything feels much quicker. So we've done lots of work to shrink, I guess. Mr. Enders, do you want to pick a couple of these now? Yeah, just as you mentioned, with multiple user document tests, we have kind of basically monitor what's happening. People are joining documents. We got that full document invalidation we mentioned about happening. Clicking in headers and footers were causing the same things. I think fundamentally, because the invalidations and redrawing on the desktop has become so cheap, while in the past, the very distant past, we might have been pretty good at keeping validations down. In that case, we've become slack in recent decades and now we've treated it as cheap and that has affected things. So let's kind of have a look at that again and bring things down to smaller rendering areas and less invalidations. Yeah, and the good news is that improves LibreOffice, of course, as well. It's more efficient and clean on your PC as well underneath. So good. We've done lots better latency hiding in terms of more aggressive prefetching. So the next slide is there before you switch to it. So it's absolutely instant. Hiding latency in those ways is quite fun, enlarging the area around the view and maintaining that as tiles and just storing and managing much more compressed tile data in the clients that we manage much better now. This is a fun one. But we don't have much time for it. Yeah, well, God, classically, standard list and C++ was always a standard list. And if you wanted to get the size of it, you had to like pass the entire list from start to finish. That was sorted out decades ago. But for whatever reason, for compatibility purposes, if you use the particular Red Hat developer tool chain, then you seem to get the classic behavior or standard list back again. So when we were assuming that you was cheap and cheerful to get the length of a standard list, it turns out to be not the case with this particular case. So you have to go back to a different approach and it appears in your profile like that. But again, it looks normal that it should take some time to draw things. And it's normal to have a cache to speed that up. But if the cache has got 20,000 items in it, and you're just walking this list, you know, point it, chasing anyway. So gone. Oh, fun stuff. Like why not have a massive virtual device in the background that you could render to instead of the whole document every time you do something? Not great. Or another one, why not have a benchmark every time you start the document to see how fast rendering is, allocate a whole load of memory and dirty it, you know? Great. Yeah, trying to cache images. So we didn't bother catching compressed images because they're compressed, right? So why bother? They're small. They're good to have memory, except TIFFs not so much compressed, you know, you eventually have the whole massive chunk of memory there. Using G-Lib C trimming functions on idle to reduce memory usage. Yeah, trying to get better measurements of various things. Yeah, this is a fun one. Well, oh, this is the S-Maps word. Yes, yes, yes, we're reading the proc S-Maps to see how much memory we're using. And the classic S-Maps has got multiple entries in it for many, many parts of your process. So you just read multiple lines. So there's a relatively new one that has it all pre-edited for you. ProxMaps roll up, which is exactly what we want. Same code to read the previous one should work with the new one. Then apparently we're running out of memory, or it's being reported that we're running out of memory, and it's all very, very bizarre. You can't proc S-Maps roll up yourself. The numbers are good. There's something very odd, but it turns out that if you seek back to the beginning and then read again, that the numbers double every time you do this. There's an actual bug in the original implementation. It's not there in the kernel, my version 6 kernel, but it is there on version V18 or 16 that the servers were applied on. So you have to be just the right version for it to appear. So Linus fixed it, thank God. Quillholt found it. Well, it was fixed before we found it. But it's always nice to know you have to check your kernel is the right, you know, is the quality kernel before you start asking it how much memory it's using. Yeah, hunspell in the loop was almost entirely dominated, not by actually spelling things, but by looking at the time. You know, I'm sure in a bad talk, you know, it's quite similar. But that's a little bit unfortunate. So yeah, some improvements there. And lots of other things, graphs showing speedups. We've got to get to usability in the last minute. Let me whizz through this then. Here we go. Accessibility, dark modes, pretty pictures. This is going to be fast. Keyboard accelerators. This is all of the good stuff for people. Screen reading, and all sorts of nice things, videos of that. Better page navigators at the side so you can see where you're going. And lots of just little bits of usability polish, a nice font previews. Was this your page number thing? I forget who did that. Making it easier to insert page numbers so people can see, you know, what's going on easily, better change tracking and showing changes, AI, depot, stuff, and hey, some some. The good news is there's more opportunity for performance improvement. So we're still, we're still having fun. You know, hey, come join us. There's some cool play files to read. Right. Well, yes. At the moment, in Calc, when you're typing the entire row and validates beyond the right hand side of where you're actually typing. So we brought that down to the self in the most generic case, but it's not done for writer. In the writer case, if you're typing, we are invalidating all the way to the right hand side of the screen. So we'll bring shrink back back down again. We have some new metrics that we've included in that debugging overlay thing that give you an indication of, you know, how much of these updates that are coming through are the same data as they were before the update came through and the numbers are staggeringly high. So there's plenty of room for improvement to validate less, send more data down. So what we have now is fix, uh, approval. Yeah. The moment that's always been troublesome in, uh, Lear Office is the treatment of the alpha layer. We picked the wrong direction than everybody else does. Everybody else picks transparency. We picked opacity or vice versa. So we have the opposite direction. Everybody else would want to actually output something in the real world that handles transparency. We have to like reverse our transparency. So that's problematic. That's, that's now fixed. That one is fixed. That one is fixed. But then we've also kept our transparency layer in a separate, uh, bitmap, a separate buffer than an actual bitmap. And if we put them together someday, that would make things a lot easier, I believe. Yeah. It's the Windows 16 API decisions that are still with us. But anyway, we're getting rid of them quickly. That's great. Um, yeah, performance regression testing with Valgrind, uh, pipeline loading. So at the moment, oh, we got five minutes. Oh, look at that. Fantastic. I went too quickly. No, you're doing fine. Okay, right. Fine. Excellent. I think we're nearly the end. Um, so pipeline loading. So at the moment we have, um, we, we essentially fetch a, fetch a webpage that passes all the credentials we need to check ourselves. We'd load lots of JavaScript. We open a web socket. Then do we actually see if we can actually load the document and start checking who the user is? This is really foolish. I'm taking on a first start, we can be, you know, checking the user, downloading the document, even loading the document ready to get the web socket and then have a pre-rendered version. So this, this is very substantially reducing, um, startup time to make it incredibly quick. You already have a huge advantage that you have a real server at the back end and you're not having to jit, you know, millions of lines of code in your browser from JavaScript into something or, you know, web assembly into something. Um, so it should be just amazingly fast. And so this is a great way to, to speed that even further. And, you know, and a real server, you may have a time share, but you know, when you arrived your server, it's probably not doing much. In fact, the CPU cost on most of our servers is extremely low. So, you know, there's suddenly all these threads ready to render your document and get, get stuff to you quickly. Say some good things. And Valgrind, we've done a whole lot of work to get, um, it to run nicely under Valgrind with our privilege model and container model. It's a bit of a problem. Uh, and so we have some code now that turns everything into one process. So you can load and collaborate on one document and automate that, but you can run it in, in Valgrind. And why do you want to do performance profiling in Valgrind? It seems like a retro, uh, poly, right? But the beautiful thing about Valgrind is the simulated CPU. So anybody can run the same workload on their machine and between two runs, it's the same thing. And Valgrind luckily doesn't have a simulated thermal management system that randomly throttles your CPU, uh, performance. And it luckily doesn't have people screwing with your cache memory and running cron jobs in the background and, you know, thermally recalibrating your disk and all this other stuff. So what you discover is that between two identical commits, you're getting fractions of a, small fractions of a percent difference in the Valgrind numbers, which is beautiful because performance tends not to go away in big jumps. Like we can, it can go in big jumps, but it tends to go slowly downhill. And if the noise is bigger than the slow downhill, you've no idea where the problem is. So much better to have a little series of steps going down in one half a percent at a time and go, hey, we get rid of that and that. And did you realize and, uh, so, so this is really vital. And LibreOffice uses this on its perf, um, automation has been beautiful web pages with graphs. Um, and we'll, we'll be applying to, to collaborate online to, to try and avoid regressions. Yeah. Someday soon. Someday soon. Yeah. Neil, Neil Lazzone, we think probably. Anyway, anything else? No, I think we've covered plenty. Well, so, and yes, of course, we can't do anything without our partners and customers that pay for it all, blah, blah, blah, commercial plug. Good. Yes. That's good. Job done. And conclusions. Yes. So, uh, computers are unbelievably fast. I mean, like this is something that you should take home. You know, like the quarter of a nanosecond that your four giga hertz processor takes is just unbelievable in the scale of a hundred milliseconds plus. It takes you to blink your eye. It's just fantastically speedy in a way you can't explain. Uh, the network latency to anywhere almost, you know, you can go three times, uh, London to Frankfurt and back in the time you can blink, right? Like it's, it's unbelievably fast. In fact, you can go, you know, Frankfurt, Milan faster than your monitor can refresh, right? So, so like, it's quite amazing when you start looking at the times of things. Um, architecture is really a bet on CPUs and networks getting faster and cheaper. Has anyone noticed a trend there? I think there might be something in that. And, and we're basically racing the hardware guys. I mean, you know, we, we do stupid stuff, obviously, and then we remove it later. But, you know, the hardware people are also trying to beat us to run stupid stuff quicker. You know, that's their mission. And, uh, yes. And, and we extremely smooth. Don't get the feeling that it's bad. Try it. You know, most of these problems, you'll only start to see them when you have 20 plus people collaboratively editing in document. So, uh, yeah, it's, it's kind of, it's kind of cool. So give it a try and try the latest version and see, give us some feedback, get involved. And there's lots, lots of fun to get involved with. I mean, I don't know. Yeah, I'd like us to play two things. As I mentioned earlier, the profile that we have for Calc and Writers uploaded to GitHub once a week, generic Calc performance profile, generic writer performance profile, search on the online GitHub issues. And you can see all of the, the chats that we've mentioned there in the past. And you can even see with the progress there and the occasional blip during a call where things go horrifically wrong and get sorted out in the next one. So yeah, plenty to see and see what we're doing. There's some links in the slide. You can't see to the profiles and get involved in the Libre Office of Technology. Thank you. That's it. You've been very patient. Thank you.