Okay, so thank you for joining. The next talk is still about Collabora, second talk of the
day about Collabora, which is about Collabora online usability optimization. And we still
have Kaelin, and that was in the previous talk, and also Michael that is joining us.
Thank you, Kaelin. Fantastic. This is Kaelin, this is Michael. Good. This is what I'm going
to say. You'll see it as we get there. And yes, fantastic. Kaelin did a very good spiel
earlier on how this thing works. So if you're in the previous talk, you saw something similar
to this, but you have your browser, and then you have a web socket talking to a server on
the back end, C++. And this talks to the Librofiskit over a Unix domain socket, which
does all sorts of beautiful interoperability rendering, tiled goodness. And yes, this fetches
data from an own cloud, an OSIS, a next cloud, a pygmc file, lots of things, any kind of
WAPI share point I think we can use even. Yeah, for the good guys, right? And yes, so
anyway, so this gets the file, this pushes it in here, it renders it, it comes back out
to the browser. And yes, we do all sorts of things to try and cache that. So JavaScript
here, good stuff over there. Anything else on there?
Nope, nope. Seems pretty silly. And I just want to talk a little bit about latencies.
This is an interactive presentation. I'm not going to ask you to put your hands up just
yet. But just here are some timings. And the one I want to time is this human eye blink,
100 milliseconds for a human eye blink, okay? Right, so here we are. How good are you at
blinking? Are you ready? Okay? So I'm going to press a button and we'll start blinking.
And when you see red, stop. But you need to count at the same time, okay? You ready?
Silently.
Silently. Yeah, yeah, here we go. Ready? Ready? Are you ready? Go. How many? How many did you
get? Do you want to try again? Yeah? Okay, so here is reciprocation for beginners, okay?
So this is an advanced topic in maths, okay? If you need help. Anyway, so if you're a falcon,
you've got like 7.7 milliseconds. So that's pretty good. Me, I'm more about here. I don't
know how about you. Six, seven, eight. How many did you get? Do you want to try again?
Okay, we're going to try again. It's like, okay, right? You got the idea now, right?
Okay, ready? Not completely, okay. So I'm going to click and it's going to go green.
Start blinking. And count the blinks you're doing. Blink as fast as you can, right? As
many as you can. I want to get a high score here, right? We're going for the Peregrine
Falcon 153 in a second, right? Okay, ready? Okay, three, two, you've not started yet,
have you? Three, two, one, blink. Okay, that was a second. You had to blink. How many did
you get? Five, six, seven, eight. Yeah, okay, fair enough. So this tells you your score.
And interestingly, in the UK, they say a blink takes between 100 and 150 milliseconds. In
Harvard, it takes between 100 and 400, which tells you something about Americans. Maybe.
I don't know. It's slower pace of life is good for people generally. Anyway, sorry. So here
we are. So actually, the very interesting thing is that when you start looking at some of
these numbers, now on a log scale, so they're a bit more friendly, you know, the blinking
is really quite slow. You can go from the Frankfurt to the US east coast and back again
in the same time, right? So that's pretty good. You know, the 60 hertz frame time, 16,
you know, is also quite long. You can get Frankfurt Milan, Frankfurt London is a similar
time to the time it takes to get something on the screen, particularly when you add the
monitor latency. So you blink faster than you miss. Lots of people are very worried
about latency, and they don't have a good feeling for how long things take. But it's
quite interesting to see some of these things. And also, in terms of typing, you know, like
the average typist is supposed to be like three characters a second, pro 6.6. Yeah,
it's human eye blinkers quicker. But you know, even me typing, not very accurately, it's
like, yeah, quite, quite, and if you mash the keyboard, it turns out you're massively
faster, like you're 10 times faster than the average typist when you mash the keyboard.
It's not, you know, it's not good for it. So yes, there we go. Anyway, I'm going to
hand over to Depp, Aquilon, unless you have anything to add?
No, no, no, no, nothing to add on blinking. But yeah, the fundamental point that networking
is really, really fast and stuff comes from one end to the other and back in a very, very
sharp period of time is great. So, you know, don't generally have to worry too much about
that part of things. Yeah, so what we do is that we have a bunch of demo servers that
are generally publicly accessible. And what I've started, we started in recently is to
use perf to sample once a second and record for an entire week what happens on the public
servers. And at the end of the week, then we generate a single flame graph from all of
that to see what, where, where, where our time is spent over the week generally. That's
the demo servers, multi user testing. We have this once a week called some of the people
present in the room, join us from that from other people, organizations and, and community
members, members. And we just have a general feel as to what it feels like in that little
10, 20, 15 person call for the applications are still responsive or whatever issues arise
in testing that can be checked at that point. And that is also profiled and flame graph
generated, typically one for writer and one for Calc in recent tests, which are all stuck
up in GitHub that you can look at yourselves if you're interested to see the change in
time over what we're looking at. We use it internally in clapper, of course, with the
deployment that is used daily there and the same week long profile that I mentioned for
the demo server is run on the internal one now as well. Yeah, so that's the tooling that
we're looking at there. And then interactive debugging, which you have the clapper online,
you can do yourself. You just go help about and you trip a click on the dialogue there.
And that'll show you up this debugging display that we're looking at here. There's loads
of information in it there. The far right inside the tick box as you check them on, certain
ones will check on display in the bottom left corner to tell you things. But maybe more
interestingly, the one that we're calling the tile overlays. When you type in the documents,
you'll get these flashing areas. And that's the part of the document that has been required
to be redrawn because of your interaction. So what you're really hoping to see, especially
looking at these things is that people are typing and you're hoping to see a small rectangle
around the area of change that they're actually making. If the entire screen starts flashing,
it means that there's a whole reason other piles of things have been redrawn or been
invalidated to be painted to be redrawn later on to avoid that. These are the kind of flame
graphs that we look at and the week and just for the purposes of looking at these things,
the colors don't matter in these flame graphs or most flame graphs. What matters is the
width of the line, the width of the bar, the wider the bar, the more proportionally time has
been spent there. What you want to do is you want to take a quick look at it. You want to see
which is the widest line and see can you make the wider lines narrower. I mean, it's nothing to
the profiling really. It's just make the wide ones narrow. Yeah, so this particular one is in
the widest bar there. This whole gigantic pile of boost, spirit, classic, whatever, which is all
being used to detect if the PDF that people are opening up is a particular type of PDF, the
hybrid PDF that's using LibreOffice where you can embed the LibreOffice document inside the PDF.
So when you open up PDF, you also have the original document. It just takes a ludicrous amount of
time, especially over the course of a week to collect up that information when it can be done in
many orders of magnitude less. Yes. So it's good to see that sort of stuff and disappear off the
profile. You should never optimize before profiling, obviously. Cool. Thanks, Will. Storing previous
tiles. Yeah, so we've done a whole lot of work to improve our tile rendering performance. We store
previous tiles that have been rendered so we can see what the difference is and just send the
difference. That saves a lot of bandwidth and reduces latency too. And we've completely rewritten
this. Well, how this is done in the last six months to a year. So we've already compressed it, so
just a simple run length encoding. Because we're extremely modern, instead of doing stupid stuff
like using byte lengths and this kind of thing, we use bit masks. And you'll see why in a second.
So the bit mask essentially says, is the pixel the same as the previous pixel? So you end up with a
bit mask. We have 1056 square tiles. So in four 64 bit numbers, we can have the whole bit mask for
the row. And yeah, it's pretty easy. This removes a whole load of things. Previously, we stored them
uncompressed. We compared them uncompressed. Turns out to be massively slower. Touch is much more
memory. It uses much more space. And we also did clever things to hash each row as we did that
while we were copying. But it turns out this is far better just to use the bit mask and some of
that stuff. And, Koel and I did this fun thing with AVX2. Why not? You hear about these processor
accelerated things and after shrinking our inner loop down to almost nothing, it's still not as
quick as it could be on the CPU. So this is how we do it. We load a whole load, actually eight pixels,
into a whole single AVX register, which is just kind of nice, right? Eight pixels at a time. And
the problem is we need to compare it with the previous thing. So we shift a bit off the end. We
shove the previous one. We shift it along, although actually it's really a sort of, yeah, it's a
crossbar switch here that you permute to move things. There is no shift in AVX registers that
does that. And then we just compare these guys. And that gives you a whole load of either whole
all ones are all zeroes. And then comes Koel on magic trick. Well, yeah, in AVX, there's the AVX2,
which is like practically available. But AVX512, which is not practically available, has a particular
call that you can do that will compare the two things for you and give you that bit mask, which
is not available in the AVX2. And if you look at what's available, though, you can guess if it was
done in floats, then the number is basically available for you. So you cast it to floats, and
you do this move mask thing brings your top bits in and gives you what you were hoping for in the
first place, which is just an individual bit result for each individual pixel that you've
compared, whether they're true or not. And you can basically so compress, pull the bits you're
looking for out in no time. It's great. Which is pretty awesome. So, you know, you convert this
into a floating point number, and you get the sign out of it. And that's your that's your
orally bit mask. So the nice thing about this is there's no branch, there's no compare. There's
nothing. There's a simple flat loop with about five instructions. At the end of that, we then
have to work out how many pixels to copy because it's all very well saying these the same, but
you need individual copies of those different pixels one after another. So a bit of a pop count
will count the bits in the mask. And then with a clever lookup table, we can also use this. Yeah,
this clever instruction shuffling instructions to shuffle the things in that we need to copy them
out, stack them up. Bingo, twice as fast, which is nice. And hopefully AVX512, you know, will make
it even even faster if you believe that you'll believe anything. So yes, here we go. So this is a
real problem here. And if only we can find the idiot responsible for you. We don't need to
suggest. Yeah, no, what's sometimes interesting is that, while I said earlier narrow was better,
sometimes it can be interesting to see that wider will be better in the sense that when you look at
the flame graph, what you should see is individual threads should all be positioned separately.
So they shouldn't be, you know, combined with the main thread. So if you're not seeing work that you
expect to see happening in a thread on the left hand side, basically, of your flame graph, then it
means the threading isn't being used. So it becomes apparent that while there's this code that attempts
to do this threading for doing this previous delta stuff, there is no existence of the threads and
there's a flaw that needs to be sorted. So when you fix the flaw for the threading and bring it back
in, you see then on the far left hand side, because it's rooted in the threading area, all that work
is put on the left hand side separately in the flame graph. And while it's wider, it now means
it's operating in a separate thread and you've made progress. So it's nice to get twice as fast and
then four times as fast on top of it. That's the right sort of approach. Yeah, I think we're going
to skip through some of these because we're running out of time. But working out where to do the work,
either in the browser or not, and I'm pretty multiplying and the stupidity of the web and
having an RGB, un-premultiplied alpha API. When it's almost certainly going to be premultiplied
underneath its hood. Yeah, underneath the hood, all the hardware, everything is doing
premultiplying because it's so much quicker. You can see the complaints online about people
pushing RGBA into the canvas and getting something out that isn't the same because it's been
premultiplied and then un-premultiplied. Anyway, there you go. The web APIs are awesome. What else?
What should be on your profile? Well, it's very hard to know. This could be okay. Here's a whole
lot of un-premultiplication here. It's a very old profile. It's a time, but hey, there's a lot of
rendering on the profile. Not very much painting, lots of delta ring, so we fixed that. But actually,
it's very hard to know if this is good or bad looking at that. Actually, with lots of bogus
invalidations, you start to see lots of rendering and that's not what you want. So everything
should shrink and you'll end up with a profile that looks the same, but everything feels much
quicker. So we've done lots of work to shrink, I guess. Mr. Enders, do you want to pick a couple
of these now? Yeah, just as you mentioned, with multiple user document tests, we have kind of
basically monitor what's happening. People are joining documents. We got that full document
invalidation we mentioned about happening. Clicking in headers and footers were causing the same
things. I think fundamentally, because the invalidations and redrawing on the desktop has
become so cheap, while in the past, the very distant past, we might have been pretty good at
keeping validations down. In that case, we've become slack in recent decades and now we've
treated it as cheap and that has affected things. So let's kind of have a look at that again and
bring things down to smaller rendering areas and less invalidations. Yeah, and the good news is
that improves LibreOffice, of course, as well. It's more efficient and clean on your PC as well
underneath. So good. We've done lots better latency hiding in terms of more aggressive
prefetching. So the next slide is there before you switch to it. So it's absolutely instant.
Hiding latency in those ways is quite fun, enlarging the area around the view and maintaining
that as tiles and just storing and managing much more compressed tile data in the clients that
we manage much better now. This is a fun one. But we don't have much time for it. Yeah, well,
God, classically, standard list and C++ was always a standard list. And if you wanted to get
the size of it, you had to like pass the entire list from start to finish. That was sorted out
decades ago. But for whatever reason, for compatibility purposes, if you use the particular
Red Hat developer tool chain, then you seem to get the classic behavior or standard list back
again. So when we were assuming that you was cheap and cheerful to get the length of a standard
list, it turns out to be not the case with this particular case. So you have to go back to a
different approach and it appears in your profile like that. But again, it looks normal that it
should take some time to draw things. And it's normal to have a cache to speed that up. But if
the cache has got 20,000 items in it, and you're just walking this list, you know, point it,
chasing anyway. So gone. Oh, fun stuff. Like why not have a massive virtual device in the
background that you could render to instead of the whole document every time you do something? Not
great. Or another one, why not have a benchmark every time you start the document to see how fast
rendering is, allocate a whole load of memory and dirty it, you know? Great. Yeah, trying to cache
images. So we didn't bother catching compressed images because they're compressed, right? So why
bother? They're small. They're good to have memory, except TIFFs not so much compressed, you know,
you eventually have the whole massive chunk of memory there. Using G-Lib C trimming functions
on idle to reduce memory usage. Yeah, trying to get better measurements of various things. Yeah,
this is a fun one. Well, oh, this is the S-Maps word. Yes, yes, yes, we're reading the proc S-Maps
to see how much memory we're using. And the classic S-Maps has got multiple entries in it for many,
many parts of your process. So you just read multiple lines. So there's a relatively new one
that has it all pre-edited for you. ProxMaps roll up, which is exactly what we want. Same code to
read the previous one should work with the new one. Then apparently we're running out of memory,
or it's being reported that we're running out of memory, and it's all very, very bizarre. You
can't proc S-Maps roll up yourself. The numbers are good. There's something very odd, but it turns
out that if you seek back to the beginning and then read again, that the numbers double every
time you do this. There's an actual bug in the original implementation. It's not there in the
kernel, my version 6 kernel, but it is there on version V18 or 16 that the servers were applied
on. So you have to be just the right version for it to appear. So Linus fixed it, thank God.
Quillholt found it. Well, it was fixed before we found it. But it's always nice to know you have
to check your kernel is the right, you know, is the quality kernel before you start asking it
how much memory it's using. Yeah, hunspell in the loop was almost entirely dominated, not by
actually spelling things, but by looking at the time. You know, I'm sure in a bad talk, you know,
it's quite similar. But that's a little bit unfortunate. So yeah, some improvements there.
And lots of other things, graphs showing speedups. We've got to get to usability in the last minute.
Let me whizz through this then. Here we go. Accessibility, dark modes, pretty pictures. This
is going to be fast. Keyboard accelerators. This is all of the good stuff for people. Screen reading,
and all sorts of nice things, videos of that. Better page navigators at the side so you
can see where you're going. And lots of just little bits of usability polish, a nice font
previews. Was this your page number thing? I forget who did that. Making it easier to insert page
numbers so people can see, you know, what's going on easily, better change tracking and showing
changes, AI, depot, stuff, and hey, some some. The good news is there's more opportunity for
performance improvement. So we're still, we're still having fun. You know, hey, come join us.
There's some cool play files to read.
Right. Well, yes.
At the moment, in Calc, when you're typing the entire row and validates beyond the right
hand side of where you're actually typing. So we brought that down to the self in the most
generic case, but it's not done for writer. In the writer case, if you're typing, we are
invalidating all the way to the right hand side of the screen. So we'll bring shrink back back
down again. We have some new metrics that we've included in that debugging overlay thing that
give you an indication of, you know, how much of these updates that are coming through are the same
data as they were before the update came through and the numbers are staggeringly high. So there's
plenty of room for improvement to validate less, send more data down. So what we have now is
fix, uh, approval. Yeah. The moment that's always been troublesome in, uh, Lear Office is the treatment
of the alpha layer. We picked the wrong direction than everybody else does. Everybody else picks
transparency. We picked opacity or vice versa. So we have the opposite direction. Everybody else
would want to actually output something in the real world that handles transparency. We have to
like reverse our transparency. So that's problematic. That's, that's now fixed. That one is fixed.
That one is fixed. But then we've also kept our transparency layer in a separate, uh,
bitmap, a separate buffer than an actual bitmap. And if we put them together someday, that would
make things a lot easier, I believe. Yeah. It's the Windows 16 API decisions that are still with
us. But anyway, we're getting rid of them quickly. That's great. Um, yeah, performance regression
testing with Valgrind, uh, pipeline loading. So at the moment, oh, we got five minutes. Oh,
look at that. Fantastic. I went too quickly. No, you're doing fine. Okay, right. Fine. Excellent.
I think we're nearly the end. Um, so pipeline loading. So at the moment we have, um, we, we
essentially fetch a, fetch a webpage that passes all the credentials we need to check ourselves.
We'd load lots of JavaScript. We open a web socket. Then do we actually see if we can actually load
the document and start checking who the user is? This is really foolish. I'm taking on a first start,
we can be, you know, checking the user, downloading the document, even loading the document ready
to get the web socket and then have a pre-rendered version. So this, this is very substantially
reducing, um, startup time to make it incredibly quick. You already have a huge advantage that you
have a real server at the back end and you're not having to jit, you know, millions of lines of code
in your browser from JavaScript into something or, you know, web assembly into something. Um,
so it should be just amazingly fast. And so this is a great way to, to speed that even further.
And, you know, and a real server, you may have a time share, but you know, when you arrived
your server, it's probably not doing much. In fact, the CPU cost on most of our servers is
extremely low. So, you know, there's suddenly all these threads ready to render your document and
get, get stuff to you quickly. Say some good things. And Valgrind, we've done a whole lot of work to
get, um, it to run nicely under Valgrind with our privilege model and container model. It's a bit
of a problem. Uh, and so we have some code now that turns everything into one process. So you
can load and collaborate on one document and automate that, but you can run it in, in Valgrind.
And why do you want to do performance profiling in Valgrind? It seems like a retro, uh, poly,
right? But the beautiful thing about Valgrind is the simulated CPU. So anybody can run the same
workload on their machine and between two runs, it's the same thing. And Valgrind luckily doesn't
have a simulated thermal management system that randomly throttles your CPU, uh, performance. And
it luckily doesn't have people screwing with your cache memory and running cron jobs in the
background and, you know, thermally recalibrating your disk and all this other stuff. So what you
discover is that between two identical commits, you're getting fractions of a, small fractions of a
percent difference in the Valgrind numbers, which is beautiful because performance tends not to go
away in big jumps. Like we can, it can go in big jumps, but it tends to go slowly downhill. And if
the noise is bigger than the slow downhill, you've no idea where the problem is. So much better to
have a little series of steps going down in one half a percent at a time and go, hey, we get rid of
that and that. And did you realize and, uh, so, so this is really vital. And LibreOffice uses this
on its perf, um, automation has been beautiful web pages with graphs. Um, and we'll, we'll be applying
to, to collaborate online to, to try and avoid regressions. Yeah. Someday soon. Someday soon.
Yeah. Neil, Neil Lazzone, we think probably. Anyway, anything else?
No, I think we've covered plenty. Well, so, and yes, of course, we can't do anything without our
partners and customers that pay for it all, blah, blah, blah, commercial plug. Good. Yes. That's good.
Job done. And conclusions. Yes. So, uh, computers are unbelievably fast. I mean, like this is
something that you should take home. You know, like the quarter of a nanosecond that your four giga
hertz processor takes is just unbelievable in the scale of a hundred milliseconds plus. It takes
you to blink your eye. It's just fantastically speedy in a way you can't explain. Uh, the network
latency to anywhere almost, you know, you can go three times, uh, London to Frankfurt and back
in the time you can blink, right? Like it's, it's unbelievably fast. In fact, you can go,
you know, Frankfurt, Milan faster than your monitor can refresh, right? So, so like,
it's quite amazing when you start looking at the times of things. Um, architecture is really a
bet on CPUs and networks getting faster and cheaper. Has anyone noticed a trend there? I think
there might be something in that. And, and we're basically racing the hardware guys. I mean, you
know, we, we do stupid stuff, obviously, and then we remove it later. But, you know, the hardware
people are also trying to beat us to run stupid stuff quicker. You know, that's their mission.
And, uh, yes. And, and we extremely smooth. Don't get the feeling that it's bad. Try it. You know,
most of these problems, you'll only start to see them when you have 20 plus people collaboratively
editing in document. So, uh, yeah, it's, it's kind of, it's kind of cool. So give it a try and try
the latest version and see, give us some feedback, get involved. And there's lots, lots of fun to
get involved with. I mean, I don't know. Yeah, I'd like us to play two things. As I mentioned earlier,
the profile that we have for Calc and Writers uploaded to GitHub once a week, generic Calc
performance profile, generic writer performance profile, search on the online GitHub issues.
And you can see all of the, the chats that we've mentioned there in the past. And you can even
see with the progress there and the occasional blip during a call where things go horrifically
wrong and get sorted out in the next one. So yeah, plenty to see and see what we're doing.
There's some links in the slide. You can't see to the profiles and get involved in the Libre
Office of Technology. Thank you. That's it. You've been very patient. Thank you.