[00:00.000 --> 00:20.640] Okay, my name is Noel Brandon. I'm talking about the work that I've been doing making [00:20.640 --> 00:25.760] LibreOffice faster. We all know that LibreOffice is a bit of an elephant. It's a cute elephant, [00:25.760 --> 00:29.000] but it's still a bit of an elephant. We've got 10 million-odd lines of code, some of [00:29.000 --> 00:34.720] which is 20 years old. Practices that were perfectly acceptable 20 years ago are a little [00:34.720 --> 00:39.920] bit trickier these days. Optimizations that made perfect sense back in the day are not [00:39.920 --> 00:45.640] great anymore, and things have just changed around. For example, CPU memory bandwidth [00:45.640 --> 00:51.800] has dramatically changed. From around about the era of the 486DX2, we suddenly saw a dramatic [00:51.800 --> 00:57.160] shift in the speeds of main RAM versus the speeds of your main CPU. Up until that point, [00:57.160 --> 01:02.560] you were looking at CPUs that could touch location in memory at about the same speed [01:02.560 --> 01:10.120] as they could touch local cached memory. So you wrote your algorithms around those sorts [01:10.120 --> 01:13.960] of things. DX2 changed things, and it only got dramatically worse after that. To the [01:13.960 --> 01:19.960] point now where touching something in L1 cache versus touching main memory can be anywhere [01:19.960 --> 01:26.360] up to 50 times slower. Cache-finding algorithms are now the new game in town. We now have [01:26.360 --> 01:31.280] multiple cores. Everybody has multiple cores. It used to be that only people with vast sums [01:31.280 --> 01:37.120] of money had machines with multiple cores or the one CPU. Now, everybody has one. So [01:37.120 --> 01:40.480] locking algorithms that made perfect sense because they only got touched by a handful [01:40.480 --> 01:48.600] of people suddenly get exercised by everybody, and all sorts of interesting flaws come up. [01:48.600 --> 01:52.640] An interesting note, the locking algorithms that were written back then, nobody actually [01:52.640 --> 01:58.880] knew if they were truly solid or not because Intel's own engineers refused to commit to [01:58.880 --> 02:06.440] a cache coherency protocol until somewhere around the Pentium era. Up until then, they [02:06.440 --> 02:11.560] ruthlessly refused to say anything about it, and if queried, they would just say that it [02:11.560 --> 02:16.080] wasn't something they had locked down yet. So you were kind of in the dark. So we end [02:16.080 --> 02:22.280] up where we end up. So the good news. I've been at this for a little while, and the worst [02:22.280 --> 02:27.320] and ugliest of the stuff is largely gone. I mean, there's still lots of performance [02:27.320 --> 02:32.600] issues left in LibreOffice, but the stuff that used to hang LibreOffice up for minutes [02:32.600 --> 02:36.960] at a time, I've managed to nail most of that down. There's still lots of places where [02:36.960 --> 02:42.680] it's slow, but it's not like heinously slow like it used to be. The bad news is that the [02:42.680 --> 02:47.200] remaining stuff is really, really ugly. It's typically the stuff that I just couldn't get [02:47.200 --> 02:50.960] my head around the first time, and some of it I've looked at again, and I still can't [02:50.960 --> 02:57.120] get my head around it. Some of it I've been beating on for a while, and I'm not making [02:57.120 --> 03:00.920] a lot of progress. But anyhow, the point of this exercise is that you keep beating on [03:00.920 --> 03:08.080] things, and eventually either you break or it breaks. Hopefully, you get the right thing. [03:08.080 --> 03:11.120] So what worked out nicely? I'm going to talk today about what worked out well and what [03:11.120 --> 03:16.680] worked out worse, because I think it's important that we share both the successes and the failures [03:16.680 --> 03:21.200] because I think you learn equally for both. So what worked out nicely is reducing allocations [03:21.200 --> 03:27.400] primarily with things like using standard optional. So, for example, if we were allocating [03:27.400 --> 03:34.280] a temporary data structure inside a subroutine, I would switch that to use, to declare that [03:34.280 --> 03:39.600] thing using standard optional, which does have the downside that it takes up space on the [03:39.600 --> 03:44.960] stack even if you're not using it. But on the upside, using the stack is about 100 times [03:44.960 --> 03:49.880] faster than allocating for a main memory. The reasons for that are because any time [03:49.880 --> 03:54.160] you touch main, any time you have to allocate out of the heap, you have to take a lock. [03:54.160 --> 03:57.400] That leads to locking contention, et cetera, et cetera. You have to talk to main memory [03:57.400 --> 04:01.440] because you're doing all sorts of stuff there, whereas the stack is pretty much guaranteed [04:01.440 --> 04:06.640] to be an L1 cache, and so allocating out of there and allocating out of there is just [04:06.640 --> 04:10.880] ridiculously fast. So you can allocate quite large amounts of memory before you start falling [04:10.880 --> 04:15.200] out of L1 cache. We're talking up to a couple of kilobytes here. So throwing the sort of [04:15.200 --> 04:19.560] stuff at it worked well. The other thing I did was I've been converting [04:19.560 --> 04:26.400] usages of our internal OSL, colon, colon, mutex class to the standard mutex class. Now, [04:26.400 --> 04:30.200] despite the naming here, we're talking about two different kinds of mutexes. Our internal [04:30.200 --> 04:38.360] mutex is what's normally called standard recursive mutex, whereas standard mutex is [04:38.360 --> 04:44.760] a non-recursive mutex, and it is considerably faster. In an uncontended case, standard mutex [04:44.760 --> 04:50.320] is about 20 to 50 times faster than standard recursive mutex. In the contended case, they [04:50.320 --> 04:55.760] tend to fall back to roughly being about the same cost. So given that we run most of the [04:55.760 --> 05:01.280] time with very, very little concurrency, except for an occasional use case here and there, [05:01.280 --> 05:07.560] standard mutex is a major win. The downside with standard mutex is that if you use it [05:07.560 --> 05:13.160] and you try and lock that mutex the second time, you get a dead lock. So you have to [05:13.160 --> 05:19.360] write your code quite carefully. Most of our usages converted very easily. Did make a couple [05:19.360 --> 05:23.720] of mistakes. So there were a couple of regressions added to fix, and in a couple of cases, the [05:23.720 --> 05:27.640] regressions were simply unfixable, and so we backed it back out again. But in general, [05:27.640 --> 05:32.680] it's a win. And as a side effect, standard mutex is allocation-free. It's literally [05:32.680 --> 05:39.200] just a single word in memory. And the common unattended case of taking a standard mutex [05:39.200 --> 05:45.040] is what's called a CAS operation, compare and swap. So it's really fast, and it doesn't [05:45.040 --> 05:51.800] use even less memory than OSL mutex. So a win all around. What didn't work out? So [05:51.800 --> 05:56.480] we've got this SVL shared string pool. It's a really nice thing. We use it in spreadsheets [05:56.480 --> 06:02.120] primarily because spreadsheets often have many tens of thousands of strings, which are all [06:02.120 --> 06:06.520] identical across all the cells. So we still references to those strings in the shared [06:06.520 --> 06:14.760] pool. Now, this shared pool gets hit very hard at load time because other people implemented [06:14.760 --> 06:19.560] concurrent loading of spreadsheets. So we have this nice stuff where we fire up multiple [06:19.560 --> 06:24.400] threads and load a bunch of different chunks of the spreadsheet at the same time. This is [06:24.400 --> 06:30.760] great. Speeds things up. But it was bottlenecking very hard in SVL shared string pool. So I [06:30.800 --> 06:36.720] thought, oh, it's great. This is a hash map. Best case scenario, we can stick a concurrent [06:36.720 --> 06:40.800] lock-free hash map in there. I found this great cuckoo hash map written by some very [06:40.800 --> 06:44.600] bright people, much smarter than myself. I stuck it in there. Oh, it was great. It worked [06:44.600 --> 06:49.480] out brilliantly. It completely destroyed the bottleneck. Speeds sheet loading went up by [06:49.480 --> 06:53.880] a factor of two or three. And I thought I had a great win. And then the first bug came [06:53.880 --> 06:58.560] in where there was an edge case where it wasn't quite the locking because we were doing concurrent [06:58.600 --> 07:03.320] locking now. It was getting some weird edge cases. We had two different maps. They were [07:03.320 --> 07:10.240] both talking in the same thing. So I then had to fix those edge cases. I reworked it. [07:10.240 --> 07:15.560] We fixed those bugs. We reworked it again. I fixed the bugs. I eventually got it working. [07:15.560 --> 07:23.600] And then another bright guy came in. Lubosh came in and said, but wait. String pool is [07:23.760 --> 07:30.440] indeed a hotspot, but the hotspot is not actually the hash map inside string pool. The hotspot [07:30.440 --> 07:35.640] is the uppercase conversion. He improved the uppercase conversion. And by the time he was [07:35.640 --> 07:39.520] done improving the uppercase conversion, my concurrent hash map was making no difference [07:39.520 --> 07:48.280] whatsoever. So I had identified a hotspot, but I hadn't identified the hotspot. So we [07:48.280 --> 07:52.640] backed it out because there's no point in keeping a highly complicated piece of equipment [07:52.680 --> 07:56.520] like that around. So I was very sad to see it go. But I learned some stuff in the process. [07:56.520 --> 08:01.600] So no harm, no foul. And we backed it out again. And we're back to being faster than we were [08:01.600 --> 08:10.040] before. Red line processing in writer, which is our document editor section, is often very, [08:10.040 --> 08:13.760] very slow, especially if there's a ton of red lines in a big document. And it's slow [08:13.760 --> 08:19.120] both loading and slow at runtime because when we're doing red lines, we often are doing [08:19.160 --> 08:24.200] massive amounts of adding and deleting to data structures as we internally render and stuff. [08:24.200 --> 08:29.200] And so I thought I'd try and speed that up. However, this is writer. Writer is the most [08:29.200 --> 08:34.560] cash unfriendly program known to mankind. It just wants to chase pointers all over RAM. [08:34.560 --> 08:41.160] It's dataset once you are more than, once you have more than a sort of 10 page document, [08:41.160 --> 08:46.280] you are completely falling out of L1 cash. So you have no chance of getting cashed algorithms. [08:46.800 --> 08:51.320] The data structures are inherently very, very complicated. Human languages and documents [08:51.320 --> 08:56.720] just are. Consequences are we do pointer chasing. So we're constantly loading a pointer and then [08:56.720 --> 09:00.840] following it to some other location in memory, which is guaranteed to not to be an L1 cache. [09:00.840 --> 09:05.360] And then we're following that again to something else, which is also not an L1 cache. And in the [09:05.360 --> 09:09.440] process, we've now blown apart our cache. So if we need the first pointer again, that's also not [09:09.440 --> 09:14.400] an L1 cache. So we just chase our tails around in a very slow processing. I did my best to fix [09:14.440 --> 09:22.800] this. And that involves trying to speed up something called a node index and a content index. And I [09:22.800 --> 09:29.360] failed. I am now three levels deep, fixing different things related to that, no end in sight. And I'm [09:29.360 --> 09:33.680] currently bottlenecked on a piece of processing that I just don't understand. So that didn't [09:33.680 --> 09:39.640] work out. But in the process, I now know a lot more about writer. And so I consider that a win. Thank [09:39.640 --> 09:39.840] you.