[00:00.000 --> 00:15.360]  So now we have Orson, he's going to be talking about GlideSort, very beautiful opening slides,
[00:15.360 --> 00:17.640]  so yeah, take it away.
[00:17.640 --> 00:18.640]  Thank you.
[00:18.640 --> 00:19.640]  Can everyone hear me?
[00:19.640 --> 00:20.640]  All right.
[00:20.640 --> 00:21.640]  Good.
[00:21.640 --> 00:22.960]  Thanks for coming.
[00:22.960 --> 00:26.640]  So my name is Orson, and I'm here to present GlideSort.
[00:26.640 --> 00:31.600]  I did this research at the CBI Database Architecture Group, GlideSort.
[00:31.600 --> 00:32.600]  What is it?
[00:32.600 --> 00:35.760]  It's a general purpose, stable comparison sort here.
[00:35.760 --> 00:38.200]  Does everyone here understand what that means?
[00:38.200 --> 00:39.200]  No.
[00:39.200 --> 00:40.200]  Oh, okay.
[00:40.200 --> 00:42.560]  Well, good luck.
[00:42.560 --> 00:45.800]  So stable means that it does not reorder equal elements.
[00:45.800 --> 00:49.880]  They stay in the original order, so essentially it makes sorting deterministic.
[00:49.880 --> 00:52.040]  GlideSort is a hybrid.
[00:52.040 --> 00:56.800]  It's a hybrid of merge sort, quicksort, and block insertion sort, which is a variant of
[00:56.800 --> 01:03.160]  insertion sort, and it is robustly adaptive to pre-sorted and low cardinality inputs.
[01:03.160 --> 01:06.240]  Don't worry, I'll talk about what that means.
[01:06.240 --> 01:11.520]  I made a reference implementation in partially unsaved rust, and you can think of it if you're
[01:11.520 --> 01:16.560]  programming a rust as a drop-in for the slice stable sort algorithm.
[01:16.560 --> 01:19.000]  So you might wonder, stable quicksort.
[01:19.000 --> 01:22.920]  The answer is yes.
[01:22.920 --> 01:28.720]  A guy named Igor von Den Hoven made a very, I don't know, he did very good work on flux
[01:28.720 --> 01:34.200]  sort, where he showed that indeed you can do stable quicksort efficiently.
[01:34.200 --> 01:40.760]  Wikipedia will tell you that quicksort is in place, that it is done using element exchanges,
[01:40.760 --> 01:45.560]  and that it will literally tell you efficient implementations of quicksort are not a stable
[01:45.560 --> 01:46.560]  sort.
[01:46.560 --> 01:50.520]  Wikipedia tells you, no, you cannot do it.
[01:50.520 --> 01:56.320]  Standard stable sort uses extra memory to do its sorting, and if you tell people, hey,
[01:56.320 --> 02:00.520]  you can do the same with stable quicksort, they completely lose their minds.
[02:00.520 --> 02:02.960]  No quicksort is in place, you cannot do that.
[02:02.960 --> 02:03.960]  That's not true.
[02:03.960 --> 02:07.840]  You can, and you probably should.
[02:07.840 --> 02:10.040]  So earlier I mentioned adaptive sorting.
[02:10.040 --> 02:11.760]  What do I mean by that?
[02:11.760 --> 02:18.360]  To adapt is to change your behavior to deal with new information or a new situation.
[02:18.360 --> 02:25.200]  And there are two ways that you can be adaptive, in my opinion, major ways you can be adaptive
[02:25.200 --> 02:27.280]  in sorting.
[02:27.280 --> 02:29.720]  And they correspond to two schools of sorting.
[02:29.720 --> 02:32.000]  There is the bottom-up school of sorting.
[02:32.000 --> 02:36.880]  Those are your merge sorts, or your mergers variance, like dimsort and powersort.
[02:36.880 --> 02:38.200]  And they are bottom-up.
[02:38.200 --> 02:43.480]  They construct larger and larger sorted sequences from smaller sort of sequences.
[02:43.480 --> 02:47.800]  They are often presented in a schoolbook way top-down, but really fundamentally they are
[02:47.800 --> 02:48.800]  bottom-up.
[02:48.800 --> 02:51.200]  And that way they can be adaptive to pre-sorted runs.
[02:51.200 --> 02:55.320]  If there's already pre-sorted running your input, you can just take that as is and continue
[02:55.320 --> 02:56.820]  merging up.
[02:56.820 --> 03:00.000]  There's also the partition school of sorts.
[03:00.000 --> 03:03.280]  Those are your quicksort, your sample sorts, your radix sorts.
[03:03.280 --> 03:07.080]  They partition out or distribute data.
[03:07.080 --> 03:08.560]  They are fundamentally top-down.
[03:08.560 --> 03:12.240]  You start at the higher, and you partition to smaller, smaller, smaller.
[03:12.240 --> 03:16.840]  Subpartitions, and that way they can be adaptive to low cardinality inputs.
[03:16.840 --> 03:19.720]  So what are low cardinality inputs?
[03:19.720 --> 03:23.960]  Essentially you have a lot of data, and you're sorting it by a subset of the data.
[03:23.960 --> 03:27.920]  So you're sorting your customers, but you're sorting them by which city they live in.
[03:27.920 --> 03:31.040]  Or you're sorting your cars, but you're sorting by the brand of the car.
[03:31.040 --> 03:35.400]  And even though you might have hundreds of thousands of cars, you might only have 100
[03:35.400 --> 03:36.400]  brands.
[03:36.400 --> 03:42.280]  So essentially duplicates, at least from the perspective of a comparison operator.
[03:42.280 --> 03:45.640]  So how does adaptive quicksort deal with that?
[03:45.640 --> 03:50.720]  The idea is that during partitioning, we can detect buckets of elements that are all equal
[03:50.720 --> 03:52.440]  to each other.
[03:52.440 --> 03:55.440]  And there's a challenge with doing that.
[03:55.440 --> 03:59.000]  You don't want to do extra unnecessary comparisons.
[03:59.000 --> 04:01.120]  And we actually want to avoid three-way comparisons.
[04:01.120 --> 04:08.040]  That's a bit funny, because Rust's basic ORT trait uses three-way comparisons.
[04:08.040 --> 04:09.440]  But that's a lie.
[04:09.440 --> 04:13.760]  Under the hood, we turn that back into a two-way comparison, because computers aren't very
[04:13.760 --> 04:15.040]  good at turner logic.
[04:15.040 --> 04:16.320]  They really love binary logic.
[04:16.320 --> 04:17.720]  They love ifs and else.
[04:17.720 --> 04:21.600]  So we still turn that back into two-way comparisons.
[04:21.600 --> 04:28.600]  And there's been a long history on adaptive quicksorts in this way, with Dijkstra and
[04:28.600 --> 04:34.280]  Hauer, I still don't know how to pronounce that, working on it.
[04:34.280 --> 04:39.640]  And already, time flies, eight years ago, I showed that in pattern defeating quicksort
[04:39.640 --> 04:43.520]  that you can detect this and handle this very efficiently.
[04:43.520 --> 04:44.680]  So how does that work?
[04:44.680 --> 04:49.480]  I have an entire earlier talk on PDQ sort that you can watch if you're interested in
[04:49.480 --> 04:50.480]  this.
[04:50.480 --> 04:52.440]  But essentially, we have two different partition strategies.
[04:52.440 --> 04:54.960]  A partition left and a partition right.
[04:54.960 --> 04:58.560]  The partition left puts elements equal to the pivot on the left.
[04:58.560 --> 05:01.960]  And the partition right puts equal elements on the right.
[05:01.960 --> 05:07.560]  And what you do is, when you select a pivot, you check if that pivot is equal to a pivot
[05:07.560 --> 05:08.560]  we used previously.
[05:08.560 --> 05:14.440]  And you can do this efficiently using a single extra comparison during partitioning, or at
[05:14.440 --> 05:16.200]  least pivot selection.
[05:16.200 --> 05:19.160]  And the default is that you put the equal elements on the right.
[05:19.160 --> 05:23.040]  But if you detect, hey, this pivot is equal to a previous pivot, you put equal elements
[05:23.040 --> 05:24.040]  on the left.
[05:24.040 --> 05:29.320]  And this way, you implicitly do a three-way partition using two-way comparisons.
[05:29.320 --> 05:35.200]  And you can prove that, on average, this means that your sort is o n log k, where k is the
[05:35.200 --> 05:36.520]  number of distinct values.
[05:36.520 --> 05:40.160]  If every value is distinct, that becomes o n log n, we're used to that.
[05:40.160 --> 05:46.760]  But if you have a lot of duplicate values, o n log k goes a lot faster than n log n.
[05:46.760 --> 05:49.520]  There's also adaptive merge sort.
[05:49.520 --> 05:53.560]  As I said earlier, these merge pre-existing runs in the input.
[05:53.560 --> 05:58.800]  The problem with solving this is that you want to minimize the amount of unbalanced mergers
[05:58.800 --> 05:59.800]  that you do.
[05:59.800 --> 06:04.680]  So you don't want to merge a very large array with a very small array, because that's quite
[06:04.680 --> 06:06.480]  inefficient.
[06:06.480 --> 06:13.520]  And you also want to somehow store, during your algorithm, where the runs are in memory.
[06:13.520 --> 06:18.520]  And if you do this in an illogical way, you have to potentially store a lot of data about
[06:18.520 --> 06:21.280]  where all the runs are.
[06:21.280 --> 06:28.480]  And Van Neumann invented merge sort very early, and Knuth described also quite early a natural
[06:28.480 --> 06:31.900]  merge sort that takes advantage of pre-existing runs in the input.
[06:31.900 --> 06:36.200]  And then, in particular, Tim Peters popularized Tim sort, which became the default sorting
[06:36.200 --> 06:43.640]  algorithm in Python, that really showed the first sort of clever way to keep track of
[06:43.640 --> 06:48.280]  your run information and minimizing unbalanced mergers.
[06:48.280 --> 06:55.240]  The recent work is PowerSort, which extends on Tim sort essentially, or has the same logic,
[06:55.240 --> 07:03.600]  but more clever logic, and actually has mathematical proofs that it creates balanced merge sequences.
[07:03.600 --> 07:08.040]  And in fact, Python, I believe, now uses PowerSort's logic as well.
[07:08.040 --> 07:11.080]  So I'm not going to go into detail on how PowerSort works.
[07:11.080 --> 07:12.920]  I don't have enough time for that.
[07:12.920 --> 07:18.160]  But essentially, the core loop of it is that you create a run, and that can either be
[07:18.160 --> 07:24.680]  by finding the run in the input or doing a small sorting algorithm to create a small run.
[07:24.680 --> 07:26.040]  You compute the power of a run.
[07:26.040 --> 07:28.800]  That's the heuristics I'm not going to get into.
[07:28.800 --> 07:36.080]  And then you keep a stack of runs, and then use this power heuristic that we computed
[07:36.080 --> 07:38.520]  to decide when to merge two runs.
[07:38.520 --> 07:44.680]  And you can prove that the stack then becomes logarithmic in size, and that your merge sequences
[07:44.680 --> 07:47.920]  are going to be very good.
[07:47.920 --> 07:53.880]  But yeah, the idea is that create a run can take advantage of existing runs in the input.
[07:53.880 --> 07:55.320]  So a problem merges.
[07:55.320 --> 08:00.080]  We want to be adaptive to low cardinality inputs, and we want to be adaptive to preexisting
[08:00.080 --> 08:01.280]  run in input.
[08:01.280 --> 08:06.560]  But one is fundamentally bottom up, and the other one is fundamentally top down.
[08:06.560 --> 08:08.520]  And that's why I call this GlideSort.
[08:08.520 --> 08:09.520]  We glide.
[08:09.520 --> 08:10.760]  What do I mean by that?
[08:10.760 --> 08:15.120]  Well, the idea is that a soaring bird only flaps its wings when necessary.
[08:15.120 --> 08:17.240]  GlideSort only sorts when necessary.
[08:17.240 --> 08:24.960]  So during this create run process, I'm sorry, before that, I changed the concept of a run
[08:24.960 --> 08:26.120]  to a logical run.
[08:26.120 --> 08:28.560]  And a logical run can be one of three things.
[08:28.560 --> 08:29.960]  It can be just as before.
[08:29.960 --> 08:34.360]  It can just be a sorted range of elements in your array, and can also be an unsorted
[08:34.360 --> 08:38.840]  range of elements in your array, or two sorted ranges that are right next to each other in
[08:38.840 --> 08:42.720]  your array.
[08:42.720 --> 08:44.880]  We change the create run function.
[08:44.880 --> 08:51.160]  We do, in fact, if there's a run in the input, detect that and return that as a sorted run.
[08:51.160 --> 08:54.080]  But if we don't detect a sorted run, we just return an unsorted run.
[08:54.080 --> 09:00.440]  We don't eagerly sort anything, and how you do that is very simple.
[09:00.440 --> 09:04.680]  You just scan through the array, and if you find a run that we consider big enough, we
[09:04.680 --> 09:12.040]  return it, and otherwise, we just skip some elements and return an unsorted run.
[09:12.040 --> 09:17.360]  And then you add quite a bit of code for merging two runs, but it's actually relatively
[09:17.360 --> 09:18.880]  simple.
[09:18.880 --> 09:24.240]  As long as two unsorted runs concatenate it fit in our scratch base, which is essentially
[09:24.240 --> 09:30.240]  this extra memory that blows people's minds, as long as it fits in that, we just concatenate
[09:30.240 --> 09:32.080]  our unsorted runs.
[09:32.080 --> 09:39.320]  And otherwise, we actively physically sort the elements using quick sort, and then create
[09:39.320 --> 09:44.280]  one of these two sorted concatenated run cases.
[09:44.280 --> 09:49.840]  If we have two sorted runs, we concatenate them.
[09:49.840 --> 09:59.440]  If we have an unsorted run and something else, we actually sort this unsorted run and then
[09:59.440 --> 10:02.080]  recurs.
[10:02.080 --> 10:04.960]  And finally, we have our actual physical mergers.
[10:04.960 --> 10:12.880]  So when we can no longer be lazy, we can no longer glide, we have to actually merge elements.
[10:12.880 --> 10:16.800]  So that is essentially the main loop of glide sort.
[10:16.800 --> 10:23.120]  So it's an extension of power sort, but you can apply the same logic to any natural stable
[10:23.120 --> 10:24.680]  merge sort.
[10:24.680 --> 10:26.440]  We don't eagerly sort small runs.
[10:26.440 --> 10:31.480]  We keep them as unsorted runs as long as possible.
[10:31.480 --> 10:37.480]  And this way, we transform the sorting problem into a sequence of quick sort calls and triple
[10:37.480 --> 10:39.520]  slash quad merges.
[10:39.520 --> 10:43.520]  And doing this, we are adaptive to pre-sorted runs and low cardinality inputs at the same
[10:43.520 --> 10:46.440]  time.
[10:46.440 --> 10:48.760]  So why triple and quad merges?
[10:48.760 --> 10:51.600]  And there are three main reasons.
[10:51.600 --> 10:58.240]  There's ping-pong merging, bidirectional merging, oh, sorry, before I want to quite clearly
[10:58.240 --> 11:04.160]  mention something, Glider is not the first algorithm that is adaptive to both of these
[11:04.160 --> 11:09.160]  categories at the same time, but to my knowledge, at least it is the first algorithm that is
[11:09.160 --> 11:10.320]  robustly adaptive.
[11:10.320 --> 11:14.240]  So it does not hard code anything, it does not use heuristics to decide when to switch
[11:14.240 --> 11:20.920]  to which algorithm it detects this completely naturally based on the input.
[11:20.920 --> 11:23.040]  So why triple slash quad merges?
[11:23.040 --> 11:24.240]  There are three main reasons.
[11:24.240 --> 11:28.520]  Ping-pong merging, bidirectional merging, and parallel merging.
[11:28.520 --> 11:32.440]  Ping-pong merging is not my idea, it's found in two early projects, once again by Igor
[11:32.440 --> 11:37.040]  van der Hove and an earlier paper, Pages is Virtue.
[11:37.040 --> 11:42.600]  And the idea is that in a traditional merge, you copy out part of the data someplace else
[11:42.600 --> 11:48.400]  and then merge back into the original array, that's an extra memcap.
[11:48.400 --> 11:55.440]  With a triple slash quad or a quad merge, you can merge both into your scratch space
[11:55.440 --> 11:59.200]  and on the way back, because essentially when you do an out of place merge, you get a mem
[11:59.200 --> 12:02.240]  copy for free because you're moving to some other place.
[12:02.240 --> 12:05.920]  So I think that's best described visually, if you have four, so in this case a quad merge,
[12:05.920 --> 12:12.880]  you have four mer sorted runs, you merge two into your scratch space, you merge two
[12:12.880 --> 12:16.640]  more into your scratch space, and you merge two back.
[12:16.640 --> 12:22.360]  And now we eliminated three mem copies, so don't have to do that, that's one advantage
[12:22.360 --> 12:25.080]  of being lazy with merging.
[12:25.080 --> 12:33.240]  We can also do bidirectional merging, this, to my knowledge, was first done again by Igor
[12:33.240 --> 12:37.760]  van der Hove in Quadsford, where he described a parity merge, where he showed a very clever
[12:37.760 --> 12:43.640]  technique to merge two equal length arrays without any branch checks.
[12:43.640 --> 12:47.160]  But then I thought by merging from both ends at the same time.
[12:47.160 --> 12:55.080]  But then I thought, looked really into why that was fast and how can we extend that
[12:55.080 --> 12:57.680]  and use that further.
[12:57.680 --> 13:01.480]  So the idea behind a bidirectional merge is that if your destination and your source
[13:01.480 --> 13:06.480]  arrays are disjoint, you can merge from both ends at the same time.
[13:06.480 --> 13:11.120]  And then the pointer that's going from right to left does not interfere with the pointer
[13:11.120 --> 13:12.680]  going from left to right.
[13:12.680 --> 13:15.760]  These two logics are independent, essentially.
[13:15.760 --> 13:19.440]  And it essentially looks like that.
[13:19.440 --> 13:20.940]  Why?
[13:20.940 --> 13:22.760]  Why do we want to do that?
[13:22.760 --> 13:30.760]  Well, modern processors are quite different than what maybe your traditional processor
[13:30.760 --> 13:31.960]  with your mental image are.
[13:31.960 --> 13:35.920]  They are superscaler, that means they don't execute one instruction per cycle, no, they
[13:35.920 --> 13:38.840]  can execute many instructions per cycle.
[13:38.840 --> 13:43.280]  They are out of order, the processor will internally reorder your instructions based
[13:43.280 --> 13:48.120]  on your assembly based on the data paths and when memory is available.
[13:48.120 --> 13:49.560]  And they are deeply pipelined.
[13:49.560 --> 13:53.920]  That means that they don't like it when the next instruction depends immediately on the
[13:53.920 --> 13:57.200]  result of the previous instruction because it has to go through the entire pipeline of
[13:57.200 --> 14:00.600]  the processor.
[14:00.600 --> 14:05.680]  So to study that in a bit more detail, we look at a branchless merge, which was first
[14:05.680 --> 14:09.640]  described in branch mispredictions don't affect merge source.
[14:09.640 --> 14:11.800]  This is not the code that they used in this paper.
[14:11.800 --> 14:16.040]  This is roughly translated from GlideSort.
[14:16.040 --> 14:18.920]  You don't have to get into it, how it works.
[14:18.920 --> 14:26.280]  The main important part is that you analyze where is the result used in the next slide.
[14:26.280 --> 14:31.320]  Well, you find that generally all the data that's computed is needed immediately.
[14:31.320 --> 14:35.520]  And the worst part of it all is that the next iteration cannot start really until the
[14:35.520 --> 14:36.960]  previous iteration is finished.
[14:36.960 --> 14:40.640]  You don't know if you're merging two arrays, you need to know, am I continuing with the
[14:40.640 --> 14:42.880]  left array or am I continuing with the right array?
[14:42.880 --> 14:46.840]  There's a lot of data dependencies.
[14:46.840 --> 14:52.440]  So that is my main takeaway from GlideSort and my main low level design principle is
[14:52.440 --> 14:55.440]  to interleave independent branchless loops.
[14:55.440 --> 14:59.960]  So branchless is important, so the processor isn't jumping around and constantly canceling
[14:59.960 --> 15:03.160]  your pipeline.
[15:03.160 --> 15:10.200]  And by interleaving, we can hide some of these data dependencies.
[15:10.200 --> 15:16.320]  The processor can execute multiple instructions at once and it can essentially reduce the
[15:16.320 --> 15:24.240]  impact of having to constantly wait for the previous result.
[15:24.240 --> 15:25.800]  You can also consider parallel merging.
[15:25.800 --> 15:30.720]  So in this case, we had one merge where we did it in parallel from the left and parallel
[15:30.720 --> 15:33.320]  from the right.
[15:33.320 --> 15:39.280]  But we also noticed that the first step in our quad merge has two independent merges.
[15:39.280 --> 15:43.760]  These are essentially parallel, but we're not using threads, but we can interleave
[15:43.760 --> 15:45.920]  their loops.
[15:45.920 --> 15:51.280]  So once I discovered that, I thought, let's create more parallelism.
[15:51.280 --> 15:57.920]  By doing a binary search, you can identify a split point where you can turn one merge
[15:57.920 --> 16:06.400]  into two smaller merges by swapping the right blocks in the middle.
[16:06.400 --> 16:12.120]  I won't go into the exact logic of proof about that, but you can.
[16:12.120 --> 16:16.680]  And in fact, if you are doing an out-of-place merge, you can do this swap implicitly by
[16:16.680 --> 16:19.120]  just reassigning pointers.
[16:19.120 --> 16:21.360]  So there's no actual physical mem copy going on.
[16:21.360 --> 16:27.880]  However, if you're doing an in-place merge, you can actually do the physical swap.
[16:27.880 --> 16:32.280]  And now you have for free a fallback for low memory merging.
[16:32.280 --> 16:37.960]  So even if you don't have a large buffer available to merge with, you can use this algorithm
[16:37.960 --> 16:42.720]  to do it in a low amount of memory.
[16:42.720 --> 16:45.920]  Then I also optimized the quicksort portion of it with the same principle.
[16:45.920 --> 16:48.960]  I came up with what I call bi-directional stable partitioning.
[16:48.960 --> 16:54.920]  Again, I don't have time to get into it, but the idea is that we do, again, partition
[16:54.920 --> 16:55.920]  like in quicksort.
[16:55.920 --> 17:00.840]  So one set of elements that are less than the pivot goes somewhere else.
[17:00.840 --> 17:01.840]  Go go here.
[17:01.840 --> 17:04.120]  And some that are greater or equal go somewhere else.
[17:04.120 --> 17:08.400]  But we do it from both the left-hand side to the right, and from the right-hand side
[17:08.400 --> 17:09.400]  to the left.
[17:09.400 --> 17:12.600]  And these two loops are independent from each other, so we can interleave them.
[17:12.600 --> 17:15.440]  Same principle.
[17:15.440 --> 17:19.400]  When you recurse, it gets a bit more involved, because now your data is in multiple different
[17:19.400 --> 17:21.760]  locations.
[17:21.760 --> 17:30.080]  I can tell you this is not fun to program, but I did it, and here it is.
[17:30.080 --> 17:37.080]  So I do have some experiments to show you very briefly, an experiment of the setup.
[17:37.080 --> 17:46.440]  So this is a lot of text that basically says a 2021 Apple MacBook.
[17:46.440 --> 17:50.680]  And these are the numbers you would get on an Apple 2021 MacBook.
[17:50.680 --> 17:53.840]  So at the top, I have two variants of GlideSort.
[17:53.840 --> 17:58.120]  One is the default variant that you would get if you were to download it.
[17:58.120 --> 18:05.600]  GlideSort 1024 is a variant that uses a fixed amount of memory, so 1024 elements of memory.
[18:05.600 --> 18:10.400]  Then we have the Rust stable sort, the C++ standard stable sort, an implementation of
[18:10.400 --> 18:18.000]  Tim sort, a PDQ sort, an older algorithm of mine, which is also the stable Rust sorting
[18:18.000 --> 18:25.640]  algorithm, and the whatever shipped as the standard sort in C++.
[18:25.640 --> 18:29.360]  You can read the slides yourself.
[18:29.360 --> 18:34.960]  GlideSort is quite a bit faster than the Rust stable sort right now.
[18:34.960 --> 18:43.520]  What isn't shown on this page are some more competitive algorithms, like Fluxort and Quadsort.
[18:43.520 --> 18:50.280]  So they trade blows for blows on different data sets, but those are written in C, and
[18:50.280 --> 18:56.160]  they don't have to deal with some of the problems that we deal with that I'll get to later in
[18:56.160 --> 19:00.920]  my talk on sorting in Rust.
[19:00.920 --> 19:05.200]  If you actually change your comparison operator, so we're only sorting by the last byte of
[19:05.200 --> 19:12.520]  the integer, fun fact, if you do this, stability becomes even observable for integers.
[19:12.520 --> 19:18.120]  GlideSort, again, speeds up even more compared to the Rust stable sorting algorithm.
[19:18.120 --> 19:26.200]  So now we're over an order of magnitude faster for these data sets.
[19:26.200 --> 19:29.960]  If you want to use it, good news, it's released.
[19:29.960 --> 19:33.000]  It took me a while, but it's finally out.
[19:33.000 --> 19:39.760]  You can just cargo add GlideSort, and you can replace your sort, call to Slidesort with
[19:39.760 --> 19:41.160]  GlideSort.
[19:41.160 --> 19:46.360]  If there are any standards library people in the audience come talk to me after the talk,
[19:46.360 --> 19:50.440]  I would love to see it integrated, so you don't have to call GlideSort, and it would
[19:50.440 --> 19:54.600]  just be done by default.
[19:54.600 --> 19:58.960]  But this is a Rust dev room, so some of you at least probably are interested in Rust,
[19:58.960 --> 20:03.760]  so I will also talk about some Rust specifics, so what it takes to implement a sorting algorithm
[20:03.760 --> 20:05.560]  in Rust.
[20:05.560 --> 20:15.960]  And first I'm just going to rant, unwinding panics, I think a Rust billion-dollar mistake.
[20:15.960 --> 20:17.600]  They are complete nightmare.
[20:17.600 --> 20:21.560]  If you are writing unsafe code and you've ever had to deal with panic, some people in
[20:21.560 --> 20:27.480]  the audience are laughing, they're horrible, because essentially, since you can catch them
[20:27.480 --> 20:34.560]  and we have to be sound and safe during a panic, they're essentially the same as C++
[20:34.560 --> 20:36.680]  functions.
[20:36.680 --> 20:41.800]  In C++, all these functions say if you throw an exception, tough shit, like your vector
[20:41.800 --> 20:46.640]  is invalid now, too bad, it doesn't matter, you can't use it, you don't have the choice
[20:46.640 --> 20:54.320]  in Rust, you have to always be safe and sound in Rust, ensuring that is a nightmare, especially
[20:54.320 --> 20:59.080]  when you're dealing with generic code in unsafe code.
[20:59.080 --> 21:06.080]  So if you're calling foreign code, anything you do, any call, can panic, which causes
[21:06.080 --> 21:07.080]  an unwind.
[21:07.080 --> 21:11.520]  So whenever you call a foreign function, you have to make sure that you are in a sound
[21:11.520 --> 21:14.760]  and safe state.
[21:14.760 --> 21:18.360]  The problem is every single trait is foreign code.
[21:18.360 --> 21:20.560]  That clone call, that's foreign code.
[21:20.560 --> 21:24.840]  This comparison operator, that's foreign code.
[21:24.840 --> 21:29.120]  Lightning Glider was a complete nightmare, every time I compare two elements, that could
[21:29.120 --> 21:32.920]  cause a panic, that could cause an unwind, and you saw all this stuff that I'm doing
[21:32.920 --> 21:38.120]  with arrays all over the place, all of that has to be restored to the original location
[21:38.120 --> 21:42.400]  because it's a mudslice, and you cannot leave a mudslice in an unsound or you can't leave
[21:42.400 --> 21:46.680]  holes in it, everything has to be returned to the original array.
[21:46.680 --> 21:52.920]  Yeah, it's a nightmare, I really wish we would just, instead of panicking, we would just write
[21:52.920 --> 21:56.960]  out a stack trace and abort and be done with it.
[21:56.960 --> 22:00.640]  I hate it, that's my rant.
[22:00.640 --> 22:08.040]  Oh yeah, well, and in fact, GlideSort has an actual real performance penalty because
[22:08.040 --> 22:11.320]  panics are a thing.
[22:11.320 --> 22:18.320]  I can't just write a, like if you're writing an insertion sort, for example, in C++ or
[22:18.320 --> 22:24.400]  in Python, you would just have a loop with a loop variable and you would put the items
[22:24.400 --> 22:26.200]  in the correct place.
[22:26.200 --> 22:30.200]  If you're implementing a thing like this in Rust and you're leaving gaps, so you're
[22:30.200 --> 22:35.240]  moving the element out during the insertion sort, you have to have a drop handler that
[22:35.240 --> 22:40.680]  puts this element back during a panic because this ORD implementation, this foreign code,
[22:40.680 --> 22:44.120]  can cause a panic and cause an unwind.
[22:44.120 --> 22:48.760]  So even when I'm sorting something like integers, which cannot panic, if I don't want to duplicate
[22:48.760 --> 22:54.360]  my entire code base, I still have to pay this penalty for dealing with the potential for
[22:54.360 --> 22:58.880]  panics by storing all my data, instructs, and all this algorithm state.
[22:58.880 --> 23:02.680]  So yeah, that's a problem.
[23:02.680 --> 23:06.800]  But I also want to praise Rust, where it is pleasurable.
[23:06.800 --> 23:08.840]  I love that moves are mem copies.
[23:08.840 --> 23:09.920]  There's no move constructor.
[23:09.920 --> 23:15.200]  If you want to move a type somewhere else, you essentially just copy it and you ignore
[23:15.200 --> 23:17.480]  whatever, wherever it came from.
[23:17.480 --> 23:23.160]  This also makes optimizations possible that aren't possible in C++ because of move constructors,
[23:23.160 --> 23:28.520]  at least not if you don't want to use like templates metaprogramming.
[23:28.520 --> 23:34.680]  For example, instead of copying an element, this is an example actually from GlideSort,
[23:34.680 --> 23:40.480]  not written like this, but where you place an element in one of two places, and if it's
[23:40.480 --> 23:44.440]  going to the wrong place, it doesn't matter because it will just be ignored or overwritten
[23:44.440 --> 23:46.720]  in the next iteration.
[23:46.720 --> 23:51.160]  If it's a small type, just place it in both, don't do a branch.
[23:51.160 --> 23:54.480]  So essentially, this is the opposite of unwinding panics.
[23:54.480 --> 23:55.480]  There are no surprises.
[23:55.480 --> 24:01.120]  A mem copy is always what you get.
[24:01.120 --> 24:05.600]  This is not necessarily, it's part praise, part complaining.
[24:05.600 --> 24:11.040]  Split at mutt, so splitting a slice into or more.
[24:11.040 --> 24:12.040]  It's a one-way street.
[24:12.040 --> 24:13.040]  You cannot go back.
[24:13.040 --> 24:14.040]  Once it's split, it's split.
[24:14.040 --> 24:18.080]  Unless you go back to the original object, but that's not always an option.
[24:18.080 --> 24:25.440]  In GlideSort, when I concatenate these arrays, slices, I need actual concatenation.
[24:25.440 --> 24:30.000]  My options were raw pointers, but that was the option.
[24:30.000 --> 24:34.840]  You are storing an array with indices, but now you're storing an extra pointer everywhere
[24:34.840 --> 24:39.480]  and passing an extra pointer everywhere, and that's overhead that I didn't want to pay.
[24:39.480 --> 24:42.000]  So I came up with a thing I call branded slices.
[24:42.000 --> 24:46.840]  You could hold an entire talk on this, but it's essentially applying the idea of a ghost
[24:46.840 --> 24:47.840]  cell.
[24:47.840 --> 24:52.920]  Some of you might have heard from this, where you essentially brand a type with a unique
[24:52.920 --> 24:54.840]  lifetime that you cannot create.
[24:54.840 --> 24:59.480]  You can only create this lifetime once, and it's not interchangeable with any other lifetime.
[24:59.480 --> 25:04.000]  And with that, you can make safe concatenation.
[25:04.000 --> 25:08.080]  So you could just check, is the end pointer equal to the begin pointer of the other array
[25:08.080 --> 25:09.840]  if yes, we can concatenate?
[25:09.840 --> 25:14.040]  And that will work, except that could also just be happening by chance because of the
[25:14.040 --> 25:18.080]  local array layout on the stack, and you could create unsound behavior.
[25:18.080 --> 25:24.880]  But if you know that they came from the same allocation, then it's safe to concatenate
[25:24.880 --> 25:27.520]  them after checking and equals begin.
[25:27.520 --> 25:35.360]  So that's what I did with what I call mudslice, which in GlideSword, every single slice is
[25:35.360 --> 25:41.760]  a mudslice type, which has a brand, so you can do the safe concatenation, and it has
[25:41.760 --> 25:44.560]  a state, which is one of five things.
[25:44.560 --> 25:48.080]  It's weak on in it, maybe in it, in it, always in it.
[25:48.080 --> 25:54.440]  Always in it, essentially, a mutable slice, so you always have to return it to initialization
[25:54.440 --> 25:55.440]  state.
[25:55.440 --> 26:01.080]  And maybe on in it, in it are a bit more specialized than just maybe on in it, where the type doesn't
[26:01.080 --> 26:06.120]  really encode what it actually contains, and weak is essentially a pair of pointers.
[26:06.120 --> 26:12.840]  And then the code becomes a lot more readable and a lot more verifiable by explicitly encoding
[26:12.840 --> 26:18.400]  your assumptions about your slice type using the type, and then calling functions like
[26:18.400 --> 26:22.040]  upgrades to say, hey, this now becomes an exclusive mutably slice.
[26:22.040 --> 26:29.200]  I'm only going to access this here, or hey, I'm now going to temporarily invalidate this
[26:29.200 --> 26:33.800]  initialization state of this slice.
[26:33.800 --> 26:35.440]  That was essentially my talk.
[26:35.440 --> 26:42.680]  I'm leaving academia, so if you have an interesting, potentially rust job, my contact details
[26:42.680 --> 26:48.120]  are on the slides or come talk to me after the talk.
[26:48.120 --> 26:58.120]  I'm not interested in cryptocurrency, Web3 or similar endeavors.
[26:58.120 --> 27:10.840]  I love cryptography, but I don't know, some of this stuff gets rather sketchy, no offense.
[27:10.840 --> 27:12.000]  That was essentially my talk.
[27:12.000 --> 27:13.880]  I'm going to leave this on this slide.
[27:13.880 --> 27:30.320]  Are there any questions?
[27:30.320 --> 27:31.320]  I have a question.
[27:31.320 --> 27:38.040]  Did you test Glidesort on, let's say, less modern CPUs, like embedded CPUs that don't
[27:38.040 --> 27:40.760]  have auto-fordering execution, et cetera?
[27:40.760 --> 27:41.760]  Yes.
[27:41.760 --> 27:46.800]  Can you repeat the question, please?
[27:46.800 --> 27:57.240]  The question was, did you test the algorithm on any older CPUs that might not have as much
[27:57.240 --> 27:59.240]  instruction-level parallelism and that kind of stuff?
[27:59.240 --> 28:04.280]  The answer is yes, and yes, it is slower than other state-of-the-art that don't do these
[28:04.280 --> 28:06.280]  tricks.
[28:06.280 --> 28:10.400]  This is really aimed towards essentially the future of modern processors.
[28:10.400 --> 28:15.960]  From older CPUs, it is slower than, for example, flux sort, which doesn't do this aggressive
[28:15.960 --> 28:19.080]  interleaving.
[28:19.080 --> 28:22.360]  But if you compare it to the current Rust stable sort that's currently in the standard
[28:22.360 --> 28:29.240]  library, it's still completely dumps us all over that.
[28:29.240 --> 28:31.240]  Can you hear me?
[28:31.240 --> 28:32.240]  Barely.
[28:32.240 --> 28:34.040]  Can you speak loudly?
[28:34.040 --> 28:39.000]  When you take two sort of sequences and you take the bottom half of one and the top half
[28:39.000 --> 28:45.400]  of another and create a third sorted sequence out of that, I thought that was an interesting
[28:45.400 --> 28:50.080]  observation, but what do you use it for?
[28:50.080 --> 28:52.080]  So it's not the top half and the bottom half.
[28:52.080 --> 28:53.080]  That's just the simplification.
[28:53.080 --> 28:57.840]  You're talking about the splitting up merges into smaller merges, right?
[28:57.840 --> 28:58.840]  Yes.
[28:58.840 --> 28:59.840]  Yes.
[28:59.840 --> 29:00.840]  So it is not the top half and the bottom half.
[29:00.840 --> 29:07.240]  It involves a binary search to find the unique split point that allows you to do this swap.
[29:07.240 --> 29:10.840]  It could be bottom half, top half, but that's not necessarily the case.
[29:10.840 --> 29:11.960]  What do I use this for?
[29:11.960 --> 29:16.000]  It creates two independent merges.
[29:16.000 --> 29:19.400]  After doing the swap, this merge no longer depends on this merge at all.
[29:19.400 --> 29:25.000]  And by doing that, I can have two independent loops that merge these and then interleave
[29:25.000 --> 29:26.000]  the bodies of these loops.
[29:26.000 --> 29:29.760]  So it executes one instruction from this merge, one instruction from this merge, one instruction
[29:29.760 --> 29:32.920]  from this merge, one instruction from this merge, et cetera.
[29:32.920 --> 29:37.060]  And that way these instructions don't depend on each other and you can hide these data
[29:37.060 --> 29:38.440]  dependencies and such.
[29:38.440 --> 29:46.120]  On top of that, I use it as a fallback for the low memory case where you don't need,
[29:46.120 --> 29:48.840]  so GlideSword can use less auxiliary memory.
[29:48.840 --> 29:53.120]  We have a last question.
[29:53.120 --> 29:55.520]  Thanks for the talk.
[29:55.520 --> 29:59.320]  I would like to know if you have a bench.
[29:59.320 --> 30:00.400]  I'm sorry, I cannot hear.
[30:00.400 --> 30:03.800]  Can you speak a bit louder, please?
[30:03.800 --> 30:09.760]  Did you bench when the array is already sorted?
[30:09.760 --> 30:12.480]  Did I bench when the array is already sorted?
[30:12.480 --> 30:13.480]  Yes.
[30:13.480 --> 30:17.120]  Yes, it's on the slides.
[30:17.120 --> 30:18.120]  Is it on the slides?
[30:18.120 --> 30:23.240]  Yes, it's the ascending column on the slides and on this slide as well.
[30:23.240 --> 30:25.280]  It's the one person.
[30:25.280 --> 30:26.280]  Sorry?
[30:26.280 --> 30:28.280]  It's the one person column.
[30:28.280 --> 30:29.280]  No, ascending.
[30:29.280 --> 30:30.280]  Ascent.
[30:30.280 --> 30:31.280]  Okay.
[30:31.280 --> 30:35.480]  Okay, that means sorted in this case and descending means reverse of sorted.
[30:35.480 --> 30:36.480]  Okay, okay.
[30:36.480 --> 30:37.480]  Thank you.
[30:37.480 --> 30:38.480]  All right.
[30:38.480 --> 30:39.480]  Thanks very much.
[30:39.480 --> 31:05.480]  Thank you very much.