[00:00.000 --> 00:14.000] Hello, my name is Vlad, I am a CEO and C++ developer with roughly eight years of experience. [00:14.000 --> 00:18.440] Right now I work at a game dev company called Ubisoft, which some of you probably heard [00:18.440 --> 00:25.200] about, with mostly low level networking and some bits of system programming. [00:25.200 --> 00:31.640] And in Ubisoft, while doing a complete rewrite of our networking stack in one of game engines, [00:31.640 --> 00:37.920] I managed to design, I hope, a cool algorithm for doing so-called task scheduling. [00:37.920 --> 00:44.840] And today I share this algorithm in the form of this talk and an open source C++ implementation [00:44.840 --> 00:45.840] of it. [00:45.840 --> 00:53.120] Although, if you don't like C++, it doesn't have to be, today we focus just on the algorithms [00:53.120 --> 00:57.920] and they can be implemented in some other language too, like in Rust, I'm sure it can [00:57.920 --> 01:01.640] be done in Java, in C-sharp maybe. [01:01.640 --> 01:09.440] So there will be a lot of content, so we will be going relatively quick and I ask you to [01:09.440 --> 01:10.440] concentrate. [01:10.440 --> 01:16.000] Talk will follow the plan firstly, what is task scheduling exactly, because many things [01:16.000 --> 01:21.840] can be understood as it, then I explain how it's normally done, what are typical problems [01:21.840 --> 01:22.840] with it. [01:22.840 --> 01:28.480] I'm actually, at least, known to me and how worked an old task scheduler in my game engine [01:28.480 --> 01:30.200] and how works the new one. [01:30.200 --> 01:35.760] And then I will briefly show you the benchmarks, how it's verified and what are the future [01:35.760 --> 01:37.440] plans for it. [01:37.440 --> 01:40.160] So what is task scheduling exactly? [01:40.160 --> 01:47.400] Today, in this talk, I mean, as task scheduling, any kind of execution of code, such as code [01:47.400 --> 01:54.280] exec functions or something similar, and we just call it task, a generic term. [01:54.280 --> 02:00.560] And so to give you a few examples, tasks can be callbacks in a thread pool, or tasks can [02:00.560 --> 02:08.360] be watchers in an event loop like in LibF in C, we can have watchers as timers or wrapping [02:08.360 --> 02:14.600] sockets and they also have callbacks, or we can also call tasks routines in some routine [02:14.600 --> 02:21.800] engine like C++ routines or fibers in Toronto database. [02:21.800 --> 02:29.520] So here is a very, very simple trivial scheduler implemented in C++ like pseudo code for simplicity, [02:29.520 --> 02:32.000] which demonstrates this example. [02:32.000 --> 02:34.320] So it's just a single thread. [02:34.320 --> 02:40.720] It has a mutex locked queue for callbacks, and what it does is just executes those callbacks [02:40.720 --> 02:41.720] one by one. [02:41.720 --> 02:44.080] This is a very simple scheduler. [02:44.080 --> 02:50.520] Now let's see if such a basic scheduler can solve a real task, which I had at Ubisoft. [02:50.520 --> 02:58.040] Here, tasks execute requests coming to a server, and they handle save games, save game blobs. [02:58.040 --> 03:02.800] And there might be tens of thousands of those requests per second, and they are also CPU [03:02.800 --> 03:03.800] intensive. [03:03.800 --> 03:09.960] So every task might take milliseconds of pure CPU time, and also they consist of multiple [03:09.960 --> 03:16.960] steps like, I have to, in each task, lock user profile in some database, then I have [03:16.960 --> 03:21.680] to download the save game blob from another database, then I have to do some pre-processing [03:21.680 --> 03:27.720] of this blob, some stuff with some manipulations, then I have to upload it back and unlock the [03:27.720 --> 03:29.720] user profile. [03:29.720 --> 03:33.800] As you can guess, most of the time the task is not doing anything at all. [03:33.800 --> 03:38.880] It just sleeps waiting for network input from the databases. [03:38.880 --> 03:45.560] So literally more than 90% of times there is nothing happening in this task. [03:45.560 --> 03:50.440] The trivial scheduler, which we just saw with single triad, it will not work here, it simply [03:50.440 --> 03:51.640] will not scale. [03:51.640 --> 03:59.280] Firstly, single triad is not enough to handle so many requests, so being so CPU intensive, [03:59.280 --> 04:01.280] it will just choke on CPU. [04:01.280 --> 04:07.120] Secondly, we can postpone blocked tasks and do other tasks while the first ones are waiting [04:07.120 --> 04:09.840] for some events like network input. [04:09.840 --> 04:16.280] So this means we need routines so that tasks could yield, so they could give up their time [04:16.280 --> 04:20.360] to some other tasks in the same thread. [04:20.360 --> 04:26.160] The game engine I'm working with had a good enough scheduler, which could do the job for [04:26.160 --> 04:29.240] some years good enough. [04:29.240 --> 04:36.240] This was just a thread pool where each thread had a list of own tasks, and when new tasks [04:36.240 --> 04:41.840] were appearing, they were distributed to those threads in around Robin Manor one by one, [04:41.840 --> 04:45.280] and they were pinned to those threads forever. [04:45.280 --> 04:47.760] They could not migrate between threads. [04:47.760 --> 04:53.640] And then what each worker thread does is updates all its tasks with some fixed hard-coded period [04:53.640 --> 04:57.520] like once per 100 milliseconds or once per second. [04:57.520 --> 05:03.760] Each task, after being updated, it can return false eventually, and it consider it done, [05:03.760 --> 05:04.920] and it is deleted. [05:04.920 --> 05:11.440] So this is some polling, basically, and we will call this scheduler updater because [05:11.440 --> 05:17.960] there isn't much scheduling, really, it just updates those tasks without looking at their [05:17.960 --> 05:21.720] state or anything, so we call it updater. [05:21.720 --> 05:27.880] This updater thing had some typical problems, like schedulers of this type sometimes do. [05:27.880 --> 05:33.960] Firstly, it was unfair, meaning that tasks were pinned to threads, they could not migrate. [05:33.960 --> 05:40.200] This leads to a problem that your CPU usage will be unbalanced because some worker threads [05:40.200 --> 05:46.480] will get heavy tasks, and a lot of them will stay in the queues, and other threads will [05:46.480 --> 05:52.000] get light tasks and will be in idling most of the time. [05:52.000 --> 05:57.680] This will happen even if you do perfect round robin, and all your tasks are the same because [05:57.680 --> 06:00.240] actually tasks are never the same. [06:00.240 --> 06:06.040] Like in my case, all the tasks do the same several steps, but save game blobs, they can [06:06.040 --> 06:11.400] vary in size from kilobytes to megabytes, their processing, their downloading, obviously [06:11.400 --> 06:16.640] takes not the same time, so some threads will be unfairly loaded, and they will perform [06:16.640 --> 06:22.160] worse, at least in terms of latency, because they will have bigger queues, tasks will wait [06:22.160 --> 06:26.480] longer in them, and that is not the only problem. [06:26.480 --> 06:31.880] The other problem is polling, which in this case means the tasks, each task is updated [06:31.880 --> 06:38.200] unconditionally with some fixed period regardless of task state, so every task is updated even [06:38.200 --> 06:44.120] if it doesn't have work to do yet, so it's still waiting for network input. [06:44.120 --> 06:49.000] What it means is that if you select too big polling interval, then you will have too high [06:49.000 --> 06:51.800] latency, more than you could have. [06:51.800 --> 06:57.360] For instance, imagine like we have a task with just three steps, each taking five milliseconds, [06:57.360 --> 07:03.840] and in between it waits for some network input, and we will update it once per hundred milliseconds. [07:03.840 --> 07:09.440] Then your task will always at least take 215 milliseconds. [07:09.440 --> 07:13.960] The thing is, you don't always need so much time. [07:13.960 --> 07:18.120] Most of the time, like almost always, the network response will arrive earlier than [07:18.120 --> 07:23.720] 100 milliseconds expire, and you have events to process, but you are not doing it because [07:23.720 --> 07:28.760] it's not yet time to update the task, so we have higher latency than we could have. [07:28.760 --> 07:37.640] If you try to fight it, try to set lower polling interval, then you will burn CPU without need, [07:37.640 --> 07:41.160] because you will have spurious wakeups, so unnecessary wakeups. [07:41.160 --> 07:46.760] You will sometimes wake up tasks before they have stuff to do. [07:46.760 --> 07:52.200] Then it sounds too bad, like how expensive can it be, really, and imagine like spurious [07:52.200 --> 07:56.200] update of a task would cost us just 10 microseconds. [07:56.200 --> 08:01.080] To check a deadline or check an atomic flag, lock and lock, and you text and that's it. [08:01.080 --> 08:05.840] It's not much, but if you have 10,000 tasks per second doing those unnecessary wakeups, [08:05.840 --> 08:13.120] you just burned 10% of one CPU core, which already sounds worse, and this problem aggravates [08:13.120 --> 08:16.160] if you have more threads than CPU cores. [08:16.160 --> 08:19.440] Because then you didn't just burn the CPU time. [08:19.440 --> 08:24.480] You stole it from other threads, which could spend it on something useful. [08:24.480 --> 08:30.080] This problem was actually real, and during low tests, some servers were spending more [08:30.080 --> 08:38.320] CPU time on spurious wakeups than on doing actual work, because they were just burning [08:38.320 --> 08:43.400] so much for the green data clusters. [08:43.400 --> 08:48.880] What we need on the summary from a really performance schedule, firstly, it should be [08:48.880 --> 08:49.880] fair. [08:49.880 --> 08:53.000] We are not allowed to pin tasks to threads. [08:53.000 --> 08:54.000] It doesn't scale. [08:54.000 --> 08:59.960] Secondly, we need routines, so that tasks could give up their time, so they could yield [08:59.960 --> 09:04.320] and let the worker thread do some other work, other tasks. [09:04.320 --> 09:10.920] And also, we have zero tolerance against polling, no polling at all, everything should be event [09:10.920 --> 09:11.920] based. [09:11.920 --> 09:17.800] And these goals are achieved in a new scheduler, which I do to lack of fantasy, just called [09:17.800 --> 09:19.280] the task scheduler. [09:19.280 --> 09:25.960] Although the name is a pretty self-explanatory, what it does, the plan is that we will go [09:25.960 --> 09:31.560] firstly, we'll look at the big picture of the entire scheduler, and then we will look [09:31.560 --> 09:36.520] at individual parts of the scheduler, when you will already know what they do. [09:36.520 --> 09:40.400] Imagine like we have a process, this server running, it's a process. [09:40.400 --> 09:46.800] It has multiple threads, and they produce tasks. [09:46.800 --> 09:53.160] And we have a global object of type task scheduler in the process, like it can be C++ task scheduler, [09:53.160 --> 09:58.240] Java task scheduler, whatever language you implemented it in, it is a global object in [09:58.240 --> 09:59.240] the process. [09:59.240 --> 10:04.440] And those threads, they produce tasks, so they receive some requests from clients and [10:04.440 --> 10:11.560] post the, wrap them into tasks and post into the task scheduler in form of some callbacks [10:11.560 --> 10:13.280] of some sort. [10:13.280 --> 10:18.600] In the scheduler, they will gather in a so-called front queue. [10:18.600 --> 10:25.160] Then the scheduler will periodically pick up, take all the tasks from the front queue, [10:25.160 --> 10:26.920] and inspect them one by one. [10:26.920 --> 10:32.520] It will see that some tasks are ready to be executed right now, the red ones on the slide. [10:32.520 --> 10:38.680] They want to be executed at SAP, and some other tasks they do not want to be executed [10:38.680 --> 10:40.960] right now, they just yielded. [10:40.960 --> 10:44.680] We have routines, so there will be tasks which don't want to be executed now. [10:44.680 --> 10:46.720] They are waiting for something. [10:46.720 --> 10:53.160] For a deadline or for an explicit wake up for some event, they are moved into the wait [10:53.160 --> 10:57.840] queue, where they will wait for their events. [10:57.840 --> 11:03.360] So from the wait queue, we extract some older tasks which were already sleeping there for [11:03.360 --> 11:10.640] some time, and now their deadline is up, or they were woken up explicitly, either way [11:10.640 --> 11:16.440] we extract them from the queue, and all those red tasks go to the ready queue. [11:16.440 --> 11:22.840] And from here, they are extracted by the worker threads, which take them from the ready queue [11:22.840 --> 11:26.080] one by one and execute. [11:26.080 --> 11:31.160] And this is basically the entire pipeline, it's not too complicated, although some of [11:31.160 --> 11:34.720] you could already have a question, who does this part? [11:34.720 --> 11:37.480] We have external threads posting tasks. [11:37.480 --> 11:42.400] We have worker threads doing tasks, but who does the scheduling itself, managing management [11:42.400 --> 11:44.160] of the queues? [11:44.160 --> 11:47.960] The thing is, there is no dedicated thread doing just the scheduling. [11:47.960 --> 11:54.080] Instead, the worker threads compete for doing the scheduling, depending on who of them is [11:54.080 --> 11:55.080] idle. [11:55.080 --> 12:01.160] Imagine, like this big rectangle with queues, it's a room with a single key to it. [12:01.160 --> 12:05.960] And the worker threads, sometimes, depending on who of them is idle, again, will try to [12:05.960 --> 12:06.960] pick out the key. [12:06.960 --> 12:12.760] Whoever does it first, enters the room, does the scheduling, this queue management stuff, [12:12.760 --> 12:14.720] leaves the room, and starts doing tasks. [12:14.720 --> 12:16.720] So all the threads are the same. [12:16.720 --> 12:22.840] There is no threads having goals of some sort, like a naive implementation could do. [12:22.840 --> 12:28.600] And it's a bit like dining philosophers' problem, except that we have just one fork here. [12:28.600 --> 12:34.000] And this big rectangle, in this case, is just a plate of spaghetti, but not on code, code [12:34.000 --> 12:36.520] is clean. [12:36.520 --> 12:42.240] To improve understanding, there is a short visual example I prepared. [12:42.240 --> 12:44.200] Imagine, like, we have five tasks. [12:44.200 --> 12:49.280] They are posted onto the front queue, and one of the worker threads has the scheduling [12:49.280 --> 12:51.480] key right now. [12:51.480 --> 12:54.360] It will extract the tasks from the queue. [12:54.360 --> 12:56.560] We'll see that a couple of them yielded. [12:56.560 --> 13:02.960] They go to the wait queue, waiting for something, and the others are moved into the ready queue. [13:02.960 --> 13:06.240] From here, they are picked up by the worker threads. [13:06.240 --> 13:07.760] Nobody is doing the scheduling right now. [13:07.760 --> 13:10.840] All the threads are doing some actual work. [13:10.840 --> 13:16.720] A couple of tasks are done, and suddenly, those waiting tasks are woken up, or their [13:16.720 --> 13:18.360] deadline is up. [13:18.360 --> 13:19.920] They want to be executed now. [13:19.920 --> 13:23.240] So one of the worker threads will eventually pick up the scheduling key. [13:23.240 --> 13:24.240] We'll notice this. [13:24.240 --> 13:28.680] We'll move the tasks into the ready queue, and in parallel, some other thread finished [13:28.680 --> 13:30.200] and older tasks. [13:30.200 --> 13:35.240] Now, those two tasks are picked up by a couple of random threads. [13:35.240 --> 13:39.920] They are done, and some other thread will pick up the scheduling key, and the process [13:39.920 --> 13:42.280] repeats all the time when new tasks arrive. [13:42.280 --> 13:43.280] This is it. [13:43.280 --> 13:49.880] What we need to implement all this cool stuff, the language and the libraries, which we will [13:49.880 --> 13:55.000] be using, need to provide us with the following stuff, at least. [13:55.000 --> 14:01.200] We need mutexes, containers like race, lists, condition variables, and not important for [14:01.200 --> 14:07.000] the algorithm, but for the implementation, for the performance, it's extremely important. [14:07.000 --> 14:10.760] We will also need log-free atomic operations. [14:10.760 --> 14:15.720] Just in case not all of you know what they are, here is a short pseudo code explaining [14:15.720 --> 14:19.760] what these log-free atomics are, we will need three of them. [14:19.760 --> 14:27.600] Firstly, it's atomic load, which is basically reading a variable, but with respect to so-called [14:27.600 --> 14:33.400] memory models, which, again, Google afterwards, it's too complicated topic to dive in right [14:33.400 --> 14:34.400] now. [14:34.400 --> 14:39.080] We will also need atomic compare exchange, which is conditional assignment. [14:39.080 --> 14:45.360] So we set a new value to some variable if it was equal to something else which we wanted [14:45.360 --> 14:46.360] to check. [14:46.360 --> 14:48.440] And we will also need atomic exchange. [14:48.440 --> 14:52.200] So it sets a new value and returns the old value. [14:52.200 --> 14:58.320] The cool stuff about those log-free atomics is that they are not only atomic, but they [14:58.320 --> 14:59.800] are log-free. [14:59.800 --> 15:01.880] There is no mutexes. [15:01.880 --> 15:06.200] On the contrary, mutexes use this stuff inside. [15:06.200 --> 15:10.320] And how they are implemented, they are special instructions right on the CPU. [15:10.320 --> 15:14.360] So doing those operations doesn't even involve the kernel. [15:14.360 --> 15:17.120] It's quite cheap if you use it efficiently. [15:17.120 --> 15:23.080] And those three operations are basis for some very cool and extremely performant algorithms, [15:23.080 --> 15:24.080] as we will see. [15:24.080 --> 15:25.080] Not just here. [15:25.080 --> 15:29.960] There are a lot of those algorithms based on those log-free atomic operations. [15:29.960 --> 15:33.400] They are also called log-free algorithms. [15:33.400 --> 15:40.280] And we will follow the task pipeline, looking at the scheduler parts, and we will start [15:40.280 --> 15:43.800] from the front queue, just like the tasks. [15:43.800 --> 15:49.880] We know that this queue has multiple producers, and it has a single consumer. [15:49.880 --> 15:54.880] So it means this queue is multi-producer, single-consumer. [15:54.880 --> 15:57.680] This is a common notation for naming queues. [15:57.680 --> 16:03.000] We say multi-produce, multi- or single-producer, multi- or single-consumer, and we get four [16:03.000 --> 16:04.760] combinations of queue types. [16:04.760 --> 16:08.080] This is multi-producer, single-consumer. [16:08.080 --> 16:12.600] We also know about this queue that it will experience high contention, because I want [16:12.600 --> 16:15.120] my scheduler to be functional. [16:15.120 --> 16:21.180] When I have tens of threads and millions of tasks per second, this means extreme contention [16:21.180 --> 16:22.680] on the front queue. [16:22.680 --> 16:27.800] This in turn means I should avoid mutexes, because mutexes most likely will choke here. [16:27.800 --> 16:34.120] What you would do in a normal queue, not thread safe or anything, you would have to remember [16:34.120 --> 16:38.520] two positions, head and tail, and maintain those positions. [16:38.520 --> 16:44.840] If you try to turn this algorithm into a log-free thread safe implementation, it would be nightmare, [16:44.840 --> 16:51.520] because the more variables you have to update in a log-free way, the more complicated the [16:51.520 --> 16:58.360] algorithm gets, and queue, like with two variables, it's extremely hard to implement in a log-free [16:58.360 --> 16:59.360] way. [16:59.360 --> 17:04.680] We should try to avoid this if possible, and the idea is let's make it a stack instead [17:04.680 --> 17:06.200] of queue. [17:06.200 --> 17:09.880] So we will maintain just one position, the top item, and that's it. [17:09.880 --> 17:14.800] We don't need two variables here, and then the algorithm becomes quite simple. [17:14.800 --> 17:22.000] Pushing is, well, we try to link new item with the old top, set it, atomic compare exchange [17:22.000 --> 17:28.120] will fail, if some other threads does it faster than we are in this push, and we don't retry [17:28.120 --> 17:29.120] it. [17:29.120 --> 17:30.120] It's pretty simple. [17:30.120 --> 17:35.200] The more complicated part, although it looks shorter, is popping the items. [17:35.200 --> 17:41.600] The thing is that how pop is implemented, we just replace top item with null, effectively [17:41.600 --> 17:43.720] taking all the items at once. [17:43.720 --> 17:46.320] We cannot pop them one by one. [17:46.320 --> 17:51.000] And also, we have to reverse the list before we return it, because when we are pushing [17:51.000 --> 17:58.440] items in FIFO order, pushing them on the stack, they are returned in LIFO order, and we want [17:58.440 --> 18:01.640] to maintain the order, so we have to reverse it again. [18:01.640 --> 18:07.200] These are two downsides that we can't pop one by one, and we have to reverse the result [18:07.200 --> 18:08.520] before returning it. [18:08.520 --> 18:15.320] On the other hand, what we get, it is completely lock free, thread safe, and it's weight free. [18:15.320 --> 18:20.360] Who will win those drawbacks for being lock free and weight free? [18:20.360 --> 18:24.040] We will see in the end in the benchmark section. [18:24.040 --> 18:27.080] Next part of the task journey is weight queue. [18:27.080 --> 18:33.280] As we know, it stores yielded tasks, so they are waiting for something, like in 99 percent [18:33.280 --> 18:36.080] of cases, they are waiting for a deadline. [18:36.080 --> 18:40.400] Since we are using tasks for requests, requests usually have a timeout, meaning that they [18:40.400 --> 18:42.120] have a deadline. [18:42.120 --> 18:48.520] So what we need is to be able to quickly pop all tasks with expired deadline, simply because [18:48.520 --> 18:53.720] we have to do everything here quickly, there is a lot of tasks. [18:53.720 --> 18:59.480] And we also know that this queue is always accessed by one third at a time. [18:59.480 --> 19:03.480] The current scheduler worker who owns the scheduling key, so there is no concurrency [19:03.480 --> 19:04.800] on this queue. [19:04.800 --> 19:11.000] And that gives us quite a lot of freedom about what data structures we can use. [19:11.000 --> 19:13.920] That basically means we can use binary heap. [19:13.920 --> 19:19.960] It's ideal for such a job when you have to quickly pop something sorted by a thing like [19:19.960 --> 19:21.760] deadline. [19:21.760 --> 19:26.520] What happens is that we sort the tasks by deadlines here, basically. [19:26.520 --> 19:32.000] So the task with the closest deadline will be on top and we will be able, for constant [19:32.000 --> 19:38.040] time, tell immediately if any task has expired by just looking at the top. [19:38.040 --> 19:44.960] And in case not all of you know what binary heap is, there is a brief explanation. [19:44.960 --> 19:51.720] It's a perfectly balanced binary tree where each node value is less than values of its [19:51.720 --> 19:52.720] child nodes. [19:52.720 --> 19:54.480] We call this minimal heap. [19:54.480 --> 19:58.600] If we reverse the order, it will be called maximal heap. [19:58.600 --> 20:04.840] And this heap, it has quite good complexity. [20:04.840 --> 20:05.840] Quite good complexity. [20:05.840 --> 20:10.720] So for logarithmic time, we can pop any items even from the middle. [20:10.720 --> 20:16.560] We can push new items also for logarithmic time, which is very nice, very fast. [20:16.560 --> 20:26.600] From the weight queue, the tasks migrate to the ready queue, which as we know is populated [20:26.600 --> 20:33.000] by one thread, current scheduler worker, and it is consumed by multiple threads, worker [20:33.000 --> 20:34.840] threads. [20:34.840 --> 20:39.680] So it means this is a multi-consumer single producer queue. [20:39.680 --> 20:44.720] We also know that it will as well experience high contention because task scheduler should [20:44.720 --> 20:47.920] be perfectly functional with like 10 worker threads. [20:47.920 --> 20:48.920] Why not? [20:48.920 --> 20:52.120] And with millions of tasks per second, we will have high contention. [20:52.120 --> 20:54.400] We have to deal with it somehow. [20:54.400 --> 21:01.360] Although unfortunately, I don't know a nice simple algorithm for doing unbounded and lock [21:01.360 --> 21:03.600] free queue of this type. [21:03.600 --> 21:09.360] For the reason why you can Google ABA problem, after the talk, it's also quite complicated. [21:09.360 --> 21:16.200] We will not have time to dive into this, but just know that it is much more complicated [21:16.200 --> 21:19.840] than multi-producer single consumer version. [21:19.840 --> 21:28.080] Although I know a couple of other queues, unbounded log-based and bounded lock free. [21:28.080 --> 21:32.440] I want my final queue to be exactly unbounded, meaning not limited in size. [21:32.440 --> 21:35.320] So I don't want any limits inside the scheduler. [21:35.320 --> 21:38.760] You can add them on top, but I don't want them to be inside the scheduler. [21:38.760 --> 21:47.320] I don't want it to be limited by anything like queue size. [21:47.320 --> 21:50.200] So let's see what we can do with those two queues. [21:50.200 --> 21:55.160] The bounded lock free version, bounded lock free queue is simple. [21:55.160 --> 22:01.360] It's just a cyclic rate, except that the read and write index will make atomic variables. [22:01.360 --> 22:07.000] So in pushing, the only thing changes compared to normal cyclic buffer is that you increment [22:07.000 --> 22:11.720] the write index atomic increment and that's it. [22:11.720 --> 22:15.120] The popping is just a little bit more complicated. [22:15.120 --> 22:19.160] We have to retry it sometimes because there will be multiple consumers. [22:19.160 --> 22:22.160] They will compete for the same element sometimes. [22:22.160 --> 22:27.040] So atomic compare exchange will eventually fail and we will have to retry, but it's still [22:27.040 --> 22:29.440] quite simple. [22:29.440 --> 22:34.120] And the unbounded lock queue is just trivial. [22:34.120 --> 22:39.400] So it's a mutex and it's a list and we take mutex on every operation. [22:39.400 --> 22:43.760] Then it becomes lock based, but it's unbounded. [22:43.760 --> 22:48.960] So what we can do with the ready queue, what I had was the craziest idea. [22:48.960 --> 22:56.400] The thing is, our enemy is not the mutex itself, it's the contention on the mutex. [22:56.400 --> 23:01.120] So we could try to reduce the contention instead of deleting the mutex. [23:01.120 --> 23:06.760] We could skip it, but somehow maybe not lock it so often. [23:06.760 --> 23:11.320] Let's combine those two approaches together. [23:11.320 --> 23:19.160] Let's take the bounded lock free queue and make unbounded lock based queue of those lock [23:19.160 --> 23:21.160] free sub-queues. [23:21.160 --> 23:27.760] So it will be like a stdq, but the blocks are lock free and the big queue of those blocks [23:27.760 --> 23:29.960] is lock based. [23:29.960 --> 23:37.760] And how producer works, it will push new items to the latest block in a lock free way. [23:37.760 --> 23:42.560] When the block becomes full, it takes mutex, appends a new block, fills it in a lock free [23:42.560 --> 23:44.240] way and so on. [23:44.240 --> 23:49.360] And the consumers, they do their other thing vice versa, so they consume the first block [23:49.360 --> 23:51.080] in a lock free way. [23:51.080 --> 23:55.360] When it becomes empty, they take mutex, switch to the next block, consume it in a lock free [23:55.360 --> 23:57.440] way and so on. [23:57.440 --> 24:02.920] To see the benefit, imagine like sub-queue size, this block size, there's not four like [24:02.920 --> 24:05.640] on the slide, but it's 10,000. [24:05.640 --> 24:14.320] What happens then is we will take mutex lock not on IV operation, but once per 10,000 operations. [24:14.320 --> 24:19.400] Mutex is still here, but it's locked so rarely that its cost is neglectable. [24:19.400 --> 24:24.840] You will not see this mutex lock in any flame graphs anymore. [24:24.840 --> 24:31.880] The only problem with this is that the consumers will need an explicit state, because if consumer [24:31.880 --> 24:37.120] doesn't know which is the first block, in this queue of blocks, it will have to locate [24:37.120 --> 24:38.120] it. [24:38.120 --> 24:44.760] The queue of blocks, it's protected by a mutex, so if consumers don't have a state, if they [24:44.760 --> 24:50.520] don't reference the first block having items, then on every pop, they would have to find [24:50.520 --> 24:54.520] it, keeping the mutex, and that would mean mutex lock on every pop, if this is exactly [24:54.520 --> 24:57.400] what we wanted to avoid. [24:57.400 --> 25:02.040] Consumers need to register themselves, and then they can do the popping. [25:02.040 --> 25:07.240] This is not a problem for a task scheduler itself, because it has fixed set of worker [25:07.240 --> 25:10.240] threads which you specify at creation. [25:10.240 --> 25:14.280] They can register themselves as consumers at start and leave just fine with it, so it's [25:14.280 --> 25:20.120] just a problem for generic usage of this type of queue, but for task scheduler, it is just [25:20.120 --> 25:22.240] fine. [25:22.240 --> 25:27.720] Let's examine the visual example again. [25:27.720 --> 25:34.600] Imagine like we have this queue, it's empty, just single block, one producer, two consumers. [25:34.600 --> 25:39.040] Producer adds three items in a lock-free way. [25:39.040 --> 25:44.320] Everything is lock-free so far, one consumer consumes one item, lock-free. [25:44.320 --> 25:48.120] Now producer adds another item, lock-free. [25:48.120 --> 25:54.640] It sees that the block became full, so we need to append a new block, we take mutex, [25:54.640 --> 26:01.200] switch to the next, append a new block, switch to the next block, release the mutex, and [26:01.200 --> 26:04.560] add three more items in a lock-free way. [26:04.560 --> 26:10.960] Our consumers will work, they will finish consumption of the first block, consumer A will see that [26:10.960 --> 26:17.840] the block is empty, so it takes mutex, switches to next block, releases the mutex, and continues [26:17.840 --> 26:20.000] consumption in a lock-free way. [26:20.000 --> 26:24.200] So we take mutex only when we switch from block to block. [26:24.200 --> 26:29.320] And the other consumer, when we will try to consume something from it, it will see immediately [26:29.320 --> 26:34.120] that its current block is empty, because those blocks, they are, as you remember, lock-free [26:34.120 --> 26:40.040] bounded queues, there are read and write index, if we see that the read index of this block [26:40.040 --> 26:44.680] equals its size, it means it's empty, so we don't even need to full scan it. [26:44.680 --> 26:51.920] Anyway, consumer B will then lock mutex, switch to next block, release the mutex, and continue [26:51.920 --> 26:54.000] the consumption. [26:54.000 --> 26:58.560] And the old blocks, the completely consumed ones, can be discarded. [26:58.560 --> 27:03.560] You can free them, you can reuse them, like have a pool of those blocks, and append them [27:03.560 --> 27:09.600] to beginning again, if you want to, like, it's not cheap to allocate blocks having 10,000 [27:09.600 --> 27:14.920] items in them, so you might want to pull them, to have a pool of them. [27:14.920 --> 27:20.080] About the benchmarks, we will see in the end, again, as I said, with everything combined. [27:20.080 --> 27:26.560] And our progress so far is that we saw test-calular parts, those queues, and now we can have a [27:26.560 --> 27:28.240] glimpse of the routines. [27:28.240 --> 27:35.640] Unfortunately, we don't have enough time to dive deep into the implementation, but I can [27:35.640 --> 27:40.800] show you some usage examples, show the features they have, and for the implementation, you [27:40.800 --> 27:46.040] can look at the source code and ask me questions after the talk. [27:46.040 --> 27:51.200] To see why do we need coroutines again, and what features we need from the coroutines, [27:51.200 --> 27:55.120] let's inspect the simplified version of this save games example. [27:55.120 --> 27:57.800] This time, we have just two steps. [27:57.800 --> 28:01.040] Start download of a save game block, and then handle the result. [28:01.040 --> 28:03.680] This is in just two steps. [28:03.680 --> 28:09.960] What we know is that while the task is waiting for response from the network, it shouldn't [28:09.960 --> 28:14.000] block other tasks, so it should be able to yield. [28:14.000 --> 28:16.840] It should be able to step away. [28:16.840 --> 28:22.320] But we also know that we can't, this task, it can't sleep infinitely. [28:22.320 --> 28:24.360] There should be some sort of timeout. [28:24.360 --> 28:29.200] Requests can't be executing infinitely, so we need some deadline after which the task [28:29.200 --> 28:33.040] would be woken up and will cancel the request. [28:33.040 --> 28:38.480] So if the response arrives in time, before the deadline, we have to wake up the task [28:38.480 --> 28:40.240] before the deadline. [28:40.240 --> 28:44.440] There should be some sort of explicit wake up for the task which is right now sleeping [28:44.440 --> 28:46.520] in the wait queue, so it's not executing. [28:46.520 --> 28:49.560] How exactly do we wake it up from there? [28:49.560 --> 28:50.960] So we need an explicit wake up. [28:50.960 --> 28:57.800] We need yield, deadlines, and wake ups, three main features of those coroutines. [28:57.800 --> 29:00.080] Let's see an example. [29:00.080 --> 29:02.960] It's almost exactly like it looks in real code. [29:02.960 --> 29:10.040] Once there will be simplified version of C++, but it's very similar how it looks in reality. [29:10.040 --> 29:14.800] We have a global task scheduler in the process and some HTTP client. [29:14.800 --> 29:20.520] And imagine like a request from the client arrives, so we wrap it into a task and give [29:20.520 --> 29:22.760] it a callback called download. [29:22.760 --> 29:28.200] And we post it into the scheduler, so it will go into this front queue. [29:28.200 --> 29:33.680] The scheduler will execute our callback in one of the worker threads. [29:33.680 --> 29:39.200] This download callback, so what we do here is firstly set what will be the next step. [29:39.200 --> 29:42.800] The next step will be handling the result. [29:42.800 --> 29:45.360] Then we do the asynchronous HTTP request. [29:45.360 --> 29:48.680] Assume that our HTTP client is able to do this. [29:48.680 --> 29:53.480] So we start an asynchronous HTTP request and give it a future callback to execute when [29:53.480 --> 29:58.120] the request is done, which we'll call wake up on our task. [29:58.120 --> 30:01.760] While this will be explicit wake up when the request is done. [30:01.760 --> 30:03.080] And then we start waiting. [30:03.080 --> 30:08.120] So we post ourselves back into the scheduler with five seconds deadline. [30:08.120 --> 30:13.360] Either the task will wake up in five seconds or it will be woken up explicitly when the [30:13.360 --> 30:14.680] request is complete. [30:14.680 --> 30:20.120] This goes into the scheduler, sleeps in the wait queue for some time and eventually our [30:20.120 --> 30:23.240] callback handle result is executed. [30:23.240 --> 30:24.600] We have to check what happened. [30:24.600 --> 30:26.600] It could be two reasons. [30:26.600 --> 30:28.440] It could be that the task is expired. [30:28.440 --> 30:32.120] So five seconds passed and we were woken up by the deadline. [30:32.120 --> 30:35.240] And we just literally check if it's expired. [30:35.240 --> 30:40.200] If it is sold and we cancel the request and we start waiting for the cancellation to be [30:40.200 --> 30:45.480] complete to properly free all the resources and to go back into the scheduler. [30:45.480 --> 30:48.320] But now we wait for the cancellation. [30:48.320 --> 30:53.680] Otherwise, if it's not expired, it means the request is finally complete with some result. [30:53.680 --> 30:55.160] It could be success. [30:55.160 --> 31:00.320] It could be an HTTP error code or it could be our own consolation done on a previous [31:00.320 --> 31:02.000] wake up a few lines above. [31:02.000 --> 31:05.560] Either way, we just handle it and delete the task. [31:05.560 --> 31:10.240] This is literally, this is almost exactly how it looks in the source code. [31:10.240 --> 31:16.320] There is no spurious wake ups at all, no sleeps and the request has clear state machine with [31:16.320 --> 31:19.840] every state expressed at the callback. [31:19.840 --> 31:26.520] And what is better, these entire routine stuff, all those wake ups, post deadlines, this is [31:26.520 --> 31:27.760] all completely lock free. [31:27.760 --> 31:32.840] There is no single mutics used to implement this all. [31:32.840 --> 31:37.800] So this is to get you further interested into looking at the source code and asking questions [31:37.800 --> 31:40.560] afterwards. [31:40.560 --> 31:45.560] This is quite complicated stuff, as you can imagine, especially the implementation. [31:45.560 --> 31:48.520] How do we verify such stuff? [31:48.520 --> 31:54.400] Of course, there are unit tests, like literally hundreds and thousands of lines of unit tests. [31:54.400 --> 31:59.840] But with multi-tried algorithms, the thing is, even if you have 100% code coverage, it [31:59.840 --> 32:02.360] doesn't tell you anything. [32:02.360 --> 32:08.560] It's better than, I think, not having 100% coverage, but it's still, there might be bugs [32:08.560 --> 32:11.520] which can stay hidden for literally years. [32:11.520 --> 32:16.200] And you will not find them except when this thing explodes in production. [32:16.200 --> 32:20.000] There should be a way to at least verify the algorithm, maybe. [32:20.000 --> 32:26.600] We can't verify the source code, but we can improve our confidence about the algorithm [32:26.600 --> 32:27.760] itself. [32:27.760 --> 32:35.320] And the solution is TLA+, TLA+, stands for temporal logic of actions, and it is a combination [32:35.320 --> 32:38.360] of mathematics and temporal logic. [32:38.360 --> 32:42.960] And it's also a language and runtime of this language. [32:42.960 --> 32:48.680] This language allows you to verify literally any algorithms or systems, like you can verify [32:48.680 --> 32:57.560] an algorithm of a queue, like I did, or how your microservices interact, or you can verify [32:57.560 --> 32:59.200] how you go to a grocery store. [32:59.200 --> 33:01.760] If you can algorithmize it, then you can verify it. [33:01.760 --> 33:02.760] TLA+. [33:02.760 --> 33:06.440] It's suitable for anything like this. [33:06.440 --> 33:12.920] So in this TLA+, language, you write a specification and run the verification, and it will run [33:12.920 --> 33:19.400] and it will split your system, your algorithm, into a set of all the possible states in which [33:19.400 --> 33:21.840] this algorithm can be. [33:21.840 --> 33:27.720] And verify your own invariance in every reachable state of your system. [33:27.720 --> 33:34.520] So firstly, you define the algorithm, then you define what means validness for your algorithm, [33:34.520 --> 33:38.080] the invariance, and TLA+, we'll verify all them. [33:38.080 --> 33:43.440] Let's see an example, like I assume you have implemented a queue in any language, and we [33:43.440 --> 33:46.040] want to verify the algorithm of this queue. [33:46.040 --> 33:53.600] First you have to define what objects exist in your system, what agents you have there. [33:53.600 --> 33:59.160] In my queue, I have just pipe for items, so some sort of storage, and a couple of counters [33:59.160 --> 34:00.640] and limits. [34:00.640 --> 34:03.320] Then you have to define actions. [34:03.320 --> 34:08.520] In TLA+, every action is a set of mathematical conditions combined with some mathematical [34:08.520 --> 34:18.520] operators, like you have operators and or operator, or list or set operator, or complex [34:18.520 --> 34:23.920] operators like all items in the set comply with the condition operator, and many more [34:23.920 --> 34:26.840] other operators which can use in your actions. [34:26.840 --> 34:32.680] And the first action is always initialization, where you give initial values to your variables. [34:32.680 --> 34:37.680] Here I say that storage pipe is an empty list, and my counters of sent and received items [34:37.680 --> 34:39.200] are zero. [34:39.200 --> 34:47.080] Then I start defining some actual actions doing stuff, like send or push or insert or [34:47.080 --> 34:48.080] whatever. [34:48.080 --> 34:50.080] And there are ways how to do it. [34:50.080 --> 34:55.680] I'm not aware of any standard ways how to do it properly, so I invented my own. [34:55.680 --> 34:59.480] Firstly, I split my actions into two groups. [34:59.480 --> 35:04.640] And first group of conditions I define when the action is possible. [35:04.640 --> 35:11.600] So here I say send is possible when queue is not full, and when I still have items to [35:11.600 --> 35:14.280] send, so not everything pushed yet. [35:14.280 --> 35:20.680] In the second group, I tell what changes should be done when the first group is true, so when [35:20.680 --> 35:21.800] the condition is true. [35:21.800 --> 35:28.160] Here I say if the first part is true, then I add a new item to the pipe storage, as items [35:28.160 --> 35:33.360] I use numbers, and I increment the number of sent items, of pushed items. [35:33.360 --> 35:37.640] The problem is, as you could see, probably there is no really distinction between those [35:37.640 --> 35:38.640] two groups. [35:38.640 --> 35:45.200] It's imaginary, and the only distinction is this small single code sign. [35:45.200 --> 35:47.320] It means next value of the variable. [35:47.320 --> 35:53.200] The problem is, since here it passes basically math, in math there is no assignment. [35:53.200 --> 35:56.880] There is no action like assign new value to some variable. [35:56.880 --> 36:03.440] There is only, the closest thing we have is equal operator in math, but there is no assignment. [36:03.440 --> 36:09.800] But you can emulate it saying like next value of your variable equals old value and something [36:09.800 --> 36:10.800] done with it. [36:10.800 --> 36:18.880] So here I say literally last sent next equals last sent old plus one, which effectively results [36:18.880 --> 36:23.360] into assignment and programming languages, but here you have to simulate it like this. [36:23.360 --> 36:29.800] In theory plus there is no separation into groups, it's just several mathematical conditions, [36:29.800 --> 36:34.640] but I do separate it for making the specification easier to read. [36:34.640 --> 36:36.800] I do the same for the receive action. [36:36.800 --> 36:43.800] It's possible when I have items to receive, and the changes are to receive. [36:43.800 --> 36:50.200] And then you have to define what means validness for your system, so the invariance, which [36:50.200 --> 36:52.480] will be validated in every Ritual state. [36:52.480 --> 36:57.000] Here as they say that my queue is valid when all the items in the queue are ordered, like [36:57.000 --> 36:59.000] I pushed them. [36:59.000 --> 37:05.600] The queue never overflows, and then the items are received in the same order as sent. [37:05.600 --> 37:12.160] And then with some technical steps, simple ones, I run the validation, and it will give [37:12.160 --> 37:16.760] me a result like n states are found, like hundreds of thousands of millions of states [37:16.760 --> 37:23.240] that are found, and they are valid, or it will say that I found an invalid state. [37:23.240 --> 37:27.960] And here is how you get into the state from the initial one following this sequence of [37:27.960 --> 37:29.720] actions. [37:29.720 --> 37:34.280] And then you can turn those failure traces into unit tests in your actual code. [37:34.280 --> 37:41.520] Now this actually works, and they call it a bug in the scheduler, thanks to this thing. [37:41.520 --> 37:48.840] Here by links you can find the specifications for tax scheduler on the whole, and for the [37:48.840 --> 37:54.120] multi-consumer queue, which was not trivial enough, so as I would try to validate it as [37:54.120 --> 37:55.120] well. [37:55.120 --> 37:59.240] Two specifications, they are quite big, but most of the lines are comments explaining [37:59.240 --> 38:00.240] the algorithm. [38:00.240 --> 38:06.400] So the specification, the code part of them is not too complicated. [38:06.400 --> 38:11.500] You can read it like easily. [38:11.500 --> 38:17.280] And also there are instructions how to install TLA+, in the source code repository, how to [38:17.280 --> 38:22.080] run validation on those models, how to install TLA+, into the command line. [38:22.080 --> 38:24.960] It's not trivial, surprisingly. [38:24.960 --> 38:31.600] And there are also great course of lectures from the author of TLA+, Leslie Lampert. [38:31.600 --> 38:36.800] Lectures are quite fun, if you are not even planning to use TLA+, they are still worth [38:36.800 --> 38:41.800] watching, very entertaining. [38:41.800 --> 38:46.840] All of this can be found on the source code repository, and now about the benchmarks. [38:46.840 --> 38:50.120] How I did them, the benchmarks are comparative. [38:50.120 --> 38:56.400] So I'm not just running the benchmarks against themselves in vacuum against some random stuff. [38:56.400 --> 39:02.800] I compare the improved versions of those, my algorithms against trivial versions, naive [39:02.800 --> 39:09.200] versions using mutexes, to see if stuff actually improved. [39:09.200 --> 39:14.440] Like all the same benchmarks run on my algorithms and don't trivial implementations. [39:14.440 --> 39:19.120] For example, the smart queues I benchmark against their mutex locked versions, or the [39:19.120 --> 39:24.280] task scheduler I benchmark against thread pool, without coroutines, single queues, single [39:24.280 --> 39:26.600] mutex and nothing else. [39:26.600 --> 39:32.480] I run this on five different configurations of software and hardware with tens of scenarios [39:32.480 --> 39:38.600] and all the reports, while the performance are available on GitHub, in human readable [39:38.600 --> 39:40.200] markdown format. [39:40.200 --> 39:44.880] And you can also run them on your system with just a single line of Python script. [39:44.880 --> 39:51.480] It will generate the report for your case, and you can read it and see what's up. [39:51.480 --> 39:56.000] And so there are quite a lot of results, I can show just a few of them on the slides. [39:56.000 --> 40:03.120] I will use Debian Linux with eight cores running in Google Cloud, and I will show just some [40:03.120 --> 40:09.920] average results, no shocking like 100 times faster, although there are extreme cases when [40:09.920 --> 40:15.120] algorithms are almost the same, or when it's extremely faster, but I will show just some [40:15.120 --> 40:21.120] average results which you can actually get in real production. [40:21.120 --> 40:24.280] We start from the front queue again. [40:24.280 --> 40:29.880] The benchmark uses five producer threads and one consumer thread, doing busy loop pushes [40:29.880 --> 40:35.160] and pops all the time to get the worst case contention. [40:35.160 --> 40:39.560] It's just for one and a half times faster, and this is all considering the two drawbacks [40:39.560 --> 40:45.160] which you can remember, so we store items as stack in the front queue, we have to reverse [40:45.160 --> 40:50.000] them, we can pop them one by one, and still it is one and a half times faster. [40:50.000 --> 40:55.520] If we make it 10 producer threads, so we have more threads than CPU cores, worst case for [40:55.520 --> 41:00.120] mutics contention, it becomes 2.6 times faster. [41:00.120 --> 41:06.880] The ready queue, another benchmark, five consumer threads, one producer threads, one producer [41:06.880 --> 41:13.280] thread, again busy loop pushes and pops, it becomes 2.6 times faster, already and you [41:13.280 --> 41:19.960] can see that the mutics, the log contention is multiple orders slower than in a trivial [41:19.960 --> 41:21.960] implementation. [41:21.960 --> 41:27.400] This is thanks to us taking mutics log, not on every operation, but once per multiple [41:27.400 --> 41:30.640] thousand operations, and this is the result. [41:30.640 --> 41:35.880] Mutics is still here, but it almost doesn't affect the results at all. [41:35.880 --> 41:41.280] When we make it 10, consumer threads, it becomes already four and a half times faster. [41:41.280 --> 41:46.240] The naive queue degrades quite quick in this case. [41:46.240 --> 41:49.840] And now everything combined, the task scheduler on the whole. [41:49.840 --> 41:56.640] In this benchmark, tasks are empty, so they are just empty C++ functions. [41:56.640 --> 42:01.560] Not doing anything, worst case for contention again. [42:01.560 --> 42:05.880] And now we start from single worker thread, which will do both the scheduling and tasks, [42:05.880 --> 42:09.080] it becomes right away 2.2 times faster. [42:09.080 --> 42:15.280] Then a trivial thread pool without routine support, it becomes already 2.2 times faster. [42:15.280 --> 42:21.560] And zero log contention, so log wasn't contended even once between the worker and producer. [42:21.560 --> 42:26.280] When we make it five worker threads, it becomes three times faster, so it scales better than [42:26.280 --> 42:32.680] naive implementation, but we make it 10 worker threads, it becomes seven and half times faster. [42:32.680 --> 42:36.840] And these are not the best results, it can be even better. [42:36.840 --> 42:44.160] I'm just not showing extreme cases here, but I saw like 15 times speed up as well. [42:44.160 --> 42:48.360] It's just not something you will most likely get in production if you start using this, [42:48.360 --> 42:50.720] but those benchmarks are also available. [42:50.720 --> 42:56.240] And now about the real usage, not some random benchmarks in the vacuum, how it affects the [42:56.240 --> 42:57.240] actual code. [42:57.240 --> 43:02.520] Apply to this save games case from the beginning. [43:02.520 --> 43:09.200] We reported one of the microservices from updater scheduler to task scheduler, and we immediately [43:09.200 --> 43:14.400] without any further optimizations got 10 times speed up. [43:14.400 --> 43:23.760] We went from 100s RPS to bigger than 10,000 RPS, and latency dropped five times like right [43:23.760 --> 43:28.720] out of the box before we started doing some other optimizations. [43:28.720 --> 43:30.640] And the algorithm is extendable. [43:30.640 --> 43:36.920] Now, as you remember, there is this big rectangle where only one thread at a time can work. [43:36.920 --> 43:40.640] It means this is thread safe space. [43:40.640 --> 43:44.640] We can replace the binary heap with weighting task with something more complicated. [43:44.640 --> 43:51.000] For instance, we can put LibF inside, or EPUL or IO completion ports from Windows inside, [43:51.000 --> 43:57.160] and we get multi-threaded event loop, like multi-threaded LibF, we can store sockets [43:57.160 --> 44:03.240] in tasks, and we get circuits with deadlines, with yields, and this is, in fact, what we [44:03.240 --> 44:10.000] do have in Ubisoft, it's a fork of task scheduler where we just replaced weight queue with EPUL [44:10.000 --> 44:12.920] on Linux and IO completion ports on Windows. [44:12.920 --> 44:18.040] And we get more than millions, more than million messages per second with just several threads [44:18.040 --> 44:19.040] on sockets. [44:19.040 --> 44:25.720] With this thing, it's not too complicated to extend scheduler, it's just maybe next [44:25.720 --> 44:29.400] time about this multi-threaded event loop. [44:29.400 --> 44:31.400] What are the future plans for it? [44:31.400 --> 44:34.200] Firstly, we could try to run it on ARM. [44:34.200 --> 44:37.200] Maybe it already runs, but I just haven't tried. [44:37.200 --> 44:40.000] Maybe it works, maybe not, I have no idea. [44:40.000 --> 44:41.200] That's why it's open source. [44:41.200 --> 44:45.120] You can try it, send a pull request if something's not working. [44:45.120 --> 44:51.520] Also, it's currently implemented only in C++, and it's not even STL, although some people [44:51.520 --> 44:54.040] might consider it good, like me. [44:54.040 --> 45:00.120] I don't like STL, but it could use a port to STL as well, or to some other language. [45:00.120 --> 45:05.320] And also, there could be optimizations done like the front queue. [45:05.320 --> 45:10.120] Maybe there is a way not to store it as a stack, not to reverse the list of items before [45:10.120 --> 45:11.120] returning it. [45:11.120 --> 45:16.320] I just haven't found a simple way to do it, which would be worth trying, and this is the [45:16.320 --> 45:17.320] end. [45:17.320 --> 45:18.320] Thanks for your attention. [45:18.320 --> 45:26.840] And here are the links to the source code, and to this talk. [45:26.840 --> 45:31.040] It will be available, the animated versions with all the slides and my notes online by [45:31.040 --> 45:33.480] this link and my other talks. [45:33.480 --> 45:38.120] And also, there are bonus sections which some of you might ask as questions, and we will [45:38.120 --> 45:44.360] quickly go for them, or you can click on them yourself after the talk if you're interested. [45:44.360 --> 46:07.560] Okay, so time for questions, so show of hands, and we'll give you a mic. [46:07.560 --> 46:33.040] So, for the front queue, you can use some of Dmitry Vyukov's NPSC queue, very fast, much [46:33.040 --> 46:37.880] faster than the tribal stack, which is the thing you're using. [46:37.880 --> 46:43.720] The other thing is for the wait queue, well, you answer this with IOCP, KQ, Epo, that would [46:43.720 --> 46:50.760] be much more in line with something that uses timer or for networking. [46:50.760 --> 47:01.320] And you said that we cannot have single produce, no, multi-multi-consumer single production. [47:01.320 --> 47:08.640] Yes, you can, actually, you can use one of the chase left DQ, there's a paper for 2013 [47:08.640 --> 47:15.040] with formally verified primitives, including ARM, and those will work for you, use case. [47:15.040 --> 47:19.640] I know that there exist implementations of such queues, I just couldn't find the simple [47:19.640 --> 47:20.880] enough one. [47:20.880 --> 47:28.120] The thing is, in Ubisoft, internally, above all, sometimes, in Harm to performance, can [47:28.120 --> 47:34.440] value code simplicity, so it's not an option to use something extremely complicated, like [47:34.440 --> 47:40.800] hazard pointers or stuff like this, or, for example, I saw implementations of such queues, [47:40.800 --> 47:46.040] which are not wait-free, so they can be lock-free, let's say, but not wait-free, that also wasn't [47:46.040 --> 47:49.000] an option, because that's basically a spin-lock. [47:49.000 --> 48:02.520] There's 100 levels.