[00:00.000 --> 00:09.400] So, good morning, ladies and gentlemen. [00:09.400 --> 00:12.400] I am Pavel Pisa from Czech Technical University. [00:12.400 --> 00:16.680] I teach computer architectures for something like 20 years. [00:16.680 --> 00:22.280] From year 2000, we follow the famous Professor Hennessy [00:22.280 --> 00:26.360] Patterson book about the computer architectures. [00:26.360 --> 00:31.840] And we have used the original MIPSIT simulator, which [00:31.840 --> 00:35.160] has been distributed with the book, but it has got dated. [00:35.160 --> 00:38.440] So we decided that we need to do something for our students [00:38.440 --> 00:41.160] to have a better experience. [00:41.160 --> 00:44.920] And we started first with MIPSIT simulator, [00:44.920 --> 00:47.640] and then we switched to RISC-5. [00:47.640 --> 00:51.160] And this was work, it has this work of the Jakub Lupak, [00:51.160 --> 00:52.320] our student. [00:52.320 --> 00:54.560] So I hope that everything is working, [00:54.560 --> 00:58.120] and I pass him a mic and the stuff. [00:58.120 --> 00:58.620] OK? [01:18.640 --> 01:21.240] So, good morning. [01:21.240 --> 01:23.640] Over the past decades, computer engineers [01:23.640 --> 01:26.960] have been creating faster and faster computers. [01:26.960 --> 01:29.280] Meanwhile, many software engineers [01:29.280 --> 01:32.640] kept writing slower and slower software. [01:32.640 --> 01:38.760] And in many areas, this is fine, but in other ones, [01:38.760 --> 01:42.000] we need to process pretty much insane amount of data [01:42.000 --> 01:43.920] often in real time. [01:43.920 --> 01:47.720] And to do that, we need really efficient software that [01:47.720 --> 01:51.000] can exploit the capabilities of the hardware. [01:51.000 --> 01:53.400] And only software engineers who understand [01:53.400 --> 01:56.080] the principles of the hardware can do that. [01:56.080 --> 02:00.080] So to ensure we will have such software engineers, [02:00.080 --> 02:02.600] we need to teach our undergraduate students [02:02.600 --> 02:05.920] at least the basics of computer architectures. [02:05.920 --> 02:10.440] And because nobody wants to learn with pen and pencil [02:10.440 --> 02:14.280] anymore, we started using graphical simulator. [02:14.280 --> 02:25.080] So, and it is more. [02:25.080 --> 02:26.080] Yes. [02:29.960 --> 02:32.360] So we started with a MIPSID simulator, as was mentioned, [02:32.360 --> 02:35.480] which was shipped with the Hennessy Patterson books. [02:35.480 --> 02:37.800] However, it had pre-limited features, [02:37.800 --> 02:39.720] and it only worked on Windows. [02:39.720 --> 02:46.000] So in 2019, Carl Kochi was sitting over there, [02:46.000 --> 02:49.800] created Qt MIPS, graphical simulator of the MIPSMaker [02:49.800 --> 02:51.400] architecture. [02:51.400 --> 02:55.560] And it was continued with a lot of work of Dr. Pisa here. [02:55.560 --> 02:59.120] And in 2032, myself, with my colleague Max Holman, [02:59.120 --> 03:02.680] and again, a lot of work from Dr. Pisa, [03:02.680 --> 03:05.760] we have switched the simulator to RISC-5 [03:05.760 --> 03:08.480] so that we could switch all the undergraduate computer [03:08.480 --> 03:11.200] architecture teaching to RISC-5. [03:11.200 --> 03:12.720] So we are there. [03:12.720 --> 03:15.680] The simulator is licensed under GPL-3, [03:15.680 --> 03:20.960] and it's native simulator running under Qt-5 and 6. [03:20.960 --> 03:24.520] It's developed and available on GitHub. [03:24.520 --> 03:28.920] And to better collaborate on the materials, [03:28.920 --> 03:32.000] we have joined forces with our sister faculty. [03:32.000 --> 03:34.160] We are from the faculty of electrical engineering, [03:34.160 --> 03:37.320] and they are the faculty of information technology, [03:37.320 --> 03:39.920] and we have created what we call computer architectures [03:39.920 --> 03:40.960] education projects. [03:40.960 --> 03:42.760] And you can find all the materials [03:42.760 --> 03:47.200] that we have recorded in lectures, slides. [03:47.200 --> 03:49.040] You can find it there. [03:49.040 --> 03:52.960] And furthermore, to collaborate with as many people [03:52.960 --> 03:56.720] as possible, the university has joined the RISC-5 foundation [03:56.720 --> 03:58.280] lately. [03:58.280 --> 04:02.600] So the simulator is called Qt RISC-5, [04:02.600 --> 04:05.400] and it is currently used by our university, [04:05.400 --> 04:08.400] the Technical University in Graz, University of Colorado [04:08.400 --> 04:11.520] at Colorado Springs, and University of Porto. [04:11.520 --> 04:13.200] The previous MIPS version is still [04:13.200 --> 04:15.280] used by the Charles University in Prague, [04:15.280 --> 04:19.160] and National Coupled History and University of Athens. [04:19.160 --> 04:24.680] And this talk will focus on the teaching and user perspective [04:24.680 --> 04:27.960] if you are interested about how it works inside, [04:27.960 --> 04:30.200] at least a little bit of it. [04:30.200 --> 04:32.680] You can look up the talk that we gave [04:32.680 --> 04:36.560] at the RISC-5 International Academy and Eurotraining [04:36.560 --> 04:38.760] Special Interest Group. [04:38.760 --> 04:43.120] So before I dive in, as Dr. Pichai mentioned, [04:43.120 --> 04:45.760] you can just open this link and follow the presentation [04:45.760 --> 04:47.720] with the simulator running in your browser. [04:47.720 --> 04:49.280] It is in chat. [04:49.280 --> 04:50.520] It is in chat. [04:50.520 --> 04:52.040] Great. [04:52.040 --> 04:55.240] So let's dive in. [04:55.240 --> 04:58.680] When we get students who have never heard anything [04:58.680 --> 05:01.840] about computers before, we need to start really simple. [05:01.840 --> 05:07.400] So what we do is that we start with a very simple, single cycle [05:07.400 --> 05:09.680] micro-architecture, and we show them [05:09.680 --> 05:12.520] how basic instructions are processed there. [05:12.520 --> 05:14.640] So this is the simulator. [05:14.640 --> 05:16.520] This is how it looks when you open it. [05:16.520 --> 05:21.600] So we will hit No Pipeline and No Cache to get started, [05:21.600 --> 05:25.600] and hit Start Empty because we will use one of the examples [05:25.600 --> 05:31.480] provided, so file examples, and simple load and store work. [05:31.480 --> 05:33.640] As you can see, the program is really simple. [05:33.640 --> 05:36.720] It just loads from one part of memory, [05:36.720 --> 05:40.080] stores into another, and then goes into loop using branch [05:40.080 --> 05:41.280] instruction. [05:41.280 --> 05:45.400] So this is the basic view that we start with. [05:45.400 --> 05:47.200] And on the right side, you can see [05:47.200 --> 05:52.000] detail of the program memory with address of each instruction, [05:52.000 --> 05:54.200] hexadecimal code of the instruction, [05:54.200 --> 05:57.200] and the disassembled instruction itself, [05:57.200 --> 06:00.960] where the last two columns can be edited directly [06:00.960 --> 06:03.440] in this view by double-clicking and writing [06:03.440 --> 06:06.760] the instruction or pasting the code. [06:06.760 --> 06:10.960] And furthermore, the left mode column [06:10.960 --> 06:13.720] is a place where you can set breakpoints, [06:13.720 --> 06:16.840] so you can use it like a TV debugger, [06:16.840 --> 06:20.000] or at least the simplest part of that. [06:20.000 --> 06:23.200] And now we can move to the heart of the simulator, [06:23.200 --> 06:26.680] and that is the visualization of the pipeline. [06:26.680 --> 06:29.320] So if you look closer, on the left, [06:29.320 --> 06:32.120] we start with the program counter, program memory, [06:32.120 --> 06:34.720] and circuits to update the program counter. [06:34.720 --> 06:36.840] We load the instruction, and here you [06:36.840 --> 06:41.200] can see the hexadecimal value of the instruction shown [06:41.200 --> 06:44.440] on the bus, and it gets to the control unit, [06:44.440 --> 06:47.680] where we send all the control signals that we need. [06:47.680 --> 06:50.720] So we are showing globeboard instruction, [06:50.720 --> 06:57.480] so we need to send the data from memory to registers. [06:57.480 --> 07:01.880] We need to read the memory, and we [07:01.880 --> 07:06.840] will need immediate value for the arithmetical and logical [07:06.840 --> 07:08.520] unit. [07:08.520 --> 07:11.400] Right here, we can see mechanisms for resolving branches, [07:11.400 --> 07:13.800] but this is globeboard instructions. [07:13.800 --> 07:15.480] I think there's a thing there. [07:15.480 --> 07:18.960] And here is our arithmetical and logical unit, [07:18.960 --> 07:24.960] which has two possible inputs, either from registers or from PC. [07:24.960 --> 07:28.280] And the other one has either value from registers [07:28.280 --> 07:30.880] or from immediate decode. [07:33.480 --> 07:37.280] Here we get the requested operation of the ALU, [07:37.280 --> 07:40.400] and here we get the signal whether the result was [07:40.400 --> 07:42.360] zero for branching. [07:42.360 --> 07:45.040] And finally, here we have the data memory, [07:45.040 --> 07:49.560] if it address, and the data to be written. [07:49.560 --> 07:52.000] So we start executing, and we see [07:52.000 --> 07:54.440] that the instruction load board was highlighted, [07:54.440 --> 07:57.880] and in the register view that appeared up there, [07:57.880 --> 08:02.840] we see that value 1 through 8 was written [08:02.840 --> 08:06.880] into the register bank, which is highlighted by the red color. [08:06.880 --> 08:09.640] If we continue, we move to the starboard instruction, [08:09.640 --> 08:12.200] and now the value needed to be read. [08:12.200 --> 08:14.280] So it's blue now. [08:14.280 --> 08:20.880] And if we go back to detail of the view, [08:20.880 --> 08:23.960] we see that we read that value from register, [08:23.960 --> 08:27.640] and we are sending it by this wire to the data memory [08:27.640 --> 08:28.960] to be start. [08:28.960 --> 08:31.960] We also read the address from the immediate decode, [08:31.960 --> 08:35.160] and we are sending that to the arithmetical and logical unit [08:35.160 --> 08:38.160] to be added to register, which is in fact zero. [08:38.160 --> 08:41.320] So we are just sending the value to the memory, [08:41.320 --> 08:43.680] and now the memory can write. [08:43.680 --> 08:46.600] So let's see what happened in the memory. [08:46.600 --> 08:51.200] Here we can see the address 4 or 4 with the value written. [08:51.200 --> 08:53.760] So it corresponds to the address we have here, [08:53.760 --> 08:55.640] and to value we have here. [08:55.640 --> 08:58.000] And right above it, we can see the place [08:58.000 --> 09:01.800] where we read the value from. [09:01.800 --> 09:04.800] So speaking of the memory view, right now [09:04.800 --> 09:08.520] we are working with words, but we can work with other sizes. [09:08.520 --> 09:11.360] So you can switch the view for the unit [09:11.360 --> 09:14.720] to be either word, half word, or byte, [09:14.720 --> 09:17.760] which will work with respect to both Littler and Big [09:17.760 --> 09:19.480] Indiana's. [09:19.480 --> 09:23.240] And once we add cache, which will be in the moment, [09:23.240 --> 09:26.640] we can switch the memory view between direct view of the memory [09:26.640 --> 09:30.080] and view where we are looking through the cache. [09:30.080 --> 09:34.280] So we will see also the cache data. [09:34.280 --> 09:38.440] Now we move to last instruction of the long program. [09:38.440 --> 09:41.120] That is the branch equal instruction. [09:41.120 --> 09:46.960] And we will see that we take the destination from immediate. [09:46.960 --> 09:49.600] It's negative, so we're jumping back. [09:49.600 --> 09:51.560] We add it to the program counter, [09:51.560 --> 09:55.440] and here we send it back to the program counter. [09:55.440 --> 09:58.680] In the control unit now, the signal for branching, [09:58.680 --> 10:01.520] for the most generic one, is active. [10:01.520 --> 10:06.400] And we see that the branch resolving mechanism [10:06.400 --> 10:08.640] determined that the branch should be taken. [10:08.640 --> 10:12.120] So we go to the program counter, and instead [10:12.120 --> 10:16.120] of taking the program counter plus four, [10:16.120 --> 10:19.640] we are now instructed to switch this multiplexer [10:19.640 --> 10:22.640] and take the value computed by the arithmetical and logical [10:22.640 --> 10:23.360] unit. [10:23.360 --> 10:26.000] So here we have detail of the program counter [10:26.000 --> 10:29.800] and the register itself in the pipeline. [10:29.800 --> 10:33.320] And as you can see, we continue to the load bar instruction [10:33.320 --> 10:39.600] so that we can keep running in a loop in this example. [10:39.600 --> 10:43.840] So this is a very basic example just for this presentation, [10:43.840 --> 10:47.840] but it gives you an idea how we start with the simple processor. [10:47.840 --> 10:51.680] And when students have idea of what the virus are doing [10:51.680 --> 10:54.400] and what it all means, we can start [10:54.400 --> 10:58.080] speaking about how to make the CPU faster. [10:58.080 --> 11:01.400] And well, what we find out that the memory is slow, [11:01.400 --> 11:03.480] so we want to cache it. [11:03.480 --> 11:05.400] However, that's not that simple. [11:05.400 --> 11:06.920] There are many ways to cache it. [11:06.920 --> 11:07.800] We can cache data. [11:07.800 --> 11:10.120] We can cache instructions. [11:10.120 --> 11:13.880] We can choose different sizes of the caches. [11:13.880 --> 11:17.600] And if you do it wrong, it will be worse than doing nothing. [11:17.600 --> 11:20.680] So we really need to think about that. [11:20.680 --> 11:24.120] So we switch to the second option in the configuration [11:24.120 --> 11:25.880] and add in time empty. [11:25.880 --> 11:29.560] But this time, we will hit the button up there [11:29.560 --> 11:33.080] which will open integrated editor with syntax [11:33.080 --> 11:35.280] highlight and integrated assembler. [11:35.280 --> 11:39.200] The other options are to open file, to save file, close file. [11:39.200 --> 11:40.120] This is the important one. [11:40.120 --> 11:43.960] This is to run the assembler and upload it to memory. [11:43.960 --> 11:47.360] And if you have some more complex project, [11:47.360 --> 11:50.440] you can even invoke external make file. [11:50.440 --> 11:54.400] Except for WebAssembly where there is no make. [11:54.400 --> 12:00.400] So we insert a simple selection sort. [12:00.400 --> 12:03.480] You will use, for example, I put it in free column [12:03.480 --> 12:04.960] so that we can see the program. [12:04.960 --> 12:08.120] But just a simple selection sort was important here. [12:08.120 --> 12:11.880] That we have data that will be put into the L file [12:11.880 --> 12:14.400] and later start into memory. [12:14.400 --> 12:18.320] And that will be the data that we want to sort. [12:18.320 --> 12:21.120] So we save the file so that we don't lose it [12:21.120 --> 12:23.800] and we move to the memory view. [12:23.800 --> 12:27.000] Right here, you can see the data that we have inputted. [12:27.000 --> 12:30.960] And on the right side, new windows opens. [12:30.960 --> 12:33.600] And it's the detail of the data cache. [12:33.600 --> 12:38.040] It has two parts, statistics of the cache performance. [12:38.040 --> 12:41.720] And it has very detailed view of the internal structure [12:41.720 --> 12:46.560] of the cache and the data that are there right now. [12:46.560 --> 12:49.760] So if we start running, the cache was empty. [12:49.760 --> 12:53.920] So of course, we need to edit and we will miss. [12:53.920 --> 12:56.200] And we will continue adding the value. [12:56.200 --> 12:58.600] And right now, the cache is full. [12:58.600 --> 13:01.520] So first decision that the students need to understand [13:01.520 --> 13:04.760] is, what shall we edit? [13:04.760 --> 13:09.560] So right now, there is random choice. [13:09.560 --> 13:14.400] So it evokes the first one and we continue running. [13:14.400 --> 13:18.160] And right now, the selection sort will start placing values [13:18.160 --> 13:19.520] where they should be. [13:19.520 --> 13:22.880] So it moves the value number one to the first one. [13:22.880 --> 13:26.360] And you see the yellow highlight shows us [13:26.360 --> 13:28.840] that the value is now written in the cache [13:28.840 --> 13:30.480] and the cache is dirty. [13:30.480 --> 13:34.040] So right now, the cache is using writeback policy [13:34.040 --> 13:36.720] just to show you the highlight. [13:36.720 --> 13:38.240] And we can continue. [13:38.240 --> 13:41.120] And now, we switch one with five. [13:41.120 --> 13:44.880] And it was a simple example, so the array is now sorted. [13:44.880 --> 13:50.240] But we still need to go over it because the CPU doesn't see [13:50.240 --> 13:51.960] that, needs to check it. [13:51.960 --> 13:55.320] And because we use the writeback policy, [13:55.320 --> 13:57.080] we also need to run fence instruction [13:57.080 --> 13:59.840] to make sure that everything is stored back to the memory. [13:59.840 --> 14:02.080] So you see there is no highlight anymore [14:02.080 --> 14:03.920] and the cache is empty. [14:03.920 --> 14:07.200] And this is a time where we should look at the performance [14:07.200 --> 14:12.360] data and we will see that we have quite an improvement. [14:12.360 --> 14:16.080] Of course, it depends on how fast memory we have [14:16.080 --> 14:17.240] and fast cache we have. [14:17.240 --> 14:20.120] So in the configuration, you can set the penalties [14:20.120 --> 14:24.920] for hits and misses so that this data will look [14:24.920 --> 14:28.160] according to what you need. [14:28.160 --> 14:31.080] So this is the configuration of the data cache. [14:31.080 --> 14:35.520] You can see that we can select different shape of the cache, [14:35.520 --> 14:38.000] which means number of sets, block size, [14:38.000 --> 14:39.760] degree of associativity. [14:39.760 --> 14:41.720] And then we have two policies. [14:41.720 --> 14:44.120] Replacement policy, which is out of random, [14:44.120 --> 14:47.600] last recently used, or least frequently used. [14:47.600 --> 14:50.920] And writeback policy, which can be, [14:50.920 --> 14:54.880] you can see writeback right through the options there. [14:54.880 --> 14:57.680] So here is how the detail of the shape of the cache [14:57.680 --> 15:01.240] looks for two different configurations, which in fact [15:01.240 --> 15:04.440] have the same amount of data there. [15:04.440 --> 15:07.120] But they will probably have quite different performance. [15:07.120 --> 15:13.760] And also, the right one, it's way more complicated. [15:13.760 --> 15:16.680] So the students can examine what this [15:16.680 --> 15:19.640] will do to their particle programs and data loads [15:19.640 --> 15:20.600] that they will have. [15:23.560 --> 15:25.920] Of course, we have also program cache. [15:25.920 --> 15:28.480] And I have added this very stupid design [15:28.480 --> 15:30.320] where we have just one cell. [15:30.320 --> 15:35.760] And well, in program, you never go to the same address twice. [15:35.760 --> 15:38.400] So you can see that we always missed. [15:38.400 --> 15:41.160] And it's actually worse than doing nothing. [15:41.160 --> 15:44.080] So just an example of that. [15:44.080 --> 15:48.360] And I mentioned that the windows can just appear. [15:48.360 --> 15:51.880] So if we prepare some example code for students, [15:51.880 --> 15:55.320] we can insert special pragmas to make the windows appear [15:55.320 --> 15:59.040] so that the students who have not seen the simulator before [15:59.040 --> 16:01.720] can just start with all the tools they need. [16:01.720 --> 16:03.560] So we can show any part. [16:03.560 --> 16:08.800] We can focus address in memory, and so on. [16:08.800 --> 16:13.400] So now we have improved the speed of the memory. [16:13.400 --> 16:15.000] And we look at the CPU again. [16:15.000 --> 16:17.320] And we find out that most of the silicon [16:17.320 --> 16:20.400] is just lying there doing nothing most of the time. [16:20.400 --> 16:23.480] So we want to utilize it better. [16:23.480 --> 16:28.680] So we see that most of the parts are somewhat independent. [16:28.680 --> 16:32.280] So we split it into five parts. [16:32.280 --> 16:34.280] So let's go to third option. [16:34.280 --> 16:36.960] We will choose pipeline without cache. [16:36.960 --> 16:42.040] And you can see that the image got somewhat more complicated. [16:42.040 --> 16:46.960] However, if you remember the previous, it's all the same. [16:46.960 --> 16:49.160] So students already invested some effort [16:49.160 --> 16:53.520] into understanding what the image was supposed to mean. [16:53.520 --> 16:55.560] So we don't want to move things around. [16:55.560 --> 16:58.320] We are just adding the minimum amount of things [16:58.320 --> 17:00.800] that we need so that we can continue [17:00.800 --> 17:01.880] with the more complex stuff. [17:01.880 --> 17:06.440] So you can see we display all the instructions [17:06.440 --> 17:08.600] in each stage of the pipeline. [17:08.600 --> 17:11.280] We are adding the interstage registers. [17:11.280 --> 17:15.440] And the colors are not the random. [17:15.440 --> 17:19.600] We are also using that to show which instructions correspond [17:19.600 --> 17:22.880] to each stage of the pipeline in the program view. [17:22.880 --> 17:28.000] So you can see all the stalls and branches nicely [17:28.000 --> 17:30.400] as they progress through the pipeline. [17:30.400 --> 17:35.800] And of course, we needed to add animated windows [17:35.800 --> 17:38.920] to each stage of the pipeline for each wire, [17:38.920 --> 17:41.520] for both control and data that we have. [17:41.520 --> 17:46.840] And now we find out that this is not all we need to do. [17:46.840 --> 17:49.960] Because up until now, we assumed that, OK, [17:49.960 --> 17:51.560] we process one instruction. [17:51.560 --> 17:52.520] Instruction is done. [17:52.520 --> 17:53.800] We continue. [17:53.800 --> 17:55.800] But right now, we start processing instruction. [17:55.800 --> 18:00.680] But it will take, what, former cycles before we get the results. [18:00.680 --> 18:04.400] So we have something that's called data hazards. [18:04.400 --> 18:06.440] And that means that if we do nothing, [18:06.440 --> 18:07.720] we will get all the data. [18:07.720 --> 18:09.760] Because the data from the instruction [18:09.760 --> 18:12.760] that we depend on is not ready yet. [18:12.760 --> 18:19.120] So we can have hazard spending two cycles in the next instruction [18:19.120 --> 18:23.920] and hazard spending one cycle in the instruction after. [18:23.920 --> 18:26.960] Technically, there is also hazard in the instruction [18:26.960 --> 18:27.840] that is after that. [18:27.840 --> 18:31.960] But this is something that we can solve inside the register file [18:31.960 --> 18:34.000] without breaking the abstraction of the pipeline. [18:34.000 --> 18:37.360] So we will ignore the last case. [18:37.360 --> 18:40.280] And so what can we do about that? [18:40.280 --> 18:42.360] Well, we need to detect those problems. [18:42.360 --> 18:46.520] And we need to counteract them in some way. [18:46.520 --> 18:50.240] So we are adding hazard unit to the CPU. [18:50.240 --> 18:53.040] And it has two options. [18:53.040 --> 18:57.040] First option is, OK, we will wait for the data. [18:57.040 --> 19:00.160] But we said we want to build a fast CPU. [19:00.160 --> 19:03.920] And waiting is, well, not fast. [19:03.920 --> 19:08.560] So we also have option to forward the data as we need it. [19:08.560 --> 19:10.320] So yeah, it's in the core tab. [19:10.320 --> 19:12.240] And this is the option. [19:12.240 --> 19:16.840] So this is the CPU with edit forwarding paths. [19:16.840 --> 19:19.960] So what we have here is we are adding path [19:19.960 --> 19:23.360] from the writeback stage to get it directly [19:23.360 --> 19:26.160] into the execute stage. [19:26.160 --> 19:27.960] The multiplexers got bigger. [19:27.960 --> 19:31.560] And we are adding second path to forward the data [19:31.560 --> 19:34.440] from the memory stage. [19:34.440 --> 19:36.760] So here is the difference. [19:36.760 --> 19:39.520] There is some extra wiring. [19:39.520 --> 19:41.680] The CPU looks, well, simpler. [19:41.680 --> 19:43.280] But it will wait. [19:43.280 --> 19:44.680] And that's what we want. [19:44.680 --> 19:47.000] But it will cost us more to build it [19:47.000 --> 19:49.720] because it needs more components. [19:49.720 --> 19:52.280] So let's see some actual examples. [19:52.280 --> 19:55.360] We are again learning the selection sort I showed you [19:55.360 --> 19:55.880] before. [19:55.880 --> 19:58.960] So we start running the instructions. [19:58.960 --> 20:03.680] And right now, we have our PC instruction [20:03.680 --> 20:06.560] that is producing its value to register 10. [20:06.560 --> 20:10.080] And the hazard unit started screaming. [20:10.080 --> 20:11.360] It doesn't like it. [20:11.360 --> 20:14.640] So it tells us it's not very, very visible. [20:14.640 --> 20:18.120] But the text in the red is forward. [20:18.120 --> 20:21.880] So this is the hazard that we have here. [20:21.880 --> 20:25.120] And what happens is that this multiplexer [20:25.120 --> 20:29.360] gets value to use the forwarding wire from the memory stage. [20:29.360 --> 20:32.960] And instead of using the actual value for registers, [20:32.960 --> 20:37.200] we are taking the value for the red path. [20:37.200 --> 20:41.320] And we get here for the arithmetic analogic unique value [20:41.320 --> 20:44.280] hexa 200, which is what we want. [20:44.280 --> 20:45.320] Well, you don't know that yet. [20:45.320 --> 20:50.320] But to remember that value, it will be important later. [20:50.320 --> 20:53.280] OK, we don't want to add the extra hardware. [20:53.280 --> 20:56.880] So we will stall instead. [20:56.880 --> 21:00.400] So this hazard unit is screaming again. [21:00.400 --> 21:02.160] But this time, it cannot forward. [21:02.160 --> 21:04.800] So it's screaming to stall. [21:04.800 --> 21:08.400] But you can see that the instructions [21:08.400 --> 21:11.240] are right next to each other. [21:11.240 --> 21:13.480] So that was the first case. [21:13.480 --> 21:17.440] And we need to actually stall two cycles. [21:17.440 --> 21:19.920] So we will run next cycle. [21:19.920 --> 21:21.320] Another knob is inserted. [21:21.320 --> 21:23.600] And again, the hazard unit is screaming. [21:23.600 --> 21:25.880] So right now, we have any two knobs. [21:25.880 --> 21:28.720] But what we have now is that the instructions [21:28.720 --> 21:30.320] are far apart enough. [21:30.320 --> 21:33.680] And we can continue. [21:33.680 --> 21:39.720] So we get value, thank you, 200 hexa again. [21:39.720 --> 21:41.240] And we are happy. [21:41.240 --> 21:44.160] Well, and then the simulator, of course, supports [21:44.160 --> 21:49.200] the most simple option, we do nothing. [21:49.200 --> 21:51.200] And what happens here? [21:51.200 --> 21:53.000] We have the hazard. [21:53.000 --> 21:58.640] And we get value 0 because the registers were initially empty. [21:58.640 --> 22:01.040] What is the purpose of this setting [22:01.040 --> 22:05.840] is that we will task the students to play the hazard unit [22:05.840 --> 22:07.240] and play the compiler. [22:07.240 --> 22:09.960] So they will have to rearrange the instructions [22:09.960 --> 22:13.920] and insert as less knobs as they need [22:13.920 --> 22:17.080] to make sure that the result will be correct. [22:17.080 --> 22:19.200] This will typically be some kind of homework. [22:19.200 --> 22:23.160] So we don't want to control that manually. [22:23.160 --> 22:25.800] So we have a command line interface, [22:25.800 --> 22:27.680] which can look something like that [22:27.680 --> 22:31.040] and will give you output like that. [22:31.040 --> 22:32.800] So the capabilities that we have here [22:32.800 --> 22:36.920] is to assemble file, to set the configuration, [22:36.920 --> 22:40.360] to trace instruction in each stage of the pipeline, [22:40.360 --> 22:43.120] trace changes to memory and registers, [22:43.120 --> 22:46.680] also at the end to dump the memory and registers, which [22:46.680 --> 22:49.000] is the most useful part because we [22:49.000 --> 22:53.520] want to make sure that the result is the same as it should [22:53.520 --> 22:56.760] be, for example, when the hazard unit is not available. [22:56.760 --> 23:02.600] And finally, when we want to plug in some data, [23:02.600 --> 23:05.560] for example, when the students are supposed to sort, [23:05.560 --> 23:07.560] they should not know the data, so we [23:07.560 --> 23:12.520] have special option to load data into the memory. [23:12.520 --> 23:18.160] And right now, we have CPU that, for undergraduate students, [23:18.160 --> 23:22.360] it's quite fast, but it's quite simple. [23:22.360 --> 23:24.680] So now instead of making it fast, [23:24.680 --> 23:26.080] we will make it more complicated. [23:26.080 --> 23:28.360] And we are adding memory map profiles [23:28.360 --> 23:32.200] and some very simple operating system emulation. [23:32.200 --> 23:35.040] So if we open the templates again, [23:35.040 --> 23:38.000] you can see that there is template for operating system. [23:38.000 --> 23:40.840] And it really does nothing special. [23:40.840 --> 23:43.720] All it does is prepare for calling system call, [23:43.720 --> 23:46.080] which is equal in risk five. [23:46.080 --> 23:49.640] And we will print hellward to file. [23:49.640 --> 23:54.080] The file is connected to a terminal. [23:54.080 --> 24:00.040] So what we actually do is, after running the, [24:00.040 --> 24:01.840] now we get the equal detected, you [24:01.840 --> 24:03.840] can see that there is something correct written [24:03.840 --> 24:05.920] that is detected syscall. [24:05.920 --> 24:08.400] So we will stop fetching new instructions, [24:08.400 --> 24:11.960] so it's more clear for the students. [24:11.960 --> 24:15.640] And once we get to memory, here is [24:15.640 --> 24:19.360] our hellward in the simulated terminal. [24:19.360 --> 24:22.040] And you can see that the pipeline is empty [24:22.040 --> 24:25.200] because we are looking at this from the perspective of user [24:25.200 --> 24:25.840] land. [24:25.840 --> 24:28.160] So the operating system is emulated, [24:28.160 --> 24:33.040] and we don't see the instructions in the pipeline. [24:33.040 --> 24:35.960] We just see that the pipeline was flashed before. [24:35.960 --> 24:38.160] It was written to us. [24:38.160 --> 24:41.800] So this is the system calls that we support. [24:41.800 --> 24:45.320] It's not much, but it's enough to show the basics [24:45.320 --> 24:47.800] to the students. [24:47.800 --> 24:51.520] And this is all the peripherals that we can play with, [24:51.520 --> 24:53.360] mainly using the system calls. [24:53.360 --> 24:55.320] So this is not peripheral, but we [24:55.320 --> 24:57.600] have control and status registers, [24:57.600 --> 24:59.720] some basic supports of them. [24:59.720 --> 25:02.160] We have an LCD display. [25:02.160 --> 25:03.840] The terminal I've already shown you, [25:03.840 --> 25:08.280] and there are some general purpose IOU peripherals, [25:08.280 --> 25:11.400] two LEDs, three knobs with buttons. [25:11.400 --> 25:14.720] And this might seem a little random to you. [25:14.720 --> 25:15.220] It's not. [25:18.400 --> 25:22.040] Because we also have this relevant board that [25:22.040 --> 25:24.160] has exactly the same peripherals. [25:24.160 --> 25:27.160] And the simulator is set up in a way [25:27.160 --> 25:29.880] that you can take the same code, well, not assembly code, [25:29.880 --> 25:34.560] because, unfortunately, the board is ARM. [25:34.560 --> 25:37.880] But you can take the same C code, [25:37.880 --> 25:42.000] and run it both on the simulator and on the real board, [25:42.000 --> 25:43.520] and you can move back and forward. [25:47.200 --> 25:51.520] So I said C code, so we can't use the integrated assembly [25:51.520 --> 25:54.640] anymore, and we did not implement C compiler. [25:54.640 --> 25:59.440] So you can use Clang or GCC to compile [25:59.440 --> 26:02.640] the file into ELF, which, of course, [26:02.640 --> 26:03.720] is to be statically linked. [26:03.720 --> 26:09.080] And what we do now is, instead of hitting the start empty, [26:09.080 --> 26:13.560] we will here insert the ELF executable, [26:13.560 --> 26:16.560] and we will hit Load Machine. [26:16.560 --> 26:19.000] So this is example of some program [26:19.000 --> 26:21.400] that writes to the LCD display. [26:21.400 --> 26:23.080] I will not show you the code here, [26:23.080 --> 26:29.840] because this is one of the tasks that we give our students. [26:29.840 --> 26:32.640] But I can show you this code. [26:32.640 --> 26:35.320] This is available on our GitLab. [26:35.320 --> 26:36.760] We provide a link here. [26:36.760 --> 26:43.720] And it is a test suite, simple for malloc, from the new lip. [26:43.720 --> 26:47.120] So you can link your code to the new lip [26:47.120 --> 26:50.720] and use it in our simulator, or at least some basic parts. [26:50.720 --> 26:53.080] And we have tested that with malloc. [26:53.080 --> 26:56.680] So we can run this, and it's connected to the terminal. [26:56.680 --> 27:00.280] So we can see that some dynamic allocation takes place, [27:00.280 --> 27:03.000] and some checks take place, and the tests [27:03.000 --> 27:04.320] are running successfully. [27:04.320 --> 27:06.920] So now you can use actual dynamic allocation [27:06.920 --> 27:09.920] inside the simulator. [27:09.920 --> 27:11.880] And now some conclusions. [27:11.880 --> 27:14.920] So frequently asked questions. [27:14.920 --> 27:17.080] Is the simulator cycle accurate? [27:17.080 --> 27:19.560] That's a pretty important one. [27:19.560 --> 27:22.880] Yes, kind of. [27:22.880 --> 27:26.080] The thing is, we always assume that the memory has enough time [27:26.080 --> 27:29.200] to finish, and if it doesn't CPU will wait, [27:29.200 --> 27:32.760] we will not insert stores or anything. [27:32.760 --> 27:35.600] The memory will finish, but that's the only exception. [27:35.600 --> 27:39.720] Otherwise, it's inside written in quite similar style [27:39.720 --> 27:43.240] to malloc, system malloc. [27:43.240 --> 27:46.640] So is this compliant with the official RISC-5 tests? [27:46.640 --> 27:49.040] And yes, I can please see this. [27:49.040 --> 27:51.680] Yes, in the previous version, we have added that, [27:51.680 --> 27:53.680] and it's integrating into our CI, [27:53.680 --> 27:58.280] so every new changes are tested against that. [27:58.280 --> 28:00.960] We support the graphical parts, so it [28:00.960 --> 28:05.000] supports the RISC-5 I with multiplication, [28:05.000 --> 28:08.280] and also with control and status registers instructions. [28:08.280 --> 28:11.520] The command line also supports 64 bits. [28:11.520 --> 28:16.560] However, I yet need to find out how to fit the 65 bit values [28:16.560 --> 28:21.160] into the visualization, because it's already quite full. [28:21.160 --> 28:23.920] And we don't have a virtual memory yet, [28:23.920 --> 28:26.800] but somewhere here in other room, [28:26.800 --> 28:28.640] there is a student who already promised [28:28.640 --> 28:30.880] to take care of that in the next year. [28:30.880 --> 28:35.480] So in a year, he might be presenting the change here, [28:35.480 --> 28:38.160] so we'll see. [28:38.160 --> 28:41.280] So what we plan for the future, we [28:41.280 --> 28:44.440] are very close to adding interrupt support. [28:44.440 --> 28:47.080] We would like to add compressed instruction set support, [28:47.080 --> 28:50.520] because that's quite key part of RISC-5. [28:50.520 --> 28:54.960] The only trouble there is it's pretty hard to, [28:54.960 --> 28:57.000] I'm not sure how to fit it into the program view, [28:57.000 --> 29:00.840] because then it needs to be presented sequentially, [29:00.840 --> 29:04.320] and I'm not sure how to show it that the students see [29:04.320 --> 29:08.200] that it's really half of the size. [29:08.200 --> 29:10.520] So that's plan. [29:10.520 --> 29:15.600] Also, the instruction encoding in RISC-5 is somewhat [29:15.600 --> 29:18.360] special, especially around immediate. [29:18.360 --> 29:25.480] So I would like a visualization of each component, [29:25.480 --> 29:27.440] the blocks that the instruction is composed of, [29:27.440 --> 29:30.440] so that that can be seen for each instruction, [29:30.440 --> 29:33.720] and inspect it, and edit it. [29:33.720 --> 29:40.440] We want to be able to run a very minimal RISC-5 Linux [29:40.440 --> 29:41.760] target. [29:41.760 --> 29:47.200] I mentioned 64-bit visualization and the program with that, [29:47.200 --> 29:49.120] also the memory unit. [29:49.120 --> 29:53.160] Another nice thing would be to visualize [29:53.160 --> 29:55.080] the utilization of the pipeline, [29:55.080 --> 29:57.040] the image that is typically in every book, [29:57.040 --> 29:59.600] when you have squares for each instructions, [29:59.600 --> 30:03.280] and you see the spaces where no instructions were executed. [30:03.280 --> 30:05.200] So it would be nice to add. [30:05.200 --> 30:08.200] And also, even when I made these slides, [30:08.200 --> 30:10.560] I would really love to have an option to step back, [30:10.560 --> 30:14.360] because I went one instruction too far. [30:14.360 --> 30:17.160] And it would be really hard to do it with all memory [30:17.160 --> 30:20.040] and everything, but at least for the visualization [30:20.040 --> 30:23.200] of the pipeline, I would really like that. [30:23.200 --> 30:26.840] So if you are a teacher, represent educational institution, [30:26.840 --> 30:29.960] and you want to use the simulator, please do. [30:29.960 --> 30:31.840] If you have some problems, contact us. [30:31.840 --> 30:33.960] We'll be happy to cooperate on that. [30:33.960 --> 30:36.000] If you are a student or developer, [30:36.000 --> 30:38.840] this is open source project, we accept your request. [30:38.840 --> 30:41.680] And usually the way it works is that students [30:41.680 --> 30:46.160] do their final thesis on this project, so far free. [30:46.160 --> 30:48.200] And if you are a distribution maintainer, [30:48.200 --> 30:53.640] you could help me with putting in the official packages. [30:53.640 --> 30:56.120] So you can get the source at GitHub, [30:56.120 --> 31:01.200] and we also provide executables for Windows, Linux, and Mac. [31:01.200 --> 31:10.040] And we have packages using the open-source build system [31:10.040 --> 31:12.960] and launchpad for Ubuntu, SUSE, Fedora, and Debian. [31:12.960 --> 31:16.520] And there are also packages for our Nix packages, [31:16.520 --> 31:20.280] because those are those that I use. [31:20.280 --> 31:24.960] And as we mentioned, we have the online version as well. [31:24.960 --> 31:28.440] So if you would like to read more, [31:28.440 --> 31:31.600] we have some publications, the thesis, and our paper [31:31.600 --> 31:34.040] from the Embedded Word Conference last year. [31:34.040 --> 31:36.520] So those are available. [31:36.520 --> 31:40.280] And we have this subject, and you [31:40.280 --> 31:44.080] can find a lot of the materials at the Comparch. [31:44.080 --> 31:46.920] It's some of the materials videos. [31:46.920 --> 31:49.040] Some are in Czech, many are in English. [31:49.040 --> 31:52.440] And that's all from me, so thank you for attention [31:52.440 --> 31:54.360] and for so many people coming. [31:54.360 --> 32:01.360] Thank you. [32:01.360 --> 32:03.600] Please. [32:03.600 --> 32:06.080] How tightly coupled is the simulation [32:06.080 --> 32:09.320] of the processor and the visualization? [32:09.320 --> 32:12.560] I was thinking, would it be possible to somehow connect [32:12.560 --> 32:16.080] something like ModelSIM with the HDL model? [32:16.080 --> 32:18.480] Or would that be very, very difficult? [32:18.480 --> 32:20.920] These are repeat the questions for the screen. [32:20.920 --> 32:23.480] Sure, so the question was, how tightly coupled [32:23.480 --> 32:27.000] is the visualization and the simulation? [32:27.000 --> 32:31.320] So it's a separated project that is linked together [32:31.320 --> 32:36.200] and is only connected with some data passing and QT signals. [32:36.200 --> 32:39.720] So we already have the visualization and the command [32:39.720 --> 32:41.600] line that are completely separate, [32:41.600 --> 32:43.600] and they are just connecting to the same signals. [32:43.600 --> 32:46.920] So it's quite well separated, but it's not at all stable. [32:46.920 --> 32:50.920] So, any other question? [33:02.200 --> 33:04.280] OK, if you have no other questions, [33:04.280 --> 33:06.680] we do not work only on the simulators, [33:06.680 --> 33:10.000] but we do even the design of peripherals. [33:10.000 --> 33:11.840] So for example, we have open source, [33:11.840 --> 33:17.840] can have these stuff, we work on open source replacement [33:17.840 --> 33:20.600] of MATLAB and simulink. [33:20.600 --> 33:23.440] OK, it is toy, but we work on such stuff. [33:23.440 --> 33:25.760] So if you have interest, we have their links [33:25.760 --> 33:27.120] to our other project. [33:27.120 --> 33:29.120] We have experience with motion control. [33:29.120 --> 33:31.560] I have about 30 years of experience [33:31.560 --> 33:37.720] with embedded system design, including infusion systems [33:37.720 --> 33:40.280] for medical applications. [33:40.280 --> 33:44.200] I have contributed to RTMS project, which [33:44.200 --> 33:47.160] is used in European space agency and so on. [33:47.160 --> 33:51.440] We do VHDL design for the experiments [33:51.440 --> 33:55.040] of its space-grade FPGAs and so on. [33:55.040 --> 33:59.360] So this is the work which we do to help our students. [33:59.360 --> 34:02.920] It is a lot of work, something like eight [34:02.920 --> 34:05.760] many years of work in the simulator, [34:05.760 --> 34:08.440] but it is only for our students, but we [34:08.440 --> 34:12.080] do even the serial stuff for the world. [34:12.080 --> 34:14.600] For example, in Socketken, in mainline Linux kernels, [34:14.600 --> 34:19.480] there are our open stuff, contribution drivers, and so on. [34:19.480 --> 34:20.200] I have a question. [34:20.200 --> 34:22.200] OK. [34:22.200 --> 34:23.200] Sir? [34:23.200 --> 34:24.200] Yeah? [34:24.200 --> 34:26.680] So what do you use for the actual visualization [34:26.680 --> 34:27.400] of the pipeline? [34:27.400 --> 34:29.480] Because I remember there's something like this in, [34:29.480 --> 34:33.400] say, Altair Fortress, which does schematic [34:33.400 --> 34:35.480] after-invisualization. [34:35.480 --> 34:37.480] Well, is this yours? [34:37.480 --> 34:38.960] No, no. [34:38.960 --> 34:40.960] Something you could pull in from some other person? [34:40.960 --> 34:41.460] No. [34:41.460 --> 34:43.880] At point point, which will take three hours. [34:43.880 --> 34:48.680] Yes, the visualization that you see is actually an SVG file [34:48.680 --> 34:52.200] that has special annotations which connected to the core. [34:52.200 --> 34:56.680] Previously, it was done by generated QT objects, [34:56.680 --> 35:00.760] but I was not OK to work with that. [35:00.760 --> 35:03.600] So I switch that to the SVG version [35:03.600 --> 35:07.240] when you can design it all in graphical editor [35:07.240 --> 35:09.200] and then you just connect it to the simulator. [35:09.200 --> 35:37.720] So that's completely ours.