[00:00.000 --> 00:26.400] I'm here to talk to you about graphics programming in Python on an embedded microcontroller, [00:26.400 --> 00:30.000] which is hilarious because I'm none of these things, I'm not a graphics programmer, I'm [00:30.000 --> 00:36.880] not a Python programmer and I'm not an embedded programmer, so we'll see how this goes. [00:36.880 --> 00:41.800] It's for that reason I just, you know, I can't emphasise enough this part of the talk description, [00:41.800 --> 00:48.760] this is not an instructional talk, this is just what I did. [00:48.760 --> 00:57.680] So there's some background, EMF camp is this weekend camping festival for hackers and makers [00:57.680 --> 01:07.320] and it's in a similar vein to the chaos communication camp and the Dutch hacker festival, you know, [01:07.320 --> 01:11.040] there's robots and lasers and geodesic domes and things, it's great fun if you get the [01:11.040 --> 01:17.480] opportunity to go, I highly recommend it and it's a bit of a tradition of these style [01:17.480 --> 01:34.520] of events to give the attendees electronic event badges and the aim of these is to give [01:34.520 --> 01:42.480] attendees opportunity to play with some hardware that they might not have come across before. [01:42.480 --> 01:50.200] These are the two most recent badges from EMF camp, the one on the left here. [01:50.200 --> 01:57.640] If I told you they had, they put a SIM card on it and a GSM modem and then they set up [01:57.640 --> 02:03.280] an onsite cell phone network, you'll understand why it's made to look like a Nokia Engage, [02:03.280 --> 02:09.400] but it's got all of the usual like peripherals and sensors and things on there as well like [02:09.400 --> 02:16.360] accelerometers and humidity and temperature and things and because it runs micropycin [02:16.360 --> 02:23.040] it allows people to easily get started with experimenting with that kind of hardware. [02:23.040 --> 02:27.960] The one on the right there is the newest one, these photographs aren't to scale by the [02:27.960 --> 02:34.080] way, let me just hold them up for comparison, the newest one is much smaller. [02:34.080 --> 02:41.320] The reasons for that you might guess is because of the silicon shortage that's been caused [02:41.320 --> 02:49.160] by fire, flood and plague as you might expect, but it's still a lovely device. [02:49.160 --> 02:53.120] The one on the left here you can see, you might recognise this as a version of the settlers [02:53.120 --> 03:02.200] of Ketan, I spent a lot of time trying to isolate small parts of the screen to redraw [03:02.200 --> 03:08.080] because the update speed on that screen was so slow, it was almost, it's almost unusable [03:08.080 --> 03:09.920] for anything in real time. [03:09.920 --> 03:14.720] So when I got my hands on the new one this year I obviously wanted to see what this one [03:14.720 --> 03:18.960] could do. [03:18.960 --> 03:24.440] And so the first thing I wanted to do was to just try and glitter full screen of pixels [03:24.440 --> 03:31.040] to the device using the display driver directly and let's talk about 70 milliseconds which [03:31.040 --> 03:35.960] is already orders of magnitude faster than the old badge. [03:35.960 --> 03:44.320] If I draw to an off-screen buffer instead that's way faster, but you know if you're [03:44.320 --> 03:49.080] doing that you then you have to get into the business of implementing your own drawing [03:49.080 --> 03:54.640] functions for primitives and I didn't really want to do that. [03:54.640 --> 04:00.080] That is ominous foreshadowing by the way. [04:00.080 --> 04:05.280] But I did discover that MicroPython has this frame buff module which provides you with [04:05.280 --> 04:10.600] an off-screen frame buffer and also some drawing functions which is great. [04:10.600 --> 04:15.800] So 41 milliseconds, I thought that was fair compromise, that's a good start. [04:15.800 --> 04:21.160] Now I've got a baseline for how fast I can draw to the screen. [04:21.160 --> 04:28.280] So obviously what this is about is drawing 3D things to the screen of this device and [04:28.280 --> 04:37.320] so this is just here to, in case you don't know this is basically, I guess this is 3D [04:37.320 --> 04:42.880] rasterization 101, this is like the minimum we have to do in order to get 3D points onto [04:42.880 --> 04:44.200] the screen. [04:44.200 --> 04:48.080] You know we start with our vertex coordinates and then that's multiplied by the model matrix [04:48.080 --> 04:53.320] to get into world space and then you multiply that by the view matrix to get the view space [04:53.320 --> 04:59.920] and then by the projection matrix you get the clip space and then the clip space allows [04:59.920 --> 05:04.440] you to see which vertices will be eclipped by the edges of the screen or not. [05:04.440 --> 05:10.480] So then once we know we've got the list of vertices we want to render then we can do [05:10.480 --> 05:16.640] the perspective division to bring that into normalize device coordinate space or NDC space. [05:16.640 --> 05:20.640] The perspective division is just the part that makes the further away points closer [05:20.640 --> 05:24.600] together so it gives you that illusion of 3D. [05:24.600 --> 05:29.400] And then we've got to convert the normalize device coordinates which are like between [05:29.400 --> 05:36.440] minus one and one to screen space which is like our pixel coordinates. [05:36.440 --> 05:46.840] And so when I was doing this, these, to render these eight points on the screen from a cube [05:46.840 --> 05:53.320] it was pretty, it wasn't too bad 53 seconds and then if you like join those up to create [05:53.320 --> 05:58.040] your cube wireframe it's not that much, not that much slower there's 12 triangles there [05:58.040 --> 06:02.200] obviously. [06:02.200 --> 06:08.120] The next step is to then start filling in these triangles you want to draw solid shapes after [06:08.120 --> 06:18.960] all, annoyingly there's no method or no function for doing that in the frame buff module for [06:18.960 --> 06:20.400] MicroPython. [06:20.400 --> 06:29.280] There is in the display driver but as I mentioned like using the display driver directly is [06:29.280 --> 06:33.600] much slower because we're making many more calls to hardware and you know we're setting [06:33.600 --> 06:36.880] pins high and low and stuff for every time we want to draw something and we just want [06:36.880 --> 06:41.880] to do that once when we blip the whole thing to the screen. [06:41.880 --> 06:50.200] And yeah so frame buff doesn't provide a like polygon or polygon fill method and so [06:50.200 --> 06:57.160] I do have to get into the business of writing these sort of functions myself after all. [06:57.160 --> 07:02.600] So yeah the display driver itself does have these methods so obviously that's the first [07:02.600 --> 07:09.440] place I looked for implementation clues, they have a polygon and a fill polygon method [07:09.440 --> 07:21.040] only obviously there are problems with it and it's a little bit rubbish here's the figure [07:21.040 --> 07:29.640] on the left there is just using the outline polygon method and then the second one here [07:29.640 --> 07:35.920] is where I've tried to draw in a filled polygon over the top of the wireframe polygon and [07:35.920 --> 07:39.960] you can see it just doesn't quite match up. [07:39.960 --> 07:45.640] And so reading the code there is it seems to be implementing like quite a well known [07:45.640 --> 07:52.480] or well documented fill polygon method and there's a link to the website where this algorithm [07:52.480 --> 07:57.680] is described and that also supplies a reference implementation so I was able to like copy the [07:57.680 --> 08:04.480] reference implementation to see if that if the display drivers implementation was different [08:04.480 --> 08:08.520] and it isn't it's exactly the same it looks like the display drivers inherited the same [08:08.520 --> 08:14.600] problems that we're in the reference implementation and you'll notice that it's not only incorrect [08:14.600 --> 08:18.480] on this side but like the left edge here is completely different to this edge here so [08:18.480 --> 08:23.880] it's like over drawing on this side and not drawing enough on that side. [08:23.880 --> 08:31.240] A lot of the problems with it were sort of like rounding errors and like floating point [08:31.240 --> 08:37.880] to integer truncation and that sort of thing which I've managed to mostly fix except for [08:37.880 --> 08:45.720] this really annoying pixel down here that I just couldn't get and when I submitted because [08:45.720 --> 08:50.240] I wanted to submit like this enhancement to the frame buff module upstream to the micro [08:50.240 --> 08:56.720] platform project and so we spent a few days scratching our heads over this to try and figure [08:56.720 --> 09:02.480] out what we could do we were initially we proposed just drawing the outline again on top of that [09:02.480 --> 09:06.960] on top of the filled polygon just to like sweep it under the rug but eventually we [09:06.960 --> 09:12.720] managed to figure out a much better way of doing it we just like try to detect when these [09:12.720 --> 09:19.840] stray pixels were we're going to happen and then fill them in explicitly instead of letting [09:19.840 --> 09:38.640] the algorithm do it oh yeah I you know these it was quite it's pretty obvious that the [09:38.640 --> 09:45.240] algorithm I think was developed by a physicist or a mathematician because in the article [09:45.240 --> 09:54.880] that describes the algorithm it says and I'm quoting here the detecting points on the [09:54.880 --> 10:02.840] polygon edge will deliver unpredictable results but that is quote not generally a problem [10:02.840 --> 10:12.840] because quotes the edge of the polygon is infinitely thin now my polygons have an edge [10:12.840 --> 10:20.720] of one pixel so this is obviously why we had to like it fix the problems of it anyway now [10:20.720 --> 10:27.440] we can draw arbitrary polygons to the screen and let's see what that looks like this is [10:27.440 --> 10:35.520] the cube here again which is like basically you know the hello world of 3d graphics programming [10:35.520 --> 10:44.800] and it seems to work pretty well 66 milliseconds there but you can see on the on the left hand [10:44.800 --> 10:49.120] screenshot there that's not the inside it looks like you're looking at the inside of [10:49.120 --> 10:54.320] the cube but it's just because we are drawing the back face of the back of the cube on top [10:54.320 --> 11:01.560] of the front face of the front of the cube so as part of this 3d rasterization process [11:01.560 --> 11:07.800] that you've now got to do like back face calling which is more maths added on to that pipeline [11:07.800 --> 11:15.640] you know you've got to take the you've got to calculate the normal vector of the face [11:15.640 --> 11:21.280] which is the direction the face is facing and then compute the dot product of that with [11:21.280 --> 11:25.960] the direction you're looking so that you can know if the face if the triangle is facing [11:25.960 --> 11:30.520] you or not and then just don't bother drawing the ones that aren't facing you but yeah that's [11:30.520 --> 11:37.240] much it's just more maths so it adds more time and oh yeah get the occasional like really [11:37.240 --> 11:45.160] long frame and that coincides with a garbage collection I guess we'll talk a bit more about [11:45.160 --> 11:53.040] that in a bit yeah so like there's some really low hanging fruit things we can do to improve [11:53.040 --> 11:59.000] the performance initially which is basically amounts to being smarter about the algorithms [11:59.000 --> 12:04.080] we use we pre-calculate the normals instead of calculating them every frame which for [12:04.080 --> 12:12.120] like static model like this makes total sense and yeah avoid doing the perspective division [12:12.120 --> 12:19.040] if we can help it because it's like part of the I'd implemented it as part of the matrix [12:19.040 --> 12:26.220] multiplication process and usually it's a and usually it's a no op unless you're multiplying [12:26.220 --> 12:33.080] it by the perspective matrix and only then is it doing something so we can just avoid [12:33.080 --> 12:40.640] doing those those divisions at all on you know on every vertex in every face in every [12:40.640 --> 12:46.440] frame that's quite a lot of time saved but it does mean I can add more things to it and [12:46.440 --> 12:51.560] make it do extra work like you know add as rudimentary lighting model and make the cube [12:51.560 --> 13:04.840] nice looking by adding shading and whatnot and the what I'm trying to do basically is [13:04.840 --> 13:12.240] to keep the rendering time below 100 milliseconds as well because that seems like a good target [13:12.240 --> 13:18.160] to have if I can do that then I get like a reasonable performance of 10 frames per second [13:18.160 --> 13:26.920] and so this is although this is this works well that's within that target it's close [13:26.920 --> 13:33.760] to that target so I want to try something a bit more complex so I download a model of [13:33.760 --> 13:43.360] the industry standard teapot and try and render that this is about 240 faces 240 triangles [13:43.360 --> 13:52.040] and this obviously completely destroyed my 100 millisecond time limit so I've got to [13:52.040 --> 13:58.080] think of I had to think of more ways to make this faster and the obvious way is to rewrite [13:58.080 --> 14:06.120] all the hottest math functions in C as a micro Python native module the two ones that are [14:06.120 --> 14:11.120] called the most often are like the matrix vector matrix multiplying method and the dot [14:11.120 --> 14:21.800] product method and yeah you can see that more than cuts the time in half and with the success [14:21.800 --> 14:31.200] of that it's pretty clear I should write rewrite all of the math in C because you know if I've [14:31.200 --> 14:35.480] got the bonnet up I might as well and but that you know that brings the time right down [14:35.480 --> 14:44.720] to a glorious glorious glorious six frames per second but yeah like as a general strategy [14:44.720 --> 14:50.400] if you find yourself calling a method you know 12 1200 times a frame it's probably a good [14:50.400 --> 15:03.480] target to be to be pushed down into the native layer so yeah a note on writing a native code [15:03.480 --> 15:11.200] for micro Python there's really two ways of doing it there's the what is called the external [15:11.200 --> 15:20.440] C modules which is basically C code that you write there's a module exposed to the Python [15:20.440 --> 15:28.960] runtime those are compiled directly into the firmware which is a bit suboptimal because [15:28.960 --> 15:36.200] I yeah it would be nice if I didn't require other people who have these devices to reflash [15:36.200 --> 15:42.520] the firmware every time I changed this program so the other way of doing it is to write what [15:42.520 --> 15:51.200] they call a native module which allows your application to supply native code as an MPY [15:51.200 --> 15:56.080] file and then that can be dynamically loaded by your application at runtime which is much [15:56.080 --> 15:59.880] nicer the way of doing it so obviously that's what I wanted to do but I did come across [15:59.880 --> 16:07.040] problems when I tried to build the native code because I'd used a floating point division [16:07.040 --> 16:13.880] in there for the perspective division step of the pipeline I got this problem which is [16:13.880 --> 16:24.120] a linker error from the expressive tool chain for the ESP32 I'd love to know why this happens [16:24.120 --> 16:28.240] and if anyone from expressive is here I'd love to know if it's fixed in a newer version [16:28.240 --> 16:32.200] as well but it seems like it can't link this software implementation of floating point [16:32.200 --> 16:40.920] division so obviously what I did was I downloaded the source for their tool chain and found [16:40.920 --> 16:46.960] the assembly implementation of this method to add into my project which also didn't [16:46.960 --> 16:51.560] work the micro Python build system wasn't prepared to accept that but that was an easy [16:51.560 --> 16:55.520] fix and that was actually the first change I got accepted into micro Python they were [16:55.520 --> 17:01.120] very good they're very good at or in my experience they're very good at accepting patches and [17:01.120 --> 17:09.160] then once I got that building I got it to just cause my application to crash I'm not [17:09.160 --> 17:16.480] sure why this happens but there seems to be like a reference to the native stuff that [17:16.480 --> 17:23.440] gets collected erroneously by the garbage collection and I spent a lot of time like [17:23.440 --> 17:29.120] trying to reduce my object allocations you know in the frames but all that did was just [17:29.120 --> 17:38.280] like push out the crash to further in the future so you know I had to settle for compiling [17:38.280 --> 17:45.120] my maths functions directly into the firmware there's some other things I did to try and [17:45.120 --> 17:51.240] make it faster the big one is trying to reduce object insatiations it's super costly in [17:51.240 --> 17:59.880] Python and wherever you can pre-allocate like lists and arrays and things and then just [17:59.880 --> 18:10.840] reuse them I initially wanted to have like a lot of my classes to be totally immutable [18:10.840 --> 18:17.000] as a good programmer I am but they just totally wasn't feasible so I just you know you just [18:17.000 --> 18:23.120] have to mutate when you do calculations on your vertices just mutate one of the operands [18:23.120 --> 18:31.640] and send it back that way you can also the other thing I found that saved some time was [18:31.640 --> 18:36.280] reducing crossing reducing the amount of times that we cross from Python into native code [18:36.280 --> 18:43.560] and back again I found I was doing like lots of the same operation to vertices and matrices [18:43.560 --> 18:48.880] so if I could just send them all as one batch in a single function call into the native [18:48.880 --> 18:54.160] side then that made it perform a lot quicker I think there's a lot of function and stack [18:54.160 --> 19:02.240] manipulation overhead there that you save and also pass arrays and not lists into the [19:02.240 --> 19:06.960] native functions as well especially for this kind of stuff where we know that the data [19:06.960 --> 19:13.480] that we're passing our floats or whatever you know ahead of time what type is in your [19:13.480 --> 19:17.440] array which means you can make some assumptions that my Python can't make and when and when [19:17.440 --> 19:25.080] you manipulate this the data objects in a native side you can like skip a bunch of like [19:25.080 --> 19:31.840] type safe stuff you can just write directly to the to the data structure which is useful [19:31.840 --> 19:37.440] and also I this wrong surprise me as well that I well I don't know if it's surprising [19:37.440 --> 19:46.240] maybe it's obvious to people who are veteran Python Easter's but I didn't expect to a native [19:46.240 --> 19:55.200] the libc qsort function to be so much faster than the sort function in Python but I was [19:55.200 --> 20:02.800] if you look at the if you look at this this picture here you can see that some parts of [20:02.800 --> 20:07.720] the teapot are drawn on top of that should be occluded drawn on top of the body of the [20:07.720 --> 20:15.040] teapot so what I had to do was Z sort the faces so that we draw the faces from from [20:15.040 --> 20:21.240] back to front and that's what I was doing I was what I was using the list sort method [20:21.240 --> 20:27.160] for here but just like implementing this sorting this face sorting as a native function as [20:27.160 --> 20:35.480] well was like as it says it's 100 times faster and the other thing that was made a measurable [20:35.480 --> 20:41.160] difference as well was locally caching object references in your functions as well so like [20:41.160 --> 20:48.160] instead of if you're using an object value more than once instead of doing self food [20:48.160 --> 20:54.080] self food self food just have yeah just created a local reference a local variable in their [20:54.080 --> 21:00.280] function and use that instead so there's some like dereferencing overheads there that is [21:00.280 --> 21:10.880] quite significant that we're saving and so after applying all of this sort of stuff this [21:10.880 --> 21:19.480] is the final result or the results so far I'm pretty happy with it getting the teapot [21:19.480 --> 21:37.520] model down to under 100 milliseconds per frame was really pleasing and yeah I'm pretty happy [21:37.520 --> 21:49.080] with the performance so what can this be used for honestly this was a this this was just [21:49.080 --> 21:54.840] a fun way to spend a few weekends after the festival had happened but you know it seems [21:54.840 --> 22:01.320] to be performing enough the way you could do some kind of like small 3d game like a [22:01.320 --> 22:07.560] lunar lander or something like that or you know make yourself a Jurassic Park style 3d [22:07.560 --> 22:15.960] user interface for your home automation but really the chief lesson for me I think was [22:15.960 --> 22:22.040] that the the best way to get involved with a project like micro python was to just start [22:22.040 --> 22:30.160] using it and eventually you come across some kind of limitation that probably your best [22:30.160 --> 22:36.680] place to overcome because you know you're the one who's trying to solve the problem [22:36.680 --> 22:42.560] you've got the vested interest in it you have you know all of the information is currently [22:42.560 --> 22:52.640] paged into your brain so yeah and then the the micro python people were extremely helpful [22:52.640 --> 23:03.080] in helping me whip up whip my year contributions into shape so yeah thanks to them for helping [23:03.080 --> 23:17.640] me get involved in micro python and thanks to you for listening I can try and answer [23:17.640 --> 23:29.880] questions but I'm not super expert on anything I've been talking about hi and thanks for [23:29.880 --> 23:35.440] your talk I had a question about the ESP 2 that you were implementing on this did you [23:35.440 --> 23:41.960] ever look at using like the dual core setup to try to sort of accelerate any of the mass [23:41.960 --> 23:47.800] but that is a good question and someone has mentioned this to me before but when I was [23:47.800 --> 23:53.000] writing this I was actually unaware that it had more than one core so I haven't yet but [23:53.000 --> 24:02.400] it's a great idea thanks very much for your talk if you're interested in micro python [24:02.400 --> 24:08.320] in the building a there is a stance about micro python and also a stand by pine 64 who make [24:08.320 --> 24:15.000] like smartwatch that can run micro python and stuff