[00:00.000 --> 00:07.800] Okay, thank you. [00:07.800 --> 00:12.160] So our next speaker is Jesus, we've been talking a few times in the GoDev room about everything [00:12.160 --> 00:16.280] that has to do deeply within the language and today he's going to talk to us about what's [00:16.280 --> 00:17.280] going on in functions. [00:17.280 --> 00:18.280] A round of applause. [00:18.280 --> 00:19.280] Okay. [00:19.280 --> 00:20.280] Hello, everybody. [00:20.280 --> 00:21.280] Well, my name is Jesus. [00:21.280 --> 00:32.280] I'm software engineer and I'm going to talk about squeezing a Go function. [00:32.280 --> 00:34.920] So what is optimization? [00:34.920 --> 00:40.480] I think it's important to know that optimization is not being faster or consuming less memory, [00:40.480 --> 00:41.960] it depends on your needs. [00:41.960 --> 00:48.760] So it's better for squeeze use, probably everybody will say yes, but it depends if you are looking [00:48.760 --> 00:52.520] for convenience or for something that lasts forever. [00:52.520 --> 00:55.480] So in that case, it's not the best option. [00:55.480 --> 01:01.320] Optimizing is about what you need and trying to address that. [01:01.320 --> 01:03.320] It's important to optimize at the right level. [01:03.320 --> 01:10.200] You can buy the best car, you can get an F1 car and it's not going to be fast if this [01:10.200 --> 01:11.440] is the road. [01:11.440 --> 01:20.000] So try to optimize always at the upper level because this kind of optimization, the ones [01:20.000 --> 01:25.960] that we are going to see in this talk are micro optimizations that probably are not [01:25.960 --> 01:31.120] the first place that you should be starting. [01:31.120 --> 01:35.080] Optimize what you need and when you need it. [01:35.080 --> 01:41.000] It's not about taking a Go function and try to optimize forever and try to make that run [01:41.000 --> 01:48.160] super efficiently and scratch every single nanosecond because probably the bottleneck [01:48.160 --> 01:49.600] is no longer there. [01:49.600 --> 01:54.400] You have to search for the bottleneck, you have to optimize where the bottleneck is and [01:54.400 --> 01:59.160] then look again if the bottleneck is still there because if it's no longer there, you [01:59.160 --> 02:03.680] are over-optimizing that function without much gain. [02:03.680 --> 02:13.320] So just take that into consideration, optimizing is an interactive cycle and you need to keep [02:13.320 --> 02:18.360] moving and keep searching for the bottleneck. [02:18.360 --> 02:19.360] Do not guess, please. [02:19.360 --> 02:24.560] Yeah, I know everybody has instincts and all that stuff but guessing about performance [02:24.560 --> 02:32.320] is an awful thing because there's so many things that comes into play that is just impossible. [02:32.320 --> 02:38.520] There's the operating system, the compiler, the optimizations of the compiler, if you [02:38.520 --> 02:45.040] are in the cloud, maybe a noisy neighbor, all that stuff comes into play with performance. [02:45.040 --> 02:52.240] So you have to, you are not good at guessing almost for sure in performance. [02:52.240 --> 02:54.120] So just measure everything. [02:54.120 --> 02:58.600] The important thing here is try to measure everything and work with that data. [02:58.600 --> 03:04.400] Probably is what, probably the talk that is after the next one is about. [03:04.400 --> 03:11.640] So I will suggest to go there also because it probably is a very interesting talk. [03:11.640 --> 03:13.560] So let's talk about benchmarks. [03:13.560 --> 03:22.440] The way that you measure performance in micro-optimization, so micro-benchmarks, is through Go benchmarks. [03:22.440 --> 03:28.240] Go benchmark is a tool that comes with Go and is similar to the testing framework that [03:28.240 --> 03:31.360] comes in Go but very focused on benchmark. [03:31.360 --> 03:38.280] In this case, we can see here an example to have two benchmark, one for MD5SAM and one [03:38.280 --> 03:44.480] for SHA256SAM. [03:44.480 --> 03:45.480] That's it. [03:45.480 --> 03:46.480] It's just a function that starts with benchmark. [03:46.480 --> 03:51.680] I'm going to receive a testing.b argument and that's this four, I have this four loop [03:51.680 --> 03:52.680] inside. [03:52.680 --> 03:59.600] And that is going to do all the job to give you the numbers and I show you now the numbers. [03:59.600 --> 04:05.200] If I run this with Go bench, we got this dash bench dot. [04:05.200 --> 04:08.760] The dot is a regular expression that means everything. [04:08.760 --> 04:16.760] So you can use like the Go test run a regular expression for only executing certain benchmarks. [04:16.760 --> 04:26.880] And here you can see that MD5SAM is around twice time faster per operation than SHA. [04:26.880 --> 04:28.680] So well, just a number. [04:28.680 --> 04:29.680] It's that important. [04:29.680 --> 04:30.680] It depends. [04:30.680 --> 04:34.340] If you need more security, probably MD5 is not the best option. [04:34.340 --> 04:38.640] So it depends on your needs. [04:38.640 --> 04:42.600] Another interesting thing is the allocations. [04:42.600 --> 04:47.160] One thing that you maybe have heard is about counting allocations. [04:47.160 --> 04:49.520] Counting allocations, why is that important? [04:49.520 --> 04:54.040] It's because every time we allocate something, when we talk allocation, we're talking about [04:54.040 --> 04:55.920] allocation in the heap. [04:55.920 --> 05:00.800] If every time we allocate something in the heap, allocating that is going to introduce [05:00.800 --> 05:01.800] an overhead. [05:01.800 --> 05:06.080] And not only that, it's going to add more pressure to the garbage collector. [05:06.080 --> 05:11.080] That's why it's important to count the allocations when you are talking about performance. [05:11.080 --> 05:17.000] If you are not worried about performance at that point, don't count the allocation. [05:17.000 --> 05:23.080] It's not that important and you are not going to gain a massive amount of performance from [05:23.080 --> 05:26.720] there if you are not in that point there. [05:26.720 --> 05:27.720] Okay. [05:27.720 --> 05:32.800] Let's see an example here in MD5 and SHA SAMs. [05:32.800 --> 05:34.640] We have zero allocations. [05:34.640 --> 05:37.840] So well, this data is not very useful for us now. [05:37.840 --> 05:39.400] So let's use another thing. [05:39.400 --> 05:40.920] Let's open a file. [05:40.920 --> 05:47.600] Let's open a file thousands of times and see how it goes. [05:47.600 --> 05:52.240] Now I see that every single operation of opening a file, just opening the file, is going to [05:52.240 --> 05:54.080] generate three allocations. [05:54.080 --> 05:57.680] And it's going to consume 120 bytes per operation. [05:57.680 --> 05:59.480] Interesting. [05:59.480 --> 06:01.920] So now you are measuring things. [06:01.920 --> 06:08.240] You are measuring how much time it takes, how much time is gone in processing something, [06:08.240 --> 06:14.760] is going in allocating things, how much memory is gone there. [06:14.760 --> 06:21.400] So let's talk about profiling because once you, well, actually normally you do the profiling [06:21.400 --> 06:29.920] first to find your bottleneck and then you do the benchmark to tune that bottleneck. [06:29.920 --> 06:35.800] But I'm playing with the fact that I already have the benchmark and I'm going to do the [06:35.800 --> 06:37.960] profiling on top of the benchmark. [06:37.960 --> 06:42.520] So I'm going to execute the gobench, I'm going to pass the mem profile, I'm going to generate [06:42.520 --> 06:45.960] the mem profile and I'm going to use the people of tool. [06:45.960 --> 06:49.200] The people of tool is going to allow me to analyze that profile. [06:49.200 --> 06:56.720] In this case, I'm just asking for a text output and that text output is going to show me the [06:56.720 --> 06:59.520] top consumers of memory in this case. [06:59.520 --> 07:06.520] And I can see there that 84% of the memory is gone in OS new file. [07:06.520 --> 07:14.040] Okay, let's see what happened, okay, it's that file but I need more information, well, [07:14.040 --> 07:17.520] it's that function, sorry, I need more information. [07:17.520 --> 07:24.440] Actually I cannot like this output but if you don't like this output, you can, for example, [07:24.440 --> 07:29.240] use SVG and you are going to get something like this that is very visual and actually [07:29.240 --> 07:39.080] is kind of obvious that where is the bottleneck there and in this case, again, is OS new file. [07:39.080 --> 07:45.360] If I go to the people of tool again and instead of that, I use the list of a function and [07:45.360 --> 07:53.960] I'm seeing here where is the memory going by line and here I can see that in the line 127 [07:53.960 --> 07:59.520] of the file, fileunix.go, I'm consuming the memory. [07:59.520 --> 08:04.400] Actually there you see 74 megabytes, that is because it's counting all the allocation [08:04.400 --> 08:10.440] and aggregating all the allocations, it's not, every operation here is consuming only 120 [08:10.440 --> 08:12.440] bytes. [08:12.440 --> 08:23.080] Okay, the same with CPU profile, in this case, this is generating the most of the CPU consumption [08:23.080 --> 08:33.120] is in Cisco 6, I can see in SVG, this time it's more scattered, so the CPU is consuming [08:33.120 --> 08:38.880] in way more places but still the Cisco 6 is the biggest one. [08:38.880 --> 08:44.840] So I'm going to list that and I see some assembly code, probably you are not going to optimize [08:44.840 --> 08:53.000] more this function, so probably this is not the place that you should be looking for optimizations [08:53.000 --> 09:01.240] anyway, this is an example of getting to the root cause during the profiling. [09:01.240 --> 09:09.080] Okay, this talk is going to be more by examples, I'm going to try to show you some examples [09:09.080 --> 09:16.440] of optimizations, it's just to show you the process more than the specific optimization, [09:16.440 --> 09:21.960] I expect you learn something in between but it's more about the process, okay. [09:21.960 --> 09:28.480] One of the things that you can do is reducing the CPU usage, this is a kind of silly example, [09:28.480 --> 09:33.680] you have a fine function that have a needle and a high stack and just go through the high [09:33.680 --> 09:40.120] stack and search for that needle and give you the result. [09:40.120 --> 09:48.560] This is looping over the whole string or the whole slice, I'm going to do a benchmark, [09:48.560 --> 09:55.560] the first thing, I'm going to do the benchmark, I'm going to generate a lot of strings and [09:55.560 --> 10:02.560] I'm going to do a benchmark looking for something around in the middle, it's not exactly in [10:02.560 --> 10:13.880] the middle but it's around there and the benchmark is saying that it's taking nearly 300 nanoseconds. [10:13.880 --> 10:20.120] If I just early return that is just a kind of silly optimization, it's not super smart [10:20.120 --> 10:27.360] or something like that, I'm going to save basically almost the half of the performance, [10:27.360 --> 10:33.360] this is because the benchmark is doing something really silly and it can vary depending on [10:33.360 --> 10:40.560] the data that it inputs but it's an optimization is just doing less, that is one of the best [10:40.560 --> 10:44.680] ways of optimizing things. [10:44.680 --> 10:49.080] Reducing allocations, one of the classic example of reducing allocations is when you are dealing [10:49.080 --> 10:55.520] with slices, when you have a slice, for example this is a common way of constructing a slice, [10:55.520 --> 11:01.040] I create a slice, I loop over this, generate a loop and start appending things to that [11:01.040 --> 11:11.840] slice, okay fine, I'm going to do a benchmark for checking that and it's taking 39 allocations [11:11.840 --> 11:23.560] and around 41 megabytes per operation, okay sounds like a lot, okay let's do it, let's [11:23.560 --> 11:32.720] do this, let's build the slice but we are going to give an initial size of a million [11:32.720 --> 11:37.920] and the time I'm just setting that, the final result is exactly the same but now we have [11:37.920 --> 11:42.800] one allocation and we have consumed only one megabyte and actually if you see there [11:42.800 --> 12:01.440] is around 800 microseconds and here you have around 10 milliseconds, so it's a lot of time [12:01.440 --> 12:08.160] actually, a lot of CPU time too but you can squeeze it more, if you know that at compile [12:08.160 --> 12:12.720] time, if you know exactly the size that you want to have at compile time, you can build [12:12.720 --> 12:20.720] an array, it's faster than any slice actually, so if I build an array I'm now doing zero [12:20.720 --> 12:28.640] allocation, zero heap allocations, it's going to go in the stack or in binary somehow, whatever [12:28.640 --> 12:38.000] but it's not consuming my heap allocations and this time is 300 microseconds approximately, [12:38.000 --> 12:46.480] so an interesting thing if you know that information at compile time, okay another thing is packing, [12:46.480 --> 12:52.320] if you are concerned about memory you can build this struct and say okay I have a Boolean, [12:52.320 --> 13:01.280] I have a float, I have an N32 and the goal compiler is going to align my struct to make [13:01.280 --> 13:07.080] it more efficient and work better with the CPU and all that stuff and in this case it's [13:07.080 --> 13:12.560] just adding seven bytes between the Boolean and the float and four bytes after the integer [13:12.560 --> 13:21.320] to get everything aligned, okay I built a slice and initialized a slice and I'm allocating [13:21.320 --> 13:27.880] one time because that's what the slice is doing and I'm consuming around 24 megabytes [13:27.880 --> 13:34.480] per operation, if I just organize the struct, in this case I put the float at the beginning [13:34.480 --> 13:39.120] then the integer 32 and then the Boolean, the compiler is only going to add three bytes [13:39.120 --> 13:46.320] so the whole structure is going to be smaller in memory and in this case now is 16 megabytes [13:46.320 --> 13:50.960] per operation, so this kind of optimization is not going to save your day, if you are [13:50.960 --> 13:57.200] just creating some structs but if you are creating millions of instances of an struct [13:57.200 --> 14:02.040] it can be a significant amount of memory. [14:02.040 --> 14:06.720] Function in lining, function in lining is something that the goal compiler does for us [14:06.720 --> 14:12.640] is just taking a function and replacing any call to that function with the code that is [14:12.640 --> 14:17.280] generated by the function. [14:17.280 --> 14:23.760] I'm going to show you a very damn example, I'm not inlining this function explicitly [14:23.760 --> 14:29.240] and I'm using the inlined version that is going to be inlined by the compiler because [14:29.240 --> 14:38.040] it's simple enough and then I'm going to execute that, I'm saving a whole nanosecond there, [14:38.040 --> 14:46.240] so yeah it's not a great optimization to be honest, probably you don't care about that [14:46.240 --> 14:53.120] nanosecond but we are going to see why that is important later, not because of the nanosecond. [14:53.120 --> 14:56.600] I'm going to talk now about escape analysis, escape analysis is another thing that the [14:56.600 --> 15:02.080] compiler does for us and basically analyzes our variables and decides when a variable [15:02.080 --> 15:09.480] escapes from the context of the stack, it's something that is no longer able to get the [15:09.480 --> 15:14.720] information from the stack or store the information from the stack and be accessible where it [15:14.720 --> 15:22.520] needs to be accessible so it needs to escape to the heap, so it's what generates that allocations [15:22.520 --> 15:29.480] and we have seen that allocations have certain implications, so let's see an example here, [15:29.480 --> 15:38.120] this is another inline function that returns a pointer that is going to generate an allocation, [15:38.120 --> 15:43.280] this is something that returns by value, a value is going to copy the value to the stack [15:43.280 --> 15:50.040] of the caller so it's not going to generate allocations, so we can see that in the benchmark [15:50.040 --> 15:56.160] that is saying the first version have one allocation and it's allocating 8 bytes and [15:56.160 --> 16:01.840] the second one have 0 allocations and actually you can see there is one allocation and it's [16:01.840 --> 16:12.000] taking 10 times more to do that, 10 times more in this case is around 12 nanoseconds [16:12.000 --> 16:18.840] that is not a lot but everything adds up at the end especially when you are calling millions [16:18.840 --> 16:27.920] of times of things, okay and one interesting thing is escape analysis plus inlining, why? [16:27.920 --> 16:34.680] Well imagine this situation you have a struct, a function that generates or instantiate that [16:34.680 --> 16:40.520] struct and the constructor of that extract, okay, the constructor returns me a pointer [16:40.520 --> 16:48.800] and do all the stuff that it needs, okay great, it is generating 3 allocations and it's consuming [16:48.800 --> 16:58.520] 56 bytes per operation, okay, what happen if I just move the logic of that initialization [16:58.520 --> 17:05.400] process into a different function, if we do that suddenly the new document is simple enough [17:05.400 --> 17:12.400] to be inlined and because it's inlined it's no longer escaped so it's no longer needed [17:12.400 --> 17:18.280] that allocation, something that simple allows you to just reduce the number of allocations [17:18.280 --> 17:25.360] of certain types when you have a constructor, what I would suggest is just keep your constructor [17:25.360 --> 17:31.520] as simple as possible and if you have to do certain complex logic do it in an initialization [17:31.520 --> 17:41.760] function, well if that doesn't hurt the readability, okay, let's see here we have less allocations, [17:41.760 --> 17:48.400] we have now 2 allocations and 32 bytes per operation and the time consumed is you are [17:48.400 --> 18:01.200] saving 50 nanoseconds every time you instantiate that, so this is a good chunk, okay, well [18:01.200 --> 18:09.480] this is optimization sometimes it's a matter of trade-offs, sometimes you just can do less, [18:09.480 --> 18:14.560] like less allocations, less CPU work, less garbage collector pressure, all that stuff [18:14.560 --> 18:23.080] is things that you can be done, but sometimes it's not about doing less, it's about consuming [18:23.080 --> 18:28.800] different kind of resources, I care less about memory and I care more about CPU or all the [18:28.800 --> 18:34.640] way around, so concurrency is one of the cases where you need to decide what you want to [18:34.640 --> 18:43.040] consume because go-routines are really cheap but are not free at all, so let's see an example [18:43.040 --> 18:49.760] with IO, this is two functions that I created, one is a fake IO that is going to generate [18:49.760 --> 19:00.640] some kind of IO simulation by time-sleep and then you have the fake IO parallel that received [19:00.640 --> 19:06.400] the number of go-routines and it's doing basically the same but distributing all that hundred [19:06.400 --> 19:19.360] cycles between different go-routines and I built a benchmark to do that using three [19:19.360 --> 19:25.360] different approaches, one is serial one, the non-concurrency, the other one is concurrency [19:25.360 --> 19:35.160] using the number of CPUs in my machine and the other one is using the number of tasks [19:35.160 --> 19:41.680] that I have, and because this is IO, this is the result, I'm going to see that if I [19:41.680 --> 19:48.800] create one go-routine per job, the number of bytes per operation and the number of allocation [19:48.800 --> 19:57.840] is going to spike but the time that is going to be consumed is going to be way lower, actually [19:57.840 --> 20:09.120] I'm able to execute hundred times this function using this one go-routine per job approach [20:09.120 --> 20:15.280] and only 12 using one CPU per job because this is IO, so let's see what happens if I [20:15.280 --> 20:23.200] do that with CPU. Using the CPU, this is to simulate some CPU load and using MD5 sum [20:23.200 --> 20:30.840] and it's more or less the same approach as we saw in the fake IO, the benchmark is exactly [20:30.840 --> 20:37.560] the same approach, we are using the number of jobs and the number of CPUs and using no [20:37.560 --> 20:43.400] go-routines and here is interesting because if you use the number of CPUs and this is [20:43.400 --> 20:51.080] a CPU workload, that is what is going to do the best efficiency. You can see here that [20:51.080 --> 20:57.200] executing one go-routine per job is going to be even slower than executing that in [20:57.200 --> 21:07.440] serial and actually you have the worst of both worlds. You have plenty of allocations, [21:07.440 --> 21:13.120] plenty of memory consumption, plenty of time consumption and you are not gaining anything. [21:13.120 --> 21:22.280] In the case of CPU, you are consuming more memory and you are getting better CPU performance [21:22.280 --> 21:28.960] because you are basically spreading the job all over your physical CPUs and the serial [21:28.960 --> 21:35.800] one is just doing everything and is using only one core of your CPU. Whenever you want [21:35.800 --> 21:39.880] to optimize using concurrency, you have to take in consideration what the kind of workload [21:39.880 --> 21:45.680] that you are using is the CPU workload, is your workload, do you care about memory, do [21:45.680 --> 21:57.800] you care about CPU, what do you care about? That is the whole idea. I just want to explain [21:57.800 --> 22:04.520] that all this is about measuring everything, measuring all this, doing all these benchmarks, [22:04.520 --> 22:12.280] doing all these kind of experiments to see if you are getting improvement on the performance [22:12.280 --> 22:18.920] and iterate over that. That is the main idea. I show some examples of how you can improve [22:18.920 --> 22:26.800] things and some of them can be applied in general basics like using the, try to keep [22:26.800 --> 22:31.400] constructors small or using the constructor for slices when you know the size and things [22:31.400 --> 22:41.120] like that. Some references. Efficient Go is a really book that is really, really interesting. [22:41.120 --> 22:47.160] If you are really interested into efficiency, Bartolome Plocca wrote that book and actually [22:47.160 --> 22:54.960] is going to give a talk after the next one. I am sure it is going to be super interesting. [22:54.960 --> 22:59.000] High-performance workshop from Dave Cheney. There is a lot of documentation about that [22:59.000 --> 23:03.960] workshop that Dave Cheney did and it is really interesting also. The Go Perf book is a good [23:03.960 --> 23:11.080] lecture also. An Ultimate Go course from Ardon Labs is also an interesting course because [23:11.080 --> 23:18.480] it is giving you a lot of foundation and the course takes a lot of, cares a lot about hardware [23:18.480 --> 23:24.880] sympathy and all that stuff. Well, some creative common, all the images are creative common [23:24.880 --> 23:32.520] so I put the reference here because it is creative common. Thank you. That is it. [23:32.520 --> 23:58.480] Thank you.