[00:00.000 --> 00:07.800]  Okay, thank you.
[00:07.800 --> 00:12.160]  So our next speaker is Jesus, we've been talking a few times in the GoDev room about everything
[00:12.160 --> 00:16.280]  that has to do deeply within the language and today he's going to talk to us about what's
[00:16.280 --> 00:17.280]  going on in functions.
[00:17.280 --> 00:18.280]  A round of applause.
[00:18.280 --> 00:19.280]  Okay.
[00:19.280 --> 00:20.280]  Hello, everybody.
[00:20.280 --> 00:21.280]  Well, my name is Jesus.
[00:21.280 --> 00:32.280]  I'm software engineer and I'm going to talk about squeezing a Go function.
[00:32.280 --> 00:34.920]  So what is optimization?
[00:34.920 --> 00:40.480]  I think it's important to know that optimization is not being faster or consuming less memory,
[00:40.480 --> 00:41.960]  it depends on your needs.
[00:41.960 --> 00:48.760]  So it's better for squeeze use, probably everybody will say yes, but it depends if you are looking
[00:48.760 --> 00:52.520]  for convenience or for something that lasts forever.
[00:52.520 --> 00:55.480]  So in that case, it's not the best option.
[00:55.480 --> 01:01.320]  Optimizing is about what you need and trying to address that.
[01:01.320 --> 01:03.320]  It's important to optimize at the right level.
[01:03.320 --> 01:10.200]  You can buy the best car, you can get an F1 car and it's not going to be fast if this
[01:10.200 --> 01:11.440]  is the road.
[01:11.440 --> 01:20.000]  So try to optimize always at the upper level because this kind of optimization, the ones
[01:20.000 --> 01:25.960]  that we are going to see in this talk are micro optimizations that probably are not
[01:25.960 --> 01:31.120]  the first place that you should be starting.
[01:31.120 --> 01:35.080]  Optimize what you need and when you need it.
[01:35.080 --> 01:41.000]  It's not about taking a Go function and try to optimize forever and try to make that run
[01:41.000 --> 01:48.160]  super efficiently and scratch every single nanosecond because probably the bottleneck
[01:48.160 --> 01:49.600]  is no longer there.
[01:49.600 --> 01:54.400]  You have to search for the bottleneck, you have to optimize where the bottleneck is and
[01:54.400 --> 01:59.160]  then look again if the bottleneck is still there because if it's no longer there, you
[01:59.160 --> 02:03.680]  are over-optimizing that function without much gain.
[02:03.680 --> 02:13.320]  So just take that into consideration, optimizing is an interactive cycle and you need to keep
[02:13.320 --> 02:18.360]  moving and keep searching for the bottleneck.
[02:18.360 --> 02:19.360]  Do not guess, please.
[02:19.360 --> 02:24.560]  Yeah, I know everybody has instincts and all that stuff but guessing about performance
[02:24.560 --> 02:32.320]  is an awful thing because there's so many things that comes into play that is just impossible.
[02:32.320 --> 02:38.520]  There's the operating system, the compiler, the optimizations of the compiler, if you
[02:38.520 --> 02:45.040]  are in the cloud, maybe a noisy neighbor, all that stuff comes into play with performance.
[02:45.040 --> 02:52.240]  So you have to, you are not good at guessing almost for sure in performance.
[02:52.240 --> 02:54.120]  So just measure everything.
[02:54.120 --> 02:58.600]  The important thing here is try to measure everything and work with that data.
[02:58.600 --> 03:04.400]  Probably is what, probably the talk that is after the next one is about.
[03:04.400 --> 03:11.640]  So I will suggest to go there also because it probably is a very interesting talk.
[03:11.640 --> 03:13.560]  So let's talk about benchmarks.
[03:13.560 --> 03:22.440]  The way that you measure performance in micro-optimization, so micro-benchmarks, is through Go benchmarks.
[03:22.440 --> 03:28.240]  Go benchmark is a tool that comes with Go and is similar to the testing framework that
[03:28.240 --> 03:31.360]  comes in Go but very focused on benchmark.
[03:31.360 --> 03:38.280]  In this case, we can see here an example to have two benchmark, one for MD5SAM and one
[03:38.280 --> 03:44.480]  for SHA256SAM.
[03:44.480 --> 03:45.480]  That's it.
[03:45.480 --> 03:46.480]  It's just a function that starts with benchmark.
[03:46.480 --> 03:51.680]  I'm going to receive a testing.b argument and that's this four, I have this four loop
[03:51.680 --> 03:52.680]  inside.
[03:52.680 --> 03:59.600]  And that is going to do all the job to give you the numbers and I show you now the numbers.
[03:59.600 --> 04:05.200]  If I run this with Go bench, we got this dash bench dot.
[04:05.200 --> 04:08.760]  The dot is a regular expression that means everything.
[04:08.760 --> 04:16.760]  So you can use like the Go test run a regular expression for only executing certain benchmarks.
[04:16.760 --> 04:26.880]  And here you can see that MD5SAM is around twice time faster per operation than SHA.
[04:26.880 --> 04:28.680]  So well, just a number.
[04:28.680 --> 04:29.680]  It's that important.
[04:29.680 --> 04:30.680]  It depends.
[04:30.680 --> 04:34.340]  If you need more security, probably MD5 is not the best option.
[04:34.340 --> 04:38.640]  So it depends on your needs.
[04:38.640 --> 04:42.600]  Another interesting thing is the allocations.
[04:42.600 --> 04:47.160]  One thing that you maybe have heard is about counting allocations.
[04:47.160 --> 04:49.520]  Counting allocations, why is that important?
[04:49.520 --> 04:54.040]  It's because every time we allocate something, when we talk allocation, we're talking about
[04:54.040 --> 04:55.920]  allocation in the heap.
[04:55.920 --> 05:00.800]  If every time we allocate something in the heap, allocating that is going to introduce
[05:00.800 --> 05:01.800]  an overhead.
[05:01.800 --> 05:06.080]  And not only that, it's going to add more pressure to the garbage collector.
[05:06.080 --> 05:11.080]  That's why it's important to count the allocations when you are talking about performance.
[05:11.080 --> 05:17.000]  If you are not worried about performance at that point, don't count the allocation.
[05:17.000 --> 05:23.080]  It's not that important and you are not going to gain a massive amount of performance from
[05:23.080 --> 05:26.720]  there if you are not in that point there.
[05:26.720 --> 05:27.720]  Okay.
[05:27.720 --> 05:32.800]  Let's see an example here in MD5 and SHA SAMs.
[05:32.800 --> 05:34.640]  We have zero allocations.
[05:34.640 --> 05:37.840]  So well, this data is not very useful for us now.
[05:37.840 --> 05:39.400]  So let's use another thing.
[05:39.400 --> 05:40.920]  Let's open a file.
[05:40.920 --> 05:47.600]  Let's open a file thousands of times and see how it goes.
[05:47.600 --> 05:52.240]  Now I see that every single operation of opening a file, just opening the file, is going to
[05:52.240 --> 05:54.080]  generate three allocations.
[05:54.080 --> 05:57.680]  And it's going to consume 120 bytes per operation.
[05:57.680 --> 05:59.480]  Interesting.
[05:59.480 --> 06:01.920]  So now you are measuring things.
[06:01.920 --> 06:08.240]  You are measuring how much time it takes, how much time is gone in processing something,
[06:08.240 --> 06:14.760]  is going in allocating things, how much memory is gone there.
[06:14.760 --> 06:21.400]  So let's talk about profiling because once you, well, actually normally you do the profiling
[06:21.400 --> 06:29.920]  first to find your bottleneck and then you do the benchmark to tune that bottleneck.
[06:29.920 --> 06:35.800]  But I'm playing with the fact that I already have the benchmark and I'm going to do the
[06:35.800 --> 06:37.960]  profiling on top of the benchmark.
[06:37.960 --> 06:42.520]  So I'm going to execute the gobench, I'm going to pass the mem profile, I'm going to generate
[06:42.520 --> 06:45.960]  the mem profile and I'm going to use the people of tool.
[06:45.960 --> 06:49.200]  The people of tool is going to allow me to analyze that profile.
[06:49.200 --> 06:56.720]  In this case, I'm just asking for a text output and that text output is going to show me the
[06:56.720 --> 06:59.520]  top consumers of memory in this case.
[06:59.520 --> 07:06.520]  And I can see there that 84% of the memory is gone in OS new file.
[07:06.520 --> 07:14.040]  Okay, let's see what happened, okay, it's that file but I need more information, well,
[07:14.040 --> 07:17.520]  it's that function, sorry, I need more information.
[07:17.520 --> 07:24.440]  Actually I cannot like this output but if you don't like this output, you can, for example,
[07:24.440 --> 07:29.240]  use SVG and you are going to get something like this that is very visual and actually
[07:29.240 --> 07:39.080]  is kind of obvious that where is the bottleneck there and in this case, again, is OS new file.
[07:39.080 --> 07:45.360]  If I go to the people of tool again and instead of that, I use the list of a function and
[07:45.360 --> 07:53.960]  I'm seeing here where is the memory going by line and here I can see that in the line 127
[07:53.960 --> 07:59.520]  of the file, fileunix.go, I'm consuming the memory.
[07:59.520 --> 08:04.400]  Actually there you see 74 megabytes, that is because it's counting all the allocation
[08:04.400 --> 08:10.440]  and aggregating all the allocations, it's not, every operation here is consuming only 120
[08:10.440 --> 08:12.440]  bytes.
[08:12.440 --> 08:23.080]  Okay, the same with CPU profile, in this case, this is generating the most of the CPU consumption
[08:23.080 --> 08:33.120]  is in Cisco 6, I can see in SVG, this time it's more scattered, so the CPU is consuming
[08:33.120 --> 08:38.880]  in way more places but still the Cisco 6 is the biggest one.
[08:38.880 --> 08:44.840]  So I'm going to list that and I see some assembly code, probably you are not going to optimize
[08:44.840 --> 08:53.000]  more this function, so probably this is not the place that you should be looking for optimizations
[08:53.000 --> 09:01.240]  anyway, this is an example of getting to the root cause during the profiling.
[09:01.240 --> 09:09.080]  Okay, this talk is going to be more by examples, I'm going to try to show you some examples
[09:09.080 --> 09:16.440]  of optimizations, it's just to show you the process more than the specific optimization,
[09:16.440 --> 09:21.960]  I expect you learn something in between but it's more about the process, okay.
[09:21.960 --> 09:28.480]  One of the things that you can do is reducing the CPU usage, this is a kind of silly example,
[09:28.480 --> 09:33.680]  you have a fine function that have a needle and a high stack and just go through the high
[09:33.680 --> 09:40.120]  stack and search for that needle and give you the result.
[09:40.120 --> 09:48.560]  This is looping over the whole string or the whole slice, I'm going to do a benchmark,
[09:48.560 --> 09:55.560]  the first thing, I'm going to do the benchmark, I'm going to generate a lot of strings and
[09:55.560 --> 10:02.560]  I'm going to do a benchmark looking for something around in the middle, it's not exactly in
[10:02.560 --> 10:13.880]  the middle but it's around there and the benchmark is saying that it's taking nearly 300 nanoseconds.
[10:13.880 --> 10:20.120]  If I just early return that is just a kind of silly optimization, it's not super smart
[10:20.120 --> 10:27.360]  or something like that, I'm going to save basically almost the half of the performance,
[10:27.360 --> 10:33.360]  this is because the benchmark is doing something really silly and it can vary depending on
[10:33.360 --> 10:40.560]  the data that it inputs but it's an optimization is just doing less, that is one of the best
[10:40.560 --> 10:44.680]  ways of optimizing things.
[10:44.680 --> 10:49.080]  Reducing allocations, one of the classic example of reducing allocations is when you are dealing
[10:49.080 --> 10:55.520]  with slices, when you have a slice, for example this is a common way of constructing a slice,
[10:55.520 --> 11:01.040]  I create a slice, I loop over this, generate a loop and start appending things to that
[11:01.040 --> 11:11.840]  slice, okay fine, I'm going to do a benchmark for checking that and it's taking 39 allocations
[11:11.840 --> 11:23.560]  and around 41 megabytes per operation, okay sounds like a lot, okay let's do it, let's
[11:23.560 --> 11:32.720]  do this, let's build the slice but we are going to give an initial size of a million
[11:32.720 --> 11:37.920]  and the time I'm just setting that, the final result is exactly the same but now we have
[11:37.920 --> 11:42.800]  one allocation and we have consumed only one megabyte and actually if you see there
[11:42.800 --> 12:01.440]  is around 800 microseconds and here you have around 10 milliseconds, so it's a lot of time
[12:01.440 --> 12:08.160]  actually, a lot of CPU time too but you can squeeze it more, if you know that at compile
[12:08.160 --> 12:12.720]  time, if you know exactly the size that you want to have at compile time, you can build
[12:12.720 --> 12:20.720]  an array, it's faster than any slice actually, so if I build an array I'm now doing zero
[12:20.720 --> 12:28.640]  allocation, zero heap allocations, it's going to go in the stack or in binary somehow, whatever
[12:28.640 --> 12:38.000]  but it's not consuming my heap allocations and this time is 300 microseconds approximately,
[12:38.000 --> 12:46.480]  so an interesting thing if you know that information at compile time, okay another thing is packing,
[12:46.480 --> 12:52.320]  if you are concerned about memory you can build this struct and say okay I have a Boolean,
[12:52.320 --> 13:01.280]  I have a float, I have an N32 and the goal compiler is going to align my struct to make
[13:01.280 --> 13:07.080]  it more efficient and work better with the CPU and all that stuff and in this case it's
[13:07.080 --> 13:12.560]  just adding seven bytes between the Boolean and the float and four bytes after the integer
[13:12.560 --> 13:21.320]  to get everything aligned, okay I built a slice and initialized a slice and I'm allocating
[13:21.320 --> 13:27.880]  one time because that's what the slice is doing and I'm consuming around 24 megabytes
[13:27.880 --> 13:34.480]  per operation, if I just organize the struct, in this case I put the float at the beginning
[13:34.480 --> 13:39.120]  then the integer 32 and then the Boolean, the compiler is only going to add three bytes
[13:39.120 --> 13:46.320]  so the whole structure is going to be smaller in memory and in this case now is 16 megabytes
[13:46.320 --> 13:50.960]  per operation, so this kind of optimization is not going to save your day, if you are
[13:50.960 --> 13:57.200]  just creating some structs but if you are creating millions of instances of an struct
[13:57.200 --> 14:02.040]  it can be a significant amount of memory.
[14:02.040 --> 14:06.720]  Function in lining, function in lining is something that the goal compiler does for us
[14:06.720 --> 14:12.640]  is just taking a function and replacing any call to that function with the code that is
[14:12.640 --> 14:17.280]  generated by the function.
[14:17.280 --> 14:23.760]  I'm going to show you a very damn example, I'm not inlining this function explicitly
[14:23.760 --> 14:29.240]  and I'm using the inlined version that is going to be inlined by the compiler because
[14:29.240 --> 14:38.040]  it's simple enough and then I'm going to execute that, I'm saving a whole nanosecond there,
[14:38.040 --> 14:46.240]  so yeah it's not a great optimization to be honest, probably you don't care about that
[14:46.240 --> 14:53.120]  nanosecond but we are going to see why that is important later, not because of the nanosecond.
[14:53.120 --> 14:56.600]  I'm going to talk now about escape analysis, escape analysis is another thing that the
[14:56.600 --> 15:02.080]  compiler does for us and basically analyzes our variables and decides when a variable
[15:02.080 --> 15:09.480]  escapes from the context of the stack, it's something that is no longer able to get the
[15:09.480 --> 15:14.720]  information from the stack or store the information from the stack and be accessible where it
[15:14.720 --> 15:22.520]  needs to be accessible so it needs to escape to the heap, so it's what generates that allocations
[15:22.520 --> 15:29.480]  and we have seen that allocations have certain implications, so let's see an example here,
[15:29.480 --> 15:38.120]  this is another inline function that returns a pointer that is going to generate an allocation,
[15:38.120 --> 15:43.280]  this is something that returns by value, a value is going to copy the value to the stack
[15:43.280 --> 15:50.040]  of the caller so it's not going to generate allocations, so we can see that in the benchmark
[15:50.040 --> 15:56.160]  that is saying the first version have one allocation and it's allocating 8 bytes and
[15:56.160 --> 16:01.840]  the second one have 0 allocations and actually you can see there is one allocation and it's
[16:01.840 --> 16:12.000]  taking 10 times more to do that, 10 times more in this case is around 12 nanoseconds
[16:12.000 --> 16:18.840]  that is not a lot but everything adds up at the end especially when you are calling millions
[16:18.840 --> 16:27.920]  of times of things, okay and one interesting thing is escape analysis plus inlining, why?
[16:27.920 --> 16:34.680]  Well imagine this situation you have a struct, a function that generates or instantiate that
[16:34.680 --> 16:40.520]  struct and the constructor of that extract, okay, the constructor returns me a pointer
[16:40.520 --> 16:48.800]  and do all the stuff that it needs, okay great, it is generating 3 allocations and it's consuming
[16:48.800 --> 16:58.520]  56 bytes per operation, okay, what happen if I just move the logic of that initialization
[16:58.520 --> 17:05.400]  process into a different function, if we do that suddenly the new document is simple enough
[17:05.400 --> 17:12.400]  to be inlined and because it's inlined it's no longer escaped so it's no longer needed
[17:12.400 --> 17:18.280]  that allocation, something that simple allows you to just reduce the number of allocations
[17:18.280 --> 17:25.360]  of certain types when you have a constructor, what I would suggest is just keep your constructor
[17:25.360 --> 17:31.520]  as simple as possible and if you have to do certain complex logic do it in an initialization
[17:31.520 --> 17:41.760]  function, well if that doesn't hurt the readability, okay, let's see here we have less allocations,
[17:41.760 --> 17:48.400]  we have now 2 allocations and 32 bytes per operation and the time consumed is you are
[17:48.400 --> 18:01.200]  saving 50 nanoseconds every time you instantiate that, so this is a good chunk, okay, well
[18:01.200 --> 18:09.480]  this is optimization sometimes it's a matter of trade-offs, sometimes you just can do less,
[18:09.480 --> 18:14.560]  like less allocations, less CPU work, less garbage collector pressure, all that stuff
[18:14.560 --> 18:23.080]  is things that you can be done, but sometimes it's not about doing less, it's about consuming
[18:23.080 --> 18:28.800]  different kind of resources, I care less about memory and I care more about CPU or all the
[18:28.800 --> 18:34.640]  way around, so concurrency is one of the cases where you need to decide what you want to
[18:34.640 --> 18:43.040]  consume because go-routines are really cheap but are not free at all, so let's see an example
[18:43.040 --> 18:49.760]  with IO, this is two functions that I created, one is a fake IO that is going to generate
[18:49.760 --> 19:00.640]  some kind of IO simulation by time-sleep and then you have the fake IO parallel that received
[19:00.640 --> 19:06.400]  the number of go-routines and it's doing basically the same but distributing all that hundred
[19:06.400 --> 19:19.360]  cycles between different go-routines and I built a benchmark to do that using three
[19:19.360 --> 19:25.360]  different approaches, one is serial one, the non-concurrency, the other one is concurrency
[19:25.360 --> 19:35.160]  using the number of CPUs in my machine and the other one is using the number of tasks
[19:35.160 --> 19:41.680]  that I have, and because this is IO, this is the result, I'm going to see that if I
[19:41.680 --> 19:48.800]  create one go-routine per job, the number of bytes per operation and the number of allocation
[19:48.800 --> 19:57.840]  is going to spike but the time that is going to be consumed is going to be way lower, actually
[19:57.840 --> 20:09.120]  I'm able to execute hundred times this function using this one go-routine per job approach
[20:09.120 --> 20:15.280]  and only 12 using one CPU per job because this is IO, so let's see what happens if I
[20:15.280 --> 20:23.200]  do that with CPU. Using the CPU, this is to simulate some CPU load and using MD5 sum
[20:23.200 --> 20:30.840]  and it's more or less the same approach as we saw in the fake IO, the benchmark is exactly
[20:30.840 --> 20:37.560]  the same approach, we are using the number of jobs and the number of CPUs and using no
[20:37.560 --> 20:43.400]  go-routines and here is interesting because if you use the number of CPUs and this is
[20:43.400 --> 20:51.080]  a CPU workload, that is what is going to do the best efficiency. You can see here that
[20:51.080 --> 20:57.200]  executing one go-routine per job is going to be even slower than executing that in
[20:57.200 --> 21:07.440]  serial and actually you have the worst of both worlds. You have plenty of allocations,
[21:07.440 --> 21:13.120]  plenty of memory consumption, plenty of time consumption and you are not gaining anything.
[21:13.120 --> 21:22.280]  In the case of CPU, you are consuming more memory and you are getting better CPU performance
[21:22.280 --> 21:28.960]  because you are basically spreading the job all over your physical CPUs and the serial
[21:28.960 --> 21:35.800]  one is just doing everything and is using only one core of your CPU. Whenever you want
[21:35.800 --> 21:39.880]  to optimize using concurrency, you have to take in consideration what the kind of workload
[21:39.880 --> 21:45.680]  that you are using is the CPU workload, is your workload, do you care about memory, do
[21:45.680 --> 21:57.800]  you care about CPU, what do you care about? That is the whole idea. I just want to explain
[21:57.800 --> 22:04.520]  that all this is about measuring everything, measuring all this, doing all these benchmarks,
[22:04.520 --> 22:12.280]  doing all these kind of experiments to see if you are getting improvement on the performance
[22:12.280 --> 22:18.920]  and iterate over that. That is the main idea. I show some examples of how you can improve
[22:18.920 --> 22:26.800]  things and some of them can be applied in general basics like using the, try to keep
[22:26.800 --> 22:31.400]  constructors small or using the constructor for slices when you know the size and things
[22:31.400 --> 22:41.120]  like that. Some references. Efficient Go is a really book that is really, really interesting.
[22:41.120 --> 22:47.160]  If you are really interested into efficiency, Bartolome Plocca wrote that book and actually
[22:47.160 --> 22:54.960]  is going to give a talk after the next one. I am sure it is going to be super interesting.
[22:54.960 --> 22:59.000]  High-performance workshop from Dave Cheney. There is a lot of documentation about that
[22:59.000 --> 23:03.960]  workshop that Dave Cheney did and it is really interesting also. The Go Perf book is a good
[23:03.960 --> 23:11.080]  lecture also. An Ultimate Go course from Ardon Labs is also an interesting course because
[23:11.080 --> 23:18.480]  it is giving you a lot of foundation and the course takes a lot of, cares a lot about hardware
[23:18.480 --> 23:24.880]  sympathy and all that stuff. Well, some creative common, all the images are creative common
[23:24.880 --> 23:32.520]  so I put the reference here because it is creative common. Thank you. That is it.
[23:32.520 --> 23:58.480]  Thank you.