[00:00.000 --> 00:12.480]  Hello, everybody, and welcome to the second talk of the Python Dev Room, where Vadim
[00:12.480 --> 00:18.160]  Markovstev will be talking to you about accelerating object serialization using constraints.
[00:18.160 --> 00:25.280]  I'm really happy to welcome him here, and I hope you will enjoy this talk.
[00:25.520 --> 00:29.520]  Thank you.
[00:32.560 --> 00:39.120]  So let's start with a short bio. I'm a backend developer. I'm also a machine learning engineer.
[00:39.120 --> 00:43.040]  Used to be a Google developer expert in machine learning before the COVID.
[00:44.160 --> 00:50.320]  I call it in quite a few languages in the last 12 months, mostly in Python and Go, a bit of cc++.
[00:51.040 --> 00:57.200]  And I work in Athenian. This is a startup. It does a software service product for engineering
[00:57.200 --> 01:05.120]  leaders. So the talk will be divided into two parts. The first will be about some custom binary
[01:05.120 --> 01:12.000]  serialization that I had to do in my daily job. And the second will be about speeding up JSON dump.
[01:12.960 --> 01:19.760]  So the first is about how I coded something that was working faster than people by
[01:19.760 --> 01:26.880]  many, many times. And why I had to do that? Well, apparently because I had a use case that
[01:27.520 --> 01:33.760]  didn't really fit well into regular, pickling into regular serialization of pandas data frames.
[01:34.480 --> 01:40.640]  And it's not necessarily a best practice to materialize a huge pandas data frame
[01:40.640 --> 01:46.880]  while you're serving a request. But nevertheless, I had to do that. And it was really big,
[01:46.880 --> 01:53.920]  a few megabytes at least. And I noticed that according to the traces, picking it just took
[01:53.920 --> 01:59.040]  too much time. And we really needed to put it into distributed cache. Otherwise,
[01:59.040 --> 02:05.760]  you would spend extra time recomputing it in every request. And that data frame was quite
[02:05.760 --> 02:12.480]  special. I mean, it's not a usual bunch of integers and floating points. No, it was quite
[02:12.480 --> 02:19.280]  complex. It contained strings, it contained daytime columns, could be numpy daytime,
[02:19.280 --> 02:27.280]  could be just regular Python objects, contained nested lists and digs, and even nested numpy
[02:27.280 --> 02:33.440]  arrays. So it was quite complex. And this was mostly the reason why regular serialization,
[02:33.440 --> 02:42.000]  arrow and everything worked too slow in serializing it. So I came up with a custom binary format.
[02:43.040 --> 02:50.480]  And although it's not really open, it's also public and everyone can study the source code.
[02:50.480 --> 02:56.960]  It's not universal. Really, it only supports the types inside the pandas data frame that
[02:57.600 --> 03:04.800]  we had to serialize at work. It's not backward compatible. It's not portable at all. It doesn't
[03:04.800 --> 03:10.640]  fit well into different CPU architectures or it always assumes Linux and it always
[03:10.640 --> 03:16.640]  assumes CPython 3.11. And this is quite important. I'll talk about it a little bit later.
[03:17.360 --> 03:23.040]  So this is generally a really, really bad idea. And you only do that when you don't have any other
[03:23.040 --> 03:30.320]  options. And you need to squeeze a lot of performance in the backend. However, on the bright side,
[03:31.360 --> 03:38.800]  it was quite compact code. Cytan, because performance. It's a single pass serializer
[03:39.680 --> 03:45.440]  really, really, really fast. And also it releases Jill. This is something that regular people or
[03:45.440 --> 03:54.480]  arrow cannot do for the constraint that they had to be universal and pickle everything, any object.
[03:55.920 --> 04:00.000]  As soon as you have an assumption of what kind of object you can serialize,
[04:00.000 --> 04:04.960]  you can call it in Cytan in such a way that you release Jill and it works even faster.
[04:05.680 --> 04:12.720]  And also other code to execute in parallel. And releasing Jill is probably the hardest
[04:12.720 --> 04:19.840]  feature in this serializer because it was more labor intensive that I initially thought.
[04:19.840 --> 04:30.640]  You see Cytan offers C API wrapper for a lot of APIs inside CPython API. And they are great.
[04:30.640 --> 04:36.160]  However, as soon as you release the Jill, you cannot use them. Just Cytan doesn't compile it.
[04:36.160 --> 04:43.280]  And this is a good thing because to use these wrappers without Jill, you need to be absolutely
[04:43.280 --> 04:47.520]  sure that you are doing the right thing, that you are not using the heap, that you're not writing
[04:47.520 --> 04:53.760]  to any Python objects, also that those Python objects are not mutated somewhere else. This is
[04:53.760 --> 05:01.840]  all true for our backend, but it can be different in other use cases. So this is what happens.
[05:02.720 --> 05:10.080]  And what if there is no C API at all or some C API wrapper missing?
[05:11.360 --> 05:18.320]  You have to re-inclimate it from scratch in Cytan. And this is what we had to do quite much.
[05:19.280 --> 05:27.120]  Final note is that, for example, PyPy has a garbage collector that can relocate objects
[05:27.120 --> 05:34.560]  from time to time. So the pointer to the object can change. And this is why this serializer just
[05:34.560 --> 05:40.160]  doesn't work with PyPy even in Siri because it always assumes that as soon as you have a pointer
[05:40.160 --> 05:50.080]  to Python object, it never changes. Anyway, this is how the high-level serialization of
[05:50.080 --> 05:58.320]  pandas works. And this is not really specific to my serializer or your pickling or this
[05:58.320 --> 06:05.280]  suggests how it works. So every pandas data frame currently in 1.x has a block manager that maps
[06:05.280 --> 06:12.880]  column names to different blocks. Every block has the same data type. And this is a two-dimensional
[06:12.880 --> 06:21.760]  non-PRA underneath. This 2D non-PRA doesn't know anything about what columns are, what are they
[06:21.760 --> 06:30.640]  named. It's just the internal storage, let's say. And whenever you reach a column in the
[06:31.600 --> 06:37.440]  in the data frame, you use the block manager to access them and do some operations. It works
[06:37.440 --> 06:45.360]  fast, yes, and it's also memory efficient. Every operation on a single column executes
[06:45.360 --> 06:55.200]  on a non-PRA underneath, so it's really efficient. And it also allows us to read this pandas data
[06:55.200 --> 07:01.840]  frame and serialize it as a bunch of non-PRAs without Jill. So this is great for us. One thing
[07:01.840 --> 07:13.200]  that we had to re-sync properly is how do we serialize object data type in non-PRAs. And we,
[07:13.200 --> 07:20.320]  again, don't support all use cases, only those that we know exist in our backend. This is some
[07:20.320 --> 07:30.080]  site on pseudo code that shows some general idea how we do this serialization. There is a
[07:30.880 --> 07:37.280]  function called PyArrayData that returns you a row pointer to underlying data in the non-PRA.
[07:37.920 --> 07:44.560]  Then you iterate by lengths. We only support one-dimensional arrays so far, like every column,
[07:45.360 --> 07:52.400]  and we serialize each PyObject independently. And this PyObject can be in Siri arbitrarily,
[07:52.400 --> 07:58.640]  but, again, in our backend, there is a predefined list of things that can be stored like a string,
[07:58.640 --> 08:07.200]  a daytime, an integer, and so on. When we have to iterate standard library containers,
[08:07.200 --> 08:14.160]  such as a list or dictionary, it turned out, again, quite simple. You use PyListGetSize or
[08:14.240 --> 08:25.280]  PyListGetItem. For Dict, you use PyDictNextIterator. This is so great. If you didn't have to release
[08:25.280 --> 08:33.920]  the JIL, you would just use list or Dict type and site. And site would generate somewhat similar
[08:33.920 --> 08:38.800]  code, also efficient. But, again, since we released the JIL, we have to do, let's say,
[08:38.800 --> 08:46.560]  heavy lifting from scratch. For serializing integers and floats, you just convert PyObject to
[08:47.360 --> 08:54.640]  C type, like double or long. If it's a non-P scalar, a non-P scalar is usually a structure,
[08:54.640 --> 09:02.480]  you use a special non-PC API for extracting the scalar value, and you just memory copy it
[09:02.480 --> 09:08.880]  to your output stream. So, this is why we are not portable in the CPU, for example, because
[09:09.760 --> 09:16.640]  the order of bytes in the integer, this endianness, can be whatever. For Intel, it's little endian.
[09:16.640 --> 09:24.080]  For ARM, it can be little or big. So, since our backend always works on the same kind of CPU,
[09:24.080 --> 09:31.200]  and we are really sure about that, we can do these things without caring about endianness and
[09:31.200 --> 09:36.640]  converting to some other order, network order, and so on. And there's also the reason why it
[09:36.640 --> 09:46.320]  works so fast. To serialize a string, it's also nothing special. I have to say a few words about
[09:46.320 --> 09:52.960]  how strings are stored in C Python internally. It's quite smart, and I'm honestly really impressed
[09:53.040 --> 10:03.440]  how C Python does it. It stores strings in three different ways, in one byte encoding,
[10:03.440 --> 10:08.320]  in two byte encoding, and four byte encoding. And it dynamically chooses the encoding based on the
[10:08.320 --> 10:15.120]  maximum number of the character that has to be stored in the string. So, let's say if you only
[10:15.120 --> 10:21.040]  have ASCII characters, you choose one byte encoding and you're super memory efficient. If you have
[10:21.040 --> 10:28.400]  some crazy emojis, then you choose four byte encoding. And yes, if there are other characters,
[10:28.400 --> 10:34.400]  it's not so memory efficient. At the same time, when you address a string by an index, it works
[10:34.400 --> 10:39.200]  super fast because every character is guaranteed to have the same number of bytes underneath.
[10:40.000 --> 10:49.680]  Anyway, to copy a string to binary output format, you just take the row pointer to the string
[10:49.680 --> 10:56.400]  contents. And for the size, you get the number of bytes per character and multiply it by the
[10:56.400 --> 11:04.000]  length of the string. So, again, nothing special in Siri, but you have to be aware of the internals.
[11:04.720 --> 11:14.640]  To serialize daytime and time delta, C Python provides a special C API as well. It's worth
[11:14.640 --> 11:20.560]  mentioning that internal representation of daytime and time delta in Python is not a timestamp.
[11:20.560 --> 11:26.960]  It's not a single integer like in other programming languages. It's really a struct that contains the
[11:26.960 --> 11:35.040]  number of year, the number of months, the number of days, seconds, and so on. So, you just use
[11:35.040 --> 11:42.480]  these getters and you serialize each integer to the output. Worse fast. The same for time delta,
[11:42.480 --> 11:51.280]  you get days and seconds. All of that allowed us to speed up pandas array, pandas data frame
[11:51.280 --> 11:58.880]  serialization by roughly 20x. And, of course, the speed up can be really different depending on the
[11:58.880 --> 12:05.040]  nature of the data frame. If it only contains integers and floating points, the speed up will
[12:05.040 --> 12:14.000]  be marginal. Pickle actually is working quite fast, so it works well for strongly typed numpy
[12:14.000 --> 12:22.240]  arrays, actually. So, does it mean that we kind of managed to beat Pickle? Well, yes and no, because
[12:22.960 --> 12:29.040]  Pickle is universal and it can do all things at once. And it's way harder to be fast at all
[12:29.040 --> 12:35.120]  possible use cases than be super fast at one particular use case. So, I don't know how to say.
[12:35.120 --> 12:42.640]  Maybe yes, maybe no. But, however, there is also an elephant in the room. Can you see it?
[12:46.640 --> 12:53.280]  So far, I only talked about how fast we were at serializing objects, serializing data frames.
[12:53.280 --> 13:00.000]  But supposedly, this is not what you do. Only that you do, you also need to deserialize them back.
[13:01.040 --> 13:06.800]  Otherwise, that will be this anecdote for a compressor that compresses everything to one byte,
[13:06.800 --> 13:14.320]  just kind of decompress back. And decompressing back from this format is also complex, but since
[13:15.200 --> 13:24.000]  I don't have much time to talk about it, I'd rather leave it for next time. But in brief,
[13:24.000 --> 13:35.520]  this is the place where I had to do some extra actions to stay performing. So, yeah. The second
[13:35.520 --> 13:41.440]  part of the talk, how we solved a similar problem, but for Jason's serialization
[13:42.400 --> 13:50.960]  in the backend, this is what we had. A lot of models like a data class with some fields inside,
[13:52.000 --> 14:00.640]  strings, floating points, nested lists, nested models. It could be several levels deep. If you
[14:00.640 --> 14:08.160]  work with a first API that can seem familiar, if I'm not mistaken, by-identic, can offer
[14:08.240 --> 14:12.320]  similar structure. So, it's not really about using data class or something else.
[14:15.280 --> 14:23.440]  A lot of us saw that. And the problem was we had thousands of such objects. And we needed to cache
[14:23.440 --> 14:33.200]  them and surf pagination. Again, this is probably an anti-pattern in a way, because you would ideally
[14:33.200 --> 14:38.400]  push down all the filters in the backend so that you don't have to materialize the whole
[14:38.400 --> 14:48.880]  list of objects to return. But in our case, we really had to do that. So, we loaded all the models
[14:48.880 --> 14:55.760]  from cache. Then we deserialized the bytes to the actual Python objects, these data classes.
[14:56.560 --> 15:03.440]  We took on a dose that corresponded to the pagination and we converted them to Jason's string.
[15:03.440 --> 15:11.360]  So, two things went wrong here. Deserializing everything was super slow and also converting
[15:11.360 --> 15:19.600]  to Jason was also slow. Our first attempt to fix it was actually lightweight. Let's just
[15:20.400 --> 15:32.000]  pre-convert all data classes to atomic objects, like dictionaries and lists. C Python is really,
[15:32.000 --> 15:39.120]  really, really fast. It is working with Dx and lists. So, we assumed that that would help.
[15:40.000 --> 15:47.360]  Then you just store the cache, those atomic basic objects, and you return those that's
[15:47.360 --> 15:53.120]  requested in the pagination. So, that was for the first call, where you create the cache.
[15:53.760 --> 15:59.440]  For the next call, you just serve those objects that are corresponding to the pagination.
[16:01.120 --> 16:06.560]  However, it was still slow. It was slow because conversion from data classes to
[16:07.920 --> 16:13.600]  basic objects through data classes to dict was really slow and painfully slow.
[16:14.560 --> 16:19.840]  And serializing basic objects to Jason was also kind of slow.
[16:21.520 --> 16:30.720]  And just for the subsequent calls, we had problems with deserializing all the objects.
[16:30.720 --> 16:37.040]  It was not great. Materializing a lot of Python objects requires you to do a lot of rev count
[16:37.040 --> 16:42.080]  increments and it just cannot work fast. There is no way it can work fast.
[16:43.040 --> 16:49.120]  So, we had to be inventive and add this to Jason extra value function
[16:49.920 --> 16:58.960]  that pre-serializes all the objects to Jason and also produces the index where each object begins.
[17:00.480 --> 17:07.920]  And this table of contents can be used in the future. When you have pagination calls, you just
[17:08.640 --> 17:15.760]  select the part of this huge Jason blob corresponding to the pagination and you return
[17:15.760 --> 17:25.440]  it. And since you only work with strings, this works fast. However, it really depends on the
[17:25.440 --> 17:31.440]  performance of this function that converts data classes to this huge Jason string with a table
[17:31.440 --> 17:40.960]  of contents. And, yeah, this can be skipped really. This function had to be implemented
[17:40.960 --> 17:48.000]  from scratch, unfortunately, because Jason dumps doesn't specify the table of contents and it cannot
[17:48.000 --> 17:53.360]  do that. Also, to convert data classes to Jason, you still need to convert them to basic objects
[17:53.360 --> 18:01.760]  first using 2D. Jason dumps cannot convert data classes directly. And the only way for us to
[18:01.760 --> 18:10.160]  move on was to really write our own serializer. And this serializer works using a so-called
[18:10.160 --> 18:18.640]  specification of the serialization. The thing is these data classes are typed and you can scan
[18:18.640 --> 18:24.560]  these types to build some plan, how you should perform this serialization, how you should iterate
[18:24.560 --> 18:31.440]  the objects and how you should write them to Jason. Apart from that, we had to implement
[18:32.320 --> 18:37.520]  iterating lists and dicks. Well, we kind of already covered that. Converting integers and
[18:37.520 --> 18:45.200]  floating points to a string. This is really basic. Into stir, float to stir and others. We just don't
[18:45.280 --> 18:50.000]  think about it when you work with them in Python. You just convert them to string, problem solved.
[18:50.560 --> 18:59.600]  However, since it works with HIP and it touches the internal sort of Python, you cannot use it
[18:59.600 --> 19:06.240]  inside and you have to reimplement it from scratch. The same is daytime. Finally, for strings,
[19:06.240 --> 19:12.240]  you need to escape them. If you have a double column inside a string, you need to escape it,
[19:12.240 --> 19:22.400]  the same about new line. And since Jason is UTF-8 and internal binary encoding of Python strings
[19:22.400 --> 19:30.880]  can be one byte, two byte, four byte, we also need to do this conversion. So this is quite interesting.
[19:31.600 --> 19:40.080]  So the main trick to serialize is if you have slots in your data class, these slots are stored as
[19:40.160 --> 19:48.640]  pointers inside the PyObject structure. You take the type of data class, you get the slot members
[19:48.640 --> 19:54.240]  through C API, and each member contains the byte offset where the pointers exist.
[19:55.040 --> 20:02.320]  You just read these pointers and you serialize the objects. Therefore, the serialization spec
[20:02.320 --> 20:08.160]  is recursive. You have the data type that you need to convert to JSON and you have a list
[20:09.120 --> 20:18.800]  of nested specifications with some slot offset. We only support predefined types like lists,
[20:18.800 --> 20:25.040]  floating points, strings, boolean values, times, not everything. We are not universally have
[20:25.040 --> 20:31.200]  constraints, but it allows us to perform faster. Oh, the code listing didn't load for some reason.
[20:31.760 --> 20:38.000]  Anyway, I don't have time for that. This serializer appeared to be 100x faster,
[20:38.000 --> 20:43.680]  really, really, really fast, so fast that we just don't see it in our traces and profile
[20:43.680 --> 20:49.520]  anymore. So the goal was achieved. And it can be improved in the future by pre-computing the
[20:50.080 --> 21:02.240]  serialization spec inside the data class member. Final advice is doing less is doing more. I think
[21:02.240 --> 21:08.640]  it's also mentioned in Zen of Python, probably. Know your tools. It's always a great idea to
[21:08.640 --> 21:15.040]  know your tools and learn the internals, how they work, because it will allow you to write
[21:15.040 --> 21:21.120]  more performance programs. Going low level when you learn the internals is your super
[21:21.120 --> 21:27.280]  weapon that allows you to do many things. You can break the rules and you can do magic. However,
[21:27.280 --> 21:32.960]  you should only do that if you've got your architecture right. Otherwise, it's just stupid.
[21:32.960 --> 21:38.480]  You optimize the place that shouldn't be optimized in the first place. And also,
[21:38.560 --> 21:44.560]  with great power comes consequences, because when we release the GIL, we cannot use
[21:45.360 --> 21:49.440]  almost all the standard library and we have to implement crazy things like
[21:49.440 --> 22:05.440]  UTF conversions. This is annoying and you need to be prepared for that. Thanks.
[22:08.480 --> 22:09.860]  you