[00:00.000 --> 00:13.480]  Let's welcome Pedro Holanda for his talk on DuckDB and a magnificent snake duck.
[00:13.480 --> 00:19.720]  Yeah, you guys can be surprised at anything you can find as a rubber duck these days,
[00:19.720 --> 00:20.720]  you know.
[00:20.720 --> 00:22.920]  All right, so I'm Pedro Holanda.
[00:22.920 --> 00:28.080]  I am one of the main contributors of the DuckDB projects, which is an open source database
[00:28.080 --> 00:32.680]  system, and also I'm the COO of DuckDB Labs.
[00:32.680 --> 00:36.640]  And today I'm going to be talking a little bit about how DuckDB can bring analytical
[00:36.640 --> 00:40.480]  SQL power directly into your Python show.
[00:40.480 --> 00:44.920]  So to give you a little bit of an idea of how the stocks look like, I'm going to start
[00:44.920 --> 00:46.640]  with what is DuckDB.
[00:46.640 --> 00:50.120]  So I'm here talking about one more database system.
[00:50.120 --> 00:54.360]  I want to motivate you guys that we actually needed to do one more database system.
[00:54.360 --> 00:57.560]  The other ones didn't solve the problems we had.
[00:57.560 --> 01:03.760]  And then I'm going to go over the main characteristics of DuckDB, so what actually makes it special.
[01:03.760 --> 01:08.240]  Then I'm going to go over DuckDB in the Python land, so how DuckDB integrates in the Python
[01:08.240 --> 01:09.240]  ecosystem.
[01:09.240 --> 01:12.760]  I'm going to do a little demo.
[01:12.760 --> 01:18.160]  The basic idea is that we're going to use the infamous New York City taxi data sets,
[01:18.160 --> 01:24.360]  and we're going to try to do some estimation of fair costs, and we're going to use DuckDB
[01:24.360 --> 01:28.360]  partners and PySpark just to see a couple of the differences of the things I'm going
[01:28.360 --> 01:30.120]  to be talking over.
[01:30.120 --> 01:32.440]  And then some summary of the talk.
[01:32.440 --> 01:33.440]  So what is DuckDB?
[01:33.440 --> 01:39.280]  Well, DuckDB was actually born at CWI, which is the research center of mathematics and
[01:39.280 --> 01:41.760]  computer science in the Netherlands.
[01:41.760 --> 01:45.760]  And what we actually had there is that a lot of the projects, the PhD student projects,
[01:45.760 --> 01:48.840]  the master projects, they are very data sciencey.
[01:48.840 --> 01:52.400]  So usually you have a data science problem, and you want to throw a database measurement
[01:52.400 --> 01:56.160]  system at the data science problem because you're handling data.
[01:56.160 --> 02:02.080]  So initially we were like, OK, we can probably use a database server, use a database connection,
[02:02.080 --> 02:07.800]  and then just transfer the data from the relational database to your Python terminal, for example,
[02:07.800 --> 02:10.360]  like where your analytical tools are.
[02:10.360 --> 02:14.920]  And it turns out that's quite a bad idea, because you are transferring a lot of data.
[02:14.920 --> 02:16.880]  So that's pretty costly.
[02:16.880 --> 02:20.760]  And then you're like, OK, this is really not solving our problem, can we draw inspiration
[02:20.760 --> 02:22.200]  from somewhere else?
[02:22.200 --> 02:25.800]  And then, of course, there are SQLites, the most famous database out there, at least the
[02:25.800 --> 02:28.160]  most used one.
[02:28.160 --> 02:33.600]  And it has quite a nice property, which is being an embedded database system.
[02:33.600 --> 02:37.360]  Being an embedded database system, it means it can run inside your Python process.
[02:37.360 --> 02:41.680]  So you can eliminate this data transfer cost.
[02:41.680 --> 02:46.880]  SQLite comes with one design decision that is a transactional database, so it's actually
[02:46.880 --> 02:52.720]  super optimized for small updates, but it's not really optimized for analytics.
[02:52.720 --> 02:57.800]  So we kind of wanted to do SQLites in terms of being easy to use and eliminating this
[02:57.800 --> 03:03.160]  data transfer cost, but focusing on analytical queries.
[03:03.160 --> 03:07.480]  And that's kind of how the database was born, and that's also why we frame it as a SQLite
[03:07.480 --> 03:09.560]  for analytics.
[03:09.560 --> 03:11.240]  It also has a very simple installation.
[03:11.240 --> 03:14.760]  So if you think about Python, you just do a bit so, and you're good.
[03:14.760 --> 03:16.880]  This is embedded, there's no server management.
[03:16.880 --> 03:21.120]  So let's say you just want to, I don't know, query a pre-kit file, two lines of code you
[03:21.120 --> 03:22.120]  can write, query it.
[03:22.120 --> 03:28.120]  Like, there's no starting of server, there's no schema creation, the schema is inferred
[03:28.120 --> 03:31.240]  from the object, so it's very easy, very fast.
[03:31.240 --> 03:36.320]  And we also really focus on this fast transfer between analytical languages and their tools,
[03:36.320 --> 03:38.840]  like in Python and R, to DuckDB.
[03:38.840 --> 03:45.600]  DuckDB is currently in pre-release, I think the last version we released was 0.6, 0.7
[03:45.600 --> 03:46.600]  is coming up soon.
[03:46.600 --> 03:52.040]  I need the web pages like a little bit more details about all the things that are in each
[03:52.040 --> 03:53.040]  release.
[03:53.040 --> 03:59.200]  All right, so I'm going to go over some of the main characteristics of DuckDB, particularly
[03:59.200 --> 04:03.400]  like the color data storage, a little bit about compression.
[04:03.400 --> 04:07.720]  I'm going to talk about vectorized execution, so these are all like core database stuff.
[04:07.720 --> 04:12.840]  Actually talking about vectorized execution engine, it's going to be difficult because
[04:12.840 --> 04:18.720]  Professor Bonk is here and he actually created that, so I'll try to do it correctly.
[04:18.720 --> 04:25.160]  A little bit about end-to-end core optimization, parallelism and beyond-memory execution.
[04:25.160 --> 04:30.000]  So color data storage, well, there's basically two ways that you can do it.
[04:30.000 --> 04:35.160]  One is a raw store, a scone store, as an example of raw store, we have SQLites, and the whole
[04:35.160 --> 04:42.920]  thing about the whole idea is they're storing your data consecutively in memory per row.
[04:42.920 --> 04:46.600]  So that basically means that if you want to fetch an individual row, that's very cheap
[04:46.600 --> 04:52.200]  because it's continuous in memory, however, you always have to fetch all the columns.
[04:52.200 --> 04:57.360]  So analytical queries, usually you have very white tables, but you just want to really
[04:57.360 --> 05:00.240]  get a couple of these columns.
[05:00.240 --> 05:02.040]  So what if you only want to use a field?
[05:02.040 --> 05:08.160]  So in this example, what if you just are interested in the price of a product, but not the stores
[05:08.160 --> 05:11.480]  as sold, right?
[05:11.480 --> 05:19.240]  In a column store, you actually have your layout that the data of the column is consecutively
[05:19.240 --> 05:20.240]  in memory.
[05:20.240 --> 05:24.400]  So if you want to access just a couple columns, you can actually have immense savings on this
[05:24.400 --> 05:27.280]  KIO and memory bandwidth.
[05:27.280 --> 05:31.360]  So that's why this type of format is really optimized for analytics.
[05:31.360 --> 05:35.880]  So to give you a more concrete example, let's say that we have a one terabyte table with
[05:35.880 --> 05:38.040]  100 columns.
[05:38.040 --> 05:41.880]  For simplicity, let's say all the columns have the same size, and we just require five
[05:41.880 --> 05:44.560]  columns of the table in our analytical query.
[05:44.560 --> 05:49.520]  So in a raw store, let's SQLites, reading this whole table, if you have a disk with
[05:49.520 --> 05:52.600]  around 100 megabytes per second, it will take you three hours.
[05:52.600 --> 05:57.160]  If you were using a column store model, which is what Pandas inductively does, for example,
[05:57.160 --> 06:01.520]  using these five columns from disk, it takes you eight minutes.
[06:01.520 --> 06:09.400]  So there's a huge improvement by just setting up the correct storage formats for your workload.
[06:09.400 --> 06:10.400]  Compression.
[06:10.400 --> 06:13.680]  Well, I'm not going to go into a lot of detail about the compression algorithms that we implement
[06:13.680 --> 06:19.760]  in Duck2B, but what I can tell you is because of having a column store, you're going to
[06:19.760 --> 06:25.880]  have your data from your column continuously in memory, which gives you a very good advantage
[06:25.880 --> 06:31.120]  to compressing it, because usually the data from the same column is somewhat similar.
[06:31.120 --> 06:40.960]  So you can apply cool things like RLE, FSST and CHIMP for floating point numbers, FSST
[06:40.960 --> 06:41.960]  for strings.
[06:41.960 --> 06:45.760]  So you can start applying all these algorithms and really decrease the size of your data.
[06:45.760 --> 06:51.480]  So in this table here, we actually have, I think this is from one year ago, one year
[06:51.480 --> 06:53.480]  and a half.
[06:53.480 --> 06:55.520]  0.2.8 from Duck2B.
[06:55.520 --> 06:59.800]  We had no compression at that point, and then a year and a half later, we actually managed
[06:59.800 --> 07:05.720]  to implement all these things, which got us five times better compression, Y19, for example,
[07:05.720 --> 07:12.600]  3.18 better compression in the taxi data set that I'm going to be using later.
[07:12.600 --> 07:14.000]  And why is compression so important?
[07:14.000 --> 07:18.960]  Well, if we go back again to the same example, where we were reading our five columns, and
[07:18.960 --> 07:23.840]  it was costing us to read them from disk eight minutes because of the storage formats, if
[07:23.840 --> 07:29.000]  we compress these columns, we suddenly don't have to read 50 gigabytes anymore, right?
[07:29.000 --> 07:30.120]  You read less.
[07:30.120 --> 07:33.440]  And then of course, you apply like the best case from what I showed you from the last
[07:33.440 --> 07:40.560]  table, five times, there are increases that cost you one point, one minute and 40 seconds.
[07:40.560 --> 07:47.280]  So execution, well, there's three ways of doing a query execution.
[07:47.280 --> 07:49.680]  There's actually one more, but it's not in the slides.
[07:49.680 --> 07:54.080]  So our SQLites use the top-of-the-time processing, which means that you process one row at a
[07:54.080 --> 07:55.080]  time.
[07:55.080 --> 07:58.680]  Pandas uses column-of-the-time processing, which means that you process one column at
[07:58.680 --> 07:59.680]  a time.
[07:59.680 --> 08:05.120]  And DuckDB uses kind of like a mix of the both, which is a technique developed by Peter, the
[08:05.120 --> 08:09.720]  vectorized processing where you process batches of a column at a time.
[08:09.720 --> 08:13.800]  So basically, the top-of-the-time model from SQLites, it was optimized for a time where
[08:13.800 --> 08:15.720]  computers didn't have a lot of memory.
[08:15.720 --> 08:20.760]  There was low memory to be used because you only need to really keep one row in memory
[08:20.760 --> 08:23.840]  throughout your whole query plan.
[08:23.840 --> 08:28.560]  So the memory was expensive, that's what it could do, but this comes with a high CPU overhead
[08:28.560 --> 08:34.400]  per tuple because you're constantly resetting your caches, you don't have any cache-conscious
[08:34.400 --> 08:40.000]  algorithm running that piece of data up to the production of your query results.
[08:40.000 --> 08:43.560]  If you go to the column-of-the-time, which is what Pandas uses, this already brings like
[08:43.560 --> 08:47.320]  better CPU utilization, it allows for SIMD.
[08:47.320 --> 08:52.040]  But it comes with the cost of materializing large intermediates in memory.
[08:52.040 --> 08:56.040]  It basically means that you need the whole column in memory at that point to process
[08:56.040 --> 08:57.360]  for that operator.
[08:57.360 --> 09:01.760]  And of course, the intermediates can be gigabytes each, so that's pretty problematic when the
[09:01.760 --> 09:03.080]  data sizes are large.
[09:03.080 --> 09:06.760]  And that's why you see, for example, that Pandas, if your data doesn't fit in memory,
[09:06.760 --> 09:07.760]  what does it happen?
[09:07.760 --> 09:10.440]  It crashes.
[09:10.440 --> 09:15.640]  And then if you go to the vectorized query processing, it's actually optimized for CPU
[09:15.640 --> 09:20.200]  cache locality, you can do SIMD instructions, pipelining, and the whole idea is that your
[09:20.200 --> 09:24.960]  intermediates are actually going to fit here in L1 cache.
[09:24.960 --> 09:28.480]  So basically you're going to be paying this latency of one nanosecond to be accessing
[09:28.480 --> 09:32.880]  your data throughout your query plan instead of paying the latency of a main memory, which
[09:32.880 --> 09:37.280]  is also the case of a column database, which is 100 nanoseconds.
[09:37.280 --> 09:41.840]  It seems like a small difference, but when you're constantly executing this, this really
[09:41.840 --> 09:46.320]  becomes a bottleneck.
[09:46.320 --> 09:49.400]  And to end-score optimizations, of course, something that we have inducted to be, so
[09:49.400 --> 09:54.240]  we have stuff like expression rewriting, join ordering, subquery flattening, filtering,
[09:54.240 --> 10:00.240]  projection pushdown, which is a bit more simple, but it's extremely important and brings a
[10:00.240 --> 10:02.400]  huge difference in the cost of the query.
[10:02.400 --> 10:05.760]  So here's an example of a projection pushdown.
[10:05.760 --> 10:10.400]  Say you have a table with five columns, A, B, C, D, E, and you want to run a query, that's
[10:10.400 --> 10:16.920]  pretty small, but the query is like a selects minimum from column A, where there's a filtering
[10:16.920 --> 10:19.920]  column A saying the column A is bigger than zero and you group by B.
[10:19.920 --> 10:24.800]  So the whole point of this query is that you're only using two columns of the table, right?
[10:24.800 --> 10:29.240]  So what the ductdb optimizer will do is like, okay, in this scanner, I know I don't need
[10:29.240 --> 10:34.320]  all the columns, I just need N and B, and you just don't have to read the other ones.
[10:34.320 --> 10:38.680]  If you do the same one in pandas, for example, you can apply your filter, and then you have
[10:38.680 --> 10:42.600]  the filter, the group by the aggregator, but at the time you're doing this filter, you're
[10:42.600 --> 10:45.880]  still filtering all the other columns you're not going to be using your query.
[10:45.880 --> 10:51.120]  Of course, you can manually make this optimization, but it's pretty nice that the database system
[10:51.120 --> 10:54.160]  can do that for you.
[10:54.160 --> 10:59.440]  Of course, the ductdb also has automatic parallelism and beyond-memory execution, so ductdb has
[10:59.440 --> 11:05.560]  parallel versions of most of its operators, I think all of our scanners, including with
[11:05.560 --> 11:11.600]  insertion order preservation of parallelize now, aggregations, joins, pandas, for example,
[11:11.600 --> 11:14.680]  only supports single-threaded execution.
[11:14.680 --> 11:17.240]  We all have pretty good laptops these days, right?
[11:17.240 --> 11:21.640]  So it's a shame if you cannot really take advantage of parallelism.
[11:21.640 --> 11:26.400]  And ductdb, again, supports the execution of data that does not fit in memory.
[11:26.400 --> 11:30.240]  It's kind of the never give up, never surrender approach, it's like, we're going to execute
[11:30.240 --> 11:32.120]  this query.
[11:32.120 --> 11:35.960]  We try to always have graceful degradation, also, that it just doesn't suddenly crash
[11:35.960 --> 11:36.960]  in performance.
[11:36.960 --> 11:41.640]  And the whole goal is really to never crash and always execute the query.
[11:41.640 --> 11:46.960]  All right, so a little bit about ductdb in the Python lens.
[11:46.960 --> 11:55.160]  Basically we have an API, it's a dbapi to.ocompliant, so, far too much what SQLite has, for example,
[11:55.160 --> 11:59.920]  you can create a connection and can start executing queries, but we also wanted to have
[11:59.920 --> 12:05.560]  something similar to the data frame API that still could, people that can't come from pandas,
[12:05.560 --> 12:08.320]  for example, could still have something familiar to work on.
[12:08.320 --> 12:11.400]  So here in this example, you can also create a connection.
[12:11.400 --> 12:14.960]  You can create this relation, which kind of looks like a data frame, you just point it
[12:14.960 --> 12:19.840]  to a table, you can do a show to inspect what the table is inside, and you can apply, for
[12:19.840 --> 12:24.480]  example, these chaining operators, right, like a filter, a projection.
[12:24.480 --> 12:31.400]  So in the end, this is all lazily executed, and this also allows you to take advantage
[12:31.400 --> 12:36.760]  of the optimizer of ductdb, even if you do the chaining operations.
[12:36.760 --> 12:45.600]  Of course, I talked to you about memory transfer, so we were very careful as well into being
[12:45.600 --> 12:49.560]  very integrated with this, very common libraries in Python.
[12:49.560 --> 12:57.040]  So with pandas and pyarrow, for example, what we actually do is that in the end, for pandas,
[12:57.040 --> 13:02.600]  the columns are usually not pyarrows, which turns out to be RC vectors, which turns out
[13:02.600 --> 13:04.280]  that's also kind of what we use.
[13:04.280 --> 13:09.160]  So with a little bit of makeup in the metadata, we can just directly read them, and they're
[13:09.160 --> 13:10.720]  all in the same process, right?
[13:10.720 --> 13:16.560]  So we have access to that piece of memory, which in the end means that you can actually
[13:16.560 --> 13:21.720]  access the data from pandas in ductdb without paying any transfer costs, at least constant
[13:21.720 --> 13:26.400]  transfer costs just for doing the metadata makeup, let's say.
[13:26.400 --> 13:28.120]  And there's the same thing with pyarrow.
[13:28.120 --> 13:34.040]  We also have what we call zero copy, so we can read error objects and output error objects
[13:34.040 --> 13:36.880]  without any extra costs.
[13:36.880 --> 13:42.200]  With NumPy, we also support SQLCAMY, and in IBIS, they're actually the default back-end
[13:42.200 --> 13:48.360]  from them, I think, since six months ago.
[13:48.360 --> 13:55.680]  A little bit of usage, so as you can see, this is our PyPy download counts.
[13:55.680 --> 13:58.400]  The Python library is actually our most downloaded API.
[13:58.400 --> 14:04.120]  We have APIs for all sorts of languages, and you can see that in the last month, we had
[14:04.120 --> 14:09.680]  like 900,000 downloads, so there are a lot of people there trying out and using ductdb
[14:09.680 --> 14:12.160]  in their Python scripts.
[14:12.160 --> 14:23.040]  So now it's the demo time, let me get this, all right, this looks like you can see.
[14:23.040 --> 14:29.280]  So this is just installing ductdb PySpark and getting our yellow trip data dataset,
[14:29.280 --> 14:35.400]  our executor, this, our database, just importing the stuff we're going to be using, and here
[14:35.400 --> 14:41.000]  is just like getting a connection from ductdb, creating a relation that's just, okay, we're
[14:41.000 --> 14:45.360]  going to, as a parquet file, ductdb can be parquet files, and then you can just print
[14:45.360 --> 14:50.120]  to inspect what's out there, right, so if we run this, we can see like, okay, these
[14:50.120 --> 14:54.880]  are the columns we have, we have vendor ID, we have pick up dates, time, passenger counts,
[14:54.880 --> 14:59.400]  you have the types of the columns, you can also have a little result preview, just have
[14:59.400 --> 15:05.000]  an idea of what it looks like, so I think this dataset has about like 20 columns, maybe,
[15:05.000 --> 15:12.560]  and there's just information about the taxi rides in New York in 2016, and then you can
[15:12.560 --> 15:17.880]  also, for example, run a simple query here, I'm just doing like accounts to know how many
[15:17.880 --> 15:22.720]  tuples are there, and we have about 10 million tuples on this dataset.
[15:22.720 --> 15:26.680]  All right, so this function here is just to do a little bit of benchmarking, coming from
[15:26.680 --> 15:32.440]  academia, we do have to do something that's kind of fair, I guess, so I run just five
[15:32.440 --> 15:37.160]  times and take the median time of everything, and then this is actually where then we start,
[15:37.160 --> 15:41.840]  so we start off with data frame, so Pundas can also read parquet files, and the whole
[15:41.840 --> 15:46.520]  thing about ductdb again is that it's not here as a replacement for Pundas, this is
[15:46.520 --> 15:51.440]  not run by itself, but something that can work together with Pundas, so the cool thing
[15:51.440 --> 15:57.320]  is that we can, again, read and output data frames without any extra cost, so let's say
[15:57.320 --> 16:02.440]  that in the query here, we're just getting the passenger counts, then the average tip
[16:02.440 --> 16:09.920]  amount of trips that had a short distance, right, and we group by passengers, by the
[16:09.920 --> 16:16.680]  number of passengers, so what we want to know is for short trips, does the amount of tip
[16:16.680 --> 16:24.280]  matters by the number of passengers in that ride, and what you can see here is that you
[16:24.280 --> 16:29.120]  can, again, read from the data frame, that's what we're doing, and we just have to use
[16:29.120 --> 16:35.120]  the data frame name in our SQL query, and if you call.df from the query results, you
[16:35.120 --> 16:40.360]  also output in your data frame, and it's pretty cool because the data frames have these plots
[16:40.360 --> 16:45.240]  bars, they have plotting capabilities that ductdb doesn't have, and you can get easily
[16:45.240 --> 16:52.120]  a very nice chart, so you see here, apparently, there's some dirty data because before getting
[16:52.120 --> 16:57.080]  in tips, when they don't have anyone in their rides, I'm not sure what that is, but apparently
[16:57.080 --> 17:01.560]  like if you have more people, seven to nine, maybe like the more expensive cars, you get
[17:01.560 --> 17:06.120]  a higher tip, and you can do the same thing in pandas, of course, right, like in pandas
[17:06.120 --> 17:09.160]  you don't have SQL, you're going to have to do, to use their own language, to do the
[17:09.160 --> 17:14.480]  group by, the average, and you can directly use the plots, and the whole point here is
[17:14.480 --> 17:21.800]  to show the different execution time, like, now we're waiting, okay, so took a second,
[17:21.800 --> 17:28.920]  and ductdb took 0.2, so this is like a 5x, right, to 0.25, so like 4x, and you also have
[17:28.920 --> 17:32.560]  to consider that we're using like a, not a very beefy machine, right, this is a co-lib
[17:32.560 --> 17:37.880]  machine, imagine if you had more cars, this difference would also be bigger, and then
[17:37.880 --> 17:44.400]  I added spark for fun, so actually spark can also read data frames, but it crashes out
[17:44.400 --> 17:49.040]  of memory in my co-lib machine, so I had to give up on this, and read directly from par
[17:49.040 --> 17:56.040]  K files, but it does output it as a data frame, I think we're going to have to wait a little
[17:56.040 --> 18:05.400]  bit, but as me it's best, so of course spark is not designed for small data sets, but turns
[18:05.400 --> 18:10.000]  out there are a lot of use cases where you use these small data sets, as you're going,
[18:10.000 --> 18:19.560]  it's warming up a little bit, it's good for the winter, it produces some energy, I think,
[18:19.560 --> 18:28.120]  alright, okay, so it took two seconds, 2.2 seconds, the actual execution, and that's
[18:28.120 --> 18:36.080]  already like, what, more than two times what Pandas was, so, yeah, anyway, for the demo
[18:36.080 --> 18:40.840]  of course, I showed you something that's fairly simple, can you do actually very complicated
[18:40.840 --> 18:44.800]  things, maybe not very complicated, but more complicated, so here I'm not really going to
[18:44.800 --> 18:49.800]  go over the query, but the whole idea is that we can just, for example, use ductDB to run
[18:49.800 --> 18:58.600]  linear regression, so can we predict, can we estimate the fare with the trip distance,
[18:58.600 --> 19:04.480]  and turns out you can just calculate the alpha and beta with not such a crazy query, and
[19:04.480 --> 19:08.800]  then you can again export it to Pandas, and you have a very nice figure there, so you
[19:08.800 --> 19:18.880]  can really combine these two to get the best out of both, alright, that was the demo, summary,
[19:18.880 --> 19:24.760]  oh that's my last slide, good, so yeah, ductDB is an embedded database system, again it's
[19:24.760 --> 19:29.600]  completely open source, it's under the MIT license, since it came from academia, this
[19:29.600 --> 19:34.080]  is something that we're always worried about, it's to also give it back to everyone, because
[19:34.080 --> 19:39.600]  it was usually funded by taxpayers money, so everyone can use it, 100% of what we do
[19:39.600 --> 19:45.320]  is actually open source, there's nothing that's closed source, it's designed for analytical
[19:45.320 --> 19:52.160]  queries, so data analysis, data science, has binding for many languages, so of course
[19:52.160 --> 19:57.760]  I'm at the Python dev room, I'm talking about Python, but we have our Java, turns out that
[19:57.760 --> 20:03.800]  Java is like one of our most downloaded APIs, so I guess that's an interesting sign, Java
[20:03.800 --> 20:09.280]  scripts, and a bunch of others, it has very tight integrations with the Python ecosystem,
[20:09.280 --> 20:15.320]  again the whole idea is that you eliminate transfer costs, implements the database in
[20:15.320 --> 20:20.520]  relation to APIs, the relation to API again is this more data frame like, and has full
[20:20.520 --> 20:26.640]  SQL support, so anything you can imagine like window functions or what not, you can just
[20:26.640 --> 20:36.280]  express them using duck to be, and that's it, thank you very much for paying attention,
[20:36.280 --> 20:52.680]  happy to answer questions.
[20:52.680 --> 21:02.920]  Thank you Petron, so we have five minutes for your questions.
[21:02.920 --> 21:09.800]  You mentioned, thanks for the great presentation, you mentioned beyond memory execution, and
[21:09.800 --> 21:14.960]  kind of that it tries not to degrade as much, can you shine a little bit more light on
[21:14.960 --> 21:20.960]  kind of what happens under the hood, and how much degradation happens?
[21:20.960 --> 21:26.000]  Of course, I think that's, there's only the ordering operator that actually does that,
[21:26.000 --> 21:31.960]  we have Lawrence that's doing his PhD, so there's a lot of operators that need to research
[21:31.960 --> 21:36.240]  to be developed, that's more of a goal than something that actually happens now, but the
[21:36.240 --> 21:41.520]  whole goal is that you really don't have this sudden spike in the future, but there's research
[21:41.520 --> 21:57.280]  going on, in the future there will be more to be shared for sure.
[21:57.280 --> 22:01.920]  Thank you very much for the talk, and it's very exciting to see such a tool, such a powerful
[22:01.920 --> 22:09.960]  tool, I'm working usually with data warehouses, and I saw on the website that you do not recommend
[22:09.960 --> 22:15.320]  using this with data warehouses, I would like to know why.
[22:15.320 --> 22:21.240]  So of course, there's no one solution for our problems, there are cases that the warehouses
[22:21.240 --> 22:26.800]  are very good fits, it turns out that for data science for example, which is kind of
[22:26.800 --> 22:32.000]  what we preach the most, they're usually not good because then you fall back to the senior
[22:32.000 --> 22:36.280]  data outside your database system, like you're not really going to be running your Python
[22:36.280 --> 22:40.880]  codes inside the system, you can do that for UDS for example, but they are messy, they're
[22:40.880 --> 22:46.040]  a bit nasty, so you want really to have it embedded in your Python process, so you completely
[22:46.040 --> 22:50.400]  eliminate data transfer costs, because usually what you do is like, okay, I have a table,
[22:50.400 --> 22:57.400]  10 columns, I'm going over 4 columns, but I'm really reading huge chunks of it, so that's
[22:57.400 --> 22:59.120]  a bottleneck we try to eliminate.
[22:59.120 --> 23:06.600]  How do you handle updates?
[23:06.600 --> 23:12.440]  Although we are in the analytical database system, we do do updates, so Mark, I don't
[23:12.440 --> 23:20.160]  know where he is, but he's there, he developed MVCC algorithm for OLAP, so we have the same
[23:20.160 --> 23:24.240]  asset transactional capabilities that you would expect from a transactional database,
[23:24.240 --> 23:30.840]  of course, if you have a transactional workload, you should still go for Postgrease or SQLize
[23:30.840 --> 23:37.080]  or a database that handles this type of transactions, but Mark developed like a full-on algorithm
[23:37.080 --> 23:40.760]  to handle updates completely, yeah.
[23:40.760 --> 23:43.200]  How do you compare to Vertica?
[23:43.200 --> 23:45.400]  How do you compare to Vertica?
[23:45.400 --> 23:51.800]  I have a good question, I think in terms of analytical queries, TPCH, probably similar
[23:51.800 --> 23:58.280]  performance, but then again, the whole point is that if you go again for the Python process,
[23:58.280 --> 24:04.480]  the data transfer costs will take most of the time there, and then it's really catered
[24:04.480 --> 24:11.320]  for this type of scenario, the embedded scenario.
[24:11.320 --> 24:15.280]  We have one minute left for one more question.
[24:15.280 --> 24:22.280]  Yeah, I actually have a rappel somewhere for a bunch of examples as well, I'm very happy
[24:22.280 --> 24:23.280]  to share it.
[24:23.280 --> 24:24.280]  I don't know where I'll post it.
[24:24.280 --> 24:25.280]  Ah, the false then thing, I guess.
[24:25.280 --> 24:26.280]  All right.
[24:26.280 --> 24:27.280]  All right.
[24:27.280 --> 24:28.280]  Thank you a lot, Pedro.
[24:28.280 --> 24:29.280]  Thanks a lot.
[24:29.280 --> 24:30.280]  Thank you very much.
[24:30.280 --> 24:46.280]  Thank you very much.