[00:00.000 --> 00:08.280]  I am Gabor Sarnas and I'm here with David Proha.
[00:08.280 --> 00:14.320]  We work at CWI Amsterdam and we're here to present you the LDBC social network benchmark.
[00:14.320 --> 00:15.600]  What is the LDBC?
[00:15.600 --> 00:18.560]  The abbreviation stands for Linked Data Benchmark Council.
[00:18.560 --> 00:23.320]  It is a non-profit company founded in 2012 and its mission is to accelerate the progress
[00:23.320 --> 00:25.640]  in the field of graph data management.
[00:25.640 --> 00:30.800]  And to this end, it designs and governs the use of graph benchmarks and everything we
[00:30.800 --> 00:34.960]  do is open source under the Apache version 2 license.
[00:34.960 --> 00:40.360]  From an organizational perspective, LDBC consists of more than 20 members who all have some
[00:40.360 --> 00:42.360]  vested interest in graph data management.
[00:42.360 --> 00:46.760]  We have financial service providers like the End Group, database vendors like Oracle,
[00:46.760 --> 00:53.120]  Neo4j and Tigrograph, cloud vendors like AWS and hardware vendors like Intel.
[00:53.120 --> 00:58.840]  Also we have individual contributors like David and me who contribute to the benchmarks.
[00:58.840 --> 01:04.000]  So to put things into context, the last two decades has seen a rise in the use of modern
[01:04.000 --> 01:06.440]  graph database management systems.
[01:06.440 --> 01:10.880]  Typically, the data model used in these systems is called a property graph, which is a labelled
[01:10.880 --> 01:15.520]  graph where both the nodes and the edges can have an arbitrary number of properties.
[01:15.520 --> 01:20.320]  For example, this is a small social network consisting of five person nodes and a single
[01:20.320 --> 01:23.000]  city node, which is the city of SPA.
[01:23.000 --> 01:25.040]  And the properties can be on the nodes.
[01:25.040 --> 01:30.040]  For example, here the nodes have names and the edges have attributes like the date when
[01:30.040 --> 01:31.560]  the friendship was established.
[01:31.560 --> 01:35.760]  We can see that Bob and Carl met in 2015.
[01:35.760 --> 01:40.920]  And if you want to run a query on this system, we can use a graph query where we look for
[01:40.920 --> 01:42.920]  matches of a given graph.
[01:42.920 --> 01:46.280]  So here the query says we want to start from Bob.
[01:46.280 --> 01:51.240]  We want to use an arbitrary number of edges to reach some person who lives in SPA and
[01:51.240 --> 01:55.680]  we want to do an aggregation to return the number of those people.
[01:55.680 --> 02:01.800]  If you want to evaluate this, we then start from the person Bob, push to all the people
[02:01.800 --> 02:06.520]  transitively, which are known by Bob directly or via multiple edges.
[02:06.520 --> 02:08.680]  This means all four people here.
[02:08.680 --> 02:13.240]  We shrink it down to the people who actually live in SPA, then add up the results and get
[02:13.240 --> 02:14.960]  the result too.
[02:14.960 --> 02:21.280]  So graph databases use something called a visual graph syntax, also known as the sqr graph
[02:21.280 --> 02:27.920]  syntax, which is similar to the popular cipher language of Neo4j.
[02:27.920 --> 02:32.080]  And here this query is actually really similar to the graph pattern that I have shown.
[02:32.080 --> 02:36.360]  So there are similarities in how the nodes are formulated, how the edges are captured
[02:36.360 --> 02:41.080]  in this text, and also how the transitive closure of the little asterisk is captured
[02:41.080 --> 02:42.440]  in the query language.
[02:42.440 --> 02:46.960]  So this is a very intuitive and concise way of formulating the queries.
[02:46.960 --> 02:51.240]  If we deconstruct this query, we can see three main components.
[02:51.240 --> 02:52.640]  The one is relational operators.
[02:52.640 --> 02:55.400]  Obviously, we still need relational operators.
[02:55.400 --> 02:58.680]  We want to be able to identify people by filtering.
[02:58.680 --> 03:03.000]  So we filter for Bob, we filter for SPA, and also we want to sometimes aggregate.
[03:03.000 --> 03:06.360]  So the count aggregation is part of this query.
[03:06.360 --> 03:10.560]  The pathfinding is really elegant in this formulation because we have nodes asterisk
[03:10.560 --> 03:13.920]  which captures that we can use an arbitrary number of edges.
[03:13.920 --> 03:20.600]  And the pattern matching which connects the person to SPA is also very concise and readable.
[03:20.600 --> 03:26.240]  So what is interesting from a future work perspective on graph databases?
[03:26.240 --> 03:30.200]  Obviously, relational operators are quite well known at this point, and there are endless
[03:30.200 --> 03:32.960]  papers and techniques on how to implement these.
[03:32.960 --> 03:37.720]  But we believe that pathfinding and pattern matching is really good in graph databases
[03:37.720 --> 03:42.200]  compared to traditional relational systems because they provide a more concise syntax
[03:42.200 --> 03:44.800]  and better algorithms and implementations.
[03:44.800 --> 03:49.520]  Interestingly enough, even in the last 15 years, there have been lots of papers on better
[03:49.520 --> 03:56.160]  BFS algorithms, better factorization representations for graph patterns, multi-wavers, case optimal
[03:56.160 --> 03:57.720]  joins, and so on.
[03:57.720 --> 04:01.000]  So we believe that these should be adopted by more and more systems.
[04:01.000 --> 04:05.400]  And to this end, we designed benchmarks that try to push the state of the art and the four
[04:05.400 --> 04:08.000]  systems to adopt better and better techniques.
[04:08.000 --> 04:10.160]  David will talk about these benchmarks.
[04:10.160 --> 04:11.520]  Yeah, hi.
[04:11.520 --> 04:14.800]  So I will give an overview about the social network benchmark.
[04:14.800 --> 04:21.520]  And so first, we'll go through three steps of this benchmark, so the data sets, two example
[04:21.520 --> 04:25.360]  queries, and the update operations done in this benchmark.
[04:25.360 --> 04:31.360]  So here we see a small example of the data sets where on the left side, we see persons
[04:31.360 --> 04:37.440]  with friendships, forms, and network, and these persons post messages on the social network
[04:37.440 --> 04:42.880]  and can reply to each other forming a tree-shaped data structure.
[04:42.880 --> 04:48.640]  And now we will do one query on this very small data set example.
[04:48.640 --> 04:54.720]  So with query nine, we want to retrieve messages posted by a given person, friend, and friends
[04:54.720 --> 04:58.280]  of friends before a given date.
[04:58.280 --> 05:00.640]  And the dates are here shortened for simplicity.
[05:00.640 --> 05:06.240]  So if we would start with BOP, we will traverse to their friends and friends of friends, retrieve
[05:06.240 --> 05:12.120]  the messages, and then filter out the ones that are actually before Saturday.
[05:12.120 --> 05:15.520]  And then we touch upon 10 nodes in this data.
[05:15.520 --> 05:21.240]  Suppose we would start from another person, so for example Finn, and we traverse again
[05:21.240 --> 05:24.640]  to their friends and friends of friends.
[05:24.640 --> 05:30.240]  Here we see that we touch upon five different nodes.
[05:30.240 --> 05:33.200]  So half of the one of BOP.
[05:33.200 --> 05:40.680]  And this difference can actually be troublesome since runtimes for the same queries are different
[05:40.680 --> 05:45.240]  and therefore doesn't help in understanding what's happening.
[05:45.240 --> 05:51.640]  So for this benchmark, we actually want to select parameters that have similar runtimes
[05:51.640 --> 05:56.200]  and also to actually stress the technical difficulties in these systems.
[05:56.200 --> 06:00.320]  So we select the parameters more carefully.
[06:00.320 --> 06:05.520]  So here we see an example of when we do not select the parameters carefully, just a uniform
[06:05.520 --> 06:06.520]  random.
[06:06.520 --> 06:12.400]  And we can see here a trial model, distribution by model, and one with many outliers.
[06:12.400 --> 06:14.520]  And we don't want that.
[06:14.520 --> 06:22.400]  So in the data sets, there are also statistics provided in this example for each person,
[06:22.400 --> 06:25.240]  the number of friends and friends of friends.
[06:25.240 --> 06:32.240]  Then we want to select persons with similar number to get more predictable runtimes.
[06:32.240 --> 06:37.960]  And so if we do that, then we can see here an example that we have unimodal distributions
[06:37.960 --> 06:40.880]  with very tight runtimes.
[06:40.880 --> 06:47.800]  And that improves also in understanding these, like the behavior of the queries.
[06:47.800 --> 06:50.600]  So now we're going to the updates.
[06:50.600 --> 06:57.280]  And for example, if Eve and Gia wants to be friends, we insert a nose edge.
[06:57.280 --> 07:01.120]  And this is then formed into the network.
[07:01.120 --> 07:04.080]  Suppose that the next operation is inserting a comment.
[07:04.080 --> 07:09.120]  So Gia comments replies on a message posted by Eve.
[07:09.120 --> 07:12.880]  And both messages are posted on the same date.
[07:12.880 --> 07:15.280]  Then we have another problem.
[07:15.280 --> 07:20.360]  Because when we are executing these operations concurrently, it can happen that the reply
[07:20.360 --> 07:26.720]  is earlier than the message in such a network, posting an error.
[07:26.720 --> 07:31.320]  And to mitigate this, we introduce dependency tracking.
[07:31.320 --> 07:36.880]  So for each operation, and also includes the edges, but just for simplicity, the notes
[07:36.880 --> 07:40.960]  are here with the dependent dates.
[07:40.960 --> 07:45.360]  We include for each operation a creation date and dependent date.
[07:45.360 --> 07:50.360]  The creation date is when it's scheduled to be executed, and the dependent date is the
[07:50.360 --> 07:58.040]  one that's, like in this case, for M6, is the creation date of M3.
[07:58.040 --> 08:02.760]  And here we can see, actually, that each operation is dependent on each other, forming
[08:02.760 --> 08:06.360]  a whole chain in the social network.
[08:06.360 --> 08:10.880]  Suppose now that Eve wants to leave the social network and removes her account.
[08:10.880 --> 08:15.920]  And so we start with deleting the notes of Eve, and this will trigger a cascading effect
[08:15.920 --> 08:22.520]  by, since we then need to remove the edges connected to Eve, the messages posted, and
[08:22.520 --> 08:25.440]  also the replies to those messages.
[08:25.440 --> 08:31.200]  We can actually see, like, this huge cascading effect, and that can actually have a large
[08:31.200 --> 08:38.440]  impact on the data distribution, and also therefore the executability of these operations.
[08:38.440 --> 08:45.320]  And furthermore, it also influences for selecting the parameters, which we have shown before.
[08:45.320 --> 08:49.640]  And we want to include this delete because it prohibits append only data structures in
[08:49.640 --> 08:54.120]  databases and also stress the garbage collector of these systems.
[08:54.120 --> 09:00.360]  Now we are going to give another example to also stress the temporal aspect of this benchmark.
[09:00.360 --> 09:04.680]  So suppose we want to find a path between two persons.
[09:04.680 --> 09:11.080]  So we have a start person and a destination person, and, for example, Finn and Gia.
[09:11.080 --> 09:16.040]  Then we can see here that we have a four-hole path between these persons.
[09:16.040 --> 09:22.120]  But at one point in the benchmark, it can happen that a node's edge is removed, and
[09:22.120 --> 09:25.160]  then there is no path anymore.
[09:25.160 --> 09:29.560]  It can also happen that there's another edge inserted between Carl and Gia, and then we
[09:29.560 --> 09:31.840]  have a path again.
[09:31.840 --> 09:36.640]  And so for the same parameters, we can actually have three different outcomes.
[09:36.640 --> 09:39.480]  And to mitigate this, we do temporal parameter selection.
[09:39.480 --> 09:46.240]  So each parameter is assigned in a time bucket to actually ensure that we have similar results
[09:46.240 --> 09:49.320]  and therefore also similar run times.
[09:49.320 --> 09:52.680]  Now going through the benchmark workflow.
[09:52.680 --> 09:59.600]  So we start by the data gen, and the data gen provides us with a temporal graph spanning
[09:59.600 --> 10:05.400]  over social media activity for three years, and it is simulated closely to the, similar
[10:05.400 --> 10:08.560]  to the Facebook social network.
[10:08.560 --> 10:15.480]  It's a spark-based data generator that can generate data up to 30 terabytes, and it contains
[10:15.480 --> 10:24.720]  the, you know, skewed data sets, for example, with the nodes and person data in this data.
[10:24.720 --> 10:31.160]  And so the output is a data set suitable for loading into the system on a test, updates
[10:31.160 --> 10:38.280]  which are then executed during the benchmark, and statistics where we can select the parameters.
[10:38.280 --> 10:42.360]  And the selection of the parameters is done in the parameter generator.
[10:42.360 --> 10:48.240]  This ensures the stable query run times and assigns parameters into a temporal bucket.
[10:48.240 --> 10:55.920]  So a parameter can, it may include parameters that once are inserted into the data sets
[10:55.920 --> 11:00.600]  or before they are removed from the network.
[11:00.600 --> 11:06.760]  So and then we have a benchmark driver who schedules these operations and ensures that
[11:06.760 --> 11:11.120]  they can be executed with using the dependency tracking.
[11:11.120 --> 11:16.920]  And this is especially important when executing the operations concurrently.
[11:16.920 --> 11:21.640]  And lastly, we have the system on the test where we have, for example, graph databases,
[11:21.640 --> 11:24.120]  triple stores or relational databases.
[11:24.120 --> 11:27.920]  And now Gabor will go further into the workloads.
[11:27.920 --> 11:36.520]  Okay, so graph workloads are actually quite diverse in terms of what they are trying to
[11:36.520 --> 11:40.680]  achieve, and our benchmark reflects that by having multiple workloads.
[11:40.680 --> 11:45.580]  We have the social network benchmark interactive workload, which is transactional in nature,
[11:45.580 --> 11:47.720]  so it has loads of concurrent operations.
[11:47.720 --> 11:52.920]  The queries here are relatively simple, so they always start in one or two person nodes,
[11:52.920 --> 11:55.800]  the same as David presented before.
[11:55.800 --> 12:00.000]  And here the systems are striving to achieve a high throughput, so the competition is getting
[12:00.000 --> 12:03.360]  as many operations per second as possible.
[12:03.360 --> 12:08.400]  We are happy to report that we have official results from the last three years, where systems
[12:08.400 --> 12:13.560]  started with slightly above 5,000 operations per second and have sped up exponentially,
[12:13.560 --> 12:19.600]  now being close to 17,000 operations per second on a 100 gigabyte dataset.
[12:19.600 --> 12:23.280]  The other workload of the social network benchmark is called business intelligence.
[12:23.280 --> 12:27.800]  This is an analytical workload where the queries touch on large portions of the data.
[12:27.800 --> 12:33.400]  For example, this query in this slide shows a case where we start from a given country
[12:33.400 --> 12:37.720]  and then find all triangles of friendships in that country.
[12:37.720 --> 12:40.400]  It's easy to see that this is a very heavy hitting operation.
[12:40.400 --> 12:45.360]  It may touch on billions of edges in the graph, and it also has to do a complex computation
[12:45.360 --> 12:46.880]  to find those people.
[12:46.880 --> 12:52.240]  So here system can use either a bulk update or a concurrent update method, and they should
[12:52.240 --> 12:57.400]  also strive to get both a high throughput and low query run times.
[12:57.400 --> 12:58.960]  This benchmark is relatively new.
[12:58.960 --> 13:02.560]  It was released at the end of last year, so we only have a single result, which was done
[13:02.560 --> 13:05.040]  by a collaboration of Tiger Graph and AMD.
[13:05.040 --> 13:10.080]  We're happy to report that there are more audits under way, so we are going to release
[13:10.080 --> 13:13.600]  more results in 2023.
[13:13.600 --> 13:18.080]  So probably you can see from this presentation that these benchmarks can get fairly complex
[13:18.080 --> 13:20.240]  and implementing them is not trivial.
[13:20.240 --> 13:23.960]  So we did our best to provide everything our users need.
[13:23.960 --> 13:27.680]  For each of the workloads that we have presented, we have a specification, we have detailed
[13:27.680 --> 13:33.680]  academic papers who motivate the design choices and the architecture of these benchmarks.
[13:33.680 --> 13:39.160]  We released a data generator as well as pre-generated datasets, and we have benchmark drivers and
[13:39.160 --> 13:42.160]  at least two reference implementations for each of the workloads.
[13:42.160 --> 13:46.600]  Moreover, we have guidelines on how to execute these benchmarks correctly, how to validate
[13:46.600 --> 13:51.320]  the results of a given system, and how to ensure that the system will lose your data
[13:51.320 --> 13:54.240]  or mingle up the transactions.
[13:54.240 --> 13:58.680]  So we have asset compliance tests and recovery tests.
[13:58.680 --> 14:00.920]  This leads us to our auditing process.
[14:00.920 --> 14:05.760]  Similarly to the TPC, the Transaction Processing Performance Council, we have a rigorous auditing
[14:05.760 --> 14:11.840]  process where vendors can commission an independent third party who will rerun the benchmark in
[14:11.840 --> 14:17.680]  an executable and reproducible manner, and they will write up it as a full disclosure
[14:17.680 --> 14:23.560]  report so that the benchmark is understandable by whoever wants to see that result.
[14:23.560 --> 14:29.000]  This is important because LDBC is trademarked worldwide, and we only allow official audited
[14:29.000 --> 14:32.440]  results to use the term LDBC benchmark result.
[14:32.440 --> 14:35.960]  This is not to say that we don't allow people to use this benchmark.
[14:35.960 --> 14:40.440]  Researchers, practitioners, and developers are welcome to use the benchmark.
[14:40.440 --> 14:41.440]  They can run it.
[14:41.440 --> 14:45.640]  They can report the results if it is accompanied by the appropriate disclaimer that this is
[14:45.640 --> 14:49.800]  not an official LDBC benchmark result.
[14:49.800 --> 14:53.480]  I would like to talk a bit about standard GraphQL languages.
[14:53.480 --> 14:57.560]  This is an important topic because this has been a pain point for GraphSystems for many
[14:57.560 --> 14:58.560]  years.
[14:58.560 --> 15:03.080]  There is a bit of a tower of Babel out there with many languages, both of them using some
[15:03.080 --> 15:07.240]  sort of visual graph syntax, but always with slightly different semantics and a slightly
[15:07.240 --> 15:12.600]  different syntax, which makes it difficult for users to adopt these techniques and may
[15:12.600 --> 15:16.840]  put them in a position of being locked in by their vendors.
[15:16.840 --> 15:21.480]  In the next couple of years, there are going to be new standard queer languages.
[15:21.480 --> 15:24.920]  These focus on pathfinding and pattern matching.
[15:24.920 --> 15:26.840]  The first one is called SQL PGQ.
[15:26.840 --> 15:31.440]  This is an extension to the SQL language and PGQ stands for property graph queries.
[15:31.440 --> 15:36.600]  This is going to be released next summer, and GQL, the standalone GraphQL language,
[15:36.600 --> 15:39.360]  is going to come out in 2024.
[15:39.360 --> 15:43.680]  We are happy to report that even though we have two new languages, the pattern matching
[15:43.680 --> 15:48.920]  core of them, the visual graph syntax that we all know and love, is going to be the same,
[15:48.920 --> 15:52.840]  so users can port at least those bits of their queries.
[15:52.840 --> 15:57.240]  To give you a taste of how this will look like, here is query 9 that David presented
[15:57.240 --> 16:00.560]  in the social network benchmark interactive workload.
[16:00.560 --> 16:02.520]  This query can be formulated in SQL.
[16:02.520 --> 16:08.280]  It's not too difficult, but the new variants, SQL PGQ and GQL, can represent it as terms
[16:08.280 --> 16:12.920]  of a graph pattern, and this is a much more concise formulation.
[16:12.920 --> 16:16.920]  The difference is even more pronounced for query 13 with the path queries.
[16:16.920 --> 16:22.240]  Here we can see that in SQL PGQ, the pattern is really similar to the visual representation.
[16:22.240 --> 16:28.640]  It just has a source, a target, and an arbitrary amount of nose edges denoted by nose asterisk
[16:28.640 --> 16:29.840]  in between.
[16:29.840 --> 16:35.040]  In SQL, this is a lot less readable, hard to maintain, and it's even less sufficient
[16:35.040 --> 16:39.400]  because it just implements a unidirectional search algorithm instead of doing a bidirectional
[16:39.400 --> 16:42.840]  search which has a better algorithmic complexity.
[16:42.840 --> 16:46.440]  The way LDBC is involved in these new query languages is manifold.
[16:46.440 --> 16:52.640]  First, it had the G-core design language released in 2018 which influenced these benchmarks.
[16:52.640 --> 16:56.680]  Then LDBC has the formal semantics working group which formalized the pattern matching
[16:56.680 --> 17:04.000]  core of these new languages, and LDBC is doing further research to advance the state
[17:04.000 --> 17:05.720]  of the art on graph schemas.
[17:05.720 --> 17:10.560]  We have an industry-driven and a theory-driven group, and what they do will end up in the
[17:10.560 --> 17:13.600]  new versions of these languages.
[17:13.600 --> 17:16.600]  The outlook is the LDBC Graphalytics benchmark.
[17:16.600 --> 17:23.640]  This is a more wide benchmark because it can target analytical libraries like NetworkX,
[17:23.640 --> 17:27.280]  distributed systems like Apache Giraffe, or the GraphBlast API.
[17:27.280 --> 17:31.120]  This is everything that has to do with analyzing large graphs.
[17:31.120 --> 17:36.560]  Here the graph is an untyped, unattributed graph, so there are no properties or no labels.
[17:36.560 --> 17:42.040]  We do use the LDBC social network benchmark dataset, but it is stripped down to the person-nose-person
[17:42.040 --> 17:43.040]  core graph.
[17:43.040 --> 17:48.680]  Additionally, we have included a number of well-known datasets like Graph500, Twitter,
[17:48.680 --> 17:49.680]  and so on.
[17:49.680 --> 17:53.280]  The algorithms that we run are mostly well-known graph algorithms.
[17:53.280 --> 17:58.960]  There is the BFS, which starts from a given node and assigns the number of steps that
[17:58.960 --> 18:02.320]  need to be taken to all of the other nodes to reach them.
[18:02.320 --> 18:06.800]  We have the famous PageRank centrality algorithm, which highlights the most important nodes in
[18:06.800 --> 18:13.480]  the network, and we have the local clustering coefficient, community detection using label
[18:13.480 --> 18:17.720]  propagation, weakly connected components, and shortest paths.
[18:17.720 --> 18:20.800]  This benchmark is a bit simpler than the social network benchmark.
[18:20.800 --> 18:23.600]  It does not have a rigorous auditing process.
[18:23.600 --> 18:29.040]  We trust people that they can run this benchmark efficiently and correctly on their own infrastructure,
[18:29.040 --> 18:30.800]  and they can report results.
[18:30.800 --> 18:35.040]  If they do so, they will be able to participate in the Graphalytics competition, which has
[18:35.040 --> 18:38.840]  a leaderboard for the best implementations.
[18:38.840 --> 18:44.160]  Wrapping up, you should consider joining the IDBC because members can participate in the
[18:44.160 --> 18:45.160]  benchmark design.
[18:45.160 --> 18:49.080]  They have a say in where we are going in terms of including new features.
[18:49.080 --> 18:55.040]  They can commission audits if they are vendors, and members can gain access to these ISO standard
[18:55.040 --> 18:58.000]  drafts that I mentioned, SQL, PGQ, and GQR.
[18:58.000 --> 19:01.560]  Otherwise, these are not available to general public.
[19:01.560 --> 19:06.440]  Being wise, this is free for individuals, and there is a yearly fee for companies.
[19:06.440 --> 19:10.520]  To sum up, we have presented three benchmarks, the social network benchmark's interactive
[19:10.520 --> 19:15.760]  workload, its business intelligence workload, and the Graphalytics graph algorithms workload.
[19:15.760 --> 19:16.760]  We have more benchmarks.
[19:16.760 --> 19:22.720]  There is semantic publishing benchmark, which is targeting RDF systems set in the media
[19:22.720 --> 19:24.480]  and publishing industry.
[19:24.480 --> 19:27.880]  There is the financial benchmark, which is going to be released this year, which targets
[19:27.880 --> 19:34.160]  distributed systems, and it uses the financial fraud detection domain as its area, and it
[19:34.160 --> 19:36.800]  imposes strict latency bounds on queries.
[19:36.800 --> 19:39.880]  This is quite a different workload from the previous ones.
[19:39.880 --> 19:44.440]  Of course, graphs are ubiquitous, and they have loads of use cases, so there are many
[19:44.440 --> 19:49.040]  future benchmark ideas, including graph neural network mining and streaming.
[19:49.040 --> 19:51.080]  Thank you very much, and we're open to any questions.
[19:51.080 --> 19:58.080]  Yes.
[19:58.080 --> 20:04.560]  So, in this one overview that was the graph data set, and the updates were kind of separated.
[20:04.560 --> 20:09.960]  Is there a possibility to create a graph data set where the updates are included in the
[20:09.960 --> 20:14.560]  data set, so that the nodes and vertices get time stamps when they were deleted or when
[20:14.560 --> 20:15.560]  they were added?
[20:15.560 --> 20:16.560]  Yes.
[20:16.560 --> 20:22.680]  So, is it possible to create something like a temporal graph with the time stamps of when
[20:22.680 --> 20:27.880]  the specific node is created and deleted, and this is actually very easy, because this
[20:27.880 --> 20:30.920]  is the first step that the data gen creates.
[20:30.920 --> 20:36.160]  So, when David said that it creates a social network of three years, that has everything
[20:36.160 --> 20:40.560]  that was ever created or deleted during those three years, and then we have attributes like
[20:40.560 --> 20:44.480]  creation date and deletion date, and then we turn it into something that's loadable
[20:44.480 --> 20:49.640]  to the database, we hide deletion dates, because the database, of course, shouldn't be aware
[20:49.640 --> 20:54.800]  of this, but this is something that the data gen supports out of the box.
[20:54.800 --> 21:00.040]  Okay, but then it's also too able to get this data set with the deletion date, because you
[21:00.040 --> 21:02.080]  already said that it's hideable.
[21:02.080 --> 21:07.720]  It's hideable, but we have one which is called the row temporal data set, and that is available,
[21:07.720 --> 21:14.680]  and we even published that, so that's something that, yeah, it has a lot of chance to be influential
[21:14.680 --> 21:16.520]  in the streaming community, I believe.
[21:16.520 --> 21:18.920]  All right, more questions?
[21:18.920 --> 21:19.920]  Yeah, Michael?
[21:19.920 --> 21:48.120]  Yeah, Michael?
[21:48.120 --> 21:50.840]  So the question is, can we extend to other domains?
[21:50.840 --> 21:57.360]  And we usually emphasize that social networks is not really the domain that is the actual
[21:57.360 --> 22:01.360]  primary use case for graphs, we just use this because this is really easy to understand,
[22:01.360 --> 22:05.640]  we don't have to explain person-nose-person, and you can put in all sorts of interesting
[22:05.640 --> 22:09.440]  technological challenges to a graph domain like this.
[22:09.440 --> 22:14.840]  It would make sense, and sometimes we are approached by our members saying, we want
[22:14.840 --> 22:22.320]  to do a new benchmark in the domain X, and we then send them the process that is required
[22:22.320 --> 22:27.160]  to get one of these benchmarks completed, and that's usually the end of the conversation,
[22:27.160 --> 22:33.640]  but we are definitely open to have more interesting benchmarks, and of course, a good data generator
[22:33.640 --> 22:40.120]  is worth gold to all the researchers and the vendors in this community, so that's usually
[22:40.120 --> 22:45.560]  the hard point, and I would be definitely interested in having a retail graph generator.
[22:45.560 --> 22:46.560]  Carlo?
[22:46.560 --> 22:47.560]  Hi.
[22:47.560 --> 22:57.560]  The question is specifically, what do you see the impact of this will be on the industry
[22:57.560 --> 23:14.320]  or it's more uneductive of evidence if it's, if the system would have improved, or if the
[23:14.320 --> 23:15.320]  system would get more robust as in that you detect stuff that is doing things and stuff
[23:15.320 --> 23:16.320]  get fixed, or what's the, yeah.
[23:16.320 --> 23:17.320]  So the question?
[23:17.320 --> 23:18.320]  Yeah, the question is about the potential impact.
[23:18.320 --> 23:19.960]  What could all this achieve?
[23:19.960 --> 23:26.600]  And we believe that it will help accelerate the field in the sense that systems will get
[23:26.600 --> 23:31.280]  more mature, because if you want to get an audited result, you have to pass all the asset
[23:31.280 --> 23:36.880]  tests, you have to be able to recover after a crash, and ideally you would have to be
[23:36.880 --> 23:41.480]  fast, so that is hopefully one of the other things that systems will take away.
[23:41.480 --> 23:48.120]  They will have better optimizers, improved storage, better query execution engines,
[23:48.120 --> 23:54.960]  and we have seen this in the aftermath of the TPC benchmarks, so those resulted in quite
[23:54.960 --> 23:56.320]  a big speedup.
[23:56.320 --> 24:02.160]  So that's one area, and of course there is pricing, we would like that users can get
[24:02.160 --> 24:07.240]  more transactions per dollar, and the third that we are personally quite interested in
[24:07.240 --> 24:09.200]  is the new accelerators that come out.
[24:09.200 --> 24:14.560]  So there are, especially in the field of machine learning, there are cards that do fast sparse
[24:14.560 --> 24:19.840]  matrix multiplications, those could be harnessed specifically for the analytical benchmarks
[24:19.840 --> 24:24.960]  that we have, and that would be interesting to see how big of a hassle it is to implement
[24:24.960 --> 24:35.040]  and how big of a speedup they give, cool, all right, okay, thank you very much.