[00:00.000 --> 00:07.800]  Hello, HPC Room, my name is Gabor Sarnas.
[00:07.800 --> 00:14.240]  I work at CWI Amsterdam as a researcher, and today I'm here on behalf of the LDBC.
[00:14.240 --> 00:17.360]  The LDBC stands for the Linked Data Benchmark Council.
[00:17.360 --> 00:22.560]  We are a non-profit company founded in 2012, and we design graph benchmarks and govern
[00:22.560 --> 00:23.560]  their use.
[00:23.560 --> 00:28.160]  Additionally, we do research on graph schemas and modern graph queue languages, and everything
[00:28.160 --> 00:32.680]  we do is available under the Apache V2 license.
[00:32.680 --> 00:36.200]  Organizationally, LDBC consists of more than 20 companies.
[00:36.200 --> 00:39.040]  These are companies interested in graph data management.
[00:39.040 --> 00:44.360]  We have financial service providers, database vendors, cloud vendors, hardware vendors, and
[00:44.360 --> 00:49.080]  consultancy companies, as well as individual contributors like me.
[00:49.080 --> 00:54.440]  So we design benchmarks, the first one being the LDBC social network benchmark, which targets
[00:54.440 --> 00:56.600]  database systems.
[00:56.600 --> 01:00.520]  Let's go through this benchmark by a series of examples.
[01:00.520 --> 01:05.480]  I will touch on datasets, queries, and updates that we use in this benchmark.
[01:05.480 --> 01:10.600]  As the name social network benchmark suggests, we have a social network that consists of
[01:10.600 --> 01:17.040]  person nodes who know each other via a distribution that mimics the Facebook career social network.
[01:17.040 --> 01:20.080]  The content that these people create is messages.
[01:20.080 --> 01:25.680]  These form little three-shaped subgraphs and are connected via author edges to the people.
[01:25.680 --> 01:28.760]  On this graph, we can run queries like the following.
[01:28.760 --> 01:34.120]  Let's have a given person enumerate their friends and their friends of friends, get
[01:34.120 --> 01:38.640]  the messages that these people created, and then filter them based on some condition on
[01:38.640 --> 01:39.760]  their dates.
[01:39.760 --> 01:44.440]  So a potential substitution could be on this graph that we are interested in this query
[01:44.440 --> 01:47.640]  for Bob and the date set on Saturday.
[01:47.640 --> 01:50.600]  And if we evaluate this query, we start with Bob.
[01:50.600 --> 01:56.680]  We traverse the nose edges to Ada and Carl, then continue to Finn, Eve, and then we move
[01:56.680 --> 01:58.360]  along the author edges.
[01:58.360 --> 02:03.600]  And then finally, we apply the filter condition, which will cut message three and will leave
[02:03.600 --> 02:06.840]  us messages one, two, and four.
[02:06.840 --> 02:09.840]  So obviously, a social network is not a static environment.
[02:09.840 --> 02:11.120]  There are always changes.
[02:11.120 --> 02:15.800]  For example, people become friends, even Gia may add each other as a friend.
[02:15.800 --> 02:18.280]  That will result in a new nose edge.
[02:18.280 --> 02:20.000]  That's simple enough.
[02:20.000 --> 02:22.120]  Gia can decide to create a message.
[02:22.120 --> 02:24.800]  This message will be replied to message M3.
[02:24.800 --> 02:29.160]  So we add a new node and connect it to the existing graph via two edges.
[02:29.160 --> 02:31.640]  The heavy hitting updates are the deletes.
[02:31.640 --> 02:36.760]  A person may decide to delete their account, and that will result in a cascade of deletes.
[02:36.760 --> 02:42.880]  For example, if we remove the node Eve, that will result in the removal of their direct
[02:42.880 --> 02:45.120]  edges, all the messages they created.
[02:45.120 --> 02:49.560]  And in some social network, this will even trigger the deletion of all the message trees
[02:49.560 --> 02:53.360]  and, of course, all the edges that point to those messages.
[02:53.360 --> 02:56.880]  So this is quite a hard operation for systems to execute.
[02:56.880 --> 03:02.920]  It stresses their garbage collectors, and this allows certain append-only data structures.
[03:02.920 --> 03:06.880]  So if you want to weave these three components together, the data set, the queries, and
[03:06.880 --> 03:11.600]  the updates, we need a benchmark driver that schedules the operations to be executable.
[03:11.600 --> 03:16.080]  It runs the updates and the queries concurrently, and, of course, it collects the results.
[03:16.080 --> 03:21.680]  The system under test that we run the benchmark on is provided by our members who are the
[03:21.680 --> 03:27.920]  database vendors, and we go to great lengths to allow as many candidate systems as possible,
[03:27.920 --> 03:34.360]  so graph databases, triple stores, and relational databases can all compete on this benchmark.
[03:34.360 --> 03:39.720]  Speaking of relational databases, some of you may think is SQL sufficient to express
[03:39.720 --> 03:42.880]  these queries, and the answer is that in most cases it is.
[03:42.880 --> 03:48.800]  So the query that we have just seen can be formulated in a reasonably simple SQL query.
[03:48.800 --> 03:53.960]  It is a bit unwieldy, but it is certainly doable, and the performance will be okay.
[03:53.960 --> 03:59.000]  However, this being a graph benchmark, it lends itself quite naturally to other query
[03:59.000 --> 04:00.000]  languages.
[04:00.000 --> 04:04.120]  There are two new query languages that are going to be coming out, and both of them adopted
[04:04.120 --> 04:08.080]  a visual graph syntax inspired by Neo4j's Cypher language.
[04:08.080 --> 04:13.000]  The first one is called SQL-PGQ, where PGQ stands for property graph queries.
[04:13.000 --> 04:17.760]  This will be released this summer, and as you can see, it's an extension to SQL, so
[04:17.760 --> 04:22.240]  you can use select and from, but it adds the graph table construct, and the query can
[04:22.240 --> 04:25.680]  be formulated in a very concise and readable manner.
[04:25.680 --> 04:30.200]  There is GQL, the graph query language, which is a standalone language that is going to
[04:30.200 --> 04:36.040]  be released next year, and it shares the same pattern matching language as SQL-PGQ.
[04:36.040 --> 04:41.000]  So the social network benchmark has multiple workloads to cover the diverse challenges
[04:41.000 --> 04:44.840]  that are created by graph workloads.
[04:44.840 --> 04:49.120]  The first one, the older one, is the social network benchmark interactive workload.
[04:49.120 --> 04:53.320]  This is transactional in nature, and it has queries like the one I have shown before.
[04:53.320 --> 04:57.120]  So these queries typically start in one or two person nodes.
[04:57.120 --> 04:58.960]  They are not very heavy hitting.
[04:58.960 --> 05:01.320]  They only touch on a limited amount of data.
[05:01.320 --> 05:06.720]  They have concurrent reads and updates, and systems are competing on achieving high throughputs.
[05:06.720 --> 05:10.200]  So this benchmark has been around for a few years, and we have seen actually very good
[05:10.200 --> 05:11.200]  results.
[05:11.200 --> 05:16.200]  In the last three years, we witnessed an exponential increase in throughput, starting from a little
[05:16.200 --> 05:22.880]  above 5,000 operations per second to almost 17,000 operations per second this year.
[05:22.880 --> 05:27.320]  Our newer benchmark is the social network benchmark business intelligence workload.
[05:27.320 --> 05:32.240]  This is analytical in nature, and it has queries that touch on large portions of the data.
[05:32.240 --> 05:37.800]  For example, the query on this slide enumerates all triangles of friendships in a given country
[05:37.800 --> 05:44.160]  which can potentially reach billions of edges, and is a very difficult computational problem.
[05:44.160 --> 05:48.720]  Systems here are allowed to do either a bulk or a concurrent update approach, but they
[05:48.720 --> 05:54.080]  should strive to get both a high throughput and low individual query runtimes.
[05:54.080 --> 05:58.040]  This benchmark being relatively new, we only have a single result, so it's a bit difficult
[05:58.040 --> 05:59.800]  to put it into context.
[05:59.800 --> 06:02.080]  But it allows me to highlight one thing.
[06:02.080 --> 06:04.240]  Many of our benchmarks use different CPUs.
[06:04.240 --> 06:08.560]  We actually have quite a healthy diversity in the CPUs.
[06:08.560 --> 06:13.400]  We have results with the AMD Epic Genoa, like this one achieved by TigerGraph.
[06:13.400 --> 06:19.640]  We have results using Intel Xeon Ice Lakes and the ETN 710s, which use an ARM architecture.
[06:19.640 --> 06:24.520]  We have more and larger scale results expected this year, and we are also quite interested
[06:24.520 --> 06:29.280]  in some graph and machine learning accelerators that are going to be released soon.
[06:29.280 --> 06:31.560]  So our benchmark process is quite involved.
[06:31.560 --> 06:34.360]  For each workload, we release a specification.
[06:34.360 --> 06:36.840]  We have an academic paper that motivates the benchmark.
[06:36.840 --> 06:41.400]  We have data generators, pre-generated data sets, as well as a benchmark driver and at
[06:41.400 --> 06:44.520]  least two reference implementations.
[06:44.520 --> 06:50.520]  We do this because we have an auditing process that allows the vendors to implement this benchmark
[06:50.520 --> 06:56.200]  to actually go through a rigorous test, and if they do so, they can claim that they have
[06:56.200 --> 06:58.320]  an official benchmark result.
[06:58.320 --> 07:04.520]  So we trademark the term ADBC such that the vendors have to go through these hoops of
[07:04.520 --> 07:09.440]  auditing, and we still allow researchers and developers to do unofficial benchmarks,
[07:09.440 --> 07:14.360]  but they have to say that this is not unofficial ADBC benchmark result.
[07:14.360 --> 07:18.200]  Another benchmark I would like to touch upon briefly is the Graph Analytics benchmark.
[07:18.200 --> 07:22.680]  This casts a wider net, so it targets graph databases, graph processing framework, embedded
[07:22.680 --> 07:26.600]  graph libraries like NetworkX and so on.
[07:26.600 --> 07:31.920]  This uses untyped, unattributed graphs, so it's only the person knows person graphs
[07:31.920 --> 07:36.080]  of the social network benchmark or other well-known graphs like Graph 500.
[07:36.080 --> 07:37.400]  We have six algorithms.
[07:37.400 --> 07:41.520]  Many of these are textbook algorithms like BFS, which just traverses the graph from a
[07:41.520 --> 07:47.160]  given source node, or we have PageRank, which selects the most important nodes in the network.
[07:47.160 --> 07:51.200]  We also have clustering coefficient, community detection, connected components, and shortest
[07:51.200 --> 07:53.360]  paths.
[07:53.360 --> 07:54.920]  This benchmark is a bit simpler to implement.
[07:54.920 --> 07:57.600]  We have a leaderboard that we update periodically.
[07:57.600 --> 08:02.680]  The next one is going to come out in Spring 2023, so talk to us if you're interested.
[08:02.680 --> 08:07.280]  So wrapping up, you should consider becoming an ADBC member because members can participate
[08:07.280 --> 08:11.880]  in the benchmark design and have a say in where we go, they can commission audits of
[08:11.880 --> 08:17.000]  their benchmarks, and they can also gain early access to the ISO standard drafts, SQL, PGQ,
[08:17.000 --> 08:18.720]  and GQ that I have shown.
[08:18.720 --> 08:23.200]  It's free for individuals and has a yearly fee for companies.
[08:23.200 --> 08:25.400]  So to sum up, these are our three main benchmarks.
[08:25.400 --> 08:28.160]  We have other benchmarks and many future ideas.
[08:28.160 --> 08:39.640]  If you're interested, please reach out.
[08:39.640 --> 08:42.600]  Again we have time for one question.
[08:42.600 --> 08:51.800]  Any questions for Gabor?
[08:51.800 --> 08:53.160]  This is a newbie question.
[08:53.160 --> 08:57.200]  I'm not into graphs.
[08:57.200 --> 09:08.200]  Apart from advertisement, optimization, mass surveillance, and perhaps content distribution,
[09:08.200 --> 09:14.880]  which I don't know if they're the major applications, but it's just what my naive minds come with.
[09:14.880 --> 09:20.120]  What other applications are those benchmarks meant to optimize?
[09:20.120 --> 09:26.400]  So the big one this year is supply chain optimization, like strengthening supply chains, ensuring
[09:26.400 --> 09:31.400]  that they are ethical, ensuring that they are not passing conflict zones.
[09:31.400 --> 09:34.120]  It's something that is very important these days.
[09:34.120 --> 09:42.520]  You can also track CO2 emissions and other aspects of labor and manufacturing.
[09:42.520 --> 09:46.160]  So that's certainly a big one, and that's something that we have seen.
[09:46.160 --> 09:51.720]  And there are, of course, all the graphic problems like power grid, a lot of e-commerce
[09:51.720 --> 09:55.960]  programs, and the financial fraud detection, which is going to be part of our financial
[09:55.960 --> 10:22.240]  benchmark this year.