[00:00.000 --> 00:07.800] Hello, HPC Room, my name is Gabor Sarnas. [00:07.800 --> 00:14.240] I work at CWI Amsterdam as a researcher, and today I'm here on behalf of the LDBC. [00:14.240 --> 00:17.360] The LDBC stands for the Linked Data Benchmark Council. [00:17.360 --> 00:22.560] We are a non-profit company founded in 2012, and we design graph benchmarks and govern [00:22.560 --> 00:23.560] their use. [00:23.560 --> 00:28.160] Additionally, we do research on graph schemas and modern graph queue languages, and everything [00:28.160 --> 00:32.680] we do is available under the Apache V2 license. [00:32.680 --> 00:36.200] Organizationally, LDBC consists of more than 20 companies. [00:36.200 --> 00:39.040] These are companies interested in graph data management. [00:39.040 --> 00:44.360] We have financial service providers, database vendors, cloud vendors, hardware vendors, and [00:44.360 --> 00:49.080] consultancy companies, as well as individual contributors like me. [00:49.080 --> 00:54.440] So we design benchmarks, the first one being the LDBC social network benchmark, which targets [00:54.440 --> 00:56.600] database systems. [00:56.600 --> 01:00.520] Let's go through this benchmark by a series of examples. [01:00.520 --> 01:05.480] I will touch on datasets, queries, and updates that we use in this benchmark. [01:05.480 --> 01:10.600] As the name social network benchmark suggests, we have a social network that consists of [01:10.600 --> 01:17.040] person nodes who know each other via a distribution that mimics the Facebook career social network. [01:17.040 --> 01:20.080] The content that these people create is messages. [01:20.080 --> 01:25.680] These form little three-shaped subgraphs and are connected via author edges to the people. [01:25.680 --> 01:28.760] On this graph, we can run queries like the following. [01:28.760 --> 01:34.120] Let's have a given person enumerate their friends and their friends of friends, get [01:34.120 --> 01:38.640] the messages that these people created, and then filter them based on some condition on [01:38.640 --> 01:39.760] their dates. [01:39.760 --> 01:44.440] So a potential substitution could be on this graph that we are interested in this query [01:44.440 --> 01:47.640] for Bob and the date set on Saturday. [01:47.640 --> 01:50.600] And if we evaluate this query, we start with Bob. [01:50.600 --> 01:56.680] We traverse the nose edges to Ada and Carl, then continue to Finn, Eve, and then we move [01:56.680 --> 01:58.360] along the author edges. [01:58.360 --> 02:03.600] And then finally, we apply the filter condition, which will cut message three and will leave [02:03.600 --> 02:06.840] us messages one, two, and four. [02:06.840 --> 02:09.840] So obviously, a social network is not a static environment. [02:09.840 --> 02:11.120] There are always changes. [02:11.120 --> 02:15.800] For example, people become friends, even Gia may add each other as a friend. [02:15.800 --> 02:18.280] That will result in a new nose edge. [02:18.280 --> 02:20.000] That's simple enough. [02:20.000 --> 02:22.120] Gia can decide to create a message. [02:22.120 --> 02:24.800] This message will be replied to message M3. [02:24.800 --> 02:29.160] So we add a new node and connect it to the existing graph via two edges. [02:29.160 --> 02:31.640] The heavy hitting updates are the deletes. [02:31.640 --> 02:36.760] A person may decide to delete their account, and that will result in a cascade of deletes. [02:36.760 --> 02:42.880] For example, if we remove the node Eve, that will result in the removal of their direct [02:42.880 --> 02:45.120] edges, all the messages they created. [02:45.120 --> 02:49.560] And in some social network, this will even trigger the deletion of all the message trees [02:49.560 --> 02:53.360] and, of course, all the edges that point to those messages. [02:53.360 --> 02:56.880] So this is quite a hard operation for systems to execute. [02:56.880 --> 03:02.920] It stresses their garbage collectors, and this allows certain append-only data structures. [03:02.920 --> 03:06.880] So if you want to weave these three components together, the data set, the queries, and [03:06.880 --> 03:11.600] the updates, we need a benchmark driver that schedules the operations to be executable. [03:11.600 --> 03:16.080] It runs the updates and the queries concurrently, and, of course, it collects the results. [03:16.080 --> 03:21.680] The system under test that we run the benchmark on is provided by our members who are the [03:21.680 --> 03:27.920] database vendors, and we go to great lengths to allow as many candidate systems as possible, [03:27.920 --> 03:34.360] so graph databases, triple stores, and relational databases can all compete on this benchmark. [03:34.360 --> 03:39.720] Speaking of relational databases, some of you may think is SQL sufficient to express [03:39.720 --> 03:42.880] these queries, and the answer is that in most cases it is. [03:42.880 --> 03:48.800] So the query that we have just seen can be formulated in a reasonably simple SQL query. [03:48.800 --> 03:53.960] It is a bit unwieldy, but it is certainly doable, and the performance will be okay. [03:53.960 --> 03:59.000] However, this being a graph benchmark, it lends itself quite naturally to other query [03:59.000 --> 04:00.000] languages. [04:00.000 --> 04:04.120] There are two new query languages that are going to be coming out, and both of them adopted [04:04.120 --> 04:08.080] a visual graph syntax inspired by Neo4j's Cypher language. [04:08.080 --> 04:13.000] The first one is called SQL-PGQ, where PGQ stands for property graph queries. [04:13.000 --> 04:17.760] This will be released this summer, and as you can see, it's an extension to SQL, so [04:17.760 --> 04:22.240] you can use select and from, but it adds the graph table construct, and the query can [04:22.240 --> 04:25.680] be formulated in a very concise and readable manner. [04:25.680 --> 04:30.200] There is GQL, the graph query language, which is a standalone language that is going to [04:30.200 --> 04:36.040] be released next year, and it shares the same pattern matching language as SQL-PGQ. [04:36.040 --> 04:41.000] So the social network benchmark has multiple workloads to cover the diverse challenges [04:41.000 --> 04:44.840] that are created by graph workloads. [04:44.840 --> 04:49.120] The first one, the older one, is the social network benchmark interactive workload. [04:49.120 --> 04:53.320] This is transactional in nature, and it has queries like the one I have shown before. [04:53.320 --> 04:57.120] So these queries typically start in one or two person nodes. [04:57.120 --> 04:58.960] They are not very heavy hitting. [04:58.960 --> 05:01.320] They only touch on a limited amount of data. [05:01.320 --> 05:06.720] They have concurrent reads and updates, and systems are competing on achieving high throughputs. [05:06.720 --> 05:10.200] So this benchmark has been around for a few years, and we have seen actually very good [05:10.200 --> 05:11.200] results. [05:11.200 --> 05:16.200] In the last three years, we witnessed an exponential increase in throughput, starting from a little [05:16.200 --> 05:22.880] above 5,000 operations per second to almost 17,000 operations per second this year. [05:22.880 --> 05:27.320] Our newer benchmark is the social network benchmark business intelligence workload. [05:27.320 --> 05:32.240] This is analytical in nature, and it has queries that touch on large portions of the data. [05:32.240 --> 05:37.800] For example, the query on this slide enumerates all triangles of friendships in a given country [05:37.800 --> 05:44.160] which can potentially reach billions of edges, and is a very difficult computational problem. [05:44.160 --> 05:48.720] Systems here are allowed to do either a bulk or a concurrent update approach, but they [05:48.720 --> 05:54.080] should strive to get both a high throughput and low individual query runtimes. [05:54.080 --> 05:58.040] This benchmark being relatively new, we only have a single result, so it's a bit difficult [05:58.040 --> 05:59.800] to put it into context. [05:59.800 --> 06:02.080] But it allows me to highlight one thing. [06:02.080 --> 06:04.240] Many of our benchmarks use different CPUs. [06:04.240 --> 06:08.560] We actually have quite a healthy diversity in the CPUs. [06:08.560 --> 06:13.400] We have results with the AMD Epic Genoa, like this one achieved by TigerGraph. [06:13.400 --> 06:19.640] We have results using Intel Xeon Ice Lakes and the ETN 710s, which use an ARM architecture. [06:19.640 --> 06:24.520] We have more and larger scale results expected this year, and we are also quite interested [06:24.520 --> 06:29.280] in some graph and machine learning accelerators that are going to be released soon. [06:29.280 --> 06:31.560] So our benchmark process is quite involved. [06:31.560 --> 06:34.360] For each workload, we release a specification. [06:34.360 --> 06:36.840] We have an academic paper that motivates the benchmark. [06:36.840 --> 06:41.400] We have data generators, pre-generated data sets, as well as a benchmark driver and at [06:41.400 --> 06:44.520] least two reference implementations. [06:44.520 --> 06:50.520] We do this because we have an auditing process that allows the vendors to implement this benchmark [06:50.520 --> 06:56.200] to actually go through a rigorous test, and if they do so, they can claim that they have [06:56.200 --> 06:58.320] an official benchmark result. [06:58.320 --> 07:04.520] So we trademark the term ADBC such that the vendors have to go through these hoops of [07:04.520 --> 07:09.440] auditing, and we still allow researchers and developers to do unofficial benchmarks, [07:09.440 --> 07:14.360] but they have to say that this is not unofficial ADBC benchmark result. [07:14.360 --> 07:18.200] Another benchmark I would like to touch upon briefly is the Graph Analytics benchmark. [07:18.200 --> 07:22.680] This casts a wider net, so it targets graph databases, graph processing framework, embedded [07:22.680 --> 07:26.600] graph libraries like NetworkX and so on. [07:26.600 --> 07:31.920] This uses untyped, unattributed graphs, so it's only the person knows person graphs [07:31.920 --> 07:36.080] of the social network benchmark or other well-known graphs like Graph 500. [07:36.080 --> 07:37.400] We have six algorithms. [07:37.400 --> 07:41.520] Many of these are textbook algorithms like BFS, which just traverses the graph from a [07:41.520 --> 07:47.160] given source node, or we have PageRank, which selects the most important nodes in the network. [07:47.160 --> 07:51.200] We also have clustering coefficient, community detection, connected components, and shortest [07:51.200 --> 07:53.360] paths. [07:53.360 --> 07:54.920] This benchmark is a bit simpler to implement. [07:54.920 --> 07:57.600] We have a leaderboard that we update periodically. [07:57.600 --> 08:02.680] The next one is going to come out in Spring 2023, so talk to us if you're interested. [08:02.680 --> 08:07.280] So wrapping up, you should consider becoming an ADBC member because members can participate [08:07.280 --> 08:11.880] in the benchmark design and have a say in where we go, they can commission audits of [08:11.880 --> 08:17.000] their benchmarks, and they can also gain early access to the ISO standard drafts, SQL, PGQ, [08:17.000 --> 08:18.720] and GQ that I have shown. [08:18.720 --> 08:23.200] It's free for individuals and has a yearly fee for companies. [08:23.200 --> 08:25.400] So to sum up, these are our three main benchmarks. [08:25.400 --> 08:28.160] We have other benchmarks and many future ideas. [08:28.160 --> 08:39.640] If you're interested, please reach out. [08:39.640 --> 08:42.600] Again we have time for one question. [08:42.600 --> 08:51.800] Any questions for Gabor? [08:51.800 --> 08:53.160] This is a newbie question. [08:53.160 --> 08:57.200] I'm not into graphs. [08:57.200 --> 09:08.200] Apart from advertisement, optimization, mass surveillance, and perhaps content distribution, [09:08.200 --> 09:14.880] which I don't know if they're the major applications, but it's just what my naive minds come with. [09:14.880 --> 09:20.120] What other applications are those benchmarks meant to optimize? [09:20.120 --> 09:26.400] So the big one this year is supply chain optimization, like strengthening supply chains, ensuring [09:26.400 --> 09:31.400] that they are ethical, ensuring that they are not passing conflict zones. [09:31.400 --> 09:34.120] It's something that is very important these days. [09:34.120 --> 09:42.520] You can also track CO2 emissions and other aspects of labor and manufacturing. [09:42.520 --> 09:46.160] So that's certainly a big one, and that's something that we have seen. [09:46.160 --> 09:51.720] And there are, of course, all the graphic problems like power grid, a lot of e-commerce [09:51.720 --> 09:55.960] programs, and the financial fraud detection, which is going to be part of our financial [09:55.960 --> 10:22.240] benchmark this year.