First openCypher Implementers Meeting (oCIM 1) - 8 February 2017

This was the first face-to-face meeting for people, projects and companies who are interested in

implementing part or whole of the Cypher language, including current implementers (there are now approaching ten projects or products, commercial and research)
the design of a standard declarative query language for graph databases, and want to see how Cypher could evolve to help that goal

Links

Summary of the day (including links)
Program with links to materials

Introduction

The conference was held in Walldorf, Germany, at SAP’s headquarters.

The conference room, featuring Alastair Green making the case for multiple graph querying.

What happened

The opening presentation was given by Alastair Green, the Product Manager at Neo4j responsible for the development of Cypher and the openCypher project. In his talk, Alastair explained that the goal of the openCypher initiative is to craft a standard language for querying graphs, using as a basis Cypher in its current form, and seeking to evolve it via an open process, with the active participation of all interested vendors and implementers.

Mats Rydberg, an engineer at Neo4j, presented an overview of the library of shared artifacts that have been produced and made publicly available under the auspices of openCypher. He then proceeded to a discussion of ideas around a shared grammar and a verifiable test kit.

Next, Marcus Paradies, developer at SAP, discussed how they’ve injected Cypher into their HANA Graph stack, detailing how SAP has modelled graphs in their relational system, and mentioning some of the shortcomings that were encountered. In particular, Marcus highlighted the importance of compositionality within the language, and brought up two of the conference’s larger topics: pattern matching semantics, and multiple graphs. Later in the day, Neo4j’s CLG (Cypher Language Group, the internal Neo4j team responsible for language development of Cypher) team lead Stefan Plantikow and Oskar van Rest, Principal Member of Technical Staff at Oracle, both described problems relating to Cypher’s pattern matching semantics, and explored alternatives.

Following on from Marcus’ presentation, Dmitry Vrublevsky, software engineer at Neueda, presented and demonstrated the Cypher developer tool that they have been building as a plugin for the popular JetBrains family of IDEs. As of 8 February 2017, the plugin had over 11,000 downloads, and Dmitry showcased its syntax highlighting, refactoring and error reporting features.

Just before the first coffee break of the day, Gábor Szárnyas and József Marton, researchers at Budapest University of Technology and Economics, took the audience through their research project of incremental query execution on graphs. The example model of a railroad network resonated well with the audience, as did the extensive work on mapping Cypher onto relational algebra (with special extensions). The session concluded with a list of challenging components of the language, which include the handling of lists, aggregations and the default bag semantics that are in effect when not using DISTINCT.

Both Gábor and József, as well as Dmitry, had already been directly involved in the openCypher project, raising issues, providing pull requests and discussing various topics, mostly related to the grammar specification.

Andrés Taylor, engineer at Neo4j, father of Cypher, and former team lead of the CLG, started the next session by describing how Cypher has been (and is) implemented in Neo4j, for and from which the language has been grown. After a brief overview of Cypher’s history, Andrés described in detail the cost-based query planner, the algorithm it is based on, and ended with a quick look at Neo4j’s new Cypher runtime, which runs on generated code.

Roi Lipman, software engineer at Redis Labs (now Redis), then gave a presentation on how he had developed the Redis graph module based on a hexastore model, where node-relationship-node triplets are stored in six permutations to enable fast prefix-based searches.

A concept that has arisen in prior meetings with several interested parties is that of a shared standard of internal graph query representation, possibly compiled from several distinct source languages. Stefan Plantikow gave an insight as to how Neo4j and the CLG had been thinking about such a model, called QUIL (Query Intermediary Language).

Tomasz Zdybał, software engineer at Dgraph Labs, presented Dgraph, an in-memory native graph database, and the implementation in their product of a graph query language based on Facebook’s GraphQL. Tomasz highlighted Dgraph’s intentions of adding support for Cypher, and how schema validation was an important topic.

Just before lunch, Alastair Green took the floor again, standing in for Bitnine (the Korean company behind Agens Graph, who were unable to attend in person) discussing how Cypher, SQL, and other query languages could be integrated with one another. Alastair presented slides authored by Kisung Kim from Bitnine, detailing the hybrid relational/graph architecture of Agens Graph, which is based on PostgreSQL. The most significant contribution was the way in which Bitnine had introduced integration points between SQL and Cypher, allowing for each to be passed in as a subquery construct to the other, and how functions may be shared as expressions between the languages.

The lunch break paved the way for the big topic previously brought up by Marcus Paradies of SAP in the morning: multiple graphs. Cypher has always been a language that operates on a single implicit graph, producing a stream of records as output (the collection of which effectively form a table). Alastair Green discussed the motivations for changing this model to make Cypher a language closed over graphs, envisioning a future where Cypher would be capable of processing multiple graphs provided as input, and producing as output one or more graphs. Alastair explored salient subtopics such as identity, addressing, and ways of defining compositions of graphs from distinct sources.

Following on from Alastair’s talk on the vision of multiple graphs in Cypher, Stefan Plantikow led a longer session on the topic, presenting his latest thinking on how Cypher can be remodeled towards a graph-in-graph-out paradigm. Stefan discussed the motivations from a different angle to those covered by Alastair, focusing on how to make the concept of multiple graphs logically consistent and how to extend the execution model, whilst still keeping in mind Cypher’s considerable user base and the cost of imposing breaking change in semantics. One of the major concepts in Stefan’s discussion was the re-interpretation of Cypher’s result records as ‘g-records’ (graph-records, or graphlets), meaning each binding of a matched subgraph would itself be interpreted as a (typically very small) graph, and the extended ability to collapse/union all such g-records into one large result graph. Both the g-record model and the companion model of the unionised graph would enact Cypher as being closed over graphs, as it would be possible to upon retrieval of the result graph(s) immediately issue a new Cypher query, now pattern matching on the newly computed results. Stefan also gave detailed syntax proposals for how to define and use graphs as values in the context of a query, including a take on the addressing topic raised by Alastair. It was made clear that this topic is foremost in the minds of several key Cypher innovators.

Before Stefan’s dive into the world of multiple graphs, Martin Junghanns, researcher at the University of Leipzig, presented his research project on implementing Cypher on Gradoop, a graph platform based on Apache Hadoop. Martin gave us an overview on how their system handled query planning, and the model used to represent (intermediate) query results in the distributed framework, Flink, in which the queries are executed. The project also featured interesting extensions to the Property Graph Model, upon which Cypher is based, including the concept of logical subgraphs and a set of graph operations.

Following on from Martin’s talk, Hannes Voigt, researcher at the Technical University of Dresden, walked us through his research project, in which Michael Hunger, community caretaker at Neo4j, had participated. The topic comprised virtual graphs and views, and featured several intriguing extensions to Cypher. These were expressed in terms of ‘crossing the concept chasm’, which Hannes explained as the different levels of abstraction that users view their data in. At the lowest level of abstraction is the actual raw unprocessed data, which is usually very high in volume. At higher levels, larger patterns start to appear, composed of groups of nodes and relationships from the lower levels. These larger patterns are in the model constructed using virtual nodes and relationships, with the additional ability to define views which provide several interesting qualities, such as performance optimizations and query modularization.

The afternoon session featured two larger sections (#1 and #2) in which four of the members of the CLG presented views and ideas on how to address the most prominently mentioned shortcomings of the language in its current form. Mats Rydberg presented a proposal for a new schema/constraint syntax, and also talked more in-depth on the Technology Compatibility Kit, highlighting its usefulness to verify that a multitude of language implementations are semantically consistent. Petra Selmer went through the latest thinking on several classes of subqueries, including syntax proposals. Petra also detailed the revised Cypher improvement process, which has been designed to chime in with the open, collaborative format intended for the openCypher project. As mentioned above, Stefan Plantikow and Oskar van Rest discussed (Oskar’s slides) semantics of pattern matching in terms of isomorphism/homomorphism of subgraphs and entities. Tobias Lindaaker provided insights as to how Cypher could complete its support for Conjunctive Regular Path Queries (CRPQs), going through historical thinking as well as recent syntax and semantics proposals. The topic of vendor-specific extensions, including some of the pitfalls that should be looked out for from experiences with the SQL standard, was also presented by Tobias.

The last session also featured Paolo Guagliardo, researcher at the University of Edinburgh, who presented a recent research project on formalising semantics of SQL, and the advantages conferred on a language through the provision of a formal specification. Paolo also announced the recently begun project of producing a formal semantics specification of Cypher, which is to be carried out by his research team, including Nadime Francis and Professor Leonid Libkin. This project will be undertaken during the spring of 2017, and reports of progress will be given at upcoming oCIMs.

All in all, the meeting was a resounding success.

Programme

09:00	Coffee		30 mins
Chair: Tobias Lindaaker
09:30	Introduction (slides)	Alastair Green (Neo)	15 mins
09:45	openCypher Artefacts (slides)	Mats Rydberg (Neo)	15 mins
10:00	Graph Pattern Matching in SAP HANA (slides)	Marcus Paradies (SAP)	15 mins
10:15	Cypher in JetBrains IDE (slides)	Dmitry Vrublevsky (Neueda)	15 mins
10:30	Incremental Graph Queries for Cypher (slides)	Gábor Szárnyas, József Marton (Budapest University of Technology and Economics)	30 mins
11:00	Break		30 mins
Chair: Petra Selmer
11:20	Neo4j Cypher Implementation (slides)	Andres Taylor (Neo)	25 mins
11:45	Redis Graph (slides)	Roi Lipman (Redis Labs (now Redis))	15 mins
12:00	QUIL (slides)	Stefan Plantikow (Neo)	15 mins
12:15	Dgraph (slides)	Tomasz Zdybał (Dgraph)	15 mins
12:30	Language Integration: SQL, GraphQL, and Tinkerpop (slides, BitNine slides)	Open discussion Moderator: Alastair Green (Neo)	30 mins
13:00	Lunch		60 mins
Chair: Mats Rydberg
14:00	The case for Multiple Graph Querying (slides)	Alastair Green (Neo)	15 mins
14:15	Extended Property Graphs and Cypher on Gradoop (slides)	Martin Junghanns (University of Leipzig)	15 mins
14:30	Multiple Graphs: Evolving Cypher (slides)	Stefan Plantikow (Neo)	20 mins
14:50	Views on Cypher (slides)	Hannes Voigt (TU Dresden)	10 mins
15:00	Break		30 mins
Chair: Alastair Green
15:30	Language Evolution: Future Features (slides)		30 mins
	Schema and Constraints	Mats Rydberg (Neo)
	Subqueries	Petra Selmer (Neo)
	Isomorphic Matching (Oskar's slides)	Stefan Plantikow (Neo), Oskar van Rest (Oracle)
	CRPQs	Tobias Lindaaker (Neo)
	What else? Other ideas?
16:00	Natural Language and Formal Specifications of Cypher (slides)	Paolo Guagliardo, Nadime Francis (University of Edinburgh)	20 mins
16:20	Language Evolution: Conformance and Extension (slides)		30 mins
	TCK / Specification	Mats Rydberg (Neo)
	Vendor Extensions	Tobias Lindaaker (Neo)
	CIP Process -- Involvement	Petra Selmer (Neo)
16:50	Wrap-up and future meetings	Alastair Green, Stefan Plantikow (Neo)	10 mins
17:00	End
19:30	Dinner

Talks & Abstracts

openCypher Artefacts

Mats Rydberg (Neo4j)

In this talk, we will present and discuss the motivation behind the provision of the openCypher artefacts, all of which are publicly available for consumption. We will describe the components of the grammar specification and our Technology Compatibility Kit (TCK).

Graph Pattern Matching in SAP HANA

Marcus Paradies (SAP)

In this presentation we will describe SAP HANA Graph, a core component of the SAP HANA database. Specifically, we discuss the major design decisions that drove the integration of native graph processing capabilities and describe how we integrated a language subset of openCypher into SAP HANA. We conclude with a list of observations that we made during the development process when integrating openCypher into an existing database management system.

Cypher in JetBrains IDE

Dmitry Vrublevsky (Neueda)

In this talk we will take a look on "Graph Database support" plugin for JetBrains IDEs, in particular Cypher support that this plugin offers. We will explore what functionality it provides and how it can be used by developers to make them more efficient when dealing with Cypher.

Incremental Graph Queries for Cypher

Gábor Szárnyas, József Marton (Budapest University of Technology and Economics)

How can we evaluate a global query on huge graphs in 0.1 seconds? Given our current technology, that would be magic. The lack of wizarding skills did not stop us, however, from tackling the problem by using smart caching structures, which are witchcrafts on their own.

Why is this challenge important? Several applications evaluate global queries on continuously changing graphs: fraud detection in financial transactions, analysis of source code repositories and validating engineering models. Current approaches employ domain-specific optimizations, which are difficult and error-prone to implement. Meanwhile, the requirements of these (and similar) use cases could be uniformly addressed by incremental graph query evaluation. With this technique, the first execution of the queries takes some time, but once the result are calculated, they can be efficiently maintained for each change in the graph.

To allow incremental queries on property graphs, we implemented the ingraph engine, based on the openCypher language specification. We aim to support the standard subset of openCypher, as most standard constructs can be calculated incrementally. We already mapped some of the standard constructs to relational algebra, defined incremental relational algebraic operators and implemented them in an incremental relational engine using Akka actors.

We start the talk by presenting use cases that evaluate complex global queries on continuously changing graphs and discuss the idea of incremental graph queries. We show the mapping of basic openCypher constructs (e.g. MATCH, WHERE, WITH, RETURN) to relational operators, such as joins, selections and projections. Finally, we show our approach for optimizing incremental graph queries and outline related challenges.

Neo4j Cypher Implementation

Andrés Taylor (Neo4j)

The first and most widely-used version of Cypher is the version that is currently implemented in Neo4j. After a brief history of Cypher, we will present our current implementation of Cypher as it appears in Neo4j. We will discuss the full life cycle of a Cypher query, from the initial parsing of the query string through to the execution of the physical query plan. We define relevant concepts, such as iterative dynamic programming, and describe how we use these techniques within our implementation. The talk concludes with an outlook on next steps for further improving the implementation of Cypher in Neo4j.

Redis Graph

Roi Lipman (Redis Labs)

Redis-Graph is a graph database built on top of Redis. In this talk we will review from top to bottom an implementation of a graph database, dig a bit inside its internals and touch on key decisions made as the project developed.

QUIL

Stefan Plantikow (Neo4j)

QUIL is an ongoing effort for creating a platform-agnostic intermediary representation for Cypher. The goal of QUIL is to provide a compact representation for a Cypher query for analysis and executing by different vendors. We would like to discuss the motivation for QUIL, our current approach, open questions, and gather general feedback around the idea.

Dgraph

Tomasz Zdybał (Dgraph)

Dgraph is an open source, scalable, distributed, highly available and fast graph database, designed from ground up to be run in production. In this talk we present our goals, main design concepts, current status of the project and benchmark results. We also discuss implementation decisions and introduce the GraphQL+- query language.

Language Integration: SQL, GraphQL, and Tinkerpop

Moderator: Alastair Green (Neo4j)

A number of companies and groups have considered how to integrate SQL and Cypher. The simple composition model used by Bitnine in their Agens Graph product is of interest. GraphQL is a widely-used tool for expressing queries and for defining returned data documents, but it is not a fully-featured graph querying language: it will be interesting to see how GraphQL can be integrated with Cypher. Jason Crawford at IBM's System G project has recently raised the idea of implementing Cypher over Tinkerpop, an idea which has also been floated by the Apache Tinkerpop project. This session is an opportunity to talk about these kinds of language integrations in a round table discussion.

Multiple graphs

Alastair Green (Neo4j)

Accessing multiple property graphs within the same query provides many promising new approaches for graph data management, graph analytics, and graph modelling.

Extended Property Graphs and Cypher on Gradoop

Martin Junghanns (University of Leipzig)

Graph pattern matching is one of the most interesting and challenging operations in graph analytics. However, it is primarily supported by graph database systems such as Neo4j but, besides research prototypes, not generally available for distributed (not-only graph) processing frameworks like Apache Flink or Apache Spark.

In our talk, we want to give an overview of our current implementation of Cypher on Apache Flink. Cypher is the Neo4j graph query language and enables the intuitive definition of graph patterns including structural and semantic predicates. As the Neo4j graph data model is not supported out-of-the box by Apache Flink, we leverage Gradoop, a Flink-based graph analytics framework based on Apache Flink that already provides an abstraction of schema-free property graphs.

We will give a brief overview about the technologies used to implement Cypher, explain our query engine and give a demonstration of the available language features. Finally, we will discuss open challenges and missing features hopefully motivating people to contribute.

The project is a cooperation between the University of Leipzig and Neo4j.

Multiple graphs

Stefan Plantikow (Neo4j)

In this talk we present an outlook on a future world of multiple graph processing and show how applications may benefit from the techniques enabled by accessing multiple, globally-addressable property graphs, including parameterized views, logical graphs, and data graphs. The talk discusses these topics in the context of how the Cypher graph query language may be evolved over time to support querying multiple property graphs end-to-end in OLTP as well as in OLAP scenarios.

Views on Cypher

Hannes Voigt (TU Dresden)

In this talk, we discuss ongoing work on Graph Views in Cypher.

Language Evolution: Future Features — Schema and Constraints

Mats Rydberg (Neo4j)

Schemas and constraints are an integral part of any database management system. In this talk, we will present our view on this topic, and provide details on current and future work in this area.

Language Evolution: Future Features — Subqueries

Petra Selmer (Neo4j)

Subqueries are a well-known and useful adjunct in querying, and in this talk we will discuss how we envisage incorporating existential, nested, and scalar subqueries into Cypher, along with projections and comprehensions.

Language Evolution: Future Features — Isomorphic Matching

Stefan Plantikow (Neo4j)

Cypher's pattern matching semantics is based on relationship-isomorphic matching. While this has proven to be a good, pragmatic choice for real-world applications, it also limits language expressivity for no strong reason. This short talk presents a recent proposal for lifting this restriction by introducing a new set of uniqueness modes to pattern matching as well as accompanying path predicate functions.

Language Evolution: Future Features — CRPQs

Tobias Lindaaker (Neo4j)

Conjuctive Regular Path Queries (CRPQs) lie at the heart of complex graph pattern matching, and research into this area has been ongoing for decades. In this talk, we present our ideas for their incorporation into Cypher.

Natural Language and Formal Specifications of Cypher

Paolo Guagliardo, Nadime Francis (University of Edinburgh)

The SQL standard has been around for more than three decades, but we still do not fully understand what to expect when executing an SQL query on a relational DBMS. This is mostly due to vagueness of the standard and the ambiguity of the natural language it is expressed in. In this talk, I will discuss recent efforts in providing a formal semantics for a core fragment of the SQL language that captures the behavior of real DBMSs, and the lessons we learned from this exercise. I will also introduce the newly started, ongoing collaboration between Neo4j and the University of Edinburgh in providing a similar formal semantics for Cypher, the challenges it poses and expected outcomes.

Language Evolution: Conformance and Extension — TCK / Specification

Mats Rydberg (Neo4j)

In this talk, we will provide working details on the Technology Compatability Kit (TCK), an artefact that we provide as part of openCypher. We will discuss the purpose, benefits, and limitations of the TCK, and walk through an example.

Language Evolution: Conformance and Extension — Vendor Extensions

Tobias Lindaaker (Neo4j)

In SQL's long history, there have been a number of undesirable outcomes, such as different meanings for the same query, alternative ways of implementing the same construct, and difficulty in evolving the language whilst remaining backwards-compatible. *These are situations we seek to* avoid in Cypher, and in this talk, we discuss various ways of extending Cypher along with a motivating example, as well as language profiles and ideas around versioning the language.

Language Evolution: Conformance and Extension — CIP Process — Involvement

Petra Selmer (Neo4j)

The evolution of Cypher is driven by the production and acceptance of Cypher Improvement Proposals (CIPs), which are documents outlining the syntax and semantics of proposed new Cypher features. In this talk, we will provide details of how we envisage the Cypher Improvement Proposal (CIP) process to work going forwards.