Second openCypher Implementers Meeting (oCIM 2) - 10 May 2017

This was the second face-to-face meeting for people, projects and organizations who are interested in

implementing part or whole of the Cypher graph query language, including current implementers
the design and development of a standard declarative query language for graph databases, and want to see how Cypher could evolve to reach that goal

Links

Summary of the day (including links)
Program with links to materials

Summary

We were very pleased to welcome an interested and diverse set of attendees to the Second openCypher Implementers Meeting, which was held in central London, UK, on the 10th of May 2017.

Group photo of the participants at the 2nd oCIM conference.

Testament to the increasing impact and adoption of Cypher, and following in the footsteps of the 1st oCIM conference, we had once more a schedule jam-packed with a wide variety of presentations. The day was divided into a number of sessions, each of which centred around a particular topic or theme.

Stefan Plantikow, the team lead of Neo4j’s Cypher Language Group (the team at Neo4j responsible for the language development of Cypher and fostering the growth of the openCyper project) welcomed everyone to the meeting with a summarization of the current state of Cypher, touching upon the vast quantity of new features under consideration, and giving a quick update on both the ongoing specification process, as well as the progress of new implementations using Cypher, such as Scott Tiger and the Cypher for Apache Spark project.

Alastair Green, the Product Manager at Neo4j, followed on from Stefan’s talk with a presentation on the openCypher Implementers Group, building on from the announcement made at the 1st oCIM conference. As a public body, the oCIG is open to any interested parties – vendors, researchers, implementers, users – wishing to be involved in improving the feature set, expressiveness, aesthetics and usability of Cypher. Alastair introduced the process underpinning this initiative, which is broadly as follows:

Cypher Improvement Requests (CIRs) outline at a high level ideas, feature suggestions, or topics for discussion.
Cypher Improvement Proposals (CIPs) act as the formal Cypher specification vehicle in that they address ideas suggested in CIRs, detailing the syntax and semantics of proposed new features.

Every three weeks, the oCIG will meet virtually to discuss CIPs and CIRs – expanding and enriching as needed – and to arrive at a consensus as to whether to accept or reject a CIP, and to assign working groups to produce CIPs in response to high-priority CIRs.

The next session shone a spotlight on four CIPs, with the aim of kicking off a discussion around these.

Petra Selmer, an engineer at Neo4j and a member of the Cypher Language Group, started off by presenting a talk on nested subqueries; i.e. self-contained queries running within the scope of an outer query. This feature would greatly increase the expressivity of complex read, write and read-write queries, as well as provide a useful abstraction mechanism. In addition to following in the footsteps of SQL, this CIP motivates, for the first time, the usage of a query combinator to allow for chained nested subqueries.

In the discussion that followed, Hannes Voigt, a researcher at the Technical University of Dresden, made the comment that it would be very useful to delineate clearly the subsumption and orthogonality of nested subqueries when contrasted with multiple MATCH statements, which currently exist in Cypher. Another suggestion made was to consider named subqueries – i.e. along the lines of Oracle’s temporary tables – within the CIP, motivated by the benefit conferred by being able to refer to these subqueries by name in any subsequent queries.

Stefan followed this with a presentation on query combinators, a foundational construct allowing for a very powerful query composition paradigm. Query combinators come in several different ‘flavours’: set operations (such an UNION, INTERSECT, and so on); multiset operators (such as UNION ALL, INTERSECT ALL, and so on); and general operators such as THEN and OTHERWISE, which are used for pipelining queries (as well as nested queries) and to impose an order of execution on queries.

Professor Leonid Libkin, a researcher at the University of Edinburgh, raised the question regarding whether, in the event of the query preceding OTHERWISE failing to execute, any indication of this was propagated back to the user. Indeed, a user can mimic this behaviour manually by returning a flag, but it may be worth considering how to do it automatically.

Stefan then changed track with a discussion on MANDATORY MATCH (CIP), which can be considered as the inverse to OPTIONAL MATCH, whereby an error is raised if the pattern does not find at least one match. The idea broadly is to improve error reporting, and make querying more expressive and efficient by negating the need to have validation code in the application.

Mats Rydberg, an engineer at Neo4j and a member of the Cypher Language Group, ended the session with a talk on the proposed generic syntax for constraints. In addition to the familiar uniqueness constraint, Mats also discussed the generality of the syntax when applied to more sophisticated constraints underpinning label co-existence, cardinality conditions and property value conditions. The wording of NODE KEY was questioned by Leonid and Hannes, among others. Moreover, concerns were raised around the idea of standardizing a loose framework around constraints in general (instead of specific, feasible constraints) as it was deemed that using a generic syntax to express most constraints would never be implementable.

A general recommendation raised by the group was to design a canonical dataset, which could henceforth be incorporated into CIPs and CIRs. This would have the benefit of not needing to invent contrived examples in each and every CIP/CIR going forward (thus easing the writing thereof for the authors) and, moreover, would lessen the cognitive load required by readers of the CIP/CIR.

The next session gave a lot of food for thought, exploring in detail Conjunctive Regular Path Queries (CRPQs), and work done to date on the formal specification on Cypher, both topics pushing the boundaries on querying in general.

Tobias Lindaaker, an engineer at Neo4j and a member of the Cypher Language Group, began the session with a deep dive into the thinking around CRPQs – or, to use a more Cypher-oriented nomenclature – Path Query Patterns (PQPs) in Cypher (CIP). These allow for paths to be defined by regular expressions (including grouping or nesting) over the relationship types, thus allowing for very complex connections to be expressed. CRPQs have been studied extensively in academia over the past few decades. Bringing the notion of CRPQs into the world of property graphs, PQPs additionally allow for property predicates on intermediate nodes and relationships (i.e. data tests).

Leading on from Tobias’ talk, Nadime Francis, a researcher at the University of Edinburgh, presented the work and thinking to date around the formal specification of Cypher. This work returns to basics, defining all the constructs within Cypher and the property graph formally from the ground up, with the idea of building a solid foundation upon which all future work on Cypher semantics will be based. Nadime started off by detailing the formalization of operations, expressions and pattern matching. He then proceeded to discuss the various corner cases, ambiguities and inconsistencies, and incomplete and ill-defined cases, discovered in the course of this work so far, making for a very illuminating presentation!

For example, it may not be clear what the semantics are for the following query: MATCH (n:Person {name: null}) RETURN n. Does this i) return every node n labelled with :Person having a name property, or ii) return every node n labelled with :Person such that n.name IS NULL = true, or iii) return nothing? It turns out that the third case is the correct one, as the query is equivalent to MATCH (n:Person) WHERE n.name = null RETURN n, thus returning no results as nothing is ever equal to null. As Nadime reflected, this provides a good example as to why formal semantics are not only interesting, but necessary.

After lunch, it was time for matters to take a more practical turn, with the next session being comprised of presentations of various implementations using Cypher.

Jan Posiadała, from Scott Tiger, a software company using advanced data techniques, kicked off the session by presenting a novel implementation of Cypher in Prolog, showing how this was realized from the grammar, through the intermediate representation – represented as a Prolog term – to the end point, resulting in a tool which could be used for verifying and validating Cypher; in effect, an executable specification. Jan concluded with a number of ‘exercises’ exemplifying its use.

Mats then took the floor by showing how graph processing over a combined OLAP and OLTP workload could be achieved by using Cypher as a bridging construct on Apache Spark (OLAP) and Neo4j (OLTP) and seamlessly integrating between the two. Among many novel developments, one idea that resonated particularly was the extension of Cypher to be able to return graphs, yielding a truly compositional language. Mats ended this talk with a comprehensive demonstration of the implementation.

Gábor Szárnyas, a researcher at the Budapest University of Technology and Economics, closed the session with a discussion on the inGraph project, which allows for the incremental evaluation of Cypher queries. Following on from Gábor’s presentation at the 1st oCIM conference, he motivated this with a use case involving proximity detection in a live railway network model. Queries are continuously evaluated over the graph, taking changes into account. He showed how Cypher is transformed into relational graph algebra, which is composed of basic relational algebra with extra graph-oriented extensions, such as expanding from vertices. Gábor continued with a discussion on their Incremental Relational Engine – transforming nested relational algebra to flat algebra expressions – showing how this was used to optimize incremental graph querying. The talk ended with a roadmap of future challenges in this area.

Owing to the volume and breadth of content in the talks, we unfortunately ran out of time at this juncture, which meant that Mats was not able to present his talk on recent implementation changes made to Neo4j.

After a break, the next session focused on openCypher artifacts. Alastair opened the series of talks by quickly drawing attention to the legal side of collaboration – more information on this is forthcoming in summer – and it was ascertained that no one was against a retroactive agreement of collaboration.

This was followed by Dmitry Vrublevsky, a software engineer at Neueda who, following on from his presentation at the 1st oCIM conference, covered work on the Cypher editor they have been building which is included in the Neo4j Browser. Dmitry showed us the syntax highlighting features, context-sensitive error reporting (for example, when a query does not comply with the expected syntax) and auto completion. He concluded with a quick tour of small improvements made to the grammar, and challenges they encountered, comprising vendor extensions and intricacies surrounding the applications of the Cypher style guide.

Gábor then rounded off the presentation section of the session with a discussion on his and his colleagues’ take on the TCK (Technology Compliance Kit), thus exemplifying how an implementation incorporated an openCypher artifact from scratch. Gábor began by covering the TCK as it stands, and then delved into more details about their experiences in implementing the TCK, focusing on issues and inconsistencies they encountered; in particular, in the area of side effects and how the underlying behaviour of these was not clearly specified. Since the meeting, a PR (Pull Request) has been raised which explicitly defines side effects.

Mats then took the floor with an open discussion on how best to align the growing number of artifacts. Hannes stated that it is an unsolved problem, mentioning that a university in Germany is undertaking a research project on this very topic. The general consensus was that we as a group ought to try and think about how to align the artifacts. To illustrate how artifacts may be integrated effectively, Mats stated that the Antlr parser is run on all TCK queries.

A pronounced change of topic was brought about in the penultimate session through the focus on a very pertinent and exciting area: multiple graphs. This area was introduced at the 1st oCIM conference: hitherto, Cypher has operated on a single implicit graph, producing a stream of records (in effect, forming a table) as output. The idea going forwards is to allow Cypher to be closed over graphs; i.e. augmenting Cypher such that multiple graphs could be provided as input, one or more graphs could be produced as output, and defining a pipeline over these.

To motivate this topic, two real-world use cases were presented: social media monitoring and analytics, and financial regulatory compliance.

Frank Smit, the CIO of OBI4WAN, began the session by presenting their tool which is used for social media monitoring, publishing and social analytics, which references multiple graphs from different sources, from which it is possible to derive interesting results and actionable information. They further have the need to segment the graph according to privacy requirements. Frank ended with an overview of their system architecture, which uses Spark, and next steps, in which Cypher will be used as the query language.

Patrick Watts and Matthew Toy, from BNP Paribas, discussed the second use case for multiple graphs. Briefly, they are required by law to show that they are compliant with a number of regulations. In order to do this, they need to show, and subsequently query, a list of files pertaining to a set of risk portfolios. The resulting data is in essence a graph. They showed that using a graph database for these tasks was easy, extensible, and performed well.

After the talks on the two very interesting use cases motivating the need for multiple graphs, Alastair discussed the naming and addressing aspects of multiple graph processing. A number of overarching problems were covered. The first of these concerned loading and saving the graph, in which potential URI schemes and their requirements were discussed, and putative Cypher syntax was proposed. Next to be covered was the notion of how queries and subqueries need to be able to return and possibly persist results as graphs, both in the context of final, as well as intermediate, results. This led naturally into the notion of views in graphs.

Stefan ended the multiple graph session with a talk featuring various ideas and proposals for adding multiple graph support to Cypher, considering the physical and logical aspects alongside the language syntax and semantics. Ensuring that Cypher is extended in a systematic manner such that it changes seamlessly from a table transformation language into a graph transformation language, the main strand here was that of inverting existing operations in Cypher in order to define the corresponding ‘dual’ operation. For instance, the inverse of RETURN – which in effect discards the graph and returns a table (and in the future, graphs) – is the notion of loading a graph using LOAD GRAPH AT <some url>. Analogously, the inverse of MATCH – which reads a graph – can be thought of as returning a graph using RETURN GRAPH FROM <pattern matches>. Stefan then proceeded to propose syntax regarding the management of persistent named graphs, and ended with a call to arms regarding further discussions, as well as the production of CIRs and CIPs needed to bring to fruition this far-reaching, fundamental change to Cypher.

Leonid raised the point that both SQL and XQuery ought to be compared and contrasted with the proposed multiple graph functionality in Cypher, owing to the latter’s far closer alignment with the graph data model than the former (XQuery is a query language operating on semi-structured data).

The last session of the day centred firmly on emerging topics, both in the form of CIRs and ideas under development.

Recently, there has been a lot of activity in providing solutions to address the shortcomings of Cypher’s current aggregation semantics, whereby the lack of differentiation between aggregating and non-aggregating expressions can lead to confusion.

Jan and his colleague from Scott Tiger, Paweł Susicki, began by presenting their proposal on grouping semantics, followed by József Marton, a researcher at the Budapest University of Technology and Economics, discussing his take on a potential solution to this. József then followed with a presentation on his proposal on how to explicitly order elements in functions that construct lists (such as collect()). He also posed the question of whether ordering results matters between query parts in a pipeline. József ended with a proposal for the inception of OPTIONAL UNWIND, partnered with a motivating example.

Time was drawing to a close at this juncture, and Stefan took the floor for the final time, detailing topics that are appearing on the Cypher horizon, such as CRPQs, multiple graphs, subqueries (nested, scalar and list). He proceeded to describe how to compose SQL with Cypher in a bidirectional fashion, showing how this keeps intact the purity of each language, as opposed to, say, extending SQL with graph-processing capabilities. The penultimate topic was a quick run-through of ideas around uniqueness in Cypher, and how this may be extended to allow for multiple notions of uniqueness to be supported.

We should like to thank all our speakers and attendees for making it such an interesting and absorbing day. It was very gratifying to see how many different strands – introduced at the first meeting a mere three months ago – have developed and gained momentum since the last meeting.

Program

All slide presentations from the conference are now publicly available to download, see the below links. A zipped archive of all presentations may be downloaded here.

09:00	Coffee		30 mins
Session: Formalizing the openCypher Implementers Group (oCIG) (Chair: Mats Rydberg)
09:30	Introduction & Status Report (slides)	Stefan Plantikow (Neo4j)	15 mins
09:45	Updated CIP/oCIG Process (slides)	Alastair Green (Neo4j)	15 mins
10.00	Session: CIPs (Chair: Alastair Green)		60 mins
	CIPs proposed for approval by oCIG: Nested Subqueries (slides) Query Combinators (slides) MANDATORY MATCH (slides) Constraints (slides)
11:00	Break		15 mins
Session: CIPs & Specification (Chair: Stefan Plantikow)
11:15	Regular Path Queries (RPQs) (slides)	Tobias Lindaaker (Neo4j)	30 mins
11:45	Formal specification of Cypher (slides)	Nadime Francis (University of Edinburgh)	15 mins
12:00	Lunch		60 mins
Session: openCypher Implementations (Chair: Petra Selmer)
13:00	Implementation in Prolog (slides)	Jan Posiadała (Scott Tiger)	15 mins
13:15	Cypher for Apache Spark (slides)	Mats Rydberg (Neo4j)	15 mins
13:30	The ingraph project and incremental evaluation of Cypher queries (slides)	Gábor Szárnyas, József Marton (Budapest University of Technology and Economics)	15 mins
13:45	Neo4j implementation changes (slides)	Mats Rydberg (Neo4j)	15 mins
14:00	Break		15 mins
Session: openCypher Artifacts (Chair: Petra Selmer)
14:15	Mode of collaboration (slides)	Alastair Green (Neo4j)	5 mins
14:20	Web-based Cypher editor in relationship to openCypher (slides)	Dmitry Vrublevsky (Neueda)	15 mins
14:35	openCypher TCK (slides)	Gábor Szárnyas (Budapest University of Technology and Economics)	10 mins
14:45	openCypher grammar and parsers		15 mins
	openCypher perspective on parsing and grammar	Mats Rydberg (Neo4j)
	openCypher Artifacts & Best Practices	Open discussion
15:00	Break		15 mins
15:15	Session: Multiple graphs (Chair: Mats Rydberg)		75 mins
	User requirements (Frank Smit's slides, Matt Toy's slides)
	Identification and addressing (slides)	Alastair Green (Neo4j)
	Syntax and semantics (slides)	Stefan Plantikow (Neo4j)
16:30	Break		15 mins
Session: Cypher Improvement Requests (CIRs) and developing ideas (Chair: Stefan Plantikow)
16:45	Discussion on grouping semantics		25 mins
	Presentation by Scott Tiger (slides)
	Presentation by József Marton (slides)
	Open discussion
17:10	Sorting lists (slides)	József Marton (Budapest University of Technology and Economics)	10 mins
17:20	Future oCIM Topics (slides)	Stefan Plantikow (Neo4j)	15 mins
	Cypher relationship to SQL
	Expression subqueries
	Pattern matching iso-/homomorphism
	Participant proposals
17:35	Closing & Next Meeting		10 mins
17:45	End
18:30	Dinner

Logistics

The second oCIM will be held at etc.venues Marble Arch at Garfield House, 86 Edgware Road, W2 2EA in London, United Kingdom, on Wednesday the 10th of May 2017.