Graph Databases

February 20, 2018.

Graph Database Applications and Concepts with Neo4J

Notes on paper written by Justin Miller.

RDBMS are optimised for aggregate data, graph databases like Neo4J are optimised for highly connected data.

A graph is a data structure composed of edges and vertices. Graph DBs are useful in modeling relationships between entities. A common graph type supported by most systems is the property graph. A property graph can effectively model all other graph types.

The graph database is optimised for efficient processing of dense, interrelated datasets. This design allows for construction of predictive models, and detection of correlations and patterns.

Graph dtabase traversal are localized and thus do not incur expensive join operations like those common in traditional relational databases.

Just like relational database systems, graph databases also have OLAP and OLTP contenders. Google Pregel, which is based Valiant’s Bulk Synchronous Processing technique, and Apache Hama are examples of OLAP systems, which focus on high performance processing that can solve problems like page rank, shortest path and bipartite matching. Neo4J is a OLTP system that is a suitable as a data backend for transaction based applications.

Some applications of graph databases are - social graphs, recommender systems, bioinformatics etc.,

Note: item-to-item correlation is content based recommendation system, where as user-to-user correlation is collaborative filtering.

Querying – Effective retrieval of information out of a graph requires traversal. There is no global adjacency index. Instead each vertex and edge in the graph store a “mini-index” of the objects connected to it. Global indexes do exist in Neo4J, but are used only when trying to find the starting point of a traversal. With indexes, the cost of lookup can be as low as \(O(log_{2}n)\).

There is no standard language for querying graph databases unlike RDBMS(ie.,SQL). Gremlin is a DSL written in groovy. Cypher is a SQL-like keyword based system.

Neo4J is ACID compliant. The Neo4J high availability (commercial only feature) uses Apache ZooKeeper to coordinate nodes for coordination and replication.