Neptune Graph Data Model - Amazon Neptune

Neptune Graph Data Model

The basic unit of Amazon Neptune graph data is a four-position (quad) element, which is similar to a Resource Description Framework (RDF) quad. The following are the four positions of a Neptune quad:

  • subject    (S)

  • predicate  (P)

  • object     (O)

  • graph      (G)

Each quad is a statement that makes an assertion about one or more resources. A statement can assert the existence of a relationship between two resources, or it can attach a property (key-value pair) to a resource. You can think of the quad predicate value generally as the verb of the statement. It describes the type of relationship or property that's being defined. The object is the target of the relationship, or the value of the property. The following are examples:

  • A relationship between two vertices can be represented by storing the source vertex identifier in the S position, the target vertex identifier in the O position, and the edge label in the P position.

  • A property can be represented by storing the element identifier in the S position, the property key in the P position, and the property value in the O position.

The graph position G is used differently in the different stacks. For RDF data in Neptune, the G position contains a named graph identifier. For property graphs in Gremlin, it is used to store the edge ID value in the case of an edge. In all other cases, it defaults to a fixed value.

User-facing values in a quad statement are usually stored separately in a dictionary index, where the statement indexes reference them using an 8-byte long term identifier. The exception to this is numeric values, including date and datetime values (represented as milliseconds from the epoch). These can be stored inline directly in the statement indexes.

A set of quad statements with shared resource identifiers creates a graph.

How Statements Are Indexed in Neptune

When you query a graph of quads, for each quad position, you can either specify a value constraint, or not. The query returns all the quads that match the value constraints that you specified.

Neptune uses indexes to resolve queries. In the 2005 paper, Optimized Index Structures for Querying RDF from the Web, Andreas Harth and Stefan Decker observed that there are 16 (24) possible access patterns for the four quad positions. You can query all 16 patterns efficiently without having to scan and filter by using six quad statement indexes. Each quad statement index uses a key that is composed of the four position values concatenated in a different order.

Access Pattern Index key order ---------------------------------------------------- --------------- 1. ???? (No constraints; returns every quad) SPOG 2. SPOG (Every position is constrained) SPOG 3. SPO? (S, P, and O are constrained; G is not) SPOG 4. SP?? (S and P are constrained; O and G are not) SPOG 5. S??? (S is constrained; P, O, and G are not) SPOG 6. ?POG (P, O, and G are constrained; S is not) POGS 7. ?PO? (P and O are constrained; S and G are not) POGS 8. ?P?? (P is constrained; S, O, and G are not) POGS 9. ?P?G (P and G are constrained; S and O are not) GPSO 10. SP?G (S, P, and G are constrained; O is not) GSPO 11. S??G (S and G are constrained; P and O are not) GSPO 12. ???G (G is constrained; S, P, and O are not) GSPO 13. S?OG (S, O, and G are constrained; P is not) OGSP 14. ??OG (O and G are constrained; S and P are not) OGSP 15. ??O? (O is constrained; S, P, and G are not) OGSP 16. S?O? (S and O are constrained; P and G are not) OSGP

Neptune creates and maintains only three out of those six indexes by default:

  • SPOG –   Uses a key composed of Subject + Predicate + Object + Graph.

  • POGS –   Uses a key composed of Predicate + Object + Graph + Subject.

  • GPSO –   Uses a key composed of Graph + Predicate + Subject + Object.

These three indexes handle many of the most common access patterns. Maintaining only three full statement indexes instead of six greatly reduces the resources that you need to support rapid access without scanning and filtering. For example, the SPOG index allows efficient lookup whenever a prefix of the positions, such as the vertex or vertex and property identifier, is bound. The POGS index allows efficient access when only the edge or property label stored in P position is bound.

The low-level API for finding statements takes a statement pattern in which some positions are known and the rest are left for discovery by index search. By composing the known positions into a key prefix according to the index key order for one of the statement indexes, Neptune performs a range scan to retrieve all the statements matching the known positions.

However, one of the statement indexes that Neptune does not create by default is a reverse traversal OSGP index, which can gather predicates across objects and subjects. Instead, Neptune by default tracks distinct predicates in a separate index that it uses to do a union scan of {all P x POGS}. When you are working with Gremlin, a predicate corresponds to a property or an edge label.

If the number of distinct predicates in a graph becomes large, the default Neptune access strategy can become inefficient. In Gremlin, for example, an in() step where no edge labels are given, or any step that uses in() internally such as both() or drop(), may become quite inefficient.

Enabling OSGP Index Creation Using Lab Mode

If your data model creates a large number of distinct predicates, you may experience reduced performance and higher operational costs that can be dramatically improved by using Lab Mode to OSGP Index, in addition to the three indexes that Neptune maintains by default.

This can have a few down-sides:

  • The insert rate may slow by up to 23%.

  • Storage increases by up to 20%.

  • Read queries that touch all indexes equally (which is quite rare) may have increased latencies.

In general, however, it is worth enabling the OSGP index for DB Clusters with a large number of distinct predicates. Object-based searchs become highly efficient (for example, finding all incoming edges to a vertex, or all subjects connected to a given object), and as a result dropping vertices becomes much more efficient too.

Important

A newly enabled OSGP index will not be backfilled with existing data in your DB cluster. Only new entries are indexed. To index existing data in a DB cluster, you must reload it all.