

# Load Data Formats
<a name="bulk-load-tutorial-format"></a>

The Amazon Neptune `Load` API supports loading data in a variety of formats.

**Property-graph load formats**

Data loaded in one of the following property-graph formats can then be queried using both Gremlin and openCypher:
+ [Gremlin load data format](bulk-load-tutorial-format-gremlin.md) (`csv`): a comma-separated values (CSV) format.
+ [openCypher data load format](bulk-load-tutorial-format-opencypher.md) (`opencypher`): a comma-separated values (CSV) format.

**RDF load formats**

To load Resource Description Framework (RDF) data that you query using SPARQL, you can use one of the following standard formats as specified by the World Wide Web Consortium (W3C):
+ N-Triples (`ntriples`) from the specification at [https://www.w3.org/TR/n-triples/](https://www.w3.org/TR/n-triples/).
+ N-Quads (`nquads`) from the specification at [https://www.w3.org/TR/n-quads/](https://www.w3.org/TR/n-quads/).
+ RDF/XML (`rdfxml`) from the specification at [https://www.w3.org/TR/rdf-syntax-grammar/](https://www.w3.org/TR/rdf-syntax-grammar/).
+ Turtle (`turtle`) from the specification at [https://www.w3.org/TR/turtle/](https://www.w3.org/TR/turtle/).

**Load data must use UTF-8 encoding**

**Important**  
All load-data files must be encoded in UTF-8 form. If a file is not UTF-8 encoded, Neptune tries to load it as UTF-8 anyway.

For N-Quads and N-triples data that includes Unicode characters, `\uxxxxx` escape sequences are supported. However, Neptune does not support normalization. If a value is present that requires normalization, it will not match byte-to-byte during querying. For more information about normalization, see the [Normalization](https://unicode.org/faq/normalization.html) page on [Unicode.org](https://unicode.org).

If your data is not in a supported format, you must convert it before you load it.

A tool for converting GraphML to the Neptune CSV format is available in the [GraphML2CSV project](https://github.com/awslabs/amazon-neptune-tools/blob/master/graphml2csv/README.md) on [GitHub](https://github.com/).

## Compression support for load-data files
<a name="bulk-load-tutorial-format-compression"></a>

Neptune supports compression of individual files in `gzip` or `bzip2` format.

The compressed file must have a `.gz` or `.bz2` extension, and must be a single text file encoded in UTF-8 format. You can load multiple files, but each one must be a separate `.gz`, `.bz2`, or uncompressed text file. Archive files with extensions such as `.tar`, `.tar.gz`, and `.tgz` are not supported.

The following sections describe the formats in more detail.

**Topics**
+ [Compression support for load-data files](#bulk-load-tutorial-format-compression)
+ [Gremlin load data format](bulk-load-tutorial-format-gremlin.md)
+ [Load format for openCypher data](bulk-load-tutorial-format-opencypher.md)
+ [RDF load data formats](bulk-load-tutorial-format-rdf.md)

# Gremlin load data format
<a name="bulk-load-tutorial-format-gremlin"></a>

To load Apache TinkerPop Gremlin data using the CSV format, you must specify the vertices and the edges in separate files.

The loader can load from multiple vertex files and multiple edge files in a single load job.

For each load command, the set of files to be loaded must be in the same folder in the Amazon S3 bucket, and you specify the folder name for the `source` parameter. The file names and file name extensions are not important.

The Amazon Neptune CSV format follows the RFC 4180 CSV specification. For more information, see [Common Format and MIME Type for CSV Files](https://tools.ietf.org/html/rfc4180) on the Internet Engineering Task Force (IETF) website.

**Note**  
All files must be encoded in UTF-8 format.

Each file has a comma-separated header row. The header row consists of both system column headers and property column headers.

## System Column Headers
<a name="bulk-load-tutorial-format-gremlin-systemheaders"></a>

The required and allowed system column headers are different for vertex files and edge files.

Each system column can appear only once in a header.

All labels are case sensitive.

**Vertex headers**
+ `~id` - **Required**

  An ID for the vertex.
+ `~label`

  A label for the vertex. Multiple label values are allowed, separated by semicolons (`;`).

  If `~label` is not present, TinkerPop supplies a label with the value `vertex`, because every vertex must have at least one label.

**Edge headers**
+ `~id` - **Required**

  An ID for the edge.
+ `~from` - **Required**

  The vertex ID of the *from* vertex.
+ `~to` - **Required**

  The vertex ID of the *to* vertex.
+ `~label`

  A label for the edge. Edges can only have a single label.

  If `~label` is not present, TinkerPop supplies a label with the value `edge`, because every edge must have a label.

## Property Column Headers
<a name="bulk-load-tutorial-format-gremlin-propheaders"></a>

You can specify a column (`:`) for a property by using the following syntax. The type names are not case sensitive. Note, however, that if a colon appears within a property name, it must be escaped by preceding it with a backslash: `\:`.

```
propertyname:type
```

**Note**  
Space, comma, carriage return and newline characters are not allowed in the column headers, so property names cannot include these characters.

You can specify a column for an array type by adding `[]` to the type:

```
propertyname:type[]
```

**Note**  
Edge properties can only have a single value and will cause an error if an array type is specified or a second value is specified.

The following example shows the column header for a property named `age` of type `Int`.

```
age:Int
```

Every row in the file would be required to have an integer in that position or be left empty.

Arrays of strings are allowed, but strings in an array cannot include the semicolon (`;`) character unless it is escaped using a backslash (like this: `\;`).

**Specifying the Cardinality of a Column**

The column header can be used to specify *cardinality* for the property identified by the column. This allows the bulk loader to honor cardinality similarly to the way Gremlin queries do.

You specify the cardinality of a column like this:

```
propertyname:type(cardinality)
```

The *cardinality* value can be either `single` or `set`. The default is assumed to be `set`, meaning that the column can accept multiple values. In the case of edge files, cardinality is always single and specifying any other cardinality causes the loader to throw an exception.

If the cardinality is `single`, the loader throws an error if a previous value is already present when a value is loaded, or if multiple values are loaded. This behavior can be overridden so that an existing value is replaced when a new value is loaded by using the `updateSingleCardinalityProperties` flag. See [Loader Command](load-api-reference-load.md).

It is possible to use a cardinality setting with an array type, although this is not generally necessary. Here are the possible combinations:
+ `name:type`   –   the cardinality is `set`, and the content is single-valued.
+ `name:type[]`   –   the cardinality is `set`, and the content is multi-valued.
+ `name:type(single)`   –   the cardinality is `single`, and the content is single-valued.
+ `name:type(set)`   –   the cardinality is `set`, which is the same as the default, and the content is single-valued.
+ `name:type(set)[]`   –   the cardinality is `set`, and the content is multi-valued.
+ `name:type(single)[]`   –   this is contradictory and causes an error to be thrown.

The following section lists all the available Gremlin data types.

## Gremlin Data Types
<a name="bulk-load-tutorial-format-gremlin-datatypes"></a>

This is a list of the allowed property types, with a description of each type.

**Bool (or Boolean)**  
Indicates a Boolean field. Allowed values: `false`, `true`

**Note**  
Any value other than `true` will be treated as false.

**Whole Number Types**  
Values outside of the defined ranges result in an error.


| 
| 
| Type | Range | 
| --- |--- |
| Byte | -128 to 127 | 
| Short | -32768 to 32767 | 
| Int | -2^31 to 2^31-1 | 
| Long | -2^63 to 2^63-1 | 

**Decimal Number Types**  
Supports both decimal notation or scientific notation. Also allows symbols such as (\$1/-) Infinity or NaN. INF is not supported.


| 
| 
| Type | Range | 
| --- |--- |
| Float | 32-bit IEEE 754 floating point | 
| Double | 64-bit IEEE 754 floating point | 

Float and double values that are too long are loaded and rounded to the nearest value for 24-bit (float) and 53-bit (double) precision. A midway value is rounded to 0 for the last remaining digit at the bit level.

**String**  
Quotation marks are optional. Commas, newline, and carriage return characters are automatically escaped if they are included in a string surrounded by double quotation marks (`"`). *Example:* `"Hello, World"`

To include quotation marks in a quoted string, you can escape the quotation mark by using two in a row: *Example:* `"Hello ""World"""`

Arrays of strings are allowed, but strings in an array cannot include the semicolon (`;`) character unless it is escaped using a backslash (like this: `\;`).

If you want to surround strings in an array with quotation marks, you must surround the whole array with one set of quotation marks. *Example:* `"String one; String 2; String 3"`

**Date**  
Java date in ISO-8601 format. Supports the following formats: `yyyy-MM-dd`, `yyyy-MM-ddTHH:mm`, `yyyy-MM-ddTHH:mm:ss`, `yyyy-MM-ddTHH:mm:ssZ`. The values are converted to epoch time and stored.

**Datetime**  
Java date in ISO-8601 format. Supports the following formats: `yyyy-MM-dd`, `yyyy-MM-ddTHH:mm`, `yyyy-MM-ddTHH:mm:ss`, `yyyy-MM-ddTHH:mm:ssZ`. The values are converted to epoch time and stored.

## Gremlin Row Format
<a name="bulk-load-tutorial-format-gremlin-rowformat"></a>

**Delimiters**  
Fields in a row are separated by a comma. Records are separated by a newline or a newline followed by a carriage return.

**Blank Fields**  
Blank fields are allowed for non-required columns (such as user-defined properties). A blank field still requires a comma separator. Blank fields on required columns will result in a parsing error. Empty string values are interpreted as empty string value for the field; not as a blank field. The example in the next section has a blank field in each example vertex.

**Vertex IDs**  
`~id` values must be unique for all vertices in every vertex file. Multiple vertex rows with identical `~id` values are applied to a single vertex in the graph. Empty string (`""`) is a valid id, and the vertex is created with an empty string as the id.

**Edge IDs**  
Additionally, `~id` values must be unique for all edges in every edge file. Multiple edge rows with identical `~id` values are applied to the single edge in the graph. Empty string (`""`) is a valid id, and the edge is created with an empty string as the id.

**Labels**  
Labels are case sensitive and cannot be empty. A value of `""` will result in an error.

**String Values**  
Quotation marks are optional. Commas, newline, and carriage return characters are automatically escaped if they are included in a string surrounded by double quotation marks (`"`). Empty string values `("")` are interpreted as an empty string value for the field; not as a blank field.

## CSV Format Specification
<a name="bulk-load-tutorial-format-csv-info"></a>

The Neptune CSV format follows the RFC 4180 CSV specification, including the following requirements.
+ Both Unix and Windows style line endings are supported (\$1n or \$1r\$1n).
+ Any field can be quoted (using double quotation marks).
+ Fields containing a line-break, double-quote, or commas must be quoted. (If they are not, load aborts immediately.)
+ A double quotation mark character (`"`) in a field must be represented by two (double) quotation mark characters. For example, a string `Hello "World"` must be present as `"Hello ""World"""` in the data.
+ Surrounding spaces between delimiters are ignored. If a row is present as `value1, value2`, they are stored as `"value1"` and `"value2"`.
+ Any other escape characters are stored verbatim. For example, `"data1\tdata2"` is stored as `"data1\tdata2"`. No further escaping is needed as long as these characters are enclosed within quotation marks.
+ Blank fields are allowed. A blank field is considered an empty value.
+ Multiple values for a field are specified with a semicolon (`;`) between values.

For more information, see [Common Format and MIME Type for CSV Files](https://tools.ietf.org/html/rfc4180) on the Internet Engineering Task Force (IETF) website.

## Gremlin Example
<a name="bulk-load-tutorial-format-gremlin-example"></a>

The following diagram shows an example of two vertices and an edge taken from the TinkerPop Modern Graph.

![\[Diagram depicting two vertices and an edge, contains marko age 29 and lop software with lang: java.\]](http://docs.aws.amazon.com/neptune/latest/userguide/images/tiny-modern-graph.png)


The following is the graph in Neptune CSV load format.

Vertex file:

```
~id,name:String,age:Int,lang:String,interests:String[],~label
v1,"marko",29,,"sailing;graphs",person
v2,"lop",,"java",,software
```

Tabular view of the vertex file:

|  |  |  |  |  |  | 
| --- |--- |--- |--- |--- |--- |
| \$1id | name:String | age:Int | lang:String | interests:String[] | \$1label | 
| v1 | "marko" | 29 |  | ["sailing", "graphs"] | person | 
| v2 | "lop" |  | "java" |  | software | 

Edge file:

```
~id,~from,~to,~label,weight:Double
e1,v1,v2,created,0.4
```

Tabular view of the edge file:

|  |  |  |  |  | 
| --- |--- |--- |--- |--- |
| \$1id | \$1from | \$1to | \$1label | weight:Double | 
| e1 | v1 | v2 | created | 0.4 | 

**Next Steps**  
Now that you know more about the loading formats, see [Example: Loading Data into a Neptune DB Instance](bulk-load-data.md).

# Load format for openCypher data
<a name="bulk-load-tutorial-format-opencypher"></a>

To load openCypher data using the openCypher CSV format, you must specify nodes and relationships in separate files. The loader can load from multiple of these node files and relationship files in a single load job.

For each load command, the set of files to be loaded must have the same path prefix in an Amazon Simple Storage Service bucket. You specify that prefix in the source parameter. The actual file names and extensions are not important.

In Amazon Neptune, the openCypher CSV format conforms to the RFC 4180 CSV specification. For more information, see [Common Format and MIME Type for CSV Files](https://tools.ietf.org/html/rfc4180) (https://tools.ietf.org/html/rfc4180) on the Internet Engineering Task Force (IETF) website.

**Note**  
These files MUST be encoded in UTF-8 format.

Each file has a comma-separated header row that contains both system column headers and property column headers.

## System column headers in openCypher data loading files
<a name="bulk-load-tutorial-format-opencypher-system-headers"></a>

A given system column can only appear once in each file. All system column header labels are case-sensitive.

The system column headers that are required and allowed are different for openCypher node load files and relationship load files:

### System column headers in node files
<a name="bulk-load-tutorial-format-opencypher-system-headers-nodes"></a>
+ **`:ID`**   –   (Required) An ID for the node.

  An optional ID space can be added to the node `:ID` column header like this: `:ID(ID Space)`. An example is `:ID(movies)`.

  When loading relationships that connect the nodes in this file, use the same ID spaces in the relationship files' `:START_ID` and/or `:END_ID` columns.

  The node `:ID` column can optionally be stored as a property in the form, `property name:ID`. An example is `name:ID`.

  Node IDs should be unique across all node files in the current and previous loads. If an ID space is used, node IDs should be unique across all node files that use the same ID space in the current and previous loads.
+ **`:LABEL`**   –   A label for the node.

  When using multiple label values for a single node, each label should be separated by semicolons(`;`).

### System column headers in relationship files
<a name="bulk-load-tutorial-format-opencypher-system-headers-relationships"></a>
+ **`:ID`**   –   An ID for the relationship. This is required when `userProvidedEdgeIds` is true (the default), but invalid when `userProvidedEdgeIds` is `false`.

  Relationship IDs should be unique across all relationship files in current and previous loads.
+ **`:START_ID`**   –   (*Required*) The node ID of the node this relationship starts from.

  Optionally, an ID space can be associated with the start ID column in the form `:START_ID(ID Space)`. The ID space assigned to the start node ID should match the ID space assigned to the node in its node file.
+ **`:END_ID`**   –   (*Required*) The node ID of the node this relationship ends at.

  Optionally, an ID space can be associated with the end ID column in the form `:END_ID(ID Space)`. The ID space assigned to the end node ID should match the ID space assigned to the node in its node file.
+ **`:TYPE`**   –   A type for the relationship. Relationships can only have a single type.

**Note**  
See [Loading openCypher data](load-api-reference-load.md#load-api-reference-load-parameters-opencypher) for information about how duplicate node or relationship IDs are handled by the bulk load process.

### Property column headers in openCypher data loading files
<a name="bulk-load-tutorial-format-opencypher-property-headers"></a>

You can specify that a column holds the values for a particular property using a property column header in the following form:

```
propertyname:type
```

Space, comma, carriage return and newline characters are not allowed in the column headers, so property names cannot include these characters. Here is an example of a column header for a property named `age` of type `Int`:

```
age:Int
```

The column with `age:Int` as a column header would then have to contain either an integer or an empty value in every row.

## Data types in Neptune openCypher data loading files
<a name="bulk-load-tutorial-format-opencypher-data-types"></a>
+ **`Bool`** or **`Boolean`**   –   A Boolean field. Allowed values are `true` and `false`.

  Any value other than `true` is treated as `false`.
+ **`Byte`**   –   A whole number in the range `-128` through `127`.
+ **`Short`**   –   A whole number in the range `-32,768` through `32,767`.
+ **`Int`**   –   A whole number in the range `-2^31` through `2^31 - 1`.
+ **`Long`**   –   A whole number in the range `-2^63` through `2^63 - 1`.
+ **`Float`**   –   A 32-bit IEEE 754 floating point number. Decimal notation and scientific notation are both supported. `Infinity`, `-Infinity`, and `NaN` are all recognized, but `INF` is not.

  Values with too many digits to fit are rounded to the nearest value (a midway value is rounded to 0 for the last remaining digit at the bit level).
+ **`Double`**   –   A 64-bit IEEE 754 floating point number. Decimal notation and scientific notation are both supported. `Infinity`, `-Infinity`, and `NaN` are all recognized, but `INF` is not.

  Values with too many digits to fit are rounded to the nearest value (a midway value is rounded to 0 for the last remaining digit at the bit level).
+ **`String`**   –   Quotation marks are optional. Comma, newline, and carriage return characters are automatically escaped if they are included in a string that is surrounded by double quotation marks (`"`) like `"Hello, World"`.

  You can include quotation marks in a quoted string by using two in a row, like `"Hello ""World"""`.
+ **`DateTime`**   –   A Java date in one of the following ISO-8601 formats:
  + `yyyy-MM-dd`
  + `yyyy-MM-ddTHH:mm`
  + `yyyy-MM-ddTHH:mm:ss`
  + `yyyy-MM-ddTHH:mm:ssZ`

### Auto-cast data types in Neptune openCypher data loading files
<a name="bulk-load-tutorial-format-opencypher-data-auto-cast"></a>

Auto-cast data types are provided to load data types not currently supported natively by Neptune. Data in such columns are stored as strings, verbatim with no verification against their intended formats. The following auto-cast data types are allowed:
+ **`Char`**   –   A `Char` field. Stored as a string.
+ **`Date`**, **`LocalDate`**, and **`LocalDateTime`**,   –   See [Neo4j Temporal Instants](https://neo4j.com/docs/cypher-manual/current/values-and-types/temporal/#cypher-temporal-instants) for a description of the `date`, `localdate`, and `localdatetime` types. The values are loaded verbatim as strings, without validation.
+ **`Duration`**   –   See the [Neo4j Duration format](https://neo4j.com/docs/cypher-manual/current/values-and-types/temporal/#cypher-temporal-durations). The values are loaded verbatim as strings, without validation.
+ **Point**   –   A point field, for storing spatial data. See [Spatial instants](https://neo4j.com/docs/cypher-manual/current/values-and-types/spatial/#spatial-values-spatial-instants). The values are loaded verbatim as strings, without validation.

## Example of the openCypher load format
<a name="bulk-load-tutorial-format-opencypher-example"></a>

The following diagram taken from the TinkerPop Modern Graph shows an example of two nodes and a relationship:

![\[Diagram of two nodes and a relationship between them.\]](http://docs.aws.amazon.com/neptune/latest/userguide/images/tinkerpop-2-nodes-and-relationship.png)


The following is the graph in the normal Neptune openCypher load format.

**Node file:**

```
:ID,name:String,age:Int,lang:String,:LABEL
v1,"marko",29,,person
v2,"lop",,"java",software
```

**Relationship file:**

```
:ID,:START_ID,:END_ID,:TYPE,weight:Double
e1,v1,v2,created,0.4
```

Alternatively, you could use ID spaces and ID as a property, as follows:

**First node file:**

```
name:ID(person),age:Int,lang:String,:LABEL
"marko",29,,person
```

**Second node file:**

```
name:ID(software),age:Int,lang:String,:LABEL
"lop",,"java",software
```

**Relationship file:**

```
:ID,:START_ID(person),:END_ID(software),:TYPE,weight:Double
e1,"marko","lop",created,0.4
```

# RDF load data formats
<a name="bulk-load-tutorial-format-rdf"></a>

To load Resource Description Framework (RDF) data, you can use one of the following standard formats as specified by the World Wide Web Consortium (W3C):
+ N-Triples (`ntriples`) from the specification at [https://www.w3.org/TR/n-triples/](https://www.w3.org/TR/n-triples/)
+ N-Quads (`nquads`) from the specification at [https://www.w3.org/TR/n-quads/](https://www.w3.org/TR/n-quads/)
+ RDF/XML (`rdfxml`) from the specification at [https://www.w3.org/TR/rdf-syntax-grammar/](https://www.w3.org/TR/rdf-syntax-grammar/)
+ Turtle (`turtle`) from the specification at [https://www.w3.org/TR/turtle/](https://www.w3.org/TR/turtle/)

**Important**  
All files must be encoded in UTF-8 format.  
For N-Quads and N-triples data that includes Unicode characters, `\uxxxxx` escape sequences are supported. However, Neptune does not support normalization. If a value is present that requires normalization, it will not match byte-to-byte during querying. For more information about normalization, see the [Normalization](https://unicode.org/faq/normalization.html) page on [Unicode.org](https://unicode.org).

**Next Steps**  
Now that you know more about the loading formats, see [Example: Loading Data into a Neptune DB Instance](bulk-load-data.md).