Using the neptune-export tool or Neptune-Export service to export data from Neptune for Neptune ML - Amazon Neptune

Using the neptune-export tool or Neptune-Export service to export data from Neptune for Neptune ML

Neptune ML requires that you provide training data for the Deep Graph Library (DGL) to create and evaluate models.

You can export data from Neptune using either the Neptune-Export service, or neptune-export utility. Both the service and the command line tool publish data to Amazon Simple Storage Service (Amazon S3) in a CSV format, encrypted using Amazon S3 server-side encryption (SSE-S3). See Files exported by Neptune-Export and neptune-export.

In addition, when you configure an export of training data for Neptune ML the export job creates and publishes an encrypted model-training configuration file along with the exported data. By default, this file is named training-data-configuration.json.

Examples of using the Neptune-Export service to export training data for Neptune ML

This request exports property-graph training data for a node classification task:

curl \ (your NeptuneExportApiUri) \ -X POST \ -H 'Content-Type: application/json' \ -d '{ "command": "export-pg", "outputS3Path": "s3://(your Amazon S3 bucket)/neptune-export", "params": { "endpoint": "(your Neptune endpoint DNS name)", "profile": "neptune_ml" }, "additionalParams": { "neptune_ml": { "version": "v2.0", "targets": [ { "node": "Movie", "property": "genre", "type": "classification" } ] } } }'

This request exports RDF training data for a node classification task:

curl \ (your NeptuneExportApiUri) \ -X POST \ -H 'Content-Type: application/json' \ -d '{ "command": "export-rdf", "outputS3Path": "s3://(your Amazon S3 bucket)/neptune-export", "params": { "endpoint": "(your Neptune endpoint DNS name)", "profile": "neptune_ml" }, "additionalParams": { "neptune_ml": { "version": "v2.0", "targets": [ { "node": "http://aws.amazon.com/neptune/csv2rdf/class/Movie", "predicate": "http://aws.amazon.com/neptune/csv2rdf/datatypeProperty/genre", "type": "classification" } ] } } }'

Fields to set in the params object when exporting training data

The params object in an export request can contain various fields, as described in the params documentation. The following ones are most relevant for exporting machine-learning training data:

  • endpoint   –   Use endpoint to specify an endpoint of a Neptune instance in your DB cluster that the export process can query to extract data.

  • profile   –   The profile field in the params object must be set to neptune-ml.

    This causes the export process to format the exported data appropriately for Neptune ML model training, in a CSV format for property-graph data or as N-Triples for RDF data. It also causes a training-data-configuration.json file to be created and written to the same Amazon S3 location as the exported training data.

  • cloneCluster   –   If set to true, the export process clones your DB cluster, exports from the clone, and then deletes the clone when it is finished.

  • useIamAuth   –   If your DB cluster has IAM authentication enabled, you must include this field set to true.

The export process also provides several ways to filter the data you export (see these examples).

Using the additionalParams object to tune the export of model-training information

The additionalParams object contains fields that you can use to specify machine-learning class labels and features for training purposes and guide the creation of a training data configuration file.

The export process cannot automatically infer which node and edge properties should be the machine learning class labels to serve as examples for training purposes. It also cannot automatically infer the best feature encoding for numeric, categorical and text properties, so you need to supply hints using fields in the additionalParams object to specify these things, or to override the default encoding.

For property-graph data, the top-level structure of additionalParams in an export request might look like this:

{ "command": "export-pg", "outputS3Path": "s3://(your Amazon S3 bucket)/neptune-export", "params": { "endpoint": "(your Neptune endpoint DNS name)", "profile": "neptune_ml" }, "additionalParams": { "neptune_ml": { "version": "v2.0", "targets": [ (an array of node and edge class label targets) ], "features": [ (an array of node feature hints) ] } } }

For RDF data, its top-level structure might look like this:

{ "command": "export-rdf", "outputS3Path": "s3://(your Amazon S3 bucket)/neptune-export", "params": { "endpoint": "(your Neptune endpoint DNS name)", "profile": "neptune_ml" }, "additionalParams": { "neptune_ml": { "version": "v2.0", "targets": [ (an array of node and edge class label targets) ] } } }

You can also supply multiple export configurations, using the jobs field:

{ "command": "export-pg", "outputS3Path": "s3://(your Amazon S3 bucket)/neptune-export", "params": { "endpoint": "(your Neptune endpoint DNS name)", "profile": "neptune_ml" }, "additionalParams" : { "neptune_ml" : { "version": "v2.0", "jobs": [ { "name" : "(training data configuration name)", "targets": [ (an array of node and edge class label targets) ], "features": [ (an array of node feature hints) ] }, { "name" : "(another training data configuration name)", "targets": [ (an array of node and edge class label targets) ], "features": [ (an array of node feature hints) ] } ] } } }

Top-level elements in the neptune_ml field in additionalParams

The version element in neptune_ml

Specifies the version of training data configuration to generate.

(Optional), Type: string, Default: "v2.0".

If you do include version, set it to v2.0.

The jobs field in neptune_ml

Contains an array of training-data configuration objects, each of which defines a data processing job, and contains:

  • name   –   The name of the training data configuration to be created.

    For example, a training data configuration with the name "job-number-1" results in a training data configuration file named job-number-1.json.

  • targets   –   A JSON array of node and edge class label targets that represent the machine-learning class labels for training purposes. See The targets field in a neptune_ml object.

  • features   –   A JSON array of node property features. See The features field in neptune_ml.