Data processing using the dataprocessing command - Amazon Neptune

Data processing using the dataprocessing command

You use the Neptune ML dataprocessing command to create a data processing job, check its status, stop it, or list all active data-processing jobs.

Creating a data-processing job using the Neptune ML dataprocessing command

A typical Neptune ML dataprocessing command for creating a new job looks like this:

curl \ -X POST https://(your Neptune endpoint)/ml/dataprocessing \ -H 'Content-Type: application/json' \ -d '{ "inputDataS3Location" : "s3://(Amazon S3 bucket name)/(path to your input folder)", "id" : "(a job ID for the new job)", "processedDataS3Location" : "s3://(S3 bucket name)/(path to your output folder)" }'

A command to initiate incremental re-processing looks like this:

curl \ -X POST https://(your Neptune endpoint)/ml/dataprocessing \ -H 'Content-Type: application/json' \ -d '{ "inputDataS3Location" : "s3://(Amazon S3 bucket name)/(path to your input folder)", "id" : "(a job ID for this job)", "processedDataS3Location" : "s3://(S3 bucket name)/(path to your output folder)" "previousDataProcessingJobId" : "(the job ID of a previously completed job to update)" }'

Parameters for dataprocessing job creation

  • id   –   (Optional) A unique identifier for the new job.

    Type: string. Default: An autogenerated UUID.

  • previousDataProcessingJobId   –   (Optional) The job ID of a completed data processing job run on an earlier version of the data.

    Type: string. Default: none.

    Note: Use this for incremental data processing, to update the model when graph data has changed (but not when data has been deleted).

  • inputDataS3Location   –   (Required) The URI of the Amazon S3 location where you want SageMaker to download the data needed to run the data processing job.

    Type: string.

  • processedDataS3Location   –   (Required) The URI of the Amazon S3 location where you want SageMaker to save the results of a data processing job.

    Type: string.

  • sagemakerIamRoleArn   –   (Optional) The ARN of an IAM role for SageMaker execution.

    Type: string. Note: This must be listed in your DB cluster parameter group or an error will occur.

  • neptuneIamRoleArn   –   (Optional) The Amazon Resource Name (ARN) of an IAM role that SageMaker can assume to perform tasks on your behalf.

    Type: string. Note: This must be listed in your DB cluster parameter group or an error will occur.

  • processingInstanceType   –   (Optional) The type of ML instance used during data processing. Its memory should be large enough to hold the processed dataset.

    Type: string. Default: the smallest ml.r5 type whose memory is ten times larger than the size of the exported graph data on disk.

    Note: Neptune ML can select the instance type automatically. See Selecting an instance for data processing.

  • processingInstanceVolumeSizeInGB   –   (Optional) The disk volume size of the processing instance. Both input data and processed data are stored on disk, so the volume size must be large enough to hold both data sets.

    Type: integer. Default: 0.

    Note: If not specified or 0, Neptune ML chooses the volume size automatically based on the data size.

  • processingTimeOutInSeconds   –   (Optional) Timeout in seconds for the data processing job.

    Type: integer. Default: 86,400 (1 day).

  • modelType   –   (Optional) One of the two model types that Neptune ML currently supports: heterogeneous graph models (heterogeneous), and knowledge graph (kge).

    Type: string. Default: none.

    Note: If not specified, Neptune ML chooses the model type automatically based on the data.

  • configFileName   –   (Optional) A data specification file that describes how to load the exported graph data for training. The file is automatically generated by the Neptune export toolkit.

    Type: string. Default: training-data-configuration.json.

  • subnets   –   (Optional) The IDs of the subnets in the Neptune VPC.

    Type: list of strings. Default: none.

  • securityGroupIds   –   (Optional) The VPC security group IDs.

    Type: list of strings. Default: none.

  • volumeEncryptionKMSKey   –   (Optional) The AWS Key Management Service (AWS KMS) key that SageMaker uses to encrypt data on the storage volume attached to the ML compute instances that run the processing job.

    Type: string. Default: none.

  • s3OutputEncryptionKMSKey   –   (Optional) The AWS Key Management Service (AWS KMS) key that SageMaker uses to encrypt the output of the training job.

    Type: string. Default: none.

Getting the status of a data-processing job using the Neptune ML dataprocessing command

A sample Neptune ML dataprocessing command for the status of a job looks like this:

curl -s \ "https://(your Neptune endpoint)/ml/dataprocessing/(the job ID)" \ | python -m json.tool

Parameters for dataprocessing job status

  • id   –   (Required) The unique identifier of the data-processing job.

    Type: string.

  • neptuneIamRoleArn   –   (Optional) The ARN of an IAM role that provides Neptune access to SageMaker and Amazon S3 resources.

    Type: string. Note: This must be listed in your DB cluster parameter group or an error will occur.

Stopping a data-processing job using the Neptune ML dataprocessing command

A sample Neptune ML dataprocessing command for stopping a job looks like this:

curl -s \ -X DELETE "https://(your Neptune endpoint)/ml/dataprocessing/(the job ID)"

Or this:

curl -s \ -X DELETE "https://(your Neptune endpoint)/ml/dataprocessing/(the job ID)?clean=true"

Parameters for dataprocessing stop job

  • id   –   (Required) The unique identifier of the data-processing job.

    Type: string.

  • neptuneIamRoleArn   –   (Optional) The ARN of an IAM role that provides Neptune access to SageMaker and Amazon S3 resources.

    Type: string. Note: This must be listed in your DB cluster parameter group or an error will occur.

  • clean   –   (Optional) This flag specifies that all Amazon S3 artifacts should be deleted when the job is stopped.

    Type: Boolean. Default: FALSE.

Listing active data-processing jobs using the Neptune ML dataprocessing command

A sample Neptune ML dataprocessing command for listing active jobs looks like this:

curl -s "https://(your Neptune endpoint)/ml/dataprocessing"

Or this:

curl -s "https://(your Neptune endpoint)/ml/dataprocessing?maxItems=3"

Parameters for dataprocessing list jobs

  • maxItems   –   (Optional) The maximum number of items to return.

    Type: integer. Default: 10. Maximum allowed value: 1024.

  • neptuneIamRoleArn   –   (Optional) The ARN of an IAM role that provides Neptune access to SageMaker and Amazon S3 resources.

    Type: string. Note: This must be listed in your DB cluster parameter group or an error will occur.