Input Data - Amazon SageMaker

Input Data

The input data are the data objects that you send to your workforce to be labeled. Each object in the input data is described in a manifest file. Each line in the manifest is an entry containing an object to label. An entry can also contain labels from previous jobs.

Input data and the manifest file must be stored in Amazon Simple Storage Service (Amazon S3). Each has specific storage and access requirements, as follows:

  • The Amazon S3 bucket that contains the input data must be in the same AWS Region in which you are running Amazon SageMaker Ground Truth. You must give Amazon SageMaker access to the data stored in the Amazon S3 bucket so that it can read it. For more information about Amazon S3 buckets, see Working with Amazon S3 buckets.

  • The manifest file must be in the same AWS Region as the data files, but it doesn't need to be in the same location as the data files. It can be stored in any Amazon S3 bucket that is accessible to the AWS Identity and Access Management (IAM) role that you assigned to Ground Truth when you created the labeling job.

The manifest is a UTF-8 encoded file in which each line is a complete and valid JSON object. Each line is delimited by a standard line break, \n or \r\n. Because each line must be a valid JSON object, you can't have unescaped line break characters. For more information about data format, see JSON Lines.

Each JSON object in the manifest file can be no larger than 100,000 characters. No single attribute within an object can be larger than 20,000 characters. Attribute names can't begin with $ (dollar sign).

Important

For 3D point cloud labeling job input data requirements, see 3D Point Cloud Input Data.

For video frame labeling job input data requirements, see Video Frame Input Data.

Each JSON object in the manifest file must contain one of the following keys: source-ref or source. The value of the keys are interpreted as follows:

  • source-ref – The source of the object is the Amazon S3 object specified in the value. Use this value when the object is a binary object, such as an image, or when you have text in individual files.

  • source – The source of the object is the value. Use this value when the object is a text value.

The following is an example of a manifest file for files stored in an Amazon S3 bucket:

{"source-ref": "S3 bucket location 1"} {"source-ref": "S3 bucket location 2"} ... {"source-ref": "S3 bucket location n"}

Use the source-ref key for image files for bounding box, image classification (single and multi-label), and semantic segmentation labeling jobs.

Use the source-ref key for text-based labeling jobs (such as single and multi-label text classification and named entity recognition) if your dataset is stored in text files (for example, .txt or .csv files).

The following is an example of a manifest file with the input data stored in the manifest:

{"source": "Lorem ipsum dolor sit amet"} {"source": "consectetur adipiscing elit"} ... {"source": "mollit anim id est laborum"}

Use the source key for single and multi-label text classification and named entity recognition labeling jobs if the text you want labeled is listed directly in the input manifest file.

You can include other key-value pairs in the manifest file. These pairs are passed to the output file unchanged. This is useful when you want to pass information between your applications. For more information, see Output Data.

Automated Data Setup

You can create a manifest file for your labeling jobs in the Ground Truth console using images, videos, video frames, text (.txt) files, and comma-separated value (.csv) files. Before using the following procedure, ensure that your input images or files are correctly formatted:

  • Image files – Image files must comply with the size and resolution limits listed in the tables found in Input File Size Quota.

  • Text files – Text data can be stored in one or more .txt files. Each item that you want labeled must be separated by a standard line break.

  • CSV files – Text data can be stored in one or more .csv files. Each item that you want labeled must be in a separate row.

  • Videos – Video files can be any of the following formats: MP4, OGG, and WEBM. If you want to extract video frames from your video files for object detection or object tracking, see Provide Video Files.

  • Video frames – Video frames are images extracted from a videos. All images extracted from a single video are referred to as a sequence of video frames. Each sequence of video frames must have unique prefix keys in Amazon S3. See Provide Video Frames. For this data type, see Automated Video Frame Input Data Setup

Important

For video frame object detection and video frame object tracking labeling jobs, see Automated Video Frame Input Data Setup to learn how to use the automated data setup.

Use these instructions to automatically set up your input dataset connection with Ground Truth.

Automatically connect your data in Amazon S3 with Ground Truth

  1. Navigate to the Create labeling job page in the Amazon SageMaker console: https://console.aws.amazon.com/sagemaker/.

    This link puts you in the North Virginia (us-east-1) AWS Region. If your input data is in an Amazon S3 bucket in another Region, switch to that Region. To change your AWS Region, on the navigation bar, choose the name of the currently displayed Region.

  2. Select Create labeling job.

  3. Enter a Job name.

  4. In the section Input data setup, select Automated data setup.

  5. Enter an Amazon S3 URI for S3 location for input datasets.

  6. Specify your S3 location for output datasets. This is where your output data is stored.

  7. Choose your Data type using the dropdown list.

  8. Select Set up connection.

This creates an input manifet in the Amazon S3 location for input datasets that you specified in step 5. If you are creating a labeling job using the Amazon SageMaker API or, AWS CLI, or an AWS SDK, use the Amazon S3 URI for this input manifest file as input to the parameter ManifestS3Uri.