Amazon SageMaker
Developer Guide

Amazon SageMaker Ground Truth

To train a machine learning model, you need a large, high-quality, labeled dataset. Ground Truth helps you build high-quality training datasets for your machine learning models. With Ground Truth, you can use workers from either Amazon Mechanical Turk, a vendor company that you choose, or an internal, private workforce along with machine learning to enable you to create a labeled dataset. You can use the labeled dataset output from Ground Truth to train your own models. You can also use the output as a training dataset for an Amazon SageMaker model.

In order to automate labeling your training dataset, you can optionally use automated data labeling, a Ground Truth process that uses machine learning to decide which data needs to be labeled by humans. Automated data labeling may reduce the labeling time and manual effort required. For more information, see Using Automated Data Labeling.

Use either pre-built or custom tools to assign the labeling tasks for your training dataset. A labeling UI template is a webpage that Ground Truth uses to present tasks and instructions to your workers. The Amazon SageMaker console provides built-in templates for labeling data. You can use these templates to get started , or you can build your own tasks and instructions by using our HTML 2.0 components. For more information, see Creating Custom Labeling Workflows.

Use the workforce of your choice to label your dataset. You can choose your workforce from:

  • The Amazon Mechanical Turk workforce of over 500,000 independent contractors worldwide.

  • A private workforce that you create from your employees or contractors for handling data within your organization.

  • A vendor company that you can find in the AWS Marketplace that specializes in data labeling services.

For more information, see Managing Your Workforce.

You store your datasets in Amazon S3 buckets. The buckets contain three things: The data to be labeled, an input manifest file that Ground Truth uses to read the data files, and an output manifest file. The output file contains the results of the labeling job. For more information, see Using Input and Output Data.

Events from your labeling jobs appear in Amazon CloudWatch under the /aws/sagemaker/LabelingJobs group. CloudWatch uses the labeling job name as the name for the log stream.

Are You a First-time User of Ground Truth?

If you are a first-time user of Ground Truth, we recommend that you do the following:

  1. Read Getting started—This section walks you through setting up your first Ground Truth labeling job.

  2. Explore other topics—Depending on your needs, do the following:

    • Create instruction pages for your labeling jobs—Create a custom instruction page that makes it easier for your workers to understand the requirements of the job. For more information, see Creating Instruction Pages.

    • Manage your labeling workforce—Create new work teams and manage your existing workforce. For more information, see Managing Your Workforce.

    • Create a custom UI—Make it easier for your workers to quickly and correctly label your data by creating a custom UI for them to use. For more information, see Creating Custom Labeling Workflows.

  3. See the API Reference—This section describes operations to automate Ground Truth operations.