Create a Labeling Job (API)
To create a labeling job using the Amazon SageMaker API, you use the CreateLabelingJob
operation. For specific instructions on
creating a labeling job for a built-in task type, see that task type page. To learn how to
create a streaming labeling job, which is a labeling job that runs perpetually, see Create a streaming labeling job.
To use the CreateLabelingJob
operation, you need the following:
-
A worker task template (
UiTemplateS3Uri
) or human task UI ARN (HumanTaskUiArn
) in Amazon S3.-
For 3D point cloud jobs, video object detection and tracking jobs, and NER jobs, use the ARN listed in
HumanTaskUiArn
for your task type. -
If you are using a built-in task type other than 3D point cloud tasks, you can add your worker instructions to one of the pre-built templates and save the template (using a .html or .liquid extension) in your S3 bucket. Find the pre-build templates on your task type page.
-
If you are using a custom labeling workflow, you can create a custom template and save the template in your S3 bucket. To learn how to built a custom worker template, see Creating a custom worker task template. For custom HTML elements that you can use to customize your template, see Crowd HTML Elements Reference. For a repository of demo templates for a variety of labeling tasks, see Amazon SageMaker Ground Truth Sample Task UIs
.
-
-
An input manifest file that specifies your input data in Amazon S3. Specify the location of your input manifest file in
ManifestS3Uri
. For information about creating an input manifest, see Input data. If you create a streaming labeling job, this is optional. To learn how to create a streaming labeling job, see Create a streaming labeling job. -
An Amazon S3 bucket to store your output data. You specify this bucket, and optionally, a prefix in
S3OutputPath
. -
A label category configuration file. Each label category name must be unique. Specify the location of this file in Amazon S3 using the
LabelCategoryConfigS3Uri
parameter. The format and label categories for this file depend on the task type you use:-
For image classification and text classification (single and multi-label) you must specify at least two label categories. For all other task types, the minimum number of label categories required is one.
-
For named entity recognition tasks, you must provide worker instructions in this file. See Provide Worker Instructions in a Label Category Configuration File for details and an example.
-
For 3D point cloud and video frame task type, use the format in Labeling category configuration file with label category and frame attributes reference.
-
For all other built-in task types and custom tasks, your label category configuration file must be a JSON file in the following format. Identify the labels you want to use by replacing
label_1
,label_2
,...
,label_n
with your label categories.{ "document-version": "2018-11-28", "labels": [ {"label": "
label_1
"}, {"label": "label_2
"}, ... {"label": "label_n
"} ] }
-
-
An AWS Identity and Access Management (IAM) role with the AmazonSageMakerGroundTruthExecution
managed IAM policy attached and with permissions to access your S3 buckets. Specify this role in RoleArn
. To learn more about this policy, see Use IAM Managed Policies with Ground Truth. If you require more granular permissions, see Assign IAM Permissions to Use Ground Truth.If your input or output bucket name does not contain
sagemaker
, you can attach a policy similar to the following to the role that is passed to theCreateLabelingJob
operation.{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "s3:GetObject" ], "Resource": [ "arn:aws:s3:::
my_input_bucket
/*" ] }, { "Effect": "Allow", "Action": [ "s3:PutObject" ], "Resource": [ "arn:aws:s3:::my_output_bucket
/*" ] } ] } -
A pre-annotation and post-annotation (or annotation-consolidation) AWS Lambda function Amazon Resource Name (ARN) to process your input and output data.
-
Lambda functions are predefined in each AWS Region for built-in task types. To find the pre-annotation Lambda ARN for your Region, see PreHumanTaskLambdaArn. To find the annotation-consolidation Lambda ARN for your Region, see AnnotationConsolidationLambdaArn.
-
For custom labeling workflows, you must provide a custom pre- and post-annotation Lambda ARN. To learn how to create these Lambda functions, see Processing data in a custom labeling workflow with AWS Lambda.
-
-
A work team ARN that you specify in
WorkteamArn
. You receive a work team ARN when you subscribe to a vendor workforce or create a private workteam. If you are creating a labeling job for a video frame or point cloud task type, you cannot use the Amazon Mechanical Turk workforce. For all other task types, to use the Mechanical Turk workforce, use the following ARN. Replace
with the AWS Region you are using to create the labeling job.region
arn:aws:sagemaker:
region
:394669845002:workteam/public-crowd/defaultIf you use the Amazon Mechanical Turk workforce, use the
ContentClassifiers
parameter inDataAttributes
ofInputConfig
to declare that your content is free of personally identifiable information and adult content.Ground Truth requires that your input data is free of personally identifiable information (PII) if you use the Mechanical Turk workforce. If you use Mechanical Turk and do not specify that your input data is free of PII using the
FreeOfPersonallyIdentifiableInformation
flag, your labeling job will fail. Use theFreeOfAdultContent
flag to declare that your input data is free of adult content. SageMaker AI may restrict the Amazon Mechanical Turk workers that can view your task if it contains adult content.To learn more about work teams and workforces, see Workforces.
-
If you use the Mechanical Turk workforce, you must specify the price you'll pay workers for performing a single task in
PublicWorkforceTaskPrice
. -
To configure the task, you must provide a task description and title using
TaskDescription
andTaskTitle
respectively. Optionally, you can provide time limits that control how long the workers have to work on an individual task (TaskTimeLimitInSeconds
) and how long tasks remain in the worker portal, available to workers (TaskAvailabilityLifetimeInSeconds
). -
(Optional) For some task types, you can have multiple workers label a single data object by inputting a number greater than one for the
NumberOfHumanWorkersPerDataObject
parameter. For more information about annotation consolidation, see Annotation consolidation. -
(Optional) To create an automated data labeling job, specify one of the ARNs listed in LabelingJobAlgorithmSpecificationArn in
LabelingJobAlgorithmsConfig
. This ARN identifies the algorithm used in the automated data labeling job. The task type associated with this ARN must match the task type of thePreHumanTaskLambdaArn
andAnnotationConsolidationLambdaArn
you specify. Automated data labeling is supported for the following task types: image classification, bounding box, semantic segmentation, and text classification. The minimum number of objects allowed for automated data labeling is 1,250, and we strongly suggest providing a minimum of 5,000 objects. To learn more about automated data labeling jobs, see Automate data labeling. -
(Optional) You can provide
StoppingConditions
that cause the labeling job to stop if one the conditions is met. You can use stopping conditions to control the cost of the labeling job.
Examples
The following code examples demonstrate how to create a labeling job using
CreateLabelingJob
. For additional examples, we recommend you use one of
the Ground Truth Labeling Jobs Jupyter notebooks in the SageMaker
Examples section of a SageMaker notebook instance. To learn how to use a notebook example
from the SageMaker AI Examples, see Access example notebooks. You can also see these example notebooks on
GitHub in the SageMaker AI Examples repository
For more information about this operation, see CreateLabelingJob. For
information about how to use other language-specific SDKs, see See
Also in the CreateLabelingJobs
topic.