How Amazon SageMaker Processes Training Output
As your algorithm runs in a container, it generates output including the status of the
training job and model and output artifacts. Your algorithm should write this
information to the following files, which are located in the container's
/output
directory. Amazon SageMaker processes the information contained
in this directory as follows:
-
/opt/ml/model
– Your algorithm should write all final model artifacts to this directory. SageMaker copies this data as a single object in compressed tar format to the S3 location that you specified in theCreateTrainingJob
request. If multiple containers in a single training job write to this directory they should ensure nofile/directory
names clash. SageMaker aggregates the result in a tar file and uploads to S3. SageMaker aggregates the result in a TAR file and uploads to S3 at the end of the training job. -
/opt/ml/output/data
– Your algorithm should write artifacts you want to store other than the final model to this directory. SageMaker copies this data as a single object in compressed tar format to the S3 location that you specified in theCreateTrainingJob
request. If multiple containers in a single training job write to this directory they should ensure nofile/directory
names clash. SageMaker aggregates the result in a TAR file and uploads to S3 at the end of the training job. -
/opt/ml/output/failure
– If training fails, after all algorithm output (for example, logging) completes, your algorithm should write the failure description to this file. In aDescribeTrainingJob
response, SageMaker returns the first 1024 characters from this file asFailureReason
.