Amazon SageMaker
Developer Guide

The AWS Documentation website is getting a new look!
Try it now and let us know what you think. Switch to the new look >>

You can return to the original look by selecting English in the language selector above.

Step 4.3: Transform the Training Dataset and Upload It to Amazon S3

The XGBoost Algorithm expects comma-separated values (CSV) for its training input. The format of the training dataset is numpy.array. Transform the dataset from numpy.array format to the CSV format. Then upload it to the Amazon S3 bucket that you created in Step 1: Create an Amazon S3 Bucket

To convert the dataset to CSV format and upload it

  • Type the following code into a cell in your notebook and then run the cell.

    %%time import struct import io import csv import boto3 def convert_data(): data_partitions = [('train', train_set), ('validation', valid_set), ('test', test_set)] for data_partition_name, data_partition in data_partitions: print('{}: {} {}'.format(data_partition_name, data_partition[0].shape, data_partition[1].shape)) labels = [t.tolist() for t in data_partition[1]] features = [t.tolist() for t in data_partition[0]] if data_partition_name != 'test': examples = np.insert(features, 0, labels, axis=1) else: examples = features #print(examples[50000,:]) np.savetxt('data.csv', examples, delimiter=',') key = "{}/{}/examples".format(prefix,data_partition_name) url = 's3://{}/{}'.format(bucket, key) boto3.Session().resource('s3').Bucket(bucket).Object(key).upload_file('data.csv') print('Done writing to {}'.format(url)) convert_data()

    After it converts the dataset to the CSV format, the code uploads the CSV file to the S3 bucket.

    Next Step

    Step 5: Train a Model