內建演算法支援的內容類型使用樞紐分析模式使用CSV格式使用 RecordIO 格式訓練模型還原序列化

用於訓練的一般資料格式

為了準備培訓，您可以使用各種 AWS 服務預先處理資料，包括 AWS Glue Amazon、亞馬遜 Redshift EMR、Amazon Relational Database Service 服務和 Amazon Athena。預先處理完畢後，請將資料發佈到 Amazon S3 儲存貯體上。對於培訓，數據必須經過一系列的轉換和轉換，包括：

訓練資料序列化 (由您處理)
訓練資料還原序列化 (由演算法處理)
訓練模型序列化 (由演算法處理)
訓練模型還原序列化 (選擇性，由您處理)

在演算法 SageMaker 的訓練部分中使用 Amazon 時，請務必一次上傳所有資料。若該位置新增了更多資料，則需進行新的訓練呼叫，以建立全新的模型。

內建演算法支援的內容類型

下表列出了一些常見支援的ContentType值以及使用它們的算法：

ContentTypes 內建演算法

ContentType	演算法
應用程式/x - 影像	物件偵測演算法、語意分割
應用程式/X - 錄音	物件偵測演算法
應用程式/x-recordio-protobuf	分解機, K-均值, k-nN, 潛在狄利克雷分配, 線性學習器,,,,, 序列對序列 NTM PCA RCF
application/jsonlines	BlazingText, DeepAR
影像/JPEG	物件偵測演算法、語意分割
圖片/png	物件偵測演算法、語意分割
text/csv	IP 洞察, K-均值, k-nN, 潛在狄利克雷分配, 線性學習者,,, NTM PCA RCF XGBoost
文字/libsvm	XGBoost

如需每個演算法使用的參數摘要，請參閱文件或此資料表以取得個別演算法的資訊。

使用樞紐分析模式

在管道模式下，您的訓練任務會直接從 Amazon Simple Storage Service (Amazon S3) 串流資料。串流可以為訓練任務提供更快的啟動時間和更高的輸送量。這與檔案模式相反，Amazon S3 中的資料會存放在訓練執行個體磁碟區上。檔案模式則使用存放最終模型成品和完整訓練資料集的磁碟空間。透過以管道模式直接從 Amazon S3 串流資料，您可以減少訓練執行個體的 Amazon 彈性區塊存放區容量大小。管道模式僅需足夠磁碟空間來存放您的最終模型成品。請參閱 AlgorithmSpecification以取得訓練輸入模式的詳細資訊。

使用CSV格式

許多 Amazon SageMaker 演算法都支援以CSV格式化資料進行訓練。若要使用CSV格式的資料進行訓練，請在輸入資料通道規格中，指定text/csv為 ContentType. Amazon SageMaker 要求CSV文件沒有標題記錄，並且目標變量位於第一列。若要執行不具有目標的未受監督學習演算法，請指定內容類型中標籤欄的數目。例如，在此案例中為 'content_type=text/csv;label_size=0'。如需詳細資訊，請參閱現在將管道模式與資CSV料集搭配使用，以加快 Amazon SageMaker 內建演算法的訓練。

使用 RecordIO 格式

在 protobuf recordIO 格式中， SageMaker 將數據集中的每個觀察結果轉換為一組 4 字節浮點數的二進制表示形式，然後將其加載到 protobuf 值字段中。如果您使用 Python 來準備資料，強烈建議您使用這些現有的轉換。但是，如果您使用的是其他語言，則下面的 protobuf 定義文件提供了用於將數據轉換為 SageMaker protobuf 格式的模式。

注意

有關示範如何將常用的 numPy 陣列轉換為 protobuf recordIO 格式的範例，請參閱分解機器簡介。MNIST


syntax = "proto2";

 package aialgs.data;

 option java_package = "com.amazonaws.aialgorithms.proto";
 option java_outer_classname = "RecordProtos";

 // A sparse or dense rank-R tensor that stores data as doubles (float64).
 message Float32Tensor   {
     // Each value in the vector. If keys is empty, this is treated as a
     // dense vector.
     repeated float values = 1 [packed = true];

     // If key is not empty, the vector is treated as sparse, with
     // each key specifying the location of the value in the sparse vector.
     repeated uint64 keys = 2 [packed = true];

     // An optional shape that allows the vector to represent a matrix.
     // For example, if shape = [ 10, 20 ], floor(keys[i] / 20) gives the row,
     // and keys[i] % 20 gives the column.
     // This also supports n-dimensonal tensors.
     // Note: If the tensor is sparse, you must specify this value.
     repeated uint64 shape = 3 [packed = true];
 }

 // A sparse or dense rank-R tensor that stores data as doubles (float64).
 message Float64Tensor {
     // Each value in the vector. If keys is empty, this is treated as a
     // dense vector.
     repeated double values = 1 [packed = true];

     // If this is not empty, the vector is treated as sparse, with
     // each key specifying the location of the value in the sparse vector.
     repeated uint64 keys = 2 [packed = true];

     // An optional shape that allows the vector to represent a matrix.
     // For example, if shape = [ 10, 20 ], floor(keys[i] / 10) gives the row,
     // and keys[i] % 20 gives the column.
     // This also supports n-dimensonal tensors.
     // Note: If the tensor is sparse, you must specify this value.
     repeated uint64 shape = 3 [packed = true];
 }

 // A sparse or dense rank-R tensor that stores data as 32-bit ints (int32).
 message Int32Tensor {
     // Each value in the vector. If keys is empty, this is treated as a
     // dense vector.
     repeated int32 values = 1 [packed = true];

     // If this is not empty, the vector is treated as sparse with
     // each key specifying the location of the value in the sparse vector.
     repeated uint64 keys = 2 [packed = true];

     // An optional shape that allows the vector to represent a matrix.
     // For Exmple, if shape = [ 10, 20 ], floor(keys[i] / 10) gives the row,
     // and keys[i] % 20 gives the column.
     // This also supports n-dimensonal tensors.
     // Note: If the tensor is sparse, you must specify this value.
     repeated uint64 shape = 3 [packed = true];
 }

 // Support for storing binary data for parsing in other ways (such as JPEG/etc).
 // This is an example of another type of value and may not immediately be supported.
 message Bytes {
     repeated bytes value = 1;

     // If the content type of the data is known, stores it.
     // This allows for the possibility of using decoders for common formats
     // in the future.
     optional string content_type = 2;
 }

 message Value {
     oneof value {
         // The numbering assumes the possible use of:
         // - float16, float128
         // - int8, int16, int32
         Float32Tensor float32_tensor = 2;
         Float64Tensor float64_tensor = 3;
         Int32Tensor int32_tensor = 7;
         Bytes bytes = 9;
     }
 }

 message Record {
     // Map from the name of the feature to the value.
     //
     // For vectors and libsvm-like datasets,
     // a single feature with the name `values`
     // should be specified.
     map<string, Value> features = 1;

     // An optional set of labels for this record.
     // Similar to the features field above, the key used for
     // generic scalar / vector labels should be 'values'.
     map<string, Value> label = 2;

     // A unique identifier for this record in the dataset.
     //
     // Whilst not necessary, this allows better
     // debugging where there are data issues.
     //
     // This is not used by the algorithm directly.
     optional string uid = 3;

     // Textual metadata describing the record.
     //
     // This may include JSON-serialized information
     // about the source of the record.
     //
     // This is not used by the algorithm directly.
     optional string metadata = 4;

     // An optional serialized JSON object that allows per-record
     // hyper-parameters/configuration/other information to be set.
     //
     // The meaning/interpretation of this field is defined by
     // the algorithm author and may not be supported.
     //
     // This is used to pass additional inference configuration
     // when batch inference is used (e.g. types of scores to return).
     optional string configuration = 5;
 }

建立通訊協定緩衝區之後，請將其存放在 Amazon SageMaker 可存取的 Amazon S3 位置，並且可以作為InputDataConfig中的一部分傳遞create_training_job。

注意

對於所有 Amazon SageMaker 演算法，InputDataConfig必須將輸ChannelName入設定為train。部分演算法也支援驗證或測試 input channels。這些通常會透過使用鑑效組來評估模型的效能。鑑效組不會用於初始訓練，但可用來進一步微調模型。

訓練模型還原序列化

Amazon SageMaker 模型會以 model.tar.gz 形式存放在create_training_job呼叫OutputDataConfigS3OutputPath參數中指定的 S3 儲存貯體中。S3 儲存貯體必須與筆記型電腦執行個體位於相同的 AWS 區域。建立託管模型時，這類模型成品大多可以指定。也可以在您筆記本執行個體中開啟和檢視。當model.tar.gz被解除定位時，它包含model_algo-1，這是一個序列化的 Apache 對象。MXNet舉例而言，您可以如下所示，將 K 平均數模型載入記憶體內並加以檢視：


import mxnet as mx
print(mx.ndarray.load('model_algo-1'))

您的瀏覽器已停用或無法使用 Javascript。

您必須啟用 Javascript，才能使用 AWS 文件。請參閱您的瀏覽器說明頁以取得說明。

文件慣用形式

一般資料格式

推論的常見資料格式