CSV ファイルからのマニフェストファイルの作成

この Python スクリプト例では、カンマ区切り値 (CSV) ファイルを使用してイメージにラベルを付けることにより、マニフェストファイルの作成を簡素化しています。ユーザーによって CSV ファイルを作成します。マニフェストファイルはマルチラベルイメージ分類またはマルチラベルイメージ分類に適しています。詳細については、「オブジェクト、シーン、概念を検出する」を参照してください。

注記

このスクリプトでは、オブジェクトの位置やブランドの位置の検索に適したマニフェストファイルは作成されません。

マニフェストファイルには、モデルのトレーニングに使用されるイメージが記述されています。例えば、イメージの位置やイメージに割り当てられたラベルなどです。マニフェストファイルは、1 行以上の JSON 行で構成されます。各 JSON Lines では 1 つの画像について記述されます。詳細については、「マニフェストファイル内のイメージレベルのラベル」を参照してください。

CSV ファイルは、テキストファイル内の複数行にわたる表形式のデータを表します。行内のフィールドはカンマで区切られます。詳細については、「カンマ区切り値」を参照してください。このスクリプトでは、CSV ファイルの各行が 1 つのイメージを表し、マニフェストファイルの JSON 行にマッピングされます。マルチラベルイメージ分類をサポートするマニフェストファイル用の CSV ファイルを作成するには、各行に 1 つ以上のイメージレベルのラベルを追加します。画像分類に適したマニフェストファイルを作成するには、各行に 1 つのイメージレベルのラベルを追加します。

例えば、次の CSV ファイルにはマルチラベルイメージ分類 (花) 入門プロジェクトのイメージが記述されています。


camellia1.jpg,camellia,with_leaves
camellia2.jpg,camellia,with_leaves
camellia3.jpg,camellia,without_leaves
helleborus1.jpg,helleborus,without_leaves,not_fully_grown
helleborus2.jpg,helleborus,with_leaves,fully_grown
helleborus3.jpg,helleborus,with_leaves,fully_grown
jonquil1.jpg,jonquil,with_leaves
jonquil2.jpg,jonquil,with_leaves
jonquil3.jpg,jonquil,with_leaves
jonquil4.jpg,jonquil,without_leaves
mauve_honey_myrtle1.jpg,mauve_honey_myrtle,without_leaves
mauve_honey_myrtle2.jpg,mauve_honey_myrtle,with_leaves
mauve_honey_myrtle3.jpg,mauve_honey_myrtle,with_leaves
mediterranean_spurge1.jpg,mediterranean_spurge,with_leaves
mediterranean_spurge2.jpg,mediterranean_spurge,without_leaves

このスクリプトは、各行に JSON 行を生成します。例えば、以下は最初の行の JSON 行 (camellia1.jpg,camellia,with_leaves) です。


{"source-ref": "s3://bucket/flowers/train/camellia1.jpg","camellia": 1,"camellia-metadata":{"confidence": 1,"job-name": "labeling-job/camellia","class-name": "camellia","human-annotated": "yes","creation-date": "2022-01-21T14:21:05","type": "groundtruth/image-classification"},"with_leaves": 1,"with_leaves-metadata":{"confidence": 1,"job-name": "labeling-job/with_leaves","class-name": "with_leaves","human-annotated": "yes","creation-date": "2022-01-21T14:21:05","type": "groundtruth/image-classification"}}

CSV の例では、イメージへの Amazon S3 パスは存在しません。CSV ファイルにイメージの Amazon S3 パスが含まれていない場合は、--s3_path コマンドライン引数を使用してイメージへの Amazon S3 パスを指定します。

このスクリプトは、各イメージの最初のエントリを重複排除されたイメージ CSV ファイルに記録します。重複排除されたイメージ CSV ファイルには、入力 CSV ファイルで見つかった各イメージのインスタンスが含まれています。入力 CSV ファイル内でイメージがさらに見つかると、重複イメージ CSV ファイルに記録されます。スクリプトによって重複イメージが検知された場合は、重複イメージ CSV ファイルを確認し、必要に応じて重複排除されたイメージ CSV ファイルを更新します。重複排除されたファイルを使用してスクリプトを再度実行します。入力 CSV ファイルに重複が見つからなかった場合、スクリプトは重複排除されたイメージ CSV ファイルと重複イメージ CSV ファイルが空として削除します。

この手順では、CSV ファイルを作成し、Python スクリプトを実行してマニフェストファイルを作成します。

CSV ファイルからマニフェストファイルを作成するには

各行に以下のフィールドを含む CSV ファイルを作成します (1 画像ごとに 1 行)。CSV ファイルにはヘッダー行を追加しないでください。

フィールド 1	フィールド 2	フィールド n
イメージ名または Amazon S3 へのイメージのパス。例えば `s3://my-bucket/flowers/train/camellia1.jpg` です。Amazon S3 パスのイメージと、パスがないイメージを混在させることはできません。	イメージの最初のイメージレベルのラベル。	カンマで区切られた 1 つ以上の追加のイメージレベルのラベル。マルチラベルイメージ分類をサポートするマニフェストファイルを作成する場合にのみ追加してください。

例えば、camellia1.jpg,camellia,with_leaves、s3://my-bucket/flowers/train/camellia1.jpg,camellia,with_leaves などです。

CSV ファイルを保存します。

以下の Python スクリプトを実行します。以下の情報を提供します:

csv_file - ステップ 1 で作成した CSV ファイル。
manifest_file - 作成するマニフェストファイルの名前。
(オプション) --s3_path s3://path_to_folder/ - イメージファイル名に追加する Amazon S3 パス (フィールド 1)。この段階でフィールド 1 のイメージに S3 パスが含まれていない場合は、--s3_path を使用します。


# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
# SPDX-License-Identifier:  Apache-2.0

from datetime import datetime, timezone
import argparse
import logging
import csv
import os
import json

"""
Purpose
Amazon Rekognition Custom Labels model example used in the service documentation.
Shows how to create an image-level (classification) manifest file from a CSV file.
You can specify multiple image level labels per image.
CSV file format is
image,label,label,..
If necessary, use the bucket argument to specify the S3 bucket folder for the images.
https://docs.aws.amazon.com/rekognition/latest/customlabels-dg/md-gt-cl-transform.html
"""

logger = logging.getLogger(__name__)


def check_duplicates(csv_file, deduplicated_file, duplicates_file):
    """
    Checks for duplicate images in a CSV file. If duplicate images
    are found, deduplicated_file is the deduplicated CSV file - only the first
    occurence of a duplicate is recorded. Other duplicates are recorded in duplicates_file.
    :param csv_file: The source CSV file.
    :param deduplicated_file: The deduplicated CSV file to create. If no duplicates are found
    this file is removed.
    :param duplicates_file: The duplicate images CSV file to create. If no duplicates are found
    this file is removed.
    :return: True if duplicates are found, otherwise false.
    """

    logger.info("Deduplicating %s", csv_file)

    duplicates_found = False

    # Find duplicates.
    with open(csv_file, 'r', newline='', encoding="UTF-8") as f,\
            open(deduplicated_file, 'w', encoding="UTF-8") as dedup,\
            open(duplicates_file, 'w', encoding="UTF-8") as duplicates:

        reader = csv.reader(f, delimiter=',')
        dedup_writer = csv.writer(dedup)
        duplicates_writer = csv.writer(duplicates)

        entries = set()
        for row in reader:
            # Skip empty lines.
            if not ''.join(row).strip():
                continue

            key = row[0]
            if key not in entries:
                dedup_writer.writerow(row)
                entries.add(key)
            else:
                duplicates_writer.writerow(row)
                duplicates_found = True

    if duplicates_found:
        logger.info("Duplicates found check %s", duplicates_file)

    else:
        os.remove(duplicates_file)
        os.remove(deduplicated_file)

    return duplicates_found


def create_manifest_file(csv_file, manifest_file, s3_path):
    """
    Reads a CSV file and creates a Custom Labels classification manifest file.
    :param csv_file: The source CSV file.
    :param manifest_file: The name of the manifest file to create.
    :param s3_path: The S3 path to the folder that contains the images.
    """
    logger.info("Processing CSV file %s", csv_file)

    image_count = 0
    label_count = 0

    with open(csv_file, newline='', encoding="UTF-8") as csvfile,\
            open(manifest_file, "w", encoding="UTF-8") as output_file:

        image_classifications = csv.reader(
            csvfile, delimiter=',', quotechar='|')

        # Process each row (image) in CSV file.
        for row in image_classifications:
            source_ref = str(s3_path)+row[0]

            image_count += 1

            # Create JSON for image source ref.
            json_line = {}
            json_line['source-ref'] = source_ref

            # Process each image level label.
            for index in range(1, len(row)):
                image_level_label = row[index]

                # Skip empty columns.
                if image_level_label == '':
                    continue
                label_count += 1

               # Create the JSON line metadata.
                json_line[image_level_label] = 1
                metadata = {}
                metadata['confidence'] = 1
                metadata['job-name'] = 'labeling-job/' + image_level_label
                metadata['class-name'] = image_level_label
                metadata['human-annotated'] = "yes"
                metadata['creation-date'] = \
                    datetime.now(timezone.utc).strftime('%Y-%m-%dT%H:%M:%S.%f')
                metadata['type'] = "groundtruth/image-classification"

                json_line[f'{image_level_label}-metadata'] = metadata

                # Write the image JSON Line.
            output_file.write(json.dumps(json_line))
            output_file.write('\n')

    output_file.close()
    logger.info("Finished creating manifest file %s\nImages: %s\nLabels: %s",
                manifest_file, image_count, label_count)

    return image_count, label_count


def add_arguments(parser):
    """
    Adds command line arguments to the parser.
    :param parser: The command line parser.
    """

    parser.add_argument(
        "csv_file", help="The CSV file that you want to process."
    )

    parser.add_argument(
        "--s3_path", help="The S3 bucket and folder path for the images."
        " If not supplied, column 1 is assumed to include the S3 path.", required=False
    )


def main():

    logging.basicConfig(level=logging.INFO,
                        format="%(levelname)s: %(message)s")

    try:

        # Get command line arguments
        parser = argparse.ArgumentParser(usage=argparse.SUPPRESS)
        add_arguments(parser)
        args = parser.parse_args()

        s3_path = args.s3_path
        if s3_path is None:
            s3_path = ''

        # Create file names.
        csv_file = args.csv_file
        file_name = os.path.splitext(csv_file)[0]
        manifest_file = f'{file_name}.manifest'
        duplicates_file = f'{file_name}-duplicates.csv'
        deduplicated_file = f'{file_name}-deduplicated.csv'

        # Create manifest file, if there are no duplicate images.
        if check_duplicates(csv_file, deduplicated_file, duplicates_file):
            print(f"Duplicates found. Use {duplicates_file} to view duplicates "
                  f"and then update {deduplicated_file}. ")
            print(f"{deduplicated_file} contains the first occurence of a duplicate. "
                  "Update as necessary with the correct label information.")
            print(f"Re-run the script with {deduplicated_file}")
        else:
            print("No duplicates found. Creating manifest file.")

            image_count, label_count = create_manifest_file(csv_file,
                                                            manifest_file,
                                                            s3_path)

            print(f"Finished creating manifest file: {manifest_file} \n"
                  f"Images: {image_count}\nLabels: {label_count}")

    except FileNotFoundError as err:
        logger.exception("File not found: %s", err)
        print(f"File not found: {err}. Check your input CSV file.")


if __name__ == "__main__":
    main()

テストデータセットを使用する場合は、ステップ 1～3 を繰り返してテストデータセットのマニフェストファイルを作成します。
必要に応じて、CSV ファイルの列 1 で指定した (--s3_pathまたはコマンドラインで指定した) Amazon S3 バケットパスにイメージをコピーします。次の AWS S3 コマンドを使用できます。
```
aws s3 cp --recursive your-local-folder s3://your-target-S3-location
```
マニフェストファイルを保存するために使用する Amazon S3 バケットにマニフェストファイルをアップロードします。

注記
Amazon Rekognition Custom Labels が、マニフェストファイルの JSON 行の source-ref フィールドで参照されている Amazon S3 バケットにアクセスできることを確認してください。詳細については、「外部の Amazon S3 バケットへのアクセス」を参照してください。Ground Truth ジョブが Amazon Rekognition Custom Labels コンソールバケットにイメージを保存する場合、アクセス許可を追加する必要はありません。
「 SageMaker Ground Truth マニフェストファイルを使用したデータセットの作成 (コンソール）」の手順に従って、アップロードしたマニフェストファイルを使用してデータセットを作成します。ステップ 8 では、[マニフェストファイルの場所] に、マニフェストファイルの場所として Amazon S3 URL を入力します。 AWS SDK を使用している場合は、を実行します SageMaker Ground Truth マニフェストファイル (SDK) を使用したデータセットの作成。

ブラウザで JavaScript が無効になっているか、使用できません。

AWS ドキュメントを使用するには、JavaScript を有効にする必要があります。手順については、使用するブラウザのヘルプページを参照してください。

ドキュメントの表記規則

マルチラベルの Ground Truth マニフェストファイルの変換

既存のデータセット