通过 CSV 文件创建分类清单文件

此示例 Python 脚本使用逗号分隔值（CSV）文件来标注图像，从而简化了分类清单文件的创建工作。您需要创建 CSV 文件。

清单文件描述了用于训练模型的图像。清单文件由一个或多个 JSON 行组成。每个 JSON 行都描述了一张图像。有关更多信息，请参阅为图像分类定义 JSON 行。

CSV 文件代表文本文件中多行的表格数据。一行中的各个字段用逗号分隔。有关更多信息，请参阅逗号分隔的值。对于此脚本，CSV 文件中的每一行都包括图像的 S3 位置和图像的异常分类（normal 或 anomaly）。每一行分别对应清单文件中的一个 JSON 行。

例如，以下 CSV 文件描述了示例图像中的一些图像。


s3://s3bucket/circuitboard/train/anomaly/train-anomaly_1.jpg,anomaly
s3://s3bucket/circuitboard/train/anomaly/train-anomaly_10.jpg,anomaly
s3://s3bucket/circuitboard/train/anomaly/train-anomaly_11.jpg,anomaly
s3://s3bucket/circuitboard/train/normal/train-normal_1.jpg,normal
s3://s3bucket/circuitboard/train/normal/train-normal_10.jpg,normal
s3://s3bucket/circuitboard/train/normal/train-normal_11.jpg,normal

该脚本会为每一行生成 JSON 行。例如，以下是第一行（s3://s3bucket/circuitboard/train/anomaly/train-anomaly_1.jpg,anomaly）的 JSON 行。


{"source-ref": "s3://s3bucket/csv_test/train_anomaly_1.jpg","anomaly-label": 1,"anomaly-label-metadata": {"confidence": 1,"job-name": "labeling-job/anomaly-classification","class-name": "anomaly","human-annotated": "yes","creation-date": "2022-02-04T22:47:07","type": "groundtruth/image-classification"}}

如果您的 CSV 文件不包含图像的 Amazon S3 路径，请使用 --s3-path 命令行参数指定图像的 Amazon S3 路径。

在创建清单文件之前，该脚本会检查 CSV 文件中是否有重复图像以及任何不是 normal 或 anomaly 的图像分类。如果发现重复图像或图像分类错误，则该脚本会执行以下操作：

在去重 CSV 文件中，记录所有图像的第一个有效图像条目。
在错误文件中，记录图像的重复版本。
在错误文件中，记录不是 normal 或 anomaly 的图像分类。
不创建清单文件。

错误文件中包含在输入 CSV 文件中发现重复图像或分类错误的行号。请使用错误 CSV 文件更新输入 CSV 文件，然后再次运行该脚本。或者，使用错误 CSV 文件更新去重 CSV 文件，后者中仅包含唯一图像条目和没有图像分类错误的图像。使用更新后的去重 CSV 文件，重新运行该脚本。

如果在输入 CSV 文件中未发现重复项或错误，则该脚本会删除去重图像 CSV 文件和错误文件，因为它们为空。

在此过程中，您将创建 CSV 文件并运行 Python 脚本以创建清单文件。此脚本已使用 Python 版本 3.7 进行测试。

通过 CSV 文件创建清单文件

创建一个 CSV 文件，并且在每一行中包含以下字段（每张图像占一行）。请勿在 CSV 文件中添加标题行。

字段 1	字段 2
图像名称或图像的 Amazon S3 路径。例如，`s3://s3bucket/circuitboard/train/anomaly/train-anomaly_10.jpg`。您不能混合使用带有 Amazon S3 路径的图像和不带 Amazon S3 路径的图像。	图像的异常分类（`normal` 或 `anomaly`）。

例如，s3://s3bucket/circuitboard/train/anomaly/image_10.jpg,anomaly 或 image_11.jpg,normal

保存 CSV 文件。

运行以下 Python 脚本。提供以下参数：

csv_file：您在步骤 1 中创建的 CSV 文件。
（可选）--s3-path s3://path_to_folder/：要添加到图像文件名的 Amazon S3 路径（字段 1）。如果字段 1 中的图像未包含 S3 路径，则使用 --s3-path。


# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
# SPDX-License-Identifier:  Apache-2.0
"""
Purpose
Shows how to create an Amazon Lookout for Vision manifest file from a CSV file.
The CSV file format is image location,anomaly classification (normal or anomaly)
For example:
s3://s3bucket/circuitboard/train/anomaly/train_11.jpg,anomaly
s3://s3bucket/circuitboard/train/normal/train_1.jpg,normal

If necessary, use the bucket argument to specify the Amazon S3 bucket folder for the images.
"""

from datetime import datetime, timezone
import argparse
import logging
import csv
import os
import json

logger = logging.getLogger(__name__)


def check_errors(csv_file):
    """
    Checks for duplicate images and incorrect classifications in a CSV file.
    If duplicate images or invalid anomaly assignments are found, an errors CSV file
    and deduplicated CSV file are created. Only the first
    occurrence of a duplicate is recorded. Other duplicates are recorded in the errors file.
    :param csv_file: The source CSV file
    :return: True if errors or duplicates are found, otherwise false.
    """

    logger.info("Checking %s.", csv_file)

    errors_found = False
    errors_file = f"{os.path.splitext(csv_file)[0]}_errors.csv"
    deduplicated_file = f"{os.path.splitext(csv_file)[0]}_deduplicated.csv"

    with open(csv_file, 'r', encoding="UTF-8") as input_file,\
            open(deduplicated_file, 'w', encoding="UTF-8") as dedup,\
            open(errors_file, 'w', encoding="UTF-8") as errors:

        reader = csv.reader(input_file, delimiter=',')
        dedup_writer = csv.writer(dedup)
        error_writer = csv.writer(errors)
        line = 1
        entries = set()
        for row in reader:

            # Skip empty lines.
            if not ''.join(row).strip():
                continue

            # Record any incorrect classifications.
            if not row[1].lower() == "normal" and not row[1].lower() == "anomaly":
                error_writer.writerow(
                    [line, row[0], row[1], "INVALID_CLASSIFICATION"])
                errors_found = True

            # Write first image entry to dedup file and record duplicates.
            key = row[0]
            if key not in entries:
                dedup_writer.writerow(row)
                entries.add(key)
            else:
                error_writer.writerow([line, row[0], row[1], "DUPLICATE"])
                errors_found = True
            line += 1

    if errors_found:
        logger.info("Errors found check %s.", errors_file)
    else:
        os.remove(errors_file)
        os.remove(deduplicated_file)

    return errors_found


def create_manifest_file(csv_file, manifest_file, s3_path):
    """
    Read a CSV file and create an Amazon Lookout for Vision classification manifest file.
    :param csv_file: The source CSV file.
    :param manifest_file: The name of the manifest file to create.
    :param s3_path: The Amazon S3 path to the folder that contains the images.
    """
    logger.info("Processing CSV file %s.", csv_file)

    image_count = 0
    anomalous_count = 0

    with open(csv_file, newline='', encoding="UTF-8") as csvfile,\
        open(manifest_file, "w", encoding="UTF-8") as output_file:

        image_classifications = csv.reader(
            csvfile, delimiter=',', quotechar='|')

        # Process each row (image) in the CSV file.
        for row in image_classifications:
            # Skip empty lines.
            if not ''.join(row).strip():
                continue

            source_ref = str(s3_path) + row[0]
            classification = 0

            if row[1].lower() == 'anomaly':
                classification = 1
                anomalous_count += 1

           # Create the JSON line.
            json_line = {}
            json_line['source-ref'] = source_ref
            json_line['anomaly-label'] = str(classification)

            metadata = {}
            metadata['confidence'] = 1
            metadata['job-name'] = "labeling-job/anomaly-classification"
            metadata['class-name'] = row[1]
            metadata['human-annotated'] = "yes"
            metadata['creation-date'] = datetime.now(timezone.utc).strftime('%Y-%m-%dT%H:%M:%S.%f')
            metadata['type'] = "groundtruth/image-classification"

            json_line['anomaly-label-metadata'] = metadata

            output_file.write(json.dumps(json_line))
            output_file.write('\n')
            image_count += 1

    logger.info("Finished creating manifest file %s.\n"
                "Images: %s\nAnomalous: %s",
                manifest_file,
                image_count,
                anomalous_count)
    return image_count, anomalous_count


def add_arguments(parser):
    """
    Add command line arguments to the parser.
    :param parser: The command line parser.
    """

    parser.add_argument(
        "csv_file", help="The CSV file that you want to process."
    )

    parser.add_argument(
        "--s3_path", help="The Amazon S3 bucket and folder path for the images."
        " If not supplied, column 1 is assumed to include the Amazon S3 path.", required=False
    )


def main():

    logging.basicConfig(level=logging.INFO,
                        format="%(levelname)s: %(message)s")

    try:

        # Get command line arguments.
        parser = argparse.ArgumentParser(usage=argparse.SUPPRESS)
        add_arguments(parser)
        args = parser.parse_args()
        s3_path = args.s3_path
        if s3_path is None:
            s3_path = ""

        csv_file = args.csv_file
        csv_file_no_extension = os.path.splitext(csv_file)[0]
        manifest_file = csv_file_no_extension + '.manifest'

        # Create manifest file if there are no duplicate images.
        if check_errors(csv_file):
            print(f"Issues found. Use {csv_file_no_extension}_errors.csv "\
                "to view duplicates and errors.")
            print(f"{csv_file}_deduplicated.csv contains the first"\
                "occurrence of a duplicate.\n"
                  "Update as necessary with the correct information.")
            print(f"Re-run the script with {csv_file_no_extension}_deduplicated.csv")
        else:
            print('No duplicates found. Creating manifest file.')

            image_count, anomalous_count = create_manifest_file(csv_file, manifest_file, s3_path)

            print(f"Finished creating manifest file: {manifest_file} \n")

            normal_count = image_count-anomalous_count
            print(f"Images processed: {image_count}")
            print(f"Normal: {normal_count}")
            print(f"Anomalous: {anomalous_count}")

    except FileNotFoundError as err:
        logger.exception("File not found.:%s", err)
        print(f"File not found: {err}. Check your input CSV file.")

if __name__ == "__main__":
    main()

如果出现重复的图像或出现分类错误：
1. 使用错误文件更新去重 CSV 文件或输入 CSV 文件。
2. 使用更新后的去重 CSV 文件或更新后的输入 CSV 文件再次运行该脚本。
如果您计划使用测试数据集，请重复步骤 1-4，以便为测试数据集创建清单文件。
如有必要，请从您的计算机将图像复制到您在 CSV 文件第 1 列中指定的（或在 --s3-path 命令行中指定的）Amazon S3 桶路径。要复制图像，请在命令提示符处输入以下命令。
```
aws s3 cp --recursive your-local-folder/ s3://your-target-S3-location/
```
按照使用清单文件创建数据集（控制台）部分的说明操作，创建一个数据集。如果您使用的是 AWS SDK，请参阅使用清单文件创建数据集（SDK）。

Javascript 在您的浏览器中被禁用或不可用。

要使用 Amazon Web Services 文档，必须启用 Javascript。请参阅浏览器的帮助页面以了解相关说明。

文档惯例

为图像分割定义 JSON 线

使用清单文件创建数据集（控制台）