CSV 파일로 분류 매니페스트 파일 생성

이 예제 Python 스크립트는 CSV (쉼표로 구분된 값) 파일을 사용하여 이미지에 레이블을 지정함으로써 분류 매니페스트 파일을 간단하게 만들 수 있습니다. CSV 파일을 만들려면

매니페스트 파일은 모델 학습에 사용되는 이미지를 설명합니다. 매니페스트 파일은 하나 이상의 JSON 라인으로 구성됩니다. 각 JSON 라인은 단일 이미지를 설명합니다. 자세한 정보는 이미지 분류를 위한 JSON 라인 정의을 참조하세요.

CSV 파일은 텍스트 파일의 여러 행에 대한 표 형식 데이터를 나타냅니다. 행의 필드는 쉼표로 구분합니다. 자세한 내용은 comma separated values를 참조하세요. 이 스크립트의 경우 CSV 파일의 각 행에는 이미지의 S3 위치와 이미지의 예외 항목 분류 (normal 또는anomaly)가 포함됩니다. 각 행은 매니페스트 파일의 JSON 라인에 매핑됩니다.

예를 들어, 다음 CSV 파일은 예제 이미지의 일부 이미지를 설명합니다.


s3://s3bucket/circuitboard/train/anomaly/train-anomaly_1.jpg,anomaly
s3://s3bucket/circuitboard/train/anomaly/train-anomaly_10.jpg,anomaly
s3://s3bucket/circuitboard/train/anomaly/train-anomaly_11.jpg,anomaly
s3://s3bucket/circuitboard/train/normal/train-normal_1.jpg,normal
s3://s3bucket/circuitboard/train/normal/train-normal_10.jpg,normal
s3://s3bucket/circuitboard/train/normal/train-normal_11.jpg,normal

스크립트는 각 행에 대해 JSON 라인을 생성합니다. 예를 들어, 다음은 첫 번째 행 (s3://s3bucket/circuitboard/train/anomaly/train-anomaly_1.jpg,anomaly)의 JSON 라인입니다.


{"source-ref": "s3://s3bucket/csv_test/train_anomaly_1.jpg","anomaly-label": 1,"anomaly-label-metadata": {"confidence": 1,"job-name": "labeling-job/anomaly-classification","class-name": "anomaly","human-annotated": "yes","creation-date": "2022-02-04T22:47:07","type": "groundtruth/image-classification"}}

CSV 파일에 이미지의 Amazon S3 경로가 포함되어 있지 않은 경우 --s3-path 명령줄 인수를 사용하여 이미지에 대한 Amazon S3 경로를 지정하십시오.

매니페스트 파일을 생성하기 전에 스크립트는 CSV 파일의 중복 이미지와 normal또는 anomaly 이외의 이미지 분류가 있는지 확인합니다. 중복된 이미지 또는 이미지 분류 오류가 발견되면 스크립트는 다음을 수행합니다.

모든 이미지에 대한 첫 번째 유효한 이미지 항목을 중복 제거된 CSV 파일에 기록합니다.
오류 파일에 이미지가 중복된 경우를 기록합니다.
오류 파일에 normal나 anomaly가 없는 이미지 분류를 기록합니다.
매니페스트 파일을 생성하지 마세요.

오류 파일에는 입력 CSV 파일에서 중복 이미지 또는 분류 오류가 발견된 줄 번호가 포함됩니다. 오류 CSV 파일을 사용하여 입력 CSV 파일을 업데이트한 다음 스크립트를 다시 실행합니다. 또는 errors CSV 파일을 사용하여 이미지 분류 오류가 없는 고유한 이미지 항목과 이미지만 포함하는 중복 제거된 CSV 파일을 업데이트하십시오. 업데이트된 중복 제거된 CSV 파일을 사용하여 스크립트를 다시 실행합니다.

입력 CSV 파일에 중복이나 오류가 없는 경우 스크립트는 중복 제거된 이미지 CSV 파일과 오류 파일이 비어 있으므로 삭제합니다.

이 절차에서는 CSV 파일을 만들고 Python 스크립트를 실행하여 매니페스트 파일을 만듭니다. 이 스크립트는 Python 버전 3.7로 테스트되었습니다.

CSV 파일에서 매니페스트 파일을 생성하려면

각 행에 다음 필드를 포함하는 CSV 파일을 생성합니다 (이미지당 한 행). CSV 파일에 헤더 행을 추가하지 마세요.

필드 1	필드 2
이미지 이름 또는 Amazon S3 이미지 경로 예: `s3://s3bucket/circuitboard/train/anomaly/train-anomaly_10.jpg` Amazon S3 경로가 있는 이미지와 그렇지 않은 이미지를 혼합할 수는 없습니다.	이미지의 예외 항목 분류 (`normal`또는`anomaly`).

예: s3://s3bucket/circuitboard/train/anomaly/image_10.jpg,anomaly 또는 image_11.jpg,normal

CSV 파일을 저장합니다.

다음 Python 스크립트를 실행합니다. 다음 인수를 제공하세요.

csv_file: 1단계에서 생성한 CSV 파일
(선택 사항) --s3-path s3://path_to_folder/ — 이미지 파일 이름에 추가할 Amazon S3 경로 (필드 1). 필드 1의 이미지에 아직 S3 경로가 포함되어 있지 않은 경우 --s3-path 항목을 사용합니다.


# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
# SPDX-License-Identifier:  Apache-2.0
"""
Purpose
Shows how to create an Amazon Lookout for Vision manifest file from a CSV file.
The CSV file format is image location,anomaly classification (normal or anomaly)
For example:
s3://s3bucket/circuitboard/train/anomaly/train_11.jpg,anomaly
s3://s3bucket/circuitboard/train/normal/train_1.jpg,normal

If necessary, use the bucket argument to specify the Amazon S3 bucket folder for the images.
"""

from datetime import datetime, timezone
import argparse
import logging
import csv
import os
import json

logger = logging.getLogger(__name__)


def check_errors(csv_file):
    """
    Checks for duplicate images and incorrect classifications in a CSV file.
    If duplicate images or invalid anomaly assignments are found, an errors CSV file
    and deduplicated CSV file are created. Only the first
    occurrence of a duplicate is recorded. Other duplicates are recorded in the errors file.
    :param csv_file: The source CSV file
    :return: True if errors or duplicates are found, otherwise false.
    """

    logger.info("Checking %s.", csv_file)

    errors_found = False
    errors_file = f"{os.path.splitext(csv_file)[0]}_errors.csv"
    deduplicated_file = f"{os.path.splitext(csv_file)[0]}_deduplicated.csv"

    with open(csv_file, 'r', encoding="UTF-8") as input_file,\
            open(deduplicated_file, 'w', encoding="UTF-8") as dedup,\
            open(errors_file, 'w', encoding="UTF-8") as errors:

        reader = csv.reader(input_file, delimiter=',')
        dedup_writer = csv.writer(dedup)
        error_writer = csv.writer(errors)
        line = 1
        entries = set()
        for row in reader:

            # Skip empty lines.
            if not ''.join(row).strip():
                continue

            # Record any incorrect classifications.
            if not row[1].lower() == "normal" and not row[1].lower() == "anomaly":
                error_writer.writerow(
                    [line, row[0], row[1], "INVALID_CLASSIFICATION"])
                errors_found = True

            # Write first image entry to dedup file and record duplicates.
            key = row[0]
            if key not in entries:
                dedup_writer.writerow(row)
                entries.add(key)
            else:
                error_writer.writerow([line, row[0], row[1], "DUPLICATE"])
                errors_found = True
            line += 1

    if errors_found:
        logger.info("Errors found check %s.", errors_file)
    else:
        os.remove(errors_file)
        os.remove(deduplicated_file)

    return errors_found


def create_manifest_file(csv_file, manifest_file, s3_path):
    """
    Read a CSV file and create an Amazon Lookout for Vision classification manifest file.
    :param csv_file: The source CSV file.
    :param manifest_file: The name of the manifest file to create.
    :param s3_path: The Amazon S3 path to the folder that contains the images.
    """
    logger.info("Processing CSV file %s.", csv_file)

    image_count = 0
    anomalous_count = 0

    with open(csv_file, newline='', encoding="UTF-8") as csvfile,\
        open(manifest_file, "w", encoding="UTF-8") as output_file:

        image_classifications = csv.reader(
            csvfile, delimiter=',', quotechar='|')

        # Process each row (image) in the CSV file.
        for row in image_classifications:
            # Skip empty lines.
            if not ''.join(row).strip():
                continue

            source_ref = str(s3_path) + row[0]
            classification = 0

            if row[1].lower() == 'anomaly':
                classification = 1
                anomalous_count += 1

           # Create the JSON line.
            json_line = {}
            json_line['source-ref'] = source_ref
            json_line['anomaly-label'] = str(classification)

            metadata = {}
            metadata['confidence'] = 1
            metadata['job-name'] = "labeling-job/anomaly-classification"
            metadata['class-name'] = row[1]
            metadata['human-annotated'] = "yes"
            metadata['creation-date'] = datetime.now(timezone.utc).strftime('%Y-%m-%dT%H:%M:%S.%f')
            metadata['type'] = "groundtruth/image-classification"

            json_line['anomaly-label-metadata'] = metadata

            output_file.write(json.dumps(json_line))
            output_file.write('\n')
            image_count += 1

    logger.info("Finished creating manifest file %s.\n"
                "Images: %s\nAnomalous: %s",
                manifest_file,
                image_count,
                anomalous_count)
    return image_count, anomalous_count


def add_arguments(parser):
    """
    Add command line arguments to the parser.
    :param parser: The command line parser.
    """

    parser.add_argument(
        "csv_file", help="The CSV file that you want to process."
    )

    parser.add_argument(
        "--s3_path", help="The Amazon S3 bucket and folder path for the images."
        " If not supplied, column 1 is assumed to include the Amazon S3 path.", required=False
    )


def main():

    logging.basicConfig(level=logging.INFO,
                        format="%(levelname)s: %(message)s")

    try:

        # Get command line arguments.
        parser = argparse.ArgumentParser(usage=argparse.SUPPRESS)
        add_arguments(parser)
        args = parser.parse_args()
        s3_path = args.s3_path
        if s3_path is None:
            s3_path = ""

        csv_file = args.csv_file
        csv_file_no_extension = os.path.splitext(csv_file)[0]
        manifest_file = csv_file_no_extension + '.manifest'

        # Create manifest file if there are no duplicate images.
        if check_errors(csv_file):
            print(f"Issues found. Use {csv_file_no_extension}_errors.csv "\
                "to view duplicates and errors.")
            print(f"{csv_file}_deduplicated.csv contains the first"\
                "occurrence of a duplicate.\n"
                  "Update as necessary with the correct information.")
            print(f"Re-run the script with {csv_file_no_extension}_deduplicated.csv")
        else:
            print('No duplicates found. Creating manifest file.')

            image_count, anomalous_count = create_manifest_file(csv_file, manifest_file, s3_path)

            print(f"Finished creating manifest file: {manifest_file} \n")

            normal_count = image_count-anomalous_count
            print(f"Images processed: {image_count}")
            print(f"Normal: {normal_count}")
            print(f"Anomalous: {anomalous_count}")

    except FileNotFoundError as err:
        logger.exception("File not found.:%s", err)
        print(f"File not found: {err}. Check your input CSV file.")

if __name__ == "__main__":
    main()

중복 이미지가 발생하거나 분류 오류가 발생하는 경우:
1. 오류 파일을 사용하여 중복 제거된 CSV 파일 또는 입력 CSV 파일을 업데이트합니다.
2. 업데이트된 중복 제거 CSV 파일 또는 업데이트된 입력 CSV 파일을 사용하여 스크립트를 다시 실행합니다.
테스트 데이터세트를 사용하려는 경우 1~4단계를 반복하여 테스트 데이터 세트용 매니페스트 파일을 만드세요.
필요한 경우 CSV 파일의 열 1에서 지정한(또는 --s3-path 명령줄에서 지정한) Amazon S3 버킷 경로에 이미지를 복사합니다. 이미지를 복사하려면 명령 프롬프트에서 다음 명령을 입력합니다.
```
aws s3 cp --recursive your-local-folder/ s3://your-target-S3-location/
```
매니페스트 파일로 데이터세트 만들기 (콘솔)의 지침에 따라 데이터 세트를 생성합니다. AWS SDK를 사용하는 경우 을 참조하십시오. 매니페스트 파일 (SDK)로 데이터세트 만들기

javascript가 브라우저에서 비활성화되거나 사용이 불가합니다.

AWS 설명서를 사용하려면 Javascript가 활성화되어야 합니다. 지침을 보려면 브라우저의 도움말 페이지를 참조하십시오.

문서 규칙

이미지 분할을 위한 JSON 라인 정의

매니페스트 파일로 데이터세트 만들기 (콘솔)