사전 조건

참고

AWS SDK for Python (Boto3) AWS CLI, 또는 SageMaker AI 콘솔을 사용하여 모델을 컴파일한 경우이 섹션의 지침을 따르세요.

SageMaker NEO 컴파일 모델을 생성하려면 다음이 필요합니다.

도커 이미지 Amazon ECR URI. 이 목록에서 필요에 맞는 것을 선택할 수 있습니다.

진입점 스크립트 파일:

PyTorch 및 MXNet 모델의 경우:

SageMaker AI를 사용하여 모델을 훈련시킨 경우 훈련 스크립트는 아래에 설명된 함수를 구현해야 합니다. 훈련 스크립트는 추론 중에 진입점 스크립트 역할을 합니다. MXNet 모듈 및 SageMaker Neo를 사용한 MNIST 훈련, 컴파일 및 배포에 자세히 설명된 예제에서는 훈련 스크립트(mnist.py)가 필수 함수를 구현합니다.

SageMaker AI를 사용하여 모델을 훈련하지 않은 경우 추론 시 사용할 수 있는 진입점 스크립트(inference.py) 파일을 제공해야 합니다. 프레임워크(MXNet 또는 Pytorch)에 따라 추론 스크립트 위치는 SageMaker Python SDK MXnet용 모델 디렉터리 구조 또는 PyTorch용 모델 디렉터리 구조를 준수해야 합니다.

CPU 및 GPU 인스턴스 유형에서 PyTorch 및 MXNet과 함께 Neo 추론 최적화 컨테이너 이미지를 사용하는 경우, 추론 스크립트는 다음 함수를 구현해야 합니다.

model_fn: 모델을 로드합니다. (선택 사항)
input_fn: 수신 요청 페이로드를 numpy 배열로 변환합니다.
predict_fn: 예측을 수행합니다.
output_fn: 예측 출력을 응답 페이로드로 변환합니다.
또는, input_fn, predict_fn 및 output_fn을 조합하여 transform_fn을 정의할 수도 있습니다.

다음은 PyTorch와 MXNet(Gluon 및 모듈)의 경우 code(code/inference.py)라는 디렉터리 내의 inference.py 스크립트 예제입니다. 이 예제는 먼저 모델을 로드한 다음 GPU의 이미지 데이터에 사용합니다.

MXNet Module


import numpy as np
import json
import mxnet as mx
import neomx  # noqa: F401
from collections import namedtuple

Batch = namedtuple('Batch', ['data'])

# Change the context to mx.cpu() if deploying to a CPU endpoint
ctx = mx.gpu()

def model_fn(model_dir):
    # The compiled model artifacts are saved with the prefix 'compiled'
    sym, arg_params, aux_params = mx.model.load_checkpoint('compiled', 0)
    mod = mx.mod.Module(symbol=sym, context=ctx, label_names=None)
    exe = mod.bind(for_training=False,
                   data_shapes=[('data', (1,3,224,224))],
                   label_shapes=mod._label_shapes)
    mod.set_params(arg_params, aux_params, allow_missing=True)
    
    # Run warm-up inference on empty data during model load (required for GPU)
    data = mx.nd.empty((1,3,224,224), ctx=ctx)
    mod.forward(Batch([data]))
    return mod


def transform_fn(mod, image, input_content_type, output_content_type):
    # pre-processing
    decoded = mx.image.imdecode(image)
    resized = mx.image.resize_short(decoded, 224)
    cropped, crop_info = mx.image.center_crop(resized, (224, 224))
    normalized = mx.image.color_normalize(cropped.astype(np.float32) / 255,
                                  mean=mx.nd.array([0.485, 0.456, 0.406]),
                                  std=mx.nd.array([0.229, 0.224, 0.225]))
    transposed = normalized.transpose((2, 0, 1))
    batchified = transposed.expand_dims(axis=0)
    casted = batchified.astype(dtype='float32')
    processed_input = casted.as_in_context(ctx)

    # prediction/inference
    mod.forward(Batch([processed_input]))

    # post-processing
    prob = mod.get_outputs()[0].asnumpy().tolist()
    prob_json = json.dumps(prob)
    return prob_json, output_content_type

MXNet Gluon


import numpy as np
import json
import mxnet as mx
import neomx  # noqa: F401

# Change the context to mx.cpu() if deploying to a CPU endpoint
ctx = mx.gpu()

def model_fn(model_dir):
    # The compiled model artifacts are saved with the prefix 'compiled'
    block = mx.gluon.nn.SymbolBlock.imports('compiled-symbol.json',['data'],'compiled-0000.params', ctx=ctx)
    
    # Hybridize the model & pass required options for Neo: static_alloc=True & static_shape=True
    block.hybridize(static_alloc=True, static_shape=True)
    
    # Run warm-up inference on empty data during model load (required for GPU)
    data = mx.nd.empty((1,3,224,224), ctx=ctx)
    warm_up = block(data)
    return block


def input_fn(image, input_content_type):
    # pre-processing
    decoded = mx.image.imdecode(image)
    resized = mx.image.resize_short(decoded, 224)
    cropped, crop_info = mx.image.center_crop(resized, (224, 224))
    normalized = mx.image.color_normalize(cropped.astype(np.float32) / 255,
                                  mean=mx.nd.array([0.485, 0.456, 0.406]),
                                  std=mx.nd.array([0.229, 0.224, 0.225]))
    transposed = normalized.transpose((2, 0, 1))
    batchified = transposed.expand_dims(axis=0)
    casted = batchified.astype(dtype='float32')
    processed_input = casted.as_in_context(ctx)
    return processed_input


def predict_fn(processed_input_data, block):
    # prediction/inference
    prediction = block(processed_input_data)
    return prediction

def output_fn(prediction, output_content_type):
    # post-processing
    prob = prediction.asnumpy().tolist()
    prob_json = json.dumps(prob)
    return prob_json, output_content_type

PyTorch 1.4 and Older


import os
import torch
import torch.nn.parallel
import torch.optim
import torch.utils.data
import torch.utils.data.distributed
import torchvision.transforms as transforms
from PIL import Image
import io
import json
import pickle


def model_fn(model_dir):
    """Load the model and return it.
    Providing this function is optional.
    There is a default model_fn available which will load the model
    compiled using SageMaker Neo. You can override it here.

    Keyword arguments:
    model_dir -- the directory path where the model artifacts are present
    """

    # The compiled model is saved as "compiled.pt"
    model_path = os.path.join(model_dir, 'compiled.pt')
    with torch.neo.config(model_dir=model_dir, neo_runtime=True):
        model = torch.jit.load(model_path)
        device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        model = model.to(device)

    # We recommend that you run warm-up inference during model load
    sample_input_path = os.path.join(model_dir, 'sample_input.pkl')
    with open(sample_input_path, 'rb') as input_file:
        model_input = pickle.load(input_file)
    if torch.is_tensor(model_input):
        model_input = model_input.to(device)
        model(model_input)
    elif isinstance(model_input, tuple):
        model_input = (inp.to(device) for inp in model_input if torch.is_tensor(inp))
        model(*model_input)
    else:
        print("Only supports a torch tensor or a tuple of torch tensors")
        return model


def transform_fn(model, request_body, request_content_type,
                 response_content_type):
    """Run prediction and return the output.
    The function
    1. Pre-processes the input request
    2. Runs prediction
    3. Post-processes the prediction output.
    """
    # preprocess
    decoded = Image.open(io.BytesIO(request_body))
    preprocess = transforms.Compose([
        transforms.Resize(256),
        transforms.CenterCrop(224),
        transforms.ToTensor(),
        transforms.Normalize(
            mean=[
                0.485, 0.456, 0.406], std=[
                0.229, 0.224, 0.225]),
    ])
    normalized = preprocess(decoded)
    batchified = normalized.unsqueeze(0)
    # predict
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    batchified = batchified.to(device)
    output = model.forward(batchified)

    return json.dumps(output.cpu().numpy().tolist()), response_content_type

PyTorch 1.5 and Newer


import os
import torch
import torch.nn.parallel
import torch.optim
import torch.utils.data
import torch.utils.data.distributed
import torchvision.transforms as transforms
from PIL import Image
import io
import json
import pickle


def model_fn(model_dir):
    """Load the model and return it.
    Providing this function is optional.
    There is a default_model_fn available, which will load the model
    compiled using SageMaker Neo. You can override the default here.
    The model_fn only needs to be defined if your model needs extra
    steps to load, and can otherwise be left undefined.

    Keyword arguments:
    model_dir -- the directory path where the model artifacts are present
    """

    # The compiled model is saved as "model.pt"
    model_path = os.path.join(model_dir, 'model.pt')
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model = torch.jit.load(model_path, map_location=device)
    model = model.to(device)

    return model


def transform_fn(model, request_body, request_content_type,
                    response_content_type):
    """Run prediction and return the output.
    The function
    1. Pre-processes the input request
    2. Runs prediction
    3. Post-processes the prediction output.
    """
    # preprocess
    decoded = Image.open(io.BytesIO(request_body))
    preprocess = transforms.Compose([
                                transforms.Resize(256),
                                transforms.CenterCrop(224),
                                transforms.ToTensor(),
                                transforms.Normalize(
                                    mean=[
                                        0.485, 0.456, 0.406], std=[
                                        0.229, 0.224, 0.225]),
                                    ])
    normalized = preprocess(decoded)
    batchified = normalized.unsqueeze(0)
    
    # predict
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    batchified = batchified.to(device)
    output = model.forward(batchified)
    return json.dumps(output.cpu().numpy().tolist()), response_content_type

inf1 인스턴스 또는 onnx, xgboost, keras 컨테이너 이미지의 경우

다른 모든 Neo 추론에 최적화된 컨테이너 이미지 또는 inferentia 인스턴스 유형의 경우, 진입점 스크립트는 Neo 딥 러닝 런타임을 위한 다음 함수를 구현해야 합니다.
- neo_preprocess: 수신 요청 페이로드를 numpy 배열로 변환합니다.
- neo_postprocess: Neo 딥 러닝 런타임의 예측 출력을 응답 본문으로 변환합니다.
  
  참고
  이러한 두 개의 함수는 MXNet, Pytorch 또는 Tensorflow의 기능을 사용하지 않습니다.
이러한 함수를 사용하는 방법에 대한 예는 Neo Model 컴파일 샘플 노트북을 참조하세요.

TensorFlow 모델의 경우

모델에 데이터가 전송되기 전에 사용자 지정 사전 및 사후 처리 로직이 필요한 경우, 추론 시 사용할 수 있는 진입점 스크립트 inference.py 파일을 지정해야 합니다. 스크립트는 input_handler 및 output_handler 함수 쌍이나 단일 핸들러 함수를 구현해야 합니다.

참고

참고로 핸들러 함수가 구현된 경우, input_handler 및 output_handler는 무시됩니다.

다음은 컴파일 모델과 함께 사용하여 이미지 분류 모델에서 사용자 지정 사전 및 사후 처리를 수행할 수 있는 inference.py 스크립트의 코드 예제입니다. SageMaker AI 클라이언트는 이미지 파일을 input_handler 함수에 application/x-image 콘텐츠 유형으로 전송하여 JSON으로 변환합니다. 그런 다음 변환된 이미지 파일은 REST API를 사용하여 Tensorflow 모델 서버(TFX)로 전송됩니다.


import json
import numpy as np
import json
import io
from PIL import Image

def input_handler(data, context):
    """ Pre-process request input before it is sent to TensorFlow Serving REST API
    
    Args:
    data (obj): the request data, in format of dict or string
    context (Context): an object containing request and configuration details
    
    Returns:
    (dict): a JSON-serializable dict that contains request body and headers
    """
    f = data.read()
    f = io.BytesIO(f)
    image = Image.open(f).convert('RGB')
    batch_size = 1
    image = np.asarray(image.resize((512, 512)))
    image = np.concatenate([image[np.newaxis, :, :]] * batch_size)
    body = json.dumps({"signature_name": "serving_default", "instances": image.tolist()})
    return body

def output_handler(data, context):
    """Post-process TensorFlow Serving output before it is returned to the client.
    
    Args:
    data (obj): the TensorFlow serving response
    context (Context): an object containing request and configuration details
    
    Returns:
    (bytes, string): data to return to client, response content type
    """
    if data.status_code != 200:
        raise ValueError(data.content.decode('utf-8'))

    response_content_type = context.accept_header
    prediction = data.content
    return prediction, response_content_type

사용자 지정 사전 또는 사후 처리가 없는 경우 SageMaker AI 클라이언트는 파일 이미지를 유사한 방식으로 JSON으로 변환한 후 SageMaker AI 엔드포인트로 전송합니다.

자세한 내용은 SageMaker Python SDK에서 TensorFlow 제공 엔드포인트에 배포를 참조하세요.

컴파일된 모델 아티팩트가 포함된 Amazon S3 버킷 URI입니다.

javascript가 브라우저에서 비활성화되거나 사용이 불가합니다.

AWS 설명서를 사용하려면 Javascript가 활성화되어야 합니다. 지침을 보려면 브라우저의 도움말 페이지를 참조하십시오.

문서 규칙

모델 배포

SageMaker AI SDK를 사용하여 컴파일된 모델 배포