Wadah pembelajaran mendalam dengan TorchServe Memulai Siapkan artefak model Anda Terapkan model menggunakan SageMaker Python SDK

Menyebarkan model besar untuk inferensi dengan TorchServe

Tutorial ini menunjukkan cara menerapkan model besar dan menyajikan inferensi di Amazon SageMaker AI dengan on. TorchServe GPUs Contoh ini menerapkan model OPT-30b ke sebuah instance. ml.g5 Anda dapat memodifikasi ini untuk bekerja dengan model dan jenis instance lainnya. Ganti contoh italicized placeholder text dalam dengan informasi Anda sendiri.

TorchServe adalah platform terbuka yang kuat untuk inferensi model terdistribusi besar. Dengan mendukung perpustakaan populer seperti PyTorch, Pi asli, dan HuggingFace Accelerate PPy DeepSpeed, ia menawarkan penangan seragam APIs yang tetap konsisten di seluruh model besar terdistribusi dan skenario inferensi model non-terdistribusi. Untuk informasi lebih lanjut, TorchServelihat dokumentasi inferensi model besar.

Wadah pembelajaran mendalam dengan TorchServe

Untuk menerapkan model besar dengan TorchServe SageMaker AI, Anda dapat menggunakan salah satu wadah pembelajaran mendalam SageMaker AI (DLCs). Secara default, TorchServe diinstal di semua AWS PyTorch DLCs. Selama pemuatan model, TorchServe dapat menginstal perpustakaan khusus yang disesuaikan untuk model besar seperti PiPPy, Deepspeed, dan Accelerate.

Tabel berikut mencantumkan semua SageMaker AI DLCs dengan TorchServe.

Kateogri DLC	Kerangka Kerja	Perangkat keras	Contoh URL
SageMaker Wadah Kerangka AI	PyTorch 2.0.0+	CPU, GPU	763104351884.dkr. ecr.us-east-1.amazonaws.com /pytorch-inferensi:2.0.1-gpu-py310-cu118-ubuntu20.04-sagemaker
SageMaker Wadah Graviton Kerangka AI	PyTorch 2.0.0+	CPU	763104351884.dkr. ecr.us-east-1.amazonaws.com /:2.0.1-cpu-py310-ubuntu20.04-sagemaker pytorch-inference-graviton
Wadah Inferensi StabilityAI	PyTorch 2.0.0+	GPU	763104351884.dkr. ecr.us-east-1.amazonaws.com /:2.0.1-sgm0.1.0-gpu-py310-cu118-ubuntu20.04-sagemaker stabilityai-pytorch-inference
Wadah Neuron	PyTorch 1.13.1	Neuronx	763104351884.dkr. ecr.us-west-2.amazonaws.com /:1.13.1-neuron-py310-sdk2.12.0-ubuntu20.04 pytorch-inference-neuron

Memulai

Sebelum menerapkan model Anda, selesaikan prasyarat. Anda juga dapat mengonfigurasi parameter model dan menyesuaikan kode handler.

Prasyarat

Untuk memulai, pastikan Anda memiliki prasyarat berikut:

Pastikan Anda memiliki akses ke AWS akun. Siapkan lingkungan Anda sehingga AWS CLI dapat mengakses akun Anda melalui pengguna AWS IAM atau peran IAM. Kami merekomendasikan menggunakan peran IAM. Untuk tujuan pengujian di akun pribadi Anda, Anda dapat melampirkan kebijakan izin terkelola berikut ke peran IAM:
Untuk informasi selengkapnya tentang melampirkan kebijakan IAM ke peran, lihat Menambahkan dan menghapus izin identitas IAM di Panduan Pengguna IAM.AWS

Konfigurasikan dependensi Anda secara lokal, seperti yang ditunjukkan pada contoh berikut.

Instal versi 2 dari AWS CLI:


# Install the latest AWS CLI v2 if it is not installed
!curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" !unzip awscliv2.zip
#Follow the instructions to install v2 on the terminal
!cat aws/README.md

Instal SageMaker AI dan klien Boto3:


# If already installed, update your client
#%pip install sagemaker pip --upgrade --quiet
!pip install -U sagemaker
!pip install -U boto
!pip install -U botocore
!pip install -U boto3

Konfigurasikan pengaturan dan parameter model

TorchServe digunakan torchrununtuk mengatur lingkungan terdistribusi untuk pemrosesan paralel model. TorchServe memiliki kemampuan untuk mendukung banyak pekerja untuk model besar. Secara default, TorchServe menggunakan algoritma round-robin untuk menetapkan GPUs ke pekerja pada host. Dalam kasus inferensi model besar, jumlah yang GPUs ditugaskan untuk setiap pekerja dihitung secara otomatis berdasarkan jumlah yang GPUs ditentukan dalam model_config.yaml file. Variabel lingkunganCUDA_VISIBLE_DEVICES, yang menentukan perangkat GPU IDs yang terlihat pada waktu tertentu, diatur berdasarkan nomor ini.

Misalnya, misalkan ada 8 GPUs pada node dan satu pekerja membutuhkan 4 GPUs pada node (nproc_per_node=4). Dalam hal ini, TorchServe berikan empat GPUs untuk pekerja pertama (CUDA_VISIBLE_DEVICES="0,1,2,3") dan empat GPUs untuk pekerja kedua (CUDA_VISIBLE_DEVICES="4,5,6,7”).

Selain perilaku default ini, TorchServe memberikan fleksibilitas bagi pengguna GPUs untuk menentukan pekerja. Misalnya, jika Anda menyetel variabel deviceIds: [2,3,4,5] dalam file YAMAL konfigurasi model, dan mengaturnproc_per_node=2, kemudian TorchServe menetapkan CUDA_VISIBLE_DEVICES=”2,3” ke pekerja pertama dan pekerja CUDA_VISIBLE_DEVICES="4,5” kedua.

Dalam model_config.yaml contoh berikut, kami mengonfigurasi parameter front-end dan back-end untuk model OPT-30b. Parameter front-end yang dikonfigurasi adalahparallelType,deviceType, deviceIds dan. torchrun Untuk informasi lebih rinci tentang parameter front-end yang dapat Anda konfigurasi, lihat dokumentasi. PyTorch GitHub Konfigurasi back-end didasarkan pada peta YAMM yang memungkinkan kustomisasi gaya bebas. Untuk parameter back-end, kita mendefinisikan DeepSpeed konfigurasi dan parameter tambahan yang digunakan oleh kode handler kustom.


# TorchServe front-end parameters
minWorkers: 1
maxWorkers: 1
maxBatchDelay: 100
responseTimeout: 1200
parallelType: "tp"
deviceType: "gpu"
# example of user specified GPU deviceIds
deviceIds: [0,1,2,3] # sets CUDA_VISIBLE_DEVICES

torchrun:
    nproc-per-node: 4

# TorchServe back-end parameters
deepspeed:
    config: ds-config.json
    checkpoint: checkpoints.json

handler: # parameters for custom handler code
    model_name: "facebook/opt-30b"
    model_path: "model/models--facebook--opt-30b/snapshots/ceea0a90ac0f6fae7c2c34bcb40477438c152546"
    max_length: 50
    max_new_tokens: 10
    manual_seed: 40

Sesuaikan penangan

TorchServe menawarkan penangan dasar dan utilitas penangan untuk inferensi model besar yang dibangun dengan perpustakaan populer. Contoh berikut menunjukkan bagaimana kelas handler kustom TransformersSeqClassifierHandlermeluas BaseDeepSpeedHandlerdan menggunakan utilitas handler. Untuk contoh kode lengkap, lihat custom_handler.pykode pada PyTorch GitHub dokumentasi.


class TransformersSeqClassifierHandler(BaseDeepSpeedHandler, ABC):
    """
    Transformers handler class for sequence, token classification and question answering.
    """

    def __init__(self):
        super(TransformersSeqClassifierHandler, self).__init__()
        self.max_length = None
        self.max_new_tokens = None
        self.tokenizer = None
        self.initialized = False

    def initialize(self, ctx: Context):
        """In this initialize function, the HF large model is loaded and
        partitioned using DeepSpeed.
        Args:
            ctx (context): It is a JSON Object containing information
            pertaining to the model artifacts parameters.
        """
        super().initialize(ctx)
        model_dir = ctx.system_properties.get("model_dir")
        self.max_length = int(ctx.model_yaml_config["handler"]["max_length"])
        self.max_new_tokens = int(ctx.model_yaml_config["handler"]["max_new_tokens"])
        model_name = ctx.model_yaml_config["handler"]["model_name"]
        model_path = ctx.model_yaml_config["handler"]["model_path"]
        seed = int(ctx.model_yaml_config["handler"]["manual_seed"])
        torch.manual_seed(seed)

        logger.info("Model %s loading tokenizer", ctx.model_name)

        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.tokenizer.pad_token = self.tokenizer.eos_token
        config = AutoConfig.from_pretrained(model_name)
        with torch.device("meta"):
            self.model = AutoModelForCausalLM.from_config(
                config, torch_dtype=torch.float16
            )
        self.model = self.model.eval()

        ds_engine = get_ds_engine(self.model, ctx)
        self.model = ds_engine.module
        logger.info("Model %s loaded successfully", ctx.model_name)
        self.initialized = True

    def preprocess(self, requests):
        """
        Basic text preprocessing, based on the user's choice of application mode.
        Args:
            requests (list): A list of dictionaries with a "data" or "body" field, each
                            containing the input text to be processed.
        Returns:
            tuple: A tuple with two tensors: the batch of input ids and the batch of
                attention masks.
        """

    def inference(self, input_batch):
        """
        Predicts the class (or classes) of the received text using the serialized transformers
        checkpoint.
        Args:
            input_batch (tuple): A tuple with two tensors: the batch of input ids and the batch
                                of attention masks, as returned by the preprocess function.
        Returns:
            list: A list of strings with the predicted values for each input text in the batch.
        """
        
    def postprocess(self, inference_output):
        """Post Process Function converts the predicted response into Torchserve readable format.
        Args:
            inference_output (list): It contains the predicted response of the input text.
        Returns:
            (list): Returns a list of the Predictions and Explanations.
        """

Siapkan artefak model Anda

Sebelum menerapkan model Anda pada SageMaker AI, Anda harus mengemas artefak model Anda. Untuk model besar, kami menyarankan Anda menggunakan PyTorch torch-model-archiveralat dengan argumen--archive-format no-archive, yang melewatkan artefak model kompresi. Contoh berikut menyimpan semua artefak model ke folder baru bernamaopt/.


torch-model-archiver --model-name opt --version 1.0 --handler custom_handler.py --extra-files ds-config.json -r requirements.txt --config-file opt/model-config.yaml --archive-format no-archive

Setelah opt/ folder dibuat, unduh model Opt-30b ke folder menggunakan alat Download_model. PyTorch


cd opt
python path_to/Download_model.py --model_path model --model_name facebook/opt-30b --revision main

Terakhir, unggah artefak model ke ember Amazon S3.


aws s3 cp opt {your_s3_bucket}/opt --recursive

Anda sekarang harus memiliki artefak model yang disimpan di Amazon S3 yang siap digunakan ke titik akhir AI. SageMaker

Terapkan model menggunakan SageMaker Python SDK

Setelah menyiapkan artefak model Anda, Anda dapat menerapkan model Anda ke titik akhir SageMaker AI Hosting. Bagian ini menjelaskan cara menerapkan satu model besar ke titik akhir dan membuat prediksi respons streaming. Untuk informasi selengkapnya tentang respons streaming dari titik akhir, lihat Memanggil titik akhir waktu nyata.

Untuk menerapkan model Anda, selesaikan langkah-langkah berikut:

Buat sesi SageMaker AI, seperti yang ditunjukkan pada contoh berikut.


import boto3
import sagemaker
from sagemaker import Model, image_uris, serializers, deserializers

boto3_session=boto3.session.Session(region_name="us-west-2")
smr = boto3.client('sagemaker-runtime-demo')
sm = boto3.client('sagemaker')
role = sagemaker.get_execution_role()  # execution role for the endpoint
sess= sagemaker.session.Session(boto3_session, sagemaker_client=sm, sagemaker_runtime_client=smr)  # SageMaker AI session for interacting with different AWS APIs
region = sess._region_name  # region name of the current SageMaker Studio Classic environment
account = sess.account_id()  # account_id of the current SageMaker Studio Classic environment

# Configuration:
bucket_name = sess.default_bucket()
prefix = "torchserve"
output_path = f"s3://{bucket_name}/{prefix}"
print(f'account={account}, region={region}, role={role}, output_path={output_path}')

Buat model yang tidak terkompresi di SageMaker AI, seperti yang ditunjukkan pada contoh berikut.


from datetime import datetime

instance_type = "ml.g5.24xlarge"
endpoint_name = sagemaker.utils.name_from_base("ts-opt-30b")
s3_uri = {your_s3_bucket}/opt

model = Model(
    name="torchserve-opt-30b" + datetime.now().strftime("%Y-%m-%d-%H-%M-%S"),
    # Enable SageMaker uncompressed model artifacts
    model_data={
        "S3DataSource": {
                "S3Uri": s3_uri,
                "S3DataType": "S3Prefix",
                "CompressionType": "None",
        }
    },
    image_uri=container,
    role=role,
    sagemaker_session=sess,
    env={"TS_INSTALL_PY_DEP_PER_MODEL": "true"},
)
print(model)

Menerapkan model ke EC2 instance Amazon, seperti yang ditunjukkan pada contoh berikut.


model.deploy(
    initial_instance_count=1,
    instance_type=instance_type,
    endpoint_name=endpoint_name,
    volume_size=512, # increase the size to store large model
    model_data_download_timeout=3600, # increase the timeout to download large model
    container_startup_health_check_timeout=600, # increase the timeout to load large model
)

Inisialisasi kelas untuk memproses respons streaming, seperti yang ditunjukkan pada contoh berikut.


import io

class Parser:
    """
    A helper class for parsing the byte stream input. 
    
    The output of the model will be in the following format:
    ```
    b'{"outputs": [" a"]}\n'
    b'{"outputs": [" challenging"]}\n'
    b'{"outputs": [" problem"]}\n'
    ...
    ```
    
    While usually each PayloadPart event from the event stream will contain a byte array 
    with a full json, this is not guaranteed and some of the json objects may be split across
    PayloadPart events. For example:
    ```
    {'PayloadPart': {'Bytes': b'{"outputs": '}}
    {'PayloadPart': {'Bytes': b'[" problem"]}\n'}}
    ```
    
    This class accounts for this by concatenating bytes written via the 'write' function
    and then exposing a method which will return lines (ending with a '\n' character) within
    the buffer via the 'scan_lines' function. It maintains the position of the last read 
    position to ensure that previous bytes are not exposed again. 
    """
    
    def __init__(self):
        self.buff = io.BytesIO()
        self.read_pos = 0
        
    def write(self, content):
        self.buff.seek(0, io.SEEK_END)
        self.buff.write(content)
        data = self.buff.getvalue()
        
    def scan_lines(self):
        self.buff.seek(self.read_pos)
        for line in self.buff.readlines():
            if line[-1] != b'\n':
                self.read_pos += len(line)
                yield line[:-1]
                
    def reset(self):
        self.read_pos = 0

Uji prediksi respons streaming, seperti yang ditunjukkan pada contoh berikut.


import json

body = "Today the weather is really nice and I am planning on".encode('utf-8')
resp = smr.invoke_endpoint_with_response_stream(EndpointName=endpoint_name, Body=body, ContentType="application/json")
event_stream = resp['Body']
parser = Parser()
for event in event_stream:
    parser.write(event['PayloadPart']['Bytes'])
    for line in parser.scan_lines():
        print(line.decode("utf-8"), end=' ')

Anda sekarang telah menerapkan model Anda ke titik akhir SageMaker AI dan harus dapat memanggilnya untuk tanggapan. Untuk informasi selengkapnya tentang titik akhir real-time SageMaker AI, lihatTitik akhir model tunggal.

Awas Javascript dinonaktifkan atau tidak tersedia di browser Anda.

Untuk menggunakan Dokumentasi AWS, Javascript harus diaktifkan. Lihat halaman Bantuan browser Anda untuk petunjuk.

Konvensi Dokumen

Menyebarkan model yang tidak terkompresi

Pagar pembatas penyebaran