

# Train and deploy models with HyperPod CLI and SDK
<a name="getting-started-hyperpod-training-deploying-models"></a>

Amazon SageMaker HyperPod helps you train and deploy machine learning models at scale. The AWS HyperPod CLI is a unified command-line interface that simplifies machine learning (ML) workflows on AWS. It abstracts infrastructure complexities and provides a streamlined experience for submitting, monitoring, and managing ML training jobs. The CLI is designed specifically for data scientists and ML engineers who want to focus on model development rather than infrastructure management. This topic walks you through three key scenarios: training a PyTorch model, deploying a custom model using trained artifacts, and deploying a JumpStart model. Designed for first-time users, this concise tutorial ensures you can set up, train, and deploy models effortlessly using either the HyperPod CLI or the SDK. The handshake process between training and inference helps you manage model artifacts effectively. 

## Prerequisites
<a name="prerequisites"></a>

Before you begin using Amazon SageMaker HyperPod, make sure you have:
+ An AWS account with access to Amazon SageMaker HyperPod
+ Python 3.9, 3.10, or 3.11 installed
+ AWS CLI configured with appropriate credentials. 

## Install the HyperPod CLI and SDK
<a name="install-cli-sdk"></a>

Install the required package to access the CLI and SDK:

```
pip install sagemaker-hyperpod
```

This command sets up the tools needed to interact with HyperPod clusters.

## Configure your cluster context
<a name="configure-cluster"></a>

HyperPod operates on clusters optimized for machine learning. Start by listing available clusters to select one for your tasks.

1. List all available clusters:

   ```
   hyp list-cluster
   ```

1. Choose and set your active cluster:

   ```
   hyp set-cluster-context your-eks-cluster-name
   ```

1. Verify the configuration:

   ```
   hyp get-cluster-context
   ```

**Note**  
All subsequent commands target the cluster you've set as your context.

## Choose your scenario
<a name="choose-scenario"></a>

For detailed instructions on each scenario, click on the topics below:

**Topics**
+ [Prerequisites](#prerequisites)
+ [Install the HyperPod CLI and SDK](#install-cli-sdk)
+ [Configure your cluster context](#configure-cluster)
+ [Choose your scenario](#choose-scenario)
+ [Train a PyTorch model](train-models-with-hyperpod.md)
+ [Deploy a custom model](deploy-trained-model.md)
+ [Deploy a JumpStart model](deploy-jumpstart-model.md)

# Train a PyTorch model
<a name="train-models-with-hyperpod"></a>

This topic walks you through the process of training a PyTorch model using HyperPod.

In this scenario, let's train a PyTorch model using the `hyp-pytorch-job` template, which simplifies job creation by exposing commonly used parameters. The model artifacts will be stored in an S3 bucket for later use in inference. However, this is optional, and you can choose your preferred storage location.

## Create a training job
<a name="create-training-job"></a>

You can train the model using either the CLI or Python SDK.

### Using the CLI
<a name="using-cli"></a>

Create a training job with the following command:

```
hyp create hyp-pytorch-job \
    --version 1.0 \
    --job-name test-pytorch-job \
    --image pytorch/pytorch:latest \
    --command '["python", "train.py"]' \
    --args '["--epochs", "10", "--batch-size", "32"]' \
    --environment '{"PYTORCH_CUDA_ALLOC_CONF": "max_split_size_mb:32"}' \
    --pull-policy "IfNotPresent" \
    --instance-type ml.p4d.24xlarge \
    --tasks-per-node 8 \
    --label-selector '{"accelerator": "nvidia", "network": "efa"}' \
    --deep-health-check-passed-nodes-only true \
    --scheduler-type "kueue" \
    --queue-name "training-queue" \
    --priority "high" \
    --max-retry 3 \
    --volumes '["data-vol", "model-vol", "checkpoint-vol"]' \
    --persistent-volume-claims '["shared-data-pvc", "model-registry-pvc"]' \
    --output-s3-uri s3://my-bucket/model-artifacts
```

**Key required parameters explained**:
+ `--job-name`: Unique identifier for your training job
+ `--image`: Docker image containing your training environment

This command starts a training job named `test-pytorch-job`. The `--output-s3-uri` specifies where the trained model artifacts will be stored, for example, `s3://my-bucket/model-artifacts`. Note this location, as you’ll need it for deploying the custom model.

### Using the Python SDK
<a name="using-python-sdk"></a>

For programmatic control, use the SDK. Create a Python script to launch the same training job.

```
from sagemaker.hyperpod import HyperPodPytorchJob
from sagemaker.hyperpod.job 
import ReplicaSpec, Template, Spec, Container, Resources, RunPolicy, Metadata

# Define job specifications
nproc_per_node = "1"  # Number of processes per node
replica_specs = 
[
    ReplicaSpec
    (
        name = "pod",  # Replica name
        template = Template
        (
            spec = Spec
            (
                containers =
                [
                    Container
                    (
                        # Container name
                        name="container-name",  
                        
                        # Training image
                        image="448049793756.dkr.ecr.us-west-2.amazonaws.com/ptjob:mnist",  
                        
                        # Always pull image
                        image_pull_policy="Always",  
                        resources=Resources\
                        (
                            # No GPUs requested
                            requests={"nvidia.com/gpu": "0"},  
                            # No GPU limit
                            limits={"nvidia.com/gpu": "0"},   
                        ),
                        # Command to run
                        command=["python", "train.py"],  
                        # Script arguments
                        args=["--epochs", "10", "--batch-size", "32"],  
                    )
                ]
            )
        ),
    )
]
# Keep pods after completion
run_policy = RunPolicy(clean_pod_policy="None")  

# Create and start the PyTorch job
pytorch_job = HyperPodPytorchJob
(
    # Job name
    metadata = Metadata(name="demo"),  
    # Processes per node
    nproc_per_node = nproc_per_node,   
    # Replica specifications
    replica_specs = replica_specs,     
    # Run policy
    run_policy = run_policy,           
    # S3 location for artifacts
    output_s3_uri="s3://my-bucket/model-artifacts"  
)
# Launch the job
pytorch_job.create()
```

## Monitor your training job
<a name="monitor-training-job"></a>

Monitor your job's progress with these commands:

### Using the CLI
<a name="monitor-cli"></a>

```
# Check job status
hyp list hyp-pytorch-job

# Get detailed information
hyp describe hyp-pytorch-job --job-name test-pytorch-job

# View logs
hyp get-logs hyp-pytorch-job \
    --pod-name test-pytorch-job-pod-0 \
    --job-name test-pytorch-job
```

**Note**: Training time varies based on model complexity and instance type. Monitor the logs to track progress.

These commands help you verify the job’s status and troubleshoot issues. Once the job completes successfully, the model artifacts are saved to `s3://my-bucket/model-artifacts`.

### Using the Python SDK
<a name="monitor-python-sdk"></a>

Add the following code to your Python script:

```
print("List all pods created for this job:")
print(pytorch_job.list_pods())

print("Check the logs from pod0:")
print(pytorch_job.get_logs_from_pod(pod_name="demo-pod-0"))

print("List all HyperPodPytorchJobs:")
print(HyperPodPytorchJob.list())

print("Describe job:")
print(HyperPodPytorchJob.get(name="demo").model_dump())

pytorch_job.refresh()
print(pytorch_job.status.model_dump())
```

## Next steps
<a name="next-steps"></a>

After training, the model artifacts are stored in the S3 bucket you specified (`s3://my-bucket/model-artifacts`). You can use these artifacts to deploy a model. Currently, you must manually manage the transition from training to inference. This involves:
+ **Locating artifacts**: Check the S3 bucket (`s3://my-bucket/model-artifacts`) to confirm the trained model files are present.
+ **Recording the path**: Note the exact S3 path (e.g., `s3://my-bucket/model-artifacts/test-pytorch-job/model.tar.gz`) for use in the inference setup.
+ **Referencing in deployment**: Provide this S3 path when configuring the custom endpoint to ensure the correct model is loaded.

# Deploy a custom model
<a name="deploy-trained-model"></a>

After training completes, deploy your model for inference. You can deploy a custom model using either the CLI or the SDK.

## Locate your model artifacts
<a name="locate-model-artifacts"></a>
+ **Check your S3 bucket**: Verify that model artifacts are saved at `s3://my-bucket/model-artifacts/`
+ **Note the exact path**: You'll need the full path (for example, `s3://my-bucket/model-artifacts/test-pytorch-job/model.tar.gz`)

## Deploy using the CLI
<a name="deploy-using-cli"></a>

Run the following command to deploy your custom model:

```
hyp create hyp-custom-endpoint \
    --version 1.0 \
    --env '{"HF_MODEL_ID":"/opt/ml/model", "SAGEMAKER_PROGRAM":"inference.py", }' \
    --model-source-type s3 \
    --model-location test-pytorch-job \
    --s3-bucket-name my-bucket \
    --s3-region us-east-2 \
    --prefetch-enabled true \ 
    --image-uri 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference:latest \
    --model-volume-mount-name model-weights \
    --container-port 8080 \
    --resources-requests '{"cpu": "30000m", "nvidia.com/gpu": 1, "memory": "100Gi"}' \
    --resources-limits '{"nvidia.com/gpu": 1}' \
    --tls-output-s3-uri s3://<bucket_name> \
    --instance-type ml.g5.8xlarge \
    --endpoint-name endpoint-custom-pytorch \
    --model-name pytorch-custom-model
```

This command deploys the trained model as an endpoint named `endpoint-custom-pytorch`. The `--model-location` references the artifact path from the training job.

## Deploy using the Python SDK
<a name="deploy-using-sdk"></a>

Create a Python script with the following content:

```
from sagemaker.hyperpod.inference.config.hp_custom_endpoint_config import Model, Server, SageMakerEndpoint, TlsConfig, EnvironmentVariables
from sagemaker.hyperpod.inference.hp_custom_endpoint import HPCustomEndpoint

model = Model(
    model_source_type="s3",
    model_location="test-pytorch-job",
    s3_bucket_name="my-bucket",
    s3_region="us-east-2",
    prefetch_enabled=True
)

server = Server(
    instance_type="ml.g5.8xlarge",
    image_uri="763104351884.dkr.ecr.us-east-2.amazonaws.com/huggingface-pytorch-tgi-inference:2.4.0-tgi2.3.1-gpu-py311-cu124-ubuntu22.04-v2.0",
    container_port=8080,
    model_volume_mount_name="model-weights"
)

resources = {
    "requests": {"cpu": "30000m", "nvidia.com/gpu": 1, "memory": "100Gi"},
    "limits": {"nvidia.com/gpu": 1}
}

env = EnvironmentVariables(
    HF_MODEL_ID="/opt/ml/model",
    SAGEMAKER_PROGRAM="inference.py",
    SAGEMAKER_SUBMIT_DIRECTORY="/opt/ml/model/code",
    MODEL_CACHE_ROOT="/opt/ml/model",
    SAGEMAKER_ENV="1"
)

endpoint_name = SageMakerEndpoint(name="endpoint-custom-pytorch")

tls_config = TlsConfig(tls_certificate_output_s3_uri="s3://<bucket_name>")

custom_endpoint = HPCustomEndpoint(
    model=model,
    server=server,
    resources=resources,
    environment=env,
    sage_maker_endpoint=endpoint_name,
    tls_config=tls_config
)

custom_endpoint.create()
```

## Invoke the endpoint
<a name="invoke-endpoint"></a>

### Using the CLI
<a name="invoke-using-cli"></a>

Test the endpoint with a sample input:

```
hyp invoke hyp-custom-endpoint \
    --endpoint-name endpoint-custom-pytorch \
    --body '{"inputs":"What is the capital of USA?"}'
```

This returns the model’s response, such as “The capital of the USA is Washington, D.C.”

### Using the SDK
<a name="invoke-using-sdk"></a>

Add the following code to your Python script:

```
data = '{"inputs":"What is the capital of USA?"}'
response = custom_endpoint.invoke(body=data).body.read()
print(response)
```

## Manage the endpoint
<a name="manage-endpoint"></a>

### Using the CLI
<a name="manage-using-cli"></a>

List and inspect the endpoint:

```
hyp list hyp-custom-endpoint
hyp get hyp-custom-endpoint --name endpoint-custom-pytorch
```

### Using the SDK
<a name="manage-using-sdk"></a>

Add the following code to your Python script:

```
logs = custom_endpoint.get_logs()
print(logs)
```

## Clean up resources
<a name="cleanup-resources"></a>

When you're done, delete the endpoint to avoid unnecessary costs.

### Using the CLI
<a name="cleanup-using-cli"></a>

```
hyp delete hyp-custom-endpoint --name endpoint-custom-pytorch
```

### Using the SDK
<a name="cleanup-using-sdk"></a>

```
custom_endpoint.delete()
```

## Next steps
<a name="next-steps"></a>

You've successfully deployed and tested a custom model using SageMaker HyperPod. You can now use this endpoint for inference in your applications.

# Deploy a JumpStart model
<a name="deploy-jumpstart-model"></a>

You can deploy a pre-trained JumpStart model for inference using either the CLI or the SDK.

## Using the CLI
<a name="deploy-jumpstart-cli"></a>

Run the following command to deploy a JumpStart model:

```
hyp create hyp-jumpstart-endpoint \
  --version 1.0 \
  --model-id deepseek-llm-r1-distill-qwen-1-5b \
  --instance-type ml.g5.8xlarge \
  --endpoint-name endpoint-test-jscli
```

## Using the SDK
<a name="deploy-jumpstart-sdk"></a>

Create a Python script with the following content:

```
from sagemaker.hyperpod.inference.config.hp_jumpstart_endpoint_config import Model, Server, SageMakerEndpoint, TlsConfig
from sagemaker.hyperpod.inference.hp_jumpstart_endpoint import HPJumpStartEndpoint

model=Model(
    model_id='deepseek-llm-r1-distill-qwen-1-5b'
)

server=Server(
    instance_type='ml.g5.8xlarge',
)

endpoint_name=SageMakerEndpoint(name='<endpoint-name>')

# create spec
js_endpoint=HPJumpStartEndpoint(
    model=model,
    server=server,
    sage_maker_endpoint=endpoint_name
)
```

## Invoke the endpoint
<a name="invoke-jumpstart-endpoint"></a>

### Using the CLI
<a name="invoke-jumpstart-cli"></a>

Test the endpoint with a sample input:

```
hyp invoke hyp-jumpstart-endpoint \
    --endpoint-name endpoint-jumpstart \
    --body '{"inputs":"What is the capital of USA?"}'
```

### Using the SDK
<a name="invoke-jumpstart-sdk"></a>

Add the following code to your Python script:

```
data = '{"inputs":"What is the capital of USA?"}'
response = js_endpoint.invoke(body=data).body.read()
print(response)
```

## Manage the endpoint
<a name="manage-jumpstart-endpoint"></a>

### Using the CLI
<a name="manage-jumpstart-cli"></a>

List and inspect the endpoint:

```
hyp list hyp-jumpstart-endpoint
hyp get hyp-jumpstart-endpoint --name endpoint-jumpstart
```

### Using the SDK
<a name="manage-jumpstart-sdk"></a>

Add the following code to your Python script:

```
endpoint_iterator = HPJumpStartEndpoint.list()
for endpoint in endpoint_iterator:
    print(endpoint.name, endpoint.status)

logs = js_endpoint.get_logs()
print(logs)
```

## Clean up resources
<a name="cleanup-jumpstart-resources"></a>

When you're done, delete the endpoint to avoid unnecessary costs.

### Using the CLI
<a name="cleanup-jumpstart-cli"></a>

```
hyp delete hyp-jumpstart-endpoint --name endpoint-jumpstart
```

### Using the SDK
<a name="cleanup-jumpstart-sdk"></a>

```
js_endpoint.delete()
```

## Next steps
<a name="jumpstart-next-steps"></a>

Now that you've trained a PyTorch model, deployed it as a custom endpoint, and deployed a JumpStart model using HyperPod's CLI and SDK, explore advanced features:
+ **Multi-node training**: Scale training across multiple instances
+ **Custom containers**: Build specialized training environments
+ **Integration with SageMaker Pipelines**: Automate your ML workflows
+ **Advanced monitoring**: Set up custom metrics and alerts

For more examples and advanced configurations, visit the [SageMaker HyperPod GitHub repository](https://github.com/aws/amazon-sagemaker-examples).