Benchmark LLMs with vLLM and lm-evaluation-harness
This tutorial walks you through evaluating multiple large language models (LLMs) against multiple benchmarks in a single Deadline Cloud job. Each model becomes one task in a parameter sweep, and tasks run in parallel across workers. A final step aggregates per-model results into a ranked leaderboard in CSV and Markdown format.
The source code for this tutorial is available in the deadline-cloud-samples
The following video demonstrates the vLLM LLM leaderboard workflow on Deadline Cloud.
Estimated time: 20–40 minutes (depending on the number of models and benchmarks).
Overview
Each task in the EvalModels step starts a local vLLM
To complete this tutorial, follow these steps:
-
Complete the prerequisites.
-
Set up your farm.
-
Submit the evaluation job.
-
Download and review results.
-
Clean up resources.
Prerequisites
Before you begin, the following setup is recommended:
-
A Deadline Cloud farm with an NVIDIA GPU service-managed fleet (A10G or L4, at least 32 GB RAM, at least 4 vCPUs).
-
A queue with a conda queue environment attached that reads
CondaPackagesandCondaChannelsjob parameters. -
The Deadline Cloud CLI
installed on your workstation. -
Sufficient Amazon Elastic Compute Cloud (Amazon EC2) vCPU service quota for GPU instances. The default 3-model run on
g5.xlarge(4 vCPUs each) requires at least 12 vCPUs under Running On-Demand G and VT instances.
Note
A Hugging Face token is only required for gated models (such as Llama). The default model list uses ungated models.
Set up your farm
The fastest way to get a compatible farm is to deploy the CUDA
farm CloudFormation template
To configure the CLI for your farm
-
After the CloudFormation stack reaches
CREATE_COMPLETE, configure the Deadline Cloud CLI to use the new farm:deadline config set defaults.farm_idFarmId-from-stack-outputsdeadline config set defaults.queue_idCUDAQueueId-from-stack-outputs
If you already have a farm, the following configuration is recommended:
-
An SMF fleet with NVIDIA GPUs, at least 32 GB RAM, and at least 4 vCPUs.
-
A queue with a conda queue environment that reads
CondaPackagesandCondaChannelsjob parameters.
Submit the evaluation job
To submit the evaluation job
-
Clone the samples repository and navigate to the job bundle directory:
git clone https://github.com/aws-deadline/deadline-cloud-samples.git cd deadline-cloud-samples/job_bundles/vllm_lm_eval_leaderboard -
Submit the job with the default models and benchmarks:
deadline bundle submit . \ --parameter MaxModelLen=2048The default model list evaluates three small, ungated models:
Qwen/Qwen2.5-0.5B,Qwen/Qwen2.5-1.5B, andEleutherAI/pythia-1.4b. The default benchmarks are a commonsense reasoning suite:hellaswag,arc_easy,arc_challenge,winogrande. -
Monitor the job status in the Deadline Cloud console or by using the
deadline job getcommand.
Changing the model list
Models are defined as a STRING parameter space on the EvalModels step
in template.yaml:
parameterSpace: taskParameterDefinitions: - name: ModelName type: STRING range: - "Qwen/Qwen2.5-0.5B" - "Qwen/Qwen2.5-1.5B" - "EleutherAI/pythia-1.4b"
To add or remove models, edit the range list. Each entry becomes a task
visible in the Deadline Cloud monitor. Model IDs must be supported by vLLM (see the vLLM supported
models list
Choosing benchmarks
The Benchmarks job parameter is a comma-separated list of
lm-evaluation-harness task names. Override the default benchmarks at submit time:
deadline bundle submit . \ --parameter Benchmarks="hellaswag,mmlu,gsm8k"
All benchmarks in the list run sequentially against each model's vLLM server. Keep
MaxModelLen less than or equal to the smallest model's context
window. For a full list of available benchmarks, see the lm-evaluation-harness
tasks
Download and review results
To download the leaderboard results
-
After the job completes, download the output:
deadline job download-output --job-idjob-id -
View the leaderboard:
cat leaderboard_results/leaderboard.md
The following example shows typical leaderboard output:
# LLM Leaderboard Models: 3 | Benchmarks: arc_challenge, arc_easy, hellaswag, winogrande | Rank | Model | arc_challenge | arc_easy | hellaswag | winogrande | Mean | |------|------------------------|---------------|----------|-----------|------------|--------| | 1 | Qwen/Qwen2.5-1.5B | 0.4497 | 0.7176 | 0.6775 | 0.6322 | 0.6192 | | 2 | Qwen/Qwen2.5-0.5B | 0.3200 | 0.5816 | 0.5223 | 0.5691 | 0.4982 | | 3 | EleutherAI/pythia-1.4b | 0.2833 | 0.5387 | 0.5201 | 0.5730 | 0.4788 |
Clean up
To avoid ongoing charges, clean up the resources that you created for this tutorial:
To clean up tutorial resources
-
If you deployed the CUDA farm CloudFormation template, delete the CloudFormation stack from the CloudFormation console.
-
If you used an existing farm, stop or delete the GPU fleet that you used for this tutorial.
-
Remove local output files if they are no longer needed:
rm -rf leaderboard_results/
Troubleshooting
Fleet does not scale up workers
The most common cause is an Amazon EC2 vCPU service quota. Open the Service Quotas console
Related resources
The following resources provide additional information: