MLSUS-13: Optimize models for inference

Improve efficiency of your models and thus use less resources for inference by compiling the models into optimized forms.

Implementation plan

Use open-source model compilers - Libraries such as Treelite (for decision tree ensembles) improve the prediction throughput of models, due to more efficient use of compute resources.
Use third-party tools - Solutions like Hugging Face Infinity allow you to accelerate transformer models and run inference not only on GPUs but also on CPUs.
Use Amazon SageMaker Neo - SageMaker Neo enables developers to optimize ML models for inference on SageMaker in the cloud and supported devices at the edge. SageMaker Neo runtime consumes as little as one-tenth the footprint of a deep learning framework while optimizing models to perform up to 25 times faster with no loss in accuracy.

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

MLSUS-12: Use efficient silicon

MLSUS-14: Deploy multiple models behind a single endpoint