MLSUS-13: Optimize models for inference
Improve efficiency of your models and thus use less resources for inference by compiling the models into optimized forms.
Implementation plan
-
Use open-source model compilers - Libraries such as Treelite
(for decision tree ensembles) improve the prediction throughput of models, due to more efficient use of compute resources. -
Use third-party tools - Solutions like Hugging Face Infinity
allow you to accelerate transformer models and run inference not only on GPUs but also on CPUs. -
Use Amazon SageMaker Neo - SageMaker Neo
enables developers to optimize ML models for inference on SageMaker in the cloud and supported devices at the edge. SageMaker Neo runtime consumes as little as one-tenth the footprint of a deep learning framework while optimizing models to perform up to 25 times faster with no loss in accuracy.