Troubleshooting - Amazon Elastic Inference

Troubleshooting

The following are common Amazon Elastic Inference errors and troubleshooting steps.

Issues Launching Accelerators

Ensure that you are launching in a Region where Elastic Inference accelerators are available. For more information, see the Region Table.

Resolving Configuration Issues

If you launched your instance with the Deep Learning AMI (DLAMI), run python ~/anaconda3/bin/EISetupValidator.py to verify that the instance is correctly configured. You can also download the EISetupValidator.py script and run 'python EISetupValidator.py.

Issues Running AWS Batch

Running an AWS Batch job from an Amazon EC2 instance with Elastic Inference may throw the following error:

[Sat Nov 23 20:21:11 2019, 792775us] Error during accelerator discovery [Sat Nov 23 20:21:11 2019, 792895us] Failed to detect any accelerator [Sat Nov 23 20:21:11 2019, 792920us] Warning - Preconditions not met for reaching Accelerator

To fix this issue, unset the ECS_CONTAINER_METADATA_URI environment variable for the processes using Elastic Inference enabled frameworks. The ECS_CONTAINER_METADATA_URI environment variable is automatically set for containers launched as Amazon Elastic Container Service tasks. AWS Batch uses Amazon ECS to run containerized jobs. The following shows how to unset the ECS_CONTAINER_METADATA_URI variable.

env -u ECS_CONTAINER_METADATA_URI python script_using_tf_predictor_api.py env -u ECS_CONTAINER_METADATA_URI amazonei_tensorflow_model_server env -u ECS_CONTAINER_METADATA_URI python script_using_ei_mxnet.py

This does not unset ECS_CONTAINER_METADATA_URI globally. It only unsets it for the relevant processes, so unsetting it will not have any undesirable side-effects. Once ECS_CONTAINER_METADATA_URI is no longer set, Elastic Inference should work with AWS Batch.

Resolving Permission Issues

If you are unable to successfully connect to accelerators, verify that you have completed the following:

  • Set up a Virtual Private Cloud (VPC) endpoint for Elastic Inference for the subnet in which you have launched your instance.

  • Configure security groups for the instance and VPC endpoints with outbound rules that allow communications for HTTPS (Port 443). Configure the VPC endpoint security group with an inbound rule that allows HTTPS traffic.

  • Add an IAM instance role with the elastic-inference:Connect permission to the instance from which you are connecting to the accelerator.

  • Check CloudWatch Logs to verify that your accelerator is healthy. The Amazon EC2 instance details from the console contain a link to CloudWatch, which allows you to view the health of its associated accelerator.

Stop and Start the Instance

If your Elastic Inference accelerator is in an unhealthy state, stopping and starting it again is the simplest option. For more information, see Stopping and Starting Your Instances.

Warning

When you stop an instance, the data on any instance store volumes is erased. If you have any data to preserve on instance store volumes, make sure to back it up to persistent storage.

Troubleshooting Model Performance

Elastic Inference accelerates operations defined by frameworks like TensorFlow and MXNet. While Elastic Inference accelerates most:

  • neural network

  • math

  • array manipulation

  • control flow

operators, there are many operators that Elastic Inference does not accelerate. These include

  • training-related operators

  • input/output operators

  • operators in contrib

When a model contains operators that Elastic Inference does not accelerate, the framework runs them on the instance. The frequency and location of these operators within a model graph can have an impact on the model's inference performance with Elastic Inference accelerators. If your model is known to benefit from GPU acceleration and does not perform well on Elastic Inference, contact AWS Support or amazon-ei-feedback@amazon.com.

Submitting Feedback

Contact AWS Support or send feedback to: amazon-ei-feedback@amazon.com.