Troubleshooting
The following are common Amazon Elastic Inference errors and troubleshooting steps.
Topics
Issues Launching Accelerators
Ensure that you are launching in a Region where Elastic Inference accelerators are available. For
more information, see the Region Table
Resolving Configuration Issues
If you launched your instance with the Deep Learning AMI (DLAMI), run python
~/anaconda3/bin/EISetupValidator.py
to verify that the instance is correctly
configured. You can also download the EISetupValidator.py
script'python EISetupValidator.py
.
Issues Running AWS Batch
Running an AWS Batch job from an Amazon EC2 instance with Elastic Inference may throw the following error:
[Sat Nov 23 20:21:11 2019, 792775us] Error during accelerator discovery [Sat Nov 23 20:21:11 2019, 792895us] Failed to detect any accelerator [Sat Nov 23 20:21:11 2019, 792920us] Warning - Preconditions not met for reaching Accelerator
To fix this issue, unset the ECS_CONTAINER_METADATA_URI
environment
variable for the processes using Elastic Inference enabled frameworks. The
ECS_CONTAINER_METADATA_URI
environment variable is automatically set
for containers launched as Amazon Elastic Container Service tasks. AWS Batch uses Amazon ECS to run
containerized jobs. The following shows how to unset the ECS_CONTAINER_METADATA_URI
variable.
env -u ECS_CONTAINER_METADATA_URI python script_using_tf_predictor_api.py env -u ECS_CONTAINER_METADATA_URI amazonei_tensorflow_model_server env -u ECS_CONTAINER_METADATA_URI python script_using_ei_mxnet.py
This does not unset ECS_CONTAINER_METADATA_URI
globally. It only unsets
it for the relevant processes, so unsetting it will not have any undesirable
side-effects. Once ECS_CONTAINER_METADATA_URI
is no longer set, Elastic Inference
should work with AWS Batch.
Resolving Permission Issues
If you are unable to successfully connect to accelerators, verify that you have completed the following:
Set up a Virtual Private Cloud (VPC) endpoint for Elastic Inference for the subnet in which you have launched your instance.
Configure security groups for the instance and VPC endpoints with outbound rules that allow communications for HTTPS (Port 443). Configure the VPC endpoint security group with an inbound rule that allows HTTPS traffic.
Add an IAM instance role with the elastic-inference:Connect permission to the instance from which you are connecting to the accelerator.
Check CloudWatch Logs to verify that your accelerator is healthy. The Amazon EC2 instance details from the console contain a link to CloudWatch, which allows you to view the health of its associated accelerator.
Stop and Start the Instance
If your Elastic Inference accelerator is in an unhealthy state, stopping and starting it again is the simplest option. For more information, see Stopping and Starting Your Instances.
When you stop an instance, the data on any instance store volumes is erased. If you have any data to preserve on instance store volumes, make sure to back it up to persistent storage.
Troubleshooting Model Performance
Elastic Inference accelerates operations defined by frameworks like TensorFlow and MXNet. While Elastic Inference accelerates most:
-
neural network
-
math
-
array manipulation
-
control flow
operators, there are many operators that Elastic Inference does not accelerate. These include
-
training-related operators
-
input/output operators
-
operators in contrib
When a model contains operators that Elastic Inference does not accelerate, the framework runs them on the instance. The frequency and location of these operators within a model graph can have an impact on the model's inference performance with Elastic Inference accelerators. If your model is known to benefit from GPU acceleration and does not perform well on Elastic Inference, contact AWS Support or amazon-ei-feedback@amazon.com.
Submitting Feedback
Contact AWS Support or send feedback to: amazon-ei-feedback@amazon.com.