Monitoring Elastic Inference Accelerators - Amazon Elastic Inference

Monitoring Elastic Inference Accelerators

The following tools are provided to monitor and check the status of your Elastic Inference accelerators.

EI_VISIBLE_DEVICES

EI_VISIBLE_DEVICES is an environment variable that you use to control which Elastic Inference accelerator devices are visible to the deep learning frameworks. EI_VISIBLE_DEVICES can also be used with EI Tool. The variable is a comma-separated list of device ordinal numbers or device IDs. Use EI Tool to see all attached Elastic Inference accelerator devices.

EI_VISIBLE_DEVICES is used as follows. In this example, only the device with the ordinal number value 3 will be used when starting the server.

EI_VISIBLE_DEVICES=3 amazonei_tensorflow_model_server --port=8502 --rest_api_port=8503 --model_name=ssdresnet --model_base_path=/home/ec2-user/models/ssdresnet

If EI_VISIBLE_DEVICES is not set, then all attached devices are visible. If EI_VISIBLE_DEVICES is set to an empty string, then none of the devices are visible.

Using EI_VISIBLE_DEVICES with Multiple Devices

To pass multiple devices with EI_VISIBLE_DEVICES, use a comma-separated list. This list can contain device ordinal numbers or device IDs. The following command shows the use of multiple devices with EI Tool:

EI_VISIBLE_DEVICES=1,3 /opt/amazon/ei/ei_tools/bin/ei describe-accelerators -j

When using multiple Elastic Inference accelerators with EI_VISIBLE_DEVICES, the devices visible to the framework take on new ordinal numbers within the process. They will be labeled within the process starting from zero. This change only happens within the process. It does not have any impact on the ordinal numbers of the devices outside of the process. It also does not impact devices that are not included in EI_VISIBLE_DEVICES.

Exporting EI_VISIBLE_DEVICES

To set the EI_VISIBLE_DEVICES variable for use with all child processes of the current shell process, use the following command:

export EI_VISIBLE_DEVICES=1,3

All subsequently launched processes use this value. You must override or update the EI_VISIBLE_DEVICES value to change this behavior.

EI Tool

The EI Tool is a binary that comes with the latest version, v26.0, of the Conda DLAMI. You can also download it from the Amazon S3 Bucket. It can be used to monitor the status of multiple Elastic Inference accelerators. 

By default, running EI Tool as follows prints basic information about the Elastic Inference accelerators attached to the Amazon Elastic Compute Cloud instance.

ubuntu@ip-10-0-0-98:/opt/amazon/ei/ei_tools/bin$ ./ei describe-accelerators EI Client Version: 1.5.0Time: Fri Nov  1 03:09:15 2019 Attached accelerators: 2 Device 0:     Type: eia1.xlarge     Id: eia-679e4c622d584803aed5b42ab6a97706     Status: healthy Device 1:     Type: eia1.xlarge     Id: eia-6c414c6ee37a4d93874afc00825c2f28     Status: healthy

The following topic describes options for using EI Tool from the command line.

Getting Help

There are two ways to get help when using EI Tool. The following are the two methods for accessing help.

  • The EI Tool will output usage information if a command is not provided. 

    ubuntu@ip-10-0-0-98:/opt/amazon/ei/ei_tools/bin$ ./ei  Usage: ei describe-accelerators [options] Print description of attached accelerators. Options: -j, --json    Print description of attached accelerators in JSON format. -h, --help    Print this help instructions and exit. ubuntu@ip-10-0-0-98:~/ei_tools/bin$ echo $? 1
  • You can use the -h and —help switches to output the same information.

    ubuntu@ip-10-0-0-98:/opt/amazon/ei/ei_tools/bin$ ./ei describe-accelerators -h Usage: ei describe-accelerators [options] Print description of attached accelerators. Options: -j, --json    Print description of attached accelerators in JSON format. -h, --help    Print this help instructions and exit. ubuntu@ip-10-0-0-98:/opt/amazon/ei/ei_tools/bin$ ./ei describe-accelerators --help Usage: ei describe-accelerators [options] Print description of attached accelerators. Options: -j, --json    Print description of attached accelerators in JSON format. -h, --help    Print this help instructions and exit.

JSON

The EI Tool supports JSON output when describing attached Elastic Inference accelerators. The -j/--json switches can be used to print the accelerator state description as a JSON object.

ubuntu@ip-10-0-0-98:/opt/amazon/ei/ei_tools/bin$ ./ei describe-accelerators -j {   "ei_client_version": "1.5.0",   "time": "Fri Nov  1 03:09:38 2019",   "attached_accelerators": 2,   "devices": [     {       "ordinal": 0,       "type": "eia1.xlarge",       "id": "eia-679e4c622d584803aed5b42ab6a97706",       "status": "healthy"     },     {       "ordinal": 1,       "type": "eia1.xlarge",       "id": "eia-6c414c6ee37a4d93874afc00825c2f28",       "status": "healthy"     }   ] } ubuntu@ip-10-0-0-98:/opt/amazon/ei/ei_tools/bin$ ./ei describe-accelerators --json {   "ei_client_version": "1.5.0",   "time": "Fri Nov  1 03:10:15 2019",   "attached_accelerators": 2,   "devices": [     {       "ordinal": 0,       "type": "eia1.xlarge",       "id": "eia-679e4c622d584803aed5b42ab6a97706",       "status": "healthy"     },     {       "ordinal": 1,       "type": "eia1.xlarge",       "id": "eia-6c414c6ee37a4d93874afc00825c2f28",       "status": "healthy"     }   ] }

Errors

Errors encountered when running EI Tool are output to stderr. The following illustrates an error encountered due to blocked outgoing traffic.

ubuntu@ip-10-0-0-98:/opt/amazon/ei/ei_tools/bin$ ./ei describe-accelerators [Fri Nov  1 03:20:29 2019, 046923us] [Connect] Failed. Error message - Last Error:      EI Error Code: [1, 4, 1]     EI Error Description: Internal error     EI Request ID: MX-EFBD3C87-6E8E-4E99-A855-949CB2A24E7F  --  EI Accelerator ID: eia-679e4c622d584803aed5b42ab6a97706     EI Client Version: 1.5.0 [Fri Nov  1 03:20:44 2019, 055905us] [Connect] Failed. Error message - Last Error:      EI Error Code: [1, 4, 1]     EI Error Description: Internal error     EI Request ID: MX-BD40C53D-6BBC-49A8-AF6D-27FF542DA38A  --  EI Accelerator ID: eia-6c414c6ee37a4d93874afc00825c2f28     EI Client Version: 1.5.0 EI Client Version: 1.5.0Time: Fri Nov  1 03:20:44 2019 Attached accelerators: 2 Device 0:     Type: eia1.xlarge     Id: eia-679e4c622d584803aed5b42ab6a97706     Status: not reachable Device 1:     Type: eia1.xlarge     Id: eia-6c414c6ee37a4d93874afc00825c2f28     Status: not reachable ubuntu@ip-10-0-0-98:~/ei_tools/bin$ echo $? 0

It’s important to note that a JSON object is also output when the -j/--json switches are set. Even though errors encountered when running EI Tool are output to stderr, the stdout can still be parsed as a JSON object.

ubuntu@ip-10-0-0-98:/opt/amazon/ei/ei_tools/bin$ ./ei describe-accelerators -j  E1101 03:54:54.084712 25091 log_stream.cpp:232] [Connect] Failed. Error message - Last Error:      EI Error Code: [1, 4, 1]     EI Error Description: Internal error     EI Request ID: MX-192D16B1-65CD-43AA-9CA8-0D717D134C0E  --  EI Accelerator ID: eia-679e4c622d584803aed5b42ab6a97706     EI Client Version: 1.5.0 E1101 03:55:09.096704 25091 log_stream.cpp:232] [Connect] Failed. Error message - Last Error:      EI Error Code: [1, 4, 1]     EI Error Description: Internal error     EI Request ID: MX-A4C4C90E-FC13-4D58-AA4F-54382222E8D7  --  EI Accelerator ID: eia-6c414c6ee37a4d93874afc00825c2f28     EI Client Version: 1.5.0 {   "ei_client_version": "1.5.0",   "time": "Fri Nov  1 03:55:09 2019",   "attached_accelerators": 2,   "devices": [     {       "ordinal": 0,       "type": "eia1.xlarge",       "id": "eia-679e4c622d584803aed5b42ab6a97706",       "status": "not reachable"     },     {       "ordinal": 1,       "type": "eia1.xlarge",       "id": "eia-6c414c6ee37a4d93874afc00825c2f28",       "status": "not reachable"     }   ] }

Using EI Tool with LD_LIBRARY_PATH

If there has been a change to your local LD_LIBRARY_PATH variable, you may have to modify your use of EI_Tool. Include the following LD_LIBRARY_PATH value when using EI_Tool:

LD_LIBRARY_PATH=/opt/amazon/ei/ei_tools/lib

The following example uses this value with a single Elastic Inference accelerator:

EI_VISIBLE_DEVICES=1 LD_LIBRARY_PATH=/opt/amazon/ei/ei_tools/lib /opt/amazon/ei/ei_tools/bin/ei describe-accelerators -j { "ei_client_version": "1.5.3", "time": "Tue Nov 19 16:57:21 2019", "attached_accelerators": 1, "devices": [ { "ordinal": 0, "type": "eia1.xlarge", "id": "eia-7f127e2640e642d48a7d4673a57581be", "status": "healthy" } ] }

Health Check

You can use Health Check to monitor the health of your Elastic Inference accelerators. The exit code of the Health Check command is 0 if all accelerators are healthy and reachable. If they are not, then the exit code is 1.

ubuntu@ip-10-0-0-98:/opt/amazon/ei/ei_tools/bin$ ./health_check  EI Client Version: 1.5.0 Device 0: healthy Device 1: healthy ubuntu@ip-10-0-0-98:/opt/amazon/ei/ei_tools/bin$ echo $? 0

The following illustrates an error due to blocked traffic received when running Health Check.

ubuntu@ip-10-0-0-98:/opt/amazon/ei/ei_tools/bin$ ./health_check  [Fri Nov  1 07:00:47 2019, 134735us] [Connect] Failed. Error message - Last Error:      EI Error Code: [1, 4, 1]     EI Error Description: Internal error     EI Request ID: MX-A0558121-49D8-48DB-8CCB-9322D78BFCA5  --  EI Accelerator ID: eia-679e4c622d584803aed5b42ab6a97706     EI Client Version: 1.5.0 Device 0: not reachable [Fri Nov  1 07:01:02 2019, 143732us] [Connect] Failed. Error message - Last Error:      EI Error Code: [1, 4, 1]     EI Error Description: Internal error     EI Request ID: MX-AC879033-FB46-46EE-B2B6-A76F5E674E0D  --  EI Accelerator ID: eia-6c414c6ee37a4d93874afc00825c2f28     EI Client Version: 1.5.0 Device 1: not reachable ubuntu@ip-10-0-0-98:/opt/amazon/ei/ei_tools/bin$ echo $? 1