Amazon EVS host maintenance - Amazon Elastic VMware Service

Amazon EVS host maintenance

Because Amazon EVS is a self-managed service, you are responsible for maintenance of the VMware Cloud Foundation (VCF) software that runs on the host, monitoring host health, and remediating host issues, including host replacement in the event of host failure. For more information about managing ESXi hosts in VMware Cloud Foundation (VCF), see Host Management in the VMware Cloud Foundation documentation.

Checking health of the underlying EC2 instance

Amazon EC2 performs automated checks on every running EC2 instance to identify hardware and software issues. You can view the results of these status checks in the EC2 console or AWS CLI to identify specific and detectable problems. For more information, see View status checks for Amazon EC2 instance in the Amazon EC2 User Guide and describe-instance-status in the AWS CLI Command Line Reference.

You can create a CloudWatch alarm to warn you if status checks fail on a specific instance. For more information, see Create CloudWatch alarms for Amazon EC2 instances that fail status checks om tje Amazon EC2 User Guide.

About AWS scheduled maintenance for EC2 instances

AWS performs scheduled maintenance on the underlying EC2 instances to ensure reliability, availability, and performance. EC2 bare metal instances are subject to the same types of scheduled events as other EC2 instances. AWS can schedule events to reboot, stop, and retire your instances due to underlying hardware issues or scheduled maintenance. These events do not occur frequently. For more information, see Types of scheduled events in the Amazon EC2 User Guide.

Note

You should place your hosts in maintenance mode in the vSphere Client before any scheduled reboot event.

If one of your instances will be affected by a scheduled event, AWS notifies you in advance by email, using the email address that’s associated with your AWS account. AWS also sends an AWS Health event, which you can monitor and manage by using Amazon EventBridge. For more information, see Monitoring events in AWS Health with Amazon EventBridge and Scheduled events for Amazon EC2 instances in the Amazon EC2 User Guide.

At any time, you can reschedule the event so that it occurs at a specific date and time that suits you. The event can be rescheduled up to the event deadline date. For more information, see Reschedule a scheduled event for an EC2 instance in the Amazon EC2 User Guide.

Using EC2 On-Demand Capacity Reservations

You can use EC2 On-Demand Capacity Reservations to ensure that your cluster has sufficient capacity during maintenance periods. You can reserve capacity in a specific Availabilty Zones for any duration. For more information, see Reserve compute capacity with EC2 On-Demand Capacity Reservations in the Amazon EC2 User Guide.

For steps to create a Capacity Reservation, see Create a Capacity Reservation in the Amazon EC2 User Guide.

Note

If you use EC2 On-Demand Capacity Reservations or EC2 Dedicated Hosts, we recommend that you retain a spare host for mission-critical workloads. While Capacity Reservations ensure you have access to a specific amount of EC2 instance capacity in a given Availability Zone, having a spare host provides an additional layer of redundancy that is crucial for mission-critical workloads. For Dedicated Hosts, having a spare host ensures that you maintain the environment for mission-critical workloads, even if a primary host requires maintenance or experiences an issue.

Preparing for AWS scheduled system-maintenance and instance-retirement events

AWS schedules two types of system-maintenance events: network maintenance and power maintenance.

  • During network maintenance, scheduled instances lose network connectivity for a brief period of time. Normal network connectivity to your instance is restored after maintenance is complete.

  • During power maintenance, scheduled instances are taken offline for a brief period, and then rebooted. When a reboot is performed on EC2 bare metal instances, instance store volume data is not preserved.

AWS schedules EC2 instance-retirement events when degradation of the underlying hardware hosting your EC2 instances is detected.

To remediate system-maintenance and instance-retirement events, replace the failed host with a new host using the Amazon EVS console or AWS CLI and SDDC Manager before the maintenance event occurs. If you wait for the maintenance event to occur and an EC2 instance reboot is required, you will lose your vSAN data that is stored on the instance store volume. For detailed steps, see Replace an Amazon EVS host.

Important

The EC2 console should not be used to manage the state of your Amazon EVS hosts, including, stop, start, and termination. Do not attempt to start, stop, or terminate the EC2 instances that Amazon EVS deploys. This action results in vSAN data loss.

Replace an Amazon EVS host

Follow this procedure to a replace an Amazon EVS host.

Warning

Amazon EVS hosts use a custom vendor add-on to provide important host functionality. When you add a host to your environment, it will have the latest available version of the Amazon EVS custom add-on. If your environment uses hosts with an older add-on version, adding host to your vSphere cluster will cause cluster image remediation to fail. For steps to troubleshoot this issue, see Troubleshoot add host failure due to incompatible cluster image.

Warning

If you have updated your ESXi version post-deployment, SDDC manager may fail during VCF host validation in the commission hosts step. For steps to troubleshoot this issue, see SDDC Manager fails VCF host validation during host commissioning.

Note

Ensure that your Amazon EVS host count per EVS environment quota is correctly set to ensure successfully host creation. Host creation fails if this quota value is less than the number of hosts that you are attempting to provision within a single Amazon EVS environment. You may need to request a quota increase for maintenance operations that require host replacement. For more information, see Amazon EVS service quotas.

Amazon EVS console and SDDC Managuer UI
  1. Go to the Amazon EVS console.

  2. In the navigation pane, choose Environment.

  3. Select the environment that contains the host to be replaced.

  4. Select the Hosts tab.

  5. Choose Create host.

  6. Specify host details and choose Create host.

  7. To verify completion, check that the Host state has changed to Created.

  8. Retrieve the credentials for the ESXi root password from AWS Secrets Manager. For more information about retrieving secrets, see Get secrets from AWS Secrets Manager in the AWS Secrets Manager User Guide.

  9. Go to SDDC Manager.

  10. Commission the new host in SDDC Manager, using the ESXi root credentials that you retrieved in a previous step. For more information, see Commission Hosts in the VMware Cloud Foundation documentation.

  11. Add the new host to the cluster. For more information, see How to Add an ESXi Host to Your vSphere Cluster by Using the Quickstart Workflow in the vSphere documentation.

  12. Decommission the old host in SDDC Manager that you want to remove from SDDC Manager. For more information, see Decommission Hosts in the VMware Cloud Foundation documentation.

  13. Return to the Amazon EVS console.

  14. Under the Hosts tab, select the failed host and choose Delete > Delete host.

AWS CLI and SDDC Manager UI
  1. Open a new terminal session.

  2. Create a new host. See example command below for reference.

    aws evs create-environment-host \ --environment-id "env-abcde12345" \ --host '{ \ "hostName": "esxi-host-05", \ "keyName": "your-ec2-keypair-name", \ "instanceType": "i4i.metal" \ }'
  3. Retrieve the credentials for the ESXi root password from AWS Secrets Manager. For more information about retrieving secrets, see Get secrets from AWS Secrets Manager in the AWS Secrets Manager User Guide.

  4. Go to SDDC Manager.

  5. Commission the new host in SDDC Manager, using the ESXi root credentials that you retrieved in a previous step. For more information, see Commission Hosts in the VMware Cloud Foundation documentation.

  6. Add the new host to the cluster that contains the impaired host.

  7. Decommission the impaired host in SDDC Manager. For more information, see Decommission Hosts in the VMware Cloud Foundation documentation.

  8. Return to the terminal.

  9. Delete the failed host. See example command below for reference.

    aws evs delete-environment-host --environment-id "env-abcde12345" --host-name "esxi-host-05"

Troubleshooting

For troubleshooting guidance, see Troubleshooting. If you continue to experience issues after reviewing the troubleshooting guidance, contact AWS Support for further assistance.