Failure management - Connected Mobility Lens

Failure management

"Everything fails, all the time"

Werner Vogels, CTO - Amazon.com

Connected mobility platforms have been evolving over the years and the relative pace has been more than any time in past few decades. There has been an equally growing concern about vehicle safety with the technology in vehicle, the data generated and the backend supporting the platform. With the growing complexity, system failures are highly probable. While failures are difficult to predict, reliability requires that the systems are aware of the failures and react accordingly to avoid impact on availability. Building and deploying robust systems should be the target state of every automotive company.

Back up data

CMREL_12: How do your strategy and mechanisms manage failures to prevent impact on your workload?

[CMREL_BP12.1] Identify and back up all data that needs to be backed up, or reproduce the data from sources

[CMREL_BP12.2] Perform data backup automatically

While there might be n data sources that the connected mobility platforms have to integrate with, not all data is equally important. It's essential that data entities are classified and criticality determined. This helps with making right strategic decisions for data recovery based on RPO, that is either backing up the data or reproduce it. In some of the connected mobility use cases like emergency-call or breakdown-calls, where compliance is critical the balance has to be adequate. Classifying data also helps in determining the retention periods for various data entities, one of the primary drivers for cost optimization.

Prescriptive guidance:

Backup capabilities are available with all AWS data sources. Some services provide additional features such as point-in-time-recovery (PITR), continuous replication and cross Region copy. With the growing number of services, managing backup configurations can add to the overhead. Tools such as AWS Backup and AWS Elastic Data Recovery can boost the productivity, availability and have a RPO in seconds.

Automotive companies generally have variations hybrid infrastructure setup with systems running on cloud and on-premises data centers. AWS services such as AWS Storage Gateway or AWS DataSync can help in these deployment modes.

[CMREL_BP12.3] Secure and encrypt backups

"Encrypt Everything."

Werner Vogels, CTO - Amazon.com

Safeguard the backups just as you secure the vehicle data. Securing access to the data with right level of permissions prevents tampering. Encrypt the backups so that the data cannot be accessed in case of an accidental leak. 

Prescriptive guidance:

Use AWS Identity and Access Management (IAM) to create roles and policies that have least privileged access. Use AWS managed policies where applicable. As more connected vehicles are rolled out, the volume of vehicle data that is hosted in vehicle data lakes keeps growing exponentially. Considering that S3 is the most popular data store for vehicle data lakes, following the best practices in handling this data is essential.

Ensure the backups are not accidentally deleted as this could result in vehicle regulatory compliance violations. Legal hold is a feature in AWS Backup service that prevents accidental deletions. Multiple departments/ units within an organization can have their holds on the backup.

[CMREL_BP12.4] Perform periodic recovery of the data to verify backup integrity and processes

"Hope for the best and be prepared for the worst" - Maya Angelou

With backups configured, an equal effort is needed in validating with recovery test whether the backups meet Recovery Time Objectives (RTO) and Recovery Point Objectives (RPO). This exercise should be repeated periodically using well-defined mechanisms to ensure that the data is recovered within RTO and with expected data loss as established in RPO.

Prescriptive guidance:

Testing the restores build confidence in the system and trains teams to handle the disaster situations better. There are various features in AWS services that aid in testing the restores and assess RTO and RPO. Amazon RDS and DynamoDB have allow point-in-time recovery (PITR). AWS Elastic Disaster Recovery offers continual point-in-time recovery snapshots of Amazon EBS volumes.

Use fault isolation

The architecture should ensure that there isn't any single point of failure. Systems should be resilient enough to handle the partial or complete failure of the infrastructure stack. Fault isolated boundaries limit the effect of a failure within a workload to a limited number of components. Components outside of the boundary are unaffected by the failure. Using multiple fault isolated boundaries, you can limit the impact on your workload.

CMREL_13: How do you ensure that you don't have a single point of failure?

[CMREL_BP13.1] Deploy the workload to multiple locations

To ensure high availability and prevent a single point of failure, the services should be deployed as diverse as required. The system should rely on redundant components that are decoupled from each other. 

Prescriptive guidance:

A typical connected mobility set up consists of several distributed systems which creates an ideal condition to spread out and isolate. Multi-AZ solutions are a must for most of the connected mobility use cases. Multi-region though ideal but may be cost prohibitive for wider scope so should be done selectively where required. A multi-Region approach is common for disaster recovery strategies to meet recovery objectives when one-off large-scale events occur. AWS Local Zones can be used to deploy workloads closer to vehicles for low-latency requirements.

In certain cases, vehicles can communicate with edge locations rather than the cloud regions. This deployment pattern is more relevant in case of low latency requirements. Static content can be delivered to end user from Amazon CloudFront with millisecond level latency. Other services that also enable edge computing include AWS Global AcceleratorAmazon API Gateway, and Lambda@Edge. Amazon CloudFront can scale automatically to deliver software with over-the-air (OTA) updates at scale with high transfer rates.

Evaluate AWS Outposts for your workload. If your workload requires low latency to your on-premises data center or has local data processing requirements such as in case of plant systems. Then run AWS infrastructure and services on premises using AWS Outposts.