Failure management
"Everything fails, all the time"
Connected mobility platforms have been evolving over the years and the relative pace has been more than any time in past few decades. There has been an equally growing concern about vehicle safety with the technology in vehicle, the data generated and the backend supporting the platform. With the growing complexity, system failures are highly probable. While failures are difficult to predict, reliability requires that the systems are aware of the failures and react accordingly to avoid impact on availability. Building and deploying robust systems should be the target state of every automotive company.
Back up data
| CMREL_12: How do your strategy and mechanisms manage failures to prevent impact on your workload? |
|---|
[CMREL_BP12.1] Identify and back up all data that needs to be backed up, or reproduce the data from sources
[CMREL_BP12.2] Perform data backup automatically
While there might be n data sources that the connected mobility platforms have to integrate with, not all data is equally important. It's essential that data entities are classified and criticality determined. This helps with making right strategic decisions for data recovery based on RPO, that is either backing up the data or reproduce it. In some of the connected mobility use cases like emergency-call or breakdown-calls, where compliance is critical the balance has to be adequate. Classifying data also helps in determining the retention periods for various data entities, one of the primary drivers for cost optimization.
Prescriptive guidance:
Backup capabilities are available with all AWS data sources. Some services
provide additional features such as point-in-time-recovery (PITR), continuous replication
and cross Region copy. With the growing number of services, managing backup configurations
can add to the overhead. Tools such as AWS Backup and AWS Elastic Data Recovery
Automotive companies generally have variations hybrid infrastructure setup with systems running on cloud and on-premises data centers. AWS services such as AWS Storage Gateway or AWS DataSync can help in these deployment modes.
[CMREL_BP12.3] Secure and encrypt backups
"Encrypt Everything."
Safeguard the backups just as you secure the vehicle data. Securing access to the data with right level of permissions prevents tampering. Encrypt the backups so that the data cannot be accessed in case of an accidental leak.
Prescriptive guidance:
Use AWS Identity and Access Management (IAM) to create roles and policies that have least privileged access. Use AWS managed policies where applicable. As more connected vehicles are rolled out, the volume of vehicle data that is hosted in vehicle data lakes keeps growing exponentially. Considering that S3 is the most popular data store for vehicle data lakes, following the best practices in handling this data is essential.
Ensure the backups are not accidentally deleted as this could result in vehicle regulatory compliance violations. Legal hold is a feature in AWS Backup service that prevents accidental deletions. Multiple departments/ units within an organization can have their holds on the backup.
[CMREL_BP12.4] Perform periodic recovery of the data to verify backup integrity and processes
"Hope for the best and be prepared for the worst" - Maya
Angelou
With backups configured, an equal effort is needed in validating with recovery test whether the backups meet Recovery Time Objectives (RTO) and Recovery Point Objectives (RPO). This exercise should be repeated periodically using well-defined mechanisms to ensure that the data is recovered within RTO and with expected data loss as established in RPO.
Prescriptive guidance:
Testing the restores build confidence in the system and trains teams to handle the disaster situations better. There are various features in AWS services that aid in testing the restores and assess RTO and RPO. Amazon RDS and DynamoDB have allow point-in-time recovery (PITR). AWS Elastic Disaster Recovery offers continual point-in-time recovery snapshots of Amazon EBS volumes.
Use fault isolation
The architecture should ensure that there isn't any single point of failure. Systems should be resilient enough to handle the partial or complete failure of the infrastructure stack. Fault isolated boundaries limit the effect of a failure within a workload to a limited number of components. Components outside of the boundary are unaffected by the failure. Using multiple fault isolated boundaries, you can limit the impact on your workload.
| CMREL_13: How do you ensure that you don't have a single point of failure? |
|---|
[CMREL_BP13.1] Deploy the workload to multiple locations
To ensure high availability and prevent a single point of failure, the services should be deployed as diverse as required. The system should rely on redundant components that are decoupled from each other.
Prescriptive guidance:
A typical connected mobility set up consists of several
distributed systems which creates an ideal condition to
spread out and isolate.
Multi-AZ
solutions are a must for most of the connected mobility use
cases. Multi-region though ideal but may be cost prohibitive
for wider scope so should be done selectively where
required. A multi-Region approach is common for disaster
recovery strategies to meet recovery objectives when one-off
large-scale events occur.
AWS Local Zones
In certain cases, vehicles can communicate with edge
locations rather than the cloud regions. This deployment
pattern is more relevant in case of low latency
requirements. Static content can be delivered to end user
from
Amazon CloudFront
Evaluate AWS Outposts for your workload. If your workload requires low latency to your on-premises data center or has local data processing requirements such as in case of plant systems. Then run AWS infrastructure and services on premises using AWS Outposts.