Change management
At any OEM, connected mobility landscape is continuously evolving with new use cases and implementation patterns. On the vehicle side, the number of sensors collecting data have been rapidly multiplying. This results in refactoring and rewrite of some of the processing logic that feeds the vehicle data platform and enables deriving of insights. Thus, there is greater need to have streamlined processes that introduce and manage change in your environment. But based on the use cases, application, and infrastructure, the run-books and deployment strategies vary. These changes to the workloads and environment should be anticipated, monitored, accommodated and carefully executed for reliability of the connected mobility platform.
Monitor workload resources
CMREL_6: Are you monitoring all components of the workloads, including vehicle-based units? |
---|
[CMREL_BP6.1] Monitor what matters.
Observability is understanding the working of the system based on the telemetry that the system emits. A recommended approach is to implement a Vehicle Network Operations Center (V-NOC) which is an intelligent observability system that is aware of the health of the systems running in the vehicle, the backend systems supporting the platform and the applications that are user facing. It is essential to monitor all components of the workloads including vehicle-based units.
Generally, a Connected Mobility platform has several integrations with internal and external systems. Any system change that can impact the interfacing message contracts needs deep dive impact analysis. Enable the observability on these integrations for quicker troubleshooting, failure analysis and performance profiling.
Vehicle based telematic units should have health monitoring and data transmission mechanism that can relay the state of systems in near real time when connection is available or ship buffered logs when connectivity is restored.
Prescriptive guidance:
AWS provides native monitoring, logging, alarming, and dashboards with Amazon CloudWatch
If the preference is for open-source based managed
services Amazon Managed Service for Prometheus
It is important to be aware of end-to-end functioning of all endpoints that implement the connected mobility use cases. This active monitoring can be done with synthetic transactions which periodically run a number of common tasks matching actions performed by clients of the workload. You can also combine the synthetic canary client nodes with AWS X-Ray console to pinpoint which synthetic canaries are experiencing issues with errors, faults, or throttling rates for the selected time frame.
In all cases, tool interoperability and extensibility are an important consideration in observability.
CMREL_7: Are your connected mobility logs from various systems lacking the correct level of information? |
---|
[CMREL_BP7.1] Log interpretation and context propagation
Considering the number of systems and services supporting a connected mobility setup, it is important to collect logs into a centralized store. But all log entries are not equal, in addition the logs may be too verbose or lacking right level of info that can be used in correlation. A system that enables the filtering of logs based on prefixes would reduce the amount of data that is retained and processed for insights. A processing / transformation engine is needed to make the logs more usable or to enrich the log entries with additional information that can be used later for better correlation.
As connected mobility platforms result in high volume log generation, a system that simplify the searching, querying and visualizing of the logs in different ways based on need is recommended.
An ideal NOC should have capability to create dashboards, reports, and allow dynamic querying capability. Considering the mix of systems that are on-board and off-board the log entries may have sensitive information like location of the vehicle. Securing the logs with encryption at rest, auditing and masking sensitive data in logs is required for compliance validation.
Prescriptive guidance:
CloudWatch Logs enables you to centralize the logs from all of connected mobility systems, applications, and AWS services that you use, in a single, highly scalable service. You can then easily view them, live-tail, search them for specific error codes or patterns, filter them based on specific fields, or archive them securely for future analysis. CloudWatch Logs enables you to see all of your logs, regardless of their source, as a single and consistent flow of events ordered by time.
CloudWatch Logs Insights enables querying your logs with a powerful query language, visualizing log data and adding them to dashboard.
The filtering of the logs can be done using CloudWatch logs subscription filters that can have four different target services: Kinesis Data Streams, AWS Lambda, Amazon Data Firehose and Amazon OpenSearch Service
For a list of AWS services that publish logs to CloudWatch Logs, see the CloudWatch documentation.
Some AWS services can directly write logs to other destinations
CMREL_8: Is your metric collection aligned with a business outcome? |
---|
[CMREL_BP8.1] Define and calculate metrics (Aggregation)
Metric collection should always begin with an objective or business outcome. Defining metrics that are business outcome linked will result in quicker response from the system and teams supporting it than the generic metrics. As an OEM you are aware of regular pattern or trend for certain metrics in your system. For example, a critical KPI could be the number of remote start commands executed on the vehicles across various hours of the day. Generally, the count is higher in the morning and evenings which changes across time zones and seasons. With the release of a new version of the services that process such requests if the traffic has anomaly, it should get reflected on dashboard as an issue. Aggregations should be available on that metric to determine the severity of the impact.
Prescriptive guidance:
You can create metric filters to match terms in your log events and convert log
data into metrics. When a metric filter matches a term, it increments the metric's count.
For example, on release of a new firmware you can create a metric filter that counts the
number of times the word ECU_Connected
occurs in your log events. You can
assign units and dimensions to metrics. For example, if you create a metric filter that
counts the number of times the word ECU_Connected
occurs in your log events,
you can specify a dimension that's called ECUConnectionCount
to show the
total number of log events that contain the word ECU_Connected
and filter
data by reported firmware version.
CMREL_9: Does your connected mobility network operations center (NOC) have the correct level of searchability and interactivity? |
---|
[CMREL_BP9.1] Real-time processing and alarming with notifications and automated response
Some OEMs may have existing processes and tools for alarming, trouble tracking and automated response. For better visibility across the organization, the Vehicle-NOC should be integrated with these systems. With thousands to millions of vehicles connecting to the platform the number of data points can be too many so having an intelligent system that can reduce noise by finding pattern across various issues can boost productivity and reduce mean time to resolve (MTTR). Typically, such intelligent systems also reduce the number of repeat notifications for same issue.
Prescriptive guidance:
Alerts can be sent to Amazon Simple Notification Service (Amazon SNS) topics, and then pushed to any number of subscribers. For example, Amazon SNS can forward alerts to an email alias or messaging channel so that technical staff can respond.
When you enable anomaly detection for a metric, CloudWatch applies statistical and machine learning algorithms. These algorithms continuously analyze metrics of systems and applications, determine normal baselines, and surface anomalies with minimal user intervention. In addition, anomaly detection on metric math is a feature that you can use to create anomaly detection alarms on the output metric math expressions.
Design your workload to adapt to changes in demand
CMREL_10: How does your connected mobility workload adapt to vehicle traffic demand on resources? |
---|
[CMREL_BP10.1] Use automation when obtaining or scaling resources
[CMREL_BP10.2] Scale resources reactively on impairment to restore workload availability
[CMREL_BP10.3] Scale resources proactively to meet demand and avoid availability impact
Telemetry traffic from vehicle has patterns that vary based upon various factors like time of the day, weather condition during a day, season, geographic location. If the connected mobility systems are scaling to meet the demand, automating the process by using managed services will aid in better control.
When you are automating for scaling it is important to know the target utilization which generally based upon observations of historic trends and extrapolating. In a typical set up for connected mobility this may turn out to be a complex task due to the typical mix of several different services.
Another challenge is to monitor your connected mobility applications / services as the capacity is added or removed in real time as the demand changes. This will build confidence to have right level of end user experience as the workloads change periodically or unpredictably.
Prescriptive guidance:
The exact nature of the automation depends upon the type of services that are supporting the landscape. Managed services like AWS Lambda, Amazon S3, Amazon CloudFront and others scale automatically based upon the load condition. There may be limitations imposed by the Service Quotas which need to be taken into account.
Self-managed services like Amazon EC2 require careful planning to ensure that the load
distribution meets the requirements for the business function. The automation should
ensure that the instance returns to the state that it is expected to be in to handle the
traffic. Automation is vital to efficient DevOps, and getting your fleets of Amazon EC2
instances to launch, provision software, and self-heal automatically is a key challenge.
Amazon EC2 Auto Scaling
To make the scaling experience better and simple AWS launched predictive scaling policies
Automotive companies generally use the AWS managed services to reduce the overheads of administrating the infrastructure and let AWS handle the undifferentiated heavy lifting. Configuring these managed services is Customer's responsibility. But considering the varied nature of these services, configuring them for auto-scaling and with cohesiveness becomes challenging very quickly. Thus, even in this case, you need a service that orchestrates the auto-scaling across the workload. With Application Auto Scaling, you can configure automatic scaling for the various resources beyond Amazon EC2.
[CMREL_BP10.4] Load and stress test your workload
Having followed the best practices above it is important to test if the scaling activities meet the requirements of connected mobility functions. Stress testing the platform would reveal the robustness of various connected mobility functions under extreme load conditions. Some use cases may deserve higher level of robustness than others due to various business and compliance reasons. The load testing frameworks should have the ability to execute various testing strategies with varied traffic patterns that cover both the regular business as usual cases and the corner cases of failure and subsequent recovery. Considering the agility and cost optimization that AWS cloud computing provides, setting up of stage environments for load testing should be less time, money and effort intensive. While load testing in the stage environment should be done, training the teams to handle such events in production environment is also a must-do activity.
Prescriptive guidance:
Distributed
load testing
Game
Days
Implement change
CMREL_11: How are you controlling the changes that are deployed? |
---|
The number of features and scenarios that are supported by connected mobility platform have been growing rapidly. It is recommended to control the impact of these changes while deploying new functions.
[CMREL_BP11.1] Use runbooks for standard activities such as deployment
Vehicles which are connected to the backend are generally operating in various geographies and time zones. It is important to plan the change activities at a time that can cause least disruption to the services. The documentation should be less verbose and specific to conducting the tasks. The execution of the steps should be controlled by right level of authorizations that requires approvals and has appropriate reason captured. Unauthorized changes can result in downtime in the connectivity of vehicles to backend. It is important to track the changes that were implemented so that the deployment rollbacks can be analyzed for integrity.
Prescriptive guidance:
Runbooks provide an excellent mechanism to communicate the actions that can be performed with minimum information. The change activities are generally audited and can be rolled back to the previous version. AWS Config performs compliance checks based on these rules, and Audit Manager reports the results as compliance check evidence.
Resources:
Using AWS Config managed rules with Audit Manager
Using AWS Config custom rules with Audit Manager
Troubleshooting AWS Config integration with Audit Manager
[CMREL_BP11.2] Integrate functional and resilience testing as part of your deployment
Functional and resilience tests should be part of automated deployment. If success criteria are not met, the pipeline is halted or rolled back. These tests are run in a stage environment and done as part of a deployment pipeline.
Prescriptive guidance:
Vehicle owners or the related systems can take different functional paths while using the connected services. It is important to track such paths and simulate them in steps using services like Amazon CloudWatch Synthetic monitoring. As a bonus this also reveals the uptime of the services.
Chaos
engineering
Resources:
Videos:
AWS re:Invent 2022 - Building confidence through chaos
engineering on AWS
[CMREL_BP11.3] Use a canary or blue/green deployment when deploying applications in immutable infrastructures
The operational complexity of the connected mobility platform builds up with the numerous distributed services that support various scenarios. Release cycles, patching, monitoring the changes all result in overhead that is difficult to manage and error prone. Offboard teams begin to delay the release of changes to avoid disruptions in availability of services. A solution to this problem is to have immutable infrastructure, where infrastructure is not updated or fixed rather it is replaced.
Prescriptive guidance:
Canary release