Operate - Analytics Lens


Operational success is the achievement of outcomes measured by the metrics you define. By understanding the operational health of your analytics application, you can identify when it’s impacted by operational events and respond appropriately.

To operate successfully, you need to consider the following:

  • Understand operational health of the entire analytics pipeline

  • Responding to Events

ANALYTICS_OPS 03: How do you monitor the health of the analytics pipeline?

Your team must be able to easily understand the operational health of your analytics workload. Use metrics based on key-performance indicators (KPIs) and operational outcomes to gain useful insights. Use these metrics to implement dashboards with business and technical viewpoints that can help team members make informed decisions. On AWS, capabilities are available to help you bring together and analyze your workload and operations logs so that you can know your operating status and gain insight from operations over time.

Many AWS managed services, such as Amazon RDS and Amazon Redshift, provide service-specific metrics that can be integrated into CloudWatch Dashboards. For example, by monitoring the average number of read and write operations per second on a data warehouse, you can determine usage patterns and scale your capacity accordingly. Metrics, including queue length/depth, are important and provide information about the number of queries or requests waiting to be processed by the data warehouse at any given time.

Send log data from your clusters to Amazon CloudWatch Logs and define baseline metrics to establish normal operating patterns. Create Amazon CloudWatch dashboards that present system- and business-level views of your metrics. For processing frameworks such as Amazon EMR, you should also install the Ganglia monitoring system, which is a scalable distributed monitoring system for high performance computing systems, such as clusters and grids.

You can also ingest your CloudWatch Logs log data into Amazon Elasticsearch Service (Amazon ES) and then use the built-in support for Kibana to create dashboards and visualizations of your operational health (for example, order rates, connected users, and transaction times). In the AWS shared responsibility model, portions of monitoring are delivered to you through the AWS Service Health Dashboard (SHD) and the Personal Health Dashboard (PHD). These dashboards provide alerts and remediation guidance when AWS is experiencing events that might affect you. Customers with Business and Enterprise Support subscriptions also get access to the Amazon PHD API, enabling integration to their event management systems. AWS also has support for third-party log analysis systems and business intelligence tools through the AWS service APIs and SDKs (for example, Grafana, Kibana, and Logstash). Right-sizing capacity in your analytics pipeline should not be based on guess work—instead use operational metrics to make informed decisions.

ANALYTICS_OPS 04: How do you manage operational events for your analytics application?

An analytics pipeline consists of many moving parts across ingestion, storage, analysis, and visualization of your data. Failure in any of these individual layers of your data and analytics pipeline can have downstream impact on business critical applications that depend on your analytics layer. Ensuring that you have remediation steps in place for such events helps ensure business continuity and meeting your availability SLAs. Anticipate operational events, both planned (for example, sales promotions, deployments, and failure tests) and unplanned (for example, surges in utilization and component failures).

Create and use runbooks and playbooks to avoid confusion and methodically respond to alerts consistently. Defined alerts must be owned by a role or a team that is accountable for the response and escalations. Understand the business impact of your system components and use this knowledge to prioritize efforts where needed. Perform a root cause analysis (RCA) after events, and then prevent the recurrence of failures and document workarounds. Know when a human decision is needed before an action is taken and, when possible, have that decision made before the action is required. You should have critical manual procedures available for use when automated procedures fail.

AWS simplifies your event response by providing tools supporting all aspects of your workload and operations programmatically. These tools allow you to script responses to operations events and trigger their execution in response to monitored data. In AWS, you can improve recovery time by replacing failed components with known good versions, rather than trying to repair them. There are multiple ways to automate the execution of runbook and playbook actions on AWS.

To respond to an event from a state change in your AWS resources, or your own custom events, you should create CloudWatch rules to trigger responses through Amazon CloudWatch targets (for example, Lambda functions, Amazon Simple Notification Service (Amazon SNS) topics, Amazon Elastic Container Service (Amazon ECS) tasks, Step Functions, or AWS Systems Manager Automation). AWS also supports third-party systems through the AWS service APIs and SDKs. There are a number of tools provided by partners and third parties that allow for monitoring, notifications, and responses. Some of these tools include New Relic, Splunk, Loggly, SumoLogic, and Datadog.

ANALYTICS_OPS 05: How are you evolving your data and analytics workload while minimizing the impact of change?

As new technologies are introduced, it is common for organizations to upgrade their data and analytics stack to newer versions or replace a service for ingestion, processing, or visualization with a managed or serverless alternative. Decoupling storage from the compute layer for data assets, using an external metadata store, and using versioned configuration artifacts in Amazon S3 buckets enable you to upgrade and re-launch clusters and resume processing.

If you need more fine-grained control over configuration within analytics shared across multiple applications functions, consider the AWS Systems Manager Parameter Store feature over environment variables. Use AWS Secrets Manager to store database credentials and other “secrets” so you can easily rotate, manage, and retrieve the secrets and credentials from your analytics application code.