OPS04-BP02 Implement application telemetry
Application telemetry serves as the foundation for observability of your workload. It's crucial to emit telemetry that offers actionable insights into the state of your application and the achievement of both technical and business outcomes. From troubleshooting to measuring the impact of a new feature or ensuring alignment with business key performance indicators (KPIs), application telemetry informs the way you build, operate, and evolve your workload.
Metrics, logs, and traces form the three primary pillars of observability. These serve as diagnostic tools that describe the state of your application. Over time, they assist in creating baselines and identifying anomalies. However, to ensure alignment between monitoring activities and business objectives, it's pivotal to define and monitor KPIs. Business KPIs often make it easier to identify issues compared to technical metrics alone.
Other telemetry types, like real user monitoring (RUM) and synthetic transactions, complement these primary data sources. RUM offers insights into real-time user interactions, whereas synthetic transactions simulate potential user behaviors, helping detect bottlenecks before real users encounter them.
Desired outcome: Derive actionable insights into the performance of your workload. These insights allow you to make proactive decisions about performance optimization, achieve increased workload stability, streamline CI/CD processes, and utilize resources effectively.
Common anti-patterns:
-
Incomplete observability: Neglecting to incorporate observability at every layer of the workload, resulting in blind spots that can obscure vital system performance and behavior insights.
-
Fragmented data view: When data is scattered across multiple tools and systems, it becomes challenging to maintain a holistic view of your workload's health and performance.
-
User-reported issues: A sign that proactive issue detection through telemetry and business KPI monitoring is lacking.
Benefits of establishing this best practice:
-
Informed decision-making: With insights from telemetry and business KPIs, you can make data-driven decisions.
-
Improved operational efficiency: Data-driven resource utilization leads to cost-effectiveness.
-
Enhanced workload stability: Faster detection and resolution of issues leading to improved uptime.
-
Streamlined CI/CD processes: Insights from telemetry data facilitate refinement of processes and reliable code delivery.
Level of risk exposed if this best practice is not established: High
Implementation guidance
To implement application telemetry for your workload, use AWS services like Amazon CloudWatch
Implementation steps
-
Identify what data to collect: Ascertain the essential metrics, logs, and traces that would offer substantial insights into your workload's health, performance, and behavior.
-
Deploy the CloudWatch
agent: The CloudWatch agent is instrumental in procuring system and application metrics and logs from your workload and its underlying infrastructure. The CloudWatch agent can also be used to collect OpenTelemetry or X-Ray traces and send them to X-Ray. -
Define and monitor business KPIs: Establish custom metrics that align with your business outcomes
. -
Instrument your application with AWS X-Ray: In addition to deploying the CloudWatch agent, it's crucial to instrument your application to emit trace data. This process can provide further insights into your workload's behavior and performance.
-
Standardize data collection across your application: Standardize data collection practices across your entire application. Uniformity aids in correlating and analyzing data, providing a comprehensive view of your application's behavior.
-
Analyze and act on the data: Once data collection and normalization are in place, use Amazon CloudWatch
for metrics and logs analysis, and AWS X-Ray for trace analysis. Such analysis can yield crucial insights into your workload's health, performance, and behavior, guiding your decision-making process.
Level of effort for the implementation plan: High
Resources
Related best practices:
Related documents:
Related videos:
Related examples: