OPS04-BP04 Implement dependency telemetry - Operational Excellence Pillar

OPS04-BP04 Implement dependency telemetry

Dependency telemetry is essential for monitoring the health and performance of the external services and components your workload relies on. It provides valuable insights into reachability, timeouts, and other critical events related to dependencies such as DNS, databases, or third-party APIs. When you instrument your application to emit metrics, logs, and traces about these dependencies, you gain a clearer understanding of potential bottlenecks, performance issues, or failures that might impact your workload.

Desired outcome: Ensure that the dependencies your workload relies on are performing as expected, allowing you to proactively address issues and ensure optimal workload performance.

Common anti-patterns:

  • Overlooking external dependencies: Focusing only on internal application metrics while neglecting metrics related to external dependencies.

  • Lack of proactive monitoring: Waiting for issues to arise instead of continuously monitoring dependency health and performance.

  • Siloed monitoring: Using multiple, disparate monitoring tools which can result in fragmented and inconsistent views of dependency health.

Benefits of establishing this best practice:

  • Improved workload reliability: By ensuring that external dependencies are consistently available and performing optimally.

  • Faster issue detection and resolution: Proactively identifying and addressing issues with dependencies before they impact the workload.

  • Comprehensive view: Gaining a holistic view of both internal and external components that influence workload health.

  • Enhanced workload scalability: By understanding the scalability limits and performance characteristics of external dependencies.

Level of risk exposed if this best practice is not established: High

Implementation guidance

Implement dependency telemetry by starting with identifying the services, infrastructure, and processes that your workload depends on. Quantify what good conditions look like when those dependencies are functioning as expected, and then determine what data will be needed to measure those. With that information you can craft dashboards and alerts that provide insights to your operations teams on the state of those dependencies. Use AWS tools to discover and quantify the impacts when dependencies cannot deliver as needed. Continually revisit your strategy to account for changes in priorities, goals, and gained insights.

Implementation steps

To implement dependency telemetry effectively:

  1. Identify external dependencies: Collaborate with stakeholders to pinpoint the external dependencies your workload relies on. External dependencies can encompass services like external databases, third-party APIs, network connectivity routes to other environments, and DNS services. The first step towards effective dependency telemetry is being comprehensive in understanding what those dependencies are.

  2. Develop a monitoring strategy: Once you have a clear picture of your external dependencies, architect a monitoring strategy tailored to them. This involves understanding the criticality of each dependency, its expected behavior, and any associated service-level agreements or targets (SLA or SLTs). Set up proactive alerts to notify you of status changes or performance deviations.

  3. Use network monitoring: Use Internet Monitor and Network Monitor, which provide comprehensive insights into global internet and network conditions. These tools help you understand and respond to outages, disruptions, or performance degradations that affect your external dependencies.

  4. Stay informed with AWS Health Dashboard: It provides alerts and remediation guidance when AWS is experiencing events that may impact your services.

    1. Monitor AWS Health events with Amazon EventBridge rules, or integrate programatically with AWS Health API to automate actions when you receive AWS Health events. These can be general actions, such as sending all planned lifecycle event messages to a chat interface, or specific actions, such as the initiation of a workflow in an IT service management tool.

    2. If you use AWS Organizations, aggregate AWS Health events across accounts.

  5. Instrument your application with AWS X-Ray: AWS X-Ray provides insights into how applications and their underlying dependencies are performing. By tracing requests from start to end, you can identify bottlenecks or failures in the external services or components your application relies on.

  6. Use Amazon DevOps Guru: This machine learning-driven service identifies operational issues, predicts when critical issues might occur, and recommends specific actions to take. It's invaluable for gaining insights into dependencies and ensuring they're not the source of operational problems.

  7. Monitor regularly: Continually monitor metrics and logs related to external dependencies. Set up alerts for unexpected behavior or degraded performance.

  8. Validate after changes: Whenever there's an update or change in any of the external dependencies, validate their performance and check their alignment with your application's requirements.

Level of effort for the implementation plan: Medium

Resources

Related best practices:

Related documents:

Related videos:

Related examples: