Design for Failure - Running Containerized Microservices on AWS

Design for Failure

“Everything fails all the time.” – Werner Vogels

This adage is not any less true in the container world than it is for the cloud. Achieving high availability is a top priority for workloads, but remains an arduous undertaking for development teams. Modern applications running in containers should not be tasked with managing the underlying layers, from physical infrastructure like electricity sources or environmental controls all the way to the stability of the underlying operating system. If a set of containers fails while tasked with delivering a service, these containers should be re-instantiated automatically and with no delay. Similarly, as microservices interact with each other over the network more than they do locally and synchronously, connections need to be monitored and managed. Latency and timeouts should be assumed and gracefully handled. More generally, microservices need to apply the same error retries and exponential backoff with jitter as advised with applications running in a networked environment.

Designing for failure also means testing the design and watching services cope with deteriorating conditions. Not all technology departments need to apply this principle to the extent that Netflix does, but we encourage you to test these mechanisms often.

Designing for failure yields a self-healing infrastructure that acts with the maturity that is expected of recent workloads. Preventing emergency calls guarantees a base level of satisfaction for the service-owning team. This also removes a level of stress that can otherwise grow into accelerated attrition. Designing for failure will deliver greater uptime for your products. It can shield a company from outages that could erode customer trust.

Here are the key factors from the twelve-factor app pattern methodology that play a role in designing for failure:

  • Disposability (maximize robustness with fast startup and graceful shutdown) – Produce lean container images and strive for processes that can start and stop in a matter of seconds.

  • Logs (treat logs as event streams) – If part of a system fails, troubleshooting is necessary. Ensure that material for forensics exists.

  • Dev/prod parity – Keep development, staging, and production as similar as possible.

AWS recommends that container hosts be part of a self-healing group. Ideally, container management systems are aware of different data centers and the microservices that span across them, mitigating possible events at the physical level.

Containers offer an abstraction from operating system management. You can treat container instances as immutable servers. Containers will behave identically on a developer’s laptop or on a fleet of virtual machines in the cloud.

One very useful container pattern for hardening an application’s resiliency is the circuit breaker. With circuit breakers such as Resilience4j, Hystrix, an application container is proxied by a container in charge of monitoring connection attempts from the application container. If connections are successful, the circuit breaker container remains in closed status, letting communication happen. When connections start failing, the circuit breaker logic triggers. If a pre-defined threshold for failure/success ratio is breached, the container enters an open status that prevents more connections. This mechanism offers a predictable and clean breaking point, a departure from partially failing situations that can render recovery difficult. The application container can move on and switch to a backup service or enter a degraded state.

One other useful container pattern for application’s resilience is the using Service Mesh which forms a network of microservices communicating with each other. Tools such as AWS App Mesh, Istio have been available recently to manage and monitor such service meshes. Services meshes have sidecars which refers to a separate process that is installed along with the service in a container set. Important feature of the sidecar is that all communication to and from the service is routed through the sidecar process. This redirection of communication is completely transparent to the service. This service meshes offer several resilience patterns which can be activated by rules in the sidecar and these are Timeout, Retry, and Circuit Breaker.

Modern container management services allow developers to retrieve near real-time, event-driven updates on the state of containers. Docker supports multiple logging drivers (list as of Docker v20.10). For more information see, Implement Logging with EFK.

Driver Description
none No logs will be available for the container and Docker logs will not return any output.
json-file The logs are formatted as JSON. The default logging driver for Docker.
syslog Writes logging messages to the syslog facility. The syslog daemon must be running on the host machine.
journald Writes log messages to journald. The journald daemon must be running on the host machine.
gelf Writes log messages to a Graylog Extended Log Format (GELF) endpoint such as Graylog or Logstash.
fluentd Writes log messages to fluentd (forward input). The fluentd daemon must be running on the host machine.
awslogs Writes log messages to Amazon CloudWatch Logs.
splunk Writes log messages to splunk using the HTTP Event Collector.
etwlogs Writes log messages as Event Tracing for Windows (ETW) events. Only available on Windows platforms.
gcplogs Writes log messages to Google Cloud Platform (GCP) Logging.
local Logs are stored in a custom format designed for minimal overhead.
logentries Writes log messages to Rapid7 Logentries.

Sending these logs to the appropriate destination becomes as easy as specifying it in a key/value manner. You can then define appropriate metrics and alarms in your monitoring solution. Another way to collect telemetry and troubleshooting material from containers is to link a logging container to the application container in a pattern generically referred to as sidecar. More specifically, in the case of a container working to standardize and normalize the output, the pattern is known as an adapter.

Container monitoring is another approach for tracking the operation of a containerized application. These systems collect metrics to ensure application running on containers are performing properly. Container monitoring solutions use metric capture, analytics, transaction tracing and visualization. Container monitoring covers basic metrics like memory utilization, CPU usage, CPU limit and memory limit. Container monitoring also offers the real-time streaming logs, tracing and observability that containers need.

Containers can also be leveraged to ensure that various environments are as similar as possible. Infrastructure-as-code can be used to turn infrastructure into templates and easily replicate one footprint.