Monitoring and Logging - AWS AppSync

Monitoring and Logging

You can use Amazon CloudWatch to monitor and debug requests in AWS AppSync. Using AWS AppSync, you can use GraphQL to request data from the cloud. There are often times when you want to get more information about the execution of your API. To help debug issues related to request execution of your GraphQL API, you can enable Amazon CloudWatch Logs to monitor API calls. After you enable by CloudWatch, AWS AppSync logs API calls in CloudWatch.

Setup and Configuration

You can enable logging on a GraphQL API automatically through the AWS AppSync console.

  1. Sign in to the AWS AppSync console.

  2. Choose Settings from the navigation panel.

  3. Under Logging, click the toggle to Enable Logs.

  4. When the console prompts you, provide or create a CloudWatch ARN role.

  5. Optionally, Choose to configure the Field resolver log level from the list.

  6. Choose Save. The logging is automatically configured for your API.

Manual Role Configuration

To enable CloudWatch Logs, you must grant AWS AppSync correct permissions to write logs to CloudWatch for your account. To do this, you need to provide a service role ARN so that AWS AppSync can assume this role when writing the logs.

First, navigate to the IAM console. Then create a new policy with the name AWSAppSyncPushToCloudWatchLogsPolicy that has the following definition:

{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "logs:CreateLogGroup", "logs:CreateLogStream", "logs:PutLogEvents" ], "Resource": "*" } ] }

Next, create a new role with the name AWSAppSyncPushToCloudWatchLogsRole, and attach the above policy to this role. Edit the trust relationship for this role to have the following:

{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Service": "appsync.amazonaws.com" }, "Action": "sts:AssumeRole" } ] }

Copy the role ARN and register this with AWS AppSync to enable writing into CloudWatch.

CloudWatch Metrics

You can use CloudWatch metrics to monitor and provide alerts about specific events that can be triggered by HTTP status codes or by latency, such as the overall GraphQL request and response latency. The following are the metrics that are emitted.

4XX

The number of errors captured as a result of invalid requests due to an incorrect client configuration. Typically, these errors happen anywhere outside of the GraphQL execution. For example, this error can occur when the request includes an incorrect JSON payload or an incorrect query, when the service is throttled, or when the Auth settings are misconfigured.

Unit: Count. Use the Sum statistic to get the total occurrences of these errors.

5XX

Errors encountered during the execution of a GraphQL query. For example, this could occur when a query request is initiated for an empty or incorrect schema. It can also occur when the Amazon Cognito user pool ID or AWS Region is invalid. Alternatively, this could also happen if AWS AppSync encounters an issue during an execution of a request.

Unit: Count. Use the Sum statistic to get the total occurrences of these errors.

Latency

The time between when AWS AppSync receives a request from a client and when it returns a response to the client. This doesn’t include the network latency encountered for a response to reach the end devices.

Unit: Millisecond. Use the Average statistic to evaluate expected latencies.

Real-Time Subscriptions

All metrics are emitted in one dimension: GraphQLAPIId. This means that all metrics are coupled with GraphQL API IDs. The following metrics are related to GrapQL subscriptions over pure WebSockets:

ConnectSuccess

The number of successful WebSocket connections to AWS AppSync. It is possible to have connections without subscriptions.

Unit: Count. Use the Sum statistic to get the total occurrences of the successful connections.

ConnectClientError

The number of WebSocket connections that were rejected by AWS AppSync because of client-side errors. This could imply that the service is throttled or the Authorization settings are misconfigured.

Unit: Count. Use the Sum statistic to get the total occurrences of the client-side connection errors.

ConnectServerError

The number of errors that originated from AWS AppSync while processing connections. This usually happens when an unexpected server-side issue occurs.

Unit: Count. Use the Sum statistic to get the total occurrences of the server-side connection errors.

DisconnectSuccess

The number of successful WebSocket disconnections from AWS AppSync.

Unit: Count. Use the Sum statistic to get the total occurrences of the successful disconnections.

DisconnectError

The number of errors that originated from AWS AppSync while disconnecting WebSocket connections.

Unit: Count. Use the Sum statistic to get the total occurrences of the disconnection errors.

SubscribeSuccess

The number of subscriptions that were successfully registered to AWS AppSync through WebSocket. It is possible to have connections without subscriptions, but it isn’t possible to have subscriptions without connections.

Unit: Count. Use the Sum statistic to get the total occurrences of the successful subscriptions.

SubscribeClientError

The number of subscriptions that were rejected by AWS AppSync because of client-side errors. This can occur when a JSON payload is incorrect, the service is throttled, or the Authorization settings are misconfigured.

Unit: Count. Use the Sum statistic to get the total occurrences of the client-side subscription errors.

SubscribeServerError

The number of errors that originated from AWS AppSync while processing subscriptions. This usually happens when an unexpected server-side issue occurs.

Unit: Count. Use the Sum statistic to get the total occurrences of the server-side subscription errors.

UnsubscribeSuccess

The number of unsubscriptions that were successfully processed from AWS AppSync.

Unit: Count. Use the Sum statistic to get the total occurrences of the successful unsubscriptions.

UnsubscribeClientError

The number of unsubscriptions that were rejected by AWS AppSync because of client-side errors.

Unit: Count. Use the Sum statistic to get the total occurrences of the client-side unsubscription errors.

UnsubscribeServerError

The number of errors that originated from AWS AppSync while processing unsubscriptions. This usually happens when an unexpected server-side issue occurs.

Unit: Count. Use the Sum statistic to get the total occurrences of the server-side unsubscription errors.

PublishDataMessageSuccess

The number of subscription event messages that were successfully published.

Unit: Count. Use the Sum statistic to get the total of the subscription event messages were successfully published.

PublishDataMessageClientError

The number of subscription event messages that failed to publish because of client-side errors.

Unit: Count. Use the Sum statistic to get the total occurrences of the client-side publishing subscription events errors.

PublishDataMessageServerError

The number of errors that originated from AWS AppSync while publishing subscription event messages. This usually happens when an unexpected server-side issue occurs.

Unit: Count. Use the Sum statistic to get the total occurrences of the server-side publishing subscription events errors.

PublishDataMessageSize

The size of subscription event messages published.

Unit: Bytes.

ActiveConnection

The number of concurrent WebSocket connections from clients to AWS AppSync in 1 minute.

Unit: Count. Use the Sum statistic to get the total opened connections.

ActiveSubscription

The number of concurrent subscriptions from clients in 1 minute.

Unit: Count. Use the Sum statistic to get the total active subscriptions.

ConnectionDuration

The amount of time that the connection stays open.

Unit: Milliseconds. Use the Average statistic to evaluate connection duration.

CloudWatch Logs

You can configure two types of logging on any new or existing GraphQL API: request-level and field-level.

Request-Level Logs

When enabled, the following information is logged:

  • The request and response HTTP headers

  • The GraphQL query that is being executed in the request

  • The overall execution summary

  • New and existing GraphQL subscriptions that are registered

Field-Level Logs

When enabled, the following information is logged:

  • Generated Request Mapping with source and arguments for each field

  • The transformed Response Mapping for each field, which includes the data as a result of resolving that field

  • Tracing information for each field

If you turn on logging, AWS AppSync manages the CloudWatch Logs. The process includes creating log groups and log streams, and reporting to the log streams with these logs.

When you enable logging on a GraphQL API and make requests, AWS AppSync creates a log group and log streams under the log group. The log group is named following the /aws/appsync/apis/{graphql_api_id} format. Within each log group, the logs are further divided into log streams. These are ordered by Last Event Time as logged data is reported.

Every log event is tagged with the x-amzn-RequestId of that request. This helps you filter log events in CloudWatch to get all logged information pertaining to that request. You can get the RequestId from the response headers of every GraphQL AWS AppSync request.

The field-Level logging is configured with the following log levels:

  • NONE - No field-level logs are captured.

  • ERROR - Logs the following information only for the fields that are in error:
    • The error section in the server response

    • Field-level errors

    • The generated request/response functions that got resolved for error fields

  • ALL - The following information is logged for all fields in the query:
    • Field-level tracing information

    • The generated request/response functions that got resolved for each field

Benefits of Enabling Monitoring

You can use logging and metrics to identify, troubleshoot, and optimize your GraphQL queries. For example, these will help you debug latency issues using the tracing information that is logged for each field in the query. To demonstrate this, suppose you are using one or more resolvers nested in a GraphQL query. A sample field execution in CloudWatch Logs might look similar to the following:

{ "path": [ "singlePost", "authors", 0, "name" ], "parentType": "Post", "returnType": "String!", "fieldName": "name", "startOffset": 416563350, "duration": 11247 }

This might correspond to a GraphQL schema, similar to the following:

type Post { id: ID! name: String! authors: [Author] } type Author { id: ID! name: String! } type Query { singlePost(id:ID!): Post }

In the log results above, path shows a single item in your data returned from running a query named singlePost(). In this example, it’s representing the name field at the first index (0). startOffset gives an offset from the start of the GraphQL query execution. duration is the total time to resolve the field. These values can be useful to troubleshoot why data from a particular data source might be running slower than expected, or if a specific field is slowing down the entire query. For example, you might choose to increase provisioned throughput for an Amazon DynamoDB table, or remove a specific field from a query that is causing the overall execution to perform poorly.

As of May 8, 2019, AWS AppSync generates log events as fully structured JSON. This enables you to use log analytics services such as Amazon CloudWatch Logs Insights and Amazon Elasticsearch Service to understand the performance of your GraphQL requests and usage characteristics of your schema fields. For example, you can easily identify resolvers with large latencies that may be the root cause of a performance issue. You can also identify the most and least frequently used fields in your schema and assess the impact of deprecating GraphQL fields.

Conflict Detection and Sync Logging

If an AWS AppSync API has enabled CloudWatch Logs with the logging settings set to Field-Level Logs enabled and log-level for the Field-Level Logs set to ALL, then AWS AppSync will emit Conflict Detection and Resolution information to the log group. This will provide granular insight on what decisions the AWS AppSync API decided to take when a conflict was detected. To accomplish this, the following information will be provided in the logs:

conflictType

Details whether a conflict occurred due to a version mismatch or the customer-supplied condition.

conflictHandlerConfigured

States the conflict handler configured on the resolver at the time of the request.

message

Provides information on how the conflict was detected and resolved.

syncAttempt

The number of tries the server attempted in order to synchronize the data before ultimately rejecting the request.

data

If the conflict handler configured was Automerge, this field will be populated to show what decision Automerge took for each field. Actions provided can be:

  • REJECTED - When Automerge rejects the incoming field value in favor of the value in the server.

  • ADDED - When Automerge adds on the incoming field due to no pre-existing value in the server.

  • APPENDED - When Automerge appends the incoming values to the values for the List that exists in the server.

  • MERGED - When Automerge merges the incoming values to the values for the Set that exists in the server.

Log Type Reference

RequestSummary

  • requestId: Unique identifier for the request.

  • graphQLAPIId: ID of the GraphQL API making the request.

  • statusCode: HTTP Status code response.

  • latency: End-to-end latency of the request, in nanoseconds, as an integer.

{ "logType": "RequestSummary", "requestId": "dbe87af3-c114-4b32-ae79-8af11f3f96f1", "graphQLAPIId": "pmo28inf75eepg63qxq4ekoeg4", "statusCode": 200, "latency": 242000000 }

ExecutionSummary

  • requestId: Unique identifier for the request.

  • graphQLAPIId: ID of the GraphQL API making the request.

  • startTime: The start timestamp of GraphQL execution for the request, in RFC 3339 format.

  • endTime: The end timestamp of GraphQL execution for the request, in RFC 3339 format.

  • duration: The total elapsed GraphQL execution time, in nanoseconds, as an integer.

  • version: The schema version of the ExecutionSummary.

  • parsing:
    • startOffset: The start offset for parsing, in nanoseconds, relative to execution start, as an integer.

    • duration: The time spent parsing, in nanoseconds, as an integer.

  • validation:
    • startOffset: The start offset for validation, in nanoseconds, relative to execution start, as an integer.

    • duration: The time spent performing validation, in nanoseconds, as an integer.

{ "duration": 217406145, "logType": "ExecutionSummary", "requestId": "dbe87af3-c114-4b32-ae79-8af11f3f96f1", "startTime": "2019-01-01T06:06:18.956Z", "endTime": "2019-01-01T06:06:19.174Z", "parsing": { "startOffset": 49033, "duration": 34784 }, "version": 1, "validation": { "startOffset": 129048, "duration": 69126 }, "graphQLAPIId": "pmo28inf75eepg63qxq4ekoeg4" }

Tracing

  • requestId: Unique identifier for the request.

  • graphQLAPIId: ID of the GraphQL API making the request.

  • startOffset: The start offset for field resolution, in nanoseconds, relative to execution start, as an integer.

  • duration: The time spent resolving the field, in nanoseconds, as an integer.

  • fieldName: The name of the field being resolved.

  • parentType: The parent type of the field being resolved.

  • returnType: The return type of the field being resolved.

  • path: A list of path segments, starting at the root of the response and ending with the field being resolved.

  • resolverArn: The ARN of the resolver used for field resolution. Might not be present on nested fields.

{ "duration": 216820346, "logType": "Tracing", "path": [ "putItem" ], "fieldName": "putItem", "startOffset": 178156, "resolverArn": "arn:aws:appsync:us-east-1:111111111111:apis/pmo28inf75eepg63qxq4ekoeg4/types/Mutation/fields/putItem", "requestId": "dbe87af3-c114-4b32-ae79-8af11f3f96f1", "parentType": "Mutation", "returnType": "Item", "graphQLAPIId": "pmo28inf75eepg63qxq4ekoeg4" }

Analyzing Your Logs with Amazon CloudWatch Logs Insights

The following are examples of queries you can run to get actionable insights into the performance and health of your GraphQL operations. These examples are available as sample queries in the CloudWatch Logs Insights console. In the Amazon CloudWatch console, choose Logs Insights, select the AWS AppSync log group for your GraphQL API, and then choose AWS AppSync queries under Sample queries.

The following query returns the top 10 GraphQL requests with maximum latency:

fields requestId, latency | filter logType = "RequestSummary" | limit 10 | sort latency desc

The following query returns the top 10 resolvers with maximum latency:

fields resolverArn, duration | filter logType = "Tracing" | limit 10 | sort duration desc

The following query returns the most frequently invoked resolvers:

fields ispresent(resolverArn) as isRes | stats count() as invocationCount by resolverArn | filter isRes and logType = "Tracing" | limit 10 | sort invocationCount desc

The following query returns resolvers with the most errors in mapping templates:

fields ispresent(resolverArn) as isRes | stats count() as errorCount by resolverArn, logType | filter isRes and (logType = "RequestMapping" or logType = "ResponseMapping") and fieldInError | limit 10 | sort errorCount desc

The following query returns resolver latency statistics:

fields ispresent(resolverArn) as isRes | stats min(duration), max(duration), avg(duration) as avg_dur by resolverArn | filter isRes and logType = "Tracing" | limit 10 | sort avg_dur desc

The following query returns field latency statistics:

stats min(duration), max(duration), avg(duration) as avg_dur by concat(parentType, '/', fieldName) as fieldKey | filter logType = "Tracing" | limit 10 | sort avg_dur desc

The results of CloudWatch Logs Insights queries can be exported to CloudWatch dashboards.

Analyze Your Logs with Amazon Elasticsearch Service

You can search, analyze, and visualize your AWS AppSync logs with Amazon Elasticsearch Service to identify performance bottlenecks and root cause of operational issues. You can identify resolvers with the maximum latency and errors. In addition, you can use Kibana to create dashboards with powerful visualizations. Kibana is an open source data visualization and exploration tool available in Amazon ES. Using Kibana dashboards, you can continuously monitor the performance and health of your GraphQL operations. For example, you can create dashboards that enable you to visualize the P90 latency of your GraphQL requests and drill down into the P90 latencies of each resolver.

When using Amazon ES, use “cwl*” as the filter pattern to search Elasticsearch indexes. Elasticsearch indexes the logs streamed from CloudWatch Logs with a prefix of “cwl-”. To differentiate AWS AppSync API logs from other CloudWatch logs sent to Elasticsearch, we recommend adding an additional filter expression of graphQLAPIID.keyword=<AWS AppSync's GraphQL API Id> to your search.

Log Format Migration

Log events generated by AWS AppSync on May 8, 2019 or later are formatted as fully structured JSON. If you want to analyze GraphQL requests prior to May 8, 2019, you can migrate older logs to fully structured JSON using a script available in the GitHub Sample. If you need to use the log format prior to May 8, 2019, create a support ticket with the following settings: set Type to Account Management and then set Category to General Account Question.

You can also choose to enable metric filters in CloudWatch. You can then use these metric filters to turn log data into numerical CloudWatch metrics, so you can graph or set an alarm on them.