Performance and Resource Utilization - Amazon DocumentDB

Performance and Resource Utilization

This section provides questions and solutions for common diagnostics issues in Amazon DocumentDB deployments. The examples provided use the mongo shell and are scoped to an individual instance. To find an instance endpoint, see Understanding Amazon DocumentDB Endpoints.

How Do I Find and Terminate Long Running or Blocked Queries?

User queries can run slowly because of a suboptimal query plan or can be blocked due to resource contention.

To find long running queries that slow down due to a suboptimal query plan, or queries that are blocked due to resource contention, use the currentOp command. You can filter the command to help narrow down the list of relevant queries to terminate. You must have opid associated with the long running query to be able to terminate the query.

The following query uses the currentOp command to list all queries that are either blocked or running for more than 10 seconds.

db.adminCommand({ aggregate: 1, pipeline: [ {$currentOp: {}}, {$match: {$or: [ {secs_running: {$gt: 10}}, {WaitState: {$exists: true}}]}}, {$project: {_id:0, opid: 1, secs_running: 1}}], cursor: {} });

Next, you can narrow down the query to find the opid of a query running for more than 10 seconds and terminate it.

To find and terminate a query running for more than 10 seconds

  1. Find the opid of the query.

    db.adminCommand({ aggregate: 1, pipeline: [ {$currentOp: {}}, {$match: {$or: [{secs_running: {$gt: 10}}, {WaitState: {$exists: true}}]}}], cursor: {} });

    Output from this operation looks something like the following (JSON format).

    { "waitedMS" : NumberLong(0), "cursor" : { "firstBatch" : [ { "opid" : 24646, "secs_running" : 12 } ], "id" : NumberLong(0), "ns" : "admin.$cmd" }, "ok" : 1 }
  2. Terminate the query using the killOp operation.

    db.adminCommand({killOp: 1, op: 24646});

How Can I See a Query Plan and Optimize a Query?

If a query runs slow, it could be because the query execution requires a full scan of the collection to choose the relevant documents. Sometimes creating appropriate indexes enables the query to run faster. To detect this scenario and decide the fields on which to create the indexes, use the explain command.

Note

Amazon DocumentDB emulates the MongoDB 3.6 API on a purpose-built database engine that utilizes a distributed, fault-tolerant, self-healing storage system. As a result, query plans and the output of explain() may differ between Amazon DocumentDB and MongoDB. Customers who want control over their query plan can use the $hint operator to enforce selection of a preferred index.

Run the query that you want to improve under the explain command as follows.

db.runCommand({explain: {<query document>}})

The following is an example operation.

db.runCommand({explain:{ aggregate: "sample-document", pipeline: [{$match: {x: {$eq: 1}}}], cursor: {batchSize: 1}} });

Output from this operation looks something like the following (JSON format).

{ "queryPlanner" : { "plannerVersion" : 1, "namespace" : "db.test", "winningPlan" : { "stage" : "COLLSCAN" } }, "serverInfo" : { "host" : "...", "port" : ..., "version" : "..." }, "ok" : 1 }

The preceding output indicates that the $match stage requires scanning the whole collection and checking if the field "x" in each document is equal to 1. If there are many documents in the collection, the collection scan (and therefore the overall query performance) is very slow. Thus the presence of the "COLLSCAN" in the output of the explain command indicates that the query performance can be improved by creating appropriate indexes.

In this example, the query checks whether the field "x" equals 1 in all documents. So creating an index on field "x" enables the query to avoid the complete collection scan and use the index to return the relevant documents sooner.

After creating an index on field "x", the explain output is as follows.

{ "queryPlanner" : { "plannerVersion" : 1, "namespace" : "db.test", "winningPlan" : { "stage" : "IXSCAN", "indexName" : "x_1", "direction" : "forward" } }, "serverInfo" : { "host" : "...", "port" : ..., "version" : "..." }, "ok" : 1 }

Creating an index on field "x" enables the $match stage to use an index scan to reduce the number of documents on which the predicate "x = 1" must be evaluated.

For small collections, the Amazon DocumentDB query processor can choose not to use an index if the performance gains are negligible.

How Do I List All Running Operations on an Instance?

As a user or master user, you often want to list all the current operations running on an instance for diagnostics and troubleshooting purposes. (For information about managing users, see Managing Amazon DocumentDB Users.)

With the mongo shell, you can use the following query to list all the running operations on an Amazon DocumentDB instance.

db.adminCommand({currentOp: 1, $all: 1});

The query returns the complete list of all user queries and internal system tasks currently operating on the instance.

Output from this operation looks something like the following (JSON format).

{ "inprog" : [ { "desc" : "INTERNAL" }, { "desc" : "TTLMonitor", "active" : false }, { "desc" : "GARBAGE_COLLECTION" }, { "client" : ..., "desc" : "Conn", "active" : true, "killPending" : false, "opid" : 195, "ns" : "admin.$cmd", "command" : { "currentOp" : 1, "$all" : 1 }, "op" : "command", "$db" : "admin", "secs_running" : 0, "microsecs_running" : NumberLong(68), "clientMetaData" : { "application" : { "name" : "MongoDB Shell" }, "driver" : { ... }, "os" : { ... } } }], "ok" : 1 }

The following are valid values for the "desc" field:

  • INTERNAL — Internal system tasks like the cursor cleanup or stale user cleanup tasks.

  • TTLMonitor — The Time to Live (TTL) monitor thread. Its running status is reflected in the "active" field.

  • GARBAGE_COLLECTION — The internal garbage collector thread. There can be a maximum of three garbage collector threads running concurrently in the system.

  • CONN — The user query.

The preceding output also lists all user queries running in the system. Each user query runs in the context of a database and collection, and the union of these two is called a namespace. The namespace of each user query is available in the "ns" field.

Sometimes you need to list all user queries that are running in a particular namespace. So the previous output must be filtered on the "ns" field. The following is an example query to achieve the output to filter. The query lists all user queries that are currently running in the database "db" and collection "test" (that is, the "db.test" namespace).

db.adminCommand({aggregate: 1, pipeline: [{$currentOp: {allUsers: true, idleConnections: true}}, {$match: {ns: {$eq: "db.test"}}}], cursor: {} });

As the master user of the system, you can see queries of all users and also all internal system tasks. All other users can see only their respective queries.

If the total number of queries and internal system tasks exceeds the default batch cursor size, the mongo shell automatically generates an iterator object 'it' to view the rest of the results. Keep executing the 'it' command until all results have been exhausted.

How Do I Know When a Query Is Making Progress?

User queries can run slowly due to a suboptimal query plan, or they can be blocked due to resource contention. Debugging such queries is a multi-step process that can require executing the same step multiple times.

The first step of debugging is to list all queries that are long running or blocked. The following query lists all user queries that have been running for more than 10 seconds or that are waiting for resources.

db.adminCommand({aggregate: 1, pipeline: [{$currentOp: {}}, {$match: {$or: [{secs_running: {$gt: 10}}, {WaitState: {$exists: true}}]}}, {$project: {_id:0, opid: 1, secs_running: 1, WaitState: 1, blockedOn: 1, command: 1}}], cursor: {} });

Repeat the preceding query periodically to determine whether the list of queries changes and to identify the long running or blocked queries.

If the output document for the query of interest has a WaitState field, it indicates that resource contention is why the query is running slow or is blocked. The resource contention could either be due to I/O, internal system tasks, or other user queries.

Output from this operation looks something like the following (JSON format).

{ "waitedMS" : NumberLong(0), "cursor" : { "firstBatch" : [ { "opid" : 201, "command" : { "aggregate" : ... }, "secs_running" : 208, "WaitState" : "IO" } ], "id" : NumberLong(0), "ns" : "admin.$cmd" }, "ok" : 1 }

I/O can be a bottleneck if many queries across different collections are running concurrently on the same instance, or if the instance size is too small for the dataset that the query is running on. If the queries are read-only queries, you can mitigate the former situation by separating the queries for each collection across separate replicas. For concurrent updates across different collections, or when the instance size is too small for the dataset, you can mitigate by scaling up the instance.

If the resource contention is due to other user queries, the "blockedOn" field in the output document will have the "opid" of the query that is affecting this query. Using the "opid" follows the chain of "WaitState" and "blockedOn" fields of all the queries to find the query at the head of the chain.

If the task at the head of the chain is an internal task, the only mitigation in this case would be to terminate the query and rerun it later.

The following is sample output in which the find query is blocked on a collection lock that is owned by another task.

{ "inprog" : [ { "client" : "...", "desc" : "Conn", "active" : true, "killPending" : false, "opid" : 75, "ns" : "...", "command" : { "find" : "...", "filter" : { } }, "op" : "query", "$db" : "test", "secs_running" : 9, "microsecs_running" : NumberLong(9449440), "threadId" : 24773, "clientMetaData" : { "application" : { "name" : "MongoDB Shell" }, "driver" : { ... }, "os" : { ... } }, "WaitState" : "CollectionLock", "blockedOn" : "INTERNAL" }, { "desc" : "INTERNAL" }, { "client" : "...", ... "command" : { "currentOp" : 1 }, ... } ], "ok" : 1 }

If the "WaitState" has values "Latch", "SystemLock", "BufferLock", "BackgroundActivity", or "Other", the source of resource contention is internal system tasks. If the situation continues for a long time, the only mitigation would be to terminate the query and rerun it later.

How Do I Determine Why a System Suddenly Runs Slowly?

The following are some common reasons for a system slowing down:

  • Excessive resource contention between concurrent queries

  • The number of active concurrent queries increasing over time

  • Internal system tasks such as "GARBAGE_COLLECTION"

To monitor the system usage over time, run the following "currentOp" query periodically and output the results to an external store. The query counts the number of queries and operations in each namespace in the system. You can then analyze the system usage results to understand the load on the system and take appropriate action.

db.adminCommand({aggregate: 1, pipeline: [{$currentOp: {allUsers: true, idleConnections: true}}, {$group: {_id: {desc: "$desc", ns: "$ns", WaitState: "$WaitState"}, count: {$sum: 1}}}], cursor: {} });

This query returns an aggregate of all queries running in each namespace, all the internal system tasks, and the unique number of wait states (if any) per namespace.

Output from this operation looks something like the following (JSON format).

{ "waitedMS" : NumberLong(0), "cursor" : { "firstBatch" : [ { "_id" : { "desc" : "Conn", "ns" : "db.test", "WaitState" : "CollectionLock" }, "count" : 2 }, { "_id" : { "desc" : "Conn", "ns" : "admin.$cmd" }, "count" : 1 }, { "_id" : { "desc" : "TTLMonitor" }, "count" : 1 } ], "id" : NumberLong(0), "ns" : "admin.$cmd" }, "ok" : 1 }

In the preceding output, two user queries in namespace "db.test" are blocked on collection lock: one query in the namespace "admin.$cmd", and one internal "TTLMonitor" task.

If the output indicates many queries with blocking wait states, see How Do I Find and Terminate Long Running or Blocked Queries?

How Do I Determine the Cause of High CPU Utilization on One or More Cluster Instances?

The following sections might help you identify the cause of high instance CPU utilization. Your results can vary depending on the workload.

Depending on the reason for your high instance CPU utilization, doing one or more of the following can help.

  • If the primary instance exhibits high CPU utilization, but the replica instances don't, consider distributing read traffic across replicas via client read preference settings (for example, secondaryPreferred). For more information, see Connecting to Amazon DocumentDB as a Replica Set.

    Using replicas for reads can make better use of the cluster’s resources by allowing the primary instance to process more write traffic. Reads from replicas are eventually consistent.

  • If the high CPU utilization is a result of your write workload, changing the size of the cluster’s instances to a larger instance type increases the number of CPU cores available to service the workload. For more information, see Instances and Instance Class Specifications.

  • If all cluster instances exhibit high CPU utilization, and the workload is using replicas for reads, adding more replicas to the cluster increases the resources available for read traffic. For more information, see Adding an Amazon DocumentDB Instance to a Cluster.

How Do I Determine the Open Cursors on an Instance?

When connected to a Amazon DocumentDB instance, you can use the command db.runCommand("listCursors") to list the open cursors on that instance. Note, there is a limit of 450 active cursors open at any given time on a given Amazon DocumentDB instance. It is generally advised to close cursors that are no longer in use because cursors utilize resources on an instance and have an upper limit.

db.runCommand("listCursors")

How do I Determine the Current Amazon DocumentDB Engine Version?

To determine your current Amazon DocumentDB engine version, run the following command.

db.runCommand({getEngineVersion: 1})

Output from this operation looks something like the following (JSON format).

{ "engineVersion" : "1.0.202313", "ok" : 1 }

How Do I Identify Unused Indexes?

It is a best practice to regularly identify and remove unused indexes in order to improve performance and reduce cost, as it eliminates unnecessary compute, storage, and IOs used to maintain the indexes. To identify the indexes for a given collection, run the following command:

db.collection.getIndexes()

To identify whether or not an index has been utilized, run the following command. Output from the command describes the following:

db.collection.aggregate([{$indexStats:{}}]).pretty()
  • ops —The number of operations that used the index. If your workload has been running for a sufficiently long time and you are confident that your workload is in a steady state, an ops value of zero would indicate that the index is not used at all.

  • since —The time since Amazon DocumentDB started collecting stats on index usage, which is typically the value since the last database restart or maintenance action.

To determine the overall index size for a collection, run the following command:

db.collection.stats()

To drop an unused index, run the following command:

db.collection.dropIndex("indexName")

How Do I Identify Missing Indexes?

You can use the Amazon DocumentDB profiler to log slow queries. A query that appears repeatedly in the slow query log may indicate that an additional index is required to improve that query's performance.

You can identify opportunities for helpful indexes by looking for long running queries that have one or more stages that perform at least one COLLSCAN stage, meaning that they query stage has to read every document in the collection in order to provide a response to the query.

The following example shows a query on a collection of taxi rides that ran on a large collection.

db.rides.count({"fare.totalAmount":{$gt:10.0}}))

In order to execute this example, the query had to perform a collection scan (i.e. read every single document in the collection) since there is no index on the fare.totalAmount field. Output from the Amazon DocumentDB profiler for this query looks something like the following:

{ ... "cursorExhausted": true, "nreturned": 0, "responseLength": 0, "protocol": "op_query", "millis": 300679, "planSummary": "COLLSCAN", "execStats": { "stage": "COLLSCAN", "nReturned": "0", "executionTimeMillisEstimate": "300678.042" }, "client": "172.31.5.63:53878", "appName": "MongoDB Shell", "user": "example" }

To speed up the query in this example, you want to create an index on fare.totalAmount, as shown below.

db.rides.createIndex( {"fare.totalAmount": 1}, {background: true} )
Note

Indexes created in the foreground (meaning if the {background:true} option was not supplied when creating the index) take an exclusive write lock, which prevents applications from writing data to the collection until the index build completes. Be aware of this potential impact when creating indexes on production clusters. When creating indexes, we recommend setting {background:true}.

In general, you want to create indexes on fields that have high cardinality (for example, a large number of unique values). Creating an index on a field with low cardinality can result in a large index that is not used. The Amazon DocumentDB query optimizer considers the overall size of the collection and selectivity of the indexes when creating a query plan. There are times where you will see the query processor select a COLLSCAN even when an index is present. This happens when the query processor estimates that utilizing the index will not yield a performance advantage over scanning the entire collection. If you want to force the query processor to utilize a particular index, you can use the hint() operator as shown below.

db.collection.find().hint("indexName")

Summary of Useful Queries

The following queries can be useful for monitoring performance and resource utilization in Amazon DocumentDB.

  • Use the following query to list all activity.

    db.adminCommand({currentOp: 1, $all: 1});
  • The following code lists all long running or blocked queries.

    db.adminCommand({aggregate: 1, pipeline: [{$currentOp: {}}, {$match: {$or: [{secs_running: {$gt: 10}}, {WaitState: {$exists: true}}]}}, {$project: {_id:0, opid: 1, secs_running: 1, WaitState: 1, blockedOn: 1, command: 1}}], cursor: {} });
  • The following code terminates a query.

    db.adminCommand({killOp: 1, op: <opid of running or blocked query>});
  • Use the following code to get an aggregated view of the system state.

    db.adminCommand({aggregate: 1, pipeline: [{$currentOp: {allUsers: true, idleConnections: true}}, {$group: {_id: {desc: "$desc", ns: "$ns", WaitState: "$WaitState"}, count: {$sum: 1}}}], cursor: {} });