Architecture Description Capabilities Common use cases Implementation guidance Summary

Computer-use agents

Computer-use agents can simulate or control digital environments like browsers, terminals, file systems, and applications. These agents interpret user intent, interact with visual and text interfaces, and perform goal-directed actions by combining LLM reasoning, visual language models (VLMs), and tool servers that execute commands or simulate input events.

This pattern is important for practical AI automations, where the agent functions not just as an assistant but also as a proxy that performs actions as a human would, often by using the same tools and environments.

Architecture

A computer-use agent pattern is shown in the following diagram:

Description

Receives a query
- A task or request is provided through a UI, API, or natural language interface.
Accesses memory
- The agent retrieves short-term and long-term memory to recall past commands, goals, and system states.
Analyzes the visual context
- A VLM observes the computer screen, system state, or UI elements to understand a given context and identify actionable items.
Reasons through an LLM
- The LLM combines the query, memory state, tool, and server response to determine the next action.
Interacts with tool server
- The agent invokes tools that are hosted on a server, which may include the following:
  - Browsers (for example, headless Chrome) and shell environments
  - Text and code editors
  - Custom script interfaces
Updates visual inputs
- If the system UI changes or further observation is needed, the VLM may reanalyze the screen state or text buffers.
Updates memory
- New insights, system states, or user feedback are written to short-term and long-term memory.
Formulates final decisions and explanations
- The LLM synthesizes results or recommends actions based on the query and tool output.
Returns a response
- The agent returns results to the interface (for example, a completed task, confirmation, or generated content).

Capabilities

Multimodal reasoning with visual and textual inputs
Control over applications through simulated or API-driven inputs
Memory management for persistent state
Autonomy in sequence execution (multistep flows)

Common use cases

AI developers that write and run code in IDEs
Computer-use agents for repetitive digital workflows
Simulated users for software testing and quality assurance
Accessibility agents for navigating UIs through voice or high-level instructions
Smart robotic process automation (RPA) that's enhanced with reasoning

Implementation guidance

You can build this pattern using the following AWS services:
Amazon Bedrock for LLM-based planning and reasoning
Amazon Elastic Compute Cloud (Amazon EC2), AWS Lambda, or Amazon SageMaker notebooks to run tool servers with simulated UI environments
Amazon Simple Storage Service (Amazon S3) or Amazon DynamoDB for memory persistence
Amazon Rekognition (or custom models) for UI image analysis in hybrid scenarios
Amazon CloudWatch Logs or AWS X-Ray for observability and audit trails

Summary

Computer-use agents act as autonomous digital operators, bridging the gap between human-computer interactions and AI-driven actions. By incorporating memory, tool orchestration, and VLMs, these agents can adaptively interact with systems designed for humans, execute actions, update files, navigate menus, and generate responses.

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Tool-based agents for servers

Coding agents