Computer-use agents
Computer-use agents can simulate or control digital environments like browsers, terminals, file systems, and applications. These agents interpret user intent, interact with visual and text interfaces, and perform goal-directed actions by combining LLM reasoning, visual language models (VLMs), and tool servers that execute commands or simulate input events.
This pattern is important for practical AI automations, where the agent functions not just as an assistant but also as a proxy that performs actions as a human would, often by using the same tools and environments.
Architecture
A computer-use agent pattern is shown in the following diagram:

Description
-
Receives a query
-
A task or request is provided through a UI, API, or natural language interface.
-
-
Accesses memory
-
The agent retrieves short-term and long-term memory to recall past commands, goals, and system states.
-
-
Analyzes the visual context
-
A VLM observes the computer screen, system state, or UI elements to understand a given context and identify actionable items.
-
-
Reasons through an LLM
-
The LLM combines the query, memory state, tool, and server response to determine the next action.
-
-
Interacts with tool server
-
The agent invokes tools that are hosted on a server, which may include the following:
-
Browsers (for example, headless Chrome) and shell environments
-
Text and code editors
-
Custom script interfaces
-
-
-
Updates visual inputs
-
If the system UI changes or further observation is needed, the VLM may reanalyze the screen state or text buffers.
-
-
Updates memory
-
New insights, system states, or user feedback are written to short-term and long-term memory.
-
-
Formulates final decisions and explanations
-
The LLM synthesizes results or recommends actions based on the query and tool output.
-
-
Returns a response
-
The agent returns results to the interface (for example, a completed task, confirmation, or generated content).
-
Capabilities
-
Multimodal reasoning with visual and textual inputs
-
Control over applications through simulated or API-driven inputs
-
Memory management for persistent state
-
Autonomy in sequence execution (multistep flows)
Common use cases
-
AI developers that write and run code in IDEs
-
Computer-use agents for repetitive digital workflows
-
Simulated users for software testing and quality assurance
-
Accessibility agents for navigating UIs through voice or high-level instructions
-
Smart robotic process automation (RPA) that's enhanced with reasoning
Implementation guidance
-
You can build this pattern using the following AWS services:
-
Amazon Bedrock for LLM-based planning and reasoning
-
Amazon Elastic Compute Cloud (Amazon EC2), AWS Lambda, or Amazon SageMaker notebooks to run tool servers with simulated UI environments
-
Amazon Simple Storage Service (Amazon S3) or Amazon DynamoDB for memory persistence
-
Amazon Rekognition (or custom models) for UI image analysis in hybrid scenarios
-
Amazon CloudWatch Logs or AWS X-Ray for observability and audit trails
Summary
Computer-use agents act as autonomous digital operators, bridging the gap between human-computer interactions and AI-driven actions. By incorporating memory, tool orchestration, and VLMs, these agents can adaptively interact with systems designed for humans, execute actions, update files, navigate menus, and generate responses.