What is the lakehouse architecture of Amazon SageMaker?
The lakehouse architecture of Amazon SageMaker is a unified data architecture built on AWS's cloud-native infrastructure that bridges Amazon S3 data lakes and Amazon Redshift data warehouses into a cohesive analytics platform. The architecture leverages Apache Iceberg table format for cross-service interoperability and implements a shared metadata catalog that provides consistent data access patterns across storage systems.
This integrated approach enables organizations to perform analytics, machine learning, and AI workloads on a single data foundation without data movement or duplication. The architecture integrates with AWS machine learning and analytics services, enabling data scientists, analysts, and engineers to collaborate on the same datasets using their preferred tools and interfaces.
What is a data lakehouse?
A data lakehouse is an architectural pattern that unifies the scalability and cost-effectiveness of data lakes with the performance and reliability characteristics of data warehouses. This approach eliminates the traditional trade-offs between storing diverse data types and maintaining query performance for analytical workloads.
The lakehouse architecture addresses the following key limitations of isolated systems:
-
Transactional consistency – ACID compliance ensures reliable concurrent operations
-
Schema management – Flexible schema evolution without breaking existing queries
-
Multi-format support – Native handling of structured, semi-structured, and unstructured data
-
Compute-storage separation – Independent scaling of processing and storage resources
-
Open standards – Vendor-neutral formats preventing data lock-in
-
Single source of truth – Eliminates data silos and redundant storage costs
-
Real-time and batch processing – Supports both streaming and historical analytics
-
Direct file access – Enables both SQL queries and programmatic data access
-
Unified governance – Consistent security and compliance across all data types
This architecture enables organizations to support business intelligence, advanced analytics, and machine learning workloads on the same data platform, reducing complexity and operational overhead while maintaining performance requirements for each use case.
Key Capabilities
The lakehouse architecture of Amazon SageMaker provides the following key capabilities:
-
Unified data access – Query and access data across Amazon S3 data lakes, Amazon Redshift data warehouses, and other sources using Apache Iceberg
compatible tools and engines. This includes AWS services such as Amazon Athena, Amazon Redshift, Amazon EMR, Amazon SageMaker AI, as well as third-party engines, all of which you can use to query your data in-place. -
Integrated access control – Fine-grained access control to your data with permissions that you can define and consistently apply across all analytics and ML tools and engines, regardless of the underlying storage formats or query engines used.
-
Open source compatibility – Leverages open-source Apache Iceberg
, enabling data interoperability across various Apache Iceberg compatible query engines and tools. This gives you the flexibility to choose your preferred tools and engines.