Decision matrix - AWS Prescriptive Guidance

Decision matrix

To decide which multi-tenant SaaS partitioning model you should use with PostgreSQL, consult the following decision matrix. The matrix analyzes these four partitioning options:

  • Silo – A separate PostgreSQL instance or cluster for each tenant.

  • Bridge with separate databases – A separate database for each tenant in a single PostgreSQL instance or cluster.

  • Bridge with separate schemas – A separate schema for each tenant in a single PostgreSQL database, in a single PostgreSQL instance or cluster.

  • Pool – Shared tables for tenants in a single instance and schema.

Silo Bridge with separate databases Bridge with separate schemas Pool
Use case Isolation of data with full control of resource usage is a key requirement, or you have very large and very performance-sensitive tenants. Isolation of data is a key requirement, and limited or no cross-reference of tenants’ data is required. Moderate number of tenants with a moderate amount of data. This is the preferred model if you have to cross-reference tenants’ data. Large number of tenants with less data per tenant.
New tenant onboarding agility Very slow. (A new instance or cluster is required for each tenant.) Moderately slow. (Requires creating a new database for each tenant to store schema objects.) Moderately slow. (Requires creating a new schema for each tenant to store objects.) Fastest option. (Minimal setup is required.)
Database connection pool configuration effort and efficiency

Significant effort required. (One connection pool per tenant.)

Less efficient. (No database connection sharing between tenants.)

Significant effort required. (One connection pool configuration per tenant unless you use Amazon RDS Proxy.)

Less efficient. (No database connection sharing between tenants and total number of connections. Usage across all tenants is limited based on the DB instance class.)

Less effort required. (One connection pool configuration for all tenants.)

Moderately efficient. (Connection reuse through the SET ROLE or SET SCHEMA command in session pool mode only. SET commands also cause session pinning when using Amazon RDS Proxy, but the client connection pools can be eliminated and direct connections can be made for each request for efficiency.)

Least effort required.

Most efficient. (One connection pool for all tenants and efficient connection reuse across all tenants. Database connection limits are based on the DB instance class.)

Database maintenance (vacuum management) and resource usage Simpler management. Medium complexity. (Might lead to high resource consumption, because a vacuum worker has to be started for each database after vacuum_naptime, which leads to high autovacuum launcher CPU usage. There might also be additional overhead associated with vacuuming the PostgreSQL system catalog tables for each database.) Large PostgreSQL system catalog tables. (Total pg_catalog size in tens of GBs, depending on number of tenants and relations. Likely to require modifications to vacuuming-related parameters to control table bloat.) Tables might be large, depending on the number of tenants and data per tenant. (Likely to require modifications to vacuuming-related parameters to control table bloat.)
Extensions management effort Significant effort (for each database in separate instances). Significant effort (at each database level). Minimal effort (one time in the common database). Minimal effort (one time in the common database).
Change deployment effort Significant effort. (Connect to each separate instance and roll out changes.) Significant effort. (Connect to each database and schema, and roll out changes.) Moderate effort. (Connect to common database and roll out changes for each schema.) Minimal effort. (Connect to common database and roll out changes.)
Change deployment – scope of impact Minimal. (Single tenant affected.) Minimal. (Single tenant affected.) Minimal. (Single tenant affected.) Very large. (All tenants affected.)
Query performance management and effort Manageable query performance. Manageable query performance. Manageable query performance. Significant effort likely required to maintain query performance. (Over time, queries might run more slowly due to the increased size of tables. You can use table partitioning and database sharding to maintain performance.)
Cross-tenant resource impact No impact. (No resource sharing among tenants.) Moderate impact. (Tenants share common resources such as instance CPU and memory.) Moderate impact. (Tenants share common resources such as instance CPU and memory.) Heavy impact. (Tenants affect one another in terms of resources, lock conflicts, and so on.)
Tenant-level tuning (for example, creation of additional indexes per tenant or DB parameter tweaking for a particular tenant) Possible. Somewhat possible. (Schema-level changes can be made for each tenant, but database parameters are global across all tenants.) Somewhat possible. (Schema-level changes can be made for each tenant, but database parameters are global across all tenants.) Not possible. (Tables are shared by all tenants.)
Rebalance effort for performance-sensitive tenants Minimal. (No need to rebalance. Scale server and I/O resources to handle this scenario.) Moderate. (Use logical replication or pg_dump to export the database, but downtime might be lengthy depending on data size. You can use the transportable database feature in Amazon RDS for PostgreSQL to copy databases between instances faster.) Moderate but likely involves lengthy downtime. (Use logical replication or pg_dump to export the schema, but downtime might be lengthy depending on data size.)

Significant, because all tenants share the same tables. (Sharding the database requires copying everything to another instance and an additional step to clean up tenant data.)

Most likely requires a change in application logic.

Database downtime for major version upgrades Standard downtime. (Depends on PostgreSQL system catalog size.) Longer downtime likely. (Depending on system catalog size, the time will vary. PostgreSQL system catalog tables are also duplicated across databases) Longer downtime likely. (Depending on PostgreSQL system catalog size, the time will vary.) Standard downtime. (Depends on PostgreSQL system catalog size.)
Administration overhead (for example, for database log analysis or backup job monitoring) Significant effort Minimal effort. Minimal effort. Minimal effort.
Tenant-level availability Highest. (Each tenant fails and recovers independently.) Higher scope of impact. (All tenants fail and recover together in case of hardware or resource issues.) Higher scope of impact. (All tenants fail and recover together in case of hardware or resource issues.) Higher scope of impact. (All tenants fail and recover together in case of hardware or resource issues.)
Tenant-level backup and recovery effort Least effort. (Each tenant can be backed up and restored independently.) Moderate effort. (Use logical export and import for each tenant. Some coding and automation are required.) Moderate effort. (Use logical export and import for each tenant. Some coding and automation are required.) Significant effort. (All tenants share the same tables.)
Tenant-level point-in-time recovery effort Minimal effort. (Use point-in time recovery by using snapshots, or use backtracking in Amazon Aurora.) Moderate effort. (Use snapshot restore, followed by export/import. However, this will be a slow operation.) Moderate effort. (Use snapshot restore, followed by export/import. However, this will be a slow operation.) Significant effort and complexity.
Uniform schema name Same schema name for each tenant. Same schema name for each tenant. Different schema for each tenant. Common schema.
Per-tenant customization (for example, additional table columns for a specific tenant) Possible. Possible. Possible. Complicated (because all tenants share the same tables).
Catalog management efficiency at object-relational mapping (ORM) layer (for example, Ruby) Efficient (because the client connection is specific for a tenant). Efficient (because the client connection is specific to a database). Moderately efficient. (Depending on the ORM used, user/role security model, and search_path configuration, the client sometimes caches the metadata for all tenants, leading to high DB connection memory usage.) Efficient (because all tenants share the same tables).
Consolidated tenant reporting effort Significant effort. (You have to use foreign data wrappers [FDWs] to consolidate data in all tenants or extract, transform, and load [ETL] to another reporting database.) Significant effort. (You have to use FDWs to consolidate data in all tenants or ETL to another reporting database.) Moderate effort. (You can aggregate data in all schemas by using unions.) Minimal effort. (All tenant data is in the same tables, so reporting is simple.)
Tenant-specific read-only instance for reporting (for example, based on subscription) Least effort. (Create a read replica.) Moderate effort. (You can use logical replication or AWS Database Migration Service [AWS DMS] to configure.) Moderate effort. (You can use logical replication or AWS DMS to configure.) Complicated (because all tenants share the same tables).
Data isolation Best. Better. (You can manage database-level permissions by using PostgreSQL roles.) Better. (You can manage schema-level permissions by using PostgreSQL roles.) Worse. (Because all tenants share the same tables, you have to implement features such as row-level security [RLS] for tenant isolation.)
Tenant-specific storage encryption key Possible. (Each PostgreSQL cluster can have its own AWS Key Management Service [AWS KMS] key for storage encryption.) Not possible. (All tenants share the same KMS key for storage encryption.) Not possible. (All tenants share the same KMS key for storage encryption.) Not possible. (All tenants share the same KMS key for storage encryption.)
Using AWS Identity and Access Management (IAM) for database authentication for each tenant Possible. Possible. Possible (by having separate PostgreSQL users for each schema). Not possible (because tables are shared by all tenants).
Infrastructure cost Highest (because nothing is shared). Moderate. Moderate. Lowest.
Data duplication and storage usage Highest aggregate across all tenants. (PostgreSQL system catalog tables and the application’s static and common data are duplicated across all tenants.) Highest aggregate across all tenants. (PostgreSQL system catalog tables and the application’s static and common data are duplicated across all tenants.) Moderate. (The application’s static and common data can be in a common schema and accessed by other tenants.) Minimal. (No duplication of data. The application’s static and common data can be in the same schema.)
Tenant-centric monitoring (quickly find out which tenant is causing issues) Least effort. (Because each tenant is monitored separately, it’s easy to check the activity of a specific tenant.) Moderate effort. (Because all tenants share the same physical resource, you have to apply additional filtering to check the activity of a specific tenant.) Moderate effort. (Because all tenants share the same physical resource, you have to apply additional filtering to check the activity of a specific tenant.) Significant effort. (Because all tenants share all resources, including tables, you have to use bind variable capture to check which tenant a specific SQL query belongs to.)
Centralized management and health/activity monitoring Significant effort (to set up central monitoring and a central command center). Moderate effort (because all tenants share the same instance). Moderate effort (because all tenants share the same instance). Minimal effort (because all tenants share the same resources, including the schema).
Chances of object identifier (OID) and transaction ID (XID) wraparound Minimal. High. (Because OID,XID is a single PostgreSQL clusterwide counter and there can be issues vacuuming effectively across physical databases). Moderate. (Because OID,XID is a single PostgreSQL clusterwide counter). High. (For example, a single table can reach the TOAST OID limit of 4 billion, depending on the number of out-of-line columns.)