Multi-Tenant Isolation on AWS
Problem
A single noisy customer could degrade the platform for everyone. We needed isolation that didn't explode costs, plus a story for running the biggest customers on dedicated infrastructure.
Constraints
- Single logical product; no code forks per tenant
- Must support both pooled and dedicated deployment modes
- No customer-visible downtime during isolation changes
Architecture
We adopted a cell-based topology: a pool of small, identical cells each owning a subset of tenants. A routing layer maps tenant to cell. Within a cell, Postgres row-level security enforces tenant boundaries; RLS policies are tested in CI. Large customers can be migrated to a dedicated cell via an online tenant move: dual-writes → backfill → cutover. Observability is tenant-aware end-to-end.
Trade-offs
Why: Contains blast radius; enables tenant moves.
Cost: More moving parts; routing and deploys per cell.
Why: Better for the 99% case; simpler migrations.
Cost: Must test policies rigorously; requires strict audit.
Outcome
- Noisy-neighbor incidents reduced by ~85%
- First dedicated-cell migration done with zero downtime
- Tenant-aware SLOs gave the team a clear on-call story