AWS Well-Architected Review
Spotlight has been reviewed against the six pillars of the AWS Well-Architected Framework. This page documents the assessment, current alignment, identified gaps, and recommended improvements.
Review Date
This review covers the architecture as of the Phase 7 implementation. It should be revisited quarterly or after major architectural changes.
Summary Scorecard
| Pillar | Score | Status | Key Strength | Key Gap |
|---|---|---|---|---|
| Operational Excellence | 7.5/10 | Partial | Full IaC, observability stack | No alerting rules or CI/CD pipeline |
| Security | 8/10 | Good | Layered auth, OWASP headers, tenant isolation | Per-tenant rate limiting absent |
| Reliability | 7.5/10 | Partial | Transactional outbox, circuit breaker | Lambda DLQ configuration missing |
| Performance Efficiency | 7.5/10 | Partial | Multi-layer caching, on-demand DynamoDB | Cold start monitoring absent |
| Cost Optimisation | 8/10 | Good | Serverless, TTL cleanup, no idle resources | CloudWatch log retention unbounded |
| Sustainability | 6.5/10 | Partial | Serverless scales to zero | No explicit green metrics |
Pillar 1: Operational Excellence
Current Alignment
Infrastructure as Code -- All infrastructure is provisioned via Terraform modules. The infrastructure/terraform/ directory contains reusable modules for DynamoDB, Lambda, API Gateway, EventBridge, CDN, and monitoring. Environment-specific configs exist for local (LocalStack), dev, staging, and prod.
Observability -- The platform ships with a complete observability stack:
- Traces: OpenTelemetry auto-instrumentation for FastAPI and botocore, exported to Tempo
- Metrics: 13+ custom counters and histograms covering cache, auth, SDK, tours, content, outbox
- Logs: Structured JSON logging via structlog with OTEL trace correlation, exported to Loki
- Dashboards: Four Grafana dashboards (Product Overview, Tour Deep Dive, System Health, Admin Activity)
Local Development -- Docker Compose stack includes LocalStack, DynamoDB Admin, Grafana, Prometheus, Loki, Tempo, and OTEL Collector. Makefile provides 20+ targets for setup, testing, building, and deployment.
Gaps
- No alerting rules: Grafana dashboards exist but no alert thresholds for auth failures, cache miss rates, or outbox backlog
- No canary / blue-green deployments: CI pipelines exist (
.github/workflows/ci.ymlplusdeploy-api-{dev,prod}.yml,deploy-dashboard.yml,deploy-marketing.yml,deploy-sdk.yml,deploy-docs.yml) but each runs an all-or-nothing publish; no traffic shifting via Lambda aliases or weighted CF Pages deployments - No canary deployments: Lambda deployments are all-or-nothing; no traffic shifting via aliases
- No operational runbooks: Incident response procedures not documented for common failures
- No automated backup validation: DynamoDB PITR enabled but no restore testing schedule
Recommendations
- Create GitHub Actions workflows for test, lint, security scan, and deploy
- Define CloudWatch alarms for: auth failure rate > 5%, outbox pending > 100, Lambda error rate > 1%
- Implement Lambda alias traffic shifting for gradual rollouts
- Write runbooks for: circuit breaker activation, outbox backlog, DynamoDB throttling, JWKS endpoint failure
- Schedule quarterly backup restore tests
Pillar 2: Security
Current Alignment
Authentication -- Layered auth with API key (SHA-256 hashed, constant-time comparison) plus optional JWT validated via tenant-configured JWKS endpoint. Admin detection uses a pluggable strategy pattern (table lookup, JWT permissions, JWT roles, composite).
Authorisation -- Tenant isolation enforced at the DynamoDB partition key level. All 17 tables use tenant_id as PK or PK prefix. Admin endpoints gated behind role verification.
Network Security -- OWASP-recommended security headers on all responses (CSP, HSTS, X-Frame-Options, X-Content-Type-Options). Per-tenant origin lockdown with wildcard subdomain support.
Data Protection -- DynamoDB server-side encryption enabled on all tables. HTML sanitisation in the SDK prevents XSS. No PII logged (sanitised structured logging).
Infrastructure -- Least-privilege IAM roles per Lambda function. S3 buckets block public access. CloudFront uses OAI for S3 origin access.
Gaps
- No per-tenant rate limiting: API Gateway throttling is global (1000 req/s in prod, 100/s in dev); a single tenant can exhaust shared capacity
- API key cache window: Rotated keys remain valid in cache for up to 10 minutes
- Default CORS too permissive: Defaults to
["*"]; production should require explicit allowlist - No WAF rules: CloudFront lacks WAF for common attack patterns
- No CloudFront access logging: CDN requests not logged for security audit
- JWKS fetch timeout: 10-second timeout could be exploited in slow-rate attacks
Recommendations
- Implement per-tenant rate limiting using API Gateway usage plans or a DynamoDB-backed token bucket
- Reduce API key cache TTL to 60 seconds after rotation events
- Enforce explicit CORS allowlist in production (block
*default) - Attach AWS WAF to CloudFront with OWASP rule set
- Enable CloudFront access logs to S3 with lifecycle policy
- Reduce JWKS timeout to 5 seconds with retry backoff
Pillar 3: Reliability
Current Alignment
Event Delivery -- Transactional outbox pattern guarantees no lost events. Atomic writes pair business data with outbox entries via DynamoDB transact_write_items. Idempotent processing via status checks.
Failure Handling -- Dead-letter pattern with 30-day retention for events that exceed retry limits. Circuit breaker per tenant allows immediate kill-switch. SDK gracefully skips missing target elements without crashing.
Data Durability -- DynamoDB Point-in-Time Recovery enabled. EventBridge archive retains events for 30 days for replay.
Degradation -- Anonymous SDK access works if JWT is absent. Tour steps with missing targets are skipped. SDK continues operating if API is temporarily unreachable (cached config).
Gaps
- No Lambda DLQ: If Lambda execution fails on DynamoDB Stream events, no explicit dead-letter queue captures failures
- Single Lambda function: All API traffic routes through one function; no separation by tenant tier or endpoint type
- Unbounded cache growth: CacheManager limits to 10,000 entries but no proactive cleanup
- JWKS failure cascade: If a tenant's JWKS endpoint goes down, all requests for that tenant fail with no fallback to previously cached keys
- No event schema validation: Malformed events in the pipeline could crash processors
Recommendations
- Configure Lambda DLQ (SQS) for stream and event processor functions
- Split API Lambda into SDK and admin functions with separate concurrency profiles
- Add periodic cache cleanup (every 5 minutes, evict expired entries)
- Implement JWKS cache fallback: serve stale keys on fetch failure with a warning metric
- Validate event schemas in processors; route invalid events to DLQ
- Add DynamoDB access verification to
/healthendpoint
Pillar 4: Performance Efficiency
Current Alignment
Compute -- Serverless Lambda with configurable memory (256MB default). FastAPI with async/await throughout. No idle compute costs.
Data -- DynamoDB on-demand billing scales automatically. Sparse GSI for outbox pending events minimises query costs. Two-tier analytics rollup (hourly + daily) uses atomic ADD operations.
Caching -- Multi-layer in-memory caching: API keys (10min), JWKS (1hr), tenant config (30s), themes (5min), published content (1min). All layers emit OTEL hit/miss metrics.
Content Delivery -- CloudFront CDN for SDK assets with versioned paths and long-lived cache headers. CORS preflight caching (max_age 3600s) reduces OPTIONS requests.
SDK -- Shadow DOM isolation. Target bundle size < 20KB gzipped. Dynamic imports for admin code (only loaded for admins).
Gaps
- No cold start monitoring: Lambda init time not instrumented; latency spikes undetectable
- No DynamoDB query batching: Related entities fetched with individual
get_itemcalls - API Gateway caching not configured: Repeated identical requests hit Lambda every time
- Connection pooling not explicit: HTTP client lifecycle for JWKS fetches unclear
- No pagination defaults: List operations could return unbounded result sets
Recommendations
- Add Lambda Extension or X-Ray segment for cold start measurement
- Use
batch_get_itemfor fetching multiple tours/content in a single request - Enable API Gateway caching on
GET /v1/sdk/config(30-second TTL) - Use
httpx.AsyncClientsingleton with connection pooling for JWKS fetches - Enforce maximum page size (100 items) on all list endpoints
- Consider DynamoDB DAX when read costs exceed $200/month (documented upgrade path)
Pillar 5: Cost Optimisation
Current Alignment
Pay-per-use -- Lambda (invocation-based), DynamoDB (on-demand), EventBridge (event-based), S3 (storage + requests). No idle infrastructure costs.
Data Lifecycle -- TTL on transient data: Events.Outbox (7 days after processing), Events.Interactions (90 days). DynamoDB handles cleanup automatically.
Right-sized Observability -- Self-hosted Grafana stack in Docker for development (zero cloud billing). CloudWatch log retention set to 30 days on Lambda functions.
CDN -- CloudFront PriceClass_100 (lowest-cost tier, covers major regions). Versioned SDK assets allow aggressive caching.
Gaps
- CloudWatch log retention is uniformly 30 days (api-gateway, lambda, eventbridge modules all set
retention_in_days = 30) - Unused DynamoDB Streams: Some tables have streams enabled without active consumers
- No cost allocation tags: Multi-tenant cost attribution not possible
- No reserved capacity analysis: Sustained workloads may benefit from provisioned DynamoDB or Savings Plans
- GSI storage costs: Unused GSIs incur storage costs regardless of query volume
Recommendations
- Set explicit CloudWatch Logs retention on ALL log groups (30 days production, 7 days development)
- Disable DynamoDB Streams on tables without active consumers (e.g., Themes.Definitions)
- Implement AWS cost allocation tags:
Environment,Service,TenantTier - Review DynamoDB costs monthly; evaluate provisioned capacity when baseline exceeds 1000 RCU
- Audit GSI usage; remove or disable indexes with zero queries
Pillar 6: Sustainability
Current Alignment
Resource Efficiency -- Serverless architecture scales to zero. No VMs running 24/7. Event-driven processing minimises active compute time.
Data Lifecycle -- TTL policies prevent unbounded storage growth. Analytics use incremental counters (cheap atomic updates) rather than full event replay.
Infrastructure -- AWS data centres use renewable energy. On-demand capacity avoids over-provisioning.
Gaps
- No carbon footprint tracking: No monitoring of energy consumption per request
- Lambda memory not optimised: Default 256MB may over-provision for lightweight handlers
- Local dev stack heavy: Full Grafana + Prometheus + Loki + Tempo stack consumes resources during development
- CloudFront cache effectiveness not measured: No metrics on cache-to-origin ratio
Recommendations
- Enable AWS Carbon Footprint monitoring in the billing console
- Use AWS Compute Optimizer to right-size Lambda memory allocation
- Make local observability stack optional (
docker compose --profile monitoring up) - Track CloudFront cache hit ratio; target > 80% to reduce origin requests
- Consider arm64 Lambda architecture for ~20% energy reduction
Cross-Cutting Concerns
Missing Capabilities
| Capability | Status | Priority |
|---|---|---|
| CI/CD pipeline | Not implemented | P0 |
| CloudWatch alarms | Not configured | P0 |
| Per-tenant rate limiting | Not implemented | P1 |
| Lambda DLQ | Not configured | P1 |
| WAF rules | Not attached | P1 |
| Multi-region failover | Not designed | P2 |
| Load testing | Not implemented | P2 |
| Schema versioning | Not addressed | P2 |
| Cost allocation tagging | Not configured | P3 |
Priority Action Items
Immediate (P0 -- production blockers):
- Create CI/CD pipeline with automated test, lint, and security scan
- Configure CloudWatch alarms for Lambda errors, DynamoDB throttling, and outbox backlog
- Enforce explicit CORS allowlist in production configuration
Short-term (P1 -- operational maturity): 4. Implement per-tenant rate limiting via API Gateway usage plans 5. Add Lambda DLQ for stream and event processor functions 6. Attach WAF to CloudFront with managed OWASP rule set 7. Write operational runbooks for top 5 failure scenarios
Medium-term (P2 -- scale readiness): 8. Design multi-region active-passive failover strategy 9. Implement load testing with representative traffic patterns 10. Add event schema versioning with backwards compatibility guarantees 11. Split API Lambda by endpoint type for independent scaling
Architecture Decisions Impacting Well-Architected
| Decision | Rationale | Trade-off |
|---|---|---|
| In-memory caching over Redis | Zero latency, zero cost, suitable for Lambda lifecycle | Cache not shared across Lambda instances |
| DynamoDB over RDS | Serverless, auto-scaling, no connection management | Limited query flexibility, no JOINs |
| Single API Lambda | Simpler deployment, shared connection pools | Single bottleneck, coarse scaling |
| Transactional outbox over direct EventBridge | Guaranteed delivery, atomic with business writes | Additional DynamoDB write per event |
| Shadow DOM in SDK | Complete CSS/JS isolation from host site | Slightly larger bundle, no inherited styles |
| LocalStack for local dev | Full AWS parity without cloud costs | Community edition lacks some features (Streams) |
Review Cadence
This Well-Architected review should be updated:
- Quarterly: Routine assessment of all six pillars
- After major changes: New services, new AWS resources, architecture refactors
- Before production launch: Full review with action items resolved to P1 level
- After incidents: Post-mortem findings incorporated into relevant pillars