AWS Well-Architected Review

Spotlight has been reviewed against the six pillars of the AWS Well-Architected Framework. This page documents the assessment, current alignment, identified gaps, and recommended improvements.

Review Date

This review covers the architecture as of the Phase 7 implementation. It should be revisited quarterly or after major architectural changes.

Summary Scorecard

Pillar	Score	Status	Key Strength	Key Gap
Operational Excellence	7.5/10	Partial	Full IaC, observability stack	No alerting rules or CI/CD pipeline
Security	8/10	Good	Layered auth, OWASP headers, tenant isolation	Per-tenant rate limiting absent
Reliability	7.5/10	Partial	Transactional outbox, circuit breaker	Lambda DLQ configuration missing
Performance Efficiency	7.5/10	Partial	Multi-layer caching, on-demand DynamoDB	Cold start monitoring absent
Cost Optimisation	8/10	Good	Serverless, TTL cleanup, no idle resources	CloudWatch log retention unbounded
Sustainability	6.5/10	Partial	Serverless scales to zero	No explicit green metrics

Pillar 1: Operational Excellence

Current Alignment

Infrastructure as Code -- All infrastructure is provisioned via Terraform modules. The infrastructure/terraform/ directory contains reusable modules for DynamoDB, Lambda, API Gateway, EventBridge, CDN, and monitoring. Environment-specific configs exist for local (LocalStack), dev, staging, and prod.

Observability -- The platform ships with a complete observability stack:

Traces: OpenTelemetry auto-instrumentation for FastAPI and botocore, exported to Tempo
Metrics: 13+ custom counters and histograms covering cache, auth, SDK, tours, content, outbox
Logs: Structured JSON logging via structlog with OTEL trace correlation, exported to Loki
Dashboards: Four Grafana dashboards (Product Overview, Tour Deep Dive, System Health, Admin Activity)

Local Development -- Docker Compose stack includes LocalStack, DynamoDB Admin, Grafana, Prometheus, Loki, Tempo, and OTEL Collector. Makefile provides 20+ targets for setup, testing, building, and deployment.

Gaps

No alerting rules: Grafana dashboards exist but no alert thresholds for auth failures, cache miss rates, or outbox backlog
No canary / blue-green deployments: CI pipelines exist (.github/workflows/ci.yml plus deploy-api-{dev,prod}.yml, deploy-dashboard.yml, deploy-marketing.yml, deploy-sdk.yml, deploy-docs.yml) but each runs an all-or-nothing publish; no traffic shifting via Lambda aliases or weighted CF Pages deployments
No canary deployments: Lambda deployments are all-or-nothing; no traffic shifting via aliases
No operational runbooks: Incident response procedures not documented for common failures
No automated backup validation: DynamoDB PITR enabled but no restore testing schedule

Recommendations

Create GitHub Actions workflows for test, lint, security scan, and deploy
Define CloudWatch alarms for: auth failure rate > 5%, outbox pending > 100, Lambda error rate > 1%
Implement Lambda alias traffic shifting for gradual rollouts
Write runbooks for: circuit breaker activation, outbox backlog, DynamoDB throttling, JWKS endpoint failure
Schedule quarterly backup restore tests

Pillar 2: Security

Current Alignment

Authentication -- Layered auth with API key (SHA-256 hashed, constant-time comparison) plus optional JWT validated via tenant-configured JWKS endpoint. Admin detection uses a pluggable strategy pattern (table lookup, JWT permissions, JWT roles, composite).

Authorisation -- Tenant isolation enforced at the DynamoDB partition key level. All 17 tables use tenant_id as PK or PK prefix. Admin endpoints gated behind role verification.

Network Security -- OWASP-recommended security headers on all responses (CSP, HSTS, X-Frame-Options, X-Content-Type-Options). Per-tenant origin lockdown with wildcard subdomain support.

Data Protection -- DynamoDB server-side encryption enabled on all tables. HTML sanitisation in the SDK prevents XSS. No PII logged (sanitised structured logging).

Infrastructure -- Least-privilege IAM roles per Lambda function. S3 buckets block public access. CloudFront uses OAI for S3 origin access.

Gaps

No per-tenant rate limiting: API Gateway throttling is global (1000 req/s in prod, 100/s in dev); a single tenant can exhaust shared capacity
API key cache window: Rotated keys remain valid in cache for up to 10 minutes
Default CORS too permissive: Defaults to ["*"]; production should require explicit allowlist
No WAF rules: CloudFront lacks WAF for common attack patterns
No CloudFront access logging: CDN requests not logged for security audit
JWKS fetch timeout: 10-second timeout could be exploited in slow-rate attacks

Recommendations

Implement per-tenant rate limiting using API Gateway usage plans or a DynamoDB-backed token bucket
Reduce API key cache TTL to 60 seconds after rotation events
Enforce explicit CORS allowlist in production (block * default)
Attach AWS WAF to CloudFront with OWASP rule set
Enable CloudFront access logs to S3 with lifecycle policy
Reduce JWKS timeout to 5 seconds with retry backoff

Pillar 3: Reliability

Current Alignment

Event Delivery -- Transactional outbox pattern guarantees no lost events. Atomic writes pair business data with outbox entries via DynamoDB transact_write_items. Idempotent processing via status checks.

Failure Handling -- Dead-letter pattern with 30-day retention for events that exceed retry limits. Circuit breaker per tenant allows immediate kill-switch. SDK gracefully skips missing target elements without crashing.

Data Durability -- DynamoDB Point-in-Time Recovery enabled. EventBridge archive retains events for 30 days for replay.

Degradation -- Anonymous SDK access works if JWT is absent. Tour steps with missing targets are skipped. SDK continues operating if API is temporarily unreachable (cached config).

Gaps

No Lambda DLQ: If Lambda execution fails on DynamoDB Stream events, no explicit dead-letter queue captures failures
Single Lambda function: All API traffic routes through one function; no separation by tenant tier or endpoint type
Unbounded cache growth: CacheManager limits to 10,000 entries but no proactive cleanup
JWKS failure cascade: If a tenant's JWKS endpoint goes down, all requests for that tenant fail with no fallback to previously cached keys
No event schema validation: Malformed events in the pipeline could crash processors

Recommendations

Configure Lambda DLQ (SQS) for stream and event processor functions
Split API Lambda into SDK and admin functions with separate concurrency profiles
Add periodic cache cleanup (every 5 minutes, evict expired entries)
Implement JWKS cache fallback: serve stale keys on fetch failure with a warning metric
Validate event schemas in processors; route invalid events to DLQ
Add DynamoDB access verification to /health endpoint

Pillar 4: Performance Efficiency

Current Alignment

Compute -- Serverless Lambda with configurable memory (256MB default). FastAPI with async/await throughout. No idle compute costs.

Data -- DynamoDB on-demand billing scales automatically. Sparse GSI for outbox pending events minimises query costs. Two-tier analytics rollup (hourly + daily) uses atomic ADD operations.

Caching -- Multi-layer in-memory caching: API keys (10min), JWKS (1hr), tenant config (30s), themes (5min), published content (1min). All layers emit OTEL hit/miss metrics.

Content Delivery -- CloudFront CDN for SDK assets with versioned paths and long-lived cache headers. CORS preflight caching (max_age 3600s) reduces OPTIONS requests.

SDK -- Shadow DOM isolation. Target bundle size < 20KB gzipped. Dynamic imports for admin code (only loaded for admins).

Gaps

No cold start monitoring: Lambda init time not instrumented; latency spikes undetectable
No DynamoDB query batching: Related entities fetched with individual get_item calls
API Gateway caching not configured: Repeated identical requests hit Lambda every time
Connection pooling not explicit: HTTP client lifecycle for JWKS fetches unclear
No pagination defaults: List operations could return unbounded result sets

Recommendations

Add Lambda Extension or X-Ray segment for cold start measurement
Use batch_get_item for fetching multiple tours/content in a single request
Enable API Gateway caching on GET /v1/sdk/config (30-second TTL)
Use httpx.AsyncClient singleton with connection pooling for JWKS fetches
Enforce maximum page size (100 items) on all list endpoints
Consider DynamoDB DAX when read costs exceed $200/month (documented upgrade path)

Pillar 5: Cost Optimisation

Current Alignment

Pay-per-use -- Lambda (invocation-based), DynamoDB (on-demand), EventBridge (event-based), S3 (storage + requests). No idle infrastructure costs.

Data Lifecycle -- TTL on transient data: Events.Outbox (7 days after processing), Events.Interactions (90 days). DynamoDB handles cleanup automatically.

Right-sized Observability -- Self-hosted Grafana stack in Docker for development (zero cloud billing). CloudWatch log retention set to 30 days on Lambda functions.

CDN -- CloudFront PriceClass_100 (lowest-cost tier, covers major regions). Versioned SDK assets allow aggressive caching.

Gaps

CloudWatch log retention is uniformly 30 days (api-gateway, lambda, eventbridge modules all set retention_in_days = 30)
Unused DynamoDB Streams: Some tables have streams enabled without active consumers
No cost allocation tags: Multi-tenant cost attribution not possible
No reserved capacity analysis: Sustained workloads may benefit from provisioned DynamoDB or Savings Plans
GSI storage costs: Unused GSIs incur storage costs regardless of query volume

Recommendations

Set explicit CloudWatch Logs retention on ALL log groups (30 days production, 7 days development)
Disable DynamoDB Streams on tables without active consumers (e.g., Themes.Definitions)
Implement AWS cost allocation tags: Environment, Service, TenantTier
Review DynamoDB costs monthly; evaluate provisioned capacity when baseline exceeds 1000 RCU
Audit GSI usage; remove or disable indexes with zero queries

Pillar 6: Sustainability

Current Alignment

Resource Efficiency -- Serverless architecture scales to zero. No VMs running 24/7. Event-driven processing minimises active compute time.

Data Lifecycle -- TTL policies prevent unbounded storage growth. Analytics use incremental counters (cheap atomic updates) rather than full event replay.

Infrastructure -- AWS data centres use renewable energy. On-demand capacity avoids over-provisioning.

Gaps

No carbon footprint tracking: No monitoring of energy consumption per request
Lambda memory not optimised: Default 256MB may over-provision for lightweight handlers
Local dev stack heavy: Full Grafana + Prometheus + Loki + Tempo stack consumes resources during development
CloudFront cache effectiveness not measured: No metrics on cache-to-origin ratio

Recommendations

Enable AWS Carbon Footprint monitoring in the billing console
Use AWS Compute Optimizer to right-size Lambda memory allocation
Make local observability stack optional (docker compose --profile monitoring up)
Track CloudFront cache hit ratio; target > 80% to reduce origin requests
Consider arm64 Lambda architecture for ~20% energy reduction

Cross-Cutting Concerns

Missing Capabilities

Capability	Status	Priority
CI/CD pipeline	Not implemented	P0
CloudWatch alarms	Not configured	P0
Per-tenant rate limiting	Not implemented	P1
Lambda DLQ	Not configured	P1
WAF rules	Not attached	P1
Multi-region failover	Not designed	P2
Load testing	Not implemented	P2
Schema versioning	Not addressed	P2
Cost allocation tagging	Not configured	P3

Priority Action Items

Immediate (P0 -- production blockers):

Create CI/CD pipeline with automated test, lint, and security scan
Configure CloudWatch alarms for Lambda errors, DynamoDB throttling, and outbox backlog
Enforce explicit CORS allowlist in production configuration

Short-term (P1 -- operational maturity): 4. Implement per-tenant rate limiting via API Gateway usage plans 5. Add Lambda DLQ for stream and event processor functions 6. Attach WAF to CloudFront with managed OWASP rule set 7. Write operational runbooks for top 5 failure scenarios

Medium-term (P2 -- scale readiness): 8. Design multi-region active-passive failover strategy 9. Implement load testing with representative traffic patterns 10. Add event schema versioning with backwards compatibility guarantees 11. Split API Lambda by endpoint type for independent scaling

Architecture Decisions Impacting Well-Architected

Decision	Rationale	Trade-off
In-memory caching over Redis	Zero latency, zero cost, suitable for Lambda lifecycle	Cache not shared across Lambda instances
DynamoDB over RDS	Serverless, auto-scaling, no connection management	Limited query flexibility, no JOINs
Single API Lambda	Simpler deployment, shared connection pools	Single bottleneck, coarse scaling
Transactional outbox over direct EventBridge	Guaranteed delivery, atomic with business writes	Additional DynamoDB write per event
Shadow DOM in SDK	Complete CSS/JS isolation from host site	Slightly larger bundle, no inherited styles
LocalStack for local dev	Full AWS parity without cloud costs	Community edition lacks some features (Streams)

Review Cadence

This Well-Architected review should be updated:

Quarterly: Routine assessment of all six pillars
After major changes: New services, new AWS resources, architecture refactors
Before production launch: Full review with action items resolved to P1 level
After incidents: Post-mortem findings incorporated into relevant pillars

AWS Well-Architected Review ​

Summary Scorecard ​

Pillar 1: Operational Excellence ​

Current Alignment ​

Gaps ​

Recommendations ​

Pillar 2: Security ​

Current Alignment ​

Gaps ​

Recommendations ​

Pillar 3: Reliability ​

Current Alignment ​

Gaps ​

Recommendations ​

Pillar 4: Performance Efficiency ​

Current Alignment ​

Gaps ​

Recommendations ​

Pillar 5: Cost Optimisation ​

Current Alignment ​

Gaps ​

Recommendations ​

Pillar 6: Sustainability ​

Current Alignment ​

Gaps ​

Recommendations ​

Cross-Cutting Concerns ​

Missing Capabilities ​

Priority Action Items ​

Architecture Decisions Impacting Well-Architected ​

Review Cadence ​

AWS Well-Architected Review

Summary Scorecard

Pillar 1: Operational Excellence

Current Alignment

Gaps

Recommendations

Pillar 2: Security

Current Alignment

Gaps

Recommendations

Pillar 3: Reliability

Current Alignment

Gaps

Recommendations

Pillar 4: Performance Efficiency

Current Alignment

Gaps

Recommendations

Pillar 5: Cost Optimisation

Current Alignment

Gaps

Recommendations

Pillar 6: Sustainability

Current Alignment

Gaps

Recommendations

Cross-Cutting Concerns

Missing Capabilities

Priority Action Items

Architecture Decisions Impacting Well-Architected

Review Cadence