Skip to content

AWS Well-Architected Review

Spotlight has been reviewed against the six pillars of the AWS Well-Architected Framework. This page documents the assessment, current alignment, identified gaps, and recommended improvements.

Review Date

This review covers the architecture as of the Phase 7 implementation. It should be revisited quarterly or after major architectural changes.

Summary Scorecard

PillarScoreStatusKey StrengthKey Gap
Operational Excellence7.5/10PartialFull IaC, observability stackNo alerting rules or CI/CD pipeline
Security8/10GoodLayered auth, OWASP headers, tenant isolationPer-tenant rate limiting absent
Reliability7.5/10PartialTransactional outbox, circuit breakerLambda DLQ configuration missing
Performance Efficiency7.5/10PartialMulti-layer caching, on-demand DynamoDBCold start monitoring absent
Cost Optimisation8/10GoodServerless, TTL cleanup, no idle resourcesCloudWatch log retention unbounded
Sustainability6.5/10PartialServerless scales to zeroNo explicit green metrics

Pillar 1: Operational Excellence

Current Alignment

Infrastructure as Code -- All infrastructure is provisioned via Terraform modules. The infrastructure/terraform/ directory contains reusable modules for DynamoDB, Lambda, API Gateway, EventBridge, CDN, and monitoring. Environment-specific configs exist for local (LocalStack), dev, staging, and prod.

Observability -- The platform ships with a complete observability stack:

  • Traces: OpenTelemetry auto-instrumentation for FastAPI and botocore, exported to Tempo
  • Metrics: 13+ custom counters and histograms covering cache, auth, SDK, tours, content, outbox
  • Logs: Structured JSON logging via structlog with OTEL trace correlation, exported to Loki
  • Dashboards: Four Grafana dashboards (Product Overview, Tour Deep Dive, System Health, Admin Activity)

Local Development -- Docker Compose stack includes LocalStack, DynamoDB Admin, Grafana, Prometheus, Loki, Tempo, and OTEL Collector. Makefile provides 20+ targets for setup, testing, building, and deployment.

Gaps

  • No alerting rules: Grafana dashboards exist but no alert thresholds for auth failures, cache miss rates, or outbox backlog
  • No canary / blue-green deployments: CI pipelines exist (.github/workflows/ci.yml plus deploy-api-{dev,prod}.yml, deploy-dashboard.yml, deploy-marketing.yml, deploy-sdk.yml, deploy-docs.yml) but each runs an all-or-nothing publish; no traffic shifting via Lambda aliases or weighted CF Pages deployments
  • No canary deployments: Lambda deployments are all-or-nothing; no traffic shifting via aliases
  • No operational runbooks: Incident response procedures not documented for common failures
  • No automated backup validation: DynamoDB PITR enabled but no restore testing schedule

Recommendations

  1. Create GitHub Actions workflows for test, lint, security scan, and deploy
  2. Define CloudWatch alarms for: auth failure rate > 5%, outbox pending > 100, Lambda error rate > 1%
  3. Implement Lambda alias traffic shifting for gradual rollouts
  4. Write runbooks for: circuit breaker activation, outbox backlog, DynamoDB throttling, JWKS endpoint failure
  5. Schedule quarterly backup restore tests

Pillar 2: Security

Current Alignment

Authentication -- Layered auth with API key (SHA-256 hashed, constant-time comparison) plus optional JWT validated via tenant-configured JWKS endpoint. Admin detection uses a pluggable strategy pattern (table lookup, JWT permissions, JWT roles, composite).

Authorisation -- Tenant isolation enforced at the DynamoDB partition key level. All 17 tables use tenant_id as PK or PK prefix. Admin endpoints gated behind role verification.

Network Security -- OWASP-recommended security headers on all responses (CSP, HSTS, X-Frame-Options, X-Content-Type-Options). Per-tenant origin lockdown with wildcard subdomain support.

Data Protection -- DynamoDB server-side encryption enabled on all tables. HTML sanitisation in the SDK prevents XSS. No PII logged (sanitised structured logging).

Infrastructure -- Least-privilege IAM roles per Lambda function. S3 buckets block public access. CloudFront uses OAI for S3 origin access.

Gaps

  • No per-tenant rate limiting: API Gateway throttling is global (1000 req/s in prod, 100/s in dev); a single tenant can exhaust shared capacity
  • API key cache window: Rotated keys remain valid in cache for up to 10 minutes
  • Default CORS too permissive: Defaults to ["*"]; production should require explicit allowlist
  • No WAF rules: CloudFront lacks WAF for common attack patterns
  • No CloudFront access logging: CDN requests not logged for security audit
  • JWKS fetch timeout: 10-second timeout could be exploited in slow-rate attacks

Recommendations

  1. Implement per-tenant rate limiting using API Gateway usage plans or a DynamoDB-backed token bucket
  2. Reduce API key cache TTL to 60 seconds after rotation events
  3. Enforce explicit CORS allowlist in production (block * default)
  4. Attach AWS WAF to CloudFront with OWASP rule set
  5. Enable CloudFront access logs to S3 with lifecycle policy
  6. Reduce JWKS timeout to 5 seconds with retry backoff

Pillar 3: Reliability

Current Alignment

Event Delivery -- Transactional outbox pattern guarantees no lost events. Atomic writes pair business data with outbox entries via DynamoDB transact_write_items. Idempotent processing via status checks.

Failure Handling -- Dead-letter pattern with 30-day retention for events that exceed retry limits. Circuit breaker per tenant allows immediate kill-switch. SDK gracefully skips missing target elements without crashing.

Data Durability -- DynamoDB Point-in-Time Recovery enabled. EventBridge archive retains events for 30 days for replay.

Degradation -- Anonymous SDK access works if JWT is absent. Tour steps with missing targets are skipped. SDK continues operating if API is temporarily unreachable (cached config).

Gaps

  • No Lambda DLQ: If Lambda execution fails on DynamoDB Stream events, no explicit dead-letter queue captures failures
  • Single Lambda function: All API traffic routes through one function; no separation by tenant tier or endpoint type
  • Unbounded cache growth: CacheManager limits to 10,000 entries but no proactive cleanup
  • JWKS failure cascade: If a tenant's JWKS endpoint goes down, all requests for that tenant fail with no fallback to previously cached keys
  • No event schema validation: Malformed events in the pipeline could crash processors

Recommendations

  1. Configure Lambda DLQ (SQS) for stream and event processor functions
  2. Split API Lambda into SDK and admin functions with separate concurrency profiles
  3. Add periodic cache cleanup (every 5 minutes, evict expired entries)
  4. Implement JWKS cache fallback: serve stale keys on fetch failure with a warning metric
  5. Validate event schemas in processors; route invalid events to DLQ
  6. Add DynamoDB access verification to /health endpoint

Pillar 4: Performance Efficiency

Current Alignment

Compute -- Serverless Lambda with configurable memory (256MB default). FastAPI with async/await throughout. No idle compute costs.

Data -- DynamoDB on-demand billing scales automatically. Sparse GSI for outbox pending events minimises query costs. Two-tier analytics rollup (hourly + daily) uses atomic ADD operations.

Caching -- Multi-layer in-memory caching: API keys (10min), JWKS (1hr), tenant config (30s), themes (5min), published content (1min). All layers emit OTEL hit/miss metrics.

Content Delivery -- CloudFront CDN for SDK assets with versioned paths and long-lived cache headers. CORS preflight caching (max_age 3600s) reduces OPTIONS requests.

SDK -- Shadow DOM isolation. Target bundle size < 20KB gzipped. Dynamic imports for admin code (only loaded for admins).

Gaps

  • No cold start monitoring: Lambda init time not instrumented; latency spikes undetectable
  • No DynamoDB query batching: Related entities fetched with individual get_item calls
  • API Gateway caching not configured: Repeated identical requests hit Lambda every time
  • Connection pooling not explicit: HTTP client lifecycle for JWKS fetches unclear
  • No pagination defaults: List operations could return unbounded result sets

Recommendations

  1. Add Lambda Extension or X-Ray segment for cold start measurement
  2. Use batch_get_item for fetching multiple tours/content in a single request
  3. Enable API Gateway caching on GET /v1/sdk/config (30-second TTL)
  4. Use httpx.AsyncClient singleton with connection pooling for JWKS fetches
  5. Enforce maximum page size (100 items) on all list endpoints
  6. Consider DynamoDB DAX when read costs exceed $200/month (documented upgrade path)

Pillar 5: Cost Optimisation

Current Alignment

Pay-per-use -- Lambda (invocation-based), DynamoDB (on-demand), EventBridge (event-based), S3 (storage + requests). No idle infrastructure costs.

Data Lifecycle -- TTL on transient data: Events.Outbox (7 days after processing), Events.Interactions (90 days). DynamoDB handles cleanup automatically.

Right-sized Observability -- Self-hosted Grafana stack in Docker for development (zero cloud billing). CloudWatch log retention set to 30 days on Lambda functions.

CDN -- CloudFront PriceClass_100 (lowest-cost tier, covers major regions). Versioned SDK assets allow aggressive caching.

Gaps

  • CloudWatch log retention is uniformly 30 days (api-gateway, lambda, eventbridge modules all set retention_in_days = 30)
  • Unused DynamoDB Streams: Some tables have streams enabled without active consumers
  • No cost allocation tags: Multi-tenant cost attribution not possible
  • No reserved capacity analysis: Sustained workloads may benefit from provisioned DynamoDB or Savings Plans
  • GSI storage costs: Unused GSIs incur storage costs regardless of query volume

Recommendations

  1. Set explicit CloudWatch Logs retention on ALL log groups (30 days production, 7 days development)
  2. Disable DynamoDB Streams on tables without active consumers (e.g., Themes.Definitions)
  3. Implement AWS cost allocation tags: Environment, Service, TenantTier
  4. Review DynamoDB costs monthly; evaluate provisioned capacity when baseline exceeds 1000 RCU
  5. Audit GSI usage; remove or disable indexes with zero queries

Pillar 6: Sustainability

Current Alignment

Resource Efficiency -- Serverless architecture scales to zero. No VMs running 24/7. Event-driven processing minimises active compute time.

Data Lifecycle -- TTL policies prevent unbounded storage growth. Analytics use incremental counters (cheap atomic updates) rather than full event replay.

Infrastructure -- AWS data centres use renewable energy. On-demand capacity avoids over-provisioning.

Gaps

  • No carbon footprint tracking: No monitoring of energy consumption per request
  • Lambda memory not optimised: Default 256MB may over-provision for lightweight handlers
  • Local dev stack heavy: Full Grafana + Prometheus + Loki + Tempo stack consumes resources during development
  • CloudFront cache effectiveness not measured: No metrics on cache-to-origin ratio

Recommendations

  1. Enable AWS Carbon Footprint monitoring in the billing console
  2. Use AWS Compute Optimizer to right-size Lambda memory allocation
  3. Make local observability stack optional (docker compose --profile monitoring up)
  4. Track CloudFront cache hit ratio; target > 80% to reduce origin requests
  5. Consider arm64 Lambda architecture for ~20% energy reduction

Cross-Cutting Concerns

Missing Capabilities

CapabilityStatusPriority
CI/CD pipelineNot implementedP0
CloudWatch alarmsNot configuredP0
Per-tenant rate limitingNot implementedP1
Lambda DLQNot configuredP1
WAF rulesNot attachedP1
Multi-region failoverNot designedP2
Load testingNot implementedP2
Schema versioningNot addressedP2
Cost allocation taggingNot configuredP3

Priority Action Items

Immediate (P0 -- production blockers):

  1. Create CI/CD pipeline with automated test, lint, and security scan
  2. Configure CloudWatch alarms for Lambda errors, DynamoDB throttling, and outbox backlog
  3. Enforce explicit CORS allowlist in production configuration

Short-term (P1 -- operational maturity): 4. Implement per-tenant rate limiting via API Gateway usage plans 5. Add Lambda DLQ for stream and event processor functions 6. Attach WAF to CloudFront with managed OWASP rule set 7. Write operational runbooks for top 5 failure scenarios

Medium-term (P2 -- scale readiness): 8. Design multi-region active-passive failover strategy 9. Implement load testing with representative traffic patterns 10. Add event schema versioning with backwards compatibility guarantees 11. Split API Lambda by endpoint type for independent scaling


Architecture Decisions Impacting Well-Architected

DecisionRationaleTrade-off
In-memory caching over RedisZero latency, zero cost, suitable for Lambda lifecycleCache not shared across Lambda instances
DynamoDB over RDSServerless, auto-scaling, no connection managementLimited query flexibility, no JOINs
Single API LambdaSimpler deployment, shared connection poolsSingle bottleneck, coarse scaling
Transactional outbox over direct EventBridgeGuaranteed delivery, atomic with business writesAdditional DynamoDB write per event
Shadow DOM in SDKComplete CSS/JS isolation from host siteSlightly larger bundle, no inherited styles
LocalStack for local devFull AWS parity without cloud costsCommunity edition lacks some features (Streams)

Review Cadence

This Well-Architected review should be updated:

  • Quarterly: Routine assessment of all six pillars
  • After major changes: New services, new AWS resources, architecture refactors
  • Before production launch: Full review with action items resolved to P1 level
  • After incidents: Post-mortem findings incorporated into relevant pillars

Spotlight