Untitled

InterviewBee — Cloud Engineer Question Bank

FAANG-Level Interview Preparation | Senior · Staff · Principal


Question 1: Multi-Region Disaster Recovery Strategy

Difficulty: Elite | Role: Cloud Architect / SRE | Level: Staff / Principal | Company Examples: Netflix, Stripe, AWS, Google


The Question

Your e-commerce platform processes $4M/day in GMV and currently runs in a single AWS region (us-east-1). The CISO has mandated a 99.99% availability SLA with an RTO of 15 minutes and an RPO of 30 seconds following a complete regional outage. Your engineering team is 18 people. Design the multi-region DR strategy, justify the architecture choices against cost and operational complexity, and explain how you'd validate it without causing a production incident.


1. What Is This Question Testing?

  • Systems thinking — ability to reason about failure domains at regional scale, not just availability zones
  • Financial literacy — distinguishing between active-active, active-passive, and pilot light models as cost multipliers (1.2x to 2.5x baseline infrastructure cost)
  • Reliability engineering — translating RTO/RPO SLAs into concrete data replication lag budgets and automated failover choreography
  • Risk assessment — recognizing that multi-region DR introduces its own failure modes: split-brain, stale DNS TTLs, cross-region latency spikes, data divergence
  • Organizational thinking — calibrating architecture complexity against an 18-person team's operational capacity and on-call rotation depth
  • Infrastructure-as-code knowledge — understanding that DR is only auditable if runbooks are Terraform-codified and tested quarterly

2. Framework: Multi-Region Resilience Framework (MRRF)

  1. Assumption Documentation — Traffic profile, RTO/RPO targets, budget ceiling, team size, compliance requirements (PCI-DSS for payments)
  1. Constraint Analysis — Network egress costs (~$0.09/GB), cross-region replication lag ceilings, Route 53 health check propagation delay (~60s), human toil budget
  1. Tradeoff Evaluation — Active-Active vs. Active-Passive vs. Pilot Light against cost and operational complexity
  1. Hidden Cost Identification — NAT Gateway cross-AZ traffic, Aurora Global Database replication overhead, doubled ALB + WAF costs, doubled CloudFront origin capacity
  1. Risk Signals / Early Warning Metrics — Replication lag (target <5s), health check failure rate, cross-region latency p99, RDS replica read latency divergence
  1. Pivot Triggers — If replication lag exceeds 20s continuously for 5 minutes, trigger automated failover drill; if team oncall incidents exceed 8/month from DR complexity, re-evaluate to pilot light
  1. Long-Term Evolution Plan — Start with active-passive; instrument fully; validate quarterly via GameDay; evolve to active-active per-service as team scales

3. The Answer

Explicit Assumptions:

  • Traffic: ~46 req/s average, 180 req/s peak (Black Friday 4x surge)
  • Budget ceiling: $60,000–$80,000/month incremental DR cost (3–4% of GMV)
  • Team: 18 engineers, ~4-person platform team, 2-person on-call rotation
  • Compliance: PCI-DSS Level 1 (cardholder data must not leave approved regions)
  • Primary: us-east-1; DR: us-west-2 (lowest latency divergence, separate power grids)

Architecture Decision: Active-Passive with Warm Standby

Active-Active would achieve sub-1-minute RTO but introduces distributed transaction complexity across regions — especially critical for payment idempotency and inventory consistency. At 18 engineers, the operational overhead of managing cross-region write conflicts would consume 2–3 FTEs. Instead, deploy warm standby in us-west-2: all stateless tiers (ECS Fargate, ALB, API Gateway) pre-provisioned at 20% of production capacity. Auto Scaling groups configured to surge from 20% to 100% in ~8 minutes based on CloudWatch alarms triggered by Route 53 failover.

Data Layer Strategy

Aurora Global Database with us-west-2 as secondary cluster. Typical cross-region replication lag: 1–2 seconds measured, SLA target <30s RPO covered with headroom. Aurora Global DB secondary clusters support <1 second RPO in practice. DynamoDB Global Tables (for session, cart, rate-limit data) with active-active replication at ~100ms lag. S3 Cross-Region Replication (CRR) for assets and order documents with versioning enabled. ElastiCache cannot replicate cross-region — pre-warm caches on failover using CloudWatch Events-triggered Lambda warm-up scripts targeting the most frequently queried SKUs and user sessions.

DNS and Traffic Failover

Route 53 with latency-based routing + health checks on both regions. Health check interval: 10s, failure threshold: 3 consecutive failures (30s detection window). Use Route 53 Application Recovery Controller (ARC) for cell-based routing controls — this prevents split-brain where both regions think they're primary. ARC routing controls require explicit operator confirmation via API, preventing accidental dual-primary scenarios. TTL: 60 seconds on failover records; warn stakeholders that DNS propagation to all resolvers can take 120–180 seconds, pushing observed RTO to 8–12 minutes realistically.

Cost Model

Primary region baseline: ~$38,000/month. DR warm standby at 20% capacity: ~$14,000/month additional (ECS Fargate, ALB, RDS Aurora secondary: ~$8,000, DynamoDB Global Tables replication: ~$2,500, S3 CRR egress: ~$800, Aurora Global DB replication: ~$1,200, Route 53 ARC + health checks: ~$400, NAT Gateway + VPC endpoints: ~$1,100). Total DR overhead: ~$14,000/month or ~37% uplift. Compared to active-active ($34,000/month additional = 90% uplift), active-passive saves ~$20,000/month — $240,000/year — while accepting a 10–12 minute RTO vs. <2 minute active-active RTO.

Validation Without Production Risk

Quarterly GameDay exercises using AWS Fault Injection Simulator (FIS): inject us-east-1 API Gateway failure, RDS primary unavailability, and full AZ loss separately. Each drill validates: automated failover completes within 15 min, no data loss beyond RPO, payment processing resumes on secondary, monitoring dashboards reflect correct active region. Shadow traffic replication: use Lambda edge functions to asynchronously duplicate 0.5% of production requests to us-west-2 (read path only) continuously — this validates the secondary stack handles real traffic patterns without data mutation risk. Monthly RPO drills: pause Aurora replication, wait 25 seconds, measure lag, resume — validates recovery point is within budget.

Failure Modes and Blast Radius

DNS failover delay: 2–3 minutes for Route 53 health check consensus + TTL propagation. Mitigation: pre-warm client DNS caches via a health check endpoint that clients poll every 30s. Cache cold start: ElastiCache in us-west-2 starts empty, causing 3–5x database load for 8–15 minutes. Mitigation: pre-warm Lambda triggered by ARC routing control flip. Aurora promotion time: demoting secondary to primary takes 60–120 seconds. Mitigation: test in the quarterly GameDay, ensure no long-running transactions during the failover window. Payment gateway failback: Stripe, Braintree have regional endpoints — validate gateway failover order in configuration.

Recommendation

Deploy active-passive warm standby at $14,000/month incremental cost. This satisfies 15-minute RTO (realistically 10–12 minutes) and 30-second RPO with Aurora Global DB. Do not attempt active-active until team exceeds 35 engineers with dedicated distributed systems expertise. Encode entire DR topology in Terraform with GitOps CI/CD; DR infrastructure must be indistinguishable from production in code quality. Validate quarterly. Escalate to active-active only if business requires <2-minute RTO — at that point, cost and complexity tradeoffs must be re-evaluated at board level given the $240,000/year cost delta.

Early Warning Metrics:

  • Aurora replication lag — alert at 10s, page at 20s (hard SLA: 30s RPO)
  • Route 53 health check success rate — alert below 99.5% over 5 minutes
  • Cross-region latency p99 — alert if us-east-1 to us-west-2 exceeds 120ms (baseline ~65ms)
  • ElastiCache cache hit rate on DR cluster — if <20% during GameDay, warm-up scripts need tuning
  • ECS task startup time in us-west-2 — should scale from 20% to 80% capacity in under 8 minutes

4. Interview Score: 8.5 / 10

Why this demonstrates senior-level maturity: This answer demonstrates principal-engineer thinking by rejecting the naive active-active recommendation and instead sizing architecture to team capability. Quantifying the $240,000/year cost delta between active-active and active-passive, naming Aurora Global DB replication lag in seconds, identifying ElastiCache cold-start as a non-obvious blast radius, and prescribing Route 53 ARC to prevent split-brain — these reflect real production experience, not whiteboard architecture.

What differentiates it from mid-level thinking: A mid-level engineer would recommend active-active without costing it, describe Route 53 failover without acknowledging DNS TTL propagation delays, and omit the ElastiCache cold-start problem. They would not identify that 18 engineers cannot operationally sustain active-active complexity without dedicated platform staffing.

What would make it a 10/10: A 10/10 response would additionally detail the PCI-DSS audit scope implications of a secondary region (cardholder data flows must be explicitly scoped into the DR architecture), provide a concrete Terraform module structure for the DR VPC, and include a formal decision tree for the operator on-call during a real regional outage with documented go/no-go criteria for failover.



Question 2: Zero-Downtime Cloud Migration (On-Prem → Cloud)

Difficulty: Elite | Role: Cloud Engineer / Platform Engineer | Level: Senior / Staff | Company Examples: Goldman Sachs, Shopify, Airbnb, Twitter


The Question

A 12-year-old financial services platform running on-premises handles 2,000 transactions per second at peak. The stack includes Oracle RAC on bare metal, 400 Java microservices on VMware, a custom HAProxy load balancer configuration, and a co-located data center with 10-year contracts expiring in 18 months. You have been tasked with migrating 100% to AWS with zero downtime, compliance with SOC 2 Type II and PCI-DSS, and a hard budget of $8M for the migration program. Describe your end-to-end strategy, phasing, risk mitigation, and how you determine what NOT to migrate.


1. What Is This Question Testing?

  • Cloud architecture maturity — recognizing that lift-and-shift for Oracle RAC destroys the value proposition; understanding RDS, Aurora, or managed Oracle licensing traps
  • Risk assessment — identifying migration-induced data consistency windows, dual-write complexity, and the danger of schema divergence between on-prem and cloud during parallel operation
  • Financial literacy — calculating Oracle BYOL vs. License Included costs, VMware → EC2 rightsizing, and hidden migration costs (AWS DMS replication, Data Pipeline, DirectConnect provisioning: 45–90 day lead time)
  • Systems thinking — understanding that 400 microservices cannot be migrated simultaneously; designing a dependency graph and migration wave sequencing
  • Security awareness — SOC 2 and PCI-DSS require continuous compliance evidence during migration, not just at the end state; any intermediate architecture must be auditable
  • Organizational thinking — $8M budget for an 18-month program implies ~25–30 FTEs; scope management is as important as technical execution

2. Framework: Zero-Downtime Migration Model (ZDMM)

  1. Assumption Documentation — Inventory all 400 services by dependency tier, data residency, compliance scope, and traffic pattern
  1. Constraint Analysis — 18-month deadline, $8M budget (~$444K/month burn), DirectConnect 45-day provisioning lead time, Oracle licensing decisions
  1. Tradeoff Evaluation — Rehost (lift-and-shift) vs. Replatform vs. Refactor per service cluster
  1. Hidden Cost Identification — Oracle BYOL bring-your-own-license compliance, AWS DMS replication costs, dual-run cloud+on-prem costs during transition ($120K–$180K/month), egress fees during cutover
  1. Risk Signals / Early Warning Metrics — DMS replication lag, schema drift alerts, transaction error rate on migrated services, PCI scope expansion events
  1. Pivot Triggers — If Oracle migration to Aurora PostgreSQL encounters >2% data validation failures in UAT, pause and evaluate Oracle RDS BYOL instead
  1. Long-Term Evolution Plan — Post-migration: decommission on-prem hardware, renegotiate co-lo contract, migrate Oracle to Aurora over 6 additional months after baseline stability

3. The Answer

Explicit Assumptions:

  • 2,000 TPS peak, ~800 TPS average; data volume ~80TB Oracle, ~15TB application data
  • 18 months to data center exit; 6-month buffer built into plan (target 12-month migration)
  • $8M total: ~$2M infrastructure (cloud during dual-run), ~$3M labor (25 FTEs), ~$1.5M tooling/licensing, ~$1.5M contingency
  • DirectConnect 10Gbps provisioned in Month 1 (order immediately — 45-90 day lead time)
  • PCI-DSS scope: ~15% of services touch cardholder data; SOC 2 applies to all 400 services

Phase 0: Foundation and Discovery (Months 1-2)

Order AWS DirectConnect 10Gbps immediately — 45–90 day provisioning is the longest lead time in the program. Establish AWS Landing Zone using Control Tower: separate accounts for Production, Staging, Security, Shared Services, and Log Archive. Deploy AWS Transit Gateway for hub-and-spoke connectivity. Run AWS Application Discovery Service agents on all 400 VMware VMs to capture network dependencies, CPU/memory utilization, and traffic patterns. This dependency map is the foundation of wave sequencing — without it, migrating a service that has an undocumented gRPC dependency on an on-prem Oracle stored procedure will cause cascading failures. Establish Cloud Center of Excellence (CCoE) team: 4 engineers dedicated to platform, not product migration.

Service Tiering and 'What NOT to Migrate'

After dependency analysis, classify services into four buckets: (1) Cloud-native candidates — stateless Java services with no Oracle direct dependencies: 180 services, migrate in waves 1-4; (2) Replatform candidates — services with Oracle JDBC dependencies that can be re-targeted to Aurora PostgreSQL: 140 services after schema migration; (3) Deferred — 60 services with deep Oracle RAC features (partitioning, RAC interconnect, Oracle Text): migrate post-primary cutover with dedicated DBA team; (4) Retire — 20 legacy services with no active users confirmed by traffic analysis. DO NOT migrate mainframe batch jobs, SWIFT messaging gateway (regulatory complexity exceeds $8M budget scope), or the legacy reporting cluster (retire and replace with QuickSight + Redshift in parallel workstream).

Data Migration Strategy: Oracle RAC

Oracle RAC on bare metal to RDS Oracle BYOL is the path of least resistance for Phase 1, despite higher cost (~$28,000/month for RDS Oracle SE2 BYOL at production scale vs. ~$18,000/month Aurora PostgreSQL equivalent). Budget 6 months on RDS Oracle BYOL, then execute Oracle → Aurora PostgreSQL migration as a separate program. Attempting Oracle RAC → Aurora PostgreSQL in a single pass in 18 months for 80TB with 400 dependent services is high-risk and has caused program failures at similar-scale financial institutions. Use AWS DMS with Change Data Capture (CDC) for ongoing replication during cutover. DMS replication lag budget: <5 seconds during steady state, <30 seconds during peak. Validate with Striim or AWS DMS data validation rules — character set encoding issues, implicit Oracle type conversions, and timezone handling are the top causes of silent data corruption in Oracle → RDS migrations.

Traffic Migration: Strangler Fig Pattern

Do not do a big-bang cutover. Use the Strangler Fig pattern: deploy an AWS-side API Gateway + ALB in front of migrated services. Route 5% of traffic to cloud via weighted Route 53 records, monitor for 72 hours, increment to 20%, 50%, 80%, 100% over 3–4 weeks per service wave. Maintain on-prem as fallback: Route 53 weight rollback takes <60 seconds. For stateful services, implement dual-write: write to both on-prem Oracle and RDS simultaneously for 2 weeks before read-side cutover, then validate record counts and checksums. Dual-write introduces synchronous latency overhead of 8–15ms at cross-datacenter roundtrip — benchmark against SLA before enabling.

Cost Model and Budget Burn

Months 1–6: $280,000–$320,000/month (DirectConnect, Landing Zone, dual-run infrastructure, 25 FTEs). Months 7–12: $350,000–$400,000/month (peak dual-run with cloud at 50–80% traffic). Post-migration steady-state: $220,000/month AWS bill (EC2 EKS nodes for 400 services, RDS Oracle BYOL, ElastiCache, ALB, CloudFront, DataDog observability). On-prem cost elimination: ~$180,000/month hardware/co-lo savings. Net cloud infrastructure increase: ~$40,000/month — partially offset by VMware licensing elimination (~$25,000/month). Total program spend against $8M budget: $6.8M projected, $1.2M contingency preserved.

Compliance During Migration

PCI-DSS requires all cardholder data environments to be continuously auditable — not just at end state. This means: encrypt all DMS replication with TLS 1.2+, encrypt RDS at rest with KMS CMK, enforce VPC endpoint for all S3 access, deploy AWS Security Hub with PCI standard enabled from Day 1 of each new account, and ensure AWS Config rules flag any public S3 bucket or unencrypted EBS volume within 5 minutes. SOC 2 Type II evidence collection: use AWS CloudTrail, Config, and GuardDuty from the first day of cloud operation — the auditors will need 6 months of continuous evidence for Type II certification. Brief external auditors on migration architecture in Month 2 to avoid surprises at audit time.

Risk Mitigation

Top 3 risks: (1) Oracle licensing audit during migration — Oracle's BYOL compliance team actively audits customers migrating to cloud; engage Oracle license counsel before moving any Oracle workload. (2) DirectConnect capacity — 10Gbps may be insufficient for peak DMS replication plus production traffic simultaneously; model data transfer rates before cutover weekends. (3) Scope creep — 400 services will surface undocumented dependencies during migration; maintain a strict change freeze on the on-prem estate from Month 3 onward, no new features deployed on-prem.

Early Warning Metrics:

  • DMS replication lag — alert >5s, page >15s, consider cutover pause >30s
  • Dual-write error rate — any non-zero divergence between on-prem and RDS record counts triggers investigation halt
  • Budget burn rate — weekly review; if Month 6 spend exceeds $2.1M total, re-scope wave 3+
  • PCI scope creep events — any new service touching cardholder data must be reviewed by the security team before cloud deployment

4. Interview Score: 9 / 10

Why this demonstrates senior-level maturity: This answer demonstrates program-level thinking at Staff+ level: recognizing Oracle licensing as a non-technical blocker, calculating DirectConnect lead time as a critical path dependency, structuring a Strangler Fig migration rather than a big-bang, and explicitly defining what NOT to migrate. The dual-write pattern with checksum validation shows database engineering depth, not just cloud service familiarity.

What differentiates it from mid-level thinking: A mid-level engineer would propose 'lift-and-shift everything to EC2' without addressing Oracle RAC features, ignore DirectConnect lead times, attempt Oracle → Aurora in a single pass, and not budget for the dual-run cost overlap period. They would also miss the Oracle licensing audit risk that has derailed comparable migrations.

What would make it a 10/10: A 10/10 response would include a specific wave sequencing table with the 400 services organized by dependency tier and migration complexity score, a concrete DMS task configuration for the Oracle CDC setup, and a signed-off rollback trigger document specifying the exact conditions (error rate threshold, replication lag, data validation failure percentage) that would trigger a rollback to on-prem for each wave.



Question 3: Cloud Cost Explosion — $500K/Month AWS Bill Spike

Difficulty: Senior | Role: Cloud Engineer / FinOps Engineer | Level: Senior / Staff | Company Examples: Uber, Lyft, Snap, Coinbase


The Question

Your AWS bill has spiked from $180,000/month to $500,000/month over 23 days. You are the on-call cloud engineer. Leadership is asking for a root cause in 4 hours, a cost reduction plan in 24 hours, and a prevention framework in 72 hours. You have access to AWS Cost Explorer, CloudWatch, and the AWS Organizations master payer account. Describe your systematic investigation approach, the 12 most likely causes, your cost reduction playbook, and the architectural controls you'd implement to prevent recurrence.


1. What Is This Question Testing?

  • Financial literacy — reading Cost Explorer at resource, tag, and service level; understanding blended vs. unblended costs and reservation gaps
  • Systems thinking — correlating billing dimension spikes with deployment events, traffic patterns, and infrastructure changes
  • Security awareness — a cost spike can be evidence of a security breach: cryptomining EC2 instances, data exfiltration via S3 egress, or compromised IAM keys spinning up infrastructure
  • Risk assessment — distinguishing between a business-justified spike (product launch, seasonal traffic) vs. engineering error vs. security incident
  • Tradeoff analysis — aggressive cost cutting (terminate instances immediately) vs. service stability (verify before terminating)
  • Organizational thinking — communicating cost analysis to finance, engineering, and the C-suite simultaneously with different levels of technical depth

2. Framework: Cost Explosion Investigation Framework (CEIF)

  1. Assumption Documentation — Understand if any planned events could explain the spike: product launch, Black Friday, new service deployment, team onboarding with AWS access
  1. Constraint Analysis — 4-hour root cause SLA, must not break production while cutting costs, teams in multiple time zones need visibility
  1. Tradeoff Evaluation — Speed of cost reduction vs. risk of accidental production impact from hasty resource termination
  1. Hidden Cost Identification — Data transfer egress is notoriously invisible until month-end; NAT Gateway per-GB charges; DynamoDB on-demand pricing with scan-heavy queries
  1. Risk Signals — Determine within first 30 minutes if this could be a security incident (GuardDuty findings, CloudTrail unauthorized API calls, unfamiliar EC2 instance types)
  1. Pivot Triggers — If GuardDuty shows cryptomining or data exfiltration signals inthe first 30 minutes, escalate to security incident response, not cost investigation
  1. Long-Term Evolution Plan — Implement AWS Budgets + anomaly detection, tag enforcement policies, cost allocation by team, and quarterly FinOps reviews

3. The Answer

Explicit Assumptions:

  • No planned major product launches or traffic events in the past 23 days
  • Multiple engineering teams have AWS IAM access; no centralized tagging enforcement
  • Organization has ~15 AWS accounts; billing consolidated to the master payer
  • No existing cost anomaly detection or budget alerts were configured
  • GuardDuty is enabled (critical first check)

First 30 Minutes: Security or Engineering Error?

Before opening Cost Explorer, check GuardDuty findings first. A $320,000 cost spike in 23 days is a red flag for cryptomining (EC2 P3/G4 GPU instances running XMRig), data exfiltration (S3 egress spikes), or a compromised IAM key launching infrastructure in unfamiliar regions. If GuardDuty shows CryptoCurrency: EC2/BitcoinTool.B or UnauthorizedAccess: IAMUser/InstanceCredentialExfiltration findings — stop the cost investigation and escalate to security incident response immediately. Containment supersedes cost analysis. If GuardDuty is clean, proceed to Cost Explorer.

Cost Explorer Investigation Protocol

Open AWS Cost Explorer in the master payer account. Set date range: last 30 days, daily granularity. Group by Service first — identify which service line caused the spike. Then Group by Linked Account — identify which account. Then Group by Usage Type — identify the specific billable dimension (e.g., DataTransfer-Out-Bytes vs. BoxUsage:m5.2xlarge). Then Group by Tag (team/environment) if tags exist. This 4-level drill-down typically identifies the offending resource type within 20 minutes.

The 12 most common root causes, in descending frequency: (1) NAT Gateway data transfer — teams routing all traffic through a NAT GW instead of VPC endpoints for S3/DynamoDB; $0.045/GB accumulates explosively with data-intensive workloads; (2) EC2 over-provisioning after a scale event — Auto Scaling Group scale-out never scaled back in due to misconfigured scale-in policies; (3) S3 egress — a new analytics pipeline reading from S3 to an on-prem Spark cluster or to a non-CloudFront endpoint; (4) RDS Multi-AZ storage explosion — automated backups with retention set to 35 days on a 10TB database; (5) DynamoDB on-demand pricing with table scan — a developer switched from Provisioned to On-Demand capacity mode on a table receiving full-table scans; (6) CloudWatch Logs — application debug logging enabled in production sending 50GB/day at $0.50/GB ingestion + $0.03/GB storage; (7) EC2 Spot interruption fallback to On-Demand — Spot capacity dried up and Auto Scaling Group fallback to On-Demand launched 40x m5.2xlarge instances at $0.384/hr; (8) EKS node group over-scaling — HPA misconfiguration causing runaway pod replication; (9) Data Sync or DMS replication task running continuously without termination; (10) Elastic IP charges for unattached IPs after a deployment; (11) AWS Support plan uplift — someone upgraded to Enterprise Support unexpectedly; (12) Rekognition / Textract / SageMaker endpoint left running — ML endpoints are $0.046/hr for ml.m5.xlarge × 24hr × 30 days = $33.12 each, multiplied by forgotten endpoints.

24-Hour Cost Reduction Playbook

Immediate actions (0-4 hours): Pull the Cost Explorer Group By Service + Usage Type report for the spike period. Identify the top 3 cost drivers. For each: find the specific Resource ID using CloudWatch Metrics or Cost Allocation Tags, validate it is not serving production traffic via ALB access logs or ECS service status, then terminate or resize. Do not terminate blindly — a $50,000/month EC2 cluster may be a legitimate production batch job. Use AWS Instance Scheduler to enforce off-hours shutdown for non-production environments (typically saves 65% of Dev/Staging EC2 costs). Review Savings Plans coverage report: if On-Demand coverage is >40%, purchase Compute Savings Plans immediately — 1-year no-upfront Compute SP provides 66% discount on EC2 On-Demand with flexibility across instance families. Estimated 48-hour cost reduction: $80,000–$140,000/month, depending on root cause.

72-Hour Prevention Framework

Deploy AWS Budgets with anomaly detection: set account-level budget at $210,000/month (20% above baseline), SNS alert at 80% ($168,000), hard alert at 100% ($210,000). Enable AWS Cost Anomaly Detection with service-level monitors — this uses ML to detect unusual spend patterns and alerts within hours, not days. Enforce mandatory resource tagging using AWS Config rule required-tags with automatic remediation: untagged resources get tagged 'team=unidentified' and trigger a Slack alert to the team owning the account. Implement VPC endpoints for S3 and DynamoDB in all VPCs to eliminate NAT Gateway data transfer charges for internal AWS traffic — this alone saves $0.045/GB for all S3 traffic. Enforce Auto Scaling scale-in policies: all ASGs must have scale-in cooldown ≤ scale-out cooldown, and scale-in policies triggered at CPU <30% for 10 consecutive minutes. Implement EC2 right-sizing recommendations from AWS Compute Optimizer on a monthly basis — average 15–25% compute cost reduction from right-sizing alone.

Communication Protocol

4-hour update to CFO/CTO: "Root cause identified as [X]. Estimated monthly impact: $320,000. Immediate actions taken: [Y]. No production impact. Full RCA in 24 hours." 24-hour update: full RCA with timeline, financial impact, and immediate savings actions taken. 72-hour update: prevention framework proposal with estimated ROI (if preventing recurrence saves $150,000/month, AWS Budgets + tagging enforcement costs 40 engineering hours = $8,000 — 18x ROI). Always separate the 'what happened' from 'who is responsible' in the first report — blame assignment derails investigation and alienates the teams you need to cooperate for root cause.

Early Warning Metrics:

  • Daily spend alert — AWS Budgets SNS notification if daily spend exceeds 130% ofthe the 7-day rolling average
  • Service-level anomaly — Cost Anomaly Detection with $5,000 impact threshold per service per day
  • NAT Gateway data transfer — CloudWatch metric BytesOutToDestination alert at 100GB/day per NAT GW
  • GuardDuty high-severity findings — any CryptoCurrency or UnauthorizedAccess finding triggers PagerDuty P1

4. Interview Score: 9 / 10

Why this demonstrates senior-level maturity: The response leads with security triage before cost analysis — a hallmark of senior production experience where cost spikes and security incidents are correlated. Enumerating 12 specific root causes with their billing mechanisms (NAT Gateway per-GB, DynamoDB on-demand scan, Spot fallback to On-Demand) demonstrates deep AWS cost modeling knowledge, not generic FinOps buzzwords. The communication protocol section shows organizational maturity.

What differentiates it from mid-level thinking: A mid-level engineer would open Cost Explorer immediately without checking GuardDuty, recommend 'enable Savings Plans' as the primary fix without identifying the root cause, and lack the structured 4-tier drill-down (Service → Account → Usage Type → Tag) approach. They would also propose solutions that could accidentally terminate production resources.

What would make it a 10/10: A 10/10 response would include specific Cost Explorer filter queries for each of the 12 root causes, a concrete AWS Config conformance pack YAML for tagging enforcement, and a worked example of how to use S3 Storage Lens to identify specific buckets driving egress costs — with actual per-bucket cost attribution.



Question 4: Kubernetes Scaling Failure Under Traffic Surge

Difficulty: Elite | Role: Platform / Site Reliability Engineer | Level: Senior / Staff | Company Examples: Spotify, DoorDash, Robinhood, Cloudflare


The Question

Your EKS production cluster serves 3,000 RPS steady state. During a viral marketing event, traffic spikes to 22,000 RPS over 4 minutes. Within 90 seconds of the surge, your HPA begins scaling pods, but latency climbs from 45ms p99 to 8,200ms p99. 40% of pods enter CrashLoopBackOff. The node autoscaler is provisioning new nodes, but they're not becoming Ready. Your SLO is 99.9% success rate at <500ms p99. You are paged at2 ammm. Walk through your triage, root cause hypothesis tree, and remediation steps — and explain what architectural changes would have prevented this.


1. What Is This Question Testing?

  • Reliability engineering — understanding the HPA → CA (Cluster Autoscaler) → node provisioning → pod scheduling → readiness probe chain and where each can fail independently
  • Systems thinking — tracing the failure cascade from surge detection through pod startup failure to node unavailability
  • Cloud architecture maturity — knowing the difference between EKS Managed Node Groups, Karpenter, and Fargate for burst scaling characteristics
  • Risk assessment — distinguishing between pod-level failures (OOM, readiness probe timeout, init container failure) and cluster-level failures (node capacity, ENI limits, instance type exhaustion)
  • Tradeoff analysis — immediate blast radius mitigation (load shedding, circuit breakers) vs. root cause fix during live incident
  • Infrastructure-as-code knowledge — understanding that Kubernetes resource requests/limits, HPA configuration, and readiness probe timeouts are code artifacts that caused this incident

2. Framework: Kubernetes Surge Failure Response Tree (KSFRT)

  1. Assumption Documentation — Identify if a surge was expected or unexpected, whether Karpenter/CA is configured, and instance type availability in the AZ
  1. Constraint Analysis — 2 am oncall, limited blast radius visibility, need to restore service before full RCA
  1. Tradeoff Evaluation — Load shed immediately (drop requests gracefully) vs. wait for nodes to come up (risk prolonged degradation)
  1. Hidden Cost Identification — ENI limits per instance type, EC2 instance type exhaustion in AZ, ALB connection draining timeout causing retry storms
  1. Risk Signals — New nodes NotReady for >5 minutes, pod pending >3 minutes after node provisioning, init container failure rate, OOM kill events
  1. Pivot Triggers — If nodes are NotReady due to instance type exhaustion in the AZ, switch CA diversification or manually provision in an alternate AZ
  1. Long-Term Evolution Plan — Pre-warming strategy, Karpenter with spot + on-demand fallback, KEDA for event-driven scaling, VPA for right-sizing, load testing quarterly

3. The Answer

Explicit Assumptions:

  • EKS 1.29, Managed Node Groups, Cluster Autoscaler (not Karpenter)
  • HPA configured on CPU utilization (target 60%), min 10 pods, max 80 pods
  • Application pods: 2 CPU / 4GB memory requests; 4 CPU / 8GB limits
  • Node type: m5.2xlarge (8 vCPU, 32GB) — max ~7 pods per node with current requests
  • VPC has sufficient subnets, but pod ENI allocation has not been verified

First 5 Minutes: Parallel Triage Protocol

Do not start with logs. Start with blast radius metrics. Open three browser tabs simultaneously: (1) Datadog/CloudWatch: p99 latency trend + error rate + pod count; (2) kubectl get nodes -o wide — how many nodes are Ready vs. NotReady; (3) kubectl get pods -n production | grep -v Running — count CrashLoopBackOff pods and note which deployments. This 3-screen snapshot gives you the failure scope in 90 seconds: Is it a pod-level failure (CrashLoopBackOff), a scheduling failure (Pending), or a node-level failure (NotReady)?

Hypothesis Tree

Branch 1 — Nodes are NotReady: CA provisioned nodes, but they're not joining. Sub-causes: (a) EC2 instance type exhausted in the AZ — check EC2 service health dashboard for capacity issues; switch to m5.4xlarge or add us-east-1c to node group; (b) VPC CNI (aws-node) DaemonSet failure — pods cannot start without CNI; check aws-node pod logs; (c) UserData/bootstrap script failure — node joined but kubelet failed to register; check EC2 Systems Manager Session Manager logs; (d) ENI limit reached — each m5.2xlarge supports max 3 ENIs × 15 IPs = 45 pod IPs maximum; if VPC CIDR is exhausted, pods cannot get IPs.

Branch 2 — Nodes are Ready but pods are CrashLoopBackOff: (a) OOM kill — pod memory limit too low for surge traffic; check kubectl describe pod and look for OOMKilled exit code; (b) Readiness probe timeout — application takes 45 seconds to warm up JVM / load caches, but readinessProbe failureThreshold × periodSeconds = 30 seconds; pod marked unhealthy before ready; (c) Init container failure — config map or secret mount failing due to rate limiting on AWS Secrets Manager API (happens during mass pod startup — 80 pods × 3 secrets = 240 API calls in <60s, hitting throttle limit); (d) Liveness probe too aggressive — killing pods that are alive but slow during startup.

Branch 3 — Pods are running,g but latency is 8,200ms: Application is overwhelmed. Connection pool exhausted on upstream RDS or Redis. Database connection limit hit — RDS MySQL default max_connections for db.r5.2xlarge is 1000; 80 pods × 20 connections each = 1,600 connections. This is the most likely cause of combined CrashLoopBackOff + extreme latency.

Immediate Remediation (Live Incident)

Step 1: Implement load shedding at the ALB level. Use ALB Target Group deregistration with connection draining to shed 30% of traffic — return 503 with Retry-After header. This is preferable to letting requests queue and time out with 8,200ms p99. Step 2: If CrashLoopBackOff is OOMKill confirmed — kubectl patch deployment [app] -p '{"spec":{"template":{"spec":{"containers":[{"name": "app", "resources":{"limits":{"memory ": "12Gi"}}}]}}}}' — temporarily increase limits without a full deployment rollout. Step 3: If RDS connection exhaustion is confirmed — deploy RDS Proxy immediately (RDS Proxy multiplexes connections, allowing 80 pods × 20 connections = 1,600 connections funneled through RDS Proxy's 200-connection pool to RDS). RDS Proxy provisioning: ~3 minutes. Step 4: kubectl rollout restart deployment [app] for the CrashLoopBackOff deployments after applying the above fixes.

Architectural Prevention

Replace Cluster Autoscaler with Karpenter: Karpenter provisions nodes in 60–90 seconds vs. CA's 3–5 minutes, uses Spot + On-Demand fallback automatically, and can select instance types dynamically based on pod resource requirements. This alone reduces node provisioning time from 5 minutes to 90 seconds — sufficient to absorb the 4-minute surge ramp. Pre-warming strategy: configure HPA with behavior.scaleUp.stabilizationWindowSeconds: 0 and selectPolicy: Max to maximize scale-out speed. Maintain a 'warm pool' of 20% excess capacity during business hours using scheduled scaling.

Fix readiness probes: separate startupProbe (longer timeout for JVM warmup) from readinessProbe (strict, fast). startupProbe failureThreshold: 30, periodSeconds: 5 (= 150 seconds max startup time); readinessProbe failureThreshold: 3, periodSeconds: 5 (= 15 seconds). Implement KEDA for event-driven scaling from SQS queue depth or Kafka lag — CPU-based HPA cannot predict traffic surges that arrive faster than the metric reporting interval (30–60 seconds). Deploy RDS Proxy as a standard pattern — not as a crisis response. Implement circuit breakers (Resilience4j for Java) with connection pool limits per pod: max 25 connections/pod regardless of traffic, preventing RDS connection exhaustion.

VPC CNI ENI Limit Deep Dive

This is the most commonly missed root cause. On m5.2xlarge: 3 ENIs × 15 IPv4 addresses per ENI = 45 pod IP addresses maximum. At 80 pods across 12 nodes, average 6.7 pods/node — within limits. However, if CA placed pods unevenly (10+ pods on some nodes), those nodes hit ENI limits. Solution: Enable VPC CNI prefix delegation — each ENI can assign /28 CIDR blocks (16 IPs), multiplying capacity by 16x. Additionally, configure pod topology spread constraints to ensure even distribution across nodes and AZs.

Early Warning Metrics:

  • HPA replication lag — time from traffic surge detection to first new pod becoming Ready; alert >4 minutes
  • Pending pod count — alert if >5 pods pending for >2 minutes (indicates node provisioning lag)
  • RDS connections — alert at 80% of max_connections, page at 90%
  • OOMKill events — any OOMKill in production triggers a right-sizing review within 24 hours
  • Node NotReady duration — alert if any node is NotReady for >3 minutes

4. Interview Score: 9 / 10

Why this demonstrates senior-level maturity: The structured hypothesis tree (nodes NotReady → pods CrashLoopBackOff → pods running but slow) with specific sub-causes and kubectl commands demonstrates real incident triage maturity. Identifying RDS connection exhaustion as the most likely combined cause, and the VPC CNI ENI prefix delegation fix, reflects production Kubernetes experience at scale that goes far beyond CKAD certification knowledge.

What differentiates it from mid-level thinking: A mid-level engineer would describe HPA and Cluster Autoscaler correctly but miss the ENI limit, Secrets Manager API throttling during mass pod startup, and the JVM warmup vs. readiness probe mismatch. They would not immediately suggest RDS Proxy as a 3-minute fix during the incident, and would not propose Karpenter as a structural improvement.

What would make it a 10/10: A 10/10 response would include specific kubectl commands for the entire triage sequence, a concrete Karpenter NodePool YAML configuration, and a worked example of calculating ENI limits for the specific instance type — demonstrating that ENI capacity planning is part of cluster sizing design, not an afterthought.



Question 5: IAM Misconfiguration Causing Security Breach

Difficulty: Elite | Role: Cloud Security Engineer / SRE | Level: Senior / Staff | Company Examples: Capital One, Twilio, LastPass, Okta


The Question

Your security team has detected that an attacker has exfiltrated 2.3TB of customer PII from S3 over the past 11 days. CloudTrail shows the access originated from an EC2 instance role in your production account. GuardDuty shows findings of type S3:Exfiltration/MaliciousIPCaller.Custom and IAMUser: Exfiltration/AnomalousBehavior. Initial analysis shows the EC2 instance role has s3:GetObject and s3:ListBucket permissions on * (all buckets). You are the incident commander. Describe your containment strategy, forensic investigation approach, root cause analysis, and the IAM governance framework you'd implement to prevent this class of breach.


1. What Is This Question Testing?

  • Security awareness — understanding the principle of least privilege as an architectural guarantee, not a policy document
  • Risk assessment — calculating blast radius: which customer data was exposed, what regulatory obligations are triggered, and what is the timeline for breach notification
  • Systems thinking — tracing the attack chain: EC2 compromise → IAM credential exfiltration → S3 access with legitimate credentials → data exfiltration to external IP
  • Organizational thinking — breach notification obligations (GDPR 72-hour, CCPA, state breach laws), legal hold requirements, executive and regulatory communication
  • Cloud architecture maturity — knowing that S3:* on * is a Level 1 misconfiguration that should be caught by AWS Config rules before production
  • Reliability engineering — containing the breach without destroying forensic evidence or breaking production services

2. Framework: Breach Containment and IAM Governance Framework (BCIGF)

  1. Assumption Documentation — Determine breach timeline, data classification of exfiltrated S3 buckets, and regulatory jurisdiction of affected customers
  1. Constraint Analysis — Must contain breach without destroying CloudTrail/VPC Flow Log forensic evidence; legal hold requirements; regulatory notification timelines
  1. Tradeoff Evaluation — Terminate compromised EC2 instance immediately (destroys volatile memory forensics) vs. isolate in quarantine VPC (preserves evidence, but breach continues)
  1. Hidden Cost Identification — Breach notification costs ($5–$100 per affected individual), regulatory fines (GDPR: 4% of global revenue), forensic investigation firm fees ($50K–$500K), class action legal costs
  1. Risk Signals — GuardDuty findings escalation, additional EC2 instances with the same role making external API calls, new IAM users or access keys created in the account
  1. Pivot Triggers — If the attacker has pivoted to additional AWS accounts via role assumption chains, escalate to full AWS Organizations lockdown
  1. Long-Term Evolution Plan — Zero-trust IAM architecture, AWS IAM Access Analyzer deployed across all accounts, quarterly IAM permission reviews, mandatory SCP guardrails

3. The Answer

Explicit Assumptions:

  • 2.3TB exfiltrated over 11 days; S3 buckets contain PII (names, emails, SSNs) for ~800,000 customers
  • GDPR and CCPA apply (customers in the EU and California); 72-hour GDPR notification clock started when detection occurred
  • EC2 instance role: arn:aws:iam::123456789012:role/app-prod-role with AdministratorAccess or s3:* on Resource: *
  • GuardDuty enabled; CloudTrail enabled with S3 data events logging (critical — without data events, GetObject calls are not logged)
  • Legal counsel and DPO have been notified within the first hour

Containment: First 30 Minutes

Do not terminate the EC2 instance — yet. Termination destroys volatile memory (bash history, network connections, running processes) that forensic analysis needs. Instead: (1) Attach a restrictive IAM inline policy to the EC2 role that denies all S3 actions immediately — this stops exfiltration while preserving the instance. Use the explicit Deny override: {"Effect": "Deny", "Action ": "s3:","Resource":""} added as an inline policy. Inline policies with Deny override all Allow policies,icies including managed policies. (2) Isolate the EC2 instance's security group: replace all outbound rules with 0.0.0.0/0 DENY, allow only inbound SSH from your incident response team's IP. This cuts the C2 channel while preserving the instance. (3) Rotate the EC2 instance role immediately — this invalidates all existing credentials issued to that role. AWS STS credentials have a default 1-hour TTL; active sessions may continue for up to 1 hour after a role policy change. (4) Enable S3 Block Public Access on all buckets not already blocked. (5) Check CloudTrail for any AssumeRole calls from this role to other accounts — if the attacker has pivoted to other AWS accounts, initiate organizational lockdown.

Forensic Investigation Protocol

Create an EBS snapshot of the compromised EC2 instance immediately. This snapshot is your forensic image. Do not mount it in the production account — copy it to a dedicated forensic account with no internet access. In the forensic account, mount the snapshot read-only and analyze: /var/log/auth.log for unauthorized SSH access, bash history for commands run, and network connections at the time of compromise. CloudTrail investigation: query CloudTrail Lake or Athena for all S3:GetObject events by the EC2 instance role in the past 30 days. Identify which buckets were accessed, when, and the volume of objects retrieved. S3 Server Access Logs (if enabled) provide object-level access with client IP — critical for determining if exfiltration destination IPs match known threat actor infrastructure. VPC Flow Logs: identify outbound traffic from the EC2 instance to external IPs during the 11-day window.

Root Cause Analysis

The IAM misconfiguration is the direct cause: s3:GetObject + s3:ListBucket on Resource: * grants access to every S3 bucket in the account. This is a Level 1 misconfiguration that should have been caught by: (1) AWS IAM Access Analyzer — would flag overly permissive policies at creation time; (2) AWS Config rules s3-bucket-public-access-prohibited; (3) A pre-commit Terraform policy scan (e.g., Checkov, tfsec) that blocks s3:* on * from being deployed.

The attack chain likely: (a) EC2 instance compromised via SSRF vulnerability in the application (IMDSv1 enabled — attacker sent HTTP request to 169.254.169.254/latest/meta-data/iam/security-credentials/ and received the role credentials); (b) Attacker used credentials to enumerate S3 buckets; (c) Identified buckets with PII naming convention; (d) Exfiltrated via aws s3 sync to an attacker-controlled bucket or external HTTP endpoint. IMDSv1 is the root enabling condition — it allows any SSRF vulnerability to become a credential exfiltration. Mitigation: require IMDSv2 (token-based metadata API) on all EC2 instances.

IAM Governance Framework

Implement across all accounts within 72 hours of containment: (1) Service Control Policies (SCPs) at Organization root: deny s3:* where Resource does not match a specific ARN prefix; deny CreateRole where AssumeRolePolicyDocument contains Principal: {AWS: *}; deny PutRolePolicy for any policy containing Resource: *. (2) AWS IAM Access Analyzer: enable in every account with Organization-level aggregation. Any externally accessible IAM entity triggers a P1 alert. (3) Require IMDSv2 via SCP: deny RunInstances if HttpTokens is not required. (4) Terraform policy-as-code: integrate Checkov into CI/CD pipeline with blocking checks for Resource: Effect: Allow + Action: s3: combined, any managed policy attachment of AdministratorAccess to service roles. (5) Quarterly IAM permission reviews: use IAM Access AdAdvisor'sast-accessed data to identify and remove permissions not used in 90 days.

Regulatory and Breach Notification

GDPR Article 33: notify the relevant supervisory authority within 72 hours of becoming aware of the breach. CCPA breach notification: notify affected California residents 'in the most expedient time possible' — typically interpreted as 30–45 days. Engagement of an external forensic firm (Mandiant, CrowdStrike, or Palo Alto Unit 42) is standard for PII breaches of this scale. Estimated total breach cost for 800,000 records: notification and credit monitoring services ($25–$50/affected individual = $20M–$40M), regulatory fines (GDPR 2–4% of global revenue), class action exposure, forensic investigation ($200K–$500K), reputational damage (share price impact for public companies is typically 3–7% in first 90 days).

Early Warning Metrics:

  • IAM Access Analyzer findings — any new external principal finding triggers P1 within 15 minutes
  • GuardDuty S3 exfiltration findings — any finding severity HIGH triggers automated S3 Block Public Access enforcement
  • Unusual S3 data transfer volume — CloudWatch alarm on BucketSizeBytes + NumberOfObjects rate of change >500% week-over-week
  • IMDSv1 usage — CloudTrail MetadataNoToken events trigger a P2 alert to the owning team for remediation within 5 days

4. Interview Score: 9.5 / 10

Why this demonstrates senior-level maturity: This answer is exceptional in sequencing containment before forensics, knowing that inline policy Deny overrides managed policy Allow, and identifying IMDSv1 + SSRF as the attack vector rather than treating 'misconfigured IAM' as the root cause. The regulatory dimension (GDPR 72-hour clock, CCPA notification) and cost estimation ($20M–$40M) demonstrate that this engineer operates at the business and legal level, not just the technical level.

What differentiates it from mid-level thinking: A mid-level engineer would terminate the EC2 instance immediately (destroying forensic evidence), not think about IMDSv1 SSRF as the credential exfiltration vector, and have no knowledge of GDPR 72-hour notification requirements or how to preserve legal hold while containing the breach.

What would make it a 10/10: A 10/10 response would include the specific SCP JSON for denying IMDSv1, a Checkov policy rule YAML that blocks s3:* on *, and a concrete incident timeline template showing the required actions at T+0, T+1hr, T+4hr, T+72hr for a PII breach of this scale.


Question 6: Serverless vs. Containers — Tradeoff at Scale

Difficulty: Senior | Role: Cloud Architect / Engineering Manager | Level: Senior / Staff | Company Examples: Amazon, Slack, Twitch, Figma


The Question

You are designing the compute layer for a new real-time data processing platform that will: (1) ingest 50,000 events/second from IoT devices, (2) process each event through 4 sequential transformation steps, (3) write results to DynamoDB and push notifications via SNS, and (4) handle extreme traffic variability — baseline 2,000 events/sec, surge to 50,000 events/sec in under 30 seconds. Your team is 6 engineers. Analyze the architectural tradeoffs between Lambda, ECS Fargate, and EKS for this specific workload, quantify the cost model for each, and make a concrete recommendation with reversibility analysis.


1. What Is This Question Testing?

  • Cloud architecture maturity — understanding Lambda concurrency limits, cold starts, execution duration limits, and how they interact with streaming event processing
  • Cost modeling — calculating Lambda cost ($0.0000166667/GB-second) vs. Fargate ($0.04048/vCPU-hour + $0.004445/GB-hour) vs. EKS at 50,000 events/sec
  • Systems thinking — modeling the 4-step sequential processing pipeline as a fan-out + fan-in problem and understanding Lambda chaining vs. ECS task chain vs. Kafka streams
  • Reliability engineering — understanding Lambda's 15-minute execution timeout, concurrency burst limits (500–3,000 requests/second in most regions), and Fargate's task startup latency (30–90 seconds)
  • Organizational thinking — calibrating operational complexity against a 6-engineer team that cannot maintain a Kubernetes cluster without platform overhead
  • Tradeoff analysis — reversibility: can the team migrate from Lambda to Fargate (or vice versa) if the initial choice proves wrong after 6 months of production data?

2. Framework: Compute Tradeoff Evaluation Matrix (CTEM)

  1. Assumption Documentation — Event payload size, processing latency SLA, data residency, team Kubernetes expertise
  1. Constraint Analysis — Lambda concurrency limits, Fargate task startup time, 6-engineer team operational budget
  1. Tradeoff Evaluation — Lambda vs. Fargate vs. EKS across: cost, scaling speed, cold start, operational burden, observability, debugging experience
  1. Hidden Cost Identification — Lambda: API Gateway invocation cost ($3.50/million), VPC Lambda cold start penalty (+400ms), DynamoDB write throttling at burst; Fargate: task startup latency (30–90s) fails the 30-second surge requirement; EKS: ~$73/month cluster fee + node group management overhead
  1. Risk Signals — Lambda concurrency exhaustion (ThrottledInvocations metric), Fargate task startup failures under burst, EKS node scheduling bottlenecks
  1. Pivot Triggers — If Lambda p99 latency exceeds 500ms after 3 months of production traffic, migrate hot-path to Fargate; if Fargate task startup during surge exceeds 45 seconds, pre-warm task pool
  1. Long-Term Evolution Plan — Start with Lambda for event-driven simplicity; add Fargate for long-running processing if execution duration limits become a constraint after 12 months

3. The Answer

Explicit Assumptions:

  • Event payload: 2KB average; processing latency SLA: <200ms end-to-end (event ingestion to DynamoDB write)
  • Sequential 4-step pipeline: validate → enrich → transform → write; each step <40ms target
  • Budget: $15,000–$25,000/month compute budget
  • Team: 6 engineers, moderate AWS experience, no Kubernetes expertise
  • Surge characteristics: 2,000 → 50,000 events/sec in 30 seconds, sustained for 15–30 minutes

Option A: AWS Lambda

Lambda is the natural fit for event-driven, variable-scale workloads. Each IoT event triggers a Lambda function via Kinesis Data Streams (128-shard stream for 50,000 events/sec × 2KB = 100MB/sec, within Kinesis limits). Lambda Concurrency: at 50,000 events/sec with 100ms processing time per event, peak concurrency = 50,000 × 0.1 = 5,000 concurrent executions. Default account concurrency limit: 1,000. This is a critical blocker — requires requesting a concurrency limit increase to 6,000+ (AWS support ticket, 24–48 hours). Burst limit: 3,000 additional concurrent executions per minute in most regions — the 30-second surge from 2,000 to 50,000 events/sec may hit the burst limit, causing ThrottledInvocations that Kinesis will retry, but with increasing lag.

Cold starts: Lambda in VPC (required for DynamoDB VPC endpoint access) addsa a 400–800ms cold start penalty for Java/Node.js runtimes. At 2,000 → 50,000 surge, thousands of cold starts occur simultaneously. Mitigation: Provisioned Concurrency at 500 warm instances ($0.0000046484/GB-second × 24hr × 30 days = ~$600/month for 500 × 512MB) — ensures core capacity is always warm. Cost model at 50,000 events/sec sustained for 4 hours/day: 50,000 × 3,600 × 4 = 720M invocations/day; 720M × $0.0000002 = $144/day invocation cost; compute: 720M × 0.1s × 512MB/1024 = 36,000 GB-seconds × $0.0000166667 = $0.60/day compute (very cheap). Total Lambda cost: ~$150/day or $4,500/month.

Option B: ECS Fargate

Fargate provides consistent compute without cold starts, but its scaling model is fundamentally different: it scales tasks (containers), not function invocations. Task startup: 30–90 seconds per Fargate task. This means a surge from 2,000 to 50,000 events/sec in 30 seconds CANNOT be absorbed by Fargate scaling — you would need to pre-provision tasks. A buffer pool of 50 pre-warmed Fargate tasks (2 vCPU / 4GB each) continuously running would handle the surge without scaling latency, but at high cost: 50 tasks × 2 vCPU × $0.04048/hr × 24hr × 30 days = $2,913/month in idle compute. Plus memory: 50 tasks × 4GB × $0.004445/hr × 24hr × 30 days = $801/month. Pre-warm pool cost alone: ~$3,714/month for 50 tasks. Active processing tasks add ~$1,200/month. Total Fargate: ~$5,000/month — more expensive than Lambda for this workload. Critically, Fargate does not solve the 30-second surge requirement unless pre-warmed.

Option C: Amazon EKS

EKS is appropriate when teams need Kubernetes-native tooling, custom scheduling, or GPU workloads — none of which apply here. For a 6-engineer team with no Kubernetes expertise: EKS cluster management, node group upgrades, RBAC configuration, PodDisruptionBudgets, and HPA tuning represent $8,000–$15,000/month in engineering opportunity cost. EKS with Karpenter can scale nodes in 60–90 seconds — better than Fargate's task startup, but still insufficient for a 30-second surge without pre-warming. Recommendation: Do not use EKS for a 6-engineer team processing IoT events with no Kubernetes expertise.

Recommendation: Lambda with Kinesis + Provisioned Concurrency

Use Lambda as the compute layer. IoT devices → IoT Core → Kinesis Data Streams (128 shards) → Lambda (4 sequential functions chained via SQS queues, not synchronous chaining) → DynamoDB + SNS. Implement Provisioned Concurrency at 500 warm instances to handle the first 30 seconds of surge without cold starts. Requesa t concurrency limit increase to 6,000 before launch. Use SQS as a buffer between Lambda stages — this decouples the 4-step pipeline and prevents a slow step from creating backpressure on Kinesis. Total cost: $4,500–$6,000/month, including Kinesis, SQS, DynamoDB writes, and Lambda — well within budget.

Reversibility Analysis

Lambda → Fargate migration: Moderate difficulty. The business logic in Lambda functions can be containerized with minimal code change. Estimate: 3–4 sprint weeks for a 6-person team. Lambda → EKS migration: High difficulty. Requires Kubernetes expertise the team doesn't have; likely requires hiring a platform engineer. Estimate: 3–6 months,s including knowledge transfer. Fargate → Lambda: Easy. Extract business logic from containers to Lambda functions. The initial Lambda choice is the highest reversibility option.

Early Warning Metrics:

  • Lambda ThrottledInvocations — alert >100/minute; page >1,000/minute
  • Kinesis IteratorAge (GetRecords.IteratorAgeMilliseconds) — alert >30 seconds; page >5 minutes
  • Lambda duration p99 — alert >150ms; page >180ms; approaching 200ms SLA
  • DynamoDB ConsumedWriteCapacityUnits — alert at 80% provisioned

4. Interview Score: 8.5 / 10

Why this demonstrates senior-level maturity: The response correctly identifies Lambda concurrency burst limits as the primary risk for the 30-second surge requirement. Calculating Fargate pre-warm pool cost as $3,714/month with actual vCPU-hour pricing and demonstrating that this eliminates Fargate's cost advantage demonstrates financial modeling depth. Rejecting EKS on organizational grounds (6-engineer team, no Kubernetes expertise) is a sign of senior judgment.

What differentiates it from mid-level thinking: A mid-level engineer would recommend Lambda generically without calculating peak concurrency (5,000), not know the burst concurrency limit (3,000/minute), not calculate Provisioned Concurrency cost, and recommend EKS as the 'most scalable' option without factoring in operational burden for the team size.

What would make it a 10/10: A 10/10 response would include a specific Kinesis shard count calculation (50,000 events/sec × 2KB = 100MB/sec ÷ 1MB/sec per shard = 100 shards minimum), a Lambda scaling burst limit timeline simulation for the 30-second surge event, and a concrete Terraform module structure for the Kinesis → Lambda → SQS → Lambda → DynamoDB pipeline.



Question 7: Designing 99.99% Availability Architecture

Difficulty: Elite | Role: Cloud Architect / Principal Engineer | Level: Principal / Distinguished | Company Examples: AWS, Stripe, Datadog, PagerDuty


The Question

Your SLA with enterprise customers requires 99.99% availability — 52.6 minutes of allowable downtime per year. Your current architecture achieves 99.9% (8.76 hours/year). You have 12 months and $2.5M in engineering investment to close the gap. The system is a SaaS B2B platform with: multi-tenant database, REST API layer, async job processing, and 3 external dependencies (Stripe for payments, Twilio for notifications, SendGrid for email). Explain the architectural and operational changes required to achieve 99.99%, and be honest about where the limits of cloud-native availability reside.


1. What Is This Question Testing?

  • Reliability engineering — understanding the difference between 99.9% and 99.99% is not incremental; it requires architectural elimination of every single-point-of-failure
  • Systems thinking — recognizing that SLA availability is bounded by the weakest external dependency (Twilio's published SLA is 99.95%; the system SLA cannot exceed the product of dependency SLAs)
  • Cloud architecture maturity — knowing that multi-AZ is table stakes for 99.9%; 99.99% requires multi-region or cell-based architecture
  • Financial literacy — calculating the cost of downtime against the $2.5M engineering investment; if each minute of downtime costs $10,000 (enterprise SLA penalties), the 8.76hr → 52.6min improvement = $5.1M annual penalty avoidance
  • Intellectual honesty — being explicit that 99.99% cannot be guaranteed if external dependencies don't offer equivalent SLAs; degraded-mode operation must be designed in
  • Organizational thinking — 12 months, $2.5M requires prioritization; phased approach with highest-impact changes first

2. Framework: Four-Nines Availability Framework (FNAF)

  1. Assumption Documentation — Current downtime root cause distribution (infrastructure failure vs. deployment errors vs. external dependency vs. software bugs)
  1. Constraint Analysis — External dependency SLA ceilings, database replication lag budgets, multi-region data consistency requirements
  1. Tradeoff Evaluation — Multi-AZ vs. multi-region vs. cell-based architecture against cost and complexity
  1. Hidden Cost Identification — 99.99% requires investment in chaos engineering, gameday exercises, synthetic monitoring, and incident response automation — often more expensive than infrastructure changes
  1. Risk Signals — Availability budget burn rate (how many minutes of downtime consumed year-to-date), dependency error rate trending
  1. Pivot Triggers — If multi-region implementation introduces split-brain incidents that consume more downtime than they prevent, pivot to active-passive with fast failover
  1. Long-Term Evolution Plan — Achieve 99.99% in 12 months; build toward 99.999% (5.26 minutes/year) in 3 years through cell-based architecture and chaos engineering culture

3. The Answer

Explicit Assumptions:

  • Current 99.9% → 99.99% target; allowable downtime reduction from 8.76hr to 52.6 min/year
  • Current architecture: single-region AWS, multi-AZ RDS, ECS Fargate, single ALB
  • Current downtime distribution (estimated): 40% deployment-related, 25% infrastructure failures, 20% external dependency outages, 15% software bugs
  • $2.5M budget: $1.5M engineering labor (15 FTEs × 12 months) + $500K infrastructure increase + $500K tooling
  • Enterprise customers: 50 accounts, average contract value $150K/year; SLA penalty: 10% credit per hour beyond SLA

Where 99.9% Fails: Root Cause Analysis First

Before designing for 99.99%, understand why you're currently at 99.9%. The error budget gap is 8.76hr - 0.876hr = 7.88 additional hours to eliminate. Based on industry data for SaaS B2B platforms: Deployments cause ~3.5 hours of downtime/year (rolling deployments without proper readiness probes, database migration failures during deployments, and config changes without blue/green). Infrastructure failures cause ~2.2 hours (single-AZ RDS failover during maintenance: ~30s per event × 4 events/year + ECS task failures without health checks). External dependency outages cause ~1.75hr (Stripe 99.99% SLA = 52min/year maximum; if Stripe is on the critical path for all API calls, every Stripe outage is your outage). Software bugs causing cascading failures: ~1.2hr. This prioritization drives the investment plan: fix deployments first (highest ROI).

Elimination of Deployment Downtime (Year Budget: 3.5hr → 0)

Implement blue/green deployments for all services using ECS with CodeDeploy Blue/Green. This eliminates deployment downtime — traffic shifts happen at the ALB listener level with pre-shift canary validation. Add automated rollback triggers: if error rate on green target group exceeds 1% during 5-minute canary phase, CodeDeploy automatically shifts traffic back to blue in <30 seconds. Zero-downtime database migrations are the hardest deployment challenge: require that all DB schema changes be backwards-compatible with both old and new application code (expand-contract pattern). Never rename columns, never make columns non-nullable in a single migration. Use Flyway or Liquibase with mandatory CI checks. This alone should reduce deployment-related downtime from 3.5 hours to <15min/year.

External Dependency Isolation: The 20% Budget

This is the most intellectually honest part of the answer: your system SLA cannot exceed the product of your external dependency SLAs in the worst case. Stripe: 99.99% SLA (52 min/year). Twilio: 99.95% SLA (4.38 hours/year). If Twilio SMS notifications are on the critical path for user authentication (2FA), Twilio's 99.95% ceiling limits your system to 99.95% — regardless of how good your infrastructure is. Resolution: degrade gracefully. Notification failures should not fail API calls. Implement the circuit breaker pattern for all external dependencies using Resilience4j. When Twilio error rate exceeds 5% over 60 seconds, open the circuit: queue SMS notifications in SQS for retry later, return a degraded-mode response to the client (inform user: 'SMS may be delayed'). Never make SMS sending synchronous in the API response path for 99.99% targets. Payment processing (Stripe): implement idempotency keys, retry logic with exponential backoff, and a payment status webhook reconciliation job.

Chaos Engineering and Operational Excellence

Infrastructure is not the constraint for the final 1-hour gap between 99.9% and 99.99%. Operations is. 99.99% requires: (1) Mean Time To Detect (MTTD) <2 minutes — synthetic monitoring pinging every 30 seconds from 3 AWS regions; alerting on error rate, p99 latency, and availability simultaneously. Use CloudWatch Synthetics or Datadog Synthetics. (2) Mean Time To Respond (MTTR) <10 minutes — automated runbooks for the top 10 most common failure modes; no manual investigation for known failure patterns. AWS Systems Manager Runbook automations triggered by CloudWatch alarms. (3) Quarterly GameDay exercises: inject AZ failure, database failover, Stripe circuit breaker activation — validate that every failure scenario has automated recovery. (4) Post-incident reviews with a blameless culture and mandatory architectural improvements for any incident consuming >10 minutes of error budget.

Honest Assessment: The 99.99% Ceiling

99.99% is achievable for the infrastructure and deployment layers with the changes above. However, any unplanned failure mode not yet encountered will consume error budget in ways that no architecture can fully prevent. The 52.6-minute annual budget provides very little margin — two 25-minute incidents per year consume the entire budget. The honest answer is: 99.99% is achievable if (a) deployment downtime is eliminated via blue/green, (b) external dependencies are decoupled from the critical path via circuit breakers, (c) MTTD is <2 minutes and MTTR is <10 minutes for all known failure modes, and (d) the team has a rigorous chaos engineering practice. Claiming 99.99% in an enterprise SLA without all four pillars is a business risk, not just an engineering gap.

Early Warning Metrics:

  • Availability budget burn rate — alert if >25% of annual error budget consumed in any single month
  • Synthetic monitor uptime — alert immediately if availability drops below 99.98% on 5a -minute trailing window
  • External dependency error rate — Stripe/Twilio/SendGrid error rate >1% over 60s triggers circuit breaker and on-call page
  • Deployment success rate — any failed deployment that reaches production triggers a blameless post-mortem within 24 hours

4. Interview Score: 9 / 10

Why this demonstrates senior-level maturity: The intellectual honesty about external dependency SLA ceilings (Twilio at 99.95% limits the system ceiling) and the explicit statement that 99.99% requires operational excellence at MTTD <2min/MTTR <10min — not just infrastructure redundancy — demonstrates principal-level thinking. The deployment downtime root cause prioritization (3.5hr, highest ROI) and the expand-contract pattern for zero-downtime database migrations show breadth across the full SRE discipline.

What differentiates it from mid-level thinking: A mid-level engineer would propose 'multi-region active-active' as the solution, not calculating that it costs $500K–$1M more than the budget allows. They would not identify that deployment errors are the #1 cause of downtime at this scale, not infrastructure failures. They would not address the external dependency SLA ceiling problem.

What would make it a 10/10: A 10/10 response would include a concrete availability budget tracker showing how each architectural change maps to minutes of recovered error budget, a specific CodeDeploy blue/green deployment YAML configuration, and a quantified comparison of 99.9% SLA customer penalty exposure (current) vs. 99.99% SLA penalty exposure (target) with the break-even calculation against the $2.5M investment.



Question 8: Terraform State Corruption Incident

Difficulty: Senior | Role: Platform Engineer / DevOps Engineer | Level: Senior | Company Examples: HashiCorp, Cloudflare, Digital Ocean, Wealthsimple


The Question

At 11 am on a Tuesday, a senior engineer runsTerraformm apply on the production environment. Midway through the apply, their laptop's VPN drops, terminating the Terraform process. The S3 backend shows a .terraform.tfstate.lock.info file that was never released. Three other engineers are blocked from running Terraform. Upon investigation, you discover the partial application created 4 new security groups but failed to associate them with the ECS tasks, leaving production ECS services with orphaned security group references. Additionally, the tfstate file may have partially written — you cannot be sure if the state reflects reality. Describe your recovery procedure, the immediate risk assessment, and the IaC governance changes to prevent this class of incident.


1. What Is This Question Testing?

  • Infrastructure-as-code knowledge — understanding Terraform state locking mechanics, the DynamoDB lock table, partial apply scenarios, and state drift
  • Risk assessment — evaluating whether the partial state means Terraform would try to re-create, destroy, or leave dangling resources on the next apply
  • Systems thinking — understanding that orphaned security group references on ECS tasks may cause task replacement failures on next deployment, not immediate outage
  • Reliability engineering — designing a recovery procedure that restores IaC integrity without causing additional production impact
  • Tradeoff analysis — forcefully unlocking the state (fast, risky) vs. restoring from S3 versioned backup (slower, safer)
  • Cloud architecture maturity — knowing that production Terraform runs should never happen from a developer laptop; understanding CI/CD-gated Terraform pipelines

2. Framework: State Recovery and IaC Governance Framework (SRIGF)

  1. Assumption Documentation — Identify whether ECS services are currently degraded, whether the partial apply changed any production-impacting resources, and S3 state backup availability
  1. Constraint Analysis — 3 engineers blocked from Terraform, but is this causing production impact? Production services are still running with the current (pre-apply) security groups
  1. Tradeoff Evaluation — Force unlock immediately (risk: if another apply is truly in-progress somewhere, concurrent applies corrupt state) vs. verify no concurrent apply first, then unlock
  1. Hidden Cost Identification — Orphaned security groups: AWS charges $0 for security groups, but they count toward the 2,500 security groups per VPC limit
  1. Risk Signals — ECS task replacement failures on next deployment, Terraform plan showing unexpected destroy/create for security group associations
  1. Pivot Triggers — If terraform plan after state restore shows >20 resource changes (unexpected), restore from the previous S3 version of tfstate before the interrupted apply
  1. Long-Term Evolution Plan — Terraform Cloud or Atlantis for remote, atomic applies; state locking via CI/CD; no local Terraform apply to production

3. The Answer

Explicit Assumptions:

  • S3 backend with versioning enabled; DynamoDB for state locking (standard AWS backend config)
  • Production ECS services are currently running — the partial apply did not break existing tasks, only failed to update them
  • The senior engineer's laptop is confirmed offline (VPN dropped); no concurrent apply is in progress
  • Team uses Terraform 1.5+; S3 backend with DynamoDB lock table arn:aws:dynamodb:us-east-1:123456789012:table/terraform-state-lock
  • Blast radius: 4 new security groups created but not associated with ECS tasks; no security groups deleted; no existing resources modified

Immediate Risk Assessment

First question: Is production currently degraded? No — the partial apply created new security groups (additive) but did not modify or delete any existing resources. The ECS tasks are running with their original security group associations. The 'orphaned security group references' in the state may mean the tfstate believes the new security groups are associated with ECS tasks, while AWS reality shows they are not. This state drift means the next Terraform plan will show unexpected changes. Critically, the 3 blocked engineers cannot run terraform plan or apply, but production is operational. This is a P2 incident (team velocity blocked), not a P1 (production impacted). Communicate this distinction to leadership immediately to prevent panic-driven actions that could causean actual outage.

State Lock Recovery Procedure

Step 1: Verify no concurrent apply is running. Check CI/CD pipeline logs (GitHub Actions, GitLab CI, Jenkins) for any in-progress Terraform jobs targeting production. Check EC2 instances running Terraform agents. Check the lock info: aws s3 cp s3://your-tfstate-bucket/path/to/.terraform.tfstate.lock.info - | cat — this shows the lock ID, the operation (apply), and the timestamp. If the lock timestamp is >15 minutes ago and you have confirmed the locking engineer's session is dead (VPN disconnected), the lock is stale.

Step 2: Force unlock the state. Use terraform force-unlock [LOCK_ID] where LOCK_ID is found in the lock info file. This removes the DynamoDB lock record. Alternatively, directly delete the lock record from DynamoDB: aws dynamodb delete-item --table-name terraform-state-lock --key '{"LockID": {"S": "your-state-path"}}'. Document who performed the force unlock, at what time, and why — this is an audit trail entry.

State Integrity Verification

After unlocking, do NOT run terraform apply immediately. Run terraform plan first to understand the delta between state and reality. The plan will reveal: (a) Resources in state but not in AWS (orphaned state entries) — these will show as 'will be destroyed' unless you remove them from state. (b) Resources in AWS but not in state (out-of-band created resources) — these will show as 'will be created'. (c) State drift in resource attributes.

For the 4 partially created security groups: Terraform plan will show them as existing in state. If they exist in AWS (verify with aws ec2 describe-security-groups --filters Name=group-name, Values=your-new-sg-name), the state is consistent for those resources — the partial apply successfully created them. The missing association with ECS tasks will show as a pending change in the plan. If the security groups do NOT exist in AWS, the state is corrupt — use terraform state rm aws_security_group.new_sg_1 for each orphaned state entry, then re-run terraform plan to confirma clean state.

S3 State Backup Restoration (If State Is Corrupt)

If terraform plan shows an unexpectedly large number of changes (>10 unexpected destroys or creates), the tfstate is corrupt from the partial write. Restore from S3 versioning: aws s3 ls --versions s3://your-tfstate-bucket/path/to/terraform.tfstate — identify the version immediately before the interrupted apply (by timestamp). Copy the specific version back: aws s3api copy-object --bucket your-tfstate-bucket --copy-source your-tfstate-bucket/path/to/terraform.tfstate?versionId=PREVIOUS_VERSION_ID --key path/to/terraform.tfstate. After restoring the previous version, run terraform plan — this should show only the 4 security group changes that were intended in the interrupted apply.

Production Fix: Security Group Associations

After state integrity is confirmed, run Terraform apply in a controlled environment. The application should be minimal: create the 4 security group rules that were intended, and associate them with ECS task definitions. Before applying, review the plan carefully for any unexpected changes. Apply with -target flag if the plan shows unexpected changes outside the security group scope: terraform apply -target=aws_security_group.new_sg_1 -target=aws_ecs_task_definition.app. After applying, verify ECS services with the new task definition are healthy — run a smoke test against the production endpoint.

IaC Governance Changes

This incident is a symptom of a governance failure: production Terraform should never run from a developer laptop. Implement within 2 weeks: (1) Atlantis or Terraform Cloud for remote, server-side plan and apply. Atlantis runs in EKS/ECS, triggered by pull request comments ('atlantis apply'), with audit logs, automatic plan on PR creation, and apply locked until PR approval. (2) Terraform Sentinel or OPA policy enforcement: any plan affecting >5 resources requires a second human approval. (3) State backend hardening: enable S3 Object Lock with Governance mode on the tfstate bucket; 30-day versioning retention; DynamoDB TTL on lock entries (auto-expire stale locks after 1 hour). (4) No AWS IAM credentials with Terraform production permissions on developer laptops — use OIDC federation for GitHub Actions/GitLab CI to assume the Terraform role with short-lived credentials. Developers use read-only credentials for plan locally, never apply. (5) Separate Terraform state files per environment (dev/staging/prod) in separate S3 paths.

Long-Term: Terragrunt for DRY State Management

If the team manages >10 Terraform root modules, introduce Terragrunt for state path standardization, module versioning locks, and dependency graph management. Terragrunt enforces consistent backend configuration across all modules and makes it impossible to accidentally target the wrong state backend.

Early Warning Metrics:

  • Stale state lock alert — Lambda function triggered by DynamoDB TTL monitor; if any lock entry is older than 30 minutes, page the oncall
  • Terraform plan drift alert — scheduled daily Terraform plan in CI/CD; if the plan shows >0 unexpected changes, alert the platform team
  • S3 state file modification events — CloudTrail S3 object write events for tfstate files trigger a Slack notification for all writes not originating from the CI/CD pipeline principal
  • ECS task launch success rate — any task launch failure rate >5% in a 5-minute window triggers investigation

4. Interview Score: 8.5 / 10

Why this demonstrates senior-level maturity: This answer correctly sequences the recovery procedure: risk assessment → lock verification → force unlock → state integrity check → targeted apply → governance fix. Knowing to check CI/CD pipelines before force-unlocking (to confirm no concurrent apply), using terraform state rm for orphaned entries rather than editing tfstate JSON directly, and proposing OIDC federation to eliminate laptop credentials as a systemic fix — these reflect hands-on IaC production experience.

What differentiates it from mid-level thinking: A mid-level engineer would immediately run Terraform force-unlock without verifying no concurrent apply is in progress (risking state corruption from concurrent applies), would manually edit the tfstate JSON file (extremely risky), and would not propose the systematic governance changes (Atlantis, OIDC federation, Object Lock on S3).

What would make it a 10/10: A 10/10 response would include the exact DynamoDB CLI command to inspect and delete the lock entry, a concrete Atlantis configuration YAML for the PR-based apply workflow, and a worked example of using terraform state rm and terraform import to resolve the most common types of state drift that occur during partial applies.


Question 9: Multi-Cloud Strategy — AWS + GCP Hybrid Architecture

Difficulty: Elite | Role: Cloud Architect / Principal Engineer | Level: Staff / Principal | Company Examples: Spotify, Dropbox, Pinterest, Lyft


The Question

Your CTO has mandated a multi-cloud strategy: your primary SaaS platform runs on AWS, but the data science team insists on using GCP for BigQuery and Vertex AI — where the tooling is genuinely superior. Additionally, your legal team requires that EU customer data never leave European data centers, while US customer data must stay in US regions. You now have workloads split across AWS us-east-1, AWS eu-west-1, GCP us-central1, and GCP europe-west1. Networking costs are exploding, data transfer between clouds is slow and expensive, and the platform team is drowning in complexity. Design the network architecture, data governance model, and operational framework for this multi-cloud setup — and be honest about when multi-cloud creates more problems than it solves.


1. What Is This Question Testing?

  • Cloud architecture maturity — understanding that multi-cloud is not a feature, it's an operational tax; recognizing the legitimate use cases vs. vendor-pressure-driven decisions
  • Financial literacy — calculating cross-cloud egress costs ($0.08–$0.12/GB between AWS and GCP) and modeling the break-even point where multi-cloud tooling advantages are consumed by networking overhead
  • Systems thinking — designing data gravity: workloads should live where the data lives; moving data across clouds for every ML training job is a $50,000–$200,000/month mistake at scale
  • Security awareness — cross-cloud IAM federation, workload identity federation, and the expanded attack surface of credentials that span multiple cloud control planes
  • Risk assessment — identifying that multi-cloud increases mean time to diagnose incidents (which cloud's network is dropping packets?) and requires engineers fluent in two cloud platforms simultaneously
  • Organizational thinking — the hidden cost of multi-cloud is team cognitive load: AWS certifications don't transfer to GCP; a team fluent in both costs 20–30% more to hire and retain

2. Framework: Multi-Cloud Governance and Data Gravity Model (MCGDGM)

  1. Assumption Documentation — Identify which workloads genuinely require GCP (BigQuery, Vertex AI) vs. which are there due to team preference; quantify data volumes flowing between clouds monthly
  1. Constraint Analysis — EU data residency requirements, cross-cloud egress pricing ($0.08–$0.12/GB), latency between AWS EU-West-1 and GCP Europe-West1 (typically 15–30ms), team expertise split
  1. Tradeoff Evaluation — Full multi-cloud vs. primary cloud with SaaS connectors (use BigQuery via API from AWS) vs. replicating GCP tooling on AWS (Redshift + SageMaker instead of BigQuery + Vertex AI)
  1. Hidden Cost Identification — Dedicated interconnect between AWS and GCP ($1,750–$5,000/month for 1Gbps), cross-cloud data transfer for ML training jobs, doubled observability tooling (CloudWatch + Cloud Monitoring), doubled security tooling
  1. Risk Signals / Early Warning Metrics — Cross-cloud data transfer costs as % of total cloud bill (alert >15%), inter-cloud latency p99 degradation, data residency compliance drift events
  1. Pivot Triggers — If cross-cloud data transfer costs exceed $80,000/month or if a compliance audit finds EU data flowing through US regions, trigger architecture review within 30 days
  1. Long-Term Evolution Plan — Establish data gravity principle: data-intensive workloads must live where the data lives; use cross-cloud only for control plane integration, not data plane replication

3. The Answer

Explicit Assumptions:

  • AWS: primary application platform (~$280,000/month); GCP: data analytics and ML (~$45,000/month)
  • EU data: ~40TB/month processed in BigQuery europe-west1; US data: ~120TB/month in BigQuery us-central1
  • Current cross-cloud data transfer: ~8TB/month at $0.10/GB average = $800/month (currently manageable, but growing 40% MoM)
  • Team: 22 engineers total, 8 with GCP expertise, 18 with AWS expertise (4 overlap)
  • Compliance: GDPR Article 46 — EU customer PII cannot be transferred to non-adequate countries; US data has no cross-border restriction

Network Architecture: Cloud Interconnect + Private Connectivity

Do not route cross-cloud traffic over the public internet. AWS to GCP traffic over public internet: latency 40–80ms, no SLA, subject to BGP routing anomalies. Instead, establish dedicated connectivity: Google Cloud Dedicated Interconnect ($1,750/month for 10Gbps in us-central1) connecting to AWS Direct Connect ($1,500/month for 10Gbps in us-east-1) via a colocation facility (Equinix Chicago DA1 is a common meet point for AWS us-east-1 and GCP us-central1). Total dedicated interconnect cost: ~$3,250/month. This reduces cross-cloud latency from 40–80ms to 8–15ms and eliminates public egress charges for traffic flowing through the interconnect. For EU: repeat with AWS eu-west-1 (Dublin) and GCP europe-west1 (Belgium) via Equinix LD8 in London. EU interconnect: ~$3,500/month (slightly higher due to colocation pricing). Total networking infrastructure: ~$6,750/month — this is the minimum cost of a serious multi-cloud architecture.

Data Gravity Principle: Where Should Workloads Live?

The single most important architectural decision in multi-cloud is data gravity. Data is heavy — moving it is expensive and slow. The rule: workloads must live where the data originates, not where the team prefers the tooling. For this architecture: (1) Application data (user records, transactions, product catalog) originates in AWS → stays in AWS (RDS Aurora, DynamoDB, S3). (2) Analytics data (event streams, clickstream, logs) is exported from AWS Kinesis Firehose to GCP BigQuery via the dedicated interconnect — one-way, append-only, batched hourly. This is legitimate because analytics is inherently downstream. (3) ML model training happens in GCP Vertex AI against data already in BigQuery — no cross-cloud data movement during training. (4) ML model artifacts (trained models) are exported from Vertex AI → S3 → served via AWS SageMaker Inference or Lambda. Model files are small (1–10GB); this cross-cloud transfer is acceptable. Anti-pattern to avoid: running ML training in GCP against data that must be replicated from AWS in real-time. If a training job requires 10TB of real-time AWS data in GCP, the architecture is wrong — either run training in AWS or accept the $1,000/training-run transfer cost.

EU Data Residency Architecture

GDPR compliance requires strict data residency controls. Architecture: EU customer data flows are completely isolated from US flows at the AWS account level (separate AWS account: eu-production) and GCP project level (separate GCP project: analytics-eu). No cross-region data replication for PII. AWS eu-west-1 → GCP europe-west1 analytics pipeline uses the EU dedicated interconnect — data never leaves Europe. S3 bucket policies enforce AWS: RequestedRegion conditions to prevent accidental cross-region copies. GCP BigQuery dataset location is set to EU and cannot be changed after creation — this is an immutable GCP constraint that enforces residency at the storage layer. AWS Config rule: any S3 cross-region replication rule that includes eu-west-1 buckets triggers an immediate compliance alert and auto-remediation (disable the rule). Quarterly data residency audit: use AWS Macie to scan all S3 buckets for EU PII indicators (EU phone formats, EU postal codes, IBANs) and verify they exist only in eu-west-1.

Cross-Cloud IAM and Identity Federation

Never create static service account keys for cross-cloud authentication. GCP Workload Identity Federation allows AWS IAM roles to authenticate to GCP without service account keys — the AWS IAM role presents its STS token to GCP's STS endpoint, which exchanges it for a short-lived GCP access token. Configuration: create a GCP Workload Identity Pool with an AWS provider, configure attribute mapping from AWS ARN to GCP identity. This eliminates the security risk of long-lived GCP service account JSON keys stored in AWS Secrets Manager. For the reverse (GCP → AWS): GCP Service Account can use AWS STS AssumeRoleWithWebIdentity with the GCP OIDC token. This creates a zero-static-credential cross-cloud auth architecture. Audit cross-cloud access monthly: any GCP service account with AWS permissions or AWS role with GCP permissions must be documented, reviewed, and time-bounded.

Observability in Multi-Cloud

Resist the temptation to run separate observability stacks per cloud. The cost of debugging a cross-cloud incident with two separate monitoring systems is measured in hours of MTTR. Deploy a unified observability platform: Datadog or Grafana Cloud (cloud-agnostic) with agents on both AWS (CloudWatch metrics forwarded via Kinesis → Datadog) and GCP (Cloud Monitoring metrics forwarded via Pub/Sub → Datadog). Cost: ~$8,000–$15,000/month for unified observability at this scale. This is expensive but recovers its cost after the first major cross-cloud incident, where you would otherwise spend 4 hours determining whether the slowdown is in AWS, GCP, or the interconnect.

Honest Assessment: When Multi-Cloud Is the Wrong Answer

Multi-cloud is justified when: (a) a specific cloud has genuinely superior tooling for a workload type (GCP BigQuery for petabyte analytics, AWS Lambda for serverless event processing); (b) regulatory requirements mandate geographic distribution across providers; (c) vendor lock-in risk justifies the operational premium for strategic workloads. Multi-cloud is the wrong answer when: (a) the motivation is "avoid vendor lock-in" for undifferentiated infrastructure (compute, storage, databases — all major clouds offer equivalent services); (b) the team lacks expertise in both clouds; (c) the workloads are tightly coupled (cross-cloud latency will degrade the application). The honest recommendation: use GCP for BigQuery and Vertex AI, where the tooling advantage is measurable and specific. Do not try to run application servers, databases, or message queues in both clouds simultaneously — the operational complexity and networking cost will consume the tooling advantage within 6 months.

Early Warning Metrics:

  • Cross-cloud data transfer cost — alert if monthly egress between AWS and GCP exceeds $15,000; indicates data gravity violations (workloads accessing data across clouds)
  • Interconnect utilization — alert at 70% of dedicated interconnect capacity; provision additional capacity 60 days before saturation
  • EU data residency drift — any AWS Macie finding of EU PII in us-east-1 buckets triggers P1 within 15 minutes
  • Cross-cloud authentication failures — any spike in GCP Workload Identity Federation token exchange failures triggers an investigation of IAM configuration drift

4. Interview Score: 8.5 / 10

Why this demonstrates senior-level maturity: The data gravity principle (workloads must live where data originates) is a senior-level insight that eliminates the most expensive multi-cloud anti-patterns before they are built. Recommending Workload Identity Federation over static service account keys, calculating dedicated interconnect costs ($6,750/month) as the minimum price of serious multi-cloud, and being explicit about when multi-cloud creates more problems than it solves — these demonstrate principal-engineer judgment over technical enthusiasm.

What differentiates it from mid-level thinking: A mid-level engineer would design data replication pipelines between AWS and GCP for every workload without modeling the egress costs, use long-lived service account JSON keys stored in Secrets Manager (a significant security risk), and not address the EU data residency isolation at the AWS account level.

What would make it a 10/10: A 10/10 response would include a specific GCP Workload Identity Federation Terraform configuration, a worked cost model showing the 18-month TCO comparison between multi-cloud with dedicated interconnect vs. moving the analytics workload to AWS Redshift + SageMaker, and a concrete data classification tagging schema for enforcing residency requirements at the object level in both S3 and GCS.



Question 10: Production Database Migration — PostgreSQL to Aurora at Scale

Difficulty: Elite | Role: Cloud Engineer / Database Engineer | Level: Senior / Staff | Company Examples: GitHub, Atlassian, HubSpot, Zendesk


The Question

Your product runs on a self-managed PostgreSQL 12 cluster on EC2 (primary + 2 read replicas), storing 4TB of data with 3,200 reads/second and 800 writes/second at peak. Your team has decided to migrate to Amazon Aurora PostgreSQL to reduce operational burden. However, you have 12 downstream services reading from the read replicas directly via hardcoded connection strings, 3 long-running ETL jobs that run overnight and hold table locks for up to 45 minutes, and a strict SLA that allows a maximum of 30 seconds of write downtime and zero read downtime during the cutover. Additionally, your schema includes 14 custom PostgreSQL extensions, 3 of which (PostGIS, pg_trgm, pg_partman) must be validated for Aurora compatibility. You have a 4-hour Saturday maintenance window. Describe the migration strategy, extension compatibility plan, cutover procedure, and rollback plan.


1. What Is This Question Testing?

  • Cloud architecture maturity — understanding that Aurora PostgreSQL is not a drop-in replacement for self-managed PostgreSQL; extension compatibility, parameter group differences, and replication lag during cutover are real blockers
  • Risk assessment — modeling the blast radius of a failed cutover: 12 downstream services hard-coded to old connection strings, 3 ETL jobs that will fail if the database is unavailable mid-run
  • Systems thinking — sequencing the migration to achieve zero read downtime (route reads to Aurora replicas before promoting Aurora to primary) while limiting write downtime to <30 seconds
  • Reliability engineering — designing the rollback path: if Aurora promotion fails at minute 28, can you revert to EC2 PostgreSQL with <30 seconds of additional write downtime?
  • Infrastructure-as-code knowledge — understanding that 12 hardcoded connection strings in downstream services are a technical debt problem that the migration surfaces, but did not create
  • Financial literacy — calculating Aurora cost vs. self-managed EC2 PostgreSQL: Aurora db.r6g.2xlarge + 2 read replicas vs. 3x r6g.2xlarge EC2 instances; Aurora is typically 20–30% more expensive but eliminates 2–3 FTE-weeks/year of DBA operational overhead

2. Framework: Zero-Downtime Database Migration Model (ZDDMM)

  1. Assumption Documentation — Extension compatibility matrix, downstream service connection string audit, ETL job schedules, Aurora parameter group differences from RDS default
  1. Constraint Analysis — 30-second write downtime ceiling, zero read downtime, 4-hour maintenance window, 12 hardcoded connection strings that cannot be changed before migration
  1. Tradeoff Evaluation — AWS DMS for continuous replication vs. pg_dump/restore vs. native PostgreSQL logical replication; each has different lag characteristics and compatibility requirements
  1. Hidden Cost Identification — Aurora storage auto-scaling charges ($0.10/GB-month vs. EBS gp3 at $0.08/GB-month for 4TB = $80/month premium), Aurora I/O charges ($0.20/million I/O requests in on-demand mode), DMS replication instance during migration (~$0.14/hr × 4 weeks = ~$100)
  1. Risk Signals / Early Warning Metrics — DMS replication lag (target <5s before cutover), Aurora parameter group differences causing query plan changes, pg_partman behavior differences post-migration
  1. Pivot Triggers — If DMS replication lag exceeds 60 seconds at T-24hr before cutover, postpone migration; if Aurora parameter group causes >10% query performance regression in load testing, tune before proceeding
  1. Long-Term Evolution Plan — Post-migration: migrate connection strings to AWS RDS Proxy (eliminates hardcoded IP problem for future migrations), enable Aurora Global Database for read replicas in secondary regions

3. The Answer

Explicit Assumptions:

  • PostgreSQL 12 on EC2 r6g.2xlarge (primary) + 2x r6g.xlarge (read replicas); 4TB data, growing ~50GB/week
  • Aurora PostgreSQL 14 target (upgrade PostgreSQL version simultaneously — Aurora 12 is end-of-life; the version upgrade adds risk,k but is necessary)
  • 12 downstream services: 8 read replica consumers (hardcoded to read replica IPs), 4 primary consumers (hardcoded to primary IP)
  • ETL jobs run Frida11 pmpm – Saturd3 am3am, Sun11 pm11pm – Mo2 ammy 2am; Saturday cutover window avoids this conflict
  • Budget: $500 for DMS replication instance; 3 weeks pre-migration validation; 1 week post-migration stabilization

Extension Compatibility Assessment

This must happen before any migration work begins. The 14 extensions fall into three categories: (1) Fully supported by Aurora PostgreSQL: PostGIS (Aurora natively supports PostGIS 3.x), pg_trgm (native support), uuid-ossp, pgcrypto, hstore, ltree, pg_stat_statements, pg_buffercache, plpgsql, fuzzystrmatch — 10 extensions, no action required. (2) Supported with version differences: pg_partman — Aurora PostgreSQL supports pg_partman, butthe version may differ; run pg_partman version comparison and test all partition maintenance jobs in staging. (3) Not supported/requires replacement: check if any of the remaining 3 extensions are in the Aurora unsupported list (aurora_stat_utils replaces some pg_stat extensions; aws_s3 replaces pg_s3 patterns). Any unsupported extension must be replaced or the migration blocked. Extension validation must be completed in a staging Aurora instance with a restored copy of production data — not just tested in isolation.

Pre-Migration: Replication Setup (3 Weeks Before)

Use AWS DMS with PostgreSQL logical replication, not pg_dump/restore. pg_dump/restore requires the database to be offline for the duration of the restore — unacceptable for 4TB. DMS with Change Data Capture (CDC) allows continuous replication of ongoing changes while the initial full load runs. Setup: (1) Enable logical replication on EC2 PostgreSQL: set wal_level = logical, max_replication_slots = 5, max_wal_senders = 5 in postgresql.conf. Requires a PostgreSQL restart — schedule this 3 weeks before cutover. (2) Create Aurora PostgreSQL cluster: db.r6g.2xlarge primary + 2x db.r6g.xlarge read replicas in 3 AZs. Restore a pg_dump snapshot as the initial load baseline (this takes ~6–8 hours for 4TB at typical DMS rates). (3) DMS task: configure ongoing replication from EC2 PostgreSQL to Aurora PostgreSQL. DMS will replay all WAL changes since the snapshot. Monitor DMS replication lag daily for 3 weeks — it should stabilize at <5 seconds within 48 hours.

Connection String Problem: RDS Proxy as the Decoupling Layer

The 12 hardcoded connection strings cannot be changed before the migration (insufficient time, risk of breaking each service). Solution: deploy AWS RDS Proxy in front of Aurora 2 weeks before cutover. RDS Proxy provides a stable DNS endpoint that survives primary failovers and can be pointed at different Aurora clusters without changing downstream connection strings. During migration: point RDS Proxy at the old EC2 PostgreSQL primary via a custom endpoint (RDS Proxy supports proxying to non-Aurora PostgreSQL via custom connection). At cutover: update RDS Proxy to point at Aurora — all 12 services see the change without any connection string modification. This is the cleanest solution to the hardcoded connection string problem. Post-migration: enforce a policy that all new services must connect via RDS Proxy DNS endpoint, never direct IP/hostname.

Cutover Procedure (Saturday 10 pm –2 amm)

T-60min: Verify DMS replication lag <5 seconds. Run full validation query suite against Aurora (row counts, checksum spot checks on 20 largest tables, verify all 14 extensions functional). Confirm ETL jobs completed Friday night. T-30min: Pre-warm Aurora connection pool (RDS Proxy). Notify all downstream service teams of the maintenance window. T-0: Stop ETL scheduler. Set PostgreSQL primary to read-only (SET default_transaction_read_only = ON) — this prevents writes to EC2 PostgreSQL while allowing reads to continue. Read traffic continues flowing through read replicas (zero read downtime maintained). T+2min: Wait for DMS replication lag to reach 0 (all pending WAL events replayed to Aurora). This is the critical wait — do not proceed until lag = 0. T+5min: Update RDS Proxy target to Aurora primary. All 12 downstream services now connect to Aurora without connection string changes. T+7min: Promote Aurora to standalone (stop DMS replication). Enable writes on Aurora. T+10min: Validate: run write test (INSERT + UPDATE + DELETE on a test table), verify all 12 services responding, check Aurora CloudWatch metrics (CPUUtilization, DatabaseConnections, ReplicaLag). T+30min: If all validation passes, declare cutover complete. Total write downtime: ~7 minutes (T+0 read-only to T+7 writes enabled). This exceeds the 30-second SLA. To meet 30 seconds: replace the RDS Proxy approach with DNS cutover — update the internal Route 53 CNAME from EC2 PostgreSQL to Aurora at T+0, rely on 30-second DNS TTL for propagation. This is faster but assumes all 12 services respect DNS TTL (some JDBC connection pools cache DNS — must validate in staging).

Rollback Plan

If Aurora validation fails at any point before T+30min: set RDS Proxy target back to EC2 PostgreSQL (2-minute operation), re-enable writes on EC2 PostgreSQL, restart DMS replication (will catch up from WAL). Estimated rollback write downtime: <5 minutes. If DMS fails to restart (WAL segment expired on EC2 PostgreSQL after 7+ hours): restore from the most recent Aurora snapshot to a new EC2 PostgreSQL instance — this is the catastrophic rollback path (2–3 hours of downtime). Mitigate by extending WAL retention on EC2 PostgreSQL: set wal_keep_size = 10240 (10GB) for the week before cutover.

Post-Migration: Aurora-Specific Optimizations

After 2 weeks of stable operation, switch from Aurora On-Demand pricing to Aurora Serverless v2 if write traffic has variance >3x (saves 30–50% on compute during off-peak). Enable Aurora Backtrack for the first 3 months post-migration — this allows point-in-time rewind within 72 hours without a full restore, invaluable if a schema migration error is discovered days later. Cost: $0.012/GB-backtrack-hour. For 4TB: ~$1,920/month during the validation period. Disable after 3 months of stable operation.

Early Warning Metrics:

  • DMS replication lag — alert >10s, page >30s; if lag >60s at T-24hr, postpone cutover
  • Aurora replica lag — alert >1s, page >5s (indicates Aurora internal replication issue, not DMS)
  • Query performance regression — run pg_stat_statements comparison between Aurora staging and EC2 production weekly; alert if any query's mean execution time increases >50%
  • Extension function errors — monitor Aurora PostgreSQL logs for any errors from PostGIS, pg_partman, or pg_trgm functions post-migration

4. Interview Score: 9 / 10

Why this demonstrates senior-level maturity: Identifying that the 12 hardcoded connection strings are solved architecturally (RDS Proxy as decoupling layer) rather than operationally (rushing 12 service teams to update configs before migration) demonstrates senior judgment. Calculating that the RDS Proxy approach delivers 7-minute write downtime (not 30 seconds) and proposing the DNS TTL alternative with the JDBC cache caveat shows honest engineering analysis, not optimistic planning.

What differentiates it from mid-level thinking: A mid-level engineer would propose pg_dump/restore (requires full downtime), not consider the WAL level change requiring a PostgreSQL restart 3 weeks before migration, and miss the Aurora parameter group query plan divergence risk. They would also not model the rollback path at each step of the cutover procedure.

What would make it a 10/10: A 10/10 response would include the specific DMS task JSON configuration for PostgreSQL logical replication, a checksum validation SQL query for verifying data integrity between EC2 PostgreSQL and Aurora post-replication, and a worked Aurora pricing comparison (On-Demand vs. Serverless v2 vs. Reserved Instances) at the specific traffic profile described.



Question 11: Zero-Trust Network Architecture on AWS

Difficulty: Elite | Role: Cloud Security Architect / Platform Engineer | Level: Staff / Principal | Company Examples: Palo Alto Networks, CrowdStrike, Cloudflare, HashiCorp


The Question

Your company has grown from 50 to 600 engineers in 3 years. Your current network security model is perimeter-based: engineers VPN into the corporate network, which has trusted access to the production VPC via a peering connection. Security has found that 3 former contractors still have active VPN credentials. Last month, a phishing attack compromised an engineer's laptop — and because the laptop was VPN-connected, the attacker had access to production RDS databases directly. You have been asked to design a zero-trust network architecture for AWS that eliminates implicit trust, enforces least-privilege access to every production resource, and scales to 600 engineers without becoming an operational nightmare. You have 9 months and a team of 4 platform engineers.


1. What Is This Question Testing?

  • Security awareness — understanding zero-trust as an architectural principle (never trust, always verify, assume breach) vs. a product category (ZTNA vendors)
  • Systems thinking — recognizing that zero-trust is not just network segmentation; it requires identity verification at every layer: network, workload, data, and application
  • Cloud architecture maturity — knowing the difference between AWS PrivateLink, VPC Lattice, AWS Verified Access, and traditional VPN + peering for zero-trust implementation patterns
  • Organizational thinking — designing a zero-trust model that 600 engineers can actually use without becoming a productivity blocker; security that is too friction-heavy gets bypassed
  • Risk assessment — modeling the blast radius of the phishing attack in the old vs. new architecture: in zero-trust, a compromised engineer's laptop should have access only to the specific resources the engineer needs, not all of production
  • Tradeoff analysis — migrating 600 engineers off VPN over 9 months while maintaining access continuity; cannot do a big-bang cutover without engineering productivity collapse

2. Framework: Zero-Trust Implementation Roadmap (ZTIR)

  1. Assumption Documentation — Current VPN solution (Cisco AnyConnect, AWS Client VPN, OpenVPN), identity provider (Okta, Azure AD, Google Workspace), and current production access patterns by role
  1. Constraint Analysis — 600 engineers, 9-month migration, 4-person platform team, cannot break developer access to staging/debug environments during migration
  1. Tradeoff Evaluation — AWS Verified Access (managed ZTNA) vs. self-hosted zero-trust proxy (Pomerium, Teleport) vs. commercial ZTNA (Zscaler, Cloudflare Access); build vs. buy at 600 engineers
  1. Hidden Cost Identification — AWS Verified Access: $0.012/hour per endpoint + $0.015/GB data processed; for 600 engineers × 8hr/day = significant cost; Teleport enterprise: $100–$200/user/month; Cloudflare Access: $7/user/month
  1. Risk Signals / Early Warning Metrics — Failed authentication attempts per user (>10/hour suggests credential stuffing), access to resources outside the engineer's defined role, lateral movement attempts (accessing multiple resources in sequence from a single session)
  1. Pivot Triggers — If device posture checking blocks >5% of valid engineers in the first month, reduce posture check strictness; if ZTNA solution causes >200ms additional latency on database connections, re-evaluate architecture
  1. Long-Term Evolution Plan — Phase 1: identity-aware access to web applications; Phase 2: database and API access via ZTNA; Phase 3: workload-to-workload zero-trust via AWS VPC Lattice and service mesh

3. The Answer

Explicit Assumptions:

  • Identity provider: Okta (SAML/OIDC); engineer devices: mix of macOS and Windows, some corporate-managed, some BYOD
  • Current access pattern: VPN → production VPC (flat network, all engineers can reach all resources)
  • Production resources requiring access: RDS databases (50 instances), EKS clusters (3), EC2 bastion hosts (12), S3 buckets (400+), internal web applications (25)
  • Budget: $200,000 for ZTNA tooling over 9 months (~$22K/month); 4 platform engineers at 60% allocation

Phase 1: Eliminate Implicit Trust (Months 1-3)

The most dangerous element of the current architecture is not the VPN itself — it is the flat network behind the VPN. Once inside the VPN, any device can reach any production resource with no further authentication. Fix this first before replacing the VPN. Implement micro-segmentation: deploy AWS Security Groups with explicit allow rules per team per resource, removing the implicit "VPN = trusted" rule. Example: the backend engineering team's EC2 instances can reach only the backend RDS cluster (port 5432), not the data platform RDS cluster. Use AWS IAM Identity Center (SSO) with permission sets scoped to each team's AWS resources. This ensures that even if a VPN credential is compromised, the attacker cannot access resources outside the compromised engineer's permission set. Immediately audit all VPN credentials: cross-reference active VPN credentials against Okta active users. Revoke any VPN credentials for a user not active in Okta within the last 30 days. This alone addresses the 3 former contractors with active credentials.

Phase 2: Identity-Aware Application Access (Months 2-5)

Replace VPN access to internal web applications (25 apps) with AWS Verified Access or Cloudflare Access. AWS Verified Access: engineers authenticate via Okta OIDC, device posture is checked (OS version, disk encryption, endpoint protection running), and access is granted to specific applications per user — not to the entire network. AWS Verified Access sits in front of internal ALBs for each application. No VPN required for web application access after this phase. Cloudflare Access is cheaper ($7/user/month × 600 = $4,200/month vs. AWS Verified Access at ~$8,000–$12,000/month at this scale) and has better performance due to Cloudflare's global edge network. However, AWS Verified Access integrates more natively with IAM, CloudTrail, and AWS WAF. Recommendation: use Cloudflare Access for developer-facing web applications (cost optimization); use AWS Verified Access for compliance-sensitive internal applications (audit trail integration with CloudTrail).

Phase 3: Database and Infrastructure Access (Months 4-8)

Database access via VPN is the highest-risk pattern: a VPN-connected compromised laptop has direct TCP access to port 5432 on production RDS. Replace with: AWS Systems Manager Session Manager for all EC2 access (eliminates SSH key management and direct port 22 exposure), and AWS RDS Proxy with IAM authentication for all database access. With RDS Proxy IAM auth: engineers authenticate with their AWS IAM identity (federated from Okta), and receive a short-lived RDS auth token (15-minute TTL). No password. No sharedcredentialssl. The database connection is attributable to a specific IAM identity in CloudTrail. This converts database access from a network trust problem (is the engineer on the VPN?) to an identity trust problem (is this engineer's IAM role permitted to access this database?). For EKS access: implement EKS access entries with Okta group federation — engineers access kubectl via aws eks update-kubeconfig with their IAM role, which is scoped per team. No kubconfig sharing, no static kubeconfig files. For S3: remove all bucket policies that grant access based on VPC origin (a ws: SourceVpc condition). Replace with IAM role-based access scoped to specific bucket prefixes per team.

Phase 4: Workload-to-Workload Zero Trust (Months 7-9)

Zero-trust is not only about human access — workload-to-workload trust is equally important. The phishing attack scenario: a compromiengineer'sneer laptop contacts an internal service that has broad production access. If the internal service itself has zero-trust controls, it limits the blast radius. Implement AWS VPC Lattice for service-to-service authentication: each microservice authenticates to VPC Lattice with its IAM role, and VPC Lattice enforces that only permitted services can call specific APIs. A compromised service cannot call other services it was not explicitly authorized to reach, even within the same VPC. For EKS workloads: implement IRSA (IAM Roles for Service Accounts) with least-privilege policies per microservice. No service should use the EC2 instance role (which grants broad access) — each pod should have its own IAM role with specific DynamoDB tables, specific S3 prefixes, and specific SNS topics.

Device Posture Enforcement

Zero-trust requires knowing the trust level of the device making the request — not just the identity of the user. Integrate Okta Device Trust with Jamf (for macOS) and Microsoft Intune (for Windows): devices that are not enrolled, not encrypted, or not running the latest OS version are denied access regardless of valid credentials. For the BYOD population: create a separate Okta policy that restricts BYOD devices to web-only access via Cloudflare Access (no database, no SSH, no kubectl). Corporate-managed devices receive full access based on IAM roles.

Blast Radius Comparison: Old vs. New

Old architecture (phishing attack impact): compromised laptop with VPN = access to all 50 RDS databases, all 12 EC2 bastions, all 25 internal applications, all 400 S3 buckets. An attacker can exfiltrate TBs of data before detection. New architecture (same phishing attack): compromised laptop triggers Okta Verify notification (attacker cannot pass MFA without physical device access). If the laptop itself is compromised (not just credentials), device posture check detects the endpoint protection agent as tampered or absent → Verified Access denies access → attacker is limited to resources accessible without ZTNA (none). If device posture is somehow bypassed: attacker has access only to the specific engineer's IAM role permissions — a backend engineer has access to 2 RDS databases, 1 EKS namespace, and specific S3 prefixes. Maximum blast radius is the engineer's individual IAM permissions, not all of production.

Early Warning Metrics:

  • Failed MFA attempts per user — alert >5 failures in 10 minutes; indicates credential stuffing or phishing attempt
  • Anomalous access patterns — Okta Identity Threat Protection: alert if an engineer's access originates from a new country, new device, or impossible travel (same credential in NYC at 2 pm and London at 3 pm)
  • IAM role assumption from non-standard origin — CloudTrail alert if any IAM role is assumed from an IP not in the expected engineer range or from an unexpected AWS region
  • Database connection from non-RDS Proxy endpoint — any direct connection to port 5432 on RDS (bypassing RDS Proxy) triggers immediate alert

4. Interview Score: 9 / 10

Why this demonstrates senior-level maturity: The phased migration approach (fix the flat network first, then eliminate VPN progressively) avoids the big-bang cutover that would collapse engineering productivity. Quantifying the blast radius comparison between old and new architectures, proposing RDS Proxy with IAM auth as the database zero-trust pattern (not a network-based solution), and addressing workload-to-workload trust (not just human access) demonstrate security architecture depth at the principal level.

What differentiates it from mid-level thinking: A mid-level engineer would propose "implement Zscaler" or "deploy AWS Verified Access everywhere" as the solution without phasing the migration or addressing database access patterns. They would not identify the flat network behind VPN as the primary risk, would miss workload-to-workload trust, and would not address the device posture enforcement gap for BYOD devices.

What would make it a 10/10: A 10/10 response would include a specific Okta Device Trust integration Terraform configuration, a worked IAM policy example for a backend engineer scoped to specific RDS database ARNs and specific S3 bucket prefixes, and a concrete migration plan showing which of the 600 engineers migrate to ZTNA first (start with the data platform team that has the highest-risk access, not all 600 simultaneously).


Question 12: CI/CD Pipeline at Scale — 1,000+ Deploys Per Day

Difficulty: Senior | Role: Platform / DevOps Engineer | Level: Senior / Staff | Company Examples: Etsy, Netflix, Amazon, Meta


The Question

Your engineering organization has grown to 200 engineers across 4 microservice teams. Each team deploys independently, but all pipelines run through a shared Jenkins cluster on EC2. The situation has reached a breaking point: average build time has increased from 8 minutes to 47 minutes over the past 18 months, the Jenkins cluster has had 6 unplanned outages this year (each blocking all 200 engineers for 2–4 hours), and your target is 1,000 deploys/day across all teams,s but you are currently achieving 340. Your platform team has 5 engineers. Design the next-generation CI/CD architecture that achieves 1,000 deploys/day, reduces average pipeline duration to under 10 minutes, eliminates the single-point-of-failure Jenkins cluster, and self-serves team onboarding without platform team intervention.


1. What Is This Question Testing?

  • Systems thinking — diagnosing why pipelines slow down at scale: resource contention on shared build agents, test suite growth without parallel execution, Docker layer cache misses due to stateless agents, and artifact storage bottlenecks
  • Cloud architecture maturity — understanding the architectural differences between Jenkins (stateful master/agent), GitHub Actions (managed, per-workflow ephemeral runners), and Tekton (Kubernetes-native, self-hosted) for different scale and cost profiles
  • Reliability engineering — designing CI/CD infrastructure with the same availability standards as production: the Jenkins cluster, er being a single point of failure for all 200 engines, is a P1 availability risk
  • Financial literacy — modeling the cost of 1,000 deploys/day across ephemeral cloud compute vs. reserved EC2 capacity vs. GitHub Actions hosted runners ($0.008/minute for Linux, ~$0.016/minute for 4-core)
  • Organizational thinking — a 5-person platform team cannot support 40 microservice teams if every pipeline requires platform team involvement; self-service must be an architectural requirement, not a feature
  • Infrastructure-as-code knowledge — understanding that pipeline configuration should be codified (pipeline-as-code), version-controlled, and tested the same way as application code

2. Framework: CI/CD Scalability and Reliability Framework (CSRF)

  1. Assumption Documentation — Current build composition (compile, unit test, integration test, Docker build, push, deploy), test suite sizes per team, artifact sizes, and current EC2 instance types for Jenkins agents
  1. Constraint Analysis — 5-person platform team, 40 teams to onboard self-service, 1,000 deploys/day target, <10-minute pipeline target
  1. Tradeoff Evaluation — Migrate Jenkins to GitHub Actions (managed) vs. self-hosted runners on EKS vs. hybrid; cost and control tradeoffs
  1. Hidden Cost Identification — GitHub Actions hosted runner cost at 1,000 deploys/day × 10 minutes = 10,000 runner-minutes/day × $0.008 = $80/day = $2,400/month. Self-hosted runners on EKS: fixed compute cost, but requiresthe platform team to maintain; Spot Instance interruptions can fail builds
  1. Risk Signals / Early Warning Metrics — Pipeline queue depth (time-in-queue per build), build flakiness rate per test suite, artifact registry storage growth rate, runner utilization (>80% sustained = capacity issue)
  1. Pivot Triggers — If GitHub Actions hosted runner costs exceed $8,000/month, evaluate self-hosted runners on Spot EKS nodes; if self-hosted runner maintenance exceeds 20% of platform team capacity, switch back to managed runners
  1. Long-Term Evolution Plan — Progressive delivery (feature flags, canary deployments) as the evolution beyond binary deploy/don't-deploy decisions; Argo Rollouts for automated canary analysis

3. The Answer

Explicit Assumptions:

  • Current build breakdown per pipeline: compile (8min), unit tests (15min), Docker build (12min), integration tests (9min), deploy (3min) = 47min total
  • 40 teams, average 8–9 deploys/team/day target (= 340 → 1,000 deploys/day)
  • Current Jenkins: 3 m5.4xlarge EC2 masters + 20 m5.2xlarge build agents; all stateful, pets not cattle
  • Monorepo or polyrepo? Assume polyrepo — 40 separate repositories, each with its own pipeline
  • Artifact registry: self-managed Nexus on EC2 (another single point of failure)

Root Cause of 47-Minute Pipelines

Before redesigning infrastructure, diagnose why pipelines are slow. The 47-minute breakdown reveals: unit tests at 15 minutes on a single thread — this is the #1 problem. Most test suites can be parallelized across multiple workers. Jenkins does not automatically parallelize test execution; teams must explicitly configure test sharding. Docker build at 12 minutes — this indicates layer cache misses. Stateless Jenkins agents with no persistent Docker layer cache rebuild the entire image on every build. Solution: use a shared Docker layer cache (either ECR pull-through cache or BuildKit remote cache in S3). A warm layer cache reduces Docker builds from 12 minutes to 1–3 minutes for typical application code changes. Integration tests at 9 minutes — typically indicate tests that spin up dependencies (databases, message queues) on each run. Containerize test dependencies using Docker Compose or Testcontainers; run in parallel. Compile at 8 minutes — indicates either a very large codebase, no incremental compilation, or a cold JVM startup on every build. Gradle/Maven build caches stored in S3 can reduce compile time to 1–3 minutes for incremental changes. Target breakdown for <10 minutes: compile 2min (with cache), unit tests 4min (parallelized, 4-shard), Docker build 2min (layer cache), deploy 2min = 10 minutes.

Architecture: GitHub Actions + Self-Hosted Runners on EKS

Do not rebuild Jenkins. Jenkins' fundamental architecture (stateful master coordinating stateful agents) is the root cause of the single-point-of-failure and scaling problems. Migrate to GitHub Actions with self-hosted runners on EKS. GitHub Actions provides: workflow-as-code in YAML checked into each repository, managed orchestration (no master to maintain), native GitHub integration (PR checks, branch protection, deployment environments). Self-hosted runners on EKS: each GitHub Actions job spawns a dedicated Kubernetes pod (using the actions-runner-controller or KEDA-based autoscaler). The pod is ephemeral — spun up for one job, terminated after completion. No shared state, no resource contention between teams. Runner nodes: EKS node group of m6i.4xlarge Spot instances (16 vCPU, 64GB). At 1,000 builds/day × 10 minutes average = 10,000 runner-minutes/day ÷ 480 minutes/workday = 20.8 concurrent builds at peak. Each build uses 4 vCPU pods → need 83 vCPUs peak → 6 m6i.4xlarge instances peak. Spot pricing for m6i.4xlarge: ~$0.27/hr. 6 instances × $0.27 × 8 peak hours + 3 instances × $0.27 × 16 off-peak hours = $21.60/day = $648/month. Add on-demand base capacity (2 m6i.4xlarge for priority builds): $0.768/hr × 2 × 730hr = $1,122/month. Total runner compute: ~$1,770/month — dramatically cheaper than GitHub-hosted runners at $2,400/month and cheaper than the current Jenkins EC2 fleet (~$3,200/month).

Build Caching Strategy

Implement three caching layers: (1) Docker BuildKit remote cache via ECR: configure BuildKit to push layer cache to ECR on every build (--cache-to=type=registry,ref=123456789012.dkr.ecr.us-east-1.amazonaws.com/build-cache:myservice) and pull on next build. Cache hit rate >80% expected after 1 week of warming. Docker build time: 12min → 1–2min. (2) Gradle/Maven/npm dependency cache in S3: store ~/.gradle, ~/.m2, or node_modules in S3 per branch. Use actions/cache with S3 backend. Compile time: 8min → 1–2min for incremental changes. (3) Test parallelization: implement test sharding. For JUnit: use --shard X/Y flag. For pytest: pytest-xdist. Split unit tests across 4 parallel pods (matrix strategy in GitHub Actions). Unit test time: 15min → 4–5min. Total pipeline time after caching and parallelization: 10–12 minutes. This meets the target.

Eliminating the Single Point of Failure

GitHub Actions orchestration is managed by GitHub (99.9% SLA). Self-hosted runners on EKS have no master — each runner is an independent pod. If a node fails, the Kubernetes scheduler restarts the runner pod on another node. The only remaining single point of failure is EKS itself: use a managed node group across 3 AZs. EKS control plane SLA: 99.95%. This is a significant improvement over the Jenkins cluster's observed 6 outages/year (99.83% availability). For critical pipeline stages (production deployments), implement job retry with idempotency checks: if a deployment job fails due to runner infrastructure failure, the retry must detect whether the deployment partially succeeded (check Kubernetes deployment status before re-deploying).

Self-Service Team Onboarding

The 5-person platform team cannot review and configure pipelines for 40 teams. Design self-service: create a standardized GitHub Actions workflow template published in a central organization-level repository. Teams inherit the template via a reusable workflow call — they do not write their own pipeline YAML from scratch. The template handles: Docker build with BuildKit cache, unit test parallelization, ECR push, and EKS deployment via Argo CD. Teams customize via workflow inputs (Dockerfile path, test command, deployment namespace). New team onboarding: fork the template repository, configure 3 variables (service name, ECR repo, EKS namespace), open a PR — the platform team reviews the PR (15 minutes) rather than building a custom pipeline. Target: new team onboarding time from 2 weeks (current, with platform team involvement) to 2 days (self-service template).

Early Warning Metrics:

  • Pipeline queue time — alert if average time-in-queue exceeds 3 minutes (indicates runner capacity shortage)
  • Build flakiness rate — alert if any team's pipeline failure rate due to infrastructure (not test failures) exceeds 2%; flaky infrastructure undermines developer trust in CI/CD
  • Docker cache hit rate — alert if ECR cache hit rate drops below 70% (indicates cache eviction policy needs tuning)
  • Runner Spot interruption rate — alert if >5% of builds are interrupted by Spot reclamation; increase on-demand base capacity or diversify instance types

4. Interview Score: 8.5 / 10

Why this demonstrates senior-level maturity: Diagnosing the 47-minute pipeline as a symptom of three separate problems (test parallelization, Docker layer cache, compile cache) rather than "Jenkins is slow" demonstrates engineering depth. Calculating the self-hosted runner cost ($1,770/month) against GitHub-hosted ($2,400/month) and the current Jenkins fleet ($3,200/month) demonstrates financial modeling. Designing self-service onboarding as an architectural requirement (not a future feature) reflects organizational maturity.

What differentiates it from mid-level thinking: A mid-level engineer would recommend "migrate to GitHub Actions hosted runners" without modeling the cost at 1,000 deploys/day, not identifying Docker layer cache misses as the primary Docker build slowdown, and not design a self-service onboarding process — instead planning for the platform team to manually configure each of 40 teams' pipelines.

What would make it a 10/10: A 10/10 response would include a specific GitHub Actions workflow YAML for the reusable template, a Karpenter NodePool configuration for the runner EKS node group with Spot diversification across 5 instance types, and a concrete test sharding configuration for both JUnit and pytest showing how 200 tests are distributed across 4 parallel runners.



Question 13: Data Lake Architecture and Cost Optimization at Petabyte Scale

Difficulty: Senior | Role: Cloud Data Engineer / Architect | Level: Senior / Staff | Company Examples: Airbnb, Uber, Netflix, Twitter


The Question

Your company's data lake on S3 has grown to 4 petabytes over 5 years and now costs $1.2M/year in S3 storage alone — before compute. 70% of the data has not been accessed in over 18 months. Your Athena queries against the largest tables take 45–90 minutes and cost $15–$40 per query run (at $5/TB scanned). Your data engineering team runs 800 Glue ETL jobs per day, and Glue costs alone are $280,000/year. Business stakeholders are complaining that "the data lake is slow and expensive." You have been asked to reduce total data lake cost by 40% within 6 months without losing any data and without degrading query performance for the 20% of data that is actively used. Describe your cost optimization strategy, storage tiering architecture, and query performance improvements.


1. What Is This Question Testing?

  • Financial literacy — understanding S3 storage class pricing (Standard $0.023/GB vs. Glacier Instant Retrieval $0.004/GB vs. Glacier Deep Archive $0.00099/GB) and calculating the ROI of tiering 70% of cold data
  • Cloud architecture maturity — knowing that Athena cost is driven by data scanned, not data stored; columnar file formats (Parquet/ORC) and partitioning are the primary cost levers for query optimization
  • Systems thinking — recognizing that 800 Glue jobs/day for a 4PB lake likely contains significant redundancy: jobs that re-process the same data, jobs with overlapping outputs, and jobs that could be consolidated
  • Cost modeling — calculating 40% cost reduction: 70% of 4PB in cold storage = 2.8PB × ($0.023 - $0.004) = $53,200/month savings from tiering alone; Glue optimization; Athena partitioning, reducing scan costs
  • Risk assessment — S3 Lifecycle policies moving data to Glacier have a 90-day minimum storage charge; moving data to Glacier Instant Retrieval incorrectly can turn a 45-minute Athena query into a 12-hour query (Glacier Standard retrieval)
  • Organizational thinking — reducing data lake costs is politically complex: every "cold" dataset has a team that believes it will be needed urgently someday; the governance process for tiering decisions must involve stakeholders

2. Framework: Data Lake Cost Optimization Model (DLCOM)

  1. Assumption Documentation — Data age distribution, access pattern analysis (S3 Storage Lens last-accessed data), query patterns (Athena query history), Glue job dependency graph
  1. Constraint Analysis — Cannot delete data (legal hold or business requirement), 40% cost reduction in 6 months, no degradation for active data (top 20% by access frequency)
  1. Tradeoff Evaluation — S3 Intelligent-Tiering (automatic, set-and-forget, $0.0025/1,000 objects monitoring charge) vs. manual Lifecycle policies (cheaper for predictable cold data, no monitoring charge)
  1. Hidden Cost Identification — S3 Glacier retrieval costs ($0.01/GB for Expedited retrieval), S3 Intelligent-Tiering monitoring charge ($0.0025/1,000 objects × 4PB in small files = potentially significant), Glue Data Catalog API call charges, Athena per-query minimum cost ($0.01 even for empty results)
  1. Risk Signals / Early Warning Metrics — Athena scan cost per query trending upward (indicates new unpartitioned data), Glue DPU-hours per job trending upward (indicates data volume growth without job optimization), S3 Storage Lens showing access frequency by prefix
  1. Pivot Triggers — If S3 Intelligent-Tiering monitoring charges exceed $15,000/month for small files (<128KB), switch to manual Lifecycle policies for those prefixes; if Glue job consolidation reduces jobs from 800 to 400, re-evaluate if Glue is still the right engine vs. Spark on EMR
  1. Long-Term Evolution Plan — Adopt Apache Iceberg as the table format for all new data; Iceberg's time-travel, compaction, and partition evolution features reduce operational overhead and enable more aggressive cost optimization

3. The Answer

Explicit Assumptions:

  • 4PB total: 1.2PB hot (accessed weekly), 2.8PB cold (not accessed in 18+ months)
  • File format mix: 60% Parquet, 30% CSV/JSON (legacy pipelines), 10% Avro
  • Current annual cost breakdown: S3 Standard storage $1,200,000/year; Glue ETL $280,000/year; Athena queries $180,000/year; total ~$1,660,000/year. Target: reduce by 40% = save $664,000/year
  • Average object size: 50MB (meaning 4PB = ~80 million objects — small object count, not a small-files problem at this scale)
  • Athena workgroup is configured without query result reuse or partition projection

Storage Tiering: The Highest-ROI Action

Use S3 Storage Lens last-accessed data to identify the cold 2.8PB with precision. S3 Storage Lens Activity Metrics show the last-accessed date at the prefix (folder) level. Create S3 Lifecycle rules to transition cold prefixes. Tiering strategy: data not accessed in 90 days → S3 Intelligent-Tiering (automatically moves between Frequent Access, Infrequent Access, and Archive tiers based on access patterns); data not accessed in 365 days → S3 Glacier Instant Retrieval (millisecond retrieval, supports Athena queries via S3 Object Lambda, $0.004/GB-month vs. $0.023 Standard). Do NOT use S3 Glacier Standard or Deep Archive for data that may ever be queried by Athena — retrieval takes 3–12 hours, making Athena queries unusable. Glacier Deep Archive ($0.00099/GB) is appropriate only for compliance archive data that will never be queried — define this category explicitly with business stakeholders before applying. Savings from tiering 2.8PB to Glacier Instant Retrieval: 2.8PB × 1,024GB/TB × 1,024GB/TB... 2.8 × 1,024 × 1,024 = 2,936,012 GB. 2,936,012 × ($0.023 - $0.004) = $55,784/month = $669,408/year. This alone exceeds the 40% cost reduction target.

Athena Query Cost Optimization: Partitioning and Columnar Format

Athena charges $5 per TB scanned. A query scanning 8TB of unpartitioned CSV costs $40. The same query against partitioned Parquet costs $0.50–$2.00 (95–97% cost reduction). Two actions: (1) Convert CSV/JSON to Parquet: the 1.2PB of hot data stored as CSV/JSON (30% × 1.2PB = 360TB) must be converted to Parquet. Run a one-time Glue ETL job to convert. Parquet columnar compression typically reduces storage size by 60–80% (360TB → 72–144TB after conversion) and reduces Athena scan cost by 85–95% per query. (2) Add partition projection to all Athena tables: configure partition projection in the Glue Data Catalog to eliminate DescribePartitions API calls for date-based partitions. For a table partitioned by year/month/day, partition projection tells Athena to infer partition paths without scanning the Glue catalog — reducing query startup time from 45–90 seconds to <5 seconds for heavily partitioned tables. Implement Athena query result reuse (5-minute TTL for dashboards running the same queries): business intelligence tools (Tableau, Looker) often run the same query multiple times per session. Query result reuse serves cached results at $0 per reuse. Expected Athena cost reduction: 60–70% for the active workload.

Glue ETL Job Audit: From 800 to 300 Jobs

800 Glue jobs/day at $0.44/DPU-hour × average 10 DPUs × 30-minute average runtime = $0.22/job × 800 jobs = $176/day = $5,280/month = $63,360/year. The $280,000/year implies average runtime is much higher — likely jobs with 20–50 DPUs for large datasets. Audit the 800 jobs for: (1) Duplicate outputs — jobs from different teams that produce the same aggregated dataset independently. Consolidate into one canonical pipeline with downstream consumers. (2) Unnecessary full refreshes — jobs that reprocess the entire historical dataset daily when only yesterday's data changed. Convert to incremental processing (Glue bookmarks or Apache Iceberg incremental read). Incremental Glue jobs use 10x fewer DPUs than full refresh jobs for typical incremental volumes. (3) Orphaned jobs — jobs with no downstream consumers, running on a schedule established 2+ years ago. Delete. Target: reduce Glue DPU-hours by 60% through consolidation and incremental processing. Glue cost: $280,000/year → $112,000/year.

Apache Iceberg Migration for Future-Proofing

For new data pipelines and incrementally migrated existing tables: adopt Apache Iceberg as the table format. Iceberg provides: time-travel queries (eliminate the need for snapshot tables that currently consume ~200TB of the lake), partition evolution (add/change partition columns without rewriting all data), row-level deletes (GDPR right-to-erasure without rewriting entire Parquet files — currently requires full partition rewrites), and compaction (automatically merge small files without manual Glue jobs). Iceberg compaction alone can eliminate 50–100 Glue maintenance jobs from the 800-job daily schedule. Cost impact: significant reduction in Glue maintenance job cost and elimination of snapshot storage overhead.

Total Cost Reduction Summary

S3 storage tiering (2.8PB cold → Glacier Instant Retrieval): -$669,000/year. Athena partitioning + Parquet conversion (60–70% query cost reduction): -$108,000–$126,000/year. Glue consolidation + incremental processing (60% DPU reduction): -$168,000/year. Total annual savings: ~$945,000–$963,000/year against a $1,660,000 baseline = 57% cost reduction. This exceeds the 40% target, providing $300,000 of headroom for implementation costs and unexpected optimization friction.

Early Warning Metrics:

  • Athena cost per query (by workgroup) — alert if any team's average query cost exceeds $10; indicates a new unpartitioned table being queried
  • S3 storage by class — monthly S3 Storage Lens report showing bytes in each storage class; alert if Standard storage grows >5% month-over-month for cold prefixes (indicates Lifecycle policy gap)
  • Glue DPU-hours per job — alert if any job's DPU-hours increases >50% week-over-week (indicates data volume growth without job capacity adjustment)
  • S3 Glacier retrieval requests — alert on any Glacier Standard or Deep Archive retrieval (not Instant Retrieval) — these are expensive ($0.01–$0.03/GB) and indicate a tiering policy error

4. Interview Score: 8.5 / 10

Why this demonstrates senior-level maturity: Calculating the tiering savings to the dollar ($669,408/year from Glacier Instant Retrieval), identifying that Glacier Standard must not be used for Athena-queryable data (a production mistake that turns queries into 12-hour operations), and recommending Apache Iceberg for GDPR row-level deletes (eliminating the full-partition-rewrite pattern) demonstrate deep data engineering + cloud cost expertise. The stakeholder dimension (cold data governance process) shows organizational maturity.

What differentiates it from mid-level thinking: A mid-level engineer would recommend "enable S3 Intelligent-Tiering for everything" without calculating the monitoring charge for 80 million objects, recommend Glacier Deep Archive for cold data without considering Athena compatibility, and not connect the Glue job audit to specific DPU-hour cost modeling.

What would make it a 10/10: A 10/10 response would include a specific S3 Lifecycle policy JSON for the tiering rules, a worked Athena partition projection configuration for a date-partitioned table, and a Glue bookmark configuration example showing how to convert a full-refresh Glue job to incremental processing — with the expected DPU savings calculated.


Question 14: Designing for Regulatory Compliance — HIPAA on AWS

Difficulty: Elite | Role: Cloud Security Architect / Compliance Engineer | Level: Senior / Staff | Company Examples: Change Healthcare, Oscar Health, Teladoc, Nuvation Bio


The Question

You have been hired as the first Cloud Engineer at a Series B health-tech startup that stores Protected Health Information (PHI) for 200,000 patients. The company has been running on a single AWS account with no HIPAA controls: PHI is stored in unencrypted S3 buckets, CloudTrail is disabled, there is no VPC (all services run in the default VPC), engineers have AdministratorAccess to the AWS account, and RDS instances are publicly accessible with password authentication. An enterprise hospital system wants to sign a $4M annual contract, but its procurement team requires HIPAA BAA (Business Associate Agreement) completion, SOC 2 Type II certification, and a completed HIPAA technical safeguard assessment within 90 days. You have 3 engineers and a $120,000 budget. Prioritize and execute the remediation.


1. What Is This Question Testing?

  • Security awareness — HIPAA Technical Safeguards (45 CFR §164.312) as specific, auditable requirements: access controls, audit controls, integrity controls, transmission security — not abstract security concepts
  • Risk assessment — prioritizing the most critical remediations (public RDS with PHI = immediate P0) vs. important but less urgent (SOC 2 evidence collection can start while technical remediations are in progress)
  • Cloud architecture maturity — knowing which AWS services are HIPAA Eligible (listed in the AWS HIPAA compliance guide) and which are not; not all AWS services can process PHI
  • Organizational thinking — a 90-day timeline with 3 engineers and a $120,000 budget requires ruthless prioritization; perfect compliance architecture cannot be built in 90 days, but the minimum viable HIPAA posture can
  • Financial literacy — calculating the cost of non-compliance ($50,000–$1.9M per HIPAA violation) against the $120,000 remediation budget and the $4M contract opportunity
  • Systems thinking — understanding that HIPAA compliance is not a one-time project; it requires continuous monitoring, audit evidence collection, and annual risk assessment

2. Framework: HIPAA Remediation Prioritization Matrix (HRPM)

  1. Assumption Documentation — PHI data locations (S3 buckets, RDS databases, SQS queues, CloudWatch logs), current AWS services in use,engineer'ss AWS expertise level, legal counsel engagement
  1. Constraint Analysis — 90-day deadline, 3 engineers (assume 60% allocation to compliance, 40% to product), $120,000 budget, AWS BAA must be signed before any HIPAA-eligible workload proceeds
  1. Tradeoff Evaluation — Build a new compliant account and migrate PHI vs. remediate the existing account in place; new account is cleaner but takes longer; in-place remediation is faster but risks configuration errors during migration
  1. Hidden Cost Identification — AWS HIPAA Eligible services cost premiums (AWS Config: $0.003/resource/month, AWS Security Hub: $0.001/check/month, GuardDuty: $1.00/1M CloudTrail events), external security consultant for penetration testing ($15,000–$40,000), SOC 2 auditor fees ($30,000–$60,000 for Type II)
  1. Risk Signals / Early Warning Metrics — AWS Config compliance score (target >95% before audit), GuardDuty finding severity distribution (any HIGH or CRITICAL findings must be remediated before audit), CloudTrail completeness (must cover all 90 days of the audit period)
  1. Pivot Triggers — If penetration testing reveals additional PHI exposure vectors beyond the 5 identified, halt enterprise contract negotiation until remediated; do not sign a BAA while known PHI exposure exists
  1. Long-Term Evolution Plan — Post-90-day: pursue SOC 2 Type II (requires 6 months of continuous evidence), annual HIPAA Risk Assessment, HITRUST certification for enterprise hospital system contracts requiring it

3. The Answer

Explicit Assumptions:

  • PHI locations identified: 3 S3 buckets (patient records, lab results, clinical notes), 1 RDS PostgreSQL (patient demographics, appointments), 1 SQS queue (HL7 message processing), CloudWatch Logs (application logs containing PHI in stack traces)
  • Current AWS spend: ~$8,000/month; budget headroom for compliance tooling: ~$3,000–$5,000/month additional
  • 3 engineers: 1 with security background, 2 generalists; no dedicated compliance officer (will engage external HIPAA consultant for $25,000)
  • AWS BAA: must be signed at the beginning of Day 1 — without a signed BAA, the company is already in violation of HIPAA by storing PHI on AWS at all (AWS requires a BAA for HIPAA-covered entities)
  • SOC 2 Type II auditor: engaged in Month 1, audit window starts immediately to maximize the evidence period

Day 1: Sign the AWS BAA and Stop the Bleeding

The single most urgent action is not technical — it is legal: sign the AWS Business Associate Agreement through the AWS Management Console (Account Settings → Agreements). Without a signed BAA, every day of PHI storage on AWS is a HIPAA violation. This takes 15 minutes. Next, the two most critical technical P0 remediations: (1) Disable public accessibility on all RDS instances. aws rds modify-db-instance --db-instance-identifier prod-db --no-publicly-accessible. Enable at-rest encryption — this requires a snapshot + restore with encryption enabled (30-minute downtime, schedule for the next maintenance window). Enable SSL/TLS for RDS connections: set rds.force_ssl parameter group to 1. (2) Enable S3 Block Public Access at the account level. This single API call closes all current and future public S3 bucket risks: aws s3control put-public-access-block --account-id 123456789012 --public-access-block-configuration BlockPublicAcls=true, IgnorePublicAcls=true Bo BlockPublicPolicy=trueRes trictPublicBuckets=true. Enable S3 default encryption (SSE-KMS with a dedicated KMS CMK for PHI data) on all S3 buckets containing PHI.

Week 1–2: Enable Audit Controls (HIPAA §164.312(b))

HIPAA requires audit controls: "Implement hardware, software, and/or procedural mechanisms that record and examine activity in information systems that contain or use ePHI." Enable AWS CloudTrail with: all management events, all data events for S3 (GetObject, PutObject, DeleteObject) on PHI buckets, and all data events for RDS. Store CloudTrail logs in a dedicated S3 bucket with Object Lock (WORM — Write Once, Read Many) for 6+ years (HIPAA requires 6 years of audit log retention). Enable AWS Config in all regions to record configuration changes. Enable Amazon GuardDuty for threat detection. Enable AWS Security Hub with the HIPAA standard (note: Security Hub HIPAA standard covers many but not all HIPAA Technical Safeguards). These four services — CloudTrail + Config + GuardDuty + Security Hub — form the audit evidence backbone for both HIPAA and SOC 2 Type II.

Week 2–4: Network Isolation and Access Controls

Migrate all services from the default VPC to a purpose-built HIPAA VPC: private subnets for RDS and application servers, no public subnet for PHI-touching resources, NAT Gateway for outbound internet access, VPC endpoints for S3 and DynamoDB (eliminating internet-routed traffic for PHI). Remove AdministratorAccess from all engineers. Replace with: a break-glass role (AdministratorAccess, requires MFA, CloudTrail-logged, Slack-notified on every assumption), a developer role (least-privilege for development resources, no production PHI access without explicit request), and a read-only role (CloudWatch, Config, Security Hub access for debugging). Enable AWS IAM Identity Center (SSO) with Okta federation and enforce MFA for all AWS console access. HIPAA §164.312(d) requires unique user identification — shared credentials are non-compliant.

Week 3–6: PHI Identification and Data Minimization

Scan all CloudWatch Logs for PHI exposure: deploy Amazon Macie on all S3 buckets to identify PHI (Macie detects SSNs, patient identifiers, medical record numbers). Any Macie HIGH or CRITICAL finding in a non-PHI bucket must be investigated and remediated. For CloudWatch Logs containing PHI in stack traces: implement a log sanitization Lambda triggered by CloudWatch Logs subscription filter — strips patient identifiers before logs are stored. HIPAA does not prohibit logging, but logs containing PHI must receive the same access controls as PHI data.

Month 2–3: PHI Encryption in Transit and Remaining Safeguards

Implement AWS Certificate Manager for all load balancers with TLS 1.2+ enforcement; disable TLS 1.0 and 1.1 via ALB security policy ALP-TLS-2022. For SQS HL7 message processing: enable SQS server-side encryption with KMS CMK. Review all third-party integrations (lab systems, EHR vendors) for HIPAA BAA status — any business associate that touches PHI must have a signed BAA with your company. Conduct the annual HIPAA Security Risk Assessment (required by §164.308(a)(1)): document all PHI data flows, threat scenarios, likelihood and impact ratings, and mitigation status. The risk assessment document is the primary artifact that demonstrates HIPAA compliance to auditors.

Budget Allocation

AWS compliance tooling (GuardDuty, Config, Security Hub, Macie, CloudTrail): ~$2,500/month = $22,500 over 9 months. External HIPAA consultant (risk assessment, policy documentation, audit prep): $25,000. Penetration test: $20,000. SOC 2 Type II auditor: $45,000. Contingency: $7,500. Total: $120,000. Note: this budget is tight for SOC 2 Type II. If the auditor requires additional evidence or the penetration test reveals significant remediation work, the budget may be exceeded. Communicate this risk to leadership in Month 1.

Early Warning Metrics:

  • AWS Config compliance score — alert if drops below 90%; track weekly; must be >95% before audit
  • GuardDuty HIGH/CRITICAL findings — any finding must be remediated within 24 hours; page the security engineer immediately
  • Macie PHI findings in non-PHI buckets — any finding triggers P1 investigation within 4 hours
  • RDS SSL connection ratio — alert if any connection to production RDS is non-SSL (indicates application not enforcing TLS to database)

4. Interview Score: 9 / 10

Why this demonstrates senior-level maturity: Beginning with the AWS BAA signature (legal, not technical) on Day 1 reflects understanding that HIPAA is a legal framework, not just a security checklist. Prioritizing public RDS remediation and S3 Block Public Access as the first technical actions (not "set up GuardDuty," which many engineers list first) correctly sequences by blast radius. The budget allocation to external consultants and auditors — and the honest statement that $120,000 is tight for SOC 2 Type II — demonstrates program management maturity.

What differentiates it from mid-level thinking: A mid-level engineer would list HIPAA technical safeguards generically without mapping them to specific AWS services and configurations, not know that AdministratorAccess for all engineers violates HIPAA unique user identification requirements, and not mention the AWS BAA as the priority (which is a fundamental legal compliance gap before any technical work matters).

What would make it a 10/10: A 10/10 response would include the specific AWS Config managed rules for HIPAA (rds-instance-public-access-check, s3-bucket-public-read-prohibited, cloudtrail-enabled, etc.), a worked PHI data flow diagram showing all ingress and egress paths for the described workload, and a concrete 90-day Gantt chart showing which remediations are sequential (cannot enable S3 KMS encryption before KMS CMK is created) vs. parallel.


Question 15: API Gateway Design for 50 Million Requests Per Day

Difficulty: Senior | Role: Cloud Engineer / Backend Architect | Level: Senior | Company Examples: Twilio, Stripe, Plaid, SendGrid


The Question

You are designing the API gateway layer for a new fintech platform that will serve 50 million API requests per day at launch (growing to 500 million/day within 18 months). The API serves 3 types of clients: (1) mobile apps (latency-sensitive, <100ms p99), (2) partner integrations (batch-oriented, up to 10,000 req/sec burst), (3) internal microservices (high-volume, trusted network). You must implement rate limiting, authentication, request routing, response caching, DDoS protection, and developer portal self-service — all at this scale. Your team is 6 engineers. Compare AWS API Gateway (v1 REST and v2 HTTP), Kong, and AWS CloudFront + ALB + Lambda@Edge as architectural options, and make a concrete recommendation with cost modeling.


1. What Is This Question Testing?

  • Cloud architecture maturity — understanding the performance and cost differences between AWS API Gateway REST (v1) at $3.50/million requests, HTTP (v2) at $1.00/million requests, and self-managed Kong at fixed compute cost; recognizing that at 50M requests/day, the choice matters by $100,000+/year
  • Systems thinking — designing separate gateway tiers for different client types (mobile = latency-critical, partner = burst-tolerant, internal = throughput-critical) rather than a single gateway for all traffic
  • Reliability engineering — rate limiting at 10,000 req/sec burst from a single partner cannot be implemented correctly with a shared counter in a distributed system without understanding the token bucket vs. leaky bucket algorithms and their consistency tradeoffs
  • Financial literacy — calculating the cost cliff: AWS API Gateway HTTP API at 1.00/million × 500M requests/month = $500/month at 50M req/day; at 500M req/day = $5,000/month. Self-managed Kong on EKS at 500M req/day: ~$2,500/month compute but ~$8,000/month in platform engineering overhead
  • Security awareness — DDoS protection at API Gateway layer: AWS WAF rate rules, Shield Advanced ($3,000/month minimum), and the tradeoff between CloudFront terminating DDoS at the edge vs. absorbing DDoS at ALB (more expensive, more infrastructure exposed)
  • Tradeoff analysis — developer portal self-service: AWS API Gateway has native developer portal integration via API Gateway Developer Portal; Kong has Kong Developer Portal (enterprise license required at ~$100K/year)

2. Framework: API Gateway Selection and Scaling Framework (AGSSF)

  1. Assumption Documentation — Request size distribution (average payload), authentication method (JWT, API key, mTLS), caching requirements (response TTL per endpoint), developer portal feature requirements
  1. Constraint Analysis — 50M→500M req/day growth trajectory, 6-engineer team, <100ms p99 for mobile, 10,000 req/sec partner burst, no Kubernetes expertise (rules out self-managed Kong on EKS without hiring)
  1. Tradeoff Evaluation — AWS API Gateway HTTP (v2) vs. REST (v1) vs. Kong vs. CloudFront+ALB; build vs. buy for custom middleware, rate limiting engine, developer portal
  1. Hidden Cost Identification — AWS WAF ($5/million requests inspected + $1/rule/month), API Gateway data transfer ($0.09/GB outbound), CloudFront vs. API Gateway for global distribution (CloudFront cheaper for cacheable responses), developer portal hosting cost
  1. Risk Signals / Early Warning Metrics — API Gateway 5XX error rate, rate limiter false positive rate (legitimate partners being throttled), cache hit ratio (<50% indicates caching strategy needs tuning), latency p99 vs. p50 divergence (indicates hot-path bottleneck)
  1. Pivot Triggers — If AWS API Gateway throttling limits (10,000 req/sec default, requestable to 29,000/sec) become a constraint before the 500M req/day target, migrate to Kong or CloudFront+ALB; if Kong maintenance consumes >15% of team capacity, migrate to managed API Gateway
  1. Long-Term Evolution Plan — At 500M req/day, evaluate custom-built high-performance gateway (Envoy proxy on EKS) if API Gateway costs exceed $50,000/month and latency SLAs tighten to <50ms p99

3. The Answer

Explicit Assumptions:

  • 50M requests/day = 578 req/sec average; peak 3x = 1,734 req/sec at launch
  • 500M requests/day = 5,787 req/sec average; peak 3x = 17,361 req/sec in 18 months
  • Average request/response payload: 4KB inbound, 8KB outbound
  • Authentication: JWT (mobile apps), HMAC API keys (partner integrations), IAM SigV4 (internal microservices)
  • Cacheable responses: 40% of mobile API endpoints are cacheable with 60-second TTL; partner batch endpoints are not cacheable
  • Team: 6 engineers, AWS expertise, no Kubernetes expertise

Architecture Decision: Tiered Gateway (Not a Single Gateway for All)

The critical insight is that mobile (latency-sensitive), partner (burst-tolerant), and internal (throughput) clients have fundamentally different requirements that cannot be optimally served by a single gateway configuration. Design three separate gateway tiers: (1) Mobile API Gateway — AWS API Gateway HTTP (v2) + CloudFront CDN; (2) Partner API Gateway — AWS API Gateway HTTP (v2) with dedicated rate limiting; (3) Internal API Gateway — AWS ALB with IAM authentication (no API Gateway overhead for trusted internal traffic).

Option Analysis: AWS API Gateway HTTP (v2)

AWS API Gateway HTTP API is the correct choice for mobile and partner tiers at this scale. HTTP API (v2) vs. REST API (v1): HTTP API is 71% cheaper ($1.00/million vs. $3.50/million), has lower latency (typically 5–10ms vs. 15–25ms added latency), and supports JWT authorizers natively. REST API (v1) is required only if you need usage plans, API key management with quota tracking, or request/response transformation — features not needed at launch. At 50M requests/day × 30 days = 1,500M requests/month × $1.00/million = $1,500/month for API Gateway HTTP. At 500M requests/day = $15,000/month. API Gateway HTTP default throttling: 10,000 req/sec burst, 5,000 req/sec steady-state per account. At 17,361 req/sec peak target (18 months), this is a constraint — request a limit increase to 30,000 req/sec (AWS support, 24–48 hours).

CloudFront for Mobile API: The Cache Layer

Deploy CloudFront in front of API Gateway for the mobile tier. For the 40% of endpoints that are cacheable with 60-second TTL: CloudFront hit ratio will be high for popular data (product catalog, exchange rates, user profile — data that thousands of users request simultaneously). A 60% CloudFront cache hit rate reduces API Gateway requests by 60% × 40% cacheable = 24% total request reduction. At 500M req/day: 500M × 24% = 120M requests/day served from CloudFront cache at $0.0085/10,000 requests = ~$1,020/day... That's too high — CloudFront pricing: $0.0085/10,000 HTTP requests (first 10B/month): 120M × 30 / 10,000 × $0.0085 = $3,060/month for cached requests. But this replaces API Gateway cost: 500M × 24% = 120M fewer API Gateway requests × $0.001/request = $120/month savings at v2 pricing. CloudFront caching only saves meaningfully on compute-intensive backend requests (Lambda invocations, database reads) rather than API Gateway request costs.

Rate Limiting: Token Bucket at the Edge

AWS API Gateway rate limiting uses the token bucket algorithm per API key or per IP. For partner integrations with 10,000 req/sec burst allowance: configure API Gateway Usage Plans with burst limit: 10,000, rate: 1,000. The burst limit allows a short burst up to 10,000 req/sec, replenishing at 1,000 req/sec. However, API Gateway rate limiting is eventually consistent in multi-region deployments — a partner can briefly exceed limits if requests hit different API Gateway regional endpoints simultaneously. For strict rate limiting at the partner tier: implement rate limiting in a Lambda authorizer using ElastiCache Redis atomic counters (INCR + EXPIRE). This adds ~8–15ms latency per request (Redis round-trip) but provides exact rate limiting across all API Gateway instances. Only implement Redis-based rate limiting for the partner tier where exactness matters; the overhead is not justified for mobile users where approximate rate limiting is sufficient.

DDoS Protection

Layer 1: CloudFront absorbs volumetric DDoS at the edge — CloudFront's distributed edge network can absorb Tbps-scale attacks without exposing the origin (API Gateway or ALB) to direct traffic. This is the most cost-effective DDoS protection layer. Layer 2: AWS WAF rate rules behind CloudFront: a WAF rule blocking any IP exceeding 2,000 requests in 5 minutes costs $1/rule/month and blocks L7 application-layer floods. Layer 3: AWS Shield Standard is automatically included with CloudFront at no additional cost. AWS Shield Advanced ($3,000/month minimum) provides DDoS cost protection (AWS will credit costs from DDoS-related scaling events) and 24x7 DRT (DDoS Response Team) access. Shield Advanced is justified at the enterprise scale (500M req/day, SLA-backed enterprise contracts) but not at launch. Defer Shield Advanced until Month 6 if the product has reached $1M ARR.

Developer Portal

AWS API Gateway Developer Portal is a managed static site (hosted on S3 + CloudFront) that auto-generates API documentation from OpenAPI specs and allows self-service API key issuance. It requires a small Lambda backend for key management. Setup time: 2–3 days using the AWS SAR deployment. Cost: ~$50/month (Lambda + S3 + CloudFront for low-traffic developer portal). This is the fastest path to a functional developer portal for a 6-engineer team. Kong Developer Portal (enterprise) is significantly more capable (custom branding, monetization, versioning) but costs $100,000+/year in enterprise licensing and requires Kubernetes expertise to operate. Defer Kong Enterprise until the platform reaches 50+ enterprise API partners, where the portal features justify the cost.

Cost Model Summary

At launch (50M req/day): API Gateway HTTP: $1,500/month; CloudFront: $200/month; WAF: $150/month; JWT Authorizer Lambda: $50/month; Redis rate limiting: $200/month. Total: ~$2,100/month. At 500M req/day (18 months): API Gateway HTTP: $15,000/month; CloudFront: $2,000/month; WAF: $800/month; Redis cluster upgrade: $600/month. Total: ~$18,400/month. Self-managed Kong on EKS at 500M req/day: compute ~$2,500/month, but platform engineer overhead (~$15,000–$25,000/month equivalent in team time) + Kong Enterprise license ($100,000/year = $8,333/month). Total Kong: ~$26,000–$36,000/month. AWS API Gateway HTTP is the correct choice at this scale for a 6-engineer team.

Early Warning Metrics:

  • API Gateway 5XX rate — alert >0.1% over 5 minutes; page >0.5% (indicates Lambda timeout, downstream DB failure, or API Gateway throttling)
  • Rate limiter false positive rate — track 429 responses for verified-legitimate partners; alert if >0.5% of a partner's requests are throttled when their usage is within contract limits
  • CloudFront cache hit ratio — alert if drops below 40% for cacheable endpoints (indicates cache key configuration issue or TTL too short)
  • API Gateway latency p99 — alert >150ms; page >300ms for mobile tier (approaching 100ms SLA including network roundtrip)

4. Interview Score: 8.5 / 10

Why this demonstrates senior-level maturity: Designing three separate gateway tiers (mobile, partner, internal) rather than one gateway for all traffic, calculating the API Gateway HTTP vs. REST pricing difference and when the limit increase must be requested, and quantifying the "self-managed Kong cost" as $26,000–$36,000/month including engineering overhead (not just compute) demonstrates the financial and organizational sophistication of a senior engineer.

What differentiates it from mid-level thinking: A mid-level engineer would recommend "use Kong because it's more powerful" without calculating the engineering cost to operate it, choose REST API (v1) over HTTP API (v2) without knowing the 71% price difference, and implement a single rate limiter for all client types without distinguishing between mobile (approximate is fine) and partner (exact is required).