DevOps Engineer Interview Question Bank
by InterviewBee — Production-Grade, Scenario-Driven
Question 1: Fixing a Broken CI Pipeline [ci-cd, troubleshooting]
Difficulty: Medium
Role: DevOps Engineer
Level: Junior
Company Examples: GitHub, Shopify, Atlassian, GitLab
Question:
Your team's GitHub Actions pipeline has been failing intermittently for the past two days — roughly 30% of builds fail with a connection timeout error during the Docker image push step. The on-call engineer has already restarted the runners twice with no improvement. Deployments are blocked. How do you diagnose and fix this?
What is This Question Testing?
- Systematic CI/CD troubleshooting methodology
- Understanding of Docker registry authentication and network dependencies
- Ability to distinguish flaky infrastructure from code bugs
- Communication and prioritisation under pressure
Framework to Answer This Question
Use the Layered Elimination Framework: isolate the failure at each layer (runner → network → registry → auth) before making changes.
- Reproduce the failure locally and in CI with verbose logging
- Check runner resource utilization and network egress
- Audit registry auth token expiry and rate limits
- Implement retry logic as short-term fix
- Harden the pipeline with caching and exponential backoff long-term
Key Principles:
- Never change two variables at once during diagnosis
- Timeouts almost always point to network or auth, not code
- Flaky = 30% failure rate → systematic, not random
- Add observability before adding fixes
The Answer:
Assumptions: GitHub Actions self-hosted runners on EC2, pushing to ECR or Docker Hub, ~50 builds/day.
Step 1 — Enable verbose logging:
yaml
name: Push Docker image
run: | set -x
docker push $IMAGE_TAG 2>&1 | tee push.log env: DOCKER_BUILDKIT: 1
Step 2 — Check runner metrics:
bash
# On the runner host
vmstat 1 5
netstat -an | grep ESTABLISHED | wc -l
curl -v https://registry-1.docker.io/v2/ 2>&1 | head -30
Step 3 — Check Docker Hub rate limits (if applicable):
bash
TOKEN=$(curl -s "https://auth.docker.io/token?service=registry.docker.io&scope=repository:library/alpine:pull" | jq -r .token)
curl -H "Authorization: Bearer $TOKEN" https://registry-1.docker.io/v2/ratelimitpreview/test/manifests/latest -I 2>&1 | grep -i ratelimit
Step 4 — Short-term fix — add retry with backoff:
yaml
name: Push with retry
uses: nickfields/retry@v2
with: timeout_minutes: 10 max_attempts: 3 retry_wait_seconds: 15 command: docker push $IMAGE_TAG
Step 5 — Long-term fix: Switch to ECR with IAM role authentication (no token expiry), enable runner autoscaling, and add a build cache layer.
Metrics to Watch:
- Pipeline success rate < 95% → alert (SLO breach)
- Docker push duration > 3 min → investigate
- Runner CPU > 80% sustained → scale out
Rollback/Mitigation: Re-enable manual deploy workflow as a bypass; pin runner AMI to last known-good version. RTO: 15 min.
Interview Score: 7/10
Why this score:
- Full marks for systematic diagnosis and concrete commands
- Deduct if candidate jumps to "restart everything" without diagnosis
- Deduct if they miss rate limiting as a root cause
- Bonus for mentioning IAM-based auth over static credentials
- Watch for: candidates who treat flakiness as acceptable without root-causing it
Question 2: Terraform State Corruption After Team Conflict [iac, terraform]
Difficulty: Medium
Role: DevOps Engineer
Level: Mid
Company Examples: HashiCorp, Stripe, Cloudflare, Datadog
Question:
Two engineers on your team ran terraform apply simultaneously against the same workspace. The S3 backend state file is now corrupt, and terraform plan returns: Error: Failed to load state: state file version is incompatible. Production infrastructure is unchanged, but you cannot deploy or modify any resources. How do you safely recover?
What is This Question Testing?
- Terraform state management and locking mechanisms
- Risk-awareness when editing state directly
- Team process improvements to prevent recurrence
- Blast radius minimization
Framework to Answer This Question
Use the Backup → Inspect → Repair → Harden cycle for state recovery.
- Immediately back up the corrupt state file
- Inspect state file structure and identify corruption extent
- Use
terraform statecommands or manual JSON repair
- Re-enable locking and verify plan matches reality
- Implement DynamoDB locking and CI-only apply going forward
Key Principles:
- Never edit state without a versioned backup
- State corruption ≠ infra corruption — verify separately
- Locking is table stakes, not optional
- CI/CD should be the only principal with write access to state
The Answer:
Assumptions: S3 backend with versioning enabled, DynamoDB locking NOT configured (that's the bug), AWS infra untouched.
Step 1 — Backup and inspect:
bash
aws s3 cp s3://my-tfstate-bucket/prod/terraform.tfstate ./terraform.tfstate.corrupt
aws s3api list-object-versions --bucket my-tfstate-bucket --prefix prod/terraform.tfstate
# Restore last known-good version
aws s3api get-object \
--bucket my-tfstate-bucket \
--key prod/terraform.tfstate \
--version-id <LAST_GOOD_VERSION_ID> \
./terraform.tfstate.restored
Step 2 — Validate restored state:
bash
terraform state list # should enumerate all resources
terraform plan # should show no changes if infra matches
Step 3 — If no clean backup, rebuild state by importing:
bash
terraform import aws_instance.web i-0abc1234def56789
terraform import aws_security_group.main sg-0123456789abcdef0
Step 4 — Add DynamoDB locking (never again):
hcl
terraform {
backend "s3" {
bucket = "my-tfstate-bucket"
key = "prod/terraform.tfstate"
region = "us-east-1"
dynamodb_table = "terraform-state-lock"
encrypt = true
}
}
Step 5 — Process hardening: Restrict terraform apply to CI only via IAM policy. Engineers get read-only state access.
Metrics to Watch:
- State lock acquisition time > 30s → alert (possible deadlock)
- Concurrent applies in CI > 1 → block at pipeline level
- S3 versioning delete events → alert on any state deletion
Rollback: Restore previous S3 version. RTO: 10 min. RPO: last successful apply (S3 versioning).
Interview Score: 8/10
Why this score:
- Full credit for S3 versioning recovery path — many candidates miss this
- Deduct if they attempt to hand-edit JSON without backup
- Bonus for mentioning CI-only apply as the structural fix
- Watch for: candidates who only fix the symptom, not the locking gap
Question 3: Kubernetes Pod Eviction Mystery [kubernetes, observability, sre]
Difficulty: Medium
Role: SRE
Level: Mid-to-Senior
Company Examples: Spotify, LinkedIn, Lyft, Robinhood
Question:
Your SLO dashboard shows a spike: 2% of HTTP requests are returning 503 over the last 20 minutes. Kubernetes events show OOMKilled and Evicted pods across three microservices. The cluster has 12 nodes, each with 64GB RAM. No deployment was triggered in the last 6 hours. What do you do?
What is This Question Testing?
- Kubernetes resource model (requests vs. limits vs. actual usage)
- Incident triage under active production impact
- Observability fluency (kubectl, Prometheus, events)
- Distinguishing node pressure from pod-level misconfiguration
Framework to Answer This Question
Use Triage → Contain → Root Cause → Remediate: stop the bleeding first, then diagnose.
- Identify which nodes are under pressure
- Cordon affected nodes if necessary
- Check pod resource requests/limits vs. actual usage
- Identify the memory leak or sudden load spike source
- Set correct resource requests and add VPA/HPA
Key Principles:
- OOMKill = limit hit; Eviction = node pressure — distinguish them
- Pods without requests get scheduled as Burstable — dangerous
kubectl topshows current; Prometheus shows history
- Never cordon without understanding blast radius
The Answer:
Assumptions: EKS cluster, Prometheus + Grafana installed, services have no resource requests set.
Step 1 — Immediate triage:
bash
kubectl get events --all-namespaces --sort-by='.lastTimestamp' | grep -E "OOM|Evict" | tail -30
kubectl top nodes
kubectl top pods --all-namespaces --sort-by=memory | head -20
kubectl describe node <pressured-node> | grep -A 10 "Conditions:"
Step 2 — Contain: cordon node if >80% memory used:
bash
kubectl cordon <node-name>
# Drain only if node is unrecoverable
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data --grace-period=30
Step 3 — Identify the offending pod:
bash
kubectl describe pod <evicted-pod> -n <ns> | grep -E "OOM|Limits|Requests|Last State"
Step 4 — Check Prometheus for memory growth trend:
promql
container_memory_working_set_bytes{namespace="production"} / 1024 / 1024
Look for exponential growth in the last 2 hours.
Step 5 — Set resource requests immediately:
yaml
resources:
requests:
memory: "512Mi"
cpu: "250m"
limits:
memory: "1Gi"
cpu: "500m"
Step 6 — Deploy VPA in recommendation mode to collect data for right-sizing.
Metrics to Watch (with thresholds):
- Node memory utilization > 85% → page on-call
- Pod restarts > 5 in 10 min → alert
- OOMKill events > 0 in production → P1 investigation trigger
Rollback: Uncordon nodes after resource requests applied; scale node group +2 nodes as buffer. RTO: 15 min.
Interview Score: 8/10
Why this score:
- Strong answer: cordon before drain, triage before fix
- Deduct for recommending
kubectl delete podas first step
- Bonus for PromQL query and VPA mention
- Watch for: conflating OOMKill (limit) with Eviction (node pressure)
Question 4: Designing a Zero-Downtime Deployment Pipeline [ci-cd, kubernetes, reliability]
Difficulty: Medium
Role: DevOps Engineer
Level: Mid
Company Examples: Netflix, Shopify, DoorDash, Figma
Question:
Your e-commerce platform processes $500K/day in transactions. The engineering team deploys 8–12 times per week, and the current kubectl set image approach causes ~90 seconds of elevated errors during each deploy. Leadership wants zero-downtime deploys within the next sprint. What do you design and implement?
What is This Question Testing?
- Kubernetes deployment strategies (rolling, blue/green, canary)
- Readiness/liveness probe design
- Traffic management and load balancer behavior
- Tradeoff analysis (complexity vs. reliability)
Framework to Answer This Question
Use the Deploy Strategy Ladder: match deployment strategy to risk tolerance and operational maturity.
- Fix readiness probes (immediate wins — often the real issue)
- Configure proper rolling update parameters
- Add PodDisruptionBudgets
- Implement canary via Argo Rollouts or Flagger
- Validate with synthetic load test
Key Principles:
- 90s of errors often means missing or slow readiness probes, not strategy
- PDB prevents disruption from simultaneous voluntary evictions
- Canary catches regressions before full rollout
- Health checks must match actual app startup time
The Answer:
Assumptions: EKS, NGINX Ingress, 3-replica deployments, Node.js services with 15s startup time.
Step 1 — Fix readiness probes (often resolves 80% of issues):
yaml
readinessProbe:
httpGet:
path: /health/ready
port: 3000
initialDelaySeconds: 20
periodSeconds: 5
failureThreshold: 3
livenessProbe:
httpGet:
path: /health/live
port: 3000
initialDelaySeconds: 30
periodSeconds: 10
Step 2 — Rolling update parameters:
yaml
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
Step 3 — PodDisruptionBudget:
yaml
apiVersion: policy/v1
kind: PodDisruptionBudget
spec:
minAvailable: 2
selector:
matchLabels:
app: checkout-service
Step 4 — Canary with Argo Rollouts:
yaml
strategy:
canary:
steps:
- setWeight: 10
- pause: {duration: 5m}
- setWeight: 50
- pause: {duration: 5m}
- setWeight: 100
Step 5 — Validate: Run k6 load test during canary phase, watch error rate in Grafana.
Metrics to Watch:
- HTTP 5xx rate > 0.1% during deploy → automatic rollback trigger
- Readiness probe failures > 3 consecutive → pod not added to rotation
- Deploy duration > 10 min → alert (stuck rollout)
Rollback: kubectl rollout undo deployment/checkout-service — RTO: 2 min.
Interview Score: 7/10
Why this score:
- Solid: probe fix + PDB + canary is the complete answer
- Deduct for going straight to blue/green without checking probes first
- Bonus for Argo Rollouts with automated analysis
- Watch for: candidates who ignore PDB (allows simultaneous voluntary disruptions)
Question 5: On-Call Incident — Database CPU Spike in Production [incident-response, observability, database]
Difficulty: High
Role: SRE
Level: Mid-to-Senior
Company Examples: Stripe, PagerDuty, MongoDB, Atlassian
Question:
It's 2:47 AM. You receive a PagerDuty alert: RDS PostgreSQL CPU > 95% for 8 minutes. The payments service is returning timeouts. 40,000 active users are affected. Your RDS instance is a db.r6g.4xlarge (16 vCPU, 128GB RAM). A new feature was deployed 3 hours ago. Walk through your incident response.
What is This Question Testing?
- Incident command and structured communication
- Database performance diagnosis (slow queries, locks, connections)
- Ability to distinguish symptom from root cause under pressure
- Rollback decision-making with business context
Framework to Answer This Question
Use the STAR-I Incident Protocol: Stabilize → Triage → Analyze → Remediate → Improve.
- Declare incident, assign roles, open war room
- Check slow query log and
pg_stat_activity
- Kill long-running queries if causing lock contention
- Evaluate rollback of recent deploy vs. query-level mitigation
- Write RCA within 48 hours
Key Principles:
- Time to mitigate > time to root cause in P1
- Never kill queries without understanding the cascade
- Feature flag rollback is faster than code rollback
- Always capture diagnostic state before taking action
The Answer:
Assumptions: Aurora PostgreSQL, CloudWatch enhanced monitoring enabled, feature was a new analytics endpoint.
Step 1 — Declare incident and capture state:
bash
# Slack: "P1 DECLARED: RDS CPU 95%+, payments degraded. IC: @you. Bridge: zoom.us/j/xxx"
# Capture immediately — don't change yet
aws rds describe-db-instances --db-instance-identifier prod-payments-db \
--query 'DBInstances[0].{CPU:MonitoringInterval,Status:DBInstanceStatus}'
Step 2 — Diagnose active queries:
sql
- Connect via psql or RDS Query EditorSELECT pid, now() pg_stat_activity.query_start AS duration, query, state, wait_event_type, wait_event
FROM pg_stat_activity
WHERE state != 'idle'ORDER BY duration DESCLIMIT 20;
Step 3 — Check for lock contention:
sql
SELECT blocked_locks.pid, blocked_activity.query,
blocking_locks.pid AS blocking_pid,
blocking_activity.query AS blocking_query
FROM pg_catalog.pg_locks blocked_locks
JOIN pg_catalog.pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid
JOIN pg_catalog.pg_locks blocking_locks ON blocking_locks.locktype = blocked_locks.locktype
JOIN pg_catalog.pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid
WHERE NOT blocked_locks.granted;
Step 4 — Kill offending queries (if confirmed safe):
sql
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE duration > interval '5 minutes'
AND query ILIKE '%analytics%'
AND state = 'active';
Step 5 — Rollback via feature flag (fastest): Disable new analytics feature flag in LaunchDarkly → CPU should drop within 60 seconds.
Step 6 — If no feature flag: git revert and emergency deploy (estimated 12 min).
Metrics to Watch:
- RDS CPU > 80% sustained 5 min → P2 alert
- Active connections > 80% of
max_connections→ alert
- Query duration p99 > 2s → SLO warning
Rollback plan: Feature flag disable → RTO: 2 min. Code rollback → RTO: 15 min. RPO: 0 (no data loss for query-only issue).
Interview Score: 9/10
Why this score:
- Excellent: structured IC declaration, diagnostic-first, feature flag rollback awareness
- Full credit for lock contention query — many candidates miss this
- Deduct if candidate reboots the DB as first action
- Deduct if no RCA/postmortem mentioned
- Bonus for
pg_terminate_backendwith safety check (filtering by query pattern)
Question 6: Designing SLOs for a Payment API [sre, observability, reliability]
Difficulty: High
Role: SRE
Level: Senior
Company Examples: Google, Stripe, Square, Braintree
Question:
You're joining a fintech company as their first SRE. The payments API has no SLOs. Engineering ships code 3–5x/week, and the last three months show an average of 99.2% uptime measured by a ping check — but customers are complaining. Your job is to design meaningful SLOs from scratch within 30 days. How do you do it?
What is This Question Testing?
- SLI/SLO/Error budget framework knowledge
- Understanding of customer-centric reliability metrics
- Ability to distinguish vanity metrics (uptime ping) from meaningful SLIs
- Stakeholder alignment and change management
Framework to Answer This Question
Use the SRE SLO Design Loop: Identify critical journeys → Define SLIs → Set SLO targets → Instrument → Burn rate alerts.
- Map critical user journeys (CUJ)
- Define SLIs that measure user experience, not infra health
- Set initial SLO targets conservatively (below current performance)
- Instrument with Prometheus/OpenTelemetry
- Configure burn rate alerts with multi-window approach
Key Principles:
- Ping uptime is a vanity metric — use success rate of real transactions
- Start with 28-day rolling windows
- Error budget = 100% - SLO target
- Burn rate alerts fire before budget exhaustion, not after
The Answer:
Assumptions: Payment API built on FastAPI, Prometheus + Grafana, GCP environment, ~500K transactions/day.
Step 1 — Identify Critical User Journeys:
- Payment authorization (synchronous, latency-sensitive)
- Refund processing (async, correctness-sensitive)
- Merchant dashboard load (read-heavy, P2 priority)
Step 2 — Define SLIs:
yaml
# SLI 1: Payment Authorization Availability
# Numerator: HTTP 200/201 responses to POST /v1/payments
# Denominator: All non-4xx responses to POST /v1/payments
# SLI 2: Payment Authorization Latency
# % of requests completing in < 500ms (p99 < 2s)
# SLI 3: Refund Correctness
# % of refunds that complete without manual intervention
```
**Step 3 — Set SLO targets (week 1, conservative):**
```
Payment Availability SLO: 99.5% (28-day rolling)
Payment Latency SLO: 95% of requests < 500ms
Refund Correctness: 99.9%
Error budget (availability): 0.5% = ~3.6 hours/month
Step 4 — Prometheus instrumentation:
yaml
# prometheus-rules.yaml
groups:
- name: payment-slo
rules:
- record: job:payment_request_duration:rate5m
expr: rate(http_request_duration_seconds_bucket{job="payment-api",le="0.5"}[5m])
- alert: PaymentSLOBurnRateHigh
expr: |
(
rate(http_requests_total{job="payment-api",status!~"5.."}[1h]) /
rate(http_requests_total{job="payment-api"}[1h])
) < 0.99
for: 5m
labels:
severity: page
Step 5 — Multi-window burn rate alerts:
- 1h window at 14x burn rate → page (budget gone in 2h)
- 6h window at 6x burn rate → ticket (budget gone in 5 days)
Metrics to Watch:
- Error budget consumption > 50% in first 14 days → freeze non-critical deploys
- Burn rate > 14x for 5 min → immediate page
- Latency p99 > 2s for 3 consecutive minutes → P2 alert
Rollback/Mitigation: If SLO breach during deploy, auto-rollback via Argo Rollouts analysis. RTO: 3 min.
Interview Score: 9/10
Why this score:
- Excellent: CUJ-first approach, rejects ping uptime, concrete PromQL
- Bonus for multi-window burn rate (Google SRE Workbook pattern)
- Deduct if candidate sets 99.99% SLO on day 1 without historical data
- Watch for: candidates who conflate SLA (contract) with SLO (internal target)
Question 7: Terraform Module Refactoring Without Downtime [iac, terraform, reliability]
Difficulty: High
Role: DevOps Engineer
Level: Senior
Company Examples: HashiCorp, Gruntwork, Cloudflare, Twilio
Question:
Your team has 40 microservices, each with duplicated Terraform code defining their own VPC, subnets, and security groups. Drift is accumulating. You need to refactor into shared Terraform modules without destroying and recreating any existing infrastructure. The constraint: zero production disruption and no downtime. How do you approach this safely?
What is This Question Testing?
- Terraform
state mvand module adoption patterns
- Risk management for infrastructure refactoring
- Understanding of Terraform plan/apply lifecycle
- Team coordination for large IaC migrations
Framework to Answer This Question
Use the Encapsulate → Move → Verify refactoring pattern.
- Build the new module matching existing resource configurations exactly
- Use
terraform state mvto re-home resources without recreation
- Verify with
terraform planshowing zero changes
- Migrate services in batches of 3–5
- Clean up orphaned code after validation
Key Principles:
terraform planmust show0 to add, 0 to change, 0 to destroybefore merging
- State moves are non-destructive if resource addresses match
- Never migrate all 40 services in one PR
- Tag resources before migration for audit trail
The Answer:
Assumptions: AWS, S3 backend, 40 services in monorepo with environments/prod/service-X/ structure.
Step 1 — Create the shared module:
hcl
# modules/networking/main.tf
variable "service_name" {}
variable "vpc_cidr" { default = "10.0.0.0/16" }
resource "aws_vpc" "main" {
cidr_block = var.vpc_cidr
tags = { Name = var.service_name, ManagedBy = "terraform-module-v2" }
}
Step 2 — Dry-run state migration for one service:
bash
cd environments/prod/service-checkout
# List current resource addresses
terraform state list | grep aws_vpc
# Preview state move (no-op flag doesn't exist, so backup first)
cp terraform.tfstate terraform.tfstate.backup-$(date +%Y%m%d)
# Move resource to module address
terraform state mv \
'aws_vpc.main' \
'module.networking.aws_vpc.main'
terraform state mv \
'aws_subnet.public[0]' \
'module.networking.aws_subnet.public[0]'
Step 3 — Validate no changes:
bash
terraform plan -out=migration.tfplan
# Must show: Plan: 0 to add, 0 to change, 0 to destroy
terraform show migration.tfplan | grep -E "add|change|destroy"
Step 4 — Batch migration script:
bash
SERVICES=("checkout" "inventory" "notifications")
for svc in "${SERVICES[@]}"; do
echo "Migrating $svc..."
cd environments/prod/$svc
terraform state mv 'aws_vpc.main' 'module.networking.aws_vpc.main'
terraform plan -detailed-exitcode && echo "OK: $svc" || echo "FAIL: $svc — STOPPING"
cd ../../..
done
Timeline: 2 engineers, 3 weeks. 5 services/day, with 24h bake time between batches.
Metrics to Watch:
terraform planexit code ≠ 0 after state mv → stop migration immediately
- AWS CloudTrail events for resource modification during migration window → alert
- State file size growth > 20% unexpectedly → investigate
Rollback: Restore .tfstate.backup file to S3. No infra change = no impact. RTO: 5 min.
Interview Score: 8/10
Why this score:
- Correct: state mv + plan verify is the exact right pattern
- Deduct if candidate suggests
terraform destroy+ recreate
- Deduct for attempting all 40 services in one batch
- Bonus for backup-before-move discipline and batch script with fail-fast
Question 8: Multi-Region Active-Active Failover Design [networking, reliability, dr]
Difficulty: High
Role: Staff DevOps
Level: Staff
Company Examples: Netflix, Cloudflare, Amazon, Uber
Question:
Your company runs a SaaS product used globally. The board mandates 99.99% availability (52 min downtime/year). Currently you run active-passive across us-east-1 and us-west-2 with a 15-minute RTO. You've been tasked with designing active-active multi-region with RTO < 30 seconds. Budget: $80K/month additional cloud spend. What do you design?
What is This Question Testing?
- Active-active architecture patterns and tradeoffs
- Data replication and consistency tradeoffs (CAP theorem)
- Global load balancing (Route53, Cloudflare, GFE)
- Realistic cost and complexity awareness
Framework to Answer This Question
Use the RACE Design Pattern: Replication → Availability layer → Consistency tradeoffs → Edge routing.
- Assess which tiers can be active-active vs. active-passive
- Design global routing and health checks
- Choose data replication strategy per data type
- Implement circuit breakers and regional fallback
- Validate with chaos engineering
Key Principles:
- Not every tier needs active-active — be pragmatic
- Writes to two regions simultaneously require conflict resolution strategy
- DNS TTL < 30s is required to meet 30s RTO
- Test failover monthly or it's not real
The Answer:
Assumptions: AWS, PostgreSQL on Aurora Global, stateless app tier, Redis for sessions, Route53.
Step 1 — Tier analysis:
- App tier: active-active (stateless, trivial)
- Cache (Redis): active-active via Elasticache Global Datastore
- Database: Aurora Global with us-east-1 primary, us-west-2 read replica (< 1s replication lag)
- Database writes: region-affinity routing (primary handles writes, replicas handle reads)
Step 2 — Global routing:
hcl
# Route53 health check + latency routing
resource "aws_route53_health_check" "us_east_1" {
fqdn = "api-us-east-1.internal.example.com"
port = 443
type = "HTTPS"
request_interval = 10
failure_threshold = 2 # 20s to detect failure
}
resource "aws_route53_record" "api" {
zone_id = var.zone_id
name = "api.example.com"
type = "A"
set_identifier = "us-east-1"
latency_routing_policy { region = "us-east-1" }
health_check_id = aws_route53_health_check.us_east_1.id
alias { ... }
}
Step 3 — Aurora Global failover:
bash
# Promote us-west-2 replica to primary (< 1 min)
aws rds failover-global-cluster \
--global-cluster-identifier my-global-cluster \
--target-db-cluster-identifier arn:aws:rds:us-west-2:...
Step 4 — Application-level conflict resolution: Use event sourcing + timestamp-based last-write-wins for non-financial data. For financial transactions: route all writes to a single region (no active-active for payments — explain why).
Step 5 — Chaos validation: Monthly GameDay — inject region failure using AWS FIS, measure actual RTO.
Metrics to Watch:
- Aurora replication lag > 1s → alert (failover will have stale reads)
- Route53 health check failures > 2 → automatic DNS failover begins
- Cross-region latency > 150ms p99 → investigate routing
Cost estimate: ~$65K/month additional (Aurora Global: $30K, cross-region traffic: $20K, second app tier: $15K) — within budget.
Rollback: Fail back to us-east-1 after root cause resolved. RPO: < 1s (Aurora replication lag). RTO: < 30s.
Interview Score: 9/10
Why this score:
- Excellent: acknowledges payments can't be active-active naively
- CAP theorem awareness and conflict resolution strategy
- Concrete Route53 + Aurora commands
- Deduct if candidate claims "full active-active" without addressing write conflicts
- Bonus for GameDay validation and cost estimate
Question 9: Secrets Sprawl — Migrating Hardcoded Credentials to Vault [security, iac, ci-cd]
Difficulty: High
Role: DevOps Engineer
Level: Senior
Company Examples: HashiCorp, Twilio, Okta, GitHub
Question:
A security audit reveals 23 microservices have database passwords and API keys hardcoded in environment variables baked into Docker images and Kubernetes ConfigMaps. Three secrets have already appeared in git history. You have 6 weeks to remediate before the SOC 2 Type II audit. How do you execute this migration without breaking production?
What is This Question Testing?
- Secrets management architecture (Vault, AWS Secrets Manager, External Secrets Operator)
- Git history remediation and secret rotation
- Zero-downtime migration strategy for secrets
- Security posture improvement under deadline pressure
Framework to Answer This Question
Use the Rotate → Centralize → Inject → Audit pattern.
- Immediately rotate all exposed secrets
- Deploy centralized secrets backend (Vault or AWS Secrets Manager)
- Migrate services one-by-one using External Secrets Operator
- Purge secrets from git history and ConfigMaps
- Enforce via admission controller (no plaintext secrets in manifests)
Key Principles:
- Rotate first, then migrate — don't migrate stale compromised secrets
- ESO (External Secrets Operator) is least-invasive for K8s migration
- Git history rewrite requires coordination (all devs must re-clone)
- Audit trail of secret access is a SOC 2 requirement
The Answer:
Assumptions: EKS, GitHub, AWS Secrets Manager (simpler than Vault for AWS-native), 23 services, 6-week deadline.
Step 1 — Emergency rotation (Day 1):
bash
# For each exposed secret
aws secretsmanager rotate-secret --secret-id prod/postgres/password
# For API keys: revoke in provider dashboard, generate new, store in Secrets Manager
aws secretsmanager create-secret \
--name prod/stripe/api-key \
--secret-string '{"key":"sk_live_newkey123"}'
Step 2 — Deploy External Secrets Operator:
bash
helm repo add external-secrets https://charts.external-secrets.io
helm install external-secrets external-secrets/external-secrets -n external-secrets --create-namespace
Step 3 — Create ExternalSecret per service:
yaml
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
name: postgres-creds
namespace: checkout
spec:
refreshInterval: 1h
secretStoreRef:
name: aws-secrets-manager
kind: ClusterSecretStore
target:
name: postgres-creds
creationPolicy: Owner
data:
- secretKey: DB_PASSWORD
remoteRef:
key: prod/postgres/password
property: password
Step 4 — Remove secrets from ConfigMaps and Deployments; reference K8s Secret instead.
Step 5 — Git history remediation:
bash
# Use git-filter-repo (preferred over BFG)
pip install git-filter-repo
git filter-repo --path-glob '*.env' --invert-paths
# Force push + require all devs to re-clone
git push --force --all
Step 6 — Enforce with OPA/Gatekeeper: Deny any ConfigMap or Deployment containing strings matching secret patterns.
Timeline: Week 1: rotate + deploy ESO. Weeks 2–5: migrate 5–6 services/week. Week 6: audit evidence collection.
Metrics to Watch:
- Secrets Manager API errors > 5/min → alert (ESO injection failing)
- Any K8s Secret with
type: Opaquecontaining base64 plaintext → OPA policy violation alert
- Secret rotation failures → immediate page
Rollback per service: Revert Deployment to use old env var (secret already rotated, so use new value in both places during transition). RTO: 5 min/service.
Interview Score: 8/10
Why this score:
- Correct priority: rotate before migrate
- ESO is the right low-disruption tool
- Deduct for missing git history remediation step
- Bonus for OPA enforcement (prevents regression)
- Watch for: candidates who skip rotation and just move the compromised secret
Question 10: Cost Optimization — 40% Cloud Bill Reduction [cost, kubernetes, iac]
Difficulty: High
Role: DevOps Engineer / SRE
Level: Senior
Company Examples: Stripe, Figma, Notion, GitLab
Question:
Your AWS bill hit $420K last month, up 35% from six months ago despite flat traffic growth. The CTO wants a 40% cost reduction ($168K/month) within 90 days without degrading reliability. You have access to Cost Explorer, CloudWatch, and the Kubernetes cluster. Where do you start and what levers do you pull?
What is This Question Testing?
- Cloud cost analysis methodology
- Right-sizing, reserved instances, and Spot usage
- Kubernetes resource efficiency (requests, limits, bin-packing)
- Balancing cost reduction with reliability risk
Framework to Answer This Question
Use the Analyze → Quick Wins → Structural Changes → Automate cost optimization cycle.
- Categorize spend by service (EC2, RDS, data transfer, S3)
- Attack idle/oversized resources first (quick wins, low risk)
- Reserved Instances / Savings Plans for predictable baseline
- Kubernetes bin-packing and Spot for batch workloads
- Automate with scheduled scaling and anomaly alerts
Key Principles:
- Data transfer costs are often invisible — check first
- Reserved Instances for 70% baseline load; Spot for 30% burst
- Over-provisioned RDS instances are common and safe to fix
- Never right-size production databases without load testing first
The Answer:
Assumptions: AWS, EKS cluster, Aurora RDS, S3 heavy usage, cross-AZ data transfer costs high.
Step 1 — Cost breakdown via AWS Cost Explorer:
bash
aws ce get-cost-and-usage \
--time-period Start=2025-09-01,End=2025-10-01 \
--granularity MONTHLY \
--metrics BlendedCost \
--group-by Type=DIMENSION,Key=SERVICE
Typical finding: EC2 40%, RDS 25%, Data Transfer 20%, S3 10%, other 5%.
Step 2 — Quick wins (Week 1–2):
bash
# Find unattached EBS volumes
aws ec2 describe-volumes --filters Name=status,Values=available \
--query 'Volumes[*].[VolumeId,Size,CreateTime]' --output table
# Find idle load balancers (0 requests/day)
aws cloudwatch get-metric-statistics \
--namespace AWS/ApplicationELB --metric-name RequestCount \
--start-time 2025-09-01T00:00:00 --end-time 2025-10-01T00:00:00 \
--period 2592000 --statistics Sum
Step 3 — Kubernetes resource right-sizing:
bash
# Install Goldilocks (VPA-based recommender)
helm install goldilocks fairwinds/goldilocks --namespace goldilocks
# Check recommendations
kubectl -n goldilocks get vpa --all-namespaces
```
Target: reduce average CPU request padding by 40% (most teams over-request 2–3x).
**Step 4 — Reserved Instances (30-day analysis, then commit):**
- 1-year Compute Savings Plan for 70% of baseline EC2 → ~30% discount
- Spot instances for CI runners and batch jobs → 70% discount
**Step 5 — Cross-AZ transfer reduction:**
- Enable `topology.kubernetes.io/zone` node affinity for pod scheduling
- Use S3 Transfer Acceleration only where needed; enable S3 Intelligent Tiering
**Projected savings:**
- Idle resources cleanup: ~$25K/month
- K8s right-sizing: ~$40K/month
- Reserved Instances: ~$60K/month
- Spot for batch: ~$25K/month
- Data transfer optimization: ~$20K/month
- **Total: ~$170K/month ✓**
**Metrics to Watch:**
- Monthly spend growth > 5% MoM without traffic growth → alert
- CPU utilization < 20% sustained on any instance type → right-size candidate
- Data transfer costs > $30K/month → investigate AZ routing
**Rollback:** Reserved Instance commitments are 1-year — validate with 30 days of Savings Plans recommendations first (no-commitment option available).
**Interview Score: 8/10**
**Why this score:**
- Excellent: structured analysis, multiple levers, realistic projections
- Deduct if candidate goes straight to "use Spot for everything"
- Bonus for Goldilocks and cross-AZ transfer insight (often missed)
- Watch for: candidates who ignore data transfer costs (often 15–25% of bill)
---
## Question 11: Designing a GitOps Deployment Platform [gitops, ci-cd, kubernetes]
**Difficulty:** Very High
**Role:** Staff DevOps
**Level:** Staff
**Company Examples:** Weaveworks, GitHub, Shopify, Spotify
**Question:**
Your organization has 80 engineers across 12 teams, deploying 15 microservices to three environments (dev, staging, prod). There's no standardized deployment process — teams use a mix of raw `kubectl apply`, Helm, and bash scripts. You've been asked to design and implement a GitOps platform using Argo CD within 8 weeks. Prod is currently stable and cannot be disrupted. How do you design and roll it out?
**What is This Question Testing?**
- GitOps principles and Argo CD architecture
- Multi-tenant platform design for multiple teams
- Progressive rollout without disrupting existing deployments
- Organizational change management alongside technical delivery
**Framework to Answer This Question**
Use the **Platform Product Mindset**: treat internal platform as a product with customers (dev teams).
1. Define GitOps contracts (repo structure, sync policies)
2. Deploy Argo CD with RBAC mapped to team boundaries
3. Onboard one pilot team → iterate → scale
4. Migrate existing workloads non-destructively
5. Enforce via admission webhooks (no `kubectl apply` in prod)
**Key Principles:**
- Git is the single source of truth — drift = bad PR
- App-of-apps pattern for managing multiple teams at scale
- Sync windows prevent accidental prod deploys during business hours
- Measure adoption rate as a platform metric
**The Answer:**
**Assumptions:** EKS, GitHub, 3 clusters (dev/staging/prod), teams own their namespaces.
**Step 1 — Repo structure:**
```
gitops-platform/
├── apps/ # App-of-apps root
│ ├── dev/
│ ├── staging/
│ └── prod/
├── clusters/
│ ├── dev-cluster/
│ ├── staging-cluster/
│ └── prod-cluster/
└── platform/ # Argo CD itself, ingress, monitoring
Step 2 — Deploy Argo CD:
bash
kubectl create namespace argocd
kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml
# Configure HA mode for prod
helm install argocd argo/argo-cd \
--set controller.replicas=2 \
--set server.replicas=2 \
--set repoServer.replicas=2
Step 3 — RBAC per team:
yaml
# argocd-rbac-cm
policy.csv: |
p, team-checkout, applications, get, checkout/*, allow
p, team-checkout, applications, sync, checkout/*, allow
p, team-checkout, applications, override, checkout/*, deny
g, github-org:team-checkout, role:team-checkout
Step 4 — App-of-apps for prod:
yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: prod-apps
namespace: argocd
spec:
project: default
source:
repoURL: https://github.com/org/gitops-platform
targetRevision: main
path: apps/prod
destination:
server: https://prod-cluster.example.com
namespace: argocd
syncPolicy:
syncOptions:
- CreateNamespace=true
Step 5 — Prod sync window (no accidental weekend deploys):
yaml
syncWindows:
- kind: allow
schedule: '0 10 * * 1-5' # Mon-Fri 10am only
duration: 8h
applications:
- '*'
namespaces:
- production
Step 6 — Migration approach: Keep existing kubectl/Helm deployments running. Onboard each service by adding an Argo CD Application pointing to a new Helm chart. Validate drift detection catches any manual changes. Teams stop using kubectl directly once they trust the platform.
Timeline: Week 1–2: platform setup + pilot team. Week 3–5: 6 teams onboarded. Week 6–8: prod migration + enforcement.
Metrics to Watch:
- Argo CD sync failure rate > 5% → alert (broken manifests or connectivity)
- App out-of-sync duration > 15 min in prod → alert (drift or blocked sync)
- Teams still using direct kubectl in prod (audit via CloudTrail) > 0 → governance report
Rollback: Argo CD is additive — existing workloads are unaffected until explicitly onboarded. Each team can opt out by removing their Application CR. Platform rollback: helm rollback argocd. RTO: 10 min.
Code Appendix:
yaml
# App-of-apps application template
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
name: team-apps
namespace: argocd
spec:
generators:
- git:
repoURL: https://github.com/org/gitops-platform
revision: main
directories:
- path: apps/prod/*
template:
spec:
project: default
syncPolicy:
automated:
prune: true
selfHeal: true
Interview Score: 9/10
Why this score:
- Excellent: app-of-apps, RBAC design, sync windows, non-disruptive migration
- Bonus for ApplicationSet (scales to N teams automatically)
- Deduct if candidate enables
prune: truein prod without explaining the risk
- Watch for: over-engineering on day 1 vs. pragmatic pilot-first approach
Question 12: Prometheus Alert Fatigue — Redesigning Alerting Strategy [observability, sre, reliability]
Difficulty: Very High
Role: SRE
Level: Senior
Company Examples: Google, Datadog, PagerDuty, Cloudflare
Question:
Your on-call rotation receives 200+ alerts per week, 85% of which are noise. Engineers are ignoring PagerDuty pages. Three real incidents were missed in the last month. You need to redesign the alerting strategy from scratch. The team has Prometheus, Grafana, AlertManager, and PagerDuty. How do you fix this in 4 weeks?
What is This Question Testing?
- Alert fatigue diagnosis and systematic remediation
- SLO-based alerting vs. cause-based alerting
- AlertManager routing, grouping, and inhibition rules
- Organizational behavior change alongside technical fixes
Framework to Answer This Question
Use the Signal/Noise Reduction Funnel: Audit → Classify → Redesign → Measure.
- Audit all alerts: actionable vs. informational vs. noise
- Migrate critical service alerts to SLO burn rate model
- Configure AlertManager grouping, silences, and inhibition
- Establish alert review cadence (weekly for 4 weeks)
- Track MTTA and noise ratio as success metrics
Key Principles:
- Every page must require human action — or it's not a page
- SLO burn rate alerts are more actionable than threshold alerts
- Alert storms require inhibition rules, not more alerts
- Measure alert quality, not just quantity
The Answer:
Assumptions: ~300 active Prometheus alerts, Alertmanager routing to PagerDuty for everything, no inhibitions configured.
Step 1 — Audit (Week 1):
bash
# Export all current alert rules
kubectl exec -n monitoring prometheus-0 -- \
curl -s localhost:9090/api/v1/rules | jq '.data.groups[].rules[] | select(.type=="alerting") | .name' | sort > all_alerts.txt
# Pull 30 days of firing history
curl -G 'http://alertmanager:9093/api/v1/alerts' \
--data-urlencode 'filter=severity="page"' | jq '.data | length'
Classify each alert: Actionable page / Ticket / Log and delete.
Step 2 — Migrate to SLO burn rate alerts (keeps only ~20 critical alerts):
yaml
# prometheus-slo-alerts.yaml
- alert: HighErrorBurnRate
expr: |
(
rate(http_requests_total{status=~"5.."}[1h]) /
rate(http_requests_total[1h])
) > (14 * 0.001)
for: 2m
labels:
severity: page
team: "{{ $labels.team }}"
annotations:
summary: "Error budget burning 14x fast — action required"
runbook: "https://runbooks.example.com/high-error-burn-rate"
Step 3 — AlertManager: grouping + inhibition:
yaml
# alertmanager.yml
route:
group_by: ['alertname', 'cluster', 'team']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'pagerduty-critical'
routes:
- match:
severity: warning
receiver: 'slack-warnings'
continue: false
inhibit_rules:
- source_match:
severity: 'page'
alertname: 'NodeDown'
target_match:
severity: 'page'
equal: ['node']
# Suppress pod alerts when their node is down
Step 4 — Routing by team (reduce cross-team noise):
yaml
routes:
- match:
team: checkout
receiver: pagerduty-checkout
group_by: ['alertname']
Step 5 — Weekly review cadence: Every Monday, review previous week's pages. Silence or delete any alert that fired > 5 times without action.
Metrics to Watch (success metrics):
- Pages/week: target < 30 (from 200) by week 4
- MTTA (mean time to acknowledge) < 5 min (currently 18 min due to fatigue)
- Missed incidents: 0 per month (current: 3)
- Alert actionability rate > 90% (surveys with on-call engineers)
Rollback: Old alert rules are in git — restore previous prometheus-rules.yaml. AlertManager config is version-controlled. RTO: 5 min.
Code Appendix:
yaml
# Multi-window burn rate — covers both fast and slow burns
- alert: SLOBurnRateFast
expr: |
(rate(errors[1h]) / rate(requests[1h])) > (14 * 0.005)
and
(rate(errors[5m]) / rate(requests[5m])) > (14 * 0.005)
for: 2m
labels: {severity: page}
- alert: SLOBurnRateSlow
expr: |
(rate(errors[6h]) / rate(requests[6h])) > (6 * 0.005)
and
(rate(errors[30m]) / rate(requests[30m])) > (6 * 0.005)
for: 15m
labels: {severity: ticket}
Interview Score: 9/10
Why this score:
- Excellent: audit-first, SLO burn rate model, inhibition rules, team-based routing
- Multi-window burn rate is Google SRE Workbook best practice
- Deduct if candidate only adjusts thresholds (treats symptoms not causes)
- Bonus for tracking MTTA and actionability rate as platform metrics
- Watch for: candidates who add more dashboards instead of fixing alerting
Question 13: Immutable Infrastructure — Blue/Green AMI Pipeline [iac, ci-cd, reliability]
Difficulty: Very High
Role: Staff DevOps
Level: Staff
Company Examples: Netflix, Amazon, Hashicorp, Cloudflare
Question:
Your team currently SSHs into production EC2 instances to apply patches and config changes, creating significant configuration drift. Security has flagged this as a compliance issue. You need to implement immutable infrastructure using Packer-built AMIs with a blue/green deployment pipeline within 6 weeks. The environment: 200 EC2 instances across 8 Auto Scaling Groups, running stateless services. How do you design and execute this?
What is This Question Testing?
- Immutable infrastructure philosophy and implementation
- Packer image pipeline design
- Blue/green ASG swap strategies
- Breaking cultural habits (no more SSH to prod)
Framework to Answer This Question
Use the Bake → Validate → Swap → Terminate immutable pipeline.
- Build hardened base AMI with Packer (OS + agents)
- Layer application AMI on top (app code + config)
- Launch new ASG with new AMI (blue)
- Shift traffic gradually; terminate old ASG (green)
- Block SSH via security group policy
Key Principles:
- Configuration belongs in the AMI, not applied post-launch
- Validate AMI with InSpec/Serverspec before promoting to prod
- Blue/green swap takes 10–15 min — acceptable for stateless services
- SSM Session Manager replaces SSH — no inbound 22 required
The Answer:
Assumptions: AWS, 200 stateless EC2 instances, Terraform for ASGs, Jenkins CI, stateless app.
Step 1 — Packer base AMI:
hcl
# packer/base-ami.pkr.hcl
source "amazon-ebs" "ubuntu" {
region = "us-east-1"
source_ami_filter {
filters = { name = "ubuntu/images/hvm-ssd/ubuntu-22.04-amd64-server-*" }
owners = ["099720109477"]
most_recent = true
}
instance_type = "t3.medium"
ssh_username = "ubuntu"
ami_name = "base-{{timestamp}}"
}
build {
sources = ["source.amazon-ebs.ubuntu"]
provisioner "shell" {
scripts = ["scripts/harden.sh", "scripts/install-agents.sh"]
}
provisioner "inspec" {
profile = "profiles/cis-ubuntu"
}
post-processor "manifest" {
output = "manifest.json"
}
}
Step 2 — App AMI (inherits base):
bash
# In CI pipeline
packer build -var "base_ami=$(cat manifest.json | jq -r '.builds[-1].artifact_id')" app-ami.pkr.hcl
NEW_AMI=$(cat app-manifest.json | jq -r '.builds[-1].artifact_id' | cut -d: -f2)
Step 3 — Terraform blue/green ASG swap:
hcl
resource "aws_autoscaling_group" "blue" {
name = "checkout-blue-${var.ami_id}"
launch_template {
id = aws_launch_template.checkout.id
version = "$Latest"
}
target_group_arns = [aws_lb_target_group.main.arn]
min_size = var.min_capacity
max_size = var.max_capacity
lifecycle { create_before_destroy = true }
}
Step 4 — Traffic shift script:
bash
# Deregister old ASG from target group, register new
aws autoscaling attach-load-balancer-target-groups \
--auto-scaling-group-name checkout-blue-$NEW_AMI \
--target-group-arns $TG_ARN
# Wait for health checks
sleep 120
aws autoscaling detach-load-balancer-target-groups \
--auto-scaling-group-name checkout-green-$OLD_AMI \
--target-group-arns $TG_ARN
# Terminate old ASG after 30 min bake
aws autoscaling delete-auto-scaling-group \
--auto-scaling-group-name checkout-green-$OLD_AMI \
--force-delete
Step 5 — Block SSH permanently:
bash
# Remove port 22 from all prod security groups
aws ec2 revoke-security-group-ingress \
--group-id $SG_ID --protocol tcp --port 22 --cidr 0.0.0.0/0
# Enable SSM for break-glass access
aws ssm start-session --target i-0abc1234def56789
Timeline: 3 engineers, 6 weeks. Week 1: Packer pipeline. Week 2–3: one ASG piloted. Week 4–5: all 8 ASGs. Week 6: SSH revoked.
Metrics to Watch:
- New instance boot time > 5 min → investigate AMI bloat
- Blue ASG health check failures > 5% during swap → abort and rollback
- Any SSH connection attempt to prod instances (CloudTrail) → P2 alert
Rollback: Reattach old green ASG to target group in < 5 min. AMIs are immutable — no rollback needed at AMI level; just keep old ASG alive for 1 hour post-swap. RTO: 5 min. RPO: 0.
Code Appendix:
bash
#!/bin/bash
# swap-asg.sh — Blue/Green swap with health gate
NEW_ASG=$1; OLD_ASG=$2; TG_ARN=$3; THRESHOLD=95
aws autoscaling attach-load-balancer-target-groups \
--auto-scaling-group-name $NEW_ASG --target-group-arns $TG_ARN
echo "Waiting 90s for health checks..."
sleep 90
HEALTHY=$(aws elbv2 describe-target-health --target-group-arn $TG_ARN \
--query 'TargetHealthDescriptions[?TargetHealth.State==`healthy`] | length(@)')
TOTAL=$(aws elbv2 describe-target-health --target-group-arn $TG_ARN \
--query 'TargetHealthDescriptions | length(@)')
PCT=$(( HEALTHY * 100 / TOTAL ))
if [ $PCT -lt $THRESHOLD ]; then
echo "Health check failed ($PCT% healthy). Aborting swap."
exit 1
fi
aws autoscaling detach-load-balancer-target-groups \
--auto-scaling-group-name $OLD_ASG --target-group-arns $TG_ARN
echo "Swap complete. Old ASG $OLD_ASG detached."
Interview Score: 9/10
Why this score:
- Excellent: full pipeline from Packer → validate → swap → SSH revoke
- InSpec validation in AMI build pipeline is security best practice
- Health-gated swap script prevents bad deploys from completing
- Deduct if candidate skips AMI validation or leaves SSH open
- Bonus for SSM Session Manager as SSH replacement (no inbound port 22 needed)
Question 14: Disaster Recovery — RDS Corruption in Production [dr, database, incident-response]
Difficulty: Very High
Role: SRE
Level: Senior
Company Examples: Stripe, GitHub, Shopify, Cloudflare
Question:
At 11:22 AM on a Tuesday, a developer accidentally runs DROP TABLE orders; on the production PostgreSQL RDS database. The orders table has 18 months of transaction history. The business has an RPO of 1 hour. Automated backups are enabled with a 35-day retention. Point-in-time recovery is available. How do you recover, and what's your communication plan?
What is This Question Testing?
- RDS PITR (Point-in-Time Recovery) knowledge and limitations
- Incident communication and stakeholder management
- Data validation after restore
- Post-incident hardening (IAM, least privilege)
Framework to Answer This Question
Use the Contain → Restore → Validate → Harden DR playbook.
- Immediately prevent further writes to corrupt database
- Initiate PITR to a new RDS instance (target: 11:21 AM)
- Validate restored data against application checksums
- Promote restored instance (or export table and reimport)
- Revoke DDL permissions for developers; add IAM guardrails
Key Principles:
- PITR restores to a NEW instance — you cannot restore in-place
- Fastest path: restore table only from PITR export, not full instance promotion
- Communicate status every 15–30 minutes during P1
- RPO 1 hour = 18 months of data restored to 1 min before incident
The Answer:
Assumptions: Aurora PostgreSQL, PITR enabled, 11:22 AM incident, 11:21 AM target restore time, application can be put in read-only mode.
Step 1 — Contain (11:22 AM):
bash
# Put app in maintenance mode immediately (prevent further writes)
kubectl set env deployment/orders-service DB_READ_ONLY=true
# Or enable maintenance page via ingress annotation
kubectl annotate ingress orders-ingress nginx.ingress.kubernetes.io/server-snippet='return 503;'
Step 2 — Initiate PITR to new cluster (11:23 AM):
bash
aws rds restore-db-cluster-to-point-in-time \
--db-cluster-identifier prod-orders-restored \
--source-db-cluster-identifier prod-orders \
--restore-to-time 2025-10-14T11:21:00Z \
--vpc-security-group-ids $PROD_SG_ID \
--db-subnet-group-name prod-subnet-group
# Takes ~15-25 min for Aurora. Monitor:
watch -n 30 "aws rds describe-db-clusters \
--db-cluster-identifier prod-orders-restored \
--query 'DBClusters[0].Status'"
Step 3 — Export only the orders table (faster than full promotion):
bash
# Once PITR cluster is available, connect and export
psql -h restored-cluster.endpoint -U admin -d production \
-c "\COPY orders TO '/tmp/orders_recovered.csv' CSV HEADER"
# Compress and move to S3
aws s3 cp /tmp/orders_recovered.csv s3://dr-exports/orders-recovered-$(date +%Y%m%dT%H%M).csv
Step 4 — Reimport to production (11:48 AM estimate):
bash
psql -h prod-cluster.endpoint -U admin -d production \
-c "\COPY orders FROM '/tmp/orders_recovered.csv' CSV HEADER"
# Validate row count
psql -c "SELECT COUNT(*) FROM orders;" prod_restored
psql -c "SELECT COUNT(*) FROM orders;" prod # should match
Step 5 — Validate and restore service:
bash
# Application-level validation
curl -s https://internal-api/health/orders-db | jq '.row_count'
# Re-enable service
kubectl set env deployment/orders-service DB_READ_ONLY-
Step 6 — Communication:
- 11:22 AM: "P1 declared. Orders service in maintenance. ETR: 60 min."
- 11:40 AM: "PITR restore in progress. ETR: 30 min."
- 12:15 AM: "Data restored and validated. Service restored. RCA in 48h."
Step 7 — Post-incident hardening:
sql
- Revoke DDL from application userREVOKE DROP ON ALL TABLES IN SCHEMA public FROM app_user;- Developers get read-only role, never prod write accessGRANT SELECT ON ALL TABLES IN SCHEMA public TO dev_readonly;
Metrics to Watch:
- PITR restore duration > 30 min → escalate to AWS Support
- Row count mismatch after restore > 0 → do not restore service, investigate
- Any DDL statements (
DROP,TRUNCATE) from non-migration users → alert in real-time via pg_audit
Timeline: Incident at 11:22 AM → service restored by 12:10–12:20 PM. RPO achieved: ~1 min of data loss. RTO: ~50 min.
Code Appendix:
bash
#!/bin/bash
# validate-restore.sh — compare row counts between restored and prod
PROD_HOST=$1; RESTORED_HOST=$2; DB=$3; TABLE=$4
PROD_COUNT=$(psql -h $PROD_HOST -U admin -d $DB -tAc "SELECT COUNT(*) FROM $TABLE")
RESTORED_COUNT=$(psql -h $RESTORED_HOST -U admin -d $DB -tAc "SELECT COUNT(*) FROM $TABLE")
echo "Production: $PROD_COUNT | Restored: $RESTORED_COUNT"
if [ "$PROD_COUNT" -ne "$RESTORED_COUNT" ]; then
echo "MISMATCH — do not cut over. Investigate immediately."
exit 1
fi
echo "Counts match. Safe to promote restored instance."
Interview Score: 9/10
Why this score:
- Excellent: maintenance mode first (stops the bleeding), PITR + table export (faster than full cluster promotion)
- Validation script with row count check is production-grade
- Communication timeline demonstrates IC experience
- Deduct if candidate promotes full cluster (30–60 min longer) without considering table-export shortcut
- Bonus for pg_audit real-time DDL alerting and post-incident least privilege hardening
- Watch for: candidates who skip data validation before restoring service
Question 15: Service Mesh Adoption — Migrating to Istio [networking, kubernetes, security]
Difficulty: Very High
Role: Staff DevOps
Level: Staff
Company Examples: Google, Lyft, Airbnb, LinkedIn
Question:
Your organization has 30 microservices communicating over plain HTTP inside a Kubernetes cluster. Security requires mutual TLS (mTLS) between all services within 60 days. The platform team has chosen Istio. However, the previous attempt to install Istio 18 months ago caused a 45-minute outage. Engineering leadership is skeptical. How do you design a safe migration?
What is This Question Testing?
- Istio architecture and sidecar injection model
- Progressive, non-disruptive service mesh adoption
- mTLS migration modes (PERMISSIVE → STRICT)
- Risk management and stakeholder communication
Framework to Answer This Question
Use the Shadow → Permissive → Strict mTLS migration ladder.
- Install Istio in ambient or sidecar mode on non-prod first
- Enable sidecar injection namespace-by-namespace (not cluster-wide)
- Start in PERMISSIVE mode (accepts both HTTP and mTLS)
- Validate observability and no traffic disruption
- Migrate to STRICT mode namespace-by-namespace
Key Principles:
- PERMISSIVE mode is the safety net — use it for weeks, not hours
- Sidecar injection causes pod restarts — schedule during maintenance windows
- Ambient mesh (Istio 1.18+) avoids sidecar overhead entirely
- Measure traffic success rate before and after each namespace migration
The Answer:
Assumptions: EKS 1.28, Istio 1.20, 30 services across 8 namespaces, current 0 mTLS.
Step 1 — Install Istio with minimal footprint:
bash
istioctl install --set profile=default \
--set values.global.proxy.autoInject=disabled \
-y # Injection disabled cluster-wide — opt-in per namespace
istioctl verify-install
Step 2 — Enable sidecar injection on pilot namespace (non-critical):
bash
kubectl label namespace logging istio-injection=enabled
kubectl rollout restart deployment -n logging
# Monitor for 24h
kubectl get pods -n logging # all should show 2/2 READY (app + sidecar)
Step 3 — Set PERMISSIVE mTLS globally (accepts HTTP and mTLS simultaneously):
yaml
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
name: default
namespace: istio-system # cluster-wide
spec:
mtls:
mode: PERMISSIVE
Step 4 — Gradually enable injection per namespace (one per week):
bash
NAMESPACES=("payments" "orders" "inventory" "notifications")
for ns in "${NAMESPACES[@]}"; do
echo "Enabling injection for $ns"
kubectl label namespace $ns istio-injection=enabled
kubectl rollout restart deployment -n $ns
# Gate: verify no 5xx increase
sleep 300
kubectl exec -n istio-system deploy/prometheus -- \
promtool query instant http://localhost:9090 \
'rate(istio_requests_total{response_code=~"5.."}[5m])'
done
Step 5 — Switch to STRICT mTLS per namespace (after all injected):
yaml
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
name: strict-mtls
namespace: payments
spec:
mtls:
mode: STRICT
Step 6 — Validate mTLS is active:
bash
istioctl x check-inject -n payments
kubectl exec -n payments deploy/payment-api -c istio-proxy -- \
pilot-agent request GET stats | grep ssl.handshake
Timeline: 3 engineers, 8 weeks. Week 1: install + pilot namespace. Weeks 2–6: one namespace/week. Week 7: STRICT mode. Week 8: policy enforcement + runbook.
Metrics to Watch:
- HTTP 503 rate > 0.1% immediately after namespace injection → rollback injection for that namespace
- Envoy proxy CPU > 10% of total pod CPU → investigate filter chain complexity
- mTLS handshake failures > 0 in STRICT mode → service missing sidecar
Rollback per namespace:
bash
kubectl label namespace payments istio-injection-
kubectl rollout restart deployment -n payments
# Returns to plain HTTP — PERMISSIVE mode means nothing breaks
RTO: 5 min per namespace. Full cluster rollback: istioctl uninstall --purge (risky — prefer namespace-by-namespace).
Code Appendix:
yaml
# AuthorizationPolicy — allow only specific services after mTLS strict
apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
name: payments-allow
namespace: payments
spec:
action: ALLOW
rules:
- from:
- source:
principals:
- "cluster.local/ns/orders/sa/orders-service"
- "cluster.local/ns/checkout/sa/checkout-service"
Interview Score: 9/10
Why this score:
- Excellent: PERMISSIVE-first strategy directly addresses the previous outage risk
- Namespace-by-namespace with metric gates is production-safe
- AuthorizationPolicy shows understanding of mTLS + authz as separate concerns
- Deduct for enabling cluster-wide injection on day 1
- Bonus for noting Istio ambient mode as a lower-overhead alternative