DevOps Engineer Interview Question Bank

Q: Your organization has 30 microservices communicating over plain HTTP inside a Kubernetes cluster. Security requires mutual TLS (mTLS) between all services within 60 days. The platform team has chosen Istio. However, the previous attempt to install Istio 18 months ago caused a 45-minute outage. Engineering leadership is skeptical. How do you design a safe migration?

What is This Question Testing? Framework to Answer This Question Use the Shadow → Permissive → Strict mTLS migration ladder. Key Principles: The Answer:

by InterviewBee — Production-Grade, Scenario-Driven

Question 1: Fixing a Broken CI Pipeline [ci-cd, troubleshooting]

Difficulty: Medium
Role: DevOps Engineer
Level: Junior
Company Examples: GitHub, Shopify, Atlassian, GitLab

Question:
Your team's GitHub Actions pipeline has been failing intermittently for the past two days — roughly 30% of builds fail with a connection timeout error during the Docker image push step. The on-call engineer has already restarted the runners twice with no improvement. Deployments are blocked. How do you diagnose and fix this?

What is This Question Testing?

Systematic CI/CD troubleshooting methodology

Understanding of Docker registry authentication and network dependencies

Ability to distinguish flaky infrastructure from code bugs

Communication and prioritisation under pressure

Framework to Answer This Question
Use the Layered Elimination Framework: isolate the failure at each layer (runner → network → registry → auth) before making changes.

Reproduce the failure locally and in CI with verbose logging

Check runner resource utilization and network egress

Audit registry auth token expiry and rate limits

Implement retry logic as short-term fix

Harden the pipeline with caching and exponential backoff long-term

Key Principles:

Never change two variables at once during diagnosis

Timeouts almost always point to network or auth, not code

Flaky = 30% failure rate → systematic, not random

Add observability before adding fixes

The Answer:

Assumptions: GitHub Actions self-hosted runners on EC2, pushing to ECR or Docker Hub, ~50 builds/day.

Step 1 — Enable verbose logging:

yaml

name: Push Docker image run: | set -x docker push $IMAGE_TAG 2>&1 | tee push.log env: DOCKER_BUILDKIT: 1

Step 2 — Check runner metrics:

bash

# On the runner host vmstat 1 5 netstat -an | grep ESTABLISHED | wc -l curl -v https://registry-1.docker.io/v2/ 2>&1 | head -30

Step 3 — Check Docker Hub rate limits (if applicable):

bash

TOKEN=$(curl -s "https://auth.docker.io/token?service=registry.docker.io&scope=repository:library/alpine:pull" | jq -r .token) curl -H "Authorization: Bearer $TOKEN" https://registry-1.docker.io/v2/ratelimitpreview/test/manifests/latest -I 2>&1 | grep -i ratelimit

Step 4 — Short-term fix — add retry with backoff:

yaml

name: Push with retry uses: nickfields/retry@v2 with: timeout_minutes: 10 max_attempts: 3 retry_wait_seconds: 15 command: docker push $IMAGE_TAG

Step 5 — Long-term fix: Switch to ECR with IAM role authentication (no token expiry), enable runner autoscaling, and add a build cache layer.

Metrics to Watch:

Pipeline success rate < 95% → alert (SLO breach)

Docker push duration > 3 min → investigate

Runner CPU > 80% sustained → scale out

Rollback/Mitigation: Re-enable manual deploy workflow as a bypass; pin runner AMI to last known-good version. RTO: 15 min.

Interview Score: 7/10

Why this score:

Full marks for systematic diagnosis and concrete commands

Deduct if candidate jumps to "restart everything" without diagnosis

Deduct if they miss rate limiting as a root cause

Bonus for mentioning IAM-based auth over static credentials

Watch for: candidates who treat flakiness as acceptable without root-causing it

Question 2: Terraform State Corruption After Team Conflict [iac, terraform]

Difficulty: Medium
Role: DevOps Engineer
Level: Mid
Company Examples: HashiCorp, Stripe, Cloudflare, Datadog

Question:
Two engineers on your team ran terraform apply simultaneously against the same workspace. The S3 backend state file is now corrupt, and terraform plan returns: Error: Failed to load state: state file version is incompatible. Production infrastructure is unchanged, but you cannot deploy or modify any resources. How do you safely recover?

What is This Question Testing?

Terraform state management and locking mechanisms

Risk-awareness when editing state directly

Team process improvements to prevent recurrence

Blast radius minimization

Framework to Answer This Question
Use the Backup → Inspect → Repair → Harden cycle for state recovery.

Immediately back up the corrupt state file

Inspect state file structure and identify corruption extent

Use terraform state commands or manual JSON repair

Re-enable locking and verify plan matches reality

Implement DynamoDB locking and CI-only apply going forward

Key Principles:

Never edit state without a versioned backup

State corruption ≠ infra corruption — verify separately

Locking is table stakes, not optional

CI/CD should be the only principal with write access to state

The Answer:

Assumptions: S3 backend with versioning enabled, DynamoDB locking NOT configured (that's the bug), AWS infra untouched.

Step 1 — Backup and inspect:

bash

aws s3 cp s3://my-tfstate-bucket/prod/terraform.tfstate ./terraform.tfstate.corrupt aws s3api list-object-versions --bucket my-tfstate-bucket --prefix prod/terraform.tfstate # Restore last known-good version aws s3api get-object \ --bucket my-tfstate-bucket \ --key prod/terraform.tfstate \ --version-id <LAST_GOOD_VERSION_ID> \ ./terraform.tfstate.restored

Step 2 — Validate restored state:

bash

terraform state list # should enumerate all resources terraform plan # should show no changes if infra matches

Step 3 — If no clean backup, rebuild state by importing:

bash

terraform import aws_instance.web i-0abc1234def56789 terraform import aws_security_group.main sg-0123456789abcdef0

Step 4 — Add DynamoDB locking (never again):

hcl

terraform { backend "s3" { bucket = "my-tfstate-bucket" key = "prod/terraform.tfstate" region = "us-east-1" dynamodb_table = "terraform-state-lock" encrypt = true } }

Step 5 — Process hardening: Restrict terraform apply to CI only via IAM policy. Engineers get read-only state access.

Metrics to Watch:

State lock acquisition time > 30s → alert (possible deadlock)

Concurrent applies in CI > 1 → block at pipeline level

S3 versioning delete events → alert on any state deletion

Rollback: Restore previous S3 version. RTO: 10 min. RPO: last successful apply (S3 versioning).

Interview Score: 8/10

Why this score:

Full credit for S3 versioning recovery path — many candidates miss this

Deduct if they attempt to hand-edit JSON without backup

Bonus for mentioning CI-only apply as the structural fix

Watch for: candidates who only fix the symptom, not the locking gap

Question 3: Kubernetes Pod Eviction Mystery [kubernetes, observability, sre]

Difficulty: Medium
Role: SRE
Level: Mid-to-Senior
Company Examples: Spotify, LinkedIn, Lyft, Robinhood

Question:
Your SLO dashboard shows a spike: 2% of HTTP requests are returning 503 over the last 20 minutes. Kubernetes events show OOMKilled and Evicted pods across three microservices. The cluster has 12 nodes, each with 64GB RAM. No deployment was triggered in the last 6 hours. What do you do?

What is This Question Testing?

Kubernetes resource model (requests vs. limits vs. actual usage)

Incident triage under active production impact

Observability fluency (kubectl, Prometheus, events)

Distinguishing node pressure from pod-level misconfiguration

Framework to Answer This Question
Use Triage → Contain → Root Cause → Remediate: stop the bleeding first, then diagnose.

Identify which nodes are under pressure

Cordon affected nodes if necessary

Check pod resource requests/limits vs. actual usage

Identify the memory leak or sudden load spike source

Set correct resource requests and add VPA/HPA

Key Principles:

OOMKill = limit hit; Eviction = node pressure — distinguish them

Pods without requests get scheduled as Burstable — dangerous

kubectl top shows current; Prometheus shows history

Never cordon without understanding blast radius

The Answer:

Assumptions: EKS cluster, Prometheus + Grafana installed, services have no resource requests set.

Step 1 — Immediate triage:

bash

Step 2 — Contain: cordon node if >80% memory used:

bash

kubectl cordon <node-name> # Drain only if node is unrecoverable kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data --grace-period=30

Step 3 — Identify the offending pod:

bash

kubectl describe pod <evicted-pod> -n <ns> | grep -E "OOM|Limits|Requests|Last State"

Step 4 — Check Prometheus for memory growth trend:

promql

container_memory_working_set_bytes{namespace="production"} / 1024 / 1024

Look for exponential growth in the last 2 hours.

Step 5 — Set resource requests immediately:

yaml

resources: requests: memory: "512Mi" cpu: "250m" limits: memory: "1Gi" cpu: "500m"

Step 6 — Deploy VPA in recommendation mode to collect data for right-sizing.

Metrics to Watch (with thresholds):

Node memory utilization > 85% → page on-call

Pod restarts > 5 in 10 min → alert

OOMKill events > 0 in production → P1 investigation trigger

Rollback: Uncordon nodes after resource requests applied; scale node group +2 nodes as buffer. RTO: 15 min.

Interview Score: 8/10

Why this score:

Strong answer: cordon before drain, triage before fix

Deduct for recommending kubectl delete pod as first step

Bonus for PromQL query and VPA mention

Watch for: conflating OOMKill (limit) with Eviction (node pressure)

Question 4: Designing a Zero-Downtime Deployment Pipeline [ci-cd, kubernetes, reliability]

Difficulty: Medium
Role: DevOps Engineer
Level: Mid
Company Examples: Netflix, Shopify, DoorDash, Figma

Question:
Your e-commerce platform processes $500K/day in transactions. The engineering team deploys 8–12 times per week, and the current kubectl set image approach causes ~90 seconds of elevated errors during each deploy. Leadership wants zero-downtime deploys within the next sprint. What do you design and implement?

What is This Question Testing?

Kubernetes deployment strategies (rolling, blue/green, canary)

Readiness/liveness probe design

Traffic management and load balancer behavior

Tradeoff analysis (complexity vs. reliability)

Framework to Answer This Question
Use the Deploy Strategy Ladder: match deployment strategy to risk tolerance and operational maturity.

Fix readiness probes (immediate wins — often the real issue)

Configure proper rolling update parameters

Add PodDisruptionBudgets

Implement canary via Argo Rollouts or Flagger

Validate with synthetic load test

Key Principles:

90s of errors often means missing or slow readiness probes, not strategy

PDB prevents disruption from simultaneous voluntary evictions

Canary catches regressions before full rollout

Health checks must match actual app startup time

The Answer:

Assumptions: EKS, NGINX Ingress, 3-replica deployments, Node.js services with 15s startup time.

Step 1 — Fix readiness probes (often resolves 80% of issues):

yaml

readinessProbe: httpGet: path: /health/ready port: 3000 initialDelaySeconds: 20 periodSeconds: 5 failureThreshold: 3 livenessProbe: httpGet: path: /health/live port: 3000 initialDelaySeconds: 30 periodSeconds: 10

Step 2 — Rolling update parameters:

yaml

strategy: type: RollingUpdate rollingUpdate: maxSurge: 1 maxUnavailable: 0

Step 3 — PodDisruptionBudget:

yaml

apiVersion: policy/v1 kind: PodDisruptionBudget spec: minAvailable: 2 selector: matchLabels: app: checkout-service

Step 4 — Canary with Argo Rollouts:

yaml

strategy: canary: steps: - setWeight: 10 - pause: {duration: 5m} - setWeight: 50 - pause: {duration: 5m} - setWeight: 100

Step 5 — Validate: Run k6 load test during canary phase, watch error rate in Grafana.

Metrics to Watch:

HTTP 5xx rate > 0.1% during deploy → automatic rollback trigger

Readiness probe failures > 3 consecutive → pod not added to rotation

Deploy duration > 10 min → alert (stuck rollout)

Rollback: kubectl rollout undo deployment/checkout-service — RTO: 2 min.

Interview Score: 7/10

Why this score:

Solid: probe fix + PDB + canary is the complete answer

Deduct for going straight to blue/green without checking probes first

Bonus for Argo Rollouts with automated analysis

Watch for: candidates who ignore PDB (allows simultaneous voluntary disruptions)

Question 5: On-Call Incident — Database CPU Spike in Production [incident-response, observability, database]

Difficulty: High
Role: SRE
Level: Mid-to-Senior
Company Examples: Stripe, PagerDuty, MongoDB, Atlassian

Question:
It's 2:47 AM. You receive a PagerDuty alert: RDS PostgreSQL CPU > 95% for 8 minutes. The payments service is returning timeouts. 40,000 active users are affected. Your RDS instance is a db.r6g.4xlarge (16 vCPU, 128GB RAM). A new feature was deployed 3 hours ago. Walk through your incident response.

What is This Question Testing?

Incident command and structured communication

Database performance diagnosis (slow queries, locks, connections)

Ability to distinguish symptom from root cause under pressure

Rollback decision-making with business context

Framework to Answer This Question
Use the STAR-I Incident Protocol: Stabilize → Triage → Analyze → Remediate → Improve.

Declare incident, assign roles, open war room

Check slow query log and pg_stat_activity

Kill long-running queries if causing lock contention

Evaluate rollback of recent deploy vs. query-level mitigation

Write RCA within 48 hours

Key Principles:

Time to mitigate > time to root cause in P1

Never kill queries without understanding the cascade

Feature flag rollback is faster than code rollback

Always capture diagnostic state before taking action

The Answer:

Assumptions: Aurora PostgreSQL, CloudWatch enhanced monitoring enabled, feature was a new analytics endpoint.

Step 1 — Declare incident and capture state:

bash

# Slack: "P1 DECLARED: RDS CPU 95%+, payments degraded. IC: @you. Bridge: zoom.us/j/xxx" # Capture immediately — don't change yet aws rds describe-db-instances --db-instance-identifier prod-payments-db \ --query 'DBInstances[0].{CPU:MonitoringInterval,Status:DBInstanceStatus}'

Step 2 — Diagnose active queries:

sql

- Connect via psql or RDS Query EditorSELECT pid, now() pg_stat_activity.query_start AS duration, query, state, wait_event_type, wait_event FROM pg_stat_activity WHERE state != 'idle'ORDER BY duration DESCLIMIT 20;

Step 3 — Check for lock contention:

sql

SELECT blocked_locks.pid, blocked_activity.query, blocking_locks.pid AS blocking_pid, blocking_activity.query AS blocking_query FROM pg_catalog.pg_locks blocked_locks JOIN pg_catalog.pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid JOIN pg_catalog.pg_locks blocking_locks ON blocking_locks.locktype = blocked_locks.locktype JOIN pg_catalog.pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid WHERE NOT blocked_locks.granted;

Step 4 — Kill offending queries (if confirmed safe):

sql

SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE duration > interval '5 minutes' AND query ILIKE '%analytics%' AND state = 'active';

Step 5 — Rollback via feature flag (fastest): Disable new analytics feature flag in LaunchDarkly → CPU should drop within 60 seconds.

Step 6 — If no feature flag: git revert and emergency deploy (estimated 12 min).

Metrics to Watch:

RDS CPU > 80% sustained 5 min → P2 alert

Active connections > 80% of max_connections → alert

Query duration p99 > 2s → SLO warning

Rollback plan: Feature flag disable → RTO: 2 min. Code rollback → RTO: 15 min. RPO: 0 (no data loss for query-only issue).

Interview Score: 9/10

Why this score:

Excellent: structured IC declaration, diagnostic-first, feature flag rollback awareness

Full credit for lock contention query — many candidates miss this

Deduct if candidate reboots the DB as first action

Deduct if no RCA/postmortem mentioned

Bonus for pg_terminate_backend with safety check (filtering by query pattern)

Question 6: Designing SLOs for a Payment API [sre, observability, reliability]

Difficulty: High
Role: SRE
Level: Senior
Company Examples: Google, Stripe, Square, Braintree

Question:
You're joining a fintech company as their first SRE. The payments API has no SLOs. Engineering ships code 3–5x/week, and the last three months show an average of 99.2% uptime measured by a ping check — but customers are complaining. Your job is to design meaningful SLOs from scratch within 30 days. How do you do it?

What is This Question Testing?

SLI/SLO/Error budget framework knowledge

Understanding of customer-centric reliability metrics

Ability to distinguish vanity metrics (uptime ping) from meaningful SLIs

Stakeholder alignment and change management

Framework to Answer This Question
Use the SRE SLO Design Loop: Identify critical journeys → Define SLIs → Set SLO targets → Instrument → Burn rate alerts.

Map critical user journeys (CUJ)

Define SLIs that measure user experience, not infra health

Set initial SLO targets conservatively (below current performance)

Instrument with Prometheus/OpenTelemetry

Configure burn rate alerts with multi-window approach

Key Principles:

Ping uptime is a vanity metric — use success rate of real transactions

Start with 28-day rolling windows

Error budget = 100% - SLO target

Burn rate alerts fire before budget exhaustion, not after

The Answer:

Assumptions: Payment API built on FastAPI, Prometheus + Grafana, GCP environment, ~500K transactions/day.

Step 1 — Identify Critical User Journeys:

Payment authorization (synchronous, latency-sensitive)

Refund processing (async, correctness-sensitive)

Merchant dashboard load (read-heavy, P2 priority)

Step 2 — Define SLIs:

yaml

# SLI 1: Payment Authorization Availability # Numerator: HTTP 200/201 responses to POST /v1/payments # Denominator: All non-4xx responses to POST /v1/payments # SLI 2: Payment Authorization Latency # % of requests completing in < 500ms (p99 < 2s) # SLI 3: Refund Correctness # % of refunds that complete without manual intervention ``` **Step 3 — Set SLO targets (week 1, conservative):** ``` Payment Availability SLO: 99.5% (28-day rolling) Payment Latency SLO: 95% of requests < 500ms Refund Correctness: 99.9% Error budget (availability): 0.5% = ~3.6 hours/month

Step 4 — Prometheus instrumentation:

yaml

# prometheus-rules.yaml groups: - name: payment-slo rules: - record: job:payment_request_duration:rate5m expr: rate(http_request_duration_seconds_bucket{job="payment-api",le="0.5"}[5m]) - alert: PaymentSLOBurnRateHigh expr: | ( rate(http_requests_total{job="payment-api",status!~"5.."}[1h]) / rate(http_requests_total{job="payment-api"}[1h]) ) < 0.99 for: 5m labels: severity: page

Step 5 — Multi-window burn rate alerts:

1h window at 14x burn rate → page (budget gone in 2h)

6h window at 6x burn rate → ticket (budget gone in 5 days)

Metrics to Watch:

Error budget consumption > 50% in first 14 days → freeze non-critical deploys

Burn rate > 14x for 5 min → immediate page

Latency p99 > 2s for 3 consecutive minutes → P2 alert

Rollback/Mitigation: If SLO breach during deploy, auto-rollback via Argo Rollouts analysis. RTO: 3 min.

Interview Score: 9/10

Why this score:

Excellent: CUJ-first approach, rejects ping uptime, concrete PromQL

Bonus for multi-window burn rate (Google SRE Workbook pattern)

Deduct if candidate sets 99.99% SLO on day 1 without historical data

Watch for: candidates who conflate SLA (contract) with SLO (internal target)

Question 7: Terraform Module Refactoring Without Downtime [iac, terraform, reliability]

Difficulty: High
Role: DevOps Engineer
Level: Senior
Company Examples: HashiCorp, Gruntwork, Cloudflare, Twilio

Question:
Your team has 40 microservices, each with duplicated Terraform code defining their own VPC, subnets, and security groups. Drift is accumulating. You need to refactor into shared Terraform modules without destroying and recreating any existing infrastructure. The constraint: zero production disruption and no downtime. How do you approach this safely?

What is This Question Testing?

Terraform state mv and module adoption patterns

Risk management for infrastructure refactoring

Understanding of Terraform plan/apply lifecycle

Team coordination for large IaC migrations

Framework to Answer This Question
Use the Encapsulate → Move → Verify refactoring pattern.

Build the new module matching existing resource configurations exactly

Use terraform state mv to re-home resources without recreation

Verify with terraform plan showing zero changes

Migrate services in batches of 3–5

Clean up orphaned code after validation

Key Principles:

terraform plan must show 0 to add, 0 to change, 0 to destroy before merging

State moves are non-destructive if resource addresses match

Never migrate all 40 services in one PR

Tag resources before migration for audit trail

The Answer:

Assumptions: AWS, S3 backend, 40 services in monorepo with environments/prod/service-X/ structure.

Step 1 — Create the shared module:

hcl

# modules/networking/main.tf variable "service_name" {} variable "vpc_cidr" { default = "10.0.0.0/16" } resource "aws_vpc" "main" { cidr_block = var.vpc_cidr tags = { Name = var.service_name, ManagedBy = "terraform-module-v2" } }

Step 2 — Dry-run state migration for one service:

bash

cd environments/prod/service-checkout # List current resource addresses terraform state list | grep aws_vpc # Preview state move (no-op flag doesn't exist, so backup first) cp terraform.tfstate terraform.tfstate.backup-$(date +%Y%m%d) # Move resource to module address terraform state mv \ 'aws_vpc.main' \ 'module.networking.aws_vpc.main' terraform state mv \ 'aws_subnet.public[0]' \ 'module.networking.aws_subnet.public[0]'

Step 3 — Validate no changes:

bash

terraform plan -out=migration.tfplan # Must show: Plan: 0 to add, 0 to change, 0 to destroy terraform show migration.tfplan | grep -E "add|change|destroy"

Step 4 — Batch migration script:

bash

SERVICES=("checkout" "inventory" "notifications") for svc in "${SERVICES[@]}"; do echo "Migrating $svc..." cd environments/prod/$svc terraform state mv 'aws_vpc.main' 'module.networking.aws_vpc.main' terraform plan -detailed-exitcode && echo "OK: $svc" || echo "FAIL: $svc — STOPPING" cd ../../.. done

Timeline: 2 engineers, 3 weeks. 5 services/day, with 24h bake time between batches.

Metrics to Watch:

terraform plan exit code ≠ 0 after state mv → stop migration immediately

AWS CloudTrail events for resource modification during migration window → alert

State file size growth > 20% unexpectedly → investigate

Rollback: Restore .tfstate.backup file to S3. No infra change = no impact. RTO: 5 min.

Interview Score: 8/10

Why this score:

Correct: state mv + plan verify is the exact right pattern

Deduct if candidate suggests terraform destroy + recreate

Deduct for attempting all 40 services in one batch

Bonus for backup-before-move discipline and batch script with fail-fast

Question 8: Multi-Region Active-Active Failover Design [networking, reliability, dr]

Difficulty: High
Role: Staff DevOps
Level: Staff
Company Examples: Netflix, Cloudflare, Amazon, Uber

Question:
Your company runs a SaaS product used globally. The board mandates 99.99% availability (52 min downtime/year). Currently you run active-passive across us-east-1 and us-west-2 with a 15-minute RTO. You've been tasked with designing active-active multi-region with RTO < 30 seconds. Budget: $80K/month additional cloud spend. What do you design?

What is This Question Testing?

Active-active architecture patterns and tradeoffs

Data replication and consistency tradeoffs (CAP theorem)

Global load balancing (Route53, Cloudflare, GFE)

Realistic cost and complexity awareness

Framework to Answer This Question
Use the RACE Design Pattern: Replication → Availability layer → Consistency tradeoffs → Edge routing.

Assess which tiers can be active-active vs. active-passive

Design global routing and health checks

Choose data replication strategy per data type

Implement circuit breakers and regional fallback

Validate with chaos engineering

Key Principles:

Not every tier needs active-active — be pragmatic

Writes to two regions simultaneously require conflict resolution strategy

DNS TTL < 30s is required to meet 30s RTO

Test failover monthly or it's not real

The Answer:

Assumptions: AWS, PostgreSQL on Aurora Global, stateless app tier, Redis for sessions, Route53.

Step 1 — Tier analysis:

App tier: active-active (stateless, trivial)

Cache (Redis): active-active via Elasticache Global Datastore

Database: Aurora Global with us-east-1 primary, us-west-2 read replica (< 1s replication lag)

Database writes: region-affinity routing (primary handles writes, replicas handle reads)

Step 2 — Global routing:

hcl

# Route53 health check + latency routing resource "aws_route53_health_check" "us_east_1" { fqdn = "api-us-east-1.internal.example.com" port = 443 type = "HTTPS" request_interval = 10 failure_threshold = 2 # 20s to detect failure } resource "aws_route53_record" "api" { zone_id = var.zone_id name = "api.example.com" type = "A" set_identifier = "us-east-1" latency_routing_policy { region = "us-east-1" } health_check_id = aws_route53_health_check.us_east_1.id alias { ... } }

Step 3 — Aurora Global failover:

bash

# Promote us-west-2 replica to primary (< 1 min) aws rds failover-global-cluster \ --global-cluster-identifier my-global-cluster \ --target-db-cluster-identifier arn:aws:rds:us-west-2:...

Step 4 — Application-level conflict resolution: Use event sourcing + timestamp-based last-write-wins for non-financial data. For financial transactions: route all writes to a single region (no active-active for payments — explain why).

Step 5 — Chaos validation: Monthly GameDay — inject region failure using AWS FIS, measure actual RTO.

Metrics to Watch:

Aurora replication lag > 1s → alert (failover will have stale reads)

Route53 health check failures > 2 → automatic DNS failover begins

Cross-region latency > 150ms p99 → investigate routing

Cost estimate: ~$65K/month additional (Aurora Global: $30K, cross-region traffic: $20K, second app tier: $15K) — within budget.

Rollback: Fail back to us-east-1 after root cause resolved. RPO: < 1s (Aurora replication lag). RTO: < 30s.

Interview Score: 9/10

Why this score:

Excellent: acknowledges payments can't be active-active naively

CAP theorem awareness and conflict resolution strategy

Concrete Route53 + Aurora commands

Deduct if candidate claims "full active-active" without addressing write conflicts

Bonus for GameDay validation and cost estimate

Question 9: Secrets Sprawl — Migrating Hardcoded Credentials to Vault [security, iac, ci-cd]

Difficulty: High
Role: DevOps Engineer
Level: Senior
Company Examples: HashiCorp, Twilio, Okta, GitHub

Question:
A security audit reveals 23 microservices have database passwords and API keys hardcoded in environment variables baked into Docker images and Kubernetes ConfigMaps. Three secrets have already appeared in git history. You have 6 weeks to remediate before the SOC 2 Type II audit. How do you execute this migration without breaking production?

What is This Question Testing?

Secrets management architecture (Vault, AWS Secrets Manager, External Secrets Operator)

Git history remediation and secret rotation

Zero-downtime migration strategy for secrets

Security posture improvement under deadline pressure

Framework to Answer This Question
Use the Rotate → Centralize → Inject → Audit pattern.

Immediately rotate all exposed secrets

Deploy centralized secrets backend (Vault or AWS Secrets Manager)

Migrate services one-by-one using External Secrets Operator

Purge secrets from git history and ConfigMaps

Enforce via admission controller (no plaintext secrets in manifests)

Key Principles:

Rotate first, then migrate — don't migrate stale compromised secrets

ESO (External Secrets Operator) is least-invasive for K8s migration

Git history rewrite requires coordination (all devs must re-clone)

Audit trail of secret access is a SOC 2 requirement

The Answer:

Assumptions: EKS, GitHub, AWS Secrets Manager (simpler than Vault for AWS-native), 23 services, 6-week deadline.

Step 1 — Emergency rotation (Day 1):

bash

# For each exposed secret aws secretsmanager rotate-secret --secret-id prod/postgres/password # For API keys: revoke in provider dashboard, generate new, store in Secrets Manager aws secretsmanager create-secret \ --name prod/stripe/api-key \ --secret-string '{"key":"sk_live_newkey123"}'

Step 2 — Deploy External Secrets Operator:

bash

helm repo add external-secrets https://charts.external-secrets.io helm install external-secrets external-secrets/external-secrets -n external-secrets --create-namespace

Step 3 — Create ExternalSecret per service:

yaml

apiVersion: external-secrets.io/v1beta1 kind: ExternalSecret metadata: name: postgres-creds namespace: checkout spec: refreshInterval: 1h secretStoreRef: name: aws-secrets-manager kind: ClusterSecretStore target: name: postgres-creds creationPolicy: Owner data: - secretKey: DB_PASSWORD remoteRef: key: prod/postgres/password property: password

Step 4 — Remove secrets from ConfigMaps and Deployments; reference K8s Secret instead.

Step 5 — Git history remediation:

bash

# Use git-filter-repo (preferred over BFG) pip install git-filter-repo git filter-repo --path-glob '*.env' --invert-paths # Force push + require all devs to re-clone git push --force --all

Step 6 — Enforce with OPA/Gatekeeper: Deny any ConfigMap or Deployment containing strings matching secret patterns.

Timeline: Week 1: rotate + deploy ESO. Weeks 2–5: migrate 5–6 services/week. Week 6: audit evidence collection.

Metrics to Watch:

Secrets Manager API errors > 5/min → alert (ESO injection failing)

Any K8s Secret with type: Opaque containing base64 plaintext → OPA policy violation alert

Secret rotation failures → immediate page

Rollback per service: Revert Deployment to use old env var (secret already rotated, so use new value in both places during transition). RTO: 5 min/service.

Interview Score: 8/10

Why this score:

Correct priority: rotate before migrate

ESO is the right low-disruption tool

Deduct for missing git history remediation step

Bonus for OPA enforcement (prevents regression)

Watch for: candidates who skip rotation and just move the compromised secret

Question 10: Cost Optimization — 40% Cloud Bill Reduction [cost, kubernetes, iac]

Difficulty: High
Role: DevOps Engineer / SRE
Level: Senior
Company Examples: Stripe, Figma, Notion, GitLab

Question:
Your AWS bill hit $420K last month, up 35% from six months ago despite flat traffic growth. The CTO wants a 40% cost reduction ($168K/month) within 90 days without degrading reliability. You have access to Cost Explorer, CloudWatch, and the Kubernetes cluster. Where do you start and what levers do you pull?

What is This Question Testing?

Cloud cost analysis methodology

Right-sizing, reserved instances, and Spot usage

Kubernetes resource efficiency (requests, limits, bin-packing)

Balancing cost reduction with reliability risk

Framework to Answer This Question
Use the Analyze → Quick Wins → Structural Changes → Automate cost optimization cycle.

Categorize spend by service (EC2, RDS, data transfer, S3)

Attack idle/oversized resources first (quick wins, low risk)

Reserved Instances / Savings Plans for predictable baseline

Kubernetes bin-packing and Spot for batch workloads

Automate with scheduled scaling and anomaly alerts

Key Principles:

Data transfer costs are often invisible — check first

Reserved Instances for 70% baseline load; Spot for 30% burst

Over-provisioned RDS instances are common and safe to fix

Never right-size production databases without load testing first

The Answer:

Assumptions: AWS, EKS cluster, Aurora RDS, S3 heavy usage, cross-AZ data transfer costs high.

Step 1 — Cost breakdown via AWS Cost Explorer:

bash

aws ce get-cost-and-usage \ --time-period Start=2025-09-01,End=2025-10-01 \ --granularity MONTHLY \ --metrics BlendedCost \ --group-by Type=DIMENSION,Key=SERVICE

Typical finding: EC2 40%, RDS 25%, Data Transfer 20%, S3 10%, other 5%.

Step 2 — Quick wins (Week 1–2):

bash

# Find unattached EBS volumes aws ec2 describe-volumes --filters Name=status,Values=available \ --query 'Volumes[*].[VolumeId,Size,CreateTime]' --output table # Find idle load balancers (0 requests/day) aws cloudwatch get-metric-statistics \ --namespace AWS/ApplicationELB --metric-name RequestCount \ --start-time 2025-09-01T00:00:00 --end-time 2025-10-01T00:00:00 \ --period 2592000 --statistics Sum

Step 3 — Kubernetes resource right-sizing:

bash

# Install Goldilocks (VPA-based recommender) helm install goldilocks fairwinds/goldilocks --namespace goldilocks # Check recommendations kubectl -n goldilocks get vpa --all-namespaces ``` Target: reduce average CPU request padding by 40% (most teams over-request 2–3x). **Step 4 — Reserved Instances (30-day analysis, then commit):** - 1-year Compute Savings Plan for 70% of baseline EC2 → ~30% discount - Spot instances for CI runners and batch jobs → 70% discount **Step 5 — Cross-AZ transfer reduction:** - Enable `topology.kubernetes.io/zone` node affinity for pod scheduling - Use S3 Transfer Acceleration only where needed; enable S3 Intelligent Tiering **Projected savings:** - Idle resources cleanup: ~$25K/month - K8s right-sizing: ~$40K/month - Reserved Instances: ~$60K/month - Spot for batch: ~$25K/month - Data transfer optimization: ~$20K/month - **Total: ~$170K/month ✓** **Metrics to Watch:** - Monthly spend growth > 5% MoM without traffic growth → alert - CPU utilization < 20% sustained on any instance type → right-size candidate - Data transfer costs > $30K/month → investigate AZ routing **Rollback:** Reserved Instance commitments are 1-year — validate with 30 days of Savings Plans recommendations first (no-commitment option available). **Interview Score: 8/10** **Why this score:** - Excellent: structured analysis, multiple levers, realistic projections - Deduct if candidate goes straight to "use Spot for everything" - Bonus for Goldilocks and cross-AZ transfer insight (often missed) - Watch for: candidates who ignore data transfer costs (often 15–25% of bill) --- ## Question 11: Designing a GitOps Deployment Platform [gitops, ci-cd, kubernetes] **Difficulty:** Very High **Role:** Staff DevOps **Level:** Staff **Company Examples:** Weaveworks, GitHub, Shopify, Spotify **Question:** Your organization has 80 engineers across 12 teams, deploying 15 microservices to three environments (dev, staging, prod). There's no standardized deployment process — teams use a mix of raw `kubectl apply`, Helm, and bash scripts. You've been asked to design and implement a GitOps platform using Argo CD within 8 weeks. Prod is currently stable and cannot be disrupted. How do you design and roll it out? **What is This Question Testing?** - GitOps principles and Argo CD architecture - Multi-tenant platform design for multiple teams - Progressive rollout without disrupting existing deployments - Organizational change management alongside technical delivery **Framework to Answer This Question** Use the **Platform Product Mindset**: treat internal platform as a product with customers (dev teams). 1. Define GitOps contracts (repo structure, sync policies) 2. Deploy Argo CD with RBAC mapped to team boundaries 3. Onboard one pilot team → iterate → scale 4. Migrate existing workloads non-destructively 5. Enforce via admission webhooks (no `kubectl apply` in prod) **Key Principles:** - Git is the single source of truth — drift = bad PR - App-of-apps pattern for managing multiple teams at scale - Sync windows prevent accidental prod deploys during business hours - Measure adoption rate as a platform metric **The Answer:** **Assumptions:** EKS, GitHub, 3 clusters (dev/staging/prod), teams own their namespaces. **Step 1 — Repo structure:** ``` gitops-platform/ ├── apps/ # App-of-apps root │ ├── dev/ │ ├── staging/ │ └── prod/ ├── clusters/ │ ├── dev-cluster/ │ ├── staging-cluster/ │ └── prod-cluster/ └── platform/ # Argo CD itself, ingress, monitoring

Step 2 — Deploy Argo CD:

bash

kubectl create namespace argocd kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml # Configure HA mode for prod helm install argocd argo/argo-cd \ --set controller.replicas=2 \ --set server.replicas=2 \ --set repoServer.replicas=2

Step 3 — RBAC per team:

yaml

# argocd-rbac-cm policy.csv: | p, team-checkout, applications, get, checkout/*, allow p, team-checkout, applications, sync, checkout/*, allow p, team-checkout, applications, override, checkout/*, deny g, github-org:team-checkout, role:team-checkout

Step 4 — App-of-apps for prod:

yaml

apiVersion: argoproj.io/v1alpha1 kind: Application metadata: name: prod-apps namespace: argocd spec: project: default source: repoURL: https://github.com/org/gitops-platform targetRevision: main path: apps/prod destination: server: https://prod-cluster.example.com namespace: argocd syncPolicy: syncOptions: - CreateNamespace=true

Step 5 — Prod sync window (no accidental weekend deploys):

yaml

syncWindows: - kind: allow schedule: '0 10 * * 1-5' # Mon-Fri 10am only duration: 8h applications: - '*' namespaces: - production

Step 6 — Migration approach: Keep existing kubectl/Helm deployments running. Onboard each service by adding an Argo CD Application pointing to a new Helm chart. Validate drift detection catches any manual changes. Teams stop using kubectl directly once they trust the platform.

Timeline: Week 1–2: platform setup + pilot team. Week 3–5: 6 teams onboarded. Week 6–8: prod migration + enforcement.

Metrics to Watch:

Argo CD sync failure rate > 5% → alert (broken manifests or connectivity)

App out-of-sync duration > 15 min in prod → alert (drift or blocked sync)

Teams still using direct kubectl in prod (audit via CloudTrail) > 0 → governance report

Rollback: Argo CD is additive — existing workloads are unaffected until explicitly onboarded. Each team can opt out by removing their Application CR. Platform rollback: helm rollback argocd. RTO: 10 min.

Code Appendix:

yaml

# App-of-apps application template apiVersion: argoproj.io/v1alpha1 kind: ApplicationSet metadata: name: team-apps namespace: argocd spec: generators: - git: repoURL: https://github.com/org/gitops-platform revision: main directories: - path: apps/prod/* template: spec: project: default syncPolicy: automated: prune: true selfHeal: true

Interview Score: 9/10

Why this score:

Excellent: app-of-apps, RBAC design, sync windows, non-disruptive migration

Bonus for ApplicationSet (scales to N teams automatically)

Deduct if candidate enables prune: true in prod without explaining the risk

Watch for: over-engineering on day 1 vs. pragmatic pilot-first approach

Question 12: Prometheus Alert Fatigue — Redesigning Alerting Strategy [observability, sre, reliability]

Difficulty: Very High
Role: SRE
Level: Senior
Company Examples: Google, Datadog, PagerDuty, Cloudflare

Question:
Your on-call rotation receives 200+ alerts per week, 85% of which are noise. Engineers are ignoring PagerDuty pages. Three real incidents were missed in the last month. You need to redesign the alerting strategy from scratch. The team has Prometheus, Grafana, AlertManager, and PagerDuty. How do you fix this in 4 weeks?

What is This Question Testing?

Alert fatigue diagnosis and systematic remediation

SLO-based alerting vs. cause-based alerting

AlertManager routing, grouping, and inhibition rules

Organizational behavior change alongside technical fixes

Framework to Answer This Question
Use the Signal/Noise Reduction Funnel: Audit → Classify → Redesign → Measure.

Audit all alerts: actionable vs. informational vs. noise

Migrate critical service alerts to SLO burn rate model

Configure AlertManager grouping, silences, and inhibition

Establish alert review cadence (weekly for 4 weeks)

Track MTTA and noise ratio as success metrics

Key Principles:

Every page must require human action — or it's not a page

SLO burn rate alerts are more actionable than threshold alerts

Alert storms require inhibition rules, not more alerts

Measure alert quality, not just quantity

The Answer:

Assumptions: ~300 active Prometheus alerts, Alertmanager routing to PagerDuty for everything, no inhibitions configured.

Step 1 — Audit (Week 1):

bash

# Export all current alert rules kubectl exec -n monitoring prometheus-0 -- \ curl -s localhost:9090/api/v1/rules | jq '.data.groups[].rules[] | select(.type=="alerting") | .name' | sort > all_alerts.txt # Pull 30 days of firing history curl -G 'http://alertmanager:9093/api/v1/alerts' \ --data-urlencode 'filter=severity="page"' | jq '.data | length'

Classify each alert: Actionable page / Ticket / Log and delete.

Step 2 — Migrate to SLO burn rate alerts (keeps only ~20 critical alerts):

yaml

# prometheus-slo-alerts.yaml - alert: HighErrorBurnRate expr: | ( rate(http_requests_total{status=~"5.."}[1h]) / rate(http_requests_total[1h]) ) > (14 * 0.001) for: 2m labels: severity: page team: "{{ $labels.team }}" annotations: summary: "Error budget burning 14x fast — action required" runbook: "https://runbooks.example.com/high-error-burn-rate"

Step 3 — AlertManager: grouping + inhibition:

yaml

# alertmanager.yml route: group_by: ['alertname', 'cluster', 'team'] group_wait: 30s group_interval: 5m repeat_interval: 4h receiver: 'pagerduty-critical' routes: - match: severity: warning receiver: 'slack-warnings' continue: false inhibit_rules: - source_match: severity: 'page' alertname: 'NodeDown' target_match: severity: 'page' equal: ['node'] # Suppress pod alerts when their node is down

Step 4 — Routing by team (reduce cross-team noise):

yaml

routes: - match: team: checkout receiver: pagerduty-checkout group_by: ['alertname']

Step 5 — Weekly review cadence: Every Monday, review previous week's pages. Silence or delete any alert that fired > 5 times without action.

Metrics to Watch (success metrics):

Pages/week: target < 30 (from 200) by week 4

MTTA (mean time to acknowledge) < 5 min (currently 18 min due to fatigue)

Missed incidents: 0 per month (current: 3)

Alert actionability rate > 90% (surveys with on-call engineers)

Rollback: Old alert rules are in git — restore previous prometheus-rules.yaml. AlertManager config is version-controlled. RTO: 5 min.

Code Appendix:

yaml

# Multi-window burn rate — covers both fast and slow burns - alert: SLOBurnRateFast expr: | (rate(errors[1h]) / rate(requests[1h])) > (14 * 0.005) and (rate(errors[5m]) / rate(requests[5m])) > (14 * 0.005) for: 2m labels: {severity: page} - alert: SLOBurnRateSlow expr: | (rate(errors[6h]) / rate(requests[6h])) > (6 * 0.005) and (rate(errors[30m]) / rate(requests[30m])) > (6 * 0.005) for: 15m labels: {severity: ticket}

Interview Score: 9/10

Why this score:

Excellent: audit-first, SLO burn rate model, inhibition rules, team-based routing

Multi-window burn rate is Google SRE Workbook best practice

Deduct if candidate only adjusts thresholds (treats symptoms not causes)

Bonus for tracking MTTA and actionability rate as platform metrics

Watch for: candidates who add more dashboards instead of fixing alerting

Question 13: Immutable Infrastructure — Blue/Green AMI Pipeline [iac, ci-cd, reliability]

Difficulty: Very High
Role: Staff DevOps
Level: Staff
Company Examples: Netflix, Amazon, Hashicorp, Cloudflare

Question:
Your team currently SSHs into production EC2 instances to apply patches and config changes, creating significant configuration drift. Security has flagged this as a compliance issue. You need to implement immutable infrastructure using Packer-built AMIs with a blue/green deployment pipeline within 6 weeks. The environment: 200 EC2 instances across 8 Auto Scaling Groups, running stateless services. How do you design and execute this?

What is This Question Testing?

Immutable infrastructure philosophy and implementation

Packer image pipeline design

Blue/green ASG swap strategies

Breaking cultural habits (no more SSH to prod)

Framework to Answer This Question
Use the Bake → Validate → Swap → Terminate immutable pipeline.

Build hardened base AMI with Packer (OS + agents)

Layer application AMI on top (app code + config)

Launch new ASG with new AMI (blue)

Shift traffic gradually; terminate old ASG (green)

Block SSH via security group policy

Key Principles:

Configuration belongs in the AMI, not applied post-launch

Validate AMI with InSpec/Serverspec before promoting to prod

Blue/green swap takes 10–15 min — acceptable for stateless services

SSM Session Manager replaces SSH — no inbound 22 required

The Answer:

Assumptions: AWS, 200 stateless EC2 instances, Terraform for ASGs, Jenkins CI, stateless app.

Step 1 — Packer base AMI:

hcl

# packer/base-ami.pkr.hcl source "amazon-ebs" "ubuntu" { region = "us-east-1" source_ami_filter { filters = { name = "ubuntu/images/hvm-ssd/ubuntu-22.04-amd64-server-*" } owners = ["099720109477"] most_recent = true } instance_type = "t3.medium" ssh_username = "ubuntu" ami_name = "base-{{timestamp}}" } build { sources = ["source.amazon-ebs.ubuntu"] provisioner "shell" { scripts = ["scripts/harden.sh", "scripts/install-agents.sh"] } provisioner "inspec" { profile = "profiles/cis-ubuntu" } post-processor "manifest" { output = "manifest.json" } }

Step 2 — App AMI (inherits base):

bash

# In CI pipeline packer build -var "base_ami=$(cat manifest.json | jq -r '.builds[-1].artifact_id')" app-ami.pkr.hcl NEW_AMI=$(cat app-manifest.json | jq -r '.builds[-1].artifact_id' | cut -d: -f2)

Step 3 — Terraform blue/green ASG swap:

hcl

resource "aws_autoscaling_group" "blue" { name = "checkout-blue-${var.ami_id}" launch_template { id = aws_launch_template.checkout.id version = "$Latest" } target_group_arns = [aws_lb_target_group.main.arn] min_size = var.min_capacity max_size = var.max_capacity lifecycle { create_before_destroy = true } }

Step 4 — Traffic shift script:

bash

# Deregister old ASG from target group, register new aws autoscaling attach-load-balancer-target-groups \ --auto-scaling-group-name checkout-blue-$NEW_AMI \ --target-group-arns $TG_ARN # Wait for health checks sleep 120 aws autoscaling detach-load-balancer-target-groups \ --auto-scaling-group-name checkout-green-$OLD_AMI \ --target-group-arns $TG_ARN # Terminate old ASG after 30 min bake aws autoscaling delete-auto-scaling-group \ --auto-scaling-group-name checkout-green-$OLD_AMI \ --force-delete

Step 5 — Block SSH permanently:

bash

# Remove port 22 from all prod security groups aws ec2 revoke-security-group-ingress \ --group-id $SG_ID --protocol tcp --port 22 --cidr 0.0.0.0/0 # Enable SSM for break-glass access aws ssm start-session --target i-0abc1234def56789

Timeline: 3 engineers, 6 weeks. Week 1: Packer pipeline. Week 2–3: one ASG piloted. Week 4–5: all 8 ASGs. Week 6: SSH revoked.

Metrics to Watch:

New instance boot time > 5 min → investigate AMI bloat

Blue ASG health check failures > 5% during swap → abort and rollback

Any SSH connection attempt to prod instances (CloudTrail) → P2 alert

Rollback: Reattach old green ASG to target group in < 5 min. AMIs are immutable — no rollback needed at AMI level; just keep old ASG alive for 1 hour post-swap. RTO: 5 min. RPO: 0.

Code Appendix:

bash

#!/bin/bash # swap-asg.sh — Blue/Green swap with health gate NEW_ASG=$1; OLD_ASG=$2; TG_ARN=$3; THRESHOLD=95 aws autoscaling attach-load-balancer-target-groups \ --auto-scaling-group-name $NEW_ASG --target-group-arns $TG_ARN echo "Waiting 90s for health checks..." sleep 90 HEALTHY=$(aws elbv2 describe-target-health --target-group-arn $TG_ARN \ --query 'TargetHealthDescriptions[?TargetHealth.State==`healthy`] | length(@)') TOTAL=$(aws elbv2 describe-target-health --target-group-arn $TG_ARN \ --query 'TargetHealthDescriptions | length(@)') PCT=$(( HEALTHY * 100 / TOTAL )) if [ $PCT -lt $THRESHOLD ]; then echo "Health check failed ($PCT% healthy). Aborting swap." exit 1 fi aws autoscaling detach-load-balancer-target-groups \ --auto-scaling-group-name $OLD_ASG --target-group-arns $TG_ARN echo "Swap complete. Old ASG $OLD_ASG detached."

Interview Score: 9/10

Why this score:

Excellent: full pipeline from Packer → validate → swap → SSH revoke

InSpec validation in AMI build pipeline is security best practice

Health-gated swap script prevents bad deploys from completing

Deduct if candidate skips AMI validation or leaves SSH open

Bonus for SSM Session Manager as SSH replacement (no inbound port 22 needed)

Question 14: Disaster Recovery — RDS Corruption in Production [dr, database, incident-response]

Difficulty: Very High
Role: SRE
Level: Senior
Company Examples: Stripe, GitHub, Shopify, Cloudflare

Question:
At 11:22 AM on a Tuesday, a developer accidentally runs DROP TABLE orders; on the production PostgreSQL RDS database. The orders table has 18 months of transaction history. The business has an RPO of 1 hour. Automated backups are enabled with a 35-day retention. Point-in-time recovery is available. How do you recover, and what's your communication plan?

What is This Question Testing?

RDS PITR (Point-in-Time Recovery) knowledge and limitations

Incident communication and stakeholder management

Data validation after restore

Post-incident hardening (IAM, least privilege)

Framework to Answer This Question
Use the Contain → Restore → Validate → Harden DR playbook.

Immediately prevent further writes to corrupt database

Initiate PITR to a new RDS instance (target: 11:21 AM)

Validate restored data against application checksums

Promote restored instance (or export table and reimport)

Revoke DDL permissions for developers; add IAM guardrails

Key Principles:

PITR restores to a NEW instance — you cannot restore in-place

Fastest path: restore table only from PITR export, not full instance promotion

Communicate status every 15–30 minutes during P1

RPO 1 hour = 18 months of data restored to 1 min before incident

The Answer:

Assumptions: Aurora PostgreSQL, PITR enabled, 11:22 AM incident, 11:21 AM target restore time, application can be put in read-only mode.

Step 1 — Contain (11:22 AM):

bash

# Put app in maintenance mode immediately (prevent further writes) kubectl set env deployment/orders-service DB_READ_ONLY=true # Or enable maintenance page via ingress annotation kubectl annotate ingress orders-ingress nginx.ingress.kubernetes.io/server-snippet='return 503;'

Step 2 — Initiate PITR to new cluster (11:23 AM):

bash

aws rds restore-db-cluster-to-point-in-time \ --db-cluster-identifier prod-orders-restored \ --source-db-cluster-identifier prod-orders \ --restore-to-time 2025-10-14T11:21:00Z \ --vpc-security-group-ids $PROD_SG_ID \ --db-subnet-group-name prod-subnet-group # Takes ~15-25 min for Aurora. Monitor: watch -n 30 "aws rds describe-db-clusters \ --db-cluster-identifier prod-orders-restored \ --query 'DBClusters[0].Status'"

Step 3 — Export only the orders table (faster than full promotion):

bash

# Once PITR cluster is available, connect and export psql -h restored-cluster.endpoint -U admin -d production \ -c "\COPY orders TO '/tmp/orders_recovered.csv' CSV HEADER" # Compress and move to S3 aws s3 cp /tmp/orders_recovered.csv s3://dr-exports/orders-recovered-$(date +%Y%m%dT%H%M).csv

Step 4 — Reimport to production (11:48 AM estimate):

bash

psql -h prod-cluster.endpoint -U admin -d production \ -c "\COPY orders FROM '/tmp/orders_recovered.csv' CSV HEADER" # Validate row count psql -c "SELECT COUNT(*) FROM orders;" prod_restored psql -c "SELECT COUNT(*) FROM orders;" prod # should match

Step 5 — Validate and restore service:

bash

# Application-level validation curl -s https://internal-api/health/orders-db | jq '.row_count' # Re-enable service kubectl set env deployment/orders-service DB_READ_ONLY-

Step 6 — Communication:

11:22 AM: "P1 declared. Orders service in maintenance. ETR: 60 min."

11:40 AM: "PITR restore in progress. ETR: 30 min."

12:15 AM: "Data restored and validated. Service restored. RCA in 48h."

Step 7 — Post-incident hardening:

sql

- Revoke DDL from application userREVOKE DROP ON ALL TABLES IN SCHEMA public FROM app_user;- Developers get read-only role, never prod write accessGRANT SELECT ON ALL TABLES IN SCHEMA public TO dev_readonly;

Metrics to Watch:

PITR restore duration > 30 min → escalate to AWS Support

Row count mismatch after restore > 0 → do not restore service, investigate

Any DDL statements (DROP, TRUNCATE) from non-migration users → alert in real-time via pg_audit

Timeline: Incident at 11:22 AM → service restored by 12:10–12:20 PM. RPO achieved: ~1 min of data loss. RTO: ~50 min.

Code Appendix:

bash

#!/bin/bash # validate-restore.sh — compare row counts between restored and prod PROD_HOST=$1; RESTORED_HOST=$2; DB=$3; TABLE=$4 PROD_COUNT=$(psql -h $PROD_HOST -U admin -d $DB -tAc "SELECT COUNT(*) FROM $TABLE") RESTORED_COUNT=$(psql -h $RESTORED_HOST -U admin -d $DB -tAc "SELECT COUNT(*) FROM $TABLE") echo "Production: $PROD_COUNT | Restored: $RESTORED_COUNT" if [ "$PROD_COUNT" -ne "$RESTORED_COUNT" ]; then echo "MISMATCH — do not cut over. Investigate immediately." exit 1 fi echo "Counts match. Safe to promote restored instance."

Interview Score: 9/10

Why this score:

Excellent: maintenance mode first (stops the bleeding), PITR + table export (faster than full cluster promotion)

Validation script with row count check is production-grade

Communication timeline demonstrates IC experience

Deduct if candidate promotes full cluster (30–60 min longer) without considering table-export shortcut

Bonus for pg_audit real-time DDL alerting and post-incident least privilege hardening

Watch for: candidates who skip data validation before restoring service

Question 15: Service Mesh Adoption — Migrating to Istio [networking, kubernetes, security]

Difficulty: Very High
Role: Staff DevOps
Level: Staff
Company Examples: Google, Lyft, Airbnb, LinkedIn

Question:
Your organization has 30 microservices communicating over plain HTTP inside a Kubernetes cluster. Security requires mutual TLS (mTLS) between all services within 60 days. The platform team has chosen Istio. However, the previous attempt to install Istio 18 months ago caused a 45-minute outage. Engineering leadership is skeptical. How do you design a safe migration?

What is This Question Testing?

Istio architecture and sidecar injection model

Progressive, non-disruptive service mesh adoption

mTLS migration modes (PERMISSIVE → STRICT)

Risk management and stakeholder communication

Framework to Answer This Question
Use the Shadow → Permissive → Strict mTLS migration ladder.

Install Istio in ambient or sidecar mode on non-prod first

Enable sidecar injection namespace-by-namespace (not cluster-wide)

Start in PERMISSIVE mode (accepts both HTTP and mTLS)

Validate observability and no traffic disruption

Migrate to STRICT mode namespace-by-namespace

Key Principles:

PERMISSIVE mode is the safety net — use it for weeks, not hours

Sidecar injection causes pod restarts — schedule during maintenance windows

Ambient mesh (Istio 1.18+) avoids sidecar overhead entirely

Measure traffic success rate before and after each namespace migration

The Answer:

Assumptions: EKS 1.28, Istio 1.20, 30 services across 8 namespaces, current 0 mTLS.

Step 1 — Install Istio with minimal footprint:

bash

istioctl install --set profile=default \ --set values.global.proxy.autoInject=disabled \ -y # Injection disabled cluster-wide — opt-in per namespace istioctl verify-install

Step 2 — Enable sidecar injection on pilot namespace (non-critical):

bash

kubectl label namespace logging istio-injection=enabled kubectl rollout restart deployment -n logging # Monitor for 24h kubectl get pods -n logging # all should show 2/2 READY (app + sidecar)

Step 3 — Set PERMISSIVE mTLS globally (accepts HTTP and mTLS simultaneously):

yaml

apiVersion: security.istio.io/v1beta1 kind: PeerAuthentication metadata: name: default namespace: istio-system # cluster-wide spec: mtls: mode: PERMISSIVE

Step 4 — Gradually enable injection per namespace (one per week):

bash

NAMESPACES=("payments" "orders" "inventory" "notifications") for ns in "${NAMESPACES[@]}"; do echo "Enabling injection for $ns" kubectl label namespace $ns istio-injection=enabled kubectl rollout restart deployment -n $ns # Gate: verify no 5xx increase sleep 300 kubectl exec -n istio-system deploy/prometheus -- \ promtool query instant http://localhost:9090 \ 'rate(istio_requests_total{response_code=~"5.."}[5m])' done

Step 5 — Switch to STRICT mTLS per namespace (after all injected):

yaml

apiVersion: security.istio.io/v1beta1 kind: PeerAuthentication metadata: name: strict-mtls namespace: payments spec: mtls: mode: STRICT

Step 6 — Validate mTLS is active:

bash

istioctl x check-inject -n payments kubectl exec -n payments deploy/payment-api -c istio-proxy -- \ pilot-agent request GET stats | grep ssl.handshake

Timeline: 3 engineers, 8 weeks. Week 1: install + pilot namespace. Weeks 2–6: one namespace/week. Week 7: STRICT mode. Week 8: policy enforcement + runbook.

Metrics to Watch:

HTTP 503 rate > 0.1% immediately after namespace injection → rollback injection for that namespace

Envoy proxy CPU > 10% of total pod CPU → investigate filter chain complexity

mTLS handshake failures > 0 in STRICT mode → service missing sidecar

Rollback per namespace:

bash

kubectl label namespace payments istio-injection- kubectl rollout restart deployment -n payments # Returns to plain HTTP — PERMISSIVE mode means nothing breaks

RTO: 5 min per namespace. Full cluster rollback: istioctl uninstall --purge (risky — prefer namespace-by-namespace).

Code Appendix:

yaml

# AuthorizationPolicy — allow only specific services after mTLS strict apiVersion: security.istio.io/v1beta1 kind: AuthorizationPolicy metadata: name: payments-allow namespace: payments spec: action: ALLOW rules: - from: - source: principals: - "cluster.local/ns/orders/sa/orders-service" - "cluster.local/ns/checkout/sa/checkout-service"

Interview Score: 9/10

Why this score:

Excellent: PERMISSIVE-first strategy directly addresses the previous outage risk

Namespace-by-namespace with metric gates is production-safe

AuthorizationPolicy shows understanding of mTLS + authz as separate concerns

Deduct for enabling cluster-wide injection on day 1

Bonus for noting Istio ambient mode as a lower-overhead alternative