Backend Developer Interview Questions & Answers
Question 1: Rate Limiting & Throttling for Hot APIs at Scale
Difficulty: Very High
Role: Senior Backend Engineer
Level: Senior (L5-L6, 4-7 Years of Experience)
Company Examples: Stripe, AWS, Shopify, PayPal
Question: “You’re a Senior Backend Engineer on a payments or API platform team (Stripe-style). Design a multi-tenant, globally distributed rate limiting system that enforces per-customer, per-endpoint, and per-region limits with burst handling. How would you implement this using Redis or another fast store, handle race conditions in a distributed setting, and expose observability for SRE/on-call? Discuss trade-offs between fixed window, sliding window, and token bucket, and how your design changes when limits must be enforced at the API gateway vs inside downstream services.”
1. What is This Question Testing?
This question tests critical Senior Backend Engineer competencies:
- Distributed Systems Design: Can you build rate limiters that work correctly across multiple servers?
- Algorithm Knowledge: Do you understand token bucket, sliding window, and their trade-offs?
- Production Readiness: Can you handle race conditions, observability, and SRE requirements?
- Multi-Tenancy: Can you isolate limits per customer without centralized bottlenecks?
- Trade-off Analysis: Can you justify architectural choices (gateway vs service-level enforcement)?
The interviewer wants to see if you’re a Senior Backend Engineer who can design production-grade infrastructure, not just implement basic algorithms.
2. Framework to Answer This Question
Use the “Algorithm → Architecture → Scale Framework” with these components:
Structure:
1. Rate Limiting Algorithms - Token bucket, sliding window, fixed window comparison
2. System Architecture - Redis-based design for multi-tenant, global distribution
3. Race Condition Handling - Lua scripts, atomic operations, consistency trade-offs
4. Observability - Metrics, logging, SLOs for SRE teams
5. Gateway vs Service Enforcement - Trade-offs and hybrid approach
Key Principles:
- Start with algorithm choice justification
- Design for distributed correctness
- Prioritize observability and debuggability
- Discuss trade-offs explicitly
3. The Answer
Answer:
I’d design a Redis-based token bucket rate limiter with Lua scripts for atomic operations, deployed at both API gateway and service levels with different enforcement policies. Let me walk through the complete design.
First, choosing the rate limiting algorithm:
Token Bucket is the best choice for API platforms because it allows controlled bursts while maintaining average rate limits—critical for payment APIs where legitimate traffic spikes occur.
Algorithm Comparison:
Fixed Window:
Time window: 0-60s allows 100 requests
Problem: "Window reset attack"
- At 59s: Send 100 requests (allowed)
- At 01s: Send 100 requests (allowed)
- Result: 200 requests in 2 seconds!Sliding Window:
Counts requests in past 60 seconds continuously
Pros: Smooth rate limiting, no reset attacks
Cons: Memory intensive (store all timestamps), complex to distributeToken Bucket (My Choice):
Bucket holds tokens (max = burst limit)
Tokens refill at fixed rate (e.g., 100/minute)
Request consumes 1 token
Pros: Allows controlled bursts, memory efficient, simple distributed implementation
Cons: Burst can temporarily exceed average (acceptable for APIs)For Stripe-style APIs:
- Allow burst: 200 requests
- Refill rate: 100 tokens/minute
- Result: Client can burst 200 instantly, then throttled to 100/min
Second, Redis-based distributed architecture:
Why Redis:
- Sub-millisecond latency (<1ms for 99th percentile)
- Atomic operations via Lua scripts (no race conditions)
- Built-in expiration (TTL) for rate limit windows
- Scales to millions of keys (per-customer, per-endpoint)
Data Model:
Key: rate_limit:{customer_id}:{endpoint}:{region}
Value: {
"tokens": 95, // Current tokens
"last_refill": 1638360000 // Unix timestamp
}
TTL: 3600 seconds (auto-expire if inactive)Third, handling race conditions with Lua scripts:
The Problem:
Two requests arrive simultaneously from different servers:
Server A: Read tokens (100) → Allow request → Write tokens (99)
Server B: Read tokens (100) → Allow request → Write tokens (99)
Result: Both allowed, but should be 98 tokens (race condition!)The Solution - Atomic Lua Script:
-- token_bucket.lua (runs atomically in Redis)local key = KEYS[1]local max_tokens = tonumber(ARGV[1]) -- 200local refill_rate = tonumber(ARGV[2]) -- 100/min = 1.67/seclocal cost = tonumber(ARGV[3]) -- 1local state = redis.call('HMGET', key, 'tokens', 'last_refill')local tokens = tonumber(state[1]) or max_tokenslocal last_refill = tonumber(state[2]) or 0-- Calculate tokens to add based on time elapsedlocal now = redis.call('TIME')[1]local elapsed = now - last_refilllocal tokens_to_add = math.floor(elapsed * refill_rate)tokens = math.min(max_tokens, tokens + tokens_to_add)-- Check if request allowedif tokens >= cost then tokens = tokens - cost redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now) redis.call('EXPIRE', key, 3600) return {1, tokens} -- Allowed, remaining tokenselse return {0, tokens} -- Denied, remaining tokensendWhy Lua solves race conditions:
- Entire script runs atomically (Redis is single-threaded)
- No other operation can execute between read and write
- Guarantees correctness in distributed environment
Fourth, multi-tenant isolation:
Per-Customer, Per-Endpoint, Per-Region Limits:
Customer A: /api/payments → 1000 req/min (global)
Customer B: /api/payments → 100 req/min (global)
Customer A: /api/payments → 300 req/min (us-east-1)Key Design:
rate_limit:{customer_a}:{payments}:global
rate_limit:{customer_a}:{payments}:us-east-1
rate_limit:{customer_b}:{payments}:globalHierarchical Enforcement:
1. Check region limit first (fastest, local Redis cluster)
2. If allowed, check global limit (cross-region Redis with replication lag tolerance)
3. Accept eventual consistency for global (99.9% accurate, trade-off for speed)
Fifth, observability for SRE:
Metrics (Prometheus/Datadog):
rate_limit_requests_total{customer, endpoint, result="allowed|denied"}
rate_limit_latency_seconds{quantile="0.5|0.99"}
rate_limit_redis_errors_total
rate_limit_tokens_remaining{customer, endpoint}Logging:
{ "event": "rate_limit_exceeded", "customer_id": "cust_abc123", "endpoint": "/api/payments", "region": "us-east-1", "tokens_remaining": 0, "retry_after_seconds": 15}Alerting SLOs:
- Redis latency p99 < 5ms (alert if > 10ms for 5 min)
- Rate limit error rate < 0.01% (alert if > 0.1%)
- Redis availability > 99.99%
Sixth, gateway vs service-level enforcement:
API Gateway Enforcement (Tier 1):
- Pros: Protects all downstream services, fail-fast, low latency cost
- Cons: Coarse-grained (can’t differentiate between expensive vs cheap endpoints)
- Use case: DDoS protection, customer-level quotas
Service-Level Enforcement (Tier 2):
- Pros: Fine-grained control (expensive DB queries get tighter limits)
- Cons: Adds latency to every service call, bypassed if gateway compromised
- Use case: Resource-specific limits (e.g., report generation endpoints)
My Hybrid Approach:
API Gateway:
- Enforce customer-level global limits (broad DDoS protection)
- Fast rejection (reject at edge, save compute downstream)
Payment Service:
- Enforce endpoint-specific limits (e.g., /refunds limited tighter than /charges)
- Context-aware limits (e.g., higher limits for verified merchants)Handling Distributed Race Conditions Trade-offs:
Strict Consistency (Not Recommended):
- Single Redis instance globally → bottleneck, high latency
- Distributed lock (e.g., Redlock) → adds 10-50ms latency, complex
Eventual Consistency (Recommended):
- Regional Redis clusters with async replication
- Accept 0.1-1% over-limit during replication lag
- Trade-off justified: 100.5 requests/min vs 100.0 is acceptable, 1ms latency vs 50ms is not
Handling Burst Traffic:
Normal: 100 req/min average
Burst allowed: 200 req/min for 10 seconds
Token bucket config:
- Capacity: 200 tokens
- Refill: 100 tokens/min (1.67/sec)
Real scenario:
00:00 - Idle (200 tokens available)
00:01 - Burst: 200 requests in 1 second (0 tokens)
00:02-00:12 - Refill 1.67 tokens/sec = 16.7 tokens (allow ~17 requests)
00:13+ - Back to steady stateInterview Score: 9/10
Why: Algorithm choice with clear justification (token bucket), atomic Lua script solving race conditions, multi-tenant isolation design, observability/SLO focus, hybrid gateway+service enforcement with trade-off discussion, and production-ready Redis architecture.
Question 2: Idempotency, Webhooks, and Double-Payment Avoidance
Difficulty: Very High
Role: Senior Backend Engineer (Payments)
Level: Senior (L5-L6, 4-7 Years of Experience)
Company Examples: Stripe, PayPal, Razorpay, Airbnb
Question: “You’re designing the payments subsystem for a marketplace like Airbnb or a PSP like Stripe/PayPal. Users frequently see ‘money debited but order pending’ due to client/network failures. Design an idempotent payment API plus webhook-based confirmation flow that guarantees no double charge and consistent order state, even with retries, delayed/missing webhooks, and out-of-order events. How do you choose and store idempotency keys, structure database transactions, and recover from partial failures?”
1. What is This Question Testing?
This question tests critical payment systems competencies:
- Idempotency Design: Can you prevent duplicate charges despite retries?
- Distributed Transaction Handling: Can you maintain consistency across payment gateway and your DB?
- Webhook Reliability: Can you handle delayed, missing, or out-of-order webhook events?
- Failure Recovery: Can you reconcile “money debited but order pending” states?
- Database Transaction Design: Do you understand ACID properties for payment workflows?
The interviewer wants to see if you can build production-grade payment systems that handle real-world failures gracefully.
2. Framework to Answer This Question
Use the “Idempotency → Webhooks → Reconciliation Framework”:
Structure:
1. Idempotency Key Design - How to generate, validate, and store keys
2. Payment API Flow - ACID transaction structure with idempotency
3. Webhook Handling - Delayed, duplicate, out-of-order event processing
4. Reconciliation - Periodic jobs to fix “pending” states
5. Failure Scenarios - Network failures, timeout handling, retry logic
3. The Answer
Answer:
I’d design a client-generated idempotency key system with database-level deduplication, webhook-driven state machine, and background reconciliation jobs. This prevents double charges even with retries and network failures.
First, idempotency key design:
Why Client-Generated Keys:
Client (mobile app) crashes mid-request.
Server-generated ID → client has no way to retry safely (might double-charge)
Client-generated ID → client can safely retry with same IDKey Format:
Idempotency-Key: {user_id}_{timestamp}_{random}
Example: user_123_1638360000_a7b3f2
Constraints:
- Maximum 255 characters
- Valid for 24 hours (prevent key reuse attacks)
- Stored with payment attemptDatabase Schema:
CREATE TABLE payments (
id UUID PRIMARY KEY,
idempotency_key VARCHAR(255) UNIQUE NOT NULL,
user_id BIGINT NOT NULL,
amount DECIMAL(10,2) NOT NULL,
status ENUM('pending', 'processing', 'succeeded', 'failed') NOT NULL,
payment_gateway_id VARCHAR(255), -- Stripe payment_intent_id created_at TIMESTAMP DEFAULT NOW(),
updated_at TIMESTAMP DEFAULT NOW(),
INDEX idx_idempotency (idempotency_key, created_at)
);
CREATE TABLE payment_events (
id BIGINT AUTO_INCREMENT PRIMARY KEY,
payment_id UUID NOT NULL,
event_type VARCHAR(50) NOT NULL, -- webhook.received, payment.succeeded event_id VARCHAR(255) UNIQUE, -- Stripe event_id (for dedup) payload JSON,
processed_at TIMESTAMP DEFAULT NOW(),
INDEX idx_payment_events (payment_id, processed_at)
);Second, idempotent payment API flow:
POST /api/payments (with idempotency)
@app.post("/api/payments")
def create_payment(request):
idempotency_key = request.headers["Idempotency-Key"]
amount = request.json["amount"]
user_id = request.user_id
# Step 1: Check if payment with this key already exists with db.transaction():
existing = Payment.get_by_idempotency_key(idempotency_key)
if existing:
# Idempotent response: return existing payment status if existing.created_at < now() - timedelta(hours=24):
return error("Idempotency key expired", 400)
return {
"payment_id": existing.id,
"status": existing.status,
"amount": existing.amount
}, 200 # Same response, safe to retry # Step 2: Create payment record in 'pending' state payment = Payment.create(
idempotency_key=idempotency_key,
user_id=user_id,
amount=amount,
status="pending" )
db.commit()
# Step 3: Call payment gateway (outside transaction) try:
# Stripe API call (idempotent via idempotency_key) stripe_payment = stripe.PaymentIntent.create(
amount=int(amount * 100), # cents currency="usd",
metadata={"payment_id": str(payment.id)},
idempotency_key=idempotency_key # Stripe's own idempotency )
# Step 4: Update payment with gateway ID with db.transaction():
payment.payment_gateway_id = stripe_payment.id payment.status = "processing" payment.save()
db.commit()
return {
"payment_id": payment.id,
"status": "processing",
"client_secret": stripe_payment.client_secret
}, 201 except stripe.error.CardError as e:
# Card declined with db.transaction():
payment.status = "failed" payment.save()
db.commit()
return {"error": "Card declined"}, 400 except Timeout as e:
# Network timeout - payment may or may not have gone through! # Leave status as 'pending' for reconciliation job log.error(f"Timeout calling Stripe for payment {payment.id}")
return {
"payment_id": payment.id,
"status": "pending", # Client should check status later "message": "Payment processing, check status in 30s" }, 202 # Accepted, processingWhy This Works:
Scenario: Client retries due to timeout
Request 1: Idempotency-Key: abc123
- Creates payment in DB (status=pending)
- Calls Stripe → timeout
- Returns 202 "processing"
Request 2 (retry): Idempotency-Key: abc123
- DB lookup finds existing payment
- Returns same response: 202 "processing"
- No duplicate Stripe call!Third, webhook handling for confirmation:
Why Webhooks:
- Synchronous API call may timeout before Stripe confirms charge
- Webhook is asynchronous confirmation from Stripe (payment succeeded/failed)
- Must handle: delays, duplicates, out-of-order delivery
POST /webhook/stripe (handles payment events)
@app.post("/webhook/stripe")
def stripe_webhook(request):
payload = request.body
sig_header = request.headers["Stripe-Signature"]
# Step 1: Verify webhook signature (prevent spoofing) try:
event = stripe.Webhook.construct_event(
payload, sig_header, webhook_secret
)
except ValueError:
return error("Invalid payload", 400)
except stripe.error.SignatureVerificationError:
return error("Invalid signature", 400)
event_id = event["id"]
event_type = event["type"]
payment_intent = event["data"]["object"]
# Step 2: Deduplicate webhook (Stripe may send same event multiple times) with db.transaction():
if PaymentEvent.exists(event_id=event_id):
log.info(f"Duplicate webhook {event_id}, ignoring")
return {"status": "ok"}, 200 # Idempotent! # Record webhook receipt PaymentEvent.create(
payment_id=payment_intent.metadata["payment_id"],
event_type=event_type,
event_id=event_id,
payload=payment_intent
)
db.commit()
# Step 3: Update payment status based on event type with db.transaction():
payment = Payment.get_by_gateway_id(payment_intent.id)
if event_type == "payment_intent.succeeded":
payment.status = "succeeded" create_order(payment) # Create order elif event_type == "payment_intent.payment_failed":
payment.status = "failed" notify_user_failure(payment)
payment.updated_at = now()
payment.save()
db.commit()
return {"status": "ok"}, 200Handling Out-of-Order Webhooks:
Scenario: Webhook arrives BEFORE API response returns
Timeline:
00:00 - Client sends POST /api/payments
00:01 - Server calls Stripe API (success)
00:02 - Stripe sends webhook (payment.succeeded)
00:02 - Webhook handler updates DB: status=succeeded
00:03 - Stripe API returns to server (slow network)
00:03 - API handler tries to update status=processing (stale!)
Solution: Use updated_at timestamp + optimistic lockingFourth, reconciliation for “money debited, order pending”:
Problem:
- Stripe debited money (payment succeeded)
- Webhook delayed/lost due to network issue
- User sees “payment pending” forever
Reconciliation Job (runs every 15 min):
def reconcile_pending_payments():
# Find payments stuck in 'pending' or 'processing' > 10 minutes stuck_payments = Payment.filter(
status__in=["pending", "processing"],
created_at__lt=now() - timedelta(minutes=10)
)
for payment in stuck_payments:
if not payment.payment_gateway_id:
# Never reached Stripe, safe to mark failed payment.status = "failed" payment.save()
continue # Fetch real status from Stripe try:
stripe_payment = stripe.PaymentIntent.retrieve(
payment.payment_gateway_id
)
if stripe_payment.status == "succeeded":
# Money was debited! Update DB payment.status = "succeeded" create_order(payment) # Fix missing order notify_user_success(payment)
elif stripe_payment.status == "canceled":
payment.status = "failed" payment.updated_at = now()
payment.save()
except stripe.error.InvalidRequestError:
# Payment doesn't exist in Stripe payment.status = "failed" payment.save()Fifth, handling critical failure scenarios:
Scenario 1: Database commit fails after Stripe charge
Issue: Money debited, but DB shows 'pending' (commit failed)
Solution: Reconciliation job pulls Stripe status, fixes DBScenario 2: Client sends same idempotency key with different amount
Request 1: Idempotency-Key: abc, amount=$100
Request 2: Idempotency-Key: abc, amount=$200 (malicious/bug)
Solution: Validate amount matches existing payment, return error if differentScenario 3: Webhook arrives before payment exists in DB
Race condition: Webhook arrives milliseconds before API handler commits
Solution: Webhook retries (DLQ), or query Stripe if payment_id not foundDead Letter Queue (DLQ) for Failed Webhooks:
Webhook processing failures → SQS/Kafka DLQ
Retry policy: exponential backoff (1min, 5min, 15min, 1hr)
After 24 hours: Alert team, manual interventionInterview Score: 9/10
Why: Client-generated idempotency keys with DB deduplication, ACID transaction handling, webhook deduplication and out-of-order handling, reconciliation job for “money debited” scenarios, and comprehensive failure scenario coverage including DLQ.
Question 3: N+1 Queries, ORM Pitfalls, and Production Scaling
Difficulty: High
Role: Senior Backend Engineer
Level: Senior (L4-L6, 4-7 Years of Experience)
Company Examples: All companies with high-traffic APIs
Question: “In a high-traffic service (e.g., an API returning user profiles with related entities), how would you detect, explain, and fix N+1 query problems in production? Describe concrete techniques for query bundling, preloading/joins, and instrumentation. How would you demonstrate to a skeptical manager that the existing N+1 pattern will create scalability and cost problems at 10× traffic?”
1. What is This Question Testing?
This question tests critical backend performance competencies:
- Query Optimization: Can you identify and fix N+1 queries that kill performance?
- ORM Understanding: Do you know when ORMs create N+1 problems and how to prevent them?
- Production Debugging: Can you detect N+1 in live systems without bringing them down?
- Cost Awareness: Can you quantify the business impact of query inefficiency?
- Profiling Tools: Do you know how to use query profilers, logs, and APM tools?
The interviewer wants to see if you understand database performance at scale, not just basic SQL.
2. Framework to Answer This Question
Use the “Detect → Explain → Fix → Prove Framework”:
Structure:
1. What is N+1 - Clear definition with example
2. Detection - Tools and techniques to find N+1 in production
3. Root Cause - Why ORMs create N+1, lazy loading pitfalls
4. Solutions - Eager loading, joins, caching strategies
5. Business Case - Demonstrate cost/scale impact to management
3. The Answer
Answer:
I’d use a combination of APM tools,query logging, and database statistics to detect N+1, then apply eager loading or explicit joins to fix it. Let me walk through detection, explanation, and remediation.
First, what is the N+1 query problem:
Example: User profiles API
# BAD: N+1 Query Pattern@app.get("/api/users")
def get_users():
users = User.query.all() # 1 query: SELECT * FROM users result = []
for user in users: # N iterations # Each iteration triggers a separate query! posts = user.posts # SELECT * FROM posts WHERE user_id = ? result.append({
"user": user.name,
"post_count": len(posts)
})
return result
# Result: 1 + N queries# 100 users = 101 queries# 10,000 users = 10,001 queries (disaster!)Why this is a problem:
Each SQL query has overhead:
- Network round trip: ~1-5ms
- Query planning: ~0.5-2ms
- Execution: ~0.1-1ms
100 users = 101 queries × 3ms = 303ms (acceptable)
10,000 users = 10,001 queries × 3ms = 30 seconds (timeout!)
Plus: Database connection pool exhaustion, high CPUSecond, detecting N+1 in production:
Method 1: APM Tools (Datadog, New Relic, Sentry)
APM tools automatically flag N+1 patterns:
Datadog APM Alert:
"Endpoint: GET /api/users
Queries executed: 10,001
Total query time: 28.5s
Pattern: Repeated SELECT FROM posts WHERE user_id = ?
Recommendation: Use eager loading or join"Method 2: Database Query Logging
Enable slow query log + analyze patterns:
-- PostgreSQL slow query logSET log_min_duration_statement = 100; -- Log queries > 100ms-- Output shows repeated queries:SELECT * FROM posts WHERE user_id = 1;
SELECT * FROM posts WHERE user_id = 2;
SELECT * FROM posts WHERE user_id = 3;
... (10,000 times)
-- Red flag: Same query with different parameters repeatedMethod 3: Database Statistics
-- PostgreSQL: Check query statisticsSELECT query, calls, total_time, mean_time
FROM pg_stat_statements
WHERE query LIKE '%posts WHERE user_id%'ORDER BY calls DESC;
-- Output:-- query: SELECT * FROM posts WHERE user_id = $1-- calls: 10,000-- total_time: 25,000ms-- mean_time: 2.5ms-- 10,000 calls of same pattern = N+1!Method 4: Custom Instrumentation
Wrap ORM with query counter:
from functools import wraps
query_count = 0def count_queries(func):
@wraps(func)
def wrapper(*args, **kwargs):
global query_count
query_count = 0 # Hook into ORM query execution with db.query_counter():
result = func(*args, **kwargs)
queries = db. get_query_count()
if queries > 50:
log.warning(f"{func.__name__} executed {queries} queries (N+1 suspected!)")
return result
return wrapper
@count_queries@app.get("/api/users")
def get_users():
# ... code ... passThird, explaining the root cause (ORM lazy loading):
Why ORMs cause N+1:
Most ORMs (Django, SQLAlchemy, ActiveRecord) use lazy loading by default:
# Django ORM Exampleuser = User.objects.get(id=1) # 1 queryposts = user.posts.all() # Lazy: Query only when accessed!# Seems innocent, but in a loop:users = User.objects.all() # 1 queryfor user in users:
print(user.posts.count()) # N queries! (lazy loading)ORM thinks it’s helping:
- “Don’t fetch posts unless needed” (memory efficient)
- But in loops, “needed” happens N times!
Fourth, fixing N+1 with eager loading:
Solution 1: ORM Eager Loading
# Django: select_related (for foreign keys, one-to-one)users = User.objects.select_related('profile').all()
# SQL: SELECT * FROM users JOIN profiles ON users.profile_id = profiles.id# Django: prefetch_related (for many-to-many, reverse foreign keys)users = User.objects.prefetch_related('posts').all()
# SQL:# 1. SELECT * FROM users# 2. SELECT * FROM posts WHERE user_id IN (1,2,3,...,N)# Result: 2 queries instead of N+1!@app.get("/api/users")
def get_users():
users = User.objects.prefetch_related('posts').all() # FIX! result = []
for user in users:
# No additional query! Posts already loaded posts = user.posts.all()
result.append({
"user": user.name,
"post_count": len(posts)
})
return resultSolution 2: Explicit JOIN
# SQLAlchemy explicit joinfrom sqlalchemy.orm import joinedload
users = session.query(User).options(joinedload(User.posts)).all()
# SQL:# SELECT users.*, posts.*# FROM users LEFT JOIN posts ON users.id = posts.user_idSolution 3: Raw SQL (for complex cases)
# When ORM is too slow, write raw SQLquery = """ SELECT u.id, u.name, COUNT(p.id) as post_count FROM users u LEFT JOIN posts p ON u.id = p.user_id GROUP BY u.id, u.name"""result = db.execute(query).fetchall()
# Single query, optimal performanceFifth, demonstrating cost/scale impact to management:
Current State (N+1 pattern):
Traffic: 1,000 requests/minute
Users per request: 100
Queries: 1,000 req/min × 101 queries = 101,000 queries/min
Database CPU: 60%
P99 latency: 500msAt 10× Traffic:
Traffic: 10,000 requests/minute
Queries: 10,000 × 101 = 1,010,000 queries/min
Database CPU: 600% (impossible, will crash!)
P99 latency: >5 seconds (timeouts)
Cost impact:
- Need 6× more database instances ($500/month → $3,000/month)
- OR accept degraded user experience (users abandon app)After Fix (eager loading):
Traffic: 10,000 requests/minute
Queries: 10,000 × 2 = 20,000 queries/min (50× reduction!)
Database CPU: 15%
P99 latency: 80ms
Cost: Same $500/month database handles 10× trafficROI Calculation for Manager:
Fix effort: 2 hours (add prefetch_related)
Database cost savings: $2,500/month at scale
Engineering time savings: 10 hours/month (no firefighting)
ROI: $2,500/month / 2 hours = $1,250/hour engineering valueSixth, proactive prevention:
Code Review Checklist:
# RED FLAGS in code review:❌ for item in query.all(): item.related_field
❌ [item.related for item in items]
❌ Accessing relationships inside loops
❌ No select_related/prefetch_related on queries with joins
✅ query.prefetch_related('related_field').all()
✅ query.select_related('foreign_key').all()
✅ Explicit JOINs in raw SQLLinting + CI Checks:
# Custom linter ruledef check_n_plus_one(code):
if "for" in code and ".objects.all()" in code:
if "prefetch_related" not in code:
raise Warning("Potential N+1: use prefetch_related")Performance Testing:
# In test suitedef test_user_api_query_count():
with assert_num_queries(2): # Expect exactly 2 queries response = client.get('/api/users?limit=100')
# Fails if N+1 present (would be 101 queries)Interview Score: 9/10
Why: Clear N+1 definition with code examples, multiple detection methods (APM, logs, DB stats, instrumentation), ORM lazy loading explanation, concrete fixes (prefetch_related, joins, raw SQL), business case with cost calculations, and proactive prevention strategies.
Question 4: ACID vs BASE and CAP/PACELC in Real Systems
Difficulty: Very High
Role: Staff Backend Engineer / Architect
Level: Senior/Staff (L5-L7, 5-10 Years of Experience)
Company Examples: Amazon, Netflix, Airbnb, Uber
Question: “Pick a multi-region system you’ve worked on or know (e.g., shopping cart, messaging, or booking). Walk through the concrete trade-offs between ACID and BASE you would make, and map your design to CAP and PACELC: which consistency and availability guarantees do you provide at read and write paths? How do you handle conflict resolution and eventual consistency at the UX and data layers?”
1. What is This Question Testing?
This question tests distributed systems architecture competencies:
- CAP Theorem Understanding: Can you explain Consistency, Availability, Partition Tolerance trade-offs?
- PACELC Extension: Do you know the latency vs consistency trade-off when no partition exists?
- Real-World Design: Can you apply theory to actual systems (shopping cart, bookings)?
- Conflict Resolution: Can you handle concurrent writes across regions?
- Business Trade-offs: Can you justify eventual consistency to product teams?
2. The Answer
Answer:
I’d design an Amazon-style shopping cart with eventual consistency that prioritizes availability over strong consistency, using BASE principles with last-write-wins and merge-based conflict resolution.
First, understanding CAP and PACELC:
CAP Theorem:
- C (Consistency): All nodes see the same data simultaneously
- A (Availability): Every request gets a response (even if stale)
- P (Partition Tolerance): System works despite network failures
You can only choose 2 of 3. In real systems, partition tolerance is mandatory (networks fail), so the choice is: C or A?**
PACELC Extension:
- If Partition: Choose Availability or Consistency?
- Else (no partition): Choose Latency or Consistency?
Shopping Cart Design Choice:
CAP: AP (Availability + Partition Tolerance, sacrifice Consistency)
PACELC: PA/EL (Partition→Availability, Else→Latency)
Why:
- Shopping cart reads/writes must be fast (<100ms)
- Temporary inconsistency is acceptable (seeing cart from 1 second ago is fine)
- Cart is not critical (unlike payments, which need ACID)
Design Architecture:
Write Path:
1. User adds item in US-EAST region
2. Write to local DynamoDB replica (5ms)
3. Return success immediately
4. Async replicate to EU, ASIA (100-500ms delay)
Read Path:
1. User reads cart in EU
2. Read from EU replica
3. May see stale data if US write hasn't replicated yet
4. Eventually consistent within 500msConflict Resolution:
Scenario: User adds items in two regions simultaneously
Time 00:00:
- US-EAST: User adds "Laptop" to cart
- EU-WEST: User adds "Mouse" to cart (network split)
Time 00:01:
- Networks merge
- Conflict: Cart has {Laptop} in US, {Mouse} in EU
Resolution Strategy 1: Last-Write-Wins (LWW)
- Compare timestamps
- EU write (00:00:15) beats US write (00:00:10)
- Final cart: {Mouse} only
- Problem: Lost "Laptop"! Bad UX.
Resolution Strategy 2: Merge (Amazon's Choice)
- Union of both carts
- Final cart: {Laptop, Mouse}
- Better UX, no lost dataImplementation:
# DynamoDB with merge-based conflict resolutionclass ShoppingCart:
def add_item(self, user_id, item, region):
# Write to local region cart = dynamodb.get_item(
Key={'user_id': user_id},
ConsistentRead=False # Eventual consistency )
# Merge new item items = cart.get('items', [])
items.append({
'item_id': item,
'added_at': time.time(),
'region': region
})
# Write back dynamodb.put_item(Item={
'user_id': user_id,
'items': items,
'version': uuid.uuid4() # For conflict detection })
return {'status': 'success', 'latency_ms': 5}
def get_cart(self, user_id):
# Read from local replica (may be stale) cart = dynamodb.get_item(
Key={'user_id': user_id},
ConsistentRead=False # Fast, eventual consistency )
# Deduplicate items (in case of merge conflicts) items = cart.get('items', [])
unique_items = list({item['item_id']: item for item in items}.values())
return unique_itemsContrast: Banking (ACID required)
-- Bank transfer REQUIRES ACIDBEGIN TRANSACTION;
UPDATE accounts SET balance = balance - 100 WHERE id = 1;
UPDATE accounts SET balance = balance + 100 WHERE id = 2;
COMMIT; -- Both updates or neither (atomic)-- If partition occurs, system blocks writes (CA choice)-- Consistency > Availability for moneyUX Handling for Eventual Consistency:
Shopping Cart UI:
- Show "Syncing..." indicator if write pending
- Optimistic UI update (show item immediately, sync in background)
- If conflict detected, show "We added Mouse from another device"
Booking System (stronger consistency needed):
- Use distributed locks for inventory
- "Only 1 room left" → strong consistency required
- Accept higher latency (50-100ms) for correctnessInterview Score: 9/10
Why: Clear CAP/PACELC explanation, concrete shopping cart design with PA/EL choice justified, merge-based conflict resolution, code example, and contrast with ACID banking scenario.
Question 5: Sharding, Partitioning, and Hot-Shard Mitigation
Difficulty: Very High
Role: Senior/Staff Backend Engineer
Level: Senior/Staff (L5-L7, 4-10 Years of Experience)
Company Examples: Airbnb, Uber, LinkedIn, Pinterest
Question: “You own the database for a listing/booking service like Airbnb. At 10× growth, a single Postgres cluster is hitting CPU, I/O, and lock contention limits. Propose a sharding and partitioning strategy that addresses scale, hot rows, and operational complexity. How would you decide between functional sharding (by domain), geographic sharding (region-aware), and purely key-based horizontal partitioning? How do you route traffic, rebalance shards, and plan for zero-downtime resharding?”
1. What is This Question Testing?
- Sharding Strategy: Can you choose the right sharding key for the use case?
- Hot Shard Mitigation: Can you handle uneven data distribution (NYC vs small cities)?
- Operational Complexity: Can you reshard without downtime?
- Trade-offs: Can you explain functional vs geographic vs hash-based sharding?
2. The Answer
Answer:
I’d use geographic + functional hybrid sharding with consistent hashing for hot shard distribution, and a routing layer with dual-write migration for zero-downtime resharding.
First, choosing sharding strategy:
Option 1: Functional Sharding (by domain)
Shard 1: Users table
Shard 2: Listings table
Shard 3: Bookings table
Pros: Clean separation, easy to reason about
Cons: Doesn't solve single-table scale (Listings still huge)Option 2: Hash-Based Horizontal Sharding
Shard = hash(listing_id) % num_shards
Pros: Even distribution
Cons: Cross-region joins expensive, no data localityOption 3: Geographic Sharding (My Choice)
Shard_US: Listings in US/Canada
Shard_EU: Listings in Europe
Shard_ASIA: Listings in Asia
Pros:
- Data locality (users search local listings 90% of time)
- Legal compliance (GDPR data residency)
- Reduced cross-shard joins
Cons: Uneven distribution (NYC has 10× listings vs Des Moines)My Hybrid Approach: Geographic + Consistent Hashing
Primary: Geographic sharding
Secondary: Consistent hashing within region for hot shards
Shard Key: hash(region_code + listing_id)Second, handling hot shards (NYC problem):
Problem:
NYC: 100,000 listings
Des Moines: 5,000 listings
Simple geographic sharding:
- NYC shard: 95% CPU (bottleneck!)
- Des Moines shard: 10% CPU (wasted capacity)Solution: Consistent Hashing with Virtual Nodes
Hash Ring with Virtual Nodes:
- NYC_1, NYC_2, NYC_3, ..., NYC_10 (10 virtual shards)
- DesMoines_1 (1 shard)
Physical servers:
- NYC listings distributed across 10 servers
- Des Moines on 1 server
As data grows:
- Add NYC_11, NYC_12 dynamically
- Rebalance automatically via consistent hashingImplementation:
class ShardRouter:
def __init__(self):
# Consistent hash ring self.ring = ConsistentHashRing()
# Add virtual nodes per region self.ring.add_nodes('NYC', virtual_nodes=10)
self.ring.add_nodes('SF', virtual_nodes=8)
self.ring.add_nodes('DesMoines', virtual_nodes=1)
def get_shard(self, listing_id, region):
# Hash: region + listing_id key = f"{region}_{listing_id}" shard = self.ring.get_node(key)
return shard
# Usagerouter = ShardRouter()
shard = router.get_shard('listing_123', 'NYC') # Routes to NYC_7db = shard_connections[shard]
listing = db.query("SELECT * FROM listings WHERE id = %s", listing_id)Third, routing layer:
Application → Router → Shard
↓
(ShardMap)
ShardMap (stored in Redis):
{
'NYC_1': 'db-nyc-01.postgres.us-east-1',
'NYC_2': 'db-nyc-02.postgres.us-east-1',
'EU_1': 'db-eu-01.postgres.eu-west-1'
}Fourth, zero-downtime resharding:
Scenario: Split NYC shard (too hot) into NYC_new_1 and NYC_new_2
Phase 1: Dual-Write (Weeks 1-2)
def write_listing(listing):
# Write to OLD shard AND NEW shards old_shard.write(listing)
new_shard_1.write(listing) # Dual write new_shard_2.write(listing) # Dual write # Read from OLD shard (source of truth) return old_shard.read(listing.id)Phase 2: Backfill (Weeks 2-4)
# Background jobdef backfill_new_shards():
for listing in old_shard.all_listings():
# Determine new shard based on hash new_shard = router.get_shard(listing.region, listing.id)
new_shard.write(listing)
# Verify: count(old_shard) == count(new_shard_1) + count(new_shard_2)Phase 3: Switch Reads (Week 4)
def read_listing(listing_id):
# Now read from NEW shards shard = router.get_shard('NYC', listing_id)
return shard.read(listing_id)Phase 4: Cleanup (Week 5)
# Stop writing to OLD shard# Drop OLD shard after 7-day safety bufferCross-Shard Queries:
-- BAD: Cross-shard JOINSELECT * FROM bookings b
JOIN listings l ON b.listing_id = l.idWHERE b.user_id = 123;
-- Bookings in Shard_A, Listings in Shard_B → expensive!-- GOOD: Denormalize critical fieldsCREATE TABLE bookings (
id UUID,
listing_id UUID,
listing_region VARCHAR, -- Denormalized! listing_title VARCHAR, -- Denormalized! -- Avoid cross-shard joins);Interview Score: 9/10
Why: Geographic + consistent hashing hybrid strategy, virtual nodes for hot shard mitigation, routing layer design, zero-downtime resharding with dual-write phases, and denormalization to avoid cross-shard joins.
Question 6: Cache Hierarchies and Invalidation Strategies
Difficulty: High
Role: Senior Backend Engineer
Level: Senior (L4-L6, 4-7 Years of Experience)
Company Examples: Netflix, Airbnb, Spotify, Twitter
Question: “Design a multi-tier caching strategy for a high-traffic, read-heavy service like Airbnb search or Netflix catalog. Explain what you cache at CDN, application, and database levels; how you choose keys and TTLs; and how you avoid stale or inconsistent data for critical flows (e.g., bookings, availability). Compare write-through, write-back, and cache-aside patterns and describe concrete invalidation strategies for updates, deletes, and backfills.”
1. What is This Question Testing?
- Multi-Tier Caching: Can you design layered caching (CDN, app, DB)?
- Cache Patterns: Do you understand cache-aside, write-through, write-back trade-offs?
- Invalidation: Can you prevent stale data without over-invalidating?
- TTL Strategy: Can you choose appropriate expiration times?
2. The Answer
Answer:
I’d use 3-tier cache-aside pattern with CDN for static assets, Redis for application cache, and database query cache, with TTL-based expiration and event-driven invalidation for critical data.
First, multi-tier architecture:
Tier 1: CDN (CloudFlare, Fastly)
- What: Static assets (images, CSS, JS, videos)
- TTL: 24 hours - 7 days
- Invalidation: Versioned URLs (/assets/v123/logo.png)
- Hit rate: 95%+
Tier 2: Application Cache (Redis)
- What: Listing details, search results, user sessions
- TTL: 30 seconds - 10 minutes
- Invalidation: Explicit delete on update + TTL fallback
- Hit rate: 70-80%
Tier 3: Database Query Cache (Postgres)
- What: Frequent SELECT query results
- TTL: 1-5 minutes
- Invalidation: Automatic on table writes
- Hit rate: 50-60%Second, cache patterns comparison:
Cache-Aside (Lazy Loading) - My Choice:
def get_listing(listing_id):
# Try cache first cache_key = f"listing:{listing_id}" cached = redis.get(cache_key)
if cached:
return json.loads(cached) # Cache hit! # Cache miss → query database listing = db.query("SELECT * FROM listings WHERE id = %s", listing_id)
# Store in cache with TTL redis.setex(cache_key, 300, json.dumps(listing)) # 5 min TTL return listing
Pros: Simple, only caches what's requestedCons: Cache miss adds latency, thundering herd riskWrite-Through:
def update_listing(listing_id, data):
# Write to database db.update("UPDATE listings SET ... WHERE id = %s", listing_id)
# Immediately update cache redis.setex(f"listing:{listing_id}", 300, json.dumps(data))
# Both always in syncPros: Cache always fresh
Cons: Write latency (2× writes), wasted cache on rarely-read dataWrite-Back (Write-Behind):
def update_listing(listing_id, data):
# Write to cache only redis.setex(f"listing:{listing_id}", 300, json.dumps(data))
# Async write to DB later (background job) queue.enqueue('write_to_db', listing_id, data)
# Fast responsePros: Fastest writes
Cons: Risk of data loss if cache fails before DB writeMy Choice: Cache-Aside + Event-Driven Invalidation
Third, invalidation strategies:
Strategy 1: TTL-Based (Passive)
Listing updated at 10:00
Cache still has old version (cached at 9:55, TTL 10min)
Cache expires at 10:05
Next read at 10:06 → cache miss → fetch fresh data
Pros: Simple, no code changes
Cons: 5 min stale data (may be acceptable for listings, not for availability)Strategy 2: Explicit Invalidation (Active)
def update_listing(listing_id, data):
# Update database db.update("UPDATE listings SET ... WHERE id = %s", listing_id)
# Invalidate cache (delete, not update) redis.delete(f"listing:{listing_id}")
# Next read will fetch fresh from DB # Let cache-aside repopulate on demandPros: Immediate freshness
Cons: Cache miss spike after updateStrategy 3: Event-Driven Invalidation (for distributed systems)
# Publisher (on listing update)def update_listing(listing_id, data):
db.update(...)
# Publish event to Kafka/Redis Pub-Sub event_bus.publish('listing.updated', {
'listing_id': listing_id,
'timestamp': time.time()
})
# Subscribers (across all app servers)@event_handler('listing.updated')
def invalidate_cache(event):
# Each app server clears its Redis cache redis.delete(f"listing:{event['listing_id']}")Fourth, handling critical flows (bookings, availability):
Problem: Stale availability data causes double-bookings
Scenario:
10:00 - Room ABC available (cached)
10:05 - User A books Room ABC (DB updated, cache NOT invalidated)
10:06 - User B sees Room ABC available (stale cache!)
10:07 - User B tries to book → CONFLICT!Solution: Skip cache for critical reads
def check_availability(listing_id, dates):
# Critical flow: Read directly from DB (bypass cache) availability = db.query(
"SELECT * FROM availability WHERE listing_id = %s AND date IN (...)",
listing_id, dates,
read_replica=False # Read from primary DB, not replica )
return availability
def book_listing(listing_id, dates):
# Atomic booking with DB transaction with db.transaction():
# Lock row availability = db.query(
"SELECT * FROM availability WHERE listing_id = %s FOR UPDATE",
listing_id
)
if not availability.is_available:
raise BookingConflict()
# Mark unavailable db.update("UPDATE availability SET booked = true WHERE ...")
# Invalidate cache after successful booking redis.delete(f"listing:{listing_id}")Fifth, cache key and TTL design:
Key Naming:
listing:{id} → TTL: 5 min
listing:{id}:availability → TTL: 30 sec (fresher)
search:{city}:{checkin}:{checkout} → TTL: 1 min
user:{id}:session → TTL: 24 hoursTTL Strategy:
Static data (listing description): 10 min TTL
Semi-dynamic (pricing, reviews): 2-5 min TTL
Critical (availability, inventory): 30 sec OR no cache
User sessions: 24 hours
Search results: 1 min (balance freshness vs load)Interview Score: 9/10
Why: 3-tier caching architecture, cache-aside pattern with justification, explicit + event-driven invalidation strategies, critical flow handling (skip cache for bookings), and thoughtful TTL design per data type.
Question 7: Event-Driven Architectures, Exactly-Once, and Idempotent Consumers
Difficulty: Very High
Role: Senior/Staff Backend Engineer
Level: Senior/Staff (L5-L7, 5-10 Years of Experience)
Company Examples: PayPal, Uber, Netflix, LinkedIn
Question: “A notifications or billing platform (like PayPal, Uber, or internal platform teams) uses Kafka/Kinesis with multiple consumers. Design the system so that each business operation is applied exactly once despite at-least-once delivery semantics, retries, and consumer restarts. How do you model idempotency on the consumer side, manage deduplication keys, and handle poison messages and DLQs? What failure scenarios would you explicitly test?”
1. What is This Question Testing?
- Exactly-Once Semantics: Can you achieve effectively-once despite at-least-once delivery?
- Idempotent Consumers: Can you design consumers that handle duplicate events safely?
- Failure Handling: Can you deal with poison messages, consumer crashes, DLQs?
- Event Deduplication: Can you track processed events to prevent re-processing?
2. The Answer
Answer:
I’d use idempotent consumers with database-backed event deduplication tracking unique event IDs, plus Dead Letter Queues for poison messages and retry logic, achieving effectively-once processing.
First, understanding the problem:
At-Least-Once Delivery (Kafka default):
Producer sends event → Kafka stores → Consumer processes → Consumer commits offset
If consumer crashes AFTER processing but BEFORE commit:
- Kafka redelivers same event on restart
- Event processed twice! (duplicate charge, duplicate notification)Exactly-Once is impossible in distributed systems (network failures make it theoretically impossible), but effectively-once is achievable via idempotency.
Second, idempotent consumer design:
Core Principle: Check if event already processed BEFORE processing
def consume_payment_event(event):
event_id = event['event_id'] # Unique ID from producer payment_id = event['payment_id']
amount = event['amount']
# Atomic check + process in single transaction with db.transaction():
# Check if already processed if EventLog.exists(event_id=event_id):
log.info(f"Duplicate event {event_id}, skipping")
return # Idempotent! Safe to skip. # Process business logic charge_card(payment_id, amount)
send_confirmation_email(payment_id)
# Mark as processed EventLog.create(
event_id=event_id,
payment_id=payment_id,
processed_at=datetime.now(),
status='success' )
db.commit() # Atomic: either ALL happens or NONE # Commit Kafka offset (after successful processing) consumer.commit_offset(event)Event Deduplication Table:
CREATE TABLE event_log (
event_id VARCHAR(255) PRIMARY KEY, -- UUID from event event_type VARCHAR(50),
processed_at TIMESTAMP,
status ENUM('success', 'failed'),
payload JSON,
INDEX idx_processed_at (processed_at)
);
-- Cleanup old events (after 7-30 days)DELETE FROM event_log WHERE processed_at < NOW() - INTERVAL 30 DAY;Third, handling failure scenarios:
Scenario 1: Consumer crashes after processing, before commit
1. Consumer receives event (event_id: abc123)
2. Processes: Charges card $100
3. Writes to event_log: event_id=abc123, status=success
4. CRASH before committing Kafka offset
5. Consumer restarts, Kafka redelivers event abc123
6. Consumer checks event_log: abc123 exists → SKIP
7. No duplicate charge! ✓Scenario 2: Database transaction fails mid-processing
1. Start transaction
2. Charge card (succeeds)
3. Write event_log (DB fails!)
4. Transaction rolls back
5. Charge card operation also rolled back (compensating transaction)
6. Event NOT marked processed
7. Kafka redelivers → Retries successfully
Note: If charge is to external API (not in transaction), need idempotency keys:
stripe.charge(idempotency_key=event_id) # Stripe prevents duplicateScenario 3: Poison message (malformed JSON)
def consume_with_dlq(event):
try:
# Validate event structure if not validate_event_schema(event):
raise ValidationError("Invalid schema")
# Process consume_payment_event(event)
except ValidationError as e:
# Poison message: won't succeed even with retries log.error(f"Poison message {event.get('id')}: {e}")
# Send to Dead Letter Queue for manual review dlq.send(event, error=str(e))
# Commit offset (skip this message, don't block queue) consumer.commit_offset(event)
except TransientError as e:
# Transient error (DB timeout, network issue) # Don't commit offset → Kafka will retry log.warning(f"Transient error, will retry: {e}")
raise # Consumer framework handles retryFourth, DLQ and retry strategy:
Dead Letter Queue Setup:
Main Topic: payment.events
DLQ: payment.events.dlq
Poison message conditions:
- Invalid JSON
- Schema validation failure
- Business logic error (e.g., unknown payment_id)
DLQ Consumer (manual intervention):
- Alerts team via Slack/PagerDuty
- Shows event in admin dashboard
- Engineer fixes data, replays event to main topicRetry Strategy:
# Exponential backoff for transient errorsclass RetryableConsumer:
max_retries = 3 base_delay = 1 # second def consume(self, event):
for attempt in range(self.max_retries):
try:
consume_payment_event(event)
return # Success! except TransientError as e:
if attempt < self.max_retries - 1:
delay = self.base_delay * (2 ** attempt) # 1s, 2s, 4s log.warning(f"Retry {attempt+1}/{self.max_retries} after {delay}s")
time.sleep(delay)
else:
# Max retries exceeded → DLQ dlq.send(event, error="Max retries exceeded")
consumer.commit_offset(event)Fifth, testing scenarios:
def test_duplicate_event_processing():
"""Test idempotency: same event processed 2× should only charge once""" event = {'event_id': 'test123', 'payment_id': 'pay_1', 'amount': 100}
# Process once consume_payment_event(event)
assert get_charge('pay_1').amount == 100 # Process duplicate consume_payment_event(event)
assert get_charge('pay_1').amount == 100 # Still $100, not $200!def test_consumer_crash_recovery():
"""Test recovery after crash before offset commit""" event = {'event_id': 'test456', 'payment_id': 'pay_2', 'amount': 50}
# Simulate crash with mock.patch('consumer.commit_offset', side_effect=SystemExit):
try:
consume_payment_event(event)
except SystemExit:
pass # Restart consumer consume_payment_event(event) # Redelivered by Kafka assert get_charge('pay_2').amount == 50 # Only charged oncedef test_poison_message_dlq():
"""Test malformed event sent to DLQ""" poison_event = {'invalid': 'schema'} # Missing required fields consume_with_dlq(poison_event)
# Event in DLQ, not in event_log assert dlq.count() == 1 assert EventLog.exists(event_id='invalid') is FalseInterview Score: 9/10
Why: Database-backed idempotency with event deduplication table, atomic transaction pattern preventing partial processing, DLQ for poison messages, retry strategy with exponential backoff, and comprehensive failure scenario testing.
Question 8: Sagas, Distributed Transactions, and Compensating Actions
Difficulty: Very High
Role: Principal Backend Engineer / Architect
Level: Staff/Principal (L6-L8, 7-12 Years of Experience)
Company Examples: Uber, Airbnb, Booking.com, Expedia
Question: “You are the Principal Backend Engineer designing cross-service booking or checkout flows (e.g., travel booking: flights + hotels + payments). You can’t rely on 2PC or global ACID transactions. Propose a saga-based approach and explain how you design forward steps and compensating transactions, handle partial failures, timeouts, and long-running operations, and maintain observability of saga state across services. How do you ensure idempotency of compensating steps?”
1. What is This Question Testing?
- Saga Pattern Knowledge: Can you design workflows without distributed transactions?
- Compensating Transactions: Can you handle rollbacks in distributed systems?
- Failure Handling: Can you deal with partial failures, timeouts?
- Idempotency: Can you ensure compensating actions are safe to retry?
- Observability: Can you track saga state across services?
2. The Answer
Answer:
I’d use choreography-based saga with event-driven compensations and a saga orchestrator tracking state, ensuring idempotent compensating transactions via unique compensation IDs.
First, saga pattern basics:
Problem: Distributed transactions don’t scale
Traditional 2PC (Two-Phase Commit):
1. Coordinator asks all services to prepare
2. All services lock resources, vote YES/NO
3. If all YES, coordinator commits; if any NO, abort
Issues:
- Blocking: Services hold locks during coordination (performance killer)
- Single point of failure: Coordinator crash = deadlock
- Network partitions: Can't guarantee atomicity across regionsSaga Alternative:
Each service commits locally (no global lock)
If failure occurs → run compensating transactions (undo previous steps)Second, travel booking saga design:
Booking Flow: Book Flight + Hotel + Payment
Choreography-Based Saga (Event-Driven):
Step 1: Reserve Flight
→ Service: Flight Service
→ Action: Create flight reservation (status=reserved)
→ Event: FlightReserved
Step 2: Reserve Hotel (triggered by FlightReserved event)
→ Service: Hotel Service
→ Action: Create hotel reservation (status=reserved)
→ Event: HotelReserved
Step 3: Charge Payment (triggered by HotelReserved event)
→ Service: Payment Service
→ Action: Charge credit card
→ Event: PaymentSucceeded OR PaymentFailed
Step 4a: Confirm Booking (if PaymentSucceeded)
→ Update flight: status=confirmed
→ Update hotel: status=confirmed
→ Event: BookingCompleted
Step 4b: Compensate (if PaymentFailed)
→ Cancel hotel reservation (compensating transaction)
→ Cancel flight reservation (compensating transaction)
→ Event: BookingCanceledImplementation:
# Flight Serviceclass FlightService:
def reserve_flight(self, booking_id, flight_id):
# Step 1: Reserve flight reservation = FlightReservation.create(
booking_id=booking_id,
flight_id=flight_id,
status='reserved',
expires_at=now() + timedelta(minutes=15)
)
# Publish event event_bus.publish('flight.reserved', {
'booking_id': booking_id,
'reservation_id': reservation.id })
return reservation
@event_handler('booking.canceled')
def cancel_reservation(self, event):
# Compensating transaction booking_id = event['booking_id']
compensation_id = event['compensation_id'] # For idempotency # Check if already compensated if Compensation.exists(compensation_id=compensation_id):
return # Idempotent! # Cancel reservation reservation = FlightReservation.get(booking_id=booking_id)
reservation.status = 'canceled' reservation.save()
# Mark compensation as done Compensation.create(
compensation_id=compensation_id,
booking_id=booking_id,
action='cancel_flight' )
event_bus.publish('flight.canceled', {'booking_id': booking_id})
# Hotel Service (similar pattern)class HotelService:
@event_handler('flight.reserved')
def reserve_hotel(self, event):
booking_id = event['booking_id']
# Reserve hotel reservation = HotelReservation.create(
booking_id=booking_id,
status='reserved' )
event_bus.publish('hotel.reserved', {
'booking_id': booking_id,
'reservation_id': reservation.id })
@event_handler('booking.canceled')
def cancel_reservation(self, event):
# Compensating transaction (idempotent) compensation_id = event['compensation_id']
if Compensation.exists(compensation_id=compensation_id):
return # Cancel reservation = HotelReservation.get(booking_id=event['booking_id'])
reservation.status = 'canceled' reservation.save()
Compensation.create(
compensation_id=compensation_id,
booking_id=event['booking_id'],
action='cancel_hotel' )
# Payment Serviceclass PaymentService:
@event_handler('hotel.reserved')
def charge_payment(self, event):
booking_id = event['booking_id']
try:
# Charge credit card charge = stripe.charge(amount=total, booking_id=booking_id)
event_bus.publish('payment.succeeded', {
'booking_id': booking_id,
'charge_id': charge.id })
except StripeError as e:
# Payment failed → trigger compensations event_bus.publish('payment.failed', {
'booking_id': booking_id,
'error': str(e)
})Third, saga orchestrator (for observability):
class SagaOrchestrator:
"""Tracks saga state across services""" def create_booking_saga(self, user_id, flight_id, hotel_id):
# Create saga state saga = Saga.create(
saga_id=uuid.uuid4(),
type='booking',
status='started',
steps=[
{'name': 'reserve_flight', 'status': 'pending'},
{'name': 'reserve_hotel', 'status': 'pending'},
{'name': 'charge_payment', 'status': 'pending'}
]
)
# Start saga event_bus.publish('saga.started', {
'saga_id': saga.saga_id,
'booking_id': saga.saga_id # Use saga_id as booking_id })
return saga
@event_handler('flight.reserved')
def on_flight_reserved(self, event):
saga = Saga.get(saga_id=event['booking_id'])
saga.update_step('reserve_flight', 'completed')
saga.save()
@event_handler('payment.failed')
def on_payment_failed(self, event):
saga = Saga.get(saga_id=event['booking_id'])
saga.status = 'compensating' saga.save()
# Trigger compensations compensation_id = uuid.uuid4()
event_bus.publish('booking.canceled', {
'booking_id': event['booking_id'],
'compensation_id': compensation_id # Ensures idempotency })Fourth, handling failure scenarios:
Scenario 1: Timeout during hotel reservation
1. Flight reserved (success)
2. Hotel reservation times out (network issue)
3. Payment never triggered
Solution: Timeouts + Expiration
- Flight reservation expires in 15 minutes
- Background job checks for expired reservations
- Auto-cancel if saga not completedScenario 2: Compensation fails
1. Payment fails
2. Trigger compensation: cancel hotel
3. Cancel hotel fails (hotel service down)
Solution: Retry with exponential backoff
- Retry compensation 3 times (1s, 2s, 4s)
- If still failing, alert team (manual intervention)
- Idempotency ensures safe retriesFifth, ensuring idempotency of compensations:
# Compensation Tracking TableCREATE TABLE compensations (
compensation_id UUID PRIMARY KEY,
booking_id UUID NOT NULL,
action VARCHAR(50), -- 'cancel_flight', 'cancel_hotel' executed_at TIMESTAMP,
INDEX idx_booking (booking_id)
);# Idempotent compensationdef cancel_flight(booking_id, compensation_id):
# Check if already executed if Compensation.exists(compensation_id=compensation_id):
log.info(f"Compensation {compensation_id} already executed")
return # Safe to retry! # Execute compensation with db.transaction():
reservation = FlightReservation.get(booking_id=booking_id)
reservation.status = 'canceled' reservation.save()
# Record compensation Compensation.create(
compensation_id=compensation_id,
booking_id=booking_id,
action='cancel_flight',
executed_at=now()
)
db.commit()Observability Dashboard:
Saga ID: abc-123
Status: Compensating
Started: 2024-12-08 10:00:00
Steps:
✅ Reserve Flight (completed at 10:00:05)
✅ Reserve Hotel (completed at 10:00:10)
❌ Charge Payment (failed at 10:00:15 - "Card declined")
Compensations:
⏳ Cancel Hotel (in progress)
⏳ Cancel Flight (pending)Interview Score: 9/10
Why: Choreography-based saga with event-driven flow, compensating transaction design with idempotency via compensation_id tracking, saga orchestrator for state visibility, timeout/expiration handling, and failure scenario coverage.
Question 9: Database Migrations and Zero-Downtime Releases
Difficulty: Very High
Role: Senior/Staff Backend Engineer
Level: Senior/Staff (L5-L7, 4-10 Years of Experience)
Company Examples: Stripe, GitHub, Shopify, LinkedIn
Question: “Your monolithic service is being decomposed into microservices, and you need to perform a breaking database schema migration (e.g., splitting a user table, changing primary keys, or moving to sharded instances) with zero customer-visible downtime. Describe your migration plan in phases, including dual writes/reads, backfill strategies, feature flags, fallbacks, monitoring, and rollback. How would you test and de-risk this plan in a live, high-traffic environment?”
1. What is This Question Testing?
- Migration Planning: Can you design multi-phase, zero-downtime migrations?
- Dual Write/Read: Can you maintain consistency during transition?
- Risk Mitigation: Can you de-risk with feature flags, monitoring, rollback plans?
- Testing: Can you validate migration in production safely?
2. The Answer
Answer:
I’d use a 5-phase migration strategy with dual writes, gradual rollout via feature flags, and continuous monitoring with instant rollback capability.
Scenario: Split users table into users + user_profiles
Old Schema:
CREATE TABLE users (
id BIGINT PRIMARY KEY,
email VARCHAR(255),
name VARCHAR(255),
bio TEXT, -- Moving to user_profiles avatar_url VARCHAR(500), -- Moving to user_profiles created_at TIMESTAMP);New Schema:
CREATE TABLE users (
id BIGINT PRIMARY KEY,
email VARCHAR(255),
name VARCHAR(255),
created_at TIMESTAMP);
CREATE TABLE user_profiles (
user_id BIGINT PRIMARY KEY,
bio TEXT,
avatar_url VARCHAR(500),
FOREIGN KEY (user_id) REFERENCES users(id)
);Phase 1: Add New Table (Week 1)
-- Create new tableCREATE TABLE user_profiles (
user_id BIGINT PRIMARY KEY,
bio TEXT,
avatar_url VARCHAR(500),
FOREIGN KEY (user_id) REFERENCES users(id)
);
-- No code changes yet, just schemaPhase 2: Dual Write (Weeks 2-3)
# Application code: Write to BOTH tablesdef update_user_profile(user_id, bio, avatar_url):
with db.transaction():
# Write to OLD table (still source of truth) User.update(user_id, bio=bio, avatar_url=avatar_url)
# ALSO write to NEW table (dual write) UserProfile.upsert(user_id=user_id, bio=bio, avatar_url=avatar_url)
db.commit()
# Reads still from OLD tabledef get_user_profile(user_id):
user = User.get(user_id)
return {'bio': user.bio, 'avatar_url': user.avatar_url}Phase 3: Backfill Historical Data (Weeks 3-4)
# Background jobdef backfill_user_profiles():
batch_size = 1000 offset = 0 while True:
users = User.query.limit(batch_size).offset(offset).all()
if not users:
break for user in users:
# Copy data from users to user_profiles UserProfile.upsert(
user_id=user.id,
bio=user.bio,
avatar_url=user.avatar_url
)
offset += batch_size
log.info(f"Backfilled {offset} users")
time.sleep(0.1) # Rate limit to avoid DB overload# Verify backfillSELECT COUNT(*) FROM users; -- 1,000,000SELECT COUNT(*) FROM user_profiles; -- 1,000,000 (should match!)Phase 4: Gradual Read Migration (Weeks 4-6)
# Feature flag: Gradually route reads to NEW table@feature_flag('read_from_user_profiles', rollout_percentage=10)
def get_user_profile(user_id):
if feature_flag_enabled('read_from_user_profiles', user_id):
# Read from NEW table profile = UserProfile.get(user_id)
return {'bio': profile.bio, 'avatar_url': profile.avatar_url}
else:
# Read from OLD table (fallback) user = User.get(user_id)
return {'bio': user.bio, 'avatar_url': user.avatar_url}
# Rollout:# Week 4: 10% of users read from new table# Week 5: 50% of users# Week 6: 100% of users (if no issues)Monitoring during rollout:
Metrics:
- read_user_profiles_latency (OLD vs NEW comparison)
- read_user_profiles_errors (alert if >0.1%)
- data_consistency_check (OLD == NEW?)
Alert Conditions:
- NEW table latency >2× OLD table latency
- Error rate >0.1%
- Data mismatch detected
→ Instant rollback to 0%Phase 5: Cleanup (Week 7)
-- Stop dual writes-- Remove bio, avatar_url from users tableALTER TABLE users DROP COLUMN bio;
ALTER TABLE users DROP COLUMN avatar_url;
-- Remove feature flag (100% on NEW table)Rollback Plan:
# If issues detected at 50% rollout:1. Set feature flag to 0% → All reads from OLD table
2. Continue dual writes (data still syncing)
3. Investigate issue
4. Fix and retry rollout
# Instant rollback (< 1 minute):feature_flag.set('read_from_user_profiles', 0)Testing in Production:
# Shadow reads: Compare OLD vs NEWdef get_user_profile(user_id):
# Primary read (OLD table) old_data = User.get(user_id)
# Shadow read (NEW table, non-blocking) async_shadow_read(user_id)
return old_data
def async_shadow_read(user_id):
try:
new_data = UserProfile.get(user_id)
# Compare results if new_data.bio != old_data.bio:
alert("Data mismatch detected!", user_id)
except Exception as e:
log.error(f"Shadow read failed: {e}")Interview Score: 9/10
Why: 5-phase migration with dual writes, backfill strategy, gradual rollout via feature flags with percentage control, continuous monitoring with instant rollback, and shadow reads for data consistency validation.
Question 10: Handling Concurrency, Race Conditions, and Distributed Locks
Difficulty: High
Role: Senior Backend Engineer
Level: Senior (L5-L6, 4-7 Years of Experience)
Company Examples: All high-traffic systems
Question: “Consider a rate limiter or resource allocator backed by Redis or a relational DB in a distributed environment with multiple instances. How do you design your system to avoid race conditions and ensure correctness under concurrent requests? Compare optimistic vs pessimistic locking, Lua scripts in Redis, and per-resource distributed locks. Where do you accept ‘eventual enforcement’ vs strict enforcement, and how do you justify the trade-off?”
1. What is This Question Testing?
- Race Condition Handling: Can you design systems that handle concurrent requests correctly?
- Locking Strategies: Do you understand optimistic vs pessimistic locking trade-offs?
- Redis Lua Scripts: Can you use atomic operations to prevent race conditions?
- Distributed Locks: Do you know when to use distributed locks vs other approaches?
- Trade-off Analysis: Can you justify eventual vs strict enforcement?
2. The Answer
Answer:
I’d use Redis Lua scripts for atomic operations in rate limiting, combined with optimistic locking for resource allocation, accepting eventual enforcement for performance.
First, the race condition problem:
# BAD: Race condition in rate limiterdef check_rate_limit(user_id):
count = redis.get(f"rate:{user_id}")
if count >= 100:
return False # Rate limited # Race condition here! Two requests can both pass this check redis.incr(f"rate:{user_id}")
return True# Result: User makes 105 requests instead of 100 (5% over-limit)Solution 1: Lua Script (Atomic)
-- rate_limit.lua (runs atomically in Redis)local key = KEYS[1]local limit = tonumber(ARGV[1])local current = redis.call('GET', key)if not current then current = 0endif tonumber(current) >= limit then return 0 -- Rate limitedelse redis.call('INCR', key) redis.call('EXPIRE', key, 60) return 1 -- Allowedend# Python: Execute Lua script atomicallydef check_rate_limit(user_id, limit=100):
result = redis.eval(
lua_script,
keys=[f"rate:{user_id}"],
args=[limit]
)
return result == 1 # True if allowedSolution 2: Optimistic Locking
# Resource allocation with version checkdef allocate_resource(resource_id, user_id):
while True:
# Read resource + version resource = Resource.get(resource_id)
if resource.allocated:
return False # Already allocated # Try to allocate (check version hasn't changed) updated = Resource.update(
resource_id=resource_id,
allocated=True,
allocated_to=user_id,
version=resource.version + 1,
where={'version': resource.version} # Optimistic lock )
if updated:
return True # Success! else:
# Version changed → retry continueSolution 3: Distributed Lock (Redis)
import redis_lock
def book_last_ticket(event_id, user_id):
lock_key = f"lock:event:{event_id}" with redis_lock.Lock(redis, lock_key, expire=5):
# Only one request holds lock at a time tickets = Ticket.count(event_id=event_id, available=True)
if tickets == 0:
return False # Allocate ticket ticket = Ticket.get_available(event_id)
ticket.allocated_to = user_id
ticket.save()
return TrueFourth, trade-off analysis:
Strict Enforcement (Lua scripts, locks):
Pros: Exactly 100.0 requests/min, guaranteed correctness
Cons: Higher latency (lock contention), single point of failure (Redis)
Use case: Critical resources (inventory, tickets, payments)Eventual Enforcement (Regional Redis with replication lag):
Pros: Low latency (<1ms), high availability
Cons: Slight over-limit during lag (100-102 requests/min, 2% error)
Use case: Rate limiting, quotas, soft limits
Acceptable trade-off:
- 2% over-limit for 100× better latency
- Rate limiting is a soft limit (not life-critical)Fifth, choosing the right approach:
Lua Scripts (My choice for rate limiting):
- Atomic operations
- No network round trips
- Sub-millisecond performance
- 99.9% accuracy (regional lag acceptable)
Optimistic Locking (For resources with low contention):
- No locks needed
- Retry on conflict
- Works well if conflicts rare (<5%)
Distributed Locks (For critical resources):
- Strong consistency
- Use only when necessary (tickets, inventory)
- Accept latency cost (10-50ms)Interview Score: 9/10
Why: Clear race condition explanation, three solutions compared (Lua, optimistic, locks), trade-off justification (strict vs eventual), and practical guidance on when to use each approach.
Question 11: Observability, SLOs, and On-Call Incident Response
Difficulty: High
Role: Senior Backend Engineer / SRE
Level: Senior (L5-L6, 4-7 Years of Experience)
Company Examples: All production systems
Question: “You are the on-call Senior Backend Engineer for a microservices-based API platform. Suddenly, p99 latency doubles and error rates spike in one region. Walk through your incident response: what dashboards, traces, and logs do you check; how do you distinguish between downstream dependency issues (e.g., DB, cache, message queue) vs application-level regressions; and how do you decide whether to roll back, degrade features, or perform a partial failover?”
1. What is This Question Testing?
- Incident Response: Can you systematically diagnose production issues?
- Observability Tools: Do you know how to use dashboards, traces, logs effectively?
- Root Cause Analysis: Can you distinguish app vs infrastructure vs dependency issues?
- Decision Making: Can you decide between rollback, degradation, failover under pressure?
- SRE Mindset: Do you prioritize customer impact over perfect diagnosis?
2. The Answer
Answer:
I’d follow a systematic 4-step incident response playbook: dashboards → traces → logs → decision matrix, prioritizing fast mitigation over perfect diagnosis.
First, initial assessment (0-2 minutes):
Primary Dashboards (Datadog, Grafana):
1. Service Health Dashboard:
- p50/p95/p99 latency trends
- Error rate (4XX, 5XX)
- Request rate (RPS)
- By region, service, endpoint
2. Dependency Dashboard:
- Database: query latency, connection pool usage
- Cache (Redis): latency, hit rate, eviction rate
- Message Queue (Kafka): consumer lag, broker health
- External APIs: latency, error rate
3. Infrastructure Dashboard:
- CPU, memory, disk I/O
- Network throughput, packet loss
- Container/pod health (if Kubernetes)What I See:
Alert: API latency spike in US-EAST region
Metrics:
- p99 latency: 100ms → 500ms (5× increase) ← RED FLAG
- 5XX error rate: 0.1% → 2% (20× increase) ← RED FLAG
- Request rate: Stable at 10K RPS (no traffic spike)
- Only US-EAST affected (EU, ASIA normal)
Initial hypothesis: Issue localized to US-EAST regionSecond, distributed trace analysis (2-5 minutes):
Use Datadog APM or Jaeger to find slow requests:
Example slow trace (p99 = 500ms):
Span 1: API Gateway → 5ms (normal)
Span 2: Auth Service → 10ms (normal)
Span 3: Product Service → 480ms ← SLOW!
↳ Span 3.1: Database query → 450ms ← ROOT CAUSE
↳ Span 3.2: Redis cache → 15ms (normal)
↳ Span 3.3: Business logic → 10ms (normal)
Conclusion: Database queries in Product Service are slowDrill down into database span:
Slow query:
SELECT * FROM products
WHERE category_id IN (SELECT id FROM categories WHERE ...)
ORDER BY created_at DESC
LIMIT 100;
Execution time: 450ms (normal: 50ms)
Rows scanned: 500,000 (normal: 1,000)
Red flag: Missing index or inefficient queryThird, log analysis (5-10 minutes):
Check application logs:
# Filter logs for errors in US-EAST Product Servicekubectl logs -l app=product-service,region=us-east --since=10m | grep ERROR
Output:[ERROR] Database connection pool exhausted (95/100)[ERROR] Query timeout after 1000ms: SELECT * FROM products...
[ERROR] Slow query detected: 450ms execution timeCheck database metrics:
PostgreSQL metrics (US-EAST):
- Active connections: 95/100 (near limit!) ← RED FLAG
- Slow query log: SELECT * FROM products ... (repeated 10,000×)
- Query execution time: 450ms avg (p99: 800ms)
- Lock wait time: 50ms (increased from 5ms)
Root cause confirmed: Database connection pool exhaustion + slow queriesFourth, decision matrix:
Option 1: Rollback
When to rollback:
- Recent deploy within last 2 hours
- Clear correlation between deploy time and incident
- Rollback is low-risk (feature flag or code deploy)
Check:
- Last deploy: 15 minutes ago (Product Service v2.3.5)
- Changeset: Added new "related products" feature with N+1 query
Decision: ROLLBACK
Action: Rollback Product Service to v2.3.4
ETA: 2 minutes (automated rollback)Option 2: Feature Degradation
When to degrade:
- External dependency failing (not our code)
- Can't rollback (critical security fix deployed)
- Graceful degradation possible
Example:
- Redis cache down → Serve stale data (30 sec old)
- Search service slow → Skip search suggestions, show static results
- Payment gateway timeout → Retry queue, notify user "processing"
Action: Feature flag to disable "related products" feature
ETA: 1 minuteOption 3: Partial Failover
When to failover:
- Infrastructure issue (datacenter, AZ outage)
- Database replica failure
- Network partition
Action:
- Route US-EAST traffic to US-WEST
- Scale up US-WEST capacity 2×
- Monitor cross-region latency increase (acceptable temporarily)
ETA: 5-10 minutesMy Decision for This Incident:
Root cause: Recent deploy introduced N+1 query
Best action: ROLLBACK
Reasoning:
- Deploy was 15 min ago (clear correlation)
- Rollback is safe (automated, tested)
- Fastest mitigation (2 min vs 10+ min for other options)
- Preserves customer experience
Execute:
1. Slack: "#incident-response Deploy rollback initiated for product-service"
2. Command: kubectl rollout undo deployment/product-service -n us-east
3. Monitor: Dashboard shows latency dropping within 30 seconds
4. Confirm: p99 latency back to 100ms, errors back to 0.1%
5. All-clear: Incident resolved in 8 minutesFifth, post-incident actions:
Immediate (during incident):
1. Update incident channel with ETA
2. Notify stakeholders (eng lead, product)
3. Monitor for 15 min to confirm resolution
4. Close incident ticketPost-Mortem (within 24 hours):
1. Root Cause:
- N+1 query in "related products" feature
- Missing index on products.category_id
- No load testing before deploy
2. Action Items:
- Add eager loading: Product.objects.prefetch_related('related')
- Add index: CREATE INDEX ON products(category_id)
- Add query count assertion in tests
- Require load testing for database-heavy features
3. Prevention:
- Enable query count warnings in staging
- Add dashboard alert: "Query count >50 per request"SLO Impact:
SLO: p99 latency <200ms for 99.9% of requests
Impact:
- Breach duration: 8 minutes
- Affected requests: ~4,800 (10K RPS × 8 min)
- Error budget: Consumed 0.5% of monthly budget
Conclusion: Within acceptable range (stayed <1% budget burn)Interview Score: 9/10
Why: Systematic 4-step playbook (dashboards → traces → logs → decision), clear decision matrix for rollback vs degradation vs failover, real-world trace analysis identifying N+1 query, and post-incident actions with SLO impact calculation.
Question 12: API Versioning, Backward Compatibility, and Contract Negotiation
Difficulty: High
Role: Senior Backend Engineer / API Platform
Level: Senior (L5-L6, 4-7 Years of Experience)
Company Examples: Stripe, Twilio, GitHub
Question: “Your team owns a core backend API used by multiple internal and external clients. You need to ship a breaking change to the contract. How do you design your API versioning strategy, manage deprecation, and avoid breaking consumers? Describe how you’d coordinate across teams, enforce backward compatibility in the short term, and use schema validation, contract tests, or API gateways to keep the ecosystem stable.”
1. What is This Question Testing?
- API Design: Can you version APIs without breaking clients?
- Backward Compatibility: Can you maintain old and new versions simultaneously?
- Deprecation Management: Can you sunset old versions gracefully?
- Contract Testing: Do you know how to validate API contracts programmatically?
- Cross-Team Coordination: Can you manage migration across multiple teams?
2. The Answer
Answer:
I’d use URL-based versioning (/api/v1, /api/v2) with parallel version support, gradual client migration, contract tests, and a 12-month deprecation timeline.
First, versioning strategy:
URL-Based Versioning (Recommended):
Current: /api/v1/users
New: /api/v2/users
Pros:
- Clear version in URL (easy for clients to understand)
- Can run both versions simultaneously
- Easy to route in API gateway
Cons:
- URL changes (but that's the point for breaking changes)Alternatives (Why I don’t recommend):
Header-based: Accept: application/vnd.api+json; version=2
- Pros: Clean URLs
- Cons: Harder to test, caching issues
Query param: /api/users?version=2
- Pros: Easy to toggle
- Cons: Pollutes query params, caching issuesSecond, breaking change example:
Scenario: User API redesign
Old (v1):
GET /api/v1/users/123Response:{ "user_id": 123, "name": "John Smith", "email": "john@example.com", "created": "2024-01-01"}New (v2) - Breaking changes:
GET /api/v2/users/123Response:{ "id": 123, // Renamed from user_id "first_name": "John", // Split from name "last_name": "Smith", // Split from name "email": "john@example.com", "created_at": "2024-01-01T00:00:00Z" // ISO 8601}Third, implementation with parallel support:
# Both versions use same data model internallyclass UserSerializer_V1:
def to_json(self, user):
return {
"user_id": user.id,
"name": f"{user.first_name} {user.last_name}",
"email": user.email,
"created": user.created_at.strftime("%Y-%m-%d")
}
class UserSerializer_V2:
def to_json(self, user):
return {
"id": user.id,
"first_name": user.first_name,
"last_name": user.last_name,
"email": user.email,
"created_at": user.created_at.isoformat()
}
# v1 endpoint@app.route("/api/v1/users/<int:user_id>")
def get_user_v1(user_id):
user = User.get(user_id)
return jsonify(UserSerializer_V1().to_json(user))
# v2 endpoint@app.route("/api/v2/users/<int:user_id>")
def get_user_v2(user_id):
user = User.get(user_id)
return jsonify(UserSerializer_V2().to_json(user))Fourth, deprecation timeline:
12-Month Deprecation Plan:
Month 0 (Today):
- Announce v2 launch
- v1 will be deprecated in 12 months
- Email all API consumers
- Add deprecation warning header to v1 responses:
Warning: 299 - "API v1 is deprecated. Migrate to v2 by 2025-12-31"
Month 3:
- Email clients still on v1 (identify via logs)
- Offer migration support (office hours, docs)
- Track migration progress: 30% on v2
Month 6:
- Email v1 users again
- Warn: v1 sunset in 6 months
- Track: 60% on v2
Month 9:
- Final warning: v1 sunset in 3 months
- Personally contact large clients still on v1
- Track: 85% on v2
Month 12:
- Sunset v1 (return 410 Gone)
- Keep v1 code for 30 days (rollback safety)
- Track: 98%+ on v2
Month 13:
- Delete v1 codeFifth, contract testing to prevent breaking changes:
JSON Schema Validation:
# v1 contract (OpenAPI/JSON Schema)user_v1_schema = {
"type": "object",
"required": ["user_id", "name", "email", "created"],
"properties": {
"user_id": {"type": "integer"},
"name": {"type": "string"},
"email": {"type": "string", "format": "email"},
"created": {"type": "string", "pattern": "^\d{4}-\d{2}-\d{2}$"}
}
}
# Contract testdef test_user_v1_contract():
response = client.get('/api/v1/users/123')
data = response.json()
# Validate against schema validate(data, user_v1_schema) # Fails if contract broken # Ensure v1 contract unchanged assert "user_id" in data # Must have user_id, not id assert "name" in data # Must have name, not first_name/last_namePact Contract Testing (for multiple consumers):
# Consumer (Mobile App) defines expected contractfrom pact import Consumer, Provider
pact = Consumer('MobileApp').has_pact_with(Provider('UserAPI'))
(pact
.upon_receiving('get user request')
.with_request('GET', '/api/v1/users/123')
.will_respond_with(200, body={
'user_id': 123,
'name': 'John Smith',
'email': 'john@example.com',
'created': '2024-01-01' }))
# Provider (API) must satisfy contract# CI fails if API changes break consumer expectationsSixth, cross-team coordination:
Migration Kickoff (Month 0):
1. Announce in eng-all channel
2. Create migration guide: docs.company.com/api-v2-migration
3. Breaking changes highlighted:
- user_id → id
- name → first_name + last_name
- created → created_at (ISO 8601)
4. Migration checklist:
☐ Update API base URL: /v1 → /v2
☐ Update field mappings
☐ Test in staging
☐ Deploy to production
☐ Monitor for errorsOffice Hours (Months 1-6):
Weekly Zoom sessions:
- Answer migration questions
- Debug integration issues
- Provide code examplesTracking Migration Progress:
-- Track API usage by versionSELECT
version,
COUNT(*) as requests,
COUNT(DISTINCT client_id) as unique_clients
FROM api_logs
WHERE endpoint = '/users'GROUP BY version;
Results (Month 6):
v1: 100K requests, 15 clients
v2: 200K requests, 35 clients
Action: Contact 15 v1 clients, offer helpSeventh, enforcing backward compatibility:
During Transition Period (Months 0-12):
Rule: v1 contract CANNOT change
Enforcement:
1. Contract tests in CI (fail if v1 schema changes)
2. Code review checklist: "Does this affect v1?"
3. Freeze v1 codebase (only security fixes allowed)API Gateway for Routing:
API Gateway (Kong, AWS API Gateway):
/api/v1/* → Route to v1 backend
/api/v2/* → Route to v2 backend
Allows:
- Independent deployment of v1 and v2
- Traffic splitting for testing
- Gradual rollout of v2Interview Score: 9/10
Why: Clear URL-based versioning strategy, parallel version support with code examples, 12-month deprecation timeline with monthly milestones, contract testing (JSON Schema + Pact), cross-team migration coordination, and backward compatibility enforcement during transition.
Question 13: Backend Performance Bottlenecks and Memory Leaks in Production
Difficulty: High
Role: Senior Backend Engineer
Level: Senior (L5-L6, 4-7 Years of Experience)
Company Examples: All production systems
Question: “A high-traffic backend service (e.g., search or recommendations) has gradually increasing memory usage and GC pauses, leading to intermittent timeouts under peak load. How would you diagnose and fix a memory leak or performance regression in production? Explain your approach to profiling, heap analysis, sampling traces, and experimenting safely with fixes. How do you differentiate between application-level leaks, library issues, and infrastructure misconfiguration?”
1. What is This Question Testing?
- Performance Debugging: Can you diagnose memory leaks in production?
- Profiling Tools: Do you know heap dumps, GC logs, profilers?
- Root Cause Analysis: Can you distinguish app vs library vs infrastructure issues?
- Safe Experimentation: Can you test fixes without impacting customers?
- Memory Management: Do you understand GC behavior, memory allocation patterns?
2. The Answer
Answer:
I’d use heap dumps, GC log analysis, and sampling profilers to identify memory leaks, validate with controlled experiments, then deploy fixes gradually with canary releases.
First, symptoms and initial triage:
Observed Symptoms:
1. Memory usage: Gradually increasing from 2GB → 6GB → 8GB (OOM kill)
2. GC pauses: Increasing from 50ms → 500ms → 2 seconds
3. Timeouts: p99 latency 100ms → 5 seconds during GC pauses
4. Pattern: Happens after ~6 hours uptime
Timeline:
00:00 - Service starts, memory at 2GB
06:00 - Memory at 4GB, GC pauses 200ms
12:00 - Memory at 7GB, GC pauses 1 second
14:00 - OOM kill, restart, cycle repeatsSecond, heap dump analysis:
Capture Heap Dump (Java example):
# During high memory usagejmap -dump:live,format=b,file=heap.hprof <pid># Or automatic on OOM-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/tmp/heap.hprofAnalyze with Eclipse MAT (Memory Analyzer Tool):
Top Memory Consumers:
1. HashMap<String, User> cache: 4.2GB (53% of heap!)
- 2,000,000 entries
- Growing unbounded
- Never evicted
2. ArrayList<Request> requestLog: 800MB (10% of heap)
- 500,000 entries
- Accumulating without limit
3. Normal objects: 3GB (37% of heap)Leak Suspects Report:
Suspected leak:
- HashMap<String, User> cache
- Accumulator: CacheManager class
- Problem: No size limit, no TTL, no eviction policy
Dominator tree shows:
CacheManager → HashMap → 2M User objects → 4.2GBThird, GC log analysis:
Enable GC Logging:
# Java-Xlog:gc*:file=gc.log:time,uptime,level,tags# PythonPYTHONTRACEMALLOC=1Analyze GC Pattern:
GC Log Analysis:
Time=00:00, YoungGC: 50ms, OldGC: N/A, Heap: 2GB
Time=06:00, YoungGC: 100ms, OldGC: 500ms, Heap: 4GB
Time=12:00, YoungGC: 200ms, FullGC: 2000ms, Heap: 7GB
Pattern:
- Young GC frequency increasing (more objects created)
- Full GC happening (old gen filling up)
- Heap not shrinking after GC (objects still referenced = leak)
Conclusion: Objects accumulating in old generation (memory leak)Fourth, profiling in production (safe):
Sampling Profiler (Low Overhead):
# Python: py-spy (sampling profiler, <1% overhead)py-spy record --pid <pid> --output profile.svg --duration 60# Shows which functions allocate most memoryProfiling Results:
Top Memory Allocators:
1. cache_user() - 60% of allocations
- Called 10,000× per second
- Allocates User object each call
- Never frees old entries
2. log_request() - 15% of allocations
- Appends to unbounded list
- List grows indefinitelyFifth, identifying root cause:
Code Review of Suspect:
# BAD: Memory leak (unbounded cache)cache = {} # Global dictionary, never cleareddef get_user(user_id):
if user_id not in cache:
# Fetch from DB user = db.query(f"SELECT * FROM users WHERE id = {user_id}")
cache[user_id] = user # LEAK: Cache grows forever return cache[user_id]
# After 12 hours at 10K RPS:# 10K requests/sec × 43,200 sec = 432M requests# Even if only 5% unique users = 21M cache entries = OOMDifferentiate: App vs Library vs Infrastructure
Application-level leak (This case):
- Code explicitly creates unbounded data structures
- Fix: Application code changes
Library leak:
- Third-party library not releasing resources
- Example: HTTP client not closing connections
- Fix: Update library or workaround
Infrastructure misconfiguration:
- JVM heap too small for workload
- Example: -Xmx2g for service needing 4GB
- Fix: Increase heap size (but doesn't fix leak, just delays it)Sixth, implementing the fix:
Fix 1: Bounded Cache with LRU Eviction:
# GOOD: Bounded cache with automatic evictionfrom functools import lru_cache
@lru_cache(maxsize=10000) # Maximum 10K entriesdef get_user(user_id):
return db.query(f"SELECT * FROM users WHERE id = {user_id}")
# LRU automatically evicts least-recently-used entries# Memory bounded to ~10K users × 2KB/user = 20MBFix 2: TTL-Based Cache:
# Alternative: Time-based expirationfrom cachetools import TTLCache
cache = TTLCache(maxsize=10000, ttl=300) # 5 min TTLdef get_user(user_id):
if user_id not in cache:
cache[user_id] = db.query(f"SELECT * FROM users WHERE id = {user_id}")
return cache[user_id]Fix 3: External Cache (Redis):
def get_user(user_id):
# Try Redis first cached = redis.get(f"user:{user_id}")
if cached:
return json.loads(cached)
# Cache miss user = db.query(f"SELECT * FROM users WHERE id = {user_id}")
# Store in Redis with TTL redis.setex(f"user:{user_id}", 300, json.dumps(user))
return user
# Redis handles eviction automatically (maxmemory-policy allkeys-lru)Seventh, safe deployment and validation:
Canary Deployment (20% traffic):
1. Deploy fixed version to 20% of instances
2. Monitor memory usage for 12 hours:
- Old instances: Memory grows to 7GB → OOM
- Canary instances: Memory stable at 2.5GB ✓
3. Validate metrics:
- GC pauses: 50ms (down from 2 seconds) ✓
- p99 latency: 100ms (down from 5 seconds) ✓
- Error rate: 0.1% (unchanged) ✓
4. Rollout to 100% after 24 hoursBefore/After Comparison:
Before (with leak):
- Memory: 2GB → 8GB over 12 hours
- GC frequency: Every 10 seconds (Full GC)
- GC pause: Up to 2 seconds
- Instance lifetime: 14 hours (OOM kill)
After (with fix):
- Memory: Stable at 2.5GB
- GC frequency: Every 60 seconds (Young GC only)
- GC pause: 50ms
- Instance lifetime: Unlimited (no OOM)Interview Score: 9/10
Why: Complete diagnostic workflow (heap dump + GC logs + profiling), clear differentiation between app/library/infrastructure leaks, three fix approaches (LRU cache, TTL, Redis), safe canary deployment validation, and before/after metrics showing impact.
[Questions 14-15 continue in next message due to length]
# Resource allocation with version checkdef allocate_resource(resource_id, user_id):
while True:
# Read resource + version resource = Resource.get(resource_id)
if resource.allocated:
return False # Already allocated # Try to allocate (check version hasn't changed) updated = Resource.update(
resource_id=resource_id,
allocated=True,
allocated_to=user_id,
version=resource.version + 1,
where={'version': resource.version} # Optimistic lock )
if updated:
return True # Success! else:
# Version changed → retry continueSolution 3: Distributed Lock (Redis)
import redis_lock
def book_last_ticket(event_id, user_id):
lock_key = f"lock:event:{event_id}" with redis_lock.Lock(redis, lock_key, expire=5):
# Only one request holds lock at a time tickets = Ticket.count(event_id=event_id, available=True)
if tickets == 0:
return False # Allocate ticket ticket = Ticket.get_available(event_id)
ticket.allocated_to = user_id
ticket.save()
return TrueTrade-off: Strict vs Eventual Enforcement
Strict: 100.0 requests/min maximum (Lua script, locks)
Eventual: 100-102 requests/min (acceptable for rate limiting, not for tickets)
Interview Score: 9/10
Question 14: Security, Identities, and Authz at Scale
Difficulty: Very High
Role: Senior Backend Engineer / Security
Level: Senior/Staff (L5-L7, 5-10 Years of Experience)
Company Examples: Auth0, Okta, Stripe
Question: “Design a secure authentication and authorization architecture for a multi-tenant SaaS platform with public APIs. How would you combine OAuth2/OIDC, JWT, API gateways, and service-to-service auth (mTLS, service accounts) to protect resources, implement fine-grained authorization, and make rollout safe? Discuss token lifetimes, refresh flows, revocation, and how you’d instrument the system to detect abuse or privilege escalation attempts.”
1. What is This Question Testing?
- Security Architecture: Can you design auth for multi-tenant SaaS platforms?
- OAuth2/OIDC: Do you understand modern authentication protocols?
- Service-to-Service Auth: Can you secure internal API calls with mTLS?
- Token Management: Do you know JWT lifecycles, refresh flows, revocation?
- Abuse Detection: Can you instrument systems to catch privilege escalation?
2. The Answer
Answer:
I’d use OAuth2 authorization code flow + JWT for user auth, mTLS for service-to-service, RBAC + ABAC for authorization, and anomaly detection for abuse prevention.
First, user authentication flow (OAuth2 + OIDC):
OAuth2 Authorization Code Flow (Most Secure):
1. User clicks "Login" → Redirect to auth.company.com/authorize
2. User authenticates (username/password + MFA)
3. Auth server returns authorization code to callback URL
4. Frontend exchanges code for tokens:
POST /oauth/token
{
"grant_type": "authorization_code",
"code": "abc123",
"redirect_uri": "https://app.company.com/callback"
}
5. Response:
{
"access_token": "eyJhbGc..." (JWT, 15 min),
"refresh_token": "rt_abc123" (opaque, 30 days),
"id_token": "eyJhbGc..." (OIDC, user identity)
}
6. Frontend stores tokens:
- Access token: Memory only (not localStorage, XSS risk)
- Refresh token: HttpOnly cookie (CSRF protected)JWT Structure:
// Access Token (JWT){ "header": { "alg": "RS256", "typ": "JWT" }, "payload": { "sub": "user_123", "tenant_id": "tenant_abc", "roles": ["admin", "editor"], "permissions": ["users:read", "users:write"], "exp": 1640000000, // 15 min expiry "iat": 1639999100 }, "signature": "..."}Second, API gateway validation:
# API Gateway (Kong, AWS API Gateway, custom)class APIGateway:
def validate_request(self, request):
# Extract JWT from Authorization header auth_header = request.headers.get("Authorization")
if not auth_header or not auth_header.startswith("Bearer "):
return error("Unauthorized", 401)
token = auth_header.replace("Bearer ", "")
# Verify JWT signature (RS256 with public key) try:
payload = jwt.decode(
token,
public_key,
algorithms=["RS256"],
verify_exp=True # Check expiration )
except jwt.ExpiredSignatureError:
return error("Token expired", 401)
except jwt.InvalidTokenError:
return error("Invalid token", 401)
# Check token not revoked (Redis blacklist) if redis.exists(f"revoked:{payload['jti']}"): # jti = JWT ID return error("Token revoked", 401)
# Attach user context to request request.user = {
"user_id": payload["sub"],
"tenant_id": payload["tenant_id"],
"roles": payload["roles"],
"permissions": payload["permissions"]
}
# Route to backend service return forward_to_backend(request)Third, fine-grained authorization (RBAC + ABAC):
RBAC (Role-Based Access Control):
# Simple role checkdef get_users(request):
if "admin" not in request.user["roles"]:
return error("Forbidden", 403)
# Admin can access users = User.query.all()
return jsonify(users)ABAC (Attribute-Based Access Control):
# Policy: Users can only edit their own tenant's datadef update_user(request, user_id):
target_user = User.get(user_id)
# Check permission if not has_permission(request.user, "users:write", target_user):
return error("Forbidden", 403)
# Update user target_user.update(request.json)
return jsonify(target_user)
def has_permission(current_user, permission, resource):
# Check permission exists if permission not in current_user["permissions"]:
return False # Check tenant isolation (ABAC attribute) if resource.tenant_id != current_user["tenant_id"]:
return False # Can't access other tenants' data return TrueFourth, service-to-service auth (mTLS):
Why mTLS over JWT for internal services:
JWT issues for service-to-service:
- Need to manage service accounts, rotate secrets
- Adds latency (sign, verify tokens)
- Token expiry handling
mTLS benefits:
- Certificate-based mutual authentication
- No tokens to manage
- Lower latency
- Already have TLS infrastructuremTLS Setup:
Each service has:
1. Client certificate (signed by internal CA)
2. Private key
3. CA certificate (to verify peers)
Service A calling Service B:
1. TLS handshake with mutual cert verification
2. Both sides verify peer certificate
3. Connection established only if both valid
4. No JWT needed!Implementation:
# Service A (caller)import requests
response = requests.get(
"https://service-b.internal/api/data",
cert=("/path/to/client-cert.pem", "/path/to/client-key.pem"),
verify="/path/to/ca-cert.pem" # Verify Service B's cert)
# Service B (receiver) - Nginx configserver {
listen 443 ssl; ssl_certificate /path/to/server-cert.pem; ssl_certificate_key /path/to/server-key.pem; # Require client cert ssl_client_certificate /path/to/ca-cert.pem; ssl_verify_client on; location /api {
# Only requests with valid cert reach here proxy_pass http://backend; }
}Fifth, token lifecycle management:
Access Token Refresh:
# When access token expires (15 min)def refresh_access_token(refresh_token):
# Verify refresh token stored_token = redis.get(f"refresh_token:{refresh_token}")
if not stored_token:
return error("Invalid refresh token", 401)
user_id = stored_token["user_id"]
# Issue new access token access_token = jwt.encode({
"sub": user_id,
"tenant_id": get_user_tenant(user_id),
"roles": get_user_roles(user_id),
"exp": time.time() + 900 # 15 min }, private_key, algorithm="RS256")
# Rotate refresh token (security best practice) old_refresh_token = refresh_token
new_refresh_token = generate_secure_token()
redis.delete(f"refresh_token:{old_refresh_token}")
redis.setex(f"refresh_token:{new_refresh_token}", 2592000, { # 30 days "user_id": user_id
})
return {
"access_token": access_token,
"refresh_token": new_refresh_token
}Token Revocation:
# Logout - revoke tokensdef logout(request):
access_token = extract_token(request)
payload = jwt.decode(access_token, verify=False) # Don't verify, just extract # Add to blacklist (TTL = remaining token lifetime) ttl = payload["exp"] - time.time()
redis.setex(f"revoked:{payload['jti']}", ttl, "1")
# Delete refresh token refresh_token = request.cookies.get("refresh_token")
redis.delete(f"refresh_token:{refresh_token}")
return {"status": "logged out"}Sixth, abuse detection and instrumentation:
Anomaly Detection:
# Track API usage per userdef track_api_call(user_id, endpoint, response_time):
# Increment counter key = f"api_usage:{user_id}:{date.today()}" redis.incr(key)
redis.expire(key, 86400) # 24 hours # Check rate (simple anomaly detection) count = redis.get(key)
if count > 10000: # 10K calls per day alert(f"User {user_id} exceeded normal API usage: {count} calls")
# Optional: Temporary rate limit or flag for review# Track failed auth attemptsdef track_failed_login(user_id, ip_address):
key = f"failed_login:{user_id}" redis.incr(key)
redis.expire(key, 3600) # 1 hour failed_count = redis.get(key)
if failed_count >= 5:
# Lock account temporarily redis.setex(f"locked:{user_id}", 1800, "1") # 30 min lockout alert(f"Account locked due to failed login attempts: {user_id}")Privilege Escalation Detection:
# Log all permission changesdef grant_permission(admin_user_id, target_user_id, permission):
# Verify admin has permission to grant if "admin" not in get_user_roles(admin_user_id):
alert(f"Unauthorized permission grant attempt: {admin_user_id}")
return error("Forbidden", 403)
# Grant permission add_user_permission(target_user_id, permission)
# Audit log audit_log.create({
"event": "permission_granted",
"admin_user_id": admin_user_id,
"target_user_id": target_user_id,
"permission": permission,
"timestamp": time.time()
})
# Alert if granting admin permission if permission == "admin":
alert(f"Admin permission granted: {admin_user_id} → {target_user_id}")Interview Score: 9/10
Why: OAuth2 authorization code flow with JWT, API gateway validation with signature verification, RBAC + ABAC for fine-grained authorization, mTLS for service-to-service auth, token rotation and revocation strategies, and comprehensive abuse detection with anomaly tracking and audit logging.
Question 15: Leadership: Technical Debt, Legacy Modernization, and Mentoring
Difficulty: Very High
Role: Staff/Principal Engineer / EM
Level: Staff+ (L6-L8, 7-15 Years of Experience)
Company Examples: All companies with legacy systems
Question: “You’re a Staff Backend Engineer or EM inheriting a legacy monolith that is critical to revenue but accumulating severe technical debt. How do you prioritize refactors vs new feature delivery, communicate trade-offs with product, and coach junior developers on making architecture decisions that scale? Describe a concrete framework for deciding when to pay down debt (e.g., strangler fig pattern, safety rails, incremental modularization) and how you would measure the impact on reliability and velocity.”
1. What is This Question Testing?
- Leadership: Can you balance technical health with business needs?
- Communication: Can you explain technical debt to non-technical stakeholders?
- Mentorship: Can you coach juniors to make scalable decisions?
- Strategic Thinking: Do you have frameworks for prioritizing debt paydown?
- Measurement: Can you quantify impact on velocity and reliability?
2. The Answer
Answer:
I’d use the Strangler Fig pattern for incremental modernization, enforce a 70/30 feature-to-refactor ratio, measure impact via DORA metrics, and mentor through architecture design reviews.
First, assessing technical debt:
Debt Audit Framework:
1. Inventory debt:
- Monolithic deployments (deploy time >30 min)
- No test coverage (<20%)
- Hard-coded config (no feature flags)
- Manual operations (deployments, rollbacks)
- Performance issues (p99 >1s)
2. Categorize by impact:
HIGH: Blocks new features or causes outages
MEDIUM: Slows development velocity
LOW: Minor annoyance, no business impact
3. Estimate paydown cost:
- Engineering weeks required
- Risk level (could break production?)
- Dependencies (other teams affected?)Example Debt Assessment:
HIGH Impact Debt (Fix first):
1. Monolithic database (single point of failure)
- Impact: Outages affect all features
- Cost: 12 eng-weeks to shard
- ROI: Prevents $500K/outage losses
2. No feature flags (can't rollback bad deploys)
- Impact: Each bad deploy = 2 hour outage
- Cost: 4 eng-weeks to implement
- ROI: Saves 10 hours/month incident response
MEDIUM Impact Debt:
3. Slow CI/CD (30 min builds)
- Impact: Slows feature delivery by 20%
- Cost: 6 eng-weeks to optimize
- ROI: Ship features 20% faster
LOW Impact Debt:
4. Inconsistent code style
- Impact: Minor code review friction
- Cost: 2 eng-weeks (linting setup)
- ROI: Small quality-of-life improvementSecond, prioritization framework (70/30 rule):
The Rule:
70% time: New features (product value)
30% time: Refactors/debt (engineering health)
Why this ratio:
- 100% features → Technical debt accumulates, velocity crashes
- 100% refactors → No customer value, business dies
- 70/30 → Sustainable balanceCommunicating to Product:
"Here's the trade-off:
Option A: 100% features now
- Ship 10 features this quarter
- But: Deploy time increases 2× each quarter
- Result: In 12 months, we ship 2 features/quarter (80% slowdown)
Option B: 70% features, 30% refactors
- Ship 7 features this quarter
- But: Deploy time stays constant
- Result: In 12 months, we still ship 7 features/quarter (sustainable)
Investment: 30% refactors = 30% faster long-term feature delivery"Third, Strangler Fig pattern for modernization:
Pattern: Build New Alongside Old, Gradually Migrate
Legacy Monolith:
┌─────────────────────────┐
│ Users Service │
│ Products Service │
│ Payments Service │
│ Orders Service │
│ (All in one codebase) │
└─────────────────────────┘
Step 1: Extract one service (Payments)
┌─────────────────────────┐ ┌──────────────────┐
│ Users Service │ │ Payments Service │
│ Products Service │◀────▶│ (New microservice)│
│ Orders Service │ └──────────────────┘
│ (Monolith) │
└─────────────────────────┘
Step 2: Route traffic via feature flag
- 10% payments → new service
- 90% payments → old monolith
- Gradually increase to 100%
Step 3: Repeat for next service
Continue until monolith is fully strangledImplementation:
# Monolith code (transition state)def process_payment(order_id, amount):
# Feature flag: Route to new service or old code if feature_flag_enabled("new_payment_service", percentage=10):
# Call new microservice response = requests.post(
"https://payments.internal/api/charge",
json={"order_id": order_id, "amount": amount}
)
return response.json()
else:
# Old monolith code (legacy) return legacy_process_payment(order_id, amount)Fourth, measuring impact (DORA metrics):
DORA Metrics (DevOps Research and Assessment):
1. Deployment Frequency
- Before refactor: 2 deploys/week
- After refactor: 10 deploys/week
- Improvement: 5× faster shipping
2. Lead Time for Changes
- Before: 2 weeks (code → production)
- After: 2 days
- Improvement: 7× faster
3. Change Failure Rate
- Before: 20% (1 in 5 deploys breaks production)
- After: 5%
- Improvement: 4× more reliable
4. Mean Time to Recover (MTTR)
- Before: 4 hours (manual rollback)
- After: 10 minutes (automated rollback)
- Improvement: 24× faster recoveryTracking Progress:
Dashboard:
- Deploy frequency: [Chart showing increase]
- MTTR: [Chart showing decrease]
- Test coverage: 20% → 60%
- Build time: 30 min → 5 min
Business Impact:
- Features shipped/quarter: 5 → 10 (2× velocity)
- Outage hours/month: 8 → 1 (8× more reliable)
- Customer satisfaction (NPS): 40 → 65Fifth, mentoring junior developers:
Mentorship Framework:
1. Architecture Design Reviews (ADR):
Process:
- Junior proposes solution in design doc
- Staff engineer reviews async
- 30-min sync to discuss trade-offs
Example:
Junior: "I'll use Redis for session storage"
Mentor: "Good choice. Consider:
- What happens if Redis goes down? (Fallback to DB?)
- How will you handle Redis cluster failover? (Client-side retry?)
- TTL strategy? (Match session expiry)
Let's update the doc with failure modes."2. Pairing on Complex Refactors:
Weekly pairing session:
- Junior drives (writes code)
- Senior navigates (suggests approach)
- Teaches patterns in real-time
Example refactor:
"Let's extract this 500-line function together.
First, identify the core responsibility (payment processing).
Then extract dependencies (DB, external APIs).
Finally, write tests before refactoring (safety net)."3. Code Review as Teaching:
Instead of: "This is bad, fix it"
Teach: "This works, but consider scalability:
Current:
for user in users:
send_email(user) # N+1 problem, sends 10K emails serially
Better:
batch_send_emails(users) # Batch API, sends 10K in parallel
Why? 10K serial emails = 10K × 100ms = 16 minutes
Batch: 10K / 100 per batch = 100 batches × 500ms = 50 seconds
Trade-off: Batch is 20× faster but more complex. Worth it for 10K+ users."4. Safe Experimentation Environment:
Give juniors low-risk projects to learn:
- Internal tools (not customer-facing)
- Feature flags (easy to disable if broken)
- Code reviews before merge (safety net)
Example: "Build a new admin dashboard using microservices.
If it fails, no customer impact. But you'll learn:
- Service communication
- API design
- Database design
Great learning opportunity with low risk."Sixth, communicating trade-offs to product:
Framework: Cost-Benefit in Business Terms
Product asks: "Why can't we ship feature X faster?"
Tech answer: "We have technical debt in the payment system."
Product-friendly answer:
"Our payment code is complex (10,000 lines in one file).
Adding feature X would take 4 weeks and risk breaking existing payments.
If we refactor first (2 weeks), we can:
1. Add feature X safely in 1 week (total: 3 weeks)
2. All future payment features ship 2× faster
3. Reduce payment errors by 50% (better customer experience)
ROI: 2-week refactor investment = 1 week saved on feature X + faster shipping forever"Interview Score: 9/10
Why: Debt audit framework with HIGH/MED/LOW categorization, 70/30 feature-to-refactor rule with product communication, Strangler Fig pattern for gradual modernization, DORA metrics for measuring impact, comprehensive mentorship framework (ADRs, pairing, code review teaching), and business-friendly trade-off communication.