Backend Developer Interview Questions & Answers

Q: You’re designing the payments subsystem for a marketplace like Airbnb or a PSP like Stripe/PayPal. Users frequently see ‘money debited but order pending’ due to client/network failures. Design an idempotent payment API plus webhook-based confirmation flow that guarantees no double charge and consistent order state, even with retries, delayed/missing webhooks, and out-of-order events. How do you choose and store idempotency keys, structure database transactions, and recover from partial failures?

I’d design a Redis-based token bucket rate limiter with Lua scripts for atomic operations, deployed at both API gateway and service levels with different enforcement policies. Let me walk through the complete design. First, choosing the rate limiting algorithm: Token Bucket is the best choice for API platforms because it allows controlled bursts while maintaining average rate limits—critical for payment APIs where legitimate traffic spikes occur. Algorithm Comparison:

Q: In a high-traffic service (e.g., an API returning user profiles with related entities), how would you detect, explain, and fix N+1 query problems in production? Describe concrete techniques for query bundling, preloading/joins, and instrumentation. How would you demonstrate to a skeptical manager that the existing N+1 pattern will create scalability and cost problems at 10× traffic?

I’d design a client-generated idempotency key system with database-level deduplication, webhook-driven state machine, and background reconciliation jobs. This prevents double charges even with retries and network failures. First, idempotency key design: Why Client-Generated Keys: Client (mobile app) crashes mid-request. Server-generated ID → client has no way to retry safely (might double-charge) Client-generated ID → client can safely retry with same ID Key Format:

Q: Pick a multi-region system you’ve worked on or know (e.g., shopping cart, messaging, or booking). Walk through the concrete trade-offs between ACID and BASE you would make, and map your design to CAP and PACELC: which consistency and availability guarantees do you provide at read and write paths? How do you handle conflict resolution and eventual consistency at the UX and data layers?

I’d use a combination of APM tools,query logging, and database statistics to detect N+1, then apply eager loading or explicit joins to fix it. Let me walk through detection, explanation, and remediation. First, what is the N+1 query problem: Example: User profiles API # BAD: N+1 Query Pattern@app.get("/api/users") def get_users(): users = User.query.all() # 1 query: SELECT * FROM users result = [] for user in users: # N iterations # Each iteration triggers a separate query! posts = user.posts #

Q: You own the database for a listing/booking service like Airbnb. At 10× growth, a single Postgres cluster is hitting CPU, I/O, and lock contention limits. Propose a sharding and partitioning strategy that addresses scale, hot rows, and operational complexity. How would you decide between functional sharding (by domain), geographic sharding (region-aware), and purely key-based horizontal partitioning? How do you route traffic, rebalance shards, and plan for zero-downtime resharding?

I’d design an Amazon-style shopping cart with eventual consistency that prioritizes availability over strong consistency, using BASE principles with last-write-wins and merge-based conflict resolution. First, understanding CAP and PACELC: CAP Theorem: - C (Consistency): All nodes see the same data simultaneously - A (Availability): Every request gets a response (even if stale) - P (Partition Tolerance): System works despite network failures You can only choose 2 of 3. In real systems, partition

Q: Design a multi-tier caching strategy for a high-traffic, read-heavy service like Airbnb search or Netflix catalog. Explain what you cache at CDN, application, and database levels; how you choose keys and TTLs; and how you avoid stale or inconsistent data for critical flows (e.g., bookings, availability). Compare write-through, write-back, and cache-aside patterns and describe concrete invalidation strategies for updates, deletes, and backfills.

I’d use geographic + functional hybrid sharding with consistent hashing for hot shard distribution, and a routing layer with dual-write migration for zero-downtime resharding. First, choosing sharding strategy: Option 1: Functional Sharding (by domain) Shard 1: Users table Shard 2: Listings table Shard 3: Bookings table Pros: Clean separation, easy to reason about Cons: Doesn't solve single-table scale (Listings still huge) Option 2: Hash-Based Horizontal Sharding

Q: A notifications or billing platform (like PayPal, Uber, or internal platform teams) uses Kafka/Kinesis with multiple consumers. Design the system so that each business operation is applied exactly once despite at-least-once delivery semantics, retries, and consumer restarts. How do you model idempotency on the consumer side, manage deduplication keys, and handle poison messages and DLQs? What failure scenarios would you explicitly test?

I’d use 3-tier cache-aside pattern with CDN for static assets, Redis for application cache, and database query cache, with TTL-based expiration and event-driven invalidation for critical data. First, multi-tier architecture: Tier 1: CDN (CloudFlare, Fastly) - What: Static assets (images, CSS, JS, videos) - TTL: 24 hours - 7 days - Invalidation: Versioned URLs (/assets/v123/logo.png) - Hit rate: 95%+ Tier 2: Application Cache (Redis) - What: Listing details, search results, user sessions - TTL: 3

Q: You are the Principal Backend Engineer designing cross-service booking or checkout flows (e.g., travel booking: flights + hotels + payments). You can’t rely on 2PC or global ACID transactions. Propose a saga-based approach and explain how you design forward steps and compensating transactions, handle partial failures, timeouts, and long-running operations, and maintain observability of saga state across services. How do you ensure idempotency of compensating steps?

I’d use idempotent consumers with database-backed event deduplication tracking unique event IDs, plus Dead Letter Queues for poison messages and retry logic, achieving effectively-once processing. First, understanding the problem: At-Least-Once Delivery (Kafka default): Producer sends event → Kafka stores → Consumer processes → Consumer commits offset If consumer crashes AFTER processing but BEFORE commit: - Kafka redelivers same event on restart - Event processed twice! (duplicate charge, dupli

Q: Your monolithic service is being decomposed into microservices, and you need to perform a breaking database schema migration (e.g., splitting a user table, changing primary keys, or moving to sharded instances) with zero customer-visible downtime. Describe your migration plan in phases, including dual writes/reads, backfill strategies, feature flags, fallbacks, monitoring, and rollback. How would you test and de-risk this plan in a live, high-traffic environment?

I’d use choreography-based saga with event-driven compensations and a saga orchestrator tracking state, ensuring idempotent compensating transactions via unique compensation IDs. First, saga pattern basics: Problem: Distributed transactions don’t scale Traditional 2PC (Two-Phase Commit): 1. Coordinator asks all services to prepare 2. All services lock resources, vote YES/NO 3. If all YES, coordinator commits; if any NO, abort Issues: - Blocking: Services hold locks during coordination (performan

Q: Consider a rate limiter or resource allocator backed by Redis or a relational DB in a distributed environment with multiple instances. How do you design your system to avoid race conditions and ensure correctness under concurrent requests? Compare optimistic vs pessimistic locking, Lua scripts in Redis, and per-resource distributed locks. Where do you accept ‘eventual enforcement’ vs strict enforcement, and how do you justify the trade-off?

I’d use a 5-phase migration strategy with dual writes, gradual rollout via feature flags , and continuous monitoring with instant rollback capability. Scenario: Split users table into users + user_profiles Old Schema: CREATE TABLE users ( id BIGINT PRIMARY KEY, email VARCHAR(255), name VARCHAR(255), bio TEXT, -- Moving to user_profiles avatar_url VARCHAR(500), -- Moving to user_profiles created_at TIMESTAMP); New Schema:

Q: You are the on-call Senior Backend Engineer for a microservices-based API platform. Suddenly, p99 latency doubles and error rates spike in one region. Walk through your incident response: what dashboards, traces, and logs do you check; how do you distinguish between downstream dependency issues (e.g., DB, cache, message queue) vs application-level regressions; and how do you decide whether to roll back, degrade features, or perform a partial failover?

I’d use Redis Lua scripts for atomic operations in rate limiting, combined with optimistic locking for resource allocation, accepting eventual enforcement for performance. First, the race condition problem: # BAD: Race condition in rate limiterdef check_rate_limit(user_id): count = redis.get(f"rate:{user_id}") if count >= 100: return False # Rate limited # Race condition here! Two requests can both pass this check redis.incr(f"rate:{user_id}") return True# Result: User makes 105 requests instead

Question 1: Rate Limiting & Throttling for Hot APIs at Scale

Difficulty: Very High

Role: Senior Backend Engineer

Level: Senior (L5-L6, 4-7 Years of Experience)

Company Examples: Stripe, AWS, Shopify, PayPal

Question: “You’re a Senior Backend Engineer on a payments or API platform team (Stripe-style). Design a multi-tenant, globally distributed rate limiting system that enforces per-customer, per-endpoint, and per-region limits with burst handling. How would you implement this using Redis or another fast store, handle race conditions in a distributed setting, and expose observability for SRE/on-call? Discuss trade-offs between fixed window, sliding window, and token bucket, and how your design changes when limits must be enforced at the API gateway vs inside downstream services.”

1. What is This Question Testing?

This question tests critical Senior Backend Engineer competencies:

Distributed Systems Design: Can you build rate limiters that work correctly across multiple servers?

Algorithm Knowledge: Do you understand token bucket, sliding window, and their trade-offs?

Production Readiness: Can you handle race conditions, observability, and SRE requirements?

Multi-Tenancy: Can you isolate limits per customer without centralized bottlenecks?

Trade-off Analysis: Can you justify architectural choices (gateway vs service-level enforcement)?

The interviewer wants to see if you’re a Senior Backend Engineer who can design production-grade infrastructure, not just implement basic algorithms.

2. Framework to Answer This Question

Use the “Algorithm → Architecture → Scale Framework” with these components:

Structure:
1. Rate Limiting Algorithms - Token bucket, sliding window, fixed window comparison
2. System Architecture - Redis-based design for multi-tenant, global distribution
3. Race Condition Handling - Lua scripts, atomic operations, consistency trade-offs
4. Observability - Metrics, logging, SLOs for SRE teams
5. Gateway vs Service Enforcement - Trade-offs and hybrid approach

Key Principles:
- Start with algorithm choice justification
- Design for distributed correctness
- Prioritize observability and debuggability
- Discuss trade-offs explicitly

3. The Answer

Answer:

I’d design a Redis-based token bucket rate limiter with Lua scripts for atomic operations, deployed at both API gateway and service levels with different enforcement policies. Let me walk through the complete design.

First, choosing the rate limiting algorithm:

Token Bucket is the best choice for API platforms because it allows controlled bursts while maintaining average rate limits—critical for payment APIs where legitimate traffic spikes occur.

Algorithm Comparison:

Fixed Window:

Time window: 0-60s allows 100 requests
Problem: "Window reset attack"
- At 59s: Send 100 requests (allowed)
- At 01s: Send 100 requests (allowed)
- Result: 200 requests in 2 seconds!

Sliding Window:

Counts requests in past 60 seconds continuously
Pros: Smooth rate limiting, no reset attacks
Cons: Memory intensive (store all timestamps), complex to distribute

Token Bucket (My Choice):

Bucket holds tokens (max = burst limit)
Tokens refill at fixed rate (e.g., 100/minute)
Request consumes 1 token
Pros: Allows controlled bursts, memory efficient, simple distributed implementation
Cons: Burst can temporarily exceed average (acceptable for APIs)

For Stripe-style APIs:
- Allow burst: 200 requests
- Refill rate: 100 tokens/minute
- Result: Client can burst 200 instantly, then throttled to 100/min

Second, Redis-based distributed architecture:

Why Redis:
- Sub-millisecond latency (<1ms for 99th percentile)
- Atomic operations via Lua scripts (no race conditions)
- Built-in expiration (TTL) for rate limit windows
- Scales to millions of keys (per-customer, per-endpoint)

Data Model:

Key: rate_limit:{customer_id}:{endpoint}:{region}
Value: {
  "tokens": 95,           // Current tokens
  "last_refill": 1638360000  // Unix timestamp
}
TTL: 3600 seconds (auto-expire if inactive)

Third, handling race conditions with Lua scripts:

The Problem:

Two requests arrive simultaneously from different servers:
Server A: Read tokens (100) → Allow request → Write tokens (99)
Server B: Read tokens (100) → Allow request → Write tokens (99)
Result: Both allowed, but should be 98 tokens (race condition!)

The Solution - Atomic Lua Script:

-- token_bucket.lua (runs atomically in Redis)local key = KEYS[1]local max_tokens = tonumber(ARGV[1])    -- 200local refill_rate = tonumber(ARGV[2])   -- 100/min = 1.67/seclocal cost = tonumber(ARGV[3])          -- 1local state = redis.call('HMGET', key, 'tokens', 'last_refill')local tokens = tonumber(state[1]) or max_tokenslocal last_refill = tonumber(state[2]) or 0-- Calculate tokens to add based on time elapsedlocal now = redis.call('TIME')[1]local elapsed = now - last_refilllocal tokens_to_add = math.floor(elapsed * refill_rate)tokens = math.min(max_tokens, tokens + tokens_to_add)-- Check if request allowedif tokens >= cost then  tokens = tokens - cost  redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)  redis.call('EXPIRE', key, 3600)  return {1, tokens}  -- Allowed, remaining tokenselse  return {0, tokens}  -- Denied, remaining tokensend

Why Lua solves race conditions:
- Entire script runs atomically (Redis is single-threaded)
- No other operation can execute between read and write
- Guarantees correctness in distributed environment

Fourth, multi-tenant isolation:

Per-Customer, Per-Endpoint, Per-Region Limits:

Customer A: /api/payments → 1000 req/min (global)
Customer B: /api/payments → 100 req/min (global)
Customer A: /api/payments → 300 req/min (us-east-1)

Key Design:

rate_limit:{customer_a}:{payments}:global
rate_limit:{customer_a}:{payments}:us-east-1
rate_limit:{customer_b}:{payments}:global

Hierarchical Enforcement:
1. Check region limit first (fastest, local Redis cluster)
2. If allowed, check global limit (cross-region Redis with replication lag tolerance)
3. Accept eventual consistency for global (99.9% accurate, trade-off for speed)

Fifth, observability for SRE:

Metrics (Prometheus/Datadog):

rate_limit_requests_total{customer, endpoint, result="allowed|denied"}
rate_limit_latency_seconds{quantile="0.5|0.99"}
rate_limit_redis_errors_total
rate_limit_tokens_remaining{customer, endpoint}

Logging:

{  "event": "rate_limit_exceeded",  "customer_id": "cust_abc123",  "endpoint": "/api/payments",  "region": "us-east-1",  "tokens_remaining": 0,  "retry_after_seconds": 15}

Alerting SLOs:
- Redis latency p99 < 5ms (alert if > 10ms for 5 min)
- Rate limit error rate < 0.01% (alert if > 0.1%)
- Redis availability > 99.99%

Sixth, gateway vs service-level enforcement:

API Gateway Enforcement (Tier 1):
- Pros: Protects all downstream services, fail-fast, low latency cost
- Cons: Coarse-grained (can’t differentiate between expensive vs cheap endpoints)
- Use case: DDoS protection, customer-level quotas

Service-Level Enforcement (Tier 2):
- Pros: Fine-grained control (expensive DB queries get tighter limits)
- Cons: Adds latency to every service call, bypassed if gateway compromised
- Use case: Resource-specific limits (e.g., report generation endpoints)

My Hybrid Approach:

API Gateway:
- Enforce customer-level global limits (broad DDoS protection)
- Fast rejection (reject at edge, save compute downstream)

Payment Service:
- Enforce endpoint-specific limits (e.g., /refunds limited tighter than /charges)
- Context-aware limits (e.g., higher limits for verified merchants)

Handling Distributed Race Conditions Trade-offs:

Strict Consistency (Not Recommended):
- Single Redis instance globally → bottleneck, high latency
- Distributed lock (e.g., Redlock) → adds 10-50ms latency, complex

Eventual Consistency (Recommended):
- Regional Redis clusters with async replication
- Accept 0.1-1% over-limit during replication lag
- Trade-off justified: 100.5 requests/min vs 100.0 is acceptable, 1ms latency vs 50ms is not

Handling Burst Traffic:

Normal: 100 req/min average
Burst allowed: 200 req/min for 10 seconds
Token bucket config:
- Capacity: 200 tokens
- Refill: 100 tokens/min (1.67/sec)

Real scenario:
00:00 - Idle (200 tokens available)
00:01 - Burst: 200 requests in 1 second (0 tokens)
00:02-00:12 - Refill 1.67 tokens/sec = 16.7 tokens (allow ~17 requests)
00:13+ - Back to steady state

Interview Score: 9/10

Why: Algorithm choice with clear justification (token bucket), atomic Lua script solving race conditions, multi-tenant isolation design, observability/SLO focus, hybrid gateway+service enforcement with trade-off discussion, and production-ready Redis architecture.

Question 2: Idempotency, Webhooks, and Double-Payment Avoidance

Difficulty: Very High

Role: Senior Backend Engineer (Payments)

Level: Senior (L5-L6, 4-7 Years of Experience)

Company Examples: Stripe, PayPal, Razorpay, Airbnb

Question: “You’re designing the payments subsystem for a marketplace like Airbnb or a PSP like Stripe/PayPal. Users frequently see ‘money debited but order pending’ due to client/network failures. Design an idempotent payment API plus webhook-based confirmation flow that guarantees no double charge and consistent order state, even with retries, delayed/missing webhooks, and out-of-order events. How do you choose and store idempotency keys, structure database transactions, and recover from partial failures?”

1. What is This Question Testing?

This question tests critical payment systems competencies:

Idempotency Design: Can you prevent duplicate charges despite retries?

Distributed Transaction Handling: Can you maintain consistency across payment gateway and your DB?

Webhook Reliability: Can you handle delayed, missing, or out-of-order webhook events?

Failure Recovery: Can you reconcile “money debited but order pending” states?

Database Transaction Design: Do you understand ACID properties for payment workflows?

The interviewer wants to see if you can build production-grade payment systems that handle real-world failures gracefully.

2. Framework to Answer This Question

Use the “Idempotency → Webhooks → Reconciliation Framework”:

Structure:
1. Idempotency Key Design - How to generate, validate, and store keys
2. Payment API Flow - ACID transaction structure with idempotency
3. Webhook Handling - Delayed, duplicate, out-of-order event processing
4. Reconciliation - Periodic jobs to fix “pending” states
5. Failure Scenarios - Network failures, timeout handling, retry logic

3. The Answer

Answer:

I’d design a client-generated idempotency key system with database-level deduplication, webhook-driven state machine, and background reconciliation jobs. This prevents double charges even with retries and network failures.

First, idempotency key design:

Why Client-Generated Keys:

Client (mobile app) crashes mid-request.
Server-generated ID → client has no way to retry safely (might double-charge)
Client-generated ID → client can safely retry with same ID

Key Format:

Idempotency-Key: {user_id}_{timestamp}_{random}
Example: user_123_1638360000_a7b3f2

Constraints:
- Maximum 255 characters
- Valid for 24 hours (prevent key reuse attacks)
- Stored with payment attempt

Database Schema:

CREATE TABLE payments (
  id UUID PRIMARY KEY,
  idempotency_key VARCHAR(255) UNIQUE NOT NULL,
  user_id BIGINT NOT NULL,
  amount DECIMAL(10,2) NOT NULL,
  status ENUM('pending', 'processing', 'succeeded', 'failed') NOT NULL,
  payment_gateway_id VARCHAR(255),  -- Stripe payment_intent_id  created_at TIMESTAMP DEFAULT NOW(),
  updated_at TIMESTAMP DEFAULT NOW(),
  INDEX idx_idempotency (idempotency_key, created_at)
);
CREATE TABLE payment_events (
  id BIGINT AUTO_INCREMENT PRIMARY KEY,
  payment_id UUID NOT NULL,
  event_type VARCHAR(50) NOT NULL,  -- webhook.received, payment.succeeded  event_id VARCHAR(255) UNIQUE,     -- Stripe event_id (for dedup)  payload JSON,
  processed_at TIMESTAMP DEFAULT NOW(),
  INDEX idx_payment_events (payment_id, processed_at)
);

Second, idempotent payment API flow:

POST /api/payments (with idempotency)

@app.post("/api/payments")
def create_payment(request):
    idempotency_key = request.headers["Idempotency-Key"]
    amount = request.json["amount"]
    user_id = request.user_id
    # Step 1: Check if payment with this key already exists    with db.transaction():
        existing = Payment.get_by_idempotency_key(idempotency_key)
        if existing:
            # Idempotent response: return existing payment status            if existing.created_at < now() - timedelta(hours=24):
                return error("Idempotency key expired", 400)
            return {
                "payment_id": existing.id,
                "status": existing.status,
                "amount": existing.amount
            }, 200  # Same response, safe to retry        # Step 2: Create payment record in 'pending' state        payment = Payment.create(
            idempotency_key=idempotency_key,
            user_id=user_id,
            amount=amount,
            status="pending"        )
        db.commit()
    # Step 3: Call payment gateway (outside transaction)    try:
        # Stripe API call (idempotent via idempotency_key)        stripe_payment = stripe.PaymentIntent.create(
            amount=int(amount * 100),  # cents            currency="usd",
            metadata={"payment_id": str(payment.id)},
            idempotency_key=idempotency_key  # Stripe's own idempotency        )
        # Step 4: Update payment with gateway ID        with db.transaction():
            payment.payment_gateway_id = stripe_payment.id            payment.status = "processing"            payment.save()
            db.commit()
        return {
            "payment_id": payment.id,
            "status": "processing",
            "client_secret": stripe_payment.client_secret
        }, 201    except stripe.error.CardError as e:
        # Card declined        with db.transaction():
            payment.status = "failed"            payment.save()
            db.commit()
        return {"error": "Card declined"}, 400    except Timeout as e:
        # Network timeout - payment may or may not have gone through!        # Leave status as 'pending' for reconciliation job        log.error(f"Timeout calling Stripe for payment {payment.id}")
        return {
            "payment_id": payment.id,
            "status": "pending",  # Client should check status later            "message": "Payment processing, check status in 30s"        }, 202  # Accepted, processing

Why This Works:

Scenario: Client retries due to timeout

Request 1: Idempotency-Key: abc123
- Creates payment in DB (status=pending)
- Calls Stripe → timeout
- Returns 202 "processing"

Request 2 (retry): Idempotency-Key: abc123
- DB lookup finds existing payment
- Returns same response: 202 "processing"
- No duplicate Stripe call!

Third, webhook handling for confirmation:

Why Webhooks:
- Synchronous API call may timeout before Stripe confirms charge
- Webhook is asynchronous confirmation from Stripe (payment succeeded/failed)
- Must handle: delays, duplicates, out-of-order delivery

POST /webhook/stripe (handles payment events)

@app.post("/webhook/stripe")
def stripe_webhook(request):
    payload = request.body
    sig_header = request.headers["Stripe-Signature"]
    # Step 1: Verify webhook signature (prevent spoofing)    try:
        event = stripe.Webhook.construct_event(
            payload, sig_header, webhook_secret
        )
    except ValueError:
        return error("Invalid payload", 400)
    except stripe.error.SignatureVerificationError:
        return error("Invalid signature", 400)
    event_id = event["id"]
    event_type = event["type"]
    payment_intent = event["data"]["object"]
    # Step 2: Deduplicate webhook (Stripe may send same event multiple times)    with db.transaction():
        if PaymentEvent.exists(event_id=event_id):
            log.info(f"Duplicate webhook {event_id}, ignoring")
            return {"status": "ok"}, 200  # Idempotent!        # Record webhook receipt        PaymentEvent.create(
            payment_id=payment_intent.metadata["payment_id"],
            event_type=event_type,
            event_id=event_id,
            payload=payment_intent
        )
        db.commit()
    # Step 3: Update payment status based on event type    with db.transaction():
        payment = Payment.get_by_gateway_id(payment_intent.id)
        if event_type == "payment_intent.succeeded":
            payment.status = "succeeded"            create_order(payment)  # Create order        elif event_type == "payment_intent.payment_failed":
            payment.status = "failed"            notify_user_failure(payment)
        payment.updated_at = now()
        payment.save()
        db.commit()
    return {"status": "ok"}, 200

Handling Out-of-Order Webhooks:

Scenario: Webhook arrives BEFORE API response returns

Timeline:
00:00 - Client sends POST /api/payments
00:01 - Server calls Stripe API (success)
00:02 - Stripe sends webhook (payment.succeeded)
00:02 - Webhook handler updates DB: status=succeeded
00:03 - Stripe API returns to server (slow network)
00:03 - API handler tries to update status=processing (stale!)

Solution: Use updated_at timestamp + optimistic locking

Fourth, reconciliation for “money debited, order pending”:

Problem:
- Stripe debited money (payment succeeded)
- Webhook delayed/lost due to network issue
- User sees “payment pending” forever

Reconciliation Job (runs every 15 min):

def reconcile_pending_payments():
    # Find payments stuck in 'pending' or 'processing' > 10 minutes    stuck_payments = Payment.filter(
        status__in=["pending", "processing"],
        created_at__lt=now() - timedelta(minutes=10)
    )
    for payment in stuck_payments:
        if not payment.payment_gateway_id:
            # Never reached Stripe, safe to mark failed            payment.status = "failed"            payment.save()
            continue        # Fetch real status from Stripe        try:
            stripe_payment = stripe.PaymentIntent.retrieve(
                payment.payment_gateway_id
            )
            if stripe_payment.status == "succeeded":
                # Money was debited! Update DB                payment.status = "succeeded"                create_order(payment)  # Fix missing order                notify_user_success(payment)
            elif stripe_payment.status == "canceled":
                payment.status = "failed"            payment.updated_at = now()
            payment.save()
        except stripe.error.InvalidRequestError:
            # Payment doesn't exist in Stripe            payment.status = "failed"            payment.save()

Fifth, handling critical failure scenarios:

Scenario 1: Database commit fails after Stripe charge

Issue: Money debited, but DB shows 'pending' (commit failed)
Solution: Reconciliation job pulls Stripe status, fixes DB

Scenario 2: Client sends same idempotency key with different amount

Request 1: Idempotency-Key: abc, amount=$100
Request 2: Idempotency-Key: abc, amount=$200 (malicious/bug)

Solution: Validate amount matches existing payment, return error if different

Scenario 3: Webhook arrives before payment exists in DB

Race condition: Webhook arrives milliseconds before API handler commits
Solution: Webhook retries (DLQ), or query Stripe if payment_id not found

Dead Letter Queue (DLQ) for Failed Webhooks:

Webhook processing failures → SQS/Kafka DLQ
Retry policy: exponential backoff (1min, 5min, 15min, 1hr)
After 24 hours: Alert team, manual intervention

Interview Score: 9/10

Why: Client-generated idempotency keys with DB deduplication, ACID transaction handling, webhook deduplication and out-of-order handling, reconciliation job for “money debited” scenarios, and comprehensive failure scenario coverage including DLQ.

Question 3: N+1 Queries, ORM Pitfalls, and Production Scaling

Difficulty: High

Role: Senior Backend Engineer

Level: Senior (L4-L6, 4-7 Years of Experience)

Company Examples: All companies with high-traffic APIs

Question: “In a high-traffic service (e.g., an API returning user profiles with related entities), how would you detect, explain, and fix N+1 query problems in production? Describe concrete techniques for query bundling, preloading/joins, and instrumentation. How would you demonstrate to a skeptical manager that the existing N+1 pattern will create scalability and cost problems at 10× traffic?”

1. What is This Question Testing?

This question tests critical backend performance competencies:

Query Optimization: Can you identify and fix N+1 queries that kill performance?

ORM Understanding: Do you know when ORMs create N+1 problems and how to prevent them?

Production Debugging: Can you detect N+1 in live systems without bringing them down?

Cost Awareness: Can you quantify the business impact of query inefficiency?

Profiling Tools: Do you know how to use query profilers, logs, and APM tools?

The interviewer wants to see if you understand database performance at scale, not just basic SQL.

2. Framework to Answer This Question

Use the “Detect → Explain → Fix → Prove Framework”:

Structure:
1. What is N+1 - Clear definition with example
2. Detection - Tools and techniques to find N+1 in production
3. Root Cause - Why ORMs create N+1, lazy loading pitfalls
4. Solutions - Eager loading, joins, caching strategies
5. Business Case - Demonstrate cost/scale impact to management

3. The Answer

Answer:

I’d use a combination of APM tools,query logging, and database statistics to detect N+1, then apply eager loading or explicit joins to fix it. Let me walk through detection, explanation, and remediation.

First, what is the N+1 query problem:

Example: User profiles API

# BAD: N+1 Query Pattern@app.get("/api/users")
def get_users():
    users = User.query.all()  # 1 query: SELECT * FROM users    result = []
    for user in users:  # N iterations        # Each iteration triggers a separate query!        posts = user.posts  # SELECT * FROM posts WHERE user_id = ?        result.append({
            "user": user.name,
            "post_count": len(posts)
        })
    return result
# Result: 1 + N queries# 100 users = 101 queries# 10,000 users = 10,001 queries (disaster!)

Why this is a problem:

Each SQL query has overhead:
- Network round trip: ~1-5ms
- Query planning: ~0.5-2ms
- Execution: ~0.1-1ms

100 users = 101 queries × 3ms = 303ms (acceptable)
10,000 users = 10,001 queries × 3ms = 30 seconds (timeout!)

Plus: Database connection pool exhaustion, high CPU

Second, detecting N+1 in production:

Method 1: APM Tools (Datadog, New Relic, Sentry)

APM tools automatically flag N+1 patterns:

Datadog APM Alert:
"Endpoint: GET /api/users
 Queries executed: 10,001
 Total query time: 28.5s
 Pattern: Repeated SELECT FROM posts WHERE user_id = ?

 Recommendation: Use eager loading or join"

Method 2: Database Query Logging

Enable slow query log + analyze patterns:

-- PostgreSQL slow query logSET log_min_duration_statement = 100;  -- Log queries > 100ms-- Output shows repeated queries:SELECT * FROM posts WHERE user_id = 1;
SELECT * FROM posts WHERE user_id = 2;
SELECT * FROM posts WHERE user_id = 3;
... (10,000 times)
-- Red flag: Same query with different parameters repeated

Method 3: Database Statistics

-- PostgreSQL: Check query statisticsSELECT query, calls, total_time, mean_time
FROM pg_stat_statements
WHERE query LIKE '%posts WHERE user_id%'ORDER BY calls DESC;
-- Output:-- query: SELECT * FROM posts WHERE user_id = $1-- calls: 10,000-- total_time: 25,000ms-- mean_time: 2.5ms-- 10,000 calls of same pattern = N+1!

Method 4: Custom Instrumentation

Wrap ORM with query counter:

from functools import wraps
query_count = 0def count_queries(func):
    @wraps(func)
    def wrapper(*args, **kwargs):
        global query_count
        query_count = 0        # Hook into ORM query execution        with db.query_counter():
            result = func(*args, **kwargs)
        queries = db. get_query_count()
        if queries > 50:
            log.warning(f"{func.__name__} executed {queries} queries (N+1 suspected!)")
        return result
    return wrapper
@count_queries@app.get("/api/users")
def get_users():
    # ... code ...    pass

Third, explaining the root cause (ORM lazy loading):

Why ORMs cause N+1:

Most ORMs (Django, SQLAlchemy, ActiveRecord) use lazy loading by default:

# Django ORM Exampleuser = User.objects.get(id=1)  # 1 queryposts = user.posts.all()       # Lazy: Query only when accessed!# Seems innocent, but in a loop:users = User.objects.all()  # 1 queryfor user in users:
    print(user.posts.count())  # N queries! (lazy loading)

ORM thinks it’s helping:
- “Don’t fetch posts unless needed” (memory efficient)
- But in loops, “needed” happens N times!

Fourth, fixing N+1 with eager loading:

Solution 1: ORM Eager Loading

# Django: select_related (for foreign keys, one-to-one)users = User.objects.select_related('profile').all()
# SQL: SELECT * FROM users JOIN profiles ON users.profile_id = profiles.id# Django: prefetch_related (for many-to-many, reverse foreign keys)users = User.objects.prefetch_related('posts').all()
# SQL:#   1. SELECT * FROM users#   2. SELECT * FROM posts WHERE user_id IN (1,2,3,...,N)# Result: 2 queries instead of N+1!@app.get("/api/users")
def get_users():
    users = User.objects.prefetch_related('posts').all()  # FIX!    result = []
    for user in users:
        # No additional query! Posts already loaded        posts = user.posts.all()
        result.append({
            "user": user.name,
            "post_count": len(posts)
        })
    return result

Solution 2: Explicit JOIN

# SQLAlchemy explicit joinfrom sqlalchemy.orm import joinedload
users = session.query(User).options(joinedload(User.posts)).all()
# SQL:# SELECT users.*, posts.*# FROM users LEFT JOIN posts ON users.id = posts.user_id

Solution 3: Raw SQL (for complex cases)

# When ORM is too slow, write raw SQLquery = """    SELECT        u.id, u.name,        COUNT(p.id) as post_count    FROM users u    LEFT JOIN posts p ON u.id = p.user_id    GROUP BY u.id, u.name"""result = db.execute(query).fetchall()
# Single query, optimal performance

Fifth, demonstrating cost/scale impact to management:

Current State (N+1 pattern):

Traffic: 1,000 requests/minute
Users per request: 100
Queries: 1,000 req/min × 101 queries = 101,000 queries/min
Database CPU: 60%
P99 latency: 500ms

At 10× Traffic:

Traffic: 10,000 requests/minute
Queries: 10,000 × 101 = 1,010,000 queries/min
Database CPU: 600% (impossible, will crash!)
P99 latency: >5 seconds (timeouts)

Cost impact:
- Need 6× more database instances ($500/month → $3,000/month)
- OR accept degraded user experience (users abandon app)

After Fix (eager loading):

Traffic: 10,000 requests/minute
Queries: 10,000 × 2 = 20,000 queries/min (50× reduction!)
Database CPU: 15%
P99 latency: 80ms

Cost: Same $500/month database handles 10× traffic

ROI Calculation for Manager:

Fix effort: 2 hours (add prefetch_related)
Database cost savings: $2,500/month at scale
Engineering time savings: 10 hours/month (no firefighting)

ROI: $2,500/month / 2 hours = $1,250/hour engineering value

Sixth, proactive prevention:

Code Review Checklist:

# RED FLAGS in code review:❌ for item in query.all(): item.related_field
❌ [item.related for item in items]
❌ Accessing relationships inside loops
❌ No select_related/prefetch_related on queries with joins
✅ query.prefetch_related('related_field').all()
✅ query.select_related('foreign_key').all()
✅ Explicit JOINs in raw SQL

Linting + CI Checks:

# Custom linter ruledef check_n_plus_one(code):
    if "for" in code and ".objects.all()" in code:
        if "prefetch_related" not in code:
            raise Warning("Potential N+1: use prefetch_related")

Performance Testing:

# In test suitedef test_user_api_query_count():
    with assert_num_queries(2):  # Expect exactly 2 queries        response = client.get('/api/users?limit=100')
    # Fails if N+1 present (would be 101 queries)

Interview Score: 9/10

Why: Clear N+1 definition with code examples, multiple detection methods (APM, logs, DB stats, instrumentation), ORM lazy loading explanation, concrete fixes (prefetch_related, joins, raw SQL), business case with cost calculations, and proactive prevention strategies.

Question 4: ACID vs BASE and CAP/PACELC in Real Systems

Difficulty: Very High

Role: Staff Backend Engineer / Architect

Level: Senior/Staff (L5-L7, 5-10 Years of Experience)

Company Examples: Amazon, Netflix, Airbnb, Uber

Question: “Pick a multi-region system you’ve worked on or know (e.g., shopping cart, messaging, or booking). Walk through the concrete trade-offs between ACID and BASE you would make, and map your design to CAP and PACELC: which consistency and availability guarantees do you provide at read and write paths? How do you handle conflict resolution and eventual consistency at the UX and data layers?”

1. What is This Question Testing?

This question tests distributed systems architecture competencies:

CAP Theorem Understanding: Can you explain Consistency, Availability, Partition Tolerance trade-offs?

PACELC Extension: Do you know the latency vs consistency trade-off when no partition exists?

Real-World Design: Can you apply theory to actual systems (shopping cart, bookings)?

Conflict Resolution: Can you handle concurrent writes across regions?

Business Trade-offs: Can you justify eventual consistency to product teams?

2. The Answer

Answer:

I’d design an Amazon-style shopping cart with eventual consistency that prioritizes availability over strong consistency, using BASE principles with last-write-wins and merge-based conflict resolution.

First, understanding CAP and PACELC:

CAP Theorem:
- C (Consistency): All nodes see the same data simultaneously
- A (Availability): Every request gets a response (even if stale)
- P (Partition Tolerance): System works despite network failures

You can only choose 2 of 3. In real systems, partition tolerance is mandatory (networks fail), so the choice is: C or A?**

PACELC Extension:
- If Partition: Choose Availability or Consistency?
- Else (no partition): Choose Latency or Consistency?

Shopping Cart Design Choice:

CAP: AP (Availability + Partition Tolerance, sacrifice Consistency)

PACELC: PA/EL (Partition→Availability, Else→Latency)

Why:
- Shopping cart reads/writes must be fast (<100ms)
- Temporary inconsistency is acceptable (seeing cart from 1 second ago is fine)
- Cart is not critical (unlike payments, which need ACID)

Design Architecture:

Write Path:
1. User adds item in US-EAST region
2. Write to local DynamoDB replica (5ms)
3. Return success immediately
4. Async replicate to EU, ASIA (100-500ms delay)

Read Path:
1. User reads cart in EU
2. Read from EU replica
3. May see stale data if US write hasn't replicated yet
4. Eventually consistent within 500ms

Conflict Resolution:

Scenario: User adds items in two regions simultaneously

Time 00:00:
- US-EAST: User adds "Laptop" to cart
- EU-WEST: User adds "Mouse" to cart (network split)

Time 00:01:
- Networks merge
- Conflict: Cart has {Laptop} in US, {Mouse} in EU

Resolution Strategy 1: Last-Write-Wins (LWW)
- Compare timestamps
- EU write (00:00:15) beats US write (00:00:10)
- Final cart: {Mouse} only
- Problem: Lost "Laptop"! Bad UX.

Resolution Strategy 2: Merge (Amazon's Choice)
- Union of both carts
- Final cart: {Laptop, Mouse}
- Better UX, no lost data

Implementation:

# DynamoDB with merge-based conflict resolutionclass ShoppingCart:
    def add_item(self, user_id, item, region):
        # Write to local region        cart = dynamodb.get_item(
            Key={'user_id': user_id},
            ConsistentRead=False  # Eventual consistency        )
        # Merge new item        items = cart.get('items', [])
        items.append({
            'item_id': item,
            'added_at': time.time(),
            'region': region
        })
        # Write back        dynamodb.put_item(Item={
            'user_id': user_id,
            'items': items,
            'version': uuid.uuid4()  # For conflict detection        })
        return {'status': 'success', 'latency_ms': 5}
    def get_cart(self, user_id):
        # Read from local replica (may be stale)        cart = dynamodb.get_item(
            Key={'user_id': user_id},
            ConsistentRead=False  # Fast, eventual consistency        )
        # Deduplicate items (in case of merge conflicts)        items = cart.get('items', [])
        unique_items = list({item['item_id']: item for item in items}.values())
        return unique_items

Contrast: Banking (ACID required)

-- Bank transfer REQUIRES ACIDBEGIN TRANSACTION;
UPDATE accounts SET balance = balance - 100 WHERE id = 1;
UPDATE accounts SET balance = balance + 100 WHERE id = 2;
COMMIT;  -- Both updates or neither (atomic)-- If partition occurs, system blocks writes (CA choice)-- Consistency > Availability for money

UX Handling for Eventual Consistency:

Shopping Cart UI:
- Show "Syncing..." indicator if write pending
- Optimistic UI update (show item immediately, sync in background)
- If conflict detected, show "We added Mouse from another device"

Booking System (stronger consistency needed):
- Use distributed locks for inventory
- "Only 1 room left" → strong consistency required
- Accept higher latency (50-100ms) for correctness

Interview Score: 9/10

Why: Clear CAP/PACELC explanation, concrete shopping cart design with PA/EL choice justified, merge-based conflict resolution, code example, and contrast with ACID banking scenario.

Question 5: Sharding, Partitioning, and Hot-Shard Mitigation

Difficulty: Very High

Role: Senior/Staff Backend Engineer

Level: Senior/Staff (L5-L7, 4-10 Years of Experience)

Company Examples: Airbnb, Uber, LinkedIn, Pinterest

Question: “You own the database for a listing/booking service like Airbnb. At 10× growth, a single Postgres cluster is hitting CPU, I/O, and lock contention limits. Propose a sharding and partitioning strategy that addresses scale, hot rows, and operational complexity. How would you decide between functional sharding (by domain), geographic sharding (region-aware), and purely key-based horizontal partitioning? How do you route traffic, rebalance shards, and plan for zero-downtime resharding?”

1. What is This Question Testing?

Sharding Strategy: Can you choose the right sharding key for the use case?

Hot Shard Mitigation: Can you handle uneven data distribution (NYC vs small cities)?

Operational Complexity: Can you reshard without downtime?

Trade-offs: Can you explain functional vs geographic vs hash-based sharding?

2. The Answer

Answer:

I’d use geographic + functional hybrid sharding with consistent hashing for hot shard distribution, and a routing layer with dual-write migration for zero-downtime resharding.

First, choosing sharding strategy:

Option 1: Functional Sharding (by domain)

Shard 1: Users table
Shard 2: Listings table
Shard 3: Bookings table

Pros: Clean separation, easy to reason about
Cons: Doesn't solve single-table scale (Listings still huge)

Option 2: Hash-Based Horizontal Sharding

Shard = hash(listing_id) % num_shards

Pros: Even distribution
Cons: Cross-region joins expensive, no data locality

Option 3: Geographic Sharding (My Choice)

Shard_US: Listings in US/Canada
Shard_EU: Listings in Europe
Shard_ASIA: Listings in Asia

Pros:
- Data locality (users search local listings 90% of time)
- Legal compliance (GDPR data residency)
- Reduced cross-shard joins

Cons: Uneven distribution (NYC has 10× listings vs Des Moines)

My Hybrid Approach: Geographic + Consistent Hashing

Primary: Geographic sharding
Secondary: Consistent hashing within region for hot shards

Shard Key: hash(region_code + listing_id)

Second, handling hot shards (NYC problem):

Problem:

NYC: 100,000 listings
Des Moines: 5,000 listings

Simple geographic sharding:
- NYC shard: 95% CPU (bottleneck!)
- Des Moines shard: 10% CPU (wasted capacity)

Solution: Consistent Hashing with Virtual Nodes

Hash Ring with Virtual Nodes:
- NYC_1, NYC_2, NYC_3, ..., NYC_10 (10 virtual shards)
- DesMoines_1 (1 shard)

Physical servers:
- NYC listings distributed across 10 servers
- Des Moines on 1 server

As data grows:
- Add NYC_11, NYC_12 dynamically
- Rebalance automatically via consistent hashing

Implementation:

class ShardRouter:
    def __init__(self):
        # Consistent hash ring        self.ring = ConsistentHashRing()
        # Add virtual nodes per region        self.ring.add_nodes('NYC', virtual_nodes=10)
        self.ring.add_nodes('SF', virtual_nodes=8)
        self.ring.add_nodes('DesMoines', virtual_nodes=1)
    def get_shard(self, listing_id, region):
        # Hash: region + listing_id        key = f"{region}_{listing_id}"        shard = self.ring.get_node(key)
        return shard
# Usagerouter = ShardRouter()
shard = router.get_shard('listing_123', 'NYC')  # Routes to NYC_7db = shard_connections[shard]
listing = db.query("SELECT * FROM listings WHERE id = %s", listing_id)

Third, routing layer:

Application → Router → Shard
            ↓
        (ShardMap)

ShardMap (stored in Redis):
{
  'NYC_1': 'db-nyc-01.postgres.us-east-1',
  'NYC_2': 'db-nyc-02.postgres.us-east-1',
  'EU_1': 'db-eu-01.postgres.eu-west-1'
}

Fourth, zero-downtime resharding:

Scenario: Split NYC shard (too hot) into NYC_new_1 and NYC_new_2

Phase 1: Dual-Write (Weeks 1-2)

def write_listing(listing):
    # Write to OLD shard AND NEW shards    old_shard.write(listing)
    new_shard_1.write(listing)  # Dual write    new_shard_2.write(listing)  # Dual write    # Read from OLD shard (source of truth)    return old_shard.read(listing.id)

Phase 2: Backfill (Weeks 2-4)

# Background jobdef backfill_new_shards():
    for listing in old_shard.all_listings():
        # Determine new shard based on hash        new_shard = router.get_shard(listing.region, listing.id)
        new_shard.write(listing)
    # Verify: count(old_shard) == count(new_shard_1) + count(new_shard_2)

Phase 3: Switch Reads (Week 4)

def read_listing(listing_id):
    # Now read from NEW shards    shard = router.get_shard('NYC', listing_id)
    return shard.read(listing_id)

Phase 4: Cleanup (Week 5)

# Stop writing to OLD shard# Drop OLD shard after 7-day safety buffer

Cross-Shard Queries:

-- BAD: Cross-shard JOINSELECT * FROM bookings b
JOIN listings l ON b.listing_id = l.idWHERE b.user_id = 123;
-- Bookings in Shard_A, Listings in Shard_B → expensive!-- GOOD: Denormalize critical fieldsCREATE TABLE bookings (
  id UUID,
  listing_id UUID,
  listing_region VARCHAR,  -- Denormalized!  listing_title VARCHAR,   -- Denormalized!  -- Avoid cross-shard joins);

Interview Score: 9/10

Why: Geographic + consistent hashing hybrid strategy, virtual nodes for hot shard mitigation, routing layer design, zero-downtime resharding with dual-write phases, and denormalization to avoid cross-shard joins.

Question 6: Cache Hierarchies and Invalidation Strategies

Difficulty: High

Role: Senior Backend Engineer

Level: Senior (L4-L6, 4-7 Years of Experience)

Company Examples: Netflix, Airbnb, Spotify, Twitter

Question: “Design a multi-tier caching strategy for a high-traffic, read-heavy service like Airbnb search or Netflix catalog. Explain what you cache at CDN, application, and database levels; how you choose keys and TTLs; and how you avoid stale or inconsistent data for critical flows (e.g., bookings, availability). Compare write-through, write-back, and cache-aside patterns and describe concrete invalidation strategies for updates, deletes, and backfills.”

1. What is This Question Testing?

Multi-Tier Caching: Can you design layered caching (CDN, app, DB)?

Cache Patterns: Do you understand cache-aside, write-through, write-back trade-offs?

Invalidation: Can you prevent stale data without over-invalidating?

TTL Strategy: Can you choose appropriate expiration times?

2. The Answer

Answer:

I’d use 3-tier cache-aside pattern with CDN for static assets, Redis for application cache, and database query cache, with TTL-based expiration and event-driven invalidation for critical data.

First, multi-tier architecture:

Tier 1: CDN (CloudFlare, Fastly)
- What: Static assets (images, CSS, JS, videos)
- TTL: 24 hours - 7 days
- Invalidation: Versioned URLs (/assets/v123/logo.png)
- Hit rate: 95%+

Tier 2: Application Cache (Redis)
- What: Listing details, search results, user sessions
- TTL: 30 seconds - 10 minutes
- Invalidation: Explicit delete on update + TTL fallback
- Hit rate: 70-80%

Tier 3: Database Query Cache (Postgres)
- What: Frequent SELECT query results
- TTL: 1-5 minutes
- Invalidation: Automatic on table writes
- Hit rate: 50-60%

Second, cache patterns comparison:

Cache-Aside (Lazy Loading) - My Choice:

def get_listing(listing_id):
    # Try cache first    cache_key = f"listing:{listing_id}"    cached = redis.get(cache_key)
    if cached:
        return json.loads(cached)  # Cache hit!    # Cache miss → query database    listing = db.query("SELECT * FROM listings WHERE id = %s", listing_id)
    # Store in cache with TTL    redis.setex(cache_key, 300, json.dumps(listing))  # 5 min TTL    return listing
Pros: Simple, only caches what's requestedCons: Cache miss adds latency, thundering herd risk

Write-Through:

def update_listing(listing_id, data):
    # Write to database    db.update("UPDATE listings SET ... WHERE id = %s", listing_id)
    # Immediately update cache    redis.setex(f"listing:{listing_id}", 300, json.dumps(data))
    # Both always in syncPros: Cache always fresh
Cons: Write latency (2× writes), wasted cache on rarely-read data

Write-Back (Write-Behind):

def update_listing(listing_id, data):
    # Write to cache only    redis.setex(f"listing:{listing_id}", 300, json.dumps(data))
    # Async write to DB later (background job)    queue.enqueue('write_to_db', listing_id, data)
    # Fast responsePros: Fastest writes
Cons: Risk of data loss if cache fails before DB write

My Choice: Cache-Aside + Event-Driven Invalidation

Third, invalidation strategies:

Strategy 1: TTL-Based (Passive)

Listing updated at 10:00
Cache still has old version (cached at 9:55, TTL 10min)
Cache expires at 10:05
Next read at 10:06 → cache miss → fetch fresh data

Pros: Simple, no code changes
Cons: 5 min stale data (may be acceptable for listings, not for availability)

Strategy 2: Explicit Invalidation (Active)

def update_listing(listing_id, data):
    # Update database    db.update("UPDATE listings SET ... WHERE id = %s", listing_id)
    # Invalidate cache (delete, not update)    redis.delete(f"listing:{listing_id}")
    # Next read will fetch fresh from DB    # Let cache-aside repopulate on demandPros: Immediate freshness
Cons: Cache miss spike after update

Strategy 3: Event-Driven Invalidation (for distributed systems)

# Publisher (on listing update)def update_listing(listing_id, data):
    db.update(...)
    # Publish event to Kafka/Redis Pub-Sub    event_bus.publish('listing.updated', {
        'listing_id': listing_id,
        'timestamp': time.time()
    })
# Subscribers (across all app servers)@event_handler('listing.updated')
def invalidate_cache(event):
    # Each app server clears its Redis cache    redis.delete(f"listing:{event['listing_id']}")

Fourth, handling critical flows (bookings, availability):

Problem: Stale availability data causes double-bookings

Scenario:
10:00 - Room ABC available (cached)
10:05 - User A books Room ABC (DB updated, cache NOT invalidated)
10:06 - User B sees Room ABC available (stale cache!)
10:07 - User B tries to book → CONFLICT!

Solution: Skip cache for critical reads

def check_availability(listing_id, dates):
    # Critical flow: Read directly from DB (bypass cache)    availability = db.query(
        "SELECT * FROM availability WHERE listing_id = %s AND date IN (...)",
        listing_id, dates,
        read_replica=False  # Read from primary DB, not replica    )
    return availability
def book_listing(listing_id, dates):
    # Atomic booking with DB transaction    with db.transaction():
        # Lock row        availability = db.query(
            "SELECT * FROM availability WHERE listing_id = %s FOR UPDATE",
            listing_id
        )
        if not availability.is_available:
            raise BookingConflict()
        # Mark unavailable        db.update("UPDATE availability SET booked = true WHERE ...")
        # Invalidate cache after successful booking        redis.delete(f"listing:{listing_id}")

Fifth, cache key and TTL design:

Key Naming:

listing:{id}                     → TTL: 5 min
listing:{id}:availability       → TTL: 30 sec (fresher)
search:{city}:{checkin}:{checkout} → TTL: 1 min
user:{id}:session               → TTL: 24 hours

TTL Strategy:

Static data (listing description): 10 min TTL
Semi-dynamic (pricing, reviews): 2-5 min TTL
Critical (availability, inventory): 30 sec OR no cache
User sessions: 24 hours
Search results: 1 min (balance freshness vs load)

Interview Score: 9/10

Why: 3-tier caching architecture, cache-aside pattern with justification, explicit + event-driven invalidation strategies, critical flow handling (skip cache for bookings), and thoughtful TTL design per data type.

Question 7: Event-Driven Architectures, Exactly-Once, and Idempotent Consumers

Difficulty: Very High

Role: Senior/Staff Backend Engineer

Level: Senior/Staff (L5-L7, 5-10 Years of Experience)

Company Examples: PayPal, Uber, Netflix, LinkedIn

Question: “A notifications or billing platform (like PayPal, Uber, or internal platform teams) uses Kafka/Kinesis with multiple consumers. Design the system so that each business operation is applied exactly once despite at-least-once delivery semantics, retries, and consumer restarts. How do you model idempotency on the consumer side, manage deduplication keys, and handle poison messages and DLQs? What failure scenarios would you explicitly test?”

1. What is This Question Testing?

Exactly-Once Semantics: Can you achieve effectively-once despite at-least-once delivery?

Idempotent Consumers: Can you design consumers that handle duplicate events safely?

Failure Handling: Can you deal with poison messages, consumer crashes, DLQs?

Event Deduplication: Can you track processed events to prevent re-processing?

2. The Answer

Answer:

I’d use idempotent consumers with database-backed event deduplication tracking unique event IDs, plus Dead Letter Queues for poison messages and retry logic, achieving effectively-once processing.

First, understanding the problem:

At-Least-Once Delivery (Kafka default):

Producer sends event → Kafka stores → Consumer processes → Consumer commits offset

If consumer crashes AFTER processing but BEFORE commit:
- Kafka redelivers same event on restart
- Event processed twice! (duplicate charge, duplicate notification)

Exactly-Once is impossible in distributed systems (network failures make it theoretically impossible), but effectively-once is achievable via idempotency.

Second, idempotent consumer design:

Core Principle: Check if event already processed BEFORE processing

def consume_payment_event(event):
    event_id = event['event_id']  # Unique ID from producer    payment_id = event['payment_id']
    amount = event['amount']
    # Atomic check + process in single transaction    with db.transaction():
        # Check if already processed        if EventLog.exists(event_id=event_id):
            log.info(f"Duplicate event {event_id}, skipping")
            return  # Idempotent! Safe to skip.        # Process business logic        charge_card(payment_id, amount)
        send_confirmation_email(payment_id)
        # Mark as processed        EventLog.create(
            event_id=event_id,
            payment_id=payment_id,
            processed_at=datetime.now(),
            status='success'        )
        db.commit()  # Atomic: either ALL happens or NONE    # Commit Kafka offset (after successful processing)    consumer.commit_offset(event)

Event Deduplication Table:

CREATE TABLE event_log (
  event_id VARCHAR(255) PRIMARY KEY,  -- UUID from event  event_type VARCHAR(50),
  processed_at TIMESTAMP,
  status ENUM('success', 'failed'),
  payload JSON,
  INDEX idx_processed_at (processed_at)
);
-- Cleanup old events (after 7-30 days)DELETE FROM event_log WHERE processed_at < NOW() - INTERVAL 30 DAY;

Third, handling failure scenarios:

Scenario 1: Consumer crashes after processing, before commit

1. Consumer receives event (event_id: abc123)
2. Processes: Charges card $100
3. Writes to event_log: event_id=abc123, status=success
4. CRASH before committing Kafka offset
5. Consumer restarts, Kafka redelivers event abc123
6. Consumer checks event_log: abc123 exists → SKIP
7. No duplicate charge! ✓

Scenario 2: Database transaction fails mid-processing

1. Start transaction
2. Charge card (succeeds)
3. Write event_log (DB fails!)
4. Transaction rolls back
5. Charge card operation also rolled back (compensating transaction)
6. Event NOT marked processed
7. Kafka redelivers → Retries successfully

Note: If charge is to external API (not in transaction), need idempotency keys:
stripe.charge(idempotency_key=event_id)  # Stripe prevents duplicate

Scenario 3: Poison message (malformed JSON)

def consume_with_dlq(event):
    try:
        # Validate event structure        if not validate_event_schema(event):
            raise ValidationError("Invalid schema")
        # Process        consume_payment_event(event)
    except ValidationError as e:
        # Poison message: won't succeed even with retries        log.error(f"Poison message {event.get('id')}: {e}")
        # Send to Dead Letter Queue for manual review        dlq.send(event, error=str(e))
        # Commit offset (skip this message, don't block queue)        consumer.commit_offset(event)
    except TransientError as e:
        # Transient error (DB timeout, network issue)        # Don't commit offset → Kafka will retry        log.warning(f"Transient error, will retry: {e}")
        raise  # Consumer framework handles retry

Fourth, DLQ and retry strategy:

Dead Letter Queue Setup:

Main Topic: payment.events
DLQ: payment.events.dlq

Poison message conditions:
- Invalid JSON
- Schema validation failure
- Business logic error (e.g., unknown payment_id)

DLQ Consumer (manual intervention):
- Alerts team via Slack/PagerDuty
- Shows event in admin dashboard
- Engineer fixes data, replays event to main topic

Retry Strategy:

# Exponential backoff for transient errorsclass RetryableConsumer:
    max_retries = 3    base_delay = 1  # second    def consume(self, event):
        for attempt in range(self.max_retries):
            try:
                consume_payment_event(event)
                return  # Success!            except TransientError as e:
                if attempt < self.max_retries - 1:
                    delay = self.base_delay * (2 ** attempt)  # 1s, 2s, 4s                    log.warning(f"Retry {attempt+1}/{self.max_retries} after {delay}s")
                    time.sleep(delay)
                else:
                    # Max retries exceeded → DLQ                    dlq.send(event, error="Max retries exceeded")
                    consumer.commit_offset(event)

Fifth, testing scenarios:

def test_duplicate_event_processing():
    """Test idempotency: same event processed 2× should only charge once"""    event = {'event_id': 'test123', 'payment_id': 'pay_1', 'amount': 100}
    # Process once    consume_payment_event(event)
    assert get_charge('pay_1').amount == 100    # Process duplicate    consume_payment_event(event)
    assert get_charge('pay_1').amount == 100  # Still $100, not $200!def test_consumer_crash_recovery():
    """Test recovery after crash before offset commit"""    event = {'event_id': 'test456', 'payment_id': 'pay_2', 'amount': 50}
    # Simulate crash    with mock.patch('consumer.commit_offset', side_effect=SystemExit):
        try:
            consume_payment_event(event)
        except SystemExit:
            pass    # Restart consumer    consume_payment_event(event)  # Redelivered by Kafka    assert get_charge('pay_2').amount == 50  # Only charged oncedef test_poison_message_dlq():
    """Test malformed event sent to DLQ"""    poison_event = {'invalid': 'schema'}  # Missing required fields    consume_with_dlq(poison_event)
    # Event in DLQ, not in event_log    assert dlq.count() == 1    assert EventLog.exists(event_id='invalid') is False

Interview Score: 9/10

Why: Database-backed idempotency with event deduplication table, atomic transaction pattern preventing partial processing, DLQ for poison messages, retry strategy with exponential backoff, and comprehensive failure scenario testing.

Question 8: Sagas, Distributed Transactions, and Compensating Actions

Difficulty: Very High

Role: Principal Backend Engineer / Architect

Level: Staff/Principal (L6-L8, 7-12 Years of Experience)

Company Examples: Uber, Airbnb, Booking.com, Expedia

Question: “You are the Principal Backend Engineer designing cross-service booking or checkout flows (e.g., travel booking: flights + hotels + payments). You can’t rely on 2PC or global ACID transactions. Propose a saga-based approach and explain how you design forward steps and compensating transactions, handle partial failures, timeouts, and long-running operations, and maintain observability of saga state across services. How do you ensure idempotency of compensating steps?”

1. What is This Question Testing?

Saga Pattern Knowledge: Can you design workflows without distributed transactions?

Compensating Transactions: Can you handle rollbacks in distributed systems?

Failure Handling: Can you deal with partial failures, timeouts?

Idempotency: Can you ensure compensating actions are safe to retry?

Observability: Can you track saga state across services?

2. The Answer

Answer:

I’d use choreography-based saga with event-driven compensations and a saga orchestrator tracking state, ensuring idempotent compensating transactions via unique compensation IDs.

First, saga pattern basics:

Problem: Distributed transactions don’t scale

Traditional 2PC (Two-Phase Commit):
1. Coordinator asks all services to prepare
2. All services lock resources, vote YES/NO
3. If all YES, coordinator commits; if any NO, abort

Issues:
- Blocking: Services hold locks during coordination (performance killer)
- Single point of failure: Coordinator crash = deadlock
- Network partitions: Can't guarantee atomicity across regions

Saga Alternative:

Each service commits locally (no global lock)
If failure occurs → run compensating transactions (undo previous steps)

Second, travel booking saga design:

Booking Flow: Book Flight + Hotel + Payment

Choreography-Based Saga (Event-Driven):

Step 1: Reserve Flight
  → Service: Flight Service
  → Action: Create flight reservation (status=reserved)
  → Event: FlightReserved

Step 2: Reserve Hotel (triggered by FlightReserved event)
  → Service: Hotel Service
  → Action: Create hotel reservation (status=reserved)
  → Event: HotelReserved

Step 3: Charge Payment (triggered by HotelReserved event)
  → Service: Payment Service
  → Action: Charge credit card
  → Event: PaymentSucceeded OR PaymentFailed

Step 4a: Confirm Booking (if PaymentSucceeded)
  → Update flight: status=confirmed
  → Update hotel: status=confirmed
  → Event: BookingCompleted

Step 4b: Compensate (if PaymentFailed)
  → Cancel hotel reservation (compensating transaction)
  → Cancel flight reservation (compensating transaction)
  → Event: BookingCanceled

Implementation:

# Flight Serviceclass FlightService:
    def reserve_flight(self, booking_id, flight_id):
        # Step 1: Reserve flight        reservation = FlightReservation.create(
            booking_id=booking_id,
            flight_id=flight_id,
            status='reserved',
            expires_at=now() + timedelta(minutes=15)
        )
        # Publish event        event_bus.publish('flight.reserved', {
            'booking_id': booking_id,
            'reservation_id': reservation.id        })
        return reservation
    @event_handler('booking.canceled')
    def cancel_reservation(self, event):
        # Compensating transaction        booking_id = event['booking_id']
        compensation_id = event['compensation_id']  # For idempotency        # Check if already compensated        if Compensation.exists(compensation_id=compensation_id):
            return  # Idempotent!        # Cancel reservation        reservation = FlightReservation.get(booking_id=booking_id)
        reservation.status = 'canceled'        reservation.save()
        # Mark compensation as done        Compensation.create(
            compensation_id=compensation_id,
            booking_id=booking_id,
            action='cancel_flight'        )
        event_bus.publish('flight.canceled', {'booking_id': booking_id})
# Hotel Service (similar pattern)class HotelService:
    @event_handler('flight.reserved')
    def reserve_hotel(self, event):
        booking_id = event['booking_id']
        # Reserve hotel        reservation = HotelReservation.create(
            booking_id=booking_id,
            status='reserved'        )
        event_bus.publish('hotel.reserved', {
            'booking_id': booking_id,
            'reservation_id': reservation.id        })
    @event_handler('booking.canceled')
    def cancel_reservation(self, event):
        # Compensating transaction (idempotent)        compensation_id = event['compensation_id']
        if Compensation.exists(compensation_id=compensation_id):
            return        # Cancel        reservation = HotelReservation.get(booking_id=event['booking_id'])
        reservation.status = 'canceled'        reservation.save()
        Compensation.create(
            compensation_id=compensation_id,
            booking_id=event['booking_id'],
            action='cancel_hotel'        )
# Payment Serviceclass PaymentService:
    @event_handler('hotel.reserved')
    def charge_payment(self, event):
        booking_id = event['booking_id']
        try:
            # Charge credit card            charge = stripe.charge(amount=total, booking_id=booking_id)
            event_bus.publish('payment.succeeded', {
                'booking_id': booking_id,
                'charge_id': charge.id            })
        except StripeError as e:
            # Payment failed → trigger compensations            event_bus.publish('payment.failed', {
                'booking_id': booking_id,
                'error': str(e)
            })

Third, saga orchestrator (for observability):

class SagaOrchestrator:
    """Tracks saga state across services"""    def create_booking_saga(self, user_id, flight_id, hotel_id):
        # Create saga state        saga = Saga.create(
            saga_id=uuid.uuid4(),
            type='booking',
            status='started',
            steps=[
                {'name': 'reserve_flight', 'status': 'pending'},
                {'name': 'reserve_hotel', 'status': 'pending'},
                {'name': 'charge_payment', 'status': 'pending'}
            ]
        )
        # Start saga        event_bus.publish('saga.started', {
            'saga_id': saga.saga_id,
            'booking_id': saga.saga_id  # Use saga_id as booking_id        })
        return saga
    @event_handler('flight.reserved')
    def on_flight_reserved(self, event):
        saga = Saga.get(saga_id=event['booking_id'])
        saga.update_step('reserve_flight', 'completed')
        saga.save()
    @event_handler('payment.failed')
    def on_payment_failed(self, event):
        saga = Saga.get(saga_id=event['booking_id'])
        saga.status = 'compensating'        saga.save()
        # Trigger compensations        compensation_id = uuid.uuid4()
        event_bus.publish('booking.canceled', {
            'booking_id': event['booking_id'],
            'compensation_id': compensation_id  # Ensures idempotency        })

Fourth, handling failure scenarios:

Scenario 1: Timeout during hotel reservation

1. Flight reserved (success)
2. Hotel reservation times out (network issue)
3. Payment never triggered

Solution: Timeouts + Expiration
- Flight reservation expires in 15 minutes
- Background job checks for expired reservations
- Auto-cancel if saga not completed

Scenario 2: Compensation fails

1. Payment fails
2. Trigger compensation: cancel hotel
3. Cancel hotel fails (hotel service down)

Solution: Retry with exponential backoff
- Retry compensation 3 times (1s, 2s, 4s)
- If still failing, alert team (manual intervention)
- Idempotency ensures safe retries

Fifth, ensuring idempotency of compensations:

# Compensation Tracking TableCREATE TABLE compensations (
  compensation_id UUID PRIMARY KEY,
  booking_id UUID NOT NULL,
  action VARCHAR(50),  -- 'cancel_flight', 'cancel_hotel'  executed_at TIMESTAMP,
  INDEX idx_booking (booking_id)
);# Idempotent compensationdef cancel_flight(booking_id, compensation_id):
    # Check if already executed    if Compensation.exists(compensation_id=compensation_id):
        log.info(f"Compensation {compensation_id} already executed")
        return  # Safe to retry!    # Execute compensation    with db.transaction():
        reservation = FlightReservation.get(booking_id=booking_id)
        reservation.status = 'canceled'        reservation.save()
        # Record compensation        Compensation.create(
            compensation_id=compensation_id,
            booking_id=booking_id,
            action='cancel_flight',
            executed_at=now()
        )
        db.commit()

Observability Dashboard:

Saga ID: abc-123
Status: Compensating
Started: 2024-12-08 10:00:00

Steps:
✅ Reserve Flight (completed at 10:00:05)
✅ Reserve Hotel (completed at 10:00:10)
❌ Charge Payment (failed at 10:00:15 - "Card declined")

Compensations:
⏳ Cancel Hotel (in progress)
⏳ Cancel Flight (pending)

Interview Score: 9/10

Why: Choreography-based saga with event-driven flow, compensating transaction design with idempotency via compensation_id tracking, saga orchestrator for state visibility, timeout/expiration handling, and failure scenario coverage.

Question 9: Database Migrations and Zero-Downtime Releases

Difficulty: Very High

Role: Senior/Staff Backend Engineer

Level: Senior/Staff (L5-L7, 4-10 Years of Experience)

Company Examples: Stripe, GitHub, Shopify, LinkedIn

Question: “Your monolithic service is being decomposed into microservices, and you need to perform a breaking database schema migration (e.g., splitting a user table, changing primary keys, or moving to sharded instances) with zero customer-visible downtime. Describe your migration plan in phases, including dual writes/reads, backfill strategies, feature flags, fallbacks, monitoring, and rollback. How would you test and de-risk this plan in a live, high-traffic environment?”

1. What is This Question Testing?

Migration Planning: Can you design multi-phase, zero-downtime migrations?

Dual Write/Read: Can you maintain consistency during transition?

Risk Mitigation: Can you de-risk with feature flags, monitoring, rollback plans?

Testing: Can you validate migration in production safely?

2. The Answer

Answer:

I’d use a 5-phase migration strategy with dual writes, gradual rollout via feature flags, and continuous monitoring with instant rollback capability.

Scenario: Split users table into users + user_profiles

Old Schema:

CREATE TABLE users (
  id BIGINT PRIMARY KEY,
  email VARCHAR(255),
  name VARCHAR(255),
  bio TEXT,              -- Moving to user_profiles  avatar_url VARCHAR(500), -- Moving to user_profiles  created_at TIMESTAMP);

New Schema:

CREATE TABLE users (
  id BIGINT PRIMARY KEY,
  email VARCHAR(255),
  name VARCHAR(255),
  created_at TIMESTAMP);
CREATE TABLE user_profiles (
  user_id BIGINT PRIMARY KEY,
  bio TEXT,
  avatar_url VARCHAR(500),
  FOREIGN KEY (user_id) REFERENCES users(id)
);

Phase 1: Add New Table (Week 1)

-- Create new tableCREATE TABLE user_profiles (
  user_id BIGINT PRIMARY KEY,
  bio TEXT,
  avatar_url VARCHAR(500),
  FOREIGN KEY (user_id) REFERENCES users(id)
);
-- No code changes yet, just schema

Phase 2: Dual Write (Weeks 2-3)

# Application code: Write to BOTH tablesdef update_user_profile(user_id, bio, avatar_url):
    with db.transaction():
        # Write to OLD table (still source of truth)        User.update(user_id, bio=bio, avatar_url=avatar_url)
        # ALSO write to NEW table (dual write)        UserProfile.upsert(user_id=user_id, bio=bio, avatar_url=avatar_url)
        db.commit()
# Reads still from OLD tabledef get_user_profile(user_id):
    user = User.get(user_id)
    return {'bio': user.bio, 'avatar_url': user.avatar_url}

Phase 3: Backfill Historical Data (Weeks 3-4)

# Background jobdef backfill_user_profiles():
    batch_size = 1000    offset = 0    while True:
        users = User.query.limit(batch_size).offset(offset).all()
        if not users:
            break        for user in users:
            # Copy data from users to user_profiles            UserProfile.upsert(
                user_id=user.id,
                bio=user.bio,
                avatar_url=user.avatar_url
            )
        offset += batch_size
        log.info(f"Backfilled {offset} users")
        time.sleep(0.1)  # Rate limit to avoid DB overload# Verify backfillSELECT COUNT(*) FROM users;          -- 1,000,000SELECT COUNT(*) FROM user_profiles;  -- 1,000,000 (should match!)

Phase 4: Gradual Read Migration (Weeks 4-6)

# Feature flag: Gradually route reads to NEW table@feature_flag('read_from_user_profiles', rollout_percentage=10)
def get_user_profile(user_id):
    if feature_flag_enabled('read_from_user_profiles', user_id):
        # Read from NEW table        profile = UserProfile.get(user_id)
        return {'bio': profile.bio, 'avatar_url': profile.avatar_url}
    else:
        # Read from OLD table (fallback)        user = User.get(user_id)
        return {'bio': user.bio, 'avatar_url': user.avatar_url}
# Rollout:# Week 4: 10% of users read from new table# Week 5: 50% of users# Week 6: 100% of users (if no issues)

Monitoring during rollout:

Metrics:
- read_user_profiles_latency (OLD vs NEW comparison)
- read_user_profiles_errors (alert if >0.1%)
- data_consistency_check (OLD == NEW?)

Alert Conditions:
- NEW table latency >2× OLD table latency
- Error rate >0.1%
- Data mismatch detected
→ Instant rollback to 0%

Phase 5: Cleanup (Week 7)

-- Stop dual writes-- Remove bio, avatar_url from users tableALTER TABLE users DROP COLUMN bio;
ALTER TABLE users DROP COLUMN avatar_url;
-- Remove feature flag (100% on NEW table)

Rollback Plan:

# If issues detected at 50% rollout:1. Set feature flag to 0% → All reads from OLD table
2. Continue dual writes (data still syncing)
3. Investigate issue
4. Fix and retry rollout
# Instant rollback (< 1 minute):feature_flag.set('read_from_user_profiles', 0)

Testing in Production:

# Shadow reads: Compare OLD vs NEWdef get_user_profile(user_id):
    # Primary read (OLD table)    old_data = User.get(user_id)
    # Shadow read (NEW table, non-blocking)    async_shadow_read(user_id)
    return old_data
def async_shadow_read(user_id):
    try:
        new_data = UserProfile.get(user_id)
        # Compare results        if new_data.bio != old_data.bio:
            alert("Data mismatch detected!", user_id)
    except Exception as e:
        log.error(f"Shadow read failed: {e}")

Interview Score: 9/10

Why: 5-phase migration with dual writes, backfill strategy, gradual rollout via feature flags with percentage control, continuous monitoring with instant rollback, and shadow reads for data consistency validation.

Question 10: Handling Concurrency, Race Conditions, and Distributed Locks

Difficulty: High

Role: Senior Backend Engineer

Level: Senior (L5-L6, 4-7 Years of Experience)

Company Examples: All high-traffic systems

Question: “Consider a rate limiter or resource allocator backed by Redis or a relational DB in a distributed environment with multiple instances. How do you design your system to avoid race conditions and ensure correctness under concurrent requests? Compare optimistic vs pessimistic locking, Lua scripts in Redis, and per-resource distributed locks. Where do you accept ‘eventual enforcement’ vs strict enforcement, and how do you justify the trade-off?”

1. What is This Question Testing?

Race Condition Handling: Can you design systems that handle concurrent requests correctly?

Locking Strategies: Do you understand optimistic vs pessimistic locking trade-offs?

Redis Lua Scripts: Can you use atomic operations to prevent race conditions?

Distributed Locks: Do you know when to use distributed locks vs other approaches?

Trade-off Analysis: Can you justify eventual vs strict enforcement?

2. The Answer

Answer:

I’d use Redis Lua scripts for atomic operations in rate limiting, combined with optimistic locking for resource allocation, accepting eventual enforcement for performance.

First, the race condition problem:

# BAD: Race condition in rate limiterdef check_rate_limit(user_id):
    count = redis.get(f"rate:{user_id}")
    if count >= 100:
        return False  # Rate limited    # Race condition here! Two requests can both pass this check    redis.incr(f"rate:{user_id}")
    return True# Result: User makes 105 requests instead of 100 (5% over-limit)

Solution 1: Lua Script (Atomic)

-- rate_limit.lua (runs atomically in Redis)local key = KEYS[1]local limit = tonumber(ARGV[1])local current = redis.call('GET', key)if not current then  current = 0endif tonumber(current) >= limit then  return 0  -- Rate limitedelse  redis.call('INCR', key)  redis.call('EXPIRE', key, 60)  return 1  -- Allowedend

# Python: Execute Lua script atomicallydef check_rate_limit(user_id, limit=100):
    result = redis.eval(
        lua_script,
        keys=[f"rate:{user_id}"],
        args=[limit]
    )
    return result == 1  # True if allowed

Solution 2: Optimistic Locking

# Resource allocation with version checkdef allocate_resource(resource_id, user_id):
    while True:
        # Read resource + version        resource = Resource.get(resource_id)
        if resource.allocated:
            return False  # Already allocated        # Try to allocate (check version hasn't changed)        updated = Resource.update(
            resource_id=resource_id,
            allocated=True,
            allocated_to=user_id,
            version=resource.version + 1,
            where={'version': resource.version}  # Optimistic lock        )
        if updated:
            return True  # Success!        else:
            # Version changed → retry            continue

Solution 3: Distributed Lock (Redis)

import redis_lock
def book_last_ticket(event_id, user_id):
    lock_key = f"lock:event:{event_id}"    with redis_lock.Lock(redis, lock_key, expire=5):
        # Only one request holds lock at a time        tickets = Ticket.count(event_id=event_id, available=True)
        if tickets == 0:
            return False        # Allocate ticket        ticket = Ticket.get_available(event_id)
        ticket.allocated_to = user_id
        ticket.save()
        return True

Fourth, trade-off analysis:

Strict Enforcement (Lua scripts, locks):

Pros: Exactly 100.0 requests/min, guaranteed correctness
Cons: Higher latency (lock contention), single point of failure (Redis)
Use case: Critical resources (inventory, tickets, payments)

Eventual Enforcement (Regional Redis with replication lag):

Pros: Low latency (<1ms), high availability
Cons: Slight over-limit during lag (100-102 requests/min, 2% error)
Use case: Rate limiting, quotas, soft limits

Acceptable trade-off:
- 2% over-limit for 100× better latency
- Rate limiting is a soft limit (not life-critical)

Fifth, choosing the right approach:

Lua Scripts (My choice for rate limiting):
- Atomic operations
- No network round trips
- Sub-millisecond performance
- 99.9% accuracy (regional lag acceptable)

Optimistic Locking (For resources with low contention):
- No locks needed
- Retry on conflict
- Works well if conflicts rare (<5%)

Distributed Locks (For critical resources):
- Strong consistency
- Use only when necessary (tickets, inventory)
- Accept latency cost (10-50ms)

Interview Score: 9/10

Why: Clear race condition explanation, three solutions compared (Lua, optimistic, locks), trade-off justification (strict vs eventual), and practical guidance on when to use each approach.

Question 11: Observability, SLOs, and On-Call Incident Response

Difficulty: High

Role: Senior Backend Engineer / SRE

Level: Senior (L5-L6, 4-7 Years of Experience)

Company Examples: All production systems

Question: “You are the on-call Senior Backend Engineer for a microservices-based API platform. Suddenly, p99 latency doubles and error rates spike in one region. Walk through your incident response: what dashboards, traces, and logs do you check; how do you distinguish between downstream dependency issues (e.g., DB, cache, message queue) vs application-level regressions; and how do you decide whether to roll back, degrade features, or perform a partial failover?”

1. What is This Question Testing?

Incident Response: Can you systematically diagnose production issues?

Observability Tools: Do you know how to use dashboards, traces, logs effectively?

Root Cause Analysis: Can you distinguish app vs infrastructure vs dependency issues?

Decision Making: Can you decide between rollback, degradation, failover under pressure?

SRE Mindset: Do you prioritize customer impact over perfect diagnosis?

2. The Answer

Answer:

I’d follow a systematic 4-step incident response playbook: dashboards → traces → logs → decision matrix, prioritizing fast mitigation over perfect diagnosis.

First, initial assessment (0-2 minutes):

Primary Dashboards (Datadog, Grafana):

1. Service Health Dashboard:
   - p50/p95/p99 latency trends
   - Error rate (4XX, 5XX)
   - Request rate (RPS)
   - By region, service, endpoint

2. Dependency Dashboard:
   - Database: query latency, connection pool usage
   - Cache (Redis): latency, hit rate, eviction rate
   - Message Queue (Kafka): consumer lag, broker health
   - External APIs: latency, error rate

3. Infrastructure Dashboard:
   - CPU, memory, disk I/O
   - Network throughput, packet loss
   - Container/pod health (if Kubernetes)

What I See:

Alert: API latency spike in US-EAST region

Metrics:
- p99 latency: 100ms → 500ms (5× increase) ← RED FLAG
- 5XX error rate: 0.1% → 2% (20× increase) ← RED FLAG
- Request rate: Stable at 10K RPS (no traffic spike)
- Only US-EAST affected (EU, ASIA normal)

Initial hypothesis: Issue localized to US-EAST region

Second, distributed trace analysis (2-5 minutes):

Use Datadog APM or Jaeger to find slow requests:

Example slow trace (p99 = 500ms):

Span 1: API Gateway → 5ms (normal)
Span 2: Auth Service → 10ms (normal)
Span 3: Product Service → 480ms ← SLOW!
  ↳ Span 3.1: Database query → 450ms ← ROOT CAUSE
  ↳ Span 3.2: Redis cache → 15ms (normal)
  ↳ Span 3.3: Business logic → 10ms (normal)

Conclusion: Database queries in Product Service are slow

Drill down into database span:

Slow query:
SELECT * FROM products
WHERE category_id IN (SELECT id FROM categories WHERE ...)
ORDER BY created_at DESC
LIMIT 100;

Execution time: 450ms (normal: 50ms)
Rows scanned: 500,000 (normal: 1,000)

Red flag: Missing index or inefficient query

Third, log analysis (5-10 minutes):

Check application logs:

# Filter logs for errors in US-EAST Product Servicekubectl logs -l app=product-service,region=us-east --since=10m | grep ERROR
Output:[ERROR] Database connection pool exhausted (95/100)[ERROR] Query timeout after 1000ms: SELECT * FROM products...
[ERROR] Slow query detected: 450ms execution time

Check database metrics:

PostgreSQL metrics (US-EAST):
- Active connections: 95/100 (near limit!) ← RED FLAG
- Slow query log: SELECT * FROM products ... (repeated 10,000×)
- Query execution time: 450ms avg (p99: 800ms)
- Lock wait time: 50ms (increased from 5ms)

Root cause confirmed: Database connection pool exhaustion + slow queries

Fourth, decision matrix:

Option 1: Rollback

When to rollback:
- Recent deploy within last 2 hours
- Clear correlation between deploy time and incident
- Rollback is low-risk (feature flag or code deploy)

Check:
- Last deploy: 15 minutes ago (Product Service v2.3.5)
- Changeset: Added new "related products" feature with N+1 query

Decision: ROLLBACK
Action: Rollback Product Service to v2.3.4
ETA: 2 minutes (automated rollback)

Option 2: Feature Degradation

When to degrade:
- External dependency failing (not our code)
- Can't rollback (critical security fix deployed)
- Graceful degradation possible

Example:
- Redis cache down → Serve stale data (30 sec old)
- Search service slow → Skip search suggestions, show static results
- Payment gateway timeout → Retry queue, notify user "processing"

Action: Feature flag to disable "related products" feature
ETA: 1 minute

Option 3: Partial Failover

When to failover:
- Infrastructure issue (datacenter, AZ outage)
- Database replica failure
- Network partition

Action:
- Route US-EAST traffic to US-WEST
- Scale up US-WEST capacity 2×
- Monitor cross-region latency increase (acceptable temporarily)

ETA: 5-10 minutes

My Decision for This Incident:

Root cause: Recent deploy introduced N+1 query
Best action: ROLLBACK

Reasoning:
- Deploy was 15 min ago (clear correlation)
- Rollback is safe (automated, tested)
- Fastest mitigation (2 min vs 10+ min for other options)
- Preserves customer experience

Execute:
1. Slack: "#incident-response Deploy rollback initiated for product-service"
2. Command: kubectl rollout undo deployment/product-service -n us-east
3. Monitor: Dashboard shows latency dropping within 30 seconds
4. Confirm: p99 latency back to 100ms, errors back to 0.1%
5. All-clear: Incident resolved in 8 minutes

Fifth, post-incident actions:

Immediate (during incident):

1. Update incident channel with ETA
2. Notify stakeholders (eng lead, product)
3. Monitor for 15 min to confirm resolution
4. Close incident ticket

Post-Mortem (within 24 hours):

1. Root Cause:
   - N+1 query in "related products" feature
   - Missing index on products.category_id
   - No load testing before deploy

2. Action Items:
   - Add eager loading: Product.objects.prefetch_related('related')
   - Add index: CREATE INDEX ON products(category_id)
   - Add query count assertion in tests
   - Require load testing for database-heavy features

3. Prevention:
   - Enable query count warnings in staging
   - Add dashboard alert: "Query count >50 per request"

SLO Impact:

SLO: p99 latency <200ms for 99.9% of requests

Impact:
- Breach duration: 8 minutes
- Affected requests: ~4,800 (10K RPS × 8 min)
- Error budget: Consumed 0.5% of monthly budget

Conclusion: Within acceptable range (stayed <1% budget burn)

Interview Score: 9/10

Why: Systematic 4-step playbook (dashboards → traces → logs → decision), clear decision matrix for rollback vs degradation vs failover, real-world trace analysis identifying N+1 query, and post-incident actions with SLO impact calculation.

Question 12: API Versioning, Backward Compatibility, and Contract Negotiation

Difficulty: High

Role: Senior Backend Engineer / API Platform

Level: Senior (L5-L6, 4-7 Years of Experience)

Company Examples: Stripe, Twilio, GitHub

Question: “Your team owns a core backend API used by multiple internal and external clients. You need to ship a breaking change to the contract. How do you design your API versioning strategy, manage deprecation, and avoid breaking consumers? Describe how you’d coordinate across teams, enforce backward compatibility in the short term, and use schema validation, contract tests, or API gateways to keep the ecosystem stable.”

1. What is This Question Testing?

API Design: Can you version APIs without breaking clients?

Backward Compatibility: Can you maintain old and new versions simultaneously?

Deprecation Management: Can you sunset old versions gracefully?

Contract Testing: Do you know how to validate API contracts programmatically?

Cross-Team Coordination: Can you manage migration across multiple teams?

2. The Answer

Answer:

I’d use URL-based versioning (/api/v1, /api/v2) with parallel version support, gradual client migration, contract tests, and a 12-month deprecation timeline.

First, versioning strategy:

URL-Based Versioning (Recommended):

Current: /api/v1/users
New: /api/v2/users

Pros:
- Clear version in URL (easy for clients to understand)
- Can run both versions simultaneously
- Easy to route in API gateway

Cons:
- URL changes (but that's the point for breaking changes)

Alternatives (Why I don’t recommend):

Header-based: Accept: application/vnd.api+json; version=2
- Pros: Clean URLs
- Cons: Harder to test, caching issues

Query param: /api/users?version=2
- Pros: Easy to toggle
- Cons: Pollutes query params, caching issues

Second, breaking change example:

Scenario: User API redesign

Old (v1):

GET /api/v1/users/123Response:{  "user_id": 123,  "name": "John Smith",  "email": "john@example.com",  "created": "2024-01-01"}

New (v2) - Breaking changes:

GET /api/v2/users/123Response:{  "id": 123,  // Renamed from user_id  "first_name": "John",  // Split from name  "last_name": "Smith",   // Split from name  "email": "john@example.com",  "created_at": "2024-01-01T00:00:00Z"  // ISO 8601}

Third, implementation with parallel support:

# Both versions use same data model internallyclass UserSerializer_V1:
    def to_json(self, user):
        return {
            "user_id": user.id,
            "name": f"{user.first_name} {user.last_name}",
            "email": user.email,
            "created": user.created_at.strftime("%Y-%m-%d")
        }
class UserSerializer_V2:
    def to_json(self, user):
        return {
            "id": user.id,
            "first_name": user.first_name,
            "last_name": user.last_name,
            "email": user.email,
            "created_at": user.created_at.isoformat()
        }
# v1 endpoint@app.route("/api/v1/users/<int:user_id>")
def get_user_v1(user_id):
    user = User.get(user_id)
    return jsonify(UserSerializer_V1().to_json(user))
# v2 endpoint@app.route("/api/v2/users/<int:user_id>")
def get_user_v2(user_id):
    user = User.get(user_id)
    return jsonify(UserSerializer_V2().to_json(user))

Fourth, deprecation timeline:

12-Month Deprecation Plan:

Month 0 (Today):
- Announce v2 launch
- v1 will be deprecated in 12 months
- Email all API consumers
- Add deprecation warning header to v1 responses:
  Warning: 299 - "API v1 is deprecated. Migrate to v2 by 2025-12-31"

Month 3:
- Email clients still on v1 (identify via logs)
- Offer migration support (office hours, docs)
- Track migration progress: 30% on v2

Month 6:
- Email v1 users again
- Warn: v1 sunset in 6 months
- Track: 60% on v2

Month 9:
- Final warning: v1 sunset in 3 months
- Personally contact large clients still on v1
- Track: 85% on v2

Month 12:
- Sunset v1 (return 410 Gone)
- Keep v1 code for 30 days (rollback safety)
- Track: 98%+ on v2

Month 13:
- Delete v1 code

Fifth, contract testing to prevent breaking changes:

JSON Schema Validation:

# v1 contract (OpenAPI/JSON Schema)user_v1_schema = {
    "type": "object",
    "required": ["user_id", "name", "email", "created"],
    "properties": {
        "user_id": {"type": "integer"},
        "name": {"type": "string"},
        "email": {"type": "string", "format": "email"},
        "created": {"type": "string", "pattern": "^\d{4}-\d{2}-\d{2}$"}
    }
}
# Contract testdef test_user_v1_contract():
    response = client.get('/api/v1/users/123')
    data = response.json()
    # Validate against schema    validate(data, user_v1_schema)  # Fails if contract broken    # Ensure v1 contract unchanged    assert "user_id" in data  # Must have user_id, not id    assert "name" in data     # Must have name, not first_name/last_name

Pact Contract Testing (for multiple consumers):

# Consumer (Mobile App) defines expected contractfrom pact import Consumer, Provider
pact = Consumer('MobileApp').has_pact_with(Provider('UserAPI'))
(pact
  .upon_receiving('get user request')
  .with_request('GET', '/api/v1/users/123')
  .will_respond_with(200, body={
      'user_id': 123,
      'name': 'John Smith',
      'email': 'john@example.com',
      'created': '2024-01-01'  }))
# Provider (API) must satisfy contract# CI fails if API changes break consumer expectations

Sixth, cross-team coordination:

Migration Kickoff (Month 0):

1. Announce in eng-all channel
2. Create migration guide: docs.company.com/api-v2-migration
3. Breaking changes highlighted:
   - user_id → id
   - name → first_name + last_name
   - created → created_at (ISO 8601)

4. Migration checklist:
   ☐ Update API base URL: /v1 → /v2
   ☐ Update field mappings
   ☐ Test in staging
   ☐ Deploy to production
   ☐ Monitor for errors

Office Hours (Months 1-6):

Weekly Zoom sessions:
- Answer migration questions
- Debug integration issues
- Provide code examples

Tracking Migration Progress:

-- Track API usage by versionSELECT
  version,
  COUNT(*) as requests,
  COUNT(DISTINCT client_id) as unique_clients
FROM api_logs
WHERE endpoint = '/users'GROUP BY version;
Results (Month 6):
v1: 100K requests, 15 clients
v2: 200K requests, 35 clients
Action: Contact 15 v1 clients, offer help

Seventh, enforcing backward compatibility:

During Transition Period (Months 0-12):

Rule: v1 contract CANNOT change

Enforcement:
1. Contract tests in CI (fail if v1 schema changes)
2. Code review checklist: "Does this affect v1?"
3. Freeze v1 codebase (only security fixes allowed)

API Gateway for Routing:

API Gateway (Kong, AWS API Gateway):

/api/v1/* → Route to v1 backend
/api/v2/* → Route to v2 backend

Allows:
- Independent deployment of v1 and v2
- Traffic splitting for testing
- Gradual rollout of v2

Interview Score: 9/10

Why: Clear URL-based versioning strategy, parallel version support with code examples, 12-month deprecation timeline with monthly milestones, contract testing (JSON Schema + Pact), cross-team migration coordination, and backward compatibility enforcement during transition.

Question 13: Backend Performance Bottlenecks and Memory Leaks in Production

Difficulty: High

Role: Senior Backend Engineer

Level: Senior (L5-L6, 4-7 Years of Experience)

Company Examples: All production systems

Question: “A high-traffic backend service (e.g., search or recommendations) has gradually increasing memory usage and GC pauses, leading to intermittent timeouts under peak load. How would you diagnose and fix a memory leak or performance regression in production? Explain your approach to profiling, heap analysis, sampling traces, and experimenting safely with fixes. How do you differentiate between application-level leaks, library issues, and infrastructure misconfiguration?”

1. What is This Question Testing?

Performance Debugging: Can you diagnose memory leaks in production?

Profiling Tools: Do you know heap dumps, GC logs, profilers?

Root Cause Analysis: Can you distinguish app vs library vs infrastructure issues?

Safe Experimentation: Can you test fixes without impacting customers?

Memory Management: Do you understand GC behavior, memory allocation patterns?

2. The Answer

Answer:

I’d use heap dumps, GC log analysis, and sampling profilers to identify memory leaks, validate with controlled experiments, then deploy fixes gradually with canary releases.

First, symptoms and initial triage:

Observed Symptoms:

1. Memory usage: Gradually increasing from 2GB → 6GB → 8GB (OOM kill)
2. GC pauses: Increasing from 50ms → 500ms → 2 seconds
3. Timeouts: p99 latency 100ms → 5 seconds during GC pauses
4. Pattern: Happens after ~6 hours uptime

Timeline:
00:00 - Service starts, memory at 2GB
06:00 - Memory at 4GB, GC pauses 200ms
12:00 - Memory at 7GB, GC pauses 1 second
14:00 - OOM kill, restart, cycle repeats

Second, heap dump analysis:

Capture Heap Dump (Java example):

# During high memory usagejmap -dump:live,format=b,file=heap.hprof <pid># Or automatic on OOM-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/tmp/heap.hprof

Analyze with Eclipse MAT (Memory Analyzer Tool):

Top Memory Consumers:

1. HashMap<String, User> cache: 4.2GB (53% of heap!)
   - 2,000,000 entries
   - Growing unbounded
   - Never evicted

2. ArrayList<Request> requestLog: 800MB (10% of heap)
   - 500,000 entries
   - Accumulating without limit

3. Normal objects: 3GB (37% of heap)

Leak Suspects Report:

Suspected leak:
- HashMap<String, User> cache
- Accumulator: CacheManager class
- Problem: No size limit, no TTL, no eviction policy

Dominator tree shows:
CacheManager → HashMap → 2M User objects → 4.2GB

Third, GC log analysis:

Enable GC Logging:

# Java-Xlog:gc*:file=gc.log:time,uptime,level,tags# PythonPYTHONTRACEMALLOC=1

Analyze GC Pattern:

GC Log Analysis:

Time=00:00, YoungGC: 50ms, OldGC: N/A, Heap: 2GB
Time=06:00, YoungGC: 100ms, OldGC: 500ms, Heap: 4GB
Time=12:00, YoungGC: 200ms, FullGC: 2000ms, Heap: 7GB

Pattern:
- Young GC frequency increasing (more objects created)
- Full GC happening (old gen filling up)
- Heap not shrinking after GC (objects still referenced = leak)

Conclusion: Objects accumulating in old generation (memory leak)

Fourth, profiling in production (safe):

Sampling Profiler (Low Overhead):

# Python: py-spy (sampling profiler, <1% overhead)py-spy record --pid <pid> --output profile.svg --duration 60# Shows which functions allocate most memory

Profiling Results:

Top Memory Allocators:

1. cache_user() - 60% of allocations
   - Called 10,000× per second
   - Allocates User object each call
   - Never frees old entries

2. log_request() - 15% of allocations
   - Appends to unbounded list
   - List grows indefinitely

Fifth, identifying root cause:

Code Review of Suspect:

# BAD: Memory leak (unbounded cache)cache = {}  # Global dictionary, never cleareddef get_user(user_id):
    if user_id not in cache:
        # Fetch from DB        user = db.query(f"SELECT * FROM users WHERE id = {user_id}")
        cache[user_id] = user  # LEAK: Cache grows forever    return cache[user_id]
# After 12 hours at 10K RPS:# 10K requests/sec × 43,200 sec = 432M requests# Even if only 5% unique users = 21M cache entries = OOM

Differentiate: App vs Library vs Infrastructure

Application-level leak (This case):
- Code explicitly creates unbounded data structures
- Fix: Application code changes

Library leak:
- Third-party library not releasing resources
- Example: HTTP client not closing connections
- Fix: Update library or workaround

Infrastructure misconfiguration:
- JVM heap too small for workload
- Example: -Xmx2g for service needing 4GB
- Fix: Increase heap size (but doesn't fix leak, just delays it)

Sixth, implementing the fix:

Fix 1: Bounded Cache with LRU Eviction:

# GOOD: Bounded cache with automatic evictionfrom functools import lru_cache
@lru_cache(maxsize=10000)  # Maximum 10K entriesdef get_user(user_id):
    return db.query(f"SELECT * FROM users WHERE id = {user_id}")
# LRU automatically evicts least-recently-used entries# Memory bounded to ~10K users × 2KB/user = 20MB

Fix 2: TTL-Based Cache:

# Alternative: Time-based expirationfrom cachetools import TTLCache
cache = TTLCache(maxsize=10000, ttl=300)  # 5 min TTLdef get_user(user_id):
    if user_id not in cache:
        cache[user_id] = db.query(f"SELECT * FROM users WHERE id = {user_id}")
    return cache[user_id]

Fix 3: External Cache (Redis):

def get_user(user_id):
    # Try Redis first    cached = redis.get(f"user:{user_id}")
    if cached:
        return json.loads(cached)
    # Cache miss    user = db.query(f"SELECT * FROM users WHERE id = {user_id}")
    # Store in Redis with TTL    redis.setex(f"user:{user_id}", 300, json.dumps(user))
    return user
# Redis handles eviction automatically (maxmemory-policy allkeys-lru)

Seventh, safe deployment and validation:

Canary Deployment (20% traffic):

1. Deploy fixed version to 20% of instances
2. Monitor memory usage for 12 hours:
   - Old instances: Memory grows to 7GB → OOM
   - Canary instances: Memory stable at 2.5GB ✓

3. Validate metrics:
   - GC pauses: 50ms (down from 2 seconds) ✓
   - p99 latency: 100ms (down from 5 seconds) ✓
   - Error rate: 0.1% (unchanged) ✓

4. Rollout to 100% after 24 hours

Before/After Comparison:

Before (with leak):
- Memory: 2GB → 8GB over 12 hours
- GC frequency: Every 10 seconds (Full GC)
- GC pause: Up to 2 seconds
- Instance lifetime: 14 hours (OOM kill)

After (with fix):
- Memory: Stable at 2.5GB
- GC frequency: Every 60 seconds (Young GC only)
- GC pause: 50ms
- Instance lifetime: Unlimited (no OOM)

Interview Score: 9/10

Why: Complete diagnostic workflow (heap dump + GC logs + profiling), clear differentiation between app/library/infrastructure leaks, three fix approaches (LRU cache, TTL, Redis), safe canary deployment validation, and before/after metrics showing impact.

[Questions 14-15 continue in next message due to length]

# Resource allocation with version checkdef allocate_resource(resource_id, user_id):
    while True:
        # Read resource + version        resource = Resource.get(resource_id)
        if resource.allocated:
            return False  # Already allocated        # Try to allocate (check version hasn't changed)        updated = Resource.update(
            resource_id=resource_id,
            allocated=True,
            allocated_to=user_id,
            version=resource.version + 1,
            where={'version': resource.version}  # Optimistic lock        )
        if updated:
            return True  # Success!        else:
            # Version changed → retry            continue

Solution 3: Distributed Lock (Redis)

import redis_lock
def book_last_ticket(event_id, user_id):
    lock_key = f"lock:event:{event_id}"    with redis_lock.Lock(redis, lock_key, expire=5):
        # Only one request holds lock at a time        tickets = Ticket.count(event_id=event_id, available=True)
        if tickets == 0:
            return False        # Allocate ticket        ticket = Ticket.get_available(event_id)
        ticket.allocated_to = user_id
        ticket.save()
        return True

Trade-off: Strict vs Eventual Enforcement

Strict: 100.0 requests/min maximum (Lua script, locks)

Eventual: 100-102 requests/min (acceptable for rate limiting, not for tickets)

Interview Score: 9/10

Question 14: Security, Identities, and Authz at Scale

Difficulty: Very High

Role: Senior Backend Engineer / Security

Level: Senior/Staff (L5-L7, 5-10 Years of Experience)

Company Examples: Auth0, Okta, Stripe

Question: “Design a secure authentication and authorization architecture for a multi-tenant SaaS platform with public APIs. How would you combine OAuth2/OIDC, JWT, API gateways, and service-to-service auth (mTLS, service accounts) to protect resources, implement fine-grained authorization, and make rollout safe? Discuss token lifetimes, refresh flows, revocation, and how you’d instrument the system to detect abuse or privilege escalation attempts.”

1. What is This Question Testing?

Security Architecture: Can you design auth for multi-tenant SaaS platforms?

OAuth2/OIDC: Do you understand modern authentication protocols?

Service-to-Service Auth: Can you secure internal API calls with mTLS?

Token Management: Do you know JWT lifecycles, refresh flows, revocation?

Abuse Detection: Can you instrument systems to catch privilege escalation?

2. The Answer

Answer:

I’d use OAuth2 authorization code flow + JWT for user auth, mTLS for service-to-service, RBAC + ABAC for authorization, and anomaly detection for abuse prevention.

First, user authentication flow (OAuth2 + OIDC):

OAuth2 Authorization Code Flow (Most Secure):

1. User clicks "Login" → Redirect to auth.company.com/authorize
2. User authenticates (username/password + MFA)
3. Auth server returns authorization code to callback URL
4. Frontend exchanges code for tokens:
   POST /oauth/token
   {
     "grant_type": "authorization_code",
     "code": "abc123",
     "redirect_uri": "https://app.company.com/callback"
   }

5. Response:
   {
     "access_token": "eyJhbGc..." (JWT, 15 min),
     "refresh_token": "rt_abc123" (opaque, 30 days),
     "id_token": "eyJhbGc..." (OIDC, user identity)
   }

6. Frontend stores tokens:
   - Access token: Memory only (not localStorage, XSS risk)
   - Refresh token: HttpOnly cookie (CSRF protected)

JWT Structure:

// Access Token (JWT){  "header": {    "alg": "RS256",    "typ": "JWT"  },  "payload": {    "sub": "user_123",    "tenant_id": "tenant_abc",    "roles": ["admin", "editor"],    "permissions": ["users:read", "users:write"],    "exp": 1640000000,  // 15 min expiry    "iat": 1639999100  },  "signature": "..."}

Second, API gateway validation:

# API Gateway (Kong, AWS API Gateway, custom)class APIGateway:
    def validate_request(self, request):
        # Extract JWT from Authorization header        auth_header = request.headers.get("Authorization")
        if not auth_header or not auth_header.startswith("Bearer "):
            return error("Unauthorized", 401)
        token = auth_header.replace("Bearer ", "")
        # Verify JWT signature (RS256 with public key)        try:
            payload = jwt.decode(
                token,
                public_key,
                algorithms=["RS256"],
                verify_exp=True  # Check expiration            )
        except jwt.ExpiredSignatureError:
            return error("Token expired", 401)
        except jwt.InvalidTokenError:
            return error("Invalid token", 401)
        # Check token not revoked (Redis blacklist)        if redis.exists(f"revoked:{payload['jti']}"):  # jti = JWT ID            return error("Token revoked", 401)
        # Attach user context to request        request.user = {
            "user_id": payload["sub"],
            "tenant_id": payload["tenant_id"],
            "roles": payload["roles"],
            "permissions": payload["permissions"]
        }
        # Route to backend service        return forward_to_backend(request)

Third, fine-grained authorization (RBAC + ABAC):

RBAC (Role-Based Access Control):

# Simple role checkdef get_users(request):
    if "admin" not in request.user["roles"]:
        return error("Forbidden", 403)
    # Admin can access    users = User.query.all()
    return jsonify(users)

ABAC (Attribute-Based Access Control):

# Policy: Users can only edit their own tenant's datadef update_user(request, user_id):
    target_user = User.get(user_id)
    # Check permission    if not has_permission(request.user, "users:write", target_user):
        return error("Forbidden", 403)
    # Update user    target_user.update(request.json)
    return jsonify(target_user)
def has_permission(current_user, permission, resource):
    # Check permission exists    if permission not in current_user["permissions"]:
        return False    # Check tenant isolation (ABAC attribute)    if resource.tenant_id != current_user["tenant_id"]:
        return False  # Can't access other tenants' data    return True

Fourth, service-to-service auth (mTLS):

Why mTLS over JWT for internal services:

JWT issues for service-to-service:
- Need to manage service accounts, rotate secrets
- Adds latency (sign, verify tokens)
- Token expiry handling

mTLS benefits:
- Certificate-based mutual authentication
- No tokens to manage
- Lower latency
- Already have TLS infrastructure

mTLS Setup:

Each service has:
1. Client certificate (signed by internal CA)
2. Private key
3. CA certificate (to verify peers)

Service A calling Service B:
1. TLS handshake with mutual cert verification
2. Both sides verify peer certificate
3. Connection established only if both valid
4. No JWT needed!

Implementation:

# Service A (caller)import requests
response = requests.get(
    "https://service-b.internal/api/data",
    cert=("/path/to/client-cert.pem", "/path/to/client-key.pem"),
    verify="/path/to/ca-cert.pem"  # Verify Service B's cert)
# Service B (receiver) - Nginx configserver {
    listen 443 ssl;    ssl_certificate /path/to/server-cert.pem;    ssl_certificate_key /path/to/server-key.pem;    # Require client cert    ssl_client_certificate /path/to/ca-cert.pem;    ssl_verify_client on;    location /api {
        # Only requests with valid cert reach here        proxy_pass http://backend;    }
}

Fifth, token lifecycle management:

Access Token Refresh:

# When access token expires (15 min)def refresh_access_token(refresh_token):
    # Verify refresh token    stored_token = redis.get(f"refresh_token:{refresh_token}")
    if not stored_token:
        return error("Invalid refresh token", 401)
    user_id = stored_token["user_id"]
    # Issue new access token    access_token = jwt.encode({
        "sub": user_id,
        "tenant_id": get_user_tenant(user_id),
        "roles": get_user_roles(user_id),
        "exp": time.time() + 900  # 15 min    }, private_key, algorithm="RS256")
    # Rotate refresh token (security best practice)    old_refresh_token = refresh_token
    new_refresh_token = generate_secure_token()
    redis.delete(f"refresh_token:{old_refresh_token}")
    redis.setex(f"refresh_token:{new_refresh_token}", 2592000, {  # 30 days        "user_id": user_id
    })
    return {
        "access_token": access_token,
        "refresh_token": new_refresh_token
    }

Token Revocation:

# Logout - revoke tokensdef logout(request):
    access_token = extract_token(request)
    payload = jwt.decode(access_token, verify=False)  # Don't verify, just extract    # Add to blacklist (TTL = remaining token lifetime)    ttl = payload["exp"] - time.time()
    redis.setex(f"revoked:{payload['jti']}", ttl, "1")
    # Delete refresh token    refresh_token = request.cookies.get("refresh_token")
    redis.delete(f"refresh_token:{refresh_token}")
    return {"status": "logged out"}

Sixth, abuse detection and instrumentation:

Anomaly Detection:

# Track API usage per userdef track_api_call(user_id, endpoint, response_time):
    # Increment counter    key = f"api_usage:{user_id}:{date.today()}"    redis.incr(key)
    redis.expire(key, 86400)  # 24 hours    # Check rate (simple anomaly detection)    count = redis.get(key)
    if count > 10000:  # 10K calls per day        alert(f"User {user_id} exceeded normal API usage: {count} calls")
        # Optional: Temporary rate limit or flag for review# Track failed auth attemptsdef track_failed_login(user_id, ip_address):
    key = f"failed_login:{user_id}"    redis.incr(key)
    redis.expire(key, 3600)  # 1 hour    failed_count = redis.get(key)
    if failed_count >= 5:
        # Lock account temporarily        redis.setex(f"locked:{user_id}", 1800, "1")  # 30 min lockout        alert(f"Account locked due to failed login attempts: {user_id}")

Privilege Escalation Detection:

# Log all permission changesdef grant_permission(admin_user_id, target_user_id, permission):
    # Verify admin has permission to grant    if "admin" not in get_user_roles(admin_user_id):
        alert(f"Unauthorized permission grant attempt: {admin_user_id}")
        return error("Forbidden", 403)
    # Grant permission    add_user_permission(target_user_id, permission)
    # Audit log    audit_log.create({
        "event": "permission_granted",
        "admin_user_id": admin_user_id,
        "target_user_id": target_user_id,
        "permission": permission,
        "timestamp": time.time()
    })
    # Alert if granting admin permission    if permission == "admin":
        alert(f"Admin permission granted: {admin_user_id} → {target_user_id}")

Interview Score: 9/10

Why: OAuth2 authorization code flow with JWT, API gateway validation with signature verification, RBAC + ABAC for fine-grained authorization, mTLS for service-to-service auth, token rotation and revocation strategies, and comprehensive abuse detection with anomaly tracking and audit logging.

Question 15: Leadership: Technical Debt, Legacy Modernization, and Mentoring

Difficulty: Very High

Role: Staff/Principal Engineer / EM

Level: Staff+ (L6-L8, 7-15 Years of Experience)

Company Examples: All companies with legacy systems

Question: “You’re a Staff Backend Engineer or EM inheriting a legacy monolith that is critical to revenue but accumulating severe technical debt. How do you prioritize refactors vs new feature delivery, communicate trade-offs with product, and coach junior developers on making architecture decisions that scale? Describe a concrete framework for deciding when to pay down debt (e.g., strangler fig pattern, safety rails, incremental modularization) and how you would measure the impact on reliability and velocity.”

1. What is This Question Testing?

Leadership: Can you balance technical health with business needs?

Communication: Can you explain technical debt to non-technical stakeholders?

Mentorship: Can you coach juniors to make scalable decisions?

Strategic Thinking: Do you have frameworks for prioritizing debt paydown?

Measurement: Can you quantify impact on velocity and reliability?

2. The Answer

Answer:

I’d use the Strangler Fig pattern for incremental modernization, enforce a 70/30 feature-to-refactor ratio, measure impact via DORA metrics, and mentor through architecture design reviews.

First, assessing technical debt:

Debt Audit Framework:

1. Inventory debt:
   - Monolithic deployments (deploy time >30 min)
   - No test coverage (<20%)
   - Hard-coded config (no feature flags)
   - Manual operations (deployments, rollbacks)
   - Performance issues (p99 >1s)

2. Categorize by impact:
   HIGH: Blocks new features or causes outages
   MEDIUM: Slows development velocity
   LOW: Minor annoyance, no business impact

3. Estimate paydown cost:
   - Engineering weeks required
   - Risk level (could break production?)
   - Dependencies (other teams affected?)

Example Debt Assessment:

HIGH Impact Debt (Fix first):
1. Monolithic database (single point of failure)
   - Impact: Outages affect all features
   - Cost: 12 eng-weeks to shard
   - ROI: Prevents $500K/outage losses

2. No feature flags (can't rollback bad deploys)
   - Impact: Each bad deploy = 2 hour outage
   - Cost: 4 eng-weeks to implement
   - ROI: Saves 10 hours/month incident response

MEDIUM Impact Debt:
3. Slow CI/CD (30 min builds)
   - Impact: Slows feature delivery by 20%
   - Cost: 6 eng-weeks to optimize
   - ROI: Ship features 20% faster

LOW Impact Debt:
4. Inconsistent code style
   - Impact: Minor code review friction
   - Cost: 2 eng-weeks (linting setup)
   - ROI: Small quality-of-life improvement

Second, prioritization framework (70/30 rule):

The Rule:

70% time: New features (product value)
30% time: Refactors/debt (engineering health)

Why this ratio:
- 100% features → Technical debt accumulates, velocity crashes
- 100% refactors → No customer value, business dies
- 70/30 → Sustainable balance

Communicating to Product:

"Here's the trade-off:

Option A: 100% features now
- Ship 10 features this quarter
- But: Deploy time increases 2× each quarter
- Result: In 12 months, we ship 2 features/quarter (80% slowdown)

Option B: 70% features, 30% refactors
- Ship 7 features this quarter
- But: Deploy time stays constant
- Result: In 12 months, we still ship 7 features/quarter (sustainable)

Investment: 30% refactors = 30% faster long-term feature delivery"

Third, Strangler Fig pattern for modernization:

Pattern: Build New Alongside Old, Gradually Migrate

Legacy Monolith:
┌─────────────────────────┐
│ Users Service           │
│ Products Service        │
│ Payments Service        │
│ Orders Service          │
│ (All in one codebase)   │
└─────────────────────────┘

Step 1: Extract one service (Payments)
┌─────────────────────────┐      ┌──────────────────┐
│ Users Service           │      │ Payments Service │
│ Products Service        │◀────▶│ (New microservice)│
│ Orders Service          │      └──────────────────┘
│ (Monolith)              │
└─────────────────────────┘

Step 2: Route traffic via feature flag
- 10% payments → new service
- 90% payments → old monolith
- Gradually increase to 100%

Step 3: Repeat for next service
Continue until monolith is fully strangled

Implementation:

# Monolith code (transition state)def process_payment(order_id, amount):
    # Feature flag: Route to new service or old code    if feature_flag_enabled("new_payment_service", percentage=10):
        # Call new microservice        response = requests.post(
            "https://payments.internal/api/charge",
            json={"order_id": order_id, "amount": amount}
        )
        return response.json()
    else:
        # Old monolith code (legacy)        return legacy_process_payment(order_id, amount)

Fourth, measuring impact (DORA metrics):

DORA Metrics (DevOps Research and Assessment):

1. Deployment Frequency
   - Before refactor: 2 deploys/week
   - After refactor: 10 deploys/week
   - Improvement: 5× faster shipping

2. Lead Time for Changes
   - Before: 2 weeks (code → production)
   - After: 2 days
   - Improvement: 7× faster

3. Change Failure Rate
   - Before: 20% (1 in 5 deploys breaks production)
   - After: 5%
   - Improvement: 4× more reliable

4. Mean Time to Recover (MTTR)
   - Before: 4 hours (manual rollback)
   - After: 10 minutes (automated rollback)
   - Improvement: 24× faster recovery

Tracking Progress:

Dashboard:
- Deploy frequency: [Chart showing increase]
- MTTR: [Chart showing decrease]
- Test coverage: 20% → 60%
- Build time: 30 min → 5 min

Business Impact:
- Features shipped/quarter: 5 → 10 (2× velocity)
- Outage hours/month: 8 → 1 (8× more reliable)
- Customer satisfaction (NPS): 40 → 65

Fifth, mentoring junior developers:

Mentorship Framework:

1. Architecture Design Reviews (ADR):

Process:
- Junior proposes solution in design doc
- Staff engineer reviews async
- 30-min sync to discuss trade-offs

Example:
Junior: "I'll use Redis for session storage"
Mentor: "Good choice. Consider:
  - What happens if Redis goes down? (Fallback to DB?)
  - How will you handle Redis cluster failover? (Client-side retry?)
  - TTL strategy? (Match session expiry)

Let's update the doc with failure modes."

2. Pairing on Complex Refactors:

Weekly pairing session:
- Junior drives (writes code)
- Senior navigates (suggests approach)
- Teaches patterns in real-time

Example refactor:
"Let's extract this 500-line function together.
First, identify the core responsibility (payment processing).
Then extract dependencies (DB, external APIs).
Finally, write tests before refactoring (safety net)."

3. Code Review as Teaching:

Instead of: "This is bad, fix it"
Teach: "This works, but consider scalability:

Current:
for user in users:
    send_email(user)  # N+1 problem, sends 10K emails serially

Better:
batch_send_emails(users)  # Batch API, sends 10K in parallel

Why? 10K serial emails = 10K × 100ms = 16 minutes
Batch: 10K / 100 per batch = 100 batches × 500ms = 50 seconds

Trade-off: Batch is 20× faster but more complex. Worth it for 10K+ users."

4. Safe Experimentation Environment:

Give juniors low-risk projects to learn:
- Internal tools (not customer-facing)
- Feature flags (easy to disable if broken)
- Code reviews before merge (safety net)

Example: "Build a new admin dashboard using microservices.
If it fails, no customer impact. But you'll learn:
- Service communication
- API design
- Database design
Great learning opportunity with low risk."

Sixth, communicating trade-offs to product:

Framework: Cost-Benefit in Business Terms

Product asks: "Why can't we ship feature X faster?"

Tech answer: "We have technical debt in the payment system."

Product-friendly answer:
"Our payment code is complex (10,000 lines in one file).
Adding feature X would take 4 weeks and risk breaking existing payments.

If we refactor first (2 weeks), we can:
1. Add feature X safely in 1 week (total: 3 weeks)
2. All future payment features ship 2× faster
3. Reduce payment errors by 50% (better customer experience)

ROI: 2-week refactor investment = 1 week saved on feature X + faster shipping forever"

Interview Score: 9/10

Why: Debt audit framework with HIGH/MED/LOW categorization, 70/30 feature-to-refactor rule with product communication, Strangler Fig pattern for gradual modernization, DORA metrics for measuring impact, comprehensive mentorship framework (ADRs, pairing, code review teaching), and business-friendly trade-off communication.