InterviewBee — Software Engineer Premium Question Bank | Volume 1
FAANG-Level Interview Preparation | Senior · Staff · Principal
Question 1: System Design — Designing a Rate Limiter for a High-Traffic API
Difficulty: Senior | Role: Software Engineer | Level: Senior | Company Examples: Google, Stripe, Cloudflare, Twitter, Twilio
The Question
You are a Senior Software Engineer at a payments company. The public-facing API processes 50,000 requests per second at peak and is being abused by a small number of clients who are sending bursts of thousands of requests within seconds, overwhelming downstream services and degrading performance for all other clients. The Head of Platform has asked you to design and implement a rate limiter that enforces per-client request limits (1,000 requests per minute per API key), handles burst traffic gracefully, works correctly across a horizontally-scaled fleet of 20 API servers, and fails open (allows traffic through rather than blocking everything if the rate limiter itself has a failure). Design the algorithm, the data storage approach, the distributed coordination mechanism, and the API response behaviour when a client is rate-limited.
1. What Is This Question Testing?
- Rate limiting algorithm selection — understanding the four main rate limiting algorithms and their tradeoffs: fixed window counter (simplest, but allows a burst at the window boundary — 1,000 requests at 11:59 + 1,000 at 12:00 = 2,000 requests in 2 seconds), sliding window log (accurate, but stores each request timestamp — memory-expensive at scale), sliding window counter (approximates the sliding window by interpolating between the current and previous fixed window counts — low memory, close to accurate), and token bucket (allows controlled bursting — a client can save tokens and use them in a burst, up to a configurable bucket maximum); knowing that the token bucket is the standard for payment APIs because it allows short legitimate bursts without permitting sustained overload
- Distributed rate limiting with Redis — understanding that rate limiting across 20 API servers requires a shared counter store; each server independently querying a centralised Redis cluster is the standard approach; knowing that Redis's
INCRBY+EXPIREcommands are not atomic together (a race condition exists between them), and that a Lua script in Redis (which executes atomically) is the correct implementation for the token bucket or sliding window counter: the script checks the count, increments it, and sets the TTL in a single atomic operation
- Fail-open vs. fail-closed — understanding the tradeoff: fail-closed (block all traffic if the rate limiter fails) is safer for security use cases but catastrophic for availability; fail-open (allow all traffic if the rate limiter cannot be reached) preserves availability but means rate limiting is not enforced during outages; for a payments API where downtime is more damaging than a burst of excess requests during a brief Redis outage, fail-open is the correct choice; the implementation must be explicit:
try { checkRateLimit() } catch (e: RedisException) { return allow() }— never rely on default exception propagation to implement fail-open
- HTTP response design for rate limiting — knowing the standard HTTP headers:
429 Too Many Requestsstatus code,Retry-After: <seconds>header indicating when the client can retry,X-RateLimit-Limit: 1000(the limit),X-RateLimit-Remaining: 0(remaining requests),X-RateLimit-Reset: 1735689600(Unix timestamp of the window reset); knowing that well-designed rate limit responses enable clients to implement polite backoff without guessing
- The thundering herd problem — knowing that if 10,000 clients are all told
Retry-After: 60, they will all retry simultaneously 60 seconds later, producing a thundering herd; the correct mitigation is adding jitter:Retry-After: 60 + random(0, 30)— spreading retries over a 30-second window; alternatively, returning the exact time of the next token replenishment (not a fixed window reset) allows clients to retry at their individually-computed next available slot
- Monitoring and alerting — knowing that a rate limiter without observability is flying blind: track the rate limit hit rate per client (clients consistently hitting rate limits may need quota increases or are misbehaving), the Redis latency for rate limit checks (if Redis adds more than 2ms to every request, it is the new bottleneck), and the fail-open event rate (any period where the rate limiter failed open must be reviewed for abuse patterns)
2. Framework: Distributed Rate Limiter Design Model (DRLDM)
- Assumption Documentation — Establish the rate limit dimensions before designing the algorithm: is the limit per API key, per IP, per user ID, or per endpoint? A payments API typically rates limits per API key (the authenticated client) with different limits per tier (starter: 100/min, pro: 1,000/min, enterprise: custom); different limits per endpoint are also common (read endpoints allow more requests than write endpoints)
- Constraint Analysis — 50,000 RPS means the rate limiter check must complete in under 1ms (it is on the critical path of every request); Redis round-trip latency is typically 0.3–0.8ms on a co-located deployment; at 50,000 RPS, the Redis cluster must handle 50,000 commands per second — this requires a Redis cluster with at least 3 shards
- Tradeoff Evaluation — Centralised Redis (accurate, single point of coordination, adds network hop) vs. local in-process rate limiting with eventual synchronisation (lower latency, but a client can exceed the limit by up to N times if they distribute requests across N servers simultaneously); for a payments API where accuracy matters, centralised Redis is correct
- Hidden Cost Identification — The Lua script for atomic rate limiting must be cached on the Redis server using
SCRIPT LOADand called viaEVALSHA(the hash of the loaded script) rather thanEVAL(which sends the script text on every call); at 50,000 RPS, the difference betweenEVALSHA(small hash) andEVAL(full script text) is significant network overhead
- Risk Signals / Early Warning Metrics — Rate limit hit rate per client per hour (a sudden spike in hits for a specific API key indicates either a buggy client or a credential leak being used for abuse), Redis p99 latency (above 2ms on the rate limit check path degrades API response times for all clients), fail-open event count per hour (any non-zero value requires an immediate post-mortem)
- Pivot Triggers — If the Redis cluster becomes a bottleneck at scale (p99 rate limit check latency exceeds 5ms): implement a two-tier rate limiting approach — a local in-process rate limiter uses a smaller local token bucket (e.g., 100 tokens, replenished every 6 seconds from the Redis count), reducing Redis calls by 90% while bounding the local token over-usage to the local bucket size
- Long-Term Evolution Plan — Phase 1: per-API-key token bucket with Redis (described below); Phase 2: per-endpoint rate limits with different limits for read vs. write; Phase 3: adaptive rate limiting (detect abuse patterns and automatically tighten limits for suspicious clients); Phase 4: rate limit tiers exposed as a product feature (clients can purchase higher limits)
3. The Answer
The Token Bucket Algorithm
The token bucket is the correct choice for a payments API: it allows controlled bursting (a client who has been quiet for a minute can use accumulated tokens for a burst of 100 requests in 2 seconds) while preventing sustained overload. Parameters: bucket capacity = 200 tokens (allowing a 200-request burst for well-behaved clients), refill rate = 1,000 tokens per minute = 16.67 tokens per second, initial tokens = 200 (on first request). Token bucket state per API key stored in Redis: {tokens: 200.0, last_refill_timestamp: 1735689600.123}.
The Atomic Redis Lua Script
-- KEYS[1] = rate_limit:{api_key}
-- ARGV[1] = current_timestamp_ms, ARGV[2] = refill_rate_per_ms
-- ARGV[3] = bucket_capacity, ARGV[4] = tokens_requested (always 1)
local key = KEYS[1]
local now = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2]) -- tokens per millisecond (1000/60000 = 0.01667)
local capacity = tonumber(ARGV[3])
local requested = tonumber(ARGV[4])
local data = redis.call('HMGET', key, 'tokens', 'last_refill')
local tokens = tonumber(data[1]) or capacity -- Default to full bucket for new keys
local last_refill = tonumber(data[2]) or now
-- Refill tokens based on elapsed time
local elapsed_ms = now - last_refill
local refilled = elapsed_ms * refill_rate
tokens = math.min(capacity, tokens + refilled)
-- Check if the request can be served
if tokens >= requested then
tokens = tokens - requested
redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
redis.call('EXPIRE', key, 120) -- TTL: expire if inactive for 2 minutes
return {1, math.floor(tokens)} -- {allowed=true, remaining_tokens}
else
redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
redis.call('EXPIRE', key, 120)
-- Calculate milliseconds until 1 token is refilled
local ms_until_token = math.ceil((1 - tokens) / refill_rate)
return {0, ms_until_token} -- {allowed=false, ms_until_next_token}
endThis Lua script executes atomically on the Redis server — no race condition between the read, update, and write operations. At 50,000 RPS, the script executes in under 0.1ms on the Redis server.
The API Rate Limit Middleware (Java/Kotlin example)
class RateLimitMiddleware(
private val redisClient: RedisClient,
private val rateLimitConfig: RateLimitConfig
) {
private val luaScript = redisClient.loadScript(LUA_SCRIPT) // Cached as EVALSHA hash
fun checkRateLimit(apiKey: String): RateLimitResult {
return try {
val result = redisClient.evalSha(
luaScript,
keys = listOf("rate_limit:$apiKey"),
args = listOf(
System.currentTimeMillis().toString(),
rateLimitConfig.refillRatePerMs.toString(),
rateLimitConfig.bucketCapacity.toString(),
"1"
)
)
val allowed = result[0] as Long == 1L
val secondValue = result[1] as Long
if (allowed) {
RateLimitResult.Allowed(remainingTokens = secondValue)
} else {
val retryAfterMs = secondValue
RateLimitResult.RateLimited(retryAfterMs = retryAfterMs)
}
} catch (e: RedisException) {
// FAIL OPEN: Redis unavailable — allow the request
logger.error("Rate limiter Redis unavailable, failing open", e)
metrics.increment("rate_limiter.fail_open")
RateLimitResult.Allowed(remainingTokens = -1) // -1 signals fail-open mode
}
}
}HTTP Response Headers
When rate-limited, return:
HTTP/1.1 429 Too Many Requests
Content-Type: application/json
X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1735689660
Retry-After: 62 ← with jitter: base_seconds + random(0, 30)
{
"error": {
"code": "RATE_LIMIT_EXCEEDED",
"message": "You have exceeded the rate limit of 1000 requests per minute.",
"retry_after_seconds": 62,
"documentation_url": "https://docs.company.com/api/rate-limits"
}
}When allowed, include the rate limit headers in every response (not just 429s) — this allows clients to self-throttle before hitting the limit:
HTTP/1.1 200 OK
X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 847
X-RateLimit-Reset: 1735689660Redis Cluster Sharding
At 50,000 RPS, a single Redis instance processes approximately 50,000 Lua script executions per second. A single Redis instance can handle approximately 100,000 commands per second — with the Lua script being 3–4 Redis commands internally, the throughput is approximately 25,000–33,000 scripts per second per instance. For 50,000 RPS, use a Redis cluster with at least 2–3 primary shards. Route each API key to its shard using consistent hashing on the API key: the Redis cluster's built-in hash slot mechanism (CRC16(api_key) % 16384) distributes keys automatically across shards, ensuring that the same API key always hits the same shard.
Early Warning Metrics:
- Rate limit hit rate by API key — a key hitting the rate limit more than 50% of the time is either misconfigured (the limit is too low for its legitimate use) or abusive; alert the account team for the former, the security team for the latter
- Redis cluster latency P99 for the rate limit script — alert at 2ms; above 5ms requires immediate investigation as this is now adding perceptible latency to every API response
- Fail-open event rate per hour — zero in normal operation; any non-zero value triggers a PagerDuty alert and a post-incident review; the rate limiter's Redis availability must be treated as a Tier 1 service dependency
4. Interview Score: 9.5 / 10
Why this demonstrates senior-level maturity: The Lua script for atomic token bucket operations — with the specific detail that the script must be pre-loaded with SCRIPT LOAD and called via EVALSHA (not EVAL) to avoid sending the script text on every one of 50,000 requests per second — demonstrates that this engineer has thought about the performance characteristics of the implementation, not just its correctness. The fail-open implementation with metrics.increment("rate_limiter.fail_open") in the catch block — so that Redis outages are visible in dashboards even when they are handled gracefully — is the operations engineering discipline that makes a production system debuggable. The jitter in the Retry-After header (base + random(0, 30)) directly addresses the thundering herd problem that a naively implemented rate limiter introduces.
What differentiates it from mid-level thinking: A mid-level engineer would propose "use Redis with INCR and EXPIRE" without knowing about the non-atomicity of those two commands together (requiring a Lua script), would implement a fixed window counter (allowing boundary bursts) rather than a token bucket, would not include rate limit headers in every response (only in 429s), and would not address the thundering herd problem. They would also not calculate whether a single Redis instance can handle 50,000 RPS.
What would make it a 10/10: A 10/10 response would include the two-tier local + Redis hybrid rate limiter design (reducing Redis calls by 90%), a concrete Redis cluster sizing calculation showing the number of shards required for 50,000 RPS, and a distributed tracing integration showing how rate limit checks appear in the request trace alongside the upstream service calls.
Question 2: Data Structures and Algorithms — Solving a Real-World Problem Under Engineering Constraints
Difficulty: Senior | Role: Software Engineer | Level: Senior | Company Examples: Google, Meta, Amazon, Apple, Microsoft
The Question
You are a Senior Software Engineer at a logistics company. You have been asked to implement a real-time package routing system. Packages arrive at a warehouse and must be assigned to the optimal delivery route. Each route is a sequence of stops, and each package has a destination address. The addresses have been pre-geocoded into (lat, lng) coordinates. The routing problem has three specific requirements: (1) Given a delivery route (an ordered list of GPS coordinates) and a new package's destination coordinate, find the stop on the route that is geographically closest to the package's destination — this must be answered in under 10ms for routes with up to 5,000 stops; (2) The same destination coordinate must be looked up repeatedly (the same address appears on multiple packages) — the system processes 20,000 packages per hour and 30% have duplicate destinations; (3) Route stops are updated every 5 minutes as drivers complete stops — the data structure must support efficient deletion of completed stops and insertion of emergency stops. Walk through the data structure choices, time complexities, and the implementation for each requirement.
1. What Is This Question Testing?
- Spatial data structures — knowing that a linear scan of 5,000 stops to find the nearest one is O(n) and may be too slow at scale; knowing that a k-d tree (k-dimensional tree) is the standard data structure for nearest-neighbour search in 2D space — building a k-d tree is O(n log n), and a nearest-neighbour query is O(log n) average case; knowing that a quadtree is an alternative that recursively divides 2D space into quadrants, suitable for geographic regions; and knowing the practical consideration that for n=5,000, a linear scan with SIMD optimisation may outperform a k-d tree due to cache locality — the O(n) constant factor matters
- Memoisation and caching for repeated lookups — knowing that 30% duplicate destinations means that the nearest-stop computation for those destinations has been done before; a
HashMap<Coordinate, Stop>cache (key: destination coordinate rounded to 6 decimal places, value: the nearest stop found last time) reduces repeated computation; knowing the cache invalidation problem: when stops are updated (requirement 3), the cache entries referencing the changed stops must be invalidated; this is the classic "cache invalidation" problem — the solution depends on whether approximate results are acceptable (serve from cache until TTL expires) or must be exact (invalidate eagerly when stops change)
- Dynamic updates to a k-d tree — knowing that standard k-d trees are not efficiently updatable — deleting a node requires rebuilding the subtree; knowing the alternatives: a KD-B tree supports dynamic updates but is more complex to implement; a Ball tree (a hierarchical tree where each node represents a ball in n-dimensional space, not a hyperplane split) supports approximate nearest-neighbour with dynamic updates; or a lazy deletion strategy for the k-d tree (mark deleted nodes rather than removing them, and periodically rebuild the tree during the 5-minute update window)
- Priority queue for route optimisation — knowing that when an emergency stop is inserted, it must be inserted in the optimal position of the route (between the stops it is nearest to, minimising total route distance); this is the travelling salesman insertion heuristic — for each new stop, compute the cost of inserting it between every pair of consecutive stops and choose the minimum-cost insertion point; this is O(n) per insertion, which is acceptable for 5,000 stops
- Space-time tradeoff analysis — demonstrating the ability to reason about the tradeoffs explicitly: the k-d tree uses O(n) space and O(log n) query time; the linear scan uses O(n) space and O(n) query time with better cache locality for small n; the cache uses O(m) space (m = number of distinct destinations) and O(1) query time for cached hits; providing the crossover point (at what n does the k-d tree outperform the linear scan? approximately n > 1,000 for random access patterns in 2D) demonstrates algorithmic depth
- Concurrency considerations — at 20,000 packages per hour, the system receives approximately 5.5 packages per second; the k-d tree is built on the route's 5,000 stops and queried for each incoming package; concurrent queries from multiple threads must be safe — the k-d tree must be either immutable (built once, queried by multiple threads without synchronisation) or protected by a read-write lock; the 5-minute rebuild cycle creates a window where the old tree is being queried while the new tree is being built — the swap must be atomic
2. Framework: Spatial Algorithm Selection and Implementation Model (SASIM)
- Assumption Documentation — Clarify the precision requirements: is "nearest stop" the Euclidean distance in lat/lng coordinates, or the great-circle distance (Haversine formula for the curvature of the Earth)? For distances under 100km (a typical delivery route), Euclidean distance in lat/lng introduces less than 0.5% error — acceptable for routing; for accuracy, use the Haversine formula in the final verification step after the k-d tree narrows the candidates
- Constraint Analysis — The 10ms response time requirement at a route size of 5,000 stops; a k-d tree nearest-neighbour query completes in under 1ms for 5,000 nodes on modern hardware — the bottleneck is likely the cache lookup and network I/O, not the algorithm itself
- Tradeoff Evaluation — For n=5,000 stops: a linear scan processes 5,000 distance calculations in approximately 0.05ms (50μs) on modern hardware with SIMD vectorisation — potentially faster than the k-d tree's cache-unfriendly tree traversal; benchmark both approaches on the target hardware before committing to the more complex k-d tree implementation
- Hidden Cost Identification — Geocoding precision: latitude and longitude coordinates are typically given to 6 decimal places (approximately 0.1m precision); rounding to 5 decimal places (approximately 1m) for the cache key reduces cache fragmentation from GPS coordinate noise while preserving practical location accuracy
- Risk Signals / Early Warning Metrics — Cache hit rate for the memoisation layer (target above 25% given the 30% duplicate destination rate — a hit rate below 20% indicates the cache key rounding strategy is creating false misses), k-d tree rebuild time per 5-minute cycle (the rebuild must complete in under 30 seconds to not overlap with the next cycle), nearest-neighbour query P99 latency (must remain under 10ms at all traffic levels)
- Pivot Triggers — If the k-d tree query time increases above 5ms when the route expands beyond 5,000 stops (e.g., a warehouse hub with 20,000-stop routes): switch to an approximate nearest-neighbour algorithm (Annoy — Approximate Nearest Neighbours Oh Yeah — by Spotify, which builds multiple random projection trees and returns the approximate nearest neighbour in O(log n) with configurable accuracy)
- Long-Term Evolution Plan — The current design handles a single route; at scale, the system routes packages across hundreds of routes simultaneously; the production architecture uses a geospatial index (PostGIS with ST_ClosestPoint, or Elasticsearch with geo_point queries) to first find which route a package belongs to, then the k-d tree to find the nearest stop on that route
3. The Answer
Requirement 1: Nearest Stop on a Route — K-D Tree Implementation
Build a 2D k-d tree from the route's (lat, lng) stop coordinates. Each node in the k-d tree represents a stop; the tree splits on alternating dimensions (latitude at even depth, longitude at odd depth), with the median coordinate as the splitting value:
from dataclasses import dataclass
from typing import Optional
import heapq
@dataclass
class Stop:
stop_id: str
lat: float
lng: float
sequence: int # Position in the route
class KDNode:
def __init__(self, stop: Stop, left: 'KDNode' = None, right: 'KDNode' = None):
self.stop = stop
self.left = left
self.right = right
def build_kd_tree(stops: list[Stop], depth: int = 0) -> Optional[KDNode]:
if not stops:
return None
# Alternate splitting dimension: 0=latitude, 1=longitude
axis = depth % 2
sorted_stops = sorted(stops, key=lambda s: s.lat if axis == 0 else s.lng)
mid = len(sorted_stops) // 2
return KDNode(
stop=sorted_stops[mid],
left=build_kd_tree(sorted_stops[:mid], depth + 1),
right=build_kd_tree(sorted_stops[mid + 1:], depth + 1)
)
def nearest_neighbour(root: KDNode, target_lat: float, target_lng: float) -> Stop:
best = [None, float('inf')] # [best_stop, best_dist_sq]
def _search(node: KDNode, depth: int):
if node is None:
return
# Euclidean distance squared (no sqrt needed for comparison)
dist_sq = (node.stop.lat - target_lat) ** 2 + (node.stop.lng - target_lng) ** 2
if dist_sq < best[1]:
best[0], best[1] = node.stop, dist_sq
axis = depth % 2
diff = (target_lat - node.stop.lat) if axis == 0 else (target_lng - node.stop.lng)
# Search the nearer half first, then check if the farther half could be closer
near, far = (node.left, node.right) if diff <= 0 else (node.right, node.left)
_search(near, depth + 1)
if diff ** 2 < best[1]: # The splitting hyperplane is closer than the current best
_search(far, depth + 1)
_search(root, 0)
return best[0]Time complexity: Build: O(n log n). Query: O(log n) average, O(n) worst case (for pathological distributions like all stops on a line). Space complexity: O(n).
Requirement 2: Memoisation for Repeated Destinations
Wrap the k-d tree query with a cache keyed on the rounded coordinate:
import threading
from functools import lru_cache
class RouteNearestStopFinder:
def __init__(self, stops: list[Stop]):
self._tree = build_kd_tree(stops)
self._cache: dict[tuple, Stop] = {}
self._cache_lock = threading.RLock()
def find_nearest(self, dest_lat: float, dest_lng: float) -> Stop:
# Round to 5 decimal places (~1m precision) to catch GPS-noise duplicates
cache_key = (round(dest_lat, 5), round(dest_lng, 5))
with self._cache_lock:
if cache_key in self._cache:
return self._cache[cache_key]
# Cache miss — query the k-d tree
nearest = nearest_neighbour(self._tree, dest_lat, dest_lng)
with self._cache_lock:
self._cache[cache_key] = nearest
return nearest
def invalidate_stop(self, stop_id: str):
"""Call when a stop is completed or moved — invalidates all cache entries for that stop."""
with self._cache_lock:
keys_to_delete = [k for k, v in self._cache.items() if v.stop_id == stop_id]
for key in keys_to_delete:
del self._cache[key]Requirement 3: Dynamic Stop Updates
For the 5-minute stop update cycle, use a lazy deletion + periodic rebuild strategy:
class DynamicRouteFinder:
def __init__(self, stops: list[Stop]):
self._active_stops = {s.stop_id: s for s in stops}
self._deleted_stop_ids: set[str] = set()
self._finder = RouteNearestStopFinder(stops)
self._rebuild_threshold = 50 # Rebuild after 50 deletions (avoid staleness)
def complete_stop(self, stop_id: str):
"""Mark a stop as completed (driver has visited it)."""
del self._active_stops[stop_id]
self._deleted_stop_ids.add(stop_id)
self._finder.invalidate_stop(stop_id)
if len(self._deleted_stop_ids) >= self._rebuild_threshold:
self._rebuild_tree()
def insert_emergency_stop(self, new_stop: Stop) -> int:
"""Insert an emergency stop at the optimal position in the route.
Returns the sequence position where the stop was inserted."""
active = sorted(self._active_stops.values(), key=lambda s: s.sequence)
best_position = self._find_optimal_insertion(new_stop, active)
# Shift sequence numbers of all stops after the insertion point
for stop in active[best_position:]:
stop.sequence += 1
new_stop.sequence = best_position
self._active_stops[new_stop.stop_id] = new_stop
self._rebuild_tree() # Rebuild immediately after insertion
return best_position
def _find_optimal_insertion(self, new_stop: Stop, route: list[Stop]) -> int:
"""O(n) insertion heuristic: find the position minimising added route distance."""
if not route:
return 0
best_cost = float('inf')
best_pos = 0
for i in range(len(route) - 1):
# Cost = distance(route[i] → new_stop) + distance(new_stop → route[i+1])
# - distance(route[i] → route[i+1]) ← the segment we're replacing
added_cost = (
_dist(route[i], new_stop) +
_dist(new_stop, route[i + 1]) -
_dist(route[i], route[i + 1])
)
if added_cost < best_cost:
best_cost, best_pos = added_cost, i + 1
return best_pos
def _rebuild_tree(self):
self._deleted_stop_ids.clear()
self._finder = RouteNearestStopFinder(list(self._active_stops.values()))Complexity summary:
- Build k-d tree: O(n log n) — executed at most once per 5-minute cycle
- Nearest-stop query (cache miss): O(log n) average
- Nearest-stop query (cache hit): O(1)
- Complete stop (lazy deletion): O(k) where k = cache entries for the completed stop
- Insert emergency stop (optimal position): O(n) for the insertion heuristic, O(n log n) for the tree rebuild
Early Warning Metrics:
- Cache hit rate (target above 25% given 30% duplicate destinations) — a hit rate below 20% after the first hour of operation indicates the cache key rounding is too aggressive or the coordinate data has more noise than expected
- K-d tree build time per rebuild cycle — the rebuild must complete in under 5 seconds for a 5,000-stop route; above 30 seconds indicates the active stop count has grown beyond the expected bound
- Nearest-neighbour query P99 latency — alert at 5ms; above 10ms violates the requirement; this is most commonly caused by a degenerate k-d tree (poorly balanced due to clustered geographic stops) that can be addressed with a better splitting strategy (median split vs. midpoint split)
4. Interview Score: 9.5 / 10
Why this demonstrates senior-level maturity: The specific observation that for n=5,000 with SIMD vectorisation, a linear scan may outperform the k-d tree due to cache locality — and the recommendation to benchmark both before committing to the more complex implementation — demonstrates the algorithmic pragmatism that distinguishes an experienced engineer from one who reaches for the theoretically optimal data structure without considering constant factors. The lazy deletion strategy (mark deleted nodes, rebuild after 50 deletions) rather than attempting dynamic deletion from the k-d tree (which requires subtree rebuilding) is the practical engineering trade-off that produces a maintainable system. The cache invalidation design (invalidate_stop(stop_id) scanning the cache for all entries mapping to the changed stop) shows awareness of the consistency problem that arises when the underlying data changes.
What differentiates it from mid-level thinking: A mid-level engineer would implement a linear scan for all three requirements, not knowing about k-d trees for spatial nearest-neighbour queries. They would not address the concurrent access problem (the threading.RLock on the cache), would not design the memoisation cache invalidation on stop deletion, and would not know the O(n) insertion heuristic for finding the optimal route position for an emergency stop.
What would make it a 10/10: A 10/10 response would include a benchmark comparing the linear scan vs. k-d tree at n=100, n=1,000, n=5,000, and n=20,000 stops showing the crossover point, a thread-safe atomic tree swap implementation (using concurrent.futures or AtomicReference in Java to swap the old tree for the new one during the rebuild cycle without blocking queries), and an analysis of when to switch to Annoy for approximate nearest-neighbour at route sizes above 20,000.
Question 3: Distributed Systems — Designing for Consistency, Availability, and Fault Tolerance
Difficulty: Elite | Role: Software Engineer | Level: Senior / Staff | Company Examples: Amazon, Google, Netflix, Uber, Stripe
The Question
You are a Senior Software Engineer at a fintech company. The engineering team is designing a money transfer system. A user initiates a transfer of £500 from Account A (at Bank A) to Account B (at Bank B). The operation involves 3 steps: deduct £500 from Account A, credit £500 to Account B, and record the transfer in a transaction ledger. These 3 steps touch 3 different services (Bank A API, Bank B API, and the internal Ledger Service) — no shared database exists across all three. The business requirement is that the system must never result in money being debited from Account A without being credited to Account B, and vice versa — partial transfers (money lost or money created) are both unacceptable. However, the three services are independently operated, can fail at any time, and have no cross-service transaction capability. Design the distributed transaction protocol that guarantees the atomicity of the transfer — all 3 steps succeed or none of them take effect — while handling network failures, service crashes, and partial completions.
1. What Is This Question Testing?
- The distributed transaction problem — understanding that ACID transactions across multiple services with no shared database is a fundamental distributed systems problem; knowing that 2PC (Two-Phase Commit) is the classical protocol for distributed transactions and its limitations: it requires all participants to be available simultaneously and has a blocking failure mode (if the coordinator crashes after Phase 1 but before Phase 2, participants are blocked indefinitely holding their locks); and knowing that for a financial transfer involving external bank APIs (which cannot participate in 2PC), 2PC is not applicable
- The Saga pattern for distributed transactions — knowing that the Saga pattern replaces a distributed ACID transaction with a sequence of local transactions, each followed by a compensating transaction that undoes the local transaction if a subsequent step fails; knowing the two Saga variants: choreography (each service publishes events and subscribes to events from other services — no central coordinator) and orchestration (a central Saga orchestrator sends commands to each service and handles the compensating transactions on failure); knowing that orchestration is more appropriate for financial transactions because the audit trail is centralised and easier to reason about
- Idempotency keys — understanding that in a distributed system where messages can be duplicated (retries, at-least-once delivery), every operation must be idempotent: applying the same operation multiple times has the same effect as applying it once; for a money transfer, the debit of Account A must be idempotent — if the Bank A API call is retried (because the first call's response was lost), the second call must not debit the account a second time; this is implemented with an idempotency key (a UUID generated once per transfer attempt) that Bank A uses to deduplicate requests
- The Outbox Pattern — knowing that a service that writes to its own database and then publishes an event to a message queue has a consistency problem: if the database write succeeds but the message publish fails (or the process crashes between the two), the downstream services never receive the event; the Outbox pattern solves this by writing the event to an
outboxtable in the same local database transaction as the business data; a separate message relay process reads the outbox table and publishes events to the message queue, retrying until the publish succeeds — this guarantees at-least-once event delivery with no split-brain between the database and the event bus
- Exactly-once semantics vs. at-least-once + idempotency — knowing that exactly-once delivery is extremely difficult and expensive to achieve in distributed systems; the pragmatic alternative is at-least-once delivery combined with idempotent consumers — the consumer deduplicates incoming messages using the message ID and processes each unique message exactly once; this is the pattern used by every major payment system
- Failure scenario handling — being able to reason through specific failure scenarios: what happens if Bank A crashes after deducting the money but before the orchestrator is notified? (The orchestrator must retry the Bank A status check — idempotency ensures the deduction is not duplicated), what happens if Bank B is down when the credit is attempted? (The orchestrator retries with exponential backoff for a configurable period — if Bank B does not recover within the timeout, the compensating transaction reverses the Bank A debit), what happens if the compensating transaction itself fails? (The failed compensation is recorded and a human escalation alert is sent — this is the "pending manual resolution" state that every payment system must handle)
2. Framework: Distributed Financial Transaction Design Model (DFTDM)
- Assumption Documentation — Establish the consistency requirements: can the system tolerate a brief window where money is "in flight" (deducted from Account A but not yet credited to Account B)? In practice, all payment systems have this window; the requirement is that the window resolves correctly — not that it never exists; distinguish between atomicity (the transfer either fully completes or fully reverses) and consistency (at any point in time, the total money in the system is conserved)
- Constraint Analysis — Bank A and Bank B are external services with their own APIs; they cannot be made to participate in a 2PC protocol; the Ledger Service is internal and can be designed to support the required transactional semantics; the design must handle Bank A's API being idempotent (requires idempotency key support) vs. non-idempotent (requires a deduplication layer)
- Tradeoff Evaluation — Saga Choreography (decentralised, event-driven, no single point of failure) vs. Saga Orchestration (centralised coordinator, easier to reason about state, single point of failure mitigated by replication); for a financial transfer where auditability and state visibility are critical, orchestration is preferred — the transfer orchestrator is a replicated, stateful service that can be queried for the current state of any transfer
- Hidden Cost Identification — The "double debit" scenario: if the Bank A API call succeeds, the process crashes before writing the success to the orchestrator's database, and the orchestrator retries — without idempotency keys, Bank A debits the account twice; idempotency keys must be generated once (at transfer initiation) and stored durably before the first Bank A API call
- Risk Signals / Early Warning Metrics — Saga completion rate (what percentage of initiated transfers complete successfully without any compensating transactions?), compensating transaction success rate (when a transfer must be reversed, does the reversal succeed? — a compensating transaction failure rate above 0.01% requires immediate engineering attention), transfers in "pending manual resolution" state (the count of transfers stuck in an unresolvable state requiring human intervention — this should be zero under normal conditions)
- Pivot Triggers — If the Bank B API has a planned downtime window that causes the transfer success rate to drop below 95%: implement a "deferred credit" pattern where the Bank A debit and the Ledger Service recording are committed immediately, and the Bank B credit is queued for retry when Bank B returns; this requires the Ledger to show the transfer as "credit pending" rather than "complete" until Bank B confirms receipt
- Long-Term Evolution Plan — Phase 1: Saga orchestration with manual compensation for unresolvable failures; Phase 2: automated compensation with exponential backoff (24-hour retry window before escalating to manual); Phase 3: ISO 20022 message format support for interoperability with the SWIFT network; Phase 4: idempotency key management service shared across all payment operations
3. The Answer
The Transfer State Machine
The transfer orchestrator manages a transfer through the following states: INITIATED → DEBIT_PENDING → DEBIT_CONFIRMED → CREDIT_PENDING → CREDIT_CONFIRMED → LEDGER_PENDING → COMPLETED. Compensation states: CREDIT_FAILED → REVERSAL_PENDING → REVERSED (or REVERSAL_FAILED → MANUAL_RESOLUTION_REQUIRED). Every state transition is persisted to the orchestrator's database before the next action is taken.
The Idempotency Key Architecture
At transfer initiation, a transfer record is created:
INSERT INTO transfers (
transfer_id, -- UUID, the Saga correlation ID
idempotency_key_debit, -- UUID, used as Bank A's request idempotency key
idempotency_key_credit,-- UUID, used as Bank B's request idempotency key
amount,
source_account,
destination_account,
state,
created_at
) VALUES (gen_random_uuid(), gen_random_uuid(), gen_random_uuid(),
500.00, 'ACC_A', 'ACC_B', 'INITIATED', now())Critically: the idempotency_key_debit and idempotency_key_credit are generated and committed to the database before any external API call is made. If the orchestrator crashes and restarts, it reads the uncommitted-step transfer records and retries using the same idempotency keys — Bank A and Bank B will recognise the key and return the result of the original operation rather than executing it again.
The Saga Orchestration Flow
Step 1 — Debit Account A: the orchestrator calls Bank A's API: POST /v1/debits { amount: 500, account: ACC_A, idempotency_key: <idempotency_key_debit> }. Bank A returns { debit_id: "DEB_123", status: "confirmed" }. The orchestrator updates the transfer state to DEBIT_CONFIRMED and saves debit_id. If Bank A returns an error: the orchestrator retries with exponential backoff up to 3 times (using the same idempotency_key_debit each time — idempotent retry). If all retries fail: the transfer transitions to FAILED (no money has moved — no compensation needed).
Step 2 — Credit Account B: the orchestrator calls Bank B's API: POST /v1/credits { amount: 500, account: ACC_B, idempotency_key: <idempotency_key_credit> }. Bank B returns { credit_id: "CRE_456", status: "confirmed" }. The orchestrator updates the transfer state to CREDIT_CONFIRMED. If Bank B is unavailable: the orchestrator retries for up to 24 hours with exponential backoff (the debit is now committed — the credit must eventually succeed or be reversed). If Bank B confirms it cannot credit the account (insufficient destination account details, account closed): immediately transition to CREDIT_FAILED and execute the compensating transaction.
Compensating Transaction (if credit fails): call Bank A's API: POST /v1/reversals { debit_id: "DEB_123", idempotency_key: <reversal_key> }. Bank A reverses the debit and credits Account A. The transfer transitions to REVERSED. If the reversal also fails (Bank A is now unavailable): the transfer transitions to REVERSAL_FAILED → MANUAL_RESOLUTION_REQUIRED; a PagerDuty alert is sent to the payments operations team with the full transfer state.
Step 3 — Record in Ledger: after both the debit and credit are confirmed, the orchestrator calls the internal Ledger Service: POST /v1/entries { transfer_id, debit_account, credit_account, amount, debit_confirmation, credit_confirmation }. The Ledger Service records the double-entry bookkeeping entry. If the Ledger Service is unavailable: retry indefinitely (the money has already moved correctly — the ledger entry is the record of what happened and cannot be omitted); the transfer is marked COMPLETED only after the ledger confirms.
The Outbox Pattern for the Orchestrator's Events
The orchestrator must publish events (TransferInitiated, TransferDebited, TransferCompleted, TransferFailed) to a message bus for downstream consumers (notifications service, analytics, fraud detection). To avoid the split-brain between the database state transition and the event publication:
-- This INSERT happens in the SAME database transaction as the state update
BEGIN;
UPDATE transfers SET state = 'DEBIT_CONFIRMED', debit_id = 'DEB_123'
WHERE transfer_id = '<id>';
INSERT INTO outbox (event_type, payload, created_at)
VALUES ('TransferDebited', '{"transfer_id": "<id>", "amount": 500}', now());
COMMIT;
-- A separate relay process reads the outbox table and publishes to the event busThis guarantees that if the process crashes after the state update commits, the outbox entry is also committed — the relay will publish the event on the next run. If the event bus is unavailable, the relay retries; the outbox entry stays until the publish succeeds.
Handling the "Zombie Transfer" — Orchestrator Crash Recovery
If the orchestrator process crashes while waiting for Bank B's response (the transfer is in CREDIT_PENDING state), recovery is straightforward: on startup, the orchestrator queries for all transfers in non-terminal states (DEBIT_CONFIRMED, CREDIT_PENDING, CREDIT_CONFIRMED) and resumes them from their last confirmed state — retrying the pending step with the stored idempotency key. This is the "transactional outbox + at-least-once retry" pattern that makes the orchestrator crash-resilient without requiring distributed locking.
Early Warning Metrics:
- Transfer completion rate (successful completions / initiations) — target above 99.5%; below 99% indicates a systemic issue with one of the external bank APIs
- Compensating transaction success rate — any failed compensation is a critical incident; monitor in real time; below 100% triggers an immediate incident response
MANUAL_RESOLUTION_REQUIREDtransfer count — should be zero under normal conditions; any non-zero value triggers a human review within 4 hours to prevent money from being "lost in flight" for an extended period
4. Interview Score: 10 / 10
Why this demonstrates staff-level maturity: The idempotency key generation before the first API call — and the specific explanation that the keys must be durably stored before any external call so that crash-and-retry uses the same keys — is the distributed systems engineering discipline that prevents the double-debit bug that breaks naive implementations. The Outbox pattern integration (writing the event to the outbox table in the same database transaction as the state update) addresses the "dual write" consistency problem without requiring a distributed transaction between the database and the event bus. The "Zombie Transfer" recovery procedure — querying non-terminal states on startup and resuming with stored idempotency keys — is the specific operational procedure that makes the orchestrator crash-resilient, which most engineers miss because it requires thinking about recovery, not just the happy path.
What differentiates it from mid-level thinking: A mid-level engineer would propose a database transaction spanning all three services (not possible with external APIs), or would propose 2PC (which requires all participants to be available simultaneously and creates blocking failures). They would not know about the Saga pattern, would not design the idempotency key architecture, would not address the zombie transfer recovery, and would not know about the Outbox pattern for consistent event publishing.
What would make it a perfect implementation: This response scores 10/10 on the dimensions tested. The theoretical extension would be a formal state machine diagram showing all states and transitions including the compensation paths, a concrete exponential backoff configuration (initial delay 1 second, multiplier 2, max delay 300 seconds, max attempts 20 for a 24-hour retry window), and a database schema for the orchestrator's transfers table with the right indexes for the recovery query (index on state for finding non-terminal transfers).
Question 4: Code Quality and Refactoring — Identifying and Resolving Design Smells in a Production Codebase
Difficulty: Senior | Role: Software Engineer | Level: Senior | Company Examples: Google, ThoughtWorks, JetBrains, Stripe, Basecamp
The Question
You are a Senior Software Engineer joining a team that has been maintaining a 6-year-old e-commerce order management system written in Java. During a code review of the checkout service, you encounter the following 300-line OrderService class that handles order creation. The class: checks inventory availability by calling the InventoryService directly, calculates the total price including discounts by implementing pricing logic inline, sends confirmation emails by instantiating EmailClient directly, writes audit logs by calling a static logger, processes payment by calling PaymentGateway with hardcoded timeout values, updates inventory after payment with a direct InventoryService call, and publishes an OrderCreatedEvent to a message broker with a direct KafkaProducer instantiation inside the method. The class has no unit tests because "you can't mock the static logger or the hardcoded Kafka producer." Walk through the design smells you identify, the refactoring strategy, and how you would transform this class into a testable, maintainable design without breaking the existing behaviour.
1. What Is This Question Testing?
- SOLID principles in practice — identifying the violations: Single Responsibility Principle (the
OrderServicedoes 7 things — checking inventory, pricing, emailing, logging, payment, inventory update, and event publishing — it should do one: coordinate an order creation workflow); Open/Closed Principle (pricing logic inline means adding a new discount type requires modifying the class); Dependency Inversion Principle (the class depends on concrete implementations, not abstractions —EmailClientandKafkaProducerare instantiated directly instead of being injected as interfaces)
- Dependency injection and testability — understanding the specific connection between DI and testability: code that instantiates its own dependencies (
new EmailClient(),new KafkaProducer()) is untestable because the test cannot substitute a fake; code that receives dependencies via constructor injection (or interface injection) is testable because the test provides a fake implementation of the interface; knowing the rule: a class that can only be tested by sending a real email or publishing to a real Kafka topic is a class that violates DI
- The Strangler Fig refactoring pattern for legacy code — knowing that refactoring a class that has no tests must be done with extreme caution (every change risks breaking behaviour); the safe approach: write characterisation tests first (tests that document the current behaviour of the class, whatever it may be), then refactor incrementally with each step verified by the characterisation tests; "make the change easy, then make the easy change" (Kent Beck)
- Extract Interface for testability — knowing the specific refactoring step: for each direct dependency (
EmailClient,KafkaProducer,InventoryService), extract an interface (EmailSender,EventPublisher,InventoryRepository) that expresses the contract; create the production implementation; inject the interface via the constructor; now the class can be tested with a fake implementation of each interface
- The Tell Don't Ask principle — recognising that the pricing logic inline violates encapsulation:
OrderServiceis "asking" theOrderobject for its items and computing the price itself, rather than "telling" aPricingServiceto calculate the price; the fix: extract aPricingServicewith acalculateOrderTotal(order, appliedDiscounts)method
- God class symptoms — knowing the code smells that indicate a God class (a class that knows too much): more than 200 lines, more than 10 dependencies, methods that contain "and" in their name (
createOrderAndSendEmail), more than 3 levels of nesting, and a class that is changed for multiple different reasons (adding a new payment method, adding a new discount type, changing the email template — all require modifying the same class)
2. Framework: Legacy Code Refactoring Model (LCRM)
- Assumption Documentation — Before refactoring any line, confirm with the team: what is the expected behaviour of
createOrder()? Are there integration tests (even if not unit tests) that exercise this method? Understanding the observable contract of the method before refactoring prevents introducing regressions that the team may not discover for weeks
- Constraint Analysis — The class has no unit tests; every refactoring step must be small enough to be verified manually or with the existing integration tests; large refactoring steps (rewrites) without test coverage are high-risk; the team may be running manual smoke tests — coordinate with QA before deploying each refactoring step
- Tradeoff Evaluation — Incremental refactoring (extract one dependency at a time, deploy, verify) vs. full redesign then replace (do all the refactoring in one branch, deploy all at once); incremental is safer but takes longer; for a production system with no unit tests, incremental is always correct
- Hidden Cost Identification — The static logger is the least visible dependency but the most insidious: it makes the logging behaviour impossible to verify in tests (a unit test cannot assert "an error was logged" without a log capture library); replacing the static logger with an injected
AuditLoggerinterface is the change that has the largest impact on testability despite appearing trivial
- Risk Signals / Early Warning Metrics — Cyclomatic complexity of the refactored class (target below 10 — a complexity above 10 means the class has too many conditional branches for one class), test coverage after refactoring (target above 80% for all new classes extracted from the refactoring), lines of code in the new
OrderCoordinator(target under 80 lines — if the coordinator is still 200 lines, the extraction did not sufficiently decompose the responsibilities)
- Pivot Triggers — If a characterisation test reveals that the current implementation has an undocumented behaviour (e.g., it silently ignores payment gateway timeouts and marks the order as pending rather than failed): document the undocumented behaviour as a known issue, do not fix it during the refactoring (changing behaviour and structure simultaneously doubles the risk of regression)
- Long-Term Evolution Plan — Refactoring sprint: extract dependencies and write unit tests; second sprint: extract domain logic (pricing, inventory) into their own services; third sprint: migrate the event publishing to a proper event sourcing pattern; fourth sprint: remove the God class entirely by decomposing into feature-oriented services
3. The Answer
Step 1: Write Characterisation Tests Before Touching Any Code
Before the first line is changed, write a characterisation test that calls createOrder() through the entire stack with a real database (or a test database) and real services where unavoidable. The test documents: what does the method return for a valid order? What does it return when the item is out of stock? What happens to the database state after a payment failure? These tests run slowly and are not proper unit tests, but they are the safety net for the refactoring. Sample characterisation test:
@Test
public void createOrder_withValidItems_persistsOrderAndReturnsId() {
// Arrange — uses real implementations, real test database
Order order = new Order(userId, List.of(new OrderItem("ITEM_001", 2)));
// Act
OrderConfirmation result = orderService.createOrder(order);
// Assert observable behaviour — don't assert internals
assertNotNull(result.getOrderId());
assertEquals(OrderStatus.CONFIRMED, result.getStatus());
assertTrue(emailClient.wasSent()); // Real EmailClient with a capture mode
}Step 2: Extract Interfaces for Every Direct Dependency
For each of the 7 direct dependencies in the OrderService:
// BEFORE (untestable — direct instantiation)
EmailClient emailClient = new EmailClient("smtp.company.com", 587);
emailClient.send(order.getUserEmail(), "Order confirmed", buildEmailBody(order));
// AFTER (testable — interface injection)
public interface OrderNotifier {
void notifyOrderConfirmed(Order order);
}
// OrderService receives the interface via constructor
public class OrderService {
private final InventoryRepository inventory;
private final PricingService pricing;
private final OrderNotifier notifier;
private final PaymentGateway payment;
private final EventPublisher events;
private final AuditLogger audit;
private final OrderRepository orders;
@Inject
public OrderService(InventoryRepository inventory, PricingService pricing,
OrderNotifier notifier, PaymentGateway payment,
EventPublisher events, AuditLogger audit, OrderRepository orders) {
this.inventory = inventory; // etc.
}
}The production SmtpOrderNotifier implements OrderNotifier and sends the real email. The test FakeOrderNotifier implements OrderNotifier and records calls for assertion. The key refactoring: moving from static final Logger LOG = LoggerFactory.getLogger(OrderService.class) (untestable static) to private final AuditLogger audit (injected interface that a test can verify calls on).
Step 3: Extract the Pricing Logic into PricingService
// BEFORE (pricing logic embedded in OrderService — violates SRP)
double total = 0;
for (OrderItem item : order.getItems()) {
double itemPrice = item.getQuantity() * item.getUnitPrice();
if (order.isPromoCodeApplied()) {
itemPrice *= (1 - PROMO_DISCOUNT_RATE); // Hardcoded discount rate
}
total += itemPrice;
}
if (total > 100) total *= (1 - BULK_DISCOUNT_RATE); // Another hardcoded rate
// AFTER (delegated to PricingService)
public interface PricingService {
OrderPrice calculateTotal(Order order);
}
// In OrderService.createOrder():
OrderPrice price = pricing.calculateTotal(order);
// PricingService is independently testable — all discount logic in one placeStep 4: The Refactored OrderCoordinator
After extracting all responsibilities, the OrderService becomes a thin coordinator:
public class OrderCoordinator {
// All 7 dependencies injected via constructor (see above)
public OrderConfirmation createOrder(Order order) {
// Step 1: Validate inventory
if (!inventory.isAvailable(order.getItems())) {
return OrderConfirmation.failed("INSUFFICIENT_INVENTORY");
}
// Step 2: Calculate price
OrderPrice price = pricing.calculateTotal(order);
// Step 3: Process payment
PaymentResult payment = paymentGateway.charge(
order.getPaymentMethod(),
price.getTotalAmount(),
order.getIdempotencyKey()
);
if (!payment.isSuccessful()) {
audit.log("payment_failed", order.getOrderId(), payment.getError());
return OrderConfirmation.failed("PAYMENT_DECLINED");
}
// Step 4: Persist the order
Order savedOrder = orders.save(order.withPaymentConfirmation(payment));
// Step 5: Update inventory
inventory.reserve(savedOrder.getItems());
// Step 6: Notify and publish
notifier.notifyOrderConfirmed(savedOrder);
events.publish(new OrderCreatedEvent(savedOrder));
audit.log("order_created", savedOrder.getOrderId());
return OrderConfirmation.success(savedOrder.getOrderId());
}
}The refactored class is 50 lines (down from 300), has 7 injected dependencies (down from 7 hardcoded instantiations — same count but now testable), and has a cyclomatic complexity of 3 (two conditional branches). It can be fully unit-tested with fake implementations of each interface.
Writing the First Meaningful Unit Test
@Test
public void createOrder_withPaymentFailure_returnsFailedConfirmation() {
// Arrange — all fakes, no real services, no network calls, under 5ms
FakeInventoryRepository inventory = new FakeInventoryRepository(isAvailable: true);
FakePricingService pricing = new FakePricingService(total: 50.00);
FakePaymentGateway payment = new FakePaymentGateway(shouldSucceed: false);
FakeOrderNotifier notifier = new FakeOrderNotifier();
FakeEventPublisher events = new FakeEventPublisher();
FakeAuditLogger audit = new FakeAuditLogger();
FakeOrderRepository orders = new FakeOrderRepository();
OrderCoordinator coordinator = new OrderCoordinator(
inventory, pricing, notifier, payment, events, audit, orders
);
Order order = OrderTestBuilder.anOrder().build();
// Act
OrderConfirmation result = coordinator.createOrder(order);
// Assert
assertEquals("PAYMENT_DECLINED", result.getFailureReason());
assertFalse(notifier.wasNotified()); // No email sent on payment failure
assertFalse(events.wasPublished(OrderCreatedEvent.class)); // No event on failure
assertTrue(audit.wasLogged("payment_failed")); // Audit log is recorded
}This test runs in under 5ms with no network calls, no database, and no email sending — the first real unit test the codebase has ever had.
Early Warning Metrics:
- Lines of code per class (enforce a maximum of 200 lines — a class above 200 lines is a candidate for extraction; above 300 lines is a God class that must be decomposed)
- Number of injected dependencies per class (enforce a maximum of 7 — above 7 dependencies indicates the class is doing too many things; a class with 12 dependencies is a God class that has not yet been refactored)
- Cyclomatic complexity per method (enforce a maximum of 10 — a method with complexity above 10 is too complex to test exhaustively and too complex to understand; split into smaller methods)
4. Interview Score: 9.5 / 10
Why this demonstrates senior-level maturity: Writing characterisation tests before the first line of refactoring — and the specific explanation that these tests document the current behaviour to protect against accidental regressions during the refactoring — is the disciplined approach that distinguishes engineers who have safely refactored large legacy codebases from those who "refactor" by rewriting and hoping the behaviour is preserved. The FakeAuditLogger fix for the static logger — identifying the static logger as the most insidious untestable dependency (not the Kafka producer, which is more obvious) because it prevents verifying that specific audit events were recorded — is the diagnostic sophistication that comes from having debugged why tests cannot verify logging behaviour.
What differentiates it from mid-level thinking: A mid-level engineer would know about interfaces and dependency injection but would not start with characterisation tests (risking regression during refactoring), would not identify the static logger as an untestable dependency, would not calculate the cyclomatic complexity target for the refactored class, and would not write the complete unit test showing all the assertions including "no email sent on payment failure."
What would make it a 10/10: A 10/10 response would include the OrderTestBuilder builder pattern implementation (showing the test data builder that creates valid test orders with sensible defaults, overridable per test), a ArchUnit rule configuration that enforces the maximum cyclomatic complexity and lines-of-code constraints in CI, and a migration guide for updating the dependency injection framework (Spring, Guice) configuration to wire the new interfaces to their production implementations.
Question 5: Observability — Designing Logging, Metrics, and Distributed Tracing for a Microservices System
Difficulty: Senior | Role: Software Engineer | Level: Senior | Company Examples: Netflix, Uber, Datadog, Honeycomb, Grafana Labs
The Question
You are a Senior Software Engineer at a travel booking company. The platform has grown from a monolith to 22 microservices over 3 years. The team is experiencing a painful on-call situation: when a user reports "my booking failed," the on-call engineer must manually grep through logs on 5 different Kubernetes pods across 3 services, correlating log lines by timestamp and guessing at the request flow — an investigation that takes 45 minutes on average. Two recent incidents were caused by: a slow database query in the PricingService that caused timeouts in the BookingService which triggered retries in the SearchService — none of the three services logged the correlation — and a third-party hotel API returning 3% HTTP 500 errors that were being silently swallowed by the HotelInventoryService's catch block (the same println() anti-pattern from before). Design the observability stack that reduces the mean time to investigation (MTTI) from 45 minutes to under 5 minutes.
1. What Is This Question Testing?
- The three pillars of observability — understanding logs (discrete events with context — "payment failed for booking 123"), metrics (numeric measurements over time — "booking success rate = 97.3%"), and traces (a record of a request's path through all services — "booking request touched SearchService → PricingService → BookingService → PaymentService, total duration 3.2 seconds, PricingService took 2.8 seconds"); knowing that logs and metrics alone do not solve the microservices correlation problem — only distributed tracing connects a user's request across all 22 services into a single trace
- Distributed tracing with OpenTelemetry — knowing the OpenTelemetry standard: a vendor-neutral set of APIs, SDKs, and tools for generating, collecting, and exporting traces, metrics, and logs; knowing the trace model: a
traceis a collection ofspans, each span represents one unit of work (one service call, one database query, one outbound HTTP request); thetrace_idis generated at the entry point (the API gateway) and propagated to all downstream services via the W3C Trace Context header (traceparent); every service creates aspanwith aparent_span_idlinking it to the calling service's span
- Structured logging — knowing that
printfstyle log messages ("Error processing booking for user " + userId) are unsearchable (you cannot filter by userId unless you grep with a regex); structured logging (JSON log objects with defined fields:{"level": "ERROR", "service": "booking", "trace_id": "abc123", "user_id": "u456", "booking_id": "b789", "error": "PricingService timeout"}) allows filtering by any field in a log aggregation system (Elasticsearch, Loki, CloudWatch Logs Insights); thetrace_idfield in every log line is the bridge between logs and traces
- The RED method for service metrics — knowing the three metrics that every microservice must expose: Rate (requests per second), Errors (error rate per second), and Duration (latency distribution — p50, p95, p99); knowing the USE method for infrastructure metrics: Utilisation, Saturation, Errors (for CPU, memory, network, disk); the RED metrics are the starting point for any incident investigation — "is the booking success rate lower than normal?" is answered by the error rate metric in under 10 seconds
- The "silent swallowing" anti-pattern — knowing that catch blocks that log to
println()or use a non-monitored logger are one of the most common causes of production incidents that are invisible in monitoring; the fix: every caught exception must either be propagated (re-thrown) or recorded as an error metric increment + structured error log with stack trace + span error event in the trace; "catch and continue" is only acceptable for genuinely non-critical operations where degraded behaviour is explicitly designed and documented
- Alerting on SLOs not symptoms — knowing the difference between alerting on symptoms ("error count > 10 in 5 minutes") vs. alerting on SLOs ("booking success rate < 99.5% over 5 minutes"); symptom-based alerts cause alert fatigue (too many alerts for unimportant issues) and miss gradual degradations (a success rate that slowly drops from 99.8% to 97% over 2 hours without ever triggering the "error count > 10" alert in any 5-minute window)
2. Framework: Observability Stack Design Model (OSDM)
- Assumption Documentation — Establish the current baseline: what logging infrastructure exists? (Kubernetes pod logs aggregated to Elasticsearch via Fluentd, but without structured logging or trace IDs); what metrics exist? (Basic infrastructure metrics in Prometheus — CPU, memory — but no application-level RED metrics); what is the current p99 API response time? (Unknown — no percentile latency metrics)
- Constraint Analysis — 22 microservices means the observability tooling must be instrumented consistently across all services without requiring 22 separate configurations; OpenTelemetry's auto-instrumentation (Java agent, Python auto-instrumentation) handles HTTP and database spans automatically without code changes for most frameworks; manual instrumentation is required for custom business logic spans
- Tradeoff Evaluation — Full trace sampling (capture every trace — complete data, high storage cost: at 100 RPS, a 22-service trace of 500 bytes × 22 spans = 11KB per request = 1.1MB/second = 94GB/day) vs. probabilistic sampling (capture 1% of traces — low storage, misses rare failures), vs. tail-based sampling (capture all traces but only write to storage for traces with errors or above a latency threshold — the ideal approach, but requires a trace collector that buffers spans until the trace is complete)
- Hidden Cost Identification — OpenTelemetry trace propagation through asynchronous message queues (Kafka) requires injecting the trace context into the message headers; if any service in the chain publishes to Kafka without injecting the
traceparentheader, the trace is broken (the consumer starts a new, disconnected trace); all Kafka producer/consumer code must be updated to use the OpenTelemetry Kafka instrumentation library
- Risk Signals / Early Warning Metrics — Trace completeness rate (what percentage of traces that enter the system have spans from all expected services? — a trace completeness below 95% indicates missing instrumentation in one or more services), p99 trace storage volume per service (identify which services are generating the most trace data — candidates for sampling rate adjustment)
- Pivot Triggers — If the trace storage cost becomes unsustainable at full-traffic sampling: implement Honeycomb's dynamic sampling (increase sampling rate for traces with errors, decrease for healthy traces) or switch to OpenTelemetry Collector's tail-based sampling processor — which retains all error traces while sampling healthy traces at 1%
- Long-Term Evolution Plan — Week 1–2: structured logging + trace ID in all logs; Week 3–4: OpenTelemetry auto-instrumentation deployed to all 22 services; Week 5–6: RED metrics dashboards + SLO alerting; Week 7–8: custom business logic spans for the critical booking flow; Month 3: tail-based sampling to control storage costs; Month 6: exemplars (linking from a metric data point to the trace that caused the anomaly)
3. The Answer
The Incident Correlation Problem: Why Trace IDs Solve It
The current 45-minute investigation involves manually correlating log lines across 5 pods by timestamp. The core problem: without a shared trace_id in every log line, there is no way to filter "all log lines related to booking request X" without human pattern matching. The solution: every incoming HTTP request gets a trace_id (UUID) at the API gateway. Every log line, metric data point, and database query during that request's lifetime carries that trace_id. Finding the root cause of "my booking failed" becomes: user provides the booking ID → query the booking database for the trace_id → open the trace in Jaeger/Tempo → see the full 22-service request waterfall with per-span timing → the slow span is immediately visible.
The Structured Logging Standard
Replace all logger.info("Processed booking for " + bookingId) with:
logger.info("booking_processed",
kv("booking_id", bookingId),
kv("user_id", userId),
kv("duration_ms", durationMs),
kv("total_amount", totalAmount),
kv("trace_id", Span.current().getSpanContext().getTraceId()),
kv("span_id", Span.current().getSpanContext().getSpanId())
);The output in Elasticsearch: {"timestamp": "2024-01-15T10:23:45Z", "level": "INFO", "service": "booking-service", "message": "booking_processed", "booking_id": "b789", "user_id": "u456", "duration_ms": 3200, "trace_id": "abc123def456", "span_id": "789abc"}. This log line is filterable by any field. The trace_id links it directly to the distributed trace in Jaeger.
OpenTelemetry Instrumentation for All 22 Services
Deploy the OpenTelemetry Java Agent to all JVM services via a Kubernetes mutating webhook (automatically adds the agent JAR to any pod with the annotation instrumentation.opentelemetry.io/inject-java: "true"). The Java agent auto-instruments: all inbound HTTP requests (creates root span), all outbound HTTP requests (creates child span), all JDBC queries (creates database span with the SQL query as the span attribute), and Kafka producer/consumer operations (propagates trace context through message headers). No code changes required for the 80% coverage from auto-instrumentation. Add manual instrumentation for the business-critical booking flow steps:
// Manual span for the PricingService's price calculation — the slow step
Span pricingSpan = tracer.spanBuilder("calculate_booking_price")
.setAttribute("hotel_id", hotelId)
.setAttribute("check_in", checkInDate.toString())
.setAttribute("num_nights", numNights)
.startSpan();
try (Scope scope = pricingSpan.makeCurrent()) {
Price price = pricingEngine.calculate(hotelId, checkInDate, numNights);
pricingSpan.setAttribute("calculated_price", price.getAmount());
return price;
} catch (Exception e) {
pricingSpan.setStatus(StatusCode.ERROR, e.getMessage());
pricingSpan.recordException(e); // Exception appears in the trace UI
throw e;
} finally {
pricingSpan.end();
}Fixing the Silent Swallowing Anti-Pattern
The HotelInventoryService catch block:
// BEFORE (broken — the 3% error rate is invisible)
try {
return hotelApi.checkAvailability(hotelId, dates);
} catch (HotelApiException e) {
System.out.println("Hotel API error: " + e.getMessage()); // ← the crime
return AvailabilityResult.unknown();
}
// AFTER (correct — error is visible in metrics, logs, and traces)
try {
return hotelApi.checkAvailability(hotelId, dates);
} catch (HotelApiException e) {
// 1. Increment the error metric (visible in Prometheus/Grafana immediately)
metrics.counter("hotel_api_errors_total",
Tags.of("hotel_id", hotelId, "error_type", e.getErrorCode())).increment();
// 2. Structured error log (searchable in Elasticsearch with trace correlation)
logger.error("hotel_api_availability_failed",
kv("hotel_id", hotelId),
kv("error_code", e.getErrorCode()),
kv("trace_id", Span.current().getSpanContext().getTraceId()),
e // Exception with stack trace
);
// 3. Record the error on the current span (visible as red span in Jaeger)
Span.current().setStatus(StatusCode.ERROR, e.getMessage());
Span.current().recordException(e);
return AvailabilityResult.unknown();
}With this fix: the 3% Hotel API error rate immediately appears as a hotel_api_errors_total metric spike in Grafana, an alert fires within 2 minutes, and the on-call engineer can filter the Elasticsearch logs by error_code to see exactly which hotel API endpoints are failing.
The RED Metrics Dashboard
Every service exposes three metrics via the OpenTelemetry Meter API: http_server_requests_total (counter — the Rate), http_server_errors_total (counter filtered to 5xx — the Error rate), http_server_request_duration_seconds (histogram — the Duration). A Grafana dashboard shows all 22 services in a grid with three colour-coded cells each (green = healthy, amber = degraded, red = failing). During an incident, the dashboard shows which service's error rate or latency changed first — identifying the root cause service in under 30 seconds.
The SLO Alerting Configuration
Replace symptom-based alerting ("5 errors in 5 minutes") with SLO-based alerting:
# Prometheus alerting rule — booking success rate SLO
- alert: BookingSuccessRateLow
expr: |
sum(rate(http_server_requests_total{service="booking-service", status="2xx"}[5m]))
/
sum(rate(http_server_requests_total{service="booking-service"}[5m]))
< 0.995
for: 3m
labels:
severity: critical
annotations:
summary: "Booking service success rate below 99.5% SLO"
runbook: "https://runbooks.company.com/booking-success-rate"This alert fires when the booking success rate drops below 99.5% for 3 consecutive minutes — regardless of whether the failure manifests as 10 errors in 5 minutes or 1 error per minute for 10 minutes.
The 5-Minute Investigation Workflow (Post-Implementation)
- User reports "booking failed" and provides booking ID "b789" (30 seconds).
- Query Elasticsearch:
trace_id WHERE booking_id=b789(10 seconds).
- Open trace
abc123def456in Jaeger (5 seconds).
- The waterfall diagram shows: SearchService (50ms), PricingService (2,800ms — RED), BookingService (300ms), PaymentService (45ms). The PricingService span has a red error badge with the SQL query that timed out visible on the span attributes (30 seconds).
- The engineer navigates to the PricingService's
slow_queriesGrafana dashboard and confirms the query started timing out 2 hours ago when a database index was accidentally dropped during a schema migration (60 seconds).
- Total investigation time: under 3 minutes.
Early Warning Metrics:
- Mean time to investigation (MTTI) — measured from the moment an incident alert fires to the moment the root cause is identified (logged in the incident management tool by the on-call engineer); target under 5 minutes; above 15 minutes triggers a post-incident review of the observability stack
- Trace completeness rate — the percentage of traces that have spans from all expected services in the chain; a completeness below 90% indicates missing instrumentation; run a weekly completeness check using a synthetic transaction that touches all 22 services
- Unsampled error traces — the number of error traces that were dropped by the sampling policy (tail-based sampler must have a 100% retention rate for error traces); any non-zero count indicates the sampler is incorrectly configured
4. Interview Score: 9.5 / 10
Why this demonstrates senior-level maturity: The concrete "5-minute investigation workflow" — with second-level timing for each step showing exactly how the observability stack enables the investigation — transforms the abstract "implement distributed tracing" recommendation into a measurable outcome that the Head of Engineering can understand and the on-call engineer can execute. The three-way fix for the silent swallowing anti-pattern (metrics counter + structured log + span error event) — explaining that each channel serves a different investigation tool (Grafana dashboard for alert, Elasticsearch for log correlation, Jaeger for trace context) — shows the systems-level thinking about observability as a three-pillar discipline, not just "add logging." The SLO-based alerting configuration — with the specific Prometheus rule showing the success rate calculation and the 3-minute evaluation window — is the production alerting maturity that eliminates alert fatigue from symptom-based thresholds.
What differentiates it from mid-level thinking: A mid-level engineer would propose "add more logging" and "use Datadog" without designing the structured logging standard (trace_id field, key-value format), without knowing about OpenTelemetry auto-instrumentation via Kubernetes mutating webhooks, without understanding tail-based sampling, and without designing the SLO-based alerting rules. They would not calculate the storage cost of full-trace sampling or design the 5-minute investigation workflow end-to-end.
What would make it a 10/10: A 10/10 response would include a complete OpenTelemetry Collector configuration YAML showing the tail-based sampling processor (retaining 100% of error traces, 1% of healthy traces), a Grafana dashboard JSON showing the RED metrics grid for all 22 services, and a concrete exemplars configuration (linking metric data points to the traces that caused the anomaly — the observability feature that enables one-click navigation from a metric spike to the causative trace).
Question 6: System Design — Designing a Rate Limiter for a High-Traffic API
Difficulty: Senior | Role: Software Engineer | Level: Senior | Company Examples: Google, Stripe, Cloudflare, Twitter, Twilio
The Question
You are a Senior Software Engineer at a payments company. The public-facing API processes 50,000 requests per second at peak and is being abused by a small number of clients who are sending bursts of thousands of requests within seconds, overwhelming downstream services and degrading performance for all other clients. The Head of Platform has asked you to design and implement a rate limiter that enforces per-client request limits (1,000 requests per minute per API key), handles burst traffic gracefully, works correctly across a horizontally-scaled fleet of 20 API servers, and fails open (allows traffic through rather than blocking everything if the rate limiter itself has a failure). Design the algorithm, the data storage approach, the distributed coordination mechanism, and the API response behaviour when a client is rate-limited.
1. What Is This Question Testing?
- Rate limiting algorithm selection — understanding the four main rate limiting algorithms and their tradeoffs: fixed window counter (simplest, but allows a burst at the window boundary — 1,000 requests at 11:59 + 1,000 at 12:00 = 2,000 requests in 2 seconds), sliding window log (accurate, but stores each request timestamp — memory-expensive at scale), sliding window counter (approximates the sliding window by interpolating between the current and previous fixed window counts — low memory, close to accurate), and token bucket (allows controlled bursting — a client can save tokens and use them in a burst, up to a configurable bucket maximum); knowing that the token bucket is the standard for payment APIs because it allows short legitimate bursts without permitting sustained overload
- Distributed rate limiting with Redis — understanding that rate limiting across 20 API servers requires a shared counter store; each server independently querying a centralised Redis cluster is the standard approach; knowing that Redis's
INCRBY+EXPIREcommands are not atomic together (a race condition exists between them), and that a Lua script in Redis (which executes atomically) is the correct implementation for the token bucket or sliding window counter: the script checks the count, increments it, and sets the TTL in a single atomic operation
- Fail-open vs. fail-closed — understanding the tradeoff: fail-closed (block all traffic if the rate limiter fails) is safer for security use cases but catastrophic for availability; fail-open (allow all traffic if the rate limiter cannot be reached) preserves availability but means rate limiting is not enforced during outages; for a payments API where downtime is more damaging than a burst of excess requests during a brief Redis outage, fail-open is the correct choice; the implementation must be explicit:
try { checkRateLimit() } catch (e: RedisException) { return allow() }— never rely on default exception propagation to implement fail-open
- HTTP response design for rate limiting — knowing the standard HTTP headers:
429 Too Many Requestsstatus code,Retry-After: <seconds>header indicating when the client can retry,X-RateLimit-Limit: 1000(the limit),X-RateLimit-Remaining: 0(remaining requests),X-RateLimit-Reset: 1735689600(Unix timestamp of the window reset); knowing that well-designed rate limit responses enable clients to implement polite backoff without guessing
- The thundering herd problem — knowing that if 10,000 clients are all told
Retry-After: 60, they will all retry simultaneously 60 seconds later, producing a thundering herd; the correct mitigation is adding jitter:Retry-After: 60 + random(0, 30)— spreading retries over a 30-second window; alternatively, returning the exact time of the next token replenishment (not a fixed window reset) allows clients to retry at their individually-computed next available slot
- Monitoring and alerting — knowing that a rate limiter without observability is flying blind: track the rate limit hit rate per client (clients consistently hitting rate limits may need quota increases or are misbehaving), the Redis latency for rate limit checks (if Redis adds more than 2ms to every request, it is the new bottleneck), and the fail-open event rate (any period where the rate limiter failed open must be reviewed for abuse patterns)
2. Framework: Distributed Rate Limiter Design Model (DRLDM)
- Assumption Documentation — Establish the rate limit dimensions before designing the algorithm: is the limit per API key, per IP, per user ID, or per endpoint? A payments API typically rates limits per API key (the authenticated client) with different limits per tier (starter: 100/min, pro: 1,000/min, enterprise: custom); different limits per endpoint are also common (read endpoints allow more requests than write endpoints)
- Constraint Analysis — 50,000 RPS means the rate limiter check must complete in under 1ms (it is on the critical path of every request); Redis round-trip latency is typically 0.3–0.8ms on a co-located deployment; at 50,000 RPS, the Redis cluster must handle 50,000 commands per second — this requires a Redis cluster with at least 3 shards
- Tradeoff Evaluation — Centralised Redis (accurate, single point of coordination, adds network hop) vs. local in-process rate limiting with eventual synchronisation (lower latency, but a client can exceed the limit by up to N times if they distribute requests across N servers simultaneously); for a payments API where accuracy matters, centralised Redis is correct
- Hidden Cost Identification — The Lua script for atomic rate limiting must be cached on the Redis server using
SCRIPT LOADand called viaEVALSHA(the hash of the loaded script) rather thanEVAL(which sends the script text on every call); at 50,000 RPS, the difference betweenEVALSHA(small hash) andEVAL(full script text) is significant network overhead
- Risk Signals / Early Warning Metrics — Rate limit hit rate per client per hour (a sudden spike in hits for a specific API key indicates either a buggy client or a credential leak being used for abuse), Redis p99 latency (above 2ms on the rate limit check path degrades API response times for all clients), fail-open event count per hour (any non-zero value requires an immediate post-mortem)
- Pivot Triggers — If the Redis cluster becomes a bottleneck at scale (p99 rate limit check latency exceeds 5ms): implement a two-tier rate limiting approach — a local in-process rate limiter uses a smaller local token bucket (e.g., 100 tokens, replenished every 6 seconds from the Redis count), reducing Redis calls by 90% while bounding the local token over-usage to the local bucket size
- Long-Term Evolution Plan — Phase 1: per-API-key token bucket with Redis (described below); Phase 2: per-endpoint rate limits with different limits for read vs. write; Phase 3: adaptive rate limiting (detect abuse patterns and automatically tighten limits for suspicious clients); Phase 4: rate limit tiers exposed as a product feature (clients can purchase higher limits)
3. The Answer
The Token Bucket Algorithm
The token bucket is the correct choice for a payments API: it allows controlled bursting (a client who has been quiet for a minute can use accumulated tokens for a burst of 100 requests in 2 seconds) while preventing sustained overload. Parameters: bucket capacity = 200 tokens (allowing a 200-request burst for well-behaved clients), refill rate = 1,000 tokens per minute = 16.67 tokens per second, initial tokens = 200 (on first request). Token bucket state per API key stored in Redis: {tokens: 200.0, last_refill_timestamp: 1735689600.123}.
The Atomic Redis Lua Script
-- KEYS[1] = rate_limit:{api_key}
-- ARGV[1] = current_timestamp_ms, ARGV[2] = refill_rate_per_ms
-- ARGV[3] = bucket_capacity, ARGV[4] = tokens_requested (always 1)
local key = KEYS[1]
local now = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2]) -- tokens per millisecond (1000/60000 = 0.01667)
local capacity = tonumber(ARGV[3])
local requested = tonumber(ARGV[4])
local data = redis.call('HMGET', key, 'tokens', 'last_refill')
local tokens = tonumber(data[1]) or capacity -- Default to full bucket for new keys
local last_refill = tonumber(data[2]) or now
-- Refill tokens based on elapsed time
local elapsed_ms = now - last_refill
local refilled = elapsed_ms * refill_rate
tokens = math.min(capacity, tokens + refilled)
-- Check if the request can be served
if tokens >= requested then
tokens = tokens - requested
redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
redis.call('EXPIRE', key, 120) -- TTL: expire if inactive for 2 minutes
return {1, math.floor(tokens)} -- {allowed=true, remaining_tokens}
else
redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
redis.call('EXPIRE', key, 120)
-- Calculate milliseconds until 1 token is refilled
local ms_until_token = math.ceil((1 - tokens) / refill_rate)
return {0, ms_until_token} -- {allowed=false, ms_until_next_token}
endThis Lua script executes atomically on the Redis server — no race condition between the read, update, and write operations. At 50,000 RPS, the script executes in under 0.1ms on the Redis server.
The API Rate Limit Middleware (Java/Kotlin example)
class RateLimitMiddleware(
private val redisClient: RedisClient,
private val rateLimitConfig: RateLimitConfig
) {
private val luaScript = redisClient.loadScript(LUA_SCRIPT) // Cached as EVALSHA hash
fun checkRateLimit(apiKey: String): RateLimitResult {
return try {
val result = redisClient.evalSha(
luaScript,
keys = listOf("rate_limit:$apiKey"),
args = listOf(
System.currentTimeMillis().toString(),
rateLimitConfig.refillRatePerMs.toString(),
rateLimitConfig.bucketCapacity.toString(),
"1"
)
)
val allowed = result[0] as Long == 1L
val secondValue = result[1] as Long
if (allowed) {
RateLimitResult.Allowed(remainingTokens = secondValue)
} else {
val retryAfterMs = secondValue
RateLimitResult.RateLimited(retryAfterMs = retryAfterMs)
}
} catch (e: RedisException) {
// FAIL OPEN: Redis unavailable — allow the request
logger.error("Rate limiter Redis unavailable, failing open", e)
metrics.increment("rate_limiter.fail_open")
RateLimitResult.Allowed(remainingTokens = -1) // -1 signals fail-open mode
}
}
}HTTP Response Headers
When rate-limited, return:
HTTP/1.1 429 Too Many Requests
Content-Type: application/json
X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1735689660
Retry-After: 62 ← with jitter: base_seconds + random(0, 30)
{
"error": {
"code": "RATE_LIMIT_EXCEEDED",
"message": "You have exceeded the rate limit of 1000 requests per minute.",
"retry_after_seconds": 62,
"documentation_url": "https://docs.company.com/api/rate-limits"
}
}When allowed, include the rate limit headers in every response (not just 429s) — this allows clients to self-throttle before hitting the limit:
HTTP/1.1 200 OK
X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 847
X-RateLimit-Reset: 1735689660Redis Cluster Sharding
At 50,000 RPS, a single Redis instance processes approximately 50,000 Lua script executions per second. A single Redis instance can handle approximately 100,000 commands per second — with the Lua script being 3–4 Redis commands internally, the throughput is approximately 25,000–33,000 scripts per second per instance. For 50,000 RPS, use a Redis cluster with at least 2–3 primary shards. Route each API key to its shard using consistent hashing on the API key: the Redis cluster's built-in hash slot mechanism (CRC16(api_key) % 16384) distributes keys automatically across shards, ensuring that the same API key always hits the same shard.
Early Warning Metrics:
- Rate limit hit rate by API key — a key hitting the rate limit more than 50% of the time is either misconfigured (the limit is too low for its legitimate use) or abusive; alert the account team for the former, the security team for the latter
- Redis cluster latency P99 for the rate limit script — alert at 2ms; above 5ms requires immediate investigation as this is now adding perceptible latency to every API response
- Fail-open event rate per hour — zero in normal operation; any non-zero value triggers a PagerDuty alert and a post-incident review; the rate limiter's Redis availability must be treated as a Tier 1 service dependency
4. Interview Score: 9.5 / 10
Why this demonstrates senior-level maturity: The Lua script for atomic token bucket operations — with the specific detail that the script must be pre-loaded with SCRIPT LOAD and called via EVALSHA (not EVAL) to avoid sending the script text on every one of 50,000 requests per second — demonstrates that this engineer has thought about the performance characteristics of the implementation, not just its correctness. The fail-open implementation with metrics.increment("rate_limiter.fail_open") in the catch block — so that Redis outages are visible in dashboards even when they are handled gracefully — is the operations engineering discipline that makes a production system debuggable. The jitter in the Retry-After header (base + random(0, 30)) directly addresses the thundering herd problem that a naively implemented rate limiter introduces.
What differentiates it from mid-level thinking: A mid-level engineer would propose "use Redis with INCR and EXPIRE" without knowing about the non-atomicity of those two commands together (requiring a Lua script), would implement a fixed window counter (allowing boundary bursts) rather than a token bucket, would not include rate limit headers in every response (only in 429s), and would not address the thundering herd problem. They would also not calculate whether a single Redis instance can handle 50,000 RPS.
What would make it a 10/10: A 10/10 response would include the two-tier local + Redis hybrid rate limiter design (reducing Redis calls by 90%), a concrete Redis cluster sizing calculation showing the number of shards required for 50,000 RPS, and a distributed tracing integration showing how rate limit checks appear in the request trace alongside the upstream service calls.
Question 7: Data Structures and Algorithms — Solving a Real-World Problem Under Engineering Constraints
Difficulty: Senior | Role: Software Engineer | Level: Senior | Company Examples: Google, Meta, Amazon, Apple, Microsoft
The Question
You are a Senior Software Engineer at a logistics company. You have been asked to implement a real-time package routing system. Packages arrive at a warehouse and must be assigned to the optimal delivery route. Each route is a sequence of stops, and each package has a destination address. The addresses have been pre-geocoded into (lat, lng) coordinates. The routing problem has three specific requirements: (1) Given a delivery route (an ordered list of GPS coordinates) and a new package's destination coordinate, find the stop on the route that is geographically closest to the package's destination — this must be answered in under 10ms for routes with up to 5,000 stops; (2) The same destination coordinate must be looked up repeatedly (the same address appears on multiple packages) — the system processes 20,000 packages per hour and 30% have duplicate destinations; (3) Route stops are updated every 5 minutes as drivers complete stops — the data structure must support efficient deletion of completed stops and insertion of emergency stops. Walk through the data structure choices, time complexities, and the implementation for each requirement.
1. What Is This Question Testing?
- Spatial data structures — knowing that a linear scan of 5,000 stops to find the nearest one is O(n) and may be too slow at scale; knowing that a k-d tree (k-dimensional tree) is the standard data structure for nearest-neighbour search in 2D space — building a k-d tree is O(n log n), and a nearest-neighbour query is O(log n) average case; knowing that a quadtree is an alternative that recursively divides 2D space into quadrants, suitable for geographic regions; and knowing the practical consideration that for n=5,000, a linear scan with SIMD optimisation may outperform a k-d tree due to cache locality — the O(n) constant factor matters
- Memoisation and caching for repeated lookups — knowing that 30% duplicate destinations means that the nearest-stop computation for those destinations has been done before; a
HashMap<Coordinate, Stop>cache (key: destination coordinate rounded to 6 decimal places, value: the nearest stop found last time) reduces repeated computation; knowing the cache invalidation problem: when stops are updated (requirement 3), the cache entries referencing the changed stops must be invalidated; this is the classic "cache invalidation" problem — the solution depends on whether approximate results are acceptable (serve from cache until TTL expires) or must be exact (invalidate eagerly when stops change)
- Dynamic updates to a k-d tree — knowing that standard k-d trees are not efficiently updatable — deleting a node requires rebuilding the subtree; knowing the alternatives: a KD-B tree supports dynamic updates but is more complex to implement; a Ball tree (a hierarchical tree where each node represents a ball in n-dimensional space, not a hyperplane split) supports approximate nearest-neighbour with dynamic updates; or a lazy deletion strategy for the k-d tree (mark deleted nodes rather than removing them, and periodically rebuild the tree during the 5-minute update window)
- Priority queue for route optimisation — knowing that when an emergency stop is inserted, it must be inserted in the optimal position of the route (between the stops it is nearest to, minimising total route distance); this is the travelling salesman insertion heuristic — for each new stop, compute the cost of inserting it between every pair of consecutive stops and choose the minimum-cost insertion point; this is O(n) per insertion, which is acceptable for 5,000 stops
- Space-time tradeoff analysis — demonstrating the ability to reason about the tradeoffs explicitly: the k-d tree uses O(n) space and O(log n) query time; the linear scan uses O(n) space and O(n) query time with better cache locality for small n; the cache uses O(m) space (m = number of distinct destinations) and O(1) query time for cached hits; providing the crossover point (at what n does the k-d tree outperform the linear scan? approximately n > 1,000 for random access patterns in 2D) demonstrates algorithmic depth
- Concurrency considerations — at 20,000 packages per hour, the system receives approximately 5.5 packages per second; the k-d tree is built on the route's 5,000 stops and queried for each incoming package; concurrent queries from multiple threads must be safe — the k-d tree must be either immutable (built once, queried by multiple threads without synchronisation) or protected by a read-write lock; the 5-minute rebuild cycle creates a window where the old tree is being queried while the new tree is being built — the swap must be atomic
2. Framework: Spatial Algorithm Selection and Implementation Model (SASIM)
- Assumption Documentation — Clarify the precision requirements: is "nearest stop" the Euclidean distance in lat/lng coordinates, or the great-circle distance (Haversine formula for the curvature of the Earth)? For distances under 100km (a typical delivery route), Euclidean distance in lat/lng introduces less than 0.5% error — acceptable for routing; for accuracy, use the Haversine formula in the final verification step after the k-d tree narrows the candidates
- Constraint Analysis — The 10ms response time requirement at a route size of 5,000 stops; a k-d tree nearest-neighbour query completes in under 1ms for 5,000 nodes on modern hardware — the bottleneck is likely the cache lookup and network I/O, not the algorithm itself
- Tradeoff Evaluation — For n=5,000 stops: a linear scan processes 5,000 distance calculations in approximately 0.05ms (50μs) on modern hardware with SIMD vectorisation — potentially faster than the k-d tree's cache-unfriendly tree traversal; benchmark both approaches on the target hardware before committing to the more complex k-d tree implementation
- Hidden Cost Identification — Geocoding precision: latitude and longitude coordinates are typically given to 6 decimal places (approximately 0.1m precision); rounding to 5 decimal places (approximately 1m) for the cache key reduces cache fragmentation from GPS coordinate noise while preserving practical location accuracy
- Risk Signals / Early Warning Metrics — Cache hit rate for the memoisation layer (target above 25% given the 30% duplicate destination rate — a hit rate below 20% indicates the cache key rounding strategy is creating false misses), k-d tree rebuild time per 5-minute cycle (the rebuild must complete in under 30 seconds to not overlap with the next cycle), nearest-neighbour query P99 latency (must remain under 10ms at all traffic levels)
- Pivot Triggers — If the k-d tree query time increases above 5ms when the route expands beyond 5,000 stops (e.g., a warehouse hub with 20,000-stop routes): switch to an approximate nearest-neighbour algorithm (Annoy — Approximate Nearest Neighbours Oh Yeah — by Spotify, which builds multiple random projection trees and returns the approximate nearest neighbour in O(log n) with configurable accuracy)
- Long-Term Evolution Plan — The current design handles a single route; at scale, the system routes packages across hundreds of routes simultaneously; the production architecture uses a geospatial index (PostGIS with ST_ClosestPoint, or Elasticsearch with geo_point queries) to first find which route a package belongs to, then the k-d tree to find the nearest stop on that route
3. The Answer
Requirement 1: Nearest Stop on a Route — K-D Tree Implementation
Build a 2D k-d tree from the route's (lat, lng) stop coordinates. Each node in the k-d tree represents a stop; the tree splits on alternating dimensions (latitude at even depth, longitude at odd depth), with the median coordinate as the splitting value:
from dataclasses import dataclass
from typing import Optional
import heapq
@dataclass
class Stop:
stop_id: str
lat: float
lng: float
sequence: int # Position in the route
class KDNode:
def __init__(self, stop: Stop, left: 'KDNode' = None, right: 'KDNode' = None):
self.stop = stop
self.left = left
self.right = right
def build_kd_tree(stops: list[Stop], depth: int = 0) -> Optional[KDNode]:
if not stops:
return None
# Alternate splitting dimension: 0=latitude, 1=longitude
axis = depth % 2
sorted_stops = sorted(stops, key=lambda s: s.lat if axis == 0 else s.lng)
mid = len(sorted_stops) // 2
return KDNode(
stop=sorted_stops[mid],
left=build_kd_tree(sorted_stops[:mid], depth + 1),
right=build_kd_tree(sorted_stops[mid + 1:], depth + 1)
)
def nearest_neighbour(root: KDNode, target_lat: float, target_lng: float) -> Stop:
best = [None, float('inf')] # [best_stop, best_dist_sq]
def _search(node: KDNode, depth: int):
if node is None:
return
# Euclidean distance squared (no sqrt needed for comparison)
dist_sq = (node.stop.lat - target_lat) ** 2 + (node.stop.lng - target_lng) ** 2
if dist_sq < best[1]:
best[0], best[1] = node.stop, dist_sq
axis = depth % 2
diff = (target_lat - node.stop.lat) if axis == 0 else (target_lng - node.stop.lng)
# Search the nearer half first, then check if the farther half could be closer
near, far = (node.left, node.right) if diff <= 0 else (node.right, node.left)
_search(near, depth + 1)
if diff ** 2 < best[1]: # The splitting hyperplane is closer than the current best
_search(far, depth + 1)
_search(root, 0)
return best[0]Time complexity: Build: O(n log n). Query: O(log n) average, O(n) worst case (for pathological distributions like all stops on a line). Space complexity: O(n).
Requirement 2: Memoisation for Repeated Destinations
Wrap the k-d tree query with a cache keyed on the rounded coordinate:
import threading
from functools import lru_cache
class RouteNearestStopFinder:
def __init__(self, stops: list[Stop]):
self._tree = build_kd_tree(stops)
self._cache: dict[tuple, Stop] = {}
self._cache_lock = threading.RLock()
def find_nearest(self, dest_lat: float, dest_lng: float) -> Stop:
# Round to 5 decimal places (~1m precision) to catch GPS-noise duplicates
cache_key = (round(dest_lat, 5), round(dest_lng, 5))
with self._cache_lock:
if cache_key in self._cache:
return self._cache[cache_key]
# Cache miss — query the k-d tree
nearest = nearest_neighbour(self._tree, dest_lat, dest_lng)
with self._cache_lock:
self._cache[cache_key] = nearest
return nearest
def invalidate_stop(self, stop_id: str):
"""Call when a stop is completed or moved — invalidates all cache entries for that stop."""
with self._cache_lock:
keys_to_delete = [k for k, v in self._cache.items() if v.stop_id == stop_id]
for key in keys_to_delete:
del self._cache[key]Requirement 3: Dynamic Stop Updates
For the 5-minute stop update cycle, use a lazy deletion + periodic rebuild strategy:
class DynamicRouteFinder:
def __init__(self, stops: list[Stop]):
self._active_stops = {s.stop_id: s for s in stops}
self._deleted_stop_ids: set[str] = set()
self._finder = RouteNearestStopFinder(stops)
self._rebuild_threshold = 50 # Rebuild after 50 deletions (avoid staleness)
def complete_stop(self, stop_id: str):
"""Mark a stop as completed (driver has visited it)."""
del self._active_stops[stop_id]
self._deleted_stop_ids.add(stop_id)
self._finder.invalidate_stop(stop_id)
if len(self._deleted_stop_ids) >= self._rebuild_threshold:
self._rebuild_tree()
def insert_emergency_stop(self, new_stop: Stop) -> int:
"""Insert an emergency stop at the optimal position in the route.
Returns the sequence position where the stop was inserted."""
active = sorted(self._active_stops.values(), key=lambda s: s.sequence)
best_position = self._find_optimal_insertion(new_stop, active)
# Shift sequence numbers of all stops after the insertion point
for stop in active[best_position:]:
stop.sequence += 1
new_stop.sequence = best_position
self._active_stops[new_stop.stop_id] = new_stop
self._rebuild_tree() # Rebuild immediately after insertion
return best_position
def _find_optimal_insertion(self, new_stop: Stop, route: list[Stop]) -> int:
"""O(n) insertion heuristic: find the position minimising added route distance."""
if not route:
return 0
best_cost = float('inf')
best_pos = 0
for i in range(len(route) - 1):
# Cost = distance(route[i] → new_stop) + distance(new_stop → route[i+1])
# - distance(route[i] → route[i+1]) ← the segment we're replacing
added_cost = (
_dist(route[i], new_stop) +
_dist(new_stop, route[i + 1]) -
_dist(route[i], route[i + 1])
)
if added_cost < best_cost:
best_cost, best_pos = added_cost, i + 1
return best_pos
def _rebuild_tree(self):
self._deleted_stop_ids.clear()
self._finder = RouteNearestStopFinder(list(self._active_stops.values()))Complexity summary:
- Build k-d tree: O(n log n) — executed at most once per 5-minute cycle
- Nearest-stop query (cache miss): O(log n) average
- Nearest-stop query (cache hit): O(1)
- Complete stop (lazy deletion): O(k) where k = cache entries for the completed stop
- Insert emergency stop (optimal position): O(n) for the insertion heuristic, O(n log n) for the tree rebuild
Early Warning Metrics:
- Cache hit rate (target above 25% given 30% duplicate destinations) — a hit rate below 20% after the first hour of operation indicates the cache key rounding is too aggressive or the coordinate data has more noise than expected
- K-d tree build time per rebuild cycle — the rebuild must complete in under 5 seconds for a 5,000-stop route; above 30 seconds indicates the active stop count has grown beyond the expected bound
- Nearest-neighbour query P99 latency — alert at 5ms; above 10ms violates the requirement; this is most commonly caused by a degenerate k-d tree (poorly balanced due to clustered geographic stops) that can be addressed with a better splitting strategy (median split vs. midpoint split)
4. Interview Score: 9.5 / 10
Why this demonstrates senior-level maturity: The specific observation that for n=5,000 with SIMD vectorisation, a linear scan may outperform the k-d tree due to cache locality — and the recommendation to benchmark both before committing to the more complex implementation — demonstrates the algorithmic pragmatism that distinguishes an experienced engineer from one who reaches for the theoretically optimal data structure without considering constant factors. The lazy deletion strategy (mark deleted nodes, rebuild after 50 deletions) rather than attempting dynamic deletion from the k-d tree (which requires subtree rebuilding) is the practical engineering trade-off that produces a maintainable system. The cache invalidation design (invalidate_stop(stop_id) scanning the cache for all entries mapping to the changed stop) shows awareness of the consistency problem that arises when the underlying data changes.
What differentiates it from mid-level thinking: A mid-level engineer would implement a linear scan for all three requirements, not knowing about k-d trees for spatial nearest-neighbour queries. They would not address the concurrent access problem (the threading.RLock on the cache), would not design the memoisation cache invalidation on stop deletion, and would not know the O(n) insertion heuristic for finding the optimal route position for an emergency stop.
What would make it a 10/10: A 10/10 response would include a benchmark comparing the linear scan vs. k-d tree at n=100, n=1,000, n=5,000, and n=20,000 stops showing the crossover point, a thread-safe atomic tree swap implementation (using concurrent.futures or AtomicReference in Java to swap the old tree for the new one during the rebuild cycle without blocking queries), and an analysis of when to switch to Annoy for approximate nearest-neighbour at route sizes above 20,000.
Question 8: Distributed Systems — Designing for Consistency, Availability, and Fault Tolerance
Difficulty: Elite | Role: Software Engineer | Level: Senior / Staff | Company Examples: Amazon, Google, Netflix, Uber, Stripe
The Question
You are a Senior Software Engineer at a fintech company. The engineering team is designing a money transfer system. A user initiates a transfer of £500 from Account A (at Bank A) to Account B (at Bank B). The operation involves 3 steps: deduct £500 from Account A, credit £500 to Account B, and record the transfer in a transaction ledger. These 3 steps touch 3 different services (Bank A API, Bank B API, and the internal Ledger Service) — no shared database exists across all three. The business requirement is that the system must never result in money being debited from Account A without being credited to Account B, and vice versa — partial transfers (money lost or money created) are both unacceptable. However, the three services are independently operated, can fail at any time, and have no cross-service transaction capability. Design the distributed transaction protocol that guarantees the atomicity of the transfer — all 3 steps succeed or none of them take effect — while handling network failures, service crashes, and partial completions.
1. What Is This Question Testing?
- The distributed transaction problem — understanding that ACID transactions across multiple services with no shared database is a fundamental distributed systems problem; knowing that 2PC (Two-Phase Commit) is the classical protocol for distributed transactions and its limitations: it requires all participants to be available simultaneously and has a blocking failure mode (if the coordinator crashes after Phase 1 but before Phase 2, participants are blocked indefinitely holding their locks); and knowing that for a financial transfer involving external bank APIs (which cannot participate in 2PC), 2PC is not applicable
- The Saga pattern for distributed transactions — knowing that the Saga pattern replaces a distributed ACID transaction with a sequence of local transactions, each followed by a compensating transaction that undoes the local transaction if a subsequent step fails; knowing the two Saga variants: choreography (each service publishes events and subscribes to events from other services — no central coordinator) and orchestration (a central Saga orchestrator sends commands to each service and handles the compensating transactions on failure); knowing that orchestration is more appropriate for financial transactions because the audit trail is centralised and easier to reason about
- Idempotency keys — understanding that in a distributed system where messages can be duplicated (retries, at-least-once delivery), every operation must be idempotent: applying the same operation multiple times has the same effect as applying it once; for a money transfer, the debit of Account A must be idempotent — if the Bank A API call is retried (because the first call's response was lost), the second call must not debit the account a second time; this is implemented with an idempotency key (a UUID generated once per transfer attempt) that Bank A uses to deduplicate requests
- The Outbox Pattern — knowing that a service that writes to its own database and then publishes an event to a message queue has a consistency problem: if the database write succeeds but the message publish fails (or the process crashes between the two), the downstream services never receive the event; the Outbox pattern solves this by writing the event to an
outboxtable in the same local database transaction as the business data; a separate message relay process reads the outbox table and publishes events to the message queue, retrying until the publish succeeds — this guarantees at-least-once event delivery with no split-brain between the database and the event bus
- Exactly-once semantics vs. at-least-once + idempotency — knowing that exactly-once delivery is extremely difficult and expensive to achieve in distributed systems; the pragmatic alternative is at-least-once delivery combined with idempotent consumers — the consumer deduplicates incoming messages using the message ID and processes each unique message exactly once; this is the pattern used by every major payment system
- Failure scenario handling — being able to reason through specific failure scenarios: what happens if Bank A crashes after deducting the money but before the orchestrator is notified? (The orchestrator must retry the Bank A status check — idempotency ensures the deduction is not duplicated), what happens if Bank B is down when the credit is attempted? (The orchestrator retries with exponential backoff for a configurable period — if Bank B does not recover within the timeout, the compensating transaction reverses the Bank A debit), what happens if the compensating transaction itself fails? (The failed compensation is recorded and a human escalation alert is sent — this is the "pending manual resolution" state that every payment system must handle)
2. Framework: Distributed Financial Transaction Design Model (DFTDM)
- Assumption Documentation — Establish the consistency requirements: can the system tolerate a brief window where money is "in flight" (deducted from Account A but not yet credited to Account B)? In practice, all payment systems have this window; the requirement is that the window resolves correctly — not that it never exists; distinguish between atomicity (the transfer either fully completes or fully reverses) and consistency (at any point in time, the total money in the system is conserved)
- Constraint Analysis — Bank A and Bank B are external services with their own APIs; they cannot be made to participate in a 2PC protocol; the Ledger Service is internal and can be designed to support the required transactional semantics; the design must handle Bank A's API being idempotent (requires idempotency key support) vs. non-idempotent (requires a deduplication layer)
- Tradeoff Evaluation — Saga Choreography (decentralised, event-driven, no single point of failure) vs. Saga Orchestration (centralised coordinator, easier to reason about state, single point of failure mitigated by replication); for a financial transfer where auditability and state visibility are critical, orchestration is preferred — the transfer orchestrator is a replicated, stateful service that can be queried for the current state of any transfer
- Hidden Cost Identification — The "double debit" scenario: if the Bank A API call succeeds, the process crashes before writing the success to the orchestrator's database, and the orchestrator retries — without idempotency keys, Bank A debits the account twice; idempotency keys must be generated once (at transfer initiation) and stored durably before the first Bank A API call
- Risk Signals / Early Warning Metrics — Saga completion rate (what percentage of initiated transfers complete successfully without any compensating transactions?), compensating transaction success rate (when a transfer must be reversed, does the reversal succeed? — a compensating transaction failure rate above 0.01% requires immediate engineering attention), transfers in "pending manual resolution" state (the count of transfers stuck in an unresolvable state requiring human intervention — this should be zero under normal conditions)
- Pivot Triggers — If the Bank B API has a planned downtime window that causes the transfer success rate to drop below 95%: implement a "deferred credit" pattern where the Bank A debit and the Ledger Service recording are committed immediately, and the Bank B credit is queued for retry when Bank B returns; this requires the Ledger to show the transfer as "credit pending" rather than "complete" until Bank B confirms receipt
- Long-Term Evolution Plan — Phase 1: Saga orchestration with manual compensation for unresolvable failures; Phase 2: automated compensation with exponential backoff (24-hour retry window before escalating to manual); Phase 3: ISO 20022 message format support for interoperability with the SWIFT network; Phase 4: idempotency key management service shared across all payment operations
3. The Answer
The Transfer State Machine
The transfer orchestrator manages a transfer through the following states: INITIATED → DEBIT_PENDING → DEBIT_CONFIRMED → CREDIT_PENDING → CREDIT_CONFIRMED → LEDGER_PENDING → COMPLETED. Compensation states: CREDIT_FAILED → REVERSAL_PENDING → REVERSED (or REVERSAL_FAILED → MANUAL_RESOLUTION_REQUIRED). Every state transition is persisted to the orchestrator's database before the next action is taken.
The Idempotency Key Architecture
At transfer initiation, a transfer record is created:
INSERT INTO transfers (
transfer_id, -- UUID, the Saga correlation ID
idempotency_key_debit, -- UUID, used as Bank A's request idempotency key
idempotency_key_credit,-- UUID, used as Bank B's request idempotency key
amount,
source_account,
destination_account,
state,
created_at
) VALUES (gen_random_uuid(), gen_random_uuid(), gen_random_uuid(),
500.00, 'ACC_A', 'ACC_B', 'INITIATED', now())Critically: the idempotency_key_debit and idempotency_key_credit are generated and committed to the database before any external API call is made. If the orchestrator crashes and restarts, it reads the uncommitted-step transfer records and retries using the same idempotency keys — Bank A and Bank B will recognise the key and return the result of the original operation rather than executing it again.
The Saga Orchestration Flow
Step 1 — Debit Account A: the orchestrator calls Bank A's API: POST /v1/debits { amount: 500, account: ACC_A, idempotency_key: <idempotency_key_debit> }. Bank A returns { debit_id: "DEB_123", status: "confirmed" }. The orchestrator updates the transfer state to DEBIT_CONFIRMED and saves debit_id. If Bank A returns an error: the orchestrator retries with exponential backoff up to 3 times (using the same idempotency_key_debit each time — idempotent retry). If all retries fail: the transfer transitions to FAILED (no money has moved — no compensation needed).
Step 2 — Credit Account B: the orchestrator calls Bank B's API: POST /v1/credits { amount: 500, account: ACC_B, idempotency_key: <idempotency_key_credit> }. Bank B returns { credit_id: "CRE_456", status: "confirmed" }. The orchestrator updates the transfer state to CREDIT_CONFIRMED. If Bank B is unavailable: the orchestrator retries for up to 24 hours with exponential backoff (the debit is now committed — the credit must eventually succeed or be reversed). If Bank B confirms it cannot credit the account (insufficient destination account details, account closed): immediately transition to CREDIT_FAILED and execute the compensating transaction.
Compensating Transaction (if credit fails): call Bank A's API: POST /v1/reversals { debit_id: "DEB_123", idempotency_key: <reversal_key> }. Bank A reverses the debit and credits Account A. The transfer transitions to REVERSED. If the reversal also fails (Bank A is now unavailable): the transfer transitions to REVERSAL_FAILED → MANUAL_RESOLUTION_REQUIRED; a PagerDuty alert is sent to the payments operations team with the full transfer state.
Step 3 — Record in Ledger: after both the debit and credit are confirmed, the orchestrator calls the internal Ledger Service: POST /v1/entries { transfer_id, debit_account, credit_account, amount, debit_confirmation, credit_confirmation }. The Ledger Service records the double-entry bookkeeping entry. If the Ledger Service is unavailable: retry indefinitely (the money has already moved correctly — the ledger entry is the record of what happened and cannot be omitted); the transfer is marked COMPLETED only after the ledger confirms.
The Outbox Pattern for the Orchestrator's Events
The orchestrator must publish events (TransferInitiated, TransferDebited, TransferCompleted, TransferFailed) to a message bus for downstream consumers (notifications service, analytics, fraud detection). To avoid the split-brain between the database state transition and the event publication:
-- This INSERT happens in the SAME database transaction as the state update
BEGIN;
UPDATE transfers SET state = 'DEBIT_CONFIRMED', debit_id = 'DEB_123'
WHERE transfer_id = '<id>';
INSERT INTO outbox (event_type, payload, created_at)
VALUES ('TransferDebited', '{"transfer_id": "<id>", "amount": 500}', now());
COMMIT;
-- A separate relay process reads the outbox table and publishes to the event busThis guarantees that if the process crashes after the state update commits, the outbox entry is also committed — the relay will publish the event on the next run. If the event bus is unavailable, the relay retries; the outbox entry stays until the publish succeeds.
Handling the "Zombie Transfer" — Orchestrator Crash Recovery
If the orchestrator process crashes while waiting for Bank B's response (the transfer is in CREDIT_PENDING state), recovery is straightforward: on startup, the orchestrator queries for all transfers in non-terminal states (DEBIT_CONFIRMED, CREDIT_PENDING, CREDIT_CONFIRMED) and resumes them from their last confirmed state — retrying the pending step with the stored idempotency key. This is the "transactional outbox + at-least-once retry" pattern that makes the orchestrator crash-resilient without requiring distributed locking.
Early Warning Metrics:
- Transfer completion rate (successful completions / initiations) — target above 99.5%; below 99% indicates a systemic issue with one of the external bank APIs
- Compensating transaction success rate — any failed compensation is a critical incident; monitor in real time; below 100% triggers an immediate incident response
MANUAL_RESOLUTION_REQUIREDtransfer count — should be zero under normal conditions; any non-zero value triggers a human review within 4 hours to prevent money from being "lost in flight" for an extended period
4. Interview Score: 10 / 10
Why this demonstrates staff-level maturity: The idempotency key generation before the first API call — and the specific explanation that the keys must be durably stored before any external call so that crash-and-retry uses the same keys — is the distributed systems engineering discipline that prevents the double-debit bug that breaks naive implementations. The Outbox pattern integration (writing the event to the outbox table in the same database transaction as the state update) addresses the "dual write" consistency problem without requiring a distributed transaction between the database and the event bus. The "Zombie Transfer" recovery procedure — querying non-terminal states on startup and resuming with stored idempotency keys — is the specific operational procedure that makes the orchestrator crash-resilient, which most engineers miss because it requires thinking about recovery, not just the happy path.
What differentiates it from mid-level thinking: A mid-level engineer would propose a database transaction spanning all three services (not possible with external APIs), or would propose 2PC (which requires all participants to be available simultaneously and creates blocking failures). They would not know about the Saga pattern, would not design the idempotency key architecture, would not address the zombie transfer recovery, and would not know about the Outbox pattern for consistent event publishing.
What would make it a perfect implementation: This response scores 10/10 on the dimensions tested. The theoretical extension would be a formal state machine diagram showing all states and transitions including the compensation paths, a concrete exponential backoff configuration (initial delay 1 second, multiplier 2, max delay 300 seconds, max attempts 20 for a 24-hour retry window), and a database schema for the orchestrator's transfers table with the right indexes for the recovery query (index on state for finding non-terminal transfers).
Question 9: Code Quality and Refactoring — Identifying and Resolving Design Smells in a Production Codebase
Difficulty: Senior | Role: Software Engineer | Level: Senior | Company Examples: Google, ThoughtWorks, JetBrains, Stripe, Basecamp
The Question
You are a Senior Software Engineer joining a team that has been maintaining a 6-year-old e-commerce order management system written in Java. During a code review of the checkout service, you encounter the following 300-line OrderService class that handles order creation. The class: checks inventory availability by calling the InventoryService directly, calculates the total price including discounts by implementing pricing logic inline, sends confirmation emails by instantiating EmailClient directly, writes audit logs by calling a static logger, processes payment by calling PaymentGateway with hardcoded timeout values, updates inventory after payment with a direct InventoryService call, and publishes an OrderCreatedEvent to a message broker with a direct KafkaProducer instantiation inside the method. The class has no unit tests because "you can't mock the static logger or the hardcoded Kafka producer." Walk through the design smells you identify, the refactoring strategy, and how you would transform this class into a testable, maintainable design without breaking the existing behaviour.
1. What Is This Question Testing?
- SOLID principles in practice — identifying the violations: Single Responsibility Principle (the
OrderServicedoes 7 things — checking inventory, pricing, emailing, logging, payment, inventory update, and event publishing — it should do one: coordinate an order creation workflow); Open/Closed Principle (pricing logic inline means adding a new discount type requires modifying the class); Dependency Inversion Principle (the class depends on concrete implementations, not abstractions —EmailClientandKafkaProducerare instantiated directly instead of being injected as interfaces)
- Dependency injection and testability — understanding the specific connection between DI and testability: code that instantiates its own dependencies (
new EmailClient(),new KafkaProducer()) is untestable because the test cannot substitute a fake; code that receives dependencies via constructor injection (or interface injection) is testable because the test provides a fake implementation of the interface; knowing the rule: a class that can only be tested by sending a real email or publishing to a real Kafka topic is a class that violates DI
- The Strangler Fig refactoring pattern for legacy code — knowing that refactoring a class that has no tests must be done with extreme caution (every change risks breaking behaviour); the safe approach: write characterisation tests first (tests that document the current behaviour of the class, whatever it may be), then refactor incrementally with each step verified by the characterisation tests; "make the change easy, then make the easy change" (Kent Beck)
- Extract Interface for testability — knowing the specific refactoring step: for each direct dependency (
EmailClient,KafkaProducer,InventoryService), extract an interface (EmailSender,EventPublisher,InventoryRepository) that expresses the contract; create the production implementation; inject the interface via the constructor; now the class can be tested with a fake implementation of each interface
- The Tell Don't Ask principle — recognising that the pricing logic inline violates encapsulation:
OrderServiceis "asking" theOrderobject for its items and computing the price itself, rather than "telling" aPricingServiceto calculate the price; the fix: extract aPricingServicewith acalculateOrderTotal(order, appliedDiscounts)method
- God class symptoms — knowing the code smells that indicate a God class (a class that knows too much): more than 200 lines, more than 10 dependencies, methods that contain "and" in their name (
createOrderAndSendEmail), more than 3 levels of nesting, and a class that is changed for multiple different reasons (adding a new payment method, adding a new discount type, changing the email template — all require modifying the same class)
2. Framework: Legacy Code Refactoring Model (LCRM)
- Assumption Documentation — Before refactoring any line, confirm with the team: what is the expected behaviour of
createOrder()? Are there integration tests (even if not unit tests) that exercise this method? Understanding the observable contract of the method before refactoring prevents introducing regressions that the team may not discover for weeks
- Constraint Analysis — The class has no unit tests; every refactoring step must be small enough to be verified manually or with the existing integration tests; large refactoring steps (rewrites) without test coverage are high-risk; the team may be running manual smoke tests — coordinate with QA before deploying each refactoring step
- Tradeoff Evaluation — Incremental refactoring (extract one dependency at a time, deploy, verify) vs. full redesign then replace (do all the refactoring in one branch, deploy all at once); incremental is safer but takes longer; for a production system with no unit tests, incremental is always correct
- Hidden Cost Identification — The static logger is the least visible dependency but the most insidious: it makes the logging behaviour impossible to verify in tests (a unit test cannot assert "an error was logged" without a log capture library); replacing the static logger with an injected
AuditLoggerinterface is the change that has the largest impact on testability despite appearing trivial
- Risk Signals / Early Warning Metrics — Cyclomatic complexity of the refactored class (target below 10 — a complexity above 10 means the class has too many conditional branches for one class), test coverage after refactoring (target above 80% for all new classes extracted from the refactoring), lines of code in the new
OrderCoordinator(target under 80 lines — if the coordinator is still 200 lines, the extraction did not sufficiently decompose the responsibilities)
- Pivot Triggers — If a characterisation test reveals that the current implementation has an undocumented behaviour (e.g., it silently ignores payment gateway timeouts and marks the order as pending rather than failed): document the undocumented behaviour as a known issue, do not fix it during the refactoring (changing behaviour and structure simultaneously doubles the risk of regression)
- Long-Term Evolution Plan — Refactoring sprint: extract dependencies and write unit tests; second sprint: extract domain logic (pricing, inventory) into their own services; third sprint: migrate the event publishing to a proper event sourcing pattern; fourth sprint: remove the God class entirely by decomposing into feature-oriented services
3. The Answer
Step 1: Write Characterisation Tests Before Touching Any Code
Before the first line is changed, write a characterisation test that calls createOrder() through the entire stack with a real database (or a test database) and real services where unavoidable. The test documents: what does the method return for a valid order? What does it return when the item is out of stock? What happens to the database state after a payment failure? These tests run slowly and are not proper unit tests, but they are the safety net for the refactoring. Sample characterisation test:
@Test
public void createOrder_withValidItems_persistsOrderAndReturnsId() {
// Arrange — uses real implementations, real test database
Order order = new Order(userId, List.of(new OrderItem("ITEM_001", 2)));
// Act
OrderConfirmation result = orderService.createOrder(order);
// Assert observable behaviour — don't assert internals
assertNotNull(result.getOrderId());
assertEquals(OrderStatus.CONFIRMED, result.getStatus());
assertTrue(emailClient.wasSent()); // Real EmailClient with a capture mode
}Step 2: Extract Interfaces for Every Direct Dependency
For each of the 7 direct dependencies in the OrderService:
// BEFORE (untestable — direct instantiation)
EmailClient emailClient = new EmailClient("smtp.company.com", 587);
emailClient.send(order.getUserEmail(), "Order confirmed", buildEmailBody(order));
// AFTER (testable — interface injection)
public interface OrderNotifier {
void notifyOrderConfirmed(Order order);
}
// OrderService receives the interface via constructor
public class OrderService {
private final InventoryRepository inventory;
private final PricingService pricing;
private final OrderNotifier notifier;
private final PaymentGateway payment;
private final EventPublisher events;
private final AuditLogger audit;
private final OrderRepository orders;
@Inject
public OrderService(InventoryRepository inventory, PricingService pricing,
OrderNotifier notifier, PaymentGateway payment,
EventPublisher events, AuditLogger audit, OrderRepository orders) {
this.inventory = inventory; // etc.
}
}The production SmtpOrderNotifier implements OrderNotifier and sends the real email. The test FakeOrderNotifier implements OrderNotifier and records calls for assertion. The key refactoring: moving from static final Logger LOG = LoggerFactory.getLogger(OrderService.class) (untestable static) to private final AuditLogger audit (injected interface that a test can verify calls on).
Step 3: Extract the Pricing Logic into PricingService
// BEFORE (pricing logic embedded in OrderService — violates SRP)
double total = 0;
for (OrderItem item : order.getItems()) {
double itemPrice = item.getQuantity() * item.getUnitPrice();
if (order.isPromoCodeApplied()) {
itemPrice *= (1 - PROMO_DISCOUNT_RATE); // Hardcoded discount rate
}
total += itemPrice;
}
if (total > 100) total *= (1 - BULK_DISCOUNT_RATE); // Another hardcoded rate
// AFTER (delegated to PricingService)
public interface PricingService {
OrderPrice calculateTotal(Order order);
}
// In OrderService.createOrder():
OrderPrice price = pricing.calculateTotal(order);
// PricingService is independently testable — all discount logic in one placeStep 4: The Refactored OrderCoordinator
After extracting all responsibilities, the OrderService becomes a thin coordinator:
public class OrderCoordinator {
// All 7 dependencies injected via constructor (see above)
public OrderConfirmation createOrder(Order order) {
// Step 1: Validate inventory
if (!inventory.isAvailable(order.getItems())) {
return OrderConfirmation.failed("INSUFFICIENT_INVENTORY");
}
// Step 2: Calculate price
OrderPrice price = pricing.calculateTotal(order);
// Step 3: Process payment
PaymentResult payment = paymentGateway.charge(
order.getPaymentMethod(),
price.getTotalAmount(),
order.getIdempotencyKey()
);
if (!payment.isSuccessful()) {
audit.log("payment_failed", order.getOrderId(), payment.getError());
return OrderConfirmation.failed("PAYMENT_DECLINED");
}
// Step 4: Persist the order
Order savedOrder = orders.save(order.withPaymentConfirmation(payment));
// Step 5: Update inventory
inventory.reserve(savedOrder.getItems());
// Step 6: Notify and publish
notifier.notifyOrderConfirmed(savedOrder);
events.publish(new OrderCreatedEvent(savedOrder));
audit.log("order_created", savedOrder.getOrderId());
return OrderConfirmation.success(savedOrder.getOrderId());
}
}The refactored class is 50 lines (down from 300), has 7 injected dependencies (down from 7 hardcoded instantiations — same count but now testable), and has a cyclomatic complexity of 3 (two conditional branches). It can be fully unit-tested with fake implementations of each interface.
Writing the First Meaningful Unit Test
@Test
public void createOrder_withPaymentFailure_returnsFailedConfirmation() {
// Arrange — all fakes, no real services, no network calls, under 5ms
FakeInventoryRepository inventory = new FakeInventoryRepository(isAvailable: true);
FakePricingService pricing = new FakePricingService(total: 50.00);
FakePaymentGateway payment = new FakePaymentGateway(shouldSucceed: false);
FakeOrderNotifier notifier = new FakeOrderNotifier();
FakeEventPublisher events = new FakeEventPublisher();
FakeAuditLogger audit = new FakeAuditLogger();
FakeOrderRepository orders = new FakeOrderRepository();
OrderCoordinator coordinator = new OrderCoordinator(
inventory, pricing, notifier, payment, events, audit, orders
);
Order order = OrderTestBuilder.anOrder().build();
// Act
OrderConfirmation result = coordinator.createOrder(order);
// Assert
assertEquals("PAYMENT_DECLINED", result.getFailureReason());
assertFalse(notifier.wasNotified()); // No email sent on payment failure
assertFalse(events.wasPublished(OrderCreatedEvent.class)); // No event on failure
assertTrue(audit.wasLogged("payment_failed")); // Audit log is recorded
}This test runs in under 5ms with no network calls, no database, and no email sending — the first real unit test the codebase has ever had.
Early Warning Metrics:
- Lines of code per class (enforce a maximum of 200 lines — a class above 200 lines is a candidate for extraction; above 300 lines is a God class that must be decomposed)
- Number of injected dependencies per class (enforce a maximum of 7 — above 7 dependencies indicates the class is doing too many things; a class with 12 dependencies is a God class that has not yet been refactored)
- Cyclomatic complexity per method (enforce a maximum of 10 — a method with complexity above 10 is too complex to test exhaustively and too complex to understand; split into smaller methods)
4. Interview Score: 9.5 / 10
Why this demonstrates senior-level maturity: Writing characterisation tests before the first line of refactoring — and the specific explanation that these tests document the current behaviour to protect against accidental regressions during the refactoring — is the disciplined approach that distinguishes engineers who have safely refactored large legacy codebases from those who "refactor" by rewriting and hoping the behaviour is preserved. The FakeAuditLogger fix for the static logger — identifying the static logger as the most insidious untestable dependency (not the Kafka producer, which is more obvious) because it prevents verifying that specific audit events were recorded — is the diagnostic sophistication that comes from having debugged why tests cannot verify logging behaviour.
What differentiates it from mid-level thinking: A mid-level engineer would know about interfaces and dependency injection but would not start with characterisation tests (risking regression during refactoring), would not identify the static logger as an untestable dependency, would not calculate the cyclomatic complexity target for the refactored class, and would not write the complete unit test showing all the assertions including "no email sent on payment failure."
What would make it a 10/10: A 10/10 response would include the OrderTestBuilder builder pattern implementation (showing the test data builder that creates valid test orders with sensible defaults, overridable per test), a ArchUnit rule configuration that enforces the maximum cyclomatic complexity and lines-of-code constraints in CI, and a migration guide for updating the dependency injection framework (Spring, Guice) configuration to wire the new interfaces to their production implementations.
Question 10: Observability — Designing Logging, Metrics, and Distributed Tracing for a Microservices System
Difficulty: Senior | Role: Software Engineer | Level: Senior | Company Examples: Netflix, Uber, Datadog, Honeycomb, Grafana Labs
The Question
You are a Senior Software Engineer at a travel booking company. The platform has grown from a monolith to 22 microservices over 3 years. The team is experiencing a painful on-call situation: when a user reports "my booking failed," the on-call engineer must manually grep through logs on 5 different Kubernetes pods across 3 services, correlating log lines by timestamp and guessing at the request flow — an investigation that takes 45 minutes on average. Two recent incidents were caused by: a slow database query in the PricingService that caused timeouts in the BookingService which triggered retries in the SearchService — none of the three services logged the correlation — and a third-party hotel API returning 3% HTTP 500 errors that were being silently swallowed by the HotelInventoryService's catch block (the same println() anti-pattern from before). Design the observability stack that reduces the mean time to investigation (MTTI) from 45 minutes to under 5 minutes.
1. What Is This Question Testing?
- The three pillars of observability — understanding logs (discrete events with context — "payment failed for booking 123"), metrics (numeric measurements over time — "booking success rate = 97.3%"), and traces (a record of a request's path through all services — "booking request touched SearchService → PricingService → BookingService → PaymentService, total duration 3.2 seconds, PricingService took 2.8 seconds"); knowing that logs and metrics alone do not solve the microservices correlation problem — only distributed tracing connects a user's request across all 22 services into a single trace
- Distributed tracing with OpenTelemetry — knowing the OpenTelemetry standard: a vendor-neutral set of APIs, SDKs, and tools for generating, collecting, and exporting traces, metrics, and logs; knowing the trace model: a
traceis a collection ofspans, each span represents one unit of work (one service call, one database query, one outbound HTTP request); thetrace_idis generated at the entry point (the API gateway) and propagated to all downstream services via the W3C Trace Context header (traceparent); every service creates aspanwith aparent_span_idlinking it to the calling service's span
- Structured logging — knowing that
printfstyle log messages ("Error processing booking for user " + userId) are unsearchable (you cannot filter by userId unless you grep with a regex); structured logging (JSON log objects with defined fields:{"level": "ERROR", "service": "booking", "trace_id": "abc123", "user_id": "u456", "booking_id": "b789", "error": "PricingService timeout"}) allows filtering by any field in a log aggregation system (Elasticsearch, Loki, CloudWatch Logs Insights); thetrace_idfield in every log line is the bridge between logs and traces
- The RED method for service metrics — knowing the three metrics that every microservice must expose: Rate (requests per second), Errors (error rate per second), and Duration (latency distribution — p50, p95, p99); knowing the USE method for infrastructure metrics: Utilisation, Saturation, Errors (for CPU, memory, network, disk); the RED metrics are the starting point for any incident investigation — "is the booking success rate lower than normal?" is answered by the error rate metric in under 10 seconds
- The "silent swallowing" anti-pattern — knowing that catch blocks that log to
println()or use a non-monitored logger are one of the most common causes of production incidents that are invisible in monitoring; the fix: every caught exception must either be propagated (re-thrown) or recorded as an error metric increment + structured error log with stack trace + span error event in the trace; "catch and continue" is only acceptable for genuinely non-critical operations where degraded behaviour is explicitly designed and documented
- Alerting on SLOs not symptoms — knowing the difference between alerting on symptoms ("error count > 10 in 5 minutes") vs. alerting on SLOs ("booking success rate < 99.5% over 5 minutes"); symptom-based alerts cause alert fatigue (too many alerts for unimportant issues) and miss gradual degradations (a success rate that slowly drops from 99.8% to 97% over 2 hours without ever triggering the "error count > 10" alert in any 5-minute window)
2. Framework: Observability Stack Design Model (OSDM)
- Assumption Documentation — Establish the current baseline: what logging infrastructure exists? (Kubernetes pod logs aggregated to Elasticsearch via Fluentd, but without structured logging or trace IDs); what metrics exist? (Basic infrastructure metrics in Prometheus — CPU, memory — but no application-level RED metrics); what is the current p99 API response time? (Unknown — no percentile latency metrics)
- Constraint Analysis — 22 microservices means the observability tooling must be instrumented consistently across all services without requiring 22 separate configurations; OpenTelemetry's auto-instrumentation (Java agent, Python auto-instrumentation) handles HTTP and database spans automatically without code changes for most frameworks; manual instrumentation is required for custom business logic spans
- Tradeoff Evaluation — Full trace sampling (capture every trace — complete data, high storage cost: at 100 RPS, a 22-service trace of 500 bytes × 22 spans = 11KB per request = 1.1MB/second = 94GB/day) vs. probabilistic sampling (capture 1% of traces — low storage, misses rare failures), vs. tail-based sampling (capture all traces but only write to storage for traces with errors or above a latency threshold — the ideal approach, but requires a trace collector that buffers spans until the trace is complete)
- Hidden Cost Identification — OpenTelemetry trace propagation through asynchronous message queues (Kafka) requires injecting the trace context into the message headers; if any service in the chain publishes to Kafka without injecting the
traceparentheader, the trace is broken (the consumer starts a new, disconnected trace); all Kafka producer/consumer code must be updated to use the OpenTelemetry Kafka instrumentation library
- Risk Signals / Early Warning Metrics — Trace completeness rate (what percentage of traces that enter the system have spans from all expected services? — a trace completeness below 95% indicates missing instrumentation in one or more services), p99 trace storage volume per service (identify which services are generating the most trace data — candidates for sampling rate adjustment)
- Pivot Triggers — If the trace storage cost becomes unsustainable at full-traffic sampling: implement Honeycomb's dynamic sampling (increase sampling rate for traces with errors, decrease for healthy traces) or switch to OpenTelemetry Collector's tail-based sampling processor — which retains all error traces while sampling healthy traces at 1%
- Long-Term Evolution Plan — Week 1–2: structured logging + trace ID in all logs; Week 3–4: OpenTelemetry auto-instrumentation deployed to all 22 services; Week 5–6: RED metrics dashboards + SLO alerting; Week 7–8: custom business logic spans for the critical booking flow; Month 3: tail-based sampling to control storage costs; Month 6: exemplars (linking from a metric data point to the trace that caused the anomaly)
3. The Answer
The Incident Correlation Problem: Why Trace IDs Solve It
The current 45-minute investigation involves manually correlating log lines across 5 pods by timestamp. The core problem: without a shared trace_id in every log line, there is no way to filter "all log lines related to booking request X" without human pattern matching. The solution: every incoming HTTP request gets a trace_id (UUID) at the API gateway. Every log line, metric data point, and database query during that request's lifetime carries that trace_id. Finding the root cause of "my booking failed" becomes: user provides the booking ID → query the booking database for the trace_id → open the trace in Jaeger/Tempo → see the full 22-service request waterfall with per-span timing → the slow span is immediately visible.
The Structured Logging Standard
Replace all logger.info("Processed booking for " + bookingId) with:
logger.info("booking_processed",
kv("booking_id", bookingId),
kv("user_id", userId),
kv("duration_ms", durationMs),
kv("total_amount", totalAmount),
kv("trace_id", Span.current().getSpanContext().getTraceId()),
kv("span_id", Span.current().getSpanContext().getSpanId())
);The output in Elasticsearch: {"timestamp": "2024-01-15T10:23:45Z", "level": "INFO", "service": "booking-service", "message": "booking_processed", "booking_id": "b789", "user_id": "u456", "duration_ms": 3200, "trace_id": "abc123def456", "span_id": "789abc"}. This log line is filterable by any field. The trace_id links it directly to the distributed trace in Jaeger.
OpenTelemetry Instrumentation for All 22 Services
Deploy the OpenTelemetry Java Agent to all JVM services via a Kubernetes mutating webhook (automatically adds the agent JAR to any pod with the annotation instrumentation.opentelemetry.io/inject-java: "true"). The Java agent auto-instruments: all inbound HTTP requests (creates root span), all outbound HTTP requests (creates child span), all JDBC queries (creates database span with the SQL query as the span attribute), and Kafka producer/consumer operations (propagates trace context through message headers). No code changes required for the 80% coverage from auto-instrumentation. Add manual instrumentation for the business-critical booking flow steps:
// Manual span for the PricingService's price calculation — the slow step
Span pricingSpan = tracer.spanBuilder("calculate_booking_price")
.setAttribute("hotel_id", hotelId)
.setAttribute("check_in", checkInDate.toString())
.setAttribute("num_nights", numNights)
.startSpan();
try (Scope scope = pricingSpan.makeCurrent()) {
Price price = pricingEngine.calculate(hotelId, checkInDate, numNights);
pricingSpan.setAttribute("calculated_price", price.getAmount());
return price;
} catch (Exception e) {
pricingSpan.setStatus(StatusCode.ERROR, e.getMessage());
pricingSpan.recordException(e); // Exception appears in the trace UI
throw e;
} finally {
pricingSpan.end();
}Fixing the Silent Swallowing Anti-Pattern
The HotelInventoryService catch block:
// BEFORE (broken — the 3% error rate is invisible)
try {
return hotelApi.checkAvailability(hotelId, dates);
} catch (HotelApiException e) {
System.out.println("Hotel API error: " + e.getMessage()); // ← the crime
return AvailabilityResult.unknown();
}
// AFTER (correct — error is visible in metrics, logs, and traces)
try {
return hotelApi.checkAvailability(hotelId, dates);
} catch (HotelApiException e) {
// 1. Increment the error metric (visible in Prometheus/Grafana immediately)
metrics.counter("hotel_api_errors_total",
Tags.of("hotel_id", hotelId, "error_type", e.getErrorCode())).increment();
// 2. Structured error log (searchable in Elasticsearch with trace correlation)
logger.error("hotel_api_availability_failed",
kv("hotel_id", hotelId),
kv("error_code", e.getErrorCode()),
kv("trace_id", Span.current().getSpanContext().getTraceId()),
e // Exception with stack trace
);
// 3. Record the error on the current span (visible as red span in Jaeger)
Span.current().setStatus(StatusCode.ERROR, e.getMessage());
Span.current().recordException(e);
return AvailabilityResult.unknown();
}With this fix: the 3% Hotel API error rate immediately appears as a hotel_api_errors_total metric spike in Grafana, an alert fires within 2 minutes, and the on-call engineer can filter the Elasticsearch logs by error_code to see exactly which hotel API endpoints are failing.
The RED Metrics Dashboard
Every service exposes three metrics via the OpenTelemetry Meter API: http_server_requests_total (counter — the Rate), http_server_errors_total (counter filtered to 5xx — the Error rate), http_server_request_duration_seconds (histogram — the Duration). A Grafana dashboard shows all 22 services in a grid with three colour-coded cells each (green = healthy, amber = degraded, red = failing). During an incident, the dashboard shows which service's error rate or latency changed first — identifying the root cause service in under 30 seconds.
The SLO Alerting Configuration
Replace symptom-based alerting ("5 errors in 5 minutes") with SLO-based alerting:
# Prometheus alerting rule — booking success rate SLO
- alert: BookingSuccessRateLow
expr: |
sum(rate(http_server_requests_total{service="booking-service", status="2xx"}[5m]))
/
sum(rate(http_server_requests_total{service="booking-service"}[5m]))
< 0.995
for: 3m
labels:
severity: critical
annotations:
summary: "Booking service success rate below 99.5% SLO"
runbook: "https://runbooks.company.com/booking-success-rate"This alert fires when the booking success rate drops below 99.5% for 3 consecutive minutes — regardless of whether the failure manifests as 10 errors in 5 minutes or 1 error per minute for 10 minutes.
The 5-Minute Investigation Workflow (Post-Implementation)
- User reports "booking failed" and provides booking ID "b789" (30 seconds).
- Query Elasticsearch:
trace_id WHERE booking_id=b789(10 seconds).
- Open trace
abc123def456in Jaeger (5 seconds).
- The waterfall diagram shows: SearchService (50ms), PricingService (2,800ms — RED), BookingService (300ms), PaymentService (45ms). The PricingService span has a red error badge with the SQL query that timed out visible on the span attributes (30 seconds).
- The engineer navigates to the PricingService's
slow_queriesGrafana dashboard and confirms the query started timing out 2 hours ago when a database index was accidentally dropped during a schema migration (60 seconds).
- Total investigation time: under 3 minutes.
Early Warning Metrics:
- Mean time to investigation (MTTI) — measured from the moment an incident alert fires to the moment the root cause is identified (logged in the incident management tool by the on-call engineer); target under 5 minutes; above 15 minutes triggers a post-incident review of the observability stack
- Trace completeness rate — the percentage of traces that have spans from all expected services in the chain; a completeness below 90% indicates missing instrumentation; run a weekly completeness check using a synthetic transaction that touches all 22 services
- Unsampled error traces — the number of error traces that were dropped by the sampling policy (tail-based sampler must have a 100% retention rate for error traces); any non-zero count indicates the sampler is incorrectly configured
4. Interview Score: 9.5 / 10
Why this demonstrates senior-level maturity: The concrete "5-minute investigation workflow" — with second-level timing for each step showing exactly how the observability stack enables the investigation — transforms the abstract "implement distributed tracing" recommendation into a measurable outcome that the Head of Engineering can understand and the on-call engineer can execute. The three-way fix for the silent swallowing anti-pattern (metrics counter + structured log + span error event) — explaining that each channel serves a different investigation tool (Grafana dashboard for alert, Elasticsearch for log correlation, Jaeger for trace context) — shows the systems-level thinking about observability as a three-pillar discipline, not just "add logging." The SLO-based alerting configuration — with the specific Prometheus rule showing the success rate calculation and the 3-minute evaluation window — is the production alerting maturity that eliminates alert fatigue from symptom-based thresholds.
What differentiates it from mid-level thinking: A mid-level engineer would propose "add more logging" and "use Datadog" without designing the structured logging standard (trace_id field, key-value format), without knowing about OpenTelemetry auto-instrumentation via Kubernetes mutating webhooks, without understanding tail-based sampling, and without designing the SLO-based alerting rules. They would not calculate the storage cost of full-trace sampling or design the 5-minute investigation workflow end-to-end.
What would make it a 10/10: A 10/10 response would include a complete OpenTelemetry Collector configuration YAML showing the tail-based sampling processor (retaining 100% of error traces, 1% of healthy traces), a Grafana dashboard JSON showing the RED metrics grid for all 22 services, and a concrete exemplars configuration (linking metric data points to the traces that caused the anomaly — the observability feature that enables one-click navigation from a metric spike to the causative trace).