Swiggy Software Engineer
This guide features 10 challenging Software Engineer interview questions for Swiggy (SDE-1 to SDE-2 levels), covering backend systems, frontend optimization, low-level design, machine learning, and behavioral scenarios specific to food delivery and quick commerce platforms.
1. Kafka-Backed Order Event Processor (200k Events/Min)
Difficulty Level: Hard
Role: SDE-2 (Senior Software Development Engineer)
Source: InterviewExperiences.in (July 2025)
Topic: Backend Engineering / Distributed Systems
Interview Round: System Design + Bar-Raiser (60-90 min)
Technology Stack: Kafka, Redis, Java, Spring Boot, Kubernetes, Prometheus
Swiggy Product Area: Food Delivery (Order Management)
Question: “Design a system to process 200k events/min with strict ordering per orderId, at-least-once delivery, and idempotent downstream effects for ordered, deduplicated event fan-out to inventory & dispatch. Consider throughput requirements, SLA <1s p95, ordering by key, back-pressure handling, and visibility/observability.”
Answer Framework
STAR Method Structure:
- Situation: High-throughput event processing (200k/min) requiring strict ordering per order, reliability (at-least-once), and idempotency preventing duplicate processing
- Task: Design scalable architecture balancing throughput, ordering guarantees, fault tolerance, and operational observability
- Action: Kafka partitioning by orderId, Redis idempotency keys, two-tier retry with DLQ, RED metrics + distributed tracing, canary deployments
- Result: <1s p95 latency, zero duplicate processing, 99.9% uptime, clear operational runbooks
Key Competencies Evaluated:
- Distributed Systems Mastery: Understanding Kafka partitioning, consumer groups, offset management
- Idempotency Design: Preventing duplicate processing across retries and failures
- Observability: Metrics, tracing, logging for production debugging
- Trade-off Navigation: At-least-once + idempotency vs exactly-once complexity
Architecture Framework
Event Flow:
Order Service → Kafka Topic (partitioned by orderId)
↓
Consumer Group (parallel processing per partition)
↓
Idempotency Check (Redis: "processed:{eventId}")
↓
Fan-out to: Inventory Service, Dispatch Service
↓
Retry Logic (exponential backoff) → DLQ (manual investigation)
Kafka Partitioning:
- Partition key: orderId (ensures all events for same order → same partition → strict ordering)
- Consumer instances: 10 (one per partition for parallelism)
- Replication factor: 3 (fault tolerance)
Idempotency Implementation:
- Redis key: "processed:{eventId}" with TTL 24h
- Before processing: Check if key exists
- If exists: Skip (duplicate); If not: Process + Set key
Observability:
- RED Metrics: Request rate, Error rate, Duration (p50, p95, p99)
- Distributed tracing: Correlation IDs across services
- Structured logging: JSON format with orderId, eventId, timestampAnswer
Kafka partitioning by orderId ensures strict ordering where all events for same order land in same partition processed sequentially by single consumer instance, enabling parallel processing across different orders (10 partitions = 10 concurrent consumers handling 20k events/min each), with replication factor 3 providing fault tolerance if broker fails—consumer group coordination uses Kafka’s built-in offset management where each consumer tracks last processed offset per partition, enabling automatic rebalancing when consumer crashes (Kafka reassigns partitions to healthy consumers within 30 seconds), with manual offset commits after successful processing preventing message loss during failures, accepts at-least-once delivery (events may be reprocessed during consumer restart) over exactly-once (adds 200-300ms latency overhead from distributed transactions) since idempotency layer handles duplicates cheaper than coordination protocol. Redis idempotency layer checks “processed:{eventId}” key before processing (sub-millisecond lookup), skips if exists (duplicate from retry), otherwise processes event then sets key with 24h TTL (balances memory usage vs duplicate window), handles edge case where consumer crashes after processing but before setting key by making downstream operations idempotent (inventory service uses “UPDATE … WHERE version = X” optimistic locking, dispatch service checks “notification_sent” flag)—two-tier retry mechanism implements immediate retry (3 attempts with 100ms, 200ms, 400ms exponential backoff) for transient failures (network timeout, temporary service unavailability), then sends to Dead Letter Queue for persistent failures (invalid data, downstream service permanently down) enabling manual investigation without blocking main pipeline, monitors DLQ depth alerting when >100 messages indicating systemic issue requiring immediate attention. Observability and deployment track RED metrics (Request rate, Error rate, Duration with <1s p95 SLA) via Prometheus scraped every 15s with Grafana dashboards and PagerDuty alerts when p95 >1s or error rate >1%, distributed tracing propagates correlation IDs (orderId + eventId) across Kafka → Consumer → Inventory → Dispatch enabling end-to-end request flow visualization in Jaeger identifying bottlenecks (e.g., inventory service taking 800ms due to slow DB query), canary releases deploy to 10% traffic monitoring metrics 30 mins before scaling to 100% with feature flags enabling quick rollback without redeployment, runbooks document common failures (Kafka broker down: consumers auto-rebalance; Redis unavailable: fail-safe to process without idempotency check logging duplicates for reconciliation; downstream timeout: retry then DLQ) ensuring on-call engineers resolve incidents within 15-minute MTTR target.
Code
Kafka Consumer Implementation (Java + Spring Boot):
@Service
public class OrderEventProcessor {
@Autowired
private RedisTemplate<String, String> redisTemplate;
@Autowired
private InventoryService inventoryService;
@Autowired
private DispatchService dispatchService;
@KafkaListener(topics = "order-events", groupId = "order-processor-group")
public void processOrderEvent(ConsumerRecord<String, OrderEvent> record) {
String eventId = record.value().getEventId();
String orderId = record.key(); // Partition key
// Idempotency check using Redis
String idempotencyKey = "processed:" + eventId;
Boolean isProcessed = redisTemplate.opsForValue().setIfAbsent(
idempotencyKey, "1", Duration.ofHours(24)
);
if (Boolean.FALSE.equals(isProcessed)) {
log.info("Duplicate event {} for order {}, skipping", eventId, orderId);
return; // Already processed
}
try {
// Process event with retries
processWithRetry(record.value());
// Manual offset commit after successful processing
// (Kafka listener configured with MANUAL_IMMEDIATE ack mode)
} catch (Exception e) {
log.error("Failed to process event {} after retries", eventId, e);
sendToDLQ(record.value());
// Remove idempotency key to allow retry if needed
redisTemplate.delete(idempotencyKey);
}
}
@Retryable(
value = {TransientException.class},
maxAttempts = 3,
backoff = @Backoff(delay = 100, multiplier = 2)
)
private void processWithRetry(OrderEvent event) {
// Fan-out to downstream services with idempotent operations
inventoryService.updateStock(event.getOrderId(), event.getItems());
dispatchService.assignRider(event.getOrderId(), event.getDeliveryAddress());
}
private void sendToDLQ(OrderEvent event) {
kafkaTemplate.send("order-events-dlq", event.getOrderId(), event);
dlqDepthMetric.increment();
}
}Kafka Configuration:
spring:
kafka:
consumer:
group-id: order-processor-group
enable-auto-commit:false # Manual offset management
properties:
partition.assignment.strategy: org.apache.kafka.clients.consumer.RangeAssignor
max.poll.records:100
listener:
ack-mode: MANUAL_IMMEDIATE
# Kafka Topic Configuration
order-events:
partitions:10 # Parallel processing
replication-factor:3 # Fault tolerance
partition-key: orderId # Ensures ordering per orderObservability (Prometheus Metrics):
@Component
public class EventProcessorMetrics {
private final Counter eventsProcessed = Counter.builder("order_events_processed_total")
.description("Total events processed")
.tag("status", "success")
.register(Metrics.globalRegistry);
private final Counter eventsFailed = Counter.builder("order_events_failed_total")
.description("Total events failed")
.tag("error_type", "timeout")
.register(Metrics.globalRegistry);
private final Timer processingLatency = Timer.builder("order_event_processing_duration")
.description("Event processing latency")
.publishPercentiles(0.5, 0.95, 0.99)
.register(Metrics.globalRegistry);
@Gauge(name = "dlq_depth", description = "Dead Letter Queue depth")
public long getDLQDepth() {
return kafkaAdmin.getQueueDepth("order-events-dlq");
}
}2. Food-Order Matching System with Consistency-Latency Trade-offs
Difficulty Level: Hard
Role: SDE-1 to SDE-2
Source: InterviewExperiences.in, LinkedIn (2024-2025)
Topic: Backend Engineering / System Architecture
Interview Round: System Design (45-60 min)
Technology Stack: Microservices, Redis, MySQL, ElasticSearch, Kafka, Load Balancers
Swiggy Product Area: Food Delivery (Order-Restaurant Matching)
Question: “Design a food-order matching system that outlines all core services, databases, caching layers, and clearly explains trade-offs between consistency and latency. The system must handle millions of concurrent orders and thousands of restaurants.”
Answer Framework
STAR Method Structure:
- Situation: Millions concurrent orders, thousands restaurants requiring fast matching (<500ms) with acceptable consistency trade-offs
- Task: Design microservices architecture balancing read/write consistency, caching strategy, search performance, and event-driven communication
- Action: Service decomposition (Order, Restaurant, Delivery, Notification), Redis caching with TTL, ElasticSearch for search, Kafka for async events, Saga pattern for distributed transactions
- Result: <500ms order placement, 10k+ concurrent requests/sec, eventual consistency acceptable for non-critical paths
Key Competencies Evaluated:
- Microservices Design: Service boundaries, data ownership, inter-service communication
- CAP Theorem Application: Choosing availability over consistency where appropriate
- Caching Strategy: Cache invalidation, TTL selection, cache-aside pattern
- Event-Driven Architecture: Async communication preventing tight coupling
System Architecture
API Gateway (Load Balancer)
├── Order Service
│ ├── MySQL (ACID transactions: order creation, payment)
│ ├── Redis (order state cache: 5-min TTL)
│ └── Kafka Producer (order.created, order.confirmed events)
├── Restaurant Service
│ ├── MySQL (restaurant data, menu, pricing)
│ ├── ElasticSearch (search with filters: cuisine, rating, distance)
│ └── Redis (availability cache: real-time open/closed status)
├── Delivery Service
│ ├── Redis (rider location: geohash for proximity search)
│ ├── Kafka Consumer (order.confirmed → assign rider)
│ └── Route optimization (distance + traffic API)
└── Notification Service
├── WebSocket (real-time updates to customers)
└── FCM (push notifications)Answer
Order Service owns order lifecycle (creation, payment, status tracking) storing transactional data in MySQL with ACID guarantees ensuring payment atomicity (order confirmed only after payment succeeds preventing revenue loss), caches order state in Redis (5-min TTL) enabling fast status lookups (10k reads/sec) without hitting MySQL, publishes “order.created” and “order.confirmed” events to Kafka decoupling from downstream services—Restaurant Service manages restaurant metadata (menu, pricing, operating hours) in MySQL, indexes in ElasticSearch for sub-second search with filters (cuisine=Italian, rating>4, distance<3km), caches availability status in Redis updated every 30 seconds (accepts stale data: customer may see “open” restaurant that just closed, handled by validation during order placement showing “currently unavailable” error), trade-off accepts eventual consistency for search (ElasticSearch syncs from MySQL every 5 mins) over strong consistency (distributed transactions adding 500ms+ latency) since users tolerate slightly stale restaurant data but not slow search. Strong consistency required for payment transactions (MySQL ACID) where order confirmed only after payment succeeds preventing double-charging or unpaid orders, inventory deduction (optimistic locking: “UPDATE menu_items SET quantity = quantity - 1 WHERE id = X AND quantity >= 1” preventing overselling popular items), and rider assignment (single rider can’t be assigned to multiple concurrent orders) using 2-phase commit or Saga pattern with compensating transactions accepting 200-300ms latency overhead for correctness—eventual consistency acceptable for restaurant search (ElasticSearch lag 5 mins caught during order validation), availability status (Redis 30-sec refresh validated at order time), and analytics dashboards (hourly batch updates sufficient), achieves <500ms order placement by avoiding distributed locks, using cache-aside pattern (check Redis → if miss, query MySQL → populate cache), and async event processing via Kafka where notification sending doesn’t block order confirmation. Horizontal scaling partitions Order Service by userId (10 instances handling 1M users each), Restaurant Service by city (Mumbai, Delhi, Bangalore separate instances), Delivery Service by geohash (grid-based rider assignment preventing global locking), MySQL read replicas (3 replicas) handle read traffic (90% reads, 10% writes), Redis Cluster with 6 nodes (3 master, 3 replica) provides 99.9% availability, ElasticSearch 5-node cluster with 2 replicas ensures search availability during node failures—failure scenarios handled: MySQL master down (promote replica to master within 30 secs), Redis cache miss (fallback to MySQL with circuit breaker), Kafka broker down (producer retries with exponential backoff), ElasticSearch node failure (queries route to healthy nodes), monitors SLOs (order placement <500ms p95, search <200ms p95, 99.9% uptime) with PagerDuty alerts when thresholds breached enabling 15-min incident response.
Code
Microservices Implementation (Spring Boot):
@RestController
@RequestMapping("/api/orders")
public class OrderController {
@PostMapping
public ResponseEntity<OrderResponse> createOrder(@RequestBody OrderRequest request) {
Order order = orderService.createOrderWithPayment(request);
redisTemplate.opsForValue().set("order:" + order.getId(), order, Duration.ofMinutes(5));
kafkaTemplate.send("order.created", order.getId(), new OrderEvent(order));
return ResponseEntity.ok(new OrderResponse(order));
}
}
@Service
public class RestaurantSearchService {
public List<Restaurant> searchRestaurants(SearchCriteria criteria) {
NativeSearchQuery query = new NativeSearchQueryBuilder()
.withQuery(QueryBuilders.boolQuery()
.must(QueryBuilders.matchQuery("cuisine", criteria.getCuisine()))
.filter(QueryBuilders.rangeQuery("rating").gte(criteria.getMinRating()))
.filter(QueryBuilders.geoDistanceQuery("location")
.point(criteria.getLat(), criteria.getLng())
.distance(criteria.getRadius(), DistanceUnit.KILOMETERS))
)
.build();
return elasticsearchTemplate.search(query, Restaurant.class)
.stream().map(SearchHit::getContent).collect(Collectors.toList());
}
}3. Splitwise Application Low-Level Design (LLD)
Difficulty Level: Hard
Role: SDE-1
Source: InterviewExperiences.in (multiple candidates), Reddit (2024-2025)
Topic: Backend Engineering / Software Design
Interview Round: Low-Level Design - Machine Coding (45-60 min)
Technology Stack: Java, OOP (SOLID principles), Database Design
Swiggy Product Area: General / Learning Objective
Question: “Design and implement a Splitwise application (45-60 minutes) with complete functional requirements, UML class diagram, entity relationships, database schema, API contracts, and partial working code implementation. Handle multiple split types (equal, exact, percentage), user groups, and settlement logic.”
Answer Framework
STAR Method Structure:
- Situation: Build expense-splitting app supporting equal/exact/percentage splits, user groups, settlement calculation within 45-60 mins
- Task: Design OOP model with SOLID principles, database schema, API contracts, implement core logic with edge case handling
- Action: Entity design (User, Expense, Split, Settlement), service layer (ExpenseService, BalanceService), split strategy pattern, settlement algorithm
- Result: Working implementation handling 3 split types, circular debt simplification, extensible for new split types
Key Competencies Evaluated:
- OOP Design: SOLID principles, design patterns (Strategy for split types)
- Time Management: Complete design + code + explanation in 45-60 mins
- Edge Case Handling: Floating-point precision, circular debts, validation
- Scalability Thinking: How to optimize for millions of expenses
Class Design
Core Entities:
├── User
│ ├── id: String
│ ├── name: String
│ ├── email: String
│ └── getBalance(): Map<User, Double>
├── Expense
│ ├── id: String
│ ├── description: String
│ ├── amount: Double
│ ├── payer: User
│ ├── splits: List<Split>
│ ├── date: LocalDateTime
│ └── validate(): boolean
├── Split (Abstract)
│ ├── user: User
│ ├── expense: Expense
│ ├── amount: Double
│ └── calculateShare(): Double
│ ├── EqualSplit extends Split
│ ├── ExactSplit extends Split
│ └── PercentSplit extends Split
└── Settlement
├── fromUser: User
├── toUser: User
└── amount: Double
Services:
├── ExpenseService
│ ├── createExpense(Expense): void
│ ├── deleteExpense(String expenseId): void
│ └── getExpense(String expenseId): Expense
├── BalanceService
│ ├── getBalance(User): Map<User, Double>
│ ├── simplifyDebts(): List<Settlement>
│ └── getTransactionHistory(User): List<Expense>
└── SplitStrategy (Strategy Pattern)
├── calculateSplit(amount, users): List<Split>
└── validate(splits, totalAmount): booleanAnswer
User entity stores id, name, email with getBalance() method calculating net balance across all expenses (amount owed to others minus amount others owe), avoiding storing derived balance in database (violates Single Source of Truth: balance changes with every expense requiring complex update logic). Expense entity contains description, total amount, payer (who paid upfront), list of splits (how amount divided), date, with validate() ensuring splits sum to total amount (prevents data inconsistency: ₹100 expense split as ₹40 + ₹40 + ₹30 = ₹110 rejected). Split hierarchy uses Strategy pattern where abstract Split class has calculateShare() method, EqualSplit divides amount equally (4 people = ₹25 each for ₹100), ExactSplit uses specified amounts (₹40, ₹30, ₹20, ₹10), PercentSplit calculates from percentages (40%, 30%, 20%, 10%)—enables adding new split types (e.g., ShareSplit for unequal ratios) without modifying existing code (Open/Closed Principle).
Database schema uses three tables: Users (id PK, name, email unique), Expenses (id PK, description, amount, payer_id FK, created_at), Splits (id PK, expense_id FK, user_id FK, split_type ENUM, amount)—avoids storing balance in Users table (derived via JOIN query: “SELECT user_id, SUM(CASE WHEN payer_id = user_id THEN amount ELSE -split_amount END) FROM expenses JOIN splits GROUP BY user_id”). API contracts include POST /expenses (create expense with splits), GET /expenses/{id} (retrieve expense details), DELETE /expenses/{id} (remove expense recalculating balances), GET /users/{id}/balance (net balance per friend), GET /users/{id}/settlements (optimal payment plan)—validates requests: split amounts sum to total (400 Bad Request if mismatch), user exists (404 Not Found), percentage splits sum to 100% (validation error). Edge cases handled: floating-point precision (use BigDecimal for money calculations preventing ₹0.01 rounding errors), negative balances (A owes B ₹50 represented as -50 for A, +50 for B), circular debts (A→B ₹30, B→C ₹20, C→A ₹10 simplified to A→B ₹20, B→C ₹10).
Debt simplification algorithm calculates net balance per user (A paid ₹100, owes ₹40 = net +₹60; B paid ₹0, owes ₹60 = net -₹60), separates creditors (positive balance) and debtors (negative balance), greedily matches largest debtor with largest creditor until all balanced (minimizes number of transactions: 3-person circular debt reduced from 3 payments to 2)—implementation uses two priority queues (max-heap for creditors, min-heap for debtors) with O(n log n) complexity acceptable for <1000 users per group. Scalability optimizations for millions of expenses: partition by userId (shard database), cache frequent queries (user balance in Redis with 5-min TTL), index on (user_id, created_at) for transaction history pagination, archive old expenses (>1 year) to cold storage reducing active table size—accepts eventual consistency for balance display (Redis cache may show stale balance for 5 mins) over strong consistency (querying MySQL every time adding 100ms+ latency). Time management strategy: spend 10 mins on requirements clarification and entity design, 20 mins coding core logic (Expense creation, split calculation), 10 mins on settlement algorithm, 5 mins explaining trade-offs—demonstrates ability to prioritize MVP (working equal split) over perfection (all 3 split types) if time constrained.
Code
Splitwise Implementation (Java):
// Entity Classes
public abstract class Split {
protected User user;
protected Expense expense;
protected double amount;
public abstract double calculateShare();
}
public class EqualSplit extends Split {
@Override
public double calculateShare() {
int numUsers = expense.getSplits().size();
return BigDecimal.valueOf(expense.getAmount())
.divide(BigDecimal.valueOf(numUsers), 2, RoundingMode.HALF_UP)
.doubleValue();
}
}
public class ExactSplit extends Split {
public ExactSplit(User user, double amount) {
this.user = user;
this.amount = amount;
}
@Override
public double calculateShare() {
return amount;
}
}
public class PercentSplit extends Split {
private double percentage;
@Override
public double calculateShare() {
return BigDecimal.valueOf(expense.getAmount())
.multiply(BigDecimal.valueOf(percentage / 100.0))
.setScale(2, RoundingMode.HALF_UP)
.doubleValue();
}
}
// Service Layer
@Service
public class ExpenseService {
public Expense createExpense(ExpenseRequest request) {
// Validate splits sum to total
double totalSplits = request.getSplits().stream()
.mapToDouble(Split::calculateShare)
.sum();
if (Math.abs(totalSplits - request.getAmount()) > 0.01) {
throw new ValidationException("Splits don't sum to total amount");
}
Expense expense = new Expense(request);
return expenseRepository.save(expense);
}
}
// Settlement Algorithm (Debt Simplification)
@Service
public class BalanceService {
public List<Settlement> simplifyDebts(List<User> users) {
Map<User, Double> balances = calculateNetBalances(users);
PriorityQueue<Map.Entry<User, Double>> creditors = new PriorityQueue<>(
(a, b) -> Double.compare(b.getValue(), a.getValue())
);
PriorityQueue<Map.Entry<User, Double>> debtors = new PriorityQueue<>(
Comparator.comparingDouble(Map.Entry::getValue)
);
balances.entrySet().forEach(entry -> {
if (entry.getValue() > 0) creditors.offer(entry);
else if (entry.getValue() < 0) debtors.offer(entry);
});
List<Settlement> settlements = new ArrayList<>();
while (!creditors.isEmpty() && !debtors.isEmpty()) {
Map.Entry<User, Double> creditor = creditors.poll();
Map.Entry<User, Double> debtor = debtors.poll();
double amount = Math.min(creditor.getValue(), -debtor.getValue());
settlements.add(new Settlement(debtor.getKey(), creditor.getKey(), amount));
double newCreditorBalance = creditor.getValue() - amount;
double newDebtorBalance = debtor.getValue() + amount;
if (newCreditorBalance > 0.01) {
creditors.offer(Map.entry(creditor.getKey(), newCreditorBalance));
}
if (newDebtorBalance < -0.01) {
debtors.offer(Map.entry(debtor.getKey(), newDebtorBalance));
}
}
return settlements;
}
private Map<User, Double> calculateNetBalances(List<User> users) {
Map<User, Double> balances = new HashMap<>();
for (User user : users) {
double totalPaid = expenseRepository.findByPayer(user).stream()
.mapToDouble(Expense::getAmount)
.sum();
double totalOwed = expenseRepository.findByUser(user).stream()
.flatMap(e -> e.getSplits().stream())
.filter(s -> s.getUser().equals(user))
.mapToDouble(Split::getAmount)
.sum();
balances.put(user, totalPaid - totalOwed);
}
return balances;
}
}4. Warehouse Management System for Swiggy Instamart
Difficulty Level: Hard
Role: SDE-2
Source: InterviewExperiences.in (2024)
Topic: Backend Engineering / High-Level Design
Interview Round: High-Level Design (60 min)
Technology Stack: Microservices, MySQL, Kafka, Redis, ElasticSearch
Swiggy Product Area: Instamart (Quick Commerce)
Question: “Design a warehouse management system (HLD + API design) for Swiggy Instamart handling inventory, stock movements, supplier batches, product SKUs, multi-warehouse coordination, and ensuring data consistency across all integrated systems (Order Service, Fulfillment Service, Analytics).”
Answer
Warehouse Service manages inventory across 100+ dark stores with APIs: POST /inventory/stock (add items with batch tracking), GET /inventory/{productId}/{warehouseId} (current stock levels), PATCH /inventory/reserve (optimistic locking preventing overselling), DELETE /inventory/remove (expired/damaged goods)—database schema includes Products (sku, name, category, supplier_id), Warehouses (id, location, capacity, region), Stock (product_id, warehouse_id, quantity, reorder_point, last_updated), InventoryTransactions (type ENUM: INBOUND/OUTBOUND/RESERVED/EXPIRED, quantity, timestamp for audit trail), SupplierBatches (batch_id, expiry_date, cost_price for FIFO/FEFO inventory rotation). Multi-warehouse coordination uses Kafka partitioned by product_id serializing updates (prevents race condition: two warehouses simultaneously updating same product stock), publishes events (stock.updated for search reindexing, inventory.low triggering auto-reorder when quantity < reorder_point, order.fulfilled notifying fulfillment team)—accepts eventual consistency for analytics (hourly batch sync) over strong consistency (distributed transactions adding 500ms+ latency blocking order placement).
Order Service integration reserves stock immediately upon order placement (PATCH /inventory/reserve with optimistic locking: “UPDATE stock SET quantity = quantity - X, version = version + 1 WHERE product_id = Y AND warehouse_id = Z AND version = V” failing if concurrent update occurred), deducts from Redis cache (sub-second response) then asynchronously updates MySQL (eventual consistency acceptable: Redis shows reserved stock instantly, MySQL syncs within 5 secs)—handles edge case where order cancelled after reservation by publishing “order.cancelled” event triggering stock release. Fulfillment Service determines which warehouse fulfills order based on geolocation (customer 3km from Warehouse A, 8km from Warehouse B → prefer A) + inventory availability (A has 2 units, B has 10 units → choose A if quantity ≤2, else B for better stock distribution) + delivery partner availability (A has 5 idle riders, B has 0 → prefer A for faster dispatch)—uses scoring algorithm (distance weight 50%, stock weight 30%, rider weight 20%) selecting highest-scoring warehouse, publishes “fulfillment.assigned” event updating stock status from RESERVED to ALLOCATED.
Scalability optimizations partition Stock table by warehouse_id (each dark store independent database shard), cache frequently-accessed products in Redis (top 100 SKUs per warehouse with 5-min TTL handling 10k reads/sec), index on (product_id, warehouse_id, last_updated) for fast lookups, archive transactions older than 90 days to cold storage (S3) reducing active table size from 100M to 10M rows improving query performance 10x. Monitoring tracks stock accuracy (physical count vs system count reconciliation weekly, alerting if discrepancy >5%), stockout rate (% of orders unfulfilled due to inventory unavailability, target <2%), inventory turnover (how fast stock sells, target 15 days for perishables), reorder lead time (supplier delivery time, optimizes reorder_point calculation)—dashboards show real-time stock levels per warehouse, low-stock alerts (quantity < reorder_point), expiry warnings (batches expiring within 7 days), enabling proactive inventory management preventing revenue loss from stockouts or wastage from expired goods.
Code
Warehouse Service Implementation:
@RestController
@RequestMapping("/api/inventory")
public class InventoryController {
@PatchMapping("/reserve")
public ResponseEntity<ReservationResponse> reserveStock(@RequestBody ReservationRequest request) {
// Optimistic locking to prevent overselling
int updated = jdbcTemplate.update(
"UPDATE stock SET quantity = quantity - ?, version = version + 1 " +
"WHERE product_id = ? AND warehouse_id = ? AND version = ? AND quantity >= ?",
request.getQuantity(),
request.getProductId(),
request.getWarehouseId(),
request.getVersion(),
request.getQuantity()
);
if (updated == 0) {
throw new ConcurrentModificationException("Stock reservation failed");
}
// Update Redis cache
redisTemplate.opsForValue().decrement(
"stock:" + request.getProductId() + ":" + request.getWarehouseId(),
request.getQuantity()
);
// Publish Kafka event
kafkaTemplate.send("stock.reserved", new StockReservedEvent(request));
return ResponseEntity.ok(new ReservationResponse("SUCCESS"));
}
}
// Warehouse Selection Algorithm
@Service
public class FulfillmentService {
public Warehouse selectWarehouse(Order order, List<Warehouse> warehouses) {
return warehouses.stream()
.map(w -> new ScoredWarehouse(w, calculateScore(order, w)))
.max(Comparator.comparingDouble(ScoredWarehouse::getScore))
.map(ScoredWarehouse::getWarehouse)
.orElseThrow();
}
private double calculateScore(Order order, Warehouse warehouse) {
double distance = calculateDistance(
order.getDeliveryAddress(),
warehouse.getLocation()
);
int availableStock = getAvailableStock(warehouse, order.getItems());
int availableRiders = getRiderCount(warehouse);
// Weighted scoring: distance 50%, stock 30%, riders 20%
double distanceScore = 1.0 / (1.0 + distance);
double stockScore = Math.min(availableStock / (double) order.getTotalQuantity(), 1.0);
double riderScore = Math.min(availableRiders / 5.0, 1.0);
return (0.5 * distanceScore) + (0.3 * stockScore) + (0.2 * riderScore);
}
}
// Kafka Event Handler
@Service
public class StockEventHandler {
@KafkaListener(topics = "order.cancelled")
public void handleOrderCancellation(OrderCancelledEvent event) {
// Release reserved stock
jdbcTemplate.update(
"UPDATE stock SET quantity = quantity + ? WHERE product_id = ? AND warehouse_id = ?",
event.getQuantity(),
event.getProductId(),
event.getWarehouseId()
);
// Update cache
redisTemplate.opsForValue().increment(
"stock:" + event.getProductId() + ":" + event.getWarehouseId(),
event.getQuantity()
);
}
}5. Real-Time Order Tracking System - Frontend
Difficulty Level: Hard
Role: SDE-2
Source: LinkedIn (2025)
Topic: Frontend / Mobile Engineering
Interview Round: Frontend System Design (60 min)
Technology Stack: React, WebSockets, SSE, Service Workers, IndexedDB
Swiggy Product Area: Food Delivery (Order Tracking)
Question: “Design and implement a real-time order tracking system for millions of concurrent users. Address: (1) Handling thousands of map marker updates without crashing the browser, (2) Choosing between WebSocket vs Server-Sent Events (SSE) vs Long Polling with trade-off justification, (3) Managing offline UI scenarios when user enters tunnel/loses connectivity, (4) Optimizing battery consumption on mobile browsers.”
Answer
WebSocket chosen for primary real-time updates (<100ms latency, bidirectional enabling client sending “rider arrived” acknowledgment, persistent connection reducing overhead) with SSE fallback if WebSocket fails (automatic reconnection, one-way sufficient for location updates, better battery efficiency)—implementation: const ws = new WebSocket('/orders/track'); ws.onmessage = (event) => { const {riderId, lat, lng} = JSON.parse(event.data); updateMarker(riderId, {lat, lng}); }; with exponential backoff reconnection (1s, 2s, 4s, 8s max) preventing server overload during mass disconnections. Long Polling rejected due to 1-3s latency (unacceptable for real-time tracking where rider moves 10-20 meters between polls creating jerky animation), higher battery drain (constant HTTP requests vs single persistent connection), and server resource waste (10k concurrent users = 10k open connections vs WebSocket’s efficient multiplexing)—trade-off accepts WebSocket’s complexity (requires fallback logic, load balancer sticky sessions) over Long Polling’s simplicity for better user experience.
Map marker virtualization renders only visible markers (user viewing 1km radius sees 10 riders, not all 1000 in city) using viewport bounds filtering (markers.filter(m => isInViewport(m.lat, m.lng))), throttles updates to 5-second intervals (rider location updates every 1s but map redraws every 5s preventing 60fps animation overhead causing browser lag), uses requestAnimationFrame for smooth transitions (interpolates between old and new position over 5s instead of instant jump)—handles 1000+ concurrent markers without crashing by lazy-loading marker icons (load on-demand vs preloading all assets), using canvas rendering instead of DOM manipulation (10x faster for >100 markers), implementing marker clustering (group nearby riders into single cluster marker showing count). Battery optimization requests location updates every 5s (vs continuous GPS draining battery 40% faster), uses throttling/debouncing for user interactions (pan/zoom triggers API call only after user stops moving for 300ms), implements background sync (Service Worker queues updates when app backgrounded, syncs when foregrounded) reducing active connection time.
Service Worker + IndexedDB caches last known rider location enabling offline display (“Last updated 2 mins ago” timestamp showing data freshness), stores order details (restaurant name, items, estimated time) for offline viewing, uses background sync resuming updates when connectivity restored (queues failed requests, retries automatically)—graceful degradation shows static map with last known position instead of crashing, displays “Reconnecting…” banner with retry countdown, prevents user actions requiring network (e.g., “Call Rider” button disabled with tooltip “No internet connection”). Core Web Vitals optimization ensures LCP (Largest Contentful Paint) <2.5s by lazy-loading map (load Google Maps API only after critical content rendered), FID (First Input Delay) <100ms by debouncing zoom/pan handlers, CLS (Cumulative Layout Shift) <0.1 by reserving map container space preventing layout shift when map loads—monitors via Lighthouse, alerts when metrics degrade enabling proactive performance fixes before user complaints.
Code
Real-Time Tracking (React + WebSocket):
// Custom Hook for Order Tracking
const useOrderTracking = (orderId) => {
const [riderLocation, setRiderLocation] = useState(null);
const [connectionStatus, setConnectionStatus] = useState('connecting');
const wsRef = useRef(null);
const retryCount = useRef(0);
useEffect(() => {
const connectWebSocket = () => {
try {
wsRef.current = new WebSocket(`wss://api.swiggy.com/orders/${orderId}/track`);
wsRef.current.onopen = () => {
setConnectionStatus('connected');
retryCount.current = 0;
};
wsRef.current.onmessage = (event) => {
const update = JSON.parse(event.data);
setRiderLocation({
lat: update.latitude,
lng: update.longitude,
timestamp: update.timestamp
});
};
wsRef.current.onerror = () => {
setConnectionStatus('error');
fallbackToSSE();
};
wsRef.current.onclose = () => {
setConnectionStatus('disconnected');
retryCount.current++;
const delay = Math.min(retryCount.current * 1000, 8000);
setTimeout(connectWebSocket, delay);
};
} catch (error) {
fallbackToSSE();
}
};
const fallbackToSSE = () => {
const eventSource = new EventSource(`/api/orders/${orderId}/updates`);
eventSource.onmessage = (event) => {
const update = JSON.parse(event.data);
setRiderLocation(update);
};
};
connectWebSocket();
return () => {
wsRef.current?.close();
};
}, [orderId]);
return { riderLocation, connectionStatus };
};
// Map Component with Virtualization
const OrderTrackingMap = ({ orderId }) => {
const { riderLocation } = useOrderTracking(orderId);
const mapRef = useRef(null);
const markerRef = useRef(null);
// Throttle updates to 5 seconds
const throttledUpdate = useCallback(
throttle((location) => {
if (mapRef.current && markerRef.current) {
// Smooth animation using requestAnimationFrame
animateMarker(markerRef.current, location);
}
}, 5000),
[]
);
const animateMarker = (marker, newPosition) => {
const start = marker.getPosition();
const end = newPosition;
const duration = 5000; // 5 seconds
const startTime = Date.now();
const animate = () => {
const elapsed = Date.now() - startTime;
const progress = Math.min(elapsed / duration, 1);
const lat = start.lat + (end.lat - start.lat) * progress;
const lng = start.lng + (end.lng - start.lng) * progress;
marker.setPosition({ lat, lng });
if (progress < 1) {
requestAnimationFrame(animate);
}
};
requestAnimationFrame(animate);
};
useEffect(() => {
if (riderLocation) {
throttledUpdate(riderLocation);
}
}, [riderLocation, throttledUpdate]);
// Offline handling with Service Worker
useEffect(() => {
if ('serviceWorker' in navigator) {
navigator.serviceWorker.register('/sw.js').then(registration => {
console.log('Service Worker registered');
});
}
}, []);
return (
<div>
{connectionStatus === 'disconnected' && (
<div className="reconnecting-banner">Reconnecting...</div>
)}
<GoogleMap
ref={mapRef}
center={riderLocation || { lat: 0, lng: 0 }}
zoom={15}
>
{riderLocation && (
<Marker
ref={markerRef}
position={riderLocation}
icon="/rider-icon.png"
/>
)}
</GoogleMap>
{riderLocation && (
<p>Last updated: {new Date(riderLocation.timestamp).toLocaleTimeString()}</p>
)}
</div>
);
};6. React Optimization: useDebounce Hook + Restaurant Menu
Difficulty Level: Hard
Role: SDE-1 to SDE-2
Source: LinkedIn, InterviewRecap (2024-2025)
Topic: Frontend Engineering
Interview Round: Machine Coding (90 min)
Technology Stack: React (Hooks, Context API), JavaScript, CSS
Swiggy Product Area: Food Delivery
Question: “(Part 1) Implement a custom useDebounce hook from scratch for search functionality. (Part 2) Build a dynamic restaurant menu page where users can add item customizations (e.g., toppings, sides), and the cart total updates instantly on any change without using a state management library like Redux. Optimize re-renders to prevent performance degradation.”
Answer
Custom hook prevents excessive API calls while user types (typing “pizza” triggers 5 calls for “p”, “pi”, “piz”, “pizz”, “pizza” vs 1 call after 500ms pause) reducing server load from 100k/sec to 10k/sec during peak search hours—implementation: const useDebounce = (value, delay = 500) => { const [debouncedValue, setDebouncedValue] = useState(value); useEffect(() => { const handler = setTimeout(() => setDebouncedValue(value), delay); return () => clearTimeout(handler); }, [value, delay]); return debouncedValue; }; where cleanup function clears previous timeout preventing memory leaks, dependency array [value, delay] ensures effect runs only when inputs change. Usage pattern: const searchQuery = useDebounce(inputValue, 500); useEffect(() => { if (searchQuery) fetchRestaurants(searchQuery); }, [searchQuery]); where API call triggered only after user stops typing for 500ms, handles edge case of empty string (if check prevents unnecessary API call when user clears search).
React.memo + useMemo + useCallback prevents unnecessary re-renders where Cart component memoized (const Cart = React.memo(({ items, onItemChange }) => { const total = useMemo(() => items.reduce((sum, item) => sum + item.price * item.qty, 0), [items]); return <div>Total: ₹{total}</div>; });) recalculating total only when items array changes (not on every parent render), MenuItem memoized with useCallback (const MenuItem = React.memo(({ item, onAdd }) => { const handleCustomize = useCallback(() => onAdd({...item, customizations: [...]}), [item, onAdd]); return <button onClick={handleCustomize}>Add</button>; });) preventing function recreation on every render. Context API with value splitting avoids global re-renders by separating cart state (<CartContext.Provider value={{items, addItem}}>) from UI state (menu filters, search), using multiple contexts (CartContext for cart operations, UIContext for filters) ensuring menu items don’t re-render when cart updates—accepts slight complexity (managing multiple contexts) over performance degradation (entire menu re-rendering on every cart change causing 500ms lag for 100+ items).
React DevTools Profiler identifies expensive renders (MenuItem re-rendering despite React.memo due to onAdd function changing identity every render, fixed by wrapping in useCallback), measures commit phase duration (target <16ms for 60fps, alerts if >50ms indicating performance bottleneck), highlights unnecessary re-renders (parent state change causing all children to re-render even when props unchanged)—optimization strategy: start with working implementation (no memoization), profile under realistic load (100 menu items, 10 cart items), identify bottlenecks (Profiler shows MenuItem rendering 100 times on single cart update), apply targeted optimizations (React.memo on MenuItem, useMemo for expensive calculations), verify improvement (re-renders reduced from 100 to 1). Edge cases handled: floating-point precision (use Math.round(total * 100) / 100 preventing ₹10.999999 display), concurrent updates (user adds item while customization modal open, queue updates preventing race condition), empty cart (show “Cart is empty” instead of ₹0 total avoiding confusion).
Code
React Optimization Implementation:
// Custom useDebounce Hook
const useDebounce = (value, delay = 500) => {
const [debouncedValue, setDebouncedValue] = useState(value);
useEffect(() => {
const handler = setTimeout(() => {
setDebouncedValue(value);
}, delay);
return () => clearTimeout(handler);
}, [value, delay]);
return debouncedValue;
};
// Restaurant Menu with Cart Optimization
const RestaurantMenu = () => {
const [cartItems, setCartItems] = useState([]);
const [searchQuery, setSearchQuery] = useState('');
const [menuItems, setMenuItems] = useState([]);
const debouncedQuery = useDebounce(searchQuery, 300);
useEffect(() => {
if (debouncedQuery) {
fetchMenuItems(debouncedQuery);
}
}, [debouncedQuery]);
const addToCart = useCallback((item) => {
setCartItems(prev => {
const existing = prev.find(i => i.id === item.id);
if (existing) {
return prev.map(i =>
i.id === item.id
? { ...i, quantity: i.quantity + 1 }
: i
);
}
return [...prev, { ...item, quantity: 1 }];
});
}, []);
const updateQuantity = useCallback((itemId, quantity) => {
setCartItems(prev =>
prev.map(item =>
item.id === itemId ? { ...item, quantity } : item
).filter(item => item.quantity > 0)
);
}, []);
return (
<div className="restaurant-menu">
<SearchBar value={searchQuery} onChange={setSearchQuery} />
<MenuItems items={menuItems} onAdd={addToCart} />
<Cart items={cartItems} onUpdateQuantity={updateQuantity} />
</div>
);
};
// Memoized Cart Component
const Cart = React.memo(({ items, onUpdateQuantity }) => {
const total = useMemo(() => {
return items.reduce((sum, item) => {
return sum + (item.price * item.quantity);
}, 0);
}, [items]);
return (
<div className="cart">
<h3>Cart Total: ₹{total.toFixed(2)}</h3>
{items.length === 0 ? (
<p>Cart is empty</p>
) : (
items.map(item => (
<CartItem
key={item.id}
item={item}
onUpdateQuantity={onUpdateQuantity}
/>
))
)}
</div>
);
});
// Memoized Menu Item Component
const MenuItem = React.memo(({ item, onAdd }) => {
const handleAdd = useCallback(() => {
onAdd(item);
}, [item, onAdd]);
return (
<div className="menu-item">
<img src={item.image} alt={item.name} loading="lazy" />
<h4>{item.name}</h4>
<p className="description">{item.description}</p>
<div className="price-section">
<span className="price">₹{item.price}</span>
<button onClick={handleAdd} className="add-btn">
Add to Cart
</button>
</div>
</div>
);
});
// Context API for Cart State
const CartContext = createContext();
export const CartProvider = ({ children }) => {
const [items, setItems] = useState([]);
const value = useMemo(() => ({
items,
addItem: (item) => setItems(prev => [...prev, item]),
removeItem: (id) => setItems(prev => prev.filter(i => i.id !== id)),
updateQuantity: (id, qty) => setItems(prev =>
prev.map(i => i.id === id ? { ...i, quantity: qty } : i)
)
}), [items]);
return (
<CartContext.Provider value={value}>
{children}
</CartContext.Provider>
);
};7. ML System Design: Delivery Time Prediction & Rider Assignment
Difficulty Level: Hard
Role: Data Scientist 1-2, Senior ML Engineer
Source: LinkedIn, Swiggy Bytes blog, InterviewQuery (2024-2025)
Topic: Data Engineering / ML Engineering
Interview Round: Machine Learning System Design (60-90 min)
Technology Stack: Python, XGBoost, PySpark, Kafka, Feature Store, Prophet/ARIMA
Swiggy Product Area: Food Delivery (Logistics & Pricing)
Question: “Design a complete machine learning system for: (1) Predicting exact delivery time (ETA) using real-time traffic, weather, food prep speed, and historical patterns. (2) Smart rider assignment matching right delivery partner based on distance, availability, restaurant load, past delivery patterns. (3) Kitchen load forecasting for peak hour prediction. (4) Dynamic route optimization for riders. (5) Demand forecasting for surge pricing based on city events, holidays, weather.”
Answer
ETA prediction model uses XGBoost with features: distance (GPS haversine formula), real-time traffic (Google Maps API congestion levels: low/medium/high adding 0/5/15 mins), historical delivery times (median time for restaurant-area pair from past 30 days), food prep time (learned from order-to-ready duration per restaurant averaging 15-25 mins), weather (rain delays deliveries 10-20% from wet roads, reduced rider availability)—achieves 85% accuracy within ±5 min margin with p95 latency <500ms for real-time inference, retrains daily on new data (incremental learning preventing model drift), handles cold-start (new restaurant with no history uses city-wide average prep time). Data pipeline ingests raw events via Kafka (order.placed, rider.assigned, order.delivered), batch processes with Spark (hourly aggregations computing features), stores in feature store (Tecton/Feast enabling feature reuse across models), serves via FastAPI (model loaded in memory, inference <100ms)—monitors prediction error (MAPE: Mean Absolute Percentage Error target <15%), alerts when accuracy drops below 80% indicating model degradation requiring retraining.
Rider assignment solves constraint optimization (assign order to rider minimizing delivery time while balancing workload) using greedy approximation: geohash finds riders within 3km radius (spatial index enabling sub-second lookup among 15k riders), scoring function combines distance (70% weight: closer rider preferred), availability (20%: idle rider vs rider finishing delivery in 5 mins), historical acceptance rate (10%: rider with 90% acceptance preferred over 60% preventing frequent rejections)—accepts approximate solution (greedy picks highest-scoring rider) over optimal (NP-hard problem requiring minutes to solve) since real-time constraint (<1s assignment) more important than 5% efficiency gain, handles peak load (100k concurrent orders, 15k riders) by partitioning city into zones (each zone runs independent assignment preventing global locking).
Kitchen load forecasting predicts peak hours (lunch 12-1:30pm, dinner 7-9pm) and popular dishes per restaurant using ARIMA time series with external regressors (day of week: Friday dinner 40% higher than Monday, holidays: Diwali 2x normal volume, weather: rain increases orders 30%, promotional events: 50% off campaign doubles orders)—outputs hourly predictions per restaurant alerting when predicted load >80% capacity enabling proactive ingredient prep, reduces order cancellations from 15% to 5% during peaks by setting realistic prep times (30 mins vs usual 20 mins when forecasted busy), updates every 15 mins incorporating real-time order velocity (if 100 orders placed in last 15 mins vs usual 50, increase forecast by 2x). Surge pricing predicts demand surge per city zone using features: time of day (8pm peak vs 3pm valley), weather (rain/extreme heat increases delivery orders), temperature (>35°C reduces food delivery, <20°C increases), city events (cricket match, concert nearby), holidays (New Year’s Eve 3x surge)—model outputs surge multiplier (1.0-2.0x) updating every 15 mins, applies to delivery fee (base ₹20 × 1.5 surge = ₹30), caps at 2x preventing customer alienation, communicates transparently (“High demand in your area, ₹10 extra delivery fee”) reducing complaints.
Evaluation metrics track ETA accuracy (85% within ±5 mins, p95 error <10 mins), rider assignment efficiency (average delivery time 25 mins vs 30 mins without ML, rider utilization 80% vs 60%), kitchen load forecast accuracy (MAPE <20%, false positive rate <10% preventing unnecessary alerts), surge pricing impact (revenue +15%, order cancellation rate <5% indicating acceptable pricing)—A/B tests new models (10% traffic gets new model, 90% gets current, compare metrics after 7 days) before full rollout preventing degradation. Ethical considerations prevent surge pricing discrimination (same surge for all users in zone, no personalized pricing based on income/history), cap maximum surge (2x limit even during extreme demand preventing price gouging), provide transparency (show surge reason: “High demand due to rain”), offer alternatives (suggest ordering from less-busy restaurant with lower surge)—demonstrates understanding ML systems require fairness, transparency, accountability beyond just accuracy metrics, with regular audits ensuring models don’t develop bias (e.g., assigning worse riders to certain neighborhoods).
Code
ML Model Implementation (Python):
import xgboost as xgb
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from prophet import Prophet
# ETA Prediction Model
class ETAPredictionModel:
def __init__(self):
self.model = xgb.XGBRegressor(
objective='reg:squarederror',
n_estimators=100,
learning_rate=0.1,
max_depth=6
)
def prepare_features(self, order_data):
features = {
'distance_km': self.calculate_distance(
order_data['restaurant_location'],
order_data['delivery_location']
),
'traffic_level': self.get_traffic_level(order_data['timestamp']),
'historical_prep_time': self.get_avg_prep_time(order_data['restaurant_id']),
'weather_condition': self.get_weather(order_data['city']),
'hour_of_day': order_data['timestamp'].hour,
'day_of_week': order_data['timestamp'].weekday(),
'is_peak_hour': 1 if order_data['timestamp'].hour in [12, 13, 19, 20, 21] else 0
}
return pd.DataFrame([features])
def train(self, training_data):
X = training_data[['distance_km', 'traffic_level', 'historical_prep_time',
'weather_condition', 'hour_of_day', 'day_of_week', 'is_peak_hour']]
y = training_data['actual_delivery_time']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
self.model.fit(X_train, y_train)
predictions = self.model.predict(X_test)
mape = np.mean(np.abs((y_test - predictions) / y_test)) * 100
print(f"MAPE:{mape:.2f}%")
def predict(self, order_data):
features = self.prepare_features(order_data)
eta_minutes = self.model.predict(features)[0]
return int(eta_minutes)
# Rider Assignment Algorithm
import geohash
class RiderAssignmentService:
def assign_rider(self, order, available_riders):
# Geohash for proximity search
order_geohash = geohash.encode(order['lat'], order['lng'], precision=6)
nearby_riders = self.find_riders_in_geohash(order_geohash, available_riders)
if not nearby_riders:
return None
# Score each rider
scored_riders = []
for rider in nearby_riders:
distance = self.calculate_distance(order, rider['location'])
availability_score = 1.0 if rider['is_idle'] else 0.5
acceptance_rate = rider['acceptance_rate']
# Weighted scoring: distance 70%, availability 20%, acceptance 10%
score = (0.7 * (1.0 / (1.0 + distance))) + (0.2 * availability_score) + (0.1 * acceptance_rate)
scored_riders.append((rider, score))
return max(scored_riders, key=lambda x: x[1])[0]
def calculate_distance(self, point1, point2):
from math import radians, sin, cos, sqrt, atan2
R = 6371 # Earth radius in km
lat1, lon1 = radians(point1['lat']), radians(point1['lng'])
lat2, lon2 = radians(point2['lat']), radians(point2['lng'])
dlat = lat2 - lat1
dlon = lon2 - lon1
a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
c = 2 * atan2(sqrt(a), sqrt(1-a))
return R * c
# Surge Pricing Model
class SurgePricingModel:
def __init__(self):
self.model = Prophet(
yearly_seasonality=True,
weekly_seasonality=True,
daily_seasonality=True
)
def predict_surge(self, city_zone, timestamp, weather, temperature):
features = pd.DataFrame([{
'ds': timestamp,
'hour': timestamp.hour,
'day_of_week': timestamp.weekday(),
'weather': weather,
'temperature': temperature,
'is_holiday': self.is_holiday(timestamp)
}])
demand_forecast = self.model.predict(features)['yhat'].values[0]
# Calculate surge multiplier (1.0 to 2.0)
surge_multiplier = min(1.0 + (demand_forecast / 100), 2.0)
return surge_multiplier8. Infinite Scroll Component in React
Difficulty Level: Medium-Hard
Role: SDE-1
Source: YouTube, Reddit, InterviewExperiences.in (2024-2025)
Topic: Frontend Engineering
Interview Round: Machine Coding (60 min)
Technology Stack: React, JavaScript (Fetch API, Intersection Observer), CSS
Swiggy Product Area: General / Restaurant Search
Question: “Build an infinite scroll component in React that: (1) Fetches data from a paginated API (OpenLibrary), (2) Implements debounced search (call API only after user stops typing for 300ms), (3) Uses Intersection Observer to detect when user scrolls near bottom, (4) Handles duplicate requests (abort previous request when user retypes), (5) Implements useCallback to prevent unnecessary child re-renders, (6) Shows loading/error states gracefully.”
Answer
Intersection Observer detects when user scrolls near bottom triggering next page fetch: const observer = new IntersectionObserver(entries => { if (entries[0].isIntersecting && !isLoading) { pageNumber.current++; getData(query, pageNumber.current); } }); observer.observe(lastItemRef.current); where lastItemRef points to last item in list (sentinel element), isIntersecting fires when element enters viewport (user scrolled to bottom), isLoading prevents duplicate requests (waits for current fetch to complete before triggering next)—cleanup: return () => observer.disconnect(); preventing memory leaks. Abort controller cancels previous request when user retypes: const controller = useRef(null); if (controller.current) controller.current.abort(); controller.current = new AbortController(); const response = await fetch(url, { signal: controller.current.signal }); where abort() cancels in-flight request preventing stale data (user types “pizza” then “burger”, “pizza” results discarded showing only “burger” results), handles AbortError gracefully (catch block checks error.name === 'AbortError' skipping error display for intentional cancellations).
Debounced search delays API call until user stops typing: const debouncedQuery = useDebounce(query, 300); useEffect(() => { if (debouncedQuery) { pageNumber.current = 1; setData([]); getData(debouncedQuery, 1); } }, [debouncedQuery]); where typing “pizza” triggers single API call 300ms after last keystroke (not 5 calls for each letter), resets page number to 1 on new search (prevents fetching page 5 of “burger” results when user searched “pizza” previously), clears existing data preventing mixing results from different queries. useCallback prevents child re-renders: const getData = useCallback(async (query, page) => { ... }, []); const renderItem = useCallback((item) => <div>{item.title}</div>, []); where function identity remains stable across renders (child components receiving getData as prop don’t re-render unless dependencies change), empty dependency array indicates function never changes (safe since no external state referenced)—trade-off accepts slight complexity (managing useCallback) over performance degradation (100 list items re-rendering on every parent state change).
Loading states show skeleton screens during initial fetch (3 placeholder cards with shimmer animation), “Loading more…” text during pagination (appended to bottom of list), spinner during search (overlays existing results with semi-transparent background)—prevents jarring UX where content disappears then reappears. Error handling displays user-friendly messages: network error (“Check your internet connection”, retry button), API error 500 (“Something went wrong, try again later”), no results (“No restaurants found for ‘xyz’”, suggest clearing filters), rate limit 429 (“Too many requests, wait 1 minute”)—logs errors to monitoring (Sentry) for debugging without exposing technical details to users. Edge cases: empty query (don’t call API, show placeholder “Search for restaurants”), rapid scrolling (throttle Intersection Observer callbacks to 500ms preventing 10 simultaneous page fetches), page reset (when query changes, scroll to top preventing confusion where user sees middle of results), duplicate items (deduplicate by id preventing same restaurant appearing twice when pagination overlaps).
Code
Infinite Scroll Implementation (React):
const InfiniteScrollList = ({ searchQuery }) => {
const [data, setData] = useState([]);
const [page, setPage] = useState(1);
const [isLoading, setIsLoading] = useState(false);
const [hasMore, setHasMore] = useState(true);
const [error, setError] = useState(null);
const lastItemRef = useRef(null);
const controllerRef = useRef(null);
const debouncedQuery = useDebounce(searchQuery, 300);
const fetchData = useCallback(async (query, pageNum) => {
// Abort previous request
if (controllerRef.current) {
controllerRef.current.abort();
}
controllerRef.current = new AbortController();
setIsLoading(true);
setError(null);
try {
const response = await fetch(
`https://openlibrary.org/search.json?q=${query}&page=${pageNum}`,
{ signal: controllerRef.current.signal }
);
if (!response.ok) {
throw new Error(`HTTP${response.status}:${response.statusText}`);
}
const result = await response.json();
// Deduplicate by key
const newDocs = result.docs.filter(doc =>
!data.some(existing => existing.key === doc.key)
);
if (pageNum === 1) {
setData(newDocs);
} else {
setData(prev => [...prev, ...newDocs]);
}
setHasMore(result.docs.length > 0);
} catch (err) {
if (err.name !== 'AbortError') {
setError(err.message);
console.error('Fetch error:', err);
}
} finally {
setIsLoading(false);
}
}, [data]);
// Reset on new search
useEffect(() => {
if (debouncedQuery) {
setPage(1);
setData([]);
setHasMore(true);
fetchData(debouncedQuery, 1);
// Scroll to top
window.scrollTo({ top: 0, behavior: 'smooth' });
}
}, [debouncedQuery]);
// Intersection Observer for infinite scroll
useEffect(() => {
const observer = new IntersectionObserver(
(entries) => {
if (entries[0].isIntersecting && !isLoading && hasMore) {
const nextPage = page + 1;
setPage(nextPage);
fetchData(debouncedQuery, nextPage);
}
},
{ threshold: 1.0, rootMargin: '100px' }
);
if (lastItemRef.current) {
observer.observe(lastItemRef.current);
}
return () => observer.disconnect();
}, [isLoading, hasMore, debouncedQuery, page, fetchData]);
const renderItem = useCallback((item, index) => (
<div
key={item.key}
ref={index === data.length - 1 ? lastItemRef : null}
className="list-item"
>
<h3>{item.title}</h3>
<p className="author">{item.author_name?.join(', ') || 'Unknown Author'}</p>
<p className="year">{item.first_publish_year}</p>
</div>
), [data.length]);
if (!debouncedQuery) {
return <div className="placeholder">Search for restaurants...</div>;
}
return (
<div className="infinite-scroll-container">
{data.map((item, index) => renderItem(item, index))}
{isLoading && (
<div className="loading-skeleton">
{[1, 2, 3].map(i => (
<div key={i} className="skeleton-card shimmer" />
))}
</div>
)}
{error && (
<div className="error-message">
<p>{error}</p>
<button onClick={() => fetchData(debouncedQuery, page)}>
Retry
</button>
</div>
)}
{!hasMore && data.length > 0 && (
<div className="end-message">No more results</div>
)}
{!isLoading && data.length === 0 && (
<div className="no-results">
No restaurants found for "{debouncedQuery}"
</div>
)}
</div>
);
};
// Throttle utility
function throttle(func, wait) {
let timeout;
let previous = 0;
return function(...args) {
const now = Date.now();
const remaining = wait - (now - previous);
if (remaining <= 0 || remaining > wait) {
if (timeout) {
clearTimeout(timeout);
timeout = null;
}
previous = now;
func.apply(this, args);
} else if (!timeout) {
timeout = setTimeout(() => {
previous = Date.now();
timeout = null;
func.apply(this, args);
}, remaining);
}
};
}9. Cab Booking System - Low-Level Design
Difficulty Level: Hard
Role: SDE-2
Source: InterviewExperiences.in (2024-2025)
Topic: Backend Engineering / OOP Design
Interview Round: Low-Level Design (60 min)
Technology Stack: Java, OOP (Inheritance, Polymorphism), SOLID, Relational DB
Swiggy Product Area: Ride-sharing
Question: “Design a Cab Booking System (similar to Uber) with: (1) Core entities (Rider, Driver, Trip, Cab), (2) Main functionalities (booking, driver allocation, trip status tracking), (3) Complete UML class diagram showing relationships, (4) API endpoints (REST), (5) Database schema, (6) Edge cases (concurrent bookings, driver unavailability, trip cancellation), (7) Scalability considerations.”
Answer
Core entities include Rider (riderId, name, email, phone, rating, rideHistory: List, requestTrip(pickup, dropoff): Trip), Driver (driverId, name, phone, licenseNumber, rating, currentCab: Cab, isAvailable: boolean, acceptTrip(trip): void, updateLocation(location): void, completeTrip(trip): void), Cab extends Vehicle (licensePlate, capacity, currentLocation: Location, cabType ENUM: MINI/SEDAN/PRIME, baseFare), Trip (tripId, pickupLocation, dropoffLocation, fare, status: TripStatus ENUM: REQUESTED/ACCEPTED/IN_PROGRESS/COMPLETED/CANCELLED, driver, rider, startTime, endTime)—relationships: Rider 1:N Trip (one rider many trips), Driver 1:N Trip (one driver many trips), Driver 1:1 Cab (one driver one cab at a time), Trip M:1 Driver (many trips one driver), Trip M:1 Rider (many trips one rider). Design patterns apply Strategy for fare calculation (different pricing for MINI/SEDAN/PRIME), Observer for trip status notifications (rider/driver notified on status change), Factory for creating trips (TripFactory.createTrip() encapsulating validation logic).
REST APIs include POST /trips (create trip: {riderId, pickup: {lat, lng}, dropoff: {lat, lng}, cabType} returns tripId), GET /trips/{tripId} (retrieve trip details), PATCH /trips/{tripId}/accept (driver accepts trip), PATCH /trips/{tripId}/start (driver starts trip), PATCH /trips/{tripId}/complete (driver completes trip, calculates fare), DELETE /trips/{tripId} (cancel trip with reason), GET /drivers/nearby (find available drivers within radius: {lat, lng, radius, cabType})—validation: pickup/dropoff coordinates valid (latitude -90 to 90, longitude -180 to 180), rider exists (404 if not found), driver available (400 if already on trip), trip in correct state (can’t complete REQUESTED trip, must be IN_PROGRESS). Edge cases handle concurrent bookings (100 riders request trips simultaneously: use request queue processed FIFO, assign drivers fairly preventing starvation), driver unavailability (driver goes offline mid-trip: trip marked CANCELLED, rider refunded, driver penalized reducing acceptance rate), trip cancellation (rider cancels after driver accepted: cancellation fee ₹20, driver compensated for time wasted), location spoofing (driver reports fake GPS: validate signal quality, compare with cell tower triangulation, flag if accuracy >50m).
Database schema uses tables: Riders (rider_id PK, name, email unique, phone, rating), Drivers (driver_id PK, name, phone, license_number unique, rating, is_available boolean, current_cab_id FK), Cabs (cab_id PK, license_plate unique, cab_type ENUM, base_fare, current_location POINT), Trips (trip_id PK, rider_id FK, driver_id FK nullable, pickup_location POINT, dropoff_location POINT, fare, status ENUM, created_at, started_at, completed_at)—indexes on (driver_id, is_available) for fast available driver lookup, (rider_id, created_at DESC) for ride history pagination, spatial index on current_location for geospatial queries (find drivers within 3km). Scalability optimizations partition Trips table by created_at (monthly partitions: trips_2024_01, trips_2024_02 improving query performance for recent trips), shard Drivers by city (Mumbai drivers in shard 1, Delhi in shard 2 preventing cross-city queries), cache driver locations in Redis (geohash for proximity search: GEORADIUS command finds drivers within radius in <10ms vs MySQL spatial query taking 500ms), use read replicas (3 replicas) for trip history queries (90% reads, 10% writes)—handles growth from 100k to 10M drivers by introducing geohashing (divide city into grid cells, run matching per cell independently, merge results avoiding global locking).
Code
Cab Booking System (Java):
// Entity Classes
@Entity
@Table(name = "trips")
public class Trip {
@Id
@GeneratedValue(strategy = GenerationType.UUID)
private String tripId;
@Embedded
private Location pickupLocation;
@Embedded
private Location dropoffLocation;
@Enumerated(EnumType.STRING)
private TripStatus status;
@ManyToOne
@JoinColumn(name = "rider_id")
private Rider rider;
@ManyToOne
@JoinColumn(name = "driver_id")
private Driver driver;
private Double fare;
private LocalDateTime startTime;
private LocalDateTime endTime;
}
@Entity
@Table(name = "drivers")
public class Driver {
@Id
private String driverId;
private String name;
private String licenseNumber;
private Double rating;
@OneToOne
@JoinColumn(name = "cab_id")
private Cab currentCab;
@Column(name = "is_available")
private Boolean isAvailable;
@Embedded
private Location currentLocation;
}
// Service Layer
@Service
@Transactional
public class TripService {
@Autowired
private DriverRepository driverRepository;
@Autowired
private TripRepository tripRepository;
@Autowired
private NotificationService notificationService;
public Trip requestTrip(TripRequest request) {
// Find nearby available drivers using spatial query
List<Driver> nearbyDrivers = driverRepository.findNearbyDrivers(
request.getPickupLocation().getLatitude(),
request.getPickupLocation().getLongitude(),
3000.0 // 3km radius in meters
);
if (nearbyDrivers.isEmpty()) {
throw new NoDriverAvailableException("No drivers available in your area");
}
// Create trip
Trip trip = new Trip();
trip.setRider(riderRepository.findById(request.getRiderId()).orElseThrow());
trip.setPickupLocation(request.getPickupLocation());
trip.setDropoffLocation(request.getDropoffLocation());
trip.setStatus(TripStatus.REQUESTED);
trip.setFare(calculateFare(request));
Trip savedTrip = tripRepository.save(trip);
// Notify drivers (FIFO queue)
nearbyDrivers.forEach(driver ->
notificationService.notifyDriver(driver, savedTrip)
);
return savedTrip;
}
public Trip acceptTrip(String tripId, String driverId) {
Trip trip = tripRepository.findById(tripId)
.orElseThrow(() -> new TripNotFoundException(tripId));
Driver driver = driverRepository.findById(driverId)
.orElseThrow(() -> new DriverNotFoundException(driverId));
// Optimistic locking to prevent concurrent bookings
if (!driver.getIsAvailable()) {
throw new DriverNotAvailableException("Driver is already on a trip");
}
if (trip.getStatus() != TripStatus.REQUESTED) {
throw new InvalidTripStatusException("Trip already accepted by another driver");
}
driver.setIsAvailable(false);
trip.setDriver(driver);
trip.setStatus(TripStatus.ACCEPTED);
trip.setStartTime(LocalDateTime.now());
driverRepository.save(driver);
Trip updatedTrip = tripRepository.save(trip);
// Notify rider
notificationService.notifyRider(trip.getRider(), "Driver accepted your trip");
return updatedTrip;
}
public Trip completeTrip(String tripId) {
Trip trip = tripRepository.findById(tripId).orElseThrow();
if (trip.getStatus() != TripStatus.IN_PROGRESS) {
throw new InvalidTripStatusException("Trip is not in progress");
}
trip.setStatus(TripStatus.COMPLETED);
trip.setEndTime(LocalDateTime.now());
// Make driver available again
Driver driver = trip.getDriver();
driver.setIsAvailable(true);
driverRepository.save(driver);
return tripRepository.save(trip);
}
private Double calculateFare(TripRequest request) {
double distance = calculateDistance(
request.getPickupLocation(),
request.getDropoffLocation()
);
double baseFare = request.getCabType() == CabType.MINI ? 50.0 :
request.getCabType() == CabType.SEDAN ? 70.0 : 100.0;
double perKmRate = request.getCabType() == CabType.MINI ? 10.0 :
request.getCabType() == CabType.SEDAN ? 12.0 : 15.0;
return baseFare + (distance * perKmRate);
}
}
// Repository with Spatial Queries
@Repository
public interface DriverRepository extends JpaRepository<Driver, String> {
@Query(value = "SELECT * FROM drivers WHERE is_available = true " +
"AND ST_Distance_Sphere(POINT(current_location_lng, current_location_lat), " +
"POINT(:lng, :lat)) <= :radiusMeters " +
"ORDER BY ST_Distance_Sphere(POINT(current_location_lng, current_location_lat), " +
"POINT(:lng, :lat)) ASC",
nativeQuery = true)
List<Driver> findNearbyDrivers(
@Param("lat") double latitude,
@Param("lng") double longitude,
@Param("radiusMeters") double radiusMeters
);
}
// Redis Caching for Driver Locations
@Service
public class DriverLocationService {
@Autowired
private RedisTemplate<String, String> redisTemplate;
public void updateDriverLocation(String driverId, double lat, double lng) {
String key = "driver:location:" + driverId;
redisTemplate.opsForGeo().add(key, new Point(lng, lat), driverId);
}
public List<String> findNearbyDrivers(double lat, double lng, double radiusKm) {
Distance radius = new Distance(radiusKm, Metrics.KILOMETERS);
Circle area = new Circle(new Point(lng, lat), radius);
GeoResults<GeoLocation<String>> results = redisTemplate.opsForGeo()
.radius("drivers:locations", area);
return results.getContent().stream()
.map(result -> result.getContent().getName())
.collect(Collectors.toList());
}
}10. Behavioral: Production Incident & Leadership
Difficulty Level: Hard
Role: SDE-2 to Staff Engineer
Source: InterviewExperiences.in, LinkedIn, Blind (2024-2025)
Topic: Behavioral / Leadership / Engineering Mindset
Interview Round: Hiring Manager / Bar-Raiser (30-45 min)
Technology Stack: N/A (real technical work examples)
Swiggy Product Area: All (culture fit)
Question: “Tell us about a time you: (1) Resolved a production incident under pressure (what was failing, how did you debug, what was impact), (2) Had strong disagreement with Product Manager (how did you handle it, did you convince them or accept their view), (3) Balanced speed vs code quality (shipping feature in 2 days vs ‘proper’ 2-week implementation), (4) Led a small team/sprint (what did you accomplish, what challenges), (5) Used data to make a technical decision (metrics, analysis, outcome).”
Answer
Situation: Payment service crashed during dinner rush (7pm Friday) causing 40% missed payments (₹50L revenue at risk per hour)—immediately enabled fallback to synchronous processing (1-min slower but reliable preventing further losses), debugged concurrent request issue discovering thread pool exhaustion (100 threads all blocked waiting for database connections, new requests queued causing timeouts), applied fix increasing connection pool size from 20 to 50 and implementing connection timeout (5s vs infinite preventing indefinite blocking), deployed behind feature flag with 10% traffic first validating stability (monitored error rate, latency for 15 mins), scaled to 100% once confirmed stable. Impact: reduced missed payments from 40% to 2%, recovered SLA (p95 latency 300ms vs 3s during incident), prevented ₹2Cr revenue loss (4 hours × ₹50L/hour)—learning: should have load-tested concurrency scenarios before peak hours, now run weekly chaos engineering drills simulating database slowdowns, thread exhaustion, network partitions ensuring team prepared for failures.
PM disagreement: PM wanted showing fake reviews as “recommended” boosting engagement metrics (30% more clicks on restaurants with 4.5+ fake ratings)—disagreed morally (violates user trust, long-term brand damage outweighs short-term engagement), presented data: users discovering fake reviews churn 2x faster (retention 60% vs 80%), competitor got fined ₹10Cr for fake reviews creating legal risk, proposed alternative (highlight verified reviews with “Verified Purchase” badge increasing trust without deception)—PM had business constraints I didn’t understand (investor pressure for engagement metrics), accepted their decision after voicing concerns, implemented cleanly with feature flag enabling quick rollback, logged learnings for future (importance of understanding business context before pushing back). Speed vs quality: PM needed feature in 2 days for investor demo vs my estimate 2 weeks for “proper” implementation with tests, monitoring, documentation—chose thin TRD slice: built MVP with core functionality (restaurant search, basic filters) behind feature flag, skipped edge cases (advanced filters, sorting), wrote minimal tests (happy path only), added TODO comments for technical debt, shipped in 2 days—post-demo, allocated 1 week refactoring (added comprehensive tests, error handling, monitoring) preventing technical debt accumulation, demonstrates bias for action (ship fast, iterate) over perfectionism (wait 2 weeks, miss opportunity).
Data-driven decision: choosing between two caching strategies for restaurant search—Option A: cache entire search results (faster: 50ms response, but stale data: 5-min TTL means new restaurants invisible for 5 mins), Option B: cache only restaurant metadata, recompute search results (slower: 200ms response, but fresh data: new restaurants visible immediately)—analyzed metrics: 95% searches repeat within 5 mins (users browse same area multiple times), new restaurant additions <10/hour (low freshness requirement), user complaints about slow search 10x higher than stale results (speed more important than freshness)—chose Option A accepting 5-min staleness for 4x faster response, monitored impact (search latency p95 dropped 50ms→200ms, user engagement +15%, complaints about missing restaurants <1% validating decision). Ownership mindset: delivery time SLA missed in 3 cities (Mumbai, Delhi, Bangalore)—didn’t blame external factors (“rider supply low”), owned outcome (“my allocation algorithm doesn’t optimize for SLA under supply-constrained scenarios”), analyzed root cause (algorithm prioritized distance over delivery time, assigned far riders to close orders when nearby riders busy), proposed improved model (multi-objective optimization balancing distance + estimated delivery time + rider workload), tracked progress (SLA compliance 70%→85% over 2 months)—demonstrates taking responsibility for outcomes, using data for diagnosis, implementing systematic fixes not band-aids.
### Code
Production Incident Response (Monitoring + Alerting):
// Circuit Breaker Pattern for Payment Service
@Service
public class PaymentService {
@CircuitBreaker(name = "payment", fallbackMethod = "fallbackPayment")
@Retry(name = "payment", fallbackMethod = "fallbackPayment")
@TimeLimiter(name = "payment")
public CompletableFuture<PaymentResponse> processPayment(PaymentRequest request) {
return CompletableFuture.supplyAsync(() -> {
// Primary payment processing
return paymentGateway.charge(request);
});
}
public CompletableFuture<PaymentResponse> fallbackPayment(
PaymentRequest request,
Exception e
) {
log.warn("Payment service degraded, using fallback", e);
// Synchronous fallback processing (slower but reliable)
return CompletableFuture.completedFuture(
synchronousPaymentProcessor.process(request)
);
}
}
// Prometheus Metrics
@Component
public class PaymentMetrics {
private final Counter paymentsProcessed = Counter.builder("payments_processed_total")
.description("Total payments processed")
.tag("status", "success")
.register(Metrics.globalRegistry);
private final Counter paymentsFailed = Counter.builder("payments_failed_total")
.description("Total payments failed")
.tag("error_type", "timeout")
.register(Metrics.globalRegistry);
private final Timer paymentLatency = Timer.builder("payment_processing_duration")
.description("Payment processing latency")
.publishPercentiles(0.5, 0.95, 0.99)
.register(Metrics.globalRegistry);
@Gauge(name = "thread_pool_active_threads", description = "Active threads in payment pool")
public int getActiveThreads() {
ThreadPoolTaskExecutor executor = (ThreadPoolTaskExecutor) taskExecutor;
return executor.getActiveCount();
}
@Gauge(name = "thread_pool_queue_size", description = "Queued tasks in payment pool")
public int getQueueSize() {
ThreadPoolTaskExecutor executor = (ThreadPoolTaskExecutor) taskExecutor;
return executor.getThreadPoolExecutor().getQueue().size();
}
}
// Thread Pool Configuration
@Configuration
public class ThreadPoolConfig {
@Bean(name = "paymentTaskExecutor")
public ThreadPoolTaskExecutor paymentTaskExecutor() {
ThreadPoolTaskExecutor executor = new ThreadPoolTaskExecutor();
executor.setCorePoolSize(20);
executor.setMaxPoolSize(50);
executor.setQueueCapacity(100);
executor.setThreadNamePrefix("payment-");
executor.setRejectedExecutionHandler(new ThreadPoolExecutor.CallerRunsPolicy());
executor.initialize();
return executor;
}
}
// Distributed Tracing with Sleuth
@Service
public class OrderService {
@NewSpan("create-order")
public Order createOrder(OrderRequest request) {
Span span = tracer.currentSpan();
span.tag("order.id", request.getOrderId());
span.tag("restaurant.id", request.getRestaurantId());
try {
Order order = processOrder(request);
span.tag("order.status", "success");
return order;
} catch (Exception e) {
span.tag("order.status", "failed");
span.tag("error", e.getMessage());
throw e;
}
}
}Alert Configuration (Prometheus + AlertManager):
groups:
-name: payment_service_alerts
rules:
-alert: HighPaymentErrorRate
expr: rate(payments_failed_total[5m]) > 0.01
for: 2m
labels:
severity: critical
team: payments
annotations:
summary:"High payment error rate detected"
description:"Payment error rate is {{ $value }} errors/sec (threshold: 0.01)"
runbook_url:"https://wiki.swiggy.com/runbooks/payment-errors"
-alert: HighPaymentLatency
expr: histogram_quantile(0.95, payment_processing_duration_bucket) > 1.0
for: 5m
labels:
severity: warning
team: payments
annotations:
summary:"Payment p95 latency exceeds 1s"
description:"Current p95 latency: {{ $value }}s (SLA: <1s)"
-alert: ThreadPoolExhaustion
expr: thread_pool_active_threads / thread_pool_max_threads > 0.9
for: 3m
labels:
severity: critical
team: infrastructure
annotations:
summary:"Thread pool near exhaustion"
description:"{{ $value | humanizePercentage }} of threads in use"
action:"Scale up payment service instances or increase thread pool size"
-alert: HighQueueDepth
expr: thread_pool_queue_size > 50
for: 2m
labels:
severity: warning
team: payments
annotations:
summary:"Payment queue depth high"
description:"{{ $value }} tasks queued (threshold: 50)"
# Grafana Dashboard Query Examples
-name: Payment Service Dashboard
panels:
-title:"Payment Success Rate"
expr:|
sum(rate(payments_processed_total{status="success"}[5m])) /
sum(rate(payments_processed_total[5m])) * 100
-title:"P95 Latency"
expr:|
histogram_quantile(0.95,
sum(rate(payment_processing_duration_bucket[5m])) by (le)
)
-title:"Error Rate by Type"
expr:|
sum(rate(payments_failed_total[5m])) by (error_type)Incident Response Runbook:
# Payment Service Incident Response
## Symptoms
-High error rate (>1%)
-P95 latency >1s
-Thread pool exhaustion
## Immediate Actions
1.Check Grafana dashboard for metrics
2.Check Jaeger for distributed traces
3.Check application logs for exceptions
## Common Causes & Fixes
### Thread Pool Exhaustion
**Cause**: Database connection pool exhausted
**Fix**:
-Increase connection pool size (20 → 50)
-Add connection timeout (5s)
-Deploy with feature flag (10% traffic first)
### High Latency
**Cause**: Slow database queries
**Fix**:
-Check slow query log
-Add missing indexes
-Enable query caching
### High Error Rate
**Cause**: Payment gateway timeout
**Fix**:
-Enable circuit breaker fallback
-Switch to synchronous processing
-Contact payment gateway support
## Escalation
-L1: On-call engineer (15 min response)
-L2: Payment team lead (30 min response)
-L3: VP Engineering (1 hour response)