Walmart Software Engineer
1. Design a high-scale, consistent, distributed inventory system supporting BOPIS for 5,000+ locations under peak load
Difficulty Level: Extreme
Engineering Level: Senior Software Engineer, Staff Engineer, Principal Engineer
Source: GeeksforGeeks (2024), SystemDesignHandbook.com (November 2024), Reddit r/cscareerquestions
Team: Supply Chain Technology, Store Systems
Interview Round: System Design Round (Round 2 or 3)
Question: “Design a distributed system that tracks inventory in real-time across Walmart’s 5,000+ stores and distribution centers. The system must handle millions of concurrent updates, ensure consistency between online and physical inventory, support BOPIS (Buy-Online-Pickup-In-Store), and maintain high availability during Black Friday traffic spikes.”
Answer Framework
Requirements Clarification
Functional Requirements:
- Real-time inventory updates across all channels (in-store POS, online orders, warehouse transfers)
- Support for 5,000+ stores + 150 distribution centers
- BOPIS reservation system with 15-min hold periods
- Multi-channel inventory visibility (web, mobile, in-store kiosks)
- RFID integration for automated tracking (16% stockout reduction)
Non-Functional Requirements:
- Scale: 10M+ updates/day, 100K+ concurrent requests, P95 latency <150ms reads/<800ms writes
- Availability: 99.99% uptime with auto-scaling during Black Friday (10x traffic)
- Consistency: Eventual consistency for reads (5-min cache), strong consistency for reservations
- Cost: ~$50K/month infrastructure (PostgreSQL, Redis, Kafka)
Key Design Decisions:
- CAP Theorem: Chose AP (Availability + Partition Tolerance) over strong consistency
- Database: Shard by store_id across 10 nodes, optimistic locking with version field
- Caching: Redis with 5-min TTL, 85%+ hit rate, cache-aside pattern
- Concurrency: Distributed locks (Redis) for updates, timeout 5s
- Monitoring: Track inventory accuracy >99%, cache hit rate, consumer lag <1 min
System Architecture
High-Level Design:
┌─────────────────────────────────────────────────────────────────┐
│ CLIENT LAYER │
│ [Web App] [Mobile App] [In-Store POS] [Store Associate Devices]│
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ API GATEWAY LAYER │
│ [Load Balancer] → [API Gateway Cluster] │
│ Rate Limiting (100 req/s/user) | Auth | Routing │
└─────────────────────────────────────────────────────────────────┘
│
┌─────────────┼─────────────┐
▼ ▼ ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Inventory │ │ Reservation │ │ Analytics │
│ Service │ │ Service │ │ Service │
└──────────────┘ └──────────────┘ └──────────────┘
│ │ │
└─────────────────┼─────────────────┘
▼
┌────────────────────────────────────────────┐
│ MESSAGE QUEUE (Kafka) │
│ Topics: inventory-updates, reservations │
│ 10 consumers/group, 30-day retention │
└────────────────────────────────────────────┘
│
┌─────────────────┼─────────────────┐
▼ ▼ ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Warehouse │ │ Real-Time │ │ Supplier │
│ Management │ │ Processor │ │ Integration │
└──────────────┘ └──────────────┘ └──────────────┘
│
▼
┌────────────────────────────────────────────┐
│ DATA PERSISTENCE LAYER │
│ [PostgreSQL Cluster - 10 shards by store] │
│ [Redis Cache - 5min TTL] [Cassandra] │
└────────────────────────────────────────────┘Scalability & Performance:
- Throughput: 100K reads/sec (cached), 10K writes/sec (Kafka buffered)
- Auto-scaling: Kubernetes HPA, pre-warm 2hrs before Black Friday
- Degraded mode: Serve cached data if DB unavailable
- Event sourcing: Complete audit trail in Cassandra, replay capability
Code
Inventory Service (Java + Spring Boot):
@Servicepublic class InventoryService { @Autowired private InventoryRepository inventoryRepo; @Autowired private RedisTemplate redisTemplate; @Autowired private DistributedLockManager lockManager; // Get inventory with cache-aside pattern public InventoryResponse getInventory(String productId, String storeId) { String cacheKey = "inventory:" + productId + ":" + storeId; InventoryCache cached = redisTemplate.opsForValue().get(cacheKey); if (cached != null && !cached.isExpired()) { return InventoryResponse.fromCache(cached); } Inventory inventory = inventoryRepo.findByProductAndStore(productId, storeId); redisTemplate.opsForValue().set(cacheKey, inventory, 300, TimeUnit.SECONDS); return InventoryResponse.from(inventory); } // Update inventory with distributed locking @Transactional public void updateInventory(InventoryUpdateRequest request) { String lockKey = "inventory:lock:" + request.getProductId() + ":" + request.getStoreId(); if (!lockManager.tryLock(lockKey, 5, TimeUnit.SECONDS)) { throw new InventoryLockException("Could not acquire lock"); } try { Inventory inventory = inventoryRepo.findByProductAndStoreForUpdate( request.getProductId(), request.getStoreId() ); int newQuantity = inventory.getAvailableQuantity() + request.getQuantityDelta(); if (newQuantity < 0) throw new InsufficientInventoryException(); inventory.setAvailableQuantity(newQuantity); inventory.setVersion(inventory.getVersion() + 1); inventoryRepo.save(inventory); // Publish event to Kafka kafkaTemplate.send("inventory-updates", createEvent(request, newQuantity)); // Invalidate cache redisTemplate.delete("inventory:" + request.getProductId() + ":" + request.getStoreId()); } finally { lockManager.unlock(lockKey); } } // Reserve inventory for BOPIS with TTL @Transactional public ReservationResponse reserveInventory(ReservationRequest request) { Inventory inventory = inventoryRepo.findByProductAndStoreForUpdate( request.getProductId(), request.getStoreId() ); if (inventory.getAvailableQuantity() < request.getQuantity()) { return ReservationResponse.failed("Insufficient inventory"); } Reservation reservation = Reservation.builder() .reservationId(UUID.randomUUID().toString()) .expiresAt(Instant.now().plus(15, ChronoUnit.MINUTES)) .build(); inventory.setAvailableQuantity(inventory.getAvailableQuantity() - request.getQuantity()); inventory.setReservedQuantity(inventory.getReservedQuantity() + request.getQuantity()); inventoryRepo.save(inventory); redisTemplate.opsForValue().set("reservation:" + reservation.getReservationId(), reservation, 15, TimeUnit.MINUTES); return ReservationResponse.success(reservation.getReservationId()); }}Database Schema (PostgreSQL):
-- Inventory table with sharding by store_id for horizontal scalingCREATE TABLE inventory (
id BIGSERIAL PRIMARY KEY,
product_id VARCHAR(50) NOT NULL,
store_id VARCHAR(20) NOT NULL,
available_quantity INT NOT NULL DEFAULT 0,
reserved_quantity INT NOT NULL DEFAULT 0,
total_quantity INT GENERATED ALWAYS AS (available_quantity + reserved_quantity) STORED,
last_updated TIMESTAMP NOT NULL DEFAULT NOW(),
version INT NOT NULL DEFAULT 1, -- For optimistic locking CONSTRAINT positive_quantity CHECK (available_quantity >= 0),
CONSTRAINT unique_product_store UNIQUE (product_id, store_id)
) PARTITION BY HASH (store_id);
-- Create partitions for horizontal scalingCREATE TABLE inventory_p0 PARTITION OF inventory FOR VALUES WITH (MODULUS 10, REMAINDER 0);
CREATE TABLE inventory_p1 PARTITION OF inventory FOR VALUES WITH (MODULUS 10, REMAINDER 1);
-- ... up to p9-- Indexes for fast lookupsCREATE INDEX idx_inventory_product ON inventory (product_id);
CREATE INDEX idx_inventory_store ON inventory (store_id);
CREATE INDEX idx_inventory_updated ON inventory (last_updated);
-- Event sourcing table for audit trailCREATE TABLE inventory_events (
event_id UUID PRIMARY KEY,
event_type VARCHAR(20) NOT NULL, -- SALE, RESTOCK, RETURN, TRANSFER product_id VARCHAR(50) NOT NULL,
store_id VARCHAR(20) NOT NULL,
quantity_delta INT NOT NULL,
new_quantity INT NOT NULL,
source VARCHAR(20) NOT NULL, -- POS, ONLINE, WAREHOUSE, RFID timestamp TIMESTAMPTZ NOT NULL DEFAULT NOW(),
metadata JSONB
) PARTITION BY RANGE (timestamp);
-- Partition by month for efficient archivalCREATE TABLE inventory_events_2025_01 PARTITION OF inventory_events
FOR VALUES FROM ('2025-01-01') TO ('2025-02-01');Kafka Event Consumer:
@Servicepublic class InventoryEventConsumer { @KafkaListener(topics = "inventory-updates", concurrency = "10") public void processInventoryEvent(InventoryEvent event) { try { analyticsService.recordInventoryChange(event); // Low stock alert and auto-reorder if (event.getNewQuantity() <= 10) { notificationService.sendLowStockAlert(event); if (isAutoReorderEnabled(event.getProductId())) { supplierService.createPurchaseOrder(event); } } updateSearchIndex(event); } catch (Exception e) { kafkaTemplate.send("inventory-updates-dlq", event); } }}RFID Integration:
@Servicepublic class RFIDInventoryService { // Process RFID scans for real-time tracking (16% stockout reduction) public void processRFIDScan(RFIDScanEvent scanEvent) { Map<String, Integer> productCounts = groupByProduct(scanEvent.getElectronicProductCodes()); for (Map.Entry<String, Integer> entry : productCounts.entrySet()) { InventoryReconciliation reconciliation = reconcileInventory( entry.getKey(), scanEvent.getStoreId(), entry.getValue() ); if (reconciliation.hasDrift()) { inventoryService.updateInventory( InventoryUpdateRequest.rfidReconciliation(entry.getKey(),
scanEvent.getStoreId(), reconciliation.getDrift()) ); auditService.logInventoryDrift(reconciliation); } } }}2. Design a rating and review system for e-commerce that prevents fake reviews at scale
Difficulty Level: Very Hard
Engineering Level: Software Engineer III (SDE-3), Senior Engineer
Source: YouTube interview by Sanket Singh (March 2024), Reddit r/developersIndia
Team: E-commerce Platform, Marketplace
Interview Round: Technical Round 2 - High Level Design (45 minutes)
Question: “Design a complete rating and review system for Walmart’s e-commerce platform handling millions of products. The system must prevent fake reviews, scale to support different product categories (groceries to electronics), allow verified purchase reviews, implement fraud detection mechanisms, and be extensible for Walmart Marketplace integration.”
Answer Framework
Requirements Clarification
Functional Requirements:
- Submit reviews with star rating (1-5), text, images/videos
- Verified purchase badge for authenticated buyers
- Review helpfulness voting (upvote/downvote)
- Seller response capability
- Filter/sort reviews (most helpful, recent, rating)
- Report inappropriate reviews
Non-Functional Requirements:
- Scale: Support 10M+ products, 100M+ reviews, 1M writes/day
- Fraud Detection: ML-based fake review detection (>95% accuracy)
- Latency: <100ms review retrieval, <2s review submission with fraud check
- Availability: 99.9% uptime, eventual consistency acceptable
- Cost: ~$30K/month (database, ML inference, storage)
Key Design Decisions:
- Database: PostgreSQL for reviews + Elasticsearch for search/filtering
- Fraud Detection: Real-time ML scoring, block suspicious patterns (IP clustering, burst reviews)
- Storage: S3 for images/videos with CDN caching
- Rate Limiting: 5 reviews/product/user, 10 reviews/day/user
- Moderation: Hybrid approach (ML auto-moderation + human review queue)
System Architecture
High-Level Design:
┌──────────────────────────────────────────────────────────┐
│ CLIENT LAYER │
│ [Web App] [Mobile App] [Seller Dashboard] │
└──────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────┐
│ API GATEWAY │
│ Rate Limiting | Authentication | Request Validation │
└──────────────────────────────────────────────────────────┘
│
┌─────────────────┼─────────────────┐
▼ ▼ ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Review │ │ Fraud │ │ Search │
│ Service │ │ Detection │ │ Service │
└──────────────┘ └──────────────┘ └──────────────┘
│ │ │
└─────────────────┼─────────────────┘
▼
┌────────────────────────────────────────┐
│ MESSAGE QUEUE (Kafka) │
│ Topics: review-submitted, moderation │
└────────────────────────────────────────┘
│
┌─────────────────┼─────────────────┐
▼ ▼ ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ ML Fraud │ │ Moderation │ │ Analytics │
│ Pipeline │ │ Queue │ │ Service │
└──────────────┘ └──────────────┘ └──────────────┘
│
▼
┌──────────────────────────────────────────────────────────┐
│ DATA LAYER │
│ [PostgreSQL - Reviews] [Elasticsearch - Search] │
│ [Redis - Cache] [S3 - Media] [ML Models - SageMaker] │
└──────────────────────────────────────────────────────────┘Scalability & Performance:
- Read Optimization: Elasticsearch with aggregations for filtering, Redis cache for popular products
- Write Throughput: Async processing via Kafka, batch ML inference
- Media Handling: Direct S3 upload with pre-signed URLs, CloudFront CDN
- Fraud Detection: Real-time scoring <500ms, batch analysis nightly
Code
Review Service (Java + Spring Boot):
@Servicepublic class ReviewService { @Autowired private ReviewRepository reviewRepo; @Autowired private FraudDetectionService fraudService; // Submit review with fraud detection @Transactional public ReviewResponse submitReview(ReviewRequest request) { validateReviewEligibility(request.getUserId(), request.getProductId()); Review review = Review.builder() .id(UUID.randomUUID().toString()) .productId(request.getProductId()) .userId(request.getUserId()) .rating(request.getRating()) .content(request.getContent()) .isVerifiedPurchase(orderService.hasUserPurchased(request.getUserId(), request.getProductId())) .status(ReviewStatus.PENDING) .build(); // Fraud detection scoring FraudScore fraudScore = fraudService.calculateFraudScore(review, request.getMetadata()); review.setFraudScore(fraudScore.getScore()); if (fraudScore.getScore() > 0.8) { review.setStatus(ReviewStatus.REJECTED); } else if (fraudScore.getScore() > 0.5) { review.setStatus(ReviewStatus.PENDING_MANUAL_REVIEW); } else { review.setStatus(ReviewStatus.APPROVED); } reviewRepo.save(review); kafkaTemplate.send("review-submitted", ReviewEvent.from(review)); redisTemplate.delete("reviews:" + request.getProductId()); return ReviewResponse.success(review.getId(), review.getStatus()); } // Get reviews with caching public ReviewListResponse getReviews(String productId, ReviewFilter filter) { String cacheKey = "reviews:" + productId + ":" + filter; ReviewStats cached = redisTemplate.opsForValue().get(cacheKey); if (cached != null) return ReviewListResponse.fromCache(cached); SearchHits<Review> hits = elasticsearchTemplate.search(buildQuery(productId, filter), Review.class); ReviewStats stats = calculateReviewStats(productId, hits); redisTemplate.opsForValue().set(cacheKey, stats, 10, TimeUnit.MINUTES); return ReviewListResponse.from(stats); }}Fraud Detection Service (Python + ML):
from sklearn.ensemble import RandomForestClassifier
import redis
class FraudDetectionService:
def __init__(self):
self.model = self.load_trained_model()
self.redis = redis.Redis()
def calculate_fraud_score(self, review, metadata):
# ML model + rule-based scoring ml_score = self.model.predict_proba([self.extract_features(review)])[0][1]
pattern_score = self.check_suspicious_patterns(review, metadata)
return FraudScore(score=0.7 * ml_score + 0.3 * pattern_score)
def extract_features(self, review):
return [
len(review.content), review.rating,
self.calculate_sentiment_score(review.content),
self.get_user_review_count(review.user_id),
self.get_account_age_days(review.user_id),
int(review.is_verified_purchase),
self.calculate_text_similarity(review),
self.get_review_velocity(review.user_id)
]
def check_suspicious_patterns(self, review, metadata):
score = 0.0 # IP burst detection ip_key = f"reviews:ip:{metadata.ip_address}:24h" if self.redis.incr(ip_key) > 10:
score += 0.4 self.redis.expire(ip_key, 86400)
# New account with many reviews if self.get_account_age_days(review.user_id) < 7 and self.get_user_review_count(review.user_id) > 5:
score += 0.3 # Template text detection if self.is_template_text(review.content):
score += 0.2 # Extreme sentiment pattern ratings = self.get_user_rating_history(review.user_id)
if len(ratings) > 3 and len(set(ratings)) == 1:
score += 0.2 return min(score, 1.0)Database Schema (PostgreSQL):
-- Reviews tableCREATE TABLE reviews (
id VARCHAR(36) PRIMARY KEY,
product_id VARCHAR(50) NOT NULL,
user_id VARCHAR(36) NOT NULL,
rating INT NOT NULL CHECK (rating >= 1 AND rating <= 5),
title VARCHAR(200),
content TEXT NOT NULL,
is_verified_purchase BOOLEAN DEFAULT FALSE,
status VARCHAR(20) NOT NULL, -- PENDING, APPROVED, REJECTED, PENDING_MANUAL_REVIEW fraud_score DECIMAL(3,2),
rejection_reason TEXT,
helpful_count INT DEFAULT 0,
not_helpful_count INT DEFAULT 0,
created_at TIMESTAMP NOT NULL DEFAULT NOW(),
updated_at TIMESTAMP NOT NULL DEFAULT NOW(),
CONSTRAINT unique_user_product UNIQUE (user_id, product_id)
);
CREATE INDEX idx_reviews_product ON reviews (product_id, status, created_at DESC);
CREATE INDEX idx_reviews_user ON reviews (user_id);
CREATE INDEX idx_reviews_fraud ON reviews (fraud_score) WHERE fraud_score > 0.5;
-- Review media (images/videos)CREATE TABLE review_media (
id BIGSERIAL PRIMARY KEY,
review_id VARCHAR(36) NOT NULL REFERENCES reviews(id) ON DELETE CASCADE,
media_type VARCHAR(10) NOT NULL, -- IMAGE, VIDEO s3_key VARCHAR(500) NOT NULL,
cdn_url VARCHAR(500) NOT NULL,
uploaded_at TIMESTAMP NOT NULL DEFAULT NOW()
);
-- Review helpfulness votesCREATE TABLE review_votes (
review_id VARCHAR(36) NOT NULL REFERENCES reviews(id) ON DELETE CASCADE,
user_id VARCHAR(36) NOT NULL,
vote_type VARCHAR(10) NOT NULL, -- HELPFUL, NOT_HELPFUL created_at TIMESTAMP NOT NULL DEFAULT NOW(),
PRIMARY KEY (review_id, user_id)
);
-- Seller responsesCREATE TABLE seller_responses (
id BIGSERIAL PRIMARY KEY,
review_id VARCHAR(36) NOT NULL REFERENCES reviews(id) ON DELETE CASCADE,
seller_id VARCHAR(36) NOT NULL,
response_text TEXT NOT NULL,
created_at TIMESTAMP NOT NULL DEFAULT NOW(),
CONSTRAINT unique_review_response UNIQUE (review_id)
);3. Implement a concurrent thread pool executor with dynamic task scheduling
Difficulty Level: Very Hard
Engineering Level: Staff Engineer, Principal Engineer
Source: LinkedIn post documenting Walmart Staff Engineer interview (September 2024)
Team: Walmart Global Tech, Platform Engineering
Interview Round: Low Level Design Round
Question: “Design and implement from scratch a thread pool executor that schedules tasks dynamically as soon as a thread becomes free. The implementation must handle thread lifecycle management, task queuing, load balancing across threads, graceful shutdown, exception handling, and support for different task priorities.”
Answer Framework
Requirements Clarification
Functional Requirements:
- Fixed-size thread pool with configurable threads (e.g., 10 threads)
- Task queue with priority support (HIGH, NORMAL, LOW)
- Dynamic task scheduling (assign task when thread becomes free)
- Graceful shutdown (finish running tasks, reject new tasks)
- Task rejection policy when queue is full
- Task timeout support
Non-Functional Requirements:
- Thread Safety: No race conditions, proper synchronization
- Performance: Minimal locking overhead, <1ms task dispatch latency
- Memory: Bounded queue to prevent OOM (max 10,000 tasks)
- Monitoring: Track completed tasks, queue size, thread utilization
Key Design Decisions:
- Queue: PriorityBlockingQueue for task ordering
- Synchronization: ReentrantLock with Condition variables for thread coordination
- Rejection Policy: Throw exception when queue full (configurable)
- Shutdown: Two-phase (shutdown() vs shutdownNow())
System Architecture
Component Design:
┌────────────────────────────────────────────────────────────┐
│ ThreadPoolExecutor │
│ - Worker threads pool │
│ - Task queue (PriorityBlockingQueue) │
│ - Lifecycle management (RUNNING, SHUTDOWN, TERMINATED) │
└────────────────────────────────────────────────────────────┘
│
┌─────────────────┼─────────────────┐
▼ ▼ ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Worker 1 │ │ Worker 2 │ │ Worker N │
│ (Thread) │ │ (Thread) │ │ (Thread) │
└──────────────┘ └──────────────┘ └──────────────┘
│ │ │
└─────────────────┼─────────────────┘
▼
┌────────────────────────────────────┐
│ PriorityBlockingQueue │
│ [Task 1] [Task 2] ... [Task N] │
└────────────────────────────────────┘Code
Thread Pool Executor Implementation (Java):
import java.util.concurrent.*;import java.util.concurrent.atomic.*;public class CustomThreadPoolExecutor { private final PriorityBlockingQueue<PriorityTask> taskQueue; private final List<Worker> workers; private final AtomicInteger activeThreads = new AtomicInteger(0); private final AtomicLong completedTasks = new AtomicLong(0); private volatile ExecutorState state = ExecutorState.RUNNING; private final ReentrantLock mainLock = new ReentrantLock(); enum ExecutorState { RUNNING, SHUTDOWN, TERMINATED } public CustomThreadPoolExecutor(int corePoolSize, int maxQueueSize) { this.taskQueue = new PriorityBlockingQueue<>(maxQueueSize); this.workers = new ArrayList<>(); for (int i = 0; i < corePoolSize; i++) { Worker worker = new Worker(); workers.add(worker); worker.thread.start(); } } public Future<?> submit(Runnable task, TaskPriority priority) { if (state != ExecutorState.RUNNING) { throw new RejectedExecutionException("Executor shutting down"); } PriorityTask priorityTask = new PriorityTask(task, priority); if (!taskQueue.offer(priorityTask)) { throw new RejectedExecutionException("Queue full"); } return priorityTask.future; } public void shutdown() { mainLock.lock(); try { state = ExecutorState.SHUTDOWN; workers.forEach(w -> w.thread.interrupt()); } finally { mainLock.unlock(); } } // Worker thread private class Worker implements Runnable { final Thread thread = new Thread(this); @Override public void run() { while (state != ExecutorState.TERMINATED) { try { PriorityTask task = (state == ExecutorState.RUNNING)
? taskQueue.take()
: taskQueue.poll(100, TimeUnit.MILLISECONDS); if (task != null) { activeThreads.incrementAndGet(); try { task.run(); completedTasks.incrementAndGet(); } catch (Exception e) { task.future.completeExceptionally(e); } finally { activeThreads.decrementAndGet(); } } else if (state == ExecutorState.SHUTDOWN) { break; } } catch (InterruptedException e) { if (state == ExecutorState.TERMINATED) break; } } } } // Priority task wrapper static class PriorityTask implements Runnable, Comparable<PriorityTask> { final Runnable task; final TaskPriority priority; final long sequenceNumber; final CompletableFuture<Void> future = new CompletableFuture<>(); static final AtomicLong sequencer = new AtomicLong(0); PriorityTask(Runnable task, TaskPriority priority) { this.task = task; this.priority = priority; this.sequenceNumber = sequencer.getAndIncrement(); } @Override public void run() { try { task.run(); future.complete(null); } catch (Exception e) { future.completeExceptionally(e); throw e; } } @Override public int compareTo(PriorityTask other) { int result = Integer.compare(other.priority.value, this.priority.value); return (result != 0) ? result : Long.compare(this.sequenceNumber, other.sequenceNumber); } } enum TaskPriority { LOW(1), NORMAL(5), HIGH(10); final int value; TaskPriority(int value) { this.value = value; } }}4. Design a high-concurrency ticket booking system handling race conditions and payment workflows
Difficulty Level: Very Hard
Engineering Level: Staff Engineer
Source: LinkedIn post on Walmart Staff Engineer interview (September 2024)
Team: Walmart Global Tech
Interview Round: System Design/High Level Design Round
Question: “Design a complete ticket booking system similar to BookMyShow in 45 minutes, covering seat selection with real-time availability, booking workflows, payment processing, handling high concurrency for popular shows, scalability for millions of users, and performance optimization. The system must prevent double-booking, handle payment failures gracefully, and provide responsive user experience.”
Answer Framework
Requirements Clarification
Functional Requirements:
- Browse shows, theaters, and showtimes
- View seat map with real-time availability
- Reserve seats (hold for 10 minutes)
- Process payment and confirm booking
- Cancel bookings with refunds
Non-Functional Requirements:
- Concurrency: Handle 10K concurrent users, prevent double-booking
- Latency: <500ms seat selection, <3s booking confirmation
- Availability: 99.9% uptime, strong consistency for reservations
- Scale: 1M bookings/day, 100K shows
- Cost: ~$40K/month (DB, Redis, compute)
Key Design Decisions:
- Locking: Pessimistic locks using Redis distributed locks with 10-min TTL
- Payment: Two-phase commit (reserve → pay → confirm)
- Idempotency: Unique booking IDs to handle payment retries
- Cleanup: Scheduled job to release expired reservations
System Architecture
High-Level Design:
┌────────────────────────────────────────────────────────┐
│ CLIENT LAYER │
│ [Web App] [Mobile App] [Booking Kiosk] │
└────────────────────────────────────────────────────────┘
│
▼
┌────────────────────────────────────────────────────────┐
│ API GATEWAY + LOAD BALANCER │
│ Rate Limiting (100 req/s) | Session Management │
└────────────────────────────────────────────────────────┘
│
┌─────────────────┼─────────────────┐
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Show │ │ Booking │ │ Payment │
│ Service │ │ Service │ │ Service │
└──────────┘ └──────────┘ └──────────┘
│ │ │
└─────────────────┼─────────────────┘
▼
┌────────────────────────────────────┐
│ MESSAGE QUEUE (Kafka) │
│ Topics: bookings, payments │
└────────────────────────────────────┘
│
▼
┌────────────────────────────────────────────────────────┐
│ DATA LAYER │
│ [PostgreSQL - Bookings] [Redis - Locks/Cache] │
│ [Elasticsearch - Search] [S3 - Tickets] │
└────────────────────────────────────────────────────────┘Scalability & Performance:
- Seat Locking: Redis SETNX with automatic expiry prevents double-booking
- Read Optimization: Cache seat maps, invalidate on booking
- Write Optimization: Async notifications via Kafka
- Payment Retry: Idempotent APIs with deduplication keys
Code
Booking Service (Java + Spring Boot):
@Servicepublic class BookingService { @Autowired private RedisTemplate<String, String> redisTemplate; @Autowired private BookingRepository bookingRepo; private static final int SEAT_LOCK_DURATION_MINUTES = 10; // Reserve seats with distributed locking @Transactional public ReservationResponse reserveSeats(ReservationRequest request) { List<String> lockedSeats = new ArrayList<>(); try { // Try to acquire locks for all seats for (String seatId : request.getSeatIds()) { String lockKey = "seat:lock:" + request.getShowId() + ":" + seatId; Boolean lockAcquired = redisTemplate.opsForValue().setIfAbsent( lockKey, request.getUserId(), SEAT_LOCK_DURATION_MINUTES, TimeUnit.MINUTES ); if (Boolean.FALSE.equals(lockAcquired)) { releaseLocks(request.getShowId(), lockedSeats); return ReservationResponse.failed("Seat " + seatId + " already reserved"); } lockedSeats.add(seatId); } // Create reservation Reservation reservation = Reservation.builder() .id(UUID.randomUUID().toString()) .showId(request.getShowId()) .userId(request.getUserId()) .seatIds(request.getSeatIds()) .status(ReservationStatus.PENDING_PAYMENT) .expiresAt(Instant.now().plus(SEAT_LOCK_DURATION_MINUTES, ChronoUnit.MINUTES)) .build(); bookingRepo.save(reservation); return ReservationResponse.success(reservation.getId(), reservation.getExpiresAt()); } catch (Exception e) { releaseLocks(request.getShowId(), lockedSeats); throw new BookingException("Failed to reserve seats", e); } } // Confirm booking after payment @Transactional public BookingResponse confirmBooking(String reservationId, PaymentResponse payment) { Reservation reservation = bookingRepo.findById(reservationId) .orElseThrow(() -> new NotFoundException("Reservation not found")); if (reservation.getExpiresAt().isBefore(Instant.now())) { releaseLocks(reservation.getShowId(), reservation.getSeatIds()); throw new ReservationExpiredException("Reservation expired"); } if (!paymentService.verifyPayment(payment.getPaymentId())) { throw new PaymentVerificationException("Payment failed"); } Booking booking = Booking.builder() .id(UUID.randomUUID().toString()) .showId(reservation.getShowId()) .userId(reservation.getUserId()) .seatIds(reservation.getSeatIds()) .paymentId(payment.getPaymentId()) .status(BookingStatus.CONFIRMED) .build(); bookingRepo.save(booking); updateSeatStatus(reservation.getShowId(), reservation.getSeatIds(), SeatStatus.BOOKED); releaseLocks(reservation.getShowId(), reservation.getSeatIds()); kafkaTemplate.send("bookings", BookingEvent.from(booking)); return BookingResponse.success(booking.getId()); } // Cleanup expired reservations @Scheduled(fixedRate = 60000) public void cleanupExpiredReservations() { List<Reservation> expired = bookingRepo.findExpiredReservations(Instant.now()); expired.forEach(r -> { releaseLocks(r.getShowId(), r.getSeatIds()); r.setStatus(ReservationStatus.EXPIRED); bookingRepo.save(r); }); } private void releaseLocks(String showId, List<String> seatIds) { seatIds.forEach(seatId ->
redisTemplate.delete("seat:lock:" + showId + ":" + seatId) ); }}Database Schema (PostgreSQL):
CREATE TABLE shows (
id VARCHAR(36) PRIMARY KEY,
movie_id VARCHAR(36) NOT NULL,
theater_id VARCHAR(36) NOT NULL,
show_time TIMESTAMP NOT NULL,
available_seats INT NOT NULL,
total_seats INT NOT NULL);
CREATE TABLE seats (
id VARCHAR(36) PRIMARY KEY,
show_id VARCHAR(36) NOT NULL REFERENCES shows(id),
row_number VARCHAR(5) NOT NULL,
seat_number INT NOT NULL,
status VARCHAR(20) NOT NULL, -- AVAILABLE, RESERVED, BOOKED CONSTRAINT unique_seat UNIQUE (show_id, row_number, seat_number)
);
CREATE TABLE reservations (
id VARCHAR(36) PRIMARY KEY,
show_id VARCHAR(36) NOT NULL,
user_id VARCHAR(36) NOT NULL,
seat_ids TEXT[] NOT NULL,
status VARCHAR(20) NOT NULL,
expires_at TIMESTAMP NOT NULL,
created_at TIMESTAMP NOT NULL DEFAULT NOW()
);
CREATE INDEX idx_reservations_expires ON reservations (expires_at)
WHERE status = 'PENDING_PAYMENT';
CREATE TABLE bookings (
id VARCHAR(36) PRIMARY KEY,
show_id VARCHAR(36) NOT NULL,
user_id VARCHAR(36) NOT NULL,
seat_ids TEXT[] NOT NULL,
payment_id VARCHAR(36) NOT NULL,
status VARCHAR(20) NOT NULL,
booked_at TIMESTAMP NOT NULL DEFAULT NOW()
);5. Design a secure document vault with granular access controls and encryption
Difficulty Level: Hard
Engineering Level: Senior Software Engineer (SDE-3)
Source: Reddit r/developersIndia (September 2024)
Team: Walmart US - Full Stack Engineering
Interview Round: System Design Round 2 (conducted by Distinguished Engineer)
Question: “Design a secure document storage system similar to Google Drive or Dropbox with focus on multi-user access control. The system must support granular permissions (read, write, delete, share), hierarchical folder structures, sharing documents with different permission levels, encryption at rest and in transit, audit logging for compliance, and version control.”
Answer Framework
Requirements Clarification
Functional Requirements:
- Upload/download documents and folders
- Hierarchical folder structure (nested folders)
- Granular permissions (READ, WRITE, DELETE, SHARE)
- Share files/folders with specific users or groups
- Version control (track document changes)
- Search documents by name, content, tags
Non-Functional Requirements:
- Security: Encryption at rest (AES-256), in transit (TLS 1.3)
- Scale: 1M users, 100M documents, 10PB storage
- Latency: <100ms permission checks, <1s file access
- Compliance: SOX, GDPR, HIPAA audit trails
- Cost: ~$35K/month (S3, RDS, KMS)
Key Design Decisions:
- Storage: S3 for files, PostgreSQL for metadata
- Access Control: RBAC with permission inheritance from parent folders
- Encryption: Envelope encryption (DEK + CMK via AWS KMS)
- Versioning: S3 versioning + metadata tracking
System Architecture
High-Level Design:
┌────────────────────────────────────────────────────────┐
│ CLIENT LAYER │
│ [Web App] [Mobile App] [Desktop Sync] │
└────────────────────────────────────────────────────────┘
│
▼
┌────────────────────────────────────────────────────────┐
│ API GATEWAY + WAF │
│ TLS Termination | Rate Limiting | DDoS Protection │
└────────────────────────────────────────────────────────┘
│
┌─────────────────┼─────────────────┐
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Document │ │ Access │ │ Search │
│ Service │ │ Control │ │ Service │
└──────────┘ └──────────┘ └──────────┘
│ │ │
└─────────────────┼─────────────────┘
▼
┌────────────────────────────────────┐
│ ENCRYPTION SERVICE (KMS) │
│ Key Management | Encrypt/Decrypt │
└────────────────────────────────────┘
│
▼
┌────────────────────────────────────────────────────────┐
│ DATA LAYER │
│ [PostgreSQL - Metadata] [S3 - Files + Versioning] │
│ [Redis - Permission Cache] [Elasticsearch - Search] │
└────────────────────────────────────────────────────────┘Scalability & Performance:
- Caching: Redis for permission checks (5-min TTL), CloudFront for downloads
- Sharding: Partition metadata by user_id for horizontal scaling
- Async Processing: Kafka for audit logs, virus scanning, thumbnail generation
Code
Access Control Service (Java + Spring Boot):
@Servicepublic class AccessControlService { @Autowired private PermissionRepository permissionRepo; @Autowired private RedisTemplate<String, Boolean> redisTemplate; // Check permission with inheritance public boolean hasPermission(String userId, String documentId, PermissionType required) { String cacheKey = "perm:" + userId + ":" + documentId + ":" + required; Boolean cached = redisTemplate.opsForValue().get(cacheKey); if (cached != null) return cached; // Check direct permissions Optional<Permission> direct = permissionRepo.findByUserAndDocument(userId, documentId); if (direct.isPresent() && direct.get().hasPermission(required)) { cacheResult(cacheKey, true); return true; } // Check inherited permissions from parent folders List<String> parentFolders = getParentFolderHierarchy(documentId); for (String folderId : parentFolders) { Optional<Permission> folderPerm = permissionRepo.findByUserAndDocument(userId, folderId); if (folderPerm.isPresent() && folderPerm.get().hasPermission(required)) { cacheResult(cacheKey, true); return true; } } // Check group permissions List<String> userGroups = getUserGroups(userId); for (String groupId : userGroups) { Optional<Permission> groupPerm = permissionRepo.findByGroupAndDocument(groupId, documentId); if (groupPerm.isPresent() && groupPerm.get().hasPermission(required)) { cacheResult(cacheKey, true); return true; } } cacheResult(cacheKey, false); return false; } // Share document with another user @Transactional public ShareResponse shareDocument(ShareRequest request) { if (!hasPermission(request.getOwnerId(), request.getDocumentId(), PermissionType.SHARE)) { throw new UnauthorizedException("No share permission"); } Permission permission = Permission.builder() .id(UUID.randomUUID().toString()) .documentId(request.getDocumentId()) .userId(request.getTargetUserId()) .permissionType(request.getPermissionType()) .grantedBy(request.getOwnerId()) .grantedAt(Instant.now()) .build(); permissionRepo.save(permission); invalidatePermissionCache(request.getTargetUserId(), request.getDocumentId()); auditService.logShare(request); notificationService.sendShareNotification(request.getTargetUserId(), request.getDocumentId()); return ShareResponse.success(permission.getId()); } private void cacheResult(String key, boolean result) { redisTemplate.opsForValue().set(key, result, 5, TimeUnit.MINUTES); }}enum PermissionType { READ(1), WRITE(2), DELETE(3), SHARE(4), OWNER(5); final int level; PermissionType(int level) { this.level = level; } public boolean isAtLeast(PermissionType other) { return this.level >= other.level; }}Encryption Service:
@Servicepublic class EncryptionService { @Autowired private KMSClient kmsClient; // Encrypt file using envelope encryption public EncryptedFile encryptFile(byte[] fileData, String userId) { // Generate Data Encryption Key (DEK) byte[] dataKey = kmsClient.generateDataKey(userId); // Encrypt file with DEK using AES-256-GCM Cipher cipher = Cipher.getInstance("AES/GCM/NoPadding"); SecretKeySpec keySpec = new SecretKeySpec(dataKey, "AES"); byte[] iv = generateIV(); GCMParameterSpec gcmSpec = new GCMParameterSpec(128, iv); cipher.init(Cipher.ENCRYPT_MODE, keySpec, gcmSpec); byte[] encryptedData = cipher.doFinal(fileData); // Encrypt DEK with Customer Master Key (CMK) byte[] encryptedDEK = kmsClient.encrypt(dataKey, userId); // Clear plaintext DEK from memory Arrays.fill(dataKey, (byte) 0); return EncryptedFile.builder() .encryptedData(encryptedData) .encryptedDEK(encryptedDEK) .iv(iv) .build(); } // Decrypt file public byte[] decryptFile(EncryptedFile encryptedFile, String userId) { byte[] dataKey = kmsClient.decrypt(encryptedFile.getEncryptedDEK(), userId); Cipher cipher = Cipher.getInstance("AES/GCM/NoPadding"); SecretKeySpec keySpec = new SecretKeySpec(dataKey, "AES"); GCMParameterSpec gcmSpec = new GCMParameterSpec(128, encryptedFile.getIv()); cipher.init(Cipher.DECRYPT_MODE, keySpec, gcmSpec); byte[] decryptedData = cipher.doFinal(encryptedFile.getEncryptedData()); Arrays.fill(dataKey, (byte) 0); return decryptedData; }}Database Schema (PostgreSQL):
CREATE TABLE documents (
id VARCHAR(36) PRIMARY KEY,
name VARCHAR(255) NOT NULL,
type VARCHAR(20) NOT NULL, -- FILE, FOLDER parent_folder_id VARCHAR(36) REFERENCES documents(id),
owner_id VARCHAR(36) NOT NULL,
s3_key VARCHAR(500),
encrypted_dek TEXT,
file_size BIGINT,
version INT DEFAULT 1,
created_at TIMESTAMP NOT NULL DEFAULT NOW(),
updated_at TIMESTAMP NOT NULL DEFAULT NOW()
);
CREATE INDEX idx_documents_parent ON documents (parent_folder_id);
CREATE INDEX idx_documents_owner ON documents (owner_id);
CREATE TABLE permissions (
id VARCHAR(36) PRIMARY KEY,
document_id VARCHAR(36) NOT NULL REFERENCES documents(id) ON DELETE CASCADE,
user_id VARCHAR(36),
group_id VARCHAR(36),
permission_type VARCHAR(20) NOT NULL, -- READ, WRITE, DELETE, SHARE, OWNER granted_by VARCHAR(36) NOT NULL,
granted_at TIMESTAMP NOT NULL DEFAULT NOW(),
CONSTRAINT user_or_group CHECK (
(user_id IS NOT NULL AND group_id IS NULL) OR (user_id IS NULL AND group_id IS NOT NULL)
)
);
CREATE INDEX idx_permissions_doc_user ON permissions (document_id, user_id);
CREATE TABLE audit_logs (
id BIGSERIAL PRIMARY KEY,
user_id VARCHAR(36) NOT NULL,
document_id VARCHAR(36) NOT NULL,
action VARCHAR(50) NOT NULL, -- VIEW, DOWNLOAD, UPLOAD, DELETE, SHARE ip_address INET,
timestamp TIMESTAMP NOT NULL DEFAULT NOW()
) PARTITION BY RANGE (timestamp);6. Design a container image storage system similar to Amazon ECR
Difficulty Level: Hard
Engineering Level: Senior Software Engineer
Source: Reddit r/developersIndia (September 2024)
Team: Walmart Global Tech - Infrastructure
Interview Round: System Design Round 3 (conducted by Staff Engineer)
Question: “Design a container image storage and distribution system like Amazon Elastic Container Registry (ECR). The system must handle container image layers, implement deduplication to save storage, support versioning and tagging, provide fast image pull/push operations at scale, implement access control and vulnerability scanning, and optimize for both regional and global distribution.”
Answer Framework
Requirements Clarification
Functional Requirements:
- Push/pull container images (Docker, OCI format)
- Image layer deduplication (same layers shared across images)
- Tagging and versioning (latest, v1.0, etc.)
- Registry API compatibility (Docker Registry v2)
- Vulnerability scanning integration
- Multi-region distribution
Non-Functional Requirements:
- Scale: 100K images, 1M layers, 1PB storage
- Latency: <2s image push, <5s pull (depending on size)
- Availability: 99.99% uptime
- Cost: ~$25K/month (S3, compute, data transfer)
Key Design Decisions:
- Storage: Content-addressable storage using SHA256 hashing
- Deduplication: Layer-level deduplication (save 60-70% storage)
- Distribution: Multi-region S3 + CloudFront CDN
- Access Control: IAM integration with registry tokens
System Architecture
High-Level Design:
┌────────────────────────────────────────────────────────┐
│ CLIENT LAYER │
│ [Docker CLI] [K8s] [CI/CD Pipelines] │
└────────────────────────────────────────────────────────┘
│
▼
┌────────────────────────────────────────────────────────┐
│ API GATEWAY + AUTH │
│ Registry API v2 | Token Auth | Rate Limiting │
└────────────────────────────────────────────────────────┘
│
┌─────────────────┼─────────────────┐
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Registry │ │ Scan │ │ Replica │
│ Service │ │ Service │ │ Service │
└──────────┘ └──────────┘ └──────────┘
│ │ │
└─────────────────┼─────────────────┘
▼
┌────────────────────────────────────────────────────────┐
│ DATA LAYER │
│ [S3 - Content-Addressable Storage] [RDS - Metadata] │
│ [Redis - Cache] [CloudFront - CDN] │
└────────────────────────────────────────────────────────┘Scalability & Performance:
- Content Addressing: Store layers by SHA256 digest for deduplication
- CDN: CloudFront for fast global pulls
- Garbage Collection: Remove unreferenced layers periodically
- Multi-region: S3 cross-region replication for HA
Code
Registry Service (Go):
package registry
import ( "crypto/sha256" "encoding/hex" "io")type RegistryService struct { s3Client *S3Client
metadataRepo *MetadataRepository
cache *RedisCache
}// Push image layer (content-addressable storage)func (r *RegistryService) PushLayer(imageRepo string, layerData io.Reader) (*LayerResponse, error) { // Calculate SHA256 digest while reading hash := sha256.New() teeReader := io.TeeReader(layerData, hash) // Upload to S3 tempKey := fmt.Sprintf("temp/%s", uuid.New().String()) err := r.s3Client.Upload(tempKey, teeReader) if err != nil { return nil, err
} // Get digest digest := "sha256:" + hex.EncodeToString(hash.Sum(nil)) // Move to content-addressable location layerKey := fmt.Sprintf("layers/%s", digest) // Check if layer already exists (deduplication) if !r.layerExists(layerKey) { err = r.s3Client.Move(tempKey, layerKey) if err != nil { return nil, err
} } else { r.s3Client.Delete(tempKey) // Layer already exists } // Update metadata layer := &Layer{ Digest: digest, Repository: imageRepo, Size: size, UploadedAt: time.Now(), } r.metadataRepo.SaveLayer(layer) return &LayerResponse{Digest: digest, Location: layerKey}, nil}// Pull image layer with CDN supportfunc (r *RegistryService) PullLayer(digest string) (*LayerData, error) { // Check cache cacheKey := fmt.Sprintf("layer:%s", digest) if cached, err := r.cache.Get(cacheKey); err == nil { return cached, nil } // Get layer location layerKey := fmt.Sprintf("layers/%s", digest) // Generate pre-signed CloudFront URL url, err := r.generateCDNUrl(layerKey) if err != nil { return nil, err
} layerData := &LayerData{ Digest: digest, DownloadURL: url, Expires: time.Now().Add(15 * time.Minute), } // Cache for future requests r.cache.Set(cacheKey, layerData, 10*time.Minute) return layerData, nil}// Push image manifestfunc (r *RegistryService) PushManifest(imageRepo, tag string, manifest *Manifest) error { // Validate all layers exist for _, layer := range manifest.Layers { if !r.layerExists(fmt.Sprintf("layers/%s", layer.Digest)) { return fmt.Errorf("layer %s not found", layer.Digest) } } // Calculate manifest digest manifestJSON, _ := json.Marshal(manifest) manifestDigest := "sha256:" + sha256Hash(manifestJSON) // Store manifest manifestKey := fmt.Sprintf("manifests/%s/%s", imageRepo, manifestDigest) err := r.s3Client.Upload(manifestKey, bytes.NewReader(manifestJSON)) if err != nil { return err
} // Update tag pointer tagKey := fmt.Sprintf("tags/%s/%s", imageRepo, tag) err = r.metadataRepo.UpdateTag(imageRepo, tag, manifestDigest) return err
}// Garbage collection - remove unreferenced layersfunc (r *RegistryService) GarbageCollect() error { // Get all layer digests allLayers, err := r.metadataRepo.GetAllLayerDigests() if err != nil { return err
} // Get referenced layers from manifests referencedLayers := make(map[string]bool) manifests, _ := r.metadataRepo.GetAllManifests() for _, manifest := range manifests { for _, layer := range manifest.Layers { referencedLayers[layer.Digest] = true } } // Delete unreferenced layers for _, layerDigest := range allLayers { if !referencedLayers[layerDigest] { layerKey := fmt.Sprintf("layers/%s", layerDigest) r.s3Client.Delete(layerKey) r.metadataRepo.DeleteLayer(layerDigest) } } return nil}Database Schema (PostgreSQL):
CREATE TABLE repositories (
id VARCHAR(36) PRIMARY KEY,
name VARCHAR(255) NOT NULL UNIQUE,
created_at TIMESTAMP NOT NULL DEFAULT NOW()
);
CREATE TABLE layers (
digest VARCHAR(71) PRIMARY KEY, -- sha256:... (71 chars) size BIGINT NOT NULL,
media_type VARCHAR(100),
created_at TIMESTAMP NOT NULL DEFAULT NOW()
);
CREATE TABLE manifests (
digest VARCHAR(71) PRIMARY KEY,
repository_id VARCHAR(36) NOT NULL REFERENCES repositories(id),
manifest_json JSONB NOT NULL,
created_at TIMESTAMP NOT NULL DEFAULT NOW()
);
CREATE INDEX idx_manifests_repo ON manifests (repository_id);
CREATE TABLE tags (
id BIGSERIAL PRIMARY KEY,
repository_id VARCHAR(36) NOT NULL REFERENCES repositories(id),
tag_name VARCHAR(128) NOT NULL,
manifest_digest VARCHAR(71) NOT NULL REFERENCES manifests(digest),
updated_at TIMESTAMP NOT NULL DEFAULT NOW(),
CONSTRAINT unique_repo_tag UNIQUE (repository_id, tag_name)
);
CREATE TABLE layer_references (
manifest_digest VARCHAR(71) NOT NULL REFERENCES manifests(digest) ON DELETE CASCADE,
layer_digest VARCHAR(71) NOT NULL REFERENCES layers(digest),
layer_index INT NOT NULL,
PRIMARY KEY (manifest_digest, layer_digest)
);
CREATE INDEX idx_layer_refs_layer ON layer_references (layer_digest);7. Design a product recommendation engine at scale for e-commerce
Difficulty Level: Extreme
Engineering Level: Senior Engineer, Staff Engineer, ML Engineers
Source: SystemDesignHandbook.com (November 2024), Emerj AI interview with Walmart (January 2023)
Team: E-commerce, Data & Analytics, ML Engineering
Interview Round: System Design Round
Question: “Design Walmart’s product recommendation system that personalizes suggestions for millions of users across billions of products. The system must implement collaborative filtering (users who bought X also bought Y), content-based filtering (based on product attributes), hybrid approaches combining multiple techniques, handle cold start problems for new users/products, support real-time personalization based on current session, integrate A/B testing framework, and scale to handle Walmart’s traffic.”
Answer Framework
Requirements Clarification
Functional Requirements:
- Recommend products based on user history and behavior
- Support multiple recommendation types (you may like, frequently bought together, trending)
- Real-time personalization during browsing session
- A/B testing for recommendation algorithms
- Handle cold start (new users/products)
Non-Functional Requirements:
- Scale: 270M weekly users, billions of products, 100M recommendations/day
- Latency: <100ms for recommendation generation
- Accuracy: CTR >95%, conversion lift >10%
- Cost: ~$60K/month (compute, storage, ML infrastructure)
Key Design Decisions:
- Hybrid Approach: Collaborative filtering + content-based + contextual signals
- Architecture: Batch processing (Spark) + real-time serving (low-latency API)
- Cold Start: Content-based fallback + popularity-based recommendations
- A/B Testing: Multi-armed bandit for algorithm selection
System Architecture
High-Level Design:
┌────────────────────────────────────────────────────────┐
│ CLIENT LAYER │
│ [Web App] [Mobile App] [Email] │
└────────────────────────────────────────────────────────┘
│
▼
┌────────────────────────────────────────────────────────┐
│ API GATEWAY │
│ Recommendation API | Personalization │
└────────────────────────────────────────────────────────┘
│
┌─────────────────┼─────────────────┐
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Online │ │ Model │ │ A/B │
│ Serving │ │ Selection│ │ Testing │
└──────────┘ └──────────┘ └──────────┘
│ │ │
└─────────────────┼─────────────────┘
▼
┌────────────────────────────────────┐
│ FEATURE STORE (Redis) │
│ User Embeddings | Product Vectors│
└────────────────────────────────────┘
│
┌─────────────────┼─────────────────┐
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Offline │ │ Event │ │ Product │
│ Training │ │ Tracking │ │ Catalog │
│ (Spark) │ │ (Kafka) │ │ (RDS) │
└──────────┘ └──────────┘ └──────────┘Scalability & Performance:
- Batch Processing: Nightly model training on Spark, pre-compute top-N recommendations
- Real-time: Stream processing with Kafka for immediate behavior updates
- Caching: Redis for user/product embeddings, 1-hour TTL
- A/B Testing: Thompson sampling for algorithm selection
Code
Recommendation Service (Python + FastAPI):
from fastapi import FastAPI
import numpy as np
import redis
app = FastAPI()
class RecommendationService:
def __init__(self):
self.redis = redis.Redis()
self.product_catalog = ProductCatalog()
self.ab_tester = ABTestingService()
# Get recommendations with hybrid approach async def get_recommendations(self, user_id: str, context: dict) -> List[Product]:
# Select algorithm via A/B testing algorithm = self.ab_tester.select_algorithm(user_id)
# Get user embeddings user_embedding = self.get_user_embedding(user_id)
if user_embedding is None:
# Cold start: use content-based + popularity return self.cold_start_recommendations(context)
# Collaborative filtering collab_scores = self.collaborative_filtering(user_embedding)
# Content-based filtering content_scores = self.content_based_filtering(user_id, context)
# Contextual signals (time, location, device) contextual_boost = self.apply_contextual_signals(context)
# Hybrid scoring final_scores = (
0.5 * collab_scores +
0.3 * content_scores +
0.2 * contextual_boost
)
# Get top-N products top_products = self.get_top_n_products(final_scores, n=20)
# Apply business rules (inventory, profitability) filtered_products = self.apply_business_rules(top_products)
# Track recommendation for feedback loop self.track_recommendation(user_id, filtered_products, algorithm)
return filtered_products[:10]
def get_user_embedding(self, user_id: str) -> np.ndarray:
# Try cache first cached = self.redis.get(f"user:embedding:{user_id}")
if cached:
return np.frombuffer(cached, dtype=np.float32)
# Fall back to database embedding = self.load_user_embedding_from_db(user_id)
if embedding is not None:
self.redis.setex(
f"user:embedding:{user_id}",
3600, # 1 hour TTL embedding.tobytes()
)
return embedding
def collaborative_filtering(self, user_embedding: np.ndarray) -> dict:
# Get similar users similar_users = self.find_similar_users(user_embedding, top_k=100)
# Aggregate their purchases product_scores = {}
for similar_user, similarity in similar_users:
user_purchases = self.get_user_purchases(similar_user)
for product_id in user_purchases:
product_scores[product_id] = product_scores.get(product_id, 0) + similarity
return product_scores
def content_based_filtering(self, user_id: str, context: dict) -> dict:
# Get user's historical preferences user_preferences = self.get_user_preferences(user_id)
# Current browsing context current_product_id = context.get('current_product')
if current_product_id:
# Find similar products similar_products = self.find_similar_products(current_product_id, top_k=50)
return {p.id: p.similarity_score for p in similar_products}
# Match user preferences with product attributes return self.match_preferences_to_products(user_preferences)
def cold_start_recommendations(self, context: dict) -> List[Product]:
# For new users: trending + context-based category = context.get('category')
if category:
# Popular products in category return self.get_trending_in_category(category, limit=10)
else:
# Overall trending products return self.get_trending_products(limit=10)
def apply_business_rules(self, products: List[Product]) -> List[Product]:
filtered = []
for product in products:
# Check inventory if product.in_stock:
# Boost profitable items if product.margin > 0.25:
product.score *= 1.2 filtered.append(product)
return sorted(filtered, key=lambda p: p.score, reverse=True)
# Offline training job (Spark)class ModelTrainer:
def train_user_embeddings(self):
# Load interaction data interactions = spark.read.parquet("s3://data/interactions/")
# Matrix factorization (ALS algorithm) als = ALS(
rank=128,
maxIter=10,
regParam=0.01,
userCol="user_id",
itemCol="product_id",
ratingCol="implicit_rating" )
model = als.fit(interactions)
# Extract embeddings user_embeddings = model.userFactors
product_embeddings = model.itemFactors
# Save to feature store user_embeddings.write.format("redis").save()
product_embeddings.write.format("redis").save()A/B Testing Service:
import random
class ABTestingService:
def __init__(self):
self.algorithms = {
'collab_heavy': {'weight': 0.7, 'content': 0.2, 'context': 0.1},
'content_heavy': {'weight': 0.3, 'content': 0.5, 'context': 0.2},
'balanced': {'weight': 0.5, 'content': 0.3, 'context': 0.2}
}
self.performance = {alg: {'ctr': 0.0, 'trials': 0} for alg in self.algorithms}
def select_algorithm(self, user_id: str) -> str:
# Thompson sampling (multi-armed bandit) samples = {}
for alg in self.algorithms:
# Beta distribution based on performance alpha = self.performance[alg].get('successes', 1)
beta = self.performance[alg]['trials'] - alpha + 1 samples[alg] = np.random.beta(alpha, beta)
# Select best sample return max(samples, key=samples.get)
def record_outcome(self, user_id: str, algorithm: str, clicked: bool):
self.performance[algorithm]['trials'] += 1 if clicked:
self.performance[algorithm]['successes'] = \ self.performance[algorithm].get('successes', 0) + 1Database Schema (PostgreSQL):
CREATE TABLE user_interactions (
id BIGSERIAL PRIMARY KEY,
user_id VARCHAR(36) NOT NULL,
product_id VARCHAR(36) NOT NULL,
interaction_type VARCHAR(20) NOT NULL, -- VIEW, CART, PURCHASE timestamp TIMESTAMP NOT NULL DEFAULT NOW()
) PARTITION BY RANGE (timestamp);
CREATE INDEX idx_interactions_user ON user_interactions (user_id, timestamp DESC);
CREATE INDEX idx_interactions_product ON user_interactions (product_id);
CREATE TABLE recommendations_served (
id BIGSERIAL PRIMARY KEY,
user_id VARCHAR(36) NOT NULL,
product_ids TEXT[] NOT NULL,
algorithm_version VARCHAR(50) NOT NULL,
context JSONB,
served_at TIMESTAMP NOT NULL DEFAULT NOW()
) PARTITION BY RANGE (served_at);
CREATE TABLE recommendation_feedback (
id BIGSERIAL PRIMARY KEY,
recommendation_id BIGINT NOT NULL,
clicked BOOLEAN DEFAULT FALSE,
purchased BOOLEAN DEFAULT FALSE,
feedback_at TIMESTAMP NOT NULL DEFAULT NOW()
);8. Design Walmart’s last-mile delivery system with multi-carrier orchestration
Difficulty Level: Extreme
Engineering Level: Senior Engineer, Staff Engineer, Distinguished Engineer
Source: DELIVER Event 2025 (November 2025), YouTube interview
Team: Supply Chain Technology, Walmart GoLocal, Last Mile Delivery
Interview Round: System Design/High Level Design Round
Question: “Design Walmart’s last-mile delivery orchestration platform that routes orders across 5,000+ stores using multiple delivery modes: in-house delivery vans, Spark Driver (Walmart’s gig platform), third-party delivery service providers (DSPs), and drone delivery (300,000+ deliveries to date). The system must provide real-time tracking, optimize delivery routes, handle carrier failover, scale to 100,000+ deliveries per day (1,000+ routes of ~100 stops each), achieve 98%+ on-time delivery rate, and support same-day, next-day, and scheduled deliveries.”
Answer Framework
Requirements Clarification
Functional Requirements:
- Assign orders to optimal delivery carriers (in-house, gig, DSP, drone)
- Optimize delivery routes (minimize time/cost)
- Real-time tracking and ETA updates
- Handle carrier failures (automatic reassignment)
- Support multiple delivery speeds (same-day, next-day, 2-hour)
Non-Functional Requirements:
- Scale: 100K+ deliveries/day, 1K+ routes, 5K+ stores
- Latency: <500ms carrier assignment, <2s route optimization
- Availability: 99.99% uptime, carrier failover <1min
- Accuracy: 98%+ on-time delivery, <0.5% lost packages
- Cost: ~$55K/month (compute, routing API, carriers)
Key Design Decisions:
- Carrier Selection: Multi-criteria optimization (cost, ETA, capacity, reliability)
- Routing: Google Maps API + internal routing engine
- Failover: Automatic reassignment with priority queue
- Tracking: Geofencing + periodic location updates
System Architecture
High-Level Design:
┌────────────────────────────────────────────────────────┐
│ CLIENT LAYER │
│ [Customer App] [Driver App] [Store Associate App] │
└────────────────────────────────────────────────────────┘
│
▼
┌────────────────────────────────────────────────────────┐
│ API GATEWAY │
│ Tracking API | Driver API | Assignment API │
└────────────────────────────────────────────────────────┘
│
┌─────────────────┼─────────────────┐
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Order │ │ Route │ │ Tracking │
│Assignment│ │Optimizer │ │ Service │
└──────────┘ └──────────┘ └──────────┘
│ │ │
└─────────────────┼─────────────────┘
▼
┌────────────────────────────────────┐
│ MESSAGE QUEUE (Kafka) │
│ Topics: deliveries, locations │
└────────────────────────────────────┘
│
┌─────────────────┼─────────────────┐
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Carrier │ │ ETA │ │ Failover │
│ Integr. │ │Calculator│ │ Manager │
└──────────┘ └──────────┘ └──────────┘
│
▼
┌────────────────────────────────────────────────────────┐
│ DATA LAYER │
│ [PostgreSQL - Orders] [Redis - Live Tracking] │
│ [Google Maps API - Routing] [MongoDB - Routes] │
└────────────────────────────────────────────────────────┘Scalability & Performance:
- Carrier Selection: Real-time scoring algorithm considers cost, ETA, driver rating
- Routing: Batch optimization every 5 minutes, incremental updates for new orders
- Tracking: WebSocket connections for live updates, Redis pub/sub
- Failover: Automatic detection (timeout, driver decline), priority reassignment queue
Code
Order Assignment Service (Java + Spring Boot):
@Servicepublic class OrderAssignmentService { @Autowired private CarrierService carrierService; @Autowired private RoutingService routingService; // Assign order to optimal carrier public AssignmentResponse assignOrder(Order order) { // Get available carriers near store List<Carrier> availableCarriers = carrierService.getAvailableCarriers( order.getStoreLocation(), order.getDeliveryWindow() ); if (availableCarriers.isEmpty()) { return AssignmentResponse.failed("No carriers available"); } // Score each carrier List<CarrierScore> scores = availableCarriers.stream() .map(carrier -> scoreCarrier(carrier, order)) .sorted(Comparator.comparingDouble(CarrierScore::getScore).reversed()) .collect(Collectors.toList()); // Try to assign to best carrier for (CarrierScore score : scores) { AssignmentResult result = attemptAssignment(score.getCarrier(), order); if (result.isSuccess()) { // Update routing routingService.addStopToRoute(score.getCarrier().getRouteId(), order); // Send notification notificationService.notifyDriver(score.getCarrier().getDriverId(), order); return AssignmentResponse.success(result); } } return AssignmentResponse.failed("All carriers declined"); } // Multi-criteria scoring private CarrierScore scoreCarrier(Carrier carrier, Order order) { double costScore = calculateCostScore(carrier, order); double etaScore = calculateETAScore(carrier, order); double capacityScore = calculateCapacityScore(carrier); double reliabilityScore = carrier.getOnTimeRate(); // Weighted scoring double finalScore =
0.3 * costScore + 0.4 * etaScore + 0.2 * capacityScore + 0.1 * reliabilityScore; return new CarrierScore(carrier, finalScore); } private double calculateETAScore(Carrier carrier, Order order) { Duration eta = routingService.calculateETA( carrier.getCurrentLocation(), order.getDeliveryAddress() ); // Score: 1.0 if within delivery window, decreases outside window long minutesToDeadline = Duration.between(Instant.now(), order.getDeliveryBy()).toMinutes(); long etaMinutes = eta.toMinutes(); if (etaMinutes <= minutesToDeadline) { return 1.0; } else { return Math.max(0, 1.0 - (etaMinutes - minutesToDeadline) / 60.0); } }}Route Optimizer (Python):
from ortools.constraint_solver import routing_enums_pb2
from ortools.constraint_solver import prcpvrp
class RouteOptimizer:
def __init__(self):
self.google_maps = GoogleMapsClient()
# Optimize routes for all carriers def optimize_routes(self, orders: List[Order], carriers: List[Carrier]) -> List[Route]:
# Group orders by store/zone order_clusters = self.cluster_orders(orders)
optimized_routes = []
for cluster in order_clusters:
routes = self.solve_vrp(cluster, carriers)
optimized_routes.extend(routes)
return optimized_routes
def solve_vrp(self, orders: List[Order], carriers: List[Carrier]) -> List[Route]:
# Vehicle Routing Problem with Time Windows num_vehicles = len(carriers)
depot = orders[0].store_location
# Create routing model manager = pywrapcp.RoutingIndexManager(
len(orders),
num_vehicles,
depot
)
routing = pywrapcp.RoutingModel(manager)
# Distance callback def distance_callback(from_index, to_index):
from_node = manager.IndexToNode(from_index)
to_node = manager.IndexToNode(to_index)
return self.get_distance(orders[from_node], orders[to_node])
transit_callback_index = routing.RegisterTransitCallback(distance_callback)
routing.SetArcCostEvaluatorOfAllVehicles(transit_callback_index)
# Add capacity constraints def demand_callback(from_index):
from_node = manager.IndexToNode(from_index)
return orders[from_node].weight
demand_callback_index = routing.RegisterUnaryTransitCallback(demand_callback)
routing.AddDimensionWithVehicleCapacity(
demand_callback_index,
0, # null capacity slack [carrier.capacity for carrier in carriers],
True, # start cumul to zero 'Capacity' )
# Add time window constraints def time_callback(from_index, to_index):
from_node = manager.IndexToNode(from_index)
to_node = manager.IndexToNode(to_index)
return self.get_travel_time(orders[from_node], orders[to_node])
time_callback_index = routing.RegisterTransitCallback(time_callback)
routing.AddDimension(
time_callback_index,
30, # allow waiting time 180, # maximum time per vehicle (3 hours) False, # don't force start cumul to zero 'Time' )
time_dimension = routing.GetDimensionOrDie('Time')
for i, order in enumerate(orders):
index = manager.NodeToIndex(i)
time_dimension.CumulVar(index).SetRange(
order.delivery_window_start,
order.delivery_window_end
)
# Solve search_parameters = pywrapcp.DefaultRoutingSearchParameters()
search_parameters.first_solution_strategy = (
routing_enums_pb2.FirstSolutionStrategy.PATH_CHEAPEST_ARC
)
search_parameters.local_search_metaheuristic = (
routing_enums_pb2.LocalSearchMetaheuristic.GUIDED_LOCAL_SEARCH
)
search_parameters.time_limit.seconds = 30 solution = routing.SolveWithParameters(search_parameters)
return self.extract_routes(solution, routing, manager, orders, carriers)Tracking Service (Real-time):
@Servicepublic class TrackingService { @Autowired private RedisTemplate<String, Location> redisTemplate; @Autowired private SimpMessagingTemplate websocketTemplate; // Update driver location public void updateDriverLocation(String driverId, Location location) { // Store in Redis redisTemplate.opsForValue().set( "driver:location:" + driverId, location, 5, TimeUnit.MINUTES ); // Get orders assigned to this driver List<Order> assignedOrders = getAssignedOrders(driverId); // Update ETA for each order for (Order order : assignedOrders) { Duration newETA = calculateETA(location, order.getDeliveryAddress()); order.setEstimatedDelivery(Instant.now().plus(newETA)); // Push update to customer via WebSocket websocketTemplate.convertAndSend( "/topic/order/" + order.getId(), new TrackingUpdate(driverId, location, newETA) ); // Check geofence (delivery imminent) if (isNearDestination(location, order.getDeliveryAddress())) { notificationService.sendArrivalNotification(order.getCustomerId()); } } } private boolean isNearDestination(Location current, Address destination) { double distance = calculateDistance(current, destination); return distance < 500; // 500 meters }}Database Schema (PostgreSQL):
CREATE TABLE deliveries (
id VARCHAR(36) PRIMARY KEY,
order_id VARCHAR(36) NOT NULL,
store_id VARCHAR(36) NOT NULL,
carrier_type VARCHAR(20) NOT NULL, -- IN_HOUSE, SPARK, DSP, DRONE driver_id VARCHAR(36),
status VARCHAR(20) NOT NULL, -- ASSIGNED, IN_TRANSIT, DELIVERED, FAILED assigned_at TIMESTAMP,
delivered_at TIMESTAMP,
estimated_delivery TIMESTAMP);
CREATE INDEX idx_deliveries_driver ON deliveries (driver_id, status);
CREATE INDEX idx_deliveries_status ON deliveries (status, assigned_at);
CREATE TABLE delivery_routes (
id VARCHAR(36) PRIMARY KEY,
driver_id VARCHAR(36) NOT NULL,
route_date DATE NOT NULL,
stops JSONB NOT NULL, -- Array of delivery IDs with sequence optimized_at TIMESTAMP NOT NULL,
total_distance_km DECIMAL(10,2),
estimated_duration_min INT);
CREATE TABLE carrier_performance (
carrier_id VARCHAR(36) PRIMARY KEY,
on_time_rate DECIMAL(5,4),
avg_delivery_time_min INT,
total_deliveries INT,
last_updated TIMESTAMP NOT NULL DEFAULT NOW()
);9. Design an efficient tag management system for multi-million product catalog
Difficulty Level: Hard
Engineering Level: Mid-Level Engineer, Senior Engineer
Source: Reddit r/developersIndia (November 2024)
Team: Product Catalog Engineering
Interview Round: System Design Round
Question: “Design an efficient tag management system for Walmart’s product catalog with billions of products. The system must support flexible tagging (categories, attributes, promotions, seasonal), efficient search and filtering (find all products tagged ‘organic’ AND ‘gluten-free’), tag hierarchies (Electronics > Mobile > Smartphones), bulk tag operations, and ensure consistency across distributed systems. Handle 10M+ tag updates per day while maintaining <100ms search latency.”
Answer Framework
Requirements Clarification
Functional Requirements:
- Add/update/delete tags on products
- Hierarchical tags (parent-child relationships)
- Multi-tag search (AND, OR, NOT operations)
- Bulk tag operations (tag 10K products at once)
- Tag analytics (most used tags, trending)
Non-Functional Requirements:
- Scale: Billions of products, 100K+ unique tags, 10M updates/day
- Latency: <100ms tag search, <500ms bulk operations
- Consistency: Eventually consistent, no lost updates
- Cost: ~$15K/month (DB, cache, search)
Key Design Decisions:
- Storage: PostgreSQL for tags + Elasticsearch for search
- Indexing: Inverted index for fast tag lookups
- Hierarchies: Materialized path pattern for efficient traversal
- Bulk Ops: Async processing via message queue
System Architecture
High-Level Design:
┌────────────────────────────────────────────────────────┐
│ CLIENT LAYER │
│ [Catalog Manager] [Search API] [Bulk Import Tool] │
└────────────────────────────────────────────────────────┘
│
▼
┌────────────────────────────────────────────────────────┐
│ API GATEWAY │
│ Tag API | Search API | Bulk API │
└────────────────────────────────────────────────────────┘
│
┌─────────────────┼─────────────────┐
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Tag │ │ Search │ │ Bulk │
│ Service │ │ Service │ │Processor │
└──────────┘ └──────────┘ └──────────┘
│ │ │
└─────────────────┼─────────────────┘
▼
┌────────────────────────────────────┐
│ MESSAGE QUEUE (Kafka) │
│ Topics: tag-updates, bulk-ops │
└────────────────────────────────────┘
│
▼
┌────────────────────────────────────────────────────────┐
│ DATA LAYER │
│ [PostgreSQL - Tags/Products] [Elasticsearch - Search] │
│ [Redis - Cache] [S3 - Bulk Import Files] │
└────────────────────────────────────────────────────────┘Scalability & Performance:
- Caching: Redis for frequently accessed tags, 1-hour TTL
- Search: Elasticsearch with custom analyzers for tag matching
- Bulk Operations: Async processing with progress tracking
- Consistency: Event sourcing for tag changes, idempotent operations
Code
Tag Service (Java + Spring Boot):
@Servicepublic class TagService { @Autowired private TagRepository tagRepo; @Autowired private ProductTagRepository productTagRepo; @Autowired private ElasticsearchTemplate esTemplate; @Autowired private RedisTemplate<String, List<Tag>> redisTemplate; // Add tag to product @Transactional public void addTagToProduct(String productId, String tagId) { Tag tag = tagRepo.findById(tagId) .orElseThrow(() -> new NotFoundException("Tag not found")); // Add tag and all parent tags (hierarchy) List<Tag> tagsToAdd = getTagHierarchy(tag); for (Tag t : tagsToAdd) { ProductTag productTag = ProductTag.builder() .productId(productId) .tagId(t.getId()) .createdAt(Instant.now()) .build(); productTagRepo.save(productTag); } // Update search index esTemplate.update(productId, doc -> doc.addTags(tagsToAdd)); // Invalidate cache redisTemplate.delete("product:tags:" + productId); // Publish event kafkaTemplate.send("tag-updates", TagEvent.added(productId, tagId)); } // Get tag hierarchy (including parents) private List<Tag> getTagHierarchy(Tag tag) { List<Tag> hierarchy = new ArrayList<>(); Tag current = tag; while (current != null) { hierarchy.add(current); current = current.getParentId() != null
? tagRepo.findById(current.getParentId()).orElse(null) : null; } return hierarchy; } // Search products by tags (AND operation) public List<Product> searchByTags(List<String> tagIds, TagOperator operator) { String cacheKey = "search:tags:" + String.join(",", tagIds) + ":" + operator; List<Product> cached = redisTemplate.opsForValue().get(cacheKey); if (cached != null) return cached; // Build Elasticsearch query BoolQueryBuilder query = QueryBuilders.boolQuery(); if (operator == TagOperator.AND) { for (String tagId : tagIds) { query.must(QueryBuilders.termQuery("tags.id", tagId)); } } else if (operator == TagOperator.OR) { BoolQueryBuilder orQuery = QueryBuilders.boolQuery(); for (String tagId : tagIds) { orQuery.should(QueryBuilders.termQuery("tags.id", tagId)); } query.must(orQuery); } SearchHits<Product> results = esTemplate.search( new NativeSearchQueryBuilder() .withQuery(query) .withPageable(PageRequest.of(0, 100)) .build(), Product.class ); List<Product> products = results.stream() .map(SearchHit::getContent) .collect(Collectors.toList()); // Cache results redisTemplate.opsForValue().set(cacheKey, products, 10, TimeUnit.MINUTES); return products; } // Bulk tag operation @Async public CompletableFuture<BulkOperationResult> bulkAddTags(List<String> productIds, List<String> tagIds) { String operationId = UUID.randomUUID().toString(); // Publish to Kafka for async processing BulkTagOperation operation = BulkTagOperation.builder() .id(operationId) .productIds(productIds) .tagIds(tagIds) .operation(BulkOperation.ADD) .status(OperationStatus.PENDING) .createdAt(Instant.now()) .build(); kafkaTemplate.send("bulk-tag-ops", operation); return CompletableFuture.completedFuture( BulkOperationResult.pending(operationId) ); }}enum TagOperator { AND, OR, NOT }enum BulkOperation { ADD, REMOVE, REPLACE }enum OperationStatus { PENDING, IN_PROGRESS, COMPLETED, FAILED }Bulk Tag Processor (Kafka Consumer):
@Servicepublic class BulkTagProcessor { @Autowired private TagService tagService; @Autowired private BulkOperationRepository bulkOpRepo; @KafkaListener(topics = "bulk-tag-ops", concurrency = "5") public void processBulkOperation(BulkTagOperation operation) { operation.setStatus(OperationStatus.IN_PROGRESS); bulkOpRepo.save(operation); int processed = 0; int failed = 0; try { // Process in batches of 1000 for (int i = 0; i < operation.getProductIds().size(); i += 1000) { int end = Math.min(i + 1000, operation.getProductIds().size()); List<String> batch = operation.getProductIds().subList(i, end); for (String productId : batch) { try { for (String tagId : operation.getTagIds()) { if (operation.getOperation() == BulkOperation.ADD) { tagService.addTagToProduct(productId, tagId); } else if (operation.getOperation() == BulkOperation.REMOVE) { tagService.removeTagFromProduct(productId, tagId); } } processed++; } catch (Exception e) { failed++; log.error("Failed to process product {}: {}", productId, e.getMessage()); } } // Update progress operation.setProgress((double) processed / operation.getProductIds().size()); bulkOpRepo.save(operation); } operation.setStatus(OperationStatus.COMPLETED); } catch (Exception e) { operation.setStatus(OperationStatus.FAILED); operation.setErrorMessage(e.getMessage()); } operation.setProcessedCount(processed); operation.setFailedCount(failed); operation.setCompletedAt(Instant.now()); bulkOpRepo.save(operation); }}Database Schema (PostgreSQL):
-- Tags table with materialized path for hierarchyCREATE TABLE tags (
id VARCHAR(36) PRIMARY KEY,
name VARCHAR(100) NOT NULL,
slug VARCHAR(100) NOT NULL UNIQUE,
parent_id VARCHAR(36) REFERENCES tags(id),
path VARCHAR(500), -- Materialized path (e.g., "/electronics/mobile/") level INT NOT NULL DEFAULT 0,
created_at TIMESTAMP NOT NULL DEFAULT NOW()
);
CREATE INDEX idx_tags_path ON tags USING GIN (path gin_trgm_ops);
CREATE INDEX idx_tags_parent ON tags (parent_id);
-- Product-Tag junction tableCREATE TABLE product_tags (
product_id VARCHAR(36) NOT NULL,
tag_id VARCHAR(36) NOT NULL REFERENCES tags(id),
created_at TIMESTAMP NOT NULL DEFAULT NOW(),
PRIMARY KEY (product_id, tag_id)
);
CREATE INDEX idx_product_tags_tag ON product_tags (tag_id);
-- Bulk operations trackingCREATE TABLE bulk_tag_operations (
id VARCHAR(36) PRIMARY KEY,
operation_type VARCHAR(20) NOT NULL,
status VARCHAR(20) NOT NULL,
total_count INT NOT NULL,
processed_count INT DEFAULT 0,
failed_count INT DEFAULT 0,
progress DECIMAL(5,4) DEFAULT 0,
error_message TEXT,
created_at TIMESTAMP NOT NULL DEFAULT NOW(),
completed_at TIMESTAMP);
-- Tag analyticsCREATE TABLE tag_usage_stats (
tag_id VARCHAR(36) PRIMARY KEY REFERENCES tags(id),
product_count INT NOT NULL DEFAULT 0,
search_count INT NOT NULL DEFAULT 0,
last_used TIMESTAMP,
updated_at TIMESTAMP NOT NULL DEFAULT NOW()
);Elasticsearch Index Mapping:
{ "mappings": { "properties": { "product_id": { "type": "keyword" }, "name": { "type": "text" }, "tags": { "type": "nested", "properties": { "id": { "type": "keyword" }, "name": { "type": "text" }, "slug": { "type": "keyword" }, "path": { "type": "keyword" } } } } }}10. Design Walmart’s omnichannel architecture supporting online, in-store, and mobile
Difficulty Level: Extreme
Engineering Level: Staff Engineer, Distinguished Engineer, Principal Architect
Source: Glassdoor Walmart interview experiences (2024-2025), Architecture Summit talks
Team: Enterprise Architecture, Platform Engineering
Interview Round: System Design Round (senior roles)
Question: “Design Walmart’s omnichannel platform architecture that seamlessly integrates online shopping (Walmart.com, app), in-store experiences (POS, kiosks, Scan & Go), BOPIS (Buy Online Pickup In-Store), curbside pickup, and delivery services. The system must maintain consistent inventory visibility, unified customer profiles, synchronized pricing and promotions, order fulfillment across channels, and real-time data synchronization. Handle 5,000+ physical stores, 270M+ weekly customers, peak traffic of 1M concurrent users, and ensure <99.99% uptime across all channels.”
Answer Framework
Requirements Clarification
Functional Requirements:
- Unified product catalog across all channels
- Real-time inventory sync (online, in-store, warehouse)
- Unified customer profile and cart across devices
- Order fulfillment options (ship, pickup, delivery)
- Consistent pricing and promotions
- Support BOPIS, curbside pickup, Scan & Go
Non-Functional Requirements:
- Scale: 5K+ stores, 270M+ users, 1M concurrent, 100M orders/day
- Latency: <200ms API responses, <1s inventory sync
- Availability: 99.99% uptime, zero downtime deployments
- Consistency: Strong consistency for orders, eventual for inventory
- Cost: ~$200K/month (cloud, data sync, CDN)
Key Design Decisions:
- Architecture: Event-driven microservices with domain-driven design
- Data Sync: CDC (Change Data Capture) + event streaming (Kafka)
- API Gateway: GraphQL federation for unified API layer
- Consistency: SAGA pattern for distributed transactions
System Architecture
High-Level Design:
┌────────────────────────────────────────────────────────────────┐
│ CLIENT LAYER │
│ [Walmart.com] [Mobile App] [In-Store POS] [Scan & Go] │
└────────────────────────────────────────────────────────────────┘
│
▼
┌────────────────────────────────────────────────────────────────┐
│ API GATEWAY (GraphQL Federation) │
│ Authentication | Rate Limiting | Request Routing │
└────────────────────────────────────────────────────────────────┘
│
┌───────────────────────┼───────────────────────┐
▼ ▼ ▼
┌────────────┐ ┌────────────┐ ┌────────────┐
│ Product │ │ Customer │ │ Order │
│ Service │ │ Service │ │ Service │
└────────────┘ └────────────┘ └────────────┘
│ │ │
▼ ▼ ▼
┌────────────┐ ┌────────────┐ ┌────────────┐
│ Inventory │ │ Cart │ │ Fulfillment│
│ Service │ │ Service │ │ Service │
└────────────┘ └────────────┘ └────────────┘
│
▼
┌───────────────────────────────────────────────┐
│ EVENT STREAMING (Kafka) │
│ Topics: inventory, orders, customers, carts │
└───────────────────────────────────────────────┘
│
┌───────────────────────┼───────────────────────┐
▼ ▼ ▼
┌────────────┐ ┌────────────┐ ┌────────────┐
│Store POS │ │ Warehouse │ │ Analytics │
│Integration │ │Management │ │ Service │
└────────────┘ └────────────┘ └────────────┘
│
▼
┌────────────────────────────────────────────────────────────────┐
│ DATA LAYER │
│ [Product DB] [Customer DB] [Order DB] [Inventory DB] │
│ [Redis Cache] [Elasticsearch] [Data Warehouse] │
└────────────────────────────────────────────────────────────────┘Scalability & Performance:
- Microservices: Domain-driven, independently deployable services
- Event Streaming: Kafka for async communication, CDC for data sync
- Caching: Multi-layer (CDN, API Gateway, Redis, App Cache)
- Database: Polyglot persistence (PostgreSQL, MongoDB, Cassandra)
Code
Order Orchestration Service (Java + Spring Boot):
@Servicepublic class OrderOrchestrationService { @Autowired private InventoryService inventoryService; @Autowired private PaymentService paymentService; @Autowired private FulfillmentService fulfillmentService; // Place order using SAGA pattern @Transactional public OrderResponse placeOrder(OrderRequest request) { String orderId = UUID.randomUUID().toString(); Order order = Order.builder() .id(orderId) .customerId(request.getCustomerId()) .items(request.getItems()) .fulfillmentType(request.getFulfillmentType()) // SHIP, PICKUP, DELIVERY .status(OrderStatus.PENDING) .build(); orderRepository.save(order); try { // Step 1: Reserve inventory InventoryReservation reservation = inventoryService.reserveItems( order.getItems(), order.getStoreId() ); // Step 2: Process payment PaymentResult payment = paymentService.processPayment( order.getId(), order.getTotalAmount(), request.getPaymentMethod() ); // Step 3: Create fulfillment FulfillmentRequest fulfillment = fulfillmentService.createFulfillment( order.getId(), order.getFulfillmentType(), order.getStoreId() ); // Confirm order order.setStatus(OrderStatus.CONFIRMED); orderRepository.save(order); kafkaTemplate.send("orders", OrderEvent.confirmed(order)); return OrderResponse.success(orderId); } catch (Exception e) { // SAGA compensation (rollback) compensateOrder(order, e); return OrderResponse.failed(e.getMessage()); } } private void compensateOrder(Order order, Exception error) { if (order.getFulfillmentId() != null) { fulfillmentService.cancelFulfillment(order.getFulfillmentId()); } if (order.getPaymentId() != null) { paymentService.refundPayment(order.getPaymentId()); } if (order.getReservationId() != null) { inventoryService.releaseReservation(order.getReservationId()); } order.setStatus(OrderStatus.FAILED); orderRepository.save(order); }}Unified Cart Service (synced across channels):
@Servicepublic class CartService { @Autowired private RedisTemplate<String, Cart> redisTemplate; // Add item to cart (works across all channels) public Cart addItem(String userId, String channel, CartItem item) { String cartKey = "cart:" + userId; Cart cart = redisTemplate.opsForValue().get(cartKey); if (cart == null) { cart = cartRepo.findByUserId(userId) .orElse(Cart.builder().userId(userId).items(new ArrayList<>()).build()); } // Check inventory boolean available = inventoryService.checkAvailability( item.getProductId(), item.getQuantity(), channel
); if (!available) { throw new OutOfStockException("Product not available"); } cart.addItem(item); cart.setLastModified(Instant.now()); // Update Redis redisTemplate.opsForValue().set(cartKey, cart, 7, TimeUnit.DAYS); // Async persist to DB cartRepo.save(cart); // Publish cart update event kafkaTemplate.send("cart-updates", CartEvent.itemAdded(userId, item, channel)); return cart; } // Sync cart across devices @KafkaListener(topics = "cart-updates") public void syncCart(CartEvent event) { websocketTemplate.convertAndSend("/topic/cart/" + event.getUserId(), event); }}Inventory Sync Service (CDC + Kafka):
@Servicepublic class InventorySyncService { // Listen to inventory changes from stores (CDC) @KafkaListener(topics = "store-inventory-cdc", concurrency = "10") public void processInventoryChange(InventoryChangeEvent event) { Inventory inventory = inventoryRepo.findByProductAndStore( event.getProductId(), event.getStoreId() ); inventory.setQuantity(event.getNewQuantity()); inventory.setLastUpdated(Instant.now()); inventoryRepo.save(inventory); // Update cache redisTemplate.opsForValue().set( "inventory:" + event.getProductId() + ":" + event.getStoreId(), inventory, 5, TimeUnit.MINUTES ); // Publish to all channels kafkaTemplate.send("inventory-updates", InventoryEvent.updated(inventory)); // Update online availability if (inventory.getQuantity() == 0) { productService.markOutOfStock(event.getProductId(), event.getStoreId()); } } // Propagate to online systems @KafkaListener(topics = "inventory-updates") public void propagateToOnline(InventoryEvent event) { // Update Elasticsearch elasticsearchService.updateProductAvailability( event.getProductId(), event.getStoreId(), event.getQuantity() ); // Notify subscribed customers if (event.getQuantity() > 0) { notificationService.notifyBackInStock(event.getProductId(), event.getStoreId()); } }}Database Schema (PostgreSQL):
-- Orders tableCREATE TABLE orders (
id VARCHAR(36) PRIMARY KEY,
customer_id VARCHAR(36) NOT NULL,
store_id VARCHAR(36),
fulfillment_type VARCHAR(30) NOT NULL,
status VARCHAR(20) NOT NULL,
total_amount DECIMAL(12,2) NOT NULL,
payment_id VARCHAR(36),
created_at TIMESTAMP NOT NULL DEFAULT NOW()
) PARTITION BY RANGE (created_at);
CREATE INDEX idx_orders_customer ON orders (customer_id, created_at DESC);
-- Unified cartCREATE TABLE carts (
id BIGSERIAL PRIMARY KEY,
user_id VARCHAR(36) NOT NULL UNIQUE,
items JSONB NOT NULL,
channel VARCHAR(20), -- WEB, MOBILE, INSTORE last_modified TIMESTAMP NOT NULL);
-- Multi-channel inventoryCREATE TABLE omnichannel_inventory (
id BIGSERIAL PRIMARY KEY,
product_id VARCHAR(36) NOT NULL,
store_id VARCHAR(36) NOT NULL,
online_quantity INT NOT NULL DEFAULT 0,
instore_quantity INT NOT NULL DEFAULT 0,
reserved_quantity INT NOT NULL DEFAULT 0,
last_synced TIMESTAMP NOT NULL DEFAULT NOW(),
CONSTRAINT unique_product_store UNIQUE (product_id, store_id)
);
CREATE INDEX idx_inventory_product ON omnichannel_inventory (product_id);