Swiggy Data Scientist

Swiggy Data Scientist

This guide features 10 challenging Data Scientist interview questions for Swiggy (Data Scientist to Staff Data Scientist levels), covering ML fundamentals, statistical learning theory, recommendation systems, fraud detection, logistics optimization, demand forecasting, A/B testing, and production ML systems aligned with Swiggy’s mission of delivering convenience at scale.

1. Why Is Cross-Entropy the Default Loss Function in Logistic Regression Rather Than MSE?

Difficulty Level: High

Role: Senior Data Scientist / Staff Data Scientist

Source: LinkedIn (Ashutosh Kumar Jha, Ankit Batham)

Topic: ML Fundamentals & Loss Function Theory

Interview Round: Technical/ML Depth Round (60 min)

Domain: Core ML Concepts

Question: “Explain mathematically why cross-entropy is the natural loss function for logistic regression rather than Mean Squared Error (MSE). Your answer should cover: (1) probabilistic interpretation and likelihood maximization, (2) information-theoretic perspective (KL divergence), (3) gradient properties and convergence behavior, (4) why MSE can lead to vanishing gradients with probabilistic outputs.”


Answer Framework

STAR Method Structure:
- Situation: Loss function selection fundamentally affects model training convergence, gradient behavior, and probabilistic interpretation
- Task: Explain mathematical foundations connecting cross-entropy to Bernoulli distribution likelihood, contrast with MSE designed for Gaussian errors
- Action: Derive cross-entropy from maximum likelihood estimation (MLE) for binary classification, show KL divergence interpretation, analyze gradient behavior preventing vanishing gradients
- Result: Cross-entropy provides well-behaved gradients throughout [0,1] probability space, directly optimizes log-likelihood of correct classification, converges faster than MSE for classification tasks

Key Competencies Evaluated:
- Mathematical Rigor: Deriving loss functions from first principles (MLE, information theory)
- Probabilistic Foundations: Understanding Bernoulli distribution, likelihood functions, log-likelihood
- Gradient Analysis: Explaining why cross-entropy gradients remain large even when predictions are confident (unlike MSE)
- Information Theory: Connecting cross-entropy to KL divergence and entropy concepts

Answer

Probabilistic interpretation shows logistic regression models P(y=1|x) = σ(wᵀx) where σ is sigmoid function, assuming labels follow Bernoulli distribution with parameter p = σ(wᵀx), making likelihood L(w) = ∏ᵢ p(yᵢ|xᵢ) = ∏ᵢ σ(wᵀxᵢ)^yᵢ × (1-σ(wᵀxᵢ))^(1-yᵢ), with negative log-likelihood -log L(w) = -∑ᵢ [yᵢ log σ(wᵀxᵢ) + (1-yᵢ) log(1-σ(wᵀxᵢ))] exactly matching binary cross-entropy loss, demonstrating cross-entropy directly maximizes probability of correct classification under Bernoulli assumption—information-theoretic perspective interprets cross-entropy as H(p,q) = -∑ p(x) log q(x) measuring average number of bits needed to encode data from true distribution p using model distribution q, with KL divergence KL(p||q) = H(p,q) - H(p) showing cross-entropy minimization equivalent to minimizing KL divergence between true label distribution and predicted distribution (since H(p) constant for fixed labels), making it natural measure of distributional mismatch—gradient properties reveal cross-entropy gradient ∂L/∂w = ∑ᵢ (σ(wᵀxᵢ) - yᵢ)xᵢ remains large when predictions are wrong (σ far from y) enabling fast correction, whereas MSE loss L_MSE = ∑ᵢ (σ(wᵀxᵢ) - yᵢ)² produces gradient ∂L_MSE/∂w = ∑ᵢ 2(σ(wᵀxᵢ) - yᵢ) × σ’(wᵀxᵢ)xᵢ where σ’(z) = σ(z)(1-σ(z)) approaches zero when σ(z) near 0 or 1 (confident predictions), causing vanishing gradients preventing learning when model is confidently wrong—convergence behavior shows cross-entropy converges faster because gradient magnitude proportional to prediction error (σ - y) without sigmoid derivative dampening, while MSE gradient includes σ’(z) term creating plateau regions in loss surface when predictions saturated, with empirical validation showing cross-entropy typically achieves lower validation loss in fewer epochs for classification tasks, making it preferred choice despite MSE being valid loss function mathematically, and practical recommendation being always use cross-entropy for classification (binary or multi-class via softmax) and MSE for regression tasks where Gaussian error assumption holds.


2. How Does Random Forest Reduce Variance, and Why Don’t We Simply Keep Adding More Trees Indefinitely?

Difficulty Level: High

Role: Senior Data Scientist

Source: LinkedIn (Ashutosh Kumar Jha, Ankit Batham)

Topic: Ensemble Methods & Optimization Trade-offs

Interview Round: ML Depth/Breadth Round (60 min)

Domain: Ensemble Methods

Question: “Explain mathematically how Random Forest reduces variance through bagging and feature randomness. Then explain why we don’t simply keep adding more trees indefinitely. Your answer should cover: (1) bias-variance decomposition, (2) correlation between trees and decorrelation advantage, (3) out-of-bag error convergence, (4) computational cost vs. marginal benefit trade-offs.”


Answer Framework

STAR Method Structure:
- Situation: Ensemble methods reduce variance by averaging predictions from diverse models, but adding trees has diminishing returns
- Task: Explain variance reduction mechanism via bootstrap sampling and feature randomness, identify point where additional trees provide minimal benefit
- Action: Derive variance reduction formula showing 1/B factor for B uncorrelated trees, explain how correlation increases with more trees reducing decorrelation benefit, demonstrate OOB error plateau
- Result: Optimal forest size typically 100-500 trees where OOB error stabilizes, further trees add computational cost without accuracy improvement

Key Competencies Evaluated:
- Ensemble Theory: Understanding bagging mechanics, bootstrap aggregating, feature subsampling
- Bias-Variance Decomposition: Explaining how averaging reduces variance without increasing bias
- Correlation Analysis: Recognizing that tree correlation limits variance reduction
- Practical Optimization: Balancing model performance with computational constraints

Answer

Variance reduction mechanism shows Random Forest reduces variance through bagging (bootstrap aggregating) where each tree trained on bootstrap sample (sampling with replacement from n training examples creating n-sized dataset with ~63.2% unique samples), producing B trees with predictions f₁(x), f₂(x), …, f_B(x), with final prediction f̂(x) = (1/B)∑ᵢ fᵢ(x), and variance Var[f̂(x)] = Var[(1/B)∑ᵢ fᵢ(x)] = (1/B²)∑ᵢ∑ⱼ Cov[fᵢ(x), fⱼ(x)], which for uncorrelated trees (Cov[fᵢ, fⱼ] = 0 when i≠j) simplifies to Var[f̂(x)] = (1/B²)×B×σ² = σ²/B where σ² is individual tree variance, demonstrating variance decreases by factor 1/B as trees added—feature randomness enhances decorrelation by selecting random subset of m features (typically m = √p for classification, m = p/3 for regression where p is total features) at each split, preventing trees from always choosing same strong predictors and creating diversity, with correlation between trees ρ making actual variance Var[f̂(x)] = ρσ² + (1-ρ)σ²/B showing variance reduction limited by correlation (when ρ=1 all trees identical yielding no reduction, when ρ=0 full 1/B reduction achieved)—diminishing returns explanation reveals why not adding trees indefinitely: (1) correlation between trees increases as more trees added because bootstrap samples increasingly overlap and feature subsets repeatedly select similar features, reducing decorrelation advantage; (2) computational cost becomes prohibitive with training time O(B×n×p×log n) and prediction time O(B×tree_depth) growing linearly with B; (3) out-of-bag (OOB) error plateaus where OOB samples (37% not in bootstrap) provide unbiased error estimate, and plotting OOB error vs. number of trees shows rapid decrease initially then plateau typically around 100-500 trees; (4) marginal benefit diminishes following law of diminishing returns where first 50 trees provide most accuracy gain, next 50 provide smaller gain, and beyond 200-500 trees provide negligible improvement (often <0.1% accuracy) not justifying 2-5x computational cost—practical recommendation suggests optimal forest size 100-500 trees balancing accuracy and efficiency, with empirical validation via OOB error curve showing stabilization point, and production deployment considerations favoring smaller forests (100-200 trees) for latency-sensitive applications versus larger forests (500-1000 trees) for offline batch predictions where accuracy paramount, with hyperparameter tuning focusing on max_depth, min_samples_split, and max_features providing better ROI than simply adding more trees beyond convergence point.

Code

Random Forest Implementation with Variance Analysis:

import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

# Generate synthetic data
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15,
                          n_redundant=5, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Analyze OOB error vs number of trees
n_trees_range = range(10, 501, 10)
oob_errors = []
test_errors = []
training_times = []

for n_trees in n_trees_range:
    import time
    start = time.time()

    rf = RandomForestClassifier(
        n_estimators=n_trees,
        max_features='sqrt',  # Feature randomness
        bootstrap=True,       # Bagging
        oob_score=True,       # Calculate OOB error
        random_state=42,
        n_jobs=-1
    )
    rf.fit(X_train, y_train)

    training_times.append(time.time() - start)
    oob_errors.append(1 - rf.oob_score_)
    test_errors.append(1 - rf.score(X_test, y_test))

# Plot convergence
plt.figure(figsize=(12, 4))

plt.subplot(1, 3, 1)
plt.plot(n_trees_range, oob_errors, label='OOB Error')
plt.plot(n_trees_range, test_errors, label='Test Error')
plt.xlabel('Number of Trees')
plt.ylabel('Error Rate')
plt.title('Error Convergence')
plt.legend()
plt.grid(True)

plt.subplot(1, 3, 2)
plt.plot(n_trees_range, training_times)
plt.xlabel('Number of Trees')
plt.ylabel('Training Time (seconds)')
plt.title('Computational Cost')
plt.grid(True)

plt.subplot(1, 3, 3)
marginal_benefit = np.diff(oob_errors)
plt.plot(n_trees_range[1:], np.abs(marginal_benefit))
plt.xlabel('Number of Trees')
plt.ylabel('Absolute OOB Error Change')
plt.title('Marginal Benefit (Diminishing Returns)')
plt.grid(True)

plt.tight_layout()
plt.savefig('random_forest_convergence.png')

# Variance reduction demonstration
from sklearn.tree import DecisionTreeClassifier

# Single tree variance (high)
single_tree_predictions = []
for i in range(100):
    tree = DecisionTreeClassifier(random_state=i)
    tree.fit(X_train, y_train)
    single_tree_predictions.append(tree.predict_proba(X_test)[:, 1])

single_tree_variance = np.var(single_tree_predictions, axis=0).mean()

# Random Forest variance (low)
rf_predictions = []
for i in range(100):
    rf = RandomForestClassifier(n_estimators=100, random_state=i)
    rf.fit(X_train, y_train)
    rf_predictions.append(rf.predict_proba(X_test)[:, 1])

rf_variance = np.var(rf_predictions, axis=0).mean()

print(f"Single Tree Variance:{single_tree_variance:.6f}")
print(f"Random Forest Variance:{rf_variance:.6f}")
print(f"Variance Reduction:{(1 - rf_variance/single_tree_variance)*100:.2f}%")

# Optimal number of trees (where OOB error plateaus)
oob_error_change = np.abs(np.diff(oob_errors))
plateau_threshold = 0.001  # <0.1% change
optimal_trees = n_trees_range[np.where(oob_error_change < plateau_threshold)[0][0]]
print(f"\nOptimal number of trees (OOB plateau):{optimal_trees}")

3. Design a Machine Learning System to Predict Delivery Time Across Multiple Order Journey Stages

Difficulty Level: Very High

Role: Senior Data Scientist / Staff Data Scientist

Source: Swiggy Bytes (How ML Powers - When is my order coming?), GeeksforGeeks

Topic: ML System Design & ETA Prediction

Interview Round: ML System Design / Case Study (90-120 min)

Domain: Logistics Optimization & Delivery Science

Question: “Design an end-to-end ML system predicting delivery time through four distinct stages: Order-to-Assignment (O2A), First Mile (FM), Wait Time (WT), and Last Mile (LM). Your solution should cover: (1) Model architecture (MIMO deep learning with entity embeddings), (2) Feature engineering (real-time stress signals, restaurant behavior, traffic patterns), (3) Evaluation metrics (MAE, sudden jump detection, P99 latency), (4) Production challenges (1-minute batch serving, distribution shift handling).”


Answer Framework

STAR Method Structure:
- Situation: Delivery time prediction critical for customer experience, requiring stage-specific models handling different feature sets and temporal dynamics
- Task: Design multi-stage prediction system balancing accuracy (low MAE) with customer anxiety prevention (avoid sudden jumps) under strict latency constraints (1-min batch serving)
- Action: Implement MIMO neural network with shared layers, entity embeddings for restaurants/DEs, real-time stress signals, traffic integration, separate models per stage
- Result: ~15% accuracy improvement over gradient boosting, ~12% reduction in delivery time jumps, P99 latency 71ms enabling city-level batch predictions within 1-minute SLA

Key Competencies Evaluated:
- System Design: End-to-end ML pipeline from feature engineering to production serving
- Multi-Task Learning: MIMO architecture with shared representations across stages
- Feature Engineering: Real-time signals, entity embeddings, temporal patterns
- Production ML: Latency optimization, partial predictions under timeout, micro-batching

Answer

Model architecture implements Multi-Input Multi-Output (MIMO) deep learning where four prediction heads (O2A, FM, WT, LM) share lower embedding layers learning common representations of restaurants, delivery executives, and geographic zones, with entity embeddings mapping high-cardinality categorical features (restaurant_id with 50k+ values, DE_id with 100k+ values) to dense 32-64 dimensional vectors capturing latent characteristics (restaurant prep speed, DE efficiency), shared layers (3-4 dense layers with 128-256 units, ReLU activation, batch normalization, dropout 0.3) learning cross-stage patterns, and stage-specific heads (2 dense layers per stage) specializing in stage-unique dynamics—feature engineering incorporates real-time stress signals (active_DEs / active_orders ratio indicating system load, updated every minute), restaurant preparation behavior (historical avg prep time by restaurant-hour-day, menu complexity score, kitchen capacity utilization), traffic patterns (Google Maps API real-time traffic data, historical traffic by route-time, weather conditions affecting delivery speed), delivery executive features (current location pings, historical speed percentiles, acceptance rate, fatigue indicators from consecutive deliveries), temporal features (hour of day, day of week, holiday flags, special events), and geographic features (distance calculations, zone-level demand density, restaurant-customer proximity clusters)—evaluation metrics prioritize Mean Absolute Error (MAE) as primary accuracy measure (target <3 minutes per stage), sudden jump detection penalizing predictions changing >5 minutes between consecutive updates (causes customer anxiety), P99 latency measuring 99th percentile prediction time (must be <100ms for real-time serving), and business metrics (% orders delivered within predicted window, customer satisfaction correlation with prediction accuracy)—production challenges address serving predictions within 1-minute latency requirements via city-level batch jobs processing all active orders simultaneously (5k+ peak predictions/second), feature store optimization caching frequently accessed features (restaurant embeddings, zone statistics) in Redis with 1-minute TTL, handling distribution shifts through daily model retraining on last 7 days data with concept drift detection monitoring feature distributions, partial predictions under timeout returning best-effort results for critical stages (LM most important) rather than failing entirely when upstream services slow, and micro-batching strategies grouping predictions by city-zone for efficient GPU utilization—migration from GBT to neural networks achieved ~15% accuracy improvement (MAE reduced from 4.2 to 3.6 minutes) and ~12% reduction in delivery time jumps (from 18% to 6% of predictions) through better handling of entity interactions via embeddings, capturing non-linear temporal patterns, and shared representations across stages reducing overfitting on sparse stage-specific data, with production deployment using TensorFlow Serving with model versioning enabling A/B testing new architectures, monitoring dashboards tracking prediction accuracy by stage-city-time, and feedback loops from actual delivery outcomes retraining models nightly incorporating latest behavioral patterns.

Code

Multi-Stage ETA Prediction System:

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import numpy as np
import pandas as pd

class MultiStageETAPredictor:
    def __init__(self, n_restaurants, n_delivery_execs, n_zones):
        """
        MIMO architecture for 4-stage delivery time prediction
        Stages: O2A, FM, WT, LM
        """
        self.n_restaurants = n_restaurants
        self.n_delivery_execs = n_delivery_execs
        self.n_zones = n_zones
        self.model = self.build_model()

    def build_model(self):
        # Input layers
        restaurant_id = layers.Input(shape=(1,), name='restaurant_id')
        de_id = layers.Input(shape=(1,), name='de_id')
        zone_id = layers.Input(shape=(1,), name='zone_id')

        # Numerical features
        stress_signal = layers.Input(shape=(1,), name='stress_signal')  # active_DEs/active_orders
        distance = layers.Input(shape=(1,), name='distance')
        traffic_index = layers.Input(shape=(1,), name='traffic_index')
        hour_of_day = layers.Input(shape=(1,), name='hour_of_day')
        prep_time_hist = layers.Input(shape=(1,), name='prep_time_hist')

        # Entity embeddings (dimensionality reduction for high-cardinality categoricals)
        restaurant_embedding = layers.Embedding(
            input_dim=self.n_restaurants,
            output_dim=64,
            name='restaurant_emb'
        )(restaurant_id)
        restaurant_embedding = layers.Flatten()(restaurant_embedding)

        de_embedding = layers.Embedding(
            input_dim=self.n_delivery_execs,
            output_dim=32,
            name='de_emb'
        )(de_id)
        de_embedding = layers.Flatten()(de_embedding)

        zone_embedding = layers.Embedding(
            input_dim=self.n_zones,
            output_dim=16,
            name='zone_emb'
        )(zone_id)
        zone_embedding = layers.Flatten()(zone_embedding)

        # Concatenate all features
        concat = layers.Concatenate()([
            restaurant_embedding, de_embedding, zone_embedding,
            stress_signal, distance, traffic_index, hour_of_day, prep_time_hist
        ])

        # Shared layers (learn common representations)
        shared = layers.Dense(256, activation='relu', name='shared_1')(concat)
        shared = layers.BatchNormalization()(shared)
        shared = layers.Dropout(0.3)(shared)

        shared = layers.Dense(128, activation='relu', name='shared_2')(shared)
        shared = layers.BatchNormalization()(shared)
        shared = layers.Dropout(0.3)(shared)

        shared = layers.Dense(64, activation='relu', name='shared_3')(shared)

        # Stage-specific heads
        # O2A (Order to Assignment)
        o2a = layers.Dense(32, activation='relu', name='o2a_1')(shared)
        o2a_output = layers.Dense(1, activation='relu', name='o2a_output')(o2a)

        # FM (First Mile - restaurant to pickup)
        fm = layers.Dense(32, activation='relu', name='fm_1')(shared)
        fm_output = layers.Dense(1, activation='relu', name='fm_output')(fm)

        # WT (Wait Time at restaurant)
        wt = layers.Dense(32, activation='relu', name='wt_1')(shared)
        wt_output = layers.Dense(1, activation='relu', name='wt_output')(wt)

        # LM (Last Mile - pickup to customer)
        lm = layers.Dense(32, activation='relu', name='lm_1')(shared)
        lm_output = layers.Dense(1, activation='relu', name='lm_output')(lm)

        # Build model
        model = keras.Model(
            inputs=[restaurant_id, de_id, zone_id, stress_signal, distance,
                   traffic_index, hour_of_day, prep_time_hist],
            outputs=[o2a_output, fm_output, wt_output, lm_output]
        )

        # Custom loss: MAE + jump penalty
        model.compile(
            optimizer=keras.optimizers.Adam(learning_rate=0.001),
            loss={
                'o2a_output': 'mae',
                'fm_output': 'mae',
                'wt_output': 'mae',
                'lm_output': 'mae'
            },
            loss_weights={
                'o2a_output': 0.2,
                'fm_output': 0.3,
                'wt_output': 0.2,
                'lm_output': 0.3  # Last mile most important
            },
            metrics=['mae']
        )

        return model

    def predict_batch(self, batch_data, timeout_ms=100):
        """
        Batch prediction with timeout handling
        Returns partial predictions if timeout exceeded
        """
        import time
        start = time.time()

        try:
            predictions = self.model.predict(batch_data, batch_size=512)

            # Check latency
            latency_ms = (time.time() - start) * 1000
            if latency_ms > timeout_ms:
                print(f"Warning: Prediction latency{latency_ms:.2f}ms exceeds{timeout_ms}ms")

            # Total ETA = sum of all stages
            total_eta = sum(predictions)

            return {
                'o2a': predictions[0],
                'fm': predictions[1],
                'wt': predictions[2],
                'lm': predictions[3],
                'total_eta': total_eta,
                'latency_ms': latency_ms
            }

        except Exception as e:
            # Partial prediction fallback (return LM only if critical)
            print(f"Prediction error:{e}, returning fallback")
            return {'lm': self.fallback_lm_prediction(batch_data)}

    def detect_sudden_jumps(self, current_pred, previous_pred, threshold=5.0):
        """
        Detect sudden jumps in predictions (causes customer anxiety)
        """
        jump = abs(current_pred - previous_pred)
        return jump > threshold

# Example usage
n_restaurants = 50000
n_delivery_execs = 100000
n_zones = 500

predictor = MultiStageETAPredictor(n_restaurants, n_delivery_execs, n_zones)

# Simulate batch prediction
batch_size = 1000
batch_data = {
    'restaurant_id': np.random.randint(0, n_restaurants, batch_size),
    'de_id': np.random.randint(0, n_delivery_execs, batch_size),
    'zone_id': np.random.randint(0, n_zones, batch_size),
    'stress_signal': np.random.uniform(0.5, 2.0, batch_size),  # DEs/orders ratio
    'distance': np.random.uniform(1.0, 10.0, batch_size),  # km
    'traffic_index': np.random.uniform(0.8, 1.5, batch_size),
    'hour_of_day': np.random.randint(0, 24, batch_size),
    'prep_time_hist': np.random.uniform(10, 30, batch_size)  # minutes
}

results = predictor.predict_batch(batch_data, timeout_ms=100)
print(f"Batch prediction latency:{results['latency_ms']:.2f}ms")
print(f"Average total ETA:{results['total_eta'].mean():.2f} minutes")

Feature Store Integration (Redis):

import redis
import json
import time

class ETAFeatureStore:
    def __init__(self, redis_host='localhost', redis_port=6379):
        self.redis_client = redis.Redis(host=redis_host, port=redis_port, decode_responses=True)
        self.ttl = 60  # 1 minute cache

    def get_restaurant_features(self, restaurant_id):
        """
        Fetch restaurant features from cache (embeddings, historical prep time)
        """
        key = f"restaurant:{restaurant_id}"
        cached = self.redis_client.get(key)

        if cached:
            return json.loads(cached)

        # Cache miss - fetch from database
        features = self.fetch_from_db(restaurant_id)
        self.redis_client.setex(key, self.ttl, json.dumps(features))
        return features

    def get_realtime_stress_signal(self, city, zone):
        """
        Real-time system stress: active_DEs / active_orders
        Updated every minute by separate service
        """
        key = f"stress:{city}:{zone}"
        stress = self.redis_client.get(key)
        return float(stress) if stress else 1.0  # Default balanced

    def fetch_from_db(self, restaurant_id):
        # Simulate database fetch
        return {
            'avg_prep_time': 15.0,
            'menu_complexity': 0.7,
            'kitchen_capacity': 10
        }

4. Explain the Mathematical Foundations of Support Vector Machines - Hard Margin vs. Soft Margin, Kernel Trick, and Outlier Sensitivity

Difficulty Level: Very High

Role: Senior Data Scientist / Staff Data Scientist

Source: LinkedIn (Ashutosh Kumar Jha, Surbhi Walecha)

Topic: Statistical Learning Theory

Interview Round: Technical/ML Depth Round (60 min)

Domain: Core ML Concepts

Question: “Derive and explain: (1) the optimization problem for Hard Margin SVMs with geometric interpretation, (2) C-parameter introduction for Soft Margin SVMs and its role in controlling the margin-violation trade-off, (3) why outliers affect Soft Margin SVMs less than Hard Margin due to slack variables, (4) kernel trick derivation showing how non-linear problems can be solved in higher-dimensional spaces without explicit feature transformation. Compare SVM with Logistic Regression on outlier sensitivity.”


Answer Framework

STAR Method Structure:
- Situation: SVMs provide maximum-margin classification with theoretical guarantees, requiring deep understanding of optimization formulation and kernel methods
- Task: Derive Hard/Soft Margin formulations, explain C-parameter trade-off, demonstrate kernel trick mechanics, compare outlier robustness with Logistic Regression
- Action: Present Lagrangian formulation, dual problem, support vector interpretation, kernel selection criteria, slack variable mechanics
- Result: Complete mathematical framework enabling informed SVM application, hyperparameter tuning (C, kernel choice), and understanding when SVM preferred over alternatives

Key Competencies Evaluated:
- Optimization Theory: Lagrangian formulation, KKT conditions, dual problem derivation
- Geometric Intuition: Maximum margin interpretation, support vector identification
- Kernel Methods: Implicit feature mapping, kernel selection (RBF, polynomial, linear)
- Robustness Analysis: Outlier sensitivity comparison across algorithms

Answer

Hard Margin SVM optimization seeks maximum-margin hyperplane separating classes where decision boundary w·x + b = 0 maximizes margin 2/||w|| subject to constraints yᵢ(w·xᵢ + b) ≥ 1 for all training points (yᵢ ∈ {-1,+1}), formulated as minimization problem min (1/2)||w||² subject to yᵢ(w·xᵢ + b) ≥ 1, with geometric interpretation showing margin = 2/||w|| is distance between parallel hyperplanes w·x + b = +1 and w·x + b = -1 containing support vectors (points exactly on margin boundaries), and Lagrangian formulation L(w,b,α) = (1/2)||w||² - ∑ᵢ αᵢ[yᵢ(w·xᵢ + b) - 1] where αᵢ ≥ 0 are Lagrange multipliers, with dual problem max_α ∑ᵢ αᵢ - (1/2)∑ᵢ∑ⱼ αᵢαⱼyᵢyⱼ(xᵢ·xⱼ) subject to ∑ᵢ αᵢyᵢ = 0 and αᵢ ≥ 0, revealing support vectors as points with αᵢ > 0 (KKT complementary slackness condition)—Soft Margin SVM introduces slack variables ξᵢ ≥ 0 allowing margin violations, with optimization min (1/2)||w||² + C∑ᵢ ξᵢ subject to yᵢ(w·xᵢ + b) ≥ 1 - ξᵢ and ξᵢ ≥ 0, where C-parameter controls trade-off between margin maximization (small ||w||) and training error minimization (small ∑ξᵢ), with large C emphasizing correct classification (approaching Hard Margin, sensitive to outliers) and small C allowing more violations (wider margin, more robust to outliers), and dual formulation max_α ∑ᵢ αᵢ - (1/2)∑ᵢ∑ⱼ αᵢαⱼyᵢyⱼ(xᵢ·xⱼ) subject to 0 ≤ αᵢ ≤ C and ∑ᵢ αᵢyᵢ = 0 showing C acts as upper bound on Lagrange multipliers—outlier sensitivity shows Hard Margin SVM extremely sensitive because single outlier preventing linear separation forces algorithm failure or forces hyperplane to accommodate outlier dramatically changing decision boundary, whereas Soft Margin SVM becomes robust through slack variables allowing controlled violations where outlier assigned large ξᵢ (penalized by C×ξᵢ) but doesn’t force hyperplane shift, with appropriate C tuning (e.g., C=1.0) balancing outlier accommodation and margin quality, making Soft Margin comparable to Logistic Regression in outlier robustness (both allow misclassifications with controlled penalty)—kernel trick enables non-linear classification by implicitly mapping data to higher-dimensional space via kernel function K(xᵢ, xⱼ) = φ(xᵢ)·φ(xⱼ) where φ is feature transformation, with dual problem depending only on dot products xᵢ·xⱼ replaceable by K(xᵢ, xⱼ) avoiding explicit φ computation, common kernels including linear K(x,y) = x·y, polynomial K(x,y) = (x·y + c)^d, and RBF/Gaussian K(x,y) = exp(-γ||x-y||²) where γ controls decision boundary smoothness, with kernel selection criteria considering data separability (linear for linearly separable, RBF for complex non-linear boundaries), computational cost (linear fastest, RBF moderate, high-degree polynomial expensive), and overfitting risk (RBF with small γ overfits, large γ underfits)—SVM vs Logistic Regression comparison shows Logistic Regression uses sigmoid and cross-entropy loss differentiable everywhere but influenced by outliers changing decision boundary (though less dramatically than linear regression), Hard Margin SVM extremely sensitive requiring separability, Soft Margin SVM with appropriate C comparable to Logistic Regression in outlier robustness, with practical guidance preferring SVM for high-dimensional data with clear margin (text classification, image recognition), Logistic Regression for probabilistic outputs and interpretability, and Soft Margin SVM with RBF kernel for complex non-linear boundaries with moderate outlier presence.

Code

SVM Implementation with Hard/Soft Margin Comparison:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification, make_circles
from sklearn.model_selection import train_test_split

# Generate linearly separable data with outliers
np.random.seed(42)
X, y = make_classification(n_samples=100, n_features=2, n_redundant=0,
                          n_informative=2, n_clusters_per_class=1,
                          class_sep=2.0, random_state=42)

# Add outliers
outlier_indices = [10, 50]
X[outlier_indices] = X[outlier_indices] + np.array([[3, 3], [-3, -3]])

# Hard Margin SVM (will struggle with outliers)
try:
    hard_margin_svm = SVC(kernel='linear', C=1e10)  # Very large C ~ Hard Margin
    hard_margin_svm.fit(X, y)
    print("Hard Margin SVM converged (no outliers blocking separability)")
except:
    print("Hard Margin SVM failed (outliers prevent linear separation)")

# Soft Margin SVM with different C values
C_values = [0.01, 0.1, 1.0, 10.0, 100.0]
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
axes = axes.ravel()

for idx, C in enumerate(C_values):
    svm = SVC(kernel='linear', C=C)
    svm.fit(X, y)

    # Plot decision boundary
    ax = axes[idx]
    xx, yy = np.meshgrid(np.linspace(X[:, 0].min()-1, X[:, 0].max()+1, 100),
                         np.linspace(X[:, 1].min()-1, X[:, 1].max()+1, 100))
    Z = svm.decision_function(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)

    ax.contourf(xx, yy, Z, levels=[-1, 0, 1], alpha=0.3, colors=['red', 'white', 'blue'])
    ax.scatter(X[:, 0], X[:, 1], c=y, cmap='bwr', edgecolors='k')
    ax.scatter(X[outlier_indices, 0], X[outlier_indices, 1],
              s=200, facecolors='none', edgecolors='green', linewidths=3, label='Outliers')

    # Highlight support vectors
    ax.scatter(svm.support_vectors_[:, 0], svm.support_vectors_[:, 1],
              s=100, facecolors='none', edgecolors='black', linewidths=2, label='Support Vectors')

    ax.set_title(f'Soft Margin SVM (C={C})\n{len(svm.support_)} support vectors')
    ax.legend()
    ax.grid(True)

# Compare with Logistic Regression
ax = axes[5]
log_reg = LogisticRegression()
log_reg.fit(X, y)

Z_lr = log_reg.decision_function(np.c_[xx.ravel(), yy.ravel()])
Z_lr = Z_lr.reshape(xx.shape)

ax.contourf(xx, yy, Z_lr, levels=20, alpha=0.3, cmap='RdBu')
ax.scatter(X[:, 0], X[:, 1], c=y, cmap='bwr', edgecolors='k')
ax.scatter(X[outlier_indices, 0], X[outlier_indices, 1],
          s=200, facecolors='none', edgecolors='green', linewidths=3, label='Outliers')
ax.set_title('Logistic Regression\n(Outlier Comparison)')
ax.legend()
ax.grid(True)

plt.tight_layout()
plt.savefig('svm_soft_margin_comparison.png')

# Kernel Trick Demonstration
X_nonlinear, y_nonlinear = make_circles(n_samples=200, noise=0.1, factor=0.5, random_state=42)

fig, axes = plt.subplots(1, 3, figsize=(15, 4))

kernels = ['linear', 'poly', 'rbf']
kernel_params = [
    {},
    {'degree': 3, 'coef0': 1},
    {'gamma': 'scale'}
]

for idx, (kernel, params) in enumerate(zip(kernels, kernel_params)):
    svm = SVC(kernel=kernel, C=1.0, **params)
    svm.fit(X_nonlinear, y_nonlinear)

    xx, yy = np.meshgrid(np.linspace(X_nonlinear[:, 0].min()-0.5, X_nonlinear[:, 0].max()+0.5, 100),
                         np.linspace(X_nonlinear[:, 1].min()-0.5, X_nonlinear[:, 1].max()+0.5, 100))
    Z = svm.decision_function(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)

    axes[idx].contourf(xx, yy, Z, levels=20, alpha=0.3, cmap='RdBu')
    axes[idx].scatter(X_nonlinear[:, 0], X_nonlinear[:, 1], c=y_nonlinear, cmap='bwr', edgecolors='k')
    axes[idx].scatter(svm.support_vectors_[:, 0], svm.support_vectors_[:, 1],
                     s=100, facecolors='none', edgecolors='black', linewidths=2)
    axes[idx].set_title(f'{kernel.upper()} Kernel\nAccuracy:{svm.score(X_nonlinear, y_nonlinear):.3f}')
    axes[idx].grid(True)

plt.tight_layout()
plt.savefig('svm_kernel_trick.png')

# Outlier sensitivity quantitative analysis
from sklearn.metrics import accuracy_score

# Create dataset with varying outlier percentages
outlier_percentages = [0, 5, 10, 20, 30]
svm_accuracies = []
logreg_accuracies = []

for outlier_pct in outlier_percentages:
    X_test, y_test = make_classification(n_samples=200, n_features=2, n_redundant=0,
                                         n_informative=2, n_clusters_per_class=1,
                                         class_sep=2.0, random_state=42)

    # Add outliers
    n_outliers = int(len(X_test) * outlier_pct / 100)
    if n_outliers > 0:
        outlier_idx = np.random.choice(len(X_test), n_outliers, replace=False)
        X_test[outlier_idx] += np.random.normal(0, 3, (n_outliers, 2))

    # Train models
    svm = SVC(kernel='linear', C=1.0)
    logreg = LogisticRegression()

    svm.fit(X_test, y_test)
    logreg.fit(X_test, y_test)

    svm_accuracies.append(svm.score(X_test, y_test))
    logreg_accuracies.append(logreg.score(X_test, y_test))

plt.figure(figsize=(8, 5))
plt.plot(outlier_percentages, svm_accuracies, marker='o', label='Soft Margin SVM (C=1.0)')
plt.plot(outlier_percentages, logreg_accuracies, marker='s', label='Logistic Regression')
plt.xlabel('Outlier Percentage (%)')
plt.ylabel('Accuracy')
plt.title('Outlier Sensitivity: SVM vs Logistic Regression')
plt.legend()
plt.grid(True)
plt.savefig('outlier_sensitivity_comparison.png')

print("\nOutlier Sensitivity Analysis:")
print(f"{'Outlier %':<12}{'SVM Acc':<12}{'LogReg Acc':<12}")
for pct, svm_acc, lr_acc in zip(outlier_percentages, svm_accuracies, logreg_accuracies):
    print(f"{pct:<12}{svm_acc:<12.3f}{lr_acc:<12.3f}")

5. Design a Recommendation System for Restaurant Ranking in Swiggy’s Feed

Difficulty Level: Very High

Role: Senior Data Scientist

Source: Swiggy Bytes (Evolution of Feed Ranking, Learning To Rank Restaurants)

Topic: Recommendation Systems & Learning-to-Rank

Interview Round: Case Study / ML System Design (90-120 min)

Domain: Recommendations & Personalization

Question: “Design a system that ranks thousands of restaurants for a given customer in a given session. Your solution should cover: (1) Learning-to-Rank framework with pairwise vs listwise loss functions, (2) Feature engineering (customer preferences, restaurant similarity via LDA, real-time SLA), (3) Wide-and-Deep architecture for memorization and generalization, (4) Evaluation metrics (nDCG, MRR, median click depth), (5) Multi-objective optimization balancing relevance and business metrics.”


Answer Framework

STAR Method Structure:
- Situation: Restaurant feed ranking critical for order conversion, requiring personalization balancing customer preferences with business objectives (revenue, DP efficiency)
- Task: Design Learning-to-Rank system handling 50k+ restaurants per city, real-time serving under 100ms latency, A/B testing showing offline-online metric correlation
- Action: Implement pairwise loss LTR (9% better than listwise), LDA topic modeling for restaurant similarity, Wide-and-Deep architecture, diversity constraints
- Result: Improved nDCG by 12%, median first-click depth reduced from position 8 to position 5, 15% increase in orders from top-10 positions

Key Competencies Evaluated:
- Ranking Algorithms: LTR formulation, pairwise vs listwise loss comparison
- Feature Engineering: Topic modeling (LDA), customer embeddings, real-time signals
- Neural Architecture: Wide-and-Deep for memorization-generalization trade-off
- Evaluation: nDCG, MRR, business metrics correlation with offline metrics

Answer

Learning-to-Rank framework formulates ranking as supervised learning where each training sample contains context (customer_id, session_time, location) and candidate restaurants with labels (0 if no order, 1 if ordered), with pairwise loss function optimizing ordered pairs where L_pairwise = ∑_(i,j):y_i>y_j max(0, 1 - (s(x_i) - s(x_j))) penalizing when lower-ranked restaurant scores higher than higher-ranked (s is scoring function), outperforming listwise loss by 9% in ordered pair accuracy because pairwise directly optimizes relative ordering matching ranking objective, whereas listwise (e.g., softmax cross-entropy over full list) treats ranking as classification losing pairwise relationships—feature engineering incorporates customer preference scores from order history (cuisine embeddings, price tier preferences, dietary restrictions), restaurant similarity scores via LDA topic modeling on restaurant items/descriptions/cuisine tags creating 50-dimensional topic distributions where similar restaurants have high cosine similarity enabling “customers who liked X also like Y” recommendations, restaurant popularity scores (order count, rating, review sentiment), real-time SLA conditions (current delivery time, restaurant preparation load, DP availability affecting deliverability), diversity constraints preventing feed dominated by single cuisine (inject diversity penalty in ranking), and temporal features (time-of-day preferences, day-of-week patterns, holiday effects)—Wide-and-Deep architecture combines wide component (linear model with cross-product feature transformations) for memorization of customer-specific preferences (e.g., user 123 always orders biryani on Fridays) with deep component (4-layer neural network with 128-64-32-16 units) for generalization learning latent patterns across customers, with separate embedding layers for city (capturing city-specific food culture) and time-slots (breakfast/lunch/dinner preferences differ), trained end-to-end minimizing pairwise loss with Adam optimizer (learning rate 0.001, batch size 512), and production serving using TensorFlow Serving with model versioning enabling A/B testing—evaluation metrics prioritize nDCG (normalized Discounted Cumulative Gain) measuring ranking quality with position-based discounting (top positions weighted higher), MRR (Mean Reciprocal Rank) = average of 1/rank_of_first_ordered_restaurant indicating how quickly users find relevant restaurants, median first-click depth showing typical scrolling required before engagement, median ordered-click depth measuring scrolling before actual order, percentage of orders from top-X positions (X=5,10,20) indicating feed quality, and sessions needing search (indicates feed failure requiring manual search)—multi-objective optimization balances customer relevance (maximize orders) with business objectives including revenue (promote high-AOV restaurants), delivery partner efficiency (avoid overloading specific zones), restaurant partner satisfaction (distribute orders fairly preventing long-tail restaurants getting zero visibility), implemented via weighted scoring s_final = 0.6×s_relevance + 0.2×s_revenue + 0.1×s_DP_efficiency + 0.1×s_fairness with weights tuned via A/B testing measuring overall platform health, and A/B testing methodology showing correlation between offline nDCG improvements and online engagement lift (1% nDCG improvement → 0.8% order increase) validating offline metric as proxy for online success.

Code

Restaurant Ranking System with LTR:

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import numpy as np
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer

class RestaurantRankingModel:
    def __init__(self, n_customers, n_restaurants, n_cities, n_topics=50):
        self.n_customers = n_customers
        self.n_restaurants = n_restaurants
        self.n_cities = n_cities
        self.n_topics = n_topics
        self.model = self.build_wide_and_deep()

    def build_wide_and_deep(self):
        # Wide component inputs (memorization)
        customer_id = layers.Input(shape=(1,), name='customer_id')
        restaurant_id = layers.Input(shape=(1,), name='restaurant_id')
        city_id = layers.Input(shape=(1,), name='city_id')

        # Cross-product features (wide)
        customer_restaurant_cross = layers.Concatenate()([customer_id, restaurant_id])

        # Deep component inputs (generalization)
        customer_features = layers.Input(shape=(20,), name='customer_features')  # Order history embeddings
        restaurant_features = layers.Input(shape=(self.n_topics,), name='restaurant_features')  # LDA topics
        realtime_features = layers.Input(shape=(5,), name='realtime_features')  # SLA, load, etc.

        # Wide path
        wide = layers.Dense(1, activation=None, name='wide_output')(customer_restaurant_cross)

        # Deep path
        customer_emb = layers.Embedding(self.n_customers, 32)(customer_id)
        customer_emb = layers.Flatten()(customer_emb)

        restaurant_emb = layers.Embedding(self.n_restaurants, 32)(restaurant_id)
        restaurant_emb = layers.Flatten()(restaurant_emb)

        city_emb = layers.Embedding(self.n_cities, 16)(city_id)
        city_emb = layers.Flatten()(city_emb)

        deep_concat = layers.Concatenate()([
            customer_emb, restaurant_emb, city_emb,
            customer_features, restaurant_features, realtime_features
        ])

        deep = layers.Dense(128, activation='relu')(deep_concat)
        deep = layers.Dropout(0.3)(deep)
        deep = layers.Dense(64, activation='relu')(deep)
        deep = layers.Dropout(0.3)(deep)
        deep = layers.Dense(32, activation='relu')(deep)
        deep = layers.Dense(1, activation=None, name='deep_output')(deep)

        # Combine wide and deep
        combined = layers.Add()([wide, deep])
        output = layers.Activation('sigmoid', name='ranking_score')(combined)

        model = keras.Model(
            inputs=[customer_id, restaurant_id, city_id, customer_features,
                   restaurant_features, realtime_features],
            outputs=output
        )

        return model

    def pairwise_loss(self, y_true, y_pred):
        """
        Pairwise ranking loss: penalize when lower-ranked item scores higher
        """
        # y_true: binary labels (1 if ordered, 0 otherwise)
        # y_pred: predicted scores

        # Create pairs: (positive, negative)
        # For each positive example, pair with all negatives
        positive_mask = tf.cast(y_true > 0.5, tf.float32)
        negative_mask = 1.0 - positive_mask

        # Expand dimensions for broadcasting
        y_pred_pos = tf.expand_dims(y_pred * positive_mask, 1)
        y_pred_neg = tf.expand_dims(y_pred * negative_mask, 0)

        # Pairwise differences
        diff = y_pred_neg - y_pred_pos

        # Hinge loss: max(0, 1 - (score_pos - score_neg))
        loss = tf.maximum(0.0, 1.0 + diff)

        # Average over valid pairs
        return tf.reduce_mean(loss)

    def compile_model(self):
        self.model.compile(
            optimizer=keras.optimizers.Adam(learning_rate=0.001),
            loss=self.pairwise_loss,
            metrics=['AUC']
        )

    def rank_restaurants(self, customer_data, restaurant_list, top_k=20):
        """
        Rank restaurants for a customer and return top-k
        """
        scores = self.model.predict(restaurant_list)
        ranked_indices = np.argsort(scores.flatten())[::-1][:top_k]
        return ranked_indices, scores[ranked_indices]

# LDA Topic Modeling for Restaurant Similarity
class RestaurantTopicModeling:
    def __init__(self, n_topics=50):
        self.n_topics = n_topics
        self.vectorizer = CountVectorizer(max_features=1000, stop_words='english')
        self.lda = LatentDirichletAllocation(n_components=n_topics, random_state=42)

    def fit(self, restaurant_descriptions):
        """
        restaurant_descriptions: List of strings (menu items + cuisine + description)
        """
        # Convert to bag-of-words
        bow = self.vectorizer.fit_transform(restaurant_descriptions)

        # Fit LDA
        self.lda.fit(bow)

        return self

    def transform(self, restaurant_descriptions):
        """
        Get topic distribution for each restaurant
        Returns: (n_restaurants, n_topics) array
        """
        bow = self.vectorizer.transform(restaurant_descriptions)
        topic_distributions = self.lda.transform(bow)
        return topic_distributions

    def get_similar_restaurants(self, restaurant_idx, topic_distributions, top_k=10):
        """
        Find similar restaurants using cosine similarity on topic distributions
        """
        from sklearn.metrics.pairwise import cosine_similarity

        query_topics = topic_distributions[restaurant_idx].reshape(1, -1)
        similarities = cosine_similarity(query_topics, topic_distributions).flatten()

        # Exclude self
        similarities[restaurant_idx] = -1

        similar_indices = np.argsort(similarities)[::-1][:top_k]
        return similar_indices, similarities[similar_indices]

# Evaluation Metrics
def calculate_ndcg(y_true, y_pred, k=10):
    """
    Normalized Discounted Cumulative Gain @ k
    """
    # Sort by predicted scores
    sorted_indices = np.argsort(y_pred)[::-1][:k]
    sorted_labels = y_true[sorted_indices]

    # DCG = sum(rel_i / log2(i+2)) for i in 1..k
    dcg = np.sum(sorted_labels / np.log2(np.arange(2, k+2)))

    # IDCG (ideal DCG with perfect ranking)
    ideal_labels = np.sort(y_true)[::-1][:k]
    idcg = np.sum(ideal_labels / np.log2(np.arange(2, k+2)))

    return dcg / idcg if idcg > 0 else 0.0

def calculate_mrr(y_true, y_pred):
    """
    Mean Reciprocal Rank
    """
    sorted_indices = np.argsort(y_pred)[::-1]

    # Find first relevant item
    for rank, idx in enumerate(sorted_indices, 1):
        if y_true[idx] > 0:
            return 1.0 / rank

    return 0.0

# Example usage
n_customers = 1000000
n_restaurants = 50000
n_cities = 100

ranker = RestaurantRankingModel(n_customers, n_restaurants, n_cities)
ranker.compile_model()

# Simulate restaurant descriptions for LDA
restaurant_descriptions = [
    "biryani chicken rice spicy indian north-indian",
    "pizza pasta italian cheese tomato",
    "burger fries american fast-food",
    # ... 50k restaurants
]

topic_model = RestaurantTopicModeling(n_topics=50)
topic_model.fit(restaurant_descriptions)
topic_distributions = topic_model.transform(restaurant_descriptions)

# Find similar restaurants
similar_to_restaurant_0 = topic_model.get_similar_restaurants(0, topic_distributions, top_k=10)
print(f"Restaurants similar to restaurant 0:{similar_to_restaurant_0[0]}")

# Evaluate ranking
y_true = np.array([1, 0, 0, 1, 0, 0, 0, 1, 0, 0])  # Binary labels
y_pred = np.array([0.9, 0.3, 0.2, 0.8, 0.1, 0.4, 0.3, 0.7, 0.2, 0.1])  # Predicted scores

ndcg_score = calculate_ndcg(y_true, y_pred, k=5)
mrr_score = calculate_mrr(y_true, y_pred)

print(f"nDCG@5:{ndcg_score:.4f}")
print(f"MRR:{mrr_score:.4f}")

6. How Would You Design a Fraud Detection System Using Weak Supervision and Limited Labeled Data?

Difficulty Level: Very High

Role: Senior Data Scientist / Staff Data Scientist

Source: Swiggy Bytes (DeFraudNet, Identifying Fraud Rings)

Topic: Anomaly Detection & Fraud Prevention

Interview Round: ML System Design / Technical Round (90 min)

Domain: Fraud & Risk Detection

Question: “Design a fraud detection system for online food delivery addressing the core challenge: manually labeled fraud data is limited and expensive. Your solution should cover: (1) Weak supervision approach using Snorkel framework, (2) Denoising strategy with class-specific autoencoders, (3) Feature engineering (cross-sectional, longitudinal, GCN embeddings), (4) Discriminator architecture (MLP + LSTM ensemble), (5) Graph-based fraud ring detection.”


Answer Framework

STAR Method Structure:
- Situation: Fraud detection requires labeled data (expensive manual review), fraud patterns evolve (concept drift), and false positives costly (legitimate customer friction)
- Task: Build system using weak supervision generating noisy labels from logical rules, denoise labels, detect individual fraud and collusive fraud rings
- Action: Implement Snorkel for label generation, class-specific autoencoders for denoising, GCN for customer relationship embeddings, MLP+LSTM discriminator, weighted community detection for fraud rings
- Result: 16 percentage points recall improvement at fixed precision, saved ~1.5M person-hours of manual labeling, detected fraud rings with 85% precision

Key Competencies Evaluated:
- Weak Supervision: Snorkel framework, labeling function design, label aggregation
- Denoising Techniques: Autoencoder-based noise filtering, reconstruction error thresholding
- Graph Neural Networks: GCN for relationship embeddings, fraud ring detection
- Production ML: Handling concept drift, precision-recall trade-offs, feedback loops

Answer

Weak supervision approach uses Snorkel framework generating noisy labels from handcrafted and auto-generated logical functions detecting unusual delivery patterns (order delivered to location 50km from customer address), suspicious account behavior (account created <24 hours ago placing high-value order), chargeback history (customer disputed >3 orders in last month), velocity checks (5+ orders from same IP in 1 hour), with each labeling function (LF) producing label ∈ {fraud, legitimate, abstain} with varying accuracy and coverage, Snorkel’s label model aggregating LF outputs via probabilistic model estimating LF accuracies and correlations producing single probabilistic label per transaction, enabling training on millions of weakly-labeled examples versus thousands of manually-labeled—denoising strategy implements class-specific autoencoders trained separately for fraud and legitimate classes, where fraud autoencoder trained only on weak fraud labels learns to reconstruct fraud patterns with low reconstruction error for true fraud and high error for mislabeled legitimate transactions (noise), legitimate autoencoder similarly trained on weak legitimate labels, denoising process computing reconstruction errors from both autoencoders for each weakly-labeled example, filtering examples with high reconstruction error from their assigned class (likely mislabeled noise), retaining high-confidence examples for discriminator training, achieving 16pp recall improvement by removing label noise that confuses classifier—feature engineering incorporates cross-sectional features (transaction amount, location distance between order and delivery, payment method, device fingerprint), longitudinal features (customer order history, average order value trend, time since account creation, refund rate), and GCN embeddings capturing customer-customer relationships via bipartite graph (customers connected if sharing device/IP/address/payment method) where Graph Convolutional Network learns node embeddings aggregating neighbor information across 2-3 hops, enabling detection of collusive fraud where multiple accounts controlled by same fraudster exhibit similar embeddings despite no direct transaction similarity—discriminator architecture ensembles shallow MLP (3 layers, 128-64-32 units, ReLU activation, dropout 0.4) capturing cross-sectional patterns with LSTM (2 layers, 64 units) capturing temporal fraud evolution (account behavior changes over time), concatenating MLP and LSTM outputs feeding into final sigmoid layer producing fraud probability, trained on denoised weak labels using weighted binary cross-entropy (class weights addressing imbalance: fraud 10x weight of legitimate), with production deployment using threshold tuning balancing precision (minimize false positives annoying legitimate customers) and recall (catch fraud), typically operating at 90% precision, 70% recall based on business cost-benefit analysis—graph-based fraud ring detection constructs weighted graph where nodes are customers and edges weighted by domain knowledge (shared device weight 0.8, shared IP weight 0.6, shared address weight 0.9, shared payment method weight 0.7), applies weighted community detection (Louvain algorithm with edge weights) identifying densely connected customer clusters, flags communities with high fraud concentration (>50% members flagged as fraud by discriminator) as fraud rings, achieving 85% precision in fraud ring detection enabling bulk account suspension, with production challenges including handling concept drift via daily model retraining on last 30 days data, monitoring feature distributions for drift detection, feedback loops from fraud investigation outcomes retraining models incorporating confirmed fraud patterns, and A/B testing new fraud signals measuring precision-recall impact before production deployment.

Code

import numpy as np
import tensorflow as tf
from tensorflow import keras
from sklearn.preprocessing import StandardScaler
import networkx as nx
from community import community_louvain

# Weak Supervision Label Generation (Snorkel-style)
class FraudLabelingFunctions:
    @staticmethod
    def lf_high_value_new_account(transaction):
        """LF: High-value order from new account"""
        if transaction['account_age_days'] < 1 and transaction['amount'] > 1000:
            return 1  # Fraud
        return -1  # Abstain

    @staticmethod
    def lf_location_mismatch(transaction):
        """LF: Delivery location far from customer address"""
        if transaction['delivery_distance_km'] > 50:
            return 1  # Fraud
        return -1  # Abstain

    @staticmethod
    def lf_chargeback_history(transaction):
        """LF: Customer has chargeback history"""
        if transaction['chargebacks_last_30d'] >= 3:
            return 1  # Fraud
        return -1  # Abstain

    @staticmethod
    def lf_velocity_check(transaction):
        """LF: Multiple orders from same IP in short time"""
        if transaction['orders_from_ip_last_hour'] >= 5:
            return 1  # Fraud
        return -1  # Abstain

    def aggregate_labels(self, transactions):
        """Aggregate LF outputs using majority voting (simplified Snorkel)"""
        lfs = [self.lf_high_value_new_account, self.lf_location_mismatch,
               self.lf_chargeback_history, self.lf_velocity_check]

        weak_labels = []
        for txn in transactions:
            votes = [lf(txn) for lf in lfs]
            fraud_votes = sum(1 for v in votes if v == 1)
            weak_labels.append(1 if fraud_votes >= 2 else 0)  # Majority

        return np.array(weak_labels)

# Class-Specific Autoencoder for Denoising
class DenosingAutoencoder:
    def __init__(self, input_dim, encoding_dim=32):
        self.input_dim = input_dim
        self.encoding_dim = encoding_dim
        self.autoencoder = self.build_autoencoder()

    def build_autoencoder(self):
        encoder_input = keras.Input(shape=(self.input_dim,))
        encoded = keras.layers.Dense(64, activation='relu')(encoder_input)
        encoded = keras.layers.Dense(self.encoding_dim, activation='relu')(encoded)

        decoded = keras.layers.Dense(64, activation='relu')(encoded)
        decoded = keras.layers.Dense(self.input_dim, activation='sigmoid')(decoded)

        autoencoder = keras.Model(encoder_input, decoded)
        autoencoder.compile(optimizer='adam', loss='mse')
        return autoencoder

    def fit(self, X_class):
        """Train autoencoder on single class"""
        self.autoencoder.fit(X_class, X_class, epochs=50, batch_size=256, verbose=0)

    def get_reconstruction_error(self, X):
        """Calculate reconstruction error"""
        reconstructed = self.autoencoder.predict(X)
        mse = np.mean((X - reconstructed) ** 2, axis=1)
        return mse

    def denoise_labels(self, X, y_weak, threshold_percentile=90):
        """Remove noisy labels based on reconstruction error"""
        reconstruction_errors = self.get_reconstruction_error(X)
        threshold = np.percentile(reconstruction_errors, threshold_percentile)

        # Keep only low-error examples (high confidence)
        clean_mask = reconstruction_errors < threshold
        return X[clean_mask], y_weak[clean_mask]

# GCN for Customer Relationship Embeddings
class CustomerGraphEmbedding:
    def __init__(self, n_customers, embedding_dim=32):
        self.n_customers = n_customers
        self.embedding_dim = embedding_dim

    def build_graph(self, shared_features):
        """
        Build customer graph from shared features
        shared_features: list of (customer_i, customer_j, feature_type, weight)
        """
        G = nx.Graph()
        for i, j, feature, weight in shared_features:
            if G.has_edge(i, j):
                G[i][j]['weight'] += weight
            else:
                G.add_edge(i, j, weight=weight)
        return G

    def detect_fraud_rings(self, G, fraud_labels):
        """
        Detect fraud rings using weighted community detection
        """
        # Louvain community detection with edge weights
        communities = community_louvain.best_partition(G, weight='weight')

        # Identify fraud rings (communities with high fraud concentration)
        community_fraud_rates = {}
        for node, community_id in communities.items():
            if community_id not in community_fraud_rates:
                community_fraud_rates[community_id] = []
            community_fraud_rates[community_id].append(fraud_labels[node])

        fraud_rings = []
        for community_id, labels in community_fraud_rates.items():
            fraud_rate = np.mean(labels)
            if fraud_rate > 0.5 and len(labels) >= 3:  # >50% fraud, >=3 members
                fraud_rings.append({
                    'community_id': community_id,
                    'fraud_rate': fraud_rate,
                    'size': len(labels),
                    'members': [n for n, c in communities.items() if c == community_id]
                })

        return fraud_rings

# Fraud Discriminator (MLP + LSTM Ensemble)
class FraudDiscriminator:
    def __init__(self, feature_dim, sequence_length=10):
        self.feature_dim = feature_dim
        self.sequence_length = sequence_length
        self.model = self.build_ensemble()

    def build_ensemble(self):
        # Cross-sectional features (MLP path)
        cross_sectional_input = keras.Input(shape=(self.feature_dim,), name='cross_sectional')
        mlp = keras.layers.Dense(128, activation='relu')(cross_sectional_input)
        mlp = keras.layers.Dropout(0.4)(mlp)
        mlp = keras.layers.Dense(64, activation='relu')(mlp)
        mlp = keras.layers.Dropout(0.4)(mlp)
        mlp = keras.layers.Dense(32, activation='relu')(mlp)

        # Longitudinal features (LSTM path)
        longitudinal_input = keras.Input(shape=(self.sequence_length, self.feature_dim), name='longitudinal')
        lstm = keras.layers.LSTM(64, return_sequences=True)(longitudinal_input)
        lstm = keras.layers.LSTM(64)(lstm)

        # Concatenate MLP and LSTM
        concat = keras.layers.Concatenate()([mlp, lstm])
        output = keras.layers.Dense(1, activation='sigmoid', name='fraud_prob')(concat)

        model = keras.Model(inputs=[cross_sectional_input, longitudinal_input], outputs=output)

        # Weighted loss for class imbalance
        model.compile(
            optimizer='adam',
            loss='binary_crossentropy',
            metrics=['precision', 'recall', 'AUC']
        )

        return model

    def predict_with_threshold(self, X_cross, X_long, threshold=0.5):
        """Predict with custom threshold for precision-recall trade-off"""
        probs = self.model.predict([X_cross, X_long])
        return (probs > threshold).astype(int)

# Example Usage
# Generate weak labels
transactions = [
    {'account_age_days': 0.5, 'amount': 1500, 'delivery_distance_km': 60,
     'chargebacks_last_30d': 0, 'orders_from_ip_last_hour': 1},
    # ... more transactions
]

lf_generator = FraudLabelingFunctions()
weak_labels = lf_generator.aggregate_labels(transactions)

# Denoise labels
X_features = np.random.rand(1000, 20)  # Feature matrix
fraud_ae = DenosingAutoencoder(input_dim=20)
fraud_ae.fit(X_features[weak_labels == 1])

X_clean, y_clean = fraud_ae.denoise_labels(X_features, weak_labels)
print(f"Denoised:{len(X_clean)}/{len(X_features)} examples retained")

# Train discriminator
discriminator = FraudDiscriminator(feature_dim=20)
# discriminator.model.fit([X_cross, X_long], y_clean, epochs=10, class_weight={0: 1, 1: 10})

# Detect fraud rings
shared_features = [
    (0, 1, 'device', 0.8),
    (1, 2, 'ip', 0.6),
    # ... more relationships
]

graph_embedding = CustomerGraphEmbedding(n_customers=1000)
G = graph_embedding.build_graph(shared_features)
fraud_rings = graph_embedding.detect_fraud_rings(G, weak_labels)
print(f"Detected{len(fraud_rings)} fraud rings")

7-10. Additional Questions (Abbreviated for Token Efficiency)

7. Demand Forecasting for Instamart: Availability vs. Wastage Trade-off

Difficulty: Very High | Domain: Time Series & Inventory Optimization
Answer: Adaptive metric alignment bridging offline backtesting (wMAPE) with business metrics (availability hours, wastage %), handling SKU heterogeneity via product-specific models, monotonicity constraints ensuring higher predictions → higher availability, A/B testing validating offline-online correlation.

8. Logistic Regression vs. SVM Outlier Sensitivity Comparison

Difficulty: High | Domain: Model Robustness
Answer: Logistic Regression influenced by outliers via cross-entropy loss but less than linear regression, Hard Margin SVM extremely sensitive (single outlier breaks separability), Soft Margin SVM with C-tuning comparable robustness, mitigation via feature scaling, robust loss functions, class weights.

9. A/B Testing Framework for Recommendation Algorithm Evaluation

Difficulty: Very High | Domain: Experimentation & Causal Inference
Answer: Multi-metric evaluation (primary: engagement/orders, secondary: delivery success, guardrails: cancellation/refund), statistical power calculation, CUPED variance reduction, heterogeneous treatment effects analysis, novelty effect handling, online-offline metric gap closure, multiple testing correction.

10. Delivery Partner Allocation and Order-to-DE Matching at Scale

Difficulty: Very High | Domain: Combinatorial Optimization
Answer: Two-stage approach (LM batching/routing via DPDPTW heuristics, FM JIT assignment via multi-objective optimization), O2A vs. batch quality trade-off within 1-min cron SLA, multi-objective function (CPD, customer experience, implicit JIT), P99 latency 71ms via micro-batching, partial predictions under timeout.


7. How Does Demand Forecasting Differ Across Product Categories in Instamart, and How Would You Handle the Availability vs. Wastage Trade-off?

Difficulty Level: Very High

Role: Senior Data Scientist

Source: Swiggy Bytes (Adaptive Metric Alignment for Demand Forecasting), Case Studies

Topic: Time Series Forecasting & Inventory Optimization

Interview Round: Case Study / Technical Round (90 min)

Domain: Logistics Optimization & Growth Science

Question: “Design a demand forecasting system for Swiggy Instamart addressing: (1) Core challenge: forecasting k days in advance (k=1-7) for purchasing decisions, (2) Metric definition problem: business metrics (availability, wastage) only measurable after deployment, (3) Adaptive metric alignment: creating offline metrics that proxy business impact, (4) SKU heterogeneity: different products have vastly different demand patterns, (5) Monotonicity constraint: higher predictions should lead to higher availability.”


Answer Framework

STAR Method Structure:
- Situation: Demand planning requires forecasting k days ahead for vendor orders, optimizing availability (product available throughout store hours) while minimizing wastage (spoiled products)
- Task: Bridge offline backtesting metrics (wMAPE) with business metrics (availability hours, wastage %), handle SKU heterogeneity (milk peaks morning/evening vs. specialty items smooth demand)
- Action: Implement adaptive metric alignment creating estimated availability/wastage for offline evaluation, SKU-specific models, monotonicity constraints, hourly vs. daily granularity
- Result: Improved availability from 85% to 92% while reducing wastage from 8% to 5%, offline metrics correlating 0.85+ with online business metrics

Key Competencies Evaluated:
- Metric Design: Creating offline proxies for online business metrics
- Time Series Modeling: Handling seasonality, trend, SKU heterogeneity
- Optimization Trade-offs: Balancing availability (revenue) vs. wastage (cost)
- Production Validation: A/B testing offline-online metric correlation

Answer

Core challenge shows demand planning requiring forecasting k days in advance (k=1-7 depending on SKU shelf life and vendor turnaround time) where purchasing decisions made based on predictions affect inventory levels, with objective maximizing availability (product available throughout 6am-11pm store hours) while minimizing wastage (products spoiled before sale), and traditional forecasting metrics like wMAPE (weighted Mean Absolute Percentage Error) not capturing business impact because low wMAPE doesn’t guarantee high availability or low wastage—metric definition problem reveals business metrics (availability hours, wastage %) can only be measured after demand predictions deployed and affect purchasing decisions, creating chicken-egg problem where cannot measure offline during backtesting because metrics depend on counterfactual “what would have happened if we purchased based on this prediction,” requiring estimated availability/wastage metrics for offline evaluation proxying business impact—adaptive metric alignment introduces estimated availability hours calculated as min(predicted_demand, actual_demand) × hours_per_unit approximating how many hours product would be available if purchased based on prediction, estimated wastage calculated as max(0, predicted_demand - actual_demand) × cost_per_unit approximating spoilage cost from over-prediction, with monotonicity constraint ensuring higher predictions lead to higher or equal availability hours (proportional sales assumption: if we stock more, we sell proportionally more until saturation), and offline validation showing estimated metrics correlate 0.85+ with actual online availability/wastage enabling offline model selection—SKU heterogeneity requires different approaches where milk/bread/eggs peak morning (6-9am) and evening (6-9pm) requiring hourly demand models capturing intra-day patterns, specialty items (imported cheese, organic produce) have smooth demand throughout day enabling daily aggregation reducing noise, seasonal items (mangoes, ice cream) require trend decomposition separating base demand from seasonal spikes, and tail SKUs (low-volume niche products) use simpler models (moving average, exponential smoothing) avoiding overfitting sparse data—implementation approach builds SKU-specific models using XGBoost for high-volume SKUs (>100 orders/week) with features including lagged demand (t-1, t-7, t-30 days), day of week, holiday flags, promotional events, weather (temperature affects beverage/ice cream demand), Prophet for seasonal SKUs handling trend and multiple seasonality (daily, weekly, yearly), and exponential smoothing for tail SKUs with insufficient data for complex models, with model selection via cross-validation on estimated availability/wastage metrics not wMAPE, and production deployment using ensemble predictions (weighted average of XGBoost + Prophet + exponential smoothing) providing robustness—replenishment policy integration shows predictions feeding into purchasing decisions via safety stock calculation (buffer inventory = z-score × demand_std_dev covering forecast uncertainty), reorder point = lead_time_demand + safety_stock triggering vendor orders, and dynamic adjustment where if actual demand exceeds prediction by >20% for 3 consecutive days, increase safety stock temporarily preventing stockouts, with A/B testing validating offline metrics correlate with online results showing 1% improvement in estimated availability → 0.85% improvement in actual availability confirming offline optimization transfers to online business impact.

Code

Demand Forecasting with Availability-Wastage Optimization:

import numpy as np
import pandas as pd
from sklearn.ensemble import GradientBoostingRegressor
from prophet import Prophet
import matplotlib.pyplot as plt

class InstamartDemandForecaster:
    def __init__(self, sku_id, sku_type='high_volume'):
        """
        sku_type: 'high_volume', 'seasonal', 'tail'
        """
        self.sku_id = sku_id
        self.sku_type = sku_type
        self.model = self.build_model()

    def build_model(self):
        if self.sku_type == 'high_volume':
            # XGBoost for high-volume SKUs
            return GradientBoostingRegressor(
                n_estimators=100,
                learning_rate=0.1,
                max_depth=5,
                random_state=42
            )
        elif self.sku_type == 'seasonal':
            # Prophet for seasonal patterns
            return Prophet(
                yearly_seasonality=True,
                weekly_seasonality=True,
                daily_seasonality=True,
                changepoint_prior_scale=0.05
            )
        else:  # tail
            # Simple exponential smoothing
            return None  # Use pandas ewm

    def create_features(self, df):
        """
        Feature engineering for demand forecasting
        """
        df = df.copy()

        # Lagged features
        df['demand_lag_1'] = df['demand'].shift(1)
        df['demand_lag_7'] = df['demand'].shift(7)
        df['demand_lag_30'] = df['demand'].shift(30)

        # Rolling statistics
        df['demand_rolling_mean_7'] = df['demand'].rolling(7).mean()
        df['demand_rolling_std_7'] = df['demand'].rolling(7).std()

        # Temporal features
        df['day_of_week'] = df['date'].dt.dayofweek
        df['day_of_month'] = df['date'].dt.day
        df['month'] = df['date'].dt.month
        df['is_weekend'] = (df['day_of_week'] >= 5).astype(int)

        # Holiday flags (simplified)
        df['is_holiday'] = 0  # Would use actual holiday calendar

        # Weather features (if available)
        # df['temperature'] = ...

        return df.dropna()

    def fit(self, df):
        """
        Train demand forecasting model
        df: DataFrame with columns ['date', 'demand']
        """
        if self.sku_type == 'high_volume':
            df_features = self.create_features(df)
            feature_cols = ['demand_lag_1', 'demand_lag_7', 'demand_lag_30',
                          'demand_rolling_mean_7', 'demand_rolling_std_7',
                          'day_of_week', 'month', 'is_weekend']
            X = df_features[feature_cols]
            y = df_features['demand']
            self.model.fit(X, y)
            self.feature_cols = feature_cols

        elif self.sku_type == 'seasonal':
            prophet_df = df.rename(columns={'date': 'ds', 'demand': 'y'})
            self.model.fit(prophet_df)

        else:  # tail - exponential smoothing
            self.ewm_model = df.set_index('date')['demand'].ewm(span=7).mean()

    def predict(self, future_dates):
        """
        Predict demand for future dates
        """
        if self.sku_type == 'high_volume':
            # Need to create features for future dates (simplified)
            # In practice, would use last known values for lags
            predictions = self.model.predict(future_dates)

        elif self.sku_type == 'seasonal':
            future_df = pd.DataFrame({'ds': future_dates})
            forecast = self.model.predict(future_df)
            predictions = forecast['yhat'].values

        else:  # tail
            # Use last EWM value as prediction
            predictions = np.full(len(future_dates), self.ewm_model.iloc[-1])

        return predictions

    def calculate_estimated_availability(self, predicted_demand, actual_demand,
                                        hours_per_unit=2.0):
        """
        Offline metric: Estimated availability hours
        Assumption: Product available for 'hours_per_unit' hours per unit sold
        """
        units_available = np.minimum(predicted_demand, actual_demand)
        availability_hours = units_available * hours_per_unit

        # Normalize to [0, 1] (store hours 6am-11pm = 17 hours)
        max_hours = 17.0
        availability_score = np.minimum(availability_hours / max_hours, 1.0)

        return availability_score.mean()

    def calculate_estimated_wastage(self, predicted_demand, actual_demand,
                                   cost_per_unit=50.0):
        """
        Offline metric: Estimated wastage cost
        """
        wastage_units = np.maximum(0, predicted_demand - actual_demand)
        wastage_cost = wastage_units * cost_per_unit

        # Normalize by total predicted cost
        total_cost = predicted_demand.sum() * cost_per_unit
        wastage_rate = wastage_cost.sum() / total_cost if total_cost > 0 else 0

        return wastage_rate

    def optimize_prediction_threshold(self, predictions, actuals,
                                     availability_weight=0.7, wastage_weight=0.3):
        """
        Adjust predictions to optimize availability-wastage trade-off
        """
        best_score = -np.inf
        best_multiplier = 1.0

        # Try different multipliers
        for multiplier in np.arange(0.8, 1.3, 0.05):
            adjusted_predictions = predictions * multiplier

            availability = self.calculate_estimated_availability(adjusted_predictions, actuals)
            wastage = self.calculate_estimated_wastage(adjusted_predictions, actuals)

            # Combined score (maximize availability, minimize wastage)
            score = availability_weight * availability - wastage_weight * wastage

            if score > best_score:
                best_score = score
                best_multiplier = multiplier

        return predictions * best_multiplier, best_multiplier

# Example Usage
np.random.seed(42)
dates = pd.date_range('2024-01-01', '2024-12-31', freq='D')
demand = 100 + 20 * np.sin(np.arange(len(dates)) * 2 * np.pi / 7)  # Weekly seasonality
demand += np.random.normal(0, 10, len(dates))  # Noise
demand = np.maximum(demand, 0)

df = pd.DataFrame({'date': dates, 'demand': demand})

# Train model
forecaster = InstamartDemandForecaster(sku_id='milk_1L', sku_type='high_volume')
train_df = df[:-30]  # Hold out last 30 days
forecaster.fit(train_df)

# Predict
test_dates = df[-30:]['date']
predictions = forecaster.predict(test_dates)
actuals = df[-30:]['demand'].values

# Optimize predictions for availability-wastage trade-off
optimized_predictions, multiplier = forecaster.optimize_prediction_threshold(
    predictions, actuals, availability_weight=0.7, wastage_weight=0.3
)

# Evaluate
availability_original = forecaster.calculate_estimated_availability(predictions, actuals)
wastage_original = forecaster.calculate_estimated_wastage(predictions, actuals)

availability_optimized = forecaster.calculate_estimated_availability(optimized_predictions, actuals)
wastage_optimized = forecaster.calculate_estimated_wastage(optimized_predictions, actuals)

print(f"Original Predictions:")
print(f"  Availability Score:{availability_original:.3f}")
print(f"  Wastage Rate:{wastage_original:.3f}")
print(f"\nOptimized Predictions (multiplier={multiplier:.2f}):")
print(f"  Availability Score:{availability_optimized:.3f}")
print(f"  Wastage Rate:{wastage_optimized:.3f}")

# Visualize
plt.figure(figsize=(12, 6))
plt.plot(test_dates, actuals, label='Actual Demand', marker='o')
plt.plot(test_dates, predictions, label='Original Predictions', marker='s', alpha=0.7)
plt.plot(test_dates, optimized_predictions, label='Optimized Predictions', marker='^', alpha=0.7)
plt.xlabel('Date')
plt.ylabel('Demand (units)')
plt.title('Demand Forecasting: Availability-Wastage Optimization')
plt.legend()
plt.grid(True)
plt.xticks(rotation=45)
plt.tight_layout()
plt.savefig('demand_forecast_optimization.png')

SKU Heterogeneity Handling:

class MultiSKUForecaster:
    def __init__(self):
        self.sku_models = {}

    def classify_sku(self, sku_id, historical_data):
        """
        Classify SKU type based on volume and seasonality
        """
        volume = historical_data['demand'].sum()
        cv = historical_data['demand'].std() / historical_data['demand'].mean()  # Coefficient of variation

        if volume > 1000:  # High volume
            return 'high_volume'
        elif cv > 0.5:  # High variability = seasonal
            return 'seasonal'
        else:
            return 'tail'

    def train_all_skus(self, sku_data_dict):
        """
        Train models for all SKUs
        sku_data_dict: {sku_id: DataFrame}
        """
        for sku_id, df in sku_data_dict.items():
            sku_type = self.classify_sku(sku_id, df)
            forecaster = InstamartDemandForecaster(sku_id, sku_type)
            forecaster.fit(df)
            self.sku_models[sku_id] = forecaster
            print(f"Trained{sku_type} model for SKU{sku_id}")

    def predict_all_skus(self, future_dates):
        """
        Generate predictions for all SKUs
        """
        predictions = {}
        for sku_id, forecaster in self.sku_models.items():
            predictions[sku_id] = forecaster.predict(future_dates)
        return predictions

8. Between Logistic Regression and SVM, Which Model Is More Sensitive to Outliers, and Why? How Would You Mitigate This?

Difficulty Level: High

Role: Data Scientist to Senior Data Scientist

Source: LinkedIn (Ashutosh Kumar Jha, Surbhi Walecha)

Topic: Model Robustness & Comparison

Interview Round: Technical/ML Depth (45 min)

Domain: Core ML Concepts

Question: “Compare Logistic Regression and SVM (both Hard Margin and Soft Margin) on outlier sensitivity. Explain: (1) How each model’s loss function responds to outliers, (2) Mathematical explanation via influence functions, (3) Practical mitigation strategies for each model, (4) When to prefer each approach in production.”


Answer Framework

STAR Method Structure:
- Situation: Outlier sensitivity affects model robustness in production where data quality varies, requiring understanding of loss function behavior
- Task: Compare Logistic Regression (sigmoid + cross-entropy) vs. Hard Margin SVM (requires separability) vs. Soft Margin SVM (slack variables) on outlier impact
- Action: Analyze loss function gradients, derive influence functions, demonstrate mitigation strategies (feature scaling, robust loss, C-tuning)
- Result: Logistic Regression moderately sensitive, Hard Margin SVM extremely sensitive, Soft Margin SVM with appropriate C comparable to Logistic Regression in robustness

Key Competencies Evaluated:
- Loss Function Analysis: Understanding how different losses respond to extreme values
- Mathematical Rigor: Influence function derivation, gradient behavior
- Practical Mitigation: Feature scaling, outlier removal, robust loss functions, hyperparameter tuning
- Model Selection: Knowing when each approach preferable based on data characteristics

Answer

Logistic Regression outlier sensitivity shows model uses sigmoid function σ(z) = 1/(1+e^(-z)) and cross-entropy loss L = -∑[y log σ(wᵀx) + (1-y) log(1-σ(wᵀx))], which is differentiable everywhere but influenced by outliers because extreme feature values (outliers) produce large wᵀx causing σ(wᵀx) → 1 or 0 (saturated predictions), with gradient ∂L/∂w = ∑(σ(wᵀx) - y)x remaining non-zero even for outliers (unlike MSE) enabling continued learning but outlier’s large x value amplifies gradient potentially shifting decision boundary, though less dramatically than linear regression because sigmoid bounds predictions [0,1] preventing unbounded influence—Hard Margin SVM outlier sensitivity shows extreme sensitivity because optimization requires finding separating hyperplane with maximum margin satisfying yᵢ(w·xᵢ + b) ≥ 1 for ALL training points, where single outlier preventing linear separation forces algorithm failure or dramatically shifts hyperplane to accommodate outlier, with geometric interpretation showing outlier on wrong side of margin forces margin collapse or hyperplane rotation, making Hard Margin SVM impractical for real-world data containing noise—Soft Margin SVM outlier sensitivity introduces slack variables ξᵢ ≥ 0 allowing controlled margin violations via optimization min (1/2)||w||² + C∑ξᵢ subject to yᵢ(w·xᵢ + b) ≥ 1 - ξᵢ, where C-parameter controls outlier influence with large C (e.g., C=100) heavily penalizing violations approaching Hard Margin sensitivity, small C (e.g., C=0.1) allowing many violations creating wide margin robust to outliers, and moderate C (e.g., C=1.0) balancing margin quality and outlier accommodation achieving robustness comparable to Logistic Regression—influence function analysis quantifies outlier impact where influence of point (xᵢ, yᵢ) on model parameters θ defined as I(xᵢ, yᵢ) = -H^(-1) ∇L(xᵢ, yᵢ) where H is Hessian and ∇L is gradient, showing Logistic Regression influence bounded by sigmoid saturation (gradient → 0 for extreme predictions) while Hard Margin SVM influence unbounded (single outlier can prevent convergence), and Soft Margin SVM influence bounded by C (slack variable caps penalty at C×ξᵢ)—mitigation strategies for Logistic Regression include feature scaling (standardization preventing large x values), outlier removal via IQR or z-score thresholding, robust loss functions (Huber loss combining L2 for inliers and L1 for outliers), class weights addressing imbalance (not directly outlier mitigation but helps), and regularization (L1/L2 preventing overfitting to outliers), while mitigation for SVM uses Soft Margin with appropriate C-tuning (cross-validation selecting C balancing margin and violations), kernel selection where RBF kernel can handle non-linear outlier patterns better than linear, one-class SVM for novelty detection treating outliers as separate class, and feature scaling (critical for SVM as it’s distance-based)—practical guidance recommends Logistic Regression for probabilistic outputs and interpretability with moderate outlier presence, Soft Margin SVM with RBF kernel for complex non-linear boundaries with outliers (C=1.0 starting point), avoiding Hard Margin SVM in production (academic interest only), and ensemble methods (Random Forest, Gradient Boosting) for maximum outlier robustness via averaging reducing individual outlier impact.

Code

Outlier Sensitivity Comparison:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.datasets import make_classification
from sklearn.preprocessing import StandardScaler

# Generate data with outliers
np.random.seed(42)
X, y = make_classification(n_samples=100, n_features=2, n_redundant=0,
                          n_informative=2, n_clusters_per_class=1,
                          class_sep=2.0, random_state=42)

# Add outliers
n_outliers = 5
outlier_indices = np.random.choice(100, n_outliers, replace=False)
X[outlier_indices] += np.random.normal(0, 4, (n_outliers, 2))

# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Train models with different configurations
models = {
    'Logistic Regression': LogisticRegression(random_state=42),
    'Hard Margin SVM (C=1e10)': SVC(kernel='linear', C=1e10, random_state=42),
    'Soft Margin SVM (C=0.1)': SVC(kernel='linear', C=0.1, random_state=42),
    'Soft Margin SVM (C=1.0)': SVC(kernel='linear', C=1.0, random_state=42),
    'Soft Margin SVM (C=10.0)': SVC(kernel='linear', C=10.0, random_state=42),
}

fig, axes = plt.subplots(2, 3, figsize=(15, 10))
axes = axes.ravel()

for idx, (name, model) in enumerate(models.items()):
    try:
        model.fit(X_scaled, y)

        # Plot decision boundary
        ax = axes[idx]
        xx, yy = np.meshgrid(np.linspace(X_scaled[:, 0].min()-1, X_scaled[:, 0].max()+1, 100),
                             np.linspace(X_scaled[:, 1].min()-1, X_scaled[:, 1].max()+1, 100))

        if hasattr(model, 'decision_function'):
            Z = model.decision_function(np.c_[xx.ravel(), yy.ravel()])
        else:
            Z = model.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 1] - 0.5

        Z = Z.reshape(xx.shape)

        ax.contourf(xx, yy, Z, levels=20, alpha=0.3, cmap='RdBu')
        ax.scatter(X_scaled[:, 0], X_scaled[:, 1], c=y, cmap='bwr', edgecolors='k')
        ax.scatter(X_scaled[outlier_indices, 0], X_scaled[outlier_indices, 1],
                  s=200, facecolors='none', edgecolors='green', linewidths=3, label='Outliers')

        # Highlight support vectors for SVM
        if hasattr(model, 'support_vectors_'):
            ax.scatter(model.support_vectors_[:, 0], model.support_vectors_[:, 1],
                      s=100, facecolors='none', edgecolors='black', linewidths=2, label='Support Vectors')

        ax.set_title(f'{name}\nAccuracy:{model.score(X_scaled, y):.3f}')
        ax.legend()
        ax.grid(True)

    except Exception as e:
        axes[idx].text(0.5, 0.5, f'Model failed:\n{str(e)}',
                      ha='center', va='center', transform=axes[idx].transAxes)
        axes[idx].set_title(name)

plt.tight_layout()
plt.savefig('outlier_sensitivity_comparison.png')

# Quantitative analysis: Influence of outliers on decision boundary
def measure_boundary_shift(X, y, outlier_indices):
    """
    Measure how much decision boundary shifts due to outliers
    """
    # Train without outliers
    mask = np.ones(len(X), dtype=bool)
    mask[outlier_indices] = False
    X_clean = X[mask]
    y_clean = y[mask]

    models_clean = {
        'LogReg': LogisticRegression(random_state=42),
        'SVM_C0.1': SVC(kernel='linear', C=0.1, random_state=42),
        'SVM_C1.0': SVC(kernel='linear', C=1.0, random_state=42),
        'SVM_C10.0': SVC(kernel='linear', C=10.0, random_state=42),
    }

    models_with_outliers = {
        'LogReg': LogisticRegression(random_state=42),
        'SVM_C0.1': SVC(kernel='linear', C=0.1, random_state=42),
        'SVM_C1.0': SVC(kernel='linear', C=1.0, random_state=42),
        'SVM_C10.0': SVC(kernel='linear', C=10.0, random_state=42),
    }

    boundary_shifts = {}

    for name in models_clean.keys():
        models_clean[name].fit(X_clean, y_clean)
        models_with_outliers[name].fit(X, y)

        # Measure shift in decision boundary (simplified: compare coefficients)
        if hasattr(models_clean[name], 'coef_'):
            coef_clean = models_clean[name].coef_[0]
            coef_outliers = models_with_outliers[name].coef_[0]
            shift = np.linalg.norm(coef_clean - coef_outliers)
            boundary_shifts[name] = shift

    return boundary_shifts

shifts = measure_boundary_shift(X_scaled, y, outlier_indices)
print("\nDecision Boundary Shift (L2 norm of coefficient difference):")
for model, shift in shifts.items():
    print(f"{model}:{shift:.4f}")

Mitigation Strategies Implementation:

from sklearn.preprocessing import RobustScaler
from sklearn.covariance import EllipticEnvelope

# Strategy 1: Robust Scaling (less sensitive to outliers than StandardScaler)
robust_scaler = RobustScaler()
X_robust_scaled = robust_scaler.fit_transform(X)

# Strategy 2: Outlier Detection and Removal
outlier_detector = EllipticEnvelope(contamination=0.1, random_state=42)
outlier_labels = outlier_detector.fit_predict(X)
X_no_outliers = X[outlier_labels == 1]
y_no_outliers = y[outlier_labels == 1]

print(f"Removed{np.sum(outlier_labels == -1)} outliers")

# Strategy 3: Robust Loss Function (Huber Loss for Logistic Regression approximation)
from sklearn.linear_model import SGDClassifier

huber_logreg = SGDClassifier(loss='modified_huber', random_state=42)
huber_logreg.fit(X_scaled, y)

# Strategy 4: Cross-Validation for C-parameter tuning
from sklearn.model_selection import GridSearchCV

param_grid = {'C': [0.01, 0.1, 1.0, 10.0, 100.0]}
svm_cv = GridSearchCV(SVC(kernel='linear'), param_grid, cv=5, scoring='accuracy')
svm_cv.fit(X_scaled, y)

print(f"\nBest C parameter:{svm_cv.best_params_['C']}")
print(f"Best CV score:{svm_cv.best_score_:.3f}")

9. Design an A/B Testing Framework for Evaluating a New Recommendation Algorithm with Business Metric Concerns

Difficulty Level: Very High

Role: Senior Data Scientist / Staff Data Scientist

Source: Swiggy Bytes (Feed Ranking Experiments), Experimentation Culture

Topic: Experimentation & Causal Inference

Interview Round: Case Study / Technical Round (90 min)

Domain: Growth Science & Experimentation

Question: “Design a complete A/B testing framework for evaluating a new restaurant recommendation algorithm. Address: (1) Multiple metrics (primary: engagement/orders, secondary: delivery success, guardrails: cancellation/refund), (2) Statistical power calculation and sample size determination, (3) Variance reduction using CUPED, (4) Heterogeneous treatment effects analysis, (5) Novelty effects and run duration, (6) Online vs. offline metric gap, (7) Multiple testing correction.”


Answer Framework

STAR Method Structure:
- Situation: ML improvements must translate to business impact, requiring rigorous A/B testing connecting offline metrics (nDCG) with online metrics (orders, revenue)
- Task: Design experiment balancing statistical rigor (power, significance) with business constraints (runtime, user experience), detect heterogeneous effects across segments
- Action: Define metric hierarchy (primary/secondary/guardrails), calculate sample size for 2% MDE, apply CUPED reducing variance 30-40%, analyze treatment heterogeneity, plan 2-4 week runtime accounting for novelty
- Result: Detect 2% improvement with 80% power in 2 weeks (vs. 4 weeks without CUPED), identify high-value segments benefiting most, validate offline-online correlation enabling faster iteration

Key Competencies Evaluated:
- Experimental Design: Sample size calculation, randomization strategy, stratification
- Statistical Rigor: Power analysis, multiple testing correction, confidence intervals
- Variance Reduction: CUPED methodology, pre-experiment covariate selection
- Causal Inference: Treatment effect heterogeneity, confounding, selection bias

Answer

Multiple metrics hierarchy defines primary metric as orders from recommendations (directly measures recommendation quality), secondary metrics including delivery success rate (ensures recommendations don’t compromise operations), customer satisfaction NPS (long-term health), revenue per user (business impact), and session engagement (click-through rate, time on feed), with guardrail metrics preventing harm including cancellation rate (must not increase >1%), refund rate (quality check), app crash rate (technical stability), and delivery partner utilization (operational balance), using Bonferroni correction for multiple testing where if testing k=5 metrics, significance threshold α/k = 0.05/5 = 0.01 per metric preventing false discoveries—statistical power calculation determines sample size via power = P(reject H₀ | H₁ true) = 0.80 (industry standard), significance level α = 0.05, minimum detectable effect (MDE) = 2% relative improvement in orders (business meaningful threshold), baseline conversion rate p₀ = 0.15 (15% of users order from recommendations), with formula n = 2(z_(α/2) + z_β)²σ²/δ² where z_(α/2) = 1.96, z_β = 0.84, σ² = p₀(1-p₀) = 0.1275, δ = 0.02×p₀ = 0.003, yielding n ≈ 100,000 users per variant (200k total) for 2-week experiment at 1M DAU requiring 20% traffic allocation—variance reduction using CUPED (Controlled Experiment Using Pre-Experiment Data) reduces variance 30-40% enabling faster detection by adjusting outcome metric Y_adj = Y - θ(X_pre - E[X_pre]) where X_pre is pre-experiment covariate (e.g., user’s historical order rate), θ = Cov(Y, X_pre)/Var(X_pre) estimated from data, reducing variance Var(Y_adj) = Var(Y)(1 - ρ²) where ρ is correlation between Y and X_pre, with typical ρ = 0.6 yielding 64% variance reduction enabling sample size reduction from 100k to 36k per variant (2.8x speedup) or detecting smaller MDE (1.5% instead of 2%) with same sample—heterogeneous treatment effects analysis segments users by new vs. repeat customers (new users may respond differently to recommendations), cuisine preferences (biryani lovers vs. pizza lovers), geography (metro vs. tier-2 cities), time of day (lunch vs. dinner), and order frequency (heavy vs. light users), using interaction terms in regression Y = β₀ + β₁Treatment + β₂Segment + β₃(Treatment×Segment) + ε where β₃ measures differential treatment effect, with practical example showing new users benefit +5% from new algorithm while repeat users only +1% (heterogeneity) suggesting personalized rollout strategy—novelty effects and run duration recognize users might respond differently to new recommendations initially (novelty bias) versus steady-state, requiring minimum 2-week runtime capturing two full weeks of behavior (weekly seasonality), ideally 4 weeks detecting novelty decay where Week 1 shows +4% lift (novelty), Week 2-4 stabilize at +2% (true effect), with statistical test comparing Week 1 vs. Week 4 treatment effects detecting if novelty present, and business decision using Week 4 effect for launch decision not Week 1 inflated estimate—online vs. offline metric gap addresses challenge that offline metrics (nDCG, MRR computed on historical data) may not perfectly correlate with online business metrics (orders, revenue), requiring correlation analysis plotting offline metric improvement vs. online metric improvement across past experiments showing ρ = 0.75 (strong but imperfect correlation), with strategies to close gap including adding business-aware features to offline evaluation (e.g., restaurant profitability, delivery feasibility), running more frequent online experiments validating offline improvements, and using offline metrics for rapid iteration (daily) with periodic online validation (monthly)—multiple testing correction applies Bonferroni (conservative: α/k) or Benjamini-Hochberg (FDR control, less conservative) when testing multiple hypotheses, with example testing 5 metrics requiring Bonferroni threshold 0.05/5 = 0.01, or using hierarchical testing where primary metric tested at α=0.05, secondary metrics only tested if primary significant (gatekeeper approach), and guardrail metrics monitored separately with stricter thresholds (α=0.01) preventing harm—practical implementation uses stratified randomization ensuring treatment/control balanced across key covariates (city, user tenure, device type), implements A/A test first validating randomization (no difference between two control groups), monitors sample ratio mismatch (SRM) detecting implementation bugs (50-50 split deviating to 48-52 indicates problem), and builds dashboards showing real-time metric evolution with confidence intervals enabling early stopping if guardrail violated or overwhelming positive effect detected.

Code

A/B Testing Framework with CUPED:

import numpy as np
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt

class ABTestFramework:
    def __init__(self, alpha=0.05, power=0.80):
        self.alpha = alpha
        self.power = power
        self.z_alpha = stats.norm.ppf(1 - alpha/2)
        self.z_beta = stats.norm.ppf(power)

    def calculate_sample_size(self, baseline_rate, mde, variance_reduction=0.0):
        """
        Calculate required sample size per variant

        baseline_rate: Current conversion rate (e.g., 0.15)
        mde: Minimum detectable effect (relative, e.g., 0.02 for 2%)
        variance_reduction: CUPED variance reduction (e.g., 0.36 for 36% reduction)
        """
        delta = mde * baseline_rate  # Absolute effect
        sigma_squared = baseline_rate * (1 - baseline_rate)

        # Adjust for variance reduction
        sigma_squared_adj = sigma_squared * (1 - variance_reduction)

        n = 2 * (self.z_alpha + self.z_beta)**2 * sigma_squared_adj / delta**2

        return int(np.ceil(n))

    def apply_cuped(self, Y, X_pre):
        """
        Apply CUPED variance reduction

        Y: Outcome metric (e.g., orders)
        X_pre: Pre-experiment covariate (e.g., historical order rate)
        """
        # Calculate theta (covariance / variance)
        theta = np.cov(Y, X_pre)[0, 1] / np.var(X_pre)

        # Adjust outcome
        Y_adj = Y - theta * (X_pre - np.mean(X_pre))

        # Variance reduction
        variance_reduction = 1 - np.var(Y_adj) / np.var(Y)

        return Y_adj, variance_reduction

    def run_ab_test(self, control_data, treatment_data, metric_name='orders'):
        """
        Run A/B test with statistical significance testing
        """
        n_control = len(control_data)
        n_treatment = len(treatment_data)

        mean_control = np.mean(control_data)
        mean_treatment = np.mean(treatment_data)

        var_control = np.var(control_data, ddof=1)
        var_treatment = np.var(treatment_data, ddof=1)

        # Pooled standard error
        se = np.sqrt(var_control/n_control + var_treatment/n_treatment)

        # T-statistic
        t_stat = (mean_treatment - mean_control) / se

        # P-value (two-tailed)
        df = n_control + n_treatment - 2
        p_value = 2 * (1 - stats.t.cdf(abs(t_stat), df))

        # Confidence interval
        ci_lower = (mean_treatment - mean_control) - self.z_alpha * se
        ci_upper = (mean_treatment - mean_control) + self.z_alpha * se

        # Relative lift
        relative_lift = (mean_treatment - mean_control) / mean_control

        results = {
            'metric': metric_name,
            'control_mean': mean_control,
            'treatment_mean': mean_treatment,
            'absolute_lift': mean_treatment - mean_control,
            'relative_lift': relative_lift,
            'p_value': p_value,
            'significant': p_value < self.alpha,
            'ci_lower': ci_lower,
            'ci_upper': ci_upper,
            't_statistic': t_stat
        }

        return results

    def analyze_heterogeneous_effects(self, data, treatment_col, outcome_col, segment_col):
        """
        Analyze treatment effect heterogeneity across segments
        """
        segments = data[segment_col].unique()
        results = []

        for segment in segments:
            segment_data = data[data[segment_col] == segment]

            control = segment_data[segment_data[treatment_col] == 0][outcome_col].values
            treatment = segment_data[segment_data[treatment_col] == 1][outcome_col].values

            if len(control) > 0 and len(treatment) > 0:
                segment_result = self.run_ab_test(control, treatment,
                                                  metric_name=f'{outcome_col}_{segment}')
                segment_result['segment'] = segment
                results.append(segment_result)

        return pd.DataFrame(results)

    def bonferroni_correction(self, p_values, alpha=None):
        """
        Apply Bonferroni correction for multiple testing
        """
        if alpha is None:
            alpha = self.alpha

        k = len(p_values)
        adjusted_alpha = alpha / k

        significant = [p < adjusted_alpha for p in p_values]

        return adjusted_alpha, significant

# Example Usage
np.random.seed(42)

# Simulate experiment data
n_users = 10000
baseline_rate = 0.15
treatment_effect = 0.02  # 2% relative improvement

# Control group
control_orders = np.random.binomial(1, baseline_rate, n_users)
control_pre_orders = np.random.binomial(1, baseline_rate, n_users)  # Historical data

# Treatment group (with 2% lift)
treatment_orders = np.random.binomial(1, baseline_rate * (1 + treatment_effect), n_users)
treatment_pre_orders = np.random.binomial(1, baseline_rate, n_users)

# Initialize framework
ab_test = ABTestFramework(alpha=0.05, power=0.80)

# Calculate required sample size
sample_size_no_cuped = ab_test.calculate_sample_size(
    baseline_rate=0.15, mde=0.02, variance_reduction=0.0
)
sample_size_with_cuped = ab_test.calculate_sample_size(
    baseline_rate=0.15, mde=0.02, variance_reduction=0.36
)

print(f"Required sample size per variant:")
print(f"  Without CUPED:{sample_size_no_cuped:,}")
print(f"  With CUPED (36% variance reduction):{sample_size_with_cuped:,}")
print(f"  Speedup:{sample_size_no_cuped / sample_size_with_cuped:.2f}x\n")

# Apply CUPED
control_adj, var_red_control = ab_test.apply_cuped(control_orders, control_pre_orders)
treatment_adj, var_red_treatment = ab_test.apply_cuped(treatment_orders, treatment_pre_orders)

print(f"CUPED Variance Reduction:")
print(f"  Control:{var_red_control:.2%}")
print(f"  Treatment:{var_red_treatment:.2%}\n")

# Run A/B test without CUPED
results_no_cuped = ab_test.run_ab_test(control_orders, treatment_orders, 'orders')

# Run A/B test with CUPED
results_with_cuped = ab_test.run_ab_test(control_adj, treatment_adj, 'orders_cuped')

print("A/B Test Results (Without CUPED):")
print(f"  Control Mean:{results_no_cuped['control_mean']:.4f}")
print(f"  Treatment Mean:{results_no_cuped['treatment_mean']:.4f}")
print(f"  Relative Lift:{results_no_cuped['relative_lift']:.2%}")
print(f"  P-value:{results_no_cuped['p_value']:.4f}")
print(f"  Significant:{results_no_cuped['significant']}\n")

print("A/B Test Results (With CUPED):")
print(f"  Control Mean:{results_with_cuped['control_mean']:.4f}")
print(f"  Treatment Mean:{results_with_cuped['treatment_mean']:.4f}")
print(f"  Relative Lift:{results_with_cuped['relative_lift']:.2%}")
print(f"  P-value:{results_with_cuped['p_value']:.4f}")
print(f"  Significant:{results_with_cuped['significant']}\n")

# Heterogeneous treatment effects
data = pd.DataFrame({
    'treatment': np.concatenate([np.zeros(n_users), np.ones(n_users)]),
    'orders': np.concatenate([control_orders, treatment_orders]),
    'segment': np.random.choice(['new_user', 'repeat_user'], n_users*2)
})

het_results = ab_test.analyze_heterogeneous_effects(
    data, 'treatment', 'orders', 'segment'
)

print("Heterogeneous Treatment Effects:")
print(het_results[['segment', 'relative_lift', 'p_value', 'significant']])

# Multiple testing correction
metrics = ['orders', 'revenue', 'satisfaction', 'delivery_success', 'cancellation']
p_values = [0.01, 0.03, 0.04, 0.06, 0.08]

adjusted_alpha, significant = ab_test.bonferroni_correction(p_values)
print(f"\nMultiple Testing Correction (Bonferroni):")
print(f"  Original alpha:{ab_test.alpha}")
print(f"  Adjusted alpha:{adjusted_alpha:.4f}")
for metric, p, sig in zip(metrics, p_values, significant):
    print(f"{metric}: p={p:.4f}, significant={sig}")

10. How Would You Build a Delivery Partner Allocation and Order-to-Delivery Executive Matching Algorithm at Scale, Considering Real-Time Constraints?

Difficulty Level: Very High

Role: Senior Data Scientist / Staff Data Scientist

Source: Swiggy Bytes (Assignment & Routing Optimization), Swiggy DSP Blog

Topic: Combinatorial Optimization & Real-Time Systems

Interview Round: ML System Design (90-120 min)

Domain: Logistics Optimization & Operations

Question: “Design a complete delivery partner allocation system addressing: (1) Two-stage approach (LM batching/routing + FM JIT assignment), (2) Key trade-offs (O2A vs. batch quality within 1-min cron SLA), (3) Multi-objective optimization (CPD, customer experience, implicit JIT), (4) Prediction models required (ETA for FM, LM, WT, travel times), (5) Production challenges (5k+ peak predictions/second, P99 latency 71ms, partial predictions under timeout).”


Answer Framework

STAR Method Structure:
- Situation: Order-to-delivery assignment must optimize multiple objectives (cost, customer experience, DP efficiency) under strict latency constraints (1-min cron cycle)
- Task: Design two-stage system balancing batching benefits (multiple orders per trip reducing CPD) with assignment speed (low O2A time), handling 5k+ predictions/second
- Action: Implement LM batching via DPDPTW heuristics, FM JIT assignment via multi-objective optimization, micro-batching for GPU efficiency, partial predictions under timeout
- Result: Achieved P99 latency 71ms enabling city-level batch predictions within 1-min SLA, reduced CPD by 25% via batching, maintained O2A <2 mins

Key Competencies Evaluated:
- Combinatorial Optimization: Vehicle routing, matching algorithms, constraint satisfaction
- System Design: Latency optimization, distributed processing, fault tolerance
- Multi-Objective Optimization: Balancing competing objectives with weighted scoring
- Production ML: Real-time serving, micro-batching, partial predictions, monitoring

Answer

Two-stage approach implements Stage I (Last-Mile/LM) solving batching and routing optimization via Dynamic Pickup and Delivery Problem with Time Windows (DPDPTW) where given set of orders and available delivery executives, create batches (groups of orders assigned to single DE) and routes (sequence of pickups and deliveries) minimizing total travel time subject to time window constraints (delivery by promised SLA), using construction heuristic (greedy insertion: iteratively add orders to batches minimizing marginal cost) followed by meta-heuristic (simulated annealing, tabu search) improving solution quality, and Stage II (First-Mile/FM) performing Just-In-Time (JIT) assignment of nearby delivery executives to batches created in Stage I using multi-objective optimization balancing DE proximity (minimize FM time), DE workload (avoid overloading single DE), and implicit JIT (deferring assignment to next cron cycle if better DEs expected)—key trade-offs balance Order-to-Assignment (O2A) time versus batch quality where delaying assignment cycle accumulates more orders enabling better batching (2-3 orders per trip vs. 1) reducing cost per delivery, but increases O2A time (customer sees “searching for delivery partner” longer) degrading experience, with optimal strategy running 1-minute cron cycles processing all pending orders within 60 seconds (hard SLA) where first 30 seconds accumulate orders, next 30 seconds run optimization and assignment, achieving balance between batching benefits and customer experience—multi-objective optimization defines objective function F = w₁×CPD + w₂×CustomerExperience + w₃×ImplicitJIT where CPD (Cost Per Delivery) = (total_travel_time × cost_per_minute) / num_orders measuring operational efficiency, CustomerExperience = f(estimated_delay) penalizing late deliveries with exponential penalty (5 min delay = small penalty, 15 min = large penalty), ImplicitJIT = benefit_of_deferring measuring expected improvement from waiting for better DEs (trade current assignment vs. deferring to next cron), with weights w₁=0.5, w₂=0.4, w₃=0.1 tuned via A/B testing measuring overall platform health (orders completed, customer satisfaction, DP utilization)—prediction models required include ETA models for FM time (DE location to restaurant), LM time (restaurant to customer), Wait Time (food preparation at restaurant), and travel times between locations (using Google Maps API with real-time traffic), with these predictions serving as inputs to assignment optimization where inaccurate predictions cause suboptimal assignments (assigning far DE thinking they’re close), requiring continuous model retraining on actual delivery outcomes with feedback loops—production challenges address latency requirements where Order-Assignment flow must complete all tasks for city within 1 minute including fetching active orders (100ms), fetching available DEs (100ms), running ETA predictions (400ms for 5k predictions), running optimization (300ms), and committing assignments (100ms), with Swiggy’s Data Science Platform achieving 5k+ peak predictions/second with P99 latency 71ms via micro-batching optimization grouping predictions by city-zone for efficient GPU utilization (batch size 512), feature store caching frequently accessed features (restaurant embeddings, zone statistics) in Redis with 1-minute TTL reducing database load, and partial predictions under timeout returning best-effort results for critical stages (LM most important) rather than failing entirely when upstream services slow, with monitoring dashboards tracking prediction latency percentiles (P50, P95, P99), optimization convergence time, assignment success rate, and feedback loops from actual delivery outcomes (actual FM time vs. predicted) retraining models nightly incorporating latest behavioral patterns—optimization constraints enforce each DE assigned to at most one batch (capacity constraint), each batch assigned to exactly one DE (coverage constraint), time window satisfaction (delivery by promised SLA), and vehicle capacity (max 3-4 orders per trip preventing quality degradation), with constraint violation handling via penalty terms in objective function (soft constraints) or hard filtering (infeasible solutions rejected)—scalability architecture uses distributed optimization where each city processed independently (embarrassingly parallel), horizontal scaling adding workers for high-demand cities (Mumbai, Bangalore get 10 workers, tier-2 cities get 2 workers), and incremental optimization where instead of solving from scratch each cron cycle, warm-start from previous solution and adjust for new orders (faster convergence), with production deployment using Kubernetes for auto-scaling based on order volume, circuit breakers preventing cascading failures when optimization times out, and fallback strategies assigning nearest available DE when optimization fails (degraded mode better than no assignment).

Code

Delivery Partner Assignment System:

import numpy as np
from scipy.optimize import linear_sum_assignment
from dataclasses import dataclass
from typing import List, Tuple
import time

@dataclass
class Order:
    order_id: str
    restaurant_lat: float
    restaurant_lon: float
    customer_lat: float
    customer_lon: float
    promised_time: float  # minutes from now

@dataclass
class DeliveryExecutive:
    de_id: str
    current_lat: float
    current_lon: float
    available: bool
    current_orders: int

class DeliveryAssignmentSystem:
    def __init__(self, timeout_ms=1000):
        self.timeout_ms = timeout_ms

    def haversine_distance(self, lat1, lon1, lat2, lon2):
        """
        Calculate distance between two points (km)
        """
        R = 6371  # Earth radius in km

        lat1, lon1, lat2, lon2 = map(np.radians, [lat1, lon1, lat2, lon2])
        dlat = lat2 - lat1
        dlon = lon2 - lon1

        a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
        c = 2 * np.arcsin(np.sqrt(a))

        return R * c

    def predict_eta(self, distance_km, traffic_factor=1.2):
        """
        Predict ETA based on distance and traffic
        Simplified: actual system uses ML models
        """
        avg_speed_kmph = 25  # Average delivery speed
        eta_minutes = (distance_km / avg_speed_kmph) * 60 * traffic_factor
        return eta_minutes

    def create_batches(self, orders: List[Order], max_batch_size=3):
        """
        Stage I: Create order batches using greedy clustering
        Simplified DPDPTW heuristic
        """
        batches = []
        remaining_orders = orders.copy()

        while remaining_orders:
            # Start new batch with first order
            batch = [remaining_orders.pop(0)]

            # Add nearby orders to batch
            while len(batch) < max_batch_size and remaining_orders:
                # Find closest order to batch centroid
                batch_lat = np.mean([o.customer_lat for o in batch])
                batch_lon = np.mean([o.customer_lon for o in batch])

                distances = [
                    self.haversine_distance(batch_lat, batch_lon, o.customer_lat, o.customer_lon)
                    for o in remaining_orders
                ]

                closest_idx = np.argmin(distances)

                # Add if within 2km radius
                if distances[closest_idx] < 2.0:
                    batch.append(remaining_orders.pop(closest_idx))
                else:
                    break

            batches.append(batch)

        return batches

    def assign_de_to_batches(self, batches: List[List[Order]],
                            des: List[DeliveryExecutive]):
        """
        Stage II: Assign DEs to batches using multi-objective optimization
        """
        n_batches = len(batches)
        n_des = len(des)

        # Create cost matrix
        cost_matrix = np.zeros((n_batches, n_des))

        for i, batch in enumerate(batches):
            # Batch centroid (restaurant location for FM)
            batch_restaurant_lat = batch[0].restaurant_lat
            batch_restaurant_lon = batch[0].restaurant_lon

            for j, de in enumerate(des):
                if not de.available or de.current_orders >= 3:
                    cost_matrix[i, j] = 1e6  # Infeasible
                    continue

                # Calculate FM distance (DE to restaurant)
                fm_distance = self.haversine_distance(
                    de.current_lat, de.current_lon,
                    batch_restaurant_lat, batch_restaurant_lon
                )

                # Calculate LM distance (restaurant to customers)
                lm_distance = sum([
                    self.haversine_distance(
                        batch_restaurant_lat, batch_restaurant_lon,
                        order.customer_lat, order.customer_lon
                    ) for order in batch
                ])

                # Predict ETAs
                fm_eta = self.predict_eta(fm_distance)
                lm_eta = self.predict_eta(lm_distance / len(batch))

                # Multi-objective cost
                cpd = (fm_eta + lm_eta) / len(batch)  # Cost per delivery
                customer_exp_penalty = max(0, fm_eta + lm_eta - batch[0].promised_time) ** 2
                workload_penalty = de.current_orders * 5  # Penalize overloaded DEs

                cost_matrix[i, j] = 0.5 * cpd + 0.4 * customer_exp_penalty + 0.1 * workload_penalty

        # Solve assignment problem (Hungarian algorithm)
        batch_indices, de_indices = linear_sum_assignment(cost_matrix)

        assignments = []
        for batch_idx, de_idx in zip(batch_indices, de_indices):
            if cost_matrix[batch_idx, de_idx] < 1e6:  # Feasible
                assignments.append({
                    'batch': batches[batch_idx],
                    'de': des[de_idx],
                    'cost': cost_matrix[batch_idx, de_idx]
                })

        return assignments

    def run_assignment_cycle(self, orders: List[Order], des: List[DeliveryExecutive]):
        """
        Complete assignment cycle with timeout handling
        """
        start_time = time.time()

        try:
            # Stage I: Batching
            batches = self.create_batches(orders)

            # Check timeout
            if (time.time() - start_time) * 1000 > self.timeout_ms * 0.5:
                print("Warning: Batching took >50% of timeout, using greedy assignment")
                # Fallback: assign each order to nearest DE
                return self.fallback_greedy_assignment(orders, des)

            # Stage II: Assignment
            assignments = self.assign_de_to_batches(batches, des)

            # Check timeout
            elapsed_ms = (time.time() - start_time) * 1000
            if elapsed_ms > self.timeout_ms:
                print(f"Warning: Assignment exceeded timeout ({elapsed_ms:.0f}ms >{self.timeout_ms}ms)")

            return assignments

        except Exception as e:
            print(f"Assignment failed:{e}, using fallback")
            return self.fallback_greedy_assignment(orders, des)

    def fallback_greedy_assignment(self, orders: List[Order], des: List[DeliveryExecutive]):
        """
        Fallback strategy: assign each order to nearest available DE
        """
        assignments = []

        for order in orders:
            # Find nearest available DE
            available_des = [de for de in des if de.available and de.current_orders < 3]

            if not available_des:
                continue

            distances = [
                self.haversine_distance(de.current_lat, de.current_lon,
                                       order.restaurant_lat, order.restaurant_lon)
                for de in available_des
            ]

            nearest_de = available_des[np.argmin(distances)]

            assignments.append({
                'batch': [order],
                'de': nearest_de,
                'cost': min(distances)
            })

        return assignments

# Example Usage
np.random.seed(42)

# Simulate orders
orders = [
    Order(f"order_{i}",
          12.9716 + np.random.normal(0, 0.01),  # Bangalore lat
          77.5946 + np.random.normal(0, 0.01),  # Bangalore lon
          12.9716 + np.random.normal(0, 0.02),
          77.5946 + np.random.normal(0, 0.02),
          30.0)  # 30 min promised time
    for i in range(100)
]

# Simulate delivery executives
des = [
    DeliveryExecutive(f"de_{i}",
                     12.9716 + np.random.normal(0, 0.02),
                     77.5946 + np.random.normal(0, 0.02),
                     True, 0)
    for i in range(50)
]

# Run assignment
assignment_system = DeliveryAssignmentSystem(timeout_ms=1000)

start = time.time()
assignments = assignment_system.run_assignment_cycle(orders, des)
elapsed_ms = (time.time() - start) * 1000

print(f"\nAssignment Results:")
print(f"  Total orders:{len(orders)}")
print(f"  Total DEs:{len(des)}")
print(f"  Assignments created:{len(assignments)}")
print(f"  Orders per batch (avg):{np.mean([len(a['batch']) for a in assignments]):.2f}")
print(f"  Latency:{elapsed_ms:.2f}ms")
print(f"  Orders assigned:{sum([len(a['batch']) for a in assignments])}")