Paytm Data Engineer

Q: It’s 10 AM on Diwali Sale day. The Real-time Sales Dashboard used by the CEO stops updating. 1. How do you investigate? 2. How do you communicate? 3. How do you prevent it from happening again?

STAR Method Structure: - Situation: Kafka brokers were overloaded due to 5x traffic spike, causing “Consumer Lag” to skyrocket. Dashboards were 30 mins stale. Use of “Sales War Room” panic. - Task: Restore dashboard latency to <1 min immediately. - Action: (1) Acknowledged the issue within 5 mins. (2) Temporarily disabled “Ad-hoc Analytics” consumers to free up bandwidth (Load Shedding). (3) Scaled Flink Consumers. - Result: Dashboard caught up in 15 mins. Established “Tier 1 vs Tier 2” consumer

This guide features 10 challenging Data Engineer interview questions for Paytm (Data Engineer to Lead DE levels), covering big data processing, real-time streaming, SQL optimization, and system design aligned with Paytm’s mission of building India’s leading payments ecosystem.

1. Recursive Revenue Calculation with Multi-Level Referrals — Complex SQL Window Functions

Difficulty Level: Very High

Role: Data Engineer / Senior Data Engineer

Source: LinkedIn Post by Shubham Wadekar (Paytm Data Engineering Interview Questions 2025)

Topic: Data Warehousing & Analytics

Interview Round: SQL Coding (60 minutes)

Domain: Growth & Referral Analytics

Question: “Given a table of user referrals with columns user_id, referred_by_user_id, transaction_amount, and transaction_date, write a SQL query to calculate the total revenue contribution of each user, including revenue from their direct referrals (Level 1), their referrals’ referrals (Level 2), and their referrals’ referrals’ referrals (Level 3). Only include transactions from the last 30 days.”

Schema:

CREATE TABLE user_referrals (
    user_id INT,
    referred_by_user_id INT,
    transaction_amount DECIMAL(10, 2),
    transaction_date DATE,
    PRIMARY KEY (user_id, transaction_date)
);

Answer Framework

STAR Method Structure:
- Situation: Paytm’s referral program rewards users not just for direct invites but for network effects (L1-L3). Calculating this attribution efficiently on millions of rows is complex.
- Task: Write a SQL query to traverse the referral tree up to 3 levels and aggregate revenue for the last 30 days.
- Action: Utilized a Recursive CTE (Common Table Expression) to traverse the hierarchy, setting a depth limit of 3 to prevent infinite loops and filtering early by date.
- Result: Delivered a scalable solution to track “Influencer Value,” enabling the Growth team to target high-network users.

Key Competencies Evaluated:
- Recursive Logic: Understanding WITH RECURSIVE for hierarchical data structures.
- Optimization: Limiting recursion depth and filtering data before joining.
- Edge Case Handling: Handling circular references (users referring themselves).

SQL Implementation

WITH RECURSIVE referral_tree AS (
    -- Base case: Direct users (no referrer context needed for root, but we need to start somewhere)
    -- Actually, for revenue contribution, we want to find descendants OF a root.
    -- Better approach: Start with all users as potential roots, then find their descendants.

    -- Level 0: The User themselves (if counting own spend) or just Direct Referrals.
    -- The prompt asks for "revenue contribution... including referrals".
    -- Usually this means: My Spend + Child Spend + Grandchild Spend.

    -- Let's stick to the prompt's implied logic: Find all connections.

    SELECT
        user_id,
        user_id AS root_user,
        0 AS level,
        transaction_amount,
        transaction_date
    FROM user_referrals
    WHERE transaction_date >= DATE_SUB(CURDATE(), INTERVAL 30 DAY)

    UNION ALL

    -- Recursive case: Find people referred by the current set
    SELECT
        child.user_id,
        parent.root_user,
        parent.level + 1,
        child.transaction_amount,
        child.transaction_date
    FROM user_referrals child
    INNER JOIN referral_tree parent
        ON child.referred_by_user_id = parent.user_id
    WHERE parent.level < 3  -- Limit recursion to 3 levels (0, 1, 2, 3)
      AND child.transaction_date >= DATE_SUB(CURDATE(), INTERVAL 30 DAY)
)
SELECT
    root_user,
    SUM(transaction_amount) AS total_revenue,
    COUNT(DISTINCT user_id) AS referral_network_size,
    MAX(level) AS max_depth
FROM referral_tree
GROUP BY root_user
ORDER BY total_revenue DESC;

Answer (Part 1 of 2): Recursive Strategy & Complexity

Recursive CTEs are the standard solution for hierarchical data (trees/graphs) in SQL. The query works in two parts: the Anchor Member (Base Case) selects the starting nodes, and the Recursive Member joins the table to itself to find children.
* Time Complexity: O(N × D) where N is the number of rows and D is the depth (3).
* Optimization: We filter transaction_date inside the CTE to minimize the dataset before recursion.
* Performance: For deep trees (>10 levels), recursion can be slow. In those cases, we’d store a “Path String” or “Hierarchy ID” in the table or use a Graph Database.

Answer (Part 2 of 2): Edge Cases & Alternatives

Circular References: If User A refers B, and B refers A, infinite recursion occurs. To fix this, we can add a check: WHERE child.user_id != parent.root_user.
Window Functions vs CTE: Window functions (DENSE_RANK, LEAD) enable analysis on flat partitions but cannot traverse parent-child relationships easily. CTEs are strictly required here.
Self-Referrals: The query should explicitly exclude user_id = referred_by_user_id in the join condition if the data allows self-referrals.

2. Optimize Slow Query on 1 Billion-Row Transaction Table with Multiple Joins

Difficulty Level: Very High

Role: Senior Data Engineer

Source: LinkedIn Post by Shubham Wadekar (Paytm Data Engineering Interview Questions 2025)

Topic: Data Warehousing & Performance Optimization

Interview Round: System Design / Performance Optimization (60 minutes)

Domain: Core Payments Platform

Question:
“You have a query running on a 1 billion-row transactions table that joins with user, merchant, and payment_method tables. The query uses multiple window functions to calculate running balance, transaction rank, and moving averages. Currently, it takes 45 minutes to complete. How would you rewrite and optimize it for acceptable performance (target: <2 minutes)?”

Original Slow Query:

SELECT
    t.transaction_id, t.user_id, t.amount,
    SUM(t.amount) OVER (PARTITION BY t.user_id ORDER BY t.transaction_date) AS running_balance,
    ROW_NUMBER() OVER (PARTITION BY t.user_id ORDER BY t.transaction_date DESC) AS transaction_rank,
    AVG(t.amount) OVER (PARTITION BY t.user_id ORDER BY t.transaction_date ROWS BETWEEN 29 PRECEDING AND CURRENT ROW) AS moving_avg
FROM transactions t
JOIN users u ON t.user_id = u.user_id
JOIN merchants m ON t.merchant_id = m.merchant_id
WHERE t.transaction_date >= DATE_SUB(CURDATE(), INTERVAL 90 DAY);

Answer Framework

STAR Method Structure:
- Situation: A critical report on 1B+ rows was timing out (45 mins) due to full table scans and heavy window function computations on raw data.
- Task: Optimize the query to run under 2 minutes.
- Action: Applied a 3-step optimization: Indexing/Partitioning (reduce scan range), Materialized Views (pre-compute window functions), and Columnar Storage (reduce I/O).
- Result: Query time reduced to 90 seconds (30x improvement) by eliminating repetitive aggregation.

Key Competencies Evaluated:
- Execution Plan Analysis: Identifying bottlenecks (Scans vs Sorts).
- Materialization: Knowing when to calculate on the fly vs pre-store.
- Partitioning Strategy: Pruning partitions by date.

Optimization Strategy & Solution

Step 1: Partitioning & Indexing (Foundation)
Instead of scanning 1B rows, we scan only the relevant partitions (last 90 days).

-- Partition by Date Range
ALTER TABLE transactions PARTITION BY RANGE (YEAR(transaction_date)) (...);

-- Composite Index for the specific query pattern
CREATE INDEX idx_txn_composite ON transactions(user_id, transaction_date, amount);

Step 2: Pre-Materialization (The Game Changer)
Window functions (Running Sum, Moving Avg) are expensive to compute per query. We pre-calculate them daily.

CREATE MATERIALIZED VIEW user_daily_metrics AS
SELECT
    user_id, transaction_date, amount,
    SUM(amount) OVER (PARTITION BY user_id ORDER BY transaction_date) AS running_balance,
    AVG(amount) OVER (PARTITION BY user_id ORDER BY transaction_date ROWS BETWEEN 29 PRECEDING AND CURRENT ROW) AS moving_avg
FROM transactions
WHERE transaction_date >= DATE_SUB(CURDATE(), INTERVAL 90 DAY);

Step 3: Optimized Query (Final)
Query the pre-computed view and join dimension tables last (after filtering).

SELECT
    utm.user_id, utm.running_balance, utm.moving_avg,
    u.user_segment, m.merchant_category
FROM user_daily_metrics utm
INNER JOIN users u ON utm.user_id = u.user_id  -- Join small dimensions late
INNER JOIN merchants m ON utm.merchant_id = m.merchant_id
WHERE utm.transaction_date >= DATE_SUB(CURDATE(), INTERVAL 90 DAY);

Answer (Part 1 of 2): Performance Bottlenecks & Fixes

The Bottlenecks:
1. Full Table Scan: Without partitioning/indexing, DB reads 1B rows. Partitioning by transaction_date prunes 90% of data.
2. Wide Joins: Joining users and merchants before aggregation duplicates dimension data, exploding memory usage. Optimization: Aggregation First -> Join Last.
3. Window Functions: calculating ROWS BETWEEN 29 PRECEDING requires sorting, which is O(Nlog N). Pre-materializing this removes the sort cost from the read path.

Answer (Part 2 of 2): Advanced Storage Techniques

Columnar Storage (Parquet/ORC): If using Hive/Spark, storing data in Columnar format improves performance by 10x for aggregations (reading only amount column vs entire row).
Approximate Algorithms: If exact precision isn’t required (e.g., dashboard Trend Lines), use APPROX_QUANTILES or HyperLogLog for counting distincts, which is 100x faster than exact calculations.

3. Design Real-Time Payment Transaction Analytics Pipeline — System Design

Difficulty Level: Very High

Role: Senior Data Engineer / Lead Data Engineer

Source: Paytm’s Technical Case Study (AWS EMR Modernization)

Topic: Real-Time Streaming & Event Processing

Interview Round: System Design (90 minutes)

Domain: Growth Monitoring & Fraud Detection

Question:
“Design a real-time analytics pipeline that ingests 5 billion payment events per day for Paytm, processes them with sub-second latency, and enables the Growth team to explore data interactively. Consider scalability, cost, fault tolerance, and how to handle peak loads (10x normal during festivals like Diwali).”

Answer Framework

STAR Method Structure:
- Situation: Processing 5B events/day (~57k/sec) with 10x spikes (570k/sec) while providing sub-second analytics is challenging. Standard batch pipelines (Airflow/Hive) have 1-hour latency.
- Task: Build a streaming architecture for Instant Insights (Growth Team) and Fraud Alerts (Risk Team).
- Action: Designed a Lambda Architecture using Apache Kafka (Ingest), Apache Flink (Stateful Processing), and Apache Druid (OLAP Serving). Implemented auto-scaling for peak loads.
- Result: Latency reduced to <2s end-to-end. Cost optimized by using Spot Instances for stateless compute and S3 for cold storage.

Key Competencies Evaluated:
- Technology Selection: Why Flink over Spark Streaming? (Lower latency). Why Druid over Postgres? (OLAP Aggregation speed).
- Scale Handling: Specifically addressing the “Diwali Spike” (10x load).
- Cost Awareness: Tiered storage (Hot/Warm/Cold) and Spot instances.

System Architecture Code (Flink + Kafka)

Stream Processing Logic (Python/Flink):

from pyflink.datastream import StreamExecutionEnvironment
from pyflink.datastream.functions import WindowFunction

env = StreamExecutionEnvironment.get_execution_environment()

# 1. Ingest from Kafka (Partitioned by UserID)
kafka_source = env.add_source(...)

# 2. Enrich Stream (Join with State)
def enrich_event(event):
    # Lookup User Segment from RocksDB State Backend
    return {**event, "segment": user_state[event['user_id']]}

# 3. Window Aggregation (1-Minute Tumbling Window)
windowed_events = kafka_source.map(enrich_event) \
    .key_by(lambda e: e['merchant_id']) \
    .time_window(60000) \
    .apply(PaymentAggregationFunction())

# 4. Sink to Multiple Targets
windowed_events.add_sink(DruidSink(topic="realtime_metrics"))  # For Analytics
windowed_events.add_sink(S3Sink(path="s3://paytm/raw/"))       # For Data Lake

Answer (Part 1 of 3): Ingestion & Processing Layer

Kafka (Ingestion): We use Kafka with 100 partitions and Replication Factor 3. Key design: Partition by user_id ensures all transactions for a user land on the same partition, guaranteeing order.
Flink (Processing): Chosen over Spark Streaming because Flink handles Event Time processing better (handling late-arriving data) and has true Low Latency (streaming vs micro-batch). We use RocksDB as the state backend to manage large state (e.g., “User’s last 5 transactions”) without blowing up RAM.

Answer (Part 2 of 3): Serving & Storage Layer

Hot Storage (Apache Druid): Druid is optimized for “Slice and Dice” queries. It indexes every column, allowing the Growth team to query “UPI failure rates in Mumbai for GenZ segment” in milliseconds.
Cold Storage (S3): All raw events are dumped to S3 in Parquet format (partitioned by date/hour). This is the “Source of Truth” for training ML models later (Batch Layer).
Cost Optimization:
* Spot Instances: Since Flink checkpoints state, we run Task Managers on AWS Spot Instances (60% cheaper). If a node dies, Flink restores state from the last checkpoint.
* Tiered Storage: Druid keeps only 7 days of “Hot” data. Older data is offloaded to “Warm” tiers (S3-backed) which are cheaper but slower.

Answer (Part 3 of 3): Handling Peak Loads (The Diwali Challenge)

10x Scaling Strategy:
1. Kafka: Pre-split partitions. We run with 500 partitions normally (over-provisioned) so we don’t need to re-partition during the event.
2. Flink: Increase parallelism. We spin up 50 extra Task Managers using an Auto-Scaling Group trigger based on “Kafka Lag” metrics.
3. Circuit Breakers: If the sink (Druid) slows down, we implement “Backpressure” to slow down ingestion or spill to S3 purely to catch up later, prioritizing Data Integrity over Real-Time Latency during extreme peaks.

4. Duplicate Payment Detection in Real-Time Streaming — Edge Case Handling

Difficulty Level: High

Role: Senior Data Engineer

Source: LinkedIn Post by Shubham Wadekar / Rajashekar

Topic: Real-Time Streaming & Event Processing

Interview Round: Coding + System Design (60 minutes)

Domain: Payment Gateway Engineering

Question:
“Write a Python script to detect duplicate payments in real-time. You receive millions of payment events per second. Handle edge cases:
1. Retries: Same transaction sent at T and T+2 seconds.
2. Currency: ₹1000 vs $12 USD (needs normalization).
3. Partial Payments: User pays ₹500, then ₹500 more (Not a duplicate).
How would you efficiently detect duplicates at scale?”

Answer Framework

STAR Method Structure:
- Situation: Users double-click “Pay” or network retries cause duplicate events. Processing them twice causes financial loss (double charge).
- Task: Build a de-duplication service that handles retries (exact dups) vs partial payments (legitimate) at 1M TPS.
- Action: Implemented a Fingerprinting Strategy (Hash of User+Merchant+Amount) stored in a Bloom Filter (L1 Check) and Redis (L2 Confirmation) time-windowed for 5 minutes.
- Result: Eliminated double-charges with <1ms latency overhead. 99% of non-dups filtered by Bloom filter (saving Redis IO).

Key Competencies Evaluated:
- Idempotency: Designing systems that can handle the same message twice without side effects.
- Probabilistic Data Structures: Using Bloom Filters to save memory.
- Edge Case handling: Distinguishing “Same Amount” from “Same Transaction”.

Python Implementation (Optimized)

import hashlib
from datetime import datetime
from pybloom import BloomFilter
import redis

class PaymentDeduplicator:
    def __init__(self):
        # L1: Bloom Filter (Fast, Memory Efficient, 0.01% False Positive)
        self.bloom_filter = BloomFilter(capacity=1000000, error_rate=0.0001)
        # L2: Redis (Exact Source of Truth)
        self.redis_client = redis.StrictRedis(host='localhost', port=6379)
        self.FX_RATES = {'USD': 83.0, 'INR': 1.0}

    def get_fingerprint(self, event):
        # Normalize Amount to INR
        amount_inr = event['amount'] * self.FX_RATES.get(event['currency'], 1.0)
        # Fingerprint: User + Merchant + Normalized Amount + Payment Method
        # Note: Do NOT include timestamp (as retries have different stamps)
        raw_str = f"{event['user_id']}_{event['merchant_id']}_{amount_inr:.2f}_{event['payment_method']}"
        return hashlib.md5(raw_str.encode()).hexdigest()

    def is_duplicate(self, event):
        fingerprint = self.get_fingerprint(event)

        # Step 1: Fast Check (Bloom Filter)
        if fingerprint not in self.bloom_filter:
            self.bloom_filter.add(fingerprint)
            return False  # Definitely not a duplicate

        # Step 2: Exact Check (Redis) - Handle False Positives & Time Window
        # Key expires in 5 minutes (typical retry window)
        redis_key = f"dedup:{fingerprint}"
        if self.redis_client.exists(redis_key):
            # Check timestamp diff
            last_time = float(self.redis_client.get(redis_key))
            current_time = datetime.now().timestamp()
            if (current_time - last_time) < 300: # 5 mins
                return True # DUPLICATE DETECTED

        # New valid transaction (or False Positive from Bloom)
        self.redis_client.setex(redis_key, 300, datetime.now().timestamp())
        return False

    def handle_partial(self, event):
        # Logic: Partial payments have specific 'split_id' or state
        # Deduplicator treats them as unique if amount/metadata differs
        pass

Answer (Part 1 of 3): Deduplication Logic (Fingerprinting)

The Fingerprint: We cannot just use transaction_id because some gateways generate new IDs for retries. We create a composite key: Hash(User_ID + Merchant_ID + Standardized_Amount).
Currency Normalization: Comparing ₹1000 vs $12 requires live FX conversion (or tolerance range ±2%) before hashing.
Partial Payments: These are not duplicates. They usually carry a parent_order_id but distinct payment_ids or amounts. Our specific fingerprint (User+Merchant+Amount) might mistakenly flag two ₹500 payments as dups. Fix: Include sequence_number or split_id in the hash if present.

Answer (Part 2 of 3): Scaling with Bloom Filters

At 1M TPS, checking Redis for every transaction is slow and expensive.
Why Bloom Filter? It fits in RAM. It answers “Definitely No” or “Maybe Yes”.
* If Bloom says “No” (99.9% of traffic): We accept the payment immediately. Zero network call.
* If Bloom says “Maybe” (0.1%): We check Redis to confirm.
This architecture reduces Redis load by 99.9%.

Answer (Part 3 of 3): Time Windows & Expiry

We don’t store history forever. Real-world retries happen within seconds or minutes. We set a TTL (Time To Live) of 5 minutes in Redis. If the same user makes the same payment an hour later, it’s likely a new distinct purchase (e.g., buying a second coffee), not a technical retry.

5. PySpark Job to Join 10TB+ Datasets with Minimal Data Shuffling

Difficulty Level: High

Role: Senior Data Engineer

Source: LinkedIn Post by Shubham Wadekar

Topic: Big Data Engineering / Spark Optimization

Interview Round: Coding + System Design (60 minutes)

Domain: Payments Data Lake

Question:
“You need to join two large datasets (Transaction logs: 10TB, User profiles: 5TB). The join key user_id is highly skewed (Power Users have 1M+ txns). Write a PySpark job that minimizes shuffling (<20%) and handles skew. Explain your partitioning strategy.”

Answer Framework

STAR Method Structure:
- Situation: Joining 15TB of data caused OOM errors and excessive shuffle because “Whale Users” (Paytm Merchants) had 1000x more transactions than regular users, causing “Straggler Tasks.”
- Task: Optimize the join to complete within SLA limits.
- Action: Implemented a Skew-Aware Join using the Salting Technique (splitting big keys) and enabled Adaptive Query Execution (AQE).
- Result: Join time reduced from 2 hours to 45 mins. Shuffle data reduced to <1.5TB (~10%).

Key Competencies Evaluated:
- Data Skew: Identifying and fixing stragglers.
- Join Strategies: Broadcast vs Sort-Merge vs Skewed.
- Salting: Manually redistributing skewed keys.

PySpark Optimized Implementation

from pyspark.sql.functions import col, broadcast, rand, floor

def join_with_skew_handling(txn_df, user_df):
    # 1. Enable AQE (Adaptive Query Execution) - The modern way
    spark.conf.set("spark.sql.adaptive.enabled", "true")
    spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")

    # 2. Strategy A: Broadcast Join (if User Table < 8GB)
    # 5TB is too big for Broadcast. So we skip this unless we filter users first.

    # 3. Strategy B: Salting (Manual Skew Handling)
    # Step 3.1: Add "Salt" (Random 0-19) to the Skewed Table (Transactions)
    txn_salted = txn_df.withColumn("salt", floor(rand() * 20))

    # Step 3.2: Explode the Small Table (Users) 20 times to match salts
    # This prevents all 1M transactions for 'User A' going to 1 task.
    # They now go to 20 different tasks.
    user_exploded = user_df.withColumn("salt_array", array([lit(i) for i in range(20)])) \
                           .select(col("user_id"), explode(col("salt_array")).alias("salt"))

    # Step 3.3: Join on Key + Salt
    joined_df = txn_salted.join(user_exploded,
                                on=["user_id", "salt"],
                                how="inner")

    # 4. Result
    return joined_df.drop("salt")

Answer (Part 1 of 3): The Skew Problem & Salting

Data Skew: In Payments, a merchant like “Uber” has 10M transactions, while “User Bob” has 5. In a standard Sort-Merge Join, the reducer for “Uber” gets overwhelmed (Straggler), checking 1 core while 99 others wait.
Salting: By adding a random number (0-19) to the transaction key and replicating the User record 20 times, we split “Uber’s” 10M transactions into 20 chunks of 500k. These can be processed in parallel by 20 tasks, eliminating the bottleneck.

Answer (Part 2 of 3): Partitioning Strategy

Shuffle Partitions: 10TB is massive. Default spark.sql.shuffle.partitions (200) is too low (resulting in 50GB partitions).
Calculation: Target partition size ~150MB.
Total Data = 15TB.
Partitions = 15,000,000 MB / 150 MB = 100,000 Partitions.
We set spark.sql.shuffle.partitions = 100000.

Answer (Part 3 of 3): File Format & Compression

Parquet/Snappy: We write output in Parquet (Columnar) with Snappy compression.
Bucketing: If we join these tables frequently, we should pre-bucket them by user_id (e.g., 10,000 buckets). This eliminates the Shuffle step entirely for future joins (SortMergeJoin becomes BucketJoin), reducing join time by 60%.

6. Data Quality & Monitoring Pipeline — Ensure Regulatory Compliance

Difficulty Level: High

Role: Senior Data Engineer

Source: LinkedIn Post by Jaydeep Patel

Topic: Data Quality & Governance

Interview Round: System Design (60 minutes)

Domain: Regulatory Reporting (RBI/PCI-DSS)

Question:
“Design a data quality pipeline for Paytm’s payment data to meet RBI guidelines.
1. Real-time validation (Schema, Nulls, Values).
2. Automated Alerts (Anomalies).
3. SLA Monitoring (Freshness).
4. Audit Logging (Compliance).
How do you ensure Zero Data Loss?”

Answer Framework

STAR Method Structure:
- Situation: Regulatory reports (RBI) were failing validaton because of ‘Ghost Merchants’ (Null IDs) and ‘Future Dates’ in transaction logs.
- Task: Build a “Great Expectations” style pipeline to block bad data entering the warehouse.
- Action: Implemented a Data Quality Firewall on Airflow. Checks run before loading to Warehouse. Failed batches are routed to a “Dead Letter Queue” for manual fix.
- Result: Data Quality Score improved from 85% to 99.9%. Regulatory compliance penalty risk eliminated.

Key Competencies Evaluated:
- Observability: Metric-driven quality (Availability, Validity, Completeness).
- Compliance: Understanding that ‘Audit Trails’ are as important as the data itself.
- Error Handling: Dead Letter Queues (DLQ) vs crashing the pipeline.

Data Quality Framework (Python)

import pandas as pd
from datetime import datetime

class DataValidator:
    def validate_schema(self, df):
        required_cols = {'txn_id': 'object', 'amount': 'float', 'timestamp': 'datetime64[ns]'}
        for col, dtype in required_cols.items():
            if col not in df.columns or df[col].dtype != dtype:
                raise ValueError(f"Schema Mismatch:{col}")

    def validate_business_rules(self, df):
        # Rule 1: Amount must be positive & < 10 Lakh (RBI Limit)
        invalid_amt = df[(df['amount'] <= 0) | (df['amount'] > 1000000)]
        if not invalid_amt.empty:
            self.route_to_dlq(invalid_amt, "InvalidAmount")

        # Rule 2: No Future Dates
        future_dates = df[df['timestamp'] > datetime.now()]
        if not future_dates.empty:
            self.route_to_dlq(future_dates, "FutureDate")

        return df[(df['amount'] > 0) & (df['timestamp'] <= datetime.now())]

    def check_sla(self, df):
        # Freshness Check: Data must be < 15 mins old
        max_delay_mins = (datetime.now() - df['timestamp'].max()).total_seconds() / 60
        if max_delay_mins > 15:
            AlertSystem.send_pagerduty(f"SLA BREACH: Data is{max_delay_mins} mins old")

    def route_to_dlq(self, bad_records, reason):
        # Write to S3 for manual inspection
        path = f"s3://paytm-datalake/dlq/{reason}/{datetime.now()}.csv"
        bad_records.to_csv(path)
        print(f"Routed{len(bad_records)} rows to DLQ:{reason}")

Answer (Part 1 of 3): The Validation Layer

We implement checks at 3 levels:
1. Schema Validation: Strict type checking.
2. Point Checks: Row-level logical rules (e.g., amount > 0).
3. Aggregate Checks: “Did transaction volume drop by 50% vs yesterday?” (Anomaly Detection).
All checks run in the Staging Area. Only valid data moves to Production.

Answer (Part 2 of 3): Handling Bad Data (DLQ)

We never silently drop data (Financial Loss risk).
Dead Letter Queue (DLQ): Invalid rows are effectively “quarantined” in a separate S3 bucket. An automated report is sent to the Engineering team daily to fix the upstream bug (e.g., “IOS App v4.2 is sending null Merchant IDs”).

Answer (Part 3 of 3): Audit & Compliance

For RBI audits, we maintain an Audit Log Table:
{BatchID, StartTime, EndTime, RowCount_Input, RowCount_Valid, RowCount_DLQ, Status}.
This proves to auditors that we account for every single transaction, even the failed ones.

7. Lambda vs. Kappa Architecture — Modernizing the Data Stack

Difficulty Level: High

Role: Lead Data Engineer / Architect

Source: System Design Patterns

Topic: Architecture Strategy

Interview Round: System Design (60 minutes)

Domain: Core Platform Modernization

Question:
“Paytm currently runs a Lambda Architecture (Batch Layer via Hive + Speed Layer via Storm). We are seeing data consistency issues where batch and speed views diverge.
1. Explain the flaws in our current setup.
2. Propose a migration to Kappa Architecture.
3. What are the trade-offs?”

Answer Framework

STAR Method Structure:
- Situation: Maintaining two codebases (Java for Storm, SQL for Hive) led to “Logic Drift” — the calculate_revenue() function produced slightly different results in real-time vs batch.
- Task: Unify the stack to a single “Stream-First” processing engine.
- Action: Migrated to Kappa Architecture using Apache Flink as the single engine for both Real-time and Historical Backfill (using Kafka’s long retention).
- Result: Reduced engineering maintenance by 50%. Guaranteed 100% consistency between Real-time dashboards and End-of-Day reports.

Key Competencies Evaluated:
- Architectural Evolution: Knowing when to retire Lambda.
- Stream Processing: “Stream is just a Batch with no end.”
- Operational Complexity: Reducing moving parts.

Migration Strategy (Lambda -> Kappa)

1. The Problem (Lambda):

[Kafka] -> [Storm (Speed)] -> [Redis]  (Logic A: Java)
   |
   +-----> [HDFS] -> [Hive (Batch)] -> [MySQL] (Logic B: SQL)

Issue: Logic A != Logic B. Debugging is a nightmare.

2. The Solution (Kappa):

[Kafka (Infinite Retention/Tiered Storage)]
   |
   +-> [Flink (Unified Engine)] -> [Sink]

Concept: To re-process history (e.g., fix a bug from last month), we don’t run a Hive Re-computation. We simply replay the Kafka Topic from “Offset 0” using a parallel Flink job.

Answer (Part 1 of 3): Consistency & Maintenance

Lambda Flaw: Violates “DRY” (Don’t Repeat Yourself). You write logic twice.
Kappa Fix: You write logic once in Flink SQL/DataStream. Real-time uses the latest offset. Backfill uses the earliest offset. The code is identical.

Answer (Part 2 of 3): The “Replay” Challenge

Critique of Kappa: “Kafka is expensive for long retention.”
Paytm Solution: Use Tiered Storage (Kafka 3.0+). Hot data (2 days) is on SSD. Cold data (Year 1) is offloaded to S3 but appears as a “Topic” to consumers. This makes infinite retention affordable.

Answer (Part 3 of 3): When to stick with Lambda?

If your complex ML aggregates (e.g., “PageRank over 5 years of graph data”) are impossible to compute incrementally in a stream, you still need a Batch layer (Lambda). Kappa is best for event-driven aggregation, not global graph algorithms.

8. Schema Evolution Strategy for 100+ Microservices

Difficulty Level: High

Role: Senior Data Engineer

Source: Data Lake Best Practices

Topic: Data Modelling & Governance

Interview Round: System Design (45 min)

Domain: Ingestion Platform

Question:
“Paytm has 100+ microservices emitting events to Kafka. Developers frequently change schemas (e.g., renaming user_id to uid or adding mandatory fields), causing downstream Spark jobs to fail.
Design a Schema Evolution Strategy that prevents pipeline breakage while allowing agility.”

Answer Framework

STAR Method Structure:
- Situation: “Silent Schema Changes” caused P1 incidents where our nightly ETL pipelines crashed, leading to stale executive dashboards.
- Task: Enforce strict contracts between Producers (App Teams) and Consumers (Data Teams).
- Action: Implemented Confluent Schema Registry with AVRO format. Enforced “Backward Compatibility” checks in the CI/CD pipeline.
- Result: Pipeline failures dropped to zero. Developers get immediate feedback (“Build Failed”) if they make breaking changes, preventing bad data from hitting Kafka.

Key Competencies Evaluated:
- Serialization Formats: JSON vs Avro/Protobuf.
- Compatibility Modes: Backward vs Forward vs Full.
- Governance: CI/CD integration.

Schema Registry Workflow

The Contract:
1. Format: Switch from JSON (schemaless) to Avro (Schema embedded).
2. Registry: All producers must register schema with the Registry Server before publishing.
3. Compatibility Rule: BACKWARD (New schema can read old data).
* Allowed: Adding an optional field (default=null).
* Forbidden: Renaming a field, Removing a mandatory field.

CI/CD Logic (GitHub Action):
If a developer changes order_event.avsc:
1. CI Script calls SchemaRegistry.check_compatibility(new_schema).
2. If Incompatible -> Block Merge.
3. If Compatible -> Allow Merge & Auto-Update Registry.

Answer (Part 1 of 3): JSON vs Avro

JSON: Human readable but risky. No type enforcement. {"age": "25"} vs {"age": 25} breaks Spark.
Avro: Binary, compact (faster), and requires a schema. Best for Data Lakes (HDFS/S3).
Protobuf: Best for RPC (gRPC), less common for storage.

Answer (Part 2 of 3): Handling “Breaking Changes”

What if a breaking change is necessary (e.g., heavy refactoring)?
Strategy: Dual-Write.
1. Create v2_topic.
2. Producer writes to both v1 and v2.
3. Migrate consumers to v2.
4. Deprecate v1 after 30 days.
This decouples producers from consumers.

Answer (Part 3 of 3): Spark Schema merging

In the Data Lake (Delta Lake/Parquet), set failOnDataLoss=false and mergeSchema=true for reading. However, relying on Spark to infer schema causes “Performance Hit” (listing all files). Using a Registry is cleaner.

9. Multi-Tenant Data Architecture for Merchant Analytics

Difficulty Level: High

Role: Lead Data Engineer

Source: SaaS Architecture Patterns

Topic: Data Security & Access Control

Interview Round: System Design (60 min)

Domain: Paytm for Business (B2B)

Question:
“You are building a Data Warehouse for ‘Paytm for Business’. We have 10 Million Merchants. Each Merchant needs a dashboard showing only their sales.
1. How do you physically store this data? (Separate DBs vs Shared Table?)
2. How do you enforce Strict Data Isolation (Apple must not see Samsung’s data)?
3. How do you optimize for query performance given the scale volume difference (Uber vs Chai shop)?”

Answer Framework

STAR Method Structure:
- Situation: Paytm stores data for 10M+ merchants. Creating 10M separate databases is impossible (maintenance nightmare). Storing in one table creates security risks (Leaky WHERE clause).
- Task: Design a Multi-Tenant Warehouse that is scalable, secure, and performant.
- Action: Implemented a Shared-Table schema with Row-Level Security (RLS) policies. Used clustering keys on merchant_id.
- Result: Single warehouse instance serving 10M tenants securely. Onboarding a new merchant takes 0 ms (no new DDL).

Key Competencies Evaluated:
- Multi-Tenancy Models: Database-per-Tenant vs Schema-per-Tenant vs Shared-Table.
- Security: RLS (Row Level Security) functions.
- Performance: Clustering/Partitioning for tenant isolation.

Implementation Strategy

1. Storage Model: Shared Table (pool)
All transactions live in one giant table.
Transactions { txn_id, merchant_id, amount, date }Why: Analytics queries often need to span all merchants (e.g., “Total GMV of Paytm”). 10M tables would break the metadata store of any DB (Postgres/Snowflake).

2. Enforcement: Row-Level Security (RLS)
Never rely on application logic (WHERE merchant_id = ?). Developers forget.
Use Database Policies (Postgres/Redshift/Snowflake RLS):

-- Create Policy
CREATE POLICY merchant_isolation ON transactions
USING (merchant_id = current_user_merchant_id());

-- Enable it
ALTER TABLE transactions ENABLE ROW LEVEL SECURITY;

Now, SELECT * FROM transactions automatically adds the WHERE clause deep in the engine.

Answer (Part 1 of 3): The “Noisy Neighbor” Problem

Issue: A large merchant (Uber) runs a heavy query, eating 100% CPU. Small merchants (Chai shop) get timeouts.
Fix:
1. Workload Management (WLM): Create separate Queues. “Whale Queue” (max 5 slots) vs “Standard Queue” (max 50 slots).
2. Clustering: Sort data by merchant_id on disk. This ensures Uber’s data blocks are separate from Chai shop’s data blocks, minimizing I/O for small queries.

Answer (Part 2 of 3): Separate Databases? (The Banking Exception)

For extremely sensitive clients (e.g., HDFC Bank using Paytm gateway), we might use a Schema-per-Tenant or even Database-per-Tenant approach. This is stricter but costs more. For 99% of SMBs, Shared Table is the only viable economic model.

Answer (Part 3 of 3): Scaling Storage

We shard the database based on merchant_id.
* Shard 1: Merchant IDs 1-1M
* Shard 2: Merchant IDs 1M-2M
This allows infinite horizontal scaling while keeping related data (orders, refunds) on the same shard.

10. Behavioral — Handling a P0 Data Outage During Peak Sale

Difficulty Level: Medium

Role: Data Engineer / Lead

Source: Behavioral / SRE Principles

Topic: Crisis Management & Communication

Interview Round: Hiring Manager (45 min)

Domain: Operations

Question:
“It’s 10 AM on Diwali Sale day. The Real-time Sales Dashboard used by the CEO stops updating.
1. How do you investigate?
2. How do you communicate?
3. How do you prevent it from happening again?”

Answer Framework

STAR Method Structure:
- Situation: Kafka brokers were overloaded due to 5x traffic spike, causing “Consumer Lag” to skyrocket. Dashboards were 30 mins stale. Use of “Sales War Room” panic.
- Task: Restore dashboard latency to <1 min immediately.
- Action: (1) Acknowledged the issue within 5 mins. (2) Temporarily disabled “Ad-hoc Analytics” consumers to free up bandwidth (Load Shedding). (3) Scaled Flink Consumers.
- Result: Dashboard caught up in 15 mins. Established “Tier 1 vs Tier 2” consumer prioritization policy for future.

Key Competencies Evaluated:
- Bias for Action: Fixing the bleeding first vs Root Cause analysis.
- Communication: “Update every 30 mins” protocol.
- Prioritization: Load shedding non-critical jobs.

Incident Response Playbook

Phase 1: Diagnosis (0-10 Mins)
* Check Lag: kafka-consumer-groups --describe shows lag increasing.
* Check Infrastructure: CPU/Memory on Flink/Kafka nodes.
* The Pivot: If Infra is healthy but Lag is high, it’s a Volume issue.

Phase 2: Mitigation (10-30 Mins)
* Load Shedding: “I am killing the ‘Marketing Analytics’ job to prioritize ‘CEO Dashboard’. Marketing data will be backfilled later.”
* Communication: “CEO Team, data is delayed 20 mins. We have isolated the cause (Volume). ETA recovery 10:45 AM. Next update at 10:30.”

Phase 3: Prevention (Post-Mortem)
* Action Item: Implement Auto-Scaling Policies based on Lag (not just CPU).
* Action Item: Topic Isolation. Important dashboards should not share a Kafka Cluster with experimental jobs.

Answer (Part 1 of 3): Fix vs Root Cause

In a P0, Mitigation (make it work) > Root Cause (why it broke). I prioritize restarting services or adding capacity over debugging logs. Logs are for later.

Answer (Part 2 of 3): Communication is Key

Silence is terrifying to stakeholders. Even “I don’t know yet, still looking” is better than silence. I follow the 30-30-30 Rule: Detect in 30s, Acknowledge in 30m? No -> Update stakeholders every 30 mins.

Answer (Part 3 of 3): Technical Debt

This incident usually reveals technical debt (e.g., “Manual Scaling”). I use the outage as leverage to get prioritizing for “Auto-Scaling Infrastructure” in the next sprint planning.