IBM Data Scientist Interview Questions

Introduction

Data Scientists at IBM sit at the convergence of advanced analytics, machine learning, and enterprise-scale decision-making. IBM's data science teams work across some of the most data-intensive industries in the world — building predictive models that detect fraud in real time for global banks, optimising supply chain logistics for multinational manufacturers, and surfacing clinical insights for healthcare providers. The scale and complexity of these problems demand more than technical proficiency: IBM Data Scientists are expected to frame ambiguous business questions as tractable analytical problems and translate model outputs into decisions that executives can act on.

In practice, this means working with structured and unstructured datasets that run into billions of rows, applying machine learning techniques ranging from classical regression and ensemble methods to deep learning and NLP, and deploying models through IBM's AI ecosystem — including Watson Studio, Watson Machine Learning, and AutoAI. Data Scientists at IBM also collaborate closely with software engineers, product managers, and client-facing consultants, which means the ability to communicate statistical reasoning clearly to non-technical stakeholders is as valued as the ability to tune a gradient boosting model.

IBM's interview process for Data Scientists reflects this breadth. Candidates are assessed on their statistical foundations, their practical experience with messy real-world data, their model evaluation rigour, and their capacity for business thinking. The five questions below are representative of the scenarios IBM interviewers use to probe these dimensions — and are designed to help you walk into the interview with the depth and specificity the role demands.

Interview Questions

Question 1: Diagnosing and Cleaning a Corrupted Enterprise Dataset

Interview Question

You've been handed a dataset from an IBM retail banking client containing 8 million rows of customer transaction records. The dataset will be used to train a credit risk model. During initial exploration, you notice: 14% of rows have missing values in the income column, the transaction_date column has inconsistent formats across rows, several numeric columns have extreme outliers that appear to be data entry errors, and a handful of customer IDs appear thousands of times more than others. How do you approach this before any modelling begins?

Why Interviewers Ask This Question

Real enterprise datasets are almost never clean, and IBM interviewers know this well. This question tests whether a candidate treats EDA as a genuine diagnostic process rather than a box to tick. Strong candidates demonstrate systematic thinking — understanding why data is dirty before deciding how to fix it — and recognise that cleaning decisions have downstream consequences for model fairness, performance, and business trust.

Example Strong Answer

I'd approach this in four stages, treating each anomaly as a signal rather than just noise to remove.

Missing values in income: First, I'd test whether the missingness is random or systematic. I'd cross-tabulate missing income rows against other features — age, employment status, account type — to see if there's a pattern. If lower-income customers are more likely to have missing values, simple mean imputation would bias the credit risk model against them. In that case, I'd use multiple imputation (e.g., IterativeImputer in scikit-learn, which uses other features to predict the missing value) and add a binary income_missing indicator feature, preserving the signal that the missingness itself carries.

Date format inconsistency: I'd use a robust parser (dateutil.parser in Python) that handles multiple formats, then validate parsed dates against business logic — no transaction dates before the bank's founding, no dates in the future. I'd flag ambiguous dates (e.g., 04/05/23 could be April 5th or May 4th) for source-team clarification rather than silently assuming one format.

Extreme outliers: Rather than removing outliers immediately, I'd separate them into two categories: plausible extremes (a high-net-worth customer with a legitimate £2M transaction) and data entry errors (a transaction amount of 999999999 that exceeds the bank's own product limits). I'd validate against domain knowledge — the bank's actual product caps — and use Winsorisation at the 1st/99th percentile for features where genuine extremes shouldn't disproportionately influence the model.

Duplicated customer IDs: This is a data pipeline problem, not a statistical one. I'd investigate whether these represent legitimate high-frequency traders, a join gone wrong upstream, or a schema issue where the same customer has multiple IDs across systems. I'd flag this to the data engineering team immediately — modelling on duplicate rows inflates one customer's influence and can leak labels through train/test contamination.

Finally, I'd document every decision made during cleaning in a reproducible notebook, so the client can audit the choices and the model's behaviour can be explained clearly if it's ever challenged.

Key Concepts Tested

Systematic EDA vs. reactive cleaning

Missing not at random (MNAR) vs. missing completely at random (MCAR)

Multiple imputation and indicator variables

Outlier treatment strategies (Winsorisation vs. removal)

Data leakage awareness and reproducibility

Follow-Up Questions

After cleaning, you discover that the income column was collected differently across two data sources that were merged — one uses gross income, the other uses net income. How do you handle this before modelling?

How would your approach to outlier treatment change if this were a fraud detection model rather than a credit risk model?

Question 2: Feature Engineering for a Customer Churn Prediction Model

Interview Question

IBM is building a churn prediction model for a telecommunications client. The raw dataset contains: customer demographic information, a monthly billing history going back 3 years, a log of customer service calls with timestamps and resolution codes, and a product subscription table. The target variable is whether a customer churned within the next 30 days. Walk through how you would approach feature engineering before model training.

Why Interviewers Ask This Question

Feature engineering is where domain understanding and statistical thinking combine — and it's often what separates a mediocre model from a high-performing one in production. IBM interviewers use this question to assess whether a candidate can extract meaningful signal from raw transactional data, think temporally (avoiding data leakage), and bring business intuition to the feature design process.

Example Strong Answer

I'd organise feature engineering around the key behavioural signals that precede churn — declining engagement, billing friction, and unresolved complaints.

From billing history — trend and volatility features:
Raw monthly spend is less informative than change in spend. I'd create:

3-month and 6-month spend trend (slope of a linear fit over recent months) — a declining trend is a strong churn signal

Spend volatility (standard deviation over 6 months) — erratic billing often indicates dissatisfaction or plan confusion

Last month vs. 12-month average ratio — captures sudden drop-offs

Days since last payment and number of late payments in last 6 months — financial friction is predictive

From customer service logs — recency, frequency, and resolution quality:

Number of calls in last 30/60/90 days — rising call frequency is a leading indicator

Time to resolution (average and trend) — customers with poor resolution experiences churn at higher rates

Proportion of unresolved calls — any call logged as unresolved in the last 60 days is a high-signal feature

Contact channel diversity — customers escalating from chat to phone to in-store are showing distress

From the subscription table — product fit signals:

Number of active products — customers with more products have higher switching costs and lower churn rates

Time since last product change — recent downgrades are churn predictors

Contract type (month-to-month vs. annual) — a simple but powerful feature

Temporal hygiene — avoiding leakage:
All features must be computed using data available before the prediction window. For a 30-day forward-looking target, every feature must be a snapshot at t-30 or earlier. I'd enforce this through a strict point-in-time feature store design rather than relying on discipline alone.

Interaction features:
I'd also test a few interaction terms — high call frequency combined with unresolved issues is likely more predictive than either alone. I'd let a tree-based model (XGBoost or LightGBM) surface these automatically rather than manually engineering too many cross-terms.

Key Concepts Tested

Time-series feature engineering from transactional logs

Point-in-time correctness and data leakage prevention

Recency, Frequency, and Monetary (RFM) thinking

Feature interaction design

Business domain alignment in feature selection

Follow-Up Questions

After training your model, you find that number of customer service calls in the last 7 days has the highest feature importance by a large margin. What do you do with this finding before deploying the model?

The telecommunications client wants to use this model to proactively offer retention discounts. How does this use case change which features you'd prioritise?

Question 3: Choosing and Evaluating a Model for an Imbalanced Fraud Dataset

Interview Question

IBM is building a fraud detection model for a payment processing client. The training dataset has 10 million transactions, of which only 0.08% are confirmed fraud. You train a baseline logistic regression model that achieves 99.92% accuracy. The client is pleased — but your team lead is not. Explain what the problem is, how you would evaluate this model properly, and what approach you'd take to build a better one.

Why Interviewers Ask This Question

The "accuracy paradox" on imbalanced datasets is a classic trap, and IBM interviewers use it deliberately to filter candidates who understand model evaluation beyond surface metrics. This question tests statistical rigour, practical knowledge of resampling and algorithmic techniques for class imbalance, and the ability to reframe evaluation in terms of business cost — the asymmetric cost of missing fraud vs. falsely flagging a legitimate transaction.

Example Strong Answer

The problem is immediately clear: a model that classifies every single transaction as legitimate would also achieve 99.92% accuracy on this dataset. Our baseline has almost certainly learned to do exactly that — it has never seen enough fraud examples to learn anything meaningful about them.

Proper evaluation metrics for this problem:

Accuracy is meaningless here. The right metrics are:

Precision — of all transactions we flag as fraud, what proportion actually are? Low precision means legitimate customers get blocked constantly, destroying trust.

Recall (Sensitivity) — of all actual fraud transactions, what proportion do we catch? Low recall means fraud slips through undetected.

F1 Score — harmonic mean of precision and recall, useful for a single summary metric

Area Under the Precision-Recall Curve (AUPRC) — far more informative than ROC-AUC on heavily imbalanced data, since ROC-AUC can look artificially high when negatives dominate

Cost-weighted metric — I'd work with the client to assign a cost to each error type. Missing fraud (false negative) might cost £200 on average; wrongly blocking a transaction (false positive) might cost £5 in customer service and goodwill. The optimal decision threshold is where expected cost is minimised, not where accuracy is maximised.

Approaches to address class imbalance:

Data-level:

SMOTE (Synthetic Minority Oversampling Technique) — generates synthetic fraud examples by interpolating between existing fraud samples in feature space. Better than simple duplication.

Undersampling the majority class — combined with SMOTE (SMOTETomek), this can create a more balanced training set without excessive synthetic data

Algorithm-level:

Class weight adjustment — most sklearn models accept class_weight='balanced', which internally penalises misclassifying the minority class more heavily. This is often the cleanest first step.

Ensemble methods designed for imbalance — BalancedRandomForest and EasyEnsemble are purpose-built for this

Threshold tuning:
The default 0.5 decision threshold is rarely optimal for imbalanced problems. I'd plot the precision-recall curve and select the threshold that optimises the business cost function described above.

My approach would be: start with class_weight='balanced' on a gradient boosting model (LightGBM, which handles imbalance well natively), evaluate on AUPRC and cost-weighted metrics, then apply SMOTE if further improvement is needed. I'd never apply resampling to the test set — only the training fold during cross-validation.

Key Concepts Tested

Accuracy paradox and class imbalance mechanics

Precision, recall, F1, and AUPRC

SMOTE and resampling strategies

Class weight adjustment

Decision threshold optimisation using business cost functions

Follow-Up Questions

After deployment, the fraud team reports that the model's recall has dropped from 78% to 61% over six months. What has likely happened, and how do you diagnose it?

The client wants to use the model's output score to automatically block transactions above a certain confidence threshold, with no human review. How does this change your approach to threshold selection?

Question 4: Designing an A/B Test for a New Recommendation Algorithm

Interview Question

IBM's e-commerce analytics team has developed a new product recommendation algorithm for a retail client. The current algorithm drives 4.2% click-through rate (CTR) on product recommendations. The team believes the new algorithm could improve CTR by at least 0.5 percentage points. You are asked to design the A/B test to validate this. Walk through your complete experimental design.

Why Interviewers Ask This Question

A/B testing sits at the intersection of statistics, product intuition, and business judgement — and IBM data scientists are frequently asked to own experimentation design end to end. This question tests whether a candidate can correctly frame hypotheses, calculate sample sizes with appropriate statistical power, identify threats to validity, and interpret results with nuance rather than just reporting a p-value.

Example Strong Answer

Step 1 — Define the hypothesis and primary metric:

Null hypothesis (H₀): The new algorithm produces the same CTR as the current algorithm

Alternative hypothesis (H₁): The new algorithm produces a higher CTR (one-tailed, since we're only interested in improvement)

Primary metric: Click-through rate on product recommendations (clicks / recommendation impressions)

Guardrail metrics: Add-to-cart rate, revenue per session, and page load time — to ensure CTR improvements don't come at the cost of downstream conversion or performance

Step 2 — Sample size calculation:

Using a two-proportion z-test:

Baseline CTR: 4.2%

Minimum detectable effect (MDE): 0.5 percentage points (4.7% target)

Significance level (α): 0.05

Statistical power (1 − β): 0.80 (standard; I'd push for 0.90 given the business stakes)

Using a power calculator, this yields approximately 115,000 users per variant (230,000 total). At the client's daily traffic of ~50,000 unique visitors to recommendation pages, with 50/50 split, this requires roughly 5 days of data — but I'd run it for a minimum of 2 full business weeks to account for day-of-week effects (e.g., weekend browsing patterns are systematically different).

Step 3 — Randomisation unit and assignment:

Randomise at the user level, not session or page level. Assigning the same user to different variants across sessions creates carry-over effects — a user who sees the better algorithm first may behave differently when switched. User-level randomisation ensures clean comparison. I'd use a deterministic hashing function (e.g., hash(user_id + experiment_id) % 2) to ensure stable assignment.

Step 4 — Validity threats to control for:

Novelty effect: Users may click new recommendations simply because they're different, not better. I'd monitor CTR week-over-week and consider extending the test if there's a significant week-1 spike that doesn't persist.

Network effects / contamination: If users in the control group see recommendations influenced by their social graph (shared wishlists, household accounts), contamination is possible. I'd investigate whether user-level assignment is sufficient or whether household-level assignment is needed.

Sample Ratio Mismatch (SRM): After assignment, I'd verify the split is actually 50/50 by running a chi-squared test on observed assignment counts. Any significant imbalance signals a logging or assignment bug that invalidates results.

Step 5 — Analysis and decision:

I'd use a two-proportion z-test on the primary metric and apply Bonferroni correction if testing multiple secondary metrics to control family-wise error rate. I'd report the absolute lift (not just relative lift — "0.5pp improvement" is more honest than "12% improvement"), the 95% confidence interval, and the expected annualised revenue impact to give the business a decision framework beyond the p-value alone.

Key Concepts Tested

Hypothesis formulation and one- vs. two-tailed tests

Power analysis and sample size calculation

Randomisation unit selection and carry-over effects

Sample Ratio Mismatch detection

Multiple testing correction and business-impact framing

Follow-Up Questions

After two weeks, the new algorithm shows a CTR of 4.65% vs. 4.21% with p = 0.04. The product team wants to ship immediately. What questions would you ask before agreeing to launch?

The client has a much smaller user base — only 3,000 daily visitors. The sample size calculation suggests the test would take 18 months. How would you adapt your approach?

Question 5: Communicating a Model's Limitations to a Non-Technical Stakeholder

Interview Question

You've built a machine learning model for IBM's healthcare client that predicts which patients are at high risk of hospital readmission within 30 days of discharge. The model achieves 81% AUC-ROC on the test set. The hospital's Chief Medical Officer (CMO) wants to use this model to automatically allocate post-discharge support resources — care calls, home visits, and follow-up appointments. In your review meeting, the CMO says: "Great, 81% accuracy — let's roll this out to all 12 hospitals next week." How do you respond?

Why Interviewers Ask This Question

This question tests one of the most underrated skills in enterprise data science: the ability to communicate model limitations clearly, confidently, and constructively to a senior non-technical stakeholder — without being dismissive of their enthusiasm or defaulting to jargon. IBM data scientists regularly face this scenario with C-suite stakeholders at client organisations, and getting this wrong can mean deploying a model that causes real harm.

Example Strong Answer

I'd acknowledge the result positively, then reframe the conversation around what matters for the deployment decision — without making the CMO feel their excitement is misplaced.

What I'd say (in the meeting):

"An 81% AUC is a strong result and means the model is genuinely useful. Before we discuss rollout, I want to make sure we're aligned on a few things that will affect how we use it safely — and actually, answering these will help us get more value from it, not less."

Then I'd walk through three things concisely:

1. What the model can and can't do:
AUC-ROC measures the model's ability to rank patients by risk — it doesn't tell us how many of the patients it flags as high-risk actually will be readmitted. I'd show a confusion matrix at the operating threshold we'd use in practice. If we're flagging the top 20% of patients as high-risk, what proportion of those truly are high-risk? That's the number that determines whether the care team is making evidence-based decisions or chasing false alarms.

2. Fairness and subgroup performance:
I'd ask whether the model has been evaluated separately across patient demographics — age, gender, primary language, insurance status. A model that performs well overall can perform poorly for specific subgroups, which in a healthcare context isn't just a technical issue, it's an equity issue. I'd share a performance breakdown by key subgroups from our analysis, and flag any groups where performance is meaningfully lower.

3. The rollout plan:
Rather than deploying to all 12 hospitals simultaneously, I'd recommend a phased rollout — starting with two hospitals as a pilot, with a monitoring framework in place. We'd track readmission rates for flagged vs. unflagged patients in production, watch for distribution shift (patient populations differ across hospitals), and establish a feedback loop with clinical staff to capture cases where the model's output didn't match clinical judgement.

I'd frame the phased approach not as slowing things down, but as protecting the hospital's reputation and ensuring the model performs as well in production as it did in testing — which is genuinely in the CMO's interest.

Closing the conversation:"I'm confident this model adds real value. A two-hospital pilot over 6–8 weeks gives us the evidence we need to roll out to all 12 with confidence — and it means we can report back to your board with real-world outcome data, not just a test set metric."

Key Concepts Tested

Translating AUC-ROC into operationally meaningful metrics

Model fairness and subgroup performance evaluation

Phased deployment and production monitoring strategy

Communicating risk without undermining stakeholder confidence

Clinical and ethical considerations in healthcare AI

Follow-Up Questions

During the pilot, you notice the model's AUC drops from 81% to 68% at one of the two pilot hospitals. What are the most likely explanations, and how do you investigate?

A clinical staff member at the pilot hospital says they've stopped looking at the model's output because it "keeps flagging patients who are fine." How do you diagnose whether this is a model problem or a communication/training problem?

Question 6: Writing and Optimising SQL for a Business Intelligence Report

Interview Question

IBM's analytics team supports a retail banking client who wants a monthly report showing, for each customer segment: the average number of transactions in the last 90 days, the percentage of customers who have used more than one product, and the top 3 transaction categories by total spend. The data lives across three tables: customers (customer_id, segment, join_date), transactions (transaction_id, customer_id, amount, category, transaction_date), and products (customer_id, product_type, start_date). Write the SQL logic for this report and explain any performance considerations.

Why Interviewers Ask This Question

SQL is a day-to-day tool for IBM Data Scientists — for data extraction, validation, feature computation, and ad hoc analysis. This question tests whether a candidate can handle multi-table aggregation with window functions and subqueries, think about query correctness (e.g., avoiding double-counting from joins), and reason about performance at scale. It also checks whether the candidate asks clarifying questions before writing code, which reflects strong analytical discipline.

Example Strong Answer

Before writing anything, I'd clarify two things: what "last 90 days" means (rolling 90 days from today, or a fixed reporting period?) and whether a customer can appear in multiple segments. I'll assume rolling 90 days and one segment per customer.

Part 1 — Average transactions per customer in the last 90 days, by segment:

SELECT
    c.segment,
    AVG(txn_counts.txn_count) AS avg_transactions_90d
FROM customers c
LEFT JOIN (
    SELECT
        customer_id,
        COUNT(*) AS txn_count
    FROM transactions
    WHERE transaction_date >= CURRENT_DATE - INTERVAL '90 days'
    GROUP BY customer_id
) txn_counts ON c.customer_id = txn_counts.customer_id
GROUP BY c.segment;

Using a LEFT JOIN ensures customers with zero transactions in the period are included with a count of 0 (NULL coalesced), not silently excluded — a common correctness error.

Part 2 — Percentage of customers using more than one product, by segment:

SELECT
    c.segment,
    ROUND(
        100.0 * SUM(CASE WHEN p.product_count > 1 THEN 1 ELSE 0 END) / COUNT(*), 2
    ) AS pct_multi_product
FROM customers c
LEFT JOIN (
    SELECT customer_id, COUNT(DISTINCT product_type) AS product_count
    FROM products
    GROUP BY customer_id
) p ON c.customer_id = p.customer_id
GROUP BY c.segment;

Part 3 — Top 3 transaction categories by total spend, per segment:

WITH segment_category_spend AS (
    SELECT
        c.segment,
        t.category,
        SUM(t.amount) AS total_spend,
        RANK() OVER (PARTITION BY c.segment ORDER BY SUM(t.amount) DESC) AS spend_rank
    FROM transactions t
    JOIN customers c ON t.customer_id = c.customer_id
    WHERE t.transaction_date >= CURRENT_DATE - INTERVAL '90 days'
    GROUP BY c.segment, t.category
)
SELECT segment, category, total_spend
FROM segment_category_spend
WHERE spend_rank <= 3
ORDER BY segment, spend_rank;

I use RANK() rather than ROW_NUMBER() here so tied categories at position 3 are both included, which is usually the correct business behaviour.

Performance considerations:

Index transactions(transaction_date, customer_id) for the date-filtered scans — a composite index avoids full table scans on a large transactions table

Materialise the 90-day transaction filter as a CTE or temp table if it's being joined multiple times in the same report run

For very large tables (billions of rows), consider partitioning transactions by transaction_date so the 90-day filter prunes partitions at the storage layer

Key Concepts Tested

Multi-table JOIN logic and avoiding double-counting

Window functions (RANK, PARTITION BY)

LEFT JOIN for preserving zero-count records

CTEs for readable, modular query design

Indexing strategy for date-range queries at scale

Follow-Up Questions

The report is running for 5 minutes on a table with 2 billion rows. Walk through how you'd diagnose and optimise it, starting from the query execution plan.

The client now wants this report to update in real time as transactions come in, not just monthly. What architectural changes would be needed beyond the SQL itself?

Question 7: Detecting and Responding to Model Drift in Production

Interview Question

IBM deployed a credit scoring model for a lending client 14 months ago. The model performed well at launch — AUC of 0.84, and default rates among approved applicants were in line with predictions. Recently, the business team has noticed that approved applicants are defaulting at nearly twice the predicted rate, but the model's AUC on a held-out validation set still looks healthy at 0.82. How do you diagnose what has gone wrong, and what do you do about it?

Why Interviewers Ask This Question

Model degradation in production is one of the most practically important — and commonly underestimated — challenges in enterprise data science. IBM's deployed models operate in dynamic environments where macroeconomic conditions, customer behaviour, and data pipelines all change. This question tests whether a candidate understands the distinction between data drift and concept drift, can design a monitoring framework, and knows when to retrain vs. rebuild a model entirely.

Example Strong Answer

The key diagnostic insight here is the disconnect: calibration is broken (predictions are wrong), but discrimination is intact (AUC is still strong). This tells us the model can still rank customers by relative risk, but its absolute probability estimates are no longer reliable. These two failure modes have different causes.

Step 1 — Distinguish data drift from concept drift:

Data drift (covariate shift): The distribution of input features has changed. I'd compare the current input feature distributions against the training-time distributions using statistical tests:

Numerical features: Kolmogorov-Smirnov test or Population Stability Index (PSI)

Categorical features: Chi-squared test or PSI

A PSI > 0.2 on any feature is a strong signal of distribution shift. Given this model was deployed 14 months ago, macroeconomic changes (rising interest rates, unemployment shifts post-COVID recovery) would likely have altered income, debt-to-income, and employment stability distributions substantially.

Concept drift: Even if inputs look the same, the relationship between features and default has changed. The credit risk landscape in the current macroeconomic environment may be fundamentally different from the training period. I'd test this by examining calibration curves — plotting predicted probability vs. actual default rate across probability bins. If the actual default rate for customers scored at "10% risk" is consistently 20%, the model's learned probabilities are systematically wrong.

Step 2 — Audit the data pipeline:
Before concluding it's a model problem, I'd check whether data quality has degraded — incorrect income values, changed field definitions from a CRM update, or a feature that was computed differently after a pipeline change. Data pipeline bugs are a frequent cause of apparent model drift.

Step 3 — Immediate remediation:
If the discrimination is intact (AUC 0.82), a recalibration may be sufficient in the short term — fitting a Platt scaling or isotonic regression layer on recent labelled data to remap the model's scores to accurate probabilities. This is faster than retraining and can be deployed quickly.

Step 4 — Retraining strategy:
For a longer-term fix, I'd retrain the model on a data window that reflects current conditions — likely the most recent 12–18 months rather than the original 3-year window. I'd use time-based cross-validation (walk-forward validation) rather than random splits to ensure temporal integrity.

Step 5 — Monitoring framework going forward:
I'd set up automated monitoring for: PSI on key input features (weekly), calibration checks on recent predictions vs. actuals (monthly, with a lag for defaults to materialise), and AUC on a rolling labelled holdout. Alert thresholds trigger a model review before degradation reaches business impact.

Key Concepts Tested

Data drift vs. concept drift distinction

Population Stability Index (PSI) and statistical drift tests

Calibration vs. discrimination as separate failure modes

Platt scaling and model recalibration

Production monitoring framework design

Follow-Up Questions

The client's risk team argues that retraining on only recent data will make the model "too pessimistic" about economic conditions that will eventually improve. How do you respond and what approach do you suggest?

How would you design a monitoring system that alerts the team before the default rate diverges, rather than after — given that credit defaults have a 6–12 month lag?

Question 8: Applying NLP to Unstructured Customer Feedback at Scale

Interview Question

IBM is working with a global insurance client that receives approximately 80,000 customer complaint emails per month. Currently, a team of 12 analysts manually categorises each email into one of 15 complaint types and assigns a priority level (high/medium/low). The process takes 3 days on average and the client wants near-real-time categorisation. You're asked to build an NLP solution. Walk through your approach from raw text to a deployed classification system.

Why Interviewers Ask This Question

NLP on enterprise unstructured data is a growing use case across IBM's client base — in insurance, banking, and healthcare especially. This question tests end-to-end ML pipeline thinking, knowledge of modern NLP approaches (including when to use transformers vs. simpler models), handling of multi-label and multi-task classification, and practical understanding of how to evaluate and deploy an NLP system at production scale.

Example Strong Answer

Step 1 — Data preparation and labelling:
I'd start with the 80,000 monthly emails as a corpus, but more importantly, I'd request the historical labelled data from the analyst team — even 6–12 months of labelled emails (potentially 500,000–1M examples) provides a strong foundation. Before training, I'd audit the label quality: inter-annotator agreement between analysts, label distribution (are some of the 15 categories rare?), and whether the priority labels are consistently applied or subjective.

Step 2 — Text preprocessing:
Insurance complaints contain domain-specific language — policy numbers, claim IDs, regulatory references. I'd clean aggressively: remove PII (policy numbers, names, addresses) before any modelling for compliance, normalise insurance jargon, and handle multilingual emails if the client is global (IBM's clients often are).

Step 3 — Model architecture selection:

For 15-category classification with near-real-time requirements, I'd evaluate two approaches:

Option A — Fine-tuned transformer (DistilBERT or RoBERTa):
Pre-trained transformers understand context and handle nuanced complaints well. I'd fine-tune on the labelled dataset with a classification head for both complaint type and priority simultaneously (multi-task learning), sharing the encoder and adding two separate output layers. The shared representation improves both tasks.

Option B — TF-IDF + LightGBM (baseline first):
Before fine-tuning a transformer, I'd always build a strong TF-IDF baseline. It's fast to train, interpretable, and often performs surprisingly well on structured complaint text. This gives me a performance floor and a deployment fallback.

In practice, a fine-tuned DistilBERT typically outperforms TF-IDF on complaint categorisation by 8–15 F1 points — worth the inference cost for this use case.

Step 4 — Handling rare categories:
If some of the 15 complaint types have very few examples (<500), I'd use few-shot techniques — either data augmentation (paraphrasing with a generative model) or a hierarchical classification approach where rare types are first grouped into broader buckets before fine-grained classification.

Step 5 — Evaluation:
Given class imbalance across categories, I'd evaluate using macro-averaged F1 (treats all categories equally regardless of frequency) alongside weighted F1. I'd build a confusion matrix to identify which complaint types are most often confused — these often share domain language and may benefit from additional labelled data or feature engineering.

Step 6 — Deployment:
I'd deploy the model as a REST API (FastAPI + Docker) with a p95 inference latency target of <200ms per email. For the near-real-time requirement, I'd use an async queue (Kafka or RabbitMQ) to process incoming emails, with results written to a database the analysts can review. I'd build a human-in-the-loop interface for low-confidence predictions (model confidence <70%) that routes those emails to analyst review — preserving human oversight while automating the high-confidence majority.

Key Concepts Tested

NLP pipeline design from raw text to deployment

Transfer learning and transformer fine-tuning

Multi-task learning for joint classification

Handling class imbalance in multi-class NLP

Human-in-the-loop system design and confidence thresholding

Follow-Up Questions

Six months after deployment, the model's accuracy on "regulatory complaint" emails drops significantly. Investigation shows a new financial regulation introduced a new type of complaint the model has never seen. How do you handle this without full retraining?

An analyst raises a concern that the model may be categorising complaints differently for customers with non-native English writing styles, potentially routing them to lower-priority queues. How do you investigate and address this?

Question 9: Forecasting Demand for an Inventory Optimisation Problem

Interview Question

IBM is supporting a pharmaceutical distributor that needs to forecast weekly demand for 4,000 SKUs (individual drug products) across 35 warehouses. Accurate forecasts feed directly into automated inventory replenishment — overestimating demand ties up capital in excess stock, while underestimating causes stockouts with patient safety implications. The historical data goes back 3 years, but many SKUs have intermittent demand patterns — weeks of zero sales followed by occasional large orders. How do you approach this forecasting problem?

Why Interviewers Ask This Question

Demand forecasting at scale across heterogeneous SKUs is a real and challenging problem that IBM tackles with supply chain clients. This question tests a candidate's knowledge of time series methods beyond "ARIMA and Prophet," their understanding of intermittent demand patterns, their ability to reason about forecast accuracy metrics in a business context, and their architectural thinking about how to scale forecasting to thousands of series efficiently.

Example Strong Answer

This problem has two distinct subproblems that require different approaches: high-volume regular SKUs and intermittent-demand SKUs. Treating them the same way is the most common mistake.

Step 1 — Segment SKUs by demand pattern:
I'd classify all 4,000 SKUs using the ADI/CV² matrix (Average Demand Interval vs. Coefficient of Variation squared):

Smooth demand (low ADI, low CV²): regular, forecastable — standard time series methods work well

Intermittent demand (high ADI, low CV²): infrequent but consistent order sizes — Croston's method or its variants

Erratic demand (low ADI, high CV²): frequent but variable — harder; ensemble or quantile regression approaches

Lumpy demand (high ADI, high CV²): rare and irregular — often better served by safety stock rules than point forecasting

Step 2 — Model selection by segment:

For smooth/erratic SKUs (~60% of volume), I'd use LightGBM with lag features and calendar features — a global model trained across all SKUs simultaneously, with SKU-level embeddings. This "global model" approach dramatically outperforms fitting individual models per SKU when you have 4,000 series, because it can learn cross-SKU patterns (seasonal spikes, supply chain disruptions that affect categories).

Features would include: lagged demand (1, 2, 4, 8, 13, 26, 52 weeks), rolling means and standard deviations, day-of-year / month / week-of-year encodings, promotional event flags, and warehouse-level features (region, size, product mix).

For intermittent SKUs (~40%), I'd use Croston's method or ADIDA (Aggregate-Disaggregate Intermittent Demand Approach) — methods designed specifically for sparse time series where standard models produce biased forecasts.

Step 3 — Forecast evaluation metrics:
Standard RMSE is a poor metric here because large-volume SKUs dominate and stockout risk isn't symmetric. I'd use:

MASE (Mean Absolute Scaled Error) — scale-free, compares model to a seasonal naïve baseline per SKU

Weighted MAPE by SKU revenue — weights accuracy by business impact

Service level simulation — given forecast and safety stock policy, what percentage of demand weeks would have been met without stockout? This is the metric the business actually cares about.

Step 4 — Scaling to 4,000 × 35 = 140,000 series:
Fitting individual models per series doesn't scale. The global LightGBM model handles this naturally. For the Croston models, I'd parallelise fitting using Spark or Dask across a cluster. IBM Cloud Pak for Data is well-suited for this.

Step 5 — Uncertainty quantification:
Point forecasts aren't enough for inventory decisions — the replenishment algorithm needs prediction intervals to set safety stock levels. I'd use quantile regression within the LightGBM framework (predicting the 10th, 50th, and 90th percentile of demand) rather than symmetric confidence intervals, since demand distributions are typically right-skewed.

Key Concepts Tested

Intermittent demand classification (ADI/CV² matrix)

Global vs. local time series modelling

Croston's method for sparse time series

Forecast evaluation metrics for business impact (MASE, service level)

Quantile regression for inventory safety stock

Follow-Up Questions

A new drug product has just been added to the catalogue with zero sales history. How do you generate a forecast for it in week one?

The automated replenishment system will act directly on your forecasts without human review. How does this change your model design and what safeguards would you put in place?

Question 10: Causal Inference — Separating Correlation from Business Impact

Interview Question

IBM's analytics team is reviewing results for a telecommunications client. The client's data shows that customers who use the mobile app have a 22% lower churn rate than customers who don't. The product team wants to invest heavily in driving app adoption, citing this as proof that the app reduces churn. A colleague challenges this interpretation. Who is right, and how would you design an analysis to get closer to the true causal effect of app adoption on churn?

Why Interviewers Ask This Question

Confusing correlation with causation is one of the most consequential errors a data scientist can make in a business context — it leads to wasted investment and misattributed success. IBM interviewers use this question to assess whether candidates have genuine statistical maturity, understand the limits of observational data, and know how to apply causal inference techniques to get closer to actionable business answers. It also tests communication skills: can the candidate explain this clearly to a non-statistician?

Example Strong Answer

My colleague is right to challenge it — and the product team is committing a classic selection bias error.

The confounding problem:
Customers who choose to download and regularly use the mobile app are almost certainly different from those who don't, in ways that independently predict churn. App users are likely younger, more digitally engaged, more satisfied with the service, and possibly on better value plans. These same characteristics independently reduce churn risk. The 22% difference in churn rates may be entirely driven by who uses the app, not what the app does for them. Investing heavily in forcing app adoption on customers who are fundamentally disengaged may produce no churn reduction at all.

The right framing:
The business question isn't "do app users churn less?" (we already know they do). The real question is: "If we cause a non-app-user to adopt the app, does their churn probability decrease, and by how much?" That's a causal question, and observational data alone can't answer it.

Approach 1 — Propensity Score Matching (feasible quickly):
I'd build a propensity score model predicting the probability of app adoption based on observable confounders — demographics, plan type, tenure, engagement history, device type. I'd then match each app user to a non-app user with a similar propensity score, creating a pseudo-experimental comparison group. The matched difference in churn rates is a much cleaner estimate of the treatment effect than the raw 22%.

This doesn't eliminate all confounding (only observed variables can be matched on), but it removes the most obvious sources of bias.

Approach 2 — Instrumental Variable (if available):
If there was a period where the company ran a targeted app promotion campaign that reached some customers semi-randomly (e.g., based on a postcode lottery or a marketing database quirk), that promotion acts as an instrument — it affects app adoption but has no direct effect on churn other than through adoption. An IV regression using this instrument gives an unbiased estimate of the causal effect.

Approach 3 — Randomised Controlled Trial (gold standard):
The cleanest answer is a prospective experiment: randomly assign a cohort of non-app-users to receive an intensive onboarding campaign designed to drive adoption, and compare their 6-month churn rate against an untreated control group. This is the only approach that fully eliminates confounding.

How I'd present this to the product team:"The 22% difference is real, but it almost certainly overstates the impact of the app itself. Our analysis suggests the true causal effect is closer to [X]% — still meaningful, but the ROI calculation for the adoption campaign looks different. Here's what a 3-month pilot experiment would cost and what it would tell us."

Key Concepts Tested

Selection bias and confounding in observational data

Propensity score matching methodology

Instrumental variable intuition

Randomised controlled trial design as a causal gold standard

Communicating statistical nuance to business stakeholders

Follow-Up Questions

The propensity score matching analysis shows the causal effect of app adoption on churn is only 6%, not 22%. The product team is disappointed and wants to deprioritise app investment. What additional analysis would you do before accepting this conclusion?

The RCT isn't feasible because the business won't allow a control group to go without the app promotion — they say it's "leaving money on the table." What alternative quasi-experimental designs could you use instead?