Data Scientist Interview Question Bank

Q: Building a Credit Default Prediction Model for a New Cardmember Segment

Part (a) — Problem Framing: Binary target: 1 = default (typically defined as 90+ days past due within 12–24 months of account opening), 0 = no default The choice of observation window (12 vs. 24 months) significantly impacts class balance and model behavior — a 24-month window captures more defaults but introduces more censoring risk Reject inference problem : Historical data only contains approved applicants — rejected thin-file applicants are unobserved. A model trained only on approvals is bi

Q: Designing and Analyzing a Credit Limit Increase Experiment

Part (a) — Experiment Design: Treatment: Proactive CLI offers to cardmembers in the 60–75% utilization band Control: No CLI offer (current policy — members can request a C, LI, but none is proactively offered) Do not offer CLI to >85% utilization (already financially stressed) or <40% utilization (not the target behavior) — keep the experiment population clean

Q: Engineering Behavioral Features from Raw Transaction Data for a Churn Prediction Model

Part (a) — Feature Engineering Strategy: Features should capture behavioral momentum — it's not the absolute level that predicts churn, it's the behavior change that matters. Organize by signal type: Spend Volume & Velocity Features: Total spend in L30/L60/L90 days; trend (L30 vs. prior L30)

Q: Designing a Real-Time Transaction Fraud Scoring System

Part (a) — Problem Framing & Unique Challenges: Binary: 1 = fraudulent transaction , 0 = legitimate Ground truth lag problem : fraud labels are often confirmed days to weeks after the transaction (chargebacks, investigations) — the training label is not available in real-time, creating a delayed labeling problem Part (b) — Feature Engineering Under Latency Constraints:

Q: Measuring the True Business Impact of a Deployed Churn Intervention Model

Part (a) — Causal Impact Without a Clean Control Group: This is a causal inference problem , not a predictive modeling problem. Several approaches: Option 1: Propensity Score Matching (PSM) Approach: 1. Define treatment: cardmembers who received retention outreach (model score > 35%) 2. For each treated member, find a statistically similar untreated member (members who scored just below the threshold, or high-risk members the model missed due to data gaps) 3. Match on confounders: tenure, spend

Q: Building a Forward-Looking CLV Model to Drive Acquisition Spend Allocation

Part (a) — CLV Model Structure: CLV = Σ [t=1 to T] ( M_t × r_t / (1 + d)^t ) − CAC Where: M_t = Net margin in period t (revenue − credit losses − servicing cost − rewards cost) r_t = Probability of still being an active cardmember at time t (survival probability) d = Discount rate (cost of capital, typically 8–12% for AmEx) CAC = Customer Acquisition Cost (channel-specific) T = Time horizon (typically 5 years; beyond that, uncertainty dominates) Why credit card CLV is harder than subscription CL

Q: Diagnosing a Sudden Drop in Transaction Approval Rates Using SQL

Part (a) — Systematic Dimensional Decomposition: -- ============================================================ -- QUERY 1: Approval rate by date and hour (isolate timing) -- ============================================================ SELECT DATE(transaction_date) AS txn_date, EXTRACT(HOUR FROM transaction_date) AS txn_hour, COUNT(*) AS total_transactions, SUM(CASE WHEN is_approved THEN 1 ELSE 0 END) AS approved, ROUND( AVG(CASE WHEN is_approved THEN 1.0 ELSE 0.0 END) * 100, 2 ) AS approval_ra

Q: Segmenting the Commercial Card Portfolio for Targeted Product Strategy

Part (a) — Feature Selection & Preparation: SPEND BEHAVIOR: ├── Total monthly billed business (L6M average) ├── Spend volatility (std dev / mean — coefficient of variation) ├── MCC concentration: Herfindahl index of spend across MCC categories │ HHI = Σ(share_i²) → high HHI = concentrated spend; low = diversified ├── T&E share (% spend in travel/entertainment MCCs) ├── Supplier payment share (% spend in B2B/vendor MCCs) └── International spend share (% non-domestic transactions) ACCOUNT BEHAVIOR

Q: Building an Uplift Model to Optimize Targeted Retention Offers

Part (a) — Propensity vs. Uplift: The Conceptual Gap: PROPENSITY MODEL asks: "Who is most likely to stay?" → Targets members with P(retain | offer) = 0.90 → Problem: P(retain | no offer) for this group is also 0.88 → Incremental effect of offer = 0.90 - 0.88 = 0.02 (waste of offer cost) UPLIFT MODEL asks: "For whom does the offer make the MOST DIFFERENCE?" → Targets members with: P(retain | offer) = 0.65 AND P(retain | no offer) = 0.40 → Incremental effect = 0.65 - 0.40 = 0.25 (high impact) THE

Q: Auditing a Credit Decisioning Model for Fairness, Bias, and Regulatory Compliance

Part (a) — Statistical Fairness Investigation Framework: # ============================================================ # STEP 1: Establish the raw disparity (unadjusted gap) # ============================================================ import scipy.stats as stats male_limits = df[df['gender'] == 'M']['assigned_credit_limit'] female_limits = df[df['gender'] == 'F']['assigned_credit_limit'] # Test statistical significance of the gap t_stat, p_value = stats.ttest_ind(male_limits, female_limits) e

American Express — Data Scientist Interview Question Bank

Card & Risk Analytics Team

Question 1: Credit Risk Modeling & Default Prediction

📌 Question Title

Building a Credit Default Prediction Model for a New Cardmember Segment

💬 Detailed Business Scenario

AmEx is expanding its card acquisition strategy into a segment of thin-file applicants — consumers with limited credit history (fewer than 3 tradelines, credit history under 24 months) but strong income indicators and high digital engagement. Traditional FICO-based underwriting performs poorly on this segment because the score is either absent or unreliable.
You've been asked to build a default prediction model for this segment using alternative data sources: rent payment history, utility payments, device and behavioral signals from the application journey, and income verification data.
(a) How do you frame this as a machine learning problem — what is the target variable, the training population, and the key modeling challenges?
(b) Walk through your end-to-end modeling pipeline — from data preparation to model deployment considerations.
(c) Your model achieves AUC-ROC of 0.74 on the holdout set. A stakeholder says: "That's not good enough — traditional models hit 0.82 on prime customers." How do you respond?

📋 Structured Model Answer

Part (a) — Problem Framing:

Target variable definition is the first and most consequential decision:

Binary target: 1 = default (typically defined as 90+ days past due within 12–24 months of account opening), 0 = no default

The choice of observation window (12 vs. 24 months) significantly impacts class balance and model behavior — a 24-month window captures more defaults but introduces more censoring risk

Training population considerations:

Reject inference problem: Historical data only contains approved applicants — rejected thin-file applicants are unobserved. A model trained only on approvals is biased toward the accepted population.

Solution: Apply reject inference techniques — augmented data methods, parceling, or fuzzy augmentation to estimate likely outcomes for declined applicants

Class imbalance: Default rates on thin-file segments may be 8–15%, creating significant class imbalance → requires SMOTE, class weighting, or threshold calibration

Key modeling challenges:

Feature sparsity (missing tradeline history)

Non-stationarity: thin-file behavior at acquisition may drift post-approval

Regulatory compliance: alternative data (device signals, rent) must clear Fair Lending / ECOA scrutiny — disparate impact analysis required on all features

Part (b) — End-to-End Pipeline:

1. DATA PREPARATION
   ├── Merge application data + bureau data + alternative signals
   ├── Define observation point (application date) and outcome window
   ├── Handle missing values: MCAR/MAR/MNAR analysis per feature
   │   └── For thin-file: absence of bureau data IS informative → create
   │       binary missingness indicator features
   └── Train/Validation/Test split — use time-based split, NOT random
       (random split leaks future information in time-series financial data)

2. FEATURE ENGINEERING
   ├── Payment velocity features: rent payment streak, utility on-time %
   ├── Application behavior: time-on-page, form completion rate, device type
   ├── Income stability proxies: income-to-obligation ratio, employment tenure
   └── Interaction features: income × rent_payment_consistency

3. MODELING
   ├── Baseline: Logistic Regression with WOE-encoded features
   │   (interpretable, audit-ready, benchmark for business)
   ├── Primary: Gradient Boosting (XGBoost / LightGBM)
   │   └── Hyperparameter tuning via Bayesian optimization
   ├── Calibration: Platt Scaling or Isotonic Regression
   │   (raw probabilities must be calibrated for scorecard conversion)
   └── Ensemble: Stacked model if individual models show low correlation

4. EVALUATION
   ├── AUC-ROC: Overall discriminatory power
   ├── KS Statistic: Separation between default/non-default score distributions
   ├── Gini Coefficient: Gini = 2×AUC − 1
   ├── PSI (Population Stability Index): Monitor score distribution drift
   └── Fairness metrics: Demographic parity, equalized odds across
       protected class proxies

5. DEPLOYMENT CONSIDERATIONS
   ├── Score-to-cutoff decision: business risk appetite, not just model metric
   ├── Champion/Challenger framework: deploy new model to 10% traffic first
   └── Model monitoring: monthly PSI tracking, performance decay alerts

Part (c) — Responding to the AUC Comparison:

This is a statistical reasoning and business communication challenge:

The comparison is invalid: AUC of 0.82 on prime customers with 40+ years of credit history is not the right benchmark for a thin-file model. These populations have fundamentally different signal richness.

Reframe the question: The right question is — "Does this model perform better than the current alternative for this segment?" The current alternative is likely a hard decline (AUC = 0.50) or a generic bureau score that performs at 0.61 on thin-file data.

Business impact framing: A model with AUC 0.74 on a previously un-scored segment that enables 200,000 new approvals annually — with acceptable loss rates — generates revenue that a 0.82 model on a different segment doesn't.

Present a lift curve: Show how the model identifies the top 30% of applicants by score, which contain X% of eventual defaults — even at 0.74 AUC, the concentration of risk in the bottom score deciles may be commercially viable.

📊 Difficulty Level: Hard

⏱ Expected Interview Time: 18–20 minutes

✅ What a Strong Candidate Must Mention

Reject inference as a fundamental problem in credit model training — models built only on approved applicants are biased

Time-based train/test splits — random splits introduce data leakage in sequential financial data

Missingness as signal: in thin-file data, absence of bureau features is itself predictive — encode it explicitly

Calibration vs. discrimination: AUC measures rank-ordering ability; calibration measures whether predicted probabilities are accurate — both matter for scorecard use

Fair lending compliance: disparate impact analysis on protected class proxies is non-negotiable for alternative data models

🔁 Smart Follow-Up Questions

"Your model uses device behavioral signals from the application journey. Legal flags that this may create a disparate impact on older applicants who use older devices. How do you test for this — and if confirmed, what are your options?"

"Twelve months after deployment, PSI for your model's score distribution has risen to 0.28. What does that mean, and what do you do?"

"How would you redesign your target variable definition if AmEx wanted to optimize for minimizing losses on the worst defaulters, rather than maximizing overall rank-ordering performance?"

Question 2: A/B Testing & Experimentation

📌 Question Title

Designing and Analyzing a Credit Limit Increase Experiment

💬 Detailed Business Scenario

AmEx's product team hypothesizes that proactively offering credit limit increases (CLIs) to cardmembers who are consistently utilizing 60–75% of their current limit will increase spend, improve retention, and not materially increase default risk. They want to run an experiment before rolling out the policy at scale.
You're asked to design the experiment, determine sample sizes, run the analysis, and interpret the results for a business decision.
(a) How do you design this experiment — treatment, control, randomization strategy, and guardrail metrics?
(b) After 90 days, treatment group shows: spend up 11% (p = 0.03), default rate up 0.4 percentage points (p = 0.11), and retention up 2.1% (p = 0.07). How do you interpret these results — and what is your recommendation?
(c) A stakeholder says: "The default rate increase isn't statistically significant, so we should ignore it." How do you respond?

📋 Structured Model Answer

Part (a) — Experiment Design:

Treatment definition:

Treatment: Proactive CLI offers to cardmembers in the 60–75% utilization band

Control: No CLI offer (current policy — members can request a C, LI, but none is proactively offered)

Do not offer CLI to >85% utilization (already financially stressed) or <40% utilization (not the target behavior) — keep the experiment population clean

Randomization strategy:

Unit of randomization: Individual cardmember (not household — to prevent spillover if two household members have separate cards)

Use stratified randomization on: credit score band, tenure, spend category mix, and geography — ensures balance on confounders

Holdout ratio: 50/50 for maximum statistical power at this stage; can move to 80/20 after initial validation

Sample size calculation:

# Power calculation for spend lift (primary metric)
from statsmodels.stats.power import TTestIndPower

analysis = TTestIndPower()
n = analysis.solve_power(
    effect_size = 0.11 / std_spend,  # 11% lift / pooled std dev of spend
    alpha = 0.05,                     # Type I error rate
    power = 0.80,                     # 1 - Type II error rate
    ratio = 1.0                       # equal group sizes
)
# Repeat separately for each primary and guardrail metric

Guardrail metrics (must not degrade beyond threshold):

Default rate increase: must stay below +0.75 pp (pre-agreed risk tolerance threshold)

Fraud rate: monitor for limit increase,ase enabling fraud

Early delinquency (30+ DPD): leading indicator of default risk

Observation period: 90 days for spend signal; 12 months for full credit risk signal — decision must acknowledge the lag

Part (b) — Interpreting the Results:

Metric	Result	p-value	Interpretation
Spend lift	+11%	0.03	Statistically significant; positive
Default rate	+0.4 pp	0.11	Not significant at α=0.05, but concerning
Retention	+2.1%	0.07	Borderline significant; positive signal

Business interpretation:

result is clean and significant — the hypothesis on behavior is confirmed

The retention result is directionally positive and commercially meaningful even at p=0.07

The default rate result requires careful treatment (see Part c)

Recommendation framing: "The data supports a conditional rollout — the spend and retention signals are compelling. However, the experiment was powered for spend, not for credit risk. The 90-day observation window is insufficient toconcluden default behavior, which typically manifests at 9–18 months. I recommend: (1) extend the experiment cohort monitoring to 12 months for credit outcomes, (2) proceed with a limited geographic or segment rollout at 20% scale while awaiting full credit signal, (3) set a pre-agreed default rate circuit breaker that triggers a policy pause if exceeded."

Part (c) — Responding to the Stakeholder:

This is one of the most important statistical communication moments:

"Statistical insignificance does not mean absence of effect — it means we don't have enough evidence to rule out chance. Those are very different statements."

Type II error risk: The experiment may be underpowered for the default metric. If the true effect is +0.4 pp and the study had 60% power to detect it, a p=0.11 is entirely consistent with a real effect.

Asymmetric consequences: A false negative on default risk (missing a real increase) has very different business consequences than a false negative on spend lift. The cost of being wrong is not symmetric, which justifies a more conservative threshold for the guardrail metric.

Practical significance vs. statistical significance: A +0.4 pp default rate increase across 2 million CLIs is $X million in additional credit losses — calculate it and present it. The business should decide on real dollar impact, not a p-value threshold alone.

📊 Difficulty Level: Hard

⏱ Expected Interview Time: 17–20 minutes

✅ What a Strong Candidate Must Mention

Separate power calculations for each metric — a study powered for spend may be massively underpowered for a 0.4 pp default rate change

Asymmetry of Type I and Type II errors in risk decisions — the cost of missing a real default rate increase is not equal to the cost of a false positive on spend

Observation window mismatch: 90-day spend results are informative; 90-day credit results are not — a strong candidate explicitly flags this

Pre-registration of guardrail thresholds — define what "too much default rate increase" means before seeing the data, not after

The circuit breaker principle: scale up only with a pre-agreed trigger for pause or reversal

🔁 Smart Follow-Up Questions

"Your experiment shows sa ignificant spend lift in the treatment group. But you later discover that the control group had a higher proportion of cardmembers who called in requesting CLIs during the experiment period. How does this affect your conclusions?"

"How would you handle the fact that cardmembers in the treatment group might change their behavior simply because they received a proactive offer — a form of novelty or Hawthorne effect — rather than because of the higher limit itself?"

"If you had to design a multi-armed experiment testing three different CLI increase amounts (10%, 25%, 50%), how would your design and analysis change?"

Question 3: Feature Engineering for Financial Transaction Data

📌 Question Title

Engineering Behavioral Features from Raw Transaction Data for a Churn Prediction Model

💬 Detailed Business Scenario

AmEx wants to predict which Platinum cardmembers are likely to cancel their card within the next 6 months before their annual fee renewal date. You have access to 24 months of transaction-level data for 500,000 cardmembers: transaction date, merchant name, merchant category code (MCC), transaction amount, whether it was approved/declined, country, and channel (in-person, online, contactless).
(a) What features would you engineer from this raw transaction data — and what behavioral signals are most predictive of churn in a premium card context?
(b) Walk through the SQL logic to compute three of your most important features for a single cardmember's 90-day window.
(c) How do you handle the temporal nature of this data to prevent leakage — and what specific leakage risks exist in churn modeling on financial data?

📋 Structured Model Answer

Part (a) — Feature Engineering Strategy:

Features should capture behavioral momentum — it's not the absolute level that predicts churn, it's the behavior change that matters. Organize by signal type:

Spend Volume & Velocity Features:

Total spend in L30/L60/L90 days; trend (L30 vs. prior L30)

Month-over-month spend growth rate (rolling 3-month)

Spend per active day (normalized engagement)

Category Mix Features (most predictive for Platinum):

T&E spend share: % of spend in airline (MCC 3000–3299), hotel (MCC 7011), restaurant (MCC 5812) MCCs

T&E share trend: if a Platinum cardmember's T&E share drops from 60% to 20%, that's a strong churn signal

Competitor category proxy: high spend at merchants associated with competitor card benefits (certain airline lounges, hotel chains with co-brand partnerships)

Benefit Utilization Features:

Number of distinct benefit categories used in L90 days

Lounge visit frequency (via transaction records at lounge MCC)

Amex Offer redemption rate: offers clicked vs. offers redeemed

Days since last benefit redemption

Engagement Recency Features:

Recency of last transaction (RFM-style)

Days since last online/mobile channel transaction

Number of declined transactions in L30 days (financial stress signal)

Account Health Features:

Payment behavior: full pay vs. minimum pay trend

Credit utilization trajectory (for charge/lending card holders)

Number of customer service contacts in L90 days (high contact = dissatisfaction signal)

Part (b) — SQL Implementation (3 Key Features):

-- Feature 1: 90-day T&E spend share (critical Platinum churn signal)
WITH spend_base AS (
    SELECT
        cardmember_id,
        SUM(transaction_amount) AS total_spend_90d,
        SUM(CASE
            WHEN mcc BETWEEN 3000 AND 3299    -- Airlines
              OR mcc = 7011                   -- Hotels
              OR mcc = 5812                   -- Restaurants
            THEN transaction_amount ELSE 0
        END) AS te_spend_90d
    FROM transactions
    WHERE transaction_date >= CURRENT_DATE - INTERVAL '90 days'
      AND transaction_status = 'APPROVED'
    GROUP BY cardmember_id
)
SELECT
    cardmember_id,
    ROUND(te_spend_90d / NULLIF(total_spend_90d, 0), 4) AS te_spend_share_90d
FROM spend_base;

-- Feature 2: Month-over-month spend trend (momentum signal)
WITH monthly_spend AS (
    SELECT
        cardmember_id,
        DATE_TRUNC('month', transaction_date) AS spend_month,
        SUM(transaction_amount) AS monthly_spend
    FROM transactions
    WHERE transaction_date >= CURRENT_DATE - INTERVAL '60 days'
    GROUP BY cardmember_id, DATE_TRUNC('month', transaction_date)
),
lagged AS (
    SELECT
        cardmember_id,
        spend_month,
        monthly_spend,
        LAG(monthly_spend) OVER (
            PARTITION BY cardmember_id
            ORDER BY spend_month
        ) AS prior_month_spend
    FROM monthly_spend
)
SELECT
    cardmember_id,
    ROUND((monthly_spend - prior_month_spend)
          / NULLIF(prior_month_spend, 0), 4) AS mom_spend_growth
FROM lagged
WHERE spend_month = DATE_TRUNC('month', CURRENT_DATE);

-- Feature 3: Benefit utilization diversity (L90 distinct benefit categories)
SELECT
    cardmember_id,
    COUNT(DISTINCT benefit_category) AS benefit_categories_used_90d,
    MAX(benefit_redemption_date) AS last_benefit_date,
    DATEDIFF('day', MAX(benefit_redemption_date),
             CURRENT_DATE) AS days_since_last_benefit
FROM benefit_redemptions
WHERE benefit_redemption_date >= CURRENT_DATE - INTERVAL '90 days'
GROUP BY cardmember_id;

Part (c) — Temporal Leakage Prevention:

Definition of the prediction task first:

Observation point: T (the date at which we make the prediction)

Label window: T to T+180 days (did the member cancel?)

Feature window: T-90 days to T (all features computed from data before T only)

Specific leakage risks in churn modeling:

Leakage Type	Example	Prevention
Future data in features	Including spend from T+30 in a feature computed at T	Strict feature window cutoff at observation point
Label leakage	Including cancellation-related service calls as a feature (they occur simultaneously with the decision)	Exclude events that are consequences of the churn decision, not causes
Account status leakage	Flagging accounts already in cancellation process as training examples	Remove any account that had a cancellation inquiry before T from the training set
Survival bias	Only training on accounts still active at T	Ensure cohort includes accounts that churned before T to represent the full risk spectrum

# Correct temporal pipeline structure
def create_features(transactions_df, observation_date):
    """
    ALWAYS filter to data STRICTLY BEFORE observation_date.
    Never use data from [observation_date, observation_date + label_window].
    """
    feature_data = transactions_df[
        transactions_df['transaction_date'] < observation_date  # strict <, not <=
    ]
    return compute_features(feature_data)

📊 Difficulty Level: Medium–Hard

⏱ Expected Interview Time: 16–18 minutes

✅ What a Strong Candidate Must Mention

Trend features outperform level features for churn: it's the change in T&E share, not the absolute level, that signals intent to cancel

MCC-level granularity as a rich behavioral signal — not just spend amount, but where the spend goes

Benefit utilizationiss the most Platinum-specific signal — members who stop redeeming benefits have already psychologically churned before they call

Strict temporal discipline: observation point, feature window, and label window must be explicitly defined and enforced in code

The survival bias risk: churn models trained only on currently active members underrepresent the characteristics of early churners

🔁 Smart Follow-Up Questions

"You engineer 200 features from transaction data. Your XGBoost model uses 40 of them with high importance scores. How do you decide which features are safe to use in a production model from a fairness and regulatory standpoint?"

"A PM asks you to add a feature: 'Did the cardmember contact customer service in the last 30 days?' You know from analysis that this is one of the top 3 predictors. What are the risks of including it?"

"Your churn model performs well on AUC, but the calibration is poor — predicted churn probability of 0.3 corresponds to actual churn rate of 0.18. Why does this matter, and how do you fix it?"

Question 4: Fraud Detection & Anomaly Detection

📌 Question Title

Designing a Real-Time Transaction Fraud Scoring System

💬 Detailed Business Scenario

AmEx processes millions of transactions daily. The fraud team wants to upgrade their existing rule-based fraud detection system with an ML-based real-time scoring model that assigns a fraud probability to each transaction at the point of authorization — within 200 milliseconds.
The current rule-based system has a false positive rate of 3.2% (legitimate transactions declined), causing significant cardmember friction. Fraud accounts for approximately 0.08% of transactions by volume but a disproportionate share of dollar losses.
(a) How do you frame this as an ML problem — and what makes fraud detection fundamentally different from other classification problems?
(b) What features would you engineer for real-time scoring, and what are the constraints imposed by the 200ms latency requirement?
(c) The model reduces fraud losses by 18% but increases false positives by 0.4 percentage points. How do you frame this tradeoff for a business decision?

📋 Structured Model Answer

Part (a) — Problem Framing & Unique Challenges:

Target variable:

Binary: 1 = fraudulent transaction, 0 = legitimate

Ground truth lag problem: fraud labels are often confirmed days to weeks after the transaction (chargebacks, investigations) — the training label is not available in real-time, creating a delayed labeling problem

What makes fraud detection uniquely difficult:

Challenge	Description	Mitigation
Extreme class imbalance	0.08% fraud rate = 1 in 1,250 transactions	Precision-Recall AUC >> ROC-AUC as primary metric; cost-sensitive learning
Adversarial adaptation	Fraudsters observe declines and adapt behavior	Model must be retrained frequently; concept drift monitoring is critical
Feedback loop bias	Only reviewed/confirmed transactions generate labels — high-score transactions reviewed more, biasing training data	Explore random auditing of low-score transactions to avoid blind spots
Temporal non-stationarity	Fraud patterns change with seasons, economic conditions, new card-not-present channels	Rolling training windows; time-weighted samples
Latency constraint	200ms authorization window limits model complexity	Feature pre-computation; lightweight inference models

Part (b) — Feature Engineering Under Latency Constraints:

Architectural principle: Compute expensive features offline (updated hourly/daily); compute only lightweight real-time features at transaction time.

OFFLINE FEATURES (pre-computed, stored in feature store):
├── Cardmember spend velocity: avg_daily_spend_L30, std_daily_spend_L30
├── Merchant risk profile: merchant_fraud_rate_L90 (from historical data)
├── Geographic anchor: most_common_country_L60, home_zip_code
├── Card-not-present ratio: pct_cnp_transactions_L30
└── Account age, product type, credit limit utilization

REAL-TIME FEATURES (computed at authorization, <5ms each):
├── Transaction amount vs. cardmember L30 average:
│   deviation_score = (txn_amount - avg_amount_L30) / std_amount_L30
├── Time since last transaction (velocity check):
│   seconds_since_last_txn
├── Geographic velocity: distance from last transaction location / time elapsed
│   → "impossible travel" detection: 2 txns in different countries within 1 hour
├── MCC consistency: is this MCC in the cardmember's historical MCC distribution?
└── Device/channel fingerprint match

FEATURE ENGINEERING EXAMPLE — Geographic Velocity:
distance_km = haversine(last_txn_lat_lon, current_txn_lat_lon)
time_hours = (current_txn_timestamp - last_txn_timestamp).seconds / 3600
velocity_kmh = distance_km / max(time_hours, 0.001)
is_impossible_travel = (velocity_kmh > 900)  # above commercial flight speed

Model architecture for latency:

Gradient Boosted Trees (XGBoost/LightGBM): fast inference, handles tabular data well, ~2ms inference time

Two-stage scoring: lightweight model for all transactions (fast), heavier model triggered only for transactions above a risk threshold

Feature store: pre-computed cardmember profiles served via in-memory cache (Redis/DynamoDB) for sub-millisecond lookup

Part (c) — Framing the False Positive Tradeoff:

This is a cost-benefit analysis, not a model optimization problem:

COST OF FRAUD (status quo):
  - Dollar loss per fraudulent transaction × fraud volume
  - Operational cost of dispute resolution (~$40–60 per case)

COST OF FALSE POSITIVE (new model):
  - Estimated 0.4 pp increase in false positive rate
  - On 10M daily transactions: 40,000 additional legitimate declines/day
  - Cost per false positive: cardmember friction + potential churn
    → Research suggests 15–20% of cardmembers who experience
      an unexpected decline consider switching cards
  - At 0.5% churn conversion on 40,000 daily false positives:
    200 additional at-risk cardmembers per day × LTV of $2,000+
    = $400,000/day in LTV at risk

RECOMMENDATION FRAMEWORK:
Present as a Precision-Recall operating point decision:
  → At threshold T1: 18% fraud reduction, +0.4pp FPR
  → At threshold T2: 14% fraud reduction, +0.1pp FPR (more conservative)
  → Let the business choose the operating point based on their
    relative valuation of fraud losses vs. cardmember experience

The data scientist's job is to present the Pareto frontier, not make the business decision unilaterally.

📊 Difficulty Level: Hard

⏱ Expected Interview Time: 18–20 minutes

✅ What a Strong Candidate Must Mention

Precision-Recall AUC as the primary metric for fraud (not ROC-AUC) — in extreme imbalance, ROC-AUC is misleadingly optimistic

The delayed labeling problem: fraud ground truth isn't available at transaction time, creating a training data lag that must be explicitly managed

Feedback loop bias: models trained on reviewed transactions oversample high-risk patterns — random audits of low-score transactions are essential

Two-stage model architecture for latency management — a single heavy model can't serve 200ms at scale

The asymmetric cost framing: false positives harm known, identifiable, loyal cardmembers; false negatives harm AmEx financially — this asymmetry should drive threshold selection

🔁 Smart Follow-Up Questions

"Six months after deployment, your model's precision drops from 78% to 61% at the same threshold, but recall is unchanged. What are the most likely explanations,s and how do you investigate?"

"A fraudster realizes your model uses geographic velocity and starts making fraudulent transactions close to the cardmember's home location. How does your system detect and adapt to this behavioral adversarial attack?"

"How would you design the retraining pipeline for this model — what triggers retraining, what data goes in, and what governance exists before a new model version goes live?"

Question 5: Model Evaluation & Business Impact Measurement

📌 Question Title

Measuring the True Business Impact of a Deployed Churn Intervention Model

💬 Detailed Business Scenario

Six months ago, AmEx deployed a churn prediction model that flags Platinum cardmembers with predicted churn probability >35% and triggers an automated retention outreach (targeted offer, RM call, or fee waiver). The model has been running in production for 6 months.
Your manager asks you to produce a business impact report: did the model actually reduce churn, and what was the financial return on the intervention investment?
The challenge: there is no clean holdout group — the model was rolled out to all eligible cardmembers at launch because leadership wanted immediate impact.
(a) Without a clean control group, how do you measure the causal impact of the intervention?
(b) You find that the model-triggered retention group has a 12% lower churn rate than similar non-triggered cardmembers. A colleague says: "That proves the model works." What's wrong with that claim?
(c) Design a rigorous going-forward measurement framework — both for model performance and business ROI — that you would present to the VP of Card Analytics.

📋 Structured Model Answer

Part (a) — Causal Impact Without a Clean Control Group:

This is a causal inference problem, not a predictive modeling problem. Several approaches:

Option 1: Propensity Score Matching (PSM)

Approach:
1. Define treatment: cardmembers who received retention outreach (model score > 35%)
2. For each treated member, find a statistically similar untreated member
   (members who scored just below the threshold, or high-risk members
   the model missed due to data gaps)
3. Match on confounders: tenure, spend level, benefit utilization,
   prior churn risk indicators
4. Compare churn rates between matched treated and control groups

Limitation: Only controls for OBSERVED confounders — unobserved
differences between treated and untreated groups may still bias results

Option 2: Regression Discontinuity Design (RDD)

Exploit the 35% score threshold as a natural experiment:
- Cardmembers just above 35% (received intervention) vs.
  cardmembers just below 35% (did not receive intervention)
- Near the threshold, treatment assignment is "as good as random"
- Compare churn rates in a narrow bandwidth around the 35% cutoff

This is the most credible causal estimate available without
a randomized experiment, assuming:
  (a) The 35% threshold was applied consistently
  (b) Cardmembers couldn't manipulate their own score

Option 3: Difference-in-Differences (DiD)

- Compare churn trends BEFORE and AFTER model deployment
- For both high-risk (treated) and lower-risk (never treated) segments
- DiD estimate: (post-pre change in treated) - (post-pre change in control)

Assumption: parallel trends — both groups would have trended
similarly in the absence of the intervention

Part (b) — The Selection Bias Problem:

The colleague's claim is a classic selection bias / confounding error:

The model selectively targeted high-risk cardmembers for intervention. The comparison group (similar non-triggered members) is not similar — they were deemed lower-risk by the model.

The 12% lower churn rate in the treated group could reflect:
1. The interventiois n genuinely working ✓
1. The model is wrong about some "high risk" members who weren't actually at risk (false positives who wouldn't have churned anyway — regression to the mean)
1. High-risk members who received outreach are more engaged to begin with (selection into the high-score group correlates with engagement)

The correct framing: The 12% difference is a correlation, not a causal estimate. Without knowing the counterfactual (what those cardmembers would have done without the intervention), we cannot attribute the difference to the model.

Part (c) — Going-Forward Measurement Framework:

Model Performance Metrics (ongoing):

DISCRIMINATION METRICS (monthly):
├── AUC-ROC and Precision-Recall AUC on rolling holdout cohort
├── KS Statistic: separation of churners vs. non-churners at 6-month mark
└── Lift curve: concentration of actual churners in top score deciles

CALIBRATION METRICS:
├── Score vs. actual churn rate by score band (reliability diagram)
└── Expected Calibration Error (ECE)

STABILITY METRICS:
├── PSI (Population Stability Index) — monthly score distribution
│   PSI < 0.10: stable; 0.10–0.25: monitor; >0.25: investigate/retrain
└── Characteristic Stability Index (CSI) for key features

Business ROI Framework:

STEP 1: Establish causal estimate via RDD or PSM (as above)
        → Estimated incremental retention rate: X%

STEP 2: Convert to cardmembers retained
        Triggered population × causal retention estimate
        = Incrementally retained cardmembers (N)

STEP 3: Value per retained cardmember
        Average LTV of Platinum member × probability of multi-year retention
        (use survival curve, not point estimate)

STEP 4: Intervention cost
        Cost per outreach × volume + fee waivers granted × value

STEP 5: Net ROI
        ROI = (N × LTV_per_member − Total_intervention_cost)
               / Total_intervention_cost

STEP 6: Confidence interval on ROI
        Propagate uncertainty from causal estimate through to ROI
        → Present range: "We estimate ROI between X% and Y%
          with 80% confidence"

VP Presentation structure:

What the model does and who it targets

Our best causal estimate of impact (with honest caveats on methodology)

Financial ROI range with confidence bounds

What we're doing to get a cleaner estimate going forward (prospective holdout)

Model health dashboard — stability, calibration, and performance trends

📊 Difficulty Level: Hard

⏱ Expected Interview Time: 17–20 minutes

✅ What a Strong Candidate Must Mention

The fundamental problem of causal inference in production ML: correlation between model triggers and outcomes is not causal evidence of impact

Regression Discontinuity Design is the most credible post-hoc causal estimate when threshold-based assignment was used — this is a graduate-level insight that separates strong candidates

Selection bias/regression to the mean: model false positives who receive treatment inflate the "success rate" by including people who wouldn't have churned anyway

PSI as the primary model health metric in production — knowing the threshold values (0.10, 0.25) demonstrates practical deployment experience

Uncertainty quantification in ROI: presenting a range with confidence bounds is more credible and more honest than a point estimate

🔁 Smart Follow-Up Questions

"The VP asks why you can't just compare churn rates before and after deployment as proof that the model works. Walk her through exactly why that's insufficient."

"You implement a prospective holdout going forward — 10% of high-risk cardmembers receive no intervention. Two months in, a stakeholder says it's unethical to withhold retention offers from at-risk members. How do you respond?"

"Your model's AUC is stable at 0.76, but you notice that the top score decile's actual churn rate has dropped from 28% to 19% over 6 months. Is this good news or bad news for the model — and what does it tell you?"

Question 6: Customer Lifetime Value Modeling & Segmentation

📌 Question Title

Building a Forward-Looking CLV Model to Drive Acquisition Spend Allocation

💬 Detailed Business Scenario

AmEx's marketing team currently allocates acquisition budget based on short-term proxies — expected first-year spend and estimated annual fee revenue. The Chief Marketing Officer believes this is leaving money on the table: high-CLV cardmembers are being acquired at the same cost as low-CLV ones, and some high-cost acquisition channels are actually delivering the most valuable long-term customers.
You're asked to build a forward-looking Customer Lifetime Value model for newly acquired Platinum cardmembers and use it to re-optimize acquisition channel spend allocation.
(a) How do you define and structure a CLV model for a premium credit card product — what are its components, and what makes this harder than a typical subscription CLV model?
(b) Walk through the statistical modeling choices for each component — spend prediction, retention/survival modeling, and margin estimation.
(c) Your CLV model shows that Channel A (digital influencer partnerships) delivers 2.3× higher 5-year CLV than Channel B (direct mail), despite Channel A costing 40% more per acquisition. How do you turn this into a budget reallocation recommendation?

📋 Structured Model Answer

Part (a) — CLV Model Structure:

Core CLV formula for a credit card context:

CLV = Σ [t=1 to T] ( M_t × r_t / (1 + d)^t ) − CAC

Where:
  M_t  = Net margin in period t (revenue − credit losses − servicing cost − rewards cost)
  r_t  = Probability of still being an active cardmember at time t (survival probability)
  d    = Discount rate (cost of capital, typically 8–12% for AmEx)
  CAC  = Customer Acquisition Cost (channel-specific)
  T    = Time horizon (typically 5 years; beyond that, uncertainty dominates)

Why credit card CLV is harder than subscription CLV:

Dimension	Subscription (Netflix)	Credit Card (AmEx Platinum)
Revenue per period	Fixed ($X/month)	Variable (spend-driven + fee)
Churn definition	Binary, clean	Gradual disengagement before formal cancel
Margin	Relatively stable	Volatile (credit losses are stochastic)
Cross-sell	Limited	Rich (lending, business cards, travel)
Regulatory constraints	Minimal	Fair lending, data usage restrictions

Additional complexity: multi-product relationships — a Platinum holder who also has a Business Gold and a personal loan is worth far more than a single-product model suggests; cross-product CLV requires a portfolio-level view.

Part (b) — Component Modeling:

Component 1: Spend Prediction (M_t revenue portion)

# Approach: Two-stage model
# Stage 1: Will the cardmember be active in period t? (survival model)
# Stage 2: Conditional on being active, how much will they spend?

# For spend prediction conditional on activity:
# - Gamma-Gamma model (conjugate prior for spend variability)
# - Or: XGBoost regression on lagged spend features + macro variables
# Key insight: use QUANTILE regression to model spend distribution,
# not just the mean — high-spend cardmembers have fat-tailed distributions

from lifetimes import GammaGammaFitter
ggf = GammaGammaFitter(penalizer_coef=0.01)
ggf.fit(cardmember_df['frequency'], cardmember_df['avg_transaction_value'])
predicted_clv = ggf.customer_lifetime_value(
    bgf,                    # BG/NBD model for transaction frequency
    cardmember_df['frequency'],
    cardmember_df['recency'],
    cardmember_df['T'],     # age of cardmember relationship
    time=12,                # months
    discount_rate=0.01      # monthly
)

Component 2: Survival / Retention Modeling

# Approach: Parametric survival model (preferred over KM for prediction)
# Cox Proportional Hazards: semi-parametric, interpretable
# Weibull AFT: fully parametric, better for extrapolation beyond observed horizon

from lifelines import WeibullAFTFitter
aft = WeibullAFTFitter()
aft.fit(df, duration_col='tenure_months', event_col='churned',
        formula='spend_L90 + te_share + benefit_utilization + fico_band')

# Survival probability at month t for a specific cardmember:
survival_prob = aft.predict_survival_function(new_member_features)

Component 3: Margin Estimation

Net margin per period =
  Interchange revenue (spend × net interchange rate)
+ Annual fee (recognized ratably)
+ Net interest income (revolve balance × net interest margin)
− Rewards cost (spend × rewards rate × redemption probability)
− Provision for credit losses (balance × predicted loss rate)
− Servicing cost (fixed per account)

Key challenge: credit loss component is correlated with economic cycle —
build scenario-adjusted margin: base / adverse / severely adverse

Part (c) — Budget Reallocation Recommendation:

STEP 1: Compute CLV-to-CAC ratio by channel
  Channel A: CLV = $4,600 | CAC = $1,400 | CLV/CAC = 3.3×
  Channel B: CLV = $2,000 | CAC = $1,000 | CLV/CAC = 2.0×

STEP 2: Compute marginal return on incremental spend
  Key question: Does Channel A's CLV/CAC ratio hold at HIGHER volume?
  → Diminishing returns: digital influencer audiences are finite;
    scaling spend 2× on a channel rarely produces 2× volume
    at the same quality
  → Model: fit a spend-volume-quality curve for each channel

STEP 3: Optimal allocation (constrained optimization)
  Maximize: Σ_channels (Volume_c × CLV_c − Spend_c × CAC_c)
  Subject to: Σ_channels Spend_c ≤ Total_budget
              Volume_c ≤ Channel_capacity_c
  → Solve with linear programming or gradient-based optimization

STEP 4: Communicate with confidence intervals
  CLV estimates carry uncertainty — present the reallocation
  recommendation with: expected case, conservative case, and
  the breakeven CLV assumption at which the reallocation still wins

📊 Difficulty Level: Hard

⏱ Expected Interview Time: 18–20 minutes

✅ What a Strong Candidate Must Mention

Gamma-Gamma / BG-NBD models as the industry-standard probabilistic CLV framework for non-contractual settings — awareness of lifetimes library is a plus

Discounting future cash flows: CLV without a discount rate is economically meaningless for a multi-year horizon

Diminishing returns on channel scaling: CLV/CAC ratios are not constant with spend volume — a candidate who ignores this is proposing a naive reallocation

Cross-product CLV: single-product CLV systematically undervalues cardmembers who have or will have multiple AmEx products

Confidence intervals on CLV: point estimates of CLV are less useful than distributions — especially for budget decisions with significant dollar impact

🔁 Smart Follow-Up Questions

"Your CLV model is 5 years forward-looking, but your acquisition campaign results are evaluated quarterly. How do you bridge that organizational tension?"

"Channel A delivers high-CLV customers, but you discover they also have a 40% higher fraud rate in the first 90 days. How does that change your recommendation?"

"A PM argues that your CLV model can't account for macroeconomic uncertainty, so it shouldn't drive budget decisions. How do you respond — and how do you incorporate macro uncertainty into the model?"

Question 7: SQL & Data Manipulation at Scale

📌 Question Title

Diagnosing a Sudden Drop in Transaction Approval Rates Using SQL

💬 Detailed Business Scenario

On a Monday morning, the operations team flags that the overall transaction approval rate dropped from 94.2% to 89.7% over the weekend — a 4.5 percentage point decline affecting millions of transactions. Leadership wants a root cause analysis within 2 hours.
You have access to a transactions table with the following schema:
transactions (
  transaction_id      VARCHAR,
  cardmember_id       VARCHAR,
  transaction_date    TIMESTAMP,
  transaction_amount  DECIMAL(12,2),
  merchant_id         VARCHAR,
  merchant_category   VARCHAR,   -- MCC description
  country_code        CHAR(2),
  channel             VARCHAR,   -- 'in_person', 'online', 'contactless'
  card_product        VARCHAR,   -- 'Platinum', 'Gold', 'Green', 'Blue'
  decline_reason      VARCHAR,   -- NULL if approved
  is_approved         BOOLEAN
)
(a) Write the SQL queries you'd use to systematically isolate where the approval rate drop is concentrated — by dimension.
(b) Your analysis reveals the drop is almost entirely concentrated in online transactions from cardmembers in 3 countries, on one card product. Write a query to quantify the exact impact and identify the top 10 merchants by declined transaction volume in this segment.
(c) You find the decline reason is overwhelmingly "velocity_check_triggered." What does this tell you, and how do you communicate the finding and its business impact to a non-technical stakeholder?

📋 Structured Model Answer

Part (a) — Systematic Dimensional Decomposition:

-- ============================================================
-- QUERY 1: Approval rate by date and hour (isolate timing)
-- ============================================================
SELECT
    DATE(transaction_date)                              AS txn_date,
    EXTRACT(HOUR FROM transaction_date)                 AS txn_hour,
    COUNT(*)                                            AS total_transactions,
    SUM(CASE WHEN is_approved THEN 1 ELSE 0 END)        AS approved,
    ROUND(
        AVG(CASE WHEN is_approved THEN 1.0 ELSE 0.0 END) * 100, 2
    )                                                   AS approval_rate_pct
FROM transactions
WHERE transaction_date >= CURRENT_DATE - INTERVAL '7 days'
GROUP BY DATE(transaction_date), EXTRACT(HOUR FROM transaction_date)
ORDER BY txn_date, txn_hour;

-- ============================================================
-- QUERY 2: Approval rate by channel (identify channel concentration)
-- ============================================================
SELECT
    channel,
    COUNT(*)                                            AS total_txns,
    SUM(CASE WHEN is_approved THEN 1 ELSE 0 END)        AS approved_txns,
    ROUND(AVG(CASE WHEN is_approved
              THEN 1.0 ELSE 0.0 END) * 100, 2)          AS approval_rate_pct,
    LAG(ROUND(AVG(CASE WHEN is_approved
              THEN 1.0 ELSE 0.0 END) * 100, 2))
        OVER (PARTITION BY channel
              ORDER BY DATE(transaction_date))          AS prior_period_rate
FROM transactions
WHERE transaction_date >= CURRENT_DATE - INTERVAL '3 days'
GROUP BY channel, DATE(transaction_date)
ORDER BY channel, DATE(transaction_date);

-- ============================================================
-- QUERY 3: Approval rate by card product × country × channel
-- (multi-dimensional breakdown to isolate intersection)
-- ============================================================
SELECT
    card_product,
    country_code,
    channel,
    COUNT(*)                                            AS total_txns,
    SUM(CASE WHEN NOT is_approved THEN 1 ELSE 0 END)    AS declined_txns,
    ROUND(AVG(CASE WHEN is_approved
              THEN 1.0 ELSE 0.0 END) * 100, 2)          AS approval_rate_pct
FROM transactions
WHERE transaction_date >= CURRENT_DATE - INTERVAL '2 days'
GROUP BY card_product, country_code, channel
HAVING COUNT(*) > 100   -- filter noise from low-volume segments
ORDER BY declined_txns DESC
LIMIT 30;

-- ============================================================
-- QUERY 4: Decline reason distribution (identify root cause)
-- ============================================================
SELECT
    decline_reason,
    COUNT(*)                                            AS decline_count,
    ROUND(COUNT(*) * 100.0 /
          SUM(COUNT(*)) OVER (), 2)                     AS pct_of_all_declines
FROM transactions
WHERE is_approved = FALSE
  AND transaction_date >= CURRENT_DATE - INTERVAL '2 days'
GROUP BY decline_reason
ORDER BY decline_count DESC;

Part (b) — Quantifying the Concentrated Impact:

-- ============================================================
-- QUERY 5: Exact impact quantification for affected segment
-- ============================================================
WITH affected_segment AS (
    SELECT *
    FROM transactions
    WHERE channel = 'online'
      AND country_code IN ('GB', 'AU', 'CA')  -- hypothetical affected countries
      AND card_product = 'Platinum'
      AND transaction_date >= CURRENT_DATE - INTERVAL '2 days'
),
baseline AS (
    SELECT
        AVG(CASE WHEN is_approved THEN 1.0 ELSE 0.0 END) AS baseline_approval_rate
    FROM transactions
    WHERE channel = 'online'
      AND country_code IN ('GB', 'AU', 'CA')
      AND card_product = 'Platinum'
      AND transaction_date BETWEEN CURRENT_DATE - INTERVAL '9 days'
                               AND CURRENT_DATE - INTERVAL '3 days'
)
SELECT
    COUNT(*)                                            AS total_txns_affected,
    SUM(CASE WHEN NOT is_approved THEN 1 ELSE 0 END)    AS total_declines,
    ROUND(AVG(CASE WHEN is_approved
              THEN 1.0 ELSE 0.0 END) * 100, 2)          AS current_approval_rate,
    ROUND((SELECT baseline_approval_rate FROM baseline) * 100, 2)
                                                        AS baseline_approval_rate,
    SUM(CASE WHEN NOT is_approved
             THEN transaction_amount ELSE 0 END)        AS declined_transaction_value
FROM affected_segment;

-- ============================================================
-- QUERY 6: Top 10 merchants by declined volume in affected segment
-- ============================================================
SELECT
    t.merchant_id,
    m.merchant_name,          -- assuming merchant lookup table
    t.merchant_category,
    COUNT(*)                                            AS total_declines,
    SUM(t.transaction_amount)                           AS total_declined_value,
    ROUND(AVG(CASE WHEN t.is_approved
              THEN 1.0 ELSE 0.0 END) * 100, 2)          AS merchant_approval_rate
FROM transactions t
LEFT JOIN merchants m USING (merchant_id)
WHERE t.channel = 'online'
  AND t.country_code IN ('GB', 'AU', 'CA')
  AND t.card_product = 'Platinum'
  AND t.is_approved = FALSE
  AND t.transaction_date >= CURRENT_DATE - INTERVAL '2 days'
GROUP BY t.merchant_id, m.merchant_name, t.merchant_category
ORDER BY total_declined_value DESC
LIMIT 10;

Part (c) — Root Cause Interpretation & Stakeholder Communication:

Technical interpretation of "velocity_check_triggered":
A velocity check fires when a cardmember exceeds a pre-defined transaction frequency or volume threshold within a short time window — designed to catch fraud patterns (rapid sequential transactions). A sudden spike in velocity check declines suggests one of three things:

A velocity rule threshold was inadvertently tightened (config change over the weekend)

A legitimate surge in transaction volume hit the threshold — e.g., a promotional event, Black Friday-style campaign, or a large merchant processing batch transactions

A coordinated fraud attack on those 3 countries triggered mass velocity flags (less likely if affecting legitimate merchants)

Non-technical stakeholder communication:

"Over the weekend, our system began automatically declining a higher-than-normal number of transactions from Platinum cardmembers shopping online in the UK, Australia, and Canada. The root cause is a security rule that limits how many transactions can happen in a short time window — it's designed to catch fraud. Something caused that rule to trigger much more frequently than normal, resulting in legitimate purchases being blocked.
The impact: approximately $X million in transactions were declined that would normally have been approved. We're working with the risk engineering team to identify whether a rule threshold was changed or whether there was a transaction pattern that triggered it. We expect a resolution within [X hours] and can provide an update at [time]."

📊 Difficulty Level: Medium–Hard

⏱ Expected Interview Time: 16–18 minutes

✅ What a Strong Candidate Must Mention

Dimensional decomposition discipline: start broad (time, channel, product), then narrow to intersections — don't jump to hypotheses before the data directs you

LAG() and window functions for period-over-period comparison within a single query — avoids multiple self-joins

HAVING clause for noise filtering: low-volume segments produce volatile approval rates that mislead root cause analysis

Declined transaction value, not just count: a 4.5pp decline rate drop matters very differently if it's $50 average transactions vs. $5,000 average transactions

Translate "velocity_check_triggered" into business language without losing accuracy — the ability to explain technical root causes to operations or product leadership is a core DS skill at AmEx

🔁 Smart Follow-Up Questions

"The transactions table has 2 billion rows and your query is running for 45 minutes. What are three concrete things you'd do to optimize it for a 2-hour RCA deadline?"

"After fixing the velocity check issue, leadership asks you to build a monitoring dashboard that catches approval rate drops like this within 15 minutes of onset. What does that system look like?"

"You discover that one of the top 10 affected merchants is a large airline. How does that change the business urgency of your communication — and who else do you loop in immediately?"

Question 8: Behavioral Segmentation & Unsupervised Learning

📌 Question Title

Segmenting the Commercial Card Portfolio for Targeted Product Strategy

💬 Detailed Business Scenario

AmEx's Commercial Cards division serves companies ranging from sole proprietors to large enterprises. The product team currently uses a simple revenue-based segmentation (Small: <$10M revenue, Mid: $10M–$100M, Large: >$100M) to design product offerings and set service levels. The Head of Commercial Products believes this segmentation misses meaningful behavioral differences within each tier — two companies with identical revenue may have completely different card usage patterns, risk profiles, and product needs.
You're asked to build a behavioral segmentation of the commercial card portfolio using transaction and account-level data to identify natural customer clusters that should inform product design and RM coverage strategy.
(a) What features would you use for behavioral segmentation — and how do you prepare them for clustering algorithms?
(b) Walk through your end-to-end clustering approach — algorithm selection, determining optimal number of clusters, and validation.
(c) Your clustering produces 6 segments. How do you make these segments actionable for the product team and RMs — and how do you guard against the common failure mode where data science segments are never adopted?

📋 Structured Model Answer

Part (a) — Feature Selection & Preparation:

Feature categories for commercial card behavioral segmentation:

SPEND BEHAVIOR:
├── Total monthly billed business (L6M average)
├── Spend volatility (std dev / mean — coefficient of variation)
├── MCC concentration: Herfindahl index of spend across MCC categories
│   HHI = Σ(share_i²) → high HHI = concentrated spend; low = diversified
├── T&E share (% spend in travel/entertainment MCCs)
├── Supplier payment share (% spend in B2B/vendor MCCs)
└── International spend share (% non-domestic transactions)

ACCOUNT BEHAVIOR:
├── Cards-in-force to company size ratio (card density)
├── Average transaction size (proxy for purchasing level/type)
├── Payment behavior: days-to-pay average, early vs. on-time vs. late
├── Credit line utilization trajectory (L6M trend)
└── Number of expense categories used (breadth of use)

ENGAGEMENT:
├── Benefit redemption rate (Amex Offers, travel credits)
├── Digital platform adoption (online account management usage)
└── Customer service contact frequency

Feature preparation for clustering:

import pandas as pd
from sklearn.preprocessing import RobustScaler
from sklearn.decomposition import PCA

# Step 1: Handle skewness — financial features are often right-skewed
# Log-transform spend features (add 1 to handle zeros)
spend_cols = ['monthly_spend', 'avg_txn_size', 'supplier_spend']
df[spend_cols] = np.log1p(df[spend_cols])

# Step 2: Scale — clustering is distance-based, scale is critical
# Use RobustScaler (resistant to outliers, common in financial data)
scaler = RobustScaler()
df_scaled = scaler.fit_transform(df[feature_cols])

# Step 3: Dimensionality reduction (optional but recommended for >20 features)
# PCA retaining 85% of variance
pca = PCA(n_components=0.85, random_state=42)
df_pca = pca.fit_transform(df_scaled)

# IMPORTANT: Do NOT run PCA blindly — inspect which original features
# drive each principal component for interpretability
loadings = pd.DataFrame(
    pca.components_.T,
    columns=[f'PC{i+1}' for i in range(pca.n_components_)],
    index=feature_cols
)

Part (b) — Clustering Approach:

# ============================================================
# ALGORITHM SELECTION
# ============================================================
# K-Means: fast, scalable, assumes spherical clusters — good baseline
# DBSCAN: identifies outliers/anomalies — useful for flagging unusual accounts
# Gaussian Mixture Models: soft assignments, handles elliptical clusters
# Hierarchical: interpretable dendrogram — useful for presenting to business

# For this use case: K-Means + validation, with GMM as robustness check

from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score, davies_bouldin_score

# ============================================================
# DETERMINING OPTIMAL K
# ============================================================
inertia = []
silhouette_scores = []
k_range = range(2, 12)

for k in k_range:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    labels = kmeans.fit_predict(df_pca)
    inertia.append(kmeans.inertia_)
    silhouette_scores.append(silhouette_score(df_pca, labels, sample_size=10000))

# Elbow method: look for inflection in inertia curve
# Silhouette score: higher is better (range -1 to 1)
# Davies-Bouldin index: lower is better
# Business constraint: 5-8 segments is typically the maximum
# an RM team can operationalize meaningfully

# ============================================================
# VALIDATION BEYOND METRICS
# ============================================================
# Statistical: silhouette score, Davies-Bouldin, Calinski-Harabasz
# Stability: run clustering on 80% samples 20 times — do segments persist?
# Business: are segments meaningfully different on KEY business outcomes?

# Cross-validation of business relevance:
for cluster in range(k_optimal):
    mask = labels == cluster
    print(f"Cluster {cluster}: "
          f"Default Rate={df[mask]['default_rate'].mean():.3f}, "
          f"Avg CLV={df[mask]['clv_estimate'].mean():.0f}, "
          f"Retention={df[mask]['12m_retention'].mean():.3f}")

Part (c) — Making Segments Actionable:

The most common failure mode: data science produces beautiful clusters that live in a slide deck and are never used. Prevention requires co-designing adoption from the start:

Step 1: Give segments business names, not numbers

Cluster	Data Profile	Business Name	RM Strategy
1	High spend, high T&E, international	"Global Travelers"	Platinum Business upgrade, travel benefits
2	High spend, concentrated in supplier MCCs	"Supply Chain Managers"	Vendor Pay, Working Capital
3	Low spend, high card density, diversified	"Office Administrators"	Employee card optimization, expense tool
4	Volatile spend, late payment trend	"Cash Flow Stressed"	Working Capital, proactive risk outreach
5	Low engagement, single MCC	"Reluctant Adopters"	Re-engagement, benefits education
6	High spend growth, young account	"High-Growth Prospects"	Rapid deepening, referral program

Step 2: Build a real-time segment assignment tool

# Deploy the scaler + PCA + KMeans pipeline as a scoring API
# New accounts get assigned to a segment within 60 days of first transactions
# Segment assignments update monthly as behavior evolves

def assign_segment(account_features: dict) -> int:
    features_scaled = scaler.transform([list(account_features.values())])
    features_pca = pca.transform(features_scaled)
    segment = kmeans.predict(features_pca)[0]
    return segment

Step 3: Embed segments into RM workflows

Segment label visible in RM dashboard alongside account profile

Segment-specific playbook: "If this account is a Supply Chain Manager, lead with Vendor Pay in first QBR"

Track segment migration: accounts moving from "Reluctant Adopter" to "Office Administrator" is a win signal

📊 Difficulty Level: Hard

⏱ Expected Interview Time: 17–19 minutes

✅ What a Strong Candidate Must Mention

HHI (Herfindahl Index) as a spend concentration measure — a sophisticated, domain-relevant feature choice that signals financial analytics experience

RobustScaler over StandardScaler for financial data — financial features have significant outliers that StandardScaler handles poorly

Cluster stability testing across bootstrap samples — silhouette score alone is insufficient; unstable clusters don't generalize to new data

The naming and operationalization step: the biggest failure mode for segmentation projects is beautiful math that produces no business change

Segment migration tracking: static segmentation is less valuable than understanding how accounts move between segments over time

🔁 Smart Follow-Up Questions

"Your K-Means model assigns every account to exactly one segment. But an account that is 'Global Traveler' 6 months a year and 'Supply Chain Manager' the other 6 months gets mislabeled. How would you address this limitation?"

"An RM complains that their highest-revenue client was assigned to the 'Reluctant Adopter' segment, which she thinks is insulting and wrong. How do you handle that conversation — and is it possible she's right?"

"One year after deployment, how do you evaluate whether the segmentation actually improved business outcomes — what's your measurement framework?"

Question 9: Causal Inference & Uplift Modeling

📌 Question Title

Building an Uplift Model to Optimize Targeted Retention Offers

💬 Detailed Business Scenario

AmEx's retention team sends targeted fee waiver offers to at-risk Platinum cardmembers. The current approach uses a propensity-to-retain model (probability of staying given an offer) to select who receives outreach. The VP of Retention suspects this is suboptimal: the model targets members who would have stayed anyway — members with a high probability of retaining regardless of whether they receive an offer.
She asks you to build an uplift model (also called an incremental response model) that identifies cardmembers for whom the offer has the maximum causal impact on retention — not just the highest predicted retention rate.
(a) Explain the conceptual difference between a propensity model and an uplift model — and why targeting high-propensity members with retention offers is wasteful.
(b) Walk through the technical approach to building an uplift model using historical A/B test data (where 50% of at-risk members were randomly assigned to receive the offer).
(c) Your uplift model identifies 4 behavioral segments (the "four quadrants of uplift"). How do you allocate retention spend across these quadrants?

📋 Structured Model Answer

Part (a) — Propensity vs. Uplift: The Conceptual Gap:

PROPENSITY MODEL asks:
  "Who is most likely to stay?"
  → Targets members with P(retain | offer) = 0.90
  → Problem: P(retain | no offer) for this group is also 0.88
  → Incremental effect of offer = 0.90 - 0.88 = 0.02 (waste of offer cost)

UPLIFT MODEL asks:
  "For whom does the offer make the MOST DIFFERENCE?"
  → Targets members with:
    P(retain | offer) = 0.65 AND P(retain | no offer) = 0.40
  → Incremental effect = 0.65 - 0.40 = 0.25 (high impact)

THE FOUR QUADRANTS:
┌─────────────────────┬──────────────────────────────────┐
│                     │  High P(retain | no offer)       │
│                     ├──────────────┬───────────────────┤
│                     │    HIGH      │      LOW          │
├────────────┬────────┼──────────────┼───────────────────┤
│ P(retain   │  HIGH  │ "Sure Things"│  "Persuadables"   │
│  | offer)  │        │ (don't waste │  ← TARGET THESE   │
│            │        │  offer here) │                   │
│            ├────────┼──────────────┼───────────────────┤
│            │  LOW   │ "Lost Causes"│  "Do Not Disturb" │
│            │        │ (offer won't │  (offer backfires)│
│            │        │  help)       │                   │
└────────────┴────────┴──────────────┴───────────────────┘

The critical insight: "Do Not Disturb" members may be MORE LIKELY
to cancel after receiving a retention offer (it signals desperation,
reminds them to think about the fee, or they resent the intrusion)

Part (b) — Technical Uplift Modeling Approach:

# Approach 1: Two-Model (T-Learner) — simplest, good baseline
# Train separate models on treatment and control groups,
# then subtract predicted probabilities

from sklearn.ensemble import GradientBoostingClassifier

# Split historical A/B data
treated = df[df['received_offer'] == 1]
control = df[df['received_offer'] == 0]

# Model 1: P(retain | offer, X) — trained on treatment group
model_treat = GradientBoostingClassifier(n_estimators=200)
model_treat.fit(treated[features], treated['retained'])

# Model 2: P(retain | no offer, X) — trained on control group
model_control = GradientBoostingClassifier(n_estimators=200)
model_control.fit(control[features], control['retained'])

# Uplift score = individual treatment effect estimate
df['uplift_score'] = (
    model_treat.predict_proba(df[features])[:, 1] -
    model_control.predict_proba(df[features])[:, 1]
)

# Approach 2: X-Learner — better for imbalanced treatment/control
# Approach 3: Causal Forest (GRF) — gold standard, handles heterogeneity
# from econml import CausalForestDML
# cf = CausalForestDML(model_y=GBM(), model_t=Logistic())
# cf.fit(Y=retained, T=received_offer, X=features)
# uplift = cf.effect(X_new)

# ============================================================
# EVALUATION: Uplift models can't be evaluated like standard classifiers
# No "true" individual treatment effect is observable (fundamental
# problem of causal inference — you can't observe both counterfactuals)

# Use QINI CURVE and AUUC (Area Under Uplift Curve) instead:
def compute_qini_curve(df, uplift_col, treatment_col, outcome_col):
    df_sorted = df.sort_values(uplift_col, ascending=False)
    n = len(df_sorted)
    n_treat = df_sorted[treatment_col].sum()
    n_control = n - n_treat

    qini_values = []
    for k in range(1, n + 1):
        top_k = df_sorted.iloc[:k]
        treated_outcomes = top_k[top_k[treatment_col]==1][outcome_col].sum()
        control_outcomes = top_k[top_k[treatment_col]==0][outcome_col].sum()
        n_treat_k = top_k[treatment_col].sum()
        # Qini: incremental gains normalized by treatment proportion
        qini = treated_outcomes - control_outcomes * (n_treat_k / max(n - n_treat_k, 1))
        qini_values.append(qini)
    return qini_values

Part (c) — Allocating Spend Across the Four Quadrants:

Quadrant	Uplift Score	Recommended Action	Reasoning
Persuadables	High positive	Maximum spend; priority targeting	Every dollar of offer spend generates incremental retention
Sure Things	Near zero	No offer; save the budget	Will retain regardless; offer cost is pure waste
Lost Causes	Near zero or negative	No offer; consider product fix	Offer won't move the needle; underlying dissatisfaction is structural
Do Not Disturb	Negative	Explicitly exclude from outreach	Offer accelerates cancellation — a harmful intervention

Budget optimization:

For each at-risk member i:
  Expected ROI of offer = uplift_score_i × LTV_i − offer_cost_i

Allocate offers in descending order of Expected ROI
until budget is exhausted

This is superior to:
  (a) Targeting everyone: wastes budget on Sure Things
  (b) Targeting by propensity: misses Persuadables, wastes on Sure Things
  (c) Targeting by CLV alone: high-CLV Sure Things absorb budget

📊 Difficulty Level: Hard

⏱ Expected Interview Time: 18–20 minutes

✅ What a Strong Candidate Must Mention

The fundamental problem of causal inference: you can never observe both treatment and control outcomes for the same individual — uplift is always estimated, never directly observed

The "Do Not Disturb" quadrant: offer-averse members are a real phenomenon in financial services — retention offers can remind members to cancel. This insight is what separates uplift experts from beginners.

Qini curve / AUUC as the correct evaluation metric — standard AUC is meaningless for uplift models because there is no observable "true uplift" label

Causal Forest (GRF) as the current gold standard for heterogeneous treatment effect estimation — awareness of econml or grf packages signals genuine expertise

Budget allocation as an optimization problem — ranking by uplift alone ignores offer cost and member LTV; the correct objective is expected ROI per dollar spent

🔁 Smart Follow-Up Questions

"Your uplift model was trained on A/B test data where 50% of members were randomly treated. In production, only 20% will receive offers. Does this affect how you apply the model — and why?"

"A business stakeholder says: 'We should still target Sure Things because even if the offer doesn't change their behavior, it signals that AmEx values them.' How do you quantify and evaluate this argument?"

"Six months after deploying the uplift model, you notice the 'Do Not Disturb' segment has grown from 8% to 19% of at-risk members. What are the most likely explanations?"

Question 10: Model Risk, Governance & Responsible AI

📌 Question Title

Auditing a Credit Decisioning Model for Fairness, Bias, and Regulatory Compliance

💬 Detailed Business Scenario

AmEx has been using an ML-based credit line assignment model for 18 months that sets initial credit limits for new cardmembers based on application data, bureau features, and behavioral signals. During a routine model audit, a fair lending compliance officer flags a concern: female applicants are receiving credit limits that are, on average, 12% lower than male applicants with similar creditworthiness indicators.
You are asked to lead the technical investigation as the model owner.
(a) Walk through a rigorous statistical framework to determine whether this disparity represents illegal discrimination, legitimate risk differentiation, or a modeling artifact.
(b) Your investigation reveals the disparity is largely explained by a proxy feature — "years at current employer" — which has different distributions by gender due to workforce participation patterns. What are your options, and what are the tradeoffs?
(c) How do you build a model governance framework that catches bias issues like this before they reach a compliance audit — and what does responsible AI look like in a credit decisioning context specifically?

📋 Structured Model Answer

Part (a) — Statistical Fairness Investigation Framework:

# ============================================================
# STEP 1: Establish the raw disparity (unadjusted gap)
# ============================================================
import scipy.stats as stats

male_limits = df[df['gender'] == 'M']['assigned_credit_limit']
female_limits = df[df['gender'] == 'F']['assigned_credit_limit']

# Test statistical significance of the gap
t_stat, p_value = stats.ttest_ind(male_limits, female_limits)
effect_size = (male_limits.mean() - female_limits.mean()) / df['assigned_credit_limit'].std()

print(f"Raw gap: ${male_limits.mean() - female_limits.mean():.0f}")
print(f"Cohen's d: {effect_size:.3f}")  # standardized effect size

# ============================================================
# STEP 2: Decompose the gap — Oaxaca-Blinder decomposition
# Separates "explained" (different risk profiles) from
# "unexplained" (same risk profiles, different outcomes) components
# ============================================================

# Fit identical models on male and female subsamples
model_male = LinearRegression().fit(X_male, y_male_limit)
model_female = LinearRegression().fit(X_female, y_female_limit)

# Counterfactual: what limit would females receive under the male model?
counterfactual_female_limit = model_male.predict(X_female)

explained_gap = counterfactual_female_limit.mean() - female_limits.mean()
unexplained_gap = male_limits.mean() - counterfactual_female_limit.mean()

# ============================================================
# STEP 3: Test for disparate impact (legal standard)
# Four-Fifths Rule (80% rule): adverse action rate for protected class
# must be ≥ 80% of the rate for the reference group
# ============================================================
# For credit limits: compare approval rates above threshold by gender
approval_threshold = 5000  # minimum viable credit limit

male_approval_rate = (df[df['gender']=='M']['assigned_credit_limit']
                      >= approval_threshold).mean()
female_approval_rate = (df[df['gender']=='F']['assigned_credit_limit']
                        >= approval_threshold).mean()

disparate_impact_ratio = female_approval_rate / male_approval_rate
print(f"Disparate Impact Ratio: {disparate_impact_ratio:.3f}")
# If < 0.80: potential ECOA / Fair Housing Act violation flag

# ============================================================
# STEP 4: Fairness metrics beyond the four-fifths rule
# ============================================================
# Demographic Parity: P(high limit | Male) ≈ P(high limit | Female)
# Equalized Odds: TPR and FPR equal across gender groups
#   (at same predicted risk score, same credit limit assignment)
# Calibration: predicted default rate for score X = actual default rate
#   for score X, across both gender groups

Part (b) — Handling the Proxy Feature:

"Years at current employer" is a facially neutral feature that has disparate impact on female applicants due to structural workforce patterns (career interruptions, part-time work during caregiving periods). This is a classic proxy discrimination scenario.

Four options and their tradeoffs:

Option	Description	Tradeoff
Remove the feature entirely	Drop "years at employer" from the model	Reduces discrimination risk; loses predictive power (AUC may drop 1–3 bps); cleanest compliance posture
Replace with a less biased proxy	Use "employment stability score" combining tenure + job type + income consistency	May preserve predictive power while reducing gender correlation; requires validation
Apply fairness constraints during training	Add a regularization penalty for features correlated with protected class	Reduces bias but can reduce overall model accuracy; requires careful calibration
Post-processing adjustment	Adjust predicted limits upward for affected group to equalize outcomes	Legally risky — explicit group-based adjustment may itself violate ECOA; consult legal before implementing

Recommendation: Option 1 (remove) is the safest regulatory posture. If predictive power loss is material (>3 bps AUC), explore Option 2 with rigorous disparate impact testing on the replacement feature. Option 4 (post-processing) should not be implemented without explicit legal sign-off.

Part (c) — Model Governance Framework:

PRE-DEPLOYMENT GOVERNANCE (Model Risk Management):
├── Mandatory fairness audit checklist before any credit model goes live:
│   ├── Disparate impact testing on all protected classes (ECOA: race,
│   │   color, religion, national origin, sex, marital status, age)
│   ├── Feature correlation analysis: flag any feature with
│   │   |correlation| > 0.15 with a protected class proxy
│   ├── Oaxaca-Blinder decomposition on model outputs
│   └── Four-fifths rule test across all protected class intersections
│
├── Model documentation requirements:
│   ├── Intended use, out-of-scope uses, and known limitations
│   ├── Training data representativeness analysis
│   └── Adversarial testing: how does the model behave on edge cases?
│
└── Independent validation team review (separate from model builders)

POST-DEPLOYMENT MONITORING:
├── Monthly disparate impact monitoring dashboard
│   → Alert threshold: DI ratio drops below 0.85 (early warning)
│   → Mandatory review threshold: DI ratio drops below 0.80
├── Model performance by demographic segment (not just overall)
├── Feature distribution monitoring (PSI per feature per demographic)
└── Quarterly fairness reports to Model Risk Committee

RESPONSIBLE AI PRINCIPLES FOR CREDIT DECISIONING:
1. Explainability: every declined/reduced credit decision must be
   explainable to the applicant (ECOA adverse action notice requirements)
   → SHAP values for individual decision explanation
2. Contestability: clear process for applicants to dispute decisions
3. Human oversight: automated decisions above $X in impact require
   human review
4. Data minimization: don't use features you can't legally justify
5. Regular re-certification: models older than 18 months require
   a full fairness re-audit before continued use

📊 Difficulty Level: Hard

⏱ Expected Interview Time: 18–20 minutes

✅ What a Strong Candidate Must Mention

ECOA (Equal Credit Opportunity Act) specifically — not just generic "fairness." AmEx operates under this regulation and a candidate who names it signals domain awareness

Oaxaca-Blinder decomposition to separate explained from unexplained gaps — this is the econometric gold standard for pay/outcome gap analysis and directly applicable here

Proxy discrimination as a legal concept: a facially neutral feature can be discriminatory if it has unjustified disparate impact on a protected class

The four-fifths rule as the legal standard for disparate impact in credit — knowing the specific threshold (80%) demonstrates regulatory fluency

SHAP for adverse action explanations: ECOA requires that applicants be told the specific reasons for adverse credit actions — black-box models create a compliance problem that SHAP helps solve

🔁 Smart Follow-Up Questions

"You remove 'years at employer' and retrain the model. The gender gap reduces from 12% to 6%. Legal says 6% is still concerning. You've now exhausted your obvious proxy features. What do you do next?"

"A model can satisfy demographic parity (equal approval rates) OR equalized odds (equal error rates across groups) — but mathematically, it often cannot satisfy both simultaneously. How do you decide which fairness criterion to optimize for in a credit context?"

"The model governance team asks you to implement a fully explainable model (logistic regression with WOE encoding) instead of GBM, accepting a 4-point AUC reduction to gain interpretability. How do you frame the business tradeoff — and what is your recommendation?"

Question 1: Product Strategy & Roadmap Planning
📌 Question Title

Building a 12-Month Product Roadmap for a Redesigned Card Onboarding Experience

💬 Detailed Business Scenario

AmEx has identified that 38% of newly acquired Platinum cardmembers never activate a key benefit (lounge access, dining credit, or travel insurance) within their first 90 days. Internal data shows that cardmembers who activate at least 2 benefits within 90 days have a 3.4× higher 12-month retention rate than those who don't. The current onboarding flow is a generic welcome email sequence with a static PDF benefits guide.
You've been asked to own the end-to-end redesign of the onboarding product experience and build a 12-month roadmap to measurably improve benefit activation rates.
(a) How do you diagnose the root cause of low benefit activation before committing to any solution?
(b) Using a structured prioritization framework, how do you decide what to build first, second, and third on your roadmap?
(c) How do you define and track success — and what does "done" look like at 12 months?

📋 Structured Model Answer

Part (a) — Root Cause Diagnosis Before Building:

A strong PM never starts with solutions. The diagnostic phase should cover:

Quantitative discovery:

Funnel analysis: where in the onboarding flow do cardmembers drop off?
- Email open rate → click-through rate → benefit page view → benefit activation attempt → successful activation

Segment the 38% non-activators: are they concentrated in a specific acquisition channel, demographic, or spend category?

Time-to-first-activation distribution: is the problem "never activate" or "activate too late"?

Qualitative discovery:

15–20 user interviews with recent non-activators: "Walk me through what you did in the first week after receiving your card."

Usability testing on current onboarding flow: can a new cardmember find and activate the dining credit in under 3 minutes without help?

Card Satisfaction surveys (NPS at Day 7, Day 30, Day 90): are members aware benefits exist?

Hypotheses to test:

Hypothesis	Signal to look for
Discovery problem	<40% of members visit benefits page within first 7 days
Complexity problem	High bounce rate on benefits pages; low activation completion rate
Relevance problem	Members activate 0 benefits but spend in benefit-eligible categories
Channel problem	Digital-first members activate more than direct mail acquired ones

Part (b) — Roadmap Prioritization Framework:

Use RICE scoring (Reach × Impact × Confidence ÷ Effort) combined with a Now / Next / Later framework:

Initiative	Reach	Impact	Confidence	Effort	RICE Score	Timeline
Personalized benefit highlight in welcome email (top 2 benefits matched to spend profile)	High	High	High	Low	90	Now (Q1)
In-app interactive onboarding checklist (gamified, progress-tracked)	High	High	Medium	Medium	72	Now (Q1–Q2)
Push notification benefit reminder at Day 7, 14, 30 (non-activators only)	High	Medium	High	Low	68	Now (Q2)
ML-personalized benefit recommendation engine (predict which benefit each member will value most)	Medium	High	Medium	High	45	Next (Q3)
Concierge-assisted onboarding call for high-CLV new members	Low	High	Medium	High	35	Next (Q3–Q4)
Integrated benefit activation within card activation flow (activate card → immediately enroll in top benefit)	High	High	Low	Very High	28	Later (Q4+)

Part (c) — Success Definition & Metrics:

Primary metric (North Star):

% of new cardmembers activating ≥2 benefits within 90 days

Target: increase from 62% to 78% at 12 months

Leading indicators (weekly tracking):

Day 7 benefit page visit rate

Email open rate and benefit click-through rate by segment

Onboarding checklist completion rate (once built)

Day 30 first benefit activation rate

Guardrail metrics (must not degrade):

Unsubscribe rate from onboarding email sequence (<2%)

Push notification opt-out rate (<5% of enrolled members)

CS contact rate for "I don't know how to use my benefits" (should decrease)

Business outcome metrics (quarterly):

90-day retention rate for activation cohorts vs. non-activation cohorts

Revenue impact: activation cohort's spend in benefit categories vs. baseline

Estimated LTV uplift per additional activated member

"Done" at 12 months:
The product is "done" when: (1) the personalization engine is live, (2) activation rates are trending toward 78%, and (3) a clear A/B-tested causal link between the new onboarding flow and retention improvement is established — not just a correlation.

📊 Difficulty Level: Medium

⏱ Expected Interview Time: 14–16 minutes

✅ What a Strong Candidate Must Mention

Discovery before roadmap: the biggest PM mistake is building a roadmap before understanding the root cause — diagnosis comes first

RICE or ICE scoring with explicit assumptions — not just a gut-feel prioritization

Leading vs. lagging indicators: activation rate is a leading indicator of retention; retention is the lagging business outcome — you need both

Personalization as a multiplier: a generic onboarding sequence treats every new Platinum member the same; behavioral personalization is the highest-leverage intervention

The causal attribution challenge: improving activation rate doesn't prove causality with retention — the roadmap must include an A/B test design to establish the link

🔁 Smart Follow-Up Questions

"Engineering tells you they can only deliver one of your Q1 initiatives due to resource constraints. The personalized email and the in-app checklist are both scored equally. How do you decide which one ships first — and how do you make that case to engineering leadership?"

"Three months in, your Day 30 activation rate has improved by 8 points, but 90-day retention hasn't moved. What are the possible explanations — and how does that change your Q3 roadmap?"

"A senior stakeholder wants to add a 'premium concierge welcome call for all new Platinum members' to the roadmap as a Q1 priority. You think it's too expensive and operationally complex. How do you handle that conversation?"

Question 2: Payment Product Innovation

📌 Question Title

Designing a B2B Digital Payment Feature for Small Business Cardmembers

💬 Detailed Business Scenario

AmEx's small business cardmember segment ($1M–$10M annual revenue) is growing rapidly, but behavioral data shows that over 60% of these businesses still pay their largest suppliers via paper check or ACH bank transfer — transactions that don't run through their AmEx Business Card. This represents a significant untapped billed business opportunity, estimated at $40B+ annually across the segment.
You've been asked to define, scope, and build the business case for a B2B digital payment feature that enables small business cardmembers to pay their suppliers using their AmEx card — capturing spend that currently bypasses the AmEx network entirely.
(a) How do you validate that this is the right problem to solve before committing to building?
(b) Define the MVP — what is the minimum set of capabilities that makes this product genuinely useful, and what do you deliberately leave out of v1?
(c) What are the three biggest product risks — and how do you mitigate each one?

📋 Structured Model Answer

Part (a) — Problem Validation:

Three validation layers before building:

Layer 1 — Demand validation (is the problem real and painful enough?)

Survey 500 small business cardmembers: "What % of your supplier payments are currently on your AmEx card? What prevents you from putting more on it?"

Expected findings: supplier resistance (won't accept cards), transaction fees passed to buyer, ACH is free and embedded in accounting workflow

Benchmark: competitor products (e.g., Brex, Ramp, Bill.com, Divvy) — what has the market validated already?

Layer 2 — Willingness to pay / use validation

Prototype test: show mockups of a "Pay Your Suppliers" feature to 30 small business owners — would they use it? What would make them trust it?

Key question: would they absorb the interchange cost or require a rebate structure to make it work?

Layer 3 — Business model validation

Financial model: if AmEx captures 15% of the $40B addressable spend at 1.5% net interchange = $900M incremental annual revenue. Even 5% capture = $300M.

Supplier adoption: without supplier enrollment, the feature doesn't work — validate that suppliers will accept or that a virtual card solution (where AmEx pays supplier via check/ACH but cardmember pays AmEx) resolves this

Part (b) — MVP Scoping:

What's IN v1:

Virtual card issuance for supplier payments: cardmember enters supplier details + payment amount → AmEx generates a single-use virtual card number → AmEx pays the supplier via ACH (no change to supplier behavior) → cardmember's AmEx balance carries the charge

Accounting software integration (QuickBooks, Xero): one-click payment initiation from within the tool the small business already uses

Payment scheduling: pay on due date, not today — critical for working capital management

Basic payment tracking: view status of supplier payments in AmEx dashboard

What's deliberately OUT of v1:

Multi-currency supplier payments (complexity; v2)

Supplier self-enrollment portal (v2 — first solve for cardmember side)

Invoice OCR / auto-capture (v3 — adds complexity without proving core usage)

Integration with 20+ accounting platforms (v1: QuickBooks + Xero = 70% of SMB market)

Design principle: v1 should feel like "I paid my supplier and it went on my AmEx" — not "I onboarded to a new payments platform."

Part (c) — Top 3 Product Risks & Mitigations:

Risk	Description	Mitigation
Adoption risk	Small business owners are creatures of habit; ACH is free and embedded — switching to AmEx payment adds a step and potentially a cost	Integrate into existing accounting tools so the behavior change is minimal; offer a rebate or rewards multiplier for supplier payments to offset any perceived cost
Supplier friction risk	If AmEx pays suppliers via virtual card, suppliers may decline or charge card acceptance fees back to the buyer	Use ACH as the default payment rail to suppliers (invisible to them); virtual card is optional for suppliers who already accept cards
Credit risk concentration	Enabling large supplier payments on AmEx cards could dramatically increase exposure for small business accounts — a $200K supplier payment from a $500K credit limit business creates concentration risk	Implement smart credit limit management; supplier payment feature subject to separate sub-limit; underwriting review triggered above threshold payment size

📊 Difficulty Level: Hard

⏱ Expected Interview Time: 15–17 minutes

✅ What a Strong Candidate Must Mention

The virtual card as the technical solution to supplier non-acceptance — this is the key product insight that resolves the "suppliers don't accept AmEx" objection

Distribution strategy: integrating into QuickBooks/Xero is not a nice-to-have — it IS the go-to-market strategy for SMB products; standalone apps fail

Competitive awareness: Brex, Ramp, and Bill.com have moved aggressively into B2B payments — AmEx's advantage is the existing cardmember relationship and brand trust, not technology

The credit risk dimension: a PM who doesn't flag credit exposure from large supplier payments on revolving cards is missing a core financial services concern

Success metrics for v1: not just adoption rate, but incremental billed business per enrolled cardmember (did it actually capture new spend, or just shift spend from another AmEx product?)

🔁 Smart Follow-Up Questions

"QuickBooks tells you the integration will take their team 9 months to build because AmEx is not a priority for them. What are your alternatives — and how does this change your v1 timeline and scope?"

"Three months after launch, adoption is 4% of eligible small business cardmembers. Your VP asks whether to double down or pull back. What data do you want before making that recommendation?"

"A competitor launches a nearly identical feature two months before your planned launch date. Do you accelerate, differentiate, or deprioritize? Walk me through your thinking."

Question 3: Data-Driven Decision Making & Customer Experience

📌 Question Title

Using Data to Redesign the Card Upgrade and Upsell Experience

💬 Detailed Business Scenario

AmEx has identified that Gold cardmembers who have held their card for 18–36 months represent the highest-conversion segment for upgrades to the Platinum card — but the current upgrade experience is a generic banner in the mobile app saying "Upgrade to Platinum." Conversion from the upgrade prompt to actual upgrade completion is only 2.3%, despite these being warm, engaged cardmembers who are already in the AmEx ecosystem.
You are the PM responsible for improving the Gold-to-Platinum upgrade conversion rate by building a more personalized, data-driven upgrade experience.
(a) What data do you analyze first to understand why the conversion rate is so low?
(b) Design a personalized upgrade experience — what signals do you use, and what does the experience look like for different cardmember sub-segments?
(c) How do you set up the experimentation program to improve conversion rate, and what metrics tell you you've succeeded — beyond just conversion rate?

📋 Structured Model Answer

Part (a) — Data Diagnosis:

Funnel decomposition (first thing to build):

Eligible Gold members (18–36 months tenure)
    ↓ [What % see the upgrade prompt?]
Upgrade prompt impressions
    ↓ [What % click?]
Upgrade prompt click-through (CTR)
    ↓ [What % start the application?]
Upgrade flow start
    ↓ [What % complete?]
Upgrade application submitted
    ↓ [What % are approved and activate?]
Upgrade completed (2.3% of impressions)

Each stage of the funnel is a separate problem requiring a separate solution. A 2.3% overall rate could mean:

40% CTR but 5% completion (awareness is fine; the flow is broken)

8% CTR and 30% completion (the prompt itself isn't compelling)

Key analytical questions:

What is the timing of the prompt? (Showing an upgrade offer when a member just had a bad service experience is counterproductive)

Which Gold members have already used benefits that exist only on Platinum? (They've self-selected as Platinum candidates)

What is the fee sensitivity signal? Members paying $250/year for Gold who have realized $300+ in Gold benefits are demonstrably willing to pay for value — more likely to accept a $695 Platinum fee

What does the post-upgrade regret rate look like? If upgrade-to-cancel within 6 months is high, we're converting the wrong people

Part (b) — Personalized Upgrade Experience Design:

Segment-based personalization:

Segment	Behavioral Signal	Personalized Message	Timing
T&E High Spender	>40% spend in travel/dining MCCs	Show: Centurion Lounge access, Fine Hotels & Resorts, $200 airline credit	After a travel transaction
Benefit Maximizer	Redeems all Gold credits consistently	Show: incremental benefits over Gold; frame as "You're already getting full value — here's what you're leaving on the table"	At annual fee renewal reminder
Status Seeker	Frequent flyer, hotel loyalty member	Emphasize Global Lounge Collection, hotel elite status benefits	After an airline transaction
Business Traveler	Mix of personal + business spend	Suggest Business Platinum as an alternative	After an international transaction
Fence-Sitter	Has clicked upgrade prompt 2+ times, never completed	Proactive RM outreach or chat offer to walk through benefits	Triggered by 2nd prompt click

Experience design principles:

Show don't tell: replace the generic banner with a personalized ROI statement — "Based on your spending, you'd receive $1,340 in Platinum benefits annually"

Reduce friction in the upgrade flow: a Gold member already has an account — the upgrade should require minimal new information (no full re-application; a one-click upgrade with instant decision)

Timing matters: show the upgrade prompt within 24 hours of a trigger event (just returned from a trip, just had a dining experience at an Amex-eligible restaurant), not on a generic schedule

Part (c) — Experimentation Program Design:

Experiment hierarchy:

Layer 1 — MESSAGE TESTING (2-week sprint):
  Control: "Upgrade to Platinum" (current)
  Variant A: Personalized ROI message ("You'd earn $X more in benefits")
  Variant B: Social proof ("Members like you upgraded after 22 months")
  Variant C: Scarcity/urgency ("Limited-time upgrade offer with waived first-year fee")
  Primary metric: CTR on upgrade prompt
  Sample: 50K eligible Gold members per variant

Layer 2 — FLOW TESTING (4-week test):
  Control: Current multi-step upgrade application
  Variant: Streamlined one-click upgrade (pre-filled, instant decision)
  Primary metric: Completion rate from click to upgrade confirmation

Layer 3 — TIMING TESTING (6-week test):
  Control: Generic weekly prompt
  Variant: Trigger-based prompt (within 24h of qualifying spend event)
  Primary metric: Overall upgrade conversion rate (impressions → completed)

Success metrics beyond conversion rate:

Metric	Why It Matters
Post-upgrade 6-month retention rate	High conversion + high early cancellation = we upgraded the wrong people
Post-upgrade spend lift	Did converting to Platinum unlock higher spend behavior?
Benefit activation rate at Day 30 (post-upgrade)	Upgraded members who don't activate Platinum benefits will regret the fee
Net upgrade revenue per experiment cohort	(Incremental fee revenue + spend revenue) − (cost of any incentive offered)

📊 Difficulty Level: Medium–Hard

⏱ Expected Interview Time: 14–16 minutes

✅ What a Strong Candidate Must Mention

Funnel decomposition before solution design — the 2.3% overall rate is hiding where the real problem is; you need the per-stage breakdown

Post-upgrade regret as a guardrail metric — optimizing for conversion rate without tracking downstream retention is a vanity metric trap

Personalized ROI framing as the highest-leverage message change: "Here's what you personally would get" vs. "Here's what Platinum offers" — the former requires data, the latter doesn't

Trigger-based timing as a potentially more impactful variable than message content — showing the right offer at the right moment is the PM insight that separates this from a marketing problem

One-click upgrade flow: a warm, existing cardmember going through a full application is the #1 friction kill — the PM should advocate hard for engineering to solve this

🔁 Smart Follow-Up Questions

"Your personalized ROI message tests 3× better than the control in CTR, but post-upgrade 6-month retention is 8 points lower for the ROI message group. How do you interpret that result and what do you do?"

"The legal team flags that showing a personalized benefits value estimate in the upgrade prompt constitutes a 'financial promise' and needs compliance review, adding 6 weeks to your timeline. How do you respond?"

"How do you ensure your upgrade experimentation program doesn't inadvertently show upgrade prompts to Gold members who are already at risk of canceling — and who would be better served with a retention offer instead?"

Question 4: A/B Testing, Experimentation & Stakeholder Management

📌 Question Title

Running a Rewards Program Redesign Experiment With High Organizational Stakes

💬 Detailed Business Scenario

AmEx is considering a significant change to the Membership Rewards program for Gold cardmembers: replacing the current flat 4× points on dining with a dynamic multiplier (2× to 6× points) that adjusts based on the cardmember's dining spend history and engagement level. The hypothesis is that high-frequency diners will be delighted by earning up to 6× points, driving more spend concentration on AmEx, while low-frequency diners get a baseline that still rewards them.
The Chief Rewards Officer loves the idea. The CFO is worried about rewards liability cost overrun. The Head of Restaurant Merchant Services is concerned it will upset merchant relationships. You are the PM responsible for running the experiment.
(a) How do you design the experiment — population, variants, success criteria, and guardrails?
(b) After 8 weeks, results show: dining spend up 9% in treatment, rewards liability up 14%, merchant satisfaction unchanged. How do you interpret and present these results to each of the three senior stakeholders?
(c) The Chief Rewards Officer wants to declare success and ship immediately. The CFO wants to run the experiment for another 12 weeks. How do you navigate this tension and make a recommendation?

📋 Structured Model Answer

Part (a) — Experiment Design:

Population & variants:

Eligible population: Gold cardmembers with ≥3 dining transactions in the prior 90 days (active diners; exclude members where dining isn't a category to avoid diluting the signal)

Randomization unit: individual cardmember (not household)

Control: Current 4× flat on dining

Treatment A: Dynamic 2×–6× (high-frequency diners get 5×–6×; occasional diners get 2×–3×)

Treatment B: Flat 5× on dining (simpler alternative; tests whether the multiplier boost itself — not the dynamic nature — drives spend)
- Rationale for B: If B performs as well as A, the complexity of dynamic multipliers isn't justified

Pre-registered success criteria (defined before seeing data):

Primary metric: Dining billed business per cardmember (% lift vs. control)

Secondary metric: Dining transaction frequency (are members making more visits or just larger transactions?)

Financial guardrail: Rewards liability per $1 of dining spend must not increase >10% vs. control

Merchant guardrail: No statistically significant decline in merchant NPS for participating restaurant merchants

Minimum detectable effect: 5% lift in dining spend (pre-agreed with CFO as the minimum commercially meaningful improvement)

Duration: 12 weeks minimum (captures monthly billing cycles; avoids novelty effect in first 2 weeks)

Part (b) — Interpreting and Communicating Results (8-Week Read):

Statistical context first: 8 weeks is premature for a rewards experiment. Members change spending patterns when they notice a new rewards structure — novelty effect can inflate treatment results in the first 4–6 weeks, which then regress. This is the first thing to say to all three stakeholders.

Tailored communication by stakeholder:

To the Chief Rewards Officer (wants to ship):

"The 9% dining spend lift is genuinely exciting and directionally validates the hypothesis. However, 8 weeks likely includes a novelty effect — members who noticed the higher multiplier changed behavior temporarily. We need 12 weeks to see whether the spend lift stabilizes or regresses. Shipping at 8 weeks and seeing a regression post-launch would be a much more damaging outcome than waiting 4 more weeks to be confident."

To the CFO (worried about liability):

"The 14% rewards liability increase is the number we need to watch most carefully. At 8 weeks, we don't know if this is temporary (high-value members front-loading dining) or structural. If liability stabilizes at +14% while spend is +9%, the economics are borderline. I'll give you the 12-week read alongside the net revenue model — the question is whether the incremental spend revenue exceeds the incremental rewards cost."

Net revenue framing for CFO:

Incremental dining spend: +9% × baseline dining revenue = +$X
Incremental interchange revenue on +9% spend = +$Y
Incremental rewards liability: +14% of rewards cost = −$Z
Net incremental margin = $Y − $Z
→ Present this at both 8-week annualized and 12-week projected

To the Head of Merchant Services (concerned about relationships):

"The merchant NPS data is clean — no deterioration at 8 weeks. The dynamic multiplier is cardmember-facing only; merchants don't see the multiplier change. I'll flag if that changes, but the current signal is stable."

Part (c) — Navigating the Stakeholder Tension:

This is a PM leadership moment — not a technical statistics problem:

Don't take sides between the Chief Rewards Officer and CFO. That's not your role. Your role is to provide the best possible recommendation grounded in data.

Frame the tradeoff explicitly: "Shipping at 8 weeks carries the risk of [list specific risks]; waiting 4 more weeks costs us [estimated revenue delay]. Here is my recommendation."

Make a clear recommendation:

"I recommend a structured path to full launch: extend the experiment to 12 weeks with a pre-agreed decision framework. If at 12 weeks the dining spend lift remains ≥7% AND rewards liability growth is ≤12% vs. control, we launch. If either condition is not met, we test Treatment B (flat 5×) before a full rollout decision. This gives the Chief Rewards Officer a clear, fast path to launch while giving the CFO the financial confidence gate she needs."

Document the decision: whoever makes the call should own it in writing. The PM facilitates; the senior stakeholders decide on pre-agreed criteria.

📊 Difficulty Level: Hard

⏱ Expected Interview Time: 16–18 minutes

✅ What a Strong Candidate Must Mention

Novelty effect as the critical caveat in 8-week rewards experiment results — members consciously adjusting behavior when they notice a higher multiplier is a temporary, not sustainable, effect

Pre-registering success criteria before seeing data — defining "success" after seeing the results is p-hacking, and in a high-stakes financial product context, it's also an internal governance risk

Treatment B (flat 5×) as a critical control arm: testing dynamic complexity vs. a simple higher multiplier is the design insight that tells you whether the engineering complexity is justified

Net revenue model, not just spend lift: the Chief Rewards Officer will celebrate spend lift; the CFO will calculate whether the rewards cost outweighs the revenue — the PM must bridge both

The structured decision framework with pre-agreed launch criteria as the resolution to the stakeholder conflict — it depersonalizes the decision and makes it data-governed

🔁 Smart Follow-Up Questions

"Your 12-week results show dining spend lift of 7.2% and rewards liability growth of 11.8% — both exactly at your pre-agreed thresholds. Do you launch? How do you make this call when the data is right on the boundary?"

"Post-launch, you discover that 30% of the spend lift is coming from a small segment of 'gaming' cardmembers who are making many small dining transactions to maximize multiplier earnings. Does that change how you view the success of the experiment?"

"How would you design the rollout sequencing — do you launch to all eligible Gold members at once, or do you stage the rollout? What are the risks of each approach?"

Question 5: Customer Acquisition, Retention & Churn Strategy

📌 Question Title

Designing a Proactive Churn Prevention Product for At-Risk Cardmembers

💬 Detailed Business Scenario

AmEx's analytics team has developed a churn prediction model that identifies Platinum cardmembers with a >40% probability of canceling within the next 6 months. The model currently flags approximately 85,000 cardmembers per quarter as high-risk. Today, the only intervention is a reactive retention call when a member actually initiates a cancellation — by which point, 60% of callers still cancel despite the offer.
You are the PM responsible for building a proactive churn prevention product — a systematic, scalable, personalized intervention system that engages at-risk members before they decide to cancel, using product features rather than just outreach calls.
(a) How do you think about the product design of a churn prevention system — what are the intervention layers, and how do you avoid making members feel surveilled or pressured?
(b) Prioritize 3 product interventions from a list of 8 candidates using a structured framework.
(c) How do you measure whether your churn prevention product is actually working — and what is the single most important metric you'd report to the VP of Card Products every month?

📋 Structured Model Answer

Part (a) — Churn Prevention Product Design Philosophy:

The core design tension: effective churn prevention requires acting on behavioral signals — but if members feel "watched" or receive offers that signal desperation, you can accelerate the very behavior you're trying to prevent.

Design principles:

Value delivery, not retention desperation: the product should feel like "AmEx is getting better at serving me" — not "AmEx knows I'm thinking about leaving"

Proactive, not reactive: intervene when you can add genuine value, not only when the churn signal is at its highest

Personalization over broadcast: a fee waiver offer to a member who canceled because they never used any benefits is a band-aid; a re-engagement with the specific unused benefit they were originally excited about is a solution

Intervention layer framework (ordered by subtlety and scalability):

LAYER 1 — Product & Experience Improvements (scale: all at-risk members)
  Passive interventions delivered through existing product surfaces
  Examples: personalized benefit reminder in monthly statement,
  in-app "You're leaving value on the table" notification,
  proactive annual fee value summary 60 days before renewal

LAYER 2 — Personalized Digital Outreach (scale: medium — top 50% of at-risk)
  Targeted, behavior-triggered communications
  Examples: "You haven't used your dining credit in 4 months —
  here are 3 restaurants near you that are Amex-eligible"

LAYER 3 — High-Touch Interventions (scale: top 20% of at-risk by CLV)
  Human or near-human interventions reserved for highest-value members
  Examples: RM outreach call, personalized video message from Concierge,
  exclusive experience invitation (cardmember event, early access)

LAYER 4 — Commercial Interventions (scale: final 10% — highest CLV, highest risk)
  Fee waiver, points bonus, product downgrade offer
  Use only when layers 1–3 haven't moved the signal
  Critical: track "offer acceptance → subsequent behavior" to ensure
  you're not training members to wait for offers at every renewal

Part (b) — Prioritization of 8 Intervention Candidates:

Using Impact × Effort × Reach (ICE) framework, with an additional "Feel Good" dimension (does this feel like customer value or retention desperation?):

Intervention	Reach	Impact	Effort	Feel Good?	Priority
Personalized annual fee value summary (email + in-app) 60 days before renewal	All 85K	High	Low	✅ Natural timing	#1 — Ship First
In-app benefit utilization nudge (unused benefit + nearby redemption location)	Top 60K	High	Medium	✅ Genuinely helpful	#2 — Ship Q2
Proactive product downgrade offer (suggest Gold if Platinum benefits not used)	Top 30K	Medium	Low	✅ Member-centric	#3 — Ship Q2
Personalized points bonus for next qualifying spend (reactivation incentive)	Top 20K	High	Medium	⚠️ Slightly transactional	#4 — Test
RM outreach call for top 5K by CLV	Top 5K	Very High	High	✅ Premium feel	#5 — Phased
Fee waiver offer (1-year)	Top 10K	High	Low	❌ Signals desperation	#6 — Last Resort
Cardmember exclusive event invitation	Top 2K	High	Very High	✅ Premium experience	#7 — Pilot
Automated SMS with "We noticed you haven't used your lounge benefit"	All 85K	Low	Low	❌ Surveillance feel	Deprioritized

Part (c) — Measuring Success:

The single most important monthly metric:

Incremental 6-month retention rate for at-risk members who received a proactive intervention vs. a matched control group that did not.

This is the only metric that answers the causal question: "Is the product actually preventing churn, or are we just observing members who would have stayed anyway?"

Why not raw retention rate? Because the at-risk model may flag members who, upon reflection, weren't going to cancel — the raw retention rate of the treated group will look good regardless of whether the intervention worked.

Full metrics dashboard for VP reporting:

Metric	Cadence	What It Tells You
Incremental retention rate (treatment vs. control)	Monthly	Causal proof the product works
Intervention engagement rate	Weekly	Are members actually interacting with Layer 1/2 nudges?
Benefit activation rate among at-risk members post-intervention	Monthly	Did the intervention solve the root cause (underutilization)?
Fee waiver usage rate	Monthly	Are we over-relying on commercial interventions (Layer 4)?
Post-intervention 12-month CLV vs. baseline	Quarterly	Are retained members genuinely re-engaged or just delayed churners?
False positive rate of churn model	Quarterly	Are we intervening with members who were never at risk? (waste + annoyance)

📊 Difficulty Level: Hard

⏱ Expected Interview Time: 15–17 minutes

✅ What a Strong Candidate Must Mention

The "feel good" design dimension: interventions that feel like surveillance or desperation can accelerate churn — the product must be designed to feel like customer value delivery, not retention management

Proactive downgrade offer as a customer-centric intervention: recommending Gold to a member who isn't using Platinum benefits is counterintuitive from a revenue standpoint but is the right thing to do — and members who downgrade are far less likely to cancel entirely

Layer-based escalation framework: not every at-risk member needs a fee waiver — burning commercial budget on Sure Things is wasteful; reserve high-cost interventions for high-CLV, high-risk members

Causal measurement via holdout group: without a randomized control group, you cannot distinguish between "our product retained these members" and "these members were going to stay anyway"

Post-retention CLV tracking: a member who stays because of a fee waiver but disengages afterward is a retained member who is functionally still churning — the product must track whether retention is durable

🔁 Smart Follow-Up Questions

"Your churn model has a 35% false positive rate — one in three members flagged as 'high risk' wasn't actually planning to cancel. How does this affect your intervention design — and is it a problem you need to solve before building the product?"

"Six months after launch, retention of at-risk members has improved by 11 percentage points, but you notice that 22% of retained members still cancel within 12 months of the intervention. What does that tell you — and how does it change your product strategy?"

"A competitor launches a feature that automatically matches any retention offer AmEx makes — effectively making retention offers a commodity. How does that change the long-term product strategy for churn prevention?"

📎 Complete Interview Question Summary

#	Domain	Title	Difficulty	Time
1	Product Strategy & Roadmap	Redesigning the Card Onboarding Experience	Medium	14–16 min
2	Payment Product Innovation	B2B Digital Payment Feature for SMBs	Hard	15–17 min
3	Data-Driven Decision Making	Data-Led Card Upgrade & Upsell Experience	Medium–Hard	14–16 min
4	A/B Testing & Stakeholder Mgmt	Rewards Program Redesign Experiment	Hard	16–18 min
5	Acquisition, Retention & Churn	Proactive Churn Prevention Product	Hard	15–17 min

💡 Senior Interviewer Tip: The most revealing PM interview moments are when candidates are asked to make a recommendation with incomplete data — Questions 4c and 5c are specifically designed for this. Strong product managers at AmEx don't wait for perfect data; they define the decision criteria in advance, acknowledge uncertainty honestly, and make a clear, defensible recommendation. The candidates who stand out are those who protect the customer experience as fiercely as the business metrics — and who understand that in financial services, those two things are almost always aligned over the long term.