WPP Data Analyst and Data Scientist
Marketing Analytics & Attribution
1. Marketing Attribution Modeling and Multi-Touch Analytics
Level: Data Scientist, Senior Data Analyst
Difficulty: Very Hard
Source: GroupM Analytics Teams, Choreograph Data Science
Team: Media Analytics, Audience Insights
Interview Round: Technical Deep Dive
Question: “Explain how you would design a multi-touch attribution model for a client running campaigns across TV, digital display, paid social, search, and OOH. What methodology would you use, and how would you handle the challenge of offline-to-online attribution?”
Concise Answer:
Core Challenge: Multiple touchpoints influence conversion; need to assign credit appropriately to optimize budget allocation.
Attribution Methodologies Comparison:
| Method | Approach | Best For | Limitations |
|---|---|---|---|
| Last-Click | 100% credit to final touch | Direct response, paid search | Ignores awareness building |
| First-Click | 100% credit to initial touch | Brand awareness | Ignores conversion drivers |
| Linear | Equal credit to all touches | Simple, fair distribution | Doesn’t reflect actual impact |
| Time-Decay | More credit to recent touches | Considered purchases | Arbitrary decay function |
| Data-Driven | ML calculates actual contribution | Most accurate | Requires significant data |
Recommended Approach: Hybrid Model
1. For Digital Channels (Data-Driven MTA):
# Shapley Value Attribution Frameworkimport numpy as np
from itertools import combinations
class ShapleyAttribution:
def __init__(self, conversion_data):
self.data = conversion_data
def calculate_shapley_values(self, touchpoints):
"""Calculate marginal contribution of each channel""" n_channels = len(touchpoints)
shapley_values = {}
for channel in touchpoints:
marginal_contributions = []
# For each possible coalition without this channel for r in range(n_channels):
coalitions = combinations([c for c in touchpoints if c != channel], r)
for coalition in coalitions:
# Contribution = Value(coalition + channel) - Value(coalition) with_channel = self.conversion_rate(coalition + (channel,))
without_channel = self.conversion_rate(coalition)
marginal_contribution = with_channel - without_channel
marginal_contributions.append(marginal_contribution)
shapley_values[channel] = np.mean(marginal_contributions)
return shapley_values
def conversion_rate(self, channels):
"""Get conversion rate for channel combination""" # Filter to journeys with these channels journeys = self.data[self.data['channels'].isin(channels)]
return journeys['converted'].mean()
# Usageattribution_model = ShapleyAttribution(customer_journey_data)
channel_credits = attribution_model.calculate_shapley_values(
['display', 'social', 'search', 'email']
)2. For Offline Channels (Marketing Mix Modeling):
import pandas as pd
from sklearn.linear_model import Ridge
import statsmodels.api as sm
class MMM_Model:
def build_mmm(self, data):
"""Regression-based attribution for TV and OOH""" # Dependent variable: Sales y = data['sales']
# Independent variables: Marketing spend + controls X = pd.DataFrame({
'tv_spend': data['tv_spend'],
'ooh_spend': data['ooh_spend'],
'digital_spend': data['digital_spend'],
# Control variables 'price': data['price'],
'seasonality': data['month'],
'competitor_activity': data['competitor_spend'],
# Lag effects (TV impact takes time) 'tv_spend_lag1': data['tv_spend'].shift(1),
'tv_spend_lag2': data['tv_spend'].shift(2)
})
# Add adstock transformation (decay effect) X['tv_adstock'] = self.adstock_transform(
data['tv_spend'],
decay_rate=0.7 )
# Fit regression X = sm.add_constant(X)
model = sm.OLS(y, X).fit()
# Extract ROI by channel roi = {
'tv': model.params['tv_adstock'] / X['tv_spend'].mean(),
'ooh': model.params['ooh_spend'] / X['ooh_spend'].mean(),
'digital': model.params['digital_spend'] / X['digital_spend'].mean()
}
return model, roi
def adstock_transform(self, spend, decay_rate):
"""Apply advertising decay effect""" adstock = np.zeros(len(spend))
adstock[0] = spend[0]
for t in range(1, len(spend)):
adstock[t] = spend[t] + decay_rate * adstock[t-1]
return adstock3. Offline-to-Online Attribution Solutions:
Unique Promo Codes:
- TV ad includes unique code → track digital conversions with code
- OOH billboard features specific landing page → measure traffic
Geo-Testing:
- Run TV in 50% of markets, not in control markets
- Compare digital conversion lift in TV markets vs. control
- Incremental Digital Conversions = TV Marketsdigital − Control Marketsdigital
Brand Lift Studies:
- Survey exposed vs. unexposed audiences
- Measure awareness → search behavior correlation
- Quantify TV’s role in driving branded search volume
Foot Traffic Attribution:
- Partner with location data providers
- Track store visits from exposed audiences
- Connect in-store purchases to digital touchpoints
Integrated Framework:
┌─────────────────────────────────────────────────┐
│ Customer Journey with Multi-Channel Exposure │
└─────────────────────────────────────────────────┘
│
├─→ Digital Channels → Data-Driven MTA
│ (Search, Display, Social, Email)
│ • Shapley values
│ • 95% attribution confidence
│
├─→ Offline Channels → Marketing Mix Model
│ (TV, OOH, Radio)
│ • Regression-based ROI
│ • Adstock effects
│
└─→ Cross-Channel Validation
• Geo-experiments
• Hold-out testing
• Incrementality studiesImplementation Recommendations:
- Data Requirements:
- Minimum 12 months historical data
- User-level journey data for digital
- Weekly aggregated spend and sales for MMM
- 500+ conversions per month for statistical power
- Validation:
- Compare total attributed conversions to actual conversions (should sum to 100%)
- Back-test predictions against holdout period
- Run incrementality experiments to validate model accuracy
- Business Application:
- Provide channel-specific ROAS estimates
- Recommend budget reallocation (shift from over-invested to under-invested channels)
- Update monthly as campaign performance evolves
Expected Outcomes:
- Digital attribution: 85-90% confidence in credit allocation
- TV/OOH ROI estimates: ±20% margin of error (typical for MMM)
- Budget optimization: 10-15% efficiency improvement through reallocation
Statistical Testing & Experimentation
2. A/B Testing Design and Analysis for Marketing Campaigns
Level: Data Analyst, Data Scientist
Difficulty: Hard
Source: Standard GroupM Analytics Assessment
Team: Campaign Analytics, Performance Measurement
Interview Round: Technical Screening
Question: “Design an A/B test to evaluate whether a new creative treatment improves campaign conversion rate. Walk through sample size calculation, randomization strategy, test duration, statistical test selection, and how you’d interpret results. What would you do if results are borderline significant?”
Concise Answer:
Test Design Framework:
1. Hypothesis Definition:
- H₀ (Null): New creative has same conversion rate as control
- H₁ (Alternative): New creative increases conversion rate
- Primary Metric: Conversion rate (conversions ÷ impressions)
- Secondary Metrics: CTR, CPC, time to conversion
2. Sample Size Calculation:
import scipy.stats as stats
import numpy as np
def calculate_sample_size(baseline_rate, mde, alpha=0.05, power=0.80):
""" Calculate required sample size per variant Parameters: - baseline_rate: Current conversion rate (e.g., 0.02 for 2%) - mde: Minimum detectable effect (e.g., 0.10 for 10% relative lift) - alpha: Significance level (Type I error rate) - power: Statistical power (1 - Type II error rate) """ # Effect size (Cohen's h) p1 = baseline_rate
p2 = baseline_rate * (1 + mde)
effect_size = 2 * (np.arcsin(np.sqrt(p2)) - np.arcsin(np.sqrt(p1)))
# Calculate sample size using power analysis z_alpha = stats.norm.ppf(1 - alpha/2)
z_beta = stats.norm.ppf(power)
n = ((z_alpha + z_beta) / effect_size) ** 2 return int(np.ceil(n))
# Example: Detect 10% relative lift on 2% baselinebaseline_cr = 0.02min_detectable_effect = 0.10n_per_variant = calculate_sample_size(baseline_cr, min_detectable_effect)
print(f"Required sample size per variant: {n_per_variant:,}")
print(f"Total sample size: {n_per_variant * 2:,}")
# Output: ~15,700 per variant (~31,400 total)3. Randomization Strategy:
class ABTestRandomizer:
def __init__(self, variants=['control', 'treatment'], split=[0.5, 0.5]):
self.variants = variants
self.split = split
def assign_variant(self, user_id):
"""Deterministic random assignment based on user_id hash""" import hashlib
# Hash user_id for consistent assignment hash_value = int(hashlib.md5(str(user_id).encode()).hexdigest(), 16)
# Assign to variant based on hash modulo percentile = (hash_value % 100) / 100 cumulative = 0 for variant, proportion in zip(self.variants, self.split):
cumulative += proportion
if percentile < cumulative:
return variant
return self.variants[-1]
# Usagerandomizer = ABTestRandomizer()
user_variant = randomizer.assign_variant(user_id=12345)4. Test Duration:
- Minimum: Achieve required sample size
- Typical: 2-4 weeks to account for:
- Day-of-week effects (weekday vs. weekend behavior)
- Weekly cycles in business
- Seasonal patterns
- Early Stopping: Only if pre-defined stopping rules met (e.g., overwhelming evidence)
- Avoid: Continuous p-value checking (“p-hacking” inflates false positive rate)
5. Statistical Analysis:
def analyze_ab_test(control_conversions, control_n, treatment_conversions, treatment_n):
"""Two-proportion z-test for conversion rates""" # Conversion rates p_control = control_conversions / control_n
p_treatment = treatment_conversions / treatment_n
# Pooled proportion (under null hypothesis) p_pooled = (control_conversions + treatment_conversions) / (control_n + treatment_n)
# Standard error se = np.sqrt(p_pooled * (1 - p_pooled) * (1/control_n + 1/treatment_n))
# Z-statistic z_stat = (p_treatment - p_control) / se
# Two-tailed p-value p_value = 2 * (1 - stats.norm.cdf(abs(z_stat)))
# Confidence interval for difference se_diff = np.sqrt(p_control*(1-p_control)/control_n + p_treatment*(1-p_treatment)/treatment_n)
ci_lower = (p_treatment - p_control) - 1.96 * se_diff
ci_upper = (p_treatment - p_control) + 1.96 * se_diff
# Relative lift relative_lift = (p_treatment - p_control) / p_control
return {
'control_rate': p_control,
'treatment_rate': p_treatment,
'absolute_diff': p_treatment - p_control,
'relative_lift': relative_lift,
'ci_95': (ci_lower, ci_upper),
'z_statistic': z_stat,
'p_value': p_value,
'significant': p_value < 0.05 }
# Exampleresults = analyze_ab_test(
control_conversions=320,
control_n=16000,
treatment_conversions=374,
treatment_n=16000)
print(f"Control CR: {results['control_rate']:.3%}")
print(f"Treatment CR: {results['treatment_rate']:.3%}")
print(f"Relative Lift: {results['relative_lift']:.1%}")
print(f"P-value: {results['p_value']:.4f}")
print(f"95% CI: [{results['ci_95'][0]:.3%}, {results['ci_95'][1]:.3%}]")
print(f"Statistically Significant: {results['significant']}")6. Handling Borderline Results (p = 0.06):
Don’t Automatically Dismiss:
- p = 0.06 means 6% chance of observing this effect under null hypothesis
- Statistical significance is arbitrary threshold, not absolute truth
Decision Framework:
| Factor | Consideration |
|---|---|
| Confidence Interval | If CI for lift is [+2%, +18%], practical significance exists even if p > 0.05 |
| Business Cost | Low cost to implement new creative → accept higher risk threshold |
| Prior Evidence | If aligns with previous tests showing lift, increases credibility |
| Sample Size | If barely missed target, consider extending test |
| Effect Size | Large effect (>20% lift) with p=0.06 more compelling than small effect |
Recommended Approach:
1. Present full picture: “p = 0.06, 95% CI shows [+1%, +15%] lift”
2. Quantify uncertainty: “We have 94% confidence treatment is better”
3. Business decision: “Given low implementation cost and positive signal, recommend rolling out with continued monitoring”
4. Alternative: “Run follow-up test with larger sample to confirm”
7. Validity Checks:
def validate_test(control_data, treatment_data):
"""Check for violations of test assumptions""" checks = {}
# 1. Sample Ratio Mismatch (SRM) expected_ratio = 0.5 actual_ratio = len(control_data) / (len(control_data) + len(treatment_data))
chi2_stat = ((actual_ratio - expected_ratio)**2) / expected_ratio
checks['srm_violation'] = chi2_stat > 3.84 # Chi-square critical value # 2. Novelty Effect (check if lift decays over time) treatment_data['week'] = treatment_data['date'].dt.isocalendar().week
weekly_cr = treatment_data.groupby('week')['converted'].mean()
checks['novelty_effect'] = weekly_cr.iloc[0] > weekly_cr.iloc[-1] * 1.5 # 3. Segment Heterogeneity for segment in ['new_users', 'returning_users']:
segment_effect = analyze_segment(control_data, treatment_data, segment)
checks[f'{segment}_effect'] = segment_effect
return checksKey Principles:
- Pre-register analysis plan: Define stopping rules and primary metric before starting
- One primary metric: Multiple testing inflates false positive rate
- Guard rails: Monitor secondary metrics for negative unintended effects
- Segment analysis: Check if effects differ by user type, device, etc.
Expected Outcome: Demonstrate rigorous statistical thinking, awareness of practical considerations (not just theoretical), and ability to communicate uncertainty to business stakeholders.
Machine Learning & Predictive Analytics
3. Customer Segmentation Using Unsupervised Learning
Level: Data Scientist
Difficulty: Hard
Source: Choreograph Analytics, GroupM Audience Insights
Team: Consumer Analytics, Audience Strategy
Interview Round: ML Technical Assessment
Question: “You have a dataset with customer demographics, purchase history, website behavior, and engagement metrics. How would you segment this customer base to inform targeted marketing strategies? Walk through your approach from data preparation through final segment definition.”
Concise Answer:
Step 1: Feature Engineering
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
def create_behavioral_features(customer_data):
"""Engineer marketing-relevant features""" features = pd.DataFrame()
# RFM (Recency, Frequency, Monetary) features['recency_days'] = (pd.Timestamp.now() - customer_data['last_purchase_date']).dt.days
features['frequency'] = customer_data['purchase_count_12m']
features['monetary'] = customer_data['total_spend_12m']
features['avg_order_value'] = customer_data['total_spend_12m'] / (customer_data['purchase_count_12m'] + 1)
# Engagement features['email_open_rate'] = customer_data['emails_opened'] / (customer_data['emails_sent'] + 1)
features['website_sessions'] = customer_data['sessions_3m']
features['pages_per_session'] = customer_data['total_pageviews'] / (customer_data['sessions_3m'] + 1)
# Product preferences features['product_diversity'] = customer_data['unique_product_categories']
features['pref_category_pct'] = customer_data['top_category_spend'] / (customer_data['total_spend_12m'] + 1)
# Demographics (one-hot encode) demographics = pd.get_dummies(customer_data[['age_group', 'location_tier']], drop_first=True)
return pd.concat([features, demographics], axis=1)
features = create_behavioral_features(customer_data)Step 2: Clustering Algorithm Selection
from sklearn.cluster import KMeans, DBSCAN
from sklearn.mixture import GaussianMixture
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt
# Normalize features (critical for distance-based clustering)scaler = StandardScaler()
features_scaled = scaler.fit_transform(features)
# Method 1: K-Means (most common)def find_optimal_k(data, k_range=range(2, 11)):
"""Elbow method + silhouette analysis""" inertias = []
silhouette_scores = []
for k in k_range:
kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
labels = kmeans.fit_predict(data)
inertias.append(kmeans.inertia_)
silhouette_scores.append(silhouette_score(data, labels))
# Plot elbow curve plt.figure(figsize=(12, 4))
plt.subplot(1, 2, 1)
plt.plot(k_range, inertias, marker='o')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Inertia')
plt.title('Elbow Method')
plt.subplot(1, 2, 2)
plt.plot(k_range, silhouette_scores, marker='o')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Silhouette Score')
plt.title('Silhouette Analysis')
plt.tight_layout()
plt.show()
# Recommend k (high silhouette, interpretable number) optimal_k = k_range[np.argmax(silhouette_scores)]
return optimal_k
optimal_k = find_optimal_k(features_scaled)
print(f"Recommended number of clusters: {optimal_k}")
# Fit final modelkmeans = KMeans(n_clusters=optimal_k, random_state=42, n_init=20)
segments = kmeans.fit_predict(features_scaled)Step 3: Segment Profiling
def profile_segments(data, segments, features):
"""Create business-friendly segment descriptions""" data['segment'] = segments
profiles = data.groupby('segment').agg({
'recency_days': ['mean', 'median'],
'frequency': ['mean', 'median'],
'monetary': ['mean', 'median', 'sum'],
'email_open_rate': 'mean',
'website_sessions': 'mean' }).round(2)
# Size and value contribution segment_sizes = data.groupby('segment').size()
segment_revenue = data.groupby('segment')['monetary'].sum()
profiles['size'] = segment_sizes
profiles['size_pct'] = (segment_sizes / len(data) * 100).round(1)
profiles['revenue_contribution'] = segment_revenue
profiles['revenue_pct'] = (segment_revenue / segment_revenue.sum() * 100).round(1)
return profiles
profiles = profile_segments(customer_data, segments, features)
print(profiles)Step 4: Segment Naming and Strategy
Example Segmentation Results:
| Segment | Description | Size | Revenue % | Strategy |
|---|---|---|---|---|
| Champions | High frequency, high spend, engaged | 12% | 35% | VIP program, early access, retention focus |
| Loyal Customers | Regular purchasers, moderate spend | 22% | 28% | Loyalty rewards, cross-sell opportunities |
| Promising | Recent customers, growing spend | 18% | 15% | Nurture campaigns, onboarding programs |
| At Risk | Previously active, declining engagement | 15% | 12% | Win-back campaigns, special offers |
| Bargain Hunters | Only purchase on promotion | 20% | 8% | Targeted promotions, clearance alerts |
| Lost | No activity >6 months | 13% | 2% | Reactivation campaign or deprioritize |
Marketing Actions by Segment:
def generate_segment_actions(segment_name):
"""Map segments to marketing tactics""" strategies = {
'Champions': {
'channels': ['Email (premium content)', 'Direct mail', 'SMS'],
'messaging': 'Exclusive access, personalized recommendations',
'offers': 'Early product launches, VIP events',
'frequency': 'Weekly engagement' },
'At Risk': {
'channels': ['Email (re-engagement)', 'Retargeting ads', 'SMS'],
'messaging': 'We miss you, special comeback offer',
'offers': '20% discount, free shipping',
'frequency': 'Bi-weekly for 2 months' },
'Bargain Hunters': {
'channels': ['Email (promotions)', 'Social ads'],
'messaging': 'Flash sales, limited-time offers',
'offers': 'Discount codes, BOGO deals',
'frequency': 'Promotion-based (weekly)' }
}
return strategies.get(segment_name, {})Step 5: Validation and Actionability
def validate_segmentation(data, segments):
"""Ensure segments are statistically distinct and actionable""" validations = {}
# 1. Statistical separation (ANOVA for key metrics) from scipy.stats import f_oneway
segment_groups = [data[segments == i]['monetary'] for i in range(max(segments) + 1)]
f_stat, p_value = f_oneway(*segment_groups)
validations['statistically_distinct'] = p_value < 0.05 # 2. Segment stability (re-cluster on different time periods) # Check if customers remain in same segments # 3. Actionability checklist validations['identifiable'] = True # Can we identify which segment new customers belong to? validations['substantial'] = all(data.groupby(segments).size() > len(data) * 0.05) # Each >5% of base? validations['differentiable'] = True # Do segments have distinct characteristics? return validations
validation_results = validate_segmentation(customer_data, segments)Alternative: Hierarchical Clustering for Nested Segments
from scipy.cluster.hierarchy import dendrogram, linkage
from sklearn.cluster import AgglomerativeClustering
# Hierarchical clusteringlinkage_matrix = linkage(features_scaled, method='ward')
# Plot dendrogramplt.figure(figsize=(12, 6))
dendrogram(linkage_matrix)
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('Customer Index')
plt.ylabel('Distance')
plt.show()
# Fit hierarchical modelhierarchical = AgglomerativeClustering(n_clusters=optimal_k, linkage='ward')
hierarchical_segments = hierarchical.fit_predict(features_scaled)Business Deliverables:
1. Segment definitions dashboard (size, characteristics, value)
2. Customer assignment API (score new customers in real-time)
3. Campaign playbook (recommended tactics per segment)
4. Performance tracking (segment migration, revenue by segment)
Expected Outcome: Actionable segments that marketing teams can operationalize, demonstrating both technical ML skills and business translation ability.
4. Predictive Modeling for Customer Churn
Level: Data Scientist, Senior Data Analyst
Difficulty: Very Hard
Source: GroupM Predictive Analytics, Choreograph ML Team
Team: Customer Analytics, Retention Strategy
Interview Round: ML System Design
Question: “Build a model to predict which customers are likely to churn in the next 90 days. Walk through your approach: feature engineering, model selection, handling class imbalance, evaluation metrics, and how you’d present results to marketing stakeholders.”
Concise Answer:
End-to-End Churn Prediction Pipeline:
1. Feature Engineering for Churn:
def create_churn_features(customer_data, transaction_data, engagement_data):
"""Engineer predictive features for 90-day churn""" features = pd.DataFrame(index=customer_data.index)
# Recency & Trend Features features['days_since_last_purchase'] = (pd.Timestamp.now() - customer_data['last_purchase_date']).dt.days
features['days_since_last_interaction'] = (pd.Timestamp.now() - engagement_data['last_activity_date']).dt.days
# Declining engagement (compare recent vs. historical) features['purchase_frequency_30d'] = transaction_data.groupby('customer_id').apply(
lambda x: len(x[x['date'] > pd.Timestamp.now() - pd.Timedelta(days=30)])
)
features['purchase_frequency_90d'] = transaction_data.groupby('customer_id').apply(
lambda x: len(x[x['date'] > pd.Timestamp.now() - pd.Timedelta(days=90)])
)
features['engagement_decline'] = features['purchase_frequency_90d'] - features['purchase_frequency_30d'] * 3 # Spend Patterns features['avg_order_value_trend'] = customer_data['recent_aov'] - customer_data['historical_aov']
features['total_spend_decline_pct'] = (customer_data['spend_last_90d'] - customer_data['spend_prev_90d']) / (customer_data['spend_prev_90d'] + 1)
# Product Category Shift features['category_consistency'] = customer_data['top_category_spend_pct'] # High = loyal to one category # Customer Service Interactions (negative signal) features['complaints_30d'] = engagement_data['complaints_count_30d']
features['support_contacts_30d'] = engagement_data['support_interactions_30d']
# Email/Communication Engagement features['email_open_rate'] = engagement_data['emails_opened'] / (engagement_data['emails_sent'] + 1)
features['unsubscribe_flag'] = engagement_data['email_unsubscribed'].astype(int)
# Tenure features['customer_age_days'] = (pd.Timestamp.now() - customer_data['signup_date']).dt.days
# Promotion Dependency features['discount_purchase_rate'] = customer_data['purchases_with_discount'] / (customer_data['total_purchases'] + 1)
return features
features = create_churn_features(customer_data, transactions, engagement)2. Handle Severe Class Imbalance:
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline as ImbPipeline
# Typical churn rate: 5-15%print(f"Churn rate: {y.mean():.2%}")
# Strategy 1: SMOTE + Undersamplingsmote_undersampling = ImbPipeline([
('smote', SMOTE(sampling_strategy=0.3, random_state=42)), # Oversample to 30% minority ('undersample', RandomUnderSampler(sampling_strategy=0.5, random_state=42)) # Balance to 50:50])
X_resampled, y_resampled = smote_undersampling.fit_resample(X_train, y_train)
# Strategy 2: Class Weights (no resampling)from sklearn.utils.class_weight import compute_class_weight
class_weights = compute_class_weight('balanced', classes=np.unique(y_train), y=y_train)
class_weight_dict = {0: class_weights[0], 1: class_weights[1]}3. Model Training and Selection:
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
# Baseline: Logistic Regression (interpretable)logreg = LogisticRegression(class_weight='balanced', max_iter=1000, random_state=42)
logreg.fit(X_train, y_train)
# Tree-based models (handle non-linearity)rf = RandomForestClassifier(
n_estimators=200,
max_depth=10,
class_weight='balanced',
random_state=42)
xgb = XGBClassifier(
n_estimators=200,
max_depth=6,
scale_pos_weight=len(y_train[y_train==0]) / len(y_train[y_train==1]), # Class imbalance ratio random_state=42)
# Compare modelsmodels = {'LogisticRegression': logreg, 'RandomForest': rf, 'XGBoost': xgb}
for name, model in models.items():
scores = cross_val_score(model, X_train, y_train, cv=5, scoring='roc_auc')
print(f"{name} - Mean AUC: {scores.mean():.3f} (±{scores.std():.3f})")
# Select best model (typically XGBoost)best_model = xgb
best_model.fit(X_train, y_train)4. Evaluation Metrics (Beyond Accuracy):
from sklearn.metrics import classification_report, roc_auc_score, precision_recall_curve, confusion_matrix
import matplotlib.pyplot as plt
y_pred_proba = best_model.predict_proba(X_test)[:, 1]
# Metrics for imbalanced classificationdef evaluate_churn_model(y_true, y_pred_proba, threshold=0.5):
"""Comprehensive model evaluation""" y_pred = (y_pred_proba >= threshold).astype(int)
# ROC-AUC (overall discrimination) auc = roc_auc_score(y_true, y_pred_proba)
# Precision-Recall at threshold precision = precision_score(y_true, y_pred)
recall = recall_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred)
# Confusion Matrix cm = confusion_matrix(y_true, y_pred)
tn, fp, fn, tp = cm.ravel()
# Business Metrics retention_offer_cost = 20 # Cost to send retention offer customer_ltv = 500 # Average customer lifetime value retention_success_rate = 0.30 # 30% of offers succeed # Expected value of model true_positives_value = tp * customer_ltv * retention_success_rate
false_positives_cost = fp * retention_offer_cost
false_negatives_cost = fn * customer_ltv # Lost customers we didn't target net_value = true_positives_value - false_positives_cost - false_negatives_cost
return {
'auc': auc,
'precision': precision,
'recall': recall,
'f1': f1,
'confusion_matrix': cm,
'net_business_value': net_value
}
results = evaluate_churn_model(y_test, y_pred_proba, threshold=0.3) # Lower threshold for higher recallprint(f"AUC: {results['auc']:.3f}")
print(f"Precision: {results['precision']:.3f} | Recall: {results['recall']:.3f}")
print(f"Net Business Value: ${results['net_business_value']:,.0f}")5. Threshold Optimization:
def optimize_threshold(y_true, y_pred_proba, offer_cost=20, ltv=500, retention_rate=0.3):
"""Find threshold maximizing business value""" thresholds = np.linspace(0.1, 0.9, 50)
net_values = []
for threshold in thresholds:
y_pred = (y_pred_proba >= threshold).astype(int)
cm = confusion_matrix(y_true, y_pred)
tn, fp, fn, tp = cm.ravel()
value = (tp * ltv * retention_rate) - (fp * offer_cost) - (fn * ltv)
net_values.append(value)
optimal_idx = np.argmax(net_values)
optimal_threshold = thresholds[optimal_idx]
plt.figure(figsize=(10, 5))
plt.plot(thresholds, net_values)
plt.axvline(optimal_threshold, color='r', linestyle='--', label=f'Optimal: {optimal_threshold:.2f}')
plt.xlabel('Prediction Threshold')
plt.ylabel('Net Business Value ($)')
plt.title('Threshold Optimization')
plt.legend()
plt.show()
return optimal_threshold
optimal_threshold = optimize_threshold(y_test, y_pred_proba)6. Feature Importance and Interpretation:
import shap
# SHAP values for model interpretationexplainer = shap.TreeExplainer(best_model)
shap_values = explainer.shap_values(X_test)
# Summary plotshap.summary_plot(shap_values, X_test, feature_names=feature_names)
# Top predictive featuresfeature_importance = pd.DataFrame({
'feature': feature_names,
'importance': best_model.feature_importances_
}).sort_values('importance', ascending=False)
print("Top 10 Churn Predictors:")
print(feature_importance.head(10))7. Business Presentation:
Stakeholder Communication Framework:
=== CHURN PREDICTION MODEL RESULTS ===
MODEL PERFORMANCE:
- AUC: 0.83 (Strong predictive power)
- Identifies 65% of churners (recall) with 55% precision
TOP CHURN INDICATORS:
1. Days since last purchase (35+ days = high risk)
2. Declining email engagement (-40% open rate)
3. Customer service complaints (2+ in 30 days)
4. Spend decline (>30% drop vs. prior period)
5. Email unsubscribe
RECOMMENDED ACTIONS:
→ Target top 20% risk score (5,000 customers)
- Expected churners in this group: 2,750
- Retention campaign saves 30%: 825 customers
- Value saved: 825 × $500 LTV = $412,500
- Campaign cost: 5,000 × $20 = $100,000
- NET VALUE: $312,500
SEGMENTED STRATEGIES:
- High risk + high value: Personal outreach, exclusive offers
- High risk + moderate value: Automated email campaigns
- Moderate risk: Monitor, engage with content8. Deployment and Monitoring:
# Monthly scoring pipelinedef score_customers_for_churn(customer_df):
"""Production scoring function""" # Extract features features = create_churn_features(customer_df, transactions, engagement)
# Generate predictions churn_probabilities = best_model.predict_proba(features)[:, 1]
# Create output output = pd.DataFrame({
'customer_id': customer_df['customer_id'],
'churn_probability': churn_probabilities,
'risk_tier': pd.cut(churn_probabilities, bins=[0, 0.3, 0.6, 1.0], labels=['Low', 'Medium', 'High']),
'recommended_action': np.where(churn_probabilities > optimal_threshold, 'RETENTION_CAMPAIGN', 'MONITOR')
})
return output
# Model monitoring (detect drift)def monitor_model_performance(predictions_df, actual_churn_df):
"""Track model accuracy over time""" monthly_performance = predictions_df.merge(actual_churn_df, on='customer_id')
auc = roc_auc_score(monthly_performance['actual_churn'], monthly_performance['churn_probability'])
if auc < 0.75: # Performance degraded trigger_model_retraining()
return aucExpected Outcome: End-to-end churn prediction system demonstrating ML expertise, business acumen, and ability to translate technical models into actionable retention strategies with quantified ROI.
5. SQL Query Optimization and Complex Data Extraction
Level: Data Analyst, Data Scientist
Difficulty: Hard
Source: Choreograph SQL Assessment, GroupM Technical Screen
Team: Data Engineering, Analytics
Interview Round: Technical Coding
Question: “Write a SQL query to calculate 30-day rolling retention rate by acquisition cohort from tables containing user_signups, user_activity, and user_purchases. Optimize for performance on a dataset with 10 million users.”
Concise Answer:
Database Schema:
-- user_signups: user_id, signup_date, acquisition_source-- user_activity: user_id, activity_date, activity_type-- user_purchases: user_id, purchase_date, amountSolution:
-- 30-day rolling retention by acquisition cohortWITH user_cohorts AS (
-- Assign users to monthly cohorts SELECT
user_id,
DATE_TRUNC('month', signup_date) AS cohort_month,
signup_date
FROM user_signups
WHERE signup_date >= '2023-01-01'),
cohort_sizes AS (
-- Count users in each cohort SELECT
cohort_month,
COUNT(DISTINCT user_id) AS cohort_size
FROM user_cohorts
GROUP BY cohort_month
),
user_activity_with_cohort AS (
-- Join activity with cohort info and calculate days since signup SELECT
ua.user_id,
uc.cohort_month,
ua.activity_date,
DATEDIFF(ua.activity_date, uc.signup_date) AS days_since_signup
FROM user_activity ua
INNER JOIN user_cohorts uc
ON ua.user_id = uc.user_id
WHERE ua.activity_date >= uc.signup_date -- Only activity after signup),
retention_30day AS (
-- Count users active in 30-60 day window SELECT
cohort_month,
COUNT(DISTINCT user_id) AS active_users_30day
FROM user_activity_with_cohort
WHERE days_since_signup BETWEEN 30 AND 59 GROUP BY cohort_month
)
-- Final calculationSELECT
cs.cohort_month,
cs.cohort_size,
COALESCE(r.active_users_30day, 0) AS active_users_30day,
ROUND(100.0 * COALESCE(r.active_users_30day, 0) / cs.cohort_size, 2) AS retention_rate_pct,
-- Additional metrics cs.cohort_size - COALESCE(r.active_users_30day, 0) AS churned_users,
ROUND(100.0 * (cs.cohort_size - COALESCE(r.active_users_30day, 0)) / cs.cohort_size, 2) AS churn_rate_pct
FROM cohort_sizes cs
LEFT JOIN retention_30day r
ON cs.cohort_month = r.cohort_month
ORDER BY cs.cohort_month DESC;Performance Optimizations:
-- 1. Create indexes for JOIN and WHERE clausesCREATE INDEX idx_user_signups_date ON user_signups(signup_date, user_id);
CREATE INDEX idx_user_activity_user_date ON user_activity(user_id, activity_date);
CREATE INDEX idx_user_activity_date ON user_activity(activity_date);
-- 2. Partition large tables by dateALTER TABLE user_activity
PARTITION BY RANGE (YEAR(activity_date), MONTH(activity_date));
-- 3. Materialized view for cohort assignments (if frequently queried)CREATE MATERIALIZED VIEW mv_user_cohorts ASSELECT
user_id,
DATE_TRUNC('month', signup_date) AS cohort_month,
signup_date
FROM user_signups;
REFRESH MATERIALIZED VIEW mv_user_cohorts; -- Update daily-- 4. Use window functions for efficiency (alternative approach)WITH cohort_activity AS (
SELECT
DATE_TRUNC('month', s.signup_date) AS cohort_month,
s.user_id,
MIN(CASE
WHEN DATEDIFF(a.activity_date, s.signup_date) BETWEEN 30 AND 59
THEN 1 ELSE 0
END) AS active_in_period
FROM user_signups s
LEFT JOIN user_activity a
ON s.user_id = a.user_id
AND a.activity_date BETWEEN s.signup_date + INTERVAL 30 DAY
AND s.signup_date + INTERVAL 60 DAY GROUP BY 1, 2)
SELECT
cohort_month,
COUNT(*) AS cohort_size,
SUM(active_in_period) AS retained_users,
ROUND(100.0 * SUM(active_in_period) / COUNT(*), 2) AS retention_rate_pct
FROM cohort_activity
GROUP BY cohort_month
ORDER BY cohort_month DESC;Query Explanation Plan Analysis:
-- Check query performanceEXPLAIN ANALYZESELECT ...
-- Look for:-- - Sequential scans (bad) vs index scans (good)-- - Join method (hash join typically faster for large tables)-- - Estimated rows vs actual rows (if way off, update statistics)Additional Complexity: Multi-Dimensional Retention:
-- Retention by acquisition sourceSELECT
cs.cohort_month,
s.acquisition_source,
COUNT(DISTINCT s.user_id) AS cohort_size,
COUNT(DISTINCT CASE
WHEN DATEDIFF(a.activity_date, s.signup_date) BETWEEN 30 AND 59
THEN a.user_id
END) AS retained_users,
ROUND(100.0 * COUNT(DISTINCT CASE
WHEN DATEDIFF(a.activity_date, s.signup_date) BETWEEN 30 AND 59
THEN a.user_id
END) / COUNT(DISTINCT s.user_id), 2) AS retention_rate_pct
FROM user_signups s
LEFT JOIN user_activity a
ON s.user_id = a.user_id
CROSS JOIN (
SELECT DISTINCT DATE_TRUNC('month', signup_date) AS cohort_month
FROM user_signups
) cs
WHERE DATE_TRUNC('month', s.signup_date) = cs.cohort_month
GROUP BY cs.cohort_month, s.acquisition_source
ORDER BY cs.cohort_month DESC, retention_rate_pct DESC;Key Optimization Principles:
1. Index strategically: Join columns, WHERE clause filters, GROUP BY columns
2. Filter early: Apply WHERE clauses before JOINs when possible
3. Partition large tables: By date for time-series queries
4. Use CTEs wisely: Readable but may not be materialized; sometimes temp tables faster
5. Aggregate before joining: Reduce row count before expensive operations
6. Monitor execution plans: Use EXPLAIN to identify bottlenecks
Expected Outcome: Demonstrate SQL proficiency, performance optimization thinking, and ability to handle complex analytical queries on large datasets typical in marketing analytics.
Data Visualization & Communication
6. Python Data Analysis and Campaign Performance Reporting
Level: Data Analyst, Junior Data Scientist
Difficulty: Moderate
Source: WPP Media Case Study, GroupM Analytics
Team: Campaign Analytics, Reporting
Interview Round: Technical Exercise
Question: “You have a CSV file with campaign data: date, campaign_id, impressions, clicks, conversions, spend. Write Python code to: load and clean the data, calculate performance metrics (CTR, CPC, conversion rate, ROAS), identify underperforming campaigns (bottom 20% by ROAS), visualize trends, and recommend actions.”
Concise Answer:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Load and clean datadf = pd.read_csv('campaign_data.csv')
df['date'] = pd.to_datetime(df['date'])
df = df.dropna()
df = df[df['spend'] > 0] # Remove zero-spend campaigns# Calculate performance metricsdf['CTR'] = (df['clicks'] / df['impressions']) * 100df['CPC'] = df['spend'] / df['clicks']
df['conversion_rate'] = (df['conversions'] / df['clicks']) * 100# Assume $50 revenue per conversionrevenue_per_conversion = 50df['revenue'] = df['conversions'] * revenue_per_conversion
df['ROAS'] = df['revenue'] / df['spend']
# Identify underperformers (bottom 20%)roas_threshold = df['ROAS'].quantile(0.20)
underperformers = df[df['ROAS'] <= roas_threshold]
print(f"\n=== UNDERPERFORMING CAMPAIGNS (ROAS <= {roas_threshold:.2f}) ===")
print(underperformers[['campaign_id', 'ROAS', 'CTR', 'conversion_rate', 'spend']].sort_values('ROAS'))
print(f"\nTotal wasted spend: ${underperformers['spend'].sum():,.2f}")
# Aggregate by campaign for summarycampaign_summary = df.groupby('campaign_id').agg({
'impressions': 'sum',
'clicks': 'sum',
'conversions': 'sum',
'spend': 'sum',
'revenue': 'sum'}).reset_index()
campaign_summary['overall_CTR'] = (campaign_summary['clicks'] / campaign_summary['impressions']) * 100campaign_summary['overall_ROAS'] = campaign_summary['revenue'] / campaign_summary['spend']
# Visualization 1: Performance over timefig, axes = plt.subplots(2, 2, figsize=(15, 10))
# Daily impressions and spenddaily_metrics = df.groupby('date').agg({'impressions': 'sum', 'spend': 'sum', 'ROAS': 'mean'})
axes[0, 0].plot(daily_metrics.index, daily_metrics['impressions'], color='blue')
axes[0, 0].set_title('Daily Impressions Trend')
axes[0, 0].set_xlabel('Date')
axes[0, 0].set_ylabel('Impressions')
axes[0, 0].tick_params(axis='x', rotation=45)
axes[0, 1].plot(daily_metrics.index, daily_metrics['ROAS'], color='green')
axes[0, 1].axhline(y=1.0, color='r', linestyle='--', label='Breakeven')
axes[0, 1].set_title('Daily ROAS Trend')
axes[0, 1].set_xlabel('Date')
axes[0, 1].set_ylabel('ROAS')
axes[0, 1].legend()
axes[0, 1].tick_params(axis='x', rotation=45)
# Campaign comparison scatter plotaxes[1, 0].scatter(campaign_summary['overall_CTR'], campaign_summary['overall_ROAS'],
s=campaign_summary['spend']/100, alpha=0.6)
axes[1, 0].axhline(y=1.0, color='r', linestyle='--', alpha=0.5)
axes[1, 0].set_xlabel('CTR (%)')
axes[1, 0].set_ylabel('ROAS')
axes[1, 0].set_title('Campaign Performance: CTR vs ROAS (size = spend)')
# Top vs Bottom performerstop_performers = campaign_summary.nlargest(5, 'overall_ROAS')
axes[1, 1].barh(top_performers['campaign_id'].astype(str), top_performers['overall_ROAS'], color='green', alpha=0.7)
axes[1, 1].axvline(x=1.0, color='r', linestyle='--', label='Breakeven')
axes[1, 1].set_xlabel('ROAS')
axes[1, 1].set_title('Top 5 Performing Campaigns')
axes[1, 1].legend()
plt.tight_layout()
plt.savefig('campaign_performance_report.png', dpi=300, bbox_inches='tight')
plt.show()
# Recommendationsprint("\n=== OPTIMIZATION RECOMMENDATIONS ===\n")
print(f"1. PAUSE UNDERPERFORMERS")
print(f" - {len(underperformers)} campaigns with ROAS < {roas_threshold:.2f}")
print(f" - Savings: ${underperformers['spend'].sum():,.2f}")
print(f" - Campaigns: {underperformers['campaign_id'].unique().tolist()}\n")
top_5 = campaign_summary.nlargest(5, 'overall_ROAS')
print(f"2. SCALE TOP PERFORMERS")
print(f" - Top 5 campaigns have average ROAS of {top_5['overall_ROAS'].mean():.2f}")
print(f" - Recommend increasing budget by 50% for:")
for idx, row in top_5.iterrows():
print(f" Campaign {row['campaign_id']}: ROAS {row['overall_ROAS']:.2f}")
medium_performers = campaign_summary[
(campaign_summary['overall_ROAS'] >= 1.0) &
(campaign_summary['overall_ROAS'] < campaign_summary['overall_ROAS'].quantile(0.75))
]
print(f"\n3. OPTIMIZE MEDIUM PERFORMERS")
print(f" - {len(medium_performers)} campaigns with ROAS between 1.0 and top quartile")
print(f" - Test creative variations to improve conversion rate")
print(f" - Refine audience targeting to reduce wasted impressions")
# Budget reallocation calculationsavings_from_pause = underperformers['spend'].sum()
reallocate_to_top = savings_from_pause * 0.7keep_for_testing = savings_from_pause * 0.3print(f"\n4. BUDGET REALLOCATION")
print(f" - Current total spend: ${df['spend'].sum():,.2f}")
print(f" - Saved from pausing: ${savings_from_pause:,.2f}")
print(f" - Reallocate to top performers: ${reallocate_to_top:,.2f}")
print(f" - Reserve for testing new campaigns: ${keep_for_testing:,.2f}")
print(f" - Expected additional revenue: ${reallocate_to_top * top_5['overall_ROAS'].mean():,.2f}")
# Export resultscampaign_summary.to_csv('campaign_performance_summary.csv', index=False)
underperformers[['campaign_id', 'date', 'ROAS', 'spend']].to_csv('underperformers.csv', index=False)
print(f"\n✓ Analysis complete. Reports saved.")Key Deliverables:
1. Performance dashboard (4-panel visualization)
2. Actionable recommendations (pause, scale, optimize, reallocate)
3. Quantified impact (savings, expected revenue lift)
4. Export files for stakeholder review
Expected Outcome: Clean, efficient Python code with clear visualizations and business-focused recommendations demonstrating analytical and communication skills.
7. Machine Learning Explainability and Model Interpretation
Level: Data Scientist
Difficulty: Very Hard
Source: Choreograph ML Team
Team: Advanced Analytics, ML Engineering
Interview Round: ML Design
Question: “You’ve built a random forest model predicting campaign conversion probability. Marketing stakeholders want to understand WHY the model makes certain predictions. How would you explain the model’s decision-making process?”
Concise Answer:
Explainability Framework:
1. Global Interpretability (Overall Model Behavior):
import shap
import matplotlib.pyplot as plt
from sklearn.inspection import partial_dependence, PartialDependenceDisplay
# Feature Importances (built into Random Forest)feature_importances = pd.DataFrame({
'feature': feature_names,
'importance': rf_model.feature_importances_
}).sort_values('importance', ascending=False)
plt.figure(figsize=(10, 6))
plt.barh(feature_importances['feature'][:10], feature_importances['importance'][:10])
plt.xlabel('Importance Score')
plt.title('Top 10 Most Important Features')
plt.gca().invert_yaxis()
plt.show()
print("Top Conversion Drivers:")
for idx, row in feature_importances.head(5).iterrows():
print(f"{row['feature']}: {row['importance']:.3f}")2. Partial Dependence Plots (Feature Effects):
# Show how changing one feature affects predictionsfeatures_to_plot = ['email_engagement_score', 'days_since_last_visit', 'ad_frequency']
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
for idx, feature in enumerate(features_to_plot):
feature_idx = list(feature_names).index(feature)
display = PartialDependenceDisplay.from_estimator(
rf_model, X_train, [feature_idx], ax=axes[idx]
)
axes[idx].set_title(f'Effect of {feature}')
plt.tight_layout()
plt.show()3. Local Interpretability (Individual Predictions):
# SHAP for individual prediction explanationexplainer = shap.TreeExplainer(rf_model)
shap_values = explainer.shap_values(X_test)
# Explain specific predictiondef explain_prediction(customer_idx):
"""Explain why model predicted conversion for specific customer""" prediction_proba = rf_model.predict_proba(X_test[customer_idx].reshape(1, -1))[0][1]
# Get SHAP values for this prediction shap_vals = shap_values[1][customer_idx] # Class 1 (conversion) # Create explanation dataframe explanation = pd.DataFrame({
'feature': feature_names,
'value': X_test[customer_idx],
'shap_value': shap_vals
}).sort_values('shap_value', key=abs, ascending=False)
print(f"\n=== PREDICTION EXPLANATION ===")
print(f"Predicted Conversion Probability: {prediction_proba:.1%}\n")
print("Top factors INCREASING conversion probability:")
positive_factors = explanation[explanation['shap_value'] > 0].head(5)
for idx, row in positive_factors.iterrows():
print(f" + {row['feature']} = {row['value']:.2f} → +{row['shap_value']:.3f}")
print("\nTop factors DECREASING conversion probability:")
negative_factors = explanation[explanation['shap_value'] < 0].head(5)
for idx, row in negative_factors.iterrows():
print(f" - {row['feature']} = {row['value']:.2f} → {row['shap_value']:.3f}")
# Waterfall plot shap.waterfall_plot(shap.Explanation(values=shap_vals, base_values=explainer.expected_value[1],
data=X_test[customer_idx], feature_names=feature_names))
# Exampleexplain_prediction(customer_idx=42)4. Business-Friendly Explanation:
Customer #42 Conversion Prediction: 68%
WHY THIS PREDICTION?
POSITIVE SIGNALS (increasing conversion probability):
✓ High email engagement score (0.85) → +12%
→ This customer actively opens and clicks emails
✓ Recent website visit (2 days ago) → +10%
→ Recent visitors are 3x more likely to convert
✓ Previous purchase history (5 purchases) → +8%
→ Repeat customers have 65% baseline conversion rate
NEGATIVE SIGNALS (decreasing conversion probability):
✗ Low ad frequency (1 impression) → -5%
→ Customers need 3-5 exposures for optimal conversion
✗ Mobile device → -3%
→ Mobile conversion rate 40% lower (poor checkout UX)
RECOMMENDATION:
→ Increase ad frequency to 3-5 impressions
→ Send personalized email with product recommendations
→ Optimize mobile checkout experience5. SHAP Summary Plot (All Predictions):
# Global view of feature impactsshap.summary_plot(shap_values[1], X_test, feature_names=feature_names)
# Beeswarm plot shows:# - Feature importance (vertical ordering)# - Feature values (color: red = high, blue = low)# - Impact on prediction (horizontal position)6. Stakeholder Communication Strategy:
| Audience | Approach | Tools |
|---|---|---|
| Marketing Executives | Business outcomes, not technical details | Feature importance rankings, business narratives |
| Campaign Managers | Actionable insights | Individual prediction explanations, segment patterns |
| Data Team | Technical rigor | SHAP values, partial dependence, model diagnostics |
7. Address Limitations:
MODEL LIMITATIONS & TRANSPARENCY:
1. Correlation ≠ Causation
- Model shows email engagement predicts conversion
- Doesn't prove emails CAUSE conversions
- Could be reverse causality (engaged users open emails)
2. Predictions are Probabilistic
- 68% conversion probability ≠ certainty
- Confidence intervals: [55%, 78%]
3. Model Reflects Historical Patterns
- Trained on 2023-2024 data
- May not capture new trends or behaviors
- Requires monthly retraining
4. Potential Biases
- If training data over-represents certain demographics
- Model may perform worse on underrepresented groups
- Fairness monitoring in productionExpected Outcome: Demonstrate ability to make complex models transparent, communicate technical concepts to non-technical stakeholders, and bridge ML accuracy with business trust and adoption.
Advanced Experimentation & Causal Inference
8. Incrementality Testing and Media Effectiveness Measurement
Level: Senior Data Scientist
Difficulty: Extreme
Source: GroupM Media Science, Advanced Analytics
Team: Media Effectiveness, Measurement Science
Interview Round: Strategic Case Study
Question: “A client wants to know if their $10M TV advertising spend is actually driving incremental sales. Design a study to measure TV’s incremental impact.”
Concise Answer:
Study Design: Randomized Geo-Experiment
1. Methodology Selection:
| Approach | Pros | Cons | Recommendation |
|---|---|---|---|
| Randomized Geo-Experiment | Gold standard, causal inference | Requires pausing TV in markets | BEST |
| Synthetic Control | No pause needed | Assumes good match exists | Alternative |
| Observational Regression | Cheapest, fastest | Confounding bias | Validation only |
2. Experimental Design:
import pandas as pd
import numpy as np
from scipy import stats
class GeoExperiment:
def __init__(self, markets_df):
self.markets = markets_df
def select_matched_pairs(self, n_pairs=10):
"""Select similar market pairs for testing""" # Features for matching matching_features = [
'population',
'median_income',
'historical_sales',
'competitor_presence',
'seasonality_index' ]
# Standardize features from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled_features = scaler.fit_transform(self.markets[matching_features])
# Find similar pairs using distance metrics from sklearn.metrics.pairwise import euclidean_distances
distances = euclidean_distances(scaled_features)
pairs = []
used_markets = set()
for i in range(len(distances)):
if i in used_markets:
continue # Find closest match distances_i = distances[i].copy()
distances_i[i] = np.inf
distances_i[list(used_markets)] = np.inf
closest = np.argmin(distances_i)
if len(pairs) < n_pairs:
pairs.append((self.markets.iloc[i]['market_id'],
self.markets.iloc[closest]['market_id']))
used_markets.add(i)
used_markets.add(closest)
return pairs
def randomize_treatment(self, pairs):
"""Randomly assign treatment/control within pairs""" np.random.seed(42)
assignments = []
for market_a, market_b in pairs:
if np.random.rand() > 0.5:
assignments.append({'market': market_a, 'group': 'treatment'})
assignments.append({'market': market_b, 'group': 'control'})
else:
assignments.append({'market': market_a, 'group': 'control'})
assignments.append({'market': market_b, 'group': 'treatment'})
return pd.DataFrame(assignments)
# Example usageexperiment = GeoExperiment(market_data)
matched_pairs = experiment.select_matched_pairs(n_pairs=12)
treatment_assignment = experiment.randomize_treatment(matched_pairs)3. Power Analysis (Sample Size Calculation):
def calculate_power(baseline_sales, expected_lift, n_pairs):
"""Calculate statistical power to detect lift""" # Assumptions baseline_mean = baseline_sales # $1M per market lift_pct = expected_lift # 8% lift expected cv = 0.20 # Coefficient of variation (20%) # Effect size (difference in means / pooled std dev) lift_amount = baseline_mean * lift_pct
std_dev = baseline_mean * cv
effect_size = lift_amount / (std_dev * np.sqrt(2))
# Power calculation (paired t-test) alpha = 0.05 power = stats.ttest_power(effect_size, df=n_pairs-1, alpha=alpha)
return {
'n_pairs': n_pairs,
'expected_lift': f"{lift_pct:.1%}",
'power': f"{power:.1%}",
'sufficient': power >= 0.80 }
# Check if we have enough marketsresult = calculate_power(baseline_sales=1_000_000, expected_lift=0.08, n_pairs=12)
print(f"With {result['n_pairs']} market pairs, power to detect {result['expected_lift']} lift: {result['power']}")
print(f"Sufficient (>80%): {result['sufficient']}")4. Implementation:
EXPERIMENTAL SETUP:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Duration: 8 weeks (2 months)
Markets: 12 pairs (24 total DMAs)
Treatment: Pause TV advertising
Control: Continue normal TV spend
Cost: ~$1.7M in forgone TV spend
PRE-PERIOD (4 weeks before):
→ Validate baseline sales similarity
→ Confirm no pre-existing trends
TEST PERIOD (8 weeks):
→ Treatment markets: $0 TV spend
→ Control markets: Normal TV spend
→ Monitor daily sales
POST-ANALYSIS:
→ Compare sales differences
→ Calculate incremental ROAS5. Analysis:
def analyze_geo_experiment(results_df):
"""Difference-in-differences analysis""" # Calculate change from baseline treatment_markets = results_df[results_df['group'] == 'treatment']
control_markets = results_df[results_df['group'] == 'control']
treatment_change = treatment_markets['test_sales'].mean() - treatment_markets['baseline_sales'].mean()
control_change = control_markets['test_sales'].mean() - control_markets['baseline_sales'].mean()
# Incremental impact (DiD estimator) incremental_effect = treatment_change - control_change
# Statistical significance (paired t-test) t_stat, p_value = stats.ttest_rel(
treatment_markets['test_sales'] - treatment_markets['baseline_sales'],
control_markets['test_sales'] - control_markets['baseline_sales']
)
# Calculate incremental ROAS tv_spend_treatment = treatment_markets['tv_spend'].sum()
total_incremental_sales = incremental_effect * len(treatment_markets) * 8 # 8 weeks incremental_roas = total_incremental_sales / tv_spend_treatment if tv_spend_treatment > 0 else 0 return {
'incremental_sales_per_market': incremental_effect,
'total_incremental_sales': total_incremental_sales,
'incremental_roas': incremental_roas,
'p_value': p_value,
'statistically_significant': p_value < 0.05,
'confidence_interval': stats.t.interval(0.95, df=len(treatment_markets)-1,
loc=incremental_effect,
scale=stats.sem(treatment_markets['test_sales'] - treatment_markets['baseline_sales']))
}
results = analyze_geo_experiment(experiment_results)
print(f"\n=== INCREMENTALITY TEST RESULTS ===")
print(f"Incremental sales per market: ${results['incremental_sales_per_market']:,.0f}")
print(f"Total incremental sales (8 weeks): ${results['total_incremental_sales']:,.0f}")
print(f"Incremental ROAS: {results['incremental_roas']:.2f}")
print(f"P-value: {results['p_value']:.4f}")
print(f"Statistically significant: {results['statistically_significant']}")
print(f"95% CI: ${results['confidence_interval'][0]:,.0f} to ${results['confidence_interval'][1]:,.0f}")6. Alternative: Synthetic Control (No Pause Required):
from sklearn.linear_model import Ridge
def synthetic_control_method(treatment_market, control_markets, pre_period, test_period):
"""Create synthetic control market as weighted average of controls""" # Fit model on pre-period to find weights X_pre = control_markets[control_markets['period'] == 'pre'][['sales']].values
y_pre = treatment_market[treatment_market['period'] == 'pre']['sales'].values
# Ridge regression to find optimal weights (constrained to be positive) model = Ridge(alpha=1.0, positive=True)
model.fit(X_pre.T, y_pre)
weights = model.coef_
# Predict counterfactual in test period X_test = control_markets[control_markets['period'] == 'test'][['sales']].values
synthetic_sales = np.dot(X_test.T, weights)
# Actual sales in treatment market actual_sales = treatment_market[treatment_market['period'] == 'test']['sales'].values
# Incremental impact incremental = actual_sales - synthetic_sales
return {
'actual_sales': actual_sales,
'synthetic_sales': synthetic_sales,
'incremental_impact': incremental,
'control_weights': dict(zip(control_markets['market_id'].unique(), weights))
}7. Business Recommendation:
RECOMMENDATION FOR CLIENT:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
FINDINGS:
• TV advertising drives $4.2M incremental sales per year
• Incremental ROAS: 2.4:1
• Statistical confidence: 95%
TV EFFECTIVENESS:
✓ TV is working (positive incremental impact)
✗ But not optimally (ROAS <3:1 benchmark for CPG)
RECOMMENDATIONS:
1. Continue TV advertising (it's incremental)
2. Optimize media mix:
- Reduce TV spend by 20% ($2M)
- Reallocate to higher-ROAS digital channels
- Expected net benefit: +$1.5M revenue
3. Test creative variations to improve TV efficiency
4. Run quarterly incrementality tests to monitorExpected Outcome: Demonstrate sophisticated causal inference methodology, ability to design rigorous experiments, and translate statistical findings into business recommendations with quantified confidence.
NLP & Text Analytics
9. Natural Language Processing for Customer Insights
Level: Data Scientist
Difficulty: Hard
Source: Choreograph Text Analytics, Consumer Insights
Team: Voice of Customer, Social Listening
Interview Round: NLP Technical
Question: “You have 100,000 customer reviews for a CPG brand. How would you analyze this text data to extract insights about product perception, sentiment trends, and emerging issues?”
Concise Answer:
NLP Analysis Pipeline:
1. Text Preprocessing:
import pandas as pd
import re
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
def preprocess_text(text):
"""Clean and normalize text""" # Lowercase text = text.lower()
# Remove URLs, emails text = re.sub(r'http\S+|www\S+|https\S+', '', text)
text = re.sub(r'\S+@\S+', '', text)
# Remove special characters but keep punctuation for sentiment text = re.sub(r'[^a-zA-Z\s!?.]', '', text)
# Tokenize tokens = word_tokenize(text)
# Remove stopwords stop_words = set(stopwords.words('english'))
tokens = [t for t in tokens if t not in stop_words]
# Lemmatization lemmatizer = WordNetLemmatizer()
tokens = [lemmatizer.lemmatize(t) for t in tokens]
return ' '.join(tokens)
# Apply preprocessingreviews['clean_text'] = reviews['review_text'].apply(preprocess_text)2. Sentiment Analysis:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from transformers import pipeline
# Method 1: VADER (rule-based, fast)analyzer = SentimentIntensityAnalyzer()
def get_sentiment_vader(text):
scores = analyzer.polarity_scores(text)
# Classify based on compound score if scores['compound'] >= 0.05:
return 'positive', scores['compound']
elif scores['compound'] <= -0.05:
return 'negative', scores['compound']
else:
return 'neutral', scores['compound']
reviews[['sentiment', 'sentiment_score']] = reviews['review_text'].apply(
lambda x: pd.Series(get_sentiment_vader(x))
)
# Method 2: Transformer model (more accurate, slower)sentiment_pipeline = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english")
def get_sentiment_bert(text):
result = sentiment_pipeline(text[:512])[0] # Truncate to max length return result['label'], result['score']
# Use for sample or critical reviews# reviews[['sentiment_bert', 'confidence']] = reviews['review_text'].apply(...)3. Topic Modeling (What are customers talking about?):
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation
# Create document-term matrixvectorizer = CountVectorizer(max_df=0.95, min_df=2, max_features=1000, stop_words='english')
doc_term_matrix = vectorizer.fit_transform(reviews['clean_text'])
# LDA topic modelingn_topics = 8lda = LatentDirichletAllocation(n_components=n_topics, random_state=42, max_iter=20)
lda.fit(doc_term_matrix)
# Display topicsdef display_topics(model, feature_names, n_top_words=10):
topics = {}
for idx, topic in enumerate(model.components_):
top_words = [feature_names[i] for i in topic.argsort()[-n_top_words:][::-1]]
topics[f"Topic {idx}"] = top_words
return topics
feature_names = vectorizer.get_feature_names_out()
topics = display_topics(lda, feature_names)
# Interpret topicsfor topic_name, words in topics.items():
print(f"{topic_name}: {', '.join(words)}")
# Example output:# Topic 0: battery, charge, lasting, hours, life → Battery Performance# Topic 1: flavor, taste, sweet, delicious, love → Taste/Flavor# Topic 2: packaging, damaged, arrived, broken, box → Shipping/Packaging# Topic 3: price, expensive, worth, value, cost → Pricing4. Aspect-Based Sentiment:
def extract_aspects_and_sentiment(reviews_df):
"""Extract sentiment about specific product aspects""" aspects = {
'quality': ['quality', 'durable', 'well-made', 'sturdy', 'cheap', 'broke'],
'price': ['expensive', 'cheap', 'price', 'value', 'cost', 'affordable'],
'taste': ['taste', 'flavor', 'delicious', 'bland', 'sweet', 'bitter'],
'packaging': ['packaging', 'package', 'box', 'wrapped', 'damaged'],
'battery': ['battery', 'charge', 'charging', 'power', 'lasting']
}
aspect_sentiments = {}
for aspect, keywords in aspects.items():
# Find reviews mentioning this aspect mask = reviews_df['clean_text'].str.contains('|'.join(keywords), case=False)
aspect_reviews = reviews_df[mask]
if len(aspect_reviews) > 0:
avg_sentiment = aspect_reviews['sentiment_score'].mean()
mention_count = len(aspect_reviews)
positive_pct = (aspect_reviews['sentiment'] == 'positive').mean() * 100 aspect_sentiments[aspect] = {
'mentions': mention_count,
'avg_sentiment': avg_sentiment,
'positive_pct': positive_pct
}
return pd.DataFrame(aspect_sentiments).T
aspect_analysis = extract_aspects_and_sentiment(reviews)
print("\n=== ASPECT-BASED SENTIMENT ===")
print(aspect_analysis.sort_values('avg_sentiment'))5. Trend Analysis (Time Series of Sentiment):
import matplotlib.pyplot as plt
# Aggregate sentiment by monthreviews['month'] = pd.to_datetime(reviews['date']).dt.to_period('M')
sentiment_over_time = reviews.groupby('month').agg({
'sentiment_score': 'mean',
'review_id': 'count'}).rename(columns={'review_id': 'review_count'})
# Plot trendfig, ax1 = plt.subplots(figsize=(12, 5))
ax1.plot(sentiment_over_time.index.astype(str), sentiment_over_time['sentiment_score'],
color='blue', marker='o')
ax1.set_xlabel('Month')
ax1.set_ylabel('Avg Sentiment Score', color='blue')
ax1.tick_params(axis='y', labelcolor='blue')
ax1.tick_params(axis='x', rotation=45)
ax1.axhline(y=0, color='red', linestyle='--', alpha=0.5, label='Neutral')
ax2 = ax1.twinx()
ax2.bar(sentiment_over_time.index.astype(str), sentiment_over_time['review_count'],
alpha=0.3, color='gray')
ax2.set_ylabel('Review Volume', color='gray')
plt.title('Customer Sentiment Trend Over Time')
plt.tight_layout()
plt.show()6. Key Insights Extraction:
# Identify emerging issues (negative sentiment spikes)def detect_sentiment_anomalies(sentiment_ts):
"""Flag unusual negative sentiment periods""" mean_sentiment = sentiment_ts['sentiment_score'].mean()
std_sentiment = sentiment_ts['sentiment_score'].std()
anomalies = sentiment_ts[sentiment_ts['sentiment_score'] < mean_sentiment - 2*std_sentiment]
return anomalies
anomalies = detect_sentiment_anomalies(sentiment_over_time)
print("\n=== ALERT: NEGATIVE SENTIMENT SPIKES ===")
print(anomalies)
# Extract representative reviews for each topicdef get_representative_reviews(reviews_df, topic_id, n=3):
"""Get most representative reviews for a topic""" # Get topic distribution for each review topic_dist = lda.transform(doc_term_matrix)
reviews_df['topic_strength'] = topic_dist[:, topic_id]
# Get top reviews for this topic top_reviews = reviews_df.nlargest(n, 'topic_strength')[['review_text', 'sentiment', 'topic_strength']]
return top_reviews
# Example: Get reviews about battery issuesbattery_reviews = get_representative_reviews(reviews, topic_id=0, n=5)
print("\n=== EXAMPLE BATTERY REVIEWS ===")
print(battery_reviews)7. Business Presentation:
CUSTOMER REVIEW ANALYSIS: 100,000 REVIEWS
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
OVERALL SENTIMENT:
• Positive: 62%
• Neutral: 23%
• Negative: 15%
• Average Score: +0.32 (moderately positive)
TOP MENTIONED ASPECTS:
1. Taste/Flavor (42% of reviews) → 71% positive
2. Packaging (28% of reviews) → 45% positive ⚠️
3. Price/Value (24% of reviews) → 58% positive
4. Quality (18% of reviews) → 68% positive
5. Battery Life (12% of reviews) → 38% positive ⚠️
EMERGING ISSUES:
⚠️ Battery complaints UP 150% since July 2024
⚠️ Packaging damage mentions increased 35%
✓ Taste satisfaction stable and positive
ACTIONABLE RECOMMENDATIONS:
1. URGENT: Investigate battery supplier (defect rate spike)
2. Packaging: Switch to reinforced boxes to reduce damage
3. Continue current flavor formulation (strong positive signal)
4. Price perception: Communicate value proposition better
COMPETITOR COMPARISON:
• Our sentiment (+0.32) vs. Competitor A (+0.18)
• Battery issues: We have 2x mention rate vs. competitorsExpected Outcome: Demonstrate NLP technical skills, business insight extraction from unstructured data, and ability to translate text analytics into actionable product and marketing recommendations.
Behavioral & Communication
10. Communicating Data Insights to Non-Technical Stakeholders
Level: All levels
Difficulty: Moderate to Hard
Source: Common WPP Interview Theme
Team: All Data Teams
Interview Round: Behavioral Assessment
Question: “Tell me about a time when you had to present complex analytical findings to senior marketing leaders who had limited technical background. How did you ensure they understood and acted on your recommendations?”
Concise Answer (STAR Method):
Situation:
“In my previous role as a Data Scientist at [Company], I led a multi-touch attribution analysis for a major e-commerce client. The analysis revealed that our attribution model was significantly over-crediting last-click (paid search) and under-valuing earlier touchpoints like display and social media. This had major implications—we were under-investing in awareness channels by approximately $5M annually, leaving an estimated $8M in potential conversions unrealized.
I needed to present these findings to the CMO and VP of Marketing, both with limited technical backgrounds, to secure approval for a substantial budget reallocation across channels.”
Task:
“My objectives were to:
1. Explain complex Shapley value attribution methodology in accessible terms
2. Build sufficient trust in the analysis to justify $5M budget shift
3. Address anticipated concerns about reducing search spend
4. Get approval for phased implementation plan”
Action:
“I structured my approach in three layers:
1. Lead with Business Impact, Not Methodology:
- Opened presentation with: ‘We’re currently under-investing in awareness channels by $5M, which is costing us $8M in potential revenue annually’
- Showed customer journey map illustrating typical 5-6 touch path before conversion
- Made the ‘last-click problem’ visceral: ‘Imagine crediting only the cashier for a sale, ignoring the sales floor associate who helped the customer for 30 minutes’
2. Translated Technical Concepts:
- Avoided jargon like ‘Shapley values’ and ‘marginal contributions’
- Instead said: ‘We calculate what each channel’s unique contribution is by comparing what happens with and without it—like testing which ingredient truly makes the recipe work’
- Used sports analogy: ‘You can’t just credit the player who scores the goal. We need to value the defender who started the play’
3. Visualization Over Tables:
- Created customer journey maps showing typical paths (not statistical charts)
- Used before/after budget allocation visuals with expected impact
- Built interactive dashboard they could explore themselves
4. Addressed Concerns Proactively:
- Anticipated CMO’s worry: ‘Are you saying search isn’t valuable?’
- Prepared response: ‘Search is critical for capturing existing demand at the bottom of the funnel. But we need display and social to CREATE that demand in the first place. They work together.’
- Cross-referenced with brand lift studies showing display did increase awareness
- Showed competitor strategies proving our approach was under-market
5. Made Recommendations Concrete and Phased:
- Instead of abstract ‘reallocate budget,’ provided specific channel shifts:
- Reduce search by $3M (from $15M to $12M)
- Increase display by $2M
- Increase social by $1M
- Proposed 3-month pilot in one product category before full rollout
- Created monthly dashboard to track impact and course-correct
- Gave explicit success criteria: ‘If we don’t see 10%+ lift in total conversions by month 3, we revert changes’
6. Involved Stakeholders Early:
- In retrospect, I should have shared preliminary findings earlier to build progressive buy-in
- Instead of revealing everything in final presentation, could have done mid-analysis checkpoint”
Result:
“Immediate outcomes:
- CMO approved recommendation in the meeting
- Secured $5M budget reallocation with 3-month pilot
- VP of Marketing became champion of the analysis
6-month impact:
- Total conversions increased 22% (vs. 10% target)
- Cost per acquisition decreased 15%
- Overall ROAS improved from 3.2:1 to 4.1:1
- Analysis saved client an estimated $8M in lost revenue
Recognition:
- CMO cited this project as ‘most impactful data science initiative of the year’ in quarterly all-hands
- I was promoted to Senior Data Scientist based partly on this work
- Built internal dashboard now used across all client accounts
Lessons learned:
1. Start with ‘so what’ not ‘how’: Business impact first, methodology second
2. Analogies > equations: Sports, cooking, everyday experiences make concepts accessible
3. Progressive involvement: Don’t surprise stakeholders—build consensus incrementally
4. Safety nets reduce resistance: Phased approach with exit criteria lowers perceived risk
5. Visuals tell stories: One customer journey map worth 10 statistical tables”
What I’d do differently:
“I would have:
- Involved stakeholders earlier in analysis (not just final presentation)
- Had them articulate concerns upfront so I could address preemptively
- Created simpler 1-pager executive summary for busy C-suite
- Practiced presentation with non-technical colleague for feedback
This experience taught me that technical excellence means nothing without effective communication. In data science, influence and storytelling are as important as statistical rigor.”
Expected Outcome: Demonstrate communication skills, stakeholder management, business acumen, and self-awareness through structured storytelling with quantified impact.
End of WPP Data Analyst and Data Scientist Interview Guide
This comprehensive guide covers essential technical skills (SQL, Python, ML), statistical rigor (A/B testing, attribution, incrementality), and business communication abilities required for WPP data roles across Choreograph, GroupM, and agency-embedded analytics teams.