WPP Data Analyst and Data Scientist

Q: Tell me about a time when you had to present complex analytical findings to senior marketing leaders who had limited technical background. How did you ensure they understood and acted on your recommendations?

Concise Answer (STAR Method): Situation: “In my previous role as a Data Scientist at [Company], I led a multi-touch attribution analysis for a major e-commerce client. The analysis revealed that our attribution model was significantly over-crediting last-click (paid search) and under-valuing earlier touchpoints like display and social media. This had major implications—we were under-investing in awareness channels by approximately $5M annually, leaving an estimated $8M in potential conversions u

Marketing Analytics & Attribution

1. Marketing Attribution Modeling and Multi-Touch Analytics

Level: Data Scientist, Senior Data Analyst

Difficulty: Very Hard

Source: GroupM Analytics Teams, Choreograph Data Science

Team: Media Analytics, Audience Insights

Interview Round: Technical Deep Dive

Question: “Explain how you would design a multi-touch attribution model for a client running campaigns across TV, digital display, paid social, search, and OOH. What methodology would you use, and how would you handle the challenge of offline-to-online attribution?”

Concise Answer:

Core Challenge: Multiple touchpoints influence conversion; need to assign credit appropriately to optimize budget allocation.

Attribution Methodologies Comparison:

Method	Approach	Best For	Limitations
Last-Click	100% credit to final touch	Direct response, paid search	Ignores awareness building
First-Click	100% credit to initial touch	Brand awareness	Ignores conversion drivers
Linear	Equal credit to all touches	Simple, fair distribution	Doesn’t reflect actual impact
Time-Decay	More credit to recent touches	Considered purchases	Arbitrary decay function
Data-Driven	ML calculates actual contribution	Most accurate	Requires significant data

Recommended Approach: Hybrid Model

1. For Digital Channels (Data-Driven MTA):

# Shapley Value Attribution Frameworkimport numpy as np
from itertools import combinations
class ShapleyAttribution:
    def __init__(self, conversion_data):
        self.data = conversion_data
    def calculate_shapley_values(self, touchpoints):
        """Calculate marginal contribution of each channel"""        n_channels = len(touchpoints)
        shapley_values = {}
        for channel in touchpoints:
            marginal_contributions = []
            # For each possible coalition without this channel            for r in range(n_channels):
                coalitions = combinations([c for c in touchpoints if c != channel], r)
                for coalition in coalitions:
                    # Contribution = Value(coalition + channel) - Value(coalition)                    with_channel = self.conversion_rate(coalition + (channel,))
                    without_channel = self.conversion_rate(coalition)
                    marginal_contribution = with_channel - without_channel
                    marginal_contributions.append(marginal_contribution)
            shapley_values[channel] = np.mean(marginal_contributions)
        return shapley_values
    def conversion_rate(self, channels):
        """Get conversion rate for channel combination"""        # Filter to journeys with these channels        journeys = self.data[self.data['channels'].isin(channels)]
        return journeys['converted'].mean()
# Usageattribution_model = ShapleyAttribution(customer_journey_data)
channel_credits = attribution_model.calculate_shapley_values(
    ['display', 'social', 'search', 'email']
)

2. For Offline Channels (Marketing Mix Modeling):

import pandas as pd
from sklearn.linear_model import Ridge
import statsmodels.api as sm
class MMM_Model:
    def build_mmm(self, data):
        """Regression-based attribution for TV and OOH"""        # Dependent variable: Sales        y = data['sales']
        # Independent variables: Marketing spend + controls        X = pd.DataFrame({
            'tv_spend': data['tv_spend'],
            'ooh_spend': data['ooh_spend'],
            'digital_spend': data['digital_spend'],
            # Control variables            'price': data['price'],
            'seasonality': data['month'],
            'competitor_activity': data['competitor_spend'],
            # Lag effects (TV impact takes time)            'tv_spend_lag1': data['tv_spend'].shift(1),
            'tv_spend_lag2': data['tv_spend'].shift(2)
        })
        # Add adstock transformation (decay effect)        X['tv_adstock'] = self.adstock_transform(
            data['tv_spend'],
            decay_rate=0.7        )
        # Fit regression        X = sm.add_constant(X)
        model = sm.OLS(y, X).fit()
        # Extract ROI by channel        roi = {
            'tv': model.params['tv_adstock'] / X['tv_spend'].mean(),
            'ooh': model.params['ooh_spend'] / X['ooh_spend'].mean(),
            'digital': model.params['digital_spend'] / X['digital_spend'].mean()
        }
        return model, roi
    def adstock_transform(self, spend, decay_rate):
        """Apply advertising decay effect"""        adstock = np.zeros(len(spend))
        adstock[0] = spend[0]
        for t in range(1, len(spend)):
            adstock[t] = spend[t] + decay_rate * adstock[t-1]
        return adstock

3. Offline-to-Online Attribution Solutions:

Unique Promo Codes:
- TV ad includes unique code → track digital conversions with code
- OOH billboard features specific landing page → measure traffic

Geo-Testing:
- Run TV in 50% of markets, not in control markets
- Compare digital conversion lift in TV markets vs. control
- Incremental Digital Conversions = TV Marketsdigital − Control Marketsdigital

Brand Lift Studies:
- Survey exposed vs. unexposed audiences
- Measure awareness → search behavior correlation
- Quantify TV’s role in driving branded search volume

Foot Traffic Attribution:
- Partner with location data providers
- Track store visits from exposed audiences
- Connect in-store purchases to digital touchpoints

Integrated Framework:

┌─────────────────────────────────────────────────┐
│  Customer Journey with Multi-Channel Exposure   │
└─────────────────────────────────────────────────┘
         │
         ├─→ Digital Channels → Data-Driven MTA
         │   (Search, Display, Social, Email)
         │   • Shapley values
         │   • 95% attribution confidence
         │
         ├─→ Offline Channels → Marketing Mix Model
         │   (TV, OOH, Radio)
         │   • Regression-based ROI
         │   • Adstock effects
         │
         └─→ Cross-Channel Validation
             • Geo-experiments
             • Hold-out testing
             • Incrementality studies

Implementation Recommendations:

Data Requirements:
- Minimum 12 months historical data
- User-level journey data for digital
- Weekly aggregated spend and sales for MMM
- 500+ conversions per month for statistical power

Validation:
- Compare total attributed conversions to actual conversions (should sum to 100%)
- Back-test predictions against holdout period
- Run incrementality experiments to validate model accuracy

Business Application:
- Provide channel-specific ROAS estimates
- Recommend budget reallocation (shift from over-invested to under-invested channels)
- Update monthly as campaign performance evolves

Expected Outcomes:
- Digital attribution: 85-90% confidence in credit allocation
- TV/OOH ROI estimates: ±20% margin of error (typical for MMM)
- Budget optimization: 10-15% efficiency improvement through reallocation

Statistical Testing & Experimentation

2. A/B Testing Design and Analysis for Marketing Campaigns

Level: Data Analyst, Data Scientist

Difficulty: Hard

Source: Standard GroupM Analytics Assessment

Team: Campaign Analytics, Performance Measurement

Interview Round: Technical Screening

Question: “Design an A/B test to evaluate whether a new creative treatment improves campaign conversion rate. Walk through sample size calculation, randomization strategy, test duration, statistical test selection, and how you’d interpret results. What would you do if results are borderline significant?”

Concise Answer:

Test Design Framework:

1. Hypothesis Definition:
- H₀ (Null): New creative has same conversion rate as control
- H₁ (Alternative): New creative increases conversion rate
- Primary Metric: Conversion rate (conversions ÷ impressions)
- Secondary Metrics: CTR, CPC, time to conversion

2. Sample Size Calculation:

import scipy.stats as stats
import numpy as np
def calculate_sample_size(baseline_rate, mde, alpha=0.05, power=0.80):
    """    Calculate required sample size per variant    Parameters:    - baseline_rate: Current conversion rate (e.g., 0.02 for 2%)    - mde: Minimum detectable effect (e.g., 0.10 for 10% relative lift)    - alpha: Significance level (Type I error rate)    - power: Statistical power (1 - Type II error rate)    """    # Effect size (Cohen's h)    p1 = baseline_rate
    p2 = baseline_rate * (1 + mde)
    effect_size = 2 * (np.arcsin(np.sqrt(p2)) - np.arcsin(np.sqrt(p1)))
    # Calculate sample size using power analysis    z_alpha = stats.norm.ppf(1 - alpha/2)
    z_beta = stats.norm.ppf(power)
    n = ((z_alpha + z_beta) / effect_size) ** 2    return int(np.ceil(n))
# Example: Detect 10% relative lift on 2% baselinebaseline_cr = 0.02min_detectable_effect = 0.10n_per_variant = calculate_sample_size(baseline_cr, min_detectable_effect)
print(f"Required sample size per variant: {n_per_variant:,}")
print(f"Total sample size: {n_per_variant * 2:,}")
# Output: ~15,700 per variant (~31,400 total)

3. Randomization Strategy:

class ABTestRandomizer:
    def __init__(self, variants=['control', 'treatment'], split=[0.5, 0.5]):
        self.variants = variants
        self.split = split
    def assign_variant(self, user_id):
        """Deterministic random assignment based on user_id hash"""        import hashlib
        # Hash user_id for consistent assignment        hash_value = int(hashlib.md5(str(user_id).encode()).hexdigest(), 16)
        # Assign to variant based on hash modulo        percentile = (hash_value % 100) / 100        cumulative = 0        for variant, proportion in zip(self.variants, self.split):
            cumulative += proportion
            if percentile < cumulative:
                return variant
        return self.variants[-1]
# Usagerandomizer = ABTestRandomizer()
user_variant = randomizer.assign_variant(user_id=12345)

4. Test Duration:
- Minimum: Achieve required sample size
- Typical: 2-4 weeks to account for:
- Day-of-week effects (weekday vs. weekend behavior)
- Weekly cycles in business
- Seasonal patterns
- Early Stopping: Only if pre-defined stopping rules met (e.g., overwhelming evidence)
- Avoid: Continuous p-value checking (“p-hacking” inflates false positive rate)

5. Statistical Analysis:

def analyze_ab_test(control_conversions, control_n, treatment_conversions, treatment_n):
    """Two-proportion z-test for conversion rates"""    # Conversion rates    p_control = control_conversions / control_n
    p_treatment = treatment_conversions / treatment_n
    # Pooled proportion (under null hypothesis)    p_pooled = (control_conversions + treatment_conversions) / (control_n + treatment_n)
    # Standard error    se = np.sqrt(p_pooled * (1 - p_pooled) * (1/control_n + 1/treatment_n))
    # Z-statistic    z_stat = (p_treatment - p_control) / se
    # Two-tailed p-value    p_value = 2 * (1 - stats.norm.cdf(abs(z_stat)))
    # Confidence interval for difference    se_diff = np.sqrt(p_control*(1-p_control)/control_n + p_treatment*(1-p_treatment)/treatment_n)
    ci_lower = (p_treatment - p_control) - 1.96 * se_diff
    ci_upper = (p_treatment - p_control) + 1.96 * se_diff
    # Relative lift    relative_lift = (p_treatment - p_control) / p_control
    return {
        'control_rate': p_control,
        'treatment_rate': p_treatment,
        'absolute_diff': p_treatment - p_control,
        'relative_lift': relative_lift,
        'ci_95': (ci_lower, ci_upper),
        'z_statistic': z_stat,
        'p_value': p_value,
        'significant': p_value < 0.05    }
# Exampleresults = analyze_ab_test(
    control_conversions=320,
    control_n=16000,
    treatment_conversions=374,
    treatment_n=16000)
print(f"Control CR: {results['control_rate']:.3%}")
print(f"Treatment CR: {results['treatment_rate']:.3%}")
print(f"Relative Lift: {results['relative_lift']:.1%}")
print(f"P-value: {results['p_value']:.4f}")
print(f"95% CI: [{results['ci_95'][0]:.3%}, {results['ci_95'][1]:.3%}]")
print(f"Statistically Significant: {results['significant']}")

6. Handling Borderline Results (p = 0.06):

Don’t Automatically Dismiss:
- p = 0.06 means 6% chance of observing this effect under null hypothesis
- Statistical significance is arbitrary threshold, not absolute truth

Decision Framework:

Factor	Consideration
Confidence Interval	If CI for lift is [+2%, +18%], practical significance exists even if p > 0.05
Business Cost	Low cost to implement new creative → accept higher risk threshold
Prior Evidence	If aligns with previous tests showing lift, increases credibility
Sample Size	If barely missed target, consider extending test
Effect Size	Large effect (>20% lift) with p=0.06 more compelling than small effect

Recommended Approach:
1. Present full picture: “p = 0.06, 95% CI shows [+1%, +15%] lift”
2. Quantify uncertainty: “We have 94% confidence treatment is better”
3. Business decision: “Given low implementation cost and positive signal, recommend rolling out with continued monitoring”
4. Alternative: “Run follow-up test with larger sample to confirm”

7. Validity Checks:

def validate_test(control_data, treatment_data):
    """Check for violations of test assumptions"""    checks = {}
    # 1. Sample Ratio Mismatch (SRM)    expected_ratio = 0.5    actual_ratio = len(control_data) / (len(control_data) + len(treatment_data))
    chi2_stat = ((actual_ratio - expected_ratio)**2) / expected_ratio
    checks['srm_violation'] = chi2_stat > 3.84  # Chi-square critical value    # 2. Novelty Effect (check if lift decays over time)    treatment_data['week'] = treatment_data['date'].dt.isocalendar().week
    weekly_cr = treatment_data.groupby('week')['converted'].mean()
    checks['novelty_effect'] = weekly_cr.iloc[0] > weekly_cr.iloc[-1] * 1.5    # 3. Segment Heterogeneity    for segment in ['new_users', 'returning_users']:
        segment_effect = analyze_segment(control_data, treatment_data, segment)
        checks[f'{segment}_effect'] = segment_effect
    return checks

Key Principles:
- Pre-register analysis plan: Define stopping rules and primary metric before starting
- One primary metric: Multiple testing inflates false positive rate
- Guard rails: Monitor secondary metrics for negative unintended effects
- Segment analysis: Check if effects differ by user type, device, etc.

Expected Outcome: Demonstrate rigorous statistical thinking, awareness of practical considerations (not just theoretical), and ability to communicate uncertainty to business stakeholders.

Machine Learning & Predictive Analytics

3. Customer Segmentation Using Unsupervised Learning

Level: Data Scientist

Difficulty: Hard

Source: Choreograph Analytics, GroupM Audience Insights

Team: Consumer Analytics, Audience Strategy

Interview Round: ML Technical Assessment

Question: “You have a dataset with customer demographics, purchase history, website behavior, and engagement metrics. How would you segment this customer base to inform targeted marketing strategies? Walk through your approach from data preparation through final segment definition.”

Concise Answer:

Step 1: Feature Engineering

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
def create_behavioral_features(customer_data):
    """Engineer marketing-relevant features"""    features = pd.DataFrame()
    # RFM (Recency, Frequency, Monetary)    features['recency_days'] = (pd.Timestamp.now() - customer_data['last_purchase_date']).dt.days
    features['frequency'] = customer_data['purchase_count_12m']
    features['monetary'] = customer_data['total_spend_12m']
    features['avg_order_value'] = customer_data['total_spend_12m'] / (customer_data['purchase_count_12m'] + 1)
    # Engagement    features['email_open_rate'] = customer_data['emails_opened'] / (customer_data['emails_sent'] + 1)
    features['website_sessions'] = customer_data['sessions_3m']
    features['pages_per_session'] = customer_data['total_pageviews'] / (customer_data['sessions_3m'] + 1)
    # Product preferences    features['product_diversity'] = customer_data['unique_product_categories']
    features['pref_category_pct'] = customer_data['top_category_spend'] / (customer_data['total_spend_12m'] + 1)
    # Demographics (one-hot encode)    demographics = pd.get_dummies(customer_data[['age_group', 'location_tier']], drop_first=True)
    return pd.concat([features, demographics], axis=1)
features = create_behavioral_features(customer_data)

Step 2: Clustering Algorithm Selection

from sklearn.cluster import KMeans, DBSCAN
from sklearn.mixture import GaussianMixture
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt
# Normalize features (critical for distance-based clustering)scaler = StandardScaler()
features_scaled = scaler.fit_transform(features)
# Method 1: K-Means (most common)def find_optimal_k(data, k_range=range(2, 11)):
    """Elbow method + silhouette analysis"""    inertias = []
    silhouette_scores = []
    for k in k_range:
        kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
        labels = kmeans.fit_predict(data)
        inertias.append(kmeans.inertia_)
        silhouette_scores.append(silhouette_score(data, labels))
    # Plot elbow curve    plt.figure(figsize=(12, 4))
    plt.subplot(1, 2, 1)
    plt.plot(k_range, inertias, marker='o')
    plt.xlabel('Number of Clusters (k)')
    plt.ylabel('Inertia')
    plt.title('Elbow Method')
    plt.subplot(1, 2, 2)
    plt.plot(k_range, silhouette_scores, marker='o')
    plt.xlabel('Number of Clusters (k)')
    plt.ylabel('Silhouette Score')
    plt.title('Silhouette Analysis')
    plt.tight_layout()
    plt.show()
    # Recommend k (high silhouette, interpretable number)    optimal_k = k_range[np.argmax(silhouette_scores)]
    return optimal_k
optimal_k = find_optimal_k(features_scaled)
print(f"Recommended number of clusters: {optimal_k}")
# Fit final modelkmeans = KMeans(n_clusters=optimal_k, random_state=42, n_init=20)
segments = kmeans.fit_predict(features_scaled)

Step 3: Segment Profiling

def profile_segments(data, segments, features):
    """Create business-friendly segment descriptions"""    data['segment'] = segments
    profiles = data.groupby('segment').agg({
        'recency_days': ['mean', 'median'],
        'frequency': ['mean', 'median'],
        'monetary': ['mean', 'median', 'sum'],
        'email_open_rate': 'mean',
        'website_sessions': 'mean'    }).round(2)
    # Size and value contribution    segment_sizes = data.groupby('segment').size()
    segment_revenue = data.groupby('segment')['monetary'].sum()
    profiles['size'] = segment_sizes
    profiles['size_pct'] = (segment_sizes / len(data) * 100).round(1)
    profiles['revenue_contribution'] = segment_revenue
    profiles['revenue_pct'] = (segment_revenue / segment_revenue.sum() * 100).round(1)
    return profiles
profiles = profile_segments(customer_data, segments, features)
print(profiles)

Step 4: Segment Naming and Strategy

Example Segmentation Results:

Segment	Description	Size	Revenue %	Strategy
Champions	High frequency, high spend, engaged	12%	35%	VIP program, early access, retention focus
Loyal Customers	Regular purchasers, moderate spend	22%	28%	Loyalty rewards, cross-sell opportunities
Promising	Recent customers, growing spend	18%	15%	Nurture campaigns, onboarding programs
At Risk	Previously active, declining engagement	15%	12%	Win-back campaigns, special offers
Bargain Hunters	Only purchase on promotion	20%	8%	Targeted promotions, clearance alerts
Lost	No activity >6 months	13%	2%	Reactivation campaign or deprioritize

Marketing Actions by Segment:

def generate_segment_actions(segment_name):
    """Map segments to marketing tactics"""    strategies = {
        'Champions': {
            'channels': ['Email (premium content)', 'Direct mail', 'SMS'],
            'messaging': 'Exclusive access, personalized recommendations',
            'offers': 'Early product launches, VIP events',
            'frequency': 'Weekly engagement'        },
        'At Risk': {
            'channels': ['Email (re-engagement)', 'Retargeting ads', 'SMS'],
            'messaging': 'We miss you, special comeback offer',
            'offers': '20% discount, free shipping',
            'frequency': 'Bi-weekly for 2 months'        },
        'Bargain Hunters': {
            'channels': ['Email (promotions)', 'Social ads'],
            'messaging': 'Flash sales, limited-time offers',
            'offers': 'Discount codes, BOGO deals',
            'frequency': 'Promotion-based (weekly)'        }
    }
    return strategies.get(segment_name, {})

Step 5: Validation and Actionability

def validate_segmentation(data, segments):
    """Ensure segments are statistically distinct and actionable"""    validations = {}
    # 1. Statistical separation (ANOVA for key metrics)    from scipy.stats import f_oneway
    segment_groups = [data[segments == i]['monetary'] for i in range(max(segments) + 1)]
    f_stat, p_value = f_oneway(*segment_groups)
    validations['statistically_distinct'] = p_value < 0.05    # 2. Segment stability (re-cluster on different time periods)    # Check if customers remain in same segments    # 3. Actionability checklist    validations['identifiable'] = True  # Can we identify which segment new customers belong to?    validations['substantial'] = all(data.groupby(segments).size() > len(data) * 0.05)  # Each >5% of base?    validations['differentiable'] = True  # Do segments have distinct characteristics?    return validations
validation_results = validate_segmentation(customer_data, segments)

Alternative: Hierarchical Clustering for Nested Segments

from scipy.cluster.hierarchy import dendrogram, linkage
from sklearn.cluster import AgglomerativeClustering
# Hierarchical clusteringlinkage_matrix = linkage(features_scaled, method='ward')
# Plot dendrogramplt.figure(figsize=(12, 6))
dendrogram(linkage_matrix)
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('Customer Index')
plt.ylabel('Distance')
plt.show()
# Fit hierarchical modelhierarchical = AgglomerativeClustering(n_clusters=optimal_k, linkage='ward')
hierarchical_segments = hierarchical.fit_predict(features_scaled)

Business Deliverables:
1. Segment definitions dashboard (size, characteristics, value)
2. Customer assignment API (score new customers in real-time)
3. Campaign playbook (recommended tactics per segment)
4. Performance tracking (segment migration, revenue by segment)

Expected Outcome: Actionable segments that marketing teams can operationalize, demonstrating both technical ML skills and business translation ability.

4. Predictive Modeling for Customer Churn

Level: Data Scientist, Senior Data Analyst

Difficulty: Very Hard

Source: GroupM Predictive Analytics, Choreograph ML Team

Team: Customer Analytics, Retention Strategy

Interview Round: ML System Design

Question: “Build a model to predict which customers are likely to churn in the next 90 days. Walk through your approach: feature engineering, model selection, handling class imbalance, evaluation metrics, and how you’d present results to marketing stakeholders.”

Concise Answer:

End-to-End Churn Prediction Pipeline:

1. Feature Engineering for Churn:

def create_churn_features(customer_data, transaction_data, engagement_data):
    """Engineer predictive features for 90-day churn"""    features = pd.DataFrame(index=customer_data.index)
    # Recency & Trend Features    features['days_since_last_purchase'] = (pd.Timestamp.now() - customer_data['last_purchase_date']).dt.days
    features['days_since_last_interaction'] = (pd.Timestamp.now() - engagement_data['last_activity_date']).dt.days
    # Declining engagement (compare recent vs. historical)    features['purchase_frequency_30d'] = transaction_data.groupby('customer_id').apply(
        lambda x: len(x[x['date'] > pd.Timestamp.now() - pd.Timedelta(days=30)])
    )
    features['purchase_frequency_90d'] = transaction_data.groupby('customer_id').apply(
        lambda x: len(x[x['date'] > pd.Timestamp.now() - pd.Timedelta(days=90)])
    )
    features['engagement_decline'] = features['purchase_frequency_90d'] - features['purchase_frequency_30d'] * 3    # Spend Patterns    features['avg_order_value_trend'] = customer_data['recent_aov'] - customer_data['historical_aov']
    features['total_spend_decline_pct'] = (customer_data['spend_last_90d'] - customer_data['spend_prev_90d']) / (customer_data['spend_prev_90d'] + 1)
    # Product Category Shift    features['category_consistency'] = customer_data['top_category_spend_pct']  # High = loyal to one category    # Customer Service Interactions (negative signal)    features['complaints_30d'] = engagement_data['complaints_count_30d']
    features['support_contacts_30d'] = engagement_data['support_interactions_30d']
    # Email/Communication Engagement    features['email_open_rate'] = engagement_data['emails_opened'] / (engagement_data['emails_sent'] + 1)
    features['unsubscribe_flag'] = engagement_data['email_unsubscribed'].astype(int)
    # Tenure    features['customer_age_days'] = (pd.Timestamp.now() - customer_data['signup_date']).dt.days
    # Promotion Dependency    features['discount_purchase_rate'] = customer_data['purchases_with_discount'] / (customer_data['total_purchases'] + 1)
    return features
features = create_churn_features(customer_data, transactions, engagement)

2. Handle Severe Class Imbalance:

from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline as ImbPipeline
# Typical churn rate: 5-15%print(f"Churn rate: {y.mean():.2%}")
# Strategy 1: SMOTE + Undersamplingsmote_undersampling = ImbPipeline([
    ('smote', SMOTE(sampling_strategy=0.3, random_state=42)),  # Oversample to 30% minority    ('undersample', RandomUnderSampler(sampling_strategy=0.5, random_state=42))  # Balance to 50:50])
X_resampled, y_resampled = smote_undersampling.fit_resample(X_train, y_train)
# Strategy 2: Class Weights (no resampling)from sklearn.utils.class_weight import compute_class_weight
class_weights = compute_class_weight('balanced', classes=np.unique(y_train), y=y_train)
class_weight_dict = {0: class_weights[0], 1: class_weights[1]}

3. Model Training and Selection:

from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
# Baseline: Logistic Regression (interpretable)logreg = LogisticRegression(class_weight='balanced', max_iter=1000, random_state=42)
logreg.fit(X_train, y_train)
# Tree-based models (handle non-linearity)rf = RandomForestClassifier(
    n_estimators=200,
    max_depth=10,
    class_weight='balanced',
    random_state=42)
xgb = XGBClassifier(
    n_estimators=200,
    max_depth=6,
    scale_pos_weight=len(y_train[y_train==0]) / len(y_train[y_train==1]),  # Class imbalance ratio    random_state=42)
# Compare modelsmodels = {'LogisticRegression': logreg, 'RandomForest': rf, 'XGBoost': xgb}
for name, model in models.items():
    scores = cross_val_score(model, X_train, y_train, cv=5, scoring='roc_auc')
    print(f"{name} - Mean AUC: {scores.mean():.3f} (±{scores.std():.3f})")
# Select best model (typically XGBoost)best_model = xgb
best_model.fit(X_train, y_train)

4. Evaluation Metrics (Beyond Accuracy):

from sklearn.metrics import classification_report, roc_auc_score, precision_recall_curve, confusion_matrix
import matplotlib.pyplot as plt
y_pred_proba = best_model.predict_proba(X_test)[:, 1]
# Metrics for imbalanced classificationdef evaluate_churn_model(y_true, y_pred_proba, threshold=0.5):
    """Comprehensive model evaluation"""    y_pred = (y_pred_proba >= threshold).astype(int)
    # ROC-AUC (overall discrimination)    auc = roc_auc_score(y_true, y_pred_proba)
    # Precision-Recall at threshold    precision = precision_score(y_true, y_pred)
    recall = recall_score(y_true, y_pred)
    f1 = f1_score(y_true, y_pred)
    # Confusion Matrix    cm = confusion_matrix(y_true, y_pred)
    tn, fp, fn, tp = cm.ravel()
    # Business Metrics    retention_offer_cost = 20  # Cost to send retention offer    customer_ltv = 500  # Average customer lifetime value    retention_success_rate = 0.30  # 30% of offers succeed    # Expected value of model    true_positives_value = tp * customer_ltv * retention_success_rate
    false_positives_cost = fp * retention_offer_cost
    false_negatives_cost = fn * customer_ltv  # Lost customers we didn't target    net_value = true_positives_value - false_positives_cost - false_negatives_cost
    return {
        'auc': auc,
        'precision': precision,
        'recall': recall,
        'f1': f1,
        'confusion_matrix': cm,
        'net_business_value': net_value
    }
results = evaluate_churn_model(y_test, y_pred_proba, threshold=0.3)  # Lower threshold for higher recallprint(f"AUC: {results['auc']:.3f}")
print(f"Precision: {results['precision']:.3f} | Recall: {results['recall']:.3f}")
print(f"Net Business Value: ${results['net_business_value']:,.0f}")

5. Threshold Optimization:

def optimize_threshold(y_true, y_pred_proba, offer_cost=20, ltv=500, retention_rate=0.3):
    """Find threshold maximizing business value"""    thresholds = np.linspace(0.1, 0.9, 50)
    net_values = []
    for threshold in thresholds:
        y_pred = (y_pred_proba >= threshold).astype(int)
        cm = confusion_matrix(y_true, y_pred)
        tn, fp, fn, tp = cm.ravel()
        value = (tp * ltv * retention_rate) - (fp * offer_cost) - (fn * ltv)
        net_values.append(value)
    optimal_idx = np.argmax(net_values)
    optimal_threshold = thresholds[optimal_idx]
    plt.figure(figsize=(10, 5))
    plt.plot(thresholds, net_values)
    plt.axvline(optimal_threshold, color='r', linestyle='--', label=f'Optimal: {optimal_threshold:.2f}')
    plt.xlabel('Prediction Threshold')
    plt.ylabel('Net Business Value ($)')
    plt.title('Threshold Optimization')
    plt.legend()
    plt.show()
    return optimal_threshold
optimal_threshold = optimize_threshold(y_test, y_pred_proba)

6. Feature Importance and Interpretation:

import shap
# SHAP values for model interpretationexplainer = shap.TreeExplainer(best_model)
shap_values = explainer.shap_values(X_test)
# Summary plotshap.summary_plot(shap_values, X_test, feature_names=feature_names)
# Top predictive featuresfeature_importance = pd.DataFrame({
    'feature': feature_names,
    'importance': best_model.feature_importances_
}).sort_values('importance', ascending=False)
print("Top 10 Churn Predictors:")
print(feature_importance.head(10))

7. Business Presentation:

Stakeholder Communication Framework:

=== CHURN PREDICTION MODEL RESULTS ===

MODEL PERFORMANCE:
- AUC: 0.83 (Strong predictive power)
- Identifies 65% of churners (recall) with 55% precision

TOP CHURN INDICATORS:
1. Days since last purchase (35+ days = high risk)
2. Declining email engagement (-40% open rate)
3. Customer service complaints (2+ in 30 days)
4. Spend decline (>30% drop vs. prior period)
5. Email unsubscribe

RECOMMENDED ACTIONS:
→ Target top 20% risk score (5,000 customers)
  - Expected churners in this group: 2,750
  - Retention campaign saves 30%: 825 customers
  - Value saved: 825 × $500 LTV = $412,500
  - Campaign cost: 5,000 × $20 = $100,000
  - NET VALUE: $312,500

SEGMENTED STRATEGIES:
- High risk + high value: Personal outreach, exclusive offers
- High risk + moderate value: Automated email campaigns
- Moderate risk: Monitor, engage with content

8. Deployment and Monitoring:

# Monthly scoring pipelinedef score_customers_for_churn(customer_df):
    """Production scoring function"""    # Extract features    features = create_churn_features(customer_df, transactions, engagement)
    # Generate predictions    churn_probabilities = best_model.predict_proba(features)[:, 1]
    # Create output    output = pd.DataFrame({
        'customer_id': customer_df['customer_id'],
        'churn_probability': churn_probabilities,
        'risk_tier': pd.cut(churn_probabilities, bins=[0, 0.3, 0.6, 1.0], labels=['Low', 'Medium', 'High']),
        'recommended_action': np.where(churn_probabilities > optimal_threshold, 'RETENTION_CAMPAIGN', 'MONITOR')
    })
    return output
# Model monitoring (detect drift)def monitor_model_performance(predictions_df, actual_churn_df):
    """Track model accuracy over time"""    monthly_performance = predictions_df.merge(actual_churn_df, on='customer_id')
    auc = roc_auc_score(monthly_performance['actual_churn'], monthly_performance['churn_probability'])
    if auc < 0.75:  # Performance degraded        trigger_model_retraining()
    return auc

Expected Outcome: End-to-end churn prediction system demonstrating ML expertise, business acumen, and ability to translate technical models into actionable retention strategies with quantified ROI.

5. SQL Query Optimization and Complex Data Extraction

Level: Data Analyst, Data Scientist

Difficulty: Hard

Source: Choreograph SQL Assessment, GroupM Technical Screen

Team: Data Engineering, Analytics

Interview Round: Technical Coding

Question: “Write a SQL query to calculate 30-day rolling retention rate by acquisition cohort from tables containing user_signups, user_activity, and user_purchases. Optimize for performance on a dataset with 10 million users.”

Concise Answer:

Database Schema:

-- user_signups: user_id, signup_date, acquisition_source-- user_activity: user_id, activity_date, activity_type-- user_purchases: user_id, purchase_date, amount

Solution:

-- 30-day rolling retention by acquisition cohortWITH user_cohorts AS (
    -- Assign users to monthly cohorts    SELECT
        user_id,
        DATE_TRUNC('month', signup_date) AS cohort_month,
        signup_date
    FROM user_signups
    WHERE signup_date >= '2023-01-01'),
cohort_sizes AS (
    -- Count users in each cohort    SELECT
        cohort_month,
        COUNT(DISTINCT user_id) AS cohort_size
    FROM user_cohorts
    GROUP BY cohort_month
),
user_activity_with_cohort AS (
    -- Join activity with cohort info and calculate days since signup    SELECT
        ua.user_id,
        uc.cohort_month,
        ua.activity_date,
        DATEDIFF(ua.activity_date, uc.signup_date) AS days_since_signup
    FROM user_activity ua
    INNER JOIN user_cohorts uc
        ON ua.user_id = uc.user_id
    WHERE ua.activity_date >= uc.signup_date  -- Only activity after signup),
retention_30day AS (
    -- Count users active in 30-60 day window    SELECT
        cohort_month,
        COUNT(DISTINCT user_id) AS active_users_30day
    FROM user_activity_with_cohort
    WHERE days_since_signup BETWEEN 30 AND 59    GROUP BY cohort_month
)
-- Final calculationSELECT
    cs.cohort_month,
    cs.cohort_size,
    COALESCE(r.active_users_30day, 0) AS active_users_30day,
    ROUND(100.0 * COALESCE(r.active_users_30day, 0) / cs.cohort_size, 2) AS retention_rate_pct,
    -- Additional metrics    cs.cohort_size - COALESCE(r.active_users_30day, 0) AS churned_users,
    ROUND(100.0 * (cs.cohort_size - COALESCE(r.active_users_30day, 0)) / cs.cohort_size, 2) AS churn_rate_pct
FROM cohort_sizes cs
LEFT JOIN retention_30day r
    ON cs.cohort_month = r.cohort_month
ORDER BY cs.cohort_month DESC;

Performance Optimizations:

-- 1. Create indexes for JOIN and WHERE clausesCREATE INDEX idx_user_signups_date ON user_signups(signup_date, user_id);
CREATE INDEX idx_user_activity_user_date ON user_activity(user_id, activity_date);
CREATE INDEX idx_user_activity_date ON user_activity(activity_date);
-- 2. Partition large tables by dateALTER TABLE user_activity
PARTITION BY RANGE (YEAR(activity_date), MONTH(activity_date));
-- 3. Materialized view for cohort assignments (if frequently queried)CREATE MATERIALIZED VIEW mv_user_cohorts ASSELECT
    user_id,
    DATE_TRUNC('month', signup_date) AS cohort_month,
    signup_date
FROM user_signups;
REFRESH MATERIALIZED VIEW mv_user_cohorts;  -- Update daily-- 4. Use window functions for efficiency (alternative approach)WITH cohort_activity AS (
    SELECT
        DATE_TRUNC('month', s.signup_date) AS cohort_month,
        s.user_id,
        MIN(CASE
            WHEN DATEDIFF(a.activity_date, s.signup_date) BETWEEN 30 AND 59
            THEN 1 ELSE 0
        END) AS active_in_period
    FROM user_signups s
    LEFT JOIN user_activity a
        ON s.user_id = a.user_id
        AND a.activity_date BETWEEN s.signup_date + INTERVAL 30 DAY
                                AND s.signup_date + INTERVAL 60 DAY    GROUP BY 1, 2)
SELECT
    cohort_month,
    COUNT(*) AS cohort_size,
    SUM(active_in_period) AS retained_users,
    ROUND(100.0 * SUM(active_in_period) / COUNT(*), 2) AS retention_rate_pct
FROM cohort_activity
GROUP BY cohort_month
ORDER BY cohort_month DESC;

Query Explanation Plan Analysis:

-- Check query performanceEXPLAIN ANALYZESELECT ...
-- Look for:-- - Sequential scans (bad) vs index scans (good)-- - Join method (hash join typically faster for large tables)-- - Estimated rows vs actual rows (if way off, update statistics)

Additional Complexity: Multi-Dimensional Retention:

-- Retention by acquisition sourceSELECT
    cs.cohort_month,
    s.acquisition_source,
    COUNT(DISTINCT s.user_id) AS cohort_size,
    COUNT(DISTINCT CASE
        WHEN DATEDIFF(a.activity_date, s.signup_date) BETWEEN 30 AND 59
        THEN a.user_id
    END) AS retained_users,
    ROUND(100.0 * COUNT(DISTINCT CASE
        WHEN DATEDIFF(a.activity_date, s.signup_date) BETWEEN 30 AND 59
        THEN a.user_id
    END) / COUNT(DISTINCT s.user_id), 2) AS retention_rate_pct
FROM user_signups s
LEFT JOIN user_activity a
    ON s.user_id = a.user_id
CROSS JOIN (
    SELECT DISTINCT DATE_TRUNC('month', signup_date) AS cohort_month
    FROM user_signups
) cs
WHERE DATE_TRUNC('month', s.signup_date) = cs.cohort_month
GROUP BY cs.cohort_month, s.acquisition_source
ORDER BY cs.cohort_month DESC, retention_rate_pct DESC;

Key Optimization Principles:
1. Index strategically: Join columns, WHERE clause filters, GROUP BY columns
2. Filter early: Apply WHERE clauses before JOINs when possible
3. Partition large tables: By date for time-series queries
4. Use CTEs wisely: Readable but may not be materialized; sometimes temp tables faster
5. Aggregate before joining: Reduce row count before expensive operations
6. Monitor execution plans: Use EXPLAIN to identify bottlenecks

Expected Outcome: Demonstrate SQL proficiency, performance optimization thinking, and ability to handle complex analytical queries on large datasets typical in marketing analytics.

Data Visualization & Communication

6. Python Data Analysis and Campaign Performance Reporting

Level: Data Analyst, Junior Data Scientist

Difficulty: Moderate

Source: WPP Media Case Study, GroupM Analytics

Team: Campaign Analytics, Reporting

Interview Round: Technical Exercise

Question: “You have a CSV file with campaign data: date, campaign_id, impressions, clicks, conversions, spend. Write Python code to: load and clean the data, calculate performance metrics (CTR, CPC, conversion rate, ROAS), identify underperforming campaigns (bottom 20% by ROAS), visualize trends, and recommend actions.”

Concise Answer:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Load and clean datadf = pd.read_csv('campaign_data.csv')
df['date'] = pd.to_datetime(df['date'])
df = df.dropna()
df = df[df['spend'] > 0]  # Remove zero-spend campaigns# Calculate performance metricsdf['CTR'] = (df['clicks'] / df['impressions']) * 100df['CPC'] = df['spend'] / df['clicks']
df['conversion_rate'] = (df['conversions'] / df['clicks']) * 100# Assume $50 revenue per conversionrevenue_per_conversion = 50df['revenue'] = df['conversions'] * revenue_per_conversion
df['ROAS'] = df['revenue'] / df['spend']
# Identify underperformers (bottom 20%)roas_threshold = df['ROAS'].quantile(0.20)
underperformers = df[df['ROAS'] <= roas_threshold]
print(f"\n=== UNDERPERFORMING CAMPAIGNS (ROAS <= {roas_threshold:.2f}) ===")
print(underperformers[['campaign_id', 'ROAS', 'CTR', 'conversion_rate', 'spend']].sort_values('ROAS'))
print(f"\nTotal wasted spend: ${underperformers['spend'].sum():,.2f}")
# Aggregate by campaign for summarycampaign_summary = df.groupby('campaign_id').agg({
    'impressions': 'sum',
    'clicks': 'sum',
    'conversions': 'sum',
    'spend': 'sum',
    'revenue': 'sum'}).reset_index()
campaign_summary['overall_CTR'] = (campaign_summary['clicks'] / campaign_summary['impressions']) * 100campaign_summary['overall_ROAS'] = campaign_summary['revenue'] / campaign_summary['spend']
# Visualization 1: Performance over timefig, axes = plt.subplots(2, 2, figsize=(15, 10))
# Daily impressions and spenddaily_metrics = df.groupby('date').agg({'impressions': 'sum', 'spend': 'sum', 'ROAS': 'mean'})
axes[0, 0].plot(daily_metrics.index, daily_metrics['impressions'], color='blue')
axes[0, 0].set_title('Daily Impressions Trend')
axes[0, 0].set_xlabel('Date')
axes[0, 0].set_ylabel('Impressions')
axes[0, 0].tick_params(axis='x', rotation=45)
axes[0, 1].plot(daily_metrics.index, daily_metrics['ROAS'], color='green')
axes[0, 1].axhline(y=1.0, color='r', linestyle='--', label='Breakeven')
axes[0, 1].set_title('Daily ROAS Trend')
axes[0, 1].set_xlabel('Date')
axes[0, 1].set_ylabel('ROAS')
axes[0, 1].legend()
axes[0, 1].tick_params(axis='x', rotation=45)
# Campaign comparison scatter plotaxes[1, 0].scatter(campaign_summary['overall_CTR'], campaign_summary['overall_ROAS'],
                   s=campaign_summary['spend']/100, alpha=0.6)
axes[1, 0].axhline(y=1.0, color='r', linestyle='--', alpha=0.5)
axes[1, 0].set_xlabel('CTR (%)')
axes[1, 0].set_ylabel('ROAS')
axes[1, 0].set_title('Campaign Performance: CTR vs ROAS (size = spend)')
# Top vs Bottom performerstop_performers = campaign_summary.nlargest(5, 'overall_ROAS')
axes[1, 1].barh(top_performers['campaign_id'].astype(str), top_performers['overall_ROAS'], color='green', alpha=0.7)
axes[1, 1].axvline(x=1.0, color='r', linestyle='--', label='Breakeven')
axes[1, 1].set_xlabel('ROAS')
axes[1, 1].set_title('Top 5 Performing Campaigns')
axes[1, 1].legend()
plt.tight_layout()
plt.savefig('campaign_performance_report.png', dpi=300, bbox_inches='tight')
plt.show()
# Recommendationsprint("\n=== OPTIMIZATION RECOMMENDATIONS ===\n")
print(f"1. PAUSE UNDERPERFORMERS")
print(f"   - {len(underperformers)} campaigns with ROAS < {roas_threshold:.2f}")
print(f"   - Savings: ${underperformers['spend'].sum():,.2f}")
print(f"   - Campaigns: {underperformers['campaign_id'].unique().tolist()}\n")
top_5 = campaign_summary.nlargest(5, 'overall_ROAS')
print(f"2. SCALE TOP PERFORMERS")
print(f"   - Top 5 campaigns have average ROAS of {top_5['overall_ROAS'].mean():.2f}")
print(f"   - Recommend increasing budget by 50% for:")
for idx, row in top_5.iterrows():
    print(f"     Campaign {row['campaign_id']}: ROAS {row['overall_ROAS']:.2f}")
medium_performers = campaign_summary[
    (campaign_summary['overall_ROAS'] >= 1.0) &
    (campaign_summary['overall_ROAS'] < campaign_summary['overall_ROAS'].quantile(0.75))
]
print(f"\n3. OPTIMIZE MEDIUM PERFORMERS")
print(f"   - {len(medium_performers)} campaigns with ROAS between 1.0 and top quartile")
print(f"   - Test creative variations to improve conversion rate")
print(f"   - Refine audience targeting to reduce wasted impressions")
# Budget reallocation calculationsavings_from_pause = underperformers['spend'].sum()
reallocate_to_top = savings_from_pause * 0.7keep_for_testing = savings_from_pause * 0.3print(f"\n4. BUDGET REALLOCATION")
print(f"   - Current total spend: ${df['spend'].sum():,.2f}")
print(f"   - Saved from pausing: ${savings_from_pause:,.2f}")
print(f"   - Reallocate to top performers: ${reallocate_to_top:,.2f}")
print(f"   - Reserve for testing new campaigns: ${keep_for_testing:,.2f}")
print(f"   - Expected additional revenue: ${reallocate_to_top * top_5['overall_ROAS'].mean():,.2f}")
# Export resultscampaign_summary.to_csv('campaign_performance_summary.csv', index=False)
underperformers[['campaign_id', 'date', 'ROAS', 'spend']].to_csv('underperformers.csv', index=False)
print(f"\n✓ Analysis complete. Reports saved.")

Key Deliverables:
1. Performance dashboard (4-panel visualization)
2. Actionable recommendations (pause, scale, optimize, reallocate)
3. Quantified impact (savings, expected revenue lift)
4. Export files for stakeholder review

Expected Outcome: Clean, efficient Python code with clear visualizations and business-focused recommendations demonstrating analytical and communication skills.

7. Machine Learning Explainability and Model Interpretation

Level: Data Scientist

Difficulty: Very Hard

Source: Choreograph ML Team

Team: Advanced Analytics, ML Engineering

Interview Round: ML Design

Question: “You’ve built a random forest model predicting campaign conversion probability. Marketing stakeholders want to understand WHY the model makes certain predictions. How would you explain the model’s decision-making process?”

Concise Answer:

Explainability Framework:

1. Global Interpretability (Overall Model Behavior):

import shap
import matplotlib.pyplot as plt
from sklearn.inspection import partial_dependence, PartialDependenceDisplay
# Feature Importances (built into Random Forest)feature_importances = pd.DataFrame({
    'feature': feature_names,
    'importance': rf_model.feature_importances_
}).sort_values('importance', ascending=False)
plt.figure(figsize=(10, 6))
plt.barh(feature_importances['feature'][:10], feature_importances['importance'][:10])
plt.xlabel('Importance Score')
plt.title('Top 10 Most Important Features')
plt.gca().invert_yaxis()
plt.show()
print("Top Conversion Drivers:")
for idx, row in feature_importances.head(5).iterrows():
    print(f"{row['feature']}: {row['importance']:.3f}")

2. Partial Dependence Plots (Feature Effects):

# Show how changing one feature affects predictionsfeatures_to_plot = ['email_engagement_score', 'days_since_last_visit', 'ad_frequency']
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
for idx, feature in enumerate(features_to_plot):
    feature_idx = list(feature_names).index(feature)
    display = PartialDependenceDisplay.from_estimator(
        rf_model, X_train, [feature_idx], ax=axes[idx]
    )
    axes[idx].set_title(f'Effect of {feature}')
plt.tight_layout()
plt.show()

3. Local Interpretability (Individual Predictions):

# SHAP for individual prediction explanationexplainer = shap.TreeExplainer(rf_model)
shap_values = explainer.shap_values(X_test)
# Explain specific predictiondef explain_prediction(customer_idx):
    """Explain why model predicted conversion for specific customer"""    prediction_proba = rf_model.predict_proba(X_test[customer_idx].reshape(1, -1))[0][1]
    # Get SHAP values for this prediction    shap_vals = shap_values[1][customer_idx]  # Class 1 (conversion)    # Create explanation dataframe    explanation = pd.DataFrame({
        'feature': feature_names,
        'value': X_test[customer_idx],
        'shap_value': shap_vals
    }).sort_values('shap_value', key=abs, ascending=False)
    print(f"\n=== PREDICTION EXPLANATION ===")
    print(f"Predicted Conversion Probability: {prediction_proba:.1%}\n")
    print("Top factors INCREASING conversion probability:")
    positive_factors = explanation[explanation['shap_value'] > 0].head(5)
    for idx, row in positive_factors.iterrows():
        print(f"  + {row['feature']} = {row['value']:.2f} → +{row['shap_value']:.3f}")
    print("\nTop factors DECREASING conversion probability:")
    negative_factors = explanation[explanation['shap_value'] < 0].head(5)
    for idx, row in negative_factors.iterrows():
        print(f"  - {row['feature']} = {row['value']:.2f} → {row['shap_value']:.3f}")
    # Waterfall plot    shap.waterfall_plot(shap.Explanation(values=shap_vals, base_values=explainer.expected_value[1],
                                          data=X_test[customer_idx], feature_names=feature_names))
# Exampleexplain_prediction(customer_idx=42)

4. Business-Friendly Explanation:

Customer #42 Conversion Prediction: 68%

WHY THIS PREDICTION?

POSITIVE SIGNALS (increasing conversion probability):
✓ High email engagement score (0.85) → +12%
  → This customer actively opens and clicks emails

✓ Recent website visit (2 days ago) → +10%
  → Recent visitors are 3x more likely to convert

✓ Previous purchase history (5 purchases) → +8%
  → Repeat customers have 65% baseline conversion rate

NEGATIVE SIGNALS (decreasing conversion probability):
✗ Low ad frequency (1 impression) → -5%
  → Customers need 3-5 exposures for optimal conversion

✗ Mobile device → -3%
  → Mobile conversion rate 40% lower (poor checkout UX)

RECOMMENDATION:
→ Increase ad frequency to 3-5 impressions
→ Send personalized email with product recommendations
→ Optimize mobile checkout experience

5. SHAP Summary Plot (All Predictions):

# Global view of feature impactsshap.summary_plot(shap_values[1], X_test, feature_names=feature_names)
# Beeswarm plot shows:# - Feature importance (vertical ordering)# - Feature values (color: red = high, blue = low)# - Impact on prediction (horizontal position)

6. Stakeholder Communication Strategy:

Audience	Approach	Tools
Marketing Executives	Business outcomes, not technical details	Feature importance rankings, business narratives
Campaign Managers	Actionable insights	Individual prediction explanations, segment patterns
Data Team	Technical rigor	SHAP values, partial dependence, model diagnostics

7. Address Limitations:

MODEL LIMITATIONS & TRANSPARENCY:

1. Correlation ≠ Causation
   - Model shows email engagement predicts conversion
   - Doesn't prove emails CAUSE conversions
   - Could be reverse causality (engaged users open emails)

2. Predictions are Probabilistic
   - 68% conversion probability ≠ certainty
   - Confidence intervals: [55%, 78%]

3. Model Reflects Historical Patterns
   - Trained on 2023-2024 data
   - May not capture new trends or behaviors
   - Requires monthly retraining

4. Potential Biases
   - If training data over-represents certain demographics
   - Model may perform worse on underrepresented groups
   - Fairness monitoring in production

Expected Outcome: Demonstrate ability to make complex models transparent, communicate technical concepts to non-technical stakeholders, and bridge ML accuracy with business trust and adoption.

Advanced Experimentation & Causal Inference

8. Incrementality Testing and Media Effectiveness Measurement

Level: Senior Data Scientist

Difficulty: Extreme

Source: GroupM Media Science, Advanced Analytics

Team: Media Effectiveness, Measurement Science

Interview Round: Strategic Case Study

Question: “A client wants to know if their $10M TV advertising spend is actually driving incremental sales. Design a study to measure TV’s incremental impact.”

Concise Answer:

Study Design: Randomized Geo-Experiment

1. Methodology Selection:

Approach	Pros	Cons	Recommendation
Randomized Geo-Experiment	Gold standard, causal inference	Requires pausing TV in markets	BEST
Synthetic Control	No pause needed	Assumes good match exists	Alternative
Observational Regression	Cheapest, fastest	Confounding bias	Validation only

2. Experimental Design:

import pandas as pd
import numpy as np
from scipy import stats
class GeoExperiment:
    def __init__(self, markets_df):
        self.markets = markets_df
    def select_matched_pairs(self, n_pairs=10):
        """Select similar market pairs for testing"""        # Features for matching        matching_features = [
            'population',
            'median_income',
            'historical_sales',
            'competitor_presence',
            'seasonality_index'        ]
        # Standardize features        from sklearn.preprocessing import StandardScaler
        scaler = StandardScaler()
        scaled_features = scaler.fit_transform(self.markets[matching_features])
        # Find similar pairs using distance metrics        from sklearn.metrics.pairwise import euclidean_distances
        distances = euclidean_distances(scaled_features)
        pairs = []
        used_markets = set()
        for i in range(len(distances)):
            if i in used_markets:
                continue            # Find closest match            distances_i = distances[i].copy()
            distances_i[i] = np.inf
            distances_i[list(used_markets)] = np.inf
            closest = np.argmin(distances_i)
            if len(pairs) < n_pairs:
                pairs.append((self.markets.iloc[i]['market_id'],
                             self.markets.iloc[closest]['market_id']))
                used_markets.add(i)
                used_markets.add(closest)
        return pairs
    def randomize_treatment(self, pairs):
        """Randomly assign treatment/control within pairs"""        np.random.seed(42)
        assignments = []
        for market_a, market_b in pairs:
            if np.random.rand() > 0.5:
                assignments.append({'market': market_a, 'group': 'treatment'})
                assignments.append({'market': market_b, 'group': 'control'})
            else:
                assignments.append({'market': market_a, 'group': 'control'})
                assignments.append({'market': market_b, 'group': 'treatment'})
        return pd.DataFrame(assignments)
# Example usageexperiment = GeoExperiment(market_data)
matched_pairs = experiment.select_matched_pairs(n_pairs=12)
treatment_assignment = experiment.randomize_treatment(matched_pairs)

3. Power Analysis (Sample Size Calculation):

def calculate_power(baseline_sales, expected_lift, n_pairs):
    """Calculate statistical power to detect lift"""    # Assumptions    baseline_mean = baseline_sales  # $1M per market    lift_pct = expected_lift  # 8% lift expected    cv = 0.20  # Coefficient of variation (20%)    # Effect size (difference in means / pooled std dev)    lift_amount = baseline_mean * lift_pct
    std_dev = baseline_mean * cv
    effect_size = lift_amount / (std_dev * np.sqrt(2))
    # Power calculation (paired t-test)    alpha = 0.05    power = stats.ttest_power(effect_size, df=n_pairs-1, alpha=alpha)
    return {
        'n_pairs': n_pairs,
        'expected_lift': f"{lift_pct:.1%}",
        'power': f"{power:.1%}",
        'sufficient': power >= 0.80    }
# Check if we have enough marketsresult = calculate_power(baseline_sales=1_000_000, expected_lift=0.08, n_pairs=12)
print(f"With {result['n_pairs']} market pairs, power to detect {result['expected_lift']} lift: {result['power']}")
print(f"Sufficient (>80%): {result['sufficient']}")

4. Implementation:

EXPERIMENTAL SETUP:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Duration: 8 weeks (2 months)
Markets: 12 pairs (24 total DMAs)
Treatment: Pause TV advertising
Control: Continue normal TV spend
Cost: ~$1.7M in forgone TV spend

PRE-PERIOD (4 weeks before):
→ Validate baseline sales similarity
→ Confirm no pre-existing trends

TEST PERIOD (8 weeks):
→ Treatment markets: $0 TV spend
→ Control markets: Normal TV spend
→ Monitor daily sales

POST-ANALYSIS:
→ Compare sales differences
→ Calculate incremental ROAS

5. Analysis:

def analyze_geo_experiment(results_df):
    """Difference-in-differences analysis"""    # Calculate change from baseline    treatment_markets = results_df[results_df['group'] == 'treatment']
    control_markets = results_df[results_df['group'] == 'control']
    treatment_change = treatment_markets['test_sales'].mean() - treatment_markets['baseline_sales'].mean()
    control_change = control_markets['test_sales'].mean() - control_markets['baseline_sales'].mean()
    # Incremental impact (DiD estimator)    incremental_effect = treatment_change - control_change
    # Statistical significance (paired t-test)    t_stat, p_value = stats.ttest_rel(
        treatment_markets['test_sales'] - treatment_markets['baseline_sales'],
        control_markets['test_sales'] - control_markets['baseline_sales']
    )
    # Calculate incremental ROAS    tv_spend_treatment = treatment_markets['tv_spend'].sum()
    total_incremental_sales = incremental_effect * len(treatment_markets) * 8  # 8 weeks    incremental_roas = total_incremental_sales / tv_spend_treatment if tv_spend_treatment > 0 else 0    return {
        'incremental_sales_per_market': incremental_effect,
        'total_incremental_sales': total_incremental_sales,
        'incremental_roas': incremental_roas,
        'p_value': p_value,
        'statistically_significant': p_value < 0.05,
        'confidence_interval': stats.t.interval(0.95, df=len(treatment_markets)-1,
                                                 loc=incremental_effect,
                                                 scale=stats.sem(treatment_markets['test_sales'] - treatment_markets['baseline_sales']))
    }
results = analyze_geo_experiment(experiment_results)
print(f"\n=== INCREMENTALITY TEST RESULTS ===")
print(f"Incremental sales per market: ${results['incremental_sales_per_market']:,.0f}")
print(f"Total incremental sales (8 weeks): ${results['total_incremental_sales']:,.0f}")
print(f"Incremental ROAS: {results['incremental_roas']:.2f}")
print(f"P-value: {results['p_value']:.4f}")
print(f"Statistically significant: {results['statistically_significant']}")
print(f"95% CI: ${results['confidence_interval'][0]:,.0f} to ${results['confidence_interval'][1]:,.0f}")

6. Alternative: Synthetic Control (No Pause Required):

from sklearn.linear_model import Ridge
def synthetic_control_method(treatment_market, control_markets, pre_period, test_period):
    """Create synthetic control market as weighted average of controls"""    # Fit model on pre-period to find weights    X_pre = control_markets[control_markets['period'] == 'pre'][['sales']].values
    y_pre = treatment_market[treatment_market['period'] == 'pre']['sales'].values
    # Ridge regression to find optimal weights (constrained to be positive)    model = Ridge(alpha=1.0, positive=True)
    model.fit(X_pre.T, y_pre)
    weights = model.coef_
    # Predict counterfactual in test period    X_test = control_markets[control_markets['period'] == 'test'][['sales']].values
    synthetic_sales = np.dot(X_test.T, weights)
    # Actual sales in treatment market    actual_sales = treatment_market[treatment_market['period'] == 'test']['sales'].values
    # Incremental impact    incremental = actual_sales - synthetic_sales
    return {
        'actual_sales': actual_sales,
        'synthetic_sales': synthetic_sales,
        'incremental_impact': incremental,
        'control_weights': dict(zip(control_markets['market_id'].unique(), weights))
    }

7. Business Recommendation:

RECOMMENDATION FOR CLIENT:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

FINDINGS:
• TV advertising drives $4.2M incremental sales per year
• Incremental ROAS: 2.4:1
• Statistical confidence: 95%

TV EFFECTIVENESS:
✓ TV is working (positive incremental impact)
✗ But not optimally (ROAS <3:1 benchmark for CPG)

RECOMMENDATIONS:
1. Continue TV advertising (it's incremental)
2. Optimize media mix:
   - Reduce TV spend by 20% ($2M)
   - Reallocate to higher-ROAS digital channels
   - Expected net benefit: +$1.5M revenue

3. Test creative variations to improve TV efficiency
4. Run quarterly incrementality tests to monitor

Expected Outcome: Demonstrate sophisticated causal inference methodology, ability to design rigorous experiments, and translate statistical findings into business recommendations with quantified confidence.

NLP & Text Analytics

9. Natural Language Processing for Customer Insights

Level: Data Scientist

Difficulty: Hard

Source: Choreograph Text Analytics, Consumer Insights

Team: Voice of Customer, Social Listening

Interview Round: NLP Technical

Question: “You have 100,000 customer reviews for a CPG brand. How would you analyze this text data to extract insights about product perception, sentiment trends, and emerging issues?”

Concise Answer:

NLP Analysis Pipeline:

1. Text Preprocessing:

import pandas as pd
import re
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
def preprocess_text(text):
    """Clean and normalize text"""    # Lowercase    text = text.lower()
    # Remove URLs, emails    text = re.sub(r'http\S+|www\S+|https\S+', '', text)
    text = re.sub(r'\S+@\S+', '', text)
    # Remove special characters but keep punctuation for sentiment    text = re.sub(r'[^a-zA-Z\s!?.]', '', text)
    # Tokenize    tokens = word_tokenize(text)
    # Remove stopwords    stop_words = set(stopwords.words('english'))
    tokens = [t for t in tokens if t not in stop_words]
    # Lemmatization    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(t) for t in tokens]
    return ' '.join(tokens)
# Apply preprocessingreviews['clean_text'] = reviews['review_text'].apply(preprocess_text)

2. Sentiment Analysis:

from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from transformers import pipeline
# Method 1: VADER (rule-based, fast)analyzer = SentimentIntensityAnalyzer()
def get_sentiment_vader(text):
    scores = analyzer.polarity_scores(text)
    # Classify based on compound score    if scores['compound'] >= 0.05:
        return 'positive', scores['compound']
    elif scores['compound'] <= -0.05:
        return 'negative', scores['compound']
    else:
        return 'neutral', scores['compound']
reviews[['sentiment', 'sentiment_score']] = reviews['review_text'].apply(
    lambda x: pd.Series(get_sentiment_vader(x))
)
# Method 2: Transformer model (more accurate, slower)sentiment_pipeline = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english")
def get_sentiment_bert(text):
    result = sentiment_pipeline(text[:512])[0]  # Truncate to max length    return result['label'], result['score']
# Use for sample or critical reviews# reviews[['sentiment_bert', 'confidence']] = reviews['review_text'].apply(...)

3. Topic Modeling (What are customers talking about?):

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation
# Create document-term matrixvectorizer = CountVectorizer(max_df=0.95, min_df=2, max_features=1000, stop_words='english')
doc_term_matrix = vectorizer.fit_transform(reviews['clean_text'])
# LDA topic modelingn_topics = 8lda = LatentDirichletAllocation(n_components=n_topics, random_state=42, max_iter=20)
lda.fit(doc_term_matrix)
# Display topicsdef display_topics(model, feature_names, n_top_words=10):
    topics = {}
    for idx, topic in enumerate(model.components_):
        top_words = [feature_names[i] for i in topic.argsort()[-n_top_words:][::-1]]
        topics[f"Topic {idx}"] = top_words
    return topics
feature_names = vectorizer.get_feature_names_out()
topics = display_topics(lda, feature_names)
# Interpret topicsfor topic_name, words in topics.items():
    print(f"{topic_name}: {', '.join(words)}")
# Example output:# Topic 0: battery, charge, lasting, hours, life → Battery Performance# Topic 1: flavor, taste, sweet, delicious, love → Taste/Flavor# Topic 2: packaging, damaged, arrived, broken, box → Shipping/Packaging# Topic 3: price, expensive, worth, value, cost → Pricing

4. Aspect-Based Sentiment:

def extract_aspects_and_sentiment(reviews_df):
    """Extract sentiment about specific product aspects"""    aspects = {
        'quality': ['quality', 'durable', 'well-made', 'sturdy', 'cheap', 'broke'],
        'price': ['expensive', 'cheap', 'price', 'value', 'cost', 'affordable'],
        'taste': ['taste', 'flavor', 'delicious', 'bland', 'sweet', 'bitter'],
        'packaging': ['packaging', 'package', 'box', 'wrapped', 'damaged'],
        'battery': ['battery', 'charge', 'charging', 'power', 'lasting']
    }
    aspect_sentiments = {}
    for aspect, keywords in aspects.items():
        # Find reviews mentioning this aspect        mask = reviews_df['clean_text'].str.contains('|'.join(keywords), case=False)
        aspect_reviews = reviews_df[mask]
        if len(aspect_reviews) > 0:
            avg_sentiment = aspect_reviews['sentiment_score'].mean()
            mention_count = len(aspect_reviews)
            positive_pct = (aspect_reviews['sentiment'] == 'positive').mean() * 100            aspect_sentiments[aspect] = {
                'mentions': mention_count,
                'avg_sentiment': avg_sentiment,
                'positive_pct': positive_pct
            }
    return pd.DataFrame(aspect_sentiments).T
aspect_analysis = extract_aspects_and_sentiment(reviews)
print("\n=== ASPECT-BASED SENTIMENT ===")
print(aspect_analysis.sort_values('avg_sentiment'))

5. Trend Analysis (Time Series of Sentiment):

import matplotlib.pyplot as plt
# Aggregate sentiment by monthreviews['month'] = pd.to_datetime(reviews['date']).dt.to_period('M')
sentiment_over_time = reviews.groupby('month').agg({
    'sentiment_score': 'mean',
    'review_id': 'count'}).rename(columns={'review_id': 'review_count'})
# Plot trendfig, ax1 = plt.subplots(figsize=(12, 5))
ax1.plot(sentiment_over_time.index.astype(str), sentiment_over_time['sentiment_score'],
         color='blue', marker='o')
ax1.set_xlabel('Month')
ax1.set_ylabel('Avg Sentiment Score', color='blue')
ax1.tick_params(axis='y', labelcolor='blue')
ax1.tick_params(axis='x', rotation=45)
ax1.axhline(y=0, color='red', linestyle='--', alpha=0.5, label='Neutral')
ax2 = ax1.twinx()
ax2.bar(sentiment_over_time.index.astype(str), sentiment_over_time['review_count'],
        alpha=0.3, color='gray')
ax2.set_ylabel('Review Volume', color='gray')
plt.title('Customer Sentiment Trend Over Time')
plt.tight_layout()
plt.show()

6. Key Insights Extraction:

# Identify emerging issues (negative sentiment spikes)def detect_sentiment_anomalies(sentiment_ts):
    """Flag unusual negative sentiment periods"""    mean_sentiment = sentiment_ts['sentiment_score'].mean()
    std_sentiment = sentiment_ts['sentiment_score'].std()
    anomalies = sentiment_ts[sentiment_ts['sentiment_score'] < mean_sentiment - 2*std_sentiment]
    return anomalies
anomalies = detect_sentiment_anomalies(sentiment_over_time)
print("\n=== ALERT: NEGATIVE SENTIMENT SPIKES ===")
print(anomalies)
# Extract representative reviews for each topicdef get_representative_reviews(reviews_df, topic_id, n=3):
    """Get most representative reviews for a topic"""    # Get topic distribution for each review    topic_dist = lda.transform(doc_term_matrix)
    reviews_df['topic_strength'] = topic_dist[:, topic_id]
    # Get top reviews for this topic    top_reviews = reviews_df.nlargest(n, 'topic_strength')[['review_text', 'sentiment', 'topic_strength']]
    return top_reviews
# Example: Get reviews about battery issuesbattery_reviews = get_representative_reviews(reviews, topic_id=0, n=5)
print("\n=== EXAMPLE BATTERY REVIEWS ===")
print(battery_reviews)

7. Business Presentation:

CUSTOMER REVIEW ANALYSIS: 100,000 REVIEWS
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

OVERALL SENTIMENT:
• Positive: 62%
• Neutral: 23%
• Negative: 15%
• Average Score: +0.32 (moderately positive)

TOP MENTIONED ASPECTS:
1. Taste/Flavor (42% of reviews) → 71% positive
2. Packaging (28% of reviews) → 45% positive ⚠️
3. Price/Value (24% of reviews) → 58% positive
4. Quality (18% of reviews) → 68% positive
5. Battery Life (12% of reviews) → 38% positive ⚠️

EMERGING ISSUES:
⚠️ Battery complaints UP 150% since July 2024
⚠️ Packaging damage mentions increased 35%
✓ Taste satisfaction stable and positive

ACTIONABLE RECOMMENDATIONS:
1. URGENT: Investigate battery supplier (defect rate spike)
2. Packaging: Switch to reinforced boxes to reduce damage
3. Continue current flavor formulation (strong positive signal)
4. Price perception: Communicate value proposition better

COMPETITOR COMPARISON:
• Our sentiment (+0.32) vs. Competitor A (+0.18)
• Battery issues: We have 2x mention rate vs. competitors

Expected Outcome: Demonstrate NLP technical skills, business insight extraction from unstructured data, and ability to translate text analytics into actionable product and marketing recommendations.

Behavioral & Communication

10. Communicating Data Insights to Non-Technical Stakeholders

Level: All levels

Difficulty: Moderate to Hard

Source: Common WPP Interview Theme

Team: All Data Teams

Interview Round: Behavioral Assessment

Question: “Tell me about a time when you had to present complex analytical findings to senior marketing leaders who had limited technical background. How did you ensure they understood and acted on your recommendations?”

Concise Answer (STAR Method):

Situation:
“In my previous role as a Data Scientist at [Company], I led a multi-touch attribution analysis for a major e-commerce client. The analysis revealed that our attribution model was significantly over-crediting last-click (paid search) and under-valuing earlier touchpoints like display and social media. This had major implications—we were under-investing in awareness channels by approximately $5M annually, leaving an estimated $8M in potential conversions unrealized.

I needed to present these findings to the CMO and VP of Marketing, both with limited technical backgrounds, to secure approval for a substantial budget reallocation across channels.”

Task:
“My objectives were to:
1. Explain complex Shapley value attribution methodology in accessible terms
2. Build sufficient trust in the analysis to justify $5M budget shift
3. Address anticipated concerns about reducing search spend
4. Get approval for phased implementation plan”

Action:

“I structured my approach in three layers:

1. Lead with Business Impact, Not Methodology:
- Opened presentation with: ‘We’re currently under-investing in awareness channels by $5M, which is costing us $8M in potential revenue annually’
- Showed customer journey map illustrating typical 5-6 touch path before conversion
- Made the ‘last-click problem’ visceral: ‘Imagine crediting only the cashier for a sale, ignoring the sales floor associate who helped the customer for 30 minutes’

2. Translated Technical Concepts:
- Avoided jargon like ‘Shapley values’ and ‘marginal contributions’
- Instead said: ‘We calculate what each channel’s unique contribution is by comparing what happens with and without it—like testing which ingredient truly makes the recipe work’
- Used sports analogy: ‘You can’t just credit the player who scores the goal. We need to value the defender who started the play’

3. Visualization Over Tables:
- Created customer journey maps showing typical paths (not statistical charts)
- Used before/after budget allocation visuals with expected impact
- Built interactive dashboard they could explore themselves

4. Addressed Concerns Proactively:
- Anticipated CMO’s worry: ‘Are you saying search isn’t valuable?’
- Prepared response: ‘Search is critical for capturing existing demand at the bottom of the funnel. But we need display and social to CREATE that demand in the first place. They work together.’
- Cross-referenced with brand lift studies showing display did increase awareness
- Showed competitor strategies proving our approach was under-market

5. Made Recommendations Concrete and Phased:
- Instead of abstract ‘reallocate budget,’ provided specific channel shifts:
- Reduce search by $3M (from $15M to $12M)
- Increase display by $2M
- Increase social by $1M
- Proposed 3-month pilot in one product category before full rollout
- Created monthly dashboard to track impact and course-correct
- Gave explicit success criteria: ‘If we don’t see 10%+ lift in total conversions by month 3, we revert changes’

6. Involved Stakeholders Early:
- In retrospect, I should have shared preliminary findings earlier to build progressive buy-in
- Instead of revealing everything in final presentation, could have done mid-analysis checkpoint”

Result:

“Immediate outcomes:
- CMO approved recommendation in the meeting
- Secured $5M budget reallocation with 3-month pilot
- VP of Marketing became champion of the analysis

6-month impact:
- Total conversions increased 22% (vs. 10% target)
- Cost per acquisition decreased 15%
- Overall ROAS improved from 3.2:1 to 4.1:1
- Analysis saved client an estimated $8M in lost revenue

Recognition:
- CMO cited this project as ‘most impactful data science initiative of the year’ in quarterly all-hands
- I was promoted to Senior Data Scientist based partly on this work
- Built internal dashboard now used across all client accounts

Lessons learned:
1. Start with ‘so what’ not ‘how’: Business impact first, methodology second
2. Analogies > equations: Sports, cooking, everyday experiences make concepts accessible
3. Progressive involvement: Don’t surprise stakeholders—build consensus incrementally
4. Safety nets reduce resistance: Phased approach with exit criteria lowers perceived risk
5. Visuals tell stories: One customer journey map worth 10 statistical tables”

What I’d do differently:
“I would have:
- Involved stakeholders earlier in analysis (not just final presentation)
- Had them articulate concerns upfront so I could address preemptively
- Created simpler 1-pager executive summary for busy C-suite
- Practiced presentation with non-technical colleague for feedback

This experience taught me that technical excellence means nothing without effective communication. In data science, influence and storytelling are as important as statistical rigor.”

Expected Outcome: Demonstrate communication skills, stakeholder management, business acumen, and self-awareness through structured storytelling with quantified impact.

End of WPP Data Analyst and Data Scientist Interview Guide

This comprehensive guide covers essential technical skills (SQL, Python, ML), statistical rigor (A/B testing, attribution, incrementality), and business communication abilities required for WPP data roles across Choreograph, GroupM, and agency-embedded analytics teams.