WPP Software Engineer

WPP Software Engineer

System Design & Scalability

1. Real-Time Campaign Management System Architecture

Level: Senior Software Engineer

Difficulty: Hard

Source: Extrapolated from GroupM/Choreograph Technology Stack

Team: Platform Engineering, Ad Tech

Interview Round: System Design

Question: “Design a real-time campaign management system handling 100,000 requests/second during product launches. Track impressions, clicks, conversions across display, social, and search with near-real-time reporting.”

Concise Answer:

Architecture Overview:

┌─────────────┐    ┌──────────────┐    ┌─────────────────┐
│ Load        │ -> │ API Gateway  │ -> │ Event Ingestion │
│ Balancer    │    │ (Rate Limit) │    │ (Kafka/Kinesis) │
└─────────────┘    └──────────────┘    └─────────────────┘
                                              │
                    ┌─────────────────────────┼─────────────────────┐
                    ▼                         ▼                     ▼
            ┌───────────────┐       ┌─────────────────┐    ┌──────────┐
            │ Stream Proc   │       │ Redis           │    │ S3/Data  │
            │ (Flink/Spark) │ ----> │ (Real-time)     │    │ Lake     │
            └───────────────┘       └─────────────────┘    └──────────┘
                    │                         │
                    ▼                         ▼
            ┌───────────────┐       ┌─────────────────┐
            │ Cassandra/    │       │ PostgreSQL      │
            │ DynamoDB      │ <---- │ (Aggregated)    │
            │ (Events)      │       │                 │
            └───────────────┘       └─────────────────┘
                                            │
                                            ▼
                                    ┌─────────────────┐
                                    │ Reporting API   │
                                    │ + Dashboard     │
                                    └─────────────────┘

Technology Stack:

Ingestion Layer:

// API Gateway with rate limitingconst rateLimit = require('express-rate-limit');const apiLimiter = rateLimit({
  windowMs: 1000, // 1 second  max: 100000, // 100K requests per second  standardHeaders: true});app.post('/api/events', apiLimiter, async (req, res) => {
  const event = req.body;  // Async publish to Kafka (don't block response)  kafkaProducer.send({
    topic: 'campaign-events',    messages: [{
      key: event.campaign_id,      value: JSON.stringify(event)
    }]
  }).catch(err => logger.error('Kafka publish failed', err));  // Immediate 202 response  res.status(202).send();});

Stream Processing:

from pyspark.sql import SparkSession
from pyspark.sql.functions import window, col, count, sumspark = SparkSession.builder.appName("CampaignMetrics").getOrCreate()
# Read from Kafkaevents = spark.readStream \    .format("kafka") \    .option("kafka.bootstrap.servers", "localhost:9092") \    .option("subscribe", "campaign-events") \    .load()
# Real-time aggregation (5-second windows)metrics = events \    .selectExpr("CAST(value AS STRING)") \    .select(from_json(col("value"), event_schema).alias("data")) \    .groupBy(
        window(col("data.timestamp"), "5 seconds"),
        col("data.campaign_id")
    ) \    .agg(
        count("*").alias("impressions"),
        sum("data.cost").alias("spend"),
        count(when(col("data.event_type") == "click", 1)).alias("clicks")
    )
# Write to Redis for real-time dashboardmetrics.writeStream \    .foreach(RedisWriter()) \    .start()

Data Storage Strategy:

-- Hot Path: Redis (sub-second access)Key: campaign:{campaign_id}:realtimeValue: {
  "impressions": 15234,
  "clicks": 456,
  "spend": 1234.56,
  "updated_at": "2025-01-15T10:30:00Z"}
TTL: 3600 seconds
-- Warm Path: Cassandra (recent events, high write throughput)CREATE TABLE campaign_events (
    campaign_id UUID,
    event_date DATE,
    event_id TIMEUUID,
    event_type TEXT,
    user_id UUID,
    cost DECIMAL,
    PRIMARY KEY ((campaign_id, event_date), event_id)
) WITH CLUSTERING ORDER BY (event_id DESC);
-- Cold Path: S3/Parquet (historical analysis)s3://campaign-data-lake/year=2025/month=01/day=15/events.parquet

Scalability Approach:

  1. Horizontal Scaling: Stateless API services behind ALB, auto-scale based on CPU/requests
  1. Partitioning: Kafka partitions by campaign_id for parallel processing
  1. Caching: Redis for real-time counters (5-second refresh)
  1. Database Sharding: Cassandra sharded by campaign_id
  1. Asynchronous Processing: Non-critical aggregations in batch jobs

Performance Optimizations:

// Batch writes to reduce database pressureclass EventBatcher {
  constructor(maxSize = 1000, maxWait = 1000) {
    this.batch = [];    this.maxSize = maxSize;    this.maxWait = maxWait;    this.timer = null;  }
  add(event) {
    this.batch.push(event);    if (this.batch.length >= this.maxSize) {
      this.flush();    } else if (!this.timer) {
      this.timer = setTimeout(() => this.flush(), this.maxWait);    }
  }
  async flush() {
    if (this.batch.length === 0) return;    const toWrite = this.batch;    this.batch = [];    clearTimeout(this.timer);    this.timer = null;    await cassandra.batchWrite(toWrite);  }
}

Trade-offs Addressed:

ConsiderationChoiceRationale
ConsistencyEventual consistencyReal-time analytics can tolerate 5-10s delay
StorageTiered (Redis/Cassandra/S3)Balance cost and performance
ProcessingStream + Batch hybridReal-time for dashboard, batch for complex reports
DatabaseCassandra for events, PostgreSQL for aggregatesWrite-heavy vs. read-heavy optimization

Monitoring:

// Distributed tracingconst { trace } = require('@opentelemetry/api');app.post('/api/events', async (req, res) => {
  const span = trace.getTracer('api').startSpan('process_event');  try {
    await processEvent(req.body);    span.setStatus({ code: SpanStatusCode.OK });  } catch (error) {
    span.recordException(error);    span.setStatus({ code: SpanStatusCode.ERROR });  } finally {
    span.end();  }
});// Metricsmetrics.histogram('api.latency', Date.now() - startTime);metrics.increment('events.processed');

Expected Outcomes:
- Throughput: 100K+ requests/second with auto-scaling
- Latency: <100ms API response, <5s dashboard updates
- Availability: 99.95% uptime with multi-AZ deployment
- Cost: ~$15K/month at scale (reserved instances + spot for batch)


Algorithms & Data Structures

2. Ad Frequency Capping at Scale

Level: Mid-Senior Software Engineer

Difficulty: Moderate-Hard

Source: Digital Advertising Best Practices

Team: Ad Tech, Platform Engineering

Interview Round: Technical Coding

Question: “Implement frequency capping ensuring users see the same ad ≤N times/day across devices. Handle 50M active users. Optimize for speed and memory.”

Concise Answer:

Approach 1: Exact Counting (Small Scale)

from collections import defaultdict
from datetime import datetime, timedelta
class FrequencyCapper:
    def __init__(self, max_impressions_per_day):
        self.max_impressions = max_impressions_per_day
        self.user_impressions = defaultdict(list)  # {user_ad: [timestamps]}    def can_show_ad(self, user_id, ad_id):
        key = f"{user_id}:{ad_id}"        now = datetime.now()
        cutoff = now - timedelta(days=1)
        # Remove old impressions        self.user_impressions[key] = [
            ts for ts in self.user_impressions[key] if ts > cutoff
        ]
        return len(self.user_impressions[key]) < self.max_impressions
    def record_impression(self, user_id, ad_id):
        key = f"{user_id}:{ad_id}"        self.user_impressions[key].append(datetime.now())
# Complexity: O(k) time, O(n*k) space where k=impressions/user# Problem: ~10GB+ memory for 50M users

Approach 2: Probabilistic (Production Scale)

import mmh3
import numpy as np
from datetime import datetime
class ScalableFrequencyCapper:
    """Count-Min Sketch for memory-efficient counting"""    def __init__(self, max_impressions, width=100000, depth=5):
        self.max_impressions = max_impressions
        self.width = width
        self.depth = depth
        self.counts = np.zeros((depth, width), dtype=np.int16)
        self.last_reset = datetime.now().date()
    def _check_daily_reset(self):
        today = datetime.now().date()
        if today > self.last_reset:
            self.counts.fill(0)
            self.last_reset = today
    def _hash(self, user_id, ad_id, seed):
        key = f"{user_id}:{ad_id}"        return mmh3.hash(key, seed) % self.width
    def can_show_ad(self, user_id, ad_id):
        self._check_daily_reset()
        # Get minimum count across hash functions        min_count = min(
            self.counts[i][self._hash(user_id, ad_id, i)]
            for i in range(self.depth)
        )
        return min_count < self.max_impressions
    def record_impression(self, user_id, ad_id):
        for i in range(self.depth):
            idx = self._hash(user_id, ad_id, i)
            self.counts[i][idx] += 1# Complexity: O(d) time where d=depth (constant ~5)# Space: O(width * depth) = ~1MB for 100K width, 5 depth# Handles 50M+ users with minimal memory

Approach 3: Distributed (Redis)

import redis
from datetime import datetime, timedelta
class DistributedFrequencyCapper:
    def __init__(self, redis_client, max_impressions):
        self.redis = redis_client
        self.max_impressions = max_impressions
    def can_show_ad(self, user_id, ad_id):
        key = f"freq:{user_id}:{ad_id}"        count = self.redis.get(key)
        if count is None:
            return True        return int(count) < self.max_impressions
    def record_impression(self, user_id, ad_id):
        key = f"freq:{user_id}:{ad_id}"        pipe = self.redis.pipeline()
        # Atomic increment        pipe.incr(key)
        # Set expiry to end of day if new key        seconds_until_midnight = (
            datetime.combine(datetime.now().date() + timedelta(days=1),
                           datetime.min.time()) - datetime.now()
        ).seconds
        pipe.expire(key, seconds_until_midnight)
        pipe.execute()
# Scales horizontally with Redis Cluster# Trade-off: Network latency vs. memory efficiency

Comparison:

ApproachMemory (50M users)LatencyAccuracyBest For
Exact~10GBO(k)100%Small scale
Count-Min Sketch~1MBO(1)95-99%High memory constraints
RedisDistributed<1ms100%Production (horizontal scale)

Production Implementation:

class HybridFrequencyCapper:
    """Combine local cache + distributed store"""    def __init__(self, redis_client, max_impressions):
        self.redis = redis_client
        self.local_cache = {}  # LRU cache        self.cache_size = 10000        self.max_impressions = max_impressions
    def can_show_ad(self, user_id, ad_id):
        key = f"{user_id}:{ad_id}"        # Check local cache first        if key in self.local_cache:
            return self.local_cache[key] < self.max_impressions
        # Fallback to Redis        count = self.redis.get(f"freq:{key}")
        count = int(count) if count else 0        # Update local cache        if len(self.local_cache) >= self.cache_size:
            self.local_cache.popitem()  # Remove oldest        self.local_cache[key] = count
        return count < self.max_impressions
    def record_impression(self, user_id, ad_id):
        key = f"{user_id}:{ad_id}"        # Update both cache and Redis        self.local_cache[key] = self.local_cache.get(key, 0) + 1        self.redis.incr(f"freq:{key}")
# Reduces Redis calls by 80%+ with local caching

Cross-Device Tracking:

# Probabilistic user matchingdef get_unified_user_id(user_identifiers):
    """    Combine device IDs, cookie IDs, login IDs    """    deterministic_ids = [
        id for id in user_identifiers
        if id['type'] in ['email_hash', 'login_id']
    ]
    if deterministic_ids:
        return deterministic_ids[0]['value']
    # Probabilistic matching via device graph    return device_graph.match(user_identifiers)

Expected Outcomes:
- Memory: <5MB for Count-Min Sketch, distributed with Redis
- Latency: <1ms lookup, <2ms update
- Accuracy: 99%+ (over-capping acceptable, under-capping not)
- Scalability: Linear with Redis Cluster sharding


API Design & Backend Development

3. Campaign Management RESTful API

Level: Mid-Senior Software Engineer

Difficulty: Moderate

Source: Marketing Platform Best Practices

Team: Platform Engineering, Backend

Interview Round: Technical Design

Question: “Design a RESTful API for campaign management supporting CRUD operations, asset uploads, targeting, scheduling, and performance reports. Define endpoints, auth, rate limiting, and versioning.”

Concise Answer:

Core Endpoints:

Authentication:
POST   /api/v1/auth/login
POST   /api/v1/auth/refresh
POST   /api/v1/auth/logout

Campaigns:
GET    /api/v1/campaigns                    # List (paginated, filtered)
POST   /api/v1/campaigns                    # Create
GET    /api/v1/campaigns/{id}               # Get details
PUT    /api/v1/campaigns/{id}               # Full update
PATCH  /api/v1/campaigns/{id}               # Partial update
DELETE /api/v1/campaigns/{id}               # Delete
PATCH  /api/v1/campaigns/{id}/status        # Activate/pause

Assets:
POST   /api/v1/campaigns/{id}/assets        # Upload (multipart/form-data)
GET    /api/v1/campaigns/{id}/assets
DELETE /api/v1/campaigns/{id}/assets/{assetId}

Targeting:
PUT    /api/v1/campaigns/{id}/targeting
GET    /api/v1/campaigns/{id}/targeting

Reports:
GET    /api/v1/campaigns/{id}/reports?start_date=...&end_date=...
GET    /api/v1/campaigns/{id}/reports/export?format=csv

Request/Response Format:

// POST /api/v1/campaigns{
  "name": "Summer Sale 2025",  "brand_id": "brand-123",  "budget": {
    "total": 50000,    "currency": "USD",    "daily_cap": 2000  },  "schedule": {
    "start_date": "2025-06-01T00:00:00Z",    "end_date": "2025-08-31T23:59:59Z"  },  "objectives": ["awareness", "conversions"]
}
// Response: 201 Created{
  "id": "campaign-789",  "name": "Summer Sale 2025",  "status": "draft",  "created_at": "2025-05-15T10:30:00Z",  "created_by": "user-456",  "budget": { ... },  "_links": {
    "self": "/api/v1/campaigns/campaign-789",    "assets": "/api/v1/campaigns/campaign-789/assets",    "reports": "/api/v1/campaigns/campaign-789/reports"  }
}

Pagination & Filtering:

// GET /api/v1/campaigns?page=2&limit=50&status=active&sort=-created_atapp.get('/api/v1/campaigns', authenticate, async (req, res) => {
  const {
    page = 1,    limit = 50,    status,    brand_id,    sort = '-created_at'  } = req.query;  const query = { tenant_id: req.user.tenant_id };  if (status) query.status = status;  if (brand_id) query.brand_id = brand_id;  const sortField = sort.startsWith('-') ? sort.slice(1) : sort;  const sortOrder = sort.startsWith('-') ? -1 : 1;  const [campaigns, total] = await Promise.all([
    Campaign.find(query)
      .sort({ [sortField]: sortOrder })
      .skip((page - 1) * limit)
      .limit(limit)
      .lean(),    Campaign.countDocuments(query)
  ]);  res.json({
    data: campaigns,    pagination: {
      page: parseInt(page),      limit: parseInt(limit),      total_pages: Math.ceil(total / limit),      total_count: total
    },    _links: {
      next: page * limit < total ? `/api/v1/campaigns?page=${parseInt(page)+1}&limit=${limit}` : null,      prev: page > 1 ? `/api/v1/campaigns?page=${parseInt(page)-1}&limit=${limit}` : null    }
  });});

Authentication (JWT):

const jwt = require('jsonwebtoken');const bcrypt = require('bcrypt');// Loginapp.post('/api/v1/auth/login', async (req, res) => {
  const { email, password } = req.body;  const user = await User.findOne({ email });  if (!user || !await bcrypt.compare(password, user.password_hash)) {
    return res.status(401).json({ error: 'Invalid credentials' });  }
  const accessToken = jwt.sign(
    { user_id: user.id, tenant_id: user.tenant_id, role: user.role },    process.env.JWT_SECRET,    { expiresIn: '15m' }
  );  const refreshToken = jwt.sign(
    { user_id: user.id },    process.env.JWT_REFRESH_SECRET,    { expiresIn: '7d' }
  );  await RefreshToken.create({ token: refreshToken, user_id: user.id });  res.json({ access_token: accessToken, refresh_token: refreshToken });});// Middlewarefunction authenticate(req, res, next) {
  const token = req.headers.authorization?.replace('Bearer ', '');  if (!token) return res.status(401).json({ error: 'No token' });  try {
    req.user = jwt.verify(token, process.env.JWT_SECRET);    next();  } catch (error) {
    res.status(401).json({ error: 'Invalid token' });  }
}

Rate Limiting:

const rateLimit = require('express-rate-limit');const apiLimiter = rateLimit({
  windowMs: 15 * 60 * 1000, // 15 minutes  max: 1000, // 1000 requests per window  standardHeaders: true,  keyGenerator: (req) => req.user?.user_id || req.ip,  handler: (req, res) => {
    res.status(429).json({
      error: 'Rate limit exceeded',      retry_after: Math.ceil(req.rateLimit.resetTime / 1000)
    });  }
});app.use('/api', authenticate, apiLimiter);

Multi-Tenancy:

// Middleware for tenant isolationfunction enforceTenancy(req, res, next) {
  req.tenantId = req.user.tenant_id;  next();}
// All queries auto-filter by tenantasync function getCampaign(req, res) {
  const campaign = await Campaign.findOne({
    _id: req.params.id,    tenant_id: req.tenantId  // Automatic isolation  });  if (!campaign) {
    return res.status(404).json({ error: 'Campaign not found' });  }
  res.json(campaign);}

Error Handling:

// 400 Bad Request{
  "error": {
    "code": "VALIDATION_ERROR",    "message": "Request validation failed",    "errors": [
      { "field": "budget.total", "message": "Must be positive number" },      { "field": "schedule.start_date", "message": "Must be future date" }
    ]
  }
}
// Validation middlewareconst { body, validationResult } = require('express-validator');app.post('/api/v1/campaigns',  authenticate,  body('name').notEmpty().trim(),  body('budget.total').isFloat({ gt: 0 }),  body('schedule.start_date').isISO8601(),  async (req, res) => {
    const errors = validationResult(req);    if (!errors.isEmpty()) {
      return res.status(400).json({ error: { code: 'VALIDATION_ERROR', errors: errors.array() } });    }
    // ... create campaign  }
);

API Versioning:

// URL-based versioningapp.use('/api/v1', routerV1);app.use('/api/v2', routerV2);// Deprecation headerapp.use('/api/v1', (req, res, next) => {
  res.set('Sunset', 'Sat, 31 Dec 2025 23:59:59 GMT');  res.set('Link', '</api/v2>; rel="successor-version"');  next();});

Expected Outcomes:
- Consistency: RESTful conventions, predictable responses
- Security: JWT auth, rate limiting, tenant isolation
- Performance: Pagination, caching headers, efficient queries
- Developer Experience: Clear errors, HATEOAS links, OpenAPI docs


Frontend Development

4. Asset Management Dashboard Performance

Level: Frontend Engineer

Difficulty: Moderate-Hard

Source: Hogarth Digital Asset Management

Team: Creative Technology, Frontend

Interview Round: Technical Coding

Question: “Optimize a React dashboard displaying 10,000+ image thumbnails. Users experience slow load times. How do you fix performance?”

Concise Answer:

Problem Diagnosis:

  • Initial State: 10,000 DOM nodes, 5s load time, 500MB memory
  • Root Causes: Rendering all items, loading full images, no virtualization

Solution 1: Virtualization (react-window)

import { FixedSizeGrid } from 'react-window';import AutoSizer from 'react-virtualized-auto-sizer';function AssetGrid({ assets }) {
  const COLUMN_COUNT = 5;  const COLUMN_WIDTH = 200;  const ROW_HEIGHT = 200;  const Cell = ({ columnIndex, rowIndex, style }) => {
    const index = rowIndex * COLUMN_COUNT + columnIndex;    if (index >= assets.length) return null;    return (
      <div style={style}>        <AssetCard asset={assets[index]} />      </div>    );  };  return (
    <AutoSizer>      {({ height, width }) => (
        <FixedSizeGrid          columnCount={COLUMN_COUNT}          columnWidth={COLUMN_WIDTH}          height={height}          rowCount={Math.ceil(assets.length / COLUMN_COUNT)}          rowHeight={ROW_HEIGHT}          width={width}        >          {Cell}        </FixedSizeGrid>      )}    </AutoSizer>  );}
// Only renders ~30 visible items instead of 10,000

Solution 2: Image Optimization

function AssetCard({ asset }) {
  return (
    <img      src={asset.thumbnail_url}  // Serve 150x150px, not 4K original      srcSet={`        ${asset.thumbnail_small} 150w,        ${asset.thumbnail_medium} 300w      `}      sizes="(max-width: 768px) 150px, 200px"      loading="lazy"  // Native lazy loading      alt={asset.name}      onError={(e) => e.target.src = '/fallback-thumbnail.png'}    />  );}
// Backend: Generate thumbnails on uploadasync function processUpload(file) {
  const original = await uploadToS3(file);  // Generate WebP thumbnails  const thumbnail = await sharp(file.buffer)
    .resize(150, 150)
    .webp({ quality: 80 })
    .toBuffer();  const thumbnailUrl = await uploadToS3(thumbnail, 'thumbnail_');  return { original_url: original, thumbnail_url: thumbnailUrl };}

Solution 3: Data Fetching Optimization

import { useInfiniteQuery } from '@tanstack/react-query';function useAssets() {
  return useInfiniteQuery({
    queryKey: ['assets'],    queryFn: ({ pageParam = 0 }) =>
      fetch(`/api/assets?offset=${pageParam}&limit=100`).then(r => r.json()),    getNextPageParam: (lastPage) => lastPage.next_offset,    staleTime: 5 * 60 * 1000,  // Cache for 5 minutes  });}
function AssetManager() {
  const { data, fetchNextPage, hasNextPage, isFetchingNextPage } = useAssets();  const assets = data?.pages.flatMap(page => page.assets) || [];  return (
    <InfiniteLoader      isItemLoaded={index => index < assets.length}      loadMoreItems={fetchNextPage}      itemCount={hasNextPage ? assets.length + 1 : assets.length}    >      {({ onItemsRendered, ref }) => (
        <AssetGrid assets={assets} onItemsRendered={onItemsRendered} ref={ref} />      )}    </InfiniteLoader>  );}

Solution 4: Component Optimization

// Memoize expensive componentsconst AssetCard = React.memo(({ asset }) => {
  return (
    <div className="asset-card">      <img src={asset.thumbnail_url} alt={asset.name} />      <p>{asset.name}</p>    </div>  );}, (prev, next) => prev.asset.id === next.asset.id);// useMemo for expensive computationsconst sortedAssets = useMemo(() => {
  return assets.sort((a, b) => b.created_at - a.created_at);}, [assets]);// useCallback for stable callbacksconst handleAssetClick = useCallback((assetId) => {
  navigate(`/assets/${assetId}`);}, [navigate]);

Solution 5: Code Splitting

import { lazy, Suspense } from 'react';// Lazy load asset detail viewconst AssetDetail = lazy(() => import('./AssetDetail'));function App() {
  return (
    <Suspense fallback={<Spinner />}>      <AssetDetail />    </Suspense>  );}

Performance Monitoring:

// Web Vitals trackingimport { onCLS, onFID, onLCP } from 'web-vitals';onLCP(metric => analytics.track('LCP', metric.value));onFID(metric => analytics.track('FID', metric.value));onCLS(metric => analytics.track('CLS', metric.value));// Performance budget in CI// lighthouse-ci.json{
  "ci": {
    "assert": {
      "assertions": {
        "first-contentful-paint": ["error", { "maxNumericValue": 2000 }],        "largest-contentful-paint": ["error", { "maxNumericValue": 2500 }],        "total-blocking-time": ["error", { "maxNumericValue": 300 }]
      }
    }
  }
}

Expected Outcomes:
- Load Time: 5s → 0.8s (84% improvement)
- Memory: 500MB → 50MB (90% reduction)
- DOM Nodes: 10,000 → 30 (99.7% reduction)
- FPS: Smooth 60fps scrolling


Database Optimization

5. Campaign Reporting Query Performance

Level: Backend Engineer, Data Engineer

Difficulty: Hard

Source: Marketing Analytics Platforms

Team: Data Platform, Backend

Interview Round: Technical Deep Dive

Question: “This query takes 45 seconds on 50M campaign rows. Optimize it.”

SELECT
  c.campaign_name, COUNT(i.id) as impressions,
  SUM(i.cost) as spend, COUNT(DISTINCT i.user_id) as unique_users
FROM campaigns c
LEFT JOIN impressions i ON c.id = i.campaign_id
WHERE c.status = 'active' AND i.impression_date BETWEEN '2025-06-01' AND '2025-06-30'GROUP BY c.id, c.campaign_name
ORDER BY spend DESC LIMIT 100;

Concise Answer:

Step 1: Analyze Execution Plan

EXPLAIN ANALYZE [query];
-- Likely issues:-- 1. Sequential scan on campaigns (no index on status)-- 2. Sequential scan on impressions (no index on date/campaign_id)-- 3. Full table JOIN before filtering

Step 2: Add Indexes

-- Partial index for active campaignsCREATE INDEX idx_campaigns_active ON campaigns(status, start_date)
WHERE status = 'active';
-- Composite index for impressionsCREATE INDEX idx_impressions_campaign_date ON impressions(campaign_id, impression_date)
INCLUDE (cost, user_id);
-- Or covering indexCREATE INDEX idx_impressions_covering ON impressions(
  campaign_id, impression_date, id, cost, user_id
) WHERE impression_date >= '2025-01-01';

Step 3: Rewrite Query

-- Optimized version with CTEsWITH active_campaigns AS (
  SELECT id, campaign_name
  FROM campaigns
  WHERE status = 'active' AND start_date >= '2025-01-01'),
impression_stats AS (
  SELECT
    campaign_id,
    COUNT(*) as impression_count,
    SUM(cost) as total_spend,
    COUNT(DISTINCT user_id) as unique_users
  FROM impressions
  WHERE impression_date BETWEEN '2025-06-01' AND '2025-06-30'    AND campaign_id IN (SELECT id FROM active_campaigns)
  GROUP BY campaign_id
)
SELECT
  ac.campaign_name,
  COALESCE(ist.impression_count, 0) as impressions,
  COALESCE(ist.total_spend, 0) as spend,
  COALESCE(ist.unique_users, 0) as unique_users
FROM active_campaigns ac
LEFT JOIN impression_stats ist ON ac.id = ist.campaign_id
ORDER BY spend DESC NULLS LASTLIMIT 100;

Step 4: Materialized Views

-- Pre-aggregate daily statsCREATE MATERIALIZED VIEW daily_campaign_stats ASSELECT
  campaign_id,
  DATE(impression_date) as day,
  COUNT(*) as impressions,
  SUM(cost) as spend,
  COUNT(DISTINCT user_id) as unique_users
FROM impressions
GROUP BY campaign_id, DATE(impression_date);
-- Refresh nightlyREFRESH MATERIALIZED VIEW CONCURRENTLY daily_campaign_stats;
-- Query becomes trivialSELECT
  c.campaign_name,
  SUM(dcs.impressions) as impressions,
  SUM(dcs.spend) as spend,
  SUM(dcs.unique_users) as unique_users
FROM campaigns c
JOIN daily_campaign_stats dcs ON c.id = dcs.campaign_id
WHERE c.status = 'active' AND dcs.day BETWEEN '2025-06-01' AND '2025-06-30'GROUP BY c.id, c.campaign_name
ORDER BY spend DESCLIMIT 100;
-- Now executes in <200ms

Step 5: Partitioning

-- Partition impressions by monthCREATE TABLE impressions (
  id BIGSERIAL,
  campaign_id BIGINT,
  impression_date DATE,
  cost DECIMAL(10,2),
  user_id BIGINT
) PARTITION BY RANGE (impression_date);
CREATE TABLE impressions_2025_06 PARTITION OF impressions
  FOR VALUES FROM ('2025-06-01') TO ('2025-07-01');
CREATE INDEX ON impressions_2025_06(campaign_id, impression_date);
-- Queries automatically scan only relevant partition

Step 6: Application-Level Caching

const redis = require('redis');async function getCampaignReport(startDate, endDate) {
  const cacheKey = `report:${startDate}:${endDate}`;  // Check cache  const cached = await redis.get(cacheKey);  if (cached) return JSON.parse(cached);  // Query database  const result = await db.query(optimizedQuery, [startDate, endDate]);  // Cache for 5 minutes  await redis.setex(cacheKey, 300, JSON.stringify(result));  return result;}

Performance Comparison:

OptimizationQuery TimeImprovement
Original45sBaseline
+ Indexes8s82%
+ Query Rewrite2s96%
+ Materialized View200ms99.6%
+ Caching<10ms99.98%

Expected Outcomes:
- Query Time: 45s → 200ms (99.6% faster)
- Database Load: 80% reduction in CPU usage
- Scalability: Handles 10x data growth with same performance
- Cost: Lower RDS instance size saves $500/month


Microservices & Distributed Systems

6. Resilient Microservices Communication

Level: Senior Software Engineer

Difficulty: Hard

Source: Distributed Systems Best Practices

Team: Platform Engineering, DevOps

Interview Round: System Design

Question: “Service A (campaign management) calls Service B (targeting) and Service C (asset delivery). How do you handle failures when B or C are down? Design a resilient system.”

Concise Answer:

Resilience Patterns:

1. Circuit Breaker Pattern

class CircuitBreaker {
  constructor(service, options = {}) {
    this.service = service;    this.failureThreshold = options.failureThreshold || 5;    this.timeout = options.timeout || 3000;    this.resetTimeout = options.resetTimeout || 60000;    this.state = 'CLOSED';  // CLOSED, OPEN, HALF_OPEN    this.failureCount = 0;    this.nextAttempt = Date.now();  }
  async call(method, ...args) {
    if (this.state === 'OPEN') {
      if (Date.now() < this.nextAttempt) {
        throw new Error('Circuit breaker OPEN');      }
      this.state = 'HALF_OPEN';    }
    try {
      const result = await Promise.race([
        this.service[method](...args),        this.timeout(this.timeout)
      ]);      this.onSuccess();      return result;    } catch (error) {
      this.onFailure();      throw error;    }
  }
  onSuccess() {
    this.failureCount = 0;    if (this.state === 'HALF_OPEN') this.state = 'CLOSED';  }
  onFailure() {
    this.failureCount++;    if (this.failureCount >= this.failureThreshold) {
      this.state = 'OPEN';      this.nextAttempt = Date.now() + this.resetTimeout;    }
  }
}
// Usageconst targetingBreaker = new CircuitBreaker(targetingService, {
  failureThreshold: 5,  timeout: 3000,  resetTimeout: 30000});

2. Retry with Exponential Backoff

async function retryWithBackoff(fn, maxRetries = 3) {
  for (let i = 0; i < maxRetries; i++) {
    try {
      return await fn();    } catch (error) {
      if (i === maxRetries - 1) throw error;      const delay = Math.min(1000 * Math.pow(2, i), 10000);      await new Promise(resolve => setTimeout(resolve, delay));    }
  }
}

3. Fallback Strategies

async function getCampaignData(campaignId) {
  try {
    const [targeting, assets] = await Promise.all([
      targetingBreaker.call('getAudience', campaignId),      assetBreaker.call('getAssets', campaignId)
    ]);    return { targeting, assets, source: 'live' };  } catch (error) {
    logger.warn('Live services failed', { error, campaignId });    // Fallback 1: Cached data    const cached = await cache.get(`campaign:${campaignId}`);    if (cached) return { ...cached, source: 'cache' };    // Fallback 2: Default/degraded data    return {
      targeting: { audience: 'broad', segments: [] },      assets: { creative_id: 'default' },      source: 'default',      degraded: true    };  }
}

4. Event-Driven Async Communication

// Don't wait for processing - publish eventsclass CampaignService {
  async createCampaign(data) {
    // Create campaign in local DB    const campaign = await db.campaigns.create(data);    // Publish event (fire and forget)    await eventBus.publish('campaign.created', {
      campaign_id: campaign.id,      timestamp: new Date()
    });    return campaign;  // Return immediately  }
}
// Subscribers process asynchronouslyeventBus.subscribe('campaign.created', async (event) => {
  try {
    await targetingService.initializeAudience(event.campaign_id);  } catch (error) {
    // Dead letter queue for retries    await dlq.enqueue('campaign.created', event);  }
});

5. Health Checks & Monitoring

// Service health endpointapp.get('/health', async (req, res) => {
  const checks = {
    database: await checkDatabase(),    redis: await checkRedis(),    targeting_service: await checkService('targeting'),    asset_service: await checkService('assets')
  };  const healthy = Object.values(checks).every(c => c.status === 'ok');  res.status(healthy ? 200 : 503).json(checks);});// Distributed tracingconst { trace } = require('@opentelemetry/api');async function callService(serviceName, fn) {
  const span = trace.getTracer('api').startSpan(`call_${serviceName}`);  try {
    const result = await fn();    span.setStatus({ code: SpanStatusCode.OK });    return result;  } catch (error) {
    span.recordException(error);    metrics.increment(`service.${serviceName}.error`);    throw error;  } finally {
    span.end();  }
}

Architectural Recommendations:

# Kubernetes deployment with service mesh (Istio)apiVersion: networking.istio.io/v1alpha3kind: VirtualServicemetadata:  name: campaign-servicespec:  hosts:  - campaign-service  http:  - route:    - destination:        host: campaign-service    retries:      attempts: 3      perTryTimeout: 2s    timeout: 10s

Expected Outcomes:
- Availability: 99.9%+ despite downstream failures
- Latency: <100ms with circuit breaker (vs. 30s timeout)
- Error Recovery: Automatic retry with exponential backoff
- Observability: Full request tracing across services


DevOps & CI/CD

7. Zero-Downtime Deployment Pipeline

Level: Senior Software Engineer, DevOps Engineer

Difficulty: Moderate-Hard

Source: AWS Best Practices

Team: Platform Engineering, Infrastructure

Interview Round: System Design

Question: “Design a CI/CD pipeline for microservices on AWS with automated testing, security scanning, and zero-downtime deployments.”

Concise Answer:

Pipeline Architecture:

# .github/workflows/deploy.ymlname: CI/CD Pipelineon:  push:    branches: [main, develop]  pull_request:    branches: [main]jobs:  test:    runs-on: ubuntu-latest    steps:      - uses: actions/checkout@v3      - name: Setup Node.js        uses: actions/setup-node@v3        with:          node-version: '18'          cache: 'npm'      - name: Install & Test        run: |          npm ci
          npm run lint
          npm run test:unit
          npm run test:integration
      - name: Code Coverage        run: npm run coverage  security:    runs-on: ubuntu-latest    steps:      - uses: actions/checkout@v3      - name: Security Scan        run: |          npm audit --audit-level=high
          docker build -t app:${{ github.sha }} .
          trivy image app:${{ github.sha }}
  build:    needs: [test, security]    runs-on: ubuntu-latest    steps:      - name: Build & Push to ECR        run: |          aws ecr get-login-password | docker login --username AWS --password-stdin $ECR_REGISTRY
          docker build -t $ECR_REGISTRY/campaign-service:${{ github.sha }} .
          docker push $ECR_REGISTRY/campaign-service:${{ github.sha }}
  deploy-production:    needs: build    if: github.ref == 'refs/heads/main'    runs-on: ubuntu-latest    steps:      - name: Blue-Green Deployment        run: |          # Deploy to green environment
          aws ecs update-service \
            --cluster prod \
            --service campaign-green \
            --task-definition campaign:${{ github.sha }} \
            --force-new-deployment
          # Wait for green to be healthy
          aws ecs wait services-stable --cluster prod --services campaign-green
          # Run smoke tests
          ./scripts/smoke-test.sh https://green.api.com
          # Switch traffic (update ALB target group)
          aws elbv2 modify-listener \
            --listener-arn $LISTENER_ARN \
            --default-actions TargetGroupArn=$GREEN_TG
          # Monitor for 10 minutes
          ./scripts/monitor.sh --duration=10m
      - name: Rollback on Failure        if: failure()        run: |          aws elbv2 modify-listener \
            --listener-arn $LISTENER_ARN \
            --default-actions TargetGroupArn=$BLUE_TG

Infrastructure as Code (Terraform):

resource "aws_ecs_service" "campaign" {
  name            = "campaign-service"
  cluster         = aws_ecs_cluster.main.id
  task_definition = aws_ecs_task_definition.campaign.arn
  desired_count   = 3

  deployment_configuration {
    maximum_percent         = 200  # Allow double capacity during deploy
    minimum_healthy_percent = 100   # Never drop below 100%
  }

  deployment_circuit_breaker {
    enable   = true
    rollback = true  # Auto-rollback on failure
  }

  load_balancer {
    target_group_arn = aws_lb_target_group.campaign.arn
    container_name   = "campaign"
    container_port   = 3000
  }
}

Deployment Strategies:

BLUE-GREEN DEPLOYMENT:
┌─────────────┐    ┌─────────────┐
│   Blue      │    │   Green     │
│ (Current)   │    │   (New)     │
└──────┬──────┘    └──────┬──────┘
       │                  │
       ▼                  ▼
┌─────────────────────────────┐
│     Load Balancer           │
│  Traffic: 100% Blue         │
└─────────────────────────────┘
         ↓
   After validation
         ↓
┌─────────────────────────────┐
│     Load Balancer           │
│  Traffic: 100% Green        │
└─────────────────────────────┘

CANARY DEPLOYMENT:
5% → 25% → 50% → 100% traffic shift
Monitor metrics at each stage

Testing Strategy:

// Unit Tests (80%+ coverage)describe('Campaign Service', () => {
  it('should create campaign', async () => {
    const campaign = await service.create({ name: 'Test' });    expect(campaign.id).toBeDefined();  });});// Integration Testsdescribe('Campaign API', () => {
  it('should persist to database', async () => {
    const res = await request(app).post('/campaigns').send(data);    expect(res.status).toBe(201);    const dbRecord = await db.campaigns.findById(res.body.id);    expect(dbRecord).toBeDefined();  });});// Contract Tests (Pact)describe('Targeting Service Contract', () => {
  it('should return audience data', () => {
    provider.addInteraction({
      uponReceiving: 'request for audience',      withRequest: { method: 'GET', path: '/audience/123' },      willRespondWith: { status: 200, body: { segments: [] } }
    });  });});

Monitoring & Rollback:

// Automated rollback on errorsclass DeploymentMonitor {
  async monitor(duration = 600000) {  // 10 minutes    const startTime = Date.now();    while (Date.now() - startTime < duration) {
      const metrics = await getMetrics();      if (metrics.errorRate > 0.01) {  // >1% error rate        throw new Error('Error rate threshold exceeded');      }
      if (metrics.p99Latency > 1000) {  // >1s p99        throw new Error('Latency threshold exceeded');      }
      await sleep(30000);  // Check every 30s    }
  }
}

Expected Outcomes:
- Deployment Time: <10 minutes end-to-end
- Downtime: 0 seconds (blue-green deployment)
- Rollback Time: <2 minutes (switch traffic back)
- Failure Rate: <0.1% (automated rollback prevents incidents)


Real-Time Systems

8. Campaign Alert Notification System

Level: Senior Software Engineer

Difficulty: Hard

Source: Event-Driven Architecture Patterns

Team: Platform Engineering

Interview Round: System Design

Question: “Design a notification system sending real-time alerts when campaigns reach thresholds (budget exhausted, performance targets met). Handle 1000+ clients with custom rules.”

Concise Answer:

Architecture:

Campaign Events → Kinesis → Lambda → Rule Engine → SQS → Notification Workers
                              ↓
                         DynamoDB (Rules)

Rule Storage:

// DynamoDB Rule Schema{
  rule_id: "rule-123",  client_id: "client-456",  campaign_id: "campaign-789",  conditions: [
    { metric: "spend", operator: ">=", threshold: 10000 },    { metric: "ctr", operator: "<", threshold: 1.0 }
  ],  logic: "OR",  // AND or OR  channels: ["email", "webhook"],  cooldown: 3600,  // Don't re-trigger for 1 hour  notification: {
    email: ["manager@client.com"],    webhook_url: "https://client.com/webhook",    template: "Campaign {{name}} has spent ${{spend}}"  }
}

Event Processing:

// Lambda function processes campaign eventsexports.handler = async (event) => {
  for (const record of event.Records) {
    const campaignEvent = JSON.parse(
      Buffer.from(record.kinesis.data, 'base64').toString()
    );    // Fetch rules for this campaign    const rules = await getRulesForCampaign(campaignEvent.campaign_id);    for (const rule of rules) {
      if (evaluateRule(rule, campaignEvent.metrics)) {
        // Check cooldown        if (await isCooledDown(rule.rule_id)) continue;        // Queue notification        await sqs.sendMessage({
          QueueUrl: NOTIFICATION_QUEUE,          MessageBody: JSON.stringify({
            rule_id: rule.rule_id,            campaign_id: campaignEvent.campaign_id,            metrics: campaignEvent.metrics,            notification: rule.notification          })
        });        // Set cooldown        await setCooldown(rule.rule_id, rule.cooldown);      }
    }
  }
};function evaluateRule(rule, metrics) {
  const results = rule.conditions.map(cond => {
    const value = metrics[cond.metric];    switch (cond.operator) {
      case '>=': return value >= cond.threshold;      case '<': return value < cond.threshold;      case '==': return value == cond.threshold;      default: return false;    }
  });  return rule.logic === 'AND'
    ? results.every(r => r)
    : results.some(r => r);}

Notification Delivery:

// Worker processes notification queueexports.handler = async (event) => {
  for (const record of event.Records) {
    const notification = JSON.parse(record.body);    // Render message    const message = renderTemplate(
      notification.notification.template,      notification.metrics    );    // Send via multiple channels    const promises = notification.channels.map(channel => {
      switch (channel) {
        case 'email':          return ses.sendEmail({
            To: notification.notification.email,            Subject: 'Campaign Alert',            Body: message
          });        case 'webhook':          return axios.post(notification.notification.webhook_url, notification);        case 'slack':          return axios.post(SLACK_WEBHOOK, { text: message });      }
    });    await Promise.allSettled(promises);  // Don't fail if one channel fails  }
};

User Interface:

// React rule builderfunction RuleBuilder({ campaignId }) {
  const [conditions, setConditions] = useState([{
    metric: 'spend',    operator: '>=',    threshold: 0  }]);  const saveRule = async () => {
    await api.post('/rules', {
      campaign_id: campaignId,      conditions,      logic: 'OR',      channels: ['email'],      notification: { email: [user.email] }
    });  };  return (
    <div>      {conditions.map((cond, i) => (
        <div key={i}>          <select value={cond.metric} onChange={e => updateCondition(i, 'metric', e.target.value)}>            <option value="spend">Spend</option>            <option value="ctr">CTR</option>            <option value="conversions">Conversions</option>          </select>          <select value={cond.operator}>            <option value=">=">≥</option>            <option value="<"><</option>          </select>          <input type="number" value={cond.threshold} />        </div>      ))}      <button onClick={saveRule}>Save Rule</button>    </div>  );}

Expected Outcomes:
- Latency: <10s from event to notification delivery
- Throughput: 10,000+ events/second processed
- Scalability: Lambda auto-scales, SQS buffers bursts
- Reliability: Dead letter queue for failed notifications


Security & Authentication

9. Multi-Tenant JWT Authentication

Level: Mid-Senior Software Engineer

Difficulty: Moderate

Source: Security Best Practices

Team: Backend Engineering

Interview Round: Technical Design

Question: “Implement JWT authentication for multi-tenant API. Each tenant accesses only their data. Include token refresh and rate limiting.”

Concise Answer:

JWT Implementation:

const jwt = require('jsonwebtoken');const bcrypt = require('bcrypt');// Loginapp.post('/api/auth/login', async (req, res) => {
  const { email, password } = req.body;  const user = await User.findOne({ email });  if (!user || !await bcrypt.compare(password, user.password_hash)) {
    return res.status(401).json({ error: 'Invalid credentials' });  }
  // Generate tokens  const accessToken = jwt.sign(
    {
      user_id: user.id,      tenant_id: user.tenant_id,      role: user.role    },    process.env.JWT_SECRET,    { expiresIn: '15m' }
  );  const refreshToken = jwt.sign(
    { user_id: user.id },    process.env.JWT_REFRESH_SECRET,    { expiresIn: '7d' }
  );  await RefreshToken.create({
    token: refreshToken,    user_id: user.id,    expires_at: new Date(Date.now() + 7 * 24 * 60 * 60 * 1000)
  });  res.json({
    access_token: accessToken,    refresh_token: refreshToken,    expires_in: 900  });});// Refresh endpointapp.post('/api/auth/refresh', async (req, res) => {
  const { refresh_token } = req.body;  try {
    const decoded = jwt.verify(refresh_token, process.env.JWT_REFRESH_SECRET);    const storedToken = await RefreshToken.findOne({
      token: refresh_token,      user_id: decoded.user_id,      revoked: false    });    if (!storedToken || storedToken.expires_at < new Date()) {
      return res.status(401).json({ error: 'Invalid refresh token' });    }
    const user = await User.findById(decoded.user_id);    const accessToken = jwt.sign(
      { user_id: user.id, tenant_id: user.tenant_id, role: user.role },      process.env.JWT_SECRET,      { expiresIn: '15m' }
    );    res.json({ access_token: accessToken });  } catch (error) {
    res.status(401).json({ error: 'Invalid refresh token' });  }
});

Authentication Middleware:

function authenticate(req, res, next) {
  const token = req.headers.authorization?.replace('Bearer ', '');  if (!token) return res.status(401).json({ error: 'No token' });  try {
    req.user = jwt.verify(token, process.env.JWT_SECRET);    next();  } catch (error) {
    res.status(401).json({ error: 'Invalid or expired token' });  }
}

Multi-Tenant Isolation:

// Automatic tenant filteringfunction enforceTenancy(req, res, next) {
  req.tenantId = req.user.tenant_id;  next();}
app.use('/api', authenticate, enforceTenancy);// All queries auto-filter by tenantapp.get('/api/campaigns', async (req, res) => {
  const campaigns = await Campaign.find({
    tenant_id: req.tenantId  // Automatic isolation  });  res.json(campaigns);});// Prevent cross-tenant accessapp.get('/api/campaigns/:id', async (req, res) => {
  const campaign = await Campaign.findOne({
    _id: req.params.id,    tenant_id: req.tenantId  });  if (!campaign) {
    return res.status(404).json({ error: 'Not found' });  }
  res.json(campaign);});

Rate Limiting:

const rateLimit = require('express-rate-limit');const RedisStore = require('rate-limit-redis');const limiter = rateLimit({
  store: new RedisStore({ client: redisClient }),  windowMs: 15 * 60 * 1000,  // 15 minutes  max: 1000,  keyGenerator: (req) => req.user?.user_id || req.ip,  handler: (req, res) => {
    res.status(429).json({
      error: 'Rate limit exceeded',      retry_after: Math.ceil(req.rateLimit.resetTime / 1000)
    });  }
});app.use('/api', authenticate, limiter);

RBAC (Role-Based Access Control):

function authorize(...allowedRoles) {
  return (req, res, next) => {
    if (!allowedRoles.includes(req.user.role)) {
      return res.status(403).json({ error: 'Insufficient permissions' });    }
    next();  };}
// Usageapp.delete('/api/campaigns/:id',  authenticate,  authorize('admin', 'campaign_manager'),  deleteCampaign
);

Security Best Practices:

// Helmet.js for security headersconst helmet = require('helmet');app.use(helmet());// Input validationconst { body, validationResult } = require('express-validator');app.post('/api/campaigns',  authenticate,  body('name').trim().isLength({ min: 1, max: 100 }),  body('budget').isFloat({ gt: 0 }),  async (req, res) => {
    const errors = validationResult(req);    if (!errors.isEmpty()) {
      return res.status(400).json({ errors: errors.array() });    }
    // ...  }
);// CORS configurationconst cors = require('cors');app.use(cors({
  origin: process.env.ALLOWED_ORIGINS.split(','),  credentials: true}));

Expected Outcomes:
- Security: JWT tokens, tenant isolation, rate limiting
- User Experience: Token refresh for seamless sessions
- Scalability: Redis-backed rate limiting
- Compliance: RBAC for fine-grained permissions


Problem Solving & Debugging

10. Production Incident Response

Level: All Levels

Difficulty: Moderate (Behavioral)

Source: Standard Behavioral Question

Team: All Teams

Interview Round: Behavioral Assessment

Question: “Describe a time you debugged a critical production issue under pressure. Walk through your process from alert to resolution.”

Concise Answer (STAR Method):

Situation:
“At my previous role, I received a 2 AM alert that our campaign API response times spiked from 200ms to 15+ seconds. This API serves real-time ad requests, so the degradation was causing ~$5,000/hour revenue loss.”

Task:
“As the on-call engineer, I needed to quickly diagnose and restore service within our 99.9% uptime SLA.”

Action:

1. Confirm & Scope (5 minutes)
- Verified alert in monitoring dashboards (DataDog)
- Confirmed: p99 latency at 18s, error rate 12%
- Affected all API endpoints
- No recent deployments (ruled out bad code)

2. Form Hypotheses
- Database slowdown (most common)
- External service degradation
- Memory leak causing GC pauses
- Network issues

3. Gather Data (10 minutes)

# Database healthSELECT * FROM pg_stat_activity WHERE state = 'active';# Result: Normal query times, no locks, CPU 40%# Application logstail -f /var/log/app.log | grep ERROR
# Result: Timeout errors from external targeting service# Application metricscurl localhost:9090/metrics | grep memory
# Result: Memory usage normal

4. Root Cause Identified
- External targeting service responding in 30+ seconds
- Our API waited synchronously (no timeout configured)
- Blocked threads caused request queue buildup

5. Immediate Mitigation (2 minutes)

// Deployed feature flag to disable external callsconfig.targeting.enabled = false;// Fallback to cached targeting dataconst targeting = await cache.get(`targeting:${userId}`)
  || getDefaultTargeting();

Response times dropped to 300ms within 2 minutes.

6. Communication
- Posted status updates in Slack incident channel every 10 minutes
- Updated status page for external clients
- Notified manager after extending beyond 30 minutes

7. Long-Term Fix (Next Day)

// Implemented circuit breakerconst targetingBreaker = new CircuitBreaker(targetingService, {
  failureThreshold: 3,  timeout: 3000,  // Fail fast after 3s  fallback: getCachedTargeting
});// Added monitoringmetrics.histogram('external.targeting.latency');alerting.addRule('targeting.latency > 1000ms for 2m');

Result:

Immediate:
- Restored service in 15 minutes from alert
- Prevented estimated $1,250 revenue loss
- No customer data loss

Long-Term:
- Circuit breaker prevented 2 subsequent outages in following months
- Received VP Engineering recognition for incident response

Lessons Learned:
1. Always implement timeouts for external dependencies
2. Design for failure with fallbacks and graceful degradation
3. Clear communication reduces stakeholder anxiety
4. Blameless post-mortems drive continuous improvement

Post-Mortem Actions:
- Documented timeline and root cause
- Added circuit breakers to all external service calls
- Implemented synthetic monitoring for external dependencies
- Reduced default timeout from 30s to 3s across all services

Key Takeaway: Assume all dependencies will fail and design accordingly.


End of WPP Software Engineer Interview Guide

This comprehensive guide covers essential skills for WPP software engineering roles across Platform Engineering, Backend Development, Frontend Development, DevOps, and Data Engineering teams at agencies including Hogarth Technology, AKQA Engineering, Choreograph, GroupM Technology, and VML.