Spotify Data Engineer

Spotify Data Engineer

This guide features 10 challenging Data Engineering interview questions for Spotify (Senior DE to Staff DE levels), covering real-time event processing, data modeling for personalization, cost-efficient infrastructure, and data quality at scale.

1. Real-Time Processing for "Spotify Wrapped" - Architecture & Scalability

Difficulty Level: Very High

Role: Senior Data Engineer

Topic: Streaming Systems & Event Processing

Business Function: Core Infrastructure / Personalization

Question:

"During the 'Spotify Wrapped' season, event data spikes by 10x. How would you design a real-time pipeline to process billions of 'listening events' daily to ensure user statistics are updated with minimal latency, while handling late-arriving data from offline devices?"

Answer Framework

STAR Method Structure:

● Situation: Massive seasonal data spikes (10x) for "Wrapped" processing and the need to handle offline-to-online data sync.

● Task: Build a scalable, low-latency pipeline that handles late-arrival events (watermarking) without data loss.

● Action: Implement a Kappa Architecture using Apache Beam (Scio) on Google Cloud Dataflow. Use event-time processing and windowing to manage data from offline devices.

● Result: Achieved 99.9% data completeness within a 5-minute processing window, ensuring accurate "Year in Review" stats.

Key Competencies Evaluated:

Streaming Concepts: Understanding Watermarks, Windowing, and Event Time vs. Processing Time.

Cloud Architecture: Leveraging GCP Pub/Sub and Dataflow for auto-scaling.

Answer (Part 1 of 3): Architecture Design

We move away from Lambda Architecture to Kappa.

● Ingestion: Use Cloud Pub/Sub to buffer the 10x spike. It decouples producers (Apps) from consumers.

● Processing: Apache Beam (running on Dataflow). Beam handles both batch and stream with the same code.

● State Management: Use a scalable state backend (like RocksDB) to store intermediate aggregations (e.g., "song_play_count") before flushing to storage.

Answer (Part 2 of 3): Handling Late Data (Offline Mode)

The Core Challenge: A user listens offline on a flight and syncs 10 hours later.

● Watermarking: We define a heuristic watermark (e.g., "We expect 99% of data within 2 hours").

● Triggers:

○ Early Trigger: Update stats immediately when data arrives (Speculative result).

○ Late Trigger: If data arrives after the watermark, fire a correction pane to update the database.

This ensures the user sees some data instantly, but the correct total eventually.

Answer (Part 3 of 3): Storage & Serving

● Serving Layer: Bigtable for low-latency lookups (User Profile Service).

● Analytical Layer: BigQuery for the Data Science team to run "Wrapped" aggregations.

● Optimization: Partition BigQuery tables by ingestion_time to reduce scan costs.

2. Cost-Efficient Storage Strategy - BigQuery & GCS Optimization

Difficulty Level: Very High

Role: Senior Data Engineer

Topic: Cloud Economics & Data Modeling

Business Function: Data Infrastructure / Platform

Question:

"Spotify stores petabytes of historical user data. A specific dataset used for long-term trend analysis is costing $50k/month in BigQuery storage and processing. How would you redesign

the storage and partitioning strategy to reduce costs by 40% without impacting analyst query speed?"

Answer Framework

STAR Method Structure:

● Situation: High storage costs for large-scale historical datasets (Listening History).

● Task: Optimize partitioning, clustering, and storage tiers (Long-term vs. Active).

● Action: Implemented "Partition Expiration" policies and moved cold data to GCS Coldline (Parquet format).

● Result: Reduced monthly bill by 45% while keeping query speed for recent data under 3 seconds.

Key Competencies Evaluated:

Cost Optimization: Understanding BigQuery pricing (Active vs. Long-term storage).

Data Modeling: Partitioning vs. Clustering usage.

Answer (Part 1 of 3): Diagnosis & Partitioning

● Audit: Analyze INFORMATION_SCHEMA to see which table partitions are actually queried.

● Partitioning: Partition by DATE. Analysts usually query "last 30 days" or "last year".

● Clustering: Cluster by user_id and country. This prunes the data scanned (Cost = Data Scanned).

● Impact: If an analyst filters by Country='SE', BigQuery skips 90% of the data, slashing costs immediately.

Answer (Part 2 of 3): Storage Tiering Strategy

The 90-Day Rule:

● Hot Data (0-90 days): Keep in BigQuery Standard Storage for fast access.

● Warm Data (90 days - 1 year): BigQuery Long-Term Storage (50% cheaper automatically).

● Cold Data (> 1 year): Export to Google Cloud Storage (GCS) as Parquet/Avro files.

Why Parquet? It's columnar (compressed) and can still be queried via BigQuery External Tables if absolutely needed (slower, but cheaper).

Answer (Part 3 of 3): Automation

Implement a Data Lifecycle Policy.

● Use Terraform to enforce partition_expiration_days on the BigQuery dataset.

● Result: The cleanup happens automatically. No manual engineering effort required.

3. Data Quality Incident - The "Missing Streams" Anomaly

Difficulty Level: High

Role: Senior Data Engineer

Topic: Data Quality & Observability

Business Function: Content Insights

Question:

"You notice a 15% drop in the 'Global Stream Count' metric at 2 AM. The application team says the player is working fine. How do you debug this data pipeline issue? Walk through your investigation steps from Source to Dashboard."

Answer Framework

STAR Method Structure:

● Situation: Unexplained drop in a critical KPI (Stream Count) while the app is functional.

● Task: Identify if it's a Data Ingestion failure, Transformation bug, or Delayed arrival.

● Action: Traced the DAG lineage (Airflow), checked Pub/Sub backlog, and validated schema compatibility.

● Result: Identified a "Schema Evolution" break (Client sent string, DB expected int). Deployed a hotfix to the Dead Letter Queue (DLQ).

Key Competencies Evaluated:

Debugging: Systematic isolation of pipeline stages.

Schema Management: Handling Protobuf/Avro schema changes.

Answer (Part 1 of 3): Isolation (Source vs. Pipe)

● Check the Source: Is the drop global or regional? (e.g., only iOS users in Europe?). ● Check the Plumbing: Look at Pub/Sub "Unacknowledged Messages".

● Hypothesis: If unacked messages are spiking, the consumers (Dataflow workers) are crashing or stuck.

● Finding: Workers are failing with SerializationException.

Answer (Part 2 of 3): Root Cause Analysis

The App Team released a new version (v8.4) at 1:45 AM.

● They changed the track_id format.

● Our ingestion job (Scio) uses a strict Avro schema. The new format caused the job to reject records.

● The Gap: The records weren't "lost"; they were being routed to the Dead Letter Queue (DLQ).

Answer (Part 3 of 3): Recovery & Fix

● Immediate: Update the Avro schema registry to accept the new format. Restart the pipeline.

● Reprocessing: Replay the messages from the DLQ back into the main pipeline. ● Prevention: Implement "Contract Testing" in the CI/CD pipeline. The Mobile team cannot merge code that breaks the Data Contract.

4. Designing the "Discover Weekly" Pipeline - Batch vs. Stream

Difficulty Level: High

Role: Staff Data Engineer

Topic: Workflow Orchestration & ML Pipelines

Business Function: Recommendations / Personalization

Question:

"Discover Weekly is generated once a week for 500M+ users. It involves complex Matrix Factorization (ML). How would you orchestrate this pipeline? Would you use Streaming or Batch? How do you handle dependencies (e.g., User History must be ready before Model Training)?"

Answer Framework

STAR Method Structure:

● Situation: Orchestrating a massive-scale ML pipeline with strict SLAs (Monday Morning delivery).

● Task: Choose the right paradigm (Batch) and tool (Flyte/Airflow) to manage DAG dependencies.

● Action: Designed a Batch pipeline using Airflow. Separated "Feature Engineering" from "Inference".

● Result: Reliable delivery of 500M playlists every Monday by 6 AM.

Key Competencies Evaluated:

Orchestration: Managing complex DAGs (Directed Acyclic Graphs).

ML Ops: Serving pre-computed predictions vs. real-time inference.

Answer (Part 1 of 3): Batch vs. Stream Decision

● Verdict: Batch.

● Why? Discover Weekly is a "static" playlist. It doesn't need to change instantly.

● Complex ML models (Collaborative Filtering) are computationally expensive to run in real-time for everyone.

● We compute it over the weekend using bulk resources.

Answer (Part 2 of 3): Pipeline Steps (The DAG)

● Step 1: Feature Snapshot. Freeze the "User-Item Interaction Matrix" (Listening history).

● Step 2: Candidate Generation (Spark). Run ALS (Alternating Least Squares) on the cluster.

● Step 3: Filtering. Remove songs the user has already heard (Anti-join).

● Step 4: Persistence. Write the top 30 song IDs per user to Cassandra/Bigtable.

Answer (Part 3 of 3): Handling Failure (The SLA)

● Checkpointing: If the job fails at "Step 3", we shouldn't re-run "Step 1" (expensive). ● Backfill Strategy: If the pipeline is delayed, do we serve last week's playlist?

● Decision: Yes. Reliability > Freshness for this specific feature. We set a "Fallback" flag in the serving layer.

5. GDPR & The "Right to be Forgotten" - Deletion at Scale

Difficulty Level: Medium-High

Role: Data Engineer

Topic: Data Governance & Compliance

Business Function: Trust & Safety

Question:

"A user exercises their GDPR right to delete all their data. Spotify data is spread across BigQuery (Analytics), GCS (Logs), and Bigtable (Live). How do you design a system to ensure this user's data is purged from all systems within 30 days?"

Answer Framework

STAR Method Structure:

● Situation: Compliance requirement to delete user data scattered across heterogeneous systems.

● Task: Create a centralized "Deletion Service" that propagates commands to all data stores.

● Action: Built a "Deletion Propagation" pipeline using Pub/Sub.

● Result: 100% compliance with GDPR auditing.

Key Competencies Evaluated:

System Design: Event-driven architecture for distributed tasks.

Data Governance: Handling backups and immutable logs.

Answer (Part 1 of 3): The "Deletion Stream"

We cannot write scripts for every database manually.

● Design: A central "User Privacy Service" publishes a UserDeletedEvent {user_id: "12345"} to a Kafka/PubSub topic.

● Subscribers: Every data team (Ads, Music, Payments) subscribes to this topic.

Answer (Part 2 of 3): Handling Immutable Data (BigQuery)

● Challenge: You can't easily delete one row from a massive Parquet file or immutable log.

● Strategy: "Crypto-Shredding".

● Instead of deleting the data, we delete the Encryption Key associated with that user. ● If the data is unreadable, it is effectively deleted.

● Alternative: For BigQuery, use MERGE statements periodically to filter out deleted users (Re-writing partitions).

Answer (Part 3 of 3): Verification

● The Audit Log: The system must produce a "Proof of Deletion" report.

● Each subscriber (e.g., The Bigtable cleanup job) sends an Ack back to the central service.

Only when all Acks are received is the request marked "Complete".

Based on the structure of the previous document and the remaining questions in the original set (which covered Fraud, Product Metrics, Pricing, Segmentation, and Dashboards), here are 5 additional Data Engineer interview questions (6-10) tailored for Spotify.

These cover advanced topics like Fraud Detection, Financial Data (Royalties), Experimentation Platforms, and External-Facing Analytics.

6. Fraud Detection - Identifying "Stream Farms" & Bots

Difficulty Level: High

Role: Senior Data Engineer

Topic: Anomaly Detection & Windowing

Business Function: Trust & Safety / Royalties

Question:

"Spotify loses money when 'Stream Farms' (bots) play songs on repeat to generate royalties. How would you design a detection pipeline to identify users who play 30-second clips continuously for 24 hours? The system must flag these accounts within 1 hour of the behavior starting."

Answer Framework

STAR Method Structure:

● Situation: Malicious actors using scripts to artificially inflate play counts, impacting royalty pool distribution.

● Task: Build a near real-time detection engine to flag accounts exceeding human listening capabilities (e.g., >1000 streams/day).

● Action: Designed a Session Window pipeline using Apache Flink (or Dataflow) to aggregate stream counts per user_id in sliding windows.

● Result: Reduced fraudulent royalty payouts by 15% and automated the account suspension process.

Key Competencies Evaluated:

Stream Processing: Sliding Windows vs. Tumbling Windows.

Pattern Matching: Defining complex event patterns (CEP) for fraud.

Answer (Part 1 of 3): The Detection Logic

● The Rule: No human can listen to more than 2,880 songs (30 secs each) in 24 hours. ● Ingestion: Read raw listening events from Pub/Sub.

● Windowing Strategy: Use a Sliding Window of 1 hour, updating every 5 minutes. ● Aggregation: Count(track_id) grouped by user_id.

Answer (Part 2 of 3): Handling State

● State Backend: We need to store the count for millions of active users.

● Use RocksDB (embedded in Flink/Dataflow) for fast local state access.

● Challenge: High cardinality keys (millions of users).

● Optimization: Filter out "Low Activity" users early. Only keep state for users with >10 streams in the last hour.

Answer (Part 3 of 3): The Action Pipeline

● Output: When a threshold is breached, emit a FraudAlertEvent to a separate topic. ● Consumer 1 (Immediate): A service that temporarily "shadow bans" the user (streams don't count for charts).

● Consumer 2 (Review): Sends data to the Trust & Safety team's review dashboard (BigQuery) for manual ban validation.

7. Financial Data Engineering - The Royalty Payout Pipeline

Difficulty Level: Very High

Role: Staff Data Engineer

Topic: Data Accuracy & "Exactly-Once" Processing

Business Function: Financial Systems

Question:

"You are responsible for the pipeline that calculates how much to pay artists. This involves joining billions of streams with contract data. The pipeline must be 100% accurate (zero data loss, no duplicates). How do you design this, and how do you handle 'Currency Conversion' drifts?"

Answer Framework

STAR Method Structure:

● Situation: Monthly royalty calculations involving high-stakes financial data where precision is non-negotiable.

● Task: Ensure Exactly-Once processing and handle multi-currency logic spanning different time zones.

● Action: Implemented a "Watermarked" Batch pipeline using Spark with rigorous Data Auditing steps (Row Count Checks) between stages.

● Result: Eliminated "Overpayment" incidents and reduced monthly closing time from 5 days to 2 days.

Key Competencies Evaluated:

ACID Properties: Transactional guarantees in data pipelines.

Data Precision: Handling floating-point errors (Money should be Decimal or BigInt, not Float).

Answer (Part 1 of 3): Deduplication Strategy

● Problem: If a pipeline retries, we might double-count streams, paying the artist twice. ● Solution: Idempotency.

● Generate a unique payment_id for every aggregatable row based on deterministic_hash(user_id + timestamp + track_id).

● Before writing to the General Ledger, check if payment_id already exists.

Answer (Part 2 of 3): The Join Logic (Contracts)

● Challenge: Contracts change. An artist might be on "Label A" on Jan 1st and "Label B" on Jan 15th.

● Design: Slowly Changing Dimensions (SCD Type 2).

● The join condition must be: Stream.date BETWEEN Contract.start_date AND Contract.end_date.

● Optimization: Broadcast the Contract table (if small enough) to all worker nodes to avoid "Shuffle" operations.

Answer (Part 3 of 3): Validation & Alerting

● Audit: Implementing "Balance Checks".

● Sum(Incoming Revenue) must equal Sum(Artist Payouts) + Sum(Spotify Margin). ● If the equation is off by even $0.01, the pipeline Halts immediately and pages the on-call engineer. We do not write corrupt financial data.

8. Experimentation Platform - A/B Testing Data Infrastructure

Difficulty Level: High

Role: Senior Data Engineer

Topic: Data Modeling & Log Attribution

Business Function: R&D / Platform

Question:

"Spotify runs thousands of A/B tests simultaneously (e.g., 'Green Play Button' vs 'Blue Play Button'). How do you model the data so that Data Scientists can easily query 'Did Test Group A listen to more music than Control Group B?', ensuring we handle users who switch devices?"

Answer Framework

STAR Method Structure:

● Situation: Need to attribute downstream metrics (retention, streams) to upstream experiment assignments.

● Task: Build a centralized "Experiment Assignment" table and join it efficiently with "User Activity".

● Action: Created a Fact_Experiment_Assignment table partitioned by day and clustered by user_id.

● Result: Reduced query cost for A/B test analysis by 60% and standardized metrics across the company.

Key Competencies Evaluated:

Data Modeling: Star Schema design for experimentation.

Attribution: Handling "Exposure" vs. "Assignment" (Did they actually see the feature?).

Answer (Part 1 of 3): The Data Model

● Table 1: Assignment Logs: {user_id, experiment_id, variant_id, timestamp, device_id}. ● Table 2: Activity Logs: {user_id, timestamp, songs_played, session_length}.

● The "Bucketing" Issue: Users are hashed into buckets. We must log the moment they were exposed to the experiment.

Answer (Part 2 of 3): The Attribution Join

● The Query: We only count activity after the assignment timestamp.

● Logic: JOIN Activity on Assignment WHERE Activity.timestamp > Assignment.timestamp.

● Challenge: "Sample Ratio Mismatch" (SRM). If 50% of users are in A and 50% in B, but we see a 40/60 split, the data pipeline is broken.

● Fix: Add an automated Data Quality check that alerts if the ratio deviates from 50/50 by more than 1%.

Answer (Part 3 of 3): User Device Switching

● Problem: User logs in on Phone (Group A) then Desktop (Not in experiment). ● Solution: Experiment consistency is keyed off user_id, not device_id.

● The backend config service fetches the assignment from a central Redis cache to ensure the user sees the "Blue Button" on all devices.

9. "Spotify for Artists" - Serving Analytics Externally

Difficulty Level: Medium-High

Role: Data Engineer

Topic: High-Concurrency Data Serving

Business Function: Marketplace

Question:

"Millions of artists check the 'Spotify for Artists' dashboard daily to see their stream counts. You cannot query BigQuery directly for every page load (too slow/expensive). How do you design the serving layer to provide sub-second latency for these stats?"

Answer Framework

STAR Method Structure:

● Situation: External-facing dashboard requires low latency (<200ms) but underlying data is massive (Petabytes).

● Task: Bridge the gap between "Big Data Processing" and "Fast Web Serving". ● Action: Built a pre-aggregation pipeline that pushes data into Google Cloud Bigtable (NoSQL).

● Result: Dashboard loads in <100ms, handling 50k requests/second during new album releases.

Key Competencies Evaluated:

NoSQL Design: Schema design for Key-Value stores (Bigtable/Cassandra/DynamoDB).

Caching: Read-through vs. Write-through caching strategies.

Answer (Part 1 of 3): The Database Choice

● Why not BigQuery? High latency (seconds) and concurrency limits.

● Why Bigtable? Designed for high throughput writes and low latency reads.

● Schema Design:

○ Row Key: artist_id#YYYY-MM-DD (Prefix scan allows fetching a full month of data quickly).

○ Columns: stream_count, listener_count, save_count.

Answer (Part 2 of 3): Pre-Aggregation Pipeline

● We don't calculate "Total Streams" on the fly.

● Job: A nightly Airflow job runs a GROUP BY artist_id, date on the raw data. ● Output: The job writes the "Ready to Serve" numbers directly into Bigtable.

● Backfill: If we change the definition of a "Stream", we re-run the aggregation job and overwrite the Bigtable rows.

Answer (Part 3 of 3): Caching Layer

● Hot Artists: Taylor Swift's page is loaded 10,000x more than a local indie band. ● Strategy: Place Memcached or Redis in front of Bigtable.

● TTL: Set a Time-To-Live of 5 minutes. This protects the database from "Thundering Herd" problems during viral releases.

10. Metadata Management - The "Song Search" Problem

Difficulty Level: Medium

Role: Data Engineer

Topic: Search Indexing & Data Consistency

Business Function: Content / Search

Question:

"When a new song is uploaded, it needs to be searchable within minutes. The metadata (Song Title, Artist, Genre) lives in SQL, but the search engine is Elasticsearch. How do you sync these systems reliably without race conditions?"

Answer Framework

STAR Method Structure:

● Situation: Data inconsistency between the "Source of Truth" (PostgreSQL) and the "Search Index" (Elasticsearch). Users couldn't find new songs.

● Task: Move from "Dual Writes" (which fail) to an Event-Driven synchronization.

● Action: Implemented Change Data Capture (CDC) using Debezium and Kafka.

● Result: Reduced "Search Lag" from 2 hours to 15 seconds and ensured 100% consistency.

Key Competencies Evaluated:

CDC (Change Data Capture): capturing DB logs.

Distributed Systems: Eventual Consistency.

Answer (Part 1 of 3): The Anti-Pattern (Dual Writes)

● Don't do this: App saves to DB -> App saves to Elasticsearch.

● Why? If the DB save succeeds but Elasticsearch fails (network blip), the song exists but is invisible.

Answer (Part 2 of 3): The CDC Solution

● Source: The application writes only to the PostgreSQL database (ACID transaction). ● Connector:Debezium listens to the Postgres Write-Ahead Log (WAL).

● Stream: Every INSERT/UPDATE is converted into a Kafka message: { "op": "c", "after": { "id": 1, "title": "New Song" } }.

Answer (Part 3 of 3): Indexing Consumer

● A consumer service reads the Kafka stream.

● It transforms the data (denormalizing it, e.g., joining "Artist ID" to get "Artist Name"). ● It performs an Upsert into Elasticsearch.

● Benefit: If the Search Index crashes, we just replay the Kafka stream to rebuild it.