IBM Software Engineer Interview Questions

Introduction

Software Engineers at IBM operate at the intersection of enterprise scale and modern cloud-native innovation. IBM's engineering teams are responsible for building and maintaining some of the world's most critical software infrastructure — from AI-powered platforms like watsonx to hybrid cloud solutions built on Red Hat OpenShift. Engineers at IBM design systems that serve Fortune 500 clients across banking, healthcare, logistics, and government sectors, where performance, security, and reliability are non-negotiable.

In practice, this means IBM Software Engineers spend their time designing distributed microservices architectures, building and consuming RESTful and gRPC APIs, optimising for high-throughput data pipelines, and writing code that scales across multi-region cloud deployments. Familiarity with Java, Python, Go, and Kubernetes is common across teams, as is experience with messaging systems like Apache Kafka, observability tooling, and CI/CD pipelines. IBM engineers are expected to write code that is not just functional, but maintainable, testable, and production-ready from day one.

IBM's interview process for Software Engineers reflects this. Candidates are assessed not just on raw algorithmic ability, but on how they think about system trade-offs, handle ambiguity, debug complex distributed failures, and architect solutions for enterprise-grade scale. The questions below are representative of the scenarios IBM interviewers use to probe these capabilities — and are designed to help you prepare with the depth and specificity the role demands.

Interview Questions

Question 1: Designing a Rate-Limited API Gateway

Interview Question

You're building an internal API gateway for IBM's cloud platform that routes requests from thousands of microservices to downstream services. The gateway must enforce per-client rate limiting (e.g., 1,000 requests/minute per API key) without introducing significant latency. The system must support horizontal scaling and handle bursts gracefully. How would you design this?

Why Interviewers Ask This Question

API gateways are foundational infrastructure at IBM, where internal services communicate at massive scale. This question probes whether a candidate understands the operational realities of building shared infrastructure — rate limiting algorithms, distributed state synchronisation, and the latency vs. accuracy trade-offs involved. It also tests whether candidates can reason about horizontal scalability without defaulting to a naive "add a database" answer.

Example Strong Answer

I'd start by selecting the right rate-limiting algorithm. A token bucket is the best fit here because it allows controlled bursts while enforcing an average rate — preferable to a strict fixed window which can cause thundering herd issues at window boundaries.

For the distributed state problem, I'd use Redis with a Lua script to atomically check and decrement tokens per API key. Redis gives us sub-millisecond reads and its single-threaded command execution avoids race conditions without needing distributed locks.

-- Lua script executed atomically in Redis
local tokens = tonumber(redis.call('GET', KEYS[1])) or ARGV[1]
if tokens > 0 then
  redis.call('DECR', KEYS[1])
  redis.call('EXPIRE', KEYS[1], ARGV[2])
  return 1
else
  return 0
end

For horizontal scaling, each gateway instance communicates with the same Redis cluster. I'd use a sliding window log variant in Redis if we need strict accuracy, or accept ~5% inaccuracy with a local token counter plus periodic Redis sync if we need to reduce Redis round-trips for extremely high-throughput routes.

For graceful burst handling, I'd pair this with an async queue for non-critical requests that exceed the rate limit, rather than immediately returning 429s — clients get queued rather than dropped.

Observability is critical: I'd instrument per-API-key metrics (current token count, rejection rate, latency) into Prometheus/Grafana, and set alerts on sustained high rejection rates which might indicate a misbehaving client or a capacity planning issue.

Key Concepts Tested

Rate limiting algorithms (token bucket, sliding window, fixed window)

Distributed state management with Redis

Atomic operations and avoiding race conditions

Horizontal scalability in stateless services

Observability and operational readiness

Follow-Up Questions

How would you handle the case where the Redis cluster goes down? Would you fail open or fail closed, and what are the trade-offs?

A client legitimately needs to burst to 10x their rate limit for a 30-second window during a batch job. How would you support this without compromising other tenants?

Question 2: Optimising a Slow Database Query in a High-Traffic Service

Interview Question

A critical IBM internal service that processes financial transactions for enterprise clients starts experiencing p99 latency spikes from 80ms to 4 seconds under peak load. Initial investigation points to a PostgreSQL query that's performing a full table scan on a transactions table with 500 million rows. Walk me through how you'd diagnose and resolve this in production, without taking the system offline.

Why Interviewers Ask This Question

Performance degradation in production is a defining challenge for enterprise software engineers. IBM interviewers use this scenario to test whether candidates can move methodically through a real incident — using the right diagnostic tools, understanding query execution plans, and implementing changes safely in a live system. It also probes knowledge of database internals, indexing strategies, and the practical constraints of zero-downtime deployments.

Example Strong Answer

First, I'd confirm the hypothesis by running EXPLAIN ANALYZE on the offending query to inspect the actual execution plan:

EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT * FROM transactions
WHERE client_id = $1 AND status = 'PENDING'
ORDER BY created_at DESC
LIMIT 100;

If I see Seq Scan with a high row estimate and large Buffers: shared hit counts, that confirms the full table scan. I'd check pg_stat_user_indexes to see whether relevant indexes exist and whether they're actually being used.

The fix is a composite index on (client_id, status, created_at) — this supports the equality filters on client_id and status, then uses the B-tree ordering for the ORDER BY without an additional sort step.

Critically, I'd create this concurrently to avoid locking the table in production:

CREATE INDEX CONCURRENTLY idx_transactions_client_status_date
ON transactions (client_id, status, created_at DESC);

While the index builds, I'd also check for missing statistics using pg_stat_user_tables.n_dead_tup — if autovacuum has fallen behind on a write-heavy table, the planner's row estimates will be stale and it may choose a bad plan even with a valid index. A targeted ANALYZE transactions; can help immediately.

For the longer term, I'd review whether this table should be partitioned by created_at — range partitioning on a time-series transactional table of this size is a significant win for both query performance and maintenance operations like archiving old data.

Finally, I'd set up a slow query log threshold (log_min_duration_statement = 500) and surface query performance via pg_stat_statements to catch regressions before they become incidents.

Key Concepts Tested

Query execution plans and EXPLAIN ANALYZE

Composite index design and ordering

Zero-downtime index creation in PostgreSQL

Autovacuum, table statistics, and query planner behaviour

Table partitioning for large-scale data

Follow-Up Questions

After adding the index, the query is fast for most clients but still slow for one specific enterprise client with 50 million transactions of their own. What would you investigate next?

The team wants to migrate this table to a distributed database like CockroachDB to support global deployments. What new challenges does this introduce for your indexing strategy?

Question 3: Handling Message Ordering in a Distributed Event-Driven System

Interview Question

IBM's supply chain platform uses Apache Kafka to stream inventory update events from warehouses around the world. A bug is reported: occasionally, a "stock depleted" event is processed before the "stock replenished" event that preceded it, causing incorrect inventory counts downstream. The events are produced by different producer instances in the same warehouse. How would you diagnose and fix this without rebuilding the entire pipeline?

Why Interviewers Ask This Question

Distributed messaging and event ordering are core engineering challenges in IBM's enterprise platforms. This question tests whether a candidate genuinely understands Kafka's partitioning and ordering guarantees — not just at a surface level, but at the level needed to make correct architectural decisions. It also probes how engineers balance correctness with throughput in real production systems.

Example Strong Answer

This is a partition ordering problem. Kafka guarantees message ordering only within a single partition. If two producer instances are sending events for the same warehouse without a consistent partition key, those events may land in different partitions and be consumed out of order.

Diagnosis: I'd first inspect the producer configuration — specifically whether events are being produced with a key that maps to warehouse ID. If key is null or inconsistent, Kafka's default round-robin partitioner will distribute events across partitions randomly, breaking ordering guarantees.

Fix — Partition Key Strategy:
Ensure all events for a given warehouse use the same partition key (warehouse ID or a more granular warehouse_id + product_id). This guarantees all events for a given entity land in the same partition and are consumed in order.

ProducerRecord<String, InventoryEvent> record = new ProducerRecord<>(
    "inventory-updates",
    warehouseId + "_" + productId,  // Consistent partition key
    event
);

However, this alone doesn't solve the case where two different producers race to produce events for the same key within microseconds. For this, I'd introduce event sequence numbers at the producer level — a monotonically increasing counter per warehouse/product combination, stored in a lightweight coordination store (Redis or a database sequence).

On the consumer side, I'd implement sequence validation: if an event arrives out of sequence (e.g., sequence 42 arrives before sequence 41), the consumer holds it in a small in-memory buffer and waits for the missing event within a configurable timeout window before either reprocessing or raising an alert.

For the longer term, I'd advocate for idempotent producers (enable.idempotence=true) and exactly-once semantics if the downstream system supports transactional Kafka consumers — this eliminates duplicates and out-of-order delivery in one configuration change.

Key Concepts Tested

Kafka partition ordering guarantees and their limits

Partition key strategy for message routing

Producer idempotency and exactly-once semantics

Consumer-side sequence validation and buffering

Trade-offs between ordering guarantees and throughput

Follow-Up Questions

Introducing a single partition key per warehouse dramatically reduces parallelism — a high-volume warehouse now serialises all events through one partition. How would you balance ordering guarantees with throughput?

How would you test this fix before deploying to production, given that the bug is a race condition that's hard to reproduce deterministically?

Question 4: Designing a Resilient Microservice with Circuit Breaking

Interview Question

You're building a customer-facing IBM SaaS service that aggregates data from five downstream microservices (billing, account management, usage metrics, notifications, and audit logs). In production, one of the downstream services — usage metrics — intermittently times out for 30–60 seconds at a time. These timeouts are cascading into the aggregation service, which hangs for up to 60 seconds and eventually takes down all five service calls for the user. How do you redesign this service to be resilient to partial downstream failures?

Why Interviewers Ask This Question

Cascading failures are one of the most common causes of large-scale outages in microservices architectures. IBM's production systems must remain available even when individual components fail. This question tests whether candidates have moved beyond basic microservices theory to understand patterns like circuit breaking, bulkheads, and graceful degradation — and whether they can reason about when to apply each.

Example Strong Answer

The core problem is synchronous coupling — a single slow downstream is holding threads in the aggregation layer, which eventually exhausts the thread pool and starves all other requests. The fix is a layered resilience strategy.

1. Timeouts — first line of defence:
Every downstream call must have an explicit, short timeout. For non-critical services like usage metrics, 500ms is a reasonable ceiling. I'd configure this at the HTTP client level, not just rely on the downstream's SLA.

2. Circuit Breaker (Resilience4j or similar):
I'd wrap each downstream call in a circuit breaker. When usage metrics starts timing out repeatedly (e.g., 50% failure rate over 10 calls), the circuit opens and subsequent calls fail immediately rather than waiting, protecting the thread pool.

CircuitBreakerConfig config = CircuitBreakerConfig.custom()
    .failureRateThreshold(50)
    .waitDurationInOpenState(Duration.ofSeconds(30))
    .slidingWindowSize(10)
    .build();

3. Bulkheads — isolate failure domains:
I'd use a thread pool bulkhead per downstream service. Usage metrics calls go to a dedicated pool of 5 threads. Even if all 5 are hung, the billing and account management pools are completely unaffected.

4. Graceful degradation — serve partial data:
For a customer dashboard, billing and account management are critical; usage metrics is not. I'd redesign the aggregation response to return partial data when non-critical services are unavailable, with a flag indicating which sections are stale or unavailable. The user sees their account information immediately; a placeholder appears where usage data would be.

5. Async calls for non-critical services:
I'd move usage metrics, notifications, and audit logs to async parallel calls (CompletableFuture with a timeout), so their latency doesn't block the critical path at all.

This architecture means a 60-second usage metrics outage shows as a degraded UI element for ~10 seconds before the circuit opens — after which it's invisible to users.

Key Concepts Tested

Circuit breaker, bulkhead, and timeout patterns

Thread pool exhaustion and cascading failure mechanics

Graceful degradation vs. hard failure

Async parallel service calls

Resilience4j / Hystrix configuration trade-offs

Follow-Up Questions

How would you test that your circuit breaker is working correctly in a staging environment before deploying to production?

The product team now wants real-time usage metrics — they cannot accept stale data. How does this change your resilience strategy for that specific service?

Question 5: Refactoring a Monolith Incrementally Without Downtime

Interview Question

IBM has acquired a mid-sized company whose core product is a 12-year-old Java monolith — 1.2 million lines of code, a single Oracle database, and a deployment cycle of once every 6 weeks. IBM needs this system to be cloud-native within 18 months, with independent deployable services and zero planned downtime during migration. You're the lead engineer. How do you approach this?

Why Interviewers Ask This Question

Legacy modernisation is one of IBM's core business propositions, and engineers at IBM regularly face exactly this scenario with client systems. This question tests architectural thinking at a strategic level — specifically knowledge of the Strangler Fig pattern, database decomposition, and how to manage incremental migration risk in production. It also probes whether candidates can communicate a multi-phase strategy clearly, rather than jumping to a "rewrite everything" answer.

Example Strong Answer

A full rewrite is too risky — 18 months is not enough time to rewrite 1.2 million lines of a complex domain, and you'd be running two systems in parallel with no guarantee the new one is correct. The right approach is the Strangler Fig pattern: incrementally extract services while keeping the monolith running, routing traffic to new services as they're proven stable.

Phase 1 — Understand and instrument (Months 1–2):
Before writing a line of new code, I'd instrument the monolith thoroughly — distributed tracing (OpenTelemetry), structured logging, and database query profiling. The goal is to understand the actual call graph, domain boundaries, and which modules are genuinely independent. Domain-driven design workshops with the existing team are essential here — the codebase may not reflect the actual domain model clearly.

Phase 2 — Introduce an API gateway (Month 2–3):
Deploy an API gateway (IBM API Connect or Kong) in front of the monolith. Initially it passes all traffic through. This is the routing layer that will gradually redirect traffic to extracted services. Zero downtime impact, and it can be deployed with a feature flag.

Phase 3 — Extract the least-coupled, highest-value services first:
Using the call graph analysis, I'd identify modules with clear boundaries and minimal database coupling — typically auth, notification, or reporting services. Each extraction follows this pattern:

Build the new microservice alongside the monolith

Use the Strangler Fig proxy to shadow traffic to both, comparing responses

When the new service matches 100%, cut over via the gateway config change

Deprecate the monolith module (don't delete yet)

Phase 4 — Database decomposition:
This is the hardest part. I'd use the Database-per-Service pattern incrementally:

For each extracted service, create a new schema/database

Use dual writes during transition: the monolith writes to both old and new schema

Run a data sync validator to ensure consistency

Once the new service is stable, retire the monolith's write path

Phase 5 — Containerise and move to Kubernetes:
Each extracted service gets a Dockerfile and Helm chart, deployed to OpenShift. CI/CD pipelines (Tekton) replace the 6-week manual deployment cycle. The monolith itself can be containerised as an interim step — it doesn't need to be decomposed to benefit from container orchestration.

Throughout, I'd maintain a migration risk register and ensure the team has a rollback plan for every phase. No phase should make the system harder to roll back than the previous one.

Key Concepts Tested

Strangler Fig and incremental migration patterns

Domain-driven design for identifying service boundaries

API gateway as a traffic routing layer

Database decomposition and dual-write strategies

CI/CD pipeline modernisation and containerisation

Follow-Up Questions

Three months in, the team discovers that two of the "independent" modules actually share 15 database tables with complex join queries. How does this change your extraction plan?

How would you handle the Oracle licence cost during the migration period, given IBM is paying for both Oracle and the new cloud databases in parallel?

Question 6: Designing a Distributed Caching Strategy

Interview Question

IBM's enterprise search platform serves 50,000 concurrent users querying a product catalogue of 10 million items. The underlying Elasticsearch cluster is being overwhelmed — average query latency has climbed from 20ms to 800ms, and the ops team has flagged that adding more Elasticsearch nodes is not cost-justifiable. You're asked to design a caching layer that reduces cluster load by at least 70% without serving stale data older than 60 seconds. How do you approach this?

Why Interviewers Ask This Question

Caching is one of the most misunderstood areas in distributed systems — candidates either over-simplify it ("just add Redis") or ignore the hard problems: cache invalidation, cold start, thundering herd, and consistency guarantees. IBM interviewers use this question to probe whether candidates can design a multi-layered caching strategy with explicit trade-offs, not just name a technology.

Example Strong Answer

I'd design a three-tier caching strategy: local in-process cache, distributed cache, and query result cache — each with different TTLs and eviction policies suited to the access patterns.

Tier 1 — In-process cache (Caffeine):
For the most frequently accessed items (top 0.1% of catalogue — think bestsellers or featured products), an in-process Caffeine cache with a 10-second TTL and a maximum size of 10,000 entries. This eliminates network round-trips entirely for hot items and handles the majority of read traffic at near-zero cost.

Cache<String, SearchResult> localCache = Caffeine.newBuilder()
    .maximumSize(10_000)
    .expireAfterWrite(10, TimeUnit.SECONDS)
    .recordStats()
    .build();

Tier 2 — Distributed cache (Redis Cluster):
For the broader catalogue, a Redis cluster with a 60-second TTL. Cache keys are constructed from a canonical query hash (normalised query params, sorted filters) to maximise hit rate across different users running semantically equivalent searches.

Thundering herd protection:
When a popular cache entry expires, thousands of requests can simultaneously miss and hammer Elasticsearch. I'd use a probabilistic early expiration approach (also called "fetch-ahead") — with some probability proportional to how close the TTL is to expiry, proactively refresh the cache entry in a background thread before it expires.

Alternatively, for guaranteed protection: a distributed lock (Redis SET NX) ensures only one request fetches from Elasticsearch on a miss; others wait and then read the freshly populated entry.

Tier 3 — Query result materialisation:
For common structured queries (category browsing, filtered searches), I'd pre-materialise results in Redis via a background job that refreshes every 30 seconds using the last-24-hours query log. These cached results have a 60-second TTL aligned to the staleness requirement.

Cache invalidation:
For explicit inventory or price changes, I'd publish invalidation events to a Kafka topic. Cache consumers listen to this topic and evict affected keys immediately — this is the only way to guarantee the 60-second staleness SLA is a ceiling, not just an average.

Observability: Cache hit rate per tier, latency distribution with and without cache, and eviction rate tracked in Prometheus. A drop in hit rate below 60% triggers an alert.

Key Concepts Tested

Multi-tier caching architecture

Cache key design and query normalisation

Thundering herd prevention

Event-driven cache invalidation

Observability for caching layers

Follow-Up Questions

Your cache hit rate is 85% on average, but specific users — those who always search with unique long-tail queries — experience 0% hit rates and consistently slow responses. How would you handle this segment without polluting the cache?

The business introduces personalised search rankings. How does personalisation break your caching strategy, and how would you adapt?

Question 7: Implementing a Safe Continuous Deployment Pipeline

Interview Question

Your team at IBM ships updates to a payment processing microservice that handles $2 billion in transactions per year. The current deployment process involves a manual 2-hour maintenance window every Friday evening — acceptable in 2015, not acceptable now. Leadership wants continuous deployment with zero planned downtime. A bad deployment last year caused a 45-minute outage and $3 million in failed transactions. How would you design the deployment pipeline to make this safe?

Why Interviewers Ask This Question

IBM's SaaS products must meet enterprise SLAs that make downtime commercially and contractually unacceptable. This question tests whether candidates understand progressive delivery techniques beyond basic blue-green deployments — specifically canary releases, automated rollback triggers, and the role of feature flags in decoupling deployment from release. It also probes engineering maturity around testing strategies for high-stakes systems.

Example Strong Answer

The goal is to decouple deployment (code goes to production) from release (users see the change). This is the foundation of safe continuous deployment.

Pipeline Architecture:

Stage 1 — Pre-deployment gates:
Every commit triggers a pipeline with: unit tests, integration tests against a contract test suite (Pact for payment API contracts), static analysis (SonarQube), and a security scan (SAST/DAST). No merge to main without 100% pass. For a payment service, I'd also include a mandatory chaos engineering test in a staging environment — deliberately killing downstream services to verify circuit breakers behave correctly.

Stage 2 — Canary deployment:
Instead of deploying to all instances simultaneously, I'd use a canary strategy on Kubernetes (via Argo Rollouts or Flagger):

spec:
  strategy:
    canary:
      steps:
      - setWeight: 5    # 5% of traffic
      - pause: {duration: 10m}
      - setWeight: 25
      - pause: {duration: 10m}
      - setWeight: 50
      - pause: {duration: 10m}

At each step, automated analysis checks: error rate, p99 latency, transaction success rate, and business metrics (payment decline rate). If any metric exceeds a threshold — automated rollback, no human required.

Stage 3 — Feature flags for logic changes:
Database schema changes and business logic changes ship behind a feature flag (LaunchDarkly or a homegrown toggle). The code is deployed but inactive. The flag is enabled for internal test accounts first, then 1% of users, then progressive rollout. This means a bad logic change can be disabled in seconds without a redeployment.

Stage 4 — Database migrations:
The most dangerous part. I'd enforce the expand-contract pattern: new columns/tables are added in one release (expand), the application is updated to use them in the next, and old columns are dropped in a third. No migration ever modifies or drops a column that the current running version of the service depends on.

Rollback SLA: Automated rollback must complete within 90 seconds. This is tested monthly as part of a Game Day exercise.

Key Concepts Tested

Canary deployments and progressive traffic shifting

Automated quality gates and rollback triggers

Feature flags for decoupling deployment from release

Safe database migration with expand-contract pattern

CI/CD pipeline design for high-stakes services

Follow-Up Questions

During a canary rollout at 25% traffic, your error rate metric looks fine but payment decline rates have increased 3% compared to baseline — a business metric your pipeline doesn't currently track. How do you catch this class of problem in future deployments?

How do you handle the case where a data migration has already run on 30% of records when a bug is detected mid-canary? The rollback is straightforward for code, but not for data.

Question 8: Debugging a Memory Leak in a Long-Running JVM Service

Interview Question

An IBM enterprise reporting service — a Spring Boot application running on a JVM — has been in production for two years without issues. Over the past three weeks, it's been requiring a manual restart every 4–5 days as heap usage climbs steadily from 2GB to its 8GB maximum, at which point GC overhead causes response times to spike from 200ms to 12 seconds before the pod OOMKills. No recent code changes have been deployed. How do you find and fix the memory leak?

Why Interviewers Ask This Question

Memory management in long-running JVM services is a practical, senior-level concern across IBM's Java-heavy estate. This question tests methodical debugging discipline — specifically the ability to use heap profiling tools, interpret GC logs, and reason about common Java memory leak patterns like static collections, listener registrations, and connection pool growth. The "no recent code changes" constraint is intentional: it probes whether candidates consider external factors like data volume growth or dependency updates.

Example Strong Answer

The fact that heap grows linearly over days without a code change points to a slow accumulation leak — something holding references to objects that should be garbage collected. My approach is methodical: instrument first, then isolate, then fix.

Step 1 — Enable GC logging and heap metrics:
If not already present, I'd add JVM flags to capture GC behaviour:

-Xlog:gc*:file=/logs/gc.log:time,uptime:filecount=5,filesize=20m
-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/dumps/

And expose JVM heap metrics via Micrometer/Prometheus — jvm.memory.used per generation (heap, old gen, metaspace) gives a clear growth chart.

Step 2 — Capture heap dumps at two points:
I'd trigger a heap dump at the start of the cycle (2GB baseline) and another at 6GB, then compare with Eclipse MAT (Memory Analyser Tool). The Leak Suspects Report in MAT identifies which object types are accumulating and, crucially, what's holding the GC root reference preventing collection.

Step 3 — Common suspects for this pattern:

Static collections: A static Map or List that's being populated but never cleared — common in poorly implemented caches, event registries, or metrics collectors.

Listener/callback registration without deregistration: A common Spring pattern where ApplicationEventListener beans are registered on every request or scheduled job but never removed.

Connection or thread pool misconfiguration: If the reporting service opens database connections per report job without pooling, and jobs run in parallel, connections accumulate. I'd check HikariCP metrics: hikari.connections.active and hikari.connections.pending.

Growing input data volume: The service may be correct, but if report sizes have grown — a customer with 10x more data than 2 years ago — in-memory aggregation that was fine before is now exhausting the heap for that single job.

Step 4 — MAT heap comparison:
The delta between the two dumps will show the growing class. If it's HashMap$Entry or ArrayList, I'd trace the reference chain to find the owning object. If it's byte[], I'd look at report serialisation — perhaps a response is being buffered in memory rather than streamed.

Fix — once identified:
Typically: clear static collections on job completion, switch to WeakReference for caches, add @PreDestroy listener deregistration, or switch to streaming report output rather than in-memory aggregation.

Key Concepts Tested

JVM heap structure (young gen, old gen, metaspace) and GC behaviour

Heap dump capture and analysis with Eclipse MAT

Common Java memory leak patterns

Prometheus/Micrometer JVM observability

Streaming vs. in-memory processing trade-offs

Follow-Up Questions

The heap dump shows the leak is inside a third-party reporting library that IBM doesn't own — the library maintains an internal cache that never evicts. You can't modify the library. What are your options?

After fixing the leak, how would you write an automated test that would have caught this regression before it reached production?

Question 9: Designing an Idempotent Payment Processing API

Interview Question

IBM is building a payment processing API for a banking client. The client's mobile app retries failed requests automatically — if a payment request times out, the app retries up to 3 times. The bank has reported that customers are occasionally seeing duplicate charges: the payment went through on the first attempt, but the response timed out before reaching the client, which then retried. Design an idempotency system that eliminates duplicate charges without preventing legitimate retries.

Why Interviewers Ask This Question

Idempotency is a fundamental correctness requirement for financial APIs, and IBM engineers working on banking and fintech platforms deal with this directly. This question tests whether candidates understand the full end-to-end contract of an idempotency key system — not just the concept, but the storage, TTL, race condition handling, and client contract. It also probes understanding of the difference between network-level retries and application-level retries.

Example Strong Answer

The solution is a server-side idempotency key system. The client generates a unique key per payment attempt (e.g., UUID v4) and sends it in a request header. The server uses this key to deduplicate.

Client contract:

POST /v1/payments
Idempotency-Key: 550e8400-e29b-41d4-a716-446655440000
Content-Type: application/json

{ "amount": 150.00, "currency": "GBP", "to_account": "GB29NWBK..." }

Server implementation:

Step 1 — Check idempotency store before processing:

IdempotencyRecord record = idempotencyStore.get(idempotencyKey);

if (record != null && record.isComplete()) {
    return record.getCachedResponse();  // Return exact same response
}
if (record != null && record.isInProgress()) {
    return ResponseEntity.status(409).body("Request in progress, retry later");
}

Step 2 — Atomic lock and mark in-progress:
Before executing the payment, I'd use a Redis SET NX PX (set if not exists, with TTL) to atomically claim the key. This prevents two concurrent retries from both executing the payment simultaneously — a race condition that a simple if not exists check doesn't protect against.

boolean claimed = redis.setIfAbsent(
    "idempotency:" + key,
    "IN_PROGRESS",
    Duration.ofSeconds(30)
);
if (!claimed) {
    return ResponseEntity.status(409).body("Concurrent request detected");
}

Step 3 — Execute and persist response:
After payment execution (success or failure), store the full response in the idempotency store with a 24-hour TTL. The stored record includes: the idempotency key, the canonical request hash, the response body, status code, and timestamp.

Step 4 — Request body validation:
If the same idempotency key arrives with a different request body (amount, account number), reject with HTTP 422 — this indicates a client bug where a key is being reused incorrectly.

if (!record.getRequestHash().equals(hashOf(currentRequest))) {
    return ResponseEntity.unprocessableEntity()
        .body("Idempotency key reused with different request body");
}

Storage: For a payment system, I'd persist idempotency records in the primary database (not just Redis) to survive cache evictions — consistency matters more than speed here. Redis serves as a fast lookup layer; the database is the source of truth.

Key Concepts Tested

Idempotency key design and server-side deduplication

Atomic operations for race condition prevention

TTL strategy for idempotency record lifecycle

Request body hashing and validation

Consistency vs. availability trade-offs in financial systems

Follow-Up Questions

The payment service is horizontally scaled across 10 instances. How does your Redis-based locking strategy behave during a Redis failover, and what is the risk window?

A client's SDK has a bug: it generates the same idempotency key for all payments from a given user session. How would you detect and alert on this pattern without breaking legitimate traffic?

Question 10: Tracing a Latency Regression Across a Microservices Call Chain

Interview Question

A post-deployment monitoring alert fires: the p99 response time for IBM's customer onboarding API has increased from 350ms to 2.1 seconds. The service makes synchronous calls to 7 downstream microservices. The deployment touched only the onboarding service itself — none of the downstream services were changed. Your monitoring dashboard shows average latency looks normal. How do you find the source of the regression?

Why Interviewers Ask This Question

Distributed tracing and performance debugging across service meshes are daily realities for IBM platform engineers. This question specifically tests whether candidates understand why average latency is the wrong metric for detecting tail latency problems, and whether they know how to use distributed tracing, percentile metrics, and dependency analysis to isolate regressions in complex call chains. The "no downstream changes" misdirection tests whether candidates think beyond the obvious.

Example Strong Answer

The first signal is important: p99 is elevated but averages look normal. This is the signature of a tail latency problem — a small percentage of requests are experiencing severe slowdowns. Averages mask this entirely, which is why p95/p99/p999 metrics are non-negotiable for production services.

Step 1 — Distributed trace sampling:
I'd pull recent distributed traces from Jaeger or IBM Instana, filtering specifically for requests in the 2-second range. The trace waterfall will show exactly which service in the call chain is responsible for the latency — whether it's the onboarding service itself, a specific downstream, or a serialisation bottleneck between services.

Step 2 — Isolate by percentile per service:
In the observability platform, I'd break down p99 latency by service rather than just the entry point. If downstream service D (say, identity verification) shows a p99 spike to 1.8 seconds while others remain flat, that's the culprit — regardless of whether it was deployed.

Step 3 — The "no downstream changes" misdirection:
This is a deliberate constraint. Several causes can degrade downstream performance without a code deployment:

Database query degradation: A table that crossed a size threshold, causing a query plan change (index scan → full table scan)

Connection pool exhaustion: The new onboarding deployment makes slightly more concurrent calls, pushing a downstream's connection pool to its limit

Increased call volume: The new onboarding code makes an additional downstream call that didn't exist before — the downstream is now receiving more load

Third-party dependency: One of the 7 services depends on an external API that has degraded

Step 4 — Check the deployment diff carefully:
Even though the change was "only to onboarding," I'd audit the diff for: added downstream calls, changed timeout configurations, increased parallelism (which could flood downstreams), or removed a cache that previously absorbed load.

Step 5 — Reproduce with load testing:
With a hypothesis in hand (e.g., connection pool exhaustion on the identity service), I'd reproduce with a controlled load test targeting that specific downstream call pattern, observing hikari.connections.pending or equivalent.

Immediate mitigation: If the root cause is a new synchronous call in the critical path that could be async, I'd move it behind a feature flag and disable it while the investigation continues.

Key Concepts Tested

Percentile-based latency metrics vs. averages

Distributed tracing for call chain analysis

Root cause analysis in distributed systems

Dependency analysis for indirect performance regressions

Connection pool and resource exhaustion patterns

Follow-Up Questions

Your distributed tracing shows the latency regression is inside the onboarding service itself — specifically in JSON serialisation of a response object. The new deployment added 3 new fields to the response. Serialisation time went from 2ms to 900ms. How is this possible, and how do you fix it?

You've identified the root cause and fixed it. How do you write a performance regression test that would catch this class of issue before it reaches production in future?