Salesforce Cloud Infrastructure Engineer Interview Questions
Introduction
Cloud Infrastructure Engineers at Salesforce are responsible for one of the most demanding operational environments in enterprise software — a globally distributed SaaS platform serving over 150,000 customer organisations, processing billions of API requests daily, and maintaining availability commitments that enterprise customers embed into their own revenue operations. The infrastructure underpinning Sales Cloud, Service Cloud, Marketing Cloud, MuleSoft, and the broader Salesforce platform must be simultaneously elastic enough to absorb unpredictable traffic spikes, reliable enough to meet strict SLA obligations, and secure enough to satisfy the compliance requirements of financial services, healthcare, and government customers worldwide.
The role sits at the intersection of systems engineering, reliability engineering, and platform architecture. Salesforce Cloud Infrastructure Engineers design and operate Kubernetes-based container platforms, architect multi-region deployment strategies, build and own the observability stacks that give the engineering organisation visibility into complex distributed systems, and partner with security and compliance teams to ensure that infrastructure-level controls meet the trust standards that define Salesforce's brand. In practice, this means making engineering decisions with direct business consequences — a misconfigured load balancer affects a Fortune 500 company's sales cycle; an undetected memory leak degrades the experience of thousands of reps during a critical quarter-end push.
Interviews for Cloud Infrastructure Engineer roles at Salesforce are designed to surface deep systems thinking, not just operational familiarity with cloud tooling. Candidates are expected to reason from first principles about scalability, failure modes, observability gaps, and the trade-offs between competing engineering priorities under real-world constraints. The five questions below reflect the types of scenarios you will encounter — grounded in the specific challenges of running a multi-tenant, globally distributed SaaS platform at Salesforce's scale.
Interview Questions
Question 1: Multi-Region Architecture for a Mission-Critical SaaS Platform
Interview Question
Salesforce is expanding a core Sales Cloud service — the opportunity management API — to support active-active multi-region deployment across three geographic regions: US-East, EU-West, and APAC. Currently the service is deployed in a single primary region (US-East) with a passive warm standby in US-West for disaster recovery. The service handles approximately 800,000 API requests per minute at peak and requires a 99.99% availability SLA. The opportunity data model has complex relational consistency requirements — a single opportunity record can be updated by multiple users simultaneously, and updates must be visible to all users globally within 2 seconds.Design the multi-region architecture for this service, explicitly addressing data consistency, traffic routing, failover behaviour, and the trade-offs you are making at each layer.
Why Interviewers Ask This Question
Multi-region architecture for a SaaS platform with strong consistency requirements is one of the hardest distributed systems problems in production engineering. It surfaces whether a candidate understands the CAP theorem not as an academic concept but as a set of practical trade-offs that must be made explicitly at every layer of the stack. Interviewers look for candidates who can articulate why each architectural decision was made and what it costs — not just describe a topology.
Example Strong Answer
The core tension: consistency vs availability vs latency
Active-active multi-region deployment with a 2-second global consistency requirement immediately runs into the CAP theorem and the practical reality of inter-region network latency. The round-trip time between US-East and APAC is approximately 200–250ms. Achieving strong consistency (every write visible globally before acknowledging) within 2 seconds is theoretically achievable but introduces significant tail latency risk and complexity. I would design for causal consistency rather than strict linearisability — a slightly weaker but far more operationally tractable guarantee that preserves the user experience requirement.
Traffic routing layer
Route user traffic to the geographically nearest region using latency-based DNS routing (Route 53 or equivalent) combined with Anycast routing for the API gateway layer. Each region serves its local user base for reads and writes. This is not active-passive — all three regions accept writes. The routing layer also performs health-based failover: if a region fails health checks for 30 consecutive seconds, traffic is redistributed to the remaining healthy regions using weighted routing, with weights adjusted dynamically based on remaining capacity.
The routing layer must be aware of data residency constraints: EU-West is the exclusive region for EU-domiciled customer data under GDPR. Traffic routing must ensure that EU customers' data never flows to US-East or APAC, even during a failover of EU-West. This means EU failover routes to a secondary EU region, not to US-East — a constraint that must be encoded into the routing policy and tested explicitly.
Data layer: the hardest problem
For a relational data model with concurrent update requirements, I would use a multi-master database architecture with conflict resolution:
- Primary choice: CockroachDB or Google Spanner — both provide serialisable distributed transactions across regions with automatic conflict resolution. Spanner uses TrueTime for global transaction ordering; CockroachDB uses a hybrid logical clock. Both are significantly more operationally complex than a single-region Postgres cluster, but they provide the consistency guarantees the opportunity data model requires.
- Replication topology: Each region has a local replica. Writes are acknowledged to the client after a quorum of regional replicas confirms the write (typically 2 of 3). This provides durability without requiring all three regions to confirm before responding.
- Conflict resolution policy: For concurrent updates to the same opportunity record, implement optimistic locking via version vectors at the application layer. When two users in different regions update the same record simultaneously, the last-write-wins policy with a version check ensures the losing write is surfaced to the user as a conflict rather than silently overwritten.
The 2-second consistency window
Asynchronous replication lag between US-East and APAC at peak is typically 300–600ms. To meet the 2-second global visibility requirement without paying the synchronous write latency cost on every operation:
- Read-your-writes consistency is guaranteed locally — a user always sees their own most recent write within their region.
- Cross-region read staleness is bounded by the replication lag SLA, which I would set at 1 second maximum. Monitor via replication lag metrics per region pair; alert and page if lag exceeds 500ms sustained.
- For critical consistency scenarios (e.g., when a manager reviews a pipeline immediately after a rep updates a forecast), the application can explicitly request a strong read that routes to the global leader — paying the latency cost when the use case demands it, not on every read.
Failover behaviour
In an active-active topology, "failover" is not a single event — it is a continuous spectrum of degradation handling:
| Scenario | Response |
|---|---|
| Single AZ failure within a region | Kubernetes pod rescheduling within remaining AZs — no user impact, < 30s |
| Full regional failure | DNS failover to nearest healthy region within 60s; routing layer redistributes traffic |
| Network partition (split-brain) | Quorum-based write availability maintained; minority partition serves stale reads with a staleness indicator |
| Data layer leader failure | Automatic leader election within 15–30s; writes blocked during election, reads served from replicas |
Trade-offs I am explicitly making
- Choosing causal consistency over linearisability: Eliminates the synchronous cross-region write latency cost. Accepted trade-off: a user in APAC may see a 500–1,000ms stale read from a concurrent US-East update. For an opportunity record, this is operationally acceptable.
- Choosing multi-master over primary-replica: Accepts the operational complexity of distributed transactions and conflict resolution in exchange for regional write availability. If US-East goes down, APAC users can still create and update opportunities.
- GDPR data residency constraint on EU failover: Limits the available failover capacity for EU customers — accepted because the alternative (violating data residency) is not acceptable.
Key Concepts Tested
- CAP theorem applied to real-world multi-region architecture decisions
- Active-active vs active-passive failover trade-offs
- Distributed database options: CockroachDB, Spanner, multi-master Postgres
- Latency-based DNS routing with health-based failover
- Data residency constraints as architectural inputs
- Causal consistency vs strong consistency — when each is appropriate
Follow-Up Questions
- "Your replication lag monitoring shows that APAC-to-EU-West replication has been running at 1.8 seconds consistently for the past week — above your 1-second SLA. The on-call engineer has not paged it as a P1. Walk through your investigation process: where do you start, what do you look for, and under what conditions would you escalate to a P1 incident?"
- "A Salesforce enterprise customer in Germany reports that after the EU-West region experienced a 12-minute degradation last month, some of their opportunity records appear to be in an inconsistent state — two reps see different values for the same field. Explain the mechanism that could cause this in your architecture and what remediation path exists."
Question 2: Kubernetes Platform Engineering at Scale
Interview Question
Salesforce runs a large-scale Kubernetes platform that hosts hundreds of microservices across multiple production clusters. Engineering teams are reporting three recurring problems: (1) node resource utilisation averages 35% CPU and 41% memory across the fleet, indicating significant resource wastage at significant cost; (2) during peak quarter-end periods, several critical services experience pod evictions because their resource requests are misconfigured — either dramatically under-requested (leading to OOM kills) or dramatically over-requested (consuming resources other services need); (3) a recent incident was caused by a misconfigured Horizontal Pod Autoscaler that scaled a stateful service down to zero replicas during a low-traffic overnight window, deleting all in-memory session state.You are asked to redesign the resource management and autoscaling strategy for this platform. How do you approach it?
Why Interviewers Ask This Question
Kubernetes resource management is deceptively complex in production — it is easy to run a Kubernetes cluster, and very hard to run one efficiently and reliably at scale. This question surfaces whether a candidate understands the full resource lifecycle: how requests and limits interact with the scheduler, why over-provisioning and under-provisioning are both failure modes, and the nuanced differences between autoscaling stateless and stateful workloads. It also tests operational maturity — the HPA incident is a real class of failure that has caused production outages at multiple large Kubernetes operators.
Example Strong Answer
Diagnosing the three problems separately
The three problems are distinct in cause and require different solutions. Treating them as one "resource management problem" produces a muddled fix.
Problem 1: 35% CPU / 41% memory utilisation — resource waste
Low utilisation at the fleet level almost always has one of two root causes: over-specified resource requests (teams requesting 4 CPU cores for a service that uses 0.8 at peak) or poor bin-packing due to node size choices (large nodes with heterogeneous workloads that cannot be fully packed). I would investigate both:
- Requests vs actual usage audit: Use the Kubernetes Metrics Server or a VPA (Vertical Pod Autoscaler) in recommendation-only mode to compute the 95th percentile CPU and memory usage for every deployed workload over the past 30 days. Compare against current requests. Services where actual p95 usage is < 50% of requested resources are candidates for right-sizing.
- VPA in recommendation mode (not auto mode): Expose VPA recommendations in a dashboard that engineering teams can review before accepting. Do not enable VPA auto-apply in production — it causes unexpected pod restarts and is inappropriate for stateful services or services with strict startup latency requirements.
- Namespace-level ResourceQuotas and LimitRanges: Implement LimitRanges that enforce a maximum request-to-limit ratio (e.g., memory limit ≤ 2× memory request). This prevents the pathological case of a service requesting 512Mi but setting a 16Gi limit, which reserves scheduler capacity without actually consuming resources.
Problem 2: Misconfigured requests causing evictions and OOM kills
Pod evictions during peak periods are a scheduling and QoS class problem. In Kubernetes, pod priority during resource pressure is determined by QoS class, which is derived from how requests and limits are configured:
| QoS Class | Condition | Eviction Priority |
|---|---|---|
| Guaranteed | requests == limits for all containers | Last to be evicted |
| Burstable | requests < limits | Evicted when node is under pressure |
| BestEffort | No requests or limits set | First to be evicted |
For every P0/P1 critical service, I would require requests == limits (Guaranteed QoS). This is a stricter resource contract but guarantees the service is never evicted during node pressure events. The higher resource reservation is the explicit cost of that guarantee.
For services that are OOM-killing, the issue is that memory limits are set too low relative to actual peak usage. Memory profiling via kubectl top pods and heap dump analysis should determine the correct limit. A memory limit should be set at approximately p99.9 peak memory + 20% headroom, not at average memory usage.
Problem 3: HPA scaling a stateful service to zero
This is a configuration error with a straightforward but often overlooked fix. The HPA minReplicas field must never be set to 0 or 1 for a stateful service with session state. The root cause is that the team treating this service as if it were stateless.
The corrective actions:
- Immediate: Set
minReplicas: 3for all stateful services on the affected HPAs. Three replicas ensures that even if one replica is unavailable during a rolling update or node failure, the service remains available.
- Architectural: Any service with in-memory session state that cannot tolerate loss should externalise that state to Redis or a distributed cache. An HPA scaling down a service should be a non-event from a state-persistence perspective — if it causes data loss, the service has an architectural dependency on replica count that makes it inherently fragile.
- Policy enforcement: Create a custom admission webhook that validates HPA configurations at apply time. The webhook should reject any HPA that sets
minReplicas < 2for services with the labelstateful: true, and emit a warning for any service without explicit state externalisation documentation. This prevents the same class of misconfiguration from recurring.
Platform-wide autoscaling strategy
Beyond fixing the specific incidents, I would implement a coherent autoscaling hierarchy:
Cluster Autoscaler (node-level)
└── Horizontal Pod Autoscaler (pod count scaling)
└── Vertical Pod Autoscaler in recommendation mode (right-sizing)
└── KEDA (event-driven autoscaling for queue-based workloads)- HPA scales stateless services based on CPU utilisation (target 65–70%) and custom metrics (queue depth for async workers, RPS for API services)
- Cluster Autoscaler provisions new nodes when pods are unschedulable due to insufficient resources. Configure
scaleDownDelayAfterAdd: 10mto prevent thrashing during bursty traffic
- KEDA for services that should scale based on downstream signals (Kafka consumer lag, SQS queue depth) rather than CPU — more accurate for async processing workloads
- Pod Disruption Budgets (PDBs) on all production services, ensuring that autoscaling-driven evictions and node drains never take more than one replica of a critical service offline simultaneously
Key Concepts Tested
- Kubernetes QoS classes (Guaranteed, Burstable, BestEffort) and eviction behaviour
- VPA in recommendation mode vs auto mode and when each is appropriate
- HPA
minReplicasconfiguration for stateful vs stateless services
- Admission webhooks for policy enforcement at configuration time
- Cluster Autoscaler configuration and scale-down delay tuning
- KEDA for event-driven autoscaling beyond CPU/memory metrics
- LimitRanges and ResourceQuotas for namespace-level resource governance
Follow-Up Questions
- "You implement VPA in recommendation mode and publish the right-sizing recommendations. Three weeks later, fewer than 20% of engineering teams have acted on the recommendations for their services. The remaining 80% say they are 'too busy' or 'not confident the recommendations are accurate.' How do you drive adoption without forcing potentially breaking changes onto teams?"
- "A new microservice is being onboarded to the platform. The team is asking for 32 CPU cores and 128Gi memory per pod, citing 'we need this for peak load.' Your own analysis suggests 8 cores and 32Gi covers their p99 usage with 40% headroom. How do you handle this disagreement, and what process do you put in place to allow teams to start conservative and scale up safely?"
Question 3: Observability Architecture for Distributed Microservices
Interview Question
Salesforce's platform engineering team is responding to a recent high-severity incident. A latency degradation affecting the Service Cloud case management API took 47 minutes to detect and 2.3 hours to resolve. Post-incident review reveals several observability gaps: the team had CPU and memory metrics but no distributed tracing, logs were stored in three separate systems with no correlation between them, and the degradation was first reported by a customer rather than detected by an internal alert. The incident was ultimately caused by a slow database query introduced by a schema migration that increased p99 query latency from 12ms to 890ms — invisible at the aggregate metrics level but catastrophic at the p99 tail.Design the observability architecture that would have detected this incident in under 5 minutes and provided the on-call engineer with the information needed to isolate the root cause within 15 minutes.
Why Interviewers Ask This Question
Observability is one of the most strategically important investments in distributed systems engineering, and it is also one of the most frequently underinvested until after a major incident. This question tests whether a candidate understands the three pillars of observability (metrics, traces, logs) not as independent systems but as a correlated, complementary triad — and whether they can reason about the specific failure modes that each pillar addresses and misses. The detail about p99 latency being invisible at the aggregate level is the critical diagnostic clue the candidate must pick up on.
Example Strong Answer
Diagnosing the observability failure
The 47-minute detection gap and the customer-first report reveal multiple distinct failures:
- Metrics were aggregate-only: CPU and memory averages are almost always the wrong metric for latency degradation. The slow query pushed p99 latency to 890ms — but if average query latency was 45ms, a metric alert on average latency would never trigger. This is the tail latency problem: aggregate metrics are insensitive to the worst user experiences.
- No distributed tracing: Without trace data, the 890ms query was invisible until someone happened to look at database slow query logs. The on-call engineer had no way to follow a single request from the API gateway to the database and see where time was being spent.
- Fragmented log storage: Correlating a frontend API error with a backend slow query requires joining logs across systems by a shared request ID. If no correlation ID is propagated through the call chain, the logs are forensic evidence of unrelated events, not a coherent narrative of a single request.
- No SLO-based alerting: An alert that fires when customer-visible latency exceeds a threshold would have triggered within 2–3 minutes of the degradation starting. Instead, the alert policy was monitoring internal infrastructure metrics, not user-facing outcomes.
The target observability architecture
I would structure the observability stack around the OpenTelemetry standard as the collection layer, with purpose-built backends for each signal type.
Pillar 1: Metrics — percentile-based SLO alerting
Replace all aggregate latency alerts with histogram-based percentile alerts. For the case management API:
# Example Prometheus alerting rule
- alert: CaseAPIHighTailLatency
expr: |
histogram_quantile(0.99,
rate(http_request_duration_seconds_bucket{
service="case-management-api"
}[5m])
) > 0.5
for: 2m
labels:
severity: page
annotations:
summary: "Case API p99 latency exceeds 500ms for 2 minutes"This alert would have fired within 2 minutes of the schema migration deploying. The key architectural decision: store latency as histograms, not averages. Prometheus histograms (or explicit HDR histograms via Micrometer) allow arbitrary percentile computation at query time without storing every individual measurement.
Define formal SLOs for every user-facing API: "99% of requests complete in under 300ms, measured over a 30-day rolling window." Derive alerting thresholds from error budget burn rate — not from arbitrary absolute thresholds. A burn rate alert fires when the error budget is being consumed 14× faster than sustainable, which typically surfaces degradations within 5–10 minutes.
Pillar 2: Distributed Tracing
Instrument every service with OpenTelemetry SDK to propagate trace context (trace ID, span ID) via W3C Trace Context headers. Every inbound request generates a root span; every downstream call (database query, cache read, inter-service HTTP call) generates a child span with its own duration measurement.
The schema migration incident would be immediately visible in trace data:
Trace: GET /api/cases/C-001234 [2847ms]
├── Auth validation [12ms]
├── Permission check [8ms]
└── Database query: SELECT * FROM Cases WHERE... [890ms] ← anomaly
└── Execution plan: Full table scan (missing index)Store traces in Jaeger or Tempo (lower cost than commercial APM at Salesforce's trace volume). Integrate trace sampling strategy: tail-based sampling that always retains traces with errors or latency above the p99 threshold, regardless of overall sample rate. This ensures high-latency traces like the one that caused the incident are never dropped for cost reasons.
Pillar 3: Logs — structured, correlated, centralised
Replace the three separate log storage systems with a unified log pipeline:
Application → OpenTelemetry Collector → Kafka (buffer) → OpenSearch / LokiEvery log line must include the trace ID and span ID from the active OpenTelemetry context. This is the critical correlation mechanism:
{
"timestamp": "2024-01-15T14:23:45.123Z",
"level": "ERROR",
"service": "case-management-api",
"trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
"span_id": "00f067aa0ba902b7",
"message": "Database query exceeded latency SLA",
"query_duration_ms": 890,
"query_hash": "abc123"
}With trace ID correlation, the on-call engineer can jump from a Jaeger trace showing a slow span directly to the associated structured log entries for that exact request — without manually searching across disparate log systems.
The 5-minute detection / 15-minute isolation target
| Time | Event | Mechanism |
|---|---|---|
| T+0 | Schema migration deploys | Deployment event emitted to observability platform |
| T+2m | p99 latency exceeds 500ms SLA | Prometheus histogram alert fires, pages on-call |
| T+3m | On-call opens SLO dashboard | Error budget burn rate chart shows spike at exact deployment timestamp |
| T+5m | Root cause isolated to database layer | Distributed trace waterfall shows database span as the outlier |
| T+8m | Specific slow query identified | Trace links to structured log with query_hash; slow query log confirms full table scan |
| T+12m | Rollback initiated | Schema migration reverted; p99 returns to 12ms |
The deployment change correlation layer
One additional capability that transforms incident response: correlate every deployment event with the observability timeline. A vertical line on the SLO dashboard at the exact moment of the schema migration deployment makes the "deployment caused this" hypothesis testable in seconds rather than minutes of log archaeology.
Key Concepts Tested
- Three pillars of observability: metrics, traces, logs — as a correlated system, not independent tools
- Histogram-based percentile metrics vs aggregate averages for tail latency detection
- SLO error budget burn rate alerting — a more sensitive and semantically meaningful alert policy than threshold alerts
- OpenTelemetry as the vendor-neutral instrumentation standard
- Tail-based trace sampling to retain high-latency traces
- Structured logging with trace/span ID correlation
- Deployment event correlation with observability timelines
Follow-Up Questions
- "Your OpenTelemetry collector is ingesting 4 million spans per minute from 200 microservices. Storage costs for your tracing backend have grown to $180K/month and are projected to double in 6 months. How do you design a sampling strategy that reduces storage cost by 60% while ensuring you retain the traces that matter most for incident investigation?"
- "A service team comes to you and says: 'Our service's p99 latency looks fine in Prometheus — 280ms. But our customers are reporting timeouts.' What are the three most likely explanations for this discrepancy, and how would you use your observability stack to determine which is occurring?"
Question 4: CI/CD Pipeline for Infrastructure at Scale
Interview Question
Salesforce's platform infrastructure team manages approximately 2,400 Terraform modules and Kubernetes manifest repositories across 14 production environments (multiple regions, multiple cloud providers, and Salesforce's own private data centres). Currently, infrastructure changes go through a manual review process: an engineer makes changes locally, opens a pull request, a senior engineer reviews it, and changes are applied manually withterraform applyorkubectl applyafter approval. In the past 6 months there have been four production incidents caused by: (1) configuration drift between environments where a change was applied to staging but never promoted to production, (2) aterraform applyrun that included unintended resource deletions because the engineer's local Terraform state was stale, (3) a Kubernetes manifest change deployed directly to production without going through staging, and (4) a secret accidentally committed to a Git repository.Design the CI/CD pipeline architecture for infrastructure automation that prevents each of these four incident classes.
Why Interviewers Ask This Question
Infrastructure CI/CD at the scale of a multi-cloud, multi-region enterprise platform is a distinct engineering discipline from application CI/CD. This question tests whether a candidate can design a pipeline that is simultaneously automated enough to eliminate the human-error failure modes described, and safe enough that automated pipelines don't cause their own class of incidents. Each of the four failure modes maps to a specific architectural control, and the interviewer is looking for a candidate who can identify and implement all four.
Example Strong Answer
Map each incident to its architectural control
Before designing the pipeline, I would map each failure mode to the specific control that prevents it:
| Incident | Root Cause | Architectural Control |
|---|---|---|
| Config drift between environments | Manual, untracked changes | GitOps — all changes via Git, automated promotion pipeline |
| Stale state causing unintended deletions | Local Terraform state | Remote state with locking; plan-before-apply with explicit approval |
| Direct prod deployment skipping staging | No enforcement of promotion path | Automated promotion gates; production deployment requires staging success |
| Secret committed to Git | No pre-commit detection | Pre-commit secret scanning + CI-enforced secret detection |
The GitOps foundation
The entire infrastructure estate is represented as code in Git, with environment state managed declaratively. The pipeline never applies changes — it only reconciles the desired state (in Git) with the actual state (in the cloud). This is the GitOps operating model:
Git repository (desired state)
└── Reconciliation loop (Flux or ArgoCD)
└── Cloud environment (actual state)For Kubernetes manifests, I would use Flux or ArgoCD running in each cluster. These controllers continuously watch the Git repository and apply changes when the repository state diverges from cluster state. No human ever runs kubectl apply in production — the GitOps controller is the only actor authorised to modify cluster state.
For Terraform, I would use Atlantis or Terraform Cloud to run plans and applies automatically from Git events, with remote state stored in S3 (with DynamoDB locking) or Terraform Cloud backend.
The promotion pipeline architecture
Feature Branch
│
├── PR opened → CI runs:
│ ├── terraform plan (against dev state, shows exact diff)
│ ├── terraform validate + tflint (linting)
│ ├── checkov (security policy scanning)
│ ├── gitleaks (secret scanning — PR is BLOCKED if secrets detected)
│ └── Kubernetes manifest: kubeval + conftest (OPA policy)
│
▼ PR merged to main
│
├── Auto-deploy to DEV (all 3 dev clusters)
│ └── Automated smoke tests + integration tests
│
▼ DEV tests pass (gate)
│
├── Auto-deploy to STAGING (all staging environments)
│ ├── Full regression test suite
│ ├── 30-minute soak period — monitor error rates and latency
│ └── Automated canary analysis (compare staging metrics to baseline)
│
▼ STAGING tests pass + soak period clean (gate)
│
├── Production deployment (requires explicit human approval)
│ ├── Shows Terraform plan diff again — engineer reviews exact resources changing
│ ├── Change window enforcement (no production changes Fri 16:00 – Mon 09:00)
│ └── Canary deployment: 5% traffic → 25% → 100% with automated rollback on error rate spike
│
▼ PRODUCTION healthy for 30 minutes
└── Deployment record stored in CMDBPreventing incident 2: stale state and unintended deletions
Atlantis or Terraform Cloud always runs terraform plan against the remote state — never local state. The plan output is posted as a PR comment, showing exactly which resources will be created, modified, or destroyed. Engineers must review and explicitly approve the plan before apply runs.
Critical safety controls on terraform apply:
targetflag is banned in CI pipelines — partial applies that only change specific resources mask the full scope of a change
- Destroy operations require a secondary approval — any plan that includes resource deletion triggers a second approval gate from a different engineer
- State locking prevents two engineers running conflicting applies simultaneously (DynamoDB lock for S3 backend)
Preventing incident 4: secret detection
Multi-layer secret detection:
- Pre-commit hook (local):
gitleaksorgit-secretsas a developer-side pre-commit hook. Catches secrets before they reach the remote repository. This is the developer experience layer — fast feedback, not blocking.
- CI pipeline gate (enforced): Every PR CI run includes a secret scanning step using
truffleHogorgitleaksagainst the full diff. PRs with detected secrets are blocked — not warned, blocked. This is the enforced gate.
- Historical scan: Run a full repository history scan quarterly to detect any secrets that may have been committed before the tooling was in place. Alert and rotate any found credentials immediately.
- Secret management integration: All secrets referenced in Terraform and Kubernetes manifests must be sourced from Vault (HashiCorp) or AWS Secrets Manager — never hardcoded in manifests. Enforce this via OPA/Conftest policy that rejects any Kubernetes Secret resource with literal values (they must reference an external secrets operator instead).
Drift detection
For configuration drift, the GitOps reconciliation loop is the prevention mechanism. But I would also run a daily drift detection job that compares actual cloud resource configurations against the Terraform state and reports discrepancies. Any resource that exists in the cloud but not in Git (indicating a manual change) triggers a P3 alert and a required remediation ticket.
Key Concepts Tested
- GitOps operating model (Flux/ArgoCD for Kubernetes, Atlantis/Terraform Cloud for IaC)
- Terraform remote state with locking — preventing stale state applies
- Multi-stage promotion pipeline with automated gates
- Canary deployment strategy for production changes
- Multi-layer secret detection: pre-commit, CI, historical scan, Vault integration
- OPA/Conftest policy enforcement for infrastructure compliance at PR time
- Drift detection as a continuous control, not a one-time check
Follow-Up Questions
- "Your GitOps promotion pipeline takes 3.5 hours end-to-end from PR merge to production deployment, primarily because the STAGING soak period takes 2 hours. An engineering team with an urgent security patch needs to get a Terraform change to production within 30 minutes. What is your process for emergency change deployment, and what controls do you maintain even in the fast path?"
- "Three months after implementing the pipeline, a post-incident review reveals that an automated
terraform applyin STAGING deleted a shared RDS instance that was being used by two services — the deletion was in the Terraform plan but the engineer who approved it did not notice it. What changes do you make to the pipeline to prevent approval of high-impact destructive operations?"
Question 5: Security and Zero-Trust Network Architecture in Cloud Infrastructure
Interview Question
Salesforce is redesigning the network security architecture for its internal microservices platform, which hosts hundreds of services across multiple Kubernetes clusters. The current architecture uses a flat network model where any pod can communicate with any other pod, with access controlled only at the perimeter (ingress/egress firewalls). A recent security audit found three critical vulnerabilities: (1) a compromised pod in the payments processing namespace could make lateral network calls to the user authentication service; (2) service-to-service authentication relies on IP-based allowlisting, which is fragile in a dynamic Kubernetes environment where pod IPs change frequently; (3) secrets for database connections and external APIs are stored as Kubernetes Secrets (base64-encoded, unencrypted at rest in etcd).Redesign the network security architecture using a zero-trust model. Address each of the three vulnerabilities specifically, and explain the operational trade-offs of the controls you implement.
Why Interviewers Ask This Question
Zero-trust network architecture is the current security standard for cloud-native platforms, and this question tests whether a candidate can articulate the principles behind it and implement them concretely in a Kubernetes context. The three vulnerabilities represent the three layers of the zero-trust problem: network segmentation (lateral movement), identity-based authentication (replacing IP-based trust), and secrets management (protecting credentials at rest and in transit). Interviewers look for candidates who understand not just what to implement but why each control addresses a specific threat model.
Example Strong Answer
Zero-trust principles applied to this architecture
Zero-trust means: never trust, always verify. In a Kubernetes context, this translates to three concrete properties:
- Every network path is explicitly allowed — nothing is permitted by default
- Every service call is authenticated by cryptographic identity — not by network location
- Every secret is encrypted, access-controlled, and rotated — never stored as plaintext
Vulnerability 1: Lateral movement — Kubernetes Network Policies
The flat network model is the default in Kubernetes and is the most common misconfiguration in production clusters. The fix is Kubernetes Network Policies, which are namespace-scoped L3/L4 firewall rules enforced by the CNI plugin (Calico, Cilium, or AWS VPC CNI).
Default deny policy for every namespace:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-all
namespace: payments
spec:
podSelector: {} # applies to all pods in namespace
policyTypes:
- Ingress
- EgressThis single policy drops all ingress and egress for every pod in the payments namespace until explicit allow rules are added. The payments service then explicitly allows only the specific traffic it requires:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: payments-service-policy
namespace: payments
spec:
podSelector:
matchLabels:
app: payments-processor
policyTypes:
- Ingress
- Egress
ingress:
- from:
- namespaceSelector:
matchLabels:
name: api-gateway
ports:
- port: 8443
egress:
- to:
- namespaceSelector:
matchLabels:
name: database
ports:
- port: 5432A compromised pod in the payments namespace can no longer make arbitrary calls to the authentication service — that path is not in the explicit allow list and is dropped at the kernel level by the CNI plugin.
For L7 policy enforcement (HTTP method, path, headers) — which Network Policies cannot enforce — I would deploy a service mesh (Istio or Cilium with Hubble). The service mesh sidecar proxy intercepts all traffic and enforces L7 policies: "Only the payments service can call POST /auth/validate on the authentication service, and only with a valid service identity certificate."
Vulnerability 2: IP-based authentication — mTLS with SPIFFE/SPIRE
IP-based allowlisting is fundamentally broken in Kubernetes because pod IPs are ephemeral — they change on every restart, rollout, and rescheduling event. The correct solution is cryptographic service identity using mTLS (mutual TLS).
I would implement SPIFFE (Secure Production Identity Framework for Everyone) via its reference implementation SPIRE:
- SPIRE Server issues short-lived X.509 SVIDs (SPIFFE Verifiable Identity Documents) to each workload. An SVID encodes a cryptographic identity tied to the Kubernetes service account — not the IP address:
spiffe://salesforce.internal/ns/payments/sa/payments-processor
- Each pod's SPIRE Agent automatically renews its certificate before expiry (default: 1-hour lifetime with rotation starting at 50% of lifetime)
- The Istio service mesh uses SPIRE-issued certificates for automatic mTLS between all services. Every inter-service call is mutually authenticated — both sides present their SVID and verify the other's identity
This eliminates IP-based trust entirely. The authentication service can verify: "This request comes from the payments service running in the payments namespace, not from an arbitrary IP." A compromised pod that steals a certificate is limited by the 1-hour certificate lifetime.
Vulnerability 3: Kubernetes Secrets in etcd — External Secrets Operator + Vault
Kubernetes Secrets are base64-encoded, not encrypted, and stored in etcd. Any pod with a broad RBAC policy can read secrets from the etcd API. There are two layers to fix:
Layer 1: etcd encryption at rest
Enable Kubernetes EncryptionConfiguration to encrypt Secret resources in etcd using AES-GCM or KMS. For Salesforce's compliance requirements, use KMS provider integration (AWS KMS, GCP KMS) so the encryption key is managed outside the cluster and requires cloud IAM authentication to use. An attacker who gains direct etcd access cannot read secrets without the KMS key.
Layer 2: Remove secrets from Kubernetes entirely — use Vault
The more comprehensive solution is to stop storing secrets in Kubernetes at all:
Application Pod
│
│ 1. Pod presents SPIRE SVID to Vault
├── Vault authenticates via Kubernetes Auth Method (validates service account JWT)
│
│ 2. Vault returns a dynamic secret with TTL
└── Pod receives: database credentials (valid for 1 hour, auto-rotated)
API keys (valid for 24 hours, auto-rotated)Use the External Secrets Operator to sync only non-sensitive configuration from Vault into Kubernetes ConfigMaps, and inject sensitive credentials directly into pod memory via the Vault Agent Sidecar or the Vault Secrets Operator — never writing them to disk or etcd.
Dynamic secrets are the highest-value Vault feature: rather than storing a static database password, Vault generates a unique username/password pair for each pod at startup with a TTL matching the pod's expected lifetime. When the pod is terminated, the credential is automatically revoked. A credential theft from a running pod has a maximum exposure window equal to the credential TTL.
RBAC tightening as a complementary control
Zero-trust network policy and mTLS do not replace RBAC — they layer on top of it. I would audit and tighten every ServiceAccount's permissions:
- Remove the
automountServiceAccountTokendefault from pods that do not need Kubernetes API access
- Eliminate all
ClusterRolebindings for application workloads — applications should only have namespace-scoped roles
- Audit for wildcard permissions (
verbs: ["*"]) and replace with explicit, minimal permission sets
Operational trade-offs
| Control | Operational cost | Trade-off |
|---|---|---|
| Network Policies + default deny | High initial configuration effort; teams must explicitly document all required traffic paths | Accepted: security gain justifies one-time policy authoring cost |
| mTLS with SPIRE | Certificate rotation infra to operate; 5–10ms latency overhead per service call from TLS handshake | Accepted: latency overhead is negligible vs security gain; SPIRE automates rotation |
| Vault dynamic secrets | Vault becomes a critical dependency — it must be highly available | Accepted: run Vault in HA mode (5-node Raft cluster); the security gain of revocable, short-lived credentials justifies the operational dependency |
Key Concepts Tested
- Kubernetes Network Policies for namespace-level traffic segmentation
- Default-deny network posture — nothing permitted unless explicitly allowed
- SPIFFE/SPIRE for cryptographic workload identity in dynamic environments
- mTLS between services as the replacement for IP-based authentication
- etcd EncryptionConfiguration with KMS provider for secrets at rest
- Vault dynamic secrets and the External Secrets Operator for zero-trust secrets management
- RBAC minimisation and ServiceAccount token management
Follow-Up Questions
- "After implementing mTLS across the entire service mesh, your on-call team reports a new class of incident: certificate rotation failures are causing service-to-service authentication errors. Two services fail to renew their SVID before expiry and start rejecting each other's connections. How do you design the certificate renewal process to be resilient to this failure mode, and what observability do you add to detect certificate expiry risk before it causes an outage?"
- "Your security team has asked you to implement mutual TLS for traffic between the Kubernetes cluster and an on-premises legacy system that cannot be updated to present a SPIFFE identity. It only supports IP allowlisting. How do you bridge the zero-trust model to accommodate this legacy system without creating a gap in your security posture?"
Question 6: Distributed Systems Reliability — Cascading Failure Prevention
Interview Question
Salesforce's platform team is conducting a post-incident review following a major outage. What started as a single slow database replica in the reporting service cascaded into a full platform degradation affecting Sales Cloud, Service Cloud, and the API gateway over 23 minutes. The sequence was: (1) the reporting service's database replica began returning slow responses due to a long-running analytical query; (2) the reporting service's connection pool filled up waiting for responses, exhausting all available threads; (3) the API gateway, which called the reporting service synchronously on every request for dashboard data, began queuing requests; (4) the API gateway's thread pool filled up, causing it to stop processing all requests — including requests that had nothing to do with reporting; (5) the entire platform became unavailable.This is a textbook cascading failure. Redesign the architecture to prevent each propagation step, and explain the resiliency patterns you would apply at each layer.
Why Interviewers Ask This Question
Cascading failures are one of the most insidious failure modes in distributed systems — they convert a localised, recoverable degradation into a total outage. This question tests whether a candidate understands the specific mechanisms by which failures propagate across service boundaries and can name the concrete resiliency patterns that interrupt each propagation step. The scenario is deliberately realistic: the exact sequence described has caused major outages at Netflix, Amazon, and multiple SaaS platforms. Interviewers want to see pattern recognition and architectural remediation, not just "add more replicas."
Example Strong Answer
Map the propagation chain before prescribing fixes
Each step in the cascade has a distinct failure mechanism and a distinct fix. A candidate who jumps straight to "add circuit breakers everywhere" has diagnosed the symptom, not the disease. I would address each propagation step explicitly.
Step 1 → Step 2: Slow database query exhausting the connection pool
The root cause at the origin is an unbounded analytical query running on a replica that should be serving low-latency operational reads. Two controls:
- Query isolation: Analytical queries should never run on the same replica as operational reads. Separate read replicas by workload class: one replica for sub-100ms operational reads, a dedicated replica (or a data warehouse connection) for analytical queries with no latency SLA. If the analytical query saturates its replica, it affects only analytics — not operational reporting.
- Statement timeout: Set a maximum query execution time on the database connection (e.g.,
statement_timeout = 5000msin Postgres). Any query running longer than 5 seconds is terminated automatically and returns an error rather than blocking a connection indefinitely. The reporting service must handle this error gracefully — return a cached result or a degraded response — but the connection is freed immediately.
- Connection pool sizing and timeout: The connection pool should have a
connection-timeout(time to wait for a connection from the pool before failing fast) set to a small value — 200ms for an operational service. If the pool is exhausted, the 201st request fails immediately with a 503 rather than waiting and accumulating into a queue that grows without bound.
Step 2 → Step 3: Reporting service failure propagating to the API gateway
The API gateway calling the reporting service synchronously on every request is the architectural mistake that converts a localised service failure into a platform-wide failure. Two patterns interrupt this propagation:
- Circuit breaker: Implement a circuit breaker (Resilience4j, Hystrix, or Envoy/Istio circuit breaking at the mesh layer) around the API gateway's call to the reporting service. When the error rate exceeds a threshold (e.g., >50% of calls failing within a 10-second window), the circuit trips to OPEN state and all subsequent calls to the reporting service fail immediately — without waiting for the network timeout. This prevents the API gateway from accumulating slow in-flight requests while the reporting service is degraded.
Circuit Breaker States:
CLOSED → All requests pass through normally
OPEN → All requests fail immediately (fail fast), no calls to downstream
HALF-OPEN → Probe requests permitted; if healthy, transitions to CLOSED- Bulkhead isolation: The API gateway's thread pool for reporting service calls must be isolated from its thread pools for other services. Using a bulkhead pattern (separate thread pools per downstream dependency), a fully exhausted reporting thread pool cannot consume threads intended for authentication, opportunity APIs, or case management. The reporting feature degrades; the rest of the platform continues serving requests.
Step 3 → Step 4: API gateway thread pool exhaustion
Even with a circuit breaker, if the downstream call has a 30-second timeout (the Kubernetes default), 30 seconds × (requests per second) in-flight requests accumulate before the circuit trips. The gateway thread pool fills before the circuit breaker has enough data to make a tripping decision.
- Aggressive timeout hierarchy: Set timeouts at every layer, and ensure each layer's timeout is shorter than the layer above it:
- Database query timeout: 5s
- Reporting service internal timeout: 8s
- API gateway → reporting service timeout: 3s (shorter than the reporting service's internal timeout to ensure the gateway fails fast before the downstream does)
- Client → API gateway timeout: 10s
This timeout hierarchy ensures failures propagate upward as fast failures, not slow queue accumulations.
- Load shedding at the gateway: Implement a request queue with a maximum depth at the API gateway. When the queue depth exceeds the threshold, new requests are rejected immediately with a 503 rather than enqueued. This bounds the worst-case latency for queued requests and prevents unbounded memory consumption.
Step 4 → Step 5: Complete platform unavailability
The final propagation step — reporting degradation taking down the entire API gateway — is prevented by the bulkhead isolation above. But one additional architectural principle eliminates this class of failure entirely:
- Asynchronous decoupling of non-critical data: Dashboard and reporting data is not on the critical path for core CRM operations. A sales rep updating an opportunity does not need real-time dashboard data to complete their work. The API gateway should never call the reporting service synchronously on requests that do not explicitly require reporting data. Dashboard widgets should be loaded asynchronously, client-side, with independent timeout and error handling — isolated from the core transactional API path.
The redesigned architecture
Client Request
│
├── API Gateway (bulkhead-isolated thread pools per downstream)
│ ├── [Opportunity API calls] → Opportunity Service (synchronous, critical path)
│ ├── [Case API calls] → Case Service (synchronous, critical path)
│ └── [Dashboard calls] → Reporting Service
│ │
│ ├── Circuit Breaker (trips at >50% error rate, 10s window)
│ ├── Timeout: 3s hard limit
│ └── Fallback: cached dashboard data (stale-while-revalidate)
│
└── Reporting Service
├── Operational read replica (p99 < 100ms queries only, statement_timeout=5s)
└── Analytical read replica (long-running queries, isolated from operational)Chaos engineering as validation
I would not consider this architecture resilient until it has been validated by deliberate failure injection. Using Chaos Mesh or LitmusChaos in a staging environment, I would inject each failure scenario: slow database responses, connection pool exhaustion, network latency between services, and deliberate circuit breaker tripping. The test passes only if each failure remains isolated to its origin service without cascading.
Key Concepts Tested
- Cascading failure propagation mechanics — identifying the mechanism at each step
- Circuit breaker pattern: CLOSED / OPEN / HALF-OPEN state machine
- Bulkhead isolation: separate thread pools per downstream dependency
- Timeout hierarchy: ensuring each layer's timeout is shorter than the layer above
- Load shedding at ingress to bound queue depth and prevent memory exhaustion
- Asynchronous decoupling of non-critical data paths from critical transaction paths
- Chaos engineering as validation methodology for resiliency claims
Follow-Up Questions
- "Your circuit breaker is configured to trip after 50% error rate over a 10-second window. In production, you observe that the circuit trips correctly during full reporting service outages, but fails to protect the gateway during partial degradation — when 30% of reporting requests are slow (8–12 seconds) rather than failing. The circuit never trips because the error rate stays below 50%. How do you reconfigure the circuit breaker to handle slow requests as failures, and what metric do you use as the trip signal?"
- "After implementing bulkhead isolation, your platform handles the reporting service failure correctly in synthetic tests. But during the next real incident, a completely different service — the email notification service — causes the same cascading failure pattern. What process do you put in place to systematically audit all service-to-service dependencies for the same propagation risk, rather than fixing them one incident at a time?"
Question 7: Network Performance and Latency Optimisation at Global Scale
Interview Question
Salesforce is experiencing latency complaints from enterprise customers in Southeast Asia and Australia. Their Salesforce orgs are hosted in the US-East region (Virginia), and users in Singapore and Sydney report that page loads and API responses take 4–6 seconds — compared to 400–600ms for US-based users performing identical operations. The application engineering team has investigated and confirmed that the backend processing time for the affected requests is under 120ms. The remaining 3.8–5.9 seconds of latency is network and infrastructure overhead.Diagnose the sources of latency overhead and design a network performance improvement strategy. You do not have the authority to migrate the primary data storage to APAC — that is a separate long-term initiative. Your mandate is to reduce perceived latency for APAC users using infrastructure-layer changes only.
Why Interviewers Ask This Question
Network latency optimisation for globally distributed SaaS users is a real and common infrastructure engineering challenge. This question tests whether a candidate understands the physics of network latency — round-trip time across oceans is non-negotiable at the speed of light — and the infrastructure-layer techniques that reduce the number of round trips, the connection establishment overhead, and the distance travelled by application data. The constraint (no data migration) is realistic and forces the candidate to focus on the right layer of the problem.
Example Strong Answer
Decompose the latency budget before fixing anything
4–6 seconds of overhead for a request with 120ms of backend processing is a large but diagnosable gap. I would use distributed tracing with precise timestamps at each network hop to decompose the latency budget:
- TCP connection establishment: A new TCP connection from Singapore to Virginia involves a 3-way handshake across a ~180ms round-trip path. That is 180ms × 1.5 round trips = 270ms just to establish the connection — before a single byte of application data is sent.
- TLS handshake: A TLS 1.2 handshake requires 2 round trips after TCP establishment (TLS 1.3 reduces this to 1). At 180ms RTT: TLS 1.2 adds 360ms; TLS 1.3 adds 180ms. For a user making 10 API calls per page load, 10 × (270ms TCP + 360ms TLS) = 6.3 seconds of pure connection establishment overhead.
- HTTP request-response cycles: If each page load requires 15 sequential HTTP requests (waterfall loading), that is 15 × 180ms = 2.7 seconds of network round-trip overhead at the application layer.
- DNS resolution latency: If the DNS resolver is not geographically local, each DNS lookup adds 50–100ms.
The diagnosis: the majority of the latency is connection establishment overhead from a high-RTT path, compounded by unoptimised HTTP request patterns.
Fix 1: CDN with regional Points of Presence (PoP) — highest impact
Deploy a CDN with APAC PoPs (Cloudflare, Fastly, or AWS CloudFront) in front of the Salesforce application tier. The CDN serves two functions:
- Static asset caching: JavaScript bundles, CSS, images, and API documentation are served from a Singapore or Sydney PoP — eliminating cross-ocean round trips for these assets entirely. A 2MB JavaScript bundle that previously required 180ms RTT × multiple segments to transfer now delivers from 5ms away.
- TCP and TLS termination at the edge: The CDN terminates the TCP and TLS connection at the nearest PoP, establishing a local connection (5–10ms RTT) with the end user. The CDN maintains a persistent, pre-warmed connection pool to the US-East origin over a private backbone network. The user experiences local connection establishment latency; the CDN absorbs the cross-ocean connection cost internally.
This change alone typically reduces perceived latency by 40–60% for static asset-heavy pages.
Fix 2: HTTP/2 and HTTP/3 with connection multiplexing
If the application is still serving over HTTP/1.1 to APAC clients, each API request requires its own TCP connection (or waits in a queue for a free connection). HTTP/2 allows multiplexing multiple requests over a single TCP connection, eliminating the repeated connection establishment overhead. HTTP/3 (QUIC) improves further by reducing the handshake to a single round trip and eliminating TCP head-of-line blocking.
Audit current protocol usage via CDN access logs. Enforce HTTP/2 as the minimum for all APAC traffic. Enable TLS 1.3 with 0-RTT session resumption — a returning user's TLS handshake is reduced to zero additional round trips by reusing the session ticket.
Fix 3: TCP optimisation at the origin
For traffic that cannot be served from the CDN edge (dynamic API responses with customer-specific data):
- TCP BBR congestion control: Deploy BBR (Bottleneck Bandwidth and Round-trip propagation time) congestion control on all servers serving APAC traffic. BBR is designed for high-bandwidth, high-latency paths — the exact profile of a Singapore-to-Virginia connection. BBR consistently delivers 2–4× higher throughput than CUBIC on transoceanic paths.
- Increase TCP initial congestion window (
initcwnd) to 10–20 segments. The default of 10 segments was set when average RTTs were low; for a 180ms RTT path, a smallinitcwndcauses the TCP slow start phase to take multiple round trips before reaching full throughput.
- TCP Fast Open: Allows data to be sent in the SYN packet for returning connections, eliminating one round trip for the first request.
Fix 4: API regionalisation for latency-sensitive paths
Even without migrating primary data storage, I can deploy read-only regional API replicas in APAC for endpoints that serve cacheable or non-transactional data:
- The Salesforce opportunity list view, account records, and contact details do not change on every request. A regional API node in Singapore that serves reads from a read replica (with acceptable replication lag for read-heavy use cases) reduces the round-trip for these requests from 180ms to 5ms.
- Writes still route to US-East. The regional API node proxies write requests to the origin, accepting the full round-trip latency for write operations — but write operations represent a small fraction of total API call volume in a CRM.
This is read-local, write-global — a well-established pattern for reducing read latency without data sovereignty changes.
Fix 5: DNS optimisation
Deploy regional DNS resolvers (or use Anycast DNS) so that DNS lookups for Salesforce endpoints resolve from a node in APAC rather than routing to a US DNS resolver. Use DNS TTL optimisation: static assets should have long TTLs (86400s); the API endpoint record should have a short TTL (60s) to enable fast failover while still being cacheable.
Expected outcome
| Optimisation | Estimated Latency Reduction |
|---|---|
| CDN with APAC PoP for static assets | 1.5–2.5s reduction (static asset delivery) |
| HTTP/2 + TLS 1.3 with 0-RTT | 0.5–1.0s reduction (connection overhead) |
| TCP BBR + initcwnd tuning | 0.3–0.6s reduction (throughput on long-fat pipes) |
| Regional read API replicas | 0.8–1.5s reduction (dynamic read responses) |
| Total projected reduction | 3.1–5.6s |
Target: APAC users experiencing 400–900ms perceived latency — comparable to the US-based baseline.
Key Concepts Tested
- Latency budget decomposition: TCP, TLS, DNS, and application-layer contributions
- CDN PoP architecture for edge termination and static asset delivery
- HTTP/2 multiplexing and HTTP/3 QUIC for reducing per-request connection overhead
- TLS 1.3 with 0-RTT session resumption
- TCP BBR congestion control for high-latency, high-bandwidth paths
- Read-local, write-global pattern for regional API performance without data migration
- DNS Anycast and TTL strategy
Follow-Up Questions
- "After deploying the CDN with APAC PoPs, your security team raises a concern: the CDN now terminates TLS on behalf of Salesforce, meaning the CDN provider can in theory inspect decrypted customer CRM data in transit. Several enterprise customers have compliance requirements that prohibit third-party TLS termination. How do you address this architectural concern without removing the CDN's latency benefits?"
- "Three months after the CDN deployment, a customer in Singapore reports that they are still experiencing 3.5-second latency for one specific workflow: the CPQ quote generation API, which makes 14 sequential internal API calls synchronously before returning a response. The CDN cannot help with this because the response is dynamic and non-cacheable. How do you diagnose and address this specific latency pattern?"
Question 8: High Availability and Disaster Recovery Design
Interview Question
Salesforce is designing the disaster recovery strategy for its Service Cloud contact centre platform. The platform processes 2.4 million customer support interactions daily. The business has defined the following recovery objectives: Recovery Time Objective (RTO) of 15 minutes, Recovery Point Objective (RPO) of 60 seconds, and a 99.99% annual availability SLA (52 minutes of downtime budget per year). The current DR strategy is a warm standby in a secondary region that is manually activated by an on-call engineer following a runbook. In the last DR test, activation took 47 minutes and resulted in 8 minutes of data loss — both significantly exceeding the RTO and RPO targets.Redesign the disaster recovery architecture to meet the stated RTO and RPO, and explain why the current warm standby approach is structurally incapable of meeting these targets.
Why Interviewers Ask This Question
Disaster recovery design is one of the most consequential infrastructure responsibilities at a SaaS company, and there is a precise relationship between RTO/RPO targets and the architectural complexity required to meet them. Many organisations have DR plans on paper that cannot actually meet their stated objectives — a gap that is only discovered during an incident. This question tests whether a candidate understands why certain DR architectures have hard floors on achievable RTO and RPO, and what architectural changes are required to cross those floors.
Example Strong Answer
Why the warm standby cannot meet the targets — structural analysis
A warm standby with manual activation has structural constraints that make a 15-minute RTO physically impossible:
- Human detection latency: An incident must be detected, triaged, and a decision made to activate DR before any automated process begins. Even with excellent monitoring, this takes 5–10 minutes for a clear-cut primary region failure. For ambiguous degradation (partial failure, not total outage), engineers spend additional time verifying whether DR activation is warranted — the "is it bad enough to failover?" decision adds 10–20 minutes.
- Manual runbook execution time: A DR runbook for a system of this complexity — promoting a read replica to primary, updating DNS records, reconfiguring load balancers, validating health checks across all dependent services — takes 20–40 minutes even for an experienced, well-practiced engineer.
- Replication lag translates directly to data loss: Asynchronous replication to a warm standby typically operates at 1–5 minutes of lag during normal operations. Under the load spike that often accompanies a disaster scenario (all writes must be committed before the primary fails), lag can increase to 8–15 minutes. The 8-minute data loss in the DR test is entirely explained by this.
The arithmetic is unambiguous: detection (5–10m) + decision (5–15m) + manual activation (20–40m) = 30–65 minutes. A manual warm standby process cannot achieve a 15-minute RTO. A 60-second RPO requires synchronous or near-synchronous replication, not asynchronous.
The required architecture: active-active with automated failover
Meeting a 15-minute RTO and 60-second RPO requires eliminating human action from the critical path of failover. The target architecture:
Data layer: synchronous replication with automatic leader election
- Replace asynchronous replication with synchronous replication to the standby region for all writes. A write is acknowledged to the application only after it has been committed in both regions. This reduces RPO to near-zero (the maximum data loss equals the in-flight writes at the moment of failure — typically < 1 second).
- Use a distributed database with automatic leader election (Patroni for Postgres, CockroachDB, or Aurora Global Database with automated failover). When the primary region becomes unreachable, the cluster automatically promotes the standby to primary within 30–60 seconds — without human intervention.
- Trade-off: synchronous replication adds write latency. Measure the baseline write latency increase from synchronous cross-region commit. For a US-East to US-West path, this is typically 60–90ms added to each write. For a CRM write operation, this is acceptable. For a high-frequency event streaming workload, it would require a different architecture.
Application layer: active-active with traffic shifting
Deploy the application tier in both regions simultaneously (active-active, not active-passive). Both regions serve live traffic at all times. Failover becomes traffic shifting rather than activation:
- Normal operation: US-East handles 100% of traffic (or a weighted split if geographic routing is desired)
- Failover trigger: Automated health checks (Route 53 health checks or a dedicated failover controller) detect primary region unavailability after 3 consecutive failed checks (30-second window)
- Failover action: DNS weight shifts to 100% secondary region — no manual intervention, no runbook execution
- DNS TTL: Set to 30 seconds for the API endpoint record. During failover, all clients receive the updated DNS record within 30 seconds of the TTL expiry
Automated failover controller
The human decision point is replaced by an automated failover controller:
Failover Controller (running in both regions, independent of application tier)
│
├── Health probe: primary region API health endpoint, every 10 seconds
├── Consensus: 3 consecutive failed probes required to trigger (30s)
├── Automatic actions on trigger:
│ ├── Database: initiate leader election in standby cluster
│ ├── DNS: update Route 53 weight to 0% primary / 100% secondary
│ ├── Load balancer: activate secondary region ALB
│ └── Notification: page on-call, create incident, post to status page
└── Human role: observe, validate, and initiate fail-back (not fail-over)The human is in the loop for fail-back (returning to primary after recovery), not fail-over. Fail-back is a planned, validated operation with no time pressure — it can wait for human review.
Meeting the 99.99% SLA — the arithmetic
99.99% annual availability = 52.6 minutes of downtime budget per year.
With the automated failover architecture:
- Detection time: 30 seconds (3 health check failures)
- Database leader election: 30–60 seconds
- DNS propagation: 30 seconds (TTL-bounded)
- Application warm-up in secondary region: 60–120 seconds (pre-warmed in active-active)
- Total RTO: 2.5–4 minutes
This leaves a margin of 48–50 minutes of annual downtime budget for planned maintenance, partial degradation events, and unexpected scenarios — a workable operating budget for a 99.99% SLA.
DR testing as a first-class engineering practice
The 47-minute DR test failure reveals that DR testing was not being done regularly enough for the team to identify and fix the gap. I would implement:
- Monthly automated DR drills: A scheduled job that simulates primary region failure in a staging environment and validates automated failover completes within RTO
- Quarterly full production DR test: Actual failover of production traffic to the secondary region, with customer communication and a defined maintenance window. This is the only way to validate that the production architecture actually meets its stated objectives
- Game Day exercises: Unannounced failure injection during business hours to validate detection, escalation, and response — not just the technical failover mechanism
Key Concepts Tested
- RTO/RPO target analysis — why specific architectural patterns have hard floors on achievable values
- Synchronous vs asynchronous replication and the RPO trade-off
- Active-active vs active-passive and the distinction between traffic shifting and activation
- Automated failover controller design — removing humans from the critical failover path
- Route 53 health-based routing with TTL strategy for fast DNS propagation
- DR testing as an engineering practice: automated drills, quarterly production tests, Game Days
Follow-Up Questions
- "Your automated failover controller triggers successfully during a real incident, shifting 100% of traffic to the secondary region within 3 minutes. However, 20 minutes after failover, you discover that 1,400 support cases created in the final 45 seconds before the primary region failed are missing from the secondary — despite your synchronous replication configuration. What are the three most likely explanations for this data loss, and how do you investigate each?"
- "The CFO reviews the cost of running an active-active architecture with full application capacity in two regions simultaneously and asks: 'We're paying for double the infrastructure to handle an event that may never happen.' How do you justify the cost, and is there a cost-optimised variant of this architecture that still meets the RTO/RPO targets?"
Question 9: Platform Infrastructure Automation and Self-Service Engineering
Interview Question
Salesforce's internal developer platform team is facing a scaling problem. The platform engineering team of 12 engineers is responsible for provisioning infrastructure for 280 product engineering teams. Currently, teams request infrastructure through a ticketing system: they submit a Jira ticket, a platform engineer reviews it, manually provisions the resources (Kubernetes namespaces, service accounts, IAM roles, network policies, Vault paths, CI/CD pipelines), and closes the ticket. The average provisioning time is 6 business days. Engineering teams report that this bottleneck is their top productivity complaint. Platform engineers report that 70% of their time is spent on repetitive provisioning work, leaving little time for architectural improvements.Design a self-service infrastructure platform that allows engineering teams to provision production-ready infrastructure in under 30 minutes without a platform engineer in the critical path. Address what guardrails you would build to prevent teams from provisioning insecure or non-compliant infrastructure without manual review.
Why Interviewers Ask This Question
Developer experience and platform self-service are strategic infrastructure engineering priorities at companies like Salesforce, where the ratio of product engineers to platform engineers makes manual provisioning operationally unsustainable. This question tests whether a candidate understands the "you build it, you run it" platform model, the concept of paved roads and golden paths, and how policy-as-code enables self-service without sacrificing the compliance and security standards that a manual review was previously enforcing. It also tests product thinking in an infrastructure context.
Example Strong Answer
The diagnosis: a bottleneck masquerading as a process
The 6-day provisioning time is not a staffing problem — adding more platform engineers would reduce the backlog temporarily but not solve the structural issue. The root cause is that manual review is in the critical path for work that is largely deterministic: the vast majority of infrastructure provisioning requests are variations on the same set of approved patterns. A senior platform engineer reviewing a Jira ticket for a new Kubernetes namespace is providing compliance theatre, not genuine security value — because the reviewer has no time to do a deep review in a 6-day queue.
The correct model: shift compliance enforcement left, into automated policy checks at provisioning time, so that self-service requests that conform to approved patterns are provisioned instantly, while genuinely novel or non-compliant requests are flagged for human review.
The Internal Developer Platform (IDP) architecture
I would build the platform around three principles: golden paths (opinionated, pre-approved patterns), policy-as-code guardrails (automated compliance enforcement), and escape hatches with audit trails (a path for non-standard requests that does not go through the 6-day queue).
Layer 1: Service catalogue — golden path templates
Create a self-service service catalogue (using Backstage as the frontend, backed by Terraform and Helm templates) with pre-built, pre-approved templates for the most common infrastructure patterns:
- Standard microservice: Kubernetes namespace, Deployment, Service, HPA, PodDisruptionBudget, NetworkPolicy, ServiceAccount, Vault path, CI/CD pipeline scaffold — all configured to platform standards
- Async worker service: Same as above, plus Kafka consumer group and KEDA autoscaler
- Batch job: CronJob with resource limits, service account, and log aggregation configured
- Database-backed service: Above plus RDS/CloudSQL instance provisioned in the approved VPC subnet, credentials injected via Vault dynamic secrets
An engineering team fills out a form in Backstage: service name, team name, resource tier (small/medium/large), region. The template generates a pull request to the team's infrastructure repository. The CI pipeline validates the generated configuration against OPA policies. If all policies pass, the infrastructure is automatically provisioned via Terraform. Total time: under 10 minutes.
Layer 2: Policy-as-code guardrails — OPA/Conftest
The policies that a platform engineer was previously checking manually are encoded as OPA (Open Policy Agent) Rego policies enforced in CI:
# Every Kubernetes Deployment must have resource requests and limits
deny[msg] {
input.kind == "Deployment"
container := input.spec.template.spec.containers[_]
not container.resources.requests
msg := sprintf("Container '%v' must specify resource requests", [container.name])
}
# No container may run as root
deny[msg] {
input.kind == "Deployment"
container := input.spec.template.spec.containers[_]
container.securityContext.runAsUser == 0
msg := sprintf("Container '%v' must not run as root", [container.name])
}
# NetworkPolicy must exist in every namespace before workloads are deployed
deny[msg] {
input.kind == "Deployment"
not namespace_has_network_policy(input.metadata.namespace)
msg := "Namespace must have a NetworkPolicy before deploying workloads"
}Policies cover: resource requests/limits required, security context (no root, read-only root filesystem), NetworkPolicy presence, image registry allowlist (only approved internal registries), Vault secret injection required (no raw Kubernetes Secrets), and RBAC minimum permissions.
A provisioning request that violates any policy is rejected automatically with a specific error message — no platform engineer review required. A request that passes all policies is provisioned automatically.
Layer 3: Escape hatch — non-standard request pathway
Not every team fits a golden path. For genuinely novel requirements, the process is:
- Team submits a design document (not a Jira ticket) to the platform team's async review queue
- A platform engineer reviews the design document within 1 business day (not 6) — because they are reviewing architecture, not performing repetitive provisioning
- If approved, the platform engineer adds a new template or policy exception to the catalogue — benefiting all future teams with the same need
- The non-standard provisioning itself is still automated; only the design review is manual
This model inverts the time allocation: platform engineers spend their time on architectural review and platform improvement, not on provisioning.
Layer 4: Self-service Day 2 operations
Provisioning is only the first self-service gap. Teams also need self-service for Day 2 operations:
- Scaling: Teams adjust HPA min/max replicas, resource tier, and Cluster Autoscaler node pool size via a PR to their infrastructure repository — no ticket required
- Secret rotation: Teams trigger Vault secret rotation via a CLI command or Backstage action — no platform engineer involvement
- On-call dashboard: Pre-built Grafana dashboards are automatically provisioned for every service, linked to the team's Prometheus namespace
Measuring success
The platform team's success metrics shift from "tickets closed" to:
- Mean time to provision: Target < 30 minutes (from golden path template use)
- Self-service rate: % of provisioning requests fulfilled without platform engineer involvement — target > 85%
- Policy violation rate at PR time: A declining rate indicates teams are learning the standards; an increasing rate indicates a policy that is too restrictive or poorly communicated
- Platform engineer time on roadmap work: From 30% to > 70% of time spent on platform improvement rather than provisioning
Key Concepts Tested
- Internal Developer Platform (IDP) design: Backstage as the self-service frontend
- Golden path / paved road philosophy — opinionated templates that encode best practices
- Policy-as-code with OPA: shifting compliance enforcement from manual review to automated gates
- Escape hatch design: non-standard requests routed to architectural review, not provisioning review
- Gitops-driven infrastructure provisioning: PR-based workflow with automated apply
- Platform team productivity metrics: self-service rate as the primary KPI
Follow-Up Questions
- "Your self-service platform is live and the self-service rate reaches 82% in the first month. However, you discover that three teams have used the escape hatch pathway to provision infrastructure that was technically approved but collectively creates a sprawl problem: 47 different custom VPC configurations, 12 different database engine versions, and no standardisation on logging frameworks. None of this violated your OPA policies. How do you address the sprawl problem without reverting to the bottleneck model?"
- "A security audit finds that 14 services provisioned through the golden path have a critical vulnerability: the Kubernetes ServiceAccount token is mounted in every pod by default, giving compromised pods access to the Kubernetes API. The golden path template was the source of this misconfiguration. How do you remediate 14 production services simultaneously, update the template, and prevent the same class of error from recurring in future templates?"
Question 10: Incident Response and On-Call Engineering Culture
Interview Question
Salesforce's platform engineering organisation is experiencing on-call burnout. The team of 18 engineers rotates on-call weekly. In the past quarter: average on-call engineers received 47 pages per week (target: < 10), 68% of pages were triggered by alerts that required no action (alert fatigue), 23% of pages were for the same recurring issue that had been "fixed" three times previously, and on-call engineers report spending an average of 11 hours per week on incident response outside business hours. Two senior engineers have cited on-call burden as a reason for considering leaving the team.This is an on-call health crisis. Design a programme to bring the on-call experience to a sustainable state — addressing alert quality, recurring incidents, and the broader engineering culture changes required.
Why Interviewers Ask This Question
On-call health is a reliability engineering and engineering culture problem, not just a tooling problem. Alert fatigue is a well-documented path to serious incidents: engineers who receive 47 pages per week stop treating pages as urgent signals and start ignoring them — exactly the condition under which a real P1 incident goes undetected. This question tests operational maturity, systems thinking about feedback loops, and the leadership qualities required to drive cultural change across an engineering organisation.
Example Strong Answer
Diagnose the crisis with precision before proposing solutions
The data tells a specific story. I would not start by adjusting alert thresholds — that is the last step, not the first. I would start with a structured audit:
Alert audit: categorise every page from the past quarter
Pull all 47 × 13 weeks ≈ 611 pages from the alerting system. Categorise each one:
- Actionable and correctly calibrated: Alert fired, engineer investigated, took a real action to resolve. These are the alerts the system should generate.
- Actionable but poorly timed: Alert fired, but the issue resolved itself before the engineer acted, or the alert fired at 2am for an issue that could have waited until morning.
- Not actionable — flapping: Alert fired for a transient spike that resolved within 2 minutes without intervention. These are noise.
- Not actionable — misconfigured threshold: Alert fired because the threshold was set below normal operating variance (e.g., CPU > 70% alert on a service that routinely runs at 65–75%).
- Recurring incident (same root cause): Alert is legitimate but fires repeatedly because the underlying issue was not fixed — only mitigated.
This audit converts 68% noise into a prioritised list of specific alerts to fix. It also surfaces the recurring incidents explicitly.
Fix 1: Alert quality — ruthless reduction
The target is < 10 pages per week, actionable, non-duplicated. The path to that target:
- Eliminate all flapping alerts: Any alert that auto-resolves within 5 minutes without engineer action in > 80% of pages is a flapping alert. Add
for: 5mto Prometheus alert rules (alert must be continuously firing for 5 minutes before it pages). This eliminates transient spikes from paging.
- SLO-based alerting replaces threshold-based alerting: Replace "CPU > 70%" alerts with SLO-based burn rate alerts. An SLO burn rate alert fires only when the user-visible error budget is being consumed faster than sustainable — it is semantically meaningful in a way that CPU threshold alerts are not. A CPU spike that doesn't affect user-visible latency or error rate does not page.
- Severity-appropriate routing: Not every alert should page. Create three tiers:
- Page (immediate): User-visible SLO breach, data loss risk, security incident — wake the engineer up
- Ticket (next business day): Infrastructure anomaly that is not yet user-visible but needs attention — create a ticket automatically, no page
- Log (informational): Background noise worth recording but not acting on — written to a log, not surfaced to any human
Fix 2: Recurring incidents — mandatory post-incident reviews with teeth
23% of pages for the same recurring issue is the most damaging pattern: the team is investing incident response time repeatedly without eliminating the root cause. This is a process failure, not a technical failure.
I would implement a Recurring Incident Policy:
- Any incident that triggers for the third time in a quarter is automatically escalated to a P1 — even if its impact is minor — and a blameless post-incident review is mandatory within 48 hours
- The PIR must produce a corrective action item with a completion date and a named owner, not a vague "we'll look at improving the monitoring"
- Corrective action completion is tracked in a public dashboard visible to the team's leadership
- An incident that recurs after a stated fix was applied triggers a five-whys analysis specifically on why the fix was ineffective
The cultural shift: recurring incidents are not bad luck — they are a signal that the engineering organisation is not investing enough in eliminating root causes. Making this visible and tracking it publicly changes the incentive structure.
Fix 3: On-call load distribution and toil reduction
11 hours per week outside business hours is unsustainable regardless of alert quality. Structural changes:
- On-call rotation length: 1-week rotations with 18 engineers means each engineer is on-call approximately every 18 weeks. Consider whether services can be divided into sub-teams with their own rotations — allowing engineers who own specific services to be on-call for those services rather than a generalised platform rotation.
- Follow-the-sun rotation for global teams: If the team has engineers in multiple time zones, implement a follow-the-sun rotation where on-call coverage shifts between time zones so that no engineer is routinely woken at 3am. APAC engineers cover APAC hours; EU engineers cover EU hours; US engineers cover US hours.
- Toil budget: Allocate 20% of each sprint's engineering capacity explicitly to on-call toil reduction — fixing the alerts that fire most frequently, automating the runbook steps that on-call engineers perform manually. Track toil as a metric alongside feature velocity.
Fix 4: On-call health metrics and feedback loops
Create a public on-call health dashboard tracking weekly:
- Pages per on-call week (target: < 10)
- Percentage of pages that are actionable (target: > 90%)
- Mean time to resolve (MTTR) by incident category
- Recurring incidents open vs resolved
- On-call hours outside business hours per rotation
Review this dashboard in every team retrospective. When metrics improve, acknowledge the improvement publicly. When they regress, treat it as a P1 engineering problem. Leadership visibility into on-call health is the most powerful driver of cultural change — it signals that on-call burnout is an engineering quality metric, not an individual endurance test.
The cultural shift: sustainable on-call as a reliability property
The framing matters. On-call health is not a quality-of-life initiative — it is a reliability initiative. Engineers experiencing alert fatigue miss real P1 incidents. Burned-out engineers make worse decisions during incidents. Attrition of senior engineers who carry institutional knowledge of production systems is a direct reliability risk. Making this argument to leadership — in the language of reliability and business risk, not engineer wellbeing alone — is what converts it from a nice-to-have into an engineering priority.
Key Concepts Tested
- Alert audit methodology — categorising pages before adjusting thresholds
- SLO-based burn rate alerting as the replacement for threshold-based alerting
- Flapping alert elimination via Prometheus
for:duration clauses
- Three-tier alert severity routing: page vs ticket vs log
- Recurring incident policy with mandatory PIRs and tracked corrective actions
- Follow-the-sun on-call rotation for global engineering teams
- Toil budget as a sprint planning construct
- On-call health as a reliability metric, not just an HR metric
Follow-Up Questions
- "You implement all the alert quality improvements and pages drop from 47 per week to 11 per week over two months. However, in week 10 after the changes, a P1 customer-impacting incident lasts 2.3 hours before being detected — it was caused by a slow memory leak that previously would have been caught by one of the alerts you deleted as 'not actionable.' How do you respond to this situation, and how do you avoid overcorrecting back toward alert proliferation?"
- "A senior engineer on the team pushes back on the Recurring Incident Policy, arguing: 'Some of these recurring alerts are for known limitations in legacy infrastructure that we cannot fix because the modernisation is on a 12-month roadmap. Making them P1s just creates paperwork without fixing the underlying issue.' How do you respond to this concern, and how does your policy accommodate genuinely known, unfixable issues without abandoning accountability for recurring incidents?"