IBM Cloud Infrastructure Engineer Interview Questions
Introduction
Cloud Infrastructure Engineers at IBM operate at the centre of one of the most complex enterprise technology portfolios in the industry. IBM's cloud strategy is built around hybrid cloud — the architecture that allows enterprises to run workloads across on-premises data centres, private clouds, and public cloud environments as a unified, governed system. This is not a simple lift-and-shift proposition. IBM's enterprise clients — banks, insurers, government agencies, global manufacturers — have infrastructure estates that have evolved over decades, with compliance obligations, data residency requirements, and integration dependencies that make pure-cloud migration either impractical or impermissible. IBM Cloud Infrastructure Engineers design the architectures that bridge these worlds, primarily through IBM Cloud, Red Hat OpenShift Container Platform, and IBM Cloud Pak solutions built on Kubernetes.
The technical scope of the role is deliberately broad. On any given week, an IBM Cloud Infrastructure Engineer might be designing a multi-zone OpenShift cluster topology for a financial services client, writing Terraform modules to codify a hybrid networking configuration across IBM Cloud and an on-premises data centre, building an observability stack that correlates metrics and traces across both environments, or architecting a disaster recovery plan that meets a hospital network's RPO and RTO requirements. Red Hat OpenShift is the centrepiece of IBM's hybrid cloud strategy — understanding OpenShift at depth, including its enterprise extensions (Machine Config Operator, OLM-managed operators, OpenShift GitOps, and OpenShift Pipelines) is foundational to success in this role. So is operational maturity: designing infrastructure that is resilient, observable, and automatable from day one.
Interviews for Cloud Infrastructure Engineer roles at IBM reflect this breadth and operational depth. Expect questions that require you to reason from first principles about distributed systems reliability, demonstrate hands-on familiarity with Kubernetes and OpenShift internals, articulate the trade-offs in hybrid networking architectures, and show how you would design, automate, and operate infrastructure that meets the stringent requirements IBM's enterprise clients demand. The five questions below cover the full range of topics you are likely to encounter, grounded in the real infrastructure challenges IBM and its clients face.
Interview Questions
Question 1: Hybrid Cloud Architecture for a Regulated Financial Services Workload
Interview Question
IBM is engaged with a global investment bank that operates its core trading platform on-premises in two private data centres (London and New York). The bank wants to modernise its risk calculation workloads — computationally intensive Monte Carlo simulations that run in batch every evening — by bursting them to IBM Cloud during peak demand, while keeping the trading platform itself and all customer data on-premises. The bank's requirements: all customer account data must remain in their own data centres (data residency); risk calculations can use cloud compute but cannot write results back to any IBM-managed storage; the entire hybrid environment must be managed through a single control plane; and the network connection between on-premises and IBM Cloud must support at least 10Gbps throughput with latency under 5ms.Design the hybrid cloud architecture that meets these requirements, and explain the specific IBM technologies you would use at each layer.
Why Interviewers Ask This Question
Hybrid cloud architecture for regulated workloads is IBM's primary commercial differentiation — it is what IBM Cloud is specifically designed for, and it is the scenario that most of IBM's enterprise cloud engagements start from. This question tests whether a candidate understands IBM's hybrid cloud technology stack (IBM Cloud Satellite, Red Hat OpenShift, DirectLink) at a depth beyond marketing material, and whether they can reason about the data residency, network performance, and unified management requirements that regulated clients impose. Interviewers also look for candidates who understand that "hybrid cloud" is not an architecture choice but a set of constraints — the bank's regulatory obligations define the architecture, not the other way around.
Example Strong Answer
The core architectural principle: compute bursts, data stays
The bank's requirements define a clear architectural principle: compute capacity can extend into IBM Cloud, but data gravity must remain on-premises. Every architectural decision flows from this principle.
Layer 1: Unified control plane — IBM Cloud Satellite
IBM Cloud Satellite is the correct technology for this requirement. Satellite extends IBM Cloud's managed services — including managed OpenShift (ROKS), IBM Cloud Monitoring, and IBM Cloud Object Storage — to run on the bank's own on-premises infrastructure while being managed and observable from the IBM Cloud control plane.
Architecture:
- Deploy IBM Cloud Satellite Locations in both the London and New York data centres. Each Satellite Location consists of a minimum of 3 control plane hosts that communicate with IBM Cloud's Satellite management plane.
- The Satellite hosts run entirely on the bank's own servers — IBM Cloud has management-plane visibility but no data-plane access to the bank's infrastructure.
- The trading platform and all customer account data remain on Satellite-managed infrastructure within the bank's physical perimeter. IBM Cloud never touches this data.
- The Satellite control plane communication (metadata only — no customer data) travels over an encrypted channel to IBM Cloud; this channel is the only network path between the bank's on-premises environment and IBM Cloud's control plane.
Layer 2: Network connectivity — IBM Cloud DirectLink Dedicated
10Gbps throughput with < 5ms latency between on-premises and IBM Cloud cannot be achieved over the public internet. The correct IBM product is IBM Cloud DirectLink Dedicated:
- DirectLink Dedicated provides a physical, dedicated fibre connection between the bank's colocation facility and IBM Cloud's network point of presence. No traffic traverses the public internet.
- Bandwidth options: 1Gbps, 2Gbps, 5Gbps, 10Gbps. The bank requires 10Gbps — provision two 10Gbps circuits for redundancy (active-active load balancing with automatic failover if one circuit fails).
- Latency: DirectLink Dedicated between London and IBM Cloud's London PoP achieves sub-millisecond latency. Cross-region traffic travels over IBM Cloud's private backbone — faster and more stable than public internet routing.
- BGP routing: Establish BGP peering between the bank's edge routers and IBM Cloud's DirectLink routers. Advertise the bank's on-premises subnets to IBM Cloud and IBM Cloud's VPC subnets to the bank's network. This enables private IP routing between environments without NAT.
Layer 3: Compute bursting — Red Hat OpenShift with cluster federation
The risk calculation workloads burst to IBM Cloud compute using Red Hat OpenShift on IBM Cloud (ROKS) with a cluster federation pattern:
- An OpenShift cluster runs on-premises (on Satellite) and serves as the primary cluster.
- A ROKS cluster in IBM Cloud (IBM Cloud region: eu-gb for London jobs, us-east for NY jobs) serves as the burst cluster.
- OpenShift cluster federation via Red Hat Advanced Cluster Management (RHACM): RHACM provides the single pane of glass that manages both the on-premises Satellite cluster and the cloud ROKS cluster from one console — satisfying the "single control plane" requirement.
- Risk calculation jobs are submitted to the on-premises cluster as Kubernetes Jobs. An RHACM placement policy routes jobs exceeding on-premises capacity to the cloud ROKS cluster for execution.
The critical data residency control:
Risk calculation jobs in the cloud ROKS cluster:
- Pull their input data (market data, position data) from on-premises via DirectLink at job start — data is transferred into ephemeral pod memory, never written to IBM Cloud storage
- Perform computation entirely in pod memory (Monte Carlo simulations are compute-bound, not storage-bound)
- Write results back to the bank's on-premises Satellite cluster via DirectLink at job completion — results never touch IBM Cloud Object Storage or any IBM-managed persistence layer
- Pod ephemeral storage and memory are cleared on pod termination — IBM Cloud has no persistent copy of any computation input or output
Enforcement mechanism: Kubernetes NetworkPolicies on the cloud ROKS cluster block all egress to IBM Cloud Object Storage endpoints. An OPA/Gatekeeper policy rejects any PersistentVolumeClaim creation in the burst cluster — no workload can write to persistent storage, enforced at the admission controller level.
Layer 4: Identity and access — IBM Cloud IAM with on-premises integration
The bank's engineers manage both environments through a unified identity model:
- IBM Cloud IAM is federated with the bank's Active Directory via SAML 2.0. Engineers authenticate with their bank credentials; IBM Cloud IAM issues scoped tokens.
- RHACM role-based access policies ensure that on-premises cluster administration is only available to engineers with the bank's internal
cluster-adminLDAP group membership — not to IBM Cloud staff.
Key Concepts Tested
- IBM Cloud Satellite architecture: extending IBM Cloud managed services to on-premises infrastructure
- IBM Cloud DirectLink Dedicated for private, high-throughput, low-latency hybrid connectivity
- Red Hat Advanced Cluster Management (RHACM) for unified multi-cluster management
- Data residency enforcement: ephemeral compute vs persistent storage separation
- OPA/Gatekeeper admission control for storage policy enforcement at the cluster level
- BGP peering for private IP routing between on-premises and cloud environments
Follow-Up Questions
- "The bank's CISO raises a concern: the IBM Cloud Satellite control plane communication — even though it carries only metadata — traverses IBM Cloud's network infrastructure. Their internal policy requires that no communication channel to a third-party cloud provider carry any data derived from their trading systems, including metadata about job execution. How do you redesign the architecture to address this, and does it change your technology choices?"
- "Six months after deployment, the bank wants to add a second use case: running their anti-money laundering ML model inference on IBM Cloud to take advantage of GPU instances unavailable on-premises. Unlike the risk calculation jobs, the AML model must query a live customer transaction stream — which appears to conflict with the data residency requirement. How do you architect a solution that enables cloud GPU inference on customer transaction data without violating data residency?"
Question 2: Red Hat OpenShift Cluster Operations and Node Management at Scale
Interview Question
You are the lead infrastructure engineer responsible for a Red Hat OpenShift Container Platform 4.x cluster running on IBM Cloud that hosts 180 microservices for an insurance company's claims processing platform. The cluster has 42 worker nodes across three availability zones. Over the past month, the platform team has reported three problems: (1) approximately twice per week, one or more worker nodes enters aNotReadystate due to kernel-level memory pressure, causing pod rescheduling that temporarily degrades the claims API; (2) several application teams are requesting custom kernel parameters — highervm.max_map_countfor Elasticsearch nodes, higher file descriptor limits for high-connection services — that are difficult to apply safely without disrupting running workloads; (3) a security patch requires updating the worker node OS (Red Hat CoreOS) across all 42 nodes, but the team is concerned about coordinating the rolling update without causing a claims processing outage.Address each of the three problems using OpenShift-native tooling and explain the operational approach for each.
Why Interviewers Ask This Question
Red Hat OpenShift is IBM's flagship container platform for enterprise clients, and IBM Cloud Infrastructure Engineers are expected to operate it at a depth that goes beyond basic Kubernetes administration. This question tests familiarity with OpenShift-specific operational tooling — particularly the Machine Config Operator (MCO), which is how OpenShift manages node-level configuration — and whether a candidate can reason through rolling node updates in a way that preserves application availability. Each of the three problems maps to a specific OpenShift feature that distinguishes it from vanilla Kubernetes.
Example Strong Answer
Problem 1: Node NotReady due to kernel memory pressure — eviction threshold tuning
In OpenShift 4.x on IBM Cloud, node configuration is managed by the Machine Config Operator (MCO) — not by SSHing into nodes and editing files directly. For memory pressure issues, I would first audit resource governance before tuning eviction thresholds:
oc describe node <node-name> | grep -A5 "Conditions:"
oc top node <node-name>If pods lack memory limits (common root cause), enforce them via LimitRange and ResourceQuota at the namespace level — this addresses the root cause rather than just tuning recovery behaviour.
For eviction threshold tuning, use a KubeletConfig custom resource:
apiVersion: machineconfiguration.openshift.io/v1
kind: KubeletConfig
metadata:
name: worker-memory-eviction
spec:
machineConfigPoolSelector:
matchLabels:
pools.operator.machineconfiguration.openshift.io/worker: ""
kubeletConfig:
evictionHard:
memory.available: "500Mi" # increased from default 100Mi
evictionSoft:
memory.available: "1Gi"
evictionSoftGracePeriod:
memory.available: "2m"This gives pods a 2-minute grace period to reduce memory before hard eviction, and raises the hard threshold so recovery starts earlier — before the node reaches NotReady.
Problem 2: Custom kernel parameters per workload type — MachineConfigPools
Applying different kernel parameters to different nodes requires a dedicated MachineConfigPool for the specialised nodes:
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfigPool
metadata:
name: elasticsearch-workers
spec:
machineConfigSelector:
matchExpressions:
- key: machineconfiguration.openshift.io/role
operator: In
values: [worker, elasticsearch-workers]
nodeSelector:
matchLabels:
node-role.kubernetes.io/elasticsearch-worker: ""Label the target nodes:
oc label node <es-node-1> node-role.kubernetes.io/elasticsearch-worker=""Apply the kernel parameter only to this pool via a MachineConfig:
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
labels:
machineconfiguration.openshift.io/role: elasticsearch-workers
name: 50-elasticsearch-kernel-tuning
spec:
config:
ignition:
version: 3.2.0
storage:
files:
- path: /etc/sysctl.d/99-elasticsearch.conf
mode: 0644
contents:
source: data:,vm.max_map_count%3D262144%0AThe MCO applies this only to nodes in the elasticsearch-workers pool via a managed drain → reboot → uncordon cycle — no manual node access required, and standard worker nodes are completely unaffected.
Problem 3: Rolling RHCOS OS update across 42 nodes without a claims outage
Red Hat CoreOS on OpenShift 4.x is updated through the Machine Config Operator, which handles the drain, update, reboot, and uncordon lifecycle automatically. The critical safety dependency is Pod Disruption Budgets (PDBs):
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: claims-api-pdb
namespace: claims-processing
spec:
minAvailable: 2
selector:
matchLabels:
app: claims-apiThe MCO's drain step respects PDBs. If draining a node would violate minAvailable: 2, the drain is blocked until the constraint can be satisfied — the update waits, it does not force an outage.
Control the rollout speed with maxUnavailable on the MachineConfigPool:
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfigPool
metadata:
name: worker
spec:
maxUnavailable: 1With maxUnavailable: 1, nodes update sequentially — one node at a time, approximately 5–8 minutes per node. For 42 nodes, total update time is 3.5–5.6 hours. Schedule this in a low-traffic overnight window and ensure no critical service has all its replicas on a single node before starting.
Key Concepts Tested
- Machine Config Operator (MCO): OpenShift's mechanism for all node-level configuration
- KubeletConfig for eviction threshold tuning without direct node access
- MachineConfigPools with custom node labels for targeted kernel parameter application
- Pod Disruption Budgets as the update safety control — MCO respects PDBs during drain
maxUnavailableon MachineConfigPools to control rolling update speed
- RHCOS update model via MCO vs traditional OS patching
Follow-Up Questions
- "During a node update rollout with
maxUnavailable: 1, one node gets stuck in theUpdatingstate for 45 minutes — far longer than the expected 5–8 minutes. The MCO is not producing error logs and the node appears healthy inoc get nodes. What is your investigation process, and at what point would you consider manually intervening to prevent the stuck update from blocking the rest of the rollout?"
- "The security team requests that SSH daemon be disabled on all RHCOS nodes — direct SSH access is prohibited by policy. However, your team uses
oc debug node/<n>for troubleshooting. After disabling SSH via a MachineConfig,oc debug node/<worker-3>fails with a permissions error during a critical incident. Explain whatoc debug noderequires to function, why it failed, and how you restore the troubleshooting capability without re-enabling SSH."
Question 3: Observability Architecture for Hybrid Cloud Environments
Interview Question
IBM has deployed a hybrid cloud platform for a large UK retailer. The e-commerce platform runs across two environments: a Red Hat OpenShift cluster on-premises managed via IBM Cloud Satellite, and a managed OpenShift cluster on IBM Cloud used for burst compute during peak seasons. The platform currently has no unified observability — on-premises workloads are monitored by legacy Nagios while the IBM Cloud cluster uses IBM Cloud Monitoring. During a Black Friday incident last year, a latency degradation affecting the checkout API took 54 minutes to detect because the root cause originated in the on-premises inventory service but manifested as slow responses from a cloud-hosted payment processing service. The two monitoring systems had no correlation capability.Design the full observability stack — metrics, traces, and logs — for both environments, addressing cross-environment correlation and the specific failure mode that caused the 54-minute detection gap.
Why Interviewers Ask This Question
Observability for hybrid cloud environments is one of the hardest operational challenges IBM's infrastructure clients face. The inability to correlate signals across on-premises and cloud environments is exactly what IBM Instana and distributed tracing are designed to address. This question tests whether a candidate understands the three pillars of observability as a correlated system — not three independent tools — and can design an architecture that provides a single operational view across physically and administratively separate environments.
Example Strong Answer
Diagnosing the 54-minute detection failure
The incident reveals three distinct observability failures:
- No distributed tracing: Without a trace following the checkout request from the cloud payment service through to the on-premises inventory service, there was no way to see that slow payment API responses were caused by downstream inventory calls — they appeared as a standalone payment service problem.
- Siloed monitoring: Nagios and IBM Cloud Monitoring had no shared data model. Neither could see the other environment's state, so neither could correlate root cause with symptom.
- Alert on symptom, not cause: Alerts fired on the payment service's elevated latency — the visible downstream symptom. The on-premises inventory degradation was invisible.
The collection standard: OpenTelemetry across both environments
Standardise on OpenTelemetry (OTeL) as the instrumentation layer across both clusters. OpenTelemetry is natively supported by IBM Instana, produces W3C TraceContext headers for cross-environment trace propagation, and works identically on OpenShift whether running on-premises or in IBM Cloud.
Pillar 1: Distributed Tracing — the critical missing layer
Instrument all services with OpenTelemetry SDKs. Configure W3C TraceContext header propagation across all inter-service calls, including cross-environment calls that traverse DirectLink between on-premises and IBM Cloud. With tracing in place, the Black Friday incident trace would look like:
Trace: GET /checkout [940ms]
├── [Cloud] Payment Service [920ms]
│ └── [On-Prem] Inventory Service [850ms] ← root cause visible hereThis trace, available within seconds of the incident beginning, would have reduced investigation time from 54 minutes to under 5 minutes.
Collection topology:
- Each OpenShift cluster runs an OpenTelemetry Collector DaemonSet receiving spans from all pods on each node
- On-premises collectors forward spans over DirectLink to IBM Instana — the preferred backend for IBM Cloud engagements, providing automatic instrumentation, application topology discovery, and hybrid environment correlation in a single console
- IBM Instana receives spans from both environments and renders cross-environment request flows as a unified trace waterfall
Pillar 2: Metrics — IBM Cloud Monitoring with cross-environment collection
The OpenShift built-in monitoring stack (openshift-monitoring namespace) deploys Prometheus, Alertmanager, and Grafana automatically in each cluster. For unified cross-environment visibility:
- Deploy the IBM Cloud Monitoring agent (Sysdig-based) as a DaemonSet in the on-premises Satellite cluster. The agent forwards all metrics to IBM Cloud Monitoring over DirectLink — no metrics traverse the public internet.
- IBM Cloud Monitoring provides a single dashboard showing metrics from both environments simultaneously.
- Define SLOs at the business service level: "99.5% of checkout requests complete in under 400ms" — measured end-to-end regardless of which environment the components live in. IBM Instana's Business Impact Monitoring evaluates this cross-environment SLO against real traffic.
Replace Nagios threshold alerts with SLO error budget burn rate alerts: the alert fires when the checkout API's error budget is being consumed faster than sustainable — typically within 3–5 minutes of a meaningful degradation starting, not after 54 minutes.
Pillar 3: Logs — structured and correlated via trace ID
Every service in both environments must inject the OpenTelemetry trace ID and span ID into every log line:
{
"timestamp": "2024-11-29T11:23:45.123Z",
"service": "inventory-service",
"environment": "on-premises",
"trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
"span_id": "00f067aa0ba902b7",
"message": "Database query exceeded threshold",
"query_duration_ms": 850
}With the trace ID in both the Instana trace and the log record, an on-call engineer jumps from the slow span directly to the associated log entries — no cross-system searching required.
Log pipeline:
- On-premises: FluentBit DaemonSet → Kafka (buffer for connectivity interruptions) → OpenSearch
- IBM Cloud: OpenShift Logging → IBM Log Analysis
- IBM Log Analysis aggregates logs from both environments into a single query interface, filterable by trace ID, service, environment, and time window
Deployment event correlation
Configure deployment events from both OpenShift GitOps pipelines to post to IBM Instana via the Events API. These appear as vertical markers on all metric and trace charts — making "did a deployment cause this?" answerable in 30 seconds.
Key Concepts Tested
- OpenTelemetry as the vendor-neutral instrumentation standard for hybrid environments
- W3C TraceContext header propagation across cross-environment service calls
- IBM Instana for unified hybrid cloud observability with automatic topology discovery
- OpenShift built-in monitoring stack and IBM Cloud Monitoring agent for cross-environment metrics
- Structured logging with trace ID / span ID as the log-to-trace correlation mechanism
- SLO burn rate alerting replacing threshold-based monitoring
Follow-Up Questions
- "Your OpenTelemetry Collector is forwarding 4.2 million spans per minute to IBM Instana. After 3 months, IBM Cloud Monitoring costs have grown to £22,000/month — 40% over budget. The on-call team says they only look at traces during incidents, which occur roughly twice per month. How do you redesign the trace collection strategy to reduce cost by at least 50% while ensuring the traces most needed during an incident are always retained?"
- "A new microservice team is onboarding to the platform. They are building a service in Go and have never used OpenTelemetry before. They ask: 'Do we have to manually instrument our code, or can the platform inject tracing automatically?' Explain the difference between manual instrumentation and auto-instrumentation options available in your OpenShift environment, and give a recommendation with your reasoning."
Question 4: Infrastructure Automation with Terraform and Ansible for IBM Cloud
Interview Question
IBM is building a repeatable infrastructure-as-code framework for deploying a standard "IBM Cloud Landing Zone" for enterprise clients — a baseline IBM Cloud environment that includes a Virtual Private Cloud with multiple subnets across three availability zones, transit gateway connectivity to on-premises networks, a managed OpenShift cluster, IBM Key Protect for encryption key management, IBM Cloud Activity Tracker for audit logging, and IBM Cloud Security and Compliance Centre integration. This Landing Zone must be deployable in under 2 hours, must be customisable for different clients (different CIDR ranges, regions, node counts), and must be fully reproducible — applying the same configuration twice must produce identical infrastructure.Design the infrastructure automation framework, addressing tool choice, module structure, configuration management, and the controls that ensure the Landing Zone stays compliant after initial deployment.
Why Interviewers Ask This Question
IBM Cloud Landing Zones are a real IBM product offering — IBM publishes reference architectures for repeatable, compliant cloud environment provisioning, and the tooling behind them is exactly what IBM Cloud Infrastructure Engineers build and maintain. This question tests whether a candidate can design modular, reusable Terraform at the scale of a full cloud environment, understands the IBM Cloud Terraform provider, knows when Ansible complements Terraform in a hybrid automation stack, and can reason about post-deployment governance controls that keep infrastructure compliant over time.
Example Strong Answer
Tool selection: Terraform for provisioning, Ansible for configuration
Terraform and Ansible serve complementary functions:
- Terraform: Declarative infrastructure provisioning with state management, through the IBM Cloud Terraform provider. Manages IBM Cloud resources (VPC, subnets, ROKS clusters, Key Protect, Transit Gateway) with an explicit plan-before-apply workflow. Terraform's state file tracks the relationship between HCL declarations and actual cloud resources.
- Ansible: Imperative configuration management for tasks Terraform cannot model: bootstrapping OpenShift operators, configuring LDAP authentication integration, installing custom CA certificates into cluster trust stores, running post-deployment validation playbooks. Also the right tool for on-premises configuration that operates outside the IBM Cloud API.
Module structure: composable, versioned Terraform modules
ibm-landing-zone/
├── modules/
│ ├── vpc/ # VPC, subnets, ACLs, security groups
│ ├── transit-gateway/ # Transit gateway + on-prem connectivity
│ ├── openshift-cluster/ # ROKS cluster, worker pools, add-ons
│ ├── key-protect/ # Key Protect instance + root keys
│ ├── activity-tracker/ # Activity Tracker routing rules
│ ├── security-compliance/ # SCC integration, profile attachment
│ └── iam-baseline/ # Access groups, trusted profiles, policies
│
├── patterns/
│ └── standard-landing-zone/ # Root module composing all sub-modules
│ ├── main.tf
│ ├── variables.tf
│ └── outputs.tf
│
└── environments/
├── client-a/ # Client A's variable values
│ └── terraform.tfvars
└── client-b/
└── terraform.tfvarsEach client customises the Landing Zone via a terraform.tfvars file. The module code never changes — only variable values differ per client. This allows IBM to offer the Landing Zone as a tested, supported product where client configuration is cleanly separated from platform code.
Module versioning:
module "vpc" {
source = "git::https://github.ibm.com/cloud-platform/ibm-landing-zone//modules/vpc?ref=v1.4.2"
region = var.region
# ...
}When IBM releases a security update, it publishes a new module version. Clients upgrade by bumping the version reference and running terraform plan to review the diff before applying.
State management: IBM Cloud Object Storage backend with locking
terraform {
backend "s3" {
bucket = "ibm-landing-zone-tfstate"
key = "client-a/terraform.tfstate"
region = "eu-gb"
endpoint = "https://s3.eu-gb.cloud-object-storage.appdomain.cloud"
}
}IBM COS supports the AWS S3-compatible API, enabling Terraform's S3 backend with Object Lock for state locking — preventing two engineers from running concurrent applies against the same state.
Ansible for post-provisioning configuration
After Terraform provisions the infrastructure, an Ansible playbook handles configuration that Terraform cannot model:
- name: Configure OpenShift cluster post-deployment
hosts: localhost
tasks:
- name: Add IBM LDAP authentication to OpenShift OAuth
k8s:
definition: "{{ lookup('template', 'oauth-ldap.yaml.j2') }}"
- name: Install IBM Cloud Pak operator via OLM
k8s:
definition: "{{ lookup('template', 'cp4i-subscription.yaml.j2') }}"
- name: Run SCC compliance validation checks
include_role:
name: ibm.scc_validationPost-deployment compliance: IBM Security and Compliance Centre (SCC)
The Landing Zone must stay compliant after deployment — not just at initial provisioning. IBM Cloud Security and Compliance Centre continuously evaluates IBM Cloud resources against the IBM Cloud Framework for Financial Services profile:
- SCC scans run every 24 hours, evaluating ~200 controls (VPC ACL configurations, Key Protect key rotation policy, IAM policies, Activity Tracker routing)
- Any drift from the compliant baseline — a security group rule added manually, an IAM policy changed outside Terraform — appears as a failed control in the SCC dashboard within 24 hours
- SCC findings are routed to IBM Cloud Activity Tracker, which emits a PagerDuty alert for any high-severity compliance failure
This creates a detect-and-remediate loop: engineers who make manual changes are notified immediately and must either revert the change or codify it in Terraform — preventing compliance drift from accumulating silently.
Key Concepts Tested
- Terraform module structure: separating module code from per-client variable values
- IBM Cloud Terraform provider resources: VPC, ROKS, Key Protect, Transit Gateway
- Terraform state backend on IBM Cloud Object Storage with S3-compatible API and state locking
- Module versioning via Git tags for controlled upgrades across client environments
- Ansible for post-provisioning configuration tasks outside Terraform's declarative model
- IBM Security and Compliance Centre (SCC) for continuous post-deployment compliance monitoring
Follow-Up Questions
- "During a
terraform planfor a routine module upgrade, the plan includes destruction and recreation of the client's ROKS cluster — a 20-minute outage that would require all workloads to be rescheduled. The only change in the module version was an update to cluster tagging. What causes Terraform to plan a resource recreation instead of an in-place update, and how do you resolve this without causing the outage?"
- "A client's platform team has been manually modifying security group rules in the IBM Cloud console for 3 months. When you run
terraform plan, it shows 47 diffs across 12 security groups. The team is nervous about applying the plan because they are unsure which of the 47 changes are their intentional modifications and which are legitimate updates from the module upgrade. How do you safely reconcile this state drift without losing valid manual changes or reverting module updates?"
Question 5: Security, Identity, and Zero-Trust Access Management in IBM Cloud
Interview Question
IBM is designing the IAM architecture for a hybrid cloud environment serving a UK financial services firm under FCA regulation. The environment includes IBM Cloud with 14 VPCs across production, non-production, and management environments; a Red Hat OpenShift cluster in each VPC; IBM Cloud Satellite locations in the firm's two on-premises data centres; and 340 engineers across platform engineering, application development, security operations, and data science teams. The firm's security requirements: no standing privileged access (engineers must not hold persistent admin credentials); all access must be auditable at the API-call level; privileged operations must require MFA and a secondary approval; and production environments must be technically segregated from development — no engineer should be able to promote their own code from development to production.Design the IAM architecture that meets all four requirements, and explain how you would implement zero-trust access controls for both IBM Cloud resources and the OpenShift clusters within them.
Why Interviewers Ask This Question
Financial services IAM is one of the most demanding access management challenges in cloud infrastructure, and IBM's FCA-regulated clients are among the most security-conscious in IBM's client portfolio. This question tests whether a candidate understands IBM Cloud IAM at production depth — not just access groups and policies, but the specific mechanisms for just-in-time access, privileged access approval workflows, and the separation of OpenShift RBAC from cloud-level IAM. The zero-trust requirement forces candidates beyond traditional perimeter-based access thinking to a model where no identity is permanently trusted.
Example Strong Answer
The zero-trust access principle applied to IAM
Zero-trust access means every access request is authenticated, authorised, and logged regardless of network location or prior access history. For a 340-person engineering organisation, this has five concrete implications:
- No engineer holds permanent admin credentials — access is granted just-in-time and expires
- Every IBM Cloud API call and every
occommand to OpenShift is logged
- Sensitive operations require MFA plus a secondary approval
- Production and non-production access are technically segregated, not just policy-segregated
- Workload identity (service accounts, CI/CD pipelines) is managed separately from human identity
Requirement 1: No standing privileged access — JIT access via IBM Cloud IAM
IBM Cloud IAM's native primitive for this pattern is Access Groups. Engineers never belong to privileged Access Groups permanently. When a privileged operation is needed:
- Engineer submits an access request (via ServiceNow or a custom Slack bot) specifying the reason, duration (maximum 4 hours), and the required Access Group
- A secondary approver (team lead or security officer) approves the request
- An IBM Cloud Function calls the IBM Cloud IAM Groups API to add the engineer to the Access Group for the requested duration
- A scheduled IBM Cloud Function removes the engineer when the duration expires
This workflow is built entirely on IBM Cloud IAM primitives — no third-party PAM solution required, though HashiCorp Boundary or CyberArk are valid alternatives for clients that prefer a commercial PAM tool.
Requirement 2: API-call-level audit logging — IBM Cloud Activity Tracker
IBM Cloud Activity Tracker captures every IBM Cloud API call — IAM policy changes, VPC modifications, Key Protect operations, ROKS cluster administration — with full context (who, what, when, from where).
Configuration:
- One Activity Tracker instance per region, routing events from all services in that region
- Global events routing: IAM events (global, not regional) route to the Frankfurt instance for EU data residency compliance
- 30-day hot retention + 1-year archival to IBM Cloud Object Storage — meets FCA audit trail obligations
- Activity Tracker alerts for high-risk events:
iam.policy.create,kms.secrets.delete, any privileged operation outside an approved change window triggers an immediate security alert
For OpenShift audit logging:
- OpenShift API server audit logging is enabled by default and captures all
kubectl/occommands
- Forward OpenShift audit logs to IBM Log Analysis alongside Activity Tracker events for a unified audit trail covering both IBM Cloud API calls and Kubernetes API calls
Requirement 3: MFA + secondary approval for privileged operations
MFA is enforced at the IBM Cloud account level:
- Enable account-wide TOTP MFA in IBM Cloud IAM settings — all console logins and API token requests require a TOTP code
- For programmatic access (Terraform, CI/CD pipelines), use Service IDs with scoped API keys — service IDs correctly have no MFA requirement, as MFA is a human-identity control
- Secondary approval is provided by the JIT access workflow described above — the combination of MFA (something the engineer has) and manager approval (an organisational control) provides two independent factors for production access
Requirement 4: Technical production/development segregation — IBM Cloud Enterprise Accounts
Policy segregation alone is insufficient for FCA-regulated environments. Technical segregation requires separate IBM Cloud accounts:
IBM Cloud Enterprise Account (management)
├── Production Account
│ ├── Production VPCs (14)
│ ├── Production ROKS clusters
│ └── IBM Cloud Satellite (on-premises production)
│
├── Non-Production Account
│ ├── Development and staging VPCs
│ └── Non-production ROKS clusters
│
└── Management Account
└── Activity Tracker, SCC, Secrets Manager (shared services)An engineer in the non-production account cannot assume a production identity — there is no IAM policy path between accounts except through explicitly defined Enterprise Account cross-account trust policies. A developer deploying to staging is technically incapable of promoting to production without a separate production account credential, which they do not permanently hold.
OpenShift RBAC integration with IBM Cloud IAM
ROKS integrates IBM Cloud IAM with OpenShift RBAC — Access Group membership automatically maps to OpenShift ClusterRoles:
| IBM Cloud IAM Access Group | OpenShift ClusterRole |
|---|---|
platform-admin-production | cluster-admin |
app-developer-production | edit (namespace-scoped) |
security-reader-all-envs | view (cluster-scoped) |
When the JIT workflow adds an engineer to platform-admin-production, they automatically receive cluster-admin in the production OpenShift cluster for the same duration. When the JIT access expires, they lose both IBM Cloud and OpenShift access simultaneously — no separate RBAC management required.
Workload identity — no human credentials in CI/CD
CI/CD pipelines must never use human engineer credentials:
- Create IBM Cloud Service IDs with scoped API keys per pipeline role (
pipeline-deployer-staging,pipeline-image-push)
- Store API keys in IBM Secrets Manager, injected into pipeline environments at runtime — never stored in Git or CI/CD variable stores
- Rotate service ID API keys automatically every 90 days via IBM Secrets Manager rotation policies
Key Concepts Tested
- IBM Cloud Access Groups as the unit of IAM policy assignment
- JIT privileged access using IBM Cloud IAM Groups API with approval workflow
- IBM Cloud Activity Tracker for API-level audit logging with FCA-compliant retention
- IBM Cloud Enterprise Account structure for technical production/non-production segregation
- ROKS IBM Cloud IAM to OpenShift RBAC automatic mapping — unified identity across layers
- Service IDs with scoped API keys and IBM Secrets Manager rotation for workload identity
Follow-Up Questions
- "Three months after deploying the JIT access system, a security audit finds that engineers are approving each other's access requests reciprocally — engineer A approves engineer B's production access request, and engineer B approves engineer A's. This defeats the secondary approval control. How do you detect this pattern retroactively in Activity Tracker logs, and what controls do you add to the approval workflow to prevent it going forward?"
Question 6: Disaster Recovery Architecture for a Hybrid IBM Cloud Environment
Interview Question
IBM is designing the disaster recovery strategy for a UK National Health Service trust's clinical information system. The system runs across a hybrid environment: the primary workload is a Red Hat OpenShift cluster hosted on IBM Cloud in the London region (eu-gb), with patient record data stored in IBM Cloud Databases for PostgreSQL. A secondary IBM Cloud Satellite location runs in the trust's own on-premises data centre for low-latency clinical applications. The NHS trust's recovery objectives are: an RTO of 30 minutes and an RPO of 5 minutes for the patient record system, and an RTO of 4 hours and an RPO of 1 hour for the administrative workloads. The current DR strategy is a manual runbook that was last tested 14 months ago and took 2.5 hours to execute — far exceeding the 30-minute RTO.Redesign the disaster recovery architecture to meet both sets of recovery objectives, and explain why the current manual runbook approach is structurally incapable of meeting a 30-minute RTO.
Why Interviewers Ask This Question
Disaster recovery for hybrid cloud environments combining IBM Cloud managed services and on-premises Satellite infrastructure is a direct IBM Cloud Infrastructure Engineer responsibility at NHS and healthcare clients, where regulatory obligations (NHS DSP Toolkit, ISO 27001) make DR planning and testing a compliance requirement — not just a best practice. This question tests whether a candidate understands the structural relationship between RTO targets and automation requirements (manual runbooks have hard floors on achievable RTO), knows IBM Cloud's specific DR capabilities (cross-region replication for IBM Cloud Databases, IBM Cloud Object Storage cross-region buckets), and can differentiate DR strategies for workloads with different recovery objectives.
Example Strong Answer
Why the manual runbook cannot achieve a 30-minute RTO — structural analysis
A manual runbook with a 2.5-hour historical execution time has three structural constraints that make 30-minute RTO unachievable regardless of how well the runbook is written:
- Detection and escalation latency: A regional IBM Cloud incident must be detected by monitoring, an on-call engineer must be paged and respond, the incident must be confirmed as requiring DR activation, and an approval decision must be made. Even with excellent monitoring, this typically takes 10–20 minutes for a clear-cut regional failure — longer for ambiguous partial degradation scenarios.
- Sequential manual execution: A 30-minute RTO requires that every step in the failover process completes in the remaining 10–20 minutes after detection. Manual steps — promoting a database replica, updating DNS records, verifying OpenShift cluster health in the failover region, reconfiguring load balancers — each take several minutes and cannot be parallelised by a single on-call engineer.
- Cognitive load under pressure: Engineers executing a DR runbook at 3am for a patient-facing clinical system are under significant stress. Step skipping, sequencing errors, and time lost re-reading runbook steps are not edge cases — they are expected outcomes of manual DR execution under time pressure.
The arithmetic: detection (10–20m) + decision (5–10m) + manual execution (remaining runbook steps). A 30-minute RTO requires the combined detection-to-execution pipeline to complete in 30 minutes — leaving zero margin for any step taking longer than planned.
The required architecture: automated failover with human oversight for fail-back
Tier 1 workloads (RTO: 30 min, RPO: 5 min) — Patient record system
Database layer — IBM Cloud Databases for PostgreSQL with cross-region replication:
IBM Cloud Databases for PostgreSQL supports read replicas in a secondary IBM Cloud region (eu-de, Frankfurt, for a primary in eu-gb, London). Configure synchronous replication to the Frankfurt read replica:
- Synchronous replication: every write is committed in both London and Frankfurt before the application receives acknowledgement. This achieves RPO of near-zero (maximum data loss equals in-flight writes at the moment of failure — typically under 1 second) — well inside the 5-minute RPO requirement.
- Trade-off: synchronous replication adds write latency equal to the London-to-Frankfurt round-trip time (~15ms). For a clinical information system where write correctness is paramount, this latency cost is acceptable and explicitly preferable to asynchronous replication's data loss risk.
- Automatic leader election: IBM Cloud Databases for PostgreSQL supports automated promotion of the Frankfurt replica to primary when the London primary becomes unreachable, within 60–90 seconds. No manual DBA intervention required.
Application layer — ROKS cluster in Frankfurt with active-warm standby:
Deploy a warm standby ROKS cluster in IBM Cloud eu-de (Frankfurt). "Warm" means the cluster is provisioned and healthy with all Kubernetes workloads deployed, but serving no live traffic:
- Workloads are deployed identically to London using OpenShift GitOps (ArgoCD) — the same GitOps repository drives both clusters. Any application change deployed to London is automatically synchronised to Frankfurt within minutes.
- The Frankfurt cluster is configured as a standby: its Ingress routes to a health check endpoint that returns 503 until failover is activated. Real traffic is not served until failover is triggered.
Automated failover controller:
Failover Controller (running independently in both regions)
│
├── Health probe: IBM Cloud ROKS API + PostgreSQL primary every 30 seconds
├── Consensus: 3 consecutive failed probes = failover trigger (90 seconds)
├── Automated actions on trigger:
│ ├── IBM Cloud Databases: promote Frankfurt replica to primary
│ ├── IBM Cloud Internet Services (CIS): update DNS CNAME to Frankfurt Ingress
│ │ (DNS TTL pre-set to 60 seconds — propagation within 60s of update)
│ ├── CIS Global Load Balancer: set London pool health to disabled
│ └── Notification: PagerDuty P1 alert, IBM Cloud Activity Tracker event
└── Human role: validate failover success, initiate planned fail-backFailover timeline with automation:
| Step | Time |
|---|---|
| Incident begins | T+0 |
| Health probe detects failure (3 probes × 30s) | T+90s |
| Automated database promotion | T+3m |
| DNS propagation (60s TTL) | T+4m |
| Frankfurt cluster serving live traffic | T+5m |
| Total RTO | ~5 minutes |
This is 6× better than the 30-minute RTO requirement and provides a substantial operational safety margin.
Tier 2 workloads (RTO: 4 hours, RPO: 1 hour) — Administrative workloads
For administrative workloads with a 1-hour RPO, asynchronous replication is acceptable:
- IBM Cloud Object Storage with cross-region bucket replication: Administrative documents, reports, and attachments replicated asynchronously to a secondary region every 15 minutes (well within the 1-hour RPO).
- IBM Cloud Databases asynchronous replica: A separate PostgreSQL instance for administrative data with asynchronous replication — replication lag typically < 5 minutes under normal load.
- Manual failover with a tested, automated runbook: Tier 2 workloads use an Ansible-driven runbook that executes failover steps automatically but requires a human to initiate. The 4-hour RTO provides time for a deliberate human decision and structured execution.
DR testing as a first-class engineering practice
The 14-month testing gap is itself a compliance failure for an NHS trust. I would implement:
- Monthly automated DR drills in a staging environment: the failover controller triggers automatically, and a test validates that the Frankfurt cluster is serving traffic and the database promotion completed successfully — no human involvement required
- Quarterly production DR test: Planned failover of 10% of production traffic to Frankfurt during a low-activity window, with the trust's clinical informatics team observing. This validates the production architecture, not just the staging approximation
- All DR test results are documented and submitted as evidence for NHS DSP Toolkit Assertion 9.7 (business continuity)
Key Concepts Tested
- Structural analysis of why manual runbooks cannot achieve short RTO targets
- IBM Cloud Databases for PostgreSQL synchronous cross-region replication for near-zero RPO
- Warm standby ROKS cluster with OpenShift GitOps for workload synchronisation
- Automated failover controller design: removing humans from the failover critical path
- IBM Cloud Internet Services (CIS) DNS TTL strategy for fast traffic failover
- Tiered DR strategy: different RTO/RPO requirements driving different automation levels
- DR testing cadence as a compliance requirement, not an operational nice-to-have
Follow-Up Questions
- "Your automated failover controller triggers successfully, promoting the Frankfurt database replica and shifting DNS within 5 minutes. However, 20 minutes after failover, clinical staff report that 340 patient records updated in the final 3 minutes before the London failure are missing from the Frankfurt database — despite your synchronous replication configuration. What are the three most likely explanations for this data loss, and how do you investigate each to determine the root cause?"
- "The NHS trust's clinical informatics director asks: 'What happens if Frankfurt also has an outage simultaneously with London? Our patients cannot have zero access to their records.' Design the architecture changes required to handle a simultaneous dual-region failure, and explain what additional constraints this introduces on the RTO and RPO targets."
Question 7: Network Performance Optimisation in IBM Cloud VPC Environments
Interview Question
IBM has deployed a financial trading analytics platform for a European asset management firm on IBM Cloud. The platform consists of a Red Hat OpenShift cluster in IBM Cloud eu-gb (London), consuming real-time market data feeds from external data providers, processing tick data through a stream processing pipeline (Apache Kafka + Apache Flink), and serving analytics results to 850 portfolio managers via a web application. The firm is reporting three network performance problems: (1) the market data ingestion pipeline experiences intermittent 200–400ms latency spikes every 10–15 minutes, disrupting real-time price calculations; (2) inter-pod communication within the OpenShift cluster between the Flink processing pods and the Kafka consumer pods has higher than expected latency (8–12ms average versus 1–2ms expected); (3) the web application serving portfolio managers in Frankfurt, Paris, and Amsterdam is experiencing 600–900ms page load times, despite the backend API responding in under 80ms.Diagnose each of the three network performance problems and design the architectural changes to resolve them.
Why Interviewers Ask This Question
Network performance troubleshooting in IBM Cloud VPC environments — spanning external ingestion pipelines, intra-cluster pod networking, and geographically distributed end-user performance — is a core IBM Cloud Infrastructure Engineer competency. This question tests whether a candidate can decompose multi-layer network performance problems, identify the IBM Cloud and Kubernetes networking constructs responsible for each issue, and design targeted fixes rather than broad infrastructure changes. Financial services clients in particular have stringent latency requirements that make network engineering a first-class concern.
Example Strong Answer
Problem 1: 200–400ms market data ingestion spikes every 10–15 minutes
The regularity of the spike interval (10–15 minutes) is a strong diagnostic signal — truly random network congestion does not produce periodic spikes. Periodic patterns in network latency almost always indicate one of three causes: scheduled processes (backups, log rotations, garbage collection), rate limiting with a refill interval, or TCP retransmission events triggered by a specific traffic pattern.
Diagnosis steps:
# Capture network traffic during a spike window
tcpdump -i eth0 -w /tmp/market-data-capture.pcap host <market-data-provider-ip>
# Check for TCP retransmission events
ss -s # socket statistics
netstat -s | grep retransmit
# Correlate spike timing with system processes
atop -r /var/log/atop/atop_<date>.log # check for periodic CPU/IO spikesMost likely cause and fix: TCP buffer exhaustion and flow control
A 200–400ms spike on a high-throughput market data feed is characteristic of TCP receive buffer exhaustion triggering flow control. When the application cannot consume data from the receive buffer fast enough (e.g., during a Flink checkpoint or GC pause), the TCP window shrinks to zero — the sender stops transmitting and must wait for a window update. The periodic pattern matches typical GC or checkpoint intervals.
Fixes:
- Increase TCP socket buffer sizes on the IBM Cloud VSI instances running the Kafka brokers:
sysctl -w net.core.rmem_max=134217728 sysctl -w net.core.wmem_max=134217728 sysctl -w net.ipv4.tcp_rmem="4096 87380 134217728"Persist via a MachineConfig in OpenShift to survive node reboots.
- Dedicated market data ingestion pods with CPU and memory guarantees (QoS: Guaranteed): If Flink checkpointing pauses are causing the receive buffer to fill, isolate the ingestion tier from checkpoint-heavy processing pods by running them on dedicated nodes with the node label
workload-type: ingestion. Flink checkpoints happen on separate nodes and cannot interrupt ingestion processing.
- IBM Cloud DirectLink for market data providers: If the data provider supports it, use IBM Cloud DirectLink (Connect or Dedicated) rather than the public internet for the market data feed. DirectLink eliminates public internet routing variability that can cause the observed latency spikes.
Problem 2: 8–12ms intra-cluster pod latency (expected: 1–2ms)
Intra-cluster pod-to-pod latency of 8–12ms is an order of magnitude higher than expected for pods within the same OpenShift cluster in a single IBM Cloud VPC. This rules out network hardware as the cause — the physical latency within a single IBM Cloud availability zone is under 1ms.
Diagnosis steps:
# Check if Flink and Kafka pods are on the same node or different nodes
oc get pods -o wide -n streaming # Check NODE column
# Check network policy rules between namespaces
oc get networkpolicy -n streaming
oc get networkpolicy -n kafka
# Check if a service mesh (Istio) is adding mTLS overhead
oc get pods -n istio-system
istioctl analyze -n streamingMost likely cause: cross-node traffic with service mesh mTLS overhead
If Flink and Kafka pods are scheduled on different nodes in different availability zones, the pod-to-pod traffic crosses the IBM Cloud VPC network — adding 2–5ms. If Istio service mesh is deployed with mTLS, each pod-to-pod call also incurs a TLS handshake on the Envoy sidecar proxy — adding another 3–6ms per connection establishment.
Fixes:
- Pod affinity rules: Schedule Flink consumer pods on the same nodes as (or nodes adjacent to) the Kafka broker pods:
affinity: podAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 100 podAffinityTerm: labelSelector: matchLabels: app: kafka-broker topologyKey: kubernetes.io/hostname
- Istio mTLS with session resumption: If Istio is the latency source, configure TLS session resumption to eliminate the per-connection handshake overhead for long-lived streaming connections. Alternatively, for the high-throughput Flink-to-Kafka path, evaluate whether the service mesh adds security value that justifies the latency cost — internal cluster traffic between known workloads may not require mTLS.
- IBM Cloud VPC placement groups: Use IBM Cloud VPC placement groups with a
powerstrategy to co-locate Flink and Kafka VSIs on physical hosts with the lowest inter-node latency within the same availability zone.
Problem 3: 600–900ms page load for Frankfurt, Paris, Amsterdam users
The backend API responds in 80ms, but page loads take 600–900ms from continental Europe to London. The delta (520–820ms) is pure network and browser rendering overhead — the application is not the problem.
Decomposing the latency budget:
- London to Frankfurt RTT: ~12ms. For a page load requiring 15 sequential HTTP/1.1 requests, that is 15 × 12ms = 180ms of network round-trip time — significant but not the full 520–820ms.
- TLS handshakes: each new TCP connection requires a TLS 1.2 handshake (2 round trips = ~24ms per connection). Fifteen connections = 360ms in TLS overhead alone with HTTP/1.1.
- Static asset delivery: large JavaScript bundles and images delivered from London add transfer time proportional to bundle size and available bandwidth.
Fixes:
- IBM Cloud Internet Services (CIS) with CDN and edge caching: Enable IBM CIS (powered by Cloudflare) with edge caching for the web application. IBM CIS has PoPs in Frankfurt, Paris, and Amsterdam — static assets (JS bundles, CSS, images) are served from the nearest PoP at < 5ms latency, not from London. This alone eliminates the majority of the 600–900ms page load time.
- HTTP/2 with TLS 1.3 enforcement: Enforce HTTP/2 at the IBM CIS edge. HTTP/2 multiplexes all 15 requests over a single TCP connection — eliminating 14 of 15 TLS handshakes. TLS 1.3 reduces the remaining handshake from 2 round trips to 1. Combined, this reduces connection establishment overhead from ~360ms to ~12ms.
- IBM CIS Origin Certificates + end-to-end TLS: Configure the CIS edge to terminate TLS from the browser and re-establish TLS to the London origin over IBM's private backbone — not the public internet. This gives portfolio managers encrypted end-to-end communication while the edge-to-origin path benefits from IBM's lower-latency private network.
Key Concepts Tested
- Periodic latency spike diagnosis: TCP buffer exhaustion vs rate limiting vs GC interference
- TCP socket buffer tuning via MachineConfig in OpenShift
- Pod affinity rules for co-locating latency-sensitive workloads on the same nodes
- IBM Cloud VPC placement groups for physical host co-location
- Istio service mesh mTLS overhead identification and mitigation
- IBM Cloud Internet Services (CIS) CDN for European edge delivery
- HTTP/2 multiplexing and TLS 1.3 for reducing connection establishment overhead
Follow-Up Questions
- "After deploying pod affinity rules for the Flink-to-Kafka path, intra-cluster latency drops to 1.8ms as expected. However, three days later, a cluster node fails and Kubernetes reschedules the Flink pods onto nodes in a different availability zone — latency spikes to 14ms and the trading analytics pipeline breaches its SLA for 40 minutes before the situation is manually corrected. How do you make the affinity rules more resilient so that node failures do not silently move pods to suboptimal placement?"
- "The IBM CIS CDN deployment reduces European page load times from 700ms to 180ms for static content. However, the portfolio managers report that the personalised dashboard — which loads their specific portfolio data via API calls that cannot be cached — still takes 600ms. Explain why CDN caching cannot help with this specific problem, and describe the architectural options available to reduce latency for dynamic, personalised API responses served from London to continental European users."
Question 8: CI/CD Pipeline Architecture for OpenShift on IBM Cloud
Interview Question
IBM is building a CI/CD platform for a large UK government department that is migrating 47 legacy applications to Red Hat OpenShift on IBM Cloud. The department has strict requirements: all container images must be built from approved base images and scanned for CVEs before deployment; no code can be deployed to production without passing automated tests and a mandatory peer review from a second engineer; all deployments to production must be auditable with a complete trail showing who approved what, when; and the production OpenShift cluster must only accept images from the department's own internal IBM Cloud Container Registry — no images from Docker Hub or public registries. The department's development teams use GitHub Enterprise (on-premises) as their source control system.Design the complete CI/CD pipeline architecture meeting all four requirements, using IBM Cloud and OpenShift-native tooling where possible.
Why Interviewers Ask This Question
CI/CD pipeline design for OpenShift in regulated government environments is a direct IBM Cloud Infrastructure Engineer responsibility in IBM's public sector practice. Government clients impose the same four requirements described — CVE scanning, mandatory peer review gates, deployment audit trails, and registry allowlisting — and these requirements map to specific OpenShift and IBM Cloud tooling that a candidate must know. This question also tests whether a candidate understands the difference between infrastructure-enforced policy (OPA/Gatekeeper restricting image sources at the cluster level) and process-enforced policy (a runbook saying "check images come from our registry"), and why only the former is sufficient for a regulated client.
Example Strong Answer
Pipeline architecture overview
The pipeline has four distinct stages, each addressing one of the four requirements:
GitHub Enterprise (on-prem)
│
├── Pull Request opened
│ └── [Stage 1: Build + Scan] OpenShift Pipelines (Tekton)
│ ├── Build image from approved base
│ ├── CVE scan (IBM Cloud Security Advisor / Trivy)
│ └── Push to IBM Cloud Container Registry (staging tag)
│
├── PR requires peer review approval (GitHub Enterprise CODEOWNERS)
│ └── [Gate: mandatory second engineer approval]
│
├── PR merged to main
│ └── [Stage 2: Integration Tests] OpenShift Pipelines
│ ├── Deploy to dev/test namespace
│ ├── Automated integration test suite
│ └── Test results recorded in pipeline run
│
└── Release tag created (by release engineer, not PR author)
└── [Stage 3: Production Deployment] OpenShift GitOps (ArgoCD)
├── ArgoCD Application syncs from release tag
├── Image digest validated against ICIR signature
└── Deployment event logged to IBM Cloud Activity TrackerRequirement 1: Approved base images + CVE scanning
All Dockerfiles must use only images from an approved base image catalogue maintained in the IBM Cloud Container Registry. This is enforced at two layers:
Layer 1 — Pipeline enforcement (process gate):
The Tekton pipeline includes a validate-base-image task that parses the Dockerfile's FROM statement and checks it against an allowlist:
# Tekton Task: validate-base-image
steps:
- name: check-base-image
image: registry.access.redhat.com/ubi8/ubi-minimal
script: |
BASE_IMAGE=$(grep "^FROM" /workspace/source/Dockerfile | head -1 | awk '{print $2}')
if ! grep -q "$BASE_IMAGE" /workspace/approved-base-images.txt; then
echo "ERROR: Base image $BASE_IMAGE is not in the approved catalogue"
exit 1
fiLayer 2 — CVE scanning with IBM Cloud Container Registry Vulnerability Advisor:
After the image is built and pushed to the staging tag in ICIR, the pipeline executes a Vulnerability Advisor scan. The pipeline task polls the scan result and fails if any Critical or High CVEs are found:
- name: check-vulnerability-scan
script: |
RESULT=$(ibmcloud cr va --output json ${IMAGE_TAG} | jq '.[] | .status')
if echo "$RESULT" | grep -q "Fail"; then
echo "Vulnerability scan failed — Critical/High CVEs detected"
exit 1
fiImages that fail the scan cannot be tagged as approved and cannot proceed to the peer review gate.
Requirement 2: Mandatory peer review — GitHub Enterprise CODEOWNERS
GitHub Enterprise's CODEOWNERS file combined with branch protection rules enforces the mandatory second engineer approval at the source control layer — not just as a cultural practice:
# .github/CODEOWNERS
# All production deployment manifests require approval from a second member of the platform team
/deployment/production/* @department/platform-team-leads
/helm/values-production.yaml @department/platform-team-leadsBranch protection rules on main:
- Require at least 1 approving review from a
CODEOWNERSmember
- Dismiss stale reviews when new commits are pushed (prevents approval of a safe commit, then pushing a malicious change)
- Restrict who can push directly to
main— only the CI service account, after all checks pass
Critically: the PR author cannot approve their own pull request — GitHub Enterprise enforces this natively when "Require review from Code Owners" is enabled. This addresses the "no engineer promotes their own code" requirement.
Requirement 3: Full deployment audit trail — IBM Cloud Activity Tracker + pipeline provenance
Every deployment to production generates an audit record at multiple layers:
- GitHub Enterprise: PR merge event includes the author, reviewer, approvals, and timestamps — immutable once merged
- OpenShift Pipelines (Tekton): Every pipeline run produces a
PipelineRunresource in OpenShift with a complete record of all task executions, inputs, outputs, and timestamps — queryable viaoc get pipelineruns
- ArgoCD sync history: Every ArgoCD sync (deployment to production) records the Git commit SHA, the user who triggered the sync, and the timestamp in ArgoCD's application history
- IBM Cloud Activity Tracker: All IBM Cloud Container Registry pushes and all ROKS cluster API operations (deployments, pod creations) are logged automatically via Activity Tracker with full request context
For a compliance auditor asking "who approved the deployment of version 2.4.1 to production, and when?" — the answer is traceable through GitHub PR approval history → Tekton PipelineRun → ArgoCD sync history → Activity Tracker ROKS events.
Requirement 4: Production cluster only accepts images from ICIR — OPA/Gatekeeper admission policy
Process controls (a rule saying "only use our registry") are insufficient — they can be bypassed intentionally or accidentally. The production cluster must technically reject any pod that references an image from an unauthorised registry. This is implemented via OPA/Gatekeeper (built into OpenShift as OpenShift Compliance Operator + Gatekeeper):
# OPA Rego policy: restrict image source to IBM Cloud Container Registry
package k8sallowedrepos
violation[{"msg": msg}] {
container := input.review.object.spec.containers[_]
not starts_with(container.image, "uk.icr.io/department-registry/")
msg := sprintf("Image '%v' is not from the approved registry", [container.image])
}This ConstraintTemplate is instantiated as a K8sAllowedRepos Constraint in the production namespace. Any pod whose container image does not begin with uk.icr.io/department-registry/ is rejected at admission — before it is scheduled, before it pulls the image, before it runs. A developer who accidentally references docker.io/nginx in their Kubernetes manifest receives an immediate validation error when applying the manifest.
Additionally, enable IBM Cloud Container Registry image signing with Notary v2: only images signed with the department's private signing key are accepted by the cluster. The Gatekeeper policy validates the image signature before allowing the pod — even images from ICIR that were not built through the approved pipeline are rejected.
Key Concepts Tested
- OpenShift Pipelines (Tekton) for cloud-native CI pipeline execution on OpenShift
- IBM Cloud Container Registry Vulnerability Advisor for CVE scanning in pipeline gates
- GitHub Enterprise CODEOWNERS + branch protection for mandatory peer review enforcement
- OpenShift GitOps (ArgoCD) for auditable GitOps-driven production deployments
- IBM Cloud Activity Tracker for deployment audit trail across the full pipeline
- OPA/Gatekeeper ConstraintTemplate for infrastructure-enforced image registry allowlisting
- IBM Cloud Container Registry image signing with Notary v2 for supply chain security
Follow-Up Questions
- "The CVE scanning gate has been live for 6 weeks. A development team reports that their pipeline has been failing for 3 days because a newly disclosed Critical CVE in their approved base image (Red Hat UBI 8.7) has no fix available yet — the patched version won't be released for another 2 weeks. Their application needs to go to production urgently. How do you handle this exception without either bypassing the security control entirely or blocking a legitimate business-critical deployment?"
- "The OPA/Gatekeeper image registry policy is enforced in production. However, an incident investigation reveals that a pod running in the
kube-systemnamespace is pulling an image fromquay.io— a registry not in your allowlist — and your Gatekeeper policy did not block it. Explain the most likely reasons the policy did not apply to this pod, and how you audit and close the gap."
Question 9: Distributed Storage and Stateful Workloads on OpenShift
Interview Question
IBM has deployed a Red Hat OpenShift cluster on IBM Cloud for a media company running a video processing platform. The platform includes: a Kafka cluster (6 brokers, each requiring 4TB of persistent storage for topic data retention), an Elasticsearch cluster (12 nodes, each requiring 2TB of fast NVMe storage for search indices), and a PostgreSQL database cluster (primary + 2 read replicas, each requiring 500GB storage). The operations team is reporting three storage-related incidents in the past month: (1) a Kafka broker pod was rescheduled to a different node after a node failure, but the new node could not attach the original PersistentVolume — the pod stayed inPendingstate for 47 minutes; (2) Elasticsearch indexing performance has degraded by 40% over the past 6 weeks, with disk I/O identified as the bottleneck; (3) a PostgreSQL primary pod was evicted due to node memory pressure, and during the 8-minute recovery period, all three pods (primary + 2 replicas) attempted to become primary simultaneously — resulting in a split-brain condition that required 2 hours of manual remediation.Diagnose each incident and redesign the storage and stateful workload architecture to prevent each class of failure.
Why Interviewers Ask This Question
Stateful workloads on Kubernetes — particularly storage-intensive applications like Kafka, Elasticsearch, and databases — are among the hardest operational challenges on OpenShift. IBM Cloud Infrastructure Engineers supporting media, analytics, and data platform clients routinely face exactly these three failure classes: PersistentVolume attachment failures after pod rescheduling, I/O performance degradation on shared storage, and split-brain conditions in clustered stateful applications. This question tests whether a candidate understands the Kubernetes storage primitives (StorageClass, PV topology, StatefulSets) at a depth that allows them to diagnose real incidents and design architectures that prevent them.
Example Strong Answer
Incident 1: Kafka PV attachment failure after node failover (47 minutes in Pending)
The 47-minute Pending state after a node failure is a zone topology mismatch between the PersistentVolume and the node where the pod was rescheduled.
Root cause analysis:
IBM Cloud Block Storage volumes are zonal — a volume provisioned in eu-gb-2 (London AZ 2) can only be attached to a VSI in eu-gb-2. If the Kafka broker pod is rescheduled to a node in eu-gb-1 or eu-gb-3 (because the scheduler does not know about the volume's zone constraint), the attach operation fails. The pod stays in Pending state until either: a node in eu-gb-2 becomes available (which could take 47+ minutes if all eu-gb-2 nodes are at capacity), or the volume topology constraint is correctly enforced.
Fix: StorageClass with volumeBindingMode: WaitForFirstConsumer
The standard IBM Cloud Block Storage StorageClass uses immediate volume binding — volumes are provisioned in a zone before a pod is scheduled, without knowing which zone the pod will land in. This creates the zone mismatch.
Change the StorageClass to WaitForFirstConsumer:
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: ibmc-block-gold-topology-aware
provisioner: vpc.block.csi.ibm.io
parameters:
profile: "10iops-tier"
volumeBindingMode: WaitForFirstConsumer # critical: delays binding until pod is scheduled
reclaimPolicy: RetainWith WaitForFirstConsumer, the volume is not provisioned until the pod is scheduled to a specific node. The CSI driver then provisions the volume in the same zone as the scheduled node — guaranteeing the volume can be attached to that node.
Additionally, use a StatefulSet (not a Deployment) for Kafka brokers. StatefulSets maintain stable pod identity and PVC binding — when kafka-broker-2 is rescheduled, it is always associated with data-kafka-broker-2 (its original PVC) and the scheduler attempts to place it in the zone where that PVC lives. Combined with WaitForFirstConsumer, this eliminates the zone mismatch problem.
Incident 2: Elasticsearch I/O degradation on shared storage
A 40% performance degradation over 6 weeks on a storage-intensive workload is characteristic of storage contention on shared block storage, not hardware degradation. IBM Cloud Block Storage is provisioned from shared storage infrastructure — multiple volumes share the same underlying physical storage arrays, and IOPS limits are enforced per volume (not per physical disk).
Diagnosis:
# Check block storage IOPS utilisation on ES nodes
oc exec -n elasticsearch <es-node-pod> -- iostat -xz 1 10
# Look for: %util approaching 100%, await > 1ms, r/s + w/s approaching profile limitIf the Elasticsearch data volumes are on the 5iops-tier IBM Cloud Block Storage profile (the default), each 2TB volume is limited to 5 × 2,000 = 10,000 IOPS. For 12 Elasticsearch nodes writing and reading concurrently during heavy indexing, 10,000 IOPS per node may be insufficient.
Fix: IBM Cloud Block Storage with higher IOPS profile + dedicated node storage
- Upgrade to
10iops-tierprofile: The IBM Cloud Block Storage10iops-tierprofile provides 10 IOPS per GB — for a 2TB volume, that is 20,000 IOPS. This doubles available I/O capacity per node.
- IBM Cloud Bare Metal Servers with local NVMe for Elasticsearch: For I/O-intensive Elasticsearch workloads, consider migrating the Elasticsearch cluster to IBM Cloud Bare Metal Servers with local NVMe storage (2-4TB NVMe per server, 800K+ IOPS). Local NVMe eliminates the shared storage contention entirely. Use IBM Cloud Satellite to bring these bare metal nodes into the OpenShift cluster as a dedicated worker pool for the
elasticsearch-workersMachineConfigPool.
- OpenShift Local Storage Operator: If bare metal with local disks is used, the Local Storage Operator provisions PersistentVolumes from local NVMe disks on the host node. These PVs are bound to the node where the disk lives — pod affinity ensures Elasticsearch pods are always scheduled on nodes with their local PV.
Incident 3: PostgreSQL split-brain after eviction
The split-brain condition — three PostgreSQL pods simultaneously attempting to be primary — is a Patroni configuration failure combined with a Kubernetes storage access control gap. In a properly configured Patroni PostgreSQL cluster, only one pod should ever hold the distributed lock (via etcd) that grants primary status. The simultaneous promotion indicates the etcd lock was not respected during the recovery.
Root cause: If Patroni's etcd endpoint was also affected by the node failure (etcd is often co-located with application workloads), Patroni may have entered a degraded state where each replica believed itself entitled to become primary due to etcd unavailability.
Fix: Dedicated etcd for Patroni + STONITH fencing
- Dedicated etcd cluster: Patroni's distributed lock store (etcd) must run on dedicated infrastructure that is isolated from application workload disruptions. Deploy a 3-node etcd cluster on dedicated OpenShift worker nodes with
node-role.kubernetes.io/etcd-patroni: ""labels and corresponding taints — no application workloads can be scheduled on these nodes.
- STONITH (Shoot The Other Node In The Head) fencing: Configure Patroni with a STONITH fencing mechanism. Before promoting a replica to primary, Patroni must confirm via the fence agent that the current primary is definitively unavailable — it uses the IBM Cloud VSI API to verify (or stop) the failing node before allowing any promotion. This prevents the race condition where the original primary recovers while a replica has already been promoted.
- PodDisruptionBudget for PostgreSQL: A PDB with
minAvailable: 1prevents the Kubernetes eviction controller from evicting both replicas simultaneously during node pressure, ensuring at least one replica is always serving reads even during primary recovery.
- PostgreSQL primary node pinning with
requiredDuringSchedulingIgnoredDuringExecutionanti-affinity: Ensure the primary and both replicas are scheduled on different nodes and different availability zones — a single node memory pressure event cannot affect more than one PostgreSQL pod.
Key Concepts Tested
- IBM Cloud Block Storage zone topology and
WaitForFirstConsumerStorageClass binding mode
- StatefulSet stable identity and PVC binding — preventing zone mismatch on rescheduling
- IBM Cloud Block Storage IOPS profiles and when local NVMe (bare metal) is the right choice
- OpenShift Local Storage Operator for local disk provisioning
- Patroni split-brain prevention: dedicated etcd + STONITH fencing
- PodDisruptionBudgets for stateful workloads to prevent simultaneous eviction
- Pod anti-affinity rules for distributing stateful replicas across zones
Follow-Up Questions
- "The Kafka cluster has 6 brokers with 4TB volumes each — 24TB of total storage. The team wants to implement a backup strategy for the Kafka topic data. Your first instinct is IBM Cloud Object Storage for backup, but the recovery team points out that restoring 4TB of Kafka data from object storage would take 6–8 hours — far too long for their RTO of 30 minutes. What are the alternative backup and recovery strategies for Kafka at this scale, and what trade-offs does each involve?"
- "Three months after deploying the Local Storage Operator for Elasticsearch, a hardware failure causes one of the bare metal nodes to fail completely — the local NVMe disk on that node (containing an Elasticsearch shard) is unrecoverable. The Elasticsearch shard is lost. Walk through the steps to recover from this failure, and explain what Elasticsearch configuration should have been in place to make the shard loss a non-incident rather than a data loss event."
Question 10: Cost Optimisation and FinOps for IBM Cloud Infrastructure
Interview Question
IBM is conducting a cloud cost review for a retail client whose IBM Cloud spend has grown from £180,000/month to £420,000/month over 18 months — a 133% increase against a planned 40% growth. The client's platform runs on IBM Cloud and consists of: 14 Red Hat OpenShift clusters (mix of ROKS and Satellite), IBM Cloud Databases for PostgreSQL and MongoDB, IBM Cloud Object Storage (4.2 PB total), IBM Cloud Internet Services, and IBM Cloud Container Registry. The finance team has escalated the overspend to IBM, and your team has been asked to conduct a cost optimisation engagement. Initial analysis shows IBM Cloud Monitoring dashboards indicating average CPU utilisation of 28% and average memory utilisation of 34% across all ROKS clusters. IBM Cloud billing data shows that 23% of the monthly spend is on resources provisioned over 6 months ago that have no associated workloads.Design the cost optimisation programme — identifying the highest-impact reduction areas, the tooling and automation you would use, and the governance processes to prevent cost drift from recurring.
Why Interviewers Ask This Question
FinOps — the practice of financial accountability for cloud infrastructure — is a growing responsibility for IBM Cloud Infrastructure Engineers, particularly as IBM's enterprise clients face pressure to demonstrate cloud ROI. A 133% cost increase against a 40% plan is a common pattern when infrastructure teams provision resources on demand without governance or cleanup processes. This question tests whether a candidate can identify waste systematically using IBM Cloud billing and monitoring data, knows the specific IBM Cloud cost optimisation levers available (reserved capacity, right-sizing, abandoned resource cleanup), and can design the governance processes that prevent drift from recurring after the initial optimisation.
Example Strong Answer
Step 1: Quantify the opportunity before committing to interventions
Before changing anything, I would generate a precise cost breakdown using IBM Cloud Billing APIs and IBM Cloud Monitoring:
# Export 90-day resource-level billing data
ibmcloud billing resource-instances-usage --start 2024-07-01 --end 2024-09-30 \
--output json > billing-breakdown.json
# Cross-reference with resource inventory
ibmcloud resource service-instances --all-resource-groups --output json > inventory.jsonCategorise every resource into one of three buckets:
- Active and right-sized: Running workloads consuming resources proportionate to their allocation
- Active but over-provisioned: Running workloads consuming significantly less than their allocated capacity
- Abandoned: Provisioned resources with no associated workloads or traffic in the past 30 days
The 23% abandoned resource finding is the highest-priority, lowest-risk intervention — deleting abandoned resources produces immediate cost reduction with no business impact. At £420K/month, 23% = £96,600/month in recoverable spend.
Priority 1: Abandoned resource cleanup — £96,600/month recovery
The most common abandoned resources in IBM Cloud enterprise environments:
- Snapshot and backup volumes: Database snapshots and block storage snapshots created for a deleted database that were never cleaned up. IBM Cloud Databases snapshots are billed by stored GB.
- Unattached block storage volumes: PersistentVolumes from deleted OpenShift clusters or namespaces that remain in IBM Cloud after the cluster resources are deleted. IBM Cloud Block Storage is billed even when not attached to a VSI.
- Idle load balancers: IBM Cloud VPC Load Balancers provisioned for applications that have been decommissioned — still billed at the hourly rate.
- Unused Container Registry namespaces: ICIR is billed per GB of image storage. Old images from deprecated services accumulate and are rarely cleaned up.
Automated cleanup tooling:
# IBM Cloud resource tagging + cleanup automation
# Tag all resources at provisioning time with:
# - owner: <team-name>
# - project: <project-name>
# - expiry-date: <YYYY-MM-DD>
# Weekly cleanup job: flag resources with no activity for 30 days
# AND no "keep-alive" tag for human review
# Resources flagged for 7 days with no response are deletedPriority 2: ROKS cluster right-sizing — estimated £80,000/month reduction
28% CPU and 34% memory utilisation across 14 clusters indicates significant over-provisioning. The right-sizing process:
Step 1: Identify right-sizing opportunities using IBM Cloud Monitoring + VPA recommendations:
# Get VPA recommendations for all namespaces
oc get vpa -A -o json | jq '.items[] | {
namespace: .metadata.namespace,
name: .metadata.name,
cpu_request: .status.recommendation.containerRecommendations[0].target.cpu,
memory_request: .status.recommendation.containerRecommendations[0].target.memory
}'Step 2: Identify nodes that can be removed:
Using IBM Cloud Monitoring, identify nodes where the sum of all pod resource requests is under 40% of node capacity consistently over 30 days. These nodes are candidates for consolidation — their workloads can be migrated to other nodes and the nodes decommissioned.
Step 3: Right-size worker pools:
IBM Cloud ROKS worker pools support in-place resizing (smaller flavour) or node count reduction. Reduce worker node counts progressively, validating that PodDisruptionBudgets are respected and workloads reschedule cleanly.
Priority 3: Reserved capacity for stable workloads — estimated £65,000/month saving
IBM Cloud on-demand pricing for VSIs and ROKS worker nodes carries a significant premium over reserved instance pricing. For workloads that run continuously:
- IBM Cloud Savings Plans: Commit to a sustained usage level (e.g., £15,000/month of compute) in exchange for a 25–35% discount on that committed spend. Unlike reserved instances, Savings Plans apply across instance types — providing flexibility for right-sizing without losing the discount.
- 1-year reserved instances for stable ROKS worker pools: Production ROKS clusters that have been stable for 6+ months are strong candidates for 1-year reservations at 30–40% discount vs on-demand.
Estimated saving from reservations on the £250,000/month compute portion of the bill: 30% × £250K = £75,000/month.
Priority 4: IBM Cloud Object Storage lifecycle policies — estimated £25,000/month saving
4.2 PB of IBM Cloud Object Storage is a significant cost centre. IBM COS billing varies by storage class:
- Smart Tier (auto): £0.025/GB/month
- Standard: £0.020/GB/month
- Vault (infrequent access): £0.009/GB/month
- Cold Vault (archive): £0.004/GB/month
Apply lifecycle policies to automatically transition objects to lower-cost storage classes:
{
"Rules": [{
"ID": "transition-to-vault-after-30-days",
"Status": "Enabled",
"Transitions": [
{"Days": 30, "StorageClass": "VAULT"},
{"Days": 90, "StorageClass": "COLD_VAULT"}
]
}]
}For media assets stored after processing (a one-time operation), this transition reduces storage cost from £0.025 to £0.004/GB — an 84% cost reduction on qualifying objects.
Governance: preventing cost drift from recurring
The optimisation itself is meaningless without governance changes that prevent the same drift from happening again:
- Mandatory resource tagging policy via IBM Cloud Security and Compliance Centre: Any resource without required tags (
owner,project,cost-centre) is flagged as non-compliant within 24 hours. Teams with non-compliant resources receive weekly reports.
- Per-team spending dashboards in IBM Cloud Billing: Create IBM Cloud enterprise billing account sub-accounts per team, providing each team visibility into their own spending. Teams that exceed their monthly budget receive automated alerts at 80% and 100% of budget.
- Monthly FinOps review: A 30-minute monthly review with engineering team leads covering: top 10 cost line items, right-sizing recommendations actioned vs pending, abandoned resource report. Cost data is owned by engineering, not just finance.
- Provisioning approval for large resources: Any resource provisioning over £5,000/month requires a platform engineering approval — a Jira workflow that validates the resource is tagged, budgeted, and has a defined decommission plan.
Total expected monthly saving:
| Intervention | Monthly Saving |
|---|---|
| Abandoned resource cleanup | £96,600 |
| ROKS cluster right-sizing | £80,000 |
| Reserved capacity / Savings Plans | £65,000 |
| COS lifecycle policies | £25,000 |
| Total projected reduction | £266,600 (63% reduction) |
Target monthly spend: £420,000 − £266,600 = £153,400/month — below the original planned growth trajectory of £252,000 (40% above the starting £180,000).
Key Concepts Tested
- IBM Cloud Billing API for resource-level cost attribution
- Abandoned resource identification and automated cleanup with tagging policies
- VPA recommendations for Kubernetes workload right-sizing
- IBM Cloud Savings Plans and reserved instances for stable workload cost reduction
- IBM Cloud Object Storage storage class lifecycle policies
- IBM Cloud enterprise billing account structure for per-team cost visibility
- FinOps governance: tagging policies, budget alerts, monthly reviews, provisioning approval gates
Follow-Up Questions
- "You present the cost optimisation findings to the client's CTO. She agrees with the analysis but raises a concern: 'If we right-size our production clusters down to 40% headroom, we have no capacity buffer for unexpected traffic spikes. Last Black Friday, we had a 4× traffic spike that would have taken down a right-sized cluster.' How do you address this tension between cost optimisation and operational headroom, and what is your recommendation for the appropriate headroom level for production workloads?"
- "Three months after the optimisation engagement, IBM Cloud Billing shows that spend has crept back up to £310,000/month — a £156,600 increase from the post-optimisation baseline of £153,400. The tagging policy is in place and the FinOps review process is running. What are the most likely causes of this cost rebound, and what additional automation would you implement to detect and prevent it before it accumulates to this level again?"
Preparation Tip: Across all ten questions in this complete guide, the answers that resonate most strongly with IBM interviewers share a consistent analytical structure: constraint → mechanism → trade-off. Every infrastructure decision exists in a context defined by constraints — regulatory requirements, latency SLAs, cost budgets, compliance frameworks, operational team capacity. Strong candidates name the constraint explicitly, explain the technical mechanism that satisfies it, and then articulate what is given up in making that choice. IBM's enterprise clients are sophisticated enough to understand that no architecture is free — they want engineers who can navigate trade-offs intelligently, not engineers who promise that everything is achievable without compromise. In your preparation, train yourself to end every technical answer with a deliberate trade-off statement: what does this architecture optimise for, and what does it accept as a cost?