IBM Cloud Infrastructure Engineer Interview Questions

Introduction

Cloud Infrastructure Engineers at IBM operate at the centre of one of the most complex enterprise technology portfolios in the industry. IBM's cloud strategy is built around hybrid cloud — the architecture that allows enterprises to run workloads across on-premises data centres, private clouds, and public cloud environments as a unified, governed system. This is not a simple lift-and-shift proposition. IBM's enterprise clients — banks, insurers, government agencies, global manufacturers — have infrastructure estates that have evolved over decades, with compliance obligations, data residency requirements, and integration dependencies that make pure-cloud migration either impractical or impermissible. IBM Cloud Infrastructure Engineers design the architectures that bridge these worlds, primarily through IBM Cloud, Red Hat OpenShift Container Platform, and IBM Cloud Pak solutions built on Kubernetes.

The technical scope of the role is deliberately broad. On any given week, an IBM Cloud Infrastructure Engineer might be designing a multi-zone OpenShift cluster topology for a financial services client, writing Terraform modules to codify a hybrid networking configuration across IBM Cloud and an on-premises data centre, building an observability stack that correlates metrics and traces across both environments, or architecting a disaster recovery plan that meets a hospital network's RPO and RTO requirements. Red Hat OpenShift is the centrepiece of IBM's hybrid cloud strategy — understanding OpenShift at depth, including its enterprise extensions (Machine Config Operator, OLM-managed operators, OpenShift GitOps, and OpenShift Pipelines) is foundational to success in this role. So is operational maturity: designing infrastructure that is resilient, observable, and automatable from day one.

Interviews for Cloud Infrastructure Engineer roles at IBM reflect this breadth and operational depth. Expect questions that require you to reason from first principles about distributed systems reliability, demonstrate hands-on familiarity with Kubernetes and OpenShift internals, articulate the trade-offs in hybrid networking architectures, and show how you would design, automate, and operate infrastructure that meets the stringent requirements IBM's enterprise clients demand. The five questions below cover the full range of topics you are likely to encounter, grounded in the real infrastructure challenges IBM and its clients face.

Interview Questions

Question 1: Hybrid Cloud Architecture for a Regulated Financial Services Workload

Interview Question

IBM is engaged with a global investment bank that operates its core trading platform on-premises in two private data centres (London and New York). The bank wants to modernise its risk calculation workloads — computationally intensive Monte Carlo simulations that run in batch every evening — by bursting them to IBM Cloud during peak demand, while keeping the trading platform itself and all customer data on-premises. The bank's requirements: all customer account data must remain in their own data centres (data residency); risk calculations can use cloud compute but cannot write results back to any IBM-managed storage; the entire hybrid environment must be managed through a single control plane; and the network connection between on-premises and IBM Cloud must support at least 10Gbps throughput with latency under 5ms.
Design the hybrid cloud architecture that meets these requirements, and explain the specific IBM technologies you would use at each layer.

Why Interviewers Ask This Question

Hybrid cloud architecture for regulated workloads is IBM's primary commercial differentiation — it is what IBM Cloud is specifically designed for, and it is the scenario that most of IBM's enterprise cloud engagements start from. This question tests whether a candidate understands IBM's hybrid cloud technology stack (IBM Cloud Satellite, Red Hat OpenShift, DirectLink) at a depth beyond marketing material, and whether they can reason about the data residency, network performance, and unified management requirements that regulated clients impose. Interviewers also look for candidates who understand that "hybrid cloud" is not an architecture choice but a set of constraints — the bank's regulatory obligations define the architecture, not the other way around.

Example Strong Answer

The core architectural principle: compute bursts, data stays

The bank's requirements define a clear architectural principle: compute capacity can extend into IBM Cloud, but data gravity must remain on-premises. Every architectural decision flows from this principle.

Layer 1: Unified control plane — IBM Cloud Satellite

IBM Cloud Satellite is the correct technology for this requirement. Satellite extends IBM Cloud's managed services — including managed OpenShift (ROKS), IBM Cloud Monitoring, and IBM Cloud Object Storage — to run on the bank's own on-premises infrastructure while being managed and observable from the IBM Cloud control plane.

Architecture:

Deploy IBM Cloud Satellite Locations in both the London and New York data centres. Each Satellite Location consists of a minimum of 3 control plane hosts that communicate with IBM Cloud's Satellite management plane.

The Satellite hosts run entirely on the bank's own servers — IBM Cloud has management-plane visibility but no data-plane access to the bank's infrastructure.

The trading platform and all customer account data remain on Satellite-managed infrastructure within the bank's physical perimeter. IBM Cloud never touches this data.

The Satellite control plane communication (metadata only — no customer data) travels over an encrypted channel to IBM Cloud; this channel is the only network path between the bank's on-premises environment and IBM Cloud's control plane.

Layer 2: Network connectivity — IBM Cloud DirectLink Dedicated

10Gbps throughput with < 5ms latency between on-premises and IBM Cloud cannot be achieved over the public internet. The correct IBM product is IBM Cloud DirectLink Dedicated:

DirectLink Dedicated provides a physical, dedicated fibre connection between the bank's colocation facility and IBM Cloud's network point of presence. No traffic traverses the public internet.

Bandwidth options: 1Gbps, 2Gbps, 5Gbps, 10Gbps. The bank requires 10Gbps — provision two 10Gbps circuits for redundancy (active-active load balancing with automatic failover if one circuit fails).

Latency: DirectLink Dedicated between London and IBM Cloud's London PoP achieves sub-millisecond latency. Cross-region traffic travels over IBM Cloud's private backbone — faster and more stable than public internet routing.

BGP routing: Establish BGP peering between the bank's edge routers and IBM Cloud's DirectLink routers. Advertise the bank's on-premises subnets to IBM Cloud and IBM Cloud's VPC subnets to the bank's network. This enables private IP routing between environments without NAT.

Layer 3: Compute bursting — Red Hat OpenShift with cluster federation

The risk calculation workloads burst to IBM Cloud compute using Red Hat OpenShift on IBM Cloud (ROKS) with a cluster federation pattern:

An OpenShift cluster runs on-premises (on Satellite) and serves as the primary cluster.

A ROKS cluster in IBM Cloud (IBM Cloud region: eu-gb for London jobs, us-east for NY jobs) serves as the burst cluster.

OpenShift cluster federation via Red Hat Advanced Cluster Management (RHACM): RHACM provides the single pane of glass that manages both the on-premises Satellite cluster and the cloud ROKS cluster from one console — satisfying the "single control plane" requirement.

Risk calculation jobs are submitted to the on-premises cluster as Kubernetes Jobs. An RHACM placement policy routes jobs exceeding on-premises capacity to the cloud ROKS cluster for execution.

The critical data residency control:

Risk calculation jobs in the cloud ROKS cluster:

Pull their input data (market data, position data) from on-premises via DirectLink at job start — data is transferred into ephemeral pod memory, never written to IBM Cloud storage

Perform computation entirely in pod memory (Monte Carlo simulations are compute-bound, not storage-bound)

Write results back to the bank's on-premises Satellite cluster via DirectLink at job completion — results never touch IBM Cloud Object Storage or any IBM-managed persistence layer

Pod ephemeral storage and memory are cleared on pod termination — IBM Cloud has no persistent copy of any computation input or output

Enforcement mechanism: Kubernetes NetworkPolicies on the cloud ROKS cluster block all egress to IBM Cloud Object Storage endpoints. An OPA/Gatekeeper policy rejects any PersistentVolumeClaim creation in the burst cluster — no workload can write to persistent storage, enforced at the admission controller level.

Layer 4: Identity and access — IBM Cloud IAM with on-premises integration

The bank's engineers manage both environments through a unified identity model:

IBM Cloud IAM is federated with the bank's Active Directory via SAML 2.0. Engineers authenticate with their bank credentials; IBM Cloud IAM issues scoped tokens.

RHACM role-based access policies ensure that on-premises cluster administration is only available to engineers with the bank's internal cluster-admin LDAP group membership — not to IBM Cloud staff.

Key Concepts Tested

IBM Cloud Satellite architecture: extending IBM Cloud managed services to on-premises infrastructure

IBM Cloud DirectLink Dedicated for private, high-throughput, low-latency hybrid connectivity

Red Hat Advanced Cluster Management (RHACM) for unified multi-cluster management

Data residency enforcement: ephemeral compute vs persistent storage separation

OPA/Gatekeeper admission control for storage policy enforcement at the cluster level

BGP peering for private IP routing between on-premises and cloud environments

Follow-Up Questions

"The bank's CISO raises a concern: the IBM Cloud Satellite control plane communication — even though it carries only metadata — traverses IBM Cloud's network infrastructure. Their internal policy requires that no communication channel to a third-party cloud provider carry any data derived from their trading systems, including metadata about job execution. How do you redesign the architecture to address this, and does it change your technology choices?"

"Six months after deployment, the bank wants to add a second use case: running their anti-money laundering ML model inference on IBM Cloud to take advantage of GPU instances unavailable on-premises. Unlike the risk calculation jobs, the AML model must query a live customer transaction stream — which appears to conflict with the data residency requirement. How do you architect a solution that enables cloud GPU inference on customer transaction data without violating data residency?"

Question 2: Red Hat OpenShift Cluster Operations and Node Management at Scale

Interview Question

You are the lead infrastructure engineer responsible for a Red Hat OpenShift Container Platform 4.x cluster running on IBM Cloud that hosts 180 microservices for an insurance company's claims processing platform. The cluster has 42 worker nodes across three availability zones. Over the past month, the platform team has reported three problems: (1) approximately twice per week, one or more worker nodes enters a NotReady state due to kernel-level memory pressure, causing pod rescheduling that temporarily degrades the claims API; (2) several application teams are requesting custom kernel parameters — higher vm.max_map_count for Elasticsearch nodes, higher file descriptor limits for high-connection services — that are difficult to apply safely without disrupting running workloads; (3) a security patch requires updating the worker node OS (Red Hat CoreOS) across all 42 nodes, but the team is concerned about coordinating the rolling update without causing a claims processing outage.
Address each of the three problems using OpenShift-native tooling and explain the operational approach for each.

Why Interviewers Ask This Question

Red Hat OpenShift is IBM's flagship container platform for enterprise clients, and IBM Cloud Infrastructure Engineers are expected to operate it at a depth that goes beyond basic Kubernetes administration. This question tests familiarity with OpenShift-specific operational tooling — particularly the Machine Config Operator (MCO), which is how OpenShift manages node-level configuration — and whether a candidate can reason through rolling node updates in a way that preserves application availability. Each of the three problems maps to a specific OpenShift feature that distinguishes it from vanilla Kubernetes.

Example Strong Answer

Problem 1: Node NotReady due to kernel memory pressure — eviction threshold tuning

In OpenShift 4.x on IBM Cloud, node configuration is managed by the Machine Config Operator (MCO) — not by SSHing into nodes and editing files directly. For memory pressure issues, I would first audit resource governance before tuning eviction thresholds:

oc describe node <node-name> | grep -A5 "Conditions:"
oc top node <node-name>

If pods lack memory limits (common root cause), enforce them via LimitRange and ResourceQuota at the namespace level — this addresses the root cause rather than just tuning recovery behaviour.

For eviction threshold tuning, use a KubeletConfig custom resource:

apiVersion: machineconfiguration.openshift.io/v1
kind: KubeletConfig
metadata:
  name: worker-memory-eviction
spec:
  machineConfigPoolSelector:
    matchLabels:
      pools.operator.machineconfiguration.openshift.io/worker: ""
  kubeletConfig:
    evictionHard:
      memory.available: "500Mi"    # increased from default 100Mi
    evictionSoft:
      memory.available: "1Gi"
    evictionSoftGracePeriod:
      memory.available: "2m"

This gives pods a 2-minute grace period to reduce memory before hard eviction, and raises the hard threshold so recovery starts earlier — before the node reaches NotReady.

Problem 2: Custom kernel parameters per workload type — MachineConfigPools

Applying different kernel parameters to different nodes requires a dedicated MachineConfigPool for the specialised nodes:

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfigPool
metadata:
  name: elasticsearch-workers
spec:
  machineConfigSelector:
    matchExpressions:
      - key: machineconfiguration.openshift.io/role
        operator: In
        values: [worker, elasticsearch-workers]
  nodeSelector:
    matchLabels:
      node-role.kubernetes.io/elasticsearch-worker: ""

Label the target nodes:

oc label node <es-node-1> node-role.kubernetes.io/elasticsearch-worker=""

Apply the kernel parameter only to this pool via a MachineConfig:

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: elasticsearch-workers
  name: 50-elasticsearch-kernel-tuning
spec:
  config:
    ignition:
      version: 3.2.0
    storage:
      files:
        - path: /etc/sysctl.d/99-elasticsearch.conf
          mode: 0644
          contents:
            source: data:,vm.max_map_count%3D262144%0A

The MCO applies this only to nodes in the elasticsearch-workers pool via a managed drain → reboot → uncordon cycle — no manual node access required, and standard worker nodes are completely unaffected.

Problem 3: Rolling RHCOS OS update across 42 nodes without a claims outage

Red Hat CoreOS on OpenShift 4.x is updated through the Machine Config Operator, which handles the drain, update, reboot, and uncordon lifecycle automatically. The critical safety dependency is Pod Disruption Budgets (PDBs):

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: claims-api-pdb
  namespace: claims-processing
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: claims-api

The MCO's drain step respects PDBs. If draining a node would violate minAvailable: 2, the drain is blocked until the constraint can be satisfied — the update waits, it does not force an outage.

Control the rollout speed with maxUnavailable on the MachineConfigPool:

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfigPool
metadata:
  name: worker
spec:
  maxUnavailable: 1

With maxUnavailable: 1, nodes update sequentially — one node at a time, approximately 5–8 minutes per node. For 42 nodes, total update time is 3.5–5.6 hours. Schedule this in a low-traffic overnight window and ensure no critical service has all its replicas on a single node before starting.

Key Concepts Tested

Machine Config Operator (MCO): OpenShift's mechanism for all node-level configuration

KubeletConfig for eviction threshold tuning without direct node access

MachineConfigPools with custom node labels for targeted kernel parameter application

Pod Disruption Budgets as the update safety control — MCO respects PDBs during drain

maxUnavailable on MachineConfigPools to control rolling update speed

RHCOS update model via MCO vs traditional OS patching

Follow-Up Questions

"During a node update rollout with maxUnavailable: 1, one node gets stuck in the Updating state for 45 minutes — far longer than the expected 5–8 minutes. The MCO is not producing error logs and the node appears healthy in oc get nodes. What is your investigation process, and at what point would you consider manually intervening to prevent the stuck update from blocking the rest of the rollout?"

"The security team requests that SSH daemon be disabled on all RHCOS nodes — direct SSH access is prohibited by policy. However, your team uses oc debug node/<n> for troubleshooting. After disabling SSH via a MachineConfig, oc debug node/<worker-3> fails with a permissions error during a critical incident. Explain what oc debug node requires to function, why it failed, and how you restore the troubleshooting capability without re-enabling SSH."

Question 3: Observability Architecture for Hybrid Cloud Environments

Interview Question

IBM has deployed a hybrid cloud platform for a large UK retailer. The e-commerce platform runs across two environments: a Red Hat OpenShift cluster on-premises managed via IBM Cloud Satellite, and a managed OpenShift cluster on IBM Cloud used for burst compute during peak seasons. The platform currently has no unified observability — on-premises workloads are monitored by legacy Nagios while the IBM Cloud cluster uses IBM Cloud Monitoring. During a Black Friday incident last year, a latency degradation affecting the checkout API took 54 minutes to detect because the root cause originated in the on-premises inventory service but manifested as slow responses from a cloud-hosted payment processing service. The two monitoring systems had no correlation capability.
Design the full observability stack — metrics, traces, and logs — for both environments, addressing cross-environment correlation and the specific failure mode that caused the 54-minute detection gap.

Why Interviewers Ask This Question

Observability for hybrid cloud environments is one of the hardest operational challenges IBM's infrastructure clients face. The inability to correlate signals across on-premises and cloud environments is exactly what IBM Instana and distributed tracing are designed to address. This question tests whether a candidate understands the three pillars of observability as a correlated system — not three independent tools — and can design an architecture that provides a single operational view across physically and administratively separate environments.

Example Strong Answer

Diagnosing the 54-minute detection failure

The incident reveals three distinct observability failures:

No distributed tracing: Without a trace following the checkout request from the cloud payment service through to the on-premises inventory service, there was no way to see that slow payment API responses were caused by downstream inventory calls — they appeared as a standalone payment service problem.

Siloed monitoring: Nagios and IBM Cloud Monitoring had no shared data model. Neither could see the other environment's state, so neither could correlate root cause with symptom.

Alert on symptom, not cause: Alerts fired on the payment service's elevated latency — the visible downstream symptom. The on-premises inventory degradation was invisible.

The collection standard: OpenTelemetry across both environments

Standardise on OpenTelemetry (OTeL) as the instrumentation layer across both clusters. OpenTelemetry is natively supported by IBM Instana, produces W3C TraceContext headers for cross-environment trace propagation, and works identically on OpenShift whether running on-premises or in IBM Cloud.

Pillar 1: Distributed Tracing — the critical missing layer

Instrument all services with OpenTelemetry SDKs. Configure W3C TraceContext header propagation across all inter-service calls, including cross-environment calls that traverse DirectLink between on-premises and IBM Cloud. With tracing in place, the Black Friday incident trace would look like:

Trace: GET /checkout [940ms]
  ├── [Cloud] Payment Service [920ms]
  │     └── [On-Prem] Inventory Service [850ms] ← root cause visible here

This trace, available within seconds of the incident beginning, would have reduced investigation time from 54 minutes to under 5 minutes.

Collection topology:

Each OpenShift cluster runs an OpenTelemetry Collector DaemonSet receiving spans from all pods on each node

On-premises collectors forward spans over DirectLink to IBM Instana — the preferred backend for IBM Cloud engagements, providing automatic instrumentation, application topology discovery, and hybrid environment correlation in a single console

IBM Instana receives spans from both environments and renders cross-environment request flows as a unified trace waterfall

Pillar 2: Metrics — IBM Cloud Monitoring with cross-environment collection

The OpenShift built-in monitoring stack (openshift-monitoring namespace) deploys Prometheus, Alertmanager, and Grafana automatically in each cluster. For unified cross-environment visibility:

Deploy the IBM Cloud Monitoring agent (Sysdig-based) as a DaemonSet in the on-premises Satellite cluster. The agent forwards all metrics to IBM Cloud Monitoring over DirectLink — no metrics traverse the public internet.

IBM Cloud Monitoring provides a single dashboard showing metrics from both environments simultaneously.

Define SLOs at the business service level: "99.5% of checkout requests complete in under 400ms" — measured end-to-end regardless of which environment the components live in. IBM Instana's Business Impact Monitoring evaluates this cross-environment SLO against real traffic.

Replace Nagios threshold alerts with SLO error budget burn rate alerts: the alert fires when the checkout API's error budget is being consumed faster than sustainable — typically within 3–5 minutes of a meaningful degradation starting, not after 54 minutes.

Pillar 3: Logs — structured and correlated via trace ID

Every service in both environments must inject the OpenTelemetry trace ID and span ID into every log line:

{
  "timestamp": "2024-11-29T11:23:45.123Z",
  "service": "inventory-service",
  "environment": "on-premises",
  "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
  "span_id": "00f067aa0ba902b7",
  "message": "Database query exceeded threshold",
  "query_duration_ms": 850
}

With the trace ID in both the Instana trace and the log record, an on-call engineer jumps from the slow span directly to the associated log entries — no cross-system searching required.

Log pipeline:

On-premises: FluentBit DaemonSet → Kafka (buffer for connectivity interruptions) → OpenSearch

IBM Cloud: OpenShift Logging → IBM Log Analysis

IBM Log Analysis aggregates logs from both environments into a single query interface, filterable by trace ID, service, environment, and time window

Deployment event correlation

Configure deployment events from both OpenShift GitOps pipelines to post to IBM Instana via the Events API. These appear as vertical markers on all metric and trace charts — making "did a deployment cause this?" answerable in 30 seconds.

Key Concepts Tested

OpenTelemetry as the vendor-neutral instrumentation standard for hybrid environments

W3C TraceContext header propagation across cross-environment service calls

IBM Instana for unified hybrid cloud observability with automatic topology discovery

OpenShift built-in monitoring stack and IBM Cloud Monitoring agent for cross-environment metrics

Structured logging with trace ID / span ID as the log-to-trace correlation mechanism

SLO burn rate alerting replacing threshold-based monitoring

Follow-Up Questions

"Your OpenTelemetry Collector is forwarding 4.2 million spans per minute to IBM Instana. After 3 months, IBM Cloud Monitoring costs have grown to £22,000/month — 40% over budget. The on-call team says they only look at traces during incidents, which occur roughly twice per month. How do you redesign the trace collection strategy to reduce cost by at least 50% while ensuring the traces most needed during an incident are always retained?"

"A new microservice team is onboarding to the platform. They are building a service in Go and have never used OpenTelemetry before. They ask: 'Do we have to manually instrument our code, or can the platform inject tracing automatically?' Explain the difference between manual instrumentation and auto-instrumentation options available in your OpenShift environment, and give a recommendation with your reasoning."

Question 4: Infrastructure Automation with Terraform and Ansible for IBM Cloud

Interview Question

IBM is building a repeatable infrastructure-as-code framework for deploying a standard "IBM Cloud Landing Zone" for enterprise clients — a baseline IBM Cloud environment that includes a Virtual Private Cloud with multiple subnets across three availability zones, transit gateway connectivity to on-premises networks, a managed OpenShift cluster, IBM Key Protect for encryption key management, IBM Cloud Activity Tracker for audit logging, and IBM Cloud Security and Compliance Centre integration. This Landing Zone must be deployable in under 2 hours, must be customisable for different clients (different CIDR ranges, regions, node counts), and must be fully reproducible — applying the same configuration twice must produce identical infrastructure.
Design the infrastructure automation framework, addressing tool choice, module structure, configuration management, and the controls that ensure the Landing Zone stays compliant after initial deployment.

Why Interviewers Ask This Question

IBM Cloud Landing Zones are a real IBM product offering — IBM publishes reference architectures for repeatable, compliant cloud environment provisioning, and the tooling behind them is exactly what IBM Cloud Infrastructure Engineers build and maintain. This question tests whether a candidate can design modular, reusable Terraform at the scale of a full cloud environment, understands the IBM Cloud Terraform provider, knows when Ansible complements Terraform in a hybrid automation stack, and can reason about post-deployment governance controls that keep infrastructure compliant over time.

Example Strong Answer

Tool selection: Terraform for provisioning, Ansible for configuration

Terraform and Ansible serve complementary functions:

Terraform: Declarative infrastructure provisioning with state management, through the IBM Cloud Terraform provider. Manages IBM Cloud resources (VPC, subnets, ROKS clusters, Key Protect, Transit Gateway) with an explicit plan-before-apply workflow. Terraform's state file tracks the relationship between HCL declarations and actual cloud resources.

Ansible: Imperative configuration management for tasks Terraform cannot model: bootstrapping OpenShift operators, configuring LDAP authentication integration, installing custom CA certificates into cluster trust stores, running post-deployment validation playbooks. Also the right tool for on-premises configuration that operates outside the IBM Cloud API.

Module structure: composable, versioned Terraform modules

ibm-landing-zone/
├── modules/
│   ├── vpc/                    # VPC, subnets, ACLs, security groups
│   ├── transit-gateway/        # Transit gateway + on-prem connectivity
│   ├── openshift-cluster/      # ROKS cluster, worker pools, add-ons
│   ├── key-protect/            # Key Protect instance + root keys
│   ├── activity-tracker/       # Activity Tracker routing rules
│   ├── security-compliance/    # SCC integration, profile attachment
│   └── iam-baseline/           # Access groups, trusted profiles, policies
│
├── patterns/
│   └── standard-landing-zone/  # Root module composing all sub-modules
│       ├── main.tf
│       ├── variables.tf
│       └── outputs.tf
│
└── environments/
    ├── client-a/               # Client A's variable values
    │   └── terraform.tfvars
    └── client-b/
        └── terraform.tfvars

Each client customises the Landing Zone via a terraform.tfvars file. The module code never changes — only variable values differ per client. This allows IBM to offer the Landing Zone as a tested, supported product where client configuration is cleanly separated from platform code.

Module versioning:

module "vpc" {
  source = "git::https://github.ibm.com/cloud-platform/ibm-landing-zone//modules/vpc?ref=v1.4.2"
  region = var.region
  # ...
}

When IBM releases a security update, it publishes a new module version. Clients upgrade by bumping the version reference and running terraform plan to review the diff before applying.

State management: IBM Cloud Object Storage backend with locking

terraform {
  backend "s3" {
    bucket   = "ibm-landing-zone-tfstate"
    key      = "client-a/terraform.tfstate"
    region   = "eu-gb"
    endpoint = "https://s3.eu-gb.cloud-object-storage.appdomain.cloud"
  }
}

IBM COS supports the AWS S3-compatible API, enabling Terraform's S3 backend with Object Lock for state locking — preventing two engineers from running concurrent applies against the same state.

Ansible for post-provisioning configuration

After Terraform provisions the infrastructure, an Ansible playbook handles configuration that Terraform cannot model:

- name: Configure OpenShift cluster post-deployment
  hosts: localhost
  tasks:
    - name: Add IBM LDAP authentication to OpenShift OAuth
      k8s:
        definition: "{{ lookup('template', 'oauth-ldap.yaml.j2') }}"

    - name: Install IBM Cloud Pak operator via OLM
      k8s:
        definition: "{{ lookup('template', 'cp4i-subscription.yaml.j2') }}"

    - name: Run SCC compliance validation checks
      include_role:
        name: ibm.scc_validation

Post-deployment compliance: IBM Security and Compliance Centre (SCC)

The Landing Zone must stay compliant after deployment — not just at initial provisioning. IBM Cloud Security and Compliance Centre continuously evaluates IBM Cloud resources against the IBM Cloud Framework for Financial Services profile:

SCC scans run every 24 hours, evaluating ~200 controls (VPC ACL configurations, Key Protect key rotation policy, IAM policies, Activity Tracker routing)

Any drift from the compliant baseline — a security group rule added manually, an IAM policy changed outside Terraform — appears as a failed control in the SCC dashboard within 24 hours

SCC findings are routed to IBM Cloud Activity Tracker, which emits a PagerDuty alert for any high-severity compliance failure

This creates a detect-and-remediate loop: engineers who make manual changes are notified immediately and must either revert the change or codify it in Terraform — preventing compliance drift from accumulating silently.

Key Concepts Tested

Terraform module structure: separating module code from per-client variable values

IBM Cloud Terraform provider resources: VPC, ROKS, Key Protect, Transit Gateway

Terraform state backend on IBM Cloud Object Storage with S3-compatible API and state locking

Module versioning via Git tags for controlled upgrades across client environments

Ansible for post-provisioning configuration tasks outside Terraform's declarative model

IBM Security and Compliance Centre (SCC) for continuous post-deployment compliance monitoring

Follow-Up Questions

"During a terraform plan for a routine module upgrade, the plan includes destruction and recreation of the client's ROKS cluster — a 20-minute outage that would require all workloads to be rescheduled. The only change in the module version was an update to cluster tagging. What causes Terraform to plan a resource recreation instead of an in-place update, and how do you resolve this without causing the outage?"

"A client's platform team has been manually modifying security group rules in the IBM Cloud console for 3 months. When you run terraform plan, it shows 47 diffs across 12 security groups. The team is nervous about applying the plan because they are unsure which of the 47 changes are their intentional modifications and which are legitimate updates from the module upgrade. How do you safely reconcile this state drift without losing valid manual changes or reverting module updates?"

Question 5: Security, Identity, and Zero-Trust Access Management in IBM Cloud

Interview Question

IBM is designing the IAM architecture for a hybrid cloud environment serving a UK financial services firm under FCA regulation. The environment includes IBM Cloud with 14 VPCs across production, non-production, and management environments; a Red Hat OpenShift cluster in each VPC; IBM Cloud Satellite locations in the firm's two on-premises data centres; and 340 engineers across platform engineering, application development, security operations, and data science teams. The firm's security requirements: no standing privileged access (engineers must not hold persistent admin credentials); all access must be auditable at the API-call level; privileged operations must require MFA and a secondary approval; and production environments must be technically segregated from development — no engineer should be able to promote their own code from development to production.
Design the IAM architecture that meets all four requirements, and explain how you would implement zero-trust access controls for both IBM Cloud resources and the OpenShift clusters within them.

Why Interviewers Ask This Question

Financial services IAM is one of the most demanding access management challenges in cloud infrastructure, and IBM's FCA-regulated clients are among the most security-conscious in IBM's client portfolio. This question tests whether a candidate understands IBM Cloud IAM at production depth — not just access groups and policies, but the specific mechanisms for just-in-time access, privileged access approval workflows, and the separation of OpenShift RBAC from cloud-level IAM. The zero-trust requirement forces candidates beyond traditional perimeter-based access thinking to a model where no identity is permanently trusted.

Example Strong Answer

The zero-trust access principle applied to IAM

Zero-trust access means every access request is authenticated, authorised, and logged regardless of network location or prior access history. For a 340-person engineering organisation, this has five concrete implications:

No engineer holds permanent admin credentials — access is granted just-in-time and expires

Every IBM Cloud API call and every oc command to OpenShift is logged

Sensitive operations require MFA plus a secondary approval

Production and non-production access are technically segregated, not just policy-segregated

Workload identity (service accounts, CI/CD pipelines) is managed separately from human identity

Requirement 1: No standing privileged access — JIT access via IBM Cloud IAM

IBM Cloud IAM's native primitive for this pattern is Access Groups. Engineers never belong to privileged Access Groups permanently. When a privileged operation is needed:

Engineer submits an access request (via ServiceNow or a custom Slack bot) specifying the reason, duration (maximum 4 hours), and the required Access Group

A secondary approver (team lead or security officer) approves the request

An IBM Cloud Function calls the IBM Cloud IAM Groups API to add the engineer to the Access Group for the requested duration

A scheduled IBM Cloud Function removes the engineer when the duration expires

This workflow is built entirely on IBM Cloud IAM primitives — no third-party PAM solution required, though HashiCorp Boundary or CyberArk are valid alternatives for clients that prefer a commercial PAM tool.

Requirement 2: API-call-level audit logging — IBM Cloud Activity Tracker

IBM Cloud Activity Tracker captures every IBM Cloud API call — IAM policy changes, VPC modifications, Key Protect operations, ROKS cluster administration — with full context (who, what, when, from where).

Configuration:

One Activity Tracker instance per region, routing events from all services in that region

Global events routing: IAM events (global, not regional) route to the Frankfurt instance for EU data residency compliance

30-day hot retention + 1-year archival to IBM Cloud Object Storage — meets FCA audit trail obligations

Activity Tracker alerts for high-risk events: iam.policy.create, kms.secrets.delete, any privileged operation outside an approved change window triggers an immediate security alert

For OpenShift audit logging:

OpenShift API server audit logging is enabled by default and captures all kubectl/oc commands

Forward OpenShift audit logs to IBM Log Analysis alongside Activity Tracker events for a unified audit trail covering both IBM Cloud API calls and Kubernetes API calls

Requirement 3: MFA + secondary approval for privileged operations

MFA is enforced at the IBM Cloud account level:

Enable account-wide TOTP MFA in IBM Cloud IAM settings — all console logins and API token requests require a TOTP code

For programmatic access (Terraform, CI/CD pipelines), use Service IDs with scoped API keys — service IDs correctly have no MFA requirement, as MFA is a human-identity control

Secondary approval is provided by the JIT access workflow described above — the combination of MFA (something the engineer has) and manager approval (an organisational control) provides two independent factors for production access

Requirement 4: Technical production/development segregation — IBM Cloud Enterprise Accounts

Policy segregation alone is insufficient for FCA-regulated environments. Technical segregation requires separate IBM Cloud accounts:

IBM Cloud Enterprise Account (management)
    ├── Production Account
    │     ├── Production VPCs (14)
    │     ├── Production ROKS clusters
    │     └── IBM Cloud Satellite (on-premises production)
    │
    ├── Non-Production Account
    │     ├── Development and staging VPCs
    │     └── Non-production ROKS clusters
    │
    └── Management Account
          └── Activity Tracker, SCC, Secrets Manager (shared services)

An engineer in the non-production account cannot assume a production identity — there is no IAM policy path between accounts except through explicitly defined Enterprise Account cross-account trust policies. A developer deploying to staging is technically incapable of promoting to production without a separate production account credential, which they do not permanently hold.

OpenShift RBAC integration with IBM Cloud IAM

ROKS integrates IBM Cloud IAM with OpenShift RBAC — Access Group membership automatically maps to OpenShift ClusterRoles:

IBM Cloud IAM Access Group	OpenShift ClusterRole
`platform-admin-production`	`cluster-admin`
`app-developer-production`	`edit` (namespace-scoped)
`security-reader-all-envs`	`view` (cluster-scoped)

When the JIT workflow adds an engineer to platform-admin-production, they automatically receive cluster-admin in the production OpenShift cluster for the same duration. When the JIT access expires, they lose both IBM Cloud and OpenShift access simultaneously — no separate RBAC management required.

Workload identity — no human credentials in CI/CD

CI/CD pipelines must never use human engineer credentials:

Create IBM Cloud Service IDs with scoped API keys per pipeline role (pipeline-deployer-staging, pipeline-image-push)

Store API keys in IBM Secrets Manager, injected into pipeline environments at runtime — never stored in Git or CI/CD variable stores

Rotate service ID API keys automatically every 90 days via IBM Secrets Manager rotation policies

Key Concepts Tested

IBM Cloud Access Groups as the unit of IAM policy assignment

JIT privileged access using IBM Cloud IAM Groups API with approval workflow

IBM Cloud Activity Tracker for API-level audit logging with FCA-compliant retention

IBM Cloud Enterprise Account structure for technical production/non-production segregation

ROKS IBM Cloud IAM to OpenShift RBAC automatic mapping — unified identity across layers

Service IDs with scoped API keys and IBM Secrets Manager rotation for workload identity

Follow-Up Questions

"Three months after deploying the JIT access system, a security audit finds that engineers are approving each other's access requests reciprocally — engineer A approves engineer B's production access request, and engineer B approves engineer A's. This defeats the secondary approval control. How do you detect this pattern retroactively in Activity Tracker logs, and what controls do you add to the approval workflow to prevent it going forward?"

Question 6: Disaster Recovery Architecture for a Hybrid IBM Cloud Environment

Interview Question

IBM is designing the disaster recovery strategy for a UK National Health Service trust's clinical information system. The system runs across a hybrid environment: the primary workload is a Red Hat OpenShift cluster hosted on IBM Cloud in the London region (eu-gb), with patient record data stored in IBM Cloud Databases for PostgreSQL. A secondary IBM Cloud Satellite location runs in the trust's own on-premises data centre for low-latency clinical applications. The NHS trust's recovery objectives are: an RTO of 30 minutes and an RPO of 5 minutes for the patient record system, and an RTO of 4 hours and an RPO of 1 hour for the administrative workloads. The current DR strategy is a manual runbook that was last tested 14 months ago and took 2.5 hours to execute — far exceeding the 30-minute RTO.
Redesign the disaster recovery architecture to meet both sets of recovery objectives, and explain why the current manual runbook approach is structurally incapable of meeting a 30-minute RTO.

Why Interviewers Ask This Question

Disaster recovery for hybrid cloud environments combining IBM Cloud managed services and on-premises Satellite infrastructure is a direct IBM Cloud Infrastructure Engineer responsibility at NHS and healthcare clients, where regulatory obligations (NHS DSP Toolkit, ISO 27001) make DR planning and testing a compliance requirement — not just a best practice. This question tests whether a candidate understands the structural relationship between RTO targets and automation requirements (manual runbooks have hard floors on achievable RTO), knows IBM Cloud's specific DR capabilities (cross-region replication for IBM Cloud Databases, IBM Cloud Object Storage cross-region buckets), and can differentiate DR strategies for workloads with different recovery objectives.

Example Strong Answer

Why the manual runbook cannot achieve a 30-minute RTO — structural analysis

A manual runbook with a 2.5-hour historical execution time has three structural constraints that make 30-minute RTO unachievable regardless of how well the runbook is written:

Detection and escalation latency: A regional IBM Cloud incident must be detected by monitoring, an on-call engineer must be paged and respond, the incident must be confirmed as requiring DR activation, and an approval decision must be made. Even with excellent monitoring, this typically takes 10–20 minutes for a clear-cut regional failure — longer for ambiguous partial degradation scenarios.

Sequential manual execution: A 30-minute RTO requires that every step in the failover process completes in the remaining 10–20 minutes after detection. Manual steps — promoting a database replica, updating DNS records, verifying OpenShift cluster health in the failover region, reconfiguring load balancers — each take several minutes and cannot be parallelised by a single on-call engineer.

Cognitive load under pressure: Engineers executing a DR runbook at 3am for a patient-facing clinical system are under significant stress. Step skipping, sequencing errors, and time lost re-reading runbook steps are not edge cases — they are expected outcomes of manual DR execution under time pressure.

The arithmetic: detection (10–20m) + decision (5–10m) + manual execution (remaining runbook steps). A 30-minute RTO requires the combined detection-to-execution pipeline to complete in 30 minutes — leaving zero margin for any step taking longer than planned.

The required architecture: automated failover with human oversight for fail-back

Tier 1 workloads (RTO: 30 min, RPO: 5 min) — Patient record system

Database layer — IBM Cloud Databases for PostgreSQL with cross-region replication:

IBM Cloud Databases for PostgreSQL supports read replicas in a secondary IBM Cloud region (eu-de, Frankfurt, for a primary in eu-gb, London). Configure synchronous replication to the Frankfurt read replica:

Synchronous replication: every write is committed in both London and Frankfurt before the application receives acknowledgement. This achieves RPO of near-zero (maximum data loss equals in-flight writes at the moment of failure — typically under 1 second) — well inside the 5-minute RPO requirement.

Trade-off: synchronous replication adds write latency equal to the London-to-Frankfurt round-trip time (~15ms). For a clinical information system where write correctness is paramount, this latency cost is acceptable and explicitly preferable to asynchronous replication's data loss risk.

Automatic leader election: IBM Cloud Databases for PostgreSQL supports automated promotion of the Frankfurt replica to primary when the London primary becomes unreachable, within 60–90 seconds. No manual DBA intervention required.

Application layer — ROKS cluster in Frankfurt with active-warm standby:

Deploy a warm standby ROKS cluster in IBM Cloud eu-de (Frankfurt). "Warm" means the cluster is provisioned and healthy with all Kubernetes workloads deployed, but serving no live traffic:

Workloads are deployed identically to London using OpenShift GitOps (ArgoCD) — the same GitOps repository drives both clusters. Any application change deployed to London is automatically synchronised to Frankfurt within minutes.

The Frankfurt cluster is configured as a standby: its Ingress routes to a health check endpoint that returns 503 until failover is activated. Real traffic is not served until failover is triggered.

Automated failover controller:

Failover Controller (running independently in both regions)
    │
    ├── Health probe: IBM Cloud ROKS API + PostgreSQL primary every 30 seconds
    ├── Consensus: 3 consecutive failed probes = failover trigger (90 seconds)
    ├── Automated actions on trigger:
    │     ├── IBM Cloud Databases: promote Frankfurt replica to primary
    │     ├── IBM Cloud Internet Services (CIS): update DNS CNAME to Frankfurt Ingress
    │     │     (DNS TTL pre-set to 60 seconds — propagation within 60s of update)
    │     ├── CIS Global Load Balancer: set London pool health to disabled
    │     └── Notification: PagerDuty P1 alert, IBM Cloud Activity Tracker event
    └── Human role: validate failover success, initiate planned fail-back

Failover timeline with automation:

Step	Time
Incident begins	T+0
Health probe detects failure (3 probes × 30s)	T+90s
Automated database promotion	T+3m
DNS propagation (60s TTL)	T+4m
Frankfurt cluster serving live traffic	T+5m
Total RTO	~5 minutes

This is 6× better than the 30-minute RTO requirement and provides a substantial operational safety margin.

Tier 2 workloads (RTO: 4 hours, RPO: 1 hour) — Administrative workloads

For administrative workloads with a 1-hour RPO, asynchronous replication is acceptable:

IBM Cloud Object Storage with cross-region bucket replication: Administrative documents, reports, and attachments replicated asynchronously to a secondary region every 15 minutes (well within the 1-hour RPO).

IBM Cloud Databases asynchronous replica: A separate PostgreSQL instance for administrative data with asynchronous replication — replication lag typically < 5 minutes under normal load.

Manual failover with a tested, automated runbook: Tier 2 workloads use an Ansible-driven runbook that executes failover steps automatically but requires a human to initiate. The 4-hour RTO provides time for a deliberate human decision and structured execution.

DR testing as a first-class engineering practice

The 14-month testing gap is itself a compliance failure for an NHS trust. I would implement:

Monthly automated DR drills in a staging environment: the failover controller triggers automatically, and a test validates that the Frankfurt cluster is serving traffic and the database promotion completed successfully — no human involvement required

Quarterly production DR test: Planned failover of 10% of production traffic to Frankfurt during a low-activity window, with the trust's clinical informatics team observing. This validates the production architecture, not just the staging approximation

All DR test results are documented and submitted as evidence for NHS DSP Toolkit Assertion 9.7 (business continuity)

Key Concepts Tested

Structural analysis of why manual runbooks cannot achieve short RTO targets

IBM Cloud Databases for PostgreSQL synchronous cross-region replication for near-zero RPO

Warm standby ROKS cluster with OpenShift GitOps for workload synchronisation

Automated failover controller design: removing humans from the failover critical path

IBM Cloud Internet Services (CIS) DNS TTL strategy for fast traffic failover

Tiered DR strategy: different RTO/RPO requirements driving different automation levels

DR testing cadence as a compliance requirement, not an operational nice-to-have

Follow-Up Questions

"Your automated failover controller triggers successfully, promoting the Frankfurt database replica and shifting DNS within 5 minutes. However, 20 minutes after failover, clinical staff report that 340 patient records updated in the final 3 minutes before the London failure are missing from the Frankfurt database — despite your synchronous replication configuration. What are the three most likely explanations for this data loss, and how do you investigate each to determine the root cause?"

"The NHS trust's clinical informatics director asks: 'What happens if Frankfurt also has an outage simultaneously with London? Our patients cannot have zero access to their records.' Design the architecture changes required to handle a simultaneous dual-region failure, and explain what additional constraints this introduces on the RTO and RPO targets."

Question 7: Network Performance Optimisation in IBM Cloud VPC Environments

Interview Question

IBM has deployed a financial trading analytics platform for a European asset management firm on IBM Cloud. The platform consists of a Red Hat OpenShift cluster in IBM Cloud eu-gb (London), consuming real-time market data feeds from external data providers, processing tick data through a stream processing pipeline (Apache Kafka + Apache Flink), and serving analytics results to 850 portfolio managers via a web application. The firm is reporting three network performance problems: (1) the market data ingestion pipeline experiences intermittent 200–400ms latency spikes every 10–15 minutes, disrupting real-time price calculations; (2) inter-pod communication within the OpenShift cluster between the Flink processing pods and the Kafka consumer pods has higher than expected latency (8–12ms average versus 1–2ms expected); (3) the web application serving portfolio managers in Frankfurt, Paris, and Amsterdam is experiencing 600–900ms page load times, despite the backend API responding in under 80ms.
Diagnose each of the three network performance problems and design the architectural changes to resolve them.

Why Interviewers Ask This Question

Network performance troubleshooting in IBM Cloud VPC environments — spanning external ingestion pipelines, intra-cluster pod networking, and geographically distributed end-user performance — is a core IBM Cloud Infrastructure Engineer competency. This question tests whether a candidate can decompose multi-layer network performance problems, identify the IBM Cloud and Kubernetes networking constructs responsible for each issue, and design targeted fixes rather than broad infrastructure changes. Financial services clients in particular have stringent latency requirements that make network engineering a first-class concern.

Example Strong Answer

Problem 1: 200–400ms market data ingestion spikes every 10–15 minutes

The regularity of the spike interval (10–15 minutes) is a strong diagnostic signal — truly random network congestion does not produce periodic spikes. Periodic patterns in network latency almost always indicate one of three causes: scheduled processes (backups, log rotations, garbage collection), rate limiting with a refill interval, or TCP retransmission events triggered by a specific traffic pattern.

Diagnosis steps:

# Capture network traffic during a spike window
tcpdump -i eth0 -w /tmp/market-data-capture.pcap host <market-data-provider-ip>

# Check for TCP retransmission events
ss -s  # socket statistics
netstat -s | grep retransmit

# Correlate spike timing with system processes
atop -r /var/log/atop/atop_<date>.log  # check for periodic CPU/IO spikes

Most likely cause and fix: TCP buffer exhaustion and flow control

A 200–400ms spike on a high-throughput market data feed is characteristic of TCP receive buffer exhaustion triggering flow control. When the application cannot consume data from the receive buffer fast enough (e.g., during a Flink checkpoint or GC pause), the TCP window shrinks to zero — the sender stops transmitting and must wait for a window update. The periodic pattern matches typical GC or checkpoint intervals.

Fixes:

Increase TCP socket buffer sizes on the IBM Cloud VSI instances running the Kafka brokers:
```
sysctl -w net.core.rmem_max=134217728
sysctl -w net.core.wmem_max=134217728
sysctl -w net.ipv4.tcp_rmem="4096 87380 134217728"
```
Persist via a MachineConfig in OpenShift to survive node reboots.

Dedicated market data ingestion pods with CPU and memory guarantees (QoS: Guaranteed): If Flink checkpointing pauses are causing the receive buffer to fill, isolate the ingestion tier from checkpoint-heavy processing pods by running them on dedicated nodes with the node label workload-type: ingestion. Flink checkpoints happen on separate nodes and cannot interrupt ingestion processing.

IBM Cloud DirectLink for market data providers: If the data provider supports it, use IBM Cloud DirectLink (Connect or Dedicated) rather than the public internet for the market data feed. DirectLink eliminates public internet routing variability that can cause the observed latency spikes.

Problem 2: 8–12ms intra-cluster pod latency (expected: 1–2ms)

Intra-cluster pod-to-pod latency of 8–12ms is an order of magnitude higher than expected for pods within the same OpenShift cluster in a single IBM Cloud VPC. This rules out network hardware as the cause — the physical latency within a single IBM Cloud availability zone is under 1ms.

Diagnosis steps:

# Check if Flink and Kafka pods are on the same node or different nodes
oc get pods -o wide -n streaming  # Check NODE column

# Check network policy rules between namespaces
oc get networkpolicy -n streaming
oc get networkpolicy -n kafka

# Check if a service mesh (Istio) is adding mTLS overhead
oc get pods -n istio-system
istioctl analyze -n streaming

Most likely cause: cross-node traffic with service mesh mTLS overhead

If Flink and Kafka pods are scheduled on different nodes in different availability zones, the pod-to-pod traffic crosses the IBM Cloud VPC network — adding 2–5ms. If Istio service mesh is deployed with mTLS, each pod-to-pod call also incurs a TLS handshake on the Envoy sidecar proxy — adding another 3–6ms per connection establishment.

Fixes:

Pod affinity rules: Schedule Flink consumer pods on the same nodes as (or nodes adjacent to) the Kafka broker pods:

affinity:
  podAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        podAffinityTerm:
          labelSelector:
            matchLabels:
              app: kafka-broker
          topologyKey: kubernetes.io/hostname

Istio mTLS with session resumption: If Istio is the latency source, configure TLS session resumption to eliminate the per-connection handshake overhead for long-lived streaming connections. Alternatively, for the high-throughput Flink-to-Kafka path, evaluate whether the service mesh adds security value that justifies the latency cost — internal cluster traffic between known workloads may not require mTLS.

IBM Cloud VPC placement groups: Use IBM Cloud VPC placement groups with a power strategy to co-locate Flink and Kafka VSIs on physical hosts with the lowest inter-node latency within the same availability zone.

Problem 3: 600–900ms page load for Frankfurt, Paris, Amsterdam users

The backend API responds in 80ms, but page loads take 600–900ms from continental Europe to London. The delta (520–820ms) is pure network and browser rendering overhead — the application is not the problem.

Decomposing the latency budget:

London to Frankfurt RTT: ~12ms. For a page load requiring 15 sequential HTTP/1.1 requests, that is 15 × 12ms = 180ms of network round-trip time — significant but not the full 520–820ms.

TLS handshakes: each new TCP connection requires a TLS 1.2 handshake (2 round trips = ~24ms per connection). Fifteen connections = 360ms in TLS overhead alone with HTTP/1.1.

Static asset delivery: large JavaScript bundles and images delivered from London add transfer time proportional to bundle size and available bandwidth.

Fixes:

IBM Cloud Internet Services (CIS) with CDN and edge caching: Enable IBM CIS (powered by Cloudflare) with edge caching for the web application. IBM CIS has PoPs in Frankfurt, Paris, and Amsterdam — static assets (JS bundles, CSS, images) are served from the nearest PoP at < 5ms latency, not from London. This alone eliminates the majority of the 600–900ms page load time.

HTTP/2 with TLS 1.3 enforcement: Enforce HTTP/2 at the IBM CIS edge. HTTP/2 multiplexes all 15 requests over a single TCP connection — eliminating 14 of 15 TLS handshakes. TLS 1.3 reduces the remaining handshake from 2 round trips to 1. Combined, this reduces connection establishment overhead from ~360ms to ~12ms.

IBM CIS Origin Certificates + end-to-end TLS: Configure the CIS edge to terminate TLS from the browser and re-establish TLS to the London origin over IBM's private backbone — not the public internet. This gives portfolio managers encrypted end-to-end communication while the edge-to-origin path benefits from IBM's lower-latency private network.

Key Concepts Tested

Periodic latency spike diagnosis: TCP buffer exhaustion vs rate limiting vs GC interference

TCP socket buffer tuning via MachineConfig in OpenShift

Pod affinity rules for co-locating latency-sensitive workloads on the same nodes

IBM Cloud VPC placement groups for physical host co-location

Istio service mesh mTLS overhead identification and mitigation

IBM Cloud Internet Services (CIS) CDN for European edge delivery

HTTP/2 multiplexing and TLS 1.3 for reducing connection establishment overhead

Follow-Up Questions

"After deploying pod affinity rules for the Flink-to-Kafka path, intra-cluster latency drops to 1.8ms as expected. However, three days later, a cluster node fails and Kubernetes reschedules the Flink pods onto nodes in a different availability zone — latency spikes to 14ms and the trading analytics pipeline breaches its SLA for 40 minutes before the situation is manually corrected. How do you make the affinity rules more resilient so that node failures do not silently move pods to suboptimal placement?"

"The IBM CIS CDN deployment reduces European page load times from 700ms to 180ms for static content. However, the portfolio managers report that the personalised dashboard — which loads their specific portfolio data via API calls that cannot be cached — still takes 600ms. Explain why CDN caching cannot help with this specific problem, and describe the architectural options available to reduce latency for dynamic, personalised API responses served from London to continental European users."

Question 8: CI/CD Pipeline Architecture for OpenShift on IBM Cloud

Interview Question

IBM is building a CI/CD platform for a large UK government department that is migrating 47 legacy applications to Red Hat OpenShift on IBM Cloud. The department has strict requirements: all container images must be built from approved base images and scanned for CVEs before deployment; no code can be deployed to production without passing automated tests and a mandatory peer review from a second engineer; all deployments to production must be auditable with a complete trail showing who approved what, when; and the production OpenShift cluster must only accept images from the department's own internal IBM Cloud Container Registry — no images from Docker Hub or public registries. The department's development teams use GitHub Enterprise (on-premises) as their source control system.
Design the complete CI/CD pipeline architecture meeting all four requirements, using IBM Cloud and OpenShift-native tooling where possible.

Why Interviewers Ask This Question

CI/CD pipeline design for OpenShift in regulated government environments is a direct IBM Cloud Infrastructure Engineer responsibility in IBM's public sector practice. Government clients impose the same four requirements described — CVE scanning, mandatory peer review gates, deployment audit trails, and registry allowlisting — and these requirements map to specific OpenShift and IBM Cloud tooling that a candidate must know. This question also tests whether a candidate understands the difference between infrastructure-enforced policy (OPA/Gatekeeper restricting image sources at the cluster level) and process-enforced policy (a runbook saying "check images come from our registry"), and why only the former is sufficient for a regulated client.

Example Strong Answer

Pipeline architecture overview

The pipeline has four distinct stages, each addressing one of the four requirements:

GitHub Enterprise (on-prem)
    │
    ├── Pull Request opened
    │     └── [Stage 1: Build + Scan] OpenShift Pipelines (Tekton)
    │           ├── Build image from approved base
    │           ├── CVE scan (IBM Cloud Security Advisor / Trivy)
    │           └── Push to IBM Cloud Container Registry (staging tag)
    │
    ├── PR requires peer review approval (GitHub Enterprise CODEOWNERS)
    │     └── [Gate: mandatory second engineer approval]
    │
    ├── PR merged to main
    │     └── [Stage 2: Integration Tests] OpenShift Pipelines
    │           ├── Deploy to dev/test namespace
    │           ├── Automated integration test suite
    │           └── Test results recorded in pipeline run
    │
    └── Release tag created (by release engineer, not PR author)
          └── [Stage 3: Production Deployment] OpenShift GitOps (ArgoCD)
                ├── ArgoCD Application syncs from release tag
                ├── Image digest validated against ICIR signature
                └── Deployment event logged to IBM Cloud Activity Tracker

Requirement 1: Approved base images + CVE scanning

All Dockerfiles must use only images from an approved base image catalogue maintained in the IBM Cloud Container Registry. This is enforced at two layers:

Layer 1 — Pipeline enforcement (process gate):

The Tekton pipeline includes a validate-base-image task that parses the Dockerfile's FROM statement and checks it against an allowlist:

# Tekton Task: validate-base-image
steps:
  - name: check-base-image
    image: registry.access.redhat.com/ubi8/ubi-minimal
    script: |
      BASE_IMAGE=$(grep "^FROM" /workspace/source/Dockerfile | head -1 | awk '{print $2}')
      if ! grep -q "$BASE_IMAGE" /workspace/approved-base-images.txt; then
        echo "ERROR: Base image $BASE_IMAGE is not in the approved catalogue"
        exit 1
      fi

Layer 2 — CVE scanning with IBM Cloud Container Registry Vulnerability Advisor:

After the image is built and pushed to the staging tag in ICIR, the pipeline executes a Vulnerability Advisor scan. The pipeline task polls the scan result and fails if any Critical or High CVEs are found:

- name: check-vulnerability-scan
  script: |
    RESULT=$(ibmcloud cr va --output json ${IMAGE_TAG} | jq '.[] | .status')
    if echo "$RESULT" | grep -q "Fail"; then
      echo "Vulnerability scan failed — Critical/High CVEs detected"
      exit 1
    fi

Images that fail the scan cannot be tagged as approved and cannot proceed to the peer review gate.

Requirement 2: Mandatory peer review — GitHub Enterprise CODEOWNERS

GitHub Enterprise's CODEOWNERS file combined with branch protection rules enforces the mandatory second engineer approval at the source control layer — not just as a cultural practice:

# .github/CODEOWNERS
# All production deployment manifests require approval from a second member of the platform team
/deployment/production/*  @department/platform-team-leads
/helm/values-production.yaml  @department/platform-team-leads

Branch protection rules on main:

Require at least 1 approving review from a CODEOWNERS member

Dismiss stale reviews when new commits are pushed (prevents approval of a safe commit, then pushing a malicious change)

Restrict who can push directly to main — only the CI service account, after all checks pass

Critically: the PR author cannot approve their own pull request — GitHub Enterprise enforces this natively when "Require review from Code Owners" is enabled. This addresses the "no engineer promotes their own code" requirement.

Requirement 3: Full deployment audit trail — IBM Cloud Activity Tracker + pipeline provenance

Every deployment to production generates an audit record at multiple layers:

GitHub Enterprise: PR merge event includes the author, reviewer, approvals, and timestamps — immutable once merged

OpenShift Pipelines (Tekton): Every pipeline run produces a PipelineRun resource in OpenShift with a complete record of all task executions, inputs, outputs, and timestamps — queryable via oc get pipelineruns

ArgoCD sync history: Every ArgoCD sync (deployment to production) records the Git commit SHA, the user who triggered the sync, and the timestamp in ArgoCD's application history

IBM Cloud Activity Tracker: All IBM Cloud Container Registry pushes and all ROKS cluster API operations (deployments, pod creations) are logged automatically via Activity Tracker with full request context

For a compliance auditor asking "who approved the deployment of version 2.4.1 to production, and when?" — the answer is traceable through GitHub PR approval history → Tekton PipelineRun → ArgoCD sync history → Activity Tracker ROKS events.

Requirement 4: Production cluster only accepts images from ICIR — OPA/Gatekeeper admission policy

Process controls (a rule saying "only use our registry") are insufficient — they can be bypassed intentionally or accidentally. The production cluster must technically reject any pod that references an image from an unauthorised registry. This is implemented via OPA/Gatekeeper (built into OpenShift as OpenShift Compliance Operator + Gatekeeper):

# OPA Rego policy: restrict image source to IBM Cloud Container Registry
package k8sallowedrepos

violation[{"msg": msg}] {
  container := input.review.object.spec.containers[_]
  not starts_with(container.image, "uk.icr.io/department-registry/")
  msg := sprintf("Image '%v' is not from the approved registry", [container.image])
}

This ConstraintTemplate is instantiated as a K8sAllowedRepos Constraint in the production namespace. Any pod whose container image does not begin with uk.icr.io/department-registry/ is rejected at admission — before it is scheduled, before it pulls the image, before it runs. A developer who accidentally references docker.io/nginx in their Kubernetes manifest receives an immediate validation error when applying the manifest.

Additionally, enable IBM Cloud Container Registry image signing with Notary v2: only images signed with the department's private signing key are accepted by the cluster. The Gatekeeper policy validates the image signature before allowing the pod — even images from ICIR that were not built through the approved pipeline are rejected.

Key Concepts Tested

OpenShift Pipelines (Tekton) for cloud-native CI pipeline execution on OpenShift

IBM Cloud Container Registry Vulnerability Advisor for CVE scanning in pipeline gates

GitHub Enterprise CODEOWNERS + branch protection for mandatory peer review enforcement

OpenShift GitOps (ArgoCD) for auditable GitOps-driven production deployments

IBM Cloud Activity Tracker for deployment audit trail across the full pipeline

OPA/Gatekeeper ConstraintTemplate for infrastructure-enforced image registry allowlisting

IBM Cloud Container Registry image signing with Notary v2 for supply chain security

Follow-Up Questions

"The CVE scanning gate has been live for 6 weeks. A development team reports that their pipeline has been failing for 3 days because a newly disclosed Critical CVE in their approved base image (Red Hat UBI 8.7) has no fix available yet — the patched version won't be released for another 2 weeks. Their application needs to go to production urgently. How do you handle this exception without either bypassing the security control entirely or blocking a legitimate business-critical deployment?"

"The OPA/Gatekeeper image registry policy is enforced in production. However, an incident investigation reveals that a pod running in the kube-system namespace is pulling an image from quay.io — a registry not in your allowlist — and your Gatekeeper policy did not block it. Explain the most likely reasons the policy did not apply to this pod, and how you audit and close the gap."

Question 9: Distributed Storage and Stateful Workloads on OpenShift

Interview Question

IBM has deployed a Red Hat OpenShift cluster on IBM Cloud for a media company running a video processing platform. The platform includes: a Kafka cluster (6 brokers, each requiring 4TB of persistent storage for topic data retention), an Elasticsearch cluster (12 nodes, each requiring 2TB of fast NVMe storage for search indices), and a PostgreSQL database cluster (primary + 2 read replicas, each requiring 500GB storage). The operations team is reporting three storage-related incidents in the past month: (1) a Kafka broker pod was rescheduled to a different node after a node failure, but the new node could not attach the original PersistentVolume — the pod stayed in Pending state for 47 minutes; (2) Elasticsearch indexing performance has degraded by 40% over the past 6 weeks, with disk I/O identified as the bottleneck; (3) a PostgreSQL primary pod was evicted due to node memory pressure, and during the 8-minute recovery period, all three pods (primary + 2 replicas) attempted to become primary simultaneously — resulting in a split-brain condition that required 2 hours of manual remediation.
Diagnose each incident and redesign the storage and stateful workload architecture to prevent each class of failure.

Why Interviewers Ask This Question

Stateful workloads on Kubernetes — particularly storage-intensive applications like Kafka, Elasticsearch, and databases — are among the hardest operational challenges on OpenShift. IBM Cloud Infrastructure Engineers supporting media, analytics, and data platform clients routinely face exactly these three failure classes: PersistentVolume attachment failures after pod rescheduling, I/O performance degradation on shared storage, and split-brain conditions in clustered stateful applications. This question tests whether a candidate understands the Kubernetes storage primitives (StorageClass, PV topology, StatefulSets) at a depth that allows them to diagnose real incidents and design architectures that prevent them.

Example Strong Answer

Incident 1: Kafka PV attachment failure after node failover (47 minutes in Pending)

The 47-minute Pending state after a node failure is a zone topology mismatch between the PersistentVolume and the node where the pod was rescheduled.

Root cause analysis:

IBM Cloud Block Storage volumes are zonal — a volume provisioned in eu-gb-2 (London AZ 2) can only be attached to a VSI in eu-gb-2. If the Kafka broker pod is rescheduled to a node in eu-gb-1 or eu-gb-3 (because the scheduler does not know about the volume's zone constraint), the attach operation fails. The pod stays in Pending state until either: a node in eu-gb-2 becomes available (which could take 47+ minutes if all eu-gb-2 nodes are at capacity), or the volume topology constraint is correctly enforced.

Fix: StorageClass with volumeBindingMode: WaitForFirstConsumer

The standard IBM Cloud Block Storage StorageClass uses immediate volume binding — volumes are provisioned in a zone before a pod is scheduled, without knowing which zone the pod will land in. This creates the zone mismatch.

Change the StorageClass to WaitForFirstConsumer:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: ibmc-block-gold-topology-aware
provisioner: vpc.block.csi.ibm.io
parameters:
  profile: "10iops-tier"
volumeBindingMode: WaitForFirstConsumer  # critical: delays binding until pod is scheduled
reclaimPolicy: Retain

With WaitForFirstConsumer, the volume is not provisioned until the pod is scheduled to a specific node. The CSI driver then provisions the volume in the same zone as the scheduled node — guaranteeing the volume can be attached to that node.

Additionally, use a StatefulSet (not a Deployment) for Kafka brokers. StatefulSets maintain stable pod identity and PVC binding — when kafka-broker-2 is rescheduled, it is always associated with data-kafka-broker-2 (its original PVC) and the scheduler attempts to place it in the zone where that PVC lives. Combined with WaitForFirstConsumer, this eliminates the zone mismatch problem.

Incident 2: Elasticsearch I/O degradation on shared storage

A 40% performance degradation over 6 weeks on a storage-intensive workload is characteristic of storage contention on shared block storage, not hardware degradation. IBM Cloud Block Storage is provisioned from shared storage infrastructure — multiple volumes share the same underlying physical storage arrays, and IOPS limits are enforced per volume (not per physical disk).

Diagnosis:

# Check block storage IOPS utilisation on ES nodes
oc exec -n elasticsearch <es-node-pod> -- iostat -xz 1 10
# Look for: %util approaching 100%, await > 1ms, r/s + w/s approaching profile limit

If the Elasticsearch data volumes are on the 5iops-tier IBM Cloud Block Storage profile (the default), each 2TB volume is limited to 5 × 2,000 = 10,000 IOPS. For 12 Elasticsearch nodes writing and reading concurrently during heavy indexing, 10,000 IOPS per node may be insufficient.

Fix: IBM Cloud Block Storage with higher IOPS profile + dedicated node storage

Upgrade to 10iops-tier profile: The IBM Cloud Block Storage 10iops-tier profile provides 10 IOPS per GB — for a 2TB volume, that is 20,000 IOPS. This doubles available I/O capacity per node.

IBM Cloud Bare Metal Servers with local NVMe for Elasticsearch: For I/O-intensive Elasticsearch workloads, consider migrating the Elasticsearch cluster to IBM Cloud Bare Metal Servers with local NVMe storage (2-4TB NVMe per server, 800K+ IOPS). Local NVMe eliminates the shared storage contention entirely. Use IBM Cloud Satellite to bring these bare metal nodes into the OpenShift cluster as a dedicated worker pool for the elasticsearch-workers MachineConfigPool.

OpenShift Local Storage Operator: If bare metal with local disks is used, the Local Storage Operator provisions PersistentVolumes from local NVMe disks on the host node. These PVs are bound to the node where the disk lives — pod affinity ensures Elasticsearch pods are always scheduled on nodes with their local PV.

Incident 3: PostgreSQL split-brain after eviction

The split-brain condition — three PostgreSQL pods simultaneously attempting to be primary — is a Patroni configuration failure combined with a Kubernetes storage access control gap. In a properly configured Patroni PostgreSQL cluster, only one pod should ever hold the distributed lock (via etcd) that grants primary status. The simultaneous promotion indicates the etcd lock was not respected during the recovery.

Root cause: If Patroni's etcd endpoint was also affected by the node failure (etcd is often co-located with application workloads), Patroni may have entered a degraded state where each replica believed itself entitled to become primary due to etcd unavailability.

Fix: Dedicated etcd for Patroni + STONITH fencing

Dedicated etcd cluster: Patroni's distributed lock store (etcd) must run on dedicated infrastructure that is isolated from application workload disruptions. Deploy a 3-node etcd cluster on dedicated OpenShift worker nodes with node-role.kubernetes.io/etcd-patroni: "" labels and corresponding taints — no application workloads can be scheduled on these nodes.

STONITH (Shoot The Other Node In The Head) fencing: Configure Patroni with a STONITH fencing mechanism. Before promoting a replica to primary, Patroni must confirm via the fence agent that the current primary is definitively unavailable — it uses the IBM Cloud VSI API to verify (or stop) the failing node before allowing any promotion. This prevents the race condition where the original primary recovers while a replica has already been promoted.

PodDisruptionBudget for PostgreSQL: A PDB with minAvailable: 1 prevents the Kubernetes eviction controller from evicting both replicas simultaneously during node pressure, ensuring at least one replica is always serving reads even during primary recovery.

PostgreSQL primary node pinning with requiredDuringSchedulingIgnoredDuringExecution anti-affinity: Ensure the primary and both replicas are scheduled on different nodes and different availability zones — a single node memory pressure event cannot affect more than one PostgreSQL pod.

Key Concepts Tested

IBM Cloud Block Storage zone topology and WaitForFirstConsumer StorageClass binding mode

StatefulSet stable identity and PVC binding — preventing zone mismatch on rescheduling

IBM Cloud Block Storage IOPS profiles and when local NVMe (bare metal) is the right choice

OpenShift Local Storage Operator for local disk provisioning

Patroni split-brain prevention: dedicated etcd + STONITH fencing

PodDisruptionBudgets for stateful workloads to prevent simultaneous eviction

Pod anti-affinity rules for distributing stateful replicas across zones

Follow-Up Questions

"The Kafka cluster has 6 brokers with 4TB volumes each — 24TB of total storage. The team wants to implement a backup strategy for the Kafka topic data. Your first instinct is IBM Cloud Object Storage for backup, but the recovery team points out that restoring 4TB of Kafka data from object storage would take 6–8 hours — far too long for their RTO of 30 minutes. What are the alternative backup and recovery strategies for Kafka at this scale, and what trade-offs does each involve?"

"Three months after deploying the Local Storage Operator for Elasticsearch, a hardware failure causes one of the bare metal nodes to fail completely — the local NVMe disk on that node (containing an Elasticsearch shard) is unrecoverable. The Elasticsearch shard is lost. Walk through the steps to recover from this failure, and explain what Elasticsearch configuration should have been in place to make the shard loss a non-incident rather than a data loss event."

Question 10: Cost Optimisation and FinOps for IBM Cloud Infrastructure

Interview Question

IBM is conducting a cloud cost review for a retail client whose IBM Cloud spend has grown from £180,000/month to £420,000/month over 18 months — a 133% increase against a planned 40% growth. The client's platform runs on IBM Cloud and consists of: 14 Red Hat OpenShift clusters (mix of ROKS and Satellite), IBM Cloud Databases for PostgreSQL and MongoDB, IBM Cloud Object Storage (4.2 PB total), IBM Cloud Internet Services, and IBM Cloud Container Registry. The finance team has escalated the overspend to IBM, and your team has been asked to conduct a cost optimisation engagement. Initial analysis shows IBM Cloud Monitoring dashboards indicating average CPU utilisation of 28% and average memory utilisation of 34% across all ROKS clusters. IBM Cloud billing data shows that 23% of the monthly spend is on resources provisioned over 6 months ago that have no associated workloads.
Design the cost optimisation programme — identifying the highest-impact reduction areas, the tooling and automation you would use, and the governance processes to prevent cost drift from recurring.

Why Interviewers Ask This Question

FinOps — the practice of financial accountability for cloud infrastructure — is a growing responsibility for IBM Cloud Infrastructure Engineers, particularly as IBM's enterprise clients face pressure to demonstrate cloud ROI. A 133% cost increase against a 40% plan is a common pattern when infrastructure teams provision resources on demand without governance or cleanup processes. This question tests whether a candidate can identify waste systematically using IBM Cloud billing and monitoring data, knows the specific IBM Cloud cost optimisation levers available (reserved capacity, right-sizing, abandoned resource cleanup), and can design the governance processes that prevent drift from recurring after the initial optimisation.

Example Strong Answer

Step 1: Quantify the opportunity before committing to interventions

Before changing anything, I would generate a precise cost breakdown using IBM Cloud Billing APIs and IBM Cloud Monitoring:

# Export 90-day resource-level billing data
ibmcloud billing resource-instances-usage --start 2024-07-01 --end 2024-09-30 \
  --output json > billing-breakdown.json

# Cross-reference with resource inventory
ibmcloud resource service-instances --all-resource-groups --output json > inventory.json

Categorise every resource into one of three buckets:

Active and right-sized: Running workloads consuming resources proportionate to their allocation

Active but over-provisioned: Running workloads consuming significantly less than their allocated capacity

Abandoned: Provisioned resources with no associated workloads or traffic in the past 30 days

The 23% abandoned resource finding is the highest-priority, lowest-risk intervention — deleting abandoned resources produces immediate cost reduction with no business impact. At £420K/month, 23% = £96,600/month in recoverable spend.

Priority 1: Abandoned resource cleanup — £96,600/month recovery

The most common abandoned resources in IBM Cloud enterprise environments:

Snapshot and backup volumes: Database snapshots and block storage snapshots created for a deleted database that were never cleaned up. IBM Cloud Databases snapshots are billed by stored GB.

Unattached block storage volumes: PersistentVolumes from deleted OpenShift clusters or namespaces that remain in IBM Cloud after the cluster resources are deleted. IBM Cloud Block Storage is billed even when not attached to a VSI.

Idle load balancers: IBM Cloud VPC Load Balancers provisioned for applications that have been decommissioned — still billed at the hourly rate.

Unused Container Registry namespaces: ICIR is billed per GB of image storage. Old images from deprecated services accumulate and are rarely cleaned up.

Automated cleanup tooling:

# IBM Cloud resource tagging + cleanup automation
# Tag all resources at provisioning time with:
# - owner: <team-name>
# - project: <project-name>
# - expiry-date: <YYYY-MM-DD>

# Weekly cleanup job: flag resources with no activity for 30 days
# AND no "keep-alive" tag for human review
# Resources flagged for 7 days with no response are deleted

Priority 2: ROKS cluster right-sizing — estimated £80,000/month reduction

28% CPU and 34% memory utilisation across 14 clusters indicates significant over-provisioning. The right-sizing process:

Step 1: Identify right-sizing opportunities using IBM Cloud Monitoring + VPA recommendations:

# Get VPA recommendations for all namespaces
oc get vpa -A -o json | jq '.items[] | {
  namespace: .metadata.namespace,
  name: .metadata.name,
  cpu_request: .status.recommendation.containerRecommendations[0].target.cpu,
  memory_request: .status.recommendation.containerRecommendations[0].target.memory
}'

Step 2: Identify nodes that can be removed:

Using IBM Cloud Monitoring, identify nodes where the sum of all pod resource requests is under 40% of node capacity consistently over 30 days. These nodes are candidates for consolidation — their workloads can be migrated to other nodes and the nodes decommissioned.

Step 3: Right-size worker pools:

IBM Cloud ROKS worker pools support in-place resizing (smaller flavour) or node count reduction. Reduce worker node counts progressively, validating that PodDisruptionBudgets are respected and workloads reschedule cleanly.

Priority 3: Reserved capacity for stable workloads — estimated £65,000/month saving

IBM Cloud on-demand pricing for VSIs and ROKS worker nodes carries a significant premium over reserved instance pricing. For workloads that run continuously:

IBM Cloud Savings Plans: Commit to a sustained usage level (e.g., £15,000/month of compute) in exchange for a 25–35% discount on that committed spend. Unlike reserved instances, Savings Plans apply across instance types — providing flexibility for right-sizing without losing the discount.

1-year reserved instances for stable ROKS worker pools: Production ROKS clusters that have been stable for 6+ months are strong candidates for 1-year reservations at 30–40% discount vs on-demand.

Estimated saving from reservations on the £250,000/month compute portion of the bill: 30% × £250K = £75,000/month.

Priority 4: IBM Cloud Object Storage lifecycle policies — estimated £25,000/month saving

4.2 PB of IBM Cloud Object Storage is a significant cost centre. IBM COS billing varies by storage class:

Smart Tier (auto): £0.025/GB/month

Standard: £0.020/GB/month

Vault (infrequent access): £0.009/GB/month

Cold Vault (archive): £0.004/GB/month

Apply lifecycle policies to automatically transition objects to lower-cost storage classes:

{
  "Rules": [{
    "ID": "transition-to-vault-after-30-days",
    "Status": "Enabled",
    "Transitions": [
      {"Days": 30, "StorageClass": "VAULT"},
      {"Days": 90, "StorageClass": "COLD_VAULT"}
    ]
  }]
}

For media assets stored after processing (a one-time operation), this transition reduces storage cost from £0.025 to £0.004/GB — an 84% cost reduction on qualifying objects.

Governance: preventing cost drift from recurring

The optimisation itself is meaningless without governance changes that prevent the same drift from happening again:

Mandatory resource tagging policy via IBM Cloud Security and Compliance Centre: Any resource without required tags (owner, project, cost-centre) is flagged as non-compliant within 24 hours. Teams with non-compliant resources receive weekly reports.

Per-team spending dashboards in IBM Cloud Billing: Create IBM Cloud enterprise billing account sub-accounts per team, providing each team visibility into their own spending. Teams that exceed their monthly budget receive automated alerts at 80% and 100% of budget.

Monthly FinOps review: A 30-minute monthly review with engineering team leads covering: top 10 cost line items, right-sizing recommendations actioned vs pending, abandoned resource report. Cost data is owned by engineering, not just finance.

Provisioning approval for large resources: Any resource provisioning over £5,000/month requires a platform engineering approval — a Jira workflow that validates the resource is tagged, budgeted, and has a defined decommission plan.

Total expected monthly saving:

Intervention	Monthly Saving
Abandoned resource cleanup	£96,600
ROKS cluster right-sizing	£80,000
Reserved capacity / Savings Plans	£65,000
COS lifecycle policies	£25,000
Total projected reduction	£266,600 (63% reduction)

Target monthly spend: £420,000 − £266,600 = £153,400/month — below the original planned growth trajectory of £252,000 (40% above the starting £180,000).

Key Concepts Tested

IBM Cloud Billing API for resource-level cost attribution

Abandoned resource identification and automated cleanup with tagging policies

VPA recommendations for Kubernetes workload right-sizing

IBM Cloud Savings Plans and reserved instances for stable workload cost reduction

IBM Cloud Object Storage storage class lifecycle policies

IBM Cloud enterprise billing account structure for per-team cost visibility

FinOps governance: tagging policies, budget alerts, monthly reviews, provisioning approval gates

Follow-Up Questions

"You present the cost optimisation findings to the client's CTO. She agrees with the analysis but raises a concern: 'If we right-size our production clusters down to 40% headroom, we have no capacity buffer for unexpected traffic spikes. Last Black Friday, we had a 4× traffic spike that would have taken down a right-sized cluster.' How do you address this tension between cost optimisation and operational headroom, and what is your recommendation for the appropriate headroom level for production workloads?"

"Three months after the optimisation engagement, IBM Cloud Billing shows that spend has crept back up to £310,000/month — a £156,600 increase from the post-optimisation baseline of £153,400. The tagging policy is in place and the FinOps review process is running. What are the most likely causes of this cost rebound, and what additional automation would you implement to detect and prevent it before it accumulates to this level again?"

Preparation Tip: Across all ten questions in this complete guide, the answers that resonate most strongly with IBM interviewers share a consistent analytical structure: constraint → mechanism → trade-off. Every infrastructure decision exists in a context defined by constraints — regulatory requirements, latency SLAs, cost budgets, compliance frameworks, operational team capacity. Strong candidates name the constraint explicitly, explain the technical mechanism that satisfies it, and then articulate what is given up in making that choice. IBM's enterprise clients are sophisticated enough to understand that no architecture is free — they want engineers who can navigate trade-offs intelligently, not engineers who promise that everything is achievable without compromise. In your preparation, train yourself to end every technical answer with a deliberate trade-off statement: what does this architecture optimise for, and what does it accept as a cost?