CRED — Machine Learning Engineer Interview Questions

CRED's ML Engineer role (under Prefr) is about building the infrastructure that loan decisions run on — not research notebooks, but production-grade systems that process real credit applications at scale, integrate with multiple lender partners, and need to be reliable at 50,000 decisions per day. The interview rewards candidates who think in systems: latency budgets, failure modes, rollback strategies, and monitoring, not just model accuracy. Three or more years of production ML system experience is required.

CRED's Interview Process for ML Engineer

Technical coding screen in Python, Java, or Scala, a system design round focused on ML infrastructure in a fintech context, and a domain round on production ML practices and agentic system design. Expect detailed follow-up on any design decision you propose.

Question 1: Loan Decisioning Pipeline Design

Design a real-time loan eligibility scoring system for Prefr that processes 50,000 applications per day from multiple partner channels. The system must return a credit decision in under 3 seconds and support multiple underwriting models running in parallel for different lender partners.

Why interviewers ask this

Tests system design thinking in a fintech context where latency, reliability, multi-tenancy, and model versioning are simultaneously in play. Weak candidates describe a linear model-serving architecture with no consideration for failure modes, feature freshness, or the multi-lender complexity. Strong candidates propose a layered architecture with explicit latency budgets, fallback logic, and a clear separation between the feature store, inference layer, and policy engine.

Example strong answer

I would design this as three decoupled layers: ingestion and validation, feature assembly, and decision execution. Each layer has its own latency budget and failure mode.

The ingestion layer is an API gateway — Kong or AWS API Gateway — that receives the application payload from partner channels, validates the schema against a contract (rejecting malformed requests fast), stamps a request ID for tracing, and routes to the decision engine. This layer should add no more than 50 milliseconds. Authentication and rate limiting per partner also happen here.

The feature assembly layer is where most of the latency budget goes. Bureau features — CIBIL score, DPD history, credit utilisation, enquiry velocity — are expensive to compute in real time because they require a bureau API call that can take 400 to 800 milliseconds. To stay under the 3-second total budget, I would pre-fetch and cache bureau data for users who have initiated any interaction with the Prefr app in the last 24 hours, storing in Redis with a 24-hour TTL. For cold users with no prior interaction, the bureau call happens inline and the 800ms is absorbed into the budget. Real-time features — requested loan amount, employment type, channel source — are computed on the fly from the application payload in under 50 milliseconds. Total feature assembly target: 600 milliseconds in the warm path, 1,200 milliseconds in the cold path.

The decision execution layer receives the assembled feature vector and routes it to the appropriate underwriting model based on the lender partner ID in the request. Each lender can have a distinct model — different risk thresholds, different approved segments — running as independent FastAPI microservices in Docker containers behind a load balancer. Model versions are independently deployable with no shared state. Inference per model: under 200 milliseconds. A post-inference policy engine applies the lender's approve/decline/pricing rules to the model output score, adding another 100 milliseconds. Total at the cold path: around 1,600 milliseconds, comfortably under 3 seconds.

For reliability: if the inference service times out after 500 milliseconds, a rule-based fallback applies a conservative decision using only the bureau score and loan-to-income ratio — no ML, but a real answer that does not leave the partner's UI hanging. For monitoring: every decision is logged with the full input feature vector, model version, score, and final decision. A daily drift detection job computes KL divergence on the score distribution for each partner channel versus a 30-day baseline. If divergence spikes — meaning the model is suddenly seeing a different population than it was calibrated on — the on-call engineer gets an alert before disbursements are affected.

Follow-up questions

One lender's model starts producing anomalous approval rates after a bureau data schema change upstream. How does your monitoring detect this and what is your response playbook?
How do you A/B test a new underwriting model in this architecture without exposing the lender partner to uncontrolled risk?

Question 2: Feature Engineering for Credit

You're building a feature pipeline for a personal loan default prediction model at Prefr. What features would you engineer from raw bureau data, application data, and internal repayment history? Which features do you expect to be most predictive and why?

Why interviewers ask this

Tests domain understanding of credit risk combined with ML feature thinking. Generic answers — "I'd use all available features and let the model select" — reveal candidates who haven't worked in fintech. Strong candidates demonstrate specific knowledge of what bureau data contains, why certain signals are predictive in the Indian personal lending context, and how to handle the thin-file problem.

Example strong answer

I would engineer features from three distinct sources, prioritising by predictive signal strength based on what the Indian personal lending literature and production models consistently show.

From bureau data, the highest-signal features are: DPD30+ and DPD90+ counts in the last 12 months — these are the strongest predictors of future default because past repayment behaviour in credit products is the best available signal of intent and capacity. Enquiry velocity in the last 90 days, meaning the count of hard enquiries from other lenders, is the second most important signal: a user submitting four or five loan applications in one quarter is typically in credit stress or desperation, which is a strong default predictor regardless of their CIBIL score. Credit utilisation ratio across revolving accounts, age of oldest credit account as a proxy for credit maturity, and the mix of secured versus unsecured credit lines round out the bureau feature set.

From application data: the loan-to-income ratio is the strongest structural predictor — it directly captures whether the user can service this loan given their declared income. Employment type matters because self-employed income is more volatile than salaried income; this should interact with the loan-to-income ratio rather than being used as a standalone feature. Employment tenure at the current employer is a useful stability signal, with sub-6-month tenure being a meaningful risk flag.

From Prefr's internal repayment history on previous loans: repayment consistency score — the ratio of on-time payments to total due payments — is highly predictive for repeat borrowers. Average days early or late on prior EMIs gives a behavioural signature that generalises across loan amounts. Whether the user has ever initiated a restructuring conversation, even if it was not approved, is a flag worth including.

For validation, I would use SHAP values post-training to confirm which features are actually driving predictions versus which are collinear noise, and I would specifically check that enquiry velocity and DPD history are in the top five by absolute mean SHAP value — if they are not, it usually signals a data pipeline issue rather than a model insight. For deployment, I would set up distribution monitoring specifically on bureau features, because bureau data quality and coverage varies by lender partner and can shift silently when a partner changes their bureau integration.

Follow-up questions

Bureau data is unavailable for 15% of applicants because they are thin-file or new-to-credit users. How do you handle this segment — do you exclude them, impute, or build a separate model?

Question 3: Model Monitoring in Production

Your loan approval model has been in production for 4 months. Approval rates have dropped from 34% to 27% over the last 3 weeks without any model change or explicit policy change. What do you investigate?

Why interviewers ask this

Tests the ability to diagnose production ML issues systematically, distinguishing between data drift, upstream data changes, and configuration errors. Weak candidates suggest retraining immediately. Strong candidates propose a structured diagnostic that rules out the more common and fixable causes before touching the model.

Example strong answer

A 7 percentage point drop in approval rate over 3 weeks without a deliberate model or policy change has four plausible root causes, and I would investigate them in order of likelihood and ease of diagnosis.

The first and most likely cause is input distribution shift — the characteristics of the applications arriving at the model have changed, not the model itself. I would pull the feature distributions for the most recent 3-week window and compare them to the prior 3 months using KL divergence. Specifically, I would look at the bureau score distribution, the loan-to-income ratio distribution, and the enquiry velocity distribution. If any of these have shifted — for example, if average bureau scores in the incoming population have dropped by 15 points — that would explain a lower approval rate entirely through changed inputs, with no model issue at all. The most common cause of this kind of shift is a new partner channel that started sending lower credit-quality applicants, or the end of a campaign that was attracting premium users.

The second cause to check is an upstream data change from the bureau provider. Bureau vendors occasionally update their scoring methodology, change field names in their API response, or alter how they report DPD history, without adequate notice to downstream users. If a key feature is now being populated differently — for instance, if DPD30+ is now being reported more conservatively and is coming back as 1 for users who previously returned 0 — the model would score those users lower without any change on our side. I would compare the bureau API response format and field distributions against a snapshot from 4 weeks ago to check for this.

The third cause is a configuration error — specifically, a threshold change that was applied to production inadvertently during a deployment, even if the model weights themselves were not changed. I would pull the deployment history for the last month and verify that the approve/decline threshold in the policy engine matches what was documented in the model governance record.

The fourth cause, only if the first three are ruled out, is model drift: the model was trained on data from 6 or more months ago, and the population it was calibrated on has genuinely shifted. In this case, retraining on the most recent 3 months of data with a rolling window would be the fix. But I would not jump to retraining until the other three causes are eliminated, because retraining on bad data or a misconfigured pipeline makes the underlying problem worse, not better.

Follow-up questions

You discover that the bureau provider changed their DPD reporting methodology 4 weeks ago. What is your immediate action versus your longer-term fix?

Question 4: Agentic System Design

CRED wants to build an internal "AI for BI" copilot — analysts type plain-English questions and get a generated SQL query plus a chart. Design the end-to-end system including guardrails and failure modes.

Why interviewers ask this

Directly from the job description — "agentic systems, auto-insight generation, AI for BI" — and tests whether candidates can design LLM-integrated systems with real production constraints, not just describe a chatbot that calls an LLM and hopes for the best.

Example strong answer

The system has five components: context retrieval, prompt construction, LLM inference, query validation and execution, and output rendering. I would design each with a specific failure mode in mind.

Context retrieval: before sending anything to the LLM, I need to give it accurate schema context so it can generate valid SQL. I would maintain a vector store containing table names, column names, column descriptions written in plain English, and sample values for key categorical columns. When a query comes in, I do a semantic search over this store to retrieve the 5 to 10 most relevant tables and columns for the question. This context is injected into the prompt. Without this step, the LLM hallucinates table and column names constantly, which is unusable in production.

Prompt construction: the prompt instructs the LLM to generate a single SQL SELECT statement using only the tables and columns provided in context, to add a comment explaining what the query does, and to flag if it cannot construct a valid query given the available context. I use a structured output format — JSON with a "sql" field and an "explanation" field — so the response is parseable rather than free text.

Query validation: before execution, a validation layer checks that every table and column referenced in the generated SQL actually exists in the production schema, that the query contains no data modification keywords (INSERT, UPDATE, DELETE, DROP), and that the estimated cost of the query — derived from EXPLAIN — is below a configurable threshold. If any check fails, the query is rejected and the LLM is asked to regenerate with an error message appended to the context. This prevents both schema hallucination and accidental destructive queries.

Execution happens against a read-only replica with a 30-second timeout. Results are paginated at 10,000 rows to prevent memory issues. A PII masking layer strips or hashes any column that is flagged in the schema metadata as containing user identifiers before results are returned to the analyst.

Output rendering: results are passed to a charting library — a thin Plotly wrapper — which selects the appropriate chart type based on the data shape (time series gets a line chart, categorical breakdown gets a bar chart). The LLM generates a one-sentence summary of what the chart shows, which appears as a caption. The analyst sees: the question, the SQL, the chart, and the caption. They can edit the SQL directly and re-run.

For the multi-tenant requirement, each team's schema context is stored in a separate namespace in the vector store, and queries run against team-scoped views in the database so analysts can only access data relevant to their function.

Follow-up questions

The LLM generates a syntactically valid query that passes your validation but returns results that are semantically wrong — for instance, summing revenue at the wrong grain. How do you detect and handle this?

Question 5: Production Incident Response

Your team ships a new underwriting model to production on a Friday afternoon. By Monday morning, the operations team reports that 3x more applications than usual are being flagged for manual review. Walk through how you diagnose and resolve this without causing further disruption.

Why interviewers ask this

Tests incident response thinking, ownership mindset, and the ability to triage a production ML issue under pressure without making it worse. From the JD: "continuously monitor production performance... proactively propose improvements." Weak candidates suggest an immediate rollback. Strong candidates gather data first, identify the most likely cause from first principles, fix the immediate operational impact, and design a post-mortem to prevent recurrence.

Example strong answer

The first thing I would do on Monday morning is resist the instinct to roll back immediately. A rollback is the right call only if we know what broke — an uninformed rollback could mask the real problem and make it harder to diagnose, and it also disrupts any applications that are in-flight under the new model. So I would start with data gathering in the first 30 minutes.

I would pull the score distribution from the new model for Friday through Monday and compare it to the distribution from the prior model over the same traffic volume. If the new model is producing systematically lower scores — meaning more applications are falling into the "refer to manual review" band rather than the "approve" or "decline" bands — that tells me the model is more uncertain, not necessarily more accurate. The second thing I would check immediately is whether the decision thresholds were correctly migrated when the new model was deployed. This is the most common cause of a manual review spike: the new model was calibrated with different score ranges than the old one, but the approve/decline/refer thresholds were not updated to reflect the new score distribution. A model that previously produced scores between 0.4 and 0.9 might now produce scores between 0.2 and 0.7, and if the "refer" threshold is still set at 0.5, a large chunk of the population that would have been auto-approved is now hitting the refer bucket.

If the threshold configuration is correct and the score distributions look reasonable, the next check is the feature pipeline. If any feature is arriving as NULL or defaulting to a zero value incorrectly — for instance, if the bureau API was returning empty responses over the weekend and the fallback logic is substituting zero for DPD30+ — the model would receive degraded feature vectors and produce uncertain outputs. I would spot-check 20 to 30 flagged applications by pulling their feature vectors from the decision log and verifying the values look plausible.

Once I identify the root cause — likely threshold misconfiguration in 70% of cases — the immediate action is targeted: fix the threshold configuration in the policy engine to match the new model's score distribution. This can usually be done without redeploying the model itself, which means a 15-minute fix rather than a full rollback. I would communicate the estimated fix time to the operations team every 30 minutes until resolved.

Post-incident: I would add two checks to the deployment pipeline. First, an automated gate that compares the new model's score distribution on the last 1,000 applications against the prior model's distribution on the same traffic and flags if the distributions differ by more than 15% at any decile. Second, a threshold validation step that recomputes the approve/refer/decline thresholds based on the new model's score range before any deployment proceeds to production.

Follow-up questions

After fixing the threshold issue, the operations team asks for a guarantee this won't happen again. What specific process changes do you propose and what can you realistically guarantee?

Preparation tip

CRED's ML interviews consistently distinguish between candidates who have built things in notebooks and candidates who have owned things in production. For every design you propose, be ready to answer: how do you know when it breaks, and what is your first action when it does? End-to-end ownership is the differentiator.

Back