Consumer Lending Risk – End-to-End Pipeline

Production-style credit default prediction API demonstrating robust ML engineering patterns for credit risk modelling. Built with LightGBM/XGBoost, FastAPI, and comprehensive automated testing.

Component	Detail
Backends	LightGBM / XGBoost (configurable)
Calibration	`CalibratedClassifierCV` (Platt scaling)
Imbalance	`class_weight="balanced"` / `scale_pos_weight`
Serving	FastAPI with Pydantic feature-contract enforcement
Testing	10 pytest tests (unit, integration, API)
Deployment	Docker, GitHub Actions CI

> Dataset: Kaggle Loan Default Dataset — place Loan_Default.csv in data/raw/.

Dataset context: The Loan Default dataset contains 148,670 mortgage applications with 34 features covering borrower demographics (age, gender, income), loan characteristics (amount, term, type), credit profile (credit score, credit type), and property details. The prediction target is binary loan default. With a ~25% default rate, this is a moderately imbalanced classification problem typical of real-world consumer lending portfolios.

📋 Back to README

1. Setup & Configuration

In [1]:

# file: pipeline/setup.py (or first notebook cell)

from __future__ import annotations

import warnings
warnings.filterwarnings(
    "ignore",
    message=".*str.*dtypes are inc.*",
    category=UserWarning,
)
warnings.filterwarnings(
    "ignore",
    message=".*X does not have valid feature names.*",
    category=UserWarning,
    module="sklearn",
)

import json
import importlib.util
import logging
import subprocess
import sys
import time
from pathlib import Path

AUTO_INSTALL = True  # set False if you never want runtime installs

_REQUIRED = {
    "numpy": "numpy",
    "pandas": "pandas",
    "matplotlib": "matplotlib",
    "sklearn": "scikit-learn",
    "lightgbm": "lightgbm",
    "xgboost": "xgboost",
    "jinja2": "jinja2", 
    "joblib": "joblib",
    "isort": "isort",
}


def _missing_packages(required: dict[str, str]) -> list[str]:
    missing: list[str] = []
    for import_name, pip_name in required.items():
        if importlib.util.find_spec(import_name) is None:
            missing.append(pip_name)
    return missing


def ensure_dependencies(auto_install: bool = True) -> None:
    """
    Ensures this kernel/interpreter has required packages.
    Installs missing deps using the same interpreter as the current kernel.
    """
    missing = _missing_packages(_REQUIRED)
    if not missing:
        return

    msg = (
        "Missing packages for this kernel/interpreter:\n"
        f"  - {', '.join(missing)}\n"
        f"Kernel Python: {sys.executable}\n"
    )
    if not auto_install:
        raise ModuleNotFoundError(
            msg
            + "Install them into this interpreter, then restart the kernel."
        )

    result = subprocess.run(
        [sys.executable, "-m", "pip", "install", "-U", *missing],
        stdout=subprocess.DEVNULL,
        stderr=subprocess.DEVNULL,
    )


ensure_dependencies(AUTO_INSTALL)

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.metrics import (
    roc_auc_score,
    average_precision_score,
    brier_score_loss,
    f1_score,
    precision_score,
    recall_score,
    roc_curve,
    precision_recall_curve,
    confusion_matrix,
)
from sklearn.calibration import calibration_curve, CalibratedClassifierCV
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer

from lightgbm import LGBMClassifier
from xgboost import XGBClassifier
import joblib

logging.basicConfig(level=logging.INFO, format="%(levelname)s | %(message)s")
logger = logging.getLogger(__name__)

plt.rcParams.update(
    {
        "figure.facecolor": "white",
        "axes.facecolor": "white",
        "axes.grid": True,
        "grid.alpha": 0.3,
        "axes.spines.top": False,
        "axes.spines.right": False,
        "font.size": 11,
    }
)

PALETTE = {
    "lgb": "#2196F3",
    "xgb": "#FF9800",
    "cal": "#4CAF50",
    "ref": "#9E9E9E",
    
    "red": "#E53935",
}

print(
    "Setup complete ✓",
    f"(python={sys.version.split()[0]}, exe={Path(sys.executable).name}, numpy={np.__version__}, pandas={pd.__version__})",
)

Setup complete ✓ (python=3.11.13, exe=python3.11, numpy=2.4.3, pandas=3.0.1)

In [2]:

# ── Paths ──
BASE_DIR = Path(".")
DATA_DIR = BASE_DIR / "data"
RAW_DATA = DATA_DIR / "raw" / "Loan_Default.csv"
PROCESSED_DIR = DATA_DIR / "processed"
PROCESSED_TRAIN = PROCESSED_DIR / "train.csv"
PROCESSED_TEST = PROCESSED_DIR / "test.csv"
ARTIFACTS_DIR = BASE_DIR / "artifacts"
PIPELINE_PATH = ARTIFACTS_DIR / "model.joblib"
METADATA_PATH = ARTIFACTS_DIR / "metadata.json"

# ── Modelling ──
TARGET_COLUMN = "Status"
TEST_SIZE = 0.2
RANDOM_STATE = 42
DEFAULT_THRESHOLD = 0.5

# ── Cleaning ──
DROP_COLS = ["ID"]
RENAME_MAP = {"co-applicant_credit_type": "co_applicant_credit_type"}

# Post-origination columns whose missingness is a near-perfect target proxy
LEAKY_COLS = [
    "rate_of_interest",
    "Interest_rate_spread",
    "Upfront_charges",
    "property_value",
    "LTV",
]

ARTIFACTS_DIR.mkdir(parents=True, exist_ok=True)
PROCESSED_DIR.mkdir(parents=True, exist_ok=True)

print(f"Raw data path : {RAW_DATA}")
print(f"Artifacts path: {ARTIFACTS_DIR}")

Raw data path : data/raw/Loan_Default.csv
Artifacts path: artifacts

3. Data Loading & Initial Cleaning

Initial cleaning:

Rename co-applicant_credit_type → co_applicant_credit_type (Python-safe identifier)
Drop ID column (no predictive value)
Drop rows with missing target

Imputation is not done here — it is handled exclusively by the sklearn pipeline to prevent train → test leakage.

In [3]:

df_raw = pd.read_csv(RAW_DATA)
print(f"Raw shape: {df_raw.shape}")
df_raw.head()

Raw shape: (148670, 34)

	ID	year	loan_limit	Gender	approv_in_adv	loan_type	loan_purpose	Credit_Worthiness	open_credit	business_or_commercial	...	credit_type	Credit_Score	co-applicant_credit_type	age	submission_of_application	LTV	Region	Security_Type	Status	dtir1
0	24890	2019	cf	Sex Not Available	nopre	type1	p1	l1	nopc	nob/c	...	EXP	758	CIB	25-34	to_inst	98.728814	south	direct	1	45.0
1	24891	2019	cf	Male	nopre	type2	p1	l1	nopc	b/c	...	EQUI	552	EXP	55-64	to_inst	NaN	North	direct	1	NaN
2	24892	2019	cf	Male	pre	type1	p1	l1	nopc	nob/c	...	EXP	834	CIB	35-44	to_inst	80.019685	south	direct	0	46.0
3	24893	2019	cf	Male	nopre	type1	p4	l1	nopc	nob/c	...	EXP	587	CIB	45-54	not_inst	69.376900	North	direct	0	42.0
4	24894	2019	cf	Joint	pre	type1	p1	l1	nopc	nob/c	...	CRIF	602	EXP	25-34	not_inst	91.886544	North	direct	0	39.0

5 rows × 34 columns

In [4]:

def basic_clean(df, *, drop_leaky=False):
    """Clean without imputing. Optionally drop leaky columns."""
    df = df.copy()
    df = df.rename(columns=RENAME_MAP)
    for c in DROP_COLS:
        if c in df.columns:
            df = df.drop(columns=[c])
    if drop_leaky:
        leaky_present = [c for c in LEAKY_COLS if c in df.columns]
        df = df.drop(columns=leaky_present)
        logger.info("Dropped %d leaky columns: %s", len(leaky_present), leaky_present)
    df = df.dropna(subset=[TARGET_COLUMN])
    return df

# Clean without dropping leaky columns yet — we'll analyse them first
df_all = basic_clean(df_raw, drop_leaky=False)
print(f"Cleaned shape (all features): {df_all.shape}")
print(f"Target distribution:\n{df_all[TARGET_COLUMN].value_counts(normalize=True).round(4)}")

Cleaned shape (all features): (148670, 33)
Target distribution:
Status
0    0.7536
1    0.2464
Name: proportion, dtype: float64

4. Exploratory Data Analysis

In [5]:

fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# 4a – Target distribution
counts = df_all[TARGET_COLUMN].value_counts()
bars = axes[0].bar(
    ["Non-default (0)", "Default (1)"],
    counts.values,
    color=[PALETTE["lgb"], PALETTE["xgb"]],
    edgecolor="white", linewidth=1.2,
)
axes[0].set_title("Target Distribution")
axes[0].set_ylabel("Count")
for bar, val in zip(bars, counts.values):
    axes[0].text(bar.get_x() + bar.get_width()/2, bar.get_height(),
                 f"{val:,}", ha="center", va="bottom", fontsize=10)

# 4b – Missing values (top 15)
miss = df_all.isna().sum().sort_values(ascending=False).head(15)
miss = miss[miss > 0]
if len(miss) > 0:
    colours = [PALETTE["red"] if idx in LEAKY_COLS else PALETTE["xgb"] for idx in miss.index]
    axes[1].barh(miss.index, miss.values, color=colours, edgecolor="white")
    axes[1].set_title("Missing Values (red = leaky)")
    axes[1].set_xlabel("Count")
    axes[1].invert_yaxis()
else:
    axes[1].text(0.5, 0.5, "No missing values", ha="center", va="center",
                 transform=axes[1].transAxes, fontsize=13)
    axes[1].set_title("Missing Values")

# 4c – Correlations of numeric features with target
numeric_cols = df_all.select_dtypes(include=["number"]).columns.drop(TARGET_COLUMN, errors="ignore")
sample_feats = [f for f in numeric_cols if f not in LEAKY_COLS][:8]
if sample_feats:
    corr = df_all[sample_feats + [TARGET_COLUMN]].corr()[TARGET_COLUMN].drop(TARGET_COLUMN).sort_values()
    axes[2].barh(corr.index, corr.values,
                 color=[PALETTE["xgb"] if v > 0 else PALETTE["lgb"] for v in corr.values],
                 edgecolor="white")
    axes[2].set_title("Correlation with Target (non-leaky)")
    axes[2].axvline(0, color="grey", linewidth=0.8)

plt.tight_layout()
plt.show()

In [6]:

num_summary = df_all.describe().T
num_summary["missing"] = df_all.isna().sum()
num_summary["missing_%"] = (df_all.isna().sum() / len(df_all) * 100).round(2)
num_summary[["count", "mean", "std", "min", "max", "missing", "missing_%"]].head(15)

	count	mean	std	min	max	missing	missing_%
year	148670.0	2019.000000	0.000000	2019.000000	2.019000e+03	0	0.00
loan_amount	148670.0	331117.743997	183909.310127	16500.000000	3.576500e+06	0	0.00
rate_of_interest	112231.0	4.045476	0.561391	0.000000	8.000000e+00	36439	24.51
Interest_rate_spread	112031.0	0.441656	0.513043	-3.638000	3.357000e+00	36639	24.64
Upfront_charges	109028.0	3224.996127	3251.121510	0.000000	6.000000e+04	39642	26.66
term	148629.0	335.136582	58.409084	96.000000	3.600000e+02	41	0.03
property_value	133572.0	497893.465696	359935.315562	8000.000000	1.650800e+07	15098	10.16
income	139520.0	6957.338876	6496.586382	0.000000	5.785800e+05	9150	6.15
Credit_Score	148670.0	699.789103	115.875857	500.000000	9.000000e+02	0	0.00
LTV	133572.0	72.746457	39.967603	0.967478	7.831250e+03	15098	10.16
Status	148670.0	0.246445	0.430942	0.000000	1.000000e+00	0	0.00
dtir1	124549.0	37.732932	10.545435	5.000000	6.100000e+01	24121	16.22

5. Leakage Analysis

A critical step for any lending dataset: checking whether feature missingness patterns leak

the target. Post-origination fields (e.g. interest rate, upfront charges) are often populated

only for funded, performing loans — meaning their absence is itself a near-perfect predictor of default.

In [7]:

# Compute missingness rate by target class
miss_by_class = []

for c in df_all.columns:
    if c == TARGET_COLUMN:
        continue
    miss_rate = df_all[c].isna().mean()
    if miss_rate < 0.01:
        continue
    m0 = df_all.loc[df_all[TARGET_COLUMN] == 0, c].isna().mean()
    m1 = df_all.loc[df_all[TARGET_COLUMN] == 1, c].isna().mean()
    miss_by_class.append({"column": c, "miss_non_default": m0, "miss_default": m1})

leak_df = pd.DataFrame(miss_by_class).sort_values("miss_default", ascending=False)
leak_df["leaky"] = leak_df["column"].isin(LEAKY_COLS)

# Visualise
fig, ax = plt.subplots(figsize=(10, 5))
x = np.arange(len(leak_df))
w = 0.35

bars0 = ax.bar(x - w/2, leak_df["miss_non_default"], w, label="Non-default (0)",
               color=PALETTE["lgb"], edgecolor="white")
bars1 = ax.bar(x + w/2, leak_df["miss_default"], w, label="Default (1)",
               color=PALETTE["xgb"], edgecolor="white")

# Highlight leaky columns
for i, row in enumerate(leak_df.itertuples()):
    if row.leaky:
        ax.get_children()[i].set_edgecolor(PALETTE["red"])
        ax.get_children()[i].set_linewidth(2)

ax.set_xticks(x)
ax.set_xticklabels(leak_df["column"], rotation=45, ha="right")
ax.set_ylabel("Missingness Rate")
ax.set_title("Missingness by Target Class (leaky columns highlighted)")
ax.legend()
ax.set_ylim(0, 1.1)

for i, row in enumerate(leak_df.itertuples()):
    if row.miss_default > 0.5:
        ax.annotate("⚠️", (i + w/2, row.miss_default + 0.02), ha="center", fontsize=12)

plt.tight_layout()
plt.show()

print("\nMissingness rates by class:")
print(leak_df.to_string(index=False, float_format="{:.1%}".format))

Missingness rates by class:
              column  miss_non_default  miss_default  leaky
Interest_rate_spread              0.0%        100.0%   True
     Upfront_charges              2.8%         99.6%   True
    rate_of_interest              0.0%         99.5%   True
               dtir1              7.0%         44.5%  False
      property_value              0.0%         41.2%   True
                 LTV              0.0%         41.2%   True
              income              7.1%          3.4%  False
          loan_limit              2.2%          2.4%  False

In [8]:

# Demonstrate the impact: train WITH leaky columns
X_demo = df_all.drop(columns=[TARGET_COLUMN])
y_demo = df_all[TARGET_COLUMN].astype(int)
X_tr_d, X_te_d, y_tr_d, y_te_d = train_test_split(
    X_demo, y_demo, test_size=0.2, random_state=RANDOM_STATE, stratify=y_demo
)

# LightGBM handles NaN natively via categorical encoding
for c in X_tr_d.select_dtypes(include=["object", "str"]).columns:
    X_tr_d[c] = X_tr_d[c].astype("category")
    X_te_d[c] = X_te_d[c].astype("category")

clf_leak = LGBMClassifier(n_estimators=100, random_state=RANDOM_STATE, verbose=-1)
clf_leak.fit(X_tr_d, y_tr_d)
auc_leak = roc_auc_score(y_te_d, clf_leak.predict_proba(X_te_d)[:, 1])

# Train WITHOUT leaky columns
X_clean = df_all.drop(columns=[TARGET_COLUMN] + LEAKY_COLS)
X_tr_c, X_te_c, y_tr_c, y_te_c = train_test_split(
    X_clean, y_demo, test_size=0.2, random_state=RANDOM_STATE, stratify=y_demo
)

for c in X_tr_c.select_dtypes(include=["object", "str"]).columns:
    X_tr_c[c] = X_tr_c[c].astype("category")
    X_te_c[c] = X_te_c[c].astype("category")

clf_clean = LGBMClassifier(n_estimators=100, random_state=RANDOM_STATE, verbose=-1)
clf_clean.fit(X_tr_c, y_tr_c)
auc_clean = roc_auc_score(y_te_c, clf_clean.predict_proba(X_te_c)[:, 1])

print(f"ROC-AUC WITH leaky columns:    {auc_leak:.4f}  ← missingness = free answer")
print(f"ROC-AUC WITHOUT leaky columns: {auc_clean:.4f}  ← genuine signal only")
print(f"\nDropping {len(LEAKY_COLS)} leaky columns for all subsequent analysis.")

ROC-AUC WITH leaky columns:    1.0000  ← missingness = free answer
ROC-AUC WITHOUT leaky columns: 0.8834  ← genuine signal only

Dropping 5 leaky columns for all subsequent analysis.

6. Train / Test Split (Clean Data)

With the leaky columns removed, we create a stratified 80/20 split.

In [9]:

df = basic_clean(df_raw, drop_leaky=True)
print(f"Clean shape: {df.shape}")

X = df.drop(columns=[TARGET_COLUMN])
y = df[TARGET_COLUMN].astype(int)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=TEST_SIZE, random_state=RANDOM_STATE, stratify=y,
)

# Save to disk for API reproducibility
train_out = X_train.copy(); train_out[TARGET_COLUMN] = y_train
test_out = X_test.copy();   test_out[TARGET_COLUMN] = y_test
train_out.to_csv(PROCESSED_TRAIN, index=False)
test_out.to_csv(PROCESSED_TEST, index=False)

print(f"Train: {X_train.shape[0]:,} rows | Test: {X_test.shape[0]:,} rows")
print(f"Train default rate: {y_train.mean():.4f}")
print(f"Test  default rate: {y_test.mean():.4f}")

INFO | Dropped 5 leaky columns: ['rate_of_interest', 'Interest_rate_spread', 'Upfront_charges', 'property_value', 'LTV']

Clean shape: (148670, 28)
Train: 118,936 rows | Test: 29,734 rows
Train default rate: 0.2464
Test  default rate: 0.2465

7. Pipeline Construction

All preprocessing lives inside .fit() / .transform(), ensuring imputation

statistics are learned only from training data.

```

ColumnTransformer

├── num → SimpleImputer(median)

└── cat → SimpleImputer(mode) → OneHotEncoder

↓

CalibratedClassifierCV(LGBMClassifier / XGBClassifier)

```

Hyperparameter selection: Best params were identified via Optuna TPE sweep (50 trials, 5-fold stratified CV, optimising PR-AUC); full results in artifacts/best_params.json.

In [10]:

from src.model import build_pipeline, split_feature_types

numeric, categorical = split_feature_types(X_train)
print(f"Numeric features:     {len(numeric)}")
print(f"Categorical features: {len(categorical)}")

Numeric features:     6
Categorical features: 21

8. Training (Both Backends)

In [11]:

results = {}

for backend in ["lightgbm", "xgboost"]:
    logger.info("Training %s...", backend)
    t0 = time.time()

    pipe = build_pipeline(numeric, categorical, backend=backend, y=y_train, calibration_cv=5)
    pipe.fit(X_train, y_train)

    elapsed = time.time() - t0
    proba = pipe.predict_proba(X_test)[:, 1]
    pred = (proba >= DEFAULT_THRESHOLD).astype(int)

    metrics = {
        "roc_auc": roc_auc_score(y_test, proba),
        "pr_auc": average_precision_score(y_test, proba),
        "brier": brier_score_loss(y_test, proba),
        "f1": f1_score(y_test, pred),
        "precision": precision_score(y_test, pred, zero_division=0),
        "recall": recall_score(y_test, pred, zero_division=0),
        "train_time_s": round(elapsed, 1),
    }

    results[backend] = {"pipeline": pipe, "proba": proba, "pred": pred, "metrics": metrics}
    logger.info("%s done in %.1fs  |  ROC-AUC=%.4f  PR-AUC=%.4f  Brier=%.4f",
                backend, elapsed, metrics["roc_auc"], metrics["pr_auc"], metrics["brier"])

print()

INFO | Training lightgbm...
INFO | lightgbm done in 15.6s  |  ROC-AUC=0.8827  PR-AUC=0.8229  Brier=0.0913
INFO | Training xgboost...
INFO | xgboost done in 24.8s  |  ROC-AUC=0.8821  PR-AUC=0.8229  Brier=0.0913

In [12]:

metrics_df = pd.DataFrame({k: v["metrics"] for k, v in results.items()}).T
metrics_df.index.name = "backend"
metrics_df.style.format("{:.4f}").background_gradient(cmap="Blues", axis=0)

	roc_auc	pr_auc	brier	f1	precision	recall	train_time_s
backend
lightgbm	0.8827	0.8229	0.0913	0.7311	0.8607	0.6355	15.6000
xgboost	0.8821	0.8229	0.0913	0.7308	0.8593	0.6358	24.8000

9a. ROC & Precision–Recall Curves

In [13]:

fig, axes = plt.subplots(1, 2, figsize=(14, 5.5))
label_map = {"lightgbm": "LightGBM", "xgboost": "XGBoost"}
color_map = {"lightgbm": PALETTE["lgb"], "xgboost": PALETTE["xgb"]}

for backend, res in results.items():
    fpr, tpr, _ = roc_curve(y_test, res["proba"])
    auc = res["metrics"]["roc_auc"]
    axes[0].plot(fpr, tpr, label=f'{label_map[backend]}  AUC={auc:.4f}',
                 color=color_map[backend], linewidth=2)
axes[0].plot([0, 1], [0, 1], "--", color=PALETTE["ref"], linewidth=1)
axes[0].set(xlabel="False Positive Rate", ylabel="True Positive Rate", title="ROC Curve")
axes[0].legend(loc="lower right")

for backend, res in results.items():
    prec, rec, _ = precision_recall_curve(y_test, res["proba"])
    ap = res["metrics"]["pr_auc"]
    axes[1].plot(rec, prec, label=f'{label_map[backend]}  AP={ap:.4f}',
                 color=color_map[backend], linewidth=2)
baseline = y_test.mean()
axes[1].axhline(baseline, linestyle="--", color=PALETTE["ref"], linewidth=1, label=f"Baseline={baseline:.3f}")
axes[1].set(xlabel="Recall", ylabel="Precision", title="Precision–Recall Curve")
axes[1].legend(loc="upper right")

plt.tight_layout()
plt.show()

9b. Calibration Curves

In [14]:

fig, axes = plt.subplots(1, 2, figsize=(14, 5.5))

for i, (backend, res) in enumerate(results.items()):
    fraction_pos, mean_pred = calibration_curve(y_test, res["proba"], n_bins=10, strategy="uniform")
    axes[i].plot(mean_pred, fraction_pos, "o-", color=color_map[backend], linewidth=2,
                 markersize=6, label=label_map[backend])
    axes[i].plot([0, 1], [0, 1], "--", color=PALETTE["ref"], linewidth=1, label="Perfectly calibrated")
    axes[i].set(xlabel="Mean predicted probability", ylabel="Observed frequency",
                title=f"Calibration – {label_map[backend]}  (Brier={res['metrics']['brier']:.4f})")
    axes[i].legend(loc="upper left")

plt.tight_layout()
plt.show()

9c. Confusion Matrices

In [15]:

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

for i, (backend, res) in enumerate(results.items()):
    cm = confusion_matrix(y_test, res["pred"])
    im = axes[i].imshow(cm, cmap="Blues", aspect="equal")

    for row in range(2):
        for col in range(2):
            val = cm[row, col]
            colour = "white" if val > cm.max() / 2 else "black"
            axes[i].text(col, row, f"{val:,}", ha="center", va="center",
                         fontsize=14, fontweight="bold", color=colour)

    axes[i].set(xticks=[0, 1], yticks=[0, 1],
                xticklabels=["Non-default", "Default"],
                yticklabels=["Non-default", "Default"],
                xlabel="Predicted", ylabel="Actual",
                title=f"Confusion Matrix – {label_map[backend]}")

plt.tight_layout()
plt.show()

9d. Probability Distribution by Class

In [16]:

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

for i, (backend, res) in enumerate(results.items()):
    p0 = res["proba"][y_test == 0]
    p1 = res["proba"][y_test == 1]
    axes[i].hist(p0, bins=50, alpha=0.6, color=PALETTE["lgb"], label="Non-default", density=True, edgecolor="white")
    axes[i].hist(p1, bins=50, alpha=0.6, color=PALETTE["xgb"], label="Default", density=True, edgecolor="white")
    axes[i].axvline(DEFAULT_THRESHOLD, color="red", linestyle="--", linewidth=1.5, label=f"Threshold={DEFAULT_THRESHOLD}")
    axes[i].set(xlabel="Predicted P(default)", ylabel="Density",
                title=f"Score Distribution – {label_map[backend]}")
    axes[i].legend()

plt.tight_layout()
plt.show()

9e. Feature Importance (Top 20)

In [17]:

fig, axes = plt.subplots(1, 2, figsize=(14, 7))

for i, (backend, res) in enumerate(results.items()):
    pipe = res["pipeline"]
    pre = pipe.named_steps["pre"]
    clf = pipe.named_steps["clf"]

    feature_names = [n.split("__", 1)[-1] for n in pre.get_feature_names_out()]

    importances = np.mean([
        cc.estimator.feature_importances_
        for cc in clf.calibrated_classifiers_
    ], axis=0)

    imp_df = pd.Series(importances, index=feature_names).sort_values(ascending=True).tail(20)
    axes[i].barh(imp_df.index, imp_df.values, color=color_map[backend], edgecolor="white")
    axes[i].set_title(f"Feature Importance – {label_map[backend]}")
    axes[i].set_xlabel("Importance")

plt.tight_layout()
plt.show()

10. Threshold Analysis

In production lending, the decision threshold is a business lever — not just 0.5.

This section explores the F1 / precision / recall trade-off across thresholds.

In [18]:

fig, axes = plt.subplots(1, 2, figsize=(14, 5.5))

for i, (backend, res) in enumerate(results.items()):
    thresholds = np.linspace(0.01, 0.99, 200)
    f1s, precs, recs = [], [], []

    for t in thresholds:
        p = (res["proba"] >= t).astype(int)
        f1s.append(f1_score(y_test, p, zero_division=0))
        precs.append(precision_score(y_test, p, zero_division=0))
        recs.append(recall_score(y_test, p, zero_division=0))

    axes[i].plot(thresholds, f1s, label="F1", color=PALETTE["cal"], linewidth=2)
    axes[i].plot(thresholds, precs, label="Precision", color=PALETTE["lgb"], linewidth=1.5, linestyle="--")
    axes[i].plot(thresholds, recs, label="Recall", color=PALETTE["xgb"], linewidth=1.5, linestyle="--")

    best_t = thresholds[np.argmax(f1s)]
    axes[i].axvline(best_t, color="red", linestyle=":", linewidth=1, label=f"Best F1 @ {best_t:.2f}")
    axes[i].set(xlabel="Threshold", ylabel="Score", title=f"Threshold Sweep – {label_map[backend]}")
    axes[i].legend(loc="center left")

plt.tight_layout()
plt.show()

11. Save Model & Metadata

In [19]:

best_backend = max(results, key=lambda b: results[b]["metrics"]["pr_auc"])
best = results[best_backend]

joblib.dump(best["pipeline"], PIPELINE_PATH)

metadata = {
    "trained_at": time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime()),
    "model_type": f"Calibrated({best_backend})",
    "backend": best_backend,
    "features": X_train.columns.tolist(),
    "numeric": numeric,
    "categorical": categorical,
    "threshold": DEFAULT_THRESHOLD,
    **{k: round(v, 6) for k, v in best["metrics"].items()},
}

with open(METADATA_PATH, "w") as f:
    json.dump(metadata, f, indent=2)

print(f"Saved pipeline  → {PIPELINE_PATH}")
print(f"Saved metadata  → {METADATA_PATH}")
print(f"Best backend    → {best_backend}")
print(f"\nMetadata:\n{json.dumps(metadata, indent=2)}")

Saved pipeline  → artifacts/model.joblib
Saved metadata  → artifacts/metadata.json
Best backend    → xgboost

Metadata:
{
  "trained_at": "2026-03-24T12:38:53Z",
  "model_type": "Calibrated(xgboost)",
  "backend": "xgboost",
  "features": [
    "year",
    "loan_limit",
    "Gender",
    "approv_in_adv",
    "loan_type",
    "loan_purpose",
    "Credit_Worthiness",
    "open_credit",
    "business_or_commercial",
    "loan_amount",
    "term",
    "Neg_ammortization",
    "interest_only",
    "lump_sum_payment",
    "construction_type",
    "occupancy_type",
    "Secured_by",
    "total_units",
    "income",
    "credit_type",
    "Credit_Score",
    "co_applicant_credit_type",
    "age",
    "submission_of_application",
    "Region",
    "Security_Type",
    "dtir1"
  ],
  "numeric": [
    "year",
    "loan_amount",
    "term",
    "income",
    "Credit_Score",
    "dtir1"
  ],
  "categorical": [
    "loan_limit",
    "Gender",
    "approv_in_adv",
    "loan_type",
    "loan_purpose",
    "Credit_Worthiness",
    "open_credit",
    "business_or_commercial",
    "Neg_ammortization",
    "interest_only",
    "lump_sum_payment",
    "construction_type",
    "occupancy_type",
    "Secured_by",
    "total_units",
    "credit_type",
    "co_applicant_credit_type",
    "age",
    "submission_of_application",
    "Region",
    "Security_Type"
  ],
  "threshold": 0.5,
  "roc_auc": 0.882076,
  "pr_auc": 0.822938,
  "brier": 0.091336,
  "f1": 0.730824,
  "precision": 0.859277,
  "recall": 0.635781,
  "train_time_s": 24.8
}

12. API Smoke Test

Quick verification that the saved model loads and scores correctly

via the same code path the FastAPI app uses.

In [20]:

loaded_pipe = joblib.load(PIPELINE_PATH)
with open(METADATA_PATH) as f:
    loaded_meta = json.load(f)

sample = {
    "year": 2017, "loan_limit": "cf", "Gender": "Male", "approv_in_adv": "nopre",
    "loan_type": "type1", "loan_purpose": "p1", "Credit_Worthiness": "l1",
    "open_credit": "yes", "business_or_commercial": "no", "loan_amount": 150000.0,
    "term": 180.0, "Neg_ammortization": "no", "interest_only": "no",
    "lump_sum_payment": "no", "construction_type": "existing",
    "occupancy_type": "owner", "Secured_by": "home", "total_units": "1",
    "income": 6500.0, "credit_type": "CR1", "Credit_Score": 720.0,
    "co_applicant_credit_type": "Na", "age": "25-34",
    "submission_of_application": "to_inst", "Region": "south",
    "Security_Type": "type1", "dtir1": 28.0,
}

features = loaded_meta["features"]
X_single = pd.DataFrame([[sample[c] for c in features]], columns=features)
p1 = loaded_pipe.predict_proba(X_single)[0, 1]
threshold = loaded_meta["threshold"]

response = {
    "default_probability": round(float(p1), 6),
    "prediction": int(p1 >= threshold),
    "threshold": threshold,
    "model_info": {
        "model_type": loaded_meta["model_type"],
        "backend": loaded_meta["backend"],
        "trained_at": loaded_meta["trained_at"],
    },
}

print("POST /score response (simulated):")
print(json.dumps(response, indent=2))

POST /score response (simulated):
{
  "default_probability": 0.467872,
  "prediction": 0,
  "threshold": 0.5,
  "model_info": {
    "model_type": "Calibrated(xgboost)",
    "backend": "xgboost",
    "trained_at": "2026-03-24T12:38:53Z"
  }
}

13. Next Steps

Threshold optimisation — Use Business-calibrated threshold selection balancing approval rate against expected loss (cost-sensitive decision boundary)
SHAP analysis — Add feature-level explanations for individual predictions to support regulatory explainability requirements (ECOA/FCRA)
Richer feature engineering — Derived risk ratios (loan-to-income, debt-to-income refinement) and application vintage effects
Monitoring — PSI-based monitoring for input feature drift and calibration degradation over vintage cohorts

> 📋 [Back to README](https://github.com/agbajames/consumer-lending-risk-api)