Consumer Lending Risk – End-to-End Pipeline

Production-style credit default prediction API demonstrating robust ML engineering patterns for credit risk modelling. Built with LightGBM/XGBoost, FastAPI, and comprehensive automated testing.

ComponentDetail
BackendsLightGBM / XGBoost (configurable)
CalibrationCalibratedClassifierCV (Platt scaling)
Imbalanceclass_weight="balanced" / scale_pos_weight
ServingFastAPI with Pydantic feature-contract enforcement
Testing10 pytest tests (unit, integration, API)
DeploymentDocker, GitHub Actions CI

> Dataset: Kaggle Loan Default Dataset — place Loan_Default.csv in data/raw/.

Dataset context: The Loan Default dataset contains 148,670 mortgage applications with 34 features covering borrower demographics (age, gender, income), loan characteristics (amount, term, type), credit profile (credit score, credit type), and property details. The prediction target is binary loan default. With a ~25% default rate, this is a moderately imbalanced classification problem typical of real-world consumer lending portfolios.

📋 Back to README

1. Setup & Configuration

In [1]:
# file: pipeline/setup.py (or first notebook cell)

from __future__ import annotations

import warnings
warnings.filterwarnings(
    "ignore",
    message=".*str.*dtypes are inc.*",
    category=UserWarning,
)
warnings.filterwarnings(
    "ignore",
    message=".*X does not have valid feature names.*",
    category=UserWarning,
    module="sklearn",
)

import json
import importlib.util
import logging
import subprocess
import sys
import time
from pathlib import Path

AUTO_INSTALL = True  # set False if you never want runtime installs

_REQUIRED = {
    "numpy": "numpy",
    "pandas": "pandas",
    "matplotlib": "matplotlib",
    "sklearn": "scikit-learn",
    "lightgbm": "lightgbm",
    "xgboost": "xgboost",
    "jinja2": "jinja2", 
    "joblib": "joblib",
    "isort": "isort",
}


def _missing_packages(required: dict[str, str]) -> list[str]:
    missing: list[str] = []
    for import_name, pip_name in required.items():
        if importlib.util.find_spec(import_name) is None:
            missing.append(pip_name)
    return missing


def ensure_dependencies(auto_install: bool = True) -> None:
    """
    Ensures this kernel/interpreter has required packages.
    Installs missing deps using the same interpreter as the current kernel.
    """
    missing = _missing_packages(_REQUIRED)
    if not missing:
        return

    msg = (
        "Missing packages for this kernel/interpreter:\n"
        f"  - {', '.join(missing)}\n"
        f"Kernel Python: {sys.executable}\n"
    )
    if not auto_install:
        raise ModuleNotFoundError(
            msg
            + "Install them into this interpreter, then restart the kernel."
        )

    result = subprocess.run(
        [sys.executable, "-m", "pip", "install", "-U", *missing],
        stdout=subprocess.DEVNULL,
        stderr=subprocess.DEVNULL,
    )


ensure_dependencies(AUTO_INSTALL)

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.metrics import (
    roc_auc_score,
    average_precision_score,
    brier_score_loss,
    f1_score,
    precision_score,
    recall_score,
    roc_curve,
    precision_recall_curve,
    confusion_matrix,
)
from sklearn.calibration import calibration_curve, CalibratedClassifierCV
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer

from lightgbm import LGBMClassifier
from xgboost import XGBClassifier
import joblib

logging.basicConfig(level=logging.INFO, format="%(levelname)s | %(message)s")
logger = logging.getLogger(__name__)

plt.rcParams.update(
    {
        "figure.facecolor": "white",
        "axes.facecolor": "white",
        "axes.grid": True,
        "grid.alpha": 0.3,
        "axes.spines.top": False,
        "axes.spines.right": False,
        "font.size": 11,
    }
)

PALETTE = {
    "lgb": "#2196F3",
    "xgb": "#FF9800",
    "cal": "#4CAF50",
    "ref": "#9E9E9E",
    
    "red": "#E53935",
}

print(
    "Setup complete ✓",
    f"(python={sys.version.split()[0]}, exe={Path(sys.executable).name}, numpy={np.__version__}, pandas={pd.__version__})",
)
Setup complete ✓ (python=3.11.13, exe=python3.11, numpy=2.4.3, pandas=3.0.1)

2. Configuration

In [2]:
# ── Paths ──
BASE_DIR = Path(".")
DATA_DIR = BASE_DIR / "data"
RAW_DATA = DATA_DIR / "raw" / "Loan_Default.csv"
PROCESSED_DIR = DATA_DIR / "processed"
PROCESSED_TRAIN = PROCESSED_DIR / "train.csv"
PROCESSED_TEST = PROCESSED_DIR / "test.csv"
ARTIFACTS_DIR = BASE_DIR / "artifacts"
PIPELINE_PATH = ARTIFACTS_DIR / "model.joblib"
METADATA_PATH = ARTIFACTS_DIR / "metadata.json"

# ── Modelling ──
TARGET_COLUMN = "Status"
TEST_SIZE = 0.2
RANDOM_STATE = 42
DEFAULT_THRESHOLD = 0.5

# ── Cleaning ──
DROP_COLS = ["ID"]
RENAME_MAP = {"co-applicant_credit_type": "co_applicant_credit_type"}

# Post-origination columns whose missingness is a near-perfect target proxy
LEAKY_COLS = [
    "rate_of_interest",
    "Interest_rate_spread",
    "Upfront_charges",
    "property_value",
    "LTV",
]

ARTIFACTS_DIR.mkdir(parents=True, exist_ok=True)
PROCESSED_DIR.mkdir(parents=True, exist_ok=True)

print(f"Raw data path : {RAW_DATA}")
print(f"Artifacts path: {ARTIFACTS_DIR}")
Raw data path : data/raw/Loan_Default.csv
Artifacts path: artifacts

3. Data Loading & Initial Cleaning

Initial cleaning:

Imputation is not done here — it is handled exclusively by the sklearn pipeline to prevent train → test leakage.

In [3]:
df_raw = pd.read_csv(RAW_DATA)
print(f"Raw shape: {df_raw.shape}")
df_raw.head()
Raw shape: (148670, 34)
ID year loan_limit Gender approv_in_adv loan_type loan_purpose Credit_Worthiness open_credit business_or_commercial ... credit_type Credit_Score co-applicant_credit_type age submission_of_application LTV Region Security_Type Status dtir1
0 24890 2019 cf Sex Not Available nopre type1 p1 l1 nopc nob/c ... EXP 758 CIB 25-34 to_inst 98.728814 south direct 1 45.0
1 24891 2019 cf Male nopre type2 p1 l1 nopc b/c ... EQUI 552 EXP 55-64 to_inst NaN North direct 1 NaN
2 24892 2019 cf Male pre type1 p1 l1 nopc nob/c ... EXP 834 CIB 35-44 to_inst 80.019685 south direct 0 46.0
3 24893 2019 cf Male nopre type1 p4 l1 nopc nob/c ... EXP 587 CIB 45-54 not_inst 69.376900 North direct 0 42.0
4 24894 2019 cf Joint pre type1 p1 l1 nopc nob/c ... CRIF 602 EXP 25-34 not_inst 91.886544 North direct 0 39.0

5 rows × 34 columns

In [4]:
def basic_clean(df, *, drop_leaky=False):
    """Clean without imputing. Optionally drop leaky columns."""
    df = df.copy()
    df = df.rename(columns=RENAME_MAP)
    for c in DROP_COLS:
        if c in df.columns:
            df = df.drop(columns=[c])
    if drop_leaky:
        leaky_present = [c for c in LEAKY_COLS if c in df.columns]
        df = df.drop(columns=leaky_present)
        logger.info("Dropped %d leaky columns: %s", len(leaky_present), leaky_present)
    df = df.dropna(subset=[TARGET_COLUMN])
    return df

# Clean without dropping leaky columns yet — we'll analyse them first
df_all = basic_clean(df_raw, drop_leaky=False)
print(f"Cleaned shape (all features): {df_all.shape}")
print(f"Target distribution:\n{df_all[TARGET_COLUMN].value_counts(normalize=True).round(4)}")
Cleaned shape (all features): (148670, 33)
Target distribution:
Status
0    0.7536
1    0.2464
Name: proportion, dtype: float64

4. Exploratory Data Analysis

In [5]:
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# 4a – Target distribution
counts = df_all[TARGET_COLUMN].value_counts()
bars = axes[0].bar(
    ["Non-default (0)", "Default (1)"],
    counts.values,
    color=[PALETTE["lgb"], PALETTE["xgb"]],
    edgecolor="white", linewidth=1.2,
)
axes[0].set_title("Target Distribution")
axes[0].set_ylabel("Count")
for bar, val in zip(bars, counts.values):
    axes[0].text(bar.get_x() + bar.get_width()/2, bar.get_height(),
                 f"{val:,}", ha="center", va="bottom", fontsize=10)

# 4b – Missing values (top 15)
miss = df_all.isna().sum().sort_values(ascending=False).head(15)
miss = miss[miss > 0]
if len(miss) > 0:
    colours = [PALETTE["red"] if idx in LEAKY_COLS else PALETTE["xgb"] for idx in miss.index]
    axes[1].barh(miss.index, miss.values, color=colours, edgecolor="white")
    axes[1].set_title("Missing Values (red = leaky)")
    axes[1].set_xlabel("Count")
    axes[1].invert_yaxis()
else:
    axes[1].text(0.5, 0.5, "No missing values", ha="center", va="center",
                 transform=axes[1].transAxes, fontsize=13)
    axes[1].set_title("Missing Values")

# 4c – Correlations of numeric features with target
numeric_cols = df_all.select_dtypes(include=["number"]).columns.drop(TARGET_COLUMN, errors="ignore")
sample_feats = [f for f in numeric_cols if f not in LEAKY_COLS][:8]
if sample_feats:
    corr = df_all[sample_feats + [TARGET_COLUMN]].corr()[TARGET_COLUMN].drop(TARGET_COLUMN).sort_values()
    axes[2].barh(corr.index, corr.values,
                 color=[PALETTE["xgb"] if v > 0 else PALETTE["lgb"] for v in corr.values],
                 edgecolor="white")
    axes[2].set_title("Correlation with Target (non-leaky)")
    axes[2].axvline(0, color="grey", linewidth=0.8)

plt.tight_layout()
plt.show()
In [6]:
num_summary = df_all.describe().T
num_summary["missing"] = df_all.isna().sum()
num_summary["missing_%"] = (df_all.isna().sum() / len(df_all) * 100).round(2)
num_summary[["count", "mean", "std", "min", "max", "missing", "missing_%"]].head(15)
count mean std min max missing missing_%
year 148670.0 2019.000000 0.000000 2019.000000 2.019000e+03 0 0.00
loan_amount 148670.0 331117.743997 183909.310127 16500.000000 3.576500e+06 0 0.00
rate_of_interest 112231.0 4.045476 0.561391 0.000000 8.000000e+00 36439 24.51
Interest_rate_spread 112031.0 0.441656 0.513043 -3.638000 3.357000e+00 36639 24.64
Upfront_charges 109028.0 3224.996127 3251.121510 0.000000 6.000000e+04 39642 26.66
term 148629.0 335.136582 58.409084 96.000000 3.600000e+02 41 0.03
property_value 133572.0 497893.465696 359935.315562 8000.000000 1.650800e+07 15098 10.16
income 139520.0 6957.338876 6496.586382 0.000000 5.785800e+05 9150 6.15
Credit_Score 148670.0 699.789103 115.875857 500.000000 9.000000e+02 0 0.00
LTV 133572.0 72.746457 39.967603 0.967478 7.831250e+03 15098 10.16
Status 148670.0 0.246445 0.430942 0.000000 1.000000e+00 0 0.00
dtir1 124549.0 37.732932 10.545435 5.000000 6.100000e+01 24121 16.22

5. Leakage Analysis

A critical step for any lending dataset: checking whether feature missingness patterns leak

the target. Post-origination fields (e.g. interest rate, upfront charges) are often populated

only for funded, performing loans — meaning their absence is itself a near-perfect predictor of default.

In [7]:
# Compute missingness rate by target class
miss_by_class = []

for c in df_all.columns:
    if c == TARGET_COLUMN:
        continue
    miss_rate = df_all[c].isna().mean()
    if miss_rate < 0.01:
        continue
    m0 = df_all.loc[df_all[TARGET_COLUMN] == 0, c].isna().mean()
    m1 = df_all.loc[df_all[TARGET_COLUMN] == 1, c].isna().mean()
    miss_by_class.append({"column": c, "miss_non_default": m0, "miss_default": m1})

leak_df = pd.DataFrame(miss_by_class).sort_values("miss_default", ascending=False)
leak_df["leaky"] = leak_df["column"].isin(LEAKY_COLS)

# Visualise
fig, ax = plt.subplots(figsize=(10, 5))
x = np.arange(len(leak_df))
w = 0.35

bars0 = ax.bar(x - w/2, leak_df["miss_non_default"], w, label="Non-default (0)",
               color=PALETTE["lgb"], edgecolor="white")
bars1 = ax.bar(x + w/2, leak_df["miss_default"], w, label="Default (1)",
               color=PALETTE["xgb"], edgecolor="white")

# Highlight leaky columns
for i, row in enumerate(leak_df.itertuples()):
    if row.leaky:
        ax.get_children()[i].set_edgecolor(PALETTE["red"])
        ax.get_children()[i].set_linewidth(2)

ax.set_xticks(x)
ax.set_xticklabels(leak_df["column"], rotation=45, ha="right")
ax.set_ylabel("Missingness Rate")
ax.set_title("Missingness by Target Class (leaky columns highlighted)")
ax.legend()
ax.set_ylim(0, 1.1)

for i, row in enumerate(leak_df.itertuples()):
    if row.miss_default > 0.5:
        ax.annotate("⚠️", (i + w/2, row.miss_default + 0.02), ha="center", fontsize=12)

plt.tight_layout()
plt.show()

print("\nMissingness rates by class:")
print(leak_df.to_string(index=False, float_format="{:.1%}".format))
Missingness rates by class:
              column  miss_non_default  miss_default  leaky
Interest_rate_spread              0.0%        100.0%   True
     Upfront_charges              2.8%         99.6%   True
    rate_of_interest              0.0%         99.5%   True
               dtir1              7.0%         44.5%  False
      property_value              0.0%         41.2%   True
                 LTV              0.0%         41.2%   True
              income              7.1%          3.4%  False
          loan_limit              2.2%          2.4%  False
In [8]:
# Demonstrate the impact: train WITH leaky columns
X_demo = df_all.drop(columns=[TARGET_COLUMN])
y_demo = df_all[TARGET_COLUMN].astype(int)
X_tr_d, X_te_d, y_tr_d, y_te_d = train_test_split(
    X_demo, y_demo, test_size=0.2, random_state=RANDOM_STATE, stratify=y_demo
)

# LightGBM handles NaN natively via categorical encoding
for c in X_tr_d.select_dtypes(include=["object", "str"]).columns:
    X_tr_d[c] = X_tr_d[c].astype("category")
    X_te_d[c] = X_te_d[c].astype("category")

clf_leak = LGBMClassifier(n_estimators=100, random_state=RANDOM_STATE, verbose=-1)
clf_leak.fit(X_tr_d, y_tr_d)
auc_leak = roc_auc_score(y_te_d, clf_leak.predict_proba(X_te_d)[:, 1])

# Train WITHOUT leaky columns
X_clean = df_all.drop(columns=[TARGET_COLUMN] + LEAKY_COLS)
X_tr_c, X_te_c, y_tr_c, y_te_c = train_test_split(
    X_clean, y_demo, test_size=0.2, random_state=RANDOM_STATE, stratify=y_demo
)

for c in X_tr_c.select_dtypes(include=["object", "str"]).columns:
    X_tr_c[c] = X_tr_c[c].astype("category")
    X_te_c[c] = X_te_c[c].astype("category")

clf_clean = LGBMClassifier(n_estimators=100, random_state=RANDOM_STATE, verbose=-1)
clf_clean.fit(X_tr_c, y_tr_c)
auc_clean = roc_auc_score(y_te_c, clf_clean.predict_proba(X_te_c)[:, 1])

print(f"ROC-AUC WITH leaky columns:    {auc_leak:.4f}  ← missingness = free answer")
print(f"ROC-AUC WITHOUT leaky columns: {auc_clean:.4f}  ← genuine signal only")
print(f"\nDropping {len(LEAKY_COLS)} leaky columns for all subsequent analysis.")
ROC-AUC WITH leaky columns:    1.0000  ← missingness = free answer
ROC-AUC WITHOUT leaky columns: 0.8834  ← genuine signal only

Dropping 5 leaky columns for all subsequent analysis.

6. Train / Test Split (Clean Data)

With the leaky columns removed, we create a stratified 80/20 split.

In [9]:
df = basic_clean(df_raw, drop_leaky=True)
print(f"Clean shape: {df.shape}")

X = df.drop(columns=[TARGET_COLUMN])
y = df[TARGET_COLUMN].astype(int)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=TEST_SIZE, random_state=RANDOM_STATE, stratify=y,
)

# Save to disk for API reproducibility
train_out = X_train.copy(); train_out[TARGET_COLUMN] = y_train
test_out = X_test.copy();   test_out[TARGET_COLUMN] = y_test
train_out.to_csv(PROCESSED_TRAIN, index=False)
test_out.to_csv(PROCESSED_TEST, index=False)

print(f"Train: {X_train.shape[0]:,} rows | Test: {X_test.shape[0]:,} rows")
print(f"Train default rate: {y_train.mean():.4f}")
print(f"Test  default rate: {y_test.mean():.4f}")
INFO | Dropped 5 leaky columns: ['rate_of_interest', 'Interest_rate_spread', 'Upfront_charges', 'property_value', 'LTV']
Clean shape: (148670, 28)
Train: 118,936 rows | Test: 29,734 rows
Train default rate: 0.2464
Test  default rate: 0.2465

7. Pipeline Construction

All preprocessing lives inside .fit() / .transform(), ensuring imputation

statistics are learned only from training data.

```

ColumnTransformer

├── num → SimpleImputer(median)

└── cat → SimpleImputer(mode) → OneHotEncoder

CalibratedClassifierCV(LGBMClassifier / XGBClassifier)

```

In [10]:
from src.model import build_pipeline, split_feature_types

numeric, categorical = split_feature_types(X_train)
print(f"Numeric features:     {len(numeric)}")
print(f"Categorical features: {len(categorical)}")
Numeric features:     6
Categorical features: 21

8. Training (Both Backends)

In [11]:
results = {}

for backend in ["lightgbm", "xgboost"]:
    logger.info("Training %s...", backend)
    t0 = time.time()

    pipe = build_pipeline(numeric, categorical, backend=backend, y=y_train, calibration_cv=5)
    pipe.fit(X_train, y_train)

    elapsed = time.time() - t0
    proba = pipe.predict_proba(X_test)[:, 1]
    pred = (proba >= DEFAULT_THRESHOLD).astype(int)

    metrics = {
        "roc_auc": roc_auc_score(y_test, proba),
        "pr_auc": average_precision_score(y_test, proba),
        "brier": brier_score_loss(y_test, proba),
        "f1": f1_score(y_test, pred),
        "precision": precision_score(y_test, pred, zero_division=0),
        "recall": recall_score(y_test, pred, zero_division=0),
        "train_time_s": round(elapsed, 1),
    }

    results[backend] = {"pipeline": pipe, "proba": proba, "pred": pred, "metrics": metrics}
    logger.info("%s done in %.1fs  |  ROC-AUC=%.4f  PR-AUC=%.4f  Brier=%.4f",
                backend, elapsed, metrics["roc_auc"], metrics["pr_auc"], metrics["brier"])

print()
INFO | Training lightgbm...
INFO | lightgbm done in 15.6s  |  ROC-AUC=0.8827  PR-AUC=0.8229  Brier=0.0913
INFO | Training xgboost...
INFO | xgboost done in 24.8s  |  ROC-AUC=0.8821  PR-AUC=0.8229  Brier=0.0913
In [12]:
metrics_df = pd.DataFrame({k: v["metrics"] for k, v in results.items()}).T
metrics_df.index.name = "backend"
metrics_df.style.format("{:.4f}").background_gradient(cmap="Blues", axis=0)
  roc_auc pr_auc brier f1 precision recall train_time_s
backend              
lightgbm 0.8827 0.8229 0.0913 0.7311 0.8607 0.6355 15.6000
xgboost 0.8821 0.8229 0.0913 0.7308 0.8593 0.6358 24.8000

9. Evaluation

9a. ROC & Precision–Recall Curves

In [13]:
fig, axes = plt.subplots(1, 2, figsize=(14, 5.5))
label_map = {"lightgbm": "LightGBM", "xgboost": "XGBoost"}
color_map = {"lightgbm": PALETTE["lgb"], "xgboost": PALETTE["xgb"]}

for backend, res in results.items():
    fpr, tpr, _ = roc_curve(y_test, res["proba"])
    auc = res["metrics"]["roc_auc"]
    axes[0].plot(fpr, tpr, label=f'{label_map[backend]}  AUC={auc:.4f}',
                 color=color_map[backend], linewidth=2)
axes[0].plot([0, 1], [0, 1], "--", color=PALETTE["ref"], linewidth=1)
axes[0].set(xlabel="False Positive Rate", ylabel="True Positive Rate", title="ROC Curve")
axes[0].legend(loc="lower right")

for backend, res in results.items():
    prec, rec, _ = precision_recall_curve(y_test, res["proba"])
    ap = res["metrics"]["pr_auc"]
    axes[1].plot(rec, prec, label=f'{label_map[backend]}  AP={ap:.4f}',
                 color=color_map[backend], linewidth=2)
baseline = y_test.mean()
axes[1].axhline(baseline, linestyle="--", color=PALETTE["ref"], linewidth=1, label=f"Baseline={baseline:.3f}")
axes[1].set(xlabel="Recall", ylabel="Precision", title="Precision–Recall Curve")
axes[1].legend(loc="upper right")

plt.tight_layout()
plt.show()

9b. Calibration Curves

In [14]:
fig, axes = plt.subplots(1, 2, figsize=(14, 5.5))

for i, (backend, res) in enumerate(results.items()):
    fraction_pos, mean_pred = calibration_curve(y_test, res["proba"], n_bins=10, strategy="uniform")
    axes[i].plot(mean_pred, fraction_pos, "o-", color=color_map[backend], linewidth=2,
                 markersize=6, label=label_map[backend])
    axes[i].plot([0, 1], [0, 1], "--", color=PALETTE["ref"], linewidth=1, label="Perfectly calibrated")
    axes[i].set(xlabel="Mean predicted probability", ylabel="Observed frequency",
                title=f"Calibration – {label_map[backend]}  (Brier={res['metrics']['brier']:.4f})")
    axes[i].legend(loc="upper left")

plt.tight_layout()
plt.show()

9c. Confusion Matrices

In [15]:
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

for i, (backend, res) in enumerate(results.items()):
    cm = confusion_matrix(y_test, res["pred"])
    im = axes[i].imshow(cm, cmap="Blues", aspect="equal")

    for row in range(2):
        for col in range(2):
            val = cm[row, col]
            colour = "white" if val > cm.max() / 2 else "black"
            axes[i].text(col, row, f"{val:,}", ha="center", va="center",
                         fontsize=14, fontweight="bold", color=colour)

    axes[i].set(xticks=[0, 1], yticks=[0, 1],
                xticklabels=["Non-default", "Default"],
                yticklabels=["Non-default", "Default"],
                xlabel="Predicted", ylabel="Actual",
                title=f"Confusion Matrix – {label_map[backend]}")

plt.tight_layout()
plt.show()

9d. Probability Distribution by Class

In [16]:
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

for i, (backend, res) in enumerate(results.items()):
    p0 = res["proba"][y_test == 0]
    p1 = res["proba"][y_test == 1]
    axes[i].hist(p0, bins=50, alpha=0.6, color=PALETTE["lgb"], label="Non-default", density=True, edgecolor="white")
    axes[i].hist(p1, bins=50, alpha=0.6, color=PALETTE["xgb"], label="Default", density=True, edgecolor="white")
    axes[i].axvline(DEFAULT_THRESHOLD, color="red", linestyle="--", linewidth=1.5, label=f"Threshold={DEFAULT_THRESHOLD}")
    axes[i].set(xlabel="Predicted P(default)", ylabel="Density",
                title=f"Score Distribution – {label_map[backend]}")
    axes[i].legend()

plt.tight_layout()
plt.show()

9e. Feature Importance (Top 20)

In [17]:
fig, axes = plt.subplots(1, 2, figsize=(14, 7))

for i, (backend, res) in enumerate(results.items()):
    pipe = res["pipeline"]
    pre = pipe.named_steps["pre"]
    clf = pipe.named_steps["clf"]

    feature_names = [n.split("__", 1)[-1] for n in pre.get_feature_names_out()]

    importances = np.mean([
        cc.estimator.feature_importances_
        for cc in clf.calibrated_classifiers_
    ], axis=0)

    imp_df = pd.Series(importances, index=feature_names).sort_values(ascending=True).tail(20)
    axes[i].barh(imp_df.index, imp_df.values, color=color_map[backend], edgecolor="white")
    axes[i].set_title(f"Feature Importance – {label_map[backend]}")
    axes[i].set_xlabel("Importance")

plt.tight_layout()
plt.show()

10. Threshold Analysis

In production lending, the decision threshold is a business lever — not just 0.5.

This section explores the F1 / precision / recall trade-off across thresholds.

In [18]:
fig, axes = plt.subplots(1, 2, figsize=(14, 5.5))

for i, (backend, res) in enumerate(results.items()):
    thresholds = np.linspace(0.01, 0.99, 200)
    f1s, precs, recs = [], [], []

    for t in thresholds:
        p = (res["proba"] >= t).astype(int)
        f1s.append(f1_score(y_test, p, zero_division=0))
        precs.append(precision_score(y_test, p, zero_division=0))
        recs.append(recall_score(y_test, p, zero_division=0))

    axes[i].plot(thresholds, f1s, label="F1", color=PALETTE["cal"], linewidth=2)
    axes[i].plot(thresholds, precs, label="Precision", color=PALETTE["lgb"], linewidth=1.5, linestyle="--")
    axes[i].plot(thresholds, recs, label="Recall", color=PALETTE["xgb"], linewidth=1.5, linestyle="--")

    best_t = thresholds[np.argmax(f1s)]
    axes[i].axvline(best_t, color="red", linestyle=":", linewidth=1, label=f"Best F1 @ {best_t:.2f}")
    axes[i].set(xlabel="Threshold", ylabel="Score", title=f"Threshold Sweep – {label_map[backend]}")
    axes[i].legend(loc="center left")

plt.tight_layout()
plt.show()

11. Save Model & Metadata

In [19]:
best_backend = max(results, key=lambda b: results[b]["metrics"]["pr_auc"])
best = results[best_backend]

joblib.dump(best["pipeline"], PIPELINE_PATH)

metadata = {
    "trained_at": time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime()),
    "model_type": f"Calibrated({best_backend})",
    "backend": best_backend,
    "features": X_train.columns.tolist(),
    "numeric": numeric,
    "categorical": categorical,
    "threshold": DEFAULT_THRESHOLD,
    **{k: round(v, 6) for k, v in best["metrics"].items()},
}

with open(METADATA_PATH, "w") as f:
    json.dump(metadata, f, indent=2)

print(f"Saved pipeline  → {PIPELINE_PATH}")
print(f"Saved metadata  → {METADATA_PATH}")
print(f"Best backend    → {best_backend}")
print(f"\nMetadata:\n{json.dumps(metadata, indent=2)}")
Saved pipeline  → artifacts/model.joblib
Saved metadata  → artifacts/metadata.json
Best backend    → xgboost

Metadata:
{
  "trained_at": "2026-03-24T12:38:53Z",
  "model_type": "Calibrated(xgboost)",
  "backend": "xgboost",
  "features": [
    "year",
    "loan_limit",
    "Gender",
    "approv_in_adv",
    "loan_type",
    "loan_purpose",
    "Credit_Worthiness",
    "open_credit",
    "business_or_commercial",
    "loan_amount",
    "term",
    "Neg_ammortization",
    "interest_only",
    "lump_sum_payment",
    "construction_type",
    "occupancy_type",
    "Secured_by",
    "total_units",
    "income",
    "credit_type",
    "Credit_Score",
    "co_applicant_credit_type",
    "age",
    "submission_of_application",
    "Region",
    "Security_Type",
    "dtir1"
  ],
  "numeric": [
    "year",
    "loan_amount",
    "term",
    "income",
    "Credit_Score",
    "dtir1"
  ],
  "categorical": [
    "loan_limit",
    "Gender",
    "approv_in_adv",
    "loan_type",
    "loan_purpose",
    "Credit_Worthiness",
    "open_credit",
    "business_or_commercial",
    "Neg_ammortization",
    "interest_only",
    "lump_sum_payment",
    "construction_type",
    "occupancy_type",
    "Secured_by",
    "total_units",
    "credit_type",
    "co_applicant_credit_type",
    "age",
    "submission_of_application",
    "Region",
    "Security_Type"
  ],
  "threshold": 0.5,
  "roc_auc": 0.882076,
  "pr_auc": 0.822938,
  "brier": 0.091336,
  "f1": 0.730824,
  "precision": 0.859277,
  "recall": 0.635781,
  "train_time_s": 24.8
}

12. API Smoke Test

Quick verification that the saved model loads and scores correctly

via the same code path the FastAPI app uses.

In [20]:
loaded_pipe = joblib.load(PIPELINE_PATH)
with open(METADATA_PATH) as f:
    loaded_meta = json.load(f)

sample = {
    "year": 2017, "loan_limit": "cf", "Gender": "Male", "approv_in_adv": "nopre",
    "loan_type": "type1", "loan_purpose": "p1", "Credit_Worthiness": "l1",
    "open_credit": "yes", "business_or_commercial": "no", "loan_amount": 150000.0,
    "term": 180.0, "Neg_ammortization": "no", "interest_only": "no",
    "lump_sum_payment": "no", "construction_type": "existing",
    "occupancy_type": "owner", "Secured_by": "home", "total_units": "1",
    "income": 6500.0, "credit_type": "CR1", "Credit_Score": 720.0,
    "co_applicant_credit_type": "Na", "age": "25-34",
    "submission_of_application": "to_inst", "Region": "south",
    "Security_Type": "type1", "dtir1": 28.0,
}

features = loaded_meta["features"]
X_single = pd.DataFrame([[sample[c] for c in features]], columns=features)
p1 = loaded_pipe.predict_proba(X_single)[0, 1]
threshold = loaded_meta["threshold"]

response = {
    "default_probability": round(float(p1), 6),
    "prediction": int(p1 >= threshold),
    "threshold": threshold,
    "model_info": {
        "model_type": loaded_meta["model_type"],
        "backend": loaded_meta["backend"],
        "trained_at": loaded_meta["trained_at"],
    },
}

print("POST /score response (simulated):")
print(json.dumps(response, indent=2))
POST /score response (simulated):
{
  "default_probability": 0.467872,
  "prediction": 0,
  "threshold": 0.5,
  "model_info": {
    "model_type": "Calibrated(xgboost)",
    "backend": "xgboost",
    "trained_at": "2026-03-24T12:38:53Z"
  }
}

13. Next Steps

> 📋 [Back to README](https://github.com/agbajames/consumer-lending-risk-api)