ML Serving for Fraud Detection

An end-to-end MLOps project for real-time credit card fraud detection. Covers the full pipeline from data preprocessing to model deployment on Kubernetes with automated CI/CD.

Overview

This project trains a machine learning model to detect fraudulent credit card transactions and serves it as a production-ready REST API. The entire pipeline is automated using DVC, containerized with Docker, deployed on Kubernetes, and integrated with a CI/CD pipeline via GitHub Actions.

Architecture

Raw Data (Kaggle)
      |
      v
DVC Pipeline
  1_preprocess.py  ->  2_train.py  ->  3_evaluate.py
      |
      v
Trained Models (XGBoost + RandomForest)
      |
      v
FastAPI REST API  (/predict, /health)
      |
      v
Docker Image  ->  Docker Hub
      |
      v
Kubernetes Cluster (Minikube)
  - Deployment (2 replicas)
  - Service (NodePort)
  - HPA (auto-scales 2 to 10 pods)
      |
      v
GitHub Actions CI/CD
  - Run tests
  - Build Docker image
  - Push to Docker Hub

Tech Stack

Category	Tool
Language	Python 3.10
ML Models	XGBoost, Random Forest
Pipeline	DVC
API	FastAPI
Containerization	Docker
Orchestration	Kubernetes (Minikube)
CI/CD	GitHub Actions
Data Versioning	DVC
Class Imbalance	SMOTE

Project Structure

ML-Serving-for-fraude-detetction/
|
|-- data/
|   |-- FraudTrain.csv          (not tracked by git - see below)
|   |-- FraudTest.csv           (not tracked by git - see below)
|
|-- dvc-pipeline/
|   |-- src/
|   |   |-- 1_preprocess.py     (load, clean, feature engineering, encode)
|   |   |-- 2_train.py          (SMOTE, scale, train RF + XGBoost)
|   |   |-- 3_evaluate.py       (predict, metrics, confusion matrix)
|   |-- dvc.yaml                (pipeline definition)
|   |-- params.yaml             (all configurable parameters)
|
|-- k8s/
|   |-- deployment.yaml         (2 replicas, liveness + readiness probes)
|   |-- service.yaml            (NodePort service)
|   |-- hpa.yaml                (auto-scales 2-10 pods based on CPU)
|
|-- .github/
|   |-- workflows/
|       |-- ci.yml              (test, build, push on every git push)
|
|-- notebook/
|   |-- main.ipynb              (original exploration notebook)
|
|-- models/                     (created by DVC pipeline)
|   |-- xgboost.pkl
|   |-- random_forest.pkl
|   |-- scaler.pkl
|
|-- metrics/                    (created by DVC pipeline)
|   |-- scores.json
|
|-- main.py                     (FastAPI application)
|-- Dockerfile
|-- requirements.txt
|-- .gitignore

Dataset

Download from Kaggle: https://www.kaggle.com/datasets/kartik2112/fraud-detection

Place both files in the data/ folder:

data/FraudTrain.csv
data/FraudTest.csv

The dataset contains real credit card transactions labeled as fraud or legitimate. It has a heavy class imbalance (fraud is less than 1% of transactions) which is handled using SMOTE during training.

Getting Started

1. Clone the repo

git clone https://github.com/ShreyanshuGhosh/ML-Serving-for-fraude-detetction.git
cd ML-Serving-for-fraude-detetction

2. Install dependencies

python -m venv venv
venv\Scripts\activate        # Windows
source venv/bin/activate     # Mac/Linux

pip install -r requirements.txt
pip install dvc

3. Download the dataset

Download from the Kaggle link above and place in data/.

4. Run the DVC pipeline

dvc init
cd dvc-pipeline
python -m dvc repro

This runs all 3 stages in order:

Preprocessing: cleans data, extracts features, encodes categoricals
Training: applies SMOTE, scales features, trains both models
Evaluation: generates metrics and saves to metrics/scores.json

5. View metrics

python -m dvc metrics show

6. Run the API locally

cd ..
python -m uvicorn main:app --reload --port 8000

API docs available at: http://localhost:8000/docs

API Endpoints

Endpoint	Method	Description
/	GET	Welcome message
/health	GET	Health check
/predict	POST	Predict fraud for a transaction

Sample Request

POST /predict

{
  "merchant": 319,
  "category": 10,
  "amt": 2.86,
  "gender": 1,
  "city": 157,
  "state": 39,
  "zip": 29209,
  "lat": 33.9659,
  "long": -80.9355,
  "city_pop": 333497,
  "job": 275,
  "unix_time": 1371816865,
  "merch_lat": 33.986391,
  "merch_long": -81.200714,
  "hour": 12,
  "age": 58,
  "distance": 0.266004
}

Sample Response

{
  "prediction": 0,
  "label": "LEGIT",
  "fraud_probability": 0.0231,
  "legit_probability": 0.9769
}

Docker

# Build
docker build -t shreyanshughosh/fraud-detector:latest .

# Run
docker run -p 8000:8000 shreyanshughosh/fraud-detector:latest

# Pull from Docker Hub
docker pull shreyanshughosh/fraud-detector:latest

Kubernetes Deployment

Requirements: Minikube and kubectl installed.

# Start cluster
minikube start

# Deploy
kubectl apply -f k8s/

# Check pods
kubectl get pods

# Get API URL
minikube service fraud-api-service --url

# Check autoscaler
kubectl get hpa

The HPA automatically scales pods between 2 and 10 based on CPU usage. If CPU exceeds 70%, new pods are added automatically.

CI/CD Pipeline

Every push to the main branch automatically:

Sets up Python environment
Installs dependencies
Runs tests
Builds Docker image
Pushes to Docker Hub with latest tag and commit SHA tag

Pipeline defined in .github/workflows/ci.yml.

Required GitHub Secrets:

DOCKERHUB_USERNAME
DOCKERHUB_TOKEN

Model Performance

Metric	Random Forest	XGBoost
Precision (Fraud)	0.59	0.06
Recall (Fraud)	0.69	0.93
F1 Score (Fraud)	0.64	0.10
ROC AUC	0.84	0.98

XGBoost achieves higher ROC AUC (0.98) and recall (0.93) which is more important for fraud detection — catching more fraud matters more than false alarms. A threshold of 0.3 is used instead of default 0.5.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.dvc		.dvc
.github/workflows		.github/workflows
__pycache__		__pycache__
dvc-pipeline2		dvc-pipeline2
k8s		k8s
metrics		metrics
models		models
notebook		notebook
.dvcignore		.dvcignore
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
fast_api.py		fast_api.py
requirements.txt		requirements.txt
tempCodeRunnerFile.py		tempCodeRunnerFile.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ML Serving for Fraud Detection

Overview

Architecture

Tech Stack

Project Structure

Dataset

Getting Started

1. Clone the repo

2. Install dependencies

3. Download the dataset

4. Run the DVC pipeline

5. View metrics

6. Run the API locally

API Endpoints

Sample Request

Sample Response

Docker

Kubernetes Deployment

CI/CD Pipeline

Model Performance

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ML Serving for Fraud Detection

Overview

Architecture

Tech Stack

Project Structure

Dataset

Getting Started

1. Clone the repo

2. Install dependencies

3. Download the dataset

4. Run the DVC pipeline

5. View metrics

6. Run the API locally

API Endpoints

Sample Request

Sample Response

Docker

Kubernetes Deployment

CI/CD Pipeline

Model Performance

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages