🤖 Generative AI

Haydar Kilic | Artificial Intelligence Engineering

A comprehensive, hands-on guide to the fundamentals and advanced concepts of Generative Artificial Intelligence (GAI). This repository contains a curated series of Jupyter Notebooks bridging the gap between foundational statistical theory and state-of-the-art deep generative architectures.

📚 Contents

Lecture	Topic	Notebook
Lecture 1	Generative Modelling Fundamentals	[`GAI_Lecture1_Notebook.ipynb`]
Lecture 2	Derivation of Generative Models (MAP · MLE · Beta-Binomial · Dirichlet)	[`GAI_Lecture2_Notebook.ipynb`]
Lecture 3	Deep Generative Models (VAE · GAN · GMMN · Diffusion)	[`GAI_Lecture3_Notebook.ipynb`]
Lecture 4	Transformers and Large Language Models (Attention · RoPE · Mini GPT · Scaling)	[`GAI_Lecture4_Notebook.ipynb`]

The table will be updated as new lectures are added.

🗂 Lecture 1 — Generative Modelling Fundamentals

Topics Covered

Section 1 — Core Concepts

Handwritten digit recognition: 28×28 pixel vector representation, train/test/validation split
Polynomial regression and curve fitting (Vandermonde matrix, Least Squares)
Overfitting / Underfitting and RMS error analysis
Ridge Regularisation (L2 penalty, λ hyperparameter)

Section 2 — Probability Theory

Joint, marginal and conditional probability distributions
Bayes' theorem — medical diagnosis and base rate fallacy
Gaussian (Normal) distribution: PDF, CDF, numerical verification
Maximum Likelihood Estimation (MLE) and bias
Bayesian updating: coin flip prior → posterior

Section 3 — Decision Theory

Minimum-error decision boundaries and posterior probabilities
Reject Option and threshold θ
Asymmetric loss matrix (medical diagnosis scenario)
Generative / Discriminative / Discriminant model comparison

🗂 Lecture 2 — Derivation of Generative Models

Topics Covered

Section 1 — Learning from Positive Examples & The Number Game

Concept learning = binary classification; posterior predictive distribution
Strong sampling assumption: p(D|h) = (1/|h|)^N
Size Principle: narrow hypothesis → high likelihood
Prior, likelihood and posterior computation; Bayesian updating
MAP estimation and N → ∞ behaviour (Dirac convergence)
Bayesian Model Averaging (BMA) vs. Plug-In approach
Mixture prior (π₀ parameter): rule-based vs. interval-based hypotheses

Section 2 — Beta-Binomial Model

Bernoulli likelihood and sufficient statistics (N₁, N₀)
Beta distribution: conjugate prior, various (a, b) parameters
Sequential Bayesian updating: Beta(a,b) → Beta(N₁+a, N₀+b)
MLE, MAP and posterior mean formulas; convergence as N grows
Zero Count Problem and Laplace succession rule
Posterior variance and confidence interval: σ ∝ 1/√N
Compound Beta-Binomial distribution: prediction of future trials

Section 3 — Dirichlet-Multinomial

Multinomial likelihood and Dirichlet prior
Visualisation of the K=3 probability simplex (barycentric coordinates)
Dirichlet-Multinomial update and posterior prediction
Add-K smoothing (β): MLE → Laplace → uniform

Section 4 — Mixture Model

Effect of the π₀ parameter on the posterior predictive distribution

Section 5 — MLE vs MAP vs Bayes Comparison

Error analysis, convergence of θ estimates with N

🗂 Lecture 3 — Deep Generative Models

Topics Covered

Section 1 — Probabilistic Framework & MLE

Real data simulation with a 2D Gaussian mixture
Log-Gaussian log-likelihood function
MLE vs. bad model comparison

Section 2 — KL Divergence

Closed-form Gaussian KL computation
KL asymmetry: KL(p‖q) ≠ KL(q‖p)
MLE ≡ KL minimisation relationship

Section 3 — Latent Space & Manifold Hypothesis

MNIST: 784 pixels → ~10-dimensional manifold (PCA variance analysis)
Latent space visualisation via 2D PCA projection
Latent space arithmetic: z(7) − z(1) + z(0) ≈ z(6)

Section 4 — ELBO Derivation

Closed-form KL computation and heat map
Balance between reconstruction and KL terms

Section 5 — Variational Autoencoder (VAE)

Encoder–Decoder architecture, Reparametrisation Trick
Gradient flow diagram (why backprop works)
Training on MNIST; 2D latent space visualisation
β-VAE: KL regularisation effect; Posterior Collapse problem

Section 6 — Generative Adversarial Networks (GAN)

Generator + Discriminator architecture (LeakyReLU, BatchNorm)
Optimal Discriminator formula and Nash equilibrium visualisation
MNIST training; G/D loss curves and mode-collapse discussion

Section 7 — GMMN & MMD

Gaussian (RBF) kernel and MMD² computation (multi-scale)
MMD intuition test: same / nearby / distant distributions
Discriminator-free GMMN training (MMD loss only)

Section 8 — Diffusion Models (DDPM)

Forward process: β schedule, closed-form q(x_t|x_0)
SimpleUNet: time embedding + skip-connection noise estimator
DDPM training (MSE loss) and reverse process sampling
Step-by-step denoising visualisation

Section 9 — Model Comparison & FID

Fréchet Inception Distance computation (PCA feature space)
Radar chart: Quality / Diversity / Speed / Stability / Latent Control
Generative model chronology (1985–2022)
Comprehensive comparison table

🗂 Lecture 4 — Transformers and Large Language Models

Topics Covered

Section 1 — RNN vs Transformer: Vanishing Gradients

Simulation of |dL/dh_t| ≈ |W_hh|^(T-t) exponential decay in simple RNNs
Vanishing / stable / exploding regimes (|W_hh| = 0.85 / 1.00 / 1.15)
Transformer O(1) connection distance: direct access to every token pair

Section 2 — Encoder–Decoder and the Information Bottleneck

Cosine similarity loss at different sequence lengths with a GRU encoder
RNN Enc-Dec single-vector bottleneck vs. Attention context vector comparison
Visual explanation of c_t = Σ α_{t,i} · h_i

Section 3 — Bahdanau (Additive) Attention Mechanism

From-scratch BahdanauAttention: W_s, W_h, v parameterised scoring
e_{t,i} = vᵀ tanh(W_s·s_{t-1} + W_h·h_i) → softmax → context vector
English→German translation simulation: 4×4 attention heatmap

Section 4 — Scaled Dot-Product Attention (Q, K, V)

Attention(Q,K,V) = softmax(QK^T / √d_k) · V step-by-step implementation
Importance of √d_k scaling: entropy analysis (unscaled softmax collapses as d_k grows)
Dimension analysis: (B, T, d_model) → Q/K/V → (B, T, d_k) → Z

Section 5 — Multi-Head Attention

Single large W_q/W_k/W_v matrix approach; split_heads → (B, n_heads, T, d_k)
4-head attention maps: Position / Syntax / Semantics / Distance
Parameter analysis: 4 × d_model² weights

Section 6 — Positional Encoding (Sinusoidal, RoPE, ALiBi)

PE_{pos,2i} = sin(pos/10000^{2i/d}), PE_{pos,2i+1} = cos(…): matrix visualisation
Wave frequencies: low dimension = high frequency; PE similarity matrix
RoPE: relative positional encoding via 2D rotation; q^T_m k_n ∝ f(m-n)
ALiBi: e_{ij} = q_i^Tk_j − m·|i−j| linear penalty; slope m_i = 2^{−8i/n_heads}
Comparison table: Sinusoidal / Learned / RoPE / ALiBi

Section 7 — Feed-Forward Network & Activation Functions

ReLU → GELU → Swish/SiLU → SwiGLU(x,W,V) = Swish(xW) ⊙ xV
Gradient analysis: dead neuron problem in ReLU for x<0 region
d_ff = 4×d_model expansion rule and FFN parameter growth

Section 8 — Layer Normalization: LayerNorm vs RMSNorm / Pre-LN vs Post-LN

LN(x) = γ·(x−μ)/√(σ²+ε)+β vs. RMSNorm(x) = γ·x/RMS(x) (no β, ~10% faster)
std/mean comparison at different input scales
Pre-LN (modern) vs Post-LN (original): gradient distribution histogram
BN vs LN vs RMSNorm: preference analysis in sequence models

Section 9 — Attention Masking: Full vs Causal

make_full_mask (Bidirectional): BERT/RoBERTa — every token attends to every other
make_causal_mask (lower triangular): GPT — only past visible, future −∞
Masking → model family → task matching table (Encoder / Decoder / Enc-Dec)

Section 10 — Full Transformer Block (From-Scratch Implementation)

TransformerEncoderBlock: Pre-LN + MHA + FFN + Residual
TransformerEncoder: N layers, learned PE, final LayerNorm
Parameter analysis for 3 model configurations (Small / BERT-mini / BERT-base)
#params ≈ 12 × N × d²_model estimation formula

Section 11 — Mini GPT: Character-Level Language Model

GPTDecoderBlock: Causal MHA + Pre-LN + FFN
MiniGPT: tok_emb + pos_emb + 3 decoder blocks + lm_head (weight tying)
Autoregressive generate(): top-k sampling + temperature control
500-step training on Turkish text: loss curve + attention map
Generated text samples at different temperatures (0.5 / 1.0 / 1.5)

Section 12 — Hyperparameter Analysis & Scaling Laws

Real LLM table: BERT-base/large, GPT-2, GPT-3, LLaMA-2 7B/70B
Scaling law: L ∝ N^{−0.076} log-log visualisation
d_model vs number of heads (d_k = d_model/h ≈ 64–128 rule)
GPT vs BERT comparison table: architecture, task, context, usage
Modern LLM block: RMSNorm + Pre-LN + SwiGLU + RoPE

⚙️ Installation

# Clone the repository
git clone https://github.com/HAYDARKILIC/generative_artificial_intelligence
cd generative_artificial_intelligence

# Create a virtual environment (recommended)
python -m venv venv
source venv/bin/activate        # Linux/macOS
# venv\Scripts\activate         # Windows

# Install dependencies
pip install -r requirements.txt

# Launch Jupyter
jupyter notebook

📦 Requirements

numpy>=2.0
matplotlib>=3.7
scipy>=1.11
scikit-learn>=1.3
jupyter>=1.0
ipykernel>=6.0
torch>=2.0
torchvision>=0.15
tqdm>=4.65

The requirements.txt file is included in the repository.

⚠️ torch and torchvision are required from Lecture 3 onwards. For GPU support, select a CUDA-compatible version at pytorch.org.

🏗 Project Structure

generative-ai/
├── README.md
├── requirements.txt
├── GAI_Lecture1_Notebook.ipynb   # Lecture 1 — Generative Modelling Fundamentals
├── GAI_Lecture2_Notebook.ipynb   # Lecture 2 — MAP · MLE · Beta-Binomial · Dirichlet
├── GAI_Lecture3_Notebook.ipynb   # Lecture 3 — VAE · GAN · GMMN · Diffusion
├── GAI_Lecture4_Notebook.ipynb   # Lecture 4 — Transformer · Attention · Mini GPT · LLM
└── (future lecture notebooks will be added here)

📖 References

Pattern Recognition and Machine Learning – Christopher M. Bishop (1st Ed., 2006), Ch. 1–2

Machine Learning: A Probabilistic Perspective – Kevin P. Murphy (1st Ed., 2012), Ch. 3

Deep Learning – Goodfellow, Bengio, Courville (1st Ed., 2016), Ch. 20.10.3

Deep Learning – Goodfellow, Bengio, Courville (1st Ed., 2016), Ch. 20.10.4

Probabilistic Machine Learning: Advanced Topics – Kevin P. Murphy (1st Ed., 2023), Ch. 25

Natural Language Processing with Transformers – Lewis et al. (1st Ed., 2022), Ch. 1–2

Speech and Language Processing – Jurafsky & Martin (3rd Ed., draft), Ch. 3, 10, 11

Generative AI — Haydar Kılıç, Artificial Intelligence Engineering

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🤖 Generative AI

📚 Contents

🗂 Lecture 1 — Generative Modelling Fundamentals

Topics Covered

🗂 Lecture 2 — Derivation of Generative Models

Topics Covered

🗂 Lecture 3 — Deep Generative Models

Topics Covered

🗂 Lecture 4 — Transformers and Large Language Models

Topics Covered

⚙️ Installation

📦 Requirements

🏗 Project Structure

📖 References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
GAI_Lecture1_Notebook.ipynb		GAI_Lecture1_Notebook.ipynb
GAI_Lecture2_Notebook.ipynb		GAI_Lecture2_Notebook.ipynb
GAI_Lecture3_Notebook.ipynb		GAI_Lecture3_Notebook.ipynb
GAI_Lecture4_Notebook.ipynb		GAI_Lecture4_Notebook.ipynb
LICENSE		LICENSE
Readme.md		Readme.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

🤖 Generative AI

📚 Contents

🗂 Lecture 1 — Generative Modelling Fundamentals

Topics Covered

🗂 Lecture 2 — Derivation of Generative Models

Topics Covered

🗂 Lecture 3 — Deep Generative Models

Topics Covered

🗂 Lecture 4 — Transformers and Large Language Models

Topics Covered

⚙️ Installation

📦 Requirements

🏗 Project Structure

📖 References

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages