GPU-Optimized Convolution from Scratch

Project Description: This project implements 2D and 3D convolutional layers from scratch in CUDA, optimized for modern GPU architectures with tensor cores. It includes naive and optimized implementations, im2col transformation, and GEMM using Tensor cores (WMMA API), with profiling and comparison against cuDNN.

🚀 Features

Naive Implementation

2D/3D convolution kernel in CUDA using global memory.

Profile: High latency, low occupancy.

Basic Optimizations

Constant Memory: Used for kernels since kernel data is static during execution.

Shared Memory Tiling: Reduces global memory accesses by approximately ~11x.

Profiling: Bottleneck analysis using NVIDIA Nsight.

im2col Transformation

Converts input images to column format for efficient GEMM.

Implemented in Raw CUDA for GPU and NumPy for CPU verification.

Tensor Core GEMM

Uses CUDA WMMA API for FP16 tensor operations.

Leverages warp-level matrix multiply-accumulate for high throughput.

Comparison with cuDNN: Baseline performance and correctness checks.

Executable & Library: Can be built as a Linux executable and linkable library.

PyTorch Integration: (Planned) Importable C++/CUDA module.

🛠 Implementation Details

im2col

Converts input image to a padded matrix for GEMM. The GPU kernel is implemented in im2col.cu. For a $5 \times 5$ input and $3 \times 3$ kernel, im2col produces a $9 \times 25$ matrix.

GEMM with Tensor Cores

Implemented in gemm_tensors.cu using the CUDA WMMA API for warp-level matrix multiplication. It supports large matrices (multiples of 16) and utilizes hardware-level acceleration.

Profiling & Testing

Profiling: Nsight used to monitor occupancy, latency, and memory access patterns.

Testing: Validated with synthetic data and MNIST.

Verification: Results are cross-checked against NumPy's convolve for mathematical correctness.

💻 Code Examples

im2col Kernel (CUDA)

__global__ void im2col_kernel(float *image, int C_in, int H_in, int W_in, int im2col_H, int im2col_W, int kernel_size, float *im2col) {
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;
    int idx_count = 0;

    if (row < H_in && col < W_in) {
        for (int c = 0; c < C_in; c++) {
            for (int i = -FILTER_RADIUS; i <= FILTER_RADIUS; i++) {
                for (int j = -FILTER_RADIUS; j <= FILTER_RADIUS; j++) {
                    float val = 0.0f;
                    if (col + j >= 0 && col + j < W_in && row + i >= 0 && row + i < H_in) {
                        val = image[c * (H_in * W_in) + (row + i) * W_in + (col + j)];
                    }
                    im2col[IDX2R(idx_count, row * W_in + col, im2col_W)] = val;
                    idx_count++;
                }
            }
        }
    }
}

GEMM with Tensor Cores (WMMA API)

__global__ void gemm_tensor(half *a, half *b, float *c, int M, int N, int K, float alpha, float beta) {
    // ... (see gemm_tensors.cu for full implementation)
    // Uses nvcuda::wmma for warp-level matrix operations
}

📅 Project Plan & Status

[x] Implemented:

Naive convolution (2D/3D) in RAW CUDA.

Shared memory tiled convolution.

im2col transformation (CPU & GPU).

GEMM using Tensor cores (WMMA API).

[ ] TODO:

Baseline comparison with cuDNN's convolution.

Implement WMMA convolution as GEMM with im2col.

Finalize Linux build system (CMake/Make).

Create PyTorch importable module.

📦 Usage

Build: Use nvcc and ensure CUDA >= 10.0 is installed.

Input: Shapes are read from text files or provided via command line arguments.

Output: Generates the im2col matrix, convolution result, and detailed profiling data.

📚 References

CUDA WMMA API Documentation

NVIDIA Nsight Profiling Tools

cuDNN Library

🤝 Contact

For questions or contributions, please open an issue or submit a pull request.

📄 License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
512.jpg		512.jpg
512_array.txt		512_array.txt
512_array_shape.txt		512_array_shape.txt
512_gaussian_output.jpg		512_gaussian_output.jpg
512_output_gemm_gauss.txt		512_output_gemm_gauss.txt
512_output_gemm_identity.txt		512_output_gemm_identity.txt
512_output_identical.jpg		512_output_identical.jpg
PLAN.org		PLAN.org
README.bak		README.bak
README.md		README.md
convolution_2D_basic.cu		convolution_2D_basic.cu
convolution_2D_cached_tiled.cu		convolution_2D_cached_tiled.cu
convolution_2D_const_mem.cu		convolution_2D_const_mem.cu
cublas_convolution.cu		cublas_convolution.cu
dog.jpg		dog.jpg
dog_array.txt		dog_array.txt
dog_array_shape.txt		dog_array_shape.txt
gaussian_kernel.txt		gaussian_kernel.txt
gemm_raw		gemm_raw
gemm_raw.cu		gemm_raw.cu
gemm_tensors.cu		gemm_tensors.cu
identity_kernel_array.txt		identity_kernel_array.txt
im2col		im2col
im2col.cu		im2col.cu
im2col.py		im2col.py
im2col_512_array_identity.txt		im2col_512_array_identity.txt
im2col_512_gaussian.txt		im2col_512_gaussian.txt
im2col_array.txt		im2col_array.txt
im2col_array_identity.txt		im2col_array_identity.txt
im2col_array_py.txt		im2col_array_py.txt
im2col_basic.py		im2col_basic.py
im2col_test.cu		im2col_test.cu
image_build.py		image_build.py
image_load.py		image_load.py
image_output_identity.jpg		image_output_identity.jpg
input_image_array.txt		input_image_array.txt
input_kernel_array.txt		input_kernel_array.txt
kernel_array.txt		kernel_array.txt
kernel_array_shape.txt		kernel_array_shape.txt
laplace_kernel.txt		laplace_kernel.txt
output_gemm_gaussian.txt		output_gemm_gaussian.txt
output_gemm_identity.txt		output_gemm_identity.txt
output_gemm_laplace.txt		output_gemm_laplace.txt
output_identify_kernel.txt		output_identify_kernel.txt
profile_raw_gemm.ncu-rep		profile_raw_gemm.ncu-rep
wmma_convolution.cu		wmma_convolution.cu

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GPU-Optimized Convolution from Scratch

🚀 Features

Naive Implementation

Basic Optimizations

im2col Transformation

Tensor Core GEMM

🛠 Implementation Details

im2col

GEMM with Tensor Cores

Profiling & Testing

💻 Code Examples

im2col Kernel (CUDA)

GEMM with Tensor Cores (WMMA API)

📅 Project Plan & Status

[x] Implemented:

[ ] TODO:

📦 Usage

📚 References

🤝 Contact

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GPU-Optimized Convolution from Scratch

🚀 Features

Naive Implementation

Basic Optimizations

im2col Transformation

Tensor Core GEMM

🛠 Implementation Details

im2col

GEMM with Tensor Cores

Profiling & Testing

💻 Code Examples

im2col Kernel (CUDA)

GEMM with Tensor Cores (WMMA API)

📅 Project Plan & Status

[x] Implemented:

[ ] TODO:

📦 Usage

📚 References

🤝 Contact

📄 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages