LLM Inference Optimization Engine
A from-scratch inference engine for transformer-based language models, implementing
KV caching, paged attention, speculative decoding, FlashAttention, and low-precision
matrix multiply to maximize throughput and reduce latency.
CUDA
Python
C++
FlashAttention
Quantization
View project →
Machine Learning Projects
A collection of ML experiments spanning classification, data mining, and model
experimentation — exploring techniques from traditional statistical learning
through deep neural networks.
PyTorch
Python
scikit-learn
Data Mining
View project →