Minimal GPU runtime for Python - high-performance CUDA kernels, memory management, and LLM inference without heavy dependencies
Cross-platform FlashAttention-2 Triton implementation for Turing+ GPUs with custom configuration mode