Minimal GPU runtime for Python - high-performance CUDA kernels, memory management, and LLM inference without heavy dependencies