A Python DSL to write Nvidia PTX for Hopper and Blackwell in JAX and PyTorch
Minimal GPU runtime for Python - high-performance CUDA kernels, memory management, and LLM inference without heavy dependencies
Cross-platform FlashAttention-2 Triton implementation for Turing+ GPUs with custom configuration mode