Cross-platform FlashAttention-2 Triton implementation for Turing+ GPUs with custom configuration mode