Advanced CUDA
CUDA vs cuBLAS vs cuBLASLt vs CUTLASS vs CuTe vs CuTeDSL vs Triton
These tools live at different abstraction levels in the NVIDIA and GPU-kernel ecosystem. Picking the right one is less about "which is best" and more about where you need control: from calling a tuned library kernel to building custom Tensor Core pipelines from scratch.
Quick mental model
- Use cuBLAS when your workload maps to standard BLAS operations and you want fast, stable defaults.
- Use cuBLASLt when you need more GEMM control (layouts, epilogues, algorithm selection).
- Use CUTLASS when you need custom GEMM/convolution kernels beyond library call boundaries.
- Use CuTe and CuTeDSL when you are expressing tile layouts and kernel structure in modern CUTLASS-style building blocks.
- Use Triton when you want productive Python-based custom kernels, especially for deep learning operators.
- Use raw CUDA when you need total control for non-library algorithms or highly specialized kernels.
What each option is
CUDA
The base platform and programming model. You write kernels, manage memory, schedule work, and optimize manually. Maximum flexibility, maximum responsibility.
cuBLAS
NVIDIA's highly optimized BLAS library. Great for GEMM and classic dense linear algebra when standard interfaces are enough.
cuBLASLt
A newer, more configurable GEMM-focused API. Supports flexible data layouts, fused epilogues, and richer heuristic algorithm selection.
CUTLASS
A CUDA C++ template library for building high-performance matrix kernels. It exposes hierarchical tiling and Tensor Core pipelines while still compiling to native CUDA kernels.
CuTe
A layout and tensor algebra layer used in modern CUTLASS development. It provides composable abstractions for shapes, tiling, and mapping data and computation across hardware hierarchy.
CuTeDSL
A higher-level domain-specific interface around CuTe concepts for expressing kernels with less low-level boilerplate while preserving strong control over tiling and pipeline structure.
Triton
An open-source language and compiler (Python embedded DSL) for writing GPU kernels. It is popular for custom deep learning ops and rapid kernel iteration workflows.
Control vs productivity spectrum
From easiest integration to deepest control, a practical ordering is often:
- cuBLAS
- cuBLASLt
- Triton
- CUTLASS
- CuTe and CuTeDSL
- Raw CUDA
This is not a strict ranking. For some teams, Triton is faster to deliver than CUTLASS; for others, existing C++ infrastructure makes CUTLASS the better productivity path.
Choosing by use case
1) Standard dense linear algebra
Start with cuBLAS. It is battle-tested, fast, and simple to integrate.
2) GEMM plus fusion and layout constraints
Try cuBLASLt first. Many practical transformer-style matmul patterns can be handled without writing custom kernels.
3) Custom operator with irregular memory and computation pattern
Use Triton or CUDA. Triton is usually faster to iterate, while CUDA provides lower-level control when you need exact behavior.
4) Peak-performance custom GEMM-family kernel
Use CUTLASS, and increasingly CuTe and CuTeDSL composition for precise control of tiling, pipelining, and Tensor Core usage.
Performance and maintenance trade-offs
- Libraries (cuBLAS and cuBLASLt): fastest path to reliable speed, lowest maintenance.
- Triton: high iteration speed, strong for custom deep learning kernels, compiler maturity can vary by pattern.
- CUTLASS and CuTe stack: excellent performance headroom, higher C++ template complexity.
- Raw CUDA: unmatched flexibility, highest implementation and tuning burden.
Practical decision checklist
- Can the problem be expressed as standard BLAS? Start with cuBLAS.
- Need advanced GEMM knobs or fused epilogues? Move to cuBLASLt.
- Need a custom op and fast experimentation? Prototype in Triton.
- Need maximum kernel-level control for GEMM-like workloads? Move to CUTLASS plus CuTe and CuTeDSL.
- Need full generality beyond these abstractions? Write CUDA kernels directly.
Bottom line
Think in layers, not competitors. cuBLAS and cuBLASLt are optimized libraries, CUTLASS, CuTe, and CuTeDSL are kernel-construction frameworks, Triton is a productive kernel DSL, and CUDA is the foundational platform under all of them. The best choice is the smallest abstraction that still gives you the control your workload requires.