CUDA vs cuBLAS vs cuBLASLt vs CUTLASS vs CuTe vs CuTeDSL vs Triton

These tools live at different abstraction levels in the NVIDIA and GPU-kernel ecosystem. Picking the right one is less about "which is best" and more about where you need control: from calling a tuned library kernel to building custom Tensor Core pipelines from scratch.

Quick mental model

What each option is

CUDA

The base platform and programming model. You write kernels, manage memory, schedule work, and optimize manually. Maximum flexibility, maximum responsibility.

cuBLAS

NVIDIA's highly optimized BLAS library. Great for GEMM and classic dense linear algebra when standard interfaces are enough.

cuBLASLt

A newer, more configurable GEMM-focused API. Supports flexible data layouts, fused epilogues, and richer heuristic algorithm selection.

CUTLASS

A CUDA C++ template library for building high-performance matrix kernels. It exposes hierarchical tiling and Tensor Core pipelines while still compiling to native CUDA kernels.

CuTe

A layout and tensor algebra layer used in modern CUTLASS development. It provides composable abstractions for shapes, tiling, and mapping data and computation across hardware hierarchy.

CuTeDSL

A higher-level domain-specific interface around CuTe concepts for expressing kernels with less low-level boilerplate while preserving strong control over tiling and pipeline structure.

Triton

An open-source language and compiler (Python embedded DSL) for writing GPU kernels. It is popular for custom deep learning ops and rapid kernel iteration workflows.

Control vs productivity spectrum

From easiest integration to deepest control, a practical ordering is often:

  1. cuBLAS
  2. cuBLASLt
  3. Triton
  4. CUTLASS
  5. CuTe and CuTeDSL
  6. Raw CUDA

This is not a strict ranking. For some teams, Triton is faster to deliver than CUTLASS; for others, existing C++ infrastructure makes CUTLASS the better productivity path.

Choosing by use case

1) Standard dense linear algebra

Start with cuBLAS. It is battle-tested, fast, and simple to integrate.

2) GEMM plus fusion and layout constraints

Try cuBLASLt first. Many practical transformer-style matmul patterns can be handled without writing custom kernels.

3) Custom operator with irregular memory and computation pattern

Use Triton or CUDA. Triton is usually faster to iterate, while CUDA provides lower-level control when you need exact behavior.

4) Peak-performance custom GEMM-family kernel

Use CUTLASS, and increasingly CuTe and CuTeDSL composition for precise control of tiling, pipelining, and Tensor Core usage.

Performance and maintenance trade-offs

Practical decision checklist

  1. Can the problem be expressed as standard BLAS? Start with cuBLAS.
  2. Need advanced GEMM knobs or fused epilogues? Move to cuBLASLt.
  3. Need a custom op and fast experimentation? Prototype in Triton.
  4. Need maximum kernel-level control for GEMM-like workloads? Move to CUTLASS plus CuTe and CuTeDSL.
  5. Need full generality beyond these abstractions? Write CUDA kernels directly.

Bottom line

Think in layers, not competitors. cuBLAS and cuBLASLt are optimized libraries, CUTLASS, CuTe, and CuTeDSL are kernel-construction frameworks, Triton is a productive kernel DSL, and CUDA is the foundational platform under all of them. The best choice is the smallest abstraction that still gives you the control your workload requires.

← Back to Advanced CUDA

AI Tools & Services

Chatbots & Assistants

Content Generation

APIs & Systems