# TritonGPUOps ### `triton_gpu.alloc_tensor` (triton::gpu::AllocTensorOp) _Allocate tensor_ Syntax: ``` operation ::= `triton_gpu.alloc_tensor` attr-dict `:` type($result) ``` This operation defines a tensor of a particular shape. The contents of the tensor are supposed to be in shared memory. Note: This op can be repalced to a `bufferization.alloc_tensor` in LLVM 16. Traits: ResultsAreSharedEncoding, VerifyTensorLayoutsTrait Interfaces: MemoryEffectOpInterface (MemoryEffectOpInterface) Effects: MemoryEffects::Effect{MemoryEffects::Allocate on ::mlir::SideEffects::DefaultResource} #### Results: | Result | Description | | :----: | ----------- | | `result` | tensor of floating-point values or tensor of integer values or tensor of ptr values ### `triton_gpu.async_bulk_commit_group` (triton::gpu::AsyncBulkCommitGroupOp) _Async bulk commit group_ Syntax: ``` operation ::= `triton_gpu.async_bulk_commit_group` attr-dict ``` Traits: VerifyTensorLayoutsTrait ### `triton_gpu.async_bulk_wait` (triton::gpu::AsyncBulkWaitOp) _Async bulk wait_ Syntax: ``` operation ::= `triton_gpu.async_bulk_wait` attr-dict ``` Traits: VerifyTensorLayoutsTrait #### Attributes:
AttributeMLIR TypeDescription
num::mlir::IntegerAttr32-bit signless integer attribute
### `triton_gpu.async_commit_group` (triton::gpu::AsyncCommitGroupOp) _Async commit group_ Syntax: ``` operation ::= `triton_gpu.async_commit_group` attr-dict ``` Traits: VerifyTensorLayoutsTrait ### `triton_gpu.async_wait` (triton::gpu::AsyncWaitOp) _Async wait_ Syntax: ``` operation ::= `triton_gpu.async_wait` attr-dict ``` Traits: VerifyTensorLayoutsTrait #### Attributes:
AttributeMLIR TypeDescription
num::mlir::IntegerAttr32-bit signless integer attribute
### `triton_gpu.convert_layout` (triton::gpu::ConvertLayoutOp) _Convert layout_ Syntax: ``` operation ::= `triton_gpu.convert_layout` $src attr-dict `:` functional-type(operands, results) ``` Traits: AlwaysSpeculatableImplTrait, SameOperandsAndResultElementType, SameOperandsAndResultShape, VerifyTensorLayoutsTrait Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface) Effects: MemoryEffects::Effect{} #### Operands: | Operand | Description | | :-----: | ----------- | | `src` | tensor of floating-point values or tensor of integer values or tensor of ptr values #### Results: | Result | Description | | :----: | ----------- | | `result` | tensor of floating-point values or tensor of integer values or tensor of ptr values ### `triton_gpu.extract_slice` (triton::gpu::ExtractSliceOp) _Extract slice operation_ Syntax: ``` operation ::= `triton_gpu.extract_slice` $source `` custom($offsets, $static_offsets) custom($sizes, $static_sizes) custom($strides, $static_strides) attr-dict `:` type($source) `to` type($result) ``` same as tensor.extract_slice, but with int32 index. The motivations for re-implementing it are: We reimplement ExtractSliceOp with int32 index, because: - we want to enforce int32 indexing on GPUs since Triton tensors fit in SRAM - we still want to use indexWidth = 64 when lowering to LLVM because our loops can have 64-bit induction variables and scf.for uses indexType for bounds/ivs Traits: AlwaysSpeculatableImplTrait, AttrSizedOperandSegments, ResultsAreSharedEncoding, VerifyTensorLayoutsTrait Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), OffsetSizeAndStrideOpInterface Effects: MemoryEffects::Effect{} #### Attributes:
AttributeMLIR TypeDescription
static_offsets::mlir::DenseI64ArrayAttri64 dense array attribute
static_sizes::mlir::DenseI64ArrayAttri64 dense array attribute
static_strides::mlir::DenseI64ArrayAttri64 dense array attribute
#### Operands: | Operand | Description | | :-----: | ----------- | | `source` | ranked tensor of any type values | `offsets` | variadic of 32-bit signless integer | `sizes` | variadic of 32-bit signless integer | `strides` | variadic of 32-bit signless integer #### Results: | Result | Description | | :----: | ----------- | | `result` | ranked tensor of any type values ### `triton_gpu.insert_slice_async` (triton::gpu::InsertSliceAsyncOp) _Insert slice async_ This operation inserts a tensor `$src` into another tensor `$dst` as specified by the operation’s `$index` argument and `$axis` attribute. It returns a copy of `$dst` with the proper slice updated asynchronously with the value of `$src`. This operation is non-blocking, and `$results` will have the updated value after the corresponding async_wait. When converting from `tt.load` to `triton_gpu.insert_slice_async`, the `$evict`, `$cache`, and `$isVolatile` fields might be ignored on certain hardware. For example, on NVIDIA GPUs, the cache policy is determined by the backend, and `$evict` and `$isVolatile` are ignored because they apply to L1 cache only. The insert_slice_async operation supports the following arguments: * src: the tensor that is inserted. * dst: the tensor into which the `$src` tensor is inserted. * index: the index of the `$src` tensor at the given `$axis` from which the `$dst` tensor is inserted into * mask: optional tensor-rank number of boolean masks which specify which elements of the `$src` tensor are inserted into the `$dst` tensor. * other: optional tensor-rank number of other tensors which specify what values are inserted into the `$dst` tensor if the corresponding element of the `$mask` tensor is false. In the future, we may decompose this operation into a sequence of: * `async` operation to specify a sequence of asynchronous operations * `load` operation to load a tensor from global memory * `insert_slice` operations to insert the `$src` tensor into the `$dst` tensor Example: ``` %1 = triton_gpu.alloc_tensor : tensor<2x32xf32> %2 = triton_gpu.insert_slice_async %0, %1, %index { axis = 0 } : tensor<32x!tt.ptr, #AL> -> tensor<2x32xf32, #A> triiton_gpu.async_wait { num = 0 : i32 } ``` Traits: AttrSizedOperandSegments, ResultsAreSharedEncoding, VerifyTensorLayoutsTrait Interfaces: MemoryEffectOpInterface (MemoryEffectOpInterface) Effects: MemoryEffects::Effect{MemoryEffects::Read on ::mlir::SideEffects::DefaultResource, MemoryEffects::Write on ::mlir::SideEffects::DefaultResource} #### Attributes:
AttributeMLIR TypeDescription
cache::mlir::triton::CacheModifierAttr
allowed 32-bit signless integer cases: 1, 2, 3, 4, 5, 6{{% markdown %}}Enum cases: * none (`NONE`) * ca (`CA`) * cg (`CG`) * wb (`WB`) * cs (`CS`) * wt (`WT`){{% /markdown %}}
evict::mlir::triton::EvictionPolicyAttr
allowed 32-bit signless integer cases: 1, 2, 3{{% markdown %}}Enum cases: * evict_normal (`NORMAL`) * evict_first (`EVICT_FIRST`) * evict_last (`EVICT_LAST`){{% /markdown %}}
isVolatile::mlir::BoolAttrbool attribute
axis::mlir::IntegerAttr32-bit signless integer attribute
#### Operands: | Operand | Description | | :-----: | ----------- | | `src` | ptr or tensor of ptr values | `dst` | tensor of floating-point values or tensor of integer values or tensor of ptr values | `index` | 32-bit signless integer | `mask` | tensor of 1-bit signless integer values | `other` | floating-point or tensor of floating-point values or integer or tensor of integer values or ptr or tensor of ptr values or ptr #### Results: | Result | Description | | :----: | ----------- | | `result` | tensor of floating-point values or tensor of integer values or tensor of ptr values ### `triton_gpu.insert_slice` (triton::gpu::InsertSliceOp) _Insert slice_ This operation inserts a tensor `$src` into another tensor `$dst` as specified by the operation’s `$index` argument and `$axis` attribute. It returns a copy of `$dst` with the proper slice updated with the value of `$src`. When converting from `tt.load` to `triton_gpu.insert_slice`, the `$evict`, `$cache`, and `$isVolatile` fields might be ignored on certain hardware. For example, on NVIDIA GPUs, the cache policy is determined by the backend, and `$evict` and `$isVolatile` are ignored because they apply to L1 cache only. The insert_slice operation supports the following arguments: * src: the tensor that is inserted. * dst: the tensor into which the `$src` tensor is inserted. * index: the index of the `$src` tensor at the given `$axis` from which the `$dst` tensor is inserted into * mask: optional tensor-rank number of boolean masks which specify which elements of the `$src` tensor are inserted into the `$dst` tensor. * other: optional tensor-rank number of other tensors which specify what values are inserted into the `$dst` tensor if the corresponding element of the `$mask` tensor is false. ttgpu.load_tile_async depracate triton_gpu.insert_slice might be further lowered into triton_gpu_async for different hardware implementations like tt.load, ttgpu.insert_slice/insert_slice_async has two modes up to the type of src mode 1: ptr/src is a tensor of pointers mode 2: ptr/src is a tensor pointer Some typical lowering paths are: in case the load is pipelined by the pipeline pass( load is inside kBlock loop, which means "pipeline pass): Load from global + store to shared : tt.load(mode 1) -(tt->ttgpu+Coalesce)-> tt.load(mode 1) -(Pipeline)-> ttgpu.insert_slice(mode 1) Non-bulk cp.async : tt.load(mode 1) -(tt->ttgpu+Coalesce)-> tt.load(mode 1) -(Pipeline)-> ttgpu.insert_slice(mode 1) -(MaterializeLoad)> ttgpu.insert_slice_async(mode 1) + ttgpu.await-> llvm TMA load : tt.load(mode 2) -(tt->ttgpu+Coalesce)-> tt.load(mode 2) -(Pipeline)-> ttgpu.insert_slice(mode 2) -(MaterializeLoad)> ttgpu.insert_slice_async_v2(mode 2) + ttgpu.await-> llvm otherwise: Load from global + store to shared : tt.load(mode 1) -(tt->ttgpu+Coalesce)-> tt.load(mode 1) Non-bulk cp.async : tt.load(mode 1) -(tt->ttgpu+Coalesce)-> tt.load(mode 1) -> ... -(MaterializeLoad)-> ttgpu.insert_slice_async(mode 1) + ttgpu.await -> llvm TMA load : tt.load(mode 2) -(tt->ttgpu+Coalesce)-> tt.load(mode 2) -> ... -(MaterializeLoad)-> ttgpu.insert_slice_async(mode 2) + ttgpu.await -> llvm Example: ``` %1 = triton_gpu.alloc_tensor : tensor<2x32xf32> %2 = triton_gpu.insert_slice %0, %1, %index { axis = 0 } : tensor<32x!tt.ptr, #AL> -> tensor<2x32xf32, #A> ``` Traits: AttrSizedOperandSegments, ResultsAreSharedEncoding, VerifyTensorLayoutsTrait Interfaces: MemoryEffectOpInterface (MemoryEffectOpInterface) Effects: MemoryEffects::Effect{MemoryEffects::Read on ::mlir::SideEffects::DefaultResource, MemoryEffects::Write on ::mlir::SideEffects::DefaultResource} #### Attributes:
AttributeMLIR TypeDescription
cache::mlir::triton::CacheModifierAttr
allowed 32-bit signless integer cases: 1, 2, 3, 4, 5, 6{{% markdown %}}Enum cases: * none (`NONE`) * ca (`CA`) * cg (`CG`) * wb (`WB`) * cs (`CS`) * wt (`WT`){{% /markdown %}}
evict::mlir::triton::EvictionPolicyAttr
allowed 32-bit signless integer cases: 1, 2, 3{{% markdown %}}Enum cases: * evict_normal (`NORMAL`) * evict_first (`EVICT_FIRST`) * evict_last (`EVICT_LAST`){{% /markdown %}}
isVolatile::mlir::BoolAttrbool attribute
axis::mlir::IntegerAttr32-bit signless integer attribute
#### Operands: | Operand | Description | | :-----: | ----------- | | `src` | ptr or tensor of ptr values | `dst` | tensor of floating-point values or tensor of integer values or tensor of ptr values | `index` | 32-bit signless integer | `mask` | tensor of 1-bit signless integer values | `other` | floating-point or tensor of floating-point values or integer or tensor of integer values or ptr or tensor of ptr values or ptr #### Results: | Result | Description | | :----: | ----------- | | `result` | tensor of floating-point values or tensor of integer values or tensor of ptr values