# TritonGPUOps<!-- Autogenerated by mlir-tblgen; don't manually edit -->
### `triton_gpu.alloc_tensor` (triton::gpu::AllocTensorOp)

_Allocate tensor_


Syntax:

```
operation ::= `triton_gpu.alloc_tensor` attr-dict `:` type($result)
```

This operation defines a tensor of a particular shape.
The contents of the tensor are supposed to be in shared memory.

Note: This op can be repalced to a `bufferization.alloc_tensor` in LLVM 16.

Traits: ResultsAreSharedEncoding, VerifyTensorLayoutsTrait

Interfaces: MemoryEffectOpInterface (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{MemoryEffects::Allocate on ::mlir::SideEffects::DefaultResource}

#### Results:

| Result | Description |
| :----: | ----------- |
| `result` | tensor of floating-point values or tensor of integer values or tensor of ptr values

### `triton_gpu.async_bulk_commit_group` (triton::gpu::AsyncBulkCommitGroupOp)

_Async bulk commit group_


Syntax:

```
operation ::= `triton_gpu.async_bulk_commit_group` attr-dict
```


Traits: VerifyTensorLayoutsTrait

### `triton_gpu.async_bulk_wait` (triton::gpu::AsyncBulkWaitOp)

_Async bulk wait_


Syntax:

```
operation ::= `triton_gpu.async_bulk_wait` attr-dict
```


Traits: VerifyTensorLayoutsTrait

#### Attributes:

<table>
<tr><th>Attribute</th><th>MLIR Type</th><th>Description</th></tr>
<tr><td><code>num</code></td><td>::mlir::IntegerAttr</td><td>32-bit signless integer attribute</td></tr>
</table>

### `triton_gpu.async_commit_group` (triton::gpu::AsyncCommitGroupOp)

_Async commit group_


Syntax:

```
operation ::= `triton_gpu.async_commit_group` attr-dict
```


Traits: VerifyTensorLayoutsTrait

### `triton_gpu.async_wait` (triton::gpu::AsyncWaitOp)

_Async wait_


Syntax:

```
operation ::= `triton_gpu.async_wait` attr-dict
```


Traits: VerifyTensorLayoutsTrait

#### Attributes:

<table>
<tr><th>Attribute</th><th>MLIR Type</th><th>Description</th></tr>
<tr><td><code>num</code></td><td>::mlir::IntegerAttr</td><td>32-bit signless integer attribute</td></tr>
</table>

### `triton_gpu.convert_layout` (triton::gpu::ConvertLayoutOp)

_Convert layout_


Syntax:

```
operation ::= `triton_gpu.convert_layout` $src attr-dict `:` functional-type(operands, results)
```


Traits: AlwaysSpeculatableImplTrait, SameOperandsAndResultElementType, SameOperandsAndResultShape, VerifyTensorLayoutsTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

#### Operands:

| Operand | Description |
| :-----: | ----------- |
| `src` | tensor of floating-point values or tensor of integer values or tensor of ptr values

#### Results:

| Result | Description |
| :----: | ----------- |
| `result` | tensor of floating-point values or tensor of integer values or tensor of ptr values

### `triton_gpu.extract_slice` (triton::gpu::ExtractSliceOp)

_Extract slice operation_


Syntax:

```
operation ::= `triton_gpu.extract_slice` $source ``
              custom<DynamicIndexList>($offsets, $static_offsets)
              custom<DynamicIndexList>($sizes, $static_sizes)
              custom<DynamicIndexList>($strides, $static_strides)
              attr-dict `:` type($source) `to` type($result)
```

same as tensor.extract_slice, but with int32 index. The motivations for re-implementing it are:
We reimplement ExtractSliceOp with int32 index, because:
- we want to enforce int32 indexing on GPUs since Triton tensors fit in SRAM
- we still want to use indexWidth = 64 when lowering to LLVM because our loops can have
  64-bit induction variables and scf.for uses indexType for bounds/ivs

Traits: AlwaysSpeculatableImplTrait, AttrSizedOperandSegments, ResultsAreSharedEncoding, VerifyTensorLayoutsTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), OffsetSizeAndStrideOpInterface

Effects: MemoryEffects::Effect{}

#### Attributes:

<table>
<tr><th>Attribute</th><th>MLIR Type</th><th>Description</th></tr>
<tr><td><code>static_offsets</code></td><td>::mlir::DenseI64ArrayAttr</td><td>i64 dense array attribute</td></tr>
<tr><td><code>static_sizes</code></td><td>::mlir::DenseI64ArrayAttr</td><td>i64 dense array attribute</td></tr>
<tr><td><code>static_strides</code></td><td>::mlir::DenseI64ArrayAttr</td><td>i64 dense array attribute</td></tr>
</table>

#### Operands:

| Operand | Description |
| :-----: | ----------- |
| `source` | ranked tensor of any type values
| `offsets` | variadic of 32-bit signless integer
| `sizes` | variadic of 32-bit signless integer
| `strides` | variadic of 32-bit signless integer

#### Results:

| Result | Description |
| :----: | ----------- |
| `result` | ranked tensor of any type values

### `triton_gpu.insert_slice_async` (triton::gpu::InsertSliceAsyncOp)

_Insert slice async_

This operation inserts a tensor `$src` into another tensor `$dst` as specified by the operation’s
`$index` argument and `$axis` attribute.

It returns a copy of `$dst` with the proper slice updated asynchronously with the value of `$src`.
This operation is non-blocking, and `$results` will have the updated value after the corresponding async_wait.

When converting from `tt.load` to `triton_gpu.insert_slice_async`, the `$evict`, `$cache`, and `$isVolatile` fields
might be ignored on certain hardware. For example, on NVIDIA GPUs, the cache policy is determined by the backend,
and `$evict` and `$isVolatile` are ignored because they apply to L1 cache only.

The insert_slice_async operation supports the following arguments:

* src: the tensor that is inserted.
* dst: the tensor into which the `$src` tensor is inserted.
* index: the index of the `$src` tensor at the given `$axis` from which the `$dst` tensor is inserted into
* mask: optional tensor-rank number of boolean masks which specify which
        elements of the `$src` tensor are inserted into the `$dst` tensor.
* other: optional tensor-rank number of other tensors which specify what
        values are inserted into the `$dst` tensor if the corresponding
        element of the `$mask` tensor is false.

In the future, we may decompose this operation into a sequence of:

* `async` operation to specify a sequence of asynchronous operations
* `load` operation to load a tensor from global memory
* `insert_slice` operations to insert the `$src` tensor into the `$dst` tensor

Example:

```
%1 = triton_gpu.alloc_tensor : tensor<2x32xf32>
%2 = triton_gpu.insert_slice_async %0, %1, %index { axis = 0 } : tensor<32x!tt.ptr<f32>, #AL> -> tensor<2x32xf32, #A>
triiton_gpu.async_wait { num = 0 : i32 }
```

Traits: AttrSizedOperandSegments, ResultsAreSharedEncoding, VerifyTensorLayoutsTrait

Interfaces: MemoryEffectOpInterface (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{MemoryEffects::Read on ::mlir::SideEffects::DefaultResource, MemoryEffects::Write on ::mlir::SideEffects::DefaultResource}

#### Attributes:

<table>
<tr><th>Attribute</th><th>MLIR Type</th><th>Description</th></tr>
<tr><td><code>cache</code></td><td>::mlir::triton::CacheModifierAttr</td><td><details><summary>allowed 32-bit signless integer cases: 1, 2, 3, 4, 5, 6</summary>{{% markdown %}}Enum cases:
* none (`NONE`)
* ca (`CA`)
* cg (`CG`)
* wb (`WB`)
* cs (`CS`)
* wt (`WT`){{% /markdown %}}</details></td></tr>
<tr><td><code>evict</code></td><td>::mlir::triton::EvictionPolicyAttr</td><td><details><summary>allowed 32-bit signless integer cases: 1, 2, 3</summary>{{% markdown %}}Enum cases:
* evict_normal (`NORMAL`)
* evict_first (`EVICT_FIRST`)
* evict_last (`EVICT_LAST`){{% /markdown %}}</details></td></tr>
<tr><td><code>isVolatile</code></td><td>::mlir::BoolAttr</td><td>bool attribute</td></tr>
<tr><td><code>axis</code></td><td>::mlir::IntegerAttr</td><td>32-bit signless integer attribute</td></tr>
</table>

#### Operands:

| Operand | Description |
| :-----: | ----------- |
| `src` | ptr or tensor of ptr values
| `dst` | tensor of floating-point values or tensor of integer values or tensor of ptr values
| `index` | 32-bit signless integer
| `mask` | tensor of 1-bit signless integer values
| `other` | floating-point or tensor of floating-point values or integer or tensor of integer values or ptr or tensor of ptr values or ptr

#### Results:

| Result | Description |
| :----: | ----------- |
| `result` | tensor of floating-point values or tensor of integer values or tensor of ptr values

### `triton_gpu.insert_slice` (triton::gpu::InsertSliceOp)

_Insert slice_

This operation inserts a tensor `$src` into another tensor `$dst` as specified by the operation’s
`$index` argument and `$axis` attribute.

It returns a copy of `$dst` with the proper slice updated with the value of `$src`.

When converting from `tt.load` to `triton_gpu.insert_slice`, the `$evict`, `$cache`, and `$isVolatile` fields
might be ignored on certain hardware. For example, on NVIDIA GPUs, the cache policy is determined by the backend,
and `$evict` and `$isVolatile` are ignored because they apply to L1 cache only.

The insert_slice operation supports the following arguments:

* src: the tensor that is inserted.
* dst: the tensor into which the `$src` tensor is inserted.
* index: the index of the `$src` tensor at the given `$axis` from which the `$dst` tensor is inserted into
* mask: optional tensor-rank number of boolean masks which specify which
        elements of the `$src` tensor are inserted into the `$dst` tensor.
* other: optional tensor-rank number of other tensors which specify what
        values are inserted into the `$dst` tensor if the corresponding
        element of the `$mask` tensor is false.

ttgpu.load_tile_async depracate
triton_gpu.insert_slice might be further lowered into triton_gpu_async for different hardware implementations

like tt.load, ttgpu.insert_slice/insert_slice_async has two modes up to the type of src
mode 1: ptr/src is a tensor of pointers
mode 2: ptr/src is a tensor pointer

Some typical lowering paths are:
in case the load is pipelined by the pipeline pass( load is inside kBlock loop, which means "pipeline pass):
  Load from global + store to shared : tt.load(mode 1) -(tt->ttgpu+Coalesce)-> tt.load(mode 1) -(Pipeline)-> ttgpu.insert_slice(mode 1)
  Non-bulk cp.async                  : tt.load(mode 1) -(tt->ttgpu+Coalesce)-> tt.load(mode 1) -(Pipeline)-> ttgpu.insert_slice(mode 1) -(MaterializeLoad)> ttgpu.insert_slice_async(mode 1) + ttgpu.await-> llvm
  TMA load : tt.load(mode 2) -(tt->ttgpu+Coalesce)-> tt.load(mode 2) -(Pipeline)-> ttgpu.insert_slice(mode 2) -(MaterializeLoad)> ttgpu.insert_slice_async_v2(mode 2) + ttgpu.await-> llvm

otherwise:
  Load from global + store to shared : tt.load(mode 1) -(tt->ttgpu+Coalesce)-> tt.load(mode 1)
  Non-bulk cp.async                  : tt.load(mode 1) -(tt->ttgpu+Coalesce)-> tt.load(mode 1) -> ... -(MaterializeLoad)-> ttgpu.insert_slice_async(mode 1) + ttgpu.await -> llvm
  TMA load                           : tt.load(mode 2) -(tt->ttgpu+Coalesce)-> tt.load(mode 2) -> ... -(MaterializeLoad)-> ttgpu.insert_slice_async(mode 2) + ttgpu.await -> llvm

Example:

```
%1 = triton_gpu.alloc_tensor : tensor<2x32xf32>
%2 = triton_gpu.insert_slice %0, %1, %index { axis = 0 } : tensor<32x!tt.ptr<f32>, #AL> -> tensor<2x32xf32, #A>
```

Traits: AttrSizedOperandSegments, ResultsAreSharedEncoding, VerifyTensorLayoutsTrait

Interfaces: MemoryEffectOpInterface (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{MemoryEffects::Read on ::mlir::SideEffects::DefaultResource, MemoryEffects::Write on ::mlir::SideEffects::DefaultResource}

#### Attributes:

<table>
<tr><th>Attribute</th><th>MLIR Type</th><th>Description</th></tr>
<tr><td><code>cache</code></td><td>::mlir::triton::CacheModifierAttr</td><td><details><summary>allowed 32-bit signless integer cases: 1, 2, 3, 4, 5, 6</summary>{{% markdown %}}Enum cases:
* none (`NONE`)
* ca (`CA`)
* cg (`CG`)
* wb (`WB`)
* cs (`CS`)
* wt (`WT`){{% /markdown %}}</details></td></tr>
<tr><td><code>evict</code></td><td>::mlir::triton::EvictionPolicyAttr</td><td><details><summary>allowed 32-bit signless integer cases: 1, 2, 3</summary>{{% markdown %}}Enum cases:
* evict_normal (`NORMAL`)
* evict_first (`EVICT_FIRST`)
* evict_last (`EVICT_LAST`){{% /markdown %}}</details></td></tr>
<tr><td><code>isVolatile</code></td><td>::mlir::BoolAttr</td><td>bool attribute</td></tr>
<tr><td><code>axis</code></td><td>::mlir::IntegerAttr</td><td>32-bit signless integer attribute</td></tr>
</table>

#### Operands:

| Operand | Description |
| :-----: | ----------- |
| `src` | ptr or tensor of ptr values
| `dst` | tensor of floating-point values or tensor of integer values or tensor of ptr values
| `index` | 32-bit signless integer
| `mask` | tensor of 1-bit signless integer values
| `other` | floating-point or tensor of floating-point values or integer or tensor of integer values or ptr or tensor of ptr values or ptr

#### Results:

| Result | Description |
| :----: | ----------- |
| `result` | tensor of floating-point values or tensor of integer values or tensor of ptr values
Attribute	MLIR Type	Description
`static_offsets`	::mlir::DenseI64ArrayAttr	i64 dense array attribute
`static_sizes`	::mlir::DenseI64ArrayAttr	i64 dense array attribute
`static_strides`	::mlir::DenseI64ArrayAttr	i64 dense array attribute
Attribute	MLIR Type	Description
`cache`	::mlir::triton::CacheModifierAttr	allowed 32-bit signless integer cases: 1, 2, 3, 4, 5, 6 {{% markdown %}}Enum cases: * none (`NONE`) * ca (`CA`) * cg (`CG`) * wb (`WB`) * cs (`CS`) * wt (`WT`){{% /markdown %}}
`evict`	::mlir::triton::EvictionPolicyAttr	allowed 32-bit signless integer cases: 1, 2, 3 {{% markdown %}}Enum cases: * evict_normal (`NORMAL`) * evict_first (`EVICT_FIRST`) * evict_last (`EVICT_LAST`){{% /markdown %}}
`isVolatile`	::mlir::BoolAttr	bool attribute
`axis`	::mlir::IntegerAttr	32-bit signless integer attribute