Summary:
Inductor has the following configurations:
config.comprehensive_padding
config.padding_alignment_bytes
config.padding_stride_threshold
In the case of static shape by enabling these three options Inductor will generate code for Flexible layout tensors that tries to pad up all stride dimension to be a multiple of config.padding_alignment_bytes for strides above: config.padding_stride_threshold. In the case where dynamic shapes is enabled no padding is done today.
This PR introduces the following configuration which allows the user to specify they wish to generated a padded stride even in the case of dynamic shape operations. This is mainly done so we don't break the previous behaviour of not padding up dynamic shape use cases. The config.padding_stride_threshold does not apply since the values of the strides are dynamic.
config.pad_dynamic_shapes
In addition to this a new mode "python_slow" has been added to launch grid calculation which achieves the same ceildiv behaviour that is generally applicable to integer division. This is done to prevent test regressions and make wrapper_fxir codegen more generic.
Test Plan:
CI
Rollback Plan:
Differential Revision: D80468808
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160997
Approved by: https://github.com/blaine-rister, https://github.com/jansel
Based on the [conversation](https://github.com/pytorch/pytorch/issues/121791), we plan to drop the "highest, high, medium" to represent fp32 internal computation data types . Instead, we will directly use the algorithm to represent it.
### Design Choice: Directly use algorithms name like "TF32", "BF16".
#### Pros
- The names are more informative. 'tf32' is more informative than a simple "high".
- Easier to extend new algorithm like `tf32x3`
#### Cons
- "HIGHEST, HIGH, MEDIUM" indicated the relative precision between different algorithms. However, we can have more documents to discuss them.
### We provide a layered structure for backends/operators.
('f32' is short for 'fp32_precision')

### We provide 3 fp32 compute precision can be set:
- **"ieee"**: Not allowed to use any other internal computation data types .
- **"tf32"**: Allowed to use tf32 as internal computation data types.
- **"bf16"**: Allowed to use bf16 as internal computation data types.
- **"none"**: Precision's are not set. Can be override by its father node.
### Overriding Precision Settings
Child node can be override by its father node if it is set to default.
For current default settings:
```
backend = generic, op = all, precision setting = none
backend = cuda, op = all, precision setting = none
backend = cuda, op = conv, precision setting = tf32
backend = cuda, op = rnn, precision setting = tf32
backend = cuda, op = matmul, precision setting = none
backend = matmul, op = all, precision setting = none
backend = matmul, op = conv, precision setting = none
backend = matmul, op = rnn, precision setting = none
backend = matmul, op = matmul, precision setting = none
```
- If the user set `torch.backends.mkldnn.fp32_precision="bf16"`, his child nodes `torch.backends.mkldnn.matmul.fp32_precision` / `torch.backends.mkldnn.conv.fp32_precision` / `torch.backends.mkldnn.rnn.fp32_precision` will also be override to "bf16".
- If the user set `torch.backends.fp32_precision="bf16"`, `torch.backends.mkldnn.fp32_precision` and his child nodes will also we override to "bf16".
### Backward Compatible
Since new API allow user to have more fine-grained control. There will be some conflict. For example, previous `torch.backends.cudnn.allow_tf32` are not enough to represent the status for `torch.backends.cudnn.rnn.fp32_precision="ieee"` and `torch.backends.cudnn.conv.fp32_precision="tf32"`. Therefore, our goal for backward compatible is
- If the user only uses previous APIs, it will work as previous expectations.
- If the user use **new** API to change the status to an **un-representable** status for old API, and try to access the status by **old** API. We will raise Runtime Error and point the document for user.
### Test Plan
```
python test/test_cuda.py -k test_fp32_precision_with_tf32
python test/test_cuda.py -k test_fp32_precision_with_float32_matmul_precision
python test/test_cuda.py -k test_invalid_status_for_legacy_api
python test/test_mkldnn.py -k test_mlkdnn_get_set
python test/test_mkldnn.py -k test_generic_precision
python test/test_mkldnn.py -k test_invalid
python test/test_mkldnn.py -k test_default_use_parent
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125888
Approved by: https://github.com/jgong5, https://github.com/albanD
Co-authored-by: Jiang, Yanbing <yanbing.jiang@intel.com>
test_pad_3d_tensor fails if you run it multiple times in a row, because the cache is populated and inductor skips the logic that increments the counter.
To fix this, switch these tests to use inductor's TestCase / run_tests instead of dynamo's - this way, a fresh inductor cache is used.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154935
Approved by: https://github.com/Skylion007
Preparatory refactor for https://github.com/pytorch/pytorch/pull/137243.
# Feature
This PR changes the `RINDEX` / `"r"` symbol type to `(R0_INDEX, R1_INDEX)` and `("r0_", "r1_")`, respectively. This allows the relevant code to support 2D (often ND) reductions. Unlike the parent PR, this one does not change the tiling algorithm, so `"r1_"` is never used. However, it prepares other parts of the system to handle `"r1_"` once we start using it. This should significantly reduce the chances of hitting merge conflicts, making the parent PR much easier to land.
The only change to the generated triton code is to rename `"rindex"` -> `"r0_index"`, `"RBLOCK"` -> `"R0_BLOCK"`, etc. To maintain compatibilty with existing codegen, this also generates aliases to the old reduction variables like `rindex = r0_index`. If we generated 2D reductions (which this PR will not do), the aliases would be more complicated and would collapse 2D multi-indices to linear indices. See some example kernels in the parent PR.
These aliases can be eliminated by the Triton compiler, and should not impact the final machine code running on the GPU. See the perf testing in the parent PR which confirms the aliases do not impact perf.
# Test plan
The existing CI provides good coverage. This PR modifies the expected code in a few places, renaming reduction variables from `r.*` to `r0_.*`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142020
Approved by: https://github.com/jansel
Co-authored-by: Jason Ansel <jansel@meta.com>
Based on https://github.com/pytorch/pytorch/pull/130956.
Inductor already supports padding through the `config.comprehensive_padding` option, but the padding format involves a few heuristics that are specific to Nvidia GPUs:
- When we pad, it is always aligned to the next multiple of 128 bytes.
- Strides smaller than 1024 are not padded.
- Only intermediate values are padded, not outputs.
The last of these is not really GPU-specific, but there are certain cases where we may want to override it. For example, padding outputs is useful on hardware accelerators with specific memory alignment requirements, or for applications where performance is more important than conformity with eager mode.
This PR surfaces padding parameters up to Inductor's config module, so the user can control them.
- `config.pad_outputs`: choose whether to pad outputs (default: `False`)
- `config.padding_alignment_bytes`: choose the alignment size for padding (default: `128`)
- `config.padding_stride_threshold`: choose the smallest stride that we will pad. For example, setting this to 0 will pad all unaligned strides. (default: `1024`)
**Test plan**
Added a new test in `test_padding.py` which tries various combinations of these options, checking that the output strides match our expectations.
These changes should not affect perf, because the defaults are identical to Inductor's current behavior.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133939
Approved by: https://github.com/shunting314
Co-authored-by: Yueming Hao <yhao@meta.com>
Based on https://github.com/pytorch/pytorch/pull/130956.
Inductor already supports padding through the `config.comprehensive_padding` option, but the padding format involves a few heuristics that are specific to Nvidia GPUs:
- When we pad, it is always aligned to the next multiple of 128 bytes.
- Strides smaller than 1024 are not padded.
- Only intermediate values are padded, not outputs.
The last of these is not really GPU-specific, but there are certain cases where we may want to override it. For example, padding outputs is useful on hardware accelerators with specific memory alignment requirements, or for applications where performance is more important than conformity with eager mode.
This PR surfaces padding parameters up to Inductor's config module, so the user can control them.
- `config.pad_outputs`: choose whether to pad outputs (default: `False`)
- `config.padding_alignment_bytes`: choose the alignment size for padding (default: `128`)
- `config.padding_stride_threshold`: choose the smallest stride that we will pad. For example, setting this to 0 will pad all unaligned strides. (default: `1024`)
**Test plan**
Added a new test in `test_padding.py` which tries various combinations of these options, checking that the output strides match our expectations.
These changes should not affect perf, because the defaults are identical to Inductor's current behavior.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133939
Approved by: https://github.com/shunting314
Co-authored-by: Yueming Hao <yhao@meta.com>
move benchmarking out of `torch._inductor.runtime.runtime_utils` and into `torch._inductor.runtime.benchmarking`, and prefer this path over directly accessing Triton's benchmarking
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132827
Approved by: https://github.com/eellison
yolo
iirc the a10g/sm86 runners have ~21 GB of space, so we can increase parallelism on it to 3. This results in about 6GB CUDA mem per proc. The previous calculation + 2 procs resulted in about 8 GB
Also fixes the the calc for per proc memory, assuming that CUDA context + anything else take about a little under 1GB of space (previous calc was .11 on about 7.5 - 8 GB <= .9GB)
Times on main are about 1.9-2.5hr per shard
This commit is around 1.6-2hr per shard
Risks: increase in flaky tests due to OOM
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125598
Approved by: https://github.com/huydhn
This PR adds the ability to pad tensor strides during lowering. The goal is to make sure (if possible) tensors with bad shape can have aligned strides so GPU can access the memory more efficiently.
By testing BlenderbotSmallForConditionalGeneration I already see 2.5ms speedup.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120758
Approved by: https://github.com/jansel