Commit Graph

32 Commits

Author SHA1 Message Date
Xinya Zhang
e769026bcb [ROCm] Remove HIPBLASLT_ALLOW_TF32 from codebase (#162998)
A few UT failures are caused by `HIPBLASLT_ALLOW_TF32`

Fixes #157094
Fixes #157093
Fixes #157092
Fixes #157091
Fixes #157064
Fixes #157063
Fixes #157062
Fixes #157061
Fixes #157042
Fixes #157041
Fixes #157039
Fixes #157004

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162998
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-09-18 13:53:48 +00:00
PyTorch MergeBot
66308fb470 Revert "[ROCm] Remove HIPBLASLT_ALLOW_TF32 from codebase (#162998)"
This reverts commit cef815dc2c.

Reverted https://github.com/pytorch/pytorch/pull/162998 on behalf of https://github.com/huydhn due to Sorry for reverting this, but it seems to break a test in trunk ([comment](https://github.com/pytorch/pytorch/pull/162998#issuecomment-3300280242))
2025-09-16 20:39:41 +00:00
Xinya Zhang
cef815dc2c [ROCm] Remove HIPBLASLT_ALLOW_TF32 from codebase (#162998)
A few UT failures are caused by `HIPBLASLT_ALLOW_TF32`

Fixes #157094, #157093, #157092, #157091, #157064, #157063, #157062, #157061, #157042, #157041, #157039, #157004

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162998
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-09-16 12:48:45 +00:00
nandesuka
9491d289b3 Support generic dynamic shape with padding (#160997)
Summary:
Inductor has the following configurations:

config.comprehensive_padding
config.padding_alignment_bytes
config.padding_stride_threshold

In the case of static shape by enabling these three options Inductor will generate code for Flexible layout tensors that tries to pad up all stride dimension to be a multiple of config.padding_alignment_bytes for strides above: config.padding_stride_threshold. In the case where dynamic shapes is enabled no padding is done today.
This PR introduces the following configuration which allows the user to specify they wish to generated a padded stride even in the case of dynamic shape operations. This is mainly done so we don't break the previous behaviour of not padding up dynamic shape use cases. The config.padding_stride_threshold does not apply since the values of the strides are dynamic.

config.pad_dynamic_shapes

In addition to this a new mode "python_slow" has been added to launch grid calculation which achieves the same ceildiv behaviour that is generally applicable to integer division. This is done to prevent test regressions and make wrapper_fxir codegen more generic.

Test Plan:
CI

Rollback Plan:

Differential Revision: D80468808

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160997
Approved by: https://github.com/blaine-rister, https://github.com/jansel
2025-09-03 15:58:18 +00:00
PyTorch MergeBot
710514a2a5 Revert "Enable output padding when only outermost dim is dynamic (#159404)"
This reverts commit f15ada5c6f.

Reverted https://github.com/pytorch/pytorch/pull/159404 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/159404#issuecomment-3216517032))
2025-08-23 07:17:30 +00:00
nandesuka
f15ada5c6f Enable output padding when only outermost dim is dynamic (#159404)
Summary: When the shape of the output tensor has a dynamic outer most dim, the stride can still be padded to conform to configured alignment if required.

Test Plan:
CI

Rollback Plan:

Differential Revision: D79146886

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159404
Approved by: https://github.com/blaine-rister, https://github.com/eellison
2025-08-13 01:28:22 +00:00
Xuehai Pan
17687eb792 [BE][4/6] fix typos in test/ (test/inductor/) (#157638)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157638
Approved by: https://github.com/yewentao256, https://github.com/jansel
2025-07-06 06:34:25 +00:00
haozhe.zhu
53e0b9c393 refine fp32 precision api (#125888)
Based on the [conversation](https://github.com/pytorch/pytorch/issues/121791), we plan to drop the "highest, high, medium" to represent fp32  internal computation data types . Instead, we will directly use the algorithm to represent it.

### Design Choice: Directly use algorithms name like "TF32", "BF16".
#### Pros
 - The names are more informative. 'tf32' is more informative than a simple "high".
 - Easier to extend new algorithm like `tf32x3`
#### Cons
 - "HIGHEST, HIGH, MEDIUM" indicated the relative precision between different algorithms. However, we can have more documents to discuss them.

### We provide a layered structure for backends/operators.
('f32' is short for 'fp32_precision')
![image](https://github.com/user-attachments/assets/f89143e5-d6a1-4865-9351-9a50439f5067)

### We provide 3 fp32 compute precision can be set:
 - **"ieee"**: Not allowed to use any other internal computation data types .
 - **"tf32"**: Allowed to use tf32 as internal computation data types.
 - **"bf16"**: Allowed to use bf16 as internal computation data types.
 - **"none"**:  Precision's are not set. Can be override by its father node.

### Overriding Precision Settings
Child node can be override by its father node if it is set to default.
For current default settings:
```
backend = generic, op = all, precision setting = none
    backend = cuda, op = all, precision setting = none
        backend = cuda, op = conv, precision setting = tf32
        backend = cuda, op = rnn, precision setting = tf32
        backend = cuda, op = matmul, precision setting = none
    backend = matmul, op = all, precision setting = none
        backend = matmul, op = conv, precision setting = none
        backend = matmul, op = rnn, precision setting = none
        backend = matmul, op = matmul, precision setting = none
```
 - If the user set `torch.backends.mkldnn.fp32_precision="bf16"`, his child nodes `torch.backends.mkldnn.matmul.fp32_precision` / `torch.backends.mkldnn.conv.fp32_precision` / `torch.backends.mkldnn.rnn.fp32_precision` will also be override to "bf16".
 - If the user set `torch.backends.fp32_precision="bf16"`,  `torch.backends.mkldnn.fp32_precision` and his child nodes will also we override to "bf16".

### Backward Compatible
Since new API allow user to have more fine-grained control. There will be some conflict. For example, previous `torch.backends.cudnn.allow_tf32` are not enough to represent the status for `torch.backends.cudnn.rnn.fp32_precision="ieee"` and `torch.backends.cudnn.conv.fp32_precision="tf32"`. Therefore, our goal for backward compatible is
 - If the user only uses previous APIs, it will work as previous expectations.
 - If the user use **new** API to change the status to an **un-representable** status for old API, and try to access the status by **old** API. We will raise Runtime Error and point the document for user.

### Test Plan
```
python test/test_cuda.py -k test_fp32_precision_with_tf32
python test/test_cuda.py -k test_fp32_precision_with_float32_matmul_precision
python test/test_cuda.py -k test_invalid_status_for_legacy_api
python test/test_mkldnn.py -k test_mlkdnn_get_set
python test/test_mkldnn.py -k test_generic_precision
python test/test_mkldnn.py -k test_invalid
python test/test_mkldnn.py -k test_default_use_parent
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125888
Approved by: https://github.com/jgong5, https://github.com/albanD

Co-authored-by: Jiang, Yanbing <yanbing.jiang@intel.com>
2025-06-26 10:32:20 +00:00
David Berard
c014e9d7cd [inductor][test] test_padding.py: use inductor TestCase instead of dynamo TestCase (#154935)
test_pad_3d_tensor fails if you run it multiple times in a row, because the cache is populated and inductor skips the logic that increments the counter.

To fix this, switch these tests to use inductor's TestCase / run_tests instead of dynamo's - this way, a fresh inductor cache is used.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154935
Approved by: https://github.com/Skylion007
2025-06-03 18:36:44 +00:00
xinan.lin
1fac47702e [Break XPU][Inductor UT] Generalize device-bias code introduced by #146866. (#148534)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148534
Approved by: https://github.com/nandesuka
2025-03-06 04:39:50 +00:00
nandesuka
52078154f2 Add support for no-op concat with padded output (#146866)
Add support for no-op concat with padded output

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146866
Approved by: https://github.com/shunting314
2025-03-03 21:10:46 +00:00
Aaron Orenstein
99dbc5b0e2 PEP585 update - test (#145176)
See #145101 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145176
Approved by: https://github.com/bobrenjc93
2025-01-22 04:48:28 +00:00
Blaine Burton Rister
520ba556cd [Inductor] Refactor "r" reduction prefix to {"r0_", "r1_"}. (#142020)
Preparatory refactor for https://github.com/pytorch/pytorch/pull/137243.

# Feature

This PR changes the `RINDEX` / `"r"` symbol type to `(R0_INDEX, R1_INDEX)` and `("r0_", "r1_")`, respectively. This allows the relevant code to support 2D (often ND) reductions. Unlike the parent PR, this one does not change the tiling algorithm, so `"r1_"` is never used. However, it prepares other parts of the system to handle `"r1_"` once we start using it. This should significantly reduce the chances of hitting merge conflicts, making the parent PR much easier to land.

The only change to the generated triton code is to rename `"rindex"` -> `"r0_index"`, `"RBLOCK"` -> `"R0_BLOCK"`, etc. To maintain compatibilty with existing codegen, this also generates aliases to the old reduction variables like `rindex = r0_index`. If we generated 2D reductions (which this PR will not do), the aliases would be more complicated and would collapse 2D multi-indices to linear indices. See some example kernels in the parent PR.

These aliases can be eliminated by the Triton compiler, and should not impact the final machine code running on the GPU. See the perf testing in the parent PR which confirms the aliases do not impact perf.

# Test plan

The existing CI provides good coverage. This PR modifies the expected code in a few places, renaming reduction variables from `r.*` to `r0_.*`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142020
Approved by: https://github.com/jansel

Co-authored-by: Jason Ansel <jansel@meta.com>
2024-12-12 17:22:20 +00:00
Isuru Fernando
44186a0a4e Move Sympy printers to torch/utils/_sympy/printers.py (#140597)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140597
Approved by: https://github.com/ezyang, https://github.com/anijain2305
2024-11-26 18:11:00 +00:00
PyTorch MergeBot
f23621ec56 Revert "Move Sympy printers to torch/utils/_sympy/printers.py (#140597)"
This reverts commit c25b201583.

Reverted https://github.com/pytorch/pytorch/pull/140597 on behalf of https://github.com/huydhn due to Trunk is sad again after this lands, this looks like a landrace this time, so please do a rebase ([comment](https://github.com/pytorch/pytorch/pull/140597#issuecomment-2494052978))
2024-11-22 15:43:39 +00:00
Isuru Fernando
c25b201583 Move Sympy printers to torch/utils/_sympy/printers.py (#140597)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140597
Approved by: https://github.com/ezyang, https://github.com/anijain2305
2024-11-22 02:04:36 +00:00
Xingyuan Li
cc6c248919 [Inductor UT] Generalize newly introduced inductor UTs for intel GPU (Part 2) (#136856)
[Inductor UT] Generalize Newly introduced inductor UTs for intel GPU
reuse `test/inductor/test_inductor_freezing.py`
reuse `test/inductor/test_layout_optim.py`
reuse `test/inductor/test_loop_ordering.py`
reuse `test/inductor/test_memory_planning.py`
reuse `test/inductor/test_padding.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136856
Approved by: https://github.com/EikanWang, https://github.com/etaf, https://github.com/jansel
2024-10-18 03:58:00 +00:00
Blaine Burton Rister
3daca187aa [Inductor] Allow customizing the padding format (#133939)
Based on https://github.com/pytorch/pytorch/pull/130956.

Inductor already supports padding through the `config.comprehensive_padding` option, but the padding format involves a few heuristics that are specific to Nvidia GPUs:
  - When we pad, it is always aligned to the next multiple of 128 bytes.
  - Strides smaller than 1024 are not padded.
  - Only intermediate values are padded, not outputs.

 The last of these is not really GPU-specific, but there are certain cases where we may want to override it. For example, padding outputs is useful on hardware accelerators with specific memory alignment requirements, or for applications where performance is more important than conformity with eager mode.

 This PR surfaces padding parameters up to Inductor's config module, so the user can control them.
   - `config.pad_outputs`: choose whether to pad outputs (default: `False`)
   - `config.padding_alignment_bytes`: choose the alignment size for padding (default: `128`)
   - `config.padding_stride_threshold`:  choose the smallest stride that we will pad. For example, setting this to 0 will pad all unaligned strides. (default: `1024`)

 **Test plan**
 Added a new test in `test_padding.py` which tries various combinations of these options, checking that the output strides match our expectations.

  These changes should not affect perf, because the defaults are identical to Inductor's current behavior.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133939
Approved by: https://github.com/shunting314

Co-authored-by: Yueming Hao <yhao@meta.com>
2024-09-02 05:56:33 +00:00
PyTorch MergeBot
86e03a64e1 Revert "[Inductor] Allow customizing the padding format (#133939)"
This reverts commit 8b258b3b14.

Reverted https://github.com/pytorch/pytorch/pull/133939 on behalf of https://github.com/ZainRizvi due to sorry but this PR is causing issues with diff train imports reverting it for now but it can be merged back in as-is ([comment](https://github.com/pytorch/pytorch/pull/133939#issuecomment-2322635388))
2024-08-31 00:38:30 +00:00
Blaine Burton Rister
8b258b3b14 [Inductor] Allow customizing the padding format (#133939)
Based on https://github.com/pytorch/pytorch/pull/130956.

Inductor already supports padding through the `config.comprehensive_padding` option, but the padding format involves a few heuristics that are specific to Nvidia GPUs:
  - When we pad, it is always aligned to the next multiple of 128 bytes.
  - Strides smaller than 1024 are not padded.
  - Only intermediate values are padded, not outputs.

 The last of these is not really GPU-specific, but there are certain cases where we may want to override it. For example, padding outputs is useful on hardware accelerators with specific memory alignment requirements, or for applications where performance is more important than conformity with eager mode.

 This PR surfaces padding parameters up to Inductor's config module, so the user can control them.
   - `config.pad_outputs`: choose whether to pad outputs (default: `False`)
   - `config.padding_alignment_bytes`: choose the alignment size for padding (default: `128`)
   - `config.padding_stride_threshold`:  choose the smallest stride that we will pad. For example, setting this to 0 will pad all unaligned strides. (default: `1024`)

 **Test plan**
 Added a new test in `test_padding.py` which tries various combinations of these options, checking that the output strides match our expectations.

  These changes should not affect perf, because the defaults are identical to Inductor's current behavior.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133939
Approved by: https://github.com/shunting314

Co-authored-by: Yueming Hao <yhao@meta.com>
2024-08-30 20:34:11 +00:00
Nicolas Macchioni
5cb05a82b4 [BC breaking] move benchmarking + prefer inductor path (#132827)
move benchmarking out of `torch._inductor.runtime.runtime_utils` and into `torch._inductor.runtime.benchmarking`, and prefer this path over directly accessing Triton's benchmarking

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132827
Approved by: https://github.com/eellison
2024-08-08 00:47:45 +00:00
Oguz Ulgen
920f0426ae Add None return type to init -- tests rest (#132376)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132376
Approved by: https://github.com/jamesjwu
ghstack dependencies: #132335, #132351, #132352
2024-08-01 15:44:51 +00:00
Shunting Zhang
980bb54361 [BE][Inductor] fix failures in test_padding.py (#131417)
The failure only happens [internally](https://www.internalfb.com/tasks/?t=195598864) because the main block was not executed when the tests are run internally.

Differential Revision: [D60083954](https://our.internmc.facebook.com/intern/diff/D60083954)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131417
Approved by: https://github.com/eellison
2024-07-23 21:53:59 +00:00
Xuehai Pan
134bc4fc34 [BE][Easy][12/19] enforce style for empty lines in import segments in test/i*/ (#129763)
See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter.

You can review these PRs via:

```bash
git diff --ignore-all-space --ignore-blank-lines HEAD~1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129763
Approved by: https://github.com/jansel
2024-07-18 07:49:19 +00:00
PyTorch MergeBot
b732b52f1e Revert "[BE][Easy][12/19] enforce style for empty lines in import segments in test/i*/ (#129763)"
This reverts commit aecc746fcc.

Reverted https://github.com/pytorch/pytorch/pull/129763 on behalf of https://github.com/XuehaiPan due to need reland after rerunning lintrunner on main ([comment](https://github.com/pytorch/pytorch/pull/129763#issuecomment-2235736732))
2024-07-18 06:39:58 +00:00
Xuehai Pan
aecc746fcc [BE][Easy][12/19] enforce style for empty lines in import segments in test/i*/ (#129763)
See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter.

You can review these PRs via:

```bash
git diff --ignore-all-space --ignore-blank-lines HEAD~1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129763
Approved by: https://github.com/jansel
2024-07-18 05:13:41 +00:00
dshi7
9df2f8687f cprofile every compile id [x/y] to keep consistent with tlparse (#125659)
This PR moves cprofile decorator to keep consistent with `torch_inductor_stats` logging and is needed by fbcode diffs of profiling enablement in internal e2e jobs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125659
Approved by: https://github.com/ezyang
2024-05-14 17:09:28 +00:00
Catherine Lee
bef7d650c4 [CI] 3 procs on sm86 (#125598)
yolo
iirc the a10g/sm86 runners have ~21 GB of space, so we can increase parallelism on it to 3.  This results in about 6GB CUDA mem per proc.  The previous calculation + 2 procs resulted in about 8 GB

Also fixes the the calc for per proc memory, assuming that CUDA context + anything else take about a little under 1GB of space (previous calc was .11 on about 7.5 - 8 GB  <= .9GB)

Times on main are about 1.9-2.5hr per shard
This commit is around 1.6-2hr per shard

Risks: increase in flaky tests due to OOM

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125598
Approved by: https://github.com/huydhn
2024-05-10 18:48:43 +00:00
Jason Ansel
7fd8870e6b [inductor] Refactor runtime files into torch._inductor.runtime (part 3) (#124557)
I am planning to make the compile_worker process not import torch so it can start up much faster.  This stack is prep for that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124557
Approved by: https://github.com/yanboliang
ghstack dependencies: #124552, #124553
2024-04-22 18:46:24 +00:00
PyTorch MergeBot
0b90af0bf5 Revert "[inductor] Refactor runtime files into torch._inductor.runtime (part 3) (#124557)"
This reverts commit fcf28b0ad5.

Reverted https://github.com/pytorch/pytorch/pull/124557 on behalf of https://github.com/jeanschmidt due to There are internal breakages, already discussed with author and he'll FF ([comment](https://github.com/pytorch/pytorch/pull/124552#issuecomment-2070548223))
2024-04-22 18:28:05 +00:00
Jason Ansel
fcf28b0ad5 [inductor] Refactor runtime files into torch._inductor.runtime (part 3) (#124557)
I am planning to make the compile_worker process not import torch so it can start up much faster.  This stack is prep for that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124557
Approved by: https://github.com/yanboliang
ghstack dependencies: #124552, #124553
2024-04-22 04:51:15 +00:00
Shunting Zhang
fb6f6270d6 [inductor] comprehensive padding (#120758)
This PR adds the ability to pad tensor strides during lowering. The goal is to make sure (if possible) tensors with bad shape can have aligned strides so GPU can access the memory more efficiently.

By testing BlenderbotSmallForConditionalGeneration I already see 2.5ms speedup.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120758
Approved by: https://github.com/jansel
2024-04-15 19:05:51 +00:00