pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 00:21:07 +01:00

Author	SHA1	Message	Date
Xinya Zhang	e769026bcb	[ROCm] Remove HIPBLASLT_ALLOW_TF32 from codebase (#162998 ) A few UT failures are caused by `HIPBLASLT_ALLOW_TF32` Fixes #157094 Fixes #157093 Fixes #157092 Fixes #157091 Fixes #157064 Fixes #157063 Fixes #157062 Fixes #157061 Fixes #157042 Fixes #157041 Fixes #157039 Fixes #157004 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162998 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-09-18 13:53:48 +00:00
PyTorch MergeBot	66308fb470	Revert "[ROCm] Remove HIPBLASLT_ALLOW_TF32 from codebase (#162998 )" This reverts commit `cef815dc2c`. Reverted https://github.com/pytorch/pytorch/pull/162998 on behalf of https://github.com/huydhn due to Sorry for reverting this, but it seems to break a test in trunk ([comment](https://github.com/pytorch/pytorch/pull/162998#issuecomment-3300280242))	2025-09-16 20:39:41 +00:00
Xinya Zhang	cef815dc2c	[ROCm] Remove HIPBLASLT_ALLOW_TF32 from codebase (#162998 ) A few UT failures are caused by `HIPBLASLT_ALLOW_TF32` Fixes #157094, #157093, #157092, #157091, #157064, #157063, #157062, #157061, #157042, #157041, #157039, #157004 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162998 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-09-16 12:48:45 +00:00
nandesuka	9491d289b3	Support generic dynamic shape with padding (#160997 ) Summary: Inductor has the following configurations: config.comprehensive_padding config.padding_alignment_bytes config.padding_stride_threshold In the case of static shape by enabling these three options Inductor will generate code for Flexible layout tensors that tries to pad up all stride dimension to be a multiple of config.padding_alignment_bytes for strides above: config.padding_stride_threshold. In the case where dynamic shapes is enabled no padding is done today. This PR introduces the following configuration which allows the user to specify they wish to generated a padded stride even in the case of dynamic shape operations. This is mainly done so we don't break the previous behaviour of not padding up dynamic shape use cases. The config.padding_stride_threshold does not apply since the values of the strides are dynamic. config.pad_dynamic_shapes In addition to this a new mode "python_slow" has been added to launch grid calculation which achieves the same ceildiv behaviour that is generally applicable to integer division. This is done to prevent test regressions and make wrapper_fxir codegen more generic. Test Plan: CI Rollback Plan: Differential Revision: D80468808 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160997 Approved by: https://github.com/blaine-rister, https://github.com/jansel	2025-09-03 15:58:18 +00:00
PyTorch MergeBot	710514a2a5	Revert "Enable output padding when only outermost dim is dynamic (#159404 )" This reverts commit `f15ada5c6f`. Reverted https://github.com/pytorch/pytorch/pull/159404 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/159404#issuecomment-3216517032))	2025-08-23 07:17:30 +00:00
nandesuka	f15ada5c6f	Enable output padding when only outermost dim is dynamic (#159404 ) Summary: When the shape of the output tensor has a dynamic outer most dim, the stride can still be padded to conform to configured alignment if required. Test Plan: CI Rollback Plan: Differential Revision: D79146886 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159404 Approved by: https://github.com/blaine-rister, https://github.com/eellison	2025-08-13 01:28:22 +00:00
Xuehai Pan	17687eb792	[BE][4/6] fix typos in test/ (test/inductor/) (#157638 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157638 Approved by: https://github.com/yewentao256, https://github.com/jansel	2025-07-06 06:34:25 +00:00
haozhe.zhu	53e0b9c393	refine fp32 precision api (#125888 ) Based on the [conversation](https://github.com/pytorch/pytorch/issues/121791), we plan to drop the "highest, high, medium" to represent fp32 internal computation data types . Instead, we will directly use the algorithm to represent it. ### Design Choice: Directly use algorithms name like "TF32", "BF16". #### Pros - The names are more informative. 'tf32' is more informative than a simple "high". - Easier to extend new algorithm like `tf32x3` #### Cons - "HIGHEST, HIGH, MEDIUM" indicated the relative precision between different algorithms. However, we can have more documents to discuss them. ### We provide a layered structure for backends/operators. ('f32' is short for 'fp32_precision') ![image](https://github.com/user-attachments/assets/f89143e5-d6a1-4865-9351-9a50439f5067) ### We provide 3 fp32 compute precision can be set: - "ieee": Not allowed to use any other internal computation data types . - "tf32": Allowed to use tf32 as internal computation data types. - "bf16": Allowed to use bf16 as internal computation data types. - "none": Precision's are not set. Can be override by its father node. ### Overriding Precision Settings Child node can be override by its father node if it is set to default. For current default settings: ``` backend = generic, op = all, precision setting = none backend = cuda, op = all, precision setting = none backend = cuda, op = conv, precision setting = tf32 backend = cuda, op = rnn, precision setting = tf32 backend = cuda, op = matmul, precision setting = none backend = matmul, op = all, precision setting = none backend = matmul, op = conv, precision setting = none backend = matmul, op = rnn, precision setting = none backend = matmul, op = matmul, precision setting = none ``` - If the user set `torch.backends.mkldnn.fp32_precision="bf16"`, his child nodes `torch.backends.mkldnn.matmul.fp32_precision` / `torch.backends.mkldnn.conv.fp32_precision` / `torch.backends.mkldnn.rnn.fp32_precision` will also be override to "bf16". - If the user set `torch.backends.fp32_precision="bf16"`, `torch.backends.mkldnn.fp32_precision` and his child nodes will also we override to "bf16". ### Backward Compatible Since new API allow user to have more fine-grained control. There will be some conflict. For example, previous `torch.backends.cudnn.allow_tf32` are not enough to represent the status for `torch.backends.cudnn.rnn.fp32_precision="ieee"` and `torch.backends.cudnn.conv.fp32_precision="tf32"`. Therefore, our goal for backward compatible is - If the user only uses previous APIs, it will work as previous expectations. - If the user use new API to change the status to an un-representable status for old API, and try to access the status by old API. We will raise Runtime Error and point the document for user. ### Test Plan ``` python test/test_cuda.py -k test_fp32_precision_with_tf32 python test/test_cuda.py -k test_fp32_precision_with_float32_matmul_precision python test/test_cuda.py -k test_invalid_status_for_legacy_api python test/test_mkldnn.py -k test_mlkdnn_get_set python test/test_mkldnn.py -k test_generic_precision python test/test_mkldnn.py -k test_invalid python test/test_mkldnn.py -k test_default_use_parent ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/125888 Approved by: https://github.com/jgong5, https://github.com/albanD Co-authored-by: Jiang, Yanbing <yanbing.jiang@intel.com>	2025-06-26 10:32:20 +00:00
David Berard	c014e9d7cd	[inductor][test] test_padding.py: use inductor TestCase instead of dynamo TestCase (#154935 ) test_pad_3d_tensor fails if you run it multiple times in a row, because the cache is populated and inductor skips the logic that increments the counter. To fix this, switch these tests to use inductor's TestCase / run_tests instead of dynamo's - this way, a fresh inductor cache is used. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154935 Approved by: https://github.com/Skylion007	2025-06-03 18:36:44 +00:00
xinan.lin	1fac47702e	[Break XPU][Inductor UT] Generalize device-bias code introduced by #146866 . (#148534 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148534 Approved by: https://github.com/nandesuka	2025-03-06 04:39:50 +00:00
nandesuka	52078154f2	Add support for no-op concat with padded output (#146866 ) Add support for no-op concat with padded output Pull Request resolved: https://github.com/pytorch/pytorch/pull/146866 Approved by: https://github.com/shunting314	2025-03-03 21:10:46 +00:00
Aaron Orenstein	99dbc5b0e2	PEP585 update - test (#145176 ) See #145101 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145176 Approved by: https://github.com/bobrenjc93	2025-01-22 04:48:28 +00:00
Blaine Burton Rister	520ba556cd	[Inductor] Refactor "r" reduction prefix to {"r0_", "r1_"}. (#142020 ) Preparatory refactor for https://github.com/pytorch/pytorch/pull/137243. # Feature This PR changes the `RINDEX` / `"r"` symbol type to `(R0_INDEX, R1_INDEX)` and `("r0_", "r1_")`, respectively. This allows the relevant code to support 2D (often ND) reductions. Unlike the parent PR, this one does not change the tiling algorithm, so `"r1_"` is never used. However, it prepares other parts of the system to handle `"r1_"` once we start using it. This should significantly reduce the chances of hitting merge conflicts, making the parent PR much easier to land. The only change to the generated triton code is to rename `"rindex"` -> `"r0_index"`, `"RBLOCK"` -> `"R0_BLOCK"`, etc. To maintain compatibilty with existing codegen, this also generates aliases to the old reduction variables like `rindex = r0_index`. If we generated 2D reductions (which this PR will not do), the aliases would be more complicated and would collapse 2D multi-indices to linear indices. See some example kernels in the parent PR. These aliases can be eliminated by the Triton compiler, and should not impact the final machine code running on the GPU. See the perf testing in the parent PR which confirms the aliases do not impact perf. # Test plan The existing CI provides good coverage. This PR modifies the expected code in a few places, renaming reduction variables from `r.` to `r0_.`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142020 Approved by: https://github.com/jansel Co-authored-by: Jason Ansel <jansel@meta.com>	2024-12-12 17:22:20 +00:00
Isuru Fernando	44186a0a4e	Move Sympy printers to torch/utils/_sympy/printers.py (#140597 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140597 Approved by: https://github.com/ezyang, https://github.com/anijain2305	2024-11-26 18:11:00 +00:00
PyTorch MergeBot	f23621ec56	Revert "Move Sympy printers to torch/utils/_sympy/printers.py (#140597 )" This reverts commit `c25b201583`. Reverted https://github.com/pytorch/pytorch/pull/140597 on behalf of https://github.com/huydhn due to Trunk is sad again after this lands, this looks like a landrace this time, so please do a rebase ([comment](https://github.com/pytorch/pytorch/pull/140597#issuecomment-2494052978))	2024-11-22 15:43:39 +00:00
Isuru Fernando	c25b201583	Move Sympy printers to torch/utils/_sympy/printers.py (#140597 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140597 Approved by: https://github.com/ezyang, https://github.com/anijain2305	2024-11-22 02:04:36 +00:00
Xingyuan Li	cc6c248919	[Inductor UT] Generalize newly introduced inductor UTs for intel GPU (Part 2) (#136856 ) [Inductor UT] Generalize Newly introduced inductor UTs for intel GPU reuse `test/inductor/test_inductor_freezing.py` reuse `test/inductor/test_layout_optim.py` reuse `test/inductor/test_loop_ordering.py` reuse `test/inductor/test_memory_planning.py` reuse `test/inductor/test_padding.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136856 Approved by: https://github.com/EikanWang, https://github.com/etaf, https://github.com/jansel	2024-10-18 03:58:00 +00:00
Blaine Burton Rister	3daca187aa	[Inductor] Allow customizing the padding format (#133939 ) Based on https://github.com/pytorch/pytorch/pull/130956. Inductor already supports padding through the `config.comprehensive_padding` option, but the padding format involves a few heuristics that are specific to Nvidia GPUs: - When we pad, it is always aligned to the next multiple of 128 bytes. - Strides smaller than 1024 are not padded. - Only intermediate values are padded, not outputs. The last of these is not really GPU-specific, but there are certain cases where we may want to override it. For example, padding outputs is useful on hardware accelerators with specific memory alignment requirements, or for applications where performance is more important than conformity with eager mode. This PR surfaces padding parameters up to Inductor's config module, so the user can control them. - `config.pad_outputs`: choose whether to pad outputs (default: `False`) - `config.padding_alignment_bytes`: choose the alignment size for padding (default: `128`) - `config.padding_stride_threshold`: choose the smallest stride that we will pad. For example, setting this to 0 will pad all unaligned strides. (default: `1024`) Test plan Added a new test in `test_padding.py` which tries various combinations of these options, checking that the output strides match our expectations. These changes should not affect perf, because the defaults are identical to Inductor's current behavior. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133939 Approved by: https://github.com/shunting314 Co-authored-by: Yueming Hao <yhao@meta.com>	2024-09-02 05:56:33 +00:00
PyTorch MergeBot	86e03a64e1	Revert "[Inductor] Allow customizing the padding format (#133939 )" This reverts commit `8b258b3b14`. Reverted https://github.com/pytorch/pytorch/pull/133939 on behalf of https://github.com/ZainRizvi due to sorry but this PR is causing issues with diff train imports reverting it for now but it can be merged back in as-is ([comment](https://github.com/pytorch/pytorch/pull/133939#issuecomment-2322635388))	2024-08-31 00:38:30 +00:00
Blaine Burton Rister	8b258b3b14	[Inductor] Allow customizing the padding format (#133939 ) Based on https://github.com/pytorch/pytorch/pull/130956. Inductor already supports padding through the `config.comprehensive_padding` option, but the padding format involves a few heuristics that are specific to Nvidia GPUs: - When we pad, it is always aligned to the next multiple of 128 bytes. - Strides smaller than 1024 are not padded. - Only intermediate values are padded, not outputs. The last of these is not really GPU-specific, but there are certain cases where we may want to override it. For example, padding outputs is useful on hardware accelerators with specific memory alignment requirements, or for applications where performance is more important than conformity with eager mode. This PR surfaces padding parameters up to Inductor's config module, so the user can control them. - `config.pad_outputs`: choose whether to pad outputs (default: `False`) - `config.padding_alignment_bytes`: choose the alignment size for padding (default: `128`) - `config.padding_stride_threshold`: choose the smallest stride that we will pad. For example, setting this to 0 will pad all unaligned strides. (default: `1024`) Test plan Added a new test in `test_padding.py` which tries various combinations of these options, checking that the output strides match our expectations. These changes should not affect perf, because the defaults are identical to Inductor's current behavior. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133939 Approved by: https://github.com/shunting314 Co-authored-by: Yueming Hao <yhao@meta.com>	2024-08-30 20:34:11 +00:00
Nicolas Macchioni	5cb05a82b4	[BC breaking] move benchmarking + prefer inductor path (#132827 ) move benchmarking out of `torch._inductor.runtime.runtime_utils` and into `torch._inductor.runtime.benchmarking`, and prefer this path over directly accessing Triton's benchmarking Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/132827 Approved by: https://github.com/eellison	2024-08-08 00:47:45 +00:00
Oguz Ulgen	920f0426ae	Add None return type to init -- tests rest (#132376 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132376 Approved by: https://github.com/jamesjwu ghstack dependencies: #132335, #132351, #132352	2024-08-01 15:44:51 +00:00
Shunting Zhang	980bb54361	[BE][Inductor] fix failures in test_padding.py (#131417 ) The failure only happens [internally](https://www.internalfb.com/tasks/?t=195598864) because the main block was not executed when the tests are run internally. Differential Revision: [D60083954](https://our.internmc.facebook.com/intern/diff/D60083954) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131417 Approved by: https://github.com/eellison	2024-07-23 21:53:59 +00:00
Xuehai Pan	134bc4fc34	[BE][Easy][12/19] enforce style for empty lines in import segments in `test/i*/` (#129763 ) See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter. You can review these PRs via: ```bash git diff --ignore-all-space --ignore-blank-lines HEAD~1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129763 Approved by: https://github.com/jansel	2024-07-18 07:49:19 +00:00
PyTorch MergeBot	b732b52f1e	Revert "[BE][Easy][12/19] enforce style for empty lines in import segments in `test/i*/` (#129763 )" This reverts commit `aecc746fcc`. Reverted https://github.com/pytorch/pytorch/pull/129763 on behalf of https://github.com/XuehaiPan due to need reland after rerunning lintrunner on main ([comment](https://github.com/pytorch/pytorch/pull/129763#issuecomment-2235736732))	2024-07-18 06:39:58 +00:00
Xuehai Pan	aecc746fcc	[BE][Easy][12/19] enforce style for empty lines in import segments in `test/i*/` (#129763 ) See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter. You can review these PRs via: ```bash git diff --ignore-all-space --ignore-blank-lines HEAD~1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129763 Approved by: https://github.com/jansel	2024-07-18 05:13:41 +00:00
dshi7	9df2f8687f	cprofile every compile id [x/y] to keep consistent with tlparse (#125659 ) This PR moves cprofile decorator to keep consistent with `torch_inductor_stats` logging and is needed by fbcode diffs of profiling enablement in internal e2e jobs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125659 Approved by: https://github.com/ezyang	2024-05-14 17:09:28 +00:00
Catherine Lee	bef7d650c4	[CI] 3 procs on sm86 (#125598 ) yolo iirc the a10g/sm86 runners have ~21 GB of space, so we can increase parallelism on it to 3. This results in about 6GB CUDA mem per proc. The previous calculation + 2 procs resulted in about 8 GB Also fixes the the calc for per proc memory, assuming that CUDA context + anything else take about a little under 1GB of space (previous calc was .11 on about 7.5 - 8 GB <= .9GB) Times on main are about 1.9-2.5hr per shard This commit is around 1.6-2hr per shard Risks: increase in flaky tests due to OOM Pull Request resolved: https://github.com/pytorch/pytorch/pull/125598 Approved by: https://github.com/huydhn	2024-05-10 18:48:43 +00:00
Jason Ansel	7fd8870e6b	[inductor] Refactor runtime files into torch._inductor.runtime (part 3) (#124557 ) I am planning to make the compile_worker process not import torch so it can start up much faster. This stack is prep for that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124557 Approved by: https://github.com/yanboliang ghstack dependencies: #124552, #124553	2024-04-22 18:46:24 +00:00
PyTorch MergeBot	0b90af0bf5	Revert "[inductor] Refactor runtime files into torch._inductor.runtime (part 3) (#124557 )" This reverts commit `fcf28b0ad5`. Reverted https://github.com/pytorch/pytorch/pull/124557 on behalf of https://github.com/jeanschmidt due to There are internal breakages, already discussed with author and he'll FF ([comment](https://github.com/pytorch/pytorch/pull/124552#issuecomment-2070548223))	2024-04-22 18:28:05 +00:00
Jason Ansel	fcf28b0ad5	[inductor] Refactor runtime files into torch._inductor.runtime (part 3) (#124557 ) I am planning to make the compile_worker process not import torch so it can start up much faster. This stack is prep for that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124557 Approved by: https://github.com/yanboliang ghstack dependencies: #124552, #124553	2024-04-22 04:51:15 +00:00
Shunting Zhang	fb6f6270d6	[inductor] comprehensive padding (#120758 ) This PR adds the ability to pad tensor strides during lowering. The goal is to make sure (if possible) tensors with bad shape can have aligned strides so GPU can access the memory more efficiently. By testing BlenderbotSmallForConditionalGeneration I already see 2.5ms speedup. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120758 Approved by: https://github.com/jansel	2024-04-15 19:05:51 +00:00

32 Commits