Pytest considers all symbols starting with `test_` as a test case/function and runs them.
The `test_compiled_fsdp` is a decorator but due to the import discovered by pytest.
Rename it to avoid.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144544
Approved by: https://github.com/Skylion007
This PR
* makes changes to the workflow files and scripts so we can run CI workflows on the MI300 runners
* skips and fixes several tests, failed on MI300, observed in https://github.com/pytorch/pytorch/pull/140989
Skipped due to unsupported Float8_e4m3fn data type on MI300 (need to update test code to use datatypes supported by MI300):
- distributed.tensor.parallel.test_micro_pipeline_tp.py::MicroPipelineTPTest::test_fuse_all_gather_scaled_matmul_A_dims_\*_gather_dim_\* (24 tests across inductor/distributed configs)
- distributed.tensor.parallel.test_micro_pipeline_tp.py::test_fuse_scaled_matmul_reduce_scatter_A_dims_\*_scatter_dim_\* (12 tests across inductor/distributed configs))
- inductor.test_loop_ordering::LoopOrderingTest::test_fp8_cast_and_t
- inductor.test_loop_ordering::LoopOrderingTest::test_fp8_pattern_2
Skipped due to AssertionError on MI300:
- inductor.test_mkldnn_pattern_matcher.py::test_qconv2d_int8_mixed_bf16
- distributed._tools.test_sac_ilp::TestSACILP::test_sac_ilp_case1
Skipped:
- test_cuda.py::TestCudaMallocAsync::test_clock_speed
- test_cuda.py::TestCudaMallocAsync::test_power_draw
- test_torch.py::TestTorchDeviceTypeCUDA::test_deterministic_cumsum_cuda
Skipped flaky tests on MI300:
- distributed.test_c10d_gloo.py::ProcessGroupGlooTest::test_gather_stress_cuda
- inductor.test_cpu_repro::CPUReproTests::test_lstm_packed_unbatched_False* (256 tests)
Fixed:
- test_matmul_cuda.py::TestFP8MatmulCudaCUDA::test_float8_basics_cuda
Features:
- inductor/test_fp8.py - declare a new function to convert FP8 datatypes to ROCm supported FP8 datatypes. It keeps test names for CUDA and ROCm and allows to enable Inductor FP8 tests on CPU
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143673
Approved by: https://github.com/jeffdaily, https://github.com/malfet, https://github.com/pruthvistony
Co-authored-by: saienduri <saimanas.enduri@amd.com>
Co-authored-by: Jithun Nair <jithun.nair@amd.com>
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
When calling `torch.masked.mean(...)` with a boolean tensor, the dtype is inferred to be bool. When the mean is being computed, the sum operator is used. When the sum operator is used with dtype=torch.bool, the result is clamped to True (1) leading to an incorrect mean being calculated.
The below example shows how the incorrect result occurs:
```
a = torch.tensor([True, True])
count = torch.sum(torch.ones(a.shape, dtype=torch.int64)) # 2
total = torch.sum(a, dtype=torch.bool) # True (1)
mean = total / count # 0.5
```
This PR upcasts the dtype used for the sumation to int32 in the case of bool tensors allowing for the correct result to be computed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139999
Approved by: https://github.com/cpuhrsch
Motivation: Generalize unit tests so that can be executed for cuda and non cuda devices.
Depedency : #133209 Merged now.
There was a #135242 for these changes and closed due to in correct commits. I have incoroprated the changes as suggested in comments.
@kwen2501 @zeshengzong Please review the changes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139184
Approved by: https://github.com/kwen2501
Co-authored-by: Yu, Guangye <guangye.yu@intel.com>
When calling `torch.masked.mean(...)` with a boolean tensor, the dtype is inferred to be bool. When the mean is being computed, the sum operator is used. When the sum operator is used with dtype=torch.bool, the result is clamped to True (1) leading to an incorrect mean being calculated.
The below example shows how the incorrect result occurs:
```
a = torch.tensor([True, True])
count = torch.sum(torch.ones(a.shape, dtype=torch.int64)) # 2
total = torch.sum(a, dtype=torch.bool) # True (1)
mean = total / count # 0.5
```
This PR upcasts the dtype used for the sumation to int32 in the case of bool tensors allowing for the correct result to be computed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139999
Approved by: https://github.com/cpuhrsch
See #144006
```py
__________________________________________ CudaReproTests.test_repeated_masked_load __________________________________________
RuntimeError: First class dim doesn't work with python 3.12
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/jansel/conda/envs/pytorch/lib/python3.12/unittest/case.py", line 58, in testPartExecutor
yield
File "/home/jansel/conda/envs/pytorch/lib/python3.12/unittest/case.py", line 634, in run
self._callTestMethod(testMethod)
File "/home/jansel/conda/envs/pytorch/lib/python3.12/unittest/case.py", line 589, in _callTestMethod
if method() is not None:
^^^^^^^^
File "/home/jansel/pytorch/torch/testing/_internal/common_utils.py", line 3108, in wrapper
method(*args, **kwargs)
File "/home/jansel/pytorch/test/inductor/test_cuda_repro.py", line 1678, in test_repeated_masked_load
from functorch.einops import rearrange
File "/home/jansel/pytorch/functorch/einops/__init__.py", line 1, in <module>
from .rearrange import rearrange
File "/home/jansel/pytorch/functorch/einops/rearrange.py", line 7, in <module>
from functorch._C import dim as _C
ImportError: initialization failed
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144006
Approved by: https://github.com/Skylion007
Changes by apply order:
1. Replace all `".."` and `os.pardir` usage with `os.path.dirname(...)`.
2. Replace nested `os.path.dirname(os.path.dirname(...))` call with `str(Path(...).parent.parent)`.
3. Reorder `.absolute()` ~/ `.resolve()`~ and `.parent`: always resolve the path first.
`.parent{...}.absolute()` -> `.absolute().parent{...}`
4. Replace chained `.parent x N` with `.parents[${N - 1}]`: the code is easier to read (see 5.)
`.parent.parent.parent.parent` -> `.parents[3]`
5. ~Replace `.parents[${N - 1}]` with `.parents[${N} - 1]`: the code is easier to read and does not introduce any runtime overhead.~
~`.parents[3]` -> `.parents[4 - 1]`~
6. ~Replace `.parents[2 - 1]` with `.parent.parent`: because the code is shorter and easier to read.~
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129374
Approved by: https://github.com/justinchuby, https://github.com/malfet
This PR is to add `torch._scaled_mm` for CPU backend.
`_scaled_mm_out_cpu` and `_scaled_mm_cpu` are new added and included in `torch._scaled_mm` CPU dispatch. We also add `_scaled_mm_out_cpu_emulated` as a fallback function if the current platform cannot run FP8 matmul using oneDNN. And this PR also updates the various UTs related to FP8 to support CPU tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139975
Approved by: https://github.com/mingfeima, https://github.com/jgong5, https://github.com/malfet
This PR is to add `torch._scaled_mm` for CPU backend.
`_scaled_mm_out_cpu` and `_scaled_mm_cpu` are new added and included in `torch._scaled_mm` CPU dispatch. We also add `_scaled_mm_out_cpu_emulated` as a fallback function if the current platform cannot run FP8 matmul using oneDNN. And this PR also updates the various UTs related to FP8 to support CPU tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139975
Approved by: https://github.com/mingfeima, https://github.com/jgong5, https://github.com/malfet
ghstack dependencies: #139974
Changes:
1. Bump `ruff` from 0.7.4 to 0.8.4
2. Change `%`-formatted strings to f-string
3. Change arguments with the `__`-prefix to positional-only arguments with the `/` separator in function signature.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143753
Approved by: https://github.com/Skylion007
Changes by apply order:
1. Replace all `".."` and `os.pardir` usage with `os.path.dirname(...)`.
2. Replace nested `os.path.dirname(os.path.dirname(...))` call with `str(Path(...).parent.parent)`.
3. Reorder `.absolute()` ~/ `.resolve()`~ and `.parent`: always resolve the path first.
`.parent{...}.absolute()` -> `.absolute().parent{...}`
4. Replace chained `.parent x N` with `.parents[${N - 1}]`: the code is easier to read (see 5.)
`.parent.parent.parent.parent` -> `.parents[3]`
5. ~Replace `.parents[${N - 1}]` with `.parents[${N} - 1]`: the code is easier to read and does not introduce any runtime overhead.~
~`.parents[3]` -> `.parents[4 - 1]`~
6. ~Replace `.parents[2 - 1]` with `.parent.parent`: because the code is shorter and easier to read.~
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129374
Approved by: https://github.com/justinchuby, https://github.com/malfet
Description:
1. Quantize Linear Layer Weights to 4-bits:
Quantize the weights of the Linear layer to 4 bits, using symmetric quantization.
Pack two 4-bit weights into one uint8 container.
Choose a quantization scheme (channel-wise or group-wise), with the group size being a multiple of 32.
2. Prepare Quantized Weights, Scales, and Optional Bias:
After quantizing, obtain the quantized_weights, scales, and groupsize.
If the original Linear layer has a bias, prepare it as well.
3. Pack the Weights Efficiently:
Use torch.ops.aten._dyn_quant_pack_4bit_weight to optimally pack the weights, scales, and optional bias.
```python
packed_weights = torch.ops.aten._dyn_quant_pack_4bit_weight(weight, scales_and_zeros, bias, groupsize, in_features, out_features)
```
Input parameters should include:
in_features and out_features (the same as the Linear layer’s corresponding parameters).
4. Perform Dynamic Quantized Matrix Multiplication:
Use torch.ops.aten._dyn_quant_matmul_4bit to perform matrix multiplication with quantized weights.
```python
output = torch.ops.aten._dyn_quant_matmul_4bit(input, packed_weights, groupsize, in_features, out_features)
```
Inputs required include:
The input tensor, packed_weights , groupsize, and the in_features and out_features.
API Usage: https://github.com/pytorch/pytorch/issues/143289
Model Perf :
7B Transformer model:
Prefill : 340 t/s
Decode : 40 t/s
2B Transformer model
Prefill : 747 t/s
Decode : 80 t/s
Tests:
python test/test_linalg.py -k test__dyn_quant_pack_4bit_weight
Ran 1 test in 0.016s
OK
python test/test_linalg.py -k test__dyn_quant_matmul_4bit
Ran 8 tests in 0.077s
OK
python test/test_linalg.py -k test_compile_dyn_quant_matmul_4bit
Ran 8 tests in 11.454s
Change-Id: Ia1672bad5e6ec94e64d8bb1971395d60f4b3a452
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134124
Approved by: https://github.com/digantdesai, https://github.com/malfet
Description:
1. Quantize Linear Layer Weights to 4-bits:
Quantize the weights of the Linear layer to 4 bits, using symmetric quantization.
Pack two 4-bit weights into one uint8 container.
Choose a quantization scheme (channel-wise or group-wise), with the group size being a multiple of 32.
2. Prepare Quantized Weights, Scales, and Optional Bias:
After quantizing, obtain the quantized_weights, scales, and groupsize.
If the original Linear layer has a bias, prepare it as well.
3. Pack the Weights Efficiently:
Use torch.ops.aten._dyn_quant_pack_4bit_weight to optimally pack the weights, scales, and optional bias.
```python
packed_weights = torch.ops.aten._dyn_quant_pack_4bit_weight(weight, scales_and_zeros, bias, groupsize, in_features, out_features)
```
Input parameters should include:
in_features and out_features (the same as the Linear layer’s corresponding parameters).
4. Perform Dynamic Quantized Matrix Multiplication:
Use torch.ops.aten._dyn_quant_matmul_4bit to perform matrix multiplication with quantized weights.
```python
output = torch.ops.aten._dyn_quant_matmul_4bit(input, packed_weights, groupsize, in_features, out_features)
```
Inputs required include:
The input tensor, packed_weights , groupsize, and the in_features and out_features.
API Usage: https://github.com/pytorch/pytorch/issues/143289
Model Perf :
7B Transformer model:
Prefill : 340 t/s
Decode : 40 t/s
2B Transformer model
Prefill : 747 t/s
Decode : 80 t/s
Tests:
python test/test_linalg.py -k test__dyn_quant_pack_4bit_weight
Ran 1 test in 0.016s
OK
python test/test_linalg.py -k test__dyn_quant_matmul_4bit
Ran 8 tests in 0.077s
OK
python test/test_linalg.py -k test_compile_dyn_quant_matmul_4bit
Ran 8 tests in 11.454s
Change-Id: Ia1672bad5e6ec94e64d8bb1971395d60f4b3a452
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134124
Approved by: https://github.com/digantdesai, https://github.com/malfet
Description:
1. Quantize Linear Layer Weights to 4-bits:
Quantize the weights of the Linear layer to 4 bits, using symmetric quantization.
Pack two 4-bit weights into one uint8 container.
Choose a quantization scheme (channel-wise or group-wise), with the group size being a multiple of 32.
2. Prepare Quantized Weights, Scales, and Optional Bias:
After quantizing, obtain the quantized_weights, scales, and groupsize.
If the original Linear layer has a bias, prepare it as well.
3. Pack the Weights Efficiently:
Use torch.ops.aten._dyn_quant_pack_4bit_weight to optimally pack the weights, scales, and optional bias.
```python
packed_weights = torch.ops.aten._dyn_quant_pack_4bit_weight(weight, scales_and_zeros, bias, groupsize, in_features, out_features)
```
Input parameters should include:
in_features and out_features (the same as the Linear layer’s corresponding parameters).
4. Perform Dynamic Quantized Matrix Multiplication:
Use torch.ops.aten._dyn_quant_matmul_4bit to perform matrix multiplication with quantized weights.
```python
output = torch.ops.aten._dyn_quant_matmul_4bit(input, packed_weights, groupsize, in_features, out_features)
```
Inputs required include:
The input tensor, packed_weights , groupsize, and the in_features and out_features.
API Usage: https://github.com/pytorch/pytorch/issues/143289
Model Perf :
7B Transformer model:
Prefill : 340 t/s
Decode : 40 t/s
2B Transformer model
Prefill : 747 t/s
Decode : 80 t/s
Tests:
python test/test_linalg.py -k test__dyn_quant_pack_4bit_weight
Ran 1 test in 0.016s
OK
python test/test_linalg.py -k test__dyn_quant_matmul_4bit
Ran 8 tests in 0.077s
OK
python test/test_linalg.py -k test_compile_dyn_quant_matmul_4bit
Ran 8 tests in 11.454s
Change-Id: Ia1672bad5e6ec94e64d8bb1971395d60f4b3a452
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134124
Approved by: https://github.com/digantdesai, https://github.com/malfet
Hermite polynomials diverge to NaN at high orders due to numerical overflow. The proposal is to prematurely return NaN of it is known that at this value it will be NaN.
According to my short test
```Python
import torch
device = "cuda"
dtype = torch.float32
x = torch.linspace(-1000, 1000, 100000, device=device, dtype=dtype)
for n in range(1024):
if torch.special.hermite_polynomial_h(x, n).isnan().sum().item() == x.shape[0]:
print(f"hermite_polynomial_h: all outputs are nans! n = {n}")
break
for n in range(1024):
if torch.special.hermite_polynomial_he(x, n).isnan().sum().item() == x.shape[0]:
print(f"hermite_polynomial_he: all outputs are nans! n = {n}")
break
```
The output values become NaNs at these orders:
```
hermite_polynomial_h: all outputs are nans! n = 53, dtype=torch.float32
hermite_polynomial_he: all outputs are nans! n = 61, dtype=torch.float32
hermite_polynomial_h: all outputs are nans! n = 272, dtype=torch.float64
hermite_polynomial_he: all outputs are nans! n = 304, dtype=torch.float64
```
Surely, it makes sense to increase the limit as a safety margin.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141955
Approved by: https://github.com/malfet, https://github.com/eqy
From the [docs](https://pytorch.org/docs/stable/generated/torch.Tensor.index_put_.html) for index_put_:
> If accumulate is True, the elements in values are added to self. If accumulate is False, the behavior is undefined if indices contain duplicate elements.
Currently the sample inputs for `index_put` generates 2 indices. Because they are generated randomly, they could be the same leading to undefined behaviour if `accumulate=False`.
This PR changes the input generation to only generate a single index if `accumulate=False` preventing duplicate indices and undefined behaviour.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143116
Approved by: https://github.com/albanD
This PR fixes some issues with NJT backward / compile backward tests:
1. `requires_grad` was not being propagated appropriately during `SampleInput` generation, so a LOT of backward cases were untested before (sad times). This PR utilizes a helper function `_clone()` to clone() / detach() NJTs for SampleInputs while preserving `requires_grad` status. Note: the clone() / detach() stuff is for autograd; can't have two SampleInputs as part of the same autograd graph.
2. Per-sample skips weren't -fully- working; the op logic would still be invoked even with a skip. I found this out thanks to `split_with_sizes`, which segfaults during backwards because it tries to use an NST-specific formula. As annoying as it is, I tried a ton of things but ultimately had to split the `subtest_ctx` into that + a `skip_xfail_ctx` to run the subtests within.
* Updated all uses of per-sample skips / xfails: 4 in `test_nestedtensor.py` and 1 in `test_vmap.py`
3. Added the appropriate skips / xfails to get everything passing. There are a shitton of bugs to fix!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143072
Approved by: https://github.com/cpuhrsch, https://github.com/soulitzer
This is the initial foreach map HOP for pointwise ops which will be extended in the future to support grouped GEMMs and other ops.
This PR utilizes PrimHOPBase class to represent foreach_map as a HOP with a single subgraph. The way this is implemented is that the user API `foreach_map` provides a single pointwise torch op, and internally this function calls a polyfill which has the same semantics as a foreach op (ie iterates over lists of operands applying the op elementwise). The higher order op is passed through the stack down to inductor where a lowering in essence inlines the subgraph into the main graph. This is done by interpreting it with a pointwise subgraph lowering, grouping the outputs by device, and registering the output buffers as foreach groups as applicable. For testing I was able to reuse the existing foreach tests by creating a wrapper function which matches the foreach op interfaces for those tests and then run all of the existing foreach tests on foreach_map.
TODO before landing:
* Add tests for general functions
* Test warning if unsupported op will block fusion
Followups:
* I need to add tests for backwards (this will be a followup PR because backwards will require other work as well)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142098
Approved by: https://github.com/eellison
Motivation: Generalize unit tests so that can be executed for cuda and non cuda devices.
Depedency : #133209 Merged now.
There was a #135242 for these changes and closed due to in correct commits. I have incoroprated the changes as suggested in comments.
@kwen2501 @zeshengzong Please review the changes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139184
Approved by: https://github.com/kwen2501
Co-authored-by: Yu, Guangye <guangye.yu@intel.com>
**Overview**
This PR moves `torch/distributed/_composable/fsdp` to `torch/distributed/fsdp/_fully_shard` and makes public APIs available from `torch.distributed.fsdp`, e.g.:
```
from torch.distributed.fsdp import fully_shard
```
This is targeting 2.6 release. I rewrote some of the documentation with (hopefully) improved phrasing.
**Changes for Reland**
- Preserved the public objects from `torch/distributed/_composable/fsdp/fully_shard.py` so that the import path still works internally
- Added a unit test that we can do `from torch.distributed._composable.fsdp.fully_shard import FSDPModule`
Differential Revision: [D66890387](https://our.internmc.facebook.com/intern/diff/D66890387)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141868
Approved by: https://github.com/kwen2501, https://github.com/wconstab, https://github.com/weifengpy, https://github.com/fegin, https://github.com/XilunWu
Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
Notable new features for SDPA operators on AMD systems from AOTriton 0.8b:
1. Nestedtensor support;
2. MQA/GQA support;
3. Restore Efficient attention support for causal=True and seqlen_q != seqlen_k cases;
+ The kernel should use top-left alignment, bottom right alignment will be added later
4. Move gfx1100 (RX7900/W7800/W7900) out of experimental support status.
However, users are strongly recommended to update to ROCM 6.2.4, notably for
its firmware updates.
Related unit tests are enabled as well.
Notable related changes from AOTriton 0.8b:
1. AOTriton 0.8b moves the GPU kernel out of libaotriton.so to a separate directory `aotriton.images`;
2. LZMA replaces ZSTD as GPU kernel compression algorithm for better compression ratio: aotriton0.8b (.so + aotriton.images take 350MB) compared to aotriton0.7b .so: 800MB
3. The compression cannot be disabled now, and `liblzma` is hard run-time dependency.
+ Should not be a problem, since `lzma` is part of Python Standard Library
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140172
Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily
Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
There are a number of cases where pattern matching differs based on the presence of ACL, causing the tests to fail. This adds `TEST_ACL` and `skipIfACL` so that these tests can still run with different values or be entirely skipped if necessary.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141921
Approved by: https://github.com/malfet
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
**Background:** conversion from outer dim -> inner dim makes the (previously valid) assumption that the ragged dim is immediately next to the batch dim. This is no longer the case after #137125.
This PR:
* Updates the outer dim -> inner dim conversion logic to match the actual ragged_idx. Since ragged_idx tells us where the packed ragged / batch dim is, both ragged and batch outer dims should map to this inner dim. The conversion logic must now take in `ragged_idx` to make this possible, so the PR updates all call-sites to pass this.
* Fixes outputs across keepdim settings when reducing over ragged / batch dims.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142173
Approved by: https://github.com/drisspg
**Overview**
This PR moves `torch/distributed/_composable/fsdp` to `torch/distributed/fsdp/_fully_shard` and makes public APIs available from `torch.distributed.fsdp`, e.g.:
```
from torch.distributed.fsdp import fully_shard
```
This is targeting 2.6 release. I rewrote some of the documentation with (hopefully) improved phrasing.
**Follow-Ups**
- [x] Add some explanation in the docs about FSDP1 vs. FSDP2
- [ ] Move unit tests from `test/distributed/_composable/fsdp` to `test/distributed/fsdp/fully_shard/`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141868
Approved by: https://github.com/kwen2501, https://github.com/wconstab, https://github.com/weifengpy
Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
Large n input caused a regression starting in ROCm 6.1. The for loop will run for an excessive number of iterations. The root cause seems to be how static_cast<int64_t> behaves for large float values such as 1e20 that certain unit tests will use. The workaround is to break out of the loop once the returned value reaches nan.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141150
Approved by: https://github.com/eqy, https://github.com/malfet
Using EC2 G6 instance, based on NVIDIA L4, added to scale config in https://github.com/pytorch/test-infra/pull/5376
To enable more balanced sharding, had to push 148ae19935
Added `@xfailIfSM89` to the following tests:
- test_fp8_pattern_2
- test_original_aten_preserved_split_addmm
- test_sparse_semi_structured_scaled_mm
- test_sparse_semi_structured_scaled_mm_fp8
- test_sparse_fp8fp8_mm
Increased tolerance to 2e-4 for `RNNTest.BidirectionalMultilayerGRU_CPU_vs_CUDA`
Skipped following inductor tests (that either flaky OOMs or timeouts):
- test_reduction_fn_std_float64
- test_reduction_fn_var_mean_float64
- test_multi_output_unbacked_custom_op
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140305
Approved by: https://github.com/wdvr, https://github.com/ZainRizvi
Certain `cpp_wrapper`-enabled tests were OOM-ing in the CI pipeline, with error messages suggesting that sufficient memory was accessible. This ultimately resulted from an internal memory limitation that was not queryable in the API. This PR adds querying for that limit.
Additionally, the failing tests had incorrect memory availability checks, and are updated with measured memory requirements.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140620
Approved by: https://github.com/malfet, https://github.com/eqy
ghstack dependencies: #141367
Previously, the reference inputs for div with rounding mode did not supply the rounding_mode keyword argument. This didn't match the sample inputs for this op.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136308
Approved by: https://github.com/albanD
Co-authored-by: Xia, Weiwen <weiwen.xia@intel.com>
Co-authored-by: Bob Ren <bobren@meta.com>
Co-authored-by: Xilun Wu <12968408+XilunWu@users.noreply.github.com>
Co-authored-by: siahuat0727 <tansiahuat@gmail.com>
Summary:
float8 dtypes were missing from this map, adding
Test Plan:
CI, and unbreaks debugging in torchao
If there is an existing test I can add this to - lmk
Reviewers:
Subscribers:
Tasks:
Tags:
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141385
Approved by: https://github.com/soulitzer
This PR contains three `unsqueeze()`-related fixes for NJT:
1. Adjusts the output's `_ragged_idx` when `unsqueeze()` inserts a dim before the ragged dim
2. Corrects the unbind reference for `unsqueeze()` after the last input dim. For this case, the dim kwarg canonicalization logic needs to be applied wrt `inp.dim() + 1` to account for `dim=-1` properly
3. Adds ragged dim support to `unsqueeze()`, allowing for e.g. `(B, j1, D) -> (B, 1, j1, D)`. This is okay now after #137125
Note that `unsqueeze()` still doesn't support batch dim operation, and arguably should never support this.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141392
Approved by: https://github.com/cpuhrsch
ghstack dependencies: #141500, #140736, #140161
This PR introduces `ExtraOpData`, a structure that contains op metadata regarding whether the op is a view and the dim-related args it accepts. It also populates a huge database for dim-wise / view ops with this info.
Test logic (sample input generation, references) have been updated to utilize this data. It allows for a fairly generic set of sample inputs & a reference for the class of ops that accept a single NJT and operate dim-wise (AKA "unary dimwise ops").
Testing is added over the following ops:
* `chunk()`
* `narrow()`
* `select()`
* `split()`
* `split_with_sizes()`
* `squeeze()`
* `unflatten()`
* `unsqueeze()`
Most of the above do not operate on the ragged / batch dims or on non-contiguous NJTs, so the proper xfails are added as needed.
I also slipped in a couple minor fixes (sorry):
1. The `_wrap_jagged_dim()` helper now avoids assuming the `nt._ragged_idx == 1` and allows for a batch dim to be a valid input, disambiguating the converted inner dim as necessary through an additional `operating_on_batch` return value (i.e. both dim=0 and dim=1 map to dim=0 on the inner values tensor, since that dim represents a packed ragged dim for all batch items)
2. Padded dense -> NJT conversion requires shape gymnastics to operate with the restrictive FBGEMM kernel. The gymnastics were slightly wrong for the transposed NJT case, and this PR fixes that
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140161
Approved by: https://github.com/Skylion007, https://github.com/cpuhrsch
ghstack dependencies: #141500, #140736
**Background:** It's common to use `scalar_tensor()` in the input to `where()` to convert any scalars present to compatible tensors with matching options, *including layout*. This shows up in various places, notably including derivative formulas ([example](78491d6afc/tools/autograd/derivatives.yaml (L432-L434))). It causes problems for NJTs because they have `layout=torch.jagged` and it never makes sense to create a scalar tensor with this layout. Some of the breakage only seems to happen in CI for reasons I don't fully understand (see the revert of #140736 due to softshrink's derivative formula).
**This PR:**
* Allows non-contiguous NJT inputs to `where()` + adds tests for this
* Handles scalar tensor / dense tensor inputs for `condition` / `other` + adds tests for this
* Uses limited `broadcast_tensors()` / `broadcast_to()` support
* Improves `expand()` to work on non-contig NJTs
* Changes `scalar_tensor()` to use `torch.strided` instead of `torch.jagged` in both eager and torch.compile (i.e. meta registration)
* Changes backward formulas for `sinc`, `pow`, `special.i1`, and `special.i1e` to uses `scalar_tensor()` instead of e.g. `zeros({})`
**Alternative approach:** Update all problematic usages of `scalar_tensor()` to avoid ever passing `layout=torch.jagged`. This is an extensive change and includes `torch.where()` logic, a bunch of derivative formulas, and likely other places not yet discovered.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141500
Approved by: https://github.com/malfet, https://github.com/cpuhrsch, https://github.com/soulitzer
Summary: `TestFunc` is annotated as `Callable[[object], object]` which represents a callable that takes a single argument of any type (`object`) and returns a value of any type (`object`). However, in reality, `TestFunc` could be any number of arguments, as a result, the corret typing should be `Callable[[...], object]` instead which represents a callable that takes any number of arguments (including zero) and returns a value of any type (`object`).
Test Plan: Contbuild & OSS CI
Differential Revision: D66463705
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141513
Approved by: https://github.com/wz337, https://github.com/Skylion007
# Motivation
This PR add `XPUInductorQuantizer`, which would defined the recipe of int8 quantization at XPU backend.
# Detailed
The `XPUInductorQuantizer` is class derived from `X86InductorQuantizer` as both quantizer would take the advantage of highly optimized operators in oneDNN library(qconv, qlinear, qconv/qlinear fusion).
We share the same recipe as `X86InductorQuantizer`, so we would have same `annotate_xxxx` methods. So, in ideal situation, the `XPUInductorQuantizer` would have no class body as all implementation can inherit from base class.
In this PR, we override the `annotate_xxx` method for operators that has NOT be implemented. All operators XPU backend does not implement would be fallbacked to fp32 implementation as the node in graph is a `dq-op-q` pairs. This would help provide good OOB usability for XPU backend. On the other hand, the implemented operators would uses `annotate_op` implemented in base class and could be lowered successfully.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139578
Approved by: https://github.com/EikanWang, https://github.com/leslie-fang-intel, https://github.com/CuiYifeng, https://github.com/jerryzh168
ghstack dependencies: #133080
# Motivation
This PR enables the XPU quantized convolution. The operators it registers are `onednn::qconv_prepack`, `onednn::qconv1d_pointwise`, `onednn::qconv2d_pointwise`, `onednn::qconv3d_pointwise`. We share same operator schemas as Intel CPU backend as both would call kernels implemented in oneDNN library.
# Details
The implemented operators would be further integrated into pt2e quant flow. In this PR, we validated the kernel functionality via the UT in `test/inductor/test_mkldnn_pattern_matcher.py` where CPU backend defines a series of UT for quantized convolution. Also, we extend the device support for inductor lowering pass and inductor IR defined in `torch/_inductor/fx_passes/quantization.py` and `torch/_inductor/mkldnn_ir.py`. The overall picture would be that, CPU and GPU backend could share the general optimization pass(op fusion) and quantization inductor IR. After lowering, the final kernel would be dispatched to different implementation in oneDNN library.
In this PR, we share the same int8 quantizer in CPU, namely, `X68InductorQuantizer`. In next PR #139578, we will append a `XPUIndcutorQuantizer` which will customized the pt2e behaviors at XPU backend. The capability of `XPUInductorQuantizer` would gradually grow along with the development of quantized operators in XPU.
# Validation
* UT testing
```bash
python test/inductor/test_mkldnn_pattern_matcher.py -v \
-k test_qconv2d_xpu \
-k test_qconv2d_silu_xpu \
-k test_qconv2d_relu6_xpu \
-k test_qconv2d_hardtanh_xpu \
-k test_qconv2d_hardswish_xpu
```
* Runtime exemplification
```bash
#qconv2d
onednn_verbose,primitive,exec,gpu:0,convolution,jit:ir,forward_training,src_u8::blocked:acdb::f0 wei_s8::blocked:acdb::f0 bia_undef::undef::: dst_f32::blocked:acdb::f0,attr-scratchpad:user attr-scales:src0:0:f32+wei:1:f32 attr-zero-points:src0:0:s32 attr-post-ops:binary_add:f32:2+eltwise_linear:1,alg:convolution_direct,mb1_ic128oc128_ih6oh4kh3sh1dh0ph0_iw6ow4kw3sw1dw0pw0,0.0668945
#qconv2d_silu
onednn_verbose,primitive,exec,gpu:0,convolution,jit:ir,forward_training,src_u8::blocked:acdb::f0 wei_s8::blocked:acdb::f0 bia_undef::undef::: dst_u8::blocked:acdb::f0,attr-scratchpad:user attr-scales:src0:0:f32+wei:1:f32 attr-zero-points:src0:0:s32 attr-post-ops:eltwise_swish:1+binary_add:f32:2+eltwise_linear:0.0124779:22,alg:convolution_direct,mb1_ic3oc128_ih8oh6kh3sh1dh0ph0_iw8ow6kw3sw1dw0pw0,0.0881348
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133080
Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/atalman
This PR contains three `unsqueeze()`-related fixes for NJT:
1. Adjusts the output's `_ragged_idx` when `unsqueeze()` inserts a dim before the ragged dim
2. Corrects the unbind reference for `unsqueeze()` after the last input dim. For this case, the dim kwarg canonicalization logic needs to be applied wrt `inp.dim() + 1` to account for `dim=-1` properly
3. Adds ragged dim support to `unsqueeze()`, allowing for e.g. `(B, j1, D) -> (B, 1, j1, D)`. This is okay now after #137125
Note that `unsqueeze()` still doesn't support batch dim operation, and arguably should never support this.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141392
Approved by: https://github.com/cpuhrsch
ghstack dependencies: #140736, #140161
This PR introduces `ExtraOpData`, a structure that contains op metadata regarding whether the op is a view and the dim-related args it accepts. It also populates a huge database for dim-wise / view ops with this info.
Test logic (sample input generation, references) have been updated to utilize this data. It allows for a fairly generic set of sample inputs & a reference for the class of ops that accept a single NJT and operate dim-wise (AKA "unary dimwise ops").
Testing is added over the following ops:
* `chunk()`
* `narrow()`
* `select()`
* `split()`
* `split_with_sizes()`
* `squeeze()`
* `unflatten()`
* `unsqueeze()`
Most of the above do not operate on the ragged / batch dims or on non-contiguous NJTs, so the proper xfails are added as needed.
I also slipped in a couple minor fixes (sorry):
1. The `_wrap_jagged_dim()` helper now avoids assuming the `nt._ragged_idx == 1` and allows for a batch dim to be a valid input, disambiguating the converted inner dim as necessary through an additional `operating_on_batch` return value (i.e. both dim=0 and dim=1 map to dim=0 on the inner values tensor, since that dim represents a packed ragged dim for all batch items)
2. Padded dense -> NJT conversion requires shape gymnastics to operate with the restrictive FBGEMM kernel. The gymnastics were slightly wrong for the transposed NJT case, and this PR fixes that
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140161
Approved by: https://github.com/Skylion007, https://github.com/cpuhrsch
ghstack dependencies: #140736
Adding `destroy_pg_upon_exit` property to allow derived Test classes to control whether auto destroy is desired.
(Otherwise, derived test classes will need to rewrite the `_run()` method, leading to duplicated code of `_run()` and if one needs to add things to `_run` in the future, more code change is needed.)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141192
Approved by: https://github.com/wconstab
* Automatically applies ruff rule 401. Turns loops into equivalent list comprehensions which are faster and do not leak the scope of the loop variables.
* list comprehensions not only often have better typing, but are 50+% faster than for loops on overhead. They also preserve length information etc and are better for the interpreter to optimize.
* Manually went back and made mypy happy after the change.
* Also fixed style lints in files covered by flake8 but not by pyfmt
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140980
Approved by: https://github.com/justinchuby, https://github.com/malfet
### Background
This PR adds the functionality to xfail / skip on a per-`SampleInput` basis for `OpInfo` tests. See #89354 and #82669 for some requests asking for this type of functionality.
This was originally landed for NJT in #138370 and is generalized and slightly tweaked here.
### Design
#### Principles
* Clean separation among `SampleInput` generation logic, test logic that uses the `SampleInput`s, and xfail / skip logic (which will change as bugs are addressed).
* Flexibility in xfail / skip predicate specification - ideally each bug can be handled by a single skip / xfail, even if it surfaces across a specific class of ops.
* This is important in practice for NJT, where it's common to have a bug that affects all binary ops, for example.
* Opt-in with minimal test logic changes + no substantial impact on other tests.
#### Details
The core new concept is a `SampleRule`, which can be either an `XFailRule` or `SkipRule`.
```python
@dataclass
class SampleRule(ABC):
# function to indicate whether the rule applies to this op; return True if so
# NB: str arg of callable is device_type
op_match_fn: Callable[[str, OpInfo], bool] = None
# function to indicate whether the rule applies to this sample; return True if so
sample_match_fn: Callable[[torch.device, SampleInput], bool] = None
# optional name for identifying the rule
name: str = ""
@dataclass
class XFailRule(SampleRule):
# expected error type
error_type: TypeVar = Exception
# expected error message
error_msg: str = ".*"
@dataclass
class SkipRule(SampleRule):
...
```
* See below for example usage details, but at a high level: each test should have a corresponding list of `sample_skips_and_xfails`.
* The list of `sample_skips_and_xfails` is traversed in order, and the first rule that matches (if any) is applied, so order can matter.
* The PR includes a logging mechanism for matched rules accessible by setting the loglevel to `DEBUG`.
* The split between `op_match_fn` and `sample_match_fn` is made to allow pre-filtering of the list of rules to get only those that apply to the op under test.
* Each `SampleInput` is run within a subtest context so they can be individually skipped / xfailed as needed. This also means that a test will no longer stop after the first erroring `SampleInput`; all samples will be run through test logic.
### Example Usage
Consider the following OpInfo test:
```python
class MyTestCase(TestCase):
@ops(op_db)
def test_foo(self, device, dtype, op):
for sample in op.sample_inputs(device, dtype, requires_grad=False):
# do some SampleInput-based test logic
output = op.op(sample.input, *sample.args, **sample.kwargs)
...
```
This is a common pattern for such tests; simply generate a list of `SampleInputs` and run them through the op. Now say you want to xfail one of these `SampleInput`s for a given op. Today, you have to xfail the entire test or hack around this in the test logic.
This PR lets you do this to get very flexible xfail / skips based on op / sample input properties:
```python
# NB: Define rules for per-SampleInput xfails / skips. These can also be defined in-line in the @ops decorator, but
# it can be more readable to maintain these somewhere else. These are attempted to be matched in order and
# the first one that matches applies, so order can matter.
FOO_SKIPS_AND_XFAILS = [
XFailRule(
error_type=ValueError,
error_mg="2D inputs not supported",
op_match_fn=lambda device, op: (
# NB: logic for which ops this rule applies to goes here
op.full_name == "add"
),
sample_match_fn=lambda device, sample: (
# NB: logic which samples this rule applies to goes here
sample.input.dim() == 2
),
# NB: optional rule identifier can help with debugging matched rules
name="add_with_2D_inputs_not_supported",
),
# NB: This follows a similar structure as XFailRule but without error_type / error_msg. Obviously
# this skips a particular SampleInput instead of xfailing :)
SkipRule(...),
...
]
class MyTestCase(TestCase):
@ops(op_db)
@sample_skips_and_xfails(FOO_SKIPS_AND_XFAILS)
# NB: the @ops decorator automatically filters out any rules that don't apply to this op
def test_foo(self, device, dtype, op):
for sample, subtest_ctx in op.sample_inputs(
# NB: use_subtests=True is required for skips / xfails to work. If skips / xfails are defined and use_subtests != True,
# an informative error will be thrown.
device, dtype, requires_grad=False, use_subtests=True
):
# NB: this subtest context manager runs each sample input as a "subtest" and handles skips / xfails appropriately
with subtest_ctx(self):
# do some SampleInput-based test logic
output = op.op(sample.input, *sample.args, **sample.kwargs)
...
```
More examples can be seen in `test/test_nestedtensor.py`, where this system is used in practice.
I also demonstrate usage of syntactic sugar over this system in `test/functorch/test_vmap.py`. Here, a skip for the `to()` operator is replaced with a granular xfail for `test_vmap_exhaustive()`:
```python
...
# pre-existing xfail
xfail("item"),
# new granular xfail using syntactic sugar over the general system
xfailIf(
"to",
lambda sample: (
sample.kwargs["memory_format"] == torch.channels_last
),
),
...
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140443
Approved by: https://github.com/janeyx99, https://github.com/zou3519
ghstack dependencies: #140160, #138370
Per discussion with @malfet, only allow weights_only unpickler to load NJT if `torch.nested` and `torch._dynamo` are imported
(this is slightly weird as technically `torch.nested` is actually imported by default and `torch._dynamo.decorators._DimRange` is actually what needs to be imported)
we can't import this from `torch.nested` as this would
- undo dynamo lazy import
- cause circular import
===========================
Redo of https://github.com/pytorch/pytorch/pull/140304 caused issues as `torch.nested._internal.foo` needs to be imported, which causes issues like
```python
torch/_weights_only_unpickler.py", line 339, in load
if full_path in _get_allowed_globals():
torch/_weights_only_unpickler.py", line 188, in _get_allowed_globals
torch.nested._internal.nested_tensor.NestedTensor
AttributeError: module 'torch.nested' has no attribute '_internal'
```
**This likely wasn't caught in our CI because imports are global during unit tests(?), so we use subprocess to properly test this time**
Differential Revision: [D65961691](https://our.internmc.facebook.com/intern/diff/D65961691)
@jbschlosser
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140739
Approved by: https://github.com/malfet
# Motivation
This pr is an extension of #131758. As described in #131758, these changes are looking to make distributed UTs more accessible to users of all device types.
It is a demonstration of a few changes discussed by @kwen2501 and @jgong5 in the discussion for #131758(https://github.com/pytorch/pytorch/pull/131758#discussion_r1762422784)
This PR contains two types of changes, the first is to the common distributed folder where we have added a new class derived from MultiProcessTestCase which helps abstracts out the process group creation /deletion and other functionality for a given device.
The new generalized content can be added by deriving from this base class.
Also includes other misc changes for gaudi support
The second changed file is test_functional_api. a test file in common distributed. This file is a POC for how we can use this new class to write more device agnostic distributed test cases.
The following changes have been made to test_functional_api.py:
-Functionality has been added to test for non cuda devices using intel HPU as an example
-Multiple set up steps previously required by MultiProcessTestCase have been abstracted out
-Misc adaptations to allow for general call to accelerators while adding test skips instead explicitly skipping for multiple GPUs
-Skipifhpu flags have been added to enable skipping a few Multithreaded test cases which are as yet not supported on HPUs
NOTE: Within test functional api, there are tests which require the use of some multithreading functions which are as yet not supported on HPUs. These have been skipped for hpu using skipHPU decorator.
I will be raising a separate PR to improve usability pf said decorators in a device agnostic setting in the manner suggested by @kwen2501 in a comment on this PR.
This pr is a cleaned up version of a previous PR(#136988) which I closed due to human error. I have addressed some of the comments made by @kwen2501 in this as well
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138216
Approved by: https://github.com/kwen2501, https://github.com/guangyey
Faced with an annoying string of warnings like this when running tests,
<img width="1644" alt="Screenshot 2024-11-15 at 11 23 21 AM" src="https://github.com/user-attachments/assets/91ff4e1d-3c29-4510-9a61-46e7df68a212">
My choices seem to be (1) call destroy_process_group() at the end of
each test fn, (2) do this in some wrapper, (3) do it in the base test
class.
Since tests in MultiProcessTestCase are responsible for calling
init_process_group themselves, they should also be responsible for
calling destroy (or at least method (3) would be asymmetric and may
result in double-destroy).
But it doesn't feel worth it to go add a destroy call manually to each
test, and try/except for a possible second destroy call seems like a
happy middle ground.
Note: tests that want to ensure that destroy runs cleanly can and should
still call destroy _inside_ the test, and this change does not affect
that.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140820
Approved by: https://github.com/fegin
Faced with an annoying string of warnings like this when running tests,
<img width="1644" alt="Screenshot 2024-11-15 at 11 23 21 AM" src="https://github.com/user-attachments/assets/91ff4e1d-3c29-4510-9a61-46e7df68a212">
My choices seem to be (1) call destroy_process_group() at the end of
each test fn, (2) do this in some wrapper, (3) do it in the base test
class.
Since tests in MultiProcessTestCase are responsible for calling
init_process_group themselves, they should also be responsible for
calling destroy (or at least method (3) would be asymmetric and may
result in double-destroy).
But it doesn't feel worth it to go add a destroy call manually to each
test, and try/except for a possible second destroy call seems like a
happy middle ground.
Note: tests that want to ensure that destroy runs cleanly can and should
still call destroy _inside_ the test, and this change does not affect
that.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140820
Approved by: https://github.com/fegin
ghstack dependencies: #140460, #140815
This PR adds caching for user defined triton kernels by putting the transitive closure of source code in node.meta along with constant arguments.
One HUGE hack we do here is a node looks like
```
triton_kernel_wrapper_functional_proxy = torch.ops.higher_order.triton_kernel_wrapper_functional(kernel_idx = 0, constant_args_idx = 1, grid = [(1, 1, 1)], tma_descriptor_
metadata = {}, kwargs = {'in_ptr0': arg0_1, 'in_ptr1': arg1_1, 'out_ptr': arg0_1}, tensors_to_clone = ['out_ptr']);
```
so we use regex to remove `kernel_idx = 0, constant_args_idx = 1` parts as they are not relevant to cache hash. This is horrible and I'd like to eventually not use pickle as a hashing alternative but this is a longer project.
Differential Revision: [D65895744](https://our.internmc.facebook.com/intern/diff/D65895744)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140326
Approved by: https://github.com/zou3519
Implementation of the `softmax_backward_data` operator for the CPU backend produces incorrect results when the `output` argument is non-contiguous.
Here is a test case that demonstrates this issue:
```python
torch.manual_seed(0)
op = torch.ops.aten._softmax_backward_data
grad_output = torch.ones(3, 3, 3)
temp = torch.randn(3, 10, 3)
out = temp[:, :3, :]
out = out.contiguous()
print(out.is_contiguous())
grad_input = op(grad_output, out, 1, torch.float32)
print(grad_input)
```
In this test case, the variable `grad_input` yields incorrect results if the line `out = out.contiguous()` is commented out. With this fix, `grad_input` consistently produces the same results whenever `output` is contiguous.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139740
Approved by: https://github.com/zou3519
Functionally two decorators are very similar, but one should rely on expectedFailure as much as possible to get signal when something is fixed.
- Move `product_version` variable from `test_mps` to common_utils, but call it `MACOS_VERSION`
- Introduce `skipIfMPSOnMacOS13` to decorate the hard crashes that happens only on MacOS13 (which at this point will not get any fixes and will be deprecated soon)
- Add `device_type='mps'` to all `skipIfMPS` per https://github.com/pytorch/pytorch/issues/140560
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139940
Approved by: https://github.com/janeyx99, https://github.com/huydhn
@ezyang noticed this exercises a multithreading bug that is causing tests to become disabled:
```
2024-11-13T21:05:55.8363582Z inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCPU::test_comprehensive_fft_ihfftn_cpu_int32 /opt/conda/envs/py_3.9/lib/python3.9/site-packages/_pytest/threadexception.py:73: PytestUnhandledThreadExceptionWarning: Exception in thread Thread-3
2024-11-13T21:05:55.8364857Z
2024-11-13T21:05:55.8364974Z Traceback (most recent call last):
2024-11-13T21:05:55.8365491Z File "/opt/conda/envs/py_3.9/lib/python3.9/threading.py", line 980, in _bootstrap_inner
2024-11-13T21:05:55.8366003Z self.run()
2024-11-13T21:05:55.8366371Z File "/opt/conda/envs/py_3.9/lib/python3.9/threading.py", line 917, in run
2024-11-13T21:05:55.8366858Z self._target(*self._args, **self._kwargs)
2024-11-13T21:05:55.8367518Z File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/fbscribelogger/__init__.py", line 176, in _run_event_loop
2024-11-13T21:05:55.8368189Z self.loop.run_until_complete(self.task)
2024-11-13T21:05:55.8368774Z File "/opt/conda/envs/py_3.9/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
2024-11-13T21:05:55.8369348Z return future.result()
2024-11-13T21:05:55.8369980Z File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/fbscribelogger/__init__.py", line 214, in _worker
2024-11-13T21:05:55.8370603Z message = await asyncio.wait_for(
2024-11-13T21:05:55.8371090Z File "/opt/conda/envs/py_3.9/lib/python3.9/asyncio/tasks.py", line 442, in wait_for
2024-11-13T21:05:55.8371573Z return await fut
2024-11-13T21:05:55.8372156Z File "/opt/conda/envs/py_3.9/lib/python3.9/asyncio/queues.py", line 166, in get
2024-11-13T21:05:55.8372613Z await getter
2024-11-13T21:05:55.8374010Z RuntimeError: Task <Task pending name='Task-1' coro=<FbScribeLogger._worker() running at /opt/conda/envs/py_3.9/lib/python3.9/site-packages/fbscribelogger/__init__.py:214> cb=[_run_until_complete_cb() at /opt/conda/envs/py_3.9/lib/python3.9/asyncio/base_events.py:184]> got Future <Future pending> attached to a different loop
2024-11-13T21:05:55.8375366Z
2024-11-13T21:05:55.8375603Z warnings.warn(pytest.PytestUnhandledThreadExceptionWarning(msg))
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140717
Approved by: https://github.com/ezyang, https://github.com/zxiiro
Summary:
`with_comms()` is mostly used as a decorator with an optional input argument `eager_init`. The problem of a decorator with input argument is that it has to be used with invocation always, i.e., you have to use as `with_comms()` rather than `with_comms` which majority of the existing usages.
This diff tries to provide a solution such that we could use `with_comms`, `with_comms()`, `with_comms(eager_init=False)`, and `with_comms(eager_init=True)`.
Test Plan: Contbuild & OSS CI
Differential Revision: D65385700
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139637
Approved by: https://github.com/wz337
This PR updates OpInfo-based tests for NJTs:
* Adds extensive coverage across non-contiguous NJTs (both non-contiguous transposed and non-contiguous with holes)
* The `_sample_njts()` helper that `sample_input_func`s utilize now produces non-contig NJTs as well
* Utilizes a `SampleInput`-based xfail system for granular classification of bugs. For example, it's possible to indicate that a class of ops is expected to fail only on non-contig with holes NJT inputs.
* I decided on adding `SampleInput`s and utilizing this system over using test parametrization for two reasons:
* Test perf - adding `SampleInput`s is faster than generating entire new tests
* Avoiding the possibility of `sample_input_func`s not respecting the non-contig test parameter - this would result in silently incorrect passing of these tests. Keeping the responsibility for `SampleInput` generation firmly within each `OpInfo`'s `sample_input_func` means weirdness like this isn't possible
* Improves `SampleInput` naming for a bunch of `sample_input_func`s. This makes it easier to xfail them as needed. For example, binary / unary / other ops now use the new `_describe_njt()` helper to get a string repr that uniquely defines the type of NJT being passed to the op
* Adds appropriate `XFailRule`s to get tests passing for forward / backward / forward compile / backward compile. In general, each xfail corresponds to some bug that needs to be fixed
```python
# Represents a rule indicating how to xfail a particular test. It allows granularity
# at the device, dtype, op, and individual sample levels. This flexibility allows entire
# bugs to be represented by a single rule, even if this corresponds with multiple conceptual
# test cases across multiple ops.
@dataclass
class XFailRule:
# expected error type
error_type: TypeVar = Exception
# expected error message
error_msg: str = ".*"
# function to indicate whether the rule applies; return True if so
match_fn: Callable[[torch.device, torch.dtype, OpInfo, SampleInput], bool] = None
# optional name for identifying the rule
name: str = ""
def match(self, device, dtype, op, sample) -> bool:
return self.match_fn(device, dtype, op, sample)
```
Example:
```python
# Bug when broadcasting a binary op with non-contiguous with holes NJT + dense
# tensor with 1 in ragged dim.
XFailRule(
error_type=RuntimeError,
error_msg="cannot call binary pointwise function .* with inputs of shapes",
match_fn=lambda device, dtype, op, sample: (
isinstance(op, BinaryUfuncInfo)
and "noncontig_holes" in sample.name
and "broadcasting 1 over ragged" in sample.name
),
name="binary_noncontig_holes_broadcasting_1_over_ragged",
),
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138370
Approved by: https://github.com/cpuhrsch, https://github.com/soulitzer
ghstack dependencies: #140160
Summary:
I was looking into why the non-standard bool value will fail for msort - it makes sense for argsort and sort to fail, because we're randomly generating uint8 so the order will be different (and thus the indices will be different). But msort should work.
After some digging, it's interesting that even though scalar_t is bool, when the actual value is a uint8_t, the comparison will treat them as signed. I tried lhs=255 and rhs=0: lhs < rhs is equivalent to -1 < 0 which is true (but it's supposed to be False)
Therefore we add an explicit type cast.
Test Plan: Remove the test skip
Differential Revision: D65472170
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139870
Approved by: https://github.com/Skylion007, https://github.com/davidberard98
Based on discussion here: https://github.com/pytorch/pytorch/pull/138731
Introducing ability for subclass implement type convertion to expected_type.
```
def __coerce_same_metadata_as_tangent__(
self, expected_metadata: Any, expected_type: Optional[Type] = None
):
```
Here if `expected_type=None` means `SubclassClass` is expected.
E.g. for `DTensor` we may find tangent `AsyncCollectiveTensor` where we expected `Tensor` - in this case
`expected_type=Tensor` will be called during runtime
Adding implementation to AsyncCollectiveTensor, that just triggers `wait()`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139095
Approved by: https://github.com/bdhirsh
This allows Configs to handle setting their defaults (or overriding
themselves) via environment variables.
The environment variables are resolved at install time (which is usually
import time). This is done 1) to avoid any race conditions between
threads etc..., but 2) to help encourage people to just go modify the
configs directly, vs overriding environment variables to change
pytorch behaviour.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138956
Approved by: https://github.com/ezyang
ghstack dependencies: #138766
As MacOS-15 or newer supports those out of the box. This significantly reduces memory requirements and improves performance for some stable diffision networks.
Test plan: Run
```python
from diffusers import StableDiffusionXLPipeline, AutoencoderKL, EulerAncestralDiscreteScheduler
import torch
import time
vae = AutoencoderKL.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0",
subfolder='vae',
torch_dtype=torch.bfloat16,
force_upcast=False).to('mps')
pipe = StableDiffusionXLPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", vae=vae,
torch_dtype=torch.bfloat16, variant="fp16").to('mps')
pipe.scheduler = EulerAncestralDiscreteScheduler.from_config(pipe.scheduler.config)
start_time = time.time()
start_mps_mem = torch.mps.driver_allocated_memory()
image = pipe(prompt="Spherical cow in vacuum",
num_inference_steps=10,
guidance_scale=8,
generator=torch.Generator("mps").manual_seed(42),
).images[0]
end_mps_mem = torch.mps.driver_allocated_memory()
run_time = time.time() - start_time
print(f"run time in {run_time:.2f} sec, end_mps_mem {end_mps_mem/1024.0**2:.2f} Mb mem increase {(end_mps_mem-start_time)/1024.0**2:.2f} Mb")
image.save(f'bfloat16.png')
```
Before the change total memory use were 16Gb and needed 65 sec to complete, after it drops down to 14Gb and takes 50 sec to finish on M2Pro, though generated image remains the same:

Fixes https://github.com/pytorch/pytorch/issues/139389
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139791
Approved by: https://github.com/drisspg, https://github.com/Skylion007
ghstack dependencies: #139788, #139784, #139763
This teaches install_config_module (and the underlying code) to
understands Config objects. Additionally we've added a JK option to this
which resolves the JK.
This config gets stored within the _ConfigEntry class and is evaluated
when __getattr__ is called. If justknobs is set, it'll call
justknobs_check to see the result.
Due to preceeding work, basically everything works correctly here and we
had to update a couple of tests, and modify the getattr behaviour.
Note that we are updating the justknob_check function to support a
default option, to make default work.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138766
Approved by: https://github.com/ezyang
I'm sick of reductions not working properly - spotty dim coverage, missing backwards, etc. This PR fixes quite a bit.
It applies to the following ops:
* `sum` / `mean` / `prod`
* `all` / `any`
* `amin` / `amax`
* `min` / `max`
* `argmin` / `argmax`
The general reduction logic has been factored out into a helper `_apply_reduction(func, func_name, identity_element, *args, **kwargs)`. The idea is that by providing a valid identity element, we can utilize conversions to padded dense when needed for reducing over the ragged dim.
Extensive test coverage includes:
* reductions across ragged dim
* reductions across non-batch, non-ragged dims
* reductions across both batch and ragged dims
* multiple dim reductions (for ops that support this)
* full reduction -> scalar
Bonus: the PR includes backwards fixes for `sum` and `mean`, which have never worked.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139317
Approved by: https://github.com/cpuhrsch
**Background:** The `@parametrize` decorator enjoys widespread usage as a convenient tool for ensuring extensive test coverage. One particular feature that makes this easy is the ability to stack such decorators, testing over the cross-product of inputs. Example:
```python
class MyTestClass(TestCase):
@parametrize("x", range(3))
@parametrize("y", [False, True])
def test_foo(self, x, y):
# Invoked with:
# x=0, y=False
# x=1, y=False
# x=2, y=False
# x=0, y=True
# x=1, y=True
# x=2, y=True
...
```
Note that the `@ops` and `@modules` decorators employ the same underlying machinery for parametrizing over `OpInfo` / `ModuleInfo` entries. These decorators also parametrize over op-specific `device` / `dtype` info *according to what is supported for each op*.
```python
class MyTestClass(TestCase):
@ops(op_db)
def test_foo(self, op, device, dtype):
# Invoked each OpInfo in the db along with each device / dtype that corresponds
# with this op according to the OpInfo entry.
...
```
Note that this in contrast to the naive cross product between ops and devices / dtypes, which would generate too many tests. Certain use cases benefit from a similar type of flexible parametrization that is more intelligent than simple cross-product composition. It is expensive to generate / run too many tests, even if the unneeded ones are skipped appropriately.
This PR attempts to generalize such flexible parametrization and satisfy these use cases through the introduction of a `@reparametrize` decorator, which operates on an existing parametrizer and allows for customized on-the-fly parametrization through the use of an `adapter_fn`. Examples:
```python
# adapter_fn that adds a new arg
def include_is_even_arg(test_name, param_kwargs):
x = param_kwargs["x"]
is_even = x % 2 == 0
new_param_kwargs = dict(param_kwargs)
new_param_kwargs["is_even"] = is_even
is_even_suffix = "_even" if is_even else "_odd"
new_test_name = f"{test_name}{is_even_suffix}"
yield (new_test_name, new_param_kwargs)
# adapter_fn that excludes certain values
def exclude_odds(test_name, param_kwargs):
x = param_kwargs["x"]
is_even = x % 2 == 0
yield None if not is_even else (test_name, param_kwargs)
class MyTestClass(TestCase):
@reparametrize(parametrize("x", range(5)), include_is_even_arg)
def test_foo(self, x, is_even):
# Invoked with both the x value and the new is_even arg
...
@reparametrize(parametrize("x", range(5)), exclude_odds)
def test_bar(self, x):
# Only invoked with even x values
...
```
For a more real-world use case, imagine you want to write a set of OpInfo tests that parametrize over additional op-specific things beyond `device` / `dtype` (in NJT's case, this includes contiguity type, whether to operate over the batch / ragged / other dims, etc.). The `@reparametrize` decorator allows you to customize the `@ops` parametrization to add in these additional args as they make sense on a per-op basis.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138369
Approved by: https://github.com/janeyx99
We should only pass the `device_id` when the backend is `nccl`. Otherwise, we would run into the following error:
```
RuntimeError: No backend for the parent process group or its backend does not support splitting
```
This also fixes test failure is not asserted when using `with_comms()` or `with_comms(eager_init=False)`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139097
Approved by: https://github.com/XilunWu
This PR adds FlexAttention + NJT support. In particular:
* To handle raggedness, treats the packed sequence dim of input NJTs as a giant "stacked sequence". To ensure user `score_mod` / `mask_mod` functions can still be written in the original NJT sequence space, this PR handles conversions for indices within the giant "stacked sequence" -> sequence relative indices automatically.
* Provides `py_impls` for `NestedTensor` to the HOPs for flex attention forward / backward that simply wrap / unwrap NJTs appropriately
* Adds barebones `new_empty()` support to NJT since FlexAttention utilizes this repeatedly; right now, only `new_empty()` with a shape of `()` is supported
* Tests that FlexAttention with a causal mask matches causal SDPA
* Adds a new public API for FlexAttention usage:
* `create_nested_block_mask(mask_mod, B, H, njt, BLOCK_SIZE, _compile)` - NJT analogue for `create_block_mask()` that utilizes the `njt`'s ragged structure to create an appropriately-sized block mask (e.g. `(1, 1, total_seqlen, total_seqlen)`). This function handles the index conversion from "stacked sequence" space -> relative sequence space.
* Minor note: as this is a public API, this function is purposefully named with "nested" instead of "njt" to keep the latter as an informal, mostly internal-only term.
Example usage:
```python
def causal_mask(b, h, q_idx, kv_idx):
return q_idx >= kv_idx
query = ... # NJT of shape (B, H, S*, D)
key = ... # NJT of shape (B, H, S*, D)
value = ... # NJT of shape (B, H, S*, D)
# create_nested_block_mask() automatically converts indices from "stacked sequence" space -> relative sequence space
block_mask = create_nested_block_mask(causal_mask, 1, 1, query) # block mask conceptual shape is (B, H, sum(S*), sum(S*))
output = flex_attention(query, key, value, block_mask=block_mask)
def causal_score_mod(score, b, h, q_idx, kv_idx):
return torch.where(q_idx >= kv_idx, score, float("-inf"))
# flex_attention() automatically converts indices from "stacked sequence" space -> relative sequence space for NJT inputs
output2 = flex_attention(query, key, value, score_mod=causal_score_mod)
```
TODO:
* ~~Determine the right level of abstraction for public API helpers + move them alongside other helpers~~ Verify this with others though
* ~~Some cleanup~~
* ~~`njt_score_mod_adapter`~~
* ~~Q: should `create_njt_block_mask()` call `njt_mask_mod_adapter()` so we don't need two calls?~~
* Can we avoid materializing the `sum(s)` length `seq_idx` used for conversion between stacked sequence -> sequence relative indices?
* Not for now, although future work may deepen the integration between Flex + NJT (possibly requiring custom templates). We should try to cache this though.
* ~~Demonstrate non-causal mask~~
* Support non-contiguous NJTs with holes (**booted to future PR**)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136792
Approved by: https://github.com/drisspg
ghstack dependencies: #138841
dot reference implementation should be consistent with the cpu / cuda implementations since it may be used for meta dispatch
i.e.
```python
import torch
x = torch.tensor([1,2,3], dtype=torch.float32)
y = torch.tensor([4,5,6], dtype=torch.float16)
x.dot(y)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
RuntimeError: dot : expected both vectors to have same dtype, but found Float and Half
```
However the below does not raise an exception
```python
x.to("meta").dot(y.to("meta"))
```
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138596
Approved by: https://github.com/bdhirsh
This PR combines a number of cleanups in one PR. If any of the specific cleanups don't seem to make sense, let me know and I can remove them.
Cleanups
- This PR adds a set of test suites for the config module code, which handles basically all the APIs and ways it is used. Please let me know if you see anything critical that is not tested that I missed. This test suite is primarily used as the regression test suite for later changes in this diff. Note that there is some dynamo specific testing of the config module, but it isn't as verbose.
- I removed all internal usage of shallow_copy_dict. Those usages could all use the deep copy, and did not depend on the reference behavior of certain config values that shallow_copy_dict allows.
- I removed shallow copy semantics for configuration with a deprecation warning. I think this requires a release note, so hopefully I did that correctly. Let me know if we want to continue to expose shallow copy value semantics, but I just can't find a case where I expect anyone would want it. It also complicated later internal changes to the API (i.e. breaking apart various layers of the config changes).
- I fixed what I believe is a bug in how hashes are calculated on configs. In particular, if you got the hash, then made a config change, and then got the hash again, it would not update the hash. @oulgen, please let me know if I'm misunderstanding this behavior and it is desired.
- I switched our multiple implementations of iterating through the dictionary to a single one. This is primarily to make later changes easier, but it also makes it clear how inconsistent our various config ignoring options are. Let me know if people would be interested in me unifying the various options for ignoring config values.
- I updated the test patcher (not the performance critical one, just the normal one), to use __setattr__ and __getattr__ to remove direct API access to the underlying config fetcher.
For release notes, Not sure exactly how to communicate this, but something like
"ConfigModule.to_dict, and ConfigModule.shallow_copy_dict no longer retain their shallow copy semantics, which allowed reference values objects to be modified. If you wish to modify the config object, call load_config explicitly".
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138377
Approved by: https://github.com/ezyang, https://github.com/jansel, https://github.com/jovianjaison
This has the benefit that
1) It's much easier to aggregate test failure repros into say a CSV or shell script from scuba
2) We can do analysis (eg. set different two sets of tests across two PRs)
3) We can get results faster at the test-level granularity instead of job-level granularity we see in the HUD/GH.
I tested this by introducing a breaking change, adding ci-scribe label and then verifying that the failed tests were logged to scuba: https://fburl.com/scuba/torch_open_source_signpost/w6qt7qr9
I then reverted the breaking change and published this PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138394
Approved by: https://github.com/ezyang
Add an optional `eager_init` flag to `with_comms`.
When `eager_init` is True and backend is `nccl`, we pass the `device_id` to `init_process_group()` for eager initialization.
Otherwise, `device_id` is still `None` and this goes through the normal lazy call.
Default for `eager_init` is False.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138108
Approved by: https://github.com/kwen2501
The logsumexp tensor was considered for internal use only but apparently exposed to unit tests and inductors.
The stream should be selected after picking the current device. Otherwise the code is checking the default device's architecture.
Fixes#131316#137414
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137717
Approved by: https://github.com/drisspg
Co-authored-by: Jack Taylor <108682042+jataylo@users.noreply.github.com>
Our matmul support is abysmal - let's at least get this working and do it performantly later.
Bonus: implements `bmm` as well.
jagged <-> padded dense conversions are utilized when possible, and an unbind-based fallback otherwise (the former works with torch.compile and the latter doesn't). Some testing is missing because we don't have factory function support yet :(
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138121
Approved by: https://github.com/cpuhrsch
This adds Dynamo tracing support for the host-side Triton TMA API (see `create_2d_tma_descriptor` calls on the host in the [Triton tutorial](https://triton-lang.org/main/getting-started/tutorials/09-persistent-matmul.html#sphx-glr-getting-started-tutorials-09-persistent-matmul-py)). A few notes:
- Here we assume the availability of the host-side TMA API added to upstream Triton in https://github.com/triton-lang/triton/pull/4498. As of time of writing, this is not a part of the PT2 OSS Triton pin (although back-ported internally). OSS Triton pin update should be done in December 2024.
- To capture the chain of calls `t.data_ptr() --> create_{1d,2d}_tma_descriptor(ptr, ...) --> kernel[grid](tma_desc, ...)`, we add three new variable trackers: `DataPtrVariable`, `CreateTMADescriptorVariable` (for the function), `TMADescriptorVariable` (for TMA descriptor object). This is to maintain the path back from the Triton kernel to the Tensor from which the TMA descriptor has been created.
- The newly introduced variables have `reconstruct` methods used in case of graph breaks.
- The `tma_descriptor_metadata` extracted from the captured `create_{1d,2d}_tma_descriptor` calls is propagated through the HOPs in Dynamo and AOTAutograd to be used by the downstream compiler (e.g., Inductor). See the unit tests for how the captured HOP arguments look like.
- In the Dynamo-captured fx graph, we replace the TMA descriptor arguments of the Triton kernel by the underlying Tensors, to be able to track the input/output relationships in terms of Tensors.
- In the Triton kernel mutation analysis pass (in AOTAutograd), we use the `tt.experimental_descriptor_store` TTIR op to detect mutations of the underlying tensors via TMA descriptors. So that downstream AOTAutograd can perform functionalizations as required.
- JIT Inductor and AOT Inductor support will be implemented in follow-up PRs.
Differential Revision: [D64404928](https://our.internmc.facebook.com/intern/diff/D64404928)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137677
Approved by: https://github.com/zou3519
### Fix 1: Throw async error during init wait
Previously we just busy wait for `ncclSuccess`, if the nonblocking init encountered error, we never report that. Added detection of async error via `ncclGetAsyncError`.
### Fix 2: Add wait after comm split
```
// After calling ncclCommSplit in non-blocking mode, we should wait for the
// source communicator to be out of ncclInProgress state.
// Reason 1:
// it's unsafe to call new operations on the parent comm while it's in
// ncclInProgress state.
// Reason 2:
// as of NCCL 2.23, the ptr value of child comm will not be filled until the
// state of parent comm is ncclSuccess. This may change in the future. See:
// https://github.com/NVIDIA/nccl/issues/1472
```
This wait does not mean the child comm is ready for use, neither does it block till that point.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137741
Approved by: https://github.com/shuqiangzhang
A proposal addressing Issue #1489: **Optimizer should track parameter names and not id.**
(also mentioned in here: [[RFC] Introducing FQNs/clarity eyeglasses to optim state_dict](https://dev-discuss.pytorch.org/t/rfc-introducing-fqns-clarity-to-optim-state-dict/1552)
## Summary
This PR introduces a backward-compatible enhancement where optimizers track parameter names instead of just their id.
Optimizers can be initialized with `named_parameters()` as:
```python
optimizer = optim.SGD(model.named_parameters(), lr=0.01, momentum=0.9)
```
This allows for greater clarity and ease when handling optimizers, as the parameters' names are preserved within the optimizer’s `state_dict` as:
```
state_dict =
{
'state': {
0: {'momentum_buffer': tensor(...), ...},
1: {'momentum_buffer': tensor(...), ...},
},
'param_groups': [
{
'lr': 0.01,
'weight_decay': 0,
...
'params': [0,1]
'param_names' ['layer.weight', 'layer.bias'] (optional)
}
]
}
```
Loading `state_dict` is not changed (backward-compatible) and the `param_names` key will be ignored.
## Key Features
#### Named Parameters in Optimizer Initialization:
Optimizers can accept the output of `model.named_parameters()` during initialization, allowing them to store parameter names directly.
#### Parameter Names in `state_dict`:
The parameter names are saved as a list in the optimizer’s `state_dict` with key `param_names`, alongside the `params` indices, ensuring seamless tracking of both names and parameters.
## Backward Compatibility
#### No Breaking Changes:
This change is fully backward-compatible. The added `param_names` key in the optimizer's `state_dict` is ignored when loading a state to the optimizer.
#### Customization with Hooks:
For more control, the loaded state_dict can be modified using a custom `register_load_state_dict_pre_hook`, providing flexibility for different design needs.
## Documentation Updates
Please refer to the documentation changes for more details on how this feature is implemented and how it can be used effectively.
## Solution Example:
A suggested solution to the problem mentioned in #1489, for the same parameters but in a different order.
The following `register_load_state_dict_pre_hook` should be added to the optimizer before loading to enable loading the state dict :
```python
def adapt_state_dict_ids(optimizer, state_dict):
# assuming a single param group.
current_state_group = optimizer.state_dict()['param_groups'][0]
loaded_state_group = state_dict['param_groups'][0]
# same number of params, same names, only different ordering
current_state_name_to_id_mapping = {} # mapping -- param_name: id
for i, name in enumerate(current_state_group['param_names']):
current_state_name_to_id_mapping[name] = current_state_group['params'][i]
# changing the ids of the loaded state dict to match the order of the given state dict.
for i, name in enumerate(current_state_group['param_names']):
loaded_state_group['params'][i] = current_state_name_to_id_mapping[name]
return state_dict
```
In this code, the loaded `state_dict` ids are adapted to match the order of the current optimizer `state_dict`.
Both the previous and the current optimizers are required to be initiated with `named_parameters()` to have the 'param_names' key in the dict.
### Note
This is my first contribution to PyTorch, and I wish to receive feedback or suggestions for improvement.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134107
Approved by: https://github.com/janeyx99
Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
This PR contains multiple fixes for issue https://github.com/pytorch/pytorch/issues/135279:
## First part:
Moves the GPU guard (`cudaSetDevice`) before the `currentStreamCaptureStatusMayInitCtx` call.
As its name suggests, it May Init Ctx.
## Second part:
Even with the above fix, additional contexts are still observed during Work object destruction, e.g.
```
work = dist.all_reduce(tensor, async_op=True)
time.sleep(5) <-- no additional context yet
del work <-- additional context shows up
```
### Debug process
Chasing it down to destruction of a `Future` object -- a member variable of `Work`.
Then further down to the following member of `Future`:
```
std::vector<c10::Event> events_;
```
When the `events_` are destroyed, we hit the road down to:
1f3a793790/c10/cuda/impl/CUDAGuardImpl.h (L106-L121)
When there is no "preset" CUDA context (**which is the case for python garbage collector**), line 112: `c10::cuda::GetDevice(&orig_device)` will set `orig_device` to 0. Then, at line 120, `c10::cuda::SetDevice(orig_device)` will "officially" set the context to device 0 --
**that's where rank 1, 2, ... can create extra context on device 0!**
### Solution
This PR adds an explicit destructor to `Future`. In this destructor, destroy each event with a device guard.
## Test
Added test_extra_cuda_context, implemented via
- `pynvml` (if available), or
- memory consumption check.
`python test/distributed/test_c10d_nccl.py -k test_extra_cuda_context`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135273
Approved by: https://github.com/fduwjj, https://github.com/wconstab, https://github.com/eqy
ghstack dependencies: #137161
This PR contains multiple fixes for issue https://github.com/pytorch/pytorch/issues/135279:
## First part:
Moves the GPU guard (`cudaSetDevice`) before the `currentStreamCaptureStatusMayInitCtx` call.
As its name suggests, it May Init Ctx.
## Second part:
Even with the above fix, additional contexts are still observed during Work object destruction, e.g.
```
work = dist.all_reduce(tensor, async_op=True)
time.sleep(5) <-- no additional context yet
del work <-- additional context shows up
```
### Debug process
Chasing it down to destruction of a `Future` object -- a member variable of `Work`.
Then further down to the following member of `Future`:
```
std::vector<c10::Event> events_;
```
When the `events_` are destroyed, we hit the road down to:
1f3a793790/c10/cuda/impl/CUDAGuardImpl.h (L106-L121)
When there is no "preset" CUDA context (**which is the case for python garbage collector**), line 112: `c10::cuda::GetDevice(&orig_device)` will set `orig_device` to 0. Then, at line 120, `c10::cuda::SetDevice(orig_device)` will "officially" set the context to device 0 --
**that's where rank 1, 2, ... can create extra context on device 0!**
### Solution
This PR adds an explicit destructor to `Future`. In this destructor, destroy each event with a device guard.
## Test
Added test_extra_cuda_context, implemented via
- `pynvml` (if available), or
- memory consumption check.
`python test/distributed/test_c10d_nccl.py -k test_extra_cuda_context`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135273
Approved by: https://github.com/fduwjj, https://github.com/wconstab, https://github.com/eqy