Commit Graph

5527 Commits

Author SHA1 Message Date
Simon Fan
b86269fab5 Unify cpp_extension build directory removal (#136059)
Keeps existing default directory clearing logic, even though it fails when TORCH_EXTENSIONS_DIR is set. To properly clear, we'd need to track all the folders we compiled the extensions to.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136059
Approved by: https://github.com/ezyang, https://github.com/albanD
2024-10-03 06:22:11 +00:00
Ke Wen
7631a04081 [c10d] Fix the device query story of ProcessGroup (#136790)
Function `_get_pg_default_device` is being used outside of `distributed_c10d.py`.

A concern is that people may not be aware of what it actually does, due to bad naming of this function:
`Return the device to use with ``group`` for control flow usage (object collectives, barrier).`

The remediation is as follows:
- Added a deprecation warning to `_get_pg_default_device`;
- Added a private function `_get_object_coll_device` to undertake what it does;
- Added a `_device_capability` function for users who want to query the device support of a PG -- it returns a plain list, no more "default" choice.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136790
Approved by: https://github.com/H-Huang
2024-10-03 01:36:22 +00:00
Joel Schlosser
6374a19a6e Fix wrapper subclass serialization with custom sizes / strides (#137030)
Fixes #130154

This PR takes the strategy outlined in the above issue and clears out any cached sizes / strides PyCapsules before serialization. This affects the default subclass serialization logic.

The PyCapsule issue also affects `deepcopy`, so that's fixed here as well.

Note: I originally tried utilizing a context manager to remove / restore cached PyCapsules after serialization, but in practice the state returned from `_reduce_ex_internal()` references the actual `tensor.__dict__()`, so the problem persists once the cached values are restored. Instead, we have to be careful to remove the cached values in the right place so they're not re-cached when pulling out size / stride information for serialization.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137030
Approved by: https://github.com/albanD
2024-10-02 18:55:03 +00:00
Benjamin Glass
f984b88718 Ensure noncontiguous tensor creation tests offsetting (#136396)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136396
Approved by: https://github.com/amjames, https://github.com/eellison
ghstack dependencies: #136055
2024-10-02 00:40:43 +00:00
Jez Ng
99eb47fb6d Add CI for Triton CPU backend (#135342)
Where possible, I have marked failing tests (which we intend to fix or triage) as `@xfail_if_triton_cpu`. This will help us track progress of the Triton CPU backend over time. Tests that I don't think we need to address, or that are flaky, have been marked as skips.

Successful CI run: https://github.com/pytorch/pytorch/actions/runs/10822238062/job/30028284549

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135342
Approved by: https://github.com/jansel, https://github.com/desertfire, https://github.com/malfet
2024-10-01 20:43:10 +00:00
Tom Ritchford
b85f21fc1d Add decomposition for squeeze_copy (#130941)
* Extracted from #128416

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130941
Approved by: https://github.com/amjames, https://github.com/eellison
ghstack dependencies: #136653
2024-10-01 10:23:22 +00:00
Nikita Shulga
c610aa80dc Testing: Unblock new_* testing on MPS (#137003)
By changing `other_dtype` to `torch.half` rather than `double` in
`sample_inputs_new_fns` if MPS is available
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137003
Approved by: https://github.com/Skylion007
ghstack dependencies: #136981, #136982, #136983, #136984, #136985, #136986
2024-09-30 19:06:12 +00:00
ankurneog
22a4129a76 Generalization of FSDP common for non-cuda execution (#133209)
## Motivation
The FSDP common code for FSDP UT execution is mostly written with cuda device in mind. However other devices such the intel Gaudi supports most of the functionality. We are generalizing the base content so that the UT content can be used for non-cuda device execution.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133209
Approved by: https://github.com/kwen2501
2024-09-27 00:38:10 +00:00
Joel Schlosser
991f8f8ec3 Bias gradient calculation for NJT linear backward (#136660)
Previously NYI - @mikaylagawarecki needs it for Transformers.

Fixes #136652
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136660
Approved by: https://github.com/soulitzer
2024-09-26 21:38:10 +00:00
drisspg
840c6b7a68 [FlexAttention] Add Better error message for cpu tensors (#136673)
Partially address: #136525

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136673
Approved by: https://github.com/Chillee
2024-09-26 16:40:21 +00:00
Howard Huang
141cae2eb8 [pipelining] Fix more leaks and check leaks in tests (#136584)
Fix two more leaks of the same variety as #136507 (see that PR desc and attached gdoc for debug details).

This time, also add a test-time check that helped to discover new leaks and ensure we won't accidently regress.

Adds `check_tensor_leak` util which internally asserts no tensors are being kept alive by other objects involved in py ref cycles.

Uses objgraph for a nice debug utility when a leak is found.

Credit to @H-Huang for pointing out objdump and helping debug the 'param_group["intermediates"]` leak.

I manually confirmed that all 3 of the leaks identified/fixed so far are caught by the unit test and checker.

Sample output, if I re-introduce a leak by commenting out `del param_group["intermediates"]` in _backward.py,
and run `python test/distributed/pipelining/test_schedule_multiproc.py -k test_schedule_with_native_zero_bubble`:

```
warnings.warn(
/data/users/whc/pytorch/torch/testing/_internal/common_utils.py:5341: UserWarning: 34 tensors were found in the garbage. Did you introduce a reference cycle?
warnings.warn(
/data/users/whc/pytorch/torch/testing/_internal/common_utils.py:5347: UserWarning: Dumping first 1 objgraphs of leaked tensors rendered to png
Graph written to /tmp/objgraph-ztz642h3.dot (19 nodes)
Graph viewer (xdot) not found, generating a png instead
Image generated as /tmp/objgraph-ztz642h3.png
```

rendering of ` /tmp/objgraph-ztz642h3.png`:
<img width="1671" alt="image" src="https://github.com/user-attachments/assets/9098ff29-224c-4533-935b-83c210ac2e22">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136584
Approved by: https://github.com/kwen2501, https://github.com/H-Huang
ghstack dependencies: #136507

Co-authored-by: Howard Huang <howardhuang@fb.com>
2024-09-26 01:10:40 +00:00
Huy Do
b7a5c7d331 Do not XFAIL test_segfault in fbcode (#136661)
https://github.com/pytorch/pytorch/pull/136252 silence the failure on OSS, but the test actually passed on fbcode [T202241133](https://www.internalfb.com/intern/tasks/?t=202241133)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136661
Approved by: https://github.com/malfet
2024-09-25 22:26:24 +00:00
IvanKobzarev
370c1c4297 [aotd] Fix rrelu compilation (#136008)
Issues:
https://github.com/pytorch/pytorch/issues/135083
https://github.com/pytorch/pytorch/issues/120292

rrelu decomposition contains mutation, copy_. Decompositions are executed below Functionalization, as a result AOT produces non-functional graph.

Also that decomposition is registered as python_dispatch kernel for AutogradCUDA.
Autograd dispatch happens above Functionalization, so registering it for Autograd to handle all backends makes functionalization running after this.

Testing:
```
python test/functorch/test_aotdispatch.py -k test_rrelu
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136008
Approved by: https://github.com/bdhirsh
2024-09-25 11:26:19 +00:00
Joel Schlosser
888744bd36 NJT binary pointwise broadcasting support via jagged <-> padded dense conversion (#133021)
Related: #132695

This PR uses padded dense <-> jagged conversions to handle binary pointwise broadcasting of (NT, T) and (T, NT). This includes:
* `(B, j0, D) + (1, 1, 1)`
* `(B, j0, D) + (B, 1, 1)`
* `(B, j0, D) + (B, 1, D)`
* etc.

This PR also adds (hacky) support for bool inputs to the jagged <-> padded dense conversions. The underlying CUDA kernels do not support integer / bool inputs; so the following workaround is employed: `convert input -> half, run conversion kernel, convert output -> bool`. Note that this bool support is needed specifically for the backward formula of `fmax`, and likely others.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133021
Approved by: https://github.com/cpuhrsch
2024-09-24 19:11:49 +00:00
ankurneog
efed357ef5 Add dtypes support in opinfo for Intel Gaudi (#132840)
## Motivation
This is following up on changes introduced in https://github.com/pytorch/pytorch/pull/128584
we are adding the dtype information to be picked up while executing the UTs for Intel Gaudi/HPU

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132840
Approved by: https://github.com/albanD
2024-09-24 17:17:15 +00:00
gaopengff
33e10803c8 Fix ut in internal distributed_test.py (#136251)
I have failed with test case of **test_new_subgroups_by_enumeration_input_rank_exceeds_world_size**, and passed with this small change. The expected exception is supposed to be "ValueError" rather than "RuntimeError" according to [code](https://github.com/pytorch/pytorch/blob/v2.4.1/torch/distributed/distributed_c10d.py#L4190).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136251
Approved by: https://github.com/kwen2501
2024-09-24 15:06:20 +00:00
Joel Schlosser
83a3ee0699 Support embedding_bag() with NJT input (#135888)
Fixes #93843

`EmbeddingBag()` / `embedding_bag()` support 1D inputs with offsets to handle raggedness. NJT is a natural fit here as it already maintains offsets of the same form. This PR updates the python-side to support NJT and adds corresponding OpInfo-based NJT tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135888
Approved by: https://github.com/cpuhrsch
2024-09-23 17:35:19 +00:00
Isuru Fernando
f276da7f98 Remove prims.slice_in_dim and prims.slice (#136150)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136150
Approved by: https://github.com/ezyang
2024-09-23 01:27:22 +00:00
Aaron Gokaslan
b6ffa381e1 [BE]: Add half CUDA support nextafter (#136373)
Making CUDA support match CPU support for nextafter
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136373
Approved by: https://github.com/ezyang
2024-09-21 17:13:45 +00:00
Isuru Fernando
0c936c3ecb Add decomps for max_unpool (#133146)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133146
Approved by: https://github.com/amjames, https://github.com/eellison
2024-09-20 21:35:25 +00:00
Jeff Daily
15dba021bb [ROCm][CI] upgrade CI to ROCm 6.2 (#132555)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132555
Approved by: https://github.com/pruthvistony, https://github.com/malfet
2024-09-20 17:39:31 +00:00
Igor Sugak
bce52d0b60 [CODEMOD][caffe2] use npt.NDArray instead of np.ndarray in type annotations (#136288)
Summary:
To facilitate PSS-2 upgrade, this uses `ndt.NDArray` instead of `nd.ndarray` in type annotations. In Numpy-1.19 (PSS-1) it's an alias to `nd.ndarray` -- a noop.
In Numpy-1.24, `ndt.NDArray` a proper generic type, and without this change uses of `nd.ndarray` generate this Pyre type error:
```counterexample
 Invalid type parameters [24]: Generic type `np.ndarray` expects 2 type parameters.
```

Test Plan: Sandcastle plus visual inspection

Differential Revision: D62977370

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136288
Approved by: https://github.com/kit1980
2024-09-19 12:40:36 +00:00
Huy Do
db80b98ec4 XFAIL test_segfault (#136252)
Fixes https://github.com/pytorch/pytorch/issues/128551

As this has been failing in trunk for a while and there is no owner yet to fix it properly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136252
Approved by: https://github.com/andrewkho
2024-09-19 04:17:06 +00:00
fduwjj
3efaa016b1 [c10d] Make test compatible for new pytest (#136158)
Temporary fix to the issue in https://github.com/pytorch/pytorch/issues/127517.

Short-term fix following CPython: 51aefc5bf9/Lib/unittest/case.py (L419-L426)

Differential Revision: [D62878083](https://our.internmc.facebook.com/intern/diff/D62878083)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136158
Approved by: https://github.com/fegin
2024-09-18 17:10:55 +00:00
CaoE
6a6f5b20c5 Add _addmm_activation to lower precision cast policy on AutocastCPU (#135936)
Fixes #132613.
Add `_addmm_activation` to lower precision cast policy on AutocastCPU.
`_addmm_activation`  https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/transformers/transformer.cpp#L39 of `transformer_encoder_layer_forward` may throw `RuntimeError: mat1 and mat2 must have the same dtype, but got BFloat16 and Float` when autocast is enabled, as `_native_multi_head_attention` is put in lower data type cast policy https://github.com/pytorch/pytorch/pull/107674 and `_addmm_activation` may encounter mixed data types.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135936
Approved by: https://github.com/jgong5, https://github.com/ezyang
2024-09-18 16:31:27 +00:00
Prachi Gupta
b5be4d8c05 Fix ROCm skip decorator for test_ddp_tp and multiprocess UTs (#136161)
skip_if_rocm is used only in multiprocess case (when UT test class is a child of MultiProcessTestCase). Each individual process can exit with a skip code. If used for single process UT, it will cause the UT to fail as the process returns a non-zero exit code. Use skipIfRocm in single process UTs.

To avoid the above confusion, this PR renamed skip_if_rocm to skip_if_rocm_multiprocess.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136161
Approved by: https://github.com/jithunnair-amd, https://github.com/kwen2501, https://github.com/fegin
2024-09-18 11:01:23 +00:00
Joel Schlosser
a8382847f4 Support rms_norm() for NJT (#135872)
`rms_norm()` is a nice-to-have for ViT :)

This PR:
* SymInt-ifies `rms_norm()`, allowing NJT to use the same decomp.
* Adds torch_function-based input validation logic for nested-specific stuff (no normalization supported over the ragged dim for now) on the python NJT side.
* Adds multi-dim support (on non-ragged, non-batch dims) to `mean()` for NJT.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135872
Approved by: https://github.com/mikaylagawarecki
ghstack dependencies: #125947
2024-09-17 18:09:20 +00:00
PyTorch MergeBot
462b727d1e Revert "Add decomposition for permute_copy (#130944)"
This reverts commit ab9a7eadd3.

Reverted https://github.com/pytorch/pytorch/pull/130944 on behalf of https://github.com/jeanschmidt due to Broke internal signal executorch.backends.xnnpack.test.ops.permute.TestPermute, more details on D62737086. @eellison could you please help get this PR merged to main? ([comment](https://github.com/pytorch/pytorch/pull/130944#issuecomment-2355846394))
2024-09-17 13:42:55 +00:00
PyTorch MergeBot
2c4ae81494 Revert "Add decomposition for squeeze_copy (#130941)"
This reverts commit c33b0580e6.

Reverted https://github.com/pytorch/pytorch/pull/130941 on behalf of https://github.com/jeanschmidt due to Need to revert in order to be able to revert https://github.com/pytorch/pytorch/pull/130944, after fixing any merge conflicts, feel free to merge it back ([comment](https://github.com/pytorch/pytorch/pull/130941#issuecomment-2355831480))
2024-09-17 13:39:07 +00:00
PyTorch MergeBot
7fe004f7cf Revert "Add CI for Triton CPU backend (#135342)"
This reverts commit 426580a67d.

Reverted https://github.com/pytorch/pytorch/pull/135342 on behalf of https://github.com/jeanschmidt due to Broke internal signals, see D62737208 for more details ([comment](https://github.com/pytorch/pytorch/pull/133408#issuecomment-2353623816))
2024-09-16 18:33:33 +00:00
Aaron Gokaslan
23c0d2689e [BE][Ez]: Fix missing float16 coverage for adaptive_pool3d_cpu (#136091)
Testing if op info coverage has issues

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136091
Approved by: https://github.com/ezyang
2024-09-16 18:22:16 +00:00
Aaron Gokaslan
b491e2974c [BE][Ez]: Add full half/bfloat16 dtype for unique and isin (#136114)
Fixes #136090

* Add support for isin to tensor half dtypes for CPU (just add a few extra dispatches).
* Seems like the CUDA implementation for bfloat16 was mostly compiled and available all along (it just calls sort internally AND unique). To enable it, we just need to remove an assert to access it (since sort's functionality was updated since the assert was added) and add missing dtype support to unique.
* This unlocks more GPU functionality with minimal code bloat. I also added CPU kernels for the dtypes for parity.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136114
Approved by: https://github.com/malfet
2024-09-16 17:49:12 +00:00
Tom Ritchford
c33b0580e6 Add decomposition for squeeze_copy (#130941)
* Extracted from #128416

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130941
Approved by: https://github.com/amjames, https://github.com/eellison
2024-09-16 15:46:57 +00:00
Tom Ritchford
ab9a7eadd3 Add decomposition for permute_copy (#130944)
* Extracted from #129476

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130944
Approved by: https://github.com/amjames, https://github.com/eellison
2024-09-15 19:35:14 +00:00
Jez Ng
426580a67d Add CI for Triton CPU backend (#135342)
Where possible, I have marked failing tests (which we intend to fix or triage) as `@xfail_if_triton_cpu`. This will help us track progress of the Triton CPU backend over time. Tests that I don't think we need to address, or that are flaky, have been marked as skips.

Successful CI run: https://github.com/pytorch/pytorch/actions/runs/10822238062/job/30028284549

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135342
Approved by: https://github.com/jansel
ghstack dependencies: #133408
2024-09-14 21:45:19 +00:00
CaoE
db393fb95e Add Half support for reflection and replication padding on CPU (#135931)
Fixes #135680

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135931
Approved by: https://github.com/Skylion007
2024-09-14 14:18:55 +00:00
PyTorch MergeBot
1786a17fed Revert "Use _amp_foreach_non_finite_check_and_unscale_ for CPU grads of ShardedGradScaler (#135232)"
This reverts commit 51c5206133.

Reverted https://github.com/pytorch/pytorch/pull/135232 on behalf of https://github.com/CaoE due to wrong commit ([comment](https://github.com/pytorch/pytorch/pull/135232#issuecomment-2350792806))
2024-09-14 02:31:06 +00:00
CaoE
51c5206133 Use _amp_foreach_non_finite_check_and_unscale_ for CPU grads of ShardedGradScaler (#135232)
Use `_amp_foreach_non_finite_check_and_unscale_` instead of fallback version for CPU grads of `ShardedGradScaler ` as `_amp_foreach_non_finite_check_and_unscale_ ` is supported on CPU https://github.com/pytorch/pytorch/pull/109281.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135232
Approved by: https://github.com/ezyang
2024-09-14 02:20:58 +00:00
PyTorch MergeBot
b6d6aa49b8 Revert "Validate input types for torch.nn.Linear and torch.nn.Bilinear (#135596)"
This reverts commit e157ce3ebb.

Reverted https://github.com/pytorch/pytorch/pull/135596 on behalf of https://github.com/malfet due to It's too restrictive, should allow other int-like types, such as `numpy.int64` ([comment](https://github.com/pytorch/pytorch/pull/135596#issuecomment-2349714104))
2024-09-13 18:06:56 +00:00
Sanskar Modi
e157ce3ebb Validate input types for torch.nn.Linear and torch.nn.Bilinear (#135596)
Adding validation checks to check the input types and display better error messages for the same.
Fixes https://github.com/pytorch/pytorch/issues/135463

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135596
Approved by: https://github.com/malfet
2024-09-12 21:28:37 +00:00
Mayank Mishra
9a04cfbeff fix for fp16 (#134106)
This PR is a replacement for https://github.com/pytorch/pytorch/pull/133085 for pushing a quick fix for RMSNorm.
The original author is @kkontny

Previous PR summary:
Since FP16 has quite small dynamic range it is very easy to overflow while computing `at::pow(input, 2)` , and it happens in real world computation.

I've tried to use `nn.RMSNorm` fused implementation instead of `LlamaRMSNorm` inside `transformers` implementation of Llama (`src/transformers/models/llama/modeling_llama.py`). It started to give wrong answers in Fp16 while still giving good in FP32. I figured out happens due to overflow while computing square of the input tensor.

Original `LLamaRMSNorm` implementation upcasts input to fp32 to prevent this and give better numerical stability.

```
class LlamaRMSNorm(nn.Module):
    def __init__(self, hidden_size, eps=1e-6):
        """
        LlamaRMSNorm is equivalent to T5LayerNorm
        """
        super().__init__()
        self.weight = nn.Parameter(torch.ones(hidden_size))
        self.variance_epsilon = eps

    def forward(self, hidden_states):
        input_dtype = hidden_states.dtype
        hidden_states = hidden_states.to(torch.float32)
        variance = hidden_states.pow(2).mean(-1, keepdim=True)
        hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
        return self.weight * hidden_states.to(input_dtype)
```

Proposed commit fixed the issue. FP16 in RMSNorm has to be treated in special way, to be usable in real world implementations.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134106
Approved by: https://github.com/mikaylagawarecki, https://github.com/eqy
2024-09-11 22:02:07 +00:00
Xinya Zhang
74fd1bf965 [ROCm] Update to AOTriton 0.7b (#134498)
Notable changes:
1. Enable CudaGraph related tests
2. Fix UT problems
3. EXPERIMENTAL Navi31 support. User should enable Navi31 support with Env Var `TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1`

Know Problem:
1. `test/test_transformers.py` will massive failures and/or NaN outputs with `--use-pytest`
    + Update: Confirmed skip `class TestSDPAPrivateUse1Only` can fix the problem with `--use-pytest`

Note:
AOTriton 0.7b adds support to nestedtenosrs+SDPA but need more work (and consequently a separate PR) to enable it.

Fixes #133540

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134498
Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily, https://github.com/malfet
2024-09-11 20:34:01 +00:00
Tom Ritchford
e05ea2b179 Add decomposition for transpose_copy (#130943)
* Extracted from #128416
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130943
Approved by: https://github.com/amjames, https://github.com/eellison
2024-09-11 19:45:22 +00:00
Chien-Chin Huang
1d9fefff19 [DCP] Fixes the stateless optimizer issue of distributed state_dict (#135535)
Some optimizers don't have states that can cause get_state_dict/set_state_dict behave incorrectly. This PR fixes the issues.

fixes: https://github.com/pytorch/pytorch/issues/133415

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135535
Approved by: https://github.com/wz337
2024-09-10 03:10:00 +00:00
Wanchao Liang
cfc227ad43 [reland][dtensor] move DTensor to public namespace (#134203)
reland of https://github.com/pytorch/pytorch/pull/133113

I have to create a new PR because the previous reverted PR could not either be rebased, or imported successfully :(

----

Moving DTensor to be in the public namespace, to formally add the documentation page that includes all the public APIs. This includes:

* many path renames and path import fixes
* a dedicated doc page without too much content yet (adding in the next PRs)
* To preserve the BC for users still using the torch.distributed._tensor, I added a shim script to redirect old path calls to the new module

The BC preserving is evidented by the fact that all DTensor tests are still working without changing the public imports. So it's safe to land the changes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134203
Approved by: https://github.com/tianyu-l
2024-09-08 17:08:40 +00:00
Nowtryz
a15aabc975 Add MaskedTensor passthrough: unfold, F.Unfold, F.Fold, stack (#125262)
Hi,
I noticed the `unfold` operator was missing on MaskedTensor.

I tested that my change works when calling unfold and backward on a `MaskedTensor` but I didn't find the tests for the dispatch of such operation. Where is it?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125262
Approved by: https://github.com/cpuhrsch
2024-09-06 19:06:23 +00:00
Will Feng
84ae6b7d6b AOTDispatcher: limit cases when we detach() graph inputs to non-leaves (#134193)
This PR is slightly a revival / update to the discussion from https://github.com/pytorch/pytorch/pull/98960:

Part of FSDP2's tracing strategy right now is that:

(1) it is painful/difficult to handle the case where we have multiple graph input tensors that are aliased to each other and at least one of them is duplicated

(2) we already have longstanding in logic to remove duplicate input tensors from the graph in dynamo. Morally, FSDP2 gives us duplicate input tensors in the backward graph for every `unsharded_param`, because we have (a) the `unsharded_param` being closed over by the backward hook to resize/allgather, and (b) the same `unsharded_param` being saved for backward by autograd (we now guarantee in the partitioner that we will always save the base tensor for backward and recompute views)

(3) However, we were still seeing cases where the `unsharded_param` showed up twice in the backward graph inputs, as distinct tensor objects (with different python ids) instead of being true duplicates that dynamo can de-dup.

It turns on that this was because we were `.detach()`ing the `unsharded_param` in AOTDispatcher before plumbing it through the compiled forward (and so autograd would save a detach'd version of the `unsharded_param`). This is precisely because of the logic from https://github.com/pytorch/pytorch/pull/98960.

However, re-reading the detailed comments, it seems unnecessary to do a detach() on a graph input that is a (leaf) `nn.Parameter`, even if it happens to get no gradients in the backward. Since it is a leaf, we don't have to worry about the autograd engine "continuing to backprop through the graph beyond the current tensor" (the leaf has no other grad_fn for autograd to backprop through).

So this PR makes us a bit less aggressive about calling detach() on inputs: we only do it when:

(1) our graph input statically will get a `None` gradient (and also has no metadata mutations, the existing state)

(2) **and** our graph input is a non-leaf tensor (so detach()ing is actually required to prevent autograd from incorrectly backpropping past the non-leaf.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134193
Approved by: https://github.com/yf225

Co-authored-by: Will Feng <yf225@cornell.edu>
2024-09-06 14:06:48 +00:00
Shangdi Yu
590a3e9f8a [export][training ir migration] quantized_decomposed.quantize_per_tensor decomposition (#134525)
Summary:
In graph of  TestXNNPACKQuantizer.test_dynamic_linear_with_con test, some quantized_decomposed.quantize_per_tensor.default ops are becoming quantized_decomposed.dequantize_per_tensor.tensor ops when using the new training ir.

This is because we lift params/buffers before calling make_fx. So previously, for the graph that’s passed to make_fx,`graph.L__self___linear1.weight` is a tensor
now in training ir, graph.L__self___linear1.weight is a FakeTensor. This caused the node overload to be different.

Test Plan:
```
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_dynamic_linear_with_conv
```

Differential Revision: D61364547

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134525
Approved by: https://github.com/tugsbayasgalan, https://github.com/jerryzh168
2024-09-06 07:06:06 +00:00
Avik Chaudhuri
43f4947d44 fix fake tensor tolist implementation (#135131)
Summary:
When exporting for training with `tolist`, we do not hit `FunctionalTensor.tolist` since we do not functionalize. Unfortunately, this means we hit `FakeTensor.tolist`, which creates unbacked symints that are not backed by proxies.

Rather than trying to patch up this low-level implementation, we replace it with essentially what `FunctionalTensor.tolist` does, which is higher-level: we essentially desugar to `item()` calls and let it take care of unbacked symints.

Test Plan:
Some expected failures are gone now.
Also found a test for `tolist` that was written when `FunctionalTensor.tolist` was implemented but not really doing much; repurposed it now to exercise more modes.

Differential Revision: D62197742

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135131
Approved by: https://github.com/ezyang
2024-09-05 23:20:31 +00:00
Henry Tsang
bb3c2408f4 [inductor][test] in test_unbacked_symints, replace inductor's skipCUDAIf with common device type's skipcudaif (#133936)
Differential Revision: D61506212

Use `skipCUDAIf` from `torch.testing._internal.common_device_type` if we create the test class with `instantiate_device_type_tests`.

`instantiate_device_type_tests` would make sure the class has attr device_type, which works with`skipCUDAIf` from `torch.testing._internal.common_device_type`.

Also skipping test_vertical_pointwise_reduction_fusion for cpu test class, since the test expects cuda.

FAILED [0.0026s] test/inductor/test_unbacked_symints.py::TestUnbackedSymintsCPU::test_vertical_pointwise_reduction_fusion_cpu - AttributeError: 'TestUnbackedSymintsCPU' object has no attribute 'device'

repro:
```
CUDA_VISIBLE_DEVICES="" pytest test/inductor/test_unbacked_symints.py -k cpu -v
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133936
Approved by: https://github.com/ColinPeppler, https://github.com/desertfire
2024-09-05 16:40:14 +00:00
eqy
4f70b3cfae [CUDA][complex][TF32] Update test_noncontiguous_samples tolerances for complex64 (#134526)
Recent cuDNN heuristics change surfaces same TF32 issue as `float32`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134526
Approved by: https://github.com/ezyang
2024-09-04 23:37:16 +00:00
FFFrog
80a6d60829 Moving _run_autocast_outofplace to basic class named TestAutocast to reduce redundance (#134460)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134460
Approved by: https://github.com/EikanWang, https://github.com/ezyang
2024-09-04 10:48:58 +00:00
Nikita Shulga
71383dd3da [MPS] Fix bachnorm_2d for channels last (#134618)
By skipping gather of input tensor if memory_layout is channels_last, which is a first step towards fixing  https://github.com/pytorch/pytorch/issues/134580

Though underlying problem is much more interesting, i.e. MPS does not have a generic support for channels last, but `c10::is_contiguoius()` is true for channels last layout.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134618
Approved by: https://github.com/albanD
2024-09-03 19:20:11 +00:00
Tobias Ringwald
758d787901 Added complex support for torch.logsumexp (#133187)
Added complex support for `torch.logsumexp`. Implemented complex backward pass for `torch.logsumexp`.

Fixes #133047

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133187
Approved by: https://github.com/amjames, https://github.com/lezcano
2024-09-03 17:28:36 +00:00
IvanKobzarev
33ba952e31 [subclasses] Do not fakeTensor const prop subclass args (#134855)
The issue:

Const propagation checks only if arguments do not have FakeTensor. If argument is Subclass, it will pass this condition.

As a result Const Propogation execution happens without FakeTensorMode and having tensor factories inside Subclass.__torch_dispatch__ results that this Tensor is not Fakified.

Solution:

If we have subclasses arguments, do not count that const propagation is doable

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134855
Approved by: https://github.com/zou3519
2024-09-03 13:31:49 +00:00
chilli
6fce1faa10 change multinomial to use async asserts instead of a synchronization (#134818)
Fixes https://github.com/pytorch/pytorch/issues/134442

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134818
Approved by: https://github.com/ezyang
ghstack dependencies: #134813
2024-09-03 06:33:24 +00:00
Nikita Shulga
f95085fd91 [BE][MPS] Prefer xfail to skip (#134858)
This essentially undoes large skips on everything but MacOS Sequoia to nn.modules made by https://github.com/pytorch/pytorch/pull/128393

Instead it uses existing `xfail`, but guards it on `_macos15_or_newer` boolean

Before the change if run on MacOS 14:
```
 % python3 ../test/test_modules.py -v -k Hardswish 2>&1|tail -n3
Ran 57 tests in 0.053s

OK (skipped=32)
```
After
```
% python3 ../test/test_modules.py -v -k Hardswish 2>&1|tail -n3
Ran 57 tests in 0.229s

OK (skipped=10, expected failures=2)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134858
Approved by: https://github.com/janeyx99
2024-08-31 00:29:48 +00:00
Masaki Kozuki
e21d7b77ce Update ForeachfuncInfo.sample_inputs_func to yield scalars & scalarlists that are more friendly to test_meta (#134552)
for `test_meta.py` to see more "PASSED" instead of "XFAIL".

`pytest test_meta.py -k "_foreach_"` ran 6400 test cases and:
- This PR: 4702 passed, 260 skipped, 73732 deselected, 1698 xfailed
- main (92c4771853): 3906 passed, 260 skipped, 73732 deselected, 2494 xfailed
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134552
Approved by: https://github.com/janeyx99
2024-08-30 17:30:50 +00:00
PyTorch MergeBot
f997b2b8e6 Revert "Add MaskedTensor passthrough: unfold, F.Unfold, F.Fold, stack (#125262)"
This reverts commit f685018ea9.

Reverted https://github.com/pytorch/pytorch/pull/125262 on behalf of https://github.com/ZainRizvi due to Hi, this PR appears to be calling maskedtensor tests to fail on main. Please rebase your changes onto the latest trunk build to repro the failure. test_maskedtensor.py::TestOperatorsCUDA::test_like_empty_like_layout1_cuda_bool [GH job link](https://github.com/pytorch/pytorch/actions/runs/10604716811/job/29393256312) [HUD commit link](f685018ea9) ([comment](https://github.com/pytorch/pytorch/pull/125262#issuecomment-2316387447))
2024-08-28 23:10:07 +00:00
Nowtryz
f685018ea9 Add MaskedTensor passthrough: unfold, F.Unfold, F.Fold, stack (#125262)
Hi,
I noticed the `unfold` operator was missing on MaskedTensor.

I tested that my change works when calling unfold and backward on a `MaskedTensor` but I didn't find the tests for the dispatch of such operation. Where is it?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125262
Approved by: https://github.com/cpuhrsch
2024-08-28 21:30:39 +00:00
Chien-Lin Chen
40de63be09 parameterized test_graph_optims and test_graph_scaling_fused_optimizers (#133749)
Fixes #123451

This is a rework of a reverted pull request, https://github.com/pytorch/pytorch/pull/125127.
The test failure is fixed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133749
Approved by: https://github.com/janeyx99
2024-08-28 16:34:06 +00:00
David Berard
289486d007 Move attention kernels back from fake_impls to meta_registrations (#134288)
See #121528 for additional context.

In #120682, we moved the attention kernels from meta_registrations to fake_impls with the intent of fixing the device handling for seed/offset: these are typically on CPU. We needed to put the registrations in fake_impls to do this because meta_registrations doesn't have a way to specify device, whereas fake_impls does. But when we tried to actually fix the device types (#120839), we had to revert the PR because it broke cudagraph handling (during which seed/offset _are_ on CUDA).

Now, we want to put the registrations back in meta_registrations so that we can call these kernels with meta tensors. The use case is later in this stack - we want to be able to use the flop counter with these kernels.

Also - I specifically skip the `compare_tensor_meta()` check in test_fake / test_fake_autocast tests for the `_efficient_attention_forward` and `_flash_attention_forward` kernels, which fails because of the device mismatch from the seed/offset tensors. Then we can un-skip these opinfos. I verified that the efficient_attention_forward bug (#120842) is now caught by these opinfos if I revert the fix from this PR.

Differential Revision: [D61687369](https://our.internmc.facebook.com/intern/diff/D61687369)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134288
Approved by: https://github.com/drisspg
2024-08-27 21:10:36 +00:00
Bo Li
16b8146c9e Exclude test_transformers and unit tests which require recent GPU arch (#132895)
This PR is to exclude test_transformers on ROCm temporarily and skip some unit tests which require recent GPU arch.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132895
Approved by: https://github.com/jithunnair-amd, https://github.com/pruthvistony, https://github.com/malfet
2024-08-27 20:40:53 +00:00
Nikita Shulga
8de0d7690c Use newer toAccumulateType signature in Normalization.cpp (#134540)
Which fixes BatchNorm behavior for if called with empty tensors on MPS backed. Removed `expectedFailureMPS` in test_nn.py, deleted expected failure in `test_mps.py` and adjusted `skipIfMPS` to `expectedFailureMPS`  in BatchNorm2d OpInfo decorator, but restrict it only to the memory format tests

Test Plan: CI + `python3 -c "import torch; print(torch.nn.BatchNorm2d(3, device='mps')(torch.rand(0, 3, 2, 2, device='mps')))"`

Fixes https://github.com/pytorch/pytorch/issues/134423

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134540
Approved by: https://github.com/Skylion007, https://github.com/albanD
2024-08-27 18:09:20 +00:00
Colin Peppler
13049cd6e5 [aotinductor][UserDefinedTritonKernel] fix case with non-constexpr params declared after autotuned params (#134520)
## Context
In some user Triton kernels, we have this set-up for whatever reason.
```
@triton.jit
def mykernel(
  param0,
  param1,
  param2,
  param3: tl.constexpr,   # autotuned
  param4,                 # non-constexpr
):
  ...
```

This is an edge case because it's a general practice to declare all constexprs params at the end.

And this will be an issue for AOTI because it fails to codegen all 4 params. That will surface as a device-side error: CUDA IMA, invalid argument...

```
>     void* kernel_args_var_0[] = {&var_0, &var_1, &var_2};
---
<     CUdeviceptr var_3;
<     AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_get_data_ptr(buf0, reinterpret_cast<void**>(&var_3)));
<     void* kernel_args_var_0[] = {&var_0, &var_1, &var_2, &var_3};
```

## Root-cause
* `kernel.constexpr` from the Kernel side-table contains the indices for all `constexpr` params that includes autotuned params.
* `raw_args`, that gets passed to wrapper codegen, excludes autotuned args.
* In the wrapper codegen, we try to find non-constexpr args using `kernel.constexpr` & `raw_args`. This is okay unless there's a `raw_arg` after an autotuned param in the function signature.

79b7fff188/torch/_inductor/codegen/cpp_wrapper_cuda.py (L118-L126)

## Fix
We try to fix this, by calculating the right constexprs wrt `raw_args`.

An illustration
```
         raw_args: [arg0, arg1, arg2, arg4]
 kernel.arg_names: [param0, param1, param2, param3, param4]
kernel.constexprs: [3]                      # param3 is autotuned; this is correct wrt kernel.arg_names
constexpr_indices: []                       # this is correct wrt raw_args
```

Differential Revision: [D61831625](https://our.internmc.facebook.com/intern/diff/D61831625)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134520
Approved by: https://github.com/oulgen
2024-08-27 17:20:27 +00:00
Mikayla Gawarecki
d028b810fe Fix flaky GroupNorm ModuleInfo test (#133899)
Fixes https://github.com/pytorch/pytorch/issues/98677

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133899
Approved by: https://github.com/albanD
2024-08-27 14:45:51 +00:00
wizzniu
3515090006 Fix TypeError when itering NoneType in instantiate_device_type_tests() (#134457)
Fixes #134454

Fix TypeError introduced by https://github.com/pytorch/pytorch/pull/133082, which uses iter for NoneType of default args ``except_for`` and ``only_for``.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134457
Approved by: https://github.com/shink, https://github.com/albanD
2024-08-27 07:13:36 +00:00
Roy Hvaara
1565940114 [MPS] Add test/test_nn.py to test suite (#134184)
This PR increases test coverage by including the tests in `test/test_nn.py` in the test suite of MPS.

Some of the tests are decorated with `@expectedFailureMPS` for various reasons. Either that the op is not implemented, or that the outputs do not align. Those tests that contain differing results should be investigated further to rule out any live bugs.

```bash
$ python test/run_test.py --mps --verbose -k TestNN
Running test batch 'tests to run' cost 84.76 seconds
```

Ref #133520

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134184
Approved by: https://github.com/albanD, https://github.com/malfet
2024-08-26 23:48:23 +00:00
Benjamin Glass
55236d0cb7 TestForeach::test_parity: Remove check for error message text (#134251)
Previously, error messages were expected to be string equivalent to
error messages thrown by the ref function.  This check fails for dozens
of torch functions, and doesn't appear to add much value for the end
user.  This commit removes this check.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134251
Approved by: https://github.com/amjames, https://github.com/janeyx99
ghstack dependencies: #134253, #134344
2024-08-26 22:40:54 +00:00
Benjamin Glass
ef8c474fcf Add the fast path for bfloat16 lgamma (#134344)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134344
Approved by: https://github.com/amjames, https://github.com/janeyx99
ghstack dependencies: #134253
2024-08-26 22:40:54 +00:00
Benjamin Glass
3c5883e550 Fix test_parity xfail for sigmoid (#134253)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134253
Approved by: https://github.com/amjames, https://github.com/janeyx99
2024-08-26 22:40:54 +00:00
Mengwei Liu
a0e062c6f1 Add mean.dtype_out (#133506)
Give it a try and see if CI is happy
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133506
Approved by: https://github.com/bdhirsh
2024-08-26 19:26:11 +00:00
Wuxun Zhang
1d231ff8ba [HOO] add hints_wrapper to support passing context hints (#132860)
Fixes #126393

The implementation code is based on feedback here (https://github.com/pytorch/pytorch/pull/121639#issuecomment-2223948842).

Hints are passed as kwargs of hints_wrapper op. It also supports nested hints.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132860
Approved by: https://github.com/ydwu4, https://github.com/zou3519
2024-08-26 18:21:22 +00:00
Benjamin Glass
27d97b9649 Remove unnecessary test skip (#134250)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134250
Approved by: https://github.com/amjames, https://github.com/janeyx99
2024-08-26 14:34:53 +00:00
Laith Sakka
7b6b10417d Remove ansi escape chars in assertExpectedInline and add options to skip comments and to skip empty lines (#134248)
I had a night mare rewriting tests in test_misc.py specifically :
1. graphs can have comments that refers to my files "/lsakka/.." we really dont care about comments add option to ignore comments.
2. empty lines added when EXPECTTEST_ACCEPT=1  are changed with linter causing tests to fail or linter fail!
add flag to ignore empty lines.
3. EXPECTTEST_ACCEPT fails when the text have some not readable characters. those should not effect comparing strings, also those causes weird diffs comments when tests fails. I removed ansi_escape chars https://github.com/pytorch/pytorch/pull/133045

this is used in

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134248
Approved by: https://github.com/aorenste
ghstack dependencies: #133639, #134364
2024-08-26 02:03:44 +00:00
Xu Han
1034f456ef [inductor] fix munge_exc not support windows path (#134348)
Windows file path use `\` as delimiter, it is also a escape character. We need translate all path `\` to `/`. which like Linux.

Reproduce UT:
```cmd
pytest test\dynamo\test_higher_order_ops.py -v -k test_vmap_grad_vmap_guard_fail
```
Error msg:
```cmd
________________________________________________________________________________________________________ HigherOrderOpVmapGuardTests.test_vmap_grad_vmap_guard_fail _________________________________________________________________________________________________________
Traceback (most recent call last):
  File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\site-packages\torch\testing\_internal\logging_utils.py", line 89, in test_fn
    fn(self, records)
  File "D:\xu_git\dnnl_cb\pytorch\test\dynamo\test_higher_order_ops.py", line 2714, in test_vmap_grad_vmap_guard_fail
    munge_exc(record.getMessage()),
  File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\site-packages\torch\testing\_internal\common_utils.py", line 5252, in munge_exc
    s = re.sub(file, os.path.basename(file), s)
  File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\re.py", line 209, in sub
    return _compile(pattern, flags).sub(repl, string, count)
  File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\re.py", line 303, in _compile
    p = sre_compile.compile(pattern, flags)
  File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\sre_compile.py", line 788, in compile
    p = sre_parse.parse(p, flags)
  File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\sre_parse.py", line 955, in parse
    p = _parse_sub(source, state, flags & SRE_FLAG_VERBOSE, 0)
  File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\sre_parse.py", line 444, in _parse_sub
    itemsappend(_parse(source, state, verbose, nested + 1,
  File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\sre_parse.py", line 526, in _parse
    code = _escape(source, this, state)
  File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\sre_parse.py", line 370, in _escape
    raise source.error("incomplete escape %s" % escape, len(escape))
re.error: incomplete escape \x at position 2

To execute this test, run the following from the base repo dir:
    python test\dynamo\test_higher_order_ops.py HigherOrderOpVmapGuardTests.test_vmap_grad_vmap_guard_fail

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
--------------------------------------------------------------------------------------------------------------------------- Captured stdout call ----------------------------------------------------------------------------------------------------------------------------
frames [('total', 2), ('ok', 2)]
inductor []
inline_call []
stats [('calls_captured', 38), ('unique_graphs', 2)]
--------------------------------------------------------------------------------------------------------------------------- Captured stderr call ----------------------------------------------------------------------------------------------------------------------------
V0824 01:29:00.148000 27840 torch\_dynamo\guards.py:2787] [0/1] [__recompiles] Recompiling function fn in D:\xu_git\dnnl_cb\pytorch\test\dynamo\test_higher_order_ops.py:2699
V0824 01:29:00.148000 27840 torch\_dynamo\guards.py:2787] [0/1] [__recompiles]     triggered by the following guard failure(s):
V0824 01:29:00.148000 27840 torch\_dynamo\guards.py:2787] [0/1] [__recompiles]     - 0/0: torch._functorch.pyfunctorch.compare_functorch_state([('Vmap', 1, 'error')])  # _dynamo\output_graph.py:479 in init_ambient_guards
========================================================================================================================== short test summary info ==========================================================================================================================
FAILED [0.7452s] test/dynamo/test_higher_order_ops.py::HigherOrderOpVmapGuardTests::test_vmap_grad_vmap_guard_fail - re.error: incomplete escape \x at position 2
```
Local test passed:
<img width="860" alt="image" src="https://github.com/user-attachments/assets/90f0d780-0639-4c03-8d7c-6f227c93a3fc">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134348
Approved by: https://github.com/jansel
2024-08-24 05:51:35 +00:00
Aaron Gokaslan
83b5d449a3 Add full float16/bfloat16 support to MaxUnPool (#133774)
It already supported half so might as well add bfloat16 support for parity

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133774
Approved by: https://github.com/eqy, https://github.com/ezyang
2024-08-22 13:34:43 +00:00
Zitong Zhan
90c821814e SparseCsrCUDA: cuDSS backend for linalg.solve (#129856)
This PR switches to cuDSS library and has the same purpose of #127692, which is to add Sparse CSR tensor support to linalg.solve.
Fixes #69538

Minimum example of usage:
```
import torch

if __name__ == '__main__':
    spd = torch.rand(4, 3)
    A = spd.T @ spd
    b = torch.rand(3).to(torch.float64).cuda()
    A = A.to_sparse_csr().to(torch.float64).cuda()

    x = torch.linalg.solve(A, b)
    print((A @ x - b).norm())

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129856
Approved by: https://github.com/amjames, https://github.com/lezcano, https://github.com/huydhn

Co-authored-by: Zihang Fang <zhfang1108@gmail.com>
Co-authored-by: Huy Do <huydhn@gmail.com>
2024-08-22 07:57:30 +00:00
Yuanhao Ji
cdb9c7d228 Add support for using privateuse1 backend name in instantiate_device_type_tests() (#133082)
As you can see, 'privateuse1' appears many times in out-of-tree extension codebase. I think that everything about the device type should be as same as other in-tree backends after registering the privateuse1 backend.

For example, after registering a privateuse1 backend named "foo", you should allow "foo" to be passed in as a valid device type.

```diff
- instantiate_device_type_tests(TestIndexing, globals(), only_for='privateuse1')
- instantiate_device_type_tests(NumpyTests, globals(), only_for='privateuse1')
+ instantiate_device_type_tests(TestIndexing, globals(), only_for='foo')
+ instantiate_device_type_tests(NumpyTests, globals(), only_for='foo')
```

> https://github.com/Ascend/pytorch/blob/master/test/test_indexing.py#L1654-L1655

The change is to map privateuse1 backend name to 'privateuse1' when calling `filter_desired_device_types()`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133082
Approved by: https://github.com/albanD
2024-08-22 06:17:21 +00:00
chilli
938f37b745 Added batching rule for sdpa_math, sdpa_efficient_attention forward, cudnn, and flash attention (#133964)
Fixes https://github.com/pytorch/pytorch/issues/117016, https://github.com/pytorch/pytorch/issues/102457, https://github.com/pytorch/pytorch/issues/110525, https://github.com/pytorch/pytorch/issues/108065,

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133964
Approved by: https://github.com/Skylion007
2024-08-22 05:29:49 +00:00
xinan.lin
c2e2602ecd [Inductor] Move GPU_TYPE(The runtime avaliable gpu type, cuda or xpu) from (#132740)
Move GPU_TYPE(The runtime avaliable gpu type, cuda or xpu) from `testing/_internal/inductor_utils.py` to `_inductor/utils.py`. So that we can use it in Inductor, not limited in test case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132740
Approved by: https://github.com/EikanWang, https://github.com/jansel
2024-08-21 11:18:00 +00:00
krzysztofjordan
2e1830c7c8 Implement 2D version of masked_select for nestedtensors (#133889)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133889
Approved by: https://github.com/soulitzer
2024-08-20 21:46:32 +00:00
Chirag Pandya
994fcb9acd Killswitch based rollout for flight recorder (#133237)
Summary: Defaulting TORCH_NCCL_DUMP_ON_TIMEOUT to "true" and adding a kilswitch in case we need to kill this feature in production.

Test Plan: Tests pass manually but need futher testing before this is rolled out fully everywhere.

Differential Revision: D61136320

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133237
Approved by: https://github.com/c00w
2024-08-20 04:27:55 +00:00
drisspg
fb26b84390 Update fused kernels and call _safe_softmax from SDPA (#133882)
# UPDATE:
This is  take 3 of https://github.com/pytorch/pytorch/pull/131863 which was landed via co dev but not applying correclty

# Summary
Changes the stance of SDPA on what to do for fully masked out rows

## Current Behavior
Several PyTorch users have expressed frustration over this issue:
- https://github.com/pytorch/pytorch/issues/41508
- https://github.com/pytorch/pytorch/issues/103749
- https://github.com/pytorch/pytorch/issues/103963

These are significant issues with extensive discussion but no satisfactory resolution. The PyTorch team's consensus, as stated here:
https://github.com/pytorch/pytorch/issues/24816#issuecomment-524415617

Can be paraphrased as follows:

When passing in fully masked out rows, attention becomes ambiguous. We have two main options:

1. Uniformly attend to all values:
   ```python
   scores[masked_out_rows] = 1 / len(row)
   out[masked_out_rows] = 1 / len(row) * value
   ```

2. Decide that attention between no queries (masked) and no keys (masked) is meaningless:
   ```python
   output[fully_masked_rows] = NaN
   ```

We went with option 2. Partially because it was easier to implement, but also people argued that users can slice the output to remove the NaNs:
``` Python
>fill_value = -float("inf")
>row0 = torch.randn(4)
>row1 = torch.tensor([(fill_value for _ in range(4)])
>matrix = torch.stack([row0, row1]).requires_grad_(True)
>out = torch.softmax(matrix, 1)
>out = out[0]
>print(out)
tensor([0.5377, 0.2729, 0.0692, 0.1201])
```
Cool, problem solved. But what happends when you call backwards..
```Python
>out.backward(torch.ones_like(out))
>print(matrix.grad)
tensor([[3.0957e-08, 1.4157e-08, 7.7802e-10, 1.3713e-08],
        [       nan,        nan,        nan,        nan]])
```
Those pesky NaNs are back!

## Why do we see NaNs today?

The core of the problem revolves around using softmax function in sdpa:

```python
> row = torch.tensor([(-float("inf")) for _ in range(4)])
> torch.softmax(row, 0)
tensor([nan, nan, nan, nan])
```

## Quick Aside: Masking in Attention

Attention itself doesn't have a concept of masking. The `sdpa` function has an argument called `attn_mask`, which would be more accurately named `attn_bias`. This is because we don't actually "mask" entries when computing attention. Instead, due to implementation details([performance](https://github.com/pytorch/pytorch/issues/25110#issuecomment-524519087)), we add a value to the masked-out query/key pairs.

We use a large negative number (typically -inf) to decrease the attention weight, as softmax assigns more weight to larger values.

## Alternative Approaches

If we use a very large negative number instead of -inf:

```python
> row = torch.tensor([(-1e6) for _ in range(4)])
> torch.softmax(row, 0)
tensor([0.2500, 0.2500, 0.2500, 0.2500])
```
However if users always remembered to "slice" out their outputs i.e.:
```Python
>fill_value = -1e6
>...
>out.backward(torch.ones_like(out))
>print(matrix.grad)
tensor([[-0.0563, -0.0564,  0.1613, -0.0486],
        [ 0.0000,  0.0000,  0.0000,  0.0000]])
```
This would bring us back into a better state.

## A Third Option

We don't necessarily need to alter the behavior of softmax for -inf or very large negative numbers. The fundamental goal is to exclude certain query/key pairs from attention, regardless of the underlying implementation.

This PR implements the new semantic for masking w/ attention in fully masked-out rows:
```python
out[masked_out_rows] = 0
```

**Important Note**: This idea isn't entirely new. The [MaskedTensor](https://pytorch.org/tutorials/prototype/maskedtensor_overview#safe-softmax) prototype, a tensor subclass, was designed to handle such cases. However, it remains a prototype feature and hasn't gained widespread adoption.

## Details
This PR stack does 3 things:
1. Adds a PRIVATE _safe_softmax op
2. Updates semantic for flash_cpu fused kernel
3. Updates semantic for efficient_cuda fused kernel

_safe_softmax is not supposed to be used generically and is only meant to be used within the context of SDPA. Due to this fact instead of decomposing softmax and checking for -inf rows we instead "cheat" and use nan_to_num.

Why I think this is okay? (please find a counter point if avail)
There are multiple ways NaNs can emerge. For the fully masked out rows case nan_to_num works. But what if there were other NaNs, wouldn't this silently remove them?

The only case that this can happen is if the input itself had a NaN or an Inf
For example:
```Python
a = torch.ones([4], requires_grad=False, dtype=torch.float16)
a[1] = torch.finfo(torch.float16).max
print(a.softmax(-1))
```
Will return
`tensor([0., 1., 0., 0.], dtype=torch.float16)`

Where
```Python
a = torch.ones([4], requires_grad=False, dtype=torch.float16)
a[1] = float("inf")
a.softmax(-1)
```
returns:
`tensor([nan, nan, nan, nan], dtype=torch.float16)`

If we dont want to even allow for the possibility of "inf" or "NaN" attention scores to be converted to 0 then we can implemented it something like this

```Python
max = torch.max(a, dim=-1, keepdim=True)
exp = torch.exp(a - max.values)
denom = torch.sum(exp, dim=-1, keepdim=True)
softmax = exp / denom
softmax = torch.where(max.values == float('-inf'), 0.0, softmax)
```
however we would be paying for this in math performance.

## Why Now
I think one point that has substantially changed where PyTorch should lie on this argument is the fact that we have fused implementations for SDPA now. And these fused implementations allow us to easily and performantly support this new semantic.

Differential Revision: [D61418679](https://our.internmc.facebook.com/intern/diff/D61418679)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133882
Approved by: https://github.com/soulitzer
2024-08-19 18:53:11 +00:00
PyTorch MergeBot
35f36363ec Revert "[dtensor] move DTensor to public namespace (#133113)"
This reverts commit 2ee6b97464.

Reverted https://github.com/pytorch/pytorch/pull/133113 on behalf of https://github.com/wanchaol due to looks like it break some internal type imports ([comment](https://github.com/pytorch/pytorch/pull/133113#issuecomment-2295670911))
2024-08-19 05:00:19 +00:00
Jiang, Yanbing
215b14530a Add Half for sparse.mm reduce (#133672)
This PR is to add Half support for sparse.mm reduce in CPU backend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133672
Approved by: https://github.com/Skylion007
2024-08-17 15:20:39 +00:00
Wanchao Liang
2ee6b97464 [dtensor] move DTensor to public namespace (#133113)
Moving DTensor to be in the public namespace, to formally add the
documentation page that includes all the public APIs. This includes:

* many path renames and path import fixes
* a dedicated doc page without too much content yet (adding in the next
  PRs)
* To preserve the BC for users still using the `torch.distributed._tensor`,
  I added a shim script to redirect old path calls to the new module

The BC preserving is evidented by the fact that all DTensor tests are still
working without changing the public imports. So it's safe to land the
changes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133113
Approved by: https://github.com/XilunWu
ghstack dependencies: #133305, #133306
2024-08-17 05:09:52 +00:00
Li, Xingyuan
dcfa415e6e [Inductor UT] Reuse inductor UT for intel GPU test/inductor/test_compiled_optimizers.py (#133083)
[Inductor UT] Reuse Inductor test case for Intel GPU.
Reuse `test/inductor/test_compiled_optimizers.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133083
Approved by: https://github.com/etaf, https://github.com/jansel, https://github.com/mlazos
2024-08-17 01:15:26 +00:00
Denis Vieriu
861bdf96f4 [MPS] Add native strided API for MPSNDArray starting with macOS 15 (#128393)
Add support for native strides in MPS starting with macOS Sequoia. This will get rid of the additional gather and scatter operations needed to solve the strides or storage offsets of the tensors.

Summary of changes (starting with macOS 15):
- Add support for **MPS strided API** (strides/storage offsets etc):
   - [initWithBuffer:offset:descriptor:](https://developer.apple.com/documentation/metalperformanceshaders/mpsndarray/4391636-initwithbuffer?language=objc)
   - [arrayViewWithCommandBuffer:descriptor:aliasing:](https://developer.apple.com/documentation/metalperformanceshaders/mpsndarray/3114040-arrayviewwithcommandbuffer?language=objc)
   - [arrayViewWithShape:strides:](https://developer.apple.com/documentation/metalperformanceshaders/mpsndarray/4408694-arrayviewwithshape?language=objc)
   - [reshapeWithCommandBuffer:sourceArray:shape:destinationArray:](https://developer.apple.com/documentation/metalperformanceshaders/mpsndarrayidentity/4438557-reshapewithcommandbuffer?language=objc)
- Add native support for NHWC convolutions (without incurring any extra copy from NCHW -> NHWC -> NCHW).
- Add support for strided output buffers (previously we would create a contiguous buffer

OSes older than macOS 15 will run the old gather/scatter code path to solve strides/storage offsets.

---

Couple performance stats collected from torchbench comparing macOS 15 vs macOS 14:
```
- test_train[functorch_maml_omniglot-mps]: 27% faster
- test_train[timm_vision_transformer-mps]: 12% faster
- test_train[hf_T5-mps]: 9.46% faster
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128393
Approved by: https://github.com/albanD

Co-authored-by: Siddharth Kotapati <skotapati@apple.com>
2024-08-16 21:07:50 +00:00
Mikayla Gawarecki
d9576c9440 Fix failures when default is flipped for weights_only (#127627)
Tests on XLA shard not fixed yet but there is an issue here https://github.com/pytorch/xla/issues/7799

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127627
Approved by: https://github.com/albanD
ghstack dependencies: #132349
2024-08-16 00:22:43 +00:00
PyTorch MergeBot
cfec69e2a1 Revert "Update fused kernels and call _safe_softmax from SDPA (#131863)"
This reverts commit caba37e99b.

Reverted https://github.com/pytorch/pytorch/pull/131863 on behalf of https://github.com/izaitsevfb due to breaks executorch test executorch/backends/apple/coreml:test - test_vit_skip_conv (executorch.backends.apple.coreml.test.test_coreml_partitioner.TestCoreMLPartitioner) ([comment](https://github.com/pytorch/pytorch/pull/131863#issuecomment-2291855634))
2024-08-15 17:55:07 +00:00
Jane Xu
c23dceb8f1 Add Adafactor foreach impl (#132336)
This PR adds the foreach impl for Adafactor knowing that there are many ways to improve its runtime perf today (by adding more foreach support). After this PR:
- we have a foreach flag for Adafactor
- It is NOT the default. Why not? It is only slightly faster + uses O(n) more memory where n is the number of params in your max param group. People tend to use Adafactor for memory efficiency.

Next steps:
- make torch.compile possible on it
- make it faster (by adding more foreach apis)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132336
Approved by: https://github.com/albanD
ghstack dependencies: #133360
2024-08-15 17:00:33 +00:00
Xuehai Pan
758a0a88a2 [BE][Easy] enable ruff rule PIE790: unnecessary pass statement (#133200)
This PR removes unnecessary `pass` statement. This is semanticly safe because the bytecode for the Python code does not change.

Note that if there is a docstring in the function, a empty function does not need a `pass` statement as placeholder.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133200
Approved by: https://github.com/malfet, https://github.com/eqy, https://github.com/kit1980
2024-08-15 15:50:19 +00:00
Aaron Gokaslan
ec49ce5f8e [CUDA]: Add frexp CUDA bfloat16 support (#133313)
Fixes #133263 Add CUDA bfloat16 support to cuda_frexp

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133313
Approved by: https://github.com/ezyang, https://github.com/eqy
2024-08-15 15:20:00 +00:00
soulitzer
4af4910b1a Reland "Construct NJT without graph breaks" (#133196)
This reverts commit 154d40ca488e6979ce9c2de89d8a35b53129ebea.

and adds changes from https://github.com/pytorch/pytorch/pull/133061

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133196
Approved by: https://github.com/ezyang
ghstack dependencies: #133145
2024-08-14 01:11:13 +00:00
drisspg
caba37e99b Update fused kernels and call _safe_softmax from SDPA (#131863)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131863
Approved by: https://github.com/jbschlosser, https://github.com/Chillee
2024-08-13 23:37:50 +00:00
Sun, Jiayi
7be77658e9 [Inductor] support masked vectorization for the tail_loop for INT8 datatype (#131155)
This PR supports masked vectorization for the tail_loop for torch.uint8 and torch.int8 datatype to improve performance.
BTW, I fixed the UT of `byte` by setting the range of the sample inputs  to [0, 255] since the range of `torch.uint8` is [0, 255].

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131155
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel
ghstack dependencies: #130724
2024-08-13 01:12:05 +00:00
soulitzer
05de2b2d0f Revert "Construct NJT without graph breaks" (#133145)
This reverts commit 911154271309667b55dfb963ec6384bd0048019b.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133145
Approved by: https://github.com/YuqingJ
2024-08-10 03:11:16 +00:00
drisspg
1434e0b121 Add a private _safe_softmax (#131060)
# Summary
Changes the stance of SDPA on what to do for fully masked out rows

## Current Behavior
Several PyTorch users have expressed frustration over this issue:
- https://github.com/pytorch/pytorch/issues/41508
- https://github.com/pytorch/pytorch/issues/103749
- https://github.com/pytorch/pytorch/issues/103963

These are significant issues with extensive discussion but no satisfactory resolution. The PyTorch team's consensus, as stated here:
https://github.com/pytorch/pytorch/issues/24816#issuecomment-524415617

Can be paraphrased as follows:

When passing in fully masked out rows, attention becomes ambiguous. We have two main options:

1. Uniformly attend to all values:
   ```python
   scores[masked_out_rows] = 1 / len(row)
   out[masked_out_rows] = 1 / len(row) * value
   ```

2. Decide that attention between no queries (masked) and no keys (masked) is meaningless:
   ```python
   output[fully_masked_rows] = NaN
   ```

We went with option 2. Partially because it was easier to implement, but also people argued that users can slice the output to remove the NaNs:
``` Python
>fill_value = -float("inf")
>row0 = torch.randn(4)
>row1 = torch.tensor([(fill_value for _ in range(4)])
>matrix = torch.stack([row0, row1]).requires_grad_(True)
>out = torch.softmax(matrix, 1)
>out = out[0]
>print(out)
tensor([0.5377, 0.2729, 0.0692, 0.1201])
```
Cool, problem solved. But what happends when you call backwards..
```Python
>out.backward(torch.ones_like(out))
>print(matrix.grad)
tensor([[3.0957e-08, 1.4157e-08, 7.7802e-10, 1.3713e-08],
        [       nan,        nan,        nan,        nan]])
```
Those pesky NaNs are back!

## Why do we see NaNs today?

The core of the problem revolves around using softmax function in sdpa:

```python
> row = torch.tensor([(-float("inf")) for _ in range(4)])
> torch.softmax(row, 0)
tensor([nan, nan, nan, nan])
```

## Quick Aside: Masking in Attention

Attention itself doesn't have a concept of masking. The `sdpa` function has an argument called `attn_mask`, which would be more accurately named `attn_bias`. This is because we don't actually "mask" entries when computing attention. Instead, due to implementation details([performance](https://github.com/pytorch/pytorch/issues/25110#issuecomment-524519087)), we add a value to the masked-out query/key pairs.

We use a large negative number (typically -inf) to decrease the attention weight, as softmax assigns more weight to larger values.

## Alternative Approaches

If we use a very large negative number instead of -inf:

```python
> row = torch.tensor([(-1e6) for _ in range(4)])
> torch.softmax(row, 0)
tensor([0.2500, 0.2500, 0.2500, 0.2500])
```
However if users always remembered to "slice" out their outputs i.e.:
```Python
>fill_value = -1e6
>...
>out.backward(torch.ones_like(out))
>print(matrix.grad)
tensor([[-0.0563, -0.0564,  0.1613, -0.0486],
        [ 0.0000,  0.0000,  0.0000,  0.0000]])
```
This would bring us back into a better state.

## A Third Option

We don't necessarily need to alter the behavior of softmax for -inf or very large negative numbers. The fundamental goal is to exclude certain query/key pairs from attention, regardless of the underlying implementation.

This PR implements the new semantic for masking w/ attention in fully masked-out rows:
```python
out[masked_out_rows] = 0
```

**Important Note**: This idea isn't entirely new. The [MaskedTensor](https://pytorch.org/tutorials/prototype/maskedtensor_overview#safe-softmax) prototype, a tensor subclass, was designed to handle such cases. However, it remains a prototype feature and hasn't gained widespread adoption.

## Details
This PR stack does 3 things:
1. Adds a PRIVATE _safe_softmax op
2. Updates semantic for flash_cpu fused kernel
3. Updates semantic for efficient_cuda fused kernel

_safe_softmax is not supposed to be used generically and is only meant to be used within the context of SDPA. Due to this fact instead of decomposing softmax and checking for -inf rows we instead "cheat" and use nan_to_num.

Why I think this is okay? (please find a counter point if avail)
There are multiple ways NaNs can emerge. For the fully masked out rows case nan_to_num works. But what if there were other NaNs, wouldn't this silently remove them?

The only case that this can happen is if the input itself had a NaN or an Inf
For example:
```Python
a = torch.ones([4], requires_grad=False, dtype=torch.float16)
a[1] = torch.finfo(torch.float16).max
print(a.softmax(-1))
```
Will return
`tensor([0., 1., 0., 0.], dtype=torch.float16)`

Where
```Python
a = torch.ones([4], requires_grad=False, dtype=torch.float16)
a[1] = float("inf")
a.softmax(-1)
```
returns:
`tensor([nan, nan, nan, nan], dtype=torch.float16)`

If we dont want to even allow for the possibility of "inf" or "NaN" attention scores to be converted to 0 then we can implemented it something like this

```Python
max = torch.max(a, dim=-1, keepdim=True)
exp = torch.exp(a - max.values)
denom = torch.sum(exp, dim=-1, keepdim=True)
softmax = exp / denom
softmax = torch.where(max.values == float('-inf'), 0.0, softmax)
```
however we would be paying for this in math performance.

## Why Now
I think one point that has substantially changed where PyTorch should lie on this argument is the fact that we have fused implementations for SDPA now. And these fused implementations allow us to easily and performantly support this new semantic.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131060
Approved by: https://github.com/jbschlosser
2024-08-08 23:09:38 +00:00
Alnis Murtovi
7bd0732cbd Fix flaky internal mixed_mm tests (#133015)
This PR fixes flaky internal tests:
- The AutoHeuristic test was sometimes failing because it required autotuning to happen for mixed_mm which didn't end up happening when there was a fx graph cache hit.
- The tests inside pattern_matcher failed because in some cases pad_mm decided to pad which made the mixed_mm pattern not match anymore (instead of cast -> mm, it was cast -> pad -> mm), and the tests also fail when is_big_gpu is false (which I haven't found an explanation for).

Differential Revision: [D60972176](https://our.internmc.facebook.com/intern/diff/D60972176)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133015
Approved by: https://github.com/Chillee, https://github.com/eellison
2024-08-08 20:32:12 +00:00
Joel Schlosser
75eb66afc0 Support 'non-contiguous with holes' NJTs for contiguous clone() (#132776)
It's possible to construct an NJT with "holes" by specifying both `offsets` and `lengths` metadata. When `nt.clone(memory_format=torch.contiguous_format)` is called on such an NJT, the result should be an NJT without holes.

This PR fixes this in simplistic way using `unbind()`, which isn't really supported in `torch.compile`. The longer term solution involves writing a proper kernel to support this.

NB: Another limitation is that the returned NJT does not have the same ragged structure as the input. While we could manually hack the nested int registry (or update the union find when that lands), this is the first instance where a NJT with holes and an NJT without holes could have the same ragged structure, and getting those to play nicely together requires some fairly involved updates. For now, this PR punts on these updates until we can clean this up.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132776
Approved by: https://github.com/ani300, https://github.com/soulitzer
ghstack dependencies: #131898, #131704, #131937
2024-08-08 17:08:11 +00:00
Catherine Lee
4fe6a5dc34 Move slow tests to be in repo (#132379)
Move the slow test json to be in the pytorch/pytorch repo and make a job that will update it weekly.  The job uses the same environment as the commit hash.  It uses similar code to the hash updates, but the hash update contains a lot of code that is specific to the hash update, so I chose to pick out the parts that are relevant

Remove references to the old file and set up testing to read from the new file instead

The old update cadence was every day, the new one is every week

The auto slow test infra + the lack of pinning between pytorch and test-infra makes it really hard to tell if a test started failing because of a change or because of the slow test json changing.  While this can have benefits, like disable test issues being effective everywhere immediately, it can also be very confusing, especially since we don't have the same insight into slow tests like we do for disable issues.

Example PR made: https://github.com/pytorch/pytorch/pull/132383 (with all the changes from this PR because it was working on top of this)

We should just get rid of this at some point in favor of the slowTest decorator, but there are some tests that take 5+ minutes to run and I don't want to track them down right now
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132379
Approved by: https://github.com/huydhn
2024-08-07 18:42:56 +00:00
wz337
8b50d5398f [DeviceMesh] Create new group for 1D mesh when default backend is 'gloo' and 'cuda' is available (#132709)
More context in [#132471](https://github.com/pytorch/pytorch/issues/132471) and https://github.com/pytorch/pytorch/issues/132366.

TLDR:
When cuda is available and users move tensors to cuda, we cannot really reuse the default pg if default pg is gloo, as lots of collectives are not supported on gloo for cuda tensors. For example, `dtensor.full_tensor()` would result in a mysterious SIGTERM when all_gather a cuda tensor using gloo. Without the change in this PR, users would have to know the context and explicitly move the cuda tensor to cpu before invoking most collectives, which I think is not so ideal UX.

Therefore, given most collectives are not supported on gloo for cuda tensors, we should init a new pg if the default pg is gloo when torch.cuda.is_available() and device_type is cuda.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132709
Approved by: https://github.com/awgu, https://github.com/wanchaol
2024-08-07 16:13:11 +00:00
Apurva Jain
8bc5ef563e Grouped Query Attention (#132689)
### Approach: Using the current function declaration

**Constraint:** Q_Heads % KV_Heads == 0

**Major change:**
- Added a new argument enable_gqa: bool to sdpa function call
- It adds a meaning to the last third dimension.

Sample use cases this would enable:
LLama3

```
# LLama3 8b call to SDPA
query = torch.rand(batch, 32, seq_len_q, D)
key = torch.rand(batch, 8, seq_len_kv, D)
value = torch.rand(batch, 8, seq_len_kv, D)

output = scaled_dot_product_attention(query, key, value, is_causal=True, enable_gqa=True)

# Output Shape
(batch, 32, seq_len_q, D)
```

### Design Choice:

- Check if Query.size(-3) == Key.size(-3) == Value.size(-3) or, Query.size(-3) % Key.size(-3) == 0
- The function adjusts the key and value tensors to match the query tensor's head dimension by using repeat_interleave if their number of heads are not equal, facilitating correct and efficient computation in attention mechanisms.
- By default the enable_gqa flag is set to False, which ensures that regular sdpa functionality remains unchanged.

### Benchmarks:

- **sdpa.py: #130634**
For different batch sizes enable_gqa=True shows a substansial improvement in the run_time of sdpa

 | batch_size | q_num_heads | kv_num_heads | q_seq_len | kv_seq_len | embed_dim | forward_time when enable_gqa=True   |   forward_time when enable_gqa=False    |
| ------------ | ------------- | -------------- | ----------- | ------------ | ----------- | ----------- | ---------------- |
|     1      |     32      |      8       |   2048    |    2048    |   2048    |   100.71  |  119.70  |
|     8      |     32      |      8       |   2048    |    2048    |   2048    |   539.78  |  628.83  |
|     16     |     32      |      8       |   2048    |    2048    |   2048    |   1056.81  |  1225.48  |
|     32      |     32      |      8       |   2048    |    2048    |   2048    |   2099.54  |  2440.45  |

![Screenshot 2024-07-25 at 9 07 40 PM](https://github.com/user-attachments/assets/a3e5f716-c39f-4096-9e6c-82a735e57b7b)

- **TorchTitan: https://github.com/pytorch/torchtitan/pull/458**

Differential Revision: D60772086

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132689
Approved by: https://github.com/drisspg
2024-08-07 05:35:36 +00:00
Edward Z. Yang
ed224554eb [BE] Don't unnecessarily suggest -k for rerunning tests locally (#132807)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132807
Approved by: https://github.com/malfet
2024-08-07 02:15:18 +00:00
PyTorch MergeBot
cbee9c1fd2 Revert "Deprecate torch._utils.is_compiling() and torch._dynamo.external_utils.is_compiling() (#127690)"
This reverts commit 0e7e61f7ce.

Reverted https://github.com/pytorch/pytorch/pull/127690 on behalf of https://github.com/kit1980 due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/127690#issuecomment-2272370386))
2024-08-07 00:05:20 +00:00
soulitzer
f50621989b Construct NJT without graph breaks (#130292)
Combines contributions from https://github.com/pytorch/pytorch/pull/130505

Some context can be found in this large comment block:

a5b64d39fd/test/dynamo/test_subclasses.py (L1667-L1681)

Changes in this PR
- For each tensor fakified, check the nested int registry in eager, and eagerly symbolicize if that tensor has already been associated with nested int in eager.
- Adds a separate counter stored on FakeTensorMode as a fake analog to _tensor_id_counter (which keeps track of unique tensors). This counter is initialized to the global eager tensor id counter upon creation of the FakeTensorMode, and needs to be reset when the same FakeTensorMode is reused to trace again (in this PR, we piggyback on the epoch incrementing logic).
- (refactor) Today, we store FakeTensor -> symbolic nested int in the global registry. With this PR, symbolic nested int is stored directly on the FakeTensor. (Eager still caches nested int in the registry, though we should avoid this at some point.)

Basically unchanged, but worth noting:
- `__tensor_unflatten__` is still responsible for determining whether we should cache for now. The logic is somewhat simplified.
- to_copy is still using the trick of updating two different tensors in the registry to point to the same nested int. This is kind of broken, but we try to leave it as is, and plan a better fix with the UnionFind stack.

Differential Revision: [D60406772](https://our.internmc.facebook.com/intern/diff/D60406772)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130292
Approved by: https://github.com/bdhirsh
ghstack dependencies: #131916, #131803
2024-08-06 17:03:39 +00:00
Wei Feng
2a8e94347f [TP] verify numeric parity on Transfromers for multiple iterations (#132543)
Before setting up float8 numeric parity test, I have to set up regular TP numeric parity test, preferrably testing 10 iterations

this PR sets a baseline of TP numerics. I can verify fp8 on top of it

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132543
Approved by: https://github.com/tianyu-l
ghstack dependencies: #132350
2024-08-04 06:43:27 +00:00
Xuehai Pan
0e7e61f7ce Deprecate torch._utils.is_compiling() and torch._dynamo.external_utils.is_compiling() (#127690)
This PR is split from PR #126898.

- #126898

------

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127690
Approved by: https://github.com/Skylion007, https://github.com/malfet
2024-08-03 09:43:38 +00:00
Adnan Akhundov
8ad9f89ccc [inductor] Reland: Add flag to ignore unsupported @triton.autotune args in user-written kernel compilation (#132562)
Summary:
This is a reland attempt of [#131431](https://github.com/pytorch/pytorch/pull/131431), as, in its original form, the PR has caused issues internally.

We currently don't support some of the `triton.autotune` arguments when compiling user-written Triton kernels with PT2. In this PR, we're adding a flag to circumvent it. This is to unblock internal compilation in some cases. The flag is supplied with the docs mentioning why it is not a good idea to set it.

Test Plan:
```
python test/inductor/test_triton_kernels.py -k test_triton_kernel_
autotune_with_unsupported_args
...
----------------------------------------------------------------------
Ran 3 tests in 3.636s

OK
```

Differential Revision: D60701839

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132562
Approved by: https://github.com/chenyang78
2024-08-03 06:31:28 +00:00
PyTorch MergeBot
bcb4f7c172 Revert "Grouped Query Attention (#128898)"
This reverts commit 6b28af1b79.

Reverted https://github.com/pytorch/pytorch/pull/128898 on behalf of https://github.com/ZainRizvi due to Sorry, this broke a bunch of tests internally. See D60638265 ([comment](https://github.com/pytorch/pytorch/pull/128898#issuecomment-2265961038))
2024-08-02 18:58:46 +00:00
Pearu Peterson
a4ea776881 Add pinned memory support to sparse COO/CSR/CSC/BSR/BSC tensors (#129645)
As in the title:

To register indices/values of a sparse XYZ tensor with CUDA, the following methods are supported
- `sparse_xyz_tensor(indices, values, pin_memory=True)`
- `sparse_xyz_tensor(indices, values).pin_memory()`
- `sparse_xyz_tensor(indices.pin_memory(), values.pin_memory())`

Fixes https://github.com/pytorch/pytorch/issues/115330

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129645
Approved by: https://github.com/amjames, https://github.com/cpuhrsch, https://github.com/eqy
2024-08-02 08:55:55 +00:00
majing
3a355c1891 Correct sample creation of torch.histogram in UT op_db to align PyTorch defined operator semantics (#131630)
Fixes #130916
As the semantics defined in [torch.histogram](https://pytorch.org/docs/stable/generated/torch.histogram.html#torch-histogram), we need an increasing sequence as bins tensor. Random input doesn't make sense for torch.histogram.
The case is a comparison between CPU backend and another backend. When the input is random, kernel implementation in other backends have to totally align with the CPU kernel, or the case fails.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131630
Approved by: https://github.com/EikanWang, https://github.com/albanD
2024-08-02 01:51:09 +00:00
henrylhtsang
0c3ac428a2 [BE][typing] fix types in common pruning (#132309)
BE task. Add typings and remove mypy errors in torch/testing/_internal/common_pruning.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132309
Approved by: https://github.com/ColinPeppler
2024-08-01 23:34:33 +00:00
Xuehai Pan
30293319a8 [BE][Easy][19/19] enforce style for empty lines in import segments in torch/[o-z]*/ (#129771)
See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter.

You can review these PRs via:

```bash
git diff --ignore-all-space --ignore-blank-lines HEAD~1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129771
Approved by: https://github.com/justinchuby, https://github.com/janeyx99
2024-08-01 17:07:14 +00:00
Oguz Ulgen
72d2dba992 Add None return type to init (#132335)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132335
Approved by: https://github.com/albanD
2024-08-01 15:26:45 +00:00
jainapurva
6b28af1b79 Grouped Query Attention (#128898)
### Approach: Using the current function declaration

**Constraint:** Q_Heads % KV_Heads == 0

**Major change:**
- Added a new argument enable_gqa: bool to sdpa function call
- It adds a meaning to the last third dimension.

Sample use cases this would enable:
LLama3

```
# LLama3 8b call to SDPA
query = torch.rand(batch, 32, seq_len_q, D)
key = torch.rand(batch, 8, seq_len_kv, D)
value = torch.rand(batch, 8, seq_len_kv, D)

output = scaled_dot_product_attention(query, key, value, is_causal=True, enable_gqa=True)

# Output Shape
(batch, 32, seq_len_q, D)
```

### Design Choice:

- Check if Query.size(-3) == Key.size(-3) == Value.size(-3) or, Query.size(-3) % Key.size(-3) == 0
- The function adjusts the key and value tensors to match the query tensor's head dimension by using repeat_interleave if their number of heads are not equal, facilitating correct and efficient computation in attention mechanisms.
- By default the enable_gqa flag is set to False, which ensures that regular sdpa functionality remains unchanged.

### Benchmarks:

- **sdpa.py: #130634**
For different batch sizes enable_gqa=True shows a substansial improvement in the run_time of sdpa

 | batch_size | q_num_heads | kv_num_heads | q_seq_len | kv_seq_len | embed_dim | forward_time when enable_gqa=True   |   forward_time when enable_gqa=False    |
| ------------ | ------------- | -------------- | ----------- | ------------ | ----------- | ----------- | ---------------- |
|     1      |     32      |      8       |   2048    |    2048    |   2048    |   100.71  |  119.70  |
|     8      |     32      |      8       |   2048    |    2048    |   2048    |   539.78  |  628.83  |
|     16     |     32      |      8       |   2048    |    2048    |   2048    |   1056.81  |  1225.48  |
|     32      |     32      |      8       |   2048    |    2048    |   2048    |   2099.54  |  2440.45  |

![Screenshot 2024-07-25 at 9 07 40 PM](https://github.com/user-attachments/assets/a3e5f716-c39f-4096-9e6c-82a735e57b7b)

- **TorchTitan: https://github.com/pytorch/torchtitan/pull/458**

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128898
Approved by: https://github.com/drisspg
2024-07-31 22:58:51 +00:00
Yu, Guangye
45e6a364ee Avoid autocast deprecation warning (#132207)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132207
Approved by: https://github.com/awgu
2024-07-31 13:13:39 +00:00
ekamiti
9e473fd868 Make adding Buffers more like adding Parameters (#125971)
Add similar semantics for creating a buffer object similar to creating a parameter. This is done by introducing a new Buffer class that can be used for type disambiguation. The underlying functionality of registering a buffer remains the same as the register_buffer method has not been changed. The persistent parameter in the Buffer type is to indicate whether a buffer object should be persistent or not. Other non-test changes have to do with getting the new Buffer type recognized by inductor and dynamo. Remaining changes are test changes to make sure that the Buffer type can be used as a drop in replacement for register_buffer as it just leads to register_buffer being called. The addition of this new functionality still allows for normal tensors to be used as buffers so these changes are intended to be backwards compatible.

Fixes #35735

Co-authored-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125971
Approved by: https://github.com/albanD, https://github.com/anijain2305, https://github.com/mlazos
2024-07-31 10:32:40 +00:00
Joel Schlosser
524aac413c Initial OpInfo-based testing for NJTs (#131704)
This PR utilizes the info from the existing OpInfo database `op_db` to contribute to general NJT testing.
* New tests in `TestNestedTensorOpInfo`
    * `test_forward()` - compares forward output to an unbind-based reference
    * `test_backward()` - compares forward output and grads to an unbind-based reference
    * `test_forward_compile()` - compares forward compile output (`backend="aot_eager_decomp_partition"`) to eager
    * `test_backward_compile()` - compares forward compile output (`backend="aot_eager_decomp_partition"`) and grads to eager
* To avoid adding a bunch of NJT-specific stuff to the `OpInfo` structure, this PR translates `op_db` -> a NJT-specific `njt_op_db`.
    * `UnaryUfuncInfo`s utilize a new `sample_inputs_unary_njt_pointwise()` which iterates through a comprehensive list of NJTs: contiguous / non-contiguous, dims 2, 3, and 4, transposed / not, etc.
    * `BinaryUfuncInfo`s utilize a new `sample_inputs_binary_njt_pointwise()` which iterates through a comprehensive list of NJTs: contiguous / non-contiguous, dims 2, 3, and 4, transposed / not, etc.
    * `ReductionOpInfo`s utilize a new `sample_inputs_njt_reduction()` which covers full reductions, reductions over the jagged dim, and reductions over the non-jagged dim
* Several xfails were added to get things passing

TODO (future PRs):
* Pass non-contiguous / non-contiguous with holes NJTs (maybe we should have separate tests for these? most ops don't support NJTs with holes today)
* Mixed (NT, T), (T, NT) inputs for binary ops
* Handle other types of OpInfos (beyond unary pointwise, binary pointwise, and reduction) by manually by writing sample_inputs_funcs
* Address all xfails via fixes
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131704
Approved by: https://github.com/soulitzer
ghstack dependencies: #131898
2024-07-30 23:02:24 +00:00
Joel Schlosser
d53b11bb6e Strict shape checking for NJTs with TestCase.assertEqual() (#131898)
**Background**: `TestCase.assertEqual()` is commonly used during test case validation. Historically, to support NSTs, the logic was written to compare two nested tensors by unbinding them and comparing their components. This logic applied to NJTs as well, which in practice meant that two NJTs with different nested ints in their shapes could compare equal if their components were equal.

This PR changes the above logic so that NJTs are no longer unbound during comparison, allowing them to receive full shape validation. This makes `TestCase.assertEqual()` stricter for NJTs, requiring them to have the same nested ints in their shapes to compare equal.

Note that some tests rely on the old, looser behavior. To address this, the PR introduces a base `NestedTensorTestCase` that defines a helper function `assertEqualIgnoringNestedInts()` so that these tests can explicitly opt in to the looser comparison behavior.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131898
Approved by: https://github.com/soulitzer
2024-07-30 20:05:48 +00:00
PyTorch MergeBot
499ead96ff Revert "Grouped Query Attention (#128898)"
This reverts commit d039b14207.

Reverted https://github.com/pytorch/pytorch/pull/128898 on behalf of https://github.com/albanD due to Broken test on main ([comment](https://github.com/pytorch/pytorch/pull/128898#issuecomment-2258314481))
2024-07-30 13:11:24 +00:00
jainapurva
d039b14207 Grouped Query Attention (#128898)
### Approach: Using the current function declaration

**Constraint:** Q_Heads % KV_Heads == 0

**Major change:**
- Added a new argument enable_gqa: bool to sdpa function call
- It adds a meaning to the last third dimension.

Sample use cases this would enable:
LLama3

```
# LLama3 8b call to SDPA
query = torch.rand(batch, 32, seq_len_q, D)
key = torch.rand(batch, 8, seq_len_kv, D)
value = torch.rand(batch, 8, seq_len_kv, D)

output = scaled_dot_product_attention(query, key, value, is_causal=True, enable_gqa=True)

# Output Shape
(batch, 32, seq_len_q, D)
```

### Design Choice:

- Check if Query.size(-3) == Key.size(-3) == Value.size(-3) or, Query.size(-3) % Key.size(-3) == 0
- The function adjusts the key and value tensors to match the query tensor's head dimension by using repeat_interleave if their number of heads are not equal, facilitating correct and efficient computation in attention mechanisms.
- By default the enable_gqa flag is set to False, which ensures that regular sdpa functionality remains unchanged.

### Benchmarks:

- **sdpa.py: #130634**
For different batch sizes enable_gqa=True shows a substansial improvement in the run_time of sdpa

 | batch_size | q_num_heads | kv_num_heads | q_seq_len | kv_seq_len | embed_dim | forward_time when enable_gqa=True   |   forward_time when enable_gqa=False    |
| ------------ | ------------- | -------------- | ----------- | ------------ | ----------- | ----------- | ---------------- |
|     1      |     32      |      8       |   2048    |    2048    |   2048    |   100.71  |  119.70  |
|     8      |     32      |      8       |   2048    |    2048    |   2048    |   539.78  |  628.83  |
|     16     |     32      |      8       |   2048    |    2048    |   2048    |   1056.81  |  1225.48  |
|     32      |     32      |      8       |   2048    |    2048    |   2048    |   2099.54  |  2440.45  |

![Screenshot 2024-07-25 at 9 07 40 PM](https://github.com/user-attachments/assets/a3e5f716-c39f-4096-9e6c-82a735e57b7b)

- **TorchTitan: https://github.com/pytorch/torchtitan/pull/458**

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128898
Approved by: https://github.com/drisspg
2024-07-29 21:49:06 +00:00
PyTorch MergeBot
d5e9fbb012 Revert "BE: reset dynamo before each test in test_module.py (#131372)"
This reverts commit 527901f054.

Reverted https://github.com/pytorch/pytorch/pull/131372 on behalf of https://github.com/kit1980 due to Broke test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_CTCLoss_cuda_float32 [GH job link](https://github.com/pytorch/pytorch/actions/runs/10149118852/job/28065175173) [HUD commit link](ca8153ae67) ([comment](https://github.com/pytorch/pytorch/pull/131372#issuecomment-2257019116))
2024-07-29 21:15:25 +00:00
PyTorch MergeBot
a4723b566f Revert "BE: reset dynamo before each test in test_ops_gradients.py (#131397)"
This reverts commit ca8153ae67.

Reverted https://github.com/pytorch/pytorch/pull/131397 on behalf of https://github.com/kit1980 due to Broke test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_CTCLoss_cuda_float32 [GH job link](https://github.com/pytorch/pytorch/actions/runs/10149118852/job/28065175173) [HUD commit link](ca8153ae67) ([comment](https://github.com/pytorch/pytorch/pull/131372#issuecomment-2257019116))
2024-07-29 21:15:25 +00:00
Tom Ritchford
bdf5a6dca9 Add decomposition for unsqueeze_copy (#130942)
* Extracted from #128416
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130942
Approved by: https://github.com/peterbell10
2024-07-29 21:13:37 +00:00
Xuehai Pan
5cc34f61d1 [CI] add new test config label ci-test-showlocals to control test log verbosity (#131981)
Add a new label `ci-test-showlocals` and add it to test config filter.
If the PR is labeled with `ci-test-showlocals` or "ci-test-showlocals"
present in the PR comment, the test config filter will set a environment
variable `TEST_SHOWLOCALS`. Then `pytest` will show local variables on
failures for better debugging.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131981
Approved by: https://github.com/malfet
ghstack dependencies: #131151
2024-07-29 18:53:14 +00:00
Xuehai Pan
4694ee1ad2 [BE][tests] show local variables on failure in tests (#131151)
------

As per the title, add argument `--locals` for `unittest` and `--showlocals --tb=long` for `pytest` in CI.

Some failures cannot be reproduced on the local machine but exist on cloud CI. This change allows us to investigate the test failure more easily.

Example output: https://github.com/pytorch/pytorch/actions/runs/9961546996/job/27523888353?pr=130710#step:20:3361

```text
/opt/conda/envs/py_3.8/lib/python3.8/site-packages/sympy/core/function.py:307:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

cls = FloorDiv, base = -1.00000000000000, divisor = -1.00000000000000

    @classmethod
    def eval(cls, base, divisor):
        # python test/test_dynamic_shapes.py -k TestDimConstraints.test_dim_constraints_solve_full
        # Assert triggered by inequality solver
        # assert base.is_integer, base
        # assert divisor.is_integer, divisor

        # We don't provide the same error message as in Python because SymPy
        # makes it difficult to check the types.
        if divisor.is_zero:
            raise ZeroDivisionError("division by zero")
        if base in (int_oo, -int_oo, sympy.oo, -sympy.oo) and divisor in (
            int_oo,
            -int_oo,
            sympy.oo,
            -sympy.oo,
        ):
            return sympy.nan
        if base is sympy.nan or divisor is sympy.nan:
            return sympy.nan

        if base.is_zero:
            return sympy.S.Zero
        if base.is_integer and divisor == 1:
            return base
        if base.is_integer and divisor == -1:
            return sympy.Mul(base, -1)
        if (
            isinstance(base, sympy.Number)
            and isinstance(divisor, sympy.Number)
            and (
                base in (int_oo, -int_oo, sympy.oo, -sympy.oo)
                or divisor in (int_oo, -int_oo, sympy.oo, -sympy.oo)
            )
        ):
            r = float(base) / float(divisor)
            if r == math.inf:
                return int_oo
            elif r == -math.inf:
                return -int_oo
            elif math.isnan(r):
                return sympy.nan
            else:
                return sympy.Integer(math.floor(r))
        if isinstance(base, sympy.Integer) and isinstance(divisor, sympy.Integer):
            return sympy.Integer(int(base) // int(divisor))
        if isinstance(base, FloorDiv):
            return FloorDiv(base.args[0], base.args[1] * divisor)

        # Expands (x + y) // b into x // b + y // b.
        # This only works if floor is an identity, i.e. x / b is an integer.
        for term in sympy.Add.make_args(base):
            quotient = term / divisor
            if quotient.is_integer and isinstance(divisor, sympy.Integer):
                # NB: this is correct even if the divisor is not an integer, but it
                # creates rational expressions that cause problems with dynamic
                # shapes.
                return FloorDiv(base - term, divisor) + quotient

        try:
            gcd = sympy.gcd(base, divisor)
            if gcd != 1:
>               return FloorDiv(
                    sympy.simplify(base / gcd), sympy.simplify(divisor / gcd)
                )

base       = -1.00000000000000
cls        = FloorDiv
divisor    = -1.00000000000000
gcd        = 1.00000000000000
quotient   = 1.00000000000000
term       = -1.00000000000000

/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/utils/_sympy/functions.py:159:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

args = (FloorDiv, -1.00000000000000, -1.00000000000000), kwargs = {}

    @wraps(func)
    def wrapper(*args, **kwargs):
        try:
>           retval = cfunc(*args, **kwargs)
E           RecursionError: maximum recursion depth exceeded in comparison
E
E           To execute this test, run the following from the base repo dir:
E               python test/test_sympy_utils.py -k TestValueRanges.test_binary_ref_fn_floordiv_dtype_float
E
E           This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0

args       = (FloorDiv, -1.00000000000000, -1.00000000000000)
cfunc      = <functools._lru_cache_wrapper object at 0x7fc5303173a0>
func       = <function Function.__new__ at 0x7fc530317280>
kwargs     = {}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131151
Approved by: https://github.com/ezyang
2024-07-29 18:53:14 +00:00
Shunting Zhang
ca8153ae67 BE: reset dynamo before each test in test_ops_gradients.py (#131397)
https://github.com/pytorch/pytorch/pull/126586 tried to reset dynamo before each unit test. That PR get reverted a couple of times because we see post-land test failures that we don't see before merge. This PR only reset dynamo before each tests in `test_ops_gradients.py` to make it easier to land.

Eventually after we reset dynamo in each individual test files, we can move the change to the base class (TestCase) and remove the change in individual test files.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131397
Approved by: https://github.com/zou3519
ghstack dependencies: #131551, #131388, #131372
2024-07-29 17:39:23 +00:00
Shunting Zhang
527901f054 BE: reset dynamo before each test in test_module.py (#131372)
https://github.com/pytorch/pytorch/pull/126586 tried to reset dynamo before each unit test. That PR get reverted a couple of times because we see post-land test failures that we don't see before merge. This PR only reset dynamo before each tests in `test_module.py` to make it easier to land.

Eventually after we reset dynamo in each individual test files, we can move the change to the base class (TestCase) and remove the change in individual test files.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131372
Approved by: https://github.com/zou3519
ghstack dependencies: #131551, #131388
2024-07-29 17:39:23 +00:00
PyTorch MergeBot
6cf493158e Revert "Enable FlashAttention on Windows (#131906)"
This reverts commit b90bc66766.

Reverted https://github.com/pytorch/pytorch/pull/131906 on behalf of https://github.com/atalman due to Windows nightly failures ([comment](https://github.com/pytorch/pytorch/pull/131906#issuecomment-2256421183))
2024-07-29 16:49:23 +00:00
Tom Ritchford
962f248437 Add decomposition for expand_copy (#130940)
* Extracted from #129476

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130940
Approved by: https://github.com/peterbell10
2024-07-29 16:23:56 +00:00
PyTorch MergeBot
c35f21e5fc Revert "[BE][tests] show local variables on failure in tests (#131151)"
This reverts commit 14158d892a.

Reverted https://github.com/pytorch/pytorch/pull/131151 on behalf of https://github.com/atalman due to Broke CI: test_testing.py::TestTestingCUDA::test_cuda_assert_should_stop_common_device_type_test_suite_cuda [GH job link](https://github.com/pytorch/pytorch/actions/runs/10131415299/job/28014665693) [HUD commit link](14158d892a) ([comment](https://github.com/pytorch/pytorch/pull/131151#issuecomment-2255921015))
2024-07-29 13:19:38 +00:00
PyTorch MergeBot
06fe99a097 Revert "[CI] add new test config label ci-test-showlocals to control test log verbosity (#131981)"
This reverts commit dfa18bf3f3.

Reverted https://github.com/pytorch/pytorch/pull/131981 on behalf of https://github.com/atalman due to Sorry, need to revert bottom PR, which broke CI: https://github.com/pytorch/pytorch/pull/131151 ([comment](https://github.com/pytorch/pytorch/pull/131981#issuecomment-2255892628))
2024-07-29 13:09:41 +00:00
Xuehai Pan
dfa18bf3f3 [CI] add new test config label ci-test-showlocals to control test log verbosity (#131981)
Add a new label `ci-test-showlocals` and add it to test config filter.
If the PR is labeled with `ci-test-showlocals` or "ci-test-showlocals"
present in the PR comment, the test config filter will set a environment
variable `TEST_SHOWLOCALS`. Then `pytest` will show local variables on
failures for better debugging.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131981
Approved by: https://github.com/malfet
2024-07-29 07:40:42 +00:00
Shunting Zhang
f151f25c0b BE: reset dynamo before each test in test_torch.py (#131388)
https://github.com/pytorch/pytorch/pull/126586 tried to reset dynamo before each unit test. That PR get reverted a couple of times because we see post-land test failures that we don't see before merge. This PR only reset dynamo before each tests in `test_torch.py` to make it easier to land.

Eventually after we reset dynamo in each individual test files, we can move the change to the base class (TestCase) and remove the change in individual test files.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131388
Approved by: https://github.com/zou3519
ghstack dependencies: #131551
2024-07-29 04:57:34 +00:00
Oguz Ulgen
7c29665f77 Remove mypy ignore from torch/testing/_internal/distributed/ (#131870)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131870
Approved by: https://github.com/aakhundov
ghstack dependencies: #131786
2024-07-28 17:13:53 +00:00
PyTorch MergeBot
d3c17fea90 Revert "[BE] typing for decorators - _library/custom_ops (#131578)"
This reverts commit c65b197b85.

Reverted https://github.com/pytorch/pytorch/pull/131578 on behalf of https://github.com/clee2000 due to breaking lint internally D60265575 ([comment](https://github.com/pytorch/pytorch/pull/131572#issuecomment-2254328359))
2024-07-28 03:29:32 +00:00
Xuehai Pan
14158d892a [BE][tests] show local variables on failure in tests (#131151)
------

As per the title, add argument `--locals` for `unittest` and `--showlocals --tb=long` for `pytest` in CI.

Some failures cannot be reproduced on the local machine but exist on cloud CI. This change allows us to investigate the test failure more easily.

Example output: https://github.com/pytorch/pytorch/actions/runs/9961546996/job/27523888353?pr=130710#step:20:3361

```text
/opt/conda/envs/py_3.8/lib/python3.8/site-packages/sympy/core/function.py:307:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

cls = FloorDiv, base = -1.00000000000000, divisor = -1.00000000000000

    @classmethod
    def eval(cls, base, divisor):
        # python test/test_dynamic_shapes.py -k TestDimConstraints.test_dim_constraints_solve_full
        # Assert triggered by inequality solver
        # assert base.is_integer, base
        # assert divisor.is_integer, divisor

        # We don't provide the same error message as in Python because SymPy
        # makes it difficult to check the types.
        if divisor.is_zero:
            raise ZeroDivisionError("division by zero")
        if base in (int_oo, -int_oo, sympy.oo, -sympy.oo) and divisor in (
            int_oo,
            -int_oo,
            sympy.oo,
            -sympy.oo,
        ):
            return sympy.nan
        if base is sympy.nan or divisor is sympy.nan:
            return sympy.nan

        if base.is_zero:
            return sympy.S.Zero
        if base.is_integer and divisor == 1:
            return base
        if base.is_integer and divisor == -1:
            return sympy.Mul(base, -1)
        if (
            isinstance(base, sympy.Number)
            and isinstance(divisor, sympy.Number)
            and (
                base in (int_oo, -int_oo, sympy.oo, -sympy.oo)
                or divisor in (int_oo, -int_oo, sympy.oo, -sympy.oo)
            )
        ):
            r = float(base) / float(divisor)
            if r == math.inf:
                return int_oo
            elif r == -math.inf:
                return -int_oo
            elif math.isnan(r):
                return sympy.nan
            else:
                return sympy.Integer(math.floor(r))
        if isinstance(base, sympy.Integer) and isinstance(divisor, sympy.Integer):
            return sympy.Integer(int(base) // int(divisor))
        if isinstance(base, FloorDiv):
            return FloorDiv(base.args[0], base.args[1] * divisor)

        # Expands (x + y) // b into x // b + y // b.
        # This only works if floor is an identity, i.e. x / b is an integer.
        for term in sympy.Add.make_args(base):
            quotient = term / divisor
            if quotient.is_integer and isinstance(divisor, sympy.Integer):
                # NB: this is correct even if the divisor is not an integer, but it
                # creates rational expressions that cause problems with dynamic
                # shapes.
                return FloorDiv(base - term, divisor) + quotient

        try:
            gcd = sympy.gcd(base, divisor)
            if gcd != 1:
>               return FloorDiv(
                    sympy.simplify(base / gcd), sympy.simplify(divisor / gcd)
                )

base       = -1.00000000000000
cls        = FloorDiv
divisor    = -1.00000000000000
gcd        = 1.00000000000000
quotient   = 1.00000000000000
term       = -1.00000000000000

/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/utils/_sympy/functions.py:159:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

args = (FloorDiv, -1.00000000000000, -1.00000000000000), kwargs = {}

    @wraps(func)
    def wrapper(*args, **kwargs):
        try:
>           retval = cfunc(*args, **kwargs)
E           RecursionError: maximum recursion depth exceeded in comparison
E
E           To execute this test, run the following from the base repo dir:
E               python test/test_sympy_utils.py -k TestValueRanges.test_binary_ref_fn_floordiv_dtype_float
E
E           This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0

args       = (FloorDiv, -1.00000000000000, -1.00000000000000)
cfunc      = <functools._lru_cache_wrapper object at 0x7fc5303173a0>
func       = <function Function.__new__ at 0x7fc530317280>
kwargs     = {}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131151
Approved by: https://github.com/ezyang
2024-07-27 19:39:40 +00:00
Will Feng
9e06572704 [Traceable FSDP2][Inductor] Create grouped nodes for FSDP2 all-gather code block and reduce-scatter code block (after Buffer/Operation split) (#131510)
This PR creates these `GroupedSchedulerNode`s:
- One for each all-gather code block (cast + copy-in + all-gather)
- One for each all-gather-wait code block (all-gather-wait + copy-out)
- One for each reduce-scatter code block (copy-in + reduce-scatter)
- One for each reduce-scatter-wait code block (reduce-scatter-wait)

This serves two goals:
- Prevent outside ops from being fused into these op groups, in order to have more predicable memory usage.
- Make it easier to specify the dependency e.g. from `i+1` all-gather group node to the `i` all-gather-wait group node, to enforce FSDP2 comm ordering (i.e. "serialization of comms").

The actual "reorder-for-FSDP-compute-comm-overlap" PR will come next.

Test commands:
- `pytest -rA  test/distributed/test_compute_comm_reordering.py::TestComputeCommReorderingMultiProc`
- `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_transformer_backend_inductor`
- `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_nested_fully_shard_backend_inductor`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131510
Approved by: https://github.com/yifuwang
2024-07-27 08:39:58 +00:00
Xu Han
a90b8b967a [inductor] enable windows inductor UTs (#131767)
Changes:
1. Add `skipIfWindows` function.
2. Fix `fresh_inductor_cache` raise error on Windows, due to can't delete loaded modules.
3. Disable some UTs, which are not passed on Windows.
4. Enable test_torchinductor in Windows CI.

I have tested passed on my dev machine:
<img width="864" alt="image" src="https://github.com/user-attachments/assets/91d5a62f-7383-44b3-b614-99940f196fdb">

TODO: review and fix the skipped cases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131767
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-07-27 02:46:03 +00:00
Luca Wehrstedt
b90bc66766 Enable FlashAttention on Windows (#131906)
Let's just give this a try.

Reland of https://github.com/pytorch/pytorch/pull/131875.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131906
Approved by: https://github.com/drisspg
2024-07-26 21:41:56 +00:00
Shunting Zhang
c8626a4e1f [BE] add a list of inductor test files to skip resetting dynamo (#131551)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131551
Approved by: https://github.com/zou3519
2024-07-26 21:08:15 +00:00
PyTorch MergeBot
0f9bf208ec Revert "[BE][tests] show local variables on failure in tests (#131151)"
This reverts commit 054d214c50.

Reverted https://github.com/pytorch/pytorch/pull/131151 on behalf of https://github.com/jbschlosser due to pollutes test failure output for OpInfo tests ([comment](https://github.com/pytorch/pytorch/pull/131151#issuecomment-2253310448))
2024-07-26 19:03:10 +00:00
Xuehai Pan
63374dda69 [BE][Easy] explicitly define global constants in torch.testing._internal.common_utils (#129826)
This appeases IDE warnings like "torch.testing._internal.common_utils has no member TEST_WITH_ROCM".

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129826
Approved by: https://github.com/Skylion007
2024-07-26 06:32:08 +00:00
Aaron Orenstein
c65b197b85 [BE] typing for decorators - _library/custom_ops (#131578)
See #131429
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131578
Approved by: https://github.com/oulgen, https://github.com/zou3519
ghstack dependencies: #131568, #131569, #131570, #131571, #131572, #131573, #131574, #131575, #131576, #131577
2024-07-25 22:24:19 +00:00
PyTorch MergeBot
f3df7deab8 Revert "Add flag to ignore unsupported @triton.autotune args in user-written kernel compilation (#131431)"
This reverts commit e9db1b0597.

Reverted https://github.com/pytorch/pytorch/pull/131431 on behalf of https://github.com/clee2000 due to broke internal tests D60211713 ([comment](https://github.com/pytorch/pytorch/pull/131431#issuecomment-2251091957))
2024-07-25 18:00:46 +00:00
Jane Xu
9c4cf866c2 Adafactor forloop basic impl (#129905)
#109581

At this point, the vanilla implementation (the default) is good.
Docs: https://docs-preview.pytorch.org/pytorch/pytorch/129905/generated/torch.optim.Adafactor.html#torch.optim.Adafactor

Specifically, the impl in this PR, which attempts to replicate the paper,
```
optim = torch.optim.Adafactor([weight])
```
is close enough to https://pytorch-optimizers.readthedocs.io/en/latest/optimizer/#pytorch_optimizer.AdaFactor
```
optim_c = AdaFactor([weight], betas=(0, 0.999), scale_parameter=False)
```
is close enough to https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Adafactor
```
optim = keras.optimizers.Adafactor(learning_rate=0.01)
```

The three results respectively for the same randomly generated weights:
```
# ours
tensor([[ 0.3807594, -0.3912092],
        [ 0.0762539,  0.5377805],
        [ 0.2459473,  0.4662207]])

# pytorch-optimizer
tensor([[ 0.3807592, -0.3912172],
        [ 0.0762507,  0.5377818],
        [ 0.2459457,  0.4662213]])

# keras
array([[ 0.38076326, -0.39121315],
        [ 0.0762547 ,  0.5377859 ],
        [ 0.24594972,  0.46622536]], dtype=float32)
```

This gives me confidence to move forward in speeding up the implementation now that a baseline has been established. If you're curious about differences:
* keras assigns step_size (rho_t in their code) to `min(lr, 1 / sqrt(step)` whereas the OG impl uses a hardcoded 0.01 instead of lr. We do the same thing as keras, but our lr default is 0.01.
* We differ from the pytorch-optimizers default in that our default will not track momentum (thus `beta1=0`) and we do not apply parameter scaling.

<details>

Keras collab: https://colab.research.google.com/drive/1i3xF8ChL7TWKJGV_5v_5nMhXKnYmQQ06?usp=sharing

My script repro:

```
import torch
from pytorch_optimizer import AdaFactor
torch.set_printoptions(precision=7)

weight = torch.tensor([[ 0.37697506, -0.39500135],
        [ 0.07246649,  0.53399765],
        [ 0.24216151,  0.46243715]], dtype=torch.float32)
# bias = torch.tensor([0, 0], dtype=torch.float32)

weight.grad = torch.tensor([[-0.5940447, -0.7743838],
        [-0.5940447, -0.7743838],
        [-0.5940447, -0.7743838]], dtype=torch.float32)
# bias.grad = torch.tensor([-2.5027974,  1.5422692], dtype=torch.float32)

weight_c = weight.clone()
weight_c.grad = weight.grad.clone()

optim = torch.optim.Adafactor([weight])
optim.step()
print(weight)

optim_c = AdaFactor([weight_c], betas=(0, 0.999), scale_parameter=False)
optim_c.step()
print(weight_c)
```

<details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129905
Approved by: https://github.com/albanD
2024-07-25 13:17:19 +00:00
Xuehai Pan
054d214c50 [BE][tests] show local variables on failure in tests (#131151)
------

As per the title, add argument `--locals` for `unittest` and `--showlocals --tb=long` for `pytest` in CI.

Some failures cannot be reproduced on the local machine but exist on cloud CI. This change allows us to investigate the test failure more easily.

Example output: https://github.com/pytorch/pytorch/actions/runs/9961546996/job/27523888353?pr=130710#step:20:3361

```text
/opt/conda/envs/py_3.8/lib/python3.8/site-packages/sympy/core/function.py:307:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

cls = FloorDiv, base = -1.00000000000000, divisor = -1.00000000000000

    @classmethod
    def eval(cls, base, divisor):
        # python test/test_dynamic_shapes.py -k TestDimConstraints.test_dim_constraints_solve_full
        # Assert triggered by inequality solver
        # assert base.is_integer, base
        # assert divisor.is_integer, divisor

        # We don't provide the same error message as in Python because SymPy
        # makes it difficult to check the types.
        if divisor.is_zero:
            raise ZeroDivisionError("division by zero")
        if base in (int_oo, -int_oo, sympy.oo, -sympy.oo) and divisor in (
            int_oo,
            -int_oo,
            sympy.oo,
            -sympy.oo,
        ):
            return sympy.nan
        if base is sympy.nan or divisor is sympy.nan:
            return sympy.nan

        if base.is_zero:
            return sympy.S.Zero
        if base.is_integer and divisor == 1:
            return base
        if base.is_integer and divisor == -1:
            return sympy.Mul(base, -1)
        if (
            isinstance(base, sympy.Number)
            and isinstance(divisor, sympy.Number)
            and (
                base in (int_oo, -int_oo, sympy.oo, -sympy.oo)
                or divisor in (int_oo, -int_oo, sympy.oo, -sympy.oo)
            )
        ):
            r = float(base) / float(divisor)
            if r == math.inf:
                return int_oo
            elif r == -math.inf:
                return -int_oo
            elif math.isnan(r):
                return sympy.nan
            else:
                return sympy.Integer(math.floor(r))
        if isinstance(base, sympy.Integer) and isinstance(divisor, sympy.Integer):
            return sympy.Integer(int(base) // int(divisor))
        if isinstance(base, FloorDiv):
            return FloorDiv(base.args[0], base.args[1] * divisor)

        # Expands (x + y) // b into x // b + y // b.
        # This only works if floor is an identity, i.e. x / b is an integer.
        for term in sympy.Add.make_args(base):
            quotient = term / divisor
            if quotient.is_integer and isinstance(divisor, sympy.Integer):
                # NB: this is correct even if the divisor is not an integer, but it
                # creates rational expressions that cause problems with dynamic
                # shapes.
                return FloorDiv(base - term, divisor) + quotient

        try:
            gcd = sympy.gcd(base, divisor)
            if gcd != 1:
>               return FloorDiv(
                    sympy.simplify(base / gcd), sympy.simplify(divisor / gcd)
                )

base       = -1.00000000000000
cls        = FloorDiv
divisor    = -1.00000000000000
gcd        = 1.00000000000000
quotient   = 1.00000000000000
term       = -1.00000000000000

/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/utils/_sympy/functions.py:159:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

args = (FloorDiv, -1.00000000000000, -1.00000000000000), kwargs = {}

    @wraps(func)
    def wrapper(*args, **kwargs):
        try:
>           retval = cfunc(*args, **kwargs)
E           RecursionError: maximum recursion depth exceeded in comparison
E
E           To execute this test, run the following from the base repo dir:
E               python test/test_sympy_utils.py -k TestValueRanges.test_binary_ref_fn_floordiv_dtype_float
E
E           This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0

args       = (FloorDiv, -1.00000000000000, -1.00000000000000)
cfunc      = <functools._lru_cache_wrapper object at 0x7fc5303173a0>
func       = <function Function.__new__ at 0x7fc530317280>
kwargs     = {}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131151
Approved by: https://github.com/ezyang
2024-07-25 10:10:58 +00:00
William Wen
7718024d2b [3.13] support 3.13 multiline traces in munge_exc (#131207)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131207
Approved by: https://github.com/jansel, https://github.com/anijain2305
ghstack dependencies: #131206
2024-07-24 18:22:30 +00:00
William Wen
106c6a49f5 [dynamo] limit number of compiles per frame (#130891)
Fixes https://github.com/pytorch/pytorch/issues/130776

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130891
Approved by: https://github.com/anijain2305
2024-07-24 16:43:40 +00:00
Adnan Akhundov
e9db1b0597 Add flag to ignore unsupported @triton.autotune args in user-written kernel compilation (#131431)
Summary: We currently don't support some of the `@triton.autotune` arguments when compiling user-written Triton kernels with PT2. In this PR, we're adding a flag to circumvent it. This is to unblock internal compilation in some cases. The flag is supplied with the docs mentioning why it is not a good idea to set it.

Test Plan:

```
python test/inductor/test_triton_kernels.py -k test_triton_kernel_
autotune_with_unsupported_args
...
----------------------------------------------------------------------
Ran 3 tests in 3.636s

OK
```

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131431
Approved by: https://github.com/oulgen, https://github.com/zou3519
2024-07-24 05:37:09 +00:00
Will Feng
fc3d2b26cd Use fake PG for test_compute_comm_reordering.py unit tests (#131415)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131415
Approved by: https://github.com/yifuwang
2024-07-23 22:53:23 +00:00
Aaron Orenstein
5a0068cc69 [BE] mypy: disallow untyped decorators (#131428)
Untyped decorators strip the types from their decorated function so even if the underlying function is fully typed then callers to it don't get any benefit from type annotations.

Step 1 - Enable the error and override in all the offending files.

#131429

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131428
Approved by: https://github.com/justinchuby, https://github.com/oulgen
2024-07-23 21:50:55 +00:00
Shangdi Yu
68c725a094 [custom ops] Add register_vmap for custom ops (#130589)
Fixes #130284
Fixes #130653

- Add `torch.library.register_vmap` to custom ops
- Add `register_vmap` for operators in ops in custom_op_db.
- Make `torch.autograd.Function` support kwarg-only kwargs for vmap
- test operators in op_db with `tests/test_vmap`.
- change `test_vmap` to allow custom `out_dim` and allow "None" in `out_dim` when testing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130589
Approved by: https://github.com/zou3519
2024-07-23 17:48:38 +00:00
Tom Ritchford
16247987a1 Add decomposition for t_copy (#130939)
* Extracted from #128416

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130939
Approved by: https://github.com/peterbell10
2024-07-23 08:29:19 +00:00
Li-Huai (Allan) Lin
99d9b369f4 [Optim] Support tensor lr for all optimizers and check it is 1-element (#131065)
Fixes: #130980
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131065
Approved by: https://github.com/janeyx99
2024-07-23 04:27:05 +00:00
PyTorch MergeBot
b435d84261 Revert "[custom ops] Add register_vmap for custom ops (#130589)"
This reverts commit 074b420641.

Reverted https://github.com/pytorch/pytorch/pull/130589 on behalf of https://github.com/atalman due to Please fix lint and reland ([comment](https://github.com/pytorch/pytorch/pull/130589#issuecomment-2244092174))
2024-07-23 01:44:44 +00:00
Shangdi Yu
074b420641 [custom ops] Add register_vmap for custom ops (#130589)
Fixes #130284
Fixes #130653

- Add `torch.library.register_vmap` to custom ops
- Add `register_vmap` for operators in ops in custom_op_db.
- Make `torch.autograd.Function` support kwarg-only kwargs for vmap
- test operators in op_db with `tests/test_vmap`.
- change `test_vmap` to allow custom `out_dim` and allow "None" in `out_dim` when testing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130589
Approved by: https://github.com/zou3519
2024-07-23 00:54:52 +00:00
Thomas Ortner
8ae1963a61 [Autograd] Cond Higher-Order Operation (#126911)
This is an updated PR to equip cond with the autograd feature and replaces the old [PR](https://github.com/pytorch/pytorch/pull/126007)

@ydwu4 I tried to incorporate your requests already.

Currently there are two problems that I struggle with solving:

1. There seems to be an import issue when trying to import cond in `torch/__init__.py`, see [here](8a704035c9/torch/__init__.py (L1914-L1916)). Therefore, I had to comment those lines, which resolved the import issues, but I believe cond is not proberly exposed as torch.cond.
2. I am not entirely sure how to deal with the opinfo test in `hop_db.py`

Co-authored-by: Yidi Wu <yidi@meta.com>
Co-authored-by: Xuehai Pan <XuehaiPan@outlook.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126911
Approved by: https://github.com/ydwu4
2024-07-22 23:18:19 +00:00
Tom Ritchford
500cbb5b90 Add decomposition for view_copy (#130938)
* Extracted from #128416
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130938
Approved by: https://github.com/peterbell10
ghstack dependencies: #130937
2024-07-21 20:39:24 +00:00
Tom Ritchford
f628813066 Fix out_wrapper, _make_copy_from_view to handle all signatures (#130937)
* See #128416 and #129476
* Simplify xskip lists in test/functorch/test_ops.py
* Add supports_out=True to OpInfos for copy ops
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130937
Approved by: https://github.com/peterbell10
2024-07-21 20:39:24 +00:00
ankurneog
ebc012ace6 Add hooks for execution on intel gaudi devices - 1 (#128584)
## Motivation
This is follow up to PR:https://github.com/pytorch/pytorch/pull/126970  to support Gaudi devices for Pytorch UT execution.

## Changes
We are adding additional hooks to:
1. Add dtype exceptions for Gaudi/HPU
2. Extend onlyNativeDevices decorator  functionality to add additional devices

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128584
Approved by: https://github.com/albanD
2024-07-20 05:03:36 +00:00
Isuru Fernando
bb4251213b Add decomposition for channel_shuffle (#118775)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118775
Approved by: https://github.com/peterbell10
2024-07-20 01:24:41 +00:00
Catherine Lee
31e79aae6a Another follow up to #130260 (#130993)
Another followup to https://github.com/pytorch/pytorch/pull/130260
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130993
Approved by: https://github.com/huydhn
2024-07-19 16:43:54 +00:00
kausik
4f60a2e39c Set correct output dtype for dequantize op during convert_pt2e in decomposed mode (#128953)
Earlier the signature of dequantize ops for decomposed quantized Tensor was changed for wider use-cases where the output dtype can be different from torch.float and needs to be passed during dequantization.
Please refer: https://github.com/pytorch/pytorch/pull/121450

However, setting of correct output dtype for dequantize ops was still missing in convert_pt2e flow.

This change enables the users to use PT2E quantization flow with non torch.float unquantized dtype, such as torch.bfloat16.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128953
Approved by: https://github.com/jgong5, https://github.com/jerryzh168
2024-07-19 04:58:02 +00:00
PyTorch MergeBot
fb3674b1f4 Revert "[Autograd] Cond Higher-Order Operation (#126911)"
This reverts commit f7058b735e.

Reverted https://github.com/pytorch/pytorch/pull/126911 on behalf of https://github.com/clee2000 due to broke lint and functorch/test_aotdispatch f7058b735e Probably a landrace since both the test and lint passed on PR ([comment](https://github.com/pytorch/pytorch/pull/126911#issuecomment-2237703182))
2024-07-18 22:06:40 +00:00
Thomas Bohnstingl
f7058b735e [Autograd] Cond Higher-Order Operation (#126911)
This is an updated PR to equip cond with the autograd feature and replaces the old [PR](https://github.com/pytorch/pytorch/pull/126007)

@ydwu4 I tried to incorporate your requests already.

Currently there are two problems that I struggle with solving:

1. There seems to be an import issue when trying to import cond in `torch/__init__.py`, see [here](8a704035c9/torch/__init__.py (L1914-L1916)). Therefore, I had to comment those lines, which resolved the import issues, but I believe cond is not proberly exposed as torch.cond.
2. I am not entirely sure how to deal with the opinfo test in `hop_db.py`

Co-authored-by: Yidi Wu <yidi@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126911
Approved by: https://github.com/ydwu4
2024-07-18 21:09:09 +00:00
PyTorch MergeBot
fff92d4f18 Revert "Use inductor TestCase for test_replicate_with_compiler.py (#129494)"
This reverts commit 9f392f8294.

Reverted https://github.com/pytorch/pytorch/pull/129494 on behalf of https://github.com/atalman due to broke internal tests ([comment](https://github.com/pytorch/pytorch/pull/129494#issuecomment-2237147504))
2024-07-18 17:42:05 +00:00
drisspg
dd39dca034 Removing some cruff and updating signatures for consistency (#130871)
# Summary

- This removes a bunch of example score mods that were primarily used for testing and places them directly in the test file. We should follow up with merging test_flex_decode and test_flash when the velocity slows down a little
- Fixes a bug with indexing on block mask
- Adds some doc strings to helper funcs and fixes some misc typing things
- Forces functions passed to `create_block_mask` to mask_mods and updates tests files

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130871
Approved by: https://github.com/joydddd, https://github.com/Chillee
2024-07-18 13:32:11 +00:00
Michael Lazos
cf3f4285a8 Add recursive metadata guard test (#131002)
Ensures that nested tensors subclasses are guarded properly. It turns out this case is already handled [here](d77af49380/torch/_dynamo/variables/builder.py (L1496)) which will recursively wrap inner tensors adding metadata guards for them.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131002
Approved by: https://github.com/bdhirsh
2024-07-18 08:24:43 +00:00
Sam Larsen
9f392f8294 Use inductor TestCase for test_replicate_with_compiler.py (#129494)
Summary: `test/distributed/_composable/test_replicate_with_compiler.py` exercises inductor. This change introduces a version of MultiProcessTestCase that derives from the inductor TestCase class to make sure we always get a clean cache dir.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129494
Approved by: https://github.com/eellison
2024-07-18 03:08:32 +00:00
Aidyn-A
bd56bcf0ab [TEST] Fix _scaled_mm tests (#130897)
This PR resolves several sets of `_scaled_mm` test failures:
- `scale_a` and `scale_b` are now required arguments, so the function `sample_inputs_scaled_mm` must supply them
- `_scaled_mm` does not support `"meta"` device, so it should be skipped in `test_meta.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130897
Approved by: https://github.com/drisspg
2024-07-18 02:15:00 +00:00
lezcano
af0b5ee924 Reduce number of samples in {svd,pca}_lowrank OpInfos (#127199)
We don't need to generate so many samples for these very expensive ops.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127199
Approved by: https://github.com/peterbell10, https://github.com/zou3519
2024-07-17 16:29:36 +00:00
Xuehai Pan
b29b23137c [Easy] Fix argument name collision in dispatched functions (#129562)
Use positional-only argument to avoid naming collision with aten ops arguments that are named "self".

```python
In [1]: def foo(self, *args, **kwargs):
   ...:     print(self, args, kwargs)
   ...:

In [2]: def bar(self, /, *args, **kwargs):
   ...:     print(self, args, kwargs)
   ...:

In [3]: foo(1, 2, self=3)
TypeError: foo() got multiple values for argument 'self'

In [4]: bar(1, 2, self=3)
1
(2,)
{'self': 3}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129562
Approved by: https://github.com/zou3519, https://github.com/fegin
2024-07-17 14:39:56 +00:00
PyTorch MergeBot
8458dc8966 Revert "Use inductor TestCase for distributed tests (#129494)"
This reverts commit 3cd2ae331a.

Reverted https://github.com/pytorch/pytorch/pull/129494 on behalf of https://github.com/masnesral due to fbcode failures ([comment](https://github.com/pytorch/pytorch/pull/129494#issuecomment-2232063690))
2024-07-17 00:32:48 +00:00
drisspg
2b43d339fe Make FlexAttention API public (#130755)
# Summary

Makes the prototype API flex_attention public

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130755
Approved by: https://github.com/Chillee
2024-07-16 16:21:25 +00:00
Jovian Anthony Jaison
e57101d927 Add testing regarding SparseAdam state_dicts (#130645)
Summary:
- Updated SparseAdam to run test_state_dict_deterministic unit test.
- Made gradients sparse while keeping weights dense in the above test.

Test Plan:
- Ran test_optim.py locally.

Fixes #116507

Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130645
Approved by: https://github.com/janeyx99
2024-07-16 11:29:22 +00:00
Sam Larsen
3cd2ae331a Use inductor TestCase for distributed tests (#129494)
Summary: At least some of the tests deriving from MultiProcessTestCase exercise inductor. Using the inductor TestCase class makes sure we always get a clean cache dir.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129494
Approved by: https://github.com/eellison
2024-07-16 01:24:35 +00:00
eqy
5e617d7ef5 [CUDA] Actually bump tolerances for test_grad_pca_lowrank (#130770)
Fixes change in #129902 to actually bump pca rather than svd, thanks @ptrblck for the catch

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130770
Approved by: https://github.com/Skylion007
2024-07-16 00:41:10 +00:00
Bilal Khan
54a932b0ac Support for expandable segments with cuda graph trees (#128068)
This PR adds support to use expandable segments with private memory pools which should unblock using it with cuda graphs and cuda graph trees. Currently, the allocator silently avoids using expandable segments when allocating in a private pool due to checkpoint saving/restoring not meshing well with how we keep track of unmapped blocks.

The PR itself is pretty short, most of the logic for checkpointing and reapplying state for non-expandable segments transfers over without much work.

Expandable segments reserve a virtual address space of size equal to the amount of physical memory on the GPU. Every time we want to `malloc()` or `free()` memory in a memory pool with expandable segments turned on, we map/unmap pages of physical GPU memory under the hood to create a new block that we return to the caller. This is beneficial due to the fact that each memory pool functions as a single segment of memory with a contiguous block of memory addresses that can grow and shrink as needed, avoiding fragmentation from allocating multiple non-contiguous segments that may not be merged together.

The caching allocator handles this by creating an unmapped block for the entire reserved virtual address space at init, which is treated similarly to an unallocated block in a free pool. When callers call `malloc()`, it's split and mapped to create allocated blocks, and calling `free()` similarly caches and merges free blocks in a free pool to be used later. Expandable blocks are unmapped and returned back to Cuda when they are cleaned up, or when we hit an OOM and the allocator attempts to remap cached free blocks. The code paths to map, free, and unmap blocks in expandable segments is similar to that for normal blocks and does all the same work of updating stats on memory usage, moving blocks between active and free pools, and returning memory to Cuda.

With Cuda Graph Trees and private memory pools, we need the ability to take checkpoints of the current state of the memory allocator after each graph capture as well as reapplying the state before capturing a new graph after replaying a captured graph so that the new cuda graph capture has access to the state of the allocator at the point after replaying a previously captured graph so it can reuse empty blocks and allocate new ones.

As mentioned in a below comment, memory in a private pool is cached until the private pool is destroyed and allocations can only grow from extra graph captures, any freeing of memory would result in invalid memory addresses and would break cuda graphs.

One implementation detail to note for unmapped blocks with expandable segments is that unmapped blocks are kept track in a member variable `unmapped` of a `BlockPool`. `unmapped` is *not* part of the checkpointed state of the caching allocator and isn't restored when reapplying checkpoints since we never free/unmap memory back to cuda and is persisted across graph captures / replays.

Checkpointing the current state of the memory allocator works as expected with expandable segments. Checkpointing grabs the first block of every segment in the active and free pools of the private pool and traverses the linked list of blocks in the segment to capture the state of every segment, which is then saved and kept for when it is needed to be reapplied. For expandable blocks, the last block in every segment will be an unallocated unmapped block containing the remaining amount of unmapped memory at graph capture time, and this too is saved in the checkpoint.

Reapplying the checkpoints works by freeing all allocated blocks and merging them into a single block per segment, then for each segment, we manually split and allocate all blocks from the checkpoint and then free the blocks marked as unallocated in the checkpoint state. For expandable segments, we need to make some modifications to not split unmapped blocks and avoid manually mapping then freeing unmapped blocks.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128068
Approved by: https://github.com/eqy, https://github.com/eellison
2024-07-15 23:23:23 +00:00
Alnis Murtovi
50ef099ad0 Learn a heuristic to decide whether to pad before mm (#128643)
This PR introduces AutoHeuristic, a framework to collect results from autotuning, learn a heuristic as a machine learning model (a regression tree), and then ship the learned heuristic by generating the regression tree to code.

The heuristics have been learned on artificial/random data that has been collected with the `gen_data_pad_mm.py` script. The `gen_pad_mm_a100.sh` scripts can then be used to learn a heuristic and generate it to code.

The best model is decided by doing a grid search over various values for `max_depth` and `min_samples_leaf` and choosing the model with the highest number of correct predicitons on the validation set.

The heuristic can return "unsure" which means that it is not sure which choice is the best choice and as a result autotuning will happen.

On A100 only tensors where each dimension is >= 512 are considered. For smaller tensors the heuristics that I learned returned "unsure" too often.

The results for randomly generated data and huggingface look as follows:
`max_wrong_speedup` is max(`wrong_speedups`) where `wrong_speedups` contains all the speedups one could have achieved for those examples where the heuristic made a wrong choice, i.e. a `max_wrong_speedup` of 1.37 means that the heuristic selected a choice, but the other choice would have been 1.37x faster. `gman_wrong_speedup` is the geomean of `wrong_speedups`.

The heuristic is learned as a regression tree, that returns higher values for better choices. The threshold decides how much better the better choice has to be for it to be returned, i.e. on A100 if the better choice is less than 1.702530x better than the other choice, "unsure" will be returned. This threshold is determined using the validation set.

A100
```
       max_depth  min_samples_leaf dataset  correct  wrong  unsure  total  max_wrong_speedup  gman_wrong_speedup  threshold
15         5.0                10     train     2730      4    3023   5757           1.372220            1.193873   1.702530
16         5.0                10       val      878      0    1042   1920                NaN                 NaN   1.702530
17         5.0                10      test      925      2     993   1920           1.741708            1.354954   1.702530
18         5.0                10  hf-train       14      0      22     36                NaN                 NaN   1.702530
19         5.0                10    hf-inf        7      0       1      8                NaN                 NaN   1.702530
```

The numbers for huggingface only include tensors where each dim is >=512. If all tensors would have been included there would have been the following number of matmuls, where at least one dimension is unaligned:
A100 hf-train: 60
A100 hf-inf: 10

## Results on running huggingface locally
This only includes models where the learned heuristic made at least one decision. For the examples here, it takes around 0.25-0.3 seconds to perform autotuning for the padded and unpadded version, so each decision that the heuristic makes saves around 0.25-0.3 seconds.
#pad_mm_autotuning is the number of times autotuning happened in pad_mm and #heuristic_made_decision is the number of times the heuristic made a decision (i.e. it didn't return "unsure").

I ran huggingface locally, each model 5 times and took the median speedup and compilation_latency.
Results on huggingface training
```
                          name speedup_heuristic speedup_baseline  speedup_diff compilation_latency_heuristic compilation_latency_baseline  compilation_latency_diff  comp_latency_reduction%  #pad_mm_autotuning  #heuristic_made_decision
               BartForCausalLM   1.19 (+/- 0.00)  1.19 (+/- 0.00)         -0.00              40.33 (+/- 1.13)             40.95 (+/- 0.78)                     -0.62                     1.52                   3                         2
  BartForConditionalGeneration   1.53 (+/- 0.06)  1.47 (+/- 0.05)          0.06              81.93 (+/- 5.20)             82.23 (+/- 1.92)                     -0.30                     0.36                   3                         1
    BlenderbotSmallForCausalLM   1.86 (+/- 0.04)  1.86 (+/- 0.00)          0.00              36.76 (+/- 0.49)             37.62 (+/- 1.33)                     -0.87                     2.31                   3                         2
                     CamemBert   2.36 (+/- 0.01)  2.35 (+/- 0.01)          0.01              97.60 (+/- 1.91)             98.69 (+/- 1.35)                     -1.09                     1.11                   2                         1
                   DistillGPT2   2.57 (+/- 0.01)  2.57 (+/- 0.01)          0.00              57.33 (+/- 0.77)             58.26 (+/- 1.41)                     -0.93                     1.59                   3                         2
             PLBartForCausalLM   2.07 (+/- 0.01)  2.06 (+/- 0.01)          0.01              32.54 (+/- 0.83)             34.65 (+/- 0.71)                     -2.11                     6.10                   3                         2
PLBartForConditionalGeneration   1.87 (+/- 0.00)  1.88 (+/- 0.00)         -0.01              58.45 (+/- 1.24)             58.95 (+/- 1.92)                     -0.50                     0.85                   3                         1
            RobertaForCausalLM   2.39 (+/- 0.01)  2.40 (+/- 0.01)         -0.01              97.38 (+/- 1.52)             97.69 (+/- 1.18)                     -0.31                     0.32                   2                         1
              TrOCRForCausalLM   1.70 (+/- 0.00)  1.70 (+/- 0.00)         -0.00              44.79 (+/- 1.33)             45.25 (+/- 1.08)                     -0.46                     1.01                   3                         2

Mean difference in speedup: 0.01
Mean compilation latency saved: -0.80s
Mean compilation latency reduction: 1.68%
```

Results on huggingface inference
```
                          name speedup_heuristic speedup_baseline  speedup_diff compilation_latency_heuristic compilation_latency_baseline  compilation_latency_diff  comp_latency_reduction%  #pad_mm_autotuning  #heuristic_made_decision
               BartForCausalLM   1.11 (+/- 0.00)  1.11 (+/- 0.00)          0.00              19.02 (+/- 0.28)             19.40 (+/- 0.35)                     -0.38                     1.95                   3                         2
  BartForConditionalGeneration   1.26 (+/- 0.01)  1.23 (+/- 0.03)          0.03              36.84 (+/- 0.40)             36.55 (+/- 0.75)                      0.30                    -0.81                   3                         1
    BlenderbotSmallForCausalLM   1.87 (+/- 0.02)  1.87 (+/- 0.01)          0.00              17.53 (+/- 0.31)             18.03 (+/- 0.43)                     -0.49                     2.74                   3                         2
                   DistillGPT2   2.50 (+/- 0.02)  2.50 (+/- 0.01)          0.00              16.16 (+/- 0.29)             16.40 (+/- 0.18)                     -0.24                     1.46                   3                         2
             PLBartForCausalLM   1.93 (+/- 0.01)  1.94 (+/- 0.01)         -0.00              15.30 (+/- 0.22)             16.01 (+/- 0.71)                     -0.71                     4.43                   3                         2
PLBartForConditionalGeneration   1.98 (+/- 0.01)  1.98 (+/- 0.01)          0.00              25.90 (+/- 0.32)             26.58 (+/- 0.62)                     -0.67                     2.53                   3                         1
              TrOCRForCausalLM   1.61 (+/- 0.00)  1.62 (+/- 0.00)         -0.01              21.38 (+/- 0.37)             21.85 (+/- 0.16)                     -0.47                     2.16                   3                         2

Mean difference in speedup: 0.00
Mean compilation latency saved: -0.38s
Mean compilation latency reduction: 2.07%
```

For now, the heuristic can only be applied to decide whether to pad for mm. One could also learn heuristics for bmm and addmm.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128643
Approved by: https://github.com/Chillee, https://github.com/eellison
2024-07-15 23:04:06 +00:00
eqy
6f32dc0c7b Don't pass error message as places in assertGreaterAlmostEqual (#130648)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130648
Approved by: https://github.com/awgu
2024-07-15 22:14:49 +00:00
PyTorch MergeBot
9e161af179 Revert "Increase tolerance for tensorsolve tests (#130620)"
This reverts commit 103b6ccab2.

Reverted https://github.com/pytorch/pytorch/pull/130620 on behalf of https://github.com/clee2000 due to didn't work, test is still failing on this PR and on main, reverting in favor of https://github.com/pytorch/pytorch/pull/130449 instead ([comment](https://github.com/pytorch/pytorch/pull/130620#issuecomment-2229036418))
2024-07-15 17:35:04 +00:00
cyy
28f6ae2718 [9/N] Replace c10::optional with std::optional (#130674)
Follows  #130509

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130674
Approved by: https://github.com/Skylion007
2024-07-15 00:48:43 +00:00
chilli
f9f85bfc0b [Inductor] FlexAttention supports partial masking (#130415) (#130626)
This is the new version of https://github.com/pytorch/pytorch/pull/130415

Updated test script: https://gist.github.com/yanboliang/7c34a82df611d4ea8869cb9e041bfbfc
Updated perf numbers:
```
(pt) [ybliang@devgpu002.ash8 ~/local/debug]$ CUDA_VISIBLE_DEVICES=4 python debug7.py
fwd speedup: 0.7166695598192317
bwd speedup: 0.7142133867805904
(pt) [ybliang@devgpu002.ash8 ~/local/debug]$ CUDA_VISIBLE_DEVICES=4 python debug7.py --partial-mask
fwd speedup: 0.8428246087169973
bwd speedup: 0.8486261278030254
```
Approved by: https://github.com/Chillee

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130626
Approved by: https://github.com/drisspg, https://github.com/yanboliang
2024-07-14 00:37:26 +00:00
Tobias Ringwald
e5de25896f Fixed CUDA randint generation for large ranges. (#126066)
Fixes #125224

For large ranges, calls to CUDA `randint` use a different `unroll_factor` to generate random ints. This `unroll_factor` was not considered correctly in the calculation of the Philox offsets. Thus, some of the random states were reused, resulting in lower entropy (see #125224).

This also affects multiple other random functions, such as `torch.rand` and `torch.randn`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126066
Approved by: https://github.com/eqy, https://github.com/lezcano
2024-07-13 21:42:27 +00:00
rzou
95046c86e3 [HOP] add HOP x torch_dispatch interaction (#130606)
This involved beefing up the Python dispatcher to handle torch_dispatch.
Given a HOP and a torch_dispatch Tensor subclass:
- the HOP will show up in the subclass's `__torch_dispatch__`
- you can also use HOP.py_impl to register a rule for the HOP x
  subclass interaction
- (coming soon) we'll offer a way to open register HOP x subclass
  interaction without needing to touch the subclass's
  `__torch_dispatch__` or the HOP's .py_impl.

Test Plan:
- new tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130606
Approved by: https://github.com/ydwu4
2024-07-12 21:51:36 +00:00
albanD
103b6ccab2 Increase tolerance for tensorsolve tests (#130620)
Fix current failure in periodic trunk https://hud.pytorch.org/failure?name=periodic%20%2F%20linux-focal-cuda11.8-py3.10-gcc9-debug%20%2F%20test%20(default%2C%204%2C%205%2C%20linux.4xlarge.nvidia.gpu)&jobName=undefined&failureCaptures=%5B%22functorch%2Ftest_ops.py%3A%3ATestOperatorsCUDA%3A%3Atest_vjp_linalg_tensorsolve_cuda_float32%22%5D

Since it appeared with https://github.com/pytorch/pytorch/pull/128238 that only updates random seed for the test, I expect this is just bad luck of the draw. Thus increasing tolerance like we do for other tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130620
Approved by: https://github.com/lezcano, https://github.com/atalman, https://github.com/malfet
2024-07-12 20:08:18 +00:00
leslie-fang-intel
2a1f22e57f Change BN to eval before QAT Convert phase (#130598)
**Summary**
In the QAT convert phase, we fold bn into conv and do DCE to this BN node. We should change `torch.ops.aten._native_batch_norm_legit.default` to `torch.ops.aten._native_batch_norm_legit_no_training.default`  for a safe DCE.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130598
Approved by: https://github.com/jgong5, https://github.com/yushangdi
2024-07-12 16:03:56 +00:00
yan-yhy
c16e90fe06 The device_suffix in a test_name is "privateuse1" sometimes. (#130091)
When run some test cases on the privateuse1 device, the device_suffix in a test_name is 'privateuse1' sometimes.
For examples, a test_name is 'test_Dropout1d_npu', while it would be 'test_Dropout1d_privateuse1' sometimes.
When setUpClass() didn't set it, the device_suffix would be "privateuse1".

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130091
Approved by: https://github.com/zou3519
2024-07-12 02:51:40 +00:00
PyTorch MergeBot
d97d962082 Revert "Add decompositions for copy variants of view ops (#128416)"
This reverts commit 68751799b8.

Reverted https://github.com/pytorch/pytorch/pull/128416 on behalf of https://github.com/izaitsevfb due to breaks test_qs8_permute_copy test in executorch ([comment](https://github.com/pytorch/pytorch/pull/128416#issuecomment-2224023423))
2024-07-11 22:09:23 +00:00
PyTorch MergeBot
a2f630a9a4 Revert "Decompose expand_copy and permute_copy (#129476)"
This reverts commit 7d4cb21098.

Reverted https://github.com/pytorch/pytorch/pull/129476 on behalf of https://github.com/izaitsevfb due to depends on #128416 which needs to be reverted ([comment](https://github.com/pytorch/pytorch/pull/129476#issuecomment-2224019720))
2024-07-11 22:06:15 +00:00
Yidi Wu
cd9bae30de Allow kwargs in _remove_effect_tokens_pass (#130491)
Summary: Previously, remove_effect_tokens pass didn't pass kwargs to the internal nodes. This PR fix it and add a test for it.

Test Plan: buck2 run caffe2/test:test_export -- -r test_remove_effect_token_kwargs

Reviewed By: angelayi

Differential Revision: D59603147

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130491
Approved by: https://github.com/angelayi
2024-07-11 19:03:19 +00:00
PyTorch MergeBot
578388bed8 Revert "Support for expandable segments with cuda graph trees (#128068)"
This reverts commit fdc83610f2.

Reverted https://github.com/pytorch/pytorch/pull/128068 on behalf of https://github.com/janeyx99 due to Reverting for breaking ROCm tests on trunk, I think the tests need to be qualified with @onlyCUDA ([comment](https://github.com/pytorch/pytorch/pull/128068#issuecomment-2223672381))
2024-07-11 18:58:13 +00:00
Xuehai Pan
973037be6a [BE][Easy] apply autofix for ruff rules unnecessary-collection-call (C408): list() / tuple() / dict() (#130199)
This PR changes the empty collection factory call to Python literals:

- `list()` -> `[]`
- `tuple()` -> `()`
- `dict()` -> `{}`

The Python literals are more performant and safer. For example, the bytecode for building an empty dictionary:

```bash
$ python3 -m dis - <<EOS
import collections

d1 = {}
d2 = dict()

dict = collections.OrderedDict
d3 = dict()
EOS
```

```text
  0           0 RESUME                   0

  1           2 LOAD_CONST               0 (0)
              4 LOAD_CONST               1 (None)
              6 IMPORT_NAME              0 (collections)
              8 STORE_NAME               0 (collections)

  3          10 BUILD_MAP                0
             12 STORE_NAME               1 (d1)

  4          14 PUSH_NULL
             16 LOAD_NAME                2 (dict)
             18 CALL                     0
             26 STORE_NAME               3 (d2)

  6          28 LOAD_NAME                0 (collections)
             30 LOAD_ATTR                8 (OrderedDict)
             50 STORE_NAME               2 (dict)

  7          52 PUSH_NULL
             54 LOAD_NAME                2 (dict)
             56 CALL                     0
             64 STORE_NAME               5 (d3)
             66 RETURN_CONST             1 (None)
```

The dict literal `{}` only has one bytecode `BUILD_MAP`, while the factory call `dict()` has three `PUSH_NULL + LOAD_NAME + CALL`. Also, the factory call is not safe if users override the `dict` name in `locals` or `globals` (see the example of replacing with `OrderedDict` above).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130199
Approved by: https://github.com/malfet
2024-07-11 17:30:28 +00:00
Jiang, Yanbing
6f662e9575 update the input weight of _convert_weight_to_int4pack to [n][k / 2] uint8 (#129940)
This PR is to update the input `weight` of `_convert_weight_to_int4pack` from `[n][k] int32` to `[n][k / 2] uint8`, both for CPU, CUDA and MPS, which can help decouple int4 model checkpoint with different ISAs and different platforms in `gpt-fast`. The advantage is int4 model checkpoint can be shared in different test machines, without re-generating in one certain platform. Meanwhile, the size of input `weight` can be reduced to `1 / 8`.

Before this PR, packed weight stored in CUDA specific layout: `[n/8][k/(InnerKTiles*16)][32][InnerKTiles/2]`, dtype int32, where InnerKTiles = 2, 4, 8. CPU packed weight viewed as the SAME shape but stored in different layout: `[n/64][k][32]`, dtype uint8. Weight is strongly coupled with platforms (CPU/CUDA) and ISAs (AVX512/AVX2/scalar). And users cannot use a generated weight in another different ISA or platform, because when loading weight into devices, the compute format is different.
![image](https://github.com/pytorch/pytorch/assets/61222868/64971c4b-29b9-42cf-9aeb-ffa01cea93dd)

Now, we use common serialized layout (`[n][k/2] uint8`) for different devices or ISAs as input `weight` of `_convert_weight_to_int4pack`, and each back chooses how to interpret as compute layout.
![image](https://github.com/pytorch/pytorch/assets/61222868/c7990761-c723-417b-aca2-7c60db7785c7)

### Performance
Intel (R) Xeon (R) CPU Max 9480, single socket (56 cores)
There is no obvious regression of this PR.
![image](https://github.com/pytorch/pytorch/assets/61222868/6046dcf4-920b-4c63-9ca3-1c8c3cafebde)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129940
Approved by: https://github.com/jgong5, https://github.com/lezcano, https://github.com/mingfeima
2024-07-11 15:26:48 +00:00
Isuru Fernando
5db9bd467e Skip test_nnc_correctness for new op _unsafe_masked_index (#130375)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130375
Approved by: https://github.com/lezcano
2024-07-11 08:17:16 +00:00
Bilal Khan
fdc83610f2 Support for expandable segments with cuda graph trees (#128068)
This PR adds support to use expandable segments with private memory pools which should unblock using it with cuda graphs and cuda graph trees. Currently, the allocator silently avoids using expandable segments when allocating in a private pool due to checkpoint saving/restoring not meshing well with how we keep track of unmapped blocks.

The PR itself is pretty short, most of the logic for checkpointing and reapplying state for non-expandable segments transfers over without much work.

Expandable segments reserve a virtual address space of size equal to the amount of physical memory on the GPU. Every time we want to `malloc()` or `free()` memory in a memory pool with expandable segments turned on, we map/unmap pages of physical GPU memory under the hood to create a new block that we return to the caller. This is beneficial due to the fact that each memory pool functions as a single segment of memory with a contiguous block of memory addresses that can grow and shrink as needed, avoiding fragmentation from allocating multiple non-contiguous segments that may not be merged together.

The caching allocator handles this by creating an unmapped block for the entire reserved virtual address space at init, which is treated similarly to an unallocated block in a free pool. When callers call `malloc()`, it's split and mapped to create allocated blocks, and calling `free()` similarly caches and merges free blocks in a free pool to be used later. Expandable blocks are unmapped and returned back to Cuda when they are cleaned up, or when we hit an OOM and the allocator attempts to remap cached free blocks. The code paths to map, free, and unmap blocks in expandable segments is similar to that for normal blocks and does all the same work of updating stats on memory usage, moving blocks between active and free pools, and returning memory to Cuda.

With Cuda Graph Trees and private memory pools, we need the ability to take checkpoints of the current state of the memory allocator after each graph capture as well as reapplying the state before capturing a new graph after replaying a captured graph so that the new cuda graph capture has access to the state of the allocator at the point after replaying a previously captured graph so it can reuse empty blocks and allocate new ones.

As mentioned in a below comment, memory in a private pool is cached until the private pool is destroyed and allocations can only grow from extra graph captures, any freeing of memory would result in invalid memory addresses and would break cuda graphs.

One implementation detail to note for unmapped blocks with expandable segments is that unmapped blocks are kept track in a member variable `unmapped` of a `BlockPool`. `unmapped` is *not* part of the checkpointed state of the caching allocator and isn't restored when reapplying checkpoints since we never free/unmap memory back to cuda and is persisted across graph captures / replays.

Checkpointing the current state of the memory allocator works as expected with expandable segments. Checkpointing grabs the first block of every segment in the active and free pools of the private pool and traverses the linked list of blocks in the segment to capture the state of every segment, which is then saved and kept for when it is needed to be reapplied. For expandable blocks, the last block in every segment will be an unallocated unmapped block containing the remaining amount of unmapped memory at graph capture time, and this too is saved in the checkpoint.

Reapplying the checkpoints works by freeing all allocated blocks and merging them into a single block per segment, then for each segment, we manually split and allocate all blocks from the checkpoint and then free the blocks marked as unallocated in the checkpoint state. For expandable segments, we need to make some modifications to not split unmapped blocks and avoid manually mapping then freeing unmapped blocks.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128068
Approved by: https://github.com/zdevito, https://github.com/eqy
2024-07-11 05:33:09 +00:00
Tom Ritchford
7d4cb21098 Decompose expand_copy and permute_copy (#129476)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129476
Approved by: https://github.com/amjames, https://github.com/lezcano
2024-07-10 17:12:01 +00:00
Chien-Chin Huang
81ea298600 Wrap the test func with try/except to always call destroy_process_group (#124961)
This can avoid PG warning about not calling destry_pg

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124961
Approved by: https://github.com/wanchaol, https://github.com/wz337
2024-07-10 15:36:38 +00:00
Li-Huai (Allan) Lin
99967e1119 [MPS][TYPE_PROMOTION] Fix Clamp (#130226)
Summary:
1. Fixed #130201 by adding type promotion.
2. Added proper tests.
3. Found torch's type promotion is different from numpy as follows:

```python
import torch
import numpy as np
np.clip(np.array([1], dtype=np.float32), np.array([1], dtype=np.int32), None).dtype  # dtype('float64')
torch.clamp(torch.tensor([1], dtype=torch.float32), torch.tensor([1], dtype=torch.int32)).dtype  # torch.float32
```

~Not sure the proper way to handle it, it causes numpy ref tests to fail.~
Reason here, so think I'm gonna xfail it:
3c1cf03fde/test/test_ops.py (L260-L264)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130226
Approved by: https://github.com/malfet
2024-07-10 14:27:39 +00:00
PyTorch MergeBot
637cc8d27f Revert "update the input weight of _convert_weight_to_int4pack to [n][k / 2] uint8 (#129940)"
This reverts commit 6367f02a0e.

Reverted https://github.com/pytorch/pytorch/pull/129940 on behalf of https://github.com/albanD due to Broke rocm tests on main 6367f02a0e ([comment](https://github.com/pytorch/pytorch/pull/129940#issuecomment-2220554681))
2024-07-10 13:48:32 +00:00
Jiang, Yanbing
6367f02a0e update the input weight of _convert_weight_to_int4pack to [n][k / 2] uint8 (#129940)
This PR is to update the input `weight` of `_convert_weight_to_int4pack` from `[n][k] int32` to `[n][k / 2] uint8`, both for CPU, CUDA and MPS, which can help decouple int4 model checkpoint with different ISAs and different platforms in `gpt-fast`. The advantage is int4 model checkpoint can be shared in different test machines, without re-generating in one certain platform. Meanwhile, the size of input `weight` can be reduced to `1 / 8`.

Before this PR, packed weight stored in CUDA specific layout: `[n/8][k/(InnerKTiles*16)][32][InnerKTiles/2]`, dtype int32, where InnerKTiles = 2, 4, 8. CPU packed weight viewed as the SAME shape but stored in different layout: `[n/64][k][32]`, dtype uint8. Weight is strongly coupled with platforms (CPU/CUDA) and ISAs (AVX512/AVX2/scalar). And users cannot use a generated weight in another different ISA or platform, because when loading weight into devices, the compute format is different.
![image](https://github.com/pytorch/pytorch/assets/61222868/64971c4b-29b9-42cf-9aeb-ffa01cea93dd)

Now, we use common serialized layout (`[n][k/2] uint8`) for different devices or ISAs as input `weight` of `_convert_weight_to_int4pack`, and each back chooses how to interpret as compute layout.
![image](https://github.com/pytorch/pytorch/assets/61222868/c7990761-c723-417b-aca2-7c60db7785c7)

### Performance
Intel (R) Xeon (R) CPU Max 9480, single socket (56 cores)
There is no obvious regression of this PR.
![image](https://github.com/pytorch/pytorch/assets/61222868/6046dcf4-920b-4c63-9ca3-1c8c3cafebde)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129940
Approved by: https://github.com/jgong5, https://github.com/lezcano, https://github.com/mingfeima
2024-07-10 07:38:42 +00:00
rzou
99c68f7bea Refactor TritonKernelVariable's logic so it can be shared (#130177)
TritonKernelVariable's logic tells us how to go from a user-defined
triton kernel and a grid to a call to the triton_kernel_wrapper_mutation
HOP. We want to re-use this in a setting without Dynamo; in the next PR
up, we create a new decorator (capture_triton) that, when applied to a
triton kernel, transforms a call to the triton kernel into a call
to the triton_kernel_wrapper_mutation HOP.

Test Plan:
- existing tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130177
Approved by: https://github.com/oulgen, https://github.com/ydwu4
2024-07-10 03:09:29 +00:00
Tom Ritchford
68751799b8 Add decompositions for copy variants of view ops (#128416)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128416
Approved by: https://github.com/amjames, https://github.com/lezcano
2024-07-10 01:39:09 +00:00
Catherine Lee
5b5a1f5202 Add on to Mark some test_decomp tests as slow on win #130260 (#130337)
An add on to https://github.com/pytorch/pytorch/pull/130260

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130337
Approved by: https://github.com/malfet
2024-07-09 22:30:53 +00:00
Joel Schlosser
fd43a2ba27 Forward fix for test_compare_cpu_cuda_float32 (#130360)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130360
Approved by: https://github.com/malfet
ghstack dependencies: #128238
2024-07-09 22:28:39 +00:00
Jerry Mannil
42f647219a [ROCm] Add int4 support (#129710)
- Add AMD support for int4 kernel
  - Only supports CDNA2 and CDNA3 gpus for now
  - Uses `mfma_f32_16x16x16bf16` instruction for matrix multiply
  - Uses `v_and_or_b32` instruction and `__hfma2` instrinsic for unpacking bf16 values
  - Enable hipify for `__nv_bfloat16` and `__nv_bfloat162` data types
- Enable int4 unit tests for CDNA2 and CDNA3 AMD gpus
- Fix torchscript issues due to hipify for `__nv_bfloat16` type
  - TorchScript has its own implementation for bfloat16 type
    - Implemented in `__nv_bloat16` structure at [resource_strings.h](https://github.com/pytorch/pytorch/blob/main/torch/csrc/jit/codegen/fuser/cuda/resource_strings.h)
    - So, we shouldn't hipify any reference of `__nv_bfloat16` in the torchscript implementation
    - Hence moved the `__nv_bfloat16` direct references in `codegen.cpp` and `cuda_codegen.cpp` to `resource_strings.h` which is already exempted from hipify

Fixes #124699
Fixes pytorch-labs/gpt-fast/issues/154

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129710
Approved by: https://github.com/malfet
2024-07-09 19:49:12 +00:00
Catherine Lee
75fa10066d Mark some test_decomp tests as slow on win (#130260)
Auto slow test detection is marking and then un marking these as slow, so permanently mark them as slow on windows.

These tests take >500s on windows.

This is part of the reason why test_decomp keeps failing on windows (ex da66e50e6e)

The other part is something to do with reruns + thresholds that I am still investigating
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130260
Approved by: https://github.com/huydhn, https://github.com/malfet
2024-07-09 00:16:31 +00:00
PyTorch MergeBot
d7b7f8b79f Revert "[ROCm] Add int4 support (#129710)"
This reverts commit d0ad13fa42.

Reverted https://github.com/pytorch/pytorch/pull/129710 on behalf of https://github.com/jeffdaily due to original ROCm PR did not have ciflow/rocm, missed signal ([comment](https://github.com/pytorch/pytorch/pull/129710#issuecomment-2214558368))
2024-07-08 16:07:53 +00:00
Joel Schlosser
c8ab2e8b63 Set seed per sample for OpInfo tests + support for restricting to a single sample input (#128238)
This PR:
* Sets a random seed before generating each sample for an OpInfo test. It does this by intercepting the sample input iterator via `TrackedInputIter`, optionally setting the seed to a test name specific seed before each iterator call (default is to set the seed).
    * Some quick and dirty benchmarking shows (hopefully) negligible overhead from setting the random seed before each sample input generation. For a trivial (single assert) test that uses `@ops`:
* Uncovered a bunch of test issues:
    * Test breakdown (>100 total)
        * A lot of tolerance issues (tweaked tolerance values to fix)
        * 1 broken OpInfo (`sample_inputs_masked_fill` was generating a sample of the wrong dtype)
        * 3 actually broken semantics (for masked tensor; added xfails)
        * 4 Jacobian mismatches (added xfails)
        * 2 nan results (skip for now, need fixing)
        * 3 results too far from reference result (add xfails)
* Skips MPS tests for now (there are so many failures!). Those will default to the old behavior.

**before (no seed setting):**
```
real	0m21.306s
user	0m19.053s
sys	0m5.192s
```

**after (with seed setting):**
```
real	0m21.905s
user	0m19.578s
sys	0m5.390s
```

* Utilizing the above for reproducible sample input generation, adds support for restricting the iterator to a single sample input. This is done via an env var `PYTORCH_OPINFO_SAMPLE_INPUT_INDEX` and its usage is included in the repro command.

```
======================================================================
ERROR: test_bar_add_cuda_uint8 (__main__.TestFooCUDA.test_bar_add_cuda_uint8)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/jbschlosser/branches/testing_updates/torch/testing/_internal/common_device_type.py", line 971, in test_wrapper
    return test(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/jbschlosser/branches/testing_updates/test/test_ops.py", line 2671, in test_bar
    self.assertFalse(True)
AssertionError: True is not false

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/jbschlosser/branches/testing_updates/torch/testing/_internal/common_utils.py", line 2816, in wrapper
    method(*args, **kwargs)
  File "/home/jbschlosser/branches/testing_updates/torch/testing/_internal/common_utils.py", line 2816, in wrapper
    method(*args, **kwargs)
  File "/home/jbschlosser/branches/testing_updates/torch/testing/_internal/common_device_type.py", line 419, in instantiated_test
    result = test(self, **param_kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jbschlosser/branches/testing_updates/torch/testing/_internal/common_utils.py", line 1426, in wrapper
    fn(*args, **kwargs)
  File "/home/jbschlosser/branches/testing_updates/torch/testing/_internal/common_device_type.py", line 982, in test_wrapper
    raise new_e from e
Exception: Caused by sample input at index 3: SampleInput(input=Tensor[size=(10, 5), device="cuda:0", dtype=torch.uint8], args=TensorList[Tensor[size=(), device="cuda:0", dtype=torch.uint8]], kwargs={}, broadcasts_input=False, name='')

To execute this test, run the following from the base repo dir:
    PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=3 python test/test_ops.py -k TestFooCUDA.test_bar_add_cuda_uint8

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0

----------------------------------------------------------------------
Ran 1 test in 0.037s

FAILED (errors=1)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128238
Approved by: https://github.com/janeyx99, https://github.com/justinchuby
2024-07-08 16:06:38 +00:00
Jerry Mannil
d0ad13fa42 [ROCm] Add int4 support (#129710)
Add AMD support for int4 kernel using mfma_f32_16x16x16bf16 instruction.
Only supports CDNA2 and CDNA3 gpus for now.
Fixes #124699

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129710
Approved by: https://github.com/malfet
2024-07-07 23:54:22 +00:00
chilli
da66e50e6e Added compile option to create_block_mask (#130106)
Compiling the `create_block_mask` function allows us to "materialize" extremely large masks. This would have been a 1 *trillion* element tensor if fully materialized.

```
print(do_bench(lambda: create_block_mask(causal_mask, 1, 1, 2**20, 2**20, _compiled=True)))
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130106
Approved by: https://github.com/yanboliang
ghstack dependencies: #130160
2024-07-06 08:09:56 +00:00
Li-Huai (Allan) Lin
d62d351107 [Optim][BE] Change str(device) to _get_device_type(device) (#129984)
Prevent using vague expressions like `"cuda" in str(device)`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129984
Approved by: https://github.com/janeyx99
ghstack dependencies: #129451, #129552
2024-07-04 06:44:48 +00:00
eqy
29ffa20bb1 [CUDA] Bump tolerances for test_grad_pca_lowrank (#129902)
The revert of #127199 seems to surface an additional failure on A100---small tolerance bump to account for this.

I did find what appears to be a race condition in the one of the kernels used in this workload but I'm not sure it's related here...

CC @nWEIdia

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129902
Approved by: https://github.com/ezyang
2024-07-02 23:17:02 +00:00
Li-Huai (Allan) Lin
8ec5ba960f [MPS] Add tensor_lr overloads to fused adam & adamw (#129451)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129451
Approved by: https://github.com/janeyx99
2024-07-02 19:46:30 +00:00
Aaron Gokaslan
6c2a8b6b38 [Ez][BE]: Enable new stable ruff rules (#129825)
Applies a bunch of new ruff lint rules that are now stable. Some of these improve efficiency or readability. Since I already did passes on the codebase for these when they were in preview, there should be relatively few changes to the codebase. This is just more for future hardening of it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129825
Approved by: https://github.com/XuehaiPan, https://github.com/jansel, https://github.com/malfet
2024-07-02 14:47:10 +00:00
eqy
f845a7a91a [cuDNN][SDPA] Remove TORCH_CUDNN_SDPA_ENABLED=1, enable cuDNN SDPA by default on H100 and 2nd on other archs >= sm80 (#125343)
Looks like one of the first failures seen is `test_causal_variants_compile_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cuda` when `test_causal_variants_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cuda` passes.

What seems interesting here is that the `torch.compile` version fails while the eager version passes. Not sure what the difference would be here...

Nevertheless, is there a recommended mechanism to skip cuDNN SDPA as a backend for this test? CC @drisspg

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125343
Approved by: https://github.com/Skylion007
2024-06-30 19:22:16 +00:00
leslie-fang-intel
35a197defa [Inductor][CPP] Enable Quantized Linear GEMM Template with FP32 output (#128825)
**Summary**
Support int8 GEMM Template with refer MicroInt8GEMM kernel for case:

- Activation dtype: uint8
- Weight dtype: int8
- Output dtype: float32/bfloat16
- Post Op Fusion: without unary post operator fusion

**Test Plan**
```
clear && python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_quantized_linear_with_pointwise
```

**Next Step**
- [ ] Unary post op fusion
- [ ] Int8 output
- [ ] Binary Fusion
- [ ] AMX int8 MicroGEMM Kernel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128825
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-06-30 09:45:43 +00:00
Yanbo Liang
ec47d4d9a8 [Inductor] FlexAttention supports block sparse mask (#129216)
Benchmark script (causal mask): https://gist.github.com/yanboliang/c2010a1fd081d4e8ca94fadec9eef286
Initial perf number:
* fwd speedup: 0.44 -> 0.72
* bwd speedup: 0.38 -> 0.71

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129216
Approved by: https://github.com/Chillee
2024-06-29 04:44:38 +00:00
PyTorch MergeBot
3d96217891 Revert "[BE][Easy] use pathlib.Path instead of dirname / ".." / pardir (#129374)"
This reverts commit 9e1f3ecaa7.

Reverted https://github.com/pytorch/pytorch/pull/129374 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is still failing with the same error ([comment](https://github.com/pytorch/pytorch/pull/129374#issuecomment-2197801405))
2024-06-29 00:47:15 +00:00
Isuru Fernando
c12a4f2e65 Add decomposition for slice_scatter (#123744)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123744
Approved by: https://github.com/peterbell10
2024-06-28 17:02:10 +00:00
PyTorch MergeBot
999eec8dea Revert "[cuDNN][SDPA] Remove TORCH_CUDNN_SDPA_ENABLED=1, enable cuDNN SDPA by default on H100 and 2nd on other archs >= sm80 (#125343)"
This reverts commit b7e7a4cb01.

Reverted https://github.com/pytorch/pytorch/pull/125343 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to break some test_transformer running on internal A100 and V100 ([comment](https://github.com/pytorch/pytorch/pull/125343#issuecomment-2196202003))
2024-06-28 06:03:54 +00:00
PyTorch MergeBot
c43923a116 Revert "[Inductor] FlexAttention supports block sparse mask (#129216)"
This reverts commit b9d3cedd64.

Reverted https://github.com/pytorch/pytorch/pull/129216 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it is still failing in trunk b9d3cedd64, maybe a landrace given that TD has been turned off ([comment](https://github.com/pytorch/pytorch/pull/129216#issuecomment-2196182882))
2024-06-28 05:44:46 +00:00
Yanbo Liang
b9d3cedd64 [Inductor] FlexAttention supports block sparse mask (#129216)
Benchmark script (causal mask): https://gist.github.com/yanboliang/c2010a1fd081d4e8ca94fadec9eef286
Initial perf number:
* fwd speedup: 0.44 -> 0.72
* bwd speedup: 0.38 -> 0.71

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129216
Approved by: https://github.com/Chillee
2024-06-28 01:32:54 +00:00
Xuehai Pan
9e1f3ecaa7 [BE][Easy] use pathlib.Path instead of dirname / ".." / pardir (#129374)
Changes by apply order:

1. Replace all `".."` and `os.pardir` usage with `os.path.dirname(...)`.
2. Replace nested `os.path.dirname(os.path.dirname(...))` call with `str(Path(...).parent.parent)`.
3. Reorder `.absolute()` ~/ `.resolve()`~ and `.parent`: always resolve the path first.

    `.parent{...}.absolute()` -> `.absolute().parent{...}`

4. Replace chained `.parent x N` with `.parents[${N - 1}]`: the code is easier to read (see 5.)

    `.parent.parent.parent.parent` -> `.parents[3]`

5. ~Replace `.parents[${N - 1}]` with `.parents[${N} - 1]`: the code is easier to read and does not introduce any runtime overhead.~

    ~`.parents[3]` -> `.parents[4 - 1]`~

6. ~Replace `.parents[2 - 1]` with `.parent.parent`: because the code is shorter and easier to read.~

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129374
Approved by: https://github.com/justinchuby, https://github.com/malfet
2024-06-28 00:35:15 +00:00
Xuehai Pan
d80939e5e9 [BE] enable UFMT for torch/storage.py (#127706)
Part of #123062

- #123062

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127706
Approved by: https://github.com/ezyang
2024-06-27 23:16:24 +00:00
PyTorch MergeBot
5ceba6a3cb Revert "[Inductor] FlexAttention supports block sparse mask (#129216)"
This reverts commit 4082759925.

Reverted https://github.com/pytorch/pytorch/pull/129216 on behalf of https://github.com/clee2000 due to broke functorch/aot_dispatch and test_proxy_tensor on windows https://github.com/pytorch/pytorch/actions/runs/9691331440/job/26743164471 4082759925 missed on PR due to bad TD ([comment](https://github.com/pytorch/pytorch/pull/129216#issuecomment-2195087274))
2024-06-27 15:57:52 +00:00
Yanbo Liang
4082759925 [Inductor] FlexAttention supports block sparse mask (#129216)
Benchmark script (causal mask): https://gist.github.com/yanboliang/c2010a1fd081d4e8ca94fadec9eef286
Initial perf number:
* fwd speedup: 0.44 -> 0.72
* bwd speedup: 0.38 -> 0.71

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129216
Approved by: https://github.com/Chillee
2024-06-27 05:44:27 +00:00
Li-Huai (Allan) Lin
84ad5452f6 [MPS] Fused SGD optimizer (#129350)
```
[-------------------------------------- Fused SGD --------------------------------------]
                                                          |  Fused: True  |  Fused: False
1 threads: ------------------------------------------------------------------------------
      numel: 1024, num_tensors: 100, momentum: True       |        2      |       15
      numel: 1024, num_tensors: 100, momentum: False      |        2      |        5
      numel: 65536, num_tensors: 100, momentum: True      |        3      |       16
      numel: 65536, num_tensors: 100, momentum: False     |        2      |        5
      numel: 1048576, num_tensors: 100, momentum: True    |       11      |       16
      numel: 1048576, num_tensors: 100, momentum: False   |        8      |        6
      numel: 1024, num_tensors: 500, momentum: True       |       29      |       70
      numel: 1024, num_tensors: 500, momentum: False      |       20      |       24
      numel: 65536, num_tensors: 500, momentum: True      |       33      |       76
      numel: 65536, num_tensors: 500, momentum: False     |       22      |       26
      numel: 1048576, num_tensors: 500, momentum: True    |       70      |       80
      numel: 1048576, num_tensors: 500, momentum: False   |       43      |       40
      numel: 1024, num_tensors: 1000, momentum: True      |      108      |      139
      numel: 1024, num_tensors: 1000, momentum: False     |       72      |       48
      numel: 65536, num_tensors: 1000, momentum: True     |      116      |      150
      numel: 65536, num_tensors: 1000, momentum: False    |       77      |       52
      numel: 1048576, num_tensors: 1000, momentum: True   |      190      |      170
      numel: 1048576, num_tensors: 1000, momentum: False  |      120      |       50
```

```python
def profile_fused_sgd():
    from torch.optim.sgd import sgd
    import torch.utils.benchmark as benchmark

    import itertools

    def profile(fn, params, grads, momentum_buffer_list, fused):
        fn(
            params,
            grads,
            momentum_buffer_list,
            momentum=True if len(momentum_buffer_list) > 0 else False,
            dampening=0.0,
            nesterov=False,
            foreach=False,
            fused=fused,
            lr=1e-3,
            weight_decay=.0,
            maximize=False,
            grad_scale=None,
            found_inf=None,
        )
        torch.mps.synchronize()

    device = "mps"

    results = []

    for num_tensors, numel, momentum in itertools.product([100, 500, 1000], [1024, 65536, 1048576], [True, False]):
        sublabel = f"numel: {numel}, num_tensors: {num_tensors}, momentum: {momentum}"
        print(sublabel)
        params, grads = [[torch.arange(numel, dtype=torch.float32, device=device) + (numel * i) for i in range(num_tensors)] for _ in range(2)]
        momentum_buffer_list = [torch.arange(numel, dtype=torch.float32, device=device) + (numel * i) for i in range(num_tensors)] if momentum else []
        fn = sgd

        for fused in [True, False]:

            t = benchmark.Timer(
                    stmt='profile(fn, params, grads, momentum_buffer_list, fused)',
                    label='Fused SGD',
                    sub_label=sublabel,
                    globals=locals(),
                    description= f"Fused: {fused}",
                ).blocked_autorange(min_run_time=5)
            results.append(t)

    compare = benchmark.Compare(results)
    compare.trim_significant_figures()
    compare.colorize(rowwise=True)
    compare.print()
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129350
Approved by: https://github.com/janeyx99
ghstack dependencies: #129006, #129008, #129007, #129105
2024-06-27 04:37:14 +00:00
PyTorch MergeBot
895316119d Revert "[BE][Easy] use pathlib.Path instead of dirname / ".." / pardir (#129374)"
This reverts commit 0314c4c101.

Reverted https://github.com/pytorch/pytorch/pull/129374 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it causes lots of internal build failures where they fail to find hipify module ([comment](https://github.com/pytorch/pytorch/pull/129374#issuecomment-2192437052))
2024-06-26 19:03:57 +00:00
Andres Lugo
b9a1c2c991 [ROCm] Enable F8 Inductor Unit tests (#128353)
First batch of inductor unit test enablement on ROCm for the fnuz f8 variant on MI300

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128353
Approved by: https://github.com/jansel, https://github.com/eellison
2024-06-26 18:30:43 +00:00
Tugsbayasgalan Manlaibaatar
6181e65cd8 Nested tensor subclass support (#127431)
When we have nested tensor subclasses, we need to recursively flatten/unflatten in Fake tensor creation and AOTAUtograd. Most of the PR is about mechanical change which changes today's single level flatten logic to be recursive.

Differential Revision: [D58533224](https://our.internmc.facebook.com/intern/diff/D58533224)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127431
Approved by: https://github.com/bdhirsh
2024-06-26 04:45:22 +00:00
Eddie Yan
b7e7a4cb01 [cuDNN][SDPA] Remove TORCH_CUDNN_SDPA_ENABLED=1, enable cuDNN SDPA by default on H100 and 2nd on other archs >= sm80 (#125343)
Looks like one of the first failures seen is `test_causal_variants_compile_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cuda` when `test_causal_variants_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cuda` passes.

What seems interesting here is that the `torch.compile` version fails while the eager version passes. Not sure what the difference would be here...

Nevertheless, is there a recommended mechanism to skip cuDNN SDPA as a backend for this test? CC @drisspg

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125343
Approved by: https://github.com/Skylion007
2024-06-26 00:49:18 +00:00
Xuehai Pan
0314c4c101 [BE][Easy] use pathlib.Path instead of dirname / ".." / pardir (#129374)
Changes by apply order:

1. Replace all `".."` and `os.pardir` usage with `os.path.dirname(...)`.
2. Replace nested `os.path.dirname(os.path.dirname(...))` call with `str(Path(...).parent.parent)`.
3. Reorder `.absolute()` ~/ `.resolve()`~ and `.parent`: always resolve the path first.

    `.parent{...}.absolute()` -> `.absolute().parent{...}`

4. Replace chained `.parent x N` with `.parents[${N - 1}]`: the code is easier to read (see 5.)

    `.parent.parent.parent.parent` -> `.parents[3]`

5. ~Replace `.parents[${N - 1}]` with `.parents[${N} - 1]`: the code is easier to read and does not introduce any runtime overhead.~

    ~`.parents[3]` -> `.parents[4 - 1]`~

6. ~Replace `.parents[2 - 1]` with `.parent.parent`: because the code is shorter and easier to read.~

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129374
Approved by: https://github.com/justinchuby, https://github.com/malfet
2024-06-25 08:28:38 +00:00
Isuru Fernando
e6bfa2958b Add aten._unsafe_masked_index (#116491)
To generate masked indexing operations that would generate
masked loads in triton code

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116491
Approved by: https://github.com/lezcano, https://github.com/peterbell10
2024-06-25 02:45:02 +00:00
Li-Huai (Allan) Lin
9a7e2519d3 [MPS] Fused Adam & AdamW (#127242)
Summary:

This PR adds fused Adam and AdamW implementations.

Benchmark on Macbook Pro with M1 Max chip and 64GB unified memory:
**Fast math enabled:**
```
[---------------------------------------------- Fused Adam ----------------------------------------------]
                                                                           |  Fused: True  |  Fused: False
1 threads: -----------------------------------------------------------------------------------------------
      amsgrad: True, adamWflag: True, numel: 1024, num_tensors: 100        |       10      |       100
      amsgrad: False, adamWflag: True, numel: 1024, num_tensors: 100       |        9      |        89
      amsgrad: True, adamWflag: False, numel: 1024, num_tensors: 100       |        9      |        90
      amsgrad: False, adamWflag: False, numel: 1024, num_tensors: 100      |        9      |        83
      amsgrad: True, adamWflag: True, numel: 65536, num_tensors: 100       |       12      |        94
      amsgrad: False, adamWflag: True, numel: 65536, num_tensors: 100      |       11      |        88
      amsgrad: True, adamWflag: False, numel: 65536, num_tensors: 100      |       12      |        90
      amsgrad: False, adamWflag: False, numel: 65536, num_tensors: 100     |       11      |       100
      amsgrad: True, adamWflag: True, numel: 1048576, num_tensors: 100     |       27      |       100
      amsgrad: False, adamWflag: True, numel: 1048576, num_tensors: 100    |       23      |       100
      amsgrad: True, adamWflag: False, numel: 1048576, num_tensors: 100    |       27      |       100
      amsgrad: False, adamWflag: False, numel: 1048576, num_tensors: 100   |       23      |        98
      amsgrad: True, adamWflag: True, numel: 1024, num_tensors: 500        |       82      |       480
      amsgrad: False, adamWflag: True, numel: 1024, num_tensors: 500       |       72      |       450
      amsgrad: True, adamWflag: False, numel: 1024, num_tensors: 500       |       82      |       450
      amsgrad: False, adamWflag: False, numel: 1024, num_tensors: 500      |       73      |       420
      amsgrad: True, adamWflag: True, numel: 65536, num_tensors: 500       |       91      |       500
      amsgrad: False, adamWflag: True, numel: 65536, num_tensors: 500      |       83      |       400
      amsgrad: True, adamWflag: False, numel: 65536, num_tensors: 500      |       94      |       500
      amsgrad: False, adamWflag: False, numel: 65536, num_tensors: 500     |       78      |       400
      amsgrad: True, adamWflag: True, numel: 1048576, num_tensors: 500     |      170      |       500
      amsgrad: False, adamWflag: True, numel: 1048576, num_tensors: 500    |      140      |       600
      amsgrad: True, adamWflag: False, numel: 1048576, num_tensors: 500    |      170      |       600
      amsgrad: False, adamWflag: False, numel: 1048576, num_tensors: 500   |      140      |       500
      amsgrad: True, adamWflag: True, numel: 1024, num_tensors: 1000       |      250      |       890
      amsgrad: False, adamWflag: True, numel: 1024, num_tensors: 1000      |      220      |       850
      amsgrad: True, adamWflag: False, numel: 1024, num_tensors: 1000      |      250      |       830
      amsgrad: False, adamWflag: False, numel: 1024, num_tensors: 1000     |      220      |       770
      amsgrad: True, adamWflag: True, numel: 65536, num_tensors: 1000      |      270      |       870
      amsgrad: False, adamWflag: True, numel: 65536, num_tensors: 1000     |      230      |       840
      amsgrad: True, adamWflag: False, numel: 65536, num_tensors: 1000     |      270      |       810
      amsgrad: False, adamWflag: False, numel: 65536, num_tensors: 1000    |      240      |       800
      amsgrad: True, adamWflag: True, numel: 1048576, num_tensors: 1000    |      400      |      1000
      amsgrad: False, adamWflag: True, numel: 1048576, num_tensors: 1000   |      360      |      2000
      amsgrad: True, adamWflag: False, numel: 1048576, num_tensors: 1000   |      430      |      2000
      amsgrad: False, adamWflag: False, numel: 1048576, num_tensors: 1000  |      360      |      1300

Times are in milliseconds (ms).
```

**Fast math disabled:**
```
[---------------------------------------------- Fused Adam ----------------------------------------------]
                                                                           |  Fused: True  |  Fused: False
1 threads: -----------------------------------------------------------------------------------------------
      amsgrad: True, adamWflag: True, numel: 1024, num_tensors: 100        |       10      |       100
      amsgrad: False, adamWflag: True, numel: 1024, num_tensors: 100       |        9      |        84
      amsgrad: True, adamWflag: False, numel: 1024, num_tensors: 100       |        9      |        84
      amsgrad: False, adamWflag: False, numel: 1024, num_tensors: 100      |        9      |        79
      amsgrad: True, adamWflag: True, numel: 65536, num_tensors: 100       |       11      |        93
      amsgrad: False, adamWflag: True, numel: 65536, num_tensors: 100      |       10      |        90
      amsgrad: True, adamWflag: False, numel: 65536, num_tensors: 100      |       11      |        91
      amsgrad: False, adamWflag: False, numel: 65536, num_tensors: 100     |       11      |        81
      amsgrad: True, adamWflag: True, numel: 1048576, num_tensors: 100     |       34      |       100
      amsgrad: False, adamWflag: True, numel: 1048576, num_tensors: 100    |       31      |       100
      amsgrad: True, adamWflag: False, numel: 1048576, num_tensors: 100    |       34      |        95
      amsgrad: False, adamWflag: False, numel: 1048576, num_tensors: 100   |       31      |       100
      amsgrad: True, adamWflag: True, numel: 1024, num_tensors: 500        |       94      |       500
      amsgrad: False, adamWflag: True, numel: 1024, num_tensors: 500       |       82      |       430
      amsgrad: True, adamWflag: False, numel: 1024, num_tensors: 500       |       92      |       430
      amsgrad: False, adamWflag: False, numel: 1024, num_tensors: 500      |       81      |       390
      amsgrad: True, adamWflag: True, numel: 65536, num_tensors: 500       |       98      |       500
      amsgrad: False, adamWflag: True, numel: 65536, num_tensors: 500      |       88      |       430
      amsgrad: True, adamWflag: False, numel: 65536, num_tensors: 500      |      100      |       500
      amsgrad: False, adamWflag: False, numel: 65536, num_tensors: 500     |       88      |       400
      amsgrad: True, adamWflag: True, numel: 1048576, num_tensors: 500     |      210      |       500
      amsgrad: False, adamWflag: True, numel: 1048576, num_tensors: 500    |      190      |       610
      amsgrad: True, adamWflag: False, numel: 1048576, num_tensors: 500    |      210      |       510
      amsgrad: False, adamWflag: False, numel: 1048576, num_tensors: 500   |      190      |       500
      amsgrad: True, adamWflag: True, numel: 1024, num_tensors: 1000       |      300      |       900
      amsgrad: False, adamWflag: True, numel: 1024, num_tensors: 1000      |      260      |       850
      amsgrad: True, adamWflag: False, numel: 1024, num_tensors: 1000      |      295      |       900
      amsgrad: False, adamWflag: False, numel: 1024, num_tensors: 1000     |      260      |       800
      amsgrad: True, adamWflag: True, numel: 65536, num_tensors: 1000      |      320      |       910
      amsgrad: False, adamWflag: True, numel: 65536, num_tensors: 1000     |      280      |       900
      amsgrad: True, adamWflag: False, numel: 65536, num_tensors: 1000     |      320      |       900
      amsgrad: False, adamWflag: False, numel: 65536, num_tensors: 1000    |      300      |       900
      amsgrad: True, adamWflag: True, numel: 1048576, num_tensors: 1000    |      500      |      2000
      amsgrad: False, adamWflag: True, numel: 1048576, num_tensors: 1000   |      480      |      2000
      amsgrad: True, adamWflag: False, numel: 1048576, num_tensors: 1000   |      540      |      1500
      amsgrad: False, adamWflag: False, numel: 1048576, num_tensors: 1000  |      480      |      1200

Times are in milliseconds (ms).
```

```python
def profile_fused_adam():
    from torch.optim import adam, adamw
    import torch.utils.benchmark as benchmark

    import itertools

    def profile(fn, params, grads, exp_avgs, exp_avg_sqs, max_exp_avg_sqs, state_steps, amsgrad, fused):
        fn(
            params,
            grads,
            exp_avgs,
            exp_avg_sqs,
            max_exp_avg_sqs,
            state_steps,
            foreach=False,
            capturable=False,
            fused=fused,
            amsgrad=amsgrad,
            beta1=0.9,
            beta2=0.99,
            lr=1e-3,
            weight_decay=.0,
            eps=1e-5,
            maximize=False,
            grad_scale=None,
            found_inf=None,
        )
        torch.mps.synchronize()

    device = "mps"

    results = []

    for num_tensors, numel, adamWflag, amsgrad in itertools.product([100, 500, 1000], [1024, 65536, 1048576], [True, False], [True, False]):
        print(f"amsgrad: {amsgrad}, adamWflag: {adamWflag}, numel: {numel}, num_tensors: {num_tensors}")
        params, grads, exp_avgs, exp_avg_sqs = [[torch.arange(numel, dtype=torch.float32, device=device) + (numel * i) for i in range(num_tensors)] for _ in range(4)]
        max_exp_avg_sqs = [torch.arange(numel, dtype=torch.float32, device=device) for _ in range(num_tensors)] if amsgrad else []
        state_steps = [torch.tensor([5], dtype=torch.float32, device=device) for _ in range(num_tensors)]
        if adamWflag:
            fn = adamw.adamw
        else:
            fn = adam.adam

        for fused in [True, False]:

            t = benchmark.Timer(
                    stmt='profile(fn, params, grads, exp_avgs, exp_avg_sqs, max_exp_avg_sqs, state_steps, amsgrad, fused)',
                    label='Fused Adam',
                    sub_label=f"amsgrad: {amsgrad}, adamWflag: {adamWflag}, numel: {numel}, num_tensors: {num_tensors}",
                    globals=locals(),
                    description= f"Fused: {fused}",
                ).blocked_autorange(min_run_time=5)
            results.append(t)

    compare = benchmark.Compare(results)
    compare.trim_significant_figures()
    compare.colorize(rowwise=True)
    compare.print()
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127242
Approved by: https://github.com/kulinseth, https://github.com/janeyx99
2024-06-18 19:59:50 +00:00
Guilherme Leobas
9818283da1 re-enable jacrev/jacfwd/hessian after #128028 landed (#128622)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128622
Approved by: https://github.com/zou3519
2024-06-18 17:08:58 +00:00
Andrew Gu
f2805a0408 [FSDP2] Added APIs for explicit fwd/bwd prefetching (#128884)
This PR adds two APIs `set_modules_to_forward_prefetch` and `set_modules_to_backward_prefetch` to enable explicit forward/backward all-gather prefetching, respectively.

```
def set_modules_to_forward_prefetch(self, modules: List[FSDPModule]): -> None
def set_modules_to_backward_prefetch(self, modules: List[FSDPModule]): -> None
```

**Motivation**
FSDP2 implements _reasonable defaults_ for forward and backward prefetching. In forward, it uses implicit prefetching and allows two all-gather output tensors to be alive at once (so that the current all-gather copy-out can overlap with the next all-gather). In backward, it uses explicit prefetching based on the reverse post-forward order.

However, there may be cases where with expert knowledge, we can reduce communication bubbles by moving all-gathers manually. One way to expose such behavior is to expose _prefetching limits_, i.e. integers that configure how many outstanding all-gathers/all-gather output tensors can be alive at once. IMIHO, this leans toward _easy_, not _simple_ (see [PyTorch design principles](https://pytorch.org/docs/stable/community/design.html#principle-2-simple-over-easy)).

The crux of the problem is that there may be special cases where manual intervention can give better performance. Exposing a prefetching limit and allowing users to pass a value >1 just smooths over the problem since such a limit would generally apply over the entire model even though it possibly should not. Then, expert users will see a specific all-gather that they want to deviate from this limit, and there is little we can do.

Thus, we instead choose to expose the most primitive extension point: namely, every `FSDPModule` gives an opportunity to prefetch other all-gathers in forward and in backward. How to leverage this extension point is fully up to the user. Implementing the prefetch limit can be done using this extension point (e.g. record the post-forward order yourself using forward hooks, iterate over that order, and call the `set_modules_to_forward_prefetch` / `set_modules_to_backward_prefetch` APIs).

Differential Revision: [D58700346](https://our.internmc.facebook.com/intern/diff/D58700346)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128884
Approved by: https://github.com/ckluk2, https://github.com/weifengpy
2024-06-18 13:32:57 +00:00
Chien-Chin Huang
6e43897912 [BE][ptd_fb_test][3/N] Enable TestSlide for MultiThreadedTestCase (#128843)
Enabling testslide for MultiThreadedTestCase, similar to https://github.com/pytorch/pytorch/pull/127512.

Differential Revision: [D58677457](https://our.internmc.facebook.com/intern/diff/D58677457/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128843
Approved by: https://github.com/wz337
2024-06-18 07:05:31 +00:00
cyy
163847b1bb [1/N] [Caffe2] Remove caffe2_aten_fallback code (#128675)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128675
Approved by: https://github.com/r-barnes
2024-06-17 21:25:59 +00:00
xinan.lin
cc518ebd38 [Inductor Intel GPU backend Upstream] Reuse inductor test for Intel GPU (PART 2) (#124147)
Reuse Inductor test case for Intel GPU.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124147
Approved by: https://github.com/EikanWang, https://github.com/jansel
2024-06-16 08:07:05 +00:00
Blaine Burton Rister
f1ee3589a1 [Inductor] Emit strided block pointer from ModularIndexing and FloorDiv (#127342)
**Summary**

Inductor currently uses modulo and division to compute indices into certain multi-dimensional tensors, such as those arising from row padding. This PR matches on that indexing pattern, replacing it with an N-D block pointer. This should be more efficient than computing indices with division and modulo, and it can easily map to DMAs on non-GPU hardware targets.

Because the 1D block size needs to map to an integer block shape in ND, we need to know that the ND block size evenly divides the size of the iteration range. This PR only generates ND block pointers when it can guarantee that the iteration order and number of elements loaded are unchanged. This means that the number of elements in a slice of the iteration range must either be:
  - Powers of 2. Since Triton block sizes are powers of 2, any integer power of 2 either divides the block size, or is greater than the block size. In the latter case, `CielDiv(x, y)` rounds up to 1.
  - Multiples of the maximum block size. Since block sizes are powers of 2, the maximum block size is a multiple of every possible block size.

Note that a *slice* of the iteration range does not include the leading dimension. Thus we can support arbitrary leading dimensions like `(5,8)`.

Feature proposal and discussion: https://github.com/pytorch/pytorch/issues/125077

Example kernel:
```
triton.jit
def triton_(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr):
    xnumel = 4096
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:]
    xmask = xindex < xnumel
    tmp0 = tl.reshape(tl.load(tl.make_block_ptr(in_ptr0, shape=[32, 16, 8], strides=[1024, 32, 1], block_shape=[32 * (32 <= ((127 + XBLOCK) // 128)) + ((127 + XBLOCK) // 128) * (((127 + XBLOCK) // 128) < 32), 16 * (16 <= ((7 + XBLOCK) // 8)) + ((7 + XBLOCK) // 8) * (((7 + XBLOCK) // 8) < 16), 8 * (8 <= XBLOCK) + XBLOCK * (XBLOCK < 8)], order=[0, 1, 2], offsets=[(xoffset // 128), (xoffset // 8) % 16, xoffset % 8]), boundary_check=[0, 1, 2]), [XBLOCK])
    tmp1 = tmp0 + tmp0
    tl.store(tl.make_block_ptr(out_ptr0, shape=[4096], strides=[1], block_shape=[XBLOCK], order=[0], offsets=[xoffset]), tl.broadcast_to(tmp1, [XBLOCK]).to(tl.float32))
''', device_str='cuda')
```

**Test Plan**

This PR adds a new CI test script to cover this feature. The tests can be grouped into a few main categories:
  - Can we generate strided block pointers for the appropriate shapes?
     - Powers of 2
     - Non-power of 2, but multiple of the maximum block size
     - Arbitrary leading dimensions, with power of 2 inner dimensions
     - Weird strides and offsets
     - Reductions
     - Symbolic shapes that are multiples of the maximum block size (wasn't able to trace this through dynamo)
     - Broadcasts (some variables are missing from the indexing expression)
  - Do we still compile other cases correctly, even if we don't expect to be able to generate block pointers?
     - Unsupported static shapes
     - Unsupported symbolic shapes
  - Mixing and matching these cases:
     - Pointwise and reduction in the same kernel
  - Sanity check the test harness
     - Do we raise an exception if the expected number of block pointers and the actual number are different?

**Follow-ups**

There are a few important cases which this PR can't handle. I'm hoping these can be deferred to follow-up PRs:
  - Handle non-divisible shapes
      - Change the tiling algorithm to generate a 2D (X,Y) blocking, if doing so enables block pointers to be emitted.
      - Pad unsupported loads up to the nearest divisible size, then mask/slice out the extra elements? This is probably the best solution, but I'm not yet sure how to go about it in triton.
 - Take advantage of this analysis when `triton.use_block_ptr=False`. I'm guessing we can still avoid `%` and `/` without requiring block pointers. Maybe we could compute block indices with arange and broadcast instead?

Differential Revision: D56739375

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127342
Approved by: https://github.com/jansel, https://github.com/shunting314
2024-06-16 07:35:57 +00:00
Michael Lazos
a61939467a Enable passing dynamo-traced complex test (#128771)
Fixes https://github.com/pytorch/pytorch/issues/118159

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128771
Approved by: https://github.com/anijain2305
2024-06-16 07:28:09 +00:00
dilililiwhy
574a2cbcb7 Enable UFMT on common_device_type.py and common_dtype.py (#128490)
Part of: https://github.com/pytorch/pytorch/issues/123062

Ran lintrunner on:
> torch/testing/_internal/common_device_type.py
> torch/testing/_internal/common_dtype.py

Detail:
```
$ lintrunner -a --take UFMT --all-files
ok No lint issues.
Successfully applied all patches.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128490
Approved by: https://github.com/ezyang, https://github.com/XuehaiPan
2024-06-15 00:07:42 +00:00
Tom Ritchford
edb45dce85 Add OpInfo entry for as_strided_copy (#127231)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127231
Approved by: https://github.com/lezcano
2024-06-13 13:58:47 +00:00
PyTorch MergeBot
817ce6835b Revert "[cuDNN][SDPA] Remove TORCH_CUDNN_SDPA_ENABLED=1, enable cuDNN SDPA by default on H100 and 2nd on other archs >= sm80 (#125343)"
This reverts commit 4c971932e8.

Reverted https://github.com/pytorch/pytorch/pull/125343 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/125343#issuecomment-2163690162))
2024-06-12 18:47:52 +00:00
Tom Ritchford
2386045e4f Add OpInfo entry for alias_copy (#127232) (#128142)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128142
Approved by: https://github.com/lezcano
2024-06-12 09:39:58 +00:00
ankurneog
a838e90964 Add Intel Gaudi device/HPU to auto load in instantiate_device_type_tests (#126970)
### Motivation
Intel Gaudi accelerator (device name hpu) is seen to have good pass rate with the pytorch framework UTs , however being an out-of-tree device, we face challenges in adapting the device to natively run the existing pytorch UTs under pytorch/test. The UTs however is a good indicator of the device stack health and as such we run them regularly with adaptations.
Although we can add Gaudi/HPU device to generate the device specific tests using the TORCH_TEST_DEVICES environment variable, we miss out on lot of features such as executing for specific dtypes, skipping and overriding opInfo. With significant changes introduced every Pytorch release maintaining these adaptations become difficult and time consuming.
Hence with this PR  we introduce Gaudi device in common_device_type framework, so that the tests are instantiated for Gaudi when the library is loaded.
The eventual goal is to introduce Gaudi out-of-tree support as equivalent to in-tree devices

### Changes
Add HPUTestBase of type DeviceTypeTestBase specifying appropriate attributes for Gaudi/HPU.
Include code to check if  intel Gaudi Software library is loaded and if so, add the device to the list of devices considered for instantiation of device type tests

### Additional Context
please refer the following RFC : https://github.com/pytorch/rfcs/pull/63/

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126970
Approved by: https://github.com/albanD
2024-06-11 16:35:17 +00:00
Guilherme Leobas
4460e481bc Disable jacrev/jacfwd/hessian if compiling with dynamo (#128255)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128255
Approved by: https://github.com/zou3519
2024-06-10 20:47:53 +00:00
PyTorch MergeBot
90bb510ece Revert "Deprecate torch._utils.is_compiling() and torch._dynamo.external_utils.is_compiling() (#127690)"
This reverts commit 348b181a97.

Reverted https://github.com/pytorch/pytorch/pull/127690 on behalf of https://github.com/clee2000 due to sorry I think https://github.com/pytorch/pytorch/pull/126898#issuecomment-2142884456 is still relevant, I will reach out to them to see what needs to be done in internal to get this remerged ([comment](https://github.com/pytorch/pytorch/pull/127690#issuecomment-2159248859))
2024-06-10 20:44:42 +00:00
PyTorch MergeBot
3b73f5de3a Revert "Add OpInfo entry for alias_copy (#127232) (#128142)"
This reverts commit 04da6aeb61.

Reverted https://github.com/pytorch/pytorch/pull/128142 on behalf of https://github.com/DanilBaibak due to The changes broke the test_output_match_alias_copy_cpu_complex64 test. ([comment](https://github.com/pytorch/pytorch/pull/128142#issuecomment-2158793878))
2024-06-10 16:17:16 +00:00
Isuru Fernando
c993f1b37f Fix edge cases for gather in inductor (#126893)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126893
Approved by: https://github.com/peterbell10
ghstack dependencies: #126876
2024-06-10 15:31:03 +00:00
Tom Ritchford
04da6aeb61 Add OpInfo entry for alias_copy (#127232) (#128142)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128142
Approved by: https://github.com/lezcano
2024-06-10 15:01:53 +00:00
eqy
4c971932e8 [cuDNN][SDPA] Remove TORCH_CUDNN_SDPA_ENABLED=1, enable cuDNN SDPA by default on H100 and 2nd on other archs >= sm80 (#125343)
Looks like one of the first failures seen is `test_causal_variants_compile_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cuda` when `test_causal_variants_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cuda` passes.

What seems interesting here is that the `torch.compile` version fails while the eager version passes. Not sure what the difference would be here...

Nevertheless, is there a recommended mechanism to skip cuDNN SDPA as a backend for this test? CC @drisspg
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125343
Approved by: https://github.com/Skylion007
2024-06-09 06:53:34 +00:00
Xinya Zhang
d34075e0bd Add Efficient Attention support on ROCM (#124885)
This patch implements `with sdpa_kernel(SDPBackend.EFFICIENT_ATTENTION):` by reusing AOTriton's accelerated SDPA implementation

Known limitations:
- Only supports MI200/MI300X GPUs
- Does not support varlen
- Does not support `CausalVariant`
- Optional arguments `causal_diagonal` and `seqlen_k` in `_efficient_attention_forward/backward` must be null
- Does not work well with inductor's SDPA rewriter. The rewriter has been updated to only use math and flash attention on ROCM.

This PR also uses a different approach of installing AOTriton binary instead of building it from source in the base docker image. More details on motivation: https://github.com/pytorch/pytorch/pull/124885#issuecomment-2153229129

`PYTORCH_TEST_WITH_ROCM=1 PYTORCH_TESTING_DEVICE_ONLY_FOR="cuda" python test/test_transformers.py` yields "55028 passed, 20784 skipped" results with this change.  [Previous result](https://hud.pytorch.org/pr/127528) of `test_transformers.py` was 0 error, 0 failure, 55229 skipped out of 75517 tests in total (the XML report does not contain total number of passed tests).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124885
Approved by: https://github.com/malfet
2024-06-08 22:41:05 +00:00
Pritam Damania
0dd55ee159 Fix bug in _update_process_group API (#128262)
`local_used_map_` was undefined in case of `find_unused_parameters=False`, this resulted in an error when we ran `local_used_map_.fill_(0);`

Added a unit test as well

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128262
Approved by: https://github.com/awgu
2024-06-08 19:52:24 +00:00
Aaron Orenstein
8db9dfa2d7 Flip default value for mypy disallow_untyped_defs [9/11] (#127846)
See #127836 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127846
Approved by: https://github.com/ezyang
ghstack dependencies: #127842, #127843, #127844, #127845
2024-06-08 18:50:06 +00:00
Chien-Chin Huang
2369c719d4 [DSD][BE] Cleanup unused variables and rename variables to avoid exposure to the users (#128249)
These APIs and variables should not be exposed to users as they are designed to be used internally.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128249
Approved by: https://github.com/wz337
2024-06-08 17:12:17 +00:00
Xuehai Pan
348b181a97 Deprecate torch._utils.is_compiling() and torch._dynamo.external_utils.is_compiling() (#127690)
This PR is split from PR #126898.

- #126898

------

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127690
Approved by: https://github.com/Skylion007
2024-06-08 15:25:03 +00:00
Xilun Wu
85758fa5ae [c10d][TCPStore] make TCPStore server use libuv by default (#127957)
**Summary**
This PR switches the default TCPStore server backend to a new implementation that utilizes [`libuv`](https://github.com/libuv/libuv) for significantly lower initialization time and better scalability:
<img width="714" alt="image" src="https://github.com/pytorch/pytorch/assets/12968408/18503011-da5d-4104-8ba9-abc456438b02">

We hope this improvement would benefit users from a much shorter startup time in large-scale jobs. Eventually, we hope to fully replace the old TCPStore backend implementation with the libuv one.

**What it changes**
This PR changes the underlying TCPStore server backend to `libuv` if users don't explicitly specify to use the old TCPStore server. This change is not supposed to cause any user notice except significant faster TCPStore startup for large-scale jobs.

One thing to note is, we do not support the initialization approach where user passes in a socket for libuv backend. We plan to support it as a next step but we choose to disable it before fully testing. If you are initializing TCPStore in this approach, you can see the next section to remain using the old TCPStore server.

**Fallback/Remain using the old TCPStore server**
For users who want to stay with the old TCPStore backend, there're 3 ways:

1. If user is directly instantiating TCPStore object, user can pass in argument `use_libuv=False` to use the old TCPStore server backend e.g. `store = torch.distributed.TCPStore(..., use_libuv=False)`.
2. Or, specify the TCPStore backend option in `init_method` when calling default ProcessGroup init, e.g. `torch.distributed.init_process_group(..., init_method="{YOUR_RENDEZVOUS_METHOD}://{YOUR_HOSTNAME}:{YOUR_PORT}?use_libuv=0")`
3. Or, user can set environment variable `USE_LIBUV` to `"0"` when launching.

These 3 approach are in order of precedence. That being said, if user specifies `use_libuv=0` in `init_method` and also sets environment var `USE_LIBUV="1"`, the former will take effect and the TCPStore backend instantiated will be the old one instead of the one using libuv.

**Operating Systems Compatibility**
From the CI signals, we believe the new implementation has the same behavior as the old TCPStore server on all supported platforms. If you notice any behavior discrepancy, please file an issue with `oncall: distributed` label.

**Test Plan**
`pytest test/distributed/test_store.py`
<img width="2548" alt="image" src="https://github.com/pytorch/pytorch/assets/12968408/dc0aebeb-6d5a-4daa-b98c-e56bd39aa588">
note: `TestMultiThreadedWait::test_wait` is a broken test that has been there for some time.

`test/distributed/elastic/utils/distributed_test.py`
<img width="2558" alt="image" src="https://github.com/pytorch/pytorch/assets/12968408/a6a3266d-b798-41c4-94d2-152056a034f6">

**TODO**
1. Update the doc at

- https://pytorch.org/docs/stable/distributed.html#distributed-key-value-store
- https://pytorch.org/docs/stable/distributed.html#tcp-initialization

2. Make torch elastic rendezvous to use libuv TCPStore as well. See `torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py` cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k @kurman
3. Test if libuv backend is okay with initialization with socket. Change `LibUvTCPStoreTest::test_take_over_listen_socket`.

**Test Plan**
`pytest test/distributed/test_store.py`
<img width="2548" alt="image" src="https://github.com/pytorch/pytorch/assets/12968408/dc0aebeb-6d5a-4daa-b98c-e56bd39aa588">
note: `TestMultiThreadedWait::test_wait` is a broken test that has been there for some time.

`test/distributed/elastic/utils/distributed_test.py`
<img width="2558" alt="image" src="https://github.com/pytorch/pytorch/assets/12968408/a6a3266d-b798-41c4-94d2-152056a034f6">

Differential Revision: [D58259591](https://our.internmc.facebook.com/intern/diff/D58259591)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127957
Approved by: https://github.com/kurman
ghstack dependencies: #127956
2024-06-07 16:53:01 +00:00
Ke Wen
01601ebd41 Retire torch.distributed.pipeline (#127354)
Actually retiring module after deprecation warning for a while.
The new supported module is: torch.distributed.pipelining.
Please migrate.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127354
Approved by: https://github.com/wconstab
2024-06-07 08:11:58 +00:00
laithsakka
68cc63ae27 introduce skipIfNNModuleInlined and skip test_cpu_cuda_module_after_dynamo (#128023)
see the issue https://github.com/pytorch/pytorch/issues/127636 to for details about the issue, TLDR is that
when inlining is enabled, we create a fake tensor while tracing in dynamo and try to perform  aten.add.Tensor between
two tensor of different types, with out inlining we do not hit that operation during tracing.
```
Failed running call_function <built-in function add>(*(FakeTensor(..., size=(20, 20), grad_fn=<AddBackward0>), FakeTensor(..., device='cuda:0', size=(20, 20))), **{}):
Unhandled FakeTensor Device Propagation for aten.add.Tensor, found two different devices cpu, cuda:0
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128023
Approved by: https://github.com/anijain2305
ghstack dependencies: #127487, #127553
2024-06-07 06:00:33 +00:00
Pritam Damania
e9c5144cbc Fix bug in update_process_group DDP API (#128092)
Fix bug in `_update_process_group` DDP API where we didn't correctly reset `local_used_map_` and a few other variables. This resulted in errors like `Encountered gradient which is undefined, but still allreduced by...`

Added a unit test as well that reproduced the issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128092
Approved by: https://github.com/awgu, https://github.com/fegin
2024-06-06 17:10:42 +00:00
PyTorch MergeBot
c58d3af3b4 Revert "Add OpInfo entry for alias_copy (#127232)"
This reverts commit 457df212e1.

Reverted https://github.com/pytorch/pytorch/pull/127232 on behalf of https://github.com/clee2000 due to broke [onnx](https://github.com/pytorch/pytorch/actions/runs/9397057801/job/25880181144) and [mps](https://github.com/pytorch/pytorch/actions/runs/9397057805/job/25879818705) tests, [hud link](457df212e1) , base is 15 days old, the onnx test xfailed on the pr but the xfail was removed so if you rebase itll surface, mps build failed so no mps tests were run on the pr ([comment](https://github.com/pytorch/pytorch/pull/127232#issuecomment-2152848758))
2024-06-06 15:44:47 +00:00
PyTorch MergeBot
48a54146e7 Revert "[dynamo] Support ndarray.dtype attribute access (#124490)"
This reverts commit 4adee71155.

Reverted https://github.com/pytorch/pytorch/pull/124490 on behalf of https://github.com/atalman due to Breaks internal builds ([comment](https://github.com/pytorch/pytorch/pull/124490#issuecomment-2152664749))
2024-06-06 14:21:29 +00:00
Tom Ritchford
457df212e1 Add OpInfo entry for alias_copy (#127232)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127232
Approved by: https://github.com/lezcano
2024-06-06 07:46:26 +00:00
Michael Lazos
638f543ac2 Enable single nadam test (#128087)
https://github.com/pytorch/pytorch/issues/117150 has been fixed

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128087
Approved by: https://github.com/xmfan
2024-06-06 06:25:00 +00:00
Adam J. Stewart
80d34217c6 Typo fixes: et al. (#127811)
"et al." is short for _et alia_ and should be abbreviated with a period on the second word. Noticed this typo when reading through the SGD docs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127811
Approved by: https://github.com/janeyx99
2024-06-06 01:03:25 +00:00
Chien-Chin Huang
bb68b54be0 [BE][ptd_fb_test][1/N] Enable testslide (#127512)
This change allows to enable Testslide, which gives us more readable output, import time, etc. The PR is previously stamped https://github.com/pytorch/pytorch/pull/126460 but the old PR has some ghexport issue.

Differential Revision: [D57919583](https://our.internmc.facebook.com/intern/diff/D57919583/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127512
Approved by: https://github.com/wz337, https://github.com/Skylion007
2024-06-05 17:45:15 +00:00
willfengg
6454e95824 [FSDP2] enable CI for torch.compile(root Transformer) (#127832)
This CI showcases FSDP2 works with `torch.compile` root model, since FSDP1 can do the same

compiling root Transformer without AC: `pytest test/distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_multi_group`

compiling root Transformer with AC: `pytest test/distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_with_activation_checkpointing`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127832
Approved by: https://github.com/awgu
2024-06-05 17:29:46 +00:00
Andrew M. James
4adee71155 [dynamo] Support ndarray.dtype attribute access (#124490)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124490
Approved by: https://github.com/lezcano
ghstack dependencies: #125717
2024-06-05 17:20:01 +00:00
Huy Do
8992141dba Restore MPS testing on MacOS 13 and m2 metal (#127853)
The runners are ready now https://github.com/organizations/pytorch/settings/actions/runners?qr=label%3Amacos-m1-13, we want to keep some MacOS 13 runner for mps coverage until MacOS 15 is out.

This also fixes the `macos-m2-14` mistake from https://github.com/pytorch/pytorch/pull/127582.

The current `macos-m2-14` runner is on 14.2 while our `macos-m1-14` has 14.4.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127853
Approved by: https://github.com/malfet
2024-06-05 14:44:00 +00:00
dan_the_3rd
4a384d813b [SDPA/memeff] Backport changes from xFormers to PT (#127090)
Backporting a few fixes from xFormers:
* Bug fixes for local attention (which is not exposed in PT at the moment)
* Massively reduced memory usage on the BW pass (see also https://github.com/facebookresearch/xformers/pull/1028)

Essentially this will also make xFormers build process much easier, as we will be able to use mem-eff from PyTorch (if the user has a recent enough version) rather than building it at xFormers install time
The goal is to have the source of truth for these files in PT moving forward, and remove them from xFormers eventually once our users have a recent-enough version of PT.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127090
Approved by: https://github.com/drisspg
2024-06-05 07:33:27 +00:00
PyTorch MergeBot
0ff60236ab Revert "Retire torch.distributed.pipeline (#127354)"
This reverts commit b9c058c203.

Reverted https://github.com/pytorch/pytorch/pull/127354 on behalf of https://github.com/huydhn due to Sorry for reverting your change but the doc build failure looks legit b9c058c203 ([comment](https://github.com/pytorch/pytorch/pull/127354#issuecomment-2148133982))
2024-06-04 18:19:31 +00:00
Yifu Wang
f325b39303 Introduce Inductor passes to micro-pipeline all-gather-matmul and matmul-reduce-scatter in certain cases (#126598)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126598
Approved by: https://github.com/wanchaol
2024-06-04 09:06:56 +00:00
Ke Wen
b9c058c203 Retire torch.distributed.pipeline (#127354)
Actually retiring module after deprecation warning for a while.
The new supported module is: torch.distributed.pipelining.
Please migrate.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127354
Approved by: https://github.com/wconstab
2024-06-04 07:03:26 +00:00
PyTorch MergeBot
d1fad416a8 Revert "Add aten._unsafe_masked_index (#116491)"
This reverts commit f03f8bc901.

Reverted https://github.com/pytorch/pytorch/pull/116491 on behalf of https://github.com/PaliC due to breaking onnx tests ([comment](https://github.com/pytorch/pytorch/pull/116491#issuecomment-2145557724))
2024-06-03 15:51:50 +00:00
James Wu
63d7ffe121 Retry of D58015187 Move AsyncCompile to a different file (#127691)
Summary:
This is a retry of https://github.com/pytorch/pytorch/pull/127545/files
and
D58015187, fixing the internal test that also imported codecache

Test Plan: Same tests as CI in github, plus sandcastle for internal unit tests should pass now

Differential Revision: D58054611

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127691
Approved by: https://github.com/oulgen
2024-06-03 15:29:41 +00:00
Isuru Fernando
f03f8bc901 Add aten._unsafe_masked_index (#116491)
To generate masked indexing operations that would generate
masked loads in triton code

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116491
Approved by: https://github.com/lezcano, https://github.com/peterbell10
2024-06-03 14:44:03 +00:00
Xuehai Pan
8b08b0f340 [BE] enable ruff rule Q from flake8-quotes (#127713)
Enable [ruff rule `Q`](https://docs.astral.sh/ruff/rules/#flake8-quotes-q) from flake8-quotes. Fixes:

- [avoidable-escaped-quote (Q003)](https://docs.astral.sh/ruff/rules/avoidable-escaped-quote/#avoidable-escaped-quote-q003)
- [unnecessary-escaped-quote (Q004)](https://docs.astral.sh/ruff/rules/unnecessary-escaped-quote/#unnecessary-escaped-quote-q004)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127713
Approved by: https://github.com/ezyang
2024-06-02 23:25:26 +00:00
Xuehai Pan
67ef2683d9 [BE] wrap deprecated function/class with typing_extensions.deprecated (#127689)
Use `typing_extensions.deprecated` for deprecation annotation if possible. Otherwise, add `category=FutureWarning` to `warnings.warn("message")` if the category is missing.

Note that only warnings that their messages contain `[Dd]eprecat(ed|ion)` are updated in this PR.

Resolves #126888

- #126888

This PR is split from PR #126898.

- #126898

------

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127689
Approved by: https://github.com/Skylion007
2024-06-02 12:30:43 +00:00
PyTorch MergeBot
22f392ba40 Revert "[easy?] Move AsyncCompile to a different file (#127235)"
This reverts commit f58fc16e8f.

Reverted https://github.com/pytorch/pytorch/pull/127235 on behalf of https://github.com/izaitsevfb due to breaking internal tests, see [D58015187](https://www.internalfb.com/diff/D58015187) ([comment](https://github.com/pytorch/pytorch/pull/127235#issuecomment-2143518610))
2024-06-01 17:16:16 +00:00
PyTorch MergeBot
d49dc8f4b8 Revert "Add noqa to prevent lint warnings (#127545)"
This reverts commit f9937afd4f.

Reverted https://github.com/pytorch/pytorch/pull/127545 on behalf of https://github.com/izaitsevfb due to reverting to unblock the revert of #127545 ([comment](https://github.com/pytorch/pytorch/pull/127545#issuecomment-2143517711))
2024-06-01 17:12:46 +00:00
Jane Xu
4129c3e596 Let us find out why we wrote foreach meta regs (#127623)
Turns out it was for no reason!...well, after realizing that these ops are all CompositeExplicit, their meta impls come for free.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127623
Approved by: https://github.com/mikaylagawarecki
ghstack dependencies: #127412
2024-06-01 13:58:18 +00:00
Jane Xu
ac60bdaf01 Allow slow foreach to run for any backend, not just CPU (#127412)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127412
Approved by: https://github.com/albanD
2024-06-01 13:58:18 +00:00
PyTorch MergeBot
033e733021 Revert "[BE] wrap deprecated function/class with typing_extensions.deprecated (#126898)"
This reverts commit 749a132fb0.

Reverted https://github.com/pytorch/pytorch/pull/126898 on behalf of https://github.com/fbgheith due to switching typing-extensions=4.3.0 to 4.9.0 causes internal failure ([comment](https://github.com/pytorch/pytorch/pull/126898#issuecomment-2142884456))
2024-05-31 19:47:24 +00:00
cyy
d44daebdbc [Submodule] Remove deprecated USE_TBB option and TBB submodule (#127051)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127051
Approved by: https://github.com/cpuhrsch, https://github.com/malfet
2024-05-31 01:20:45 +00:00
feifan
da9fb670d2 Nadam support the flag for "maximize" (#127214)
Fixes https://github.com/pytorch/pytorch/issues/126642

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127214
Approved by: https://github.com/janeyx99
2024-05-31 01:11:16 +00:00
PyTorch MergeBot
846f79e61a Revert "Reduce number of samples in {svd,pca}_lowrank OpInfos (#127199)"
This reverts commit 18a3f781e6.

Reverted https://github.com/pytorch/pytorch/pull/127199 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing MacOS trunk job 18a3f781e6 (25619618844) ([comment](https://github.com/pytorch/pytorch/pull/127199#issuecomment-2140834363))
2024-05-30 20:45:31 +00:00
James Wu
f9937afd4f Add noqa to prevent lint warnings (#127545)
This is to prevent the import from being removed due to unused import. What's annoying about this is that it's not consistently running: lintrunner doesn't warn me on this PR even without the comment, but it does on other PRs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127545
Approved by: https://github.com/masnesral
2024-05-30 17:56:49 +00:00
lezcano
18a3f781e6 Reduce number of samples in {svd,pca}_lowrank OpInfos (#127199)
We don't need to generate so many samples for these very expensive ops.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127199
Approved by: https://github.com/peterbell10, https://github.com/zou3519
ghstack dependencies: #125580
2024-05-30 14:45:58 +00:00
lezcano
48538d3d14 Implement svd_lowrank and pca_lowrank for complex numbers (#125580)
We fix a number of bugs previously present in the complex
implementation.

We also heavily simplify the implementation, using, among
other things, that we now have conjugate views.

I saw there is a comment regarding how slow some checks on this
function are. As such, I removed quite a few of the combinations of inputs
to make the OpInfo lighter. I still left a couple relevant examples to not regress
coverage though.

Fixes https://github.com/pytorch/pytorch/issues/122188

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125580
Approved by: https://github.com/pearu, https://github.com/peterbell10
2024-05-30 14:45:58 +00:00
James Wu
f58fc16e8f [easy?] Move AsyncCompile to a different file (#127235)
By moving AsyncCompile to its own file, we can import codecache without running the side effects of AsyncCompile. This will be important for AOTAutogradCaching, where we want to share some implementation details with codecache.py without spawning new processes.

To conservatively maintain the same behavior elsewhere, every time we import codecache, I've added an import to torch._inductor.async_compile (except in autograd_cache.py, where the explicit goal is to not do this)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127235
Approved by: https://github.com/aorenste, https://github.com/oulgen, https://github.com/masnesral
2024-05-30 02:43:02 +00:00
SandishKumarHN
da39461d61 [optim] Move test_grad_scaling_autocast_fused_optimizers to test_cuda.py (#126418)
this PR address the comments in this PR #124904

- Move test_grad_scaling_autocast_fused_optimizers to test_cuda.py
- Combine _grad_scaling_autocast_fused_optimizers into test_grad_scaling_autocast_fused_optimizers
- Move to OptimizerInfo framework.
- For failing tests test_grad_scaling_autocast_fused_optimizers AdamW_cuda_float32, Adam_cuda_float32
    - Added toleranceOverride in this PR
    - created a issue #127000

```
> (c2env) [sandish@devgpu166.ash6 ~/pytorch (refactoroptimizers)]$ python test/test_cuda.py -k test_grad_scaling_autocast_fused_optimizers -v
/home/sandish/pytorch/torch/backends/cudnn/__init__.py:106: UserWarning: PyTorch was compiled without cuDNN/MIOpen support. To use cuDNN/MIOpen, rebuild PyTorch making sure the library is visible to the build system.
  warnings.warn(
/home/sandish/pytorch/torch/backends/cudnn/__init__.py:106: UserWarning: PyTorch was compiled without cuDNN/MIOpen support. To use cuDNN/MIOpen, rebuild PyTorch making sure the library is visible to the build system.
  warnings.warn(
test_grad_scaling_autocast_fused_optimizers_Adagrad_cpu_float32 (__main__.TestCudaOptimsCPU) ... {'fused': True}
{'fused': True}
{'weight_decay': 0.1, 'fused': True}
{'weight_decay': 0.1, 'fused': True}
{'weight_decay': 0.1, 'maximize': True, 'fused': True}
{'weight_decay': 0.1, 'maximize': True, 'fused': True}
{'lr': 0.1, 'fused': True}
{'lr': 0.1, 'fused': True}
{'initial_accumulator_value': 0.1, 'weight_decay': 0.1, 'fused': True}
{'initial_accumulator_value': 0.1, 'weight_decay': 0.1, 'fused': True}
{'lr': 0.1, 'lr_decay': 0.5, 'weight_decay': 0.1, 'fused': True}
{'lr': 0.1, 'lr_decay': 0.5, 'weight_decay': 0.1, 'fused': True}
{'lr': tensor(0.0010), 'fused': True}
{'lr': tensor(0.0010), 'fused': True}
ok
test_grad_scaling_autocast_fused_optimizers_AdamW_cpu_float32 (__main__.TestCudaOptimsCPU) ... {'fused': True}
{'fused': True}
{'lr': 0.01, 'fused': True}
{'lr': 0.01, 'fused': True}
{'weight_decay': 0.1, 'fused': True}
{'weight_decay': 0.1, 'fused': True}
{'weight_decay': 0.1, 'maximize': True, 'fused': True}
{'weight_decay': 0.1, 'maximize': True, 'fused': True}
{'weight_decay': 0.1, 'amsgrad': True, 'fused': True}
{'weight_decay': 0.1, 'amsgrad': True, 'fused': True}
ok
test_grad_scaling_autocast_fused_optimizers_Adam_cpu_float32 (__main__.TestCudaOptimsCPU) ... {'fused': True}
{'fused': True}
{'lr': 0.01, 'fused': True}
{'lr': 0.01, 'fused': True}
{'weight_decay': 0.1, 'fused': True}
{'weight_decay': 0.1, 'fused': True}
{'weight_decay': 0.1, 'maximize': True, 'fused': True}
{'weight_decay': 0.1, 'maximize': True, 'fused': True}
{'weight_decay': 0.1, 'amsgrad': True, 'fused': True}
{'weight_decay': 0.1, 'amsgrad': True, 'fused': True}
ok
test_grad_scaling_autocast_fused_optimizers_SGD_cpu_float32 (__main__.TestCudaOptimsCPU) ... {'fused': True}
{'fused': True}
{'lr': 0.01, 'fused': True}
{'lr': 0.01, 'fused': True}
{'lr': tensor(0.0010), 'fused': True}
{'lr': tensor(0.0010), 'fused': True}
{'momentum': 0.9, 'fused': True}
{'momentum': 0.9, 'fused': True}
{'momentum': 0.9, 'dampening': 0.5, 'fused': True}
{'momentum': 0.9, 'dampening': 0.5, 'fused': True}
{'momentum': 0.9, 'weight_decay': 0.1, 'fused': True}
{'momentum': 0.9, 'weight_decay': 0.1, 'fused': True}
{'momentum': 0.9, 'nesterov': True, 'weight_decay': 0.1, 'fused': True}
{'momentum': 0.9, 'nesterov': True, 'weight_decay': 0.1, 'fused': True}
{'weight_decay': 0.1, 'maximize': True, 'fused': True}
{'weight_decay': 0.1, 'maximize': True, 'fused': True}
ok
test_grad_scaling_autocast_fused_optimizers_Adagrad_cuda_float32 (__main__.TestCudaOptimsCUDA) ... skipped 'cuda is not supported for fused on Adagrad'
test_grad_scaling_autocast_fused_optimizers_AdamW_cuda_float32 (__main__.TestCudaOptimsCUDA) ... {'fused': True}
{'fused': True}
{'lr': 0.01, 'fused': True}
{'lr': 0.01, 'fused': True}
{'weight_decay': 0.1, 'fused': True}
{'weight_decay': 0.1, 'fused': True}
{'weight_decay': 0.1, 'maximize': True, 'fused': True}
{'weight_decay': 0.1, 'maximize': True, 'fused': True}
{'weight_decay': 0.1, 'amsgrad': True, 'fused': True}
{'weight_decay': 0.1, 'amsgrad': True, 'fused': True}
{'capturable': True, 'fused': True}
{'capturable': True, 'fused': True}
{'weight_decay': 0.1, 'amsgrad': True, 'capturable': True, 'fused': True}
{'weight_decay': 0.1, 'amsgrad': True, 'capturable': True, 'fused': True}
{'lr': tensor(0.0010), 'amsgrad': True, 'capturable': True, 'fused': True}
{'lr': tensor(0.0010), 'amsgrad': True, 'capturable': True, 'fused': True}
ok
test_grad_scaling_autocast_fused_optimizers_Adam_cuda_float32 (__main__.TestCudaOptimsCUDA) ... {'fused': True}
{'fused': True}
{'lr': 0.01, 'fused': True}
{'lr': 0.01, 'fused': True}
{'weight_decay': 0.1, 'fused': True}
{'weight_decay': 0.1, 'fused': True}
{'weight_decay': 0.1, 'maximize': True, 'fused': True}
{'weight_decay': 0.1, 'maximize': True, 'fused': True}
{'weight_decay': 0.1, 'amsgrad': True, 'fused': True}
{'weight_decay': 0.1, 'amsgrad': True, 'fused': True}
{'capturable': True, 'fused': True}
{'capturable': True, 'fused': True}
{'weight_decay': 0.1, 'amsgrad': True, 'capturable': True, 'fused': True}
{'weight_decay': 0.1, 'amsgrad': True, 'capturable': True, 'fused': True}
{'lr': tensor(0.0010), 'amsgrad': True, 'capturable': True, 'fused': True}
{'lr': tensor(0.0010), 'amsgrad': True, 'capturable': True, 'fused': True}
ok
test_grad_scaling_autocast_fused_optimizers_SGD_cuda_float32 (__main__.TestCudaOptimsCUDA) ... {'fused': True}
{'fused': True}
{'lr': 0.01, 'fused': True}
{'lr': 0.01, 'fused': True}
{'lr': tensor(0.0010), 'fused': True}
{'lr': tensor(0.0010), 'fused': True}
{'momentum': 0.9, 'fused': True}
{'momentum': 0.9, 'fused': True}
{'momentum': 0.9, 'dampening': 0.5, 'fused': True}
{'momentum': 0.9, 'dampening': 0.5, 'fused': True}
{'momentum': 0.9, 'weight_decay': 0.1, 'fused': True}
{'momentum': 0.9, 'weight_decay': 0.1, 'fused': True}
{'momentum': 0.9, 'nesterov': True, 'weight_decay': 0.1, 'fused': True}
{'momentum': 0.9, 'nesterov': True, 'weight_decay': 0.1, 'fused': True}
{'weight_decay': 0.1, 'maximize': True, 'fused': True}
{'weight_decay': 0.1, 'maximize': True, 'fused': True}
ok

----------------------------------------------------------------------
Ran 8 tests in 16.117s

OK (skipped=1)

> lintrunner test/test_cuda.py
----------------------------------------------------------------------
ok No lint issues.

> lintrunner torch/testing/_internal/common_optimizers.py
----------------------------------------------------------------------
ok No lint issues.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126418
Approved by: https://github.com/janeyx99
2024-05-30 01:47:41 +00:00
PyTorch MergeBot
67739d8c6f Revert "[Submodule] Remove deprecated USE_TBB option and TBB submodule (#127051)"
This reverts commit 699db7988d.

Reverted https://github.com/pytorch/pytorch/pull/127051 on behalf of https://github.com/PaliC due to This PR needs to be synced using the import button as there is a bug in our diff train ([comment](https://github.com/pytorch/pytorch/pull/127051#issuecomment-2138496995))
2024-05-30 01:16:57 +00:00
Darshan Sanghani
15a7916c0e Ability to capture Process Groups information into Execution Traces (#126995)
Contains a method added to the ExecutionTraceObserver class to record the snapshot of the current process group config upon tracing start.

Unit test:

```
(pytorch) [dsang@devgpu021.nha2 ~/github/pytorch-fork (viable/strict)]$ touch /tmp/barrier && TEMP_DIR="/tmp" BACKEND="nccl" WORLD_SIZE="2" python test/distributed/test_distributed_spawn.py -v TestDistBackendWithSpawn.test_ddp_profiling_execution_trace
/home/dsang/github/pytorch-fork/torch/distributed/optim/__init__.py:28: UserWarning: TorchScript support for functional optimizers isdeprecated and will be removed in a future PyTorch release. Consider using the torch.compile optimizer instead.
  warn("TorchScript support for functional optimizers is"
test_ddp_profiling_execution_trace (__main__.TestDistBackendWithSpawn.test_ddp_profiling_execution_trace) ... /home/dsang/github/pytorch-fork/torch/distributed/optim/__init__.py:28: UserWarning: TorchScript support for functional optimizers isdeprecated and will be removed in a future PyTorch release. Consider using the torch.compile optimizer instead.
  warn("TorchScript support for functional optimizers is"
/home/dsang/github/pytorch-fork/torch/distributed/optim/__init__.py:28: UserWarning: TorchScript support for functional optimizers isdeprecated and will be removed in a future PyTorch release. Consider using the torch.compile optimizer instead.
  warn("TorchScript support for functional optimizers is"
NCCL version 2.20.5+cuda12.0
[rank1]:[W523 16:06:01.705774398 reducer.cpp:1400] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
[rank0]:[W523 16:06:01.705905760 reducer.cpp:1400] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
[rank1]:[W523 16:06:01.715182258 execution_trace_observer.cpp:819] Enabling Execution Trace Observer
printing pg info into trace
[rank0]:[W523 16:06:01.715841805 execution_trace_observer.cpp:819] Enabling Execution Trace Observer
printing pg info into trace
[rank1]:[W523 16:06:01.727881877 execution_trace_observer.cpp:831] Disabling Execution Trace Observer
[rank0]:[W523 16:06:01.728792871 execution_trace_observer.cpp:831] Disabling Execution Trace Observer
Execution trace saved at /tmp/tmpdsov4ngi.et.json
[{'id': 3, 'name': '## process_group:init ##', 'ctrl_deps': 2, 'inputs': {'values': ['[{"pg_name": "0", "pg_desc": "default_pg", "backend_config": "cuda:nccl", "ranks": [], "group_size": 2, "group_count": 1}]'], 'shapes': [[]], 'types': ['String']}, 'outputs': {'values': [], 'shapes': [], 'types': []}, 'attrs': [{'name': 'rf_id', 'type': 'uint64', 'value': 1}, {'name': 'fw_parent', 'type': 'uint64', 'value': 0}, {'name': 'seq_id', 'type': 'int64', 'value': -1}, {'name': 'scope', 'type': 'uint64', 'value': 7}, {'name': 'tid', 'type': 'uint64', 'value': 1}, {'name': 'fw_tid', 'type': 'uint64', 'value': 0}, {'name': 'op_schema', 'type': 'string', 'value': ''}, {'name': 'kernel_backend', 'type': 'string', 'value': ''}, {'name': 'kernel_file', 'type': 'string', 'value': ''}]}]
Execution trace saved at /tmp/tmpsdiqy6az.et.json
[{'id': 3, 'name': '## process_group:init ##', 'ctrl_deps': 2, 'inputs': {'values': ['[{"pg_name": "0", "pg_desc": "default_pg", "backend_config": "cuda:nccl", "ranks": [], "group_size": 2, "group_count": 1}]'], 'shapes': [[]], 'types': ['String']}, 'outputs': {'values': [], 'shapes': [], 'types': []}, 'attrs': [{'name': 'rf_id', 'type': 'uint64', 'value': 1}, {'name': 'fw_parent', 'type': 'uint64', 'value': 0}, {'name': 'seq_id', 'type': 'int64', 'value': -1}, {'name': 'scope', 'type': 'uint64', 'value': 7}, {'name': 'tid', 'type': 'uint64', 'value': 1}, {'name': 'fw_tid', 'type': 'uint64', 'value': 0}, {'name': 'op_schema', 'type': 'string', 'value': ''}, {'name': 'kernel_backend', 'type': 'string', 'value': ''}, {'name': 'kernel_file', 'type': 'string', 'value': ''}]}]
ok

----------------------------------------------------------------------
Ran 1 test in 24.447s

OK
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126995
Approved by: https://github.com/briancoutinho, https://github.com/sraikund16
2024-05-29 23:16:17 +00:00
Jane Xu
05e99154ee Allow int vals to go down the fastpath for _foreach_max (#127303)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127303
Approved by: https://github.com/albanD
ghstack dependencies: #127187
2024-05-29 19:08:58 +00:00
Jane Xu
601c5e085d Add _foreach_max (#127187)
This PR adds _foreach_max support, the second reduction foreach op we have :D

I did have to change the autogen slightly for foreach. I can promise that the existing foreach ops' derivative behavior has not changed as I've added a skip list for the harder requirement I am setting (that the arg list should match in length). I needed to add this requirement as there is another wrong max (the one that does take in a dim for reduction) that keeps getting matched first.

Caveats!
- We do not fast path if the shapes, dtypes, device, the regular shebang for foreach are not met. We fall back to slowpath!
- MORE IMPORTANTLY, we also do not fast path for int8 and int16 and bool, but that's really a skill issue on my end as I've hardcoded -INFINITY into the CUDA kernels, and -INFINITY is not defined for small ints. It'd be nice to know how to do this properly, but that work can also come later.
- This does NOT support empty Tensors in the list, because the original max op also does not support empty Tensors. ~I think this should be allowed though, and this PR may come later.~ I understand why this is not allowed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127187
Approved by: https://github.com/albanD
2024-05-29 19:08:58 +00:00
rzou
28de9143a3 opcheck should be usable without optional dependencies (#127292)
This PR excises opcheck's dependency on
torch.testing._internal.common_utils, (which comes with dependencies on
expecttest and hypothesis). We do this by moving what we need to
torch.testing._utils and adding a test for it.

Fixes #126870, #126871

Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127292
Approved by: https://github.com/williamwen42
ghstack dependencies: #127291
2024-05-29 17:17:49 +00:00
Xuehai Pan
749a132fb0 [BE] wrap deprecated function/class with typing_extensions.deprecated (#126898)
Use `typing_extensions.deprecated` for deprecation annotation if possible. Otherwise, add `category=FutureWarning` to `warnings.warn("message")` if the category is missing.

Note that only warnings that their messages contain `[Dd]eprecat(ed|ion)` are updated in this PR.

UPDATE: Use `FutureWarning` instead of `DeprecationWarning`.

Resolves #126888

- #126888

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126898
Approved by: https://github.com/albanD
2024-05-29 12:09:27 +00:00
cyy
699db7988d [Submodule] Remove deprecated USE_TBB option and TBB submodule (#127051)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127051
Approved by: https://github.com/cpuhrsch, https://github.com/malfet
2024-05-29 11:58:03 +00:00
William Wen
5359af0c7e [dynamo] wrap GraphModule exceptions in dynamo-wrapped tests (#126341)
Better approach to https://github.com/pytorch/pytorch/pull/126197 to catch issues like https://github.com/pytorch/pytorch/issues/125568.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126341
Approved by: https://github.com/anijain2305, https://github.com/jansel
2024-05-29 05:18:04 +00:00
PyTorch MergeBot
cdbb2c9acc Revert "[Submodule] Remove deprecated USE_TBB option and TBB submodule (#127051)"
This reverts commit 4fdbaa794f.

Reverted https://github.com/pytorch/pytorch/pull/127051 on behalf of https://github.com/PaliC due to This PR needs to be synced using the import button as there is a bug in our diff train ([comment](https://github.com/pytorch/pytorch/pull/127051#issuecomment-2136428735))
2024-05-29 03:02:35 +00:00
Andrew Gu
db0a0ecb60 [FSDP2] Added test for N-way TP and 1-way FSDP with CPU offloading (#127024)
This PR shows that we can use FSDP solely for CPU offloading when composing with N-way TP. Each FSDP mesh is just 1 rank.

This was motivated from an ask on Slack :)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127024
Approved by: https://github.com/weifengpy, https://github.com/wanchaol
2024-05-28 22:51:36 +00:00
PyTorch MergeBot
c7f6fbfa9d Revert "[FSDP2] Added test for N-way TP and 1-way FSDP with CPU offloading (#127024)"
This reverts commit 9117779b0a.

Reverted https://github.com/pytorch/pytorch/pull/127024 on behalf of https://github.com/atalman due to failing in CI ([comment](https://github.com/pytorch/pytorch/pull/127024#issuecomment-2133566325))
2024-05-27 14:12:09 +00:00
feifan
22712ba5c5 Radam support the flag for "maximize" (#126765)
Fixes #[126642](https://github.com/pytorch/pytorch/issues/126642)

I reference the maximize in `Adam` and add `Radam's` maximize flag. If this pr is OK, I will add another pr for `Nadam`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126765
Approved by: https://github.com/janeyx99
2024-05-27 06:34:50 +00:00
cyy
4fdbaa794f [Submodule] Remove deprecated USE_TBB option and TBB submodule (#127051)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127051
Approved by: https://github.com/cpuhrsch, https://github.com/malfet
2024-05-27 03:54:03 +00:00
Peter Bell
39de62845a [decomp] Fix default values missing from inplace rrelu decomposition (#126978)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126978
Approved by: https://github.com/lezcano
2024-05-26 23:49:40 +00:00
Xuehai Pan
35ea5c6b22 [3/N][Easy] fix typo for usort config in pyproject.toml (kown -> known): sort torchgen (#127124)
The `usort` config in `pyproject.toml` has no effect due to a typo. Fixing the typo make `usort` do more and generate the changes in the PR. Except `pyproject.toml`, all changes are generated by `lintrunner -a --take UFMT --all-files`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127124
Approved by: https://github.com/Skylion007
ghstack dependencies: #127122, #127123
2024-05-25 19:20:03 +00:00
PyTorch MergeBot
12d11fe4e5 Revert "reset dynamo cache before each test (#126586)"
This reverts commit bd24991f46.

Reverted https://github.com/pytorch/pytorch/pull/126586 on behalf of https://github.com/malfet due to Broke tons of tests, see bd24991f46  ([comment](https://github.com/pytorch/pytorch/pull/126586#issuecomment-2131365576))
2024-05-25 17:17:19 +00:00
Shunting Zhang
bd24991f46 reset dynamo cache before each test (#126586)
In https://github.com/pytorch/pytorch/issues/125967, we found test results depend on test order. The root cause is due to earlier tests populate dynamo cache and affect the later tests.

This PR clear dynamo cache before each unit test so we get more deterministic result for unit test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126586
Approved by: https://github.com/jansel
2024-05-25 04:48:09 +00:00
Jane Xu
665637714f Remove SparseAdam weird allowance of raw Tensor input (#127081)
This continues the full deprecation after https://github.com/pytorch/pytorch/pull/114425. It's been 6 months! And I'm fairly certain no one is going to yell at me as this patch is not really used.

------

# BC Breaking note

As of this PR, SparseAdam will become consistent with the rest of our optimizers in that it will only accept containers of Tensors/Parameters/param groups and fully complete deprecation of this path. Hitherto, the SparseAdam constructor had allowed raw tensors as the params argument to the constructor. Now, if you write the following code, there will be an error similar to every other optim: "params argument given to the optimizer should be an iterable of Tensors or dicts"

```
import torch
param = torch.rand(16, 32)
optimizer = torch.optim.SparseAdam(param)
```

Instead you should replace the last line with
```
optimizer = torch.optim.SparseAdam([param])
```
to no longer error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127081
Approved by: https://github.com/soulitzer
2024-05-25 02:58:24 +00:00
Shunting Zhang
db9c6aeec6 Revert "Skip test_memory_format_nn_BatchNorm2d in inductor (#125970)" (#126594)
This reverts commit 0a9c6e92f8.

enable the test since it's fixed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126594
Approved by: https://github.com/huydhn
ghstack dependencies: #126593
2024-05-25 01:27:02 +00:00
Andrew Gu
9117779b0a [FSDP2] Added test for N-way TP and 1-way FSDP with CPU offloading (#127024)
This PR shows that we can use FSDP solely for CPU offloading when composing with N-way TP. Each FSDP mesh is just 1 rank.

This was motivated from an ask on Slack :)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127024
Approved by: https://github.com/weifengpy, https://github.com/wanchaol
ghstack dependencies: #127004
2024-05-24 17:09:12 +00:00
Nikita Shulga
4ff9113e3d [MPS] Add _weight_int8pack_mm tests (#127041)
As well as extend the test to cover MV cases (where A matrix is 1xM) Limit int8 op testing to 32x32 matrix sizes for now

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127041
Approved by: https://github.com/larryliu0820, https://github.com/manuelcandales
2024-05-24 16:08:06 +00:00
William Wen
d11e44c0d0 Reset grad state across unittests (#126345)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126345
Approved by: https://github.com/ezyang
2024-05-23 21:16:39 +00:00
PyTorch MergeBot
2c90b99267 Revert "reset dynamo cache before each test (#126586)"
This reverts commit 43f2f43eb3.

Reverted https://github.com/pytorch/pytorch/pull/126586 on behalf of https://github.com/clee2000 due to broke tests on inductor? test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_CTCLoss_cuda_float64 43f2f43eb3 https://github.com/pytorch/pytorch/actions/runs/9200644034/job/25308511495 ([comment](https://github.com/pytorch/pytorch/pull/126586#issuecomment-2126228689))
2024-05-23 04:54:28 +00:00
PyTorch MergeBot
df4b7cb5f7 Reapply "Skip test_memory_format_nn_BatchNorm2d in inductor (#125970)" (#126594)
This reverts commit ce6e36bf8b.

Reverted https://github.com/pytorch/pytorch/pull/126594 on behalf of https://github.com/clee2000 due to broke tests on inductor? test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_CTCLoss_cuda_float64 43f2f43eb3 https://github.com/pytorch/pytorch/actions/runs/9200644034/job/25308511495 ([comment](https://github.com/pytorch/pytorch/pull/126586#issuecomment-2126228689))
2024-05-23 04:54:28 +00:00
eqy
ebbd431d9e [CPU] Bump test_complex_2d thresholds for LBFGS on complex64 (#126358)
Is this supposed to be bitwise identical? Wasn't sure how to interpret the comment but it seems to be giving mismatches like:
```
Mismatched elements: 1 / 2 (50.0%)
Greatest absolute difference: 4.6372413635253906e-05 at index (1,) (up to 1e-05 allowed)
Greatest relative difference: 3.4600801882334054e-05 at index (1,) (up to 1.3e-06 allowed)

To execute this test, run the following from the base repo dir:
     python test/test_optim.py -k test_complex_2d_LBFGS_cpu_complex64
```

on Neoverse-N2 SBSA ARM CPUs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126358
Approved by: https://github.com/lezcano, https://github.com/janeyx99
2024-05-23 00:16:45 +00:00
Shunting Zhang
ce6e36bf8b Revert "Skip test_memory_format_nn_BatchNorm2d in inductor (#125970)" (#126594)
This reverts commit 0a9c6e92f8.

enable the test since it's fixed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126594
Approved by: https://github.com/huydhn
ghstack dependencies: #126586, #126593
2024-05-22 22:43:09 +00:00
Shunting Zhang
43f2f43eb3 reset dynamo cache before each test (#126586)
In https://github.com/pytorch/pytorch/issues/125967, we found test results depend on test order. The root cause is due to earlier tests populate dynamo cache and affect the later tests.

This PR clear dynamo cache before each unit test so we get more deterministic result for unit test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126586
Approved by: https://github.com/jansel
2024-05-22 22:43:09 +00:00
Masaki Kozuki
0939b68980 Support dtype kwarg in _foreach_norm (#125665)
Fixes #125040

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125665
Approved by: https://github.com/janeyx99
2024-05-22 20:27:50 +00:00
chilli
51e707650f Fix flexattention not realizing inputs before lowering (also refactored runtime estimation) (#126615)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126615
Approved by: https://github.com/yanboliang, https://github.com/drisspg, https://github.com/xmfan
2024-05-22 17:28:46 +00:00
PyTorch MergeBot
8a4597980c Revert "Fix flexattention not realizing inputs before lowering (also refactored runtime estimation) (#126615)"
This reverts commit 831efeeadf.

Reverted https://github.com/pytorch/pytorch/pull/126615 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/126615#issuecomment-2124169157))
2024-05-22 08:23:40 +00:00
Andrew Gu
636e79991c [FSDP2] Fixed 2D clip grad norm test (#126497)
This fixes https://github.com/pytorch/pytorch/issues/126484.

We change from transformer to MLP stack since transformer seems to introduce slight numeric differences when using TP. We include a sequence parallel layer norm module in the MLP stack to exercise `(S(0), R)` placement.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126497
Approved by: https://github.com/weifengpy, https://github.com/wz337
2024-05-22 00:29:13 +00:00
chilli
831efeeadf Fix flexattention not realizing inputs before lowering (also refactored runtime estimation) (#126615)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126615
Approved by: https://github.com/yanboliang, https://github.com/drisspg, https://github.com/xmfan
2024-05-20 23:40:56 +00:00
PyTorch MergeBot
cb69c51b6f Revert " Updated test_graph_optims and test_graph_scaling_fused_optimizers to use new OptimizerInfo infrastructure (#125127)"
This reverts commit cf35a591b9.

Reverted https://github.com/pytorch/pytorch/pull/125127 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/125127#issuecomment-2120337584))
2024-05-20 12:14:22 +00:00
jayanth domalapalli
cf35a591b9 Updated test_graph_optims and test_graph_scaling_fused_optimizers to use new OptimizerInfo infrastructure (#125127)
This PR is meant to address issue #123451, more specifically, the ```test_graph_optims``` and ```test_graph_scaling_fused_optimizers``` functions in ```test_cuda.py``` have been updated so that they now use the new OptimizerInfo infrastructure.

Lintrunner passed:
```
$ lintrunner test/test_cuda.py
ok No lint issues.
```
Tests passed:
```
>python test_cuda.py -k test_graph_optims
Ran 19 tests in 7.463s

OK (skipped=9)

>python test_cuda.py -k test_graph_scaling_fused_optimizers
Ran 6 tests in 2.800s

OK (skipped=3)
```
Both the functions have been moved to the newly created TestCase class ```TestCudaOptims```. The test is mostly the same except the ```@optims``` decorator is used at the top of the function to implicitly call the function using each of the optimizers mentioned in the decorator instead of explicitly using a for loop to iterate through each of the optimizers.

I was unable to use the ```_get_optim_inputs_including_global_cliquey_kwargs``` to get all kwargs for each of the optimizers since some of the kwargs that are used in the original ```test_graph_optims``` function are not being returned by the new OptimizerInfo infrastructure, more specifically, for the ```torch.optim.rmsprop.RMSprop``` optimizer, the following kwargs are not returned whenever ```_get_optim_inputs_including_global_cliquey_kwargs``` is called:
```
{'foreach': False, 'maximize': True, 'weight_decay': 0}
{ 'foreach': True, 'maximize': True, 'weight_decay': 0}
```
I ran into the same issue for ```test_graph_scaling_fused_optimizers```, for the ```torch.optim.adamw.AdamW``` optimizer, whenever ```optim_info.optim_inputs_func(device=device)``` was called, the following kwarg was not returned:
```
{'amsgrad': True}
```

Due to this issue, I resorted to using a dictionary to store the kwargs for each of the optimizers, I am aware that this is less than ideal. I was wondering whether I should use the OptimizerInfo infrastructure to get all the kwargs regardless of the fact that it lacks some kwargs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125127
Approved by: https://github.com/janeyx99
2024-05-20 06:20:45 +00:00
cyy
853081a8e7 Replace torch.library.impl_abstract with torch.library.register_fake (#126606)
To remove the disrupting warning
```
      warnings.warn("torch.library.impl_abstract was renamed to "
                    "torch.library.register_fake. Please use that instead; "
                    "we will remove torch.library.impl_abstract in a future "
                    "version of PyTorch.",
                    DeprecationWarning, stacklevel=2)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126606
Approved by: https://github.com/ezyang
2024-05-19 13:21:39 +00:00
PyTorch MergeBot
d782e43464 Revert "[FSDP2] Fixed 2D clip grad norm test (#126497)"
This reverts commit 3f28906311.

Reverted https://github.com/pytorch/pytorch/pull/126497 on behalf of https://github.com/jeanschmidt due to reverting to check if might have introduced inductor cuda 12 issues ([comment](https://github.com/pytorch/pytorch/pull/126497#issuecomment-2118338716))
2024-05-17 20:29:20 +00:00
briancoutinho
de42af4b00 Add coms metadata to execution trace (ET) (#126317)
Add Execution Trace communication collective meta data.
For specification see https://github.com/pytorch/pytorch/issues/124674

New fields look like
```
    {
      "id": 80, "name": "record_param_comms", "ctrl_deps": 79,
      "inputs": {"values": [[[78,74,0,100,4,"cuda:0"]],21,["0","default_pg"],0,"allreduce",[],[],0,1,2], "shapes": [[[100]],[],[[],[]],[],[],[],[],[],[],[]], "types": ["GenericList[Tensor(float)]","Int","Tuple[String,String]","Int","String","GenericList[]","GenericList[]","Int","Int","Int"]},                             "outputs": {"values": [[[78,74,0,100,4,"cuda:0"]]], "shapes": [[[100]]], "types": ["GenericList[Tensor(float)]"]},
      "attrs": [{"name": "rf_id", "type": "uint64", "value": 53},{"name": "fw_parent", "type": "uint64", "value": 0},{"name": "seq_id", "type": "int64", "value": -1},{"name": "scope", "type": "uint64", "value": 0},{"name": "tid", "type": "uint64", "value": 2},{"name": "fw_tid", "type": "uint64", "value": 0},{"name": "op_schema", "type": "string", "value": ""},{"name": "kernel_backend", "type": "string", "value": ""},{"name": "kernel_file", "type": "string", "value": ""},
  {"name": "collective_name", "type": "string", "value": "allreduce"},
  {"name": "dtype", "type": "string", "value": "Float"},
  {"name": "in_msg_nelems", "type": "uint64", "value": 100},
  {"name": "out_msg_nelems", "type": "uint64", "value": 100},
  {"name": "in_split_size", "type": "string", "value": "[]"},
  {"name": "out_split_size", "type": "string", "value": "[]"},
  {"name": "global_rank_start", "type": "uint64", "value": 0},
  {"name": "global_rank_stride", "type": "uint64", "value": 1},
  {"name": "pg_name", "type": "string", "value": "0"},
  {"name": "pg_desc", "type": "string", "value": "default_pg"},
  {"name": "pg_size", "type": "uint64", "value": 2}]
 }
```

## Unit Test
Added a new unit test to check the execution trace collected has right attributes

`touch /tmp/barrier && TEMP_DIR="/tmp" BACKEND="nccl" WORLD_SIZE="2" python test/distributed/test_distributed_spawn.py -v TestDistBackendWithSpawn.test_ddp_profiling_execution_trace`

```
STAGE:2024-05-08 17:39:10 62892:62892 ActivityProfilerController.cpp:316] Completed Stage: Warm Up
STAGE:2024-05-08 17:39:10 62893:62893 ActivityProfilerController.cpp:316] Completed Stage: Warm Up
STAGE:2024-05-08 17:39:11 62892:62892 ActivityProfilerController.cpp:322] Completed Stage: Collection
STAGE:2024-05-08 17:39:11 62893:62893 ActivityProfilerController.cpp:322] Completed Stage: Collection
STAGE:2024-05-08 17:39:11 62892:62892 ActivityProfilerController.cpp:326] Completed Stage: Post Processing
STAGE:2024-05-08 17:39:11 62893:62893 ActivityProfilerController.cpp:326] Completed Stage: Post Processing
[rank1]:[W508 17:39:12.329544411 reducer.cpp:1399] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model
indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
[rank0]:[W508 17:39:12.329626774 reducer.cpp:1399] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model
indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
[rank0]:[W508 17:39:12.339239982 execution_trace_observer.cpp:825] Enabling Execution Trace Observer
[rank1]:[W508 17:39:12.339364516 execution_trace_observer.cpp:825] Enabling Execution Trace Observer
STAGE:2024-05-08 17:39:12 62892:62892 ActivityProfilerController.cpp:316] Completed Stage: Warm Up
STAGE:2024-05-08 17:39:12 62893:62893 ActivityProfilerController.cpp:316] Completed Stage: Warm Up
[rank1]:[W508 17:39:12.352452400 execution_trace_observer.cpp:837] Disabling Execution Trace Observer
STAGE:2024-05-08 17:39:12 62893:62893 ActivityProfilerController.cpp:322] Completed Stage: Collection
[rank0]:[W508 17:39:12.354019014 execution_trace_observer.cpp:837] Disabling Execution Trace Observer
STAGE:2024-05-08 17:39:12 62893:62893 ActivityProfilerController.cpp:326] Completed Stage: Post Processing
STAGE:2024-05-08 17:39:12 62892:62892 ActivityProfilerController.cpp:322] Completed Stage: Collection
STAGE:2024-05-08 17:39:12 62892:62892 ActivityProfilerController.cpp:326] Completed Stage: Post Processing
Execution trace saved at /tmp/tmpy01ngc3w.et.json
Execution trace saved at /tmp/tmptf8543k4.et.json
ok

----------------------------------------------------------------------
```

Also run profilerunit test
`touch /tmp/barrier && TEMP_DIR="/tmp" BACKEND="nccl" WORLD_SIZE="2" python test/distributed/test_distributed_spawn.py -v TestDistBackendWithSpawn.test_ddp_profiling_torch_profiler`

```
STAGE:2024-05-08 18:24:22 1926775:1926775 ActivityProfilerController.cpp:316] Completed Stage: Warm Up
STAGE:2024-05-08 18:24:22 1926774:1926774 ActivityProfilerController.cpp:316] Completed Stage: Warm Up
STAGE:2024-05-08 18:24:24 1926774:1926774 ActivityProfilerController.cpp:322] Completed Stage: Collection
STAGE:2024-05-08 18:24:24 1926775:1926775 ActivityProfilerController.cpp:322] Completed Stage: Collection
STAGE:2024-05-08 18:24:24 1926774:1926774 ActivityProfilerController.cpp:326] Completed Stage: Post Processing
STAGE:2024-05-08 18:24:24 1926775:1926775 ActivityProfilerController.cpp:326] Completed Stage: Post Processing
[rank1]:[W508 18:24:24.508622236 reducer.cpp:1399] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
[rank0]:[W508 18:24:24.508622241 reducer.cpp:1399] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
STAGE:2024-05-08 18:24:24 1926774:1926774 ActivityProfilerController.cpp:316] Completed Stage: Warm Up
STAGE:2024-05-08 18:24:24 1926775:1926775 ActivityProfilerController.cpp:316] Completed Stage: Warm Up
STAGE:2024-05-08 18:24:24 1926774:1926774 ActivityProfilerController.cpp:322] Completed Stage: Collection
STAGE:2024-05-08 18:24:24 1926775:1926775 ActivityProfilerController.cpp:322] Completed Stage: Collection
STAGE:2024-05-08 18:24:24 1926774:1926774 ActivityProfilerController.cpp:326] Completed Stage: Post Processing
STAGE:2024-05-08 18:24:24 1926775:1926775 ActivityProfilerController.cpp:326] Completed Stage: Post Processing
Trace saved to /tmp/tmpdrw_cmcu.json
Trace saved to /tmp/tmpnio7ec9j.json
ok

----------------------------------------------------------------------
Ran 1 test in 19.772s

OK
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126317
Approved by: https://github.com/yoyoyocmu, https://github.com/sanrise
2024-05-17 19:08:55 +00:00
Matthew Hoffman
81277baa0c Remove removed ruff rule TRY200 (#126256)
My TOML linter is complaining that "TRY200" is not acceptable for the `tool.ruff.lint` schema.

From the ruff docs: https://docs.astral.sh/ruff/rules/reraise-no-cause/

> This rule has been removed and its documentation is only available for historical reasons.
>
> This rule is identical to [B904](https://docs.astral.sh/ruff/rules/raise-without-from-inside-except/) which should be used instead.

and we are currently explicitly ignoring B904.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126256
Approved by: https://github.com/Skylion007
2024-05-17 16:31:05 +00:00
David Chiu
7e166e8057 [optim] Fix: wrong ASGD implementation (#126375)
This PR is based on #125440, additionally merging the latest main branch and fixing the lint failures from #126361.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126375
Approved by: https://github.com/janeyx99
2024-05-17 15:46:39 +00:00
Andrew Gu
3f28906311 [FSDP2] Fixed 2D clip grad norm test (#126497)
This fixes https://github.com/pytorch/pytorch/issues/126484.

We change from transformer to MLP stack since transformer seems to introduce slight numeric differences when using TP. We include a sequence parallel layer norm module in the MLP stack to exercise `(S(0), R)` placement.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126497
Approved by: https://github.com/weifengpy, https://github.com/wz337
2024-05-17 13:38:31 +00:00
drisspg
762ce6f062 Add Lowering for FlexAttention Backwards (#125515)
# Summary
#### What does this PR do?
It enables Inductor to actually generate the fused flex attention kernel for the backwards

I did some other things along the way:
- Abstract out the 'build_subgraph_buffer' subroutine and make it reusable between flex attention and flex_attention backwards. In total we need too build 3 subgraphs for fwd + bwd. 1 for the fwd graph and then 2 in the bwd. The FAv2 algorithm recomputes the parts of the forward (more efficiently since we already have the row_max via logsumexp), therefore we need to inline both the fwd graph and the joint graph in the bwds kernel.
- The version of the backwards kernel is from a somewhat older version of the triton tutorial implementation. I think that we should update in a follow up to a newer version. Notably the blocks need to be square for this to work as currently implemented. I am sure there are many opportunities for optimization.
- I didnt correctly register the decomp table + IndexMode when I landed: https://github.com/pytorch/pytorch/pull/123902, this remedies that.
- The rel_bias helper func was reversed in terms of causality. I updated and then add a test specific for "future causal" attention.
- This PRs but the main point that I think still needs to be worked out is the store_output call. I have it hacked up to be 'fake' but I dont think we want to land that and likely want to just have a mutated 'dq' and a stored_output 'dk'
- I also needed to update the `TritonTemplateKernel` to actually accept multiple subgraphs (modifications)
- I updated the benchmark to also profile bwds performance

### Benchmark Numbers:
_The current implementation is not parallelizing over ctx length in the bwd_
FWD Speedups

| Type    |   Speedup | shape              | score_mod   | dtype          |
|---------|-----------|--------------------|-------------|----------------|
| Average |     0.991 |                    |             |                |
| Max     |     1.182 | (16, 16, 4096, 64) | noop        | torch.bfloat16 |
| Min     |     0.796 | (2, 16, 512, 256)  | head_bias   | torch.bfloat16 |

BWD Speedups

| Type    |   Speedup | shape              | score_mod   | dtype          |
|---------|-----------|--------------------|-------------|----------------|
| Average |     0.291 |                    |             |                |
| Max     |     0.652 | (8, 16, 512, 64)   | head_bias   | torch.bfloat16 |
| Min     |     0.073 | (2, 16, 4096, 128) | head_bias   | torch.bfloat16 |

<details>

<summary>Full Data</summary>

| shape               | score_mod     | dtype          |   fwd_eager_time |   fwd_compiled_time |   bwd_eager_time |   bwd_compiled_time |   fwd_speedup |   bwd_speedup |
|---------------------|---------------|----------------|------------------|---------------------|------------------|---------------------|---------------|---------------|
| (2, 16, 512, 64)    | noop          | torch.bfloat16 |           19.936 |              19.092 |           57.851 |             193.564 |         1.044 |         0.299 |
| (2, 16, 512, 64)    | causal_mask   | torch.bfloat16 |           19.955 |              19.497 |           57.662 |             206.278 |         1.024 |         0.280 |
| (2, 16, 512, 64)    | relative_bias | torch.bfloat16 |           19.455 |              21.297 |           57.674 |             195.219 |         0.913 |         0.295 |
| (2, 16, 512, 64)    | head_bias     | torch.bfloat16 |           19.958 |              21.289 |           57.674 |             193.859 |         0.938 |         0.298 |
| (2, 16, 512, 128)   | noop          | torch.bfloat16 |           28.157 |              28.615 |           82.831 |             454.211 |         0.984 |         0.182 |
| (2, 16, 512, 128)   | causal_mask   | torch.bfloat16 |           28.154 |              28.444 |           83.091 |             432.083 |         0.990 |         0.192 |
| (2, 16, 512, 128)   | relative_bias | torch.bfloat16 |           28.722 |              27.897 |           83.175 |             446.789 |         1.030 |         0.186 |
| (2, 16, 512, 128)   | head_bias     | torch.bfloat16 |           28.299 |              27.673 |           83.052 |             459.179 |         1.023 |         0.181 |
| (2, 16, 512, 256)   | noop          | torch.bfloat16 |           41.167 |              50.504 |          175.019 |            1083.545 |         0.815 |         0.162 |
| (2, 16, 512, 256)   | causal_mask   | torch.bfloat16 |           41.656 |              51.933 |          175.078 |            1171.176 |         0.802 |         0.149 |
| (2, 16, 512, 256)   | relative_bias | torch.bfloat16 |           41.697 |              50.722 |          175.159 |            1097.312 |         0.822 |         0.160 |
| (2, 16, 512, 256)   | head_bias     | torch.bfloat16 |           41.690 |              52.387 |          175.184 |            1097.336 |         0.796 |         0.160 |
| (2, 16, 1024, 64)   | noop          | torch.bfloat16 |           39.232 |              37.454 |          127.847 |             612.430 |         1.047 |         0.209 |
| (2, 16, 1024, 64)   | causal_mask   | torch.bfloat16 |           39.930 |              39.599 |          127.755 |             665.359 |         1.008 |         0.192 |
| (2, 16, 1024, 64)   | relative_bias | torch.bfloat16 |           39.417 |              41.304 |          127.902 |             614.990 |         0.954 |         0.208 |
| (2, 16, 1024, 64)   | head_bias     | torch.bfloat16 |           39.965 |              42.034 |          127.953 |             613.273 |         0.951 |         0.209 |
| (2, 16, 1024, 128)  | noop          | torch.bfloat16 |           63.964 |              71.024 |          226.510 |            1637.669 |         0.901 |         0.138 |
| (2, 16, 1024, 128)  | causal_mask   | torch.bfloat16 |           63.843 |              72.451 |          226.750 |            1558.949 |         0.881 |         0.145 |
| (2, 16, 1024, 128)  | relative_bias | torch.bfloat16 |           64.301 |              70.487 |          226.651 |            1610.063 |         0.912 |         0.141 |
| (2, 16, 1024, 128)  | head_bias     | torch.bfloat16 |           64.033 |              71.394 |          226.676 |            1668.511 |         0.897 |         0.136 |
| (2, 16, 1024, 256)  | noop          | torch.bfloat16 |          129.348 |             141.390 |          507.337 |            4405.175 |         0.915 |         0.115 |
| (2, 16, 1024, 256)  | causal_mask   | torch.bfloat16 |          129.538 |             145.680 |          507.178 |            4768.874 |         0.889 |         0.106 |
| (2, 16, 1024, 256)  | relative_bias | torch.bfloat16 |          129.438 |             142.782 |          507.004 |            4401.002 |         0.907 |         0.115 |
| (2, 16, 1024, 256)  | head_bias     | torch.bfloat16 |          129.058 |             146.242 |          507.547 |            4434.251 |         0.883 |         0.114 |
| (2, 16, 4096, 64)   | noop          | torch.bfloat16 |          481.606 |             409.120 |         1440.890 |           14147.269 |         1.177 |         0.102 |
| (2, 16, 4096, 64)   | causal_mask   | torch.bfloat16 |          480.227 |             438.847 |         1434.419 |           14973.386 |         1.094 |         0.096 |
| (2, 16, 4096, 64)   | relative_bias | torch.bfloat16 |          480.831 |             458.104 |         1432.935 |           14193.253 |         1.050 |         0.101 |
| (2, 16, 4096, 64)   | head_bias     | torch.bfloat16 |          480.749 |             452.497 |         1437.040 |           14084.869 |         1.062 |         0.102 |
| (2, 16, 4096, 128)  | noop          | torch.bfloat16 |          872.534 |             848.275 |         2600.895 |           35156.849 |         1.029 |         0.074 |
| (2, 16, 4096, 128)  | causal_mask   | torch.bfloat16 |          872.647 |             868.279 |         2587.581 |           31919.531 |         1.005 |         0.081 |
| (2, 16, 4096, 128)  | relative_bias | torch.bfloat16 |          871.484 |             827.644 |         2593.989 |           34805.634 |         1.053 |         0.075 |
| (2, 16, 4096, 128)  | head_bias     | torch.bfloat16 |          871.422 |             856.437 |         2602.482 |           35708.591 |         1.017 |         0.073 |
| (2, 16, 4096, 256)  | noop          | torch.bfloat16 |         1904.497 |            1758.183 |         6122.416 |           66754.593 |         1.083 |         0.092 |
| (2, 16, 4096, 256)  | causal_mask   | torch.bfloat16 |         1911.174 |            1762.821 |         6113.207 |           72759.392 |         1.084 |         0.084 |
| (2, 16, 4096, 256)  | relative_bias | torch.bfloat16 |         1911.254 |            1727.108 |         6123.530 |           66577.988 |         1.107 |         0.092 |
| (2, 16, 4096, 256)  | head_bias     | torch.bfloat16 |         1916.977 |            1801.804 |         6118.158 |           67359.680 |         1.064 |         0.091 |
| (8, 16, 512, 64)    | noop          | torch.bfloat16 |           44.984 |              43.974 |          170.276 |             262.259 |         1.023 |         0.649 |
| (8, 16, 512, 64)    | causal_mask   | torch.bfloat16 |           45.001 |              46.265 |          170.509 |             274.893 |         0.973 |         0.620 |
| (8, 16, 512, 64)    | relative_bias | torch.bfloat16 |           45.466 |              48.211 |          170.606 |             262.759 |         0.943 |         0.649 |
| (8, 16, 512, 64)    | head_bias     | torch.bfloat16 |           45.481 |              48.435 |          170.267 |             261.265 |         0.939 |         0.652 |
| (8, 16, 512, 128)   | noop          | torch.bfloat16 |           72.565 |              74.736 |          313.220 |             773.126 |         0.971 |         0.405 |
| (8, 16, 512, 128)   | causal_mask   | torch.bfloat16 |           72.015 |              75.755 |          313.311 |             775.513 |         0.951 |         0.404 |
| (8, 16, 512, 128)   | relative_bias | torch.bfloat16 |           72.105 |              74.189 |          313.806 |             769.238 |         0.972 |         0.408 |
| (8, 16, 512, 128)   | head_bias     | torch.bfloat16 |           72.005 |              74.364 |          313.509 |             775.237 |         0.968 |         0.404 |
| (8, 16, 512, 256)   | noop          | torch.bfloat16 |          138.656 |             165.453 |          663.707 |            2672.067 |         0.838 |         0.248 |
| (8, 16, 512, 256)   | causal_mask   | torch.bfloat16 |          139.096 |             172.613 |          663.593 |            2926.538 |         0.806 |         0.227 |
| (8, 16, 512, 256)   | relative_bias | torch.bfloat16 |          139.500 |             168.417 |          663.938 |            2658.629 |         0.828 |         0.250 |
| (8, 16, 512, 256)   | head_bias     | torch.bfloat16 |          139.776 |             173.549 |          662.920 |            2667.266 |         0.805 |         0.249 |
| (8, 16, 1024, 64)   | noop          | torch.bfloat16 |          134.883 |             125.004 |          484.706 |            1195.254 |         1.079 |         0.406 |
| (8, 16, 1024, 64)   | causal_mask   | torch.bfloat16 |          134.297 |             132.875 |          485.420 |            1234.953 |         1.011 |         0.393 |
| (8, 16, 1024, 64)   | relative_bias | torch.bfloat16 |          134.839 |             139.231 |          485.470 |            1198.556 |         0.968 |         0.405 |
| (8, 16, 1024, 64)   | head_bias     | torch.bfloat16 |          133.822 |             136.449 |          485.608 |            1189.198 |         0.981 |         0.408 |
| (8, 16, 1024, 128)  | noop          | torch.bfloat16 |          235.470 |             234.765 |          886.094 |            2662.944 |         1.003 |         0.333 |
| (8, 16, 1024, 128)  | causal_mask   | torch.bfloat16 |          236.305 |             241.382 |          886.293 |            2646.984 |         0.979 |         0.335 |
| (8, 16, 1024, 128)  | relative_bias | torch.bfloat16 |          236.414 |             233.980 |          885.250 |            2642.178 |         1.010 |         0.335 |
| (8, 16, 1024, 128)  | head_bias     | torch.bfloat16 |          237.176 |             239.040 |          885.754 |            2665.242 |         0.992 |         0.332 |
| (8, 16, 1024, 256)  | noop          | torch.bfloat16 |          504.445 |             517.855 |         1978.956 |            9592.906 |         0.974 |         0.206 |
| (8, 16, 1024, 256)  | causal_mask   | torch.bfloat16 |          502.428 |             536.002 |         1978.611 |           10607.342 |         0.937 |         0.187 |
| (8, 16, 1024, 256)  | relative_bias | torch.bfloat16 |          503.396 |             523.960 |         1977.993 |            9539.284 |         0.961 |         0.207 |
| (8, 16, 1024, 256)  | head_bias     | torch.bfloat16 |          503.818 |             536.014 |         1980.131 |            9576.262 |         0.940 |         0.207 |
| (8, 16, 4096, 64)   | noop          | torch.bfloat16 |         1970.139 |            1674.930 |         5750.940 |           16724.134 |         1.176 |         0.344 |
| (8, 16, 4096, 64)   | causal_mask   | torch.bfloat16 |         1959.036 |            1775.056 |         5780.512 |           17390.350 |         1.104 |         0.332 |
| (8, 16, 4096, 64)   | relative_bias | torch.bfloat16 |         1947.198 |            1773.869 |         5780.643 |           16779.699 |         1.098 |         0.345 |
| (8, 16, 4096, 64)   | head_bias     | torch.bfloat16 |         1963.935 |            1829.502 |         5780.018 |           16703.259 |         1.073 |         0.346 |
| (8, 16, 4096, 128)  | noop          | torch.bfloat16 |         3582.711 |            3362.623 |        10436.069 |           36415.565 |         1.065 |         0.287 |
| (8, 16, 4096, 128)  | causal_mask   | torch.bfloat16 |         3581.504 |            3499.472 |        10346.869 |           36164.959 |         1.023 |         0.286 |
| (8, 16, 4096, 128)  | relative_bias | torch.bfloat16 |         3589.779 |            3337.849 |        10529.621 |           36261.696 |         1.075 |         0.290 |
| (8, 16, 4096, 128)  | head_bias     | torch.bfloat16 |         3602.265 |            3436.444 |        10458.660 |           36507.790 |         1.048 |         0.286 |
| (8, 16, 4096, 256)  | noop          | torch.bfloat16 |         7695.923 |            7126.275 |        24643.009 |          140949.081 |         1.080 |         0.175 |
| (8, 16, 4096, 256)  | causal_mask   | torch.bfloat16 |         7679.939 |            7186.252 |        24538.105 |          157156.067 |         1.069 |         0.156 |
| (8, 16, 4096, 256)  | relative_bias | torch.bfloat16 |         7681.374 |            6994.832 |        24549.713 |          140077.179 |         1.098 |         0.175 |
| (8, 16, 4096, 256)  | head_bias     | torch.bfloat16 |         7679.822 |            7212.278 |        24627.823 |          140675.003 |         1.065 |         0.175 |
| (16, 16, 512, 64)   | noop          | torch.bfloat16 |           80.126 |              78.291 |          333.719 |             541.165 |         1.023 |         0.617 |
| (16, 16, 512, 64)   | causal_mask   | torch.bfloat16 |           80.065 |              81.696 |          333.779 |             551.113 |         0.980 |         0.606 |
| (16, 16, 512, 64)   | relative_bias | torch.bfloat16 |           80.138 |              86.715 |          333.364 |             542.118 |         0.924 |         0.615 |
| (16, 16, 512, 64)   | head_bias     | torch.bfloat16 |           80.415 |              85.204 |          333.294 |             536.840 |         0.944 |         0.621 |
| (16, 16, 512, 128)  | noop          | torch.bfloat16 |          134.964 |             138.025 |          607.093 |            1333.102 |         0.978 |         0.455 |
| (16, 16, 512, 128)  | causal_mask   | torch.bfloat16 |          134.192 |             141.523 |          606.269 |            1424.318 |         0.948 |         0.426 |
| (16, 16, 512, 128)  | relative_bias | torch.bfloat16 |          135.711 |             138.639 |          606.283 |            1327.974 |         0.979 |         0.457 |
| (16, 16, 512, 128)  | head_bias     | torch.bfloat16 |          135.552 |             140.555 |          607.107 |            1347.370 |         0.964 |         0.451 |
| (16, 16, 512, 256)  | noop          | torch.bfloat16 |          275.113 |             315.144 |         1301.583 |            5268.153 |         0.873 |         0.247 |
| (16, 16, 512, 256)  | causal_mask   | torch.bfloat16 |          274.867 |             328.106 |         1302.513 |            5770.594 |         0.838 |         0.226 |
| (16, 16, 512, 256)  | relative_bias | torch.bfloat16 |          276.052 |             321.770 |         1302.904 |            5241.920 |         0.858 |         0.249 |
| (16, 16, 512, 256)  | head_bias     | torch.bfloat16 |          271.409 |             328.839 |         1302.142 |            5266.037 |         0.825 |         0.247 |
| (16, 16, 1024, 64)  | noop          | torch.bfloat16 |          260.489 |             237.463 |          955.884 |            1817.558 |         1.097 |         0.526 |
| (16, 16, 1024, 64)  | causal_mask   | torch.bfloat16 |          262.378 |             254.350 |          955.280 |            1843.807 |         1.032 |         0.518 |
| (16, 16, 1024, 64)  | relative_bias | torch.bfloat16 |          261.338 |             268.253 |          956.038 |            1820.036 |         0.974 |         0.525 |
| (16, 16, 1024, 64)  | head_bias     | torch.bfloat16 |          262.153 |             264.156 |          956.023 |            1810.076 |         0.992 |         0.528 |
| (16, 16, 1024, 128) | noop          | torch.bfloat16 |          476.475 |             461.413 |         1760.578 |            4306.521 |         1.033 |         0.409 |
| (16, 16, 1024, 128) | causal_mask   | torch.bfloat16 |          473.794 |             479.178 |         1761.277 |            4619.439 |         0.989 |         0.381 |
| (16, 16, 1024, 128) | relative_bias | torch.bfloat16 |          473.839 |             463.282 |         1758.692 |            4290.562 |         1.023 |         0.410 |
| (16, 16, 1024, 128) | head_bias     | torch.bfloat16 |          472.979 |             472.896 |         1763.086 |            4367.931 |         1.000 |         0.404 |
| (16, 16, 1024, 256) | noop          | torch.bfloat16 |         1014.184 |            1026.764 |         3922.997 |           19104.147 |         0.988 |         0.205 |
| (16, 16, 1024, 256) | causal_mask   | torch.bfloat16 |         1013.217 |            1039.046 |         3928.382 |           21086.281 |         0.975 |         0.186 |
| (16, 16, 1024, 256) | relative_bias | torch.bfloat16 |         1008.519 |            1015.278 |         3922.133 |           18980.652 |         0.993 |         0.207 |
| (16, 16, 1024, 256) | head_bias     | torch.bfloat16 |         1011.360 |            1047.542 |         3931.245 |           19069.172 |         0.965 |         0.206 |
| (16, 16, 4096, 64)  | noop          | torch.bfloat16 |         3929.850 |            3325.667 |        11411.704 |           23344.280 |         1.182 |         0.489 |
| (16, 16, 4096, 64)  | causal_mask   | torch.bfloat16 |         3885.262 |            3581.544 |        11390.515 |           23725.639 |         1.085 |         0.480 |
| (16, 16, 4096, 64)  | relative_bias | torch.bfloat16 |         3865.737 |            3537.308 |        11489.901 |           23406.330 |         1.093 |         0.491 |
| (16, 16, 4096, 64)  | head_bias     | torch.bfloat16 |         3880.530 |            3665.249 |        11484.411 |           23299.496 |         1.059 |         0.493 |
| (16, 16, 4096, 128) | noop          | torch.bfloat16 |         7030.306 |            6745.715 |        20621.264 |           57464.096 |         1.042 |         0.359 |
| (16, 16, 4096, 128) | causal_mask   | torch.bfloat16 |         7095.414 |            7034.385 |        20410.656 |           61660.511 |         1.009 |         0.331 |
| (16, 16, 4096, 128) | relative_bias | torch.bfloat16 |         7084.779 |            6686.497 |        20315.161 |           57243.969 |         1.060 |         0.355 |
| (16, 16, 4096, 128) | head_bias     | torch.bfloat16 |         7075.367 |            6863.305 |        20494.385 |           58481.953 |         1.031 |         0.350 |
| (16, 16, 4096, 256) | noop          | torch.bfloat16 |        15612.741 |           14297.482 |        55306.847 |          281161.865 |         1.092 |         0.197 |
| (16, 16, 4096, 256) | causal_mask   | torch.bfloat16 |        15326.592 |           14263.878 |        55227.806 |          313063.232 |         1.075 |         0.176 |
| (16, 16, 4096, 256) | relative_bias | torch.bfloat16 |        15297.963 |           14007.379 |        54558.029 |          279529.175 |         1.092 |         0.195 |
| (16, 16, 4096, 256) | head_bias     | torch.bfloat16 |        15216.160 |           14276.027 |        55081.581 |          280996.826 |         1.066 |         0.196 |

</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125515
Approved by: https://github.com/Chillee
2024-05-17 00:41:55 +00:00
Simon Fan
4cd4463c1c [compiled autograd] Fix LoggingTensor flaky test (#126144)
LoggingTensor fails consistently when root logger level is INFO or lower
By default, root logger should be WARNING
But, triton driver initialization will overwrite root logger to INFO, which causes flakiness: https://github.com/pytorch/pytorch/issues/126143

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126144
Approved by: https://github.com/jansel
2024-05-16 22:23:02 +00:00
PyTorch MergeBot
0716f75cfb Revert "Add Lowering for FlexAttention Backwards (#125515)"
This reverts commit 95b9e981c3.

Reverted https://github.com/pytorch/pytorch/pull/125515 on behalf of https://github.com/huydhn due to Sorry for reverting your change but the newly added test runs out of memory 95b9e981c3 ([comment](https://github.com/pytorch/pytorch/pull/125515#issuecomment-2114084869))
2024-05-16 05:52:13 +00:00
Wanchao Liang
d0dfcd2c34 fix the device type for with_comms decorator (#125798)
found by @yifuwang, it looks like we are wrongly using
self.device_type="cuda" for gloo backend, which are triggering some
flakiness. i.e. https://github.com/pytorch/pytorch/issues/125366

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125798
Approved by: https://github.com/yifuwang
2024-05-16 03:40:19 +00:00
drisspg
95b9e981c3 Add Lowering for FlexAttention Backwards (#125515)
# Summary
#### What does this PR do?
It enables Inductor to actually generate the fused flex attention kernel for the backwards

I did some other things along the way:
- Abstract out the 'build_subgraph_buffer' subroutine and make it reusable between flex attention and flex_attention backwards. In total we need too build 3 subgraphs for fwd + bwd. 1 for the fwd graph and then 2 in the bwd. The FAv2 algorithm recomputes the parts of the forward (more efficiently since we already have the row_max via logsumexp), therefore we need to inline both the fwd graph and the joint graph in the bwds kernel.
- The version of the backwards kernel is from a somewhat older version of the triton tutorial implementation. I think that we should update in a follow up to a newer version. Notably the blocks need to be square for this to work as currently implemented. I am sure there are many opportunities for optimization.
- I didnt correctly register the decomp table + IndexMode when I landed: https://github.com/pytorch/pytorch/pull/123902, this remedies that.
- The rel_bias helper func was reversed in terms of causality. I updated and then add a test specific for "future causal" attention.
- This PRs but the main point that I think still needs to be worked out is the store_output call. I have it hacked up to be 'fake' but I dont think we want to land that and likely want to just have a mutated 'dq' and a stored_output 'dk'
- I also needed to update the `TritonTemplateKernel` to actually accept multiple subgraphs (modifications)
- I updated the benchmark to also profile bwds performance

### Benchmark Numbers:
_The current implementation is not parallelizing over ctx length in the bwd_
FWD Speedups

| Type    |   Speedup | shape              | score_mod   | dtype          |
|---------|-----------|--------------------|-------------|----------------|
| Average |     0.991 |                    |             |                |
| Max     |     1.182 | (16, 16, 4096, 64) | noop        | torch.bfloat16 |
| Min     |     0.796 | (2, 16, 512, 256)  | head_bias   | torch.bfloat16 |

BWD Speedups

| Type    |   Speedup | shape              | score_mod   | dtype          |
|---------|-----------|--------------------|-------------|----------------|
| Average |     0.291 |                    |             |                |
| Max     |     0.652 | (8, 16, 512, 64)   | head_bias   | torch.bfloat16 |
| Min     |     0.073 | (2, 16, 4096, 128) | head_bias   | torch.bfloat16 |

<details>

<summary>Full Data</summary>

| shape               | score_mod     | dtype          |   fwd_eager_time |   fwd_compiled_time |   bwd_eager_time |   bwd_compiled_time |   fwd_speedup |   bwd_speedup |
|---------------------|---------------|----------------|------------------|---------------------|------------------|---------------------|---------------|---------------|
| (2, 16, 512, 64)    | noop          | torch.bfloat16 |           19.936 |              19.092 |           57.851 |             193.564 |         1.044 |         0.299 |
| (2, 16, 512, 64)    | causal_mask   | torch.bfloat16 |           19.955 |              19.497 |           57.662 |             206.278 |         1.024 |         0.280 |
| (2, 16, 512, 64)    | relative_bias | torch.bfloat16 |           19.455 |              21.297 |           57.674 |             195.219 |         0.913 |         0.295 |
| (2, 16, 512, 64)    | head_bias     | torch.bfloat16 |           19.958 |              21.289 |           57.674 |             193.859 |         0.938 |         0.298 |
| (2, 16, 512, 128)   | noop          | torch.bfloat16 |           28.157 |              28.615 |           82.831 |             454.211 |         0.984 |         0.182 |
| (2, 16, 512, 128)   | causal_mask   | torch.bfloat16 |           28.154 |              28.444 |           83.091 |             432.083 |         0.990 |         0.192 |
| (2, 16, 512, 128)   | relative_bias | torch.bfloat16 |           28.722 |              27.897 |           83.175 |             446.789 |         1.030 |         0.186 |
| (2, 16, 512, 128)   | head_bias     | torch.bfloat16 |           28.299 |              27.673 |           83.052 |             459.179 |         1.023 |         0.181 |
| (2, 16, 512, 256)   | noop          | torch.bfloat16 |           41.167 |              50.504 |          175.019 |            1083.545 |         0.815 |         0.162 |
| (2, 16, 512, 256)   | causal_mask   | torch.bfloat16 |           41.656 |              51.933 |          175.078 |            1171.176 |         0.802 |         0.149 |
| (2, 16, 512, 256)   | relative_bias | torch.bfloat16 |           41.697 |              50.722 |          175.159 |            1097.312 |         0.822 |         0.160 |
| (2, 16, 512, 256)   | head_bias     | torch.bfloat16 |           41.690 |              52.387 |          175.184 |            1097.336 |         0.796 |         0.160 |
| (2, 16, 1024, 64)   | noop          | torch.bfloat16 |           39.232 |              37.454 |          127.847 |             612.430 |         1.047 |         0.209 |
| (2, 16, 1024, 64)   | causal_mask   | torch.bfloat16 |           39.930 |              39.599 |          127.755 |             665.359 |         1.008 |         0.192 |
| (2, 16, 1024, 64)   | relative_bias | torch.bfloat16 |           39.417 |              41.304 |          127.902 |             614.990 |         0.954 |         0.208 |
| (2, 16, 1024, 64)   | head_bias     | torch.bfloat16 |           39.965 |              42.034 |          127.953 |             613.273 |         0.951 |         0.209 |
| (2, 16, 1024, 128)  | noop          | torch.bfloat16 |           63.964 |              71.024 |          226.510 |            1637.669 |         0.901 |         0.138 |
| (2, 16, 1024, 128)  | causal_mask   | torch.bfloat16 |           63.843 |              72.451 |          226.750 |            1558.949 |         0.881 |         0.145 |
| (2, 16, 1024, 128)  | relative_bias | torch.bfloat16 |           64.301 |              70.487 |          226.651 |            1610.063 |         0.912 |         0.141 |
| (2, 16, 1024, 128)  | head_bias     | torch.bfloat16 |           64.033 |              71.394 |          226.676 |            1668.511 |         0.897 |         0.136 |
| (2, 16, 1024, 256)  | noop          | torch.bfloat16 |          129.348 |             141.390 |          507.337 |            4405.175 |         0.915 |         0.115 |
| (2, 16, 1024, 256)  | causal_mask   | torch.bfloat16 |          129.538 |             145.680 |          507.178 |            4768.874 |         0.889 |         0.106 |
| (2, 16, 1024, 256)  | relative_bias | torch.bfloat16 |          129.438 |             142.782 |          507.004 |            4401.002 |         0.907 |         0.115 |
| (2, 16, 1024, 256)  | head_bias     | torch.bfloat16 |          129.058 |             146.242 |          507.547 |            4434.251 |         0.883 |         0.114 |
| (2, 16, 4096, 64)   | noop          | torch.bfloat16 |          481.606 |             409.120 |         1440.890 |           14147.269 |         1.177 |         0.102 |
| (2, 16, 4096, 64)   | causal_mask   | torch.bfloat16 |          480.227 |             438.847 |         1434.419 |           14973.386 |         1.094 |         0.096 |
| (2, 16, 4096, 64)   | relative_bias | torch.bfloat16 |          480.831 |             458.104 |         1432.935 |           14193.253 |         1.050 |         0.101 |
| (2, 16, 4096, 64)   | head_bias     | torch.bfloat16 |          480.749 |             452.497 |         1437.040 |           14084.869 |         1.062 |         0.102 |
| (2, 16, 4096, 128)  | noop          | torch.bfloat16 |          872.534 |             848.275 |         2600.895 |           35156.849 |         1.029 |         0.074 |
| (2, 16, 4096, 128)  | causal_mask   | torch.bfloat16 |          872.647 |             868.279 |         2587.581 |           31919.531 |         1.005 |         0.081 |
| (2, 16, 4096, 128)  | relative_bias | torch.bfloat16 |          871.484 |             827.644 |         2593.989 |           34805.634 |         1.053 |         0.075 |
| (2, 16, 4096, 128)  | head_bias     | torch.bfloat16 |          871.422 |             856.437 |         2602.482 |           35708.591 |         1.017 |         0.073 |
| (2, 16, 4096, 256)  | noop          | torch.bfloat16 |         1904.497 |            1758.183 |         6122.416 |           66754.593 |         1.083 |         0.092 |
| (2, 16, 4096, 256)  | causal_mask   | torch.bfloat16 |         1911.174 |            1762.821 |         6113.207 |           72759.392 |         1.084 |         0.084 |
| (2, 16, 4096, 256)  | relative_bias | torch.bfloat16 |         1911.254 |            1727.108 |         6123.530 |           66577.988 |         1.107 |         0.092 |
| (2, 16, 4096, 256)  | head_bias     | torch.bfloat16 |         1916.977 |            1801.804 |         6118.158 |           67359.680 |         1.064 |         0.091 |
| (8, 16, 512, 64)    | noop          | torch.bfloat16 |           44.984 |              43.974 |          170.276 |             262.259 |         1.023 |         0.649 |
| (8, 16, 512, 64)    | causal_mask   | torch.bfloat16 |           45.001 |              46.265 |          170.509 |             274.893 |         0.973 |         0.620 |
| (8, 16, 512, 64)    | relative_bias | torch.bfloat16 |           45.466 |              48.211 |          170.606 |             262.759 |         0.943 |         0.649 |
| (8, 16, 512, 64)    | head_bias     | torch.bfloat16 |           45.481 |              48.435 |          170.267 |             261.265 |         0.939 |         0.652 |
| (8, 16, 512, 128)   | noop          | torch.bfloat16 |           72.565 |              74.736 |          313.220 |             773.126 |         0.971 |         0.405 |
| (8, 16, 512, 128)   | causal_mask   | torch.bfloat16 |           72.015 |              75.755 |          313.311 |             775.513 |         0.951 |         0.404 |
| (8, 16, 512, 128)   | relative_bias | torch.bfloat16 |           72.105 |              74.189 |          313.806 |             769.238 |         0.972 |         0.408 |
| (8, 16, 512, 128)   | head_bias     | torch.bfloat16 |           72.005 |              74.364 |          313.509 |             775.237 |         0.968 |         0.404 |
| (8, 16, 512, 256)   | noop          | torch.bfloat16 |          138.656 |             165.453 |          663.707 |            2672.067 |         0.838 |         0.248 |
| (8, 16, 512, 256)   | causal_mask   | torch.bfloat16 |          139.096 |             172.613 |          663.593 |            2926.538 |         0.806 |         0.227 |
| (8, 16, 512, 256)   | relative_bias | torch.bfloat16 |          139.500 |             168.417 |          663.938 |            2658.629 |         0.828 |         0.250 |
| (8, 16, 512, 256)   | head_bias     | torch.bfloat16 |          139.776 |             173.549 |          662.920 |            2667.266 |         0.805 |         0.249 |
| (8, 16, 1024, 64)   | noop          | torch.bfloat16 |          134.883 |             125.004 |          484.706 |            1195.254 |         1.079 |         0.406 |
| (8, 16, 1024, 64)   | causal_mask   | torch.bfloat16 |          134.297 |             132.875 |          485.420 |            1234.953 |         1.011 |         0.393 |
| (8, 16, 1024, 64)   | relative_bias | torch.bfloat16 |          134.839 |             139.231 |          485.470 |            1198.556 |         0.968 |         0.405 |
| (8, 16, 1024, 64)   | head_bias     | torch.bfloat16 |          133.822 |             136.449 |          485.608 |            1189.198 |         0.981 |         0.408 |
| (8, 16, 1024, 128)  | noop          | torch.bfloat16 |          235.470 |             234.765 |          886.094 |            2662.944 |         1.003 |         0.333 |
| (8, 16, 1024, 128)  | causal_mask   | torch.bfloat16 |          236.305 |             241.382 |          886.293 |            2646.984 |         0.979 |         0.335 |
| (8, 16, 1024, 128)  | relative_bias | torch.bfloat16 |          236.414 |             233.980 |          885.250 |            2642.178 |         1.010 |         0.335 |
| (8, 16, 1024, 128)  | head_bias     | torch.bfloat16 |          237.176 |             239.040 |          885.754 |            2665.242 |         0.992 |         0.332 |
| (8, 16, 1024, 256)  | noop          | torch.bfloat16 |          504.445 |             517.855 |         1978.956 |            9592.906 |         0.974 |         0.206 |
| (8, 16, 1024, 256)  | causal_mask   | torch.bfloat16 |          502.428 |             536.002 |         1978.611 |           10607.342 |         0.937 |         0.187 |
| (8, 16, 1024, 256)  | relative_bias | torch.bfloat16 |          503.396 |             523.960 |         1977.993 |            9539.284 |         0.961 |         0.207 |
| (8, 16, 1024, 256)  | head_bias     | torch.bfloat16 |          503.818 |             536.014 |         1980.131 |            9576.262 |         0.940 |         0.207 |
| (8, 16, 4096, 64)   | noop          | torch.bfloat16 |         1970.139 |            1674.930 |         5750.940 |           16724.134 |         1.176 |         0.344 |
| (8, 16, 4096, 64)   | causal_mask   | torch.bfloat16 |         1959.036 |            1775.056 |         5780.512 |           17390.350 |         1.104 |         0.332 |
| (8, 16, 4096, 64)   | relative_bias | torch.bfloat16 |         1947.198 |            1773.869 |         5780.643 |           16779.699 |         1.098 |         0.345 |
| (8, 16, 4096, 64)   | head_bias     | torch.bfloat16 |         1963.935 |            1829.502 |         5780.018 |           16703.259 |         1.073 |         0.346 |
| (8, 16, 4096, 128)  | noop          | torch.bfloat16 |         3582.711 |            3362.623 |        10436.069 |           36415.565 |         1.065 |         0.287 |
| (8, 16, 4096, 128)  | causal_mask   | torch.bfloat16 |         3581.504 |            3499.472 |        10346.869 |           36164.959 |         1.023 |         0.286 |
| (8, 16, 4096, 128)  | relative_bias | torch.bfloat16 |         3589.779 |            3337.849 |        10529.621 |           36261.696 |         1.075 |         0.290 |
| (8, 16, 4096, 128)  | head_bias     | torch.bfloat16 |         3602.265 |            3436.444 |        10458.660 |           36507.790 |         1.048 |         0.286 |
| (8, 16, 4096, 256)  | noop          | torch.bfloat16 |         7695.923 |            7126.275 |        24643.009 |          140949.081 |         1.080 |         0.175 |
| (8, 16, 4096, 256)  | causal_mask   | torch.bfloat16 |         7679.939 |            7186.252 |        24538.105 |          157156.067 |         1.069 |         0.156 |
| (8, 16, 4096, 256)  | relative_bias | torch.bfloat16 |         7681.374 |            6994.832 |        24549.713 |          140077.179 |         1.098 |         0.175 |
| (8, 16, 4096, 256)  | head_bias     | torch.bfloat16 |         7679.822 |            7212.278 |        24627.823 |          140675.003 |         1.065 |         0.175 |
| (16, 16, 512, 64)   | noop          | torch.bfloat16 |           80.126 |              78.291 |          333.719 |             541.165 |         1.023 |         0.617 |
| (16, 16, 512, 64)   | causal_mask   | torch.bfloat16 |           80.065 |              81.696 |          333.779 |             551.113 |         0.980 |         0.606 |
| (16, 16, 512, 64)   | relative_bias | torch.bfloat16 |           80.138 |              86.715 |          333.364 |             542.118 |         0.924 |         0.615 |
| (16, 16, 512, 64)   | head_bias     | torch.bfloat16 |           80.415 |              85.204 |          333.294 |             536.840 |         0.944 |         0.621 |
| (16, 16, 512, 128)  | noop          | torch.bfloat16 |          134.964 |             138.025 |          607.093 |            1333.102 |         0.978 |         0.455 |
| (16, 16, 512, 128)  | causal_mask   | torch.bfloat16 |          134.192 |             141.523 |          606.269 |            1424.318 |         0.948 |         0.426 |
| (16, 16, 512, 128)  | relative_bias | torch.bfloat16 |          135.711 |             138.639 |          606.283 |            1327.974 |         0.979 |         0.457 |
| (16, 16, 512, 128)  | head_bias     | torch.bfloat16 |          135.552 |             140.555 |          607.107 |            1347.370 |         0.964 |         0.451 |
| (16, 16, 512, 256)  | noop          | torch.bfloat16 |          275.113 |             315.144 |         1301.583 |            5268.153 |         0.873 |         0.247 |
| (16, 16, 512, 256)  | causal_mask   | torch.bfloat16 |          274.867 |             328.106 |         1302.513 |            5770.594 |         0.838 |         0.226 |
| (16, 16, 512, 256)  | relative_bias | torch.bfloat16 |          276.052 |             321.770 |         1302.904 |            5241.920 |         0.858 |         0.249 |
| (16, 16, 512, 256)  | head_bias     | torch.bfloat16 |          271.409 |             328.839 |         1302.142 |            5266.037 |         0.825 |         0.247 |
| (16, 16, 1024, 64)  | noop          | torch.bfloat16 |          260.489 |             237.463 |          955.884 |            1817.558 |         1.097 |         0.526 |
| (16, 16, 1024, 64)  | causal_mask   | torch.bfloat16 |          262.378 |             254.350 |          955.280 |            1843.807 |         1.032 |         0.518 |
| (16, 16, 1024, 64)  | relative_bias | torch.bfloat16 |          261.338 |             268.253 |          956.038 |            1820.036 |         0.974 |         0.525 |
| (16, 16, 1024, 64)  | head_bias     | torch.bfloat16 |          262.153 |             264.156 |          956.023 |            1810.076 |         0.992 |         0.528 |
| (16, 16, 1024, 128) | noop          | torch.bfloat16 |          476.475 |             461.413 |         1760.578 |            4306.521 |         1.033 |         0.409 |
| (16, 16, 1024, 128) | causal_mask   | torch.bfloat16 |          473.794 |             479.178 |         1761.277 |            4619.439 |         0.989 |         0.381 |
| (16, 16, 1024, 128) | relative_bias | torch.bfloat16 |          473.839 |             463.282 |         1758.692 |            4290.562 |         1.023 |         0.410 |
| (16, 16, 1024, 128) | head_bias     | torch.bfloat16 |          472.979 |             472.896 |         1763.086 |            4367.931 |         1.000 |         0.404 |
| (16, 16, 1024, 256) | noop          | torch.bfloat16 |         1014.184 |            1026.764 |         3922.997 |           19104.147 |         0.988 |         0.205 |
| (16, 16, 1024, 256) | causal_mask   | torch.bfloat16 |         1013.217 |            1039.046 |         3928.382 |           21086.281 |         0.975 |         0.186 |
| (16, 16, 1024, 256) | relative_bias | torch.bfloat16 |         1008.519 |            1015.278 |         3922.133 |           18980.652 |         0.993 |         0.207 |
| (16, 16, 1024, 256) | head_bias     | torch.bfloat16 |         1011.360 |            1047.542 |         3931.245 |           19069.172 |         0.965 |         0.206 |
| (16, 16, 4096, 64)  | noop          | torch.bfloat16 |         3929.850 |            3325.667 |        11411.704 |           23344.280 |         1.182 |         0.489 |
| (16, 16, 4096, 64)  | causal_mask   | torch.bfloat16 |         3885.262 |            3581.544 |        11390.515 |           23725.639 |         1.085 |         0.480 |
| (16, 16, 4096, 64)  | relative_bias | torch.bfloat16 |         3865.737 |            3537.308 |        11489.901 |           23406.330 |         1.093 |         0.491 |
| (16, 16, 4096, 64)  | head_bias     | torch.bfloat16 |         3880.530 |            3665.249 |        11484.411 |           23299.496 |         1.059 |         0.493 |
| (16, 16, 4096, 128) | noop          | torch.bfloat16 |         7030.306 |            6745.715 |        20621.264 |           57464.096 |         1.042 |         0.359 |
| (16, 16, 4096, 128) | causal_mask   | torch.bfloat16 |         7095.414 |            7034.385 |        20410.656 |           61660.511 |         1.009 |         0.331 |
| (16, 16, 4096, 128) | relative_bias | torch.bfloat16 |         7084.779 |            6686.497 |        20315.161 |           57243.969 |         1.060 |         0.355 |
| (16, 16, 4096, 128) | head_bias     | torch.bfloat16 |         7075.367 |            6863.305 |        20494.385 |           58481.953 |         1.031 |         0.350 |
| (16, 16, 4096, 256) | noop          | torch.bfloat16 |        15612.741 |           14297.482 |        55306.847 |          281161.865 |         1.092 |         0.197 |
| (16, 16, 4096, 256) | causal_mask   | torch.bfloat16 |        15326.592 |           14263.878 |        55227.806 |          313063.232 |         1.075 |         0.176 |
| (16, 16, 4096, 256) | relative_bias | torch.bfloat16 |        15297.963 |           14007.379 |        54558.029 |          279529.175 |         1.092 |         0.195 |
| (16, 16, 4096, 256) | head_bias     | torch.bfloat16 |        15216.160 |           14276.027 |        55081.581 |          280996.826 |         1.066 |         0.196 |

</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125515
Approved by: https://github.com/Chillee
2024-05-16 03:14:27 +00:00
PyTorch MergeBot
e3c5d1b7d7 Revert "[optim] Fix: wrong ASGD implementation (#125440)"
This reverts commit 2c5ad9a3d7.

Reverted https://github.com/pytorch/pytorch/pull/125440 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it looks like there is a linter failure coming from this change ([comment](https://github.com/pytorch/pytorch/pull/125440#issuecomment-2113833108))
2024-05-16 02:12:29 +00:00
haozhe.zhu
f9d107af66 [optim] add fused_adagrad support for CPU device (#124905)
Support fused_sgd_kernel support for CPU.

## Bench result:
32 core/sockets ICX
Test Scripts:
https://gist.github.com/zhuhaozhe/79e842e0a6e25d6d7fa1e4598807272c
https://gist.github.com/zhuhaozhe/b4c6998a509dcea1796dd05b3005c969
```
Tensor Size: 262144, Num Tensor 4, Num Threads: 1
_single_tensor_adagrad time: 0.2500 seconds
_fused_adagrad time: 0.0933 seconds
Tensor Size: 4194304, Num Tensor 32, Num Threads: 32
_single_tensor_adagrad time: 2.8819 seconds
_fused_adagrad time: 1.7591 seconds
```
## Test Plan:
```
python test_optim.py -k test_fused_matches_forloop
python test_optim.py -k test_fused_large_tensor
python test_optim.py -k test_can_load_older_state_dict
python test_optim.py -k test_grad_scaling_autocast_fused_optimizers
python test_torch.py -k test_grad_scaling_autocast_fused
python test_torch.py -k test_params_invalidated_with_grads_invalidated_between_unscale_and_step
```

Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124905
Approved by: https://github.com/jgong5, https://github.com/janeyx99
2024-05-16 01:11:51 +00:00
David Chiu
2c5ad9a3d7 [optim] Fix: wrong ASGD implementation (#125440)
> previous: Originally, the variables `new_eta` and `new_mu` would be constructed `len(grouped_mus)` times, but each of their values is the same and won't be changed. Therefore, it can be simplified using Python list multiplication, which only constructs one tensor.

- [X] Ill assumption that every param will have the same step.
- [x] DIfferent implementation between `foreach=Ture` and `foreach=False`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125440
Approved by: https://github.com/janeyx99
2024-05-15 22:52:15 +00:00
Andres Lugo-Reyes
38b8b614a2 [ROCm] Implement forward AD for miopen_batch_norm (#125069)
Implements forward automatic differentiation support for miopen_batch_norm as well as unskips the associated unit tests. Also fixes a class of functorch related unit tests that fail due to failing a contiguous tensor assertion in BatchNorm_miopen.cpp. Solution was to just limit tensors to miopen_batch_norm that have at least 3 dimensions. The exact restriction already existed in the cudnn path and is why the tests in question only failed on ROCm.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125069
Approved by: https://github.com/jeffdaily, https://github.com/andrewor14
2024-05-14 19:09:50 +00:00
Andrew Gu
4ded666535 [FSDP2] Factored out MLPStack to de-dup code (#126070)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126070
Approved by: https://github.com/wanchaol
ghstack dependencies: #126067
2024-05-14 18:13:51 +00:00
lezcano
37596769d8 Autocast vdot (#125697)
Fixes https://github.com/pytorch/pytorch/issues/125544

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125697
Approved by: https://github.com/jbschlosser
2024-05-14 12:05:02 +00:00
briancoutinho
c312cd8890 add simple test for nccl metadata (#125317)
Add a few test cases to verify newly added NCCL metadata in profiler events
The test looks at the following blocks record_param_comms
```
{
  "ph": "X",
  "cat": "cpu_op",
  "name": "record_param_comms",
  "pid": 2840966,
  "tid": 2844581,
  "ts": 2424859.045,
  "dur": 203.866,
  "args": {
    "Collective name": "allreduce",
    "Process Group Description": "default_pg",
    "dtype": "Float",
    "In msg nelems": 100,
    "Global rank start": 0,
    "Group size": 2,
    "Process Group Ranks": "[0, 1]",
    "Record function id": 0,
    "Out msg nelems": 100,
    "Global rank stride": 1,
    "Process Group Name": "0",
  }
}
```

## Unit test

```
>$ touch /tmp/barrier && TEMP_DIR="/tmp" BACKEND="nccl" WORLD_SIZE="2" python test/distributed/test_distributed_spawn.py -v TestDistBackendWithSpawn.test_ddp_profiling_torch_profiler

test_ddp_profiling_torch_profiler (__main__.TestDistBackendWithSpawn.test_ddp_profiling_torch_profiler) ... NCCL version 2.20.5+cuda12.0
STAGE:2024-05-01 16:41:15 2840966:2840966 ActivityProfilerController.cpp:316] Completed Stage: Warm Up
STAGE:2024-05-01 16:41:15 2840965:2840965 ActivityProfilerController.cpp:316] Completed Stage: Warm Up
STAGE:2024-05-01 16:41:17 2840965:2840965 ActivityProfilerController.cpp:322] Completed Stage: Collection
STAGE:2024-05-01 16:41:17 2840966:2840966 ActivityProfilerController.cpp:322] Completed Stage: Collection
STAGE:2024-05-01 16:41:17 2840966:2840966 ActivityProfilerController.cpp:326] Completed Stage: Post Processing
STAGE:2024-05-01 16:41:17 2840965:2840965 ActivityProfilerController.cpp:326] Completed Stage: Post Processing
STAGE:2024-05-01 16:41:18 2840966:2840966 ActivityProfilerController.cpp:316] Completed Stage: Warm STAGE:2024-05-01 16:41:18 2840965:2840965 ActivityProfilerController.cpp:316] Completed Stage: Warm Up
STAGE:2024-05-01 16:41:18 2840965:2840965 ActivityProfilerController.cpp:322] Completed Stage: Collection
STAGE:2024-05-01 16:41:18 2840966:2840966 ActivityProfilerController.cpp:322] Completed Stage: Collection
STAGE:2024-05-01 16:41:18 2840965:2840965 ActivityProfilerController.cpp:326] Completed Stage: Post Processing
STAGE:2024-05-01 16:41:18 2840966:2840966 ActivityProfilerController.cpp:326] Completed Stage: Post Processing
Trace saved to /tmp/tmpvwivp7mo.json
Trace saved to /tmp/tmpvwvsc1fy.json
ok
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125317
Approved by: https://github.com/LucasLLC, https://github.com/kwen2501
2024-05-14 06:20:50 +00:00
PyTorch MergeBot
bd3cbdba2f Revert "[optim] add fused_adagrad support for CPU device (#124905)"
This reverts commit 1c3fe84033.

Reverted https://github.com/pytorch/pytorch/pull/124905 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it is failing distributed multigpu test in trunk 1c3fe84033 ([comment](https://github.com/pytorch/pytorch/pull/124905#issuecomment-2108777063))
2024-05-13 20:53:22 +00:00
haozhe.zhu
1c3fe84033 [optim] add fused_adagrad support for CPU device (#124905)
Support fused_sgd_kernel support for CPU.

## Bench result:
32 core/sockets ICX
Test Scripts:
https://gist.github.com/zhuhaozhe/79e842e0a6e25d6d7fa1e4598807272c
https://gist.github.com/zhuhaozhe/b4c6998a509dcea1796dd05b3005c969
```
Tensor Size: 262144, Num Tensor 4, Num Threads: 1
_single_tensor_adagrad time: 0.2500 seconds
_fused_adagrad time: 0.0933 seconds
Tensor Size: 4194304, Num Tensor 32, Num Threads: 32
_single_tensor_adagrad time: 2.8819 seconds
_fused_adagrad time: 1.7591 seconds
```
## Test Plan:
```
python test_optim.py -k test_fused_matches_forloop
python test_optim.py -k test_fused_large_tensor
python test_optim.py -k test_can_load_older_state_dict
python test_optim.py -k test_grad_scaling_autocast_fused_optimizers
python test_torch.py -k test_grad_scaling_autocast_fused
python test_torch.py -k test_params_invalidated_with_grads_invalidated_between_unscale_and_step
```

Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124905
Approved by: https://github.com/jgong5, https://github.com/janeyx99
2024-05-13 01:16:20 +00:00
Aaron Gokaslan
34910f87f0 [BE]: Update ruff to v0.4.4 (#125031)
Update ruff version to 0.4.2. This version mostly has bugfixes for the new parser and also updates the f-string rule to be able to apply more fixes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125031
Approved by: https://github.com/albanD, https://github.com/malfet
2024-05-12 20:02:37 +00:00
Michael Lazos
b24ad7eab5 Enable dynamo traced test_param_group_with_lrscheduler_goes_right_direction (#124544)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124544
Approved by: https://github.com/janeyx99
ghstack dependencies: #125825, #125826
2024-05-11 06:29:59 +00:00
Huy Do
0a9c6e92f8 Skip test_memory_format_nn_BatchNorm2d in inductor (#125970)
Skipping the test in the context of https://github.com/pytorch/pytorch/issues/125967 until the issue is root caused and fixed properly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125970
Approved by: https://github.com/clee2000
2024-05-11 04:11:18 +00:00
cyy
45628e3b66 Remove Caffe2 python (#125143)
This PR tries to decompose https://github.com/pytorch/pytorch/pull/122527 into a smaller one. Caffe2 python build scripts were removed and some tensorboard code using Caffe2 was removed too.
To be noted, this was inspired and is co-dev with @r-barnes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125143
Approved by: https://github.com/r-barnes, https://github.com/albanD
2024-05-10 21:15:43 +00:00
Catherine Lee
b08072f645 [CI] Relax per proc memory by a little bit, mark a test as serial (#125960)
test failure is here https://github.com/pytorch/pytorch/actions/runs/9036789873/job/24836020415

* OOMs etc rel to https://github.com/pytorch/pytorch/pull/125598
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125960
Approved by: https://github.com/huydhn
2024-05-10 21:11:39 +00:00
Masaki Kozuki
59f2e716cc Test foreach functions with all dtypes except qints (#125527)
Set `dtypes` and the others to all dtypes except qints, with some required xfails

Related to #124726.

Co-authored-by: janeyx99 <janeyx99@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125527
Approved by: https://github.com/eqy, https://github.com/janeyx99
2024-05-10 18:56:37 +00:00
Catherine Lee
bef7d650c4 [CI] 3 procs on sm86 (#125598)
yolo
iirc the a10g/sm86 runners have ~21 GB of space, so we can increase parallelism on it to 3.  This results in about 6GB CUDA mem per proc.  The previous calculation + 2 procs resulted in about 8 GB

Also fixes the the calc for per proc memory, assuming that CUDA context + anything else take about a little under 1GB of space (previous calc was .11 on about 7.5 - 8 GB  <= .9GB)

Times on main are about 1.9-2.5hr per shard
This commit is around 1.6-2hr per shard

Risks: increase in flaky tests due to OOM

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125598
Approved by: https://github.com/huydhn
2024-05-10 18:48:43 +00:00
David Berard
1ecea513b6 Fix common_methods_invocations example inputs to _efficient_attention_forward (#125788)
Fixes #120693

this tries to fix the sample input in common_methods_invocations.py:
* I think the arange was intended to be skipping every other integer in the range. Previously, we'd have one length that was -1.
* k, v tensors were too small - updated the sizes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125788
Approved by: https://github.com/drisspg, https://github.com/Aidyn-A
2024-05-09 20:08:49 +00:00
Can Balioglu
6ea226b99c Fix DDP no_sync when find_unused_parameters is True (#124193)
Fixes #69031, #42793

This PR fixes the bug introduced in #54981 where parameters used within a `no_sync` scope are not respected when `find_unused_parameters` is set to `True`. The `local_used_map_` and `numGradHooksTriggeredMap_` variables should be updated regardless of the `no_sync` state.

Tested and verified with fairseq2 and wav2vec2 ASR finetuning recipe. All gradients are correctly synced across workers as expected after applying this fix.

Co-authored-by: Kaushik Ram Sadagopan <kaushikram2811@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124193
Approved by: https://github.com/rohan-varma
2024-05-09 17:33:33 +00:00
xinan.lin
78a1693266 [Inductor Intel GPU backend Upstream] Reuse inductor test for Intel GPU (PART 1) (#122866)
Reuse Inductor test suite for Intel GPU including:
test_torchinductor.py
test_triton_wrapper.py
test_metrics.py
test_codecache.py
test_codegen_triton.py
test_kernel_benchmark.py
test_triton_heuristics.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122866
Approved by: https://github.com/EikanWang, https://github.com/jansel
2024-05-09 00:51:35 +00:00
Ke Wen
04bf7713e8 [c10d] Reduce test time by reusing ProcessGroup (#125648)
## Problem this PR resolves
Today, most of distributed tests are arranged like this:
```
def test_allreduce(self):
    pg = self._create_process_group_nccl(store, self.opts())
    pg.allreduce(tensor)
    ...
```
Thus, we are paying PG creation time **per test**. That's bad. But why were we doing that? Is there a constraint?

If we look deeper, we would find that most of our test cases inherit from `torch.testing._internal.common_distributed.MultiProcessTestCase`. From the name, nothing seems wrong, and probably fits distributed well. But a "problem" exists in its `setUp()` and `tearDown()` methods, which basically do the following:
```
def setUp(self):
    self._spawn_processes()

def tearDown(self):
    for p in self.processes:
        p.terminate()
```
Since `setUp` and `tearDown` are "**test-scope fixtures"**, meaning, they are called per test, each test will have brand new processes. Of course we'd have to recreate ProcessGroup every time.

## How we are fixing it
First, obviously, we need to put a PG's lifetime into a longer scope. Python `unittest` provides such a helper, called **"class-scope fixtures."** It is embodied by a `setUpClass` method and a `tearDownClass` method (note the name difference), which are called only once for all tests in the same test class.  Therefore, we would do:
```
@classmethod
def setUpClass(self):
    dist.init_process_group(...)

@classmethod
def tearDownClass(self):
    dist.destroy_process_group()
```
**In this PR, we create a new test template for distributed: `MultiProcContinousTest`, to hold this class-scope fixture.**

Second, we'd need to avoid per-test process spawn and terminate. That's easy, we can either:
1. launch the whole test file with `torchrun --nproc-per-node=...` or
2. use `mp.spawn()` under `if __name__ == "__main__":`.

Point is, launch the processes only once.

## Result
We moved the "positive tests" from test_c10d_nccl.py to test_c10d_ops_nccl.py.
Before this PR:
```
$ python test_c10d_nccl.py -k ProcessGroupNCCLTest
Ran 24 tests in 174.457s
```
After this PR:
```
$ torchrun --nproc-per-node 2 test_c10d_ops_nccl.py
or
$ python test_c10d_ops_nccl.py
Ran 24 tests in 16.247s
```
10X speedup.

## Limitation
For tests intended to test destroy or abort of PGs, we'd need to go back to the old style. So it would make sense to divide our tests into two classes: one for positive tests where we would reuse the PGs, and the other one for abort/destroy and negative tests like watchdog timeout.

## Next step
Migrate the tests of distributed that would fit with this test style!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125648
Approved by: https://github.com/wconstab
2024-05-08 22:33:40 +00:00
Tianyu Liu
faf0015052 [dtensor] run transformer sdpa in dtensor (#122997)
Now that efficient attention is supported in dtensor, we can modify the transformer test to use dtensor in SDPA and get rid of the manual num_head adjustments.

Caveat: Efficient attention is supported only with bf16/fp32 (not fp64) and has other constraints. If any of the constraints are not satisfied, the SDPA would fall back to the math decomposed attention, which will break as it does not fully work with dtensor (it creates a `torch.Tensor` mask in the middle). I considered adding some checks like in P1202254918 but that needs to be added everywhere this Transformer is used. Is it necessary if the current CI machines can run efficient attention?

Test files containing this Transformer:
- `test/distributed/tensor/parallel/test_tp_examples.py`
- `test/distributed/_composable/fsdp/test_fully_shard_training.py`
- `test/distributed/_composable/fsdp/test_fully_shard_clip_grad_norm_.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122997
Approved by: https://github.com/XilunWu
ghstack dependencies: #122995, #122996
2024-05-08 17:08:47 +00:00
Michael Lazos
e3d5afc60a Enable dynamo'd test for 116499 (#123469)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123469
Approved by: https://github.com/janeyx99
ghstack dependencies: #123619
2024-05-07 22:17:01 +00:00
Michael Lazos
f0c6d6100b Enable dynamo-traced optimizer peak memory tests (#124543)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124543
Approved by: https://github.com/yf225, https://github.com/janeyx99
2024-05-07 08:21:50 +00:00
ydwu4
461ffaaaf3 [dynamo] support torchbind object input (#124978)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124978
Approved by: https://github.com/jansel
2024-05-07 03:02:00 +00:00
Joel Schlosser
b98c689261 Better repro command: include test class + fix paths for py3.8 (#125498)
Fixes #117850

This PR:
* Adds the class name in the repro command
* Fixes the path to the test file for python 3.8 jobs (apparently `inspect.getfile(class_type)` returns a relative path in this older python version)

Before (in python 3.8):
```sh
PYTORCH_TEST_WITH_DYNAMO=1 python test_autograd.py -k test_foo
```

After:
```sh
PYTORCH_TEST_WITH_DYNAMO=1 python test/test_autograd.py -k TestAutograd.test_foo
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125498
Approved by: https://github.com/huydhn, https://github.com/janeyx99
2024-05-06 22:19:12 +00:00
ydwu4
0302dc68bf [Reland] Fakify script object inputs and attributes for non-strict ex… (#125490)
A re-land of #124239.

This PR fakify ScriptObject inputs and attributes in export non-strict mode by default.

The basic idea is to only fakify the script object during tracing (i.e. aot_export). After we get the traced graph module, eagerly executing, serializing, or running more passes will use the real script objects. This is essentially treating the script object as constant tensor.

Concretely, we

fakify all the script object inputs, and module attributes (gathered by constant_attrs).
patch the module's attributes with fakified script object
right after aot_export, remove the patching (to avoid changing the original module) then modify the exported graph module's attribute to real script object.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125490
Approved by: https://github.com/angelayi
2024-05-04 02:39:42 +00:00
Andrew Gu
996bb74077 [FSDP2] Added HSDP grad acc tests and some minor changes (#125479)
This adds HSDP to the existing gradient accumulation tests and includes some minor changes to simplify things a tiny bit.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125479
Approved by: https://github.com/wanchaol
ghstack dependencies: #125431
2024-05-03 23:44:05 +00:00
Nikita Shulga
30610251ec [MPS] And naive quantized intmm and .gputrace capture hooks (#125163)
- Implement a very straightforward Metal copy of CPU int4mm kernel
- Implement int8mm kernel by constructing a graph consisting of upcast, transpose and mm
- Add `isCapturing`, `isCaptureEnabled`, `startCapture` and `stopCapture` methods to `MPSProfile` which can be used to help one debug/profile Metal kernels by wrapping the calls with the following
  ```cpp
   if (getMPSProfiler().profiler.isCaptureEnabled()) {
     getMPSProfiler().startCapture(__func__, mpsStream);
   }
   ...
   if (getMPSProfiler().isCapturing()) {
     getMPSProfiler().stopCapture(mpsStream);
   }
  ```
  that, if invoked with `MTL_CAPTURE_ENABLED` environment variable set to one, will produce .gputrace files, in the current working directory, which can later be loaded and used to debug or profiler the kernel
<img width="1093" alt="image" src="https://github.com/pytorch/pytorch/assets/2453524/a2bf27e8-df8a-442c-a525-1df67b8a376a">

- Added `test_int4mm` to TestLinalgMPS, which is mostly copy-n-paste of the test from `test_linalg`

TODOs:
 - Add weight pack
 - Perf-tune both kernels
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125163
Approved by: https://github.com/mikekgfb
2024-05-03 15:20:39 +00:00
Masaki Kozuki
a99ada5b27 call super().__post_init__ in ForeachFuncinfo.__post_init__ (#125457)
obviously the current main branch's `ForeachFuncInfo`'s dunder post init doesn't `super().__post_init__()` which does some setup including setting `dtypesIfCUDA` and `dtypesIfROCM`.

Fixes #125295
related: #125001

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125457
Approved by: https://github.com/janeyx99
2024-05-03 14:54:15 +00:00
Jeff Daily
da991fac22 [ROCm][CI] upgrade CI to ROCm 6.1 (#124300)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124300
Approved by: https://github.com/malfet
2024-05-02 17:16:02 +00:00
Masaki Kozuki
aa7be72cc5 Convert ForeachFuncInfo to dataclass (#125001)
- `ForeachFuncInfo` to `dataclass` for smaller diff from `OpInfo`
- `skips` to `decorators` and `skip` to `xfail`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125001
Approved by: https://github.com/janeyx99, https://github.com/jeffdaily
2024-05-02 04:19:09 +00:00
PyTorch MergeBot
f1f142c44f Revert "Fakify script object inputs and attributes for non-strict export (#124239)"
This reverts commit ecc2e034f7.

Reverted https://github.com/pytorch/pytorch/pull/124239 on behalf of https://github.com/kit1980 due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/124239#issuecomment-2089305447))
2024-05-01 23:56:00 +00:00
yan-yhy
6cfb55dd5d Add a variable for some testcases. (#124708)
Some testcases can use 'TEST_PRIVATEUSE1_DEVICE_TYPE' to make adapting these testcases on others device more convenient.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124708
Approved by: https://github.com/albanD
2024-05-01 23:19:12 +00:00
Michael Lazos
787afc5180 Add LR as tensor tests (#123750)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123750
Approved by: https://github.com/janeyx99
2024-05-01 04:46:49 +00:00
Isuru Fernando
4d5f8070c4 add a decomposition for select_scatter (#124426)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124426
Approved by: https://github.com/peterbell10
2024-05-01 03:23:18 +00:00
Sam Larsen
74e8817311 [inductor] Minor fixes to various tests before enabling fx graph caching in OSS by default (#125258)
Summary: Discovered breakages by enabling codecache by default and doing a CI run. I'll commit these fixes first and eventually enabling caching by default will (hopefully) be a one-liner.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125258
Approved by: https://github.com/eellison
2024-05-01 02:34:01 +00:00
drisspg
25691558d9 Change templated_attention -> flex_attention (#125251)
# Summary

Change all the names

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125251
Approved by: https://github.com/Chillee, https://github.com/yanboliang
2024-05-01 01:08:48 +00:00
PyTorch MergeBot
75fa54a9d1 Revert "Convert ForeachFuncInfo to dataclass (#125001)"
This reverts commit 9466335ae4.

Reverted https://github.com/pytorch/pytorch/pull/125001 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think it is breaking on ROCm 9466335ae4 ([comment](https://github.com/pytorch/pytorch/pull/125001#issuecomment-2086640674))
2024-04-30 19:05:53 +00:00
Wanchao Liang
00df0d3e94 [dtensor] implement shard dim change with alltoall (#124872)
as titled, we implement a dedicated communication op to allow efficient
sharding dimension change using alltoall, to replace our previous
allgather + local chunk

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124872
Approved by: https://github.com/XilunWu, https://github.com/yifuwang
ghstack dependencies: #124871
2024-04-30 18:30:34 +00:00
Masaki Kozuki
9466335ae4 Convert ForeachFuncInfo to dataclass (#125001)
- `ForeachFuncInfo` to `dataclass` for smaller diff from `OpInfo`
- `skips` to `decorators` and `skip` to `xfail`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125001
Approved by: https://github.com/janeyx99
2024-04-30 16:19:42 +00:00
ydwu4
ecc2e034f7 Fakify script object inputs and attributes for non-strict export (#124239)
This PR fakify ScriptObject inputs and attributes in export non-strict mode by default.

The basic idea is to `only fakify the script object during tracing (i.e. aot_export)`. After we get the traced graph module, eagerly executing, serializing, or running more passes will use the real script objects. This is essentially treating the script object as constant tensor.

Concretely, we
1. fakify all the script object inputs, and module attributes (gathered by constant_attrs).
2. patch the module's attributes with fakified script object
3. right after aot_export, remove the patching (to avoid changing the original module) then modify the exported graph module's attribute to real script object.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124239
Approved by: https://github.com/zou3519
2024-04-30 15:57:25 +00:00
drisspg
8c219251c5 Add backwards support to FlexAttention (#123902)
# Summary
This is part one of adding backwards support to FlexAttention.

This PR focuses on the eager implementation and wiring up enough of the templated_attention_backward(name change soon 😉) to get through aot_eager.

Notably this does not actually wire up the triton template just yet in order to make this PR easier to review. That will be the next follow up PR.

#### Structure
We pass both the forward and backward graph to the backwardsHOP since these are both needed to be inlined into the calculation for backwards:
- the forward graph is needed in order to re-compute the scores
- the joint graph is needed in order to construct the correct gradients  post softmax_grad calc

### Attatched AOT Graph
https://gist.github.com/drisspg/ce4c041f8df8a5a7983c5174705cf2b5

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123902
Approved by: https://github.com/Chillee
2024-04-29 22:34:22 +00:00
rzou
c6b7504d47 Fix torch.library.register_fake's module reporting (#125037)
torch.library.register_fake reports the python module the fake impl is
located in. This is used to check against
`m.set_python_module("foo.bar")` calls in C++.

The module reporting logic was wrong in most cases. This PR fixes it.

Test Plan:
- exhaustive tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125037
Approved by: https://github.com/williamwen42
2024-04-26 20:53:33 +00:00
chilli
392dc45597 Made FlexAttention rewrite getitem calls to use aten.index in score_mod (#124799)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124799
Approved by: https://github.com/drisspg
ghstack dependencies: #124444
2024-04-26 17:22:13 +00:00
Aaron Gokaslan
2f3b0befed [BE]: Apply ruff FURB 118. (#124743)
Replaces various lambdas with operator.itemgetter which is more efficient (as it's a builtin function). Particularly useful for when lambdas are used as 'key' functions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124743
Approved by: https://github.com/albanD, https://github.com/malfet
2024-04-26 14:34:52 +00:00
PyTorch MergeBot
e913f77c60 Revert "Made FlexAttention rewrite getitem calls to use aten.index in score_mod (#124799)"
This reverts commit 9bccafc31c.

Reverted https://github.com/pytorch/pytorch/pull/124799 on behalf of https://github.com/clee2000 due to broke tests but only on crossref https://github.com/pytorch/pytorch/actions/runs/8841521519/job/24279075171, added no td label so itll actually run this time ([comment](https://github.com/pytorch/pytorch/pull/124799#issuecomment-2078530797))
2024-04-26 02:35:14 +00:00
chilli
9bccafc31c Made FlexAttention rewrite getitem calls to use aten.index in score_mod (#124799)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124799
Approved by: https://github.com/drisspg
ghstack dependencies: #124444
2024-04-26 01:02:28 +00:00
PyTorch MergeBot
678662a557 Revert "Made FlexAttention rewrite getitem calls to use aten.index in score_mod (#124799)"
This reverts commit acc4cbea39.

Reverted https://github.com/pytorch/pytorch/pull/124799 on behalf of https://github.com/jeanschmidt due to checking if this diff introduced regressions on linux-focal-py3.11-clang10 and linux-focal-py3.8-clang10 ([comment](https://github.com/pytorch/pytorch/pull/124799#issuecomment-2076756876))
2024-04-25 09:29:57 +00:00
chilli
acc4cbea39 Made FlexAttention rewrite getitem calls to use aten.index in score_mod (#124799)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124799
Approved by: https://github.com/drisspg
2024-04-25 06:19:55 +00:00
Masaki Kozuki
b21bf5e4e4 [foreach] Use same dtypes when dtypesIfCUDA is None (#124813)
in order to avoid accidentally testing cuda path with fewer dtypes
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124813
Approved by: https://github.com/janeyx99
2024-04-25 04:42:24 +00:00
rzou
4f398eed0b [custom_op] register_autograd supports non-tensor kwargonly-args (#124806)
The user does not need to return gradients for these args.

We also change how setup_context works to adapt to kwargonly-args. If
the user's op has no kwonly-args, then their setup_context function must
look like `setup_context(ctx, inputs, output)`: we require that the
arguments have the same names.

If the user's op has kwonly-args, then their setup_context function must
look like `setup_context(ctx, inputs, keyword_only_inputs, output)`.
We require that the arguments have the same names.

Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124806
Approved by: https://github.com/albanD, https://github.com/williamwen42
ghstack dependencies: #124637, #124805
2024-04-25 01:51:02 +00:00
Jane Xu
fb69eef1b4 Add int testing for foreach_copy on cuda (#124703)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124703
Approved by: https://github.com/crcrpar, https://github.com/albanD
2024-04-24 13:11:56 +00:00
rzou
4ceb44c40d Add torch.library.opcheck (#124496)
This PR:
- exposes torch.testing._internal.optests.opcheck as
  torch.library.opcheck
- Adds support for CustomOpDef (aka functions decorated with
  torch.library.custom_op) to opcheck.

Test Plan:
- Updated tests
- We validated opcheck's design internally.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124496
Approved by: https://github.com/williamwen42
2024-04-23 21:48:00 +00:00
haozhe.zhu
3c964ad1ca add fused_sgd_kernel support for CPU device (#123629)
Support fused_sgd_kernel support for CPU.

## Bench result:
32 core/sockets ICX
Test Scripts:
https://gist.github.com/zhuhaozhe/688763e17e93e4c5e12f25f676ec90d9
https://gist.github.com/zhuhaozhe/ad9938694bc7fae8b66d376f4dffc6c9
```
Tensor Size: 262144, Num Tensor 4, Num Threads: 1
_single_tensor_sgd time: 0.2301 seconds
_fused_sgd time: 0.0925 seconds
Tensor Size: 4194304, Num Tensor 32, Num Threads: 32
_single_tensor_sgd time: 2.6195 seconds
_fused_sgd time: 1.7543 seconds
```
## Test Plan:
```
python test_optim.py -k test_fused_matches_forloop
python test_optim.py -k test_fused_large_tensor
python test_optim.py -k test_can_load_older_state_dict
python test_optim.py -k test_grad_scaling_autocast_fused_optimizers
python test_torch.py -k test_grad_scaling_autocast_fused
python test_torch.py -k test_params_invalidated_with_grads_invalidated_between_unscale_and_step
```
Looks like we already have some PRs under this issue https://github.com/pytorch/pytorch/issues/123451 to unified the UTs, I did not modified UT in this PR.

Co-authored-by: Jane Xu <janeyx@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123629
Approved by: https://github.com/jgong5, https://github.com/janeyx99
2024-04-23 08:28:19 +00:00
Jeff Daily
6ede882c0b preferred blas library; cublaslt gemm implementation (#122106)
Following the example of PyTorch supporting a preferred Linalg library (cusolver or magma), this PR introduces a preferred blas library selector of either cublas or cublaslt for CUDA and hipblas or hipblaslt for ROCm via normal hipification of sources.

The default blas implementation remains cublas or hipblas.  cublaslt or hipblaslt can be enabled using environment variable TORCH_BLAS_PREFER_CUBLASLT=1 (or TORCH_BLAS_PREFER_HIPBLASLT=1 as an alias) or by calling `torch.backends.cuda.preferred_blas_library(backend="cublaslt")` or as an alias `backend="hipblaslt"`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122106
Approved by: https://github.com/lezcano
2024-04-22 15:38:22 +00:00
Chen, Zejun
b1984237a0 [Profiler] Unify the device(CUDA, XPU, PrivateUse1) in torch profiler post processing (#123247)
This PR unifies the CUDA, XPU and PrivateUse1 in the torch profiler. Now CUDA, XPU and PrivateUse1 can together use string object `use_device` to distinguish each other and share one device path for calculating kineto time durations and memory statistics for post processing.

#suppress-api-compatibility-check

Co-authored-by: Aaron Enye Shi <enye.shi@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123247
Approved by: https://github.com/aaronenyeshi
2024-04-22 01:26:55 +00:00
Aaron Gokaslan
c5fafe9f48 [BE]: TRY002 - Ban raising vanilla exceptions (#124570)
Adds a ruff lint rule to ban raising raw exceptions. Most of these should at the very least be runtime exception, value errors, type errors or some other errors. There are hundreds of instance of these bad exception types already in the codebase, so I have noqa'd most of them. Hopefully this error code will get commiters to rethink what exception type they should raise when they submit a PR.

I also encourage people to gradually go and fix all the existing noqas that have been added so they can be removed overtime and our exception typing can be improved.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124570
Approved by: https://github.com/ezyang
2024-04-21 22:26:40 +00:00
Aaron Gokaslan
5a1216bb2e [BE]: Update ruff to 0.4.1 (#124549)
Update ruff to 0.4.1 .
This version fixes a lot false negatives/false positives, is 20-40% faster, and has various other bug fixes.

Below is a before and after table showing the execution time of ruff lint and ruff format in milliseconds courtesy of https://astral.sh/blog/ruff-v0.4.0

| Repository                                         | Linter (v0.3) | Linter (v0.4) | Formatter (v0.3) | Formatter (v0.4) |
|----------------------------------------------------|---------------|---------------|------------------|------------------|
| [pytorch/pytorch](https://github.com/pytorch/pytorch) | 328.7         | 251.8         | 351.1            | 274.9            |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124549
Approved by: https://github.com/ezyang
2024-04-21 14:06:23 +00:00
Michael Lazos
0d0b5b2655 Enable dynamo rosenbrock sparse tests (#124542)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124542
Approved by: https://github.com/yf225
ghstack dependencies: #124540, #124541
2024-04-20 05:54:41 +00:00