Commit Graph

78584 Commits

Author SHA1 Message Date
Jason Ansel
a0207c8471 [dynamo] Fix support for classmethod(property(...)) (#134968)
Fixes #134451

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134968
Approved by: https://github.com/yanboliang
2024-09-18 04:47:51 +00:00
Nikita Shulga
9aa22eabe7 [CI] Make linux-aarch64 shards actually running different tests (#136208)
Non-functional sharding was introduced in https://github.com/pytorch/pytorch/pull/125255 but each shard in that case were running the same tests...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136208
Approved by: https://github.com/seemethere, https://github.com/ZainRizvi, https://github.com/atalman
2024-09-18 03:10:21 +00:00
Kiuk Chung
8895f69d12 [torch/numpy][numpy2.0 compat] Additional changes for tests to run under numpy-2.0 (#136152)
Continuation of https://github.com/pytorch/pytorch/pull/131909. This PR makes numpy tests compatible with numpy>=2.0.0. Specifically it deals with APIs that have been removed from numpy-2.0.

Changes in this PR:
1. Use `numpy.exceptions.ComplexWarning` if `numpy.exceptions` namespace is present. In numpy-2.0 `numpy.ComplexWarning` has been removed in favor of using `numpy.exceptions.ComplexWarning` (see [numpy-2.0 migration guide](https://numpy.org/devdocs/numpy_2_0_migration_guide.html#changes-to-namespaces)). Note that `numpy.exceptions` was introduced in numpy-1.25.0 hence does not exist in numpy<=1.24.x.
2. Do the same for `numpy.exceptions.VisibleDeprecationWarning`
3. Use `np.sort(...,axis=0)` over `np.msort()`(`np.msort()` removed in numpy-2.0)
4. Use `np.pad()` over `np.lib.pad()` (`np.lib` removed in numpy-2.0)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136152
Approved by: https://github.com/atalman
2024-09-18 02:11:22 +00:00
Nikita Shulga
6682327c75 [BE] Make NestedTensorTransformerFunctions.cu compilable without warnings (#136222)
Before the change compilation produced following warnings:
```
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/nested/cuda/NestedTensorTransformerFunctions.cu: In function ‘std::tuple<dim3, dim3, at::native::StackArray<long int> > at::native::check_shape_and_partition_(const at::Tensor&, const std::vector<at::Tensor>&, const at::Tensor&)’:
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/nested/cuda/NestedTensorTransformerFunctions.cu:584:22: warning: comparison of integer expressions of different signedness: ‘const int’ and ‘const size_t’ {aka ‘const long unsigned int’} [-Wsign-compare]
  584 |   TORCH_CHECK(num_jagged_dim <= kStackArrayMaxDims);
      |       ~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/nested/cuda/NestedTensorTransformerFunctions.cu: In lambda function:
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/nested/cuda/NestedTensorTransformerFunctions.cu:1224:1061: warning: comparison of integer expressions of different signedness: ‘long unsigned int’ and ‘int’ [-Wsign-compare]
 1224 |   AT_DISPATCH_INDEX_TYPES(
      |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     ^
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/nested/cuda/NestedTensorTransformerFunctions.cu: In lambda function:
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/nested/cuda/NestedTensorTransformerFunctions.cu:1224:1985: warning: comparison of integer expressions of different signedness: ‘long unsigned int’ and ‘int’ [-Wsign-compare]
 1224 |   AT_DISPATCH_INDEX_TYPES(
      |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 ^
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/nested/cuda/NestedTensorTransformerFunctions.cu: In instantiation of ‘void at::native::jagged_dense_elementwise_jagged_output_opt_(const at::Tensor&, const std::vector<at::Tensor>&, const at::Tensor&, const at::Tensor&, F) [with scalar_t = c10::Half; F = __nv_dl_wrapper_t<__nv_dl_trailing_return_tag<at::Tensor (*)(const at::Tensor&, c10::ArrayRef<at::Tensor>, std::optional<c10::SymInt>), at::native::_fbgemm_dense_to_jagged_forward_symint, c10::Half, 1> >]’:
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/nested/cuda/NestedTensorTransformerFunctions.cu:1515:1:   required from here
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/nested/cuda/NestedTensorTransformerFunctions.cu:1336:2006: warning: comparison of integer expressions of different signedness: ‘size_t’ {aka ‘long unsigned int’} and ‘int’ [-Wsign-compare]
 1336 |     AT_DISPATCH_INDEX_TYPES(
      |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ^
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/nested/cuda/NestedTensorTransformerFunctions.cu:1336:2113: warning: comparison of integer expressions of different signedness: ‘size_t’ {aka ‘long unsigned int’} and ‘int’ [-Wsign-compare]
 1336 |     AT_DISPATCH_INDEX_TYPES(
      |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 ^
```
after it compiled without a warning

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136222
Approved by: https://github.com/PaliC, https://github.com/kit1980
2024-09-18 01:24:05 +00:00
leslie-fang-intel
b18ba9419e [AO][Inductor] Enable WOQ fusion pattern with permute (#135928)
**Summary**
Fix https://github.com/pytorch/pytorch/issues/135831 and https://github.com/pytorch/ao/issues/890. The root cause of the numerical failure was that the customized woq-int8 kernel was not triggered due to changes in the pattern. After re-adding the fusion pattern, the accuracy check now passes. I will open a separate TorchAO PR to enable these unit tests in TorchAO.

**Test Plan**
```
python test/inductor/test_mkldnn_pattern_matcher.py -k test_woq_int8
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135928
Approved by: https://github.com/jgong5, https://github.com/eellison
2024-09-18 00:56:16 +00:00
Chirag Pandya
cccf500193 [c10d] remove sleep from watchdogHandler (#135760)
Summary:
Remove sleep from the `watchdogHandler` function. This sleep unnecessary slows things down during a NCCL timeout.
Flight recorder is configured to take a minute, at most, to dump out it's buffer.
This sleep ends up waiting for `8` minutes before destroy is called.

Test Plan: Unit tests.

Differential Revision: D62529875

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135760
Approved by: https://github.com/fduwjj, https://github.com/shuqiangzhang
2024-09-18 00:55:01 +00:00
Nikita Shulga
f6f1504d39 [MPS] Fix 5D+ reductions over negative dimentions (#136198)
This fixes bug introduced by https://github.com/pytorch/pytorch/pull/99856 that attempts to speed-up reduction for 5D+ tensor if trailing dimensions are all ones, but introduces crashes/off-by-one errors for wrapped dimensions

Added regresion test case to `TestMPS.test_sum`

Fixes https://github.com/pytorch/pytorch/issues/136132

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136198
Approved by: https://github.com/albanD
2024-09-17 21:53:31 +00:00
Banit Agrawal
a575ce0dc6 [PyTorch Pinned Allocator] Add support of background thread to process events (#135524)
Summary: Currently we process events in the regular allocation path and we call cudaEventQuery to check on the events and this path can take some locks in libcuda driver. Its not entirely needed to do process events in the allocation path, we could move this to a background thread and keep processing events regularly and put the freed block to the free list.

Differential Revision: D62396585

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135524
Approved by: https://github.com/zyan0
2024-09-17 21:08:10 +00:00
Banit Agrawal
48d18fbd4c [PyTorch CUDA Allocator] Allow reuse of non-split blocks with better rounding (#136174)
Summary:
This diff adds an option to round the non-split blocks in caching allocator so that they can be reused without causing lots of fragmentation for large memory segments.

For example, if we specify max_split memory size as 400MB, then all allocations more than 400MB will not be split. Lets say, we allocated some 1024MB blocks and these are cached in the allocator blocks. If we request a new 500MB block, we round it to nearest power-2-division, thats 512MB, we add default kLargeBuffer of 20MB, that will be 532MB and since 532MB is less than existing 1024MB block, the 1024MB will not be used for this allocation, instead a new 512MB block will be created. In this diff, we provide an option to cofigure the kLargeBuffer for rounding and expose as a configurable option, so 512MB + max_non_split_rounding_size and if thats greater than 1024MB, we will use te 1024MB and we wont create a new 512MB block using cudaMalloc. This option is added so that we can pre-allocate some large blocks so that we can reuse them as much as possible and we dont stall on calling cudaMalloc.

Differential Revision: D62758758

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136174
Approved by: https://github.com/zyan0
2024-09-17 19:08:44 +00:00
eqy
e3aa5e2f64 [NCCL] Don't override waitUntilInitialized's setting of comm->initialized_ (#136155)
#133630 sets `initialized_` to `true` which causes previous wait codepaths to skip necessary waits, see also #https://github.com/pytorch/pytorch/issues/136151

CC @shuqiangzhang @wconstab

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136155
Approved by: https://github.com/fduwjj, https://github.com/kwen2501, https://github.com/c-p-i-o, https://github.com/shuqiangzhang
2024-09-17 18:50:12 +00:00
Huanyu He
a4e9a1c90b [TorchRec][PT2 IR][APF] short circuit the flatten/unflatten between EBC and KTRegroupAsDict modules (#136045)
Summary:
# context
* for the root cause and background please refer to this [post](https://fb.workplace.com/groups/1028545332188949/permalink/1042204770823005/)
* basica idea of this diff is to **short circuit the pytree flatten-unflatten function pairs** between two preserved modules, i.e., EBC/fpEBC and KTRegroupAsDict.
NOTE: There could be multiple EBCs and one single KTRegroupAsDict as shown in the [pic](https://fburl.com/gslide/lcyt8eh3) {F1864810545}
* short-circuiting the EBC-KTRegroupAsDict pairs are very special and a must in most of the cases due to the EBC key-order issue with distributed table lookup.
* hide all the operations behind a control flag `short_circuit_pytree_ebc_regroup` to the torchrec main api call `decapsulate_ir_modules`, which should only be visible to the infra layer, not to the users.

# details
* The `_short_circuit_pytree_ebc_regroup` function finds all the EBCs/fpEBC and KTRegroupAsDict modules in an unflattened module.  Retrieve their fqns and sort to in_fqns (regroup_fqns) and out_fqns (ebc_fqns). Because currently the fpEBC is swapped as a whole, so we do some extra fqn logic to filter out the EBC that belongs to an up-level fpEBC.
* a util function `prune_pytree_flatten_unflatten` removes the in-coming and out-going pytree flatten/unflatten function calls in the graph module, based on the given fqns.

WARNING: The flag `short_circuit_pytree_ebc_regroup` should be turned on if EBCs are used and EBC sharding is needed. Assertions are also added if can't find a `KTRegroupAsDict` module, or `finalize_interpreter_modules` is not `True`.

# additional changes
* absorb the `finalize_interpreter_modules` process inside the torchrec main api `decapsulate_ir_modules`.
* set `graph.owning_module` in export.unflatten as required by the graph modification
* add one more layer of `sparse_module` for closely mimicing the APF model structure.

Test Plan:
# run test
* serializer
```
buck2 run fbcode//mode/opt fbcode//torchrec/ir/tests:test_serializer
```
* apf
```
buck2 run fbcode//mode/opt fbcode//aps_models/ads/gmp/tests/ne/e2e_deterministic_tests:gmp_e2e_ne_tests -- --filter-text 'test_mtml_instagram_model_562438350_single_gpu_with_ir'
```
* local mp run
```
==== Finished E2E deterministic test for mtml_instagram_model_gmp_474023725_non_kjt_unary ====
finished
  test_mtml_instagram_model_562438350_single_gpu_with_ir
Imports took: 6.0s! Profile with --import-profiler.            --_ |""---__
Executed 1 example in 203.1s:                               |'.|  ||  .    """|
  Successful: 1                                             | ||  || /|\""-.  |
  Failed: 0                                                 | ||  ||  |    |  |
  Skipped: 0                                                | ||  ||  |   \|/ |
  Not executed: 8                                           |."|  ||  --"" '__|
https://testslide.readthedocs.io/                              --" |__---"""
```

Differential Revision: D62606738

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136045
Approved by: https://github.com/angelayi
2024-09-17 18:42:56 +00:00
angelayi
ea10c072f3 [export] Deserialize args with python keyword names (#136036)
Currently when we deserialize inputs to nodes, we deserialize arguments with default values as kwargs. So deserializing `aten.uniform`, which has the signature `uniform(Tensor(a!) self, float from=0, float to=1, *, Generator? generator=None) -> Tensor(a!)`, will get become `uniform(x, from=0, to=1)`. However, this fails when running in python because `from` is a python keyword. So the solution here is to not deserialize it as a kwarg.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136036
Approved by: https://github.com/zhxchen17
2024-09-17 18:13:14 +00:00
Joel Schlosser
a8382847f4 Support rms_norm() for NJT (#135872)
`rms_norm()` is a nice-to-have for ViT :)

This PR:
* SymInt-ifies `rms_norm()`, allowing NJT to use the same decomp.
* Adds torch_function-based input validation logic for nested-specific stuff (no normalization supported over the ragged dim for now) on the python NJT side.
* Adds multi-dim support (on non-ragged, non-batch dims) to `mean()` for NJT.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135872
Approved by: https://github.com/mikaylagawarecki
ghstack dependencies: #125947
2024-09-17 18:09:20 +00:00
Nikita Shulga
785e98783b Delete links to non-existing run_plan_mpi.cc (#136204)
That were deleted by https://github.com/pytorch/pytorch/pull/125092

Fixes https://github.com/pytorch/pytorch/issues/136199

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136204
Approved by: https://github.com/albanD, https://github.com/seemethere
2024-09-17 17:51:56 +00:00
Trung Truong
cc365fdd7b [MTIA] Support torch.cuda.get_device_capability equivalent API on MTIA (#135889)
Summary:
Mirror `get_device_capability` on MTIA per https://fburl.com/gdoc/p4lo5avn

At the moment, both the major and minor version are just 0

Test Plan:
Unit test: `buck2 test //mtia/host_runtime/torch_mtia/tests:test_torch_mtia_api`

https://www.internalfb.com/intern/testinfra/testconsole/testrun/1688850109958190/

Differential Revision: D62595296

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135889
Approved by: https://github.com/egienvalue
2024-09-17 17:42:56 +00:00
Xintong Hu
8e5bb356e0 [PT2] Port merge_concats_pass to PT2 pre_grad passes (#135527)
Summary: as title

Test Plan: new UT

Differential Revision: D62398390

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135527
Approved by: https://github.com/frank-wei
2024-09-17 17:26:53 +00:00
Nikhil Gupta
63dc5dff10 [Fix]: Update CPUINFO submodule to fix support for NON-SVE ARM Hardware (#135857)
Regression PR : https://github.com/pytorch/cpuinfo/pull/255

Change-Id: I56cec061072be11ec33ccb661114360b979fc7aa

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135857
Approved by: https://github.com/digantdesai, https://github.com/malfet
2024-09-17 16:50:17 +00:00
Justin Chu
67b14ce8bd [ONNX] Fix numpy method to return the correct type (#136162)
Previous implementation of the `numpy()` method returns `fp64` when the tensor is `fp32`. This is unexpected but seems to be caused by calling `__array__(dtype=None)` on the numpy array. I updated the implementation to implement the `numpy()` method explicitly and added tests to guard the behavior.

This needs to be cherry-picked into torch 2.5
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136162
Approved by: https://github.com/gramalingam, https://github.com/xadupre
2024-09-17 15:51:00 +00:00
Mauricio Villegas
ece8267d2c Add back optim type hints that were lost when *.pyi files were removed (#136185)
When stub files (`*.pyi`) were removed from `optim` (#125556, #125452), some types that existed are no longer available. This pull request adds them back.

Just for reference, these types are used in `pytorch-lightning`'s `LightningCLI`. Command line interfaces are created automatically, and having type hints make them nicer.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136185
Approved by: https://github.com/janeyx99
2024-09-17 15:45:15 +00:00
Edward Z. Yang
913f97e878 Don't run reshape pattern match on dynamic shape size tensor (#136100)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136100
Approved by: https://github.com/mengluy0125
2024-09-17 15:08:55 +00:00
PyTorch MergeBot
462b727d1e Revert "Add decomposition for permute_copy (#130944)"
This reverts commit ab9a7eadd3.

Reverted https://github.com/pytorch/pytorch/pull/130944 on behalf of https://github.com/jeanschmidt due to Broke internal signal executorch.backends.xnnpack.test.ops.permute.TestPermute, more details on D62737086. @eellison could you please help get this PR merged to main? ([comment](https://github.com/pytorch/pytorch/pull/130944#issuecomment-2355846394))
2024-09-17 13:42:55 +00:00
PyTorch MergeBot
2c4ae81494 Revert "Add decomposition for squeeze_copy (#130941)"
This reverts commit c33b0580e6.

Reverted https://github.com/pytorch/pytorch/pull/130941 on behalf of https://github.com/jeanschmidt due to Need to revert in order to be able to revert https://github.com/pytorch/pytorch/pull/130944, after fixing any merge conflicts, feel free to merge it back ([comment](https://github.com/pytorch/pytorch/pull/130941#issuecomment-2355831480))
2024-09-17 13:39:07 +00:00
PyTorch MergeBot
3b5e2689a1 Revert "Optimize dict reconstruct to not codegen untouched values (#134876)"
This reverts commit a1a57a424d.

Reverted https://github.com/pytorch/pytorch/pull/134876 on behalf of https://github.com/jeanschmidt due to new introduced test test_reconstruct.py::ReconstructTest::test_functional_call_reconstruct is breaking internally. @zou3519 may you help get those changes merged back to main? ([comment](https://github.com/pytorch/pytorch/pull/134876#issuecomment-2355697685))
2024-09-17 13:00:01 +00:00
ankurneog
e248c1d7eb Update real device in FSDP state_dict_utils (#134994)
## Motivation
The default device for tensor.device both for sharded as well as non sharded is set to cuda by default. Hence while checking the FSDP UTs we see the following errors. This change updates the actual device type based on the created tensor.

```
[rank3]   File "/root/repos/pytorch-training-tests/tests/pytorch/v2.4.0/distributed_hpu/fsdp/test_fsdp_dtensor_state_dict.py", line 143, in test_dtensor_sharded_tensor_state_dict_identical
[rank3]     sharded_tensor_sd = ref_model.state_dict()
[rank3]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1944, in state_dict
[rank3]     hook_result = hook(self, destination, prefix, local_metadata)
[rank3]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank3]     return func(*args, **kwargs)
[rank3]   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/fsdp/_state_dict_utils.py", line 752, in _post_state_dict_hook
[rank3]     tensor.device,
[rank3]   File "/usr/local/lib/python3.10/dist-packages/typing_extensions.py", line 2853, in wrapper
[rank3]     return arg(*args, **kwargs)
[rank3]   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/_shard/sharded_tensor/api.py", line 1152, in __torch_function__
[rank3]     return dispatch(st_instance, func)
[rank3]   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/_shard/sharded_tensor/api.py", line 1134, in dispatch
[rank3]     return _SHARDED_OPS[func](types, args, kwargs, st._process_group)
[rank3]   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/_shard/op_registry_utils.py", line 33, in wrapper
[rank3]     return wrapped_func(types, args, kwargs, process_group)
[rank3]   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/_shard/sharded_tensor/_ops/tensor_ops.py", line 52, in tensor_device
[rank3]     dev = torch.device(torch.cuda.current_device())
[rank3]   File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 878, in current_device
[rank3]     _lazy_init()
[rank3]   File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 305, in _lazy_init
[rank3]     raise AssertionError("Torch not compiled with CUDA enabled")
[rank3] AssertionError: Torch not compiled with CUDA enabled
````

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134994
Approved by: https://github.com/fegin
2024-09-17 04:39:08 +00:00
wz337
408fe41a45 [DSD][EZ] Minor update in _state_dict_utils.py (#136165)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136165
Approved by: https://github.com/kwen2501
ghstack dependencies: #135725, #135763
2024-09-17 04:32:43 +00:00
Brian Hirsh
dc82d274e6 make view.dtype always return an alias (#136074)
Fixes https://github.com/pytorch/pytorch/issues/136064

In the linked repro, this issue was that there was some code like this:
```
# x has dtype torch.float32
def f(x):
    y = x.view(torch.float32)
    y.copy_(...)
```

Where because `view.dtype` is implemented today to potentially directly return its input, we would end up directly clobbering the proxy for our graph input (replacing its FX proxy value from `arg0_1` to `view_1`). This is not desirable, because we have careful assertions in AOTDispatcher that mutations only ever happen on graph inputs - but this clobbering caused the mutation to appear, from the perspective of the FX graph, like it was happening on a view of the input.

Why is this normally not a problem? Ordinarily, the `ADInplaceOrView` kernel for `view.dtype` will take the output of the view kernel, [and detach() it](https://github.com/pytorch/pytorch/blob/main/tools/autograd/gen_inplace_or_view_type.py#L466) (properly creating a fresh `TensorImpl`).

This does **not** happen, though, if you are executing the kernel from with a `__torch_dispatch__` region: the `ADInplaceOrView` logic has already run above you, so that key will be in the TLS exclude set.

This PR changes eager behavior - at first I considered trying to only change behavior under compile. But this problem isn't technically specific to PT2: if you ever rely on tensor identity from inside of a __torch_dispatch__ call, then we need to make sure the raw `view.dtype` kernel doesn't directly return the input.

I am also making the assumption that "`view.dtype` no-op'ing when the dtype is the same" is not a case worth optimizing in eager mode, and that the overhead of the `TensorImpl` creation is relatively negligible.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136074
Approved by: https://github.com/Skylion007, https://github.com/ezyang, https://github.com/albanD
ghstack dependencies: #136041
2024-09-17 03:40:54 +00:00
Brian Hirsh
d463a81c27 inductor: dont use default_dtype during rng functionalization (#136041)
Fixes https://github.com/pytorch/pytorch/issues/119162

See context at https://github.com/pytorch/pytorch/issues/119162#issuecomment-2349849469

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136041
Approved by: https://github.com/eellison
2024-09-17 03:40:54 +00:00
Zhijing Li (Accelerator Enablement)
3f74310784 Back out "Flip triton kernel default layout constraint to "needs_fixed_stride_order" (#135581)" (#136160)
Test Plan: make train-hstu-cint-publish-bf16-tgif-local

Differential Revision: D62766335

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136160
Approved by: https://github.com/muchulee8
2024-09-17 01:06:10 +00:00
PyTorch MergeBot
37a08b33bb Revert "fix compiled_autograd deadlock throw (#135795)"
This reverts commit 00dc7d4356.

Reverted https://github.com/pytorch/pytorch/pull/135795 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/135795#issuecomment-2354233619))
2024-09-16 23:59:56 +00:00
Laith Sakka
071da87cd7 use csv extention for test report in order for it to be uploaded to s3 (#136128)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136128
Approved by: https://github.com/clee2000
2024-09-16 21:47:46 +00:00
Justin Chu
c12536b3c0 [ONNX] Treat CompositeImplicitAutograd ops as normal ops in decomp (#136153)
Since https://github.com/pytorch/pytorch/pull/135080, the CompositeImplicitAutograd (CIA) ops are only decomposed when a decomp function is provided in a table. There is no longer a need to distinguish CIA ops like Upsample and preserve them explicitly. On the ONNX Script torchlib side I will unregister some ops from the following list to make sure some CIA ops are still decomposed.

```
<OpOverload(op='aten.__and__', overload='Scalar')>,
 <OpOverload(op='aten.__and__', overload='Tensor')>,
 <OpOverload(op='aten.__or__', overload='Scalar')>,
 <OpOverload(op='aten.__or__', overload='Tensor')>,
 <OpOverload(op='aten.__xor__', overload='Scalar')>,
 <OpOverload(op='aten.__xor__', overload='Tensor')>,
 <OpOverload(op='aten._add_batch_dim', overload='default')>,
 <OpOverload(op='aten._assert_tensor_metadata', overload='default')>,
 <OpOverload(op='aten._backward', overload='default')>,
 <OpOverload(op='aten._batch_norm_impl_index_backward', overload='default')>,
 <OpOverload(op='aten._cast_Byte', overload='default')>,
 <OpOverload(op='aten._cast_Char', overload='default')>,
 <OpOverload(op='aten._cast_Double', overload='default')>,
 <OpOverload(op='aten._cast_Float', overload='default')>,
 <OpOverload(op='aten._cast_Half', overload='default')>,
 <OpOverload(op='aten._cast_Int', overload='default')>,
 <OpOverload(op='aten._cast_Long', overload='default')>,
 <OpOverload(op='aten._cast_Short', overload='default')>,
 <OpOverload(op='aten._choose_qparams_per_tensor', overload='default')>,
 <OpOverload(op='aten._convolution', overload='deprecated')>,
 <OpOverload(op='aten._convolution_double_backward', overload='default')>,
 <OpOverload(op='aten._convolution_mode', overload='default')>,
 <OpOverload(op='aten._cufft_clear_plan_cache', overload='default')>,
 <OpOverload(op='aten._cufft_get_plan_cache_max_size', overload='default')>,
 <OpOverload(op='aten._cufft_get_plan_cache_size', overload='default')>,
 <OpOverload(op='aten._cufft_set_plan_cache_max_size', overload='default')>,
 <OpOverload(op='aten._debug_has_internal_overlap', overload='default')>,
 <OpOverload(op='aten._dim_arange', overload='default')>,
 <OpOverload(op='aten._embedding_bag_sparse_backward', overload='default')>,
 <OpOverload(op='aten._gather_sparse_backward', overload='default')>,
 <OpOverload(op='aten._grid_sampler_2d_cpu_fallback_backward', overload='default')>,
 <OpOverload(op='aten._has_compatible_shallow_copy_type', overload='default')>,
 <OpOverload(op='aten._is_zerotensor', overload='default')>,
 <OpOverload(op='aten._lu_with_info', overload='default')>,
 <OpOverload(op='aten._nnpack_available', overload='default')>,
 <OpOverload(op='aten._pack_padded_sequence_backward', overload='default')>,
 <OpOverload(op='aten._pad_circular', overload='default')>,
 <OpOverload(op='aten._pad_enum', overload='default')>,
 <OpOverload(op='aten._pad_packed_sequence', overload='default')>,
 <OpOverload(op='aten._propagate_xla_data', overload='default')>,
 <OpOverload(op='aten._remove_batch_dim', overload='default')>,
 <OpOverload(op='aten._reshape_from_tensor', overload='default')>,
 <OpOverload(op='aten._rowwise_prune', overload='default')>,
 <OpOverload(op='aten._saturate_weight_to_fp16', overload='default')>,
 <OpOverload(op='aten._scaled_dot_product_attention_math', overload='default')>,
 <OpOverload(op='aten._shape_as_tensor', overload='default')>,
 <OpOverload(op='aten._sobol_engine_draw', overload='default')>,
 <OpOverload(op='aten._sparse_bsc_tensor_unsafe', overload='default')>,
 <OpOverload(op='aten._sparse_bsr_tensor_unsafe', overload='default')>,
 <OpOverload(op='aten._sparse_compressed_tensor_unsafe', overload='default')>,
 <OpOverload(op='aten._sparse_coo_tensor_unsafe', overload='default')>,
 <OpOverload(op='aten._sparse_csc_tensor_unsafe', overload='default')>,
 <OpOverload(op='aten._sparse_csr_tensor_unsafe', overload='default')>,
 <OpOverload(op='aten._sparse_log_softmax', overload='Dimname')>,
 <OpOverload(op='aten._sparse_log_softmax', overload='int')>,
 <OpOverload(op='aten._sparse_mm', overload='default')>,
 <OpOverload(op='aten._sparse_mm', overload='reduce')>,
 <OpOverload(op='aten._sparse_softmax', overload='Dimname')>,
 <OpOverload(op='aten._sparse_softmax', overload='int')>,
 <OpOverload(op='aten._sparse_sum', overload='default')>,
 <OpOverload(op='aten._sparse_sum', overload='dim_dtype')>,
 <OpOverload(op='aten._sparse_sum', overload='dtype')>,
 <OpOverload(op='aten._test_ambiguous_defaults', overload='a')>,
 <OpOverload(op='aten._test_ambiguous_defaults', overload='b')>,
 <OpOverload(op='aten._test_autograd_multiple_dispatch', overload='ntonly')>,
 <OpOverload(op='aten._test_check_tensor', overload='default')>,
 <OpOverload(op='aten._test_serialization_subcmul', overload='default')>,
 <OpOverload(op='aten._test_string_default', overload='default')>,
 <OpOverload(op='aten._thnn_differentiable_gru_cell_backward', overload='default')>,
 <OpOverload(op='aten._thnn_differentiable_lstm_cell_backward', overload='default')>,
 <OpOverload(op='aten._thnn_fused_lstm_cell_backward', overload='default')>,
 <OpOverload(op='aten._to_cpu', overload='default')>,
 <OpOverload(op='aten._upsample_bicubic2d_aa', overload='vec')>,
 <OpOverload(op='aten._upsample_bilinear2d_aa', overload='vec')>,
 <OpOverload(op='aten._upsample_nearest_exact1d', overload='default')>,
 <OpOverload(op='aten._upsample_nearest_exact1d', overload='vec')>,
 <OpOverload(op='aten._upsample_nearest_exact2d', overload='default')>,
 <OpOverload(op='aten._upsample_nearest_exact2d', overload='vec')>,
 <OpOverload(op='aten._upsample_nearest_exact3d', overload='default')>,
 <OpOverload(op='aten._upsample_nearest_exact3d', overload='vec')>,
 <OpOverload(op='aten._use_cudnn_rnn_flatten_weight', overload='default')>,
 <OpOverload(op='aten._validate_sparse_bsc_tensor_args', overload='default')>,
 <OpOverload(op='aten._validate_sparse_bsr_tensor_args', overload='default')>,
 <OpOverload(op='aten._validate_sparse_compressed_tensor_args', overload='default')>,
 <OpOverload(op='aten._validate_sparse_coo_tensor_args', overload='default')>,
 <OpOverload(op='aten._validate_sparse_csc_tensor_args', overload='default')>,
 <OpOverload(op='aten._validate_sparse_csr_tensor_args', overload='default')>,
 <OpOverload(op='aten._version', overload='default')>,
 <OpOverload(op='aten._weight_norm', overload='default')>,
 <OpOverload(op='aten._weight_norm_differentiable_backward', overload='default')>,
 <OpOverload(op='aten.absolute', overload='default')>,
 <OpOverload(op='aten.adaptive_avg_pool1d', overload='default')>,
 <OpOverload(op='aten.adaptive_avg_pool2d', overload='default')>,
 <OpOverload(op='aten.adaptive_avg_pool3d', overload='default')>,
 <OpOverload(op='aten.adaptive_max_pool1d', overload='default')>,
 <OpOverload(op='aten.affine_grid_generator_backward', overload='default')>,
 <OpOverload(op='aten.align_as', overload='default')>,
 <OpOverload(op='aten.align_tensors', overload='default')>,
 <OpOverload(op='aten.all', overload='dimname')>,
 <OpOverload(op='aten.any', overload='dimname')>,
 <OpOverload(op='aten.arccos', overload='default')>,
 <OpOverload(op='aten.arccosh', overload='default')>,
 <OpOverload(op='aten.arcsin', overload='default')>,
 <OpOverload(op='aten.arcsinh', overload='default')>,
 <OpOverload(op='aten.arctan', overload='default')>,
 <OpOverload(op='aten.arctan2', overload='default')>,
 <OpOverload(op='aten.arctanh', overload='default')>,
 <OpOverload(op='aten.argsort', overload='default')>,
 <OpOverload(op='aten.argsort', overload='dimname')>,
 <OpOverload(op='aten.argsort', overload='stable')>,
 <OpOverload(op='aten.argwhere', overload='default')>,
 <OpOverload(op='aten.atleast_1d', overload='Sequence')>,
 <OpOverload(op='aten.atleast_2d', overload='Sequence')>,
 <OpOverload(op='aten.atleast_3d', overload='Sequence')>,
 <OpOverload(op='aten.avg_pool1d', overload='default')>,
 <OpOverload(op='aten.bilinear', overload='default')>,
 <OpOverload(op='aten.broadcast_tensors', overload='default')>,
 <OpOverload(op='aten.can_cast', overload='default')>,
 <OpOverload(op='aten.cat', overload='names')>,
 <OpOverload(op='aten.cdist', overload='default')>,
 <OpOverload(op='aten.chain_matmul', overload='default')>,
 <OpOverload(op='aten.chalf', overload='default')>,
 <OpOverload(op='aten.choose_qparams_optimized', overload='default')>,
 <OpOverload(op='aten.clip', overload='Tensor')>,
 <OpOverload(op='aten.clip', overload='default')>,
 <OpOverload(op='aten.column_stack', overload='default')>,
 <OpOverload(op='aten.combinations', overload='default')>,
 <OpOverload(op='aten.concat', overload='default')>,
 <OpOverload(op='aten.concat', overload='names')>,
 <OpOverload(op='aten.concatenate', overload='default')>,
 <OpOverload(op='aten.concatenate', overload='names')>,
 <OpOverload(op='aten.conv1d', overload='default')>,
 <OpOverload(op='aten.conv1d', overload='padding')>,
 <OpOverload(op='aten.conv2d', overload='default')>,
 <OpOverload(op='aten.conv2d', overload='padding')>,
 <OpOverload(op='aten.conv3d', overload='default')>,
 <OpOverload(op='aten.conv3d', overload='padding')>,
 <OpOverload(op='aten.conv_tbc_backward', overload='default')>,
 <OpOverload(op='aten.conv_transpose1d', overload='default')>,
 <OpOverload(op='aten.conv_transpose2d', overload='input')>,
 <OpOverload(op='aten.conv_transpose3d', overload='input')>,
 <OpOverload(op='aten.corrcoef', overload='default')>,
 <OpOverload(op='aten.cosine_embedding_loss', overload='default')>,
 <OpOverload(op='aten.cosine_similarity', overload='default')>,
 <OpOverload(op='aten.cov', overload='default')>,
 <OpOverload(op='aten.cross', overload='default')>,
 <OpOverload(op='aten.cross_entropy_loss', overload='default')>,
 <OpOverload(op='aten.ctc_loss', overload='IntList')>,
 <OpOverload(op='aten.ctc_loss', overload='Tensor')>,
 <OpOverload(op='aten.cudnn_is_acceptable', overload='default')>,
 <OpOverload(op='aten.cummax', overload='dimname')>,
 <OpOverload(op='aten.cummaxmin_backward', overload='default')>,
 <OpOverload(op='aten.cummin', overload='dimname')>,
 <OpOverload(op='aten.cumprod', overload='dimname')>,
 <OpOverload(op='aten.cumprod_backward', overload='default')>,
 <OpOverload(op='aten.cumsum', overload='dimname')>,
 <OpOverload(op='aten.cumulative_trapezoid', overload='dx')>,
 <OpOverload(op='aten.cumulative_trapezoid', overload='x')>,
 <OpOverload(op='aten.data', overload='default')>,
 <OpOverload(op='aten.det', overload='default')>,
 <OpOverload(op='aten.diag', overload='default')>,
 <OpOverload(op='aten.diagflat', overload='default')>,
 <OpOverload(op='aten.diff', overload='default')>,
 <OpOverload(op='aten.divide', overload='Scalar')>,
 <OpOverload(op='aten.divide', overload='Scalar_mode')>,
 <OpOverload(op='aten.divide', overload='Tensor')>,
 <OpOverload(op='aten.divide', overload='Tensor_mode')>,
 <OpOverload(op='aten.dstack', overload='default')>,
 <OpOverload(op='aten.einsum', overload='default')>,
 <OpOverload(op='aten.embedding_backward', overload='default')>,
 <OpOverload(op='aten.embedding_bag', overload='default')>,
 <OpOverload(op='aten.embedding_bag', overload='padding_idx')>,
 <OpOverload(op='aten.embedding_sparse_backward', overload='default')>,
 <OpOverload(op='aten.fake_quantize_per_channel_affine', overload='default')>,
 <OpOverload(op='aten.fake_quantize_per_channel_affine_cachemask_backward', overload='default')>,
 <OpOverload(op='aten.fake_quantize_per_tensor_affine', overload='default')>,
 <OpOverload(op='aten.fake_quantize_per_tensor_affine', overload='tensor_qparams')>,
 <OpOverload(op='aten.fake_quantize_per_tensor_affine_cachemask_backward', overload='default')>,
 <OpOverload(op='aten.fbgemm_linear_fp16_weight', overload='default')>,
 <OpOverload(op='aten.fbgemm_linear_fp16_weight_fp32_activation', overload='default')>,
 <OpOverload(op='aten.fbgemm_linear_int8_weight', overload='default')>,
 <OpOverload(op='aten.fbgemm_linear_int8_weight_fp32_activation', overload='default')>,
 <OpOverload(op='aten.fbgemm_linear_quantize_weight', overload='default')>,
 <OpOverload(op='aten.fbgemm_pack_gemm_matrix_fp16', overload='default')>,
 <OpOverload(op='aten.fbgemm_pack_quantized_matrix', overload='KN')>,
 <OpOverload(op='aten.fbgemm_pack_quantized_matrix', overload='default')>,
 <OpOverload(op='aten.fft_fft', overload='default')>,
 <OpOverload(op='aten.fft_fft2', overload='default')>,
 <OpOverload(op='aten.fft_fftn', overload='default')>,
 <OpOverload(op='aten.fft_fftshift', overload='default')>,
 <OpOverload(op='aten.fft_hfft', overload='default')>,
 <OpOverload(op='aten.fft_hfft2', overload='default')>,
 <OpOverload(op='aten.fft_hfftn', overload='default')>,
 <OpOverload(op='aten.fft_ifft', overload='default')>,
 <OpOverload(op='aten.fft_ifft2', overload='default')>,
 <OpOverload(op='aten.fft_ifftn', overload='default')>,
 <OpOverload(op='aten.fft_ifftshift', overload='default')>,
 <OpOverload(op='aten.fft_ihfft', overload='default')>,
 <OpOverload(op='aten.fft_ihfft2', overload='default')>,
 <OpOverload(op='aten.fft_ihfftn', overload='default')>,
 <OpOverload(op='aten.fft_irfft', overload='default')>,
 <OpOverload(op='aten.fft_irfft2', overload='default')>,
 <OpOverload(op='aten.fft_irfftn', overload='default')>,
 <OpOverload(op='aten.fft_rfft', overload='default')>,
 <OpOverload(op='aten.fft_rfft2', overload='default')>,
 <OpOverload(op='aten.fft_rfftn', overload='default')>,
 <OpOverload(op='aten.fix', overload='default')>,
 <OpOverload(op='aten.flatten_dense_tensors', overload='default')>,
 <OpOverload(op='aten.fliplr', overload='default')>,
 <OpOverload(op='aten.flipud', overload='default')>,
 <OpOverload(op='aten.float_power', overload='Scalar')>,
 <OpOverload(op='aten.float_power', overload='Tensor_Scalar')>,
 <OpOverload(op='aten.float_power', overload='Tensor_Tensor')>,
 <OpOverload(op='aten.frobenius_norm', overload='dim')>,
 <OpOverload(op='aten.gather', overload='dimname')>,
 <OpOverload(op='aten.gather_backward', overload='default')>,
 <OpOverload(op='aten.ger', overload='default')>,
 <OpOverload(op='aten.gradient', overload='array')>,
 <OpOverload(op='aten.gradient', overload='scalararray')>,
 <OpOverload(op='aten.gradient', overload='scalarint')>,
 <OpOverload(op='aten.gradient', overload='scalarrayarray')>,
 <OpOverload(op='aten.gradient', overload='scalarrayint')>,
 <OpOverload(op='aten.gradient', overload='tensorarray')>,
 <OpOverload(op='aten.gradient', overload='tensorarrayint')>,
 <OpOverload(op='aten.greater', overload='Scalar')>,
 <OpOverload(op='aten.greater', overload='Tensor')>,
 <OpOverload(op='aten.greater_equal', overload='Scalar')>,
 <OpOverload(op='aten.greater_equal', overload='Tensor')>,
 <OpOverload(op='aten.grid_sampler', overload='default')>,
 <OpOverload(op='aten.group_norm', overload='default')>,
 <OpOverload(op='aten.gru', overload='data')>,
 <OpOverload(op='aten.gru', overload='input')>,
 <OpOverload(op='aten.gru_cell', overload='default')>,
 <OpOverload(op='aten.hinge_embedding_loss', overload='default')>,
 <OpOverload(op='aten.histogramdd', overload='TensorList_bins')>,
 <OpOverload(op='aten.histogramdd', overload='default')>,
 <OpOverload(op='aten.histogramdd', overload='int_bins')>,
 <OpOverload(op='aten.hstack', overload='default')>,
 <OpOverload(op='aten.index_add', overload='dimname')>,
 <OpOverload(op='aten.index_copy', overload='dimname')>,
 <OpOverload(op='aten.index_fill', overload='Dimname_Scalar')>,
 <OpOverload(op='aten.index_fill', overload='Dimname_Tensor')>,
 <OpOverload(op='aten.index_select', overload='dimname')>,
 <OpOverload(op='aten.index_select_backward', overload='default')>,
 <OpOverload(op='aten.infinitely_differentiable_gelu_backward', overload='default')>,
 <OpOverload(op='aten.inner', overload='default')>,
 <OpOverload(op='aten.instance_norm', overload='default')>,
 <OpOverload(op='aten.inverse', overload='default')>,
 <OpOverload(op='aten.is_complex', overload='default')>,
 <OpOverload(op='aten.is_conj', overload='default')>,
 <OpOverload(op='aten.is_distributed', overload='default')>,
 <OpOverload(op='aten.is_floating_point', overload='default')>,
 <OpOverload(op='aten.is_inference', overload='default')>,
 <OpOverload(op='aten.is_leaf', overload='default')>,
 <OpOverload(op='aten.is_neg', overload='default')>,
 <OpOverload(op='aten.is_nonzero', overload='default')>,
 <OpOverload(op='aten.is_signed', overload='default')>,
 <OpOverload(op='aten.is_vulkan_available', overload='default')>,
 <OpOverload(op='aten.isclose', overload='default')>,
 <OpOverload(op='aten.isfinite', overload='default')>,
 <OpOverload(op='aten.isreal', overload='default')>,
 <OpOverload(op='aten.istft', overload='default')>,
 <OpOverload(op='aten.item', overload='default')>,
 <OpOverload(op='aten.kl_div', overload='default')>,
 <OpOverload(op='aten.kron', overload='default')>,
 <OpOverload(op='aten.kthvalue', overload='dimname')>,
 <OpOverload(op='aten.l1_loss', overload='default')>,
 <OpOverload(op='aten.layer_norm', overload='default')>,
 <OpOverload(op='aten.ldexp', overload='Tensor')>,
 <OpOverload(op='aten.less', overload='Scalar')>,
 <OpOverload(op='aten.less', overload='Tensor')>,
 <OpOverload(op='aten.less_equal', overload='Scalar')>,
 <OpOverload(op='aten.less_equal', overload='Tensor')>,
 <OpOverload(op='aten.linalg_cholesky', overload='default')>,
 <OpOverload(op='aten.linalg_cond', overload='default')>,
 <OpOverload(op='aten.linalg_cond', overload='p_str')>,
 <OpOverload(op='aten.linalg_det', overload='default')>,
 <OpOverload(op='aten.linalg_eigh', overload='default')>,
 <OpOverload(op='aten.linalg_eigvals', overload='default')>,
 <OpOverload(op='aten.linalg_eigvalsh', overload='default')>,
 <OpOverload(op='aten.linalg_inv', overload='default')>,
 <OpOverload(op='aten.linalg_ldl_factor', overload='default')>,
 <OpOverload(op='aten.linalg_lu_factor', overload='default')>,
 <OpOverload(op='aten.linalg_matmul', overload='default')>,
 <OpOverload(op='aten.linalg_matrix_norm', overload='default')>,
 <OpOverload(op='aten.linalg_matrix_norm', overload='str_ord')>,
 <OpOverload(op='aten.linalg_matrix_power', overload='default')>,
 <OpOverload(op='aten.linalg_matrix_rank', overload='atol_rtol_float')>,
 <OpOverload(op='aten.linalg_matrix_rank', overload='atol_rtol_tensor')>,
 <OpOverload(op='aten.linalg_matrix_rank', overload='default')>,
 <OpOverload(op='aten.linalg_matrix_rank', overload='tol_tensor')>,
 <OpOverload(op='aten.linalg_multi_dot', overload='default')>,
 <OpOverload(op='aten.linalg_norm', overload='default')>,
 <OpOverload(op='aten.linalg_norm', overload='ord_str')>,
 <OpOverload(op='aten.linalg_pinv', overload='atol_rtol_float')>,
 <OpOverload(op='aten.linalg_pinv', overload='default')>,
 <OpOverload(op='aten.linalg_pinv', overload='rcond_tensor')>,
 <OpOverload(op='aten.linalg_slogdet', overload='default')>,
 <OpOverload(op='aten.linalg_solve', overload='default')>,
 <OpOverload(op='aten.linalg_solve_ex', overload='default')>,
 <OpOverload(op='aten.linalg_svd', overload='default')>,
 <OpOverload(op='aten.linalg_svdvals', overload='default')>,
 <OpOverload(op='aten.linalg_tensorinv', overload='default')>,
 <OpOverload(op='aten.linalg_tensorsolve', overload='default')>,
 <OpOverload(op='aten.linalg_vander', overload='default')>,
 <OpOverload(op='aten.linalg_vecdot', overload='default')>,
 <OpOverload(op='aten.linear', overload='default')>,
 <OpOverload(op='aten.log_sigmoid', overload='default')>,
 <OpOverload(op='aten.log_softmax', overload='Dimname')>,
 <OpOverload(op='aten.log_softmax', overload='int')>,
 <OpOverload(op='aten.logcumsumexp', overload='dimname')>,
 <OpOverload(op='aten.logdet', overload='default')>,
 <OpOverload(op='aten.logsumexp', overload='names')>,
 <OpOverload(op='aten.lstm', overload='data')>,
 <OpOverload(op='aten.lstm', overload='input')>,
 <OpOverload(op='aten.lstm_cell', overload='default')>,
 <OpOverload(op='aten.lu_solve', overload='default')>,
 <OpOverload(op='aten.margin_ranking_loss', overload='default')>,
 <OpOverload(op='aten.masked_select_backward', overload='default')>,
 <OpOverload(op='aten.matmul', overload='default')>,
 <OpOverload(op='aten.matrix_exp', overload='default')>,
 <OpOverload(op='aten.matrix_exp_backward', overload='default')>,
 <OpOverload(op='aten.matrix_power', overload='default')>,
 <OpOverload(op='aten.max', overload='names_dim')>,
 <OpOverload(op='aten.max', overload='other')>,
 <OpOverload(op='aten.max_pool1d', overload='default')>,
 <OpOverload(op='aten.max_pool1d_with_indices', overload='default')>,
 <OpOverload(op='aten.max_pool2d', overload='default')>,
 <OpOverload(op='aten.max_pool3d', overload='default')>,
 <OpOverload(op='aten.mean', overload='names_dim')>,
 <OpOverload(op='aten.median', overload='names_dim')>,
 <OpOverload(op='aten.meshgrid', overload='default')>,
 <OpOverload(op='aten.meshgrid', overload='indexing')>,
 <OpOverload(op='aten.min', overload='names_dim')>,
 <OpOverload(op='aten.min', overload='other')>,
 <OpOverload(op='aten.mish_backward', overload='default')>,
 <OpOverload(op='aten.mode', overload='dimname')>,
 <OpOverload(op='aten.msort', overload='default')>,
 <OpOverload(op='aten.multilabel_margin_loss', overload='default')>,
 <OpOverload(op='aten.multiply', overload='Scalar')>,
 <OpOverload(op='aten.multiply', overload='Tensor')>,
 <OpOverload(op='aten.nanmean', overload='default')>,
 <OpOverload(op='aten.nanmedian', overload='names_dim')>,
 <OpOverload(op='aten.nanquantile', overload='default')>,
 <OpOverload(op='aten.nanquantile', overload='scalar')>,
 <OpOverload(op='aten.native_channel_shuffle', overload='default')>,
 <OpOverload(op='aten.negative', overload='default')>,
 <OpOverload(op='aten.nested_to_padded_tensor', overload='default')>,
 <OpOverload(op='aten.nll_loss', overload='default')>,
 <OpOverload(op='aten.nll_loss2d', overload='default')>,
 <OpOverload(op='aten.nll_loss_nd', overload='default')>,
 <OpOverload(op='aten.nonzero_numpy', overload='default')>,
 <OpOverload(op='aten.norm', overload='names_ScalarOpt_dim')>,
 <OpOverload(op='aten.norm', overload='names_ScalarOpt_dim_dtype')>,
 <OpOverload(op='aten.norm_except_dim', overload='default')>,
 <OpOverload(op='aten.not_equal', overload='Scalar')>,
 <OpOverload(op='aten.not_equal', overload='Tensor')>,
 <OpOverload(op='aten.nuclear_norm', overload='default')>,
 <OpOverload(op='aten.nuclear_norm', overload='dim')>,
 <OpOverload(op='aten.one_hot', overload='default')>,
 <OpOverload(op='aten.orgqr', overload='default')>,
 <OpOverload(op='aten.outer', overload='default')>,
 <OpOverload(op='aten.output_nr', overload='default')>,
 <OpOverload(op='aten.pad', overload='default')>,
 <OpOverload(op='aten.pad_sequence', overload='default')>,
 <OpOverload(op='aten.pairwise_distance', overload='default')>,
 <OpOverload(op='aten.pdist', overload='default')>,
 <OpOverload(op='aten.pinverse', overload='default')>,
 <OpOverload(op='aten.poisson_nll_loss', overload='default')>,
 <OpOverload(op='aten.prelu', overload='default')>,
 <OpOverload(op='aten.prod', overload='dim_Dimname')>,
 <OpOverload(op='aten.promote_types', overload='default')>,
 <OpOverload(op='aten.qr', overload='default')>,
 <OpOverload(op='aten.quantile', overload='default')>,
 <OpOverload(op='aten.quantile', overload='scalar')>,
 <OpOverload(op='aten.quantized_gru_cell', overload='default')>,
 <OpOverload(op='aten.quantized_lstm_cell', overload='default')>,
 <OpOverload(op='aten.quantized_rnn_relu_cell', overload='default')>,
 <OpOverload(op='aten.quantized_rnn_tanh_cell', overload='default')>,
 <OpOverload(op='aten.relu6', overload='default')>,
 <OpOverload(op='aten.repeat_interleave', overload='self_Tensor')>,
 <OpOverload(op='aten.repeat_interleave', overload='self_int')>,
 <OpOverload(op='aten.result_type', overload='Scalar')>,
 <OpOverload(op='aten.result_type', overload='Scalar_Scalar')>,
 <OpOverload(op='aten.result_type', overload='Scalar_Tensor')>,
 <OpOverload(op='aten.result_type', overload='Tensor')>,
 <OpOverload(op='aten.retains_grad', overload='default')>,
 <OpOverload(op='aten.rms_norm', overload='default')>,
 <OpOverload(op='aten.rnn_relu', overload='data')>,
 <OpOverload(op='aten.rnn_relu', overload='input')>,
 <OpOverload(op='aten.rnn_relu_cell', overload='default')>,
 <OpOverload(op='aten.rnn_tanh', overload='data')>,
 <OpOverload(op='aten.rnn_tanh', overload='input')>,
 <OpOverload(op='aten.rnn_tanh_cell', overload='default')>,
 <OpOverload(op='aten.row_stack', overload='default')>,
 <OpOverload(op='aten.rrelu', overload='default')>,
 <OpOverload(op='aten.scaled_dot_product_attention', overload='default')>,
 <OpOverload(op='aten.scatter', overload='dimname_src')>,
 <OpOverload(op='aten.scatter', overload='dimname_value')>,
 <OpOverload(op='aten.scatter_add', overload='dimname')>,
 <OpOverload(op='aten.selu', overload='default')>,
 <OpOverload(op='aten.silu_backward', overload='default')>,
 <OpOverload(op='aten.size', overload='Dimname')>,
 <OpOverload(op='aten.size', overload='int')>,
 <OpOverload(op='aten.slogdet', overload='default')>,
 <OpOverload(op='aten.slow_conv3d', overload='default')>,
 <OpOverload(op='aten.smm', overload='default')>,
 <OpOverload(op='aten.softmax', overload='Dimname')>,
 <OpOverload(op='aten.softmax', overload='int')>,
 <OpOverload(op='aten.sort', overload='dimname')>,
 <OpOverload(op='aten.sort', overload='dimname_stable')>,
 <OpOverload(op='aten.sparse_bsc_tensor', overload='ccol_row_value')>,
 <OpOverload(op='aten.sparse_bsc_tensor', overload='ccol_row_value_size')>,
 <OpOverload(op='aten.sparse_bsr_tensor', overload='crow_col_value')>,
 <OpOverload(op='aten.sparse_bsr_tensor', overload='crow_col_value_size')>,
 <OpOverload(op='aten.sparse_coo_tensor', overload='indices')>,
 <OpOverload(op='aten.sparse_coo_tensor', overload='indices_size')>,
 <OpOverload(op='aten.sparse_csc_tensor', overload='ccol_row_value')>,
 <OpOverload(op='aten.sparse_csc_tensor', overload='ccol_row_value_size')>,
 <OpOverload(op='aten.sparse_csr_tensor', overload='crow_col_value')>,
 <OpOverload(op='aten.sparse_csr_tensor', overload='crow_col_value_size')>,
 <OpOverload(op='aten.special_digamma', overload='default')>,
 <OpOverload(op='aten.special_erf', overload='default')>,
 <OpOverload(op='aten.special_erfc', overload='default')>,
 <OpOverload(op='aten.special_erfinv', overload='default')>,
 <OpOverload(op='aten.special_exp2', overload='default')>,
 <OpOverload(op='aten.special_expit', overload='default')>,
 <OpOverload(op='aten.special_expm1', overload='default')>,
 <OpOverload(op='aten.special_gammainc', overload='default')>,
 <OpOverload(op='aten.special_gammaincc', overload='default')>,
 <OpOverload(op='aten.special_gammaln', overload='default')>,
 <OpOverload(op='aten.special_i0', overload='default')>,
 <OpOverload(op='aten.special_log1p', overload='default')>,
 <OpOverload(op='aten.special_log_softmax', overload='default')>,
 <OpOverload(op='aten.special_logit', overload='default')>,
 <OpOverload(op='aten.special_logsumexp', overload='default')>,
 <OpOverload(op='aten.special_multigammaln', overload='default')>,
 <OpOverload(op='aten.special_ndtr', overload='default')>,
 <OpOverload(op='aten.special_polygamma', overload='default')>,
 <OpOverload(op='aten.special_psi', overload='default')>,
 <OpOverload(op='aten.special_round', overload='default')>,
 <OpOverload(op='aten.special_sinc', overload='default')>,
 <OpOverload(op='aten.special_softmax', overload='default')>,
 <OpOverload(op='aten.special_xlogy', overload='default')>,
 <OpOverload(op='aten.special_xlogy', overload='other_scalar')>,
 <OpOverload(op='aten.special_xlogy', overload='self_scalar')>,
 <OpOverload(op='aten.square', overload='default')>,
 <OpOverload(op='aten.sspaddmm', overload='default')>,
 <OpOverload(op='aten.std', overload='correction_names')>,
 <OpOverload(op='aten.std', overload='default')>,
 <OpOverload(op='aten.std', overload='dim')>,
 <OpOverload(op='aten.std', overload='names_dim')>,
 <OpOverload(op='aten.std_mean', overload='correction_names')>,
 <OpOverload(op='aten.std_mean', overload='default')>,
 <OpOverload(op='aten.std_mean', overload='dim')>,
 <OpOverload(op='aten.std_mean', overload='names_dim')>,
 <OpOverload(op='aten.stft', overload='center')>,
 <OpOverload(op='aten.stft', overload='default')>,
 <OpOverload(op='aten.stride', overload='Dimname')>,
 <OpOverload(op='aten.stride', overload='int')>,
 <OpOverload(op='aten.subtract', overload='Scalar')>,
 <OpOverload(op='aten.subtract', overload='Tensor')>,
 <OpOverload(op='aten.sum', overload='dim_DimnameList')>,
 <OpOverload(op='aten.sum_to_size', overload='default')>,
 <OpOverload(op='aten.svd', overload='default')>,
 <OpOverload(op='aten.sym_size', overload='int')>,
 <OpOverload(op='aten.sym_stride', overload='int')>,
 <OpOverload(op='aten.take_along_dim', overload='default')>,
 <OpOverload(op='aten.tensordot', overload='default')>,
 <OpOverload(op='aten.thnn_conv2d', overload='default')>,
 <OpOverload(op='aten.tile', overload='default')>,
 <OpOverload(op='aten.to_dense', overload='default')>,
 <OpOverload(op='aten.to_dense_backward', overload='default')>,
 <OpOverload(op='aten.to_mkldnn_backward', overload='default')>,
 <OpOverload(op='aten.to_sparse', overload='default')>,
 <OpOverload(op='aten.to_sparse', overload='sparse_dim')>,
 <OpOverload(op='aten.to_sparse_bsc', overload='default')>,
 <OpOverload(op='aten.to_sparse_bsr', overload='default')>,
 <OpOverload(op='aten.to_sparse_csc', overload='default')>,
 <OpOverload(op='aten.to_sparse_csr', overload='default')>,
 <OpOverload(op='aten.trace_backward', overload='default')>,
 <OpOverload(op='aten.trapezoid', overload='dx')>,
 <OpOverload(op='aten.trapezoid', overload='x')>,
 <OpOverload(op='aten.trapz', overload='dx')>,
 <OpOverload(op='aten.trapz', overload='x')>,
 <OpOverload(op='aten.triplet_margin_loss', overload='default')>,
 <OpOverload(op='aten.true_divide', overload='Scalar')>,
 <OpOverload(op='aten.true_divide', overload='Tensor')>,
 <OpOverload(op='aten.type_as', overload='default')>,
 <OpOverload(op='aten.unflatten_dense_tensors', overload='default')>,
 <OpOverload(op='aten.upsample_bicubic2d', overload='vec')>,
 <OpOverload(op='aten.upsample_bilinear2d', overload='vec')>,
 <OpOverload(op='aten.upsample_linear1d', overload='vec')>,
 <OpOverload(op='aten.upsample_nearest1d', overload='default')>,
 <OpOverload(op='aten.upsample_nearest1d', overload='vec')>,
 <OpOverload(op='aten.upsample_nearest2d', overload='default')>,
 <OpOverload(op='aten.upsample_nearest2d', overload='vec')>,
 <OpOverload(op='aten.upsample_nearest3d', overload='default')>,
 <OpOverload(op='aten.upsample_nearest3d', overload='vec')>,
 <OpOverload(op='aten.upsample_trilinear3d', overload='vec')>,
 <OpOverload(op='aten.value_selecting_reduction_backward', overload='default')>,
 <OpOverload(op='aten.vander', overload='default')>,
 <OpOverload(op='aten.var', overload='correction_names')>,
 <OpOverload(op='aten.var', overload='default')>,
 <OpOverload(op='aten.var', overload='dim')>,
 <OpOverload(op='aten.var', overload='names_dim')>,
 <OpOverload(op='aten.var_mean', overload='correction_names')>,
 <OpOverload(op='aten.var_mean', overload='default')>,
 <OpOverload(op='aten.var_mean', overload='dim')>,
 <OpOverload(op='aten.var_mean', overload='names_dim')>,
 <OpOverload(op='aten.vstack', overload='default')>,
 <OpOverload(op='aten.where', overload='Scalar')>,
 <OpOverload(op='aten.where', overload='ScalarOther')>,
 <OpOverload(op='aten.where', overload='ScalarSelf')>,
 <OpOverload(op='aten.where', overload='default')>,
 <OpOverload(op='aten.wrapped_linear_prepack', overload='default')>,
 <OpOverload(op='aten.wrapped_quantized_linear_prepacked', overload='default')>
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136153
Approved by: https://github.com/xadupre, https://github.com/gramalingam
2024-09-16 21:28:54 +00:00
Pearu Peterson
b76d1b79e6 Add scaling arguments to bsr_dense_addmm (#136104)
As in the title.

Tackles https://github.com/pytorch/ao/pull/821/files#r1759821413

The PR assumes that the existing tuning parameters are good also when using scaling arguments. This needs to be verified as a follow-up task.

Also, this PR redefines triton-contiguous tensors: the tensor must have strides not larger than 1. This will now allow zero strides that previously triggered `contiguous` call although the underlying memory buffer was contiguous.

Re: "a considerable slow-down occurs because tensor data is copied element-wise rather than chunk-wise" - this note should refer to a code (torch or triton?) that implements the element/chunk-wise copy so that we could verify that allowing zero strides indeed would not trigger element-wise copies. Atm, the performance increase in ViT-H benchmarks (that involve using 0 strides) is an evidence that allowing zero strides does not lead to slow-downs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136104
Approved by: https://github.com/cpuhrsch
2024-09-16 20:26:54 +00:00
PyTorch MergeBot
bfbcdf4967 Revert "[dynamo] Fix support for classmethod(property(...)) (#134968)"
This reverts commit c64ae601ba.

Reverted https://github.com/pytorch/pytorch/pull/134968 on behalf of https://github.com/jeanschmidt due to Breaking internal signals, we need to skip the new tests on py3.10 ([comment](https://github.com/pytorch/pytorch/pull/134968#issuecomment-2353909010))
2024-09-16 20:26:35 +00:00
Dan Johnson
3c97b0ab00 Use ncclAlltoAllv and ncclAlltoAll API when supported (#134499)
NCCL does not have an api for ncclAllToAll and ncclAllToAllv, so PyTorch does point to point send/recv. Expose this API if it is supported.

Differential Revision: [D61683836](https://our.internmc.facebook.com/intern/diff/D61683836/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134499
Approved by: https://github.com/shuqiangzhang, https://github.com/eqy
2024-09-16 20:08:06 +00:00
Kiuk Chung
abd16a8c64 [torch/multiprocessing] Use multiprocessing.reduction.register ForkingPickler.register to register custom tensor and storage reductions (#135030)
Right now `multiprocessing.reduction.register()` is simply an alias to `multiprocessing.reduction.ForkingPickler.register()`
https://github.com/python/cpython/blame/main/Lib/multiprocessing/reduction.py#L56, but the top-level `register()` function exposes less of the internal details of `multiprocessing.reduction` module.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135030
Approved by: https://github.com/albanD
2024-09-16 20:07:29 +00:00
fduwjj
a0c7029a75 [c10d][Reland] Remove Option for ProcessGroup and Expose backend Options to reflect the correct code structure (#132931) (#135653)
We introduced the dispatchable backend for a ProcessGroup and collective in https://github.com/pytorch/pytorch/issues/86225. This PR is a follow-up cleanup to clean up the option of a ProcessGroup and ask users to either set timeout or backend later on or directly create backend after creating a PG.

Also PGNCCL is using option class from ProcessGroup but we actually should use Option from backend class. So this PR is to make the type or name to be aligned with what we are doing in cpp side. I don't change the signature for the public API, so they still use args named "pg_options"

We need to make changes to the test to make it aligned with the change.

This is try to reland D62008954 by fixing internal errors.

Differential Revision: [D62483294](https://our.internmc.facebook.com/intern/diff/D62483294/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135653
Approved by: https://github.com/wz337, https://github.com/H-Huang
2024-09-16 19:56:42 +00:00
James Wu
7537f74277 Refactor FxGraphCache.load into separate functions, so that AOTAutogradCache may access it correctly later (#135491)
Summary:
We refactor FxGraphCache.load into three phases:
- prepare_key, which checks that an inductor input is cacheable and bypasses otherwise
- load_with_key, which tries to lookup the key in the cache
- post compile, where we do some logging and run post compile steps

Splitting it along these lines will allow AOTAutogradCache to use load_with_key and still get access to all of the observability + remote cache logic when accessing FxGraphCache, without needing to pass key components, etc.

Differential Revision: D62314862

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135491
Approved by: https://github.com/oulgen
2024-09-16 19:48:08 +00:00
Aaron Gokaslan
31715be72a [BE]: Update mypy to 1.11.2 (#133816)
Updates mypy to 1.11.1 to improve type inference

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133816
Approved by: https://github.com/ezyang
2024-09-16 19:44:11 +00:00
Nikita Shulga
38caf10411 [EZ] Fix spelling typo (#136157)
s/toosl/tools/ (spotted by @louie-tsai)
Also, capitalize CUDA

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136157
Approved by: https://github.com/kit1980
2024-09-16 19:30:30 +00:00
Ke Wen
c977bb7d03 [Distributed] fix FileSystemWriter __init__ (#136135)
Fixes #135608.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136135
Approved by: https://github.com/Skylion007
2024-09-16 19:11:08 +00:00
eugenekoran
717fca2cac Drop outdated section 'Running clang-tidy' in CONTRIBUTING.md (#136146)
Fixes #125920

[Running clang-tidy](https://github.com/pytorch/pytorch/blob/main/CONTRIBUTING.md#running-clang-tidy) section is misleading and outdated. C++ lint is done with lintrunner and covered in [local-linting](https://github.com/pytorch/pytorch/blob/main/CONTRIBUTING.md#local-linting) section.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136146
Approved by: https://github.com/janeyx99
2024-09-16 19:02:21 +00:00
Alexander Kurakin
f89ce4dfbb torch.nn.MultiheadAttention: docs: improvement (#136111)
`torch.nn.MultiheadAttention`: docs: improvement
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136111
Approved by: https://github.com/janeyx99
2024-09-16 18:52:20 +00:00
Nikita Shulga
d3647d15e6 Remove accidentally committed code (#136154)
Accidentally left out during rebase

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136154
Approved by: https://github.com/kit1980, https://github.com/albanD
2024-09-16 18:34:20 +00:00
PyTorch MergeBot
d0cebedb31 Revert "Add Triton CPU as an Inductor backend (#133408)"
This reverts commit e498b02b47.

Reverted https://github.com/pytorch/pytorch/pull/133408 on behalf of https://github.com/jeanschmidt due to Broke internal signals, see D62737208 for more details ([comment](https://github.com/pytorch/pytorch/pull/133408#issuecomment-2353623816))
2024-09-16 18:33:33 +00:00
PyTorch MergeBot
7fe004f7cf Revert "Add CI for Triton CPU backend (#135342)"
This reverts commit 426580a67d.

Reverted https://github.com/pytorch/pytorch/pull/135342 on behalf of https://github.com/jeanschmidt due to Broke internal signals, see D62737208 for more details ([comment](https://github.com/pytorch/pytorch/pull/133408#issuecomment-2353623816))
2024-09-16 18:33:33 +00:00
Aaron Gokaslan
23c0d2689e [BE][Ez]: Fix missing float16 coverage for adaptive_pool3d_cpu (#136091)
Testing if op info coverage has issues

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136091
Approved by: https://github.com/ezyang
2024-09-16 18:22:16 +00:00
Suresh Babu Kolla
5193f23469 [Pytorch] Cleanup Strobelight URL and shorten for readability (#136102)
Summary:
- Converted strobelight URL prefix to more readable and editable json
- Dump shortened URLs when possible for easier readability

Test Plan:
```
python ./torch/_strobelight/examples/compile_time_profile_example.py
python torch/_strobelight/examples/cli_function_profiler_example.py
```

Differential Revision: D62690292

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136102
Approved by: https://github.com/laithsakka
2024-09-16 18:10:33 +00:00
PyTorch MergeBot
0199fd4d7e Revert "[inductor] More fixes on the keys of constants and signature dictionaries (#135406)"
This reverts commit e54b559e88.

Reverted https://github.com/pytorch/pytorch/pull/135406 on behalf of https://github.com/jeanschmidt due to Reverting as it is breaking triton_mtia internal signals @jansel could you have a look and help get those changes merged? ([comment](https://github.com/pytorch/pytorch/pull/135406#issuecomment-2353557481))
2024-09-16 17:58:02 +00:00
Aaron Gokaslan
b491e2974c [BE][Ez]: Add full half/bfloat16 dtype for unique and isin (#136114)
Fixes #136090

* Add support for isin to tensor half dtypes for CPU (just add a few extra dispatches).
* Seems like the CUDA implementation for bfloat16 was mostly compiled and available all along (it just calls sort internally AND unique). To enable it, we just need to remove an assert to access it (since sort's functionality was updated since the assert was added) and add missing dtype support to unique.
* This unlocks more GPU functionality with minimal code bloat. I also added CPU kernels for the dtypes for parity.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136114
Approved by: https://github.com/malfet
2024-09-16 17:49:12 +00:00
Justin Chu
0aa41eb52f [ONNX] Run type promotion test in CI and update the table (#135915)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135915
Approved by: https://github.com/gramalingam, https://github.com/xadupre
2024-09-16 16:46:13 +00:00