Commit Graph

104 Commits

Author SHA1 Message Date
Edward Z. Yang
c2bccfd431 [BE] Simplify code interacting with get_proxy_mode/enable_tracing (#132675)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132675
Approved by: https://github.com/Skylion007, https://github.com/ydwu4, https://github.com/zou3519
ghstack dependencies: #132674
2024-08-06 18:13:22 +00:00
Chien-Chin Huang
bc510916fa Only make wait_tensor as a side_effect op (#132341)
Summary:
https://github.com/pytorch/pytorch/pull/131023 add all the collective ops to the side effect list. But we should only make wait_tensor as a side_effect op because all collective ops should have a corresponding wait_tensor.

We should switch to use high_order effect token.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132341
Approved by: https://github.com/yf225
2024-08-02 01:24:40 +00:00
YangQun1
c2f3266c8e Not remove collective ops in dce since they have side-effect (#131023)
Fixes #130918

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131023
Approved by: https://github.com/yf225
2024-07-26 03:03:32 +00:00
Xuehai Pan
94dc3253a0 [BE][Easy] enable UFMT for torch/distributed/ (#128870)
Part of #123062

- #123062

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128870
Approved by: https://github.com/fegin, https://github.com/wconstab
2024-06-22 18:53:28 +00:00
PyTorch MergeBot
9c929f6ce9 Revert "[BE][Easy] enable UFMT for torch/distributed/ (#128870)"
This reverts commit a0e1e20c41.

Reverted https://github.com/pytorch/pytorch/pull/128870 on behalf of https://github.com/fbgheith due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/128870#issuecomment-2181780356))
2024-06-21 00:38:28 +00:00
Xuehai Pan
a0e1e20c41 [BE][Easy] enable UFMT for torch/distributed/ (#128870)
Part of #123062

- #123062

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128870
Approved by: https://github.com/fegin
ghstack dependencies: #128868, #128869
2024-06-18 21:49:08 +00:00
Aaron Orenstein
3a0d088517 Flip default value for mypy disallow_untyped_defs [5/11] (#127842)
See #127836 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127842
Approved by: https://github.com/oulgen
2024-06-08 18:49:18 +00:00
Xuehai Pan
67ef2683d9 [BE] wrap deprecated function/class with typing_extensions.deprecated (#127689)
Use `typing_extensions.deprecated` for deprecation annotation if possible. Otherwise, add `category=FutureWarning` to `warnings.warn("message")` if the category is missing.

Note that only warnings that their messages contain `[Dd]eprecat(ed|ion)` are updated in this PR.

Resolves #126888

- #126888

This PR is split from PR #126898.

- #126898

------

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127689
Approved by: https://github.com/Skylion007
2024-06-02 12:30:43 +00:00
PyTorch MergeBot
033e733021 Revert "[BE] wrap deprecated function/class with typing_extensions.deprecated (#126898)"
This reverts commit 749a132fb0.

Reverted https://github.com/pytorch/pytorch/pull/126898 on behalf of https://github.com/fbgheith due to switching typing-extensions=4.3.0 to 4.9.0 causes internal failure ([comment](https://github.com/pytorch/pytorch/pull/126898#issuecomment-2142884456))
2024-05-31 19:47:24 +00:00
Xuehai Pan
749a132fb0 [BE] wrap deprecated function/class with typing_extensions.deprecated (#126898)
Use `typing_extensions.deprecated` for deprecation annotation if possible. Otherwise, add `category=FutureWarning` to `warnings.warn("message")` if the category is missing.

Note that only warnings that their messages contain `[Dd]eprecat(ed|ion)` are updated in this PR.

UPDATE: Use `FutureWarning` instead of `DeprecationWarning`.

Resolves #126888

- #126888

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126898
Approved by: https://github.com/albanD
2024-05-29 12:09:27 +00:00
Will Feng
4333e122d4 [Traceable FSDP2] Add all_gather_into_tensor out variant (#126334)
This PR adds `torch.ops._c10d_functional.all_gather_into_tensor_out`.

It's important for tracing FSDP2, because FSDP2 pre-allocates the output buffer of AllGather, and makes input buffer an alias of the output buffer, and expects both of them to be used to achieve lower memory usage. If we don't preserve this behavior and instead functionalize the AllGather op, AllGather op will then create a brand-new output buffer (instead of reusing), thus significantly increasing the memory usage.

The expectation is that we will "re-inplace" the AllGather op by switching to the out variant in Inductor post-grad stage via an FX pass, so this API is not expected to be directly used by users.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126334
Approved by: https://github.com/yifuwang, https://github.com/wanchaol
2024-05-16 10:27:06 +00:00
Brian Hirsh
e28d9947a1 AsyncCollectiveTensor: prevent wait_tensor() calls on graph inputs from getting DCEd (#125677)
@wanchaol was seeing the loss eventually become NaN when compiling individual transformer blocks in torchtitan - with this patch I no longer see the NaN loss.

The problem is the following:

(1) It is possible to have graph inputs to a compiled region that are AsyncCollectiveTensors. In particular: when we compile individual transformer blocks in the llama model, the first layer (embedding layer) is run in eager mode, and it outputs an AsyncCollectiveTensor that is fed to the first transformer block

(2) ideally, we would like that AsyncCollectiveTensor graph input to desugar into a `wait_tensor()` op that shows up at the beginning of the graph.

(3) the way this is supposed to happen is: AOTAutograd traces through the __torch_dispatch__ of AsyncCollectiveTensor, tracing out a `wait_tensor()` call before dispatching to any of the other ops in the function we are tracing

(4) however: `trigger_wait()` was getting called in a way where we would ignore its output (and return `self.elem` directly), which would cause the `wait_tensor` ops to get DCE'd.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125677
Approved by: https://github.com/wanchaol, https://github.com/yifuwang
ghstack dependencies: #125676
2024-05-08 15:54:01 +00:00
Yifu Wang
d4a1b3e093 Make c10d_functional ops call into _c10d_functional ops (#124979)
This PR removes the legacy impls of c10d_functional ops which are now irrelevant. For backward compatibility purpose, c10d_functional ops now call into _c10d_functional ops.

We also changed c10d_functional ops to be CompositeExplicitAutograd, so that when traced, only _c10d_functional ops appear in the graph. After this, we'll be able to remove the Inductor IR for the legacy functional collectives.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124979
Approved by: https://github.com/wanchaol
2024-04-27 08:08:02 +00:00
Tristan Rice
1ec05c769b all_gather and reduce_scatter autograd (#123989)
This adds `all_gather_tensor_autograd` and `reduce_scatter_tensor_autograd` to the functional_collectives library.

This only supports `sum` mode for `reduce_scatter` but should be easy to extend in the future.

The backwards implementations match the behavior in https://github.com/pytorch/torchrec/blob/main/torchrec/distributed/comm_ops.py

This follows the pattern of #123599 .

Test plan:

```sh
pytest test/distributed/test_functional_api.py -k Autograd
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123989
Approved by: https://github.com/wanchaol
2024-04-17 21:32:22 +00:00
Yifu Wang
2a2e1d8e4f [functional collective] change the Python APIs to only use the native funcol ops (#123777)
## Summary

After this PR, the functional collective Python APIs will stop honoring `TORCH_DISABLE_NATIVE_FUNCOL` and only use native funcol ops. Specifically, this PR:
- Removed `use_native_funcol()`.
- Removed the code path in the Python APIs when `use_native_funcol()` is `False`.
- Changed the CI tests that runs on both native funcol and legacy funcol through the Python API to only run with native funcol.

## Test Changes

`test_functional_api.py`
- Removed the tests where only one of output_split_sizes or input_split_sizes is specified. This behavior is unreliable has been removed from the native funcol.
- Removed `TestWaitiness` which tests an implementation detail of the legacy funcol. We have equivalent tests for native funcol in `test/distributed/test_c10d_functional_native.py` b7fac76fc2/test/distributed/test_c10d_functional_native.py (L114-L116)

`test/distributed/_tensor/test_dtensor.py`
`test/distributed/_tensor/test_dtensor_compile.py`
`test/distributed/test_device_mesh.py`
`test/distributed/_tensor/experimental/test_tp_transform.py`
`test/distributed/_tensor/test_matrix_ops.py`
`test/distributed/test_inductor_collectives.py`
- All these tests were double running with both native funcol and legacy funcol. Changed to only run with native funcol.

`test/distributed/test_c10d_functional_native.py`
- Removed the `run_with_native_funcol` decorators.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123777
Approved by: https://github.com/wanchaol
ghstack dependencies: #123776
2024-04-13 03:08:36 +00:00
Tristan Rice
358ace1a1b functional_collectives: add first differentiable collective -- all_to_all_single_grad (#123599)
This adds the differentiable collective -- all_to_all_single_grad. This is the initial proof of concept PR and I will be adding the remaining collectives in follow up PRs.

This adds a new function called `all_to_all_single_autograd` which is the autograd variant of `all_to_all_single`. For backwards compatibility + initial testing we wanted to make the autograd variant separate to avoid regressions.

This uses `autograd::Function` to register an Autograd op that calls the original `_c10d_functional::all_to_all_single` via the dispatcher. This works with compile and inductor as opposed to the previous Python implementation that had issues. As this uses the existing `_c10d_functional` ops we don't need to register any meta functions or lowering.

To avoid cudaStream issues this explicitly calls `wait_tensor` in the backward method to ensure it runs under the same stream as the async operation. This hurts performance but can be alleviated potentially using `compile`.

Related work: https://github.com/pytorch/torchrec/blob/main/torchrec/distributed/comm_ops.py

Test plan:

```
pytest test/distributed/test_functional_api.py -k test_all_to_all_single_compile
pytest test/distributed/test_functional_api.py
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123599
Approved by: https://github.com/yifuwang
2024-04-12 01:48:49 +00:00
Yifu Wang
3bede14fa7 Don't create world pg variable out of thin air when rewriting c10d collectives (#122561)
Fixes https://github.com/pytorch/pytorch/issues/122404

Previously, when rewriting c10d collectives, if the group argument is
unspecified or None, we create a world pg variable out of thin air and
pass it to the rewrite target. The approach was problematic, as it
assumes the symbol `torch` is available in the scope (see #122404).

After #120560, dynamo can now trace dist.group.WORLD. If the group
argument is unspecified, we can just set it with dist.group.WORLD in the
rewrite target.

Testing

pytest test/distributed/test_inductor_collectives.py -k test_dynamo_rewrite_dist_allreduce

Also verified with the repro provided in #122404

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122561
Approved by: https://github.com/wconstab
ghstack dependencies: #120560
2024-03-26 20:12:08 +00:00
Yifu Wang
71d0202627 [dynamo] support rewriting dist.all_reduce with explicitly specified reduce op (#120181)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120181
Approved by: https://github.com/wconstab, https://github.com/awgu
2024-03-09 08:28:22 +00:00
albanD
6791b0c09e Change default torch_function behavior to be disabled when torch_dispatch is defined (take 2) (#120632)
This does not introduce a new test but is tested by checking that all the classes we already have still behave as before now that they don't explicitly disable torch_function.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120632
Approved by: https://github.com/ezyang
2024-03-09 01:08:37 +00:00
Yifu Wang
d7a5e59647 [dynamo] support group=None when rewriting collectives (#121043)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121043
Approved by: https://github.com/awgu
2024-03-06 21:37:19 +00:00
Oleg Khabinov
4b18ab869f [torch.export] Support is_compiling() flag for non-strict mode (#119602)
Summary: In non-strict mode of torch.export() we didn't set those `is_compiling()` to `True` which is needed by some models.

Test Plan: Unit tests and manual testing.

Differential Revision: D53624452

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119602
Approved by: https://github.com/suo
2024-02-29 05:52:51 +00:00
Sergii Dymchenko
d341b66e96 Revert [dynamo] support group=None when rewriting collectives (#12018) (#120677)
This reverts commit 298c686d3f.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120677
Approved by: https://github.com/yifuwang, https://github.com/huydhn
2024-02-27 00:33:35 +00:00
Yifu Wang
298c686d3f [dynamo] support group=None when rewriting collectives (#120118)
Resolves case 2 in #120082.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120118
Approved by: https://github.com/wconstab
ghstack dependencies: #120370
2024-02-25 03:12:10 +00:00
Yifu Wang
11e4a9266d Temporarily support ranks + tag as pg identifier in native funcol (#120226)
As communicated in https://github.com/pytorch/pytorch/issues/93173#issuecomment-1907095208, although we are dropping `(ranks, tag)` as group identifier in funcols, there will be a grace period for migration. This PR adds temporary `(ranks, tag)` support in native funcols. It also helps us decouple the py funcol -> native funcol transition from the API change.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120226
Approved by: https://github.com/wanchaol, https://github.com/wconstab
ghstack dependencies: #120042, #120043, #120070
2024-02-22 20:24:16 +00:00
Yifu Wang
dd6b5e236e Prepare test_inductor_collectives.py for native funcol migration (#120025)
There are some tests in this file that are impl specific, e.g. verifying generated code via `FileCheck`. These tests are covered for native funcol in test_c10d_functional_native.py, therefore marking them with `@run_with_legacy_funcol`.

Other tests are marked with `@run_with_both_funcol_impls`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120025
Approved by: https://github.com/wanchaol
ghstack dependencies: #119982
2024-02-21 00:46:25 +00:00
Yifu Wang
637cf4a3f2 Test parametrization utils for native funcol migration (#119950)
```
Between the time we switch to the native funcol by default and the time when
we are confident that we can remove the legacy implementation, we want to
ensure that the legacy funcol remains covered by unit tests. This is to
prepare for any potential (but unlikely) reverts. The following utilities
help achieve this goal.

run_with_{native,legacy}_funcol - mark a test to run with only
{native,legacy} funcol. These decorators are for impl specific tests (e.g.
verifying generated code with FileCheck).

run_with_both_funcol_impls - parametrize a test to run with both legacy and
native funcol.

run_with_both_funcol_impls_with_arg - same as run_with_both_funcol_impls, but
passes `enable_native_funcol` to the test so impl specific checks can be
carried out.
```

This PR also marks some tests we want to cover in this fashion. More tests will be marked in subsequent PRs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119950
Approved by: https://github.com/wanchaol
ghstack dependencies: #119881
2024-02-19 02:46:03 +00:00
IvanKobzarev
006eead7d2 [dynamo][functional_collectives] Add all_to_all_single, all_gather_list, reduce_scatter_list to dynamo remapping (#119683)
Differential Revision: [D53758434](https://our.internmc.facebook.com/intern/diff/D53758434)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119683
Approved by: https://github.com/ezyang
2024-02-16 21:28:39 +00:00
Yifu Wang
4ac857f94e Support broadcast in native funcol (#119229)
### Summary

@LucasLLC recently implemented `broadcast` in funcol. This is not yet available in the native funcol ops. This PR adds support for broadcast for native funcol.

- Added `_c10d_functional::broadcast` and `_c10d_functional::broadcast_`
- Integrated with python functol broadcast and `AsyncCollectiveTensor`
- Implemented Inductor lowering. Verified correctness and buffer reuse behavior
- Validated dynamo traceability
- Validated AOTInductor compile-ability

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119229
Approved by: https://github.com/wanchaol
ghstack dependencies: #119104
2024-02-16 21:01:34 +00:00
Yifu Wang
8f82a44a5b Run device mesh tests with native funcol enabled (#118437)
### Summary

Run the relevant tests in `test/distributed/_tensor/test_dtensor_compile.py` and `test/distributed/test_device_mesh.py` with native funcol enabled, in addition to with them being disabled.

All tests excepts `test_tp_compile_comm_reordering` pass. This is expected because the native funcols have slightly different IRs, so the reordering pass needs to be adjusted. This test is disabled for now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118437
Approved by: https://github.com/LucasLLC
ghstack dependencies: #118910, #118911
2024-02-04 04:11:11 +00:00
Yifu Wang
697ca4f292 Preliminary DeviceMesh + native c10d functional integration (#118423)
### Summary
- Added `group_name` as the third field in `dim_group_infos`.
- `DeviceMeshTest` now runs both w/ and w/0 `_USE_NATIVE_C10D_FUNCTIONAL=1` in CI.

### Other fixes
- Convert `reduceOp` to lower case before passing it into c10d_functional ops.
- Added a finalizer to handle unwaited collectives (this mirrors the treatment for Python functional collective ops).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118423
Approved by: https://github.com/wanchaol, https://github.com/LucasLLC, https://github.com/wconstab
2024-01-31 04:36:12 +00:00
atalman
15702a8027 Fix lnit after #118533 (#118633)
Fixes lint after https://github.com/pytorch/pytorch/pull/118533
Adds ignore ``possibly-undefined`` to more places

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118633
Approved by: https://github.com/DanilBaibak
2024-01-30 14:07:16 +00:00
Yifu Wang
b778f44e97 Allow using native c10d_functional via _functional_collectives (#113057)
This diff introduces an env var `_USE_NATIVE_C10D_FUNCTIONAL` that tells `_functional_collective` to use native `c10d_functional` ops. The Python version and the native version will co-exist until we completely switch to the native version after more testing and verification.

NOTE: `DeviceMesh` support for native `c10d_functional` will be added in a subsequent PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113057
Approved by: https://github.com/LucasLLC, https://github.com/wconstab, https://github.com/wanchaol
2024-01-30 02:34:25 +00:00
Roger Lam
2c5488d719 Match all_gather_into_tensor args names in remapping (#117224)
Fixes #114179

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117224
Approved by: https://github.com/wanchaol, https://github.com/wconstab
2024-01-17 03:50:29 +00:00
Yifu Wang
718b576e2c Port all_to_all_single to native c10d_functional (#113438)
Summary:
- Ported `all_to_all_single` to native c10d_functional
- Added Inductor support for the native `all_to_all_single` via the new collective IR's `create_out_of_place()`
- Since the new collective IR derives from `FallbackKernel` which implements a generic `free_unbacked_symbols`, no additional unbacked symbol handling for all_to_all_single is required

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113438
Approved by: https://github.com/yf225, https://github.com/ezyang
2023-12-22 08:12:13 +00:00
Lucas Pasqualin
d749b4a152 Implements permute_tensor in functional collectives (#115078)
Implementation of `permute_tensor` as per @yifuwang 's suggestion.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115078
Approved by: https://github.com/wanchaol, https://github.com/yifuwang
2023-12-19 18:33:28 +00:00
Lucas Pasqualin
8452f41305 Adds allreduce to inductor remap (#115950)
Fixes #115728

Implements a rewrite path for allreduce

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115950
Approved by: https://github.com/wconstab
2023-12-18 22:00:22 +00:00
Chien-Chin Huang
54d552e991 [funcol] Directly import DeviceMesh to avoid circular dependency (#115649)
This diff aims to directly import DeviceMesh from torch.distributed.device_mesh instead of importing it from dist._tensor. This is done to avoid a circular dependency issue. The code changes in each file of the diff are as follows:

- torch/distributed/_functional_collectives.py: import DeviceMesh from torch.distributed instead of dist._tensor.

Overall, this diff aims to improve the code by avoiding circular dependencies and improving the import statements.

==
The above summary is generated by LLM with minor manual fixes. The following summary is by me.

The original import causes some issues when compiling DDP with compiled_autograd. The root cause of compilation failure is not identified but it is good to fix the lazy initialization, which indirectly fixes the compilation issues for DDP.

Differential Revision: [D51857246](https://our.internmc.facebook.com/intern/diff/D51857246/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115649
Approved by: https://github.com/wconstab, https://github.com/wz337
ghstack dependencies: #115523, #115302, #115648
2023-12-13 20:44:58 +00:00
Chien-Chin Huang
50db2aa70a [funcol][BE] Apply ufmt to _functional_collectives.py and turn on lintrunner for functional_collective (#115648)
No logic change, just formatting.

Differential Revision: [D51857236](https://our.internmc.facebook.com/intern/diff/D51857236/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115648
Approved by: https://github.com/wconstab, https://github.com/wz337
ghstack dependencies: #115523, #115302
2023-12-13 11:19:29 +00:00
Wanchao Liang
b6de337d16 [funcol] a few optimizations to funcol (#113324)
Apply a few optimizations to funcol:

- allgather on non-0 dim, the resulting tensor already needs to access
data in order to do torch.cat, so we sync wait here so that we don;t
need to go through ACT dispatch for chunk + cat alltogether
- have a fast return logic to aten.view as it's a commonly hit op for
view related ops

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113324
Approved by: https://github.com/XilunWu
2023-12-06 19:25:35 +00:00
Joel Schlosser
22704426c3 Expand dynamic dims support for traceable subclasses (#114311)
Continuation of #112185, following the design in this [doc](https://docs.google.com/document/d/1ipSxcTzEMMOAPvxP-YJlD5JBZZmIGgh8Q34ixtOUCRo).

Summary:
* Introduce `SubclassSymbolicPolicy` containing separate dynamic dim / constraint policies for the outer and inner tensors
    * Expand the automatic dynamic algorithm to recurse into inner tensors and produce one of these for a subclass instance
    * Maintain legacy behavior for subclasses by recursively calling `mark_dynamic()` on inner tensors *of the same dim as outer* when `mark_dynamic(outer, ...)` is called
    * Addresses this: 6a86cf00ad/torch/_dynamo/variables/builder.py (L1750)
* Add `outer_size` and `outer_stride` arguments to `__tensor_unflatten__()` so that you can find out what symbols were allocated for the outer size / stride (you are expected to return a tensor that compares equal to the outer symbols)
    * Signatures now:
    ```python
    # attrs is a list of inner tensor attributes on x; inner_tensor = getattr(x, attr)
    # ctx is anything useful for rebuilding the class we want to guard on
    attrs, ctx = x.__tensor_flatten__()
    ...
    # inner_tensors is a dict of {attr -> tensor}
    # ctx is taken unmodified from flattening and (eventually) guarded on
    # outer_size is the expected size of the output; possibly symbolic
    # outer_stride is the expected strides of the output; possibly symbolic
    y = MySubclass.__tensor_unflatten__(inner_tensors, ctx, outer_size, outer_stride)

    # at the __tensor_unflatten__() call-site in PT2, we assert y.shape == outer_size and y.stride() == outer_stride
    # the assert simplifies symbols when there are relationships between outer and inner symbols
    ```
    * Size info needed for `NestedTensor` at least, stride info needed for `DTensor` at least
    * Punting on `outer_storage_offset` because storage_offset handling is horribly broken in PT2 right now
* ~~Add new `__tensor_mark_dynamic__()` to allow overriding the behavior of mark_dynamic on a per-subclass basis~~ (booted to future work)
* ~~Add guards for tensor subclasses by calling `__tensor_flatten__()` in the guard to test equality on `ctx`~~
    * Now handled in #114469
* Next PR: add TENSOR_MATCH guards on inner tensors

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114311
Approved by: https://github.com/ezyang, https://github.com/drisspg, https://github.com/voznesenskym, https://github.com/bdhirsh
2023-12-05 21:09:25 +00:00
PyTorch MergeBot
4534cf102a Revert "[funcol] a few optimizations to funcol (#113324)"
This reverts commit 7117bffff9.

Reverted https://github.com/pytorch/pytorch/pull/113324 on behalf of https://github.com/huydhn due to Sorry for reverting your change here, but it is failing internal test ([comment](https://github.com/pytorch/pytorch/pull/113324#issuecomment-1813317913))
2023-11-15 21:53:23 +00:00
Wanchao Liang
7117bffff9 [funcol] a few optimizations to funcol (#113324)
Apply a few optimizations to funcol:

- allgather on non-0 dim, the resulting tensor already needs to access
data in order to do torch.cat, so we sync wait here so that we don;t
need to go through ACT dispatch for chunk + cat alltogether
- have a fast return logic to aten.view as it's a commonly hit op for
view related ops

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113324
Approved by: https://github.com/XilunWu
ghstack dependencies: #113323
2023-11-14 09:28:09 +00:00
Wanchao Liang
b16e3b5373 [funcol] add two APIs: wait() and numpy() (#113323)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113323
Approved by: https://github.com/XilunWu, https://github.com/wz337, https://github.com/wconstab
2023-11-14 09:27:45 +00:00
PyTorch MergeBot
23e0923c74 Revert "[pytree] reorganize submodule structure for C++ and Python pytree (#112278)"
This reverts commit eeeb40b327.

Reverted https://github.com/pytorch/pytorch/pull/112278 on behalf of https://github.com/PaliC due to Reverting this pr as the one under it in the stack is causing regressions in torchrec ([comment](https://github.com/pytorch/pytorch/pull/112278#issuecomment-1806044435))
2023-11-10 16:30:36 +00:00
Xuehai Pan
eeeb40b327 [pytree] reorganize submodule structure for C++ and Python pytree (#112278)
Reorganized the two C++ and Python pytree submodules into a subpackage. I think this would be easier to implement the abstract `PyTreeAPI` class with two implementations. And it will be much easier for the user to switch between the two implementations.

Before:

```text
torch
├── utils
│   ├── _pytree.py
│   ├── _cxx_pytree.py
│   ...
...
```

After:

```text
torch
├── utils
│   ├── _pytree
│   │   ├── __init__.py
│   │   └── api
│   │       ├── __init__.py
│   │       ├── cxx.py
│   │       └── python.py
│   ...
...
```

The `torch.utils._pytree` module will import all APIs from `torch.utils._pytree.api.python`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112278
Approved by: https://github.com/zou3519
ghstack dependencies: #112111
2023-11-10 05:41:32 +00:00
PyTorch MergeBot
bf452dcde6 Revert "[pytree] reorganize submodule structure for C++ and Python pytree (#112278)"
This reverts commit fa895da968.

Reverted https://github.com/pytorch/pytorch/pull/112278 on behalf of https://github.com/PaliC due to in the bottom diff in the stack changing _register_pytree_node's signature is bc breaking, please revert the signature and reland ([comment](https://github.com/pytorch/pytorch/pull/112278#issuecomment-1804870560))
2023-11-10 00:12:52 +00:00
Lucas Pasqualin
1d56e7b5af Adds broadcast to functional collectives (#112668)
Adds `broadcast` to functional collectives, including inductor support.

Test with `python test_inductor_collectives.py -- TestCollectivesMultiProc.test_broadcast_inductor`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112668
Approved by: https://github.com/wanchaol, https://github.com/wconstab
2023-11-09 15:47:52 +00:00
Yifu Wang
625958d8bc Inductor support for native c10d_functional (#112439)
This PR adds Inductor support for [native c10d_functional ops](https://github.com/pytorch/pytorch/pull/110570).

The Inductor IRs introduced in this PR will replace the existing `CollectiveKernel` IR hierarchy. Compared to the existing collective IRs, the new IRs:
- Are target language agnostic and support AOTInductor.
- Express the constraints solely with read/write deps. This maximizes the potential for buffer reuse.
- Address an issue where out-of-place collective's input buffers could be mutated while being volatilely read.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112439
Approved by: https://github.com/Chillee
2023-11-08 23:40:21 +00:00
Xuehai Pan
fa895da968 [pytree] reorganize submodule structure for C++ and Python pytree (#112278)
Reorganized the two C++ and Python pytree submodules into a subpackage. I think this would be easier to implement the abstract `PyTreeAPI` class with two implementations. And it will be much easier for the user to switch between the two implementations.

Before:

```text
torch
├── utils
│   ├── _pytree.py
│   ├── _cxx_pytree.py
│   ...
...
```

After:

```text
torch
├── utils
│   ├── _pytree
│   │   ├── __init__.py
│   │   └── api
│   │       ├── __init__.py
│   │       ├── cxx.py
│   │       └── python.py
│   ...
...
```

The `torch.utils._pytree` module will import all APIs from `torch.utils._pytree.api.python`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112278
Approved by: https://github.com/zou3519
ghstack dependencies: #112111
2023-11-08 06:05:39 +00:00
rzou
a06832f911 Grandfather in c10d_functional ops to pt2_compliant (#113049)
This PR also adds the ability to specify Tags for more `m.def(`
overloads.

Test Plan:
- new test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113049
Approved by: https://github.com/williamwen42
2023-11-07 12:55:05 +00:00