pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 00:21:07 +01:00

Author	SHA1	Message	Date
Brian Hirsh	e28d9947a1	AsyncCollectiveTensor: prevent wait_tensor() calls on graph inputs from getting DCEd (#125677 ) @wanchaol was seeing the loss eventually become NaN when compiling individual transformer blocks in torchtitan - with this patch I no longer see the NaN loss. The problem is the following: (1) It is possible to have graph inputs to a compiled region that are AsyncCollectiveTensors. In particular: when we compile individual transformer blocks in the llama model, the first layer (embedding layer) is run in eager mode, and it outputs an AsyncCollectiveTensor that is fed to the first transformer block (2) ideally, we would like that AsyncCollectiveTensor graph input to desugar into a `wait_tensor()` op that shows up at the beginning of the graph. (3) the way this is supposed to happen is: AOTAutograd traces through the __torch_dispatch__ of AsyncCollectiveTensor, tracing out a `wait_tensor()` call before dispatching to any of the other ops in the function we are tracing (4) however: `trigger_wait()` was getting called in a way where we would ignore its output (and return `self.elem` directly), which would cause the `wait_tensor` ops to get DCE'd. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125677 Approved by: https://github.com/wanchaol, https://github.com/yifuwang ghstack dependencies: #125676	2024-05-08 15:54:01 +00:00
Yifu Wang	d4a1b3e093	Make c10d_functional ops call into _c10d_functional ops (#124979 ) This PR removes the legacy impls of c10d_functional ops which are now irrelevant. For backward compatibility purpose, c10d_functional ops now call into _c10d_functional ops. We also changed c10d_functional ops to be CompositeExplicitAutograd, so that when traced, only _c10d_functional ops appear in the graph. After this, we'll be able to remove the Inductor IR for the legacy functional collectives. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124979 Approved by: https://github.com/wanchaol	2024-04-27 08:08:02 +00:00
Tristan Rice	1ec05c769b	all_gather and reduce_scatter autograd (#123989 ) This adds `all_gather_tensor_autograd` and `reduce_scatter_tensor_autograd` to the functional_collectives library. This only supports `sum` mode for `reduce_scatter` but should be easy to extend in the future. The backwards implementations match the behavior in https://github.com/pytorch/torchrec/blob/main/torchrec/distributed/comm_ops.py This follows the pattern of #123599 . Test plan: ```sh pytest test/distributed/test_functional_api.py -k Autograd ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/123989 Approved by: https://github.com/wanchaol	2024-04-17 21:32:22 +00:00
Yifu Wang	2a2e1d8e4f	[functional collective] change the Python APIs to only use the native funcol ops (#123777 ) ## Summary After this PR, the functional collective Python APIs will stop honoring `TORCH_DISABLE_NATIVE_FUNCOL` and only use native funcol ops. Specifically, this PR: - Removed `use_native_funcol()`. - Removed the code path in the Python APIs when `use_native_funcol()` is `False`. - Changed the CI tests that runs on both native funcol and legacy funcol through the Python API to only run with native funcol. ## Test Changes `test_functional_api.py` - Removed the tests where only one of output_split_sizes or input_split_sizes is specified. This behavior is unreliable has been removed from the native funcol. - Removed `TestWaitiness` which tests an implementation detail of the legacy funcol. We have equivalent tests for native funcol in `test/distributed/test_c10d_functional_native.py` `b7fac76fc2/test/distributed/test_c10d_functional_native.py (L114-L116)` `test/distributed/_tensor/test_dtensor.py` `test/distributed/_tensor/test_dtensor_compile.py` `test/distributed/test_device_mesh.py` `test/distributed/_tensor/experimental/test_tp_transform.py` `test/distributed/_tensor/test_matrix_ops.py` `test/distributed/test_inductor_collectives.py` - All these tests were double running with both native funcol and legacy funcol. Changed to only run with native funcol. `test/distributed/test_c10d_functional_native.py` - Removed the `run_with_native_funcol` decorators. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123777 Approved by: https://github.com/wanchaol ghstack dependencies: #123776	2024-04-13 03:08:36 +00:00
Tristan Rice	358ace1a1b	functional_collectives: add first differentiable collective -- all_to_all_single_grad (#123599 ) This adds the differentiable collective -- all_to_all_single_grad. This is the initial proof of concept PR and I will be adding the remaining collectives in follow up PRs. This adds a new function called `all_to_all_single_autograd` which is the autograd variant of `all_to_all_single`. For backwards compatibility + initial testing we wanted to make the autograd variant separate to avoid regressions. This uses `autograd::Function` to register an Autograd op that calls the original `_c10d_functional::all_to_all_single` via the dispatcher. This works with compile and inductor as opposed to the previous Python implementation that had issues. As this uses the existing `_c10d_functional` ops we don't need to register any meta functions or lowering. To avoid cudaStream issues this explicitly calls `wait_tensor` in the backward method to ensure it runs under the same stream as the async operation. This hurts performance but can be alleviated potentially using `compile`. Related work: https://github.com/pytorch/torchrec/blob/main/torchrec/distributed/comm_ops.py Test plan: ``` pytest test/distributed/test_functional_api.py -k test_all_to_all_single_compile pytest test/distributed/test_functional_api.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/123599 Approved by: https://github.com/yifuwang	2024-04-12 01:48:49 +00:00
Yifu Wang	3bede14fa7	Don't create world pg variable out of thin air when rewriting c10d collectives (#122561 ) Fixes https://github.com/pytorch/pytorch/issues/122404 Previously, when rewriting c10d collectives, if the group argument is unspecified or None, we create a world pg variable out of thin air and pass it to the rewrite target. The approach was problematic, as it assumes the symbol `torch` is available in the scope (see #122404). After #120560, dynamo can now trace dist.group.WORLD. If the group argument is unspecified, we can just set it with dist.group.WORLD in the rewrite target. Testing pytest test/distributed/test_inductor_collectives.py -k test_dynamo_rewrite_dist_allreduce Also verified with the repro provided in #122404 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122561 Approved by: https://github.com/wconstab ghstack dependencies: #120560	2024-03-26 20:12:08 +00:00
Yifu Wang	71d0202627	[dynamo] support rewriting dist.all_reduce with explicitly specified reduce op (#120181 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120181 Approved by: https://github.com/wconstab, https://github.com/awgu	2024-03-09 08:28:22 +00:00
albanD	6791b0c09e	Change default torch_function behavior to be disabled when torch_dispatch is defined (take 2) (#120632 ) This does not introduce a new test but is tested by checking that all the classes we already have still behave as before now that they don't explicitly disable torch_function. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120632 Approved by: https://github.com/ezyang	2024-03-09 01:08:37 +00:00
Yifu Wang	d7a5e59647	[dynamo] support group=None when rewriting collectives (#121043 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121043 Approved by: https://github.com/awgu	2024-03-06 21:37:19 +00:00
Oleg Khabinov	4b18ab869f	[torch.export] Support is_compiling() flag for non-strict mode (#119602 ) Summary: In non-strict mode of torch.export() we didn't set those `is_compiling()` to `True` which is needed by some models. Test Plan: Unit tests and manual testing. Differential Revision: D53624452 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119602 Approved by: https://github.com/suo	2024-02-29 05:52:51 +00:00
Sergii Dymchenko	d341b66e96	Revert [dynamo] support group=None when rewriting collectives (#12018 ) (#120677 ) This reverts commit `298c686d3f`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120677 Approved by: https://github.com/yifuwang, https://github.com/huydhn	2024-02-27 00:33:35 +00:00
Yifu Wang	298c686d3f	[dynamo] support group=None when rewriting collectives (#120118 ) Resolves case 2 in #120082. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120118 Approved by: https://github.com/wconstab ghstack dependencies: #120370	2024-02-25 03:12:10 +00:00
Yifu Wang	11e4a9266d	Temporarily support ranks + tag as pg identifier in native funcol (#120226 ) As communicated in https://github.com/pytorch/pytorch/issues/93173#issuecomment-1907095208, although we are dropping `(ranks, tag)` as group identifier in funcols, there will be a grace period for migration. This PR adds temporary `(ranks, tag)` support in native funcols. It also helps us decouple the py funcol -> native funcol transition from the API change. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120226 Approved by: https://github.com/wanchaol, https://github.com/wconstab ghstack dependencies: #120042, #120043, #120070	2024-02-22 20:24:16 +00:00
Yifu Wang	dd6b5e236e	Prepare test_inductor_collectives.py for native funcol migration (#120025 ) There are some tests in this file that are impl specific, e.g. verifying generated code via `FileCheck`. These tests are covered for native funcol in test_c10d_functional_native.py, therefore marking them with `@run_with_legacy_funcol`. Other tests are marked with `@run_with_both_funcol_impls`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120025 Approved by: https://github.com/wanchaol ghstack dependencies: #119982	2024-02-21 00:46:25 +00:00
Yifu Wang	637cf4a3f2	Test parametrization utils for native funcol migration (#119950 ) ``` Between the time we switch to the native funcol by default and the time when we are confident that we can remove the legacy implementation, we want to ensure that the legacy funcol remains covered by unit tests. This is to prepare for any potential (but unlikely) reverts. The following utilities help achieve this goal. run_with_{native,legacy}_funcol - mark a test to run with only {native,legacy} funcol. These decorators are for impl specific tests (e.g. verifying generated code with FileCheck). run_with_both_funcol_impls - parametrize a test to run with both legacy and native funcol. run_with_both_funcol_impls_with_arg - same as run_with_both_funcol_impls, but passes `enable_native_funcol` to the test so impl specific checks can be carried out. ``` This PR also marks some tests we want to cover in this fashion. More tests will be marked in subsequent PRs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119950 Approved by: https://github.com/wanchaol ghstack dependencies: #119881	2024-02-19 02:46:03 +00:00
IvanKobzarev	006eead7d2	[dynamo][functional_collectives] Add all_to_all_single, all_gather_list, reduce_scatter_list to dynamo remapping (#119683 ) Differential Revision: [D53758434](https://our.internmc.facebook.com/intern/diff/D53758434) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119683 Approved by: https://github.com/ezyang	2024-02-16 21:28:39 +00:00
Yifu Wang	4ac857f94e	Support broadcast in native funcol (#119229 ) ### Summary @LucasLLC recently implemented `broadcast` in funcol. This is not yet available in the native funcol ops. This PR adds support for broadcast for native funcol. - Added `_c10d_functional::broadcast` and `_c10d_functional::broadcast_` - Integrated with python functol broadcast and `AsyncCollectiveTensor` - Implemented Inductor lowering. Verified correctness and buffer reuse behavior - Validated dynamo traceability - Validated AOTInductor compile-ability Pull Request resolved: https://github.com/pytorch/pytorch/pull/119229 Approved by: https://github.com/wanchaol ghstack dependencies: #119104	2024-02-16 21:01:34 +00:00
Yifu Wang	8f82a44a5b	Run device mesh tests with native funcol enabled (#118437 ) ### Summary Run the relevant tests in `test/distributed/_tensor/test_dtensor_compile.py` and `test/distributed/test_device_mesh.py` with native funcol enabled, in addition to with them being disabled. All tests excepts `test_tp_compile_comm_reordering` pass. This is expected because the native funcols have slightly different IRs, so the reordering pass needs to be adjusted. This test is disabled for now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118437 Approved by: https://github.com/LucasLLC ghstack dependencies: #118910, #118911	2024-02-04 04:11:11 +00:00
Yifu Wang	697ca4f292	Preliminary DeviceMesh + native c10d functional integration (#118423 ) ### Summary - Added `group_name` as the third field in `dim_group_infos`. - `DeviceMeshTest` now runs both w/ and w/0 `_USE_NATIVE_C10D_FUNCTIONAL=1` in CI. ### Other fixes - Convert `reduceOp` to lower case before passing it into c10d_functional ops. - Added a finalizer to handle unwaited collectives (this mirrors the treatment for Python functional collective ops). Pull Request resolved: https://github.com/pytorch/pytorch/pull/118423 Approved by: https://github.com/wanchaol, https://github.com/LucasLLC, https://github.com/wconstab	2024-01-31 04:36:12 +00:00
atalman	15702a8027	Fix lnit after #118533 (#118633 ) Fixes lint after https://github.com/pytorch/pytorch/pull/118533 Adds ignore ``possibly-undefined`` to more places Pull Request resolved: https://github.com/pytorch/pytorch/pull/118633 Approved by: https://github.com/DanilBaibak	2024-01-30 14:07:16 +00:00
Yifu Wang	b778f44e97	Allow using native c10d_functional via _functional_collectives (#113057 ) This diff introduces an env var `_USE_NATIVE_C10D_FUNCTIONAL` that tells `_functional_collective` to use native `c10d_functional` ops. The Python version and the native version will co-exist until we completely switch to the native version after more testing and verification. NOTE: `DeviceMesh` support for native `c10d_functional` will be added in a subsequent PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113057 Approved by: https://github.com/LucasLLC, https://github.com/wconstab, https://github.com/wanchaol	2024-01-30 02:34:25 +00:00
Roger Lam	2c5488d719	Match all_gather_into_tensor args names in remapping (#117224 ) Fixes #114179 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117224 Approved by: https://github.com/wanchaol, https://github.com/wconstab	2024-01-17 03:50:29 +00:00
Yifu Wang	718b576e2c	Port all_to_all_single to native c10d_functional (#113438 ) Summary: - Ported `all_to_all_single` to native c10d_functional - Added Inductor support for the native `all_to_all_single` via the new collective IR's `create_out_of_place()` - Since the new collective IR derives from `FallbackKernel` which implements a generic `free_unbacked_symbols`, no additional unbacked symbol handling for all_to_all_single is required Pull Request resolved: https://github.com/pytorch/pytorch/pull/113438 Approved by: https://github.com/yf225, https://github.com/ezyang	2023-12-22 08:12:13 +00:00
Lucas Pasqualin	d749b4a152	Implements `permute_tensor` in functional collectives (#115078 ) Implementation of `permute_tensor` as per @yifuwang 's suggestion. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115078 Approved by: https://github.com/wanchaol, https://github.com/yifuwang	2023-12-19 18:33:28 +00:00
Lucas Pasqualin	8452f41305	Adds allreduce to inductor remap (#115950 ) Fixes #115728 Implements a rewrite path for allreduce Pull Request resolved: https://github.com/pytorch/pytorch/pull/115950 Approved by: https://github.com/wconstab	2023-12-18 22:00:22 +00:00
Chien-Chin Huang	54d552e991	[funcol] Directly import DeviceMesh to avoid circular dependency (#115649 ) This diff aims to directly import DeviceMesh from torch.distributed.device_mesh instead of importing it from dist._tensor. This is done to avoid a circular dependency issue. The code changes in each file of the diff are as follows: - torch/distributed/_functional_collectives.py: import DeviceMesh from torch.distributed instead of dist._tensor. Overall, this diff aims to improve the code by avoiding circular dependencies and improving the import statements. == The above summary is generated by LLM with minor manual fixes. The following summary is by me. The original import causes some issues when compiling DDP with compiled_autograd. The root cause of compilation failure is not identified but it is good to fix the lazy initialization, which indirectly fixes the compilation issues for DDP. Differential Revision: [D51857246](https://our.internmc.facebook.com/intern/diff/D51857246/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115649 Approved by: https://github.com/wconstab, https://github.com/wz337 ghstack dependencies: #115523, #115302, #115648	2023-12-13 20:44:58 +00:00
Chien-Chin Huang	50db2aa70a	[funcol][BE] Apply ufmt to _functional_collectives.py and turn on lintrunner for functional_collective (#115648 ) No logic change, just formatting. Differential Revision: [D51857236](https://our.internmc.facebook.com/intern/diff/D51857236/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115648 Approved by: https://github.com/wconstab, https://github.com/wz337 ghstack dependencies: #115523, #115302	2023-12-13 11:19:29 +00:00
Wanchao Liang	b6de337d16	[funcol] a few optimizations to funcol (#113324 ) Apply a few optimizations to funcol: - allgather on non-0 dim, the resulting tensor already needs to access data in order to do torch.cat, so we sync wait here so that we don;t need to go through ACT dispatch for chunk + cat alltogether - have a fast return logic to aten.view as it's a commonly hit op for view related ops Pull Request resolved: https://github.com/pytorch/pytorch/pull/113324 Approved by: https://github.com/XilunWu	2023-12-06 19:25:35 +00:00
Joel Schlosser	22704426c3	Expand dynamic dims support for traceable subclasses (#114311 ) Continuation of #112185, following the design in this [doc](https://docs.google.com/document/d/1ipSxcTzEMMOAPvxP-YJlD5JBZZmIGgh8Q34ixtOUCRo). Summary: * Introduce `SubclassSymbolicPolicy` containing separate dynamic dim / constraint policies for the outer and inner tensors * Expand the automatic dynamic algorithm to recurse into inner tensors and produce one of these for a subclass instance * Maintain legacy behavior for subclasses by recursively calling `mark_dynamic()` on inner tensors of the same dim as outer when `mark_dynamic(outer, ...)` is called * Addresses this: `6a86cf00ad/torch/_dynamo/variables/builder.py (L1750)` * Add `outer_size` and `outer_stride` arguments to `__tensor_unflatten__()` so that you can find out what symbols were allocated for the outer size / stride (you are expected to return a tensor that compares equal to the outer symbols) * Signatures now: ```python # attrs is a list of inner tensor attributes on x; inner_tensor = getattr(x, attr) # ctx is anything useful for rebuilding the class we want to guard on attrs, ctx = x.__tensor_flatten__() ... # inner_tensors is a dict of {attr -> tensor} # ctx is taken unmodified from flattening and (eventually) guarded on # outer_size is the expected size of the output; possibly symbolic # outer_stride is the expected strides of the output; possibly symbolic y = MySubclass.__tensor_unflatten__(inner_tensors, ctx, outer_size, outer_stride) # at the __tensor_unflatten__() call-site in PT2, we assert y.shape == outer_size and y.stride() == outer_stride # the assert simplifies symbols when there are relationships between outer and inner symbols ``` * Size info needed for `NestedTensor` at least, stride info needed for `DTensor` at least * Punting on `outer_storage_offset` because storage_offset handling is horribly broken in PT2 right now * ~~Add new `__tensor_mark_dynamic__()` to allow overriding the behavior of mark_dynamic on a per-subclass basis~~ (booted to future work) * ~~Add guards for tensor subclasses by calling `__tensor_flatten__()` in the guard to test equality on `ctx`~~ * Now handled in #114469 * Next PR: add TENSOR_MATCH guards on inner tensors Pull Request resolved: https://github.com/pytorch/pytorch/pull/114311 Approved by: https://github.com/ezyang, https://github.com/drisspg, https://github.com/voznesenskym, https://github.com/bdhirsh	2023-12-05 21:09:25 +00:00
PyTorch MergeBot	4534cf102a	Revert "[funcol] a few optimizations to funcol (#113324 )" This reverts commit `7117bffff9`. Reverted https://github.com/pytorch/pytorch/pull/113324 on behalf of https://github.com/huydhn due to Sorry for reverting your change here, but it is failing internal test ([comment](https://github.com/pytorch/pytorch/pull/113324#issuecomment-1813317913))	2023-11-15 21:53:23 +00:00
Wanchao Liang	7117bffff9	[funcol] a few optimizations to funcol (#113324 ) Apply a few optimizations to funcol: - allgather on non-0 dim, the resulting tensor already needs to access data in order to do torch.cat, so we sync wait here so that we don;t need to go through ACT dispatch for chunk + cat alltogether - have a fast return logic to aten.view as it's a commonly hit op for view related ops Pull Request resolved: https://github.com/pytorch/pytorch/pull/113324 Approved by: https://github.com/XilunWu ghstack dependencies: #113323	2023-11-14 09:28:09 +00:00
Wanchao Liang	b16e3b5373	[funcol] add two APIs: wait() and numpy() (#113323 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/113323 Approved by: https://github.com/XilunWu, https://github.com/wz337, https://github.com/wconstab	2023-11-14 09:27:45 +00:00
PyTorch MergeBot	23e0923c74	Revert "[pytree] reorganize submodule structure for C++ and Python pytree (#112278 )" This reverts commit `eeeb40b327`. Reverted https://github.com/pytorch/pytorch/pull/112278 on behalf of https://github.com/PaliC due to Reverting this pr as the one under it in the stack is causing regressions in torchrec ([comment](https://github.com/pytorch/pytorch/pull/112278#issuecomment-1806044435))	2023-11-10 16:30:36 +00:00
Xuehai Pan	eeeb40b327	[pytree] reorganize submodule structure for C++ and Python pytree (#112278 ) Reorganized the two C++ and Python pytree submodules into a subpackage. I think this would be easier to implement the abstract `PyTreeAPI` class with two implementations. And it will be much easier for the user to switch between the two implementations. Before: ```text torch ├── utils │ ├── _pytree.py │ ├── _cxx_pytree.py │ ... ... ``` After: ```text torch ├── utils │ ├── _pytree │ │ ├── __init__.py │ │ └── api │ │ ├── __init__.py │ │ ├── cxx.py │ │ └── python.py │ ... ... ``` The `torch.utils._pytree` module will import all APIs from `torch.utils._pytree.api.python`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/112278 Approved by: https://github.com/zou3519 ghstack dependencies: #112111	2023-11-10 05:41:32 +00:00
PyTorch MergeBot	bf452dcde6	Revert "[pytree] reorganize submodule structure for C++ and Python pytree (#112278 )" This reverts commit `fa895da968`. Reverted https://github.com/pytorch/pytorch/pull/112278 on behalf of https://github.com/PaliC due to in the bottom diff in the stack changing _register_pytree_node's signature is bc breaking, please revert the signature and reland ([comment](https://github.com/pytorch/pytorch/pull/112278#issuecomment-1804870560))	2023-11-10 00:12:52 +00:00
Lucas Pasqualin	1d56e7b5af	Adds broadcast to functional collectives (#112668 ) Adds `broadcast` to functional collectives, including inductor support. Test with `python test_inductor_collectives.py -- TestCollectivesMultiProc.test_broadcast_inductor` Pull Request resolved: https://github.com/pytorch/pytorch/pull/112668 Approved by: https://github.com/wanchaol, https://github.com/wconstab	2023-11-09 15:47:52 +00:00
Yifu Wang	625958d8bc	Inductor support for native c10d_functional (#112439 ) This PR adds Inductor support for [native c10d_functional ops](https://github.com/pytorch/pytorch/pull/110570). The Inductor IRs introduced in this PR will replace the existing `CollectiveKernel` IR hierarchy. Compared to the existing collective IRs, the new IRs: - Are target language agnostic and support AOTInductor. - Express the constraints solely with read/write deps. This maximizes the potential for buffer reuse. - Address an issue where out-of-place collective's input buffers could be mutated while being volatilely read. Pull Request resolved: https://github.com/pytorch/pytorch/pull/112439 Approved by: https://github.com/Chillee	2023-11-08 23:40:21 +00:00
Xuehai Pan	fa895da968	[pytree] reorganize submodule structure for C++ and Python pytree (#112278 ) Reorganized the two C++ and Python pytree submodules into a subpackage. I think this would be easier to implement the abstract `PyTreeAPI` class with two implementations. And it will be much easier for the user to switch between the two implementations. Before: ```text torch ├── utils │ ├── _pytree.py │ ├── _cxx_pytree.py │ ... ... ``` After: ```text torch ├── utils │ ├── _pytree │ │ ├── __init__.py │ │ └── api │ │ ├── __init__.py │ │ ├── cxx.py │ │ └── python.py │ ... ... ``` The `torch.utils._pytree` module will import all APIs from `torch.utils._pytree.api.python`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/112278 Approved by: https://github.com/zou3519 ghstack dependencies: #112111	2023-11-08 06:05:39 +00:00
rzou	a06832f911	Grandfather in c10d_functional ops to pt2_compliant (#113049 ) This PR also adds the ability to specify Tags for more `m.def(` overloads. Test Plan: - new test Pull Request resolved: https://github.com/pytorch/pytorch/pull/113049 Approved by: https://github.com/williamwen42	2023-11-07 12:55:05 +00:00
Yifu Wang	ec18ef62f4	Native c10d_functional ops (#110570 ) This PR introduces a native version of c10d_functional ops. The main goal is to add collective support in AOTInductor and allow collective ops to work in multi-threaded native runtimes. The native version also incorporated API improvements we wished to implement in Python c10d_functional: - Removed `ranks` and `group_size` from collective op signatures which were proven to be redundant. - Use tensor storage as opposed to `void*` to resolve in-flight work. The native process group registration/resolution mechansim is only used for native c10d_functional in the PR. It will become the single source of truth in upcoming PRs. The upcoming PRs will implement Inductor/AOTInductor support for c10d_functional, after which native c10d_functional will replace Python c10d_functional. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110570 Approved by: https://github.com/wanchaol	2023-10-25 22:56:06 +00:00
Brian Hirsh	4d29b40299	torch.compile DTensor E2E (#105236 ) This PR updates DTensor to support torch.compile Cool stuff: there are some new tests in `test_dtensor.py` that show both the forward and backward graphs that we can send to inductor, when running a matmul with DTensor's. In particular, for this user code: ``` def fn(x, y): dt = DTensor.from_local(x.reshape(2, 4), mesh, [Shard(0)], run_check=False) dt2 = DTensor.from_local(y.reshape(4, 2), mesh, [Shard(1)], run_check=False) dt_out = torch.matmul(dt, dt2) dt_out_redistribute = dt_out.redistribute(mesh, [Replicate()]) return dt_out.to_local() ``` We generate the following fw and backward graphs. Forward graph: ``` def forward(self, primals_1, primals_2): view = torch.ops.aten.view.default(primals_1, [2, 4]); primals_1 = None _to_copy = torch.ops.aten._to_copy.default(view, dtype = torch.float32, layout = torch.strided, device = device(type='cuda', index=0)); view = None detach = torch.ops.aten.detach.default(_to_copy); _to_copy = None detach_1 = torch.ops.aten.detach.default(detach); detach = None view_1 = torch.ops.aten.view.default(primals_2, [4, 2]); primals_2 = None _to_copy_1 = torch.ops.aten._to_copy.default(view_1, dtype = torch.float32, layout = torch.strided, device = device(type='cuda', index=0)); view_1 = None detach_2 = torch.ops.aten.detach.default(_to_copy_1); _to_copy_1 = None detach_3 = torch.ops.aten.detach.default(detach_2); detach_2 = None detach_4 = torch.ops.aten.detach.default(detach_1) all_gather_into_tensor = torch.ops.c10d_functional.all_gather_into_tensor.default(detach_3, 'ptd:0', [0, 1], 2) wait_tensor = torch.ops.c10d_functional.wait_tensor.default(all_gather_into_tensor); all_gather_into_tensor = None split = torch.ops.aten.split.Tensor(wait_tensor, 4); wait_tensor = None getitem = split[0] getitem_1 = split[1]; split = None cat = torch.ops.aten.cat.default([getitem, getitem_1], 1); getitem = getitem_1 = None detach_5 = torch.ops.aten.detach.default(cat); cat = None mm = torch.ops.aten.mm.default(detach_4, detach_5); detach_4 = detach_5 = None detach_6 = torch.ops.aten.detach.default(mm); mm = None detach_9 = torch.ops.aten.detach.default(detach_6); detach_6 = None detach_10 = torch.ops.aten.detach.default(detach_9); detach_9 = None t = torch.ops.aten.t.default(detach_1); detach_1 = None detach_13 = torch.ops.aten.detach.default(t); t = None t_1 = torch.ops.aten.t.default(detach_3); detach_3 = None detach_15 = torch.ops.aten.detach.default(t_1); t_1 = None clone = torch.ops.aten.clone.default(detach_15, memory_format = torch.contiguous_format); detach_15 = None return [detach_10, detach_13, clone] ``` Backward graph: ``` def forward(self, detach_13, clone, tangents_1): detach_11 = torch.ops.aten.detach.default(tangents_1); tangents_1 = None detach_12 = torch.ops.aten.detach.default(detach_11); detach_11 = None mm_1 = torch.ops.aten.mm.default(detach_13, detach_12); detach_13 = None detach_14 = torch.ops.aten.detach.default(mm_1); mm_1 = None detach_16 = torch.ops.aten.detach.default(detach_12); detach_12 = None all_gather_into_tensor_2 = torch.ops.c10d_functional.all_gather_into_tensor.default(clone, 'ptd:0', [0, 1], 2); clone = None wait_tensor_2 = torch.ops.c10d_functional.wait_tensor.default(all_gather_into_tensor_2); detach_17 = torch.ops.aten.detach.default(wait_tensor_2); wait_tensor_2 = None mm_2 = torch.ops.aten.mm.default(detach_16, detach_17); detach_16 = detach_17 = None detach_18 = torch.ops.aten.detach.default(mm_2); mm_2 = None split_1 = torch.ops.aten.split.Tensor(detach_14, 2, 1); detach_14 = None getitem_2 = split_1[0] getitem_3 = split_1[1]; split_1 = None cat_1 = torch.ops.aten.cat.default([getitem_2, getitem_3]); getitem_2 = getitem_3 = None reduce_scatter_tensor = torch.ops.c10d_functional.reduce_scatter_tensor.default(cat_1, 'SUM', 'ptd:0', [0, 1], 2); cat_1 = None wait_tensor_3 = torch.ops.c10d_functional.wait_tensor.default(reduce_scatter_tensor); reduce_scatter_tensor = None detach_19 = torch.ops.aten.detach.default(wait_tensor_3); wait_tensor_3 = None detach_20 = torch.ops.aten.detach.default(detach_19); detach_19 = None detach_21 = torch.ops.aten.detach.default(detach_20); detach_20 = None detach_22 = torch.ops.aten.detach.default(detach_21); detach_21 = None _to_copy_2 = torch.ops.aten._to_copy.default(detach_22, dtype = torch.float32, layout = torch.strided, device = device(type='cpu')); detach_22 = None view_2 = torch.ops.aten.view.default(_to_copy_2, [8]); _to_copy_2 = None detach_23 = torch.ops.aten.detach.default(detach_18); detach_18 = None detach_24 = torch.ops.aten.detach.default(detach_23); detach_23 = None _to_copy_3 = torch.ops.aten._to_copy.default(detach_24, dtype = torch.float32, layout = torch.strided, device = device(type='cpu')); detach_24 = None view_3 = torch.ops.aten.view.default(_to_copy_3, [8]); _to_copy_3 = None return [view_3, view_2] ``` Some of the stuff in this graph looks kinda of silly though (e.g. an unnecessary split() + cat(), and all the extra detach() calls). Stuff that's broken: - functionalization is pretty horribly broken. In particular, the original strategy I used in this stack was to have functionalization run above subclass desugaring. But that doesn't play well with with the way we want to compile DTensor. DTensor has a few API's like `.redistribute()`, `.to_local()`, and the `DTensor()` constructor, that we want to put directly into the graph so that we can compile them (e.g. redistribute() will desugar into collective ops). Doing this requires functionalization to run underneath the subclass though. I hacked around this for now, by forcing these functions to run functionalization first if they need to. - the backward test that I have is... wrong. The backward graph that we trace out looks kind of reasonable, but it gives incorrect gradients on one of the two inputs. This needs further debugging (presumably we should be able to stare at the graph and identify which part of it is wrong?). Pull Request resolved: https://github.com/pytorch/pytorch/pull/105236 Approved by: https://github.com/wanchaol	2023-10-11 21:55:27 +00:00
Wanchao Liang	459cef8649	switch dtensor and functional collective to use optree (#110670 ) optree recently landed and provide quite good perf, conditionally import new optree if optree is installed Some numbers testing mlp layer with TP + func collective: before this PR: 10.390ms after this PR: 9.189ms so around e2e 10% CPU overhead reduction Pull Request resolved: https://github.com/pytorch/pytorch/pull/110670 Approved by: https://github.com/fegin	2023-10-08 03:05:39 +00:00
Edward Z. Yang	f274c7b32c	Add functional collective all_to_all_single and support it in Inductor (#110195 ) Copy of https://github.com/pytorch/pytorch/pull/106655 from yf225 rebased on top of item() support changes Pull Request resolved: https://github.com/pytorch/pytorch/pull/110195 Approved by: https://github.com/Skylion007	2023-10-05 23:11:51 +00:00
Edward Z. Yang	ec8b58f5ba	Add support for tolist on AsyncCollectiveTensor (#109377 ) This has to be done by hand because tolist isn't supported on tensor subclasses. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/109377 Approved by: https://github.com/wconstab, https://github.com/fduwjj	2023-09-15 21:48:13 +00:00
Brian Hirsh	5efd63b1b8	better support for fakeifying and dynamoing through torch_dispatch subclasses (with dynamic shapes) (#107415 ) There is already some support for plumbing `__torch_dispatch__` tensor subclasses through dynamo, but this PR beefs it up a bit and adds a test. In particular: (1) Fakeifying tensor subclasses didn't properly set autograd metadata (requires_grad, is_leaf) on the newly fakeified wrapper subclass. I don't actually have a test for this in this PR, but it's tested pretty heavily later in my aot autograd tests (2) Fakeifying tensor subclasses didn't properly track source information for dynamic shapes on the inner tensors. I added a new `WrapperSubclassFieldSource` subclass, that represents a source coming from a tensor field on a wrapper subclass, which I use in the fakeifying logic, and again in symbolic_shapes.py to generate proper guards. (3) `_make_wrapper_subclass()` marginally updated this code to work better with dynamic shapes. One thing that's a bit weird about `_make_wrapper_subclass`: it has two overloads, and the first explicitly does not support dynamic shapes (and the second.. does not support kwargs). I think that later we probably want to consolidate / at least make the first overload work with dynamic shapes, but I didn't want to handle that in this PR (so these smaller changes seemed like a strict improvement). Pull Request resolved: https://github.com/pytorch/pytorch/pull/107415 Approved by: https://github.com/ezyang	2023-08-29 02:36:48 +00:00
Antoni Viros i Martin	2c45a579ca	Add wait_tensor so print always has a correct result for AsyncCollectiveTensor (#107808 ) As the title says, I was trying to test the functional collectives, and, when printing the resulting tensors, sometimes they wouldn't have finished the Async operation yet. According to the comments in the file, "AsyncTensor wrapper applied to returned tensor, which issues wait_tensor() at the time of first use". This is true in most cases, but not when print() is your first use. This PR fixes that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107808 Approved by: https://github.com/fduwjj	2023-08-24 00:00:23 +00:00
Rodrigo Kumpera	bbf03561a9	[functional collectives] Move back to registering finalizers on wrappers. (#107250 ) We cannot use inner tensors for finalizers as they are uncollective until waited. This PR adds a bunch of tests for the observable behavior we want, including the necessary scafold for us to test code for their waitiness. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107250 Approved by: https://github.com/wconstab	2023-08-17 21:08:28 +00:00
Wanchao Liang	5c48ff20b5	AsyncCollectiveTensor: dont sync on view ops (#105240 ) AsyncCollectiveTensor is a tensor subclass that is meant to "delay synchronization" when you call into the functional collectives API's. It does this (if I understand correctly) by internally holding an "unsynchronized" version of the tensor, which is the result of the communication op, and internally calling `.wait()` to synchronize the data the next time it is used. Previously, these wait() calls would happen immediately, because `AsyncCollectiveTensor` gets wrapped by `DTensor()`, which calls `.detach()` on its inner tensor, immediately causing the sync (code: `1518d5eec4/torch/distributed/_tensor/api.py (L207)`) AsyncCollectiveTensor shouldn't need to do a synchronization if you try to detach() it though - in fact, it should be fine to avoid synchronizing if you perform any view ops on it (which just require viewing metadata, but not actual data). This PR tries to update `AsyncCollectiveTensor` to delay `wait()` calls whenever the subclass encounters a view op. Added some light testing, that just runs some DTensor compute followed by view ops, and confirms that the output is still an `AsyncCollectiveTensor` when we call `.to_local()`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105240 Approved by: https://github.com/wanchaol, https://github.com/fduwjj, https://github.com/wconstab	2023-08-11 19:20:25 +00:00
Will Constable	d64bada876	Refactor funcol for readability and dynamo tracing (#104387 ) Move eager kernel impls to separate file, which is eaiser to read (since users may be confused about 2 versions of each kernel in the same file) and easier to set a dynamo policy to trace only the first file currently. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104387 Approved by: https://github.com/wanchaol, https://github.com/fduwjj, https://github.com/kumpera	2023-07-06 23:29:49 +00:00
Rodrigo Kumpera	17ab4f85e9	[c10d] Adopt allgather_into_tensor_coalesced for NCCL. (#103086 ) This is done by adding c10d::_allgather_into_tensor_coalesced wrapper. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103086 Approved by: https://github.com/rohan-varma	2023-07-06 15:05:55 +00:00

1 2

93 Commits