pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
ankurneog	e248c1d7eb	Update real device in FSDP state_dict_utils (#134994 ) ## Motivation The default device for tensor.device both for sharded as well as non sharded is set to cuda by default. Hence while checking the FSDP UTs we see the following errors. This change updates the actual device type based on the created tensor. ``` [rank3] File "/root/repos/pytorch-training-tests/tests/pytorch/v2.4.0/distributed_hpu/fsdp/test_fsdp_dtensor_state_dict.py", line 143, in test_dtensor_sharded_tensor_state_dict_identical [rank3] sharded_tensor_sd = ref_model.state_dict() [rank3] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1944, in state_dict [rank3] hook_result = hook(self, destination, prefix, local_metadata) [rank3] File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context [rank3] return func(args, kwargs) [rank3] File "/usr/local/lib/python3.10/dist-packages/torch/distributed/fsdp/_state_dict_utils.py", line 752, in _post_state_dict_hook [rank3] tensor.device, [rank3] File "/usr/local/lib/python3.10/dist-packages/typing_extensions.py", line 2853, in wrapper [rank3] return arg(args, **kwargs) [rank3] File "/usr/local/lib/python3.10/dist-packages/torch/distributed/_shard/sharded_tensor/api.py", line 1152, in __torch_function__ [rank3] return dispatch(st_instance, func) [rank3] File "/usr/local/lib/python3.10/dist-packages/torch/distributed/_shard/sharded_tensor/api.py", line 1134, in dispatch [rank3] return _SHARDED_OPS[func](types, args, kwargs, st._process_group) [rank3] File "/usr/local/lib/python3.10/dist-packages/torch/distributed/_shard/op_registry_utils.py", line 33, in wrapper [rank3] return wrapped_func(types, args, kwargs, process_group) [rank3] File "/usr/local/lib/python3.10/dist-packages/torch/distributed/_shard/sharded_tensor/_ops/tensor_ops.py", line 52, in tensor_device [rank3] dev = torch.device(torch.cuda.current_device()) [rank3] File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 878, in current_device [rank3] _lazy_init() [rank3] File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 305, in _lazy_init [rank3] raise AssertionError("Torch not compiled with CUDA enabled") [rank3] AssertionError: Torch not compiled with CUDA enabled ```` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134994 Approved by: https://github.com/fegin	2024-09-17 04:39:08 +00:00
wz337	408fe41a45	[DSD][EZ] Minor update in _state_dict_utils.py (#136165 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136165 Approved by: https://github.com/kwen2501 ghstack dependencies: #135725, #135763	2024-09-17 04:32:43 +00:00
fduwjj	a0c7029a75	[c10d][Reland] Remove Option for ProcessGroup and Expose backend Options to reflect the correct code structure (#132931 ) (#135653 ) We introduced the dispatchable backend for a ProcessGroup and collective in https://github.com/pytorch/pytorch/issues/86225. This PR is a follow-up cleanup to clean up the option of a ProcessGroup and ask users to either set timeout or backend later on or directly create backend after creating a PG. Also PGNCCL is using option class from ProcessGroup but we actually should use Option from backend class. So this PR is to make the type or name to be aligned with what we are doing in cpp side. I don't change the signature for the public API, so they still use args named "pg_options" We need to make changes to the test to make it aligned with the change. This is try to reland D62008954 by fixing internal errors. Differential Revision: [D62483294](https://our.internmc.facebook.com/intern/diff/D62483294/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135653 Approved by: https://github.com/wz337, https://github.com/H-Huang	2024-09-16 19:56:42 +00:00
Aaron Gokaslan	31715be72a	[BE]: Update mypy to 1.11.2 (#133816 ) Updates mypy to 1.11.1 to improve type inference Pull Request resolved: https://github.com/pytorch/pytorch/pull/133816 Approved by: https://github.com/ezyang	2024-09-16 19:44:11 +00:00
Ke Wen	c977bb7d03	[Distributed] fix FileSystemWriter __init__ (#136135 ) Fixes #135608. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136135 Approved by: https://github.com/Skylion007	2024-09-16 19:11:08 +00:00
PyTorch MergeBot	3117f2cf67	Revert "[BE]: Update mypy to 1.11.2 (#133816 )" This reverts commit `55299cfc22`. Reverted https://github.com/pytorch/pytorch/pull/133816 on behalf of https://github.com/jeanschmidt due to seems to have broken https://github.com/pytorch/pytorch/actions/runs/10865710499/job/30155699792 on main ([comment](https://github.com/pytorch/pytorch/pull/133816#issuecomment-2352377684))	2024-09-16 09:11:16 +00:00
Aaron Gokaslan	55299cfc22	[BE]: Update mypy to 1.11.2 (#133816 ) Updates mypy to 1.11.1 to improve type inference Pull Request resolved: https://github.com/pytorch/pytorch/pull/133816 Approved by: https://github.com/ezyang	2024-09-14 21:40:36 +00:00
CaoE	f96a073c9d	Use _amp_foreach_non_finite_check_and_unscale_ for CPU grads of ShardedGradScaler (#135232 ) Use `_amp_foreach_non_finite_check_and_unscale_` instead of fallback version for CPU grads of `ShardedGradScaler ` as `_amp_foreach_non_finite_check_and_unscale_ ` is supported on CPU https://github.com/pytorch/pytorch/pull/109281. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135232 Approved by: https://github.com/ezyang	2024-09-14 09:53:17 +00:00
Will Feng	5a2be192d1	[Traceable FSDP2] Don't register RegisterPostBackwardFunction if user intends to use Traceable FSDP2, and assert that compiled autograd is not used when entering RegisterPostBackwardFunction (#135824 ) During enablement of Traceable FSDP2 on internal models, sometimes the user only applies torch.compile to some of the FSDP2 instances but not all of them. Such mixed usage pattern is not supported by compiled autograd. Here we try to catch and throw error at such usage pattern, so that the user can fix the usage. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135824 Approved by: https://github.com/awgu	2024-09-14 06:30:12 +00:00
wz337	0cdc6a8dcd	[DSD] Fix distributed state dict full_state_dict option hang during set_state_dict (#135725 ) Fix https://github.com/pytorch/pytorch/issues/134095 This fix distributed state dict full_state_dict option hang during set_state_dict. We switch `_distribute_tensors` in _state_dict_utils.py to use `DTensor.from_local` instead of `distribute_tensor` to support FSDP2+TP 2D strided sharding use case, as `distribute_tensor` cannot handle strided sharding yet. `distribute_tensor` incurs a scatter behind the scenes, while `DTensor.from_local` takes the local slice from the full tensor on each rank to create the DTensor (no collective). This means it's the user's responsibility to make sure the full_tensor from the full_state_dict is the same across all ranks. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135725 Approved by: https://github.com/fegin	2024-09-13 03:26:36 +00:00
PyTorch MergeBot	3e1a4ea132	Revert "[DSD] Fix distributed state dict full_state_dict option hang during set_state_dict (#135725 )" This reverts commit `83c594ebd6`. Reverted https://github.com/pytorch/pytorch/pull/135725 on behalf of https://github.com/ZainRizvi due to This is breaking lint. See [GH job link](https://github.com/pytorch/pytorch/actions/runs/10835983999/job/30068709508) [HUD commit link](`83c594ebd6`) ([comment](https://github.com/pytorch/pytorch/pull/135725#issuecomment-2347303272))	2024-09-12 21:47:38 +00:00
wz337	83c594ebd6	[DSD] Fix distributed state dict full_state_dict option hang during set_state_dict (#135725 ) Fix https://github.com/pytorch/pytorch/issues/134095 This fix distributed state dict full_state_dict option hang during set_state_dict. We switch `_distribute_tensors` in _state_dict_utils.py to use `DTensor.from_local` instead of `distribute_tensor` to support FSDP2+TP 2D strided sharding use case, as `distribute_tensor` cannot handle strided sharding yet. `distribute_tensor` incurs a scatter behind the scenes, while `DTensor.from_local` takes the local slice from the full tensor on each rank to create the DTensor (no collective). This means it's the user's responsibility to make sure the full_tensor from the full_state_dict is the same across all ranks. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135725 Approved by: https://github.com/fegin	2024-09-12 17:43:57 +00:00
Xilun Wu	de8a8653c0	[dtensor][BE] replace compute_local_shape with compute_local_shape_and_global_offset (#135554 ) Summary 1. This PR removes the public API `compute_local_shape` and replace its use with the more general API `compute_local_shape_and_global_offset`. 2. To keep `compute_local_shape_and_global_offset` consistent with `compute_local_shape` on empty shards, it now returns local tensor shape `(0,)` for empty shards which is more aligned with DTensor's semantics on non-participating ranks. Test `pytest test/distributed/_tensor/test_dtensor.py` `pytest test/distributed/_tensor/test_init.py` `pytest test/distributed/_tensor/test_tensor_ops.py` Differential Revision: [D62415591](https://our.internmc.facebook.com/intern/diff/D62415591) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135554 Approved by: https://github.com/tianyu-l, https://github.com/wz337	2024-09-12 06:30:09 +00:00
Wei Feng	d270e2d240	[FSDP2] better error msg for cpu offloading (#135156 ) when cpu offloading is enabled, if user load a gpu state dict, FSDP2 will throw a less obvious error at backward ``` RuntimeError: attempting to assign a gradient with device type 'cpu' to a tensor with device type 'cuda'. Please ensure that the gradient and the tensor are on the same device ``` this PR throws error more explicitly by specifying which parameters should be moved because of cpu offloading ``` FSDP parameters should be materialized on cpu when enabling cpu offloading. For example, load cpu state dict or call module.to_empty(device="cpu"). Found following parameters on non-cpu device: ['0.weight'] ``` `pytest -s test/distributed/_composable/fsdp/test_fully_shard_state_dict.py -k test_dp_state_dict_cpu_offload` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135156 Approved by: https://github.com/awgu	2024-09-12 00:05:07 +00:00
Will Feng	94d2471d1f	[Traceable FSDP2] Use .copy_ instead of .set_ for unsharded_param inplace update; Replace unsharded_param graph input usage with graph intermediate; Support FSDP2+LoRA (#133730 ) Using `fsdp.set_` for unsharded_param inplace update causes difficult-to-debug errors when enabling Traceable FSDP2 on TorchTune models. In this PR, we change it to use `fsdp.copy_` which fixes the error and also strictly follows eager semantics (i.e. if user explictly stores an alias of the unsharded_param during execution of the user's module code, that alias will get updated correctly when the unsharded_param is copy_ into; whereas if we just swap out unsharded_param storage via set_, that user-saved alias will not get updated, which is not good). This PR also implements the graph pass to remove the resizes and copy if there is a resize_(full) -> copy_ -> resize_(0) pattern. ------ Test commands: - `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_transformer_backend_inductor` - `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_nested_fully_shard_backend_inductor` - `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_trace_fsdp_copy_` - `pytest -rA test/dynamo/test_repros.py::ReproTests::test_partitioner_cse_respects_mutation_boundaries` - `pytest -rA test/dynamo/test_repros.py::ReproTests::test_fsdp_set_input_mutation_applied_when_input_gets_no_gradients` - `pytest -rA test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mutation_op_matching` - `python test/inductor/test_distributed_patterns.py DistributedPatternTests.test_fake_distributed_aot_eager` - `PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=1 PYTORCH_TEST_WITH_CROSSREF=1 python test/functorch/test_aotdispatch.py TestEagerFusionOpInfoCPU.test_aot_autograd_exhaustive_norm_cpu_float32` - `python test/distributed/test_inductor_collectives.py TestCollectivesInductor.test_backwards` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133730 Approved by: https://github.com/bdhirsh	2024-09-11 23:01:05 +00:00
Sathyanarayanan Saravanamuthu	34dc8f69a1	Adding entry-point based support for out-of-tree rendezvous plugins (#132633 ) Fixes #127519 Currently in torchrun rendezvous, there are only two rendezvous backends supported out of the box: `C10d` and `Etcd`. The changes in this PR enables the distributed elastic users to bring their out-of-tree rendezvous backend implementations as Python packages. #### AUTHORING NEW PLUGIN Any new plugin will be a python package exposing entry-points. For example, the structure of redis plugin is as follows: ``` plugin_root \|_ pyproject.toml \|_ src \|_ redis \|_ __init__.py \|_ redis_store.py \|_ redis_backend.py ``` The contents of the `pyproject.toml` should indicate that this is exposes a torchrun entry-point by mentioning the group name `torchrun.plugins`. The `pyproject.toml` for redis plugin would be as follows: ``` [project] name = "redis" version = "0.0.1" [project.entry-points.'torchrun.plugins'] redis = 'redis' ``` The `src/redis/__init__.py` file would contain functions that return the plugin name and plugin handler. The contents of `__init__.py` for redis would be as follows: ``` def getPluginHandler(): def _create_redis_handler(params: RendezvousParameters): from redis_rendezvous_backend import create_backend backend, store = create_backend(params) return create_handler(store, backend, params) return _create_redis_handler ``` The files `redis_store` and `redis_backend` contain the implementation of [Store](`41189b0da4/torch/_C/_distributed_c10d.pyi (L171)`) and [RendezvousBackend](`e782918b8e/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py (L61)`) respectively. #### USER EXPERIENCE Before using the plugin for the first time, the user has to install the plugin packages. For example, the published packages can be installed using `pip3 install <plugin-name>` and the plugin is in local file systemcan be installed using `pip3 install -e <plugin-location>`. Once installed, the new backend can be used in torchrun as follows: ``` torchrun --rdzv-backend=redis --rdzv-endpoint=redis-container:6379 --nnodes=3 --nproc-per-node=1 --max-restarts=3 --rdzv-id=1 test.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/132633 Approved by: https://github.com/fduwjj	2024-09-11 03:35:02 +00:00
Andrew Gu	c932b39739	[FSDP2] Added `_set_unshard_async_op` (#135523 ) This PR adds a private API `_set_unshard_async_op` that allows for running pre-forward and pre-backward all-gathers using the `async_op=True` path so that all-gather allocations happen in the default stream to avoid inter-stream fragmentation. If using this option, forward requires explicit prefetching e.g. via the `unshard(async_op=True)` API for overlap. fp32 -> bf16 casts and the all-gather copy-in will not overlap with compute. Differential Revision: [D62401551](https://our.internmc.facebook.com/intern/diff/D62401551) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135523 Approved by: https://github.com/weifengpy	2024-09-10 19:28:02 +00:00
Chien-Chin Huang	1d9fefff19	[DCP] Fixes the stateless optimizer issue of distributed state_dict (#135535 ) Some optimizers don't have states that can cause get_state_dict/set_state_dict behave incorrectly. This PR fixes the issues. fixes: https://github.com/pytorch/pytorch/issues/133415 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135535 Approved by: https://github.com/wz337	2024-09-10 03:10:00 +00:00
Chien-Chin Huang	21241bfeee	[CP] Extend CP to support load-balancing shards (#132442 ) This PR extends the current ring attention to support load-balancing shards -- the context/sequence is divided into `2 * world_size` shards and each rank gets `rank` and `(world_size * 2 - rank - 1)` shards. The data re-shuffling is done in the `context_parallel` API. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132442 Approved by: https://github.com/wconstab	2024-09-09 18:04:38 +00:00
Wanchao Liang	cfc227ad43	[reland][dtensor] move DTensor to public namespace (#134203 ) reland of https://github.com/pytorch/pytorch/pull/133113 I have to create a new PR because the previous reverted PR could not either be rebased, or imported successfully :( ---- Moving DTensor to be in the public namespace, to formally add the documentation page that includes all the public APIs. This includes: * many path renames and path import fixes * a dedicated doc page without too much content yet (adding in the next PRs) * To preserve the BC for users still using the torch.distributed._tensor, I added a shim script to redirect old path calls to the new module The BC preserving is evidented by the fact that all DTensor tests are still working without changing the public imports. So it's safe to land the changes Pull Request resolved: https://github.com/pytorch/pytorch/pull/134203 Approved by: https://github.com/tianyu-l	2024-09-08 17:08:40 +00:00
Tristan Rice	196748d491	[elastic] support local_addr across all rendezvous impls (#135262 ) Summary: There was a regression introduced in https://github.com/pytorch/pytorch/pull/125743 that made `local_addr` no longer used. This fixes that by passing `local_addr` to `RendezvousStoreInfo.build` everywhere it's used. This also fixes a number of tests allowing them to be run in parallel which hugely sped up the testing cycle as this change touches many different rendezvous implementations. This required a few fixes in unrelated tests. Test Plan: Added tests for the common rendezvous implementations that `local_addr` to prevent future regressions. ``` buck2 test @//mode/dev-nosan fbcode//caffe2/test/distributed/elastic/... fbcode//caffe2/torch/distributed/elastic/... -- --stress-runs 3 ``` To vet the parallelism changes I also ran with 3 stress runs each to identify flakiness caused by parallelism. Differential Revision: D62256407 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135262 Approved by: https://github.com/fduwjj, https://github.com/wz337	2024-09-06 17:55:43 +00:00
Yiwen Shi	3a9e33dca8	[torchelastic] Don't do signal handling when off the main thread (#135088 ) Summary: In multiprocessing, signal handling is not possible if the thread is not the main thread. This resulted in the following error: > "ValueError('signal only works in main thread of the main interpreter')" To address this issue, the diff checks whether the thread is the main thread and, if not, skips signal handling. Test Plan: Before this change, MAST job failed: https://fburl.com/mlhub/iq2m10v8 With this change, MAST job succeeded: https://fburl.com/mlhub/q6kb8343 Differential Revision: D62166943 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135088 Approved by: https://github.com/d4l3k	2024-09-06 14:47:03 +00:00
wz337	67f98a99a4	[DeviceMesh][Easy] Make RuntimeError a bit more descriptive by including the actual world_size (#135271 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135271 Approved by: https://github.com/fduwjj	2024-09-06 06:23:20 +00:00
wz337	c83cdf068b	[DTensor] Fix view op replicating on tensor dim when the size of the tensor dim = 1 (#135054 ) We found a corner case that when a tensor dimension is 1, calling `view(1)` would result in an unexpected replication (see case 1 below). When the tensor dimension to shard is not 1, no matter whether the tensor dimension is evenly-shardable across the mesh dimension, it won't cause an implicit replication behind the scenes if view doesn't change the size of the given tensor dimension (see case 2 and 3). When the tensor dimension to shard is of size 1, it is not being added to shardable_dims here: https://github.com/pytorch/pytorch/blob/main/torch/distributed/_tensor/ops/_view_ops.py#L518 ``` # uneven case where the size of the tensor dimension to shard is 1 p = torch.randn(1,2) mesh = init_device_mesh(“cuda”, (2,)) dtensor = distribute_tensor(p, mesh, [Shard(0)]) t = dtensor.view(1, 2) # this would result in replication, meaning t is now replicated across all ranks. # uneven case where the size of the tensor dimension to shard is not 1 p = torch.randn(3, 2) mesh = init_device_mesh(“cuda”, (2,)) dtensor = distribute_tensor(p, mesh, [Shard(0)]) t = dtensor.view(3, 2) # this would not result in replication. # this would not result in replication, meaning t stays as sharded. # even case p = torch.randn(2,2) dtensor = distribute_tensor(p, mesh, [Shard(0)]) t = dtensor.view(2, 2) # this would not result in replication, meaning t stays as sharded. ``` Differential Revision: [D62155606](https://our.internmc.facebook.com/intern/diff/D62155606) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135054 Approved by: https://github.com/tianyu-l, https://github.com/wanchaol	2024-09-06 00:03:54 +00:00
mori360	b1f72e2984	Gradient scaler for DTensor (#132816 ) Solve the request [here](https://github.com/pytorch/pytorch/issues/120003#issuecomment-2248805798). Enable DTensor input in gradient scaler's APIs, especially on `.unscale_()` Related dispatch strategy is added to accept DTensor input. To enable found_inf to conduct reduce action across devices, we add allreduce at dispatch with args after dispatch strategy and kernel. Since `aten._amp_foreach_non_finite_check_and_unscale_.default` is an inplace_op, grad_scale as the arg[0] with be inplaced, so that redesign a strategy or refactoring the kernel would not help Test files are testing 2 parts under 1-d(dp) and 2-d(dp,tp) cases: 1. whether the non-inf values unscaled 2. whether all DTensors at each device could found inf even not at their device. 3. If inf not found, will new parameters generates 4. if inf found, will scale be updated Pull Request resolved: https://github.com/pytorch/pytorch/pull/132816 Approved by: https://github.com/XilunWu, https://github.com/weifengpy, https://github.com/wanchaol	2024-09-05 16:44:32 +00:00
Will Feng	8fb1281db9	[Traceable FSDP2] Skip _backward_prefetch under compile, and rely on compiler pass to have prefetching (#135163 ) Before this PR, when traceable FSDP2 + AC is run, an error would be thrown: ``` File "/data/users/willfeng/pytorch/torch/_dynamo/variables/builtin.py", line 1449, in call_getitem return args[0].call_method(tx, "__getitem__", args[1:], kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/users/willfeng/pytorch/torch/_dynamo/variables/lists.py", line 435, in call_method return super().call_method(tx, name, args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/users/willfeng/pytorch/torch/_dynamo/variables/lists.py", line 392, in call_method return super().call_method(tx, name, args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/users/willfeng/pytorch/torch/_dynamo/variables/lists.py", line 131, in call_method return self.getitem_const(tx, value) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/users/willfeng/pytorch/torch/_dynamo/variables/lists.py", line 106, in getitem_const return self.items[index] Error: Index out of bound from user code: File "<eval_with_key>.5", line 105, in forward aot0_trace_wrapped = torch__dynamo__trace_wrapped_higher_order_op_self_invoke(aot0_tangents_1, bw_state = aot0_primals_34); aot0_tangents_1 = None File "/data/users/willfeng/pytorch/torch/_dynamo/_trace_wrapped_higher_order_op.py", line 74, in self_invoke return _trace_wrapped_op(args, dyn_kwargs, kwargs) File "/data/users/willfeng/pytorch/torch/_dynamo/external_utils.py", line 132, in call_hook_from_backward_state return getattr(bw_state, hook_name)(args, **kwargs) File "/data/users/willfeng/pytorch/torch/distributed/_composable/fsdp/_fsdp_state.py", line 271, in _pre_backward self._fsdp_param_group.pre_backward(default_prefetch) File "/data/users/willfeng/pytorch/torch/distributed/_composable/fsdp/_fsdp_param_group.py", line 332, in pre_backward self._backward_prefetch() File "/data/users/willfeng/pytorch/torch/distributed/_composable/fsdp/_fsdp_param_group.py", line 417, in _backward_prefetch target_fsdp_param_group = self.comm_ctx.post_forward_order[target_index] ``` Since it's okay to rely on the compiler to recover the "prefetching" pattern, we will skip this `_backward_prefetch()` code path during tracing to avoid the error, and have a compiler pass (in future PR) to achieve the equivalent prefetching overlap. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135163 Approved by: https://github.com/awgu	2024-09-05 03:32:04 +00:00
Wei Feng	724faac260	[FSDP] casting input args with dataclass(frozen=True) (#135067 ) resolve: https://github.com/pytorch/pytorch/pull/135029 when enabling mixed precision, FSDP cast input args to desired dtype by calling `_apply_to_tensors`. When input args has `dataclass(frozen=True)`, we hit following runtime error, because of using `setattr` in `_apply_to_tensors` `dataclasses.FrozenInstanceError: cannot assign to field 'some_key'`. The fix is to use dataclasses api `dataclasses.replace` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135067 Approved by: https://github.com/awgu	2024-09-05 01:19:53 +00:00
Howard Huang	b3ef0c99f5	[PP] Fix zero bubble composability with DP (#134052 ) Moved all the backward functions (`stage_backward_input`, `stage_backward_weight`, `stage_backward`) under the same `backward_maybe_with_nosync` function which controls the logic of the data parallel wrappers. FSDP was not working with zero bubble PP because there will be twice as many "backward" calls and we update the weight gradients after `autograd.grad` is called. As a result, we need to manually call the FSDP `post_backward_hook()` after the weights have the correct gradients. Fixes the tests: `python test/distributed/_composable/test_composability/test_pp_composability.py ComposabilityTest.test_manual_with_data_parallel_dp_type_FSDP_ScheduleClass0_use_new_runtime_False` `python test/distributed/_composable/test_composability/test_pp_composability.py ComposabilityTest.test_manual_with_data_parallel_dp_type_DDP_ScheduleClass0_use_new_runtime_False` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134052 Approved by: https://github.com/kwen2501	2024-09-04 23:46:29 +00:00
Ke Wen	9810ce9ca7	[PP] Go back to export instead of _export (#134299 ) Reverts https://github.com/pytorch/pytorch/pull/130998 because FakeTensor + real device suffice to work around the autocast issue in HF. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134299 Approved by: https://github.com/lessw2020	2024-09-04 23:25:17 +00:00
Xilun Wu	ed06772e35	[TorchElastic] add warning when users try to pass a "use_libuv" argument to create_c10d_store (#135062 ) Summary Extend the warning message to be more self-explained Pull Request resolved: https://github.com/pytorch/pytorch/pull/135062 Approved by: https://github.com/shuqiangzhang	2024-09-04 22:05:51 +00:00
Saurabh Mishra	dd7cd182ab	[AIInfra][DCP] All gather keys checkpoint utils bug fix (#135045 ) Summary: All gather keys checkpoint utils bug fix. Dist. get_world_size should have the process group passed in to avoid inconsistent world size in case the process group has changed. This is common in the tests. Test Plan: UTs Reviewed By: Saiteja64 Differential Revision: D61578832 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135045 Approved by: https://github.com/MeetVadakkanchery, https://github.com/LucasLLC	2024-09-04 18:49:34 +00:00
CK Luk	ffd1e214df	Back out "[FSDP2] Set `ctx.set_materialize_grads(False)` for post-backward (#133498 )" (#135059 ) Summary: Original commit changeset: 96513cbc425f Original Phabricator Diff: D61291210 There is some evidence that FB-FM-v4 has better NE with Set ctx.set_materialize_grads(False), especially when pairing up with prefetching. See https://www.internalfb.com/intern/anp/view/?id=5732259 Test Plan: export NUM_WORKERS=128 export BATCH_SIZE=1024 export CONFIG_FILE="mast_joint_arch_exploration_cmf_updated_fbfm_v3_fsdp2.yaml" export ENTITLEMENT=ads_global_tc_2k_training_large_short buck2 run mode/opt //aps_models/ads/icvr:icvr_launcher -c fbcode.platform010_cuda_version=12 -c hpc_comms.use_nccl=2.17.1 -- mode=${CONFIG_FILE} launcher.tags='[ads_ranking_taxonomy_monetization_genai]' launcher.data_project=pytorch_at_scale launcher.max_retries=10 launcher.fbl_entitl ement=${ENTITLEMENT} launcher.oncall=pytorch_training_enablement launcher.hardware=GRANDTETON launcher.num_workers=${NUM_WORKERS} data_loader.dataset.batch_size=${BATCH_SIZE} training.planner.proposer=dynamic_col_dim training.planner.proposer.optim_target=h bm 2>&1\| tee ~/tmp/log.mast Differential Revision: D62009163 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135059 Approved by: https://github.com/awgu	2024-09-04 04:50:32 +00:00
Xilun Wu	e7731b3f8a	[TorchElastic] make torch elastic not have to realize TCPStore backend type and rely on c10d to decide which backend to use (#134882 ) D53335860 and D56435815 added an option to torch elastic allowing users to choose a TCPStore backend type to use via 1) explicit argument passing in user code when instantiating `MastRendezvousHandler` 2) pass `--use_libuv` command line argument to `torchrun`. The motivation was to offer a quick way to roll back to non-libuv TCPStore backend since we were making libuv the default in `c10d` code. Now we think that it's better to have torch elastic to not realize the TCPStore backend type but rely on `c10d`'s mechanism to decide which backend to use for torch elastic as well. In this sense, the TCPStore backend type used by torch elastic will be identical to that in pytorch. PyTorch TCPStore uses the environment variable `USE_LIBUV` to determine the backend type: when `USE_LIBUV="0"`, the non-libuv backend will be used. when `USE_LIBUV="1"`, the libuv backend will be used. And this is the default option. Differential Revision: [D58259590](https://our.internmc.facebook.com/intern/diff/D58259590/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134882 Approved by: https://github.com/shuqiangzhang	2024-09-03 19:43:21 +00:00
PyTorch MergeBot	351ba3e67c	Revert "[c10d] Remove Option for ProcessGroup and Expose backend Options to reflect the correct code structure (#132931 )" This reverts commit `65864d0134`. Reverted https://github.com/pytorch/pytorch/pull/132931 on behalf of https://github.com/ZainRizvi due to This PR is breaking builds internally due to the removal of ProcessGroup::Options ([comment](https://github.com/pytorch/pytorch/pull/132931#issuecomment-2321862402))	2024-08-30 16:27:40 +00:00
Xilun Wu	a645a18d2e	[reland][dtensor][MTPG] make sharding prop lru cache not shared among threads (#134509 ) Summary reland of https://github.com/pytorch/pytorch/pull/134294 Fixes #131446 Fixes #126852 Fixes #126868 Fixes #126493 The PR was reverted due to CI red signal in https://github.com/pytorch/pytorch/actions/runs/10537099590/job/29201744658. It seems that the `gaussian_nll_loss` test had been flaky before my original PR #134294 . Therefore this PR also removes the `xfail` mark on this specific test to make CI signal green. See the error message below: ``` 2024-08-24T13:42:01.3228990Z ==================================== RERUNS ==================================== 2024-08-24T13:42:01.3229530Z [31m[1m_ TestDTensorOpsCPU.test_dtensor_op_db_nn_functional_gaussian_nll_loss_cpu_float32 _[0m 2024-08-24T13:42:01.3229710Z Unexpected success[90m[39;49;00m 2024-08-24T13:42:01.3230235Z [31m[1m_ TestDTensorOpsCPU.test_dtensor_op_db_nn_functional_gaussian_nll_loss_cpu_float32 _[0m 2024-08-24T13:42:01.3230407Z Unexpected success[90m[39;49;00m 2024-08-24T13:42:01.3230594Z =================================== FAILURES =================================== 2024-08-24T13:42:01.3231128Z [31m[1m_ TestDTensorOpsCPU.test_dtensor_op_db_nn_functional_gaussian_nll_loss_cpu_float32 _[0m 2024-08-24T13:42:01.3231296Z Unexpected success[90m[39;49;00m ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134509 Approved by: https://github.com/tianyu-l, https://github.com/wz337	2024-08-30 02:13:45 +00:00
fduwjj	65864d0134	[c10d] Remove Option for ProcessGroup and Expose backend Options to reflect the correct code structure (#132931 ) We introduced the dispatchable backend for a ProcessGroup and collective in https://github.com/pytorch/pytorch/issues/86225. This PR is a follow-up cleanup to clean up the option of a ProcessGroup and ask users to either set timeout or backend later on or directly create backend after creating a PG. Also PGNCCL is using option class from ProcessGroup but we actually should use Option from backend class. So this PR is to make the type or name to be aligned with what we are doing in cpp side. I don't change the signature for the public API, so they still use args named "pg_options" We need to make changes to the test to make it aligned with the change. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132931 Approved by: https://github.com/H-Huang	2024-08-29 22:40:12 +00:00
PyTorch MergeBot	ab646cd805	Revert "[reland][dtensor][MTPG] make sharding prop lru cache not shared among threads (#134509 )" This reverts commit `ba5aec88c6`. Reverted https://github.com/pytorch/pytorch/pull/134509 on behalf of https://github.com/ZainRizvi due to Sorry but this fails internally. For details see D61953754 ([comment](https://github.com/pytorch/pytorch/pull/134509#issuecomment-2318323161))	2024-08-29 16:39:19 +00:00
wz337	cfb642bb6b	[DTensor] Extend implicit replication to replicate DTensor for foreach ops so model doesn't have to be fully tp-ed when using 2D (#134551 ) Fixes [134212](https://github.com/pytorch/pytorch/issues/134212) Currently, when we use 2D FSDP with TP, `optimizer.step()` would fail if the model were not fully tensor parallelized. If we don't have the entire model tensor parallelized when doing 2D, we would have both 1D and 2D DTensor parameters. As foreach is turned on by default, `optimizer.step()` would fail as cross mesh op is not allowed. Error as follows: ``` NotImplementedError: aten._foreach_mul_.Scalar: DTensor does not support cross-mesh operation yet!Got meshes: DeviceMesh('cuda', [[0, 1], [2, 3]], mesh_dim_names=('dp', 'tp')) DeviceMesh('cuda', [1, 3], mesh_dim_names=('dp',)) ``` In this PR, we extend implicit_replication to replicate DTensor in missing dimensions for foreach ops. If users don't want to fully tensor parallelize the model when using 2D, they have the option of using the `implicit_replication()` context manager for `optimizer.step()`. In this case, we would swap out the 1D DTensorSpec and replace it with 2D DTensorSpec. However, we don't want to turn this on by default yet, as we want the users to be aware that the tp dimension is replicated if a layer is not tp-ed. With implicit implication turning on, try replicate dtensor spec in missing dimension would work for most cases for foreach case except when the first DTensor in the list is one that also need to be replicated. This is currently a limitation, which I don't have a good solution yet. Currently, with this change, we can handle most of the cases except the case that the first DTensor's ndim is not the largest. ``` [2D_DTensor, 1D_DTensor...] ---> Implicit_replication() can handle this. [1D_DTensor, 2D_DTensor...] ---> Implicit_replication() can't handle this. ``` This change doesn't affect the existing default behavior, as `implicit_replication()` is not turned on by default. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134551 Approved by: https://github.com/tianyu-l	2024-08-29 09:01:31 +00:00
Will Feng	578b8d75e5	[2nd try][Traceable FSDP2] Allow tracing through FSDP2 impl in trace_rules.py (#134539 ) The previous PR https://github.com/pytorch/pytorch/pull/133532 caused stuck compilation issue on internal models. In this 2nd attempt PR, we gate the trace_rules.py changes with `if not torch._dynamo.config.skip_fsdp_hooks:`, so that they don't take effect for current graph-break FSDP2 (which relies on the default config value `skip_fsdp_hooks=True`), and will only take effect when we are using Traceable FSDP2 (in which case the user needs to proactively set `skip_fsdp_hooks=False`). Pull Request resolved: https://github.com/pytorch/pytorch/pull/134539 Approved by: https://github.com/ckluk2, https://github.com/yanboliang	2024-08-29 06:28:16 +00:00
wz337	b0a6d9ad27	[DTensor] Add pointwise ops strategy for aten.isinf, aten.isneginf, aten.isposinf (#134699 ) Fixes #ISSUE_NUMBER Need it for https://github.com/facebookresearch/optimizers/blob/main/distributed_shampoo/utils/shampoo_preconditioner_list.py#L671 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134699 Approved by: https://github.com/tianyu-l	2024-08-29 06:01:12 +00:00
PyTorch MergeBot	25531eb735	Revert "[2nd try][Traceable FSDP2] Allow tracing through FSDP2 impl in trace_rules.py (#134539 )" This reverts commit `26e392132d`. Reverted https://github.com/pytorch/pytorch/pull/134539 on behalf of https://github.com/ZainRizvi due to Sorry, I had to revert this in order to revert another PR ([comment](https://github.com/pytorch/pytorch/pull/134539#issuecomment-2316568257))	2024-08-29 01:59:02 +00:00
Sanket Purandare	de35d3062f	Runtime Estimator for estimating GPU compute time (#134243 ) This PR adds a basic Runtime Estimator for single-device models. It estimates the GPU runtime in milliseconds using various estimation methods under the ``FakeTensorMode``. It provides a ``TorchDispatchMode`` based context manager that can estimate the eager runtime of PyTorch functions. It supports two estimation modes, benchmarking (`operator-level-benchmark`) and roofline cost modeling (`operator-level-cost-model`). For modules executed under this context manager, it agggregates the forward and backward operation runtimes and records their execution orders. ``` import torch from torch import nn, optim from torch._subclasses.fake_tensor import FakeTensorMode from torch.distributed._tools.runtime_estimator import RuntimeEstimator from torch.testing._internal.distributed._tensor.common_dtensor import ( ModelArgs, Transformer, ) if __name__ == "__main__": def _train_step( model: nn.Module, optimizer: optim.Optimizer, inp: torch.Tensor, ): out = model(inp) loss = out.sum() loss.backward() optimizer.step() optimizer.zero_grad() dev = torch.cuda.current_device() vocab_size = 8192 bsz, seq_len = 32, 1024 model_args = ModelArgs( n_layers=4, n_heads=12, vocab_size=vocab_size, max_seq_len=seq_len, dim=768, dropout_p=0.1, ) runtime_estimator = RuntimeEstimator() with FakeTensorMode(): with torch.device(dev): model = Transformer(model_args) optimizer = optim.Adam(model.parameters(), lr=1e-2, foreach=True) inp = torch.randint(0, model_args.vocab_size, (bsz, model_args.max_seq_len), device=dev) with runtime_estimator("operator-level-benchmark"): _train_step(model, optimizer, inp) with runtime_estimator("operator-level-cost-model"): _train_step(model, optimizer, inp) # Actual model runtime with torch.device(dev): model = Transformer(model_args) optimizer = optim.Adam(model.parameters(), lr=1e-2, foreach=True) inp = torch.randint(0, model_args.vocab_size, (bsz, model_args.max_seq_len), device=dev) warmup_iters, actual_iters = 2, 5 start_event = torch.cuda.Event(enable_timing=True) end_event = torch.cuda.Event(enable_timing=True) for _ in range(warmup_iters): _train_step(model, optimizer, inp) start_event.record() for _ in range(actual_iters): _train_step(model, optimizer, inp) end_event.record() torch.cuda.synchronize() measured_time = start_event.elapsed_time(end_event) / actual_iters print(f"Actual total_time: {measured_time:.3f} ms") ``` <img width="506" alt="Screenshot 2024-08-26 at 11 27 15 PM" src="https://github.com/user-attachments/assets/04d243c9-21a6-4389-8c20-80958980788c"> @weifengpy @xuanzhang816 @gnadathur Pull Request resolved: https://github.com/pytorch/pytorch/pull/134243 Approved by: https://github.com/weifengpy	2024-08-28 20:06:54 +00:00
Andrew Gu	aa31e7019a	[FSDP] Made `clip_grad_norm_` norm compute order deterministic (#134673 ) Fixes https://github.com/pytorch/pytorch/issues/134393 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134673 Approved by: https://github.com/weifengpy ghstack dependencies: #134152	2024-08-28 18:44:11 +00:00
Xilun Wu	ba5aec88c6	[reland][dtensor][MTPG] make sharding prop lru cache not shared among threads (#134509 ) Summary reland of https://github.com/pytorch/pytorch/pull/134294 Fixes #131446 Fixes #126852 Fixes #126868 Fixes #126493 The PR was reverted due to CI red signal in https://github.com/pytorch/pytorch/actions/runs/10537099590/job/29201744658. It seems that the `gaussian_nll_loss` test had been flaky before my original PR #134294 . Therefore this PR also removes the `xfail` mark on this specific test to make CI signal green. See the error message below: ``` 2024-08-24T13:42:01.3228990Z ==================================== RERUNS ==================================== 2024-08-24T13:42:01.3229530Z [31m[1m_ TestDTensorOpsCPU.test_dtensor_op_db_nn_functional_gaussian_nll_loss_cpu_float32 _[0m 2024-08-24T13:42:01.3229710Z Unexpected success[90m[39;49;00m 2024-08-24T13:42:01.3230235Z [31m[1m_ TestDTensorOpsCPU.test_dtensor_op_db_nn_functional_gaussian_nll_loss_cpu_float32 _[0m 2024-08-24T13:42:01.3230407Z Unexpected success[90m[39;49;00m 2024-08-24T13:42:01.3230594Z =================================== FAILURES =================================== 2024-08-24T13:42:01.3231128Z [31m[1m_ TestDTensorOpsCPU.test_dtensor_op_db_nn_functional_gaussian_nll_loss_cpu_float32 _[0m 2024-08-24T13:42:01.3231296Z Unexpected success[90m[39;49;00m ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134509 Approved by: https://github.com/tianyu-l, https://github.com/wz337	2024-08-28 17:51:44 +00:00
Chien-Chin Huang	c7338f457c	[DCP] Fixes the BC issue where the traversal doesn't support versions before 2.4 (#134158 ) The original DCP doesn't flattening all the containers, which can cause issues, https://github.com/pytorch/pytorch/pull/125335 intends to solve the issue by flattening all the dictionaries. Unfortunately, it breaks the checkpoints that are saved before 2.4. This also shows some issues of the DCP: 1. DCP should record version in the metadata. 2. DCP should have a nice way to load old state_dict. 3. DCP should unflatten all containers (map, list) not just map. This PR only addresses issue 2 to unblock users. Issue 1 and issue 3 need to be addressed in the future. @pradeepfn Please let me know if this summary matches our discussion. Fixes https://github.com/pytorch/pytorch/issues/133923 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134158 Approved by: https://github.com/wz337, https://github.com/pradeepfn	2024-08-28 16:31:44 +00:00
PyTorch MergeBot	d52aff3e73	Revert "Adding entry-point based support for out-of-tree rendezvous plugins (#132633 )" This reverts commit `136b19b062`. Reverted https://github.com/pytorch/pytorch/pull/132633 on behalf of https://github.com/ZainRizvi due to Sorry but this is causing internal tests to fail with the error `ImportError: cannot import name '_register_out_of_tree_handlers' from 'torch.distributed.elastic.rendezvous.registry'` ([comment](https://github.com/pytorch/pytorch/pull/132633#issuecomment-2315716201))	2024-08-28 15:49:18 +00:00
Will Feng	26e392132d	[2nd try][Traceable FSDP2] Allow tracing through FSDP2 impl in trace_rules.py (#134539 ) The previous PR https://github.com/pytorch/pytorch/pull/133532 caused stuck compilation issue on internal models. In this 2nd attempt PR, we gate the trace_rules.py changes with `if not torch._dynamo.config.skip_fsdp_hooks:`, so that they don't take effect for current graph-break FSDP2 (which relies on the default config value `skip_fsdp_hooks=True`), and will only take effect when we are using Traceable FSDP2 (in which case the user needs to proactively set `skip_fsdp_hooks=False`). Pull Request resolved: https://github.com/pytorch/pytorch/pull/134539 Approved by: https://github.com/ckluk2, https://github.com/yanboliang	2024-08-28 08:57:56 +00:00
Xilun Wu	0159ebb654	[dtensor] add test for local_map decorator (#127752 ) Summary This PR is a follow-up of #126924 to address reviewer's comments: 1) add a test case to show the use of `local_map` as a function decorator. 2) simplify the logic of handling different data types of `out_placements`. 3) correct variable naming in test cases to match math formulas. Test see #126924 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127752 Approved by: https://github.com/wanchaol	2024-08-27 18:22:23 +00:00
Jessica Vandebon	68b1a09422	Integrate device agnostic APIs in FSDP library [1/n] (#134337 ) Summary: For MTIA FSDP support, we need to ensure the FSDP library code handles accelerator devices not limited to CUDA. Test Plan: CI Reviewed By: hanzlfs Differential Revision: D60587415 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134337 Approved by: https://github.com/LucasLLC, https://github.com/awgu	2024-08-27 17:31:11 +00:00
wz337	761cf91e3c	[DeviceMesh] Add get_all_submeshes in _MeshEnv (#134275 ) Adding a private helper method for Shampoo HSDP use cases. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134275 Approved by: https://github.com/XilunWu	2024-08-27 14:51:19 +00:00

1 2 3 4 5 ...

3508 Commits