pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Kunal Bhalla	af229ecd34	[RFC] Change --standalone to bind to a random port (#107734 ) Given standalone generates args anyways, it seems like it would be more convenient if it explicitly used a random port by default instead of trying to use 29400. That way users can directly go with `--standalone` instead of having to spell out `--rdzv-backend=c10d --rdzv-endpoint=localhost:0` Pull Request resolved: https://github.com/pytorch/pytorch/pull/107734 Approved by: https://github.com/H-Huang	2023-08-25 22:13:44 +00:00
dilililiwhy	ff37f6018d	Enable custom device support in fsdp checkpoint (#107289 ) Fixes https://github.com/pytorch/pytorch/issues/104390 Enable custom device(privateuse1 backend) support in checkpointing by a dynamic abstract device module. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107289 Approved by: https://github.com/wz337	2023-08-25 11:50:03 +00:00
weifengpy	ec10b17cfb	[FSDP] verify backward_prefetch works correctly with unit test (#107058 ) issue resolved: https://github.com/pytorch/pytorch/pull/105984 context: * CI did not catch the commit that breaks backward_prefetch https://github.com/pytorch/pytorch/pull/105006 * we had an action item to add unit test to prevent similar cases: https://github.com/pytorch/pytorch/pull/105984 what's included in this unit test * monkey patch torch.distributed.fsdp._runtime_utils._get_handle_to_prefetch and check which handles are prefetched for backward_prefetch = BackwardPrefetch.BACKWARD_PRE * state._exec_order_data.handles_post_forward_order equals forward order: encoder 0...5 -> decoder 0...5 -> root * pre-backward hook order: root -> decoder 5...0 -> encoder 5...0 * prefetch order: decoder 5...0 -> encoder 5...0 -> None * when current_handle=encoder 0, _get_handle_to_prefetch returns None for backward_prefetch = BackwardPrefetch.BACKWARD_POST * state._exec_order_data.handles_post_forward_order equals forward order: encoder 0...5 -> decoder 0...5 -> root * post-backward hook (AccumulateGrad) order: decoder 5, 4...0 -> encoder 5...0 -> root * prefetch order: decoder 4...0 -> encoder 5...0 -> None -> None * 1st None: when current_handle=encoder 0, _get_handle_to_prefetch returns None * 2nd None: when current_handle=root, we get decoder 5 inside _get_handle_to_prefetch but is not needed. so returns None Pull Request resolved: https://github.com/pytorch/pytorch/pull/107058 Approved by: https://github.com/awgu	2023-08-25 01:12:43 +00:00
wz337	d707724ac9	[DeviceMesh] init_device_mesh dosctring update to include one d mesh initialization (#107805 ) As title. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107805 Approved by: https://github.com/fduwjj, https://github.com/wanchaol	2023-08-24 01:28:22 +00:00
fduwjj	3828cd4b79	[TP][EZ] Update doc for TP parallel style (#107819 ) We need to update the doc for PairwiseParallel and SequenceParallel so that users don't get wrong impressions that these working for ``nn.Transformer``. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107819 Approved by: https://github.com/awgu, https://github.com/wanchaol	2023-08-24 00:13:52 +00:00
Antoni Viros i Martin	2c45a579ca	Add wait_tensor so print always has a correct result for AsyncCollectiveTensor (#107808 ) As the title says, I was trying to test the functional collectives, and, when printing the resulting tensors, sometimes they wouldn't have finished the Async operation yet. According to the comments in the file, "AsyncTensor wrapper applied to returned tensor, which issues wait_tensor() at the time of first use". This is true in most cases, but not when print() is your first use. This PR fixes that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107808 Approved by: https://github.com/fduwjj	2023-08-24 00:00:23 +00:00
Andrew Gu	2515ab93c4	[FSDP][Docs] Add note on `NCCL_CROSS_NIC=1` for HSDP (#107784 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107784 Approved by: https://github.com/fegin ghstack dependencies: #106068, #106080	2023-08-23 22:00:50 +00:00
wz337	cdd0821f00	[2/N][DeviceMesh] Overriding __getitem__ for DeviceMesh to support Mesh Slicing (#107730 ) Add support for DeviceMesh slicing by overloading __getitem__ for DeviceMesh. With this change, you can do: ``` mesh_shape = (2, 4) mesh_dim_names = ("DP", "TP") two_d_mesh = init_device_mesh( self.device_type, mesh_shape, mesh_dim_names=mesh_dim_names ) tp_mesh = two_d_mesh["TP"] ``` cc. @wanchaol, @fduwjj Pull Request resolved: https://github.com/pytorch/pytorch/pull/107730 Approved by: https://github.com/wanchaol	2023-08-23 20:35:30 +00:00
Andrew Gu	2b964d6efd	[FSDP] Enable async all-reduce for HSDP (#106080 ) Overview This PR runs the HSDP all-reduce as async so that it can overlap with both all-gather and reduce-scatter, which can lead to slight end-to-end speedups when the sharding process group is fully intra-node. Previously, the all-reduce serializes with reduce-scatter, so it can only overlap with one all-gather. For some clusters (e.g. our AWS cluster), `NCCL_CROSS_NIC=1` improves inter-node all-reduce times when overlapped with intra-node all-gather/reduce-scatter. Experiment <details> <summary> Example 'before' trace </summary> <img width="559" alt="hsdp_32gpus_old" src="https://github.com/pytorch/pytorch/assets/31054793/15222b6f-2b64-4e0b-a212-597335f05ba5"> </details> <details> <summary> Example 'after' trace </summary> <img width="524" alt="hsdp_32gpus_new" src="https://github.com/pytorch/pytorch/assets/31054793/94f63a1d-4255-4035-9e6e-9e10733f4e44"> </details> For the 6-encoder-layer, 6-decoder layer transformer with `d_model=8192`, `nhead=64` on 4 nodes / 32 40 GB A100s via AWS, the end-to-end iteration times are as follows (with AG == all-gather, RS == reduce-scatter, AR == all-reduce; bandwidth reported as algorithmic bandwidth): - Reference FSDP: - 1160 ms / iteration - ~23 ms / encoder AG/RS --> 24.46 GB/s bandwidth - ~40 ms / decoder AG/RS --> 26.5 GB/s bandwidth - 50 GB/s theoretical inter-node bandwidth - Baseline 8-way HSDP (only overlap AR with AG) -- intra-node AG/RS, inter-node AR: - 665 ms / iteration - ~3 ms / encoder AG/RS --> 187.5 GB/s bandwidth - ~5 ms / decoder AG/RS --> 212 GB/s bandwidth - ~30 ms / encoder AR --> 2.34 GB/s bandwidth - ~55 ms / decoder AR --> 2.65 GB/s bandwidth - 300 GB/s theoretical intra-node bandwidth - New 8-way HSDP (overlap AR with AG and RS) -- intra-node AG/RS, inter-node AR: - 597 ms / iteration - ~3 ms / encoder AG/RS --> 187.5 GB/s bandwidth - ~6.2 ms / decoder AG/RS --> 170.97 GB/s bandwidth (slower) - ~23 ms / encoder AR (non-overlapped) --> 3.057 GB/s bandwidth (faster) - ~49 ms / decoder AR (non-overlapped) --> 2.70 GB/s bandwidth (faster) - ~100 ms / decoder AR (overlapped) --> 1.325 GB/s bandwidth (slower) - Overlapping with reduce-scatter reduces all-reduce bandwidth utilization even though the all-reduce is inter-node and reduce-scatter is intra-node! - New 8-way HSDP (overlap AR with AG and RS) with `NCCL_CROSS_NIC=1`: - 556 ms / iteration - Speedup comes from faster overlapped AR Thus, for this particular workload, the async all-reduce enables 16% iteration-time speedup compared to the existing HSDP and 52% speedup compared to FSDP. These speedups are pronounced due to the workload being communication bound, so any communication time reduction translates directly to speedup. Unit Test This requires >= 4 GPUs: ``` python -m pytest test/distributed/fsdp/test_fsdp_hybrid_shard.py -k test_fsdp_hybrid_shard_parity ``` Differential Revision: [D47852456](https://our.internmc.facebook.com/intern/diff/D47852456) Pull Request resolved: https://github.com/pytorch/pytorch/pull/106080 Approved by: https://github.com/ezyang ghstack dependencies: #106068	2023-08-23 18:36:15 +00:00
Andrew Gu	50e1378680	[FSDP] Break up `_post_backward_hook` into smaller funcs (#106068 ) The post-backward hook has some complexity due to the different paths: {no communication hook, communication hook} x {`NO_SHARD`, `FULL_SHARD`/`SHARD_GRAD_OP`, `HYBRID_SHARD`/`_HYBRID_SHARD_ZERO2`} plus some options like CPU offloading and `use_orig_params=True` (requiring using sharded gradient views). The PR following this one that adds async all-reduce for HSDP further complicates this since the bottom-half after all-reduce must still be run in the separate all-reduce stream, making it more unwieldy to unify with the existing bottom-half. Nonetheless, this PR breaks up the post-backward hook into smaller logical functions to hopefully help readability. Differential Revision: [D47852461](https://our.internmc.facebook.com/intern/diff/D47852461) Pull Request resolved: https://github.com/pytorch/pytorch/pull/106068 Approved by: https://github.com/ezyang, https://github.com/fegin	2023-08-23 18:36:15 +00:00
Codle	42738c56a0	Skip the extra copy operation in broadcast_object_list if tensor_list has only one element (#107509 ) The `broadcast_object_list` function can easily broadcast the state_dict of models/optimizers. However, the `torch.cat` operation performed within `broadcast_object_list` consumes an additional double amount of memory space. This means that only objects with a maximum memory occupancy of half the device capacity can be broadcasted. This PR improves usability by skipping the `torch.cat` operation on object_lists with only a single element. Before (30G tensor)： <img width="607" alt="image" src="https://github.com/pytorch/pytorch/assets/22362311/c0c67931-0851-4f27-81c1-0119c6cd2944"> After (46G tensor): <img width="600" alt="image" src="https://github.com/pytorch/pytorch/assets/22362311/90cd1536-be7c-43f4-82ef-257234afcfa5"> Test Code: ```python if __name__ == "__main__": dist.init_process_group(backend='nccl') torch.cuda.set_device(dist.get_rank() % torch.cuda.device_count()) fake_tensor = torch.randn(30 * 1024 * 1024 * 1024 // 4) if dist.get_rank() == 0: state_dict = {"fake_tensor": fake_tensor} else: state_dict = {} object_list = [state_dict] dist.broadcast_object_list(object_list, src=0) print("Rank: ", dist.get_rank(), " Broadcasted Object: ", object_list[0].keys()) dist.barrier() ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/107509 Approved by: https://github.com/awgu	2023-08-23 17:19:10 +00:00
Aaron Gokaslan	660e8060ad	[BE]: Update ruff to 0.285 (#107519 ) This updates ruff to 0.285 which is faster, better, and have fixes a bunch of false negatives with regards to fstrings. I also enabled RUF017 which looks for accidental quadratic list summation. Luckily, seems like there are no instances of it in our codebase, so enabling it so that it stays like that. :) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107519 Approved by: https://github.com/ezyang	2023-08-22 23:16:38 +00:00
Wanchao Liang	979e706f8e	[dtensor] update some comments (#107608 ) This update some comments from the follow up of https://github.com/pytorch/pytorch/pull/107305 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107608 Approved by: https://github.com/fduwjj ghstack dependencies: #107606	2023-08-22 23:08:13 +00:00
Wanchao Liang	945fa7e8a8	[dtensor] fix requires_grad in distribute_tensor (#107606 ) This PR fixes the requires_grad set when calling distribute_tensor, we should set the requires_grad of the local tensor after the detach call to make sure we create the leaf correctly, otherwise it would raise warnings Pull Request resolved: https://github.com/pytorch/pytorch/pull/107606 Approved by: https://github.com/fduwjj	2023-08-22 23:08:13 +00:00
PyTorch MergeBot	d59a6864fb	Revert "[BE]: Update ruff to 0.285 (#107519 )" This reverts commit `88ab3e4322`. Reverted https://github.com/pytorch/pytorch/pull/107519 on behalf of https://github.com/ZainRizvi due to Sorry, but this PR breaks internal tests. @ezyang, can you please hep them get unblocked? It seems like one of the strings was prob accidentally modified ([comment](https://github.com/pytorch/pytorch/pull/107519#issuecomment-1688833480))	2023-08-22 19:53:32 +00:00
Brian	3361fae89b	Fix FP16Planner documentation (#107620 ) Fixes #107619 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107620 Approved by: https://github.com/awgu	2023-08-22 02:05:27 +00:00
wz337	f5d1df3c2f	[1/N] Introduce init_device_mesh() (#107254 ) This PR introduces init_device_mesh() as an API to standardize UX device_mesh initialization. The functionality of slicing out a submesh from a given mesh would come in later PRs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107254 Approved by: https://github.com/wanchaol	2023-08-21 21:13:47 +00:00
Wanchao Liang	da765995fb	[2d] remove ShardedTensor from fsdp extension (#107472 ) 2D Parallel won't use ShardedTensor, and it causes headable for dynamo to recoginize it, removing it from the runtime flatten/unflatten path Pull Request resolved: https://github.com/pytorch/pytorch/pull/107472 Approved by: https://github.com/fduwjj	2023-08-21 17:16:07 +00:00
Brian	24968383b5	Fix RenamePlanner documentation (#107535 ) Fixes #107490 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107535 Approved by: https://github.com/awgu, https://github.com/fduwjj	2023-08-21 07:51:57 +00:00
Chien-Chin Huang	7ba513b6e4	[FSDP][state_dict] Expose optimizer state_dict config (#105949 ) Optimizer state_dict config are not exposed. This PR exposes the 2 dataclass. Differential Revision: [D47766024](https://our.internmc.facebook.com/intern/diff/D47766024/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105949 Approved by: https://github.com/rohan-varma	2023-08-21 07:29:49 +00:00
Xilun Wu	5ce88e7e71	remove unnecessary import introduced in PR 106535 (#107440 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107440 Approved by: https://github.com/fduwjj ghstack dependencies: #106535	2023-08-21 05:29:31 +00:00
Aaron Gokaslan	b1e8e01e50	[BE]: Apply PYI autofixes to various types (#107521 ) Applies some autofixes from the ruff PYI rules to improve the typing of PyTorch. I haven't enabled most of these ruff rules yet as they do not have autofixes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107521 Approved by: https://github.com/ezyang	2023-08-20 02:42:21 +00:00
Aaron Gokaslan	88ab3e4322	[BE]: Update ruff to 0.285 (#107519 ) This updates ruff to 0.285 which is faster, better, and have fixes a bunch of false negatives with regards to fstrings. I also enabled RUF017 which looks for accidental quadratic list summation. Luckily, seems like there are no instances of it in our codebase, so enabling it so that it stays like that. :) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107519 Approved by: https://github.com/ezyang	2023-08-20 01:36:18 +00:00
Wanchao Liang	d8f2ef10a6	[dtensor][1/n] refactor op dispatch logic to reduce overhead (#107305 ) This PR is the first change of a series of refactors to the op dispatch logic to: 1. remove the redundant logic in the op dispatch, simplify the error checking 2. reduce the number of tree_map/tree_flatten/unflatten needed to reduce the overhead coming from those operations 3. remove the CachedShardingPropagator by using lru_cache from functools directly, this makes it not only helps TP, but general DTensor operations could be faster! 4. change the view ops behavior by inplace changing the op_schema, which is dangerous for sharding prop caching, model the view op as one type of resharding too 5. enrich output sharding to include whether the op needs redistribute so that we don't need explicit op schema comparison to know it. This should help with further reducing the CPU overhead, benchmark results: before (without this change), aten.addmm latency: 0.476ms ![Screenshot 2023-08-16 at 10 46 26 AM](https://github.com/pytorch/pytorch/assets/9443650/7692e6c1-1936-4c7f-bf9c-6c8c9b8f6c76) after (with this change), aten.addmm latency: 0.341ms ![Screenshot 2023-08-16 at 11 05 49 AM](https://github.com/pytorch/pytorch/assets/9443650/15a53f0b-7a95-444e-ab2f-3ee0ad2fa47f) overall one layer of mlp time reduced from 13.535 -> 9.665ms Apart from overhead reduction, this PR simplifies the op dispatching logic and the resharding logic (more refactor needed to make things more clean, which will be done in later PRs) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107305 Approved by: https://github.com/fduwjj	2023-08-18 18:30:46 +00:00
Xilun Wu	3699c6adaa	[DTensor][random] add DTensor constructor: rand (#106535 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/106535 Approved by: https://github.com/fduwjj, https://github.com/wanchaol	2023-08-18 07:39:34 +00:00
Rodrigo Kumpera	bbf03561a9	[functional collectives] Move back to registering finalizers on wrappers. (#107250 ) We cannot use inner tensors for finalizers as they are uncollective until waited. This PR adds a bunch of tests for the observable behavior we want, including the necessary scafold for us to test code for their waitiness. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107250 Approved by: https://github.com/wconstab	2023-08-17 21:08:28 +00:00
fduwjj	983fd5ba79	[2D][TP] Enable DDP TP integration with unit test (#106583 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/106583 Approved by: https://github.com/kumpera, https://github.com/fegin, https://github.com/wanchaol ghstack dependencies: #107313	2023-08-17 02:54:17 +00:00
fduwjj	f3b0d83fe3	[EZ][TP] Refactor FSDP 2D integration extension code so that it can re-used (#107313 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107313 Approved by: https://github.com/wz337	2023-08-16 22:01:17 +00:00
Chien-Chin Huang	f6a9c15421	[FSDP][state_dict] Make optim_state_dict_to_load work with use_orig_param=False + NO_SHARD (#107185 ) Summary: As title Test Plan: CI Reviewed By: wz337 Differential Revision: D48329724 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107185 Approved by: https://github.com/fegin	2023-08-15 21:42:41 +00:00
Shen Li	45128ab67c	[Reland] Add OnCompletion Hook to ProcessGroup (#106988 ) (#107233 ) This allows infra/trainers to get detailed stats about communication efficiencies without know anything about what model or distributed training paradigms have been used. This is helpful as infra/trainer package usually prefers to be as model/algorithm agnostic as possible. Therefore, we cannot assume that infra/trainer can have access to all collectives used by the model authors. This commit adds an `OnCompletion` hook to `ProcessGroupNCCL` which will be fired on every work completion event. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107233 Approved by: https://github.com/kumpera	2023-08-15 17:35:14 +00:00
PyTorch MergeBot	fd214aa8be	Revert "Add OnCompletion Hook to ProcessGroup (#106988 )" This reverts commit `ba1da47e8f`. Reverted https://github.com/pytorch/pytorch/pull/106988 on behalf of https://github.com/huydhn due to Sorry for reverting you change, but it is failing Windows build with some linker error. The Windows failures on PR looks legit ([comment](https://github.com/pytorch/pytorch/pull/106988#issuecomment-1678580899))	2023-08-15 08:24:33 +00:00
fduwjj	d6c120d7f9	[TP][DTensor Perf]Fix DTensor Spec hash (#107181 ) https://github.com/pytorch/pytorch/pull/106524 gets merged so fast that we didn't figure out that we should hash both stride and dtype in DTensorSpec. This is a forward fix. One analysis for why using just shape is not enough. 1. We use the hash value for sharding propogation cache. And the output sharding contains the stride, size of the output DTensor. If we don't consider stride, we will see errors. 2. One reason can be found below: ``` OpSchema(func_schema=aten::t(Tensor(a) self) -> Tensor(a), args_schema=(DTensorSpec(mesh=DeviceMesh:([0, 1, 2, 3, 4, 5, 6, 7]), placements=(Shard(dim=0),), tensor_meta=TensorMetadata(shape=torch.Size([64, 128]), dtype=torch.float32, requires_grad=False, stride=(128, 1), memory_format=None, is_quantized=False, qparams={})),), kwargs_schema={}) ``` ``` OpSchema(func_schema=aten::t(Tensor(a) self) -> Tensor(a), args_schema=(DTensorSpec(mesh=DeviceMesh:([0, 1, 2, 3, 4, 5, 6, 7]), placements=(Shard(dim=0),), tensor_meta=TensorMetadata(shape=torch.Size([64, 128]), dtype=torch.float32, requires_grad=False, stride=(1, 64), memory_format=None, is_quantized=False, qparams={})),), kwargs_schema={}) ``` The only difference between two op_schame is the tensor stride: <img width="151" alt="image" src="https://github.com/pytorch/pytorch/assets/6937752/161335df-bdfb-47c5-ba79-82616d070d15"> that makes the transpose op generates wrong result and leads to the add_/addmm_ op failing with errors: ``` Traceback (most recent call last): File "/data/users/fduwjj/pytorch/torch/multiprocessing/spawn.py", line 74, in _wrap fn(i, args) File "/data/users/fduwjj/pytorch/benchmarks/distributed/tensor/tp_benchmark.py", line 210, in run_tp output.sum().backward() File "/data/users/fduwjj/pytorch/torch/_tensor.py", line 491, in backward torch.autograd.backward( File "/data/users/fduwjj/pytorch/torch/autograd/__init__.py", line 251, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass File "/data/users/fduwjj/pytorch/torch/distributed/_tensor/api.py", line 252, in __torch_dispatch__ return op_dispatch.operator_dispatch( File "/data/users/fduwjj/pytorch/torch/distributed/_tensor/dispatch.py", line 116, in operator_dispatch out, _, _ = _operator_dispatch(op_call, args, kwargs, sharding_propagator) File "/data/users/fduwjj/pytorch/torch/distributed/_tensor/dispatch.py", line 246, in _operator_dispatch local_results = op_call(local_tensor_args, *local_tensor_kwargs) File "/data/users/fduwjj/pytorch/torch/_ops.py", line 435, in __call__ return self._op(args, **kwargs or {}) RuntimeError: The size of tensor a (64) must match the size of tensor b (8) at non-singleton dimension 1 ``` Same thing with dtype, if we are using DTensor in the environment of mixed precision, we will run into situations like this. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107181 Approved by: https://github.com/wanchaol ghstack dependencies: #106524	2023-08-15 05:33:10 +00:00
Shen Li	ba1da47e8f	Add OnCompletion Hook to ProcessGroup (#106988 ) This allows infra/trainers to get detailed stats about communication efficiencies without know anything about what model or distributed training paradigms have been used. This is helpful as infra/trainer package usually prefers to be as model/algorithm agnostic as possible. Therefore, we cannot assume that infra/trainer can have access to all collectives used by the model authors. This commit adds an `OnCompletion` hook to `ProcessGroupNCCL` which will be fired on every work completion event. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106988 Approved by: https://github.com/kumpera, https://github.com/H-Huang ghstack dependencies: #107140, #107141, #107160	2023-08-15 04:32:23 +00:00
Bruce Jiang	2624da638d	Support third-party devices to use the init_process_group method with… (#107113 ) …out specifying the Backend When init_process_group is not been done before, it will automatically apply init_process_group within Devicemesh without specifying the backend. Thus, when a third-party device want to use Devicemesh without doing init_process_group before, there comes a problem. In this PR, add a default_device_backend_map for third-party device users to add their backends to this map when they register their backends to pytorch firstly. When doing init_process_group without parameter backend, it will init the backends in this map. Thus, a third-party user can use init_process_group method without specifying the Backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107113 Approved by: https://github.com/wanchaol	2023-08-15 03:46:07 +00:00
Rohan Varma	ddf36c82b8	[PT-D][FSDP] Handle corner case of load with multi-backend PG (#107172 ) Summary: When loading a CPU state_dict with a pg initialized with cpu:gloo,cuda:nccl, we hit a gloo crash since dest tensor is on GPU and input is on CPU. As a workaround, just enforce that if local_tensor.is_cpu, the dest tensor is also cpu. Test Plan: CI Differential Revision: D48324752 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107172 Approved by: https://github.com/fegin	2023-08-14 23:24:44 +00:00
Jirka	858b465d74	fix str splits in single line (#106005 ) Simple formating improvement and two spell fixes Pull Request resolved: https://github.com/pytorch/pytorch/pull/106005 Approved by: https://github.com/H-Huang	2023-08-14 23:07:38 +00:00
fduwjj	4a6ca4cc05	[TP][DTensor Perf] Some perf improvement to reduce DTensor CPU overhead (#106524 ) By inspecting a small TP benchmark, we found couple things we can optimize: 1. We call deep_copy so many times when we initialize DTensor. 2. Some shading_prop is not cached successfully. 3. We are still calling redistribute when not necessary. ![image](https://github.com/pytorch/pytorch/assets/6937752/b847d110-eea1-45df-9298-066d0ba07dd7) ![image](https://github.com/pytorch/pytorch/assets/6937752/fc08f564-caed-496b-80d7-275c1dba3806) ![image](https://github.com/pytorch/pytorch/assets/6937752/fdc06cc4-a4ba-48e8-a118-c041bbd04f5e) So we want to: 1. Remove the deep_copy, and we now make placements a tuple so we are sure it's immutable. 2. Somehow the op_schema gets changed during sharding_op propogation, so we store a hash version of it before passing it to sharding_prop. Ideally we want to figure out why `op_schema` gets changed, but looks like in both index and detach/view op, all get changed, it might take more time to debug. 3. Also when we do hashing of op_schema, we want to hash the entire args_schema not just the args_spec which only contains the DTensorSpec from args which are Dtensors. 4. It turns out that sometimes, DTensor has mem_format to be None (not contiguous) and this will lead to redistribute get triggered, so that we only need to compare type/shape and stride in the metadata. Also we need to ensure _Partial and Shard have different hash value in the DTensorSpec. ![image](https://github.com/pytorch/pytorch/assets/6937752/321e6890-1ab6-4975-adc9-524c6ef9a76b) Pull Request resolved: https://github.com/pytorch/pytorch/pull/106524 Approved by: https://github.com/wanchaol	2023-08-14 20:03:19 +00:00
Wanchao Liang	c9cbcb2449	[device_mesh] move remaining collectives to a separate file (#107012 ) Move the remaining collectives to a separate file to prepare device mesh to become a public distributed API For those remaining utils, we need to upstream them to functional collectives with proper implementation, added TODO there for a follow up PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107012 Approved by: https://github.com/fduwjj	2023-08-11 23:49:27 +00:00
Michael Voznesensky	42660015b4	[Dynamo x FSDP][2/x] Small changes to distributed to make it dynamo friendly (#106886 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/106886 Approved by: https://github.com/awgu, https://github.com/wconstab ghstack dependencies: #106884	2023-08-11 22:35:50 +00:00
Wanchao Liang	5c48ff20b5	AsyncCollectiveTensor: dont sync on view ops (#105240 ) AsyncCollectiveTensor is a tensor subclass that is meant to "delay synchronization" when you call into the functional collectives API's. It does this (if I understand correctly) by internally holding an "unsynchronized" version of the tensor, which is the result of the communication op, and internally calling `.wait()` to synchronize the data the next time it is used. Previously, these wait() calls would happen immediately, because `AsyncCollectiveTensor` gets wrapped by `DTensor()`, which calls `.detach()` on its inner tensor, immediately causing the sync (code: `1518d5eec4/torch/distributed/_tensor/api.py (L207)`) AsyncCollectiveTensor shouldn't need to do a synchronization if you try to detach() it though - in fact, it should be fine to avoid synchronizing if you perform any view ops on it (which just require viewing metadata, but not actual data). This PR tries to update `AsyncCollectiveTensor` to delay `wait()` calls whenever the subclass encounters a view op. Added some light testing, that just runs some DTensor compute followed by view ops, and confirms that the output is still an `AsyncCollectiveTensor` when we call `.to_local()`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105240 Approved by: https://github.com/wanchaol, https://github.com/fduwjj, https://github.com/wconstab	2023-08-11 19:20:25 +00:00
Andrew Gu	7b94d93431	[FSDP] Fix train -> EMA -> eval with mixed precision (#106858 ) This fixes a pretty vicious bug relating to `SHARD_GRAD_OP`, mixed precision, EMA, and eval. Bug Explanation The model has a main module and an EMA module, where the main module is used for training and the EMA module is used for eval. The model has FSDP's fp16 mixed precision enabled. The flow consists of (1) training forward/backward/optimizer -> (2) EMA update (copy main module to EMA module) -> eval forward in `torch.no_grad()`, where this repeats for many iterations. Consider the _second_ iteration. - From the first iteration's eval forward, the EMA module has the fp16 unsharded parameters in memory (not freed due to `SHARD_GRAD_OP`). - In this second iteration's step (2), we perform the EMA update under the `summon_full_params()` context, where FSDP specially forces full precision. This means that the EMA module now uses fp32 unsharded parameters, distinct from the fp16 unsharded parameters still in memory. The EMA update modifies those fp32 parameters, and upon exiting the context, FSDP correctly writes the modifications back to the fp32 sharded parameters. - In the second iteration's step (3) (eval forward), FSDP checks whether it needs to run the unshard op (including all-gather) but sees it does not since the fp16 unsharded parameters are still in memory. Thus, FSDP uses those fp16 unsharded parameters directly without all-gather. However, these fp16 unsharded parameters are stale and do not include the EMA update! - In other words, at this point, the fp32 sharded parameters are correct, the fp16 unsharded parameters are stale, and FSDP chooses _not_ to re-all-gather since the fp16 unsharded parameters are in memory. Fix Explanation This PR fixes this by freeing the fp16 unsharded parameters if they are still allocated when forcing full precision, i.e. using fp32 unsharded parameters in `summon_full_params()`. This ensures that any modifications written back to the fp32 sharded parameters will be persisted via the next all-gather. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106858 Approved by: https://github.com/kumpera ghstack dependencies: #106857	2023-08-10 19:32:43 +00:00
alanhe151220037	1afbc985fe	Make RNGStateTracker support cuda-like device (#106771 ) replace `CudaRNGStateTracker` with `RNGStateTracker` by rewriting some Cuda-binding code with `device_handle` Pull Request resolved: https://github.com/pytorch/pytorch/pull/106771 Approved by: https://github.com/wanchaol	2023-08-10 19:14:33 +00:00
weifengpy	4bc846c101	[FSDP] Ignore buffer type casting in ignored modules (#106766 ) issue resolved: https://github.com/pytorch/pytorch/issues/97791 before this PR, mixed_precision applies to buffers from ignored modules. see ```test_state_dict_with_ignored_modules(mixed_precision=True)``` for reproduce after, we avoid applying mixed_precision semantics to buffers from ignored modules * step 1 initialization: state._ignored_buffer_names contains all the buffers from ignored modules * step 2 lazy init at runtime: skip ignored buffers in ```_get_buffers_and_dtypes_for_computation``` * step 3 skip upcasting in state_dict hook: avoid upcasting for ignored buffers in ```_get_buffers_and_dtypes_for_computation``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/106766 Approved by: https://github.com/awgu	2023-08-09 23:09:43 +00:00
Andrew Gu	6f036c9637	[FSDP][Easy] `zeros` -> `empty` for immediately freed tensors (#106857 ) Since we immediately free these tensors' storage (via `_free_storage()`), there is no reason to zero them after allocation: `92e5b124c8/torch/distributed/fsdp/flat_param.py (L1140-L1145)` `92e5b124c8/torch/distributed/fsdp/flat_param.py (L1155-L1161)` `92e5b124c8/torch/distributed/fsdp/flat_param.py (L1166-L1171)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/106857 Approved by: https://github.com/Skylion007	2023-08-09 17:26:33 +00:00
Eddy Ogola Onyango	cbcd9083be	[DCP] Modify tensor saving logic in DCP (#106415 ) Currently, DCP treats tensors as duplicates and only saves them on rank0. This won't work for PiPPy as PiPPy does have unique tensors across different ranks. With the current setup, we would only be saving the tensors on rank0 (coordinator rank). In this PR, we are changing to letting each rank create its own WriteItem for tensors. For the ones that does replicate across different ranks, we are handling it thru dedup_tensors(), which will dedup the replicate WriteItem so we only do the actual writing once. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106415 Approved by: https://github.com/wz337	2023-08-09 00:16:10 +00:00
Michael Voznesensky	d1a99a083f	Reland Simplify handle indexing (#105006 ) (#106357 ) This reverts commit `a9a3c45649`. This PR changes the following: - `_ExecOrderData.handle_to_handle_index` -> `FlatParamHandle._handle_index` - `_ExecOrderData.handles_to_pre_forward_order_index` -> `FlatParamHandle._pre_forward_order_index` - `_ExecOrderData.handles_to_post_forward_order_index` -> `FlatParamHandle._post_forward_index` - `_FSDPState._needs_pre_forward_unshard` -> `FlatParamHandle._needs_pre_forward_unshard` - `_FSDPState._needs_pre_backward_unshard` -> `FlatParamHandle._needs_pre_backward_unshard` - `_FSDPState._handles_prefetched` -> `FlatParamHandle._prefetched` Pull Request resolved: https://github.com/pytorch/pytorch/pull/106357 Approved by: https://github.com/awgu	2023-08-03 19:17:32 +00:00
fduwjj	578d9fee42	[DTensor][EZ] op schema comparison so that no redistribute is called (#106158 ) When looking at traces of TP more carefully, I found that for cases when input reshard is not needed, we also call redistribute within sharding propogation. Upon carefully checking, looks like the way we compare different op_schema is not correct. One example can be seen in the following trace: <img width="1146" alt="image" src="https://github.com/pytorch/pytorch/assets/6937752/7322d26f-7029-41f9-8f8c-5f27a6bb98f9"> As you can see, no collectives are called, and this redistribute is not needed. With this change: <img width="1491" alt="image" src="https://github.com/pytorch/pytorch/assets/6937752/eb4a971f-44c1-4d83-8671-fce94cfa926c"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/106158 Approved by: https://github.com/Skylion007, https://github.com/wanchaol	2023-08-03 19:17:10 +00:00
Andrew Gu	57fba6fd86	[FSDP][9/N] Introduce `CustomPolicy` (#104986 ) This PR adds a new `CustomPolicy` that acts like the existing `lambda_auto_wrap_policy` except it (1) leverages the new auto wrapping infrastructure and (2) allows overriding FSDP kwargs for particular instances. (1) gives it access to the validation checks (like for frozen parameters), and (2) makes it as expressive as manual wrapping. This should allow us to effectively deprecate manual wrapping if desired. The API is as follows: ``` def lambda_fn(module: nn.Module) -> Union[bool, Dict[str, Any]]: ... policy = CustomPolicy(lambda_fn) ``` The `lambda_fn` can return: - `False` or `{}` to indicate no wrapping - `True` to indicate wrapping while inheriting the root's FSDP kwargs - Non-empty `dict` to indicate wrapping while overriding the specified FSDP kwargs and inheriting the rest from the root --- After this PR, the follow-up work items for auto wrapping are: 1. Add shared parameter validation 2. (Longer-term / exploratory) Add a policy that provides a reasonable auto wrapping with "minimal" user input Pull Request resolved: https://github.com/pytorch/pytorch/pull/104986 Approved by: https://github.com/ezyang ghstack dependencies: #104427, #104967, #104999, #104969	2023-08-03 12:46:36 +00:00
Andrew Gu	15953fdf35	[FSDP][8/N] Replace `_FSDPPolicy.policy` with `_Policy._run_policy` (#104969 ) This does some code organization improvement. - It renames `_FSDPPolicy` to `_Policy` to show that it is not only for FSDP but for any module-level API. - It formalizes the contract that such a policy should return something like `target_module_to_kwargs: Dict[nn.Module, Dict[str, Any]]` that maps each module to wrap to its kwargs. It does so by requiring a `_run_policy` abstract method (this time private since users do not need to care about it). Then, our auto wrapping can just call `_run_policy()` to generate the dict and do any validation or post-processing. This PR is technically BC-breaking because it removes the public `ModuleWrapPolicy.policy`. However, I do not think anyone was using that anyway, so this is a pretty safe breakage. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104969 Approved by: https://github.com/rohan-varma ghstack dependencies: #104427, #104967, #104999	2023-08-03 12:42:14 +00:00
Andrew Gu	640a96dfbb	[FSDP][Easy] Allow `ModuleWrapPolicy` to take `Iterable` (#104999 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104999 Approved by: https://github.com/rohan-varma ghstack dependencies: #104427, #104967	2023-08-02 22:03:03 +00:00
Andrew Gu	031ce0fadc	[FSDP][7/N] Add warning about frozen params (#104967 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104967 Approved by: https://github.com/rohan-varma ghstack dependencies: #104427	2023-08-02 21:50:38 +00:00
Andrew Gu	a8c52863dd	[FSDP][6/N] Check valid param freezing for `ModuleWrapPolicy` (#104427 ) This PR adds improved error/warning messaging when auto wrapping with `ModuleWrapPolicy` in the presence of frozen parameters. - For `use_orig_params=False`, FSDP requires uniform `requires_grad` for each FSDP instance. This PR adds a `ValueError` at wrapping time with a message that mentions the violating module and the frozen/non-frozen parameter names. - For `use_orig_params=True`, FSDP allows non-uniform `requires_grad` for each FSDP instance. However, it will result in higher-than-expected gradient memory usage. This PR adds a `UserWarning` at wrapping time with a message that mentions the violating module, how much extra gradient memory will be used (in units of numel), and the frozen/non-frozen parameter names. - There is a possibility that this warning will be spammy/verbose, but my current thinking is that it is okay for now unless users complain. <details> <summary> Why DFS via named_children() vs. Using named_modules()</summary> ``` LoraModel( (embed_tokens): Embedding(100, 32) (layers): ModuleList( (0-3): 4 x LoraDecoder( (attn): LoraAttention( (q_proj): Linear(in_features=32, out_features=32, bias=False) (lora_A): Linear(in_features=32, out_features=8, bias=False) (lora_B): Linear(in_features=8, out_features=32, bias=False) (k_proj): Linear(in_features=32, out_features=32, bias=False) (v_proj): Linear(in_features=32, out_features=32, bias=False) (o_proj): Linear(in_features=32, out_features=32, bias=False) ) (mlp): LoraMLP( (proj1): Linear(in_features=32, out_features=128, bias=False) (proj2): Linear(in_features=128, out_features=32, bias=False) ) (inp_layernorm): LayerNorm((32,), eps=1e-05, elementwise_affine=True) (post_attn_layernorm): LayerNorm((32,), eps=1e-05, elementwise_affine=True) ) ) (norm): LayerNorm((32,), eps=1e-05, elementwise_affine=True) ) ``` Reverse topological order with stack-based DFS via `named_children()`: ``` [ 'embed_tokens', 'layers.0.attn.q_proj', 'layers.0.attn.lora_A', 'layers.0.attn.lora_B', 'layers.0.attn.k_proj', 'layers.0.attn.v_proj', 'layers.0.attn.o_proj', 'layers.0.attn', 'layers.0.mlp.proj1', 'layers.0.mlp.proj2', 'layers.0.mlp', 'layers.0.inp_layernorm', 'layers.0.post_attn_layernorm', 'layers.0', 'layers.1.attn.q_proj', 'layers.1.attn.lora_A', 'layers.1.attn.lora_B', 'layers.1.attn.k_proj', 'layers.1.attn.v_proj', 'layers.1.attn.o_proj', 'layers.1.attn', 'layers.1.mlp.proj1', 'layers.1.mlp.proj2', 'layers.1.mlp', 'layers.1.inp_layernorm', 'layers.1.post_attn_layernorm', 'layers.1', 'layers.2.attn.q_proj', 'layers.2.attn.lora_A', 'layers.2.attn.lora_B', 'layers.2.attn.k_proj', 'layers.2.attn.v_proj', 'layers.2.attn.o_proj', 'layers.2.attn', 'layers.2.mlp.proj1', 'layers.2.mlp.proj2', 'layers.2.mlp', 'layers.2.inp_layernorm', 'layers.2.post_attn_layernorm', 'layers.2', 'layers.3.attn.q_proj', 'layers.3.attn.lora_A', 'layers.3.attn.lora_B', 'layers.3.attn.k_proj', 'layers.3.attn.v_proj', 'layers.3.attn.o_proj', 'layers.3.attn', 'layers.3.mlp.proj1', 'layers.3.mlp.proj2', 'layers.3.mlp', 'layers.3.inp_layernorm', 'layers.3.post_attn_layernorm', 'layers.3', 'layers', 'norm', '' ] ``` Reverse topological order with `named_modules()`: ``` [ 'norm', 'layers.3.post_attn_layernorm', 'layers.3.inp_layernorm', 'layers.3.mlp.proj2', 'layers.3.mlp.proj1', 'layers.3.mlp', 'layers.3.attn.o_proj', 'layers.3.attn.v_proj', 'layers.3.attn.k_proj', 'layers.3.attn.lora_B', 'layers.3.attn.lora_A', 'layers.3.attn.q_proj', 'layers.3.attn', 'layers.3', 'layers.2.post_attn_layernorm', 'layers.2.inp_layernorm', 'layers.2.mlp.proj2', 'layers.2.mlp.proj1', 'layers.2.mlp', 'layers.2.attn.o_proj', 'layers.2.attn.v_proj', 'layers.2.attn.k_proj', 'layers.2.attn.lora_B', 'layers.2.attn.lora_A', 'layers.2.attn.q_proj', 'layers.2.attn', 'layers.2', 'layers.1.post_attn_layernorm', 'layers.1.inp_layernorm', 'layers.1.mlp.proj2', 'layers.1.mlp.proj1', 'layers.1.mlp', 'layers.1.attn.o_proj', 'layers.1.attn.v_proj', 'layers.1.attn.k_proj', 'layers.1.attn.lora_B', 'layers.1.attn.lora_A', 'layers.1.attn.q_proj', 'layers.1.attn', 'layers.1', 'layers.0.post_attn_layernorm', 'layers.0.inp_layernorm', 'layers.0.mlp.proj2', 'layers.0.mlp.proj1', 'layers.0.mlp', 'layers.0.attn.o_proj', 'layers.0.attn.v_proj', 'layers.0.attn.k_proj', 'layers.0.attn.lora_B', 'layers.0.attn.lora_A', 'layers.0.attn.q_proj', 'layers.0.attn', 'layers.0', 'layers', 'embed_tokens', '' ] ``` With the stack-based DFS via `named_children()`, reversing the topological order gives us each level in the module tree in the registered order, wheres with `named_modules()`, reversing the topological order gives us each level in reverse. Both are valid orders, but we prefer the former since it allows us to error/warn on the _first-registered_ module that violates the frozen/non-frozen condition. </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/104427 Approved by: https://github.com/ezyang	2023-08-02 21:44:44 +00:00
Jane Xu	7e47343d64	[BE] document more of FSDP checkpointing logic with a sprinkle of cleaning (#106069 ) This PR should not make any functional difference. It: - adds clearer documentation - clarifies a type - revises minor typos - swaps a .keys for a .items call on a dictionary Pull Request resolved: https://github.com/pytorch/pytorch/pull/106069 Approved by: https://github.com/awgu	2023-08-02 17:19:04 +00:00
Iris	0cba33e176	[DTensor]Minor Docstring Update (#106250 ) Fix docstring to reflect change. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106250 Approved by: https://github.com/wanchaol	2023-08-02 00:27:29 +00:00
Andrew Gu	506b55fc29	[FSDP][Easy] Move `_FSDPState` attrs to avoid comment confusion (#106392 ) Resubmit of https://github.com/pytorch/pytorch/pull/106333 after rebasing (I lost the original branch locally) Pull Request resolved: https://github.com/pytorch/pytorch/pull/106392 Approved by: https://github.com/kwen2501	2023-08-01 20:39:22 +00:00
shibo19	0af3203c72	fix torchrun script for custom device (#105443 ) Fixes #ISSUE_NUMBER as the title,add torchrun support for custom device Pull Request resolved: https://github.com/pytorch/pytorch/pull/105443 Approved by: https://github.com/kumpera	2023-07-31 05:46:23 +00:00
Rohan Varma	5d4e170d58	[Optim in backward] API to retrieve in-backward optimizers (#105991 ) API to retrieve in backward optimizer for checkpointing purposes Differential Revision: [D47782225](https://our.internmc.facebook.com/intern/diff/D47782225/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105991 Approved by: https://github.com/awgu	2023-07-29 01:36:25 +00:00
Rohan Varma	2ec7cd2db2	[CheckpointWrapper] Test for kwarg propagation, remove checkpoint_fn_arg support (#102679 ) Closes https://github.com/pytorch/pytorch/issues/100576 Differential Revision: [D46342398](https://our.internmc.facebook.com/intern/diff/D46342398/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102679 Approved by: https://github.com/awgu	2023-07-28 21:18:35 +00:00
Andrew Gu	800287fb56	[FSDP] Optimize away intermediate `div_` for HSDP (#106034 ) ### Background: Gradient Pre-Divide Consider $N$ data parallel workers. Define $g_i$ to be the $i$ th worker's local unsharded gradient. Data parallel gradient reduction computes $\overline g = \frac{1}{N} \sum_{i \in [N]} g_i$. $\sum_{i \in [N]} g_i$ increases the magnitude by a factor of $N$, which may overflow for fp16. However, if we pre-divide and compute $\sum_{i \in [N]} \frac{g_i}{N}$, then the $\frac{g_i}{N}$ may underflow. The current solution from Myle for FSDP is to pre-divide by $\sqrt{N}$ and post-divide by $\sqrt{N}$: $$\overline{g} = \frac{1}{\sqrt{N}} \sum_{i \in [N]} \frac{g_i}{\sqrt{N}}.$$ Now, consider HSDP with $N = S \cdot R$ data parallel workers, sharding over $S$ workers and replicating over $R$ workers. Define $g_{i,j}$ to be the $i \cdot S + j$ th worker's local unsharded gradient (so sharding indexes with $i$ and replication indexes with $j$). The existing implementation computes $$\overline{g} = \frac{1}{\sqrt{R}} \sum_{j \in [R]} \textcolor{red}{ \frac{1}{\sqrt{R}} \frac{1}{\sqrt{S}} } \sum_{i \in [S]} \frac{g_i}{\sqrt{S}},$$ where the $\frac{1}{\sqrt{R}} \frac{1}{\sqrt{S}}$ involves two separate `aten::div_` kernels. ### Revisiting Pre-Divide for HSDP A minor optimization that we can do is with this intermediate `div_`. There are two options: 1. Compute $\overline{g}$ in the same way as FSDP: $$\overline{g} = \frac{1}{\sqrt{N}} \sum_{j \in [R]} \sum_{i \in [S]} \frac{g_{i,j}}{\sqrt{N}}.$$ 2. Compute $\overline{g}$ still with an intermediate division for rescaling but coalescing the two `divs_` into one: $$\overline{g} = \frac{1}{\sqrt{R}} \sum_{j \in [R]} \textcolor{red}{ \frac{1}{\sqrt{N}} } \sum_{i \in [S]} \frac{g_i}{\sqrt{S}}$$ This PR goes with the 1st approach prioritizing performance because (1) it matches the existing FSDP behavior and (2) it avoids a memor-bandwidth bound `div_` kernel that blocks all-reduce launch. ### Implementation Details In order to accommodate this, we need to refactor the communication hook logic that baked the gradient pre/post-division into the default hook. - We raise an error if registering a communication hook for HSDP since the current implementation would only apply the hook to the reduce-scatter, not the all-reduce, which may be unexpected. - We change it so that `state._comm_hook is not None` iff a communication hook is registered. This makes the collectives and the pre/post-division in the default no-communication-hook path more visible in the code. Differential Revision: [D47852459](https://our.internmc.facebook.com/intern/diff/D47852459) Pull Request resolved: https://github.com/pytorch/pytorch/pull/106034 Approved by: https://github.com/rohan-varma	2023-07-28 18:36:26 +00:00
Albert Chen	7c8efc9049	[PT][FSDP] Combine _utils.py into _common_utils.py [2/2] (#106181 ) Summary: https://github.com/pytorch/pytorch/issues/97813 This diffs moves `_no_dispatch_record_stream` and `_same_storage_as_data_ptr` Test Plan: CI Differential Revision: D47706114 Pull Request resolved: https://github.com/pytorch/pytorch/pull/106181 Approved by: https://github.com/awgu	2023-07-28 17:15:25 +00:00
fduwjj	487ebcac3b	Clean up unsed MHA code to avoid confusion (#105956 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105956 Approved by: https://github.com/wz337, https://github.com/ezyang, https://github.com/wanchaol	2023-07-27 17:10:17 +00:00
Wanchao Liang	f026b32008	[device_mesh][BE] reduce_scatter fallback to funcol and remove from DM (#105642 ) For the reason similar to https://github.com/pytorch/pytorch/pull/105605 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105642 Approved by: https://github.com/kumpera, https://github.com/wz337, https://github.com/fduwjj	2023-07-27 01:33:05 +00:00
Wanchao Liang	2fa063e1e0	[device_mesh][BE] remove allgather from DM (#105614 ) For the reason similar to https://github.com/pytorch/pytorch/pull/105605 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105614 Approved by: https://github.com/rohan-varma, https://github.com/wz337, https://github.com/fduwjj	2023-07-27 01:33:05 +00:00
Wanchao Liang	4a49f1f46e	[device mesh][BE] remove allreduce from DM (#105605 ) This PR removes allreduce from DM and use functional collective instead, the rationle is that we don't want to maintain yet another set of collective apis, and since the DM's collective is now a thin wrapper to functional collective so we don't really need these collective to live in DM Pull Request resolved: https://github.com/pytorch/pytorch/pull/105605 Approved by: https://github.com/kumpera, https://github.com/wz337, https://github.com/fduwjj	2023-07-27 01:33:02 +00:00
Rohan Varma	4137d6e499	[Composable FSDP] Enable HSDP (#105206 ) Need to pass in strategy to _init_process_group_state to enable hsdp for composable. Differential Revision: [D47462394](https://our.internmc.facebook.com/intern/diff/D47462394/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105206 Approved by: https://github.com/awgu, https://github.com/fegin	2023-07-26 21:03:55 +00:00
Andrew Gu	841b4acf1e	[FSDP][Easy] Rename to `_comm_hook`, `_comm_hook_state` (#106033 ) This is just out of preference to make the naming convention consistent with `register_comm_hook()`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106033 Approved by: https://github.com/fegin	2023-07-26 19:59:11 +00:00
Andrew Gu	035704e88d	[FSDP][Easy] Move post-bwd hook logging to own func (#106032 ) This is to help make `_post_backward_hook()` easier to read. I plan to refactor some other parts in future PRs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106032 Approved by: https://github.com/fegin	2023-07-26 19:59:11 +00:00
FFFrog	9a1cdcb8a0	Format: fixing multiple string concatenation in single line (#106013 ) Fixing multiple string concatenation in single line Pull Request resolved: https://github.com/pytorch/pytorch/pull/106013 Approved by: https://github.com/albanD	2023-07-26 18:39:18 +00:00
Daniel Dale	6b6702f506	Enhance `no_grad`-context FSDP backward handling (#105374 ) Fixes #105369 Fixes #105371 Addressing two somewhat distinct issues that involve the same test in this PR: 1. To fix #105369: - Add a `no_grad` guard to [`_register_post_backward_reshard_only_hooks`](`93f852f201/torch/distributed/fsdp/_runtime_utils.py (L1406)`) to avoid registering post-backward hooks that would not be removed in that context. 2. To fix #105371: - Add a `grad` context condition to [`_use_sharded_flat_param`](`93f852f201/torch/distributed/fsdp/flat_param.py (L1645C9-L1645C32)`) logic to trigger post-forward `_use_sharded_views` in a `no_grad` context for `NO_RESHARD_AFTER_FORWARD_HANDLE_STRATEGIES` Pull Request resolved: https://github.com/pytorch/pytorch/pull/105374 Approved by: https://github.com/awgu	2023-07-26 14:12:13 +00:00
Andrew Gu	c099b80073	[FSDP] Add `record_function` for explicit prefetching (#105985 ) Example: <img width="568" alt="Screenshot 2023-07-25 at 7 41 43 PM" src="https://github.com/pytorch/pytorch/assets/31054793/5f3f07b3-97f4-4493-9cab-5619484e2f6d"> This can be particularly help when `with_stack=False`, in which case it is harder to tell the prefetch. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105985 Approved by: https://github.com/fegin	2023-07-26 12:16:35 +00:00
Andrew Gu	a9a3c45649	Revert "Simplify handle indexing (#105006 )" (#105984 ) This reverts commit `429d45f91a`. Unfortunately, https://github.com/pytorch/pytorch/pull/105006 broke backward prefetching (where backward prefetching working correctly was not captured in our unit tests). I need more time to dig into this (tomorrow), but I think the issue is related to: `429d45f91a (diff-9a6937168d232432c34c2c4605b96f3147afa2786e287f74b6074b20aa5980e6R143-R146)` Follow-ups: 1. Investigate this thoroughly 2. Add unit tests to capture backward prefetch functionality Pull Request resolved: https://github.com/pytorch/pytorch/pull/105984 Approved by: https://github.com/fegin	2023-07-26 12:12:14 +00:00
Matthew Hoffman	0616952d13	Merge and improve torch optim optimizer type stubs (#102593 ) Fixes #102428 Also improves hook registration type hints: ```python from typing import Any, Dict, Tuple from torch import nn from torch.optim import Adam, Adagrad, Optimizer linear = nn.Linear(2,2) optimizer = Adam(linear.parameters(), lr=0.001) def pre_hook_fn_return_none(optimizer: Adam, inputs: Tuple[Any, ...], kwargs: Dict[str, Any]) -> None: return None def pre_hook_fn_return_modified( optimizer: Optimizer, inputs: Tuple[Any, ...], kwargs: Dict[str, Any] ) -> Tuple[Tuple[Any, ...], Dict[str, Any]]: return inputs, kwargs def hook_fn(optimizer: Optimizer, inputs: Tuple[Any, ...], kwargs: Dict[str, Any]) -> None: return None def hook_fn_other_optimizer(optimizer: Adagrad, inputs: Tuple[Any, ...], kwargs: Dict[str, Any]) -> None: return None optimizer.register_step_post_hook(hook_fn) # OK optimizer.register_step_pre_hook(pre_hook_fn_return_none) # OK optimizer.register_step_pre_hook(pre_hook_fn_return_modified) # OK optimizer.register_step_post_hook(hook_fn_other_optimizer) # Parameter 1: type "Adam" cannot be assigned to type "Adagrad" ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/102593 Approved by: https://github.com/janeyx99, https://github.com/malfet	2023-07-26 11:56:42 +00:00
Rohan Varma	a326f5621e	composable fsdp, checkpoint, + compile test (#105180 ) Test to ensure that composable FSDP, checkpoint, and compile all work together. Includes a change from https://github.com/pytorch/pytorch/pull/105090 which we can land in that PR first. Differential Revision: [D47452973](https://our.internmc.facebook.com/intern/diff/D47452973/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105180 Approved by: https://github.com/awgu	2023-07-26 07:03:09 +00:00
Rohan Varma	5d70fe0165	[Composable] Use non-reentrant generator, remove reentrant (#105176 ) Removes reentrant support for the composable checkpoint, as non-reentrant is the recommended approach and we should use this when rolling out composable checkpoint API. Also removes the standalone implementation for non-reentrant and instead uses the generator from below diff to reuse the original implemenetation. Differential Revision: [D47451375](https://our.internmc.facebook.com/intern/diff/D47451375/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105176 Approved by: https://github.com/awgu, https://github.com/fegin	2023-07-26 07:03:03 +00:00
fduwjj	0003d5135d	[TP] Enable partial tensor add without redistribute (#105939 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105939 Approved by: https://github.com/wanchaol	2023-07-26 03:12:39 +00:00
Albert Chen	b65b9e6ff4	[PT][FSDP] Combine _utils.py into _common_utils.py [1/3] (#105857 ) Summary: https://github.com/pytorch/pytorch/issues/97813 This diffs moves `_override_module_mixed_precision` Test Plan: CI Differential Revision: D47706059 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105857 Approved by: https://github.com/awgu	2023-07-25 17:37:08 +00:00
Andrew Gu	c9edf11073	[FSDP][Docs] Make model/optim state dict configs visible in docs (#105848 ) This closes https://github.com/pytorch/pytorch/issues/104717. Rendered docs: ![Screenshot 2023-07-25 at 11 15 23 AM](https://github.com/pytorch/pytorch/assets/31054793/3c38166a-70c0-472c-805d-452d3bd9c700) ![Screenshot 2023-07-25 at 11 15 30 AM](https://github.com/pytorch/pytorch/assets/31054793/6d275d94-020a-44a2-a64c-0eeba083d47f) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105848 Approved by: https://github.com/rohan-varma	2023-07-25 16:23:53 +00:00
Michael Voznesensky	487a33e38a	[FSDP x dynamo] simplify registry keys (#104209 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104209 Approved by: https://github.com/wconstab, https://github.com/fegin	2023-07-25 07:16:22 +00:00
Jon Bolin	1032a2541e	Add option to disable rewriting index hints in default global save plan (#105861 ) With distributed checkpointing in PyTorch/XLA SPMD, the WriteItem index hints should not be modified when creating the global plan. In order to reuse the default planner logic for checkpoint metadata creation, we need to make the behavior of rewriting index hints optional. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105861 Approved by: https://github.com/kumpera	2023-07-25 06:00:13 +00:00
Louis Feng	3a01c056f5	[PyTorch][ET] Collect Process Groups Mapping Info (#104373 ) Summary: Add the logics and interface to log ProcessGroup comms configuration (unique ID, type, and ranks info). Test Plan: Testing in HPC: ``` TORCH_LOGS=all ../buck-out/v2/gen/fbcode/c8344b52091f4f7f/hpc/models/ads/__ads_10x_launcher__/ads_10x_launcher.par +launcher=local launcher.num_trainers=4 +data_loader=random data_loader.num_batches=2000 ``` Example output in ET: ``` { "name": "## process_group:init ##", "id": 3, "rf_id": 1, "parent": 2, "fw_parent": 0, "seq_id": -1, "scope": 7, "tid": 1, "fw_tid": 0, "op_schema": "", "inputs": ["[{'pg_id': 140538064364672, 'backend_id': 140538060772480, 'backend_config': 'cuda:nccl', 'ranks': {0: 0, 1: 1, 2: 2, 3: 3}}, {'pg_id': 140538064363904, 'backend_id': 140538042628864, 'backend_config': 'cuda:nccl', 'ranks': {0: 0, 1: 1, 2: 2, 3: 3}}]"], "input_shapes": [[]], "input_types": ["String"], "outputs": [], "output_shapes": [], "output_types": [] }, ``` Differential Revision: D46321690 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104373 Approved by: https://github.com/kwen2501	2023-07-25 03:34:53 +00:00
Andrew Gu	6655b6527a	[FSDP][Docs] Tidy up FSDP ctor/api docs (#105847 ) - This PR rewords the `BackwardPrefetch` docs to make the tradeoffs clear in the first sentence of each with more technical details after. - The only supported `_FSDPPolicy` is `ModuleWrapPolicy` at the time of writing this PR. We may add others in the future such as in my other PR stack. This PR removes `_FSDPPolicy` from the public docs. - This provides some more details around `MixedPrecision` such as explaining that layer norm and batch norm accumulate in fp32. Follow-ups: - Why do we force batch norm modules to have FSDP applied separately? (E.g. was this because before batch norm kernels did not support fp16/bf16?) Like layer norm, this just means that the affine parameters are in fp32. Both already accumulate in fp32 even with fp16/bf16 inputs. - Check the `param_init_fn` + `sync_module_states=True` usage. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105847 Approved by: https://github.com/rohan-varma	2023-07-25 00:19:08 +00:00
Howard Huang	0ab74044c2	[BE] remove deprecated attributes from distributed_c10d (#105753 ) Removing these attributes as they were introduced 5 years ago and before pytorch 1.0. `Backend` is the only support use now. Differential Revision: [D47683717](https://our.internmc.facebook.com/intern/diff/D47683717) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105753 Approved by: https://github.com/rohan-varma	2023-07-24 16:35:08 +00:00
Wanchao Liang	e3539a0e54	[dtensor] forward fix for dynamo import with deploy (#105760 ) Summary: forward fix to avoid revert Differential Revision: D47679598 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105760 Approved by: https://github.com/atalman	2023-07-23 07:13:38 +00:00
Aaron Gokaslan	6d43c89f37	[BE]: Update Ruff to 0.0.280 (#105724 ) Removes unusued loop values in python dictionary iteration. Automated fix from Ruff master Pull Request resolved: https://github.com/pytorch/pytorch/pull/105724 Approved by: https://github.com/ezyang, https://github.com/janeyx99	2023-07-22 23:03:34 +00:00
Andrew Gu	221853af23	[FSDP][Easy] nit follow-ups to handle refactor (#105738 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105738 Approved by: https://github.com/fegin, https://github.com/voznesenskym	2023-07-21 22:00:14 +00:00
Iris	6b2d48e78c	[8/n][FSDP] make use_dtensor=True work with offload_to_cpu=True for optim.load_state_dict() (#105690 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/105690 Approved by: https://github.com/fegin	2023-07-21 18:55:01 +00:00
Michael Voznesensky	429d45f91a	Simplify handle indexing (#105006 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105006 Approved by: https://github.com/awgu	2023-07-21 05:53:23 +00:00
Michael Voznesensky	a832967627	Migrate tuple(handle) -> handle (#104488 ) We strengthen the invariant that one FSDP managed module has one flatparameter, and remove unused code that would have supported 1:many module to flatparam mapping Pull Request resolved: https://github.com/pytorch/pytorch/pull/104488 Approved by: https://github.com/awgu	2023-07-19 22:33:35 +00:00
Iris	c54f630201	[7/n][FSDP] make use_dtensor=True work with offload_to_cpu=True for load_state_dict (#105378 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/105378 Approved by: https://github.com/fegin	2023-07-19 21:36:37 +00:00
Mo Mo	7b56238551	fix typo (#105507 ) Differential Revision: D47568928 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105507 Approved by: https://github.com/awgu, https://github.com/fduwjj	2023-07-19 20:34:43 +00:00
Wanchao Liang	f139aab2f4	[dynamo] add initial dynamo support for DTensor (#103146 ) This PR adds initial dynamo support for DTensor, in particular, it: - allows DTensor be passed into a compiled function, and allow fakify DTensor during dynamo tracing by turning the inner local tensor to meta tensor. - We use `allow_in_graph` to include `DTensor` and `DTensor.from_local` to be represented as `TorchVariable` - The dtensor created becomes a normal `TensorVariable` and it would insert any tensor operations to the output graph just like torch.Tensor - note that dtensor have a new instance method `redistribute` compare to plain tensor, and we currently special handle it in `TensorVariable` `from_local` and `redistribute` both accepts some non-trival metadata as arguments (i.e. DeviceMesh, Placement) which fx.Graph does not support. In order to let these two APIs appear in the dynamo captured graph, we encoded the metadata into a new_function (like `functools.partial`) and the new function only accepts prim args (i.e. tensor), then we put `call_function` with this new_function to the graph. This is suggested by @ezyang. The underlying rationale here is that the metadata will not change across the graph invocations so it's safe to encode them. Captured graph: ``` def forward(self, L_x_ : torch.Tensor): l_x_ = L_x_ # File: /scratch/wanchaol/work/pytorch/test/distributed/_tensor/test_dtensor.py:685, code: dt = DTensor.from_local(x, mesh, [Shard(0)], run_check=False) prim_from_local = torch__dynamo_variables_torch_prim_from_local(l_x_, run_check = False); l_x_ = None # File: /scratch/wanchaol/work/pytorch/test/distributed/_tensor/test_dtensor.py:686, code: return dt.redistribute(mesh, [Replicate()]).to_local() + 2 prim_redistribute = torch__dynamo_variables_tensor_prim_redistribute(prim_from_local); prim_from_local = None to_local = prim_redistribute.to_local(); prim_redistribute = None add = to_local + 2; to_local = None return (add,) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/103146 Approved by: https://github.com/voznesenskym	2023-07-19 16:01:12 +00:00
Justin Chu	232b96b6e2	[BE] Enable ruff's UP rules and autoformat distributed/ (#105433 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105433 Approved by: https://github.com/albanD	2023-07-19 14:27:11 +00:00
Andrew Gu	e983625f22	[FSDP] Fix skip-sharded-views + mixed precision (#105346 ) This fixes https://github.com/pytorch/pytorch/issues/104504. - When not using full-precision eval, the relevant fix is to force `_use_sharded_views()` calls if needed in `SUMMON_FULL_PARAMS` training state. - When using full-precision in eval, the relevant fix is tracking what was the unsharded flat parameter from which the unsharded views were computed and using that instead of determining the unsharded flat parameter from the calling context via `_get_padded_unsharded_flat_param()`. This also fixes https://github.com/pytorch/pytorch/issues/104770. <details> <summary> Print output showing parity </summary> ``` Key: 0 Model 1: [-1.5, 6.40625, -0.9453125, -0.3828125, 0.16015625, -1.5078125] Model 2: [-1.5, 6.40625, -0.9453125, -0.3828125, 0.16015625, -1.5078125] Key: 1 Model 1: [0.0157470703125, -0.8828125, 5.65625, 1.1328125, 0.275390625, 0.11181640625] Model 2: [0.0157470703125, -0.8828125, 5.65625, 1.1328125, 0.275390625, 0.11181640625] Key: 2 Model 1: [0.1689453125, -0.00567626953125, -0.09375, 7.34375, -0.18359375, -0.09521484375] Model 2: [0.1689453125, -0.00567626953125, -0.09375, 7.34375, -0.18359375, -0.09521484375] Key: 3 Model 1: [0.546875, -0.8984375, 0.228515625, 0.7578125, 6.0625, 0.435546875] Model 2: [0.546875, -0.8984375, 0.228515625, 0.7578125, 6.0625, 0.435546875] Key: 4 Model 1: [-0.66796875, -0.88671875, 0.30078125, 0.06494140625, 0.412109375, 6.9375] Model 2: [-0.66796875, -0.88671875, 0.30078125, 0.06494140625, 0.412109375, 6.9375] Key: 5 Model 1: [0.07763671875, 0.8671875, -0.43359375, 0.5703125, 0.76171875, -0.0089111328125] Model 2: [0.07763671875, 0.8671875, -0.43359375, 0.5703125, 0.76171875, -0.0089111328125] Key: 6 Model 1: [-0.283203125, -0.361328125, 0.474609375, 0.10205078125, 1.125, -0.0859375] Model 2: [-0.283203125, -0.361328125, 0.474609375, 0.10205078125, 1.125, -0.0859375] Key: 7 Model 1: [1.140625, 0.62890625, -0.07568359375, -1.0390625, -0.2578125, -0.053955078125] Model 2: [1.140625, 0.62890625, -0.07568359375, -1.0390625, -0.2578125, -0.053955078125] Key: 8 Model 1: [0.68359375, -1.09375, 0.59375, 1.0, -0.23828125, 0.578125] Model 2: [0.68359375, -1.09375, 0.59375, 1.0, -0.23828125, 0.578125] Key: 9 Model 1: [0.515625, 0.296875, -0.1826171875, -0.12890625, -0.51953125, -0.3359375] Model 2: [0.515625, 0.296875, -0.1826171875, -0.12890625, -0.51953125, -0.3359375] ``` </details> Follow-ups: - I suspect that for `SHARD_GRAD_OP`, train forward -> eval forward when using full-precision in eval will not free the low-precision unsharded parameters from the train forward, resulting in 1.5x unsharded parameter memory. Differential Revision: [D47527597](https://our.internmc.facebook.com/intern/diff/D47527597) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105346 Approved by: https://github.com/fegin, https://github.com/rohan-varma	2023-07-18 23:13:53 +00:00
Wanchao Liang	cb23373264	[dynamo] allow tensor subclass fakification in dynamo (#105308 ) This PR adds necessary plumbing through torchdynamo to allow tensor subclasses with certain contract (i.e. with `__tensor_flatten__` and `__tensor_unflatten__`) to goes through the dynamo fakification pass by fakifying the tensor subclass internal components. Some of the tensor subclass contract logic mostly borrowed from https://github.com/pytorch/pytorch/pull/97540 Added some tests to verify simply passing through a tensor subclass (i.e. DTensor) through dynamo eager works as expected. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105308 Approved by: https://github.com/ezyang	2023-07-18 17:28:04 +00:00
Wanchao Liang	bcb9ca4e5a	[dtensor] canonicalize detach callsites and use `view_as` when appropriate (#105239 ) This PR canonicalize the detach callsite to only call the detach from `distribute_tensor`. Change other callsite to view_as and remove the tensor constructor detach call This is so that we don't detach local tensor for every op run when rewrapping the DTensor Pull Request resolved: https://github.com/pytorch/pytorch/pull/105239 Approved by: https://github.com/albanD	2023-07-18 17:13:37 +00:00
Nikita Shulga	5837e95d30	[Reland] Update mypy to 1.4.1 (#105227 ) This PR re-lands - [Typing] Fix PEP 484 Violation (#105022) - Update mypy to 1.4.1 (#91983) That were reverted due to the conflict with internal source repo. Mostly fixes for PEP-484 violation (i.e. when default arg is set to None, but type is not annotated as optional) Plus few real fixes: - Add missing `_get_upgraders_entry_map` to `torch/_C/__init__.pyi` - Add missing return statement to `torch._export. deserialize_graph` - Fix error message in `torch.ao.ns.fx.weight_utils.get_lstm_mod_weights` - Add assert it `torch/optim/optimizer.py` that Optional list is not None TODO (in followup PR): - Fix erroneous `isinstance` check in `torch/ao/quantization/_pt2e/qat_utils.py` Unrelated, to bypass CI failures due to the gcc9 dependency update in Ubuntu-18.04: - Add hack to squash older libstdc++ from conda environment in favor one from OS to `.ci/docker/install_conda.sh` - Update bazel cuda builds to focal, as with libstdc++-6.0.32 bazel builds loose the ability to catch exceptions (probably because they link with cupti statically, but I could not found where it is done) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105227 Approved by: https://github.com/atalman, https://github.com/albanD, https://github.com/Skylion007	2023-07-15 20:30:20 +00:00
PyTorch MergeBot	15fd1ea118	Revert "[Reland] Update mypy to 1.4.1 (#105227 )" This reverts commit `c9c4f8efc3`. Reverted https://github.com/pytorch/pytorch/pull/105227 on behalf of https://github.com/atalman due to trying to mitigate ci sev #105248 ([comment](https://github.com/pytorch/pytorch/pull/105227#issuecomment-1636510935))	2023-07-14 22:28:35 +00:00
Nikita Shulga	c9c4f8efc3	[Reland] Update mypy to 1.4.1 (#105227 ) This PR re-lands - [Typing] Fix PEP 484 Violation (#105022) - Update mypy to 1.4.1 (#91983) That were reverted due to the conflict with internal source repo. Mostly fixes for PEP-484 violation (i.e. when default arg is set to None, but type is not annotated as optional) Plus few real fixes: - Add missing `_get_upgraders_entry_map` to `torch/_C/__init__.pyi` - Add missing return statement to `torch._export. deserialize_graph` - Fix error message in `torch.ao.ns.fx.weight_utils.get_lstm_mod_weights` - Add assert it `torch/optim/optimizer.py` that Optional list is not None TODO (in followup PR): - Fix erroneous `isinstance` check in `torch/ao/quantization/_pt2e/qat_utils.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/105227 Approved by: https://github.com/atalman, https://github.com/albanD, https://github.com/Skylion007	2023-07-14 20:45:12 +00:00
Richard Barnes	15ea0a00cb	Fix RRef type annotations (#104876 ) Test Plan: Sandcastle Reviewed By: H-Huang Differential Revision: D47334579 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104876 Approved by: https://github.com/H-Huang	2023-07-14 17:31:51 +00:00
PyTorch MergeBot	1646d6f939	Revert "Merge and improve torch optim optimizer type stubs (#102593 )" This reverts commit `3279f06410`. Reverted https://github.com/pytorch/pytorch/pull/102593 on behalf of https://github.com/malfet due to There is nothing wrong with this PR, but it fails some internal builds that depend on outdated typing_extensions, will reland when update is done ([comment](https://github.com/pytorch/pytorch/pull/102593#issuecomment-1636062515))	2023-07-14 16:04:54 +00:00
PyTorch MergeBot	3c5a494d7a	Revert "Update mypy to 1.4.1 (#91983 )" This reverts commit `634659e262`. Reverted https://github.com/pytorch/pytorch/pull/91983 on behalf of https://github.com/malfet due to It's dependent change was reverted, so reverting this one as well, to keep CI clean ([comment](https://github.com/pytorch/pytorch/pull/91983#issuecomment-1636059709))	2023-07-14 15:59:16 +00:00
PyTorch MergeBot	b4d91b1c5b	Revert "[Typing] Fix PEP 484 Violation (#105022 )" This reverts commit `4148b7bada`. Reverted https://github.com/pytorch/pytorch/pull/105022 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/105022#issuecomment-1635967734))	2023-07-14 14:45:09 +00:00
Nikita Shulga	634659e262	Update mypy to 1.4.1 (#91983 ) Mostly fixes for PEP-484 violation (i.e. when default arg is set to None, but type is not annotated as optional) Plus few real fixes: - Add missing `_get_upgraders_entry_map` to `torch/_C/__init__.pyi` - Add missing return statement to `torch._export. deserialize_graph` - Fix error message in `torch.ao.ns.fx.weight_utils.get_lstm_mod_weights` - TODO (in followup PR): - Fix erroneous `isinstance` check in `torch/ao/quantization/_pt2e/qat_utils.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/91983 Approved by: https://github.com/kit1980, https://github.com/ZainRizvi, https://github.com/huydhn, https://github.com/thiagocrepaldi, https://github.com/aaronenyeshi	2023-07-13 16:30:36 +00:00
Rohan Varma	242fc29c96	[FSDP] Refactor optimizer in backward (#104813 ) 1) Use zero_grad(set_to_none=True) to set grad to None, 2) call prepare_grad_for_optim() before call to .step, 3) use _reset_flat_param_grad_info to set flat param gradient back to None. These changes should just be refactors and equivalent to how gradient memory was managed before. Differential Revision: [D47310761](https://our.internmc.facebook.com/intern/diff/D47310761/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104813 Approved by: https://github.com/awgu	2023-07-13 06:42:53 +00:00
Rohan Varma	f2eed129c4	FSDP optimizer overlap (#98667 ) constraints: 1. No support for gradient accumulation 2. CPU offload runs step() on CPU. In future PRs ideally we'd run this on GPU. 3. When CPU offload + optimizer overlap, we have to copy the flat_param grad to CPU with non_blocking=False, otherwise step() might run on invalid data. 4. Step is waited on in post backward final cb, when in theory it can wait until the next forward. Differential Revision: [D44809582](https://our.internmc.facebook.com/intern/diff/D44809582/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98667 Approved by: https://github.com/awgu, https://github.com/fegin	2023-07-13 06:42:53 +00:00
PyTorch MergeBot	5b4aacd691	Revert "[DCP] Add FsspecReader and FsspecWriter to checkpoint __init__.py (#105088 )" This reverts commit `76a053d55c`. Reverted https://github.com/pytorch/pytorch/pull/105088 on behalf of https://github.com/atalman due to broke trunk and linux-focal-py3.9-clang7-asan ([comment](https://github.com/pytorch/pytorch/pull/105088#issuecomment-1633385350))	2023-07-13 00:59:55 +00:00
Andrew Gu	954bae8e53	[FSDP][Easy] Rename streams; add back stream sharing test (#104966 ) Purely out of preference, this PR renames the streams to `_unshard_stream` instead of `_streams_unshard` etc. since the former reads more naturally. The PR also removes some duplicated comments and adds back a unit test that streams are shared. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104966 Approved by: https://github.com/rohan-varma	2023-07-13 00:24:41 +00:00
Iris	4f8ba6f8f6	[DeviceMesh]Add validate mesh flag to DeviceMesh (#104807 ) When creating DeviceMesh, _init_process_group() would validate that all calling ranks pass in the same `mesh` argument. In FSDP, we are currently creating the DeviceMesh based on the pg of the root state so the mesh will always be valid. Adding the flag to DeviceMesh, so we can skip the all_gather_tensor of the validation during construction time. _validate_mesh is default to True, but we manually flip it to False when initializing device mesh in FSDP's _runtime_utils.py. Will modify skipping pg creation if existed for both 1D and 2D cases and then delete _init_process_groups flag in a follow up PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104807 Approved by: https://github.com/wanchaol	2023-07-12 23:42:13 +00:00
Iris	76a053d55c	[DCP] Add FsspecReader and FsspecWriter to checkpoint __init__.py (#105088 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/105088 Approved by: https://github.com/kumpera	2023-07-12 23:40:35 +00:00
Nikita Shulga	4148b7bada	[Typing] Fix PEP 484 Violation (#105022 ) Not sure, how it worked before, but if arguments must be annotated is optional if they are defaulted to None Towards enabling mypy-1.4.1 in lintrunner <!-- copilot:poem --> ### <samp>🤖 Generated by Copilot at 5e1b9f4</samp> > _We annotate the arguments of doom_ > _To show the `None` values of gloom_ > _We improve the type checking and readability_ > _With `Optional` annotations of metal-ity_ Pull Request resolved: https://github.com/pytorch/pytorch/pull/105022 Approved by: https://github.com/izaitsevfb, https://github.com/huydhn, https://github.com/Skylion007	2023-07-12 10:20:48 +00:00
Aaron Gokaslan	2f95a3d0fc	[BE]: Apply ruff PERF fixes to torch (#104917 ) Applies automated ruff fixes in the PERF modules and enables all automatic ones. I also updated ruff which applied some additional fixes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104917 Approved by: https://github.com/ezyang, https://github.com/albanD	2023-07-11 20:45:21 +00:00
Andrew Gu	63d1fb21f5	[FSDP] Default `limit_all_gathers=True` (#104900 ) This PR defaults to `limit_all_gathers=True`. I included a `record_function()` for the rate limiter synchronization to help with user confusion on the gap in the pre-forward: <img width="874" alt="Screenshot 2023-07-10 at 3 28 18 PM" src="https://github.com/pytorch/pytorch/assets/31054793/61f55e0e-58d7-4162-9395-bea06d3e8d8a"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/104900 Approved by: https://github.com/fegin	2023-07-11 01:04:29 +00:00
Matthew Hoffman	3279f06410	Merge and improve torch optim optimizer type stubs (#102593 ) Fixes #102428 Also improves hook registration type hints: ```python from typing import Any, Dict, Tuple from torch import nn from torch.optim import Adam, Adagrad, Optimizer linear = nn.Linear(2,2) optimizer = Adam(linear.parameters(), lr=0.001) def pre_hook_fn_return_none(optimizer: Adam, inputs: Tuple[Any, ...], kwargs: Dict[str, Any]) -> None: return None def pre_hook_fn_return_modified( optimizer: Optimizer, inputs: Tuple[Any, ...], kwargs: Dict[str, Any] ) -> Tuple[Tuple[Any, ...], Dict[str, Any]]: return inputs, kwargs def hook_fn(optimizer: Optimizer, inputs: Tuple[Any, ...], kwargs: Dict[str, Any]) -> None: return None def hook_fn_other_optimizer(optimizer: Adagrad, inputs: Tuple[Any, ...], kwargs: Dict[str, Any]) -> None: return None optimizer.register_step_post_hook(hook_fn) # OK optimizer.register_step_pre_hook(pre_hook_fn_return_none) # OK optimizer.register_step_pre_hook(pre_hook_fn_return_modified) # OK optimizer.register_step_post_hook(hook_fn_other_optimizer) # Parameter 1: type "Adam" cannot be assigned to type "Adagrad" ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/102593 Approved by: https://github.com/janeyx99	2023-07-11 00:07:30 +00:00
fduwjj	aa84078c6c	[PTD][TP] Add BWD support for colwise embedding sharding (#104820 ) Originally, we didn't enable BWD for colwise embedding because we thought it was just for inference, but it turns out that we do need it for training. So, let's enable it for now and unit test is also added. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104820 Approved by: https://github.com/fegin	2023-07-10 22:33:20 +00:00
Iris Zhang (PyTorch)	7b538d8987	[DCP][fsspec] Consolidate OSS FsspecWriter/Reader and internal FsspecWriter/Reader (#104724 ) Summary: This diff does the following: 1. re-enable single_file_per_rank for FsspecWriter, as the issue of file slicing error is resolved because of [https://github.com/pytorch/pytorch/pull/99167] 2. remove sync_files from FsspecWriter as there is no fsspec equivalence. 3. remove the internal implementation of FsspecWriter/Reader, as it has been upstreamed to PyTorch OSS 4. keep the internal test for manifold inside internal as we can only test it in fb environment 5. consolidate test to remove duplicates 6. remove unnecessary TARGETS Test Plan: ``` buck test @//mode/dev-nosan //caffe2/test/distributed/checkpoint/fb:test_fsspec_filesystem -- --print-passing-details ---------------------------------------------------------------------- Ran 1 test in 54.894s OK /usr/local/fbcode/platform010/lib/python3.8/tempfile.py:818: ResourceWarning: Implicitly cleaning up <TemporaryDirectory '/tmp/tmpzomokvh6'> _warnings.warn(warn_message, ResourceWarning) Buck UI: https://www.internalfb.com/buck2/4cb722a2-3ee7-48f2-a9ef-55ee6fb1a498 Test UI: https://www.internalfb.com/intern/testinfra/testrun/8725724447995201 Network: Up: 8.8 MiB Down: 1.5 GiB (reSessionID-04c29f56-ae94-4187-8a1a-c812f432674d) Jobs completed: 209847. Time elapsed: 1:56.5s. Cache hits: 100%. Commands: 85687 (cached: 85687, remote: 0, local: 0) Tests finished: Pass 3. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` Differential Revision: D47266068 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104724 Approved by: https://github.com/fegin, https://github.com/fduwjj	2023-07-10 19:31:01 +00:00
Mikayla Gawarecki	1ad435772b	Added option to always call nn.Module global/non-global forward hooks (#104278 ) Fix #103997 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104278 Approved by: https://github.com/albanD	2023-07-10 18:58:07 +00:00
Jane Xu	e25f5732c8	Add meta registrations and distributed decomps: _foreach_div_.Scalar, sqrt_.default (#104779 ) This PR unblocks #104780 by resolving spmd tracing test issues and by adding meta registrations for foreach inplace ops (div_ and sqrt_) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104779 Approved by: https://github.com/fegin, https://github.com/albanD	2023-07-10 17:38:46 +00:00
Iris	af52f6b928	[DCP] Add documentation for HSDP saving using DCP (#104810 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104810 Approved by: https://github.com/fduwjj	2023-07-10 17:33:05 +00:00
Chien-Chin Huang	46154c4c35	[FSDP][optim_state_dict] The correct way to initialize optimizer states if the corresponding param is empty (#104765 ) When using KeyedOptimizer.init_state(), some optimizers initializes the states even if the param is empty (size() == 0) while some optimizer avoid initializing the states. There is no way FSDP can tell. Instead, FSDP should look up `optim.state`. Fortunatelly, `optim.state` does not rely on FQNs which some internal users change the FQNs. Differential Revision: [D47285562](https://our.internmc.facebook.com/intern/diff/D47285562/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104765 Approved by: https://github.com/fduwjj	2023-07-10 08:00:55 +00:00
Andrew Gu	e600505e32	[FSDP][5/N] Unblock `ignored_states` + auto wrap (for now) (#104418 ) The "for now" is because we still have the issue that when using the parameter `ignored_states` path, we do not recover the ignored modules, so FSDP still wraps those as empty shells (no managed parameters), which is not ideal. This is not a blocking issue as far as I know. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104418 Approved by: https://github.com/rohan-varma	2023-07-08 12:40:14 +00:00
Andrew Gu	610f74627e	[FSDP][4/N] Remove `_get_fully_sharded_module_to_states` (#104409 ) `_get_fully_sharded_module_to_states()` was used to emulate auto wrapping without actually calling `fully_shard`. Since we committed to unifying (see previous PR), we can remove this function and its helpers/tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104409 Approved by: https://github.com/rohan-varma, https://github.com/fegin	2023-07-08 12:40:14 +00:00
Andrew Gu	d9be0366d3	[FSDP][3/N] Unify `fully_shard` auto wrap (#104408 ) This moves `fully_shard` to use `_auto_wrap()` just like `FullyShardedDataParallel`. This means that `fully_shard` goes through the `_init_param_handle_from_module()` path (i.e. 1 `fully_shard` per "wrap"), removing the need for `_init_param_handles_from_module()` (which was 1 `fully_shard` for all "wraps" of a given policy). `_auto_wrap()` simply calls `fully_shard` on target submodules. This includes several important fixes: - We should register the pre/post-forward hooks on the module regardless of it has managed parameters. - We can permit `_module_handles` to return `[]` in the composable path (for when the module has no managed parameters). - We should unify the paths for `_get_buffers_and_dtypes_for_computation()` (previously, composable path was buggy in some cases). Pull Request resolved: https://github.com/pytorch/pytorch/pull/104408 Approved by: https://github.com/rohan-varma	2023-07-08 12:40:12 +00:00
Andrew Gu	6d71b4f9f1	[FSDP][2/N][Easy] Prepare `_auto_wrap` for `fully_shard` (#104407 ) This mainly just changes the `_auto_wrap()` function signature and generalizes the `_check_nested_wrapping()` to both wrapper and composable paths (though the composable path will not hit in this PR). Pull Request resolved: https://github.com/pytorch/pytorch/pull/104407 Approved by: https://github.com/rohan-varma, https://github.com/fegin	2023-07-08 12:40:09 +00:00
Andrew Gu	d58f75be8b	[FSDP][1/N] Move wrapper `ModuleWrapPolicy` to new path (#104346 ) This PR is the first in refactoring the auto wrapping, only affecting `ModuleWrapPolicy` for wrapper `FullyShardedDataParallel`. The end goal is to improve the auto wrapping infra to support: - Checking valid frozen parameters (uniform frozenness per FSDP) - Checking valid shared parameters (shared parameters assigned to their lowest-common-ancestor module or higher) - Writing auto wrapping policies that may take multiple passes over the module tree - Specifying different FSDP kwargs per FSDP instance (instead of enforcing the same for all FSDP instances constructed via an auto wrap policy) The way I envision achieving this is that, we decouple the actual "wrapping" (which is `_post_order_apply()` in this PR) from constructing the wrapping targets and kwargs (which is `target_module_to_kwargs` in this PR). In that way, a policy reduces to just constructing that latter `target_module_to_kwargs` mapping. I do not personally recommend the size-based policy, but if we wanted to implement that under this new organization, the tracking of wrapped/nonwrapped numel should be done in the pass over the module tree prior to the actual "wrapping". This modularization keeps the actual "wrapping" part simple. The change to how `old_dtype` is handled is mainly to avoid keeping a reference to `_override_module_mixed_precision()` function closure in each hook and to allow the function to take in all module clases at once to return which ones actually got overridden for the downstream error message. (We can directly store the global state as a mapping.) To-do in follow-ups (not in order): - Add frozen parameter check before `_post_order_apply()` - Add shared parameter check before `_post_order_apply()` - Expose wrapping policy that allows per module / per module class kwarg customization (where any unspecified kwarg adopts the root's kwarg) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104346 Approved by: https://github.com/rohan-varma, https://github.com/fegin	2023-07-08 12:40:07 +00:00
Rohan Varma	0bf39d5663	[FSDP] Option for eval in fp32/bf16 (#104682 ) In https://github.com/pytorch/pytorch/pull/97645 and some follow up diffs, we made FSDP run in full precision in eval mode, even if mixed precision was specified. However, this is probably not the best idea and we should provide a flag for users to have control over this a bit more. Adding an env var FSDP_FULL_PREC_IN_EVAL and defaulting it to off, users who want to run eval in fp32 can toggle this before wrapping model in FSDP: os.environ["FSDP_FULL_PREC_IN_EVAL"] = "1" Verified that unittests, APS workflow, TNT workloads can run eval appropriately with this change. Differential Revision: [D47246556](https://our.internmc.facebook.com/intern/diff/D47246556/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104682 Approved by: https://github.com/awgu	2023-07-07 08:14:23 +00:00
Will Constable	d64bada876	Refactor funcol for readability and dynamo tracing (#104387 ) Move eager kernel impls to separate file, which is eaiser to read (since users may be confused about 2 versions of each kernel in the same file) and easier to set a dynamo policy to trace only the first file currently. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104387 Approved by: https://github.com/wanchaol, https://github.com/fduwjj, https://github.com/kumpera	2023-07-06 23:29:49 +00:00
Andrew Gu	6c1d959889	[FSDP] Annotate modules for `fully_shard` (#104363 ) This annotates modules managed by `fully_shard` for TorchDynamo to treat them specially. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104363 Approved by: https://github.com/fegin	2023-07-06 16:56:59 +00:00
Rodrigo Kumpera	17ab4f85e9	[c10d] Adopt allgather_into_tensor_coalesced for NCCL. (#103086 ) This is done by adding c10d::_allgather_into_tensor_coalesced wrapper. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103086 Approved by: https://github.com/rohan-varma	2023-07-06 15:05:55 +00:00
Wanchao Liang	db1ac4e29b	fix functional collective's allgather for gloo (#104681 ) Summary: We should explicitly check for the gloo backend instead of relying on the shard's device, because user might pass a GPU tensor as input and a process group gloo as the pg, and expect that should work. Differential Revision: D47249172 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104681 Approved by: https://github.com/rohan-varma, https://github.com/fduwjj	2023-07-06 09:52:48 +00:00
Iris	434fcffa21	[6/n][FSDP] Update _sharded_pre_load_state_dict_hook to use DTensor when use_dtensor=True in ShardedStateDictConfig (#104087 ) This allows us use use_dtensor=True for ShardedStateDictConfig() before calling model.load_state_dict(). It only works for offload_to_cpu=False now. Next PR will make use_dtensor=True work with offload_to_cpu=True for load_state_dict(). Pull Request resolved: https://github.com/pytorch/pytorch/pull/104087 Approved by: https://github.com/fegin	2023-07-06 05:36:19 +00:00
PyTorch MergeBot	fcb53c1394	Revert "[6/n][FSDP] Update _sharded_pre_load_state_dict_hook to use DTensor when use_dtensor=True in ShardedStateDictConfig (#104087 )" This reverts commit `49af83cf44`. Reverted https://github.com/pytorch/pytorch/pull/104087 on behalf of https://github.com/huydhn due to This is failing in trunk `49af83cf44`, probably due to a land race ([comment](https://github.com/pytorch/pytorch/pull/104087#issuecomment-1615608189))	2023-07-01 07:50:31 +00:00
Iris	49af83cf44	[6/n][FSDP] Update _sharded_pre_load_state_dict_hook to use DTensor when use_dtensor=True in ShardedStateDictConfig (#104087 ) This allows us use use_dtensor=True for ShardedStateDictConfig() before calling model.load_state_dict(). It only works for offload_to_cpu=False now. Next PR will make use_dtensor=True work with offload_to_cpu=True for load_state_dict(). Pull Request resolved: https://github.com/pytorch/pytorch/pull/104087 Approved by: https://github.com/fegin	2023-07-01 01:02:59 +00:00
Andrew Gu	d982fdb5d5	[FSDP] Rework meta device init (#104189 ) This addresses https://github.com/pytorch/pytorch/issues/104187. After this PR, the contract with the user is that: - If passing `param_init_fn=None`, each `nn.Module.reset_parameters()` should only initialize its own parameters/buffers (like `parameters(recurse=False)`/`buffers(recurse=False)`). - If passing `param_init_fn` not equal to `None`, then similarly, one call to `param_init_fn(module)` should only initialize `module`'s own parameters/buffers. With this contract and this PR's changes, meta device initialization through either `reset_parameters()` or `param_init_fn` should be correct. Those functions will run on the original parameter/buffer shapes allowing for correct shape-dependent computations like for fan-in/fan-out, and there will not be any re-initialization of any module. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104189 Approved by: https://github.com/rohan-varma	2023-07-01 00:25:12 +00:00
Xilun Wu	e799f565eb	[DTensor][TP][Random] Introduce TensorParallelRNGTracker to integrate parallel RNG state with Tensor Parallel (#103910 ) This PR enables the automatic use of `TensorParallelRNGTracker` in Tensor Parallel api. Some unit tests are going to be added to cover. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103910 Approved by: https://github.com/wanchaol, https://github.com/fduwjj	2023-06-30 08:06:41 +00:00
Wanchao Liang	da06920f47	Replace all_gather in device mesh with functional collective equivalent (#104056 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104056 Approved by: https://github.com/kumpera, https://github.com/wanchaol	2023-06-30 05:30:02 +00:00
Wanchao Liang	8457703e8d	lazy init device mesh in fsdp (#104447 ) since fsdp state is lazy init, we also need to lazy init device mesh otherwise devicemesh allgather check would trigger some mismatch in allgather counts in fsdp tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/104447 Approved by: https://github.com/wconstab	2023-06-30 04:40:16 +00:00
Will Constable	d0509fe32d	Document how functional collectives work under eager/dynamo (#104386 ) Move user facing apis to the top for best visibility (strictly code-motion in this PR, besides adding comments) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104386 Approved by: https://github.com/voznesenskym, https://github.com/wanchaol	2023-06-30 01:12:55 +00:00
Rohan Varma	60e2a4a4a0	[2D parallel] workaround for FSDP init issue (#104398 ) Closes https://github.com/pytorch/pytorch/issues/96491 and does so by relaxing FSDP's assumption that the entire input module must be on the same device. Now, FSDP can accept a module partially on CPU and GPU and just emits a warning. Differential Revision: [D47117256](https://our.internmc.facebook.com/intern/diff/D47117256/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104398 Approved by: https://github.com/fegin	2023-06-29 16:07:07 +00:00
Rohan Varma	c866446d6c	[FSDP] Check module.training for _root_cast_forward_inputs (#104223 ) We might erroneously cast forward inputs for the root if it doesn't manage any handles (FSDP parameters). As a fix, pass in the module and check its training attribute to ensure we don't cast inputs in eval mode. Differential Revision: [D47041673](https://our.internmc.facebook.com/intern/diff/D47041673/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104223 Approved by: https://github.com/fegin	2023-06-28 16:38:01 +00:00
Andrew Gu	6493519fff	[Easy][FSDP] Remove misleading asserts (#104274 ) Since we do not call `_FSDPState.__init__()` and only use it for typing, it is not possible for these attributes to be `None`. The purpose of these `assert`s is to make sure that these attributes are set by `_init_process_group_state_for_hybrid_shard()`. If we care to make that explicit, I would posit that we should be using `hasattr` checks, not `is not None` checks, because if indeed `_init_process_group_state_for_hybrid_shard()` did not set these attributes, then even checking that it is not `None` would lead to an `AttributeError`. I do not include these `hasattr` checks for now since `_init_process_group_state_for_hybrid_shard()` is short enough that we can quickly tell by inspection that it sets the desired attributes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104274 Approved by: https://github.com/rohan-varma	2023-06-28 11:08:47 +00:00
Andrew Gu	ba9f6e6e92	[FSDP] Validate `ignored_modules`, `ignored_states` (#104273 ) This checks that `ignored_modules` and `ignored_states` have the expected type and provides a reasonable error message if not. Otherwise, if someone passes a mix of modules and parameters to `ignored_states` for example, then our code may be silently incorrect. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104273 Approved by: https://github.com/rohan-varma	2023-06-28 11:08:47 +00:00
Andrew Gu	cc27e6c0f9	[FSDP] Fix `ignored_states` doc (#104253 ) This fixes https://github.com/pytorch/pytorch/issues/104246. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104253 Approved by: https://github.com/rohan-varma	2023-06-28 11:08:45 +00:00
Andrew Gu	9db8ad7f1d	[FSDP] Support unfreezing params for reshard-only hook (#104186 ) This fixes https://github.com/pytorch/pytorch/issues/104148 (unfreezing parameters after `n` steps). - This fixes a bug where we did not delete the post-backward hook state properly for the `requires_grad=False` case. - This makes the `already_resharded` correct for `SHARD_GRAD_OP`. - This generalizes `_clear_grads_if_needed()` to `_reset_flat_param_grad_info_if_needed()` to additionally include propagating the original parameters' `requires_grad` to the flat parameter. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104186 Approved by: https://github.com/rohan-varma, https://github.com/fegin	2023-06-28 11:04:57 +00:00
shibo19	c2095af3f8	make funcs argument type from torch.cuda.stream as torch.Stream (#104156 ) Fixes #ISSUE_NUMBER 1. we want to support fsdp for custom device, so we make funcs argument type from torch.cuda.stream as torch.Stream Pull Request resolved: https://github.com/pytorch/pytorch/pull/104156 Approved by: https://github.com/awgu	2023-06-28 06:02:56 +00:00
Xilun Wu	a66107a30c	[DTensor][Random] Introduce CudaRNGStateTracker to maintain parallel RNG state for DTensor (#103235 ) # Change This PR adds two classes to DTensor: 1. `CudaRNGStateTracker`: `CudaRNGStateTracker` stores Random Number Generator (RNG) state (a `ByteTensor` object) in a `dict`, mapping from a corresponding tag to each state tensor. It also provides a set of convenient utility methods to help access/modify the state tensors. The most important interface is `_distribute_region` which will be used when DTensor executes a random op (an operator that calls RNG). 2. `OffsetBasedRNGTracker`: This subclass of `CudaRNGStateTracker` defines the default policy of how RNG states should be shared and synchronized among all ranks to respect the semantics of DTensor random operators. # Warning - With `Multi-threaded ProcessGroup`, the global variable `_rng_tracker` will be shared among threads(ranks) and cause issue. We need to figure out a compatible solution for that. - The RNG state may be asynchronous outside of participating ranks. It is harmless in our current use case of submesh though. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103235 Approved by: https://github.com/wanchaol	2023-06-27 19:00:25 +00:00
Amr Elshennawy	968b7b5e0f	Initial commit of collective_utils (#101037 ) Summary: Details in T133020932 First commit of collective utils library. Ported over from model store, removed scuba logging, error_trait and all dependencies on modelstore. Test Plan: In the following diffs. Differential Revision: D45545970 Pull Request resolved: https://github.com/pytorch/pytorch/pull/101037 Approved by: https://github.com/H-Huang	2023-06-27 02:15:16 +00:00
Rodrigo Kumpera	c17bdb3247	[C10D] Add functional collective reduce_scatter_into_tensor_coalesced. (#101023 ) Implementation uses a fallback that does no coalescing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101023 Approved by: https://github.com/wanchaol	2023-06-23 19:24:11 +00:00
fduwjj	23b7035b3c	[TP] Add an input resharding wrapper for TP and unit test for 2D + AC (#103334 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/103334 Approved by: https://github.com/kumpera	2023-06-23 04:05:01 +00:00
Chien-Chin Huang	1c33c398c7	[FSDP][state_dict] Add a summary log when finishing state_dict (#103784 ) Add a summary log when finishing state_dict Differential Revision: [D46807103](https://our.internmc.facebook.com/intern/diff/D46807103/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/103784 Approved by: https://github.com/fduwjj	2023-06-22 16:29:24 +00:00
Iris	613970eb05	[5/n][FSDP] Update _sharded_post_state_dict_hook to use DTensor when use_dtensor=True in state_dict_config (#103921 ) This allows us use use_dtensor=True for ShardedStateDictConfig() before calling model.state_dict(). load_state_dict hooks updates will be in next PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103921 Approved by: https://github.com/fduwjj, https://github.com/fegin	2023-06-22 08:32:19 +00:00

1 2 3 4 5 ...

2331 Commits