pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-08 07:39:33 +01:00

Author	SHA1	Message	Date
Wanchao Liang	848cfe8d45	[reland] unflatten_tensor on compute stream for DTensorExtension (#117020 ) reland of https://github.com/pytorch/pytorch/pull/116559, which was reverted by internal. The underlying reason for the revert is that the torch.dynamo.disable can't be used by the pytorch codebase, as it's conflicting with some torch.deploy together, although the later one only run some inference, but it somehow take that weird dependency on fsdp.. We have seen this issue with our functional collectives that we can't use any dynamo components otherwise torch.deploy would complain.. verified internally that after removing torch.dynamo.disable the test passed again Pull Request resolved: https://github.com/pytorch/pytorch/pull/117020 Approved by: https://github.com/awgu	2024-01-09 21:25:15 +00:00
Qinfan Wu	b847290ddd	Back out "[2d] unflatten_tensor on compute stream for DTensorExtension (#116559 )" (#116939 ) Summary: Original commit changeset: 65298112f3db Original Phabricator Diff: D52530451 Differential Revision: D52583345 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116939 Approved by: https://github.com/842974287	2024-01-07 03:53:40 +00:00
Wanchao Liang	d9c0e37bab	[2d] unflatten_tensor on compute stream for DTensorExtension (#116559 ) Context: Existing FSDPExtension have some bug in the case when the unflatten tensor involves some compute/communications in cuda stream, the current logic of FSDPExtension unflatten tensor happens in the unshard stream, which makes runtime lost sync with the compute stream, and if there're some dependencies between the compute stream and the unflatten tensor logic, currently it would lose sync point, which could possibly lead to NaN. This PR make the FSDPExtension to record the compute stream and let DTensorExtension to directly use the compute stream for unflatten_tensor. In long term we might want to directly make the FSDP runtime logic to only make the unshard happen in unshard stream, and use unshard views to happen in the compute stream. We currently fix this in the Extension directly as this is the simplest thing to do without affecting FSDP runtime logic Pull Request resolved: https://github.com/pytorch/pytorch/pull/116559 Approved by: https://github.com/awgu, https://github.com/fduwjj, https://github.com/yifuwang ghstack dependencies: #116426	2024-01-03 07:29:08 +00:00
Iris Zhang (PyTorch)	23fa9621e4	[DeviceMesh] Rename _device_mesh.py to device_mesh.py to prepare for beta (#115099 ) (#115193 ) Summary: Rename _device_mesh.py to device_mesh.py, update all callsites, add documentation. We created stubs for public class and methods in torch.distributed.device_mesh so that torch.distributed.device_mesh can be imported with or without distributed is available(). Original diff reverted: D51629761 Original PR reverted: https://github.com/pytorch/pytorch/pull/115099 Prior to landing, CI signals are all passed. Shipit added the "ci/trunk" label to the PR and DID NOT wait for it and went ahead committing. More context can be found in the reverted PR above. Test Plan: CI. Differential Revision: D51861018 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115193 Approved by: https://github.com/fegin	2023-12-08 08:44:32 +00:00
Nikita Shulga	a827ac71f2	Revert "[DeviceMesh] Rename _device_mesh.py to device_mesh.py to prepare for beta (#115099 )" This reverts commit `eaa64339d6`.	2023-12-05 08:59:36 -08:00
Iris Zhang (PyTorch)	eaa64339d6	[DeviceMesh] Rename _device_mesh.py to device_mesh.py to prepare for beta (#115099 ) Summary: Rename _device_mesh.py to device_mesh.py, update all callsites, adds documentation. Original diff reverted: D51629761 Original PR reverted: https://github.com/pytorch/pytorch/pull/114991 It was failing because failing a public module binding tests in MacOS, and this is due to the change in import order for torch/distributed/fsdp/_common_utils.py. Since this original import would still work, we remove the changes in this file. Test Plan: CI. Differential Revision: D51825114 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115099 Approved by: https://github.com/wanchaol, https://github.com/fegin	2023-12-05 05:44:52 +00:00
PyTorch MergeBot	3a2e2044cd	Revert "[DeviceMesh] Rename _device_mesh.py to device_mesh.py to prepare for beta (#114710 ) (#114991 )" This reverts commit `729ac7317a`. Reverted https://github.com/pytorch/pytorch/pull/114991 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/114991#issuecomment-1837214567))	2023-12-02 17:55:51 +00:00
Iris Zhang (PyTorch)	729ac7317a	[DeviceMesh] Rename _device_mesh.py to device_mesh.py to prepare for beta (#114710 ) (#114991 ) Summary: Same content of changes as https://github.com/pytorch/pytorch/pull/114710 Rename _device_mesh.py to device_mesh.py, update all callsites, adds documentation. ghstack-source-id: 208980207 exported-using-ghexport Test Plan: CI. Reviewed By: wanchaol Differential Revision: D51629761 Pull Request resolved: https://github.com/pytorch/pytorch/pull/114991 Approved by: https://github.com/wanchaol, https://github.com/fduwjj, https://github.com/fegin	2023-12-02 04:39:41 +00:00
wz337	7b3e45be59	[DeviceMesh] Rename get_dim_groups to get_group (#114708 ) Rename get_dim_groups to get_group and update all callsites. Differential Revision: [D51629801](https://our.internmc.facebook.com/intern/diff/D51629801/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/114708 Approved by: https://github.com/XilunWu, https://github.com/wanchaol, https://github.com/fegin	2023-11-30 23:40:14 +00:00
wz337	febbc48f43	[DeviceMesh] Make our mesh_dim kwarg naming consistent (#114707 ) Changing size(self, dim: Optional[int] = None) to def size(self, mesh_dim: Optional[int] = None) so it is consistent with the rest of our APIs. We also update this API usage change in both PT and internal (pyper, APS). Differential Revision: [D51602986](https://our.internmc.facebook.com/intern/diff/D51602986/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/114707 Approved by: https://github.com/XilunWu, https://github.com/wanchaol, https://github.com/fegin	2023-11-29 19:43:23 +00:00
Iris Zhang	72ce5dd13e	[2D] Remove enable_2d_with_fsdp() API and make remove_enable_2d_with_fsdp private (#112473 ) As we have our new 2D flow out, we want to remove `enable_2d_with_fsdp()`. In addition, we change pre_dp_module_transform to private, as we may need to change the UX later on. Pull Request resolved: https://github.com/pytorch/pytorch/pull/112473 Approved by: https://github.com/fegin, https://github.com/wanchaol	2023-11-16 01:14:00 +00:00
NVS Abhilash	44c0521e8c	fix: docstring error in torch/distributed module (#113241 ) Fixes: #113193 `pydocstyle <all_files_in_issue> --count` - Before: 345 - After: 130 For deprecated methods, I have added a `noqa` to ignore them. I was not able to find the file `torch/distributed/tensor/parallel/multihead_attention_tp.py`, so I've ignored it for this PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113241 Approved by: https://github.com/kit1980	2023-11-09 19:10:20 +00:00
Iris Zhang	12c1465d76	[DeviceMesh] Make mesh_resources private (#112294 ) This is to prepare moving DeviceMesh as a standalone distributed package. `_mesh_resources` should only be used in torch.distributed package. Pull Request resolved: https://github.com/pytorch/pytorch/pull/112294 Approved by: https://github.com/fegin	2023-10-28 17:28:46 +00:00
Iris Z	185e76238d	[2D][Documentation] Add some comments to _chunk_dtensor (#111775 ) As title. Pull Request resolved: https://github.com/pytorch/pytorch/pull/111775 Approved by: https://github.com/awgu	2023-10-23 20:43:03 +00:00
wz337	80dfc974dd	[2D] Enable 2D FSDP+TP model.load_state_dict() (#110925 ) This PR adds a all_gather_dtensor() method to fsdp/_fsdp_extensions.py and the actual implementation in tensor/parallel/fsdp.py. This enables FSDP to load 2D DTensor state_dict into model when calling `model.load_state_dict()`. cc. @fegin Pull Request resolved: https://github.com/pytorch/pytorch/pull/110925 Approved by: https://github.com/fegin ghstack dependencies: #110831, #110846	2023-10-11 18:22:20 +00:00
wz337	6c136c3302	[2D] Enable 2D DTensor state_dict for FSDP + TP (#110846 ) This PR adds a `chunk_dtensor()` method to fsdp/_fsdp_extensions.py and the actual implementation of `chunk_dtensor()` in tensor/parallel/fsdp.py. This enables FSDP to return 2D DTensor state_dict when composing FSDP with TP. cc. @fegin Pull Request resolved: https://github.com/pytorch/pytorch/pull/110846 Approved by: https://github.com/fegin, https://github.com/wanchaol ghstack dependencies: #110831	2023-10-11 17:40:39 +00:00
wz337	8140494afd	[3/N][2D] Enable training with new 2D flow (#110034 ) Replacing https://github.com/pytorch/pytorch/pull/109553 as it gets reverted. This PR enables training with new 2D flow and adds associated test. In addition, this PR moves the tensor/parallel/_data_parallel_utils.py that are fsdp specific back to tensor/parallel/fsdp.py to avoid circular dependency for ddp.py and test/distributed/tensor/parallel/test_ddp_2d_parallel.py. state_dict related changes would be in later PRs. cc. @fegin, @fduwjj, @wanchaol, @awgu Pull Request resolved: https://github.com/pytorch/pytorch/pull/110034 Approved by: https://github.com/fduwjj	2023-09-26 09:14:15 +00:00
PyTorch MergeBot	f5886bf352	Revert "[3/N][2D] Enable training with new 2D flow (#109553 )" This reverts commit `217b37c023`. Reverted https://github.com/pytorch/pytorch/pull/109553 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but those distributed failures look legit and they are failing in trunk https://hud.pytorch.org/pr/109553 ([comment](https://github.com/pytorch/pytorch/pull/109553#issuecomment-1734100546))	2023-09-25 16:37:19 +00:00
wz337	217b37c023	[3/N][2D] Enable training with new 2D flow (#109553 ) This PR enables training with new 2D flow and adds associated test. state_dict related changes would be in later PRs. cc. @fegin, @fduwjj, @wanchaol, @awgu Pull Request resolved: https://github.com/pytorch/pytorch/pull/109553 Approved by: https://github.com/fegin, https://github.com/awgu	2023-09-25 05:32:07 +00:00
Chien-Chin Huang	1b3e5b53f3	[FSDP][optim_state_dict] Add device to _shard_utils.py to explicitly use the device from fsdp_state (#109631 ) _get_pg_default_device does not always get the device we want. This PR let the user explicitly tell use the correct device. Differential Revision: [D49425743](https://our.internmc.facebook.com/intern/diff/D49425743/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/109631 Approved by: https://github.com/awgu, https://github.com/fduwjj, https://github.com/wz337	2023-09-20 01:59:38 +00:00
Wanchao Liang	da765995fb	[2d] remove ShardedTensor from fsdp extension (#107472 ) 2D Parallel won't use ShardedTensor, and it causes headable for dynamo to recoginize it, removing it from the runtime flatten/unflatten path Pull Request resolved: https://github.com/pytorch/pytorch/pull/107472 Approved by: https://github.com/fduwjj	2023-08-21 17:16:07 +00:00
fduwjj	f3b0d83fe3	[EZ][TP] Refactor FSDP 2D integration extension code so that it can re-used (#107313 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107313 Approved by: https://github.com/wz337	2023-08-16 22:01:17 +00:00
Wanchao Liang	d31707a257	Get rid of dim_groups attribute from DeviceMesh (#103105 ) This PR get rids of the dim_groups attribute from DeviceMesh, the main motivation behind this is that we should let c10d store the process groups during its creation instead of DeviceMesh, DeviceMesh should just handle ranks correctly. This could enable DTensor becomes picklable! (torch.save/load could be possible), which I will give it a try in the next PR Pull Request resolved: https://github.com/pytorch/pytorch/pull/103105 Approved by: https://github.com/XilunWu, https://github.com/fduwjj	2023-06-09 04:11:15 +00:00
fduwjj	d4380edb9b	[TP] Add API logging for TP high level API (#102209 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102209 Approved by: https://github.com/wz337, https://github.com/wanchaol	2023-05-25 03:33:00 +00:00
Wanchao Liang	261eb46ddd	[dtensor] refactor get_coordiniate (#95457 ) This refactor get_coordinate to return a optional[list] instead of directly the coordinate on dim, this is so that we can check if the rank is inside the mesh easily Differential Revision: [D43643579](https://our.internmc.facebook.com/intern/diff/D43643579) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95457 Approved by: https://github.com/XilunWu	2023-02-28 17:54:26 +00:00
fduwjj	39511697d4	[PT-D][BE] Update 2D parallelism API name and docs (#94771 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/94771 Approved by: https://github.com/wanchaol	2023-02-14 08:13:15 +00:00
Wanchao Liang	9b5e6b029f	[tp] umft distributed.tensor.parallel (#89969 ) cmd: `ufmt format torch/distributed/tensor` Pull Request resolved: https://github.com/pytorch/pytorch/pull/89969 Approved by: https://github.com/fduwjj	2022-12-01 20:58:16 +00:00
Wanchao Liang	4451eb24e6	Move tensor_parallel out to distributed.tensor folder (#89878 ) This PR moves tensor parallel from torch.distributed._tensor.parallel to torch.distributed.tensor.parallel, to prepare for beta release Pull Request resolved: https://github.com/pytorch/pytorch/pull/89878 Approved by: https://github.com/fduwjj	2022-11-30 22:13:10 +00:00

28 Commits