Commit Graph

31 Commits

Author SHA1 Message Date
Wanchao Liang
2ee6b97464 [dtensor] move DTensor to public namespace (#133113)
Moving DTensor to be in the public namespace, to formally add the
documentation page that includes all the public APIs. This includes:

* many path renames and path import fixes
* a dedicated doc page without too much content yet (adding in the next
  PRs)
* To preserve the BC for users still using the `torch.distributed._tensor`,
  I added a shim script to redirect old path calls to the new module

The BC preserving is evidented by the fact that all DTensor tests are still
working without changing the public imports. So it's safe to land the
changes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133113
Approved by: https://github.com/XilunWu
ghstack dependencies: #133305, #133306
2024-08-17 05:09:52 +00:00
wz337
87053132ea [DeviceMesh] Remove parent mesh concept from _MeshEnv and replace by root mesh (#132339)
Previously, when we slice out a submesh from a mesh, we assign the mesh as the parent mesh of the submesh. In this case, when we have a 3D mesh topology, the parent mesh of a 1D mesh sliced out from the 3D mesh is different from the parent mesh of the same 1D mesh sliced out from the 2D submesh of the 3D mesh. For example:
```
mesh_3d = init_device_mesh("cuda", (2,2,2), ("dim0", "dim1", "dim2"))
mesh_dim0 = mesh_3d["dim0"]

mesh_2d = mesh_2d["dim0", "dim1"]
mesh_dim0_2 =  mesh_2d["dim0_2"]

# This would evaluate to be True
print(_mesh_resources.get_parent_mesh(mesh_dim0) != _mesh_resources.get_parent_mesh(mesh_dim0))
```

We can always reconstruct the mesh needed from the mesh dim names, as long as two dims come from the same root. For simplicity, we do not see the necessity of building a tree structure to represent child-parent relationship. Therefore, we are replacing the parent mesh concept with a root mesh concept in `_MeshEnv` so we would have:

```
mesh_3d = init_device_mesh("cuda", (2,2,2), ("dim0", "dim1", "dim2"))
mesh_dim0 = mesh_3d["dim0"]

mesh_2d = mesh_2d["dim0", "dim1"]
mesh_dim0_2 =  mesh_2d["dim0_2"]

# This would evaluate to be True
print(_mesh_resources.get_root_mesh(mesh_dim0) == _mesh_resources.get_root_mesh(mesh_dim0))
```
With this change, we will have two types of meshes in an environment.
1. `device_mesh != _mesh_resources.get_root_mesh(device_mesh)` means that the device_mesh is created by slicing.
2. `device_mesh == _mesh_resources.get_root_mesh(device_mesh)` means that the device_mesh is a root mesh not created through slicing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132339
Approved by: https://github.com/wanchaol
ghstack dependencies: #132310, #132311
2024-08-07 07:01:12 +00:00
Xuehai Pan
b25ef91bf1 [BE][Easy][18/19] enforce style for empty lines in import segments in torch/d*/ (#129770)
See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter.

You can review these PRs via:

```bash
git diff --ignore-all-space --ignore-blank-lines HEAD~1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129770
Approved by: https://github.com/wconstab
2024-08-01 04:22:50 +00:00
Jeeja
16e0868a3d [FSDP] Add hpu device to _get_remote_device_str (#132120)
In _creating chunk_sharded_tensor, _get_remote_device_str is used. by default it uses the node cound to determine the device:instance. for hpu, need to use current device to get the deivce_instance.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132120
Approved by: https://github.com/awgu
2024-07-30 14:24:24 +00:00
Aaron Orenstein
7c12cc7ce4 Flip default value for mypy disallow_untyped_defs [6/11] (#127843)
See #127836 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127843
Approved by: https://github.com/oulgen
ghstack dependencies: #127842
2024-06-08 18:49:29 +00:00
Shawn Xu
5b8c81eb82 [PT] [FSDP] fix HSDP sharding placement (#123778)
Summary:
https://github.com/pytorch/pytorch/pull/123230 formalized the contract for `ShardedTensor` sub group rank placement validation by making sure the placement rank is global rank, to align with general `torch.distributed` convention.

The current HSDP allows for both `ShardedTensor` and `DTensor`. While `DTensor` will eventually will replace `ShardedTensor`, its usage still exists and there's at least one test verifying the state dict with ST output.

This got broken as the test is run periodically only so it didn't block the other PR.
Fixes [#123749](https://github.com/pytorch/pytorch/issues/123749)

Test Plan: CI

Differential Revision: D55991256

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123778
Approved by: https://github.com/Skylion007, https://github.com/wz337
2024-04-12 00:05:49 +00:00
Wanchao Liang
0fef82b3df [dcp] fix fsdp state_dict to use run_check=False (#114995)
from_local with replicate placement would run mesh_broadcast if
run_check=True, by default from_local have run_check=True, but for FSDP
state_dict case we are for sure that these are replica already, so we
don't need to check/force check it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114995
Approved by: https://github.com/fegin, https://github.com/XilunWu, https://github.com/wz337
2023-12-02 04:16:37 +00:00
Chien-Chin Huang
a66f2a1b99 [state_dict] Move _gather_state_dict to dcp module (#112835)
This api is getting used by more than just FSDP. This PR moves it to DCP module.

Differential Revision: [D50962966](https://our.internmc.facebook.com/intern/diff/D50962966/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112835
Approved by: https://github.com/wz337
2023-11-08 19:42:56 +00:00
Iris Zhang
c84dbd2c03 [2D] Enable 2D optimizer set_state_dict() (#111778)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111778
Approved by: https://github.com/fegin, https://github.com/fduwjj
ghstack dependencies: #111774
2023-10-27 04:33:00 +00:00
PyTorch MergeBot
d8e19bb03a Revert "[2D] Enable 2D optimizer set_state_dict() (#111778)"
This reverts commit 52eec50d31.

Reverted https://github.com/pytorch/pytorch/pull/111778 on behalf of https://github.com/huydhn due to Sorry for reverting you change, but it is failing multigpu test in trunk 52eec50d31 ([comment](https://github.com/pytorch/pytorch/pull/111778#issuecomment-1780227820))
2023-10-26 00:18:30 +00:00
wz337
52eec50d31 [2D] Enable 2D optimizer set_state_dict() (#111778)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111778
Approved by: https://github.com/fegin
ghstack dependencies: #111774
2023-10-25 04:27:13 +00:00
wz337
80dfc974dd [2D] Enable 2D FSDP+TP model.load_state_dict() (#110925)
This PR adds a all_gather_dtensor() method to fsdp/_fsdp_extensions.py and the actual implementation in tensor/parallel/fsdp.py. This enables FSDP to load 2D DTensor state_dict into model when calling `model.load_state_dict()`.

cc. @fegin

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110925
Approved by: https://github.com/fegin
ghstack dependencies: #110831, #110846
2023-10-11 18:22:20 +00:00
wz337
6c136c3302 [2D] Enable 2D DTensor state_dict for FSDP + TP (#110846)
This PR adds a `chunk_dtensor()` method to fsdp/_fsdp_extensions.py and the actual implementation of `chunk_dtensor()` in tensor/parallel/fsdp.py. This enables FSDP to return 2D DTensor state_dict when composing FSDP with TP.

cc. @fegin
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110846
Approved by: https://github.com/fegin, https://github.com/wanchaol
ghstack dependencies: #110831
2023-10-11 17:40:39 +00:00
wz337
d9eb5a57aa [FSDP] Change _create_chunk_dtensor in fsdp/_shard_utils.py to use public API from DTensor (#110831)
This PR:
1) updates _create_chunk_dtensor() in _shard_utils.py to use public APIs from DTensor. This will avoid the global_size calculation error from using DTensor.from_local() for uneven-sharded parameters, as described in https://github.com/pytorch/pytorch/issues/110762
2) updates test/distributed/fsdp/test_fsdp_dtensor_state_dict.py to include unit test for a model with uneven sharding.

cc. @wanchaol, @fegin

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110831
Approved by: https://github.com/wanchaol, https://github.com/fegin
2023-10-10 21:04:27 +00:00
Chien-Chin Huang
88616349d7 [state_dict][1/N] Implement the basic functions of distributed.checkpoint._state_dict (#105902)
This PR implements the basic functions of distributed.checkpoint._state_dict. This PR currently contains the flattening of optimizer state_dict which makes the PR too large. A later version may split it into 2 for a better code review.

Differential Revision: [D47647719](https://our.internmc.facebook.com/intern/diff/D47647719/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D47647719/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105902
Approved by: https://github.com/wz337
2023-10-05 20:04:15 +00:00
Chien-Chin Huang
1b3e5b53f3 [FSDP][optim_state_dict] Add device to _shard_utils.py to explicitly use the device from fsdp_state (#109631)
_get_pg_default_device does not always get the device we want. This PR let the user explicitly tell use the correct device.

Differential Revision: [D49425743](https://our.internmc.facebook.com/intern/diff/D49425743/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109631
Approved by: https://github.com/awgu, https://github.com/fduwjj, https://github.com/wz337
2023-09-20 01:59:38 +00:00
wz337
66af4f6ec7 [HSDP] Add device_mesh to FSDP kwarg and add dtensor state_dict support for HSDP (#107533)
This PR:
1) Add device_mesh kwarg to FSDP. Remove init_device_mesh() from _runtime_utils.py, as device_mesh would be passed in by user as an kwarg.
2) change use_dtensor flag for state_dict_config and optim_state_dict_config to be private. If device_mesh is used with sharded model/optim state dict, _use_dtensor flag would be set to True and model/optim state dict would return dtensor state_dict. Otherwise, _use_dtensor flag would be set to False and model/optim state dict would return sharded_tensor state_dict.
3) Update _optim_utils.py, _shard_utils.py, and _state_dict_utils.py to add support for HSDP to return 2D DTensor state_dict.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107533
Approved by: https://github.com/fegin, https://github.com/awgu, https://github.com/wanchaol
2023-09-05 21:21:21 +00:00
Chien-Chin Huang
591cb776af [FSDP][state_dict][optim_state_dict] Log slow optim and model state_dict paths (#108290)
This PR adds SimpleProfiler for FSDP state_dict/load_state_dict logging purpose. SimpleProfiler use class variables to record profiling results and it does everything in the Python which can be slow. So it is only suitable for logging slow actions such as initialization and state_dict/load_state_dict.

This PR uses SimpleProfiler to log some critical/slow paths of the model and optimizer state_dict/load_state_dict.

Differential Revision: [D48774406](https://our.internmc.facebook.com/intern/diff/D48774406/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108290
Approved by: https://github.com/wz337
2023-09-01 06:57:59 +00:00
PyTorch MergeBot
ab5b4c4419 Revert "[HSDP] Add device_mesh to FSDP and add dtensor state_dict support for HSDP (#107533)"
This reverts commit cc220e45a8.

Reverted https://github.com/pytorch/pytorch/pull/107533 on behalf of https://github.com/huydhn due to Sorry for reverting this, but it is failing in trunk with the same failure on test_dynamo_distributed cc220e45a8 ([comment](https://github.com/pytorch/pytorch/pull/107533#issuecomment-1701983247))
2023-09-01 01:26:30 +00:00
wz337
cc220e45a8 [HSDP] Add device_mesh to FSDP and add dtensor state_dict support for HSDP (#107533)
This PR:
1) Add device_mesh kwarg to FSDP. Remove init_device_mesh() from _runtime_utils.py, as device_mesh would be passed in by user as an kwarg.
2) change use_dtensor flag for state_dict_config and optim_state_dict_config to be private. If device_mesh is used with sharded model/optim state dict, _use_dtensor flag would be set to True and model/optim state dict would return dtensor state_dict. Otherwise, _use_dtensor flag would be set to False and model/optim state dict would return sharded_tensor state_dict.
3) Update _optim_utils.py, _shard_utils.py, and _state_dict_utils.py to add support for HSDP to return 2D DTensor state_dict.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107533
Approved by: https://github.com/fegin, https://github.com/awgu, https://github.com/wanchaol
2023-09-01 00:15:00 +00:00
Iris
6b2d48e78c [8/n][FSDP] make use_dtensor=True work with offload_to_cpu=True for optim.load_state_dict() (#105690)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105690
Approved by: https://github.com/fegin
2023-07-21 18:55:01 +00:00
Iris
613970eb05 [5/n][FSDP] Update _sharded_post_state_dict_hook to use DTensor when use_dtensor=True in state_dict_config (#103921)
This allows us use use_dtensor=True for ShardedStateDictConfig() before calling model.state_dict().

load_state_dict hooks updates will be in next PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103921
Approved by: https://github.com/fduwjj, https://github.com/fegin
2023-06-22 08:32:19 +00:00
Rodrigo Kumpera
f83ebfe1bb [FSDP] Improve support for CPU tensors. (#103171)
Don't emit device index when using CPU devices.
Don't call Tensor::record_stream as it's CUDA only op.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103171
Approved by: https://github.com/rohan-varma, https://github.com/wz337
2023-06-20 21:08:19 +00:00
Iris
d991ce6da3 [FSDP][3/N]_shard_utils update for dtensor state_dict support (#103479)
Same as https://github.com/pytorch/pytorch/pull/102545 (this branch is corrupted so have to re-submit).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103479
Approved by: https://github.com/fegin
2023-06-14 06:45:28 +00:00
medivh-xp
8b7bd81902 determined collective device by _get_pg_default_device rather than explicit cuda (#101533)
There are many communication operations for shardedTensor in the state dict of fsdp. They use the external passed-in pg (or the default pg), which currently supports cuda devices. Before communication, the memory will be moved to cuda, which is implicit (because it is essentially moving data to the memory type required by pg, not the computing device type). Similarly, when users use fsdp on a custom backend, they will pass in a custom pg (which does not support cuda devices), which may cause fsdp to not work properly in some cases. This PR obtains the memory type supported by the pg through _get_pg_default_device during communication, and moves the data to it when needed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101533
Approved by: https://github.com/awgu
2023-05-24 13:48:43 +00:00
Chien-Chin Huang
4f62e7cb10 [FSDP][BE] Remove unused code (#99731)
Remove the unused code. https://github.com/pytorch/pytorch/pull/99675 is duplicated and we should land this PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99731
Approved by: https://github.com/wz337
2023-04-21 23:11:37 +00:00
Kazuaki Ishizaki
35fd5c548e Fix typos under torch/distributed directory (#95638)
This PR fixes typos in comments and messages of `.py` files under torch/distributed directory

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95638
Approved by: https://github.com/usamah1, https://github.com/H-Huang, https://github.com/kit1980
2023-03-27 21:13:44 +00:00
Andrew Gu
e3cf81e0a7 [FSDP] ufmt /fsdp (#87811)
This applies `ufmt` to all of the FSDP files in the `torch/distributed/fsdp/` directory.

**Test Plan**
CI

**Notes**
For VSCode users,
- Install `ufmt`: https://pypi.org/project/ufmt/
- Install VSCode `ufmt` extension: https://marketplace.visualstudio.com/items?itemName=omnilib.ufmt
- Include in `settings.json`:
```
{
    "[python]": {
        "editor.defaultFormatter": "omnilib.ufmt",
        "editor.formatOnSave": true,
    },
}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87811
Approved by: https://github.com/rohan-varma, https://github.com/fegin
2022-10-27 04:25:55 +00:00
Andrew Gu
ff71f45788 [FSDP] Add FSDPExtensions for TP support (#85039)
This adds `FSDPExtensions` to enable TP + FSDP composability. To be agnostic to both `ShardedTensor` and `DistributedTensor`, the design relies on customizable hooks.

Some notes:
- I preferred the `_ext` prefix (short for "extension") over `_param_extension` simply because it is shorter. It should not matter much because it is purely internal facing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85039
Approved by: https://github.com/kumpera, https://github.com/fegin
2022-09-28 18:34:17 +00:00
Chien-Chin Huang
3e1fc85b23 [FSDP] Implement sharded_optim_state_dict and flatten_sharded_optim_state_dict. (#77628)
As title

Differential Revision: [D36436496](https://our.internmc.facebook.com/intern/diff/D36436496/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77628
Approved by: https://github.com/awgu
2022-08-18 16:38:58 +00:00
Chien-Chin Huang
244690205f [FSDP] Use _init_from_local_tensor to create ShardedTensor to avoid communication overhead (#82911)
FSDP originally uses `_init_from_local_shards_and_global_metadata()` to create a ShardedTensor for sharded_state_dict(). We have seen some non-trivial overhead if the number of tensors is large. Using `_init_from_local_shards_and_global_metadata ` can significantly reduce the overhead. For a model with ~250 tensors in the state_dict trained with 16 GPUs, the original `sharded_state_dict` takes ~1.7 seconds and this PR reduces the overhead to ~0.6 seconds.

Differential Revision: [D38452170](https://our.internmc.facebook.com/intern/diff/D38452170/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82911
Approved by: https://github.com/awgu
2022-08-17 16:40:20 +00:00