Commit Graph

18 Commits

Author SHA1 Message Date
Wanchao Liang
7f71f2a997 [dtensor] improve docs and comments (#132683)
as titled, fixed typos in various comments and improve the
public documentations

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132683
Approved by: https://github.com/XilunWu
ghstack dependencies: #131210, #132682
2024-08-08 09:24:58 +00:00
Xuehai Pan
cec31050b4 [BE][Easy] enable UFMT for torch/distributed/{tensor,_tensor}/ (#128868)
Part of #123062

- #123062

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128868
Approved by: https://github.com/fegin
2024-06-18 21:49:02 +00:00
Aaron Orenstein
3a0d088517 Flip default value for mypy disallow_untyped_defs [5/11] (#127842)
See #127836 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127842
Approved by: https://github.com/oulgen
2024-06-08 18:49:18 +00:00
Wanchao Liang
2c9a420da3 [dtensor] move some modules to private namespace (#127339)
as titled, moving some modules that are mainly for DTensor private usage
to be a private module.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127339
Approved by: https://github.com/awgu
ghstack dependencies: #127338
2024-05-29 05:18:47 +00:00
Mark Saroufim
3407899ba1 DTensor Fused ADAM (#125369)
Fixes https://github.com/pytorch/pytorch/issues/124633 https://github.com/pytorch/ao/issues/205

```
(pt) [marksaroufim@devvm17057.vll0 ~/pytorch (dfusedadam)]$ pytest test/distributed/_tensor/test_optimizers.py -s -k adamw_1d_sharding
===================================================================================== test session starts ======================================================================================
platform linux -- Python 3.9.19, pytest-7.4.0, pluggy-1.5.0
rootdir: /home/marksaroufim/pytorch
configfile: pytest.ini
plugins: hypothesis-6.100.2
collected 10 items / 9 deselected / 1 selected
Running 1 items in this shard

test/distributed/_tensor/test_optimizers.py .

=============================================================================== 1 passed, 9 deselected in 5.95s ================================================================================
(pt) [marksaroufim@devvm17057.vll0 ~/pytorch (dfusedadam)]$ pytest test/distributed/_tensor/test_optimizers.py -s -k adam_1d_sharding
===================================================================================== test session starts ======================================================================================
platform linux -- Python 3.9.19, pytest-7.4.0, pluggy-1.5.0
rootdir: /home/marksaroufim/pytorch
configfile: pytest.ini
plugins: hypothesis-6.100.2
collected 10 items / 7 deselected / 3 selected
Running 3 items in this shard

test/distributed/_tensor/test_optimizers.py ...

=============================================================================== 3 passed, 7 deselected in 10.79s ===============================================================================
(pt) [marksaroufim@devvm17057.vll0 ~/pytorch (dfusedadam)]$
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125369
Approved by: https://github.com/wanchaol
2024-05-07 00:08:09 +00:00
Wanchao Liang
08460f4bae [tp] remove deprecated tp_mesh_dim arg (#121432)
This PR removes the deprecated tp_mesh_dim arg to prepare for release.
As we deprecated this arg for a while (by throwing deprecating
messages), we should remove it before the release

#suppress-api-compatibility-check

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121432
Approved by: https://github.com/wz337
ghstack dependencies: #121431
2024-03-08 17:46:44 +00:00
Iris Zhang (PyTorch)
23fa9621e4 [DeviceMesh] Rename _device_mesh.py to device_mesh.py to prepare for beta (#115099) (#115193)
Summary:

Rename _device_mesh.py to device_mesh.py, update all callsites, add documentation.
We created stubs for public class and methods in torch.distributed.device_mesh so that torch.distributed.device_mesh can be imported with or without distributed is available().

Original diff reverted: D51629761
Original PR reverted: https://github.com/pytorch/pytorch/pull/115099
Prior to landing, CI signals are all passed. Shipit added the "ci/trunk" label to the PR and DID NOT wait for it and went ahead committing. More context can be found in the reverted PR above.

Test Plan: CI.

Differential Revision: D51861018

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115193
Approved by: https://github.com/fegin
2023-12-08 08:44:32 +00:00
Nikita Shulga
a827ac71f2 Revert "[DeviceMesh] Rename _device_mesh.py to device_mesh.py to prepare for beta (#115099)"
This reverts commit eaa64339d6.
2023-12-05 08:59:36 -08:00
Iris Zhang (PyTorch)
eaa64339d6 [DeviceMesh] Rename _device_mesh.py to device_mesh.py to prepare for beta (#115099)
Summary:
Rename _device_mesh.py to device_mesh.py, update all callsites, adds documentation.

Original diff reverted: D51629761
Original PR reverted: https://github.com/pytorch/pytorch/pull/114991
It was failing because failing a public module binding tests in MacOS, and this is due to the change in import order for torch/distributed/fsdp/_common_utils.py. Since this original import would still work, we remove the changes in this file.

Test Plan: CI.

Differential Revision: D51825114

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115099
Approved by: https://github.com/wanchaol, https://github.com/fegin
2023-12-05 05:44:52 +00:00
PyTorch MergeBot
3a2e2044cd Revert "[DeviceMesh] Rename _device_mesh.py to device_mesh.py to prepare for beta (#114710) (#114991)"
This reverts commit 729ac7317a.

Reverted https://github.com/pytorch/pytorch/pull/114991 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/114991#issuecomment-1837214567))
2023-12-02 17:55:51 +00:00
Iris Zhang (PyTorch)
729ac7317a [DeviceMesh] Rename _device_mesh.py to device_mesh.py to prepare for beta (#114710) (#114991)
Summary:

Same content of changes as https://github.com/pytorch/pytorch/pull/114710

Rename _device_mesh.py to device_mesh.py, update all callsites, adds documentation.
ghstack-source-id: 208980207
exported-using-ghexport

Test Plan: CI.

Reviewed By: wanchaol

Differential Revision: D51629761

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114991
Approved by: https://github.com/wanchaol, https://github.com/fduwjj, https://github.com/fegin
2023-12-02 04:39:41 +00:00
alanhe151220037
1afbc985fe Make RNGStateTracker support cuda-like device (#106771)
replace  `CudaRNGStateTracker` with `RNGStateTracker` by rewriting some Cuda-binding code with `device_handle`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106771
Approved by: https://github.com/wanchaol
2023-08-10 19:14:33 +00:00
Xilun Wu
e799f565eb [DTensor][TP][Random] Introduce TensorParallelRNGTracker to integrate parallel RNG state with Tensor Parallel (#103910)
This PR enables the automatic use of `TensorParallelRNGTracker` in Tensor Parallel api. Some unit tests are going to be added to cover.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103910
Approved by: https://github.com/wanchaol, https://github.com/fduwjj
2023-06-30 08:06:41 +00:00
Xilun Wu
a66107a30c [DTensor][Random] Introduce CudaRNGStateTracker to maintain parallel RNG state for DTensor (#103235)
# Change
This PR adds two classes to DTensor:

1. `CudaRNGStateTracker`:  `CudaRNGStateTracker` stores Random Number Generator (RNG) state (a `ByteTensor` object) in a `dict`, mapping from a corresponding tag to each state tensor. It also provides a set of convenient utility methods to help access/modify the state tensors. The most important interface is `_distribute_region` which will be used when DTensor executes a random op (an operator that calls RNG).

2. `OffsetBasedRNGTracker`: This subclass of `CudaRNGStateTracker` defines the default policy of how RNG states should be shared and synchronized among all ranks to respect the semantics of DTensor random operators.

# Warning

- With `Multi-threaded ProcessGroup`, the global variable `_rng_tracker` will be shared among threads(ranks) and cause issue. We need to figure out a compatible solution for that.

- The RNG state may be asynchronous outside of participating ranks. It is harmless in our current use case of submesh though.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103235
Approved by: https://github.com/wanchaol
2023-06-27 19:00:25 +00:00
shaoyf42
17737f9d0e [DTensor] Allow DTensor support cuda-like device (#102468)
Allow DTensor support cuda-like device, fix https://github.com/pytorch/pytorch/issues/102442

Currently, DTensor supports cuda and cpu. There are other efforts to make DTensor support third-party devices, for example https://github.com/pytorch/pytorch/pull/101914 and https://github.com/pytorch/pytorch/issues/101911. However, this support only extends a portion of third-party devices and is no good support for third-party cuda-like devices. Therefore, we would like to extend DTensor to support cuda-like devices, after all, cuda is so popular!

1. Similar to what is done here, we need to initialize the communication backend for the device set by DeviceMesh. So `_default_backend_for_device` is added to `Backend`. It is worth noting that when we register a new backend for a device other than cpu and cuda, we also need to add a new default backend for this device.
2. Adding `_device_handle` to `DeviceMesh` for cuda-like devices, similar to what is set in FSDP. When `_device_handle` is not None, the device has similar behavior to `cuda`. In this way, functions like `torch.cuda.device_count()` need to be modified to `device_mesh._device_handle.device_count()`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102468
Approved by: https://github.com/wanchaol
2023-06-07 23:13:53 +00:00
Wanchao Liang
ff58d19c89 DeviceMesh use dispatchable PG to support custom backend (#102336)
This PR switches DeviceMesh to use dispatchable process group instead,
this could enable easier backend integration as user only need to
integrate with c10d process group custom backend, without needing to
change DeviceMesh to plug in the backend
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102336
Approved by: https://github.com/fduwjj
2023-05-30 19:22:37 +00:00
Xilun Wu
e686a1e1b3 [DTensor][2/N] add Philox offset adjustment logic in operator_dispatch (#98199)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98199
Approved by: https://github.com/wanchaol
2023-04-10 23:57:04 +00:00
Xilun Wu
67963c32bd [DTensor][1/N] add DTensor RNG state APIs (#98198)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98198
Approved by: https://github.com/wanchaol
2023-04-10 23:57:00 +00:00