Commit Graph

2331 Commits

Author SHA1 Message Date
Rohan Varma
de370eb313 [Distributed] Small nits to apply_optimizer_in_backward (#110903)
Clarify a few things around the documentation

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110903
Approved by: https://github.com/janeyx99
2023-10-11 07:45:45 +00:00
wz337
a614281ea9 Add current_device() to torch.cpu (#110987)
Better support device agnostic, add a "cpu" return for `current_device()` in torch.cpu so that we won't run into `AttributeError: module 'torch.cpu' has no attribute 'current_device'`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110987
Approved by: https://github.com/wanchaol
2023-10-11 05:13:10 +00:00
PyTorch MergeBot
314a502eb0 Revert "Reland "[C10] PG observability hooks. (#108815)" (#110907)"
This reverts commit 7678cd22af.

Reverted https://github.com/pytorch/pytorch/pull/110907 on behalf of https://github.com/huydhn due to Sorry for reverting this, but macos job in trunk starts failing after this 7678cd22af ([comment](https://github.com/pytorch/pytorch/pull/110907#issuecomment-1756497387))
2023-10-11 00:23:42 +00:00
wz337
d9eb5a57aa [FSDP] Change _create_chunk_dtensor in fsdp/_shard_utils.py to use public API from DTensor (#110831)
This PR:
1) updates _create_chunk_dtensor() in _shard_utils.py to use public APIs from DTensor. This will avoid the global_size calculation error from using DTensor.from_local() for uneven-sharded parameters, as described in https://github.com/pytorch/pytorch/issues/110762
2) updates test/distributed/fsdp/test_fsdp_dtensor_state_dict.py to include unit test for a model with uneven sharding.

cc. @wanchaol, @fegin

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110831
Approved by: https://github.com/wanchaol, https://github.com/fegin
2023-10-10 21:04:27 +00:00
Will Constable
7678cd22af Reland "[C10] PG observability hooks. (#108815)" (#110907)
This reverts commit ff0358b038.

(original PR https://github.com/pytorch/pytorch/pull/108815 desc copied below)

Expose a set of observability hooks into C10D such that our users can
detect collectives failure both faster and more easily.

The design is similar to NCCL desync debug that it minimized the
overhead by doing most of the work out of the main thread.

This PR introduces a new module torch.distributed.hooks that exposes the following set of methods:

    register_collective_start_hook
    register_collective_end_hook
    register_process_group_hook

The process group hook exposes PG creation on the member ranks and call them inline from the
the PG creation code. This is fine since this happens during initialization and a limited number of times.

The collective start/end hooks are fired from a single background thread. It reads
events from a C++ queue and dispatches over.

Queue notification is oddly done using a pipe, this is needed so python can abort the thread on shutdown
and have it as background thread. This is not possible with more reasonable choices like a condvar.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110907
Approved by: https://github.com/fduwjj
2023-10-10 20:09:40 +00:00
Chien-Chin Huang
7b25c2b90e [FSDP][optim_state_dict] Move local optimizer state to FSDP compute_device (#110929)
This will ensure all the tensors are on FSDP compute_device.

Differential Revision: [D50059492](https://our.internmc.facebook.com/intern/diff/D50059492/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110929
Approved by: https://github.com/wz337
2023-10-10 10:34:31 +00:00
Michael Voznesensky
fb68aa0a92 [Easy] Remove unused return type from utils (#110887)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110887
Approved by: https://github.com/ezyang
2023-10-10 09:02:11 +00:00
Edward Z. Yang
de3ae93e9b Include rank of default PG in C++ log messages (#110623)
I tested by adding some warning logs in C++, run a distributed program and show that they now had `[rank0]:` in the messages. There is no existing test infra for C++ logging so I couldn't easily add a unit test.

The implementation strategy is to setup a global variable in C++, and then poke it when we initialize a process group. This was the simplest thing I could think of that would work.

This PR only works for non-glog logging. Probably need to come up with some other strategy for glog, e.g., a custom prefix, but need to make sure this doesn't conflict with fbcode. I can't easily test this from OSS, will leave as follow up work.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110623
Approved by: https://github.com/voznesenskym, https://github.com/wanchaol, https://github.com/fduwjj
2023-10-10 00:26:52 +00:00
Wanchao Liang
28d7d7fc42 device agnostic: torch.cpu.set_device (#110716)
to support device agnostic, add a dummpy placeholder in torch.cpu

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110716
Approved by: https://github.com/albanD
2023-10-09 23:00:15 +00:00
Wanchao Liang
2a76c7f018 [dtensor] skip move to device when device_type match (#110774)
skip tensor.to in from_local and distribute_tensor when device_type of
device mesh matches tensor.device type, since from_local on the critial
path of TP, this might also reduce some overhead
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110774
Approved by: https://github.com/fduwjj
2023-10-09 19:39:11 +00:00
Kazuaki Ishizaki
b5f9696d81 Fix typo under torch directory (#110824)
This PR fixes typo `the the` of comments and exception messages in files under `torch` directory.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110824
Approved by: https://github.com/H-Huang
2023-10-09 19:16:43 +00:00
Wanchao Liang
459cef8649 switch dtensor and functional collective to use optree (#110670)
optree recently landed and provide quite good perf, conditionally import
new optree if optree is installed

Some numbers testing mlp layer with TP + func collective:
before this PR: 10.390ms
after this PR: 9.189ms

so around e2e 10% CPU overhead reduction

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110670
Approved by: https://github.com/fegin
2023-10-08 03:05:39 +00:00
fduwjj
2dc5e166a5 [TP][Inference] Enable DTensor TP inference (#110751)
In https://github.com/pytorch/pytorch/pull/109977, we observed that during inference mode, aten.Linear does not get decomposed. So instead of enabling sharding propagation for linear op, we use func.decompose so that it gets decomposed to matmul and mm.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110751
Approved by: https://github.com/bdhirsh, https://github.com/wanchaol
2023-10-07 18:57:27 +00:00
Chien-Chin Huang
90bf6e3938 [FSDP][optim_state_dict] Enable cpu_offload config for optimzer state_dict (#108434)
We had the option but never used cpu_offload as optimizer state_dict offloads the tensors to CPU by default. And this is usually most users want as the tensors are required to be moved to CPU eventually. However, we may want to disable offloading to CPU in some cases, epsecially for the debugging purpose. This PR lets optimizer state_dict read the flag.

Differential Revision: [D48913340](https://our.internmc.facebook.com/intern/diff/D48913340/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108434
Approved by: https://github.com/wz337
2023-10-07 01:14:49 +00:00
Wanchao Liang
1c97808f81 [dtensor] support lt/gt op (#110585)
This PR enables lt/gt aten op
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110585
Approved by: https://github.com/fduwjj
ghstack dependencies: #110584
2023-10-07 00:06:36 +00:00
Wanchao Liang
9378a2ceda [dtensor] support aten.where and enable implicit scalar promotion (#110584)
This PR adds support for aten.where and support implicit scalar
promotion, basically when we meet scalar tensors in dispatching logic,
we implicitly convert it those to replicated dtensor

The latter also enables bunch of ops in op db to pass
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110584
Approved by: https://github.com/fduwjj
2023-10-07 00:06:36 +00:00
Yue Dong
e3bf5000a7 Hide the contiguous requirement for user input mesh when initializing DeviceMesh (#110628)
Summary:
As title, this diff hides the contiguous requirement for user input mesh when initializing DeviceMesh.

In the current implementation, when testing with inter-node model parallelism, an exception is thrown during mesh validation when the following input is provided:
```
mesh = torch.arange(0, world_size).view(mp_size, dp_size).transpose(0, 1)
device_mesh = DeviceMesh(
                "cuda",
mesh.contiguous(),
mesh_dim_names=("dp", "mp")
)
```

Test Plan:
**Unit Test**:
```
buck2 test mode/dev-nosan //caffe2/test/distributed/_tensor:device_mesh -- test_validate_device_mesh

Test UI: https://www.internalfb.com/intern/testinfra/testrun/3940649876878399
Network: Up: 0B  Down: 0B
Jobs completed: 6. Time elapsed: 1:58.7s.
Tests finished: Pass 1. Fail 0. Fatal 0. Skip 0. Build failure 0
```

**Test with MP**
```
mesh = torch.arange(0, world_size).view(mp_size, dp_size).transpose(0, 1)
device_mesh = DeviceMesh(
                "cuda",
mesh.contiguous(),
mesh_dim_names=("dp", "mp")
)
```
Without the change: exception.
After this change: initialzied sucessfully.

Differential Revision: D49942839

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110628
Approved by: https://github.com/wanchaol, https://github.com/xw285cornell, https://github.com/fduwjj
2023-10-06 23:54:13 +00:00
PyTorch MergeBot
ff0358b038 Revert "[C10] PG observability hooks. (#108815)"
This reverts commit 0c7a877745.

Reverted https://github.com/pytorch/pytorch/pull/108815 on behalf of https://github.com/albanD due to Add a new torch.distributed.hooks namespace but does not document it, test was added this morning ([comment](https://github.com/pytorch/pytorch/pull/108815#issuecomment-1751327751))
2023-10-06 19:49:49 +00:00
Rodrigo Kumpera
0c7a877745 [C10] PG observability hooks. (#108815)
Expose a set of observability hooks into C10D such that our users can
detect collectives failure both faster and more easily.

The design is similar to NCCL desync debug that it minimized the
overhead by doing most of the work out of the main thread.

This PR introduces a new module torch.distributed.hooks that exposes the following set of methods:

    register_collective_start_hook
    register_collective_end_hook
    register_process_group_hook

The process group hook exposes PG creation on the member ranks and call them inline from the
the PG creation code. This is fine since this happens during initialization and a limited number of times.

The collective start/end hooks are fired from a single background thread. It reads
events from a C++ queue and dispatches over.

Queue notification is oddly done using a pipe, this is needed so python can abort the thread on shutdown
and have it as background thread. This is not possible with more reasonable choices like a condvar.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108815
Approved by: https://github.com/wconstab, https://github.com/fduwjj
2023-10-06 18:52:46 +00:00
Jon Chuang
d279979102 perf(inductor): improve Adam compile times by shortcutting for loops (via has_complex) (#110607)
Adam part of: https://github.com/pytorch/pytorch/issues/110506

TODO:
- If this approach is validated as a good one, it an also be applied to all other optimizers which convert `complex` via list comprehensions

### Results:
`NUM_PARAMS=200, foreach=True`
- main: dynamo: 43s, inductor: 31s, total: 74s
- this PR: dynamo: 3.5s, inductor: 30s, total: 34s (dynamo speedup: 12.3x, overall speedup: 34s, 2.1x)

`NUM_PARAMS=1000, foreach=True, has_complex shortcut`:

```
<class 'torch.optim.adam.Adam'> {'lr': 0.01, 'foreach': True} torch.float32 TorchDynamo compilation metrics:
Function                              Runtimes (s)
------------------------------------  -------------------------------
_compile.<locals>.compile_inner       0.0329, 50.0806, 0.0041
OutputGraph.call_user_compiler        44.9924
```

`NUM_PARAMS=1000, foreach=True`:
```
<class 'torch.optim.adam.Adam'> {'lr': 0.01, 'foreach': True} torch.float32 TorchDynamo compilation metrics:
Function                              Runtimes (s)
------------------------------------  -------------------------------
_compile.<locals>.compile_inner       0.0389, 58.6069, 0.0043
OutputGraph.call_user_compiler        44.1425
```

### Discussion
- `has_complex` shortcut provides additional 2x dynamo speedup. It is not necessary to achieve a significant overall speedup.

CC: @janeyx99 @mlazos

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110607
Approved by: https://github.com/janeyx99, https://github.com/lezcano
2023-10-06 05:08:49 +00:00
Jon Chuang
57e9969021 feat(optim): Add adadelta multi_tensor support for complex, with has_complex shortcut (#110631)
Partial fix: https://github.com/pytorch/pytorch/issues/110606

More on `has_complex` shortcut: https://github.com/pytorch/pytorch/pull/110613#issuecomment-1749314805

CC: @janeyx99, @mlazos, @lezcano
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110631
Approved by: https://github.com/lezcano
2023-10-06 03:34:41 +00:00
Gufan Yin
5d963474aa Replace enforce_dtype with dtype in ShardedTensor.gather (#110561)
Summary:
Sometimes local_shards are empty on some ranks, and out.dtype is float16, which will cause error if enforce_dtype is True because `data` will be float32.

Callers know best what dtype they want, so we can just let callers decide.

Temporarily keep enforce_dtype for backward compatibility

Test Plan: Run local and MAST job

Reviewed By: uciyc123

Differential Revision: D46886551

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110561
Approved by: https://github.com/wanchaol, https://github.com/malfet
2023-10-05 23:16:23 +00:00
Edward Z. Yang
f274c7b32c Add functional collective all_to_all_single and support it in Inductor (#110195)
Copy of https://github.com/pytorch/pytorch/pull/106655 from yf225
rebased on top of item() support changes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110195
Approved by: https://github.com/Skylion007
2023-10-05 23:11:51 +00:00
Wanchao Liang
c95cf4b4c9 [dtensor] add grad placements kwarg to to_local API (#110629)
When we convert to local tensor, dtensor can't track autograd or
gradient layout of the local tensor anymore, if user do sth not expected, there
needs to be a way for user to hint about the gradient layout of the
local tensor
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110629
Approved by: https://github.com/zdevito
2023-10-05 21:34:01 +00:00
Chien-Chin Huang
88616349d7 [state_dict][1/N] Implement the basic functions of distributed.checkpoint._state_dict (#105902)
This PR implements the basic functions of distributed.checkpoint._state_dict. This PR currently contains the flattening of optimizer state_dict which makes the PR too large. A later version may split it into 2 for a better code review.

Differential Revision: [D47647719](https://our.internmc.facebook.com/intern/diff/D47647719/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D47647719/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105902
Approved by: https://github.com/wz337
2023-10-05 20:04:15 +00:00
Chien-Chin Huang
1a729618ef [FSDP][optim_state_dict] Make the new optimizer allgather fusion work with fine-tuning models (#110540)
With use_orig_params=True, it is possible that some parameters with the same FlatParameter are in the optimizer while others parameters are frozen. This PR makes the allgather fusion logic support the case.

Differential Revision: [D49922028](https://our.internmc.facebook.com/intern/diff/D49922028/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110540
Approved by: https://github.com/awgu, https://github.com/rohan-varma
2023-10-05 15:17:10 +00:00
Mihir Patel
95c59b30b8 Update fully_sharded_data_parallel to fix typing (#110545)
Fixes typing so that linter does not complain when using CustomPolicy.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110545
Approved by: https://github.com/awgu, https://github.com/Skylion007
2023-10-05 00:00:10 +00:00
Fabrice Pont
053367b1ed fix: flake8-bugbear code B024 (#107265)
See #106571 item B024

This fix concerns the addition of `abstractmethod` to methods declared inside abstract classes.

Should I also include PEP8 compliant reformatting on the files I had to modify ?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107265
Approved by: https://github.com/kit1980
2023-10-04 23:52:52 +00:00
Howard Huang
0949d97c16 fix batch_isend_irecv example incorrect usage (#110408)
mismatched dtypes silently leads to wrong outputs in nccl

```
1:recv_tensor=tensor([0., 0.], device='cuda:1')
0:recv_tensor=tensor([2.8026e-45, 0.0000e+00], device='cuda:0')
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110408
Approved by: https://github.com/awgu, https://github.com/Neilblaze
2023-10-04 22:57:03 +00:00
Rohan Varma
40be6b72e1 [ez] Type function in distributed_c10d (#110435)
This function returns a `torch.device`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110435
Approved by: https://github.com/awgu
2023-10-03 17:54:04 +00:00
Chien-Chin Huang
cdde899a73 [FSDP][optim_state_dict] Fuse allgather for optim_state_dict when use_orig_params is True (#108298)
The original implementation of `_gather_orig_param_state` is naive. It performs one allgather_object and two allgather (if the optimizer is Adam) per FQN. This can be slow and make `_optim_state_dict` become bottleneck.

This PR rewrite the implementation and fuse all the `allgather_object`s into one. As for `allgather`, it is fused based on the information of FlatParameters. So there will be 2N `allgather` where N is the number of FlatParameter and 2 is due to Adam having 2 states per FQN.

One experiment on 8GPU A100 shows that the execution of the gathering is improved to 0.3 seconds from 3 seconds.

Differential Revision: [D48835138](https://our.internmc.facebook.com/intern/diff/D48835138/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108298
Approved by: https://github.com/awgu
2023-10-02 20:57:08 +00:00
Wanchao Liang
26900d21c2 [dtensor] skip pytree when not necessary (#110132)
pytree is a great tool, but it sometimes considers to be evil for
tensor subclasses, it's useful to implement subclass quickly, but it:
* exposes non-trival CPU overhead
* many ops don't need pytree, only the one with list/dict ops needs
* blindly use pytree to re-wrap have semantic issues for inplace/out
ops

This PR avoid using pytree for most ops during torch_dispatch and only
enable it for certain ops
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110132
Approved by: https://github.com/fduwjj
2023-10-02 17:44:34 +00:00
wz337
a588648759 [DCP] Fix 'torch.cpu' has no attribute 'current_device' in checkpoint/optimizer.py (#110299)
When running on "gloo" and "cpu:gloo,cuda:nccl" backend, it will run into the following error.

```
-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/data/users/irisz/pytorch/torch/multiprocessing/spawn.py", line 74, in _wrap
    fn(i, *args)
  File "/data/users/irisz/pytorch/torch/distributed/checkpoint/examples/fsdp_checkpoint_example.py", line 105, in run_fsdp_checkpoint_example
    optim_state = load_sharded_optimizer_state_dict(
  File "/data/users/irisz/pytorch/torch/distributed/checkpoint/optimizer.py", line 295, in load_sharded_optimizer_state_dict
    _alloc_tensor(value.properties, value.size, dp_pg_device_type), sharding_spec
  File "/data/users/irisz/pytorch/torch/distributed/checkpoint/optimizer.py", line 109, in _alloc_tensor
    device=cast(torch.device, _get_device_module(device_type).current_device()),
AttributeError: module 'torch.cpu' has no attribute 'current_device'
```

This PR fix the error in optimizer.py. Will follow up to add "cpu:gloo,cuda:nccl" support in DTensorBase so we can update unit test to include this backend.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110299
Approved by: https://github.com/kumpera
2023-10-01 21:54:13 +00:00
Rohan Varma
24e5d61af8 Log usage of optimizer in backward (#110206)
This will allow us to inspect and aggregate jobs that use optimizer in
backward

Differential Revision: [D48674740](https://our.internmc.facebook.com/intern/diff/D48674740/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110206
Approved by: https://github.com/awgu
2023-09-29 11:00:07 +00:00
Edwiv
7f5737392d [FSDP] fix: fix for fsdp exec order pre fwd record (#110138)
When the sharding_strategy is set to SHARD_GRAD_OP and forward_prefetch=True, during direct validation run, self.is_first_iter will always be True (because training=False, iter+1 is not executed). Additionally, the _pre_forward_order_index of the first handle entering the record_pre_forward function is 0. This causes the handle to have a False result in the if condition at line 166 when entering the record_pre_forward function again (the expected value should be True because _pre_forward_order_index has actually been assigned a value). As a result, the first handle is repetitively added to handles_pre_forward_order, leading to incorrect prefetching order.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110138
Approved by: https://github.com/awgu
2023-09-28 15:45:05 +00:00
Brian
e20c35a53b Allow public access for imports (#108914)
Fixes #108776

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108914
Approved by: https://github.com/wanchaol
2023-09-28 06:05:59 +00:00
Matthew Hoffman
68b0db1274 Define the public API for torch.distributed.fsdp (#109922)
Related: https://github.com/pytorch/pytorch/wiki/Public-API-definition-and-documentation
Related: https://github.com/microsoft/pylance-release/issues/2953

This fixes pylance issues for these classes:

```
"FullyShardedDataParallel" is not exported from module "torch.distributed.fsdp"
```

These classes all have public docs:

* [`BackwardPrefetch`](https://pytorch.org/docs/stable/fsdp.html#torch.distributed.fsdp.BackwardPrefetch)
* [`CPUOffload`](https://pytorch.org/docs/stable/fsdp.html#torch.distributed.fsdp.CPUOffload)
* [`FullyShardedDataParallel`](https://pytorch.org/docs/stable/fsdp.html#torch.distributed.fsdp.FullyShardedDataParallel)
* [`MixedPrecision`](https://pytorch.org/docs/stable/fsdp.html#torch.distributed.fsdp.MixedPrecision)
* [`ShardingStrategy`](https://pytorch.org/docs/stable/fsdp.html#torch.distributed.fsdp.ShardingStrategy)

And it seems like all the newly added classes will have docs once they are released.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109922
Approved by: https://github.com/wanchaol
2023-09-28 02:15:58 +00:00
Wanchao Liang
27443eadeb [dtensor][7/n] remove reduction rule (#109144)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109144
Approved by: https://github.com/fduwjj
ghstack dependencies: #108263, #108264
2023-09-26 22:24:50 +00:00
Wanchao Liang
2dd9a79d22 [dtensor][6/n] refactor reduction to use op strategy (#108264)
This PR refactors the reduction op to use strategy based propagation
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108264
Approved by: https://github.com/fduwjj
ghstack dependencies: #108263
2023-09-26 22:24:50 +00:00
Wanchao Liang
986d255db2 [dtensor][5/n] switch random ops to op strategy (#108263)
This PR switches the random ops to use op strategy instead of rule
based, this is a first series of PRs to refactor ops after we refactor
op dispatch logic
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108263
Approved by: https://github.com/fduwjj
2023-09-26 22:24:42 +00:00
wz337
8140494afd [3/N][2D] Enable training with new 2D flow (#110034)
Replacing https://github.com/pytorch/pytorch/pull/109553 as it gets reverted.

This PR enables training with new 2D flow and adds associated test. In addition, this PR moves the tensor/parallel/_data_parallel_utils.py that are fsdp specific back to tensor/parallel/fsdp.py to avoid circular dependency for ddp.py and test/distributed/tensor/parallel/test_ddp_2d_parallel.py.

state_dict related changes would be in later PRs.

cc. @fegin, @fduwjj, @wanchaol, @awgu
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110034
Approved by: https://github.com/fduwjj
2023-09-26 09:14:15 +00:00
Aaron Gokaslan
6b39cf863f Fix invalid arg to getLogger in torch distributed checkpoint (#110008)
Ran the experimental LOG002 ruff check and found a bug in our codebase. Logger should not be instantiated from `__file__`, it should be instantiated from `__name__`

https://docs.astral.sh/ruff/rules/invalid-get-logger-argument/
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110008
Approved by: https://github.com/ezyang
2023-09-25 18:21:18 +00:00
PyTorch MergeBot
f5886bf352 Revert "[3/N][2D] Enable training with new 2D flow (#109553)"
This reverts commit 217b37c023.

Reverted https://github.com/pytorch/pytorch/pull/109553 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but those distributed failures look legit and they are failing in trunk https://hud.pytorch.org/pr/109553 ([comment](https://github.com/pytorch/pytorch/pull/109553#issuecomment-1734100546))
2023-09-25 16:37:19 +00:00
wz337
217b37c023 [3/N][2D] Enable training with new 2D flow (#109553)
This PR enables training with new 2D flow and adds associated test.

state_dict related changes would be in later PRs.

cc. @fegin, @fduwjj, @wanchaol, @awgu
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109553
Approved by: https://github.com/fegin, https://github.com/awgu
2023-09-25 05:32:07 +00:00
Ed Pizzi
c13177f2cb [FSDP] Propagate requires_grad attribute to unsharded params (#109892)
Summary:
This preserves `requires_grad` in the case where all parameters within a `FlatParameter` have the same `requires_grad` value.

Currently, unsharded parameters have `requires_grad=True` in some cases where the `FlatParameter` and all original parameters have `requires_grad=False`.

This could be extended to support `FlatParameters` with a mix of `requires_grad` states by extending `ParamInfo` to capture `requires_grad` for each parameter.

Test Plan: test added

Differential Revision: D49517155

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109892
Approved by: https://github.com/awgu
2023-09-24 01:30:50 +00:00
wz337
b89ce814c0 [FSDP] Remove _set_use_dtensor in post_load_state_dict_hook (#109924)
This is a follow up for https://github.com/pytorch/pytorch/pull/109767.
We only need _set_use_dtensor in pre_state_dict_hook() and pre_load_state_dict_hook() and we do not need _set_use_dtensor in _post_load_state_dict_hook(). This PR removes _set_use_dtensor in post_load_state_dict_hook().

In addition, this PR adjusts the test cases in test_hsdp_dtensor_state_dict.py to capture changes in https://github.com/pytorch/pytorch/pull/109767

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109924
Approved by: https://github.com/fegin
2023-09-23 22:34:36 +00:00
Rodrigo Kumpera
c26270c733 [C10D] Even more store scalability work. (#109218)
Fix a bug socket.cpp in timeout detection that only shows up with 10k ranks.

Make the minimum wait time in _store_based_barrier to be adaptative based on
the number of ranks.

Longer timeouts give more room for the store to do productive work when swamped.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109218
Approved by: https://github.com/XilunWu
ghstack dependencies: #109217
2023-09-22 21:27:09 +00:00
wz337
a5145364d9 [FSDP] Fix _use_dtensor not automatically turn on for model state dict when using DeviceMesh (#109767)
Fixes #109648

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109767
Approved by: https://github.com/fegin
2023-09-21 15:15:45 +00:00
Howard Huang
600d0d0284 Add "cuda" to MPI backend capabilities (#109614)
Summary: Fixes https://github.com/pytorch/pytorch/issues/109543

Test Plan: We need to run CUDA aware MPI in PyTorch to actually test this change, we currently have no MPI tests.

Differential Revision: D49420438

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109614
Approved by: https://github.com/XilunWu
2023-09-21 13:34:58 +00:00
Rodrigo Kumpera
881bfbf21d [c10d] Add tests for usig libuv through init_process_group. (#108661)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108661
Approved by: https://github.com/XilunWu, https://github.com/fduwjj
2023-09-20 16:02:20 +00:00