Better support device agnostic, add a "cpu" return for `current_device()` in torch.cpu so that we won't run into `AttributeError: module 'torch.cpu' has no attribute 'current_device'`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110987
Approved by: https://github.com/wanchaol
This reverts commit ff0358b038.
(original PR https://github.com/pytorch/pytorch/pull/108815 desc copied below)
Expose a set of observability hooks into C10D such that our users can
detect collectives failure both faster and more easily.
The design is similar to NCCL desync debug that it minimized the
overhead by doing most of the work out of the main thread.
This PR introduces a new module torch.distributed.hooks that exposes the following set of methods:
register_collective_start_hook
register_collective_end_hook
register_process_group_hook
The process group hook exposes PG creation on the member ranks and call them inline from the
the PG creation code. This is fine since this happens during initialization and a limited number of times.
The collective start/end hooks are fired from a single background thread. It reads
events from a C++ queue and dispatches over.
Queue notification is oddly done using a pipe, this is needed so python can abort the thread on shutdown
and have it as background thread. This is not possible with more reasonable choices like a condvar.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110907
Approved by: https://github.com/fduwjj
I tested by adding some warning logs in C++, run a distributed program and show that they now had `[rank0]:` in the messages. There is no existing test infra for C++ logging so I couldn't easily add a unit test.
The implementation strategy is to setup a global variable in C++, and then poke it when we initialize a process group. This was the simplest thing I could think of that would work.
This PR only works for non-glog logging. Probably need to come up with some other strategy for glog, e.g., a custom prefix, but need to make sure this doesn't conflict with fbcode. I can't easily test this from OSS, will leave as follow up work.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110623
Approved by: https://github.com/voznesenskym, https://github.com/wanchaol, https://github.com/fduwjj
skip tensor.to in from_local and distribute_tensor when device_type of
device mesh matches tensor.device type, since from_local on the critial
path of TP, this might also reduce some overhead
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110774
Approved by: https://github.com/fduwjj
optree recently landed and provide quite good perf, conditionally import
new optree if optree is installed
Some numbers testing mlp layer with TP + func collective:
before this PR: 10.390ms
after this PR: 9.189ms
so around e2e 10% CPU overhead reduction
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110670
Approved by: https://github.com/fegin
We had the option but never used cpu_offload as optimizer state_dict offloads the tensors to CPU by default. And this is usually most users want as the tensors are required to be moved to CPU eventually. However, we may want to disable offloading to CPU in some cases, epsecially for the debugging purpose. This PR lets optimizer state_dict read the flag.
Differential Revision: [D48913340](https://our.internmc.facebook.com/intern/diff/D48913340/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108434
Approved by: https://github.com/wz337
This PR adds support for aten.where and support implicit scalar
promotion, basically when we meet scalar tensors in dispatching logic,
we implicitly convert it those to replicated dtensor
The latter also enables bunch of ops in op db to pass
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110584
Approved by: https://github.com/fduwjj
Summary:
As title, this diff hides the contiguous requirement for user input mesh when initializing DeviceMesh.
In the current implementation, when testing with inter-node model parallelism, an exception is thrown during mesh validation when the following input is provided:
```
mesh = torch.arange(0, world_size).view(mp_size, dp_size).transpose(0, 1)
device_mesh = DeviceMesh(
"cuda",
mesh.contiguous(),
mesh_dim_names=("dp", "mp")
)
```
Test Plan:
**Unit Test**:
```
buck2 test mode/dev-nosan //caffe2/test/distributed/_tensor:device_mesh -- test_validate_device_mesh
Test UI: https://www.internalfb.com/intern/testinfra/testrun/3940649876878399
Network: Up: 0B Down: 0B
Jobs completed: 6. Time elapsed: 1:58.7s.
Tests finished: Pass 1. Fail 0. Fatal 0. Skip 0. Build failure 0
```
**Test with MP**
```
mesh = torch.arange(0, world_size).view(mp_size, dp_size).transpose(0, 1)
device_mesh = DeviceMesh(
"cuda",
mesh.contiguous(),
mesh_dim_names=("dp", "mp")
)
```
Without the change: exception.
After this change: initialzied sucessfully.
Differential Revision: D49942839
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110628
Approved by: https://github.com/wanchaol, https://github.com/xw285cornell, https://github.com/fduwjj
Expose a set of observability hooks into C10D such that our users can
detect collectives failure both faster and more easily.
The design is similar to NCCL desync debug that it minimized the
overhead by doing most of the work out of the main thread.
This PR introduces a new module torch.distributed.hooks that exposes the following set of methods:
register_collective_start_hook
register_collective_end_hook
register_process_group_hook
The process group hook exposes PG creation on the member ranks and call them inline from the
the PG creation code. This is fine since this happens during initialization and a limited number of times.
The collective start/end hooks are fired from a single background thread. It reads
events from a C++ queue and dispatches over.
Queue notification is oddly done using a pipe, this is needed so python can abort the thread on shutdown
and have it as background thread. This is not possible with more reasonable choices like a condvar.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108815
Approved by: https://github.com/wconstab, https://github.com/fduwjj
Adam part of: https://github.com/pytorch/pytorch/issues/110506
TODO:
- If this approach is validated as a good one, it an also be applied to all other optimizers which convert `complex` via list comprehensions
### Results:
`NUM_PARAMS=200, foreach=True`
- main: dynamo: 43s, inductor: 31s, total: 74s
- this PR: dynamo: 3.5s, inductor: 30s, total: 34s (dynamo speedup: 12.3x, overall speedup: 34s, 2.1x)
`NUM_PARAMS=1000, foreach=True, has_complex shortcut`:
```
<class 'torch.optim.adam.Adam'> {'lr': 0.01, 'foreach': True} torch.float32 TorchDynamo compilation metrics:
Function Runtimes (s)
------------------------------------ -------------------------------
_compile.<locals>.compile_inner 0.0329, 50.0806, 0.0041
OutputGraph.call_user_compiler 44.9924
```
`NUM_PARAMS=1000, foreach=True`:
```
<class 'torch.optim.adam.Adam'> {'lr': 0.01, 'foreach': True} torch.float32 TorchDynamo compilation metrics:
Function Runtimes (s)
------------------------------------ -------------------------------
_compile.<locals>.compile_inner 0.0389, 58.6069, 0.0043
OutputGraph.call_user_compiler 44.1425
```
### Discussion
- `has_complex` shortcut provides additional 2x dynamo speedup. It is not necessary to achieve a significant overall speedup.
CC: @janeyx99 @mlazos
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110607
Approved by: https://github.com/janeyx99, https://github.com/lezcano
Summary:
Sometimes local_shards are empty on some ranks, and out.dtype is float16, which will cause error if enforce_dtype is True because `data` will be float32.
Callers know best what dtype they want, so we can just let callers decide.
Temporarily keep enforce_dtype for backward compatibility
Test Plan: Run local and MAST job
Reviewed By: uciyc123
Differential Revision: D46886551
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110561
Approved by: https://github.com/wanchaol, https://github.com/malfet
When we convert to local tensor, dtensor can't track autograd or
gradient layout of the local tensor anymore, if user do sth not expected, there
needs to be a way for user to hint about the gradient layout of the
local tensor
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110629
Approved by: https://github.com/zdevito
The original implementation of `_gather_orig_param_state` is naive. It performs one allgather_object and two allgather (if the optimizer is Adam) per FQN. This can be slow and make `_optim_state_dict` become bottleneck.
This PR rewrite the implementation and fuse all the `allgather_object`s into one. As for `allgather`, it is fused based on the information of FlatParameters. So there will be 2N `allgather` where N is the number of FlatParameter and 2 is due to Adam having 2 states per FQN.
One experiment on 8GPU A100 shows that the execution of the gathering is improved to 0.3 seconds from 3 seconds.
Differential Revision: [D48835138](https://our.internmc.facebook.com/intern/diff/D48835138/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108298
Approved by: https://github.com/awgu
pytree is a great tool, but it sometimes considers to be evil for
tensor subclasses, it's useful to implement subclass quickly, but it:
* exposes non-trival CPU overhead
* many ops don't need pytree, only the one with list/dict ops needs
* blindly use pytree to re-wrap have semantic issues for inplace/out
ops
This PR avoid using pytree for most ops during torch_dispatch and only
enable it for certain ops
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110132
Approved by: https://github.com/fduwjj
When running on "gloo" and "cpu:gloo,cuda:nccl" backend, it will run into the following error.
```
-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/data/users/irisz/pytorch/torch/multiprocessing/spawn.py", line 74, in _wrap
fn(i, *args)
File "/data/users/irisz/pytorch/torch/distributed/checkpoint/examples/fsdp_checkpoint_example.py", line 105, in run_fsdp_checkpoint_example
optim_state = load_sharded_optimizer_state_dict(
File "/data/users/irisz/pytorch/torch/distributed/checkpoint/optimizer.py", line 295, in load_sharded_optimizer_state_dict
_alloc_tensor(value.properties, value.size, dp_pg_device_type), sharding_spec
File "/data/users/irisz/pytorch/torch/distributed/checkpoint/optimizer.py", line 109, in _alloc_tensor
device=cast(torch.device, _get_device_module(device_type).current_device()),
AttributeError: module 'torch.cpu' has no attribute 'current_device'
```
This PR fix the error in optimizer.py. Will follow up to add "cpu:gloo,cuda:nccl" support in DTensorBase so we can update unit test to include this backend.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110299
Approved by: https://github.com/kumpera
When the sharding_strategy is set to SHARD_GRAD_OP and forward_prefetch=True, during direct validation run, self.is_first_iter will always be True (because training=False, iter+1 is not executed). Additionally, the _pre_forward_order_index of the first handle entering the record_pre_forward function is 0. This causes the handle to have a False result in the if condition at line 166 when entering the record_pre_forward function again (the expected value should be True because _pre_forward_order_index has actually been assigned a value). As a result, the first handle is repetitively added to handles_pre_forward_order, leading to incorrect prefetching order.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110138
Approved by: https://github.com/awgu
Replacing https://github.com/pytorch/pytorch/pull/109553 as it gets reverted.
This PR enables training with new 2D flow and adds associated test. In addition, this PR moves the tensor/parallel/_data_parallel_utils.py that are fsdp specific back to tensor/parallel/fsdp.py to avoid circular dependency for ddp.py and test/distributed/tensor/parallel/test_ddp_2d_parallel.py.
state_dict related changes would be in later PRs.
cc. @fegin, @fduwjj, @wanchaol, @awgu
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110034
Approved by: https://github.com/fduwjj
Summary:
This preserves `requires_grad` in the case where all parameters within a `FlatParameter` have the same `requires_grad` value.
Currently, unsharded parameters have `requires_grad=True` in some cases where the `FlatParameter` and all original parameters have `requires_grad=False`.
This could be extended to support `FlatParameters` with a mix of `requires_grad` states by extending `ParamInfo` to capture `requires_grad` for each parameter.
Test Plan: test added
Differential Revision: D49517155
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109892
Approved by: https://github.com/awgu
Fix a bug socket.cpp in timeout detection that only shows up with 10k ranks.
Make the minimum wait time in _store_based_barrier to be adaptative based on
the number of ranks.
Longer timeouts give more room for the store to do productive work when swamped.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109218
Approved by: https://github.com/XilunWu
ghstack dependencies: #109217