Summary:
### 🚀 The feature, motivation and pitch
Following the discussion in https://github.com/pytorch/pytorch/issues/65813, I added the QR factorization to powerSGD_hook.py
Gram-Schmidt orthogonalization can't be fully replaced because _torch.linalg.qr_ doesn't work with half-precision. Moreover, in my tests, it works faster with a rank lesser than 3.
This is one sample experiment timing powerSGD_hook on ResNext101 with the two different methods:

### Alternatives
Use _torch.orgqr(*torch.geqrf(matrix))_. From my tests it performances are similar to _torch.linalg.qr_.
### Additional context
_No response_
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72043
Reviewed By: albanD
Differential Revision: D34042781
Pulled By: cbalioglu
fbshipit-source-id: e331179d3b7ac40d445b651fc473b16ae4ead462
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72084
make fsdp folder to be public
ghstack-source-id: 148173447
Test Plan: unit tests
Reviewed By: mrshenli
Differential Revision: D33903417
fbshipit-source-id: 7852a2adc4af09af48a5ffa52ebf210489f834d5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71970
- Provide default arg for power SGD convenience wrapper that matches the main API default
Test Plan: CI
Reviewed By: H-Huang
Differential Revision: D33837457
fbshipit-source-id: 8f4efab4992b3fff09456a18db2c83e087c25bdf
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71620
Remove from_functional_optim and make it the default constructor since
that is the only way _OptimizerHookState is now being built. Also, no longer
need to expose create_functional_optim helper function
ghstack-source-id: 147577174
Test Plan: CI
Reviewed By: cbalioglu
Differential Revision: D33700593
fbshipit-source-id: ba089ce3bf66ccf8f71cffdd0f4d4bddc03e8b14
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71608
Per title
ghstack-source-id: 147577178
Test Plan: CI
Reviewed By: cbalioglu
Differential Revision: D33696382
fbshipit-source-id: 5b638d3edf5f03ba476356d61e96ca604de18c8f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71604
Implement 2 helper functions:
- as_functional_optim which takes in a torch.optim class type and arguments and
creates the corresponding functional optimizer.
- create_functional_optim which takes in the functional optimizer class type
and constructs it. Note that as_functional_optim calls into
create_functional_optim.
The first will be used in future PRs as described in
https://github.com/pytorch/pytorch/issues/67570 to create a functional
optimizer from a traditional optimizer. The latter is used in
_OptimizerHookState to create a functional optimizer.
Both new helper functions are covered by unittests.
ghstack-source-id: 147577170
Test Plan: CI
Reviewed By: cbalioglu
Differential Revision: D33688995
fbshipit-source-id: 8b2daafd1b914efa90877cc4313aa9a428546fc1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71602
The design in https://github.com/pytorch/pytorch/issues/67570 requires
`_OptimizerHookState` to be created directly from a functional optimizer. Add
support and tests for this. Also refactor a few tests.
ghstack-source-id: 147577175
Test Plan: CI
Reviewed By: cbalioglu
Differential Revision: D33687477
fbshipit-source-id: f3c789aa77773f918e01a8d0cf08739b2edf07b3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71601
Moves current prototype optimizer overlap to its own file for a better
namespace. No code changes besides a few comment fixes. Note that this code is
still prototype and not expected to be used by an end user.
ghstack-source-id: 147458826
Test Plan: CI
Reviewed By: cbalioglu
Differential Revision: D33662678
fbshipit-source-id: 3cc931323230a4b66c02b9e6f744aaf5c48d4d34
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71621
Moves this feature to beta as discussed, and cleans up some docs.
Synced offline with wayi1 who mentioned that the current names are preferred
as he works to prototype hierarchical allreduce as discussed in this RFC: https://github.com/pytorch/pytorch/issues/71325.
ghstack-source-id: 147382940
Test Plan: CI
Reviewed By: zhaojuanmao
Differential Revision: D33700444
fbshipit-source-id: 8eb543f5b02a119d0790a5c0919e6def6383a067
Summary:
Here 20 is a bad example, since the warmup step is set as 100. 200 iterations will make much more sense.
cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70974
Reviewed By: dagitses
Differential Revision: D33474576
Pulled By: rohan-varma
fbshipit-source-id: 4c7043108897848bde9503d77999971ad5567aa6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70165
Implements activation offload support in checkpoint_wrapper API via
save_on_cpu hooks. We avoid modifying the torch.utils.checkpoint implementation
and instead compose offload + checkpoint using the save_on_cpu hook for the
former.
ghstack-source-id: 146078900
Test Plan: CI
Reviewed By: zhaojuanmao
Differential Revision: D33228820
fbshipit-source-id: 98b4da0828462c41c381689ee07360ad014e808a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70164
Implement Alban's suggestion to make checkpoint_wrapper an nn.Module
instead of patching the forward pass, which is too hacky.
ghstack-source-id: 146011215
Test Plan: IC
Reviewed By: mrshenli
Differential Revision: D33214696
fbshipit-source-id: dc4b3e928d66fbde828ab60d90b314a8048ff7a2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69955
Implements a checkpoint_wrapper function, which wraps nn.Module with checkpointing so user won't have to call checkpoint() everytime they want to checkpoint the module.
Currently only support for reentrant-based checkpointing is added and only tested with FSDP to unblock a use case.
Future work is to add support for new checkpointing API, add more tests, upstream to torch.utils.checkpoint.
ghstack-source-id: 145811242
Test Plan: CI
Reviewed By: mrshenli
Differential Revision: D33107276
fbshipit-source-id: c4a1c68d71d65713a929994940a8750f73fbdbdb
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67401
some minor changes to dist quantization, mainly change the namespace and add some notes for future code dedup
ghstack-source-id: 143910067
ghstack-source-id: 143910067
Test Plan: wait for ci
Reviewed By: mrshenli
Differential Revision: D31979269
fbshipit-source-id: 85a2f395e6a3487dd0b9d1fde886eccab106e289
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66649
some minor changes to dist quantization, mainly change the namespace and add some notes for future code dedup
ghstack-source-id: 141336191
Test Plan: wait for ci
Reviewed By: cbalioglu
Differential Revision: D31663043
fbshipit-source-id: 2f96b7346e9c90df5ab2536767f8301eb86a9c79
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65197
1. The constructor accepts a local optimizer instance instead of the inputs of local optimizer constructor and the class type.
2. The parameters are read from local optimizer's param_groups instead of a separate input.
Proposal: https://github.com/pytorch/pytorch/issues/59699
ghstack-source-id: 138307226
Test Plan: buck test mode/dev-nosan //caffe2/test/distributed:distributed_nccl_spawn -- test_post_localSGD_optimizer_parity
Reviewed By: rohan-varma
Differential Revision: D31007439
fbshipit-source-id: bbb0526e6763ef76775b85088571506b3942c722
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63154
The collective quantization API now supports alltoall, alltoall_single, and allscatter. The test is also included.
ghstack-source-id: 136856877
Test Plan: buck test mode/dev-nosan //caffe2/test/distributed/algorithms/quantization:DistQuantizationTests_nccl -- test_all_to_all_single_bfp16
Reviewed By: wanchaol
Differential Revision: D30255251
fbshipit-source-id: 856f4fa12de104689a03a0c8dc9e3ecfd41cad29
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63059
Supporting BFP16 quantization method to OSS. Currently only support CPU
ghstack-source-id: 136639528
Test Plan: Imported from OSS
Reviewed By: wanchaol
Differential Revision: D30194538
fbshipit-source-id: ac248567ad8028457c2a91b77ef2ce81709fce53
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63895
When updating the model parameter, updating `parameter.data` is no longer recommended, because this `data` field will be deprecated in the future.
The replacement is `tensor.set_`.
ghstack-source-id: 136593433
Test Plan:
buck test mode/dev-nosan //caffe2/test/distributed:distributed_nccl_spawn -- test_periodic_model_averager
buck test mode/dev-nosan //caffe2/test/distributed:distributed_nccl_spawn -- test_post_localSGD_optimizer_parity
Reviewed By: SciPioneer
Differential Revision: D30526178
fbshipit-source-id: a1ac0ec3665d8623edd5bf94f01c1132daff5c00
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63867
When updating the model parameter, updating `parameter.data` is no longer recommended, because this `data` field will be deprecated in the future.
The replacement is `tensor.set_`.
ghstack-source-id: 136531233
Test Plan: buck test mode/dev-nosan //caffe2/test/distributed:distributed_nccl_spawn -- test_periodic_model_averager
Reviewed By: SciPioneer
Differential Revision: D30513613
fbshipit-source-id: 402efb9c30fafc3f285bebc631639f656ceae585
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63260
Add BF16 all-reduce communication hook. Skip if CUDA version < 11 or NCCL version < 2.9.7.
Reviewed By: SciPioneer
Differential Revision: D30238317
fbshipit-source-id: bad35bf7d43f10f1c40997a282b831b61ef592bb
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63277
`PostLocalSGDState` requires a subgroup. To initialize this subgroup, a global process group must be initialized. However, this imposes a restriction that a hook state can only be provided after distributed environment initialization, which is not compatible with lightning DDP plugin setup where hook state should be provided before distributed environment initialization.
Proposal: https://github.com/pytorch/pytorch/issues/59699
ghstack-source-id: 135848575
Test Plan: buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_ddp_hook_parity_post_localSGD
Reviewed By: cbalioglu
Differential Revision: D30325041
fbshipit-source-id: 7b870166d096d306c3f2f7c69816a705cec0bebd
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62142
Created wrapper that takes the collective op and a quantization type as an arguments. It quantize the input, performs the collective op, and and perform dequantization
Test Plan:
Tested through distributed_gloo_fork.
e.g., buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_all_to_all_quantized
Reviewed By: wanchaol
Differential Revision: D29682812
fbshipit-source-id: 79c39105ff11270008caa9f566361452fe82a92e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62662
Replaced the methods set_tensor(.) and get_tensor() in the python exposed API from the C++ logic with buffer() and set_buffer(.) to be a cleaner interface.
Reviewed By: SciPioneer
Differential Revision: D30012869
fbshipit-source-id: bd8efab583dd89c96f9aeb3dd48a12073f0b1482
Summary:
**Overview:**
This removes the preceding `_` from `_Join`, `_Joinable`, and `_JoinHook` in preparation for adding the generic join context manager tutorial (see [here](https://github.com/pytorch/tutorials/pull/1610)). This also adds a docs page, which can be linked from the tutorial. [Here](https://github.com/pytorch/pytorch/files/6919475/render.pdf) is a render of the docs page.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62605
Test Plan:
`DistributedDataParallel.join()`:
```
touch /tmp/barrier && TEMP_DIR="/tmp" BACKEND="nccl" WORLD_SIZE="2" gpurun python test/distributed/test_distributed_fork.py -- TestDistBackendWithFork.test_ddp_uneven_inputs TestDistBackendWithFork.test_ddp_uneven_inputs_stop_iteration_sync_bn TestDistBackendWithFork.test_ddp_grad_div_uneven_inputs TestDistBackendWithFork.test_ddp_uneven_input_join_disable TestDistBackendWithFork.test_ddp_uneven_input_exception
```
`ZeroRedundancyOptimizer`:
```
gpurun4 python test/distributed/optim/test_zero_redundancy_optimizer.py
```
NOTE: DDP overlap tests are failing due to a landing race. See https://github.com/pytorch/pytorch/pull/62592. Once the fix is landed, I will rebase, and tests should be passing.
`Join`:
```
gpurun4 python test/distributed/algorithms/test_join.py
```
Reviewed By: mrshenli
Differential Revision: D30055544
Pulled By: andwgu
fbshipit-source-id: a5ce1f1d9f1904de3bdd4edd0b31b0a612d87026
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62592
Reland #62510
`GradBucket` is an important class defined in both C++ and Python, used for PyTorch Distributed Training. We need to rename the following methods for simplicity:
1) get_index -> index
2) is_the_last_bucket_to_allreduce -> is_last,
3) get_per_parameter_tensors -> gradients,
4) get_model_params_for_bucket -> parameters.
ghstack-source-id: 134848352
Test Plan: unit test
Reviewed By: andwgu
Differential Revision: D30049431
fbshipit-source-id: 1bcac331aa30e529b7230e3891bc811c531b0ea9
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62532
This method is not stable at this time, so avoid releasing it when DDP communication hook feature is released as a stable feature.
ghstack-source-id: 134787831
Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_ddp_hook_with_optimizer_parity
buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_hook_then_optimizer_nccl
Reviewed By: rohan-varma
Differential Revision: D30031222
fbshipit-source-id: e03a8e13fee5116a5ddd724eb76316ee98f2a676
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62510
`GradBucket` is an important class defined in both C++ and Python, used for PyTorch Distributed Training. We need to rename the following methods for simplicity:
1) get_index -> index
2) is_the_last_bucket_to_allreduce -> is_last,
3) get_per_parameter_tensors -> gradients,
4) get_model_params_for_bucket -> parameters.
Test Plan:
Local run comprehensive test with following results:
https://pxl.cl/1Ml8b
For two timeout failure test cases, most likely environment related and fail in my devserver.
Reviewed By: SciPioneer
Differential Revision: D30024161
fbshipit-source-id: 07e6072a2f7b81f731425d9b71f8c8b60d383b0f
Summary:
**Overview:**
This adds two approaches to overlapping `DistributedDataParallel.backward()` with `ZeroRedundancyOptimizer.step()` by providing two hook constructors: `hook_with_zero_step()` and `hook_with_zero_step_interleaved()`. The former waits for all backward computation to finish before starting optimizer computation, while the latter launches a partial optimizer computation using the contents of a gradient bucket once that bucket's all-reduce completes. The two approaches each suffer from their own weaknesses, and which one to use depends on the specific hardware configuration.
Both approaches can share changes to `ZeroRedundancyOptimizer`. A user should pass `overlap_with_ddp=True` to `ZeroRedundancyOptimizer`, construct a DDP communication hook using either `hook_with_zero_step()` or `hook_with_zero_step_interleaved()`, and register that communication hook. `ZeroRedundancyOptimizer.step()` should still be called in the training loop, though the optimizer computation and communication will be offloaded to originate from the communication hook. Currently, the first two iterations are vacuous, meaning they do not result in parameter updates and the inputs are ignored. This is required to finalize the DDP bucket strategy and to then initialize the `ZeroRedundancyOptimizer`'s local optimizer based on that bucketing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62157
Test Plan:
The existing `ZeroRedundancyOptimizer` tests pass, and new unit tests for both hooks pass:
- ~~`test_ddp_with_zero_step_parity_cpu`~~ (removed for now due to flakiness in CI -- under investigation, could possibly be similar Gloo issue as with `hook_with_zero_step_interleaved()`)
- `test_ddp_with_zero_step_parity_gpu`
- `test_ddp_with_zero_step_interleaved_parity_gpu`
These were tested on the AI AWS cluster.
An analogous `test_ddp_with_zero_step_interleaved_parity_cpu` is missing due to existing bugs with Gloo. See https://github.com/pytorch/pytorch/pull/62302.
Both approaches have been verified using an internal accuracy benchmark.
Reviewed By: mrshenli
Differential Revision: D29971046
Pulled By: andwgu
fbshipit-source-id: a7234c23c7ea253f144a698fd7e3c0fe039de5e8
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62457
Specify `Future[torch.Tensor]` as DDP communication hook return type, which should be explicitly a single tensor. The previous API takes a list that has a single tensor.
Note that now the typing info no longer accepts the internal type of `torch._C.Future`, which does not support torchscript and hence cannot support `Future[torch.Tensor]`.
ghstack-source-id: 134771419
Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_default_ddp_comm_hooks_nccl
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_ddp_invalid_comm_hook_return_type
Reviewed By: rohan-varma
Differential Revision: D30007390
fbshipit-source-id: 246667c9b575b4c6e617b0a5b373151f1bd81e7f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62389
Simplify the implementation of `parseHookResult` of `PythonCommHook`, by not directly accepting the output of allreduce, which is a tensor list.
Address the comment on https://github.com/pytorch/pytorch/pull/62074#discussion_r675303280
Additionally, formatter is also applied to `OptimizerHookState` and `hook_then_optimizer`.
ghstack-source-id: 134626246
Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d
buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork
Reviewed By: rohan-varma
Differential Revision: D29982485
fbshipit-source-id: 5b27cc5ef09d2f87c1ade4c0feef7eacc1af3a9a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62392
The constructor of `PeriodicModelAverager` does not need to accept parameters.
ghstack-source-id: 134626245
Test Plan: buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_periodic_model_averager
Reviewed By: rohan-varma
Differential Revision: D29986446
fbshipit-source-id: 6a8b709e4383a3c44b9e60955fbb067cd2868e76
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62111
This base class will be passed to the post-localSGD optimizer in the next PR. This way, the same post-localSGD optimizer can choose different model averaging algorithms.
Proposal: https://github.com/pytorch/pytorch/issues/59699
ghstack-source-id: 134489187
Test Plan: buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_periodic_model_averager
Reviewed By: rohan-varma
Differential Revision: D29884954
fbshipit-source-id: 1dc5e35c58895902991567f633afd621c7108938
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62079
Adds support for kwarg arguments into functional optimizer running as
hook.
ghstack-source-id: 134330379
Test Plan: CI
Reviewed By: SciPioneer
Differential Revision: D29838127
fbshipit-source-id: 2ab051ef5f0dff19c145ebe2260668b927ba47b2