Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64472
Sometimes, user module can reassign tensor buffer, as in:
```
self.buffer = torch.randn(1, 2) # in init
self.buffer += 1 # in forward
```
in this case, `self.modules_buffers` will become outdated and we should
repopulate self.modules_buffers if we need to sync module buffers.
See https://github.com/pytorch/pytorch/issues/63916 for full description of the
issue.
ghstack-source-id: 137526309
Test Plan: CI
Reviewed By: zhaojuanmao
Differential Revision: D30745921
fbshipit-source-id: 25eb1edbf445703a481802e07f3058d38ea6fc64
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63260
Add BF16 all-reduce communication hook. Skip if CUDA version < 11 or NCCL version < 2.9.7.
Reviewed By: SciPioneer
Differential Revision: D30238317
fbshipit-source-id: bad35bf7d43f10f1c40997a282b831b61ef592bb
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62996
No need to set this because autograd engine already propagates TLS
states.
ghstack-source-id: 135438220
Test Plan: CI
Reviewed By: albanD
Differential Revision: D30202078
fbshipit-source-id: e5e917269a03afd7a6b8e61f28b45cdb71ac3e64
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61753
Reland of https://github.com/pytorch/pytorch/pull/57081.
Main difference is that the former diff moved `prepare_for_backward` check into `DDPSink` backward, but that resulted in issues due to potential autograd engine races. The original diff moved `prepare_for_backward` into `DDPSink` as part of a long-term plan to always call it within `DDPSink`.
In particular this doesn't work because `prepare_for_backward` sets `expect_autograd_hooks=true` which enables autograd hooks to fire, but there were several use cases internally where autograd hooks were called before DDPSink called `prepare_for_backward`, resulting in errors/regression.
We instead keep the call to `prepare_for_backward` in the forward pass, but still run outputs through `DDPSink` when find_unused_parameters=True. As a result, outputs that are not used when computing loss have `None` gradients and we don't touch them if they are globally `None`. Note that the hooks still fire with a undefined gradient which is how we avoid the Reducer erroring out with the message that some hooks did not fire.
Added the unittests that were part of the reverted diff.
ghstack-source-id: 135388925
Test Plan: CI
Reviewed By: zhaojuanmao
Differential Revision: D29726179
fbshipit-source-id: 54c8819e0aa72c61554104723a5b9c936501e719
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62748
Previously after buckets were rebuilt the first bucket size was always
defaulted to 1MB, this diff allows first bucket to be tuned like the rest of
the bucket sizes can.
Setting `dist._DEFAULT_FIRST_BUCKET_BYTES = 1` results in the following logs as
expected:
I0804 12:31:47.592272 246736 reducer.cpp:1694] 3 buckets rebuilt with size
limits: 1, 1048, 1048 bytes.
ghstack-source-id: 135074696
Test Plan: CI
Reviewed By: SciPioneer, wanchaol
Differential Revision: D30110041
fbshipit-source-id: 96f76bec012de129d1645e7f50e266d4b255ec66
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62662
Replaced the methods set_tensor(.) and get_tensor() in the python exposed API from the C++ logic with buffer() and set_buffer(.) to be a cleaner interface.
Reviewed By: SciPioneer
Differential Revision: D30012869
fbshipit-source-id: bd8efab583dd89c96f9aeb3dd48a12073f0b1482
Summary:
**Overview:**
This removes the preceding `_` from `_Join`, `_Joinable`, and `_JoinHook` in preparation for adding the generic join context manager tutorial (see [here](https://github.com/pytorch/tutorials/pull/1610)). This also adds a docs page, which can be linked from the tutorial. [Here](https://github.com/pytorch/pytorch/files/6919475/render.pdf) is a render of the docs page.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62605
Test Plan:
`DistributedDataParallel.join()`:
```
touch /tmp/barrier && TEMP_DIR="/tmp" BACKEND="nccl" WORLD_SIZE="2" gpurun python test/distributed/test_distributed_fork.py -- TestDistBackendWithFork.test_ddp_uneven_inputs TestDistBackendWithFork.test_ddp_uneven_inputs_stop_iteration_sync_bn TestDistBackendWithFork.test_ddp_grad_div_uneven_inputs TestDistBackendWithFork.test_ddp_uneven_input_join_disable TestDistBackendWithFork.test_ddp_uneven_input_exception
```
`ZeroRedundancyOptimizer`:
```
gpurun4 python test/distributed/optim/test_zero_redundancy_optimizer.py
```
NOTE: DDP overlap tests are failing due to a landing race. See https://github.com/pytorch/pytorch/pull/62592. Once the fix is landed, I will rebase, and tests should be passing.
`Join`:
```
gpurun4 python test/distributed/algorithms/test_join.py
```
Reviewed By: mrshenli
Differential Revision: D30055544
Pulled By: andwgu
fbshipit-source-id: a5ce1f1d9f1904de3bdd4edd0b31b0a612d87026
Summary:
**Overview:**
This adds two approaches to overlapping `DistributedDataParallel.backward()` with `ZeroRedundancyOptimizer.step()` by providing two hook constructors: `hook_with_zero_step()` and `hook_with_zero_step_interleaved()`. The former waits for all backward computation to finish before starting optimizer computation, while the latter launches a partial optimizer computation using the contents of a gradient bucket once that bucket's all-reduce completes. The two approaches each suffer from their own weaknesses, and which one to use depends on the specific hardware configuration.
Both approaches can share changes to `ZeroRedundancyOptimizer`. A user should pass `overlap_with_ddp=True` to `ZeroRedundancyOptimizer`, construct a DDP communication hook using either `hook_with_zero_step()` or `hook_with_zero_step_interleaved()`, and register that communication hook. `ZeroRedundancyOptimizer.step()` should still be called in the training loop, though the optimizer computation and communication will be offloaded to originate from the communication hook. Currently, the first two iterations are vacuous, meaning they do not result in parameter updates and the inputs are ignored. This is required to finalize the DDP bucket strategy and to then initialize the `ZeroRedundancyOptimizer`'s local optimizer based on that bucketing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62157
Test Plan:
The existing `ZeroRedundancyOptimizer` tests pass, and new unit tests for both hooks pass:
- ~~`test_ddp_with_zero_step_parity_cpu`~~ (removed for now due to flakiness in CI -- under investigation, could possibly be similar Gloo issue as with `hook_with_zero_step_interleaved()`)
- `test_ddp_with_zero_step_parity_gpu`
- `test_ddp_with_zero_step_interleaved_parity_gpu`
These were tested on the AI AWS cluster.
An analogous `test_ddp_with_zero_step_interleaved_parity_cpu` is missing due to existing bugs with Gloo. See https://github.com/pytorch/pytorch/pull/62302.
Both approaches have been verified using an internal accuracy benchmark.
Reviewed By: mrshenli
Differential Revision: D29971046
Pulled By: andwgu
fbshipit-source-id: a7234c23c7ea253f144a698fd7e3c0fe039de5e8
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62457
Specify `Future[torch.Tensor]` as DDP communication hook return type, which should be explicitly a single tensor. The previous API takes a list that has a single tensor.
Note that now the typing info no longer accepts the internal type of `torch._C.Future`, which does not support torchscript and hence cannot support `Future[torch.Tensor]`.
ghstack-source-id: 134771419
Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_default_ddp_comm_hooks_nccl
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_ddp_invalid_comm_hook_return_type
Reviewed By: rohan-varma
Differential Revision: D30007390
fbshipit-source-id: 246667c9b575b4c6e617b0a5b373151f1bd81e7f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62232
Logs the bucket sizes in DDP logging so that we know which workflow ran with what bucket size config. Will be used to verify how changing bucket sizes in DDP affects perf.
Based on the test, we can see inconsistency where the "first" bucket size actually is (last before rebuild buckets, first after).
ghstack-source-id: 134663867
Test Plan: CI
Reviewed By: SciPioneer
Differential Revision: D29922299
fbshipit-source-id: 538b331c96e77048164ad130b377433be100a761
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62231
`compute_bucket_assignment_by_size` is responsible for setting per-bucket size limits, return this information from the function so that we are aware of size limits for each bucket.
This is currently not being consumed, but will be in the next diff when we log bucket size limits to DDP logging. This will help us run experiments under different bucket size configs and analyze the impact.
ghstack-source-id: 134480575
Test Plan: CI
Reviewed By: mrshenli
Differential Revision: D29919056
fbshipit-source-id: dd5a096fa23d22e5d9dc1602899270a110db4a19
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61992
This test previously was not enabled for static graph but to ensure
this feature is supported with DDPSink, enable it for static graph which
currently passes outputs to DDPSink.
ghstack-source-id: 134471406
Test Plan: CI
Reviewed By: zhaojuanmao
Differential Revision: D29830887
fbshipit-source-id: 2d3f750d9eb4289558ed21acccd172d83d9b82cc
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61497
Reverts [DDP] Support not all outputs used in loss calculation
ghstack-source-id: 133589153
Test Plan: CI, ping authors to run their workflow on this diff
Reviewed By: zhaojuanmao
Differential Revision: D29642892
fbshipit-source-id: 81a15b9ab3329602f34d3758bb0799005a053d4f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61401
Reverts https://github.com/pytorch/pytorch/pull/59359, which is causing a few internal issues in DDP training. We will evaluate the internal use cases and reland it after reconsidering the design.
Also moves `prepare_for_backward` back into forward pass instead of DDP Sink for `find_unused_parameters`. This ensures that hooks will always fire in the backwards pass, which is behavior that internal training workloads rely on. Calling `prepare_for_backward` in DDPSink autograd function is not the best solution since other autograd threads may have been executing which can cause races.
ghstack-source-id: 133589152
Test Plan: CI
Reviewed By: pritamdamania87
Differential Revision: D29608948
fbshipit-source-id: f060f41cd103573ddff8da50cdbb6c56768dab46
Summary:
**Overview:**
This refactors the computation on non-joined processes relating to the join context manager. The concept was inspired by a comment from pritamdamania.
**Changes:**
This introduces a `_Joinable` abstract base class, which requires a `_join_hook()` method and `_join_device()` and `_join_process_group()` property methods. Any class that we want to be compatible with the generic join context manager should inherit from `_Joinable` and implement `_join_hook()`, `_join_device()`, and `_join_process_group()`. (The `device` and `process_group` information has been moved from `_JoinHook` to `_Joinable`.)
The generic join context manager now takes in a `List[_Joinable]` instead of `List[_JoinHook]`. The motivation for this is that previously, by passing the `_JoinHook`s into the context manager, the class providing a `_JoinHook` can modify the context manager's behavior, but the context manager cannot modify the class's behavior. This is solved by giving the context manager a reference to the class's instance.
This implementation reserves the field `_join_config` in every `_Joinable` to store a `_JoinConfig` instance, which holds all dynamic fields needed from the `_Joinable` for the join context manager: `enable`, `throw_on_early_termination`, and `is_first_joinable`. ("dynamic" here means that for a given `_Joinable` instance, the values for those fields may change across different join context usages.) In particular, these fields are needed to implement a method `notify_join_context()`, which encapsulates the computation performed on non-joined processes relating to the join context manager --- (1) the all-reduce to indicate that the process has not yet joined and (2) the all-reduce to check whether to throw an exception if `throw_on_uneven_inputs=True`. The idea is that every `_Joinable` class only needs to make a call to `notify_join_context()` before its per-iteration collective communications; it is a simple one-line addition.
Only the first `_Joinable` instance passed into the context manager actually performs the collective communications in `notify_join_context()`. In that case, the method returns an async work handle for the initial all-reduce indicating that the process not yet joined. Otherwise, the method returns `None`. This conditional logic is handled internally without additional input from the user.
**New API:**
Now, the example usage would look like:
```
ddp_model = DistributedDataParallel(...)
zero_optim = ZeroRedundancyOptimizer(ddp_model.parameters(), ...)
with _Join([ddp_model, zero_optim]):
...
```
Any arguments meant for a join hook (e.g. `divide_by_initial_world_size`) must be specified as keyword arguments. For example:
```
with _Join([ddp_model, zero_optim], divide_by_initial_world_size=False):
...
```
They will be forwarded to every `_join_hook()` function via `**kwargs`. This creates a clear separation between the variables needed by the context manager (`enable` and `throw_on_early_termination`) and those needed by the `_Joinable` class (e.g. `divide_by_initial_world_size`).
**Recap:**
After this change, the relevant information to use the generic join context manager looks like the following (omitting prefix `_` from names):
- Suppose we have a class `C` (e.g. `DistributedDataParallel`) that we want to be able to use the `Join` context.
- We make `C` inherit from `Joinable` and implement `join_hook() -> JoinHook`, `join_device()`, and `join_process_group()`.
- To implement `join_hook()`, we define a `CJoinHook` class inheriting from `JoinHook` and implement `main_hook()` and `post_hook()` as needed.
- We locate a place before `C`'s per-iteration collective communications and add a call to `Join.notify_join_context()`.
- We call `Joinable.__init__(self)` in `C`'s constructor.
- The `C.join_config` field will be used internally by the context manager. This does not affect `C`'s serializability.
- Run time arguments for `C`'s join hook can be passed in as keyword arguments to the context manager: `with Join([C()], arg1=..., arg2=...):`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61555
Test Plan:
I ran the existing DDP join tests:
```
touch /tmp/barrier && TEMP_DIR="/tmp" BACKEND="nccl" WORLD_SIZE="2" gpurun python test/distributed/test_distributed_fork.py -- TestDistBackendWithFork.test_ddp_uneven_inputs TestDistBackendWithFork.test_ddp_uneven_inputs_stop_iteration_sync_bn TestDistBackendWithFork.test_ddp_grad_div_uneven_inputs TestDistBackendWithFork.test_ddp_uneven_input_join_disable TestDistBackendWithFork.test_ddp_uneven_input_exception
```
I ran the ZeRO join tests:
```
gpurun4 python test/distributed/optim/test_zero_redundancy_optimizer.py TestZeroRedundancyOptimizerDistributed.test_zero_join_gpu TestZeroRedundancyOptimizerDistributed.test_zero_join_cpu
```
Reviewed By: zou3519
Differential Revision: D29690359
Pulled By: andwgu
fbshipit-source-id: 2950f78de755eb5fb13b95b803dd7c705879a9c7
Summary:
Targets https://github.com/pytorch/pytorch/issues/54318.
**Overview:**
DDP offers a `join()` context manager to accommodate training on uneven inputs. This creates a new generic `_Join()` API permitting custom hooks, refactors DDP `join()` to call this generic `_Join()`, and implements a hook for ZeRO. (For now, the generic `_Join()` is implemented as private, but this may change after design discussions are cleared.)
There are two classes introduced: `_JoinHook`, the class defining the customizable join hook, and `_Join`, the generic join context manager.
The `_JoinHook` provides two entry points: `main_hook()`, which is called repeatedly while there exists a non-joined process, and `post_hook()`, which is called once all process have joined with the additional `bool` argument `is_last_joiner`. The class also requires `process_group` and `device` information by defining corresponding abstract property methods. Thus, to implement a join hook, (1) inherit from `_JoinHook`, (2) override `main_hook()` and `post_hook()` as appropriate, and (3) override `process_group()` and `device()` to provide process group and device information to be used by the join context manager implementation for collective communications.
The `_Join` constructor requires `join_hooks: List[_JoinHook]` and optionally `enable: bool = True` and `throw_on_early_termination: bool = False`. A training loop only needs to be wrapped with `with _Join(join_hooks):` (using the appropriate `join_hooks`) to be able to train on uneven inputs without hanging/erroring. The context manager requires a `dist.all_reduce(torch.ones(1))` to be called on every non-joined process each time before it performs its collective communications in order to indicate that the process has not yet joined. It also requires that all `process_group` attributes in the `_JoinHook` objects are the same.
**Notes:**
- The argument `is_last_joiner` to `post_hook()` may be useful for finding an authoritative rank when synchronizing.
- `enable` is a flag that can be set to `False` if the user knows the current training loop will not have uneven inputs. This may be used to disable join-related computation in the classes providing join hooks.
- `throw_on_early_termination` is a flag that can be set to `True` to notify processes to terminate upon detecting uneven inputs (i.e. upon the first process joining when there exists a non-joined process). Notably, the notification requires an all-reduce, so to prevent hanging/erroring, non-joined process must participate in the all-reduce. The first-joining process raises a `RuntimeError`, and the other processes are expected (but not required) to do the same. This may be used to implement training on uneven inputs in cases that do not conform to the generic join context manager (e.g. `SyncBatchNorm`).
- Classes providing a join hook should do so via a `_join_hook()` method that returns a `_JoinHook` instance with the methods appropriately overridden.
- If there are multiple join hooks, the device specified by the first is used by the join context manager implementation to perform its collective communications.
- If there are multiple join hooks, both the main and post-hooks are iterated in the order in which the `_JoinHook` objects are passed into the context manager constructor.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60757
Test Plan:
The current implementation preserves backward compatibility by not changing the existing DDP `join()` API at all. To check this, I ran through the uneven input tests (`test_ddp_grad_div_uneven_inputs`, `test_ddp_uneven_inputs_stop_iteration_sync_bn`, `test_ddp_uneven_inputs`, `test_ddp_uneven_input_join_disable`, `test_ddp_uneven_input_exception`) on the AI AWS cluster:
```
touch /tmp/barrier && TEMP_DIR="/tmp" BACKEND="nccl" WORLD_SIZE="2" gpurun python test/distributed/test_distributed_fork.py --
```
Because the existing DDP join logic does not provide correct gradients to the joined processes if `gradient_as_bucket_view=False` and a joined process requires those gradients to correctly update its shard of the parameters in `ZeroRedundancyOptimizer.step()`, DDP and ZeRO are not fully compatible at the moment. To work around this and to test ZeRO's join hook separately, I added a test `_test_zero_join()` (with `test_zero_join_gpu()` and `test_zero_join_cpu()` flavors), which compares DDP with a local optimizer on uneven inputs against ZeRO on uneven inputs with the gradients set manually.
Reviewed By: iramazanli, mrshenli
Differential Revision: D29624636
Pulled By: andwgu
fbshipit-source-id: ec70a290e02518b0d8b683f9fed2126705b896c7
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61376
After SPMD is retired, the API of `get_tensors` becomes `get_tensor`. Fix some comments that refer to the obsolete API.
The `allreduce` hook example does not do division inside, which actually is incorrect.
ghstack-source-id: 133174272
Test Plan: N/A
Reviewed By: rohan-varma
Differential Revision: D29596857
fbshipit-source-id: 2046b185225cd6d1d104907b5f9b4009b6e87c99
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61020
Makes uneven input support with `join` context manager work with
custom communication hooks. This will ensure that the two features can work
well together. Added relevant unittests to test allreduce and powerSGD hooks.
Instead of calling `allreduce`, the join manager now calls into `_run_reduction_hook` which will automatically run whatever hook is installed.
ghstack-source-id: 132950108
Test Plan: CI
Reviewed By: SciPioneer
Differential Revision: D29480028
fbshipit-source-id: c91dc467a62c5f1e0ec702a2944ae3deb10f93f4
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61019
Changes uneven input logic of running allreduce to using `GradBucket` structure. This is to enable support for comm. hook with join in the next diff.
ghstack-source-id: 132950107
Test Plan: ci
Reviewed By: SciPioneer
Differential Revision: D29480027
fbshipit-source-id: 7c42c53653052f71b86a75e14a5fc7ae656433f7
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61017
Removes SPMD nested vector logic from this codepath. This is mostly in preparation for the next diffs in this stack which enable support for join with comm. hook.
ghstack-source-id: 132924223
Test Plan: CI
Reviewed By: SciPioneer
Differential Revision: D29477360
fbshipit-source-id: f8132a94b1abfe28586aa78ac47e13a7ce6bb137
Summary:
We recently landed a change to ensure that when running under ``find_unused_parameters=True``, not all module outputs have to be used in loss computation and DDP will work as expected. Mention this update in the documentation and add some additional clarification.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60275
Reviewed By: SciPioneer
Differential Revision: D29502609
Pulled By: rohan-varma
fbshipit-source-id: ddb3129cff9492018e61813413b30711af212309
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60882
Fixes https://github.com/pytorch/pytorch/issues/60733, which
identified an issue with a previous PR that resulted in DDP no longer
supporting cases where newly created tensors are returned that don't have a
grad_fn. The result of this is the grad_fn is set to that of the `DDPSink`
custom backward which results in errors during the backwards pass.
This PR fixes the issue by ensuring we don't touch the `grad_fn` of the tensors
if it is `None`. Added relevant tests as well.
ghstack-source-id: 132632515
Test Plan: CI
Reviewed By: zhaojuanmao
Differential Revision: D29423822
fbshipit-source-id: a9e01046c7be50aa43ffb955f6e0f48fef4bc881
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59359
Move `prepare_for_backward` into `_DDPSink` backward instead of calling it in DDP forward pass so that we can run multiple backwards in DDP with `retain_graph=True`.
ghstack-source-id: 131774159
Test Plan: CI
Reviewed By: zhaojuanmao
Differential Revision: D28855226
fbshipit-source-id: 6b7b25d75b7696f5b5629078233433f97663d61c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60118
Pipe + DDP has a few issues:
1) with static graph, does not synchronize gradients on first backward pass (i.e. delay allreduce is not run). does not work since https://github.com/pytorch/pytorch/pull/55248
2) when find_unused_parameters=True, also does not results in gradient synchronization. does not work since https://github.com/pytorch/pytorch/pull/57081
The reason for both cases is that calling `DDPSink.apply(output_tensor)` does not call the custom `backward` of `DDPSink` when the `output_tensor` is actually an `OwnerRRef`, which is the case when running DDP in `Pipe`. This is because we do `backward` on the `rref.local_value()` which does not have this autograd recording.
To fix, we unwrap the RRef and reconstruct it as needed, similar to the fix in https://github.com/pytorch/pytorch/pull/49908.
to test:
All tests in pipe_with_ddp_test pass.
The reason these tests did not catch the errors earlier is because all ranks received the same model inputs. So if gradient synchronization did not occur, then grads would still be the same because the model is the same on all ranks (guaranteed by ddp). Fixed the tests to use different inputs across ranks.
ghstack-source-id: 131688187
Test Plan: CI
Reviewed By: pritamdamania87
Differential Revision: D29167283
fbshipit-source-id: fe62310db2dc6de8519eb361b1df8ae4dfce3ab8
Summary: We updated the training scripts and re-trained the Resnext model with msuru_suru_union and ig_msuru_suru_union datasets
Test Plan:
Main command line to run:
*./deeplearning/projects/classy_vision/fb/projects/msuru_suru/scripts/train_cluster.sh*
Config we used is *msuru_suru_config.json*, which is "Normal ResNeXt101 with finetunable head".
Experiments:
- msuru_suru_union f279939874
- Train/test split
- msuru_suru_union_dataset_train_w_shard: 143,632,674 rows
- msuru_suru_union_dataset_test_w_shard: 1,831,236 rows
- Results
{F625232741}
{F625232819}
- ig_msuru_suru_union f279964200
- Train/test split
- ig_msuru_suru_union_dataset_train_w_shard: 241,884,760 rows
- ig_msuru_suru_union_dataset_test_w_shard: 3,477,181 rows
- Results
{F625234126}
{F625234457}
Differential Revision: D29154971
fbshipit-source-id: d534d830020f4f8e596bb6b941966eb84a1e8adb
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59284
Logs a few python-side errors to DDP logging.
TODO: Most python errors actually have to do with user input correctness, so they throw before reducer is constructed and thus there is no logger. For this case, should we allow `logger` to be created optionally without a reducer, just for the purpose of logging errors, so that we can gain insight into these errors in scuba?
ghstack-source-id: 130412973
Test Plan: CI
Reviewed By: SciPioneer
Differential Revision: D28820290
fbshipit-source-id: 610e5dba885b173c52351f7ab25c923edce639e0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59281
Adds ability to log when reducer/ddp encounters an error. We add fields "has_error" and "error" to indicate that an error has
occured in this iteration, and the other fields (performance stats) are not
guaranteed to be updated.
Errors encountered in python-side DDP will be added in the next diff.
ghstack-source-id: 130412974
Test Plan: CI
Reviewed By: mrshenli
Differential Revision: D28652717
fbshipit-source-id: 9772abc2647a92dac6a325da6976ef5eb877c589
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59065
Cleaner to use work.result() instead of sending back the tensor from
this function.
ghstack-source-id: 130338813
Test Plan: CI
Reviewed By: SciPioneer
Differential Revision: D28551203
fbshipit-source-id: d871fed78be91f0647687ea9d6fc86e576dc53a6
Summary:
Fixes https://github.com/pytorch/pytorch/issues/57878.
This adds `NCCL_ASYNC_ERROR_HANDLING` as a DDP relevant environment variable and includes a check for that variable in the test `test_dump_DDP_relevant_env_vars()`. Notably, the modified test now checks for the new variable but does not check for any of the other previously-existing relevant environment variables that were not already tested for (e.g. `NCCL_BLOCKING_WAIT`).
The change was tested via the following on an AI AWS cluster:
`WORLD_SIZE=2 BACKEND=nccl gpurun pytest test/distributed/test_distributed_spawn.py -k test_dump_DDP_relevant_env_vars -vs`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59109
Reviewed By: H-Huang, SciPioneer
Differential Revision: D28761148
Pulled By: andwgu
fbshipit-source-id: 7be4820e61a670b001408d0dd273f65029b1d2fe
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58595
No longer needed since this list is always of size 1.
ghstack-source-id: 129498229
Test Plan: CI
Reviewed By: zhaojuanmao
Differential Revision: D28548426
fbshipit-source-id: 7d6dba92fff685ec7f52ba7a3d350e36405e2578
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57081
Changes in this diff:
Enable passthrough autograd function when find_unused_parameters=True.
With above, move prepare_for_backward which does unused parameter checking logic to beginning of backwards pass, only when find_unused_parameters=True.
Enhance process of unused parameter checking to account for outputs not being used in loss.
The way (3) is implemented is by triggering the autograd hook corresponding to parameters that did not participate in loss computation. Since they did not participate, the autograd hook is triggered with a gradient of None, and the reducer handles this appropriately to ensure that the gradient is not touched.
Tested by ensuring that when a model output is not used in loss, the corresponding grad is not modified. Also verified that the grads are the same in local vs DDP training case. Also verified that gradients are not touched in this case, i.e. if grad is originally None, it stays as None, not zero, after.
Note that in this diff we are not enabling the pass through autograd function for regular case find_unused_parameters=False because that has a much bigger blast radius and needs additional careful analysis especially with regard to the performance.
ghstack-source-id: 129425139
Test Plan: CI
Reviewed By: zhaojuanmao
Differential Revision: D28048628
fbshipit-source-id: 71d7b6af8626804710017a4edd753787aa9bba61
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58089
make ddp logging api to be private
ghstack-source-id: 128796419
Test Plan: unit test
Reviewed By: rohan-varma
Differential Revision: D28365412
fbshipit-source-id: 374c01d443ffb47a3706f59e296d6e47eb5f4c85
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58105
When find_unused_parameters=True but static_graph is also set, static graph handles unused parameter accounting, so this code path is not needed
ghstack-source-id: 128736289
Test Plan: CI
Reviewed By: zhaojuanmao
Differential Revision: D28371954
fbshipit-source-id: 0b42a9c0fd2bba26a0de288436e9c7139e292578
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57073
Enhances use of DDPSink to work for all output types DDP supports as per https://github.com/pytorch/pytorch/issues/55876.
TODO: Add additional testing for tuple, list, dict return types
ghstack-source-id: 128726768
Test Plan: CI
Reviewed By: zhaojuanmao
Differential Revision: D27756985
fbshipit-source-id: 2e0408649fb2d6a46d6c33155a24c4c1723dd799
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57999
make ddp logging api to be private
ghstack-source-id: 128607185
Test Plan: unit test
Reviewed By: rohan-varma
Differential Revision: D28338485
fbshipit-source-id: bd2ae7c78904e93eed88be91876f5a832b5b7886
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55248
This PR provides enable static graph training when users call _set_static_graph(). This can help support more use cases in DDP without performance regression, also can potentially improve performance when there are unused parameters in the graph.
1. first iteration records graph states like how many times a grad is calculated, whether the grad is used or not. then first iteration queues a delay_all_reduce call back to all reduce grads.
2. Since autograd call back is associated with current target graph task, the delay_all_all call back should be associated with out-most backward graph task. A DDP sink layer is added in DDP forward loop so that we can queue the delay_all_reduce call back in the sink layer.
3. after first iterations, DDP will use the saved graph states to determine whether a grad is used or not. whether a grad is ready for communication.
4. rebuilt bucket is called in second iteration, after graph states are recorded in first iteration.
5. if the graph states change, DDP will throw errors
ghstack-source-id: 128599464
Test Plan: unit tests. adding more tests
Reviewed By: rohan-varma
Differential Revision: D27539964
fbshipit-source-id: 74de1ad2719465be67bab8688d6e293cd6e3a246
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57771
This mapping didn't work properly when certain parameters didn't
require grad. Fixed that and added a test.
ghstack-source-id: 128527537
Test Plan: CI
Reviewed By: SciPioneer
Differential Revision: D28265636
fbshipit-source-id: 7b342ce012b2b7e33058b4c619ffb98992ed05b7
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56755
Rehash of https://github.com/pytorch/pytorch/pull/47488
Adds a flag to ddp join() context manager that enables throwing a
StopIteration across all ranks when this flag is specified.
To do this, we implement the design in #47250. When running with this flag, we schedule an additional allreduce in the case that a joined rank needs to throw a StopIteration. In non-joined ranks forward pass, we match this allreduce and if at least one rank tells us to throw, we raise a StopIteration.
Tested by modifying existing tests, as well as adding additional tests validating that this works with SyncBatchNorm models and a model with custom collectives in the forward pass.
Currently running perf benchmarks, will post when those are available, but we expect a small (~2%) perf reduction when enabling this feature due to the blocking allreduce. Hence we will only recommend it for models with collective comm.
ghstack-source-id: 127883115
Test Plan: Ci
Reviewed By: SciPioneer
Differential Revision: D27958369
fbshipit-source-id: c26f7d315d95f17bbdc28b4a0561916fcbafb7ca
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54995
provide an DDP private API to explicitly set the training is static, also set this flag in logger
ghstack-source-id: 127755713
Test Plan: unit tests
Reviewed By: rohan-varma
Differential Revision: D27444965
fbshipit-source-id: 06ef1c372296815944b2adb33fbdf4e1217c1359
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56730
add a test to verify DDP with torch map will result in the same results when using grad_as_bucket_view=true and false.
torch.amp scale factor does not have dependencies on old gradients, thus it is not affected by grad_as_bucket_view=true or false, see
how torch.amp is implemeted here https://github.com/pytorch/pytorch/pull/33366/files.
This diff verified ddp can work as expected with amp.GradScaler and amp.autocast when when using grad_as_bucket_view=true and false.
ghstack-source-id: 127526358
Test Plan: unit tests
Reviewed By: rohan-varma
Differential Revision: D27950132
fbshipit-source-id: 8ed26935fdcb4514fccf01bb510e31bf6aedac69
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56641
currently ddpLoggingData is flat struct, which requires internal DDP developers and external users to know about the struct field names. This is not flexible to delete or add new fields in the future. also it is hard to access ddpLoggingData.
With maps/dict, developers and users can easily access the fields without knowing the field names, also easier to add/remove a new/old field.
Since C++ does not support map values to be different types, right now ddpLoggingData containes two types of maps.
ghstack-source-id: 127482694
Test Plan: unit tests
Reviewed By: SciPioneer
Differential Revision: D27923723
fbshipit-source-id: c90199c14925fc50ef219000e2f809dc7601cce1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55946
As `ddp_gpu_size` field of `SyncBatchNorm` will always be 1 for GPU modules, remove this field and the relevant code.
ghstack-source-id: 126883498
Test Plan: waitforbuildbot
Reviewed By: zhaojuanmao
Differential Revision: D27746021
fbshipit-source-id: b4518c07e6f0c6943fbd7a7548500a7d4337126c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55075
Constructs and passes in a mapping with parameter names to Reducer to log information about unused parameters in error messages about unused parameters/not all parameters getting gradient.
Use case:
1) User runs DDP forward + bwd, and it has some unused parameters that will result in ddp error in next iteration
2) Next forward pass calls `Reducer::ensure_prior_reduction_finished()` where we check all params got gradient from the previous bwd pass. DDP would throw here in this case.
3) Reducer maintains mapping and tracks used parameters, and computes which parameters did not get gradient and logs this as part of the error.
Implementation details:
0) The following is only enabled for debug modes of INFO or DETAIL.
1) To save memory, we don't map param -> param name so that we don't have to copy the entire tensor, instead we map param_index -> param_name and use the existing concept of variable_index in Reducer to look up parameter names.
2) DDP constructs param index -> param name mapping. The name is the fully qualified name: f"{module_name}:{param_name}" and passes it into Reducer
3) Reducer maintains per-iteration std::set<int> of variable indices that have had `mark_variable_ready` called.
4) When some params go unused, we take a set difference to detect the unused params.
5) Unittests to test the logged unused params, as well as for nested modules, are added
ghstack-source-id: 126581051
Test Plan: CI, UT
Reviewed By: zhaojuanmao
Differential Revision: D27356394
fbshipit-source-id: 89f436af4e74145b0a8eda92b3c4e2af8e747332
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55353
Remove all the code branches that will only be executed when `device_ids > 1`.
Some helper functions are also removed:
1. `_verify_replicas_within_process` and `verify_replicas_within_process`
2. `_replicate_modules_within_process`
3. `parallel_apply`
The next step is deprecating `_module_copies` field.
ghstack-source-id: 126201121
Test Plan: waitforbuildbot
Reviewed By: rohan-varma
Differential Revision: D27552201
fbshipit-source-id: 128d0216a202f5b1ba4279517d68c3badba92a6c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54454
According to the pitch in https://github.com/pytorch/pytorch/issues/47012
1. Let DDP error out if `device_ids` contains multiple devices.
2. If device_ids is not specified, DDP will use the provided model (module argument in DDP constructor) as-is, regardless if the model is on one GPU or multiple GPUs or on CPU.
3. Remove the assertion that prevents SPMD in DDP `join()` method, because now SPMD is already forbidden by the constructor. Also remove the relevant unit test `test_ddp_uneven_inputs_replicated_error`.
#Closes: https://github.com/pytorch/pytorch/issues/47012
ghstack-source-id: 125644392
Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:distributed_gloo_spawn -- test_cuda
buck test mode/dev-nosan caffe2/test/distributed:distributed_gloo_spawn -- test_rnn
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_nccl_backend_multi_device_ids_not_allowed
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_nccl_backend_single_device_module_device_ids_None
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_nccl_backend_multi_device_module_device_ids_None
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_ddp_multi_device_module_config
waitforbuildbot
Reviewed By: pritamdamania87
Differential Revision: D27226092
fbshipit-source-id: 3ee1e4bc46e5e362fc82cf7a24b2fafb34fcf1b9
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54919
Log the use of uneven inputs API for better tracking and use case
detection.
ghstack-source-id: 125446499
Test Plan: CI, added ut
Reviewed By: zhaojuanmao, SciPioneer
Differential Revision: D27410764
fbshipit-source-id: abc8055a2e15a3ee087d9959f8881b05a0ea933e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54219
There is no need for this ``pass``.
ghstack-source-id: 125124311
Test Plan: CI
Reviewed By: zhaojuanmao
Differential Revision: D27105234
fbshipit-source-id: 95496fa785fdc66a6c3c8ceaa14af565588325df
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54159
See https://github.com/pytorch/pytorch/issues/54059 for discussion.
In short, users might want to run evaluation on a single rank
in `torch.no_grad()` mode. When this happens, we need to make
sure that we skip all rebuild bucket logics, as the forward only
runs on one rank and not all peers can sure the bucket configuration
sync communication.
Test Plan: Imported from OSS
Reviewed By: zhaojuanmao
Differential Revision: D27119666
Pulled By: mrshenli
fbshipit-source-id: 4b2f8cce937cdd893e89d8d10c9267d255ba52ea
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53795
There are 4 calls in ddp implementation to dist.get_rank(), move these
to a helper property to ensure that users don't actually call `dist.get_rank()`
instead of `dist.get_rank(self.process_group)`.
Keeping API private for now because not sure if there is a user need to call `model.distributed_rank`, but can make it public if we think it's a useful api.
ghstack-source-id: 123640713
Test Plan: Ci
Reviewed By: mrshenli
Differential Revision: D26972368
fbshipit-source-id: a5f1cac243bca5c6f90a44f74d39cfffcc2b9a5a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53793
This call should pass in the process group so it works appropriately
for subgroups instead of whole world being passed into DDP.
Aside: This wasn't caught by tests since we don't have good testing around
passing subgroups into DDP, I believe nearly all tests use the entire world.
Should we add better testing for subgroups which may potentially bring up more
subtle bugs?
ghstack-source-id: 123640712
Test Plan: CI
Reviewed By: mrshenli
Differential Revision: D26972367
fbshipit-source-id: 8330bd51e2ad66841e4c12e96b67d3e78581ec74
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53145
add a new API to allow users to set sample rate for runtime stats, also add per iteration latency breakdowns to DDPLoggingData struct. e.g.
if users set sample rate to be 1, they can analyze per iteration latency change over time (not avged)
ghstack-source-id: 123443369
Test Plan: unit test
Reviewed By: SciPioneer
Differential Revision: D26763957
fbshipit-source-id: baff6a09c2a590e6eb91362ca6f47ae8fa6ddb0e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52966
Logs registerd comm hook if there is one, else logs
"builtin_allreduce"
ghstack-source-id: 123174803
Test Plan: CI
Reviewed By: SciPioneer
Differential Revision: D26709388
fbshipit-source-id: 484fdbbd6643ec261b3797bd8d9824b2b6a1a490
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52887
This diff changes the way to do model consistency check (i.e. `_verify_replicas_across_processes`) in DDP.
There were a few things that could be improved with the way we verify model across processes in DDP initialization:
1. We should do this check before syncing module states in DDP init, otherwise with Gloo backend this will throw but we would like to throw the error corresponding to different models on different ranks. To do this, we move the methods to be standalone C++ functions (not part of reducer) and move this check to before synchronizing parameters.
2. Refactor DDP init in the following ways:
- Run model consistency check before creating reducer, 2
- add helper functions to build params to pass into reducer
- add helper function to call `_verify_model_across_ranks`
- move `def parameters` to a helper function `_get_parameters` to be used more broadly within DDP
In follow up changes we will add the ability to detect which rank had inconsistent model (https://github.com/pytorch/pytorch/issues/52876 would be useful for this to determine which ranks(s) had errors).
ghstack-source-id: 123171877
Test Plan:
CI/unittest
buck test mode/dev-nosan //caffe2/test/distributed:c10d
BACKEND="nccl" WORLD_SIZE="2" ~/fbcode/buck-out/dev/gen/caffe2/test/distributed/distributed_nccl_fork#binary.par -r test_ddp_model_diff_across_ranks
Reviewed By: zhaojuanmao
Differential Revision: D26565290
fbshipit-source-id: f0e1709585b53730e86915e768448f5b8817a608
Summary:
Currently, `torch.nn.parallel.DistributedDataParallel(model...)` doesn't deduplicate params shared across `model`'s child Modules before calling Reducer with the param list. This can cause Reducer to register more than one hook on the shared param(s), at which point who knows what happens.
We ran into this in mlperf BERT, which has at least one param shared across submodules (an embedding weight iirc, not 100% sure). Running with `gradient_as_bucket_view = False` produced different numerics from running with `gradient_as_bucket_view = True` (which i guess is one potential consequence of multiple DDP hooks on a given param, not sure why, i'd have to dig further).
This PR changes DDP to deduplicate shared params (a small diff), and adds some tests (right now just `test_ddp_weight_sharing`, but I'll add more). `test_ddp_weight_sharing` fails with bad numerics on current master (proving the shared param issue is real) and passes with the deduplication diff.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51929
Reviewed By: zou3519
Differential Revision: D26625807
Pulled By: zhaojuanmao
fbshipit-source-id: f5f5959fef90dfe2c55812d79fa88b877f22ecc3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51386
add stats such as rebuilt bucket stats, unused parameter stats and performance stats to ddp logging data
1. gpu time stats are not collected for single process multiple devices in this diff, as that requires events are created and recorded on multiple devices
2. use at::cuda::event API for safer calls
3. events may not be created in autograd hook if hook is not triggered in user's codes, e.g., users runs in non-sync mode in some iterations. So we checked events are created or not before synchronizing, also skipped invalid results.
4. users may not set device upfront, so explicitly set proper device before creating events in our prepare_forward() and prepare_backward() calls
ghstack-source-id: 121933566
Test Plan: unit tests
Reviewed By: SciPioneer
Differential Revision: D26158645
fbshipit-source-id: ce5f15187802eba76accb980449be68902c10178
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52031
Closes https://github.com/pytorch/pytorch/issues/52020
Ensures that we can profile collectives in DDP by propagating the profiler threadLocalState appropriately. As described in the above issue, before this wouldn't work as the profiler would only be enabled on the main thread.
ghstack-source-id: 121818080
Test Plan: CI
Reviewed By: zhaojuanmao
Differential Revision: D26356192
fbshipit-source-id: 0158b5833a3f857a0b4b2943ae3037e9d998dfd1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50180
Resolves the regression in
https://github.com/pytorch/pytorch/issues/49819 by adding copy over background
stream similar to scatter. For internal use cases, this is gated with an env var that maintains the previous behavior when it is off.
Test Plan: CI
Reviewed By: mrshenli, ngimel
Differential Revision: D25818170
fbshipit-source-id: e50c76c035504b2a44e2be084701cee45c90df75
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50622
1. Define a DDPLoggingData struct that is the placeholder for all the ddp related logging fields
2. Put the DDPLoggingData struct in the C10 directory so that it can be easily imported by c10 and torch files
3. Expose get_ddp_logging_data() method in python so that users can get the logging data and dump in their applications
4. Unit test tested the logging data can be set and got as expected
5. Follow up will add more logging fields such as perf stats, internal states, env variables and etc
ghstack-source-id: 120275870
Test Plan: unit tests
Reviewed By: SciPioneer
Differential Revision: D25930527
fbshipit-source-id: 290c200161019c58e28eed9a5a2a7a8153113f99
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49908
As described in https://github.com/pytorch/pytorch/issues/49891, DDP +
Pipe doesn't work with find_unused_parameters.
This PR adds a simple fix to enable this functionality. This only currently
works for Pipe within a single host and needs to be re-worked once we support
cross host Pipe.
ghstack-source-id: 119573413
Test Plan:
1) unit tests added.
2) waitforbuildbot
Reviewed By: rohan-varma
Differential Revision: D25719922
fbshipit-source-id: 948bcc758d96f6b3c591182f1ec631830db1b15c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47394
This is a preliminary refactor for the next diff that will add an
additional flag to control whether we throw a StopIteration or not. We
basically move the flags for ddp uneven inputs to a simple class.
ghstack-source-id: 116428177
Test Plan: CI
Reviewed By: pritamdamania87
Differential Revision: D24739509
fbshipit-source-id: 96bf41bd1c02dd27e68f6f37d08e22f33129b319
Summary:
Sorry for my previous inaccurate [PR](https://github.com/pytorch/pytorch/pull/42471#issue-462329192 ).
Here are some toy code to illustrate my point:
* non-DistributedDataParallel version
```python
import torch
if __name__ == "__main__":
torch.manual_seed(0)
inp = torch.randn(1,16)
inp = torch.cat([inp, inp], dim=0)
model = torch.nn.Linear(16, 2)
loss_func = torch.nn.CrossEntropyLoss()
opti = torch.optim.SGD(model.parameters(), lr=0.001)
opti.zero_grad()
loss = loss_func(model(inp), torch.tensor([0, 0]))
loss.backward()
opti.step()
print("grad:", model.weight.grad)
print("updated weight:\n", model.weight)
```
* DistributedDataParallel version
```python
import os
import torch
import torch.nn as nn
import torch.distributed as dist
from torch.multiprocessing import Process
def run(rank, size):
torch.manual_seed(0)
x = torch.randn(1,16)
model = torch.nn.Linear(16, 2)
model = torch.nn.parallel.DistributedDataParallel(model)
loss_func = torch.nn.CrossEntropyLoss()
opti = torch.optim.SGD(model.parameters(), lr=0.001)
opti.zero_grad()
y = model(x)
label = torch.tensor([0])
loss = loss_func(y, label)
loss.backward()
opti.step()
if rank == 0:
print("grad:", model.module.weight.grad)
print("updated weight:\n", model.module.weight)
def init_process(rank, size, fn, backend="gloo"):
os.environ['MASTER_ADDR'] = '127.0.0.1'
os.environ['MASTER_PORT'] = '29500'
dist.init_process_group(backend, rank=rank, world_size=size)
fn(rank, size)
if __name__ == "__main__":
size = 2
process = []
for rank in range(size):
p = Process(target=init_process, args=(rank, size, run))
p.start()
process.append(p)
for p in process:
p.join()
```
Both of these two pieces of code have the same output.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47156
Reviewed By: mruberry
Differential Revision: D24675199
Pulled By: mrshenli
fbshipit-source-id: 1238a63350a32a824b4b8c0018dc80454ea502bb
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47158
1. Test the default Python comm hook implementations ALLREDUCE and FP16_COMPRESS, besides an ad-hoc all-reduce implementation.
2. Typo fix.
3. Reformat default_hooks.py.
4. Publish register_comm_hook API for DDP module (This should be done in a separate diff, but got merged unintentionally.)
The new style can be used for testing any new comm hook like PowerSGD easily.
Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 116012600
Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_default_ddp_comm_hooks_nccl
Reviewed By: rohan-varma
Differential Revision: D24669639
fbshipit-source-id: 048c87084234edc2398f0ea6f01f2f083a707939
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47270
This is almost same as #46959, except that in caffe2/torch/nn/parallel/distributed.py, BuiltinCommHookType should be imported conditionally, only when dist.is_available(). Otherwise, this Python enum type defined in caffe2/torch/scrc/distributed/c10d/init.cpp cannot be imported. See https://github.com/pytorch/pytorch/issues/47153
I tried to follow another enum type enum type ReduceOp defined in the same file, but did not work, because the C++ enum class is defined torch/lib/c10d library, but BuiltinCommHookType is defined in torch/csrc/distributed library. These two libraries are compiled in two different ways.
To avoid adding typing to distributed package, which can be a new project, I simply removed the arg type of BuiltinCommHookType in this file.
To review the diff on top of #46959, compare V1 vs Latest:
https://www.internalfb.com/diff/D24700959?src_version_fbid=270445741055617
Main Changes in V1 (#46959):
1. Implemented the Pybind part.
2. In the reducer, once the builtin_comm_hook_type is set, a c++ comm hook instance will be created in Reducer::autograd_hook.
3. Added unit tests for the builit-in comm hooks.
Original PR issue: C++ DDP Communication Hook https://github.com/pytorch/pytorch/issues/46348
ghstack-source-id: 115783237
Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_builtin_ddp_comm_hooks_nccl
//arvr/projects/eye_tracking/Masquerade:python_test
USE_DISTRIBUTED=0 USE_GLOO=0 BUILD_TEST=0 USE_CUDA=1 USE_MKLDNN=0 DEBUG=0 python setup.py install
Reviewed By: mrshenli
Differential Revision: D24700959
fbshipit-source-id: 69f303a48ae275aa856e6e9b50e12ad8602e1c7a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46959
1. Implemented the Pybind part.
2. In the reducer, once the builtin_comm_hook_type is set, a c++ comm hook instance will be created in Reducer::autograd_hook.
3. Added unit tests for the builit-in comm hooks.
Original PR issue: C++ DDP Communication Hook https://github.com/pytorch/pytorch/issues/46348
ghstack-source-id: 115629230
Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_builtin_ddp_comm_hooks_nccl
Reviewed By: pritamdamania87
Differential Revision: D24471910
fbshipit-source-id: f96b752298549ea2067e2568189f1b394abcd99a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46304
In the case that a single process operates only on one GPU, we can
avoid this scatter and instead replace it with a recursive version of `to`
which transfers the input tensors to the correct device.
The implementation of `_recursive_to` is modeled after `scatter` in https://github.com/pytorch/pytorch/blob/master/torch/nn/parallel/scatter_gather.py, in order to keep parity with the previous conventions (i.e. custom types not having their tensors moved).
ghstack-source-id: 114896677
Test Plan: Added unittest, and CI
Reviewed By: pritamdamania87
Differential Revision: D24296377
fbshipit-source-id: 536242da05ecabfcd36dffe14168b1f2cf58ca1d
Summary:
Retake on https://github.com/pytorch/pytorch/issues/40493 after all the feedback from albanD
This PR implements the generic Lazy mechanism and a sample `LazyLinear` layer with the `UninitializedParameter`.
The main differences with the previous PR are two;
Now `torch.nn.Module` remains untouched.
We don't require an explicit initialization or a dummy forward pass before starting the training or inference of the actual module. Making this much simpler to use from the user side.
As we discussed offline, there was the suggestion of not using a mixin, but changing the `__class__` attribute of `LazyLinear` to become `Linear` once it's completely initialized. While this can be useful, by the time being we need `LazyLinear` to be a `torch.nn.Module` subclass since there are many checks that rely on the modules being instances of `torch.nn.Module`.
This can cause problems when we create complex modules such as
```
class MyNetwork(torch.nn.Module):
def __init__(self):
super(MyNetwork, self).__init__()
self.conv = torch.nn.Conv2d(20, 4, 2)
self.linear = torch.nn.LazyLinear(10)
def forward(self, x):
y = self.conv(x).clamp(min=0)
return self.linear(y)
```
Here, when the __setattr__ function is called at the time LazyLinear is registered, it won't be added to the child modules of `MyNetwork`, so we have to manually do it later, but currently there is no way to do such thing as we can't access the parent module from LazyLinear once it becomes the Linear module. (We can add a workaround to this if needed).
TODO:
Add convolutions once the design is OK
Fix docstrings
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44538
Reviewed By: ngimel
Differential Revision: D24162854
Pulled By: albanD
fbshipit-source-id: 6d58dfe5d43bfb05b6ee506e266db3cf4b885f0c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44826
As described in https://github.com/pytorch/pytorch/issues/43690, there
is a need for DDP to be able to ignore certain parameters in the module (not
install allreduce hooks) for certain use cases. `find_unused_parameters` is
sufficient from a correctness perspective, but we can get better performance
with this upfront list if users know which params are unused, since we won't
have to traverse the autograd graph every iteration.
To enable this, we add a field `parameters_to_ignore` to DDP init and don't
pass in that parameter to reducer if that parameter is in the given list.
ghstack-source-id: 113210109
Test Plan: Added unittest
Reviewed By: xw285cornell, mrshenli
Differential Revision: D23740639
fbshipit-source-id: a0411712a8b0b809b9c9e6da04bef2b955ba5314
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45238
Adds a warning when there is much higher than expected amount of
discrepancy of inputs across different processes when running with uneven
inputs. This is because a skew in the thousands can reduce performance a
nontrivial amount as shown in benchmarks, and it was proposed to add this
warning as a result. Tested by running the tests so the threshold is hit and
observing the output.
ghstack-source-id: 112773552
Test Plan: CI
Reviewed By: mrshenli
Differential Revision: D23719270
fbshipit-source-id: 306264f62c1de65e733696a912bdb6e9376d5622
Summary:
We currently are fetching an allreduced tensor from Python in C++ in, where we are storing the resulting tensor in a struct's parameter. This PR removes extra tensor paratemeter in the function parameter and fetch from a single place.
Fixes https://github.com/pytorch/pytorch/issues/43960
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44914
Reviewed By: rohan-varma
Differential Revision: D23798888
Pulled By: bugra
fbshipit-source-id: ad1b8c31c15e3758a57b17218bbb9dc1f61f1577
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44798
[test all]
Update for relanding: in ddp.join(), moved _rebuild_buckets from end of backward to beginning of forward as well.
Part of relanding PR #41954, this refactoring is to move rebuild_buckets call from end of first iteration to beginning of second iteration
ghstack-source-id: 112279261
ghstack-source-id: 112279261
Test Plan: unit tests
Reviewed By: rohan-varma
Differential Revision: D23735185
fbshipit-source-id: c26e0efeecb3511640120faa1122a2c856cd694e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44326
Part of relanding PR #41954, this refactoring is to move rebuild_buckets call from end of first iteration to beginning of second iteration
ghstack-source-id: 112011490
Test Plan: unit tests
Reviewed By: mrshenli
Differential Revision: D23583017
fbshipit-source-id: ef67f79437a820d9b5699b651803622418499a83
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44622
Remove an extra empty line in the warning comments.Remove an extra empty line.
Test Plan: N/A
Reviewed By: rohan-varma
Differential Revision: D23674070
fbshipit-source-id: 4ee570590c66a72fb808e9ee034fb773b833efcd
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44427
Closes https://github.com/pytorch/pytorch/issues/44425
DDP join API currently does not work properly with `model.no_sync()`, see https://github.com/pytorch/pytorch/issues/44425 for details. This PR fixes the problem via the approach mentioned in the issue, namely scheduling an allreduce that tells joined ranks whether to sync in the backwards pass or not. Tests are added for skipping gradient synchronization for various `sync_interval`s.
ghstack-source-id: 111786479
Reviewed By: pritamdamania87
Differential Revision: D23609070
fbshipit-source-id: e8716b7881f8eee95e3e3499283e716bd3d7fe76
Summary:
Polishes DDP join api docstrings and makes a few minor cosmetic changes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43973
Reviewed By: zou3519
Differential Revision: D23467238
Pulled By: rohan-varma
fbshipit-source-id: faf0ee56585fca5cc16f6891ea88032336b3be56
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42577
Closes https://github.com/pytorch/pytorch/issues/38174. Implements a join-based API to support training with the DDP module in the scenario where different processes have different no. of inputs. The implementation follows the description in https://github.com/pytorch/pytorch/issues/38174. Details are available in the RFC, but as a summary, we make the following changes:
#### Approach
1) Add a context manager `torch.nn.parallel.distributed.join`
2) In the forward pass, we schedule a "present" allreduce where non-joined process contribute 1 and joined processes contribute 0. This lets us keep track of joined processes and know when all procs are joined.
3) When a process depletes its input and exits the context manager, it enters "joining" mode and attempts to "shadow" the collective comm. calls made in the model's forward and backward pass. For example we schedule the same allreduces in the same order as the backward pass, but with zeros
4) We adjust the allreduce division logic to divide by the effective world size (no. of non-joined procs) rather than the absolute world size to maintain correctness.
5) At the end of training, the last joined process is selected to be the "authoritative" model copy
We also make some misc. changes such as adding a `rank` argument to `_distributed_broadcast_coalesced` and exposing some getters/setters on `Reducer` to support the above changes.
#### How is it tested?
We have tests covering the following models/scenarios:
- [x] Simple linear model
- [x] Large convolutional model
- [x] Large model with module buffers that are broadcast in the forward pass (resnet). We verify this with a helper function `will_sync_module_buffers` and ensure this is true for ResNet (due to batchnorm)
- [x] Scenario where a rank calls join() without iterating at all, so without rebuilding buckets (which requires collective comm)
- [x] Model with unused params (with find unused parameters=True)
- [x] Scenarios where different processes iterate for a varying number of different iterations.
- [x] Test consistency in tie-breaking when multiple ranks are the last ones to join
- [x] Test that we divide by the effective world_size (no. of unjoined processes)
#### Performance implications
###### Trunk vs PR patched, 32 GPUs, batch size = 32
P50, forward + backward + optimizer batch latency & total QPS: 0.121 264/s vs 0.121 264/s
P50 backwards only batch latency & total QPS: 0.087 369/s vs 0.087 368/s
###### join(enable=True) vs without join, 32 GPUs, batch size = 32, even inputs
P50, forward + backward + optimizer batch latency & total QPS: 0.120 265/s vs 0.121 264/s
P50 backwards only batch latency & total QPS: 0.088 364/s vs 0.087 368/s
###### join(enable=False) vs without join, 32 GPUs, batch size = 32, even inputs
P50 forward + backward + optimizer batch latency & total QPS: 0.121 264/s vs 0.121 264/s
P50 backwards only batch latency & total QPS: 0.087 368/s vs 0.087 368/s
###### join(enable=True) with uneven inputs (offset = 2000), 32 GPUs, batch size = 32
P50 forward + backward + optimizer batch latency & total QPS: 0.183 174/s vs 0.121 264/s
P50 backwards only batch latency & total QPS: 0.150 213/s vs 0.087 368/s
###### join(enable=True) with uneven inputs ((offset = 2000)), 8 GPUs, batch size = 32
P50 forward + backward + optimizer batch latency & total QPS: 0.104 308/s vs 0.104 308/s
P50 backwards only batch latency & total QPS: 0.070 454/s vs 0.070 459/s
The 2 above uneven inputs benchmark was conducted 32 GPUs and 4 GPUs immediately depleting their inputs and entering "join" mode (i.e. not iterating at all), while the other 28 iterating as normal. It looks like there is a pretty significant perf hit for this case when there are uneven inputs and multi-node training. Strangely, when there is a single node (8 GPUs), this does not reproduce.
#### Limitations
1) This is only implemented for MPSD, not SPMD. Per a discussion with mrshenli we want to encourage the use of MPSD over SPMD for DDP.
2) This does not currently work with SyncBN or custom collective calls made in the model's forward pass. This is because the `join` class only shadows the `broadcast` for buffers in the forward pass, the gradient allreduces in the bwd pass, unused parameters reduction, and (optionally) the rebuild buckets broadcasting in the backwards pass. Supporting this will require additional design thought.
3) Has not been tested with the [DDP comm. hook](https://github.com/pytorch/pytorch/issues/39272) as this feature is still being finalized/in progress. We will add support for this in follow up PRs.
ghstack-source-id: 111033819
Reviewed By: mrshenli
Differential Revision: D22893859
fbshipit-source-id: dd02a7aac6c6cd968db882c62892ee1c48817fbe
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43557
backout the diff that caused some errors in pytext distributed training
Test Plan: Tested by rayhou who verified reverting the diff works
Differential Revision: D23320238
fbshipit-source-id: caa0fe74404059e336cd95fdb41373f58ecf486e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42869
We realized that when we invoke a simple callback that divides the tensors by `world_size` after `allreduce`, the performance was almost 50% lower in terms of QPS compared to the case where a simple `allreduce` hook is used with no `then` callback.
The main problem was as we call `work.wait()` before invoking `then` callback, we were synchronizing `work`'s stream with the default PyTorch stream inside [`runHook`](https://github.com/pytorch/pytorch/blob/master/torch/csrc/distributed/c10d/reducer.cpp#L609) and stalling the backward computation.
In that PR, we ensure that FutureNCCL's `then` callback is not stalling the backward computation. Assuming single-process single-device, `FutureNCCL` gets a new stream from device's pool using `at::cuda::getStreamFromPool` to run `callback` and before invoking the `callback` inline it synchronizes `WorkNCCL`'s stream by callback's stream not the default stream.
ghstack-source-id: 110208431
Test Plan: Run performance benchmark tests to validate performance issue is resolved. Also, `python test/distributed/test_c10d.py` to avoid any odd issues.
Reviewed By: pritamdamania87
Differential Revision: D23055807
fbshipit-source-id: 60e50993f1ed97497514eac5cb1018579ed2a4c5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42400
mcarilli spotted that in the original DDP communication hook design described in [39272](https://github.com/pytorch/pytorch/issues/39272), the hooks receive grads that are already predivided by world size.
It makes sense to skip the divide completely if hook registered. The hook is meant for the user to completely override DDP communication. For example, if the user would like to implement something like GossipGrad, always dividing by the world_size would not be a good idea.
We also included a warning in the register_comm_hook API as:
> GradBucket bucket's tensors will not be predivided by world_size. User is responsible to divide by the world_size in case of operations like allreduce.
ghstack-source-id: 109548696
**Update:** We discovered and fixed a bug with the sparse tensors case. See new unit test called `test_ddp_comm_hook_sparse_gradients` and changes in `reducer.cpp`.
Test Plan: python test/distributed/test_c10d.py and perf benchmark tests.
Reviewed By: ezyang
Differential Revision: D22883905
fbshipit-source-id: 3277323fe9bd7eb6e638b7ef0535cab1fc72f89e
Summary:
Add space between double back quotes and left curly bracket
Otherwise doc generation failed with `Inline literal start-string without end-string.`
This regression was introduced by b56db305cf
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42559
Reviewed By: glaringlee
Differential Revision: D22931527
Pulled By: malfet
fbshipit-source-id: 11c04a92dbba48592505f704d77222cf92a81055
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41283
in optimizer.zero_grad(), detach_ is useful to avoid memory leak only when grad has grad_fn, so add check to call grad.detach_ only when the grad has grad_fn in zero_grad() function
ghstack-source-id: 108702289
Test Plan: unit test
Reviewed By: mrshenli
Differential Revision: D22487315
fbshipit-source-id: 861909b15c8497f1da57f092d8963d4920c85e38
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41596
We've modified the previous design of `convert_dist_work_to_future` API in the GH Issue [#39272](https://github.com/pytorch/pytorch/issues/39272).
1. Whenever we create a `WorkNCCL` object, create a `Future` associated with `WorkNCCL` and store it with the object.
2. Add an API `c10::intrusive_ptr<c10::ivalue::Future> getFuture()` to `c10d::ProcessGroup::Work`.
3. This API will only be supported by NCCL in the first version, the default implementation will throw UnsupportedOperation.
4. To mark the future associated with WorkNCCL completed, implement a `cudaStreamCallback` function.
`cudaStreamAddCallback` is marked as deprecated. An alternative is `cudaLaunchHostFunc`, but it is supported for CUDA > 10 and may not be deprecated until there's a reasonable alternative available according to [this discussion](https://stackoverflow.com/questions/56448390/how-to-recover-from-cuda-errors-when-using-cudalaunchhostfunc-instead-of-cudastr).
ghstack-source-id: 108409748
Test Plan:
Run old python test/distributed/test_c10d.py.
Some additional tests:
`test_ddp_comm_hook_allreduce_hook_nccl`: This unit test verifies whether a DDP communication hook that just calls allreduce gives the same result result with the case of no hook registered. Without the then callback, the future_value in reducer is no longer a PyObject, and this unit test verifies future_value is properly checked.
`test_ddp_comm_hook_allreduce_then_mult_ten_hook_nccl`: This unit test verifies whether a DDP communication hook that calls allreduce and then multiplies the result by ten gives the expected result.
As of v10:
```
........................s.....s.....................................................s...............................
----------------------------------------------------------------------
Ran 116 tests
OK (skipped=3)
```
`flow-cli` performance validation using a stacked diff where `bucket.work` is completely replaced with `bucket.future_work` in `reducer`. See PR [#41840](https://github.com/pytorch/pytorch/pull/41840) [D22660198](https://www.internalfb.com/intern/diff/D22660198/).
Reviewed By: izdeby
Differential Revision: D22583690
fbshipit-source-id: 8c059745261d68d543eaf21a5700e64826e8d94a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41297
GH issue: https://github.com/pytorch/pytorch/issues/40105
Add a helper function to DDP to print out all relevant env vars for debugging
Test Plan:
test through unittest, example output:
---
env:RANK=3
env:LOCAL_RANK=N/A
env:WORLD_SIZE=N/A
env:MASTER_PORT=N/A
env:MASTER_ADDR=N/A
env:CUDA_VISIBLE_DEVICES=N/A
env:GLOO_SOCKET_IFNAME=N/A
env:GLOO_DEVICE_TRANSPORT=N/A
env:NCCL_SOCKET_IFNAME=N/A
env:NCCL_BLOCKING_WAIT=N/A
...
---
Reviewed By: mrshenli
Differential Revision: D22490486
fbshipit-source-id: 5dc7d2a18111e5a5a12a1b724d90eda5d35acd1c
Summary:
Decouple DataParallel/DistributedDataParallel from CUDA to support more device types.
- Move torch/cuda/comm.py to torch/nn/parallel/comm.py with minor changes for common devices support. Torch.cuda.comm is kept as is for backward compatibility
- Provide common APIs to arbitrary device types without changing existing CUDA APIs in torch.cuda space.
- Replace the torch.cuda calls in DataParellel/DistributedDataParallel with the new APIs.
Related RFC: [https://github.com/pytorch/pytorch/issues/36160](https://github.com/pytorch/pytorch/issues/36160)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38454
Differential Revision: D22051557
Pulled By: mrshenli
fbshipit-source-id: 7842dad0e5d3ca0f6fb760bda49182dcf6653af8
Summary:
1. In reducer.cpp, we have a new boolean `find_unused_param_` and its value is set in `Reducer::prepare_for_backward`.
If `!find_unused_param_`, then it avoids `allreduce(local_used_maps_dev_)`.
2. Solves issue [38942](https://github.com/pytorch/pytorch/issues/38942).
3. Fixes incorrect `find_unused_parameters_` passing like checking `outputs.empty()` or `unused_parameters_.empty()`.
ghstack-source-id: 106693089
Test Plan:
1. Run `test/distributed/test_c10d.py` and make sure all tests pass.
2. A new test case `test_find_unused_parameters_when_unused_parameters_empty` is included. Old `reducer.cpp` was failing in that unit test because it was checking `find_unused_parameters_` by `unused_parameters_.empty()`. Current `reducer.cpp` passes this unit test.
3. Two test cases were failing `test_forward_backward_unused_parameters` and `test_forward_backward_optimizer` , because `find_unused_parameter_` of their `reducer` object was not set properly. I fixed that as well.
Imported from OSS
**Output of version 14:**
```
................s.....s...............................................test/distributed/test_c10d.py:1531: UserWarning: Deprecation warning: In a future PyTorch release torch.full will no longer return tensors of floating dtype by default. Instead, a bool fill_value will return a tensor of torch.bool dtype, and an integral fill_value will return a tensor of torch.long dtype. Set the optional `dtype` or `out` arguments to suppress this warning. (Triggered internally at ../aten/src/ATen/native/TensorFactories.cpp:364.)
tensor = torch.full([100, 100], self.rank)
test/distributed/test_c10d.py:1531: UserWarning: Deprecation warning: In a future PyTorch release torch.full will no longer return tensors of floating dtype by default. Instead, a bool fill_value will return a tensor of torch.bool dtype, and an integral fill_value will return a tensor of torch.long dtype. Set the optional `dtype` or `out` arguments to suppress this warning. (Triggered internally at ../aten/src/ATen/native/TensorFactories.cpp:364.)
tensor = torch.full([100, 100], self.rank)
test/distributed/test_c10d.py:1531: UserWarning: Deprecation warning: In a future PyTorch release torch.full will no longer return tensors of floating dtype by default. Instead, a bool fill_value will return a tensor of torch.bool dtype, and an integral fill_value will return a tensor of torch.long dtype. Set the optional `dtype` or `out` arguments to suppress this warning. (Triggered internally at ../aten/src/ATen/native/TensorFactories.cpp:364.)
tensor = torch.full([100, 100], self.rank)
test/distributed/test_c10d.py:1531: UserWarning: Deprecation warning: In a future PyTorch release torch.full will no longer return tensors of floating dtype by default. Instead, a bool fill_value will return a tensor of torch.bool dtype, and an integral fill_value will return a tensor of torch.long dtype. Set the optional `dtype` or `out` arguments to suppress this warning. (Triggered internally at ../aten/src/ATen/native/TensorFactories.cpp:364.)
tensor = torch.full([100, 100], self.rank)
.test/distributed/test_c10d.py:1554: UserWarning: Deprecation warning: In a future PyTorch release torch.full will no longer return tensors of floating dtype by default. Instead, a bool fill_value will return a tensor of torch.bool dtype, and an integral fill_value will return a tensor of torch.long dtype. Set the optional `dtype` or `out` arguments to suppress this warning. (Triggered internally at ../aten/src/ATen/native/TensorFactories.cpp:364.)
self.assertEqual(torch.full([10, 10], self.world_size), tensor)
test/distributed/test_c10d.py:1554: UserWarning: Deprecation warning: In a future PyTorch release torch.full will no longer return tensors of floating dtype by default. Instead, a bool fill_value will return a tensor of torch.bool dtype, and an integral fill_value will return a tensor of torch.long dtype. Set the optional `dtype` or `out` arguments to suppress this warning. (Triggered internally at ../aten/src/ATen/native/TensorFactories.cpp:364.)
self.assertEqual(torch.full([10, 10], self.world_size), tensor)
test/distributed/test_c10d.py:1554: UserWarning: Deprecation warning: In a future PyTorch release torch.full will no longer return tensors of floating dtype by default. Instead, a bool fill_value will return a tensor of torch.bool dtype, and an integral fill_value will return a tensor of torch.long dtype. Set the optional `dtype` or `out` arguments to suppress this warning. (Triggered internally at ../aten/src/ATen/native/TensorFactories.cpp:364.)
self.assertEqual(torch.full([10, 10], self.world_size), tensor)
test/distributed/test_c10d.py:1554: UserWarning: Deprecation warning: In a future PyTorch release torch.full will no longer return tensors of floating dtype by default. Instead, a bool fill_value will return a tensor of torch.bool dtype, and an integral fill_value will return a tensor of torch.long dtype. Set the optional `dtype` or `out` arguments to suppress this warning. (Triggered internally at ../aten/src/ATen/native/TensorFactories.cpp:364.)
self.assertEqual(torch.full([10, 10], self.world_size), tensor)
.....s...............................
----------------------------------------------------------------------
Ran 108 tests in 214.210s
OK (skipped=3)
```
Differential Revision: D22176231
fbshipit-source-id: b5d15f034e13a0915a474737779cc5aa8e068836
Summary:
https://github.com/pytorch/pytorch/pull/40129 fixed the error responsible for the first revert, but exposed another error in the same test.
This PR is intended as the "master copy" for merge, and it runs on full CI.
Two other PRs (restricted to run on a small subset of CI) supporting debugging DDP failures/hangs with multiple devices per process (`test_c10d.py:DistributedDataParallelTest.test_grad_layout_1devicemodule_2replicaperprocess`).
- https://github.com/pytorch/pytorch/pull/40290 tries the test with purely rowmajor contiguous params on an untouched master. In other words https://github.com/pytorch/pytorch/pull/40290 contains none of this PR's diffs aside from the test itself.
- https://github.com/pytorch/pytorch/pull/40178, for comparison, tries the test with this PR's diffs.
Both fail the same way, indicating failure is unrelated to this PR's other diffs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40358
Differential Revision: D22165785
Pulled By: albanD
fbshipit-source-id: ac7cdd79af5c080ab74341671392dca8e717554e
Summary:
Currently, whether `AccumulateGrad` [steals](67cb018462/torch/csrc/autograd/functions/accumulate_grad.h (L42)) or [clones](67cb018462/torch/csrc/autograd/functions/accumulate_grad.h (L80)) an incoming gradient, the gradient ends up rowmajor contiguous, regardless of its param's layout. If the param's layout is channels last, or otherwise not rowmajor contigous, later kernels that apply gradients to params are forced into an uncoalesced memory access pattern for either the param or the gradient. This may not sound like a big deal but for any binary op on large tensors it's a >3X increase in gmem traffic => 3X slowdown.
The present PR changes `AccumulateGrad` to prefer, where possible, stashing gradients that match their params' layouts (["Gradient Layout Contract"](https://github.com/pytorch/pytorch/pull/34904/files#diff-ef1a56d24f66b280dcdb401502d6a796R29-R38)).
Allowing `AccumulateGrad` to stash non-rowmajor-contiguous grads means DDP allreduces and DP reduces must allow non-rowmajor-contiguous grads. This PR extends DDP and DP to allow gradients with non-rowmajor-contiguous strides as long as their layout is nonoverlapping and dense.
For good measure, I include changes that allow all five nccl primitives (allreduce, reduce, broadcast, allgather, reducescatter) to act on non-rowmajor-contiguous tensors (again as long as each input's layout is nonoverlapping and dense, and as long as all tensors participating in a given collective have the same layout). The primitive comm changes aren't necessary to enable the DDP changes, but I wasn't sure this would end up true until I had written both sets of changes. I think primitive comm enablement is reasonable to keep in the PR, especially since the code for it is simple.
Channels last params will be a major beneficiary of this PR, but I don't see it as channels-last-specific fix. The spirit is layout matching in general:
- Grads should be stashed with memory layouts matching their params.
- Src and dst tensors on opposite ends of collectives should have matching dense layouts.
This PR also updates autograd docs to describe potential BC-breaking changes below.
## BC notes
ngimel albanD gchanan
#### BC-breaking
In the common case where the user lets AccumulateGrad decide grad layouts, strides for grads of dense but non-rowmajor-contiguous params will change. Any user code that was accustomed to `view(-1)`ing these grads will break.
Also, the circumstances under which a grad can be stolen directly from the backward function that created it, as opposed to deep-copied by AccumulateGrad, have changed. In most cases we expect silent performance improvement, because we expect channels-last-aware backward kernels will create channels last gradients for channels last params. Now those can be stolen, whereas before this PR they were cloned and made rowmajor contiguous. IMO this is a mild BC breakage. Param backward hooks still see grads come in with whatever format the backward kernel gave them. The only BC breakage potential I see is if user code relies somehow on a grad in a hook having or not having the same deep memory as the eventual `param.grad`. Any such users hopefully know they're off the edge of the map and understand how to update their expectations.
#### BC escape hatches
At alband's recommendation, this PR's changes to AccumulateGrad do not alter the pre-PR code's decisions about whether grad is accumulated in or out of place. Accumulations of new grads onto an existing `.grad` attribute were (usually) in-place before this PR and remain in-place after this PR, keeping the existing `.grad`'s layout. After this PR, if the user wants to force accumulation into a grad with a particular layout, they can preset `param.grad` to a zeroed tensor with the desired strides or call `grad.contiguous(desired format)`. This likely won't be as performant as letting AccumulateGrad establish grad layouts by cloning or stealing grads with contract-compliant strides, but at least users have a control point.
One limitation (present before this PR and unchanged by this PR): Presetting `param.grad` does not ensure in-place accumulation all the time. For example, if `create_graph=True`, or if incoming `new_grad` is dense and existing `variable_grad` is sparse, accumulation occurs out of place, and the out-of-place result may not match the existing grad's strides.
----------------------------
I also noticed some potential DDP improvements that I considered out of scope but want to mention for visibility:
1. make sure Reducer's ops sync with AccumulateGrad streams
2. ~to reduce CPU overhead and incur fewer kernel launches, lazily create flat `contents` tensors by a single `cat` kernel only when a bucket is full, instead of `copy_`ing grads into `contents` individually as soon as they are received.~ PR includes a [minor change](https://github.com/pytorch/pytorch/pull/34904/files#diff-c269190a925a4b0df49eda8a8f6c5bd3R312-R315) to divide grads while copying them into flat buffers, instead of copying them in, then dividing separately. Without cat+div fusion, div-while-copying is the best we can do.
3. https://github.com/pytorch/pytorch/issues/38942
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34904
Differential Revision: D20496044
Pulled By: albanD
fbshipit-source-id: 248d680f4b1bf77b0a986451844ec6e254469217
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35137
bucket order is rebuilt dynamically in the first reduction backward pass when find_unused_parameters = false
ghstack-source-id: 104794018
Test Plan: unit test
Differential Revision: D20128537
fbshipit-source-id: fad73de965cdcb59a51c0a12b248271344584b9f
Summary:
We should recommend DDP instead of DP. Hope we can also cherry-pick this for 1.5
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35063
Differential Revision: D20549621
Pulled By: ngimel
fbshipit-source-id: 86b1b2134664065cc6070ea4212895f993eaf543
Summary:
Scatter is unnecessary if only using one device, and it breaks on some custom data structures like namedtuple, so would like to avoid :)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22384
Differential Revision: D16428208
Pulled By: soumith
fbshipit-source-id: eaa3876b2b95c1006ccaaacdb62f54c5280e730c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22037
This adds support for sparse gradients to the reducer as well as to
the DistributedDataParallel wrapper. Note that an out of band signal
is needed whether or not a dense parameter (e.g. an embedding) is
expected to receive a sparse gradient or not. This information is
passed to the bucket assignment computation routine and the reducer as
a vector of booleans. Every parameter for which we expect a sparse
gradient is assigned its own bucket, as we cannot easily group
multiple unrelated sparse tensors.
Reviewed By: mrshenli
Differential Revision: D15926383
fbshipit-source-id: 39c0d5dbd95bf0534314fdf4d44b2385d5321aaf
Summary:
The first attempt and more discussions are available in https://github.com/pytorch/pytorch/issues/19577
#### Goal
Allow toggling DDP gradient synchronization across iterations. With this feature, users may accumulate grads in module variables, and only kick off expensive grad synchronize every a few iterations.
#### Concerns
Our first attempt in https://github.com/pytorch/pytorch/issues/19577 tries to do it using a variable or a function. But apaszke made a good point that it will not be error prone, and favors a context manager instead.
#### Proposed Solution
Instead of providing a `accumulate_grads` variable/function/context, we provide a `DistributedDataParallel.no_sync()` context manager. And it does exactly what the name suggests, i.e., disable DDP grad synchronization within the context. Note that `accumulate_grads` means `no_sync` + no optimizer step, where the latter is not controlled by DDP.
It is true that users need to call another `model(input).backward()` after exiting the context, and this is indeed more verbose. But I think it is OK as one major concern in the previous discussion is to prevent users from running into errors without knowing it. This API should reaffirm the expected behavior, and does not mess up with other use cases if accumulating grads is not required..
The application would then look like:
```python
with ddp.no_sync():
for input in inputs:
ddp(input).backward()
ddp(one_more_input).backward()
optimizer.step()
```
chenyangyu1988 myleott
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21736
Differential Revision: D15805215
Pulled By: mrshenli
fbshipit-source-id: 73405797d1e39965c52016af5cf45b15525ce21c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19443
This adds support for sparse gradients to the reducer as well as to
the DistributedDataParallel wrapper. Note that an out of band signal
is needed whether or not a dense parameter (e.g. an embedding) is
expected to receive a sparse gradient or not. This information is
passed to the bucket assignment computation routine and the reducer as
a vector of booleans. Every parameter for which we expect a sparse
gradient is assigned its own bucket, as we cannot easily group
multiple unrelated sparse tensors.
Reviewed By: mrshenli
Differential Revision: D15007365
fbshipit-source-id: f298e83fd3ca828fae9e80739e1db89d045c99ac
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20236
Use the new version of broadcast_coalesced that deals with both CPU
and CUDA models. Add tests that evaluate correctness of
DistributedDataParallel for CPU models.
Closes#17757.
Reviewed By: mrshenli
Differential Revision: D15245428
fbshipit-source-id: d2fa09f68593b3cd1b72efeb13f5af23ebd5c80a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19897
During validation, gradient reduction is not needed, and autograd is
never called. The model output will always be a detached tensor. After
the new reducer was merged, this meant that it would find all model
parameters unused, and kick off reduction for them. When #19799 and
output where no parameters are used and it tries to kick off reduction
of zeroed gradients. Test for `torch.is_grad_enabled()` and
`self.training` before calling into the reducer.
Reviewed By: mrshenli
Differential Revision: D15118726
fbshipit-source-id: b0208f632a61cbe8110fa626fa427937b7f05924
Summary:
As DDP in previous releases does not support unused params, turning off `find_unused_parameters` by default to derisk new reducer.
CC pietern soumith
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19895
Reviewed By: pietern
Differential Revision: D15118563
Pulled By: mrshenli
fbshipit-source-id: 6215c486e1dae3387b36011d8e64a2721ac85f58