Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62392
The constructor of `PeriodicModelAverager` does not need to accept parameters.
ghstack-source-id: 134626245
Test Plan: buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_periodic_model_averager
Reviewed By: rohan-varma
Differential Revision: D29986446
fbshipit-source-id: 6a8b709e4383a3c44b9e60955fbb067cd2868e76
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62111
This base class will be passed to the post-localSGD optimizer in the next PR. This way, the same post-localSGD optimizer can choose different model averaging algorithms.
Proposal: https://github.com/pytorch/pytorch/issues/59699
ghstack-source-id: 134489187
Test Plan: buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_periodic_model_averager
Reviewed By: rohan-varma
Differential Revision: D29884954
fbshipit-source-id: 1dc5e35c58895902991567f633afd621c7108938
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62079
Adds support for kwarg arguments into functional optimizer running as
hook.
ghstack-source-id: 134330379
Test Plan: CI
Reviewed By: SciPioneer
Differential Revision: D29838127
fbshipit-source-id: 2ab051ef5f0dff19c145ebe2260668b927ba47b2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62177
Reland of https://github.com/pytorch/pytorch/pull/61678
Fix CI failure by gating including torchvision model on whether torchvision is available or not.
ghstack-source-id: 134282165
Test Plan: CI
Reviewed By: SciPioneer
Differential Revision: D29904101
fbshipit-source-id: 47e799eb4a90acbbda91c5857ea00de3045d49f5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61678
This diff makes the following changes: - Add `step_param` method to `_FunctionalSGD` class which is written similar to `step` but for a single param - Implement a communication hook wrapper that runs a given comm. hook and then applies functional SGD step - Verifies that this is equal to regular allreduce + SGD optimizerghstack-source-id: 133567598
ghstack-source-id: 134263399
Test Plan: CI
Reviewed By: SciPioneer
Differential Revision: D29701447
fbshipit-source-id: 183954593b82a092414623292f9b10e675fef96e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62105
This is for the preparation of wrapping the averager as an optimizer, which can only accept parameters rather than a module.
Proposal: https://github.com/pytorch/pytorch/issues/59699
ghstack-source-id: 134213572
Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_periodic_model_averager
buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_average_parameters
Reviewed By: rohan-varma
Differential Revision: D29883693
fbshipit-source-id: 474ba924a0b05068b12f163fb74582bccf314964
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62074
Since SPMD mode is retired, the comm hook result will always be a single tensor.
This can improve comm hook developer experience, as no need to add an extra `[0]` to the precursor future result.
#Closes: https://github.com/pytorch/pytorch/issues/61914
ghstack-source-id: 134164593
Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d
buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork
Reviewed By: rohan-varma
Differential Revision: D29864732
fbshipit-source-id: 59fe6dd78b66214b1788514ad4d236039d9bda31
Summary:
**Overview:**
This refactors the computation on non-joined processes relating to the join context manager. The concept was inspired by a comment from pritamdamania.
**Changes:**
This introduces a `_Joinable` abstract base class, which requires a `_join_hook()` method and `_join_device()` and `_join_process_group()` property methods. Any class that we want to be compatible with the generic join context manager should inherit from `_Joinable` and implement `_join_hook()`, `_join_device()`, and `_join_process_group()`. (The `device` and `process_group` information has been moved from `_JoinHook` to `_Joinable`.)
The generic join context manager now takes in a `List[_Joinable]` instead of `List[_JoinHook]`. The motivation for this is that previously, by passing the `_JoinHook`s into the context manager, the class providing a `_JoinHook` can modify the context manager's behavior, but the context manager cannot modify the class's behavior. This is solved by giving the context manager a reference to the class's instance.
This implementation reserves the field `_join_config` in every `_Joinable` to store a `_JoinConfig` instance, which holds all dynamic fields needed from the `_Joinable` for the join context manager: `enable`, `throw_on_early_termination`, and `is_first_joinable`. ("dynamic" here means that for a given `_Joinable` instance, the values for those fields may change across different join context usages.) In particular, these fields are needed to implement a method `notify_join_context()`, which encapsulates the computation performed on non-joined processes relating to the join context manager --- (1) the all-reduce to indicate that the process has not yet joined and (2) the all-reduce to check whether to throw an exception if `throw_on_uneven_inputs=True`. The idea is that every `_Joinable` class only needs to make a call to `notify_join_context()` before its per-iteration collective communications; it is a simple one-line addition.
Only the first `_Joinable` instance passed into the context manager actually performs the collective communications in `notify_join_context()`. In that case, the method returns an async work handle for the initial all-reduce indicating that the process not yet joined. Otherwise, the method returns `None`. This conditional logic is handled internally without additional input from the user.
**New API:**
Now, the example usage would look like:
```
ddp_model = DistributedDataParallel(...)
zero_optim = ZeroRedundancyOptimizer(ddp_model.parameters(), ...)
with _Join([ddp_model, zero_optim]):
...
```
Any arguments meant for a join hook (e.g. `divide_by_initial_world_size`) must be specified as keyword arguments. For example:
```
with _Join([ddp_model, zero_optim], divide_by_initial_world_size=False):
...
```
They will be forwarded to every `_join_hook()` function via `**kwargs`. This creates a clear separation between the variables needed by the context manager (`enable` and `throw_on_early_termination`) and those needed by the `_Joinable` class (e.g. `divide_by_initial_world_size`).
**Recap:**
After this change, the relevant information to use the generic join context manager looks like the following (omitting prefix `_` from names):
- Suppose we have a class `C` (e.g. `DistributedDataParallel`) that we want to be able to use the `Join` context.
- We make `C` inherit from `Joinable` and implement `join_hook() -> JoinHook`, `join_device()`, and `join_process_group()`.
- To implement `join_hook()`, we define a `CJoinHook` class inheriting from `JoinHook` and implement `main_hook()` and `post_hook()` as needed.
- We locate a place before `C`'s per-iteration collective communications and add a call to `Join.notify_join_context()`.
- We call `Joinable.__init__(self)` in `C`'s constructor.
- The `C.join_config` field will be used internally by the context manager. This does not affect `C`'s serializability.
- Run time arguments for `C`'s join hook can be passed in as keyword arguments to the context manager: `with Join([C()], arg1=..., arg2=...):`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61555
Test Plan:
I ran the existing DDP join tests:
```
touch /tmp/barrier && TEMP_DIR="/tmp" BACKEND="nccl" WORLD_SIZE="2" gpurun python test/distributed/test_distributed_fork.py -- TestDistBackendWithFork.test_ddp_uneven_inputs TestDistBackendWithFork.test_ddp_uneven_inputs_stop_iteration_sync_bn TestDistBackendWithFork.test_ddp_grad_div_uneven_inputs TestDistBackendWithFork.test_ddp_uneven_input_join_disable TestDistBackendWithFork.test_ddp_uneven_input_exception
```
I ran the ZeRO join tests:
```
gpurun4 python test/distributed/optim/test_zero_redundancy_optimizer.py TestZeroRedundancyOptimizerDistributed.test_zero_join_gpu TestZeroRedundancyOptimizerDistributed.test_zero_join_cpu
```
Reviewed By: zou3519
Differential Revision: D29690359
Pulled By: andwgu
fbshipit-source-id: 2950f78de755eb5fb13b95b803dd7c705879a9c7
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61207
Model averager now must be combined with post-localSGD DDP communication hook. It will skip model averaging for the first K steps, because post-localSGD communication hook will run global gradient averaging during this phase.
Proposal: https://github.com/pytorch/pytorch/issues/59699
ghstack-source-id: 133371335
Test Plan: buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_periodic_model_averager
Reviewed By: pritamdamania87
Differential Revision: D29523738
fbshipit-source-id: 3fa9611046e1c0afa4bda78aa3ba200fa2a5fa4b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61206
Create a communication hook to run post-local SGD. This will be combined with model averager component to better support local SGD.
In contrast to the previous approach that runs local gradient averaging + global model averaging at each step for the first K steps, now we plan to run global gradient averaging only for the first K steps at each step, just like normal DDP. This can give us two advantages:
1) For some optimizers, model averaging can cause discrepancy in optimizer states. If we still do global gradient averaging for the first K steps, we can defer such discrepancy until we actually start local SGD.
2) Gradient averaging at the first K steps only run one allreduce that overlaps with backward pass, so it should also be more efficient.
Proposal: https://github.com/pytorch/pytorch/issues/59699
ghstack-source-id: 133371322
Test Plan: buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_ddp_hook_parity_post_localSGD
Reviewed By: pritamdamania87
Differential Revision: D29523292
fbshipit-source-id: 3f215f7150f2917c2781278fad759530c685ea2c
Summary:
Targets https://github.com/pytorch/pytorch/issues/54318.
**Overview:**
DDP offers a `join()` context manager to accommodate training on uneven inputs. This creates a new generic `_Join()` API permitting custom hooks, refactors DDP `join()` to call this generic `_Join()`, and implements a hook for ZeRO. (For now, the generic `_Join()` is implemented as private, but this may change after design discussions are cleared.)
There are two classes introduced: `_JoinHook`, the class defining the customizable join hook, and `_Join`, the generic join context manager.
The `_JoinHook` provides two entry points: `main_hook()`, which is called repeatedly while there exists a non-joined process, and `post_hook()`, which is called once all process have joined with the additional `bool` argument `is_last_joiner`. The class also requires `process_group` and `device` information by defining corresponding abstract property methods. Thus, to implement a join hook, (1) inherit from `_JoinHook`, (2) override `main_hook()` and `post_hook()` as appropriate, and (3) override `process_group()` and `device()` to provide process group and device information to be used by the join context manager implementation for collective communications.
The `_Join` constructor requires `join_hooks: List[_JoinHook]` and optionally `enable: bool = True` and `throw_on_early_termination: bool = False`. A training loop only needs to be wrapped with `with _Join(join_hooks):` (using the appropriate `join_hooks`) to be able to train on uneven inputs without hanging/erroring. The context manager requires a `dist.all_reduce(torch.ones(1))` to be called on every non-joined process each time before it performs its collective communications in order to indicate that the process has not yet joined. It also requires that all `process_group` attributes in the `_JoinHook` objects are the same.
**Notes:**
- The argument `is_last_joiner` to `post_hook()` may be useful for finding an authoritative rank when synchronizing.
- `enable` is a flag that can be set to `False` if the user knows the current training loop will not have uneven inputs. This may be used to disable join-related computation in the classes providing join hooks.
- `throw_on_early_termination` is a flag that can be set to `True` to notify processes to terminate upon detecting uneven inputs (i.e. upon the first process joining when there exists a non-joined process). Notably, the notification requires an all-reduce, so to prevent hanging/erroring, non-joined process must participate in the all-reduce. The first-joining process raises a `RuntimeError`, and the other processes are expected (but not required) to do the same. This may be used to implement training on uneven inputs in cases that do not conform to the generic join context manager (e.g. `SyncBatchNorm`).
- Classes providing a join hook should do so via a `_join_hook()` method that returns a `_JoinHook` instance with the methods appropriately overridden.
- If there are multiple join hooks, the device specified by the first is used by the join context manager implementation to perform its collective communications.
- If there are multiple join hooks, both the main and post-hooks are iterated in the order in which the `_JoinHook` objects are passed into the context manager constructor.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60757
Test Plan:
The current implementation preserves backward compatibility by not changing the existing DDP `join()` API at all. To check this, I ran through the uneven input tests (`test_ddp_grad_div_uneven_inputs`, `test_ddp_uneven_inputs_stop_iteration_sync_bn`, `test_ddp_uneven_inputs`, `test_ddp_uneven_input_join_disable`, `test_ddp_uneven_input_exception`) on the AI AWS cluster:
```
touch /tmp/barrier && TEMP_DIR="/tmp" BACKEND="nccl" WORLD_SIZE="2" gpurun python test/distributed/test_distributed_fork.py --
```
Because the existing DDP join logic does not provide correct gradients to the joined processes if `gradient_as_bucket_view=False` and a joined process requires those gradients to correctly update its shard of the parameters in `ZeroRedundancyOptimizer.step()`, DDP and ZeRO are not fully compatible at the moment. To work around this and to test ZeRO's join hook separately, I added a test `_test_zero_join()` (with `test_zero_join_gpu()` and `test_zero_join_cpu()` flavors), which compares DDP with a local optimizer on uneven inputs against ZeRO on uneven inputs with the gradients set manually.
Reviewed By: iramazanli, mrshenli
Differential Revision: D29624636
Pulled By: andwgu
fbshipit-source-id: ec70a290e02518b0d8b683f9fed2126705b896c7
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60891
This fix is particularly useful for local SGD when the averaging period is very small, which may cause the conflict between gradient allreduce within per-machine subgroup and the global parameter allreduce by the communication world.
ghstack-source-id: 132564252
Test Plan:
f281873295 (#Try1) failed due to the conflict between global process group and subgroup.
```
<Thread(configerator-monitor-singleton, started 139839806633728)>
File "/usr/local/fbcode/platform009/lib/python3.8/threading.py", line 890, in _bootstrap
self._bootstrap_inner()
File "/usr/local/fbcode/platform009/lib/python3.8/threading.py", line 932, in _bootstrap_inner
self.run()
File "/usr/local/fbcode/platform009/lib/python3.8/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "/tmp/jetter.gson7tr3/configerator/client.py", line 348, in _monitor_loop
self._parent_thread.join(self._interval_ms / 1000)
File "/usr/local/fbcode/platform009/lib/python3.8/threading.py", line 1015, in join
self._wait_for_tstate_lock(timeout=max(timeout, 0))
File "/usr/local/fbcode/platform009/lib/python3.8/threading.py", line 1027, in _wait_for_tstate_lock
elif lock.acquire(block, timeout):
```
Fixed after adding an explicit sync: f282044866, f282241800
Reviewed By: rohan-varma
Differential Revision: D29434597
fbshipit-source-id: a4f777fc26f379639f85fda32de425cd3b337b33
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60320
This averager can be used for post-local SGD.
ghstack-source-id: 131908011
Test Plan: buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_periodic_model_averager
Reviewed By: rohan-varma
Differential Revision: D29249850
fbshipit-source-id: 09675d6bb1edfb8ffbeb94510d91962532d8ca3e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60303
The util function can be used for averaging parameters.
More optimizations can be done in the future.
ghstack-source-id: 132214212
Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_average_parameters
buck test mode/dev-nosan caffe2/test/distributed:distributed_gloo_fork -- test_average_parameters
Reviewed By: rohan-varma
Differential Revision: D29242806
fbshipit-source-id: 76fb5a92adb4bdc6151a9f411e366a0ed2a31f47
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59576
If the gradients before allreduce are large, then the sum after allreduce may overflow, especially for FP16. Therefore, apply the division before allreduce.
This fix is applied to both C++ and Python comm hooks.
ghstack-source-id: 130754510
Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_ddp_comm_hook_allreduce_hook_nccl
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_default_ddp_comm_hooks_nccl
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_fp16_compress_wrapper_nccl
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_builtin_ddp_comm_hooks_nccl
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_ddp_comm_hook_allreduce_hook_nccl_grad_is_view
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_default_ddp_comm_hooks_nccl_is_view
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_fp16_compress_wrapper_is_view
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_builtin_ddp_comm_hooks_nccl_grad_is_view
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl_grad_is_view
Reviewed By: rohan-varma
Differential Revision: D28941327
fbshipit-source-id: 932e8ddbdb2bfd609a78943f6dc390d3d6ca333f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59522
If the gradients before allreduce are large, then the sum after allreduce may overflow, especially for FP16. Therefore, apply the division before allreduce.
This fix is applied to both C++ and Python comm hooks.
ghstack-source-id: 130686229
Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_ddp_comm_hook_allreduce_hook_nccl
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_default_ddp_comm_hooks_nccl
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_fp16_compress_wrapper_nccl
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_builtin_ddp_comm_hooks_nccl
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_ddp_comm_hook_allreduce_hook_nccl_grad_is_view
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_default_ddp_comm_hooks_nccl_is_view
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_fp16_compress_wrapper_is_view
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_builtin_ddp_comm_hooks_nccl_grad_is_view
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl_grad_is_view
Reviewed By: rohan-varma
Differential Revision: D28922548
fbshipit-source-id: 442bd3cc7a35a8b948f626062fa7ad2e3704c5be
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57410
FP16 gradient compression may run into 'inf' issue. switching to division before allreduce can avoid this problem.
ghstack-source-id: 127877083
Test Plan:
before chage
f268909897
after change:
f270950609
If you still sees 'grad_norm = inf' after enabling fp16 hook, you can resume the training and turning off the hook.
Reviewed By: SciPioneer
Differential Revision: D28128628
fbshipit-source-id: 0b6648637713e4f321e39c9ccb645a6b6f1750a0
Summary:
As this diff shows, currently there are a couple hundred instances of raw `noqa` in the codebase, which just ignore all errors on a given line. That isn't great, so this PR changes all existing instances of that antipattern to qualify the `noqa` with respect to a specific error code, and adds a lint to prevent more of this from happening in the future.
Interestingly, some of the examples the `noqa` lint catches are genuine attempts to qualify the `noqa` with a specific error code, such as these two:
```
test/jit/test_misc.py:27: print(f"{hello + ' ' + test}, I'm a {test}") # noqa E999
test/jit/test_misc.py:28: print(f"format blank") # noqa F541
```
However, those are still wrong because they are [missing a colon](https://flake8.pycqa.org/en/3.9.1/user/violations.html#in-line-ignoring-errors), which actually causes the error code to be completely ignored:
- If you change them to anything else, the warnings will still be suppressed.
- If you add the necessary colons then it is revealed that `E261` was also being suppressed, unintentionally:
```
test/jit/test_misc.py:27:57: E261 at least two spaces before inline comment
test/jit/test_misc.py:28:35: E261 at least two spaces before inline comment
```
I did try using [flake8-noqa](https://pypi.org/project/flake8-noqa/) instead of a custom `git grep` lint, but it didn't seem to work. This PR is definitely missing some of the functionality that flake8-noqa is supposed to provide, though, so if someone can figure out how to use it, we should do that instead.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56272
Test Plan:
CI should pass on the tip of this PR, and we know that the lint works because the following CI run (before this PR was finished) failed:
- https://github.com/pytorch/pytorch/runs/2365189927
Reviewed By: janeyx99
Differential Revision: D27830127
Pulled By: samestep
fbshipit-source-id: d6dcf4f945ebd18cd76c46a07f3b408296864fcb
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55738
Per title, and use 0 as the default value.
It turns out that setting this epsilon as 0 can accelerate convergence and improve accuracy for some use cases.
Test Plan:
unit tests
f264687105
f264675194
Reviewed By: shuyingsunshine21
Differential Revision: D27694971
fbshipit-source-id: b61528c6c817127974acdc4635bccf607532287f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55666
{F590513307}
Some code is not properly displayed due to an extra whitespace ahead of `(num_rows + num_cols)`.
ghstack-source-id: 126148569
Test Plan: Locally viewed
Reviewed By: rohan-varma
Differential Revision: D27673663
fbshipit-source-id: 603ae4ddbe86ceaefc311885b82b0f6b48b57b27
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55295
Update `_powerSGD_comm_hook_wrapper` to only expose 2 most critical hyperparameters, to make this API more clear to any future user (although the second hyperparameter `start_powerSGD_iter` is not in use yet).
Test Plan: waitforbuildbot
Reviewed By: shuyingsunshine21
Differential Revision: D27561734
fbshipit-source-id: b661981cc033b109f4f2fc92b435567a184a7fb5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55272
1. Set 1K as the default value of `start_powerSGD_iter` for practicability. The original default value 10 is usually too small for real use cases. The new default value 1K is also consistent with PyTorch Lightning.
2. Update the docstring of `start_powerSGD_iter` to remind the users to set a value no less than the warm-up steps if any.
3. Update some unit tests to start PowerSGD early.
ghstack-source-id: 125707662
Test Plan: waitforbuildbot
Reviewed By: shuyingsunshine21
Differential Revision: D27553388
fbshipit-source-id: 40076419bc85755c0c0b64b79ba914b241085fcc
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55253
Previously DDP communication hooks takes a tensor list as the input. Now only takes a single tensor, as the preparation of retiring SPMD and only providing a single model replica for DDP communication hooks.
The next step is limiting only 1 model replica in Reducer.
ghstack-source-id: 125677637
Test Plan: waitforbuildbot
Reviewed By: zhaojuanmao
Differential Revision: D27533898
fbshipit-source-id: 5db92549c440f33662cf4edf8e0a0fd024101eae
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55103
Previously compression rate is only reported in PowerSGD hook. Also report this metric for comprehensive experimentation.
It is very easy to compute the sizes before and after compression, because there is only one matrix factorization per bucket, and no accumulation within the bucket is needed.
1) The size before compression is the input tensor size.
2) The size after compression is the size of P + Q, where each has a size of `square_side_length * state.matrix_approximation_rank`.
ghstack-source-id: 125399028
Test Plan: Tested by running scripts/wayi/torch/power_sgd.py locally.
Reviewed By: deadlybulb
Differential Revision: D27474295
fbshipit-source-id: a2225e85be03ab20238f01014d5ec9ae1787c4fb
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54838
Realize that an explicit sync is somehow still needed for batched PowerSGD hook. I find that a job failure can be fixed by this change.
The sync was once removed by #54482.
Test Plan:
f260900882
f260899693
Reviewed By: rohan-varma
Differential Revision: D27384738
fbshipit-source-id: 3efd738b9fd375e2ceb36ed3a6bf99cd8ce8ff95
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54647
Regularly log stats showing effect of gradient compression when using the PowerSGD DDP communication hook.
Test Plan:
buck run mode/dev-nosan scripts/wayi/torch:power_sgd
Play with the layer sizes of the input model (you can just use linear layers for convenience), and check the log that shows compression stats. For convenience, you can change `logging.info` to `print` locally.
You can create some test diffs on top of this diff, to show that the compression stats are correct in different cases.
Run with power_sgd script:
{F537381542}
Diff with example using a simple linear model: D27299934
sample output:
{F538486535}
Reviewed By: SciPioneer
Differential Revision: D27240254
fbshipit-source-id: 9e142b2f7957cc874804f799b7bb3bffdf824858
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53780
Update the comment, because the input data type of `fp16_compress_hook` does not have to be FP32. For example, the input dtype can also be FP64, as long as it can be casted into FP16.
ghstack-source-id: 123680621
Test Plan: N/A
Reviewed By: iseessel
Differential Revision: D26967224
fbshipit-source-id: 26d79a3629a597e6335b6f59c97d25a764a8ed80
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52979
Compression rate = uncompressed size / compressed size, so the compression rate is usually greater than 1.
Previously the compression rate was perceived as compressed size / uncompressed size, which can be very confusing.
ghstack-source-id: 122996272
Test Plan: unit tests
Reviewed By: zhaojuanmao
Differential Revision: D26713349
fbshipit-source-id: 83b7f8908c101954cf01f56a22161047fbfeaa53
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53010
To determine the boundary between different iterations in a DDP communication hook, currently the user code needs `bucket.get_index() == 0`, which involves internal bucketization implementation details and undermines the usability of DDP communication hook.
Create an API to hide the details and improve the usability before publishing GradBucket APIs.
ghstack-source-id: 122723081
Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl
Reviewed By: rohan-varma
Differential Revision: D26720813
fbshipit-source-id: f4a3147382c1f970534d7f0dee0cd599156c8b8c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53009
It can be a common operation to apply layer-wise operations over per-parameter tensors in a DDP communication hook.
Create a util method in GradBucket class before publishing GradBucket APIs.
ghstack-source-id: 122833594
Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl
f254364097
Reviewed By: rohan-varma
Differential Revision: D26717893
fbshipit-source-id: 916db319de8b85dd22bc4e35db5671bf4e34740f
Summary:
Fixes #{52034}
- Add a minimum compression rate threshold to `PowerSGDState`
- Use the threshold to determine whether to compress high-rank tensors or not
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52541
Test Plan:
No performance regression using rank-8 compression:
baseline: f253000411
updated one: f253010955
Reviewed By: rohan-varma
Differential Revision: D26594862
Pulled By: SciPioneer
fbshipit-source-id: 2859a91b4ca6bd1862bf6cd6441dc2a89badb2d5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52593
This hook is not used at all, and it probably can only be used for demonstrating that allgather is slower than allreduce, so it should never be used in practice.
However, this hook and its helper function stay with the communication hook public APIs in the same file. It will be better to make the public API file as concise as possible.
Since I don't think we will use this hook in the future, prefer deleting it to moving it to a separate file.
ghstack-source-id: 122180969
Test Plan: waitforbuildbot
Reviewed By: rohan-varma
Differential Revision: D26575318
fbshipit-source-id: b258154a7c92e33236c34104bd79bc244ecdb158