Commit Graph

6 Commits

Author SHA1 Message Date
Yi Wang
f91fcefc81 [Gradient Compression] Surface C++ comm hooks to Python API as built-in comm hooks (#47270)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47270

This is almost same as #46959, except that in caffe2/torch/nn/parallel/distributed.py, BuiltinCommHookType should be imported conditionally, only when dist.is_available(). Otherwise, this Python enum type defined in caffe2/torch/scrc/distributed/c10d/init.cpp cannot be imported. See https://github.com/pytorch/pytorch/issues/47153

I tried to follow another enum type enum type ReduceOp defined in the same file, but did not work, because the C++ enum class is defined torch/lib/c10d library, but BuiltinCommHookType is defined in torch/csrc/distributed library. These two libraries are compiled in two different ways.

To avoid adding typing to distributed package, which can be a new project, I simply removed the arg type of BuiltinCommHookType in this file.

To review the diff on top of #46959, compare V1 vs Latest:
https://www.internalfb.com/diff/D24700959?src_version_fbid=270445741055617

Main Changes in V1 (#46959):
1. Implemented the Pybind part.
2. In the reducer, once the builtin_comm_hook_type is set,  a c++ comm hook instance will be created in Reducer::autograd_hook.
3. Added unit tests for the builit-in comm hooks.

Original PR issue: C++ DDP Communication Hook https://github.com/pytorch/pytorch/issues/46348
ghstack-source-id: 115783237

Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_builtin_ddp_comm_hooks_nccl

//arvr/projects/eye_tracking/Masquerade:python_test

USE_DISTRIBUTED=0 USE_GLOO=0 BUILD_TEST=0 USE_CUDA=1 USE_MKLDNN=0 DEBUG=0 python setup.py install

Reviewed By: mrshenli

Differential Revision: D24700959

fbshipit-source-id: 69f303a48ae275aa856e6e9b50e12ad8602e1c7a
2020-11-03 18:33:50 -08:00
Yi Wang
b1b77148ac Back out "[Gradient Compression] Surface C++ comm hooks to Python API as built-in comm hooks" (#47234)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47234

Revert the diff because of https://github.com/pytorch/pytorch/issues/47153

Original PR issue: C++ DDP Communication Hook https://github.com/pytorch/pytorch/issues/46348
ghstack-source-id: 115720415

Test Plan: waitforbuildbot

Reviewed By: mrshenli

Differential Revision: D24691866

fbshipit-source-id: 58fe0c45943a2ae2a09fe5d5eac4a4d947586539
2020-11-02 20:51:18 -08:00
Yi Wang
ee0033af9b [Gradient Compression] Surface C++ comm hooks to Python API as built-in comm hooks (#46959)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46959

1. Implemented the Pybind part.
2. In the reducer, once the builtin_comm_hook_type is set,  a c++ comm hook instance will be created in Reducer::autograd_hook.
3. Added unit tests for the builit-in comm hooks.

Original PR issue: C++ DDP Communication Hook https://github.com/pytorch/pytorch/issues/46348
ghstack-source-id: 115629230

Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_builtin_ddp_comm_hooks_nccl

Reviewed By: pritamdamania87

Differential Revision: D24471910

fbshipit-source-id: f96b752298549ea2067e2568189f1b394abcd99a
2020-10-30 23:19:42 -07:00
Yi Wang
ee3d3e6dba [pytorch][PR][Gradient Compression] Reduce the peak memory of fp16 compression provided by ddp comm hook (#46078)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46078

The peak memory usage of ddp comm hook has increased due to an extra copy of gradient tensors. To reduce the memory usage, decompress the fp16 tensor in place of the tensor stored in the the gradient bucket.

#Closes: https://github.com/pytorch/pytorch/issues/45968
ghstack-source-id: 113996453

Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d  -- test_accumulate_gradients_no_sync_allreduce_hook

Also verified the decrease in memory consumption with some toy modeling exmaples.

Reviewed By: pritamdamania87

Differential Revision: D24178118

fbshipit-source-id: 453d0b52930809bd836172936b77abd69610237a
2020-10-12 16:15:38 -07:00
Yi Wang
022ba5a78b Make ddp_comm_hook_wrapper a private method. (#44643)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44643

This method is not used anywhere else.

Also formatted the file.

Test Plan: buck test caffe2/test/distributed/algorithms/ddp_comm_hooks:test_ddp_hooks

Reviewed By: pritamdamania87

Differential Revision: D23675945

fbshipit-source-id: 2d04f94589a20913e46b8d71e6a39b70940c1461
2020-09-24 13:29:48 -07:00
Sinan Nasir
1a79d7bb28 DDP communication hook examples (#43310)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43310

In this diff, we prepared some example DDP communication hooks [#40848](https://github.com/pytorch/pytorch/pull/40848):

1\. `allreduce_hook`: This DDP communication hook just calls ``allreduce`` using ``GradBucket`` tensors. Once gradient tensors are aggregated across all workers, its ``then`` callback takes the mean and returns the result. If user registers this hook DDP results is expected to be same as the case where no hook was registered. Hence, this won't change behavior of DDP and user can use this as a reference or modify this hook to log useful information or any other purposes while unaffecting DDP behavior.

2\. `allgather_then_aggregate_hook` Similar to ``allreduce_hook``, this hook first gathers ``GradBucket`` tensors and its ``then`` callback aggregates the gathered gradient tensors and takes mean. Instead of ``allreduce`` this hook uses ``allgather``. Note that with W workers, both the computation and communication time scale as O(W) for allgather compared to O(logW) for allreduce. Therefore, this hook is expected to be much slower than ``allreduce_hook`` although both essentially do the same thing with the gradients.

3\. `fp16_compress_hook` This DDP communication hook implements a simple gradient compression approach that converts ``GradBucket`` tensors whose type is assumed to be ``torch.float32`` to half-precision floating point format (``torch.float16``). It allreduces those ``float16`` gradient tensors. Once compressed gradient tensors are allreduced, its then callback called ``decompress`` converts the aggregated result back to ``float32`` and takes the mean.

4\. `quantization_pertensor_hook` does quantization per tensor and uses the idea in https://pytorch.org/docs/master/generated/torch.quantize_per_tensor.html.  Note that we separately send scale and zero_point (two floats per rank) before quantized tensors.

5\. `quantization_perchannel_hook` does quantization per channel similar to https://pytorch.org/docs/master/generated/torch.quantize_per_channel.html. The main motivation is that after the initial QSGD study diff, we realized that for considerably large gradient tensors such as a tensor that contains 6 million floats quantizing dividing it into smaller channels (512 float chunks) and quantizing independently may significantly increase the resolution and result with lower error.
ghstack-source-id: 110923269

Test Plan:
python torch/distributed/algorithms/ddp_comm_hooks/test_ddp_hooks.py
Couldn't download test skip set, leaving all tests enabled...
.....
----------------------------------------------------------------------
Ran 4 tests in 26.724s

OK

Internal testing:
```
buck run mode/dev-nosan //caffe2/test/distributed/algorithms/ddp_comm_hooks:test_ddp_hooks
```

Reviewed By: malfet

Differential Revision: D22937999

fbshipit-source-id: 274452e7932414570999cb978ae77a97eb3fb0ec
2020-08-28 18:59:14 -07:00