Commit Graph

169 Commits

Author SHA1 Message Date
Rohan Varma
a0ac80ec76 [DDP] Don't find tensors if static graph (#58105)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58105

When find_unused_parameters=True but static_graph is also set, static graph handles unused parameter accounting, so this code path is not needed
ghstack-source-id: 128736289

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D28371954

fbshipit-source-id: 0b42a9c0fd2bba26a0de288436e9c7139e292578
2021-05-12 11:40:18 -07:00
Rohan Varma
c52700dbcd [wip] enhance DDPSink to work for general outputs (#57073)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57073

Enhances use of DDPSink to work for all output types DDP supports as per https://github.com/pytorch/pytorch/issues/55876.

TODO: Add additional testing for tuple, list, dict return types
ghstack-source-id: 128726768

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D27756985

fbshipit-source-id: 2e0408649fb2d6a46d6c33155a24c4c1723dd799
2021-05-12 09:45:10 -07:00
Kimish Patel
ad4cd6ef89 Revert D28338485: make ddp logging api to be private
Test Plan: revert-hammer

Differential Revision:
D28338485 (ac44569b0d)

Original commit changeset: bd2ae7c78904

fbshipit-source-id: d383f42a2051457147dec42ea273ed4fa82ffa1f
2021-05-11 12:12:51 -07:00
Yanli Zhao
ac44569b0d make ddp logging api to be private (#57999)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57999

make ddp logging api to be private
ghstack-source-id: 128607185

Test Plan: unit test

Reviewed By: rohan-varma

Differential Revision: D28338485

fbshipit-source-id: bd2ae7c78904e93eed88be91876f5a832b5b7886
2021-05-11 10:37:03 -07:00
Yanli Zhao
ea421fb249 enable static graph training in DDP (#55248)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55248

This PR provides enable static graph training when users call _set_static_graph(). This can help support more use cases in DDP without performance regression, also can potentially improve performance when there are unused parameters in the graph.
1. first iteration records graph states like how many times a grad is calculated, whether the grad is used or not. then first iteration queues a delay_all_reduce call back to all reduce grads.
2. Since autograd call back is associated with current target graph task, the delay_all_all call back should be associated with out-most backward graph task. A DDP sink layer is added in DDP forward loop so that we can queue the delay_all_reduce call back in the sink layer.
3. after first iterations, DDP will use the saved graph states to determine whether a grad is used or not. whether a grad is ready for communication.
4. rebuilt bucket is called in second iteration, after graph states are recorded in first iteration.
5. if the graph states change, DDP will throw errors
ghstack-source-id: 128599464

Test Plan: unit tests. adding more tests

Reviewed By: rohan-varma

Differential Revision: D27539964

fbshipit-source-id: 74de1ad2719465be67bab8688d6e293cd6e3a246
2021-05-11 10:23:25 -07:00
Rohan Varma
fe3c63d9d3 [DDP] fix param to name mapping (#57771)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57771

This mapping didn't work properly when certain parameters didn't
require grad. Fixed that and added a test.
ghstack-source-id: 128527537

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D28265636

fbshipit-source-id: 7b342ce012b2b7e33058b4c619ffb98992ed05b7
2021-05-10 11:47:46 -07:00
Rohan Varma
d115e81a32 Fix document around DDP uneven inputs (#57448)
Summary:
Typo fix and additional clarifications on the API.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57448

Reviewed By: SciPioneer

Differential Revision: D28153264

Pulled By: rohan-varma

fbshipit-source-id: 9bd35d918299ad7e080785d755f97b966f826615
2021-05-10 09:33:59 -07:00
Rohan Varma
57f72b8433 [DDP] Uneven inputs: option to throw early (#56755)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56755

Rehash of https://github.com/pytorch/pytorch/pull/47488

Adds a flag to ddp join() context manager that enables throwing a
StopIteration across all ranks when this flag is specified.

To do this, we implement the design in #47250. When running with this flag, we schedule an additional allreduce in the case that a joined rank needs to throw a StopIteration. In non-joined ranks forward pass, we match this allreduce and if at least one rank tells us to throw, we raise a StopIteration.

Tested by modifying existing tests, as well as adding additional tests validating that this works with SyncBatchNorm models and a model with custom collectives in the forward pass.

Currently running perf benchmarks, will post when those are available, but we expect a small (~2%) perf reduction when enabling this feature due to the blocking allreduce. Hence we will only recommend it for models with collective comm.
ghstack-source-id: 127883115

Test Plan: Ci

Reviewed By: SciPioneer

Differential Revision: D27958369

fbshipit-source-id: c26f7d315d95f17bbdc28b4a0561916fcbafb7ca
2021-05-02 15:41:50 -07:00
Yanli Zhao
3f81912885 static graph api skeleton (#54995)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54995

provide an DDP private API to explicitly set the training is static, also set this flag in logger
ghstack-source-id: 127755713

Test Plan: unit tests

Reviewed By: rohan-varma

Differential Revision: D27444965

fbshipit-source-id: 06ef1c372296815944b2adb33fbdf4e1217c1359
2021-04-30 11:07:26 -07:00
Yanli Zhao
2c8ea63cbb add a test for grad view with torch amp (#56730)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56730

add a test to verify DDP with torch map will result in the same results when using grad_as_bucket_view=true and false.

torch.amp scale factor does not have dependencies on old gradients, thus it is not affected by grad_as_bucket_view=true or false, see
how torch.amp is implemeted here https://github.com/pytorch/pytorch/pull/33366/files.

This diff verified ddp can work as expected with amp.GradScaler and amp.autocast when when using grad_as_bucket_view=true and false.
ghstack-source-id: 127526358

Test Plan: unit tests

Reviewed By: rohan-varma

Differential Revision: D27950132

fbshipit-source-id: 8ed26935fdcb4514fccf01bb510e31bf6aedac69
2021-04-29 10:06:07 -07:00
Yanli Zhao
1e77ba36db change ddpLoggingData struct to map or dict (#56641)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56641

currently ddpLoggingData is flat struct, which requires internal DDP developers and external users to know about the struct field names. This is not flexible to delete or add new fields in the future. also it is hard to access ddpLoggingData.

With maps/dict, developers and users can easily access the fields without knowing the field names, also easier to add/remove a new/old field.

Since C++ does not support map values to be different types, right now ddpLoggingData containes two types of maps.
ghstack-source-id: 127482694

Test Plan: unit tests

Reviewed By: SciPioneer

Differential Revision: D27923723

fbshipit-source-id: c90199c14925fc50ef219000e2f809dc7601cce1
2021-04-28 06:43:25 -07:00
Yi Wang
07653b7fe0 [SPMD] Remove ddp_gpu_size field from SyncBatchNorm (#55946)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55946

As `ddp_gpu_size` field of `SyncBatchNorm` will always be 1 for GPU modules, remove this field and the relevant code.
ghstack-source-id: 126883498

Test Plan: waitforbuildbot

Reviewed By: zhaojuanmao

Differential Revision: D27746021

fbshipit-source-id: b4518c07e6f0c6943fbd7a7548500a7d4337126c
2021-04-19 21:41:29 -07:00
Mike Guo
5b4c3a9da1 record Torch DP and DDP modules forward (#55578)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55578

Reviewed By: gdankel

Differential Revision: D27862392

Pulled By: ilia-cher

fbshipit-source-id: 18545d23e35a97c8f760707fecb696a24d47dc0a
2021-04-19 17:52:59 -07:00
Michael Carilli
a24b17248f Short circuits DistributedDataParallel._recursive_to's copy and stream syncs if input is already on the right device (#55624)
Summary:
^

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55624

Reviewed By: pbelevich, agolynski

Differential Revision: D27836170

Pulled By: rohan-varma

fbshipit-source-id: 954bf336d70f9e80c045a6a96c1d8843c7f1cf2c
2021-04-18 14:08:08 -07:00
Rohan Varma
51e7a371f5 [DDP] Param to name mapping in Reducer (#55075)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55075

Constructs and passes in a mapping with parameter names to Reducer to log information about unused parameters in error messages about unused parameters/not all parameters getting gradient.

Use case:
1) User runs DDP forward + bwd, and it has some unused parameters that will result in ddp error in next iteration
2) Next forward pass calls `Reducer::ensure_prior_reduction_finished()` where we check all params got gradient from the previous bwd pass. DDP would throw here in this case.
3) Reducer maintains mapping and tracks used parameters, and computes which parameters did not get gradient and logs this as part of the error.

Implementation details:
0) The following is only enabled for debug modes of INFO or DETAIL.
1) To save memory, we don't map param -> param name so that we don't have to copy the entire tensor, instead we map param_index -> param_name and use the existing concept of variable_index in Reducer to look up parameter names.
2) DDP constructs param index -> param name mapping. The name is the fully qualified name: f"{module_name}:{param_name}" and passes it into Reducer
3) Reducer maintains per-iteration std::set<int> of variable indices that have had `mark_variable_ready` called.
4) When some params go unused, we take a set difference to detect the unused params.
5) Unittests to test the logged unused params, as well as for nested modules, are added
ghstack-source-id: 126581051

Test Plan: CI, UT

Reviewed By: zhaojuanmao

Differential Revision: D27356394

fbshipit-source-id: 89f436af4e74145b0a8eda92b3c4e2af8e747332
2021-04-15 09:19:50 -07:00
Yi Wang
d398a705c6 Clang-format batchnorm.py and distributed.py (#55971)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55971

Per title
ghstack-source-id: 126454339

Test Plan: N/A

Reviewed By: zhaojuanmao

Differential Revision: D27752315

fbshipit-source-id: 64ca5dea7b2689037594a6bd9a75641a9bb817c1
2021-04-13 18:40:23 -07:00
Yi Wang
4b09756d26 [SPMD] Move a comment (#55877)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55877

Address a comment in: 10bc1dae40 (r610930244)
ghstack-source-id: 126369525

Test Plan: N/A

Reviewed By: rohan-varma

Differential Revision: D27729567

fbshipit-source-id: 5509ebfba2b741cd3532c69044227e5af0fb54fc
2021-04-13 05:53:31 -07:00
Yi Wang
3e9cbe5ef7 [SPMD] Remove the code branches only used in SPMD mode from distributed.py (#55353)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55353

Remove all the code branches that will only be executed when `device_ids > 1`.

Some helper functions are also removed:
1.  `_verify_replicas_within_process` and `verify_replicas_within_process`
2. `_replicate_modules_within_process`
3. `parallel_apply`

The next step is deprecating `_module_copies` field.
ghstack-source-id: 126201121

Test Plan: waitforbuildbot

Reviewed By: rohan-varma

Differential Revision: D27552201

fbshipit-source-id: 128d0216a202f5b1ba4279517d68c3badba92a6c
2021-04-09 17:27:56 -07:00
Yi Wang
b986a76d91 Clang-format distributed.py (#55254)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55254

ghstack-source-id: 125680320

Test Plan: N/A

Reviewed By: rohan-varma

Differential Revision: D27542846

fbshipit-source-id: 700c3e59a9df98233fdb27054b472f5cb33eb604
2021-04-05 16:48:22 -07:00
Yi Wang
e589247a19 [SPMD] Change assertions to raising value errors in distributed.py (#54825)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54825

These assertions are tested in test_c10d.py

Context: https://github.com/pytorch/pytorch/pull/54454#discussion_r602657818
ghstack-source-id: 125602462

Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_ddp_multi_device_module_config

Reviewed By: rohan-varma

Differential Revision: D27381649

fbshipit-source-id: 9b994e9c2acf796770c2f2af2cebdd5561834d14
2021-04-02 15:13:45 -07:00
Yi Wang
6a40339920 [SPMD] Error out SPMD mode (#54454)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54454

According to the pitch in https://github.com/pytorch/pytorch/issues/47012

1. Let DDP error out if `device_ids` contains multiple devices.
2. If device_ids is not specified, DDP will use the provided model (module argument in DDP constructor) as-is, regardless if the model is on one GPU or multiple GPUs or on CPU.
3. Remove the assertion that prevents SPMD in DDP `join()` method, because now SPMD is already forbidden by the constructor. Also remove the relevant unit test `test_ddp_uneven_inputs_replicated_error`.

#Closes: https://github.com/pytorch/pytorch/issues/47012

ghstack-source-id: 125644392

Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:distributed_gloo_spawn -- test_cuda
buck test mode/dev-nosan caffe2/test/distributed:distributed_gloo_spawn -- test_rnn

buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_nccl_backend_multi_device_ids_not_allowed
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_nccl_backend_single_device_module_device_ids_None
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_nccl_backend_multi_device_module_device_ids_None

buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_ddp_multi_device_module_config

waitforbuildbot

Reviewed By: pritamdamania87

Differential Revision: D27226092

fbshipit-source-id: 3ee1e4bc46e5e362fc82cf7a24b2fafb34fcf1b9
2021-04-02 15:11:59 -07:00
Rohan Varma
3575e71be8 [DDP Logging] Log use of uneven inputs API (#54919)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54919

Log the use of uneven inputs API for better tracking and use case
detection.
ghstack-source-id: 125446499

Test Plan: CI, added ut

Reviewed By: zhaojuanmao, SciPioneer

Differential Revision: D27410764

fbshipit-source-id: abc8055a2e15a3ee087d9959f8881b05a0ea933e
2021-04-01 16:22:32 -07:00
Rohan Varma
8c13dde458 [DDP] Remove redundant pass statement (#54219)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54219

There is no need for this ``pass``.
ghstack-source-id: 125124311

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D27105234

fbshipit-source-id: 95496fa785fdc66a6c3c8ceaa14af565588325df
2021-03-29 14:15:39 -07:00
Yi Wang
6e7a3c1fdd Clang-format distributed.py (#54402)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54402

ghstack-source-id: 124497872

Test Plan: N/A

Reviewed By: zhaojuanmao

Differential Revision: D27225942

fbshipit-source-id: 277f466554fbc034fb76de161bf4b3b7c431daf7
2021-03-22 11:39:58 -07:00
Shen Li
ef9ee46756 Avoid modifying rebuild buckets state in no_grad context (#54159)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54159

See https://github.com/pytorch/pytorch/issues/54059 for discussion.

In short, users might want to run evaluation on a single rank
in `torch.no_grad()` mode. When this happens, we need to make
sure that we skip all rebuild bucket logics, as the forward only
runs on one rank and not all peers can sure the bucket configuration
sync communication.

Test Plan: Imported from OSS

Reviewed By: zhaojuanmao

Differential Revision: D27119666

Pulled By: mrshenli

fbshipit-source-id: 4b2f8cce937cdd893e89d8d10c9267d255ba52ea
2021-03-17 19:50:29 -07:00
Rohan Varma
e09e97ebf9 [DDP] add _distributed_rank helper function (#53795)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53795

There are 4 calls in ddp implementation to dist.get_rank(), move these
to a helper property to ensure that users don't actually call `dist.get_rank()`
instead of `dist.get_rank(self.process_group)`.

Keeping API private for now because not sure if there is a user need to call `model.distributed_rank`, but can make it public if we think it's a useful api.
ghstack-source-id: 123640713

Test Plan: Ci

Reviewed By: mrshenli

Differential Revision: D26972368

fbshipit-source-id: a5f1cac243bca5c6f90a44f74d39cfffcc2b9a5a
2021-03-11 21:20:05 -08:00
Rohan Varma
0c2fe02ec1 [DDP] Fix wrong call to dist.get_rank() (#53793)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53793

This call should pass in the process group so it works appropriately
for subgroups instead of whole world being passed into DDP.

Aside: This wasn't caught by tests since we don't have good testing around
passing subgroups into DDP, I believe nearly all tests use the entire world.
Should we add better testing for subgroups which may potentially bring up more
subtle bugs?
ghstack-source-id: 123640712

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D26972367

fbshipit-source-id: 8330bd51e2ad66841e4c12e96b67d3e78581ec74
2021-03-11 21:18:31 -08:00
Yi Wang
d726ce6668 Support loading a non-DP/DDP model from a DP/DDP state_dict (#53224)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53224

Loading a DP/DDP dict just needs to strip the module prefix from all items in the state dict and the metadata.

One existing example is here: https://github.com/facebookresearch/fvcore/blob/master/fvcore/common/checkpoint.py#L239.

#Closes: https://github.com/pytorch/pytorch/issues/41048/
ghstack-source-id: 123722976

Test Plan:
buck test mode/dev-nosan caffe2/test:nn -- test_load_state_dict
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_save_load_checkpoint

Reviewed By: rohan-varma, mrshenli

Differential Revision: D26798495

fbshipit-source-id: 035c7d0907d7ae8f0d7ca21ec71f7f96ef8df6c8
2021-03-11 18:43:33 -08:00
Yanli Zhao
a08fc1a7fc allow users to set sample rate and add per iteration latency breakdowns (#53145)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53145

add a new API to allow users to set sample rate for runtime stats, also add per iteration latency breakdowns to DDPLoggingData struct. e.g.
if users set sample rate to be 1, they can analyze per iteration latency change over time (not avged)
ghstack-source-id: 123443369

Test Plan: unit test

Reviewed By: SciPioneer

Differential Revision: D26763957

fbshipit-source-id: baff6a09c2a590e6eb91362ca6f47ae8fa6ddb0e
2021-03-10 11:35:18 -08:00
Michael Carilli
e787872a47 [RELAND] Deduplicate shared params before constructing Reducer in DDP (#53279)
Summary:
Original PR https://github.com/pytorch/pytorch/pull/51929 seemed to trigger failures in `pytorch_linux_xenial_py3_clang5_asan_test2`. Resubmitting to figure out why, and hopefully reland.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53279

Reviewed By: mrshenli

Differential Revision: D26916701

Pulled By: zhaojuanmao

fbshipit-source-id: 75c74c8ad8ad24154eb59eddb2b222da0a09897e
2021-03-10 07:56:20 -08:00
Rohan Varma
14fa47631b [DDP Logging] Log comm. hook in ddp logging (#52966)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52966

Logs registerd comm hook if there is one, else logs
"builtin_allreduce"
ghstack-source-id: 123174803

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D26709388

fbshipit-source-id: 484fdbbd6643ec261b3797bd8d9824b2b6a1a490
2021-03-05 11:23:26 -08:00
Rohan Varma
68134374cb Refactor/fix DDP model check during init (#52887)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52887

This diff changes the way to do model consistency check (i.e. `_verify_replicas_across_processes`) in DDP.

There were a few things that could be improved with the way we verify model across processes in DDP initialization:

1. We should do this check before syncing module states in DDP init, otherwise with Gloo backend this will throw but we would like to throw the error corresponding to different models on different ranks. To do this, we move the methods to be standalone C++ functions (not part of reducer) and move this check to before synchronizing parameters.
2. Refactor DDP init in the following ways:
- Run model consistency check before creating reducer, 2
- add helper functions to build params to pass into reducer
- add helper function to call `_verify_model_across_ranks`
- move `def parameters` to a helper function `_get_parameters` to be used more broadly within DDP

In follow up changes we will add the ability to detect which rank had inconsistent model (https://github.com/pytorch/pytorch/issues/52876 would be useful for this to determine which ranks(s) had errors).
ghstack-source-id: 123171877

Test Plan:
CI/unittest
buck test mode/dev-nosan //caffe2/test/distributed:c10d
BACKEND="nccl" WORLD_SIZE="2" ~/fbcode/buck-out/dev/gen/caffe2/test/distributed/distributed_nccl_fork#binary.par -r test_ddp_model_diff_across_ranks

Reviewed By: zhaojuanmao

Differential Revision: D26565290

fbshipit-source-id: f0e1709585b53730e86915e768448f5b8817a608
2021-03-05 11:21:45 -08:00
Mike Ruberry
30a8a13a7d Revert D26625807: [pytorch][PR] Deduplicate shared params before constructing Reducer in DDP
Test Plan: revert-hammer

Differential Revision:
D26625807 (5c15a5bb46)

Original commit changeset: f5f5959fef90

fbshipit-source-id: c875cc86b8fd21d9d64f934559f8e3126ed1d23d
2021-03-03 20:05:47 -08:00
Yi Wang
68b62493b8 [Gradient Compression] Make GradBucket class public (#53099)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53099

Publish GradBucket APIs for publishing DDP communication hooks.

s/_GradBucket/GradBucket
ghstack-source-id: 123030921

Test Plan: waitforbuildbot

Reviewed By: rohan-varma

Differential Revision: D26721121

fbshipit-source-id: ee5f68e33095b9965b51937b86cdeb331fd2419a
2021-03-03 19:22:15 -08:00
Michael Carilli
5c15a5bb46 Deduplicate shared params before constructing Reducer in DDP (#51929)
Summary:
Currently, `torch.nn.parallel.DistributedDataParallel(model...)` doesn't deduplicate params shared across `model`'s child Modules before calling Reducer with the param list. This can cause Reducer to register more than one hook on the shared param(s), at which point who knows what happens.

We ran into this in mlperf BERT, which has at least one param shared across submodules (an embedding weight iirc, not 100% sure). Running with `gradient_as_bucket_view = False` produced different numerics from running with `gradient_as_bucket_view = True` (which i guess is one potential consequence of multiple DDP hooks on a given param, not sure why, i'd have to dig further).

This PR changes DDP to deduplicate shared params (a small diff), and adds some tests (right now just `test_ddp_weight_sharing`, but I'll add more). `test_ddp_weight_sharing` fails with bad numerics on current master (proving the shared param issue is real) and passes with the deduplication diff.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51929

Reviewed By: zou3519

Differential Revision: D26625807

Pulled By: zhaojuanmao

fbshipit-source-id: f5f5959fef90dfe2c55812d79fa88b877f22ecc3
2021-03-03 10:13:24 -08:00
Shen Li
d697090260 Add a note in DDP doc to point to ZeroRedundancyOptimizer (#53113)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53113

Test Plan: Imported from OSS

Reviewed By: blefaudeux

Differential Revision: D26752339

Pulled By: mrshenli

fbshipit-source-id: 7a082f1007bc550eabb82b559d020bbe717fa497
2021-03-02 14:18:06 -08:00
Yanli Zhao
d0795ab358 log newly added construction and runtime stats at randomly selected iterations (#51394)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51394

log newly added construction and runtime stats at randomly selected iterations
ghstack-source-id: 121934040

Test Plan: unit tests

Reviewed By: SciPioneer

Differential Revision: D26161885

fbshipit-source-id: add6e02c1a03e6f74f08b9a9aecf90fa81631d60
2021-02-19 00:15:04 -08:00
Yanli Zhao
c75fa39b6c add stats that can only be collected at runtime (#51386)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51386

add stats such as rebuilt bucket stats, unused parameter stats and performance stats to ddp logging data

1. gpu time stats are not collected for single process multiple devices in this diff, as that requires events are created and recorded on multiple devices
2. use at::cuda::event API for safer calls
3. events may not be created in autograd hook if hook is not triggered in user's codes, e.g., users runs in non-sync mode in some iterations. So we checked events are created or not before synchronizing, also skipped invalid results.
4. users may not set device upfront, so explicitly set proper device before creating events in our prepare_forward() and prepare_backward() calls

ghstack-source-id: 121933566

Test Plan: unit tests

Reviewed By: SciPioneer

Differential Revision: D26158645

fbshipit-source-id: ce5f15187802eba76accb980449be68902c10178
2021-02-19 00:13:11 -08:00
Rohan Varma
6dabe0b291 [Dist Profiling] Enable dist profiling for DDP (gloo only) (#52031)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52031

Closes https://github.com/pytorch/pytorch/issues/52020
Ensures that we can profile collectives in DDP by propagating the profiler threadLocalState appropriately. As described in the above issue, before this wouldn't work as the profiler would only be enabled on the main thread.
ghstack-source-id: 121818080

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D26356192

fbshipit-source-id: 0158b5833a3f857a0b4b2943ae3037e9d998dfd1
2021-02-17 12:21:37 -08:00
Rohan Varma
a86027ded3 Use side-stream in CPU to GPU copies in DDP (#50180)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50180

Resolves the regression in
https://github.com/pytorch/pytorch/issues/49819 by adding copy over background
stream similar to scatter. For internal use cases, this is gated with an env var that maintains the previous behavior when it is off.

Test Plan: CI

Reviewed By: mrshenli, ngimel

Differential Revision: D25818170

fbshipit-source-id: e50c76c035504b2a44e2be084701cee45c90df75
2021-02-13 00:57:32 -08:00
Yanli Zhao
18e0a61388 add more logging fields that can be set in construction time (#51260)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51260

add more logging fields to DDPLoggingData, including param stats, bucket stats, environment variables, nccl version, data type
ghstack-source-id: 121260224

Test Plan: unit tests

Reviewed By: rohan-varma

Differential Revision: D26118245

fbshipit-source-id: ba48b7a11340bda1f5f3b24c8603545d346361e9
2021-02-09 21:58:58 -08:00
Yi Wang
4b3c99ce4a [Resubmission] Add a documentation page for DDP communication hooks (#51773)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51773

Resubmission of #51715.

Minor changes:
1) Removed "Note [Guidance to Tune ``matrix_approximation_rank`` And ``start_powerSGD_iter``]" in powerSGD_hook.py.

2) Removed the duplicate description of `torch.nn.parallel.DistributedDataParallel.register_comm_hook` in ddp_comm_hooks.rst, because it is already covered by distributed.rst.

Also updated the doc based on the comments from PowerSGD paper author Thijs Vogels .

It seems that `python_doc_test` was flaky. The previous error message was not informative:
https://app.circleci.com/pipelines/github/pytorch/pytorch/270682/workflows/8d186a3c-d682-46bf-b617-ad4eef5991e2/jobs/10739143, and all the warnings did also appear on the master branch.

Rebasing to a new master branch seems to get this fixed:
https://app.circleci.com/pipelines/github/pytorch/pytorch/270696/workflows/1a3adbea-6443-4876-b87b-e17d90d41428/jobs/10740021/steps

Screenshot:

{F369899792}
ghstack-source-id: 121199613

Test Plan: View locally

Reviewed By: mingzhe09088

Differential Revision: D26272687

fbshipit-source-id: 6677db496a68171798940a80343f4d9a508e15db
2021-02-06 21:22:04 -08:00
Natalia Gimelshein
d3023d86ba Revert D26249330: [Gradient Compression] Add a documentation page for DDP communication hooks
Test Plan: revert-hammer

Differential Revision:
D26249330 (e62aabac43)

Original commit changeset: ab973390ddb7

fbshipit-source-id: d508daed76219e7ca588cf7fb38aeaaffc61acfd
2021-02-04 22:38:06 -08:00
Yi Wang
e62aabac43 [Gradient Compression] Add a documentation page for DDP communication hooks (#51715)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51715

Add a documentation page for DDP communication hooks.

Screenshot:

{F369781049}

Test Plan: View locally

Reviewed By: pritamdamania87

Differential Revision: D26249330

fbshipit-source-id: ab973390ddb785c5191f587a1b2b6de7d229e50e
2021-02-04 18:53:53 -08:00
Yanli Zhao
250c71121b Create a DDPLoggingData and expose it to python interface (#50622)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50622

1. Define a DDPLoggingData struct that is the placeholder for all the ddp related logging fields
2. Put the DDPLoggingData struct in the C10 directory so that it can be easily imported by c10 and torch files
3. Expose get_ddp_logging_data() method in python so that users can get the logging data and dump in their applications
4. Unit test tested the logging data can be set and got as expected
5. Follow up will add more logging fields such as perf stats, internal states, env variables and etc
ghstack-source-id: 120275870

Test Plan: unit tests

Reviewed By: SciPioneer

Differential Revision: D25930527

fbshipit-source-id: 290c200161019c58e28eed9a5a2a7a8153113f99
2021-01-25 15:23:07 -08:00
Pritam Damania
f39f258dfd Ensure DDP + Pipe works with find_unused_parameters. (#49908)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49908

As described in https://github.com/pytorch/pytorch/issues/49891, DDP +
Pipe doesn't work with find_unused_parameters.

This PR adds a simple fix to enable this functionality. This only currently
works for Pipe within a single host and needs to be re-worked once we support
cross host Pipe.
ghstack-source-id: 119573413

Test Plan:
1) unit tests added.
2) waitforbuildbot

Reviewed By: rohan-varma

Differential Revision: D25719922

fbshipit-source-id: 948bcc758d96f6b3c591182f1ec631830db1b15c
2021-01-11 16:52:37 -08:00
Samuel Marks
e6779d4357 [*.py] Rename "Arguments:" to "Args:" (#49736)
Summary:
I've written custom parsers and emitters for everything from docstrings to classes and functions. However, I recently came across an issue when I was parsing/generating from the TensorFlow codebase: inconsistent use of `Args:` and `Arguments:` in its docstrings.

```sh
(pytorch#c348fae)$ for name in 'Args:' 'Arguments:'; do
    printf '%-10s %04d\n' "$name" "$(rg -IFtpy --count-matches "$name" | paste -s -d+ -- | bc)"; done
Args:      1095
Arguments: 0336
```

It is easy enough to extend my parsers to support both variants, however it looks like `Arguments:` is wrong anyway, as per:

  - https://google.github.io/styleguide/pyguide.html#doc-function-args @ [`ddccc0f`](https://github.com/google/styleguide/blob/ddccc0f/pyguide.md)

  - https://chromium.googlesource.com/chromiumos/docs/+/master/styleguide/python.md#describing-arguments-in-docstrings @ [`9fc0fc0`](https://chromium.googlesource.com/chromiumos/docs/+/9fc0fc0/styleguide/python.md)

  - https://sphinxcontrib-napoleon.readthedocs.io/en/latest/example_google.html @ [`c0ae8e3`](https://github.com/sphinx-contrib/napoleon/blob/c0ae8e3/docs/source/example_google.rst)

Therefore, only `Args:` is valid. This PR replaces them throughout the codebase.

PS: For related PRs, see tensorflow/tensorflow/pull/45420

PPS: The trackbacks automatically appearing below are sending the same changes to other repositories in the [PyTorch](https://github.com/pytorch) organisation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49736

Reviewed By: albanD

Differential Revision: D25710534

Pulled By: soumith

fbshipit-source-id: 61e8ff01abb433e9f78185c2d1d0cbd7c22c1619
2020-12-28 09:34:47 -08:00
Rohan Varma
c9f6e70c09 Refactor DDP uneven inputs control flags (#47394)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47394

This is a preliminary refactor for the next diff that will add an
additional flag to control whether we throw a StopIteration or not. We
basically move the flags for ddp uneven inputs to a simple class.
ghstack-source-id: 116428177

Test Plan: CI

Reviewed By: pritamdamania87

Differential Revision: D24739509

fbshipit-source-id: 96bf41bd1c02dd27e68f6f37d08e22f33129b319
2020-11-11 16:51:56 -08:00
Zhicheng Chen
3dd266304c Fix inaccurate note in DistributedDataParallel (#47156)
Summary:
Sorry for my previous inaccurate [PR](https://github.com/pytorch/pytorch/pull/42471#issue-462329192 ).

Here are some toy code to illustrate my point:

* non-DistributedDataParallel version

```python
import torch

if __name__ == "__main__":
    torch.manual_seed(0)
    inp = torch.randn(1,16)
    inp = torch.cat([inp, inp], dim=0)
    model = torch.nn.Linear(16, 2)
    loss_func = torch.nn.CrossEntropyLoss()
    opti = torch.optim.SGD(model.parameters(), lr=0.001)
    opti.zero_grad()
    loss = loss_func(model(inp), torch.tensor([0, 0]))
    loss.backward()
    opti.step()

    print("grad:", model.weight.grad)
    print("updated weight:\n", model.weight)
```

* DistributedDataParallel version

```python
import os
import torch
import torch.nn as nn
import torch.distributed as dist
from torch.multiprocessing import Process

def run(rank, size):
    torch.manual_seed(0)
    x = torch.randn(1,16)

    model = torch.nn.Linear(16, 2)
    model = torch.nn.parallel.DistributedDataParallel(model)
    loss_func = torch.nn.CrossEntropyLoss()
    opti = torch.optim.SGD(model.parameters(), lr=0.001)
    opti.zero_grad()

    y = model(x)

    label = torch.tensor([0])
    loss = loss_func(y, label)

    loss.backward()
    opti.step()

    if rank == 0:
        print("grad:", model.module.weight.grad)
        print("updated weight:\n", model.module.weight)

def init_process(rank, size, fn, backend="gloo"):
    os.environ['MASTER_ADDR'] = '127.0.0.1'
    os.environ['MASTER_PORT'] = '29500'
    dist.init_process_group(backend, rank=rank, world_size=size)
    fn(rank, size)

if __name__ == "__main__":
    size = 2
    process = []
    for rank in range(size):
        p = Process(target=init_process, args=(rank, size, run))
        p.start()
        process.append(p)

    for p in process:
        p.join()
```

Both of these two pieces of code have the same output.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47156

Reviewed By: mruberry

Differential Revision: D24675199

Pulled By: mrshenli

fbshipit-source-id: 1238a63350a32a824b4b8c0018dc80454ea502bb
2020-11-09 17:42:57 -08:00
Yi Wang
fccfe7bd1a [Gradient Compression] Add unit tests that test default Python comm hook implementations (#47158)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47158

1. Test the default Python comm hook implementations ALLREDUCE and FP16_COMPRESS, besides an ad-hoc all-reduce implementation.
2. Typo fix.
3. Reformat default_hooks.py.
4. Publish register_comm_hook API for DDP module (This should be done in a separate diff, but got merged unintentionally.)

The new style can be used for testing any new comm hook like PowerSGD easily.
Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202

ghstack-source-id: 116012600

Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_default_ddp_comm_hooks_nccl

Reviewed By: rohan-varma

Differential Revision: D24669639

fbshipit-source-id: 048c87084234edc2398f0ea6f01f2f083a707939
2020-11-06 00:28:09 -08:00