pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Rohan Varma	c9f6e70c09	Refactor DDP uneven inputs control flags (#47394 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47394 This is a preliminary refactor for the next diff that will add an additional flag to control whether we throw a StopIteration or not. We basically move the flags for ddp uneven inputs to a simple class. ghstack-source-id: 116428177 Test Plan: CI Reviewed By: pritamdamania87 Differential Revision: D24739509 fbshipit-source-id: 96bf41bd1c02dd27e68f6f37d08e22f33129b319	2020-11-11 16:51:56 -08:00
Zhicheng Chen	3dd266304c	Fix inaccurate note in DistributedDataParallel (#47156 ) Summary: Sorry for my previous inaccurate [PR](https://github.com/pytorch/pytorch/pull/42471#issue-462329192 ). Here are some toy code to illustrate my point: * non-DistributedDataParallel version ```python import torch if __name__ == "__main__": torch.manual_seed(0) inp = torch.randn(1,16) inp = torch.cat([inp, inp], dim=0) model = torch.nn.Linear(16, 2) loss_func = torch.nn.CrossEntropyLoss() opti = torch.optim.SGD(model.parameters(), lr=0.001) opti.zero_grad() loss = loss_func(model(inp), torch.tensor([0, 0])) loss.backward() opti.step() print("grad:", model.weight.grad) print("updated weight:\n", model.weight) ``` * DistributedDataParallel version ```python import os import torch import torch.nn as nn import torch.distributed as dist from torch.multiprocessing import Process def run(rank, size): torch.manual_seed(0) x = torch.randn(1,16) model = torch.nn.Linear(16, 2) model = torch.nn.parallel.DistributedDataParallel(model) loss_func = torch.nn.CrossEntropyLoss() opti = torch.optim.SGD(model.parameters(), lr=0.001) opti.zero_grad() y = model(x) label = torch.tensor([0]) loss = loss_func(y, label) loss.backward() opti.step() if rank == 0: print("grad:", model.module.weight.grad) print("updated weight:\n", model.module.weight) def init_process(rank, size, fn, backend="gloo"): os.environ['MASTER_ADDR'] = '127.0.0.1' os.environ['MASTER_PORT'] = '29500' dist.init_process_group(backend, rank=rank, world_size=size) fn(rank, size) if __name__ == "__main__": size = 2 process = [] for rank in range(size): p = Process(target=init_process, args=(rank, size, run)) p.start() process.append(p) for p in process: p.join() ``` Both of these two pieces of code have the same output. Pull Request resolved: https://github.com/pytorch/pytorch/pull/47156 Reviewed By: mruberry Differential Revision: D24675199 Pulled By: mrshenli fbshipit-source-id: 1238a63350a32a824b4b8c0018dc80454ea502bb	2020-11-09 17:42:57 -08:00
Yi Wang	fccfe7bd1a	[Gradient Compression] Add unit tests that test default Python comm hook implementations (#47158 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47158 1. Test the default Python comm hook implementations ALLREDUCE and FP16_COMPRESS, besides an ad-hoc all-reduce implementation. 2. Typo fix. 3. Reformat default_hooks.py. 4. Publish register_comm_hook API for DDP module (This should be done in a separate diff, but got merged unintentionally.) The new style can be used for testing any new comm hook like PowerSGD easily. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 ghstack-source-id: 116012600 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_default_ddp_comm_hooks_nccl Reviewed By: rohan-varma Differential Revision: D24669639 fbshipit-source-id: 048c87084234edc2398f0ea6f01f2f083a707939	2020-11-06 00:28:09 -08:00
Yi Wang	f91fcefc81	[Gradient Compression] Surface C++ comm hooks to Python API as built-in comm hooks (#47270 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47270 This is almost same as #46959, except that in caffe2/torch/nn/parallel/distributed.py, BuiltinCommHookType should be imported conditionally, only when dist.is_available(). Otherwise, this Python enum type defined in caffe2/torch/scrc/distributed/c10d/init.cpp cannot be imported. See https://github.com/pytorch/pytorch/issues/47153 I tried to follow another enum type enum type ReduceOp defined in the same file, but did not work, because the C++ enum class is defined torch/lib/c10d library, but BuiltinCommHookType is defined in torch/csrc/distributed library. These two libraries are compiled in two different ways. To avoid adding typing to distributed package, which can be a new project, I simply removed the arg type of BuiltinCommHookType in this file. To review the diff on top of #46959, compare V1 vs Latest: https://www.internalfb.com/diff/D24700959?src_version_fbid=270445741055617 Main Changes in V1 (#46959): 1. Implemented the Pybind part. 2. In the reducer, once the builtin_comm_hook_type is set, a c++ comm hook instance will be created in Reducer::autograd_hook. 3. Added unit tests for the builit-in comm hooks. Original PR issue: C++ DDP Communication Hook https://github.com/pytorch/pytorch/issues/46348 ghstack-source-id: 115783237 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_builtin_ddp_comm_hooks_nccl //arvr/projects/eye_tracking/Masquerade:python_test USE_DISTRIBUTED=0 USE_GLOO=0 BUILD_TEST=0 USE_CUDA=1 USE_MKLDNN=0 DEBUG=0 python setup.py install Reviewed By: mrshenli Differential Revision: D24700959 fbshipit-source-id: 69f303a48ae275aa856e6e9b50e12ad8602e1c7a	2020-11-03 18:33:50 -08:00
Yi Wang	b1b77148ac	Back out "[Gradient Compression] Surface C++ comm hooks to Python API as built-in comm hooks" (#47234 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47234 Revert the diff because of https://github.com/pytorch/pytorch/issues/47153 Original PR issue: C++ DDP Communication Hook https://github.com/pytorch/pytorch/issues/46348 ghstack-source-id: 115720415 Test Plan: waitforbuildbot Reviewed By: mrshenli Differential Revision: D24691866 fbshipit-source-id: 58fe0c45943a2ae2a09fe5d5eac4a4d947586539	2020-11-02 20:51:18 -08:00
Yi Wang	ee0033af9b	[Gradient Compression] Surface C++ comm hooks to Python API as built-in comm hooks (#46959 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46959 1. Implemented the Pybind part. 2. In the reducer, once the builtin_comm_hook_type is set, a c++ comm hook instance will be created in Reducer::autograd_hook. 3. Added unit tests for the builit-in comm hooks. Original PR issue: C++ DDP Communication Hook https://github.com/pytorch/pytorch/issues/46348 ghstack-source-id: 115629230 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_builtin_ddp_comm_hooks_nccl Reviewed By: pritamdamania87 Differential Revision: D24471910 fbshipit-source-id: f96b752298549ea2067e2568189f1b394abcd99a	2020-10-30 23:19:42 -07:00
Rohan Varma	ecdbea77bc	Fix DDP documentation (#46861 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46861 Noticed that in the DDP documentation: https://pytorch.org/docs/master/generated/torch.nn.parallel.DistributedDataParallel.html?highlight=distributeddataparallel there were some examples with `torch.nn.DistributedDataParallel`, fix this to read `torch.nn.parallel.DistributedDataParallel`. ghstack-source-id: 115453703 Test Plan: ci Reviewed By: pritamdamania87, SciPioneer Differential Revision: D24534486 fbshipit-source-id: 64b92dc8a55136c23313f7926251fe825a2cb7d5	2020-10-29 09:13:47 -07:00
Rohan Varma	7245d2c939	Avoid scatter for single-device case in DDP (#46304 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46304 In the case that a single process operates only on one GPU, we can avoid this scatter and instead replace it with a recursive version of `to` which transfers the input tensors to the correct device. The implementation of `_recursive_to` is modeled after `scatter` in https://github.com/pytorch/pytorch/blob/master/torch/nn/parallel/scatter_gather.py, in order to keep parity with the previous conventions (i.e. custom types not having their tensors moved). ghstack-source-id: 114896677 Test Plan: Added unittest, and CI Reviewed By: pritamdamania87 Differential Revision: D24296377 fbshipit-source-id: 536242da05ecabfcd36dffe14168b1f2cf58ca1d	2020-10-22 08:29:37 -07:00
Alexander Grund	5b0f400488	Replace list(map(...)) constructs by list comprehensions (#46461 ) Summary: As discussed in https://github.com/pytorch/pytorch/issues/46392 this makes the code more readable and possibly more performant. It also fixes a bug detected by this where the argument order of `map` was confused: `030a24906e (diff-5bb26bd3a23ee3bb540aeadcc0385df2a4e48de39f87ed9ea76b21990738fe98L1537-R1537)` Fixes https://github.com/pytorch/pytorch/issues/46392 Pull Request resolved: https://github.com/pytorch/pytorch/pull/46461 Reviewed By: ailzhang Differential Revision: D24367015 Pulled By: ezyang fbshipit-source-id: d55a67933cc22346b00544c9671f09982ad920e7	2020-10-19 18:42:49 -07:00
Emilio Castillo	d38a71d579	`torch.nn.modules.LazyModuleMixin` and `torch.nn.LazyLinear` (Shape Inference II) (#44538 ) Summary: Retake on https://github.com/pytorch/pytorch/issues/40493 after all the feedback from albanD This PR implements the generic Lazy mechanism and a sample `LazyLinear` layer with the `UninitializedParameter`. The main differences with the previous PR are two; Now `torch.nn.Module` remains untouched. We don't require an explicit initialization or a dummy forward pass before starting the training or inference of the actual module. Making this much simpler to use from the user side. As we discussed offline, there was the suggestion of not using a mixin, but changing the `__class__` attribute of `LazyLinear` to become `Linear` once it's completely initialized. While this can be useful, by the time being we need `LazyLinear` to be a `torch.nn.Module` subclass since there are many checks that rely on the modules being instances of `torch.nn.Module`. This can cause problems when we create complex modules such as ``` class MyNetwork(torch.nn.Module): def __init__(self): super(MyNetwork, self).__init__() self.conv = torch.nn.Conv2d(20, 4, 2) self.linear = torch.nn.LazyLinear(10) def forward(self, x): y = self.conv(x).clamp(min=0) return self.linear(y) ``` Here, when the __setattr__ function is called at the time LazyLinear is registered, it won't be added to the child modules of `MyNetwork`, so we have to manually do it later, but currently there is no way to do such thing as we can't access the parent module from LazyLinear once it becomes the Linear module. (We can add a workaround to this if needed). TODO: Add convolutions once the design is OK Fix docstrings Pull Request resolved: https://github.com/pytorch/pytorch/pull/44538 Reviewed By: ngimel Differential Revision: D24162854 Pulled By: albanD fbshipit-source-id: 6d58dfe5d43bfb05b6ee506e266db3cf4b885f0c	2020-10-19 13:13:54 -07:00
Rohan Varma	181afd5220	Add an option to DDP to take a list of parameters to ignore upfront. (#44826 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44826 As described in https://github.com/pytorch/pytorch/issues/43690, there is a need for DDP to be able to ignore certain parameters in the module (not install allreduce hooks) for certain use cases. `find_unused_parameters` is sufficient from a correctness perspective, but we can get better performance with this upfront list if users know which params are unused, since we won't have to traverse the autograd graph every iteration. To enable this, we add a field `parameters_to_ignore` to DDP init and don't pass in that parameter to reducer if that parameter is in the given list. ghstack-source-id: 113210109 Test Plan: Added unittest Reviewed By: xw285cornell, mrshenli Differential Revision: D23740639 fbshipit-source-id: a0411712a8b0b809b9c9e6da04bef2b955ba5314	2020-09-30 11:52:50 -07:00
Shen Li	c5ade5f698	Fix no_sync docs (#45455 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45455 Test Plan: Imported from OSS Reviewed By: pritamdamania87 Differential Revision: D23973365 Pulled By: mrshenli fbshipit-source-id: 87c9878cdc7310754670b83efa65ae6f877f86fb	2020-09-28 20:48:09 -07:00
Shen Li	6967e6295e	Fix DDP docs (#45454 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45454 Test Plan: Imported from OSS Reviewed By: pritamdamania87 Differential Revision: D23973367 Pulled By: mrshenli fbshipit-source-id: 11f20d51d0d0f92f199e4023f02b86623867bae0	2020-09-28 20:43:22 -07:00
Yanli Zhao	c6500bcf14	[reland] Make grad point to bucket buffer in DDP to save memory usage (#44344 ) Summary: [test all] Pull Request resolved: https://github.com/pytorch/pytorch/pull/44344 reland #41954 Add one argument in DDP API to enable/disable letting grads pointing to views. When it is disabled, behavior is the same as DDP right now; when it is enabled, Make both variable.grad() and grad in distautograd context point to bucket buffer in DDP to save memory usage. In this case, grad will be view of bucket buffer tensors, in order to make it compatiable with optimizer.zero_grad(), we made changes in #41283. Also be noted that we can not make variable.grad() pointing to bucket buffer during construction time, because we want to keep grad undefined for unused parameters. ghstack-source-id: 112845787 Test Plan: 1. When grad_is_view=false: a. roberta_base, peak memory usage 8250MB, p50 per iteration latency 0.923second, https://www.internalfb.com/intern/fblearner/details/218029699/?notif_channel=cli b. resnet, peak memory usage 3089MB, p50 per iteration latency 0.120second, https://www.internalfb.com/intern/fblearner/details/218029035/?notif_channel=cli c. accuracy benchmark, distributed=false, .accuracy 40.914535522461, .loss: 1.6370717287064; distributed=true, .accuracy: 39.966053009033, .loss: 1.6849111318588 https://www.internalfb.com/intern/fblearner/details/218035688/?notif_channel=cli d. classy vision uru production flow, https://www.internalfb.com/intern/fblearner/details/219065811/?notif_channel=cli e. pytext flow, https://www.internalfb.com/intern/fblearner/details/219137458/?notif_channel=cli 2. When grad_is_view=true: a. roberta_base, peak memory usage 7183MB, p50 per iteration latency 0.908second, https://www.internalfb.com/intern/fblearner/details/217882539?tab=operator_details b. resnet, peak memory usage 2988 MB, p50 per iteration latency 0.119second, https://www.internalfb.com/intern/fblearner/details/218028479/?notif_channel=cli c. accuracy benchmark, distributed=false, .accuracy 41.713260650635, .loss: 1.69939661026; distributed=true, .accuracy: 39.966053009033, .loss: 1.6849111318588, https://www.internalfb.com/intern/fblearner/details/218037058/?notif_channel=cli d. classy vision uru production flow, expected, can not work well with apex.amp https://www.internalfb.com/intern/fblearner/details/219205218/?notif_channel=cli e. pytext flow, detach_() related error, expected, as pytext zero_grad depends on apex repo where detach_() is called. also seeing the warning in finalize_bucket_dense due to tied weights, which is expected. https://www.internalfb.com/intern/fblearner/details/219150229/?notif_channel=cli Reviewed By: mrshenli Differential Revision: D23588186 fbshipit-source-id: f724d325b954ef6f06ede31759bf01dd29a6f5e5	2020-09-24 20:54:51 -07:00
Rohan Varma	e57a08119b	Add a warning log when there is high skew of uneven inputs in DDP training (#45238 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45238 Adds a warning when there is much higher than expected amount of discrepancy of inputs across different processes when running with uneven inputs. This is because a skew in the thousands can reduce performance a nontrivial amount as shown in benchmarks, and it was proposed to add this warning as a result. Tested by running the tests so the threshold is hit and observing the output. ghstack-source-id: 112773552 Test Plan: CI Reviewed By: mrshenli Differential Revision: D23719270 fbshipit-source-id: 306264f62c1de65e733696a912bdb6e9376d5622	2020-09-24 09:50:44 -07:00
Bugra Akyildiz	1b059f2c6d	Directly use work.result() to retrieve tensor rather than passing as a separate argument (#44914 ) Summary: We currently are fetching an allreduced tensor from Python in C++ in, where we are storing the resulting tensor in a struct's parameter. This PR removes extra tensor paratemeter in the function parameter and fetch from a single place. Fixes https://github.com/pytorch/pytorch/issues/43960 Pull Request resolved: https://github.com/pytorch/pytorch/pull/44914 Reviewed By: rohan-varma Differential Revision: D23798888 Pulled By: bugra fbshipit-source-id: ad1b8c31c15e3758a57b17218bbb9dc1f61f1577	2020-09-22 06:28:47 -07:00
Yanli Zhao	e14b2080be	[reland] move rebuild buckets from end of first iteration to beginning of second iteration (#44798 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44798 [test all] Update for relanding: in ddp.join(), moved _rebuild_buckets from end of backward to beginning of forward as well. Part of relanding PR #41954, this refactoring is to move rebuild_buckets call from end of first iteration to beginning of second iteration ghstack-source-id: 112279261 ghstack-source-id: 112279261 Test Plan: unit tests Reviewed By: rohan-varma Differential Revision: D23735185 fbshipit-source-id: c26e0efeecb3511640120faa1122a2c856cd694e	2020-09-17 17:10:21 -07:00
Ailing Zhang	fb085d90e3	Revert D23583017: move rebuild buckets from end of first iteration to beginning of second iteration Test Plan: revert-hammer Differential Revision: D23583017 (`f5d231d593`) Original commit changeset: ef67f79437a8 fbshipit-source-id: fd914b7565aba6a5574a32b31403525abb80ff07	2020-09-15 15:10:52 -07:00
Yanli Zhao	f5d231d593	move rebuild buckets from end of first iteration to beginning of second iteration (#44326 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44326 Part of relanding PR #41954, this refactoring is to move rebuild_buckets call from end of first iteration to beginning of second iteration ghstack-source-id: 112011490 Test Plan: unit tests Reviewed By: mrshenli Differential Revision: D23583017 fbshipit-source-id: ef67f79437a820d9b5699b651803622418499a83	2020-09-15 09:51:33 -07:00
Yi Wang	ace81b6794	Remove an extra empty line in the warning comments. (#44622 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44622 Remove an extra empty line in the warning comments.Remove an extra empty line. Test Plan: N/A Reviewed By: rohan-varma Differential Revision: D23674070 fbshipit-source-id: 4ee570590c66a72fb808e9ee034fb773b833efcd	2020-09-14 11:15:35 -07:00
Rohan Varma	41f62b17e7	Fix DDP join() API in the case of model.no_sync() (#44427 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44427 Closes https://github.com/pytorch/pytorch/issues/44425 DDP join API currently does not work properly with `model.no_sync()`, see https://github.com/pytorch/pytorch/issues/44425 for details. This PR fixes the problem via the approach mentioned in the issue, namely scheduling an allreduce that tells joined ranks whether to sync in the backwards pass or not. Tests are added for skipping gradient synchronization for various `sync_interval`s. ghstack-source-id: 111786479 Reviewed By: pritamdamania87 Differential Revision: D23609070 fbshipit-source-id: e8716b7881f8eee95e3e3499283e716bd3d7fe76	2020-09-10 18:31:40 -07:00
Rohan Varma	3806c939bd	Polish DDP join API docstrings (#43973 ) Summary: Polishes DDP join api docstrings and makes a few minor cosmetic changes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/43973 Reviewed By: zou3519 Differential Revision: D23467238 Pulled By: rohan-varma fbshipit-source-id: faf0ee56585fca5cc16f6891ea88032336b3be56	2020-09-03 13:39:45 -07:00
Rohan Varma	4e4626a23d	Join-based API to support DDP uneven inputs (#42577 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42577 Closes https://github.com/pytorch/pytorch/issues/38174. Implements a join-based API to support training with the DDP module in the scenario where different processes have different no. of inputs. The implementation follows the description in https://github.com/pytorch/pytorch/issues/38174. Details are available in the RFC, but as a summary, we make the following changes: #### Approach 1) Add a context manager `torch.nn.parallel.distributed.join` 2) In the forward pass, we schedule a "present" allreduce where non-joined process contribute 1 and joined processes contribute 0. This lets us keep track of joined processes and know when all procs are joined. 3) When a process depletes its input and exits the context manager, it enters "joining" mode and attempts to "shadow" the collective comm. calls made in the model's forward and backward pass. For example we schedule the same allreduces in the same order as the backward pass, but with zeros 4) We adjust the allreduce division logic to divide by the effective world size (no. of non-joined procs) rather than the absolute world size to maintain correctness. 5) At the end of training, the last joined process is selected to be the "authoritative" model copy We also make some misc. changes such as adding a `rank` argument to `_distributed_broadcast_coalesced` and exposing some getters/setters on `Reducer` to support the above changes. #### How is it tested? We have tests covering the following models/scenarios: - [x] Simple linear model - [x] Large convolutional model - [x] Large model with module buffers that are broadcast in the forward pass (resnet). We verify this with a helper function `will_sync_module_buffers` and ensure this is true for ResNet (due to batchnorm) - [x] Scenario where a rank calls join() without iterating at all, so without rebuilding buckets (which requires collective comm) - [x] Model with unused params (with find unused parameters=True) - [x] Scenarios where different processes iterate for a varying number of different iterations. - [x] Test consistency in tie-breaking when multiple ranks are the last ones to join - [x] Test that we divide by the effective world_size (no. of unjoined processes) #### Performance implications ###### Trunk vs PR patched, 32 GPUs, batch size = 32 P50, forward + backward + optimizer batch latency & total QPS: 0.121 264/s vs 0.121 264/s P50 backwards only batch latency & total QPS: 0.087 369/s vs 0.087 368/s ###### join(enable=True) vs without join, 32 GPUs, batch size = 32, even inputs P50, forward + backward + optimizer batch latency & total QPS: 0.120 265/s vs 0.121 264/s P50 backwards only batch latency & total QPS: 0.088 364/s vs 0.087 368/s ###### join(enable=False) vs without join, 32 GPUs, batch size = 32, even inputs P50 forward + backward + optimizer batch latency & total QPS: 0.121 264/s vs 0.121 264/s P50 backwards only batch latency & total QPS: 0.087 368/s vs 0.087 368/s ###### join(enable=True) with uneven inputs (offset = 2000), 32 GPUs, batch size = 32 P50 forward + backward + optimizer batch latency & total QPS: 0.183 174/s vs 0.121 264/s P50 backwards only batch latency & total QPS: 0.150 213/s vs 0.087 368/s ###### join(enable=True) with uneven inputs ((offset = 2000)), 8 GPUs, batch size = 32 P50 forward + backward + optimizer batch latency & total QPS: 0.104 308/s vs 0.104 308/s P50 backwards only batch latency & total QPS: 0.070 454/s vs 0.070 459/s The 2 above uneven inputs benchmark was conducted 32 GPUs and 4 GPUs immediately depleting their inputs and entering "join" mode (i.e. not iterating at all), while the other 28 iterating as normal. It looks like there is a pretty significant perf hit for this case when there are uneven inputs and multi-node training. Strangely, when there is a single node (8 GPUs), this does not reproduce. #### Limitations 1) This is only implemented for MPSD, not SPMD. Per a discussion with mrshenli we want to encourage the use of MPSD over SPMD for DDP. 2) This does not currently work with SyncBN or custom collective calls made in the model's forward pass. This is because the `join` class only shadows the `broadcast` for buffers in the forward pass, the gradient allreduces in the bwd pass, unused parameters reduction, and (optionally) the rebuild buckets broadcasting in the backwards pass. Supporting this will require additional design thought. 3) Has not been tested with the [DDP comm. hook](https://github.com/pytorch/pytorch/issues/39272) as this feature is still being finalized/in progress. We will add support for this in follow up PRs. ghstack-source-id: 111033819 Reviewed By: mrshenli Differential Revision: D22893859 fbshipit-source-id: dd02a7aac6c6cd968db882c62892ee1c48817fbe	2020-08-31 13:29:03 -07:00
Haoran Li	f35e069622	Back out "Make grad point to bucket buffer in DDP to save memory usage" (#43557 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43557 backout the diff that caused some errors in pytext distributed training Test Plan: Tested by rayhou who verified reverting the diff works Differential Revision: D23320238 fbshipit-source-id: caa0fe74404059e336cd95fdb41373f58ecf486e	2020-08-25 18:04:39 -07:00
Yanli Zhao	97d594b9f7	Make grad point to bucket buffer in DDP to save memory usage (#41954 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41954 Make both variable.grad() and grad in distautograd context point to bucket buffer in DDP to save memory usage. In this case, grad will be view of bucket buffer tensors, in order to make it compatiable with optimizer.zero_grad(), we made changes in https://github.com/pytorch/pytorch/pull/41283. Also be noted that we can not make variable.grad() pointing to bucket buffer during construction time, because we want to keep grad undefined for unused parameters. ghstack-source-id: 110260297 Test Plan: unit tests, For roberta_base model with ~1GB parameters, peak memory dropped ~1GB (8250MB-7183MB). Per iteration latency (0.982s ->0.909s), 8% speed up https://www.internalfb.com/intern/fblearner/details/211713882?tab=operator_details https://www.internalfb.com/intern/fblearner/details/211772923?tab=operator_details For resnet model with ~97M parameters, peak memory dropped ~100MB (3089MB -> 2988MB). Per iteration latency has no change (0.122s -> 0.123s) https://www.internalfb.com/intern/fblearner/details/211713577?tab=operator_details https://www.internalfb.com/intern/fblearner/details/211712582?tab=operator_details accuracy benchmark is expected as well https://www.internalfb.com/intern/fblearner/details/213237067?tab=Outputs Reviewed By: mrshenli Differential Revision: D22707857 fbshipit-source-id: b5e767cfb34ccb3d067db2735482a86d59aea7a4	2020-08-20 15:33:44 -07:00
Sinan Nasir	6e1127ea3f	[NCCL] Changed FutureNCCL's then callback logic for better efficiency. (#42869 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42869 We realized that when we invoke a simple callback that divides the tensors by `world_size` after `allreduce`, the performance was almost 50% lower in terms of QPS compared to the case where a simple `allreduce` hook is used with no `then` callback. The main problem was as we call `work.wait()` before invoking `then` callback, we were synchronizing `work`'s stream with the default PyTorch stream inside [`runHook`](https://github.com/pytorch/pytorch/blob/master/torch/csrc/distributed/c10d/reducer.cpp#L609) and stalling the backward computation. In that PR, we ensure that FutureNCCL's `then` callback is not stalling the backward computation. Assuming single-process single-device, `FutureNCCL` gets a new stream from device's pool using `at::cuda::getStreamFromPool` to run `callback` and before invoking the `callback` inline it synchronizes `WorkNCCL`'s stream by callback's stream not the default stream. ghstack-source-id: 110208431 Test Plan: Run performance benchmark tests to validate performance issue is resolved. Also, `python test/distributed/test_c10d.py` to avoid any odd issues. Reviewed By: pritamdamania87 Differential Revision: D23055807 fbshipit-source-id: 60e50993f1ed97497514eac5cb1018579ed2a4c5	2020-08-19 19:42:22 -07:00
Sinan Nasir	752f433a24	DDP communication hook: skip dividing grads by world_size if hook registered. (#42400 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42400 mcarilli spotted that in the original DDP communication hook design described in [39272](https://github.com/pytorch/pytorch/issues/39272), the hooks receive grads that are already predivided by world size. It makes sense to skip the divide completely if hook registered. The hook is meant for the user to completely override DDP communication. For example, if the user would like to implement something like GossipGrad, always dividing by the world_size would not be a good idea. We also included a warning in the register_comm_hook API as: > GradBucket bucket's tensors will not be predivided by world_size. User is responsible to divide by the world_size in case of operations like allreduce. ghstack-source-id: 109548696 Update: We discovered and fixed a bug with the sparse tensors case. See new unit test called `test_ddp_comm_hook_sparse_gradients` and changes in `reducer.cpp`. Test Plan: python test/distributed/test_c10d.py and perf benchmark tests. Reviewed By: ezyang Differential Revision: D22883905 fbshipit-source-id: 3277323fe9bd7eb6e638b7ef0535cab1fc72f89e	2020-08-10 13:55:42 -07:00
Sinan Nasir	0a804be47d	[NCCL] DDP communication hook: getFuture() without cudaStreamAddCallback (#42335 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42335 Main goal: For DDP communication hook, provide an API called "get_future" to retrieve a future associated with the completion of c10d.ProcessGroupNCCL.work. Enable NCCL support for this API in this diff. We add an API `c10::intrusive_ptr<c10::ivalue::Future> getFuture()` to `c10d::ProcessGroup::Work`. This API will only be supported by NCCL in the first version, the default implementation will throw UnsupportedOperation. We no longer consider a design that involves cudaStreamAddCallback which potentially was causing performance regression in [#41596](https://github.com/pytorch/pytorch/pull/41596). ghstack-source-id: 109461507 Test Plan: ```(pytorch) [sinannasir@devgpu017.ash6 ~/local/pytorch] python test/distributed/test_c10d.py Couldn't download test skip set, leaving all tests enabled... ..............................s.....................................................s................................ ---------------------------------------------------------------------- Ran 117 tests in 298.042s OK (skipped=2) ``` ### Facebook Internal: 2\. HPC PT trainer run to validate no regression. Check the QPS number: Master: QPS after 1000 iters: around ~34100 ``` hpc_dist_trainer --fb-data=none --mtml-fusion-level=1 --target-model=ifr_video --max-ind-range=1000000 --embedding-partition=row-wise mast --domain $USER"testvideo_master" --trainers 16 --trainer-version 1c53912 ``` ``` [0] I0806 142048.682 metrics_publishers.py:50] Finished iter 999, Local window NE: [0.963963 0.950479 0.953704], lifetime NE: [0.963963 0.950479 0.953704], loss: [0.243456 0.235225 0.248375], QPS: 34199 ``` [detailed logs](https://www.internalfb.com/intern/tupperware/details/task/?handle=priv3_global%2Fmast_hpc%2Fhpc.sinannasirtestvideo_mastwarm.trainer.trainer%2F0&ta_tab=logs) getFuture/new design: QPS after 1000 iters: around ~34030 ``` hpc_dist_trainer --fb-data=none --mtml-fusion-level=1 --target-model=ifr_video --max-ind-range=1000000 --embedding-partition=row-wise mast --domain $USER"testvideo_getFutureCyclicFix" --trainers 16 --trainer-version 8553aee ``` ``` [0] I0806 160149.197 metrics_publishers.py:50] Finished iter 999, Local window NE: [0.963959 0.950477 0.953704], lifetime NE: [0.963959 0.950477 0.953704], loss: [0.243456 0.235225 0.248375], QPS: 34018 ``` [detailed logs](https://www.internalfb.com/intern/tupperware/details/task/?handle=priv3_global%2Fmast_hpc%2Fhpc.sinannasirtestvideo_getFutureCyclicFix.trainer.trainer%2F0&ta_tab=logs) getFuture/new design Run 2: QPS after 1000 iters: around ~34200 ``` hpc_dist_trainer --fb-data=none --mtml-fusion-level=1 --target-model=ifr_video --max-ind-range=1000000 --embedding-partition=row-wise mast --domain $USER"test2video_getFutureCyclicFix" --trainers 16 --trainer-version 8553aee ``` ``` [0] I0806 160444.650 metrics_publishers.py:50] Finished iter 999, Local window NE: [0.963963 0.950482 0.953706], lifetime NE: [0.963963 0.950482 0.953706], loss: [0.243456 0.235225 0.248375], QPS: 34201 ``` [detailed logs](https://www.internalfb.com/intern/tupperware/details/task/?handle=priv3_global%2Fmast_hpc%2Fhpc.sinannasirtest2video_getFutureCyclicFix.trainer.trainer%2F0&ta_tab=logs) getFuture/old design (Regression): QPS after 1000 iters: around ~31150 ``` hpc_dist_trainer --fb-data=none --mtml-fusion-level=1 --target-model=ifr_video --max-ind-range=1000000 --embedding-partition=row-wise mast --domain $USER”testvideo_OLDgetFutureD22583690 (`d904ea5972`)" --trainers 16 --trainer-version 1cb5cbb ``` ``` priv3_global/mast_hpc/hpc.sinannasirtestvideo_OLDgetFutureD22583690 (`d904ea5972`).trainer.trainer/0 [0] I0805 101320.407 metrics_publishers.py:50] Finished iter 999, Local window NE: [0.963964 0.950482 0.953703], lifetime NE: [0.963964 0.950482 0.953703], loss: [0.243456 0.235225 0.248375], QPS: 31159 ``` 3\. `flow-cli` tests; roberta_base; world_size=4: Master: f210039922 ``` total: 32 GPUs -- 32 GPUs: p25: 0.908 35/s p50: 1.002 31/s p75: 1.035 30/s p90: 1.051 30/s p95: 1.063 30/s forward: 32 GPUs -- 32 GPUs: p25: 0.071 452/s p50: 0.071 449/s p75: 0.072 446/s p90: 0.072 445/s p95: 0.072 444/s backward: 32 GPUs -- 32 GPUs: p25: 0.821 38/s p50: 0.915 34/s p75: 0.948 33/s p90: 0.964 33/s p95: 0.976 32/s optimizer: 32 GPUs -- 32 GPUs: p25: 0.016 2037/s p50: 0.016 2035/s p75: 0.016 2027/s p90: 0.016 2019/s p95: 0.016 2017/s ``` getFuture new design: f210285797 ``` total: 32 GPUs -- 32 GPUs: p25: 0.952 33/s p50: 1.031 31/s p75: 1.046 30/s p90: 1.055 30/s p95: 1.070 29/s forward: 32 GPUs -- 32 GPUs: p25: 0.071 449/s p50: 0.072 446/s p75: 0.072 445/s p90: 0.072 444/s p95: 0.072 443/s backward: 32 GPUs -- 32 GPUs: p25: 0.865 37/s p50: 0.943 33/s p75: 0.958 33/s p90: 0.968 33/s p95: 0.982 32/s optimizer: 32 GPUs -- 32 GPUs: p25: 0.016 2037/s p50: 0.016 2033/s p75: 0.016 2022/s p90: 0.016 2018/s p95: 0.016 2017/s ``` Reviewed By: ezyang Differential Revision: D22833298 fbshipit-source-id: 1bb268d3b00335b42ee235c112f93ebe2f25b208	2020-08-07 18:48:35 -07:00
Nikita Shulga	56fc7d0345	Fix doc build (#42559 ) Summary: Add space between double back quotes and left curly bracket Otherwise doc generation failed with `Inline literal start-string without end-string.` This regression was introduced by `b56db305cf` Pull Request resolved: https://github.com/pytorch/pytorch/pull/42559 Reviewed By: glaringlee Differential Revision: D22931527 Pulled By: malfet fbshipit-source-id: 11c04a92dbba48592505f704d77222cf92a81055	2020-08-04 15:15:15 -07:00
Zhicheng Chen	b56db305cf	Improve the documentation of DistributedDataParallel (#42471 ) Summary: Fixes #{issue number} It's not clear by illustrating 'gradients from each node are averaged' in the documentation of DistributedDataParallel. Many people, including me, have a totally wrong understanding on this part. I add a note into the documentation to make it more straight forward and more user friendly. Here is some toy code to illustrate my point: * non-DistributedDataParallel version ```python import torch import torch.nn as nn x = torch.tensor([-1, 2, -3, 4], dtype=torch.float).view(-1, 1) print("input:", x) model = nn.Linear(in_features=1, out_features=1, bias=False) model.weight.data.zero_() model.weight.data.add_(1.0) opti = torch.optim.SGD(model.parameters(), lr=0.001) opti.zero_grad() y = model(x) label = torch.zeros(4, 1, dtype=torch.float) loss = torch.sum((y - label)*2) loss.backward() opti.step() print("grad:", model.weight.grad) print("updated weight:\n", model.weight) # OUTPUT # $ python test.py # input: tensor([[-1.], # [ 2.], # [-3.], # [ 4.]]) # grad: tensor([[60.]]) # updated weight: # Parameter containing: # tensor([[0.9400]], requires_grad=True) ``` DistributedDataParallel version ```python import os import torch import torch.nn as nn import torch.distributed as dist from torch.multiprocessing import Process def run(rank, size): x = torch.tensor([-(1 + 2 * rank), 2 + 2 * rank], dtype=torch.float).view(-1, 1) print("input:", x) model = nn.Linear(in_features=1, out_features=1, bias=False) model.weight.data.zero_() model.weight.data.add_(1.0) model = torch.nn.parallel.DistributedDataParallel(model) opti = torch.optim.SGD(model.parameters(), lr=0.001) opti.zero_grad() y = model(x) label = torch.zeros(2, 1, dtype=torch.float) loss = torch.sum((y.view(-1, 1) - label)**2) loss.backward() opti.step() if rank == 0: print("grad:", model.module.weight.grad) print("updated weight:\n", model.module.weight) def init_process(rank, size, fn, backend="gloo"): os.environ['MASTER_ADDR'] = '127.0.0.1' os.environ['MASTER_PORT'] = '29500' dist.init_process_group(backend, rank=rank, world_size=size) fn(rank, size) if __name__ == "__main__": size = 2 process = [] for rank in range(size): p = Process(target=init_process, args=(rank, size, run)) p.start() process.append(p) for p in process: p.join() # OUTPUT # $ python test_d.py # input: tensor([[-3.], # [ 4.]])input: tensor([[-1.], # [ 2.]]) # grad: tensor([[30.]]) # updated weight: # Parameter containing: # tensor([[0.9700]], requires_grad=True) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/42471 Reviewed By: glaringlee Differential Revision: D22923340 Pulled By: mrshenli fbshipit-source-id: 40b8c8ba63a243f857cd5976badbf7377253ba82	2020-08-04 08:36:42 -07:00
Yanli Zhao	79cfd85987	grad detach_ only when it has grad_fn in zero_grad call (#41283 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41283 in optimizer.zero_grad(), detach_ is useful to avoid memory leak only when grad has grad_fn, so add check to call grad.detach_ only when the grad has grad_fn in zero_grad() function ghstack-source-id: 108702289 Test Plan: unit test Reviewed By: mrshenli Differential Revision: D22487315 fbshipit-source-id: 861909b15c8497f1da57f092d8963d4920c85e38	2020-07-29 11:40:13 -07:00
Jongsoo Park	73ff252913	Back out "[NCCL] DDP communication hook: getFuture()" (#42152 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42152 Original commit changeset: 8c059745261d Test Plan: . Reviewed By: ajtulloch, jianyuh Differential Revision: D22786183 fbshipit-source-id: 51155389d37dc82ccb4d2fa20d350f9d14abeaca	2020-07-28 10:05:35 -07:00
Shen Li	c76fada4a8	Let DDP.train() return self to stay consistent with nn.Module (#42131 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42131 Test Plan: Imported from OSS Reviewed By: pritamdamania87 Differential Revision: D22775311 Pulled By: mrshenli fbshipit-source-id: ac9e6cf8b2381036a2b6064bd029dca361a81777	2020-07-27 18:22:13 -07:00
Sinan Nasir	d904ea5972	[NCCL] DDP communication hook: getFuture() (#41596 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41596 We've modified the previous design of `convert_dist_work_to_future` API in the GH Issue [#39272](https://github.com/pytorch/pytorch/issues/39272). 1. Whenever we create a `WorkNCCL` object, create a `Future` associated with `WorkNCCL` and store it with the object. 2. Add an API `c10::intrusive_ptr<c10::ivalue::Future> getFuture()` to `c10d::ProcessGroup::Work`. 3. This API will only be supported by NCCL in the first version, the default implementation will throw UnsupportedOperation. 4. To mark the future associated with WorkNCCL completed, implement a `cudaStreamCallback` function. `cudaStreamAddCallback` is marked as deprecated. An alternative is `cudaLaunchHostFunc`, but it is supported for CUDA > 10 and may not be deprecated until there's a reasonable alternative available according to [this discussion](https://stackoverflow.com/questions/56448390/how-to-recover-from-cuda-errors-when-using-cudalaunchhostfunc-instead-of-cudastr). ghstack-source-id: 108409748 Test Plan: Run old python test/distributed/test_c10d.py. Some additional tests: `test_ddp_comm_hook_allreduce_hook_nccl`: This unit test verifies whether a DDP communication hook that just calls allreduce gives the same result result with the case of no hook registered. Without the then callback, the future_value in reducer is no longer a PyObject, and this unit test verifies future_value is properly checked. `test_ddp_comm_hook_allreduce_then_mult_ten_hook_nccl`: This unit test verifies whether a DDP communication hook that calls allreduce and then multiplies the result by ten gives the expected result. As of v10: ``` ........................s.....s.....................................................s............................... ---------------------------------------------------------------------- Ran 116 tests OK (skipped=3) ``` `flow-cli` performance validation using a stacked diff where `bucket.work` is completely replaced with `bucket.future_work` in `reducer`. See PR [#41840](https://github.com/pytorch/pytorch/pull/41840) [D22660198](https://www.internalfb.com/intern/diff/D22660198/). Reviewed By: izdeby Differential Revision: D22583690 fbshipit-source-id: 8c059745261d68d543eaf21a5700e64826e8d94a	2020-07-24 11:22:44 -07:00
Sinan Nasir	d5ae4a07ef	DDP Communication Hook Main Structure (#40848 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40848 Sub-tasks 1 and 2 of [39272](https://github.com/pytorch/pytorch/issues/39272) ghstack-source-id: 107787878 Test Plan: 1\. Perf tests to to validate new code (if conditions before `allreduce`) doesn't slow down today's DDP. Execute the following command with diff patched/unpatched (with V25): * Unpatched Runs: ``` hg checkout D22514243 flow-cli canary pytorch.benchmark.main.workflow --parameters-json '{"model_arch": "resnet50", "batch_size": 32, "world_size": 1, "use_fp16": false, "print_percentile": true, "backend": "gloo"}' --entitlement pytorch_ftw_gpu --name test_torchelastic_gloo_masterD22514243 --run-as-secure-group pytorch_distributed ``` * Run 1 (unpatched): `elastic_gang:benchmark_single.elastic_operator` Ran for 2 mins 59 s f204539235 ``` sum: 8 GPUs: p25: 0.156 205/s p50: 0.160 200/s p75: 0.164 194/s p90: 0.169 189/s p95: 0.173 185/s fwds: 8 GPUs: p25: 0.032 1011/s p50: 0.032 1006/s p75: 0.032 1000/s p90: 0.032 992/s p95: 0.033 984/s bwds: 8 GPUs: p25: 0.121 265/s p50: 0.125 256/s p75: 0.129 248/s p90: 0.134 239/s p95: 0.137 232/s opts: 8 GPUs: p25: 0.003 11840/s p50: 0.003 11550/s p75: 0.004 8037/s p90: 0.006 5633/s p95: 0.007 4631/s ``` * Run 2 (unpatched): `elastic_gang:benchmark_single.elastic_operator` Ran for 3 mins 1 s f204683840 ``` sum: 8 GPUs: p25: 0.145 220/s p50: 0.147 217/s p75: 0.150 213/s p90: 0.154 207/s p95: 0.157 204/s fwds: 8 GPUs: p25: 0.032 1015/s p50: 0.032 1009/s p75: 0.032 1002/s p90: 0.032 994/s p95: 0.032 990/s bwds: 8 GPUs: p25: 0.107 297/s p50: 0.111 288/s p75: 0.115 278/s p90: 0.119 268/s p95: 0.122 262/s opts: 8 GPUs: p25: 0.003 11719/s p50: 0.004 9026/s p75: 0.006 5160/s p90: 0.009 3700/s p95: 0.010 3184/s ``` * Patched Runs: ``` hg checkout D22328310 flow-cli canary pytorch.benchmark.main.workflow --parameters-json '{"model_arch": "resnet50", "batch_size": 32, "world_size": 1, "use_fp16": false, "print_percentile": true, "backend": "gloo"}' --entitlement pytorch_ftw_gpu --name test_torchelastic_gloo_localD22328310 --run-as-secure-group pytorch_distributed ``` * Run 1 (patched): `elastic_gang:benchmark_single.elastic_operator` Ran for 3 mins 30 s f204544541 ``` sum: 8 GPUs: p25: 0.148 216/s p50: 0.152 210/s p75: 0.156 205/s p90: 0.160 200/s p95: 0.163 196/s fwds: 8 GPUs: p25: 0.032 1011/s p50: 0.032 1005/s p75: 0.032 999/s p90: 0.032 991/s p95: 0.033 984/s bwds: 8 GPUs: p25: 0.112 286/s p50: 0.116 275/s p75: 0.120 265/s p90: 0.125 256/s p95: 0.128 250/s opts: 8 GPUs: p25: 0.003 11823/s p50: 0.003 10948/s p75: 0.004 7225/s p90: 0.007 4905/s p95: 0.008 3873/s ``` * Run 2 (patched): `elastic_gang:benchmark_single.elastic_operator` Ran for 3 mins 14 s f204684520 ``` sum: 8 GPUs: p25: 0.146 219/s p50: 0.147 217/s p75: 0.150 214/s p90: 0.152 210/s p95: 0.153 208/s fwds: 8 GPUs: p25: 0.032 1013/s p50: 0.032 1008/s p75: 0.032 1002/s p90: 0.032 996/s p95: 0.032 990/s bwds: 8 GPUs: p25: 0.107 299/s p50: 0.110 290/s p75: 0.114 280/s p90: 0.117 274/s p95: 0.119 269/s opts: 8 GPUs: p25: 0.003 11057/s p50: 0.005 6490/s p75: 0.008 4110/s p90: 0.010 3309/s p95: 0.010 3103/s ``` * Run 3 (patched): `elastic_gang:benchmark_single.elastic_operator` Ran for 2 mins 54 s f204692872 ``` sum: 8 GPUs: p25: 0.145 220/s p50: 0.147 217/s p75: 0.150 213/s p90: 0.154 207/s p95: 0.156 204/s fwds: 8 GPUs: p25: 0.032 1001/s p50: 0.032 995/s p75: 0.032 988/s p90: 0.033 980/s p95: 0.033 973/s bwds: 8 GPUs: p25: 0.108 295/s p50: 0.111 287/s p75: 0.114 280/s p90: 0.119 269/s p95: 0.121 264/s opts: 8 GPUs: p25: 0.003 11706/s p50: 0.003 9257/s p75: 0.005 6333/s p90: 0.008 4242/s p95: 0.009 3554/s ``` * Memory: * Unpatched: ``` CUDA Memory Summary After first iteration: \|===========================================================================\| \| PyTorch CUDA memory summary, device ID 0 \| \|---------------------------------------------------------------------------\| \| CUDA OOMs: 0 \| cudaMalloc retries: 0 \| \|===========================================================================\| \| Metric \| Cur Usage \| Peak Usage \| Tot Alloc \| Tot Freed \| \|---------------------------------------------------------------------------\| \| Allocated memory \| 428091 KB \| 2892 MB \| 9825 MB \| 9407 MB \| \| from large pool \| 374913 KB \| 2874 MB \| 9752 MB \| 9386 MB \| \| from small pool \| 53178 KB \| 52 MB \| 73 MB \| 21 MB \| \|---------------------------------------------------------------------------\| \| Active memory \| 428091 KB \| 2892 MB \| 9825 MB \| 9407 MB \| \| from large pool \| 374913 KB \| 2874 MB \| 9752 MB \| 9386 MB \| \| from small pool \| 53178 KB \| 52 MB \| 73 MB \| 21 MB \| \|---------------------------------------------------------------------------\| \| GPU reserved memory \| 3490 MB \| 3490 MB \| 3490 MB \| 0 B \| \| from large pool \| 3434 MB \| 3434 MB \| 3434 MB \| 0 B \| \| from small pool \| 56 MB \| 56 MB \| 56 MB \| 0 B \| \|---------------------------------------------------------------------------\| \| Non-releasable memory \| 315332 KB \| 343472 KB \| 2295 MB \| 1987 MB \| \| from large pool \| 311166 KB \| 340158 KB \| 2239 MB \| 1935 MB \| \| from small pool \| 4166 KB \| 4334 KB \| 56 MB \| 52 MB \| \|---------------------------------------------------------------------------\| \| Allocations \| 704 \| 705 \| 1390 \| 686 \| \| from large pool \| 60 \| 131 \| 395 \| 335 \| \| from small pool \| 644 \| 645 \| 995 \| 351 \| \|---------------------------------------------------------------------------\| \| Active allocs \| 704 \| 705 \| 1390 \| 686 \| \| from large pool \| 60 \| 131 \| 395 \| 335 \| \| from small pool \| 644 \| 645 \| 995 \| 351 \| \|---------------------------------------------------------------------------\| \| GPU reserved segments \| 102 \| 102 \| 102 \| 0 \| \| from large pool \| 74 \| 74 \| 74 \| 0 \| \| from small pool \| 28 \| 28 \| 28 \| 0 \| \|---------------------------------------------------------------------------\| \| Non-releasable allocs \| 34 \| 54 \| 430 \| 396 \| \| from large pool \| 15 \| 48 \| 208 \| 193 \| \| from small pool \| 19 \| 19 \| 222 \| 203 \| \|===========================================================================\| ``` * Patched: ``` CUDA Memory Summary After first iteration: \|===========================================================================\| \| PyTorch CUDA memory summary, device ID 0 \| \|---------------------------------------------------------------------------\| \| CUDA OOMs: 0 \| cudaMalloc retries: 0 \| \|===========================================================================\| \| Metric \| Cur Usage \| Peak Usage \| Tot Alloc \| Tot Freed \| \|---------------------------------------------------------------------------\| \| Allocated memory \| 428091 KB \| 2892 MB \| 9825 MB \| 9407 MB \| \| from large pool \| 374913 KB \| 2874 MB \| 9752 MB \| 9386 MB \| \| from small pool \| 53178 KB \| 52 MB \| 73 MB \| 21 MB \| \|---------------------------------------------------------------------------\| \| Active memory \| 428091 KB \| 2892 MB \| 9825 MB \| 9407 MB \| \| from large pool \| 374913 KB \| 2874 MB \| 9752 MB \| 9386 MB \| \| from small pool \| 53178 KB \| 52 MB \| 73 MB \| 21 MB \| \|---------------------------------------------------------------------------\| \| GPU reserved memory \| 3490 MB \| 3490 MB \| 3490 MB \| 0 B \| \| from large pool \| 3434 MB \| 3434 MB \| 3434 MB \| 0 B \| \| from small pool \| 56 MB \| 56 MB \| 56 MB \| 0 B \| \|---------------------------------------------------------------------------\| \| Non-releasable memory \| 315332 KB \| 343472 KB \| 2295 MB \| 1987 MB \| \| from large pool \| 311166 KB \| 340158 KB \| 2239 MB \| 1935 MB \| \| from small pool \| 4166 KB \| 4334 KB \| 56 MB \| 52 MB \| \|---------------------------------------------------------------------------\| \| Allocations \| 704 \| 705 \| 1390 \| 686 \| \| from large pool \| 60 \| 131 \| 395 \| 335 \| \| from small pool \| 644 \| 645 \| 995 \| 351 \| \|---------------------------------------------------------------------------\| \| Active allocs \| 704 \| 705 \| 1390 \| 686 \| \| from large pool \| 60 \| 131 \| 395 \| 335 \| \| from small pool \| 644 \| 645 \| 995 \| 351 \| \|---------------------------------------------------------------------------\| \| GPU reserved segments \| 102 \| 102 \| 102 \| 0 \| \| from large pool \| 74 \| 74 \| 74 \| 0 \| \| from small pool \| 28 \| 28 \| 28 \| 0 \| \|---------------------------------------------------------------------------\| \| Non-releasable allocs \| 34 \| 54 \| 431 \| 397 \| \| from large pool \| 15 \| 48 \| 208 \| 193 \| \| from small pool \| 19 \| 19 \| 223 \| 204 \| \|===========================================================================\| ``` 2\. As of v18: `python test/distributed/test_c10d.py` ``` ....................s.....s.....................................................s................................ ---------------------------------------------------------------------- Ran 114 tests in 215.983s OK (skipped=3) ``` 3\. Additional tests in `python test/distributed/test_c10d.py`: * `test_ddp_comm_hook_future_passing_cpu`: This unit test verifies whether the Future object is passed properly. The callback function creates a Future object and sets a value to it. * `_test_ddp_comm_hook_future_passing_gpu`: This unit test verifies whether the Future object is passed properly. The callback function creates a Future object and sets a value to it. * `test_ddp_comm_hook_future_passing_gpu_gloo`: This unit test executes _test_ddp_comm_hook_future_passing_gpu using gloo backend. * `test_ddp_comm_hook_future_passing_gpu_nccl`: This unit test executes _test_ddp_comm_hook_future_passing_gpu using nccl backend. * `test_ddp_invalid_comm_hook_init`: This unit test makes sure that register_comm_hook properly checks the format of hook defined by user. The Python hook must be callable. This test also checks whether bucket annotation checked properly if defined. * `test_ddp_invalid_comm_hook_return_type`: This test checks whether return annotation checked properly if defined. It also checks whether an internal error is thrown if return type is incorrect and user hasn't specified any return type annotation. * `test_ddp_comm_hook_register_just_once`: DDP communication hook can only be registered once. This test validates whether the error is thrown properly when register_comm_hook is called more than once. Reviewed By: ezyang Differential Revision: D22328310 fbshipit-source-id: 77a6a71808e7b6e947795cb3fcc68c8c8f024549	2020-07-15 11:25:29 -07:00
Yi Huang (PyTorch)	4196605776	helper function to print out all DDP-relevant env vars (#41297 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41297 GH issue: https://github.com/pytorch/pytorch/issues/40105 Add a helper function to DDP to print out all relevant env vars for debugging Test Plan: test through unittest, example output: --- env:RANK=3 env:LOCAL_RANK=N/A env:WORLD_SIZE=N/A env:MASTER_PORT=N/A env:MASTER_ADDR=N/A env:CUDA_VISIBLE_DEVICES=N/A env:GLOO_SOCKET_IFNAME=N/A env:GLOO_DEVICE_TRANSPORT=N/A env:NCCL_SOCKET_IFNAME=N/A env:NCCL_BLOCKING_WAIT=N/A ... --- Reviewed By: mrshenli Differential Revision: D22490486 fbshipit-source-id: 5dc7d2a18111e5a5a12a1b724d90eda5d35acd1c	2020-07-13 14:03:04 -07:00
Shen Li	0edbe6b063	Add a link in RPC doc page to point to PT Distributed overview (#41108 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41108 Test Plan: Imported from OSS Differential Revision: D22440751 Pulled By: mrshenli fbshipit-source-id: 9e7b002091a3161ae385fdfcc26484ae8fc243bb	2020-07-08 14:00:05 -07:00
chengjun	8d570bc708	Decouple DataParallel/DistributedDataParallel from CUDA (#38454 ) Summary: Decouple DataParallel/DistributedDataParallel from CUDA to support more device types. - Move torch/cuda/comm.py to torch/nn/parallel/comm.py with minor changes for common devices support. Torch.cuda.comm is kept as is for backward compatibility - Provide common APIs to arbitrary device types without changing existing CUDA APIs in torch.cuda space. - Replace the torch.cuda calls in DataParellel/DistributedDataParallel with the new APIs. Related RFC: [https://github.com/pytorch/pytorch/issues/36160](https://github.com/pytorch/pytorch/issues/36160) Pull Request resolved: https://github.com/pytorch/pytorch/pull/38454 Differential Revision: D22051557 Pulled By: mrshenli fbshipit-source-id: 7842dad0e5d3ca0f6fb760bda49182dcf6653af8	2020-07-07 12:48:16 -07:00
Sinan Nasir	15864d1703	Skip allreducing `local_used_maps_dev_` when `find_unused_param=False` Summary: 1. In reducer.cpp, we have a new boolean `find_unused_param_` and its value is set in `Reducer::prepare_for_backward`. If `!find_unused_param_`, then it avoids `allreduce(local_used_maps_dev_)`. 2. Solves issue [38942](https://github.com/pytorch/pytorch/issues/38942). 3. Fixes incorrect `find_unused_parameters_` passing like checking `outputs.empty()` or `unused_parameters_.empty()`. ghstack-source-id: 106693089 Test Plan: 1. Run `test/distributed/test_c10d.py` and make sure all tests pass. 2. A new test case `test_find_unused_parameters_when_unused_parameters_empty` is included. Old `reducer.cpp` was failing in that unit test because it was checking `find_unused_parameters_` by `unused_parameters_.empty()`. Current `reducer.cpp` passes this unit test. 3. Two test cases were failing `test_forward_backward_unused_parameters` and `test_forward_backward_optimizer` , because `find_unused_parameter_` of their `reducer` object was not set properly. I fixed that as well. Imported from OSS Output of version 14: ``` ................s.....s...............................................test/distributed/test_c10d.py:1531: UserWarning: Deprecation warning: In a future PyTorch release torch.full will no longer return tensors of floating dtype by default. Instead, a bool fill_value will return a tensor of torch.bool dtype, and an integral fill_value will return a tensor of torch.long dtype. Set the optional `dtype` or `out` arguments to suppress this warning. (Triggered internally at ../aten/src/ATen/native/TensorFactories.cpp:364.) tensor = torch.full([100, 100], self.rank) test/distributed/test_c10d.py:1531: UserWarning: Deprecation warning: In a future PyTorch release torch.full will no longer return tensors of floating dtype by default. Instead, a bool fill_value will return a tensor of torch.bool dtype, and an integral fill_value will return a tensor of torch.long dtype. Set the optional `dtype` or `out` arguments to suppress this warning. (Triggered internally at ../aten/src/ATen/native/TensorFactories.cpp:364.) tensor = torch.full([100, 100], self.rank) test/distributed/test_c10d.py:1531: UserWarning: Deprecation warning: In a future PyTorch release torch.full will no longer return tensors of floating dtype by default. Instead, a bool fill_value will return a tensor of torch.bool dtype, and an integral fill_value will return a tensor of torch.long dtype. Set the optional `dtype` or `out` arguments to suppress this warning. (Triggered internally at ../aten/src/ATen/native/TensorFactories.cpp:364.) tensor = torch.full([100, 100], self.rank) test/distributed/test_c10d.py:1531: UserWarning: Deprecation warning: In a future PyTorch release torch.full will no longer return tensors of floating dtype by default. Instead, a bool fill_value will return a tensor of torch.bool dtype, and an integral fill_value will return a tensor of torch.long dtype. Set the optional `dtype` or `out` arguments to suppress this warning. (Triggered internally at ../aten/src/ATen/native/TensorFactories.cpp:364.) tensor = torch.full([100, 100], self.rank) .test/distributed/test_c10d.py:1554: UserWarning: Deprecation warning: In a future PyTorch release torch.full will no longer return tensors of floating dtype by default. Instead, a bool fill_value will return a tensor of torch.bool dtype, and an integral fill_value will return a tensor of torch.long dtype. Set the optional `dtype` or `out` arguments to suppress this warning. (Triggered internally at ../aten/src/ATen/native/TensorFactories.cpp:364.) self.assertEqual(torch.full([10, 10], self.world_size), tensor) test/distributed/test_c10d.py:1554: UserWarning: Deprecation warning: In a future PyTorch release torch.full will no longer return tensors of floating dtype by default. Instead, a bool fill_value will return a tensor of torch.bool dtype, and an integral fill_value will return a tensor of torch.long dtype. Set the optional `dtype` or `out` arguments to suppress this warning. (Triggered internally at ../aten/src/ATen/native/TensorFactories.cpp:364.) self.assertEqual(torch.full([10, 10], self.world_size), tensor) test/distributed/test_c10d.py:1554: UserWarning: Deprecation warning: In a future PyTorch release torch.full will no longer return tensors of floating dtype by default. Instead, a bool fill_value will return a tensor of torch.bool dtype, and an integral fill_value will return a tensor of torch.long dtype. Set the optional `dtype` or `out` arguments to suppress this warning. (Triggered internally at ../aten/src/ATen/native/TensorFactories.cpp:364.) self.assertEqual(torch.full([10, 10], self.world_size), tensor) test/distributed/test_c10d.py:1554: UserWarning: Deprecation warning: In a future PyTorch release torch.full will no longer return tensors of floating dtype by default. Instead, a bool fill_value will return a tensor of torch.bool dtype, and an integral fill_value will return a tensor of torch.long dtype. Set the optional `dtype` or `out` arguments to suppress this warning. (Triggered internally at ../aten/src/ATen/native/TensorFactories.cpp:364.) self.assertEqual(torch.full([10, 10], self.world_size), tensor) .....s............................... ---------------------------------------------------------------------- Ran 108 tests in 214.210s OK (skipped=3) ``` Differential Revision: D22176231 fbshipit-source-id: b5d15f034e13a0915a474737779cc5aa8e068836	2020-06-26 19:20:59 -07:00
Michael Carilli	8066fba226	[RELAND2] Change AccumulateGrad to yield `.grad`s that match weights' memory layout (#40358 ) Summary: https://github.com/pytorch/pytorch/pull/40129 fixed the error responsible for the first revert, but exposed another error in the same test. This PR is intended as the "master copy" for merge, and it runs on full CI. Two other PRs (restricted to run on a small subset of CI) supporting debugging DDP failures/hangs with multiple devices per process (`test_c10d.py:DistributedDataParallelTest.test_grad_layout_1devicemodule_2replicaperprocess`). - https://github.com/pytorch/pytorch/pull/40290 tries the test with purely rowmajor contiguous params on an untouched master. In other words https://github.com/pytorch/pytorch/pull/40290 contains none of this PR's diffs aside from the test itself. - https://github.com/pytorch/pytorch/pull/40178, for comparison, tries the test with this PR's diffs. Both fail the same way, indicating failure is unrelated to this PR's other diffs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/40358 Differential Revision: D22165785 Pulled By: albanD fbshipit-source-id: ac7cdd79af5c080ab74341671392dca8e717554e	2020-06-22 17:13:21 -07:00
Shen Li	30364f0b01	Remove obsolete warning message from DDP (#40190 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40190 Fixed by #36503 Test Plan: Imported from OSS Differential Revision: D22101516 Pulled By: mrshenli fbshipit-source-id: 9abd6dce602530c11b7fe623ac0f4d556dccc961	2020-06-17 17:58:21 -07:00
Alban Desmaison	08227fea4f	Revert D22079377: [pytorch][PR] [RELAND] Change AccumulateGrad to yield `.grad`s that match weights' memory layout Test Plan: revert-hammer Differential Revision: D22079377 Original commit changeset: 9bd2b7e0c34f fbshipit-source-id: c22cc349d790caa574eace0d63980854c33e5a59	2020-06-17 10:17:27 -07:00
Michael Carilli	1ec8ece2b9	[RELAND] Change AccumulateGrad to yield `.grad`s that match weights' memory layout (#40129 ) Summary: https://github.com/pytorch/pytorch/pull/34904 was reverted because it had a misconfigured 4 GPU test that for some reason wasn't caught by external CI ([example failure](https://app.circleci.com/pipelines/github/pytorch/pytorch/181719/workflows/cfb37cd9-9a0c-4738-898b-d683934cd308/jobs/5868948/steps)). This PR reverts the revert, and adds diffs that should repair the misconfigured test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/40129 Differential Revision: D22079377 Pulled By: albanD fbshipit-source-id: 9bd2b7e0c34fdaf887497b52037cfe82cba709c1	2020-06-17 09:02:54 -07:00
Pritam Damania	15823ac6d5	Enhance DDP docstrings for DDP + RPC support. (#39916 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39916 ghstack-source-id: 105999275 Test Plan: waitforbuildbot Differential Revision: D22013190 fbshipit-source-id: be3bb12b2281579610581b809c822ab6b027fa71	2020-06-16 20:05:13 -07:00
Alban Desmaison	f1e575a0bf	Revert D20496044: [pytorch][PR] Change AccumulateGrad to yield `.grad`s that match weights' memory layout Test Plan: revert-hammer Differential Revision: D20496044 Original commit changeset: 248d680f4b1b fbshipit-source-id: 6462b25e3fb9c8596c1da443389089f09c32df4d	2020-06-16 10:38:40 -07:00
Michael Carilli	2beb9690c3	Change AccumulateGrad to yield `.grad`s that match weights' memory layout (#34904 ) Summary: Currently, whether `AccumulateGrad` [steals](`67cb018462/torch/csrc/autograd/functions/accumulate_grad.h (L42)`) or [clones](`67cb018462/torch/csrc/autograd/functions/accumulate_grad.h (L80)`) an incoming gradient, the gradient ends up rowmajor contiguous, regardless of its param's layout. If the param's layout is channels last, or otherwise not rowmajor contigous, later kernels that apply gradients to params are forced into an uncoalesced memory access pattern for either the param or the gradient. This may not sound like a big deal but for any binary op on large tensors it's a >3X increase in gmem traffic => 3X slowdown. The present PR changes `AccumulateGrad` to prefer, where possible, stashing gradients that match their params' layouts (["Gradient Layout Contract"](https://github.com/pytorch/pytorch/pull/34904/files#diff-ef1a56d24f66b280dcdb401502d6a796R29-R38)). Allowing `AccumulateGrad` to stash non-rowmajor-contiguous grads means DDP allreduces and DP reduces must allow non-rowmajor-contiguous grads. This PR extends DDP and DP to allow gradients with non-rowmajor-contiguous strides as long as their layout is nonoverlapping and dense. For good measure, I include changes that allow all five nccl primitives (allreduce, reduce, broadcast, allgather, reducescatter) to act on non-rowmajor-contiguous tensors (again as long as each input's layout is nonoverlapping and dense, and as long as all tensors participating in a given collective have the same layout). The primitive comm changes aren't necessary to enable the DDP changes, but I wasn't sure this would end up true until I had written both sets of changes. I think primitive comm enablement is reasonable to keep in the PR, especially since the code for it is simple. Channels last params will be a major beneficiary of this PR, but I don't see it as channels-last-specific fix. The spirit is layout matching in general: - Grads should be stashed with memory layouts matching their params. - Src and dst tensors on opposite ends of collectives should have matching dense layouts. This PR also updates autograd docs to describe potential BC-breaking changes below. ## BC notes ngimel albanD gchanan #### BC-breaking In the common case where the user lets AccumulateGrad decide grad layouts, strides for grads of dense but non-rowmajor-contiguous params will change. Any user code that was accustomed to `view(-1)`ing these grads will break. Also, the circumstances under which a grad can be stolen directly from the backward function that created it, as opposed to deep-copied by AccumulateGrad, have changed. In most cases we expect silent performance improvement, because we expect channels-last-aware backward kernels will create channels last gradients for channels last params. Now those can be stolen, whereas before this PR they were cloned and made rowmajor contiguous. IMO this is a mild BC breakage. Param backward hooks still see grads come in with whatever format the backward kernel gave them. The only BC breakage potential I see is if user code relies somehow on a grad in a hook having or not having the same deep memory as the eventual `param.grad`. Any such users hopefully know they're off the edge of the map and understand how to update their expectations. #### BC escape hatches At alband's recommendation, this PR's changes to AccumulateGrad do not alter the pre-PR code's decisions about whether grad is accumulated in or out of place. Accumulations of new grads onto an existing `.grad` attribute were (usually) in-place before this PR and remain in-place after this PR, keeping the existing `.grad`'s layout. After this PR, if the user wants to force accumulation into a grad with a particular layout, they can preset `param.grad` to a zeroed tensor with the desired strides or call `grad.contiguous(desired format)`. This likely won't be as performant as letting AccumulateGrad establish grad layouts by cloning or stealing grads with contract-compliant strides, but at least users have a control point. One limitation (present before this PR and unchanged by this PR): Presetting `param.grad` does not ensure in-place accumulation all the time. For example, if `create_graph=True`, or if incoming `new_grad` is dense and existing `variable_grad` is sparse, accumulation occurs out of place, and the out-of-place result may not match the existing grad's strides. ---------------------------- I also noticed some potential DDP improvements that I considered out of scope but want to mention for visibility: 1. make sure Reducer's ops sync with AccumulateGrad streams 2. ~to reduce CPU overhead and incur fewer kernel launches, lazily create flat `contents` tensors by a single `cat` kernel only when a bucket is full, instead of `copy_`ing grads into `contents` individually as soon as they are received.~ PR includes a [minor change](https://github.com/pytorch/pytorch/pull/34904/files#diff-c269190a925a4b0df49eda8a8f6c5bd3R312-R315) to divide grads while copying them into flat buffers, instead of copying them in, then dividing separately. Without cat+div fusion, div-while-copying is the best we can do. 3. https://github.com/pytorch/pytorch/issues/38942 Pull Request resolved: https://github.com/pytorch/pytorch/pull/34904 Differential Revision: D20496044 Pulled By: albanD fbshipit-source-id: 248d680f4b1bf77b0a986451844ec6e254469217	2020-06-16 08:43:31 -07:00
Yanli Zhao	b98948e6dd	implement dynamic bucket order in DDP (#35137 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35137 bucket order is rebuilt dynamically in the first reduction backward pass when find_unused_parameters = false ghstack-source-id: 104794018 Test Plan: unit test Differential Revision: D20128537 fbshipit-source-id: fad73de965cdcb59a51c0a12b248271344584b9f	2020-05-28 12:59:52 -07:00
Shen Li	8d6a8d2b3f	Fix DDP bug in single process multiple device use cases (#36503 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36503 Test Plan: Imported from OSS Differential Revision: D21179274 Pulled By: mrshenli fbshipit-source-id: 0afce30ae0ddda753d1e240584a0f80df9aec4c2	2020-04-22 15:06:28 -07:00
Shen Li	5afd816793	Add a warning for Single-Process Multi-GPU DDP (#36656 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36656 Test Plan: Imported from OSS Differential Revision: D21042537 Pulled By: mrshenli fbshipit-source-id: fa3501dc2bba14550ec4f254612a80f61fe86a4a	2020-04-15 12:43:50 -07:00
Xiang Gao	df8d6eeb19	Update docs about DP and DDP for CUDA (#35063 ) Summary: We should recommend DDP instead of DP. Hope we can also cherry-pick this for 1.5 Pull Request resolved: https://github.com/pytorch/pytorch/pull/35063 Differential Revision: D20549621 Pulled By: ngimel fbshipit-source-id: 86b1b2134664065cc6070ea4212895f993eaf543	2020-03-20 20:06:37 -07:00

1 2 3

122 Commits