Commit Graph

278 Commits

Author SHA1 Message Date
Ailing
1499a604cf fix assertion error when input size smaller than number of module_copies (#6252) 2018-04-04 12:05:34 +02:00
Ailing
f5aa8d55ad fix detach in place error in DDP (#5829)
* fix detach in DDP

* fix typo

* make lint happy
2018-03-16 09:22:04 -04:00
Teng Li
579de82bcf DDP: 10% of NCCL backend perf improvements with mixed-prec support (#5064) 2018-02-21 23:59:52 +01:00
Teng Li
4b8f4fc259 Added mixed-precision support in distributed training (#4891) 2018-02-21 14:29:39 +01:00
Richard Zou
cac3026b35 Fix typo in DataParallel docs (#5268) 2018-02-15 23:02:26 +01:00
Teng Li
d7b6a61a54 DDP: coalescing many little broadcasts to improve performance (#4978) 2018-02-12 16:41:33 +01:00
Tongzhou Wang
805639906a Broacast output requires_grad if only corresponding input requires_grad (#5061) 2018-02-05 23:38:35 -05:00
Teng Li
ae28411af8 Slightly improve DDP single GPU multi-process dist training performance 2018-01-27 12:15:44 +01:00
Teng Li
154038e318 Removing NCCL clear_group_cache workaround with one more check in new_group (#4766) 2018-01-23 11:03:52 +01:00
Sam Gross
d605058212
Replace Variable.volatile with torch.no_grad() (#3970)
This removes volatile from Variable. The functionality is mostly
replaced by a global (thread-local) flag, which is controlled by
torch.set_grad_enabled() and the context manager torch.no_grad().

In C++, the flag is exposed through GradMode::is_enabled() and GradMode::set_enabled()

Fixes #3627
2017-12-18 15:46:13 -05:00
ngimel
7f41149e14 handle requires_grad when creating buckets for distributed (#4044) 2017-12-18 02:13:53 -05:00
Teng Li
926ed2b280 Implemented NCCL Distributed Backend for PyTorch with new dist APIs (#3435)
* Implemented NCCL Distributed Backend for PyTorch with new dist APIs

* Let FindNCCL to determine the NCCL version

* Let NCCL2 Backend use ATEN instead deprecated THPP

* Let distributed parallel model use a single reduction thread for NCCL backend

* Caching the sockets, bug fix, refactoring, and addressed Adam's comments

* Make BcastNcclID take a single param and bug fix for all_gather

* Removed barrier function, added warning for users, and not exposing experimental func to users

* Use the simplest single bucket working solution for distriubted data parallel model with rebase

* Cleanup, fixes and further addressed Adam's comments

* Used PySequence_Fast in distributed csrc

* Removed the limitation that each group is only bound to a given device sequence

* Used THPObjectPtr for PySequence_Fast
2017-11-29 15:57:02 -05:00
SsnL
01be4d6b20 sparse broadcast_coalesce and reduce_add_coalesced 2017-10-28 18:52:35 -04:00
SsnL
de1f4e69dd raw text (#3327) 2017-10-28 01:24:02 +05:30
Luca Antiga
6743d59513 Add missing import. Add return to __getstate__ 2017-10-08 11:07:10 -04:00
Sergey Kolesnikov
5f8bab47c8 bugfix for 2428 ussue (#3000) 2017-10-06 09:20:12 -04:00
jekbradbury
7aa6bc516f add "Basics" section to distributed docs (#2433) 2017-08-24 17:07:20 -04:00
Robert Kirby
5d09fcd028 Make DistributedDataParallel threads Daemon threads to allow clean process exit (#2524) 2017-08-24 06:32:29 -04:00
Christian Sarofeen
4c69697d2a Distribtued bug fixes. (#2434) 2017-08-23 14:46:52 -04:00
LuoweiZhou
5c43fcda8d Support params that don’t require grad in DistributedDataParallel (#2464) 2017-08-19 11:22:20 -04:00
Robert Kirby
9199c954f1 Fix typo in DistributedDataParallel (#2320) 2017-08-08 21:53:42 -04:00
Adam Paszke
dc17fb68e4 Fix minor bug in parallel_apply (#2193) 2017-07-25 03:45:00 +05:30
Adam Paszke
8ab3d214d5 Fixes for DistributedDataParallel (#2168) 2017-07-21 16:00:46 -04:00
Adam Paszke
4af40e3471 Let parallel_apply accept arbitrary inputs 2017-07-20 01:45:57 -04:00
Sam Gross
10e23943b3 Fix missing _forward_pre_hooks in serialized modules (#2057) 2017-07-11 18:23:35 -04:00
Leonid Vlasenkov
46a868dab7 [Ready] Limit docs line length (#1900)
* some docs are ready

* docs

* docs

* fix some more

* fix some more
2017-07-10 10:24:54 -04:00
Adam Paszke
d9d50f80c7 Rename arguments to distributed collectives 2017-06-12 22:02:11 -04:00
Adam Paszke
12813b88f6 Add DistributedDataParallel 2017-06-12 22:00:22 -04:00