Commit Graph

78 Commits

Author SHA1 Message Date
Max Wang
c5845c4482 Add support for reduce-scatter in c10d (#18844)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18844
ghimport-source-id: c6b2f0032c7c2212be2000a9c1f262f63d878a97

Stack from [ghstack](https://github.com/ezyang/ghstack):
* **#18844 Add support for reduce-scatter in c10d**
* #18820 Refactor ProcessGroupNCCL collective primitives

Reviewed By: mrshenli

Differential Revision: D14768369

fbshipit-source-id: a9def7a0da6e9cd995e982371cc1e22f3df1a156
2019-04-26 13:46:57 -07:00
Kutta Srinivasan
b7323a94ad Cleanup init_process_group (#19033)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19033

torch.distributed.init_process_group() has had many parameters added, but the contract isn't clear. Adding documentation, asserts, and explicit args should make this clearer to callers and more strictly enforced.

Reviewed By: mrshenli

Differential Revision: D14813070

fbshipit-source-id: 80e4e7123087745bed436eb390887db9d1876042
2019-04-18 09:37:38 -07:00
Pieter Noordhuis
ce166d949d ProcessGroupMPI exists only if it is valid (#14809)
Summary:
Previously, MPI process groups were created for all processes, even if
they were not part of the created group. Their MPI_Comm member field
would be MPI_COMM_NULL and they would ignore any calls. Their rank and
size were identical to that of the global process group and they had a
special groupRank and groupSize field to capture the _real_ rank.

This also meant assymetry with other process group types, where creating
a new group would either return the process group OR
GroupMember.NON_GROUP_MEMBER. For the MPI process group, it would always
return a process group and an additional check was needed to verify
whether or not a process was indeed part of a process group or not.

This commit changes this such that every MPI process group is a valid
process group, and by extension that we no longer have to special case
MPI to determine whether or not a process is part of a group. Now, if
the value returned by `new_group` is GroupMember.NON_GROUP_MEMBER, the
process is not a member, otherwise it is.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14809

Differential Revision: D14887937

Pulled By: pietern

fbshipit-source-id: c5bf86d3b33e524cc5004ee68e30103178fa491d
2019-04-10 21:36:35 -07:00
Shen Li
8f9b11cf33 Propagate ProcessGroup timeout to Store (#16571)
Summary:
closes #16520

Hi pietern, I am not sure if this is the expected way to pass timeout to `Store`, could you please help take a look? Thanks!

Questions:
1. How do I write tests for this? I wanted to do something like `test_barrier_timeout_global`, but it seems I need to set the pg's timeout larger than the `Store`'s default timeout (3 min) to see a difference, which is too long for a unit test. And I do not want to change the `Store`'s default timeout either. Any suggestion?
2. Should I also propagate timeout configuration down to `PrefixStore` in `_new_process_group_helper`?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16571

Differential Revision: D13954527

Pulled By: mrshenli

fbshipit-source-id: 77f2653903f24255207233eb298f7c0321119a87
2019-04-09 12:36:28 -07:00
Pieter Noordhuis
7a19d3c9e1 Allow override of backend in dist.new_group() (#18595)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18595

There is no need to force the backend to be the same as the global
process group, as long as the backend is "nccl" or "gloo".

Reviewed By: mrshenli

Differential Revision: D14657204

fbshipit-source-id: 868817b9f219e3be8db0761a487f0027ed46663b
2019-04-04 14:23:03 -07:00
Shen Li
c0ad6747a9 Highlight NCCL all_reduce and all_gather requirements (#18741)
Summary:
See #18689
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18741

Differential Revision: D14726874

Pulled By: mrshenli

fbshipit-source-id: a92404c653e3c62fc23fa3ccacfb3b2959b2e307
2019-04-03 09:50:29 -07:00
Igor Fedan
36237c4893 Fix flake8 issues in gragrad test
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/18727

Differential Revision: D14724887

Pulled By: ifedan

fbshipit-source-id: 8c1db6460303e746e4aea0142302b8d61277c067
2019-04-02 12:45:18 -07:00
Pieter Noordhuis
bdfdf6c2b9 C++ handler for gradient reduction (#18251)
Summary:
This commit adds the `c10d::Reducer` class that hooks into autograd
and performs gradient bucketing and reduction. These are the core
parts of `nn.parallel.DistributedDataParallel` that up to now were
only usable for CUDA models.

This should enable the following:

* Distributed data parallelism for models defined using the C++ frontend.
* Allow overlap of gradient computation and reduction for non-CUDA models.
* Enable distributed data parallelism for models with some unused parameters.

This does not include any logic for computing bucket assignment, which
can be done separately; either by observing autograd execution order
(this is what Apex does), or by assigning buckets based on some
maximum byte size, or both.

Also see #17757 and #13273.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18251

Reviewed By: mrshenli

Differential Revision: D14571899

Pulled By: pietern

fbshipit-source-id: 20f95eefd288dfe8cfffe0a28ca22fa7c9c3cd4c
2019-04-01 14:30:02 -07:00
Edward Yang
173f224570 Turn on F401: Unused import warning. (#18598)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18598
ghimport-source-id: c74597e5e7437e94a43c163cee0639b20d0d0c6a

Stack from [ghstack](https://github.com/ezyang/ghstack):
* **#18598 Turn on F401: Unused import warning.**

This was requested by someone at Facebook; this lint is turned
on for Facebook by default.  "Sure, why not."

I had to noqa a number of imports in __init__.  Hypothetically
we're supposed to use __all__ in this case, but I was too lazy
to fix it.  Left for future work.

Be careful!  flake8-2 and flake8-3 behave differently with
respect to import resolution for # type: comments.  flake8-3 will
report an import unused; flake8-2 will not.  For now, I just
noqa'd all these sites.

All the changes were done by hand.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Differential Revision: D14687478

fbshipit-source-id: 30d532381e914091aadfa0d2a5a89404819663e3
2019-03-30 09:01:17 -07:00
Brian Johnson
fd04073e61 Fixed a formatting issue in doc comments (#17505)
Summary:
for torch.distributed.broadcast_multigpu per issue #17243
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17505

Reviewed By: janewangfb

Differential Revision: D14373865

Pulled By: pietern

fbshipit-source-id: 6d7e91a3da50a7c9ba417ad852f7746eb5200043
2019-03-12 09:55:29 -07:00
Jane Wang
a2b9f7f484 add elastic zeus handler (#16746)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16746

as titled. We use a special url schem elasticzeus for elastic zeus so that we dont need to change the public interface of init_process_group.

Reviewed By: aazzolini, soumith

Differential Revision: D13948151

fbshipit-source-id: 88939dcfa0ad93467dabedad6905ec32e6ec60e6
2019-02-27 11:29:59 -08:00
hysts
cbefd0323b Fix typo
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/17521

Differential Revision: D14237482

Pulled By: soumith

fbshipit-source-id: 636e0fbe2c667d15fcb649136a65ae64937fa0cb
2019-02-26 20:23:34 -08:00
Andy Wei
5eee0670ab Pass torch.distributed launch process local rank as environment variable instead of argument (#16360)
Summary:
In `torch.distributed.launch.py`, it passes `local_rank` as argument and requires user's program to parse it. However, it would be more flexible for users and consistent with other variables, e.g. `RANK`, `MASTER_PORT`, `WORLD_SIZE`, if passing through environment variables.

265ed8ff45/torch/distributed/launch.py (L200-L212)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16360

Differential Revision: D14070372

Pulled By: ezyang

fbshipit-source-id: c3f6a8e55ab513918cad09d1326eccdedb4d98c9
2019-02-15 14:52:55 -08:00
Andy Wei
4cf76574b9 Raise CalledProcessError when torch.distributed launch process not return 0 (#16069)
Summary:
`torch.distributed.launch.py` will not raise error when `subprocess.Popen` is not return 0.
For better debugging it should always raise an error if processes launched have unusual behavior
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16069

Differential Revision: D13709467

Pulled By: ezyang

fbshipit-source-id: 31d32a5ec8fed7bccd62d845bfba0e670ed3fe20
2019-01-22 08:50:47 -08:00
Teng Li
b4bc55beef TCP init method race condition fix (#15684)
Summary:
This PR fixes a race condition for TCP init method, when master rank can exit earlier than slave ranks and thus the TCP daemon thread gets shutdown before other slaves are able to access it.

This will let every rank (process) write a special key to the store to mark that they are completed (and thus about to exit).  The master rank (who is the server) will always wait until all the ranks to complete before complete itself.

This should fix: https://github.com/pytorch/pytorch/issues/15638

Tested using the repro of https://github.com/pytorch/pytorch/issues/15638 and works fine. Also test_distributed and test_c10d should have already had this coverage.

I had to make rendezvous test in c10d the world size of 1, since it is a single process code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15684

Differential Revision: D13570904

Pulled By: teng-li

fbshipit-source-id: 34f3bc471204bbd29320df359347ad5561c6b589
2019-01-18 02:29:38 -08:00
Derek Kim
19717224c5 Miscellaneous broken RSTs fixed (#16033)
Summary:
https://pytorch.org/docs/master/tensors.html#torch.Tensor.bernoulli_
https://pytorch.org/docs/master/torch.html#torch.addmm
https://pytorch.org/docs/master/distributed_deprecated.html#torch.distributed.deprecated.reduce_multigpu
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16033

Differential Revision: D13671202

Pulled By: soumith

fbshipit-source-id: 276e10e610affe205376573e7f0f9894695d218d
2019-01-15 09:50:12 -08:00
Teng Li
2d3cf98b49 Making dist.get_default_group private for PT1 release (#14767)
Summary:
When I wrote the frontend API, it is designed on not letting users use the default_group directly on any functions.  It should really be private.

All collectives are supposed to either use group.WORLD, or anything that comes out of new_group. That was the initial design.

We need to make a TODO on removing group.WORLD one day. It exists for backward compatibility reasons and adds lots of complexity.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14767

Reviewed By: pietern

Differential Revision: D13330655

Pulled By: teng-li

fbshipit-source-id: ace107e1c3a9b3910a300b22815a9e8096fafb1c
2018-12-04 19:22:24 -08:00
Pieter Noordhuis
11ef5191ff Enable tests for CPU tensors in test_distributed.py (#14572)
Summary:
These were not enabled after adding support in the Gloo backend. The
argument checks in ProcessGroupGloo raised an error in two cases:

* If the input tensor list to scatter was ``[None]`` on processes other
  than the source process.
* If the output tensor list to gather was ``[None]`` on processes other
  than the destination process.

This commit prepares these arguments explicitly instead of boxing them
at the process group call site.

This fixes #14536.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14572

Differential Revision: D13272812

Pulled By: pietern

fbshipit-source-id: 12cb0d85ec92f175365cbada585260f89330aad8
2018-11-29 21:39:02 -08:00
Teng Li
9127ab3866 Fixed new_group won't work for two or more different rank groups (#14529)
Summary:
This fixed two things:

(1) NCCL group doesn't support 2 or more groups, this is because, we need a group name in ProcessGroupNCCL class to keep track of the ProcessGroup ID within that group name, and also the NCCL unique ID within that group name and process group ID.  Otherwise, different processes will create different NCCL PG in different orders and can clash on these names.  This will fix the NCCL problem.

(2)  When using new_group, each rank should enter this function and update its global group name counter to ensure that every rank always operates on the same group name.

With both fixes: repro code in: https://github.com/pytorch/pytorch/issues/14528 should work with both NCCL and Gloo backends.

```
tengli@learnfair096:~$ python -m torch.distributed.launch --nproc_per_node=8 --nnodes=1 --node_rank=0 --master_addr=127.0.0.1 --master_port=30000 ~/github_issues/nccl_group.py
rank: 0 - val: 6.0
rank: 2 - val: 6.0
rank: 3 - val: 6.0
rank: 1 - val: 6.0
rank: 4 - val: 22.0
rank: 6 - val: 22.0
rank: 5 - val: 22.0
rank: 7 - val: 22.0
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14529

Differential Revision: D13253434

Pulled By: teng-li

fbshipit-source-id: 8eb45882b996b06d951fc9a306d5de86a42e8b84
2018-11-29 19:57:47 -08:00
Teng Li
0d3cb91d8c Make env init_method support both env and args for rank and size (#14494)
Summary:
Fixing: https://github.com/pytorch/pytorch/issues/14446

This was a supported behavior in old torch.distributed. We want to support it in the new release.

Test should cover all combination of scenario when we have either env or arg set up for rank or size or both
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14494

Differential Revision: D13253433

Pulled By: teng-li

fbshipit-source-id: c05974d84f1bdf969f74ec45763e11a841fe4848
2018-11-29 18:48:20 -08:00
Pieter Noordhuis
4ec6bd7356 Add sourceRank() to ProcessGroup::Work (#14453)
Summary:
This function is only implemented for the subclasses where it makes
sense. If it's not overridden it will throw an error. Having this
function removes the need for a pointer passing hack to pass the
source rank of a recv operation back to the caller. Instead, the
caller can now call `source_rank` on the work object and achieve
the same result.

Closes #11804.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14453

Differential Revision: D13230898

Pulled By: pietern

fbshipit-source-id: ef38f48bfaca8ef9a364e5be122951bafc9f8e49
2018-11-29 09:16:53 -08:00
Pieter Noordhuis
0f62af4ab1 Add timeout kwarg to init_process_group (#14435)
Summary:
This applies to the gloo backend only. Timeout support for the NCCL and
MPI backends is tracked in issues #14371 and #14372 respectively.

When creating a new process group (either the global one or any subgroup
created through `new_group`) you can specify a timeout keyword
argument (of type datetime.timedelta). This timeout applies to all
collective operations executed against that process group, such that any
operation taking longer than the timeout will throw a runtime error.
Using a different, better catchable error type is tracked in #14433.

This fixes #14376.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14435

Differential Revision: D13234317

Pulled By: pietern

fbshipit-source-id: 973993b67994dc64861c0977cbb6f051ec9d87f6
2018-11-28 11:35:01 -08:00
Teng Li
b807970aea Tensor type checking and informative error messages for torch.distributed (#14204)
Summary:
This will address https://github.com/pytorch/pytorch/issues/13574

This error message should be more informative to the user for all the non-multiGPU ops, since we python binding to multi-gpu ops always.

test_distributed should cover all. Also tested both RunTime errors.

```
>>> a = torch.ByteTensor([])
>>> b = [a, a]
>>> dist.all_reduce(b)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/private/home/tengli/pytorch/torch/distributed/distributed_c10d.py", line 809, in all_reduce
    _check_single_tensor(tensor, "tensor")
  File "/private/home/tengli/pytorch/torch/distributed/distributed_c10d.py", line 207, in _check_single_tensor
    "to be a torch.Tensor type".format(param_name))
RuntimeError: Invalid function argument. Expecting parameter: tensor to be a torch.Tensor type

>>> b = ["b"]
>>> dist.all_gather(b, a)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/private/home/tengli/pytorch/torch/distributed/distributed_c10d.py", line 1006, in all_gather
    _check_tensor_list(tensor_list, "tensor_list")
  File "/private/home/tengli/pytorch/torch/distributed/distributed_c10d.py", line 225, in _check_tensor_list
    "to be a List[torch.Tensor] type".format(param_name))
RuntimeError: Invalid function argument. Expecting parameter: tensor_list to be a List[torch.Tensor] type
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14204

Differential Revision: D13131526

Pulled By: teng-li

fbshipit-source-id: bca3d881e41044a013a6b90fa187e722b9dd45f2
2018-11-19 18:30:54 -08:00
Teng Li
97036d3c30 FileStore auto deletes file and FileStore::add bug fix (#13708)
Summary:
This addressed: https://github.com/pytorch/pytorch/issues/11874

and we will have the identical file init_method behavior as the previous THD file init.

Also the FileStore::add bug is pretty annoying.

Two bugs:
(1) Add doesn't append to the end of the file.
(2) Cache doesn't get updated.

Both are fixed and tests are covered.

I examined the /tmp to ensure that all temp files are auto deleted after test_c10d.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13708

Reviewed By: pietern

Differential Revision: D12972810

Pulled By: teng-li

fbshipit-source-id: 917255390aa52845f6b0ad0f283875a7a704da48
2018-11-14 01:34:22 -08:00
Tongzhou Wang
044d00516c Rename DistBackend -> Backend (#11830)
Summary:
Also add docs for get_backend, Backend, and reduce_op

fixes #11803

cc The controller you requested could not be found. pietern apaszke
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11830

Differential Revision: D9927991

Pulled By: SsnL

fbshipit-source-id: a2ffb70826241ba84264f36f2cb173e00b19af48
2018-11-07 11:58:12 -08:00
Teng Li
1b64c0f8fe Error msg on TCP backend (#13596)
Summary:
Clean it up from my queue:

https://github.com/pytorch/pytorch/issues/12721

```
>>> torch.distributed.init_process_group(backend="tcp")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/private/home/tengli/pytorch/torch/distributed/distributed_c10d.py", line 275, in init_process_group
    backend = DistBackend(backend)
  File "/private/home/tengli/pytorch/torch/distributed/distributed_c10d.py", line 55, in __new__
    raise ValueError("TCP backend has been deprecated. Please use "
ValueError: TCP backend has been deprecated. Please use Gloo or MPI backends for collective operations on CPU tensors.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13596

Differential Revision: D12931196

Pulled By: teng-li

fbshipit-source-id: bb739b107ad7454e2e0a17430087161fedd4c392
2018-11-05 16:40:02 -08:00
Pieter Noordhuis
526460fc8b Use default timeout of 30 minutes for gloo backend (#13056)
Summary:
The existing default timeout was set at 10 seconds, which is too low
for asynchronous tasks that depend on a barrier to resynchronize.
Having a single timeout for all operations is not ideal and this will
be addressed in future commits.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13056

Reviewed By: teng-li

Differential Revision: D10558746

Pulled By: pietern

fbshipit-source-id: d857ea55b1776fc7d0baf2efd77951b5d98beabb
2018-10-25 16:35:53 -07:00
Edward Yang
058a31839d Warn about local_rank not being globally unique. (#12370)
Summary:
Signed-off-by: Edward Z. Yang <ezyang@fb.com>

CC deepakn94
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12370

Differential Revision: D10220135

Pulled By: ezyang

fbshipit-source-id: 6d1a8a383951ae52753e4f75a14b8080bf02b815
2018-10-05 17:38:41 -07:00
Edward Yang
dfa03e94eb Fix mispelling of AVAILABLE. (#12016)
Summary:
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12016

Reviewed By: pietern

Differential Revision: D10010808

Pulled By: ezyang

fbshipit-source-id: ff6394ae9a53f7fdad2cadb4e019e09ac63bba96
2018-09-24 20:46:41 -07:00
Pieter Noordhuis
52472508e9 Add env:// rendezvous test (#11782)
Summary:
A missing environment variable raised a missing key error. Now it
raises a more descriptive error of the actual problem, for example:

ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable WORLD_SIZE expected, but not set

Pull Request resolved: https://github.com/pytorch/pytorch/pull/11782

Differential Revision: D9888962

Pulled By: pietern

fbshipit-source-id: 5947e7a7bf7aa45f13bbd7b5e997529f26cc92d6
2018-09-19 09:56:06 -07:00
Tongzhou Wang
540ef9b1fc Add distributed get_backend (#11715)
Summary:
I have no idea how to run distributed tests locally so I'll let CI do this. Hopefully everything still works with `IntEnum`.

cc mcarilli
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11715

Reviewed By: pietern

Differential Revision: D9889646

Pulled By: SsnL

fbshipit-source-id: 1e2a487cb6fe0bd4cc67501c9d72a295c35693e2
2018-09-18 10:56:24 -07:00
Pieter Noordhuis
7535d98ec4 Add message tag parameter to send/recv
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/11490

Reviewed By: teng-li

Differential Revision: D9828116

Pulled By: pietern

fbshipit-source-id: 98be1ae84b6763ffb329e63c030c5e3ec0e748b7
2018-09-14 10:55:37 -07:00
Teng Li
0988bbad2d C10d release to torch.distributed for PT1 (#11405)
Summary:
The old `torch.distributed` will go to `torch.distributed.deprecated`
The old DDP will go to `torch.nn.parallel.deprecated`

Now `torch.nn.parallel.DDP` will use c10d DDP
Now `torch.distributed` will use C10d frontend API
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11405

Reviewed By: pietern

Differential Revision: D9733733

Pulled By: teng-li

fbshipit-source-id: d6a3f3e73f8d3a7fcb1f4baef53c78063b8cbb08
2018-09-10 23:27:22 -07:00
Teng Li
7726b36489 Full-fledged group testings and fixes for c10d frontend APIs (#11318)
Summary:
Fixed a few bugs that were not tested in the c10d frontend APIs, including
get_rank, get_world_size, and destroy_process_group of a given group.

These APIs are added to the CI tests.

Also added all the group related tests, including full-group, and partial groups (existing ones), since both will hit different code paths.

Also removed experimental APIs for c10d initially used in DDP, now we don't use it anyway.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11318

Reviewed By: pietern

Differential Revision: D9675896

Pulled By: teng-li

fbshipit-source-id: a2eac2c57933effa2d139855f786e64919a95bfc
2018-09-06 18:26:11 -07:00
Teng Li
3791bd12c8 PT1 Release Milestone No.2 MPI Group Support with all tests passed (#11128)
Summary:
Added MPI group support.
And this will make all previous group test cases of MPI passed.

Also, release the MPI thread level support by serializing different PG's MPI ops. This is required.

The build is fixed too
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11128

Differential Revision: D9602188

Pulled By: teng-li

fbshipit-source-id: 1d618925ae5fb7b47259b23051cc181535aa7497
2018-08-31 12:39:56 -07:00
Teng Li
56539f5fe1 PT1 Distributed Release MileStone No.1 - Completed Distributed Package and CI tests (#10871)
Summary:
The PR includes:
(1) torch.distributed.c10d, which now includes the complete backward compatible frontend API for `torch.distributed`
(2) `env://` init method functionality
(3) Minor change to `test_distributed.py`, which is now a test for `torch.distributed.c10d`.
(4) The old `test_distributed.py' is now moved to `test_distributed_thd`
(5) Miscellaneous bug fixes.
(6) DDP CPU test is removed since c10d doesn't have this support yet, but this is a very easy test after moving DDP CPU's dependency to torch.distributed.c10d.
(7) CI config to test MPI, NCCL, and Gloo backend of c10d

**Now all the distributed test including c10d DDP can pass with the c10d frontend API**

TODO: (in a separate PR)
MPI subgroup support, once this is added, CI group test will be enabled.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10871

Differential Revision: D9554514

Pulled By: teng-li

fbshipit-source-id: fb686ad42258526c8b4372148e82969fac4f42dd
2018-08-29 12:55:57 -07:00
Tongzhou Wang
8e33451e2e Make torch.cuda.* take device objects; Update distributed docs (#10833)
Summary:
Commits:

1. Make `torch.cuda.*` take device objects
2. Update `torch.distributed` docs to emphasize calling `torch.cuda.set_device` before `init_process_group`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10833

Differential Revision: D9514241

Pulled By: SsnL

fbshipit-source-id: 2497464305fb1e63d6c495291a5744aaa7e2696e
2018-08-27 15:24:42 -07:00
Tongzhou Wang
db7b7f1359 fix typo
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/10686

Differential Revision: D9399874

Pulled By: SsnL

fbshipit-source-id: 28130992d2416721552f72cfa835ff0358caeefa
2018-08-20 10:40:55 -07:00
Tongzhou Wang
3f603eeee8 some improvements on distributed docs
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/10666

Differential Revision: D9395242

Pulled By: SsnL

fbshipit-source-id: 952326b9c5a1a974a1c33a0e12738e1e21ad9956
2018-08-19 17:40:28 -07:00
Ailing Zhang
371a786b18 Errors out when Openmpi < 2.x.x with distributed. (#10015)
Summary:
This PR fixes #9418 .
Openmpi 1.10 segfaults in MPI_Bcast with CUDA buffer. And it's a retired openmpi version.
I've tested on 2.1.1 and 3.0.0 and they work well.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10015

Reviewed By: soumith

Differential Revision: D9088103

Pulled By: ailzhang

fbshipit-source-id: fc0a45e5cd016093ef0dbb9f371cbf67170d7045
2018-07-31 12:24:40 -07:00
Teng Li
5b7951057d Distributed Data Parallel Module Implementation (#8584)
Summary:
This is an initial implementation of Distributed Data Parallel module for c10d GLOO and NCCL backend.

Have done performance testing and made sure that both single GPU / process and multi-GPU / process are able to overlap communication with BW computation

The idea is, DDP will bucket parameters and do all reduce in the reverse order of the bucket. Since all C10D ops are async ops, no more dedicated thread is needed and we simply queue the all-reduce kernels once the bucket is ready following the deterministic reduction order.

Tested with 8 nodes 64 GPUs, ResNet 50, hit the required accuracy within 90 epochs
Closes https://github.com/pytorch/pytorch/pull/8584

Reviewed By: goldsborough

Differential Revision: D8678696

Pulled By: teng-li

fbshipit-source-id: 440341b804befc6762e92acece2759ba47157cea
2018-06-28 17:25:40 -07:00
Pieter Noordhuis
dc209ed963
[c10d] Rendezvous skeleton (#8294)
* [c10d] Rendezvous skeleton

The rendezvous function takes an URL and produces a triplet of a store,
a process rank, and the process group size.

For the file and TCP handlers, the rank and size must be specified, but
other handlers may discover these parameters dynamically.

It returns a generator function, such that if a rendezvous handler
supports rerendezvous, you can write:

for store, rank, size in c10d.rendezvous(...):
  pg = c10d.ProcessGroup(store, rank, size)
  while the process group is valid:
    # Do stuff with process group

* Add Python 2 fallback for urlparse library

* Import X as Y

* Relative import seems to fix it

* Spelling

* Gate import on c10d availability
2018-06-13 15:27:32 -07:00
Pieter Noordhuis
695d40efc2
Create initial Python bindings for c10d (#8119)
* Build and install c10d from tools/build_pytorch_libs.sh

* Create initial Python bindings for c10d

* clang-format

* Switch link order to include more symbols

* Add bindings and tests for ProcessGroupGloo

* Add broadcast test

* Separate build flag for c10d

* Explicit PIC property

* Skip c10d tests if not available

* Remove c10d from Windows blacklist

Let it skip by itself because it won't be available anyway.

* Make lint happy

* Comments

* Move c10d module into torch.distributed

* Close tempfile such that it is deleted
2018-06-08 12:59:51 -07:00
Kaiyu Shi
db6e4576da Use customized python interpreter (#7520) 2018-05-12 13:06:39 -04:00
Edgar Andrés Margffoy Tuay
4ab6ea5b1f Add unbuffered flag to distributed node launcher (#7226) 2018-05-03 11:49:06 +02:00
Teng Li
f5beff334b Added distributed docs on NCCL2 backend/functions and launch module (#6579) 2018-04-15 21:53:10 -04:00
Teng Li
37059ba0ec Added torch.distributed.launch module for easier multi-proc/node distributed job launching (#5348) 2018-03-13 12:04:38 +01:00
Sam Gross
48a3349c29
Delete dead Tensor code paths (#5417)
This deletes most of the dead Tensor code paths, including the TensorMethods cwrap and generic/Tensor.cpp.

This also moves the THNN.cwrap/.cpp generation to generate_code which can use ninja if installed.
2018-02-27 17:58:09 -05:00
Teng Li
5c65466b86 Release NCCL distributed backend from experimental (#4921)
* Release NCCL distributed backend from experimental

* fix typo
2018-01-30 16:21:21 +01:00
Teng Li
a3b098dcf9 Adding is process_group initialized support (#4618) 2018-01-12 22:56:54 +01:00