Commit Graph

50 Commits

Author SHA1 Message Date
Edward Yang
dfa03e94eb Fix mispelling of AVAILABLE. (#12016)
Summary:
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12016

Reviewed By: pietern

Differential Revision: D10010808

Pulled By: ezyang

fbshipit-source-id: ff6394ae9a53f7fdad2cadb4e019e09ac63bba96
2018-09-24 20:46:41 -07:00
Pieter Noordhuis
52472508e9 Add env:// rendezvous test (#11782)
Summary:
A missing environment variable raised a missing key error. Now it
raises a more descriptive error of the actual problem, for example:

ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable WORLD_SIZE expected, but not set

Pull Request resolved: https://github.com/pytorch/pytorch/pull/11782

Differential Revision: D9888962

Pulled By: pietern

fbshipit-source-id: 5947e7a7bf7aa45f13bbd7b5e997529f26cc92d6
2018-09-19 09:56:06 -07:00
Tongzhou Wang
540ef9b1fc Add distributed get_backend (#11715)
Summary:
I have no idea how to run distributed tests locally so I'll let CI do this. Hopefully everything still works with `IntEnum`.

cc mcarilli
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11715

Reviewed By: pietern

Differential Revision: D9889646

Pulled By: SsnL

fbshipit-source-id: 1e2a487cb6fe0bd4cc67501c9d72a295c35693e2
2018-09-18 10:56:24 -07:00
Pieter Noordhuis
7535d98ec4 Add message tag parameter to send/recv
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/11490

Reviewed By: teng-li

Differential Revision: D9828116

Pulled By: pietern

fbshipit-source-id: 98be1ae84b6763ffb329e63c030c5e3ec0e748b7
2018-09-14 10:55:37 -07:00
Teng Li
0988bbad2d C10d release to torch.distributed for PT1 (#11405)
Summary:
The old `torch.distributed` will go to `torch.distributed.deprecated`
The old DDP will go to `torch.nn.parallel.deprecated`

Now `torch.nn.parallel.DDP` will use c10d DDP
Now `torch.distributed` will use C10d frontend API
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11405

Reviewed By: pietern

Differential Revision: D9733733

Pulled By: teng-li

fbshipit-source-id: d6a3f3e73f8d3a7fcb1f4baef53c78063b8cbb08
2018-09-10 23:27:22 -07:00
Teng Li
7726b36489 Full-fledged group testings and fixes for c10d frontend APIs (#11318)
Summary:
Fixed a few bugs that were not tested in the c10d frontend APIs, including
get_rank, get_world_size, and destroy_process_group of a given group.

These APIs are added to the CI tests.

Also added all the group related tests, including full-group, and partial groups (existing ones), since both will hit different code paths.

Also removed experimental APIs for c10d initially used in DDP, now we don't use it anyway.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11318

Reviewed By: pietern

Differential Revision: D9675896

Pulled By: teng-li

fbshipit-source-id: a2eac2c57933effa2d139855f786e64919a95bfc
2018-09-06 18:26:11 -07:00
Teng Li
3791bd12c8 PT1 Release Milestone No.2 MPI Group Support with all tests passed (#11128)
Summary:
Added MPI group support.
And this will make all previous group test cases of MPI passed.

Also, release the MPI thread level support by serializing different PG's MPI ops. This is required.

The build is fixed too
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11128

Differential Revision: D9602188

Pulled By: teng-li

fbshipit-source-id: 1d618925ae5fb7b47259b23051cc181535aa7497
2018-08-31 12:39:56 -07:00
Teng Li
56539f5fe1 PT1 Distributed Release MileStone No.1 - Completed Distributed Package and CI tests (#10871)
Summary:
The PR includes:
(1) torch.distributed.c10d, which now includes the complete backward compatible frontend API for `torch.distributed`
(2) `env://` init method functionality
(3) Minor change to `test_distributed.py`, which is now a test for `torch.distributed.c10d`.
(4) The old `test_distributed.py' is now moved to `test_distributed_thd`
(5) Miscellaneous bug fixes.
(6) DDP CPU test is removed since c10d doesn't have this support yet, but this is a very easy test after moving DDP CPU's dependency to torch.distributed.c10d.
(7) CI config to test MPI, NCCL, and Gloo backend of c10d

**Now all the distributed test including c10d DDP can pass with the c10d frontend API**

TODO: (in a separate PR)
MPI subgroup support, once this is added, CI group test will be enabled.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10871

Differential Revision: D9554514

Pulled By: teng-li

fbshipit-source-id: fb686ad42258526c8b4372148e82969fac4f42dd
2018-08-29 12:55:57 -07:00
Tongzhou Wang
8e33451e2e Make torch.cuda.* take device objects; Update distributed docs (#10833)
Summary:
Commits:

1. Make `torch.cuda.*` take device objects
2. Update `torch.distributed` docs to emphasize calling `torch.cuda.set_device` before `init_process_group`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10833

Differential Revision: D9514241

Pulled By: SsnL

fbshipit-source-id: 2497464305fb1e63d6c495291a5744aaa7e2696e
2018-08-27 15:24:42 -07:00
Tongzhou Wang
db7b7f1359 fix typo
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/10686

Differential Revision: D9399874

Pulled By: SsnL

fbshipit-source-id: 28130992d2416721552f72cfa835ff0358caeefa
2018-08-20 10:40:55 -07:00
Tongzhou Wang
3f603eeee8 some improvements on distributed docs
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/10666

Differential Revision: D9395242

Pulled By: SsnL

fbshipit-source-id: 952326b9c5a1a974a1c33a0e12738e1e21ad9956
2018-08-19 17:40:28 -07:00
Ailing Zhang
371a786b18 Errors out when Openmpi < 2.x.x with distributed. (#10015)
Summary:
This PR fixes #9418 .
Openmpi 1.10 segfaults in MPI_Bcast with CUDA buffer. And it's a retired openmpi version.
I've tested on 2.1.1 and 3.0.0 and they work well.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10015

Reviewed By: soumith

Differential Revision: D9088103

Pulled By: ailzhang

fbshipit-source-id: fc0a45e5cd016093ef0dbb9f371cbf67170d7045
2018-07-31 12:24:40 -07:00
Teng Li
5b7951057d Distributed Data Parallel Module Implementation (#8584)
Summary:
This is an initial implementation of Distributed Data Parallel module for c10d GLOO and NCCL backend.

Have done performance testing and made sure that both single GPU / process and multi-GPU / process are able to overlap communication with BW computation

The idea is, DDP will bucket parameters and do all reduce in the reverse order of the bucket. Since all C10D ops are async ops, no more dedicated thread is needed and we simply queue the all-reduce kernels once the bucket is ready following the deterministic reduction order.

Tested with 8 nodes 64 GPUs, ResNet 50, hit the required accuracy within 90 epochs
Closes https://github.com/pytorch/pytorch/pull/8584

Reviewed By: goldsborough

Differential Revision: D8678696

Pulled By: teng-li

fbshipit-source-id: 440341b804befc6762e92acece2759ba47157cea
2018-06-28 17:25:40 -07:00
Pieter Noordhuis
dc209ed963
[c10d] Rendezvous skeleton (#8294)
* [c10d] Rendezvous skeleton

The rendezvous function takes an URL and produces a triplet of a store,
a process rank, and the process group size.

For the file and TCP handlers, the rank and size must be specified, but
other handlers may discover these parameters dynamically.

It returns a generator function, such that if a rendezvous handler
supports rerendezvous, you can write:

for store, rank, size in c10d.rendezvous(...):
  pg = c10d.ProcessGroup(store, rank, size)
  while the process group is valid:
    # Do stuff with process group

* Add Python 2 fallback for urlparse library

* Import X as Y

* Relative import seems to fix it

* Spelling

* Gate import on c10d availability
2018-06-13 15:27:32 -07:00
Pieter Noordhuis
695d40efc2
Create initial Python bindings for c10d (#8119)
* Build and install c10d from tools/build_pytorch_libs.sh

* Create initial Python bindings for c10d

* clang-format

* Switch link order to include more symbols

* Add bindings and tests for ProcessGroupGloo

* Add broadcast test

* Separate build flag for c10d

* Explicit PIC property

* Skip c10d tests if not available

* Remove c10d from Windows blacklist

Let it skip by itself because it won't be available anyway.

* Make lint happy

* Comments

* Move c10d module into torch.distributed

* Close tempfile such that it is deleted
2018-06-08 12:59:51 -07:00
Kaiyu Shi
db6e4576da Use customized python interpreter (#7520) 2018-05-12 13:06:39 -04:00
Edgar Andrés Margffoy Tuay
4ab6ea5b1f Add unbuffered flag to distributed node launcher (#7226) 2018-05-03 11:49:06 +02:00
Teng Li
f5beff334b Added distributed docs on NCCL2 backend/functions and launch module (#6579) 2018-04-15 21:53:10 -04:00
Teng Li
37059ba0ec Added torch.distributed.launch module for easier multi-proc/node distributed job launching (#5348) 2018-03-13 12:04:38 +01:00
Sam Gross
48a3349c29
Delete dead Tensor code paths (#5417)
This deletes most of the dead Tensor code paths, including the TensorMethods cwrap and generic/Tensor.cpp.

This also moves the THNN.cwrap/.cpp generation to generate_code which can use ninja if installed.
2018-02-27 17:58:09 -05:00
Teng Li
5c65466b86 Release NCCL distributed backend from experimental (#4921)
* Release NCCL distributed backend from experimental

* fix typo
2018-01-30 16:21:21 +01:00
Teng Li
a3b098dcf9 Adding is process_group initialized support (#4618) 2018-01-12 22:56:54 +01:00
Christian Sarofeen
6db9f6dc78 Enable half communication for distributed (#4091) 2017-12-13 13:00:12 +01:00
Teng Li
926ed2b280 Implemented NCCL Distributed Backend for PyTorch with new dist APIs (#3435)
* Implemented NCCL Distributed Backend for PyTorch with new dist APIs

* Let FindNCCL to determine the NCCL version

* Let NCCL2 Backend use ATEN instead deprecated THPP

* Let distributed parallel model use a single reduction thread for NCCL backend

* Caching the sockets, bug fix, refactoring, and addressed Adam's comments

* Make BcastNcclID take a single param and bug fix for all_gather

* Removed barrier function, added warning for users, and not exposing experimental func to users

* Use the simplest single bucket working solution for distriubted data parallel model with rebase

* Cleanup, fixes and further addressed Adam's comments

* Used PySequence_Fast in distributed csrc

* Removed the limitation that each group is only bound to a given device sequence

* Used THPObjectPtr for PySequence_Fast
2017-11-29 15:57:02 -05:00
Luca Antiga
af58bfbb1b Make integer parameters and buffers immune to float(), double() and half() (#3820)
* Avoid casting integer params and buffers to float(), double() and half()

* Add test for immune integer buffers

* Fix documentation for float(), double() and half()

* Fix test
2017-11-22 18:34:53 -05:00
Adam Paszke
2a8603c5e1 Make distributed recv return sender rank 2017-09-25 12:11:52 -04:00
Scott Sievert
dd27997aeb DOC: adding note about distributed MPI backend (#2750) 2017-09-15 13:47:35 -04:00
Zhou Mo
2c07f88ea3 Fix typos. 2017-08-25 14:27:07 -04:00
Gregory Chanan
50c208a50b Revert "Fix typos."
This reverts commit 4622b33952.
2017-08-10 13:57:00 -04:00
Zhou Mo
4622b33952 Fix typos. 2017-08-08 11:05:38 -04:00
Adam Paszke
575a4a98e0 Remove assertions with side effects 2017-07-20 01:45:57 -04:00
Adam Paszke
8915e2710c Refactor scatter/gather and add distributed docs 2017-07-12 14:47:36 -04:00
Adam Paszke
d9d50f80c7 Rename arguments to distributed collectives 2017-06-12 22:02:11 -04:00
Adam Paszke
714351ff39 Officially enable process-group mode 2017-06-12 22:02:11 -04:00
Adam Paszke
12813b88f6 Add DistributedDataParallel 2017-06-12 22:00:22 -04:00
Adam Paszke
5a0d5ec058 Add more checks in torch.distributed 2017-06-12 21:58:38 -04:00
Adam Paszke
c6c9e61169 Implement THD tensor copies 2017-06-02 23:42:11 +02:00
Janusz Marcinkiewicz
34804e9600 Refactor file and tcp init methods
* Add sanity checks
 * Refactor InitMethodFile and TCPInitMethod to more logical functions
 * Update few error messages
 * Add passing parameters by **kwargs, so now order of parameters is not relevant
 * Review comments
2017-06-02 23:42:11 +02:00
Janusz Marcinkiewicz
c41555fb0a Add rank parameter; Fix MW mode initalization 2017-06-02 23:42:11 +02:00
Janusz Marcinkiewicz
e685277299 Add address discovery; Bug fixes; 2017-06-02 23:42:11 +02:00
Janusz Marcinkiewicz
09c0d9c51c Add multiple initalization methods for DataChannels 2017-06-02 23:42:11 +02:00
Adam Paszke
a3e11d606b Fix linter errors 2017-01-31 01:58:09 +01:00
Adam Paszke
79232c24e2 Fixes after rebase 2017-01-31 01:58:09 +01:00
Janusz Marcinkiewicz
962084c8e8 Add Data Channel receive from any source (#52) 2017-01-31 01:58:09 +01:00
Janusz Marcinkiewicz
76520512e7 DataChannel tests rewrite (#42); DataChannel isend and irecv implementation (#44) 2017-01-31 01:58:09 +01:00
Janusz Marcinkiewicz
ac1f68127a Add barrier, scatter, gather and allGather implementations + groups (#34) 2017-01-31 01:58:09 +01:00
Adam Paszke
60d1852c7b Major improvements to master-worker mode
* Fixed all undefined symbol errors
* Implemented storage interface and THStorage class
* RPC improvements
* Code refactor
2017-01-31 01:58:09 +01:00
Adam Paszke
ea876eb6d5 Add initial bindings for master-worker mode 2017-01-31 01:58:09 +01:00
Janusz Marcinkiewicz
5e6fcd02b5 Implement data channel groups (#25) 2017-01-31 01:58:09 +01:00
Adam Paszke
55632d81d2 Add Python wrappers for process group mode 2017-01-31 01:58:09 +01:00