Summary:
A missing environment variable raised a missing key error. Now it
raises a more descriptive error of the actual problem, for example:
ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable WORLD_SIZE expected, but not set
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11782
Differential Revision: D9888962
Pulled By: pietern
fbshipit-source-id: 5947e7a7bf7aa45f13bbd7b5e997529f26cc92d6
Summary:
I have no idea how to run distributed tests locally so I'll let CI do this. Hopefully everything still works with `IntEnum`.
cc mcarilli
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11715
Reviewed By: pietern
Differential Revision: D9889646
Pulled By: SsnL
fbshipit-source-id: 1e2a487cb6fe0bd4cc67501c9d72a295c35693e2
Summary:
The old `torch.distributed` will go to `torch.distributed.deprecated`
The old DDP will go to `torch.nn.parallel.deprecated`
Now `torch.nn.parallel.DDP` will use c10d DDP
Now `torch.distributed` will use C10d frontend API
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11405
Reviewed By: pietern
Differential Revision: D9733733
Pulled By: teng-li
fbshipit-source-id: d6a3f3e73f8d3a7fcb1f4baef53c78063b8cbb08
Summary:
Fixed a few bugs that were not tested in the c10d frontend APIs, including
get_rank, get_world_size, and destroy_process_group of a given group.
These APIs are added to the CI tests.
Also added all the group related tests, including full-group, and partial groups (existing ones), since both will hit different code paths.
Also removed experimental APIs for c10d initially used in DDP, now we don't use it anyway.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11318
Reviewed By: pietern
Differential Revision: D9675896
Pulled By: teng-li
fbshipit-source-id: a2eac2c57933effa2d139855f786e64919a95bfc
Summary:
Added MPI group support.
And this will make all previous group test cases of MPI passed.
Also, release the MPI thread level support by serializing different PG's MPI ops. This is required.
The build is fixed too
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11128
Differential Revision: D9602188
Pulled By: teng-li
fbshipit-source-id: 1d618925ae5fb7b47259b23051cc181535aa7497
Summary:
The PR includes:
(1) torch.distributed.c10d, which now includes the complete backward compatible frontend API for `torch.distributed`
(2) `env://` init method functionality
(3) Minor change to `test_distributed.py`, which is now a test for `torch.distributed.c10d`.
(4) The old `test_distributed.py' is now moved to `test_distributed_thd`
(5) Miscellaneous bug fixes.
(6) DDP CPU test is removed since c10d doesn't have this support yet, but this is a very easy test after moving DDP CPU's dependency to torch.distributed.c10d.
(7) CI config to test MPI, NCCL, and Gloo backend of c10d
**Now all the distributed test including c10d DDP can pass with the c10d frontend API**
TODO: (in a separate PR)
MPI subgroup support, once this is added, CI group test will be enabled.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10871
Differential Revision: D9554514
Pulled By: teng-li
fbshipit-source-id: fb686ad42258526c8b4372148e82969fac4f42dd
Summary:
This PR fixes#9418 .
Openmpi 1.10 segfaults in MPI_Bcast with CUDA buffer. And it's a retired openmpi version.
I've tested on 2.1.1 and 3.0.0 and they work well.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10015
Reviewed By: soumith
Differential Revision: D9088103
Pulled By: ailzhang
fbshipit-source-id: fc0a45e5cd016093ef0dbb9f371cbf67170d7045
Summary:
This is an initial implementation of Distributed Data Parallel module for c10d GLOO and NCCL backend.
Have done performance testing and made sure that both single GPU / process and multi-GPU / process are able to overlap communication with BW computation
The idea is, DDP will bucket parameters and do all reduce in the reverse order of the bucket. Since all C10D ops are async ops, no more dedicated thread is needed and we simply queue the all-reduce kernels once the bucket is ready following the deterministic reduction order.
Tested with 8 nodes 64 GPUs, ResNet 50, hit the required accuracy within 90 epochs
Closes https://github.com/pytorch/pytorch/pull/8584
Reviewed By: goldsborough
Differential Revision: D8678696
Pulled By: teng-li
fbshipit-source-id: 440341b804befc6762e92acece2759ba47157cea
* [c10d] Rendezvous skeleton
The rendezvous function takes an URL and produces a triplet of a store,
a process rank, and the process group size.
For the file and TCP handlers, the rank and size must be specified, but
other handlers may discover these parameters dynamically.
It returns a generator function, such that if a rendezvous handler
supports rerendezvous, you can write:
for store, rank, size in c10d.rendezvous(...):
pg = c10d.ProcessGroup(store, rank, size)
while the process group is valid:
# Do stuff with process group
* Add Python 2 fallback for urlparse library
* Import X as Y
* Relative import seems to fix it
* Spelling
* Gate import on c10d availability
* Build and install c10d from tools/build_pytorch_libs.sh
* Create initial Python bindings for c10d
* clang-format
* Switch link order to include more symbols
* Add bindings and tests for ProcessGroupGloo
* Add broadcast test
* Separate build flag for c10d
* Explicit PIC property
* Skip c10d tests if not available
* Remove c10d from Windows blacklist
Let it skip by itself because it won't be available anyway.
* Make lint happy
* Comments
* Move c10d module into torch.distributed
* Close tempfile such that it is deleted
This deletes most of the dead Tensor code paths, including the TensorMethods cwrap and generic/Tensor.cpp.
This also moves the THNN.cwrap/.cpp generation to generate_code which can use ninja if installed.
* Implemented NCCL Distributed Backend for PyTorch with new dist APIs
* Let FindNCCL to determine the NCCL version
* Let NCCL2 Backend use ATEN instead deprecated THPP
* Let distributed parallel model use a single reduction thread for NCCL backend
* Caching the sockets, bug fix, refactoring, and addressed Adam's comments
* Make BcastNcclID take a single param and bug fix for all_gather
* Removed barrier function, added warning for users, and not exposing experimental func to users
* Use the simplest single bucket working solution for distriubted data parallel model with rebase
* Cleanup, fixes and further addressed Adam's comments
* Used PySequence_Fast in distributed csrc
* Removed the limitation that each group is only bound to a given device sequence
* Used THPObjectPtr for PySequence_Fast
* Avoid casting integer params and buffers to float(), double() and half()
* Add test for immune integer buffers
* Fix documentation for float(), double() and half()
* Fix test
* Add sanity checks
* Refactor InitMethodFile and TCPInitMethod to more logical functions
* Update few error messages
* Add passing parameters by **kwargs, so now order of parameters is not relevant
* Review comments