Commit Graph

14 Commits

Author SHA1 Message Date
Min Si
1ad0048b64 Refactor distribuetd to use absolute header path (#85780)
Headers under torch/csrc/distributed may be referened with relative path, e.g., "<c10d/...>". However, relative path cannot be gracefully handled by Meta internal build when the NCCL PG is hipified to support AMD/RCCL because the "hipified" header files are generated in other directories. Moreover, using absolute path for header inclusion is the state-of-the-art in most components in Pytorch. Thus, this patch refactors all header paths in torch/csrc/distributed to be absolute.

See D39835774 for more details about Meta internal complication.

**How to test**: commit 9e5d199 removes -I./torch/csrc/distributed in compile options. Thus use it to verify we don't miss any relative path use of torch/csrc/distributed headers.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85780
Approved by: https://github.com/kumpera, https://github.com/huydhn
2022-09-30 05:13:50 +00:00
PyTorch MergeBot
a50d8864fc Revert "Refactor distribuetd to use absolute header path (#85780)"
This reverts commit 668082718a.

Reverted https://github.com/pytorch/pytorch/pull/85780 on behalf of https://github.com/huydhn due to Sorry for reverting your PR but it breaks build due to a missing file <c10d/Store.hpp>
2022-09-30 02:04:29 +00:00
Min Si
668082718a Refactor distribuetd to use absolute header path (#85780)
Headers under torch/csrc/distributed may be referened with relative path, e.g., "<c10d/...>". However, relative path cannot be gracefully handled by Meta internal build when the NCCL PG is hipified to support AMD/RCCL because the "hipified" header files are generated in other directories. Moreover, using absolute path for header inclusion is the state-of-the-art in most components in Pytorch. Thus, this patch refactors all header paths in torch/csrc/distributed to be absolute.

See D39835774 for more details about Meta internal complication.

**How to test**: commit 9e5d199 removes -I./torch/csrc/distributed in compile options. Thus use it to verify we don't miss any relative path use of torch/csrc/distributed headers.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85780
Approved by: https://github.com/kumpera
2022-09-30 00:27:24 +00:00
Howard Huang
74ead61944 [2/N] [Dispatchable Collectives] Extract ProcessGroup::Work into a separate class and update references (#83680)
### Changes
- Move ProcessGroup::Work into its own class and update all the references to it / header includes.

#### Motivation
In the future PRs we will repurpose ProcessGroup to instead contain a list of Backends (ProcessGroupNCCL/Gloo/UCC) and perform dispatching to them based on tensor type. This change is prevent a circular dependency with ProcessGroup depending on Backend and Backend depending on ProcessGroup::Work.

Differential Revision: [D38839212](https://our.internmc.facebook.com/intern/diff/D38839212)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83680
Approved by: https://github.com/kwen2501
2022-09-14 13:05:58 +00:00
Rodrigo Kumpera
f6a45f7984 [distributed] Make DDP work with python process group (#79176)
This PR enables python process group usage with DDP by doing the following:

- Surface PG::Work::getFuture() as overridable()
- Use Work::getFuture() to retrieve values from a PG.
- Add _create_work_from_future python method that creates a Work object that wraps a Future.

To test this changes we use both strategies to run DDP with a python based PG.

The reason for offering two methods is that both have short-comings.

The wrapper method is harder to troubleshoot as there's no visibility of how the future is used.

The subclass method has memory management issues as can be noticed in the test suite by having to keep Work instances alive by storing them in PG fields.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/79176
Approved by: https://github.com/rohan-varma
2022-06-28 17:14:21 +00:00
Michael Suo
30fb2c4aba [lint] autoformat test/cpp and torch/csrc
Let's have some fun.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/78828

Approved by: https://github.com/ezyang
2022-06-11 21:11:16 +00:00
Scott Wolchok
52af4fc5ba [PyTorch] Make RecordFunction store inputs as ArrayRef (#72484)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72484

Stepping stone toward stack-allocating array of inputs.

Funnily enough, this seems to improve performance too.
ghstack-source-id: 155492056

Test Plan:
1) CI
2) framework overhead benchmark with --stressTestRecordFunction --captureRecordFunctionInputs goes from 0.76 usec/iter to 0.72.

Reviewed By: chaekit, robieta

Differential Revision: D34061169

fbshipit-source-id: 073fedf1d3d162f927c4e9867cfda7dbfabba215
(cherry picked from commit dae77cf1cd8813d902d73999ad97133a3ef8e291)
2022-05-05 21:38:42 +00:00
Ke Wen
d6f22abbcc [PyTorch Distributed] Fix batch_isend_irecv
Summary:
`batch_isend_irecv` previously only worked for two-rank cases,
otherwise it would hang, e.g. pytorch/pytorch#73960. This Diff extends
`batch_isend_irecv` to support more than two ranks. The fix is by treating
the operation more like a collective rather than two-rank P2P when selecting
the communicator, since there can be more ranks participating in the batch call than "my" rank and "my" peer.

Rules:
- If `batch_isend_irecv` is the first collective call (including collectives and
  all-to-all) in the `group` given as the argument, then all ranks of the
  `group` are expected to participate in this call.
- Otherwise, if it is not the first collective call in the `group` (i.e. the
  communicator has been initialized), then batched P2P communication involving
  only subset of processes of the `group` is allowed.

Test Plan:
Added p2p_tests.py testing the following patterns:
+    sendrecv_neighbor(input, output)       # Ring like neighbor exchange
+    sendrecv_ripple(input, output)         # Exchange with growing distance (pytorch/pytorch#73960)
+    sendrecv_P2P(input, output)            # Single P2P operation
+    isendrecv_P2P(input, output)           # Single non-blocking P2P operation
+    isendrecv_P2P_batch(input, output, 0)  # batched P2P between only two ranks
+    isendrecv_P2P_batch(input, output, 1)  # batched P2P within a new group created for two ranks

Differential Revision: D35122664

Pull Request resolved: https://github.com/pytorch/pytorch/pull/74701
Approved by: https://github.com/mingzhe09088, https://github.com/osalpekar
2022-04-13 05:55:00 +00:00
Pritam Damania
3b30c8da88 Add logging for ProcessGroup backends. (#73702)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73702

This PR adds logging to track usage of ProcessGroup backends. The
change involves adding an `init()` method to the ProcessGroup base class and
modifying all the subclasses to call it.

We can't add logging directly in the ProcessGroup constructor since
`getBackendName()` is pure virtual and can't be called from the base class
constructor. See
https://isocpp.org/wiki/faq/strange-inheritance#calling-virtuals-from-ctors for
more details.
ghstack-source-id: 150381852

Test Plan: check logging

Reviewed By: cbalioglu

Differential Revision: D34596295

fbshipit-source-id: 5baa7bfb53bdcd1b4032292c4e7715467f983288
(cherry picked from commit b2344144fe0c25db02c2e958f2acd772e9cd4dc8)
2022-03-08 17:45:42 +00:00
Can Balioglu
e1db2f13ce Refactor TORCH_DISTRIBUTED_DEBUG implementation (#73166)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73166

This PR refactors, cleans up, and optimizes the implementation of `TORCH_DISTRIBUTED_DEBUG`. It also introduces three new user APIs: `get_debug_level()`, `set_debug_level()`, and `set_debug_level_from_env()` to retrieve and modify the debug level after a process has started.
ghstack-source-id: 149778566

Test Plan: Run the existing unit tests.

Reviewed By: rohan-varma

Differential Revision: D34371226

fbshipit-source-id: e18443b411adcbaf39b2ec999178c198052fcd5b
(cherry picked from commit 26d6bb1584b83a0490d8b766482656a5887fa21d)
2022-02-24 02:33:05 +00:00
Nolan O'Brien
0fdb90da5e [warning] Fix TORCH_INTERNAL_ASSERT calls missing condition to check 1/x (#71711)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71711

This will fix a ton of broken asserts that should always fire but never actually fire.
All would have been caught with `-Wstring-conversion` warnings enabled.

Test Plan: CI Pass

Differential Revision: D33743605

fbshipit-source-id: 062641f9d5d02c6e317c5a286fd01017cf77237f
(cherry picked from commit 639b42e04b)
2022-01-25 15:45:21 +00:00
Nikita Shulga
a5ad2cdab5 Cleanup ProcessGroup.cpp (#69706)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69706

Mostly code modernization, also do not capture unused `this` in
end_handler functor

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D32997009

Pulled By: malfet

fbshipit-source-id: ac907f0c6889ad06d4fb0171964cb05133e5e610
2021-12-10 09:11:39 -08:00
Shen Li
facff2ec65 Update ProcessGroup collective C++ APIs to be non-pure virtual functions (#64943)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64943

Most ProcessGroup Collective APIs are pure virtual. As a result, c10d extensions need to override all of them and throw an error if they don't need certain APIs. This is too verbose for users. This commit changes those collective APIs to virtual and throws an error by default. Note that ProcessGroup is still an abstract class as `getBackendName` is a pure virtual function that all subclasses have to override.

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang cbalioglu gcramer23

Test Plan: Imported from OSS

Reviewed By: cbalioglu

Differential Revision: D30906866

Pulled By: mrshenli

fbshipit-source-id: c4df8962d60350a44d2df72fd04f9dd6eadb9fa6
2021-09-26 19:19:43 -07:00
Luca Wehrstedt
a016150163 Move torch/lib/c10d to torch/csrc/distributed/c10d (#60543)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60543

Since now c10d is part of libtorch, it would also be nice if the sources lived all in one place.
ghstack-source-id: 132306292

Test Plan: It builds

Reviewed By: cbalioglu

Differential Revision: D29062002

fbshipit-source-id: d9e1301e9d73e1643fa0f0119cd2d618f1ad52e6
2021-06-24 12:38:51 -07:00