Commit Graph

49 Commits

Author SHA1 Message Date
Sheng Fu
71d2827eeb Code Refactoring for getting start and stride from global ranks (#147230)
Summary: Code Refactoring for getting start and stride from global ranks, this function can be used in different collective backend.

Differential Revision: D69555405

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147230
Approved by: https://github.com/kwen2501
2025-02-21 10:02:50 +00:00
cyyever
ef28df5c9e [Reland][Environment Variable][4/N] Use thread-safe getenv functions (#140593)
Reland of #137843 , after checking the code again.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140593
Approved by: https://github.com/albanD

Co-authored-by: albanD <desmaison.alban@gmail.com>
2025-01-28 20:51:49 +00:00
cyy
00b3b61076 Add and use thread-safe strerror (#140472)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140472
Approved by: https://github.com/ezyang
2024-11-19 04:24:17 +00:00
PyTorch MergeBot
a58a565819 Revert "[Environment Variable][6/N] Use thread-safe getenv functions (#140200)"
This reverts commit 7d4f5f7508.

Reverted https://github.com/pytorch/pytorch/pull/140200 on behalf of https://github.com/ezyang due to One of these diffs had incorrect downstream optional handling, we must reaudit all of these diffs ([comment](https://github.com/pytorch/pytorch/pull/140200#issuecomment-2473956859))
2024-11-13 15:33:23 +00:00
PyTorch MergeBot
c6a29fc3d8 Revert "[Environment Variable][4/N] Use thread-safe getenv functions (#137843)"
This reverts commit 82eb09aafd.

Reverted https://github.com/pytorch/pytorch/pull/137843 on behalf of https://github.com/ezyang due to One of these diffs had incorrect downstream optional handling, we must reaudit all of these diffs ([comment](https://github.com/pytorch/pytorch/pull/137843#issuecomment-2473709760))
2024-11-13 14:06:52 +00:00
cyy
7d4f5f7508 [Environment Variable][6/N] Use thread-safe getenv functions (#140200)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140200
Approved by: https://github.com/ezyang
2024-11-09 15:05:51 +00:00
cyy
82eb09aafd [Environment Variable][4/N] Use thread-safe getenv functions (#137843)
Follows #137328

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137843
Approved by: https://github.com/ezyang
2024-10-21 02:58:59 +00:00
PyTorch MergeBot
a1899b5a9e Revert "[Environment Variable][4/N] Use thread-safe getenv functions (#137843)"
This reverts commit 239ad73cb1.

Reverted https://github.com/pytorch/pytorch/pull/137843 on behalf of https://github.com/yf225 due to Sorry for reverting your PR but I believe this PR breaks the binary builds. Example: https://ossci-raw-job-status.s3.amazonaws.com/log/31790258895, with error message: `getenv is not a member of c10::utils`, might be easier to search for `not a member of` in the log ([comment](https://github.com/pytorch/pytorch/pull/137843#issuecomment-2425192780))
2024-10-20 19:48:14 +00:00
cyy
239ad73cb1 [Environment Variable][4/N] Use thread-safe getenv functions (#137843)
Follows #137328

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137843
Approved by: https://github.com/ezyang
2024-10-20 13:05:04 +00:00
Ke Wen
3645634f3c [1/N] Move NaN check onto NCCL stream (#134300)
So that the tensor's lifetime management is the same as the management built for the NCCL, pre and post kernels.
Also so that on visualizers, they show up in the NCCL stream line. Otherwise if they show up in the compute line, user may get confused (my code does not have these kernels).

The check is thus moved after the point where we depend NCCL stream from the last compute kernel.

Also moved declaration of `checkForNan` from Utils.hpp to NCCLUtils.hpp, and renamed Utils.cu to NCCLUtils.cu.

Differential Revision: [D61957573](https://our.internmc.facebook.com/intern/diff/D61957573)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134300
Approved by: https://github.com/shuqiangzhang, https://github.com/wconstab
2024-08-29 08:28:49 +00:00
PyTorch MergeBot
cbf5ba1e97 Revert "[1/N] Move NaN check onto NCCL stream (#134300)"
This reverts commit 94caba4899.

Reverted https://github.com/pytorch/pytorch/pull/134300 on behalf of https://github.com/kwen2501 due to This is breaking builds of MTIA ([comment](https://github.com/pytorch/pytorch/pull/134300#issuecomment-2316559704))
2024-08-29 01:50:22 +00:00
Ke Wen
94caba4899 [1/N] Move NaN check onto NCCL stream (#134300)
So that the tensor's lifetime management is the same as the management built for the NCCL, pre and post kernels.
Also so that on visualizers, they show up in the NCCL stream line. Otherwise if they show up in the compute line, user may get confused (my code does not have these kernels).

The check is thus moved after the point where we depend NCCL stream from the last compute kernel.

Also moved declaration of `checkForNan` from Utils.hpp to NCCLUtils.hpp, and renamed Utils.cu to NCCLUtils.cu.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134300
Approved by: https://github.com/shuqiangzhang, https://github.com/wconstab
2024-08-27 16:02:27 +00:00
cyy
f4dcf2ae93 [1/N] Change #include <c10/util/Optional.h> to #include <optional> (#128301)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128301
Approved by: https://github.com/ezyang, https://github.com/r-barnes
2024-07-08 07:03:53 +00:00
Tristan Rice
cd70ac884f c10d/Utils: better error message on 0 bytes (#130056)
This improves the error messages on 0 bytes sent/received. We currently log it as a connection reset when it's caused by other reasons.

Test plan:

```
python test/distributed/test_store.py
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130056
Approved by: https://github.com/kurman, https://github.com/rsdcastro
2024-07-04 00:48:20 +00:00
PyTorch MergeBot
846bb30e13 Revert "[1/N] Change #include <c10/util/Optional.h> to #include <optional> (#128301)"
This reverts commit bd72e28314.

Reverted https://github.com/pytorch/pytorch/pull/128301 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it fails XLA build bd72e28314. Please rebase your PR before relanding because I think the failure is hidden by an unrelated broken trunk XLA failure from your current base commit ([comment](https://github.com/pytorch/pytorch/pull/128301#issuecomment-2169035822))
2024-06-15 01:58:20 +00:00
cyy
bd72e28314 [1/N] Change #include <c10/util/Optional.h> to #include <optional> (#128301)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128301
Approved by: https://github.com/ezyang
2024-06-14 23:21:01 +00:00
Rohan Varma
2cb6f20867 Warn env vars only once during program (#127046)
This avoids logs being excessively noisy in some training runs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127046
Approved by: https://github.com/kwen2501, https://github.com/wconstab
2024-05-30 19:10:53 +00:00
Shuqiang Zhang
c860df5a9d [c10d] Add an option for NAN check on every collective (#125726)
Summary:
The NAN CHECK is done through device side assert without copying needed
from GPU to CPU
Test Plan:
Unit test for collectives that should experience run time error

(sqzhang_1) [sqzhang@devgpu009.cln1 ~/pytorch (38f5143e)]$  python
test/distributed/test_c10d_nccl.py ProcessGroupNCCLTest.test_nan_assert
/home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15:
checkForNaN: block: [0,0,0], thread: [0,0,0] Assertion `!isnan(val)`
failed.
/home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15:
checkForNaN: block: [0,0,0], thread: [1,0,0] Assertion `!isnan(val)`
failed.
/home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15:
checkForNaN: block: [0,0,0], thread: [2,0,0] Assertion `!isnan(val)`
failed.
/home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15:
checkForNaN: block: [0,0,0], thread: [3,0,0] Assertion `!isnan(val)`
failed.
/home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15:
checkForNaN: block: [0,0,0], thread: [4,0,0] Assertion `!isnan(val)`
failed.
/home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15:
checkForNaN: block: [0,0,0], thread: [5,0,0] Assertion `!isnan(val)`
failed.
[rank0]:[E507 17:31:56.885473996 Utils.cu:30] CUDA error during
checkForNan: device-side assert triggered

/home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15:
checkForNaN: block: [0,0,0], thread: [0,0,0] Assertion `!isnan(val)`
failed.
/home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15:
checkForNaN: block: [0,0,0], thread: [1,0,0] Assertion `!isnan(val)`
failed.
/home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15:
checkForNaN: block: [0,0,0], thread: [2,0,0] Assertion `!isnan(val)`
failed.
/home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15:
checkForNaN: block: [0,0,0], thread: [3,0,0] Assertion `!isnan(val)`
failed.
/home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15:
checkForNaN: block: [0,0,0], thread: [4,0,0] Assertion `!isnan(val)`
failed.
/home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15:
checkForNaN: block: [0,0,0], thread: [5,0,0] Assertion `!isnan(val)`
failed.
[rank1]:[E507 17:31:56.128961534 Utils.cu:30] CUDA error during
checkForNan: device-side assert triggered

.
----------------------------------------------------------------------
Ran 1 test in 7.723s

OK

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125726
Approved by: https://github.com/kwen2501
2024-05-16 04:35:15 +00:00
PyTorch MergeBot
20aa7cc678 Revert "[c10d] Add an option for NAN check on every collective (#125726)"
This reverts commit 6db3271007.

Reverted https://github.com/pytorch/pytorch/pull/125726 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but the new test is failing on both multigpu and rocm distributed, i.e. c712b0f8a3 ([comment](https://github.com/pytorch/pytorch/pull/125726#issuecomment-2110646075))
2024-05-14 16:26:34 +00:00
Shuqiang Zhang
6db3271007 [c10d] Add an option for NAN check on every collective (#125726)
Summary:
The NAN CHECK is done through device side assert without copying needed
from GPU to CPU
Test Plan:
Unit test for collectives that should experience run time error

(sqzhang_1) [sqzhang@devgpu009.cln1 ~/pytorch (38f5143e)]$  python
test/distributed/test_c10d_nccl.py ProcessGroupNCCLTest.test_nan_assert
/home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15:
checkForNaN: block: [0,0,0], thread: [0,0,0] Assertion `!isnan(val)`
failed.
/home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15:
checkForNaN: block: [0,0,0], thread: [1,0,0] Assertion `!isnan(val)`
failed.
/home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15:
checkForNaN: block: [0,0,0], thread: [2,0,0] Assertion `!isnan(val)`
failed.
/home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15:
checkForNaN: block: [0,0,0], thread: [3,0,0] Assertion `!isnan(val)`
failed.
/home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15:
checkForNaN: block: [0,0,0], thread: [4,0,0] Assertion `!isnan(val)`
failed.
/home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15:
checkForNaN: block: [0,0,0], thread: [5,0,0] Assertion `!isnan(val)`
failed.
[rank0]:[E507 17:31:56.885473996 Utils.cu:30] CUDA error during
checkForNan: device-side assert triggered

/home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15:
checkForNaN: block: [0,0,0], thread: [0,0,0] Assertion `!isnan(val)`
failed.
/home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15:
checkForNaN: block: [0,0,0], thread: [1,0,0] Assertion `!isnan(val)`
failed.
/home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15:
checkForNaN: block: [0,0,0], thread: [2,0,0] Assertion `!isnan(val)`
failed.
/home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15:
checkForNaN: block: [0,0,0], thread: [3,0,0] Assertion `!isnan(val)`
failed.
/home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15:
checkForNaN: block: [0,0,0], thread: [4,0,0] Assertion `!isnan(val)`
failed.
/home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15:
checkForNaN: block: [0,0,0], thread: [5,0,0] Assertion `!isnan(val)`
failed.
[rank1]:[E507 17:31:56.128961534 Utils.cu:30] CUDA error during
checkForNan: device-side assert triggered

.
----------------------------------------------------------------------
Ran 1 test in 7.723s

OK

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125726
Approved by: https://github.com/kwen2501
2024-05-14 00:05:41 +00:00
cyy
4457cd9a30 [Distributed] [7/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d (#124987)
This PR continues to clean clang-tidy warnings in torch/csrc/distributed/c10d, following #124701.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124987
Approved by: https://github.com/malfet
2024-05-11 00:03:52 +00:00
PyTorch MergeBot
724c7491d0 Revert " [Distributed] [7/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d (#124987)"
This reverts commit b3fd94d15e.

Reverted https://github.com/pytorch/pytorch/pull/124987 on behalf of https://github.com/ezyang due to broke downstream extensions ([comment](https://github.com/pytorch/pytorch/pull/124987#issuecomment-2083956511))
2024-04-30 00:37:53 +00:00
cyy
b3fd94d15e [Distributed] [7/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d (#124987)
This PR continues to clean clang-tidy warnings in torch/csrc/distributed/c10d, following #124701. In addition, libfmt dependency is added in CMake code to enable using it in the headers. The libfmt has to be added as private dependency to torch_cuda and torch_hip because they include torch/csrc/distributed/c10d/Utils.hpp which uses libfmt.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124987
Approved by: https://github.com/malfet
2024-04-27 07:22:27 +00:00
cyy
1ac402a96c [Distributed] [6/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d (#124701)
This PR continues to fix some clang-tidy warnings in distributed/c10d code, following https://github.com/pytorch/pytorch/pull/124043.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124701
Approved by: https://github.com/ezyang
2024-04-25 11:39:23 +00:00
cyy
ea61c9cb29 [Distributed] [5/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d (#124043)
This PR continues to fix some clang-tidy warnings in distributed/c10d code, following https://github.com/pytorch/pytorch/pull/124032.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124043
Approved by: https://github.com/ezyang
2024-04-23 00:43:50 +00:00
Richard Barnes
98e5238ad8 [codemod][lowrisk] Remove unused exception parameter from caffe2/caffe2/image/image_input_op.h (#123056)
Summary:
`-Wunused-exception-parameter` has identified an unused exception parameter. This diff removes it.

This:
```
try {
    ...
} catch (exception& e) {
    // no use of e
}
```
should instead be written as
```
} catch (exception&) {
```

If the code compiles, this is safe to land.

Test Plan: Sandcastle

Reviewed By: palmje

Differential Revision: D55548497

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123056
Approved by: https://github.com/Skylion007
2024-04-04 17:24:43 +00:00
fduwjj
f6dfbffb3b [c10d] Add hashing as a debug feature for before and after NCCL collective call (#113238)
For now, we use `TORCH_DISTRIBUTED_DEBUG = DETAIL` to turn a debug feature which calculate the hashing for input tensors and output results of c10d collective in NCCL.  This is a debugging feature so that we can rule out the bug from c10d level.

<img width="840" alt="image" src="https://github.com/pytorch/pytorch/assets/6937752/cdc70b0b-ae3c-4efd-86ff-adc5c5ba505f">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113238
Approved by: https://github.com/wconstab, https://github.com/fegin
2023-12-25 22:25:38 +00:00
Pavan Balaji
00ae299016 [c10d] Remove unused function (#114341)
Summary: As the title suggests

Test Plan: OSS CI

Differential Revision: D51386619

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114341
Approved by: https://github.com/Skylion007
2023-11-22 17:31:20 +00:00
Pavan Balaji
958f3b0df6 [nccl-pg] Migrate to getCvar* functions for env variable checking (#113797)
Summary:
The getCvar* functions allow us to provide multiple environment variables for the same value.  This allows us to deprecate some variables in favor of others, while still allowing users to temporarily use the old variables for some time.

Test Plan: OSS CI

Reviewed By: fduwjj, XilunWu

Differential Revision: D51225487

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113797
Approved by: https://github.com/fduwjj
2023-11-19 03:48:58 +00:00
Pritam Damania
704b0b3c67 [RESUBMIT] Standardize on error types for distributed errors. (#108191)
We have a plethora of error types for various errors raised from c10d. These include `RuntimeError`, `TimeoutError`, `SocketError`, `DistBackendError` etc.

This results in messy code during error handling somewhat like this:
```
if "NCCL" in exception_str:
  ...
if "Timed out initializing process group in store based barrier on rank" in exception_str:
  ...
if "The client socket has timed out after" in exception_str:
  ...
if "Broken pipe" in exception_str:
  ...
if "Connection reset by peer" in exception_str:
  ...
```

To address this issue, in this PR I've ensured added these error types:

1. **DistError** - the base type of all distributed errors
2. **DistBackendError** - this already existed and referred to PG backend errors
3. **DistStoreError** - for errors originating from the store
4. **DistNetworkError** - for general network errors coming from the socket library

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108191
Approved by: https://github.com/H-Huang
2023-08-30 21:47:39 +00:00
PyTorch MergeBot
d4ff06ec84 Revert "Standardize on error types for distributed errors. (#107651)"
This reverts commit 0e2317479b.

Reverted https://github.com/pytorch/pytorch/pull/107651 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it is failing inductor test in trunk for one of its model moco ([comment](https://github.com/pytorch/pytorch/pull/107651#issuecomment-1696578138))
2023-08-28 23:58:33 +00:00
Pritam Damania
0e2317479b Standardize on error types for distributed errors. (#107651)
We have a plethora of error types for various errors raised from c10d. These include `RuntimeError`, `TimeoutError`, `SocketError`, `DistBackendError` etc.

This results in messy code during error handling somewhat like this:
```
if "NCCL" in exception_str:
  ...
if "Timed out initializing process group in store based barrier on rank" in exception_str:
  ...
if "The client socket has timed out after" in exception_str:
  ...
if "Broken pipe" in exception_str:
  ...
if "Connection reset by peer" in exception_str:
  ...
```

To address this issue, in this PR I've ensured added these error types:

1. **DistError** - the base type of all distributed errors
2. **DistBackendError** - this already existed and referred to PG backend errors
3. **DistStoreError** - for errors originating from the store
4. **DistNetworkError** - for general network errors coming from the socket library
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107651
Approved by: https://github.com/H-Huang
2023-08-28 21:58:15 +00:00
Shen Li
dd6319198d Apply clang-format to distributed/c10d folder (#107140)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107140
Approved by: https://github.com/H-Huang
2023-08-14 23:16:38 +00:00
Rohan Varma
e4c8c75583 [PG NCCL] Add TDD, NCCL_DEBUG log (#97692)
Prints these env var setting during setup for easier debug.

Differential Revision: [D44430875](https://our.internmc.facebook.com/intern/diff/D44430875/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97692
Approved by: https://github.com/kumpera
2023-04-06 20:37:46 +00:00
Wanchao Liang
c9b1e09958 [c10d] delete lengths offset checks (#98368)
According to @kwen2501, NCCL supports up to size_t send counts, so
PGNCCL shouldn't restrict it

A follow up is to think about whether we should add overflow protection
of offset
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98368
Approved by: https://github.com/kwen2501
2023-04-05 21:06:49 +00:00
Aaron Gokaslan
0247ed27cc Apply Clang-Tidy readability-container-size-empty (#93236)
Not only is this change usually shorter and more readable, it also can yield better performance. size() is not always a constant time operation (such as on LinkedLists), but empty() always is.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93236
Approved by: https://github.com/malfet
2023-01-29 23:28:19 +00:00
Howard Huang
7a0f29b776 Allow Process Group to support multiple backends (#88330) (#90997)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88330

### Implementation
Move backend-specific (NCCL, Gloo, etc) collective implementations to corresponding `Backend` class. Update ProcessGroup to support multiple backends and use dispatcher to calls backends based on tensor device type.

### Changes

#### c++ changes (ProcessGroup files, `Ops.cpp`, `init.cpp`)
- Update pybind definitions for new process group base class and new backend class
- Update pybinded backend class with collective definitions to keep BC with Python PG instances (e.g. `dist.ProcessGroupGloo`, `dist.ProcessGroupNCCL`) which are used in tests
- Switch `ProcessGroupGloo`, `ProcessGroupNCCL`, `ProcessGroupMPI`, `ProcessGroupUCC` to derive from the `Backend` class.
- Update CPU/CUDA `Ops.cpp` and `OpsImpl.cpp` to perform this dispatching by querying the backend using the device type
- Update internal dispatched implementation of `barrier` to use a tensor which allows operation to be dispatched.
- Update `allgather` collective to use `TensorList`. For some reason it was using the default implementation of `allgather` rather than dispatching it correctly. I still don't understand why and had originally filed an issue in 85122.

#### python changes (`distributed_c10d.py`, test files)
- Add BackendConfig class to specify the default configurations of backends and `get_backend_config()` API
- `get_backend()` deprecation warning
- `init_process_group` how returns a generic `ProcessGroup` object, it contains a list of backends (the ones stated above) which it will dispatch operations to.
- `new_group` updated to return the same as above
- Update `test_c10d_gloo.py`, Update `DistributedDataParallelTest` to use `init_process_group`, Update `ReducerTest`, update `test_broadcast_coalesced_gloo` to move from PG instance and gloo options
- Update `test_c10d_nccl.py`, Update `DistributedDataParallelTest` to use `init_process_group`
- Specific tests updated: `test_Backend_enum_class`

### Changes missing
- lazy initialization of backends
- support parsing of BackendConfig

### open questions
- Pure Python PG extensions (https://github.com/pytorch/pytorch/pull/66338)

# Example

This is a basic script (using 2 backends within a process group)

```python
# python -m torch.distributed.run --nnodes=1 --nproc_per_node=2 basic_scenario.py
import torch.distributed as dist
import torch
import os

if __name__ == "__main__":
    rank = os.environ.get("RANK")
    # initialize with both gloo and nccl
    dist.init_process_group()
    # with gloo
    dist.all_reduce(torch.tensor([1.0]))
    print(f"Rank {rank} finished")
    # with nccl
    dist.all_reduce(torch.tensor([1.0], device=f"cuda:{rank}"))
```

Test Plan: Imported from OSS

Differential Revision: D42069829

Pulled By: H-Huang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90997
Approved by: https://github.com/awgu, https://github.com/fduwjj
2022-12-16 23:15:00 +00:00
Min Si
1ad0048b64 Refactor distribuetd to use absolute header path (#85780)
Headers under torch/csrc/distributed may be referened with relative path, e.g., "<c10d/...>". However, relative path cannot be gracefully handled by Meta internal build when the NCCL PG is hipified to support AMD/RCCL because the "hipified" header files are generated in other directories. Moreover, using absolute path for header inclusion is the state-of-the-art in most components in Pytorch. Thus, this patch refactors all header paths in torch/csrc/distributed to be absolute.

See D39835774 for more details about Meta internal complication.

**How to test**: commit 9e5d199 removes -I./torch/csrc/distributed in compile options. Thus use it to verify we don't miss any relative path use of torch/csrc/distributed headers.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85780
Approved by: https://github.com/kumpera, https://github.com/huydhn
2022-09-30 05:13:50 +00:00
PyTorch MergeBot
a50d8864fc Revert "Refactor distribuetd to use absolute header path (#85780)"
This reverts commit 668082718a.

Reverted https://github.com/pytorch/pytorch/pull/85780 on behalf of https://github.com/huydhn due to Sorry for reverting your PR but it breaks build due to a missing file <c10d/Store.hpp>
2022-09-30 02:04:29 +00:00
Min Si
668082718a Refactor distribuetd to use absolute header path (#85780)
Headers under torch/csrc/distributed may be referened with relative path, e.g., "<c10d/...>". However, relative path cannot be gracefully handled by Meta internal build when the NCCL PG is hipified to support AMD/RCCL because the "hipified" header files are generated in other directories. Moreover, using absolute path for header inclusion is the state-of-the-art in most components in Pytorch. Thus, this patch refactors all header paths in torch/csrc/distributed to be absolute.

See D39835774 for more details about Meta internal complication.

**How to test**: commit 9e5d199 removes -I./torch/csrc/distributed in compile options. Thus use it to verify we don't miss any relative path use of torch/csrc/distributed headers.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85780
Approved by: https://github.com/kumpera
2022-09-30 00:27:24 +00:00
Ke Wen
3276b51243 Add environment parse function that supports default value (#85563)
We use "-2" to represent an unset environment variable.
Now adding a util function to attach default value if environment variable is unset.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85563
Approved by: https://github.com/rohan-varma, https://github.com/H-Huang
2022-09-28 02:56:50 +00:00
PyTorch MergeBot
b360d66391 Revert "Add environment parse function that supports default value (#85563)"
This reverts commit 784f4ba1ce.

Reverted https://github.com/pytorch/pytorch/pull/85563 on behalf of https://github.com/huydhn due to Fail test_DistributedDataParallel (main.TestDistBackendWithSpawn)
2022-09-27 02:55:59 +00:00
Ke Wen
784f4ba1ce Add environment parse function that supports default value (#85563)
We use "-2" to represent an unset environment variable.
Now adding a util function to attach default value if environment variable is unset.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85563
Approved by: https://github.com/rohan-varma, https://github.com/H-Huang
2022-09-27 00:34:50 +00:00
Ke Wen
c4e0c927e3 [c10d] Add a soft error handling mode (#84386)
Adding new value "2" to env `NCCL_ASYNC_ERROR_HANDLING` standing for a "CleanUpOnly" error handling mode.
Comparing to `NCCL_ASYNC_ERROR_HANDLING=1`, the "CleanUpOnly" mode will just abort the collectives and NCCL communicators, and will not tear down the process.
User will have the chance to query the state of the process group (in a later PR) and abort the process group (in a later PR), and re-create the process group if needed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84386
Approved by: https://github.com/rohan-varma
2022-09-07 04:48:02 +00:00
Nikita Shulga
ef56497ea0 Fix sign-compare in c10d/Utils.hpp
Prerequisite change for enabling `-Werror=sign-compare` across PyTorch repo

Pull Request resolved: https://github.com/pytorch/pytorch/pull/75081

Approved by: https://github.com/atalman
2022-04-04 23:05:31 +00:00
Can Balioglu
e1db2f13ce Refactor TORCH_DISTRIBUTED_DEBUG implementation (#73166)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73166

This PR refactors, cleans up, and optimizes the implementation of `TORCH_DISTRIBUTED_DEBUG`. It also introduces three new user APIs: `get_debug_level()`, `set_debug_level()`, and `set_debug_level_from_env()` to retrieve and modify the debug level after a process has started.
ghstack-source-id: 149778566

Test Plan: Run the existing unit tests.

Reviewed By: rohan-varma

Differential Revision: D34371226

fbshipit-source-id: e18443b411adcbaf39b2ec999178c198052fcd5b
(cherry picked from commit 26d6bb1584b83a0490d8b766482656a5887fa21d)
2022-02-24 02:33:05 +00:00
Amir Khojaste
748790588c Upgrading the loop to use irange (#70326)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70326

See D24145988 for context: it allows loops such as for(int i=0;i<10;i++) to be expressed as for(const auto i : c10::irange(10)). This is nice because it auto-types the loops and adds const-safety to the iteration variable.

Test Plan: buck run //caffe2/torch/fb/sparsenn:test

Reviewed By: r-barnes

Differential Revision: D33243400

fbshipit-source-id: b1f1b4163f4bf662031baea9e5268459b40c69a3
2022-01-06 07:06:53 -08:00
Can Balioglu
6e640a0acf Revise the socket implementation of c10d (#68226)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68226

**Note that this PR is unusually big due to the urgency of the changes. Please reach out to me in case you wish to have a "pair" review.**

This PR introduces a major refactoring of the socket implementation of the C10d library. A big portion of the logic is now contained in the `Socket` class and a follow-up PR will further consolidate the remaining parts. As of today the changes in this PR offer:

 - significantly better error handling and much more verbose logging (see the example output below)
 - explicit support for IPv6 and dual-stack sockets
 - correct handling of signal interrupts
 - better Windows support

A follow-up PR will consolidate `send`/`recv` logic into `Socket` and fully migrate to non-blocking sockets.

## Example Output

```
[I logging.h:21] The client socket will attempt to connect to an IPv6 address on (127.0.0.1, 29501).
[I logging.h:21] The client socket is attempting to connect to [localhost]:29501.
[W logging.h:28] The server socket on [localhost]:29501 is not yet listening (Error: 111 - Connection refused), retrying...
[I logging.h:21] The server socket will attempt to listen on an IPv6 address.
[I logging.h:21] The server socket is attempting to listen on [::]:29501.
[I logging.h:21] The server socket has started to listen on [::]:29501.
[I logging.h:21] The client socket will attempt to connect to an IPv6 address on (127.0.0.1, 29501).
[I logging.h:21] The client socket is attempting to connect to [localhost]:29501.
[I logging.h:21] The client socket has connected to [localhost]:29501 on [localhost]:42650.
[I logging.h:21] The server socket on [::]:29501 has accepted a connection from [localhost]:42650.
[I logging.h:21] The client socket has connected to [localhost]:29501 on [localhost]:42722.
[I logging.h:21] The server socket on [::]:29501 has accepted a connection from [localhost]:42722.
[I logging.h:21] The client socket will attempt to connect to an IPv6 address on (127.0.0.1, 29501).
[I logging.h:21] The client socket is attempting to connect to [localhost]:29501.
[I logging.h:21] The client socket has connected to [localhost]:29501 on [localhost]:42724.
[I logging.h:21] The server socket on [::]:29501 has accepted a connection from [localhost]:42724.
[I logging.h:21] The client socket will attempt to connect to an IPv6 address on (127.0.0.1, 29501).
[I logging.h:21] The client socket is attempting to connect to [localhost]:29501.
[I logging.h:21] The client socket has connected to [localhost]:29501 on [localhost]:42726.
[I logging.h:21] The server socket on [::]:29501 has accepted a connection from [localhost]:42726.
```
ghstack-source-id: 143501987

Test Plan: Run existing unit and integration tests on devserver, Fedora, Ubuntu, macOS Big Sur, Windows 10.

Reviewed By: Babar, wilson100hong, mrshenli

Differential Revision: D32372333

fbshipit-source-id: 2204ffa28ed0d3683a9cb3ebe1ea8d92a831325a
2021-11-16 20:49:25 -08:00
Luca Wehrstedt
a016150163 Move torch/lib/c10d to torch/csrc/distributed/c10d (#60543)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60543

Since now c10d is part of libtorch, it would also be nice if the sources lived all in one place.
ghstack-source-id: 132306292

Test Plan: It builds

Reviewed By: cbalioglu

Differential Revision: D29062002

fbshipit-source-id: d9e1301e9d73e1643fa0f0119cd2d618f1ad52e6
2021-06-24 12:38:51 -07:00