Commit Graph

20 Commits

Author SHA1 Message Date
fduwjj
f663aa4e81 [c10d][tcp_store] Fix connection reset caused by wrong socket close (#150987)
While fixing the memory leak in https://github.com/pytorch/pytorch/pull/145757, we accidentally close the socket for the case when nread == 0 and thought it is the case when connection is closed. This is not true. According to libuv doc: https://docs.libuv.org/en/v1.x/stream.html#c.uv_read_cb.

> nread might be 0, which does not indicate an error or EOF. This is equivalent to EAGAIN or EWOULDBLOCK under read(2).

We found this bug when debugging a broken pipe issue when users first call a set and then wait for all keys right afterwards on 128 ranks. This might also cause other broken pipe issues we have seen in the prod jobs recently.

Added a unit test to test this case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150987
Approved by: https://github.com/d4l3k, https://github.com/XilunWu
2025-04-10 18:48:57 +00:00
Richard Barnes
fddabc6e0b C10_UNUSED to [[maybe_unused]] (#6357) (#138364)
Summary: Pull Request resolved: https://github.com/pytorch/executorch/pull/6357

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138364
Approved by: https://github.com/Skylion007, https://github.com/eqy
2024-10-19 13:17:43 +00:00
cyy
94e12f97dc [Distributed] [16/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d (#137404)
Follows #137072

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137404
Approved by: https://github.com/Skylion007
2024-10-10 18:05:34 +00:00
fduwjj
911a43f930 [TCPStore] Remove deprecated constructor (#136004)
While looking at TCPStore code again and found it confusing that we still keep the deprecated constructor for TCPStore in cpp while we don't expose it in python via pybind already. I checked both internal and external, all use cases in cpp (aside from unit test fixed in this PR) already moved to using option. So let's remove this legacy constructor to avoid confusion.

Differential Revision: [D62653634](https://our.internmc.facebook.com/intern/diff/D62653634)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136004
Approved by: https://github.com/Skylion007, https://github.com/XilunWu
2024-09-14 04:25:47 +00:00
Chirag Pandya
40767e8468 [BE] rename testHelperPrefix test (#132916)
Summary:
Re-enable testHelperPrefix test that was erroneously disabled in CI.
Fixes #50701

Test Plan:
Test passes locally:
```
❯ ./TCPStoreTest --gtest_filter=TCPStoreTest.testHelperPrefix
Running main() from
/data/users/cpio/pytorch/third_party/googletest/googletest/src/gtest_main.cc
Note: Google Test filter = TCPStoreTest.testHelperPrefix
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from TCPStoreTest
[ RUN      ] TCPStoreTest.testHelperPrefix
[W807 12:01:31.531576727 socket.cpp:462] [c10d] waitForInput: poll for
socket SocketImpl(fd=6, addr=[localhost]:37984,
remote=[localhost]:37171) returned 0, likely a timeout
[W807 12:01:31.531663710 socket.cpp:487] [c10d] waitForInput: socket
SocketImpl(fd=6, addr=[localhost]:37984, remote=[localhost]:37171) timed
out after 100ms
[       OK ] TCPStoreTest.testHelperPrefix (314 ms)
[----------] 1 test from TCPStoreTest (314 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (314 ms total)
[  PASSED  ] 1 test.
╭─ ~/local/pytorch/build/bin  main *1 +1 ···················· ✔
/home/cpio/local/a/pytorch-env   cpio@devgpu011 ─╮
╰─
```
Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132916
Approved by: https://github.com/Skylion007
2024-08-08 20:54:52 +00:00
cyy
28f6ae2718 [9/N] Replace c10::optional with std::optional (#130674)
Follows  #130509

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130674
Approved by: https://github.com/Skylion007
2024-07-15 00:48:43 +00:00
Xilun Wu
0b0b9b3275 [c10d][libuv] add partial read test for libuv backend and fix an error which only happens when partially reading a buffer (#116141)
**Test Plan**
1. build pytorch
2. execute `TORCH_CPP_LOG_LEVEL=INFO build/bin/TCPStoreTest --gtest_filter=TCPStoreTest.testLibUVPartialRead` from the pytorch root directory.

without the change:
<img width="761" alt="image" src="https://github.com/pytorch/pytorch/assets/12968408/1942e3c2-a9c1-4fe4-87e8-7e21f4d8f9aa">

with the change:
<img width="747" alt="image" src="https://github.com/pytorch/pytorch/assets/12968408/f3e96a5b-0ed1-49bd-9184-bb8a5ebebc33">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116141
Approved by: https://github.com/wconstab
2023-12-20 18:37:55 +00:00
Pritam Damania
704b0b3c67 [RESUBMIT] Standardize on error types for distributed errors. (#108191)
We have a plethora of error types for various errors raised from c10d. These include `RuntimeError`, `TimeoutError`, `SocketError`, `DistBackendError` etc.

This results in messy code during error handling somewhat like this:
```
if "NCCL" in exception_str:
  ...
if "Timed out initializing process group in store based barrier on rank" in exception_str:
  ...
if "The client socket has timed out after" in exception_str:
  ...
if "Broken pipe" in exception_str:
  ...
if "Connection reset by peer" in exception_str:
  ...
```

To address this issue, in this PR I've ensured added these error types:

1. **DistError** - the base type of all distributed errors
2. **DistBackendError** - this already existed and referred to PG backend errors
3. **DistStoreError** - for errors originating from the store
4. **DistNetworkError** - for general network errors coming from the socket library

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108191
Approved by: https://github.com/H-Huang
2023-08-30 21:47:39 +00:00
Rodrigo Kumpera
fe2cda64dc [C10D] Implement new libuv backend for TCPStore. (#108066)
The new backend is currently under a flag 'use_libuv' in TCPStore constructor
to reduce the impact on existing users as we test it.

This is a reland of #105870 with a fix for a bad test.

Differential Revision: [D48742554](https://our.internmc.facebook.com/intern/diff/D48742554)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108066
Approved by: https://github.com/H-Huang, https://github.com/fduwjj
2023-08-29 14:55:14 +00:00
PyTorch MergeBot
d4ff06ec84 Revert "Standardize on error types for distributed errors. (#107651)"
This reverts commit 0e2317479b.

Reverted https://github.com/pytorch/pytorch/pull/107651 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it is failing inductor test in trunk for one of its model moco ([comment](https://github.com/pytorch/pytorch/pull/107651#issuecomment-1696578138))
2023-08-28 23:58:33 +00:00
Pritam Damania
0e2317479b Standardize on error types for distributed errors. (#107651)
We have a plethora of error types for various errors raised from c10d. These include `RuntimeError`, `TimeoutError`, `SocketError`, `DistBackendError` etc.

This results in messy code during error handling somewhat like this:
```
if "NCCL" in exception_str:
  ...
if "Timed out initializing process group in store based barrier on rank" in exception_str:
  ...
if "The client socket has timed out after" in exception_str:
  ...
if "Broken pipe" in exception_str:
  ...
if "Connection reset by peer" in exception_str:
  ...
```

To address this issue, in this PR I've ensured added these error types:

1. **DistError** - the base type of all distributed errors
2. **DistBackendError** - this already existed and referred to PG backend errors
3. **DistStoreError** - for errors originating from the store
4. **DistNetworkError** - for general network errors coming from the socket library
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107651
Approved by: https://github.com/H-Huang
2023-08-28 21:58:15 +00:00
PyTorch MergeBot
d3f92ca9e9 Revert "[C10D] Implement new libuv backend for TCPStore. (#105870)"
This reverts commit 3c841163ce.

Reverted https://github.com/pytorch/pytorch/pull/105870 on behalf of https://github.com/huydhn due to I think the distributed failure is related as this is now failing in trunk ([comment](https://github.com/pytorch/pytorch/pull/105870#issuecomment-1683117192))
2023-08-17 23:41:00 +00:00
Rodrigo Kumpera
3c841163ce [C10D] Implement new libuv backend for TCPStore. (#105870)
The new backend is currently under a flag 'use_libuv' in TCPStore constructor
to reduce the impact on existing users as we test it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105870
Approved by: https://github.com/H-Huang
2023-08-17 20:40:32 +00:00
Rodrigo Kumpera
174b0c22cb [C10D] Remove watchKey functionality from the Store. (#105014)
The feature was never fully finished and never got any adoption but
TCPStore pays the cost of twice the number of tcp connections anyway.

While the cost of all those idle connections is minimal is doesn't come for free:

- It increases the likelyhood of a connection refused failure during the initialization stampede.
- TCPStore uses poll for checking for socket availability which scales linearly on the number of sockets regardless of their status.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105014
Approved by: https://github.com/fduwjj
2023-07-21 21:18:55 +00:00
Min Si
1ad0048b64 Refactor distribuetd to use absolute header path (#85780)
Headers under torch/csrc/distributed may be referened with relative path, e.g., "<c10d/...>". However, relative path cannot be gracefully handled by Meta internal build when the NCCL PG is hipified to support AMD/RCCL because the "hipified" header files are generated in other directories. Moreover, using absolute path for header inclusion is the state-of-the-art in most components in Pytorch. Thus, this patch refactors all header paths in torch/csrc/distributed to be absolute.

See D39835774 for more details about Meta internal complication.

**How to test**: commit 9e5d199 removes -I./torch/csrc/distributed in compile options. Thus use it to verify we don't miss any relative path use of torch/csrc/distributed headers.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85780
Approved by: https://github.com/kumpera, https://github.com/huydhn
2022-09-30 05:13:50 +00:00
PyTorch MergeBot
a50d8864fc Revert "Refactor distribuetd to use absolute header path (#85780)"
This reverts commit 668082718a.

Reverted https://github.com/pytorch/pytorch/pull/85780 on behalf of https://github.com/huydhn due to Sorry for reverting your PR but it breaks build due to a missing file <c10d/Store.hpp>
2022-09-30 02:04:29 +00:00
Min Si
668082718a Refactor distribuetd to use absolute header path (#85780)
Headers under torch/csrc/distributed may be referened with relative path, e.g., "<c10d/...>". However, relative path cannot be gracefully handled by Meta internal build when the NCCL PG is hipified to support AMD/RCCL because the "hipified" header files are generated in other directories. Moreover, using absolute path for header inclusion is the state-of-the-art in most components in Pytorch. Thus, this patch refactors all header paths in torch/csrc/distributed to be absolute.

See D39835774 for more details about Meta internal complication.

**How to test**: commit 9e5d199 removes -I./torch/csrc/distributed in compile options. Thus use it to verify we don't miss any relative path use of torch/csrc/distributed headers.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85780
Approved by: https://github.com/kumpera
2022-09-30 00:27:24 +00:00
Michael Suo
30fb2c4aba [lint] autoformat test/cpp and torch/csrc
Let's have some fun.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/78828

Approved by: https://github.com/ezyang
2022-06-11 21:11:16 +00:00
Nikita Shulga
3bb20ae49f Make c10d tests -Werror clean (#69703)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69703

Test Plan: Imported from OSS

Reviewed By: seemethere

Differential Revision: D32997001

Pulled By: malfet

fbshipit-source-id: 38b5f195c04f2b3b920e6883a96fe9a36345b9d2
2021-12-09 22:10:04 -08:00
Luca Wehrstedt
a016150163 Move torch/lib/c10d to torch/csrc/distributed/c10d (#60543)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60543

Since now c10d is part of libtorch, it would also be nice if the sources lived all in one place.
ghstack-source-id: 132306292

Test Plan: It builds

Reviewed By: cbalioglu

Differential Revision: D29062002

fbshipit-source-id: d9e1301e9d73e1643fa0f0119cd2d618f1ad52e6
2021-06-24 12:38:51 -07:00