Commit Graph

36 Commits

Author SHA1 Message Date
Nikita Shulga
c4d1ff02f8 [Lint] Update clang-format to 19.1.4 (#153889)
All changes other than the one to `tools/linter/adapters/s3_init_config.json` are generated by newer clang-format
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153889
Approved by: https://github.com/cyyever, https://github.com/atalman
2025-05-20 14:12:46 +00:00
Yuanyuan Chen
ed5f4a4fa8 Replace size() checks with empty() (#153805)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153805
Approved by: https://github.com/nareshrajkumar866, https://github.com/Skylion007

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2025-05-19 16:20:57 +00:00
Tristan Rice
bb60e82672 c10d/Store: add queues (#150969)
This adds queue operations as described in https://github.com/pytorch/pytorch/issues/150943.

This works by adding two new operations `queue_push` and `queue_pop`. The semantics are designed to be blocking with a timeout. Pushing will always succeed as the queue is infinite size. Popping will first call `wait` until the key is ready and then pop the value from the queue.

This implements queues for only: HashStore, TCPStore w/ libuv. FileStore and the legacy backends are not supported.

`wait` and `check` work for queue operations though queue_push will only wake up the first waiter rather than all of them.

This also has a few cleanups to error types/documentation in related code.

Example trace:

```
[I409 16:51:43.963833529 TCPStoreLibUvBackend.cpp:829] [c10d - trace] validate magic:1015412686 address:[localhost]:55816
[I409 16:51:43.963845838 TCPStoreLibUvBackend.cpp:842] [c10d - trace] ping nonce:2840795 address:[localhost]:55816
[I409 16:51:43.963902914 TCPStoreLibUvBackend.cpp:911] [c10d - trace] add key:init/ val:1 address:[localhost]:55816
[I409 16:51:43.963939389 TCPStoreLibUvBackend.cpp:977] [c10d - trace] wait key_count:1 keys[0]:init/ address:[localhost]:55816
[I409 16:51:43.963974842 TCPStoreLibUvBackend.cpp:893] [c10d - trace] get key:init/ address:[localhost]:55816
[I409 16:51:43.964071909 TCPStoreLibUvBackend.cpp:1121] [c10d - trace] queue_push key:/test_prefix/test_queue_support address:[localhost]:55816
[I409 16:51:43.964080221 TCPStoreLibUvBackend.cpp:940] [c10d - trace] check key_count:1 keys[0]:/test_prefix/foo address:[localhost]:55816
[I409 16:51:43.964108584 TCPStoreLibUvBackend.cpp:1121] [c10d - trace] queue_push key:/test_prefix/foo address:[localhost]:55816
[I409 16:51:43.964123207 TCPStoreLibUvBackend.cpp:1121] [c10d - trace] queue_push key:/test_prefix/foo address:[localhost]:55816
[I409 16:51:43.964128194 TCPStoreLibUvBackend.cpp:940] [c10d - trace] check key_count:1 keys[0]:/test_prefix/foo address:[localhost]:55816
[I409 16:51:43.964156347 TCPStoreLibUvBackend.cpp:977] [c10d - trace] wait key_count:1 keys[0]:/test_prefix/foo address:[localhost]:55816
[I409 16:51:43.964187493 TCPStoreLibUvBackend.cpp:977] [c10d - trace] wait key_count:1 keys[0]:/test_prefix/foo address:[localhost]:55816
[I409 16:51:43.964217709 TCPStoreLibUvBackend.cpp:1133] [c10d - trace] queue_pop key:/test_prefix/foo address:[localhost]:55816
[I409 16:51:43.964324300 TCPStoreLibUvBackend.cpp:977] [c10d - trace] wait key_count:1 keys[0]:/test_prefix/foo address:[localhost]:55816
[I409 16:51:43.964354495 TCPStoreLibUvBackend.cpp:1133] [c10d - trace] queue_pop key:/test_prefix/foo address:[localhost]:55816
[I409 16:51:43.964416299 TCPStoreLibUvBackend.cpp:940] [c10d - trace] check key_count:1 keys[0]:/test_prefix/foo address:[localhost]:55816
[I409 16:51:43.964458733 TCPStoreLibUvBackend.cpp:977] [c10d - trace] wait key_count:1 keys[0]:/test_prefix/non_existant address:[localhost]:55816
[W409 16:51:43.974516585 socket.cpp:460] [c10d] waitForInput: poll for socket SocketImpl(fd=75, addr=[localhost]:55816, remote=[localhost]:46641) returned 0, likely a timeout
[W409 16:51:43.974559169 socket.cpp:485] [c10d] waitForInput: socket SocketImpl(fd=75, addr=[localhost]:55816, remote=[localhost]:46641) timed out after 10ms
[I409 16:51:43.974600451 TCPStoreLibUvBackend.cpp:1101] [c10d - trace] cancel_wait address:[localhost]:55816
```

Test plan:

```
$ pytest test/distributed/test_store.py -k queue -v -s

test/distributed/test_store.py::FileStoreTest::test_queues SKIPPED [0.4351s] (Store does not support queues)
test/distributed/test_store.py::HashStoreTest::test_queues PASSED [0.0009s]
test/distributed/test_store.py::PrefixFileStoreTest::test_queues SKIPPED [0.0006s] (Store does not support queues)
test/distributed/test_store.py::TCPStoreTest::test_queues SKIPPED [0.0012s] (Store does not support queues)
test/distributed/test_store.py::LibUvTCPStoreTest::test_queues PASSED [0.0014s]
test/distributed/test_store.py::PrefixTCPStoreTest::test_queues PASSED [0.0014s]
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150969
Approved by: https://github.com/XilunWu, https://github.com/fduwjj
2025-04-11 19:24:17 +00:00
fduwjj
320914f1b6 [c10d][libuv] Add back correct EOF case check (#151052)
We removed the wrong EOF case in https://github.com/pytorch/pytorch/pull/150987, and we added the correct one back in this PR. Since https://github.com/pytorch/pytorch/pull/150987 is a fix, so we merge that PR first and use this PR as a follow-up to further makes the logic more complete.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151052
Approved by: https://github.com/XilunWu
2025-04-11 01:37:30 +00:00
fduwjj
f663aa4e81 [c10d][tcp_store] Fix connection reset caused by wrong socket close (#150987)
While fixing the memory leak in https://github.com/pytorch/pytorch/pull/145757, we accidentally close the socket for the case when nread == 0 and thought it is the case when connection is closed. This is not true. According to libuv doc: https://docs.libuv.org/en/v1.x/stream.html#c.uv_read_cb.

> nread might be 0, which does not indicate an error or EOF. This is equivalent to EAGAIN or EWOULDBLOCK under read(2).

We found this bug when debugging a broken pipe issue when users first call a set and then wait for all keys right afterwards on 128 ranks. This might also cause other broken pipe issues we have seen in the prod jobs recently.

Added a unit test to test this case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150987
Approved by: https://github.com/d4l3k, https://github.com/XilunWu
2025-04-10 18:48:57 +00:00
Tristan Rice
29b3fdab01 TCPStoreLibUvBackend: support masterListenFd (#150215)
This supports `masterListenFd` which is required for full compatibility with the non-libuv TCPStore. The code was just missing a `uv_listen` call and now it works just fine.

This is required to migrate the last remaining uses of TCPStore off of the non-libuv backend.

Test plan:
```
pytest -v test/distributed/test_store.py -k test_take_over_listen_socket
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150215
Approved by: https://github.com/fduwjj
2025-03-29 01:58:07 +00:00
Tristan Rice
8eb400ef66 [BE] TCPStore: use typed errors for assertions (#147647)
This is a follow up to #147465 that changes most TORCH_CHECK calls in TCPStore and TCPStoreLibUvBackend  to use typed exceptions instead of generic `TORCH_CHECK` calls which end up as RuntimeErrors in Python.

Test plan:

```
pytest test/distributed/test_store.py
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147647
Approved by: https://github.com/fduwjj
2025-02-24 20:58:10 +00:00
Tristan Rice
ba214ab56c TCPStore: soft fail bind when agent store active (#147465)
This makes it easier to roll out `TORCHELASTIC_USE_AGENT_STORE` by opportunistically swallowing bind errors when the agent store is enabled and the port matches `MASTER_PORT`.

This should be very safe as if the store is somehow not up and the envs are set, the TCPStore client connections will fail to connect so we end up with a slightly different error message but success/failure behavior is identical.

This also pybinds `c10d::SocketError` into Python so we can assert on the error type in tests.

https://docs.google.com/document/d/1CzOn_N53AiFxWGgbyMWSnd2elCJd4lZ-ajPg2lzcxoM/edit?tab=t.0#heading=h.2j2f5dimrdau

Test plan:

```
pytest test/distributed/test_store.py
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147465
Approved by: https://github.com/fduwjj
2025-02-21 03:02:26 +00:00
Chirag Pandya
bdf6dfa17d [chore][ez] change alloc buffer size from 4000 to 4096 (#145759)
Summary:
Allocations typically happen as a power of 2 anyway.
Change the default alloc size to 4096 so eek out a bit more perf.

Test:
unit tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145759
Approved by: https://github.com/XilunWu, https://github.com/fduwjj
ghstack dependencies: #145756, #145757
2025-01-28 09:14:07 +00:00
Chirag Pandya
78f02bf07c [bug] handle case when remote peer closes connection (#145757)
Summary:
In the case where remote peer closes the connection, nread returns 0. In
this case, we still want to free up the allocated buffer.
Also, reorder the if so that the likely success cases (nread > 0) is at
the top of the function with an early return.

Test Plan:
unit tests

Differential Revision: [D68733192](https://our.internmc.facebook.com/intern/diff/D68733192)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145757
Approved by: https://github.com/XilunWu, https://github.com/fduwjj
ghstack dependencies: #145756
2025-01-28 03:06:38 +00:00
Chirag Pandya
5534c270db [chore] fix new linter (#145756)
Summary:
Fix new linter that's complaining when I made changes to this file:
class 'LibUVStoreDaemon' defines a non-default destructor but does not
define a copy constructor, a copy assignment operator, a move
constructor or a move assignment operator

Test Plan:
make lint passes

Differential Revision: [D68733191](https://our.internmc.facebook.com/intern/diff/D68733191)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145756
Approved by: https://github.com/XilunWu, https://github.com/Skylion007, https://github.com/fduwjj
2025-01-27 22:48:12 +00:00
Chirag Pandya
f8a4f16634 [c10d] fix memory leak on shutdown (#145507)
Summary:
Fix memory leak on shutdown when socket is closed.
We still need to free the buffer to make valgrind happy.

Test Plan:
Use `mtiavm`.
Repro steps provided by cristianlume.

on window 1:
```
vm ssh --vm=0 -- $(buck run @//neteng/ai/rdma_gen/mode/owl //neteng/ai/rdma_gen:rdma_gen --emit-shell) --rdma_mode=mtiav1 --num_ranks=2
```
on window 2:
```
vm ssh --vm=1 -- $(buck run @//neteng/ai/rdma_gen/mode/owl //neteng/ai/rdma_gen:rdma_gen --emit-shell) --rdma_mode=mtiav1 --num_ranks=2 --rank=1 --store_host=172.16.1.1
```

without the fix:
```
==8766==ERROR: LeakSanitizer: detected memory leaks
```
With fix, no leak

Differential Revision: D68566104

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145507
Approved by: https://github.com/XilunWu, https://github.com/d4l3k
2025-01-23 23:36:15 +00:00
cyy
b4c0973b59 [2/N] Apply bugprone-unchecked-optional-access (#141091)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141091
Approved by: https://github.com/Skylion007, https://github.com/albanD

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2024-12-09 19:30:19 +00:00
cyy
1ec76dd1dc Enable clang-tidy on torch/csrc/distributed (#139043)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139043
Approved by: https://github.com/Skylion007
2024-10-28 13:56:54 +00:00
cyy
2bcfbf2505 [Distributed] [17/N] Fix clang-tidy warnings in torch/csrc/distributed/ (#138465)
Follows  #137404

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138465
Approved by: https://github.com/ezyang
2024-10-24 04:58:49 +00:00
Tristan Rice
bebf5302ba TCPStoreLibUvBackend: trace operations (#136320)
Summary:
This logs all operations when tracing log level is enabled for the `TCPStoreLibUvBackend`. This is very useful for debugging collective operations when issues occur as it logs all hosts and the keys that they're modifying. To minimize total data we only log the keys and not the values

This changes the C10D_* macros to be much more efficient -- previously we would always format the log string even if they would never be printed which is very wasteful for detailed tracing. This now gates them with an if statement to achieve the same behavior with no overhead

Test Plan:
```
TORCH_DISTRIBUTED_DEBUG=DETAIL torchrun --nnodes 1 --nproc_per_node 1 --no-python /bin/bash -c "echo foo"
```

```
I0919 09:26:52.352013 34271 TCPStore.cpp:285] [c10d - debug] The server has started on port = 29500.
I0919 09:26:52.352246 34271 socket.cpp:783] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (127.0.0.1, 29500).
I0919 09:26:52.352241 36903 TCPStoreLibUvBackend.cpp:1173] [c10d - debug] Uv main loop running
I0919 09:26:52.352308 34271 socket.cpp:854] [c10d - trace] The client socket is attempting to connect to [localhost]:29500.
I0919 09:26:52.353633 34271 socket.cpp:945] [c10d] The client socket has connected to [localhost]:29500 on SocketImpl(fd=41, addr=[localhost]:45646, remote=[localhost]:29500).
I0919 09:26:52.354422 34271 TCPStore.cpp:321] [c10d - debug] TCP client connected to host 127.0.0.1:29500
I0919 09:26:52.354558 36903 TCPStoreLibUvBackend.cpp:774] [c10d - trace] validate magic:1015412686 address:[localhost]:45646
I0919 09:26:52.354638 36903 TCPStoreLibUvBackend.cpp:789] [c10d - trace] ping nonce:34271 address:[localhost]:45646
I0919 09:26:52.356122 36903 TCPStoreLibUvBackend.cpp:866] [c10d - trace] add key:init/ val:1 address:[localhost]:45646
I0919 09:26:52.356308 36903 TCPStoreLibUvBackend.cpp:930] [c10d - trace] wait key_count:1 address:[localhost]:45646
I0919 09:26:52.356410 36903 TCPStoreLibUvBackend.cpp:846] [c10d - trace] get key:init/ address:[localhost]:45646
I0919 09:26:52.358688 36903 TCPStoreLibUvBackend.cpp:808] [c10d - trace] set key:/none/torchelastic/role_info/0 address:[localhost]:45646
I0919 09:26:52.360177 36903 TCPStoreLibUvBackend.cpp:930] [c10d - trace] wait key_count:1 address:[localhost]:45646
I0919 09:26:52.360296 36903 TCPStoreLibUvBackend.cpp:1004] [c10d - trace] multi_get key_count:1 address:[localhost]:45646
I0919 09:26:52.362076 36903 TCPStoreLibUvBackend.cpp:1036] [c10d - trace] multi_set key_count:1 address:[localhost]:45646
I0919 09:26:52.364001 36903 TCPStoreLibUvBackend.cpp:930] [c10d - trace] wait key_count:1 address:[localhost]:45646
I0919 09:26:52.364091 36903 TCPStoreLibUvBackend.cpp:846] [c10d - trace] get key:/none/torchelastic/assigned_ranks/0 address:[localhost]:45646
```

Differential Revision: D62924454

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136320
Approved by: https://github.com/c-p-i-o, https://github.com/XilunWu
2024-09-20 00:53:21 +00:00
fduwjj
c608b17f60 [PTD][BE][c10d] Add some code documents for TCPStore code and cosmetic changes to libUVStore code (#130496)
While designing something else when TCPStore is needed. I spent some time digging into the codebase of TCPStore and found that the code is a little bit challenging to understand without proper documents. Although people from OSS community must be smarter than me, I still want to document my findings in the code so that devs and users can use them as a reference down the road.

Also for libuv, we need to make private variables with a "_", so it's a pure renaming of private variables such as `tcpServer`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130496
Approved by: https://github.com/wconstab
2024-09-11 04:42:25 +00:00
Tristan Rice
ef9d9be236 TCPStoreLibUvBackend: log port on error (#130797)
Adds better error messages when a socket fails to bind in libuv.

New format:
```
The server socket has failed to bind. port: 1, useIpv6: 0, code: -13, name: EACCES, message: permission denied
```

Old format:

```
The server socket has failed to listen on any local network address. useIpv6: 0, code: -98, name: EADDRINUSE, message: address already in use
```

Test plan:

Added test in `test_store.py`

```
python test/distributed/test_store.py
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130797
Approved by: https://github.com/kurman
2024-07-17 01:34:15 +00:00
Tristan Rice
9ee8c18309 TCPStore: add ping to verify network connectivity on connect (#129985)
This does a round trip request on socket connect -- this allows for detecting connection resets etc and retrying before the non-retryable application requests are sent.

This adds support for PING to both the libuv and legacy backend.

Example error:
```
[trainer85612|12]:W0701 13:41:43.421574  4776 TCPStore.cpp:182] [c10d] recvValue failed on SocketImpl(fd=24, ...): Connection reset by peer
[trainer85612|12]:Exception raised from recvBytes at /mnt/code/pytorch/torch/csrc/distributed/c10d/Utils.hpp:669 (most recent call first):
...
[trainer85612|12]:#9 c10d::TCPStore::incrementValueBy(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, long) from /packages/.../conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so:84809637
[trainer85612|12]:#10 c10d::TCPStore::waitForWorkers() from /packages/.../conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so:84812868
[trainer85612|12]:#11 c10d::TCPStore::TCPStore(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, c10d::TCPStoreOptions const&) from /packages/.../conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so:84814775
```

Test plan:

```
python test/distributed/test_store.py -v
```

```
tristanr@devvm4382 ~/pytorch (d4l3k/tcpstore_ping)> python ~/pt_tests/tcpstore_large_test.py
starting pool
started 90000
started 30000
started 70000
started 20000
started 80000
started 60000
started 0
[W702 16:16:25.301681870 TCPStore.cpp:343] [c10d] Starting store with 100000 workers but somaxconn is 4096.This might cause instability during bootstrap, consider increasing it.
init 20000
set 20000
init 80000
set 80000
init 70000
set 70000
init 60000
set 60000
init 30000
set 30000
init 90000
set 90000
started 40000
init 40000
set 40000
started 50000
init 50000
set 50000
started 10000
init 10000
set 10000
init 0
set 0
run finished 617.2992351055145
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129985
Approved by: https://github.com/rsdcastro, https://github.com/kurman
2024-07-03 02:09:44 +00:00
Tristan Rice
b50c0e94c2 TCPStoreLibUvBackend: use somaxconn and enable TCP_NODELAY (#128739)
This adjusts the settings of the libuv backend to match the older TCPStore.

* DEFAULT_BACKLOG: setting this to -1 will enable using the host somaxconn value instead of a hardcoded 16k value. When going over this limit with `tcp_abort_on_overflow` set it results in connections being reset.
* TCP_NODELAY: Since TCPStore primarily sends small messages there's no benefit to using Nargle's algorithm and it may add additional latency for store operations.

Test plan:

```
python test/distributed/test_store.py -v -k LibUv
```

Benchmark script:
```
import time
import os

import torch.distributed as dist

rank = int(os.environ["RANK"])

store = dist.TCPStore(
    host_name="<server>",
    port=29500,
    world_size=2,
    is_master=(rank == 0),
    use_libuv=True,
)

if rank == 1:
    total_iters = 0
    total_dur = 0
    for iter in range(10):
        iters = 500000
        start = time.perf_counter()
        for i in range(iters):
            store.set(f"key_{i}", f"value_{i}")
        dur = time.perf_counter() - start
        print(f"{iter}. {iters} set, qps = {iters/dur}")
        total_iters += iters
        total_dur += dur

    print(f"overall qps = {total_iters/total_dur}")
else:
    print("sleeping")
    time.sleep(1000000000)
```

Performance seems to be negligible difference between TCP_NODELAY and not for a single host

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128739
Approved by: https://github.com/rsdcastro, https://github.com/kurman, https://github.com/c-p-i-o
2024-06-15 07:40:18 +00:00
Tristan Rice
52d4442a00 [c10d] Socket, TCPStore: add better logging (#128673)
This adds better logging of errors to the socket and TCPStore classes.

All socket operations should now include the local and remote addresses and we actually log errors from the TCPStoreBackend::run as well as TCPStoreBackendUV which were previously INFO messages and not actually logged.

It also overhauls test_wait in test_store.py as it had a race condition causing it to be flaky.

Test plan:

```
python test/distributed/test_store.py
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128673
Approved by: https://github.com/c-p-i-o
2024-06-14 23:08:29 +00:00
Tristan Rice
7c370d2fb0 expose set_thread_name to Python and set thread names (#128448)
This adds a new multiprocessing method `_set_thread_name` and calls it from torchelastic and dataloader main functions. This will allow better monitoring of processes as we can separate elastic and dataloading processes from the main training process.

Threads named:

* torchrun/elastic
* PyTorch dataloader worker processes + pin memory thread
* TCPStore
* ProcessGroupNCCL background threads
* WorkerServer httpserver thread

Test plan:

```
$ torchrun --nnodes 1 --nproc_per_node 1 --no-python /bin/bash -c 'ps -eL | grep pt_'
3264281 3264281 pts/45   00:00:02 pt_elastic
3264281 3267950 pts/45   00:00:00 pt_elastic
```

dataloading

```py
import torch
import time

from torch.utils.data import (
    DataLoader,
    Dataset,
)

class NoopDataset(Dataset):
    def __getitem__(self, index):
        return index

    def __len__(self):
        return 10

dataloader = DataLoader(NoopDataset(), num_workers=2)

for i, x in enumerate(dataloader):
    print(i, x)
    time.sleep(10000)
```

```
$ python3 ~/scripts/dataloader_test.py
$ ps -eL | grep pt_
1228312 1228312 pts/45   00:00:02 pt_main_thread
1228312 1230058 pts/45   00:00:00 pt_main_thread
1228312 1230059 pts/45   00:00:00 pt_main_thread
1230052 1230052 pts/45   00:00:00 pt_data_worker
1230052 1230198 pts/45   00:00:00 pt_data_worker
1230052 1230740 pts/45   00:00:00 pt_data_worker
1230055 1230055 pts/45   00:00:00 pt_data_worker
1230055 1230296 pts/45   00:00:00 pt_data_worker
1230055 1230759 pts/45   00:00:00 pt_data_worker
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128448
Approved by: https://github.com/c-p-i-o, https://github.com/andrewkho, https://github.com/rsdcastro
2024-06-13 16:38:23 +00:00
Xilun Wu
85758fa5ae [c10d][TCPStore] make TCPStore server use libuv by default (#127957)
**Summary**
This PR switches the default TCPStore server backend to a new implementation that utilizes [`libuv`](https://github.com/libuv/libuv) for significantly lower initialization time and better scalability:
<img width="714" alt="image" src="https://github.com/pytorch/pytorch/assets/12968408/18503011-da5d-4104-8ba9-abc456438b02">

We hope this improvement would benefit users from a much shorter startup time in large-scale jobs. Eventually, we hope to fully replace the old TCPStore backend implementation with the libuv one.

**What it changes**
This PR changes the underlying TCPStore server backend to `libuv` if users don't explicitly specify to use the old TCPStore server. This change is not supposed to cause any user notice except significant faster TCPStore startup for large-scale jobs.

One thing to note is, we do not support the initialization approach where user passes in a socket for libuv backend. We plan to support it as a next step but we choose to disable it before fully testing. If you are initializing TCPStore in this approach, you can see the next section to remain using the old TCPStore server.

**Fallback/Remain using the old TCPStore server**
For users who want to stay with the old TCPStore backend, there're 3 ways:

1. If user is directly instantiating TCPStore object, user can pass in argument `use_libuv=False` to use the old TCPStore server backend e.g. `store = torch.distributed.TCPStore(..., use_libuv=False)`.
2. Or, specify the TCPStore backend option in `init_method` when calling default ProcessGroup init, e.g. `torch.distributed.init_process_group(..., init_method="{YOUR_RENDEZVOUS_METHOD}://{YOUR_HOSTNAME}:{YOUR_PORT}?use_libuv=0")`
3. Or, user can set environment variable `USE_LIBUV` to `"0"` when launching.

These 3 approach are in order of precedence. That being said, if user specifies `use_libuv=0` in `init_method` and also sets environment var `USE_LIBUV="1"`, the former will take effect and the TCPStore backend instantiated will be the old one instead of the one using libuv.

**Operating Systems Compatibility**
From the CI signals, we believe the new implementation has the same behavior as the old TCPStore server on all supported platforms. If you notice any behavior discrepancy, please file an issue with `oncall: distributed` label.

**Test Plan**
`pytest test/distributed/test_store.py`
<img width="2548" alt="image" src="https://github.com/pytorch/pytorch/assets/12968408/dc0aebeb-6d5a-4daa-b98c-e56bd39aa588">
note: `TestMultiThreadedWait::test_wait` is a broken test that has been there for some time.

`test/distributed/elastic/utils/distributed_test.py`
<img width="2558" alt="image" src="https://github.com/pytorch/pytorch/assets/12968408/a6a3266d-b798-41c4-94d2-152056a034f6">

**TODO**
1. Update the doc at

- https://pytorch.org/docs/stable/distributed.html#distributed-key-value-store
- https://pytorch.org/docs/stable/distributed.html#tcp-initialization

2. Make torch elastic rendezvous to use libuv TCPStore as well. See `torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py` cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k @kurman
3. Test if libuv backend is okay with initialization with socket. Change `LibUvTCPStoreTest::test_take_over_listen_socket`.

**Test Plan**
`pytest test/distributed/test_store.py`
<img width="2548" alt="image" src="https://github.com/pytorch/pytorch/assets/12968408/dc0aebeb-6d5a-4daa-b98c-e56bd39aa588">
note: `TestMultiThreadedWait::test_wait` is a broken test that has been there for some time.

`test/distributed/elastic/utils/distributed_test.py`
<img width="2558" alt="image" src="https://github.com/pytorch/pytorch/assets/12968408/a6a3266d-b798-41c4-94d2-152056a034f6">

Differential Revision: [D58259591](https://our.internmc.facebook.com/intern/diff/D58259591)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127957
Approved by: https://github.com/kurman
ghstack dependencies: #127956
2024-06-07 16:53:01 +00:00
Xilun Wu
6c824cd9fb [BE][c10d] fix use of TORCH_ERROR in TCPStore libuv backend (#127956)
**Summary**
The use of TORCH_ERROR in TCPStore libuv backend code needs update.

Differential Revision: [D58259589](https://our.internmc.facebook.com/intern/diff/D58259589)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127956
Approved by: https://github.com/shuqiangzhang, https://github.com/cyyever
2024-06-07 16:53:01 +00:00
cyy
be7be9fa16 [Distributed] [8/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d (#125102)
This PR continues to clean clang-tidy warnings in torch/csrc/distributed/c10d, following https://github.com/pytorch/pytorch/pull/124987.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125102
Approved by: https://github.com/ezyang
2024-05-30 16:19:53 +00:00
cyy
70d8bc2da1 Fix various errors in TCPStoreLibUvBackend.cpp (#127230)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127230
Approved by: https://github.com/Skylion007
2024-05-27 19:14:01 +00:00
cyy
b60af92c17 [Distributed] [3/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d (#123312)
This PR continues to fix some clang-tidy warnings in distributed code, following https://github.com/pytorch/pytorch/pull/122892.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123312
Approved by: https://github.com/Skylion007
2024-04-13 11:45:00 +00:00
Xilun Wu
0b0b9b3275 [c10d][libuv] add partial read test for libuv backend and fix an error which only happens when partially reading a buffer (#116141)
**Test Plan**
1. build pytorch
2. execute `TORCH_CPP_LOG_LEVEL=INFO build/bin/TCPStoreTest --gtest_filter=TCPStoreTest.testLibUVPartialRead` from the pytorch root directory.

without the change:
<img width="761" alt="image" src="https://github.com/pytorch/pytorch/assets/12968408/1942e3c2-a9c1-4fe4-87e8-7e21f4d8f9aa">

with the change:
<img width="747" alt="image" src="https://github.com/pytorch/pytorch/assets/12968408/f3e96a5b-0ed1-49bd-9184-bb8a5ebebc33">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116141
Approved by: https://github.com/wconstab
2023-12-20 18:37:55 +00:00
Juncheng Gu
7c4e49ec80 [Fix] add validation logics to TCPStore queries (#107607)
This PR fixes #106294.

Due to the lack of request validation mechanism, TCPStore in torch mistakenly treats nmap scan messages as valid query messages, which leads to DDP OOM. The simple solution enforces the very first query from a client is a validation query with a predefined magic number. If the validation fails, the server will terminate the connection.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107607
Approved by: https://github.com/cbalioglu, https://github.com/XilunWu
2023-11-07 18:36:25 +00:00
PyTorch MergeBot
c63693ca27 Revert "[Fix] add validation logics to TCPStore queries (#107607)"
This reverts commit 50a9981217.

Reverted https://github.com/pytorch/pytorch/pull/107607 on behalf of https://github.com/huydhn due to For some reason, lint job was not run on the PR and now start failing trunk, please rebase and fix lint before relanding 50a9981217 ([comment](https://github.com/pytorch/pytorch/pull/107607#issuecomment-1791702818))
2023-11-02 23:34:08 +00:00
Juncheng Gu
50a9981217 [Fix] add validation logics to TCPStore queries (#107607)
This PR fixes #106294.

Due to the lack of request validation mechanism, TCPStore in torch mistakenly treats nmap scan messages as valid query messages, which leads to DDP OOM. The simple solution enforces the very first query from a client is a validation query with a predefined magic number. If the validation fails, the server will terminate the connection.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107607
Approved by: https://github.com/cbalioglu, https://github.com/XilunWu
2023-11-02 22:12:45 +00:00
Rodrigo Kumpera
92de1d3222 [C10D] Push store scalability a bit further. (#109217)
This is a bunch of small changes to improve store scalability:

- stagger client connection to avoid a stampede.
- warn if somaxconn is too small.
- increase the backlog to 16k.

Differential Revision: [D49238587](https://our.internmc.facebook.com/intern/diff/D49238587)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109217
Approved by: https://github.com/XilunWu
2023-09-22 17:23:46 +00:00
Rodrigo Kumpera
fe2cda64dc [C10D] Implement new libuv backend for TCPStore. (#108066)
The new backend is currently under a flag 'use_libuv' in TCPStore constructor
to reduce the impact on existing users as we test it.

This is a reland of #105870 with a fix for a bad test.

Differential Revision: [D48742554](https://our.internmc.facebook.com/intern/diff/D48742554)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108066
Approved by: https://github.com/H-Huang, https://github.com/fduwjj
2023-08-29 14:55:14 +00:00
PyTorch MergeBot
d3f92ca9e9 Revert "[C10D] Implement new libuv backend for TCPStore. (#105870)"
This reverts commit 3c841163ce.

Reverted https://github.com/pytorch/pytorch/pull/105870 on behalf of https://github.com/huydhn due to I think the distributed failure is related as this is now failing in trunk ([comment](https://github.com/pytorch/pytorch/pull/105870#issuecomment-1683117192))
2023-08-17 23:41:00 +00:00
Rodrigo Kumpera
3c841163ce [C10D] Implement new libuv backend for TCPStore. (#105870)
The new backend is currently under a flag 'use_libuv' in TCPStore constructor
to reduce the impact on existing users as we test it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105870
Approved by: https://github.com/H-Huang
2023-08-17 20:40:32 +00:00
Rodrigo Kumpera
2636751fb9 [C10d] Add skeleton of LibUV backend. (#105672)
This commit hooks up tcpstore creation and build flags.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105672
Approved by: https://github.com/fduwjj
2023-07-28 13:19:06 +00:00