pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-06 12:20:52 +01:00

Author	SHA1	Message	Date
Nikita Shulga	c4d1ff02f8	[Lint] Update clang-format to 19.1.4 (#153889 ) All changes other than the one to `tools/linter/adapters/s3_init_config.json` are generated by newer clang-format Pull Request resolved: https://github.com/pytorch/pytorch/pull/153889 Approved by: https://github.com/cyyever, https://github.com/atalman	2025-05-20 14:12:46 +00:00
Yuanyuan Chen	ed5f4a4fa8	Replace size() checks with empty() (#153805 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/153805 Approved by: https://github.com/nareshrajkumar866, https://github.com/Skylion007 Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>	2025-05-19 16:20:57 +00:00
Tristan Rice	bb60e82672	c10d/Store: add queues (#150969 ) This adds queue operations as described in https://github.com/pytorch/pytorch/issues/150943. This works by adding two new operations `queue_push` and `queue_pop`. The semantics are designed to be blocking with a timeout. Pushing will always succeed as the queue is infinite size. Popping will first call `wait` until the key is ready and then pop the value from the queue. This implements queues for only: HashStore, TCPStore w/ libuv. FileStore and the legacy backends are not supported. `wait` and `check` work for queue operations though queue_push will only wake up the first waiter rather than all of them. This also has a few cleanups to error types/documentation in related code. Example trace: ``` [I409 16:51:43.963833529 TCPStoreLibUvBackend.cpp:829] [c10d - trace] validate magic:1015412686 address:[localhost]:55816 [I409 16:51:43.963845838 TCPStoreLibUvBackend.cpp:842] [c10d - trace] ping nonce:2840795 address:[localhost]:55816 [I409 16:51:43.963902914 TCPStoreLibUvBackend.cpp:911] [c10d - trace] add key:init/ val:1 address:[localhost]:55816 [I409 16:51:43.963939389 TCPStoreLibUvBackend.cpp:977] [c10d - trace] wait key_count:1 keys[0]:init/ address:[localhost]:55816 [I409 16:51:43.963974842 TCPStoreLibUvBackend.cpp:893] [c10d - trace] get key:init/ address:[localhost]:55816 [I409 16:51:43.964071909 TCPStoreLibUvBackend.cpp:1121] [c10d - trace] queue_push key:/test_prefix/test_queue_support address:[localhost]:55816 [I409 16:51:43.964080221 TCPStoreLibUvBackend.cpp:940] [c10d - trace] check key_count:1 keys[0]:/test_prefix/foo address:[localhost]:55816 [I409 16:51:43.964108584 TCPStoreLibUvBackend.cpp:1121] [c10d - trace] queue_push key:/test_prefix/foo address:[localhost]:55816 [I409 16:51:43.964123207 TCPStoreLibUvBackend.cpp:1121] [c10d - trace] queue_push key:/test_prefix/foo address:[localhost]:55816 [I409 16:51:43.964128194 TCPStoreLibUvBackend.cpp:940] [c10d - trace] check key_count:1 keys[0]:/test_prefix/foo address:[localhost]:55816 [I409 16:51:43.964156347 TCPStoreLibUvBackend.cpp:977] [c10d - trace] wait key_count:1 keys[0]:/test_prefix/foo address:[localhost]:55816 [I409 16:51:43.964187493 TCPStoreLibUvBackend.cpp:977] [c10d - trace] wait key_count:1 keys[0]:/test_prefix/foo address:[localhost]:55816 [I409 16:51:43.964217709 TCPStoreLibUvBackend.cpp:1133] [c10d - trace] queue_pop key:/test_prefix/foo address:[localhost]:55816 [I409 16:51:43.964324300 TCPStoreLibUvBackend.cpp:977] [c10d - trace] wait key_count:1 keys[0]:/test_prefix/foo address:[localhost]:55816 [I409 16:51:43.964354495 TCPStoreLibUvBackend.cpp:1133] [c10d - trace] queue_pop key:/test_prefix/foo address:[localhost]:55816 [I409 16:51:43.964416299 TCPStoreLibUvBackend.cpp:940] [c10d - trace] check key_count:1 keys[0]:/test_prefix/foo address:[localhost]:55816 [I409 16:51:43.964458733 TCPStoreLibUvBackend.cpp:977] [c10d - trace] wait key_count:1 keys[0]:/test_prefix/non_existant address:[localhost]:55816 [W409 16:51:43.974516585 socket.cpp:460] [c10d] waitForInput: poll for socket SocketImpl(fd=75, addr=[localhost]:55816, remote=[localhost]:46641) returned 0, likely a timeout [W409 16:51:43.974559169 socket.cpp:485] [c10d] waitForInput: socket SocketImpl(fd=75, addr=[localhost]:55816, remote=[localhost]:46641) timed out after 10ms [I409 16:51:43.974600451 TCPStoreLibUvBackend.cpp:1101] [c10d - trace] cancel_wait address:[localhost]:55816 ``` Test plan: ``` $ pytest test/distributed/test_store.py -k queue -v -s test/distributed/test_store.py::FileStoreTest::test_queues SKIPPED [0.4351s] (Store does not support queues) test/distributed/test_store.py::HashStoreTest::test_queues PASSED [0.0009s] test/distributed/test_store.py::PrefixFileStoreTest::test_queues SKIPPED [0.0006s] (Store does not support queues) test/distributed/test_store.py::TCPStoreTest::test_queues SKIPPED [0.0012s] (Store does not support queues) test/distributed/test_store.py::LibUvTCPStoreTest::test_queues PASSED [0.0014s] test/distributed/test_store.py::PrefixTCPStoreTest::test_queues PASSED [0.0014s] ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150969 Approved by: https://github.com/XilunWu, https://github.com/fduwjj	2025-04-11 19:24:17 +00:00
fduwjj	320914f1b6	[c10d][libuv] Add back correct EOF case check (#151052 ) We removed the wrong EOF case in https://github.com/pytorch/pytorch/pull/150987, and we added the correct one back in this PR. Since https://github.com/pytorch/pytorch/pull/150987 is a fix, so we merge that PR first and use this PR as a follow-up to further makes the logic more complete. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151052 Approved by: https://github.com/XilunWu	2025-04-11 01:37:30 +00:00
fduwjj	f663aa4e81	[c10d][tcp_store] Fix connection reset caused by wrong socket close (#150987 ) While fixing the memory leak in https://github.com/pytorch/pytorch/pull/145757, we accidentally close the socket for the case when nread == 0 and thought it is the case when connection is closed. This is not true. According to libuv doc: https://docs.libuv.org/en/v1.x/stream.html#c.uv_read_cb. > nread might be 0, which does not indicate an error or EOF. This is equivalent to EAGAIN or EWOULDBLOCK under read(2). We found this bug when debugging a broken pipe issue when users first call a set and then wait for all keys right afterwards on 128 ranks. This might also cause other broken pipe issues we have seen in the prod jobs recently. Added a unit test to test this case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150987 Approved by: https://github.com/d4l3k, https://github.com/XilunWu	2025-04-10 18:48:57 +00:00
Tristan Rice	29b3fdab01	TCPStoreLibUvBackend: support masterListenFd (#150215 ) This supports `masterListenFd` which is required for full compatibility with the non-libuv TCPStore. The code was just missing a `uv_listen` call and now it works just fine. This is required to migrate the last remaining uses of TCPStore off of the non-libuv backend. Test plan: ``` pytest -v test/distributed/test_store.py -k test_take_over_listen_socket ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150215 Approved by: https://github.com/fduwjj	2025-03-29 01:58:07 +00:00
Tristan Rice	8eb400ef66	[BE] TCPStore: use typed errors for assertions (#147647 ) This is a follow up to #147465 that changes most TORCH_CHECK calls in TCPStore and TCPStoreLibUvBackend to use typed exceptions instead of generic `TORCH_CHECK` calls which end up as RuntimeErrors in Python. Test plan: ``` pytest test/distributed/test_store.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/147647 Approved by: https://github.com/fduwjj	2025-02-24 20:58:10 +00:00
Tristan Rice	ba214ab56c	TCPStore: soft fail bind when agent store active (#147465 ) This makes it easier to roll out `TORCHELASTIC_USE_AGENT_STORE` by opportunistically swallowing bind errors when the agent store is enabled and the port matches `MASTER_PORT`. This should be very safe as if the store is somehow not up and the envs are set, the TCPStore client connections will fail to connect so we end up with a slightly different error message but success/failure behavior is identical. This also pybinds `c10d::SocketError` into Python so we can assert on the error type in tests. https://docs.google.com/document/d/1CzOn_N53AiFxWGgbyMWSnd2elCJd4lZ-ajPg2lzcxoM/edit?tab=t.0#heading=h.2j2f5dimrdau Test plan: ``` pytest test/distributed/test_store.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/147465 Approved by: https://github.com/fduwjj	2025-02-21 03:02:26 +00:00
Chirag Pandya	bdf6dfa17d	[chore][ez] change alloc buffer size from 4000 to 4096 (#145759 ) Summary: Allocations typically happen as a power of 2 anyway. Change the default alloc size to 4096 so eek out a bit more perf. Test: unit tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/145759 Approved by: https://github.com/XilunWu, https://github.com/fduwjj ghstack dependencies: #145756, #145757	2025-01-28 09:14:07 +00:00
Chirag Pandya	78f02bf07c	[bug] handle case when remote peer closes connection (#145757 ) Summary: In the case where remote peer closes the connection, nread returns 0. In this case, we still want to free up the allocated buffer. Also, reorder the if so that the likely success cases (nread > 0) is at the top of the function with an early return. Test Plan: unit tests Differential Revision: [D68733192](https://our.internmc.facebook.com/intern/diff/D68733192) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145757 Approved by: https://github.com/XilunWu, https://github.com/fduwjj ghstack dependencies: #145756	2025-01-28 03:06:38 +00:00
Chirag Pandya	5534c270db	[chore] fix new linter (#145756 ) Summary: Fix new linter that's complaining when I made changes to this file: class 'LibUVStoreDaemon' defines a non-default destructor but does not define a copy constructor, a copy assignment operator, a move constructor or a move assignment operator Test Plan: make lint passes Differential Revision: [D68733191](https://our.internmc.facebook.com/intern/diff/D68733191) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145756 Approved by: https://github.com/XilunWu, https://github.com/Skylion007, https://github.com/fduwjj	2025-01-27 22:48:12 +00:00
Chirag Pandya	f8a4f16634	[c10d] fix memory leak on shutdown (#145507 ) Summary: Fix memory leak on shutdown when socket is closed. We still need to free the buffer to make valgrind happy. Test Plan: Use `mtiavm`. Repro steps provided by cristianlume. on window 1: ``` vm ssh --vm=0 -- $(buck run @//neteng/ai/rdma_gen/mode/owl //neteng/ai/rdma_gen:rdma_gen --emit-shell) --rdma_mode=mtiav1 --num_ranks=2 ``` on window 2: ``` vm ssh --vm=1 -- $(buck run @//neteng/ai/rdma_gen/mode/owl //neteng/ai/rdma_gen:rdma_gen --emit-shell) --rdma_mode=mtiav1 --num_ranks=2 --rank=1 --store_host=172.16.1.1 ``` without the fix: ``` ==8766==ERROR: LeakSanitizer: detected memory leaks ``` With fix, no leak Differential Revision: D68566104 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145507 Approved by: https://github.com/XilunWu, https://github.com/d4l3k	2025-01-23 23:36:15 +00:00
cyy	b4c0973b59	[2/N] Apply bugprone-unchecked-optional-access (#141091 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/141091 Approved by: https://github.com/Skylion007, https://github.com/albanD Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>	2024-12-09 19:30:19 +00:00
cyy	1ec76dd1dc	Enable clang-tidy on torch/csrc/distributed (#139043 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/139043 Approved by: https://github.com/Skylion007	2024-10-28 13:56:54 +00:00
cyy	2bcfbf2505	[Distributed] [17/N] Fix clang-tidy warnings in torch/csrc/distributed/ (#138465 ) Follows #137404 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138465 Approved by: https://github.com/ezyang	2024-10-24 04:58:49 +00:00
Tristan Rice	bebf5302ba	TCPStoreLibUvBackend: trace operations (#136320 ) Summary: This logs all operations when tracing log level is enabled for the `TCPStoreLibUvBackend`. This is very useful for debugging collective operations when issues occur as it logs all hosts and the keys that they're modifying. To minimize total data we only log the keys and not the values This changes the C10D_* macros to be much more efficient -- previously we would always format the log string even if they would never be printed which is very wasteful for detailed tracing. This now gates them with an if statement to achieve the same behavior with no overhead Test Plan: ``` TORCH_DISTRIBUTED_DEBUG=DETAIL torchrun --nnodes 1 --nproc_per_node 1 --no-python /bin/bash -c "echo foo" ``` ``` I0919 09:26:52.352013 34271 TCPStore.cpp:285] [c10d - debug] The server has started on port = 29500. I0919 09:26:52.352246 34271 socket.cpp:783] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (127.0.0.1, 29500). I0919 09:26:52.352241 36903 TCPStoreLibUvBackend.cpp:1173] [c10d - debug] Uv main loop running I0919 09:26:52.352308 34271 socket.cpp:854] [c10d - trace] The client socket is attempting to connect to [localhost]:29500. I0919 09:26:52.353633 34271 socket.cpp:945] [c10d] The client socket has connected to [localhost]:29500 on SocketImpl(fd=41, addr=[localhost]:45646, remote=[localhost]:29500). I0919 09:26:52.354422 34271 TCPStore.cpp:321] [c10d - debug] TCP client connected to host 127.0.0.1:29500 I0919 09:26:52.354558 36903 TCPStoreLibUvBackend.cpp:774] [c10d - trace] validate magic:1015412686 address:[localhost]:45646 I0919 09:26:52.354638 36903 TCPStoreLibUvBackend.cpp:789] [c10d - trace] ping nonce:34271 address:[localhost]:45646 I0919 09:26:52.356122 36903 TCPStoreLibUvBackend.cpp:866] [c10d - trace] add key:init/ val:1 address:[localhost]:45646 I0919 09:26:52.356308 36903 TCPStoreLibUvBackend.cpp:930] [c10d - trace] wait key_count:1 address:[localhost]:45646 I0919 09:26:52.356410 36903 TCPStoreLibUvBackend.cpp:846] [c10d - trace] get key:init/ address:[localhost]:45646 I0919 09:26:52.358688 36903 TCPStoreLibUvBackend.cpp:808] [c10d - trace] set key:/none/torchelastic/role_info/0 address:[localhost]:45646 I0919 09:26:52.360177 36903 TCPStoreLibUvBackend.cpp:930] [c10d - trace] wait key_count:1 address:[localhost]:45646 I0919 09:26:52.360296 36903 TCPStoreLibUvBackend.cpp:1004] [c10d - trace] multi_get key_count:1 address:[localhost]:45646 I0919 09:26:52.362076 36903 TCPStoreLibUvBackend.cpp:1036] [c10d - trace] multi_set key_count:1 address:[localhost]:45646 I0919 09:26:52.364001 36903 TCPStoreLibUvBackend.cpp:930] [c10d - trace] wait key_count:1 address:[localhost]:45646 I0919 09:26:52.364091 36903 TCPStoreLibUvBackend.cpp:846] [c10d - trace] get key:/none/torchelastic/assigned_ranks/0 address:[localhost]:45646 ``` Differential Revision: D62924454 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136320 Approved by: https://github.com/c-p-i-o, https://github.com/XilunWu	2024-09-20 00:53:21 +00:00
fduwjj	c608b17f60	[PTD][BE][c10d] Add some code documents for TCPStore code and cosmetic changes to libUVStore code (#130496 ) While designing something else when TCPStore is needed. I spent some time digging into the codebase of TCPStore and found that the code is a little bit challenging to understand without proper documents. Although people from OSS community must be smarter than me, I still want to document my findings in the code so that devs and users can use them as a reference down the road. Also for libuv, we need to make private variables with a "_", so it's a pure renaming of private variables such as `tcpServer`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130496 Approved by: https://github.com/wconstab	2024-09-11 04:42:25 +00:00
Tristan Rice	ef9d9be236	TCPStoreLibUvBackend: log port on error (#130797 ) Adds better error messages when a socket fails to bind in libuv. New format: ``` The server socket has failed to bind. port: 1, useIpv6: 0, code: -13, name: EACCES, message: permission denied ``` Old format: ``` The server socket has failed to listen on any local network address. useIpv6: 0, code: -98, name: EADDRINUSE, message: address already in use ``` Test plan: Added test in `test_store.py` ``` python test/distributed/test_store.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130797 Approved by: https://github.com/kurman	2024-07-17 01:34:15 +00:00
Tristan Rice	9ee8c18309	TCPStore: add ping to verify network connectivity on connect (#129985 ) This does a round trip request on socket connect -- this allows for detecting connection resets etc and retrying before the non-retryable application requests are sent. This adds support for PING to both the libuv and legacy backend. Example error: ``` [trainer85612\|12]:W0701 13:41:43.421574 4776 TCPStore.cpp:182] [c10d] recvValue failed on SocketImpl(fd=24, ...): Connection reset by peer [trainer85612\|12]:Exception raised from recvBytes at /mnt/code/pytorch/torch/csrc/distributed/c10d/Utils.hpp:669 (most recent call first): ... [trainer85612\|12]:#9 c10d::TCPStore::incrementValueBy(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, long) from /packages/.../conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so:84809637 [trainer85612\|12]:#10 c10d::TCPStore::waitForWorkers() from /packages/.../conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so:84812868 [trainer85612\|12]:#11 c10d::TCPStore::TCPStore(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, c10d::TCPStoreOptions const&) from /packages/.../conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so:84814775 ``` Test plan: ``` python test/distributed/test_store.py -v ``` ``` tristanr@devvm4382 ~/pytorch (d4l3k/tcpstore_ping)> python ~/pt_tests/tcpstore_large_test.py starting pool started 90000 started 30000 started 70000 started 20000 started 80000 started 60000 started 0 [W702 16:16:25.301681870 TCPStore.cpp:343] [c10d] Starting store with 100000 workers but somaxconn is 4096.This might cause instability during bootstrap, consider increasing it. init 20000 set 20000 init 80000 set 80000 init 70000 set 70000 init 60000 set 60000 init 30000 set 30000 init 90000 set 90000 started 40000 init 40000 set 40000 started 50000 init 50000 set 50000 started 10000 init 10000 set 10000 init 0 set 0 run finished 617.2992351055145 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129985 Approved by: https://github.com/rsdcastro, https://github.com/kurman	2024-07-03 02:09:44 +00:00
Tristan Rice	b50c0e94c2	TCPStoreLibUvBackend: use somaxconn and enable TCP_NODELAY (#128739 ) This adjusts the settings of the libuv backend to match the older TCPStore. * DEFAULT_BACKLOG: setting this to -1 will enable using the host somaxconn value instead of a hardcoded 16k value. When going over this limit with `tcp_abort_on_overflow` set it results in connections being reset. * TCP_NODELAY: Since TCPStore primarily sends small messages there's no benefit to using Nargle's algorithm and it may add additional latency for store operations. Test plan: ``` python test/distributed/test_store.py -v -k LibUv ``` Benchmark script: ``` import time import os import torch.distributed as dist rank = int(os.environ["RANK"]) store = dist.TCPStore( host_name="<server>", port=29500, world_size=2, is_master=(rank == 0), use_libuv=True, ) if rank == 1: total_iters = 0 total_dur = 0 for iter in range(10): iters = 500000 start = time.perf_counter() for i in range(iters): store.set(f"key_{i}", f"value_{i}") dur = time.perf_counter() - start print(f"{iter}. {iters} set, qps = {iters/dur}") total_iters += iters total_dur += dur print(f"overall qps = {total_iters/total_dur}") else: print("sleeping") time.sleep(1000000000) ``` Performance seems to be negligible difference between TCP_NODELAY and not for a single host Pull Request resolved: https://github.com/pytorch/pytorch/pull/128739 Approved by: https://github.com/rsdcastro, https://github.com/kurman, https://github.com/c-p-i-o	2024-06-15 07:40:18 +00:00
Tristan Rice	52d4442a00	[c10d] Socket, TCPStore: add better logging (#128673 ) This adds better logging of errors to the socket and TCPStore classes. All socket operations should now include the local and remote addresses and we actually log errors from the TCPStoreBackend::run as well as TCPStoreBackendUV which were previously INFO messages and not actually logged. It also overhauls test_wait in test_store.py as it had a race condition causing it to be flaky. Test plan: ``` python test/distributed/test_store.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128673 Approved by: https://github.com/c-p-i-o	2024-06-14 23:08:29 +00:00
Tristan Rice	7c370d2fb0	expose set_thread_name to Python and set thread names (#128448 ) This adds a new multiprocessing method `_set_thread_name` and calls it from torchelastic and dataloader main functions. This will allow better monitoring of processes as we can separate elastic and dataloading processes from the main training process. Threads named: * torchrun/elastic * PyTorch dataloader worker processes + pin memory thread * TCPStore * ProcessGroupNCCL background threads * WorkerServer httpserver thread Test plan: ``` $ torchrun --nnodes 1 --nproc_per_node 1 --no-python /bin/bash -c 'ps -eL \| grep pt_' 3264281 3264281 pts/45 00:00:02 pt_elastic 3264281 3267950 pts/45 00:00:00 pt_elastic ``` dataloading ```py import torch import time from torch.utils.data import ( DataLoader, Dataset, ) class NoopDataset(Dataset): def __getitem__(self, index): return index def __len__(self): return 10 dataloader = DataLoader(NoopDataset(), num_workers=2) for i, x in enumerate(dataloader): print(i, x) time.sleep(10000) ``` ``` $ python3 ~/scripts/dataloader_test.py $ ps -eL \| grep pt_ 1228312 1228312 pts/45 00:00:02 pt_main_thread 1228312 1230058 pts/45 00:00:00 pt_main_thread 1228312 1230059 pts/45 00:00:00 pt_main_thread 1230052 1230052 pts/45 00:00:00 pt_data_worker 1230052 1230198 pts/45 00:00:00 pt_data_worker 1230052 1230740 pts/45 00:00:00 pt_data_worker 1230055 1230055 pts/45 00:00:00 pt_data_worker 1230055 1230296 pts/45 00:00:00 pt_data_worker 1230055 1230759 pts/45 00:00:00 pt_data_worker ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128448 Approved by: https://github.com/c-p-i-o, https://github.com/andrewkho, https://github.com/rsdcastro	2024-06-13 16:38:23 +00:00
Xilun Wu	85758fa5ae	[c10d][TCPStore] make TCPStore server use libuv by default (#127957 ) Summary This PR switches the default TCPStore server backend to a new implementation that utilizes [`libuv`](https://github.com/libuv/libuv) for significantly lower initialization time and better scalability: <img width="714" alt="image" src="https://github.com/pytorch/pytorch/assets/12968408/18503011-da5d-4104-8ba9-abc456438b02"> We hope this improvement would benefit users from a much shorter startup time in large-scale jobs. Eventually, we hope to fully replace the old TCPStore backend implementation with the libuv one. What it changes This PR changes the underlying TCPStore server backend to `libuv` if users don't explicitly specify to use the old TCPStore server. This change is not supposed to cause any user notice except significant faster TCPStore startup for large-scale jobs. One thing to note is, we do not support the initialization approach where user passes in a socket for libuv backend. We plan to support it as a next step but we choose to disable it before fully testing. If you are initializing TCPStore in this approach, you can see the next section to remain using the old TCPStore server. Fallback/Remain using the old TCPStore server For users who want to stay with the old TCPStore backend, there're 3 ways: 1. If user is directly instantiating TCPStore object, user can pass in argument `use_libuv=False` to use the old TCPStore server backend e.g. `store = torch.distributed.TCPStore(..., use_libuv=False)`. 2. Or, specify the TCPStore backend option in `init_method` when calling default ProcessGroup init, e.g. `torch.distributed.init_process_group(..., init_method="{YOUR_RENDEZVOUS_METHOD}://{YOUR_HOSTNAME}:{YOUR_PORT}?use_libuv=0")` 3. Or, user can set environment variable `USE_LIBUV` to `"0"` when launching. These 3 approach are in order of precedence. That being said, if user specifies `use_libuv=0` in `init_method` and also sets environment var `USE_LIBUV="1"`, the former will take effect and the TCPStore backend instantiated will be the old one instead of the one using libuv. Operating Systems Compatibility From the CI signals, we believe the new implementation has the same behavior as the old TCPStore server on all supported platforms. If you notice any behavior discrepancy, please file an issue with `oncall: distributed` label. Test Plan `pytest test/distributed/test_store.py` <img width="2548" alt="image" src="https://github.com/pytorch/pytorch/assets/12968408/dc0aebeb-6d5a-4daa-b98c-e56bd39aa588"> note: `TestMultiThreadedWait::test_wait` is a broken test that has been there for some time. `test/distributed/elastic/utils/distributed_test.py` <img width="2558" alt="image" src="https://github.com/pytorch/pytorch/assets/12968408/a6a3266d-b798-41c4-94d2-152056a034f6"> TODO 1. Update the doc at - https://pytorch.org/docs/stable/distributed.html#distributed-key-value-store - https://pytorch.org/docs/stable/distributed.html#tcp-initialization 2. Make torch elastic rendezvous to use libuv TCPStore as well. See `torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py` cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k @kurman 3. Test if libuv backend is okay with initialization with socket. Change `LibUvTCPStoreTest::test_take_over_listen_socket`. Test Plan `pytest test/distributed/test_store.py` <img width="2548" alt="image" src="https://github.com/pytorch/pytorch/assets/12968408/dc0aebeb-6d5a-4daa-b98c-e56bd39aa588"> note: `TestMultiThreadedWait::test_wait` is a broken test that has been there for some time. `test/distributed/elastic/utils/distributed_test.py` <img width="2558" alt="image" src="https://github.com/pytorch/pytorch/assets/12968408/a6a3266d-b798-41c4-94d2-152056a034f6"> Differential Revision: [D58259591](https://our.internmc.facebook.com/intern/diff/D58259591) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127957 Approved by: https://github.com/kurman ghstack dependencies: #127956	2024-06-07 16:53:01 +00:00
Xilun Wu	6c824cd9fb	[BE][c10d] fix use of TORCH_ERROR in TCPStore libuv backend (#127956 ) Summary The use of TORCH_ERROR in TCPStore libuv backend code needs update. Differential Revision: [D58259589](https://our.internmc.facebook.com/intern/diff/D58259589) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127956 Approved by: https://github.com/shuqiangzhang, https://github.com/cyyever	2024-06-07 16:53:01 +00:00
cyy	be7be9fa16	[Distributed] [8/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d (#125102 ) This PR continues to clean clang-tidy warnings in torch/csrc/distributed/c10d, following https://github.com/pytorch/pytorch/pull/124987. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125102 Approved by: https://github.com/ezyang	2024-05-30 16:19:53 +00:00
cyy	70d8bc2da1	Fix various errors in TCPStoreLibUvBackend.cpp (#127230 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127230 Approved by: https://github.com/Skylion007	2024-05-27 19:14:01 +00:00
cyy	b60af92c17	[Distributed] [3/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d (#123312 ) This PR continues to fix some clang-tidy warnings in distributed code, following https://github.com/pytorch/pytorch/pull/122892. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123312 Approved by: https://github.com/Skylion007	2024-04-13 11:45:00 +00:00
Xilun Wu	0b0b9b3275	[c10d][libuv] add partial read test for libuv backend and fix an error which only happens when partially reading a buffer (#116141 ) Test Plan 1. build pytorch 2. execute `TORCH_CPP_LOG_LEVEL=INFO build/bin/TCPStoreTest --gtest_filter=TCPStoreTest.testLibUVPartialRead` from the pytorch root directory. without the change: <img width="761" alt="image" src="https://github.com/pytorch/pytorch/assets/12968408/1942e3c2-a9c1-4fe4-87e8-7e21f4d8f9aa"> with the change: <img width="747" alt="image" src="https://github.com/pytorch/pytorch/assets/12968408/f3e96a5b-0ed1-49bd-9184-bb8a5ebebc33"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/116141 Approved by: https://github.com/wconstab	2023-12-20 18:37:55 +00:00
Juncheng Gu	7c4e49ec80	[Fix] add validation logics to TCPStore queries (#107607 ) This PR fixes #106294. Due to the lack of request validation mechanism, TCPStore in torch mistakenly treats nmap scan messages as valid query messages, which leads to DDP OOM. The simple solution enforces the very first query from a client is a validation query with a predefined magic number. If the validation fails, the server will terminate the connection. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107607 Approved by: https://github.com/cbalioglu, https://github.com/XilunWu	2023-11-07 18:36:25 +00:00
PyTorch MergeBot	c63693ca27	Revert "[Fix] add validation logics to TCPStore queries (#107607 )" This reverts commit `50a9981217`. Reverted https://github.com/pytorch/pytorch/pull/107607 on behalf of https://github.com/huydhn due to For some reason, lint job was not run on the PR and now start failing trunk, please rebase and fix lint before relanding `50a9981217` ([comment](https://github.com/pytorch/pytorch/pull/107607#issuecomment-1791702818))	2023-11-02 23:34:08 +00:00
Juncheng Gu	50a9981217	[Fix] add validation logics to TCPStore queries (#107607 ) This PR fixes #106294. Due to the lack of request validation mechanism, TCPStore in torch mistakenly treats nmap scan messages as valid query messages, which leads to DDP OOM. The simple solution enforces the very first query from a client is a validation query with a predefined magic number. If the validation fails, the server will terminate the connection. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107607 Approved by: https://github.com/cbalioglu, https://github.com/XilunWu	2023-11-02 22:12:45 +00:00
Rodrigo Kumpera	92de1d3222	[C10D] Push store scalability a bit further. (#109217 ) This is a bunch of small changes to improve store scalability: - stagger client connection to avoid a stampede. - warn if somaxconn is too small. - increase the backlog to 16k. Differential Revision: [D49238587](https://our.internmc.facebook.com/intern/diff/D49238587) Pull Request resolved: https://github.com/pytorch/pytorch/pull/109217 Approved by: https://github.com/XilunWu	2023-09-22 17:23:46 +00:00
Rodrigo Kumpera	fe2cda64dc	[C10D] Implement new libuv backend for TCPStore. (#108066 ) The new backend is currently under a flag 'use_libuv' in TCPStore constructor to reduce the impact on existing users as we test it. This is a reland of #105870 with a fix for a bad test. Differential Revision: [D48742554](https://our.internmc.facebook.com/intern/diff/D48742554) Pull Request resolved: https://github.com/pytorch/pytorch/pull/108066 Approved by: https://github.com/H-Huang, https://github.com/fduwjj	2023-08-29 14:55:14 +00:00
PyTorch MergeBot	d3f92ca9e9	Revert "[C10D] Implement new libuv backend for TCPStore. (#105870 )" This reverts commit `3c841163ce`. Reverted https://github.com/pytorch/pytorch/pull/105870 on behalf of https://github.com/huydhn due to I think the distributed failure is related as this is now failing in trunk ([comment](https://github.com/pytorch/pytorch/pull/105870#issuecomment-1683117192))	2023-08-17 23:41:00 +00:00
Rodrigo Kumpera	3c841163ce	[C10D] Implement new libuv backend for TCPStore. (#105870 ) The new backend is currently under a flag 'use_libuv' in TCPStore constructor to reduce the impact on existing users as we test it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105870 Approved by: https://github.com/H-Huang	2023-08-17 20:40:32 +00:00
Rodrigo Kumpera	2636751fb9	[C10d] Add skeleton of LibUV backend. (#105672 ) This commit hooks up tcpstore creation and build flags. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105672 Approved by: https://github.com/fduwjj	2023-07-28 13:19:06 +00:00

36 Commits