pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
PyTorch MergeBot	696e83a1da	Revert "TCPStore: fix remote address (#131773 )" This reverts commit `9039131a89`. Reverted https://github.com/pytorch/pytorch/pull/131773 on behalf of https://github.com/clee2000 due to broke internal builds D60265883, something about formatter ([comment](https://github.com/pytorch/pytorch/pull/131773#issuecomment-2253123800))	2024-07-26 16:47:57 +00:00
Tristan Rice	adbe4f5ecf	TCPStore: add better logging on wait timeout (#131808 ) This makes TCPStore `wait` timeout print actually useful info instead of a generic `Socket Timeout` message on timeout. Bonus: * fix weirdness where `connect_timeout` only supported seconds unlike the reset of our timeouts (thus minimum timeout was 1s) * Fixed tests that used a 10s timeout (test_store now only takes 20s instead of 40s) Ex: ``` DistStoreError: wait timeout after 100ms, keys: /the_key ``` Test plan: ``` python test/distributed/test_store.py python test/distributed/test_c10d_gloo.py -v -k timeout ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131808 Approved by: https://github.com/kurman	2024-07-25 23:54:41 +00:00
Tristan Rice	9039131a89	TCPStore: fix remote address (#131773 ) This fixes corrupt remote address logs caused by dangling pointers to addrinfo_storage inside of addrinfo. Test plan: Enable debug logs and verify addresses are correct ``` TORCH_CPP_LOG_LEVEL=INFO TORCH_DISABLE_SHARE_RDZV_TCP_STORE=1 TORCH_DISTRIBUTED_DEBUG=DETAIL LOGLEVEL=INFO python test/distributed/test_store.py -v ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131773 Approved by: https://github.com/kurman	2024-07-25 21:55:25 +00:00
cyy	f4dcf2ae93	[1/N] Change #include <c10/util/Optional.h> to #include <optional> (#128301 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/128301 Approved by: https://github.com/ezyang, https://github.com/r-barnes	2024-07-08 07:03:53 +00:00
Tristan Rice	9ee8c18309	TCPStore: add ping to verify network connectivity on connect (#129985 ) This does a round trip request on socket connect -- this allows for detecting connection resets etc and retrying before the non-retryable application requests are sent. This adds support for PING to both the libuv and legacy backend. Example error: ``` [trainer85612\|12]:W0701 13:41:43.421574 4776 TCPStore.cpp:182] [c10d] recvValue failed on SocketImpl(fd=24, ...): Connection reset by peer [trainer85612\|12]:Exception raised from recvBytes at /mnt/code/pytorch/torch/csrc/distributed/c10d/Utils.hpp:669 (most recent call first): ... [trainer85612\|12]:#9 c10d::TCPStore::incrementValueBy(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, long) from /packages/.../conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so:84809637 [trainer85612\|12]:#10 c10d::TCPStore::waitForWorkers() from /packages/.../conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so:84812868 [trainer85612\|12]:#11 c10d::TCPStore::TCPStore(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, c10d::TCPStoreOptions const&) from /packages/.../conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so:84814775 ``` Test plan: ``` python test/distributed/test_store.py -v ``` ``` tristanr@devvm4382 ~/pytorch (d4l3k/tcpstore_ping)> python ~/pt_tests/tcpstore_large_test.py starting pool started 90000 started 30000 started 70000 started 20000 started 80000 started 60000 started 0 [W702 16:16:25.301681870 TCPStore.cpp:343] [c10d] Starting store with 100000 workers but somaxconn is 4096.This might cause instability during bootstrap, consider increasing it. init 20000 set 20000 init 80000 set 80000 init 70000 set 70000 init 60000 set 60000 init 30000 set 30000 init 90000 set 90000 started 40000 init 40000 set 40000 started 50000 init 50000 set 50000 started 10000 init 10000 set 10000 init 0 set 0 run finished 617.2992351055145 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129985 Approved by: https://github.com/rsdcastro, https://github.com/kurman	2024-07-03 02:09:44 +00:00
Tristan Rice	0298560ca2	TCPStore: improve connect and retry logic (#129261 ) We've been facing issues where TCPStore can successfully connect but then fail in the validate() function due to resets from listen backlog queue overflow when combined with reset enabled as well as long init times. This PR does a few things: * Retry that connect and validate up to the specified timeout. * Use exponential backoff for the retry logic with jitter instead of a fixed 1s sleep. * Eliminate the `sleep(std::chrono::milliseconds(numWorkers))` on init which can add significant delays to startup. This is no longer necessary per @XilunWu https://github.com/pytorch/pytorch/pull/116141 Test plan: ``` python test/distributed/test_store.py -v ./build/bin/BackoffTest ``` Will do internal testing with some large scale jobs to ensure TCPStore works correctly. At 4k scale: 4x improvement ``` tristanr@devvm4382 ~/pt_tests [SIGABRT]> time TORCH_SHOW_CPP_STACKTRACES=1 python tcpstore_large_test.py (pytorch-3.10) started 0 init 0 set 0 joined all ________________________________________________________ Executed in 1.98 secs fish external usr time 0.93 secs 91.00 micros 0.93 secs sys time 1.98 secs 954.00 micros 1.97 secs tristanr@devvm4382 ~/pt_tests> conda activate torchdrive-3.10 (pytorch-3.10) tristanr@devvm4382 ~/pt_tests> time TORCH_SHOW_CPP_STACKTRACES=1 python tcpstore_large_test.py (torchdrive-3.10) started 0 init 0 set 0 joined all ________________________________________________________ Executed in 8.20 secs fish external usr time 2.15 secs 0.00 micros 2.15 secs sys time 2.76 secs 843.00 micros 2.76 secs ``` ```py import time import os import threading from multiprocessing import Pool WORLD_SIZE = 10000 import torch.distributed as dist def run(rank): should_log = rank % (WORLD_SIZE // 10) == 0 if should_log: print(f"started {rank}") store = dist.TCPStore( host_name="devvm4382.nao0.facebook.com", port=29500, world_size=WORLD_SIZE, is_master=rank == 0, use_libuv=True, ) if should_log: print(f"init {rank}") store.set(f"key{rank}", "1234") if should_log: print(f"set {rank}") del store def noop(rank): pass print("starting pool") with Pool(WORLD_SIZE) as pool: pool.map(noop, range(WORLD_SIZE), 1) print("pool hot") start = time.time() pool.map(run, range(WORLD_SIZE), 1) print("run finished", time.time()-start) ``` ``` tristanr@devvm4382 ~/pt_tests> python tcpstore_large_test.py (pytorch-3.10) starting pool pool hot started 0 [W624 16:58:09.086081750 TCPStore.cpp:343] [c10d] Starting store with 10000 workers but somaxconn is 4096.This might cause instability during bootstrap, consider increasing it. started 1000 init 1000 set 1000 started 2000 init 2000 set 2000 started 3000 init 3000 set 3000 started 4000 init 4000 set 4000 started 5000 init 5000 set 5000 started 6000 init 6000 set 6000 started 7000 init 7000 set 7000 started 8000 init 8000 set 8000 started 9000 init 9000 set 9000 init 0 set 0 run finished 0.705092191696167 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129261 Approved by: https://github.com/rsdcastro, https://github.com/wconstab, https://github.com/kurman, https://github.com/XilunWu, https://github.com/c-p-i-o	2024-06-25 19:24:22 +00:00
PyTorch MergeBot	846bb30e13	Revert "[1/N] Change #include <c10/util/Optional.h> to #include <optional> (#128301 )" This reverts commit `bd72e28314`. Reverted https://github.com/pytorch/pytorch/pull/128301 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it fails XLA build `bd72e28314`. Please rebase your PR before relanding because I think the failure is hidden by an unrelated broken trunk XLA failure from your current base commit ([comment](https://github.com/pytorch/pytorch/pull/128301#issuecomment-2169035822))	2024-06-15 01:58:20 +00:00
cyy	bd72e28314	[1/N] Change #include <c10/util/Optional.h> to #include <optional> (#128301 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/128301 Approved by: https://github.com/ezyang	2024-06-14 23:21:01 +00:00
Tristan Rice	52d4442a00	[c10d] Socket, TCPStore: add better logging (#128673 ) This adds better logging of errors to the socket and TCPStore classes. All socket operations should now include the local and remote addresses and we actually log errors from the TCPStoreBackend::run as well as TCPStoreBackendUV which were previously INFO messages and not actually logged. It also overhauls test_wait in test_store.py as it had a race condition causing it to be flaky. Test plan: ``` python test/distributed/test_store.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128673 Approved by: https://github.com/c-p-i-o	2024-06-14 23:08:29 +00:00
Xilun Wu	85758fa5ae	[c10d][TCPStore] make TCPStore server use libuv by default (#127957 ) Summary This PR switches the default TCPStore server backend to a new implementation that utilizes [`libuv`](https://github.com/libuv/libuv) for significantly lower initialization time and better scalability: <img width="714" alt="image" src="https://github.com/pytorch/pytorch/assets/12968408/18503011-da5d-4104-8ba9-abc456438b02"> We hope this improvement would benefit users from a much shorter startup time in large-scale jobs. Eventually, we hope to fully replace the old TCPStore backend implementation with the libuv one. What it changes This PR changes the underlying TCPStore server backend to `libuv` if users don't explicitly specify to use the old TCPStore server. This change is not supposed to cause any user notice except significant faster TCPStore startup for large-scale jobs. One thing to note is, we do not support the initialization approach where user passes in a socket for libuv backend. We plan to support it as a next step but we choose to disable it before fully testing. If you are initializing TCPStore in this approach, you can see the next section to remain using the old TCPStore server. Fallback/Remain using the old TCPStore server For users who want to stay with the old TCPStore backend, there're 3 ways: 1. If user is directly instantiating TCPStore object, user can pass in argument `use_libuv=False` to use the old TCPStore server backend e.g. `store = torch.distributed.TCPStore(..., use_libuv=False)`. 2. Or, specify the TCPStore backend option in `init_method` when calling default ProcessGroup init, e.g. `torch.distributed.init_process_group(..., init_method="{YOUR_RENDEZVOUS_METHOD}://{YOUR_HOSTNAME}:{YOUR_PORT}?use_libuv=0")` 3. Or, user can set environment variable `USE_LIBUV` to `"0"` when launching. These 3 approach are in order of precedence. That being said, if user specifies `use_libuv=0` in `init_method` and also sets environment var `USE_LIBUV="1"`, the former will take effect and the TCPStore backend instantiated will be the old one instead of the one using libuv. Operating Systems Compatibility From the CI signals, we believe the new implementation has the same behavior as the old TCPStore server on all supported platforms. If you notice any behavior discrepancy, please file an issue with `oncall: distributed` label. Test Plan `pytest test/distributed/test_store.py` <img width="2548" alt="image" src="https://github.com/pytorch/pytorch/assets/12968408/dc0aebeb-6d5a-4daa-b98c-e56bd39aa588"> note: `TestMultiThreadedWait::test_wait` is a broken test that has been there for some time. `test/distributed/elastic/utils/distributed_test.py` <img width="2558" alt="image" src="https://github.com/pytorch/pytorch/assets/12968408/a6a3266d-b798-41c4-94d2-152056a034f6"> TODO 1. Update the doc at - https://pytorch.org/docs/stable/distributed.html#distributed-key-value-store - https://pytorch.org/docs/stable/distributed.html#tcp-initialization 2. Make torch elastic rendezvous to use libuv TCPStore as well. See `torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py` cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k @kurman 3. Test if libuv backend is okay with initialization with socket. Change `LibUvTCPStoreTest::test_take_over_listen_socket`. Test Plan `pytest test/distributed/test_store.py` <img width="2548" alt="image" src="https://github.com/pytorch/pytorch/assets/12968408/dc0aebeb-6d5a-4daa-b98c-e56bd39aa588"> note: `TestMultiThreadedWait::test_wait` is a broken test that has been there for some time. `test/distributed/elastic/utils/distributed_test.py` <img width="2558" alt="image" src="https://github.com/pytorch/pytorch/assets/12968408/a6a3266d-b798-41c4-94d2-152056a034f6"> Differential Revision: [D58259591](https://our.internmc.facebook.com/intern/diff/D58259591) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127957 Approved by: https://github.com/kurman ghstack dependencies: #127956	2024-06-07 16:53:01 +00:00
Richard Barnes	ed327876f5	[codemod] `c10:optional` -> `std::optional` (#126135 ) Generated by running the following from PyTorch root: ``` find . -regex ".*\.$cpp\\|h\\|cu\\|hpp\\|cc\\|cxx$$" \| grep -v "build/" \| xargs -n 50 -P 4 perl -pi -e 's/c10::optional/std::optional/' ``` `c10::optional` is just an alias for `std::optional`. This removes usages of that alias in preparation for eliminating it entirely. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126135 Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/albanD, https://github.com/aaronenyeshi	2024-05-14 19:35:51 +00:00
cyy	ea61c9cb29	[Distributed] [5/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d (#124043 ) This PR continues to fix some clang-tidy warnings in distributed/c10d code, following https://github.com/pytorch/pytorch/pull/124032. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124043 Approved by: https://github.com/ezyang	2024-04-23 00:43:50 +00:00
cyy	b60af92c17	[Distributed] [3/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d (#123312 ) This PR continues to fix some clang-tidy warnings in distributed code, following https://github.com/pytorch/pytorch/pull/122892. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123312 Approved by: https://github.com/Skylion007	2024-04-13 11:45:00 +00:00
Xilun Wu	0b0b9b3275	[c10d][libuv] add partial read test for libuv backend and fix an error which only happens when partially reading a buffer (#116141 ) Test Plan 1. build pytorch 2. execute `TORCH_CPP_LOG_LEVEL=INFO build/bin/TCPStoreTest --gtest_filter=TCPStoreTest.testLibUVPartialRead` from the pytorch root directory. without the change: <img width="761" alt="image" src="https://github.com/pytorch/pytorch/assets/12968408/1942e3c2-a9c1-4fe4-87e8-7e21f4d8f9aa"> with the change: <img width="747" alt="image" src="https://github.com/pytorch/pytorch/assets/12968408/f3e96a5b-0ed1-49bd-9184-bb8a5ebebc33"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/116141 Approved by: https://github.com/wconstab	2023-12-20 18:37:55 +00:00
Juncheng Gu	7c4e49ec80	[Fix] add validation logics to TCPStore queries (#107607 ) This PR fixes #106294. Due to the lack of request validation mechanism, TCPStore in torch mistakenly treats nmap scan messages as valid query messages, which leads to DDP OOM. The simple solution enforces the very first query from a client is a validation query with a predefined magic number. If the validation fails, the server will terminate the connection. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107607 Approved by: https://github.com/cbalioglu, https://github.com/XilunWu	2023-11-07 18:36:25 +00:00
PyTorch MergeBot	c63693ca27	Revert "[Fix] add validation logics to TCPStore queries (#107607 )" This reverts commit `50a9981217`. Reverted https://github.com/pytorch/pytorch/pull/107607 on behalf of https://github.com/huydhn due to For some reason, lint job was not run on the PR and now start failing trunk, please rebase and fix lint before relanding `50a9981217` ([comment](https://github.com/pytorch/pytorch/pull/107607#issuecomment-1791702818))	2023-11-02 23:34:08 +00:00
Juncheng Gu	50a9981217	[Fix] add validation logics to TCPStore queries (#107607 ) This PR fixes #106294. Due to the lack of request validation mechanism, TCPStore in torch mistakenly treats nmap scan messages as valid query messages, which leads to DDP OOM. The simple solution enforces the very first query from a client is a validation query with a predefined magic number. If the validation fails, the server will terminate the connection. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107607 Approved by: https://github.com/cbalioglu, https://github.com/XilunWu	2023-11-02 22:12:45 +00:00
Howard Huang	9f7bff1171	Add timeout for master store if clients do not join (#111805 ) Currently, if the master_store does not have all clients join in the `timeout` time, it will just continue silently which could lead to errors down the road. However, if a client does not connect with the master within the specified time then an exception will be raised. This change will have master_store error out if not all clients have joined, making server and client consistent with each other. Since this is changing the default behavior of master store I am open to suggestions. Example: ```python import torch.distributed as dist import torch.multiprocessing as mp from datetime import timedelta def main(rank, world_size): if rank == 0: print("creating store") # world size is 2 so this eventually times out store = dist.TCPStore("localhost", 1234, 2, True, timeout=timedelta(seconds=5)) print("finished creating store") if __name__ == "__main__": world_size = 2 mp.spawn(main, (world_size,), nprocs=world_size) ``` Previous ``` print("creating store") print("finished creating store") ``` Now ``` print("creating store") torch.distributed.DistStoreError: Timed out after 6 seconds waiting for workers. 1/2 workers joined. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/111805 Approved by: https://github.com/XilunWu, https://github.com/fduwjj	2023-10-27 14:44:43 +00:00
Rodrigo Kumpera	92de1d3222	[C10D] Push store scalability a bit further. (#109217 ) This is a bunch of small changes to improve store scalability: - stagger client connection to avoid a stampede. - warn if somaxconn is too small. - increase the backlog to 16k. Differential Revision: [D49238587](https://our.internmc.facebook.com/intern/diff/D49238587) Pull Request resolved: https://github.com/pytorch/pytorch/pull/109217 Approved by: https://github.com/XilunWu	2023-09-22 17:23:46 +00:00
Rodrigo Kumpera	b26af5d5ac	[c10d] Add TCPSTore libuv backend support to c10d rendezvous. (#108284 ) This enables libuv under env and tcp urls. Under env either use the environment variable USE_LIBUV=1 or the url parameter use_lib=1. Under tcp use the url parameter use_lib=1. Pull Request resolved: https://github.com/pytorch/pytorch/pull/108284 Approved by: https://github.com/H-Huang, https://github.com/XilunWu	2023-09-07 21:39:58 +00:00
Rodrigo Kumpera	23e8a11fef	[c10d] Introduce TCPStore client metrics collection. (#108348 ) We collect timing and counts of every operation. They are acessible from python using TCPStore::collect_client_counters. Pull Request resolved: https://github.com/pytorch/pytorch/pull/108348 Approved by: https://github.com/XilunWu	2023-09-05 18:36:27 +00:00
Pritam Damania	704b0b3c67	[RESUBMIT] Standardize on error types for distributed errors. (#108191 ) We have a plethora of error types for various errors raised from c10d. These include `RuntimeError`, `TimeoutError`, `SocketError`, `DistBackendError` etc. This results in messy code during error handling somewhat like this: ``` if "NCCL" in exception_str: ... if "Timed out initializing process group in store based barrier on rank" in exception_str: ... if "The client socket has timed out after" in exception_str: ... if "Broken pipe" in exception_str: ... if "Connection reset by peer" in exception_str: ... ``` To address this issue, in this PR I've ensured added these error types: 1. DistError - the base type of all distributed errors 2. DistBackendError - this already existed and referred to PG backend errors 3. DistStoreError - for errors originating from the store 4. DistNetworkError - for general network errors coming from the socket library Pull Request resolved: https://github.com/pytorch/pytorch/pull/108191 Approved by: https://github.com/H-Huang	2023-08-30 21:47:39 +00:00
PyTorch MergeBot	d4ff06ec84	Revert "Standardize on error types for distributed errors. (#107651 )" This reverts commit `0e2317479b`. Reverted https://github.com/pytorch/pytorch/pull/107651 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it is failing inductor test in trunk for one of its model moco ([comment](https://github.com/pytorch/pytorch/pull/107651#issuecomment-1696578138))	2023-08-28 23:58:33 +00:00
Pritam Damania	0e2317479b	Standardize on error types for distributed errors. (#107651 ) We have a plethora of error types for various errors raised from c10d. These include `RuntimeError`, `TimeoutError`, `SocketError`, `DistBackendError` etc. This results in messy code during error handling somewhat like this: ``` if "NCCL" in exception_str: ... if "Timed out initializing process group in store based barrier on rank" in exception_str: ... if "The client socket has timed out after" in exception_str: ... if "Broken pipe" in exception_str: ... if "Connection reset by peer" in exception_str: ... ``` To address this issue, in this PR I've ensured added these error types: 1. DistError - the base type of all distributed errors 2. DistBackendError - this already existed and referred to PG backend errors 3. DistStoreError - for errors originating from the store 4. DistNetworkError - for general network errors coming from the socket library Pull Request resolved: https://github.com/pytorch/pytorch/pull/107651 Approved by: https://github.com/H-Huang	2023-08-28 21:58:15 +00:00
Rodrigo Kumpera	c9c66819a1	Move more TCPStorestate from BackgroundThread to TCPStoreMasterDaemon as it won't be used by the libuv backend. (#105674 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105674 Approved by: https://github.com/H-Huang ghstack dependencies: #105163, #105164, #105184, #105672	2023-07-31 20:10:16 +00:00
Rodrigo Kumpera	2636751fb9	[C10d] Add skeleton of LibUV backend. (#105672 ) This commit hooks up tcpstore creation and build flags. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105672 Approved by: https://github.com/fduwjj	2023-07-28 13:19:06 +00:00
Rodrigo Kumpera	a361fceef3	[C10d] Move TCPStoreMasterDaemon to TCPStoreBackend. (#105184 ) This makes TCPServer interface to the store server be through BackgroundThread. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105184 Approved by: https://github.com/fduwjj	2023-07-25 21:59:12 +00:00
Rodrigo Kumpera	1880852830	[C10d] Move protocol constants to TCPStoreBackend.hpp (#105164 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105164 Approved by: https://github.com/fduwjj	2023-07-25 21:43:32 +00:00
Rodrigo Kumpera	fe284b0d97	[C10D] Extract some bits of TCPStore into TCPStoreBackend. (#105163 ) This moves BackgroundThread to TCPStoreBackend.hpp. This will eventually be the interface shared between the current TCPStore backend and the new libuv one. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105163 Approved by: https://github.com/fduwjj, https://github.com/H-Huang	2023-07-25 17:59:15 +00:00
Rodrigo Kumpera	174b0c22cb	[C10D] Remove watchKey functionality from the Store. (#105014 ) The feature was never fully finished and never got any adoption but TCPStore pays the cost of twice the number of tcp connections anyway. While the cost of all those idle connections is minimal is doesn't come for free: - It increases the likelyhood of a connection refused failure during the initialization stampede. - TCPStore uses poll for checking for socket availability which scales linearly on the number of sockets regardless of their status. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105014 Approved by: https://github.com/fduwjj	2023-07-21 21:18:55 +00:00
Xilun Wu	d2fa3f608b	Produce more logs from TCPStore init (#105350 ) this diff: 1. adds debug logs to TCPStore initialization to better capture the "connection reset by peer" error. Differential Revision: [D47454956](https://our.internmc.facebook.com/intern/diff/D47454956/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105350 Approved by: https://github.com/kumpera, https://github.com/fduwjj	2023-07-19 04:15:48 +00:00
Rodrigo Kumpera	7c3c3dd7ca	[C10D] Reimplement TCPStore wait timeout logic. (#100594 ) Current TCPStore wait logic leaves the client socket in a bad state if waiting timesout. This happens because all recv functions raise an exception on timeout and that's it. The problem is that on timeout we need to unregister the wait. We implement this with client side cancelation by adding a new CANCEL_WAIT instruction. So, if no data arrives before the deadline, the client sends a CANCEL_WAIT command. The server sends a WAIT_CANCELED response to that command, always. This gets us down to the last issue, which is that there's a race between timeout'ing, canceling the wait and the wait completing. The client needs to handle the server sending a STOP_WAITING followed by a WAIT_CANCELED answer. This ensures client and server state are synchronized regardless of whether the wait timeouts or not. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100594 Approved by: https://github.com/H-Huang	2023-07-11 00:36:41 +00:00
Jon Maltiel Swenson	0da38409a0	[gloo] Make it possible for gloo TCPStore to take over an existing socket fd (#103478 ) Summary: This diff allows the `TCPStore` server associated with a gloo process group to listen on an existing socket already bound to a port. Without the functionality in this diff, canonical initialization of a gloo `ProcessGroup` is fundamentally racy: 1) ask the OS for a free port by creating a socket bound to port 0, 2) close the socket, 3) attempt to initialize a `TCPStore` server that listens on the previously free port. Of course, the problem is that in between steps 2 and 3, another process on the host may have claimed the port, causing `TCPStore` and overall process group initialization to fail. With this diff, it is now possible for users to completely avoid this race (see unit test for how this can be achieved). Test Plan: Added new unit test: buck2 test caffe2/test/distributed:store Differential Revision: D46622317 Pull Request resolved: https://github.com/pytorch/pytorch/pull/103478 Approved by: https://github.com/H-Huang	2023-06-16 17:15:56 +00:00
Rodrigo Kumpera	32360b48e8	[C10D] Rewrite TCPStore client send path to minimize amount of syscalls. (#100742 ) Accumulate data in a local buffer prior to sending it. This reduces the number of syscalls and network packets. We flush every 1440 bytes to cap the amount of temporaty memory. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100742 Approved by: https://github.com/fduwjj	2023-06-01 16:58:46 +00:00
Rodrigo Kumpera	6aa80beca1	[c10d] Implement new Store methods in TCPStore. (#100383 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100383 Approved by: https://github.com/fduwjj	2023-05-09 17:43:16 +00:00
Nikita Shulga	a229e78544	[BE] Enforce sign-compare (#96723 ) Number of OSS PR were reverted, because new signed-unsigned comparison warnings, which are treated as errors in some internal builds. Not sure how those selective rules are applied, but this PR removes `-Wno-sign-compare` from PyTorch codebase. The only tricky part in this PR, as making sure that non-ASCII character detection works for both signed and unsigned chars here: `6e3d51b08a/torch/csrc/jit/serialization/python_print.cpp (L926)` Exclude several files from sign-compare if flash attention is used, due to the violation in cutlass, to be fixed by https://github.com/NVIDIA/cutlass/pull/869 Do not try to fix sign compare violations in caffe2 codebase Pull Request resolved: https://github.com/pytorch/pytorch/pull/96723 Approved by: https://github.com/albanD	2023-03-15 06:04:20 +00:00
Aaron Gokaslan	0247ed27cc	Apply Clang-Tidy readability-container-size-empty (#93236 ) Not only is this change usually shorter and more readable, it also can yield better performance. size() is not always a constant time operation (such as on LinkedLists), but empty() always is. Pull Request resolved: https://github.com/pytorch/pytorch/pull/93236 Approved by: https://github.com/malfet	2023-01-29 23:28:19 +00:00
Aaron Gokaslan	97db9fde69	Fix header-filter for clang-tidy c10 and apply some fixes to c10 and … (#91178 ) …c10d Fixes a broken header filters from #90699 and applies a few more clang-tidy fixes that are relevant from c10 and c10d. The header filter pattern was actually broken and the clang-tidy include pattern was redundant. Also fixed a few bugs in torch/distributed/c10d Pull Request resolved: https://github.com/pytorch/pytorch/pull/91178 Approved by: https://github.com/ezyang	2022-12-27 07:34:12 +00:00
Min Si	1ad0048b64	Refactor distribuetd to use absolute header path (#85780 ) Headers under torch/csrc/distributed may be referened with relative path, e.g., "<c10d/...>". However, relative path cannot be gracefully handled by Meta internal build when the NCCL PG is hipified to support AMD/RCCL because the "hipified" header files are generated in other directories. Moreover, using absolute path for header inclusion is the state-of-the-art in most components in Pytorch. Thus, this patch refactors all header paths in torch/csrc/distributed to be absolute. See D39835774 for more details about Meta internal complication. How to test: commit 9e5d199 removes -I./torch/csrc/distributed in compile options. Thus use it to verify we don't miss any relative path use of torch/csrc/distributed headers. Pull Request resolved: https://github.com/pytorch/pytorch/pull/85780 Approved by: https://github.com/kumpera, https://github.com/huydhn	2022-09-30 05:13:50 +00:00
PyTorch MergeBot	a50d8864fc	Revert "Refactor distribuetd to use absolute header path (#85780 )" This reverts commit `668082718a`. Reverted https://github.com/pytorch/pytorch/pull/85780 on behalf of https://github.com/huydhn due to Sorry for reverting your PR but it breaks build due to a missing file <c10d/Store.hpp>	2022-09-30 02:04:29 +00:00
Min Si	668082718a	Refactor distribuetd to use absolute header path (#85780 ) Headers under torch/csrc/distributed may be referened with relative path, e.g., "<c10d/...>". However, relative path cannot be gracefully handled by Meta internal build when the NCCL PG is hipified to support AMD/RCCL because the "hipified" header files are generated in other directories. Moreover, using absolute path for header inclusion is the state-of-the-art in most components in Pytorch. Thus, this patch refactors all header paths in torch/csrc/distributed to be absolute. See D39835774 for more details about Meta internal complication. How to test: commit 9e5d199 removes -I./torch/csrc/distributed in compile options. Thus use it to verify we don't miss any relative path use of torch/csrc/distributed headers. Pull Request resolved: https://github.com/pytorch/pytorch/pull/85780 Approved by: https://github.com/kumpera	2022-09-30 00:27:24 +00:00
Michael Suo	30fb2c4aba	[lint] autoformat test/cpp and torch/csrc Let's have some fun. Pull Request resolved: https://github.com/pytorch/pytorch/pull/78828 Approved by: https://github.com/ezyang	2022-06-11 21:11:16 +00:00
Yifu Wang	27ea79b8a5	Use a more reliable signaling mechansim to stop TCPStore background threads (#76973 ) Summary: The main thread establishes a dedicated stop signal pipe for each TCPStore background thread. Before joining a background thread, the main thread would close the write end of the corresponding pipe, expecting the background the thread to receive POLLHUP. Upon receiving POLLHUP, the background thread would break the loop and graceful exit. Although we haven't found any documentation or literature backing this, we have evidence that under certain circumstances, the read end of the pipe won't receive POLLUP when the write end is closed. However, under the same circumstances, writing to the pipe will guarantee POLLIN to be received on the read end. Test Plan: Manually tested Differential Revision: D36208897 Pull Request resolved: https://github.com/pytorch/pytorch/pull/76973 Approved by: https://github.com/cbalioglu	2022-05-10 23:24:08 +00:00
mikey dagitses	f087f1a3e6	static_cast value to microsecond type (#73595 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73595 In some platforms, this might be a distinct type from 64-bit signed integral, and the compiler might warn about implicit coercion. Test Plan: Imported from OSS Reviewed By: malfet Differential Revision: D34558344 Pulled By: dagitses fbshipit-source-id: 7a07d0723688390a7f27e6e71480d22c1e077200 (cherry picked from commit 8b807c4431f909c20b3133a6522b58d7b9ab0d85)	2022-03-02 20:56:24 +00:00
Pritam Damania	2a38e1a76a	Fix TSAN issue in TCPStore (#69590 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69590 The variable `callbackRegisteredData_` was written to without synchronization. ghstack-source-id: 145066862 Test Plan: waitforbuildbot Reviewed By: rohan-varma Differential Revision: D32938979 fbshipit-source-id: bc9a11a70680db45ece95880ae19ce2026e8a88e	2021-12-07 23:29:08 -08:00
Hongyi Jia	96ba2099d1	Fix c10d TCP store with mutex (#68499 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68499 TCP store is actually being accessed by multi-threading (NCCL watch dog thread), but no mutex protection while FileStore and HashStore have. As enabling desync root cause analysis makes store calls more often, the race condition of TCP store was always triggered when creating another process group like gloo. Adding mutex to TCP store, to be the same with FileStore and HashStore. Test Plan: DDP benchmark with desync debug enabled, no perf regression https://www.internalfb.com/intern/fblearner/details/309398285?tab=Outputs W/o this diff https://www.internalfb.com/intern/fblearner/details/308379789?tab=Outputs Reviewed By: mingzhe09088 Differential Revision: D32482254 fbshipit-source-id: e8f466e1c6fdcab6cfa170f44b9be70395935fb8	2021-11-17 20:30:10 -08:00
Can Balioglu	6e640a0acf	Revise the socket implementation of c10d (#68226 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68226 Note that this PR is unusually big due to the urgency of the changes. Please reach out to me in case you wish to have a "pair" review. This PR introduces a major refactoring of the socket implementation of the C10d library. A big portion of the logic is now contained in the `Socket` class and a follow-up PR will further consolidate the remaining parts. As of today the changes in this PR offer: - significantly better error handling and much more verbose logging (see the example output below) - explicit support for IPv6 and dual-stack sockets - correct handling of signal interrupts - better Windows support A follow-up PR will consolidate `send`/`recv` logic into `Socket` and fully migrate to non-blocking sockets. ## Example Output ``` [I logging.h:21] The client socket will attempt to connect to an IPv6 address on (127.0.0.1, 29501). [I logging.h:21] The client socket is attempting to connect to [localhost]:29501. [W logging.h:28] The server socket on [localhost]:29501 is not yet listening (Error: 111 - Connection refused), retrying... [I logging.h:21] The server socket will attempt to listen on an IPv6 address. [I logging.h:21] The server socket is attempting to listen on [::]:29501. [I logging.h:21] The server socket has started to listen on [::]:29501. [I logging.h:21] The client socket will attempt to connect to an IPv6 address on (127.0.0.1, 29501). [I logging.h:21] The client socket is attempting to connect to [localhost]:29501. [I logging.h:21] The client socket has connected to [localhost]:29501 on [localhost]:42650. [I logging.h:21] The server socket on [::]:29501 has accepted a connection from [localhost]:42650. [I logging.h:21] The client socket has connected to [localhost]:29501 on [localhost]:42722. [I logging.h:21] The server socket on [::]:29501 has accepted a connection from [localhost]:42722. [I logging.h:21] The client socket will attempt to connect to an IPv6 address on (127.0.0.1, 29501). [I logging.h:21] The client socket is attempting to connect to [localhost]:29501. [I logging.h:21] The client socket has connected to [localhost]:29501 on [localhost]:42724. [I logging.h:21] The server socket on [::]:29501 has accepted a connection from [localhost]:42724. [I logging.h:21] The client socket will attempt to connect to an IPv6 address on (127.0.0.1, 29501). [I logging.h:21] The client socket is attempting to connect to [localhost]:29501. [I logging.h:21] The client socket has connected to [localhost]:29501 on [localhost]:42726. [I logging.h:21] The server socket on [::]:29501 has accepted a connection from [localhost]:42726. ``` ghstack-source-id: 143501987 Test Plan: Run existing unit and integration tests on devserver, Fedora, Ubuntu, macOS Big Sur, Windows 10. Reviewed By: Babar, wilson100hong, mrshenli Differential Revision: D32372333 fbshipit-source-id: 2204ffa28ed0d3683a9cb3ebe1ea8d92a831325a	2021-11-16 20:49:25 -08:00
Howard Huang	4e873d6799	Formatting changes (#66257 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66257 Used `clang-format -i` for these two files. Test Plan: Imported from OSS Reviewed By: gchanan Differential Revision: D31762737 Pulled By: H-Huang fbshipit-source-id: e94e301d0b013dbb8f2cef19ff140bac5811738f	2021-10-28 11:36:00 -07:00
peter	5deeaab36a	minor fixes in c10d for Windows (#62953 ) Summary: Found out by triggering builds against clang on Windows. Pull Request resolved: https://github.com/pytorch/pytorch/pull/62953 Reviewed By: gchanan Differential Revision: D30191300 Pulled By: ezyang fbshipit-source-id: d929119768298084c41d70dbc3a78aacd64fb715	2021-08-09 09:05:09 -07:00
Luca Wehrstedt	a016150163	Move torch/lib/c10d to torch/csrc/distributed/c10d (#60543 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60543 Since now c10d is part of libtorch, it would also be nice if the sources lived all in one place. ghstack-source-id: 132306292 Test Plan: It builds Reviewed By: cbalioglu Differential Revision: D29062002 fbshipit-source-id: d9e1301e9d73e1643fa0f0119cd2d618f1ad52e6	2021-06-24 12:38:51 -07:00

50 Commits