pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Rodrigo Kumpera	23e8a11fef	[c10d] Introduce TCPStore client metrics collection. (#108348 ) We collect timing and counts of every operation. They are acessible from python using TCPStore::collect_client_counters. Pull Request resolved: https://github.com/pytorch/pytorch/pull/108348 Approved by: https://github.com/XilunWu	2023-09-05 18:36:27 +00:00
Pritam Damania	704b0b3c67	[RESUBMIT] Standardize on error types for distributed errors. (#108191 ) We have a plethora of error types for various errors raised from c10d. These include `RuntimeError`, `TimeoutError`, `SocketError`, `DistBackendError` etc. This results in messy code during error handling somewhat like this: ``` if "NCCL" in exception_str: ... if "Timed out initializing process group in store based barrier on rank" in exception_str: ... if "The client socket has timed out after" in exception_str: ... if "Broken pipe" in exception_str: ... if "Connection reset by peer" in exception_str: ... ``` To address this issue, in this PR I've ensured added these error types: 1. DistError - the base type of all distributed errors 2. DistBackendError - this already existed and referred to PG backend errors 3. DistStoreError - for errors originating from the store 4. DistNetworkError - for general network errors coming from the socket library Pull Request resolved: https://github.com/pytorch/pytorch/pull/108191 Approved by: https://github.com/H-Huang	2023-08-30 21:47:39 +00:00
PyTorch MergeBot	d4ff06ec84	Revert "Standardize on error types for distributed errors. (#107651 )" This reverts commit `0e2317479b`. Reverted https://github.com/pytorch/pytorch/pull/107651 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it is failing inductor test in trunk for one of its model moco ([comment](https://github.com/pytorch/pytorch/pull/107651#issuecomment-1696578138))	2023-08-28 23:58:33 +00:00
Pritam Damania	0e2317479b	Standardize on error types for distributed errors. (#107651 ) We have a plethora of error types for various errors raised from c10d. These include `RuntimeError`, `TimeoutError`, `SocketError`, `DistBackendError` etc. This results in messy code during error handling somewhat like this: ``` if "NCCL" in exception_str: ... if "Timed out initializing process group in store based barrier on rank" in exception_str: ... if "The client socket has timed out after" in exception_str: ... if "Broken pipe" in exception_str: ... if "Connection reset by peer" in exception_str: ... ``` To address this issue, in this PR I've ensured added these error types: 1. DistError - the base type of all distributed errors 2. DistBackendError - this already existed and referred to PG backend errors 3. DistStoreError - for errors originating from the store 4. DistNetworkError - for general network errors coming from the socket library Pull Request resolved: https://github.com/pytorch/pytorch/pull/107651 Approved by: https://github.com/H-Huang	2023-08-28 21:58:15 +00:00
Rodrigo Kumpera	c9c66819a1	Move more TCPStorestate from BackgroundThread to TCPStoreMasterDaemon as it won't be used by the libuv backend. (#105674 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105674 Approved by: https://github.com/H-Huang ghstack dependencies: #105163, #105164, #105184, #105672	2023-07-31 20:10:16 +00:00
Rodrigo Kumpera	2636751fb9	[C10d] Add skeleton of LibUV backend. (#105672 ) This commit hooks up tcpstore creation and build flags. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105672 Approved by: https://github.com/fduwjj	2023-07-28 13:19:06 +00:00
Rodrigo Kumpera	a361fceef3	[C10d] Move TCPStoreMasterDaemon to TCPStoreBackend. (#105184 ) This makes TCPServer interface to the store server be through BackgroundThread. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105184 Approved by: https://github.com/fduwjj	2023-07-25 21:59:12 +00:00
Rodrigo Kumpera	1880852830	[C10d] Move protocol constants to TCPStoreBackend.hpp (#105164 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105164 Approved by: https://github.com/fduwjj	2023-07-25 21:43:32 +00:00
Rodrigo Kumpera	fe284b0d97	[C10D] Extract some bits of TCPStore into TCPStoreBackend. (#105163 ) This moves BackgroundThread to TCPStoreBackend.hpp. This will eventually be the interface shared between the current TCPStore backend and the new libuv one. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105163 Approved by: https://github.com/fduwjj, https://github.com/H-Huang	2023-07-25 17:59:15 +00:00
Rodrigo Kumpera	174b0c22cb	[C10D] Remove watchKey functionality from the Store. (#105014 ) The feature was never fully finished and never got any adoption but TCPStore pays the cost of twice the number of tcp connections anyway. While the cost of all those idle connections is minimal is doesn't come for free: - It increases the likelyhood of a connection refused failure during the initialization stampede. - TCPStore uses poll for checking for socket availability which scales linearly on the number of sockets regardless of their status. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105014 Approved by: https://github.com/fduwjj	2023-07-21 21:18:55 +00:00
Xilun Wu	d2fa3f608b	Produce more logs from TCPStore init (#105350 ) this diff: 1. adds debug logs to TCPStore initialization to better capture the "connection reset by peer" error. Differential Revision: [D47454956](https://our.internmc.facebook.com/intern/diff/D47454956/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105350 Approved by: https://github.com/kumpera, https://github.com/fduwjj	2023-07-19 04:15:48 +00:00
Rodrigo Kumpera	7c3c3dd7ca	[C10D] Reimplement TCPStore wait timeout logic. (#100594 ) Current TCPStore wait logic leaves the client socket in a bad state if waiting timesout. This happens because all recv functions raise an exception on timeout and that's it. The problem is that on timeout we need to unregister the wait. We implement this with client side cancelation by adding a new CANCEL_WAIT instruction. So, if no data arrives before the deadline, the client sends a CANCEL_WAIT command. The server sends a WAIT_CANCELED response to that command, always. This gets us down to the last issue, which is that there's a race between timeout'ing, canceling the wait and the wait completing. The client needs to handle the server sending a STOP_WAITING followed by a WAIT_CANCELED answer. This ensures client and server state are synchronized regardless of whether the wait timeouts or not. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100594 Approved by: https://github.com/H-Huang	2023-07-11 00:36:41 +00:00
Jon Maltiel Swenson	0da38409a0	[gloo] Make it possible for gloo TCPStore to take over an existing socket fd (#103478 ) Summary: This diff allows the `TCPStore` server associated with a gloo process group to listen on an existing socket already bound to a port. Without the functionality in this diff, canonical initialization of a gloo `ProcessGroup` is fundamentally racy: 1) ask the OS for a free port by creating a socket bound to port 0, 2) close the socket, 3) attempt to initialize a `TCPStore` server that listens on the previously free port. Of course, the problem is that in between steps 2 and 3, another process on the host may have claimed the port, causing `TCPStore` and overall process group initialization to fail. With this diff, it is now possible for users to completely avoid this race (see unit test for how this can be achieved). Test Plan: Added new unit test: buck2 test caffe2/test/distributed:store Differential Revision: D46622317 Pull Request resolved: https://github.com/pytorch/pytorch/pull/103478 Approved by: https://github.com/H-Huang	2023-06-16 17:15:56 +00:00
Rodrigo Kumpera	32360b48e8	[C10D] Rewrite TCPStore client send path to minimize amount of syscalls. (#100742 ) Accumulate data in a local buffer prior to sending it. This reduces the number of syscalls and network packets. We flush every 1440 bytes to cap the amount of temporaty memory. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100742 Approved by: https://github.com/fduwjj	2023-06-01 16:58:46 +00:00
Rodrigo Kumpera	6aa80beca1	[c10d] Implement new Store methods in TCPStore. (#100383 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100383 Approved by: https://github.com/fduwjj	2023-05-09 17:43:16 +00:00
Nikita Shulga	a229e78544	[BE] Enforce sign-compare (#96723 ) Number of OSS PR were reverted, because new signed-unsigned comparison warnings, which are treated as errors in some internal builds. Not sure how those selective rules are applied, but this PR removes `-Wno-sign-compare` from PyTorch codebase. The only tricky part in this PR, as making sure that non-ASCII character detection works for both signed and unsigned chars here: `6e3d51b08a/torch/csrc/jit/serialization/python_print.cpp (L926)` Exclude several files from sign-compare if flash attention is used, due to the violation in cutlass, to be fixed by https://github.com/NVIDIA/cutlass/pull/869 Do not try to fix sign compare violations in caffe2 codebase Pull Request resolved: https://github.com/pytorch/pytorch/pull/96723 Approved by: https://github.com/albanD	2023-03-15 06:04:20 +00:00
Aaron Gokaslan	0247ed27cc	Apply Clang-Tidy readability-container-size-empty (#93236 ) Not only is this change usually shorter and more readable, it also can yield better performance. size() is not always a constant time operation (such as on LinkedLists), but empty() always is. Pull Request resolved: https://github.com/pytorch/pytorch/pull/93236 Approved by: https://github.com/malfet	2023-01-29 23:28:19 +00:00
Aaron Gokaslan	97db9fde69	Fix header-filter for clang-tidy c10 and apply some fixes to c10 and … (#91178 ) …c10d Fixes a broken header filters from #90699 and applies a few more clang-tidy fixes that are relevant from c10 and c10d. The header filter pattern was actually broken and the clang-tidy include pattern was redundant. Also fixed a few bugs in torch/distributed/c10d Pull Request resolved: https://github.com/pytorch/pytorch/pull/91178 Approved by: https://github.com/ezyang	2022-12-27 07:34:12 +00:00
Min Si	1ad0048b64	Refactor distribuetd to use absolute header path (#85780 ) Headers under torch/csrc/distributed may be referened with relative path, e.g., "<c10d/...>". However, relative path cannot be gracefully handled by Meta internal build when the NCCL PG is hipified to support AMD/RCCL because the "hipified" header files are generated in other directories. Moreover, using absolute path for header inclusion is the state-of-the-art in most components in Pytorch. Thus, this patch refactors all header paths in torch/csrc/distributed to be absolute. See D39835774 for more details about Meta internal complication. How to test: commit 9e5d199 removes -I./torch/csrc/distributed in compile options. Thus use it to verify we don't miss any relative path use of torch/csrc/distributed headers. Pull Request resolved: https://github.com/pytorch/pytorch/pull/85780 Approved by: https://github.com/kumpera, https://github.com/huydhn	2022-09-30 05:13:50 +00:00
PyTorch MergeBot	a50d8864fc	Revert "Refactor distribuetd to use absolute header path (#85780 )" This reverts commit `668082718a`. Reverted https://github.com/pytorch/pytorch/pull/85780 on behalf of https://github.com/huydhn due to Sorry for reverting your PR but it breaks build due to a missing file <c10d/Store.hpp>	2022-09-30 02:04:29 +00:00
Min Si	668082718a	Refactor distribuetd to use absolute header path (#85780 ) Headers under torch/csrc/distributed may be referened with relative path, e.g., "<c10d/...>". However, relative path cannot be gracefully handled by Meta internal build when the NCCL PG is hipified to support AMD/RCCL because the "hipified" header files are generated in other directories. Moreover, using absolute path for header inclusion is the state-of-the-art in most components in Pytorch. Thus, this patch refactors all header paths in torch/csrc/distributed to be absolute. See D39835774 for more details about Meta internal complication. How to test: commit 9e5d199 removes -I./torch/csrc/distributed in compile options. Thus use it to verify we don't miss any relative path use of torch/csrc/distributed headers. Pull Request resolved: https://github.com/pytorch/pytorch/pull/85780 Approved by: https://github.com/kumpera	2022-09-30 00:27:24 +00:00
Michael Suo	30fb2c4aba	[lint] autoformat test/cpp and torch/csrc Let's have some fun. Pull Request resolved: https://github.com/pytorch/pytorch/pull/78828 Approved by: https://github.com/ezyang	2022-06-11 21:11:16 +00:00
Yifu Wang	27ea79b8a5	Use a more reliable signaling mechansim to stop TCPStore background threads (#76973 ) Summary: The main thread establishes a dedicated stop signal pipe for each TCPStore background thread. Before joining a background thread, the main thread would close the write end of the corresponding pipe, expecting the background the thread to receive POLLHUP. Upon receiving POLLHUP, the background thread would break the loop and graceful exit. Although we haven't found any documentation or literature backing this, we have evidence that under certain circumstances, the read end of the pipe won't receive POLLUP when the write end is closed. However, under the same circumstances, writing to the pipe will guarantee POLLIN to be received on the read end. Test Plan: Manually tested Differential Revision: D36208897 Pull Request resolved: https://github.com/pytorch/pytorch/pull/76973 Approved by: https://github.com/cbalioglu	2022-05-10 23:24:08 +00:00
mikey dagitses	f087f1a3e6	static_cast value to microsecond type (#73595 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73595 In some platforms, this might be a distinct type from 64-bit signed integral, and the compiler might warn about implicit coercion. Test Plan: Imported from OSS Reviewed By: malfet Differential Revision: D34558344 Pulled By: dagitses fbshipit-source-id: 7a07d0723688390a7f27e6e71480d22c1e077200 (cherry picked from commit 8b807c4431f909c20b3133a6522b58d7b9ab0d85)	2022-03-02 20:56:24 +00:00
Pritam Damania	2a38e1a76a	Fix TSAN issue in TCPStore (#69590 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69590 The variable `callbackRegisteredData_` was written to without synchronization. ghstack-source-id: 145066862 Test Plan: waitforbuildbot Reviewed By: rohan-varma Differential Revision: D32938979 fbshipit-source-id: bc9a11a70680db45ece95880ae19ce2026e8a88e	2021-12-07 23:29:08 -08:00
Hongyi Jia	96ba2099d1	Fix c10d TCP store with mutex (#68499 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68499 TCP store is actually being accessed by multi-threading (NCCL watch dog thread), but no mutex protection while FileStore and HashStore have. As enabling desync root cause analysis makes store calls more often, the race condition of TCP store was always triggered when creating another process group like gloo. Adding mutex to TCP store, to be the same with FileStore and HashStore. Test Plan: DDP benchmark with desync debug enabled, no perf regression https://www.internalfb.com/intern/fblearner/details/309398285?tab=Outputs W/o this diff https://www.internalfb.com/intern/fblearner/details/308379789?tab=Outputs Reviewed By: mingzhe09088 Differential Revision: D32482254 fbshipit-source-id: e8f466e1c6fdcab6cfa170f44b9be70395935fb8	2021-11-17 20:30:10 -08:00
Can Balioglu	6e640a0acf	Revise the socket implementation of c10d (#68226 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68226 Note that this PR is unusually big due to the urgency of the changes. Please reach out to me in case you wish to have a "pair" review. This PR introduces a major refactoring of the socket implementation of the C10d library. A big portion of the logic is now contained in the `Socket` class and a follow-up PR will further consolidate the remaining parts. As of today the changes in this PR offer: - significantly better error handling and much more verbose logging (see the example output below) - explicit support for IPv6 and dual-stack sockets - correct handling of signal interrupts - better Windows support A follow-up PR will consolidate `send`/`recv` logic into `Socket` and fully migrate to non-blocking sockets. ## Example Output ``` [I logging.h:21] The client socket will attempt to connect to an IPv6 address on (127.0.0.1, 29501). [I logging.h:21] The client socket is attempting to connect to [localhost]:29501. [W logging.h:28] The server socket on [localhost]:29501 is not yet listening (Error: 111 - Connection refused), retrying... [I logging.h:21] The server socket will attempt to listen on an IPv6 address. [I logging.h:21] The server socket is attempting to listen on [::]:29501. [I logging.h:21] The server socket has started to listen on [::]:29501. [I logging.h:21] The client socket will attempt to connect to an IPv6 address on (127.0.0.1, 29501). [I logging.h:21] The client socket is attempting to connect to [localhost]:29501. [I logging.h:21] The client socket has connected to [localhost]:29501 on [localhost]:42650. [I logging.h:21] The server socket on [::]:29501 has accepted a connection from [localhost]:42650. [I logging.h:21] The client socket has connected to [localhost]:29501 on [localhost]:42722. [I logging.h:21] The server socket on [::]:29501 has accepted a connection from [localhost]:42722. [I logging.h:21] The client socket will attempt to connect to an IPv6 address on (127.0.0.1, 29501). [I logging.h:21] The client socket is attempting to connect to [localhost]:29501. [I logging.h:21] The client socket has connected to [localhost]:29501 on [localhost]:42724. [I logging.h:21] The server socket on [::]:29501 has accepted a connection from [localhost]:42724. [I logging.h:21] The client socket will attempt to connect to an IPv6 address on (127.0.0.1, 29501). [I logging.h:21] The client socket is attempting to connect to [localhost]:29501. [I logging.h:21] The client socket has connected to [localhost]:29501 on [localhost]:42726. [I logging.h:21] The server socket on [::]:29501 has accepted a connection from [localhost]:42726. ``` ghstack-source-id: 143501987 Test Plan: Run existing unit and integration tests on devserver, Fedora, Ubuntu, macOS Big Sur, Windows 10. Reviewed By: Babar, wilson100hong, mrshenli Differential Revision: D32372333 fbshipit-source-id: 2204ffa28ed0d3683a9cb3ebe1ea8d92a831325a	2021-11-16 20:49:25 -08:00
Howard Huang	4e873d6799	Formatting changes (#66257 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66257 Used `clang-format -i` for these two files. Test Plan: Imported from OSS Reviewed By: gchanan Differential Revision: D31762737 Pulled By: H-Huang fbshipit-source-id: e94e301d0b013dbb8f2cef19ff140bac5811738f	2021-10-28 11:36:00 -07:00
peter	5deeaab36a	minor fixes in c10d for Windows (#62953 ) Summary: Found out by triggering builds against clang on Windows. Pull Request resolved: https://github.com/pytorch/pytorch/pull/62953 Reviewed By: gchanan Differential Revision: D30191300 Pulled By: ezyang fbshipit-source-id: d929119768298084c41d70dbc3a78aacd64fb715	2021-08-09 09:05:09 -07:00
Luca Wehrstedt	a016150163	Move torch/lib/c10d to torch/csrc/distributed/c10d (#60543 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60543 Since now c10d is part of libtorch, it would also be nice if the sources lived all in one place. ghstack-source-id: 132306292 Test Plan: It builds Reviewed By: cbalioglu Differential Revision: D29062002 fbshipit-source-id: d9e1301e9d73e1643fa0f0119cd2d618f1ad52e6	2021-06-24 12:38:51 -07:00

30 Commits