pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-08 07:39:33 +01:00

Author	SHA1	Message	Date
Yuanyuan Chen	5103ecc5d8	[1/N] Fix clang-tidy readability checks (#164561 ) Check all `.cpp` files except `jit` files for readability thoroughly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164561 Approved by: https://github.com/Skylion007	2025-10-04 09:40:38 +00:00
cyy	b0be30dd79	[19/N] Fix extra warnings brought by clang-tidy-17 (#144448 ) Apply more clang-tidy fixes. There was a bug introduced by #144014 due to incorrect namespace concatenation which is reverted here. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144448 Approved by: https://github.com/albanD	2025-01-09 15:58:05 +00:00
cyy	1ec76dd1dc	Enable clang-tidy on torch/csrc/distributed (#139043 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/139043 Approved by: https://github.com/Skylion007	2024-10-28 13:56:54 +00:00
Tristan Rice	9ee8c18309	TCPStore: add ping to verify network connectivity on connect (#129985 ) This does a round trip request on socket connect -- this allows for detecting connection resets etc and retrying before the non-retryable application requests are sent. This adds support for PING to both the libuv and legacy backend. Example error: ``` [trainer85612\|12]:W0701 13:41:43.421574 4776 TCPStore.cpp:182] [c10d] recvValue failed on SocketImpl(fd=24, ...): Connection reset by peer [trainer85612\|12]:Exception raised from recvBytes at /mnt/code/pytorch/torch/csrc/distributed/c10d/Utils.hpp:669 (most recent call first): ... [trainer85612\|12]:#9 c10d::TCPStore::incrementValueBy(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, long) from /packages/.../conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so:84809637 [trainer85612\|12]:#10 c10d::TCPStore::waitForWorkers() from /packages/.../conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so:84812868 [trainer85612\|12]:#11 c10d::TCPStore::TCPStore(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, c10d::TCPStoreOptions const&) from /packages/.../conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so:84814775 ``` Test plan: ``` python test/distributed/test_store.py -v ``` ``` tristanr@devvm4382 ~/pytorch (d4l3k/tcpstore_ping)> python ~/pt_tests/tcpstore_large_test.py starting pool started 90000 started 30000 started 70000 started 20000 started 80000 started 60000 started 0 [W702 16:16:25.301681870 TCPStore.cpp:343] [c10d] Starting store with 100000 workers but somaxconn is 4096.This might cause instability during bootstrap, consider increasing it. init 20000 set 20000 init 80000 set 80000 init 70000 set 70000 init 60000 set 60000 init 30000 set 30000 init 90000 set 90000 started 40000 init 40000 set 40000 started 50000 init 50000 set 50000 started 10000 init 10000 set 10000 init 0 set 0 run finished 617.2992351055145 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129985 Approved by: https://github.com/rsdcastro, https://github.com/kurman	2024-07-03 02:09:44 +00:00
Tristan Rice	52d4442a00	[c10d] Socket, TCPStore: add better logging (#128673 ) This adds better logging of errors to the socket and TCPStore classes. All socket operations should now include the local and remote addresses and we actually log errors from the TCPStoreBackend::run as well as TCPStoreBackendUV which were previously INFO messages and not actually logged. It also overhauls test_wait in test_store.py as it had a race condition causing it to be flaky. Test plan: ``` python test/distributed/test_store.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128673 Approved by: https://github.com/c-p-i-o	2024-06-14 23:08:29 +00:00
Tristan Rice	7c370d2fb0	expose set_thread_name to Python and set thread names (#128448 ) This adds a new multiprocessing method `_set_thread_name` and calls it from torchelastic and dataloader main functions. This will allow better monitoring of processes as we can separate elastic and dataloading processes from the main training process. Threads named: * torchrun/elastic * PyTorch dataloader worker processes + pin memory thread * TCPStore * ProcessGroupNCCL background threads * WorkerServer httpserver thread Test plan: ``` $ torchrun --nnodes 1 --nproc_per_node 1 --no-python /bin/bash -c 'ps -eL \| grep pt_' 3264281 3264281 pts/45 00:00:02 pt_elastic 3264281 3267950 pts/45 00:00:00 pt_elastic ``` dataloading ```py import torch import time from torch.utils.data import ( DataLoader, Dataset, ) class NoopDataset(Dataset): def __getitem__(self, index): return index def __len__(self): return 10 dataloader = DataLoader(NoopDataset(), num_workers=2) for i, x in enumerate(dataloader): print(i, x) time.sleep(10000) ``` ``` $ python3 ~/scripts/dataloader_test.py $ ps -eL \| grep pt_ 1228312 1228312 pts/45 00:00:02 pt_main_thread 1228312 1230058 pts/45 00:00:00 pt_main_thread 1228312 1230059 pts/45 00:00:00 pt_main_thread 1230052 1230052 pts/45 00:00:00 pt_data_worker 1230052 1230198 pts/45 00:00:00 pt_data_worker 1230052 1230740 pts/45 00:00:00 pt_data_worker 1230055 1230055 pts/45 00:00:00 pt_data_worker 1230055 1230296 pts/45 00:00:00 pt_data_worker 1230055 1230759 pts/45 00:00:00 pt_data_worker ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128448 Approved by: https://github.com/c-p-i-o, https://github.com/andrewkho, https://github.com/rsdcastro	2024-06-13 16:38:23 +00:00
cyy	b60af92c17	[Distributed] [3/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d (#123312 ) This PR continues to fix some clang-tidy warnings in distributed code, following https://github.com/pytorch/pytorch/pull/122892. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123312 Approved by: https://github.com/Skylion007	2024-04-13 11:45:00 +00:00
mantaionut	b0e65dd1b4	Fix TCP Store Windows (#118860 ) In https://github.com/pytorch/pytorch/pull/107607 there was added a new Validate flow, however on Windows it was not calling addMiscellaneousSocket. Added missing call to addMiscellaneousSocket on Windows. Fixes #118737 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118860 Approved by: https://github.com/awgu, https://github.com/malfet	2024-02-01 18:46:18 +00:00
Juncheng Gu	7c4e49ec80	[Fix] add validation logics to TCPStore queries (#107607 ) This PR fixes #106294. Due to the lack of request validation mechanism, TCPStore in torch mistakenly treats nmap scan messages as valid query messages, which leads to DDP OOM. The simple solution enforces the very first query from a client is a validation query with a predefined magic number. If the validation fails, the server will terminate the connection. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107607 Approved by: https://github.com/cbalioglu, https://github.com/XilunWu	2023-11-07 18:36:25 +00:00
PyTorch MergeBot	c63693ca27	Revert "[Fix] add validation logics to TCPStore queries (#107607 )" This reverts commit `50a9981217`. Reverted https://github.com/pytorch/pytorch/pull/107607 on behalf of https://github.com/huydhn due to For some reason, lint job was not run on the PR and now start failing trunk, please rebase and fix lint before relanding `50a9981217` ([comment](https://github.com/pytorch/pytorch/pull/107607#issuecomment-1791702818))	2023-11-02 23:34:08 +00:00
Juncheng Gu	50a9981217	[Fix] add validation logics to TCPStore queries (#107607 ) This PR fixes #106294. Due to the lack of request validation mechanism, TCPStore in torch mistakenly treats nmap scan messages as valid query messages, which leads to DDP OOM. The simple solution enforces the very first query from a client is a validation query with a predefined magic number. If the validation fails, the server will terminate the connection. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107607 Approved by: https://github.com/cbalioglu, https://github.com/XilunWu	2023-11-02 22:12:45 +00:00
cyy	3ec33957eb	[1/N] Enable Wunused-result and Wunused-variable in torch targets (#110722 ) They are useful for checking results of function calls. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110722 Approved by: https://github.com/Skylion007	2023-10-08 23:43:45 +00:00
Pritam Damania	704b0b3c67	[RESUBMIT] Standardize on error types for distributed errors. (#108191 ) We have a plethora of error types for various errors raised from c10d. These include `RuntimeError`, `TimeoutError`, `SocketError`, `DistBackendError` etc. This results in messy code during error handling somewhat like this: ``` if "NCCL" in exception_str: ... if "Timed out initializing process group in store based barrier on rank" in exception_str: ... if "The client socket has timed out after" in exception_str: ... if "Broken pipe" in exception_str: ... if "Connection reset by peer" in exception_str: ... ``` To address this issue, in this PR I've ensured added these error types: 1. DistError - the base type of all distributed errors 2. DistBackendError - this already existed and referred to PG backend errors 3. DistStoreError - for errors originating from the store 4. DistNetworkError - for general network errors coming from the socket library Pull Request resolved: https://github.com/pytorch/pytorch/pull/108191 Approved by: https://github.com/H-Huang	2023-08-30 21:47:39 +00:00
Rodrigo Kumpera	fe2cda64dc	[C10D] Implement new libuv backend for TCPStore. (#108066 ) The new backend is currently under a flag 'use_libuv' in TCPStore constructor to reduce the impact on existing users as we test it. This is a reland of #105870 with a fix for a bad test. Differential Revision: [D48742554](https://our.internmc.facebook.com/intern/diff/D48742554) Pull Request resolved: https://github.com/pytorch/pytorch/pull/108066 Approved by: https://github.com/H-Huang, https://github.com/fduwjj	2023-08-29 14:55:14 +00:00
PyTorch MergeBot	d4ff06ec84	Revert "Standardize on error types for distributed errors. (#107651 )" This reverts commit `0e2317479b`. Reverted https://github.com/pytorch/pytorch/pull/107651 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it is failing inductor test in trunk for one of its model moco ([comment](https://github.com/pytorch/pytorch/pull/107651#issuecomment-1696578138))	2023-08-28 23:58:33 +00:00
Pritam Damania	0e2317479b	Standardize on error types for distributed errors. (#107651 ) We have a plethora of error types for various errors raised from c10d. These include `RuntimeError`, `TimeoutError`, `SocketError`, `DistBackendError` etc. This results in messy code during error handling somewhat like this: ``` if "NCCL" in exception_str: ... if "Timed out initializing process group in store based barrier on rank" in exception_str: ... if "The client socket has timed out after" in exception_str: ... if "Broken pipe" in exception_str: ... if "Connection reset by peer" in exception_str: ... ``` To address this issue, in this PR I've ensured added these error types: 1. DistError - the base type of all distributed errors 2. DistBackendError - this already existed and referred to PG backend errors 3. DistStoreError - for errors originating from the store 4. DistNetworkError - for general network errors coming from the socket library Pull Request resolved: https://github.com/pytorch/pytorch/pull/107651 Approved by: https://github.com/H-Huang	2023-08-28 21:58:15 +00:00
PyTorch MergeBot	d3f92ca9e9	Revert "[C10D] Implement new libuv backend for TCPStore. (#105870 )" This reverts commit `3c841163ce`. Reverted https://github.com/pytorch/pytorch/pull/105870 on behalf of https://github.com/huydhn due to I think the distributed failure is related as this is now failing in trunk ([comment](https://github.com/pytorch/pytorch/pull/105870#issuecomment-1683117192))	2023-08-17 23:41:00 +00:00
Rodrigo Kumpera	3c841163ce	[C10D] Implement new libuv backend for TCPStore. (#105870 ) The new backend is currently under a flag 'use_libuv' in TCPStore constructor to reduce the impact on existing users as we test it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105870 Approved by: https://github.com/H-Huang	2023-08-17 20:40:32 +00:00
Rodrigo Kumpera	c9c66819a1	Move more TCPStorestate from BackgroundThread to TCPStoreMasterDaemon as it won't be used by the libuv backend. (#105674 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105674 Approved by: https://github.com/H-Huang ghstack dependencies: #105163, #105164, #105184, #105672	2023-07-31 20:10:16 +00:00
Rodrigo Kumpera	a361fceef3	[C10d] Move TCPStoreMasterDaemon to TCPStoreBackend. (#105184 ) This makes TCPServer interface to the store server be through BackgroundThread. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105184 Approved by: https://github.com/fduwjj	2023-07-25 21:59:12 +00:00
Rodrigo Kumpera	fe284b0d97	[C10D] Extract some bits of TCPStore into TCPStoreBackend. (#105163 ) This moves BackgroundThread to TCPStoreBackend.hpp. This will eventually be the interface shared between the current TCPStore backend and the new libuv one. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105163 Approved by: https://github.com/fduwjj, https://github.com/H-Huang	2023-07-25 17:59:15 +00:00

21 Commits