pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-06 12:20:52 +01:00

Author	SHA1	Message	Date
Yuanyuan Chen	36871622f1	[2/N] Mark unused parameters in C++ code (#165121 ) This is follow-up of #164912 to mark unused C++ parameters to improve code readability. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165121 Approved by: https://github.com/Skylion007	2025-10-15 03:04:39 +00:00
Xuehai Pan	d55dc00f84	[BE][11/16] fix typos in torch/ (torch/csrc/distributed/) (#156321 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156321 Approved by: https://github.com/jingsh ghstack dependencies: #156313, #156314, #156315, #156316, #156317, #156319	2025-06-23 02:57:50 +00:00
PyTorch MergeBot	4b55871e06	Revert "[BE][11/16] fix typos in torch/ (torch/csrc/distributed/) (#156321 )" This reverts commit `c95f7fa874`. Reverted https://github.com/pytorch/pytorch/pull/156321 on behalf of https://github.com/atalman due to export/test_torchbind.py::TestCompileTorchbind::test_compile_error_on_input_aliasing_contents_backend_aot_eager [GH job link](https://github.com/pytorch/pytorch/actions/runs/15804799771/job/44548489912) [HUD commit link](`c95f7fa874`) ([comment](https://github.com/pytorch/pytorch/pull/156321#issuecomment-2994163667))	2025-06-22 12:27:36 +00:00
Xuehai Pan	c95f7fa874	[BE][11/16] fix typos in torch/ (torch/csrc/distributed/) (#156321 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156321 Approved by: https://github.com/jingsh ghstack dependencies: #156313, #156314, #156315, #156316, #156317, #156319	2025-06-22 08:43:49 +00:00
fduwjj	ff92b42fc3	[c10d][gloo] Integrate vendor generic FR into gloo (#152614 ) This is a first quick prototyping for FR integration for gloo. Few features gaps: - Input/Output numels for each collective - Whether to use c10::Event or where to use it. - Where to dump the FR traces. (The dump api is provided in this PR) Differential Revision: [D75803601](https://our.internmc.facebook.com/intern/diff/D75803601) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152614 Approved by: https://github.com/d4l3k ghstack dependencies: #154929	2025-06-03 16:12:54 +00:00
Chirag Pandya	1cd70e7e23	[fr][c10d] log trace capture enabled or not in flight recorder (#143865 ) Summary: Refactor logging for flight recorder so we can log if the capture was with or without stack trace capture enabled. We introduce a new column ('trace_enabled') in the logger. Test Plan: Tested on local job and noted that correct output was produced. Internal link: https://fburl.com/scuba/c10d_flight_recorder/ulhqnmhg Pull Request resolved: https://github.com/pytorch/pytorch/pull/143865 Approved by: https://github.com/fduwjj	2024-12-27 03:07:55 +00:00
cyy	ea61c9cb29	[Distributed] [5/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d (#124043 ) This PR continues to fix some clang-tidy warnings in distributed/c10d code, following https://github.com/pytorch/pytorch/pull/124032. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124043 Approved by: https://github.com/ezyang	2024-04-23 00:43:50 +00:00
cyy	c2596fd3e0	[Distributed] [4/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d (#124032 ) This PR continues to fix some clang-tidy warnings in distributed/c10d code, following https://github.com/pytorch/pytorch/pull/123312. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124032 Approved by: https://github.com/Skylion007	2024-04-16 00:42:18 +00:00
Shuqiang Zhang	443444dc7f	[c10d] Add generic scuba logging capability into c10d (#121859 ) Summary: This diff tries to periodically (e.g., every 30s) log critical collective progress status to scuba table, starting from a few metric such as last enequeued seq id. With the Scuba table, it is our hope that we can easily detect the straggler of a PG, E.g., the rank that has not progressed it seq_ for X seconds while other ranks in the same PG have a larger seq_ The implementation needs to make sure that Scuba will be used only for FB internal use cases. For OSS, we still provide a generic logger data struct and logger that can be easily extended. If users do not register the logger, nothing will be logged. Test Plan: Re-use the existing unit test for fb side of operations, such as test_register_and_dump in test_c10d_manifold and change the dump period to a very small number, e.g., 1ms, verified that the loggs are correctly shown in scuba table: https://fburl.com/scuba/c10d_work_update/9trhwnmy Reviewed By: wconstab Differential Revision: D54556219 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121859 Approved by: https://github.com/wconstab	2024-03-14 16:03:45 +00:00
Shen Li	dd6319198d	Apply clang-format to distributed/c10d folder (#107140 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107140 Approved by: https://github.com/H-Huang	2023-08-14 23:16:38 +00:00
Kazuaki Ishizaki	2973994259	fix typo in comments under torch/csrc/distributed (#96062 ) This PR fixes typos in comments and messages of `.cpp` and `.hpp` files under `torch/csrc/distributed` directory Pull Request resolved: https://github.com/pytorch/pytorch/pull/96062 Approved by: https://github.com/ngimel	2023-03-07 02:56:41 +00:00
Min Si	1ad0048b64	Refactor distribuetd to use absolute header path (#85780 ) Headers under torch/csrc/distributed may be referened with relative path, e.g., "<c10d/...>". However, relative path cannot be gracefully handled by Meta internal build when the NCCL PG is hipified to support AMD/RCCL because the "hipified" header files are generated in other directories. Moreover, using absolute path for header inclusion is the state-of-the-art in most components in Pytorch. Thus, this patch refactors all header paths in torch/csrc/distributed to be absolute. See D39835774 for more details about Meta internal complication. How to test: commit 9e5d199 removes -I./torch/csrc/distributed in compile options. Thus use it to verify we don't miss any relative path use of torch/csrc/distributed headers. Pull Request resolved: https://github.com/pytorch/pytorch/pull/85780 Approved by: https://github.com/kumpera, https://github.com/huydhn	2022-09-30 05:13:50 +00:00
PyTorch MergeBot	a50d8864fc	Revert "Refactor distribuetd to use absolute header path (#85780 )" This reverts commit `668082718a`. Reverted https://github.com/pytorch/pytorch/pull/85780 on behalf of https://github.com/huydhn due to Sorry for reverting your PR but it breaks build due to a missing file <c10d/Store.hpp>	2022-09-30 02:04:29 +00:00
Min Si	668082718a	Refactor distribuetd to use absolute header path (#85780 ) Headers under torch/csrc/distributed may be referened with relative path, e.g., "<c10d/...>". However, relative path cannot be gracefully handled by Meta internal build when the NCCL PG is hipified to support AMD/RCCL because the "hipified" header files are generated in other directories. Moreover, using absolute path for header inclusion is the state-of-the-art in most components in Pytorch. Thus, this patch refactors all header paths in torch/csrc/distributed to be absolute. See D39835774 for more details about Meta internal complication. How to test: commit 9e5d199 removes -I./torch/csrc/distributed in compile options. Thus use it to verify we don't miss any relative path use of torch/csrc/distributed headers. Pull Request resolved: https://github.com/pytorch/pytorch/pull/85780 Approved by: https://github.com/kumpera	2022-09-30 00:27:24 +00:00
yanlizhao	1c35f37c9f	remove bucket_size_limit property from bucket struct Pull Request resolved: https://github.com/pytorch/pytorch/pull/73731 during rebuilting bucket, in addition to sync bucket_indice, per_bucket_limits should be synced as well before calling initialize_buckets(). Syncing per_bucket_limits will increase communicaton volume as well increasing code complexity, after taking a further look at the codes, per_bucket_limits used inside initialize_buckets() is actually not useful, it assigns bucket_size_limit property to bucket struct, but the property is not used anywhere. So it is good to remove this property and avoid syncing per_bucket_limits. Differential Revision: [D34605513](https://our.internmc.facebook.com/intern/diff/D34605513/) Approved by: https://github.com/rohan-varma	2022-04-25 15:30:09 +00:00
Andrew Gu	a37d54b6d1	[Easy][c10d][DDP] (Reland) Minor fixes (#73569 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73569 Reland https://github.com/pytorch/pytorch/pull/73299 and https://github.com/pytorch/pytorch/pull/73318 together. Test Plan: Imported from OSS Reviewed By: zhaojuanmao Differential Revision: D34552418 Pulled By: awgu fbshipit-source-id: 95088d2c1c67cd4fb9bbb115e15ba6b26ae06bdb (cherry picked from commit 695ebc3dc0ccb08a167445588c293b3a6c3c00b7)	2022-03-03 14:30:54 +00:00
Yanli Zhao	fb6977cbd5	Back out "[DDP][BE] Fix clang-tidy" (#73522 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73522 Original commit changeset: 41bcb2b9f617 Original Phabricator Diff: D34422578 (`590685dc6e`) ghstack-source-id: 150128504 Test Plan: unit tests Reviewed By: rohan-varma Differential Revision: D34527890 fbshipit-source-id: 33ca3c4d66cbae29a9d05e45e5886a4bd2c55b02 (cherry picked from commit 0e7f7299a2f851a899c667cdba7d4e6cb8f84fde)	2022-03-01 04:35:29 +00:00
Andrew Gu	590685dc6e	[DDP][BE] Fix clang-tidy (#73299 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73299 Test Plan: Imported from OSS Reviewed By: george-qi Differential Revision: D34422578 Pulled By: awgu fbshipit-source-id: 41bcb2b9f617d7a8c446e72250d415e05f8e8b31 (cherry picked from commit 0e534171f05dbc55c4b4496888b25ab1a494b97c)	2022-02-25 15:37:29 +00:00
Rohan Varma	4feef6c970	Log static graph in constructor if it is set (#72456 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/72456 It is easier to log if static graph is set at construction time now that it is natively supported in DDP constructor, as opposed to waiting for the first iteration to finish. In some failure cases we're seeing the first iteration does not finish and thus we don't have this data which is vaulable to debug. ghstack-source-id: 148840679 Test Plan: CI Reviewed By: zhaojuanmao Differential Revision: D34045204 fbshipit-source-id: 72a187c1ce031db217de4b3ad20a64f2a74995bc (cherry picked from commit `1d622c88f3`)	2022-02-11 15:55:09 +00:00
Rohan Varma	2ca552160b	[DDP] logging improvements (#67059 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67059 Debugging some workflows, and sometimes the training does not finish but I want to know whether the graph was not static. Also, log 0 for unused parameter size if no unused params were found. ghstack-source-id: 141428950 Test Plan: Ci Reviewed By: mrshenli Differential Revision: D31846669 fbshipit-source-id: 21763fcdc1b244ba829117da1f15b2271d966983	2021-10-26 13:18:00 -07:00
Rohan Varma	bff64e84cd	[DDP] Track models with sync bn (#66680 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66680 Closes https://github.com/pytorch/pytorch/issues/66215. Tracks models with sync BN so we can find workflows that use them and target for perf optimization. ghstack-source-id: 140875182 Test Plan: CI Reviewed By: pritamdamania87 Differential Revision: D31679477 fbshipit-source-id: 0e68cd1a7aabbc5b26227895c53d33b8e98bfb8e	2021-10-18 22:31:52 -07:00
Rohan Varma	1e47181c47	[DDP Logging] Add iteration in error reporting (#65772 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65772 Looking at some workloads and it would be useful to have this info. ghstack-source-id: 140555200 Test Plan: CI Reviewed By: zhaojuanmao, wayi1 Differential Revision: D31224417 fbshipit-source-id: 14eeb053aced87c7ca43b6879f81f54bd0a42b76	2021-10-14 22:29:36 -07:00
Rohan Varma	44fad84bca	[DDP] Add host-side time to CUDATimer (#62770 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62770 Adding timing of forward, backward comp, backward comm, etc will help detect desynchronization issues. ghstack-source-id: 135195680 Test Plan: CI Reviewed By: SciPioneer Differential Revision: D30115585 fbshipit-source-id: 509bf341c5c92dcc63bdacd3c1e414da4eb4f321	2021-08-06 18:41:40 -07:00
Rohan Varma	9ac56ef0fc	[DDP] log gradient ready order and bucket indices (#62751 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62751 This will help us determine whether gradient ready order and bucket indices are aligned amongst all the ranks. This should always be true for rank 0 as we determine rebuilt bucket order by the gradient ready order on rank 0, but would be interested to see this on different workloads for other ranks ghstack-source-id: 135104369 Test Plan: CI Reviewed By: SciPioneer, wanchaol Differential Revision: D30111833 fbshipit-source-id: a0ab38413a45022d953da76384800bee53cbcf9f	2021-08-05 16:36:25 -07:00
Rohan Varma	4d5607bb25	[Reland][DDP] log bucket sizes (#62625 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62625 reland of https://github.com/pytorch/pytorch/pull/62232 which ran into a land race. Test Plan: ci Reviewed By: SciPioneer Differential Revision: D30058217 fbshipit-source-id: 1454dd481e630f3de9ec6111b3f2e18cd8976091	2021-08-03 10:55:46 -07:00
Eli Uriegas	bd9f35313a	Revert D29922299: [DDP] log bucket sizes Test Plan: revert-hammer Differential Revision: D29922299 (`5429f68f00`) Original commit changeset: 538b331c96e7 fbshipit-source-id: 3595fe04e8dea38bc9d05e8c70f2dcd2ad496ced	2021-07-30 20:27:36 -07:00
Rohan Varma	5429f68f00	[DDP] log bucket sizes (#62232 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62232 Logs the bucket sizes in DDP logging so that we know which workflow ran with what bucket size config. Will be used to verify how changing bucket sizes in DDP affects perf. Based on the test, we can see inconsistency where the "first" bucket size actually is (last before rebuild buckets, first after). ghstack-source-id: 134663867 Test Plan: CI Reviewed By: SciPioneer Differential Revision: D29922299 fbshipit-source-id: 538b331c96e77048164ad130b377433be100a761	2021-07-30 18:07:04 -07:00
Rohan Varma	2cbc0ede7d	[DDP] Log if graph is static at end of training (#61871 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61871 When set_static_graph=False, the only type of dynamism we really support in DDP is dynamic set of unused parameters which must be explicitly enabled with find_unused_parameters=True. Although, some workflows have static set of unused parameters, would be good to detect and add this to logging to identify workflows that are candidates for static graph optimization. ghstack-source-id: 134371429 Test Plan: CI Reviewed By: zhaojuanmao Differential Revision: D29773962 fbshipit-source-id: 1f741984c6e6f8e3e55cf69ca719b1e25a485b13	2021-07-27 09:23:43 -07:00
Luca Wehrstedt	a016150163	Move torch/lib/c10d to torch/csrc/distributed/c10d (#60543 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60543 Since now c10d is part of libtorch, it would also be nice if the sources lived all in one place. ghstack-source-id: 132306292 Test Plan: It builds Reviewed By: cbalioglu Differential Revision: D29062002 fbshipit-source-id: d9e1301e9d73e1643fa0f0119cd2d618f1ad52e6	2021-06-24 12:38:51 -07:00

29 Commits