Commit Graph

29 Commits

Author SHA1 Message Date
Yuanyuan Chen
36871622f1 [2/N] Mark unused parameters in C++ code (#165121)
This is follow-up of #164912 to mark unused C++ parameters to improve code readability.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165121
Approved by: https://github.com/Skylion007
2025-10-15 03:04:39 +00:00
Xuehai Pan
d55dc00f84 [BE][11/16] fix typos in torch/ (torch/csrc/distributed/) (#156321)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156321
Approved by: https://github.com/jingsh
ghstack dependencies: #156313, #156314, #156315, #156316, #156317, #156319
2025-06-23 02:57:50 +00:00
PyTorch MergeBot
4b55871e06 Revert "[BE][11/16] fix typos in torch/ (torch/csrc/distributed/) (#156321)"
This reverts commit c95f7fa874.

Reverted https://github.com/pytorch/pytorch/pull/156321 on behalf of https://github.com/atalman due to export/test_torchbind.py::TestCompileTorchbind::test_compile_error_on_input_aliasing_contents_backend_aot_eager [GH job link](https://github.com/pytorch/pytorch/actions/runs/15804799771/job/44548489912) [HUD commit link](c95f7fa874) ([comment](https://github.com/pytorch/pytorch/pull/156321#issuecomment-2994163667))
2025-06-22 12:27:36 +00:00
Xuehai Pan
c95f7fa874 [BE][11/16] fix typos in torch/ (torch/csrc/distributed/) (#156321)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156321
Approved by: https://github.com/jingsh
ghstack dependencies: #156313, #156314, #156315, #156316, #156317, #156319
2025-06-22 08:43:49 +00:00
fduwjj
ff92b42fc3 [c10d][gloo] Integrate vendor generic FR into gloo (#152614)
This is a first quick prototyping for FR integration for gloo. Few features gaps:
- Input/Output numels for each collective
- Whether to use c10::Event or where to use it.
- Where to dump the FR traces. (The dump api is provided in this PR)

Differential Revision: [D75803601](https://our.internmc.facebook.com/intern/diff/D75803601)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152614
Approved by: https://github.com/d4l3k
ghstack dependencies: #154929
2025-06-03 16:12:54 +00:00
Chirag Pandya
1cd70e7e23 [fr][c10d] log trace capture enabled or not in flight recorder (#143865)
Summary:
Refactor logging for flight recorder so we can log if the capture was
with or without stack trace capture enabled.
We introduce a new column ('trace_enabled') in the logger.

Test Plan:
Tested on local job and noted that correct output was produced.
Internal link: https://fburl.com/scuba/c10d_flight_recorder/ulhqnmhg

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143865
Approved by: https://github.com/fduwjj
2024-12-27 03:07:55 +00:00
cyy
ea61c9cb29 [Distributed] [5/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d (#124043)
This PR continues to fix some clang-tidy warnings in distributed/c10d code, following https://github.com/pytorch/pytorch/pull/124032.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124043
Approved by: https://github.com/ezyang
2024-04-23 00:43:50 +00:00
cyy
c2596fd3e0 [Distributed] [4/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d (#124032)
This PR continues to fix some clang-tidy warnings in distributed/c10d code, following https://github.com/pytorch/pytorch/pull/123312.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124032
Approved by: https://github.com/Skylion007
2024-04-16 00:42:18 +00:00
Shuqiang Zhang
443444dc7f [c10d] Add generic scuba logging capability into c10d (#121859)
Summary:
This diff tries to periodically (e.g., every 30s) log critical collective
progress status to scuba table, starting from a few metric such as last
enequeued seq id.

With the Scuba table, it is our hope that we can easily detect the straggler of a PG,
E.g., the rank that has not progressed it seq_ for X seconds while other ranks in the same PG have a larger seq_

The implementation needs to make sure that Scuba will be used only for FB internal use
cases.

For OSS, we still provide a generic logger data struct and logger that can be
easily extended. If users do not register the logger, nothing will be logged.

Test Plan:
Re-use the existing unit test for fb side of operations, such as
test_register_and_dump in test_c10d_manifold and change the dump period to a
very small number, e.g., 1ms, verified that the loggs are correctly shown in scuba table:
https://fburl.com/scuba/c10d_work_update/9trhwnmy

Reviewed By: wconstab

Differential Revision: D54556219

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121859
Approved by: https://github.com/wconstab
2024-03-14 16:03:45 +00:00
Shen Li
dd6319198d Apply clang-format to distributed/c10d folder (#107140)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107140
Approved by: https://github.com/H-Huang
2023-08-14 23:16:38 +00:00
Kazuaki Ishizaki
2973994259 fix typo in comments under torch/csrc/distributed (#96062)
This PR fixes typos in comments and messages of `.cpp` and `.hpp` files under `torch/csrc/distributed` directory

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96062
Approved by: https://github.com/ngimel
2023-03-07 02:56:41 +00:00
Min Si
1ad0048b64 Refactor distribuetd to use absolute header path (#85780)
Headers under torch/csrc/distributed may be referened with relative path, e.g., "<c10d/...>". However, relative path cannot be gracefully handled by Meta internal build when the NCCL PG is hipified to support AMD/RCCL because the "hipified" header files are generated in other directories. Moreover, using absolute path for header inclusion is the state-of-the-art in most components in Pytorch. Thus, this patch refactors all header paths in torch/csrc/distributed to be absolute.

See D39835774 for more details about Meta internal complication.

**How to test**: commit 9e5d199 removes -I./torch/csrc/distributed in compile options. Thus use it to verify we don't miss any relative path use of torch/csrc/distributed headers.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85780
Approved by: https://github.com/kumpera, https://github.com/huydhn
2022-09-30 05:13:50 +00:00
PyTorch MergeBot
a50d8864fc Revert "Refactor distribuetd to use absolute header path (#85780)"
This reverts commit 668082718a.

Reverted https://github.com/pytorch/pytorch/pull/85780 on behalf of https://github.com/huydhn due to Sorry for reverting your PR but it breaks build due to a missing file <c10d/Store.hpp>
2022-09-30 02:04:29 +00:00
Min Si
668082718a Refactor distribuetd to use absolute header path (#85780)
Headers under torch/csrc/distributed may be referened with relative path, e.g., "<c10d/...>". However, relative path cannot be gracefully handled by Meta internal build when the NCCL PG is hipified to support AMD/RCCL because the "hipified" header files are generated in other directories. Moreover, using absolute path for header inclusion is the state-of-the-art in most components in Pytorch. Thus, this patch refactors all header paths in torch/csrc/distributed to be absolute.

See D39835774 for more details about Meta internal complication.

**How to test**: commit 9e5d199 removes -I./torch/csrc/distributed in compile options. Thus use it to verify we don't miss any relative path use of torch/csrc/distributed headers.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85780
Approved by: https://github.com/kumpera
2022-09-30 00:27:24 +00:00
yanlizhao
1c35f37c9f remove bucket_size_limit property from bucket struct
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73731

during rebuilting bucket, in addition to sync bucket_indice, per_bucket_limits should be synced as well before calling initialize_buckets(). Syncing per_bucket_limits will increase communicaton volume as well increasing code complexity, after taking a further look at the codes, per_bucket_limits used inside initialize_buckets() is actually not useful, it assigns bucket_size_limit property to bucket struct, but the property is not used anywhere. So it is good to remove this property and avoid syncing per_bucket_limits.

Differential Revision: [D34605513](https://our.internmc.facebook.com/intern/diff/D34605513/)

Approved by: https://github.com/rohan-varma
2022-04-25 15:30:09 +00:00
Andrew Gu
a37d54b6d1 [Easy][c10d][DDP] (Reland) Minor fixes (#73569)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73569

Reland https://github.com/pytorch/pytorch/pull/73299 and https://github.com/pytorch/pytorch/pull/73318 together.

Test Plan: Imported from OSS

Reviewed By: zhaojuanmao

Differential Revision: D34552418

Pulled By: awgu

fbshipit-source-id: 95088d2c1c67cd4fb9bbb115e15ba6b26ae06bdb
(cherry picked from commit 695ebc3dc0ccb08a167445588c293b3a6c3c00b7)
2022-03-03 14:30:54 +00:00
Yanli Zhao
fb6977cbd5 Back out "[DDP][BE] Fix clang-tidy" (#73522)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73522

Original commit changeset: 41bcb2b9f617

Original Phabricator Diff: D34422578 (590685dc6e)
ghstack-source-id: 150128504

Test Plan: unit tests

Reviewed By: rohan-varma

Differential Revision: D34527890

fbshipit-source-id: 33ca3c4d66cbae29a9d05e45e5886a4bd2c55b02
(cherry picked from commit 0e7f7299a2f851a899c667cdba7d4e6cb8f84fde)
2022-03-01 04:35:29 +00:00
Andrew Gu
590685dc6e [DDP][BE] Fix clang-tidy (#73299)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73299

Test Plan: Imported from OSS

Reviewed By: george-qi

Differential Revision: D34422578

Pulled By: awgu

fbshipit-source-id: 41bcb2b9f617d7a8c446e72250d415e05f8e8b31
(cherry picked from commit 0e534171f05dbc55c4b4496888b25ab1a494b97c)
2022-02-25 15:37:29 +00:00
Rohan Varma
4feef6c970 Log static graph in constructor if it is set (#72456)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72456

It is easier to log if static graph is set at construction time now that it is natively supported in DDP constructor, as opposed to waiting for the first iteration to finish. In some failure cases we're seeing the first iteration does not finish and thus we don't have this data which is vaulable to debug.
ghstack-source-id: 148840679

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D34045204

fbshipit-source-id: 72a187c1ce031db217de4b3ad20a64f2a74995bc
(cherry picked from commit 1d622c88f3)
2022-02-11 15:55:09 +00:00
Rohan Varma
2ca552160b [DDP] logging improvements (#67059)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67059

Debugging some workflows, and sometimes the training does not finish
but I want to know whether the graph was not static. Also, log 0 for unused
parameter size if no unused params were found.
ghstack-source-id: 141428950

Test Plan: Ci

Reviewed By: mrshenli

Differential Revision: D31846669

fbshipit-source-id: 21763fcdc1b244ba829117da1f15b2271d966983
2021-10-26 13:18:00 -07:00
Rohan Varma
bff64e84cd [DDP] Track models with sync bn (#66680)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66680

Closes https://github.com/pytorch/pytorch/issues/66215. Tracks models
with sync BN so we can find workflows that use them and target for perf
optimization.
ghstack-source-id: 140875182

Test Plan: CI

Reviewed By: pritamdamania87

Differential Revision: D31679477

fbshipit-source-id: 0e68cd1a7aabbc5b26227895c53d33b8e98bfb8e
2021-10-18 22:31:52 -07:00
Rohan Varma
1e47181c47 [DDP Logging] Add iteration in error reporting (#65772)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65772

Looking at some workloads and it would be useful to have this info.
ghstack-source-id: 140555200

Test Plan: CI

Reviewed By: zhaojuanmao, wayi1

Differential Revision: D31224417

fbshipit-source-id: 14eeb053aced87c7ca43b6879f81f54bd0a42b76
2021-10-14 22:29:36 -07:00
Rohan Varma
44fad84bca [DDP] Add host-side time to CUDATimer (#62770)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62770

Adding timing of forward, backward comp, backward comm, etc will help
detect desynchronization issues.
ghstack-source-id: 135195680

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D30115585

fbshipit-source-id: 509bf341c5c92dcc63bdacd3c1e414da4eb4f321
2021-08-06 18:41:40 -07:00
Rohan Varma
9ac56ef0fc [DDP] log gradient ready order and bucket indices (#62751)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62751

This will help us determine whether gradient ready order and bucket
indices are aligned amongst all the ranks. This should always be true for rank
0 as we determine rebuilt bucket order by the gradient ready order on rank 0,
but would be interested to see this on different workloads for other ranks
ghstack-source-id: 135104369

Test Plan: CI

Reviewed By: SciPioneer, wanchaol

Differential Revision: D30111833

fbshipit-source-id: a0ab38413a45022d953da76384800bee53cbcf9f
2021-08-05 16:36:25 -07:00
Rohan Varma
4d5607bb25 [Reland][DDP] log bucket sizes (#62625)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62625

reland of https://github.com/pytorch/pytorch/pull/62232 which ran into a land race.

Test Plan: ci

Reviewed By: SciPioneer

Differential Revision: D30058217

fbshipit-source-id: 1454dd481e630f3de9ec6111b3f2e18cd8976091
2021-08-03 10:55:46 -07:00
Eli Uriegas
bd9f35313a Revert D29922299: [DDP] log bucket sizes
Test Plan: revert-hammer

Differential Revision:
D29922299 (5429f68f00)

Original commit changeset: 538b331c96e7

fbshipit-source-id: 3595fe04e8dea38bc9d05e8c70f2dcd2ad496ced
2021-07-30 20:27:36 -07:00
Rohan Varma
5429f68f00 [DDP] log bucket sizes (#62232)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62232

Logs the bucket sizes in DDP logging so that we know which workflow ran with what bucket size config. Will be used to verify how changing bucket sizes in DDP affects perf.

Based on the test, we can see inconsistency where the "first" bucket size actually is (last before rebuild buckets, first after).
ghstack-source-id: 134663867

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D29922299

fbshipit-source-id: 538b331c96e77048164ad130b377433be100a761
2021-07-30 18:07:04 -07:00
Rohan Varma
2cbc0ede7d [DDP] Log if graph is static at end of training (#61871)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61871

When set_static_graph=False, the only type of dynamism we really
support in DDP is dynamic set of unused parameters which must be explicitly
enabled with find_unused_parameters=True. Although, some workflows have static
set of unused parameters, would be good to detect and add this to logging to
identify workflows that are candidates for static graph optimization.
ghstack-source-id: 134371429

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D29773962

fbshipit-source-id: 1f741984c6e6f8e3e55cf69ca719b1e25a485b13
2021-07-27 09:23:43 -07:00
Luca Wehrstedt
a016150163 Move torch/lib/c10d to torch/csrc/distributed/c10d (#60543)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60543

Since now c10d is part of libtorch, it would also be nice if the sources lived all in one place.
ghstack-source-id: 132306292

Test Plan: It builds

Reviewed By: cbalioglu

Differential Revision: D29062002

fbshipit-source-id: d9e1301e9d73e1643fa0f0119cd2d618f1ad52e6
2021-06-24 12:38:51 -07:00