Commit Graph

15 Commits

Author SHA1 Message Date
Shivam Raikundalia
816061843a [Distributed/Profiler] Fix input/output dimension overflow (#134360)
Summary: When using ParamCommsDebugInfo, the input elements and output elements are stored in `int` instead of `int64_t`

Test Plan: Run HTA with new outputted values and make sure overflow does not occur

Reviewed By: fengxizhou

Differential Revision: D61728747

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134360
Approved by: https://github.com/fengxizhou, https://github.com/jeanschmidt
2024-08-25 16:25:56 +00:00
cyy
1ac402a96c [Distributed] [6/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d (#124701)
This PR continues to fix some clang-tidy warnings in distributed/c10d code, following https://github.com/pytorch/pytorch/pull/124043.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124701
Approved by: https://github.com/ezyang
2024-04-25 11:39:23 +00:00
cyy
ea61c9cb29 [Distributed] [5/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d (#124043)
This PR continues to fix some clang-tidy warnings in distributed/c10d code, following https://github.com/pytorch/pytorch/pull/124032.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124043
Approved by: https://github.com/ezyang
2024-04-23 00:43:50 +00:00
Shengbao Zheng
9fa922c2ed [profiler] Log process group name instead of pg uid (#124035)
Summary:
As part of the work of unifying process group identifier, log <group_name, group_desc>,  instead of pg uid in profiler.
- group_name remains as the unique identifier, e.g. “0”, "1"
- group_desc will be the user specified name, e.g. "fsdp".

Reviewed By: aaronenyeshi, kwen2501

Differential Revision: D55610682

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124035
Approved by: https://github.com/aaronenyeshi
2024-04-15 21:49:06 +00:00
Shengbao Zheng
5b9e5f854b [profiler] Log process group id instead of backend id (#120475)
Summary:
https://github.com/pytorch/pytorch/pull/104373 introduced backend_id
> an unique ID for the actual backend object, this is also exposed in record_param_comms, so we can correlate these collectives with the right backend object.

However, it is inconvenient to correlate collectives with backend id. Instead, using pg id(uid) to correlate directly is a better solution.
This PR change the ID information exposted in record_param_comms from backend_id to pg_id.

Differential Revision: D53558257

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120475
Approved by: https://github.com/aaronenyeshi
2024-02-29 15:04:33 +00:00
Pavan Balaji
ffc826bf10 [nccl-pg] Store PG global rank information in tracing logs (#115730)
Storing the list of global ranks associated with each PG allows us to correlate traces across different ranks.

Test Plan:

OSS CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115730
Approved by: https://github.com/fduwjj
2023-12-14 00:59:17 +00:00
Pavan Balaji
aa390cec21 [profiler] Fix description to use nelems rather than size (#114735)
We were storing the number of elements in the tensor, rather than the actual bytes.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114735
Approved by: https://github.com/aaronenyeshi, https://github.com/yoyoyocmu, https://github.com/kwen2501, https://github.com/fduwjj
2023-12-01 06:21:47 +00:00
Yue Dong
ed15fa7cc2 [Kineto][NCCL][3/n] Get the NCCL communication info from PARAM_COMMS_INFO (#111846)
This diff enables the functionality to get the NCCL communication metadata from `c10::DebugInfoKind::PARAM_COMMS_INFO` available in `ThreadLocalDebugInfo`.

To make the overhead lighweight and avoid comparing the function name on each op, we add the method `bool isNcclMeta()`, which decided during initialization.

Differential Revision: [D50439211](https://our.internmc.facebook.com/intern/diff/D50439211/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111846
Approved by: https://github.com/aaronenyeshi
ghstack dependencies: #111842, #111843
2023-10-25 20:35:06 +00:00
Yue Dong
43d0ae4822 [Kineto][NCCL][1/n] Add the world size info in NCCL metadata (#111842)
This diff adds the world size info in NCCL metadata, as we need the information to calculate the algorithmic bandwidth and bus Bandwidth.

Differential Revision: [D50439185](https://our.internmc.facebook.com/intern/diff/D50439185/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111842
Approved by: https://github.com/aaronenyeshi, https://github.com/fduwjj
2023-10-25 03:48:55 +00:00
Nikita Shulga
fb18c29486 [BE] Tweak Meta copyright headers (#90805)
s/Facebook, Inc./Meta Platforms, Inc/
s/Confidential and proprietary./This source code is licensed under the BSD-style license/

Per https://www.internalfb.com/intern/wiki/Open_Source/Licenses/Straight_BSD/

Also, add linter that prevents adding those in the future

Fixes https://github.com/pytorch/pytorch/issues/90187
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90805
Approved by: https://github.com/zpao
2022-12-14 20:30:31 +00:00
Min Si
1ad0048b64 Refactor distribuetd to use absolute header path (#85780)
Headers under torch/csrc/distributed may be referened with relative path, e.g., "<c10d/...>". However, relative path cannot be gracefully handled by Meta internal build when the NCCL PG is hipified to support AMD/RCCL because the "hipified" header files are generated in other directories. Moreover, using absolute path for header inclusion is the state-of-the-art in most components in Pytorch. Thus, this patch refactors all header paths in torch/csrc/distributed to be absolute.

See D39835774 for more details about Meta internal complication.

**How to test**: commit 9e5d199 removes -I./torch/csrc/distributed in compile options. Thus use it to verify we don't miss any relative path use of torch/csrc/distributed headers.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85780
Approved by: https://github.com/kumpera, https://github.com/huydhn
2022-09-30 05:13:50 +00:00
PyTorch MergeBot
a50d8864fc Revert "Refactor distribuetd to use absolute header path (#85780)"
This reverts commit 668082718a.

Reverted https://github.com/pytorch/pytorch/pull/85780 on behalf of https://github.com/huydhn due to Sorry for reverting your PR but it breaks build due to a missing file <c10d/Store.hpp>
2022-09-30 02:04:29 +00:00
Min Si
668082718a Refactor distribuetd to use absolute header path (#85780)
Headers under torch/csrc/distributed may be referened with relative path, e.g., "<c10d/...>". However, relative path cannot be gracefully handled by Meta internal build when the NCCL PG is hipified to support AMD/RCCL because the "hipified" header files are generated in other directories. Moreover, using absolute path for header inclusion is the state-of-the-art in most components in Pytorch. Thus, this patch refactors all header paths in torch/csrc/distributed to be absolute.

See D39835774 for more details about Meta internal complication.

**How to test**: commit 9e5d199 removes -I./torch/csrc/distributed in compile options. Thus use it to verify we don't miss any relative path use of torch/csrc/distributed headers.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85780
Approved by: https://github.com/kumpera
2022-09-30 00:27:24 +00:00
Michael Suo
30fb2c4aba [lint] autoformat test/cpp and torch/csrc
Let's have some fun.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/78828

Approved by: https://github.com/ezyang
2022-06-11 21:11:16 +00:00
Luca Wehrstedt
a016150163 Move torch/lib/c10d to torch/csrc/distributed/c10d (#60543)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60543

Since now c10d is part of libtorch, it would also be nice if the sources lived all in one place.
ghstack-source-id: 132306292

Test Plan: It builds

Reviewed By: cbalioglu

Differential Revision: D29062002

fbshipit-source-id: d9e1301e9d73e1643fa0f0119cd2d618f1ad52e6
2021-06-24 12:38:51 -07:00