sanshang
249152475d
fix sequence number for group ( #134578 )
...
Summary:
Fix sequence number in execution trace dump for matching between collective/p2p op and wait in execution trace replay.
`ProcessGroupNCCL` has 2 sequence number counter, `seqCollective_` and `seqP2P_`.
b18ba9419e/torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp (L1188-L1191)
However, `WorkNCCL` only has one sequence number member `seq_`. b18ba9419e/torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp (L387)
We need to match collective and p2p with wait separately.
29b5a462dc
Depend on: https://github.com/pytorch/pytorch/pull/135132
Test Plan: buck2 run mode/dev-nosan kineto/libkineto/fb/integration_tests:pytorch_execution_trace_integration_test
Differential Revision:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134578
Approved by: https://github.com/kwen2501 , https://github.com/c-p-i-o
2024-10-10 04:24:06 +00:00
Shivam Raikundalia
816061843a
[Distributed/Profiler] Fix input/output dimension overflow ( #134360 )
...
Summary: When using ParamCommsDebugInfo, the input elements and output elements are stored in `int` instead of `int64_t`
Test Plan: Run HTA with new outputted values and make sure overflow does not occur
Reviewed By: fengxizhou
Differential Revision: D61728747
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134360
Approved by: https://github.com/fengxizhou , https://github.com/jeanschmidt
2024-08-25 16:25:56 +00:00
cyy
c2596fd3e0
[Distributed] [4/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d ( #124032 )
...
This PR continues to fix some clang-tidy warnings in distributed/c10d code, following https://github.com/pytorch/pytorch/pull/123312 .
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124032
Approved by: https://github.com/Skylion007
2024-04-16 00:42:18 +00:00
Shengbao Zheng
9fa922c2ed
[profiler] Log process group name instead of pg uid ( #124035 )
...
Summary:
As part of the work of unifying process group identifier, log <group_name, group_desc>, instead of pg uid in profiler.
- group_name remains as the unique identifier, e.g. “0”, "1"
- group_desc will be the user specified name, e.g. "fsdp".
Reviewed By: aaronenyeshi, kwen2501
Differential Revision: D55610682
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124035
Approved by: https://github.com/aaronenyeshi
2024-04-15 21:49:06 +00:00
Shengbao Zheng
5b9e5f854b
[profiler] Log process group id instead of backend id ( #120475 )
...
Summary:
https://github.com/pytorch/pytorch/pull/104373 introduced backend_id
> an unique ID for the actual backend object, this is also exposed in record_param_comms, so we can correlate these collectives with the right backend object.
However, it is inconvenient to correlate collectives with backend id. Instead, using pg id(uid) to correlate directly is a better solution.
This PR change the ID information exposted in record_param_comms from backend_id to pg_id.
Differential Revision: D53558257
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120475
Approved by: https://github.com/aaronenyeshi
2024-02-29 15:04:33 +00:00
Pavan Balaji
ffc826bf10
[nccl-pg] Store PG global rank information in tracing logs ( #115730 )
...
Storing the list of global ranks associated with each PG allows us to correlate traces across different ranks.
Test Plan:
OSS CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115730
Approved by: https://github.com/fduwjj
2023-12-14 00:59:17 +00:00
Pavan Balaji
aa390cec21
[profiler] Fix description to use nelems rather than size ( #114735 )
...
We were storing the number of elements in the tensor, rather than the actual bytes.
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114735
Approved by: https://github.com/aaronenyeshi , https://github.com/yoyoyocmu , https://github.com/kwen2501 , https://github.com/fduwjj
2023-12-01 06:21:47 +00:00
Yue Dong
ed15fa7cc2
[Kineto][NCCL][3/n] Get the NCCL communication info from PARAM_COMMS_INFO ( #111846 )
...
This diff enables the functionality to get the NCCL communication metadata from `c10::DebugInfoKind::PARAM_COMMS_INFO` available in `ThreadLocalDebugInfo`.
To make the overhead lighweight and avoid comparing the function name on each op, we add the method `bool isNcclMeta()`, which decided during initialization.
Differential Revision: [D50439211](https://our.internmc.facebook.com/intern/diff/D50439211/ )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111846
Approved by: https://github.com/aaronenyeshi
ghstack dependencies: #111842 , #111843
2023-10-25 20:35:06 +00:00
Yue Dong
43d0ae4822
[Kineto][NCCL][1/n] Add the world size info in NCCL metadata ( #111842 )
...
This diff adds the world size info in NCCL metadata, as we need the information to calculate the algorithmic bandwidth and bus Bandwidth.
Differential Revision: [D50439185](https://our.internmc.facebook.com/intern/diff/D50439185/ )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111842
Approved by: https://github.com/aaronenyeshi , https://github.com/fduwjj
2023-10-25 03:48:55 +00:00
Shen Li
dd6319198d
Apply clang-format to distributed/c10d folder ( #107140 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107140
Approved by: https://github.com/H-Huang
2023-08-14 23:16:38 +00:00
Louis Feng
55479fe80e
Enable capturing of comm collective parameters ( #98 ) ( #85368 )
...
Summary:
X-link: https://github.com/facebookresearch/torch_ucc/pull/98
Add tensor input, output, and other metadata for PyTorch comms.
Test Plan: P517138779
Reviewed By: Pavani-Panakanti
Differential Revision: D38357077
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85368
Approved by: https://github.com/H-Huang
2022-10-11 04:38:26 +00:00
Peter Bell
21017ad1a1
Dispatch.h: Avoid including ivalue ( #64165 )
...
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64165
Test Plan: Imported from OSS
Reviewed By: gchanan
Differential Revision: D30728587
Pulled By: ezyang
fbshipit-source-id: d0d2e97491d9d5e2d2fc2d6e51420a4467c1bba4
2021-09-15 12:16:44 -07:00
Luca Wehrstedt
a016150163
Move torch/lib/c10d to torch/csrc/distributed/c10d ( #60543 )
...
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60543
Since now c10d is part of libtorch, it would also be nice if the sources lived all in one place.
ghstack-source-id: 132306292
Test Plan: It builds
Reviewed By: cbalioglu
Differential Revision: D29062002
fbshipit-source-id: d9e1301e9d73e1643fa0f0119cd2d618f1ad52e6
2021-06-24 12:38:51 -07:00