Commit Graph

61 Commits

Author SHA1 Message Date
fduwjj
ff92b42fc3 [c10d][gloo] Integrate vendor generic FR into gloo (#152614)
This is a first quick prototyping for FR integration for gloo. Few features gaps:
- Input/Output numels for each collective
- Whether to use c10::Event or where to use it.
- Where to dump the FR traces. (The dump api is provided in this PR)

Differential Revision: [D75803601](https://our.internmc.facebook.com/intern/diff/D75803601)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152614
Approved by: https://github.com/d4l3k
ghstack dependencies: #154929
2025-06-03 16:12:54 +00:00
Aaron Gokaslan
49b9efdf1f [BE]: Cleanup traceutils with fmtlib (#152265)
Simplify code and make it faster.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152265
Approved by: https://github.com/albanD, https://github.com/cyyever
2025-05-04 22:27:19 +00:00
cyy
41bd0c900a [1/N] Deprecate c10::string_view and at::string (#151972)
The calls of `c10::string_view` in the code base are replaced by `std::string_view`. The calls of `at::string` are replaced by `std::string`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151972
Approved by: https://github.com/malfet
2025-04-29 07:23:52 +00:00
Shuqiang Zhang
c0d642a295 [pgnccl][simple] log started work numel (#139773)
Summary:
We saw some cases that the same work was started on multiple ranks, but
did not complete. This info could give us more info if the numel matches
Test Plan:
CI

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139773
Approved by: https://github.com/Skylion007, https://github.com/kwen2501
2024-11-05 23:11:19 +00:00
cyy
94e12f97dc [Distributed] [16/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d (#137404)
Follows #137072

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137404
Approved by: https://github.com/Skylion007
2024-10-10 18:05:34 +00:00
fduwjj
e20fb5e975 [PTD][c10d] Include PG status into flight recorder (#131268)
We are considering consolidating data source for logging and flight recorder so that we don't build multiple paths for debugging information. Before we do any merging, we want to first ensure that the PG status is also included in flight recorder. Also, we can leverage this information to validate our FR dump as well. Because the dump is not synced so we might potentially see some variants in the dump.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131268
Approved by: https://github.com/shuqiangzhang
2024-07-25 01:01:00 +00:00
FFFrog
e49525275d Make TraceUtils.h to be device-agnostic (#126969)
Some features of third-party devices depend on TraceUtils.h, so some of the CUDA code was removed and split into NCCLUtils files.

In addition, some common functions still remain in TraceUtils.h since I'm not sure if other devices will use them later.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126969
Approved by: https://github.com/c-p-i-o
2024-06-19 09:06:49 +00:00
PyTorch MergeBot
846bb30e13 Revert "[1/N] Change #include <c10/util/Optional.h> to #include <optional> (#128301)"
This reverts commit bd72e28314.

Reverted https://github.com/pytorch/pytorch/pull/128301 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it fails XLA build bd72e28314. Please rebase your PR before relanding because I think the failure is hidden by an unrelated broken trunk XLA failure from your current base commit ([comment](https://github.com/pytorch/pytorch/pull/128301#issuecomment-2169035822))
2024-06-15 01:58:20 +00:00
cyy
bd72e28314 [1/N] Change #include <c10/util/Optional.h> to #include <optional> (#128301)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128301
Approved by: https://github.com/ezyang
2024-06-14 23:21:01 +00:00
PyTorch MergeBot
5001f41b90 Revert "Make TraceUtils.h to be device-agnostic (#126969)"
This reverts commit 648625b230.

Reverted https://github.com/pytorch/pytorch/pull/126969 on behalf of https://github.com/clee2000 due to failing internal builds D58443769 ([comment](https://github.com/pytorch/pytorch/pull/126969#issuecomment-2163462600))
2024-06-12 16:32:57 +00:00
FFFrog
648625b230 Make TraceUtils.h to be device-agnostic (#126969)
Some features of third-party devices depend on TraceUtils.h, so some of the CUDA code was removed and split into NCCLUtils files.

In addition, some common functions still remain in TraceUtils.h since I'm not sure if other devices will use them later.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126969
Approved by: https://github.com/c-p-i-o
2024-06-11 08:38:07 +00:00
Chirag Pandya
ab3a0b192a [RFC] add per-collective timeout value in flight recorder (#128190)
Summary:
Add timeout value field on every collected record.

Test Plan:
Unit tests

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128190
Approved by: https://github.com/wconstab
2024-06-10 17:12:57 +00:00
Chirag Pandya
0bf2fe522a [RFC] Provide optional switches to _dump_nccl_trace (#127651)
Summary:
Data from PyTorch distributed is mostly useful during initial stages of model development.
Provide options to reduce data sent/dumped.
`_dump_nccl_trace` takes 3 optional switches. Default as before returns everything
- `includeCollectives`: option to also include collectives: Default is True.
- `includeStacktraces`: option to include stack traces in collectives. Default is True.
- `onlyActive`: option to only send active collective work - i.e. not completed. Default is
    False (i.e. send everything)

Test Plan:
Unit tests

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127651
Approved by: https://github.com/wconstab
2024-06-09 14:00:57 +00:00
PyTorch MergeBot
02a901f1e9 Revert "[RFC] Provide optional switches to _dump_nccl_trace (#127651)"
This reverts commit 0a761f0627.

Reverted https://github.com/pytorch/pytorch/pull/127651 on behalf of https://github.com/atalman due to Breaks internal CI ([comment](https://github.com/pytorch/pytorch/pull/127651#issuecomment-2156076838))
2024-06-08 15:30:04 +00:00
PyTorch MergeBot
57a24c4fdb Revert "[RFC] add per-collective timeout value in flight recorder (#128190)"
This reverts commit 09cccbc1c7.

Reverted https://github.com/pytorch/pytorch/pull/128190 on behalf of https://github.com/atalman due to Sorry need to revert this, in conflict with https://github.com/pytorch/pytorch/pull/127651 that needs reverting ([comment](https://github.com/pytorch/pytorch/pull/128190#issuecomment-2156075318))
2024-06-08 15:25:27 +00:00
Chirag Pandya
09cccbc1c7 [RFC] add per-collective timeout value in flight recorder (#128190)
Summary:
Add timeout value field on every collected record.

Test Plan:
Unit tests

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128190
Approved by: https://github.com/wconstab
2024-06-07 23:29:35 +00:00
Chirag Pandya
0a761f0627 [RFC] Provide optional switches to _dump_nccl_trace (#127651)
Summary:
Data from PyTorch distributed is mostly useful during initial stages of model development.
Provide options to reduce data sent/dumped.
`_dump_nccl_trace` takes 3 optional switches. Default as before returns everything
- `includeCollectives`: option to also include collectives: Default is True.
- `includeStacktraces`: option to include stack traces in collectives. Default is True.
- `onlyActive`: option to only send active collective work - i.e. not completed. Default is
    False (i.e. send everything)

Test Plan:
Unit tests

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127651
Approved by: https://github.com/wconstab
2024-06-06 21:59:09 +00:00
dilililiwhy
f471482eb2 Try to include NCCL related header file with macro USE_C10D_NCCL (#127501)
Fixes #ISSUE_NUMBER
Try to include NCCL related header file with macro USE_C10D_NCCL, so that third-party device compilation will not be interrupted.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127501
Approved by: https://github.com/ezyang
2024-05-30 21:33:41 +00:00
Chirag Pandya
ae66c94eaa Capture dtype in Flight Recorder (#126581)
Summary:
Capture dtype in flight recorder.
Mismatched dtypes can lead to hangs.

Newly added logs to job show mismatching DTYPE of op, which affects data
size.  Even though the sizes match and we don't see the dtype on the FR
log.

We end up capturing the type as follows:
```
{'entries': [{'record_id': 0, 'pg_id': 0, 'process_group': ('0', 'default_pg'), 'collective_seq_id': 1, 'p2p_seq_id': 0, 'op_id': 1, 'profiling_name': 'nccl:all_reduce', 'time_created_ns': 1715989097552775261, 'duration_ms': 6.697696208953857, 'input_sizes': [[3, 4]], 'input_dtypes': [6], 'output_sizes': [[3, 4]], 'output_dtypes': [6], 'state': 'completed', 'time_discovered_started_ns': 1715989097593778240, 'time_discovered_completed_ns': 1715989097593778461, 'retired': True,
```
Notice the new fields:
input_dtypes: [6]
output_dtypes: [6]

Test Plan:
unit tests

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/issues/126554

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126581
Approved by: https://github.com/wconstab
2024-05-22 03:38:09 +00:00
Chirag Pandya
a83e745356 [BE] split seq_id to collective_seq_id and p2p_seq_id (#125727)
Summary:
Split out `seq_id` into `collective_seq_id` and `p2p_seq_id`. The main idea here is that collectives that go to all machines should have identical `collective_seq_id` and therefore it makes it easier to spot if one of machines isn't handling a collective operation.
Next, we can attempt to match up p2p operations to ensure that the sender(s)/receivers(s) are in sync.

Resolves issue: https://github.com/pytorch/pytorch/issues/125173

Test Plan:
Unit tests.

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125727
Approved by: https://github.com/zdevito
2024-05-21 03:26:49 +00:00
Chirag Pandya
1485621ccb [BE] Abstract out strings to top of file (#125640)
Summary:
Move const strings to top of file. This is in preparation of tooling to
make use of shared constants (e.g. version string). A non-functional change.
Ideally we want these const strings to be available from both C++ and Python - but I haven't figured out how to correctly share things in PyTorch. I'll do this in a subsequent change.

Test Plan:
python test/distributed/test_c10d_nccl.py NCCLTraceTest

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125640
Approved by: https://github.com/wconstab
2024-05-15 03:38:30 +00:00
Richard Barnes
ed327876f5 [codemod] c10:optional -> std::optional (#126135)
Generated by running the following from PyTorch root:
```
find . -regex ".*\.\(cpp\|h\|cu\|hpp\|cc\|cxx\)$" | grep -v "build/" | xargs -n 50 -P 4 perl -pi -e 's/c10::optional/std::optional/'
```

`c10::optional` is just an alias for `std::optional`. This removes usages of that alias in preparation for eliminating it entirely.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126135
Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/albanD, https://github.com/aaronenyeshi
2024-05-14 19:35:51 +00:00
Shengbao Zheng
4e9094533e [c10d/nccl-pg] allow user to pass process group description (#123472)
Summary:
We need a way to allow user set a customized description for a process group, e.g. FSDP, PP.

Here are several use cases of user specified group_desc:
- Logging: we can easily match a log line and understand what's this collective/pg is used to.
- Pytorch traces (e.g. Kineto, Execution Trace) can benefit from the PG desc since trace analysis, benchmarks will be able to easily differentiate PG purpose like FSDP, PP.
- Lower layer collectives(e.g. NCCL) debug: we will be able to expose PG desc to NCCL communicator so NCCL layer operations can be easily correlated to a PG.

Solution: Add a group_desc field to c10d

Differential Revision: D55781850

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123472
Approved by: https://github.com/kwen2501
2024-04-12 08:44:21 +00:00
Shengbao Zheng
ae6f8d923c Pass and record process_group_name when creating ProcessGroupNCCL (#123117)
Summary:
Pass python c10d group_name to c++ ProcessGroupNCCL so that the pg name will be consistent across different layers.
Also record pg_name in flight recorder entry.

Differential Revision: D55597200

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123117
Approved by: https://github.com/wconstab
2024-04-05 18:57:45 +00:00
zdevito
530e13cf3d Revert "[c10d] disable compute_duration by default (#122138)" (#122539)
This reverts commit bf18e967b4.

It is stacked after a fix to elapsed_time that will resolve the memory issues that required in the introduction of this flag.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122539
Approved by: https://github.com/wconstab, https://github.com/shuqiangzhang
ghstack dependencies: #122538
2024-03-27 21:53:28 +00:00
Shuqiang Zhang
bf18e967b4 [c10d] disable compute_duration by default (#122138)
Summary:
Compute duration would invoke additional cuda overhead and possibly
GPU mem increase and possible hang, so we want to disable it by default and enable it only
when needed, or at least when timing is enabled.

Test Plan:
Test with existing unit test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122138
Approved by: https://github.com/wconstab
2024-03-21 04:45:37 +00:00
Will Constable
581fe26792 [C10D] Add ProcessGroup op_id to track ops inside coalescing region (#120745)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120745
Approved by: https://github.com/zdevito
2024-03-01 23:45:43 +00:00
PyTorch MergeBot
76d3a6bb4a Revert "[C10D] Add ProcessGroup op_id to track ops inside coalescing region (#120745)"
This reverts commit 381a7ad3f1.

Reverted https://github.com/pytorch/pytorch/pull/120745 on behalf of https://github.com/kit1980 due to The new test fails internally, see D54343421 ([comment](https://github.com/pytorch/pytorch/pull/120745#issuecomment-1972047106))
2024-02-29 22:06:13 +00:00
Will Constable
381a7ad3f1 [C10D] Add ProcessGroup op_id to track ops inside coalescing region (#120745)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120745
Approved by: https://github.com/zdevito
ghstack dependencies: #120724, #120270
2024-02-29 01:03:31 +00:00
Will Constable
7f4d673885 [C10D] Add record_id to flight recorder (#120724)
In cases where sequence number is shared between events (e.g. coalesced
collectives) we want to ensure a unique (and ordered) ID per record.

Note: the records are already in a list, so their ID could be implicitly
observed.  But (1) it's a ring buffer, so absolute ID is lost once the
buffer rolls over once, (2) users may sort or process or filter their
flight records, so having the ID be an explicit member of an entry is
still useful

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120724
Approved by: https://github.com/zdevito
2024-02-29 01:03:31 +00:00
Shengbao Zheng
11de40f82f [flight recorder] record process group configuration (#120262)
Summary: Record process group configuration (i.e. ranks involved in a process group) to facilitate NCCL related debugging.

Differential Revision: D53792087

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120262
Approved by: https://github.com/shuqiangzhang
2024-02-28 20:31:08 +00:00
Shuqiang Zhang
a24cba35b0 [c10d][flight recorder] dump additinal NCCL debug info (#120063)
Summary:
This PR is mainly about flight recorder side of changes that takes a
map of maps as input, and dump it as picklable. Also add functions that
should be compiled only when NCCL_COMM_DUMP is defined
Test Plan:
Integration tests with NCCL would be done later, here we only do the
c10d side of dump test, aka,NCCLTraceTest

Testing the dump function is a bit tricky as we don't have
existing C++ unit tests for them. So we still use the Python NCCLTraceTest with
the python binding of _dump_nccl_trace(), we manually fed the
dump_nccl_trace with a map of test info, and assert the pickle result and
print the converted python dict:
```
(sqzhang_1) [sqzhang@devgpu009.cln1 ~/pytorch (main)]$  python
test/distributed/test_c10d_nccl.py NCCLTraceTest
NCCL version 2.19.3+cuda12.0
[rank0]:[E ProcessGroupNCCL.cpp:1200] [PG 0 Rank 0] ProcessGroupNCCL
preparing to dump debug info.
.NCCL version 2.19.3+cuda12.0
.NCCL version 2.19.3+cuda12.0
{'ncclID2': {'Key2': 'Value2', 'Key1': 'Value1'}, 'ncclID1': {'Key2':
'Value2', 'Key1': 'Value1'}}
{'ncclID2': {'Key2': 'Value2', 'Key1': 'Value1'}, 'ncclID1': {'Key2':
'Value2', 'Key1': 'Value1'}}
.NCCL version 2.19.3+cuda12.0
{'ncclID2': {'Key2': 'Value2', 'Key1': 'Value1'}, 'ncclID1': {'Key2':
'Value2', 'Key1': 'Value1'}}
{'ncclID2': {'Key2': 'Value2', 'Key1': 'Value1'}, 'ncclID1': {'Key2':
'Value2', 'Key1': 'Value1'}}
.NCCL version 2.19.3+cuda12.0
.NCCL version 2.19.3+cuda12.0
.NCCL version 2.19.3+cuda12.0
.NCCL version 2.19.3+cuda12.0
.
----------------------------------------------------------------------
Ran 8 tests in 95.761s
OK
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120063
Approved by: https://github.com/wconstab
2024-02-21 16:35:23 +00:00
Shuqiang Zhang
a45c627f27 [c10d][flight recorder] store a copy of string in entry (#119837)
Summary:
Previously, we just store the char pointer in entry, the string is a
temp object and will be destructed when we want to dump/access it.

A quick fix is to store a copy of the string, but without changing the
upstream char*.

An alternative is to change every profilingTitle into std:string, this
however would needs comprehensive overhall of the code up to the
c10d::work layer above workNCCL and RecordFunction etc.

We chose the first option for this change

Resolve #119808

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119837
Approved by: https://github.com/zdevito, https://github.com/wconstab
2024-02-14 09:13:56 +00:00
Shuqiang Zhang
abadbbc4b0 [c10d][flight recorder] remove unintended assignment of entry (#119748)
Summary:
auto& entry = entries_.at(*id % max_entries_);
entry = entries_.at(*id % max_entries_);
The above line of code has unintended consequence of invoking copy/assignment
of entry objects as ref itself cannot be re-assigned.

Also what could cause the crash is that the entry ref could become invalid if entries_ are
resized by other threads. and this could result in 'copy to a garbage
location'. The fix is to use a pointer which can be re-assigned after
re-acquiring the lock

Tests: python test/distributed/test_c10d_nccl.py NCCLTraceTest

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119748
Approved by: https://github.com/wconstab, https://github.com/fegin
2024-02-13 20:18:58 +00:00
Ke Wen
b2043c0543 [c10d] PGNCCL refactor part 2: Simplify ProcessGroupNCCL into single-device style (#119421)
Part 2 and last part of #118674:
Introduce actual "single-device" code change to ProcessGroupNCCL.

assert size == 1 and test refactor have been done in #119099.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119421
Approved by: https://github.com/shuqiangzhang
2024-02-12 18:45:49 +00:00
PyTorch MergeBot
0342b227e5 Revert "[c10d] PGNCCL refactor part 2: Simplify ProcessGroupNCCL into single-device style (#119421)"
This reverts commit f3e7d80993.

Reverted https://github.com/pytorch/pytorch/pull/119421 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/119421#issuecomment-1938169747))
2024-02-12 07:34:20 +00:00
Ke Wen
f3e7d80993 [c10d] PGNCCL refactor part 2: Simplify ProcessGroupNCCL into single-device style (#119421)
Part 2 and last part of #118674:
Introduce actual "single-device" code change to ProcessGroupNCCL.

assert size == 1 and test refactor have been done in #119099.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119421
Approved by: https://github.com/shuqiangzhang
2024-02-09 20:23:20 +00:00
zdevito
7f05c72864 [nccl flight recorder] record time we discover start and complete (#119249)
Some APIs like ncclCommAbort can cause nccl kernels to finish even if
they were previously stuck. Because we can gather the trace buffer after
those calls, we can end up seeing some collectives marked completed eventhough
that complete happened several minutes after they started and clearly after
the timeout. This changes how we record state so that we keep track of the time
we discover a state change, so even if eventually the collective gets marked complete,
we can observe it happened minutes after it was schedule.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119249
Approved by: https://github.com/wconstab
2024-02-08 16:48:33 +00:00
willfengg
63fd6883fd [c10d] logging utility for cpp-python stacktrace (#118924)
user may not know which line of code called collectives in a big code base. When debugging, we can print python-cpp stacktrace in case user call ``ProcessGroup.reduce`` instead of ``torch.distributed.reduce``

```
LOG(INFO) << "ProcessGroupNCCL::_allgather_base stacktrace: "
                       << get_python_cpp_trace();
```

output (using _allgather_base as an example): one example python-part trace is ``all_gather_into_tensor from /data/users/weif/pytorch/torch/distributed/distributed_c10d.py:2838``
```
ProcessGroupNCCL::_allgather_base stacktrace: #0 torch::unwind::unwind() from ??:0
#1 torch::CapturedTraceback::gather(bool, bool, bool) from ??:0
#2 c10d::get_python_cpp_trace[abi:cxx11]() from :0
#3 c10d::ProcessGroupNCCL::_allgather_base(at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&) from ??:0
#4 c10d::ops::(anonymous namespace)::_allgather_base_CUDA(at::Tensor&, at::Tensor&, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, bool, long) from Ops.cpp:0
#5 c10::impl::make_boxed_from_unboxed_functor<c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<std::tuple<at::Tensor, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > > (*)(at::Tensor&, at::Tensor&, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, bool, long), std::tuple<at::Tensor, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > >, c10::guts::typelist::typelist<at::Tensor&, at::Tensor&, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, bool, long> >, false>::call(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) from :0
#6 torch::autograd::basicAutogradNotImplementedFallbackImpl(c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) from autograd_not_implemented_fallback.cpp:0
#7 c10d::ProcessGroup::_allgather_base(at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&) from :0
#8 pybind11::cpp_function::initialize<pybind11::cpp_function::initialize<c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >, c10d::ProcessGroup, at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg, pybind11::arg, pybind11::arg_v, pybind11::call_guard<pybind11::gil_scoped_release> >(c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > (c10d::ProcessGroup::*)(at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(c10d::ProcessGroup*, at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&)#1}, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >, c10d::ProcessGroup*, at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg, pybind11::arg, pybind11::arg_v, pybind11::call_guard<pybind11::gil_scoped_release> >(pybind11::cpp_function::initialize<c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >, c10d::ProcessGroup, at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg, pybind11::arg, pybind11::arg_v, pybind11::call_guard<pybind11::gil_scoped_release> >(c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > (c10d::ProcessGroup::*)(at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(c10d::ProcessGroup*, at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&)#1}&&, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > (*)(c10d::ProcessGroup*, at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call&) from :0
#9 pybind11::cpp_function::dispatcher(_object*, _object*, _object*) from :0
#10 cfunction_call from /usr/local/src/conda/python-3.10.12/Objects/methodobject.c:543
#11 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.12/Objects/call.c:215
#12 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:112
#13 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:114
#14 all_gather_into_tensor from /data/users/weif/pytorch/torch/distributed/distributed_c10d.py:2838
#15 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46
#16 do_call_core from /usr/local/src/conda/python-3.10.12/Python/ceval.c:5945
#17 wrapper from /data/users/weif/pytorch/torch/distributed/c10d_logger.py:75
#18 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46
#19 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:114
#20 _all_gather_flat_param from /data/users/weif/pytorch/torch/distributed/fsdp/_flat_param.py:1399
#21 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46
#22 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:114
#23 unshard from /data/users/weif/pytorch/torch/distributed/fsdp/_flat_param.py:1308
#24 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46
#25 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:114
#26 _unshard from /data/users/weif/pytorch/torch/distributed/fsdp/_runtime_utils.py:332
#27 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46
#28 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:114
#29 _pre_forward_unshard from /data/users/weif/pytorch/torch/distributed/fsdp/_runtime_utils.py:448
#30 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46
#31 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:114
#32 _pre_forward from /data/users/weif/pytorch/torch/distributed/fsdp/_runtime_utils.py:413
#33 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46
#34 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:114
#35 forward from /data/users/weif/pytorch/torch/distributed/fsdp/fully_sharded_data_parallel.py:839
#36 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46
#37 do_call_core from /usr/local/src/conda/python-3.10.12/Python/ceval.c:5945
#38 _call_impl from /data/users/weif/pytorch/torch/nn/modules/module.py:1520
#39 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46
#40 do_call_core from /usr/local/src/conda/python-3.10.12/Python/ceval.c:5945
#41 _wrapped_call_impl from /data/users/weif/pytorch/torch/nn/modules/module.py:1511
#42 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46
#43 _PyObject_Call_Prepend from /usr/local/src/conda/python-3.10.12/Objects/call.c:431
#44 slot_tp_call from /usr/local/src/conda/python-3.10.12/Objects/typeobject.c:7494
#45 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.12/Objects/call.c:215
#46 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:112
#47 inner from /data/users/weif/pytorch/run_fsdp.py:72
#48 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46
#49 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:114
#50 run from /data/users/weif/pytorch/run_fsdp.py:76
#51 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46
#52 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:114
#53 main from /data/users/weif/pytorch/run_fsdp.py:133
#54 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46
#55 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:114
#56 <module> from /data/users/weif/pytorch/run_fsdp.py:137
#57 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46
#58 PyEval_EvalCode from /usr/local/src/conda/python-3.10.12/Python/ceval.c:1134
#59 run_eval_code_obj from /usr/local/src/conda/python-3.10.12/Python/pythonrun.c:1291
#60 run_mod from /usr/local/src/conda/python-3.10.12/Python/pythonrun.c:1312
#61 pyrun_file from /usr/local/src/conda/python-3.10.12/Python/pythonrun.c:1208
#62 _PyRun_SimpleFileObject from /usr/local/src/conda/python-3.10.12/Python/pythonrun.c:456
#63 _PyRun_AnyFileObject from /usr/local/src/conda/python-3.10.12/Python/pythonrun.c:90
#64 pymain_run_file_obj from /usr/local/src/conda/python-3.10.12/Modules/main.c:357
#65 Py_BytesMain from /usr/local/src/conda/python-3.10.12/Modules/main.c:1090
#66 __libc_start_call_main from ??:0
#67 <unwind unsupported> from ??:0
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118924
Approved by: https://github.com/kwen2501
2024-02-02 23:49:18 +00:00
Will Constable
455bba38f4 [C10D] Make Flight Recorder report time_created in ns (#118047)
Addresses (6) from #117883

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118047
Approved by: https://github.com/zdevito
ghstack dependencies: #118044, #118046
2024-01-23 08:18:08 +00:00
Will Constable
5df92a9244 [C10D] Add version tag to NCCL Flight Recorder Dump (#118046)
Addresses (3) from #117883

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118046
Approved by: https://github.com/zdevito
ghstack dependencies: #118044
2024-01-23 08:18:08 +00:00
Will Constable
dace1fda2e [C10D] Make NCCL Flight Recorder dump produce a dict (#118044)
Putting the list of entries into a particular key of a top-level dict
paves the way for adding other metadata as other top level keys.

Addresses 1 and 2 from #117883

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118044
Approved by: https://github.com/zdevito
2024-01-23 08:18:08 +00:00
dilililiwhy
924ed91612 Move getDurationFromFirstEvent to USE_C10D_NCCL ifdef (#117738)
Fixes #117517

Try to move nccl related function *getDurationFromFirstEvent* to USE_C10D_NCCL ifdef (Related to https://github.com/pytorch/pytorch/issues/114575)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117738
Approved by: https://github.com/wconstab, https://github.com/XilunWu
2024-01-19 04:28:47 +00:00
Will Constable
7f1f0b1135 [C10D] Add duration_ms to flight recorder (#114817)
Measures the duration of a collective operation using nccl start/end
events and includes this duration (in ms) in the flight recorder data.

duration_ms will be an optional field, since it only works when
timing is enabled.  Currently timing is enabled when flight recorder
is enabled, but this is not a strict requirement.  Duration is also
not available for collectives not in a completed state.

Note: computing duration can lead to a hang due to calling cudaEventDuration when
the cuda driver queue is full.

We don't ever want dump() api to hang, since we might want dump to help
debug a hang. Hence, we only query durations from the watchdog thread,
and it's possible during dump() call, some of the most recent
collectives durations won't have been computed yet at time of dump.  We
make this tradeoff to ensure that dump() itself will never hang.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114817
Approved by: https://github.com/fduwjj, https://github.com/zdevito
ghstack dependencies: #116905
2024-01-12 23:34:11 +00:00
Will Constable
10509dac85 [C10D] Rename flightrecorder key vars to avoid confusion (#116905)
Key vars are strings used as dict keys (e.g. duration_s was a string
"duration_ms")

_s confused me with time (seconds) since duration_s was a key string and
duration_ms is another variable holding a time value.

Now duration_key is "duration_ms".

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116905
Approved by: https://github.com/zdevito
2024-01-11 02:57:04 +00:00
fduwjj
ca4df16fdd [c10d] Make DebugInfoWriter Singleton across all PG objects (#116489)
Previously, we have the writer register to each NCCL PG(backend), so for every pg, we have a NCCL PG instance, so if we use some customized writer when multiple sub-PGs are used, we need to ensure user to register the writer for every backend which indicates a bad UX. Furthermore, the debug info is global, so it does not make sense to have the writer for each instance. We even have a static mutex in the `dumpDebuggingInfo` to ensure we serialize the write, that makes it more obvious that we can make the writer a singleton so that we only have one writer instance for all PG instances.

Although the rationale is clear, the implementation may vary a lot. So this PR is RFC for now to see if this implementation makes sense or not.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116489
Approved by: https://github.com/kwen2501
2024-01-03 03:42:54 +00:00
zdevito
66b04e3cb7 [nccl flight recorder] nullptr profiling name (#115851)
Sometimes profiling name can be a nullptr, which
throws on conversion to std::string. This adds a check.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115851
Approved by: https://github.com/wconstab
2023-12-14 23:40:54 +00:00
PyTorch MergeBot
f101426790 Revert "Move class definition of DebugInfoWriter to TraceUtil as well (#114901)"
This reverts commit fb325bbd46.

Reverted https://github.com/pytorch/pytorch/pull/114901 on behalf of https://github.com/atalman due to Diff broke internal tests ([comment](https://github.com/pytorch/pytorch/pull/114901#issuecomment-1838815178))
2023-12-04 14:55:39 +00:00
Will Constable
8a51845b38 [C10D] Add filename to dump finished log (#114957)
Just shows you where to look..

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114957
Approved by: https://github.com/fduwjj
2023-12-01 20:38:02 +00:00
fduwjj
fb325bbd46 Move class definition of DebugInfoWriter to TraceUtil as well (#114901)
Since we moved the implementation of the class to TraceUtils in https://github.com/pytorch/pytorch/pull/114367, maybe we also want to move the implementation here as well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114901
Approved by: https://github.com/XilunWu
2023-12-01 03:28:16 +00:00