pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 00:21:07 +01:00

Author	SHA1	Message	Date
fduwjj	ff92b42fc3	[c10d][gloo] Integrate vendor generic FR into gloo (#152614 ) This is a first quick prototyping for FR integration for gloo. Few features gaps: - Input/Output numels for each collective - Whether to use c10::Event or where to use it. - Where to dump the FR traces. (The dump api is provided in this PR) Differential Revision: [D75803601](https://our.internmc.facebook.com/intern/diff/D75803601) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152614 Approved by: https://github.com/d4l3k ghstack dependencies: #154929	2025-06-03 16:12:54 +00:00
Aaron Gokaslan	49b9efdf1f	[BE]: Cleanup traceutils with fmtlib (#152265 ) Simplify code and make it faster. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152265 Approved by: https://github.com/albanD, https://github.com/cyyever	2025-05-04 22:27:19 +00:00
cyy	41bd0c900a	[1/N] Deprecate c10::string_view and at::string (#151972 ) The calls of `c10::string_view` in the code base are replaced by `std::string_view`. The calls of `at::string` are replaced by `std::string` Pull Request resolved: https://github.com/pytorch/pytorch/pull/151972 Approved by: https://github.com/malfet	2025-04-29 07:23:52 +00:00
Shuqiang Zhang	c0d642a295	[pgnccl][simple] log started work numel (#139773 ) Summary: We saw some cases that the same work was started on multiple ranks, but did not complete. This info could give us more info if the numel matches Test Plan: CI Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/139773 Approved by: https://github.com/Skylion007, https://github.com/kwen2501	2024-11-05 23:11:19 +00:00
cyy	94e12f97dc	[Distributed] [16/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d (#137404 ) Follows #137072 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137404 Approved by: https://github.com/Skylion007	2024-10-10 18:05:34 +00:00
fduwjj	e20fb5e975	[PTD][c10d] Include PG status into flight recorder (#131268 ) We are considering consolidating data source for logging and flight recorder so that we don't build multiple paths for debugging information. Before we do any merging, we want to first ensure that the PG status is also included in flight recorder. Also, we can leverage this information to validate our FR dump as well. Because the dump is not synced so we might potentially see some variants in the dump. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131268 Approved by: https://github.com/shuqiangzhang	2024-07-25 01:01:00 +00:00
FFFrog	e49525275d	Make TraceUtils.h to be device-agnostic (#126969 ) Some features of third-party devices depend on TraceUtils.h, so some of the CUDA code was removed and split into NCCLUtils files. In addition, some common functions still remain in TraceUtils.h since I'm not sure if other devices will use them later. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126969 Approved by: https://github.com/c-p-i-o	2024-06-19 09:06:49 +00:00
PyTorch MergeBot	846bb30e13	Revert "[1/N] Change #include <c10/util/Optional.h> to #include <optional> (#128301 )" This reverts commit `bd72e28314`. Reverted https://github.com/pytorch/pytorch/pull/128301 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it fails XLA build `bd72e28314`. Please rebase your PR before relanding because I think the failure is hidden by an unrelated broken trunk XLA failure from your current base commit ([comment](https://github.com/pytorch/pytorch/pull/128301#issuecomment-2169035822))	2024-06-15 01:58:20 +00:00
cyy	bd72e28314	[1/N] Change #include <c10/util/Optional.h> to #include <optional> (#128301 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/128301 Approved by: https://github.com/ezyang	2024-06-14 23:21:01 +00:00
PyTorch MergeBot	5001f41b90	Revert "Make TraceUtils.h to be device-agnostic (#126969 )" This reverts commit `648625b230`. Reverted https://github.com/pytorch/pytorch/pull/126969 on behalf of https://github.com/clee2000 due to failing internal builds D58443769 ([comment](https://github.com/pytorch/pytorch/pull/126969#issuecomment-2163462600))	2024-06-12 16:32:57 +00:00
FFFrog	648625b230	Make TraceUtils.h to be device-agnostic (#126969 ) Some features of third-party devices depend on TraceUtils.h, so some of the CUDA code was removed and split into NCCLUtils files. In addition, some common functions still remain in TraceUtils.h since I'm not sure if other devices will use them later. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126969 Approved by: https://github.com/c-p-i-o	2024-06-11 08:38:07 +00:00
Chirag Pandya	ab3a0b192a	[RFC] add per-collective timeout value in flight recorder (#128190 ) Summary: Add timeout value field on every collected record. Test Plan: Unit tests Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/128190 Approved by: https://github.com/wconstab	2024-06-10 17:12:57 +00:00
Chirag Pandya	0bf2fe522a	[RFC] Provide optional switches to _dump_nccl_trace (#127651 ) Summary: Data from PyTorch distributed is mostly useful during initial stages of model development. Provide options to reduce data sent/dumped. `_dump_nccl_trace` takes 3 optional switches. Default as before returns everything - `includeCollectives`: option to also include collectives: Default is True. - `includeStacktraces`: option to include stack traces in collectives. Default is True. - `onlyActive`: option to only send active collective work - i.e. not completed. Default is False (i.e. send everything) Test Plan: Unit tests Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/127651 Approved by: https://github.com/wconstab	2024-06-09 14:00:57 +00:00
PyTorch MergeBot	02a901f1e9	Revert "[RFC] Provide optional switches to _dump_nccl_trace (#127651 )" This reverts commit `0a761f0627`. Reverted https://github.com/pytorch/pytorch/pull/127651 on behalf of https://github.com/atalman due to Breaks internal CI ([comment](https://github.com/pytorch/pytorch/pull/127651#issuecomment-2156076838))	2024-06-08 15:30:04 +00:00
PyTorch MergeBot	57a24c4fdb	Revert "[RFC] add per-collective timeout value in flight recorder (#128190 )" This reverts commit `09cccbc1c7`. Reverted https://github.com/pytorch/pytorch/pull/128190 on behalf of https://github.com/atalman due to Sorry need to revert this, in conflict with https://github.com/pytorch/pytorch/pull/127651 that needs reverting ([comment](https://github.com/pytorch/pytorch/pull/128190#issuecomment-2156075318))	2024-06-08 15:25:27 +00:00
Chirag Pandya	09cccbc1c7	[RFC] add per-collective timeout value in flight recorder (#128190 ) Summary: Add timeout value field on every collected record. Test Plan: Unit tests Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/128190 Approved by: https://github.com/wconstab	2024-06-07 23:29:35 +00:00
Chirag Pandya	0a761f0627	[RFC] Provide optional switches to _dump_nccl_trace (#127651 ) Summary: Data from PyTorch distributed is mostly useful during initial stages of model development. Provide options to reduce data sent/dumped. `_dump_nccl_trace` takes 3 optional switches. Default as before returns everything - `includeCollectives`: option to also include collectives: Default is True. - `includeStacktraces`: option to include stack traces in collectives. Default is True. - `onlyActive`: option to only send active collective work - i.e. not completed. Default is False (i.e. send everything) Test Plan: Unit tests Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/127651 Approved by: https://github.com/wconstab	2024-06-06 21:59:09 +00:00
dilililiwhy	f471482eb2	Try to include NCCL related header file with macro USE_C10D_NCCL (#127501 ) Fixes #ISSUE_NUMBER Try to include NCCL related header file with macro USE_C10D_NCCL, so that third-party device compilation will not be interrupted. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127501 Approved by: https://github.com/ezyang	2024-05-30 21:33:41 +00:00
Chirag Pandya	ae66c94eaa	Capture dtype in Flight Recorder (#126581 ) Summary: Capture dtype in flight recorder. Mismatched dtypes can lead to hangs. Newly added logs to job show mismatching DTYPE of op, which affects data size. Even though the sizes match and we don't see the dtype on the FR log. We end up capturing the type as follows: ``` {'entries': [{'record_id': 0, 'pg_id': 0, 'process_group': ('0', 'default_pg'), 'collective_seq_id': 1, 'p2p_seq_id': 0, 'op_id': 1, 'profiling_name': 'nccl:all_reduce', 'time_created_ns': 1715989097552775261, 'duration_ms': 6.697696208953857, 'input_sizes': [[3, 4]], 'input_dtypes': [6], 'output_sizes': [[3, 4]], 'output_dtypes': [6], 'state': 'completed', 'time_discovered_started_ns': 1715989097593778240, 'time_discovered_completed_ns': 1715989097593778461, 'retired': True, ``` Notice the new fields: input_dtypes: [6] output_dtypes: [6] Test Plan: unit tests Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/issues/126554 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126581 Approved by: https://github.com/wconstab	2024-05-22 03:38:09 +00:00
Chirag Pandya	a83e745356	[BE] split seq_id to collective_seq_id and p2p_seq_id (#125727 ) Summary: Split out `seq_id` into `collective_seq_id` and `p2p_seq_id`. The main idea here is that collectives that go to all machines should have identical `collective_seq_id` and therefore it makes it easier to spot if one of machines isn't handling a collective operation. Next, we can attempt to match up p2p operations to ensure that the sender(s)/receivers(s) are in sync. Resolves issue: https://github.com/pytorch/pytorch/issues/125173 Test Plan: Unit tests. Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/125727 Approved by: https://github.com/zdevito	2024-05-21 03:26:49 +00:00
Chirag Pandya	1485621ccb	[BE] Abstract out strings to top of file (#125640 ) Summary: Move const strings to top of file. This is in preparation of tooling to make use of shared constants (e.g. version string). A non-functional change. Ideally we want these const strings to be available from both C++ and Python - but I haven't figured out how to correctly share things in PyTorch. I'll do this in a subsequent change. Test Plan: python test/distributed/test_c10d_nccl.py NCCLTraceTest Pull Request resolved: https://github.com/pytorch/pytorch/pull/125640 Approved by: https://github.com/wconstab	2024-05-15 03:38:30 +00:00
Richard Barnes	ed327876f5	[codemod] `c10:optional` -> `std::optional` (#126135 ) Generated by running the following from PyTorch root: ``` find . -regex ".*\.$cpp\\|h\\|cu\\|hpp\\|cc\\|cxx$$" \| grep -v "build/" \| xargs -n 50 -P 4 perl -pi -e 's/c10::optional/std::optional/' ``` `c10::optional` is just an alias for `std::optional`. This removes usages of that alias in preparation for eliminating it entirely. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126135 Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/albanD, https://github.com/aaronenyeshi	2024-05-14 19:35:51 +00:00
Shengbao Zheng	4e9094533e	[c10d/nccl-pg] allow user to pass process group description (#123472 ) Summary: We need a way to allow user set a customized description for a process group, e.g. FSDP, PP. Here are several use cases of user specified group_desc: - Logging: we can easily match a log line and understand what's this collective/pg is used to. - Pytorch traces (e.g. Kineto, Execution Trace) can benefit from the PG desc since trace analysis, benchmarks will be able to easily differentiate PG purpose like FSDP, PP. - Lower layer collectives(e.g. NCCL) debug: we will be able to expose PG desc to NCCL communicator so NCCL layer operations can be easily correlated to a PG. Solution: Add a group_desc field to c10d Differential Revision: D55781850 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123472 Approved by: https://github.com/kwen2501	2024-04-12 08:44:21 +00:00
Shengbao Zheng	ae6f8d923c	Pass and record process_group_name when creating ProcessGroupNCCL (#123117 ) Summary: Pass python c10d group_name to c++ ProcessGroupNCCL so that the pg name will be consistent across different layers. Also record pg_name in flight recorder entry. Differential Revision: D55597200 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123117 Approved by: https://github.com/wconstab	2024-04-05 18:57:45 +00:00
zdevito	530e13cf3d	Revert "[c10d] disable compute_duration by default (#122138 )" (#122539 ) This reverts commit `bf18e967b4`. It is stacked after a fix to elapsed_time that will resolve the memory issues that required in the introduction of this flag. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122539 Approved by: https://github.com/wconstab, https://github.com/shuqiangzhang ghstack dependencies: #122538	2024-03-27 21:53:28 +00:00
Shuqiang Zhang	bf18e967b4	[c10d] disable compute_duration by default (#122138 ) Summary: Compute duration would invoke additional cuda overhead and possibly GPU mem increase and possible hang, so we want to disable it by default and enable it only when needed, or at least when timing is enabled. Test Plan: Test with existing unit test Pull Request resolved: https://github.com/pytorch/pytorch/pull/122138 Approved by: https://github.com/wconstab	2024-03-21 04:45:37 +00:00
Will Constable	581fe26792	[C10D] Add ProcessGroup op_id to track ops inside coalescing region (#120745 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120745 Approved by: https://github.com/zdevito	2024-03-01 23:45:43 +00:00
PyTorch MergeBot	76d3a6bb4a	Revert "[C10D] Add ProcessGroup op_id to track ops inside coalescing region (#120745 )" This reverts commit `381a7ad3f1`. Reverted https://github.com/pytorch/pytorch/pull/120745 on behalf of https://github.com/kit1980 due to The new test fails internally, see D54343421 ([comment](https://github.com/pytorch/pytorch/pull/120745#issuecomment-1972047106))	2024-02-29 22:06:13 +00:00
Will Constable	381a7ad3f1	[C10D] Add ProcessGroup op_id to track ops inside coalescing region (#120745 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120745 Approved by: https://github.com/zdevito ghstack dependencies: #120724, #120270	2024-02-29 01:03:31 +00:00
Will Constable	7f4d673885	[C10D] Add record_id to flight recorder (#120724 ) In cases where sequence number is shared between events (e.g. coalesced collectives) we want to ensure a unique (and ordered) ID per record. Note: the records are already in a list, so their ID could be implicitly observed. But (1) it's a ring buffer, so absolute ID is lost once the buffer rolls over once, (2) users may sort or process or filter their flight records, so having the ID be an explicit member of an entry is still useful Pull Request resolved: https://github.com/pytorch/pytorch/pull/120724 Approved by: https://github.com/zdevito	2024-02-29 01:03:31 +00:00
Shengbao Zheng	11de40f82f	[flight recorder] record process group configuration (#120262 ) Summary: Record process group configuration (i.e. ranks involved in a process group) to facilitate NCCL related debugging. Differential Revision: D53792087 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120262 Approved by: https://github.com/shuqiangzhang	2024-02-28 20:31:08 +00:00
Shuqiang Zhang	a24cba35b0	[c10d][flight recorder] dump additinal NCCL debug info (#120063 ) Summary: This PR is mainly about flight recorder side of changes that takes a map of maps as input, and dump it as picklable. Also add functions that should be compiled only when NCCL_COMM_DUMP is defined Test Plan: Integration tests with NCCL would be done later, here we only do the c10d side of dump test, aka,NCCLTraceTest Testing the dump function is a bit tricky as we don't have existing C++ unit tests for them. So we still use the Python NCCLTraceTest with the python binding of _dump_nccl_trace(), we manually fed the dump_nccl_trace with a map of test info, and assert the pickle result and print the converted python dict: ``` (sqzhang_1) [sqzhang@devgpu009.cln1 ~/pytorch (main)]$ python test/distributed/test_c10d_nccl.py NCCLTraceTest NCCL version 2.19.3+cuda12.0 [rank0]:[E ProcessGroupNCCL.cpp:1200] [PG 0 Rank 0] ProcessGroupNCCL preparing to dump debug info. .NCCL version 2.19.3+cuda12.0 .NCCL version 2.19.3+cuda12.0 {'ncclID2': {'Key2': 'Value2', 'Key1': 'Value1'}, 'ncclID1': {'Key2': 'Value2', 'Key1': 'Value1'}} {'ncclID2': {'Key2': 'Value2', 'Key1': 'Value1'}, 'ncclID1': {'Key2': 'Value2', 'Key1': 'Value1'}} .NCCL version 2.19.3+cuda12.0 {'ncclID2': {'Key2': 'Value2', 'Key1': 'Value1'}, 'ncclID1': {'Key2': 'Value2', 'Key1': 'Value1'}} {'ncclID2': {'Key2': 'Value2', 'Key1': 'Value1'}, 'ncclID1': {'Key2': 'Value2', 'Key1': 'Value1'}} .NCCL version 2.19.3+cuda12.0 .NCCL version 2.19.3+cuda12.0 .NCCL version 2.19.3+cuda12.0 .NCCL version 2.19.3+cuda12.0 . ---------------------------------------------------------------------- Ran 8 tests in 95.761s OK ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/120063 Approved by: https://github.com/wconstab	2024-02-21 16:35:23 +00:00
Shuqiang Zhang	a45c627f27	[c10d][flight recorder] store a copy of string in entry (#119837 ) Summary: Previously, we just store the char pointer in entry, the string is a temp object and will be destructed when we want to dump/access it. A quick fix is to store a copy of the string, but without changing the upstream char*. An alternative is to change every profilingTitle into std:string, this however would needs comprehensive overhall of the code up to the c10d::work layer above workNCCL and RecordFunction etc. We chose the first option for this change Resolve #119808 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119837 Approved by: https://github.com/zdevito, https://github.com/wconstab	2024-02-14 09:13:56 +00:00
Shuqiang Zhang	abadbbc4b0	[c10d][flight recorder] remove unintended assignment of entry (#119748 ) Summary: auto& entry = entries_.at(id % max_entries_); entry = entries_.at(id % max_entries_); The above line of code has unintended consequence of invoking copy/assignment of entry objects as ref itself cannot be re-assigned. Also what could cause the crash is that the entry ref could become invalid if entries_ are resized by other threads. and this could result in 'copy to a garbage location'. The fix is to use a pointer which can be re-assigned after re-acquiring the lock Tests: python test/distributed/test_c10d_nccl.py NCCLTraceTest Pull Request resolved: https://github.com/pytorch/pytorch/pull/119748 Approved by: https://github.com/wconstab, https://github.com/fegin	2024-02-13 20:18:58 +00:00
Ke Wen	b2043c0543	[c10d] PGNCCL refactor part 2: Simplify ProcessGroupNCCL into single-device style (#119421 ) Part 2 and last part of #118674: Introduce actual "single-device" code change to ProcessGroupNCCL. assert size == 1 and test refactor have been done in #119099. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119421 Approved by: https://github.com/shuqiangzhang	2024-02-12 18:45:49 +00:00
PyTorch MergeBot	0342b227e5	Revert "[c10d] PGNCCL refactor part 2: Simplify ProcessGroupNCCL into single-device style (#119421 )" This reverts commit `f3e7d80993`. Reverted https://github.com/pytorch/pytorch/pull/119421 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/119421#issuecomment-1938169747))	2024-02-12 07:34:20 +00:00
Ke Wen	f3e7d80993	[c10d] PGNCCL refactor part 2: Simplify ProcessGroupNCCL into single-device style (#119421 ) Part 2 and last part of #118674: Introduce actual "single-device" code change to ProcessGroupNCCL. assert size == 1 and test refactor have been done in #119099. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119421 Approved by: https://github.com/shuqiangzhang	2024-02-09 20:23:20 +00:00
zdevito	7f05c72864	[nccl flight recorder] record time we discover start and complete (#119249 ) Some APIs like ncclCommAbort can cause nccl kernels to finish even if they were previously stuck. Because we can gather the trace buffer after those calls, we can end up seeing some collectives marked completed eventhough that complete happened several minutes after they started and clearly after the timeout. This changes how we record state so that we keep track of the time we discover a state change, so even if eventually the collective gets marked complete, we can observe it happened minutes after it was schedule. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119249 Approved by: https://github.com/wconstab	2024-02-08 16:48:33 +00:00
willfengg	63fd6883fd	[c10d] logging utility for cpp-python stacktrace (#118924 ) user may not know which line of code called collectives in a big code base. When debugging, we can print python-cpp stacktrace in case user call ``ProcessGroup.reduce`` instead of ``torch.distributed.reduce`` ``` LOG(INFO) << "ProcessGroupNCCL::_allgather_base stacktrace: " << get_python_cpp_trace(); ``` output (using _allgather_base as an example): one example python-part trace is ``all_gather_into_tensor from /data/users/weif/pytorch/torch/distributed/distributed_c10d.py:2838`` ``` ProcessGroupNCCL::_allgather_base stacktrace: #0 torch::unwind::unwind() from ??:0 #1 torch::CapturedTraceback::gather(bool, bool, bool) from ??:0 #2 c10d::get_python_cpp_trace[abi:cxx11]() from :0 #3 c10d::ProcessGroupNCCL::_allgather_base(at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&) from ??:0 #4 c10d::ops::(anonymous namespace)::_allgather_base_CUDA(at::Tensor&, at::Tensor&, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, bool, long) from Ops.cpp:0 #5 c10::impl::make_boxed_from_unboxed_functor<c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<std::tuple<at::Tensor, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > > ()(at::Tensor&, at::Tensor&, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, bool, long), std::tuple<at::Tensor, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > >, c10::guts::typelist::typelist<at::Tensor&, at::Tensor&, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, bool, long> >, false>::call(c10::OperatorKernel, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >) from :0 #6 torch::autograd::basicAutogradNotImplementedFallbackImpl(c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >) from autograd_not_implemented_fallback.cpp:0 #7 c10d::ProcessGroup::_allgather_base(at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&) from :0 #8 pybind11::cpp_function::initialize<pybind11::cpp_function::initialize<c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >, c10d::ProcessGroup, at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg, pybind11::arg, pybind11::arg_v, pybind11::call_guard<pybind11::gil_scoped_release> >(c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > (c10d::ProcessGroup::)(at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(c10d::ProcessGroup, at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&)#1}, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >, c10d::ProcessGroup, at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg, pybind11::arg, pybind11::arg_v, pybind11::call_guard<pybind11::gil_scoped_release> >(pybind11::cpp_function::initialize<c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >, c10d::ProcessGroup, at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg, pybind11::arg, pybind11::arg_v, pybind11::call_guard<pybind11::gil_scoped_release> >(c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > (c10d::ProcessGroup::)(at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(c10d::ProcessGroup, at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&)#1}&&, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > ()(c10d::ProcessGroup, at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call&) from :0 #9 pybind11::cpp_function::dispatcher(_object, _object, _object) from :0 #10 cfunction_call from /usr/local/src/conda/python-3.10.12/Objects/methodobject.c:543 #11 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.12/Objects/call.c:215 #12 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:112 #13 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:114 #14 all_gather_into_tensor from /data/users/weif/pytorch/torch/distributed/distributed_c10d.py:2838 #15 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46 #16 do_call_core from /usr/local/src/conda/python-3.10.12/Python/ceval.c:5945 #17 wrapper from /data/users/weif/pytorch/torch/distributed/c10d_logger.py:75 #18 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46 #19 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:114 #20 _all_gather_flat_param from /data/users/weif/pytorch/torch/distributed/fsdp/_flat_param.py:1399 #21 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46 #22 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:114 #23 unshard from /data/users/weif/pytorch/torch/distributed/fsdp/_flat_param.py:1308 #24 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46 #25 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:114 #26 _unshard from /data/users/weif/pytorch/torch/distributed/fsdp/_runtime_utils.py:332 #27 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46 #28 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:114 #29 _pre_forward_unshard from /data/users/weif/pytorch/torch/distributed/fsdp/_runtime_utils.py:448 #30 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46 #31 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:114 #32 _pre_forward from /data/users/weif/pytorch/torch/distributed/fsdp/_runtime_utils.py:413 #33 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46 #34 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:114 #35 forward from /data/users/weif/pytorch/torch/distributed/fsdp/fully_sharded_data_parallel.py:839 #36 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46 #37 do_call_core from /usr/local/src/conda/python-3.10.12/Python/ceval.c:5945 #38 _call_impl from /data/users/weif/pytorch/torch/nn/modules/module.py:1520 #39 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46 #40 do_call_core from /usr/local/src/conda/python-3.10.12/Python/ceval.c:5945 #41 _wrapped_call_impl from /data/users/weif/pytorch/torch/nn/modules/module.py:1511 #42 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46 #43 _PyObject_Call_Prepend from /usr/local/src/conda/python-3.10.12/Objects/call.c:431 #44 slot_tp_call from /usr/local/src/conda/python-3.10.12/Objects/typeobject.c:7494 #45 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.12/Objects/call.c:215 #46 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:112 #47 inner from /data/users/weif/pytorch/run_fsdp.py:72 #48 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46 #49 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:114 #50 run from /data/users/weif/pytorch/run_fsdp.py:76 #51 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46 #52 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:114 #53 main from /data/users/weif/pytorch/run_fsdp.py:133 #54 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46 #55 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:114 #56 <module> from /data/users/weif/pytorch/run_fsdp.py:137 #57 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46 #58 PyEval_EvalCode from /usr/local/src/conda/python-3.10.12/Python/ceval.c:1134 #59 run_eval_code_obj from /usr/local/src/conda/python-3.10.12/Python/pythonrun.c:1291 #60 run_mod from /usr/local/src/conda/python-3.10.12/Python/pythonrun.c:1312 #61 pyrun_file from /usr/local/src/conda/python-3.10.12/Python/pythonrun.c:1208 #62 _PyRun_SimpleFileObject from /usr/local/src/conda/python-3.10.12/Python/pythonrun.c:456 #63 _PyRun_AnyFileObject from /usr/local/src/conda/python-3.10.12/Python/pythonrun.c:90 #64 pymain_run_file_obj from /usr/local/src/conda/python-3.10.12/Modules/main.c:357 #65 Py_BytesMain from /usr/local/src/conda/python-3.10.12/Modules/main.c:1090 #66 __libc_start_call_main from ??:0 #67 <unwind unsupported> from ??:0 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/118924 Approved by: https://github.com/kwen2501	2024-02-02 23:49:18 +00:00
Will Constable	455bba38f4	[C10D] Make Flight Recorder report time_created in ns (#118047 ) Addresses (6) from #117883 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118047 Approved by: https://github.com/zdevito ghstack dependencies: #118044, #118046	2024-01-23 08:18:08 +00:00
Will Constable	5df92a9244	[C10D] Add version tag to NCCL Flight Recorder Dump (#118046 ) Addresses (3) from #117883 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118046 Approved by: https://github.com/zdevito ghstack dependencies: #118044	2024-01-23 08:18:08 +00:00
Will Constable	dace1fda2e	[C10D] Make NCCL Flight Recorder dump produce a dict (#118044 ) Putting the list of entries into a particular key of a top-level dict paves the way for adding other metadata as other top level keys. Addresses 1 and 2 from #117883 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118044 Approved by: https://github.com/zdevito	2024-01-23 08:18:08 +00:00
dilililiwhy	924ed91612	Move getDurationFromFirstEvent to USE_C10D_NCCL ifdef (#117738 ) Fixes #117517 Try to move nccl related function getDurationFromFirstEvent to USE_C10D_NCCL ifdef (Related to https://github.com/pytorch/pytorch/issues/114575) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117738 Approved by: https://github.com/wconstab, https://github.com/XilunWu	2024-01-19 04:28:47 +00:00
Will Constable	7f1f0b1135	[C10D] Add duration_ms to flight recorder (#114817 ) Measures the duration of a collective operation using nccl start/end events and includes this duration (in ms) in the flight recorder data. duration_ms will be an optional field, since it only works when timing is enabled. Currently timing is enabled when flight recorder is enabled, but this is not a strict requirement. Duration is also not available for collectives not in a completed state. Note: computing duration can lead to a hang due to calling cudaEventDuration when the cuda driver queue is full. We don't ever want dump() api to hang, since we might want dump to help debug a hang. Hence, we only query durations from the watchdog thread, and it's possible during dump() call, some of the most recent collectives durations won't have been computed yet at time of dump. We make this tradeoff to ensure that dump() itself will never hang. Pull Request resolved: https://github.com/pytorch/pytorch/pull/114817 Approved by: https://github.com/fduwjj, https://github.com/zdevito ghstack dependencies: #116905	2024-01-12 23:34:11 +00:00
Will Constable	10509dac85	[C10D] Rename flightrecorder key vars to avoid confusion (#116905 ) Key vars are strings used as dict keys (e.g. duration_s was a string "duration_ms") _s confused me with time (seconds) since duration_s was a key string and duration_ms is another variable holding a time value. Now duration_key is "duration_ms". Pull Request resolved: https://github.com/pytorch/pytorch/pull/116905 Approved by: https://github.com/zdevito	2024-01-11 02:57:04 +00:00
fduwjj	ca4df16fdd	[c10d] Make DebugInfoWriter Singleton across all PG objects (#116489 ) Previously, we have the writer register to each NCCL PG(backend), so for every pg, we have a NCCL PG instance, so if we use some customized writer when multiple sub-PGs are used, we need to ensure user to register the writer for every backend which indicates a bad UX. Furthermore, the debug info is global, so it does not make sense to have the writer for each instance. We even have a static mutex in the `dumpDebuggingInfo` to ensure we serialize the write, that makes it more obvious that we can make the writer a singleton so that we only have one writer instance for all PG instances. Although the rationale is clear, the implementation may vary a lot. So this PR is RFC for now to see if this implementation makes sense or not. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116489 Approved by: https://github.com/kwen2501	2024-01-03 03:42:54 +00:00
zdevito	66b04e3cb7	[nccl flight recorder] nullptr profiling name (#115851 ) Sometimes profiling name can be a nullptr, which throws on conversion to std::string. This adds a check. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115851 Approved by: https://github.com/wconstab	2023-12-14 23:40:54 +00:00
PyTorch MergeBot	f101426790	Revert "Move class definition of DebugInfoWriter to TraceUtil as well (#114901 )" This reverts commit `fb325bbd46`. Reverted https://github.com/pytorch/pytorch/pull/114901 on behalf of https://github.com/atalman due to Diff broke internal tests ([comment](https://github.com/pytorch/pytorch/pull/114901#issuecomment-1838815178))	2023-12-04 14:55:39 +00:00
Will Constable	8a51845b38	[C10D] Add filename to dump finished log (#114957 ) Just shows you where to look.. Pull Request resolved: https://github.com/pytorch/pytorch/pull/114957 Approved by: https://github.com/fduwjj	2023-12-01 20:38:02 +00:00
fduwjj	fb325bbd46	Move class definition of DebugInfoWriter to TraceUtil as well (#114901 ) Since we moved the implementation of the class to TraceUtils in https://github.com/pytorch/pytorch/pull/114367, maybe we also want to move the implementation here as well. Pull Request resolved: https://github.com/pytorch/pytorch/pull/114901 Approved by: https://github.com/XilunWu	2023-12-01 03:28:16 +00:00

1 2

61 Commits