pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-06 12:20:52 +01:00

Author	SHA1	Message	Date
soulitzer	b3861ac8e7	[reland] Warn if AccumulateGrad stream does not match producer node stream (#166136 ) Some checks failed docker-builds / docker-build (pytorch-linux-jammy-linter, linux.12xlarge) (push) Has been cancelled Details docker-builds / docker-build (pytorch-linux-jammy-py3-clang12-executorch, linux.12xlarge) (push) Has been cancelled Details docker-builds / docker-build (pytorch-linux-jammy-py3-clang12-onnx, linux.12xlarge) (push) Has been cancelled Details docker-builds / docker-build (pytorch-linux-jammy-py3-clang18-asan, linux.12xlarge) (push) Has been cancelled Details docker-builds / docker-build (pytorch-linux-jammy-py3-gcc11-inductor-benchmarks, linux.12xlarge) (push) Has been cancelled Details docker-builds / docker-build (pytorch-linux-jammy-py3.10-clang12, linux.12xlarge) (push) Has been cancelled Details docker-builds / docker-build (pytorch-linux-jammy-py3.10-gcc11, linux.12xlarge) (push) Has been cancelled Details docker-builds / docker-build (pytorch-linux-jammy-py3.12-halide, linux.12xlarge) (push) Has been cancelled Details docker-builds / docker-build (pytorch-linux-jammy-py3.12-triton-cpu, linux.12xlarge) (push) Has been cancelled Details docker-builds / docker-build (pytorch-linux-jammy-py3.13-clang12, linux.12xlarge) (push) Has been cancelled Details docker-builds / docker-build (pytorch-linux-jammy-py3.14-clang12, linux.12xlarge) (push) Has been cancelled Details docker-builds / docker-build (pytorch-linux-jammy-rocm-n-py3, linux.12xlarge) (push) Has been cancelled Details docker-builds / docker-build (pytorch-linux-jammy-rocm-n-py3-benchmarks, linux.12xlarge) (push) Has been cancelled Details docker-builds / docker-build (pytorch-linux-jammy-xpu-n-1-py3, linux.12xlarge) (push) Has been cancelled Details docker-builds / docker-build (pytorch-linux-jammy-xpu-n-py3, linux.12xlarge) (push) Has been cancelled Details docker-builds / docker-build (pytorch-linux-jammy-xpu-n-py3-inductor-benchmarks, linux.12xlarge) (push) Has been cancelled Details docker-builds / docker-build (pytorch-linux-noble-riscv64-py3.12-gcc14, linux.12xlarge) (push) Has been cancelled Details docker-builds / docker-build (pytorch-linux-noble-rocm-n-py3, linux.12xlarge) (push) Has been cancelled Details ossf-scorecard / Scorecards analysis (push) Has been cancelled Details Close nonexistent disable issues / close-nonexistent-disable-issues (push) Has been cancelled Details Index PyTorch Tests for Target Determination / get-label-type (push) Has been cancelled Details nightly / get-label-type (push) Has been cancelled Details nightly / update-commit-hashes (main, .ci/docker/ci_commit_pins, triton, triton-lang) (push) Has been cancelled Details nightly / update-commit-hashes (main, .github/ci_commit_pins, audio, pytorch) (push) Has been cancelled Details nightly / update-commit-hashes (main, .github/ci_commit_pins, vision, pytorch) (push) Has been cancelled Details nightly / update-commit-hashes (main, .github/ci_commit_pins, vllm, vllm-project) (push) Has been cancelled Details Index PyTorch Tests for Target Determination / index (push) Has been cancelled Details nightly / Link checks (push) Has been cancelled Details nightly / docs build (push) Has been cancelled Details nightly / docs push (push) Has been cancelled Details ghstack-source-id: 59641aa32dc6fd027abf3276017432b693aa71f8 Pull-Request-resolved: https://github.com/pytorch/pytorch/pull/165065 Fixes #ISSUE_NUMBER Opening a new PR for codev Pull Request resolved: https://github.com/pytorch/pytorch/pull/166136 Approved by: https://github.com/ngimel	2025-11-01 12:33:48 +00:00
Yuanyuan Chen	f0745ddb11	Replace c10::call_once with static initialization (#166381 ) This PR replaces c10::call_once calls with static initialization when possible. C++11 semantics guarantees that static initialization is atomic. Static initialization also has lower cost than using c10::call_once. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166381 Approved by: https://github.com/malfet	2025-11-01 07:09:40 +00:00
Yuanyuan Chen	e2dc32f4ba	Replace decltype(auto) with auto (#166537 ) This PR replaces `decltype(auto)` with `auto` for C++ return type deduction and simplifies some templates. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166537 Approved by: https://github.com/Skylion007	2025-11-01 00:30:23 +00:00
PyTorch MergeBot	694d205143	Revert "shrink_group implementation to expose ncclCommShrink API (#164518 )" This reverts commit `311ea0dec0`. Reverted https://github.com/pytorch/pytorch/pull/164518 on behalf of https://github.com/atalman due to breaks internal builds Error: from logging_utils import ( ModuleNotFoundError: No module named 'logging_utils' ([comment](https://github.com/pytorch/pytorch/pull/164518#issuecomment-3469308568))	2025-10-30 17:52:29 +00:00
linhaifeng	369f2d6951	[3/N] fix typo in other folders (#166606 ) fix typo in other folders #166374 #166126 _typos.toml ```bash [files] extend-exclude = ["tools/linter/dictionary.txt"] [default.extend-words] nd = "nd" arange = "arange" Nd = "Nd" GLOBALs = "GLOBALs" hte = "hte" iy = "iy" PN = "PN" Dout = "Dout" optin = "optin" gam = "gam" PTD = "PTD" Sur = "Sur" nin = "nin" tme = "tme" inpt = "inpt" mis = "mis" Raison = "Raison" ouput = "ouput" nto = "nto" Onwer = "Onwer" callibrate = "callibrate" ser = "ser" Metdata = "Metdata" ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/166606 Approved by: https://github.com/ezyang	2025-10-30 10:30:40 +00:00
Bruce Chang	311ea0dec0	shrink_group implementation to expose ncclCommShrink API (#164518 ) Closes #164529 To expose the new [ncclCommShrink](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/api/comms.html#ncclcommshrink) API to PyTorch. This is useful when you need to exclude certain GPUs or nodes from a collective operation, for example in fault tolerance scenarios or when dynamically adjusting resource utilization. For more info: [Shrinking a communicator](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/communicators.html#shrinking-a-communicator) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164518 Approved by: https://github.com/kwen2501	2025-10-30 01:50:54 +00:00
Scott Wolchok	7d16fcf2df	Re-re-re-re-apply "C++-accessible Placements via pybind11 (#163030 )" (#166132 ) Was reverted (again!) due to a merge conflict that crept in sometime during the "export to github -> land internally -> merge on github" process. D85096233 Pull Request resolved: https://github.com/pytorch/pytorch/pull/166132 Approved by: https://github.com/Skylion007, https://github.com/ezyang, https://github.com/malfet	2025-10-27 21:19:32 +00:00
Natalia Gimelshein	2efcf3ca98	Reverts #163712 and forces allgather/scatter inputs/outputs to be contiguous (#166181 ) Per title Pull Request resolved: https://github.com/pytorch/pytorch/pull/166181 Approved by: https://github.com/kwen2501	2025-10-25 02:43:10 +00:00
PyTorch MergeBot	75b8295868	Revert "Warn if AccumulateGrad stream does not match producer node stream (#165065 )" This reverts commit `12f742941d`. Reverted https://github.com/pytorch/pytorch/pull/165065 on behalf of https://github.com/clee2000 due to broke internal builds D85273204 usages of TORCH_API void add need to be updated? ([comment](https://github.com/pytorch/pytorch/pull/165065#issuecomment-3438061854))	2025-10-23 17:02:49 +00:00
soulitzer	12f742941d	Warn if AccumulateGrad stream does not match producer node stream (#165065 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/165065 Approved by: https://github.com/ngimel	2025-10-22 17:33:27 +00:00
zhudada	2998abd777	[Code Clean] Better error handling in torch/csrc/distributed (#165053 ) Replace the runtime_error of the vallina C++ exceptions with TORCH_CEHCK Including: torch/csrc/distributed/* fix partialy #148114 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165053 Approved by: https://github.com/FFFrog, https://github.com/albanD	2025-10-22 01:40:36 +00:00
Han Chao	a1005427bf	[xpu] Support high stream for ProcessGroupXCCL (#163049 ) Add high priority stream support for ProcessGroupXCCL. Just like CUDA, XPU streams also support execution with higher priority compared to other streams. Implementation in https://github.com/intel/torch-xpu-ops/pull/1715, add register here. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163049 Approved by: https://github.com/guangyey, https://github.com/gujinghui, https://github.com/EikanWang, https://github.com/albanD	2025-10-22 00:54:25 +00:00
Zhaoqi Zhu	04adfe5ba9	Make Backend::setGroupUid virtual (#165957 ) As titled, so that we may customize this function in custom backends Pull Request resolved: https://github.com/pytorch/pytorch/pull/165957 Approved by: https://github.com/d4l3k	2025-10-21 21:33:24 +00:00
PyTorch MergeBot	ad4dc52bf6	Revert "shrink_group implementation to expose ncclCommShrink API (#164518 )" This reverts commit `4e643422f6`. Reverted https://github.com/pytorch/pytorch/pull/164518 on behalf of https://github.com/albanD due to Breaks lint ([comment](https://github.com/pytorch/pytorch/pull/164518#issuecomment-3429426503))	2025-10-21 20:24:14 +00:00
Bruce Chang	4e643422f6	shrink_group implementation to expose ncclCommShrink API (#164518 ) Closes #164529 To expose the new [ncclCommShrink](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/api/comms.html#ncclcommshrink) API to PyTorch. This is useful when you need to exclude certain GPUs or nodes from a collective operation, for example in fault tolerance scenarios or when dynamically adjusting resource utilization. For more info: [Shrinking a communicator](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/communicators.html#shrinking-a-communicator) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164518 Approved by: https://github.com/kwen2501	2025-10-21 19:47:33 +00:00
Yuanyuan Chen	99c8640b5d	[1/N] Change C-style casts to static_cast or reinterpret_cast (#165750 ) This series of changes try to cover C style casts into C++ alternatives. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165750 Approved by: https://github.com/Skylion007	2025-10-20 23:27:13 +00:00
PyTorch MergeBot	ab82456c16	Revert "[1/N] Change C-style casts to static_cast or reinterpret_cast (#165750 )" This reverts commit `e1e8491b31`. Reverted https://github.com/pytorch/pytorch/pull/165750 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](https://github.com/pytorch/pytorch/pull/165750#issuecomment-3422413890))	2025-10-20 14:51:58 +00:00
Yuanyuan Chen	e1e8491b31	[1/N] Change C-style casts to static_cast or reinterpret_cast (#165750 ) This series of changes try to cover C style casts into C++ alternatives. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165750 Approved by: https://github.com/Skylion007	2025-10-20 04:36:19 +00:00
PyTorch MergeBot	633a3b7f67	Revert "shrink_group implementation to expose ncclCommShrink API (#164518 )" This reverts commit `fa0db212e7`. Reverted https://github.com/pytorch/pytorch/pull/164518 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](https://github.com/pytorch/pytorch/pull/164518#issuecomment-3419893217))	2025-10-19 19:20:45 +00:00
Bruce Chang	fa0db212e7	shrink_group implementation to expose ncclCommShrink API (#164518 ) Closes #164529 To expose the new [ncclCommShrink](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/api/comms.html#ncclcommshrink) API to PyTorch. This is useful when you need to exclude certain GPUs or nodes from a collective operation, for example in fault tolerance scenarios or when dynamically adjusting resource utilization. For more info: [Shrinking a communicator](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/communicators.html#shrinking-a-communicator) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164518 Approved by: https://github.com/kwen2501	2025-10-19 18:00:08 +00:00
Yuanyuan Chen	0f0b4bf029	[1/N] Remove unused header inclusion (#165763 ) This PR removes unused header inclusion in C++ files. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165763 Approved by: https://github.com/Skylion007	2025-10-18 05:23:11 +00:00
orangeH25	e9f4999985	[Code Clean] Replace std::runtime_error with TORCH_CHECK (#165305 ) Fixes part of #148114 Including: - torch/csrc/distributed Pull Request resolved: https://github.com/pytorch/pytorch/pull/165305 Approved by: https://github.com/FFFrog, https://github.com/albanD	2025-10-18 01:08:44 +00:00
PyTorch MergeBot	fae74cd52f	Revert "shrink_group implementation to expose ncclCommShrink API (#164518 )" This reverts commit `a032510db3`. Reverted https://github.com/pytorch/pytorch/pull/164518 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](https://github.com/pytorch/pytorch/pull/164518#issuecomment-3416718767))	2025-10-17 18:55:53 +00:00
Bruce Chang	a032510db3	shrink_group implementation to expose ncclCommShrink API (#164518 ) Closes #164529 To expose the new [ncclCommShrink](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/api/comms.html#ncclcommshrink) API to PyTorch. This is useful when you need to exclude certain GPUs or nodes from a collective operation, for example in fault tolerance scenarios or when dynamically adjusting resource utilization. For more info: [Shrinking a communicator](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/communicators.html#shrinking-a-communicator) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164518 Approved by: https://github.com/Skylion007, https://github.com/syed-ahmed, https://github.com/kwen2501	2025-10-17 17:55:03 +00:00
Tushar Jain	7e150467f7	allow providing full fr trace path (#165639 ) Summary: - allow users to specify the full path instead of fr suffixing the rank id - this will be used by torchft to provide the global rank id accross all replicas - we can't just prefix the replica id because analysis tool expects the file name to provide a unique integer --- [//]: # (BEGIN SAPLING FOOTER) Stack created with [Sapling](https://sapling-scm.com). Best reviewed with [ReviewStack](https://reviewstack.dev/pytorch/pytorch/pull/165639). * #165638 * #165640 * #165677 * #165642 * __->__ #165639 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165639 Approved by: https://github.com/fduwjj	2025-10-17 04:43:44 +00:00
Catherine Lee	0aa7ebaf03	Fix periodic debug tests failing due to FakeProcessGroup things (#165479 ) These happen when building with CMAKE_BUILD_TYPE=RelWithAssert This should fix two types of failures that started with https://github.com/pytorch/pytorch/pull/163665 Disclaimer that I used a lot of AI since I don't how pybind works or what refcounts and pointers are, so idk if this is a good solution, or even a solution at all (fwiw the tests pass now) The first one type is Truncated: ``` default_pg, _ = _new_process_group_helper( File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2096, in _new_process_group_helper backend_class = creator_fn(dist_backend_opts, backend_options) File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/fake_pg.py", line 25, in _create_fake_pg return FakeProcessGroup._create_internal( RuntimeError: new_refcount != 1 INTERNAL ASSERT FAILED at "/var/lib/jenkins/workspace/c10/util/intrusive_ptr.h":319, please report a bug to PyTorch. intrusive_ptr: Cannot increase refcount after it reached zero. Exception raised from retain_ at /var/lib/jenkins/workspace/c10/util/intrusive_ptr.h:319 (most recent call first): C++ CapturedTraceback: #4 std::_Function_handler<std::shared_ptr<c10::LazyValue<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const> (), c10::SetStackTraceFetcher(std::function<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 #5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) from ??:0 #6 c10::detail::torchCheckFail(char const, char const, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) from ??:0 #7 c10::detail::torchInternalAssertFail(char const, char const, unsigned int, char const, char const) from ??:0 #8 void pybind11::class_<c10d::FakeProcessGroup, (anonymous namespace)::IntrusivePtrNoGilDestructor<c10d::FakeProcessGroup> >::init_instance<(anonymous namespace)::IntrusivePtrNoGilDestructor<c10d::FakeProcessGroup>, 0>(pybind11::detail::instance, void const) from init.cpp:0 #9 pybind11::detail::type_caster_generic::cast(void const, pybind11::return_value_policy, pybind11::handle, pybind11::detail::type_info const, void* ()(void const), void* ()(void const), void const) from :0 #10 pybind11::cpp_function::initialize<torch::distributed::c10d::(anonymous namespace)::c10d_init(_object, _object)::{lambda(int, int, c10::intrusive_ptr<c10d::FakeProcessGroup::Options, c10::detail::intrusive_target_default_null_type<c10d::FakeProcessGroup::Options> >)#127}, c10::intrusive_ptr<c10d::FakeProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::FakeProcessGroup> >, int, int, c10::intrusive_ptr<c10d::FakeProcessGroup::Options, c10::detail::intrusive_target_default_null_type<c10d::FakeProcessGroup::Options> >, pybind11::name, pybind11::scope, pybind11::sibling, pybind11::arg, pybind11::arg, pybind11::arg_v>(torch::distributed::c10d::(anonymous namespace)::c10d_init(_object, _object)::{lambda(int, int, c10::intrusive_ptr<c10d::FakeProcessGroup::Options, c10::detail::intrusive_target_default_null_type<c10d::FakeProcessGroup::Options> >)#127}&&, c10::intrusive_ptr<c10d::FakeProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::FakeProcessGroup> > ()(int, int, c10::intrusive_ptr<c10d::FakeProcessGroup::Options, c10::detail::intrusive_target_default_null_type<c10d::FakeProcessGroup::Options> >), pybind11::name const&, pybind11::scope const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg const&, pybind11::arg_v const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call&) from init.cpp:0 ``` and I fix it here by getting rid of `DontIncreaseRefcount` and using make_intrusive to do the ref count handling instead. However, I also had to move the constructor to be public, which I think is not good, based on the reasoning of the original PR The other one type is ``` Traceback (most recent call last): File "/var/lib/jenkins/workspace/test/test_testing.py", line 2415, in test_no_warning_on_import self.assertEqual(out, "") File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4233, in assertEqual raise error_metas.pop()[0].to_error( # type: ignore[index] AssertionError: String comparison failed: "/opt/conda/envs/py_3.10/lib/python3.10/s[352 chars]):\n" != '' - /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/__init__.py:29: FutureWarning: pybind11-bound class 'torch._C._distributed_c10d.FakeProcessGroup' is using an old-style placement-new '__init__' which has been deprecated. See the upgrade guide in pybind11's docs. This message is only visible when compiled in debug mode. - if is_available() and not torch._C._c10d_init(): To execute this test, run the following from the base repo dir: python test/test_testing.py TestImports.test_no_warning_on_import ``` which I fix by getting rid of the `__init__` which I think is ok since it'll just error if you try to make one? Pull Request resolved: https://github.com/pytorch/pytorch/pull/165479 Approved by: https://github.com/ezyang	2025-10-15 18:16:08 +00:00
Yuanyuan Chen	36871622f1	[2/N] Mark unused parameters in C++ code (#165121 ) This is follow-up of #164912 to mark unused C++ parameters to improve code readability. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165121 Approved by: https://github.com/Skylion007	2025-10-15 03:04:39 +00:00
Yuanyuan Chen	ecb53078fa	Turn some const strings into constexpr in C++ code (#165203 ) This PR turns more const strings into constexpr. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165203 Approved by: https://github.com/Skylion007	2025-10-13 20:25:20 +00:00
Dzmitry Huba	5e58420dff	LocalTensor (#164537 ) A LocalTensor is a tensor subclass which simulates a tensor that is distributed across SPMD ranks. A LocalTensor might be size N, but in fact there are world_size shards/replicas of it stored internally. When you do a plain PyTorch operation on it, we apply the operation to each shard; when you do a collective, we do the mathematically equivalent operation on the local shards. A LocalTensor is associated with a list of ranks which specify which ranks it holds local tensors for. NB, this is NOT a DataParallel like abstraction where you can run operations on multiple different GPUs. It is intended purely for debugging purposes, the overhead is almost certainly too high to keep eight GPUs (even the C++ autograd needs multithreading to keep up!) (It might potentially be possible to trace through this with torch.compile and then compile it with CUDA graphs but this is currently a non-goal.) In order to handle MPMD, we provide a helper decorator that allows you to run a function with no side effects for each LocalTensor shard and combine results back into LocalTensor or LocalIntNode. Note: This PR convert all DTensor ops and some DTensor tests to illustrate intended usage and ensure conrrectness. In subsequent PR more tests will be converted. DUring test conversion we aim to share as much as possible of test logic between multi-process / multi-threaded and local tensor tests. We would like to developers to be able to run both flavors of the tests. Note: This work is based on the original proposal by @ezyang (WIP PR https://github.com/pytorch/pytorch/pull/162753). Pull Request resolved: https://github.com/pytorch/pytorch/pull/164537 Approved by: https://github.com/ezyang	2025-10-12 20:06:41 +00:00
PyTorch MergeBot	f975bd58af	Revert "Warn if AccumulateGrad stream does not match producer node stream (#165065 )" This reverts commit `a70ef954b9`. Reverted https://github.com/pytorch/pytorch/pull/165065 on behalf of https://github.com/izaitsevfb due to breaks lint ([comment](https://github.com/pytorch/pytorch/pull/165065#issuecomment-3391387386))	2025-10-10 17:29:29 +00:00
soulitzer	a70ef954b9	Warn if AccumulateGrad stream does not match producer node stream (#165065 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/165065 Approved by: https://github.com/ngimel ghstack dependencies: #162815	2025-10-10 16:46:01 +00:00
Seonmyeong Bak	263db92563	Add knobs in FR dump by watchdog (stacktrace and only active collectives) and trigger FR even on any exceptions (#164591 ) This PR includes a couple of changes to extend FlightRecorder dump by PyTorch watchdog - New knobs to control FR dump as suggested in the public documentation even for watchdog (TORCH_INCLUDE_STACK_TRACE, TORCH_INCLUDE_ONLY_ACTIVE) - Trigger the flight recorder dump on exceptions which could be triggered by any CUDA / host side error (TORCH_NCCL_EXTRA_DUMP_ON_EXEC) -> Can be used as a snapshot of the workload progress for post-mortem analysis Pull Request resolved: https://github.com/pytorch/pytorch/pull/164591 Approved by: https://github.com/fduwjj	2025-10-09 05:33:35 +00:00
fduwjj	8ca986ee60	[fr] Enable reset the FR recording for fault tolerance (#164988 ) We also want to have a python side API for users to reset FR recording for FR entries. We don't need to reset the PGNCCL's member counter since we are creating new PGNCCL anyway. FR is a global ring buffer, so we need to reset it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164988 Approved by: https://github.com/tushar00jain ghstack dependencies: #164752	2025-10-09 01:03:01 +00:00
fduwjj	0b15f7ae05	[fr] Enable dynamic path write for FR dump when it comes to torchft (#164752 ) When it comes to FR dump, in the case of fault tolerance, users want to set the dump path to a different one when there is restart, so we just enable this case for users. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164752 Approved by: https://github.com/tushar00jain	2025-10-08 21:36:32 +00:00
Lakshay Garg	a4110fedcf	Use insert_or_assign instead of erase+emplace (#164868 ) insert_or_assign does effectively the same thing as erase+emplace but more efficiently since the search does not need to be repeated Pull Request resolved: https://github.com/pytorch/pytorch/pull/164868 Approved by: https://github.com/eqy	2025-10-08 19:13:49 +00:00
Ke Wen	5c827a4133	[SymmMem] Multi-root tile reduction (#164757 ) Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0) (oldest at bottom): Perform multiple tile reductions concurrently, with each tile reduced to a separate root. - The number of concurrent reductions can be smaller than world size, i.e. roots can be a subset of all ranks. But all ranks are still required to call into this API. - Currently supports NVLink SHARP scope only. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164757 Approved by: https://github.com/weifengpy, https://github.com/fegin ghstack dependencies: #162243	2025-10-08 17:28:00 +00:00
Ke Wen	19bf67be32	multimem reduce (#164517 ) Modified `multimem_one_shot_all_reduce_out` function to accept a `root` argument, making it a `multimem_reduce` op. The original `multimem_one_shot_all_reduce` op becomes a caller of the `multimem_reduce`, with each rank providing its own rank id as root. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164517 Approved by: https://github.com/ngimel	2025-10-08 05:25:16 +00:00
Ke Wen	d444384003	[SymmMem] Tiled reduce (#162243 ) Added op: `tile_reduce(Tensor input, Tensor(a!) out, int root, str group_name)` For now supports only: - NVSHMEM backed symmetric tensor; - 2D tensor and tile; - torch.float. Testing on right-bottom quandrant: ``` rank 0: tensor([[0., 0., 0., 0., 0., 0., 0., 0.], [0., 0., 0., 0., 0., 0., 0., 0.], [0., 0., 0., 0., 0., 0., 0., 0.], [0., 0., 0., 0., 0., 0., 0., 0.], [0., 0., 0., 0., 1., 1., 1., 1.], [0., 0., 0., 0., 1., 1., 1., 1.], [0., 0., 0., 0., 1., 1., 1., 1.], [0., 0., 0., 0., 1., 1., 1., 1.]], device='cuda:0') PASSED ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/162243 Approved by: https://github.com/ngimel	2025-10-08 02:03:04 +00:00
PyTorch MergeBot	f505caa71b	Revert "multimem reduce (#164517 )" This reverts commit `d1cbb74fb1`. Reverted https://github.com/pytorch/pytorch/pull/164517 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/164517#issuecomment-3378529654))	2025-10-07 20:12:38 +00:00
PyTorch MergeBot	df640df68a	Revert "Reapply "C++-accessible Placements via pybind11 (#163030 )" (#164519 )" This reverts commit `8c0bc879b9`. Reverted https://github.com/pytorch/pytorch/pull/164519 on behalf of https://github.com/malfet due to Still breaks internal workflows ([comment](https://github.com/pytorch/pytorch/pull/164519#issuecomment-3378469432))	2025-10-07 19:46:17 +00:00
Lakshay Garg	9ecd092bd9	Add python bindings for NCCL CTA policies (#164309 ) NCCLConfig can now be constructed with non-default [cta policies][1] ```python import torch from torch.distributed import ProcessGroupNCCL as nccl config = nccl.NCCLConfig() config.cta_policy = nccl.NCCL_CTA_POLICY_ZERO # NCCL version >= 2.28 ``` [1]: https://docs.nvidia.com/deeplearning/nccl/archives/nccl_2283/user-guide/docs/api/flags.html#nccl-communicator-cta-policy-flags Pull Request resolved: https://github.com/pytorch/pytorch/pull/164309 Approved by: https://github.com/eqy	2025-10-07 18:16:20 +00:00
Scott Wolchok	8c0bc879b9	Reapply "C++-accessible Placements via pybind11 (#163030 )" (#164519 ) This makes Placement data representation available in C++ via pybind11. Reapply with fix for internal errors. D83788896 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164519 Approved by: https://github.com/Skylion007, https://github.com/ezyang	2025-10-06 23:19:14 +00:00
Yuanyuan Chen	9fff8155c3	[2/N] Fix clang-tidy readability checks (#164652 ) This PR applies clang-tidy readability checks to jit sources and all headers in the code base. `readability-redundant-inline-specifier` is suppressed because it incurs too many changes. `readability-redundant-inline-specifier` is used to detect redundant inline specifiers on function and variable declarations. There are many in-class method definitions that are marked inline. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164652 Approved by: https://github.com/Skylion007	2025-10-06 01:06:01 +00:00
PyTorch MergeBot	2c5ed6e7c0	Revert "[2/N] Fix clang-tidy readability checks (#164652 )" This reverts commit `3c5ca685d6`. Reverted https://github.com/pytorch/pytorch/pull/164652 on behalf of https://github.com/izaitsevfb due to need to revert due to a conflict with revert of https://github.com/pytorch/pytorch/pull/162659 ([comment](https://github.com/pytorch/pytorch/pull/164652#issuecomment-3369346707))	2025-10-05 21:36:57 +00:00
Yuanyuan Chen	3c5ca685d6	[2/N] Fix clang-tidy readability checks (#164652 ) This PR applies clang-tidy readability checks to jit sources and all headers in the code base. `readability-redundant-inline-specifier` is suppressed because it incurs too many changes. `readability-redundant-inline-specifier` is used to detect redundant inline specifiers on function and variable declarations. There are many in-class method definitions that are marked inline. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164652 Approved by: https://github.com/Skylion007	2025-10-05 07:05:11 +00:00
Yuanyuan Chen	5103ecc5d8	[1/N] Fix clang-tidy readability checks (#164561 ) Check all `.cpp` files except `jit` files for readability thoroughly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164561 Approved by: https://github.com/Skylion007	2025-10-04 09:40:38 +00:00
Ke Wen	d1cbb74fb1	multimem reduce (#164517 ) Modified `multimem_one_shot_all_reduce_out` function to accept a `root` argument, making it a `multimem_reduce` op. The original `multimem_one_shot_all_reduce` op becomes a caller of the `multimem_reduce`, with each rank providing its own rank id as root. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164517 Approved by: https://github.com/ngimel	2025-10-03 02:41:10 +00:00
ankushwahaRH	ece5e0f01b	Fake process group Direct construction error (#163665 ) Fixes #162129. Added validation in _rank_not_in_group() to check if ```FakeProcessGroup``` is properly initialized before use, raising a clear error message if ```torch.distributed.init_process_group(backend='fake')``` hasn't been called first. This prevents silent failures and ensures proper dispatch system integration for all distributed operations. Added test case test_fake_process_group_direct_usage_error() that validates the error is raised for ```all_reduce``` and ```all_to_all_single``` operations. Please let me know if additional distributed operators should be tested or if any other updates are needed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163665 Approved by: https://github.com/ezyang	2025-10-02 22:19:26 +00:00
Yuanyuan Chen	115af42e9d	Fix readibility checks in TIDY and apply them (#164475 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/164475 Approved by: https://github.com/albanD, https://github.com/Skylion007 Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>	2025-10-02 20:34:49 +00:00
PyTorch MergeBot	f6f7676756	Revert "C++-accessible Placements via pybind11 (#163030 )" This reverts commit `3e03deab6f`. Reverted https://github.com/pytorch/pytorch/pull/163030 on behalf of https://github.com/swolchok due to doesn't pass pyre ([comment](https://github.com/pytorch/pytorch/pull/163030#issuecomment-3362450379))	2025-10-02 18:25:24 +00:00

1 2 3 4 5 ...

2592 Commits