Commit Graph

15373 Commits

Author SHA1 Message Date
PyTorch MergeBot
4926bd6004 Revert "Fix the Problems About Defining Static Variable in Inline Function (#147095)"
This reverts commit 3da14d38bd.

Reverted https://github.com/pytorch/pytorch/pull/147095 on behalf of https://github.com/atalman due to breaks internally ([comment](https://github.com/pytorch/pytorch/pull/147095#issuecomment-2787129770))
2025-04-08 17:10:36 +00:00
Luca Wehrstedt
1791b4150b Clarify behavior of TORCH_NCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK (#150682)
I still don't really understand the original purpose of that env var, but it appears that its usage is completely disconnected from MemPools and from `ncclMemAlloc`/`Free`. In fact, when that env var is set, we invoke `ncclCommRegister` for _all_ NCCL communicators for _all_ the memory segments managed by the allocator (both the global ones, allocated with `cudaMalloc`, and the ones in private MemPools), and we do that both for the segments that already exist when the PG is initialized and for all segments that will be allocated later.

I'm reworking the code a bit, by using a few helper functions, whose name should make this behavior clearer.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150682
Approved by: https://github.com/kwen2501
ghstack dependencies: #150681
2025-04-08 13:00:59 +00:00
Luca Wehrstedt
3649e2e7bd Safer bookkeeping of NCCL communicators (#150681)
This consists mainly in two changes:
- ensure we can reliably obtain the device from a `NCCLComm` object (there was one constructor which didn't set the device)
- use a RAII pattern for acquiring the lock to the global dictionary of `NCCLComms` (which ensures the lock is released in case of exceptions)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150681
Approved by: https://github.com/kwen2501
2025-04-08 11:12:37 +00:00
FFFrog
3da14d38bd Fix the Problems About Defining Static Variable in Inline Function (#147095)
Refer to https://github.com/pytorch/pytorch/issues/125465 for more informations

- Remove unused header files
- Move the inline function that defines the static variable to .cc

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147095
Approved by: https://github.com/cyyever, https://github.com/albanD
2025-04-08 10:23:02 +00:00
Hexin Wang
91173ff89a Fixing NCCL abort hang issue when a ProcessGroupNCCL manages multiple ncclComms (#150690)
Detail of the issue:

If PyTorch issues send/recv to each 2 rank comm, and these comms are managed by a single ProcessGroupNCCL instance, then comms need to abort either in sequence or in group.

I.e. the following sequential abort will cause hang in NCCL. recv(..., comm0, stream);
send(..., comm1, stream);
abort(comm1);
abort(comm0);

Fixes #119797

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150690
Approved by: https://github.com/kwen2501
2025-04-07 23:20:49 +00:00
PyTorch MergeBot
bf1132c196 Revert "Generalize poison fork logic for each device backend (#144664)"
This reverts commit d86c14156d.

Reverted https://github.com/pytorch/pytorch/pull/144664 on behalf of https://github.com/atalman due to failing periodic test: python test/test_cpp_extensions_mtia_backend.py TestCppExtensionMTIABackend.test_device_context ([comment](https://github.com/pytorch/pytorch/pull/144664#issuecomment-2784506104))
2025-04-07 20:09:53 +00:00
fduwjj
06e9deabb6 [c10d][fr] Improve FR dump robustness with all watchdog broadcast wait and more frequent store check (#150652)
When debugging FR missing dump and missing dump logs, I have couple initial findings:
1. On the same rank, if a second watchdog timeout triggers on a different PG(or subPG), that watchdog thread will immediately throw exception instead of sleeping. We want to fix that by still making the watchdog thread to wait for 1 min.
2. The FR dump takes about 900ms to 1200ms so, we are not checking the store frequently enough. But instead of changing the frequency from 1sec to 300ms, we finally decided to just let all ranks just sleep for 1 min universally rather than using a promise.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150652
Approved by: https://github.com/kwen2501
2025-04-07 16:33:27 +00:00
Shivam Raikundalia
99c9a31386 [submodule] [Snapshot/Profiler] Memory Snapshot On Demand (#150559)
Summary:
Profiler side of memory snapshot.

1. Add API to actually do snapshot when client interface is called
2. Add ifdefs to builds so that kineto hooks snapshot correctly.

Design Philosophy: There is one interesting part of this implementation and it is during export. For export we are callign the python impl of the export rather than CPP even though we are already in CPP. This is because it is better to simply have one path of export rather than 2. Personally, I want there to be parity between auto-trace and on-demand so it if we can limit the side paths then we will have an easier time maintaining this relationship

Test Plan: {F1976563426}

Reviewed By: sanrise

Differential Revision: D70733247

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150559
Approved by: https://github.com/sanrise
2025-04-07 13:04:38 +00:00
Zhengxu Chen
24aadb40fb [precompile] Serialization for GlobalStateGuard (#150636)
Summary: To preserve global state guards we need to make the C++ type serialzable. Using json because it's easier to do and we don't have a lot of data in global state.

Test Plan: test_dynamo -k test_global_state_guard_serialization

Differential Revision: D72410611

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150636
Approved by: https://github.com/williamwen42
2025-04-07 03:10:03 +00:00
Yu, Guangye
d86c14156d Generalize poison fork logic for each device backend (#144664)
# Motivation
Generalize the posion_fork code to make it reusable across different devices.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144664
Approved by: https://github.com/EikanWang, https://github.com/albanD
2025-04-07 02:06:21 +00:00
Natalia Gimelshein
55e62ff74a bf16 grouped gemm (#150374)
Enabled bf16 grouped gemm with an API similar to _scaled_group_gemm, except without scale and fast accum arguments. All transpose variants are enabled, unlike scaled gemm. Ideally we'd factor out a lot more code from scaled gemm, currently there's a lot of repetition between scaled and non-scaled versions. I factored out only a helper kernel that prepares arguments.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150374
Approved by: https://github.com/drisspg
2025-04-06 04:53:24 +00:00
Mu-Chu Lee
60a45eb862 [AOTInductor] Introduce MaybeOwningAtenTensorHandle for ConstantMap (#150275)
Summary:
We used RAIIAtenTensorHandle for ConstantMap, where RAIIAtenTensorHandle
is a unique_ptr, indicating that all memory handling is by the
AOTInductor internally.

In this PR, we introduce ConstantAtenTensorHandle which replaces
RAIIATenTensorHandle. This class holds a raw AtenTensorHandle, and also
owns a RAIIAtenTensorHandle if user decides to delegate memory
management to AOTInductor.

This is a prerequisite for user managed buffer, this PR, however only
introduces this class and make sure it works with existing AOTInductor
and has the default behavior identical as using RAIIAtenTensorHandle.

Test Plan:
Existing tests. No change should be introduced within this PR.

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150275
Approved by: https://github.com/chenyang78, https://github.com/desertfire
2025-04-05 06:00:35 +00:00
Zhengxu Chen
07d439e782 [aoti] Split ConstantType definition out of model.h (#150545)
Summary:
Splitting the type definition of ConstantType into a separate header because it's needed by Sigmoid OSS but the entire model.h header include cause the following compilation error:
```
2025-04-01T18:12:42.0391272Z FAILED: caffe2/CMakeFiles/torch_cpu.dir/__/torch/csrc/nativert/kernels/AOTICallDelegateKernel.cpp.o
2025-04-01T18:12:42.0417705Z /opt/cache/bin/sccache /opt/cache/bin/clang++ -DAT_PER_OPERATOR_HEADERS -DBUILD_ONEDNN_GRAPH -DCAFFE2_BUILD_MAIN_LIB -DCPUINFO_SUPPORTED_PLATFORM=1 -DFMT_HEADER_ONLY=1 -DFXDIV_USE_INLINE_ASSEMBLY=0 -DHAVE_MALLOC_USABLE_SIZE=1 -DHAVE_MMAP=1 -DHAVE_SHM_OPEN=1 -DHAVE_SHM_UNLINK=1 -DIDEEP_USE_MKL -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DNNP_CONVOLUTION_ONLY=0 -DNNP_INFERENCE_ONLY=0 -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DTORCH_ENABLE_LLVM -DUSE_C10D_GLOO -DUSE_DISTRIBUTED -DUSE_EXTERNAL_MZCRC -DUSE_RPC -DUSE_TENSORPIPE -DXNN_LOG_LEVEL=0 -D_FILE_OFFSET_BITS=64 -Dtorch_cpu_EXPORTS -I/var/lib/jenkins/workspace/build/aten/src -I/var/lib/jenkins/workspace/aten/src -I/var/lib/jenkins/workspace/build -I/var/lib/jenkins/workspace -I/var/lib/jenkins/workspace/cmake/../third_party/benchmark/include -I/opt/llvm/include -I/var/lib/jenkins/workspace/third_party/onnx -I/var/lib/jenkins/workspace/build/third_party/onnx -I/var/lib/jenkins/workspace/nlohmann -I/var/lib/jenkins/workspace/torch/csrc/api -I/var/lib/jenkins/workspace/torch/csrc/api/include -I/var/lib/jenkins/workspace/caffe2/aten/src/TH -I/var/lib/jenkins/workspace/build/caffe2/aten/src/TH -I/var/lib/jenkins/workspace/build/caffe2/aten/src -I/var/lib/jenkins/workspace/build/caffe2/../aten/src -I/var/lib/jenkins/workspace/torch/csrc -I/var/lib/jenkins/workspace/third_party/miniz-3.0.2 -I/var/lib/jenkins/workspace/third_party/kineto/libkineto/include -I/var/lib/jenkins/workspace/third_party/kineto/libkineto/src -I/var/lib/jenkins/workspace/third_party/cpp-httplib -I/var/lib/jenkins/workspace/aten/src/ATen/.. -I/var/lib/jenkins/workspace/third_party/FXdiv/include -I/var/lib/jenkins/workspace/c10/.. -I/var/lib/jenkins/workspace/third_party/pthreadpool/include -I/var/lib/jenkins/workspace/third_party/cpuinfo/include -I/var/lib/jenkins/workspace/aten/src/ATen/native/quantized/cpu/qnnpack/include -I/var/lib/jenkins/workspace/aten/src/ATen/native/quantized/cpu/qnnpack/src -I/var/lib/jenkins/workspace/aten/src/ATen/native/quantized/cpu/qnnpack/deps/clog/include -I/var/lib/jenkins/workspace/third_party/NNPACK/include -I/var/lib/jenkins/workspace/third_party/fbgemm/include -I/
2025-04-01T18:12:42.0444143Z In file included from /var/lib/jenkins/workspace/torch/csrc/nativert/kernels/AOTICallDelegateKernel.cpp:5:
2025-04-01T18:12:42.0445081Z In file included from /var/lib/jenkins/workspace/torch/csrc/nativert/executor/AOTIDelegateExecutor.h:6:
2025-04-01T18:12:42.0446002Z In file included from /var/lib/jenkins/workspace/torch/csrc/nativert/executor/AOTInductorModelImpl.h:5:
2025-04-01T18:12:42.0447549Z /var/lib/jenkins/workspace/torch/csrc/inductor/aoti_runtime/model.h:78:13: error: function 'RAII_cpuMalloc' is not needed and will not be emitted [-Werror,-Wunneeded-internal-declaration]
2025-04-01T18:12:42.0448656Z RAIIDataPtr RAII_cpuMalloc(size_t num_bytes) {
```

model.h defines RAII_malloc functions directly into anonymous namespace which seems pretty sad. we should do something about it but may not in the current diff.

Test Plan: CI

Differential Revision: D72320413

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150545
Approved by: https://github.com/desertfire
2025-04-04 15:48:45 +00:00
PyTorch MergeBot
4854926aeb Revert "Add torch._scaled_mm for CPU (#150410)"
This reverts commit 3b02f795c5.

Reverted https://github.com/pytorch/pytorch/pull/150410 on behalf of https://github.com/malfet due to It breaks ROCM tests ([comment](https://github.com/pytorch/pytorch/pull/150410#issuecomment-2777704212))
2025-04-04 06:52:54 +00:00
fduwjj
bd9c42ebfb [c10d] Surface error type when we unlink and create named pipe for DumpPipe (#150648)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150648
Approved by: https://github.com/fegin, https://github.com/kwen2501
2025-04-04 02:12:32 +00:00
Richard Howell
76994d48f4 [pytorch] add experimental TORCH_LIBRARY_THREAD_UNSAFE_LAZY_INIT (#150537)
Summary: Add an experimental feature to defer pytorch library initialization cost to post startup. As noted this feature is not thread safe, it requires the client to maintain thread safety at library load time.

Reviewed By: zou3519

Differential Revision: D71917841

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150537
Approved by: https://github.com/zou3519
2025-04-03 22:36:17 +00:00
Jason Ansel
d41c22b578 Revert "[fx] Move Node._prepend/Node._remove_from_list to C++ (#148261)" (#150542)
Reverts #148261 due to possible memory leak

This reverts commit 5d4e7d58b4.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150542
Approved by: https://github.com/clee2000
2025-04-03 21:15:38 +00:00
Jiang, Yanbing
3b02f795c5 Add torch._scaled_mm for CPU (#150410)
This PR is the duplicated one for https://github.com/pytorch/pytorch/pull/139975.

This PR is to add torch._scaled_mm for CPU backend.

_scaled_mm_out_cpu and _scaled_mm_cpu are new added and included in torch._scaled_mm CPU dispatch. We also add _scaled_mm_out_cpu_emulated as a fallback function if the current platform cannot run FP8 matmul using oneDNN. And this PR also updates the various UTs related to FP8 to support CPU tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150410
Approved by: https://github.com/atalman
2025-04-03 19:43:45 +00:00
Luca Wehrstedt
440c07e56a Fix detection of GPU multicast (#150563)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150563
Approved by: https://github.com/kwen2501
2025-04-03 15:31:15 +00:00
Isuru Fernando
ff783f062a Fix shape guard failure to be valid python (#149149)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149149
Approved by: https://github.com/anijain2305
2025-04-03 14:36:17 +00:00
Junjie Wang (PyTorch)
fc674b45d4 [c10d] Add logging for desync debug report (#150513)
Summary: We want to add a logging to first understand what is the distribution of desync debug report.

Test Plan: Test with logger staging

Differential Revision: D72249281

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150513
Approved by: https://github.com/kwen2501
2025-04-03 06:42:06 +00:00
Shivam Raikundalia
a677b491c9 [Profiler] Fix Empty C Call Queue (#150370)
Summary:
My commandeer of https://github.com/pytorch/pytorch/pull/150102

Based on description of PR it seems that we need to add C calls for each starting python event with a callable such that when the tracing exits we will have a matching enter for any given exit. It adds some unnecessary events at worst but prevents segfaults/failures. My PR just cleans up some refcount impl and logging.

Contributors: @arjun-choudhry

Test Plan: Ran resnet test internally. Will check CI and ask reviewers to make sure it resolves their issues.

Differential Revision: D72207570

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150370
Approved by: https://github.com/aaronenyeshi
2025-04-02 22:25:46 +00:00
PyTorch MergeBot
532530be34 Revert "[Profiler] Fix Empty C Call Queue (#150370)"
This reverts commit 5734909f34.

Reverted https://github.com/pytorch/pytorch/pull/150370 on behalf of https://github.com/clee2000 due to broke some profiler tests when building with debug asserts profiler/test_memory_profiler.py::TestMemoryProfiler::test_config_check [GH job link](https://github.com/pytorch/pytorch/actions/runs/14211763078/job/39822158330) [HUD commit link](3ac5a499dd) ([comment](https://github.com/pytorch/pytorch/pull/150370#issuecomment-2773146070))
2025-04-02 16:40:54 +00:00
Benjamin Glass
75f38dfd4e cpp_wrapper: precompile a few more commonly used headers, and improve RAIIPyObject interface (#149350)
Add includes for torch.device, torch.dtype, torch.layout, and torch.memory_format to the cpp_wrapper common header, so that they get precompiled. Additionally, add move constructors and operator bool to RAIIPyObject.

Closes #142005.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149350
Approved by: https://github.com/desertfire
2025-04-02 09:54:27 +00:00
Shivam Raikundalia
5734909f34 [Profiler] Fix Empty C Call Queue (#150370)
Summary:
My commandeer of https://github.com/pytorch/pytorch/pull/150102

Based on description of PR it seems that we need to add C calls for each starting python event with a callable such that when the tracing exits we will have a matching enter for any given exit. It adds some unnecessary events at worst but prevents segfaults/failures. My PR just cleans up some refcount impl and logging.

Test Plan: Ran resnet test internally. Will check CI and ask reviewers to make sure it resolves their issues.

Differential Revision: D72207570

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150370
Approved by: https://github.com/aaronenyeshi
2025-04-02 02:44:50 +00:00
eqy
f09513e515 [CUDA]][SymmetricMemory] Interpret empty string as std::nullopt in rendezvous (#149793)
this is a "temporary" fix as current internal API requires strings at some interfaces instead of `std::optional` and empty strings are presumably used in-lieu of `nullopt`.
e.g.,
9d02b3993f/torch/csrc/distributed/c10d/intra_node_comm.cu (L49)

this currently breaks `test_intra_node_comm_all_reduce`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149793
Approved by: https://github.com/kwen2501, https://github.com/cyyever
2025-04-02 02:41:07 +00:00
cyy
e872c38eb3 Remove cppcoreguidelines-pro-type-member-init_fix suppression (#148638)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148638
Approved by: https://github.com/zou3519
2025-04-02 01:33:20 +00:00
Junjie Wang (PyTorch)
d22e3d5efe [fr] Add logger config for flight record in PGNCCL (#150356)
Summary: We want to move from a scuba based direct logging to a logger config based logging. Mostly changes are internal but we need to change the exception to exception_msg.

Test Plan: Following https://www.internalfb.com/wiki/Server_Logging/Getting_Started_with_Logging/Onboarding_Existing_Scribe-Based_Logging_(Alpha)/ to test it.

Differential Revision: D72198171

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150356
Approved by: https://github.com/fegin
2025-04-01 23:54:07 +00:00
Tristan Rice
6aea4d90fb gloo: use shared Stores (#150230)
Summary:
X-link: https://github.com/facebookincubator/gloo/pull/423

This modifies `connectFullMesh` to take in a shared_ptr<IStore> instead of a reference. This is an API breaking change but fairly easy to work around.

To have backwards compatibility in PyTorch during the commit phase we add a new ifdef `GLOO_SHARED_STORE` which can provide backwards compatibility until we update the pinned Gloo version in pytorch OSS repo.

This also adds a new `wait_get` method to `IStore` which will allow us to do a more efficient operation in PyTorch TCPStore. PyTorch's `Store::get` automatically waits so we want to make sure we can avoid waiting twice to reduce network traffic.

This change will land simultaneously in PyTorch and Gloo repos.

Test Plan:
```
buck2 test //gloo/... //caffe2/caffe2/contrib/gloo:
```

Differential Revision: D72084111

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150230
Approved by: https://github.com/fduwjj
2025-04-01 23:37:25 +00:00
IvanKobzarev
a37afd23fa [custom_ops][perf] Move expensive pytree traversals of tensors to C++ (#148555)
(benchmark for 1 call)

Before:
```
└─ $ python ~/task_custom_ops_perf/test_custom_ops_perf_repro.py
DO_BENCH mutate: 77.72445678710938 us PROFILE:/home/ivankobzarev/task_custom_ops_perf/mutate.json
DO_BENCH no_mutate: 64.61143493652344 us PROFILE:/home/ivankobzarev/task_custom_ops_perf/no_mutate.json
DO_BENCH direct_mutate: 11.682510375976562 us PROFILE:/home/ivankobzarev/task_custom_ops_perf/direct_mutate.json
DO_BENCH direct_no_mutate: 18.596649169921875 us PROFILE:/home/ivankobzarev/task_custom_ops_perf/direct_no_mutate.json
```

After:
```
└─ $ python ~/task_custom_ops_perf/test_custom_ops_perf_repro.py
DO_BENCH mutate: 47.6837158203125 us PROFILE:/home/ivankobzarev/task_custom_ops_perf/mutate.json
DO_BENCH no_mutate: 31.709671020507812 us PROFILE:/home/ivankobzarev/task_custom_ops_perf/no_mutate.json
DO_BENCH direct_mutate: 10.967254638671875 us PROFILE:/home/ivankobzarev/task_custom_ops_perf/direct_mutate.json
DO_BENCH direct_no_mutate: 10.728836059570312 us PROFILE:/home/ivankobzarev/task_custom_ops_perf/direct_no_mutate.json
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148555
Approved by: https://github.com/zou3519
2025-04-01 18:45:48 +00:00
Sriram Kumar
a19b667bca [ROCm] Update CUDAPluggableAllocator.h (#1984) (#150010)
Altering the flag to use the correct streamType in CUDAPluggableAllocator class for ROCm gpu. The flag TORCH_HIP_VERSION does not work for ROCm as intended. This flag is replaced with USE_ROCM. This is impacting Distributed Fused Adam in Rocm/APEX when using nccl_ub feature. This has been tested with rocm/apex.

See PR https://github.com/ROCm/apex/pull/184

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150010
Approved by: https://github.com/jeffdaily
2025-04-01 16:49:03 +00:00
Ke Wen
35c45a4a31 [Reland] Launch kernel on current stream & remove record_stream entirely (#150398)
Relanding #148590 due to merge conflict.

This PR has multiple changes to `ProcessGroupNCCL` (which unfortunately are related):
1. When async_op=False, we directly launch the collective on "current" stream, instead of a trampoline stream and join back.
- Resolves #147729
- Resolves #146881
- Also saves two event syncs (which have overhead in case of HIP) and one pybind when we call `work.wait()` in distributed_c10d.py on behalf of user.
2. Entirely remove `record_stream` and use CPU-side stashing for managing tensor lifetime against recycling.
- Resolves #147168
3. Remove tensor life management when async_op=False; only use it when async_op=True.
4. To guard against user not calling `work.wait()`, we ask watchdog to unstash tensors after detecting completion of collectives, to prevent us from holding reference to tensors forever. This is a safety net, rather than a service guarantee, see discussion [here](https://github.com/pytorch/pytorch/issues/147168#issuecomment-2660142460).
5. Profile in async_op=False mode would look different -- collective kernels would show up in the same line and compute kernels.

Joint work with @cenzhaometa who wants to remove the event sync overhead.

Squashed contents:

* [ptd][nccl] use current-stream as nccl-stream under async=False mode (#147820)
PTD current workflow:
- PTD creates its own dedicated `ncclStream` for comm operation
- it will first add a dependency on current-stream (typically the compute stream) to ensure tensors are ready before invoking collective
such stream synchronization become expensive in Inference world (cpu overhead: 70us vs GPU kernel time: 160us).
This diff:
- async=False [default], will use current-stream as nccl-stream and avoid the stream-sync overhead
- async=True, will retain existing logic: create new nccl-stream, let it wait on current-stream to ensure tensors are ready
- pass down async from c10d down to NCCL-PG
this helps shave off 50% CPU overhead **(70us -> 35us)**, which reduce total CPU/GPU from **230us to 195us by 15%**

* [PGNCCL] Make avoid-record-stream default

* [c10d] Add asyncOp argument to Ops

* Change python side wait

* Pass asyncOp at ProcessGroup level

* Watchdog unstashing tensors as a safety net

* Stash tensors for reduce_scatter_v and all_gather_v
Pull Request approved: https://github.com/pytorch/pytorch/pull/149753

* [c10d] Move unstashing from watchdog to main thread
Pull Request approved: https://github.com/pytorch/pytorch/pull/150079

* [PGNCCL][BE] Merge mutex into TensorShelf for encapsulation
Pull Request approved: https://github.com/pytorch/pytorch/pull/150130

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150398
Approved by: https://github.com/atalman
2025-04-01 16:46:07 +00:00
Natalia Gimelshein
1700599266 Add one_shot_all_reduce_copy to allow non-symm-mem allocated tensors to be reduced (#150129)
Per title, we want to be able to use it even if inputs are not registered. Separate copy would add latency, and one-shot is all about the lowest possible latency.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150129
Approved by: https://github.com/xw285cornell
2025-04-01 05:36:43 +00:00
Natalia Gimelshein
414b9ae016 enable out variant of 2-shot reduction (#150153)
Per title, this version uses symm mem input both as input source and as a work buffer, so input is modified after the end (similar to what fbgemm car reduction does). It is intended to be wrapped in an op that would first copy the real inputs to symm mem buffers that wouldn't be exposed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150153
Approved by: https://github.com/xw285cornell
2025-04-01 05:36:04 +00:00
Nikita Shulga
6470b373c1 torch.backends.mkldnn.flags() CM should not warn (#150358)
By returning `None` rather than `False` from `THPModule_allowTF32OneDNN` when USE_XPU is not defined

Added regression test

Fixes https://github.com/pytorch/pytorch/issues/149829

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150358
Approved by: https://github.com/atalman
2025-04-01 01:33:40 +00:00
Nikita Shulga
c75dac5f5c Fix typo (#150363)
Fixes https://github.com/pytorch/pytorch/issues/150339

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150363
Approved by: https://github.com/atalman, https://github.com/kwen2501
2025-03-31 23:58:37 +00:00
Pian Pawakapan
284b766898 [dynamic shapes] C++ bindings for guard_or_false/true (#150148)
C++ version. Would like to add it in one place to prove it works, but couldn't find one that doesn't expose a chain of data-dependent changes... so just gonna put up the base implementation

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150148
Approved by: https://github.com/laithsakka, https://github.com/jingsh
2025-03-31 17:04:25 +00:00
PyTorch MergeBot
57fa99c5c3 Revert "enable out variant of 2-shot reduction (#150153)"
This reverts commit cdeb32d2d1.

Reverted https://github.com/pytorch/pytorch/pull/150153 on behalf of https://github.com/clee2000 due to failing internal builds D72083877 ([comment](https://github.com/pytorch/pytorch/pull/150153#issuecomment-2766633712))
2025-03-31 15:43:24 +00:00
PyTorch MergeBot
e57fa18b40 Revert "Add one_shot_all_reduce_copy to allow non-symm-mem allocated tensors to be reduced (#150129)"
This reverts commit 8a872261dc.

Reverted https://github.com/pytorch/pytorch/pull/150129 on behalf of https://github.com/clee2000 due to breaking internal builds D72080428 ([comment](https://github.com/pytorch/pytorch/pull/150129#issuecomment-2766619006))
2025-03-31 15:37:54 +00:00
Yichen Yan
bbb9b2476b Unify use of enableCollectiveHashDebug_ and trivial updates (#142865)
Use `enableCollectiveHashDebug_` instead of checking env ad-hoc when `TORCH_DISTRIBUTED_DEBUG = DETAIL`

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142865
Approved by: https://github.com/fegin, https://github.com/kwen2501
2025-03-31 12:23:30 +00:00
fduwjj
9456738edf [c10d][fr] Allow multiple writer registration with warnings (#150232)
The life span of writer is actually the whole program which is sub-optimal but it is a practical compromise so that the registration of writer can happen outside PG creation.

So we decide to allow multiple writer registrations with warnings.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150232
Approved by: https://github.com/d4l3k, https://github.com/kwen2501
2025-03-31 04:43:43 +00:00
sanshang
879a293db8 fix et trace collection of all_to_all (#149485)
![image](https://github.com/user-attachments/assets/1e602dec-24a4-4f47-88c0-9311737e217b)
![image](https://github.com/user-attachments/assets/c48a3273-43fb-4a7f-9341-b90cb6b10785)

fix ET trace collection to all_to_all.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149485
Approved by: https://github.com/shengfukevin, https://github.com/kwen2501
2025-03-29 20:17:24 +00:00
PyTorch MergeBot
3b00ff8850 Revert "[Profiler] Give non-zero default values to start events (#149757)"
This reverts commit bc72420bcb.

Reverted https://github.com/pytorch/pytorch/pull/149757 on behalf of https://github.com/malfet due to Broke windows builds, which were also the signal on the HUD ([comment](https://github.com/pytorch/pytorch/pull/149757#issuecomment-2763461365))
2025-03-29 15:08:55 +00:00
Max Calman
bc72420bcb [Profiler] Give non-zero default values to start events (#149757)
The intent of the existing code is to

> // Assign system TIDs to start events based on the system TID of the next
    // observed event with the same Python TID.

However, if there are start events that don't share the same Python TID as later observed events, then they are left with the default initialization of DeviceAndResource and assigned values of `0`. This is problematic because Kineto uses `device=0, resource=0` for the first GPU (or other backend) device.

This PR maintains the previous logic of using TIDs from later events if any are present, but defaults to the current process and system thread IDs if there aren't later events to reference.

This issue was discovered while working to implement a custom backend and some CPU start events were appearing on the same process and thread as the device in the trace.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149757
Approved by: https://github.com/sraikund16
2025-03-29 06:29:25 +00:00
Shangdi Yu
5e787bf3e5 [reland] Support torchbind in OSS proxy executor (#150196)
Summary:
The original Diff D69500038 is reverted due to a false alarm on trunk health.

Implement torchbind support in OSSProxyExecutor.

Exactly the same as the implementation in FbProxyExecutor.

D69693697 - fbProxyExecutor
D69887230 - fbProxyExecutor but for torchbind method
D70746626 - Support None output type

Other changes:

- When generating the schema of the CallTrochBind HOP, the arg name of the torchbind object arg should be the same as the torchbind method's torchbind object arg (instead of `obj`).

- In `AOTIModelPackageLoader`, we extract everything in `data/constants` to `tmp_dir/data/aot_inductor/<model>/` folder, so the torchbind objs exist in the same folder as the rest of the files (e.g. cpp, so). This is to be consistent of how files are packaged internally (more details in internal Diff summary).

Note on using `filesystem`:

Seems like there'll be [issues](https://github.com/pytorch/pytorch/pull/137209) with using`filesystem` header in linux, so here I use string manipulation instead of `filesystem::path`.

Test Plan:
```
test/inductor:torchbind -- -r torchbind_aoti
test/inductor:torchbind -- -r aot_compile
```

Differential Revision: D72063691

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150196
Approved by: https://github.com/hl475, https://github.com/desertfire
2025-03-29 03:36:55 +00:00
Mu-Chu Lee
03313c6619 [AOTInductor] Add function for users to extract constants in container (#150163)
Summary: Add extract_constant_map that allows users to inspect the constants being used by AOTInductor

Test Plan:
`python test/inductor/test_aot_inductor.py -k extract_constants_map`

`LD_LIBRARY_PATH=/data/users/$USER/pytorch/build/lib /data/users/$USER/pytorch/build/bin/test_aoti_inference`

Differential Revision: D72020400

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150163
Approved by: https://github.com/chenyang78
2025-03-29 03:36:12 +00:00
Nichols A. Romero
7a470c9320 [ROCm] change preferred blas lib defaults (#150212)
Fixes #148883
Fixes #150155

Also adds at::BlasBackend:Default. Instinct cards prefer hipBLASLt, everything else prefers rocBLAS.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150212
Approved by: https://github.com/jeffdaily
2025-03-29 03:33:07 +00:00
Tristan Rice
29b3fdab01 TCPStoreLibUvBackend: support masterListenFd (#150215)
This supports `masterListenFd` which is required for full compatibility with the non-libuv TCPStore. The code was just missing a `uv_listen` call and now it works just fine.

This is required to migrate the last remaining uses of TCPStore off of the non-libuv backend.

Test plan:
```
pytest -v test/distributed/test_store.py -k test_take_over_listen_socket
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150215
Approved by: https://github.com/fduwjj
2025-03-29 01:58:07 +00:00
Natalia Gimelshein
cdeb32d2d1 enable out variant of 2-shot reduction (#150153)
Per title, this version uses symm mem input both as input source and as a work buffer, so input is modified after the end (similar to what fbgemm car reduction does). It is intended to be wrapped in an op that would first copy the real inputs to symm mem buffers that wouldn't be exposed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150153
Approved by: https://github.com/xw285cornell
2025-03-28 19:06:03 +00:00
PyTorch MergeBot
e691fcae0e Revert "cpp_wrapper: precompile a few more commonly used headers, and improve RAIIPyObject interface (#149350)"
This reverts commit 2b20d1433f.

Reverted https://github.com/pytorch/pytorch/pull/149350 on behalf of https://github.com/yangw-dev due to broke test internally test/inductor/test_benchmark_fusion ([comment](https://github.com/pytorch/pytorch/pull/147225#issuecomment-2761944564))
2025-03-28 17:07:52 +00:00