Commit Graph

640 Commits

Author SHA1 Message Date
Sriram Kumar
a19b667bca [ROCm] Update CUDAPluggableAllocator.h (#1984) (#150010)
Altering the flag to use the correct streamType in CUDAPluggableAllocator class for ROCm gpu. The flag TORCH_HIP_VERSION does not work for ROCm as intended. This flag is replaced with USE_ROCM. This is impacting Distributed Fused Adam in Rocm/APEX when using nccl_ub feature. This has been tested with rocm/apex.

See PR https://github.com/ROCm/apex/pull/184

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150010
Approved by: https://github.com/jeffdaily
2025-04-01 16:49:03 +00:00
Shivam Raikundalia
a11538aa46 [GPU Snapshot] Add Clear History Flag (#149352)
Summary:
Oftentimes, users complain that a bunch of extra events are prepended to their desired GPU snapshot. This is because they usually attach an OOM logger without knowing and when they go to collect the actual snapshot, it adds all the OOM logger contents. Since OOM and regular snapshot use the same backend, we currently don't have the infra in place to split these snapshots.

As a solution we add a flag to the snapshot frontend to clear out the history when starting the auto-trace record memory history.

A more thorough solution would be to have a user pass in a handle and to have snapshots per handle to seperate the events. However, this would likely be complicated and more work than it is worth as we would have to change the callbacks in the caching allocator and pass these objects between python and cpp.

Test Plan:
See diff below

Differential Revision: D71159720

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149352
Approved by: https://github.com/eqy, https://github.com/aaronenyeshi
2025-03-19 21:44:20 +00:00
Marko Radmilac
c65ee728f0 Initial implementation of host memory stats (#147660)
This is an initial attempt to provide some statistics for the pinned host memory allocations flowing through CachingHostAllocator. Many times in the past we have had inexplicable slowdowns that would be much easier to diagnose if we had some host memory characteristics.

This change tries very hard not to disrupt the initial design of the allocator, and it uses existing locking mechanism, whenever possible, to gather statistics "for free". Only deviation from that is on the "slow path" where we incur CUDA calls anyway, so taking a short lock is not going to hurt the performance much, especially in the steady state where most allocations will come from cache.

As mentioned before, this is the first PR, to introduce the concept and to see if it fits the right paradigm. We can always add more later.

Metrics that would require more involved changes to the code base and locks, like requested memory, have been punted for now. I also tried to reuse the Stat structure used in CUDA caching allocator, in order to maintain symmetry.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147660
Approved by: https://github.com/ngimel
2025-03-05 16:13:19 +00:00
Richard Barnes
5301710b15 [codemod] Fix unused-value issue in caffe2/aten/src/ATen/cuda/detail/CUDAHooks.cpp +4 (#147555)
Summary:
LLVM has a warning `-Wunused-value` which we treat as an error because it's so often diagnostic of a code issue. Unused values often indicate a programming mistake, but can also just be unnecessary cruft that harms readability and performance.

For questions/comments, contact r-barnes.

 - If you approve of this diff, please use the "Accept & Ship" button :-)

Test Plan: Sandcastle

Differential Revision: D69945678

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147555
Approved by: https://github.com/Skylion007, https://github.com/eqy
2025-03-01 19:46:13 +00:00
PyTorch MergeBot
a983b2b11a Revert "Initial implementation of host memory stats (#147660)"
This reverts commit 945e359fc1.

Reverted https://github.com/pytorch/pytorch/pull/147660 on behalf of https://github.com/mradmila due to There is an issue with ambiguous definition of Stat structure when different C++ tools are used. Backing out for now. ([comment](https://github.com/pytorch/pytorch/pull/147660#issuecomment-2692346379))
2025-03-01 18:05:45 +00:00
Marko Radmilac
945e359fc1 Initial implementation of host memory stats (#147660)
This is an initial attempt to provide some statistics for the pinned host memory allocations flowing through CachingHostAllocator. Many times in the past we have had inexplicable slowdowns that would be much easier to diagnose if we had some host memory characteristics.

This change tries very hard not to disrupt the initial design of the allocator, and it uses existing locking mechanism, whenever possible, to gather statistics "for free". Only deviation from that is on the "slow path" where we incur CUDA calls anyway, so taking a short lock is not going to hurt the performance much, especially in the steady state where most allocations will come from cache.

As mentioned before, this is the first PR, to introduce the concept and to see if it fits the right paradigm. We can always add more later.

Metrics that would require more involved changes to the code base and locks, like requested memory, have been punted for now. I also tried to reuse the Stat structure used in CUDA caching allocator, in order to maintain symmetry.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147660
Approved by: https://github.com/ngimel
2025-02-28 18:36:44 +00:00
x41lakazam
30375cb326 Fix minor typo in python_nccl (#148088)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148088
Approved by: https://github.com/Skylion007
2025-02-28 00:47:09 +00:00
PyTorch MergeBot
9a883007a2 Revert "Implement cuda graphs implementation of torch.cond and torch.while_loop (#140979)"
This reverts commit c7515da7b0.

Reverted https://github.com/pytorch/pytorch/pull/140979 on behalf of https://github.com/huydhn due to This change has been reported to break internal code ([comment](https://github.com/pytorch/pytorch/pull/140979#issuecomment-2657361940))
2025-02-13 18:04:26 +00:00
Daniel Galvez
c7515da7b0 Implement cuda graphs implementation of torch.cond and torch.while_loop (#140979)
This is a new PR for #130386 , which got stale and was closed. Since I force-pushed to that branch in order to rebase it on top of main, the PR can no longer be reopened, according to https://github.com/isaacs/github/issues/361

I fixed the possibly-not-warmed-up problem described here: https://github.com/pytorch/pytorch/pull/130386/files#r1690856534

Since starting this, torch.cond and torch.while_loop now apparently have support for backward passes. I will look into what it might take to support that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140979
Approved by: https://github.com/eqy, https://github.com/eellison
2025-02-11 18:16:15 +00:00
cyy
25aa7ca62d Cleanup CallOnce.h (#146700)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146700
Approved by: https://github.com/albanD
2025-02-07 16:44:45 +00:00
PyTorch MergeBot
00dc5b10f6 Revert "[Environment Variable][7/N] Use thread-safe getenv functions (#140211)"
This reverts commit 2fd1b6b361.

Reverted https://github.com/pytorch/pytorch/pull/140211 on behalf of https://github.com/atalman due to Breaks executorch tests ([comment](https://github.com/pytorch/pytorch/pull/140211#issuecomment-2632202864))
2025-02-03 22:04:28 +00:00
cyy
2fd1b6b361 [Environment Variable][7/N] Use thread-safe getenv functions (#140211)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140211
Approved by: https://github.com/ezyang, https://github.com/eqy
2025-02-01 12:33:41 +00:00
PyTorch MergeBot
284f217011 Revert "[Environment Variable][7/N] Use thread-safe getenv functions (#140211)"
This reverts commit 97b3b73f3e.

Reverted https://github.com/pytorch/pytorch/pull/140211 on behalf of https://github.com/ZainRizvi due to Sorry but this is failing internally. @eqy @ezyang can you please help this get remerged? See D68779772. ([comment](https://github.com/pytorch/pytorch/pull/140211#issuecomment-2622504898))
2025-01-29 18:24:29 +00:00
cyyever
97b3b73f3e [Environment Variable][7/N] Use thread-safe getenv functions (#140211)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140211
Approved by: https://github.com/ezyang, https://github.com/eqy
2025-01-28 15:21:12 +00:00
cyy
29f52e3972 [2/N] Remove unnecessary once flag usage (#145057)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145057
Approved by: https://github.com/albanD
2025-01-23 09:48:46 +00:00
Will Constable
2859b11bdb [pytorch/ncclx] Remove Alltoallv specialization for PTD all_to_all (#145045)
Summary:
PTD all_to_all uses a list of tensors, while ncclAllToAllv (provided
by NCCLX and RCCL) assumes that a single contiguous buffer is used.
These are fundamentally mismatched.  The list of tensors might not be
contiguous or even ordered (buffer addresses might not be in
increasing order).

This patch removes the ncclAllToAllv specialization for PTD
all_to_all, and instead let's it directly call ncclSend/ncclRecv.

Co-authored by @pavanbalaji
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145045
Approved by: https://github.com/pavanbalaji, https://github.com/d4l3k, https://github.com/fduwjj, https://github.com/ezyang
2025-01-18 05:26:55 +00:00
Tom Ritchford
13d35ea67a [BE] Add missing throw of std::runtime_error in scrc/cuda/utils.cpp (#144962)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144962
Approved by: https://github.com/amjames, https://github.com/Skylion007, https://github.com/malfet
2025-01-16 17:35:39 +00:00
cyy
b0be30dd79 [19/N] Fix extra warnings brought by clang-tidy-17 (#144448)
Apply more clang-tidy fixes. There was a bug introduced by #144014 due to incorrect namespace concatenation which is reverted here.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144448
Approved by: https://github.com/albanD
2025-01-09 15:58:05 +00:00
Michal Gallus
37e9da0687 [ROCm][Windows] Disable roctracer-related code (#143329)
Currently, the roctracer for Windows is not available. This PR disables any mentions of its usage for Windows, and creates dummy functions for Windows to keep compatibility with existing code, but which warn the user about the lack of Windows' availability.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143329
Approved by: https://github.com/sraikund16
2025-01-03 01:51:01 +00:00
Yu, Guangye
3848de55ed Add get_stream_from_external API for CUDA backend (#143799)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143799
Approved by: https://github.com/albanD, https://github.com/EikanWang
ghstack dependencies: #142347, #141119, #141123
2024-12-31 11:15:59 +00:00
Richard Barnes
9283c40ba8 [codemod] Decorate unused variables with [[maybe_unused]] (#143381)
Summary:
LLVM-15 has a warning `-Wunused-variable` which we treat as an error because it's so often diagnostic of a code issue. Unused variables can compromise readability or, worse, performance.

This diff either (a) removes an unused variable and, possibly, it's associated code or (b) qualifies the variable with `[[maybe_unused]]`.

 - If you approve of this diff, please use the "Accept & Ship" button :-)

Test Plan: Sandcastle

Reviewed By: palmje

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143381
Approved by: https://github.com/malfet
2024-12-17 20:36:03 +00:00
Nichols A. Romero
c0a39ad35a [ROCm] Fix TunableOp UTs: Rotating Buffer (#143172)
TunableOp's rotating buffer feature cannot be properly tested because the environment variable that controls this feature is sticky. A Python API is introduced to modify this value.

Additional items in this PR:
* UT for rotating buffer API
* Clean up UTs that were setting the rotating buffer via the environment variable
* Align behavior of environment variable and Python API when a negative value (< 0) is set.
* Update documentation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143172
Approved by: https://github.com/jeffdaily
2024-12-14 06:18:11 +00:00
Peter Bell
96c3b2c388 Expose remaining sharedMem cudaDeviceProps to python (#143226)
Was a bit too fast with my earlier PR, `sharedMemPerMultiprocessor` includes some memory that is reserved for the system. The amount a kernel can actually use is limited by `sharedMemPerBlockOptin`.

I also expose `sharedMemPerBlock` for completeness.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143226
Approved by: https://github.com/ezyang
2024-12-14 06:13:28 +00:00
Peter Bell
82a45d19b4 Expose sharedMemPerMultiprocessor device property to python (#143119)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143119
Approved by: https://github.com/ezyang
2024-12-13 16:53:57 +00:00
Alex Denisov
539286a67b Inductor annotations (#130429)
Add NVTX annotations around training phases and buffer computations

RFC/discussion: https://dev-discuss.pytorch.org/t/rfc-performance-profiling-at-scale-with-details-nvtx-annotations/2224

<img width="2160" alt="Screenshot 2024-07-10 at 11 48 04" src="https://github.com/pytorch/pytorch/assets/1175576/9ade139c-d393-473f-9b68-6c25da367dc4">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130429
Approved by: https://github.com/aorenste, https://github.com/eellison, https://github.com/albanD

Co-authored-by: Cedric GESTES <cedric.gestes@flex.ai>
2024-12-10 08:53:39 +00:00
Benjamin Glass
4959784dac Add API query for available per-process CUDA memory (#140620)
Certain `cpp_wrapper`-enabled tests were OOM-ing in the CI pipeline, with error messages suggesting that sufficient memory was accessible.  This ultimately resulted from an internal memory limitation that was not queryable in the API.  This PR adds querying for that limit.

Additionally, the failing tests had incorrect memory availability checks, and are updated with measured memory requirements.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140620
Approved by: https://github.com/malfet, https://github.com/eqy
ghstack dependencies: #141367
2024-12-03 00:24:03 +00:00
Jesse Cai
5accae4197 [sparse] add extra options to _cslt_spare_mm (#137427)
Summary:

Splitting this PR into two, one for the cuSPARSELt improvements, and one
for the inductor lowering.

This PR adds in the additional cuSPARSELt bindings into pytorch.

* `torch._cslt_sparse_mm_search` will be deprecated in a future PR,
  so a warning has been added

* Added a header file for cuSPARSELtOps.cpp

* max_id is now available in `torch.backends.cusparselt` via
  `torch.backends.cusparselt.get_max_alg_id()`

* fixed meta registrations for float8

Test Plan:

python test/test_sparse_semi_structured.py

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137427
Approved by: https://github.com/cpuhrsch, https://github.com/eqy
2024-11-27 05:32:45 +00:00
PyTorch MergeBot
5318bf8baf Revert "[sparse] add extra options to _cslt_spare_mm (#137427)"
This reverts commit f1451163ec.

Reverted https://github.com/pytorch/pytorch/pull/137427 on behalf of https://github.com/huydhn due to This looks like the test is still failing, plz do a rebase ([comment](https://github.com/pytorch/pytorch/pull/137427#issuecomment-2499918590))
2024-11-26 08:01:24 +00:00
Jesse Cai
f1451163ec [sparse] add extra options to _cslt_spare_mm (#137427)
Summary:

Splitting this PR into two, one for the cuSPARSELt improvements, and one
for the inductor lowering.

This PR adds in the additional cuSPARSELt bindings into pytorch.

* `torch._cslt_sparse_mm_search` will be deprecated in a future PR,
  so a warning has been added

* Added a header file for cuSPARSELtOps.cpp

* max_id is now available in `torch.backends.cusparselt` via
  `torch.backends.cusparselt.get_max_alg_id()`

* fixed meta registrations for float8

Test Plan:

python test/test_sparse_semi_structured.py

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137427
Approved by: https://github.com/cpuhrsch, https://github.com/eqy
2024-11-25 23:45:41 +00:00
Xiaodong Wang
4a378d77d4 [AMD] Add ncclRemoteError back (#141461)
Summary:
It looks RCCL does have the support for those two error types:: ncclRemoteError and ncclnProgress: https://github.com/ROCm/rccl/blob/develop/src/nccl.h.in#L57. And I do see my job throwing out those errors. But pytorch just said:
```
RuntimeError: Unconvertible NCCL type
```

Even though nccl says:
```
develop/src/init.cc.hip:502 NCCL WARN Attempt to use communicator before the previous operation returned ncclSuccess
```

Therefore just enabling those.

Test Plan: CI

Differential Revision: D66434341

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141461
Approved by: https://github.com/eqy
2024-11-25 17:49:34 +00:00
PyTorch MergeBot
cc90ba8924 Revert "[sparse] add extra options to _cslt_spare_mm (#137427)"
This reverts commit 45b30a5aec.

Reverted https://github.com/pytorch/pytorch/pull/137427 on behalf of https://github.com/huydhn due to Sorry for reverting your change but test_sparse_semi_structured is failing in trunk after it lands ([comment](https://github.com/pytorch/pytorch/pull/137427#issuecomment-2494047577))
2024-11-22 15:40:21 +00:00
Jesse Cai
45b30a5aec [sparse] add extra options to _cslt_spare_mm (#137427)
Summary:

Splitting this PR into two, one for the cuSPARSELt improvements, and one
for the inductor lowering.

This PR adds in the additional cuSPARSELt bindings into pytorch.

* `torch._cslt_sparse_mm_search` will be deprecated in a future PR,
  so a warning has been added

* Added a header file for cuSPARSELtOps.cpp

* max_id is now available in `torch.backends.cusparselt` via
  `torch.backends.cusparselt.get_max_alg_id()`

* fixed meta registrations for float8

Test Plan:

python test/test_sparse_semi_structured.py

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137427
Approved by: https://github.com/cpuhrsch, https://github.com/eqy
2024-11-21 23:37:36 +00:00
PyTorch MergeBot
614e727191 Revert "[Environment Variable][7/N] Use thread-safe getenv functions (#140211)"
This reverts commit cd942d00dd.

Reverted https://github.com/pytorch/pytorch/pull/140211 on behalf of https://github.com/izaitsevfb due to causes crash internally during test listing ([comment](https://github.com/pytorch/pytorch/pull/140211#issuecomment-2492328790))
2024-11-21 21:05:22 +00:00
cyyever
cd942d00dd [Environment Variable][7/N] Use thread-safe getenv functions (#140211)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140211
Approved by: https://github.com/ezyang, https://github.com/eqy
2024-11-21 00:25:20 +00:00
cyy
00b3b61076 Add and use thread-safe strerror (#140472)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140472
Approved by: https://github.com/ezyang
2024-11-19 04:24:17 +00:00
Richard Barnes
c8be6f1196 [codemod] Remove unused-variable in pytorch (#140569)
Summary:
LLVM-15 has a warning `-Wunused-variable` which we treat as an error because it's so often diagnostic of a code issue. Unused variables can compromise readability or, worse, performance.

This diff either (a) removes an unused variable and, possibly, it's associated code or (b) qualifies the variable with `[[maybe_unused]]`.

#buildsonlynotests - Builds are sufficient

 - If you approve of this diff, please use the "Accept & Ship" button :-)

Test Plan: Sandcastle

Reviewed By: meyering

Differential Revision: D65833225

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140569
Approved by: https://github.com/Skylion007
2024-11-13 20:38:03 +00:00
PyTorch MergeBot
4a18e26ff5 Revert "[Environment Variable][7/N] Use thread-safe getenv functions (#140211)"
This reverts commit a3cff4bbd4.

Reverted https://github.com/pytorch/pytorch/pull/140211 on behalf of https://github.com/ezyang due to One of these diffs had incorrect downstream optional handling, we must reaudit all of these diffs ([comment](https://github.com/pytorch/pytorch/pull/140211#issuecomment-2473709246))
2024-11-13 14:05:01 +00:00
cyy
a3cff4bbd4 [Environment Variable][7/N] Use thread-safe getenv functions (#140211)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140211
Approved by: https://github.com/ezyang, https://github.com/eqy
2024-11-12 18:49:51 +00:00
cyy
bf1b8adee6 Turn static inline into static function (#139843)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139843
Approved by: https://github.com/ezyang
2024-11-07 23:58:18 +00:00
PyTorch MergeBot
68f1b52d8a Revert "Turn static inline into static function (#139843)"
This reverts commit 72d3f5b26d.

Reverted https://github.com/pytorch/pytorch/pull/139843 on behalf of https://github.com/ZainRizvi due to Sorry but this is causing tests to fail on trunk. See [GH job link](https://github.com/pytorch/pytorch/actions/runs/11729669425/job/32675829894) [HUD commit link](72d3f5b26d) ([comment](https://github.com/pytorch/pytorch/pull/139843#issuecomment-2463354131))
2024-11-07 22:29:45 +00:00
cyy
72d3f5b26d Turn static inline into static function (#139843)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139843
Approved by: https://github.com/ezyang
2024-11-07 19:08:41 +00:00
cyy
64d9ee88d7 [11/N] Fix extra warnings brought by clang-tidy-17 (#139599)
Follows #139385
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139599
Approved by: https://github.com/sraikund16
2024-11-04 23:57:41 +00:00
cyy
7f387fa612 [10/N] Fix extra warnings brought by clang-tidy-17 (#139385)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139385
Approved by: https://github.com/Skylion007
2024-11-04 00:47:19 +00:00
cyy
f95c71867e [9/N] Fix extra warnings brought by clang-tidy-17 (#139286)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139286
Approved by: https://github.com/ezyang
2024-10-31 05:20:31 +00:00
cyyever
456c87c8a2 [8/N] Fix extra warnings brought by clang-tidy-17 (#139151)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139151
Approved by: https://github.com/ezyang

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2024-10-30 14:20:08 +00:00
cyy
4d9b5a87e4 [3/N] Fix cppcoreguidelines-special-member-functions warnings (#138796)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138796
Approved by: https://github.com/ezyang
2024-10-28 10:53:11 +00:00
PyTorch MergeBot
9ca749d6cd Revert " [3/N] Fix cppcoreguidelines-special-member-functions warnings (#138796)"
This reverts commit 7cb3cef05f.

Reverted https://github.com/pytorch/pytorch/pull/138796 on behalf of https://github.com/wdvr due to reverting since this started failing a windows test ([comment](https://github.com/pytorch/pytorch/pull/138796#issuecomment-2440710865))
2024-10-28 07:06:00 +00:00
cyyever
7cb3cef05f [3/N] Fix cppcoreguidelines-special-member-functions warnings (#138796)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138796
Approved by: https://github.com/ezyang
2024-10-28 01:38:02 +00:00
Ke Wen
ee11e2da1e [PGNCCL] Use non-blocking mode by default in eager init (#138527)
### Why use non-blocking mode in eager init?
For overlapping comm init and model init, etc.
![image](https://github.com/user-attachments/assets/9b0bf7a9-be26-4d16-827b-dbe861f083cd)

### Why can we set non-blocking as default?
If the setting is dangling -- i.e. not passed in by user nor set via env -- `ProcessGroupNCCL` can have some preferred logic. And torch-level API semantics does not change whether the NCCL comm is blocking or non-blocking (handled within `ProcessGroupNCCL`).

### Why not make non-blocking default for lazy mode as well?
PR https://github.com/pytorch/pytorch/pull/137544 tried it.
Two reasons why that's not preferred today:
1. It is hard -- too big a blast.
2. There is no gain by doing lazy init in non-blocking mode, because the right next CPU call is a collective, and we will block there waiting for comm to be ready, so same effect as blocked init, no "opening" compared to eager mode.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138527
Approved by: https://github.com/wconstab
ghstack dependencies: #138860
2024-10-27 17:40:43 +00:00
PyTorch MergeBot
144d75d934 Revert "[PGNCCL] Use non-blocking mode by default in eager init (#138527)"
This reverts commit 07e30eae2a.

Reverted https://github.com/pytorch/pytorch/pull/138527 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it is failing on ROCm ([comment](https://github.com/pytorch/pytorch/pull/138527#issuecomment-2440070035))
2024-10-27 15:39:33 +00:00