pytorch/test/cpp
Ke Wen 2149f6c684 [PGNCCL] Launch kernel on current stream & remove record_stream entirely (#148590)
This PR has multiple changes to `ProcessGroupNCCL` (which unfortunately are related):
1. When async_op=False, we directly launch the collective on "current" stream, instead of a trampoline stream and join back.
- Resolves #147729
- Resolves #146881
- Also saves two event syncs (which have overhead in case of HIP) and one pybind when we call `work.wait()` in distributed_c10d.py on behalf of user.
2. Entirely remove `record_stream` and use CPU-side stashing for managing tensor lifetime against recycling.
- Resolves #147168
3. Remove tensor life management when async_op=False; only use it when async_op=True.
4. To guard against user not calling `work.wait()`, we ask watchdog to unstash tensors after detecting completion of collectives, to prevent us from holding reference to tensors forever. This is a safety net, rather than a service guarantee, see discussion [here](https://github.com/pytorch/pytorch/issues/147168#issuecomment-2660142460).
5. Profile in async_op=False mode would look different -- collective kernels would show up in the same line and compute kernels.

Joint work with @cenzhaometa who wants to remove the event sync overhead.

Cc: @ngimel @awgu @Aidyn-A @skyw @wconstab @leonardo0lyj

Differential Revision: [D70835197](https://our.internmc.facebook.com/intern/diff/D70835197)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148590
Approved by: https://github.com/eqy, https://github.com/Aidyn-A, https://github.com/fduwjj
2025-03-09 07:32:23 +00:00
..
aoti_abi_check [AOTI] Fix complex64 not defined (#132810) 2024-08-08 18:08:23 +00:00
aoti_inference [AOTInductor] Add standalone test for compilation from ExportedProgram (#142327) 2024-12-10 06:50:09 +00:00
api [ROCm][CI] upgrade CI to ROCm 6.3 (#142152) 2025-01-09 17:14:16 +00:00
c10d [PGNCCL] Launch kernel on current stream & remove record_stream entirely (#148590) 2025-03-09 07:32:23 +00:00
common [AOTI] Add ABI-compatiblity tests (#123848) 2024-04-19 00:51:24 +00:00
dist_autograd Set RUNPATH so installed tests can find the required shared libraries (#136627) 2024-10-25 09:38:08 +00:00
jit Revert "Fix poision child process issue when call getAccelerator() (#144368)" 2025-01-10 23:36:43 +00:00
lazy Introduce cache clearing APIs for the lazy graph executor (#144489) 2025-01-29 17:38:01 +00:00
lite_interpreter_runtime Add None return type to init -- tests (#132352) 2024-08-01 15:44:51 +00:00
monitor
profiler [codemod] Fix a few unused-variable issues in pytorch (#143517) 2024-12-19 00:18:08 +00:00
rpc [rpc] Fix unit test after c10::nullopt removal (#143690) 2024-12-20 23:36:07 +00:00
tensorexpr Fix floating point literals in IRPrinter (#142119) 2024-12-18 21:59:48 +00:00
__init__.py