pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-06 12:20:52 +01:00

History

Ke Wen 2149f6c684 [PGNCCL] Launch kernel on current stream & remove `record_stream` entirely (#148590 ) This PR has multiple changes to `ProcessGroupNCCL` (which unfortunately are related): 1. When async_op=False, we directly launch the collective on "current" stream, instead of a trampoline stream and join back. - Resolves #147729 - Resolves #146881 - Also saves two event syncs (which have overhead in case of HIP) and one pybind when we call `work.wait()` in distributed_c10d.py on behalf of user. 2. Entirely remove `record_stream` and use CPU-side stashing for managing tensor lifetime against recycling. - Resolves #147168 3. Remove tensor life management when async_op=False; only use it when async_op=True. 4. To guard against user not calling `work.wait()`, we ask watchdog to unstash tensors after detecting completion of collectives, to prevent us from holding reference to tensors forever. This is a safety net, rather than a service guarantee, see discussion [here](https://github.com/pytorch/pytorch/issues/147168#issuecomment-2660142460). 5. Profile in async_op=False mode would look different -- collective kernels would show up in the same line and compute kernels. Joint work with @cenzhaometa who wants to remove the event sync overhead. Cc: @ngimel @awgu @Aidyn-A @skyw @wconstab @leonardo0lyj Differential Revision: [D70835197](https://our.internmc.facebook.com/intern/diff/D70835197) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148590 Approved by: https://github.com/eqy, https://github.com/Aidyn-A, https://github.com/fduwjj		2025-03-09 07:32:23 +00:00
..
aoti_abi_check	[AOTI] Fix complex64 not defined (#132810 )	2024-08-08 18:08:23 +00:00
aoti_inference	[AOTInductor] Add standalone test for compilation from ExportedProgram (#142327 )	2024-12-10 06:50:09 +00:00
api	[ROCm][CI] upgrade CI to ROCm 6.3 (#142152 )	2025-01-09 17:14:16 +00:00
c10d	[PGNCCL] Launch kernel on current stream & remove `record_stream` entirely (#148590 )	2025-03-09 07:32:23 +00:00
common	[AOTI] Add ABI-compatiblity tests (#123848 )	2024-04-19 00:51:24 +00:00
dist_autograd	Set RUNPATH so installed tests can find the required shared libraries (#136627 )	2024-10-25 09:38:08 +00:00
jit	Revert "Fix poision child process issue when call getAccelerator() (#144368 )"	2025-01-10 23:36:43 +00:00
lazy	Introduce cache clearing APIs for the lazy graph executor (#144489 )	2025-01-29 17:38:01 +00:00
lite_interpreter_runtime	Add None return type to init -- tests (#132352 )	2024-08-01 15:44:51 +00:00
monitor
profiler	[codemod] Fix a few unused-variable issues in pytorch (#143517 )	2024-12-19 00:18:08 +00:00
rpc	[rpc] Fix unit test after c10::nullopt removal (#143690 )	2024-12-20 23:36:07 +00:00
tensorexpr	Fix floating point literals in IRPrinter (#142119 )	2024-12-18 21:59:48 +00:00
__init__.py