pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

History

Ke Wen 35c45a4a31 [Reland] Launch kernel on current stream & remove `record_stream` entirely (#150398 ) Relanding #148590 due to merge conflict. This PR has multiple changes to `ProcessGroupNCCL` (which unfortunately are related): 1. When async_op=False, we directly launch the collective on "current" stream, instead of a trampoline stream and join back. - Resolves #147729 - Resolves #146881 - Also saves two event syncs (which have overhead in case of HIP) and one pybind when we call `work.wait()` in distributed_c10d.py on behalf of user. 2. Entirely remove `record_stream` and use CPU-side stashing for managing tensor lifetime against recycling. - Resolves #147168 3. Remove tensor life management when async_op=False; only use it when async_op=True. 4. To guard against user not calling `work.wait()`, we ask watchdog to unstash tensors after detecting completion of collectives, to prevent us from holding reference to tensors forever. This is a safety net, rather than a service guarantee, see discussion [here](https://github.com/pytorch/pytorch/issues/147168#issuecomment-2660142460). 5. Profile in async_op=False mode would look different -- collective kernels would show up in the same line and compute kernels. Joint work with @cenzhaometa who wants to remove the event sync overhead. Squashed contents: * [ptd][nccl] use current-stream as nccl-stream under async=False mode (#147820) PTD current workflow: - PTD creates its own dedicated `ncclStream` for comm operation - it will first add a dependency on current-stream (typically the compute stream) to ensure tensors are ready before invoking collective such stream synchronization become expensive in Inference world (cpu overhead: 70us vs GPU kernel time: 160us). This diff: - async=False [default], will use current-stream as nccl-stream and avoid the stream-sync overhead - async=True, will retain existing logic: create new nccl-stream, let it wait on current-stream to ensure tensors are ready - pass down async from c10d down to NCCL-PG this helps shave off 50% CPU overhead (70us -> 35us), which reduce total CPU/GPU from 230us to 195us by 15% * [PGNCCL] Make avoid-record-stream default * [c10d] Add asyncOp argument to Ops * Change python side wait * Pass asyncOp at ProcessGroup level * Watchdog unstashing tensors as a safety net * Stash tensors for reduce_scatter_v and all_gather_v Pull Request approved: https://github.com/pytorch/pytorch/pull/149753 * [c10d] Move unstashing from watchdog to main thread Pull Request approved: https://github.com/pytorch/pytorch/pull/150079 * [PGNCCL][BE] Merge mutex into TensorShelf for encapsulation Pull Request approved: https://github.com/pytorch/pytorch/pull/150130 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150398 Approved by: https://github.com/atalman		2025-04-01 16:46:07 +00:00
..
_shard	[BE][CI] bump `ruff` to 0.8.4 (#143753 )	2024-12-24 12:24:10 +00:00
_tensor	Enable FSDP tests on XPU device (#147518 )	2025-03-04 23:49:37 +00:00
nn	Migrate from Tuple -> tuple in torch/testing (#144256 )	2025-01-10 06:37:55 +00:00
rpc	[BE]: Enable ruff rule SIM113 (#147290 )	2025-02-16 22:41:16 +00:00
__init__.py
checkpoint_utils.py	[dcp] Add ZStandard transformer (#143360 )	2025-01-25 00:14:07 +00:00
common_state_dict.py	PEP585: More UP006 fixes (#146392 )	2025-02-20 06:18:13 +00:00
ddp_under_dist_autograd_test.py
distributed_test.py	[Reland] Launch kernel on current stream & remove `record_stream` entirely (#150398 )	2025-04-01 16:46:07 +00:00
distributed_utils.py
fake_pg.py
multi_threaded_pg.py	PEP585 update - torch/testing (#145200 )	2025-01-20 22:42:42 +00:00
rpc_utils.py	PEP585 update - torch/testing (#145200 )	2025-01-20 22:42:42 +00:00