pytorch/torch/testing/_internal/distributed
Ke Wen 35c45a4a31 [Reland] Launch kernel on current stream & remove record_stream entirely (#150398)
Relanding #148590 due to merge conflict.

This PR has multiple changes to `ProcessGroupNCCL` (which unfortunately are related):
1. When async_op=False, we directly launch the collective on "current" stream, instead of a trampoline stream and join back.
- Resolves #147729
- Resolves #146881
- Also saves two event syncs (which have overhead in case of HIP) and one pybind when we call `work.wait()` in distributed_c10d.py on behalf of user.
2. Entirely remove `record_stream` and use CPU-side stashing for managing tensor lifetime against recycling.
- Resolves #147168
3. Remove tensor life management when async_op=False; only use it when async_op=True.
4. To guard against user not calling `work.wait()`, we ask watchdog to unstash tensors after detecting completion of collectives, to prevent us from holding reference to tensors forever. This is a safety net, rather than a service guarantee, see discussion [here](https://github.com/pytorch/pytorch/issues/147168#issuecomment-2660142460).
5. Profile in async_op=False mode would look different -- collective kernels would show up in the same line and compute kernels.

Joint work with @cenzhaometa who wants to remove the event sync overhead.

Squashed contents:

* [ptd][nccl] use current-stream as nccl-stream under async=False mode (#147820)
PTD current workflow:
- PTD creates its own dedicated `ncclStream` for comm operation
- it will first add a dependency on current-stream (typically the compute stream) to ensure tensors are ready before invoking collective
such stream synchronization become expensive in Inference world (cpu overhead: 70us vs GPU kernel time: 160us).
This diff:
- async=False [default], will use current-stream as nccl-stream and avoid the stream-sync overhead
- async=True, will retain existing logic: create new nccl-stream, let it wait on current-stream to ensure tensors are ready
- pass down async from c10d down to NCCL-PG
this helps shave off 50% CPU overhead **(70us -> 35us)**, which reduce total CPU/GPU from **230us to 195us by 15%**

* [PGNCCL] Make avoid-record-stream default

* [c10d] Add asyncOp argument to Ops

* Change python side wait

* Pass asyncOp at ProcessGroup level

* Watchdog unstashing tensors as a safety net

* Stash tensors for reduce_scatter_v and all_gather_v
Pull Request approved: https://github.com/pytorch/pytorch/pull/149753

* [c10d] Move unstashing from watchdog to main thread
Pull Request approved: https://github.com/pytorch/pytorch/pull/150079

* [PGNCCL][BE] Merge mutex into TensorShelf for encapsulation
Pull Request approved: https://github.com/pytorch/pytorch/pull/150130

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150398
Approved by: https://github.com/atalman
2025-04-01 16:46:07 +00:00
..
_shard [BE][CI] bump ruff to 0.8.4 (#143753) 2024-12-24 12:24:10 +00:00
_tensor Enable FSDP tests on XPU device (#147518) 2025-03-04 23:49:37 +00:00
nn Migrate from Tuple -> tuple in torch/testing (#144256) 2025-01-10 06:37:55 +00:00
rpc [BE]: Enable ruff rule SIM113 (#147290) 2025-02-16 22:41:16 +00:00
__init__.py
checkpoint_utils.py [dcp] Add ZStandard transformer (#143360) 2025-01-25 00:14:07 +00:00
common_state_dict.py PEP585: More UP006 fixes (#146392) 2025-02-20 06:18:13 +00:00
ddp_under_dist_autograd_test.py
distributed_test.py [Reland] Launch kernel on current stream & remove record_stream entirely (#150398) 2025-04-01 16:46:07 +00:00
distributed_utils.py
fake_pg.py
multi_threaded_pg.py PEP585 update - torch/testing (#145200) 2025-01-20 22:42:42 +00:00
rpc_utils.py PEP585 update - torch/testing (#145200) 2025-01-20 22:42:42 +00:00