pytorch/torch/testing/_internal
Ke Wen 35c45a4a31 [Reland] Launch kernel on current stream & remove record_stream entirely (#150398)
Relanding #148590 due to merge conflict.

This PR has multiple changes to `ProcessGroupNCCL` (which unfortunately are related):
1. When async_op=False, we directly launch the collective on "current" stream, instead of a trampoline stream and join back.
- Resolves #147729
- Resolves #146881
- Also saves two event syncs (which have overhead in case of HIP) and one pybind when we call `work.wait()` in distributed_c10d.py on behalf of user.
2. Entirely remove `record_stream` and use CPU-side stashing for managing tensor lifetime against recycling.
- Resolves #147168
3. Remove tensor life management when async_op=False; only use it when async_op=True.
4. To guard against user not calling `work.wait()`, we ask watchdog to unstash tensors after detecting completion of collectives, to prevent us from holding reference to tensors forever. This is a safety net, rather than a service guarantee, see discussion [here](https://github.com/pytorch/pytorch/issues/147168#issuecomment-2660142460).
5. Profile in async_op=False mode would look different -- collective kernels would show up in the same line and compute kernels.

Joint work with @cenzhaometa who wants to remove the event sync overhead.

Squashed contents:

* [ptd][nccl] use current-stream as nccl-stream under async=False mode (#147820)
PTD current workflow:
- PTD creates its own dedicated `ncclStream` for comm operation
- it will first add a dependency on current-stream (typically the compute stream) to ensure tensors are ready before invoking collective
such stream synchronization become expensive in Inference world (cpu overhead: 70us vs GPU kernel time: 160us).
This diff:
- async=False [default], will use current-stream as nccl-stream and avoid the stream-sync overhead
- async=True, will retain existing logic: create new nccl-stream, let it wait on current-stream to ensure tensors are ready
- pass down async from c10d down to NCCL-PG
this helps shave off 50% CPU overhead **(70us -> 35us)**, which reduce total CPU/GPU from **230us to 195us by 15%**

* [PGNCCL] Make avoid-record-stream default

* [c10d] Add asyncOp argument to Ops

* Change python side wait

* Pass asyncOp at ProcessGroup level

* Watchdog unstashing tensors as a safety net

* Stash tensors for reduce_scatter_v and all_gather_v
Pull Request approved: https://github.com/pytorch/pytorch/pull/149753

* [c10d] Move unstashing from watchdog to main thread
Pull Request approved: https://github.com/pytorch/pytorch/pull/150079

* [PGNCCL][BE] Merge mutex into TensorShelf for encapsulation
Pull Request approved: https://github.com/pytorch/pytorch/pull/150130

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150398
Approved by: https://github.com/atalman
2025-04-01 16:46:07 +00:00
..
codegen
data
distributed [Reland] Launch kernel on current stream & remove record_stream entirely (#150398) 2025-04-01 16:46:07 +00:00
generated
opinfo [MPS] Fix dot/mm for conj_tensors (#150157) 2025-03-28 20:36:44 +00:00
optests
test_module
__init__.py
autocast_test_lists.py
autograd_function_db.py
check_kernel_launches.py
common_cuda.py [BE]: Apply ruff PERF403 to use dict comprehensions more often (#149257) 2025-03-18 00:46:07 +00:00
common_device_type.py [ROCm] Enable several fsdp related UTs (#149369) 2025-03-31 16:15:57 +00:00
common_dist_composable.py
common_distributed.py Enable FSDP tests on XPU device (#147518) 2025-03-04 23:49:37 +00:00
common_dtype.py
common_fsdp.py Enable FSDP tests on XPU device (#147518) 2025-03-04 23:49:37 +00:00
common_jit.py
common_methods_invocations.py Implement aten.select.int sharding strategy (#149842) 2025-03-27 20:49:00 +00:00
common_mkldnn.py [BE]: Apply ruff PERF403 to use dict comprehensions more often (#149257) 2025-03-18 00:46:07 +00:00
common_modules.py Remove outdated skipCUDAIfCudnnVersionLessThan decoration (#148940) 2025-03-13 18:02:50 +00:00
common_nn.py
common_optimizers.py Remove code for Python < 3.9 (#147097) 2025-02-14 03:22:49 +00:00
common_pruning.py
common_quantization.py [BE]: Enable ruff rule SIM113 (#147290) 2025-02-16 22:41:16 +00:00
common_quantized.py wire torch._scaled_mm with fp4 operands to the cublas nvfp4 kernel (#148792) 2025-03-27 17:32:20 +00:00
common_subclass.py
common_utils.py Revert "[fbcode]Removing @NoIntBaseDeprecated annotation in caffe2.thrift file (#149742) (#149744)" 2025-03-27 22:31:54 +00:00
composite_compliance.py [pytree] add APIs to determine a class is a namedtuple or PyStructSequence (#113257) 2025-04-01 10:40:43 +00:00
custom_op_db.py
custom_tensor.py
dist_utils.py
dynamo_test_failures.py
fake_config_module.py
fake_config_module2.py
fake_config_module3.py
hop_db.py Support torch.compile rng selective activation checkpointing with cudagraph (#146878) 2025-02-28 00:47:03 +00:00
hypothesis_utils.py
inductor_utils.py Revert "cpp_wrapper: Fix even more tests (#147225)" 2025-03-28 17:07:52 +00:00
jit_metaprogramming_utils.py
jit_utils.py Fix linter F821 error (#146665) 2025-02-08 07:19:37 +00:00
logging_tensor.py
logging_utils.py Update ruff linter for PEP585 (#147540) 2025-02-22 04:45:17 +00:00
quantization_torch_package_models.py
static_module.py
subclasses.py [subclass] testing WrapperSubclass respect outer_size, outer_stride (#146897) 2025-02-13 15:21:19 +00:00
torchbind_impls.py Fakify torchbind objects in compile_fx and add tests for SigridTransformsInstanceTorchBind (#149529) 2025-03-21 18:58:28 +00:00
triton_utils.py [Inductor] Use real input to autotune user defined triton kernels (#149553) 2025-03-26 16:42:48 +00:00
two_tensor.py Support subclass constructor capturing in export (#147014) 2025-03-16 18:19:19 +00:00