pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

History

Ke Wen 35c45a4a31 [Reland] Launch kernel on current stream & remove `record_stream` entirely (#150398 ) Relanding #148590 due to merge conflict. This PR has multiple changes to `ProcessGroupNCCL` (which unfortunately are related): 1. When async_op=False, we directly launch the collective on "current" stream, instead of a trampoline stream and join back. - Resolves #147729 - Resolves #146881 - Also saves two event syncs (which have overhead in case of HIP) and one pybind when we call `work.wait()` in distributed_c10d.py on behalf of user. 2. Entirely remove `record_stream` and use CPU-side stashing for managing tensor lifetime against recycling. - Resolves #147168 3. Remove tensor life management when async_op=False; only use it when async_op=True. 4. To guard against user not calling `work.wait()`, we ask watchdog to unstash tensors after detecting completion of collectives, to prevent us from holding reference to tensors forever. This is a safety net, rather than a service guarantee, see discussion [here](https://github.com/pytorch/pytorch/issues/147168#issuecomment-2660142460). 5. Profile in async_op=False mode would look different -- collective kernels would show up in the same line and compute kernels. Joint work with @cenzhaometa who wants to remove the event sync overhead. Squashed contents: * [ptd][nccl] use current-stream as nccl-stream under async=False mode (#147820) PTD current workflow: - PTD creates its own dedicated `ncclStream` for comm operation - it will first add a dependency on current-stream (typically the compute stream) to ensure tensors are ready before invoking collective such stream synchronization become expensive in Inference world (cpu overhead: 70us vs GPU kernel time: 160us). This diff: - async=False [default], will use current-stream as nccl-stream and avoid the stream-sync overhead - async=True, will retain existing logic: create new nccl-stream, let it wait on current-stream to ensure tensors are ready - pass down async from c10d down to NCCL-PG this helps shave off 50% CPU overhead (70us -> 35us), which reduce total CPU/GPU from 230us to 195us by 15% * [PGNCCL] Make avoid-record-stream default * [c10d] Add asyncOp argument to Ops * Change python side wait * Pass asyncOp at ProcessGroup level * Watchdog unstashing tensors as a safety net * Stash tensors for reduce_scatter_v and all_gather_v Pull Request approved: https://github.com/pytorch/pytorch/pull/149753 * [c10d] Move unstashing from watchdog to main thread Pull Request approved: https://github.com/pytorch/pytorch/pull/150079 * [PGNCCL][BE] Merge mutex into TensorShelf for encapsulation Pull Request approved: https://github.com/pytorch/pytorch/pull/150130 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150398 Approved by: https://github.com/atalman		2025-04-01 16:46:07 +00:00
..
codegen
data
distributed	[Reland] Launch kernel on current stream & remove `record_stream` entirely (#150398 )	2025-04-01 16:46:07 +00:00
generated
opinfo	[MPS] Fix dot/mm for conj_tensors (#150157 )	2025-03-28 20:36:44 +00:00
optests	[opcheck] Improve error reporting; allow atol/rtol overrides (#146488 )	2025-02-05 21:25:06 +00:00
test_module
__init__.py
autocast_test_lists.py	Remove unused Python variables in torch/[b-z]* (#136963 )	2024-10-19 16:45:22 +00:00
autograd_function_db.py	Remove unused Python variables in torch/[b-z]* (#136963 )	2024-10-19 16:45:22 +00:00
check_kernel_launches.py	PEP585 update - torch/testing (#145200 )	2025-01-20 22:42:42 +00:00
common_cuda.py	[BE]: Apply ruff PERF403 to use dict comprehensions more often (#149257 )	2025-03-18 00:46:07 +00:00
common_device_type.py	[ROCm] Enable several fsdp related UTs (#149369 )	2025-03-31 16:15:57 +00:00
common_dist_composable.py	Migrate from Tuple -> tuple in torch/testing (#144256 )	2025-01-10 06:37:55 +00:00
common_distributed.py	Enable FSDP tests on XPU device (#147518 )	2025-03-04 23:49:37 +00:00
common_dtype.py	[BE]: Enable ruff SLOT checks (#146276 )	2025-02-04 19:18:23 +00:00
common_fsdp.py	Enable FSDP tests on XPU device (#147518 )	2025-03-04 23:49:37 +00:00
common_jit.py	PEP585 update - torch/testing (#145200 )	2025-01-20 22:42:42 +00:00
common_methods_invocations.py	Implement aten.select.int sharding strategy (#149842 )	2025-03-27 20:49:00 +00:00
common_mkldnn.py	[BE]: Apply ruff PERF403 to use dict comprehensions more often (#149257 )	2025-03-18 00:46:07 +00:00
common_modules.py	Remove outdated skipCUDAIfCudnnVersionLessThan decoration (#148940 )	2025-03-13 18:02:50 +00:00
common_nn.py	PEP585 update - torch/testing (#145200 )	2025-01-20 22:42:42 +00:00
common_optimizers.py	Remove code for Python < 3.9 (#147097 )	2025-02-14 03:22:49 +00:00
common_pruning.py	PEP585 update - torch/testing (#145200 )	2025-01-20 22:42:42 +00:00
common_quantization.py	[BE]: Enable ruff rule SIM113 (#147290 )	2025-02-16 22:41:16 +00:00
common_quantized.py	wire torch._scaled_mm with fp4 operands to the cublas nvfp4 kernel (#148792 )	2025-03-27 17:32:20 +00:00
common_subclass.py
common_utils.py	Revert "[fbcode]Removing `@NoIntBaseDeprecated` annotation in `caffe2.thrift` file (#149742 ) (#149744 )"	2025-03-27 22:31:54 +00:00
composite_compliance.py	[pytree] add APIs to determine a class is a namedtuple or PyStructSequence (#113257 )	2025-04-01 10:40:43 +00:00
custom_op_db.py	Migrate from Tuple -> tuple in torch/testing (#144256 )	2025-01-10 06:37:55 +00:00
custom_tensor.py	Support getattr for tensor subclasses in pre-dispatch export via patching tensor.getattr (#143946 )	2025-01-06 23:55:50 +00:00
dist_utils.py	Migrate from Tuple -> tuple in torch/testing (#144256 )	2025-01-10 06:37:55 +00:00
dynamo_test_failures.py	[BE] Remove test_ops from FIXME_inductor_dont_reset_dynamo (#145307 )	2025-01-27 18:12:39 +00:00
fake_config_module.py	config: Support str env variables (#145980 )	2025-01-30 00:13:02 +00:00
fake_config_module2.py	Add multi env variable support to configs (#145288 )	2025-01-24 10:04:24 +00:00
fake_config_module3.py	[minifier] Fix config generator for callables (#144518 )	2025-01-14 17:18:13 +00:00
hop_db.py	Support torch.compile rng selective activation checkpointing with cudagraph (#146878 )	2025-02-28 00:47:03 +00:00
hypothesis_utils.py	[BE]: Apply PERF401 autofixes from ruff (#140980 )	2024-11-20 17:52:07 +00:00
inductor_utils.py	Revert "cpp_wrapper: Fix even more tests (#147225 )"	2025-03-28 17:07:52 +00:00
jit_metaprogramming_utils.py	[BE]: Enable ruff SLOT checks (#146276 )	2025-02-04 19:18:23 +00:00
jit_utils.py	Fix linter F821 error (#146665 )	2025-02-08 07:19:37 +00:00
logging_tensor.py	PEP585 update - torch/testing (#145200 )	2025-01-20 22:42:42 +00:00
logging_utils.py	Update ruff linter for PEP585 (#147540 )	2025-02-22 04:45:17 +00:00
quantization_torch_package_models.py
static_module.py
subclasses.py	[subclass] testing WrapperSubclass respect outer_size, outer_stride (#146897 )	2025-02-13 15:21:19 +00:00
torchbind_impls.py	Fakify torchbind objects in compile_fx and add tests for SigridTransformsInstanceTorchBind (#149529 )	2025-03-21 18:58:28 +00:00
triton_utils.py	[Inductor] Use real input to autotune user defined triton kernels (#149553 )	2025-03-26 16:42:48 +00:00
two_tensor.py	Support subclass constructor capturing in export (#147014 )	2025-03-16 18:19:19 +00:00