pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Saurabh Mishra	381d0cb239	[DCP] Avoid in-place update and deepcopy during dudpe (#149320 ) Summary: Avoid in-place update and deepcopy during dudpe. Deepcopy becomes prohibitively expensive with models having a huge number of FQNs. This was manifestd in the Ads 2K experiment as well. Here are the results from the TextRay model in Mitra: #### Control job with deepcopy regression: First save ~24.8s Global step latency is ~7-8s Test job with the new fix to avoid deepcopy: First save is ~21s global step latency ~2s Test Plan: ``` buck test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/distributed/checkpoint:test_planner ``` https://www.internalfb.com/intern/testinfra/testrun/3940649945104822 Differential Revision: D71245218 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149320 Approved by: https://github.com/MeetVadakkanchery	2025-03-18 16:08:40 +00:00
Francisco Massa	9b92828d4b	Add batch dim sharding rule to sdpa (#149253 ) This is a trivial rule that for most cases isn't needed, but if we want to consider that the input data is actually `Shard(0)` (instead of `Replicated()` as it is currently assumed), then we need this rule. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149253 Approved by: https://github.com/XilunWu	2025-03-18 07:54:02 +00:00
Aaron Gokaslan	a0ac63cbd9	[BE]: Apply ruff PERF403 to use dict comprehensions more often (#149257 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/149257 Approved by: https://github.com/jansel	2025-03-18 00:46:07 +00:00
PyTorch MergeBot	24cfeec2c7	Revert "[BE]: Apply ruff PERF403 to use dict comprehensions more often (#149257 )" This reverts commit `bfee141666`. Reverted https://github.com/pytorch/pytorch/pull/149257 on behalf of https://github.com/malfet due to Let's see if it helps restore compiler benchmark sanity, see `8bc7bd94a5/1` ([comment](https://github.com/pytorch/pytorch/pull/149257#issuecomment-2731133812))	2025-03-17 22:57:00 +00:00
PyTorch MergeBot	afa1eda901	Revert "[PGNCCL] Launch kernel on current stream & remove `record_stream` entirely (#148590 )" This reverts commit `ef6296e7f2`. Reverted https://github.com/pytorch/pytorch/pull/148590 on behalf of https://github.com/izaitsevfb due to reverted internally, see D71292427 ([comment](https://github.com/pytorch/pytorch/pull/148590#issuecomment-2731114626))	2025-03-17 22:43:15 +00:00
Aaron Gokaslan	bfee141666	[BE]: Apply ruff PERF403 to use dict comprehensions more often (#149257 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/149257 Approved by: https://github.com/jansel	2025-03-16 23:52:58 +00:00
Tugsbayasgalan Manlaibaatar	6b1b95ad2a	Support subclass constructor capturing in export (#147014 ) Notable TODOs: 1. Need to implement AutogradHOP to get rid of subclasses before serializing 2. Need to implement mechanism to figure out what subclasses will be used in export when they are not expressed in the inputs Differential Revision: [D69640673](https://our.internmc.facebook.com/intern/diff/D69640673) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147014 Approved by: https://github.com/bdhirsh	2025-03-16 18:19:19 +00:00
Wenjie Yang	115fc98cc0	Migrate aten.split.Tensor from using Sharding Rule to Sharding Strategy (#149106 ) Summary: Use Sharding Strategy for aten.split.Tensor instead of sharding rule Test Plan: pytest test/distributed/tensor/test_dtensor_ops.py -s -k split Reviewers: xilunwu Subscribers: Tasks: Tags: Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/149106 Approved by: https://github.com/XilunWu, https://github.com/tianyu-l	2025-03-15 04:03:40 +00:00
yifanmao	7537b19c73	[FSDP2] Update ignored_params docstring and add unit test (#149074 ) Fixes https://github.com/pytorch/pytorch/issues/148242 ignored_params won't be moved to devices in full_shard(), update docstring. Add unit test `test_move_states_to_device_ignored_param_device` to show that ignored_params won't be moved during full_shard(), but would be after `model.cuda()` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149074 Approved by: https://github.com/awgu	2025-03-15 00:23:09 +00:00
Um Changyong	69aeb87eca	update error message in get_backend() more detail_ (#141796 ) Fixes #ISSUE_NUMBER When attempting to reconfigure the environment without properly handling the PyTorch-related settings, you may encounter the following message. ``` │ /root/.cache/pypoetry/virtualenvs/app-rag-sample-9TtSrW0h-py3.10/lib/python3.10/site-packages/torch/distributed/distribut │ │ ed_c10d.py:1215 in get_backend │ │ │ │ 1212 │ if _rank_not_in_group(pg): │ │ 1213 │ │ raise ValueError("Invalid process group specified") │ │ 1214 │ pg_store = _world.pg_map[pg] if pg in _world.pg_map else None │ │ ❱ 1215 │ return Backend(not_none(pg_store)[0]) │ │ 1216 │ │ 1217 │ │ 1218 def _get_process_group_uid(pg: ProcessGroup) -> int: │ │ │ │ /root/.cache/pypoetry/virtualenvs/app-rag-sample-9TtSrW0h-py3.10/lib/python3.10/site-packages/torch/utils/_typing_utils.p │ │ y:13 in not_none │ │ │ │ 10 │ │ 11 def not_none(obj: Optional[T]) -> T: │ │ 12 │ if obj is None: │ │ ❱ 13 │ │ raise TypeError("Invariant encountered: value was None when it should not be") │ │ 14 │ return obj │ │ 15 │ ╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ TypeError: Invariant encountered: value was None when it should not be Exception ignored in: <function Vllm.__del__ at 0x7f35f96b6dd0> ``` Since this message can cause confusion for multiple developers, the purpose of this PR is to suggest additional details to help clarify the situation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141796 Approved by: https://github.com/kwen2501	2025-03-14 19:42:42 +00:00
Qiongwen Zhang	5e79b61e8a	add PrivateUse1 backend in fsdp collecitves (#147260 ) add PrivateUse1 backend in fsdp collecitves Pull Request resolved: https://github.com/pytorch/pytorch/pull/147260 Approved by: https://github.com/weifengpy	2025-03-14 19:41:41 +00:00
Andrew Gu	a8b1767ae5	[DTensor] Fix `local_map` with multi-threading (#149070 ) Using `nonlocal device_mesh` is not safe with multi-threading Pull Request resolved: https://github.com/pytorch/pytorch/pull/149070 Approved by: https://github.com/wanchaol	2025-03-13 10:58:59 +00:00
Matthew Hoffman	e3ebf61589	Create and send `full_tensor` on `ProcessGroup`-supported device in `_broadcast_tensors` (#148865 ) Fixes #138842 `device` is always the device of the `local_state_dict`, which may or may not be CPU, which is not supported by NCCL backend. Instead, create broadcasted tensors on one of `pg._device_types` and then move the tensors back if `local_state_dict`'s `device` was not supported by the `ProcessGroup`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148865 Approved by: https://github.com/mori360	2025-03-12 20:56:31 +00:00
Ke Wen	ef6296e7f2	[PGNCCL] Launch kernel on current stream & remove `record_stream` entirely (#148590 ) This PR has multiple changes to `ProcessGroupNCCL` (which unfortunately are related): 1. When async_op=False, we directly launch the collective on "current" stream, instead of a trampoline stream and join back. - Resolves #147729 - Resolves #146881 - Also saves two event syncs (which have overhead in case of HIP) and one pybind when we call `work.wait()` in distributed_c10d.py on behalf of user. 2. Entirely remove `record_stream` and use CPU-side stashing for managing tensor lifetime against recycling. - Resolves #147168 3. Remove tensor life management when async_op=False; only use it when async_op=True. 4. To guard against user not calling `work.wait()`, we ask watchdog to unstash tensors after detecting completion of collectives, to prevent us from holding reference to tensors forever. This is a safety net, rather than a service guarantee, see discussion [here](https://github.com/pytorch/pytorch/issues/147168#issuecomment-2660142460). 5. Profile in async_op=False mode would look different -- collective kernels would show up in the same line and compute kernels. Joint work with @cenzhaometa who wants to remove the event sync overhead. Cc: @ngimel @awgu @Aidyn-A @skyw @wconstab @leonardo0lyj Differential Revision: [D70937982](https://our.internmc.facebook.com/intern/diff/D70937982) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148590 Approved by: https://github.com/eqy, https://github.com/Aidyn-A, https://github.com/fduwjj	2025-03-11 18:36:12 +00:00
Chien-Chin Huang	52acc1f955	[DSD] Update the document to mention the limitation of set_optimizer_state_dict (#148918 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/140898 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148918 Approved by: https://github.com/fduwjj, https://github.com/mori360 ghstack dependencies: #148825	2025-03-11 18:24:12 +00:00
PyTorch MergeBot	a95eb0c0a7	Revert "[PGNCCL] Launch kernel on current stream & remove `record_stream` entirely (#148590 )" This reverts commit `2149f6c684`. Reverted https://github.com/pytorch/pytorch/pull/148590 on behalf of https://github.com/ZainRizvi due to Breaking internally, see D70873275. Discussed reverting this with Ke. To validate your fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/148590#issuecomment-2712001270))	2025-03-10 22:38:40 +00:00
Chien-Chin Huang	ed969d1236	[DSD] Fix the shared parameter mismatch for optimizer state_dict when flattening FQNs are used (#148825 ) Summary: As title. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148825 Approved by: https://github.com/fduwjj, https://github.com/mori360	2025-03-10 20:04:36 +00:00
Francisco Massa	ea86b8d315	Fix redistribution cost for all-reduce (#148761 ) This issue seems to have been introduced in https://github.com/pytorch/pytorch/pull/119897. With the current implementation, it might be more favorable to perform a reduce_scatter followed by an all-gather than simply an all-reduce. Thanks @lw for the helpful discussions on getting this PR out! Pull Request resolved: https://github.com/pytorch/pytorch/pull/148761 Approved by: https://github.com/Skylion007, https://github.com/lw, https://github.com/tianyu-l, https://github.com/fegin	2025-03-10 12:13:11 +00:00
Aditya Tiwari	bb9c426024	Typo Errors fixed in multiple files (#148262 ) # Fix typo errors across PyTorch codebase This PR fixes various spelling errors throughout the PyTorch codebase to improve documentation quality and code readability. ## Changes Made ### Documentation Fixes - Changed "seperate" to "separate" in multiple files: - `setup.py`: Build system documentation - `torch/_library/triton.py`: AOT compilation comments - `torch/csrc/dynamo/compiled_autograd.h`: Node compilation documentation - `torch/export/_unlift.py`: Pass population comments - `torch/export/exported_program.py`: Decomposition table notes ### Code Comments and Error Messages - Changed "occured" to "occurred" in: - `test/mobile/test_lite_script_module.py`: Exception handling comments - `torch/export/_draft_export.py`: Error message text - `aten/src/ATen/native/cuda/linalg/BatchLinearAlgebra.cpp`: MAGMA bug comment - `torch/csrc/utils/python_numbers.h`: Overflow handling comment - `torch/csrc/jit/OVERVIEW.md`: Graph compilation documentation - `torch/_dynamo/symbolic_convert.py`: Error explanation ### API Documentation - Changed "fullfill" to "fulfill" in `torch/distributed/checkpoint/state_dict_loader.py` - Changed "accross" to "across" in: - `torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp` - `torch/distributed/distributed_c10d.py` ## Motivation These changes improve code readability and maintain consistent spelling throughout the codebase. No functional changes were made; this is purely a documentation and comment improvement PR. ## Test Plan No testing required as these changes only affect comments and documentation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148262 Approved by: https://github.com/janeyx99 Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>	2025-03-09 12:21:40 +00:00
Ke Wen	2149f6c684	[PGNCCL] Launch kernel on current stream & remove `record_stream` entirely (#148590 ) This PR has multiple changes to `ProcessGroupNCCL` (which unfortunately are related): 1. When async_op=False, we directly launch the collective on "current" stream, instead of a trampoline stream and join back. - Resolves #147729 - Resolves #146881 - Also saves two event syncs (which have overhead in case of HIP) and one pybind when we call `work.wait()` in distributed_c10d.py on behalf of user. 2. Entirely remove `record_stream` and use CPU-side stashing for managing tensor lifetime against recycling. - Resolves #147168 3. Remove tensor life management when async_op=False; only use it when async_op=True. 4. To guard against user not calling `work.wait()`, we ask watchdog to unstash tensors after detecting completion of collectives, to prevent us from holding reference to tensors forever. This is a safety net, rather than a service guarantee, see discussion [here](https://github.com/pytorch/pytorch/issues/147168#issuecomment-2660142460). 5. Profile in async_op=False mode would look different -- collective kernels would show up in the same line and compute kernels. Joint work with @cenzhaometa who wants to remove the event sync overhead. Cc: @ngimel @awgu @Aidyn-A @skyw @wconstab @leonardo0lyj Differential Revision: [D70835197](https://our.internmc.facebook.com/intern/diff/D70835197) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148590 Approved by: https://github.com/eqy, https://github.com/Aidyn-A, https://github.com/fduwjj	2025-03-09 07:32:23 +00:00
PyTorch MergeBot	9cb25f0ea2	Revert "[PGNCCL] Launch kernel on current stream & remove `record_stream` entirely (#148590 )" This reverts commit `17dbeb11db`. Reverted https://github.com/pytorch/pytorch/pull/148590 on behalf of https://github.com/janeyx99 due to PR break backward compat test ([comment](https://github.com/pytorch/pytorch/pull/148590#issuecomment-2708641172))	2025-03-09 03:01:55 +00:00
Ke Wen	17dbeb11db	[PGNCCL] Launch kernel on current stream & remove `record_stream` entirely (#148590 ) This PR has multiple changes to `ProcessGroupNCCL` (which unfortunately are related): 1. When async_op=False, we directly launch the collective on "current" stream, instead of a trampoline stream and join back. - Resolves #147729 - Resolves #146881 - Also saves two event syncs (which have overhead in case of HIP) and one pybind when we call `work.wait()` in distributed_c10d.py on behalf of user. 2. Entirely remove `record_stream` and use CPU-side stashing for managing tensor lifetime against recycling. - Resolves #147168 3. Remove tensor life management when async_op=False; only use it when async_op=True. 4. To guard against user not calling `work.wait()`, we ask watchdog to unstash tensors after detecting completion of collectives, to prevent us from holding reference to tensors forever. This is a safety net, rather than a service guarantee, see discussion [here](https://github.com/pytorch/pytorch/issues/147168#issuecomment-2660142460). 5. Profile in async_op=False mode would look different -- collective kernels would show up in the same line and compute kernels. Joint work with @cenzhaometa who wants to remove the event sync overhead. Cc: @ngimel @awgu @Aidyn-A @skyw @wconstab @leonardo0lyj Differential Revision: [D70835197](https://our.internmc.facebook.com/intern/diff/D70835197) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148590 Approved by: https://github.com/eqy, https://github.com/Aidyn-A, https://github.com/fduwjj	2025-03-08 20:00:12 +00:00
Tristan Rice	7ffadff286	c10d/ProcessGroup: cleanup abort and shutdown (#148798 ) This adds `abort` and `shutdown` to `Backend` and `ProcessGroup` objects. This simplifies the logic in `distributed_c10d.py` by having a default noop implementation for all PGs. This will be useful for torchft and upcoming versions of NCCL which will handle abort correctly. Currently `torchft` would have to call internal methods `_abort` on the PGNCCL object directly but with this change we can now just call `.abort()` and have it work for any PG implementation. Test plan: ``` pytest distributed/test_backends.py distributed/test_c10d_common.py distributed/test_c10d_pypg.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148798 Approved by: https://github.com/kwen2501	2025-03-08 18:33:18 +00:00
Sanket Purandare	9841f0ddcf	Add support for non functional collectives under FakeTensorMode and fake_pg for memory tracking (#147566 ) This PR adds support for non-functional collectives under `FakeTensorMode` and `fake_pg`. It helps eliminate the patching of collectives for memory and runtime estimation. It also modifies the `ModTracker` to enable the post-backward hook call for modules whose inputs don't require gradients but parameters do. For the memory tracking, we now enable tracking DTensor dispatcher for custom dispatch functions like `entropy_loss`. Dispatcher is only enabled for the memory tracking part and disabled as soon as it is done. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147566 Approved by: https://github.com/weifengpy	2025-03-08 18:00:49 +00:00
Saurabh Mishra	136b8165d1	[DCP] Save Plan Caching: Fix the missing all_plans update in the cache. (#148577 ) Summary: Save Plan Caching: Fix the missing all_plans update in the cache. Test Plan: ``` buck2 test //aiplatform/modelstore/experimental/integration_tests/tests/nosan:checkpoint_dist_save_load_test ``` https://www.internalfb.com/intern/testinfra/testrun/17451448626323264 Reviewed By: MeetVadakkanchery Differential Revision: D70229019 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148577 Approved by: https://github.com/MeetVadakkanchery	2025-03-07 17:00:59 +00:00
Anant Gulati	372ad7b181	Enable FSDP2 on HPU device (#148667 ) The motivation of this PR is to enable FSDP2 collectives for HPU Pull Request resolved: https://github.com/pytorch/pytorch/pull/148667 Approved by: https://github.com/wconstab	2025-03-07 14:33:43 +00:00
Wei Feng	c0f1557285	[FSDP2][doc] highlight equivalence of set_requires_gradient_sync and no_sync (#148715 ) we got asked a few times about FSDP2's equivalence of no_sync. highlight set_requires_gradient_sync as the equivalence in docstring Pull Request resolved: https://github.com/pytorch/pytorch/pull/148715 Approved by: https://github.com/mori360	2025-03-07 04:34:46 +00:00
Xilun Wu	e2a0296e80	[dtensor] add CuDNN SDPA op support to DTensor (#148537 ) ### Summary This PR adds `_scaled_dot_product_cudnn_attention` and `_scaled_dot_product_cudnn_attention_backward` to DTensor ops ### Test `pytest test/distributed/tensor/test_attention.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148537 Approved by: https://github.com/drisspg, https://github.com/fegin	2025-03-06 23:44:40 +00:00
lanzongwei.lan	3d62e81a1e	[DCP] fix dcp gather_object/scatter_object_list (#147675 ) gather_object/scatter_object_list's dst is `Destination rank on global process group (regardless of group argument)`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147675 Approved by: https://github.com/MeetVadakkanchery	2025-03-06 21:20:38 +00:00
Aaron Gokaslan	edd640a95a	[BE][Ez]: Use itertools.chain.from_iterable when possible (#148190 ) Often makes the code more readable, more efficient, and adds support for infinite iterables. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148190 Approved by: https://github.com/jansel, https://github.com/malfet	2025-03-06 20:37:06 +00:00
Ankita George	2a639ce1d7	Add new hf storage class to torch.distributed package (#148361 ) Summary: title - Add new hf storage class to torch.distributed package so that it can be imported by customers. The HF storage reader/writer was added as DCP storage components so that DCP load and save can directly interact with hugging face format and storage. Test Plan: ensure signals pass Differential Revision: D70495399 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148361 Approved by: https://github.com/MeetVadakkanchery	2025-03-05 21:52:06 +00:00
PyTorch MergeBot	c9edd37ffb	Revert "[dtensor] add aten._scaled_dot_product_cudnn_attention.default op support (#148377 )" This reverts commit `9eef457c02`. Reverted https://github.com/pytorch/pytorch/pull/148377 on behalf of https://github.com/clee2000 due to broke lint [GH job link](https://github.com/pytorch/pytorch/actions/runs/13683650448/job/38261818684) [HUD commit link](`9eef457c02`) probably landrace ([comment](https://github.com/pytorch/pytorch/pull/148377#issuecomment-2701903810))	2025-03-05 19:45:16 +00:00
Xilun Wu	9eef457c02	[dtensor] add aten._scaled_dot_product_cudnn_attention.default op support (#148377 ) ### Summary This PR adds `_scaled_dot_product_cudnn_attention` to DTensor ops and tests it with unit test. This should allow Context Parallel and Tensor Parallel to use cudnn SDPA. ### Test `pytest test/distributed/tensor/test_attention.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148377 Approved by: https://github.com/drisspg	2025-03-05 19:09:52 +00:00
Howard Huang	e02a2ca07a	Fix dist.init_process_group on windows (#148266 ) Fix https://github.com/pytorch/pytorch/issues/139990 We don't build libuv on windows so anything that creates `TCPStore` which includes `init_process_group()` will fail, which is a bad experience. We should just default to `USE_LIBUV=0` for windows. There were a decent amount of hits for this [error on google ](https://www.google.com/search?q=use_libuv+was+requested+but+PyTorch+was+build+without+libuv+support&sca_esv=921f59ac5f8bd98a&sxsrf=AHTn8zpG3PxdKoomFHkclOc451rBhoc3jw%3A1740854890873&source=hp&ei=albDZ5GHM-uIptQP4NTikQw&iflsig=ACkRmUkAAAAAZ8Nkei9H-aB2IBCk3pUOK3yFl5xBLZUt&ved=0ahUKEwiR5P7qxemLAxVrhIkEHWCqOMIQ4dUDCBg&uact=5&oq=use_libuv+was+requested+but+PyTorch+was+build+without+libuv+support&gs_lp=Egdnd3Mtd2l6IkN1c2VfbGlidXYgd2FzIHJlcXVlc3RlZCBidXQgUHlUb3JjaCB3YXMgYnVpbGQgd2l0aG91dCBsaWJ1diBzdXBwb3J0SABQAFgAcAB4AJABAJgBAKABAKoBALgBA8gBAPgBAvgBAZgCAKACAJgDAJIHAKAHAA&sclient=gws-wiz) and https://github.com/pytorch/pytorch/issues/139579, so I figured we should add a more helpful message as well. We don't have CI for windows and our support is just best effort, so I just tested these changes on my windows machine. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148266 Approved by: https://github.com/d4l3k	2025-03-05 00:07:56 +00:00
Zain Rizvi	f30776c37a	[BE] Upgrade to mypy 1.14 (#145966 ) Upgrade mypy version Pull Request resolved: https://github.com/pytorch/pytorch/pull/145966 Approved by: https://github.com/Skylion007	2025-03-04 20:58:26 +00:00
Wanchao Liang	f859722f70	[dtensor] refactor sharding prop to handle cross mesh computation (#147869 ) as titled, this PR moves the same mesh check from the sharding propagation level to each individual operator level. This is to allow more flexibility for each individual operator to check the operator can be run on the same mesh or not. For example, before this PR if user have two DTensor params that lives on different DeviceMesh, and want to run `for_each` operator on them individually, it would error out with cross mesh error. But for foreach computation there could be DTensors that live on different meshes, as long as the the mesh are the same in a "zipped way". This should also fix https://github.com/pytorch/pytorch/issues/134212 Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/147869 Approved by: https://github.com/tianyu-l	2025-03-04 18:30:44 +00:00
Meet Vadakkanchery	fdee60769a	[DCP] Introduce process based async checkpointing (#147039 ) Summary: ### Context Background checkpoint upload thread interfering with trainer thread: In [async save API](https://github.com/pytorch/pytorch/blob/main/torch/distributed/checkpoint/state_dict_saver.py#L239-L248), the background thread spends a considerable amount of time on CPU-bound tasks (pickling/unpickling several metada objects a.k.a SavePlans) on rank0 during the collective operation; this kind of asymmetric computation heavily contends for GIL with the trainer thread causing GPU util to suffer significantly for the E2E checkpoint duration. ### Solution: Introduce async save via a checkpoint daemon process. This daemon process will be created once (during the first save attempt) and can serve async checkpoint requests for the remainder of training lifetime. Test Plan: Added E2E UTs for process based async save. Differential Revision: D69272583 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147039 Approved by: https://github.com/saumishr	2025-03-04 13:33:28 +00:00
taozhiwei	16d07988fc	add supports_coalescing property in c10d::Backend to determine whether backend supports coalescing (#135338 ) 1. My company is using privateuseone to connect new hardware device and requires the use of `batch_isend_irecv` function. However, `batch_isend_irecv` is currently only open to CUDA, so I add `supports_coalescing` property in `c10d::Backend` to determine whether backend supports coalescing. 2. If `pg._has_hooks` return True, We don't need to determine if the current device is CUDA. So privateuseone can also support `pg._wait_for_pending_works` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135338 Approved by: https://github.com/kwen2501, https://github.com/albanD	2025-03-04 12:37:06 +00:00
cyy	98bf2f1170	Use Python 3.9 typing (#148157 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/148157 Approved by: https://github.com/janeyx99	2025-03-04 03:09:55 +00:00
ankurneog	e45040b1d3	[c10d] Add hccl distributed backend to c10d data structures (#146478 ) # MOTIVATION Intel Gaudi is an out-of-tree PyTorch accelerator having its own device /dispatch key ```hpu``` . With this change we add entries for Gaudi's distributed backend ```hccl``` to the c10d Backend data structures. This is to ensure that there is no naming conflict in case a new in-tree accelerator is introduced with the same backend name. The Out-of-tree backends are registered calling `fd0cd6a08f/torch/distributed/distributed_c10d.py (L302)` Successful registration adds the backend name to the list : `fd0cd6a08f/torch/distributed/distributed_c10d.py (L265)` We are binding the process group creator constructs at run-time so if there are other distributed backend with the same device name they can safely add the device type to the dictionary `fd0cd6a08f/torch/distributed/distributed_c10d.py (L274)` And add another entry to the dictionary with the same backend name ( but different device name ) `fd0cd6a08f/torch/distributed/distributed_c10d.py (L268)` In addition the out-of-tree devices can utilize the ```backend_list``` to check for successful backend registration eg: APIs like ```is_hccl_available``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146478 Approved by: https://github.com/H-Huang	2025-03-03 21:32:21 +00:00
Carlos Mocholi	aade4fbd55	Expose the rendezvous keepalive arguments (#145228 ) Enables support for this: ```python from torch.distributed.launcher.api import LaunchConfig config = LaunchConfig( ..., rdzv_configs={"keep_alive_interval": 1122, "heartbeat_timeout": 321, "keep_alive_max_attempt" 5}, ) ``` These arguments are currently hard-coded inside torchrun. The default values are not suitable for jobs with thousands of ranks. Today, `rdzv_configs` only allows the keys `join_timeout`, `last_call_timeout`, `close_timeout` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145228 Approved by: https://github.com/wconstab	2025-03-03 19:11:56 +00:00
PyTorch MergeBot	94afb165d9	Revert "[c10d] Add hccl distributed backend to c10d data structures (#146478 )" This reverts commit `dae3fbfe97`. Reverted https://github.com/pytorch/pytorch/pull/146478 on behalf of https://github.com/malfet due to This seems to break ROCM tests, see `dae3fbfe97` ([comment](https://github.com/pytorch/pytorch/pull/146478#issuecomment-2692913573))	2025-03-02 21:22:04 +00:00
ankurneog	dae3fbfe97	[c10d] Add hccl distributed backend to c10d data structures (#146478 ) # MOTIVATION Intel Gaudi is an out-of-tree PyTorch accelerator having its own device /dispatch key ```hpu``` . With this change we add entries for Gaudi's distributed backend ```hccl``` to the c10d Backend data structures. This is to ensure that there is no naming conflict in case a new in-tree accelerator is introduced with the same backend name. The Out-of-tree backends are registered calling `fd0cd6a08f/torch/distributed/distributed_c10d.py (L302)` Successful registration adds the backend name to the list : `fd0cd6a08f/torch/distributed/distributed_c10d.py (L265)` We are binding the process group creator constructs at run-time so if there are other distributed backend with the same device name they can safely add the device type to the dictionary `fd0cd6a08f/torch/distributed/distributed_c10d.py (L274)` And add another entry to the dictionary with the same backend name ( but different device name ) `fd0cd6a08f/torch/distributed/distributed_c10d.py (L268)` In addition the out-of-tree devices can utilize the ```backend_list``` to check for successful backend registration eg: APIs like ```is_hccl_available``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146478 Approved by: https://github.com/H-Huang, https://github.com/guangyey	2025-03-02 05:13:48 +00:00
Iris Z	2544afaa1a	[DeviceMesh] Add some documentation for `from_group` API and add a 2D test (#146364 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/146364 Approved by: https://github.com/fduwjj	2025-03-01 00:57:37 +00:00
Xilun Wu	4106aa33eb	[dtensor][fix] fix _scaled_dot_product_flash_attention sharding (#148125 ) ### Summary https://github.com/pytorch/pytorch/pull/146372/ changed the op signature of `_scaled_dot_product_flash_attention` and as a consequence DTensor needs to change its sharding defined at `40ad5e01df/torch/distributed/tensor/_ops/_matrix_ops.py (L232)` ### Test `pytest test/distributed/tensor/test_attention.py` ### Follow-up It's still unclear why the CP unit tests were not run over the original PR which is BC-breaking. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148125 Approved by: https://github.com/tianyu-l, https://github.com/fegin	2025-02-28 09:26:43 +00:00
Ankita George	3a58a04898	Build a storage reader/writer to write checkpoints in HF format (#148089 ) Summary: D69984656 caused issues by adding the fsspec dependency to torch distributed when many packages internally didn't have it. In this diff I'm not adding HFStorageReader/Writer to __init__.py so that HFStorage components don't get imported internally and in turn there is no fsspec import that happens. I did the removal from __init__.py in D70286926 to fix the failing tests but the revert was done concurrently. I'll add the classes to __init__.py when I figure out a better way to get fsspec added as a dependency everywhere Test Plan: signals pass buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/distributed/checkpoint:test_hf_storage Differential Revision: D70324090 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148089 Approved by: https://github.com/saumishr	2025-02-28 07:38:10 +00:00
Xuehai Pan	995df34b19	[BE][PYFMT] migrate PYFMT for `torch.{distributed,distributions}` to `ruff format` (#144547 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144547 Approved by: https://github.com/kwen2501	2025-02-28 07:35:56 +00:00
Aaron Gokaslan	3b4b23ab0b	[BE][Ez]: Remove extra copy in dtensor parallel loss (#148096 ) Remove an extra copy of the input to `_log_softmax` when there is a dtype and memory format change. Fuse the copies instead. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148096 Approved by: https://github.com/jansel, https://github.com/wconstab	2025-02-28 05:42:32 +00:00
Arthur Laureus Wigo	9b7130b8db	Clean temporary directory at exit (#147813 ) Issue: A temporary directory is created in [pytorch/torch/distributed/nn/jit/instantiator.py](https://github.com/arthurlw/pytorch/blob/clean-temp-directory-at-exit/torch/distributed/nn/jit/instantiator.py) but is never cleaned up, leading to a ResourceWarning on program exit. Solution: Registered an `atexit` handler to properly clean up the temporary directory when the program exits. Fixes #147744 Line 23 in [0a49f8f](`0a49f8fd3d`) ```python 23 atexit.register(_TEMP_DIR.cleanup) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/147813 Approved by: https://github.com/H-Huang	2025-02-28 04:12:23 +00:00
PyTorch MergeBot	c622796cde	Revert "Build a storage reader/writer to write checkpoints in HF format (#147622 )" This reverts commit `6a658d983e`. Reverted https://github.com/pytorch/pytorch/pull/147622 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/147622#issuecomment-2686932514))	2025-02-27 05:14:28 +00:00

1 2 3 4 5 ...

3889 Commits