pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-06 12:20:52 +01:00

Author	SHA1	Message	Date
lanzongwei.lan	3d62e81a1e	[DCP] fix dcp gather_object/scatter_object_list (#147675 ) gather_object/scatter_object_list's dst is `Destination rank on global process group (regardless of group argument)`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147675 Approved by: https://github.com/MeetVadakkanchery	2025-03-06 21:20:38 +00:00
Xuehai Pan	995df34b19	[BE][PYFMT] migrate PYFMT for `torch.{distributed,distributions}` to `ruff format` (#144547 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144547 Approved by: https://github.com/kwen2501	2025-02-28 07:35:56 +00:00
Ke Wen	762a05b3b3	[DCP] Remove all-gather of state dict keys (#145998 ) The original `_all_gather_keys` call was for a safety check, but could be costly as things scale, and it blocks CPU. Instead, we make it clear in the documentation that the `state_dict` passed to the `load` API should have same set of keys, otherwise the API may hang. In addition, we move the check to a utility function: `utils.assert_same_keys`. User uncertain about state dict unity can optionally call this API to check. Resolves #145965 (as a workaround). Pull Request resolved: https://github.com/pytorch/pytorch/pull/145998 Approved by: https://github.com/mhorowitz, https://github.com/fegin	2025-02-04 03:16:13 +00:00
Aaron Orenstein	316808e4e9	PEP585 update - torch/distributed/elastic torch/distributed/checkpoint (#145163 ) See #145101 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145163 Approved by: https://github.com/Skylion007	2025-01-19 20:55:59 +00:00
Marc Horowitz	95d333f52e	[distributed] Fix _ReaderView.read() and readinto() to stop reading at the end of the slice (#143357 ) _ReaderView doesn't work correctly if the slice ends past the view. read(-1) would call read(-1) on the base_stream, which would consume the entire underlying stream, even if the view ended before that. read(n) would read n bytes, even if the view ended before that. The new implementation clamps the size read to the size of the view. readinto(b) would read len(b) bytes, even if the view ended before that. Since the interface depends on the size of b, we use a (potentially) shortened view into b to avoid a copy. If the view doesn't contain enough data to fill the view, then this will appear as end of stream to the caller, which is the desired behavior. This fix should not be user facing, since the bug is in an internal helper, and is only visible with new code down the stack. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143357 Approved by: https://github.com/saumishr	2025-01-11 00:22:10 +00:00
Xuehai Pan	b77406a9ec	[BE][CI] bump `ruff` to 0.8.4 (#143753 ) Changes: 1. Bump `ruff` from 0.7.4 to 0.8.4 2. Change `%`-formatted strings to f-string 3. Change arguments with the `__`-prefix to positional-only arguments with the `/` separator in function signature. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143753 Approved by: https://github.com/Skylion007	2024-12-24 12:24:10 +00:00
Saurabh Mishra	dd7cd182ab	[AIInfra][DCP] All gather keys checkpoint utils bug fix (#135045 ) Summary: All gather keys checkpoint utils bug fix. Dist. get_world_size should have the process group passed in to avoid inconsistent world size in case the process group has changed. This is common in the tests. Test Plan: UTs Reviewed By: Saiteja64 Differential Revision: D61578832 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135045 Approved by: https://github.com/MeetVadakkanchery, https://github.com/LucasLLC	2024-09-04 18:49:34 +00:00
Zain Huda	0acd09aecd	[torchrec][pt-d][model store] introduce LocalShardsWrapper for DTensor (#129150 ) Summary: Same as D57688538, recreated because of GH issues This diff introduces LocalShardsWrapper which is crucial to migrating from using ShardedTensor to DTensor in TRec state dict representation. As well as any changes needed in PT-D and ModelStore to support this. It allows us to extend DTensor to support multiple shards on a rank as well as empty shards on a rank as needed by TRec sharding logic. This diff also extends the support for LocalShardsWrapper to be used in conjunction with DTensor in checkpointing cases (ModelStore and DCP) See D54375878 for how it is used. LocalShardsWrapper supports the following torch ops: + torch.ops._c10d_functional.all_gather_into_tensor.default + aten._to_copy.default + aten.view.default + aten.equal.default + aten.detach.default With extensibility to add more as required by use cases. See https://docs.google.com/document/d/16Ptl50mGFJW2cljdF2HQ6FwsiA0scwbAbjx_4dhabJw/edit?usp=drivesdk for more info regarding design and approach. NOTE: This version of LocalShardsWrapper does not support empty shards, that is added in the next diff enabling CW. D57063512 Test Plan: ` buck test mode/opt -c python.package_style=inplace aiplatform/modelstore/client/tests_gpu:dist_checkpoint_save_load_with_stateful_tests -- --print-passing-details` `buck2 test 'fbcode//mode/dev-nosan' fbcode//torchrec/distributed/tests:test_tensor_configs -- --print-passing-details` Sandcastle Reviewed By: XilunWu, wanchaol Differential Revision: D58570479 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129150 Approved by: https://github.com/XilunWu	2024-06-21 01:58:51 +00:00
Xuehai Pan	e6d4451ae8	[BE][Easy] enable UFMT for `torch/distributed/{algorithms,autograd,benchmarks,checkpoint,elastic}/` (#128866 ) Part of #123062 - #123062 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128866 Approved by: https://github.com/fegin	2024-06-18 13:51:53 +00:00
Aaron Orenstein	3a0d088517	Flip default value for mypy disallow_untyped_defs [5/11] (#127842 ) See #127836 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127842 Approved by: https://github.com/oulgen	2024-06-08 18:49:18 +00:00
Zain Huda	6d4ec9b2ec	[RFC] Introduce Checkpointable for DCP (#127540 ) (#127628 ) Summary: # Introduce Checkpointable interface for DCP to support arbitrary tensor subclasses for checkpointing Authors: * zainhuda ## Summary This diff adds a CheckpointableTensor interface to allow for future compatibility for any tensor subclass with DCP in a clean and maintainable way. ## Motivation For TorchRec sharding migration from ShardedTensor to DTensor, we create a tensor subclass that is stored by DTensor to support TorchRec's sharding schemes (ex, empty shards, multiple shards on a rank). ## Proposed Implementation View the CheckpointableTensor interface implementation, in which, we introduce the minimal set of methods needed to be compatible with DCP. These methods are expected to implemented by any tensor subclasses and as such are then checkpointable by DCP. ## Drawbacks No drawbacks, it extends functionality in a clean and maintainable way. ## Alternatives Alternative design was creating paths for checking for certain attributes in tensor subclasses which can get messy and hard to maintain/understand why it was there in the first place. Test Plan: Sandcastle cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k LucasLLC Differential Revision: D57970603 Pulled By: iamzainhuda Pull Request resolved: https://github.com/pytorch/pytorch/pull/127628 Approved by: https://github.com/wz337, https://github.com/XilunWu, https://github.com/fegin	2024-06-03 21:21:55 +00:00
Chien-Chin Huang	644bc69530	[DCP] Allow users to save and load without creating storage reader and writer (#117772 ) Right now DCP API requires users to create StorageWriter and StorageReader for every API call. This PR allows users to only pass the checkpointer_id (a path) and use it to read/write a checkpoint without creating a StorageReader and Writer. Differential Revision: [D52740556](https://our.internmc.facebook.com/intern/diff/D52740556/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117772 Approved by: https://github.com/wz337 ghstack dependencies: #116248	2024-01-26 09:08:35 +00:00
Lucas Pasqualin	b10b08227a	Passes process group to `_all_gather_keys` in `dcp.load` (#118301 ) As title Fixes #118277 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118301 Approved by: https://github.com/Skylion007, https://github.com/fegin	2024-01-25 23:07:57 +00:00
Chien-Chin Huang	f170d6665c	[DCP] Add a profiler function for benchmarking save and load (#116007 ) Many operations when calling DCP's save and load are executed on CPU. Thus we can easily profile these operations with cProfile. This PR adds the ability to profile the save() and load() One follow-up for this PR is to integrate the feature with the distributed logging flags. Differential Revision: [D52245434](https://our.internmc.facebook.com/intern/diff/D52245434/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116007 Approved by: https://github.com/LucasLLC, https://github.com/wz337 ghstack dependencies: #116006	2023-12-21 08:03:07 +00:00
Chien-Chin Huang	db8d409d08	[DCP][BE] Apply ufmt to DCP and turn on lintrunner for DCP (#115302 ) No logic change. Just typing and ufmt. Differential Revision: [D51914982](https://our.internmc.facebook.com/intern/diff/D51914982/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115302 Approved by: https://github.com/XilunWu, https://github.com/wz337, https://github.com/LucasLLC ghstack dependencies: #115523	2023-12-13 10:32:36 +00:00
Lucas Pasqualin	753c07bbe0	All gather keys before processing Stateful objects in save/load [2/N] (#114304 ) Accounts for the case where `state_dict` keys may present in different orders. Since users may be calling collectives in `state_dict` and `load_state_dict` call, different ordered keys could cause a deadlock. This is mostly a defensive move, meant to match the feature in TSS. Pull Request resolved: https://github.com/pytorch/pytorch/pull/114304 Approved by: https://github.com/fegin, https://github.com/wz337	2023-12-04 18:31:14 +00:00
NVS Abhilash	44c0521e8c	fix: docstring error in torch/distributed module (#113241 ) Fixes: #113193 `pydocstyle <all_files_in_issue> --count` - Before: 345 - After: 130 For deprecated methods, I have added a `noqa` to ignore them. I was not able to find the file `torch/distributed/tensor/parallel/multihead_attention_tp.py`, so I've ignored it for this PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113241 Approved by: https://github.com/kit1980	2023-11-09 19:10:20 +00:00
dilililiwhy	ff37f6018d	Enable custom device support in fsdp checkpoint (#107289 ) Fixes https://github.com/pytorch/pytorch/issues/104390 Enable custom device(privateuse1 backend) support in checkpointing by a dynamic abstract device module. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107289 Approved by: https://github.com/wz337	2023-08-25 11:50:03 +00:00
Rodrigo Kumpera	4833dc10b8	[DCP] Rewrite read slicing to use a wrapper. (#99167 ) Moved SlicedBufferedReader to utils and renamed to _ReaderView. It no longer depends on file handles and is a pure wrapper. This makes it general enought to handle non io stream objects like fsspec's. Should help with #98386 Pull Request resolved: https://github.com/pytorch/pytorch/pull/99167 Approved by: https://github.com/wz337	2023-06-08 13:52:13 +00:00
Iris	bb347dc3c3	[PTD][DCP] Add 1D DTensor based DCP (#94868 ) Add 1D DTensor based DCP along with its test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94868 Approved by: https://github.com/wanchaol, https://github.com/fegin	2023-02-16 23:38:04 +00:00
Iris	22e7514a15	[Checkpoint][2D][3/N] Add nested_tensors for distributed checkpoint to core distributed (#89501 ) This PR moves nested_tensors to torch.distributed.checkpoint. This is a pre-req for enabling 2D checkpoint. This flattens sharded tensors in state_dict. It is used when saving and loading FSDP SHARDED_STATE_DICT. Docstring, individual and integration test will be added in the following PRs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89501 Approved by: https://github.com/wanchaol	2022-11-28 23:21:38 +00:00
Iris	aee96bbf5a	[PT-D][Checkpointing] Move distributed checkpointing from torch.distributed._shard.checkpoint to torch.distributed.checkpoint (#88698 ) Context in RFC: https://github.com/pytorch/pytorch/issues/86620 .rst file will be finalized in subsequent PRs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88698 Approved by: https://github.com/wanchaol	2022-11-16 21:06:38 +00:00

22 Commits