pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 00:21:07 +01:00

Author	SHA1	Message	Date
Rohan Varma	de370eb313	[Distributed] Small nits to apply_optimizer_in_backward (#110903 ) Clarify a few things around the documentation Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/110903 Approved by: https://github.com/janeyx99	2023-10-11 07:45:45 +00:00
wz337	a614281ea9	Add current_device() to torch.cpu (#110987 ) Better support device agnostic, add a "cpu" return for `current_device()` in torch.cpu so that we won't run into `AttributeError: module 'torch.cpu' has no attribute 'current_device'`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110987 Approved by: https://github.com/wanchaol	2023-10-11 05:13:10 +00:00
PyTorch MergeBot	314a502eb0	Revert "Reland "[C10] PG observability hooks. (#108815 )" (#110907 )" This reverts commit `7678cd22af`. Reverted https://github.com/pytorch/pytorch/pull/110907 on behalf of https://github.com/huydhn due to Sorry for reverting this, but macos job in trunk starts failing after this `7678cd22af` ([comment](https://github.com/pytorch/pytorch/pull/110907#issuecomment-1756497387))	2023-10-11 00:23:42 +00:00
wz337	d9eb5a57aa	[FSDP] Change _create_chunk_dtensor in fsdp/_shard_utils.py to use public API from DTensor (#110831 ) This PR: 1) updates _create_chunk_dtensor() in _shard_utils.py to use public APIs from DTensor. This will avoid the global_size calculation error from using DTensor.from_local() for uneven-sharded parameters, as described in https://github.com/pytorch/pytorch/issues/110762 2) updates test/distributed/fsdp/test_fsdp_dtensor_state_dict.py to include unit test for a model with uneven sharding. cc. @wanchaol, @fegin Pull Request resolved: https://github.com/pytorch/pytorch/pull/110831 Approved by: https://github.com/wanchaol, https://github.com/fegin	2023-10-10 21:04:27 +00:00
Will Constable	7678cd22af	Reland "[C10] PG observability hooks. (#108815 )" (#110907 ) This reverts commit `ff0358b038`. (original PR https://github.com/pytorch/pytorch/pull/108815 desc copied below) Expose a set of observability hooks into C10D such that our users can detect collectives failure both faster and more easily. The design is similar to NCCL desync debug that it minimized the overhead by doing most of the work out of the main thread. This PR introduces a new module torch.distributed.hooks that exposes the following set of methods: register_collective_start_hook register_collective_end_hook register_process_group_hook The process group hook exposes PG creation on the member ranks and call them inline from the the PG creation code. This is fine since this happens during initialization and a limited number of times. The collective start/end hooks are fired from a single background thread. It reads events from a C++ queue and dispatches over. Queue notification is oddly done using a pipe, this is needed so python can abort the thread on shutdown and have it as background thread. This is not possible with more reasonable choices like a condvar. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110907 Approved by: https://github.com/fduwjj	2023-10-10 20:09:40 +00:00
Chien-Chin Huang	7b25c2b90e	[FSDP][optim_state_dict] Move local optimizer state to FSDP compute_device (#110929 ) This will ensure all the tensors are on FSDP compute_device. Differential Revision: [D50059492](https://our.internmc.facebook.com/intern/diff/D50059492/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/110929 Approved by: https://github.com/wz337	2023-10-10 10:34:31 +00:00
Michael Voznesensky	fb68aa0a92	[Easy] Remove unused return type from utils (#110887 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/110887 Approved by: https://github.com/ezyang	2023-10-10 09:02:11 +00:00
Edward Z. Yang	de3ae93e9b	Include rank of default PG in C++ log messages (#110623 ) I tested by adding some warning logs in C++, run a distributed program and show that they now had `[rank0]:` in the messages. There is no existing test infra for C++ logging so I couldn't easily add a unit test. The implementation strategy is to setup a global variable in C++, and then poke it when we initialize a process group. This was the simplest thing I could think of that would work. This PR only works for non-glog logging. Probably need to come up with some other strategy for glog, e.g., a custom prefix, but need to make sure this doesn't conflict with fbcode. I can't easily test this from OSS, will leave as follow up work. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/110623 Approved by: https://github.com/voznesenskym, https://github.com/wanchaol, https://github.com/fduwjj	2023-10-10 00:26:52 +00:00
Wanchao Liang	28d7d7fc42	device agnostic: torch.cpu.set_device (#110716 ) to support device agnostic, add a dummpy placeholder in torch.cpu Pull Request resolved: https://github.com/pytorch/pytorch/pull/110716 Approved by: https://github.com/albanD	2023-10-09 23:00:15 +00:00
Wanchao Liang	2a76c7f018	[dtensor] skip move to device when device_type match (#110774 ) skip tensor.to in from_local and distribute_tensor when device_type of device mesh matches tensor.device type, since from_local on the critial path of TP, this might also reduce some overhead Pull Request resolved: https://github.com/pytorch/pytorch/pull/110774 Approved by: https://github.com/fduwjj	2023-10-09 19:39:11 +00:00
Kazuaki Ishizaki	b5f9696d81	Fix typo under torch directory (#110824 ) This PR fixes typo `the the` of comments and exception messages in files under `torch` directory. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110824 Approved by: https://github.com/H-Huang	2023-10-09 19:16:43 +00:00
Wanchao Liang	459cef8649	switch dtensor and functional collective to use optree (#110670 ) optree recently landed and provide quite good perf, conditionally import new optree if optree is installed Some numbers testing mlp layer with TP + func collective: before this PR: 10.390ms after this PR: 9.189ms so around e2e 10% CPU overhead reduction Pull Request resolved: https://github.com/pytorch/pytorch/pull/110670 Approved by: https://github.com/fegin	2023-10-08 03:05:39 +00:00
fduwjj	2dc5e166a5	[TP][Inference] Enable DTensor TP inference (#110751 ) In https://github.com/pytorch/pytorch/pull/109977, we observed that during inference mode, aten.Linear does not get decomposed. So instead of enabling sharding propagation for linear op, we use func.decompose so that it gets decomposed to matmul and mm. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110751 Approved by: https://github.com/bdhirsh, https://github.com/wanchaol	2023-10-07 18:57:27 +00:00
Chien-Chin Huang	90bf6e3938	[FSDP][optim_state_dict] Enable cpu_offload config for optimzer state_dict (#108434 ) We had the option but never used cpu_offload as optimizer state_dict offloads the tensors to CPU by default. And this is usually most users want as the tensors are required to be moved to CPU eventually. However, we may want to disable offloading to CPU in some cases, epsecially for the debugging purpose. This PR lets optimizer state_dict read the flag. Differential Revision: [D48913340](https://our.internmc.facebook.com/intern/diff/D48913340/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/108434 Approved by: https://github.com/wz337	2023-10-07 01:14:49 +00:00
Wanchao Liang	1c97808f81	[dtensor] support lt/gt op (#110585 ) This PR enables lt/gt aten op Pull Request resolved: https://github.com/pytorch/pytorch/pull/110585 Approved by: https://github.com/fduwjj ghstack dependencies: #110584	2023-10-07 00:06:36 +00:00
Wanchao Liang	9378a2ceda	[dtensor] support aten.where and enable implicit scalar promotion (#110584 ) This PR adds support for aten.where and support implicit scalar promotion, basically when we meet scalar tensors in dispatching logic, we implicitly convert it those to replicated dtensor The latter also enables bunch of ops in op db to pass Pull Request resolved: https://github.com/pytorch/pytorch/pull/110584 Approved by: https://github.com/fduwjj	2023-10-07 00:06:36 +00:00
Yue Dong	e3bf5000a7	Hide the contiguous requirement for user input mesh when initializing DeviceMesh (#110628 ) Summary: As title, this diff hides the contiguous requirement for user input mesh when initializing DeviceMesh. In the current implementation, when testing with inter-node model parallelism, an exception is thrown during mesh validation when the following input is provided: ``` mesh = torch.arange(0, world_size).view(mp_size, dp_size).transpose(0, 1) device_mesh = DeviceMesh( "cuda", mesh.contiguous(), mesh_dim_names=("dp", "mp") ) ``` Test Plan: Unit Test: ``` buck2 test mode/dev-nosan //caffe2/test/distributed/_tensor:device_mesh -- test_validate_device_mesh Test UI: https://www.internalfb.com/intern/testinfra/testrun/3940649876878399 Network: Up: 0B Down: 0B Jobs completed: 6. Time elapsed: 1:58.7s. Tests finished: Pass 1. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` Test with MP ``` mesh = torch.arange(0, world_size).view(mp_size, dp_size).transpose(0, 1) device_mesh = DeviceMesh( "cuda", mesh.contiguous(), mesh_dim_names=("dp", "mp") ) ``` Without the change: exception. After this change: initialzied sucessfully. Differential Revision: D49942839 Pull Request resolved: https://github.com/pytorch/pytorch/pull/110628 Approved by: https://github.com/wanchaol, https://github.com/xw285cornell, https://github.com/fduwjj	2023-10-06 23:54:13 +00:00
PyTorch MergeBot	ff0358b038	Revert "[C10] PG observability hooks. (#108815 )" This reverts commit `0c7a877745`. Reverted https://github.com/pytorch/pytorch/pull/108815 on behalf of https://github.com/albanD due to Add a new torch.distributed.hooks namespace but does not document it, test was added this morning ([comment](https://github.com/pytorch/pytorch/pull/108815#issuecomment-1751327751))	2023-10-06 19:49:49 +00:00
Rodrigo Kumpera	0c7a877745	[C10] PG observability hooks. (#108815 ) Expose a set of observability hooks into C10D such that our users can detect collectives failure both faster and more easily. The design is similar to NCCL desync debug that it minimized the overhead by doing most of the work out of the main thread. This PR introduces a new module torch.distributed.hooks that exposes the following set of methods: register_collective_start_hook register_collective_end_hook register_process_group_hook The process group hook exposes PG creation on the member ranks and call them inline from the the PG creation code. This is fine since this happens during initialization and a limited number of times. The collective start/end hooks are fired from a single background thread. It reads events from a C++ queue and dispatches over. Queue notification is oddly done using a pipe, this is needed so python can abort the thread on shutdown and have it as background thread. This is not possible with more reasonable choices like a condvar. Pull Request resolved: https://github.com/pytorch/pytorch/pull/108815 Approved by: https://github.com/wconstab, https://github.com/fduwjj	2023-10-06 18:52:46 +00:00
Jon Chuang	d279979102	perf(inductor): improve `Adam` compile times by shortcutting for loops (via `has_complex`) (#110607 ) Adam part of: https://github.com/pytorch/pytorch/issues/110506 TODO: - If this approach is validated as a good one, it an also be applied to all other optimizers which convert `complex` via list comprehensions ### Results: `NUM_PARAMS=200, foreach=True` - main: dynamo: 43s, inductor: 31s, total: 74s - this PR: dynamo: 3.5s, inductor: 30s, total: 34s (dynamo speedup: 12.3x, overall speedup: 34s, 2.1x) `NUM_PARAMS=1000, foreach=True, has_complex shortcut`: ``` <class 'torch.optim.adam.Adam'> {'lr': 0.01, 'foreach': True} torch.float32 TorchDynamo compilation metrics: Function Runtimes (s) ------------------------------------ ------------------------------- _compile.<locals>.compile_inner 0.0329, 50.0806, 0.0041 OutputGraph.call_user_compiler 44.9924 ``` `NUM_PARAMS=1000, foreach=True`: ``` <class 'torch.optim.adam.Adam'> {'lr': 0.01, 'foreach': True} torch.float32 TorchDynamo compilation metrics: Function Runtimes (s) ------------------------------------ ------------------------------- _compile.<locals>.compile_inner 0.0389, 58.6069, 0.0043 OutputGraph.call_user_compiler 44.1425 ``` ### Discussion - `has_complex` shortcut provides additional 2x dynamo speedup. It is not necessary to achieve a significant overall speedup. CC: @janeyx99 @mlazos Pull Request resolved: https://github.com/pytorch/pytorch/pull/110607 Approved by: https://github.com/janeyx99, https://github.com/lezcano	2023-10-06 05:08:49 +00:00
Jon Chuang	57e9969021	feat(optim): Add adadelta multi_tensor support for complex, with `has_complex` shortcut (#110631 ) Partial fix: https://github.com/pytorch/pytorch/issues/110606 More on `has_complex` shortcut: https://github.com/pytorch/pytorch/pull/110613#issuecomment-1749314805 CC: @janeyx99, @mlazos, @lezcano Pull Request resolved: https://github.com/pytorch/pytorch/pull/110631 Approved by: https://github.com/lezcano	2023-10-06 03:34:41 +00:00
Gufan Yin	5d963474aa	Replace enforce_dtype with dtype in ShardedTensor.gather (#110561 ) Summary: Sometimes local_shards are empty on some ranks, and out.dtype is float16, which will cause error if enforce_dtype is True because `data` will be float32. Callers know best what dtype they want, so we can just let callers decide. Temporarily keep enforce_dtype for backward compatibility Test Plan: Run local and MAST job Reviewed By: uciyc123 Differential Revision: D46886551 Pull Request resolved: https://github.com/pytorch/pytorch/pull/110561 Approved by: https://github.com/wanchaol, https://github.com/malfet	2023-10-05 23:16:23 +00:00
Edward Z. Yang	f274c7b32c	Add functional collective all_to_all_single and support it in Inductor (#110195 ) Copy of https://github.com/pytorch/pytorch/pull/106655 from yf225 rebased on top of item() support changes Pull Request resolved: https://github.com/pytorch/pytorch/pull/110195 Approved by: https://github.com/Skylion007	2023-10-05 23:11:51 +00:00
Wanchao Liang	c95cf4b4c9	[dtensor] add grad placements kwarg to to_local API (#110629 ) When we convert to local tensor, dtensor can't track autograd or gradient layout of the local tensor anymore, if user do sth not expected, there needs to be a way for user to hint about the gradient layout of the local tensor Pull Request resolved: https://github.com/pytorch/pytorch/pull/110629 Approved by: https://github.com/zdevito	2023-10-05 21:34:01 +00:00
Chien-Chin Huang	88616349d7	[state_dict][1/N] Implement the basic functions of distributed.checkpoint._state_dict (#105902 ) This PR implements the basic functions of distributed.checkpoint._state_dict. This PR currently contains the flattening of optimizer state_dict which makes the PR too large. A later version may split it into 2 for a better code review. Differential Revision: [D47647719](https://our.internmc.facebook.com/intern/diff/D47647719/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D47647719/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/105902 Approved by: https://github.com/wz337	2023-10-05 20:04:15 +00:00
Chien-Chin Huang	1a729618ef	[FSDP][optim_state_dict] Make the new optimizer allgather fusion work with fine-tuning models (#110540 ) With use_orig_params=True, it is possible that some parameters with the same FlatParameter are in the optimizer while others parameters are frozen. This PR makes the allgather fusion logic support the case. Differential Revision: [D49922028](https://our.internmc.facebook.com/intern/diff/D49922028/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/110540 Approved by: https://github.com/awgu, https://github.com/rohan-varma	2023-10-05 15:17:10 +00:00
Mihir Patel	95c59b30b8	Update fully_sharded_data_parallel to fix typing (#110545 ) Fixes typing so that linter does not complain when using CustomPolicy. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110545 Approved by: https://github.com/awgu, https://github.com/Skylion007	2023-10-05 00:00:10 +00:00
Fabrice Pont	053367b1ed	fix: flake8-bugbear code B024 (#107265 ) See #106571 item B024 This fix concerns the addition of `abstractmethod` to methods declared inside abstract classes. Should I also include PEP8 compliant reformatting on the files I had to modify ? Pull Request resolved: https://github.com/pytorch/pytorch/pull/107265 Approved by: https://github.com/kit1980	2023-10-04 23:52:52 +00:00
Howard Huang	0949d97c16	fix batch_isend_irecv example incorrect usage (#110408 ) mismatched dtypes silently leads to wrong outputs in nccl ``` 1:recv_tensor=tensor([0., 0.], device='cuda:1') 0:recv_tensor=tensor([2.8026e-45, 0.0000e+00], device='cuda:0') ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/110408 Approved by: https://github.com/awgu, https://github.com/Neilblaze	2023-10-04 22:57:03 +00:00
Rohan Varma	40be6b72e1	[ez] Type function in distributed_c10d (#110435 ) This function returns a `torch.device`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110435 Approved by: https://github.com/awgu	2023-10-03 17:54:04 +00:00
Chien-Chin Huang	cdde899a73	[FSDP][optim_state_dict] Fuse allgather for optim_state_dict when use_orig_params is True (#108298 ) The original implementation of `_gather_orig_param_state` is naive. It performs one allgather_object and two allgather (if the optimizer is Adam) per FQN. This can be slow and make `_optim_state_dict` become bottleneck. This PR rewrite the implementation and fuse all the `allgather_object`s into one. As for `allgather`, it is fused based on the information of FlatParameters. So there will be 2N `allgather` where N is the number of FlatParameter and 2 is due to Adam having 2 states per FQN. One experiment on 8GPU A100 shows that the execution of the gathering is improved to 0.3 seconds from 3 seconds. Differential Revision: [D48835138](https://our.internmc.facebook.com/intern/diff/D48835138/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/108298 Approved by: https://github.com/awgu	2023-10-02 20:57:08 +00:00
Wanchao Liang	26900d21c2	[dtensor] skip pytree when not necessary (#110132 ) pytree is a great tool, but it sometimes considers to be evil for tensor subclasses, it's useful to implement subclass quickly, but it: * exposes non-trival CPU overhead * many ops don't need pytree, only the one with list/dict ops needs * blindly use pytree to re-wrap have semantic issues for inplace/out ops This PR avoid using pytree for most ops during torch_dispatch and only enable it for certain ops Pull Request resolved: https://github.com/pytorch/pytorch/pull/110132 Approved by: https://github.com/fduwjj	2023-10-02 17:44:34 +00:00
wz337	a588648759	[DCP] Fix 'torch.cpu' has no attribute 'current_device' in checkpoint/optimizer.py (#110299 ) When running on "gloo" and "cpu:gloo,cuda:nccl" backend, it will run into the following error. ``` -- Process 1 terminated with the following error: Traceback (most recent call last): File "/data/users/irisz/pytorch/torch/multiprocessing/spawn.py", line 74, in _wrap fn(i, *args) File "/data/users/irisz/pytorch/torch/distributed/checkpoint/examples/fsdp_checkpoint_example.py", line 105, in run_fsdp_checkpoint_example optim_state = load_sharded_optimizer_state_dict( File "/data/users/irisz/pytorch/torch/distributed/checkpoint/optimizer.py", line 295, in load_sharded_optimizer_state_dict _alloc_tensor(value.properties, value.size, dp_pg_device_type), sharding_spec File "/data/users/irisz/pytorch/torch/distributed/checkpoint/optimizer.py", line 109, in _alloc_tensor device=cast(torch.device, _get_device_module(device_type).current_device()), AttributeError: module 'torch.cpu' has no attribute 'current_device' ``` This PR fix the error in optimizer.py. Will follow up to add "cpu:gloo,cuda:nccl" support in DTensorBase so we can update unit test to include this backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110299 Approved by: https://github.com/kumpera	2023-10-01 21:54:13 +00:00
Rohan Varma	24e5d61af8	Log usage of optimizer in backward (#110206 ) This will allow us to inspect and aggregate jobs that use optimizer in backward Differential Revision: [D48674740](https://our.internmc.facebook.com/intern/diff/D48674740/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/110206 Approved by: https://github.com/awgu	2023-09-29 11:00:07 +00:00
Edwiv	7f5737392d	[FSDP] fix: fix for fsdp exec order pre fwd record (#110138 ) When the sharding_strategy is set to SHARD_GRAD_OP and forward_prefetch=True, during direct validation run, self.is_first_iter will always be True (because training=False, iter+1 is not executed). Additionally, the _pre_forward_order_index of the first handle entering the record_pre_forward function is 0. This causes the handle to have a False result in the if condition at line 166 when entering the record_pre_forward function again (the expected value should be True because _pre_forward_order_index has actually been assigned a value). As a result, the first handle is repetitively added to handles_pre_forward_order, leading to incorrect prefetching order. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110138 Approved by: https://github.com/awgu	2023-09-28 15:45:05 +00:00
Brian	e20c35a53b	Allow public access for imports (#108914 ) Fixes #108776 Pull Request resolved: https://github.com/pytorch/pytorch/pull/108914 Approved by: https://github.com/wanchaol	2023-09-28 06:05:59 +00:00
Matthew Hoffman	68b0db1274	Define the public API for torch.distributed.fsdp (#109922 ) Related: https://github.com/pytorch/pytorch/wiki/Public-API-definition-and-documentation Related: https://github.com/microsoft/pylance-release/issues/2953 This fixes pylance issues for these classes: ``` "FullyShardedDataParallel" is not exported from module "torch.distributed.fsdp" ``` These classes all have public docs: * [`BackwardPrefetch`](https://pytorch.org/docs/stable/fsdp.html#torch.distributed.fsdp.BackwardPrefetch) * [`CPUOffload`](https://pytorch.org/docs/stable/fsdp.html#torch.distributed.fsdp.CPUOffload) * [`FullyShardedDataParallel`](https://pytorch.org/docs/stable/fsdp.html#torch.distributed.fsdp.FullyShardedDataParallel) * [`MixedPrecision`](https://pytorch.org/docs/stable/fsdp.html#torch.distributed.fsdp.MixedPrecision) * [`ShardingStrategy`](https://pytorch.org/docs/stable/fsdp.html#torch.distributed.fsdp.ShardingStrategy) And it seems like all the newly added classes will have docs once they are released. Pull Request resolved: https://github.com/pytorch/pytorch/pull/109922 Approved by: https://github.com/wanchaol	2023-09-28 02:15:58 +00:00
Wanchao Liang	27443eadeb	[dtensor][7/n] remove reduction rule (#109144 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/109144 Approved by: https://github.com/fduwjj ghstack dependencies: #108263, #108264	2023-09-26 22:24:50 +00:00
Wanchao Liang	2dd9a79d22	[dtensor][6/n] refactor reduction to use op strategy (#108264 ) This PR refactors the reduction op to use strategy based propagation Pull Request resolved: https://github.com/pytorch/pytorch/pull/108264 Approved by: https://github.com/fduwjj ghstack dependencies: #108263	2023-09-26 22:24:50 +00:00
Wanchao Liang	986d255db2	[dtensor][5/n] switch random ops to op strategy (#108263 ) This PR switches the random ops to use op strategy instead of rule based, this is a first series of PRs to refactor ops after we refactor op dispatch logic Pull Request resolved: https://github.com/pytorch/pytorch/pull/108263 Approved by: https://github.com/fduwjj	2023-09-26 22:24:42 +00:00
wz337	8140494afd	[3/N][2D] Enable training with new 2D flow (#110034 ) Replacing https://github.com/pytorch/pytorch/pull/109553 as it gets reverted. This PR enables training with new 2D flow and adds associated test. In addition, this PR moves the tensor/parallel/_data_parallel_utils.py that are fsdp specific back to tensor/parallel/fsdp.py to avoid circular dependency for ddp.py and test/distributed/tensor/parallel/test_ddp_2d_parallel.py. state_dict related changes would be in later PRs. cc. @fegin, @fduwjj, @wanchaol, @awgu Pull Request resolved: https://github.com/pytorch/pytorch/pull/110034 Approved by: https://github.com/fduwjj	2023-09-26 09:14:15 +00:00
Aaron Gokaslan	6b39cf863f	Fix invalid arg to getLogger in torch distributed checkpoint (#110008 ) Ran the experimental LOG002 ruff check and found a bug in our codebase. Logger should not be instantiated from `__file__`, it should be instantiated from `__name__` https://docs.astral.sh/ruff/rules/invalid-get-logger-argument/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/110008 Approved by: https://github.com/ezyang	2023-09-25 18:21:18 +00:00
PyTorch MergeBot	f5886bf352	Revert "[3/N][2D] Enable training with new 2D flow (#109553 )" This reverts commit `217b37c023`. Reverted https://github.com/pytorch/pytorch/pull/109553 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but those distributed failures look legit and they are failing in trunk https://hud.pytorch.org/pr/109553 ([comment](https://github.com/pytorch/pytorch/pull/109553#issuecomment-1734100546))	2023-09-25 16:37:19 +00:00
wz337	217b37c023	[3/N][2D] Enable training with new 2D flow (#109553 ) This PR enables training with new 2D flow and adds associated test. state_dict related changes would be in later PRs. cc. @fegin, @fduwjj, @wanchaol, @awgu Pull Request resolved: https://github.com/pytorch/pytorch/pull/109553 Approved by: https://github.com/fegin, https://github.com/awgu	2023-09-25 05:32:07 +00:00
Ed Pizzi	c13177f2cb	[FSDP] Propagate requires_grad attribute to unsharded params (#109892 ) Summary: This preserves `requires_grad` in the case where all parameters within a `FlatParameter` have the same `requires_grad` value. Currently, unsharded parameters have `requires_grad=True` in some cases where the `FlatParameter` and all original parameters have `requires_grad=False`. This could be extended to support `FlatParameters` with a mix of `requires_grad` states by extending `ParamInfo` to capture `requires_grad` for each parameter. Test Plan: test added Differential Revision: D49517155 Pull Request resolved: https://github.com/pytorch/pytorch/pull/109892 Approved by: https://github.com/awgu	2023-09-24 01:30:50 +00:00
wz337	b89ce814c0	[FSDP] Remove _set_use_dtensor in post_load_state_dict_hook (#109924 ) This is a follow up for https://github.com/pytorch/pytorch/pull/109767. We only need _set_use_dtensor in pre_state_dict_hook() and pre_load_state_dict_hook() and we do not need _set_use_dtensor in _post_load_state_dict_hook(). This PR removes _set_use_dtensor in post_load_state_dict_hook(). In addition, this PR adjusts the test cases in test_hsdp_dtensor_state_dict.py to capture changes in https://github.com/pytorch/pytorch/pull/109767 Pull Request resolved: https://github.com/pytorch/pytorch/pull/109924 Approved by: https://github.com/fegin	2023-09-23 22:34:36 +00:00
Rodrigo Kumpera	c26270c733	[C10D] Even more store scalability work. (#109218 ) Fix a bug socket.cpp in timeout detection that only shows up with 10k ranks. Make the minimum wait time in _store_based_barrier to be adaptative based on the number of ranks. Longer timeouts give more room for the store to do productive work when swamped. Pull Request resolved: https://github.com/pytorch/pytorch/pull/109218 Approved by: https://github.com/XilunWu ghstack dependencies: #109217	2023-09-22 21:27:09 +00:00
wz337	a5145364d9	[FSDP] Fix _use_dtensor not automatically turn on for model state dict when using DeviceMesh (#109767 ) Fixes #109648 Pull Request resolved: https://github.com/pytorch/pytorch/pull/109767 Approved by: https://github.com/fegin	2023-09-21 15:15:45 +00:00
Howard Huang	600d0d0284	Add "cuda" to MPI backend capabilities (#109614 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/109543 Test Plan: We need to run CUDA aware MPI in PyTorch to actually test this change, we currently have no MPI tests. Differential Revision: D49420438 Pull Request resolved: https://github.com/pytorch/pytorch/pull/109614 Approved by: https://github.com/XilunWu	2023-09-21 13:34:58 +00:00
Rodrigo Kumpera	881bfbf21d	[c10d] Add tests for usig libuv through init_process_group. (#108661 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/108661 Approved by: https://github.com/XilunWu, https://github.com/fduwjj	2023-09-20 16:02:20 +00:00

1 2 3 4 5 ...

2331 Commits