pytorch/torch/distributed
Chien-Chin Huang c4fc5d372f [FSDP][state_dict][1/N] Moving state_dict logic to pre_state_dict_hook (#87900)
This is one step toward the ultimate goal: remove the overwritten state_dict in FSDP. All the logic should be either in `pre_state_dict_hook` or `post_state_dict_hook`.

Since current `nn.Module` does not support `pre_state_dict_hook`, this PR mimic `pre_state_dict_hook` by calling the pre hook inside post the hook, effectively ditching all the work done by `nn.Module.state_dict`. Once `pre_state_dict_hook` is supported by `nn.Module`, these pre hook calls can be moved out from the post hooks and be registered to `nn.Module.pre_state_dict_hook`.

The major issue of this temporary solution is that `post_state_dict_hook` is called from the leaf node to the root node. This makes the `module._lazy_init()` invalid as FSDP assumes `_lazy_init()` to be called from the root. As a result, `FSDP.state_dict` currently contains only one logic -- calling `module._lazy_init()`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87900
Approved by: https://github.com/rohan-varma
2022-11-11 03:41:40 +00:00
..
_composable [FSDP()][Easy] Make fully_shard() only FULL_SHARD (#88260) 2022-11-03 13:41:54 +00:00
_shard rename DisableTorchFunction to DisableTorchFunctionSubclass (#88218) 2022-11-10 14:51:13 +00:00
_sharded_tensor
_sharding_spec Add __all__ for a few distributed modules plus a little typing (reland) (#84872) 2022-09-13 21:57:49 +00:00
_spmd Remove eager mode support form CommTensor (#84978) 2022-09-14 17:23:23 +00:00
algorithms Fix typos used in documents under torch directory (#88300) 2022-11-02 09:38:13 +00:00
autograd Integrate xdoctest - Rebased (#82797) 2022-08-12 02:08:01 +00:00
benchmarks Fix typo under torch directory (#87274) 2022-10-21 14:22:20 +00:00
elastic Make TorchElastic timer importable on Windows (#88522) 2022-11-10 17:42:20 +00:00
fsdp [FSDP][state_dict][1/N] Moving state_dict logic to pre_state_dict_hook (#87900) 2022-11-11 03:41:40 +00:00
launcher
nn [nn] add remove_duplicate flag to named_parameters (#759) (#88090) 2022-11-09 00:09:20 +00:00
optim Upstream apply_optim_in_backward from TorchRec (#87397) (#88539) 2022-11-05 18:28:07 +00:00
pipeline Deprecate TypedStorage, its derived classes, and all of their public methods (#85303) 2022-11-08 18:11:01 +00:00
rpc [Python] refactor slices on sorted (#86995) 2022-10-25 04:07:19 +00:00
__init__.py Add torch.distributed.DistBackendError exception type, thrown from C10D_NCCL_CHECK (#88134) 2022-11-08 13:26:42 +00:00
argparse_util.py
c10d_error_logger.py [C10D][BE] Add exception handlers to c10d collectives function (#87643) (#87988) 2022-10-29 04:38:34 +00:00
constants.py
CONTRIBUTING.md
distributed_c10d.py [14/N] Refactor _new_process_group_helper() to remove repeated code (#88351) 2022-11-10 19:27:17 +00:00
launch.py Integrate xdoctest - Rebased (#82797) 2022-08-12 02:08:01 +00:00
logging_handlers.py [C10D][BE] Add exception handlers to c10d collectives function (#87643) (#87988) 2022-10-29 04:38:34 +00:00
remote_device.py
rendezvous.py
run.py Integrate xdoctest - Rebased (#82797) 2022-08-12 02:08:01 +00:00
utils.py [DDP] Add PackedSequence support when device_ids is specified (#86614) 2022-10-10 21:50:59 +00:00