pytorch/torch/distributed
Rohan Varma a0b3814433 Clean prefixes when searching for params / buffers to ignore (#78278)
Co-authored with: @awgu

When `state_dict` has a prefix attached to it, the current logic for ignoring parameters and buffers does not work since it doesn't account for this prefix. To fix this, we make the following changes:

- clean the key if it starts with prefix. Note that all keys may not start with prefix, i.e. if the current module's state_dict_post_hook is running and previous module `state_dict` has already been computed and previous module is on the same level of hierarchy as the current module.
- This prefixing makes it so that it is not current to override child module's ignored params and buffers with the root FSDP instance's (this wouldn't work if child FSDP instances had ignored modules, and root didn't, for example). We fix this by having each parent know about the ignored modules of their children, and computing fully qualified names for ignored params and buffers.
- This means that each for a particular FSDP instance, that instance knows about the names of itself and its children (in fully qualified form) that it needs to ignore. It wouldn't know about parent ignored params and buffers, but it doesn't need to store this data.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78278
Approved by: https://github.com/awgu
2022-05-26 02:43:03 +00:00
..
_shard [PT-D] Enable nan_to_num op for sharded tensor 2022-05-25 18:03:42 +00:00
_sharded_tensor [reland] Create torch.distributed._shard package. (#72141) 2022-02-02 06:58:20 +00:00
_sharding_spec [reland] Create torch.distributed._shard package. (#72141) 2022-02-02 06:58:20 +00:00
algorithms CheckpointWrapper state_dict fix (#77224) 2022-05-17 03:39:31 +00:00
autograd
benchmarks
elastic [lint] upgrade mypy to latest version 2022-05-03 20:51:34 +00:00
fsdp Clean prefixes when searching for params / buffers to ignore (#78278) 2022-05-26 02:43:03 +00:00
launcher (torch/elastic) add documentation clarifying that torchrun is a console script to torch.distributed.run (#73598) 2022-03-03 08:35:50 +00:00
nn [Reland] load_state_dict post hook (#77392) 2022-05-14 06:06:23 +00:00
optim Adding maximize to Adamax (#77409) 2022-05-16 17:34:44 +00:00
pipeline Add type hints for a few random functions/classes 2022-05-04 13:53:00 +00:00
rpc [RPC small change] Improving logging for store.wait error 2022-05-05 18:23:17 +00:00
__init__.py [Dynamic RPC] Allow for optional world_size argument in init_rpc (#73372) 2022-03-24 16:19:28 +00:00
argparse_util.py
constants.py
CONTRIBUTING.md Update distributed contributing guide to show how to run one test in test_distributed_spawn (#67801) 2021-11-04 08:54:31 -07:00
distributed_c10d.py [lint] upgrade mypy to latest version 2022-05-03 20:51:34 +00:00
launch.py Introduce the torchrun entrypoint (#64049) 2021-08-26 20:17:48 -07:00
remote_device.py Rewrite ShardedTensor.gather to use dist.gather instead of gather_object (#77272) 2022-05-17 02:14:40 +00:00
rendezvous.py Improving typing and typing-related performance in rendezvous.py 2022-04-24 21:49:51 +00:00
run.py (torch/elastic) add documentation clarifying that torchrun is a console script to torch.distributed.run (#73598) 2022-03-03 08:35:50 +00:00
utils.py FSDP parameter sync 2022-05-17 19:58:49 +00:00