pytorch/test/distributed
Rohan Varma a0b3814433 Clean prefixes when searching for params / buffers to ignore (#78278)
Co-authored with: @awgu

When `state_dict` has a prefix attached to it, the current logic for ignoring parameters and buffers does not work since it doesn't account for this prefix. To fix this, we make the following changes:

- clean the key if it starts with prefix. Note that all keys may not start with prefix, i.e. if the current module's state_dict_post_hook is running and previous module `state_dict` has already been computed and previous module is on the same level of hierarchy as the current module.
- This prefixing makes it so that it is not current to override child module's ignored params and buffers with the root FSDP instance's (this wouldn't work if child FSDP instances had ignored modules, and root didn't, for example). We fix this by having each parent know about the ignored modules of their children, and computing fully qualified names for ignored params and buffers.
- This means that each for a particular FSDP instance, that instance knows about the names of itself and its children (in fully qualified form) that it needs to ignore. It wouldn't know about parent ignored params and buffers, but it doesn't need to store this data.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78278
Approved by: https://github.com/awgu
2022-05-26 02:43:03 +00:00
..
_shard [PT-D] Enable nan_to_num op for sharded tensor 2022-05-25 18:03:42 +00:00
algorithms [BE] move init_multigpu_helper to common_distributed (#67050) 2021-10-22 17:16:11 -07:00
bin Add test owner to distributed files starting with test_ (#66797) 2021-10-19 10:55:20 -07:00
elastic [torch][elastic] Make final agent barrier to shutdown properly 2022-04-15 20:29:05 +00:00
fsdp Clean prefixes when searching for params / buffers to ignore (#78278) 2022-05-26 02:43:03 +00:00
launcher [torchelastic][1/n] Fix caffe2.test.distributed.launcher.api_test flaky tests (#68624) 2021-11-19 15:23:30 -08:00
nn/jit Have test classes extend from common_utils.TestCase, not unittest.TestCase (#66900) 2021-10-19 16:54:05 -07:00
optim Convert DDP parameters to ReplicatedTensor during forward pass. 2022-04-18 03:27:23 +00:00
pipeline/sync [skip ci] set more tests with owners for distributed and elastic (#67583) 2021-11-01 12:26:03 -07:00
rpc Add test owner to distributed files starting with test_ (#66797) 2021-10-19 10:55:20 -07:00
argparse_util_test.py [skip ci] set more tests with owners for distributed and elastic (#67583) 2021-11-01 12:26:03 -07:00
test_c10d_common.py Fix SyncBatchNorm for empty inputs (#74944) 2022-04-01 23:48:30 +00:00
test_c10d_gloo.py ROCm: unskip c10 gloo tests 2022-04-25 14:28:56 +00:00
test_c10d_nccl.py Validate that tensors are contiguous in ProcessGroupNCCL 2022-05-19 17:48:22 +00:00
test_c10d_spawn_gloo.py [PyTorch][Distributed] Enable Reduce Scatter and modify all_to_all for sharded linear with more test cases. (#68786) 2021-12-06 13:38:58 -08:00
test_c10d_spawn_nccl.py [PyTorch][Distributed] Enable Reduce Scatter and modify all_to_all for sharded linear with more test cases. (#68786) 2021-12-06 13:38:58 -08:00
test_c10d_spawn.py [PyTorch][Distributed] Enable Reduce Scatter and modify all_to_all for sharded linear with more test cases. (#68786) 2021-12-06 13:38:58 -08:00
test_data_parallel.py no longer coalesce sparse COO tensors before comparison (#69751) 2022-02-17 02:33:08 +00:00
test_distributed_spawn.py Add test owner to distributed files starting with test_ (#66797) 2021-10-19 10:55:20 -07:00
test_launcher.py Add test owner to distributed files starting with test_ (#66797) 2021-10-19 10:55:20 -07:00
test_nccl.py [NCCL] Patch bfloat16 support (#67843) 2021-11-09 13:46:13 -08:00
test_pg_wrapper.py Add test owner to distributed files starting with test_ (#66797) 2021-10-19 10:55:20 -07:00
test_store.py [Bootcamp] Set default value of TCPStore world_size to None in pybind definition (#77277) 2022-05-12 18:48:48 +00:00