pytorch/torch/distributed
Avik Chaudhuri 8b4ae29b1b misc. fixes to unflatten (#141066)
Handling of nested modules in unflatten had several bugs, which were caught by trying to preserve module call signatures for nested modules.
* A module `k` encountered when calling `k.n()` before `k()` used to become an empty nn module. This caused some information to be dropped when `k()` was eventually called. Relatedly, we would also lose call counts for `k.n()` through different paths (say, when `k()` calls `n()`).
* Deleting call-indexed modules and patching up their call sites was broken for nested modules when creating dispatcher modules, because of silliness when handling their fqns.

An interesting aside is that we used random graph generation for testing some of these changes. A future PR will add the infra to create tests using these random graphs.

Differential Revision: D66192799

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141066
Approved by: https://github.com/angelayi
2024-11-23 07:31:51 +00:00
..
_composable Fix type-safety of torch.nn.Module instances (#141240) 2024-11-22 00:05:05 +00:00
_shard [BE]: Apply PERF401 autofixes from ruff (#140980) 2024-11-20 17:52:07 +00:00
_sharded_tensor
_sharding_spec
_symmetric_memory [SymmetricMemory] introduce user-facing APIs empty() and rendezvous() (#139677) 2024-11-17 20:51:50 +00:00
_tensor
_tools ILP for auto FSDP wrapping (#140298) 2024-11-11 22:02:39 +00:00
algorithms [C10D] support group_src/dst in broadcast/reduce ops (#140843) 2024-11-19 01:23:08 +00:00
autograd
benchmarks [BE]: Apply PERF401 autofixes from ruff (#140980) 2024-11-20 17:52:07 +00:00
checkpoint Fix the use of fsspec transactions (#135541) 2024-11-22 00:03:19 +00:00
elastic [Torch Elastic] Fix the bug caused by wrong host address in creating TCPStore server inside dynamic rendezvous (#139702) 2024-11-05 15:28:03 +00:00
examples
fsdp Fix type-safety of torch.nn.Module instances (#141240) 2024-11-22 00:05:05 +00:00
launcher
nn [BE]: Apply PERF401 autofixes from ruff (#140980) 2024-11-20 17:52:07 +00:00
optim [BE]: Apply PERF401 autofixes from ruff (#140980) 2024-11-20 17:52:07 +00:00
pipelining misc. fixes to unflatten (#141066) 2024-11-23 07:31:51 +00:00
rpc
tensor [BE]: Apply PERF401 autofixes from ruff (#140980) 2024-11-20 17:52:07 +00:00
__init__.py
_checkpointable.py
_composable_state.py [FSDP2] Make module-to-state mapping use weakrefs (#139650) 2024-11-05 02:16:52 +00:00
_functional_collectives_impl.py
_functional_collectives.py [aotd] coerce_same_metadata_as_tangent with expected_type for e.g.AsyncCollectiveTensor (#139095) 2024-11-07 16:24:48 +00:00
_state_dict_utils.py [DSD] Fix loading uneven full tensor into sharded state dict (#136365) 2024-09-23 16:35:58 +00:00
argparse_util.py
c10d_logger.py [c10d] Switch all timer logging in c10d to wait_counter (#141154) 2024-11-21 01:10:11 +00:00
collective_utils.py
constants.py
CONTRIBUTING.md
device_mesh.py [DeviceMesh] fix sub mesh size calculation in create_sub_mesh() (#138945) 2024-10-29 17:56:56 +00:00
distributed_c10d.py API to retrieve default distributed backend from device (#140536) 2024-11-22 11:01:53 +00:00
launch.py
logging_handlers.py
remote_device.py
rendezvous.py [reland] [torchelastic][c10d] Fix store prefix race in rendezvous (#136768) 2024-09-26 17:37:07 +00:00
run.py [BE]: Use proper logger in torch.distributed.run (#140547) 2024-11-14 14:49:17 +00:00
utils.py Revert "[compiled autograd] Compiled autograd configs in TLS (#137821)" 2024-10-16 16:38:29 +00:00