pytorch/torch/distributed
Howard Huang e8e65764d1 [pipelining] Improve schedule csv loading (#142009)
Add small changes based on feedback from Less when testing out https://github.com/pytorch/torchtitan/pull/707
- expose `validate_schedule` as a function
- handle spaces around actions in csv file
- add error arrow to `_format_pipeline_schedule()` to better show where the step errored

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142009
Approved by: https://github.com/lessw2020
2024-12-04 04:15:34 +00:00
..
_composable Remove old FSDP1 fully_shard (#141875) 2024-12-03 17:00:47 +00:00
_shard [BE]: Apply PERF401 autofixes from ruff (#140980) 2024-11-20 17:52:07 +00:00
_sharded_tensor
_sharding_spec
_symmetric_memory [SymmetricMemory] introduce user-facing APIs empty() and rendezvous() (#139677) 2024-11-17 20:51:50 +00:00
_tensor
_tools Revert "ILP for auto FSDP wrapping (#140298)" 2024-12-02 14:08:04 +00:00
algorithms [BE]: Update mypy to 1.13.0 (#140808) 2024-12-03 02:50:10 +00:00
autograd
benchmarks [BE]: Apply PERF401 autofixes from ruff (#140980) 2024-11-20 17:52:07 +00:00
checkpoint Initialize lr as a tensor if it is originally a tensor (#141620) 2024-12-03 18:10:23 +00:00
elastic [Torch Elastic] Fix the bug caused by wrong host address in creating TCPStore server inside dynamic rendezvous (#139702) 2024-11-05 15:28:03 +00:00
examples
fsdp [BE]: Update mypy to 1.13.0 (#140808) 2024-12-03 02:50:10 +00:00
launcher
nn [BE]: Apply PERF401 autofixes from ruff (#140980) 2024-11-20 17:52:07 +00:00
optim [BE]: Apply PERF401 autofixes from ruff (#140980) 2024-11-20 17:52:07 +00:00
pipelining [pipelining] Improve schedule csv loading (#142009) 2024-12-04 04:15:34 +00:00
rpc
tensor [BE]: Update mypy to 1.13.0 (#140808) 2024-12-03 02:50:10 +00:00
__init__.py
_checkpointable.py
_composable_state.py [FSDP2] Make module-to-state mapping use weakrefs (#139650) 2024-11-05 02:16:52 +00:00
_functional_collectives_impl.py
_functional_collectives.py [aotd] coerce_same_metadata_as_tangent with expected_type for e.g.AsyncCollectiveTensor (#139095) 2024-11-07 16:24:48 +00:00
_state_dict_utils.py
argparse_util.py
c10d_logger.py [c10d] Switch all timer logging in c10d to wait_counter (#141154) 2024-11-21 01:10:11 +00:00
collective_utils.py
constants.py
CONTRIBUTING.md
device_mesh.py [DeviceMesh] fix sub mesh size calculation in create_sub_mesh() (#138945) 2024-10-29 17:56:56 +00:00
distributed_c10d.py [BE]: Update mypy to 1.13.0 (#140808) 2024-12-03 02:50:10 +00:00
launch.py
logging_handlers.py
remote_device.py
rendezvous.py
run.py [BE]: Use proper logger in torch.distributed.run (#140547) 2024-11-14 14:49:17 +00:00
utils.py Revert "[compiled autograd] Compiled autograd configs in TLS (#137821)" 2024-10-16 16:38:29 +00:00