pytorch/torch/distributed
Will Constable 84416618a6 [Pipelining] Update schedules to use I, B actions. (#138886)
Also, update tests to use I (BACKWARD_INPUT) vs B (FULL_BACKWARD)
consistently.

Previously, schedules would issue a 'B' operation and leave it ambiguous
whether that operation should be BACKWARD_INPUT or FULL_BACKWARD,
depending on a separate flag (use_full_backward) passed to the schedule
class, which would determine which behavior was taken at runtime.

Now, use_full_backward is removed and the schedule class is required to
produce unambiguous IR.  The logic for 'use_full_backward' is removed
from the runtime.

_validate_pipeline_order is replaced  with _simulate_comms_compute. Both
offer similar functionality, to validate the corrrectness of a schedule
IR.  'validate' operates on compute-only IR, while simulate operates on
compute + comm IR.  To convert from using validate to simulate, you have
to first insert comm actions via '_add_send_recv'.

'simulate' was inefficiently written before this PR and needed to be
optimized to run quickly for extra large schedules with >32 ranks and
microbatches per rank used in some unit tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138886
Approved by: https://github.com/H-Huang
2024-11-01 03:54:06 +00:00
..
_composable [Device] Replace hardcoded devices with 'torch._C._get_accelerator()' (#139032) 2024-10-29 04:51:47 +00:00
_shard Remove unused Python variables in torch/[b-z]* (#136963) 2024-10-19 16:45:22 +00:00
_sharded_tensor
_sharding_spec
_symmetric_memory get_symm_mem_workspace(): print helpful error during graph capture (#138028) 2024-10-30 18:11:09 +00:00
_tensor [reland][dtensor] move DTensor to public namespace (#134203) 2024-09-08 17:08:40 +00:00
_tools ILP for Auto SAC (Selective Activation Checkpointing) (#137908) 2024-10-18 12:45:37 +00:00
algorithms Make DDP Quantization hooks backend Agnostic (#138816) 2024-10-29 15:02:45 +00:00
autograd
benchmarks Remove unused Python variables in torch/[b-z]* (#136963) 2024-10-19 16:45:22 +00:00
checkpoint [DCP] Unit Test to validate the stateful and non-stateful loads (#139251) 2024-10-31 01:12:51 +00:00
elastic [SJD] [RFC] force setting last progress time (#138615) 2024-10-23 15:29:00 +00:00
examples
fsdp Remove unused Python variables in torch/[b-z]* (#136963) 2024-10-19 16:45:22 +00:00
launcher
nn Revert "added persistent option to buffers and namedbuffers (#132994)" 2024-08-09 18:14:53 +00:00
optim [BE]: Update mypy to 1.11.2 (#133816) 2024-09-16 19:44:11 +00:00
pipelining [Pipelining] Update schedules to use I, B actions. (#138886) 2024-11-01 03:54:06 +00:00
rpc [BE][Easy] enable ruff rule PIE790: unnecessary pass statement (#133200) 2024-08-15 15:50:19 +00:00
tensor [DTensor][Bug Fix]Fix 2D DTensor mm with mesh_shape (1, n) or (n, 1) (#139134) 2024-10-30 08:09:39 +00:00
__init__.py
_checkpointable.py
_composable_state.py
_functional_collectives_impl.py
_functional_collectives.py [c10d][Partial-Graph Overlap] Support calling .wait_tensor() on output tensor of eager async_op=True collective if under allow_inflight_collective_as_graph_input_ctx() context manager (#137763) 2024-10-29 03:31:19 +00:00
_state_dict_utils.py [DSD] Fix loading uneven full tensor into sharded state dict (#136365) 2024-09-23 16:35:58 +00:00
argparse_util.py
c10d_logger.py
collective_utils.py
constants.py
CONTRIBUTING.md
device_mesh.py [DeviceMesh] fix sub mesh size calculation in create_sub_mesh() (#138945) 2024-10-29 17:56:56 +00:00
distributed_c10d.py [c10d] allow sub group to be eagerly inited even if default one is not (#138665) 2024-10-24 23:51:28 +00:00
launch.py
logging_handlers.py
remote_device.py
rendezvous.py [reland] [torchelastic][c10d] Fix store prefix race in rendezvous (#136768) 2024-09-26 17:37:07 +00:00
run.py
utils.py Revert "[compiled autograd] Compiled autograd configs in TLS (#137821)" 2024-10-16 16:38:29 +00:00