pytorch/torch/distributed
Lucas Pasqualin 13070e2753 [DCP] Adds better handling in logging of specific kwargs (#123658)
Adds additional signpost integrations to DCP Logger, to add support for MLU and metric collection.

Differential Revision: [D55803461](https://our.internmc.facebook.com/intern/diff/D55803461/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123658
Approved by: https://github.com/fegin
2024-04-11 21:09:38 +00:00
..
_composable [Compile FSDP2][3/n] Check all_gather_work is distributed_c10d.Work before calling .wait() (#123491) 2024-04-10 01:23:54 +00:00
_shard [PT] [ST] support non contiguous rank validation in sharded tensor (#123230) 2024-04-05 21:05:01 +00:00
_sharded_tensor
_sharding_spec
_spmd [dtensor] refactor schema suggestions in output sharding (#122929) 2024-04-01 17:39:39 +00:00
_tensor DTensor: add ring attention for _scaled_dot_product_flash_attention (#122460) 2024-04-03 06:45:00 +00:00
_tools
algorithms [DDP] Use compiled_autograd to trace DDP backward allreduce (#110662) 2024-02-08 03:03:15 +00:00
autograd
benchmarks Move doc links to point to main (#121823) 2024-03-15 19:49:37 +00:00
checkpoint [DCP] Adds better handling in logging of specific kwargs (#123658) 2024-04-11 21:09:38 +00:00
elastic Adding health check server hook in torch elastic (#122750) (#123504) 2024-04-11 19:10:56 +00:00
examples
fsdp [FSDP1] fix _same_storage check for DTensor (#123617) 2024-04-10 10:26:12 +00:00
launcher [Torchelastic][Logging] Pluggable logsspecs using python entrypoints and option to specify one by name. (#120942) 2024-03-02 08:07:52 +00:00
nn Fix get_rank under a non-default group. (#120481) 2024-03-11 05:40:54 +00:00
optim Add tensor step and capturable support to rprop (#122261) 2024-03-28 23:31:18 +00:00
pipeline [c10d] Deprecate torch.distributed.pipeline (#121464) 2024-03-08 19:55:02 +00:00
rpc
tensor [TP] Add wildcard support (#122968) 2024-04-02 21:23:39 +00:00
__init__.py
_composable_state.py
_functional_collectives_impl.py [funcol] add deprecation warning for the legacy backend (#122666) 2024-04-03 00:27:06 +00:00
_functional_collectives.py Don't create world pg variable out of thin air when rewriting c10d collectives (#122561) 2024-03-26 20:12:08 +00:00
_state_dict_utils.py [DCP] Adds ability to create a CPU state dict that is both shared and pinned (#122338) 2024-04-03 20:05:01 +00:00
argparse_util.py Add --local-ranks-filter to torchrun: allow logs filtering by rank (#118562) 2024-02-07 04:29:54 +00:00
c10d_logger.py [DCP] Adds better handling in logging of specific kwargs (#123658) 2024-04-11 21:09:38 +00:00
collective_utils.py
constants.py
CONTRIBUTING.md
device_mesh.py [DeviceMesh] Make dtype of mesh tensor from init_device_mesh() consistent with directly calling DeviceMesh() (#123677) 2024-04-10 09:14:34 +00:00
distributed_c10d.py Fix example in torch.distributed.new_subgroups docstring (#123492) 2024-04-10 03:33:07 +00:00
launch.py [BE] minor logging cleanup in distributed (#122921) 2024-03-29 03:34:01 +00:00
logging_handlers.py
remote_device.py
rendezvous.py Enable local_partial_types (#118467) 2024-01-28 13:38:22 +00:00
run.py [BE] minor logging cleanup in distributed (#122921) 2024-03-29 03:34:01 +00:00
utils.py