pytorch/torch/distributed/elastic
Tristan Rice 196748d491 [elastic] support local_addr across all rendezvous impls (#135262)
Summary:
There was a regression introduced in https://github.com/pytorch/pytorch/pull/125743 that made `local_addr` no longer used. This fixes that by passing `local_addr` to `RendezvousStoreInfo.build` everywhere it's used.

This also fixes a number of tests allowing them to be run in parallel which hugely sped up the testing cycle as this change touches many different rendezvous implementations. This required a few fixes in unrelated tests.

Test Plan:
Added tests for the common rendezvous implementations that `local_addr` to prevent future regressions.

```
buck2 test @//mode/dev-nosan fbcode//caffe2/test/distributed/elastic/... fbcode//caffe2/torch/distributed/elastic/... -- --stress-runs 3
```

To vet the parallelism changes I also ran with 3 stress runs each to identify flakiness caused by parallelism.

Differential Revision: D62256407

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135262
Approved by: https://github.com/fduwjj, https://github.com/wz337
2024-09-06 17:55:43 +00:00
..
agent Use JK for mast rdzv handler tcpstore handling and additional logging (#129603) 2024-06-27 03:34:52 +00:00
events [BE][Easy] enable UFMT for torch/distributed/{algorithms,autograd,benchmarks,checkpoint,elastic}/ (#128866) 2024-06-18 13:51:53 +00:00
metrics [BE][Easy] enable UFMT for torch/distributed/{algorithms,autograd,benchmarks,checkpoint,elastic}/ (#128866) 2024-06-18 13:51:53 +00:00
multiprocessing [torchelastic] Don't do signal handling when off the main thread (#135088) 2024-09-06 14:47:03 +00:00
rendezvous [elastic] support local_addr across all rendezvous impls (#135262) 2024-09-06 17:55:43 +00:00
timer [elastic] support local_addr across all rendezvous impls (#135262) 2024-09-06 17:55:43 +00:00
utils [TorchElastic] add warning when users try to pass a "use_libuv" argument to create_c10d_store (#135062) 2024-09-04 22:05:51 +00:00
__init__.py
control_plane.py [BE][Easy] enable UFMT for torch/distributed/{algorithms,autograd,benchmarks,checkpoint,elastic}/ (#128866) 2024-06-18 13:51:53 +00:00