Commit Graph

18 Commits

Author SHA1 Message Date
Aaron Orenstein
316808e4e9 PEP585 update - torch/distributed/elastic torch/distributed/checkpoint (#145163)
See #145101 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145163
Approved by: https://github.com/Skylion007
2025-01-19 20:55:59 +00:00
bobrenjc93
08be9ec312 Migrate from Tuple -> tuple in torch/distributed (#144258)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144258
Approved by: https://github.com/aorenste
2025-01-10 08:34:54 +00:00
Xuehai Pan
e6d4451ae8 [BE][Easy] enable UFMT for torch/distributed/{algorithms,autograd,benchmarks,checkpoint,elastic}/ (#128866)
Part of #123062

- #123062

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128866
Approved by: https://github.com/fegin
2024-06-18 13:51:53 +00:00
Aaron Orenstein
3a0d088517 Flip default value for mypy disallow_untyped_defs [5/11] (#127842)
See #127836 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127842
Approved by: https://github.com/oulgen
2024-06-08 18:49:18 +00:00
Kazuaki Ishizaki
91973e1c31 Issue113185 (#113523)
Fixes #113185

I have fixed the given docstring errors. The followings are the outputs with numbers before and after the changes:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113523
Approved by: https://github.com/kit1980
2023-11-14 22:25:28 +00:00
Chris Zheng
b309599d1b Add catch socket.gaierror for _matches_machine_hostname (#91119)
Summary: Add catch `socket.gaierror` for _matches_machine_hostname

Test Plan: Unit tests again

Reviewed By: kurman

Differential Revision: D42152245

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91119
Approved by: https://github.com/kurman
2022-12-20 00:57:53 +00:00
Chris Zheng
f833880b2e Fix torch.distributed.run init connect timeout by comparing host with the current IP list (#90221)
Summary:
Pull Request: https://github.com/pytorch/pytorch/issues/79388

Fix torch.distributed.run init connect timeout by comparing `host` with the current IP list.

Test Plan: unit tests

Differential Revision: D41373962

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90221
Approved by: https://github.com/d4l3k
2022-12-19 12:58:23 +00:00
PyTorch MergeBot
14a7cf79c1 Add __all__ to torch.distributed and tensorboard submodules (#80444)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/80444
Approved by: https://github.com/rohan-varma
2022-06-28 16:33:22 +00:00
Can Balioglu
1b745efbe8 [14/n] Introduce a name attribute to _PeriodicTimer (#57143)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57143

This PR introduces a `name` attribute in `_PeriodicTimer` for testing and debugging purposes.
ghstack-source-id: 127684751

Test Plan: Run the new and updated unit tests.

Reviewed By: tierex

Differential Revision: D28059045

fbshipit-source-id: 9eb067300aea21a99577e6cd8a354f7eb749f4a6
2021-05-03 11:37:05 -07:00
Can Balioglu
df91eb924c [5/n] [torch/elastic] Introduce the delay utility function (#56533)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56533

This PR introduces a small utility function to delay the execution of the current thread.
ghstack-source-id: 126979035

Test Plan: Run the associated unit tests.

Reviewed By: H-Huang

Differential Revision: D27889671

fbshipit-source-id: aae93b624bd4704da7a48004f50d130cec64969d
2021-04-21 16:01:00 -07:00
Can Balioglu
76ca1eeeb8 [4/n] [torch/elastic] Fix the finalizer of PeriodicTimer (#56532)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56532

This PR fixes a subtle issue with the finalizer implementation of `_PeriodicTimer`.

We avoid using a regular finalizer (a.k.a. `__del__`) for stopping the timer as joining a daemon thread during the interpreter shutdown can cause deadlocks. The `weakref.finalize` is a superior alternative that provides a consistent behavior regardless of the GC implementation.
ghstack-source-id: 126978904

Test Plan: Run the existing unit tests as there is no behavioral change.

Reviewed By: H-Huang

Differential Revision: D27889289

fbshipit-source-id: a248cf6fd1abc4da8bef90e160fa9669a4961fa5
2021-04-21 15:59:19 -07:00
Sam Estep
75024e228c Add lint for unqualified type: ignore (#56290)
Summary:
The other half of https://github.com/pytorch/pytorch/issues/56272.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56290

Test Plan:
CI should pass on the tip of this PR, and we know that the lint works because the following CI runs (before this PR was finished) failed:

- https://github.com/pytorch/pytorch/runs/2384511062
- https://github.com/pytorch/pytorch/actions/runs/765036024

Reviewed By: seemethere

Differential Revision: D27867219

Pulled By: samestep

fbshipit-source-id: e648f07b6822867e70833e23ddafe7fb7eaca235
2021-04-21 08:07:23 -07:00
Can Balioglu
512c744f2e [torch/elastic] Introduce PeriodicTimer (#55919)
Summary:
This PR introduces a basic timer type that periodically calls a specified function. Its main use in the upcoming `DynamicRendezvousHandler` implementation will be to send periodic keep-alive updates in a background thread.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55919

Reviewed By: tierex

Differential Revision: D27740823

Pulled By: cbalioglu

fbshipit-source-id: e46fc848ab033995946a38a29c01d67d387a4cf5
2021-04-15 14:51:14 -07:00
Aliaksandr Ivanou
8f663170bd [17/n][torch/elastic] Make torchelastic launcher compatible with the caffe2.distributed.launch (#55687)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55687

The diff makes sure that users can transfer the following parameters:
* master_addr
* master_port
* node_rank
* use_env

The diff implement StaticTCPRendezvous that creates a store with listener on agent rank #0

The diff modifies caffe2/rendezvous: If the worker process launched with torchelastic agent, the worker processes will create a PrefixStore("worker/") from TCPStore without listener.

The diff adds macros functionality to torch/distributed/ealstic/utils that helps to resolve local_rank parameter.

Test Plan: buck test mode/dev-nosan //pytorch/elastic/torchelastic/distributed/test:launch_test

Reviewed By: cbalioglu, wilson100hong

Differential Revision: D27643206

fbshipit-source-id: 540fb26feac322cc3ec0a989fe53324755ccc4ea
2021-04-14 19:33:26 -07:00
Can Balioglu
339d3bf394 [2/n] [torch/elastic] Introduce C10dRendezvousBackend. (#55636)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55636

This diff introduces:

- The `C10dRendezvousBackend` type to support C10d stores as rendezvous backends.
- A fix to the `TCPStore.compare_set()` function to support non-existent keys.
- A placeholder `c10d-experimental` registry to instantiate C10d-baked rendezvous backends via `get_rendezvous_handler()`.
ghstack-source-id: 126312162

Test Plan: Run the existing and newly-introduced unit/integration tests.

Reviewed By: tierex

Differential Revision: D27654492

fbshipit-source-id: 09f498138b35186de4b0e174adb33fb5b5aa4b52
2021-04-12 22:20:27 -07:00
Can Balioglu
b3dd8cde61 [1/n] [torch/elastic] Introduce DynamicRendezvousHandler and RendezvousBackend. (#55635)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55635

This diff introduces the `DynamicRendezvousHandler` type as a stub implementation and its accompanying `RendezvousBackend` interface.

`DynamicRendezvousHandler` is intended to be a backend-agnostic type that will contain the core (bulk) logic of rendezvous handling. Any backend specific operation will be delegated to a concrete subclass of `RendezvousBackend` (e.g. `C10dRendezvousBackend` - see D27654492) that is passed as a constructor argument to `DynamicRendezvousHandler`.
ghstack-source-id: 126304697

Test Plan: Run the existing and newly-introduced unit/integration tests.

Reviewed By: tierex

Differential Revision: D27654478

fbshipit-source-id: 9fc89a6e4cb308971c65b29a7c5af7ae191f70c5
2021-04-12 22:18:49 -07:00
Can Balioglu
7f06c65a4c [torch/elastic] Improve the implementation of the utility functions and add their unit tests. (#54804)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54804

Improve the implementation of the utility functions to handle more edge cases and also have a new set of unit tests to cover their usage.

Test Plan: Run the existing and newly introduced unit tests.

Reviewed By: kiukchung

Differential Revision: D27327898

fbshipit-source-id: 96b6fe2d910e3de69f44947a0e8a9f687ab50633
2021-04-05 23:38:25 -07:00
Kiuk Chung
ba75cedfc5 [1/n][torch/elastic][upstream] Move torchelastic/rendezvous to torch/distributed/rendezvous (#53172)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53172

Pull Request resolved: https://github.com/pytorch/elastic/pull/141

Upstreams two modules to torch:

1. `torchelastic.rendezvous`
2. `torchelastic.utils`

These modules were chosen as `[1/n]` since they are the leaf modules in torchelastic.

==== NOTES: ====
1. I'm disabling etcd_rendezvous and etcd_server tests in CIRCLECI for the moment since I need to edit the test dockers to contain the etcd server binary (there's 4-5 test dockers - one for each platform so this is going to take some time for me to set up the environments and test) - T85992919.

2. I've fixed all lint errors on python files but there are ones on the cpp files on the ZeusRendezvous. I took a look at them, and I don't want to fix the linter errors right now for 2 major reasons:
     1. Some of them are more than formatting changes (e.g. std::move vs pass by value) and I don't want to introduce bundled changes with the move
     1. The old rendezvous code (the one we forked from in caffe2/fb) has the same problems and I think its better for us to deal with this when we deprecate caffe2/fb/rendezvous in favor of the one in torchelastic -T86012579.

Test Plan:
```
buck test mode/dev-nosan //caffe2/torch/distributed/elastic/utils/test/...
buck test mode/dev-nosan //caffe2/torch/distributed/elastic/utils/data/test/...
buck test mode/dev-nosan //caffe2/torch/distributed/elastic/rendezvous/test/...
buck test mode/dev-nosan //caffe2/torch/distributed/elastic/rendezvous/fb/...
buck test mode/dev-nosan //pytorch/elastic/torchelastic/...
```
\+ Sandcastle

Reviewed By: H-Huang

Differential Revision: D26718746

fbshipit-source-id: 67cc0350c3d847221cb3c3038f98f47915362f51
2021-03-05 11:27:57 -08:00