Commit Graph

15 Commits

Author SHA1 Message Date
Sergii Dymchenko
365071c73c Fix non-existing parameters in docstrings in torch/distributed (#91116)
This is a continuation of https://github.com/pytorch/pytorch/pull/90505
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91116
Approved by: https://github.com/huydhn
2022-12-22 02:37:31 +00:00
Chris Zheng
5d37890b8e Update torchrun and TorchElastic to take optional local_addr param to allow skip local IP lookup if specified (#88922)
Summary:
Update dynamic renderzvous nodes to use rendezvous hostname if provided.
For PR: https://github.com/pytorch/pytorch/issues/85300

Before:
For dynamic renderzvous, it always grab the `fqdn` from socket for each node even if user specified the address.
For example,
https://github.com/pytorch/pytorch/blob/master/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py#L248-L256
```
return _NodeDesc(socket.getfqdn(), os.getpid(), local_id)
```

Now:
If user specifies the hostname, each node will respect the given hostname.
For example, `socket.getfqdn(<hostname>) `

Test Plan: Unit tests.

Differential Revision: D41204028

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88922
Approved by: https://github.com/d4l3k
2022-12-21 03:55:01 +00:00
Ram Rachum
351d73b97f Fix exception causes all over the codebase (#90271)
This is the continuation to #90134 and hopefully the final PR in this series.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90271
Approved by: https://github.com/kit1980
2022-12-07 04:29:00 +00:00
Can Balioglu
028f2f62ac [torch/elastic] Update the rendezvous docs (#58160)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58160

This PR updates the Torch Distributed Elastic documentation with references to the new `c10d` backend.
ghstack-source-id: 128783809

Test Plan: Visually verified the correct

Reviewed By: tierex

Differential Revision: D28384996

fbshipit-source-id: a40b0c37989ce67963322565368403e2be5d2592
2021-05-12 16:54:28 -07:00
Kimish Patel
b7d674eb21 Revert D28331386: [pytorch][PR] [torch/elastic] Update the rendezvous docs
Test Plan: revert-hammer

Differential Revision:
D28331386 (e4418b67c7)

Original commit changeset: 95dd32146222

fbshipit-source-id: 5522d4a09bc06ac42943eec9aa8bf5292cc778b2
2021-05-11 18:10:46 -07:00
Can Balioglu
e4418b67c7 [torch/elastic] Update the rendezvous docs (#57973)
Summary:
This PR updates the rendezvous documentation for the Torch Distributed Elastic section of PyTorch docs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57973

Reviewed By: kiukchung

Differential Revision: D28331386

Pulled By: cbalioglu

fbshipit-source-id: 95dd32146222aaeff246bd3c3d2caf0036a9011b
2021-05-11 15:32:50 -07:00
Can Balioglu
731dcd75f5 [torch/elastic] Revise the note section of RendezvousHandler doc (#57723)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57723

Updated the note section of `RendezvousHandler`:

- Removed the experimental API warning.
- Recommended using the C10d Store instead of etcd for most users.

Test Plan: N/A

Reviewed By: kiukchung

Differential Revision: D28253828

fbshipit-source-id: c4f34dffd1a3cc132977029fe449b6d63ddc877b
2021-05-07 13:10:22 -07:00
Can Balioglu
b3dd8cde61 [1/n] [torch/elastic] Introduce DynamicRendezvousHandler and RendezvousBackend. (#55635)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55635

This diff introduces the `DynamicRendezvousHandler` type as a stub implementation and its accompanying `RendezvousBackend` interface.

`DynamicRendezvousHandler` is intended to be a backend-agnostic type that will contain the core (bulk) logic of rendezvous handling. Any backend specific operation will be delegated to a concrete subclass of `RendezvousBackend` (e.g. `C10dRendezvousBackend` - see D27654492) that is passed as a constructor argument to `DynamicRendezvousHandler`.
ghstack-source-id: 126304697

Test Plan: Run the existing and newly-introduced unit/integration tests.

Reviewed By: tierex

Differential Revision: D27654478

fbshipit-source-id: 9fc89a6e4cb308971c65b29a7c5af7ae191f70c5
2021-04-12 22:18:49 -07:00
Can Balioglu
493a233c04 [torch/elastic] Revise the rendezvous handler registry logic. (#55466)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55466

Improve the implementation and the unit test coverage of `RendezvousHandlerRegistry`.

### Note
See the original diff (D27442325 (df299dbd7d)) that had to be reverted due to an unexpected Python version incompatibility between the internal and external PyTorch CI tests.

Test Plan: Run the existing and newly-introduced unit tests.

Reviewed By: tierex

Differential Revision: D27623215

fbshipit-source-id: 51538d0f154f64e04f685a95d40d805b478c93f9
2021-04-07 20:43:20 -07:00
Nikita Shulga
add49e7e4e Enforce PEP263 for PyTorch python codebase (#55346)
Summary:
All python files containing non-ASCII characters should be correctly annotated with `# -*- coding: utf-8 -*-` comment

Delete number of superfluous UTF-8 characters, most commonly UTF-8 opening closing quotation mark U+2019 (’) instead of ascii apostrophe ', for example `Module’s`->`Module's`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55346

Reviewed By: samestep

Differential Revision: D27582044

Pulled By: malfet

fbshipit-source-id: c1cd89655915858ff3a41f675cdfffff795a8e44
2021-04-06 18:31:38 -07:00
Brian Hirsh
bf70fe69ae Revert D27442325: [torch/elastic] Revise the rendezvous handler registry logic.
Test Plan: revert-hammer

Differential Revision:
D27442325 (df299dbd7d)

Original commit changeset: 8519a2caacbe

fbshipit-source-id: f10452567f592c23ae79ca31556a2a77546726b1
2021-04-06 06:17:14 -07:00
Can Balioglu
df299dbd7d [torch/elastic] Revise the rendezvous handler registry logic.
Summary: Improve the implementation and the unit test coverage of `RendezvousHandlerRegistry`.

Test Plan: Run the existing and newly-introduced unit tests.

Reviewed By: tierex

Differential Revision: D27442325

fbshipit-source-id: 8519a2caacbe2e3ce5d9a02e87a910503dea27d7
2021-04-05 23:38:29 -07:00
Can Balioglu
359d0a0205 [torch/elastic] Improve the implementation of RendezvousParameters and add its unit tests. (#146)
Summary:
Pull Request resolved: https://github.com/pytorch/elastic/pull/146

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54807

Improve the implementation and the unit test coverage of `RendezvousParameters`.

Test Plan: Run the existing and newly-introduced unit tests.

Reviewed By: kiukchung

Differential Revision: D27342444

fbshipit-source-id: 88de356c0a799844a739eb9105185bb8c1acf11f
2021-04-05 23:38:27 -07:00
Can Balioglu
bad8d34780 [torch/elastic] Revise the rendezvous exception types. (#54803)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54803

Revise the rendezvous exception types to align their naming convention more closely with the standard Python exception types.

Test Plan: Run the existing test suite.

Reviewed By: H-Huang

Differential Revision: D27327505

fbshipit-source-id: 862c59222f9ca61a0e5afde89ae8f226090b4f92
2021-04-05 23:36:50 -07:00
Kiuk Chung
ba75cedfc5 [1/n][torch/elastic][upstream] Move torchelastic/rendezvous to torch/distributed/rendezvous (#53172)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53172

Pull Request resolved: https://github.com/pytorch/elastic/pull/141

Upstreams two modules to torch:

1. `torchelastic.rendezvous`
2. `torchelastic.utils`

These modules were chosen as `[1/n]` since they are the leaf modules in torchelastic.

==== NOTES: ====
1. I'm disabling etcd_rendezvous and etcd_server tests in CIRCLECI for the moment since I need to edit the test dockers to contain the etcd server binary (there's 4-5 test dockers - one for each platform so this is going to take some time for me to set up the environments and test) - T85992919.

2. I've fixed all lint errors on python files but there are ones on the cpp files on the ZeusRendezvous. I took a look at them, and I don't want to fix the linter errors right now for 2 major reasons:
     1. Some of them are more than formatting changes (e.g. std::move vs pass by value) and I don't want to introduce bundled changes with the move
     1. The old rendezvous code (the one we forked from in caffe2/fb) has the same problems and I think its better for us to deal with this when we deprecate caffe2/fb/rendezvous in favor of the one in torchelastic -T86012579.

Test Plan:
```
buck test mode/dev-nosan //caffe2/torch/distributed/elastic/utils/test/...
buck test mode/dev-nosan //caffe2/torch/distributed/elastic/utils/data/test/...
buck test mode/dev-nosan //caffe2/torch/distributed/elastic/rendezvous/test/...
buck test mode/dev-nosan //caffe2/torch/distributed/elastic/rendezvous/fb/...
buck test mode/dev-nosan //pytorch/elastic/torchelastic/...
```
\+ Sandcastle

Reviewed By: H-Huang

Differential Revision: D26718746

fbshipit-source-id: 67cc0350c3d847221cb3c3038f98f47915362f51
2021-03-05 11:27:57 -08:00