Commit Graph

14 Commits

Author SHA1 Message Date
Edward Z. Yang
b09722f540 Convert logging f-strings to use % format, part two (#98700)
This hits multi-line logging strings

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98700
Approved by: https://github.com/voznesenskym
2023-04-10 12:19:31 +00:00
Edward Z. Yang
9a8f71f23e Convert logging f-strings to use % format (#98697)
Codemod done with
https://gist.github.com/ezyang/2e8b0463cdc6be278478495b23ff0530 with
assistance from ChatGPT.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98697
Approved by: https://github.com/voznesenskym
2023-04-10 12:19:31 +00:00
Kazuaki Ishizaki
6514d71add Fix typos under torch/distributed directory (#98225)
This PR fixes typos in comments and messages of `.py` files under `torch/distributed` directory

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98225
Approved by: https://github.com/soulitzer, https://github.com/kit1980
2023-04-05 00:21:33 +00:00
Jeffrey Dunn
d779dadda1 Remove stack trace captures from import (#97274)
Summary:
Calls to this function without an argument will get a stack trace at
import time. This is expensive, we can just skip it by passing in a value.

Test Plan: Wait for tests

Differential Revision: D44244345

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97274
Approved by: https://github.com/kiukchung
2023-03-22 18:34:13 +00:00
Aaron Gokaslan
0444a6c90a [BE] Remove deprecated logging warn method (#94708)
Swaps all logging.warn calls to logging.warning since the former is deprecated and even raises a deprecation warning now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94708
Approved by: https://github.com/ezyang
2023-02-13 18:24:52 +00:00
Chris Zheng
5d37890b8e Update torchrun and TorchElastic to take optional local_addr param to allow skip local IP lookup if specified (#88922)
Summary:
Update dynamic renderzvous nodes to use rendezvous hostname if provided.
For PR: https://github.com/pytorch/pytorch/issues/85300

Before:
For dynamic renderzvous, it always grab the `fqdn` from socket for each node even if user specified the address.
For example,
https://github.com/pytorch/pytorch/blob/master/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py#L248-L256
```
return _NodeDesc(socket.getfqdn(), os.getpid(), local_id)
```

Now:
If user specifies the hostname, each node will respect the given hostname.
For example, `socket.getfqdn(<hostname>) `

Test Plan: Unit tests.

Differential Revision: D41204028

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88922
Approved by: https://github.com/d4l3k
2022-12-21 03:55:01 +00:00
Kazuaki Ishizaki
2ddefbdc3c Fix typos used in documents under torch directory (#88300)
This PR fixes typos, in comments of Python files, that are found from a search box at https://pytorch.org/docs/master/search.html

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88300
Approved by: https://github.com/lezcano
2022-11-02 09:38:13 +00:00
PyTorch MergeBot
9db3c517de Add __all__ for torch.nn.modules, torch.distributed.elastic, torch.nn.utils submodules (#80240)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/80240
Approved by: https://github.com/rohan-varma
2022-06-27 17:11:12 +00:00
Kiuk Chung
f6402c469e (torch/elastic) fix scale down bug caused by calling rdzv_handler.shutdown() on premature agent failures (#67749)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67749

Fixes: https://github.com/pytorch/pytorch/issues/67742

Test Plan:
Added unittests.

Validated manually:

```
# start agent 0
$ torchrun --rdzv_backend c10d --rdzv_id 123 --rdzv_endpoint localhost:29500 --nnodes 1:2 --nproc_per_node 1 --monitor_interval 1 test.py

# start agent 1
torchrun --rdzv_backend c10d --rdzv_id 123 --rdzv_endpoint localhost:29500 --nnodes 1:2 --nproc_per_node 1 --monitor_interval 1 test.py

# kill agent 0
CTRL+C (SIGINT) or kill -15 (SIGTERM)

# restart it
torchrun --rdzv_backend c10d --rdzv_id 123 --rdzv_endpoint localhost:29500 --nnodes 1:2 --nproc_per_node 1 --monitor_interval 1 test.py
```

Reviewed By: cbalioglu

Differential Revision: D32129005

fbshipit-source-id: db292268250ef6f1e06f5b4c5bd67124d8dfd325
2021-11-05 12:18:46 -07:00
Yanli Zhao
adb85b32d3 minor fix for elastic doc (#64531)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64531

fix #64530

Test Plan: unit test

Reviewed By: mrshenli

Differential Revision: D30760879

fbshipit-source-id: 94ed1476e886513427d928a36f5be6b9bfff0826
2021-09-07 09:31:01 -07:00
Aliaksandr Ivanou
0c55f1bdec [torchelastic] Improve process termination logic (#61602)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61602

The diff introduces signal handlers and SignalException that is raised when the agent process receives SIGTERM or SIGINT.

When any of these signals received, the termination handler will raise the `SignalException`. The exception will then be processed by the main agent loop. The `shutdown(signum)` will be invoked, that would propagate the received signal to the child processes. The default 30 seconds timeout introduced: if child processes will not be able gracefully terminate during this timeout, the agent process would kill the processes via SIGKILL.

Test Plan: unittests, sandcastle

Reviewed By: cbalioglu

Differential Revision: D29671783

fbshipit-source-id: 3dbca2125676dc18d417cc3e3bb0301fdd42737a
2021-07-23 11:00:15 -07:00
Aliaksandr Ivanou
8f663170bd [17/n][torch/elastic] Make torchelastic launcher compatible with the caffe2.distributed.launch (#55687)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55687

The diff makes sure that users can transfer the following parameters:
* master_addr
* master_port
* node_rank
* use_env

The diff implement StaticTCPRendezvous that creates a store with listener on agent rank #0

The diff modifies caffe2/rendezvous: If the worker process launched with torchelastic agent, the worker processes will create a PrefixStore("worker/") from TCPStore without listener.

The diff adds macros functionality to torch/distributed/ealstic/utils that helps to resolve local_rank parameter.

Test Plan: buck test mode/dev-nosan //pytorch/elastic/torchelastic/distributed/test:launch_test

Reviewed By: cbalioglu, wilson100hong

Differential Revision: D27643206

fbshipit-source-id: 540fb26feac322cc3ec0a989fe53324755ccc4ea
2021-04-14 19:33:26 -07:00
Ralf Gommers
48ddc9762b Upgrade mypy to version 0.812 (#55712)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/54211

This was a little more annoying than expected, because the `exclude = ` key in `mypy.ini` is weird. I'll file an upstream issue about that.

I ignored one file, `torch/distributed/elastic/agent/server/api.py` that had ~8 errors that were hard to figure out. This can be done in a follow-up.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55712

Reviewed By: walterddr

Differential Revision: D27694976

Pulled By: malfet

fbshipit-source-id: 228d8be6af040343ce46595dabaca212e69ccc68
2021-04-12 18:08:28 -07:00
Aliaksandr Ivanou
77ccd4f9a3 [5/n][torch/elastic][upstream] Move torchelastic/agent to torch/distributed/elastic/agent (#54343)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54343

Move torchelastic/agent to torch/distributed/elastic/agent

Test Plan:
buck test mode/dev-nosan //pytorch/elastic/torchelastic/...
      buck test mode/dev-nosan //caffe2/test/distributed/elastic/agent/server/test/...

Reviewed By: kiukchung, wilson100hong

Differential Revision: D27173271

fbshipit-source-id: 26761acc3f962af2afffcc3c7a237f3b6d65e531
2021-03-22 23:15:37 -07:00