Commit Graph

4 Commits

Author SHA1 Message Date
Aliaksandr Ivanou
0c55f1bdec [torchelastic] Improve process termination logic (#61602)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61602

The diff introduces signal handlers and SignalException that is raised when the agent process receives SIGTERM or SIGINT.

When any of these signals received, the termination handler will raise the `SignalException`. The exception will then be processed by the main agent loop. The `shutdown(signum)` will be invoked, that would propagate the received signal to the child processes. The default 30 seconds timeout introduced: if child processes will not be able gracefully terminate during this timeout, the agent process would kill the processes via SIGKILL.

Test Plan: unittests, sandcastle

Reviewed By: cbalioglu

Differential Revision: D29671783

fbshipit-source-id: 3dbca2125676dc18d417cc3e3bb0301fdd42737a
2021-07-23 11:00:15 -07:00
Aliaksandr Ivanou
8f663170bd [17/n][torch/elastic] Make torchelastic launcher compatible with the caffe2.distributed.launch (#55687)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55687

The diff makes sure that users can transfer the following parameters:
* master_addr
* master_port
* node_rank
* use_env

The diff implement StaticTCPRendezvous that creates a store with listener on agent rank #0

The diff modifies caffe2/rendezvous: If the worker process launched with torchelastic agent, the worker processes will create a PrefixStore("worker/") from TCPStore without listener.

The diff adds macros functionality to torch/distributed/ealstic/utils that helps to resolve local_rank parameter.

Test Plan: buck test mode/dev-nosan //pytorch/elastic/torchelastic/distributed/test:launch_test

Reviewed By: cbalioglu, wilson100hong

Differential Revision: D27643206

fbshipit-source-id: 540fb26feac322cc3ec0a989fe53324755ccc4ea
2021-04-14 19:33:26 -07:00
Ralf Gommers
48ddc9762b Upgrade mypy to version 0.812 (#55712)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/54211

This was a little more annoying than expected, because the `exclude = ` key in `mypy.ini` is weird. I'll file an upstream issue about that.

I ignored one file, `torch/distributed/elastic/agent/server/api.py` that had ~8 errors that were hard to figure out. This can be done in a follow-up.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55712

Reviewed By: walterddr

Differential Revision: D27694976

Pulled By: malfet

fbshipit-source-id: 228d8be6af040343ce46595dabaca212e69ccc68
2021-04-12 18:08:28 -07:00
Aliaksandr Ivanou
77ccd4f9a3 [5/n][torch/elastic][upstream] Move torchelastic/agent to torch/distributed/elastic/agent (#54343)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54343

Move torchelastic/agent to torch/distributed/elastic/agent

Test Plan:
buck test mode/dev-nosan //pytorch/elastic/torchelastic/...
      buck test mode/dev-nosan //caffe2/test/distributed/elastic/agent/server/test/...

Reviewed By: kiukchung, wilson100hong

Differential Revision: D27173271

fbshipit-source-id: 26761acc3f962af2afffcc3c7a237f3b6d65e531
2021-03-22 23:15:37 -07:00