Summary:
Calls to this function without an argument will get a stack trace at
import time. This is expensive, we can just skip it by passing in a value.
Test Plan: Wait for tests
Differential Revision: D44244345
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97274
Approved by: https://github.com/kiukchung
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61602
The diff introduces signal handlers and SignalException that is raised when the agent process receives SIGTERM or SIGINT.
When any of these signals received, the termination handler will raise the `SignalException`. The exception will then be processed by the main agent loop. The `shutdown(signum)` will be invoked, that would propagate the received signal to the child processes. The default 30 seconds timeout introduced: if child processes will not be able gracefully terminate during this timeout, the agent process would kill the processes via SIGKILL.
Test Plan: unittests, sandcastle
Reviewed By: cbalioglu
Differential Revision: D29671783
fbshipit-source-id: 3dbca2125676dc18d417cc3e3bb0301fdd42737a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55687
The diff makes sure that users can transfer the following parameters:
* master_addr
* master_port
* node_rank
* use_env
The diff implement StaticTCPRendezvous that creates a store with listener on agent rank #0
The diff modifies caffe2/rendezvous: If the worker process launched with torchelastic agent, the worker processes will create a PrefixStore("worker/") from TCPStore without listener.
The diff adds macros functionality to torch/distributed/ealstic/utils that helps to resolve local_rank parameter.
Test Plan: buck test mode/dev-nosan //pytorch/elastic/torchelastic/distributed/test:launch_test
Reviewed By: cbalioglu, wilson100hong
Differential Revision: D27643206
fbshipit-source-id: 540fb26feac322cc3ec0a989fe53324755ccc4ea
Summary:
Fixes https://github.com/pytorch/pytorch/issues/54211
This was a little more annoying than expected, because the `exclude = ` key in `mypy.ini` is weird. I'll file an upstream issue about that.
I ignored one file, `torch/distributed/elastic/agent/server/api.py` that had ~8 errors that were hard to figure out. This can be done in a follow-up.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55712
Reviewed By: walterddr
Differential Revision: D27694976
Pulled By: malfet
fbshipit-source-id: 228d8be6af040343ce46595dabaca212e69ccc68