pytorch/torch/distributed/elastic/multiprocessing
Yiwen Shi 3a9e33dca8 [torchelastic] Don't do signal handling when off the main thread (#135088)
Summary:
In multiprocessing, signal handling is not possible if the thread is not the main thread. This resulted in the following error:
> "ValueError('signal only works in main thread of the main interpreter')"

To address this issue, the diff checks whether the thread is the main thread and, if not, skips signal handling.

Test Plan:
Before this change, MAST job failed:
https://fburl.com/mlhub/iq2m10v8

With this change, MAST job succeeded:
https://fburl.com/mlhub/q6kb8343

Differential Revision: D62166943

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135088
Approved by: https://github.com/d4l3k
2024-09-06 14:47:03 +00:00
..
errors [BE][Easy] enable UFMT for torch/distributed/{algorithms,autograd,benchmarks,checkpoint,elastic}/ (#128866) 2024-06-18 13:51:53 +00:00
subprocess_handler Fix stdout / stderr typing in SubprocessHandler (#132071) 2024-07-31 02:51:11 +00:00
__init__.py [BE][Easy] enable UFMT for torch/distributed/{algorithms,autograd,benchmarks,checkpoint,elastic}/ (#128866) 2024-06-18 13:51:53 +00:00
api.py [torchelastic] Don't do signal handling when off the main thread (#135088) 2024-09-06 14:47:03 +00:00
redirects.py [BE][Easy] enable UFMT for torch/distributed/{algorithms,autograd,benchmarks,checkpoint,elastic}/ (#128866) 2024-06-18 13:51:53 +00:00
tail_log.py [BE][Easy] enable UFMT for torch/distributed/{algorithms,autograd,benchmarks,checkpoint,elastic}/ (#128866) 2024-06-18 13:51:53 +00:00