pytorch/torch/testing
Pritam Damania f71a0daeb7 Use faulthandler to dump traceback of timed out processes in unit tests. (#54818)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54818

Several flaky tests fail due to some sort of timeout and it isn't
clear from the error message in CI where exactly each process is stuck. In this
PR, I've added mechanism to dump the entire python traceback of all python
threads when we encounter a timeout.

Example traceback:

```
Process 3 timed out with traceback:
Current thread 0x00007ff3363ff700 (most recent call first):
  File "torch/testing/_internal/common_distributed.py", line 373 in _event_listener
  File "threading.py", line 870 in run
  File "threading.py", line 932 in _bootstrap_inner
  File "threading.py", line 890 in _bootstrap

Thread 0x00007ff406132180 (most recent call first):
  File "torch/distributed/distributed_c10d.py", line 2477 in barrier
  File "torch/testing/_internal/distributed/rpc/rpc_test.py", line 838 in test_reinit
  File "torch/testing/_internal/dist_utils.py", line 90 in new_test_method
  File "torch/testing/_internal/common_distributed.py", line 292 in wrapper
  File "torch/testing/_internal/common_distributed.py", line 409 in run_test
  File "torch/testing/_internal/common_distributed.py", line 393 in _run
  File "multiprocessing/process.py", line 108 in run
  File "multiprocessing/process.py", line 315 in _bootstrap
  File "multiprocessing/popen_fork.py", line 75 in _launch
  File "multiprocessing/popen_fork.py", line 19 in __init__
  File "multiprocessing/context.py", line 277 in _Popen
  File "multiprocessing/process.py", line 121 in start
```
ghstack-source-id: 125323810

Test Plan: waitforbuildbot

Reviewed By: rohan-varma

Differential Revision: D27378764

fbshipit-source-id: 661c009a5458c724f004aa83de9347a4bc03b63e
2021-03-31 11:38:30 -07:00
..
_internal Use faulthandler to dump traceback of timed out processes in unit tests. (#54818) 2021-03-31 11:38:30 -07:00
__init__.py [testing] OpInfo for sgn and sign (#53885) 2021-03-22 09:39:40 -07:00
asserts.py initial draft for assert_tensors_(equal|allclose) in torch.testing (#53820) 2021-03-18 20:32:03 -07:00
check_kernel_launches.py Fix check_kernel_launches.py for macros and provide extended context (#49365) 2020-12-14 22:09:33 -08:00