Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57296
Seems many trainers disable print(), so we cannot see the thread dumps with CompleteInTimeOrDie(). So log.info() also.
Test Plan: sandcastle
Reviewed By: aalmah
Differential Revision: D28098738
fbshipit-source-id: dfdca8801bacf5c7bccecc2387cb7ef41dadfa46
Summary:
There is a module called `2to3` which you can target for future specifically to remove these, the directory of `caffe2` has the most redundant imports:
```2to3 -f future -w caffe2```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45033
Reviewed By: seemethere
Differential Revision: D23808648
Pulled By: bugra
fbshipit-source-id: 38971900f0fe43ab44a9168e57f2307580d36a38
Summary: CompleteInTimeOrDie was added to detect deadlocks and proactively exit. In addition, call os.abort() to generate a core dump so that the error is actionable.
Reviewed By: bmaurer
Differential Revision: D6938343
fbshipit-source-id: 8bd36da4f4bb1195bd3398f25d133a6ebf1c66ad
Summary: vigneshr has been experiencing randomly that the process does not exit in the end. We don't know what causes this, so this will help with two ways: (1) by putting timeout_guard.EuthanizeIfNecessary(600) in the end of the operator, you ensure that the process is killed in 10 minutes, allowing for retry; (2) this killing will cause python stack traces to be dumped, helping debug the real issue.
Differential Revision: D4635781
fbshipit-source-id: b558418c80671c00effdd514e4ddc01e935c95df
Summary: See http://bugs.python.org/issue6721. Since everstore loaders use ProcessPoolExecutor, which is based on forks, and there was perhaps update of the numpy library or some unralted lirbary, we started getting subprocesses stuck at np.random.randint(). Also changed logging to prints, since logging is known to have issues with multiprocessing. See https://www.prod.facebook.com/groups/fbpython/permalink/1438647216176641/
Differential Revision: D4633725
fbshipit-source-id: ae948a1827c71a3a2119d6a3248706728984df31
Summary:
This fixes at partly a recurrent problem when using everstore data input (or any other data input with multiprocessing). If the main process dies violently, the child processes are not killed. One cause for this was when using the TimeoutGuard(), as it called os._exit(1) that prevents any cleanup happening. I changed it to send SIGINT signal to the PID, and if in 10 secs the process is still living, calling os._exit(1). In my tests, this works well.
Did some other cleanup:
- improved logging of inputs/sec in data_workers
- removed redundant atexit() handling as the multiprocessing pool does it itself
Differential Revision: D4602550
fbshipit-source-id: 64d4526a2a3625d163d23f078286e719d56998f4