pytorch/test/distributed/rpc
Pritam Damania 54c05fa34e Add basic GPU support to distributed autograd. (#40312)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40312

As part of https://github.com/pytorch/pytorch/issues/40255, we
realized that GPU support for distributed autograd was broken as part of our
multithreaded autograd change.

To fix this in the short term for 1.6, this PR includes the following changes:

1) Long lived CPU thread in DistEngine to execute GPU->CPU continuations in the
autograd graph.
2) The long lived CPU thread has its own ready_queue and this queue is used for
all GraphTasks created by DistEngine.
3) In thread_main(), the CPU thread cannot exit once the GraphTask is done
processing because of the new CPU thread added in 1).
4) To resolve this, thread_main() now has a parameter `device_thread` instead
of `reentrant_thread`. When device_thread is True, we expect this to be a long
lived device thread that does not exit.
5) When device_thread is False, thread_main is expected to run a GraphTask and
return once done.
ghstack-source-id: 106391329

Test Plan: waitforbuildbot

Differential Revision: D22146183

fbshipit-source-id: dd146b7a95f55db75f6767889b7255e9d62d5825
2020-06-23 07:49:00 -07:00
..
faulty_agent [1.5 Release][RPC Reliability] RRef Idempotency and RPC Retry enablement (#33636) 2020-03-20 20:07:47 -07:00
jit Support rpc_async call with timeout in JIT (#37884) 2020-05-14 12:44:26 -07:00
tensorpipe Add basic GPU support to distributed autograd. (#40312) 2020-06-23 07:49:00 -07:00
test_dist_autograd_spawn.py Add missing test launchers for JitRpcTest and JitDistAutogradTest (#32891) 2020-02-24 21:42:47 -08:00
test_dist_optimizer_spawn.py Move pytorch distributed tests to separate folder for contbuild. (#30445) 2020-01-22 21:16:59 -08:00
test_rpc_spawn.py Add missing test launchers for JitRpcTest and JitDistAutogradTest (#32891) 2020-02-24 21:42:47 -08:00