pytorch/torch/distributed
Pritam Damania 82dd01150c Fix race during RPC shutdown. (#36113)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36113

As part of debugging https://github.com/pytorch/pytorch/issues/35863,
I discovered that the unit test would timeout during clean shutdown.

Looking into this further, it looks like there is a race in
`_on_leader_follower_report_shutdown_intent` when multiple followers call the
same method on the leader.

To fix this, I've ensured we have an appropriate lock in
`_on_leader_follower_report_shutdown_intent` to guard against this.

I ran the test 500 times to validate that this fix works.

Closes #35863
ghstack-source-id: 101641463

Test Plan:
1) waitforbuildbot
2) Ran the test 500 times.

Differential Revision: D20884373

fbshipit-source-id: 9d580e9892adffc0c9a4c2e832881fb291a1ff16
2020-04-08 14:12:33 -07:00
..
autograd Fix dist autograd context Example block format (#34921) 2020-03-17 17:44:14 -07:00
optim Fix example block format in Distributed Optimizer API doc (#34919) 2020-03-17 17:44:09 -07:00
rpc Fix race during RPC shutdown. (#36113) 2020-04-08 14:12:33 -07:00
__init__.py Scope pybind11 functions to torch.distributed.{autograd,rpc} 2019-11-05 06:25:22 -08:00
constants.py Back out "Revert D19871946: [distributed] pass in timeout to TCP store when initializing" (#33434) 2020-02-19 17:17:17 -08:00
distributed_c10d.py add c10d dynamic loading mechanism and unit test (#28068) 2020-04-02 15:46:51 -07:00
launch.py Fix typos, via a Levenshtein-type corrector (#31523) 2020-01-17 16:03:19 -08:00
rendezvous.py Back out "Revert D19871946: [distributed] pass in timeout to TCP store when initializing" (#33434) 2020-02-19 17:17:17 -08:00