pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

History

Pritam Damania 82dd01150c Fix race during RPC shutdown. (#36113 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36113 As part of debugging https://github.com/pytorch/pytorch/issues/35863, I discovered that the unit test would timeout during clean shutdown. Looking into this further, it looks like there is a race in `_on_leader_follower_report_shutdown_intent` when multiple followers call the same method on the leader. To fix this, I've ensured we have an appropriate lock in `_on_leader_follower_report_shutdown_intent` to guard against this. I ran the test 500 times to validate that this fix works. Closes #35863 ghstack-source-id: 101641463 Test Plan: 1) waitforbuildbot 2) Ran the test 500 times. Differential Revision: D20884373 fbshipit-source-id: 9d580e9892adffc0c9a4c2e832881fb291a1ff16		2020-04-08 14:12:33 -07:00
..
autograd	Fix dist autograd context Example block format (#34921 )	2020-03-17 17:44:14 -07:00
optim	Fix example block format in Distributed Optimizer API doc (#34919 )	2020-03-17 17:44:09 -07:00
rpc	Fix race during RPC shutdown. (#36113 )	2020-04-08 14:12:33 -07:00
__init__.py	Scope pybind11 functions to torch.distributed.{autograd,rpc}	2019-11-05 06:25:22 -08:00
constants.py	Back out "Revert D19871946: [distributed] pass in timeout to TCP store when initializing" (#33434 )	2020-02-19 17:17:17 -08:00
distributed_c10d.py	add c10d dynamic loading mechanism and unit test (#28068 )	2020-04-02 15:46:51 -07:00
launch.py	Fix typos, via a Levenshtein-type corrector (#31523 )	2020-01-17 16:03:19 -08:00
rendezvous.py	Back out "Revert D19871946: [distributed] pass in timeout to TCP store when initializing" (#33434 )	2020-02-19 17:17:17 -08:00