pytorch/torch/testing
Luca Wehrstedt f58cc4b444 [RPC] Fix flaky test by waiting for async rref calls (#39012)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39012

The `test_rref_context_debug_info` test was flaky with the TensorPipe agent, and I think the issue is the test itself.

What was happening is that on line 1826 the test was clearing a global variable on the remote side which was holding a rref. Even though the RPC call that unset the global variable was synchronous, the messages that the rref context needs to send around to delete that rref are asynchronous. Therefore, sometimes, when we reached line 1845 we saw the following check fail:
```
        self.assertEqual(2, int(info["num_owner_rrefs"]))
```
because `num_owner_rrefs` was still 3, as the deletion hadn't yet been processed.

The only way I found to fix it is to add a synchronization step where we wait for all the futures from the rref context to complete. Since we must wait for this to happen on all workers, we synchronize with a barrier.
ghstack-source-id: 104810738

Test Plan: The test isn't flaky anymore.

Differential Revision: D21716070

fbshipit-source-id: e5a97e520c5b10b67c335abf2dc7187ee6227643
2020-05-28 10:48:34 -07:00
..
_internal [RPC] Fix flaky test by waiting for async rref calls (#39012) 2020-05-28 10:48:34 -07:00
__init__.py Updates assertEqual to use torch.isclose-like logic (#37294) 2020-05-15 16:24:03 -07:00