pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Rohan Varma	7390c333d6	[CI] fix test_distributed for python 3.8+ (#36542 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36542 Python 3.8 set the default multiprocessing start mode to spawn, but we need fork in these tests, otherwise there are some pickling issues. Test: Ensure that these tests succeed when run with python 3.8 ghstack-source-id: 102093824 Test Plan: Ensure success with python 3.8 Differential Revision: D21007753 fbshipit-source-id: 4b39844c6ba76a53293c0dfde7c98ec5a78fe113	2020-04-14 11:38:33 -07:00
Rohan Varma	b553e6911a	[distributed] quicker exit in the case of failed tests in distributed (#34150 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34150 In the distributed setting we commonly have tests in which there are errors where one process exits but the other do not (since they are for example waiting for work from the process that exited). Currently, when this situation happens we do not handle this well, and wait for process 0 to timeout. This results in wasted time waiting for test errors and a less helpful "Process 0 timed out..." error message when the error was actually something else. This diff fixes the issue by checking for exited subprocesses and terminating the test when we see a subprocess that has exited uncleanly. We still enforce timeouts and return when all processes have exited cleantly in the happy path. ghstack-source-id: 99921462 Test Plan: All distributed tests + tested by writing tests that should trigger the unclean subprocess detection, and verified that we exit quickly instead of waiting for the entire timeout. Differential Revision: D20231032 fbshipit-source-id: 3e0d4a20925b7d1098ec4c40ffcc66845425dd62	2020-03-11 11:27:17 -07:00
Jithun Nair	3c4cec56aa	Enable test_distributed for ROCm but only with nccl backend [REDUX] (#32551 ) Summary: This is a redux of the original PR https://github.com/pytorch/pytorch/issues/28814 which was reverted in PR https://github.com/pytorch/pytorch/issues/29736 due to test_DistributedDataParallel being suspected as being flaky. Further investigation revealed it wasn't flakiness, but a bug in the PyTorch source code which has been now fixed in PR https://github.com/pytorch/pytorch/issues/32356. This PR is another attempt at enabling the test_distributed unit test suite only for the nccl backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/32551 Differential Revision: D19729966 Pulled By: bddppq fbshipit-source-id: 12a0d850991a903cc7723d63693b6157071d7115	2020-02-10 12:42:36 -08:00
Pritam Damania	f050b16dd9	Move pytorch distributed tests to separate folder for contbuild. (#30445 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30445 Create distributed and rpc directories under caffe/test for better management of unit tests. Differential Revision: D18702786 fbshipit-source-id: e9daeed0cfb846ef68806f6decfcb57c0e0e3606	2020-01-22 21:16:59 -08:00

4 Commits