pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 00:21:07 +01:00

Author	SHA1	Message	Date
Luca Wehrstedt	f083cea227	[RPC tests] Fix file descriptor leak (#40913 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40913 Summary of the entire stack: -- This diff is part of an attempt to refactor the RPC tests. They currently suffer from several problems: - Several ways to specify the agent to use: there exists one "generic" fixture that uses the global variable TEST_CONFIG to look up the agent name, and is used for process group and Thrift, and then there are separate fixtures for the flaky agent and the TensorPipe one. - These two ways lead to having two separate decorators (`requires_process_group_agent` and `@_skip_if_tensorpipe_agent`) which must both be specified, making it unclear what the effect of each of them is and what happens if only one is given. - Thrift must override the TEST_CONFIG global variable before any other import (in order for the `requires_process_group_agent` decorator to work correctly) and for that it must use a "trap" file, which makes it even harder to track which agent is being used, and which is specific to Buck, and thus cannot be used in OSS by other agents. - Even if the TensorPipe fixture doesn't use TEST_CONFIG, it still needs to set it to the right value for other parts of the code to work. (This is done in `dist_init`). - There are a few functions in dist_utils.py that return some properties of the agent (e.g., a regexp to match against the error it returns in case of shutdown). These functions are effectively chained if/elses on the various agents, which has the effect of "leaking" some part of the Thrift agent into OSS. - Each test suite (RPC, dist autograd/dist optimizer, their JIT versions, remote module, ...) must be run on each agent (or almost; the faulty one is an exception) in both fork and spawn mode. Each of these combinations is a separate file, which leads to a proliferation of scripts. - There is no "master list" of what combinations make sense and should be run. Therefore it has happened that when adding new tests or new agents we forgot to enroll them into the right tests. (TensorPipe is still missing a few tests, it turns out). - All of these tiny "entry point" files contain almost the same duplicated boilerplate. This makes it very easy to get the wrong content into one of them due to a bad copy-paste. This refactoring aims to address these problems by: - Avoiding global state, defaults/override, traps, if/elses, ... and have a single way to specify the agent, based on an abstract base class and several concrete subclasses which can be "mixed in" to any test suite. - Instead of enabling/disabling tests using decorators, the tests that are specific to a certain agent are now in a separate class (which is a subclass of the "generic" test suite) so that they are only picked up by the agent they apply to. - Instead of having one separate entry point script for each combination, it uses one entry point for each agent, and in that script it provides a list of all the test suites it wants to run on that agent. And it does that by trying to deduplicate the boilerplate as much as possible. (In fact, the various agent-suite combinations could be grouped in any way, not necessarily by agent as I did here). It provides further advantages: - It puts all the agents on equal standing, by not having any of them be the default, making it thus easier to migrate from process group to TensorPipe. - It will make it easier to add more versions of the TensorPipe tests (e.g., one that disables the same-machine backends in order to test the TCP-based ones) without a further duplication of entry points, of boilerplate, ... Summary of this commit -- Once we start merging multiple test suites in a single file (which we'll happen in the next diffs in the stack) the OSX tests on CircleCI start failing due to "too many open files". This indicates a file descriptor leak. I then managed to repro it on Linux too by lowering the limit on open file descriptors (`ulimit -n 500`). Each test method that unittest runs is run on a new instance of the Testcase class. With our multiprocessing wrappers, this instance contains a list of child processes. Even after these processes are terminated, it appears they still hold some open file descriptor (for example a pipe to communicate with the subprocess). It also appears unittest is keeping these Testcase instances alive until the entire suite completes, which I suspect is what leads to this "leak" of file descriptors. Based on that guess, in this diff I am resetting the list of subprocesses during shutdown, and this seems to fix the problem. ghstack-source-id: 107045908 Test Plan: Sandcastle and CircleCI Differential Revision: D22356784 fbshipit-source-id: c93bb9db60fde72cae0b0c735a50c17e427580a6	2020-07-03 06:22:40 -07:00
Mike Ruberry	13120bf677	Updates assertEqual to require atol and rtol, removes positional atol (#38872 ) Summary: This updates assertEqual and assertEqual-like functions to either require both or neither of atol and rtol be specified. This should improve clarity around handling precision in the test suite, and it allows us to remove the legacy positional atol argument from assertEqual. In addition, the "message" kwarg is replace with a kwarg-only "msg" argument whose name is consistent with unittest's assertEqual argument. In the future we could make "msg" an optional third positional argument to be more consistent with unittest's assertEqual, but requiring it be specified should be clear, and we can easily update the signature to make "msg" an optional positional argument in the future, too. Pull Request resolved: https://github.com/pytorch/pytorch/pull/38872 Differential Revision: D21740237 Pulled By: mruberry fbshipit-source-id: acbc027aa1d7877a49664d94db9a5fff91a07042	2020-05-27 06:31:07 -07:00
Rohan Varma	63e545e0fe	Revert D21717199: [pytorch][PR] Updates assertEqual to require atol and rtol, removes positional atol Test Plan: revert-hammer Differential Revision: D21717199 Original commit changeset: 9feb856f94ee fbshipit-source-id: bfde9c39a5ce99f0ca6183a7dde703c65b7c8259	2020-05-26 18:23:59 -07:00
Mike Ruberry	6ddca30b2d	Updates assertEqual to require atol and rtol, removes positional atol (#38872 ) Summary: This updates assertEqual and assertEqual-like functions to either require both or neither of atol and rtol be specified. This should improve clarity around handling precision in the test suite, and it allows us to remove the legacy positional atol argument from assertEqual. In addition, the "message" kwarg is replace with a kwarg-only "msg" argument whose name is consistent with unittest's assertEqual argument. In the future we could make "msg" an optional third positional argument to be more consistent with unittest's assertEqual, but requiring it be specified should be clear, and we can easily update the signature to make "msg" an optional positional argument in the future, too. Pull Request resolved: https://github.com/pytorch/pytorch/pull/38872 Differential Revision: D21717199 Pulled By: mruberry fbshipit-source-id: 9feb856f94eee911b44f6c7140a1d07c1b026d3a	2020-05-26 08:30:23 -07:00
Rohan Varma	906c50eb69	Remove dead code in ddp.{h, cpp} (#37990 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37990 The code in `ddp.{h, cpp}` and the corresponding pybind implementations are no longer used. The pybinded calls were all private APIs and only ran in unittests, so we should remove these unused APIs. https://github.com/pytorch/pytorch/pull/20234 from a year ago also mentioned that we should delete `_dist_broadcast_coalesced` Verified that all tests pass with cuda by running `test_c10d` on a gpu-enabled machine. ghstack-source-id: 103885383 Test Plan: CI Differential Revision: D21443879 fbshipit-source-id: 764d8681ca629056bfe2c260ffab47fa5bdf07ff	2020-05-12 12:41:09 -07:00
Rohan Varma	4501083306	dedupe test skipping in common_distributed and test_distributed (#38078 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38078 `common_distributed` and `test_distributed` have some error codes that overlap but are for different reasons, for example, code 75 in `test_distributed` is "no cuda available" but in common_distributed it is "need at least 2 CUDA devices". This is an issue because the tests in `test_distributed` now use the utils in `common_distributed`, so we could get the wrong reason for skipping tests. It is also the source of test failures in https://github.com/pytorch/pytorch/pull/37990. This diff makes it so that the test skipping logic is deduped and put into `common_distributed.py`, where it can be reused and then imported into `test_distributed` ghstack-source-id: 103782583 Test Plan: CI Differential Revision: D21466768 fbshipit-source-id: 53b5af36672ebd8b51ba8b42709d87e96cadef20	2020-05-08 23:19:26 -07:00
Rohan Varma	57dc4cd0f8	[MultiProcessTestCase] Improve the error message when a process terminates (#37627 ) Summary: When a subprocess terminates with an exception in a distributed test, log the process number as well Pull Request resolved: https://github.com/pytorch/pytorch/pull/37627 Differential Revision: D21366149 Pulled By: rohan-varma fbshipit-source-id: 132c4b4c1eb336761c2be26d034d8b739ae19691	2020-05-04 17:46:36 -07:00
Shen Li	989341c0c6	Add comments to explain how MultiProcessTestCase works (#37179 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37179 Test Plan: Imported from OSS Differential Revision: D21211196 Pulled By: mrshenli fbshipit-source-id: 86dc8183b4def7e95a236f6c5c73ef67466f4ddb	2020-04-23 14:17:12 -07:00
Nikita Shulga	9763db3031	`MultiProcessTestCase` to use instance rather than class method wrappers (#36826 ) Summary: This makes its wrappers stackable with `common_utils.TestCase` ones Pull Request resolved: https://github.com/pytorch/pytorch/pull/36826 Test Plan: CI Differential Revision: D21178217 Pulled By: mrshenli fbshipit-source-id: f80dd4aa175e20bd338b38b2c42c3118258f45dc	2020-04-23 08:40:02 -07:00
David Reiss	e75fb4356b	Remove (most) Python 2 support from Python code (#35615 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35615 Python 2 has reached end-of-life and is no longer supported by PyTorch. Now we can clean up a lot of cruft that we put in place to support it. These changes were all done manually, and I skipped anything that seemed like it would take more than a few seconds, so I think it makes sense to review it manually as well (though using side-by-side view and ignoring whitespace change might be helpful). Test Plan: CI Differential Revision: D20842886 Pulled By: dreiss fbshipit-source-id: 8cad4e87c45895e7ce3938a88e61157a79504aed	2020-04-22 09:23:14 -07:00
Nikita Shulga	3b832ee2bf	Use Python3 `super()` throughout `torch.testing.` (#37024 ) Summary: Hattip to ezyang Pull Request resolved: https://github.com/pytorch/pytorch/pull/37024 Differential Revision: D21173244 Pulled By: malfet fbshipit-source-id: 7079703e28777d873f69bf9fd4dcbad8d53a2682	2020-04-22 09:00:28 -07:00
Rohan Varma	4efef475d7	[WIP] make test_distributed gloo test use MultiProcessTestCase (#36970 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36970 We would like to move all distributed testing to use the existing multiprocessing tooling defined in common_distributed.py. With this change, we make `TestDistBackend` inherit from `MultiProcessTestCase` and enable fork mode multiprocessing. In the next step, we can enable spawn mode for these tests which will give us TSAN coverage. ghstack-source-id: 102553801 Test Plan: Unittests Differential Revision: D21146947 fbshipit-source-id: 608fa2cb93e88f8de6a5ac87c523e2c4e4fede1b	2020-04-21 13:13:48 -07:00
Rohan Varma	7390c333d6	[CI] fix test_distributed for python 3.8+ (#36542 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36542 Python 3.8 set the default multiprocessing start mode to spawn, but we need fork in these tests, otherwise there are some pickling issues. Test: Ensure that these tests succeed when run with python 3.8 ghstack-source-id: 102093824 Test Plan: Ensure success with python 3.8 Differential Revision: D21007753 fbshipit-source-id: 4b39844c6ba76a53293c0dfde7c98ec5a78fe113	2020-04-14 11:38:33 -07:00
Rohan Varma	b553e6911a	[distributed] quicker exit in the case of failed tests in distributed (#34150 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34150 In the distributed setting we commonly have tests in which there are errors where one process exits but the other do not (since they are for example waiting for work from the process that exited). Currently, when this situation happens we do not handle this well, and wait for process 0 to timeout. This results in wasted time waiting for test errors and a less helpful "Process 0 timed out..." error message when the error was actually something else. This diff fixes the issue by checking for exited subprocesses and terminating the test when we see a subprocess that has exited uncleanly. We still enforce timeouts and return when all processes have exited cleantly in the happy path. ghstack-source-id: 99921462 Test Plan: All distributed tests + tested by writing tests that should trigger the unclean subprocess detection, and verified that we exit quickly instead of waiting for the entire timeout. Differential Revision: D20231032 fbshipit-source-id: 3e0d4a20925b7d1098ec4c40ffcc66845425dd62	2020-03-11 11:27:17 -07:00
Jithun Nair	3c4cec56aa	Enable test_distributed for ROCm but only with nccl backend [REDUX] (#32551 ) Summary: This is a redux of the original PR https://github.com/pytorch/pytorch/issues/28814 which was reverted in PR https://github.com/pytorch/pytorch/issues/29736 due to test_DistributedDataParallel being suspected as being flaky. Further investigation revealed it wasn't flakiness, but a bug in the PyTorch source code which has been now fixed in PR https://github.com/pytorch/pytorch/issues/32356. This PR is another attempt at enabling the test_distributed unit test suite only for the nccl backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/32551 Differential Revision: D19729966 Pulled By: bddppq fbshipit-source-id: 12a0d850991a903cc7723d63693b6157071d7115	2020-02-10 12:42:36 -08:00
Pritam Damania	f050b16dd9	Move pytorch distributed tests to separate folder for contbuild. (#30445 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30445 Create distributed and rpc directories under caffe/test for better management of unit tests. Differential Revision: D18702786 fbshipit-source-id: e9daeed0cfb846ef68806f6decfcb57c0e0e3606	2020-01-22 21:16:59 -08:00

16 Commits