pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-06 12:20:52 +01:00

Author	SHA1	Message	Date
Wanchao Liang	6feba4bc7e	Implement scatter primitive for ProcessGroupNCCL (#70029 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/70029 This PR implements NCCL scatter and add scatter to ProcessGroupNCCL. NCCL doesn’t directly provide primitives for scatter, so we need to be implemented on top of NCCL’s send/recv API. 1. In ProcessGroupNCCL.cpp, the inputTensors are first flattened, then outputTensors and inputFlattened are passed by the collective class to scatter() function in nccl.cpp. 2. In nccl.cpp, scatter is implemented using ncclSend/ncclRecv: the root rank uses a for loop to send(distribute) the inputTensors to each rank, then all the ranks receive the inputTensor from the root rank. ghstack-source-id: 147754837 Test Plan: test_scatter_ops test_scatter_stress test_scatter_checks Reviewed By: pritamdamania87 Differential Revision: D33154823 fbshipit-source-id: 4513e7eaf7d47a60eb67da99dc6c2e9a2882f3fd (cherry picked from commit `93201f9d4a`)	2022-01-27 19:37:55 +00:00
Wanchao Liang	9b53d3194c	Implement gather primitive for ProcessGroupNCCL (#66745 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66745 This PR implement NCCL gather and add gather to ProcessGroupNCCL using nccl send/recv api. NCCL doesn’t directly provide primitives for gather, so we need to be implemented on top of NCCL’s send/recv API. 1. In ProcessGroupNCCL.cpp, the outputTensors are first flattened, then inputTensors and outputFlattened are passed by the collective class to gather() function in nccl.cpp. 1. In nccl.cpp, gather is implemented using ncclSend/ncclRecv: all the ranks send inputTensor to the root rank, and the root rank uses a for loop to receive these inputTensors. ghstack-source-id: 147754838 Test Plan: test_gather_ops test_gather_checks test_gather_stress Reviewed By: pritamdamania87 Differential Revision: D29616361 fbshipit-source-id: b500d9b8e67113194c5cc6575fb0e5d806dc7782 (cherry picked from commit `d560ee732e`)	2022-01-27 19:37:55 +00:00
Rohan Varma	2bed616e0f	[Dist tests] Make event_listener work for all dist tests (#70628 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/70628 event_listener thread is used to log process tracebacks when a timed out process sends it a request to get its traceback. Although, this thread is created in `_run` function which is overridden by some classes such as `TestDistBackend` so those tests did not have this feature. Move the event_listener setup logic to `run_test` which is called by all distributed test classes, which enables it for all distributed tests. Also modify logger setup to ensure that logging.info calls are printed in the subprocess. ghstack-source-id: 146714642 Test Plan: CI Reviewed By: jaceyca, fduwjj Differential Revision: D33410613 fbshipit-source-id: aa616d69d251bc9d04e45781c501d2244f011843	2022-01-09 14:54:09 -08:00
Bryan Reese	51b6981c36	[PyTorch Tests] Split out skip logic, make changes for plugins (#67256 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67256 To change what tests can be run in various cases, the check logic should be moved to functions and variables that can be changed. One challenge here is that decorators don't have dynamic functionality. If something is read in when imported and then changed afterwards, it will not actually change. This means we need to separate out the variables that need to be changed for our use case. Those are put into common_distributed.py and can be changed before importing the distributed_test.py code. The use case is to add new backends to the tests and split it into tests that can be ran on demand as a separate instance. To do so, you would change DistTestSkipCases after importing it into a launcher or a setup script and then load distributed_test. Test Plan: Check the signals Reviewed By: mrshenli Differential Revision: D31906947 fbshipit-source-id: 45e3258c55f4dc34e12a468bed65280f4c25748f	2021-12-08 12:23:15 -08:00
Rohan Varma	d44d59aa70	[BE] Enable C++ stacktraces for MultiProcessTestCase (#69175 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69175 Shows C++ stacktraces for python distributed tests that inherit from MultiProcessTestCase. Closes https://github.com/pytorch/pytorch/issues/69168 ghstack-source-id: 145085858 Test Plan: CI Reviewed By: H-Huang Differential Revision: D32736872 fbshipit-source-id: 743e870eefa7a9e77c5791d0936e2ebd5c9b1016	2021-12-08 11:57:51 -08:00
Rohan Varma	cb14a258a2	[c10d] Fix object-based collectives for debug mode (#68223 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68223 DETAIL debug mode didn't work with object-based collectives for NCCL backend, because we'd only check if backend is NCCL and then move tensors to CUDA. Instead, check if it is a wrapped PG, and then check the pg that is wrapped to see if its nccl. ghstack-source-id: 143242023 Test Plan: CI Reviewed By: zhaojuanmao Differential Revision: D32366840 fbshipit-source-id: be0a2af6849f8f24446593f4a4fbea4a67586ee5	2021-11-13 04:18:31 -08:00
Wanchao Liang	cf3a5160f8	[BE] move init_multigpu_helper to common_distributed (#67050 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67050 This PR moves init_multi_gpu_helper to common_distributed so that it could be shared by different distributed tests. ghstack-source-id: 141370119 Test Plan: wait for ci. Reviewed By: mrshenli Differential Revision: D31842644 fbshipit-source-id: c7bad25d6cef9bdce7ad1fb6c60c1cad4b765702	2021-10-22 17:16:11 -07:00
Rohan Varma	1e47181c47	[DDP Logging] Add iteration in error reporting (#65772 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65772 Looking at some workloads and it would be useful to have this info. ghstack-source-id: 140555200 Test Plan: CI Reviewed By: zhaojuanmao, wayi1 Differential Revision: D31224417 fbshipit-source-id: 14eeb053aced87c7ca43b6879f81f54bd0a42b76	2021-10-14 22:29:36 -07:00
Pritam Damania	c245632e2e	Use higher timeout for TSAN tests. (#65391 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65391 TSAN tests are much slower than the usual dev/opt mode, about 5-10x slower. As a result, for TSAN build mode we use a much higher timeout for distributed tests. ghstack-source-id: 138584613 Test Plan: waitforbuildbot Reviewed By: cbalioglu Differential Revision: D31076575 fbshipit-source-id: 44a485f07101deac536470ceeff2a52cac4f9e0b	2021-09-21 12:08:27 -07:00
Pritam Damania	d6133b2fe6	Remove `_fork_processes` from common_distributed.py (#63711 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63711 This removes `_fork_process` from common_distributed.py and fixes all other callpoints to use `spawn_process` instead. ghstack-source-id: 136395719 Test Plan: waitforbuildbot Reviewed By: xush6528 Differential Revision: D30463834 fbshipit-source-id: 0c09e8a996d0e5b912c8cdd45488a39951bac4db	2021-08-22 18:57:12 -07:00
Pritam Damania	2d671ca41b	[8/N] Remove c10d/ddp fork tests. (#63454 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63454 Continuation of https://github.com/pytorch/pytorch/pull/63443, this PR removes all fork tests from torch.distributed. ghstack-source-id: 136285511 Test Plan: waitforbuildbot Reviewed By: SciPioneer Differential Revision: D30387872 fbshipit-source-id: f6d6313db126ae7b95b86f78a1e0726887c5c513	2021-08-20 12:23:18 -07:00
Pritam Damania	f8a84a80cd	[5/N] Run opt-asan with detect_leaks=0 (#63361 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63361 Python multiprocessing doesn't support LSAN and causes false positives instead. As a result, disabling LSAN for these tests so that we can still run with opt-asan ghstack-source-id: 135962489 Test Plan: waitforbuildbot Reviewed By: rohan-varma Differential Revision: D30352269 fbshipit-source-id: f6ab5abce7bdef00cd5e1f5977424d2b151174af	2021-08-18 01:59:56 -07:00
Pritam Damania	f7611b31aa	[4/N] Enable opt-asan for distributed unit tests. (#62051 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62051 The goal here is to enable opt-asan for "spawn" based unit tests since this works for "spawn" unlike "dev-asan". As a result, we can run ASAN for "spawn" unit tests as well. This means we can completely remove fork unit tests from the code base since the only purpose for these tests was to run ASAN. ghstack-source-id: 135523770 Test Plan: waitforbuildbot Reviewed By: SciPioneer Differential Revision: D29854514 fbshipit-source-id: 02a5bfcfae2afc21badecff77082c7a6ad83636b	2021-08-10 22:38:31 -07:00
Yi Wang	72295da6c3	Reformat (#62456 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62456 as title ghstack-source-id: 134771417 Test Plan: N/A Reviewed By: rohan-varma Differential Revision: D30006493 fbshipit-source-id: 1d1dc9cfff69a9b4fa31470177c1f4fa206a94ef	2021-07-30 20:50:19 -07:00
Pritam Damania	2006dc6316	[3/N] Remove unittest.skip from torch/testing/_internal distributed files. (#61991 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61991 Continuation of https://github.com/pytorch/pytorch/pull/61887 and removing unittest.skip as much as possible. ghstack-source-id: 134759368 Test Plan: waitforbuildbot Reviewed By: SciPioneer Differential Revision: D29831860 fbshipit-source-id: fe57a7d56d4423924a2dec10bb670137ace0c9a4	2021-07-30 16:40:43 -07:00
Pritam Damania	82d81455ae	[2/N] Remove unittest.skip across all of torch.distributed. (#61887 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61887 1) Introduced a `sandcastle_skip_if` decorator that ensures these tests just get passed on sandcastle. 2) Fixed all test files under `test/distributed` to not use `unittest.skip` Overall goal is to avoid using skips since sandcastle tags these tests as continuously skipping. ghstack-source-id: 134382237 Test Plan: waitforbuildbot Reviewed By: SciPioneer Differential Revision: D29784152 fbshipit-source-id: 17b4df6c5a55ff1d1e8e1de128fa679c3dfbcb7d	2021-07-27 10:53:23 -07:00
Pritam Damania	a8f6b5a80a	[1/N] Avoid skipping tests in sandcastle. (#61876 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61876 In the sandcastle environment, avoid skipping tests and instead just "pass" these tests to avoid a large number of tasks being created which are not actionable. ghstack-source-id: 133846232 Test Plan: Test with `SANDCASTLE=1 TW_JOB_USER=sandcastle` Reviewed By: rohan-varma Differential Revision: D29779699 fbshipit-source-id: add71008830dfa6f456ce2365a2d70436b7b7a31	2021-07-21 14:31:17 -07:00
Luca Wehrstedt	14f63763c1	Avoid using mp.Manager to report #GPUs needed in dist tests (#61409 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61409 We used a multiprocessing.Manager in order to share TEST_SKIPS between the parent and the child processes. TEST_SKIPS is a global variable that defines a unique error code for each "error type", so that the parent can figure out the reason a child exited. While originally this mapping was immutable, at some point we allowed children to modify the parent's value of that mapping so they could update the message for the `multi-gpu` error to make it reflect how many GPUs were really needed. This occurred in D23285790 (`2a4d312027`). Since then this Manager proved to be quite problematic, especially around thread safety, races, TSAN, ... (see D22753459 (`f0c46878c6`), D23641618 (`567c51cce9`), D28490129, D28794321 (`0128eb9a85`) and D29585862). This seems like an awful lot of trouble for such a small functionality. Here I propose we drop Manager and instead get the same result by using separate error codes for each number of GPUs. It should be much simpler and thus more robust. ghstack-source-id: 133236447 Test Plan: CI Reviewed By: pritamdamania87 Differential Revision: D29612614 fbshipit-source-id: 8ad0fedcb7796e5832a0eb196f8fdc147e02b3df	2021-07-09 01:29:35 -07:00
Luca Wehrstedt	0128eb9a85	Fix TSAN issue in distributed tests (#59238 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59238 Creating a `mutliprocessing.Manager()` launches a new process using the `fork` method (because it's the default one), and then in that subprocess it launches a new thread. TSAN really doesn't like this (and rightly so!) because we already had threads in the superprocess, and intermixing threads and forks is dangerous. The proper way to deal with this is to `exec` inside the child process or, in other words, use the `spawn` method. Note that the method used to launch the Manager is entirely unrelated from the method used to launch our "own" subprocesses, hence we were using `fork` for the Manager even though we were using `spawn` for our own subprocesses. ghstack-source-id: 130240724 Test Plan: Reverted the silencing introduced in D28490129, ran the `test_init_rpc_then_pg` test from the TensorPipe suite and saw the original TSAN failure. Then applied my fix, re-ran the test, and the failure was gone. Reviewed By: zhaojuanmao Differential Revision: D28794321 fbshipit-source-id: 12242e69be399a7f02a40a0ebb3d92f92e00ce73	2021-07-01 11:53:01 -07:00
Pritam Damania	fbd4cb1cd7	Fix error logging in common_distributed. (#60917 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60917 The second line of error log didn't handle f-string properly. Before fix: ``` exiting process with exit code: {MultiProcessTestCase.TEST_ERROR_EXIT_CODE} ``` After fix: ``` exiting process 3 with exit code: 10 ``` ghstack-source-id: 132618199 Test Plan: waitforbuildbot Reviewed By: rohan-varma Differential Revision: D29446574 fbshipit-source-id: f806ef0470cb6aa86fe3c404e1c895514abb6488	2021-06-28 19:32:17 -07:00
Luca Wehrstedt	4aff267072	Fix Windows error in distributed (#60167 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60167 We were getting errors such as this on Windows in our c10d ProcessGroup test suite: ``` test_send_recv_all_to_all (__main__.ProcessGroupGlooTest) ... Exception in thread Thread-1: Traceback (most recent call last): File "C:\Jenkins\Miniconda3\lib\threading.py", line 932, in _bootstrap_inner self.run() File "C:\Jenkins\Miniconda3\lib\threading.py", line 870, in run self._target(self._args, *self._kwargs) File "C:\Users\circleci\project\build\win_tmp\build\torch\testing\_internal\common_distributed.py", line 471, in _event_listener if pipe.poll(None): File "C:\Jenkins\Miniconda3\lib\multiprocessing\connection.py", line 257, in poll return self._poll(timeout) File "C:\Jenkins\Miniconda3\lib\multiprocessing\connection.py", line 330, in _poll return bool(wait([self], timeout)) File "C:\Jenkins\Miniconda3\lib\multiprocessing\connection.py", line 883, in wait ov.cancel() OSError: [WinError 6] The handle is invalid Fatal Python error: could not acquire lock for <_io.BufferedWriter name='<stderr>'> at interpreter shutdown, possibly due to daemon threads Python runtime state: finalizing (tstate=000001EFDF228CE0) Thread 0x00001f68 (most recent call first): File "C:\Jenkins\Miniconda3\lib\threading.py", line 1202 in invoke_excepthook File "C:\Jenkins\Miniconda3\lib\threading.py", line 934 in _bootstrap_inner File "C:\Jenkins\Miniconda3\lib\threading.py", line 890 in _bootstrap Current thread 0x00000f94 (most recent call first): <no Python frame> FAIL (5.009s) ``` And the process would then exit with error code 3221226505. See: https://app.circleci.com/pipelines/github/pytorch/pytorch/337351/workflows/ad919a3e-fe9a-4566-8ad6-8b0a252f730c/jobs/14170191/steps By looking at [the code of `_event_listener` in `common_distributed.py`](`eb36f67dcc/torch/testing/_internal/common_distributed.py (L467-L489)`) I think that the first exception (the one about the handle being invalid) is "expected" as it results from another thread purposely closing the pipe while that thread is polling it. The relevant part of the problem seems to be the "could not acquire lock" one. I think this stems from the event listener thread being launched as a daemon thread, which means the interpreter will not wait for that thread to complete before shutting down. When the interpreter shuts down it instantly aborts all other threads. If the event listener thread was aborter _while_ it was logging to stdout then that thread was holding the lock but never got to release it. This is probably what the error is complaining about. This seems to be intended/expected behavior for CPython: https://bugs.python.org/issue42717. The solution thus is simple: don't make that thread a daemon thread and explicitly wait for it to terminate before shutting down. ghstack-source-id: 132293710 Test Plan: Will see... Reviewed By: pritamdamania87 Differential Revision: D29193014 fbshipit-source-id: 4aabe1fc74bf9c54ca605e7a702ac99655489780	2021-06-24 10:35:38 -07:00
Philip Meier	d5988c5eca	remove unused `type: ignore` directives (#60006 ) Summary: During development it is common practice to put `type: ignore` comments on lines that are correct, but `mypy` doesn't recognize this. This often stems from the fact, that the used `mypy` version wasn't able to handle the used pattern. With every new release `mypy` gets better at handling complex code. In addition to fix all the previously accepted but now failing patterns, we should also revisit all `type: ignore` comments to see if they are still needed or not. Fortunately, we don't need to do it manually: by adding `warn_unused_ignores = True` to the configuration, `mypy` will error out in case it encounters an `type: ignore` that is no longer needed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/60006 Reviewed By: jbschlosser, malfet Differential Revision: D29133237 Pulled By: albanD fbshipit-source-id: 41e82edc5cd5affa7ccedad044b59b94dad4425a	2021-06-18 07:23:31 -07:00
Rohan Varma	eb55b086b7	[DDP] Log some python-side errors (#59284 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59284 Logs a few python-side errors to DDP logging. TODO: Most python errors actually have to do with user input correctness, so they throw before reducer is constructed and thus there is no logger. For this case, should we allow `logger` to be created optionally without a reducer, just for the purpose of logging errors, so that we can gain insight into these errors in scuba? ghstack-source-id: 130412973 Test Plan: CI Reviewed By: SciPioneer Differential Revision: D28820290 fbshipit-source-id: 610e5dba885b173c52351f7ab25c923edce639e0	2021-06-02 19:49:26 -07:00
Alexander Golynski	2b6c09c11e	Add futures to ProcessGroupMPI work (but not including Send/Recv) and python DDP comm hook testing (#57214 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57214 Test Plan: Imported from OSS Reviewed By: mrshenli Differential Revision: D28200791 Pulled By: agolynski fbshipit-source-id: 83f814abd4f2eea70e383ed373b04aae8291be55	2021-05-04 16:04:45 -07:00
Pritam Damania	dc8a8cea79	Move caffe2 signal_handler to c10. (#56717 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56717 The signal_handler was under the caffe2 namespacee but was being used by PyTorch as well. I've fixed this my moving it to the c10 namespace where now both C2 and PyTorch can use it. The signal_handler interface in caffe2/utils/signal_handler.h is kept the same for backward compatiblity for C2, but most of the commmon code is moved to c10. ghstack-source-id: 127446929 Test Plan: waitforbuildbot Reviewed By: ezyang Differential Revision: D27946738 fbshipit-source-id: d6228d1a0108f4c807d405e7a0bb799c5375388f	2021-04-26 23:08:12 -07:00
Sam Estep	75024e228c	Add lint for unqualified `type: ignore` (#56290 ) Summary: The other half of https://github.com/pytorch/pytorch/issues/56272. Pull Request resolved: https://github.com/pytorch/pytorch/pull/56290 Test Plan: CI should pass on the tip of this PR, and we know that the lint works because the following CI runs (before this PR was finished) failed: - https://github.com/pytorch/pytorch/runs/2384511062 - https://github.com/pytorch/pytorch/actions/runs/765036024 Reviewed By: seemethere Differential Revision: D27867219 Pulled By: samestep fbshipit-source-id: e648f07b6822867e70833e23ddafe7fb7eaca235	2021-04-21 08:07:23 -07:00
Richard Barnes	af7775ba26	Types for caffe2/torch/testing/_internal/common_distributed.py (#55338 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55338 Test Plan: Sandcastle Reviewed By: pritamdamania87, ngimel Differential Revision: D27575367 fbshipit-source-id: ca8eb77967af71ce2734408b8e2e15bf64a5ab4a	2021-04-20 16:26:53 -07:00
Howard Huang	b2dae294b6	Fix distributed.test_jit_c10d flaky tests (#56410 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56410 Changes: - Move create_tcp_store() helper function to common file - Update test_jit_c10d to retry TCP Store creation in case allocated port becomes used fixes https://github.com/pytorch/pytorch/issues/55053 Test Plan: Imported from OSS Reviewed By: heitorschueroff Differential Revision: D27869560 Pulled By: H-Huang fbshipit-source-id: f4a6613049bb25e6f6f194214379a380968bb19c	2021-04-20 09:28:27 -07:00
Rohan Varma	51e7a371f5	[DDP] Param to name mapping in Reducer (#55075 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55075 Constructs and passes in a mapping with parameter names to Reducer to log information about unused parameters in error messages about unused parameters/not all parameters getting gradient. Use case: 1) User runs DDP forward + bwd, and it has some unused parameters that will result in ddp error in next iteration 2) Next forward pass calls `Reducer::ensure_prior_reduction_finished()` where we check all params got gradient from the previous bwd pass. DDP would throw here in this case. 3) Reducer maintains mapping and tracks used parameters, and computes which parameters did not get gradient and logs this as part of the error. Implementation details: 0) The following is only enabled for debug modes of INFO or DETAIL. 1) To save memory, we don't map param -> param name so that we don't have to copy the entire tensor, instead we map param_index -> param_name and use the existing concept of variable_index in Reducer to look up parameter names. 2) DDP constructs param index -> param name mapping. The name is the fully qualified name: f"{module_name}:{param_name}" and passes it into Reducer 3) Reducer maintains per-iteration std::set<int> of variable indices that have had `mark_variable_ready` called. 4) When some params go unused, we take a set difference to detect the unused params. 5) Unittests to test the logged unused params, as well as for nested modules, are added ghstack-source-id: 126581051 Test Plan: CI, UT Reviewed By: zhaojuanmao Differential Revision: D27356394 fbshipit-source-id: 89f436af4e74145b0a8eda92b3c4e2af8e747332	2021-04-15 09:19:50 -07:00
Philip Meier	f4967d68f5	make torch.testing asserts importable (#54769 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54769 Follow-up to #53820. This - makes the `asserts.py` module private as per suggestion from rgommers in https://github.com/pytorch/pytorch/pull/53820#issuecomment-802661387. With this the functions should only be accessible through `torch.testing`, giving us the option the change the underlying structure later. - moves the code from `torch/testing/__init__.py` to `torch/testing/_core.py` (happy to accept other name suggestions). Otherwise we can't import the new `_asserts.py` in `torch/testing/__init__.py` due to circular imports. Test Plan: Imported from OSS Reviewed By: mrshenli Differential Revision: D27438451 Pulled By: mruberry fbshipit-source-id: c7292b4d5709185b42b4aac8016648562688040e	2021-04-07 23:53:02 -07:00
Pritam Damania	e3691be2d9	Dump C++ stack traces of all threads for distributed tests. (#55003 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55003 Using the `caffe2::setPrintStackTracesOnFatalSignal` utility in distributed tests to set a signal handler that dumps the state of all threads for all processes when it receives a FATAL signal. This would help in debugging tests further. I had to revert all the python faulthandler code since only one signal handler function is supported, so running python faulthandler with `setPrintStackTracesOnFatalSignal` doesn't work. Sample output: ``` SIGSEGV(11), PID: 3492872, Thread 3492872: [0] ???(0x7fa7b2d1d61b) in libcaffe2_caffe2_caffe2_cpu.so [1] ???(0x7fa7b2d1d3fb) in libcaffe2_caffe2_caffe2_cpu.so [2] ???(0x7fa7b2d1d33d) in libcaffe2_caffe2_caffe2_cpu.so [3] ???(0x7fa7b2d1d167) in libcaffe2_caffe2_caffe2_cpu.so [4] ???(0x7fa7ce683150) in libpthread.so.0 [5] ???(0x7fa7be2b233c) in libcaffe2__C_impl_cuda.so [6] ???(0x7fa7be2ce80c) in libcaffe2__C_impl_cuda.so [7] ???(0x7fa7be2a0512) in libcaffe2__C_impl_cuda.so [8] torch::distributed::rpc::TensorPipeAgent::send(torch::distributed::rpc::WorkerInfo const&, torch::distributed::rpc::Message&&, float, std::unordered_map<signed char, signed char, std::hash<signed char>, std::equal_to<signed char>, std::allocator<std::pair<signed char const, signed char> > > const&)+0x24f(0x7fa7be29f71f) in libcaffe2__C_impl_cuda.so [9] torch::distributed::autograd::sendMessageWithAutograd(torch::distributed::rpc::RpcAgent&, torch::distributed::rpc::WorkerInfo const&, torch::distributed::rpc::Message&&, bool, float, bool)+0x393(0x7fa7b602b203) in libcaffe2_libtorch.so [10] torch::distributed::rpc::pyRpcPythonUdf(torch::distributed::rpc::WorkerInfo const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&, std::vector<at::Tensor, std::allocator<at::Tensor> >&, float, bool)+0x201(0x7fa7bd844971) in libcaffe2__C_impl_cuda.so ``` ghstack-source-id: 125630551 Test Plan: waitforbuildbot Reviewed By: SciPioneer Differential Revision: D27419714 fbshipit-source-id: 8aca9a14ef688004053d8798124d9c3a3fbe3489	2021-04-03 13:59:56 -07:00
Howard Huang	5610e8271b	Fix skip_if_not_multigpu decorator (#54916 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54916 Fixes https://github.com/pytorch/pytorch/issues/54887 `skip_if_not_multigpu` was skipping all the tests that use it. Test Plan: Imported from OSS Reviewed By: mrshenli Differential Revision: D27412193 Pulled By: H-Huang fbshipit-source-id: 28d6697bd8cc6b6784cdb038ccb3ff138d0610eb	2021-04-01 18:01:33 -07:00
Pritam Damania	f71a0daeb7	Use faulthandler to dump traceback of timed out processes in unit tests. (#54818 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54818 Several flaky tests fail due to some sort of timeout and it isn't clear from the error message in CI where exactly each process is stuck. In this PR, I've added mechanism to dump the entire python traceback of all python threads when we encounter a timeout. Example traceback: ``` Process 3 timed out with traceback: Current thread 0x00007ff3363ff700 (most recent call first): File "torch/testing/_internal/common_distributed.py", line 373 in _event_listener File "threading.py", line 870 in run File "threading.py", line 932 in _bootstrap_inner File "threading.py", line 890 in _bootstrap Thread 0x00007ff406132180 (most recent call first): File "torch/distributed/distributed_c10d.py", line 2477 in barrier File "torch/testing/_internal/distributed/rpc/rpc_test.py", line 838 in test_reinit File "torch/testing/_internal/dist_utils.py", line 90 in new_test_method File "torch/testing/_internal/common_distributed.py", line 292 in wrapper File "torch/testing/_internal/common_distributed.py", line 409 in run_test File "torch/testing/_internal/common_distributed.py", line 393 in _run File "multiprocessing/process.py", line 108 in run File "multiprocessing/process.py", line 315 in _bootstrap File "multiprocessing/popen_fork.py", line 75 in _launch File "multiprocessing/popen_fork.py", line 19 in __init__ File "multiprocessing/context.py", line 277 in _Popen File "multiprocessing/process.py", line 121 in start ``` ghstack-source-id: 125323810 Test Plan: waitforbuildbot Reviewed By: rohan-varma Differential Revision: D27378764 fbshipit-source-id: 661c009a5458c724f004aa83de9347a4bc03b63e	2021-03-31 11:38:30 -07:00
Rohan Varma	0e543b2b00	Provide a decorator to set/unset nccl blocking wait for tests (#54740 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54740 Adds a simple helper decorator to set/unset nccl blocking wait for tests. This will make it easier than having to manually set/unset the os.environ vars every time. ghstack-source-id: 125233693 Test Plan: CI Reviewed By: SciPioneer Differential Revision: D27277222 fbshipit-source-id: c289b9d05e2f6328d672810b07501979b6e177c6	2021-03-30 15:31:30 -07:00
Pritam Damania	65781f94ad	Enable faulthandler for distributed tests. (#54531 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54531 Enabling faulthandler will intercept signals like SIGSEGV, SIGFPE, SIGABRT, SIGBUS and SIGKILL and dump the entire python traceback before the process goes down. This can help us in debugging flaky tests where a process crashes and we need to debug what happened. ghstack-source-id: 125045894 Test Plan: 1) Tested locally to see traceback is produced. 2) waitforbuildbot Reviewed By: rohan-varma Differential Revision: D27271048 fbshipit-source-id: ca12125a9da6cdfc7bac5619ad1c7e116666014b	2021-03-27 00:43:58 -07:00
Pritam Damania	1c63cb2c0f	Pass child error to parent in distributed tests. (#52632 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52632 Distributed tests run in a multiprocessing environment, where a parent process drives the tests through several child processes. As a result, when a child process fails the parent only prints the following: ``` Process 0 exited with error code 10 ``` The child process also logs its own exception, but it is cumberson to go through the logs and track this down. To alleviate this, I've added a bunch of pipes for each child process so that the child process writes the error to the pipe before exiting and the parent process can read the appropriate error from the pipe and display it. The new output printed by the parent is as follows: ``` > RuntimeError: Process 0 exited with error code 10 and exception: Traceback (most recent call last): File "torch/testing/_internal/common_distributed.py", line 361, in _run getattr(self, test_name)() File "torch/testing/_internal/common_distributed.py", line 288, in wrapper fn() File "test_c10d.py", line 789, in test_broadcast_checks pg.broadcast([t1], opts) ValueError: ProcessGroupGloo::broadcast: invalid root rank: -1 Process 1 exited with error code 10 and exception: Traceback (most recent call last): File "torch/testing/_internal/common_distributed.py", line 361, in _run getattr(self, test_name)() File "torch/testing/_internal/common_distributed.py", line 288, in wrapper fn() File "test_c10d.py", line 789, in test_broadcast_checks pg.broadcast([t1], opts) ValueError: ProcessGroupGloo::broadcast: invalid root rank: -1 Process 2 exited with error code 10 and exception: Traceback (most recent call last): File "torch/testing/_internal/common_distributed.py", line 361, in _run getattr(self, test_name)() File "torch/testing/_internal/common_distributed.py", line 288, in wrapper fn() File "test_c10d.py", line 789, in test_broadcast_checks pg.broadcast([t1], opts) ValueError: ProcessGroupGloo::broadcast: invalid root rank: -1 Process 3 exited with error code 10 and exception: Traceback (most recent call last): File "torch/testing/_internal/common_distributed.py", line 361, in _run getattr(self, test_name)() File "torch/testing/_internal/common_distributed.py", line 288, in wrapper fn() File "test_c10d.py", line 789, in test_broadcast_checks pg.broadcast([t1], opts) ValueError: ProcessGroupGloo::broadcast: invalid root rank: -1 ``` ghstack-source-id: 122273793 Test Plan: waitforbuildbot Reviewed By: rohan-varma Differential Revision: D26589274 fbshipit-source-id: 7b7a71ec790b216a89db7c157377f426531349a5	2021-02-23 11:50:25 -08:00
Rong Rong (AI Infra)	e8ab58bfc7	[reland] Early terminate CUDA on common_utils TestCases (#52126 ) Summary: Take 2 of https://github.com/pytorch/pytorch/issues/50914 This change moves the early termination logic into common_utils.TestCase class. Pull Request resolved: https://github.com/pytorch/pytorch/pull/52126 Test Plan: CI with ci-all tag Reviewed By: malfet Differential Revision: D26391762 Pulled By: walterddr fbshipit-source-id: a149ecc47ccda7f2795e107fb95915506ae060b4	2021-02-12 07:32:42 -08:00
Nikita Shulga	9f1f5636d7	Revert D26019289: [pytorch][PR] Early terminate CUDA on common_utils TestCases Test Plan: revert-hammer Differential Revision: D26019289 (`c1b7ca8062`) Original commit changeset: ddc7c1c0d00d fbshipit-source-id: 6902d03fa06cda5d03191846bc4dd98af501b594	2021-02-10 17:29:10 -08:00
Rong Rong (AI Infra)	c1b7ca8062	Early terminate CUDA on common_utils TestCases (#50914 ) Summary: This is a follow up on https://github.com/pytorch/pytorch/issues/49869. Previously CUDA early termination only happens for generic test classes that extends from `DeviceTypeTestBase`. However, JIT test cases which extends from common_utils.TestCase cannot benefit from the early termination. This change moves the early termination logic into common_utils.TestCase class. - all tests extended from common_utils.TestCase now should early terminate if CUDA assert occurs. - For TestCases that extends from common_device_type.DeviceTypeTestBase, still only do torch.cuda.synchronize() when RTE is thrown. - For TestCases extends common_utils.TestCase, regardless of whether a test case uses GPU or not, it will always synchronize CUDA as long as `torch.cuda.is_initialize()` returns true. - Disabling this on common_distributed.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/50914 Reviewed By: malfet Differential Revision: D26019289 Pulled By: walterddr fbshipit-source-id: ddc7c1c0d00db4d073a6c8bc5b7733637a7e77d1	2021-02-10 07:15:40 -08:00
Shen Li	a3b8cbcdfc	Let TensorPipe detect peer access (#50676 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50676 Test Plan: Imported from OSS Reviewed By: beauby Differential Revision: D25941962 Pulled By: mrshenli fbshipit-source-id: 7d4fd3b4fbd5ae5a0c50ad65605ced9db10ede4a	2021-01-20 08:04:51 -08:00
Shen Li	30e45bb133	Enable GPU-to-GPU comm in TensorPipeAgent (#44418 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44418 This commit uses TensorPipe's cuda_ipc channel to conduct cross-process same-machine GPU-to-GPU communication. On the sender side, `TensorPipeAgent` grabs a stream to each device used by the message, let these streams wait for current streams, and passes the streams to TensorPipe `CudaBuffer`. On the receiver side, it also grabs a stream for each device used in the message, and uses these streams to receive tensors and run user functions. After that, these streams are then used for sending the response back to the sender. When receiving the response, the sender will grab a new set of streams and use them for TensorPipe's `CudaBuffer`. If device maps are provided, `TensorPipeAgent::send` will return a derived class of `CUDAFuture`, which is specifically tailored for RPC Messages. TODOs: 1. Enable sending CUDA RPC to the same process. 2. Add a custom CUDA stream pool. 3. When TensorPipe addressed the error for `cudaPointerGetAttributes()`, remove `cuda:0` context initialization code in `backend_registry.py`. 4. When TensorPipe can detect availability of peer access, enable all tests on platforms without peer access. Differential Revision: D23626207 Test Plan: Imported from OSS Reviewed By: lw Pulled By: mrshenli fbshipit-source-id: d30e89e8a98bc44b8d237807b84e78475c2763f0	2021-01-14 13:55:41 -08:00
Jerry Zhang	fadec77c30	[quant][fx][graphmode] Renable torchvision test (#48602 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48602 Test Plan: Imported from OSS Reviewed By: vkuzo Differential Revision: D25224917 fbshipit-source-id: efc73f425253c4eb7ae51064b6760416097f0437	2020-12-04 10:13:38 -08:00
Rohan Varma	25dc0056f2	[RPC] print exception message on workers that run python functions (#46372 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46372 Currently, in `_run_function`, we catch an exception from the python function which is run, and report it back to the master. However in some large scale training jobs, it would be valuable to also log the error on the trainer itself for faster debugging. Test Plan: Added unittest. Reviewed By: pritamdamania87 Differential Revision: D24324578 fbshipit-source-id: 88460d7599ea69d2c38fd9c10eb6471f7edd4100	2020-10-22 09:44:15 -07:00
Pritam Damania	b5a2f04089	Disallow creation of ProcessGroupNCCL without GPUs. (#45642 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45642 Prior to https://github.com/pytorch/pytorch/pull/45181, initializing a NCCL process group would work even if no GPUs were present. Although, now since init_process_group calls `barrier()` this would fail. In general the problem was that we could initialize ProcessGroupNCCL without GPUs and then if we called a method like `barrier()` the process would crash since we do % numGPUs resulting in division by zero. ghstack-source-id: 113490343 Test Plan: waitforbuildbot Reviewed By: osalpekar Differential Revision: D24038839 fbshipit-source-id: a1f1db52cabcfb83e06c1a11ae9744afbf03f8dc	2020-10-05 12:05:48 -07:00
gunandrose4u	f07ac6a004	Fix Windows build failure after DDP PR merged (#45335 ) Summary: Fixes #{issue number} This is resubmit for PR https://github.com/pytorch/pytorch/issues/42897 . Together with fix for Windows build issue introduced by PR https://github.com/pytorch/pytorch/issues/44344 . Pull Request resolved: https://github.com/pytorch/pytorch/pull/45335 Reviewed By: zou3519 Differential Revision: D23931471 Pulled By: mrshenli fbshipit-source-id: f49b5a114944c1450b32934b3292170be064f494	2020-09-25 12:37:50 -07:00
Mike Ruberry	103fa3894a	Revert D23841786: [pytorch][PR] Enable distributed package on windows, Gloo backend supported only Test Plan: revert-hammer Differential Revision: D23841786 (`0122299f9b`) Original commit changeset: 334ba1ed73ef fbshipit-source-id: ec95432f9957df56a5a04e52661f5db920b7f57f	2020-09-24 22:44:33 -07:00
gunandrose4u	0122299f9b	Enable distributed package on windows, Gloo backend supported only (#42897 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/42095 For test case part will be committed to this PR later mrshenli, please help to review Pull Request resolved: https://github.com/pytorch/pytorch/pull/42897 Reviewed By: osalpekar Differential Revision: D23841786 Pulled By: mrshenli fbshipit-source-id: 334ba1ed73eff2f668857390fc32d1bc7f08e5f3	2020-09-24 21:13:55 -07:00
alanashine	ba6534ae2b	enable type check common_distributed (#44821 ) Summary: Enabled type checking in common_distributed by using tensors of ints Pull Request resolved: https://github.com/pytorch/pytorch/pull/44821 Test Plan: Run python test/test_type_hints.py, errors are no longer ingnored by mypy.ini Reviewed By: walterddr Differential Revision: D23747466 Pulled By: alanadakotashine fbshipit-source-id: 820fd502d7ff715728470fbef0be90ae7f128dd6	2020-09-16 19:19:36 -07:00
Xiang Gao	20ac736200	Remove py2 compatible future imports (#44735 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44735 Reviewed By: mruberry Differential Revision: D23731306 Pulled By: ezyang fbshipit-source-id: 0ba009a99e475ddbe22981be8ac636f8a1c8b02f	2020-09-16 12:55:57 -07:00
Rohan Varma	567c51cce9	In common_distributed, fix TEST_SKIPS multiprocessing manager (#44525 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44525 Since `TEST_SKIPS` is a global multiprocessing.manager, this was causing issues when one test would fail and make the rest of the tests fail during setup due to networking errors. See the failed CI job: https://app.circleci.com/pipelines/github/pytorch/pytorch/212491/workflows/0450151d-ca09-4cf6-863d-272de6ed917f/jobs/7389065 for an example, where `test_ddp_backward` failed but then caused the rest of the tests to fail at the line `test_skips.update(TEST_SKIPS)`. To fix this issue, at the end of every test we revert `TEST_SKIPS` back to a regular dict, and redo the conversion to a `mulitiprocessing.Manager` in the next test, which prevents these errors. ghstack-source-id: 111844724 Test Plan: CI Reviewed By: malfet Differential Revision: D23641618 fbshipit-source-id: 27ce823968ece9804bb4dda898ffac43ef732b89	2020-09-11 09:16:33 -07:00

1 2

71 Commits