pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-06 12:20:52 +01:00

Author	SHA1	Message	Date
Fuzzkatt	8ba62bdff5	add test_c10d_spawn_ucc.py (#86508 ) Initial PR to create UCC equivalent of https://github.com/pytorch/pytorch/blob/master/test/distributed/test_c10d_spawn_gloo.py and https://github.com/pytorch/pytorch/blob/master/test/distributed/test_c10d_spawn_nccl.py. Currently only added common ops. Pull Request resolved: https://github.com/pytorch/pytorch/pull/86508 Approved by: https://github.com/kwen2501	2022-11-16 22:50:11 +00:00
Charlie Yan	ee05f47bdd	Rebase and re-land thread PG (#88795 ) The previous PR (https://github.com/pytorch/pytorch/pull/88627) has been reverted due to a failed check. After rebasing and rerun, all checks passed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88795 Approved by: https://github.com/huydhn, https://github.com/wanchaol	2022-11-15 21:58:58 +00:00
PyTorch MergeBot	c7fc710459	Revert "[3/n] Thread PG: add threaded PG implementation (#88627 )" This reverts commit `6dd081846e`. Reverted https://github.com/pytorch/pytorch/pull/88627 on behalf of https://github.com/huydhn due to This breaks one macos m1 test `6dd081846e` in trunk. PR also fails with the same issue so I think trymerge code has a bug here letting this one merged	2022-11-09 22:38:41 +00:00
Charlie Yan	6dd081846e	[3/n] Thread PG: add threaded PG implementation (#88627 ) Summary: After the previous 2 diffs, finally we can add the threaded ProcessGroup implementation. Test Plan: TBD Reviewed By: XilunWu Differential Revision: D40992593 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88627 Approved by: https://github.com/XilunWu, https://github.com/H-Huang	2022-11-09 20:51:11 +00:00
Will Constable	70b00b1383	Add hf_bert + DDP multigpu test (#88435 ) Spot-checks an e2e model working with ddp. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88435 Approved by: https://github.com/davidberard98	2022-11-04 03:17:48 +00:00
Fuzzkatt	d13f1e6ab4	Add sequence number support for UCC (#85047 ) Add sequence number support for UCC, mostly following format of ProcressGroupNCCL. Pass new test: `test_all_gather_object_subgroup` Add skips for gather tests: `test_gather_object` and `test_gather_object_subgroup` cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu Pull Request resolved: https://github.com/pytorch/pytorch/pull/85047 Approved by: https://github.com/kwen2501	2022-10-31 03:56:55 +00:00
PyTorch MergeBot	f451e824f3	Revert " C10D extension to enable per-thread PG (#86348 )" This reverts commit `97abc21f2b`. Reverted https://github.com/pytorch/pytorch/pull/86348 on behalf of https://github.com/huydhn due to Sorry for reverting your PR but it breaks macos tests `97abc21f2b`	2022-10-14 01:26:46 +00:00
Rodrigo Kumpera	97abc21f2b	C10D extension to enable per-thread PG (#86348 ) Move a bunch of globals to instance methods and replace all use to them. We move all PG related globals under World and use a singleton instance under _world. This creates an undocumented extension point to inject full control of how how c10d state behaves. One simple hack is to change _world to an implementation that uses a threadlocal and enable per-thread PGs. It almost get DDP working and the PG is missing an implementation of all_reduce. This enables notebook usage of PTD, which is a big deal for learning it: https://gist.github.com/kumpera/32cb051fa26b8cad8bdf671f968dcd68 This change ensures BC by keeping the global variables around and have the default _World wrap it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/86348 Approved by: https://github.com/rohan-varma	2022-10-13 22:23:28 +00:00
PyTorch MergeBot	6fae62b35f	Revert "C10D extension to enable per-thread PG (#84153 )" This reverts commit `5cbffbbac9`. Reverted https://github.com/pytorch/pytorch/pull/84153 on behalf of https://github.com/kumpera due to broke internal stuff	2022-09-29 13:51:05 +00:00
Rodrigo Kumpera	5cbffbbac9	C10D extension to enable per-thread PG (#84153 ) Move a bunch of globals to instance methods and replace all use to them. We move all PG related globals under World and use a singleton instance under _world. This creates an undocumented extension point to inject full control of how how c10d state behaves. One simple hack is to change _world to an implementation that uses a threadlocal and enable per-thread PGs. It almost get DDP working and the PG is missing an implementation of all_reduce. This enables notebook usage of PTD, which is a big deal for learning it: https://gist.github.com/kumpera/32cb051fa26b8cad8bdf671f968dcd68 Pull Request resolved: https://github.com/pytorch/pytorch/pull/84153 Approved by: https://github.com/rohan-varma	2022-09-27 21:42:31 +00:00
Xiang Gao	08c4f8c7a7	ProcessGroupUCC tests (#83285 ) - [x] Direct dependency on UCX is completely removed, UCC active set API always enabled - [x] Remove `TORCH_UCC_PROFILING_ENABLE`, always enable profiling - [x] Fixes profiling of `recv` and `all_gather` - [x] Use the NCCL TL of UCC on CUDA, as the UCP TL is not well supported on CUDA Most tests are passing, but there are a few skipped tests: - `scatter` and `gather` are not supported by the UCP TL of UCC on CPU tensors - A few flaky tests in PyTorch's CI environment - Profiler-related failures, some of them will be fixed by @Fuzzkatt in https://github.com/pytorch/pytorch/pull/84368 After this PR is merged, I will continue to work on these skipped failures. Pull Request resolved: https://github.com/pytorch/pytorch/pull/83285 Approved by: https://github.com/vtlam, https://github.com/malfet, https://github.com/kwen2501	2022-09-10 10:56:05 +00:00
Alexander Grund	acb11da556	Increase default test timeout for distributed tests (#80330 ) When running on clusters the startup time for the subprocesses might be much higher which leads to spurious failures. So increase this to 300s similar to torch/testing/_internal/distributed/distributed_test.py Also introduces `DISTRIBUTED_TESTS_DEFAULT_TIMEOUT` as suggested by @malfet in #55896 Pull Request resolved: https://github.com/pytorch/pytorch/pull/80330 Approved by: https://github.com/malfet	2022-09-05 21:23:50 +00:00
Rodrigo Kumpera	dac3fba274	Add testing workaround for EFA and TensorPipe (#77363 ) This is a workaround for EFA for TensorPipe. This allows RPC enabled tests to be ran on AWS clusters. Pull Request resolved: https://github.com/pytorch/pytorch/pull/77363 Approved by: https://github.com/wanchaol	2022-05-18 22:54:15 +00:00
Arindam Roy	24372eb5ac	ROCM: Increase timeout for flaky test_with_kwargs (#76706 ) Fixes #75665. Very rarely the test is failing on ROCM at boundary of the current timeout set to 100 seconds. Setting the timeout to 200 as default for ROCM, for only this test, to be on the safe side. Signed-off-by: Arindam Roy <rarindam@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/76706 Approved by: https://github.com/janeyx99, https://github.com/pruthvistony	2022-05-10 00:10:28 +00:00
Jagadish Krishnamoorthy	6ca8272d46	[Distributed tests] Add skip for odd world_size condition As per https://github.com/pytorch/pytorch/issues/74995, the tests needs to be skipped for odd WORLD_SIZE Signed-off-by: Jagadish Krishnamoorthy <jagdish.krishna@gmail.com> Fixes #74995 Pull Request resolved: https://github.com/pytorch/pytorch/pull/76136 Approved by: https://github.com/kumpera, https://github.com/wayi1	2022-04-24 22:04:30 +00:00
pritam	3a38f175dd	Convert DDP parameters to ReplicatedTensor during forward pass. Pull Request resolved: https://github.com/pytorch/pytorch/pull/75753 As per the design in https://github.com/pytorch/pytorch/issues/72138, convert DDP parameters to ReplicatedTensor during its forward pass. Concretely, this is done as follows: 1) Create a separate `_replicated_tensor_module` which is a copy of self.module without creating copies of the Tensors themselves. 2) Use `_replicated_tensor_module` instead of `self.module` during the forward pass. 3) Have a context manager `_ddp_replicated_tensor` to enable this, since certain edge cases can fail where self.module is changed out of band resulting in discrepancy between self.module and `_replicated_tensor_module`. Differential Revision: [D35533736](https://our.internmc.facebook.com/intern/diff/D35533736/) Approved by: https://github.com/wanchaol, https://github.com/rohan-varma	2022-04-18 03:27:23 +00:00
Andrew Gu	9012e8d65a	[ZeRO][BE] Clean up ZeRO tests (#73842 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73842 Overview This cleans up the `ZeroRedundancyOptimizer` tests. I apologize for strong formatting changes mixed in with actually-beneficial changes. It was convenient to unify the formatting while doing a deep comb through the full test file. The main non-formatting changes include: - Using `parametrize` instead of manually including `for` loops over possible argument values - Removing the `DEVICE` global variable, which was used only for the `TestZeroRedundancyOptimizerSingleRank` tests, in favor of consistent usage of `self.device` in both `TestZeroRedundancyOptimizerSingleRank` and `TestZeroRedundancyOptimizerDistributed` - Moving `assert ... == ...` to `self.assertEqual(..., ...)` when the assert is part of the test's correctness - Removing the `if self.rank >= self.world_size or (torch.cuda.is_available() and torch.cuda.device_count() < 2):` conditional guards in favor of `common_distributed.skip_if_no_gpu` for `TestZeroRedundancyOptimizerDistributed` - For `TestZeroRedundancyOptimizerDistributed`, `self.device` is `torch.device(self.rank)` if CUDA is available, while `self.world_size` is at least 2, even if `torch.cuda.device_count() == 1`. - The problematic case is exactly when `torch.cuda.device_count() == 1` but `self.world_size == 2` since then calling `self.device` on rank 1 will error. The existing conditional guard prevented this case for some tests, but it was not used consistently (e.g. `test_multiple_groups()`), which is most likely the reason for the hangs and resulting test flakiness. (From my experience landing the recent ZeRO constructor changes, the Windows environment uses a world size of 2 but only has 1 device available.) - A more robust solution is to always use the `skip_if_no_gpu` decorator as long as the test uses `self.device` and CUDA is available. This is in line with the recommended SPSD usage of ZeRO. - Renaming `test_multiple_groups()` to `test_nondefault_process_group()` - The existing `test_multiple_groups()` was slightly misnamed. Also, it is only nontrivial for a world size of (at least) 4 since it tests using a process group including only even ranks. It was marked as flaky on Windows, and I believe this is because of the world size and `torch.cuda.device_count()` mismatch. Now, the test only uses GPU if there are enough available and falls back to CPU otherwise, which is safe since the test uses Gloo backend. - There was also a duplicated section, which I was unsure how to non-naively de-duplicate. The top half and bottom half are identical even though they claim to target fitting into the broadcast bucket and not fitting into the broadcast bucket: `1d497114e7/test/distributed/optim/test_zero_redundancy_optimizer.py (L658-L684)` - Changing `_test_zero_model_parallel()` to not use CPU - This is my own fault, having introduced this inefficiency last summer. It makes more sense to simply designate one of the two GPUs for a process to be its default device rather than routing through CPU. Questions - How might we limit the runs for `test_ddp_zero_overlap()`? Because it parameterizes over many values, it contributes significantly to the time-to-signal. However, it is an experimental feature, so it is not critical that the tests run every time. Test Plan: Imported from OSS Reviewed By: rohan-varma Differential Revision: D34675709 Pulled By: awgu fbshipit-source-id: 71ce9ac968fb34415cd65206855b4bb5e67754fb (cherry picked from commit 34e3dd0a184318ea9f63a1ee20cd14b111af3501)	2022-03-08 13:15:20 +00:00
Can Balioglu	e1db2f13ce	Refactor TORCH_DISTRIBUTED_DEBUG implementation (#73166 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73166 This PR refactors, cleans up, and optimizes the implementation of `TORCH_DISTRIBUTED_DEBUG`. It also introduces three new user APIs: `get_debug_level()`, `set_debug_level()`, and `set_debug_level_from_env()` to retrieve and modify the debug level after a process has started. ghstack-source-id: 149778566 Test Plan: Run the existing unit tests. Reviewed By: rohan-varma Differential Revision: D34371226 fbshipit-source-id: e18443b411adcbaf39b2ec999178c198052fcd5b (cherry picked from commit 26d6bb1584b83a0490d8b766482656a5887fa21d)	2022-02-24 02:33:05 +00:00
Wanchao Liang	6feba4bc7e	Implement scatter primitive for ProcessGroupNCCL (#70029 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/70029 This PR implements NCCL scatter and add scatter to ProcessGroupNCCL. NCCL doesn’t directly provide primitives for scatter, so we need to be implemented on top of NCCL’s send/recv API. 1. In ProcessGroupNCCL.cpp, the inputTensors are first flattened, then outputTensors and inputFlattened are passed by the collective class to scatter() function in nccl.cpp. 2. In nccl.cpp, scatter is implemented using ncclSend/ncclRecv: the root rank uses a for loop to send(distribute) the inputTensors to each rank, then all the ranks receive the inputTensor from the root rank. ghstack-source-id: 147754837 Test Plan: test_scatter_ops test_scatter_stress test_scatter_checks Reviewed By: pritamdamania87 Differential Revision: D33154823 fbshipit-source-id: 4513e7eaf7d47a60eb67da99dc6c2e9a2882f3fd (cherry picked from commit `93201f9d4a`)	2022-01-27 19:37:55 +00:00
Wanchao Liang	9b53d3194c	Implement gather primitive for ProcessGroupNCCL (#66745 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66745 This PR implement NCCL gather and add gather to ProcessGroupNCCL using nccl send/recv api. NCCL doesn’t directly provide primitives for gather, so we need to be implemented on top of NCCL’s send/recv API. 1. In ProcessGroupNCCL.cpp, the outputTensors are first flattened, then inputTensors and outputFlattened are passed by the collective class to gather() function in nccl.cpp. 1. In nccl.cpp, gather is implemented using ncclSend/ncclRecv: all the ranks send inputTensor to the root rank, and the root rank uses a for loop to receive these inputTensors. ghstack-source-id: 147754838 Test Plan: test_gather_ops test_gather_checks test_gather_stress Reviewed By: pritamdamania87 Differential Revision: D29616361 fbshipit-source-id: b500d9b8e67113194c5cc6575fb0e5d806dc7782 (cherry picked from commit `d560ee732e`)	2022-01-27 19:37:55 +00:00
Rohan Varma	2bed616e0f	[Dist tests] Make event_listener work for all dist tests (#70628 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/70628 event_listener thread is used to log process tracebacks when a timed out process sends it a request to get its traceback. Although, this thread is created in `_run` function which is overridden by some classes such as `TestDistBackend` so those tests did not have this feature. Move the event_listener setup logic to `run_test` which is called by all distributed test classes, which enables it for all distributed tests. Also modify logger setup to ensure that logging.info calls are printed in the subprocess. ghstack-source-id: 146714642 Test Plan: CI Reviewed By: jaceyca, fduwjj Differential Revision: D33410613 fbshipit-source-id: aa616d69d251bc9d04e45781c501d2244f011843	2022-01-09 14:54:09 -08:00
Bryan Reese	51b6981c36	[PyTorch Tests] Split out skip logic, make changes for plugins (#67256 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67256 To change what tests can be run in various cases, the check logic should be moved to functions and variables that can be changed. One challenge here is that decorators don't have dynamic functionality. If something is read in when imported and then changed afterwards, it will not actually change. This means we need to separate out the variables that need to be changed for our use case. Those are put into common_distributed.py and can be changed before importing the distributed_test.py code. The use case is to add new backends to the tests and split it into tests that can be ran on demand as a separate instance. To do so, you would change DistTestSkipCases after importing it into a launcher or a setup script and then load distributed_test. Test Plan: Check the signals Reviewed By: mrshenli Differential Revision: D31906947 fbshipit-source-id: 45e3258c55f4dc34e12a468bed65280f4c25748f	2021-12-08 12:23:15 -08:00
Rohan Varma	d44d59aa70	[BE] Enable C++ stacktraces for MultiProcessTestCase (#69175 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69175 Shows C++ stacktraces for python distributed tests that inherit from MultiProcessTestCase. Closes https://github.com/pytorch/pytorch/issues/69168 ghstack-source-id: 145085858 Test Plan: CI Reviewed By: H-Huang Differential Revision: D32736872 fbshipit-source-id: 743e870eefa7a9e77c5791d0936e2ebd5c9b1016	2021-12-08 11:57:51 -08:00
Rohan Varma	cb14a258a2	[c10d] Fix object-based collectives for debug mode (#68223 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68223 DETAIL debug mode didn't work with object-based collectives for NCCL backend, because we'd only check if backend is NCCL and then move tensors to CUDA. Instead, check if it is a wrapped PG, and then check the pg that is wrapped to see if its nccl. ghstack-source-id: 143242023 Test Plan: CI Reviewed By: zhaojuanmao Differential Revision: D32366840 fbshipit-source-id: be0a2af6849f8f24446593f4a4fbea4a67586ee5	2021-11-13 04:18:31 -08:00
Wanchao Liang	cf3a5160f8	[BE] move init_multigpu_helper to common_distributed (#67050 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67050 This PR moves init_multi_gpu_helper to common_distributed so that it could be shared by different distributed tests. ghstack-source-id: 141370119 Test Plan: wait for ci. Reviewed By: mrshenli Differential Revision: D31842644 fbshipit-source-id: c7bad25d6cef9bdce7ad1fb6c60c1cad4b765702	2021-10-22 17:16:11 -07:00
Rohan Varma	1e47181c47	[DDP Logging] Add iteration in error reporting (#65772 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65772 Looking at some workloads and it would be useful to have this info. ghstack-source-id: 140555200 Test Plan: CI Reviewed By: zhaojuanmao, wayi1 Differential Revision: D31224417 fbshipit-source-id: 14eeb053aced87c7ca43b6879f81f54bd0a42b76	2021-10-14 22:29:36 -07:00
Pritam Damania	c245632e2e	Use higher timeout for TSAN tests. (#65391 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65391 TSAN tests are much slower than the usual dev/opt mode, about 5-10x slower. As a result, for TSAN build mode we use a much higher timeout for distributed tests. ghstack-source-id: 138584613 Test Plan: waitforbuildbot Reviewed By: cbalioglu Differential Revision: D31076575 fbshipit-source-id: 44a485f07101deac536470ceeff2a52cac4f9e0b	2021-09-21 12:08:27 -07:00
Pritam Damania	d6133b2fe6	Remove `_fork_processes` from common_distributed.py (#63711 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63711 This removes `_fork_process` from common_distributed.py and fixes all other callpoints to use `spawn_process` instead. ghstack-source-id: 136395719 Test Plan: waitforbuildbot Reviewed By: xush6528 Differential Revision: D30463834 fbshipit-source-id: 0c09e8a996d0e5b912c8cdd45488a39951bac4db	2021-08-22 18:57:12 -07:00
Pritam Damania	2d671ca41b	[8/N] Remove c10d/ddp fork tests. (#63454 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63454 Continuation of https://github.com/pytorch/pytorch/pull/63443, this PR removes all fork tests from torch.distributed. ghstack-source-id: 136285511 Test Plan: waitforbuildbot Reviewed By: SciPioneer Differential Revision: D30387872 fbshipit-source-id: f6d6313db126ae7b95b86f78a1e0726887c5c513	2021-08-20 12:23:18 -07:00
Pritam Damania	f8a84a80cd	[5/N] Run opt-asan with detect_leaks=0 (#63361 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63361 Python multiprocessing doesn't support LSAN and causes false positives instead. As a result, disabling LSAN for these tests so that we can still run with opt-asan ghstack-source-id: 135962489 Test Plan: waitforbuildbot Reviewed By: rohan-varma Differential Revision: D30352269 fbshipit-source-id: f6ab5abce7bdef00cd5e1f5977424d2b151174af	2021-08-18 01:59:56 -07:00
Pritam Damania	f7611b31aa	[4/N] Enable opt-asan for distributed unit tests. (#62051 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62051 The goal here is to enable opt-asan for "spawn" based unit tests since this works for "spawn" unlike "dev-asan". As a result, we can run ASAN for "spawn" unit tests as well. This means we can completely remove fork unit tests from the code base since the only purpose for these tests was to run ASAN. ghstack-source-id: 135523770 Test Plan: waitforbuildbot Reviewed By: SciPioneer Differential Revision: D29854514 fbshipit-source-id: 02a5bfcfae2afc21badecff77082c7a6ad83636b	2021-08-10 22:38:31 -07:00
Yi Wang	72295da6c3	Reformat (#62456 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62456 as title ghstack-source-id: 134771417 Test Plan: N/A Reviewed By: rohan-varma Differential Revision: D30006493 fbshipit-source-id: 1d1dc9cfff69a9b4fa31470177c1f4fa206a94ef	2021-07-30 20:50:19 -07:00
Pritam Damania	2006dc6316	[3/N] Remove unittest.skip from torch/testing/_internal distributed files. (#61991 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61991 Continuation of https://github.com/pytorch/pytorch/pull/61887 and removing unittest.skip as much as possible. ghstack-source-id: 134759368 Test Plan: waitforbuildbot Reviewed By: SciPioneer Differential Revision: D29831860 fbshipit-source-id: fe57a7d56d4423924a2dec10bb670137ace0c9a4	2021-07-30 16:40:43 -07:00
Pritam Damania	82d81455ae	[2/N] Remove unittest.skip across all of torch.distributed. (#61887 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61887 1) Introduced a `sandcastle_skip_if` decorator that ensures these tests just get passed on sandcastle. 2) Fixed all test files under `test/distributed` to not use `unittest.skip` Overall goal is to avoid using skips since sandcastle tags these tests as continuously skipping. ghstack-source-id: 134382237 Test Plan: waitforbuildbot Reviewed By: SciPioneer Differential Revision: D29784152 fbshipit-source-id: 17b4df6c5a55ff1d1e8e1de128fa679c3dfbcb7d	2021-07-27 10:53:23 -07:00
Pritam Damania	a8f6b5a80a	[1/N] Avoid skipping tests in sandcastle. (#61876 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61876 In the sandcastle environment, avoid skipping tests and instead just "pass" these tests to avoid a large number of tasks being created which are not actionable. ghstack-source-id: 133846232 Test Plan: Test with `SANDCASTLE=1 TW_JOB_USER=sandcastle` Reviewed By: rohan-varma Differential Revision: D29779699 fbshipit-source-id: add71008830dfa6f456ce2365a2d70436b7b7a31	2021-07-21 14:31:17 -07:00
Luca Wehrstedt	14f63763c1	Avoid using mp.Manager to report #GPUs needed in dist tests (#61409 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61409 We used a multiprocessing.Manager in order to share TEST_SKIPS between the parent and the child processes. TEST_SKIPS is a global variable that defines a unique error code for each "error type", so that the parent can figure out the reason a child exited. While originally this mapping was immutable, at some point we allowed children to modify the parent's value of that mapping so they could update the message for the `multi-gpu` error to make it reflect how many GPUs were really needed. This occurred in D23285790 (`2a4d312027`). Since then this Manager proved to be quite problematic, especially around thread safety, races, TSAN, ... (see D22753459 (`f0c46878c6`), D23641618 (`567c51cce9`), D28490129, D28794321 (`0128eb9a85`) and D29585862). This seems like an awful lot of trouble for such a small functionality. Here I propose we drop Manager and instead get the same result by using separate error codes for each number of GPUs. It should be much simpler and thus more robust. ghstack-source-id: 133236447 Test Plan: CI Reviewed By: pritamdamania87 Differential Revision: D29612614 fbshipit-source-id: 8ad0fedcb7796e5832a0eb196f8fdc147e02b3df	2021-07-09 01:29:35 -07:00
Luca Wehrstedt	0128eb9a85	Fix TSAN issue in distributed tests (#59238 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59238 Creating a `mutliprocessing.Manager()` launches a new process using the `fork` method (because it's the default one), and then in that subprocess it launches a new thread. TSAN really doesn't like this (and rightly so!) because we already had threads in the superprocess, and intermixing threads and forks is dangerous. The proper way to deal with this is to `exec` inside the child process or, in other words, use the `spawn` method. Note that the method used to launch the Manager is entirely unrelated from the method used to launch our "own" subprocesses, hence we were using `fork` for the Manager even though we were using `spawn` for our own subprocesses. ghstack-source-id: 130240724 Test Plan: Reverted the silencing introduced in D28490129, ran the `test_init_rpc_then_pg` test from the TensorPipe suite and saw the original TSAN failure. Then applied my fix, re-ran the test, and the failure was gone. Reviewed By: zhaojuanmao Differential Revision: D28794321 fbshipit-source-id: 12242e69be399a7f02a40a0ebb3d92f92e00ce73	2021-07-01 11:53:01 -07:00
Pritam Damania	fbd4cb1cd7	Fix error logging in common_distributed. (#60917 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60917 The second line of error log didn't handle f-string properly. Before fix: ``` exiting process with exit code: {MultiProcessTestCase.TEST_ERROR_EXIT_CODE} ``` After fix: ``` exiting process 3 with exit code: 10 ``` ghstack-source-id: 132618199 Test Plan: waitforbuildbot Reviewed By: rohan-varma Differential Revision: D29446574 fbshipit-source-id: f806ef0470cb6aa86fe3c404e1c895514abb6488	2021-06-28 19:32:17 -07:00
Luca Wehrstedt	4aff267072	Fix Windows error in distributed (#60167 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60167 We were getting errors such as this on Windows in our c10d ProcessGroup test suite: ``` test_send_recv_all_to_all (__main__.ProcessGroupGlooTest) ... Exception in thread Thread-1: Traceback (most recent call last): File "C:\Jenkins\Miniconda3\lib\threading.py", line 932, in _bootstrap_inner self.run() File "C:\Jenkins\Miniconda3\lib\threading.py", line 870, in run self._target(self._args, *self._kwargs) File "C:\Users\circleci\project\build\win_tmp\build\torch\testing\_internal\common_distributed.py", line 471, in _event_listener if pipe.poll(None): File "C:\Jenkins\Miniconda3\lib\multiprocessing\connection.py", line 257, in poll return self._poll(timeout) File "C:\Jenkins\Miniconda3\lib\multiprocessing\connection.py", line 330, in _poll return bool(wait([self], timeout)) File "C:\Jenkins\Miniconda3\lib\multiprocessing\connection.py", line 883, in wait ov.cancel() OSError: [WinError 6] The handle is invalid Fatal Python error: could not acquire lock for <_io.BufferedWriter name='<stderr>'> at interpreter shutdown, possibly due to daemon threads Python runtime state: finalizing (tstate=000001EFDF228CE0) Thread 0x00001f68 (most recent call first): File "C:\Jenkins\Miniconda3\lib\threading.py", line 1202 in invoke_excepthook File "C:\Jenkins\Miniconda3\lib\threading.py", line 934 in _bootstrap_inner File "C:\Jenkins\Miniconda3\lib\threading.py", line 890 in _bootstrap Current thread 0x00000f94 (most recent call first): <no Python frame> FAIL (5.009s) ``` And the process would then exit with error code 3221226505. See: https://app.circleci.com/pipelines/github/pytorch/pytorch/337351/workflows/ad919a3e-fe9a-4566-8ad6-8b0a252f730c/jobs/14170191/steps By looking at [the code of `_event_listener` in `common_distributed.py`](`eb36f67dcc/torch/testing/_internal/common_distributed.py (L467-L489)`) I think that the first exception (the one about the handle being invalid) is "expected" as it results from another thread purposely closing the pipe while that thread is polling it. The relevant part of the problem seems to be the "could not acquire lock" one. I think this stems from the event listener thread being launched as a daemon thread, which means the interpreter will not wait for that thread to complete before shutting down. When the interpreter shuts down it instantly aborts all other threads. If the event listener thread was aborter _while_ it was logging to stdout then that thread was holding the lock but never got to release it. This is probably what the error is complaining about. This seems to be intended/expected behavior for CPython: https://bugs.python.org/issue42717. The solution thus is simple: don't make that thread a daemon thread and explicitly wait for it to terminate before shutting down. ghstack-source-id: 132293710 Test Plan: Will see... Reviewed By: pritamdamania87 Differential Revision: D29193014 fbshipit-source-id: 4aabe1fc74bf9c54ca605e7a702ac99655489780	2021-06-24 10:35:38 -07:00
Philip Meier	d5988c5eca	remove unused `type: ignore` directives (#60006 ) Summary: During development it is common practice to put `type: ignore` comments on lines that are correct, but `mypy` doesn't recognize this. This often stems from the fact, that the used `mypy` version wasn't able to handle the used pattern. With every new release `mypy` gets better at handling complex code. In addition to fix all the previously accepted but now failing patterns, we should also revisit all `type: ignore` comments to see if they are still needed or not. Fortunately, we don't need to do it manually: by adding `warn_unused_ignores = True` to the configuration, `mypy` will error out in case it encounters an `type: ignore` that is no longer needed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/60006 Reviewed By: jbschlosser, malfet Differential Revision: D29133237 Pulled By: albanD fbshipit-source-id: 41e82edc5cd5affa7ccedad044b59b94dad4425a	2021-06-18 07:23:31 -07:00
Rohan Varma	eb55b086b7	[DDP] Log some python-side errors (#59284 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59284 Logs a few python-side errors to DDP logging. TODO: Most python errors actually have to do with user input correctness, so they throw before reducer is constructed and thus there is no logger. For this case, should we allow `logger` to be created optionally without a reducer, just for the purpose of logging errors, so that we can gain insight into these errors in scuba? ghstack-source-id: 130412973 Test Plan: CI Reviewed By: SciPioneer Differential Revision: D28820290 fbshipit-source-id: 610e5dba885b173c52351f7ab25c923edce639e0	2021-06-02 19:49:26 -07:00
Alexander Golynski	2b6c09c11e	Add futures to ProcessGroupMPI work (but not including Send/Recv) and python DDP comm hook testing (#57214 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57214 Test Plan: Imported from OSS Reviewed By: mrshenli Differential Revision: D28200791 Pulled By: agolynski fbshipit-source-id: 83f814abd4f2eea70e383ed373b04aae8291be55	2021-05-04 16:04:45 -07:00
Pritam Damania	dc8a8cea79	Move caffe2 signal_handler to c10. (#56717 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56717 The signal_handler was under the caffe2 namespacee but was being used by PyTorch as well. I've fixed this my moving it to the c10 namespace where now both C2 and PyTorch can use it. The signal_handler interface in caffe2/utils/signal_handler.h is kept the same for backward compatiblity for C2, but most of the commmon code is moved to c10. ghstack-source-id: 127446929 Test Plan: waitforbuildbot Reviewed By: ezyang Differential Revision: D27946738 fbshipit-source-id: d6228d1a0108f4c807d405e7a0bb799c5375388f	2021-04-26 23:08:12 -07:00
Sam Estep	75024e228c	Add lint for unqualified `type: ignore` (#56290 ) Summary: The other half of https://github.com/pytorch/pytorch/issues/56272. Pull Request resolved: https://github.com/pytorch/pytorch/pull/56290 Test Plan: CI should pass on the tip of this PR, and we know that the lint works because the following CI runs (before this PR was finished) failed: - https://github.com/pytorch/pytorch/runs/2384511062 - https://github.com/pytorch/pytorch/actions/runs/765036024 Reviewed By: seemethere Differential Revision: D27867219 Pulled By: samestep fbshipit-source-id: e648f07b6822867e70833e23ddafe7fb7eaca235	2021-04-21 08:07:23 -07:00
Richard Barnes	af7775ba26	Types for caffe2/torch/testing/_internal/common_distributed.py (#55338 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55338 Test Plan: Sandcastle Reviewed By: pritamdamania87, ngimel Differential Revision: D27575367 fbshipit-source-id: ca8eb77967af71ce2734408b8e2e15bf64a5ab4a	2021-04-20 16:26:53 -07:00
Howard Huang	b2dae294b6	Fix distributed.test_jit_c10d flaky tests (#56410 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56410 Changes: - Move create_tcp_store() helper function to common file - Update test_jit_c10d to retry TCP Store creation in case allocated port becomes used fixes https://github.com/pytorch/pytorch/issues/55053 Test Plan: Imported from OSS Reviewed By: heitorschueroff Differential Revision: D27869560 Pulled By: H-Huang fbshipit-source-id: f4a6613049bb25e6f6f194214379a380968bb19c	2021-04-20 09:28:27 -07:00
Rohan Varma	51e7a371f5	[DDP] Param to name mapping in Reducer (#55075 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55075 Constructs and passes in a mapping with parameter names to Reducer to log information about unused parameters in error messages about unused parameters/not all parameters getting gradient. Use case: 1) User runs DDP forward + bwd, and it has some unused parameters that will result in ddp error in next iteration 2) Next forward pass calls `Reducer::ensure_prior_reduction_finished()` where we check all params got gradient from the previous bwd pass. DDP would throw here in this case. 3) Reducer maintains mapping and tracks used parameters, and computes which parameters did not get gradient and logs this as part of the error. Implementation details: 0) The following is only enabled for debug modes of INFO or DETAIL. 1) To save memory, we don't map param -> param name so that we don't have to copy the entire tensor, instead we map param_index -> param_name and use the existing concept of variable_index in Reducer to look up parameter names. 2) DDP constructs param index -> param name mapping. The name is the fully qualified name: f"{module_name}:{param_name}" and passes it into Reducer 3) Reducer maintains per-iteration std::set<int> of variable indices that have had `mark_variable_ready` called. 4) When some params go unused, we take a set difference to detect the unused params. 5) Unittests to test the logged unused params, as well as for nested modules, are added ghstack-source-id: 126581051 Test Plan: CI, UT Reviewed By: zhaojuanmao Differential Revision: D27356394 fbshipit-source-id: 89f436af4e74145b0a8eda92b3c4e2af8e747332	2021-04-15 09:19:50 -07:00
Philip Meier	f4967d68f5	make torch.testing asserts importable (#54769 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54769 Follow-up to #53820. This - makes the `asserts.py` module private as per suggestion from rgommers in https://github.com/pytorch/pytorch/pull/53820#issuecomment-802661387. With this the functions should only be accessible through `torch.testing`, giving us the option the change the underlying structure later. - moves the code from `torch/testing/__init__.py` to `torch/testing/_core.py` (happy to accept other name suggestions). Otherwise we can't import the new `_asserts.py` in `torch/testing/__init__.py` due to circular imports. Test Plan: Imported from OSS Reviewed By: mrshenli Differential Revision: D27438451 Pulled By: mruberry fbshipit-source-id: c7292b4d5709185b42b4aac8016648562688040e	2021-04-07 23:53:02 -07:00
Pritam Damania	e3691be2d9	Dump C++ stack traces of all threads for distributed tests. (#55003 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55003 Using the `caffe2::setPrintStackTracesOnFatalSignal` utility in distributed tests to set a signal handler that dumps the state of all threads for all processes when it receives a FATAL signal. This would help in debugging tests further. I had to revert all the python faulthandler code since only one signal handler function is supported, so running python faulthandler with `setPrintStackTracesOnFatalSignal` doesn't work. Sample output: ``` SIGSEGV(11), PID: 3492872, Thread 3492872: [0] ???(0x7fa7b2d1d61b) in libcaffe2_caffe2_caffe2_cpu.so [1] ???(0x7fa7b2d1d3fb) in libcaffe2_caffe2_caffe2_cpu.so [2] ???(0x7fa7b2d1d33d) in libcaffe2_caffe2_caffe2_cpu.so [3] ???(0x7fa7b2d1d167) in libcaffe2_caffe2_caffe2_cpu.so [4] ???(0x7fa7ce683150) in libpthread.so.0 [5] ???(0x7fa7be2b233c) in libcaffe2__C_impl_cuda.so [6] ???(0x7fa7be2ce80c) in libcaffe2__C_impl_cuda.so [7] ???(0x7fa7be2a0512) in libcaffe2__C_impl_cuda.so [8] torch::distributed::rpc::TensorPipeAgent::send(torch::distributed::rpc::WorkerInfo const&, torch::distributed::rpc::Message&&, float, std::unordered_map<signed char, signed char, std::hash<signed char>, std::equal_to<signed char>, std::allocator<std::pair<signed char const, signed char> > > const&)+0x24f(0x7fa7be29f71f) in libcaffe2__C_impl_cuda.so [9] torch::distributed::autograd::sendMessageWithAutograd(torch::distributed::rpc::RpcAgent&, torch::distributed::rpc::WorkerInfo const&, torch::distributed::rpc::Message&&, bool, float, bool)+0x393(0x7fa7b602b203) in libcaffe2_libtorch.so [10] torch::distributed::rpc::pyRpcPythonUdf(torch::distributed::rpc::WorkerInfo const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&, std::vector<at::Tensor, std::allocator<at::Tensor> >&, float, bool)+0x201(0x7fa7bd844971) in libcaffe2__C_impl_cuda.so ``` ghstack-source-id: 125630551 Test Plan: waitforbuildbot Reviewed By: SciPioneer Differential Revision: D27419714 fbshipit-source-id: 8aca9a14ef688004053d8798124d9c3a3fbe3489	2021-04-03 13:59:56 -07:00
Howard Huang	5610e8271b	Fix skip_if_not_multigpu decorator (#54916 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54916 Fixes https://github.com/pytorch/pytorch/issues/54887 `skip_if_not_multigpu` was skipping all the tests that use it. Test Plan: Imported from OSS Reviewed By: mrshenli Differential Revision: D27412193 Pulled By: H-Huang fbshipit-source-id: 28d6697bd8cc6b6784cdb038ccb3ff138d0610eb	2021-04-01 18:01:33 -07:00

1 2

89 Commits