Commit Graph

71 Commits

Author SHA1 Message Date
Wanchao Liang
6feba4bc7e Implement scatter primitive for ProcessGroupNCCL (#70029)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70029

This PR implements NCCL scatter and add scatter to ProcessGroupNCCL.

NCCL doesn’t directly provide primitives for scatter, so we need to be implemented on top of NCCL’s send/recv API.

1. In ProcessGroupNCCL.cpp, the inputTensors are first flattened, then outputTensors and inputFlattened are passed by the collective class to scatter() function in nccl.cpp.
2. In nccl.cpp, scatter is implemented using ncclSend/ncclRecv: the root rank uses a for loop to send(distribute) the inputTensors to each rank, then all the ranks receive the inputTensor from the root rank.
ghstack-source-id: 147754837

Test Plan:
test_scatter_ops
test_scatter_stress
test_scatter_checks

Reviewed By: pritamdamania87

Differential Revision: D33154823

fbshipit-source-id: 4513e7eaf7d47a60eb67da99dc6c2e9a2882f3fd
(cherry picked from commit 93201f9d4a)
2022-01-27 19:37:55 +00:00
Wanchao Liang
9b53d3194c Implement gather primitive for ProcessGroupNCCL (#66745)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66745

This PR implement NCCL gather and add gather to ProcessGroupNCCL using nccl send/recv api.

NCCL doesn’t directly provide primitives for gather, so we need to be implemented on top of NCCL’s send/recv API.
1. In ProcessGroupNCCL.cpp, the outputTensors are first flattened, then inputTensors and outputFlattened are passed by the collective class to gather() function in nccl.cpp.
1. In nccl.cpp, gather is implemented using ncclSend/ncclRecv: all the ranks send inputTensor to the root rank, and the root rank uses a for loop to receive these inputTensors.
ghstack-source-id: 147754838

Test Plan:
test_gather_ops
test_gather_checks
test_gather_stress

Reviewed By: pritamdamania87

Differential Revision: D29616361

fbshipit-source-id: b500d9b8e67113194c5cc6575fb0e5d806dc7782
(cherry picked from commit d560ee732e)
2022-01-27 19:37:55 +00:00
Rohan Varma
2bed616e0f [Dist tests] Make event_listener work for all dist tests (#70628)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70628

event_listener thread is used to log process tracebacks when a timed
out process sends it a request to get its traceback. Although, this thread is
created in `_run` function which is overridden by some classes such as
`TestDistBackend` so those tests did not have this feature. Move the
event_listener setup logic to `run_test` which is called by all distributed
test classes, which enables it for all distributed tests. Also modify logger
setup to ensure that logging.info calls are printed in the subprocess.
ghstack-source-id: 146714642

Test Plan: CI

Reviewed By: jaceyca, fduwjj

Differential Revision: D33410613

fbshipit-source-id: aa616d69d251bc9d04e45781c501d2244f011843
2022-01-09 14:54:09 -08:00
Bryan Reese
51b6981c36 [PyTorch Tests] Split out skip logic, make changes for plugins (#67256)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67256

To change what tests can be run in various cases, the check logic should be moved to functions and variables that can be changed.

One challenge here is that decorators don't have dynamic functionality. If something is read in when imported and then changed afterwards, it will not actually change. This means we need to separate out the variables that need to be changed for our use case.

Those are put into common_distributed.py and can be changed before importing the distributed_test.py code.

The use case is to add new backends to the tests and split it into tests that can be ran on demand as a separate instance. To do so, you would change DistTestSkipCases after importing it into a launcher or a setup script and then load distributed_test.

Test Plan: Check the signals

Reviewed By: mrshenli

Differential Revision: D31906947

fbshipit-source-id: 45e3258c55f4dc34e12a468bed65280f4c25748f
2021-12-08 12:23:15 -08:00
Rohan Varma
d44d59aa70 [BE] Enable C++ stacktraces for MultiProcessTestCase (#69175)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69175

Shows C++ stacktraces for python distributed tests that inherit from
MultiProcessTestCase. Closes https://github.com/pytorch/pytorch/issues/69168
ghstack-source-id: 145085858

Test Plan: CI

Reviewed By: H-Huang

Differential Revision: D32736872

fbshipit-source-id: 743e870eefa7a9e77c5791d0936e2ebd5c9b1016
2021-12-08 11:57:51 -08:00
Rohan Varma
cb14a258a2 [c10d] Fix object-based collectives for debug mode (#68223)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68223

DETAIL debug mode didn't work with object-based collectives for NCCL backend, because we'd only check if backend is NCCL and then move tensors to CUDA.

Instead, check if it is a wrapped PG, and then check the pg that is wrapped to see if its nccl.
ghstack-source-id: 143242023

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D32366840

fbshipit-source-id: be0a2af6849f8f24446593f4a4fbea4a67586ee5
2021-11-13 04:18:31 -08:00
Wanchao Liang
cf3a5160f8 [BE] move init_multigpu_helper to common_distributed (#67050)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67050

This PR moves init_multi_gpu_helper to common_distributed so that it could be shared by different distributed tests.
ghstack-source-id: 141370119

Test Plan: wait for ci.

Reviewed By: mrshenli

Differential Revision: D31842644

fbshipit-source-id: c7bad25d6cef9bdce7ad1fb6c60c1cad4b765702
2021-10-22 17:16:11 -07:00
Rohan Varma
1e47181c47 [DDP Logging] Add iteration in error reporting (#65772)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65772

Looking at some workloads and it would be useful to have this info.
ghstack-source-id: 140555200

Test Plan: CI

Reviewed By: zhaojuanmao, wayi1

Differential Revision: D31224417

fbshipit-source-id: 14eeb053aced87c7ca43b6879f81f54bd0a42b76
2021-10-14 22:29:36 -07:00
Pritam Damania
c245632e2e Use higher timeout for TSAN tests. (#65391)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65391

TSAN tests are much slower than the usual dev/opt mode, about 5-10x
slower.

As a result, for TSAN build mode we use a much higher timeout for distributed
tests.
ghstack-source-id: 138584613

Test Plan: waitforbuildbot

Reviewed By: cbalioglu

Differential Revision: D31076575

fbshipit-source-id: 44a485f07101deac536470ceeff2a52cac4f9e0b
2021-09-21 12:08:27 -07:00
Pritam Damania
d6133b2fe6 Remove _fork_processes from common_distributed.py (#63711)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63711

This removes `_fork_process` from common_distributed.py and fixes all
other callpoints to use `spawn_process` instead.
ghstack-source-id: 136395719

Test Plan: waitforbuildbot

Reviewed By: xush6528

Differential Revision: D30463834

fbshipit-source-id: 0c09e8a996d0e5b912c8cdd45488a39951bac4db
2021-08-22 18:57:12 -07:00
Pritam Damania
2d671ca41b [8/N] Remove c10d/ddp fork tests. (#63454)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63454

Continuation of https://github.com/pytorch/pytorch/pull/63443, this
PR removes all fork tests from torch.distributed.
ghstack-source-id: 136285511

Test Plan: waitforbuildbot

Reviewed By: SciPioneer

Differential Revision: D30387872

fbshipit-source-id: f6d6313db126ae7b95b86f78a1e0726887c5c513
2021-08-20 12:23:18 -07:00
Pritam Damania
f8a84a80cd [5/N] Run opt-asan with detect_leaks=0 (#63361)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63361

Python multiprocessing doesn't support LSAN and causes false positives
instead. As a result, disabling LSAN for these tests so that we can still run
with opt-asan
ghstack-source-id: 135962489

Test Plan: waitforbuildbot

Reviewed By: rohan-varma

Differential Revision: D30352269

fbshipit-source-id: f6ab5abce7bdef00cd5e1f5977424d2b151174af
2021-08-18 01:59:56 -07:00
Pritam Damania
f7611b31aa [4/N] Enable opt-asan for distributed unit tests. (#62051)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62051

The goal here is to enable opt-asan for "spawn" based unit tests since
this works for "spawn" unlike "dev-asan". As a result, we can run ASAN for
"spawn" unit tests as well.

This means we can completely remove fork unit tests from the code base since
the only purpose for these tests was to run ASAN.
ghstack-source-id: 135523770

Test Plan: waitforbuildbot

Reviewed By: SciPioneer

Differential Revision: D29854514

fbshipit-source-id: 02a5bfcfae2afc21badecff77082c7a6ad83636b
2021-08-10 22:38:31 -07:00
Yi Wang
72295da6c3 Reformat (#62456)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62456

as title
ghstack-source-id: 134771417

Test Plan: N/A

Reviewed By: rohan-varma

Differential Revision: D30006493

fbshipit-source-id: 1d1dc9cfff69a9b4fa31470177c1f4fa206a94ef
2021-07-30 20:50:19 -07:00
Pritam Damania
2006dc6316 [3/N] Remove unittest.skip from torch/testing/_internal distributed files. (#61991)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61991

Continuation of https://github.com/pytorch/pytorch/pull/61887 and
removing unittest.skip as much as possible.
ghstack-source-id: 134759368

Test Plan: waitforbuildbot

Reviewed By: SciPioneer

Differential Revision: D29831860

fbshipit-source-id: fe57a7d56d4423924a2dec10bb670137ace0c9a4
2021-07-30 16:40:43 -07:00
Pritam Damania
82d81455ae [2/N] Remove unittest.skip across all of torch.distributed. (#61887)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61887

1) Introduced a `sandcastle_skip_if` decorator that ensures these
tests just get passed on sandcastle.
2) Fixed all test files under `test/distributed` to not use `unittest.skip`

Overall goal is to avoid using skips since sandcastle tags these tests as
continuously skipping.
ghstack-source-id: 134382237

Test Plan: waitforbuildbot

Reviewed By: SciPioneer

Differential Revision: D29784152

fbshipit-source-id: 17b4df6c5a55ff1d1e8e1de128fa679c3dfbcb7d
2021-07-27 10:53:23 -07:00
Pritam Damania
a8f6b5a80a [1/N] Avoid skipping tests in sandcastle. (#61876)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61876

In the sandcastle environment, avoid skipping tests and instead just
"pass" these tests to avoid a large number of tasks being created which are not
actionable.
ghstack-source-id: 133846232

Test Plan: Test with `SANDCASTLE=1 TW_JOB_USER=sandcastle`

Reviewed By: rohan-varma

Differential Revision: D29779699

fbshipit-source-id: add71008830dfa6f456ce2365a2d70436b7b7a31
2021-07-21 14:31:17 -07:00
Luca Wehrstedt
14f63763c1 Avoid using mp.Manager to report #GPUs needed in dist tests (#61409)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61409

We used a multiprocessing.Manager in order to share TEST_SKIPS between the parent and the child processes. TEST_SKIPS is a global variable that defines a unique error code for each "error type", so that the parent can figure out the reason a child exited. While originally this mapping was immutable, at some point we allowed children to modify the parent's value of that mapping so they could update the message for the `multi-gpu` error to make it reflect how many GPUs were really needed. This occurred in D23285790 (2a4d312027). Since then this Manager proved to be quite problematic, especially around thread safety, races, TSAN, ... (see D22753459 (f0c46878c6), D23641618 (567c51cce9), D28490129, D28794321 (0128eb9a85) and D29585862). This seems like an awful lot of trouble for such a small functionality. Here I propose we drop Manager and instead get the same result by using separate error codes for each number of GPUs. It should be much simpler and thus more robust.
ghstack-source-id: 133236447

Test Plan: CI

Reviewed By: pritamdamania87

Differential Revision: D29612614

fbshipit-source-id: 8ad0fedcb7796e5832a0eb196f8fdc147e02b3df
2021-07-09 01:29:35 -07:00
Luca Wehrstedt
0128eb9a85 Fix TSAN issue in distributed tests (#59238)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59238

Creating a `mutliprocessing.Manager()` launches a new process using the `fork` method (because it's the default one), and then in that subprocess it launches a new thread. TSAN really doesn't like this (and rightly so!) because we already had threads in the superprocess, and intermixing threads and forks is dangerous. The proper way to deal with this is to `exec` inside the child process or, in other words, use the `spawn` method.

Note that the method used to launch the Manager is entirely unrelated from the method used to launch our "own" subprocesses, hence we were using `fork` for the Manager even though we were using `spawn` for our own subprocesses.
ghstack-source-id: 130240724

Test Plan: Reverted the silencing introduced in D28490129, ran the `test_init_rpc_then_pg` test from the TensorPipe suite and saw the original TSAN failure. Then applied my fix, re-ran the test, and the failure was gone.

Reviewed By: zhaojuanmao

Differential Revision: D28794321

fbshipit-source-id: 12242e69be399a7f02a40a0ebb3d92f92e00ce73
2021-07-01 11:53:01 -07:00
Pritam Damania
fbd4cb1cd7 Fix error logging in common_distributed. (#60917)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60917

The second line of error log didn't handle f-string properly.

Before fix:
```
exiting process with exit code: {MultiProcessTestCase.TEST_ERROR_EXIT_CODE}
```

After fix:
```
exiting process 3 with exit code: 10
```
ghstack-source-id: 132618199

Test Plan: waitforbuildbot

Reviewed By: rohan-varma

Differential Revision: D29446574

fbshipit-source-id: f806ef0470cb6aa86fe3c404e1c895514abb6488
2021-06-28 19:32:17 -07:00
Luca Wehrstedt
4aff267072 Fix Windows error in distributed (#60167)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60167

We were getting errors such as this on Windows in our c10d ProcessGroup test suite:
```
  test_send_recv_all_to_all (__main__.ProcessGroupGlooTest) ... Exception in thread Thread-1:
Traceback (most recent call last):
  File "C:\Jenkins\Miniconda3\lib\threading.py", line 932, in _bootstrap_inner
    self.run()
  File "C:\Jenkins\Miniconda3\lib\threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "C:\Users\circleci\project\build\win_tmp\build\torch\testing\_internal\common_distributed.py", line 471, in _event_listener
    if pipe.poll(None):
  File "C:\Jenkins\Miniconda3\lib\multiprocessing\connection.py", line 257, in poll
    return self._poll(timeout)
  File "C:\Jenkins\Miniconda3\lib\multiprocessing\connection.py", line 330, in _poll
    return bool(wait([self], timeout))
  File "C:\Jenkins\Miniconda3\lib\multiprocessing\connection.py", line 883, in wait
    ov.cancel()
OSError: [WinError 6] The handle is invalid
Fatal Python error: could not acquire lock for <_io.BufferedWriter name='<stderr>'> at interpreter shutdown, possibly due to daemon threads
Python runtime state: finalizing (tstate=000001EFDF228CE0)

Thread 0x00001f68 (most recent call first):
  File "C:\Jenkins\Miniconda3\lib\threading.py", line 1202 in invoke_excepthook
  File "C:\Jenkins\Miniconda3\lib\threading.py", line 934 in _bootstrap_inner
  File "C:\Jenkins\Miniconda3\lib\threading.py", line 890 in _bootstrap

Current thread 0x00000f94 (most recent call first):
<no Python frame>
FAIL (5.009s)
```
And the process would then exit with error code 3221226505.
See: https://app.circleci.com/pipelines/github/pytorch/pytorch/337351/workflows/ad919a3e-fe9a-4566-8ad6-8b0a252f730c/jobs/14170191/steps

By looking at [the code of `_event_listener` in `common_distributed.py`](eb36f67dcc/torch/testing/_internal/common_distributed.py (L467-L489)) I think that the first exception (the one about the handle being invalid) is "expected" as it results from another thread purposely closing the pipe while that thread is polling it.

The relevant part of the problem seems to be the "could not acquire lock" one. I think this stems from the event listener thread being launched as a daemon thread, which means the interpreter will not wait for that thread to complete before shutting down. When the interpreter shuts down it instantly aborts all other threads. If the event listener thread was aborter _while_ it was logging to stdout then that thread was holding the lock but never got to release it. This is probably what the error is complaining about. This seems to be intended/expected behavior for CPython: https://bugs.python.org/issue42717.

The solution thus is simple: don't make that thread a daemon thread and explicitly wait for it to terminate before shutting down.
ghstack-source-id: 132293710

Test Plan: Will see...

Reviewed By: pritamdamania87

Differential Revision: D29193014

fbshipit-source-id: 4aabe1fc74bf9c54ca605e7a702ac99655489780
2021-06-24 10:35:38 -07:00
Philip Meier
d5988c5eca remove unused type: ignore directives (#60006)
Summary:
During development it is common practice to put `type: ignore` comments on lines that are correct, but `mypy` doesn't recognize this. This often stems from the fact, that the used `mypy` version wasn't able to handle the used pattern.

With every new release `mypy` gets better at handling complex code. In addition to fix all the previously accepted but now failing patterns, we should also revisit all `type: ignore` comments to see if they are still needed or not. Fortunately, we don't need to do it manually: by adding `warn_unused_ignores = True` to the configuration, `mypy` will error out in case it encounters an `type: ignore` that is no longer needed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60006

Reviewed By: jbschlosser, malfet

Differential Revision: D29133237

Pulled By: albanD

fbshipit-source-id: 41e82edc5cd5affa7ccedad044b59b94dad4425a
2021-06-18 07:23:31 -07:00
Rohan Varma
eb55b086b7 [DDP] Log some python-side errors (#59284)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59284

Logs a few python-side errors to DDP logging.

TODO: Most python errors actually have to do with user input correctness, so they throw before reducer is constructed and thus there is no logger. For this case, should we allow `logger` to be created optionally without a reducer, just for the purpose of logging errors, so that we can gain insight into these errors in scuba?
ghstack-source-id: 130412973

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D28820290

fbshipit-source-id: 610e5dba885b173c52351f7ab25c923edce639e0
2021-06-02 19:49:26 -07:00
Alexander Golynski
2b6c09c11e Add futures to ProcessGroupMPI work (but not including Send/Recv) and python DDP comm hook testing (#57214)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57214

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D28200791

Pulled By: agolynski

fbshipit-source-id: 83f814abd4f2eea70e383ed373b04aae8291be55
2021-05-04 16:04:45 -07:00
Pritam Damania
dc8a8cea79 Move caffe2 signal_handler to c10. (#56717)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56717

The signal_handler was under the caffe2 namespacee but was being used
by PyTorch as well.

I've fixed this my moving it to the c10 namespace where now both C2 and PyTorch
can use it.

The signal_handler interface in caffe2/utils/signal_handler.h is kept the same
for backward compatiblity for C2, but most of the commmon code is moved to c10.
ghstack-source-id: 127446929

Test Plan: waitforbuildbot

Reviewed By: ezyang

Differential Revision: D27946738

fbshipit-source-id: d6228d1a0108f4c807d405e7a0bb799c5375388f
2021-04-26 23:08:12 -07:00
Sam Estep
75024e228c Add lint for unqualified type: ignore (#56290)
Summary:
The other half of https://github.com/pytorch/pytorch/issues/56272.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56290

Test Plan:
CI should pass on the tip of this PR, and we know that the lint works because the following CI runs (before this PR was finished) failed:

- https://github.com/pytorch/pytorch/runs/2384511062
- https://github.com/pytorch/pytorch/actions/runs/765036024

Reviewed By: seemethere

Differential Revision: D27867219

Pulled By: samestep

fbshipit-source-id: e648f07b6822867e70833e23ddafe7fb7eaca235
2021-04-21 08:07:23 -07:00
Richard Barnes
af7775ba26 Types for caffe2/torch/testing/_internal/common_distributed.py (#55338)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55338

Test Plan: Sandcastle

Reviewed By: pritamdamania87, ngimel

Differential Revision: D27575367

fbshipit-source-id: ca8eb77967af71ce2734408b8e2e15bf64a5ab4a
2021-04-20 16:26:53 -07:00
Howard Huang
b2dae294b6 Fix distributed.test_jit_c10d flaky tests (#56410)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56410

Changes:
- Move create_tcp_store() helper function to common file
- Update test_jit_c10d to retry TCP Store creation in case allocated port becomes used

fixes https://github.com/pytorch/pytorch/issues/55053

Test Plan: Imported from OSS

Reviewed By: heitorschueroff

Differential Revision: D27869560

Pulled By: H-Huang

fbshipit-source-id: f4a6613049bb25e6f6f194214379a380968bb19c
2021-04-20 09:28:27 -07:00
Rohan Varma
51e7a371f5 [DDP] Param to name mapping in Reducer (#55075)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55075

Constructs and passes in a mapping with parameter names to Reducer to log information about unused parameters in error messages about unused parameters/not all parameters getting gradient.

Use case:
1) User runs DDP forward + bwd, and it has some unused parameters that will result in ddp error in next iteration
2) Next forward pass calls `Reducer::ensure_prior_reduction_finished()` where we check all params got gradient from the previous bwd pass. DDP would throw here in this case.
3) Reducer maintains mapping and tracks used parameters, and computes which parameters did not get gradient and logs this as part of the error.

Implementation details:
0) The following is only enabled for debug modes of INFO or DETAIL.
1) To save memory, we don't map param -> param name so that we don't have to copy the entire tensor, instead we map param_index -> param_name and use the existing concept of variable_index in Reducer to look up parameter names.
2) DDP constructs param index -> param name mapping. The name is the fully qualified name: f"{module_name}:{param_name}" and passes it into Reducer
3) Reducer maintains per-iteration std::set<int> of variable indices that have had `mark_variable_ready` called.
4) When some params go unused, we take a set difference to detect the unused params.
5) Unittests to test the logged unused params, as well as for nested modules, are added
ghstack-source-id: 126581051

Test Plan: CI, UT

Reviewed By: zhaojuanmao

Differential Revision: D27356394

fbshipit-source-id: 89f436af4e74145b0a8eda92b3c4e2af8e747332
2021-04-15 09:19:50 -07:00
Philip Meier
f4967d68f5 make torch.testing asserts importable (#54769)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54769

Follow-up to #53820. This

- makes the `asserts.py` module private as per suggestion from rgommers in https://github.com/pytorch/pytorch/pull/53820#issuecomment-802661387. With this the functions should only be accessible through `torch.testing`, giving us the option the change the underlying structure later.
- moves the code from `torch/testing/__init__.py` to `torch/testing/_core.py` (happy to accept other name suggestions). Otherwise we can't import the new `_asserts.py` in `torch/testing/__init__.py` due to circular imports.

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D27438451

Pulled By: mruberry

fbshipit-source-id: c7292b4d5709185b42b4aac8016648562688040e
2021-04-07 23:53:02 -07:00
Pritam Damania
e3691be2d9 Dump C++ stack traces of all threads for distributed tests. (#55003)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55003

Using the `caffe2::setPrintStackTracesOnFatalSignal` utility in
distributed tests to set a signal handler that dumps the state of all threads
for all processes when it receives a FATAL signal. This would help in debugging
tests further.

I had to revert all the python faulthandler code since only one signal handler
function is supported, so running python faulthandler with
`setPrintStackTracesOnFatalSignal` doesn't work.

Sample output:
```
SIGSEGV(11), PID: 3492872, Thread 3492872:
[0] ???(0x7fa7b2d1d61b) in libcaffe2_caffe2_caffe2_cpu.so
[1] ???(0x7fa7b2d1d3fb) in libcaffe2_caffe2_caffe2_cpu.so
[2] ???(0x7fa7b2d1d33d) in libcaffe2_caffe2_caffe2_cpu.so
[3] ???(0x7fa7b2d1d167) in libcaffe2_caffe2_caffe2_cpu.so
[4] ???(0x7fa7ce683150) in libpthread.so.0
[5] ???(0x7fa7be2b233c) in libcaffe2__C_impl_cuda.so
[6] ???(0x7fa7be2ce80c) in libcaffe2__C_impl_cuda.so
[7] ???(0x7fa7be2a0512) in libcaffe2__C_impl_cuda.so
[8] torch::distributed::rpc::TensorPipeAgent::send(torch::distributed::rpc::WorkerInfo const&, torch::distributed::rpc::Message&&, float, std::unordered_map<signed char, signed char, std::hash<signed char>, std::equal_to<signed char>, std::allocator<std::pair<signed char const, signed char> > > const&)+0x24f(0x7fa7be29f71f) in libcaffe2__C_impl_cuda.so
[9] torch::distributed::autograd::sendMessageWithAutograd(torch::distributed::rpc::RpcAgent&, torch::distributed::rpc::WorkerInfo const&, torch::distributed::rpc::Message&&, bool, float, bool)+0x393(0x7fa7b602b203) in libcaffe2_libtorch.so
[10] torch::distributed::rpc::pyRpcPythonUdf(torch::distributed::rpc::WorkerInfo const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&, std::vector<at::Tensor, std::allocator<at::Tensor> >&, float, bool)+0x201(0x7fa7bd844971) in libcaffe2__C_impl_cuda.so
```
ghstack-source-id: 125630551

Test Plan: waitforbuildbot

Reviewed By: SciPioneer

Differential Revision: D27419714

fbshipit-source-id: 8aca9a14ef688004053d8798124d9c3a3fbe3489
2021-04-03 13:59:56 -07:00
Howard Huang
5610e8271b Fix skip_if_not_multigpu decorator (#54916)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54916

Fixes https://github.com/pytorch/pytorch/issues/54887

`skip_if_not_multigpu` was skipping all the tests that use it.

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D27412193

Pulled By: H-Huang

fbshipit-source-id: 28d6697bd8cc6b6784cdb038ccb3ff138d0610eb
2021-04-01 18:01:33 -07:00
Pritam Damania
f71a0daeb7 Use faulthandler to dump traceback of timed out processes in unit tests. (#54818)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54818

Several flaky tests fail due to some sort of timeout and it isn't
clear from the error message in CI where exactly each process is stuck. In this
PR, I've added mechanism to dump the entire python traceback of all python
threads when we encounter a timeout.

Example traceback:

```
Process 3 timed out with traceback:
Current thread 0x00007ff3363ff700 (most recent call first):
  File "torch/testing/_internal/common_distributed.py", line 373 in _event_listener
  File "threading.py", line 870 in run
  File "threading.py", line 932 in _bootstrap_inner
  File "threading.py", line 890 in _bootstrap

Thread 0x00007ff406132180 (most recent call first):
  File "torch/distributed/distributed_c10d.py", line 2477 in barrier
  File "torch/testing/_internal/distributed/rpc/rpc_test.py", line 838 in test_reinit
  File "torch/testing/_internal/dist_utils.py", line 90 in new_test_method
  File "torch/testing/_internal/common_distributed.py", line 292 in wrapper
  File "torch/testing/_internal/common_distributed.py", line 409 in run_test
  File "torch/testing/_internal/common_distributed.py", line 393 in _run
  File "multiprocessing/process.py", line 108 in run
  File "multiprocessing/process.py", line 315 in _bootstrap
  File "multiprocessing/popen_fork.py", line 75 in _launch
  File "multiprocessing/popen_fork.py", line 19 in __init__
  File "multiprocessing/context.py", line 277 in _Popen
  File "multiprocessing/process.py", line 121 in start
```
ghstack-source-id: 125323810

Test Plan: waitforbuildbot

Reviewed By: rohan-varma

Differential Revision: D27378764

fbshipit-source-id: 661c009a5458c724f004aa83de9347a4bc03b63e
2021-03-31 11:38:30 -07:00
Rohan Varma
0e543b2b00 Provide a decorator to set/unset nccl blocking wait for tests (#54740)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54740

Adds a simple helper decorator to set/unset nccl blocking wait for
tests. This will make it easier than having to manually set/unset the
os.environ vars every time.
ghstack-source-id: 125233693

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D27277222

fbshipit-source-id: c289b9d05e2f6328d672810b07501979b6e177c6
2021-03-30 15:31:30 -07:00
Pritam Damania
65781f94ad Enable faulthandler for distributed tests. (#54531)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54531

Enabling faulthandler will intercept signals like SIGSEGV, SIGFPE,
SIGABRT, SIGBUS and SIGKILL and dump the entire python traceback before the
process goes down.

This can help us in debugging flaky tests where a process crashes and we need
to debug what happened.
ghstack-source-id: 125045894

Test Plan:
1) Tested locally to see traceback is produced.
2) waitforbuildbot

Reviewed By: rohan-varma

Differential Revision: D27271048

fbshipit-source-id: ca12125a9da6cdfc7bac5619ad1c7e116666014b
2021-03-27 00:43:58 -07:00
Pritam Damania
1c63cb2c0f Pass child error to parent in distributed tests. (#52632)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52632

Distributed tests run in a multiprocessing environment, where a parent
process drives the tests through several child processes. As a result, when a
child process fails the parent only prints the following:

```
Process 0 exited with error code 10
```

The child process also logs its own exception, but it is cumberson to go
through the logs and track this down.

To alleviate this, I've added a bunch of pipes for each child process so that
the child process writes the error to the pipe before exiting and the parent
process can read the appropriate error from the pipe and display it.

The new output printed by the parent is as follows:

```
> RuntimeError: Process 0 exited with error code 10 and exception:
Traceback (most recent call last):
  File "torch/testing/_internal/common_distributed.py", line 361, in _run
    getattr(self, test_name)()
  File "torch/testing/_internal/common_distributed.py", line 288, in wrapper
    fn()
  File "test_c10d.py", line 789, in test_broadcast_checks
    pg.broadcast([t1], opts)
ValueError: ProcessGroupGloo::broadcast: invalid root rank: -1

Process 1 exited with error code 10 and exception:
Traceback (most recent call last):
  File "torch/testing/_internal/common_distributed.py", line 361, in _run
    getattr(self, test_name)()
  File "torch/testing/_internal/common_distributed.py", line 288, in wrapper
    fn()
  File "test_c10d.py", line 789, in test_broadcast_checks
    pg.broadcast([t1], opts)
ValueError: ProcessGroupGloo::broadcast: invalid root rank: -1

Process 2 exited with error code 10 and exception:
Traceback (most recent call last):
  File "torch/testing/_internal/common_distributed.py", line 361, in _run
    getattr(self, test_name)()
  File "torch/testing/_internal/common_distributed.py", line 288, in wrapper
    fn()
  File "test_c10d.py", line 789, in test_broadcast_checks
    pg.broadcast([t1], opts)
ValueError: ProcessGroupGloo::broadcast: invalid root rank: -1

Process 3 exited with error code 10 and exception:
Traceback (most recent call last):
  File "torch/testing/_internal/common_distributed.py", line 361, in _run
    getattr(self, test_name)()
  File "torch/testing/_internal/common_distributed.py", line 288, in wrapper
    fn()
  File "test_c10d.py", line 789, in test_broadcast_checks
    pg.broadcast([t1], opts)
ValueError: ProcessGroupGloo::broadcast: invalid root rank: -1
```
ghstack-source-id: 122273793

Test Plan: waitforbuildbot

Reviewed By: rohan-varma

Differential Revision: D26589274

fbshipit-source-id: 7b7a71ec790b216a89db7c157377f426531349a5
2021-02-23 11:50:25 -08:00
Rong Rong (AI Infra)
e8ab58bfc7 [reland] Early terminate CUDA on common_utils TestCases (#52126)
Summary:
Take 2 of https://github.com/pytorch/pytorch/issues/50914
This change moves the early termination logic into common_utils.TestCase class.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52126

Test Plan: CI with ci-all tag

Reviewed By: malfet

Differential Revision: D26391762

Pulled By: walterddr

fbshipit-source-id: a149ecc47ccda7f2795e107fb95915506ae060b4
2021-02-12 07:32:42 -08:00
Nikita Shulga
9f1f5636d7 Revert D26019289: [pytorch][PR] Early terminate CUDA on common_utils TestCases
Test Plan: revert-hammer

Differential Revision:
D26019289 (c1b7ca8062)

Original commit changeset: ddc7c1c0d00d

fbshipit-source-id: 6902d03fa06cda5d03191846bc4dd98af501b594
2021-02-10 17:29:10 -08:00
Rong Rong (AI Infra)
c1b7ca8062 Early terminate CUDA on common_utils TestCases (#50914)
Summary:
This is a follow up on https://github.com/pytorch/pytorch/issues/49869.

Previously CUDA early termination only happens for generic test classes that extends from `DeviceTypeTestBase`. However, JIT test cases which extends from common_utils.TestCase cannot benefit from the early termination.

This change moves the early termination logic into common_utils.TestCase class.
- all tests extended from common_utils.TestCase now should early terminate if CUDA assert occurs.
- For TestCases that extends from common_device_type.DeviceTypeTestBase, still only do torch.cuda.synchronize() when RTE is thrown.
- For TestCases extends common_utils.TestCase, regardless of whether a test case uses GPU or not, it will always synchronize CUDA as long as `torch.cuda.is_initialize()` returns true.
- Disabling this on common_distributed.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50914

Reviewed By: malfet

Differential Revision: D26019289

Pulled By: walterddr

fbshipit-source-id: ddc7c1c0d00db4d073a6c8bc5b7733637a7e77d1
2021-02-10 07:15:40 -08:00
Shen Li
a3b8cbcdfc Let TensorPipe detect peer access (#50676)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50676

Test Plan: Imported from OSS

Reviewed By: beauby

Differential Revision: D25941962

Pulled By: mrshenli

fbshipit-source-id: 7d4fd3b4fbd5ae5a0c50ad65605ced9db10ede4a
2021-01-20 08:04:51 -08:00
Shen Li
30e45bb133 Enable GPU-to-GPU comm in TensorPipeAgent (#44418)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44418

This commit uses TensorPipe's cuda_ipc channel to conduct
cross-process same-machine GPU-to-GPU communication. On the sender
side, `TensorPipeAgent` grabs a stream to each device used by the
message, let these streams wait for current streams, and passes
the streams to TensorPipe `CudaBuffer`. On the receiver side, it
also grabs a stream for each device used in the message, and uses
these streams to receive tensors and run user functions. After that,
these streams are then used for sending the response back to the
sender. When receiving the response, the sender will grab a new set
of streams and use them for TensorPipe's `CudaBuffer`.

If device maps are provided, `TensorPipeAgent::send` will return a
derived class of `CUDAFuture`, which is specifically tailored for
RPC Messages.

TODOs:
1. Enable sending CUDA RPC to the same process.
2. Add a custom CUDA stream pool.
3. When TensorPipe addressed the error for `cudaPointerGetAttributes()`,
remove `cuda:0` context initialization code in `backend_registry.py`.
4. When TensorPipe can detect availability of peer access, enable all
tests on platforms without peer access.

Differential Revision: D23626207

Test Plan: Imported from OSS

Reviewed By: lw

Pulled By: mrshenli

fbshipit-source-id: d30e89e8a98bc44b8d237807b84e78475c2763f0
2021-01-14 13:55:41 -08:00
Jerry Zhang
fadec77c30 [quant][fx][graphmode] Renable torchvision test (#48602)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48602

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D25224917

fbshipit-source-id: efc73f425253c4eb7ae51064b6760416097f0437
2020-12-04 10:13:38 -08:00
Rohan Varma
25dc0056f2 [RPC] print exception message on workers that run python functions (#46372)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46372

Currently, in `_run_function`, we catch an exception from the python
function which is run, and report it back to the master. However in some large
scale training jobs, it would be valuable to also log the error on the trainer
itself for faster debugging.

Test Plan: Added unittest.

Reviewed By: pritamdamania87

Differential Revision: D24324578

fbshipit-source-id: 88460d7599ea69d2c38fd9c10eb6471f7edd4100
2020-10-22 09:44:15 -07:00
Pritam Damania
b5a2f04089 Disallow creation of ProcessGroupNCCL without GPUs. (#45642)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45642

Prior to https://github.com/pytorch/pytorch/pull/45181, initializing a
NCCL process group would work even if no GPUs were present. Although, now since
init_process_group calls `barrier()` this would fail.

In general the problem was that we could initialize ProcessGroupNCCL without
GPUs and then if we called a method like `barrier()` the process would crash
since we do % numGPUs resulting in division by zero.
ghstack-source-id: 113490343

Test Plan: waitforbuildbot

Reviewed By: osalpekar

Differential Revision: D24038839

fbshipit-source-id: a1f1db52cabcfb83e06c1a11ae9744afbf03f8dc
2020-10-05 12:05:48 -07:00
gunandrose4u
f07ac6a004 Fix Windows build failure after DDP PR merged (#45335)
Summary:
Fixes #{issue number}
This is resubmit for PR https://github.com/pytorch/pytorch/issues/42897 . Together with fix for Windows build issue introduced by PR https://github.com/pytorch/pytorch/issues/44344 .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45335

Reviewed By: zou3519

Differential Revision: D23931471

Pulled By: mrshenli

fbshipit-source-id: f49b5a114944c1450b32934b3292170be064f494
2020-09-25 12:37:50 -07:00
Mike Ruberry
103fa3894a Revert D23841786: [pytorch][PR] Enable distributed package on windows, Gloo backend supported only
Test Plan: revert-hammer

Differential Revision:
D23841786 (0122299f9b)

Original commit changeset: 334ba1ed73ef

fbshipit-source-id: ec95432f9957df56a5a04e52661f5db920b7f57f
2020-09-24 22:44:33 -07:00
gunandrose4u
0122299f9b Enable distributed package on windows, Gloo backend supported only (#42897)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/42095

For test case part will be committed to this PR later

mrshenli, please help to review

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42897

Reviewed By: osalpekar

Differential Revision: D23841786

Pulled By: mrshenli

fbshipit-source-id: 334ba1ed73eff2f668857390fc32d1bc7f08e5f3
2020-09-24 21:13:55 -07:00
alanashine
ba6534ae2b enable type check common_distributed (#44821)
Summary:
Enabled type checking in common_distributed by using tensors of ints

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44821

Test Plan: Run python test/test_type_hints.py, errors are no longer ingnored by mypy.ini

Reviewed By: walterddr

Differential Revision: D23747466

Pulled By: alanadakotashine

fbshipit-source-id: 820fd502d7ff715728470fbef0be90ae7f128dd6
2020-09-16 19:19:36 -07:00
Xiang Gao
20ac736200 Remove py2 compatible future imports (#44735)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44735

Reviewed By: mruberry

Differential Revision: D23731306

Pulled By: ezyang

fbshipit-source-id: 0ba009a99e475ddbe22981be8ac636f8a1c8b02f
2020-09-16 12:55:57 -07:00
Rohan Varma
567c51cce9 In common_distributed, fix TEST_SKIPS multiprocessing manager (#44525)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44525

Since `TEST_SKIPS` is a global multiprocessing.manager, this was causing
issues when one test would fail and make the rest of the tests fail during
setup due to networking errors.

See the failed CI job: https://app.circleci.com/pipelines/github/pytorch/pytorch/212491/workflows/0450151d-ca09-4cf6-863d-272de6ed917f/jobs/7389065 for an example, where `test_ddp_backward` failed but then caused the rest of the tests to fail at the line `test_skips.update(TEST_SKIPS)`.

To fix this issue, at the end of every test we revert `TEST_SKIPS` back to a regular dict, and redo the conversion to a `mulitiprocessing.Manager` in the next test, which prevents these errors.
ghstack-source-id: 111844724

Test Plan: CI

Reviewed By: malfet

Differential Revision: D23641618

fbshipit-source-id: 27ce823968ece9804bb4dda898ffac43ef732b89
2020-09-11 09:16:33 -07:00