Commit Graph

50 Commits

Author SHA1 Message Date
Philip Meier
d5988c5eca remove unused type: ignore directives (#60006)
Summary:
During development it is common practice to put `type: ignore` comments on lines that are correct, but `mypy` doesn't recognize this. This often stems from the fact, that the used `mypy` version wasn't able to handle the used pattern.

With every new release `mypy` gets better at handling complex code. In addition to fix all the previously accepted but now failing patterns, we should also revisit all `type: ignore` comments to see if they are still needed or not. Fortunately, we don't need to do it manually: by adding `warn_unused_ignores = True` to the configuration, `mypy` will error out in case it encounters an `type: ignore` that is no longer needed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60006

Reviewed By: jbschlosser, malfet

Differential Revision: D29133237

Pulled By: albanD

fbshipit-source-id: 41e82edc5cd5affa7ccedad044b59b94dad4425a
2021-06-18 07:23:31 -07:00
Rohan Varma
eb55b086b7 [DDP] Log some python-side errors (#59284)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59284

Logs a few python-side errors to DDP logging.

TODO: Most python errors actually have to do with user input correctness, so they throw before reducer is constructed and thus there is no logger. For this case, should we allow `logger` to be created optionally without a reducer, just for the purpose of logging errors, so that we can gain insight into these errors in scuba?
ghstack-source-id: 130412973

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D28820290

fbshipit-source-id: 610e5dba885b173c52351f7ab25c923edce639e0
2021-06-02 19:49:26 -07:00
Alexander Golynski
2b6c09c11e Add futures to ProcessGroupMPI work (but not including Send/Recv) and python DDP comm hook testing (#57214)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57214

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D28200791

Pulled By: agolynski

fbshipit-source-id: 83f814abd4f2eea70e383ed373b04aae8291be55
2021-05-04 16:04:45 -07:00
Pritam Damania
dc8a8cea79 Move caffe2 signal_handler to c10. (#56717)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56717

The signal_handler was under the caffe2 namespacee but was being used
by PyTorch as well.

I've fixed this my moving it to the c10 namespace where now both C2 and PyTorch
can use it.

The signal_handler interface in caffe2/utils/signal_handler.h is kept the same
for backward compatiblity for C2, but most of the commmon code is moved to c10.
ghstack-source-id: 127446929

Test Plan: waitforbuildbot

Reviewed By: ezyang

Differential Revision: D27946738

fbshipit-source-id: d6228d1a0108f4c807d405e7a0bb799c5375388f
2021-04-26 23:08:12 -07:00
Sam Estep
75024e228c Add lint for unqualified type: ignore (#56290)
Summary:
The other half of https://github.com/pytorch/pytorch/issues/56272.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56290

Test Plan:
CI should pass on the tip of this PR, and we know that the lint works because the following CI runs (before this PR was finished) failed:

- https://github.com/pytorch/pytorch/runs/2384511062
- https://github.com/pytorch/pytorch/actions/runs/765036024

Reviewed By: seemethere

Differential Revision: D27867219

Pulled By: samestep

fbshipit-source-id: e648f07b6822867e70833e23ddafe7fb7eaca235
2021-04-21 08:07:23 -07:00
Richard Barnes
af7775ba26 Types for caffe2/torch/testing/_internal/common_distributed.py (#55338)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55338

Test Plan: Sandcastle

Reviewed By: pritamdamania87, ngimel

Differential Revision: D27575367

fbshipit-source-id: ca8eb77967af71ce2734408b8e2e15bf64a5ab4a
2021-04-20 16:26:53 -07:00
Howard Huang
b2dae294b6 Fix distributed.test_jit_c10d flaky tests (#56410)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56410

Changes:
- Move create_tcp_store() helper function to common file
- Update test_jit_c10d to retry TCP Store creation in case allocated port becomes used

fixes https://github.com/pytorch/pytorch/issues/55053

Test Plan: Imported from OSS

Reviewed By: heitorschueroff

Differential Revision: D27869560

Pulled By: H-Huang

fbshipit-source-id: f4a6613049bb25e6f6f194214379a380968bb19c
2021-04-20 09:28:27 -07:00
Rohan Varma
51e7a371f5 [DDP] Param to name mapping in Reducer (#55075)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55075

Constructs and passes in a mapping with parameter names to Reducer to log information about unused parameters in error messages about unused parameters/not all parameters getting gradient.

Use case:
1) User runs DDP forward + bwd, and it has some unused parameters that will result in ddp error in next iteration
2) Next forward pass calls `Reducer::ensure_prior_reduction_finished()` where we check all params got gradient from the previous bwd pass. DDP would throw here in this case.
3) Reducer maintains mapping and tracks used parameters, and computes which parameters did not get gradient and logs this as part of the error.

Implementation details:
0) The following is only enabled for debug modes of INFO or DETAIL.
1) To save memory, we don't map param -> param name so that we don't have to copy the entire tensor, instead we map param_index -> param_name and use the existing concept of variable_index in Reducer to look up parameter names.
2) DDP constructs param index -> param name mapping. The name is the fully qualified name: f"{module_name}:{param_name}" and passes it into Reducer
3) Reducer maintains per-iteration std::set<int> of variable indices that have had `mark_variable_ready` called.
4) When some params go unused, we take a set difference to detect the unused params.
5) Unittests to test the logged unused params, as well as for nested modules, are added
ghstack-source-id: 126581051

Test Plan: CI, UT

Reviewed By: zhaojuanmao

Differential Revision: D27356394

fbshipit-source-id: 89f436af4e74145b0a8eda92b3c4e2af8e747332
2021-04-15 09:19:50 -07:00
Philip Meier
f4967d68f5 make torch.testing asserts importable (#54769)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54769

Follow-up to #53820. This

- makes the `asserts.py` module private as per suggestion from rgommers in https://github.com/pytorch/pytorch/pull/53820#issuecomment-802661387. With this the functions should only be accessible through `torch.testing`, giving us the option the change the underlying structure later.
- moves the code from `torch/testing/__init__.py` to `torch/testing/_core.py` (happy to accept other name suggestions). Otherwise we can't import the new `_asserts.py` in `torch/testing/__init__.py` due to circular imports.

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D27438451

Pulled By: mruberry

fbshipit-source-id: c7292b4d5709185b42b4aac8016648562688040e
2021-04-07 23:53:02 -07:00
Pritam Damania
e3691be2d9 Dump C++ stack traces of all threads for distributed tests. (#55003)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55003

Using the `caffe2::setPrintStackTracesOnFatalSignal` utility in
distributed tests to set a signal handler that dumps the state of all threads
for all processes when it receives a FATAL signal. This would help in debugging
tests further.

I had to revert all the python faulthandler code since only one signal handler
function is supported, so running python faulthandler with
`setPrintStackTracesOnFatalSignal` doesn't work.

Sample output:
```
SIGSEGV(11), PID: 3492872, Thread 3492872:
[0] ???(0x7fa7b2d1d61b) in libcaffe2_caffe2_caffe2_cpu.so
[1] ???(0x7fa7b2d1d3fb) in libcaffe2_caffe2_caffe2_cpu.so
[2] ???(0x7fa7b2d1d33d) in libcaffe2_caffe2_caffe2_cpu.so
[3] ???(0x7fa7b2d1d167) in libcaffe2_caffe2_caffe2_cpu.so
[4] ???(0x7fa7ce683150) in libpthread.so.0
[5] ???(0x7fa7be2b233c) in libcaffe2__C_impl_cuda.so
[6] ???(0x7fa7be2ce80c) in libcaffe2__C_impl_cuda.so
[7] ???(0x7fa7be2a0512) in libcaffe2__C_impl_cuda.so
[8] torch::distributed::rpc::TensorPipeAgent::send(torch::distributed::rpc::WorkerInfo const&, torch::distributed::rpc::Message&&, float, std::unordered_map<signed char, signed char, std::hash<signed char>, std::equal_to<signed char>, std::allocator<std::pair<signed char const, signed char> > > const&)+0x24f(0x7fa7be29f71f) in libcaffe2__C_impl_cuda.so
[9] torch::distributed::autograd::sendMessageWithAutograd(torch::distributed::rpc::RpcAgent&, torch::distributed::rpc::WorkerInfo const&, torch::distributed::rpc::Message&&, bool, float, bool)+0x393(0x7fa7b602b203) in libcaffe2_libtorch.so
[10] torch::distributed::rpc::pyRpcPythonUdf(torch::distributed::rpc::WorkerInfo const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&, std::vector<at::Tensor, std::allocator<at::Tensor> >&, float, bool)+0x201(0x7fa7bd844971) in libcaffe2__C_impl_cuda.so
```
ghstack-source-id: 125630551

Test Plan: waitforbuildbot

Reviewed By: SciPioneer

Differential Revision: D27419714

fbshipit-source-id: 8aca9a14ef688004053d8798124d9c3a3fbe3489
2021-04-03 13:59:56 -07:00
Howard Huang
5610e8271b Fix skip_if_not_multigpu decorator (#54916)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54916

Fixes https://github.com/pytorch/pytorch/issues/54887

`skip_if_not_multigpu` was skipping all the tests that use it.

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D27412193

Pulled By: H-Huang

fbshipit-source-id: 28d6697bd8cc6b6784cdb038ccb3ff138d0610eb
2021-04-01 18:01:33 -07:00
Pritam Damania
f71a0daeb7 Use faulthandler to dump traceback of timed out processes in unit tests. (#54818)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54818

Several flaky tests fail due to some sort of timeout and it isn't
clear from the error message in CI where exactly each process is stuck. In this
PR, I've added mechanism to dump the entire python traceback of all python
threads when we encounter a timeout.

Example traceback:

```
Process 3 timed out with traceback:
Current thread 0x00007ff3363ff700 (most recent call first):
  File "torch/testing/_internal/common_distributed.py", line 373 in _event_listener
  File "threading.py", line 870 in run
  File "threading.py", line 932 in _bootstrap_inner
  File "threading.py", line 890 in _bootstrap

Thread 0x00007ff406132180 (most recent call first):
  File "torch/distributed/distributed_c10d.py", line 2477 in barrier
  File "torch/testing/_internal/distributed/rpc/rpc_test.py", line 838 in test_reinit
  File "torch/testing/_internal/dist_utils.py", line 90 in new_test_method
  File "torch/testing/_internal/common_distributed.py", line 292 in wrapper
  File "torch/testing/_internal/common_distributed.py", line 409 in run_test
  File "torch/testing/_internal/common_distributed.py", line 393 in _run
  File "multiprocessing/process.py", line 108 in run
  File "multiprocessing/process.py", line 315 in _bootstrap
  File "multiprocessing/popen_fork.py", line 75 in _launch
  File "multiprocessing/popen_fork.py", line 19 in __init__
  File "multiprocessing/context.py", line 277 in _Popen
  File "multiprocessing/process.py", line 121 in start
```
ghstack-source-id: 125323810

Test Plan: waitforbuildbot

Reviewed By: rohan-varma

Differential Revision: D27378764

fbshipit-source-id: 661c009a5458c724f004aa83de9347a4bc03b63e
2021-03-31 11:38:30 -07:00
Rohan Varma
0e543b2b00 Provide a decorator to set/unset nccl blocking wait for tests (#54740)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54740

Adds a simple helper decorator to set/unset nccl blocking wait for
tests. This will make it easier than having to manually set/unset the
os.environ vars every time.
ghstack-source-id: 125233693

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D27277222

fbshipit-source-id: c289b9d05e2f6328d672810b07501979b6e177c6
2021-03-30 15:31:30 -07:00
Pritam Damania
65781f94ad Enable faulthandler for distributed tests. (#54531)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54531

Enabling faulthandler will intercept signals like SIGSEGV, SIGFPE,
SIGABRT, SIGBUS and SIGKILL and dump the entire python traceback before the
process goes down.

This can help us in debugging flaky tests where a process crashes and we need
to debug what happened.
ghstack-source-id: 125045894

Test Plan:
1) Tested locally to see traceback is produced.
2) waitforbuildbot

Reviewed By: rohan-varma

Differential Revision: D27271048

fbshipit-source-id: ca12125a9da6cdfc7bac5619ad1c7e116666014b
2021-03-27 00:43:58 -07:00
Pritam Damania
1c63cb2c0f Pass child error to parent in distributed tests. (#52632)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52632

Distributed tests run in a multiprocessing environment, where a parent
process drives the tests through several child processes. As a result, when a
child process fails the parent only prints the following:

```
Process 0 exited with error code 10
```

The child process also logs its own exception, but it is cumberson to go
through the logs and track this down.

To alleviate this, I've added a bunch of pipes for each child process so that
the child process writes the error to the pipe before exiting and the parent
process can read the appropriate error from the pipe and display it.

The new output printed by the parent is as follows:

```
> RuntimeError: Process 0 exited with error code 10 and exception:
Traceback (most recent call last):
  File "torch/testing/_internal/common_distributed.py", line 361, in _run
    getattr(self, test_name)()
  File "torch/testing/_internal/common_distributed.py", line 288, in wrapper
    fn()
  File "test_c10d.py", line 789, in test_broadcast_checks
    pg.broadcast([t1], opts)
ValueError: ProcessGroupGloo::broadcast: invalid root rank: -1

Process 1 exited with error code 10 and exception:
Traceback (most recent call last):
  File "torch/testing/_internal/common_distributed.py", line 361, in _run
    getattr(self, test_name)()
  File "torch/testing/_internal/common_distributed.py", line 288, in wrapper
    fn()
  File "test_c10d.py", line 789, in test_broadcast_checks
    pg.broadcast([t1], opts)
ValueError: ProcessGroupGloo::broadcast: invalid root rank: -1

Process 2 exited with error code 10 and exception:
Traceback (most recent call last):
  File "torch/testing/_internal/common_distributed.py", line 361, in _run
    getattr(self, test_name)()
  File "torch/testing/_internal/common_distributed.py", line 288, in wrapper
    fn()
  File "test_c10d.py", line 789, in test_broadcast_checks
    pg.broadcast([t1], opts)
ValueError: ProcessGroupGloo::broadcast: invalid root rank: -1

Process 3 exited with error code 10 and exception:
Traceback (most recent call last):
  File "torch/testing/_internal/common_distributed.py", line 361, in _run
    getattr(self, test_name)()
  File "torch/testing/_internal/common_distributed.py", line 288, in wrapper
    fn()
  File "test_c10d.py", line 789, in test_broadcast_checks
    pg.broadcast([t1], opts)
ValueError: ProcessGroupGloo::broadcast: invalid root rank: -1
```
ghstack-source-id: 122273793

Test Plan: waitforbuildbot

Reviewed By: rohan-varma

Differential Revision: D26589274

fbshipit-source-id: 7b7a71ec790b216a89db7c157377f426531349a5
2021-02-23 11:50:25 -08:00
Rong Rong (AI Infra)
e8ab58bfc7 [reland] Early terminate CUDA on common_utils TestCases (#52126)
Summary:
Take 2 of https://github.com/pytorch/pytorch/issues/50914
This change moves the early termination logic into common_utils.TestCase class.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52126

Test Plan: CI with ci-all tag

Reviewed By: malfet

Differential Revision: D26391762

Pulled By: walterddr

fbshipit-source-id: a149ecc47ccda7f2795e107fb95915506ae060b4
2021-02-12 07:32:42 -08:00
Nikita Shulga
9f1f5636d7 Revert D26019289: [pytorch][PR] Early terminate CUDA on common_utils TestCases
Test Plan: revert-hammer

Differential Revision:
D26019289 (c1b7ca8062)

Original commit changeset: ddc7c1c0d00d

fbshipit-source-id: 6902d03fa06cda5d03191846bc4dd98af501b594
2021-02-10 17:29:10 -08:00
Rong Rong (AI Infra)
c1b7ca8062 Early terminate CUDA on common_utils TestCases (#50914)
Summary:
This is a follow up on https://github.com/pytorch/pytorch/issues/49869.

Previously CUDA early termination only happens for generic test classes that extends from `DeviceTypeTestBase`. However, JIT test cases which extends from common_utils.TestCase cannot benefit from the early termination.

This change moves the early termination logic into common_utils.TestCase class.
- all tests extended from common_utils.TestCase now should early terminate if CUDA assert occurs.
- For TestCases that extends from common_device_type.DeviceTypeTestBase, still only do torch.cuda.synchronize() when RTE is thrown.
- For TestCases extends common_utils.TestCase, regardless of whether a test case uses GPU or not, it will always synchronize CUDA as long as `torch.cuda.is_initialize()` returns true.
- Disabling this on common_distributed.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50914

Reviewed By: malfet

Differential Revision: D26019289

Pulled By: walterddr

fbshipit-source-id: ddc7c1c0d00db4d073a6c8bc5b7733637a7e77d1
2021-02-10 07:15:40 -08:00
Shen Li
a3b8cbcdfc Let TensorPipe detect peer access (#50676)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50676

Test Plan: Imported from OSS

Reviewed By: beauby

Differential Revision: D25941962

Pulled By: mrshenli

fbshipit-source-id: 7d4fd3b4fbd5ae5a0c50ad65605ced9db10ede4a
2021-01-20 08:04:51 -08:00
Shen Li
30e45bb133 Enable GPU-to-GPU comm in TensorPipeAgent (#44418)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44418

This commit uses TensorPipe's cuda_ipc channel to conduct
cross-process same-machine GPU-to-GPU communication. On the sender
side, `TensorPipeAgent` grabs a stream to each device used by the
message, let these streams wait for current streams, and passes
the streams to TensorPipe `CudaBuffer`. On the receiver side, it
also grabs a stream for each device used in the message, and uses
these streams to receive tensors and run user functions. After that,
these streams are then used for sending the response back to the
sender. When receiving the response, the sender will grab a new set
of streams and use them for TensorPipe's `CudaBuffer`.

If device maps are provided, `TensorPipeAgent::send` will return a
derived class of `CUDAFuture`, which is specifically tailored for
RPC Messages.

TODOs:
1. Enable sending CUDA RPC to the same process.
2. Add a custom CUDA stream pool.
3. When TensorPipe addressed the error for `cudaPointerGetAttributes()`,
remove `cuda:0` context initialization code in `backend_registry.py`.
4. When TensorPipe can detect availability of peer access, enable all
tests on platforms without peer access.

Differential Revision: D23626207

Test Plan: Imported from OSS

Reviewed By: lw

Pulled By: mrshenli

fbshipit-source-id: d30e89e8a98bc44b8d237807b84e78475c2763f0
2021-01-14 13:55:41 -08:00
Jerry Zhang
fadec77c30 [quant][fx][graphmode] Renable torchvision test (#48602)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48602

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D25224917

fbshipit-source-id: efc73f425253c4eb7ae51064b6760416097f0437
2020-12-04 10:13:38 -08:00
Rohan Varma
25dc0056f2 [RPC] print exception message on workers that run python functions (#46372)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46372

Currently, in `_run_function`, we catch an exception from the python
function which is run, and report it back to the master. However in some large
scale training jobs, it would be valuable to also log the error on the trainer
itself for faster debugging.

Test Plan: Added unittest.

Reviewed By: pritamdamania87

Differential Revision: D24324578

fbshipit-source-id: 88460d7599ea69d2c38fd9c10eb6471f7edd4100
2020-10-22 09:44:15 -07:00
Pritam Damania
b5a2f04089 Disallow creation of ProcessGroupNCCL without GPUs. (#45642)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45642

Prior to https://github.com/pytorch/pytorch/pull/45181, initializing a
NCCL process group would work even if no GPUs were present. Although, now since
init_process_group calls `barrier()` this would fail.

In general the problem was that we could initialize ProcessGroupNCCL without
GPUs and then if we called a method like `barrier()` the process would crash
since we do % numGPUs resulting in division by zero.
ghstack-source-id: 113490343

Test Plan: waitforbuildbot

Reviewed By: osalpekar

Differential Revision: D24038839

fbshipit-source-id: a1f1db52cabcfb83e06c1a11ae9744afbf03f8dc
2020-10-05 12:05:48 -07:00
gunandrose4u
f07ac6a004 Fix Windows build failure after DDP PR merged (#45335)
Summary:
Fixes #{issue number}
This is resubmit for PR https://github.com/pytorch/pytorch/issues/42897 . Together with fix for Windows build issue introduced by PR https://github.com/pytorch/pytorch/issues/44344 .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45335

Reviewed By: zou3519

Differential Revision: D23931471

Pulled By: mrshenli

fbshipit-source-id: f49b5a114944c1450b32934b3292170be064f494
2020-09-25 12:37:50 -07:00
Mike Ruberry
103fa3894a Revert D23841786: [pytorch][PR] Enable distributed package on windows, Gloo backend supported only
Test Plan: revert-hammer

Differential Revision:
D23841786 (0122299f9b)

Original commit changeset: 334ba1ed73ef

fbshipit-source-id: ec95432f9957df56a5a04e52661f5db920b7f57f
2020-09-24 22:44:33 -07:00
gunandrose4u
0122299f9b Enable distributed package on windows, Gloo backend supported only (#42897)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/42095

For test case part will be committed to this PR later

mrshenli, please help to review

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42897

Reviewed By: osalpekar

Differential Revision: D23841786

Pulled By: mrshenli

fbshipit-source-id: 334ba1ed73eff2f668857390fc32d1bc7f08e5f3
2020-09-24 21:13:55 -07:00
alanashine
ba6534ae2b enable type check common_distributed (#44821)
Summary:
Enabled type checking in common_distributed by using tensors of ints

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44821

Test Plan: Run python test/test_type_hints.py, errors are no longer ingnored by mypy.ini

Reviewed By: walterddr

Differential Revision: D23747466

Pulled By: alanadakotashine

fbshipit-source-id: 820fd502d7ff715728470fbef0be90ae7f128dd6
2020-09-16 19:19:36 -07:00
Xiang Gao
20ac736200 Remove py2 compatible future imports (#44735)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44735

Reviewed By: mruberry

Differential Revision: D23731306

Pulled By: ezyang

fbshipit-source-id: 0ba009a99e475ddbe22981be8ac636f8a1c8b02f
2020-09-16 12:55:57 -07:00
Rohan Varma
567c51cce9 In common_distributed, fix TEST_SKIPS multiprocessing manager (#44525)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44525

Since `TEST_SKIPS` is a global multiprocessing.manager, this was causing
issues when one test would fail and make the rest of the tests fail during
setup due to networking errors.

See the failed CI job: https://app.circleci.com/pipelines/github/pytorch/pytorch/212491/workflows/0450151d-ca09-4cf6-863d-272de6ed917f/jobs/7389065 for an example, where `test_ddp_backward` failed but then caused the rest of the tests to fail at the line `test_skips.update(TEST_SKIPS)`.

To fix this issue, at the end of every test we revert `TEST_SKIPS` back to a regular dict, and redo the conversion to a `mulitiprocessing.Manager` in the next test, which prevents these errors.
ghstack-source-id: 111844724

Test Plan: CI

Reviewed By: malfet

Differential Revision: D23641618

fbshipit-source-id: 27ce823968ece9804bb4dda898ffac43ef732b89
2020-09-11 09:16:33 -07:00
Rohan Varma
b22abbe381 Enable test_distributed to work with spawn mode (#41769)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41769

Currently the tests in `test_distributed` only work with the `fork` mode multiprocessing, this PR introduces support for `spawn` mode multiprocessing as well (while keeping the `fork` mode intact).

Motivations for the change:
1) Spawn multiprocessing is the default on MacOS, so it better emulates how MacOS users would use distributed
2) With python 3.8+, spawn is the default on linux, so we should have test coverage for this
3) PT multiprocessing suggests using spawn/forkserver over fork, for sharing cuda tensors: https://pytorch.org/docs/stable/multiprocessing.html
4) Spawn is better supported with respect to certain sanitizers such as TSAN, so adding this sanitizer coverage may help us uncover issues.

How it is done:
1) Move `test_distributed` tests in `_DistTestBase` class to a shared file `distributed_test` (similar to how the RPC tests are structured)
2) For `Barrier`, refactor the setup of temp directories, as the current version did not work with spawn, each process would get a different randomly generated directory and thus would write to different barriers.
3) Add all the relevant builds to run internally and in OSS.
Running test_distributed with spawn mode in OSS can be done with:
`python test/run_test.py -i distributed/test_distributed_spawn -v`

Reviewed By: izdeby

Differential Revision: D22408023

fbshipit-source-id: e206be16961fd80438f995e221f18139d7e6d2a9
2020-09-08 23:11:12 -07:00
Rohan Varma
4e4626a23d Join-based API to support DDP uneven inputs (#42577)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42577

Closes https://github.com/pytorch/pytorch/issues/38174. Implements a join-based API to support training with the DDP module in the scenario where different processes have different no. of inputs. The implementation follows the description in https://github.com/pytorch/pytorch/issues/38174. Details are available in the RFC, but as a summary, we make the following changes:

#### Approach
1) Add a context manager `torch.nn.parallel.distributed.join`
2) In the forward pass, we schedule a "present" allreduce where non-joined process contribute 1 and joined processes contribute 0. This lets us keep track of joined processes and know when all procs are joined.
3) When a process depletes its input and exits the context manager, it enters "joining" mode and attempts to "shadow" the collective comm. calls made in the model's forward and backward pass. For example we schedule the same allreduces in the same order as the backward pass, but with zeros
4) We adjust the allreduce division logic to divide by the effective world size (no. of non-joined procs) rather than the absolute world size to maintain correctness.
5) At the end of training, the last joined process is selected to be the "authoritative" model copy

We also make some misc. changes such as adding a `rank` argument to `_distributed_broadcast_coalesced` and exposing some getters/setters on `Reducer` to support the above changes.

#### How is it tested?
We have tests covering the following models/scenarios:
- [x] Simple linear model
- [x] Large convolutional model
- [x] Large model with module buffers that are broadcast in the forward pass (resnet). We verify this with a helper function `will_sync_module_buffers` and ensure this is true for ResNet (due to batchnorm)
- [x] Scenario where a rank calls join() without iterating at all, so without rebuilding buckets (which requires collective comm)
- [x] Model with unused params (with find unused parameters=True)
- [x] Scenarios where different processes iterate for a varying number of different iterations.
- [x] Test consistency in tie-breaking when multiple ranks are the last ones to join
- [x] Test that we divide by the effective world_size (no. of unjoined processes)

#### Performance implications

###### Trunk vs PR patched, 32 GPUs, batch size = 32
P50, forward + backward + optimizer batch latency & total QPS: 0.121 264/s vs 0.121 264/s
P50 backwards only batch latency & total QPS: 0.087 369/s vs 0.087 368/s

###### join(enable=True) vs without join, 32 GPUs, batch size = 32, even inputs
P50, forward + backward + optimizer batch latency & total QPS: 0.120 265/s vs 0.121 264/s
P50 backwards only batch latency & total QPS: 0.088 364/s vs 0.087 368/s

###### join(enable=False) vs without join, 32 GPUs, batch size = 32, even inputs
P50 forward + backward + optimizer batch latency & total QPS: 0.121 264/s vs 0.121 264/s
P50 backwards only batch latency & total QPS: 0.087 368/s vs 0.087 368/s

###### join(enable=True) with uneven inputs (offset = 2000), 32 GPUs, batch size = 32
P50 forward + backward + optimizer batch latency & total QPS: 0.183 174/s vs 0.121 264/s
P50 backwards only batch latency & total QPS: 0.150 213/s vs 0.087 368/s

###### join(enable=True) with uneven inputs ((offset = 2000)), 8 GPUs, batch size = 32
P50 forward + backward + optimizer batch latency & total QPS: 0.104 308/s vs 0.104 308/s
P50 backwards only batch latency & total QPS: 0.070 454/s vs 0.070 459/s

The 2 above uneven inputs benchmark was conducted 32 GPUs and 4 GPUs immediately depleting their inputs and entering "join" mode (i.e. not iterating at all), while the other 28 iterating as normal. It looks like there is a pretty significant perf hit for this case when there are uneven inputs and multi-node training. Strangely, when there is a single node (8 GPUs), this does not reproduce.

#### Limitations
1) This is only implemented for MPSD, not SPMD. Per a discussion with mrshenli we want to encourage the use of MPSD over SPMD for DDP.
2) This does not currently work with SyncBN or custom collective calls made in the model's forward pass. This is because the `join` class only shadows the `broadcast` for buffers in the forward pass, the gradient allreduces in the bwd pass, unused parameters reduction, and (optionally) the rebuild buckets broadcasting in the backwards pass. Supporting this will require additional design thought.
3) Has not been tested with the [DDP comm. hook](https://github.com/pytorch/pytorch/issues/39272) as this feature is still being finalized/in progress. We will add support for this in follow up PRs.
ghstack-source-id: 111033819

Reviewed By: mrshenli

Differential Revision: D22893859

fbshipit-source-id: dd02a7aac6c6cd968db882c62892ee1c48817fbe
2020-08-31 13:29:03 -07:00
Rohan Varma
2a4d312027 Allow GPU skip decorators to report the right number of GPUs required in (#43468)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43468

Closes https://github.com/pytorch/pytorch/issues/41378.
https://github.com/pytorch/pytorch/pull/41973 enhanced the skip decorators to
report the right no. of GPUs required, but this information was not passed to
the main process where the message is actually displayed. This PR uses a
`multiprocessing.Manager()` so that the dictionary modification is reflected
correctly in the main process.
ghstack-source-id: 110684228

Test Plan:
With this diff, we can run a test in such as in https://github.com/pytorch/pytorch/pull/42577 that requires 4 GPUs on a 2 GPU machine, and we get the expected message:

```
test_ddp_uneven_inputs_replicated_error (test_distributed.TestDistBackend) ... skipped 'Need at least 4 CUDA devices'
```

Reviewed By: mrshenli

Differential Revision: D23285790

fbshipit-source-id: ac32456ef3d0b1d8f1337a24dba9f342c736ca18
2020-08-26 11:44:13 -07:00
Rohan Varma
f22aa601ce All Gather and gather APIs for Python Objects (#42189)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42189

Rehash of https://github.com/pytorch/pytorch/pull/28811, which was several months old.

As part of addressing https://github.com/pytorch/pytorch/issues/23232, this PR adds support for the following APIs:

`allgather_object` and `gather_object` to support gather/allgather of generic, pickable Python objects. This has been a long-requested feature so PyTorch should provide these helpers built-in.

The methodology is what is proposed in the original issue:
1) Pickle object to ByteTensor using torch.save
2) Comm. tensor sizes
3) Copy local ByteTensor into a tensor of maximal size
4) Call tensor-based collectives on the result of (3)
5) Unpickle back into object using torch.load

Note that the API is designed to match other than supporting `async_op`. For now, it is a blocking call. If we see demand to support `async_op`, we will have to make more progress on merging work/future to support this.

If this is a suitable approach, we can support `scatter`, `broadcast` in follow up PRs.
ghstack-source-id: 109322433

Reviewed By: mrshenli

Differential Revision: D22785387

fbshipit-source-id: a265a44ec0aa3aaffc3c6966023400495904c7d8
2020-08-06 13:30:25 -07:00
chengjinfang
f0c46878c6 Fix the issue GPU skip message(#41378) (#41973)
Summary:
Related https://github.com/pytorch/pytorch/issues/41378

Fix the issue GPU skip message

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41973

Reviewed By: pbelevich

Differential Revision: D22753459

Pulled By: mrshenli

fbshipit-source-id: d24b531926e28b860ae90b9ae07e8ca3438d21db
2020-07-28 08:28:31 -07:00
Luca Wehrstedt
f083cea227 [RPC tests] Fix file descriptor leak (#40913)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40913

Summary of the entire stack:
--

This diff is part of an attempt to refactor the RPC tests. They currently suffer from several problems:
- Several ways to specify the agent to use: there exists one "generic" fixture that uses the global variable TEST_CONFIG to look up the agent name, and is used for process group and Thrift, and then there are separate fixtures for the flaky agent and the TensorPipe one.
- These two ways lead to having two separate decorators (`requires_process_group_agent` and `@_skip_if_tensorpipe_agent`) which must both be specified, making it unclear what the effect of each of them is and what happens if only one is given.
- Thrift must override the TEST_CONFIG global variable before any other import (in order for the `requires_process_group_agent` decorator to work correctly) and for that it must use a "trap" file, which makes it even harder to track which agent is being used, and which is specific to Buck, and thus cannot be used in OSS by other agents.
- Even if the TensorPipe fixture doesn't use TEST_CONFIG, it still needs to set it to the right value for other parts of the code to work. (This is done in `dist_init`).
- There are a few functions in dist_utils.py that return some properties of the agent (e.g., a regexp to match against the error it returns in case of shutdown). These functions are effectively chained if/elses on the various agents, which has the effect of "leaking" some part of the Thrift agent into OSS.
- Each test suite (RPC, dist autograd/dist optimizer, their JIT versions, remote module, ...) must be run on each agent (or almost; the faulty one is an exception) in both fork and spawn mode. Each of these combinations is a separate file, which leads to a proliferation of scripts.
- There is no "master list" of what combinations make sense and should be run. Therefore it has happened that when adding new tests or new agents we forgot to enroll them into the right tests. (TensorPipe is still missing a few tests, it turns out).
- All of these tiny "entry point" files contain almost the same duplicated boilerplate. This makes it very easy to get the wrong content into one of them due to a bad copy-paste.

This refactoring aims to address these problems by:
- Avoiding global state, defaults/override, traps, if/elses, ... and have a single way to specify the agent, based on an abstract base class and several concrete subclasses which can be "mixed in" to any test suite.
- Instead of enabling/disabling tests using decorators, the tests that are specific to a certain agent are now in a separate class (which is a subclass of the "generic" test suite) so that they are only picked up by the agent they apply to.
- Instead of having one separate entry point script for each combination, it uses one entry point for each agent, and in that script it provides a list of all the test suites it wants to run on that agent. And it does that by trying to deduplicate the boilerplate as much as possible. (In fact, the various agent-suite combinations could be grouped in any way, not necessarily by agent as I did here).

It provides further advantages:
- It puts all the agents on equal standing, by not having any of them be the default, making it thus easier to migrate from process group to TensorPipe.
- It will make it easier to add more versions of the TensorPipe tests (e.g., one that disables the same-machine backends in order to test the TCP-based ones) without a further duplication of entry points, of boilerplate, ...

Summary of this commit
--
Once we start merging multiple test suites in a single file (which we'll happen in the next diffs in the stack) the OSX tests on CircleCI start failing due to "too many open files". This indicates a file descriptor leak. I then managed to repro it on Linux too by lowering the limit on open file descriptors (`ulimit -n 500`). Each test method that unittest runs is run on a new instance of the Testcase class. With our multiprocessing wrappers, this instance contains a list of child processes. Even after these processes are terminated, it appears they still hold some open file descriptor (for example a pipe to communicate with the subprocess). It also appears unittest is keeping these Testcase instances alive until the entire suite completes, which I suspect is what leads to this "leak" of file descriptors. Based on that guess, in this diff I am resetting the list of subprocesses during shutdown, and this seems to fix the problem.
ghstack-source-id: 107045908

Test Plan: Sandcastle and CircleCI

Differential Revision: D22356784

fbshipit-source-id: c93bb9db60fde72cae0b0c735a50c17e427580a6
2020-07-03 06:22:40 -07:00
Mike Ruberry
13120bf677 Updates assertEqual to require atol and rtol, removes positional atol (#38872)
Summary:
This updates assertEqual and assertEqual-like functions to either require both or neither of atol and rtol be specified. This should improve clarity around handling precision in the test suite, and it allows us to remove the legacy positional atol argument from assertEqual. In addition, the "message" kwarg is replace with a kwarg-only "msg" argument whose name is consistent with unittest's assertEqual argument.

In the future we could make "msg" an optional third positional argument to be more consistent with unittest's assertEqual, but requiring it be specified should be clear, and we can easily update the signature to make "msg" an optional positional argument in the future, too.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38872

Differential Revision: D21740237

Pulled By: mruberry

fbshipit-source-id: acbc027aa1d7877a49664d94db9a5fff91a07042
2020-05-27 06:31:07 -07:00
Rohan Varma
63e545e0fe Revert D21717199: [pytorch][PR] Updates assertEqual to require atol and rtol, removes positional atol
Test Plan: revert-hammer

Differential Revision:
D21717199

Original commit changeset: 9feb856f94ee

fbshipit-source-id: bfde9c39a5ce99f0ca6183a7dde703c65b7c8259
2020-05-26 18:23:59 -07:00
Mike Ruberry
6ddca30b2d Updates assertEqual to require atol and rtol, removes positional atol (#38872)
Summary:
This updates assertEqual and assertEqual-like functions to either require both or neither of atol and rtol be specified. This should improve clarity around handling precision in the test suite, and it allows us to remove the legacy positional atol argument from assertEqual. In addition, the "message" kwarg is replace with a kwarg-only "msg" argument whose name is consistent with unittest's assertEqual argument.

In the future we could make "msg" an optional third positional argument to be more consistent with unittest's assertEqual, but requiring it be specified should be clear, and we can easily update the signature to make "msg" an optional positional argument in the future, too.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38872

Differential Revision: D21717199

Pulled By: mruberry

fbshipit-source-id: 9feb856f94eee911b44f6c7140a1d07c1b026d3a
2020-05-26 08:30:23 -07:00
Rohan Varma
906c50eb69 Remove dead code in ddp.{h, cpp} (#37990)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37990

The code in `ddp.{h, cpp}` and the corresponding pybind implementations are no longer used. The pybinded calls were all private APIs and only ran in unittests, so we should remove these unused APIs.

https://github.com/pytorch/pytorch/pull/20234 from a year ago also mentioned that we should delete `_dist_broadcast_coalesced`

Verified that all tests pass with cuda by running `test_c10d` on a gpu-enabled machine.
ghstack-source-id: 103885383

Test Plan: CI

Differential Revision: D21443879

fbshipit-source-id: 764d8681ca629056bfe2c260ffab47fa5bdf07ff
2020-05-12 12:41:09 -07:00
Rohan Varma
4501083306 dedupe test skipping in common_distributed and test_distributed (#38078)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38078

`common_distributed` and `test_distributed` have some error codes that overlap but are for different reasons, for example, code 75 in `test_distributed` is "no cuda available" but in common_distributed it is "need at least 2 CUDA devices".

This is an issue because the tests in `test_distributed` now use the utils in `common_distributed`, so we could get the wrong reason for skipping tests.

It is also the source of test failures in https://github.com/pytorch/pytorch/pull/37990.

This diff makes it so that the test skipping logic is deduped and put into `common_distributed.py`, where it can be reused and then imported into `test_distributed`
ghstack-source-id: 103782583

Test Plan: CI

Differential Revision: D21466768

fbshipit-source-id: 53b5af36672ebd8b51ba8b42709d87e96cadef20
2020-05-08 23:19:26 -07:00
Rohan Varma
57dc4cd0f8 [MultiProcessTestCase] Improve the error message when a process terminates (#37627)
Summary:
When a subprocess terminates with an exception in a distributed test, log the process number as well
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37627

Differential Revision: D21366149

Pulled By: rohan-varma

fbshipit-source-id: 132c4b4c1eb336761c2be26d034d8b739ae19691
2020-05-04 17:46:36 -07:00
Shen Li
989341c0c6 Add comments to explain how MultiProcessTestCase works (#37179)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37179

Test Plan: Imported from OSS

Differential Revision: D21211196

Pulled By: mrshenli

fbshipit-source-id: 86dc8183b4def7e95a236f6c5c73ef67466f4ddb
2020-04-23 14:17:12 -07:00
Nikita Shulga
9763db3031 MultiProcessTestCase to use instance rather than class method wrappers (#36826)
Summary:
This makes its wrappers stackable with `common_utils.TestCase` ones
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36826

Test Plan: CI

Differential Revision: D21178217

Pulled By: mrshenli

fbshipit-source-id: f80dd4aa175e20bd338b38b2c42c3118258f45dc
2020-04-23 08:40:02 -07:00
David Reiss
e75fb4356b Remove (most) Python 2 support from Python code (#35615)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35615

Python 2 has reached end-of-life and is no longer supported by PyTorch.
Now we can clean up a lot of cruft that we put in place to support it.
These changes were all done manually, and I skipped anything that seemed
like it would take more than a few seconds, so I think it makes sense to
review it manually as well (though using side-by-side view and ignoring
whitespace change might be helpful).

Test Plan: CI

Differential Revision: D20842886

Pulled By: dreiss

fbshipit-source-id: 8cad4e87c45895e7ce3938a88e61157a79504aed
2020-04-22 09:23:14 -07:00
Nikita Shulga
3b832ee2bf Use Python3 super() throughout torch.testing. (#37024)
Summary:
Hattip to ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37024

Differential Revision: D21173244

Pulled By: malfet

fbshipit-source-id: 7079703e28777d873f69bf9fd4dcbad8d53a2682
2020-04-22 09:00:28 -07:00
Rohan Varma
4efef475d7 [WIP] make test_distributed gloo test use MultiProcessTestCase (#36970)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36970

We would like to move all distributed testing to use the existing
multiprocessing tooling defined in common_distributed.py. With this change, we
make `TestDistBackend` inherit from `MultiProcessTestCase` and enable fork mode
multiprocessing. In the next step, we can enable spawn mode for these tests
which will give us TSAN coverage.
ghstack-source-id: 102553801

Test Plan: Unittests

Differential Revision: D21146947

fbshipit-source-id: 608fa2cb93e88f8de6a5ac87c523e2c4e4fede1b
2020-04-21 13:13:48 -07:00
Rohan Varma
7390c333d6 [CI] fix test_distributed for python 3.8+ (#36542)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36542

Python 3.8 set the default multiprocessing start mode to spawn, but we
need fork in these tests, otherwise there are some pickling issues.
Test: Ensure that these tests succeed when run with python 3.8
ghstack-source-id: 102093824

Test Plan: Ensure success with python 3.8

Differential Revision: D21007753

fbshipit-source-id: 4b39844c6ba76a53293c0dfde7c98ec5a78fe113
2020-04-14 11:38:33 -07:00
Rohan Varma
b553e6911a [distributed] quicker exit in the case of failed tests in distributed (#34150)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34150

In the distributed setting we commonly have tests in which there are errors where one process
exits but the other do not (since they are for example waiting for work from
the process that exited). Currently, when this situation happens we do not
handle this well, and wait for process 0 to timeout. This results in wasted
time waiting for test errors and a less helpful "Process 0 timed out..." error
message when the error was actually something else.

This diff fixes the issue by checking for exited subprocesses and terminating
the test when we see a subprocess that has exited uncleanly. We still enforce
timeouts and return when all processes have exited cleantly in the happy path.
ghstack-source-id: 99921462

Test Plan:
All distributed tests + tested by writing tests that should trigger
the unclean subprocess detection, and verified that we exit quickly instead of
waiting for the entire timeout.

Differential Revision: D20231032

fbshipit-source-id: 3e0d4a20925b7d1098ec4c40ffcc66845425dd62
2020-03-11 11:27:17 -07:00
Jithun Nair
3c4cec56aa Enable test_distributed for ROCm but only with nccl backend [REDUX] (#32551)
Summary:
This is a redux of the original PR https://github.com/pytorch/pytorch/issues/28814 which was reverted in PR https://github.com/pytorch/pytorch/issues/29736 due to test_DistributedDataParallel being suspected as being flaky. Further investigation revealed it wasn't flakiness, but a bug in the PyTorch source code which has been now fixed in PR https://github.com/pytorch/pytorch/issues/32356. This PR is another attempt at enabling the test_distributed unit test suite only for the nccl backend.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32551

Differential Revision: D19729966

Pulled By: bddppq

fbshipit-source-id: 12a0d850991a903cc7723d63693b6157071d7115
2020-02-10 12:42:36 -08:00
Pritam Damania
f050b16dd9 Move pytorch distributed tests to separate folder for contbuild. (#30445)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30445

Create distributed and rpc directories under caffe/test for better management
of unit tests.

Differential Revision: D18702786

fbshipit-source-id: e9daeed0cfb846ef68806f6decfcb57c0e0e3606
2020-01-22 21:16:59 -08:00