Commit Graph

14 Commits

Author SHA1 Message Date
Rohan Varma
63e545e0fe Revert D21717199: [pytorch][PR] Updates assertEqual to require atol and rtol, removes positional atol
Test Plan: revert-hammer

Differential Revision:
D21717199

Original commit changeset: 9feb856f94ee

fbshipit-source-id: bfde9c39a5ce99f0ca6183a7dde703c65b7c8259
2020-05-26 18:23:59 -07:00
Mike Ruberry
6ddca30b2d Updates assertEqual to require atol and rtol, removes positional atol (#38872)
Summary:
This updates assertEqual and assertEqual-like functions to either require both or neither of atol and rtol be specified. This should improve clarity around handling precision in the test suite, and it allows us to remove the legacy positional atol argument from assertEqual. In addition, the "message" kwarg is replace with a kwarg-only "msg" argument whose name is consistent with unittest's assertEqual argument.

In the future we could make "msg" an optional third positional argument to be more consistent with unittest's assertEqual, but requiring it be specified should be clear, and we can easily update the signature to make "msg" an optional positional argument in the future, too.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38872

Differential Revision: D21717199

Pulled By: mruberry

fbshipit-source-id: 9feb856f94eee911b44f6c7140a1d07c1b026d3a
2020-05-26 08:30:23 -07:00
Rohan Varma
906c50eb69 Remove dead code in ddp.{h, cpp} (#37990)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37990

The code in `ddp.{h, cpp}` and the corresponding pybind implementations are no longer used. The pybinded calls were all private APIs and only ran in unittests, so we should remove these unused APIs.

https://github.com/pytorch/pytorch/pull/20234 from a year ago also mentioned that we should delete `_dist_broadcast_coalesced`

Verified that all tests pass with cuda by running `test_c10d` on a gpu-enabled machine.
ghstack-source-id: 103885383

Test Plan: CI

Differential Revision: D21443879

fbshipit-source-id: 764d8681ca629056bfe2c260ffab47fa5bdf07ff
2020-05-12 12:41:09 -07:00
Rohan Varma
4501083306 dedupe test skipping in common_distributed and test_distributed (#38078)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38078

`common_distributed` and `test_distributed` have some error codes that overlap but are for different reasons, for example, code 75 in `test_distributed` is "no cuda available" but in common_distributed it is "need at least 2 CUDA devices".

This is an issue because the tests in `test_distributed` now use the utils in `common_distributed`, so we could get the wrong reason for skipping tests.

It is also the source of test failures in https://github.com/pytorch/pytorch/pull/37990.

This diff makes it so that the test skipping logic is deduped and put into `common_distributed.py`, where it can be reused and then imported into `test_distributed`
ghstack-source-id: 103782583

Test Plan: CI

Differential Revision: D21466768

fbshipit-source-id: 53b5af36672ebd8b51ba8b42709d87e96cadef20
2020-05-08 23:19:26 -07:00
Rohan Varma
57dc4cd0f8 [MultiProcessTestCase] Improve the error message when a process terminates (#37627)
Summary:
When a subprocess terminates with an exception in a distributed test, log the process number as well
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37627

Differential Revision: D21366149

Pulled By: rohan-varma

fbshipit-source-id: 132c4b4c1eb336761c2be26d034d8b739ae19691
2020-05-04 17:46:36 -07:00
Shen Li
989341c0c6 Add comments to explain how MultiProcessTestCase works (#37179)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37179

Test Plan: Imported from OSS

Differential Revision: D21211196

Pulled By: mrshenli

fbshipit-source-id: 86dc8183b4def7e95a236f6c5c73ef67466f4ddb
2020-04-23 14:17:12 -07:00
Nikita Shulga
9763db3031 MultiProcessTestCase to use instance rather than class method wrappers (#36826)
Summary:
This makes its wrappers stackable with `common_utils.TestCase` ones
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36826

Test Plan: CI

Differential Revision: D21178217

Pulled By: mrshenli

fbshipit-source-id: f80dd4aa175e20bd338b38b2c42c3118258f45dc
2020-04-23 08:40:02 -07:00
David Reiss
e75fb4356b Remove (most) Python 2 support from Python code (#35615)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35615

Python 2 has reached end-of-life and is no longer supported by PyTorch.
Now we can clean up a lot of cruft that we put in place to support it.
These changes were all done manually, and I skipped anything that seemed
like it would take more than a few seconds, so I think it makes sense to
review it manually as well (though using side-by-side view and ignoring
whitespace change might be helpful).

Test Plan: CI

Differential Revision: D20842886

Pulled By: dreiss

fbshipit-source-id: 8cad4e87c45895e7ce3938a88e61157a79504aed
2020-04-22 09:23:14 -07:00
Nikita Shulga
3b832ee2bf Use Python3 super() throughout torch.testing. (#37024)
Summary:
Hattip to ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37024

Differential Revision: D21173244

Pulled By: malfet

fbshipit-source-id: 7079703e28777d873f69bf9fd4dcbad8d53a2682
2020-04-22 09:00:28 -07:00
Rohan Varma
4efef475d7 [WIP] make test_distributed gloo test use MultiProcessTestCase (#36970)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36970

We would like to move all distributed testing to use the existing
multiprocessing tooling defined in common_distributed.py. With this change, we
make `TestDistBackend` inherit from `MultiProcessTestCase` and enable fork mode
multiprocessing. In the next step, we can enable spawn mode for these tests
which will give us TSAN coverage.
ghstack-source-id: 102553801

Test Plan: Unittests

Differential Revision: D21146947

fbshipit-source-id: 608fa2cb93e88f8de6a5ac87c523e2c4e4fede1b
2020-04-21 13:13:48 -07:00
Rohan Varma
7390c333d6 [CI] fix test_distributed for python 3.8+ (#36542)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36542

Python 3.8 set the default multiprocessing start mode to spawn, but we
need fork in these tests, otherwise there are some pickling issues.
Test: Ensure that these tests succeed when run with python 3.8
ghstack-source-id: 102093824

Test Plan: Ensure success with python 3.8

Differential Revision: D21007753

fbshipit-source-id: 4b39844c6ba76a53293c0dfde7c98ec5a78fe113
2020-04-14 11:38:33 -07:00
Rohan Varma
b553e6911a [distributed] quicker exit in the case of failed tests in distributed (#34150)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34150

In the distributed setting we commonly have tests in which there are errors where one process
exits but the other do not (since they are for example waiting for work from
the process that exited). Currently, when this situation happens we do not
handle this well, and wait for process 0 to timeout. This results in wasted
time waiting for test errors and a less helpful "Process 0 timed out..." error
message when the error was actually something else.

This diff fixes the issue by checking for exited subprocesses and terminating
the test when we see a subprocess that has exited uncleanly. We still enforce
timeouts and return when all processes have exited cleantly in the happy path.
ghstack-source-id: 99921462

Test Plan:
All distributed tests + tested by writing tests that should trigger
the unclean subprocess detection, and verified that we exit quickly instead of
waiting for the entire timeout.

Differential Revision: D20231032

fbshipit-source-id: 3e0d4a20925b7d1098ec4c40ffcc66845425dd62
2020-03-11 11:27:17 -07:00
Jithun Nair
3c4cec56aa Enable test_distributed for ROCm but only with nccl backend [REDUX] (#32551)
Summary:
This is a redux of the original PR https://github.com/pytorch/pytorch/issues/28814 which was reverted in PR https://github.com/pytorch/pytorch/issues/29736 due to test_DistributedDataParallel being suspected as being flaky. Further investigation revealed it wasn't flakiness, but a bug in the PyTorch source code which has been now fixed in PR https://github.com/pytorch/pytorch/issues/32356. This PR is another attempt at enabling the test_distributed unit test suite only for the nccl backend.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32551

Differential Revision: D19729966

Pulled By: bddppq

fbshipit-source-id: 12a0d850991a903cc7723d63693b6157071d7115
2020-02-10 12:42:36 -08:00
Pritam Damania
f050b16dd9 Move pytorch distributed tests to separate folder for contbuild. (#30445)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30445

Create distributed and rpc directories under caffe/test for better management
of unit tests.

Differential Revision: D18702786

fbshipit-source-id: e9daeed0cfb846ef68806f6decfcb57c0e0e3606
2020-01-22 21:16:59 -08:00