Commit Graph

555 Commits

Author SHA1 Message Date
Catherine Lee
e88e92e7a2 Update to reruns + timeouts in run_test.py (#100412)
https://github.com/pytorch/pytorch/pull/100200/files made unknown tests more likely to fail b/c lacking test times but still have time outs, so fix that
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100412
Approved by: https://github.com/huydhn
2023-05-01 21:51:53 +00:00
pbialecki
73645a8412 Add CUDA 12.1 CI workflows (#98832)
Adds CUDA 12.1 CI workflows, removes CUDA 11.7.
CC @malfet

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98832
Approved by: https://github.com/atalman
2023-05-01 16:25:53 +00:00
PyTorch MergeBot
9075e3c2c6 Revert "Run test_fx_to_onnx_with_onnxruntime serially (#100298)"
This reverts commit 3a3f781f6c.

Reverted https://github.com/pytorch/pytorch/pull/100298 on behalf of https://github.com/huydhn due to No need as https://github.com/pytorch/pytorch/pull/100297 has been landed ([comment](https://github.com/pytorch/pytorch/pull/100298#issuecomment-1528476786))
2023-04-29 02:07:39 +00:00
Huy Do
3a3f781f6c Run test_fx_to_onnx_with_onnxruntime serially (#100298)
This test starts to fail out of nowhere in trunk

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100298
Approved by: https://github.com/kit1980
2023-04-29 00:51:25 +00:00
Catherine Lee
6ab9453ea9 File level rerun changes (#100200)
Fixes #ISSUE_NUMBER
* change hook so that test still gets saved in --sc when fails in test setup (caused an off by 1 error due to setup being called before the logreport hook)
* allow reruns for all tests now that --sc is used
* increase number of reruns now that --sc is used
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100200
Approved by: https://github.com/huydhn
2023-04-28 20:57:49 +00:00
Catherine Lee
ae5e1819a5 stepcurrent (#98035)
* add stepcurrent flag (--sc) based off the stepwise flag that saves the currently running test so that test running can resume from the last successful test after segfaults, takes in an argument for a key so that different test runs dont overwrite each other
* send sigint to process when timeout so that xml can be made

* add currently unused stepcurrent skip flag (--scs) based off stepwise skip flag that skips the failing test, was going to use if for the keep-going label but having trouble with CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98035
Approved by: https://github.com/huydhn
2023-04-25 20:56:04 +00:00
Huy Do
96d3f3dee3 Discover and run C++ tests with run_test.py (#99559)
This depends on [pytest-cpp](https://github.com/pytest-dev/pytest-cpp) to discover and run C++ tests with pytest. C++ tests are built under `${WORKSPACE}/build/bin` directory and copied to the test job under the same path.

* To expose them to `run_test`, I choose to use the mock path prefix `cpp`, for example `build/bin/c10_Array_test` would be named as `cpp/c10_Array_test` and the `python test/run_test.py --cpp -i cpp/c10_Array_test` would run the test in the same way as other Python tests.  I could copy them from `build/bin` to `test/cpp`, but it will be mixed with the source code and CMake file.  So this looks easier
* Some executable under `build/bin` are not C++ tests, and they are exclude, for example `build/bin/torch_shm_manager`
* C++ tests need to run with pytest directly as python command doesn't understand it
* The change is gated by the new `--cpp` argument to `run_test.py`, for example `python test/run_test.py --cpp` will run all available C++ tests
* The tests can be run in parallel
* Failing tests can be retried with `--reruns=2` and `--sw`

```
============================= test session starts ==============================
platform darwin -- Python 3.9.15, pytest-7.2.0, pluggy-1.0.0 -- /Users/huydo/miniconda3/envs/py3.9/bin/python3
cachedir: .pytest_cache
hypothesis profile 'default' -> database=DirectoryBasedExampleDatabase('/Users/huydo/Storage/mine/pytorch/test/.hypothesis/examples')
rootdir: /Users/huydo/Storage/mine/pytorch, configfile: pytest.ini
plugins: xdoctest-1.1.0, cpp-2.3.0, rerunfailures-10.3, shard-0.1.2, flakefinder-1.1.0, hypothesis-6.56.4, xdist-3.0.2, repeat-0.9.1
collecting ... collected 3 items / 2 deselected / 1 selected
Running 1 items in this shard: build/bin/scalar_tensor_test::TestScalarTensor.TestScalarTensorMPS
stepwise: skipping 2 already passed items.

../build/bin/scalar_tensor_test::TestScalarTensor::TestScalarTensorMPS RERUN [100%]
../build/bin/scalar_tensor_test::TestScalarTensor::TestScalarTensorMPS RERUN [100%]
../build/bin/scalar_tensor_test::TestScalarTensor::TestScalarTensorMPS FAILED [100%]
```

* `--import-slow-tests` and `--import-disabled-tests` won't work for now and that's ok to have it as a future task.

I also add `pytest-cpp==2.3.0` to Linux Docker, MacOS, and Windows.

### Testing

Build PyTorch and run `python test/run_test.py --cpp` on my laptop.  CI change would come later in a separate PR.  Also running `python test/run_test.py --help` now shows all C++ test discovered under `build/bin`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99559
Approved by: https://github.com/clee2000
2023-04-22 00:23:31 +00:00
Zain Rizvi
7546972565 [BE] Refactoring test execution and improving comments (#99467)
Sharing code between the code that handles test results in parallel vs serial mode.

Note that the original version of this code had an inconsistency between the two versions where it would execute `print_to_stderr(err_message)` on every test that ran in parallel, but for serial tests it would only invoke `print_to_stderr(err_message)` if `continue_on_error` was also specified.  By sharing code, this PR changes that behavior to be consistent between the two modes.

Also adding some comments.

<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at 029342c</samp>

> _Sing, O Muse, of the skillful coder who refined_
> _The PyTorch testing script, `run_test.py`, and shined_
> _A light on its obscure logic, with docstrings and comments_
> _And made it run more smoothly, with better error contents_
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99467
Approved by: https://github.com/huydhn, https://github.com/malfet
2023-04-19 19:29:07 +00:00
BowenBao
d41aa448b8 [ONNX] Run ONNX tests as part of standard run_test script (#99215)
<!--
copilot:all
-->
### <samp>🤖 Generated by Copilot at dcbf7e2</samp>

### Summary
📝🧹🚩

<!--
1.  📝 for simplifying the `./scripts/onnx/test.sh` script
2.  🧹 for refactoring the `test/onnx/dynamo/test_exporter_api.py` file
3.  🚩 for adding the `--onnx` flag to `test/run_test.py` and updating the `TESTS` list
-->
This pull request improves the ONNX testing infrastructure in PyTorch by refactoring the test code, normalizing the scope names, adding a flag to run only the ONNX tests, and simplifying the test script.

> _To export PyTorch models to ONNX_
> _We refactored some scripts and contexts_
> _We used `common_utils`_
> _And normalized the scopes_
> _And added a flag to run the tests_

### Walkthrough
*  Simplify `./scripts/onnx/test.sh` to use `run_test.py` with `--onnx` flag instead of `pytest` ([link](https://github.com/pytorch/pytorch/pull/99215/files?diff=unified&w=0#diff-0017f5b22ae1329acb0f54af8d9811c9b6180a72dac70d7a5b89d7c23c958198L44-R46))
*  Remove `onnx` test from `TESTS` list in `test/run_test.py` ([link](https://github.com/pytorch/pytorch/pull/99215/files?diff=unified&w=0#diff-e72503c9e3e8766e2d1bacf3fad7b88aa166e0e90a7e103e7df99357a35df8d7L127-R127)). Replace with `onnx_caffe2`.
*  Add `onnx/test_pytorch_onnx_onnxruntime_cuda` and `onnx/test_models` tests to `blocklisted_tests` list in `test/run_test.py` ([link](https://github.com/pytorch/pytorch/pull/99215/files?diff=unified&w=0#diff-e72503c9e3e8766e2d1bacf3fad7b88aa166e0e90a7e103e7df99357a35df8d7R154-R155))
*  Add `ONNX_SERIAL_LIST` list to `test/run_test.py` to specify ONNX tests that must run serially ([link](https://github.com/pytorch/pytorch/pull/99215/files?diff=unified&w=0#diff-e72503c9e3e8766e2d1bacf3fad7b88aa166e0e90a7e103e7df99357a35df8d7R296-R301))
*  Add `ONNX_TESTS` list to `test/run_test.py` to store all ONNX tests ([link](https://github.com/pytorch/pytorch/pull/99215/files?diff=unified&w=0#diff-e72503c9e3e8766e2d1bacf3fad7b88aa166e0e90a7e103e7df99357a35df8d7R370))
*  Add `--onnx` flag to `parse_args` function in `test/run_test.py` to run only ONNX tests ([link](https://github.com/pytorch/pytorch/pull/99215/files?diff=unified&w=0#diff-e72503c9e3e8766e2d1bacf3fad7b88aa166e0e90a7e103e7df99357a35df8d7R920-R928))
*  Include `ONNX_SERIAL_LIST` in `must_serial` function in `test/run_test.py` to run ONNX tests serially or parallelly based on memory usage ([link](https://github.com/pytorch/pytorch/pull/99215/files?diff=unified&w=0#diff-e72503c9e3e8766e2d1bacf3fad7b88aa166e0e90a7e103e7df99357a35df8d7R1120))
*  Filter selected tests based on `--onnx` flag in `get_selected_tests` function in `test/run_test.py` to exclude non-ONNX tests ([link](https://github.com/pytorch/pytorch/pull/99215/files?diff=unified&w=0#diff-e72503c9e3e8766e2d1bacf3fad7b88aa166e0e90a7e103e7df99357a35df8d7R1158-R1165))

### Other minor changes to accommodate this change
*  Replace `unittest` module with `common_utils.TestCase` in `test/onnx/dynamo/test_exporter_api.py` ([link](https://github.com/pytorch/pytorch/pull/99215/files?diff=unified&w=0#diff-4545f0c15c73ebe90a875e9bee6c5ca4b6b92fb1ed0ec5560d1568e0f6339d02L4), [link](https://github.com/pytorch/pytorch/pull/99215/files?diff=unified&w=0#diff-4545f0c15c73ebe90a875e9bee6c5ca4b6b92fb1ed0ec5560d1568e0f6339d02L29-R28), [link](https://github.com/pytorch/pytorch/pull/99215/files?diff=unified&w=0#diff-4545f0c15c73ebe90a875e9bee6c5ca4b6b92fb1ed0ec5560d1568e0f6339d02L71-R70), [link](https://github.com/pytorch/pytorch/pull/99215/files?diff=unified&w=0#diff-4545f0c15c73ebe90a875e9bee6c5ca4b6b92fb1ed0ec5560d1568e0f6339d02L147-R146))
*  Import `TemporaryFileName` class from `common_utils` in `test/onnx/dynamo/test_exporter_api.py` ([link](https://github.com/pytorch/pytorch/pull/99215/files?diff=unified&w=0#diff-4545f0c15c73ebe90a875e9bee6c5ca4b6b92fb1ed0ec5560d1568e0f6339d02L19-R18))
*  Use `common_utils.TemporaryFileName` instead of `TemporaryFileName` in `TestDynamoExportAPI` class in `test/onnx/dynamo/test_exporter_api.py` ([link](https://github.com/pytorch/pytorch/pull/99215/files?diff=unified&w=0#diff-4545f0c15c73ebe90a875e9bee6c5ca4b6b92fb1ed0ec5560d1568e0f6339d02L92-R91), [link](https://github.com/pytorch/pytorch/pull/99215/files?diff=unified&w=0#diff-4545f0c15c73ebe90a875e9bee6c5ca4b6b92fb1ed0ec5560d1568e0f6339d02L110-R109), [link](https://github.com/pytorch/pytorch/pull/99215/files?diff=unified&w=0#diff-4545f0c15c73ebe90a875e9bee6c5ca4b6b92fb1ed0ec5560d1568e0f6339d02L129-R128))
*  Use `common_utils.run_tests` instead of `unittest.main` in `test/onnx/dynamo/test_exporter_api.py` ([link](https://github.com/pytorch/pytorch/pull/99215/files?diff=unified&w=0#diff-4545f0c15c73ebe90a875e9bee6c5ca4b6b92fb1ed0ec5560d1568e0f6339d02L155-R154))
*  Add `re` module to `test/onnx/test_utility_funs.py` ([link](https://github.com/pytorch/pytorch/pull/99215/files?diff=unified&w=0#diff-da71d2c81c9dc7ac0c47ff086fded82e4edcb67ba0cd3d8b5c983d7467343bc7R6))
*  Add `_remove_test_environment_prefix_from_scope_name` function to `test/onnx/test_utility_funs.py` to normalize scope names of ONNX nodes ([link](https://github.com/pytorch/pytorch/pull/99215/files?diff=unified&w=0#diff-da71d2c81c9dc7ac0c47ff086fded82e4edcb67ba0cd3d8b5c983d7467343bc7R32-R58))
*  Use `_remove_test_environment_prefix_from_scope_name` function to compare scope names of ONNX nodes in `TestUtilityFuns` class in `test/onnx/test_utility_funs.py` ([link](https://github.com/pytorch/pytorch/pull/99215/files?diff=unified&w=0#diff-da71d2c81c9dc7ac0c47ff086fded82e4edcb67ba0cd3d8b5c983d7467343bc7L1099-R1133), [link](https://github.com/pytorch/pytorch/pull/99215/files?diff=unified&w=0#diff-da71d2c81c9dc7ac0c47ff086fded82e4edcb67ba0cd3d8b5c983d7467343bc7L1119-R1152), [link](https://github.com/pytorch/pytorch/pull/99215/files?diff=unified&w=0#diff-da71d2c81c9dc7ac0c47ff086fded82e4edcb67ba0cd3d8b5c983d7467343bc7L1170-R1188), [link](https://github.com/pytorch/pytorch/pull/99215/files?diff=unified&w=0#diff-da71d2c81c9dc7ac0c47ff086fded82e4edcb67ba0cd3d8b5c983d7467343bc7L1181-R1199), [link](https://github.com/pytorch/pytorch/pull/99215/files?diff=unified&w=0#diff-da71d2c81c9dc7ac0c47ff086fded82e4edcb67ba0cd3d8b5c983d7467343bc7L1220-R1239), [link](https://github.com/pytorch/pytorch/pull/99215/files?diff=unified&w=0#diff-da71d2c81c9dc7ac0c47ff086fded82e4edcb67ba0cd3d8b5c983d7467343bc7L1235-R1258))

Fixes #98626

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99215
Approved by: https://github.com/huydhn, https://github.com/titaiwangms
2023-04-19 06:17:47 +00:00
Zachary DeVito
7ff1f3f3f6 Revert "Revert "Expandable blocks in allocator (#96995)"" (#99275)
This reverts commit 851e89c8e8.

Differential Revision: [D45034526](https://our.internmc.facebook.com/intern/diff/D45034526)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99275
Approved by: https://github.com/eellison
2023-04-17 23:46:08 +00:00
PyTorch MergeBot
851e89c8e8 Revert "Expandable blocks in allocator (#96995)"
This reverts commit 6a50b83b73.

Reverted https://github.com/pytorch/pytorch/pull/96995 on behalf of https://github.com/izaitsevfb due to Breaks internal tests
2023-04-16 19:23:37 +00:00
Zachary DeVito
6a50b83b73 Expandable blocks in allocator (#96995)
Common advice we give for handling memory fragmentation issues is to
allocate a big block upfront to reserve memory which will get split up later.
For programs with changing tensor sizes this can be especially helpful to
avoid OOMs that happen the first time we see a new largest input and would
otherwise have to allocate new segments.

However the issue with allocating a block upfront is that is nearly impossible
to correctly estimate the size of that block. If too small, space in the block
will run out and the allocator will allocate separate blocks anyway. Too large,
and other non-PyTorch libraries might stop working because they cannot allocate
any memory.

This patch provides the same benefits as using a pre-allocating block but
without having to choose its size upfront. Using the cuMemMap-style APIs,
it adds the ability to expand the last block in a segment when more memory is
needed.

Compared to universally using cudaMallocAsync to avoid fragmentation,
this patch can fix this common fragmentation issue while preserving most
of the existing allocator behavior. This behavior can be enabled and disabled dynamically.
 This should allow users to, for instance, allocate long-lived parameters and state in individual buffers,
and put temporary state into the large expandable blocks, further reducing
fragmentation.

See inline comments for information about the implementation and its limitations.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96995
Approved by: https://github.com/eellison
2023-04-14 09:49:11 +00:00
Richard Zou
d5120ff18a [torch.library] Add ability to create library fragments (#98439)
In C++ we have TORCH_LIBRARY_FRAGMENT. This PR adds the same
functionality to the Python torch.library API.

The motivation for this is: for the simple custom op API, we don't want
users to need to deal with Library objects. One way to hide this from
users is to create library fragments.

Test Plan:
- tests that you can create multiple fragments and def+impl operators on each.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98439
Approved by: https://github.com/ezyang, https://github.com/bdhirsh
2023-04-10 18:04:53 +00:00
BowenBao
4f9dbc17a4 [ONNX] Enable xdoctests in CI (#98546)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98546
Approved by: https://github.com/justinchuby, https://github.com/kit1980
2023-04-07 22:20:18 +00:00
PyTorch MergeBot
55724a5ec9 Revert "[experiment] More procs in CI (#98098)"
This reverts commit 9fd3eba6ce.

Reverted https://github.com/pytorch/pytorch/pull/98098 on behalf of https://github.com/clee2000 due to I think theres a bug
2023-04-07 19:50:54 +00:00
Catherine Lee
9fd3eba6ce [experiment] More procs in CI (#98098)
experiment with more procs but only in master so prs dont get affected

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98098
Approved by: https://github.com/huydhn
2023-04-07 17:21:32 +00:00
Fuzzkatt
481ecffb5e Add test c10d ucc tests (#88110)
Creates the equivalent c10d test for ucc for https://github.com/pytorch/pytorch/blob/master/test/distributed/test_c10d_gloo.py and https://github.com/pytorch/pytorch/blob/master/test/distributed/test_c10d_nccl.py. Uses test_c10d_gloo.py as the reference and adds all the common ops. More detailed comparison of available ops here: https://docs.google.com/document/d/1yPsa_X9EiEiqo-j2Yn7ierhccBtEjwoqC-B7-amI0MI/edit?usp=sharing

Also removes extra line for ProcessGroupUCC.cpp barrier blocking wait that got duplicated from merging https://github.com/pytorch/pytorch/pull/85047.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88110
Approved by: https://github.com/zasdfgbnm, https://github.com/kit1980, https://github.com/kwen2501, https://github.com/malfet
2023-04-06 23:51:27 +00:00
Catherine Lee
0d73cfb3e9 Retry at test file level (#97506)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97506
Approved by: https://github.com/huydhn
2023-03-31 18:36:53 +00:00
Catherine Lee
c797c7bc8b Clean up duplicate function run_test.py (#97914)
afaict theyre the same thing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97914
Approved by: https://github.com/huydhn
2023-03-31 06:31:17 +00:00
PyTorch MergeBot
675dfd2c1f Revert "Retry at test file level (#97506)"
This reverts commit 7d5d5beba2.

Reverted https://github.com/pytorch/pytorch/pull/97506 on behalf of https://github.com/clee2000 due to test_jit_cuda_fuser having a rough time
2023-03-31 06:22:14 +00:00
Catherine Lee
7d5d5beba2 Retry at test file level (#97506)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97506
Approved by: https://github.com/huydhn
2023-03-30 17:12:19 +00:00
Kazuaki Ishizaki
f7fe6e148e [test] Make environment variable name better (#97356)
This PR intends to use better (or correct?) environment variable name (`TORCH_DOCTEST_ANOMALY` instead of `TORCH_DOCTEST_ANOMOLY`) in test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97356
Approved by: https://github.com/malfet, https://github.com/kit1980
2023-03-30 06:21:28 +00:00
Huy Do
4c0dce50fd [BE] Apply ufmt to run_test and GitHub Python util scripts (#97588)
This has been bugging me for a while as I'm working on these Python scripts and they are not tracked by ufmt linter.  So I add these script into that linter.

```
[[linter]]
code = 'UFMT'
include_patterns = [
    '.github/**/*.py',
    'test/run_test.py',
```

This change should just work and not break anything as ufmt (black + usort) linter is very safe to use for standalone util scripts.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97588
Approved by: https://github.com/kit1980
2023-03-26 04:52:55 +00:00
Catherine Lee
29c061bb90 Remove non existent files in multigpu tests (#97393)
They were removed in https://github.com/pytorch/pytorch/pull/96989/files and https://github.com/pytorch/pytorch/pull/96985/files
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97393
Approved by: https://github.com/kit1980, https://github.com/huydhn, https://github.com/fduwjj, https://github.com/malfet
2023-03-23 17:00:29 +00:00
Huy Do
244736a5a5 Mark ROCm tests as flaky (#97259)
Before https://github.com/pytorch/pytorch/pull/96464, ROCm tests in trunk are already quite flaky https://hud.pytorch.org/reliability/pytorch/pytorch?jobName=trunk%20%2F%20linux-focal-rocm5.4.2-py3.8%20%2F%20test%20(default).

After https://github.com/pytorch/pytorch/pull/96464, there is a new group of flaky failures coming from functorch.  So let's mark the test as flaky to monitor without impacting trunk.

Two flaky tests currently seeing in trunk are:

* https://github.com/pytorch/pytorch/issues/97256
* `functorch/test_memory_efficient_fusion.py` OOM

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97259
Approved by: https://github.com/malfet, https://github.com/zou3519
2023-03-21 16:55:00 +00:00
Richard Zou
5acf403088 Run functorch tests in default shards; delete functorch-specific shards (#96464)
Fixes #96347

This PR:

- Makes the functorch tests run as a part of the "default" shards
- Delete the functorch CI shard from all CI job configurations (if it exists)
- Increase the "default" shard count by 1 for each job, unless it was
previously set to 1, to accommodate the new functorch tests and not
regress time-to-signal.
- Adds a bunch of skips for ROCM and torchdynamo configurations. We can
investigate them later.

NB: I might go through some more iterations to figure out what other
skips need to be added, but this iteration of the PR seems to pass most CI.
suite.

Test Plan:
- wait for CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96464
Approved by: https://github.com/huydhn
2023-03-21 13:53:01 +00:00
Huy Do
270b42d279 Fix test_schema_check CUDA illegal memory access (#97062)
I'm seeing some recent [CUDA illegal memory access](https://hud.pytorch.org/failure/FAILED%20test_schema_check.py%3A%3ATestSchemaCheckModeOpInfoCUDA%3A%3Atest_schema_correctness_fft_fft_cuda_bool%20-%20RuntimeError%3A%20CUDA%20error%3A%20an%20illegal%20memory%20access%20was%20encountered) error related to this test.  So a cheap fix is to run it serially.

Fixes https://github.com/pytorch/pytorch/issues/95749
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97062
Approved by: https://github.com/clee2000
2023-03-20 20:57:27 +00:00
Huy Do
db2c1ea8c8 Re-enable test_ops_jit on Windows (#96859) (#96931)
Fixes https://github.com/pytorch/pytorch/issues/96858
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96931
Approved by: https://github.com/kit1980
2023-03-17 22:42:22 +00:00
Catherine Lee
8c2341c1b9 Remove pytest block list (#96698)
Enables the last few files under pytest.

xdist was causing problems with `profiler/test_profiler` `test_source_multithreaded` due to creating extra threads.  Luckily we don't use it so we can disable it with `-p no:xdist`, but this is incompatible with pytest-rerunfailures==10.2, so upgrade to 10.3.  I'd update the windows ami but idk how.

`dynamo/test_optimizers` and `dynamo/test_repros` both had tests that used skip_if_pytest.  https://github.com/pytorch/pytorch/pull/93251/files suggests that it is due to pytest assertion rewriting, so I added `PYTEST_DONT_REWRITE` to their module docstrings to prevent pytest from rewriting assertions.

Disable test by issue in `dynamo/test_dynamic_shapes` seems sane.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96698
Approved by: https://github.com/huydhn, https://github.com/malfet
2023-03-16 04:22:42 +00:00
albanD
7c525823c7 Remove un-used list. And disable pytest for public binding test. (#96684)
This contains a temporary change to make sure the test fails nicely now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96684
Approved by: https://github.com/clee2000
2023-03-15 22:12:00 +00:00
Huy Do
6339ee5d23 Temporarily disable test_ops_jit on Windows (#96859)
See https://github.com/pytorch/pytorch/issues/96858
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96859
Approved by: https://github.com/kit1980
2023-03-15 17:51:32 +00:00
Huy Do
51b8ab7879 Clean up references to test_megatron_prototype (#96431)
This test has been deleted in #96254

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96431
Approved by: https://github.com/clee2000, https://github.com/fduwjj
2023-03-10 23:50:32 +00:00
Catherine Lee
4519228f60 Reduce pytest blocklist part 2 (#96397)
Enable pytest for a few unique files.  pytest runs tests in a different order than unittest (but still a consistent ordering with respect to itself) and some tests change global state, causing other tests to fail.

`test_transpose_non_contiguous` in `test_torchinductor.py` gets impacted from some other test but I'm not sure which one, so my solution is to reset the metrics before the rest of the test is run.

`test_register_patterns` in `test_quantize_fx.py` adds extra keys to global variables, so remove them when the test is done via unittest's `addCleanUp` which also works on pytest.

pytest doesn't really have an equivalent for `load_tests` so change it to be like `test_jit` that imports all the classes.  I also attempted to dynamically import them, but I failed.

`test_public_api_surface` in `test_fx.py` checks for a backwards compatibility classification.  There is a different test in test_fx that results in `fuser_utils` being imported.  pytest runs this test before `test_public_api_surface` while unittest runs it after, so pytest sees `fuser_utils` when crawling through the modules.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96397
Approved by: https://github.com/huydhn
2023-03-10 19:10:43 +00:00
Xiao Wang
cf3d3a583e Add env PYTORCH_TEST_DO_NOT_USE_PYTEST as an option to not use pytest in unit testing (#96444)
Set environment variable
```
PYTORCH_TEST_DO_NOT_USE_PYTEST=1
```
to not use pytest in pytorch unit testing.

This change is related to some recent changes, e.g. #96210, #96016, #95844, #95659, that enabled the use of pytest in many test modules. Those test modules were testing normally before, but failed immediately after pytest is used. Sample stacktraces are:

```python
root@8e3168a83ee2:/opt/pytorch/pytorch# python test/run_test.py -v -i test_optim -- -v --save-xml
Ignoring disabled issues:  []
/opt/pytorch/pytorch/test/run_test.py:1225: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
  if torch.version.cuda is not None and LooseVersion(torch.version.cuda) >= "11.6":
Selected tests:
 test_optim
parallel (file granularity) tests:
 test_optim
serial (file granularity) tests:

Ignoring disabled issues:  []
Ignoring disabled issues:  []
Running test_optim ... [2023-03-09 12:51:59.358110]
Executing ['/usr/local/bin/python', '-bb', 'test_optim.py', '-v', '--save-xml', '-v', '--use-pytest', '-vv', '-rfEX', '-x', '--reruns=2'] ... [2023-03-09 12:51:59.358810]

Test results will be stored in test-reports/python-pytest/test_optim/test_optim-5e41643c8bac8ace.xml
Traceback (most recent call last):
  File "/opt/pytorch/pytorch/test/test_optim.py", line 4581, in <module>
    run_tests()
  File "/opt/pytorch/pytorch/torch/testing/_internal/common_utils.py", line 796, in run_tests
    exit_code = pytest.main(args=pytest_args)
  File "/usr/local/lib/python3.10/site-packages/_pytest/config/__init__.py", line 148, in main
    config = _prepareconfig(args, plugins)
  File "/usr/local/lib/python3.10/site-packages/_pytest/config/__init__.py", line 329, in _prepareconfig
    config = pluginmanager.hook.pytest_cmdline_parse(
  File "/usr/local/lib/python3.10/site-packages/pluggy/_hooks.py", line 265, in __call__
    return self._hookexec(self.name, self.get_hookimpls(), kwargs, firstresult)
  File "/usr/local/lib/python3.10/site-packages/pluggy/_manager.py", line 80, in _hookexec
    return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
  File "/usr/local/lib/python3.10/site-packages/pluggy/_callers.py", line 55, in _multicall
    gen.send(outcome)
  File "/usr/local/lib/python3.10/site-packages/_pytest/helpconfig.py", line 103, in pytest_cmdline_parse
    config: Config = outcome.get_result()
  File "/usr/local/lib/python3.10/site-packages/pluggy/_result.py", line 60, in get_result
    raise ex[1].with_traceback(ex[2])
  File "/usr/local/lib/python3.10/site-packages/pluggy/_callers.py", line 39, in _multicall
    res = hook_impl.function(*args)
  File "/usr/local/lib/python3.10/site-packages/_pytest/config/__init__.py", line 1060, in pytest_cmdline_parse
    self.parse(args)
  File "/usr/local/lib/python3.10/site-packages/_pytest/config/__init__.py", line 1348, in parse
    self._preparse(args, addopts=addopts)
  File "/usr/local/lib/python3.10/site-packages/_pytest/config/__init__.py", line 1231, in _preparse
    self.pluginmanager.load_setuptools_entrypoints("pytest11")
  File "/usr/local/lib/python3.10/site-packages/pluggy/_manager.py", line 287, in load_setuptools_entrypoints
    plugin = ep.load()
  File "/usr/local/lib/python3.10/importlib/metadata/__init__.py", line 171, in load
    module = import_module(match.group('module'))
  File "/usr/local/lib/python3.10/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "/usr/local/lib/python3.10/site-packages/_pytest/assertion/rewrite.py", line 168, in exec_module
    exec(co, module.__dict__)
  File "/usr/local/lib/python3.10/site-packages/xdist/looponfail.py", line 16, in <module>
    import execnet
  File "/usr/local/lib/python3.10/site-packages/execnet/__init__.py", line 14, in <module>
    from .gateway_base import DataFormatError
  File "/usr/local/lib/python3.10/site-packages/execnet/gateway_base.py", line 1138, in <module>
    FLOAT_FORMAT_SIZE = struct.calcsize(FLOAT_FORMAT)
BytesWarning: Comparison between bytes and string
FINISHED PRINTING LOG FILE of test_optim (/opt/pytorch/pytorch/test/test-reports/test_optim_1pnlesrz.log)

test_optim failed!
Traceback (most recent call last):
  File "/opt/pytorch/pytorch/test/run_test.py", line 1428, in <module>
    main()
  File "/opt/pytorch/pytorch/test/run_test.py", line 1386, in main
    raise RuntimeError(
RuntimeError: test_optim failed!

Tip: You can keep running tests even on failure by passing --keep-going to run_test.py.
If running on CI, add the 'keep-going' label to your PR and rerun your jobs.
```

I'd like to propose this option that allows users to use the good old python unit test way instead of pytest to run their testing in CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96444
Approved by: https://github.com/malfet
2023-03-10 01:32:15 +00:00
Catherine Lee
a7fe11dec0 --subprocess for pytest (#96210)
Implements --subprocess flag for pytest, which previously only worked with unittest

Pretty much all the tests in the custom handler list use --subprocess
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96210
Approved by: https://github.com/huydhn
2023-03-08 21:04:50 +00:00
BowenBao
bdb076ab43 [ONNX] Skip doctest torch.onnx._internal.fx if ImportError (#95686)
Need to use `exclude` to skip the module altogether. Because xdoctest
triggers `ImportError` when trying to import the module. So the whole
test fails regardless if skip was added in the docstring.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95686
Approved by: https://github.com/kit1980, https://github.com/titaiwangms
2023-03-07 22:05:27 +00:00
Catherine Lee
eea0733045 Reduce pytest blocklist (#96016)
`TestCase = object` or variations of it get switched to `TestCase = NoTest`.

unittest collects test based on subclassing unittest.TestCase, so setting TestCase = object removes it from unittest test collection.  pytest collects based on name (https://docs.pytest.org/en/7.1.x/reference/reference.html#confval-python_classes) but can be told to ignore a class (bottom of https://docs.pytest.org/en/7.1.x/example/pythoncollection.html#changing-naming-conventions)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96016
Approved by: https://github.com/ZainRizvi, https://github.com/huydhn
2023-03-07 18:30:27 +00:00
Catherine Lee
7f5f0b3665 Run _nvfuser/test_torchscript serially (#95951)
Started at ce4cbac914 (11734276291)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95951
Approved by: https://github.com/huydhn
2023-03-03 17:41:09 +00:00
Catherine Lee
d21577f28c Run more tests through pytest (#95844)
Run more tests through pytest.

Use a block list for tests that shouldn't run through pytest.  As far as I can tell, the number of tests run, skipped, and xfailed for those not on the blocklist are the same.

Regarding the main module:

Usually tests are run in CI, we call `python <test file>`, which causes the file to be imported under the module name `__main__`.  However, pytest searches for the module to be imported under the file name, so the file will be reimported.  This can cause issues for tests that run module level code and change global state, like test_nn, which modifies lists imported from another file, or tests in test/lazy, which initialize a backend that cannot coexist with a second copy of itself.

My workaround for this is to run tests from the `__main__` module.  However, this results in pytest being unable to rewrite assertions (and possibly other things but I don't know what other things pytest does right now).  A better solution might be to call `pytest <test file>` directly and move all the code in run_tests(argv) to be module level code or put it in a hook in conftest.py.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95844
Approved by: https://github.com/huydhn
2023-03-03 17:32:26 +00:00
Bin Bao
9835c93aba [CI] Change the way tests are triggered with dynamo and inductor (#94539)
Summary: Currently running PyTorch tests with dynamo and inductor is
controlled by environment variables, and CI sets them based on test
config name matching. Change them to use options of run_test.py.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94539
Approved by: https://github.com/huydhn
2023-03-01 13:06:23 +00:00
Catherine Lee
e3c5c369ba Run tests in USE_PYTEST_LIST through run_tests (#95659)
Part of my effort to move everything to pytest and decrease the number of testrunner frameworks in ci

Gives xmls but they might look a weird b/c module level tests vs tests in classes.

Doesn't give skip/disable test infra because those are tied to classes. (for future ref, could either put tests in classes or move the check_if_enable stuff into a pytest hook)

Tested in CI and checked that the same number of tests are run

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95659
Approved by: https://github.com/huydhn
2023-02-28 22:09:01 +00:00
Catherine Lee
5272d6e6e5 Remove mentions of distributed/_shard/test_replicated_tensor (#95632)
The file was removed in https://github.com/pytorch/pytorch/pull/95453, which cause some issues with the multigpu job in periodic.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95632
Approved by: https://github.com/huydhn
2023-02-27 22:41:02 +00:00
Huy Do
9b7abc4fac Run slow gradcheck tests sequentially (#95494)
Also redo https://github.com/pytorch/pytorch/pull/95246 as there are many more still run OOM
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95494
Approved by: https://github.com/clee2000
2023-02-26 00:44:25 +00:00
Zain Rizvi
c97275acf6 Fix OOMing periodic shards (#95246)
Tests have been consistently failing with the error on the following shards with the error `RuntimeError: CUDA error: out of memory`
- `periodic / linux-bionic-cuda11.7-py3-gcc7-slow-gradcheck / test (default, 1, 2, linux.4xlarge.nvidia.gpu)`
- `periodic / linux-bionic-cuda11.7-py3-gcc7-slow-gradcheck / test (default, 2, 2, linux.4xlarge.nvidia.gpu)`

Seeing if serializing those test files makes the periodic jobs succeed again.  This feels a bit off since there are so many different test files that have failed and need to be serialized, indicating a potential perf regression somewhere

Failures on hud: https://hud.pytorch.org/hud/pytorch/pytorch/master/1?per_page=100&name_filter=periodic%20%2F%20linux-bionic-cuda11.7-py3-gcc7-slow-gradcheck%20%2F%20test%20(default%2C%20
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95246
Approved by: https://github.com/Skylion007, https://github.com/huydhn
2023-02-23 03:50:56 +00:00
Xuehai Pan
046e88a291 [BE] [3/3] Rewrite super() calls in test (#94592)
Rewrite Python built-in class `super()` calls. Only non-semantic changes should be applied.

- #94587
- #94588
- #94592

Also, methods with only a `super()` call are removed:

```diff
class MyModule(nn.Module):
-   def __init__(self):
-       super().__init__()
-
    def forward(self, ...):
        ...
```

Some cases that change the semantics should be kept unchanged. E.g.:

f152a79be9/caffe2/python/net_printer.py (L184-L190)

f152a79be9/test/test_jit_fuser_te.py (L2628-L2635)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94592
Approved by: https://github.com/ezyang, https://github.com/seemethere
2023-02-12 22:20:53 +00:00
Huy Do
d51ca38ef0 Run test_serialization serially (for 2xlarge runners) (#94613)
Fixes https://github.com/pytorch/pytorch/issues/92746
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94613
Approved by: https://github.com/clee2000
2023-02-11 00:15:10 +00:00
Vasiliy Kuznetsov
f15ab8a7f2 AO migration: replace torch internal callsites (#94170)
Summary:

Do the following renames:
`torch.quantization` -> `torch.ao.quantization`
`torch.nn.quantized` -> `torch.ao.nn.quantized`
`torch.nn.quantizable` -> `torch.ao.nn.quantizable`
`torch.nn.qat` -> `torch.ao.nn.qat`
`torch.nn.intrinsic` -> `torch.ao.nn.intrinsic`

And then, do
`torch.ao.nn.quantized._reference` -> `torch.ao.nn.quantized.reference` to clean up the aftermath of https://github.com/pytorch/pytorch/pull/84974

Then, manually update `test/test_module_init.py` to fix hanging whitespace due to the replace.

Run this script to do the replacements: https://gist.github.com/vkuzo/7f7afebf8c31b9ba48306223e68a1c82

This is for https://github.com/pytorch/pytorch/issues/81667

Test plan: CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94170
Approved by: https://github.com/jerryzh168
2023-02-07 02:32:23 +00:00
Nikita Shulga
a07d1291cf Re-enable compilation tests (#92333)
As CUDA-11.5 is no longer supported, just remove the check

Fixes https://github.com/pytorch/pytorch/issues/69460

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92333
Approved by: https://github.com/atalman
2023-02-06 20:06:12 +00:00
Jane Xu
0ecb071fc4 [BE][CI] change references from .jenkins to .ci (#92624)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92624
Approved by: https://github.com/ZainRizvi, https://github.com/huydhn
2023-01-30 22:50:07 +00:00
PyTorch MergeBot
8b3e35ea4a Revert "Run dynamo/test_dynamic_shapes serially (#92215)"
This reverts commit ea1007b89c.

Reverted https://github.com/pytorch/pytorch/pull/92215 on behalf of https://github.com/huydhn due to This is not needed anymore as https://github.com/pytorch/pytorch/issues/92196 has been root caused to test ordering
2023-01-20 18:54:13 +00:00
Catherine Lee
25e530083e [ci] Run test_decomp parallel (#92566)
run test_decomp in parallel with itself since it now takes 2+ hours on some architectures https://docs.google.com/spreadsheets/d/1o0W4WjOYIyPSzBSl3lelvKcQyLOiv8pMijiGUDoPuBU/edit#gid=0

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92566
Approved by: https://github.com/huydhn
2023-01-19 20:47:27 +00:00
Huy Do
ea1007b89c Run dynamo/test_dynamic_shapes serially (#92215)
Per my findings in https://github.com/pytorch/pytorch/issues/92196#issuecomment-1383029544

> The test itself dynamo/test_dynamic_shapes is not flaky and all passes when I try to run it locally. However, this test is set to run in parallel with other tests on the runner (2 tests at a times). After many tries, I can only reproduce the issue once when dynamo/test_dynamic_shapes is run in parallel with test_comparison_utils

After many retries, I could reproduce the issue once locally when running (https://paste.sh/_mFImq6V#FgbKq6IQBg65PKUFA08Ah_Vb)

```
python test/run_test.py --verbose --exclude-jit-executor --exclude-distributed-tests -i test_comparison_utils dynamo/test_dynamic_shapes
```

So setting this test to run serially to avoid further flakiness while the root cause is investigated.

Here are some example flaky failures:

* https://github.com/pytorch/pytorch/issues/92196
* https://github.com/pytorch/pytorch/issues/92178
* https://github.com/pytorch/pytorch/issues/92042
* https://github.com/pytorch/pytorch/issues/92210

The test takes 30s or so to finish, so its duration is not a concern.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92215
Approved by: https://github.com/clee2000
2023-01-17 17:54:39 +00:00
Wanchao Liang
801d831d7a [dtensor] enable op db tests by using multithreaded test case (#92198)
Time comparison between using MultithreadedTestCase and MultiProcessTestCase on op db tests is amazing!

using MultiThreadTestCase on a AWS dev node:
```
time pytest test/distributed/_tensor/test_dtensor_ops.py

============= 175 passed, 42 skipped, 397 xfailed in 80.30s (0:01:20) =======

real    1m22.330s
user    1m38.782s
sys     0m18.762s
```
MultiProcessTestCase spends from 40mins to more than 1h, even if using pytest parallel testing tools.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92198
Approved by: https://github.com/XilunWu
2023-01-17 03:26:38 +00:00
Jeff Daily
ce50a8de75 [CI][ROCm] add test_dataloader to CI_SERIAL_LIST (#91895)
Still working towards solving #90940 .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91895
Approved by: https://github.com/huydhn
2023-01-10 16:32:39 +00:00
Catherine Lee
e67f5ab6cc Print and zip remaining test logs (#91510)
When CI times out or gets cancelled, the code to print and delete logs for currently running tests doesn't get run, which makes it hard to debug what's going on, so print the logs in a new step and also zip them into the usage-log zip (which should probably get a name change at some point)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91510
Approved by: https://github.com/malfet, https://github.com/huydhn, https://github.com/ZainRizvi
2023-01-09 17:31:36 +00:00
Jeff Daily
f44946289b [CI][ROCm] fix device visibility, again (#91813)
The previous PR #91137 was incomplete.  Though it successfully queried for the number of available GPUs, it still resulted in test files sharing the same GPU.  This PR lifts the maxtasksperchild=1 restriction so that Pool workers will always use the same GPU.  This also adds a Note in run_test.py for future reference.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91813
Approved by: https://github.com/kit1980, https://github.com/huydhn, https://github.com/malfet
2023-01-06 22:19:07 +00:00
Jeff Daily
c18e8c68d8 [ROCm] fix parallel test runners and device visibility (#91137)
Fixes #90940.  This PR revamps how tests are run in parallel as well as device visibility at the docker container and within the run_test.py test runner.

First, running multiple test modules concurrently on the same GPU was causing instability for ROCm runners manifesting as timeouts.  ROCm runners have at least 1 GPU each, but often 2 or more.  This PR allows NUM_PROCS to be set equal to the number of devices available, but also takes care to set HIP_VISIBLE_DEVICES to avoid oversubscribing any GPU.

Second, we had introduced env vars `-e ROCR_VISIBLE_DEVICES` (#91031) to prepare for two GHA runners per CI node, to split up the GPU visibility at the docker level between the two runners.  This effort wasn't fully realized; to date, we haven't had more than one runner per CI host.  We abandon this effort in favor of all GPUs being visible to a single runner and managing GPU resources as stated above.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91137
Approved by: https://github.com/kit1980, https://github.com/huydhn, https://github.com/pruthvistony
2023-01-04 19:40:05 +00:00
Huy Do
316ba9e6fc Run jit legacy tests sequentially (#91518)
Fixes https://github.com/pytorch/pytorch/issues/91457.  I have been re-running the 2 tests `test_jit_legacy` and `test_jit_fuser_legacy` in `jit_legacy` shard multiple times (100+) without any flaky issues found.  I suspect that we might have a test parallelization flakiness here.  So this PR runs these 2 tests serially.  They takes less than 5 minutes to finish, so running them sequentially won't be an issue (https://hud.pytorch.org/hud/pytorch/pytorch/master/1?per_page=50&name_filter=jit_legacy)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91518
Approved by: https://github.com/clee2000
2023-01-04 04:13:01 +00:00
joncrall
ad782ff7df Enable xdoctest runner in CI for real this time (#83816)
Builds on #83317 and enables running the doctests. Just need to figure out what is causing the failures.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83816
Approved by: https://github.com/ezyang, https://github.com/malfet
2022-12-29 05:32:42 +00:00
Huy Do
8f16524598 Run test_spectral_ops serially to fix CUDA illegal memory access (#91264)
Fixes https://github.com/pytorch/pytorch/issues/88916

* Running this test sequentially is not flaky after 1000 reruns `pytest --verbose test_spectral_ops.py -k test_fft_round_trip_cuda_float32 --flake-finder --flake-runs=1000`
* On the other hand, the curious thing is that when I run this same command on an active runner with some testing processs running in the background, the reruns could fail with CUDA illegal memory access error (hard to reproduce though) https://paste.sh/6sZdRn95#pve73riXC5XehCLqxlCbnjea.  This points to the fact that running the `test_spectral_ops` test in parallel with others might be the surface-level cause of flakiness

So this PR adds the test to the serial list instead.  This shouldn't cause any issue w.r.t TTS because the test takes only half a minute at most to finish.

```
+---------------------+-------------------------------------------------+-------------+---------------------+
| file                | base_name                                       | test_config | time                |
+---------------------+-------------------------------------------------+-------------+---------------------+
| "test_spectral_ops" | "cuda11.6-py3.10-gcc7-sm86"                     | "default"   | 5.991666666666661   |
| "test_spectral_ops" | "cuda11.6-py3.10-gcc7-sm86"                     | "slow"      | 0.18433333333333346 |
| "test_spectral_ops" | "linux-bionic-cuda11.6-py3-gcc7-slow-gradcheck" | "default"   | 9.866000000000003   |
| "test_spectral_ops" | "linux-bionic-cuda11.6-py3.10-gcc7"             | "default"   | 10.591333333333337  |
| "test_spectral_ops" | "linux-bionic-cuda11.6-py3.7-gcc7-debug"        | "default"   | 11.395000000000003  |
| "test_spectral_ops" | "linux-bionic-cuda11.7-py3.10-gcc7"             | "default"   | 9.424               |
| "test_spectral_ops" | "linux-bionic-cuda11.7-py3.7-gcc7-debug"        | "default"   | 8.889000000000003   |
| "test_spectral_ops" | "linux-bionic-py3.7-clang9"                     | "crossref"  | 6.280333333333329   |
| "test_spectral_ops" | "linux-bionic-py3.7-clang9"                     | "default"   | 12.182999999999998  |
| "test_spectral_ops" | "linux-bionic-py3.7-clang9"                     | "dynamo"    | 11.124999999999984  |
| "test_spectral_ops" | "linux-bionic-py3.7-clang9-slow"                | "slow"      | 0.1916666666666668  |
| "test_spectral_ops" | "linux-focal-py3.7-clang7-asan"                 | "default"   | 20.899666666666658  |
| "test_spectral_ops" | "linux-focal-py3.7-gcc7"                        | "default"   | 5.097999999999996   |
| "test_spectral_ops" | "linux-focal-rocm5.3-py3.8-slow"                | "slow"      | 0.23700000000000018 |
| "test_spectral_ops" | "macos-12-py3-arm64"                            | "default"   | 2.8396666666666626  |
| "test_spectral_ops" | "macos-12-py3-x86-64"                           | "default"   | 8.838999999999997   |
| "test_spectral_ops" | "parallelnative-linux-focal-py3.7-gcc7"         | "default"   | 5.016999999999998   |
| "test_spectral_ops" | "win-vs2019-cpu-py3"                            | "default"   | 8.351666666666665   |
| "test_spectral_ops" | "win-vs2019-cuda11.6-py3"                       | "default"   | 27.121666666666687  |
| "test_spectral_ops" | "win-vs2019-cuda11.7-py3"                       | "default"   | 24.567000000000025  |
+---------------------+-------------------------------------------------+-------------+---------------------+
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91264
Approved by: https://github.com/clee2000
2022-12-22 02:39:33 +00:00
Xiao Wang
649d0b6ae7 Add env var PYTORCH_TEST_RUN_EVERYTHING_IN_SERIAL=1 that allows running unit test suites in serial (#90981)
Running unit test suites in parallel sometimes creates unexpected errors. This PR adds an option that allows unit test suites to be executed in serial, by setting PYTORCH_TEST_RUN_EVERYTHING_IN_SERIAL=1.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90981
Approved by: https://github.com/malfet, https://github.com/ptrblck
2022-12-20 21:20:59 +00:00
clee2000
34717b3ea8 nn/test_convolution to run in serial (#91113)
unfortunately it takes 50 minutes on slow gradcheck but thats on periodic

ends up taking up >6000 MB of space (7440 available)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91113
Approved by: https://github.com/huydhn, https://github.com/ZainRizvi
2022-12-20 19:12:43 +00:00
Edward Z. Yang
ffd0b15a49 Add support for keep-going label (#90902)
This makes run_test.py keep going even on failure.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90902
Approved by: https://github.com/malfet, https://github.com/huydhn
2022-12-16 06:47:06 +00:00
PyTorch MergeBot
82a191313e Revert "Add support for keep-going label (#90902)"
This reverts commit 855f4b7d24.

Reverted https://github.com/pytorch/pytorch/pull/90902 on behalf of https://github.com/huydhn due to This change breaks trunk where, unlike PR, there is no label
2022-12-16 05:07:49 +00:00
Edward Z. Yang
855f4b7d24 Add support for keep-going label (#90902)
This makes run_test.py keep going even on failure.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90902
Approved by: https://github.com/malfet, https://github.com/huydhn
2022-12-16 04:03:52 +00:00
soulitzer
dc4d18d47d Remove hack to hard code test times (#90720)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90720
Approved by: https://github.com/janeyx99
2022-12-13 04:28:01 +00:00
Huy Do
177621a0b2 Use pytest-flakefinder to rerun tests multiple times (#89106)
Per title. The way re-run is handled in https://github.com/pytorch/pytorch/pull/88646 only applies to unittest.

### Testing

* https://github.com/pytorch/pytorch/actions/runs/3484930558
* https://github.com/pytorch/pytorch/actions/runs/3484930319

Manually download the test report artifacts and verify that that pytest test_ops is called multiple times.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89106
Approved by: https://github.com/clee2000
2022-11-18 00:11:44 +00:00
Wanchao Liang
ac0a6f381d [dtensor] disable op db tests for now (#89162)
context: https://github.com/pytorch/pytorch/issues/89160
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89162
Approved by: https://github.com/fduwjj
2022-11-17 02:31:23 +00:00
Fuzzkatt
8ba62bdff5 add test_c10d_spawn_ucc.py (#86508)
Initial PR to create UCC equivalent of https://github.com/pytorch/pytorch/blob/master/test/distributed/test_c10d_spawn_gloo.py and
https://github.com/pytorch/pytorch/blob/master/test/distributed/test_c10d_spawn_nccl.py. Currently only added common ops.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86508
Approved by: https://github.com/kwen2501
2022-11-16 22:50:11 +00:00
Peter Bell
7b0adc290a Run tests from test/inductor in inductor CI job (#88957)
CUDA inductor tests are currently not run in CI because the only jobs
that have triton installed don't actually run these test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88957
Approved by: https://github.com/ngimel, https://github.com/seemethere
2022-11-16 17:54:13 +00:00
Huy Do
21dd311077 Add a mode to rerun all disabled tests (without running anything else) (#88646)
Rerun all disabled test to gather their latest result so that we can close disabled tickets automatically. When running under this mode (RERUN_DISABLED_TESTS=true), only disabled tests are run while the rest are skipped `<skipped message="Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run" type="skip"/>`

The logic is roughly as follows, the test runs multiple times (n=50)

* If the disabled test passes, and it's flaky, do nothing because it's still flaky.  In the test report, we'll see the test passes with the following skipped message:
```
<testcase classname="TestMultiprocessing" file="test_multiprocessing.py" line="357" name="test_fs" time="0.000" timestamp="0001-01-01T00:00:00">
    <skipped message="{&quot;flaky&quot;: True, &quot;num_red&quot;: 4, &quot;num_green&quot;: 0, &quot;max_num_retries&quot;: 3, &quot;rerun_disabled_test&quot;: true}" type="skip"/>
</testcase>
```

* If the disabled test passes every single time, and it is not flaky anymore, mark it so that it can be closed later.  We will see the test runs and passes, i.e.
```
<testcase classname="TestCommonCUDA" name="test_out_warning_linalg_lu_factor_cuda" time="0.170" file="test_ops.py" />
```

* If the disabled test fails after all retries, this is also expected. So only report this but don't fail the job (because we don't care about red signals here), we'll see the test is skipped (without the `flaky` field), i.e.
```
<testcase classname="TestMultiprocessing" file="test_multiprocessing.py" line="357" name="test_fs" time="0.000" timestamp="0001-01-01T00:00:00">
    <skipped message="{&quot;num_red&quot;: 4, &quot;num_green&quot;: 0, &quot;max_num_retries&quot;: 3, &quot;rerun_disabled_test&quot;: true}" type="skip"/>
</testcase>
```

This runs at the same schedule as `mem_leak_check` (daily).  The change to update test stats, and (potentially) grouping on HUD will come in separated PRs.

### Testing

* pull https://github.com/pytorch/pytorch/actions/runs/3447434434
* trunk https://github.com/pytorch/pytorch/actions/runs/3447434928
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88646
Approved by: https://github.com/clee2000
2022-11-15 05:08:26 +00:00
Catherine Lee
de38c87698 Use run_test in MPS (#88829)
Run mps through run_test to get disable test infra, create xml files (which can then be used for flakiness detection), and reruns

Also added the workflow steps for uploading the xml files
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88829
Approved by: https://github.com/malfet, https://github.com/huydhn
2022-11-10 21:32:41 +00:00
soulitzer
4c20c0509d Split out forward AD tests from test_ops_gradients and reenable slow gradcheck CI (#88216)
Fixes: https://github.com/pytorch/pytorch/issues/88010

This PR does a couple things to stop slow gradcheck from timing out:
- Splits out test_ops_fwd_gradients from test_ops_gradients, and factors out TestFwdGradients and TestBwdGradients which both inherit from TestGradients, now situated in common_utils (maybe there is a better place?)
- Skips CompositeCompliance (and several other test files) for slow gradcheck CI since they do not use gradcheck
- because test times for test_ops_fwd_gradients and test_ops_gradients are either unknown or wrong, we hardcode them for now to prevent them from being put together. We can undo the hack after we see actual test times are updated. ("def calculate_shards" randomly divides tests with unknown test times in a round-robin fashion.)
- Updates references to test_ops_gradients and TestGradients
- Test files that are skipped for slow gradcheck CI are now centrally located in in run_tests.py, this reduces how fine-grained we can be with the skips, so for some skips (one so far) we still use the old skipping mechanism, e.g. for test_mps

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88216
Approved by: https://github.com/albanD
2022-11-03 00:20:45 +00:00
Iris Zhang
7bd04fb09f [1/N][C10D] Add a customized ScubaLogHandler implementation for internal FB use (#86699) (#87123)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86699

This diff does the following:
1. **c10d_error_logger.py**: Add an API  to create a logger with a specific logging handler based on the destination.
2. The API from above would get a logging handler based on the destination provided.
-  **caffe2/torch/distributed/logging_handlers.py**: For OSS, we simply use a NullHandler() for now.
3. Add associated test files for 1 and 2.

Test Plan:
## Unit Test
```
buck test @//mode/dev-nosan //caffe2/test/distributed:test_c10d_error_logger -- --print-passing-details
```
```
File changed: fbcode//caffe2/test/distributed/test_c10d_error_logger.py
File changed: fbsource//xplat/caffe2/test/distributed/TARGETS
9 additional file changes
waiting for all tests to finish...
✓ Listing success: caffe2/test/distributed:test_c10d_error_logger (0.2s)
Found 1 tests
✓ Pass: caffe2/test/distributed:test_c10d_error_logger - test_get_or_create_logger (caffe2.test.distributed.test_c10d_error_logger.C10dErrorLoggerTest) (0.2s)

stdout:

stderr:

Buck UI:      https://www.internalfb.com/buck2/b975f6b0-77e9-4287-8722-f95b48036181
Test Session: https://www.internalfb.com/intern/testinfra/testrun/1407375150206593
RE: reSessionID-4d7ab8ca-1051-48e9-a5a8-6edbe15d1fe4  Up: 124 B  Down: 0 B
Jobs completed: 5. Time elapsed: 3.5s.
Tests finished: Pass 1. Fail 0. Fatal 0. Skip 0. 0 builds failed
```

Differential Revision: D39920391

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87123
Approved by: https://github.com/fduwjj, https://github.com/H-Huang
2022-10-21 18:45:38 +00:00
atalman
40d0fa5314 Reenable aot tests on windows for cuda 11.7 and up (#87193)
Reenable aot tests on windows for cuda 11.7 and up

Issue: https://github.com/pytorch/pytorch/issues/69460 seems to be mitigated in CUDA 11.7 hence re-enable this test

cc @peterjc123 @mszhanyi @skyline75489 @nbcsm
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87193
Approved by: https://github.com/malfet
2022-10-19 17:09:37 +00:00
Catherine Lee
9a786202b7 [ci] fix log printing (#87223)
idk how i missed this

example https://github.com/pytorch/pytorch/actions/runs/3275717751/jobs/5391093040
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87223
Approved by: https://github.com/malfet, https://github.com/kit1980, https://github.com/janeyx99
2022-10-18 20:57:27 +00:00
Catherine Lee
18cc00d399 [ci] put more logs in a folded group (#86138)
fixes: request to not print the entire log file, but the last couple of lines since they are probably the most relevant

all but last 300 lines of failing tests get put into a folded group
example https://github.com/pytorch/pytorch/actions/runs/3177200444/jobs/5177703202
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86138
Approved by: https://github.com/huydhn, https://github.com/ZainRizvi, https://github.com/lezcano
2022-10-17 22:10:23 +00:00
Huy Do
d393a463ff Fix functorch test selection logic (#86944)
I realize that `run_test.py` doesn't take into account functorch test selection logic at the moment, for example `python test/run_test.py --functorch -i functorch/test_ops --verbose` stills run all functorch tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86944
Approved by: https://github.com/clee2000, https://github.com/malfet
2022-10-14 17:26:52 +00:00
Huy Do
5224906749 Spread distributed backends among all distributed shards (#86837)
So that they can be run in parallel without stepping on each other toe
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86837
Approved by: https://github.com/clee2000
2022-10-13 03:31:33 +00:00
Huy Do
be81f3d8d4 Revert distributed test parallelization (#86756)
Revert an old commit and resolve some conflicts

Fixes https://github.com/pytorch/pytorch/issues/86418
Fixes https://github.com/pytorch/pytorch/issues/86419
Fixes https://github.com/pytorch/pytorch/issues/86415
Fixes https://github.com/pytorch/pytorch/issues/86420
Fixes https://github.com/pytorch/pytorch/issues/86416
Fixes https://github.com/pytorch/pytorch/issues/86392
Fixes https://github.com/pytorch/pytorch/issues/86391
Fixes https://github.com/pytorch/pytorch/issues/86397
Fixes https://github.com/pytorch/pytorch/issues/86390
Fixes https://github.com/pytorch/pytorch/issues/86398
Fixes https://github.com/pytorch/pytorch/issues/86396
Fixes https://github.com/pytorch/pytorch/issues/86395
Fixes https://github.com/pytorch/pytorch/issues/86393
Fixes https://github.com/pytorch/pytorch/issues/86394
Fixes https://github.com/pytorch/pytorch/issues/86440
Fixes https://github.com/pytorch/pytorch/issues/86442
Fixes https://github.com/pytorch/pytorch/issues/86439

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86756
Approved by: https://github.com/mrshenli
2022-10-12 21:17:28 +00:00
Daniel Dale
ce56ee11fd Extend torch.cuda.is_available() to attempt an NVML-based CUDA availability assessment when explicitly requested by the user (#85951)
Fixes #83973 (This is a substitute PR for https://github.com/pytorch/pytorch/pull/85024)

First of all, thanks for your invaluable contributions to PyTorch everyone!

Given how extensively `torch.cuda.is_available` is used in the PyTorch ecosystem, IMHO it's worthwhile to provide downstream libraries/frameworks/users the ability to alter the default behavior of `torch.cuda.is_available` in the context of their PyTorch usage.

I'm confident there are many current and future such use cases which could benefit from leveraging a weakened, NVML-based `torch.cuda.is_available` assessment at a downstream framework's explicit direction (thanks @malfet 81da50a972 !). Though one could always patch out the `torch.cuda.is_available` function with another implementation in a downstream library, I think this environmental variable based configuration option is more convenient and the cost to including the option is quite low.

As discussed in https://github.com/pytorch/pytorch/pull/85024#issuecomment-1261542045, this PR gates new non-default NVML-based CUDA behavior with an environmental variable (PYTORCH_NVML_BASED_CUDA_CHK) that allows a user/framework to invoke non-default, NVML-based `is_available()` assessments if desired.

Thanks again for your work everyone!
@ngimel @malfet @awaelchli

Pull Request resolved: https://github.com/pytorch/pytorch/pull/85951
Approved by: https://github.com/ngimel
2022-10-12 18:37:50 +00:00
Richard Zou
109f4d4453 Move functorch tests from functorch/test/* to test/functorch/* (#86623)
This is the first step described in
https://github.com/pytorch/pytorch/issues/86618 . test/functorch/* is
the final location for these tests.

Test Plan:
- Check that the functorch shards in CI are still running tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86623
Approved by: https://github.com/huydhn
2022-10-11 17:20:45 +00:00
Catherine Lee
7941b042a7 parallelize at file granularity (#85770)
part two of https://github.com/pytorch/pytorch/pull/84961

tests files in parallel at the test file granularity

* 2 procs at a time
* number of tests ran changed by <200, possibly due to adding more tests on master between the base commit and head commit of the PR
* may cause flakiness, but I haven't seen it in my small sample size of this PR
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85770
Approved by: https://github.com/huydhn
2022-10-03 16:59:39 +00:00
Catherine Lee
401a358817 [ci] two procs for parallelization (#85985)
hitting ooms on linux cuda so use 2 procs instead of 3

https://github.com/pytorch/pytorch/issues/85939
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85985
Approved by: https://github.com/huydhn
2022-09-30 20:44:13 +00:00
Catherine Lee
7e5105dd11 [ci] expand log file if failed (#85927)
as in title, expand the logs if the test file failed

ex https://github.com/pytorch/pytorch/actions/runs/3155045945/jobs/5133566508
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85927
Approved by: https://github.com/huydhn, https://github.com/janeyx99
2022-09-30 16:51:28 +00:00
Huy Do
4d3acf1203 Enable pytest-shard for functorch (#85321)
This extends https://github.com/pytorch/pytorch/pull/84961 to support functorch tests with pytest-shard.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85321
Approved by: https://github.com/samdow, https://github.com/clee2000
2022-09-24 01:17:04 +00:00
Catherine Lee
49e10c1598 [ci] test_ops in parallel, ci tests log to file (#85528)
part one of splitting up https://github.com/pytorch/pytorch/pull/84961 into (probably 2) parts

contains
* logging to file
* testing test_ops in parallel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85528
Approved by: https://github.com/huydhn
2022-09-23 20:45:20 +00:00
Nikita Shulga
46a6a50f4e Skip MPS test from generic M1 testsuite (#85500)
As there is separate Run MPS shard right now

See if this reduces the number of crashes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/85500
Approved by: https://github.com/clee2000, https://github.com/kit1980, https://github.com/huydhn
2022-09-22 22:13:47 +00:00
PyTorch MergeBot
3dce26635f Revert "test in parallel at file granularity (#84961)"
This reverts commit 8107666c6a.

Reverted https://github.com/pytorch/pytorch/pull/84961 on behalf of https://github.com/clee2000 due to makes test_forward_ad_nn_functional_max_unpool2d_cuda_float32 flakily unexpectedly pass
2022-09-21 20:21:25 +00:00
Catherine Lee
8107666c6a test in parallel at file granularity (#84961)
run tests in parallel at the test file granularity

runs 3 files in parallel using multiprocessing pool, output goes to a file, which is then printed when the test finishes.  Some tests cannot be run in parallel (usually due to lacking memory), so we run those after.  Sharding is changed to attempt to mask large files with other large files/run them on the same shard.

test_ops* gets a custom handler to run it because it is simply too big (2hrs on windows) and linalg_cholesky fails (I would really like a solution to this if possible, but until then we use the custom handler).

reduces cuda tests by a lot, reduces total windows test time by ~1hr

Ref. https://github.com/pytorch/pytorch/issues/82894
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84961
Approved by: https://github.com/huydhn
2022-09-21 16:58:11 +00:00
Arindam Roy
a185dc2e63 [ROCm] re-enable tensorexpr and test_openmp (#81367)
The following tests are being re-enabled for ROCm:
- test_openmp.py
- TestTensorExprPyBind tests in test_tensorexpr_pybind.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81367
Approved by: https://github.com/jeffdaily, https://github.com/malfet
2022-09-14 00:41:16 +00:00
Xiang Gao
08c4f8c7a7 ProcessGroupUCC tests (#83285)
- [x] Direct dependency on UCX is completely removed, UCC active set API always enabled
- [x] Remove `TORCH_UCC_PROFILING_ENABLE`, always enable profiling
- [x] Fixes profiling of `recv` and `all_gather`
- [x] Use the NCCL TL of UCC on CUDA, as  the UCP TL is not well supported on CUDA

Most tests are passing, but there are a few skipped tests:
- `scatter` and `gather` are not supported by the UCP TL of UCC on CPU tensors
- A few flaky tests in PyTorch's CI environment
- Profiler-related failures, some of them will be fixed by @Fuzzkatt in https://github.com/pytorch/pytorch/pull/84368

After this PR is merged, I will continue to work on these skipped failures.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83285
Approved by: https://github.com/vtlam, https://github.com/malfet, https://github.com/kwen2501
2022-09-10 10:56:05 +00:00
Huy Do
90d6112a94 Test distributed backends in parallel (#84034)
This allows multiple backends (nccl, gloo) to be tested in parallel and speed up the process. The improvement is mainly in the 1st distributed CUDA shard where the long pole `distributed/test_distributed_spawn` test is executed:

* [linux-bionic-cuda11.6-py3.10-gcc7 / test (distributed, 1, 2, linux.8xlarge.nvidia.gpu)](https://github.com/pytorch/pytorch/runs/8007596825?check_suite_focus=true#logs) takes 1h24m. This is better than the current average expectation of 2h12m

On the other hand, there is no improvement for the following two jobs:

* [linux-focal-py3.7-gcc7 / test (distributed, 1, 1, linux.2xlarge)](https://github.com/pytorch/pytorch/runs/8007417353?check_suite_focus=true#logs) takes 1h47m
* [linux-bionic-cuda11.6-py3.10-gcc7 / test (distributed, 2, 2, linux.8xlarge.nvidia.gpu)](https://github.com/pytorch/pytorch/runs/8007596870?check_suite_focus=true#logs) takes 1h40m

This is still a gain though because it allows us to add more shards for distributed test if needed.

Issue https://github.com/pytorch/pytorch/issues/83694
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84034
Approved by: https://github.com/wanchaol
2022-09-01 03:48:54 +00:00
PyTorch MergeBot
772721a4b7 Revert "Test distributed backends in parallel (#84034)"
This reverts commit 3ae5be74ac.

Reverted https://github.com/pytorch/pytorch/pull/84034 on behalf of https://github.com/huydhn due to This somehow revives the flaky test https://github.com/pytorch/pytorch/issues/76428
2022-08-30 21:01:25 +00:00
Huy Do
3ae5be74ac Test distributed backends in parallel (#84034)
This allows multiple backends (nccl, gloo) to be tested in parallel and speed up the process. The improvement is mainly in the 1st distributed CUDA shard where the long pole `distributed/test_distributed_spawn` test is executed:

* [linux-bionic-cuda11.6-py3.10-gcc7 / test (distributed, 1, 2, linux.8xlarge.nvidia.gpu)](https://github.com/pytorch/pytorch/runs/8007596825?check_suite_focus=true#logs) takes 1h24m. This is better than the current average expectation of 2h12m

On the other hand, there is no improvement for the following two jobs:

* [linux-focal-py3.7-gcc7 / test (distributed, 1, 1, linux.2xlarge)](https://github.com/pytorch/pytorch/runs/8007417353?check_suite_focus=true#logs) takes 1h47m
* [linux-bionic-cuda11.6-py3.10-gcc7 / test (distributed, 2, 2, linux.8xlarge.nvidia.gpu)](https://github.com/pytorch/pytorch/runs/8007596870?check_suite_focus=true#logs) takes 1h40m

This is still a gain though because it allows us to add more shards for distributed test if needed.

Issue https://github.com/pytorch/pytorch/issues/83694
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84034
Approved by: https://github.com/wanchaol
2022-08-30 19:06:49 +00:00
joncrall
b136f3f310 More doctest refinements. (#83317)
Follow up to #82797

Now that the doctests themselves are in a better state, we should be able to enable xdoctest on the CI so they stay that way.

@ezyang @vadimkantorov
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83317
Approved by: https://github.com/ezyang
2022-08-22 20:07:26 +00:00
joncrall
4618371da5 Integrate xdoctest - Rebased (#82797)
This is a new version of #15648 based on the latest master branch.

Unlike the previous PR where I fixed a lot of the doctests in addition to integrating xdoctest, I'm going to reduce the scope here. I'm simply going to integrate xdoctest, and then I'm going to mark all of the failing tests as "SKIP". This will let xdoctest run on the dashboards, provide some value, and still let the dashboards pass. I'll leave fixing the doctests themselves to another PR.

In my initial commit, I do the bare minimum to get something running with failing dashboards. The few tests that I marked as skip are causing segfaults. Running xdoctest results in 293 failed, 201 passed tests. The next commits will be to disable those tests. (unfortunately I don't have a tool that will insert the `#xdoctest: +SKIP` directive over every failing test, so I'm going to do this mostly manually.)

Fixes https://github.com/pytorch/pytorch/issues/71105

@ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82797
Approved by: https://github.com/ezyang
2022-08-12 02:08:01 +00:00
Mateusz Sypniewski
916def84d4 CUDA trace Python hooks (#82824)
### Description
This adds Python hooks into PyTorch that allow the user to register their own callbacks for events such as tensor allocation, stream allocation, event record / wait etc.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82824
Approved by: https://github.com/lw, https://github.com/ezyang, https://github.com/malfet
2022-08-11 10:21:40 +00:00
pbialecki
b4f7e22640 Enable periodic builds for CUDA 11.7 (#81688)
CC @atalman
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81688
Approved by: https://github.com/atalman
2022-08-10 00:03:51 +00:00
Nikita Shulga
b9cdd6d0ac Do not assume what is in os.environ (#82495)
`os.environ['FOO']` will raise `IndexError` if `FOO` is not found, while `os.environ.get('FOO')` would simply return `None`

Fixes https://github.com/pytorch/pytorch/issues/82492

Pull Request resolved: https://github.com/pytorch/pytorch/pull/82495
Approved by: https://github.com/huydhn, https://github.com/kit1980
2022-07-29 22:57:18 +00:00