Extend metric library to allow setting global metrics on a process level which will always be emitted.
Current use case for them is to include shard information every time a metric is emitted by run_test.py
<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at 0cae92c</samp>
> _`run_test` refactored_
> _Sharding metrics in Rockset_
> _Autumn of testing_
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110035
Approved by: https://github.com/clee2000
When a test fails, we will now emit fine grained details about how accurately heuristics predicted the relevance of that test.
## Context
Why only look at failing tests? Our only signal that a PR is most likely relevant to a test is whether or not a test fails on it. Green tests don't tell us if the success was due to the code being good vs being irrelevant. This isn't a perfect measure, since it can miscategorize unstable and flaky failures as having been "missed" by the heuristics, but it's a reasonable approximation.
## What's measured?
The metrics this PR collects are designed to answer the following questions
### How comprehensive are the heuristics?
- What's the false negative rate, the % of failures that ideally should have been prioritized but weren't? (Both at an aggregate level and at a per heuristic level)
### How precise are the heuristics?
- What % of failed tests were prioritized by a given heuristic? What % was prioritized overall?
- How relevant was a failed test was considered to be? (Both a aggregate level and at a per heuristic level)
- What % of time was a given heuristic prioritizing a failing test higher than any other heuristic?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108192
Approved by: https://github.com/huydhn
ghstack dependencies: #108117
This PR looks big, but it's mostly just refactorings with a bit of dead code deletion. Exceptions are:
- Some metric emissions were changed to comply with the new TD format
- Some logging changes
- We now run tests in three batches (highly_relevant, probably_relevant, unranked_relevance) instead of the previous two (prioritized and general)
Refactorings done:
- Moves all test reordering code to the new TD framework
- Refactors run_test.py to cleanly support multiple levels of test priorities
- Deletes some dead code that was originally written for logging
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107071
Approved by: https://github.com/clee2000, https://github.com/huydhn
Make it so that scripts can import and run the `emit_metrics` function even if they don't have boto3 installed, in which case it will still validate the inputs but skip the actual metric emission part.
It's purely a refactor without any real logic changes
Motivation: So that run_test.py and the target determination code can use this library easily without worrying about if it was imported or if it's dependencies are installed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107070
Approved by: https://github.com/huydhn
Update distutils.Version to packaging.version due to the deprecation warning.
```python
/root/Git.d/pytorch/pytorch/torch/testing/_internal/common_methods_invocations.py:17136: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
active_if=TEST_SCIPY and LooseVersion(scipy.__version__) < "1.4.0"),
/root/Git.d/pytorch/pytorch/torch/testing/_internal/common_methods_invocations.py:17138: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
active_if=TEST_SCIPY and LooseVersion(scipy.__version__) < "1.4.0"),
/root/Git.d/pytorch/pytorch/torch/testing/_internal/common_methods_invocations.py:17140: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
active_if=TEST_SCIPY and LooseVersion(scipy.__version__) < "1.4.0"),
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107207
Approved by: https://github.com/soulitzer
Companion with https://github.com/pytorch/test-infra/pull/4424
Uses the file rating generated by the test infra PR to re order tests. For each test file, sum the file ratings from the changed files in the PR, and put the tests in order of sum.
A lot of tests are probably going to end up as "prioritized" since it takes anything with a rating > 0 right now.
Sharding is done twice, once on the prioritized tests, and once on the general/non prioritized tests. Prioritized tests have an order, so they should be sharded according to that order, while general tests don't have an order and are sharded by test time, which should result in more balanced shards.
I'll change the metric name before I merge, i want to quarantine my testing stuff from actual results
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106347
Approved by: https://github.com/ZainRizvi
Fixes https://github.com/pytorch/pytorch/issues/106754
This PR:
- moves test/autograd/test_fallback.py to test_autograd_fallback.py and
removes it from test_autograd.py (necessary for the next step)
- adds test_autograd_fallback.py to parallel test blocklist.
- lintrunner really wanted to make changes to the files, but other than
that, it is a move.
The problem is that we set a global option (the autograd fallback mode)
during these tests which may cause the tests to interfere with each
other.
Test Plan:
- python test/run_test.py -i test_autograd_fallback
NOTE to diff train oncall:
- You'll also need to modify the test/autograd/test_fallback.py TARGET in
caffe2/test/TARGETS since we renamed the file.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106866
Approved by: https://github.com/soulitzer
Companion with https://github.com/pytorch/test-infra/pull/4424
Uses the file rating generated by the test infra PR to re order tests. For each test file, sum the file ratings from the changed files in the PR, and put the tests in order of sum.
A lot of tests are probably going to end up as "prioritized" since it takes anything with a rating > 0 right now.
Sharding is done twice, once on the prioritized tests, and once on the general/non prioritized tests. Prioritized tests have an order, so they should be sharded according to that order, while general tests don't have an order and are sharded by test time, which should result in more balanced shards.
I'll change the metric name before I merge, i want to quarantine my testing stuff from actual results
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106347
Approved by: https://github.com/ZainRizvi
This PR turns translation validation on by default for tests and accuracy benchmark
runs. It also installs Z3 on CI.
The main changes are:
- Add `--no-translation-validation` as an option in _test/run_tests.py_
- Set `PYTORCH_TEST_WITH_TV` environment variable
- Add `TEST_WITH_TV` variable in _torch/testing/_internal/common_utils.py_
- Turn translation validation on for accuracy benchmarks in _benchmarks/dynamo/common.py_
- Add Z3 installation on CI scripts
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103611
Approved by: https://github.com/ezyang
Prevents following cryptic error if one attempts to use `run_tests.py` on system that also has torchaudio installed in dev mode (as `tools` from https://github.com/pytorch/audio might take precedence, but this is not how script should behave):
```
Unable to import test_selections from tools/testing. Running without test selection stats.... Reason: No module named 'tools.stats'
Traceback (most recent call last):
File "/Users/nshulga/git/pytorch/pytorch/test/run_test.py", line 1673, in <module>
main()
File "/Users/nshulga/git/pytorch/pytorch/test/run_test.py", line 1604, in main
selected_tests = get_selected_tests(options)
File "/Users/nshulga/git/pytorch/pytorch/test/run_test.py", line 1418, in get_selected_tests
path = os.path.join(str(REPO_ROOT), TEST_TIMES_FILE)
NameError: name 'TEST_TIMES_FILE' is not defined
```
But make sure to remove it in the end, otherwise it will not work if torch is installed from wheel, but tests are running from clean repo checkout.
<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at dd52521</samp>
> _Sing, O Muse, of the cunning code review_
> _That fixed the tests of the `tools` module_
> _By adding and removing the root path_
> _As a shepherd guides his flock to and fro._
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104214
Approved by: https://github.com/kit1980
There is a `HAVE_TEST_SELECTION_TOOLS` conditional, but turns out it does not really work, so fix it by defining all missing prototypes and make it work as single-shard instance
Add lint rule to test stat it would succeed for runnign only test_cuda with released version of PyTorch
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104111
Approved by: https://github.com/clee2000, https://github.com/ZainRizvi
Because we always run tests with pytest now.
Marking it as `bc-breaking` as there could technically be some scripts depending on it somewhere...
<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at 1760568</samp>
> _`pytest` option gone_
> _simpler test runner script_
> _autumn leaves fall fast_
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104125
Approved by: https://github.com/seemethere
Added a feature to upload test statistics to DynamoDB and Rockset using a new function `emit_metric` in `tools/stats/upload_stats_lib.py`.
Added metrics to measure test reordering effectiveness in `tools/testing/test_selections.py`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102691
Approved by: https://github.com/malfet
sharding on rocm is broken, i cant replicate on dummy PRs even though it seems to happen pretty often on main, so adding this to increase my sample size. Hopefully this is enough print statements...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102713
Approved by: https://github.com/huydhn
Currently file level reruns + stepcurrent are incompatible and it's making PRs green when they are actually red, so turn off stepcurrent + file level reruns when keep-going is used until I figure out a better way to do this.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102569
Approved by: https://github.com/huydhn
The console log blows up to much when running in rerun disabled tests mode (x50) e132f09e88. Each log is around 1GB and the whole uncompressed logs is ~50GB. After compression, it will be around 1GB, still too big. The increase comes mainly from the multiple SKIPPED message for non-disabled tests, which is expected due to how SkipTest and pytest-flakyfinder currently work.
I update `test/conftest.py` to completely ignore skipped tests when rerunning disabled test instead of collecting then skipping 50 tests each. The benefit of doing is is much more than I originally expect:
* Rerun disabled tests jobs now finish in less than half an hour as they should be
* Fix OOM runner crash because of too many collected tests
* Fix verbosity issue as now only disabled tests are run x50 times. There are only few hundreds of them atm
* Fix timed out issue when rerunning disabled distributed and ASAN tests. They are just too slow when running at x50
### Testing
When rerunning disabled tests https://github.com/pytorch/pytorch/actions/runs/5084508614, only disabled tests on the platform are run, for example `test_ops_jit` on https://ossci-raw-job-status.s3.amazonaws.com/log/13770164954 only ran 100 tests (`test_variant_consistency_jit_linalg_lu_cuda_float32` + `test_variant_consistency_jit_linalg_lu_factor_cuda_complex64`) x50.
```
Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_ops_jit.py', '--shard-id=1', '--num-shards=2', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--sc=test_ops_jit_1', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2023-05-25 21:32:49.763856]
Expand the folded group to see the log file of test_ops_jit 2/2
##[group]PRINTING LOG FILE of test_ops_jit 2/2 (/var/lib/jenkins/workspace/test/test-reports/test_ops_jit_h2wr_t2c.log)
Test results will be stored in test-reports/python-pytest/test_ops_jit/test_ops_jit-51a83bd44549074e.xml
============================= test session starts ==============================
platform linux -- Python 3.10.11, pytest-7.3.1, pluggy-1.0.0 -- /opt/conda/envs/py_3.10/bin/python
cachedir: .pytest_cache
hypothesis profile 'pytorch_ci' -> database=None, max_examples=50, derandomize=True, suppress_health_check=[HealthCheck.too_slow]
rootdir: /var/lib/jenkins/workspace
configfile: pytest.ini
plugins: hypothesis-5.35.1, cpp-2.3.0, flakefinder-1.1.0, rerunfailures-11.1.2, shard-0.1.2, xdist-3.3.0, xdoctest-1.1.0
collecting ... collected 1084 items
Running 100 items in this shard: test/test_ops_jit.py::TestJitCUDA::test_variant_consistency_jit_linalg_lu_cuda_float32 (x50), test/test_ops_jit.py::TestJitCUDA::test_variant_consistency_jit_linalg_lu_factor_cuda_complex64 (x50)
stepcurrent: Cannot find last run test, not skipping
test_ops_jit.py::TestJitCUDA::test_variant_consistency_jit_linalg_lu_cuda_float32 PASSED [2.1876s] [ 1%]
test_ops_jit.py::TestJitCUDA::test_variant_consistency_jit_linalg_lu_factor_cuda_complex64 PASSED [4.5615s] [ 2%]
```
* [pull](https://github.com/pytorch/pytorch/actions/runs/5093566864)
* [trunk](https://github.com/pytorch/pytorch/actions/runs/5095364311)
* [periodic](https://github.com/pytorch/pytorch/actions/runs/5095378850)
* [slow](https://github.com/pytorch/pytorch/actions/runs/5095390285)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102107
Approved by: https://github.com/clee2000, https://github.com/malfet
Per title. I extract this part out of the draft PR that I'm working on https://github.com/pytorch/pytorch/pull/102107 because
the remaining issues with rerun disabled tests: log size and unexpected runner failures requires some further investigations while this one is clearing breaking in trunk atm.
Until we can support disable C++ tests, there is no need to run them in rerun disabled tests mode.
### Testing
Coming from https://github.com/pytorch/pytorch/pull/102107, for example https://github.com/pytorch/pytorch/actions/runs/5062224659/jobs/9087747981
```
2023-05-23T22:46:50.1953318Z Running cpp/basic 1/1 ... [2023-05-23 22:46:50.195077]
2023-05-23T22:46:50.1953847Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode
2023-05-23T22:46:50.2066032Z Running cpp/atest 1/1 ... [2023-05-23 22:46:50.206348]
2023-05-23T22:46:50.2066435Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode
2023-05-23T22:46:52.2666743Z No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'
2023-05-23T22:46:52.2691817Z Ignoring disabled issues: []
...
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102132
Approved by: https://github.com/clee2000
not very elegant
checked on separate conda env that doesnt have the usual ci dependencies
the two pytest extensions at fault are pytest-rerunfailures and pytest-shard, also included pytest-flakefinder just incase
no idea if this is a good way to do this
could also check individually and add flags based on that, but was told that needing to requiring all the ci dependencies to be downloaded was also ok
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100916
Approved by: https://github.com/huydhn
After an investigation, running C++ tests with https://github.com/pytest-dev/pytest-cpp is just slower than running them directly, plain and simple. I'm curious on the exact root cause, but that's a story for another day.
`time build/bin/test_lazy` takes half a minute to run 610 tests on `linux-bionic-cuda11.8-py3.10-gcc7 / test (default, 2, 5, linux.4xlarge.nvidia.gpu)` while `time pytest /var/lib/jenkins/workspace/build/bin/test_lazy -v` takes 20+ minutes on the same runner. This is a very costly price to pay.
The saving grace here is that https://github.com/pytest-dev/pytest-cpp supports pytest-xdist to run tests in parallel with `-n auto`, so `time pytest /var/lib/jenkins/workspace/build/bin/test_lazy -v -n auto` takes only 3 minutes. This is still not as fast as running C++ tests directly, but it's order of magnitude faster than running them sequentially.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101440
Approved by: https://github.com/clee2000
After https://github.com/pytorch/pytorch/pull/99559, we can now run C++ test with `run_test.py`. Although advance features such as `--import-slow-tests` and `--import-disabled-tests` won't work for now, there will still be a gain in reliability and performance as C++ can now be retried and run in parallel.
This covers all C++ tests in the CI including aten, libtorch, and Vulkan C++ tests across all platforms Linux, Windows, MacOS.
Notes:
* To support C++ test discovery, the env variable `CPP_TESTS_DIR` can be set to where the C++ test binaries is located
* Support pytest -k argument via run_test as this is used by pytest-cpp to replace `--gtest-filter`
* The XML output is in pytest format, but it's ok now because we don't have slow test or flaky test support for C++ test yet
* ~~I need to figure out why conftest.py doesn't work when I invoke pytest directly for C++ test, so `--sc` is not available for C++ tests at the moment. Proper pytest plugin like stepwise works fine though. I'll investigate and fix it in a separate PR~~ Found the cause, `conftest.py` is per directory and needs to be in any arbitrary directory that holds C++ test
* Two tests `test_api` and `test_tensorexpr` timed out on ASAN, I suspect that ASAN is now used on top of the python executable, which is slower than running native C++ code. IMO, it's ok to run these tests as before on ASAN for now
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99956
Approved by: https://github.com/clee2000, https://github.com/ZainRizvi
Today, we prioritize running test files that were edited in the user's PR, with the idea being to run them before we run any other test.
Except, if the modified test is supposed to run serially, then we still end up running it after all the parallelized tests have finished running.
This PR fixes that to _always_ run the prioritized tests before the regular tests, regardless of if the test is supposed to run serially or in parallel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100748
Approved by: https://github.com/huydhn
i think get_reordered_tests broken since master -> main switch
add typing for some functions
checked for `prioritized` in the logs
limited testing because I only care about one very small part of the log thats near the beginning
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100752
Approved by: https://github.com/huydhn