#98035 adds some additional logic `wait_for_process` that includes catching a timeout exception and sending `SIGINT` to the process before waiting on it again with a timeout. However, if the additional wait times out again, then the wait call in the `finally` block (which does not have a timeout) has the potential to hang indefinitely.
This PR kills the process if a second timeout exception occurs after the `SIGINT` signal is sent.
CC @clee2000 @ptrblck @xwang233 @kwen2501
Also hoping that this has the potential to reduce turnaround time for distributed timeouts like those seen in https://hud.pytorch.org/pr/pytorch/pytorch/105274#15148799113
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105625
Approved by: https://github.com/ezyang
Fixes: #105143
In summary, the changes are:
- Check if Z3 is installed when the module is loaded
- Naming consistently as "translation validation" (not "validator")
- Skipping tests if Z3 is not installed
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105168
Approved by: https://github.com/ezyang
MS2 of the Reproducible Testing BE initiative. For context, this is the ask:
```
Another thing that would be really great as we start to have more dependent
systems or types of tests (functorch, dynamo, crossref) would be to have a
minimally reproducible version of the test (something at the end of the HUD
comment like: "Run python test/test_file.py -k test_name" but also if you need
flags, like crossref it would be like "Run <flag to run crossref> python test/..." ). I'll
often go through the test infra to find the flags that I need to pass when
something only breaks crossref/dynamo tests.
```
Implementation details:
* Adds a new flag `PRINT_REPRO_ON_FAILURE` that is settable through the environment variable `PYTORCH_PRINT_REPRO_ON_FAILURE=1`
* **Default is ON but I can be persuaded otherwise**
* When the flag is enabled, our base `TestCase` will wrap the test method in a context manager that catches any non-skip exceptions and appends a repro string to the exception message. The repro includes setting of necessary test flags through env vars. Example:
```
To execute this test, run the following from the base repo dir:
PYTORCH_TEST_WITH_CROSSREF=1 python test/test_ops.py -k test_foo_add_cuda_float32
This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
```
* To keep track of flag settings, this PR introduces a new `TestEnvironment` class that defines global flags by querying related environment variables. Flag and env var names are purposefully kept searchable via full names. Example usages:
```python
TestEnvironment.def_flag("TEST_WITH_TORCHINDUCTOR", env_var="PYTORCH_TEST_WITH_INDUCTOR")
# can track implication relationships to avoid adding unnecessary flags to the repro
TestEnvironment.def_flag(
"TEST_WITH_TORCHDYNAMO",
env_var="PYTORCH_TEST_WITH_DYNAMO",
implied_by_fn=lambda: TEST_WITH_TORCHINDUCTOR or TEST_WITH_AOT_EAGER)
# can use include_in_repro=False to keep the flag from appearing in the repro command
TestEnvironment.def_flag(
"DISABLE_RUNNING_SCRIPT_CHK", env_var="PYTORCH_DISABLE_RUNNING_SCRIPT_CHK", include_in_repro=False)
# the default default value is False, but this can be changed
TestEnvironment.def_flag(
"PRINT_REPRO_ON_FAILURE", env_var="PYTORCH_PRINT_REPRO_ON_FAILURE", default=(not IS_FBCODE), include_in_repro=False)
```
* AFAICT it is only feasible to achieve this from within the test framework rather than at the CI level. This is because CI / `run_test.py` are unaware of individual test cases. Implementing it in our base `TestCase` class has the broadest area of effect, as it's not isolated to e.g. OpInfo tests.
* I couldn't find an easy way to test the logic via `test_testing.py`, as the logic for extracting the test filename doesn't work for generated test classes. I'm open to ideas on testing this, however.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104537
Approved by: https://github.com/ezyang, https://github.com/janeyx99, https://github.com/huydhn
This PR turns translation validation on by default for tests and accuracy benchmark
runs. It also installs Z3 on CI.
The main changes are:
- Add `--no-translation-validation` as an option in _test/run_tests.py_
- Set `PYTORCH_TEST_WITH_TV` environment variable
- Add `TEST_WITH_TV` variable in _torch/testing/_internal/common_utils.py_
- Turn translation validation on for accuracy benchmarks in _benchmarks/dynamo/common.py_
- Add Z3 installation on CI scripts
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103611
Approved by: https://github.com/ezyang
Skip all cuda graph-related unit tests by setting env var `PYTORCH_TEST_SKIP_CUDAGRAPH=1`
This PR refactors the `TEST_CUDA` python variable in test_cuda.py into common_utils.py. This PR also creates a new python variable `TEST_CUDA_GRAPH` in common_utils.py, which has an env var switch to turn off all cuda graph-related tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103032
Approved by: https://github.com/malfet
There are some I can't easily switch due to reasons like:
- Dynamo modelling the guard
- BC concerns (for torch.autograd.set_multithreading_enabled)
Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102642
Approved by: https://github.com/albanD
Enables the hipSolver backend for ROCm builds
--------------------------------------------------------------------------
- Minimum ROCm version requirement - 5.3
- Introduces new macro USE_LINALG_SOLVER the controls enablement of both cuSOLVER and hipSOLVER
- Adds hipSOLVER API to hipification process
- combines hipSOLVER and hipSPARSE mappings into single SPECIAL map that takes priority among normal mappings
- Torch api to be moved to hipsolver backend (as opposed to magma) include: torch.svd(), torch.geqrf(), torch.orgqr(), torch.ormqr()
- Will enable 100+ linalg unit tests for ROCm
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97370
Approved by: https://github.com/malfet
The console log blows up to much when running in rerun disabled tests mode (x50) e132f09e88. Each log is around 1GB and the whole uncompressed logs is ~50GB. After compression, it will be around 1GB, still too big. The increase comes mainly from the multiple SKIPPED message for non-disabled tests, which is expected due to how SkipTest and pytest-flakyfinder currently work.
I update `test/conftest.py` to completely ignore skipped tests when rerunning disabled test instead of collecting then skipping 50 tests each. The benefit of doing is is much more than I originally expect:
* Rerun disabled tests jobs now finish in less than half an hour as they should be
* Fix OOM runner crash because of too many collected tests
* Fix verbosity issue as now only disabled tests are run x50 times. There are only few hundreds of them atm
* Fix timed out issue when rerunning disabled distributed and ASAN tests. They are just too slow when running at x50
### Testing
When rerunning disabled tests https://github.com/pytorch/pytorch/actions/runs/5084508614, only disabled tests on the platform are run, for example `test_ops_jit` on https://ossci-raw-job-status.s3.amazonaws.com/log/13770164954 only ran 100 tests (`test_variant_consistency_jit_linalg_lu_cuda_float32` + `test_variant_consistency_jit_linalg_lu_factor_cuda_complex64`) x50.
```
Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_ops_jit.py', '--shard-id=1', '--num-shards=2', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--sc=test_ops_jit_1', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2023-05-25 21:32:49.763856]
Expand the folded group to see the log file of test_ops_jit 2/2
##[group]PRINTING LOG FILE of test_ops_jit 2/2 (/var/lib/jenkins/workspace/test/test-reports/test_ops_jit_h2wr_t2c.log)
Test results will be stored in test-reports/python-pytest/test_ops_jit/test_ops_jit-51a83bd44549074e.xml
============================= test session starts ==============================
platform linux -- Python 3.10.11, pytest-7.3.1, pluggy-1.0.0 -- /opt/conda/envs/py_3.10/bin/python
cachedir: .pytest_cache
hypothesis profile 'pytorch_ci' -> database=None, max_examples=50, derandomize=True, suppress_health_check=[HealthCheck.too_slow]
rootdir: /var/lib/jenkins/workspace
configfile: pytest.ini
plugins: hypothesis-5.35.1, cpp-2.3.0, flakefinder-1.1.0, rerunfailures-11.1.2, shard-0.1.2, xdist-3.3.0, xdoctest-1.1.0
collecting ... collected 1084 items
Running 100 items in this shard: test/test_ops_jit.py::TestJitCUDA::test_variant_consistency_jit_linalg_lu_cuda_float32 (x50), test/test_ops_jit.py::TestJitCUDA::test_variant_consistency_jit_linalg_lu_factor_cuda_complex64 (x50)
stepcurrent: Cannot find last run test, not skipping
test_ops_jit.py::TestJitCUDA::test_variant_consistency_jit_linalg_lu_cuda_float32 PASSED [2.1876s] [ 1%]
test_ops_jit.py::TestJitCUDA::test_variant_consistency_jit_linalg_lu_factor_cuda_complex64 PASSED [4.5615s] [ 2%]
```
* [pull](https://github.com/pytorch/pytorch/actions/runs/5093566864)
* [trunk](https://github.com/pytorch/pytorch/actions/runs/5095364311)
* [periodic](https://github.com/pytorch/pytorch/actions/runs/5095378850)
* [slow](https://github.com/pytorch/pytorch/actions/runs/5095390285)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102107
Approved by: https://github.com/clee2000, https://github.com/malfet
Notes:
- No segfaults observed in any CI tests: dynamo unittests, inductor unittests, dynamo-wrapped pytorch tests. So we remove the warning that using dynamo 3.11 may result in segfaults.
- Some dynamo-wrapped pytorch tests hang. They will be skipped in the dynamo-wrapped test suite and will be addressed in a future PR
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99180
Approved by: https://github.com/malfet
Beefing up docs with discussion about when to use `instantiate_device_type_tests()` vs. `instantiate_parametrized_tests()` + description on what each does.
Spoiler: use only one - the former for device-specific and the latter for device-agnostic tests. Both support `@parametrize`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100905
Approved by: https://github.com/janeyx99
Previously when using `self.assertRaisesRegex` to test raised exception and its regex, the regex wasn't actually compared because mps was not in the `NATIVE_DEVICES`. This PR fixes that by enabling exception regex comparisons for mps device.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100367
Approved by: https://github.com/albanD
* add stepcurrent flag (--sc) based off the stepwise flag that saves the currently running test so that test running can resume from the last successful test after segfaults, takes in an argument for a key so that different test runs dont overwrite each other
* send sigint to process when timeout so that xml can be made
* add currently unused stepcurrent skip flag (--scs) based off stepwise skip flag that skips the failing test, was going to use if for the keep-going label but having trouble with CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98035
Approved by: https://github.com/huydhn
In edge cases in CI, SLOW_TESTS_FILE is defined but does not point to an existing file.
Guessing this is due to a test case that opens a subprocses and cwd's but doesn't clean its env.
We shouldn't make importing common_utils fail, so issue a warning and proceed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99659
Approved by: https://github.com/ezyang, https://github.com/malfet
Planning to do a full writeup later. The short story is, sometimes the following chain of events happens:
1. We turn on Dynamo's custom frame handler
2. GC triggers (and all of the finalizers run under Dynamo)
3. GC hits a GeneratorExit frame
4. You end up in the custom frame handler with throw_flag == TRUE and PyErr_Occurred() != NULL
If this happens and we blindly call into other Python functions (like the Python callback), the executed Python code will immediately raise an exception (because there's already an ambient exception set.) This is very, very confusing. The fix is to defer to the regular handler when throw_flag is TRUE.
I triggered this locally with
```
PYTHONUNBUFFERED=1 pytest test/dynamo/test_dynamic_shapes.py -k 'Unspec and export and not dupes and not reorder' -v -x -s
```
But I also have some tests which trigger the problem synthetically.
Fixes https://github.com/pytorch/pytorch/issues/93781
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96488
Approved by: https://github.com/albanD
Run more tests through pytest.
Use a block list for tests that shouldn't run through pytest. As far as I can tell, the number of tests run, skipped, and xfailed for those not on the blocklist are the same.
Regarding the main module:
Usually tests are run in CI, we call `python <test file>`, which causes the file to be imported under the module name `__main__`. However, pytest searches for the module to be imported under the file name, so the file will be reimported. This can cause issues for tests that run module level code and change global state, like test_nn, which modifies lists imported from another file, or tests in test/lazy, which initialize a backend that cannot coexist with a second copy of itself.
My workaround for this is to run tests from the `__main__` module. However, this results in pytest being unable to rewrite assertions (and possibly other things but I don't know what other things pytest does right now). A better solution might be to call `pytest <test file>` directly and move all the code in run_tests(argv) to be module level code or put it in a hook in conftest.py.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95844
Approved by: https://github.com/huydhn
Summary: Currently running PyTorch tests with dynamo and inductor is
controlled by environment variables, and CI sets them based on test
config name matching. Change them to use options of run_test.py.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94539
Approved by: https://github.com/huydhn
The expression `argv + [f'--junit-xml-reruns={test_report_path}'] if TEST_SAVE_XML else []` evaluates to the empty list when `TEST_SAVE_XML` is false and would need parentheses.
Instead simplify the code by appending the argument when required directly where `test_report_path` is set.
Note that `.append()` may not be used as that would modify `argv` and in turn `UNITTEST_ARGS` which might have undesired side effects.
Without this patch `pytest.main()` would be called, i.e. no arguments which will try to discover all tests in the current working directory which ultimately leads to (many) failures.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94589
Approved by: https://github.com/clee2000, https://github.com/Neilblaze