pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Catherine Lee	946b50c788	[ez][TD] Increase logging (#124082 ) increase logging during td generate an artifact that says which tests got excluded fix minor bug where filter test configs couldnt get commit messages Pull Request resolved: https://github.com/pytorch/pytorch/pull/124082 Approved by: https://github.com/seemethere, https://github.com/malfet	2024-04-17 00:18:28 +00:00
Catherine Lee	3cd06f56b1	[ez] test_profiler in serial (#123665 ) Add test_profiler to the serial list since we keep needing to reopen disable issues and I think its due to being incompatible with parallelism Pull Request resolved: https://github.com/pytorch/pytorch/pull/123665 Approved by: https://github.com/ZainRizvi, https://github.com/huydhn	2024-04-11 20:24:47 +00:00
William Wen	4bee4c7c25	[3.12] enable inductor unittests (#123654 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123654 Approved by: https://github.com/jansel	2024-04-10 20:51:43 +00:00
Catherine Lee	61be8843c9	[TD] Use label to configure td on distributed for rollout (#122976 ) Gate TD on distributed behind label TODO: auto add label to certain people's prs Pull Request resolved: https://github.com/pytorch/pytorch/pull/122976 Approved by: https://github.com/huydhn, https://github.com/ZainRizvi	2024-04-08 15:53:55 +00:00
William Wen	d59c5d7353	[dynamo, 3.12] enable dynamo on 3.12, enable most dynamo unittests on 3.12 (#123216 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123216 Approved by: https://github.com/jansel, https://github.com/malfet	2024-04-04 20:00:54 +00:00
Catherine Lee	b5bef9bbfd	Fix cpp tests not running + failing to surface (#122845 ) The comment in the code should have the information Pull Request resolved: https://github.com/pytorch/pytorch/pull/122845 Approved by: https://github.com/huydhn	2024-03-29 22:41:45 +00:00
Catherine Lee	03184a82dd	[TD] TD on ASAN PR jobs (#122332 ) Low impact CPU jobs Pull Request resolved: https://github.com/pytorch/pytorch/pull/122332 Approved by: https://github.com/huydhn	2024-03-22 22:32:51 +00:00
eellison	cbbed46377	Defer selection of triton template (#120275 ) Our prior approach to epilogue fusion was to select from a choice from a set of triton templates and extern calls based on benchmarking inputs, then unconditionally fuse epilogues. This can be sub-optimal in following ways: - We select an extern kernel, however an epilogue like relu() exists such that choosing a triton template + relu would have been faster - We select a triton template, epilogue fuse, and register spilling occurs causing it to be slower than not epilogue fusing. In this PR we wait to select either the Triton Template or Extern Kernel based on benchmarking results from the kernel itself and its epilogue. As soon as a successful fusion occurs where a fused Triton Template + epilogue is faster than the unfused choice we finalize the MultiTemplateBuffer as a specific template. If no fusion occurs we'll finalize the MultiTemplateBuffer after fusion. Note: if there are multiple epilogue fusions (not super likely), even though we select a template after the first fusion, we will still benchmark to see if subsequent epilogue are worth fusing. We could potentially defer choosing template in this case in a follow up at expense of compile time. Gives 4% HF training win, 10% TIMM inference win. Increases compilation time which I will be trying to address more in follow up prs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120275 Approved by: https://github.com/jansel ghstack dependencies: #121996	2024-03-20 01:40:33 +00:00
Kai Londenberg	a5ec45f2ec	[Inductor Cutlass backend] Move tests to separate file (#121489 ) Move Cutlass backend related tests to test/inductor/test_cutlass_backend.py - no changes to the tests themselves. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121489 Approved by: https://github.com/jansel	2024-03-12 21:59:48 +00:00
Catherine Lee	fac06a12c8	CI sanity check test for env vars (#120519 ) Make a test that fails on purpose to trigger retries. Check the opposite of success (that env vars exist) It's bit hacky because I want it to fail on the normal flow in order to trigger reruns but I don't want to expose the failures to users since it's confusing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120519 Approved by: https://github.com/huydhn	2024-03-11 15:35:45 +00:00
PyTorch MergeBot	2c2d6ce515	Revert "CI sanity check test for env vars (#120519 )" This reverts commit `f43b9c56c5`. Reverted https://github.com/pytorch/pytorch/pull/120519 on behalf of https://github.com/clee2000 due to broken on slow `d27509c384` https://github.com/pytorch/pytorch/actions/runs/8208843198/job/22453617568 ([comment](https://github.com/pytorch/pytorch/pull/120519#issuecomment-1986480624))	2024-03-08 22:01:35 +00:00
Catherine Lee	f43b9c56c5	CI sanity check test for env vars (#120519 ) Make a test that fails on purpose to trigger retries. Check the opposite of success (that env vars exist) It's bit hacky because I want it to fail on the normal flow in order to trigger reruns but I don't want to expose the failures to users since it's confusing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120519 Approved by: https://github.com/huydhn	2024-03-08 20:28:50 +00:00
Catherine Lee	06b52dd103	TD outside of test job (#118250 ) Give TD it's own job so that each shard can get the results from this one job artifact and they will always be in sync with each other/no longer need to worry about consistently issues * Move test discovery to its own file that is not dependent on torch so it can be run without building torch * Cannot do cpp test discovery before building pytorch * Move TD calculation to own file that will create a json file with the final results * TD is now job/build env agnostic * TD will rank all tests, including those that test jobs may not want to run (ex it will rank distributed tests along with default tests, even though these tests are never run on the same machine together) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118250 Approved by: https://github.com/huydhn	2024-03-01 23:08:10 +00:00
Catherine Lee	0290fe65bd	Test TD (test removal) on crossref (#119426 ) Current threshold is to cut the bottom 75% of test files, which results in 13 min of tests getting cut. test_ops, functorch/test_ops, and test_decomp and other really long running test files are not getting cut and make the top 25% to take really long (still 90+ min) The original plan was to test on rocm but I'm worried about queuing given that cutting 75% of test files only cuts off 13 min, and crossref is rarely referenced by others and people keep talking about getting rid of it, so it's a good alternative Pull Request resolved: https://github.com/pytorch/pytorch/pull/119426 Approved by: https://github.com/huydhn	2024-02-29 18:53:43 +00:00
albanD	30625ae582	Add cpp stack traces to our own reruns (#119408 ) Note that I'm not sure why we both have pytest rerun the failing test twice via `81abc2b249/test/run_test.py (L966)` before our own logic retries it as well? The failing test is only here to make sure it works as expected in the CI env. Will remove before landing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119408 Approved by: https://github.com/huydhn	2024-02-26 22:21:14 +00:00
Catherine Lee	c39bbd6def	Numbers based TD (#119901 ) Convert from a list/bucket based TD system to just a numbers based TD system. Looks like a massive change but a decent amount of it is tests and removing code. Main file of interest is interface.py, which Github is collapsing by default due to size The test files pretty much got rewritten entirely since a lot of the old tests are no longer relevant. Other notable changes: * Use Frozenset to make TestRun hashable * Adds tools/test/heuristics/__init__.py to ensure that unittest can discover the tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/119901 Approved by: https://github.com/osalpekar, https://github.com/huydhn	2024-02-26 17:01:19 +00:00
Catherine Lee	cfddfce0d3	Alternate sharding (#119078 ) Changes sharding to attempt to put all serial tests on as few shards as possible. Parallel tests are then distributed across all shards, with most of which likely ending up on the non serial shards Example: 8 minutes of serial tests, 20 minutes of parallel tests, 2 proc per machine, 6 machines -> 8 + 20/2 = 18 total minutes of tests -> 18 / 6 machines = 3 min per machine -> all serial tests should fit on 3 machines (3min, 3 min, 2min) -> majority of parallel tests should go on last 4 machines, one of which is shared with the serial tests Move serial tests to run first If I want to move to a purely numbers based sharding, this ensures that parallel tests are run with parallel tests as much as possible instead of interleaving serial + parallel tests, which decreases effectiveness of parallelization, while also ensuring that test reordering is still mostly effective. See `73e816ee80` for example logs Pull Request resolved: https://github.com/pytorch/pytorch/pull/119078 Approved by: https://github.com/huydhn	2024-02-21 16:40:27 +00:00
Catherine Lee	af765dbdfd	[ez] Explicit env for run_test (#120251 ) env=None (which is the default) inherits the env from the calling process. Explicitly set the env to the calling process env so that things can be added to it later Tested in: `e7b4d8ec88` Checked that test-reports (which depend on the CI env var) get made. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120251 Approved by: https://github.com/huydhn	2024-02-21 00:40:19 +00:00
PyTorch MergeBot	dfb83df889	Revert "Add cpp stack traces to our own reruns (#119408 )" This reverts commit `47182a8f4b`. Reverted https://github.com/pytorch/pytorch/pull/119408 on behalf of https://github.com/clee2000 due to iirc the default setting of env to None causes it to inherit the env of the calling process, I'll make a PR that makes it so that the old env vars don't disappear, and then re merge this on top of it. Reverting this because I think some important env vars are disappearing (specifically CI) ([comment](https://github.com/pytorch/pytorch/pull/119408#issuecomment-1955128676))	2024-02-20 21:28:13 +00:00
PyTorch MergeBot	9b38ee2343	Revert "Alternate sharding (#119078 )" This reverts commit `861acda205`. Reverted https://github.com/pytorch/pytorch/pull/119078 on behalf of https://github.com/clee2000 due to failing `861acda205` ([comment](https://github.com/pytorch/pytorch/pull/119078#issuecomment-1946583857))	2024-02-15 16:59:50 +00:00
Catherine Lee	861acda205	Alternate sharding (#119078 ) Changes sharding to attempt to put all serial tests on as few shards as possible. Parallel tests are then distributed across all shards, with most of which likely ending up on the non serial shards Example: 8 minutes of serial tests, 20 minutes of parallel tests, 2 proc per machine, 6 machines -> 8 + 20/2 = 18 total minutes of tests -> 18 / 6 machines = 3 min per machine -> all serial tests should fit on 3 machines (3min, 3 min, 2min) -> majority of parallel tests should go on last 4 machines, one of which is shared with the serial tests Move serial tests to run first If I want to move to a purely numbers based sharding, this ensures that parallel tests are run with parallel tests as much as possible instead of interleaving serial + parallel tests, which decreases effectiveness of parallelization, while also ensuring that test reordering is still mostly effective. See `73e816ee80` for example logs Pull Request resolved: https://github.com/pytorch/pytorch/pull/119078 Approved by: https://github.com/huydhn	2024-02-15 01:32:44 +00:00
atalman	244b124bb8	Add linux cpu test for 3.12 (#117853 ) This is continuation of work: https://github.com/pytorch/pytorch/pull/113987 Co-authored-by: albanD <desmaison.alban@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/117853 Approved by: https://github.com/albanD	2024-02-14 20:52:23 +00:00
albanD	47182a8f4b	Add cpp stack traces to our own reruns (#119408 ) Note that I'm not sure why we both have pytest rerun the failing test twice via `81abc2b249/test/run_test.py (L966)` before our own logic retries it as well? The failing test is only here to make sure it works as expected in the CI env. Will remove before landing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119408 Approved by: https://github.com/huydhn	2024-02-14 18:40:23 +00:00
Catherine Lee	5d6e323549	No TD (test removal) option in CI (#118808 ) It currently doesn't do anything, but I will want these env vars later. Maybe I should start using ghstack Intention: --enable-td actually gets rid of tests I am open to better names Pull Request resolved: https://github.com/pytorch/pytorch/pull/118808 Approved by: https://github.com/huydhn, https://github.com/osalpekar	2024-02-09 16:42:27 +00:00
PyTorch MergeBot	8182fce769	Revert "Add cpp stack traces to our own reruns (#119408 )" This reverts commit `fbe6f6236e`. Reverted https://github.com/pytorch/pytorch/pull/119408 on behalf of https://github.com/malfet due to Looks like it introduced intermittent crashes see https://github.com/pytorch/pytorch/actions/runs/7823402867/job/21344456540 for example, testing the theory ([comment](https://github.com/pytorch/pytorch/pull/119408#issuecomment-1934589057))	2024-02-08 17:20:39 +00:00
albanD	fbe6f6236e	Add cpp stack traces to our own reruns (#119408 ) Note that I'm not sure why we both have pytest rerun the failing test twice via `81abc2b249/test/run_test.py (L966)` before our own logic retries it as well? The failing test is only here to make sure it works as expected in the CI env. Will remove before landing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119408 Approved by: https://github.com/huydhn	2024-02-08 00:54:16 +00:00
Huy Do	3ed9df36a9	Clean up some obsolete TODOs in run_test and several test files (#119113 ) * The TODOs in `test/test_nestedtensor.py` has been mitigated, I keep the issue for reference. * ~~The TODOs in `test/test_ops_fwd_gradients.py` doesn't apply anymore~~ * The TODOs in `run_test.py` to support disabling C++ tests is probably not going to happen. I have never seen a flaky C++ test that needs to be disabled before. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119113 Approved by: https://github.com/kit1980	2024-02-03 23:54:30 +00:00
Joel Schlosser	3b41793412	Purge redundant module init tests (#119028 ) Fixes #118784 This test file is old and redundant; coverage is maintained in `test_modules.py` via the `test_factory_kwargs` set of tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119028 Approved by: https://github.com/zou3519	2024-02-02 20:17:00 +00:00
Catherine Lee	8b729fb826	[ez] Fix CI log file piping error (#118807 ) Fixes https://github.com/pytorch/pytorch/issues/118764 Example log https://github.com/pytorch/pytorch/actions/runs/7737363970/job/21097159160 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118807 Approved by: https://github.com/huydhn, https://github.com/malfet, https://github.com/seemethere	2024-02-02 03:07:56 +00:00
Catherine Lee	9391af9796	Merging heuristics (#118029 ) Everyday I move closer and closer to just using numbers * number of heuristics that marked it as high, probable, low, none etc * order of heuristics in the `__init__` file as well as how the heuristic ordered the tests * put heuristics historical edited files and profiling as not trial mode * briefly sanity checked that all shards of the larger test files (ex test_ops) exist and there are no dups Pull Request resolved: https://github.com/pytorch/pytorch/pull/118029 Approved by: https://github.com/huydhn	2024-01-31 20:00:10 +00:00
Catherine Lee	2eefbc02a0	[ez] Discover tests without importing torch (#118574 ) Moves test discovery into a file that doesn't have import torch so test listing can be done without having torch installed. Helpful when you don't have torch installed (aka me when I'm feeling lazy) I want to move TD into it's own job that doesn't need to wait for build to finish, so this is part of that. The first commit is a nothing more than a copy paste of the selected functions/vars into a new file, the second commit has various changes that should be checked. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118574 Approved by: https://github.com/huydhn	2024-01-30 03:02:29 +00:00
Catherine Lee	84251d1d71	[ez] Windows log printing + save successful test logs (#118124 ) when doing print(f.read().decode etc etc) it prints an extra new line, so manually splitlines and strip to see if that helps My guess is windows line ending differences Also always save log file regardless of success or failure See `476b81a9bf` for what it looks like now Swapped to opening in text mode instead of binary, seems to be ok now. 42483193bf024983060a234dc0262f4840aef4b8 for example Pull Request resolved: https://github.com/pytorch/pytorch/pull/118124 Approved by: https://github.com/huydhn	2024-01-26 21:14:25 +00:00
Catherine Lee	de9ddd19a5	Various CI settings (#117668 ) Test [ci-verbose-test-logs] (this worked, the test logs printing while running and interleaved and are really long) Settings for no timeout (step timeout still applies, only gets rid of ~30 min timeout for shard of test file) and no piping logs/extra verbose test logs (good for debugging deadlocks but results in very long and possibly interleaved logs). Also allows these to be set via pr body if the label name is in brackets ex [label name] or the test above. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117668 Approved by: https://github.com/huydhn	2024-01-26 00:17:29 +00:00
Catherine Lee	364728b27b	Reduce pytest prints (#117069 ) * custom pytest-shard so I can control the verbosity (also index by 1 since it's confusing) * normal runs (not keep-going) always rerun each failed test 9 times (3 per process, 3 processes). Previously it would only run the entire test file 3 times, so if a test before you segfaulted, you only got 2 tries Example of quieter log https://github.com/pytorch/pytorch/actions/runs/7481334046/job/20363147497 "items in shard" only gets printed once at the beginning, and the reruns just say how many got skipped. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117069 Approved by: https://github.com/huydhn	2024-01-23 18:39:30 +00:00
chuanqiw	40890ba8e7	[CI] Add python test skip logic for XPU (#117621 ) Add python test skip logic for XPU For test purpose, cherry-pick #116833 & #116850 firstly, and the xpu test passed https://github.com/pytorch/pytorch/actions/runs/7566746218/job/20604997985?pr=117621. Revert them now. Works for #114850 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117621 Approved by: https://github.com/huydhn	2024-01-23 08:20:42 +00:00
Catherine Lee	cef5b93f28	[ez] Serial when NUM_PROCS is 1 (#117977 ) Makes it easier to understand whats going on Pull Request resolved: https://github.com/pytorch/pytorch/pull/117977 Approved by: https://github.com/huydhn	2024-01-22 23:11:41 +00:00
ydwu4	f96b7d06d7	[export] skip export tests when test with dynamo in ci (#117988 ) Fixes https://github.com/pytorch/pytorch/issues/117947. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117988 Approved by: https://github.com/suo, https://github.com/zou3519	2024-01-22 22:14:36 +00:00
PyTorch MergeBot	f684e44fd6	Revert "Reduce pytest prints (#117069 )" This reverts commit `40dbd567e0`. Reverted https://github.com/pytorch/pytorch/pull/117069 on behalf of https://github.com/clee2000 due to need to handle timeout expired better ([comment](https://github.com/pytorch/pytorch/pull/117069#issuecomment-1901270953))	2024-01-19 23:07:51 +00:00
Catherine Lee	40dbd567e0	Reduce pytest prints (#117069 ) * custom pytest-shard so I can control the verbosity (also index by 1 since it's confusing) * normal runs (not keep-going) always rerun each failed test 9 times (3 per process, 3 processes). Previously it would only run the entire test file 3 times, so if a test before you segfaulted, you only got 2 tries Example of quieter log https://github.com/pytorch/pytorch/actions/runs/7481334046/job/20363147497 "items in shard" only gets printed once at the beginning, and the reruns just say how many got skipped. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117069 Approved by: https://github.com/huydhn	2024-01-19 18:42:12 +00:00
Catherine Lee	6c5c2121b1	Run some OOMing tests serially (#117759 ) They were disabled due to being flaky due to OOMs but got renamed. Seeing if running serially helps I kind of want to keep this test disabled since the rest of the file is probably fine... Issues in question: #113132 #113136 #113140 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117759 Approved by: https://github.com/malfet, https://github.com/huydhn	2024-01-19 16:45:35 +00:00
PyTorch MergeBot	77cfacab55	Revert "Reduce pytest prints (#117069 )" This reverts commit `2f89ef2300`. Reverted https://github.com/pytorch/pytorch/pull/117069 on behalf of https://github.com/clee2000 due to distributed tests are not printing items ([comment](https://github.com/pytorch/pytorch/pull/117069#issuecomment-1899433816))	2024-01-19 00:27:03 +00:00
Catherine Lee	2f89ef2300	Reduce pytest prints (#117069 ) * custom pytest-shard so I can control the verbosity (also index by 1 since it's confusing) * normal runs (not keep-going) always rerun each failed test 9 times (3 per process, 3 processes). Previously it would only run the entire test file 3 times, so if a test before you segfaulted, you only got 2 tries Example of quieter log https://github.com/pytorch/pytorch/actions/runs/7481334046/job/20363147497 "items in shard" only gets printed once at the beginning, and the reruns just say how many got skipped. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117069 Approved by: https://github.com/huydhn	2024-01-18 23:30:59 +00:00
rzou	5aa895e53e	Don't run inductor tests in Dynamo shard (#117747 ) In theory we could, but these get really slow once we turn on strict mode, so we're not going to for now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117747 Approved by: https://github.com/bdhirsh ghstack dependencies: #117729	2024-01-18 17:43:30 +00:00
Jack Taylor	db79ceb110	[ROCm] Enabling additional UTs on ROCm (#115738 ) Unskips mostly for dynamo/inductor UT. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115738 Approved by: https://github.com/jithunnair-amd, https://github.com/malfet	2024-01-09 08:36:07 +00:00
Catherine Lee	d455c33cca	[ez][td] Pipe TD logs to log file (#116796 ) It is a bit annoying have them come up when searching through the logs. They're also surprisingly long Pull Request resolved: https://github.com/pytorch/pytorch/pull/116796 Approved by: https://github.com/huydhn	2024-01-05 19:05:12 +00:00
Catherine Lee	c52b78ebc2	[ez] Remove some args from run_test.py (#115459 ) Don't think anyone uses these Pull Request resolved: https://github.com/pytorch/pytorch/pull/115459 Approved by: https://github.com/malfet, https://github.com/huydhn	2023-12-11 19:56:37 +00:00
Sijia Chen	641ec2115f	[AOTI] move model runner into a library (#115220 ) Summary: So that we can import it in fbcode and do some AOTI run in py env Test Plan: existed AOTI tests Reviewed By: chenyang78 Differential Revision: D51780021 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115220 Approved by: https://github.com/desertfire	2023-12-09 19:03:32 +00:00
Catherine Lee	3b7d60b6ff	Fix keep-going (#112098 ) New function for continue on error Another solution might be to run the entire suite to the end and use last failed, but I'm worried about concurrent processes writing to the same last failed cache entry, it's a bit different than the usual test rerunning strategy we use especially regarding segfaults and other ways the test suite can suddenly end, and there are some cases where the entire test suite should immediately get rerun in a new process (ex cuda error that causes sync to fail). Find example logs on commit 2f1510839727f6ef2631040d5f0edde26265015d TODO: continue on error for --subprocess and test_distributed aren't working fully Pull Request resolved: https://github.com/pytorch/pytorch/pull/112098 Approved by: https://github.com/huydhn	2023-11-30 04:01:57 +00:00
Jithun Nair	2ea2421b44	Skip unit tests that fail on MI210 runners (#114613 ) Taken from https://github.com/pytorch/pytorch/pull/105980 Pull Request resolved: https://github.com/pytorch/pytorch/pull/114613 Approved by: https://github.com/malfet	2023-11-27 22:25:35 +00:00
Philip Meier	2aa486de9b	vendor packaging.version (#114108 ) Fixes #113940. This vendors the relevant parts of [`packaging==23.2.0`]() to have access to `Version` and `InvalidVersion` without taking a runtime dependency on `setuptools` or `packaging`. I didn't find any vendoring policy so I put it under `torch._vendor.packaging`. While I have only vendored the files we need, I have not touched or trimmed the files otherwise. Pull Request resolved: https://github.com/pytorch/pytorch/pull/114108 Approved by: https://github.com/malfet, https://github.com/albanD	2023-11-21 11:51:23 +00:00
Zain Rizvi	ec20c9044e	[TD] Fix metric emission for split test files (#113789 ) Fixes a bug in TD metrics generation where it wouldn't be able to find the rank and relevance that a heuristic gave a test run if that heuristic had divided that test into multiple test runs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113789 Approved by: https://github.com/clee2000	2023-11-16 23:19:40 +00:00
Catherine Lee	87aeb248c9	More random stepcurrent (#113620 ) Distributed tests for different backends have the same name, so they end up clashing using the current stepcurrent key, so tests were not being run. Disabled the following tests because they are failing: test_ddp_has_finalized test_broadcast_object_list <details> ``` 2023-11-14T06:44:01.0428686Z 2023-11-14T06:44:01.0430447Z distributed/test_distributed_spawn.py::TestDistBackendWithSpawn::test_broadcast_object_list <- ../../../../opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/distributed_test.py INFO:numba.cuda.cudadrv.driver:init 2023-11-14T06:44:01.0431048Z [1699943450.893723] [99f90b6e6ff3:10028:0] ucc_context.c:402 UCC ERROR failed to create tl context for cuda 2023-11-14T06:44:01.0431625Z [1699943450.914385] [99f90b6e6ff3:10029:0] ucc_context.c:402 UCC ERROR failed to create tl context for cuda 2023-11-14T06:44:01.0432314Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] Caught exception: 2023-11-14T06:44:01.0433178Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] Traceback (most recent call last): 2023-11-14T06:44:01.0434677Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 658, in run_test 2023-11-14T06:44:01.0435435Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] getattr(self, test_name)() 2023-11-14T06:44:01.0436895Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 544, in wrapper 2023-11-14T06:44:01.0437500Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] fn() 2023-11-14T06:44:01.0438917Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2536, in wrapper 2023-11-14T06:44:01.0439637Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] method(args, kwargs) 2023-11-14T06:44:01.0441122Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 143, in wrapper 2023-11-14T06:44:01.0441873Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] return func(args, *kwargs) 2023-11-14T06:44:01.0443340Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 274, in wrapper 2023-11-14T06:44:01.0444077Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] ret = func(args, *kwargs) 2023-11-14T06:44:01.0445769Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/distributed_test.py", line 7717, in test_broadcast_object_list 2023-11-14T06:44:01.0446732Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] return self._test_broadcast_object_list() 2023-11-14T06:44:01.0448433Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/distributed_test.py", line 7683, in _test_broadcast_object_list 2023-11-14T06:44:01.0449187Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] dist.broadcast_object_list( 2023-11-14T06:44:01.0450553Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper 2023-11-14T06:44:01.0451621Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] return func(args, *kwargs) 2023-11-14T06:44:01.0453161Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2650, in broadcast_object_list 2023-11-14T06:44:01.0454065Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] broadcast(object_sizes_tensor, src=src, group=group) 2023-11-14T06:44:01.0455441Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper 2023-11-14T06:44:01.0456183Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] return func(args, *kwargs) 2023-11-14T06:44:01.0457775Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1947, in broadcast 2023-11-14T06:44:01.0458649Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] work = default_pg.broadcast([tensor], opts) 2023-11-14T06:44:01.0460923Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] RuntimeError: [/var/lib/jenkins/workspace/torch/csrc/distributed/c10d/ProcessGroupUCC.cpp:488] [Rank 1][ProcessGroupUCC-0][READY]failed to init cuda collective, error code -1: Operation is not supported, system error code 2 2023-11-14T06:44:01.0461471Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] 2023-11-14T06:44:01.0462430Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] To execute this test, run the following from the base repo dir: 2023-11-14T06:44:01.0463552Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] python test/distributed/test_distributed_spawn.py -k test_broadcast_object_list 2023-11-14T06:44:01.0464082Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] 2023-11-14T06:44:01.0465136Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2023-11-14T06:44:01.0465945Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] exiting process 1 with exit code: 10 2023-11-14T06:44:01.0466605Z [1699943451.005633] [99f90b6e6ff3:10029:0] parser.c:2034 UCX WARN unused environment variables: UCX_COMMIT; UCX_HOME 2023-11-14T06:44:01.0467303Z [1699943451.005633] [99f90b6e6ff3:10029:0] parser.c:2034 UCX WARN (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning) 2023-11-14T06:44:01.0467972Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] Caught exception: 2023-11-14T06:44:01.0468743Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] Traceback (most recent call last): 2023-11-14T06:44:01.0470233Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 658, in run_test 2023-11-14T06:44:01.0471106Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] getattr(self, test_name)() 2023-11-14T06:44:01.0472581Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 544, in wrapper 2023-11-14T06:44:01.0473162Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] fn() 2023-11-14T06:44:01.0474581Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2536, in wrapper 2023-11-14T06:44:01.0475314Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] method(args, *kwargs) 2023-11-14T06:44:01.0476776Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 143, in wrapper 2023-11-14T06:44:01.0477535Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] return func(args, *kwargs) 2023-11-14T06:44:01.0478993Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 274, in wrapper 2023-11-14T06:44:01.0479886Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] ret = func(args, *kwargs) 2023-11-14T06:44:01.0481593Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/distributed_test.py", line 7717, in test_broadcast_object_list 2023-11-14T06:44:01.0482429Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] return self._test_broadcast_object_list() 2023-11-14T06:44:01.0484145Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/distributed_test.py", line 7683, in _test_broadcast_object_list 2023-11-14T06:44:01.0484886Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] dist.broadcast_object_list( 2023-11-14T06:44:01.0486271Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper 2023-11-14T06:44:01.0487018Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] return func(args, *kwargs) 2023-11-14T06:44:01.0488559Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2650, in broadcast_object_list 2023-11-14T06:44:01.0489470Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] broadcast(object_sizes_tensor, src=src, group=group) 2023-11-14T06:44:01.0491078Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper 2023-11-14T06:44:01.0491912Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] return func(args, *kwargs) 2023-11-14T06:44:01.0493369Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1947, in broadcast 2023-11-14T06:44:01.0494419Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] work = default_pg.broadcast([tensor], opts) 2023-11-14T06:44:01.0496679Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] RuntimeError: [/var/lib/jenkins/workspace/torch/csrc/distributed/c10d/ProcessGroupUCC.cpp:488] [Rank 0][ProcessGroupUCC-0][READY]failed to init cuda collective, error code -1: Operation is not supported, system error code 2 2023-11-14T06:44:01.0497211Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] 2023-11-14T06:44:01.0498198Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] To execute this test, run the following from the base repo dir: 2023-11-14T06:44:01.0499291Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] python test/distributed/test_distributed_spawn.py -k test_broadcast_object_list 2023-11-14T06:44:01.0499838Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] 2023-11-14T06:44:01.0500881Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2023-11-14T06:44:01.0501667Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] exiting process 0 with exit code: 10 2023-11-14T06:44:01.0502343Z [1699943451.002362] [99f90b6e6ff3:10028:0] parser.c:2034 UCX WARN unused environment variables: UCX_COMMIT; UCX_HOME 2023-11-14T06:44:01.0503024Z [1699943451.002362] [99f90b6e6ff3:10028:0] parser.c:2034 UCX WARN (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning) 2023-11-14T06:44:01.0503411Z ('RERUN', {'yellow': True}) [6.1102s] [100%] ``` </details> test_ddp_sync_bn_training_vs_eval <details> ``` 2023-11-14T06:44:01.1494815Z 2023-11-14T06:44:01.1496630Z distributed/test_distributed_spawn.py::TestDistBackendWithSpawn::test_ddp_sync_bn_training_vs_eval <- ../../../../opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/distributed_test.py INFO:numba.cuda.cudadrv.driver:init 2023-11-14T06:44:01.1497290Z [1699943779.976037] [99f90b6e6ff3:10758:0] parser.c:2034 UCX WARN unused environment variables: UCX_COMMIT; UCX_HOME 2023-11-14T06:44:01.1498119Z [1699943779.976037] [99f90b6e6ff3:10758:0] parser.c:2034 UCX WARN (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning) 2023-11-14T06:44:01.1498808Z STAGE:2023-11-14 06:36:20 10758:10758 ActivityProfilerController.cpp:314] Completed Stage: Warm Up 2023-11-14T06:44:01.1499465Z [1699943779.970792] [99f90b6e6ff3:10757:0] parser.c:2034 UCX WARN unused environment variables: UCX_COMMIT; UCX_HOME 2023-11-14T06:44:01.1500160Z [1699943779.970792] [99f90b6e6ff3:10757:0] parser.c:2034 UCX WARN (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning) 2023-11-14T06:44:01.1500820Z STAGE:2023-11-14 06:36:20 10757:10757 ActivityProfilerController.cpp:314] Completed Stage: Warm Up 2023-11-14T06:44:01.1501556Z STAGE:2023-11-14 06:36:20 10758:10758 ActivityProfilerController.cpp:320] Completed Stage: Collection 2023-11-14T06:44:01.1502239Z STAGE:2023-11-14 06:36:20 10757:10757 ActivityProfilerController.cpp:320] Completed Stage: Collection 2023-11-14T06:44:01.1502952Z STAGE:2023-11-14 06:36:20 10757:10757 ActivityProfilerController.cpp:324] Completed Stage: Post Processing 2023-11-14T06:44:01.1503678Z STAGE:2023-11-14 06:36:20 10758:10758 ActivityProfilerController.cpp:324] Completed Stage: Post Processing 2023-11-14T06:44:01.1504350Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] Caught exception: 2023-11-14T06:44:01.1505119Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] Traceback (most recent call last): 2023-11-14T06:44:01.1506729Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 658, in run_test 2023-11-14T06:44:01.1507492Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] getattr(self, test_name)() 2023-11-14T06:44:01.1508992Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 544, in wrapper 2023-11-14T06:44:01.1509578Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] fn() 2023-11-14T06:44:01.1510994Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2536, in wrapper 2023-11-14T06:44:01.1511725Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] method(args, *kwargs) 2023-11-14T06:44:01.1513193Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 174, in wrapper 2023-11-14T06:44:01.1513962Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] return func(args, *kwargs) 2023-11-14T06:44:01.1515697Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/distributed_test.py", line 9230, in test_ddp_sync_bn_training_vs_eval 2023-11-14T06:44:01.1516529Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] self.assertNotEqual([], all_gather_calls) 2023-11-14T06:44:01.1518019Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3448, in assertNotEqual 2023-11-14T06:44:01.1518910Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] with self.assertRaises(AssertionError, msg=msg): 2023-11-14T06:44:01.1520177Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 226, in __exit__ 2023-11-14T06:44:01.1521062Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] self._raiseFailure("{} not raised".format(exc_name)) 2023-11-14T06:44:01.1522238Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 163, in _raiseFailure 2023-11-14T06:44:01.1523099Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] raise self.test_case.failureException(msg) 2023-11-14T06:44:01.1523923Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] AssertionError: AssertionError not raised 2023-11-14T06:44:01.1524470Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] 2023-11-14T06:44:01.1525481Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] To execute this test, run the following from the base repo dir: 2023-11-14T06:44:01.1526632Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] python test/distributed/test_distributed_spawn.py -k test_ddp_sync_bn_training_vs_eval 2023-11-14T06:44:01.1527180Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] 2023-11-14T06:44:01.1528223Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2023-11-14T06:44:01.1529029Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] exiting process 0 with exit code: 10 2023-11-14T06:44:01.1529786Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] Caught exception: 2023-11-14T06:44:01.1530576Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] Traceback (most recent call last): 2023-11-14T06:44:01.1532383Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 658, in run_test 2023-11-14T06:44:01.1533127Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] getattr(self, test_name)() 2023-11-14T06:44:01.1534608Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 544, in wrapper 2023-11-14T06:44:01.1535194Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] fn() 2023-11-14T06:44:01.1536817Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2536, in wrapper 2023-11-14T06:44:01.1537575Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] method(args, *kwargs) 2023-11-14T06:44:01.1539036Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 174, in wrapper 2023-11-14T06:44:01.1539800Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] return func(args, *kwargs) 2023-11-14T06:44:01.1541531Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/distributed_test.py", line 9230, in test_ddp_sync_bn_training_vs_eval 2023-11-14T06:44:01.1542388Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] self.assertNotEqual([], all_gather_calls) 2023-11-14T06:44:01.1544015Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3448, in assertNotEqual 2023-11-14T06:44:01.1544907Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] with self.assertRaises(AssertionError, msg=msg): 2023-11-14T06:44:01.1546061Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 226, in __exit__ 2023-11-14T06:44:01.1546944Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] self._raiseFailure("{} not raised".format(exc_name)) 2023-11-14T06:44:01.1548142Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 163, in _raiseFailure 2023-11-14T06:44:01.1548991Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] raise self.test_case.failureException(msg) 2023-11-14T06:44:01.1549806Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] AssertionError: AssertionError not raised 2023-11-14T06:44:01.1550350Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] 2023-11-14T06:44:01.1551304Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] To execute this test, run the following from the base repo dir: 2023-11-14T06:44:01.1552462Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] python test/distributed/test_distributed_spawn.py -k test_ddp_sync_bn_training_vs_eval 2023-11-14T06:44:01.1553095Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] 2023-11-14T06:44:01.1554166Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2023-11-14T06:44:01.1554976Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] exiting process 1 with exit code: 10 2023-11-14T06:44:01.1555235Z ('RERUN', {'yellow': True}) [6.6107s] [100%] ``` </details> test_backend_full_group <details> ``` 2023-11-14T22:51:56.4502470Z FAILED [5.2125s] distributed/test_distributed_spawn.py::TestDistBackendWithSpawn::test_backend_full_group - RuntimeError: Process 0 exited with error code 10 and exception: 2023-11-14T22:51:56.4502665Z Traceback (most recent call last): 2023-11-14T22:51:56.4503603Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 658, in run_test 2023-11-14T22:51:56.4503796Z getattr(self, test_name)() 2023-11-14T22:51:56.4504710Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 544, in wrapper 2023-11-14T22:51:56.4504845Z fn() 2023-11-14T22:51:56.4505737Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2536, in wrapper 2023-11-14T22:51:56.4505896Z method(args, *kwargs) 2023-11-14T22:51:56.4506823Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 174, in wrapper 2023-11-14T22:51:56.4506992Z return func(args, *kwargs) 2023-11-14T22:51:56.4508285Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/distributed_test.py", line 882, in test_backend_full_group 2023-11-14T22:51:56.4508640Z self._test_group_override_backend(self._init_full_group_test) 2023-11-14T22:51:56.4509798Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/distributed_test.py", line 852, in _test_group_override_backend 2023-11-14T22:51:56.4510104Z group, group_id, rank = initializer(backend=new_backend) 2023-11-14T22:51:56.4510629Z UnboundLocalError: local variable 'new_backend' referenced before assignment 2023-11-14T22:51:56.4510650Z 2023-11-14T22:51:56.4510987Z To execute this test, run the following from the base repo dir: 2023-11-14T22:51:56.4511525Z python test/distributed/test_distributed_spawn.py -k test_backend_full_group 2023-11-14T22:51:56.4511545Z 2023-11-14T22:51:56.4511970Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2023-11-14T22:51:56.4511989Z 2023-11-14T22:51:56.4512242Z Process 1 exited with error code 10 and exception: 2023-11-14T22:51:56.4512454Z Traceback (most recent call last): 2023-11-14T22:51:56.4513380Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 658, in run_test 2023-11-14T22:51:56.4513687Z getattr(self, test_name)() 2023-11-14T22:51:56.4514612Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 544, in wrapper 2023-11-14T22:51:56.4514746Z fn() 2023-11-14T22:51:56.4515633Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2536, in wrapper 2023-11-14T22:51:56.4515791Z method(args, *kwargs) 2023-11-14T22:51:56.4516708Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 174, in wrapper 2023-11-14T22:51:56.4516895Z return func(args, **kwargs) 2023-11-14T22:51:56.4518008Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/distributed_test.py", line 882, in test_backend_full_group 2023-11-14T22:51:56.4518352Z self._test_group_override_backend(self._init_full_group_test) 2023-11-14T22:51:56.4519509Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/distributed_test.py", line 852, in _test_group_override_backend 2023-11-14T22:51:56.4519813Z group, group_id, rank = initializer(backend=new_backend) 2023-11-14T22:51:56.4520334Z UnboundLocalError: local variable 'new_backend' referenced before assignment 2023-11-14T22:51:56.4520355Z 2023-11-14T22:51:56.4528843Z To execute this test, run the following from the base repo dir: 2023-11-14T22:51:56.4529492Z python test/distributed/test_distributed_spawn.py -k test_backend_full_group 2023-11-14T22:51:56.4529681Z 2023-11-14T22:51:56.4530122Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2023-11-14T22:51:56.4530423Z !!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!! ``` </details> pretty sure the solution for this one is to add ucc in _test_group_override_backend https://ossci-raw-job-status.s3.amazonaws.com/log/18651430019 https://ossci-raw-job-status.s3.amazonaws.com/log/18651430132 Pull Request resolved: https://github.com/pytorch/pytorch/pull/113620 Approved by: https://github.com/huydhn	2023-11-15 21:56:10 +00:00
Catherine Lee	0c448526a4	[experiment][TD] Rating number system (#112676 ) Emit excessive amount of heuristic info emitted, but that just means I can do more with it later? Pull Request resolved: https://github.com/pytorch/pytorch/pull/112676 Approved by: https://github.com/ZainRizvi	2023-11-07 19:40:11 +00:00
Nikita Shulga	e2e5897269	[CI] Do not use `packaging` in run_tests.py (#112873 ) It used to check that CUDA is newer than 11.6, but all of them are Yet another mitigation towards missing `packaging` on MacOS Pull Request resolved: https://github.com/pytorch/pytorch/pull/112873 Approved by: https://github.com/huydhn	2023-11-03 17:22:46 +00:00
Zain Rizvi	4e67c69a7d	[TD] Support downgrading test relevance (#112671 ) Allow heuristics to actually downgrade the relevance of a test. Note that NONE/UNLIKELY tests will still get executed, but they will be ran at the end of the CI The Relevance chosen affects the outcome when Heuristics offer conflicting predictions. A relevance higher up in this list means higher confidence in the declared relevance: HIGH > NONE > PROBABLE > UNLIKELY > UNRANKED Given that we assume ordering based on the list in init right now since the lists are appended, do a similar thing for UNLIKELY and NONE ex HEURISTICS = [a, b, c, d] currently all things in b.high and added after a.high if b.none includes things in a.high, a.high trumps if b.none includes things in a.probable, then b.none trumps since none is stronger than probable if b.unlikely includes things from a.high/probable, a.high/probable trumps since unlikely and probable are at a higher strength Pull Request resolved: https://github.com/pytorch/pytorch/pull/112671 Approved by: https://github.com/clee2000	2023-11-02 21:02:40 +00:00
Zain Rizvi	a5641bc56b	[TD] Enable Test Class granularity on heuristics (#112161 ) Changes the heuristic framework to support multiple prioritizing individual classes within a test file. Components of this included: - Updating TestPrioritizations to accept individual test classes being prioritized. Previously, when a heuristic wanted to prioritize a test file it would pass in the test's name, now to prioritize a class within a test it uses the notation "test::classname" - Changes are fully backwards compatible with existing heuristics - Test sharding now supports sharding individual tests (for when they're prioritized) - When a TestClass is prioritized, we pass the appropriate "-k" flags down to pytest Pull Request resolved: https://github.com/pytorch/pytorch/pull/112161 Approved by: https://github.com/huydhn	2023-10-31 18:11:05 +00:00
Catherine Lee	3b5b7ebd09	[ci] Save various json files from test infra into folder (#111516 ) We pull a lot of files from https://github.com/pytorch/test-infra/blob/generated-stats/stats and name them separately when we add them to the artifacts in the build, so stick them in a folder and just add that instead. Slow test and disabled test jsons remain as they were since they are pulled during the test step and do not need to be included in the artifacts during build since they are not used for sharding. Sanity checked that test times could be found for linux, mac, windows, and rocm. Pull Request resolved: https://github.com/pytorch/pytorch/pull/111516 Approved by: https://github.com/huydhn, https://github.com/ZainRizvi	2023-10-23 20:38:25 +00:00
Nikita Shulga	e9a51a6a07	[BE] Revive test_typing (#111428 ) `test_typing.py` was written to use `pytest` in https://github.com/pytorch/pytorch/pull/54234 which unfortunately rendered it incompatible with run_test.py, and therefore it was not running in CI all this time. In this PR, same functionality is re-written using unittest framework, and `parametrize` from `torch.testing._internal._common_utils`. Valid `test_typing.py` with ufmt Disable `fail/bitwise_ops.py` and `pass/jit.py` as it regressed at some point as well as one of examples in `namedtuple.py` as `torch.linalg.qr` type is no longer revealed correctly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/111428 Approved by: https://github.com/clee2000	2023-10-18 02:19:49 +00:00
Jack Taylor	6b92c367c5	Add test_jit_cuda_fuser to ROCM_BLOCKLIST (#110440 ) Adds the nvfuser related unit test suite to ROCM_BLOCKLIST as should not be run on ROCm. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110440 Approved by: https://github.com/jeffdaily, https://github.com/pruthvistony, https://github.com/lezcano	2023-10-06 08:47:15 +00:00
Catherine Lee	8a09fe4a05	[ez] Remove print in heuristics aggregation (#110621 ) Move print to the beginning instead because putting it at the end makes it so you have to scroll through when debugging, and nothing in that function indicates that it should be printing anything Also the line for printing disabled issues out of the for loop Pull Request resolved: https://github.com/pytorch/pytorch/pull/110621 Approved by: https://github.com/huydhn	2023-10-06 02:04:53 +00:00
Catherine Lee	d6e5898e8d	Quieter logs in CI (#110033 ) To reduce the amount of logs * for successes, only print the part that says what tests ran and don't print the rest. Zip the log into an artifact. The line listing al the test names is really long, but if you view source of the raw logs, it will not wrap so it will only be one line. The log classifier can also be configured to ignored this line. Gets rid of lines like `test_ops.py::TestCommonCPU::test_multiple_devices_round_cpu_int64 SKIPPED [0.0010s] (Only runs on cuda) [ 9%]` * for failures/reruns, print logs. Do not zip. Also * change log artifact name Examples of various logs: `a074db0f7f` failures `1b439e24c4` failures possibly controversial haha should i include an option for always printing? Pull Request resolved: https://github.com/pytorch/pytorch/pull/110033 Approved by: https://github.com/huydhn	2023-10-05 16:40:37 +00:00
Catherine Lee	f69e9c8c91	run_tests.py minor logging changes (#110188 ) Minor logging changes that just kind of annoyed me: * prevent constant printing of `No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'` by moving import within the function (idk if this is ok) * prevent constant printing of `Ignoring disabled issues: ['']` (no idea why it was not gated behind a function or main) * change all prints in run_tests.py to be through stderr so theres no weird interleaving (although if everything goes through stderr, might as well just print everything through stdout...) Pull Request resolved: https://github.com/pytorch/pytorch/pull/110188 Approved by: https://github.com/huydhn, https://github.com/malfet, https://github.com/ZainRizvi	2023-10-03 01:22:47 +00:00
Zain Rizvi	1277d0e834	[BE] Add sharding data by default to metrics (#110035 ) Extend metric library to allow setting global metrics on a process level which will always be emitted. Current use case for them is to include shard information every time a metric is emitted by run_test.py <!-- copilot:poem --> ### <samp>🤖 Generated by Copilot at 0cae92c</samp> > _`run_test` refactored_ > _Sharding metrics in Rockset_ > _Autumn of testing_ Pull Request resolved: https://github.com/pytorch/pytorch/pull/110035 Approved by: https://github.com/clee2000	2023-09-26 17:06:49 +00:00
Catherine Lee	47adcd412f	Increase timeout for slow tests (#109206 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/109206 Approved by: https://github.com/huydhn	2023-09-26 16:18:38 +00:00
jjsjann123	0d3db1048a	remove nvfuser test in upstream pytorch (#109918 ) Removing nvfuser related tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/109918 Approved by: https://github.com/msaroufim	2023-09-24 13:49:37 +00:00
Catherine Lee	fe198f3141	inductor/test_max_autotune serial in CI (#109209 ) Fixes #ISSUE_NUMBER Trying to figure out why the this keeps timing out, wondering if its due to parallelization weirdness Pull Request resolved: https://github.com/pytorch/pytorch/pull/109209 Approved by: https://github.com/huydhn	2023-09-13 17:04:43 +00:00
Catherine Lee	a4138b1f99	[ez] Fix small type error in run_test (#109036 ) This is really small but it has tripped me up at least 3 times. Pull Request resolved: https://github.com/pytorch/pytorch/pull/109036 Approved by: https://github.com/kit1980	2023-09-11 21:11:20 +00:00
Catherine Lee	c67ebae344	Put logging in run_tests (#107987 ) Logging regarding which tests are serial + parallel + what tests actually get run on the shard got removed, which can be pretty helpful, so this adds it back in. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107987 Approved by: https://github.com/huydhn, https://github.com/Neilblaze	2023-09-01 20:23:30 +00:00
Zain Rizvi	5727b07ac6	TD: logging bugfix (#108288 ) Fix bug where logging metrics don't get emitted unless the 'keep-going' label is specified on the PR Also adds some extra logging to make debugging easier Pull Request resolved: https://github.com/pytorch/pytorch/pull/108288 Approved by: https://github.com/Skylion007	2023-08-31 16:51:49 +00:00
Zain Rizvi	238cc84af9	[TD] Emit metrics to compare heuristic quality (#108192 ) When a test fails, we will now emit fine grained details about how accurately heuristics predicted the relevance of that test. ## Context Why only look at failing tests? Our only signal that a PR is most likely relevant to a test is whether or not a test fails on it. Green tests don't tell us if the success was due to the code being good vs being irrelevant. This isn't a perfect measure, since it can miscategorize unstable and flaky failures as having been "missed" by the heuristics, but it's a reasonable approximation. ## What's measured? The metrics this PR collects are designed to answer the following questions ### How comprehensive are the heuristics? - What's the false negative rate, the % of failures that ideally should have been prioritized but weren't? (Both at an aggregate level and at a per heuristic level) ### How precise are the heuristics? - What % of failed tests were prioritized by a given heuristic? What % was prioritized overall? - How relevant was a failed test was considered to be? (Both a aggregate level and at a per heuristic level) - What % of time was a given heuristic prioritizing a failing test higher than any other heuristic? Pull Request resolved: https://github.com/pytorch/pytorch/pull/108192 Approved by: https://github.com/huydhn ghstack dependencies: #108117	2023-08-30 18:28:18 +00:00
Zain Rizvi	620d267ef3	Refactor TestPrioritizations to support more priorities and reduce risk of accidental mutations (#108117 ) Refactor TD code to make it easier to add additional categories later and also support the changes required to enable the metrics needed for TD Pull Request resolved: https://github.com/pytorch/pytorch/pull/108117 Approved by: https://github.com/huydhn	2023-08-30 04:14:28 +00:00
Zain Rizvi	36399d067a	Port existing heuristics to TD framework (#107071 ) This PR looks big, but it's mostly just refactorings with a bit of dead code deletion. Exceptions are: - Some metric emissions were changed to comply with the new TD format - Some logging changes - We now run tests in three batches (highly_relevant, probably_relevant, unranked_relevance) instead of the previous two (prioritized and general) Refactorings done: - Moves all test reordering code to the new TD framework - Refactors run_test.py to cleanly support multiple levels of test priorities - Deletes some dead code that was originally written for logging Pull Request resolved: https://github.com/pytorch/pytorch/pull/107071 Approved by: https://github.com/clee2000, https://github.com/huydhn	2023-08-23 21:23:23 +00:00
Catherine Lee	e0238577b6	Always import test selection tools (#107644 ) https://github.com/pytorch/pytorch/pull/107070 made emit_metrics importable without boto3, so we could just import all the files without the try catch. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107644 Approved by: https://github.com/huydhn, https://github.com/malfet	2023-08-22 16:36:20 +00:00
Zain Rizvi	5ddb8ef827	Make emit_metrics importable without having boto3 installed (#107070 ) Make it so that scripts can import and run the `emit_metrics` function even if they don't have boto3 installed, in which case it will still validate the inputs but skip the actual metric emission part. It's purely a refactor without any real logic changes Motivation: So that run_test.py and the target determination code can use this library easily without worrying about if it was imported or if it's dependencies are installed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107070 Approved by: https://github.com/huydhn	2023-08-21 21:13:01 +00:00
Catherine Lee	3b2c5d47c0	Use default build env and test config for test times (#107325 ) Redo of #107312 Pairs with https://github.com/pytorch/test-infra/pull/4476 If build env and test config combo cannot be found in the test times, use default. Then we don't have to go manually change the test-times.json a new job is added or we update the jobs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107325 Approved by: https://github.com/huydhn	2023-08-21 18:39:55 +00:00
FFFrog	e108f33299	Update distutils.Version to packaging.version due to the deprecation … (#107207 ) Update distutils.Version to packaging.version due to the deprecation warning. ```python /root/Git.d/pytorch/pytorch/torch/testing/_internal/common_methods_invocations.py:17136: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead. active_if=TEST_SCIPY and LooseVersion(scipy.__version__) < "1.4.0"), /root/Git.d/pytorch/pytorch/torch/testing/_internal/common_methods_invocations.py:17138: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead. active_if=TEST_SCIPY and LooseVersion(scipy.__version__) < "1.4.0"), /root/Git.d/pytorch/pytorch/torch/testing/_internal/common_methods_invocations.py:17140: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead. active_if=TEST_SCIPY and LooseVersion(scipy.__version__) < "1.4.0"), ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/107207 Approved by: https://github.com/soulitzer	2023-08-17 11:19:44 +00:00
Catherine Lee	f16be5e0d4	Reordering tests experiment (#106347 ) Companion with https://github.com/pytorch/test-infra/pull/4424 Uses the file rating generated by the test infra PR to re order tests. For each test file, sum the file ratings from the changed files in the PR, and put the tests in order of sum. A lot of tests are probably going to end up as "prioritized" since it takes anything with a rating > 0 right now. Sharding is done twice, once on the prioritized tests, and once on the general/non prioritized tests. Prioritized tests have an order, so they should be sharded according to that order, while general tests don't have an order and are sharded by test time, which should result in more balanced shards. I'll change the metric name before I merge, i want to quarantine my testing stuff from actual results Pull Request resolved: https://github.com/pytorch/pytorch/pull/106347 Approved by: https://github.com/ZainRizvi	2023-08-16 18:23:09 +00:00
PyTorch MergeBot	9858edd99f	Revert "Reordering tests experiment (#106347 )" This reverts commit `7dfab082be`. Reverted https://github.com/pytorch/pytorch/pull/106347 on behalf of https://github.com/clee2000 due to probably broke sharding ([comment](https://github.com/pytorch/pytorch/pull/106347#issuecomment-1675542738))	2023-08-11 23:59:48 +00:00
Richard Zou	b9ad7bc533	Don't run test/autograd/test_fallback.py in parallel (#106866 ) Fixes https://github.com/pytorch/pytorch/issues/106754 This PR: - moves test/autograd/test_fallback.py to test_autograd_fallback.py and removes it from test_autograd.py (necessary for the next step) - adds test_autograd_fallback.py to parallel test blocklist. - lintrunner really wanted to make changes to the files, but other than that, it is a move. The problem is that we set a global option (the autograd fallback mode) during these tests which may cause the tests to interfere with each other. Test Plan: - python test/run_test.py -i test_autograd_fallback NOTE to diff train oncall: - You'll also need to modify the test/autograd/test_fallback.py TARGET in caffe2/test/TARGETS since we renamed the file. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106866 Approved by: https://github.com/soulitzer	2023-08-10 00:26:23 +00:00
Catherine Lee	7dfab082be	Reordering tests experiment (#106347 ) Companion with https://github.com/pytorch/test-infra/pull/4424 Uses the file rating generated by the test infra PR to re order tests. For each test file, sum the file ratings from the changed files in the PR, and put the tests in order of sum. A lot of tests are probably going to end up as "prioritized" since it takes anything with a rating > 0 right now. Sharding is done twice, once on the prioritized tests, and once on the general/non prioritized tests. Prioritized tests have an order, so they should be sharded according to that order, while general tests don't have an order and are sharded by test time, which should result in more balanced shards. I'll change the metric name before I merge, i want to quarantine my testing stuff from actual results Pull Request resolved: https://github.com/pytorch/pytorch/pull/106347 Approved by: https://github.com/ZainRizvi	2023-08-09 20:11:11 +00:00
Aaron Gokaslan	6d43c89f37	[BE]: Update Ruff to 0.0.280 (#105724 ) Removes unusued loop values in python dictionary iteration. Automated fix from Ruff master Pull Request resolved: https://github.com/pytorch/pytorch/pull/105724 Approved by: https://github.com/ezyang, https://github.com/janeyx99	2023-07-22 23:03:34 +00:00
Justin Chu	4cc1745b13	[BE] f-stringify torch/ and scripts (#105538 ) This PR is a follow up on the pyupgrade series to convert more strings to use f-strings using `flynt`. - https://docs.python.org/3/reference/lexical_analysis.html#f-strings - https://pypi.org/project/flynt/ Command used: ``` flynt torch/ -ll 120 flynt scripts/ -ll 120 flynt tools/ -ll 120 ``` and excluded `collect_env.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/105538 Approved by: https://github.com/ezyang, https://github.com/malfet	2023-07-21 19:35:24 +00:00
Justin Chu	73e1455327	[BE] Enable ruff's UP rules and autoformat test/ (#105434 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105434 Approved by: https://github.com/albanD	2023-07-19 20:36:06 +00:00
Joel Schlosser	ece19bf018	Update run_test.py to use TEST_WITH_SLOW_GRADCHECK flag (#104819 ) Finishes the job from #104537. See https://github.com/pytorch/pytorch/pull/104537#pullrequestreview-1520065008 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104819 Approved by: https://github.com/huydhn	2023-07-11 21:58:46 +00:00
Yukio Siraichi	40b8d10d5e	Re-land: Turn translation validation on for tests and accuracy runs by default. (#104467 ) Re-landing: #103611 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104467 Approved by: https://github.com/malfet	2023-07-05 19:01:50 +00:00
Nikita Shulga	ddd7da7546	Enable more tests (#104437 ) Remove `test_segment_reductions` from list of blocklisted tests Remove `@onlyCPU` qualifier from test_segment_reductions as it has CUDA specific parts Fixes https://github.com/pytorch/pytorch/issues/104410 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104437 Approved by: https://github.com/atalman, https://github.com/huydhn	2023-06-30 16:26:11 +00:00
PyTorch MergeBot	a2a8b4d415	Revert "Turn translation validation on for tests and accuracy runs by default. (#103611 )" This reverts commit `e311bed2a8`. Reverted https://github.com/pytorch/pytorch/pull/103611 on behalf of https://github.com/malfet due to Broke inductor tests ([comment](https://github.com/pytorch/pytorch/pull/103611#issuecomment-1614850276))	2023-06-30 15:54:18 +00:00
Yukio Siraichi	e311bed2a8	Turn translation validation on for tests and accuracy runs by default. (#103611 ) This PR turns translation validation on by default for tests and accuracy benchmark runs. It also installs Z3 on CI. The main changes are: - Add `--no-translation-validation` as an option in _test/run_tests.py_ - Set `PYTORCH_TEST_WITH_TV` environment variable - Add `TEST_WITH_TV` variable in _torch/testing/_internal/common_utils.py_ - Turn translation validation on for accuracy benchmarks in _benchmarks/dynamo/common.py_ - Add Z3 installation on CI scripts Pull Request resolved: https://github.com/pytorch/pytorch/pull/103611 Approved by: https://github.com/ezyang	2023-06-30 01:32:21 +00:00
Nikita Shulga	c40f5edf7b	Change tools search order (#104214 ) Prevents following cryptic error if one attempts to use `run_tests.py` on system that also has torchaudio installed in dev mode (as `tools` from https://github.com/pytorch/audio might take precedence, but this is not how script should behave): ``` Unable to import test_selections from tools/testing. Running without test selection stats.... Reason: No module named 'tools.stats' Traceback (most recent call last): File "/Users/nshulga/git/pytorch/pytorch/test/run_test.py", line 1673, in <module> main() File "/Users/nshulga/git/pytorch/pytorch/test/run_test.py", line 1604, in main selected_tests = get_selected_tests(options) File "/Users/nshulga/git/pytorch/pytorch/test/run_test.py", line 1418, in get_selected_tests path = os.path.join(str(REPO_ROOT), TEST_TIMES_FILE) NameError: name 'TEST_TIMES_FILE' is not defined ``` But make sure to remove it in the end, otherwise it will not work if torch is installed from wheel, but tests are running from clean repo checkout. <!-- copilot:poem --> ### <samp>🤖 Generated by Copilot at dd52521</samp> > _Sing, O Muse, of the cunning code review_ > _That fixed the tests of the `tools` module_ > _By adding and removing the root path_ > _As a shepherd guides his flock to and fro._ Pull Request resolved: https://github.com/pytorch/pytorch/pull/104214 Approved by: https://github.com/kit1980	2023-06-27 15:54:34 +00:00
Nikita Shulga	925f0a01c7	Do not pass `stepcurrent` option unless in CI (#104135 ) Should allow one to run the same tests multiple times on local machine <!-- copilot:poem --> ### <samp>🤖 Generated by Copilot at 740a92d</samp> > _`pytest_args` change_ > _Only add `--sc` on CI_ > _Avoid conflicts - fall_ Pull Request resolved: https://github.com/pytorch/pytorch/pull/104135 Approved by: https://github.com/huydhn, https://github.com/kit1980	2023-06-24 09:34:14 +00:00
Nikita Shulga	63f66d19ea	[Tests] Make `run_test.py` usable without boto3 (#104111 ) There is a `HAVE_TEST_SELECTION_TOOLS` conditional, but turns out it does not really work, so fix it by defining all missing prototypes and make it work as single-shard instance Add lint rule to test stat it would succeed for runnign only test_cuda with released version of PyTorch Pull Request resolved: https://github.com/pytorch/pytorch/pull/104111 Approved by: https://github.com/clee2000, https://github.com/ZainRizvi	2023-06-24 03:10:49 +00:00
Nikita Shulga	98d513cabf	[BE][Test] Remove `--pytest` option from `run_test.py` (#104125 ) Because we always run tests with pytest now. Marking it as `bc-breaking` as there could technically be some scripts depending on it somewhere... <!-- copilot:poem --> ### <samp>🤖 Generated by Copilot at 1760568</samp> > _`pytest` option gone_ > _simpler test runner script_ > _autumn leaves fall fast_ Pull Request resolved: https://github.com/pytorch/pytorch/pull/104125 Approved by: https://github.com/seemethere	2023-06-24 00:20:20 +00:00
Catherine Lee	7ac1c64bc4	Exclude _nvfuser from test collection (#104003 ) The three files in this folder are run by should instead be run by test_jit_cuda_fuser.py, test_nvfuser_dynamo.py, and test_nvfuser_frontend.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/104003 Approved by: https://github.com/huydhn, https://github.com/jjsjann123	2023-06-22 19:46:45 +00:00
Zain Rizvi	c3d3165f16	Enable uploading metrics and upload Test Reordering metrics to dynamodb (#102691 ) Added a feature to upload test statistics to DynamoDB and Rockset using a new function `emit_metric` in `tools/stats/upload_stats_lib.py`. Added metrics to measure test reordering effectiveness in `tools/testing/test_selections.py`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102691 Approved by: https://github.com/malfet	2023-06-12 23:01:53 +00:00
PyTorch MergeBot	b52ee80cdc	Revert "Add print statements to debug sharding error (#102713 )" This reverts commit `c7873522c2`. Reverted https://github.com/pytorch/pytorch/pull/102713 on behalf of https://github.com/clee2000 due to issue should be resolved now ([comment](https://github.com/pytorch/pytorch/pull/102713#issuecomment-1583334560))	2023-06-08 21:02:17 +00:00
Aidyn-A	591134f2a5	[CI] Enable UCC in CI (#100395 ) UCC was temporarily disabled in #98832. This PR re-enables it with the necessary fix. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100395 Approved by: https://github.com/atalman	2023-06-08 19:01:22 +00:00
Catherine Lee	c7873522c2	Add print statements to debug sharding error (#102713 ) sharding on rocm is broken, i cant replicate on dummy PRs even though it seems to happen pretty often on main, so adding this to increase my sample size. Hopefully this is enough print statements... Pull Request resolved: https://github.com/pytorch/pytorch/pull/102713 Approved by: https://github.com/huydhn	2023-06-01 22:38:28 +00:00
Zain Rizvi	c84f246c83	Improve time savings calculation math for test reordering (#102411 ) Use a more accurate method that accounts for tests being run in parallel Right now we still log results to the console, but later it'll get logged to Rockset for better tracking Pull Request resolved: https://github.com/pytorch/pytorch/pull/102411 Approved by: https://github.com/huydhn, https://github.com/malfet	2023-05-31 23:51:27 +00:00
Catherine Lee	a5ddb72aec	Quick fix for keep-going + reruns (#102569 ) Currently file level reruns + stepcurrent are incompatible and it's making PRs green when they are actually red, so turn off stepcurrent + file level reruns when keep-going is used until I figure out a better way to do this. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102569 Approved by: https://github.com/huydhn, https://github.com/malfet	2023-05-31 04:46:25 +00:00
PyTorch MergeBot	1a6ab8a5dc	Revert "Quick fix for keep-going + reruns (#102569 )" This reverts commit `7f6edcf422`. Reverted https://github.com/pytorch/pytorch/pull/102569 on behalf of https://github.com/clee2000 due to broke a ton of stuff ([comment](https://github.com/pytorch/pytorch/pull/102569#issuecomment-1569167673))	2023-05-30 22:04:27 +00:00
Catherine Lee	7f6edcf422	Quick fix for keep-going + reruns (#102569 ) Currently file level reruns + stepcurrent are incompatible and it's making PRs green when they are actually red, so turn off stepcurrent + file level reruns when keep-going is used until I figure out a better way to do this. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102569 Approved by: https://github.com/huydhn	2023-05-30 21:29:56 +00:00
Huy Do	6e3e3dd477	Do not collect and skip non-disabled tests when rerunning disabled tests (#102107 ) The console log blows up to much when running in rerun disabled tests mode (x50) `e132f09e88`. Each log is around 1GB and the whole uncompressed logs is ~50GB. After compression, it will be around 1GB, still too big. The increase comes mainly from the multiple SKIPPED message for non-disabled tests, which is expected due to how SkipTest and pytest-flakyfinder currently work. I update `test/conftest.py` to completely ignore skipped tests when rerunning disabled test instead of collecting then skipping 50 tests each. The benefit of doing is is much more than I originally expect: * Rerun disabled tests jobs now finish in less than half an hour as they should be * Fix OOM runner crash because of too many collected tests * Fix verbosity issue as now only disabled tests are run x50 times. There are only few hundreds of them atm * Fix timed out issue when rerunning disabled distributed and ASAN tests. They are just too slow when running at x50 ### Testing When rerunning disabled tests https://github.com/pytorch/pytorch/actions/runs/5084508614, only disabled tests on the platform are run, for example `test_ops_jit` on https://ossci-raw-job-status.s3.amazonaws.com/log/13770164954 only ran 100 tests (`test_variant_consistency_jit_linalg_lu_cuda_float32` + `test_variant_consistency_jit_linalg_lu_factor_cuda_complex64`) x50. ``` Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_ops_jit.py', '--shard-id=1', '--num-shards=2', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--sc=test_ops_jit_1', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2023-05-25 21:32:49.763856] Expand the folded group to see the log file of test_ops_jit 2/2 ##[group]PRINTING LOG FILE of test_ops_jit 2/2 (/var/lib/jenkins/workspace/test/test-reports/test_ops_jit_h2wr_t2c.log) Test results will be stored in test-reports/python-pytest/test_ops_jit/test_ops_jit-51a83bd44549074e.xml ============================= test session starts ============================== platform linux -- Python 3.10.11, pytest-7.3.1, pluggy-1.0.0 -- /opt/conda/envs/py_3.10/bin/python cachedir: .pytest_cache hypothesis profile 'pytorch_ci' -> database=None, max_examples=50, derandomize=True, suppress_health_check=[HealthCheck.too_slow] rootdir: /var/lib/jenkins/workspace configfile: pytest.ini plugins: hypothesis-5.35.1, cpp-2.3.0, flakefinder-1.1.0, rerunfailures-11.1.2, shard-0.1.2, xdist-3.3.0, xdoctest-1.1.0 collecting ... collected 1084 items Running 100 items in this shard: test/test_ops_jit.py::TestJitCUDA::test_variant_consistency_jit_linalg_lu_cuda_float32 (x50), test/test_ops_jit.py::TestJitCUDA::test_variant_consistency_jit_linalg_lu_factor_cuda_complex64 (x50) stepcurrent: Cannot find last run test, not skipping test_ops_jit.py::TestJitCUDA::test_variant_consistency_jit_linalg_lu_cuda_float32 PASSED [2.1876s] [ 1%] test_ops_jit.py::TestJitCUDA::test_variant_consistency_jit_linalg_lu_factor_cuda_complex64 PASSED [4.5615s] [ 2%] ``` * [pull](https://github.com/pytorch/pytorch/actions/runs/5093566864) * [trunk](https://github.com/pytorch/pytorch/actions/runs/5095364311) * [periodic](https://github.com/pytorch/pytorch/actions/runs/5095378850) * [slow](https://github.com/pytorch/pytorch/actions/runs/5095390285) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102107 Approved by: https://github.com/clee2000, https://github.com/malfet	2023-05-27 12:10:36 +00:00
Catherine Lee	2232cce69c	No cpp + step current (#102001 ) stepcurrent cannot handle xdist Pull Request resolved: https://github.com/pytorch/pytorch/pull/102001 Approved by: https://github.com/huydhn	2023-05-24 17:39:32 +00:00
Huy Do	d06802778e	No need to run C++ tests under rerun disabled tests mode (#102132 ) Per title. I extract this part out of the draft PR that I'm working on https://github.com/pytorch/pytorch/pull/102107 because the remaining issues with rerun disabled tests: log size and unexpected runner failures requires some further investigations while this one is clearing breaking in trunk atm. Until we can support disable C++ tests, there is no need to run them in rerun disabled tests mode. ### Testing Coming from https://github.com/pytorch/pytorch/pull/102107, for example https://github.com/pytorch/pytorch/actions/runs/5062224659/jobs/9087747981 ``` 2023-05-23T22:46:50.1953318Z Running cpp/basic 1/1 ... [2023-05-23 22:46:50.195077] 2023-05-23T22:46:50.1953847Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2023-05-23T22:46:50.2066032Z Running cpp/atest 1/1 ... [2023-05-23 22:46:50.206348] 2023-05-23T22:46:50.2066435Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2023-05-23T22:46:52.2666743Z No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda' 2023-05-23T22:46:52.2691817Z Ignoring disabled issues: [] ... ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/102132 Approved by: https://github.com/clee2000	2023-05-24 07:45:48 +00:00
Huy Do	d26c8f26d1	Lower xdist processes from auto to NUM_PROCS (#102124 ) This is to avoid CUDA OOM issues when running C++ tests both regularly and in memory leak check mode. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102124 Approved by: https://github.com/clee2000	2023-05-24 06:50:55 +00:00
Catherine Lee	f3fc531eee	Check for pytest extensions in run_test (#100916 ) not very elegant checked on separate conda env that doesnt have the usual ci dependencies the two pytest extensions at fault are pytest-rerunfailures and pytest-shard, also included pytest-flakefinder just incase no idea if this is a good way to do this could also check individually and add flags based on that, but was told that needing to requiring all the ci dependencies to be downloaded was also ok Pull Request resolved: https://github.com/pytorch/pytorch/pull/100916 Approved by: https://github.com/huydhn	2023-05-17 20:27:55 +00:00
Catherine Lee	e3c9a1e5c4	Run dynamo tests in parallel (#101432 ) cuts off ~30 min per shard (2 shards and 2 python versions so 2 hours total) Pull Request resolved: https://github.com/pytorch/pytorch/pull/101432 Approved by: https://github.com/huydhn, https://github.com/desertfire, https://github.com/ZainRizvi	2023-05-17 20:26:24 +00:00
Huy Do	552b712f80	Run C++ testcases in parallel with pytest-xdist (#101440 ) After an investigation, running C++ tests with https://github.com/pytest-dev/pytest-cpp is just slower than running them directly, plain and simple. I'm curious on the exact root cause, but that's a story for another day. `time build/bin/test_lazy` takes half a minute to run 610 tests on `linux-bionic-cuda11.8-py3.10-gcc7 / test (default, 2, 5, linux.4xlarge.nvidia.gpu)` while `time pytest /var/lib/jenkins/workspace/build/bin/test_lazy -v` takes 20+ minutes on the same runner. This is a very costly price to pay. The saving grace here is that https://github.com/pytest-dev/pytest-cpp supports pytest-xdist to run tests in parallel with `-n auto`, so `time pytest /var/lib/jenkins/workspace/build/bin/test_lazy -v -n auto` takes only 3 minutes. This is still not as fast as running C++ tests directly, but it's order of magnitude faster than running them sequentially. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101440 Approved by: https://github.com/clee2000	2023-05-16 21:52:36 +00:00
Huy Do	35834a405c	Run C++ tests on CI with run_test.py (#99956 ) After https://github.com/pytorch/pytorch/pull/99559, we can now run C++ test with `run_test.py`. Although advance features such as `--import-slow-tests` and `--import-disabled-tests` won't work for now, there will still be a gain in reliability and performance as C++ can now be retried and run in parallel. This covers all C++ tests in the CI including aten, libtorch, and Vulkan C++ tests across all platforms Linux, Windows, MacOS. Notes: * To support C++ test discovery, the env variable `CPP_TESTS_DIR` can be set to where the C++ test binaries is located * Support pytest -k argument via run_test as this is used by pytest-cpp to replace `--gtest-filter` * The XML output is in pytest format, but it's ok now because we don't have slow test or flaky test support for C++ test yet * ~~I need to figure out why conftest.py doesn't work when I invoke pytest directly for C++ test, so `--sc` is not available for C++ tests at the moment. Proper pytest plugin like stepwise works fine though. I'll investigate and fix it in a separate PR~~ Found the cause, `conftest.py` is per directory and needs to be in any arbitrary directory that holds C++ test * Two tests `test_api` and `test_tensorexpr` timed out on ASAN, I suspect that ASAN is now used on top of the python executable, which is slower than running native C++ code. IMO, it's ok to run these tests as before on ASAN for now Pull Request resolved: https://github.com/pytorch/pytorch/pull/99956 Approved by: https://github.com/clee2000, https://github.com/ZainRizvi	2023-05-09 21:24:12 +00:00
Ramin Azarmehr	cecfcf1e17	[MPS] Handle MPS failures of test_modules.py in common_modules.py (#95334 ) - Also cleaned up `test_modules.py` from skipMPS code. - Added `skipMPS` for unsupported or failing tests on MPS backend in common_modules.py. (We'll remove `skipMPS` from those tests once a fix is available for them.) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95334 Approved by: https://github.com/kulinseth, https://github.com/albanD	2023-05-09 03:55:16 +00:00
Zain Rizvi	95f191a248	Always run prioritized tests first, even if they're expected to run serially (#100748 ) Today, we prioritize running test files that were edited in the user's PR, with the idea being to run them before we run any other test. Except, if the modified test is supposed to run serially, then we still end up running it after all the parallelized tests have finished running. This PR fixes that to _always_ run the prioritized tests before the regular tests, regardless of if the test is supposed to run serially or in parallel Pull Request resolved: https://github.com/pytorch/pytorch/pull/100748 Approved by: https://github.com/huydhn	2023-05-08 20:23:46 +00:00
Catherine Lee	a1f318daba	Fix get_reordered_tests in run_test.py (#100752 ) i think get_reordered_tests broken since master -> main switch add typing for some functions checked for `prioritized` in the logs limited testing because I only care about one very small part of the log thats near the beginning Pull Request resolved: https://github.com/pytorch/pytorch/pull/100752 Approved by: https://github.com/huydhn	2023-05-05 22:46:56 +00:00
Catherine Lee	e88e92e7a2	Update to reruns + timeouts in run_test.py (#100412 ) https://github.com/pytorch/pytorch/pull/100200/files made unknown tests more likely to fail b/c lacking test times but still have time outs, so fix that Pull Request resolved: https://github.com/pytorch/pytorch/pull/100412 Approved by: https://github.com/huydhn	2023-05-01 21:51:53 +00:00
pbialecki	73645a8412	Add CUDA 12.1 CI workflows (#98832 ) Adds CUDA 12.1 CI workflows, removes CUDA 11.7. CC @malfet Pull Request resolved: https://github.com/pytorch/pytorch/pull/98832 Approved by: https://github.com/atalman	2023-05-01 16:25:53 +00:00
PyTorch MergeBot	9075e3c2c6	Revert "Run test_fx_to_onnx_with_onnxruntime serially (#100298 )" This reverts commit `3a3f781f6c`. Reverted https://github.com/pytorch/pytorch/pull/100298 on behalf of https://github.com/huydhn due to No need as https://github.com/pytorch/pytorch/pull/100297 has been landed ([comment](https://github.com/pytorch/pytorch/pull/100298#issuecomment-1528476786))	2023-04-29 02:07:39 +00:00
Huy Do	3a3f781f6c	Run test_fx_to_onnx_with_onnxruntime serially (#100298 ) This test starts to fail out of nowhere in trunk Pull Request resolved: https://github.com/pytorch/pytorch/pull/100298 Approved by: https://github.com/kit1980	2023-04-29 00:51:25 +00:00
Catherine Lee	6ab9453ea9	File level rerun changes (#100200 ) Fixes #ISSUE_NUMBER * change hook so that test still gets saved in --sc when fails in test setup (caused an off by 1 error due to setup being called before the logreport hook) * allow reruns for all tests now that --sc is used * increase number of reruns now that --sc is used Pull Request resolved: https://github.com/pytorch/pytorch/pull/100200 Approved by: https://github.com/huydhn	2023-04-28 20:57:49 +00:00
Catherine Lee	ae5e1819a5	stepcurrent (#98035 ) * add stepcurrent flag (--sc) based off the stepwise flag that saves the currently running test so that test running can resume from the last successful test after segfaults, takes in an argument for a key so that different test runs dont overwrite each other * send sigint to process when timeout so that xml can be made * add currently unused stepcurrent skip flag (--scs) based off stepwise skip flag that skips the failing test, was going to use if for the keep-going label but having trouble with CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/98035 Approved by: https://github.com/huydhn	2023-04-25 20:56:04 +00:00
Huy Do	96d3f3dee3	Discover and run C++ tests with run_test.py (#99559 ) This depends on [pytest-cpp](https://github.com/pytest-dev/pytest-cpp) to discover and run C++ tests with pytest. C++ tests are built under `${WORKSPACE}/build/bin` directory and copied to the test job under the same path. * To expose them to `run_test`, I choose to use the mock path prefix `cpp`, for example `build/bin/c10_Array_test` would be named as `cpp/c10_Array_test` and the `python test/run_test.py --cpp -i cpp/c10_Array_test` would run the test in the same way as other Python tests. I could copy them from `build/bin` to `test/cpp`, but it will be mixed with the source code and CMake file. So this looks easier * Some executable under `build/bin` are not C++ tests, and they are exclude, for example `build/bin/torch_shm_manager` * C++ tests need to run with pytest directly as python command doesn't understand it * The change is gated by the new `--cpp` argument to `run_test.py`, for example `python test/run_test.py --cpp` will run all available C++ tests * The tests can be run in parallel * Failing tests can be retried with `--reruns=2` and `--sw` ``` ============================= test session starts ============================== platform darwin -- Python 3.9.15, pytest-7.2.0, pluggy-1.0.0 -- /Users/huydo/miniconda3/envs/py3.9/bin/python3 cachedir: .pytest_cache hypothesis profile 'default' -> database=DirectoryBasedExampleDatabase('/Users/huydo/Storage/mine/pytorch/test/.hypothesis/examples') rootdir: /Users/huydo/Storage/mine/pytorch, configfile: pytest.ini plugins: xdoctest-1.1.0, cpp-2.3.0, rerunfailures-10.3, shard-0.1.2, flakefinder-1.1.0, hypothesis-6.56.4, xdist-3.0.2, repeat-0.9.1 collecting ... collected 3 items / 2 deselected / 1 selected Running 1 items in this shard: build/bin/scalar_tensor_test::TestScalarTensor.TestScalarTensorMPS stepwise: skipping 2 already passed items. ../build/bin/scalar_tensor_test::TestScalarTensor::TestScalarTensorMPS RERUN [100%] ../build/bin/scalar_tensor_test::TestScalarTensor::TestScalarTensorMPS RERUN [100%] ../build/bin/scalar_tensor_test::TestScalarTensor::TestScalarTensorMPS FAILED [100%] ``` * `--import-slow-tests` and `--import-disabled-tests` won't work for now and that's ok to have it as a future task. I also add `pytest-cpp==2.3.0` to Linux Docker, MacOS, and Windows. ### Testing Build PyTorch and run `python test/run_test.py --cpp` on my laptop. CI change would come later in a separate PR. Also running `python test/run_test.py --help` now shows all C++ test discovered under `build/bin` Pull Request resolved: https://github.com/pytorch/pytorch/pull/99559 Approved by: https://github.com/clee2000	2023-04-22 00:23:31 +00:00
Zain Rizvi	7546972565	[BE] Refactoring test execution and improving comments (#99467 ) Sharing code between the code that handles test results in parallel vs serial mode. Note that the original version of this code had an inconsistency between the two versions where it would execute `print_to_stderr(err_message)` on every test that ran in parallel, but for serial tests it would only invoke `print_to_stderr(err_message)` if `continue_on_error` was also specified. By sharing code, this PR changes that behavior to be consistent between the two modes. Also adding some comments. <!-- copilot:poem --> ### <samp>🤖 Generated by Copilot at 029342c</samp> > _Sing, O Muse, of the skillful coder who refined_ > _The PyTorch testing script, `run_test.py`, and shined_ > _A light on its obscure logic, with docstrings and comments_ > _And made it run more smoothly, with better error contents_ Pull Request resolved: https://github.com/pytorch/pytorch/pull/99467 Approved by: https://github.com/huydhn, https://github.com/malfet	2023-04-19 19:29:07 +00:00
BowenBao	d41aa448b8	[ONNX] Run ONNX tests as part of standard run_test script (#99215 ) <!-- copilot:all --> ### <samp>🤖 Generated by Copilot at dcbf7e2</samp> ### Summary 📝🧹🚩 <!-- 1. 📝 for simplifying the `./scripts/onnx/test.sh` script 2. 🧹 for refactoring the `test/onnx/dynamo/test_exporter_api.py` file 3. 🚩 for adding the `--onnx` flag to `test/run_test.py` and updating the `TESTS` list --> This pull request improves the ONNX testing infrastructure in PyTorch by refactoring the test code, normalizing the scope names, adding a flag to run only the ONNX tests, and simplifying the test script. > _To export PyTorch models to ONNX_ > _We refactored some scripts and contexts_ > _We used `common_utils`_ > _And normalized the scopes_ > _And added a flag to run the tests_ ### Walkthrough * Simplify `./scripts/onnx/test.sh` to use `run_test.py` with `--onnx` flag instead of `pytest` ([link](https://github.com/pytorch/pytorch/pull/99215/files?diff=unified&w=0#diff-0017f5b22ae1329acb0f54af8d9811c9b6180a72dac70d7a5b89d7c23c958198L44-R46)) * Remove `onnx` test from `TESTS` list in `test/run_test.py` ([link](https://github.com/pytorch/pytorch/pull/99215/files?diff=unified&w=0#diff-e72503c9e3e8766e2d1bacf3fad7b88aa166e0e90a7e103e7df99357a35df8d7L127-R127)). Replace with `onnx_caffe2`. * Add `onnx/test_pytorch_onnx_onnxruntime_cuda` and `onnx/test_models` tests to `blocklisted_tests` list in `test/run_test.py` ([link](https://github.com/pytorch/pytorch/pull/99215/files?diff=unified&w=0#diff-e72503c9e3e8766e2d1bacf3fad7b88aa166e0e90a7e103e7df99357a35df8d7R154-R155)) * Add `ONNX_SERIAL_LIST` list to `test/run_test.py` to specify ONNX tests that must run serially ([link](https://github.com/pytorch/pytorch/pull/99215/files?diff=unified&w=0#diff-e72503c9e3e8766e2d1bacf3fad7b88aa166e0e90a7e103e7df99357a35df8d7R296-R301)) * Add `ONNX_TESTS` list to `test/run_test.py` to store all ONNX tests ([link](https://github.com/pytorch/pytorch/pull/99215/files?diff=unified&w=0#diff-e72503c9e3e8766e2d1bacf3fad7b88aa166e0e90a7e103e7df99357a35df8d7R370)) * Add `--onnx` flag to `parse_args` function in `test/run_test.py` to run only ONNX tests ([link](https://github.com/pytorch/pytorch/pull/99215/files?diff=unified&w=0#diff-e72503c9e3e8766e2d1bacf3fad7b88aa166e0e90a7e103e7df99357a35df8d7R920-R928)) * Include `ONNX_SERIAL_LIST` in `must_serial` function in `test/run_test.py` to run ONNX tests serially or parallelly based on memory usage ([link](https://github.com/pytorch/pytorch/pull/99215/files?diff=unified&w=0#diff-e72503c9e3e8766e2d1bacf3fad7b88aa166e0e90a7e103e7df99357a35df8d7R1120)) * Filter selected tests based on `--onnx` flag in `get_selected_tests` function in `test/run_test.py` to exclude non-ONNX tests ([link](https://github.com/pytorch/pytorch/pull/99215/files?diff=unified&w=0#diff-e72503c9e3e8766e2d1bacf3fad7b88aa166e0e90a7e103e7df99357a35df8d7R1158-R1165)) ### Other minor changes to accommodate this change * Replace `unittest` module with `common_utils.TestCase` in `test/onnx/dynamo/test_exporter_api.py` ([link](https://github.com/pytorch/pytorch/pull/99215/files?diff=unified&w=0#diff-4545f0c15c73ebe90a875e9bee6c5ca4b6b92fb1ed0ec5560d1568e0f6339d02L4), [link](https://github.com/pytorch/pytorch/pull/99215/files?diff=unified&w=0#diff-4545f0c15c73ebe90a875e9bee6c5ca4b6b92fb1ed0ec5560d1568e0f6339d02L29-R28), [link](https://github.com/pytorch/pytorch/pull/99215/files?diff=unified&w=0#diff-4545f0c15c73ebe90a875e9bee6c5ca4b6b92fb1ed0ec5560d1568e0f6339d02L71-R70), [link](https://github.com/pytorch/pytorch/pull/99215/files?diff=unified&w=0#diff-4545f0c15c73ebe90a875e9bee6c5ca4b6b92fb1ed0ec5560d1568e0f6339d02L147-R146)) * Import `TemporaryFileName` class from `common_utils` in `test/onnx/dynamo/test_exporter_api.py` ([link](https://github.com/pytorch/pytorch/pull/99215/files?diff=unified&w=0#diff-4545f0c15c73ebe90a875e9bee6c5ca4b6b92fb1ed0ec5560d1568e0f6339d02L19-R18)) * Use `common_utils.TemporaryFileName` instead of `TemporaryFileName` in `TestDynamoExportAPI` class in `test/onnx/dynamo/test_exporter_api.py` ([link](https://github.com/pytorch/pytorch/pull/99215/files?diff=unified&w=0#diff-4545f0c15c73ebe90a875e9bee6c5ca4b6b92fb1ed0ec5560d1568e0f6339d02L92-R91), [link](https://github.com/pytorch/pytorch/pull/99215/files?diff=unified&w=0#diff-4545f0c15c73ebe90a875e9bee6c5ca4b6b92fb1ed0ec5560d1568e0f6339d02L110-R109), [link](https://github.com/pytorch/pytorch/pull/99215/files?diff=unified&w=0#diff-4545f0c15c73ebe90a875e9bee6c5ca4b6b92fb1ed0ec5560d1568e0f6339d02L129-R128)) * Use `common_utils.run_tests` instead of `unittest.main` in `test/onnx/dynamo/test_exporter_api.py` ([link](https://github.com/pytorch/pytorch/pull/99215/files?diff=unified&w=0#diff-4545f0c15c73ebe90a875e9bee6c5ca4b6b92fb1ed0ec5560d1568e0f6339d02L155-R154)) * Add `re` module to `test/onnx/test_utility_funs.py` ([link](https://github.com/pytorch/pytorch/pull/99215/files?diff=unified&w=0#diff-da71d2c81c9dc7ac0c47ff086fded82e4edcb67ba0cd3d8b5c983d7467343bc7R6)) * Add `_remove_test_environment_prefix_from_scope_name` function to `test/onnx/test_utility_funs.py` to normalize scope names of ONNX nodes ([link](https://github.com/pytorch/pytorch/pull/99215/files?diff=unified&w=0#diff-da71d2c81c9dc7ac0c47ff086fded82e4edcb67ba0cd3d8b5c983d7467343bc7R32-R58)) * Use `_remove_test_environment_prefix_from_scope_name` function to compare scope names of ONNX nodes in `TestUtilityFuns` class in `test/onnx/test_utility_funs.py` ([link](https://github.com/pytorch/pytorch/pull/99215/files?diff=unified&w=0#diff-da71d2c81c9dc7ac0c47ff086fded82e4edcb67ba0cd3d8b5c983d7467343bc7L1099-R1133), [link](https://github.com/pytorch/pytorch/pull/99215/files?diff=unified&w=0#diff-da71d2c81c9dc7ac0c47ff086fded82e4edcb67ba0cd3d8b5c983d7467343bc7L1119-R1152), [link](https://github.com/pytorch/pytorch/pull/99215/files?diff=unified&w=0#diff-da71d2c81c9dc7ac0c47ff086fded82e4edcb67ba0cd3d8b5c983d7467343bc7L1170-R1188), [link](https://github.com/pytorch/pytorch/pull/99215/files?diff=unified&w=0#diff-da71d2c81c9dc7ac0c47ff086fded82e4edcb67ba0cd3d8b5c983d7467343bc7L1181-R1199), [link](https://github.com/pytorch/pytorch/pull/99215/files?diff=unified&w=0#diff-da71d2c81c9dc7ac0c47ff086fded82e4edcb67ba0cd3d8b5c983d7467343bc7L1220-R1239), [link](https://github.com/pytorch/pytorch/pull/99215/files?diff=unified&w=0#diff-da71d2c81c9dc7ac0c47ff086fded82e4edcb67ba0cd3d8b5c983d7467343bc7L1235-R1258)) Fixes #98626 Pull Request resolved: https://github.com/pytorch/pytorch/pull/99215 Approved by: https://github.com/huydhn, https://github.com/titaiwangms	2023-04-19 06:17:47 +00:00
Zachary DeVito	7ff1f3f3f6	Revert "Revert "Expandable blocks in allocator (#96995 )"" (#99275 ) This reverts commit `851e89c8e8`. Differential Revision: [D45034526](https://our.internmc.facebook.com/intern/diff/D45034526) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99275 Approved by: https://github.com/eellison	2023-04-17 23:46:08 +00:00
PyTorch MergeBot	851e89c8e8	Revert "Expandable blocks in allocator (#96995 )" This reverts commit `6a50b83b73`. Reverted https://github.com/pytorch/pytorch/pull/96995 on behalf of https://github.com/izaitsevfb due to Breaks internal tests	2023-04-16 19:23:37 +00:00
Zachary DeVito	6a50b83b73	Expandable blocks in allocator (#96995 ) Common advice we give for handling memory fragmentation issues is to allocate a big block upfront to reserve memory which will get split up later. For programs with changing tensor sizes this can be especially helpful to avoid OOMs that happen the first time we see a new largest input and would otherwise have to allocate new segments. However the issue with allocating a block upfront is that is nearly impossible to correctly estimate the size of that block. If too small, space in the block will run out and the allocator will allocate separate blocks anyway. Too large, and other non-PyTorch libraries might stop working because they cannot allocate any memory. This patch provides the same benefits as using a pre-allocating block but without having to choose its size upfront. Using the cuMemMap-style APIs, it adds the ability to expand the last block in a segment when more memory is needed. Compared to universally using cudaMallocAsync to avoid fragmentation, this patch can fix this common fragmentation issue while preserving most of the existing allocator behavior. This behavior can be enabled and disabled dynamically. This should allow users to, for instance, allocate long-lived parameters and state in individual buffers, and put temporary state into the large expandable blocks, further reducing fragmentation. See inline comments for information about the implementation and its limitations. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96995 Approved by: https://github.com/eellison	2023-04-14 09:49:11 +00:00
Richard Zou	d5120ff18a	[torch.library] Add ability to create library fragments (#98439 ) In C++ we have TORCH_LIBRARY_FRAGMENT. This PR adds the same functionality to the Python torch.library API. The motivation for this is: for the simple custom op API, we don't want users to need to deal with Library objects. One way to hide this from users is to create library fragments. Test Plan: - tests that you can create multiple fragments and def+impl operators on each. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98439 Approved by: https://github.com/ezyang, https://github.com/bdhirsh	2023-04-10 18:04:53 +00:00
BowenBao	4f9dbc17a4	[ONNX] Enable xdoctests in CI (#98546 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98546 Approved by: https://github.com/justinchuby, https://github.com/kit1980	2023-04-07 22:20:18 +00:00
PyTorch MergeBot	55724a5ec9	Revert "[experiment] More procs in CI (#98098 )" This reverts commit `9fd3eba6ce`. Reverted https://github.com/pytorch/pytorch/pull/98098 on behalf of https://github.com/clee2000 due to I think theres a bug	2023-04-07 19:50:54 +00:00
Catherine Lee	9fd3eba6ce	[experiment] More procs in CI (#98098 ) experiment with more procs but only in master so prs dont get affected Pull Request resolved: https://github.com/pytorch/pytorch/pull/98098 Approved by: https://github.com/huydhn	2023-04-07 17:21:32 +00:00
Fuzzkatt	481ecffb5e	Add test c10d ucc tests (#88110 ) Creates the equivalent c10d test for ucc for https://github.com/pytorch/pytorch/blob/master/test/distributed/test_c10d_gloo.py and https://github.com/pytorch/pytorch/blob/master/test/distributed/test_c10d_nccl.py. Uses test_c10d_gloo.py as the reference and adds all the common ops. More detailed comparison of available ops here: https://docs.google.com/document/d/1yPsa_X9EiEiqo-j2Yn7ierhccBtEjwoqC-B7-amI0MI/edit?usp=sharing Also removes extra line for ProcessGroupUCC.cpp barrier blocking wait that got duplicated from merging https://github.com/pytorch/pytorch/pull/85047. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88110 Approved by: https://github.com/zasdfgbnm, https://github.com/kit1980, https://github.com/kwen2501, https://github.com/malfet	2023-04-06 23:51:27 +00:00
Catherine Lee	0d73cfb3e9	Retry at test file level (#97506 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/97506 Approved by: https://github.com/huydhn	2023-03-31 18:36:53 +00:00
Catherine Lee	c797c7bc8b	Clean up duplicate function run_test.py (#97914 ) afaict theyre the same thing Pull Request resolved: https://github.com/pytorch/pytorch/pull/97914 Approved by: https://github.com/huydhn	2023-03-31 06:31:17 +00:00
PyTorch MergeBot	675dfd2c1f	Revert "Retry at test file level (#97506 )" This reverts commit `7d5d5beba2`. Reverted https://github.com/pytorch/pytorch/pull/97506 on behalf of https://github.com/clee2000 due to test_jit_cuda_fuser having a rough time	2023-03-31 06:22:14 +00:00
Catherine Lee	7d5d5beba2	Retry at test file level (#97506 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/97506 Approved by: https://github.com/huydhn	2023-03-30 17:12:19 +00:00
Kazuaki Ishizaki	f7fe6e148e	[test] Make environment variable name better (#97356 ) This PR intends to use better (or correct?) environment variable name (`TORCH_DOCTEST_ANOMALY` instead of `TORCH_DOCTEST_ANOMOLY`) in test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97356 Approved by: https://github.com/malfet, https://github.com/kit1980	2023-03-30 06:21:28 +00:00
Huy Do	4c0dce50fd	[BE] Apply ufmt to run_test and GitHub Python util scripts (#97588 ) This has been bugging me for a while as I'm working on these Python scripts and they are not tracked by ufmt linter. So I add these script into that linter. ``` [[linter]] code = 'UFMT' include_patterns = [ '.github/*/.py', 'test/run_test.py', ``` This change should just work and not break anything as ufmt (black + usort) linter is very safe to use for standalone util scripts. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97588 Approved by: https://github.com/kit1980	2023-03-26 04:52:55 +00:00
Catherine Lee	29c061bb90	Remove non existent files in multigpu tests (#97393 ) They were removed in https://github.com/pytorch/pytorch/pull/96989/files and https://github.com/pytorch/pytorch/pull/96985/files Pull Request resolved: https://github.com/pytorch/pytorch/pull/97393 Approved by: https://github.com/kit1980, https://github.com/huydhn, https://github.com/fduwjj, https://github.com/malfet	2023-03-23 17:00:29 +00:00
Huy Do	244736a5a5	Mark ROCm tests as flaky (#97259 ) Before https://github.com/pytorch/pytorch/pull/96464, ROCm tests in trunk are already quite flaky https://hud.pytorch.org/reliability/pytorch/pytorch?jobName=trunk%20%2F%20linux-focal-rocm5.4.2-py3.8%20%2F%20test%20(default). After https://github.com/pytorch/pytorch/pull/96464, there is a new group of flaky failures coming from functorch. So let's mark the test as flaky to monitor without impacting trunk. Two flaky tests currently seeing in trunk are: * https://github.com/pytorch/pytorch/issues/97256 * `functorch/test_memory_efficient_fusion.py` OOM Pull Request resolved: https://github.com/pytorch/pytorch/pull/97259 Approved by: https://github.com/malfet, https://github.com/zou3519	2023-03-21 16:55:00 +00:00
Richard Zou	5acf403088	Run functorch tests in default shards; delete functorch-specific shards (#96464 ) Fixes #96347 This PR: - Makes the functorch tests run as a part of the "default" shards - Delete the functorch CI shard from all CI job configurations (if it exists) - Increase the "default" shard count by 1 for each job, unless it was previously set to 1, to accommodate the new functorch tests and not regress time-to-signal. - Adds a bunch of skips for ROCM and torchdynamo configurations. We can investigate them later. NB: I might go through some more iterations to figure out what other skips need to be added, but this iteration of the PR seems to pass most CI. suite. Test Plan: - wait for CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/96464 Approved by: https://github.com/huydhn	2023-03-21 13:53:01 +00:00
Huy Do	270b42d279	Fix test_schema_check CUDA illegal memory access (#97062 ) I'm seeing some recent [CUDA illegal memory access](https://hud.pytorch.org/failure/FAILED%20test_schema_check.py%3A%3ATestSchemaCheckModeOpInfoCUDA%3A%3Atest_schema_correctness_fft_fft_cuda_bool%20-%20RuntimeError%3A%20CUDA%20error%3A%20an%20illegal%20memory%20access%20was%20encountered) error related to this test. So a cheap fix is to run it serially. Fixes https://github.com/pytorch/pytorch/issues/95749 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97062 Approved by: https://github.com/clee2000	2023-03-20 20:57:27 +00:00
Huy Do	db2c1ea8c8	Re-enable test_ops_jit on Windows (#96859 ) (#96931 ) Fixes https://github.com/pytorch/pytorch/issues/96858 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96931 Approved by: https://github.com/kit1980	2023-03-17 22:42:22 +00:00
Catherine Lee	8c2341c1b9	Remove pytest block list (#96698 ) Enables the last few files under pytest. xdist was causing problems with `profiler/test_profiler` `test_source_multithreaded` due to creating extra threads. Luckily we don't use it so we can disable it with `-p no:xdist`, but this is incompatible with pytest-rerunfailures==10.2, so upgrade to 10.3. I'd update the windows ami but idk how. `dynamo/test_optimizers` and `dynamo/test_repros` both had tests that used skip_if_pytest. https://github.com/pytorch/pytorch/pull/93251/files suggests that it is due to pytest assertion rewriting, so I added `PYTEST_DONT_REWRITE` to their module docstrings to prevent pytest from rewriting assertions. Disable test by issue in `dynamo/test_dynamic_shapes` seems sane. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96698 Approved by: https://github.com/huydhn, https://github.com/malfet	2023-03-16 04:22:42 +00:00
albanD	7c525823c7	Remove un-used list. And disable pytest for public binding test. (#96684 ) This contains a temporary change to make sure the test fails nicely now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96684 Approved by: https://github.com/clee2000	2023-03-15 22:12:00 +00:00
Huy Do	6339ee5d23	Temporarily disable test_ops_jit on Windows (#96859 ) See https://github.com/pytorch/pytorch/issues/96858 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96859 Approved by: https://github.com/kit1980	2023-03-15 17:51:32 +00:00
Huy Do	51b8ab7879	Clean up references to test_megatron_prototype (#96431 ) This test has been deleted in #96254 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96431 Approved by: https://github.com/clee2000, https://github.com/fduwjj	2023-03-10 23:50:32 +00:00
Catherine Lee	4519228f60	Reduce pytest blocklist part 2 (#96397 ) Enable pytest for a few unique files. pytest runs tests in a different order than unittest (but still a consistent ordering with respect to itself) and some tests change global state, causing other tests to fail. `test_transpose_non_contiguous` in `test_torchinductor.py` gets impacted from some other test but I'm not sure which one, so my solution is to reset the metrics before the rest of the test is run. `test_register_patterns` in `test_quantize_fx.py` adds extra keys to global variables, so remove them when the test is done via unittest's `addCleanUp` which also works on pytest. pytest doesn't really have an equivalent for `load_tests` so change it to be like `test_jit` that imports all the classes. I also attempted to dynamically import them, but I failed. `test_public_api_surface` in `test_fx.py` checks for a backwards compatibility classification. There is a different test in test_fx that results in `fuser_utils` being imported. pytest runs this test before `test_public_api_surface` while unittest runs it after, so pytest sees `fuser_utils` when crawling through the modules. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96397 Approved by: https://github.com/huydhn	2023-03-10 19:10:43 +00:00
Xiao Wang	cf3d3a583e	Add env PYTORCH_TEST_DO_NOT_USE_PYTEST as an option to not use pytest in unit testing (#96444 ) Set environment variable ``` PYTORCH_TEST_DO_NOT_USE_PYTEST=1 ``` to not use pytest in pytorch unit testing. This change is related to some recent changes, e.g. #96210, #96016, #95844, #95659, that enabled the use of pytest in many test modules. Those test modules were testing normally before, but failed immediately after pytest is used. Sample stacktraces are: ```python root@8e3168a83ee2:/opt/pytorch/pytorch# python test/run_test.py -v -i test_optim -- -v --save-xml Ignoring disabled issues: [] /opt/pytorch/pytorch/test/run_test.py:1225: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead. if torch.version.cuda is not None and LooseVersion(torch.version.cuda) >= "11.6": Selected tests: test_optim parallel (file granularity) tests: test_optim serial (file granularity) tests: Ignoring disabled issues: [] Ignoring disabled issues: [] Running test_optim ... [2023-03-09 12:51:59.358110] Executing ['/usr/local/bin/python', '-bb', 'test_optim.py', '-v', '--save-xml', '-v', '--use-pytest', '-vv', '-rfEX', '-x', '--reruns=2'] ... [2023-03-09 12:51:59.358810] Test results will be stored in test-reports/python-pytest/test_optim/test_optim-5e41643c8bac8ace.xml Traceback (most recent call last): File "/opt/pytorch/pytorch/test/test_optim.py", line 4581, in <module> run_tests() File "/opt/pytorch/pytorch/torch/testing/_internal/common_utils.py", line 796, in run_tests exit_code = pytest.main(args=pytest_args) File "/usr/local/lib/python3.10/site-packages/_pytest/config/__init__.py", line 148, in main config = _prepareconfig(args, plugins) File "/usr/local/lib/python3.10/site-packages/_pytest/config/__init__.py", line 329, in _prepareconfig config = pluginmanager.hook.pytest_cmdline_parse( File "/usr/local/lib/python3.10/site-packages/pluggy/_hooks.py", line 265, in __call__ return self._hookexec(self.name, self.get_hookimpls(), kwargs, firstresult) File "/usr/local/lib/python3.10/site-packages/pluggy/_manager.py", line 80, in _hookexec return self._inner_hookexec(hook_name, methods, kwargs, firstresult) File "/usr/local/lib/python3.10/site-packages/pluggy/_callers.py", line 55, in _multicall gen.send(outcome) File "/usr/local/lib/python3.10/site-packages/_pytest/helpconfig.py", line 103, in pytest_cmdline_parse config: Config = outcome.get_result() File "/usr/local/lib/python3.10/site-packages/pluggy/_result.py", line 60, in get_result raise ex[1].with_traceback(ex[2]) File "/usr/local/lib/python3.10/site-packages/pluggy/_callers.py", line 39, in _multicall res = hook_impl.function(*args) File "/usr/local/lib/python3.10/site-packages/_pytest/config/__init__.py", line 1060, in pytest_cmdline_parse self.parse(args) File "/usr/local/lib/python3.10/site-packages/_pytest/config/__init__.py", line 1348, in parse self._preparse(args, addopts=addopts) File "/usr/local/lib/python3.10/site-packages/_pytest/config/__init__.py", line 1231, in _preparse self.pluginmanager.load_setuptools_entrypoints("pytest11") File "/usr/local/lib/python3.10/site-packages/pluggy/_manager.py", line 287, in load_setuptools_entrypoints plugin = ep.load() File "/usr/local/lib/python3.10/importlib/metadata/__init__.py", line 171, in load module = import_module(match.group('module')) File "/usr/local/lib/python3.10/importlib/__init__.py", line 126, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "<frozen importlib._bootstrap>", line 1050, in _gcd_import File "<frozen importlib._bootstrap>", line 1027, in _find_and_load File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked File "<frozen importlib._bootstrap>", line 688, in _load_unlocked File "/usr/local/lib/python3.10/site-packages/_pytest/assertion/rewrite.py", line 168, in exec_module exec(co, module.__dict__) File "/usr/local/lib/python3.10/site-packages/xdist/looponfail.py", line 16, in <module> import execnet File "/usr/local/lib/python3.10/site-packages/execnet/__init__.py", line 14, in <module> from .gateway_base import DataFormatError File "/usr/local/lib/python3.10/site-packages/execnet/gateway_base.py", line 1138, in <module> FLOAT_FORMAT_SIZE = struct.calcsize(FLOAT_FORMAT) BytesWarning: Comparison between bytes and string FINISHED PRINTING LOG FILE of test_optim (/opt/pytorch/pytorch/test/test-reports/test_optim_1pnlesrz.log) test_optim failed! Traceback (most recent call last): File "/opt/pytorch/pytorch/test/run_test.py", line 1428, in <module> main() File "/opt/pytorch/pytorch/test/run_test.py", line 1386, in main raise RuntimeError( RuntimeError: test_optim failed! Tip: You can keep running tests even on failure by passing --keep-going to run_test.py. If running on CI, add the 'keep-going' label to your PR and rerun your jobs. ``` I'd like to propose this option that allows users to use the good old python unit test way instead of pytest to run their testing in CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96444 Approved by: https://github.com/malfet	2023-03-10 01:32:15 +00:00
Catherine Lee	a7fe11dec0	--subprocess for pytest (#96210 ) Implements --subprocess flag for pytest, which previously only worked with unittest Pretty much all the tests in the custom handler list use --subprocess Pull Request resolved: https://github.com/pytorch/pytorch/pull/96210 Approved by: https://github.com/huydhn	2023-03-08 21:04:50 +00:00
BowenBao	bdb076ab43	[ONNX] Skip doctest `torch.onnx._internal.fx` if ImportError (#95686 ) Need to use `exclude` to skip the module altogether. Because xdoctest triggers `ImportError` when trying to import the module. So the whole test fails regardless if skip was added in the docstring. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95686 Approved by: https://github.com/kit1980, https://github.com/titaiwangms	2023-03-07 22:05:27 +00:00
Catherine Lee	eea0733045	Reduce pytest blocklist (#96016 ) `TestCase = object` or variations of it get switched to `TestCase = NoTest`. unittest collects test based on subclassing unittest.TestCase, so setting TestCase = object removes it from unittest test collection. pytest collects based on name (https://docs.pytest.org/en/7.1.x/reference/reference.html#confval-python_classes) but can be told to ignore a class (bottom of https://docs.pytest.org/en/7.1.x/example/pythoncollection.html#changing-naming-conventions) Pull Request resolved: https://github.com/pytorch/pytorch/pull/96016 Approved by: https://github.com/ZainRizvi, https://github.com/huydhn	2023-03-07 18:30:27 +00:00
Catherine Lee	7f5f0b3665	Run _nvfuser/test_torchscript serially (#95951 ) Started at `ce4cbac914 (11734276291)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/95951 Approved by: https://github.com/huydhn	2023-03-03 17:41:09 +00:00

1 2 3 4 5 ...

717 Commits