pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Catherine Lee	0290fe65bd	Test TD (test removal) on crossref (#119426 ) Current threshold is to cut the bottom 75% of test files, which results in 13 min of tests getting cut. test_ops, functorch/test_ops, and test_decomp and other really long running test files are not getting cut and make the top 25% to take really long (still 90+ min) The original plan was to test on rocm but I'm worried about queuing given that cutting 75% of test files only cuts off 13 min, and crossref is rarely referenced by others and people keep talking about getting rid of it, so it's a good alternative Pull Request resolved: https://github.com/pytorch/pytorch/pull/119426 Approved by: https://github.com/huydhn	2024-02-29 18:53:43 +00:00
albanD	30625ae582	Add cpp stack traces to our own reruns (#119408 ) Note that I'm not sure why we both have pytest rerun the failing test twice via `81abc2b249/test/run_test.py (L966)` before our own logic retries it as well? The failing test is only here to make sure it works as expected in the CI env. Will remove before landing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119408 Approved by: https://github.com/huydhn	2024-02-26 22:21:14 +00:00
Catherine Lee	c39bbd6def	Numbers based TD (#119901 ) Convert from a list/bucket based TD system to just a numbers based TD system. Looks like a massive change but a decent amount of it is tests and removing code. Main file of interest is interface.py, which Github is collapsing by default due to size The test files pretty much got rewritten entirely since a lot of the old tests are no longer relevant. Other notable changes: * Use Frozenset to make TestRun hashable * Adds tools/test/heuristics/__init__.py to ensure that unittest can discover the tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/119901 Approved by: https://github.com/osalpekar, https://github.com/huydhn	2024-02-26 17:01:19 +00:00
Catherine Lee	cfddfce0d3	Alternate sharding (#119078 ) Changes sharding to attempt to put all serial tests on as few shards as possible. Parallel tests are then distributed across all shards, with most of which likely ending up on the non serial shards Example: 8 minutes of serial tests, 20 minutes of parallel tests, 2 proc per machine, 6 machines -> 8 + 20/2 = 18 total minutes of tests -> 18 / 6 machines = 3 min per machine -> all serial tests should fit on 3 machines (3min, 3 min, 2min) -> majority of parallel tests should go on last 4 machines, one of which is shared with the serial tests Move serial tests to run first If I want to move to a purely numbers based sharding, this ensures that parallel tests are run with parallel tests as much as possible instead of interleaving serial + parallel tests, which decreases effectiveness of parallelization, while also ensuring that test reordering is still mostly effective. See `73e816ee80` for example logs Pull Request resolved: https://github.com/pytorch/pytorch/pull/119078 Approved by: https://github.com/huydhn	2024-02-21 16:40:27 +00:00
Catherine Lee	af765dbdfd	[ez] Explicit env for run_test (#120251 ) env=None (which is the default) inherits the env from the calling process. Explicitly set the env to the calling process env so that things can be added to it later Tested in: `e7b4d8ec88` Checked that test-reports (which depend on the CI env var) get made. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120251 Approved by: https://github.com/huydhn	2024-02-21 00:40:19 +00:00
PyTorch MergeBot	dfb83df889	Revert "Add cpp stack traces to our own reruns (#119408 )" This reverts commit `47182a8f4b`. Reverted https://github.com/pytorch/pytorch/pull/119408 on behalf of https://github.com/clee2000 due to iirc the default setting of env to None causes it to inherit the env of the calling process, I'll make a PR that makes it so that the old env vars don't disappear, and then re merge this on top of it. Reverting this because I think some important env vars are disappearing (specifically CI) ([comment](https://github.com/pytorch/pytorch/pull/119408#issuecomment-1955128676))	2024-02-20 21:28:13 +00:00
PyTorch MergeBot	9b38ee2343	Revert "Alternate sharding (#119078 )" This reverts commit `861acda205`. Reverted https://github.com/pytorch/pytorch/pull/119078 on behalf of https://github.com/clee2000 due to failing `861acda205` ([comment](https://github.com/pytorch/pytorch/pull/119078#issuecomment-1946583857))	2024-02-15 16:59:50 +00:00
Catherine Lee	861acda205	Alternate sharding (#119078 ) Changes sharding to attempt to put all serial tests on as few shards as possible. Parallel tests are then distributed across all shards, with most of which likely ending up on the non serial shards Example: 8 minutes of serial tests, 20 minutes of parallel tests, 2 proc per machine, 6 machines -> 8 + 20/2 = 18 total minutes of tests -> 18 / 6 machines = 3 min per machine -> all serial tests should fit on 3 machines (3min, 3 min, 2min) -> majority of parallel tests should go on last 4 machines, one of which is shared with the serial tests Move serial tests to run first If I want to move to a purely numbers based sharding, this ensures that parallel tests are run with parallel tests as much as possible instead of interleaving serial + parallel tests, which decreases effectiveness of parallelization, while also ensuring that test reordering is still mostly effective. See `73e816ee80` for example logs Pull Request resolved: https://github.com/pytorch/pytorch/pull/119078 Approved by: https://github.com/huydhn	2024-02-15 01:32:44 +00:00
atalman	244b124bb8	Add linux cpu test for 3.12 (#117853 ) This is continuation of work: https://github.com/pytorch/pytorch/pull/113987 Co-authored-by: albanD <desmaison.alban@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/117853 Approved by: https://github.com/albanD	2024-02-14 20:52:23 +00:00
albanD	47182a8f4b	Add cpp stack traces to our own reruns (#119408 ) Note that I'm not sure why we both have pytest rerun the failing test twice via `81abc2b249/test/run_test.py (L966)` before our own logic retries it as well? The failing test is only here to make sure it works as expected in the CI env. Will remove before landing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119408 Approved by: https://github.com/huydhn	2024-02-14 18:40:23 +00:00
Catherine Lee	5d6e323549	No TD (test removal) option in CI (#118808 ) It currently doesn't do anything, but I will want these env vars later. Maybe I should start using ghstack Intention: --enable-td actually gets rid of tests I am open to better names Pull Request resolved: https://github.com/pytorch/pytorch/pull/118808 Approved by: https://github.com/huydhn, https://github.com/osalpekar	2024-02-09 16:42:27 +00:00
PyTorch MergeBot	8182fce769	Revert "Add cpp stack traces to our own reruns (#119408 )" This reverts commit `fbe6f6236e`. Reverted https://github.com/pytorch/pytorch/pull/119408 on behalf of https://github.com/malfet due to Looks like it introduced intermittent crashes see https://github.com/pytorch/pytorch/actions/runs/7823402867/job/21344456540 for example, testing the theory ([comment](https://github.com/pytorch/pytorch/pull/119408#issuecomment-1934589057))	2024-02-08 17:20:39 +00:00
albanD	fbe6f6236e	Add cpp stack traces to our own reruns (#119408 ) Note that I'm not sure why we both have pytest rerun the failing test twice via `81abc2b249/test/run_test.py (L966)` before our own logic retries it as well? The failing test is only here to make sure it works as expected in the CI env. Will remove before landing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119408 Approved by: https://github.com/huydhn	2024-02-08 00:54:16 +00:00
Huy Do	3ed9df36a9	Clean up some obsolete TODOs in run_test and several test files (#119113 ) * The TODOs in `test/test_nestedtensor.py` has been mitigated, I keep the issue for reference. * ~~The TODOs in `test/test_ops_fwd_gradients.py` doesn't apply anymore~~ * The TODOs in `run_test.py` to support disabling C++ tests is probably not going to happen. I have never seen a flaky C++ test that needs to be disabled before. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119113 Approved by: https://github.com/kit1980	2024-02-03 23:54:30 +00:00
Joel Schlosser	3b41793412	Purge redundant module init tests (#119028 ) Fixes #118784 This test file is old and redundant; coverage is maintained in `test_modules.py` via the `test_factory_kwargs` set of tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119028 Approved by: https://github.com/zou3519	2024-02-02 20:17:00 +00:00
Catherine Lee	8b729fb826	[ez] Fix CI log file piping error (#118807 ) Fixes https://github.com/pytorch/pytorch/issues/118764 Example log https://github.com/pytorch/pytorch/actions/runs/7737363970/job/21097159160 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118807 Approved by: https://github.com/huydhn, https://github.com/malfet, https://github.com/seemethere	2024-02-02 03:07:56 +00:00
Catherine Lee	9391af9796	Merging heuristics (#118029 ) Everyday I move closer and closer to just using numbers * number of heuristics that marked it as high, probable, low, none etc * order of heuristics in the `__init__` file as well as how the heuristic ordered the tests * put heuristics historical edited files and profiling as not trial mode * briefly sanity checked that all shards of the larger test files (ex test_ops) exist and there are no dups Pull Request resolved: https://github.com/pytorch/pytorch/pull/118029 Approved by: https://github.com/huydhn	2024-01-31 20:00:10 +00:00
Catherine Lee	2eefbc02a0	[ez] Discover tests without importing torch (#118574 ) Moves test discovery into a file that doesn't have import torch so test listing can be done without having torch installed. Helpful when you don't have torch installed (aka me when I'm feeling lazy) I want to move TD into it's own job that doesn't need to wait for build to finish, so this is part of that. The first commit is a nothing more than a copy paste of the selected functions/vars into a new file, the second commit has various changes that should be checked. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118574 Approved by: https://github.com/huydhn	2024-01-30 03:02:29 +00:00
Catherine Lee	84251d1d71	[ez] Windows log printing + save successful test logs (#118124 ) when doing print(f.read().decode etc etc) it prints an extra new line, so manually splitlines and strip to see if that helps My guess is windows line ending differences Also always save log file regardless of success or failure See `476b81a9bf` for what it looks like now Swapped to opening in text mode instead of binary, seems to be ok now. 42483193bf024983060a234dc0262f4840aef4b8 for example Pull Request resolved: https://github.com/pytorch/pytorch/pull/118124 Approved by: https://github.com/huydhn	2024-01-26 21:14:25 +00:00
Catherine Lee	de9ddd19a5	Various CI settings (#117668 ) Test [ci-verbose-test-logs] (this worked, the test logs printing while running and interleaved and are really long) Settings for no timeout (step timeout still applies, only gets rid of ~30 min timeout for shard of test file) and no piping logs/extra verbose test logs (good for debugging deadlocks but results in very long and possibly interleaved logs). Also allows these to be set via pr body if the label name is in brackets ex [label name] or the test above. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117668 Approved by: https://github.com/huydhn	2024-01-26 00:17:29 +00:00
Catherine Lee	364728b27b	Reduce pytest prints (#117069 ) * custom pytest-shard so I can control the verbosity (also index by 1 since it's confusing) * normal runs (not keep-going) always rerun each failed test 9 times (3 per process, 3 processes). Previously it would only run the entire test file 3 times, so if a test before you segfaulted, you only got 2 tries Example of quieter log https://github.com/pytorch/pytorch/actions/runs/7481334046/job/20363147497 "items in shard" only gets printed once at the beginning, and the reruns just say how many got skipped. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117069 Approved by: https://github.com/huydhn	2024-01-23 18:39:30 +00:00
chuanqiw	40890ba8e7	[CI] Add python test skip logic for XPU (#117621 ) Add python test skip logic for XPU For test purpose, cherry-pick #116833 & #116850 firstly, and the xpu test passed https://github.com/pytorch/pytorch/actions/runs/7566746218/job/20604997985?pr=117621. Revert them now. Works for #114850 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117621 Approved by: https://github.com/huydhn	2024-01-23 08:20:42 +00:00
Catherine Lee	cef5b93f28	[ez] Serial when NUM_PROCS is 1 (#117977 ) Makes it easier to understand whats going on Pull Request resolved: https://github.com/pytorch/pytorch/pull/117977 Approved by: https://github.com/huydhn	2024-01-22 23:11:41 +00:00
ydwu4	f96b7d06d7	[export] skip export tests when test with dynamo in ci (#117988 ) Fixes https://github.com/pytorch/pytorch/issues/117947. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117988 Approved by: https://github.com/suo, https://github.com/zou3519	2024-01-22 22:14:36 +00:00
PyTorch MergeBot	f684e44fd6	Revert "Reduce pytest prints (#117069 )" This reverts commit `40dbd567e0`. Reverted https://github.com/pytorch/pytorch/pull/117069 on behalf of https://github.com/clee2000 due to need to handle timeout expired better ([comment](https://github.com/pytorch/pytorch/pull/117069#issuecomment-1901270953))	2024-01-19 23:07:51 +00:00
Catherine Lee	40dbd567e0	Reduce pytest prints (#117069 ) * custom pytest-shard so I can control the verbosity (also index by 1 since it's confusing) * normal runs (not keep-going) always rerun each failed test 9 times (3 per process, 3 processes). Previously it would only run the entire test file 3 times, so if a test before you segfaulted, you only got 2 tries Example of quieter log https://github.com/pytorch/pytorch/actions/runs/7481334046/job/20363147497 "items in shard" only gets printed once at the beginning, and the reruns just say how many got skipped. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117069 Approved by: https://github.com/huydhn	2024-01-19 18:42:12 +00:00
Catherine Lee	6c5c2121b1	Run some OOMing tests serially (#117759 ) They were disabled due to being flaky due to OOMs but got renamed. Seeing if running serially helps I kind of want to keep this test disabled since the rest of the file is probably fine... Issues in question: #113132 #113136 #113140 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117759 Approved by: https://github.com/malfet, https://github.com/huydhn	2024-01-19 16:45:35 +00:00
PyTorch MergeBot	77cfacab55	Revert "Reduce pytest prints (#117069 )" This reverts commit `2f89ef2300`. Reverted https://github.com/pytorch/pytorch/pull/117069 on behalf of https://github.com/clee2000 due to distributed tests are not printing items ([comment](https://github.com/pytorch/pytorch/pull/117069#issuecomment-1899433816))	2024-01-19 00:27:03 +00:00
Catherine Lee	2f89ef2300	Reduce pytest prints (#117069 ) * custom pytest-shard so I can control the verbosity (also index by 1 since it's confusing) * normal runs (not keep-going) always rerun each failed test 9 times (3 per process, 3 processes). Previously it would only run the entire test file 3 times, so if a test before you segfaulted, you only got 2 tries Example of quieter log https://github.com/pytorch/pytorch/actions/runs/7481334046/job/20363147497 "items in shard" only gets printed once at the beginning, and the reruns just say how many got skipped. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117069 Approved by: https://github.com/huydhn	2024-01-18 23:30:59 +00:00
rzou	5aa895e53e	Don't run inductor tests in Dynamo shard (#117747 ) In theory we could, but these get really slow once we turn on strict mode, so we're not going to for now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117747 Approved by: https://github.com/bdhirsh ghstack dependencies: #117729	2024-01-18 17:43:30 +00:00
Jack Taylor	db79ceb110	[ROCm] Enabling additional UTs on ROCm (#115738 ) Unskips mostly for dynamo/inductor UT. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115738 Approved by: https://github.com/jithunnair-amd, https://github.com/malfet	2024-01-09 08:36:07 +00:00
Catherine Lee	d455c33cca	[ez][td] Pipe TD logs to log file (#116796 ) It is a bit annoying have them come up when searching through the logs. They're also surprisingly long Pull Request resolved: https://github.com/pytorch/pytorch/pull/116796 Approved by: https://github.com/huydhn	2024-01-05 19:05:12 +00:00
Catherine Lee	c52b78ebc2	[ez] Remove some args from run_test.py (#115459 ) Don't think anyone uses these Pull Request resolved: https://github.com/pytorch/pytorch/pull/115459 Approved by: https://github.com/malfet, https://github.com/huydhn	2023-12-11 19:56:37 +00:00
Sijia Chen	641ec2115f	[AOTI] move model runner into a library (#115220 ) Summary: So that we can import it in fbcode and do some AOTI run in py env Test Plan: existed AOTI tests Reviewed By: chenyang78 Differential Revision: D51780021 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115220 Approved by: https://github.com/desertfire	2023-12-09 19:03:32 +00:00
Catherine Lee	3b7d60b6ff	Fix keep-going (#112098 ) New function for continue on error Another solution might be to run the entire suite to the end and use last failed, but I'm worried about concurrent processes writing to the same last failed cache entry, it's a bit different than the usual test rerunning strategy we use especially regarding segfaults and other ways the test suite can suddenly end, and there are some cases where the entire test suite should immediately get rerun in a new process (ex cuda error that causes sync to fail). Find example logs on commit 2f1510839727f6ef2631040d5f0edde26265015d TODO: continue on error for --subprocess and test_distributed aren't working fully Pull Request resolved: https://github.com/pytorch/pytorch/pull/112098 Approved by: https://github.com/huydhn	2023-11-30 04:01:57 +00:00
Jithun Nair	2ea2421b44	Skip unit tests that fail on MI210 runners (#114613 ) Taken from https://github.com/pytorch/pytorch/pull/105980 Pull Request resolved: https://github.com/pytorch/pytorch/pull/114613 Approved by: https://github.com/malfet	2023-11-27 22:25:35 +00:00
Philip Meier	2aa486de9b	vendor packaging.version (#114108 ) Fixes #113940. This vendors the relevant parts of [`packaging==23.2.0`]() to have access to `Version` and `InvalidVersion` without taking a runtime dependency on `setuptools` or `packaging`. I didn't find any vendoring policy so I put it under `torch._vendor.packaging`. While I have only vendored the files we need, I have not touched or trimmed the files otherwise. Pull Request resolved: https://github.com/pytorch/pytorch/pull/114108 Approved by: https://github.com/malfet, https://github.com/albanD	2023-11-21 11:51:23 +00:00
Zain Rizvi	ec20c9044e	[TD] Fix metric emission for split test files (#113789 ) Fixes a bug in TD metrics generation where it wouldn't be able to find the rank and relevance that a heuristic gave a test run if that heuristic had divided that test into multiple test runs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113789 Approved by: https://github.com/clee2000	2023-11-16 23:19:40 +00:00
Catherine Lee	87aeb248c9	More random stepcurrent (#113620 ) Distributed tests for different backends have the same name, so they end up clashing using the current stepcurrent key, so tests were not being run. Disabled the following tests because they are failing: test_ddp_has_finalized test_broadcast_object_list <details> ``` 2023-11-14T06:44:01.0428686Z 2023-11-14T06:44:01.0430447Z distributed/test_distributed_spawn.py::TestDistBackendWithSpawn::test_broadcast_object_list <- ../../../../opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/distributed_test.py INFO:numba.cuda.cudadrv.driver:init 2023-11-14T06:44:01.0431048Z [1699943450.893723] [99f90b6e6ff3:10028:0] ucc_context.c:402 UCC ERROR failed to create tl context for cuda 2023-11-14T06:44:01.0431625Z [1699943450.914385] [99f90b6e6ff3:10029:0] ucc_context.c:402 UCC ERROR failed to create tl context for cuda 2023-11-14T06:44:01.0432314Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] Caught exception: 2023-11-14T06:44:01.0433178Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] Traceback (most recent call last): 2023-11-14T06:44:01.0434677Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 658, in run_test 2023-11-14T06:44:01.0435435Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] getattr(self, test_name)() 2023-11-14T06:44:01.0436895Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 544, in wrapper 2023-11-14T06:44:01.0437500Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] fn() 2023-11-14T06:44:01.0438917Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2536, in wrapper 2023-11-14T06:44:01.0439637Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] method(args, kwargs) 2023-11-14T06:44:01.0441122Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 143, in wrapper 2023-11-14T06:44:01.0441873Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] return func(args, *kwargs) 2023-11-14T06:44:01.0443340Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 274, in wrapper 2023-11-14T06:44:01.0444077Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] ret = func(args, *kwargs) 2023-11-14T06:44:01.0445769Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/distributed_test.py", line 7717, in test_broadcast_object_list 2023-11-14T06:44:01.0446732Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] return self._test_broadcast_object_list() 2023-11-14T06:44:01.0448433Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/distributed_test.py", line 7683, in _test_broadcast_object_list 2023-11-14T06:44:01.0449187Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] dist.broadcast_object_list( 2023-11-14T06:44:01.0450553Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper 2023-11-14T06:44:01.0451621Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] return func(args, *kwargs) 2023-11-14T06:44:01.0453161Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2650, in broadcast_object_list 2023-11-14T06:44:01.0454065Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] broadcast(object_sizes_tensor, src=src, group=group) 2023-11-14T06:44:01.0455441Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper 2023-11-14T06:44:01.0456183Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] return func(args, *kwargs) 2023-11-14T06:44:01.0457775Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1947, in broadcast 2023-11-14T06:44:01.0458649Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] work = default_pg.broadcast([tensor], opts) 2023-11-14T06:44:01.0460923Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] RuntimeError: [/var/lib/jenkins/workspace/torch/csrc/distributed/c10d/ProcessGroupUCC.cpp:488] [Rank 1][ProcessGroupUCC-0][READY]failed to init cuda collective, error code -1: Operation is not supported, system error code 2 2023-11-14T06:44:01.0461471Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] 2023-11-14T06:44:01.0462430Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] To execute this test, run the following from the base repo dir: 2023-11-14T06:44:01.0463552Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] python test/distributed/test_distributed_spawn.py -k test_broadcast_object_list 2023-11-14T06:44:01.0464082Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] 2023-11-14T06:44:01.0465136Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2023-11-14T06:44:01.0465945Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] exiting process 1 with exit code: 10 2023-11-14T06:44:01.0466605Z [1699943451.005633] [99f90b6e6ff3:10029:0] parser.c:2034 UCX WARN unused environment variables: UCX_COMMIT; UCX_HOME 2023-11-14T06:44:01.0467303Z [1699943451.005633] [99f90b6e6ff3:10029:0] parser.c:2034 UCX WARN (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning) 2023-11-14T06:44:01.0467972Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] Caught exception: 2023-11-14T06:44:01.0468743Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] Traceback (most recent call last): 2023-11-14T06:44:01.0470233Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 658, in run_test 2023-11-14T06:44:01.0471106Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] getattr(self, test_name)() 2023-11-14T06:44:01.0472581Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 544, in wrapper 2023-11-14T06:44:01.0473162Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] fn() 2023-11-14T06:44:01.0474581Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2536, in wrapper 2023-11-14T06:44:01.0475314Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] method(args, *kwargs) 2023-11-14T06:44:01.0476776Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 143, in wrapper 2023-11-14T06:44:01.0477535Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] return func(args, *kwargs) 2023-11-14T06:44:01.0478993Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 274, in wrapper 2023-11-14T06:44:01.0479886Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] ret = func(args, *kwargs) 2023-11-14T06:44:01.0481593Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/distributed_test.py", line 7717, in test_broadcast_object_list 2023-11-14T06:44:01.0482429Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] return self._test_broadcast_object_list() 2023-11-14T06:44:01.0484145Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/distributed_test.py", line 7683, in _test_broadcast_object_list 2023-11-14T06:44:01.0484886Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] dist.broadcast_object_list( 2023-11-14T06:44:01.0486271Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper 2023-11-14T06:44:01.0487018Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] return func(args, *kwargs) 2023-11-14T06:44:01.0488559Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2650, in broadcast_object_list 2023-11-14T06:44:01.0489470Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] broadcast(object_sizes_tensor, src=src, group=group) 2023-11-14T06:44:01.0491078Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper 2023-11-14T06:44:01.0491912Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] return func(args, *kwargs) 2023-11-14T06:44:01.0493369Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1947, in broadcast 2023-11-14T06:44:01.0494419Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] work = default_pg.broadcast([tensor], opts) 2023-11-14T06:44:01.0496679Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] RuntimeError: [/var/lib/jenkins/workspace/torch/csrc/distributed/c10d/ProcessGroupUCC.cpp:488] [Rank 0][ProcessGroupUCC-0][READY]failed to init cuda collective, error code -1: Operation is not supported, system error code 2 2023-11-14T06:44:01.0497211Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] 2023-11-14T06:44:01.0498198Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] To execute this test, run the following from the base repo dir: 2023-11-14T06:44:01.0499291Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] python test/distributed/test_distributed_spawn.py -k test_broadcast_object_list 2023-11-14T06:44:01.0499838Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] 2023-11-14T06:44:01.0500881Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2023-11-14T06:44:01.0501667Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] exiting process 0 with exit code: 10 2023-11-14T06:44:01.0502343Z [1699943451.002362] [99f90b6e6ff3:10028:0] parser.c:2034 UCX WARN unused environment variables: UCX_COMMIT; UCX_HOME 2023-11-14T06:44:01.0503024Z [1699943451.002362] [99f90b6e6ff3:10028:0] parser.c:2034 UCX WARN (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning) 2023-11-14T06:44:01.0503411Z ('RERUN', {'yellow': True}) [6.1102s] [100%] ``` </details> test_ddp_sync_bn_training_vs_eval <details> ``` 2023-11-14T06:44:01.1494815Z 2023-11-14T06:44:01.1496630Z distributed/test_distributed_spawn.py::TestDistBackendWithSpawn::test_ddp_sync_bn_training_vs_eval <- ../../../../opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/distributed_test.py INFO:numba.cuda.cudadrv.driver:init 2023-11-14T06:44:01.1497290Z [1699943779.976037] [99f90b6e6ff3:10758:0] parser.c:2034 UCX WARN unused environment variables: UCX_COMMIT; UCX_HOME 2023-11-14T06:44:01.1498119Z [1699943779.976037] [99f90b6e6ff3:10758:0] parser.c:2034 UCX WARN (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning) 2023-11-14T06:44:01.1498808Z STAGE:2023-11-14 06:36:20 10758:10758 ActivityProfilerController.cpp:314] Completed Stage: Warm Up 2023-11-14T06:44:01.1499465Z [1699943779.970792] [99f90b6e6ff3:10757:0] parser.c:2034 UCX WARN unused environment variables: UCX_COMMIT; UCX_HOME 2023-11-14T06:44:01.1500160Z [1699943779.970792] [99f90b6e6ff3:10757:0] parser.c:2034 UCX WARN (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning) 2023-11-14T06:44:01.1500820Z STAGE:2023-11-14 06:36:20 10757:10757 ActivityProfilerController.cpp:314] Completed Stage: Warm Up 2023-11-14T06:44:01.1501556Z STAGE:2023-11-14 06:36:20 10758:10758 ActivityProfilerController.cpp:320] Completed Stage: Collection 2023-11-14T06:44:01.1502239Z STAGE:2023-11-14 06:36:20 10757:10757 ActivityProfilerController.cpp:320] Completed Stage: Collection 2023-11-14T06:44:01.1502952Z STAGE:2023-11-14 06:36:20 10757:10757 ActivityProfilerController.cpp:324] Completed Stage: Post Processing 2023-11-14T06:44:01.1503678Z STAGE:2023-11-14 06:36:20 10758:10758 ActivityProfilerController.cpp:324] Completed Stage: Post Processing 2023-11-14T06:44:01.1504350Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] Caught exception: 2023-11-14T06:44:01.1505119Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] Traceback (most recent call last): 2023-11-14T06:44:01.1506729Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 658, in run_test 2023-11-14T06:44:01.1507492Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] getattr(self, test_name)() 2023-11-14T06:44:01.1508992Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 544, in wrapper 2023-11-14T06:44:01.1509578Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] fn() 2023-11-14T06:44:01.1510994Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2536, in wrapper 2023-11-14T06:44:01.1511725Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] method(args, *kwargs) 2023-11-14T06:44:01.1513193Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 174, in wrapper 2023-11-14T06:44:01.1513962Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] return func(args, *kwargs) 2023-11-14T06:44:01.1515697Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/distributed_test.py", line 9230, in test_ddp_sync_bn_training_vs_eval 2023-11-14T06:44:01.1516529Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] self.assertNotEqual([], all_gather_calls) 2023-11-14T06:44:01.1518019Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3448, in assertNotEqual 2023-11-14T06:44:01.1518910Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] with self.assertRaises(AssertionError, msg=msg): 2023-11-14T06:44:01.1520177Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 226, in __exit__ 2023-11-14T06:44:01.1521062Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] self._raiseFailure("{} not raised".format(exc_name)) 2023-11-14T06:44:01.1522238Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 163, in _raiseFailure 2023-11-14T06:44:01.1523099Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] raise self.test_case.failureException(msg) 2023-11-14T06:44:01.1523923Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] AssertionError: AssertionError not raised 2023-11-14T06:44:01.1524470Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] 2023-11-14T06:44:01.1525481Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] To execute this test, run the following from the base repo dir: 2023-11-14T06:44:01.1526632Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] python test/distributed/test_distributed_spawn.py -k test_ddp_sync_bn_training_vs_eval 2023-11-14T06:44:01.1527180Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] 2023-11-14T06:44:01.1528223Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2023-11-14T06:44:01.1529029Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] exiting process 0 with exit code: 10 2023-11-14T06:44:01.1529786Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] Caught exception: 2023-11-14T06:44:01.1530576Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] Traceback (most recent call last): 2023-11-14T06:44:01.1532383Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 658, in run_test 2023-11-14T06:44:01.1533127Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] getattr(self, test_name)() 2023-11-14T06:44:01.1534608Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 544, in wrapper 2023-11-14T06:44:01.1535194Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] fn() 2023-11-14T06:44:01.1536817Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2536, in wrapper 2023-11-14T06:44:01.1537575Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] method(args, *kwargs) 2023-11-14T06:44:01.1539036Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 174, in wrapper 2023-11-14T06:44:01.1539800Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] return func(args, *kwargs) 2023-11-14T06:44:01.1541531Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/distributed_test.py", line 9230, in test_ddp_sync_bn_training_vs_eval 2023-11-14T06:44:01.1542388Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] self.assertNotEqual([], all_gather_calls) 2023-11-14T06:44:01.1544015Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3448, in assertNotEqual 2023-11-14T06:44:01.1544907Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] with self.assertRaises(AssertionError, msg=msg): 2023-11-14T06:44:01.1546061Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 226, in __exit__ 2023-11-14T06:44:01.1546944Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] self._raiseFailure("{} not raised".format(exc_name)) 2023-11-14T06:44:01.1548142Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 163, in _raiseFailure 2023-11-14T06:44:01.1548991Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] raise self.test_case.failureException(msg) 2023-11-14T06:44:01.1549806Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] AssertionError: AssertionError not raised 2023-11-14T06:44:01.1550350Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] 2023-11-14T06:44:01.1551304Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] To execute this test, run the following from the base repo dir: 2023-11-14T06:44:01.1552462Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] python test/distributed/test_distributed_spawn.py -k test_ddp_sync_bn_training_vs_eval 2023-11-14T06:44:01.1553095Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] 2023-11-14T06:44:01.1554166Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2023-11-14T06:44:01.1554976Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] exiting process 1 with exit code: 10 2023-11-14T06:44:01.1555235Z ('RERUN', {'yellow': True}) [6.6107s] [100%] ``` </details> test_backend_full_group <details> ``` 2023-11-14T22:51:56.4502470Z FAILED [5.2125s] distributed/test_distributed_spawn.py::TestDistBackendWithSpawn::test_backend_full_group - RuntimeError: Process 0 exited with error code 10 and exception: 2023-11-14T22:51:56.4502665Z Traceback (most recent call last): 2023-11-14T22:51:56.4503603Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 658, in run_test 2023-11-14T22:51:56.4503796Z getattr(self, test_name)() 2023-11-14T22:51:56.4504710Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 544, in wrapper 2023-11-14T22:51:56.4504845Z fn() 2023-11-14T22:51:56.4505737Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2536, in wrapper 2023-11-14T22:51:56.4505896Z method(args, *kwargs) 2023-11-14T22:51:56.4506823Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 174, in wrapper 2023-11-14T22:51:56.4506992Z return func(args, *kwargs) 2023-11-14T22:51:56.4508285Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/distributed_test.py", line 882, in test_backend_full_group 2023-11-14T22:51:56.4508640Z self._test_group_override_backend(self._init_full_group_test) 2023-11-14T22:51:56.4509798Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/distributed_test.py", line 852, in _test_group_override_backend 2023-11-14T22:51:56.4510104Z group, group_id, rank = initializer(backend=new_backend) 2023-11-14T22:51:56.4510629Z UnboundLocalError: local variable 'new_backend' referenced before assignment 2023-11-14T22:51:56.4510650Z 2023-11-14T22:51:56.4510987Z To execute this test, run the following from the base repo dir: 2023-11-14T22:51:56.4511525Z python test/distributed/test_distributed_spawn.py -k test_backend_full_group 2023-11-14T22:51:56.4511545Z 2023-11-14T22:51:56.4511970Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2023-11-14T22:51:56.4511989Z 2023-11-14T22:51:56.4512242Z Process 1 exited with error code 10 and exception: 2023-11-14T22:51:56.4512454Z Traceback (most recent call last): 2023-11-14T22:51:56.4513380Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 658, in run_test 2023-11-14T22:51:56.4513687Z getattr(self, test_name)() 2023-11-14T22:51:56.4514612Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 544, in wrapper 2023-11-14T22:51:56.4514746Z fn() 2023-11-14T22:51:56.4515633Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2536, in wrapper 2023-11-14T22:51:56.4515791Z method(args, *kwargs) 2023-11-14T22:51:56.4516708Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 174, in wrapper 2023-11-14T22:51:56.4516895Z return func(args, **kwargs) 2023-11-14T22:51:56.4518008Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/distributed_test.py", line 882, in test_backend_full_group 2023-11-14T22:51:56.4518352Z self._test_group_override_backend(self._init_full_group_test) 2023-11-14T22:51:56.4519509Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/distributed_test.py", line 852, in _test_group_override_backend 2023-11-14T22:51:56.4519813Z group, group_id, rank = initializer(backend=new_backend) 2023-11-14T22:51:56.4520334Z UnboundLocalError: local variable 'new_backend' referenced before assignment 2023-11-14T22:51:56.4520355Z 2023-11-14T22:51:56.4528843Z To execute this test, run the following from the base repo dir: 2023-11-14T22:51:56.4529492Z python test/distributed/test_distributed_spawn.py -k test_backend_full_group 2023-11-14T22:51:56.4529681Z 2023-11-14T22:51:56.4530122Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2023-11-14T22:51:56.4530423Z !!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!! ``` </details> pretty sure the solution for this one is to add ucc in _test_group_override_backend https://ossci-raw-job-status.s3.amazonaws.com/log/18651430019 https://ossci-raw-job-status.s3.amazonaws.com/log/18651430132 Pull Request resolved: https://github.com/pytorch/pytorch/pull/113620 Approved by: https://github.com/huydhn	2023-11-15 21:56:10 +00:00
Catherine Lee	0c448526a4	[experiment][TD] Rating number system (#112676 ) Emit excessive amount of heuristic info emitted, but that just means I can do more with it later? Pull Request resolved: https://github.com/pytorch/pytorch/pull/112676 Approved by: https://github.com/ZainRizvi	2023-11-07 19:40:11 +00:00
Nikita Shulga	e2e5897269	[CI] Do not use `packaging` in run_tests.py (#112873 ) It used to check that CUDA is newer than 11.6, but all of them are Yet another mitigation towards missing `packaging` on MacOS Pull Request resolved: https://github.com/pytorch/pytorch/pull/112873 Approved by: https://github.com/huydhn	2023-11-03 17:22:46 +00:00
Zain Rizvi	4e67c69a7d	[TD] Support downgrading test relevance (#112671 ) Allow heuristics to actually downgrade the relevance of a test. Note that NONE/UNLIKELY tests will still get executed, but they will be ran at the end of the CI The Relevance chosen affects the outcome when Heuristics offer conflicting predictions. A relevance higher up in this list means higher confidence in the declared relevance: HIGH > NONE > PROBABLE > UNLIKELY > UNRANKED Given that we assume ordering based on the list in init right now since the lists are appended, do a similar thing for UNLIKELY and NONE ex HEURISTICS = [a, b, c, d] currently all things in b.high and added after a.high if b.none includes things in a.high, a.high trumps if b.none includes things in a.probable, then b.none trumps since none is stronger than probable if b.unlikely includes things from a.high/probable, a.high/probable trumps since unlikely and probable are at a higher strength Pull Request resolved: https://github.com/pytorch/pytorch/pull/112671 Approved by: https://github.com/clee2000	2023-11-02 21:02:40 +00:00
Zain Rizvi	a5641bc56b	[TD] Enable Test Class granularity on heuristics (#112161 ) Changes the heuristic framework to support multiple prioritizing individual classes within a test file. Components of this included: - Updating TestPrioritizations to accept individual test classes being prioritized. Previously, when a heuristic wanted to prioritize a test file it would pass in the test's name, now to prioritize a class within a test it uses the notation "test::classname" - Changes are fully backwards compatible with existing heuristics - Test sharding now supports sharding individual tests (for when they're prioritized) - When a TestClass is prioritized, we pass the appropriate "-k" flags down to pytest Pull Request resolved: https://github.com/pytorch/pytorch/pull/112161 Approved by: https://github.com/huydhn	2023-10-31 18:11:05 +00:00
Catherine Lee	3b5b7ebd09	[ci] Save various json files from test infra into folder (#111516 ) We pull a lot of files from https://github.com/pytorch/test-infra/blob/generated-stats/stats and name them separately when we add them to the artifacts in the build, so stick them in a folder and just add that instead. Slow test and disabled test jsons remain as they were since they are pulled during the test step and do not need to be included in the artifacts during build since they are not used for sharding. Sanity checked that test times could be found for linux, mac, windows, and rocm. Pull Request resolved: https://github.com/pytorch/pytorch/pull/111516 Approved by: https://github.com/huydhn, https://github.com/ZainRizvi	2023-10-23 20:38:25 +00:00
Nikita Shulga	e9a51a6a07	[BE] Revive test_typing (#111428 ) `test_typing.py` was written to use `pytest` in https://github.com/pytorch/pytorch/pull/54234 which unfortunately rendered it incompatible with run_test.py, and therefore it was not running in CI all this time. In this PR, same functionality is re-written using unittest framework, and `parametrize` from `torch.testing._internal._common_utils`. Valid `test_typing.py` with ufmt Disable `fail/bitwise_ops.py` and `pass/jit.py` as it regressed at some point as well as one of examples in `namedtuple.py` as `torch.linalg.qr` type is no longer revealed correctly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/111428 Approved by: https://github.com/clee2000	2023-10-18 02:19:49 +00:00
Jack Taylor	6b92c367c5	Add test_jit_cuda_fuser to ROCM_BLOCKLIST (#110440 ) Adds the nvfuser related unit test suite to ROCM_BLOCKLIST as should not be run on ROCm. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110440 Approved by: https://github.com/jeffdaily, https://github.com/pruthvistony, https://github.com/lezcano	2023-10-06 08:47:15 +00:00
Catherine Lee	8a09fe4a05	[ez] Remove print in heuristics aggregation (#110621 ) Move print to the beginning instead because putting it at the end makes it so you have to scroll through when debugging, and nothing in that function indicates that it should be printing anything Also the line for printing disabled issues out of the for loop Pull Request resolved: https://github.com/pytorch/pytorch/pull/110621 Approved by: https://github.com/huydhn	2023-10-06 02:04:53 +00:00
Catherine Lee	d6e5898e8d	Quieter logs in CI (#110033 ) To reduce the amount of logs * for successes, only print the part that says what tests ran and don't print the rest. Zip the log into an artifact. The line listing al the test names is really long, but if you view source of the raw logs, it will not wrap so it will only be one line. The log classifier can also be configured to ignored this line. Gets rid of lines like `test_ops.py::TestCommonCPU::test_multiple_devices_round_cpu_int64 SKIPPED [0.0010s] (Only runs on cuda) [ 9%]` * for failures/reruns, print logs. Do not zip. Also * change log artifact name Examples of various logs: `a074db0f7f` failures `1b439e24c4` failures possibly controversial haha should i include an option for always printing? Pull Request resolved: https://github.com/pytorch/pytorch/pull/110033 Approved by: https://github.com/huydhn	2023-10-05 16:40:37 +00:00
Catherine Lee	f69e9c8c91	run_tests.py minor logging changes (#110188 ) Minor logging changes that just kind of annoyed me: * prevent constant printing of `No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'` by moving import within the function (idk if this is ok) * prevent constant printing of `Ignoring disabled issues: ['']` (no idea why it was not gated behind a function or main) * change all prints in run_tests.py to be through stderr so theres no weird interleaving (although if everything goes through stderr, might as well just print everything through stdout...) Pull Request resolved: https://github.com/pytorch/pytorch/pull/110188 Approved by: https://github.com/huydhn, https://github.com/malfet, https://github.com/ZainRizvi	2023-10-03 01:22:47 +00:00
Zain Rizvi	1277d0e834	[BE] Add sharding data by default to metrics (#110035 ) Extend metric library to allow setting global metrics on a process level which will always be emitted. Current use case for them is to include shard information every time a metric is emitted by run_test.py <!-- copilot:poem --> ### <samp>🤖 Generated by Copilot at 0cae92c</samp> > _`run_test` refactored_ > _Sharding metrics in Rockset_ > _Autumn of testing_ Pull Request resolved: https://github.com/pytorch/pytorch/pull/110035 Approved by: https://github.com/clee2000	2023-09-26 17:06:49 +00:00

1 2 3 4 5 ...

604 Commits