Commit Graph

604 Commits

Author SHA1 Message Date
Catherine Lee
0290fe65bd Test TD (test removal) on crossref (#119426)
Current threshold is to cut the bottom 75% of test files, which results in 13 min of tests getting cut.
test_ops, functorch/test_ops, and test_decomp and other really long running test files are not getting cut and make the top 25% to take really long (still 90+ min)

The original plan was to test on rocm but I'm worried about queuing given that cutting 75% of test files only cuts off 13 min, and crossref is rarely referenced by others and people keep talking about getting rid of it, so it's a good alternative

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119426
Approved by: https://github.com/huydhn
2024-02-29 18:53:43 +00:00
albanD
30625ae582 Add cpp stack traces to our own reruns (#119408)
Note that I'm not sure why we both have pytest rerun the failing test twice via 81abc2b249/test/run_test.py (L966) before our own logic retries it as well?

The failing test is only here to make sure it works as expected in the CI env. Will remove before landing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119408
Approved by: https://github.com/huydhn
2024-02-26 22:21:14 +00:00
Catherine Lee
c39bbd6def Numbers based TD (#119901)
Convert from a list/bucket based TD system to just a numbers based TD system.  Looks like a massive change but a decent amount of it is tests and removing code.

Main file of interest is interface.py, which Github is collapsing by default due to size

The test files pretty much got rewritten entirely since a lot of the old tests are no longer relevant.

Other notable changes:
* Use Frozenset to make TestRun hashable
* Adds tools/test/heuristics/__init__.py to ensure that unittest can discover the tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119901
Approved by: https://github.com/osalpekar, https://github.com/huydhn
2024-02-26 17:01:19 +00:00
Catherine Lee
cfddfce0d3 Alternate sharding (#119078)
Changes sharding to attempt to put all serial tests on as few shards as possible.  Parallel tests are then distributed across all shards, with most of which likely ending up on the non serial shards

Example: 8 minutes of serial tests, 20 minutes of parallel tests, 2 proc per machine, 6 machines
-> 8 + 20/2 = 18 total minutes of tests
-> 18 / 6 machines = 3 min per machine
-> all serial tests should fit on 3 machines (3min, 3 min, 2min)
-> majority of parallel tests should go on last 4 machines, one of which is shared with the serial tests

Move serial tests to run first

If I want to move to a purely numbers based sharding, this ensures that parallel tests are run with parallel tests as much as possible instead of interleaving serial + parallel tests, which decreases effectiveness of parallelization, while also ensuring that test reordering is still mostly effective.

See 73e816ee80 for example logs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119078
Approved by: https://github.com/huydhn
2024-02-21 16:40:27 +00:00
Catherine Lee
af765dbdfd [ez] Explicit env for run_test (#120251)
env=None (which is the default) inherits the env from the calling process.  Explicitly set the env to the calling process env so that things can be added to it later

Tested in: e7b4d8ec88
Checked that test-reports (which depend on the CI env var) get made.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120251
Approved by: https://github.com/huydhn
2024-02-21 00:40:19 +00:00
PyTorch MergeBot
dfb83df889 Revert "Add cpp stack traces to our own reruns (#119408)"
This reverts commit 47182a8f4b.

Reverted https://github.com/pytorch/pytorch/pull/119408 on behalf of https://github.com/clee2000 due to iirc the default setting of env to None causes it to inherit the env of the calling process, I'll make a PR that makes it so that the old env vars don't disappear, and then re merge this on top of it.  Reverting this because I think some important env vars are disappearing (specifically CI) ([comment](https://github.com/pytorch/pytorch/pull/119408#issuecomment-1955128676))
2024-02-20 21:28:13 +00:00
PyTorch MergeBot
9b38ee2343 Revert "Alternate sharding (#119078)"
This reverts commit 861acda205.

Reverted https://github.com/pytorch/pytorch/pull/119078 on behalf of https://github.com/clee2000 due to failing 861acda205 ([comment](https://github.com/pytorch/pytorch/pull/119078#issuecomment-1946583857))
2024-02-15 16:59:50 +00:00
Catherine Lee
861acda205 Alternate sharding (#119078)
Changes sharding to attempt to put all serial tests on as few shards as possible.  Parallel tests are then distributed across all shards, with most of which likely ending up on the non serial shards

Example: 8 minutes of serial tests, 20 minutes of parallel tests, 2 proc per machine, 6 machines
-> 8 + 20/2 = 18 total minutes of tests
-> 18 / 6 machines = 3 min per machine
-> all serial tests should fit on 3 machines (3min, 3 min, 2min)
-> majority of parallel tests should go on last 4 machines, one of which is shared with the serial tests

Move serial tests to run first

If I want to move to a purely numbers based sharding, this ensures that parallel tests are run with parallel tests as much as possible instead of interleaving serial + parallel tests, which decreases effectiveness of parallelization, while also ensuring that test reordering is still mostly effective.

See 73e816ee80 for example logs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119078
Approved by: https://github.com/huydhn
2024-02-15 01:32:44 +00:00
atalman
244b124bb8 Add linux cpu test for 3.12 (#117853)
This is continuation of work: https://github.com/pytorch/pytorch/pull/113987

Co-authored-by: albanD <desmaison.alban@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117853
Approved by: https://github.com/albanD
2024-02-14 20:52:23 +00:00
albanD
47182a8f4b Add cpp stack traces to our own reruns (#119408)
Note that I'm not sure why we both have pytest rerun the failing test twice via 81abc2b249/test/run_test.py (L966) before our own logic retries it as well?

The failing test is only here to make sure it works as expected in the CI env. Will remove before landing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119408
Approved by: https://github.com/huydhn
2024-02-14 18:40:23 +00:00
Catherine Lee
5d6e323549 No TD (test removal) option in CI (#118808)
It currently doesn't do anything, but I will want these env vars later.  Maybe I should start using ghstack

Intention: --enable-td actually gets rid of tests

I am open to better names
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118808
Approved by: https://github.com/huydhn, https://github.com/osalpekar
2024-02-09 16:42:27 +00:00
PyTorch MergeBot
8182fce769 Revert "Add cpp stack traces to our own reruns (#119408)"
This reverts commit fbe6f6236e.

Reverted https://github.com/pytorch/pytorch/pull/119408 on behalf of https://github.com/malfet due to Looks like it introduced intermittent crashes see https://github.com/pytorch/pytorch/actions/runs/7823402867/job/21344456540 for example, testing the theory ([comment](https://github.com/pytorch/pytorch/pull/119408#issuecomment-1934589057))
2024-02-08 17:20:39 +00:00
albanD
fbe6f6236e Add cpp stack traces to our own reruns (#119408)
Note that I'm not sure why we both have pytest rerun the failing test twice via 81abc2b249/test/run_test.py (L966) before our own logic retries it as well?

The failing test is only here to make sure it works as expected in the CI env. Will remove before landing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119408
Approved by: https://github.com/huydhn
2024-02-08 00:54:16 +00:00
Huy Do
3ed9df36a9 Clean up some obsolete TODOs in run_test and several test files (#119113)
* The TODOs in `test/test_nestedtensor.py` has been mitigated, I keep the issue for reference.
* ~~The TODOs in `test/test_ops_fwd_gradients.py` doesn't apply anymore~~
* The TODOs in `run_test.py` to support disabling C++ tests is probably not going to happen.  I have never seen a flaky C++ test that needs to be disabled before.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119113
Approved by: https://github.com/kit1980
2024-02-03 23:54:30 +00:00
Joel Schlosser
3b41793412 Purge redundant module init tests (#119028)
Fixes #118784

This test file is old and redundant; coverage is maintained in `test_modules.py` via the `test_factory_kwargs` set of tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119028
Approved by: https://github.com/zou3519
2024-02-02 20:17:00 +00:00
Catherine Lee
8b729fb826 [ez] Fix CI log file piping error (#118807)
Fixes https://github.com/pytorch/pytorch/issues/118764

Example log https://github.com/pytorch/pytorch/actions/runs/7737363970/job/21097159160
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118807
Approved by: https://github.com/huydhn, https://github.com/malfet, https://github.com/seemethere
2024-02-02 03:07:56 +00:00
Catherine Lee
9391af9796 Merging heuristics (#118029)
Everyday I move closer and closer to just using numbers

* number of heuristics that marked it as high, probable, low, none etc
* order of heuristics in the `__init__` file as well as how the heuristic ordered the tests
* put heuristics historical edited files and profiling as not trial mode
* briefly sanity checked that all shards of the larger test files (ex test_ops) exist and there are no dups
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118029
Approved by: https://github.com/huydhn
2024-01-31 20:00:10 +00:00
Catherine Lee
2eefbc02a0 [ez] Discover tests without importing torch (#118574)
Moves test discovery into a file that doesn't have import torch so test listing can be done without having torch installed.

Helpful when you don't have torch installed (aka me when I'm feeling lazy)
I want to move TD into it's own job that doesn't need to wait for build to finish, so this is part of that.

The first commit is a nothing more than a copy paste of the selected functions/vars into a new file, the second commit has various changes that should be checked.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118574
Approved by: https://github.com/huydhn
2024-01-30 03:02:29 +00:00
Catherine Lee
84251d1d71 [ez] Windows log printing + save successful test logs (#118124)
when doing print(f.read().decode etc etc) it prints an extra new line, so manually splitlines and strip to see if that helps

My guess is windows line ending differences

Also always save log file regardless of success or failure

See 476b81a9bf for what it looks like now

Swapped to opening in text mode instead of binary, seems to be ok now.

42483193bf024983060a234dc0262f4840aef4b8 for example
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118124
Approved by: https://github.com/huydhn
2024-01-26 21:14:25 +00:00
Catherine Lee
de9ddd19a5 Various CI settings (#117668)
Test [ci-verbose-test-logs] (this worked, the test logs printing while running and interleaved and are really long)

Settings for no timeout (step timeout still applies, only gets rid of ~30 min timeout for shard of test file) and no piping logs/extra verbose test logs (good for debugging deadlocks but results in very long and possibly interleaved logs).

Also allows these to be set via pr body if the label name is in brackets ex [label name] or the test above.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117668
Approved by: https://github.com/huydhn
2024-01-26 00:17:29 +00:00
Catherine Lee
364728b27b Reduce pytest prints (#117069)
* custom pytest-shard so I can control the verbosity (also index by 1 since it's confusing)
* normal runs (not keep-going) always rerun each failed test 9 times (3 per process, 3 processes).  Previously it would only run the entire test file 3 times, so if a test before you segfaulted, you only got 2 tries

Example of quieter log https://github.com/pytorch/pytorch/actions/runs/7481334046/job/20363147497
"items in shard" only gets printed once at the beginning, and the reruns just say how many got skipped.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117069
Approved by: https://github.com/huydhn
2024-01-23 18:39:30 +00:00
chuanqiw
40890ba8e7 [CI] Add python test skip logic for XPU (#117621)
Add python test skip logic for XPU

For test purpose, cherry-pick #116833 & #116850 firstly, and the xpu test passed https://github.com/pytorch/pytorch/actions/runs/7566746218/job/20604997985?pr=117621. Revert them now.

Works for #114850

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117621
Approved by: https://github.com/huydhn
2024-01-23 08:20:42 +00:00
Catherine Lee
cef5b93f28 [ez] Serial when NUM_PROCS is 1 (#117977)
Makes it easier to understand whats going on
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117977
Approved by: https://github.com/huydhn
2024-01-22 23:11:41 +00:00
ydwu4
f96b7d06d7 [export] skip export tests when test with dynamo in ci (#117988)
Fixes https://github.com/pytorch/pytorch/issues/117947.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117988
Approved by: https://github.com/suo, https://github.com/zou3519
2024-01-22 22:14:36 +00:00
PyTorch MergeBot
f684e44fd6 Revert "Reduce pytest prints (#117069)"
This reverts commit 40dbd567e0.

Reverted https://github.com/pytorch/pytorch/pull/117069 on behalf of https://github.com/clee2000 due to need to handle timeout expired better ([comment](https://github.com/pytorch/pytorch/pull/117069#issuecomment-1901270953))
2024-01-19 23:07:51 +00:00
Catherine Lee
40dbd567e0 Reduce pytest prints (#117069)
* custom pytest-shard so I can control the verbosity (also index by 1 since it's confusing)
* normal runs (not keep-going) always rerun each failed test 9 times (3 per process, 3 processes).  Previously it would only run the entire test file 3 times, so if a test before you segfaulted, you only got 2 tries

Example of quieter log https://github.com/pytorch/pytorch/actions/runs/7481334046/job/20363147497
"items in shard" only gets printed once at the beginning, and the reruns just say how many got skipped.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117069
Approved by: https://github.com/huydhn
2024-01-19 18:42:12 +00:00
Catherine Lee
6c5c2121b1 Run some OOMing tests serially (#117759)
They were disabled due to being flaky due to OOMs but got renamed.  Seeing if running serially helps

I kind of want to keep this test disabled since the rest of the file is probably fine...

Issues in question: #113132 #113136 #113140
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117759
Approved by: https://github.com/malfet, https://github.com/huydhn
2024-01-19 16:45:35 +00:00
PyTorch MergeBot
77cfacab55 Revert "Reduce pytest prints (#117069)"
This reverts commit 2f89ef2300.

Reverted https://github.com/pytorch/pytorch/pull/117069 on behalf of https://github.com/clee2000 due to distributed tests are not printing items ([comment](https://github.com/pytorch/pytorch/pull/117069#issuecomment-1899433816))
2024-01-19 00:27:03 +00:00
Catherine Lee
2f89ef2300 Reduce pytest prints (#117069)
* custom pytest-shard so I can control the verbosity (also index by 1 since it's confusing)
* normal runs (not keep-going) always rerun each failed test 9 times (3 per process, 3 processes).  Previously it would only run the entire test file 3 times, so if a test before you segfaulted, you only got 2 tries

Example of quieter log https://github.com/pytorch/pytorch/actions/runs/7481334046/job/20363147497
"items in shard" only gets printed once at the beginning, and the reruns just say how many got skipped.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117069
Approved by: https://github.com/huydhn
2024-01-18 23:30:59 +00:00
rzou
5aa895e53e Don't run inductor tests in Dynamo shard (#117747)
In theory we could, but these get really slow once we turn on strict
mode, so we're not going to for now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117747
Approved by: https://github.com/bdhirsh
ghstack dependencies: #117729
2024-01-18 17:43:30 +00:00
Jack Taylor
db79ceb110 [ROCm] Enabling additional UTs on ROCm (#115738)
Unskips mostly for dynamo/inductor UT.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115738
Approved by: https://github.com/jithunnair-amd, https://github.com/malfet
2024-01-09 08:36:07 +00:00
Catherine Lee
d455c33cca [ez][td] Pipe TD logs to log file (#116796)
It is a bit annoying have them come up when searching through the logs.  They're also surprisingly long
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116796
Approved by: https://github.com/huydhn
2024-01-05 19:05:12 +00:00
Catherine Lee
c52b78ebc2 [ez] Remove some args from run_test.py (#115459)
Don't think anyone uses these
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115459
Approved by: https://github.com/malfet, https://github.com/huydhn
2023-12-11 19:56:37 +00:00
Sijia Chen
641ec2115f [AOTI] move model runner into a library (#115220)
Summary: So that we can import it in fbcode and do some AOTI run in py env

Test Plan: existed AOTI tests

Reviewed By: chenyang78

Differential Revision: D51780021

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115220
Approved by: https://github.com/desertfire
2023-12-09 19:03:32 +00:00
Catherine Lee
3b7d60b6ff Fix keep-going (#112098)
New function for continue on error

Another solution might be to run the entire suite to the end and use last failed, but I'm worried about concurrent processes writing to the same last failed cache entry, it's a bit different than the usual test rerunning strategy we use especially regarding segfaults and other ways the test suite can suddenly end, and there are some cases where the entire test suite should immediately get rerun in a new process (ex cuda error that causes sync to fail).

Find example logs on commit 2f1510839727f6ef2631040d5f0edde26265015d

TODO: continue on error for --subprocess and test_distributed aren't working fully
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112098
Approved by: https://github.com/huydhn
2023-11-30 04:01:57 +00:00
Jithun Nair
2ea2421b44 Skip unit tests that fail on MI210 runners (#114613)
Taken from https://github.com/pytorch/pytorch/pull/105980
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114613
Approved by: https://github.com/malfet
2023-11-27 22:25:35 +00:00
Philip Meier
2aa486de9b vendor packaging.version (#114108)
Fixes #113940. This vendors the relevant parts of [`packaging==23.2.0`]() to have access to `Version` and `InvalidVersion` without taking a runtime dependency on `setuptools` or `packaging`.

I didn't find any vendoring policy so I put it under `torch._vendor.packaging`. While I have only vendored the files we need, I have not touched or trimmed the files otherwise.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114108
Approved by: https://github.com/malfet, https://github.com/albanD
2023-11-21 11:51:23 +00:00
Zain Rizvi
ec20c9044e [TD] Fix metric emission for split test files (#113789)
Fixes a bug in TD metrics generation where it wouldn't be able to find the rank and relevance that a heuristic gave a test run if that heuristic had divided that test into multiple test runs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113789
Approved by: https://github.com/clee2000
2023-11-16 23:19:40 +00:00
Catherine Lee
87aeb248c9 More random stepcurrent (#113620)
Distributed tests for different backends have the same name, so they end up clashing using the current stepcurrent key, so tests were not being run.

Disabled the following tests because they are failing:
test_ddp_has_finalized

test_broadcast_object_list
<details>

```

2023-11-14T06:44:01.0428686Z
2023-11-14T06:44:01.0430447Z distributed/test_distributed_spawn.py::TestDistBackendWithSpawn::test_broadcast_object_list <- ../../../../opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/distributed_test.py INFO:numba.cuda.cudadrv.driver:init
2023-11-14T06:44:01.0431048Z [1699943450.893723] [99f90b6e6ff3:10028:0]     ucc_context.c:402  UCC  ERROR failed to create tl context for cuda
2023-11-14T06:44:01.0431625Z [1699943450.914385] [99f90b6e6ff3:10029:0]     ucc_context.c:402  UCC  ERROR failed to create tl context for cuda
2023-11-14T06:44:01.0432314Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] Caught exception:
2023-11-14T06:44:01.0433178Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] Traceback (most recent call last):
2023-11-14T06:44:01.0434677Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 658, in run_test
2023-11-14T06:44:01.0435435Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR]     getattr(self, test_name)()
2023-11-14T06:44:01.0436895Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 544, in wrapper
2023-11-14T06:44:01.0437500Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR]     fn()
2023-11-14T06:44:01.0438917Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2536, in wrapper
2023-11-14T06:44:01.0439637Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR]     method(*args, **kwargs)
2023-11-14T06:44:01.0441122Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 143, in wrapper
2023-11-14T06:44:01.0441873Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR]     return func(*args, **kwargs)
2023-11-14T06:44:01.0443340Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 274, in wrapper
2023-11-14T06:44:01.0444077Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR]     ret = func(*args, **kwargs)
2023-11-14T06:44:01.0445769Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/distributed_test.py", line 7717, in test_broadcast_object_list
2023-11-14T06:44:01.0446732Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR]     return self._test_broadcast_object_list()
2023-11-14T06:44:01.0448433Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/distributed_test.py", line 7683, in _test_broadcast_object_list
2023-11-14T06:44:01.0449187Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR]     dist.broadcast_object_list(
2023-11-14T06:44:01.0450553Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper
2023-11-14T06:44:01.0451621Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR]     return func(*args, **kwargs)
2023-11-14T06:44:01.0453161Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2650, in broadcast_object_list
2023-11-14T06:44:01.0454065Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR]     broadcast(object_sizes_tensor, src=src, group=group)
2023-11-14T06:44:01.0455441Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper
2023-11-14T06:44:01.0456183Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR]     return func(*args, **kwargs)
2023-11-14T06:44:01.0457775Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1947, in broadcast
2023-11-14T06:44:01.0458649Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR]     work = default_pg.broadcast([tensor], opts)
2023-11-14T06:44:01.0460923Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] RuntimeError: [/var/lib/jenkins/workspace/torch/csrc/distributed/c10d/ProcessGroupUCC.cpp:488] [Rank 1][ProcessGroupUCC-0][READY]failed to init cuda collective, error code -1: Operation is not supported, system error code 2
2023-11-14T06:44:01.0461471Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR]
2023-11-14T06:44:01.0462430Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] To execute this test, run the following from the base repo dir:
2023-11-14T06:44:01.0463552Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR]      python test/distributed/test_distributed_spawn.py -k test_broadcast_object_list
2023-11-14T06:44:01.0464082Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR]
2023-11-14T06:44:01.0465136Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
2023-11-14T06:44:01.0465945Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR]  exiting process 1 with exit code: 10
2023-11-14T06:44:01.0466605Z [1699943451.005633] [99f90b6e6ff3:10029:0]          parser.c:2034 UCX  WARN  unused environment variables: UCX_COMMIT; UCX_HOME
2023-11-14T06:44:01.0467303Z [1699943451.005633] [99f90b6e6ff3:10029:0]          parser.c:2034 UCX  WARN  (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
2023-11-14T06:44:01.0467972Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] Caught exception:
2023-11-14T06:44:01.0468743Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] Traceback (most recent call last):
2023-11-14T06:44:01.0470233Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 658, in run_test
2023-11-14T06:44:01.0471106Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR]     getattr(self, test_name)()
2023-11-14T06:44:01.0472581Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 544, in wrapper
2023-11-14T06:44:01.0473162Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR]     fn()
2023-11-14T06:44:01.0474581Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2536, in wrapper
2023-11-14T06:44:01.0475314Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR]     method(*args, **kwargs)
2023-11-14T06:44:01.0476776Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 143, in wrapper
2023-11-14T06:44:01.0477535Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR]     return func(*args, **kwargs)
2023-11-14T06:44:01.0478993Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 274, in wrapper
2023-11-14T06:44:01.0479886Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR]     ret = func(*args, **kwargs)
2023-11-14T06:44:01.0481593Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/distributed_test.py", line 7717, in test_broadcast_object_list
2023-11-14T06:44:01.0482429Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR]     return self._test_broadcast_object_list()
2023-11-14T06:44:01.0484145Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/distributed_test.py", line 7683, in _test_broadcast_object_list
2023-11-14T06:44:01.0484886Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR]     dist.broadcast_object_list(
2023-11-14T06:44:01.0486271Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper
2023-11-14T06:44:01.0487018Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR]     return func(*args, **kwargs)
2023-11-14T06:44:01.0488559Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2650, in broadcast_object_list
2023-11-14T06:44:01.0489470Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR]     broadcast(object_sizes_tensor, src=src, group=group)
2023-11-14T06:44:01.0491078Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper
2023-11-14T06:44:01.0491912Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR]     return func(*args, **kwargs)
2023-11-14T06:44:01.0493369Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1947, in broadcast
2023-11-14T06:44:01.0494419Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR]     work = default_pg.broadcast([tensor], opts)
2023-11-14T06:44:01.0496679Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] RuntimeError: [/var/lib/jenkins/workspace/torch/csrc/distributed/c10d/ProcessGroupUCC.cpp:488] [Rank 0][ProcessGroupUCC-0][READY]failed to init cuda collective, error code -1: Operation is not supported, system error code 2
2023-11-14T06:44:01.0497211Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR]
2023-11-14T06:44:01.0498198Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] To execute this test, run the following from the base repo dir:
2023-11-14T06:44:01.0499291Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR]      python test/distributed/test_distributed_spawn.py -k test_broadcast_object_list
2023-11-14T06:44:01.0499838Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR]
2023-11-14T06:44:01.0500881Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
2023-11-14T06:44:01.0501667Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR]  exiting process 0 with exit code: 10
2023-11-14T06:44:01.0502343Z [1699943451.002362] [99f90b6e6ff3:10028:0]          parser.c:2034 UCX  WARN  unused environment variables: UCX_COMMIT; UCX_HOME
2023-11-14T06:44:01.0503024Z [1699943451.002362] [99f90b6e6ff3:10028:0]          parser.c:2034 UCX  WARN  (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
2023-11-14T06:44:01.0503411Z ('RERUN', {'yellow': True}) [6.1102s] [100%]
```
</details>

test_ddp_sync_bn_training_vs_eval

<details>

```

2023-11-14T06:44:01.1494815Z
2023-11-14T06:44:01.1496630Z distributed/test_distributed_spawn.py::TestDistBackendWithSpawn::test_ddp_sync_bn_training_vs_eval <- ../../../../opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/distributed_test.py INFO:numba.cuda.cudadrv.driver:init
2023-11-14T06:44:01.1497290Z [1699943779.976037] [99f90b6e6ff3:10758:0]          parser.c:2034 UCX  WARN  unused environment variables: UCX_COMMIT; UCX_HOME
2023-11-14T06:44:01.1498119Z [1699943779.976037] [99f90b6e6ff3:10758:0]          parser.c:2034 UCX  WARN  (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
2023-11-14T06:44:01.1498808Z STAGE:2023-11-14 06:36:20 10758:10758 ActivityProfilerController.cpp:314] Completed Stage: Warm Up
2023-11-14T06:44:01.1499465Z [1699943779.970792] [99f90b6e6ff3:10757:0]          parser.c:2034 UCX  WARN  unused environment variables: UCX_COMMIT; UCX_HOME
2023-11-14T06:44:01.1500160Z [1699943779.970792] [99f90b6e6ff3:10757:0]          parser.c:2034 UCX  WARN  (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
2023-11-14T06:44:01.1500820Z STAGE:2023-11-14 06:36:20 10757:10757 ActivityProfilerController.cpp:314] Completed Stage: Warm Up
2023-11-14T06:44:01.1501556Z STAGE:2023-11-14 06:36:20 10758:10758 ActivityProfilerController.cpp:320] Completed Stage: Collection
2023-11-14T06:44:01.1502239Z STAGE:2023-11-14 06:36:20 10757:10757 ActivityProfilerController.cpp:320] Completed Stage: Collection
2023-11-14T06:44:01.1502952Z STAGE:2023-11-14 06:36:20 10757:10757 ActivityProfilerController.cpp:324] Completed Stage: Post Processing
2023-11-14T06:44:01.1503678Z STAGE:2023-11-14 06:36:20 10758:10758 ActivityProfilerController.cpp:324] Completed Stage: Post Processing
2023-11-14T06:44:01.1504350Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] Caught exception:
2023-11-14T06:44:01.1505119Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] Traceback (most recent call last):
2023-11-14T06:44:01.1506729Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 658, in run_test
2023-11-14T06:44:01.1507492Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR]     getattr(self, test_name)()
2023-11-14T06:44:01.1508992Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 544, in wrapper
2023-11-14T06:44:01.1509578Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR]     fn()
2023-11-14T06:44:01.1510994Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2536, in wrapper
2023-11-14T06:44:01.1511725Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR]     method(*args, **kwargs)
2023-11-14T06:44:01.1513193Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 174, in wrapper
2023-11-14T06:44:01.1513962Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR]     return func(*args, **kwargs)
2023-11-14T06:44:01.1515697Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/distributed_test.py", line 9230, in test_ddp_sync_bn_training_vs_eval
2023-11-14T06:44:01.1516529Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR]     self.assertNotEqual([], all_gather_calls)
2023-11-14T06:44:01.1518019Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3448, in assertNotEqual
2023-11-14T06:44:01.1518910Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR]     with self.assertRaises(AssertionError, msg=msg):
2023-11-14T06:44:01.1520177Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 226, in __exit__
2023-11-14T06:44:01.1521062Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR]     self._raiseFailure("{} not raised".format(exc_name))
2023-11-14T06:44:01.1522238Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 163, in _raiseFailure
2023-11-14T06:44:01.1523099Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR]     raise self.test_case.failureException(msg)
2023-11-14T06:44:01.1523923Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] AssertionError: AssertionError not raised
2023-11-14T06:44:01.1524470Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR]
2023-11-14T06:44:01.1525481Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] To execute this test, run the following from the base repo dir:
2023-11-14T06:44:01.1526632Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR]      python test/distributed/test_distributed_spawn.py -k test_ddp_sync_bn_training_vs_eval
2023-11-14T06:44:01.1527180Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR]
2023-11-14T06:44:01.1528223Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
2023-11-14T06:44:01.1529029Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR]  exiting process 0 with exit code: 10
2023-11-14T06:44:01.1529786Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] Caught exception:
2023-11-14T06:44:01.1530576Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] Traceback (most recent call last):
2023-11-14T06:44:01.1532383Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 658, in run_test
2023-11-14T06:44:01.1533127Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR]     getattr(self, test_name)()
2023-11-14T06:44:01.1534608Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 544, in wrapper
2023-11-14T06:44:01.1535194Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR]     fn()
2023-11-14T06:44:01.1536817Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2536, in wrapper
2023-11-14T06:44:01.1537575Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR]     method(*args, **kwargs)
2023-11-14T06:44:01.1539036Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 174, in wrapper
2023-11-14T06:44:01.1539800Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR]     return func(*args, **kwargs)
2023-11-14T06:44:01.1541531Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/distributed_test.py", line 9230, in test_ddp_sync_bn_training_vs_eval
2023-11-14T06:44:01.1542388Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR]     self.assertNotEqual([], all_gather_calls)
2023-11-14T06:44:01.1544015Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3448, in assertNotEqual
2023-11-14T06:44:01.1544907Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR]     with self.assertRaises(AssertionError, msg=msg):
2023-11-14T06:44:01.1546061Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 226, in __exit__
2023-11-14T06:44:01.1546944Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR]     self._raiseFailure("{} not raised".format(exc_name))
2023-11-14T06:44:01.1548142Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 163, in _raiseFailure
2023-11-14T06:44:01.1548991Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR]     raise self.test_case.failureException(msg)
2023-11-14T06:44:01.1549806Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] AssertionError: AssertionError not raised
2023-11-14T06:44:01.1550350Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR]
2023-11-14T06:44:01.1551304Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] To execute this test, run the following from the base repo dir:
2023-11-14T06:44:01.1552462Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR]      python test/distributed/test_distributed_spawn.py -k test_ddp_sync_bn_training_vs_eval
2023-11-14T06:44:01.1553095Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR]
2023-11-14T06:44:01.1554166Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
2023-11-14T06:44:01.1554976Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR]  exiting process 1 with exit code: 10
2023-11-14T06:44:01.1555235Z ('RERUN', {'yellow': True}) [6.6107s] [100%]
```
</details>

test_backend_full_group
<details>

```
2023-11-14T22:51:56.4502470Z FAILED [5.2125s] distributed/test_distributed_spawn.py::TestDistBackendWithSpawn::test_backend_full_group - RuntimeError: Process 0 exited with error code 10 and exception:
2023-11-14T22:51:56.4502665Z Traceback (most recent call last):
2023-11-14T22:51:56.4503603Z   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 658, in run_test
2023-11-14T22:51:56.4503796Z     getattr(self, test_name)()
2023-11-14T22:51:56.4504710Z   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 544, in wrapper
2023-11-14T22:51:56.4504845Z     fn()
2023-11-14T22:51:56.4505737Z   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2536, in wrapper
2023-11-14T22:51:56.4505896Z     method(*args, **kwargs)
2023-11-14T22:51:56.4506823Z   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 174, in wrapper
2023-11-14T22:51:56.4506992Z     return func(*args, **kwargs)
2023-11-14T22:51:56.4508285Z   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/distributed_test.py", line 882, in test_backend_full_group
2023-11-14T22:51:56.4508640Z     self._test_group_override_backend(self._init_full_group_test)
2023-11-14T22:51:56.4509798Z   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/distributed_test.py", line 852, in _test_group_override_backend
2023-11-14T22:51:56.4510104Z     group, group_id, rank = initializer(backend=new_backend)
2023-11-14T22:51:56.4510629Z UnboundLocalError: local variable 'new_backend' referenced before assignment
2023-11-14T22:51:56.4510650Z
2023-11-14T22:51:56.4510987Z To execute this test, run the following from the base repo dir:
2023-11-14T22:51:56.4511525Z      python test/distributed/test_distributed_spawn.py -k test_backend_full_group
2023-11-14T22:51:56.4511545Z
2023-11-14T22:51:56.4511970Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
2023-11-14T22:51:56.4511989Z
2023-11-14T22:51:56.4512242Z Process 1 exited with error code 10 and exception:
2023-11-14T22:51:56.4512454Z Traceback (most recent call last):
2023-11-14T22:51:56.4513380Z   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 658, in run_test
2023-11-14T22:51:56.4513687Z     getattr(self, test_name)()
2023-11-14T22:51:56.4514612Z   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 544, in wrapper
2023-11-14T22:51:56.4514746Z     fn()
2023-11-14T22:51:56.4515633Z   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2536, in wrapper
2023-11-14T22:51:56.4515791Z     method(*args, **kwargs)
2023-11-14T22:51:56.4516708Z   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 174, in wrapper
2023-11-14T22:51:56.4516895Z     return func(*args, **kwargs)
2023-11-14T22:51:56.4518008Z   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/distributed_test.py", line 882, in test_backend_full_group
2023-11-14T22:51:56.4518352Z     self._test_group_override_backend(self._init_full_group_test)
2023-11-14T22:51:56.4519509Z   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/distributed_test.py", line 852, in _test_group_override_backend
2023-11-14T22:51:56.4519813Z     group, group_id, rank = initializer(backend=new_backend)
2023-11-14T22:51:56.4520334Z UnboundLocalError: local variable 'new_backend' referenced before assignment
2023-11-14T22:51:56.4520355Z
2023-11-14T22:51:56.4528843Z To execute this test, run the following from the base repo dir:
2023-11-14T22:51:56.4529492Z      python test/distributed/test_distributed_spawn.py -k test_backend_full_group
2023-11-14T22:51:56.4529681Z
2023-11-14T22:51:56.4530122Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
2023-11-14T22:51:56.4530423Z !!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!
```
</details>

pretty sure the solution for this one is to add ucc in _test_group_override_backend
https://ossci-raw-job-status.s3.amazonaws.com/log/18651430019
https://ossci-raw-job-status.s3.amazonaws.com/log/18651430132
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113620
Approved by: https://github.com/huydhn
2023-11-15 21:56:10 +00:00
Catherine Lee
0c448526a4 [experiment][TD] Rating number system (#112676)
Emit excessive amount of heuristic info emitted, but that just means I can do more with it later?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112676
Approved by: https://github.com/ZainRizvi
2023-11-07 19:40:11 +00:00
Nikita Shulga
e2e5897269 [CI] Do not use packaging in run_tests.py (#112873)
It used to check that CUDA is newer than 11.6, but all of them are

Yet another mitigation towards missing `packaging` on MacOS

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112873
Approved by: https://github.com/huydhn
2023-11-03 17:22:46 +00:00
Zain Rizvi
4e67c69a7d [TD] Support downgrading test relevance (#112671)
Allow heuristics to actually downgrade the relevance of a test.  Note that NONE/UNLIKELY tests will still get executed, but they will be ran at the end of the CI

The Relevance chosen affects the outcome when Heuristics offer conflicting predictions. A relevance higher up in this list means higher confidence in the declared relevance:

HIGH > NONE > PROBABLE > UNLIKELY > UNRANKED

Given that we assume ordering based on the list in init right now since the lists are appended, do a similar thing for UNLIKELY and NONE
ex HEURISTICS = [a, b, c, d]
currently all things in b.high and added after a.high
if b.none includes things in a.high, a.high trumps
if b.none includes things in a.probable, then b.none trumps since none is stronger than probable
if b.unlikely includes things from a.high/probable, a.high/probable trumps since unlikely and probable are at a higher strength
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112671
Approved by: https://github.com/clee2000
2023-11-02 21:02:40 +00:00
Zain Rizvi
a5641bc56b [TD] Enable Test Class granularity on heuristics (#112161)
Changes the heuristic framework to support multiple prioritizing individual classes within a test file.

Components of this included:
- Updating TestPrioritizations to accept individual test classes being prioritized. Previously, when a heuristic wanted to prioritize a test file it would pass in the test's name, now to prioritize a class within a test it uses the notation "test::classname"
- Changes are fully backwards compatible with existing heuristics
- Test sharding now supports sharding individual tests (for when they're prioritized)
- When a TestClass is prioritized, we pass the appropriate "-k" flags down to pytest

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112161
Approved by: https://github.com/huydhn
2023-10-31 18:11:05 +00:00
Catherine Lee
3b5b7ebd09 [ci] Save various json files from test infra into folder (#111516)
We pull a lot of files from https://github.com/pytorch/test-infra/blob/generated-stats/stats and name them separately when we add them to the artifacts in the build, so stick them in a folder and just add that instead.

Slow test and disabled test jsons remain as they were since they are pulled during the test step and do not need to be included in the artifacts during build since they are not used for sharding.

Sanity checked that test times could be found for linux, mac, windows, and rocm.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111516
Approved by: https://github.com/huydhn, https://github.com/ZainRizvi
2023-10-23 20:38:25 +00:00
Nikita Shulga
e9a51a6a07 [BE] Revive test_typing (#111428)
`test_typing.py` was written to use `pytest` in https://github.com/pytorch/pytorch/pull/54234 which unfortunately rendered it incompatible with run_test.py, and therefore it was not running in CI all this time.

In this PR, same functionality is re-written using unittest framework, and `parametrize` from `torch.testing._internal._common_utils`.

Valid `test_typing.py` with ufmt

Disable `fail/bitwise_ops.py` and `pass/jit.py` as it regressed at some point as well as one of examples in `namedtuple.py` as `torch.linalg.qr` type is no longer revealed correctly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111428
Approved by: https://github.com/clee2000
2023-10-18 02:19:49 +00:00
Jack Taylor
6b92c367c5 Add test_jit_cuda_fuser to ROCM_BLOCKLIST (#110440)
Adds the nvfuser related unit test suite to ROCM_BLOCKLIST as should not be run on ROCm.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110440
Approved by: https://github.com/jeffdaily, https://github.com/pruthvistony, https://github.com/lezcano
2023-10-06 08:47:15 +00:00
Catherine Lee
8a09fe4a05 [ez] Remove print in heuristics aggregation (#110621)
Move print to the beginning instead because putting it at the end makes it so you have to scroll through when debugging, and nothing in that function indicates that it should be printing anything

Also the line for printing disabled issues out of the for loop
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110621
Approved by: https://github.com/huydhn
2023-10-06 02:04:53 +00:00
Catherine Lee
d6e5898e8d Quieter logs in CI (#110033)
To reduce the amount of logs
* for successes, only print the part that says what tests ran and don't print the rest.  Zip the log into an artifact.  The line listing al the test names is really long, but if you view source of the raw logs, it will not wrap so it will only be one line.  The log classifier can also be configured to ignored this line. Gets rid of lines like `test_ops.py::TestCommonCPU::test_multiple_devices_round_cpu_int64 SKIPPED [0.0010s] (Only runs on cuda) [  9%]`
* for failures/reruns, print logs.  Do not zip.

Also
* change log artifact name

Examples of various logs:
a074db0f7f failures
1b439e24c4 failures

possibly controversial haha
should i include an option for always printing?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110033
Approved by: https://github.com/huydhn
2023-10-05 16:40:37 +00:00
Catherine Lee
f69e9c8c91 run_tests.py minor logging changes (#110188)
Minor logging changes that just kind of annoyed me:
* prevent constant printing of `No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'` by moving import within the function (idk if this is ok)
* prevent constant printing of `Ignoring disabled issues:  ['']` (no idea why it was not gated behind a function or main)
* change all prints in run_tests.py to be through stderr so theres no weird interleaving (although if everything goes through stderr, might as well just print everything through stdout...)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110188
Approved by: https://github.com/huydhn, https://github.com/malfet, https://github.com/ZainRizvi
2023-10-03 01:22:47 +00:00
Zain Rizvi
1277d0e834 [BE] Add sharding data by default to metrics (#110035)
Extend metric library to allow setting global metrics on a process level which will always be emitted.

Current use case for them is to include shard information every time a metric is emitted by run_test.py

<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at 0cae92c</samp>

> _`run_test` refactored_
> _Sharding metrics in Rockset_
> _Autumn of testing_
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110035
Approved by: https://github.com/clee2000
2023-09-26 17:06:49 +00:00