pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Catherine Lee	c52b78ebc2	[ez] Remove some args from run_test.py (#115459 ) Don't think anyone uses these Pull Request resolved: https://github.com/pytorch/pytorch/pull/115459 Approved by: https://github.com/malfet, https://github.com/huydhn	2023-12-11 19:56:37 +00:00
Sijia Chen	641ec2115f	[AOTI] move model runner into a library (#115220 ) Summary: So that we can import it in fbcode and do some AOTI run in py env Test Plan: existed AOTI tests Reviewed By: chenyang78 Differential Revision: D51780021 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115220 Approved by: https://github.com/desertfire	2023-12-09 19:03:32 +00:00
Catherine Lee	3b7d60b6ff	Fix keep-going (#112098 ) New function for continue on error Another solution might be to run the entire suite to the end and use last failed, but I'm worried about concurrent processes writing to the same last failed cache entry, it's a bit different than the usual test rerunning strategy we use especially regarding segfaults and other ways the test suite can suddenly end, and there are some cases where the entire test suite should immediately get rerun in a new process (ex cuda error that causes sync to fail). Find example logs on commit 2f1510839727f6ef2631040d5f0edde26265015d TODO: continue on error for --subprocess and test_distributed aren't working fully Pull Request resolved: https://github.com/pytorch/pytorch/pull/112098 Approved by: https://github.com/huydhn	2023-11-30 04:01:57 +00:00
Jithun Nair	2ea2421b44	Skip unit tests that fail on MI210 runners (#114613 ) Taken from https://github.com/pytorch/pytorch/pull/105980 Pull Request resolved: https://github.com/pytorch/pytorch/pull/114613 Approved by: https://github.com/malfet	2023-11-27 22:25:35 +00:00
Philip Meier	2aa486de9b	vendor packaging.version (#114108 ) Fixes #113940. This vendors the relevant parts of [`packaging==23.2.0`]() to have access to `Version` and `InvalidVersion` without taking a runtime dependency on `setuptools` or `packaging`. I didn't find any vendoring policy so I put it under `torch._vendor.packaging`. While I have only vendored the files we need, I have not touched or trimmed the files otherwise. Pull Request resolved: https://github.com/pytorch/pytorch/pull/114108 Approved by: https://github.com/malfet, https://github.com/albanD	2023-11-21 11:51:23 +00:00
Zain Rizvi	ec20c9044e	[TD] Fix metric emission for split test files (#113789 ) Fixes a bug in TD metrics generation where it wouldn't be able to find the rank and relevance that a heuristic gave a test run if that heuristic had divided that test into multiple test runs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113789 Approved by: https://github.com/clee2000	2023-11-16 23:19:40 +00:00
Catherine Lee	87aeb248c9	More random stepcurrent (#113620 ) Distributed tests for different backends have the same name, so they end up clashing using the current stepcurrent key, so tests were not being run. Disabled the following tests because they are failing: test_ddp_has_finalized test_broadcast_object_list <details> ``` 2023-11-14T06:44:01.0428686Z 2023-11-14T06:44:01.0430447Z distributed/test_distributed_spawn.py::TestDistBackendWithSpawn::test_broadcast_object_list <- ../../../../opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/distributed_test.py INFO:numba.cuda.cudadrv.driver:init 2023-11-14T06:44:01.0431048Z [1699943450.893723] [99f90b6e6ff3:10028:0] ucc_context.c:402 UCC ERROR failed to create tl context for cuda 2023-11-14T06:44:01.0431625Z [1699943450.914385] [99f90b6e6ff3:10029:0] ucc_context.c:402 UCC ERROR failed to create tl context for cuda 2023-11-14T06:44:01.0432314Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] Caught exception: 2023-11-14T06:44:01.0433178Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] Traceback (most recent call last): 2023-11-14T06:44:01.0434677Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 658, in run_test 2023-11-14T06:44:01.0435435Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] getattr(self, test_name)() 2023-11-14T06:44:01.0436895Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 544, in wrapper 2023-11-14T06:44:01.0437500Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] fn() 2023-11-14T06:44:01.0438917Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2536, in wrapper 2023-11-14T06:44:01.0439637Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] method(args, kwargs) 2023-11-14T06:44:01.0441122Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 143, in wrapper 2023-11-14T06:44:01.0441873Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] return func(args, *kwargs) 2023-11-14T06:44:01.0443340Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 274, in wrapper 2023-11-14T06:44:01.0444077Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] ret = func(args, *kwargs) 2023-11-14T06:44:01.0445769Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/distributed_test.py", line 7717, in test_broadcast_object_list 2023-11-14T06:44:01.0446732Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] return self._test_broadcast_object_list() 2023-11-14T06:44:01.0448433Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/distributed_test.py", line 7683, in _test_broadcast_object_list 2023-11-14T06:44:01.0449187Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] dist.broadcast_object_list( 2023-11-14T06:44:01.0450553Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper 2023-11-14T06:44:01.0451621Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] return func(args, *kwargs) 2023-11-14T06:44:01.0453161Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2650, in broadcast_object_list 2023-11-14T06:44:01.0454065Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] broadcast(object_sizes_tensor, src=src, group=group) 2023-11-14T06:44:01.0455441Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper 2023-11-14T06:44:01.0456183Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] return func(args, *kwargs) 2023-11-14T06:44:01.0457775Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1947, in broadcast 2023-11-14T06:44:01.0458649Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] work = default_pg.broadcast([tensor], opts) 2023-11-14T06:44:01.0460923Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] RuntimeError: [/var/lib/jenkins/workspace/torch/csrc/distributed/c10d/ProcessGroupUCC.cpp:488] [Rank 1][ProcessGroupUCC-0][READY]failed to init cuda collective, error code -1: Operation is not supported, system error code 2 2023-11-14T06:44:01.0461471Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] 2023-11-14T06:44:01.0462430Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] To execute this test, run the following from the base repo dir: 2023-11-14T06:44:01.0463552Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] python test/distributed/test_distributed_spawn.py -k test_broadcast_object_list 2023-11-14T06:44:01.0464082Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] 2023-11-14T06:44:01.0465136Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2023-11-14T06:44:01.0465945Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] exiting process 1 with exit code: 10 2023-11-14T06:44:01.0466605Z [1699943451.005633] [99f90b6e6ff3:10029:0] parser.c:2034 UCX WARN unused environment variables: UCX_COMMIT; UCX_HOME 2023-11-14T06:44:01.0467303Z [1699943451.005633] [99f90b6e6ff3:10029:0] parser.c:2034 UCX WARN (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning) 2023-11-14T06:44:01.0467972Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] Caught exception: 2023-11-14T06:44:01.0468743Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] Traceback (most recent call last): 2023-11-14T06:44:01.0470233Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 658, in run_test 2023-11-14T06:44:01.0471106Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] getattr(self, test_name)() 2023-11-14T06:44:01.0472581Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 544, in wrapper 2023-11-14T06:44:01.0473162Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] fn() 2023-11-14T06:44:01.0474581Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2536, in wrapper 2023-11-14T06:44:01.0475314Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] method(args, *kwargs) 2023-11-14T06:44:01.0476776Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 143, in wrapper 2023-11-14T06:44:01.0477535Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] return func(args, *kwargs) 2023-11-14T06:44:01.0478993Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 274, in wrapper 2023-11-14T06:44:01.0479886Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] ret = func(args, *kwargs) 2023-11-14T06:44:01.0481593Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/distributed_test.py", line 7717, in test_broadcast_object_list 2023-11-14T06:44:01.0482429Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] return self._test_broadcast_object_list() 2023-11-14T06:44:01.0484145Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/distributed_test.py", line 7683, in _test_broadcast_object_list 2023-11-14T06:44:01.0484886Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] dist.broadcast_object_list( 2023-11-14T06:44:01.0486271Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper 2023-11-14T06:44:01.0487018Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] return func(args, *kwargs) 2023-11-14T06:44:01.0488559Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2650, in broadcast_object_list 2023-11-14T06:44:01.0489470Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] broadcast(object_sizes_tensor, src=src, group=group) 2023-11-14T06:44:01.0491078Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper 2023-11-14T06:44:01.0491912Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] return func(args, *kwargs) 2023-11-14T06:44:01.0493369Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1947, in broadcast 2023-11-14T06:44:01.0494419Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] work = default_pg.broadcast([tensor], opts) 2023-11-14T06:44:01.0496679Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] RuntimeError: [/var/lib/jenkins/workspace/torch/csrc/distributed/c10d/ProcessGroupUCC.cpp:488] [Rank 0][ProcessGroupUCC-0][READY]failed to init cuda collective, error code -1: Operation is not supported, system error code 2 2023-11-14T06:44:01.0497211Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] 2023-11-14T06:44:01.0498198Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] To execute this test, run the following from the base repo dir: 2023-11-14T06:44:01.0499291Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] python test/distributed/test_distributed_spawn.py -k test_broadcast_object_list 2023-11-14T06:44:01.0499838Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] 2023-11-14T06:44:01.0500881Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2023-11-14T06:44:01.0501667Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] exiting process 0 with exit code: 10 2023-11-14T06:44:01.0502343Z [1699943451.002362] [99f90b6e6ff3:10028:0] parser.c:2034 UCX WARN unused environment variables: UCX_COMMIT; UCX_HOME 2023-11-14T06:44:01.0503024Z [1699943451.002362] [99f90b6e6ff3:10028:0] parser.c:2034 UCX WARN (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning) 2023-11-14T06:44:01.0503411Z ('RERUN', {'yellow': True}) [6.1102s] [100%] ``` </details> test_ddp_sync_bn_training_vs_eval <details> ``` 2023-11-14T06:44:01.1494815Z 2023-11-14T06:44:01.1496630Z distributed/test_distributed_spawn.py::TestDistBackendWithSpawn::test_ddp_sync_bn_training_vs_eval <- ../../../../opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/distributed_test.py INFO:numba.cuda.cudadrv.driver:init 2023-11-14T06:44:01.1497290Z [1699943779.976037] [99f90b6e6ff3:10758:0] parser.c:2034 UCX WARN unused environment variables: UCX_COMMIT; UCX_HOME 2023-11-14T06:44:01.1498119Z [1699943779.976037] [99f90b6e6ff3:10758:0] parser.c:2034 UCX WARN (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning) 2023-11-14T06:44:01.1498808Z STAGE:2023-11-14 06:36:20 10758:10758 ActivityProfilerController.cpp:314] Completed Stage: Warm Up 2023-11-14T06:44:01.1499465Z [1699943779.970792] [99f90b6e6ff3:10757:0] parser.c:2034 UCX WARN unused environment variables: UCX_COMMIT; UCX_HOME 2023-11-14T06:44:01.1500160Z [1699943779.970792] [99f90b6e6ff3:10757:0] parser.c:2034 UCX WARN (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning) 2023-11-14T06:44:01.1500820Z STAGE:2023-11-14 06:36:20 10757:10757 ActivityProfilerController.cpp:314] Completed Stage: Warm Up 2023-11-14T06:44:01.1501556Z STAGE:2023-11-14 06:36:20 10758:10758 ActivityProfilerController.cpp:320] Completed Stage: Collection 2023-11-14T06:44:01.1502239Z STAGE:2023-11-14 06:36:20 10757:10757 ActivityProfilerController.cpp:320] Completed Stage: Collection 2023-11-14T06:44:01.1502952Z STAGE:2023-11-14 06:36:20 10757:10757 ActivityProfilerController.cpp:324] Completed Stage: Post Processing 2023-11-14T06:44:01.1503678Z STAGE:2023-11-14 06:36:20 10758:10758 ActivityProfilerController.cpp:324] Completed Stage: Post Processing 2023-11-14T06:44:01.1504350Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] Caught exception: 2023-11-14T06:44:01.1505119Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] Traceback (most recent call last): 2023-11-14T06:44:01.1506729Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 658, in run_test 2023-11-14T06:44:01.1507492Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] getattr(self, test_name)() 2023-11-14T06:44:01.1508992Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 544, in wrapper 2023-11-14T06:44:01.1509578Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] fn() 2023-11-14T06:44:01.1510994Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2536, in wrapper 2023-11-14T06:44:01.1511725Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] method(args, *kwargs) 2023-11-14T06:44:01.1513193Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 174, in wrapper 2023-11-14T06:44:01.1513962Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] return func(args, *kwargs) 2023-11-14T06:44:01.1515697Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/distributed_test.py", line 9230, in test_ddp_sync_bn_training_vs_eval 2023-11-14T06:44:01.1516529Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] self.assertNotEqual([], all_gather_calls) 2023-11-14T06:44:01.1518019Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3448, in assertNotEqual 2023-11-14T06:44:01.1518910Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] with self.assertRaises(AssertionError, msg=msg): 2023-11-14T06:44:01.1520177Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 226, in __exit__ 2023-11-14T06:44:01.1521062Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] self._raiseFailure("{} not raised".format(exc_name)) 2023-11-14T06:44:01.1522238Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 163, in _raiseFailure 2023-11-14T06:44:01.1523099Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] raise self.test_case.failureException(msg) 2023-11-14T06:44:01.1523923Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] AssertionError: AssertionError not raised 2023-11-14T06:44:01.1524470Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] 2023-11-14T06:44:01.1525481Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] To execute this test, run the following from the base repo dir: 2023-11-14T06:44:01.1526632Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] python test/distributed/test_distributed_spawn.py -k test_ddp_sync_bn_training_vs_eval 2023-11-14T06:44:01.1527180Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] 2023-11-14T06:44:01.1528223Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2023-11-14T06:44:01.1529029Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] exiting process 0 with exit code: 10 2023-11-14T06:44:01.1529786Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] Caught exception: 2023-11-14T06:44:01.1530576Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] Traceback (most recent call last): 2023-11-14T06:44:01.1532383Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 658, in run_test 2023-11-14T06:44:01.1533127Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] getattr(self, test_name)() 2023-11-14T06:44:01.1534608Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 544, in wrapper 2023-11-14T06:44:01.1535194Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] fn() 2023-11-14T06:44:01.1536817Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2536, in wrapper 2023-11-14T06:44:01.1537575Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] method(args, *kwargs) 2023-11-14T06:44:01.1539036Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 174, in wrapper 2023-11-14T06:44:01.1539800Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] return func(args, *kwargs) 2023-11-14T06:44:01.1541531Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/distributed_test.py", line 9230, in test_ddp_sync_bn_training_vs_eval 2023-11-14T06:44:01.1542388Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] self.assertNotEqual([], all_gather_calls) 2023-11-14T06:44:01.1544015Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3448, in assertNotEqual 2023-11-14T06:44:01.1544907Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] with self.assertRaises(AssertionError, msg=msg): 2023-11-14T06:44:01.1546061Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 226, in __exit__ 2023-11-14T06:44:01.1546944Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] self._raiseFailure("{} not raised".format(exc_name)) 2023-11-14T06:44:01.1548142Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 163, in _raiseFailure 2023-11-14T06:44:01.1548991Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] raise self.test_case.failureException(msg) 2023-11-14T06:44:01.1549806Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] AssertionError: AssertionError not raised 2023-11-14T06:44:01.1550350Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] 2023-11-14T06:44:01.1551304Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] To execute this test, run the following from the base repo dir: 2023-11-14T06:44:01.1552462Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] python test/distributed/test_distributed_spawn.py -k test_ddp_sync_bn_training_vs_eval 2023-11-14T06:44:01.1553095Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] 2023-11-14T06:44:01.1554166Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2023-11-14T06:44:01.1554976Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] exiting process 1 with exit code: 10 2023-11-14T06:44:01.1555235Z ('RERUN', {'yellow': True}) [6.6107s] [100%] ``` </details> test_backend_full_group <details> ``` 2023-11-14T22:51:56.4502470Z FAILED [5.2125s] distributed/test_distributed_spawn.py::TestDistBackendWithSpawn::test_backend_full_group - RuntimeError: Process 0 exited with error code 10 and exception: 2023-11-14T22:51:56.4502665Z Traceback (most recent call last): 2023-11-14T22:51:56.4503603Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 658, in run_test 2023-11-14T22:51:56.4503796Z getattr(self, test_name)() 2023-11-14T22:51:56.4504710Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 544, in wrapper 2023-11-14T22:51:56.4504845Z fn() 2023-11-14T22:51:56.4505737Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2536, in wrapper 2023-11-14T22:51:56.4505896Z method(args, *kwargs) 2023-11-14T22:51:56.4506823Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 174, in wrapper 2023-11-14T22:51:56.4506992Z return func(args, *kwargs) 2023-11-14T22:51:56.4508285Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/distributed_test.py", line 882, in test_backend_full_group 2023-11-14T22:51:56.4508640Z self._test_group_override_backend(self._init_full_group_test) 2023-11-14T22:51:56.4509798Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/distributed_test.py", line 852, in _test_group_override_backend 2023-11-14T22:51:56.4510104Z group, group_id, rank = initializer(backend=new_backend) 2023-11-14T22:51:56.4510629Z UnboundLocalError: local variable 'new_backend' referenced before assignment 2023-11-14T22:51:56.4510650Z 2023-11-14T22:51:56.4510987Z To execute this test, run the following from the base repo dir: 2023-11-14T22:51:56.4511525Z python test/distributed/test_distributed_spawn.py -k test_backend_full_group 2023-11-14T22:51:56.4511545Z 2023-11-14T22:51:56.4511970Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2023-11-14T22:51:56.4511989Z 2023-11-14T22:51:56.4512242Z Process 1 exited with error code 10 and exception: 2023-11-14T22:51:56.4512454Z Traceback (most recent call last): 2023-11-14T22:51:56.4513380Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 658, in run_test 2023-11-14T22:51:56.4513687Z getattr(self, test_name)() 2023-11-14T22:51:56.4514612Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 544, in wrapper 2023-11-14T22:51:56.4514746Z fn() 2023-11-14T22:51:56.4515633Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2536, in wrapper 2023-11-14T22:51:56.4515791Z method(args, *kwargs) 2023-11-14T22:51:56.4516708Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 174, in wrapper 2023-11-14T22:51:56.4516895Z return func(args, **kwargs) 2023-11-14T22:51:56.4518008Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/distributed_test.py", line 882, in test_backend_full_group 2023-11-14T22:51:56.4518352Z self._test_group_override_backend(self._init_full_group_test) 2023-11-14T22:51:56.4519509Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/distributed_test.py", line 852, in _test_group_override_backend 2023-11-14T22:51:56.4519813Z group, group_id, rank = initializer(backend=new_backend) 2023-11-14T22:51:56.4520334Z UnboundLocalError: local variable 'new_backend' referenced before assignment 2023-11-14T22:51:56.4520355Z 2023-11-14T22:51:56.4528843Z To execute this test, run the following from the base repo dir: 2023-11-14T22:51:56.4529492Z python test/distributed/test_distributed_spawn.py -k test_backend_full_group 2023-11-14T22:51:56.4529681Z 2023-11-14T22:51:56.4530122Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2023-11-14T22:51:56.4530423Z !!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!! ``` </details> pretty sure the solution for this one is to add ucc in _test_group_override_backend https://ossci-raw-job-status.s3.amazonaws.com/log/18651430019 https://ossci-raw-job-status.s3.amazonaws.com/log/18651430132 Pull Request resolved: https://github.com/pytorch/pytorch/pull/113620 Approved by: https://github.com/huydhn	2023-11-15 21:56:10 +00:00
Catherine Lee	0c448526a4	[experiment][TD] Rating number system (#112676 ) Emit excessive amount of heuristic info emitted, but that just means I can do more with it later? Pull Request resolved: https://github.com/pytorch/pytorch/pull/112676 Approved by: https://github.com/ZainRizvi	2023-11-07 19:40:11 +00:00
Nikita Shulga	e2e5897269	[CI] Do not use `packaging` in run_tests.py (#112873 ) It used to check that CUDA is newer than 11.6, but all of them are Yet another mitigation towards missing `packaging` on MacOS Pull Request resolved: https://github.com/pytorch/pytorch/pull/112873 Approved by: https://github.com/huydhn	2023-11-03 17:22:46 +00:00
Zain Rizvi	4e67c69a7d	[TD] Support downgrading test relevance (#112671 ) Allow heuristics to actually downgrade the relevance of a test. Note that NONE/UNLIKELY tests will still get executed, but they will be ran at the end of the CI The Relevance chosen affects the outcome when Heuristics offer conflicting predictions. A relevance higher up in this list means higher confidence in the declared relevance: HIGH > NONE > PROBABLE > UNLIKELY > UNRANKED Given that we assume ordering based on the list in init right now since the lists are appended, do a similar thing for UNLIKELY and NONE ex HEURISTICS = [a, b, c, d] currently all things in b.high and added after a.high if b.none includes things in a.high, a.high trumps if b.none includes things in a.probable, then b.none trumps since none is stronger than probable if b.unlikely includes things from a.high/probable, a.high/probable trumps since unlikely and probable are at a higher strength Pull Request resolved: https://github.com/pytorch/pytorch/pull/112671 Approved by: https://github.com/clee2000	2023-11-02 21:02:40 +00:00
Zain Rizvi	a5641bc56b	[TD] Enable Test Class granularity on heuristics (#112161 ) Changes the heuristic framework to support multiple prioritizing individual classes within a test file. Components of this included: - Updating TestPrioritizations to accept individual test classes being prioritized. Previously, when a heuristic wanted to prioritize a test file it would pass in the test's name, now to prioritize a class within a test it uses the notation "test::classname" - Changes are fully backwards compatible with existing heuristics - Test sharding now supports sharding individual tests (for when they're prioritized) - When a TestClass is prioritized, we pass the appropriate "-k" flags down to pytest Pull Request resolved: https://github.com/pytorch/pytorch/pull/112161 Approved by: https://github.com/huydhn	2023-10-31 18:11:05 +00:00
Catherine Lee	3b5b7ebd09	[ci] Save various json files from test infra into folder (#111516 ) We pull a lot of files from https://github.com/pytorch/test-infra/blob/generated-stats/stats and name them separately when we add them to the artifacts in the build, so stick them in a folder and just add that instead. Slow test and disabled test jsons remain as they were since they are pulled during the test step and do not need to be included in the artifacts during build since they are not used for sharding. Sanity checked that test times could be found for linux, mac, windows, and rocm. Pull Request resolved: https://github.com/pytorch/pytorch/pull/111516 Approved by: https://github.com/huydhn, https://github.com/ZainRizvi	2023-10-23 20:38:25 +00:00
Nikita Shulga	e9a51a6a07	[BE] Revive test_typing (#111428 ) `test_typing.py` was written to use `pytest` in https://github.com/pytorch/pytorch/pull/54234 which unfortunately rendered it incompatible with run_test.py, and therefore it was not running in CI all this time. In this PR, same functionality is re-written using unittest framework, and `parametrize` from `torch.testing._internal._common_utils`. Valid `test_typing.py` with ufmt Disable `fail/bitwise_ops.py` and `pass/jit.py` as it regressed at some point as well as one of examples in `namedtuple.py` as `torch.linalg.qr` type is no longer revealed correctly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/111428 Approved by: https://github.com/clee2000	2023-10-18 02:19:49 +00:00
Jack Taylor	6b92c367c5	Add test_jit_cuda_fuser to ROCM_BLOCKLIST (#110440 ) Adds the nvfuser related unit test suite to ROCM_BLOCKLIST as should not be run on ROCm. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110440 Approved by: https://github.com/jeffdaily, https://github.com/pruthvistony, https://github.com/lezcano	2023-10-06 08:47:15 +00:00
Catherine Lee	8a09fe4a05	[ez] Remove print in heuristics aggregation (#110621 ) Move print to the beginning instead because putting it at the end makes it so you have to scroll through when debugging, and nothing in that function indicates that it should be printing anything Also the line for printing disabled issues out of the for loop Pull Request resolved: https://github.com/pytorch/pytorch/pull/110621 Approved by: https://github.com/huydhn	2023-10-06 02:04:53 +00:00
Catherine Lee	d6e5898e8d	Quieter logs in CI (#110033 ) To reduce the amount of logs * for successes, only print the part that says what tests ran and don't print the rest. Zip the log into an artifact. The line listing al the test names is really long, but if you view source of the raw logs, it will not wrap so it will only be one line. The log classifier can also be configured to ignored this line. Gets rid of lines like `test_ops.py::TestCommonCPU::test_multiple_devices_round_cpu_int64 SKIPPED [0.0010s] (Only runs on cuda) [ 9%]` * for failures/reruns, print logs. Do not zip. Also * change log artifact name Examples of various logs: `a074db0f7f` failures `1b439e24c4` failures possibly controversial haha should i include an option for always printing? Pull Request resolved: https://github.com/pytorch/pytorch/pull/110033 Approved by: https://github.com/huydhn	2023-10-05 16:40:37 +00:00
Catherine Lee	f69e9c8c91	run_tests.py minor logging changes (#110188 ) Minor logging changes that just kind of annoyed me: * prevent constant printing of `No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'` by moving import within the function (idk if this is ok) * prevent constant printing of `Ignoring disabled issues: ['']` (no idea why it was not gated behind a function or main) * change all prints in run_tests.py to be through stderr so theres no weird interleaving (although if everything goes through stderr, might as well just print everything through stdout...) Pull Request resolved: https://github.com/pytorch/pytorch/pull/110188 Approved by: https://github.com/huydhn, https://github.com/malfet, https://github.com/ZainRizvi	2023-10-03 01:22:47 +00:00
Zain Rizvi	1277d0e834	[BE] Add sharding data by default to metrics (#110035 ) Extend metric library to allow setting global metrics on a process level which will always be emitted. Current use case for them is to include shard information every time a metric is emitted by run_test.py <!-- copilot:poem --> ### <samp>🤖 Generated by Copilot at 0cae92c</samp> > _`run_test` refactored_ > _Sharding metrics in Rockset_ > _Autumn of testing_ Pull Request resolved: https://github.com/pytorch/pytorch/pull/110035 Approved by: https://github.com/clee2000	2023-09-26 17:06:49 +00:00
Catherine Lee	47adcd412f	Increase timeout for slow tests (#109206 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/109206 Approved by: https://github.com/huydhn	2023-09-26 16:18:38 +00:00
jjsjann123	0d3db1048a	remove nvfuser test in upstream pytorch (#109918 ) Removing nvfuser related tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/109918 Approved by: https://github.com/msaroufim	2023-09-24 13:49:37 +00:00
Catherine Lee	fe198f3141	inductor/test_max_autotune serial in CI (#109209 ) Fixes #ISSUE_NUMBER Trying to figure out why the this keeps timing out, wondering if its due to parallelization weirdness Pull Request resolved: https://github.com/pytorch/pytorch/pull/109209 Approved by: https://github.com/huydhn	2023-09-13 17:04:43 +00:00
Catherine Lee	a4138b1f99	[ez] Fix small type error in run_test (#109036 ) This is really small but it has tripped me up at least 3 times. Pull Request resolved: https://github.com/pytorch/pytorch/pull/109036 Approved by: https://github.com/kit1980	2023-09-11 21:11:20 +00:00
Catherine Lee	c67ebae344	Put logging in run_tests (#107987 ) Logging regarding which tests are serial + parallel + what tests actually get run on the shard got removed, which can be pretty helpful, so this adds it back in. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107987 Approved by: https://github.com/huydhn, https://github.com/Neilblaze	2023-09-01 20:23:30 +00:00
Zain Rizvi	5727b07ac6	TD: logging bugfix (#108288 ) Fix bug where logging metrics don't get emitted unless the 'keep-going' label is specified on the PR Also adds some extra logging to make debugging easier Pull Request resolved: https://github.com/pytorch/pytorch/pull/108288 Approved by: https://github.com/Skylion007	2023-08-31 16:51:49 +00:00
Zain Rizvi	238cc84af9	[TD] Emit metrics to compare heuristic quality (#108192 ) When a test fails, we will now emit fine grained details about how accurately heuristics predicted the relevance of that test. ## Context Why only look at failing tests? Our only signal that a PR is most likely relevant to a test is whether or not a test fails on it. Green tests don't tell us if the success was due to the code being good vs being irrelevant. This isn't a perfect measure, since it can miscategorize unstable and flaky failures as having been "missed" by the heuristics, but it's a reasonable approximation. ## What's measured? The metrics this PR collects are designed to answer the following questions ### How comprehensive are the heuristics? - What's the false negative rate, the % of failures that ideally should have been prioritized but weren't? (Both at an aggregate level and at a per heuristic level) ### How precise are the heuristics? - What % of failed tests were prioritized by a given heuristic? What % was prioritized overall? - How relevant was a failed test was considered to be? (Both a aggregate level and at a per heuristic level) - What % of time was a given heuristic prioritizing a failing test higher than any other heuristic? Pull Request resolved: https://github.com/pytorch/pytorch/pull/108192 Approved by: https://github.com/huydhn ghstack dependencies: #108117	2023-08-30 18:28:18 +00:00
Zain Rizvi	620d267ef3	Refactor TestPrioritizations to support more priorities and reduce risk of accidental mutations (#108117 ) Refactor TD code to make it easier to add additional categories later and also support the changes required to enable the metrics needed for TD Pull Request resolved: https://github.com/pytorch/pytorch/pull/108117 Approved by: https://github.com/huydhn	2023-08-30 04:14:28 +00:00
Zain Rizvi	36399d067a	Port existing heuristics to TD framework (#107071 ) This PR looks big, but it's mostly just refactorings with a bit of dead code deletion. Exceptions are: - Some metric emissions were changed to comply with the new TD format - Some logging changes - We now run tests in three batches (highly_relevant, probably_relevant, unranked_relevance) instead of the previous two (prioritized and general) Refactorings done: - Moves all test reordering code to the new TD framework - Refactors run_test.py to cleanly support multiple levels of test priorities - Deletes some dead code that was originally written for logging Pull Request resolved: https://github.com/pytorch/pytorch/pull/107071 Approved by: https://github.com/clee2000, https://github.com/huydhn	2023-08-23 21:23:23 +00:00
Catherine Lee	e0238577b6	Always import test selection tools (#107644 ) https://github.com/pytorch/pytorch/pull/107070 made emit_metrics importable without boto3, so we could just import all the files without the try catch. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107644 Approved by: https://github.com/huydhn, https://github.com/malfet	2023-08-22 16:36:20 +00:00
Zain Rizvi	5ddb8ef827	Make emit_metrics importable without having boto3 installed (#107070 ) Make it so that scripts can import and run the `emit_metrics` function even if they don't have boto3 installed, in which case it will still validate the inputs but skip the actual metric emission part. It's purely a refactor without any real logic changes Motivation: So that run_test.py and the target determination code can use this library easily without worrying about if it was imported or if it's dependencies are installed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107070 Approved by: https://github.com/huydhn	2023-08-21 21:13:01 +00:00
Catherine Lee	3b2c5d47c0	Use default build env and test config for test times (#107325 ) Redo of #107312 Pairs with https://github.com/pytorch/test-infra/pull/4476 If build env and test config combo cannot be found in the test times, use default. Then we don't have to go manually change the test-times.json a new job is added or we update the jobs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107325 Approved by: https://github.com/huydhn	2023-08-21 18:39:55 +00:00
FFFrog	e108f33299	Update distutils.Version to packaging.version due to the deprecation … (#107207 ) Update distutils.Version to packaging.version due to the deprecation warning. ```python /root/Git.d/pytorch/pytorch/torch/testing/_internal/common_methods_invocations.py:17136: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead. active_if=TEST_SCIPY and LooseVersion(scipy.__version__) < "1.4.0"), /root/Git.d/pytorch/pytorch/torch/testing/_internal/common_methods_invocations.py:17138: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead. active_if=TEST_SCIPY and LooseVersion(scipy.__version__) < "1.4.0"), /root/Git.d/pytorch/pytorch/torch/testing/_internal/common_methods_invocations.py:17140: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead. active_if=TEST_SCIPY and LooseVersion(scipy.__version__) < "1.4.0"), ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/107207 Approved by: https://github.com/soulitzer	2023-08-17 11:19:44 +00:00
Catherine Lee	f16be5e0d4	Reordering tests experiment (#106347 ) Companion with https://github.com/pytorch/test-infra/pull/4424 Uses the file rating generated by the test infra PR to re order tests. For each test file, sum the file ratings from the changed files in the PR, and put the tests in order of sum. A lot of tests are probably going to end up as "prioritized" since it takes anything with a rating > 0 right now. Sharding is done twice, once on the prioritized tests, and once on the general/non prioritized tests. Prioritized tests have an order, so they should be sharded according to that order, while general tests don't have an order and are sharded by test time, which should result in more balanced shards. I'll change the metric name before I merge, i want to quarantine my testing stuff from actual results Pull Request resolved: https://github.com/pytorch/pytorch/pull/106347 Approved by: https://github.com/ZainRizvi	2023-08-16 18:23:09 +00:00
PyTorch MergeBot	9858edd99f	Revert "Reordering tests experiment (#106347 )" This reverts commit `7dfab082be`. Reverted https://github.com/pytorch/pytorch/pull/106347 on behalf of https://github.com/clee2000 due to probably broke sharding ([comment](https://github.com/pytorch/pytorch/pull/106347#issuecomment-1675542738))	2023-08-11 23:59:48 +00:00
Richard Zou	b9ad7bc533	Don't run test/autograd/test_fallback.py in parallel (#106866 ) Fixes https://github.com/pytorch/pytorch/issues/106754 This PR: - moves test/autograd/test_fallback.py to test_autograd_fallback.py and removes it from test_autograd.py (necessary for the next step) - adds test_autograd_fallback.py to parallel test blocklist. - lintrunner really wanted to make changes to the files, but other than that, it is a move. The problem is that we set a global option (the autograd fallback mode) during these tests which may cause the tests to interfere with each other. Test Plan: - python test/run_test.py -i test_autograd_fallback NOTE to diff train oncall: - You'll also need to modify the test/autograd/test_fallback.py TARGET in caffe2/test/TARGETS since we renamed the file. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106866 Approved by: https://github.com/soulitzer	2023-08-10 00:26:23 +00:00
Catherine Lee	7dfab082be	Reordering tests experiment (#106347 ) Companion with https://github.com/pytorch/test-infra/pull/4424 Uses the file rating generated by the test infra PR to re order tests. For each test file, sum the file ratings from the changed files in the PR, and put the tests in order of sum. A lot of tests are probably going to end up as "prioritized" since it takes anything with a rating > 0 right now. Sharding is done twice, once on the prioritized tests, and once on the general/non prioritized tests. Prioritized tests have an order, so they should be sharded according to that order, while general tests don't have an order and are sharded by test time, which should result in more balanced shards. I'll change the metric name before I merge, i want to quarantine my testing stuff from actual results Pull Request resolved: https://github.com/pytorch/pytorch/pull/106347 Approved by: https://github.com/ZainRizvi	2023-08-09 20:11:11 +00:00
Aaron Gokaslan	6d43c89f37	[BE]: Update Ruff to 0.0.280 (#105724 ) Removes unusued loop values in python dictionary iteration. Automated fix from Ruff master Pull Request resolved: https://github.com/pytorch/pytorch/pull/105724 Approved by: https://github.com/ezyang, https://github.com/janeyx99	2023-07-22 23:03:34 +00:00
Justin Chu	4cc1745b13	[BE] f-stringify torch/ and scripts (#105538 ) This PR is a follow up on the pyupgrade series to convert more strings to use f-strings using `flynt`. - https://docs.python.org/3/reference/lexical_analysis.html#f-strings - https://pypi.org/project/flynt/ Command used: ``` flynt torch/ -ll 120 flynt scripts/ -ll 120 flynt tools/ -ll 120 ``` and excluded `collect_env.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/105538 Approved by: https://github.com/ezyang, https://github.com/malfet	2023-07-21 19:35:24 +00:00
Justin Chu	73e1455327	[BE] Enable ruff's UP rules and autoformat test/ (#105434 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105434 Approved by: https://github.com/albanD	2023-07-19 20:36:06 +00:00
Joel Schlosser	ece19bf018	Update run_test.py to use TEST_WITH_SLOW_GRADCHECK flag (#104819 ) Finishes the job from #104537. See https://github.com/pytorch/pytorch/pull/104537#pullrequestreview-1520065008 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104819 Approved by: https://github.com/huydhn	2023-07-11 21:58:46 +00:00
Yukio Siraichi	40b8d10d5e	Re-land: Turn translation validation on for tests and accuracy runs by default. (#104467 ) Re-landing: #103611 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104467 Approved by: https://github.com/malfet	2023-07-05 19:01:50 +00:00
Nikita Shulga	ddd7da7546	Enable more tests (#104437 ) Remove `test_segment_reductions` from list of blocklisted tests Remove `@onlyCPU` qualifier from test_segment_reductions as it has CUDA specific parts Fixes https://github.com/pytorch/pytorch/issues/104410 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104437 Approved by: https://github.com/atalman, https://github.com/huydhn	2023-06-30 16:26:11 +00:00
PyTorch MergeBot	a2a8b4d415	Revert "Turn translation validation on for tests and accuracy runs by default. (#103611 )" This reverts commit `e311bed2a8`. Reverted https://github.com/pytorch/pytorch/pull/103611 on behalf of https://github.com/malfet due to Broke inductor tests ([comment](https://github.com/pytorch/pytorch/pull/103611#issuecomment-1614850276))	2023-06-30 15:54:18 +00:00
Yukio Siraichi	e311bed2a8	Turn translation validation on for tests and accuracy runs by default. (#103611 ) This PR turns translation validation on by default for tests and accuracy benchmark runs. It also installs Z3 on CI. The main changes are: - Add `--no-translation-validation` as an option in _test/run_tests.py_ - Set `PYTORCH_TEST_WITH_TV` environment variable - Add `TEST_WITH_TV` variable in _torch/testing/_internal/common_utils.py_ - Turn translation validation on for accuracy benchmarks in _benchmarks/dynamo/common.py_ - Add Z3 installation on CI scripts Pull Request resolved: https://github.com/pytorch/pytorch/pull/103611 Approved by: https://github.com/ezyang	2023-06-30 01:32:21 +00:00
Nikita Shulga	c40f5edf7b	Change tools search order (#104214 ) Prevents following cryptic error if one attempts to use `run_tests.py` on system that also has torchaudio installed in dev mode (as `tools` from https://github.com/pytorch/audio might take precedence, but this is not how script should behave): ``` Unable to import test_selections from tools/testing. Running without test selection stats.... Reason: No module named 'tools.stats' Traceback (most recent call last): File "/Users/nshulga/git/pytorch/pytorch/test/run_test.py", line 1673, in <module> main() File "/Users/nshulga/git/pytorch/pytorch/test/run_test.py", line 1604, in main selected_tests = get_selected_tests(options) File "/Users/nshulga/git/pytorch/pytorch/test/run_test.py", line 1418, in get_selected_tests path = os.path.join(str(REPO_ROOT), TEST_TIMES_FILE) NameError: name 'TEST_TIMES_FILE' is not defined ``` But make sure to remove it in the end, otherwise it will not work if torch is installed from wheel, but tests are running from clean repo checkout. <!-- copilot:poem --> ### <samp>🤖 Generated by Copilot at dd52521</samp> > _Sing, O Muse, of the cunning code review_ > _That fixed the tests of the `tools` module_ > _By adding and removing the root path_ > _As a shepherd guides his flock to and fro._ Pull Request resolved: https://github.com/pytorch/pytorch/pull/104214 Approved by: https://github.com/kit1980	2023-06-27 15:54:34 +00:00
Nikita Shulga	925f0a01c7	Do not pass `stepcurrent` option unless in CI (#104135 ) Should allow one to run the same tests multiple times on local machine <!-- copilot:poem --> ### <samp>🤖 Generated by Copilot at 740a92d</samp> > _`pytest_args` change_ > _Only add `--sc` on CI_ > _Avoid conflicts - fall_ Pull Request resolved: https://github.com/pytorch/pytorch/pull/104135 Approved by: https://github.com/huydhn, https://github.com/kit1980	2023-06-24 09:34:14 +00:00
Nikita Shulga	63f66d19ea	[Tests] Make `run_test.py` usable without boto3 (#104111 ) There is a `HAVE_TEST_SELECTION_TOOLS` conditional, but turns out it does not really work, so fix it by defining all missing prototypes and make it work as single-shard instance Add lint rule to test stat it would succeed for runnign only test_cuda with released version of PyTorch Pull Request resolved: https://github.com/pytorch/pytorch/pull/104111 Approved by: https://github.com/clee2000, https://github.com/ZainRizvi	2023-06-24 03:10:49 +00:00
Nikita Shulga	98d513cabf	[BE][Test] Remove `--pytest` option from `run_test.py` (#104125 ) Because we always run tests with pytest now. Marking it as `bc-breaking` as there could technically be some scripts depending on it somewhere... <!-- copilot:poem --> ### <samp>🤖 Generated by Copilot at 1760568</samp> > _`pytest` option gone_ > _simpler test runner script_ > _autumn leaves fall fast_ Pull Request resolved: https://github.com/pytorch/pytorch/pull/104125 Approved by: https://github.com/seemethere	2023-06-24 00:20:20 +00:00
Catherine Lee	7ac1c64bc4	Exclude _nvfuser from test collection (#104003 ) The three files in this folder are run by should instead be run by test_jit_cuda_fuser.py, test_nvfuser_dynamo.py, and test_nvfuser_frontend.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/104003 Approved by: https://github.com/huydhn, https://github.com/jjsjann123	2023-06-22 19:46:45 +00:00
Zain Rizvi	c3d3165f16	Enable uploading metrics and upload Test Reordering metrics to dynamodb (#102691 ) Added a feature to upload test statistics to DynamoDB and Rockset using a new function `emit_metric` in `tools/stats/upload_stats_lib.py`. Added metrics to measure test reordering effectiveness in `tools/testing/test_selections.py`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102691 Approved by: https://github.com/malfet	2023-06-12 23:01:53 +00:00
PyTorch MergeBot	b52ee80cdc	Revert "Add print statements to debug sharding error (#102713 )" This reverts commit `c7873522c2`. Reverted https://github.com/pytorch/pytorch/pull/102713 on behalf of https://github.com/clee2000 due to issue should be resolved now ([comment](https://github.com/pytorch/pytorch/pull/102713#issuecomment-1583334560))	2023-06-08 21:02:17 +00:00

1 2 3 4 5 ...

572 Commits