pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Yidi Wu	7eda06b366	skip test dynamo for aot_dispatch tests on ci (#142185 ) A lot of tests in test_aotdispatch.py is not meaningful (from user's perspective) when we run with dynamo. So we skip them. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142185 Approved by: https://github.com/zou3519 ghstack dependencies: #141610	2024-12-10 17:33:57 +00:00
Mark Saroufim	e24190709f	[BE] Remove Model Dump utility (#141540 ) So I found this utility by accident, trying to find how many html files we have in the repo so I could convert them to markdown Turns out we package some html and js files in pytorch to visualize torchscript models. This seems kinda strange, probably shouldn't be in core, I removed the tests I could find. Maybe some internal tests will break but considering torchscript is being superseded might make sense to do this Last time there was a meaningful update to the test for this file was about 2 years ago by @digantdesai since then it's a bunch of routine upgrades It seems like this package is unused https://github.com/search?type=code&auto_enroll=true&q=torch.utils.model_dump&p=1 I skimmed through 5 pages of these and the only time this shows up in code search is when someone is either cloning pytorch or checking in their venv into github Pull Request resolved: https://github.com/pytorch/pytorch/pull/141540 Approved by: https://github.com/malfet	2024-11-27 22:52:55 +00:00
Aleksei Nikiforov	a82bab6419	Run only listed tests on s390x (#140265 ) Skip tests that are failing This was previously part of https://github.com/pytorch/pytorch/pull/125401 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140265 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-11-20 22:53:09 +00:00
Catherine Lee	0db21a6b23	Remove most rockset references (#139922 ) Remove most references to rockset: * replace comments and docs with a generic "backend database" * Delete `upload_to_rockset`, so we no longer need to install the package. * Do not upload perf stats to rockset as well (we should be completely on DynamoDB now right @huydhn?) According to VSCode, it went from 41 -> 7 instances of "rockset" in the repo Pull Request resolved: https://github.com/pytorch/pytorch/pull/139922 Approved by: https://github.com/huydhn, https://github.com/ZainRizvi	2024-11-12 21:17:43 +00:00
Catherine Lee	cc93c1e5e4	Upload artifacts during test run (#125799 ) Zip and upload artifacts while run_test is running Upgrade boto3 because I get errors about not having `botocore.vendored.six.move` if I don't Pull Request resolved: https://github.com/pytorch/pytorch/pull/125799 Approved by: https://github.com/huydhn	2024-10-22 16:48:57 +00:00
Will Feng	e4ad02892f	Upgrade distributed test to g4dn instances (T4 GPUs) (#137161 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137161 Approved by: https://github.com/seemethere, https://github.com/eqy, https://github.com/yf225 Co-authored-by: Will Feng <yf225@cornell.edu>	2024-10-20 23:48:54 +00:00
PyTorch MergeBot	24ee4af86b	Revert "Upgrade distributed test to g4dn instances (T4 GPUs) (#137161 )" This reverts commit `2b7c7a20b9`. Reverted https://github.com/pytorch/pytorch/pull/137161 on behalf of https://github.com/kwen2501 due to breaking trunk ([comment](https://github.com/pytorch/pytorch/pull/137161#issuecomment-2417833666))	2024-10-16 20:05:38 +00:00
Catherine Lee	f173623bb2	[td] try catch exception, do not run td if not results (#138087 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/138087 Approved by: https://github.com/wdvr	2024-10-16 18:04:25 +00:00
Ke Wen	2b7c7a20b9	Upgrade distributed test to g4dn instances (T4 GPUs) (#137161 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137161 Approved by: https://github.com/seemethere, https://github.com/eqy	2024-10-16 16:42:57 +00:00
PyTorch MergeBot	78632b97b1	Revert "Upgrade distributed test to g4dn instances (T4 GPUs) (#137161 )" This reverts commit `f43c4d28b8`. Reverted https://github.com/pytorch/pytorch/pull/137161 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it seems another failure showing up after the upgrade ([comment](https://github.com/pytorch/pytorch/pull/137161#issuecomment-2415941159))	2024-10-16 07:26:34 +00:00
Ke Wen	f43c4d28b8	Upgrade distributed test to g4dn instances (T4 GPUs) (#137161 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137161 Approved by: https://github.com/seemethere, https://github.com/eqy	2024-10-16 05:03:08 +00:00
Ke Wen	56cc22eb01	[CI][Distributed] Not to test distributed_test.py with UCC (#137932 ) Some UCC tests became unstable recently, with or without the M60 to T4 upgrade. See for example: #137855 (without upgrade), #137161 (with upgrade). So I am extracting the disablement from #137161 here. Failure signature: ``` RuntimeError: [/var/lib/jenkins/workspace/torch/csrc/distributed/c10d/ProcessGroupUCC.cpp:496] [Rank 0][ProcessGroupUCC-0][READY]failed to post triggered collective, error code -6: Unhandled error, system error code 0 ``` Earlier discussed here: https://github.com/pytorch/pytorch/pull/137161/files#r1797353294 Cc: @Aidyn-A @eqy Pull Request resolved: https://github.com/pytorch/pytorch/pull/137932 Approved by: https://github.com/fduwjj, https://github.com/malfet, https://github.com/eqy	2024-10-15 07:22:57 +00:00
Jagadish Krishnamoorthy	674d59359d	[ROCm] Enable dist sharded_tensor test suites (#137724 ) Following test suites are enabled on ROCm test_sharded_tensor test_sharded_tensor_reshard test_sharding_plan Pull Request resolved: https://github.com/pytorch/pytorch/pull/137724 Approved by: https://github.com/jithunnair-amd, https://github.com/pruthvistony, https://github.com/malfet	2024-10-14 20:20:57 +00:00
eellison	47af7cc962	Add compiler bisector (#131936 ) This is a utility to aid the torch.compile debugging. You provide a function that returns True on success, False on failure, or do something out of process and run bisect_helper `good \| bad`. The bisector will first go through backends - `eager`, `aot_eager`, `aot_eager_decomp_partition`, `inductor` to find the first failing backend. Then, it will go through subsystems within the backend - currently limited but could be expanded - and try to find the first subsystem for which disabling fixes the problem. Once it has found the failing subsystem, it will find the number of times the subsystem is applied, and then bisect through it. An example usage of how to hook it up for aot_eager_decomp_partition and decomposition subsystem is : ``` from torch._inductor.bisect_helper import BisectionManager if op in CURRENT_DECOMPOSITION_TABLE: if BisectionManager.disable_subsystem("aot_eager_decomp_partition", "decomposition", lambda: repr(op)): return NotImplemented ``` Once it has discovered the problematic change, it will print out the associated debug info, and you can set the same limits with `TORCH_BISECT_BACKEND` `TORCH_BISECT_SUBSYSTEM` and `TORCH_BISECT_MAX`. We could add further options as an automated way of going through a check list for checking divergence - e.g., the mode to emulate amp casts. Fix for https://github.com/pytorch/pytorch/issues/126546 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131936 Approved by: https://github.com/ezyang	2024-10-09 20:34:11 +00:00
Siddharth Kotapati	e27c0048db	Enable additional tests for MPS CI runs (#134356 ) As part of the follow up for https://github.com/pytorch/pytorch/issues/133520, adapting existing unused tests for use in MPS CI runs. Focusing on nhwc & other memory formatting tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/134356 Approved by: https://github.com/malfet, https://github.com/eqy, https://github.com/huydhn	2024-10-04 21:52:38 +00:00
Sergii Dymchenko	a619ced5ed	Revert "Update run_test.py" This reverts commit `193073b491`.	2024-09-26 17:34:52 -07:00
Sergii Dymchenko	193073b491	Update run_test.py	2024-09-26 16:56:29 -07:00
Xinya Zhang	74fd1bf965	[ROCm] Update to AOTriton 0.7b (#134498 ) Notable changes: 1. Enable CudaGraph related tests 2. Fix UT problems 3. EXPERIMENTAL Navi31 support. User should enable Navi31 support with Env Var `TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1` Know Problem: 1. `test/test_transformers.py` will massive failures and/or NaN outputs with `--use-pytest` + Update: Confirmed skip `class TestSDPAPrivateUse1Only` can fix the problem with `--use-pytest` Note: AOTriton 0.7b adds support to nestedtenosrs+SDPA but need more work (and consequently a separate PR) to enable it. Fixes #133540 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134498 Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily, https://github.com/malfet	2024-09-11 20:34:01 +00:00
Bo Li	16b8146c9e	Exclude test_transformers and unit tests which require recent GPU arch (#132895 ) This PR is to exclude test_transformers on ROCm temporarily and skip some unit tests which require recent GPU arch. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132895 Approved by: https://github.com/jithunnair-amd, https://github.com/pruthvistony, https://github.com/malfet	2024-08-27 20:40:53 +00:00
Roy Hvaara	1565940114	[MPS] Add `test/test_nn.py` to test suite (#134184 ) This PR increases test coverage by including the tests in `test/test_nn.py` in the test suite of MPS. Some of the tests are decorated with `@expectedFailureMPS` for various reasons. Either that the op is not implemented, or that the outputs do not align. Those tests that contain differing results should be investigated further to rule out any live bugs. ```bash $ python test/run_test.py --mps --verbose -k TestNN Running test batch 'tests to run' cost 84.76 seconds ``` Ref #133520 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134184 Approved by: https://github.com/albanD, https://github.com/malfet	2024-08-26 23:48:23 +00:00
Aidyn-A	28a4db84f2	[ARM] Fix infinite recursion in unwind (#134387 ) Fixes #119905 The `TORCH_SHOW_CPP_STACKTRACES=1` setting on ARM causes infinite recursive unwind because on failure a `StackTraceFetcher` attempts to unwind the <ins>failed instruction</ins>: `5ad759ca33/torch/csrc/profiler/combined_traceback.cpp (L25)` then the unwind itself fails: `5ad759ca33/torch/csrc/profiler/unwind/unwind.cpp (L10-L12)` and it causes another attempt to unwind the failure in `unwind()`... In summary, the executed instruction is equivalent to: ```C++ std::vector<void*> unwind() { // some instructions ... return unwind(); } ``` This PR replaces `TORCH_CHECK` by `TORCH_WARN_ONCE` as it will not cause an uncontrolled recursion. The only side effect would be an empty back-trace. Huge thanks to @nWEIdia who found the root cause! Pull Request resolved: https://github.com/pytorch/pytorch/pull/134387 Approved by: https://github.com/eqy, https://github.com/nWEIdia, https://github.com/malfet	2024-08-26 21:02:31 +00:00
Edward Z. Yang	99cf567714	Make SCRIBE_GRAPHQL_ACCESS_TOKEN available to test jobs running on main (#133536 ) It is possible to write to Meta's internal in-memory database Scuba via the Scribe Graph API: https://www.internalfb.com/intern/wiki/Scribe/users/Knowledge_Base/Interacting_with_Scribe_categories/Graph_API/ This is currently being used by pytorch/benchmark repo to upload torchbench performance results. I want to make this API generally available to all jobs running on CI in a semi-trusted context. To talk to Scribe, you need a secret access token. I have initially configured an environment prod-branch-main which contains `SCRIBE_GRAPHQL_ACCESS_TOKEN`, and switched a single class of jobs (linux-test) to use this environment when they are running on the main branch. Because we require approvals for running CI on untrusted contributions, we could potentially allow all jobs to run in this environment, including jobs on PRs, but I don't need this for my use case (per-PR benchmark result reporting, and miscellaneous statistics on main.) If this works, I'll push out this environment to the rest of our test jobs. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/133536 Approved by: https://github.com/xuzhao9, https://github.com/malfet, https://github.com/albanD	2024-08-15 19:53:17 +00:00
hippocookie	a6ad834fa8	Fix counting execution time in run_test.py (#133199 ) Counting `elapsed_time` immediately after `start_time`, not reflect real execution time of `test_batch`. Move `elapsed_time` and print method after `run_tests` method call to fix it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133199 Approved by: https://github.com/clee2000	2024-08-15 15:29:44 +00:00
chuanqiw	72f2b29bb0	[CI] disable xpu kineto build (#133069 ) Due to the xpu kineto support PR https://github.com/pytorch/pytorch/pull/130811 landed, but the xpu ci infra not ready for now. Disable kineto build as a temp WA Pull Request resolved: https://github.com/pytorch/pytorch/pull/133069 Approved by: https://github.com/seemethere	2024-08-09 23:58:50 +00:00
Xuehai Pan	4226ed1585	[BE] Format uncategorized Python files with `ruff format` (#132576 ) Remove patterns ``, `test/`, and `torch/**` in `tools/linter/adapters/pyfmt_linter.py` and run `lintrunner`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132576 Approved by: https://github.com/ezyang, https://github.com/Skylion007 ghstack dependencies: #132574	2024-08-04 17:13:31 +00:00
Xuehai Pan	5cc34f61d1	[CI] add new test config label `ci-test-showlocals` to control test log verbosity (#131981 ) Add a new label `ci-test-showlocals` and add it to test config filter. If the PR is labeled with `ci-test-showlocals` or "ci-test-showlocals" present in the PR comment, the test config filter will set a environment variable `TEST_SHOWLOCALS`. Then `pytest` will show local variables on failures for better debugging. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131981 Approved by: https://github.com/malfet ghstack dependencies: #131151	2024-07-29 18:53:14 +00:00
Xuehai Pan	4694ee1ad2	[BE][tests] show local variables on failure in tests (#131151 ) ------ As per the title, add argument `--locals` for `unittest` and `--showlocals --tb=long` for `pytest` in CI. Some failures cannot be reproduced on the local machine but exist on cloud CI. This change allows us to investigate the test failure more easily. Example output: https://github.com/pytorch/pytorch/actions/runs/9961546996/job/27523888353?pr=130710#step:20:3361 ```text /opt/conda/envs/py_3.8/lib/python3.8/site-packages/sympy/core/function.py:307: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ cls = FloorDiv, base = -1.00000000000000, divisor = -1.00000000000000 @classmethod def eval(cls, base, divisor): # python test/test_dynamic_shapes.py -k TestDimConstraints.test_dim_constraints_solve_full # Assert triggered by inequality solver # assert base.is_integer, base # assert divisor.is_integer, divisor # We don't provide the same error message as in Python because SymPy # makes it difficult to check the types. if divisor.is_zero: raise ZeroDivisionError("division by zero") if base in (int_oo, -int_oo, sympy.oo, -sympy.oo) and divisor in ( int_oo, -int_oo, sympy.oo, -sympy.oo, ): return sympy.nan if base is sympy.nan or divisor is sympy.nan: return sympy.nan if base.is_zero: return sympy.S.Zero if base.is_integer and divisor == 1: return base if base.is_integer and divisor == -1: return sympy.Mul(base, -1) if ( isinstance(base, sympy.Number) and isinstance(divisor, sympy.Number) and ( base in (int_oo, -int_oo, sympy.oo, -sympy.oo) or divisor in (int_oo, -int_oo, sympy.oo, -sympy.oo) ) ): r = float(base) / float(divisor) if r == math.inf: return int_oo elif r == -math.inf: return -int_oo elif math.isnan(r): return sympy.nan else: return sympy.Integer(math.floor(r)) if isinstance(base, sympy.Integer) and isinstance(divisor, sympy.Integer): return sympy.Integer(int(base) // int(divisor)) if isinstance(base, FloorDiv): return FloorDiv(base.args[0], base.args[1] * divisor) # Expands (x + y) // b into x // b + y // b. # This only works if floor is an identity, i.e. x / b is an integer. for term in sympy.Add.make_args(base): quotient = term / divisor if quotient.is_integer and isinstance(divisor, sympy.Integer): # NB: this is correct even if the divisor is not an integer, but it # creates rational expressions that cause problems with dynamic # shapes. return FloorDiv(base - term, divisor) + quotient try: gcd = sympy.gcd(base, divisor) if gcd != 1: > return FloorDiv( sympy.simplify(base / gcd), sympy.simplify(divisor / gcd) ) base = -1.00000000000000 cls = FloorDiv divisor = -1.00000000000000 gcd = 1.00000000000000 quotient = 1.00000000000000 term = -1.00000000000000 /opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/utils/_sympy/functions.py:159: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ args = (FloorDiv, -1.00000000000000, -1.00000000000000), kwargs = {} @wraps(func) def wrapper(args, kwargs): try: > retval = cfunc(args, **kwargs) E RecursionError: maximum recursion depth exceeded in comparison E E To execute this test, run the following from the base repo dir: E python test/test_sympy_utils.py -k TestValueRanges.test_binary_ref_fn_floordiv_dtype_float E E This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 args = (FloorDiv, -1.00000000000000, -1.00000000000000) cfunc = <functools._lru_cache_wrapper object at 0x7fc5303173a0> func = <function Function.__new__ at 0x7fc530317280> kwargs = {} ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131151 Approved by: https://github.com/ezyang	2024-07-29 18:53:14 +00:00
PyTorch MergeBot	c35f21e5fc	Revert "[BE][tests] show local variables on failure in tests (#131151 )" This reverts commit `14158d892a`. Reverted https://github.com/pytorch/pytorch/pull/131151 on behalf of https://github.com/atalman due to Broke CI: test_testing.py::TestTestingCUDA::test_cuda_assert_should_stop_common_device_type_test_suite_cuda [GH job link](https://github.com/pytorch/pytorch/actions/runs/10131415299/job/28014665693) [HUD commit link](`14158d892a`) ([comment](https://github.com/pytorch/pytorch/pull/131151#issuecomment-2255921015))	2024-07-29 13:19:38 +00:00
PyTorch MergeBot	06fe99a097	Revert "[CI] add new test config label `ci-test-showlocals` to control test log verbosity (#131981 )" This reverts commit `dfa18bf3f3`. Reverted https://github.com/pytorch/pytorch/pull/131981 on behalf of https://github.com/atalman due to Sorry, need to revert bottom PR, which broke CI: https://github.com/pytorch/pytorch/pull/131151 ([comment](https://github.com/pytorch/pytorch/pull/131981#issuecomment-2255892628))	2024-07-29 13:09:41 +00:00
Xuehai Pan	dfa18bf3f3	[CI] add new test config label `ci-test-showlocals` to control test log verbosity (#131981 ) Add a new label `ci-test-showlocals` and add it to test config filter. If the PR is labeled with `ci-test-showlocals` or "ci-test-showlocals" present in the PR comment, the test config filter will set a environment variable `TEST_SHOWLOCALS`. Then `pytest` will show local variables on failures for better debugging. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131981 Approved by: https://github.com/malfet	2024-07-29 07:40:42 +00:00
Xuehai Pan	14158d892a	[BE][tests] show local variables on failure in tests (#131151 ) ------ As per the title, add argument `--locals` for `unittest` and `--showlocals --tb=long` for `pytest` in CI. Some failures cannot be reproduced on the local machine but exist on cloud CI. This change allows us to investigate the test failure more easily. Example output: https://github.com/pytorch/pytorch/actions/runs/9961546996/job/27523888353?pr=130710#step:20:3361 ```text /opt/conda/envs/py_3.8/lib/python3.8/site-packages/sympy/core/function.py:307: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ cls = FloorDiv, base = -1.00000000000000, divisor = -1.00000000000000 @classmethod def eval(cls, base, divisor): # python test/test_dynamic_shapes.py -k TestDimConstraints.test_dim_constraints_solve_full # Assert triggered by inequality solver # assert base.is_integer, base # assert divisor.is_integer, divisor # We don't provide the same error message as in Python because SymPy # makes it difficult to check the types. if divisor.is_zero: raise ZeroDivisionError("division by zero") if base in (int_oo, -int_oo, sympy.oo, -sympy.oo) and divisor in ( int_oo, -int_oo, sympy.oo, -sympy.oo, ): return sympy.nan if base is sympy.nan or divisor is sympy.nan: return sympy.nan if base.is_zero: return sympy.S.Zero if base.is_integer and divisor == 1: return base if base.is_integer and divisor == -1: return sympy.Mul(base, -1) if ( isinstance(base, sympy.Number) and isinstance(divisor, sympy.Number) and ( base in (int_oo, -int_oo, sympy.oo, -sympy.oo) or divisor in (int_oo, -int_oo, sympy.oo, -sympy.oo) ) ): r = float(base) / float(divisor) if r == math.inf: return int_oo elif r == -math.inf: return -int_oo elif math.isnan(r): return sympy.nan else: return sympy.Integer(math.floor(r)) if isinstance(base, sympy.Integer) and isinstance(divisor, sympy.Integer): return sympy.Integer(int(base) // int(divisor)) if isinstance(base, FloorDiv): return FloorDiv(base.args[0], base.args[1] * divisor) # Expands (x + y) // b into x // b + y // b. # This only works if floor is an identity, i.e. x / b is an integer. for term in sympy.Add.make_args(base): quotient = term / divisor if quotient.is_integer and isinstance(divisor, sympy.Integer): # NB: this is correct even if the divisor is not an integer, but it # creates rational expressions that cause problems with dynamic # shapes. return FloorDiv(base - term, divisor) + quotient try: gcd = sympy.gcd(base, divisor) if gcd != 1: > return FloorDiv( sympy.simplify(base / gcd), sympy.simplify(divisor / gcd) ) base = -1.00000000000000 cls = FloorDiv divisor = -1.00000000000000 gcd = 1.00000000000000 quotient = 1.00000000000000 term = -1.00000000000000 /opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/utils/_sympy/functions.py:159: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ args = (FloorDiv, -1.00000000000000, -1.00000000000000), kwargs = {} @wraps(func) def wrapper(args, kwargs): try: > retval = cfunc(args, **kwargs) E RecursionError: maximum recursion depth exceeded in comparison E E To execute this test, run the following from the base repo dir: E python test/test_sympy_utils.py -k TestValueRanges.test_binary_ref_fn_floordiv_dtype_float E E This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 args = (FloorDiv, -1.00000000000000, -1.00000000000000) cfunc = <functools._lru_cache_wrapper object at 0x7fc5303173a0> func = <function Function.__new__ at 0x7fc530317280> kwargs = {} ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131151 Approved by: https://github.com/ezyang	2024-07-27 19:39:40 +00:00
PyTorch MergeBot	0f9bf208ec	Revert "[BE][tests] show local variables on failure in tests (#131151 )" This reverts commit `054d214c50`. Reverted https://github.com/pytorch/pytorch/pull/131151 on behalf of https://github.com/jbschlosser due to pollutes test failure output for OpInfo tests ([comment](https://github.com/pytorch/pytorch/pull/131151#issuecomment-2253310448))	2024-07-26 19:03:10 +00:00
Xuehai Pan	054d214c50	[BE][tests] show local variables on failure in tests (#131151 ) ------ As per the title, add argument `--locals` for `unittest` and `--showlocals --tb=long` for `pytest` in CI. Some failures cannot be reproduced on the local machine but exist on cloud CI. This change allows us to investigate the test failure more easily. Example output: https://github.com/pytorch/pytorch/actions/runs/9961546996/job/27523888353?pr=130710#step:20:3361 ```text /opt/conda/envs/py_3.8/lib/python3.8/site-packages/sympy/core/function.py:307: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ cls = FloorDiv, base = -1.00000000000000, divisor = -1.00000000000000 @classmethod def eval(cls, base, divisor): # python test/test_dynamic_shapes.py -k TestDimConstraints.test_dim_constraints_solve_full # Assert triggered by inequality solver # assert base.is_integer, base # assert divisor.is_integer, divisor # We don't provide the same error message as in Python because SymPy # makes it difficult to check the types. if divisor.is_zero: raise ZeroDivisionError("division by zero") if base in (int_oo, -int_oo, sympy.oo, -sympy.oo) and divisor in ( int_oo, -int_oo, sympy.oo, -sympy.oo, ): return sympy.nan if base is sympy.nan or divisor is sympy.nan: return sympy.nan if base.is_zero: return sympy.S.Zero if base.is_integer and divisor == 1: return base if base.is_integer and divisor == -1: return sympy.Mul(base, -1) if ( isinstance(base, sympy.Number) and isinstance(divisor, sympy.Number) and ( base in (int_oo, -int_oo, sympy.oo, -sympy.oo) or divisor in (int_oo, -int_oo, sympy.oo, -sympy.oo) ) ): r = float(base) / float(divisor) if r == math.inf: return int_oo elif r == -math.inf: return -int_oo elif math.isnan(r): return sympy.nan else: return sympy.Integer(math.floor(r)) if isinstance(base, sympy.Integer) and isinstance(divisor, sympy.Integer): return sympy.Integer(int(base) // int(divisor)) if isinstance(base, FloorDiv): return FloorDiv(base.args[0], base.args[1] * divisor) # Expands (x + y) // b into x // b + y // b. # This only works if floor is an identity, i.e. x / b is an integer. for term in sympy.Add.make_args(base): quotient = term / divisor if quotient.is_integer and isinstance(divisor, sympy.Integer): # NB: this is correct even if the divisor is not an integer, but it # creates rational expressions that cause problems with dynamic # shapes. return FloorDiv(base - term, divisor) + quotient try: gcd = sympy.gcd(base, divisor) if gcd != 1: > return FloorDiv( sympy.simplify(base / gcd), sympy.simplify(divisor / gcd) ) base = -1.00000000000000 cls = FloorDiv divisor = -1.00000000000000 gcd = 1.00000000000000 quotient = 1.00000000000000 term = -1.00000000000000 /opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/utils/_sympy/functions.py:159: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ args = (FloorDiv, -1.00000000000000, -1.00000000000000), kwargs = {} @wraps(func) def wrapper(args, kwargs): try: > retval = cfunc(args, **kwargs) E RecursionError: maximum recursion depth exceeded in comparison E E To execute this test, run the following from the base repo dir: E python test/test_sympy_utils.py -k TestValueRanges.test_binary_ref_fn_floordiv_dtype_float E E This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 args = (FloorDiv, -1.00000000000000, -1.00000000000000) cfunc = <functools._lru_cache_wrapper object at 0x7fc5303173a0> func = <function Function.__new__ at 0x7fc530317280> kwargs = {} ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131151 Approved by: https://github.com/ezyang	2024-07-25 10:10:58 +00:00
Xuehai Pan	ba48cf6535	[BE][Easy][6/19] enforce style for empty lines in import segments in `test/` (#129757 ) See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter. You can review these PRs via: ```bash git diff --ignore-all-space --ignore-blank-lines HEAD~1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129757 Approved by: https://github.com/ezyang	2024-07-17 06:42:37 +00:00
Xuehai Pan	4d7bf72d93	[BE][Easy] fix ruff rule needless-bool (SIM103) (#130206 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130206 Approved by: https://github.com/malfet	2024-07-14 08:17:52 +00:00
Yuanhao Ji	312652c325	[RFC] Add support for device extension autoloading (#127074 ) Fixes #122468 - Load device extensions at the end of `torch/__init__.py` - Enabled by default, or you can disable it with `TORCH_DEVICE_BACKEND_AUTOLOAD=0` run test: ```python python test/run_test.py -i test_autoload_enable python test/run_test.py -i test_autoload_disable ``` doc: https://docs-preview.pytorch.org/pytorch/pytorch/127074/miscellaneous_environment_variables.html co-author: @jgong5 @bsochack @bkowalskiINTEL @jczaja @FFFrog @hipudding Co-authored-by: albanD <desmaison.alban@gmail.com> Co-authored-by: Jiong Gong <jiong.gong@intel.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/127074 Approved by: https://github.com/albanD, https://github.com/jgong5	2024-07-09 06:14:13 +00:00
Catherine Lee	91a8376d47	run_test: Unset cpp stacktraces after reruns (#129004 ) Rerun the failing test singly with the env var set. If it succeeds, start a new process without the cpp stack traces env var We don't want to waste time generating these if we don't have to They can also show up in assertion errors, which may cause unexpected failures if a test wants to check these Adds new --rs (run single) to be used the same way --scs and --sc are. It will only run the single test in the step current file https://hud.pytorch.org/pytorch/pytorch/pull/129004?sha=2c349d3557d399020bf1f6a8b7045e2e4957ba46 has some examples of logs In the above: * test_checkpoint_valid failed, then passed in another subprocess. The testing continued in a different new subprocess from the test right after it (test_checkpointing_without_reentrant_early_free) * test_format_traceback_short failed consistently, but it continued to run because keep-going was set Pull Request resolved: https://github.com/pytorch/pytorch/pull/129004 Approved by: https://github.com/PaliC	2024-07-03 01:50:15 +00:00
Xuehai Pan	4ee1cb9b95	[BE][Easy] replace `import pathlib` with `from pathlib import Path` (#129426 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129426 Approved by: https://github.com/malfet	2024-06-30 01:36:07 +00:00
PyTorch MergeBot	2effbcfcd8	Revert "[BE][Easy] replace `import pathlib` with `from pathlib import Path` (#129426 )" This reverts commit `6d75604ef1`. Reverted https://github.com/pytorch/pytorch/pull/129426 on behalf of https://github.com/XuehaiPan due to recognize `Path` as new exported API ([comment](https://github.com/pytorch/pytorch/pull/129426#issuecomment-2198371625))	2024-06-29 23:24:06 +00:00
Xuehai Pan	6d75604ef1	[BE][Easy] replace `import pathlib` with `from pathlib import Path` (#129426 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129426 Approved by: https://github.com/malfet	2024-06-29 15:42:09 +00:00
Catherine Lee	8892ddaacc	[TD] Test removal on sm86 (#127131 ) Yolo I'm excited to break CI :') Pull Request resolved: https://github.com/pytorch/pytorch/pull/127131 Approved by: https://github.com/huydhn, https://github.com/ZainRizvi	2024-06-07 20:19:18 +00:00
Howard Huang	baaa914bf7	[small] test clean up (#128079 ) remove unnecessary line: https://github.com/pytorch/pytorch/issues/123733 add main so test can be run `python ...`: https://github.com/pytorch/pytorch/issues/124906 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128079 Approved by: https://github.com/awgu	2024-06-06 21:21:40 +00:00
chuanqiw	627d2cd87d	[CI] disable td for xpu ci test by default (#127611 ) Due to the xpu ci test has been enabled td by default, a lot of test cases (75%) have been skipped in CI tests. It caused some ci failures escaped from the ci tests, for example issue #127539. This PR depends on PR #127595 landed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127611 Approved by: https://github.com/etaf, https://github.com/atalman	2024-06-04 17:15:10 +00:00
Catherine Lee	a31a60d85b	Change run_test.py arg parsing to handle additional args better (#126709 ) Do not inherit parser from common_utils * I don't think we use any variables in run_test that depend on those, and I think all tests except doctests run in a subprocess so they will parse the args in common_utils and set the variables. I don't think doctests wants any of those variables? Parse known args, add the extra args as extra, pass the extra ones along to the subprocess Removes the first instance of `--` I think I will miss run_test telling me if an arg is valid or not Pull Request resolved: https://github.com/pytorch/pytorch/pull/126709 Approved by: https://github.com/ZainRizvi, https://github.com/huydhn, https://github.com/Flamefire	2024-05-23 21:08:12 +00:00
Catherine Lee	ac2c547838	[TD] Upload names of failures to s3 for pytest cache (#126315 ) Some tests don't get run through pytest and pytest crashes when a test segfaults, so in both caess, the pytest cache won't have an entry (similar to https://github.com/pytorch/test-infra/pull/5205). Instead, manually upload/download an extra file that lists the failing test files Technically this would be more general than the pytest cache Pull Request resolved: https://github.com/pytorch/pytorch/pull/126315 Approved by: https://github.com/ZainRizvi	2024-05-21 16:29:31 +00:00
PyTorch MergeBot	8bca0847c2	Revert "[TD] Upload names of failures to s3 for pytest cache (#126315 )" This reverts commit `655038687a`. Reverted https://github.com/pytorch/pytorch/pull/126315 on behalf of https://github.com/clee2000 due to broke inductor ([comment](https://github.com/pytorch/pytorch/pull/126315#issuecomment-2121133045))	2024-05-20 20:15:08 +00:00
Catherine Lee	655038687a	[TD] Upload names of failures to s3 for pytest cache (#126315 ) Some tests don't get run through pytest and pytest crashes when a test segfaults, so in both caess, the pytest cache won't have an entry (similar to https://github.com/pytorch/test-infra/pull/5205). Instead, manually upload/download an extra file that lists the failing test files Technically this would be more general than the pytest cache Pull Request resolved: https://github.com/pytorch/pytorch/pull/126315 Approved by: https://github.com/ZainRizvi	2024-05-20 17:36:30 +00:00
drisspg	762ce6f062	Add Lowering for FlexAttention Backwards (#125515 ) # Summary #### What does this PR do? It enables Inductor to actually generate the fused flex attention kernel for the backwards I did some other things along the way: - Abstract out the 'build_subgraph_buffer' subroutine and make it reusable between flex attention and flex_attention backwards. In total we need too build 3 subgraphs for fwd + bwd. 1 for the fwd graph and then 2 in the bwd. The FAv2 algorithm recomputes the parts of the forward (more efficiently since we already have the row_max via logsumexp), therefore we need to inline both the fwd graph and the joint graph in the bwds kernel. - The version of the backwards kernel is from a somewhat older version of the triton tutorial implementation. I think that we should update in a follow up to a newer version. Notably the blocks need to be square for this to work as currently implemented. I am sure there are many opportunities for optimization. - I didnt correctly register the decomp table + IndexMode when I landed: https://github.com/pytorch/pytorch/pull/123902, this remedies that. - The rel_bias helper func was reversed in terms of causality. I updated and then add a test specific for "future causal" attention. - This PRs but the main point that I think still needs to be worked out is the store_output call. I have it hacked up to be 'fake' but I dont think we want to land that and likely want to just have a mutated 'dq' and a stored_output 'dk' - I also needed to update the `TritonTemplateKernel` to actually accept multiple subgraphs (modifications) - I updated the benchmark to also profile bwds performance ### Benchmark Numbers: _The current implementation is not parallelizing over ctx length in the bwd_ FWD Speedups \| Type \| Speedup \| shape \| score_mod \| dtype \| \|---------\|-----------\|--------------------\|-------------\|----------------\| \| Average \| 0.991 \| \| \| \| \| Max \| 1.182 \| (16, 16, 4096, 64) \| noop \| torch.bfloat16 \| \| Min \| 0.796 \| (2, 16, 512, 256) \| head_bias \| torch.bfloat16 \| BWD Speedups \| Type \| Speedup \| shape \| score_mod \| dtype \| \|---------\|-----------\|--------------------\|-------------\|----------------\| \| Average \| 0.291 \| \| \| \| \| Max \| 0.652 \| (8, 16, 512, 64) \| head_bias \| torch.bfloat16 \| \| Min \| 0.073 \| (2, 16, 4096, 128) \| head_bias \| torch.bfloat16 \| <details> <summary>Full Data</summary> \| shape \| score_mod \| dtype \| fwd_eager_time \| fwd_compiled_time \| bwd_eager_time \| bwd_compiled_time \| fwd_speedup \| bwd_speedup \| \|---------------------\|---------------\|----------------\|------------------\|---------------------\|------------------\|---------------------\|---------------\|---------------\| \| (2, 16, 512, 64) \| noop \| torch.bfloat16 \| 19.936 \| 19.092 \| 57.851 \| 193.564 \| 1.044 \| 0.299 \| \| (2, 16, 512, 64) \| causal_mask \| torch.bfloat16 \| 19.955 \| 19.497 \| 57.662 \| 206.278 \| 1.024 \| 0.280 \| \| (2, 16, 512, 64) \| relative_bias \| torch.bfloat16 \| 19.455 \| 21.297 \| 57.674 \| 195.219 \| 0.913 \| 0.295 \| \| (2, 16, 512, 64) \| head_bias \| torch.bfloat16 \| 19.958 \| 21.289 \| 57.674 \| 193.859 \| 0.938 \| 0.298 \| \| (2, 16, 512, 128) \| noop \| torch.bfloat16 \| 28.157 \| 28.615 \| 82.831 \| 454.211 \| 0.984 \| 0.182 \| \| (2, 16, 512, 128) \| causal_mask \| torch.bfloat16 \| 28.154 \| 28.444 \| 83.091 \| 432.083 \| 0.990 \| 0.192 \| \| (2, 16, 512, 128) \| relative_bias \| torch.bfloat16 \| 28.722 \| 27.897 \| 83.175 \| 446.789 \| 1.030 \| 0.186 \| \| (2, 16, 512, 128) \| head_bias \| torch.bfloat16 \| 28.299 \| 27.673 \| 83.052 \| 459.179 \| 1.023 \| 0.181 \| \| (2, 16, 512, 256) \| noop \| torch.bfloat16 \| 41.167 \| 50.504 \| 175.019 \| 1083.545 \| 0.815 \| 0.162 \| \| (2, 16, 512, 256) \| causal_mask \| torch.bfloat16 \| 41.656 \| 51.933 \| 175.078 \| 1171.176 \| 0.802 \| 0.149 \| \| (2, 16, 512, 256) \| relative_bias \| torch.bfloat16 \| 41.697 \| 50.722 \| 175.159 \| 1097.312 \| 0.822 \| 0.160 \| \| (2, 16, 512, 256) \| head_bias \| torch.bfloat16 \| 41.690 \| 52.387 \| 175.184 \| 1097.336 \| 0.796 \| 0.160 \| \| (2, 16, 1024, 64) \| noop \| torch.bfloat16 \| 39.232 \| 37.454 \| 127.847 \| 612.430 \| 1.047 \| 0.209 \| \| (2, 16, 1024, 64) \| causal_mask \| torch.bfloat16 \| 39.930 \| 39.599 \| 127.755 \| 665.359 \| 1.008 \| 0.192 \| \| (2, 16, 1024, 64) \| relative_bias \| torch.bfloat16 \| 39.417 \| 41.304 \| 127.902 \| 614.990 \| 0.954 \| 0.208 \| \| (2, 16, 1024, 64) \| head_bias \| torch.bfloat16 \| 39.965 \| 42.034 \| 127.953 \| 613.273 \| 0.951 \| 0.209 \| \| (2, 16, 1024, 128) \| noop \| torch.bfloat16 \| 63.964 \| 71.024 \| 226.510 \| 1637.669 \| 0.901 \| 0.138 \| \| (2, 16, 1024, 128) \| causal_mask \| torch.bfloat16 \| 63.843 \| 72.451 \| 226.750 \| 1558.949 \| 0.881 \| 0.145 \| \| (2, 16, 1024, 128) \| relative_bias \| torch.bfloat16 \| 64.301 \| 70.487 \| 226.651 \| 1610.063 \| 0.912 \| 0.141 \| \| (2, 16, 1024, 128) \| head_bias \| torch.bfloat16 \| 64.033 \| 71.394 \| 226.676 \| 1668.511 \| 0.897 \| 0.136 \| \| (2, 16, 1024, 256) \| noop \| torch.bfloat16 \| 129.348 \| 141.390 \| 507.337 \| 4405.175 \| 0.915 \| 0.115 \| \| (2, 16, 1024, 256) \| causal_mask \| torch.bfloat16 \| 129.538 \| 145.680 \| 507.178 \| 4768.874 \| 0.889 \| 0.106 \| \| (2, 16, 1024, 256) \| relative_bias \| torch.bfloat16 \| 129.438 \| 142.782 \| 507.004 \| 4401.002 \| 0.907 \| 0.115 \| \| (2, 16, 1024, 256) \| head_bias \| torch.bfloat16 \| 129.058 \| 146.242 \| 507.547 \| 4434.251 \| 0.883 \| 0.114 \| \| (2, 16, 4096, 64) \| noop \| torch.bfloat16 \| 481.606 \| 409.120 \| 1440.890 \| 14147.269 \| 1.177 \| 0.102 \| \| (2, 16, 4096, 64) \| causal_mask \| torch.bfloat16 \| 480.227 \| 438.847 \| 1434.419 \| 14973.386 \| 1.094 \| 0.096 \| \| (2, 16, 4096, 64) \| relative_bias \| torch.bfloat16 \| 480.831 \| 458.104 \| 1432.935 \| 14193.253 \| 1.050 \| 0.101 \| \| (2, 16, 4096, 64) \| head_bias \| torch.bfloat16 \| 480.749 \| 452.497 \| 1437.040 \| 14084.869 \| 1.062 \| 0.102 \| \| (2, 16, 4096, 128) \| noop \| torch.bfloat16 \| 872.534 \| 848.275 \| 2600.895 \| 35156.849 \| 1.029 \| 0.074 \| \| (2, 16, 4096, 128) \| causal_mask \| torch.bfloat16 \| 872.647 \| 868.279 \| 2587.581 \| 31919.531 \| 1.005 \| 0.081 \| \| (2, 16, 4096, 128) \| relative_bias \| torch.bfloat16 \| 871.484 \| 827.644 \| 2593.989 \| 34805.634 \| 1.053 \| 0.075 \| \| (2, 16, 4096, 128) \| head_bias \| torch.bfloat16 \| 871.422 \| 856.437 \| 2602.482 \| 35708.591 \| 1.017 \| 0.073 \| \| (2, 16, 4096, 256) \| noop \| torch.bfloat16 \| 1904.497 \| 1758.183 \| 6122.416 \| 66754.593 \| 1.083 \| 0.092 \| \| (2, 16, 4096, 256) \| causal_mask \| torch.bfloat16 \| 1911.174 \| 1762.821 \| 6113.207 \| 72759.392 \| 1.084 \| 0.084 \| \| (2, 16, 4096, 256) \| relative_bias \| torch.bfloat16 \| 1911.254 \| 1727.108 \| 6123.530 \| 66577.988 \| 1.107 \| 0.092 \| \| (2, 16, 4096, 256) \| head_bias \| torch.bfloat16 \| 1916.977 \| 1801.804 \| 6118.158 \| 67359.680 \| 1.064 \| 0.091 \| \| (8, 16, 512, 64) \| noop \| torch.bfloat16 \| 44.984 \| 43.974 \| 170.276 \| 262.259 \| 1.023 \| 0.649 \| \| (8, 16, 512, 64) \| causal_mask \| torch.bfloat16 \| 45.001 \| 46.265 \| 170.509 \| 274.893 \| 0.973 \| 0.620 \| \| (8, 16, 512, 64) \| relative_bias \| torch.bfloat16 \| 45.466 \| 48.211 \| 170.606 \| 262.759 \| 0.943 \| 0.649 \| \| (8, 16, 512, 64) \| head_bias \| torch.bfloat16 \| 45.481 \| 48.435 \| 170.267 \| 261.265 \| 0.939 \| 0.652 \| \| (8, 16, 512, 128) \| noop \| torch.bfloat16 \| 72.565 \| 74.736 \| 313.220 \| 773.126 \| 0.971 \| 0.405 \| \| (8, 16, 512, 128) \| causal_mask \| torch.bfloat16 \| 72.015 \| 75.755 \| 313.311 \| 775.513 \| 0.951 \| 0.404 \| \| (8, 16, 512, 128) \| relative_bias \| torch.bfloat16 \| 72.105 \| 74.189 \| 313.806 \| 769.238 \| 0.972 \| 0.408 \| \| (8, 16, 512, 128) \| head_bias \| torch.bfloat16 \| 72.005 \| 74.364 \| 313.509 \| 775.237 \| 0.968 \| 0.404 \| \| (8, 16, 512, 256) \| noop \| torch.bfloat16 \| 138.656 \| 165.453 \| 663.707 \| 2672.067 \| 0.838 \| 0.248 \| \| (8, 16, 512, 256) \| causal_mask \| torch.bfloat16 \| 139.096 \| 172.613 \| 663.593 \| 2926.538 \| 0.806 \| 0.227 \| \| (8, 16, 512, 256) \| relative_bias \| torch.bfloat16 \| 139.500 \| 168.417 \| 663.938 \| 2658.629 \| 0.828 \| 0.250 \| \| (8, 16, 512, 256) \| head_bias \| torch.bfloat16 \| 139.776 \| 173.549 \| 662.920 \| 2667.266 \| 0.805 \| 0.249 \| \| (8, 16, 1024, 64) \| noop \| torch.bfloat16 \| 134.883 \| 125.004 \| 484.706 \| 1195.254 \| 1.079 \| 0.406 \| \| (8, 16, 1024, 64) \| causal_mask \| torch.bfloat16 \| 134.297 \| 132.875 \| 485.420 \| 1234.953 \| 1.011 \| 0.393 \| \| (8, 16, 1024, 64) \| relative_bias \| torch.bfloat16 \| 134.839 \| 139.231 \| 485.470 \| 1198.556 \| 0.968 \| 0.405 \| \| (8, 16, 1024, 64) \| head_bias \| torch.bfloat16 \| 133.822 \| 136.449 \| 485.608 \| 1189.198 \| 0.981 \| 0.408 \| \| (8, 16, 1024, 128) \| noop \| torch.bfloat16 \| 235.470 \| 234.765 \| 886.094 \| 2662.944 \| 1.003 \| 0.333 \| \| (8, 16, 1024, 128) \| causal_mask \| torch.bfloat16 \| 236.305 \| 241.382 \| 886.293 \| 2646.984 \| 0.979 \| 0.335 \| \| (8, 16, 1024, 128) \| relative_bias \| torch.bfloat16 \| 236.414 \| 233.980 \| 885.250 \| 2642.178 \| 1.010 \| 0.335 \| \| (8, 16, 1024, 128) \| head_bias \| torch.bfloat16 \| 237.176 \| 239.040 \| 885.754 \| 2665.242 \| 0.992 \| 0.332 \| \| (8, 16, 1024, 256) \| noop \| torch.bfloat16 \| 504.445 \| 517.855 \| 1978.956 \| 9592.906 \| 0.974 \| 0.206 \| \| (8, 16, 1024, 256) \| causal_mask \| torch.bfloat16 \| 502.428 \| 536.002 \| 1978.611 \| 10607.342 \| 0.937 \| 0.187 \| \| (8, 16, 1024, 256) \| relative_bias \| torch.bfloat16 \| 503.396 \| 523.960 \| 1977.993 \| 9539.284 \| 0.961 \| 0.207 \| \| (8, 16, 1024, 256) \| head_bias \| torch.bfloat16 \| 503.818 \| 536.014 \| 1980.131 \| 9576.262 \| 0.940 \| 0.207 \| \| (8, 16, 4096, 64) \| noop \| torch.bfloat16 \| 1970.139 \| 1674.930 \| 5750.940 \| 16724.134 \| 1.176 \| 0.344 \| \| (8, 16, 4096, 64) \| causal_mask \| torch.bfloat16 \| 1959.036 \| 1775.056 \| 5780.512 \| 17390.350 \| 1.104 \| 0.332 \| \| (8, 16, 4096, 64) \| relative_bias \| torch.bfloat16 \| 1947.198 \| 1773.869 \| 5780.643 \| 16779.699 \| 1.098 \| 0.345 \| \| (8, 16, 4096, 64) \| head_bias \| torch.bfloat16 \| 1963.935 \| 1829.502 \| 5780.018 \| 16703.259 \| 1.073 \| 0.346 \| \| (8, 16, 4096, 128) \| noop \| torch.bfloat16 \| 3582.711 \| 3362.623 \| 10436.069 \| 36415.565 \| 1.065 \| 0.287 \| \| (8, 16, 4096, 128) \| causal_mask \| torch.bfloat16 \| 3581.504 \| 3499.472 \| 10346.869 \| 36164.959 \| 1.023 \| 0.286 \| \| (8, 16, 4096, 128) \| relative_bias \| torch.bfloat16 \| 3589.779 \| 3337.849 \| 10529.621 \| 36261.696 \| 1.075 \| 0.290 \| \| (8, 16, 4096, 128) \| head_bias \| torch.bfloat16 \| 3602.265 \| 3436.444 \| 10458.660 \| 36507.790 \| 1.048 \| 0.286 \| \| (8, 16, 4096, 256) \| noop \| torch.bfloat16 \| 7695.923 \| 7126.275 \| 24643.009 \| 140949.081 \| 1.080 \| 0.175 \| \| (8, 16, 4096, 256) \| causal_mask \| torch.bfloat16 \| 7679.939 \| 7186.252 \| 24538.105 \| 157156.067 \| 1.069 \| 0.156 \| \| (8, 16, 4096, 256) \| relative_bias \| torch.bfloat16 \| 7681.374 \| 6994.832 \| 24549.713 \| 140077.179 \| 1.098 \| 0.175 \| \| (8, 16, 4096, 256) \| head_bias \| torch.bfloat16 \| 7679.822 \| 7212.278 \| 24627.823 \| 140675.003 \| 1.065 \| 0.175 \| \| (16, 16, 512, 64) \| noop \| torch.bfloat16 \| 80.126 \| 78.291 \| 333.719 \| 541.165 \| 1.023 \| 0.617 \| \| (16, 16, 512, 64) \| causal_mask \| torch.bfloat16 \| 80.065 \| 81.696 \| 333.779 \| 551.113 \| 0.980 \| 0.606 \| \| (16, 16, 512, 64) \| relative_bias \| torch.bfloat16 \| 80.138 \| 86.715 \| 333.364 \| 542.118 \| 0.924 \| 0.615 \| \| (16, 16, 512, 64) \| head_bias \| torch.bfloat16 \| 80.415 \| 85.204 \| 333.294 \| 536.840 \| 0.944 \| 0.621 \| \| (16, 16, 512, 128) \| noop \| torch.bfloat16 \| 134.964 \| 138.025 \| 607.093 \| 1333.102 \| 0.978 \| 0.455 \| \| (16, 16, 512, 128) \| causal_mask \| torch.bfloat16 \| 134.192 \| 141.523 \| 606.269 \| 1424.318 \| 0.948 \| 0.426 \| \| (16, 16, 512, 128) \| relative_bias \| torch.bfloat16 \| 135.711 \| 138.639 \| 606.283 \| 1327.974 \| 0.979 \| 0.457 \| \| (16, 16, 512, 128) \| head_bias \| torch.bfloat16 \| 135.552 \| 140.555 \| 607.107 \| 1347.370 \| 0.964 \| 0.451 \| \| (16, 16, 512, 256) \| noop \| torch.bfloat16 \| 275.113 \| 315.144 \| 1301.583 \| 5268.153 \| 0.873 \| 0.247 \| \| (16, 16, 512, 256) \| causal_mask \| torch.bfloat16 \| 274.867 \| 328.106 \| 1302.513 \| 5770.594 \| 0.838 \| 0.226 \| \| (16, 16, 512, 256) \| relative_bias \| torch.bfloat16 \| 276.052 \| 321.770 \| 1302.904 \| 5241.920 \| 0.858 \| 0.249 \| \| (16, 16, 512, 256) \| head_bias \| torch.bfloat16 \| 271.409 \| 328.839 \| 1302.142 \| 5266.037 \| 0.825 \| 0.247 \| \| (16, 16, 1024, 64) \| noop \| torch.bfloat16 \| 260.489 \| 237.463 \| 955.884 \| 1817.558 \| 1.097 \| 0.526 \| \| (16, 16, 1024, 64) \| causal_mask \| torch.bfloat16 \| 262.378 \| 254.350 \| 955.280 \| 1843.807 \| 1.032 \| 0.518 \| \| (16, 16, 1024, 64) \| relative_bias \| torch.bfloat16 \| 261.338 \| 268.253 \| 956.038 \| 1820.036 \| 0.974 \| 0.525 \| \| (16, 16, 1024, 64) \| head_bias \| torch.bfloat16 \| 262.153 \| 264.156 \| 956.023 \| 1810.076 \| 0.992 \| 0.528 \| \| (16, 16, 1024, 128) \| noop \| torch.bfloat16 \| 476.475 \| 461.413 \| 1760.578 \| 4306.521 \| 1.033 \| 0.409 \| \| (16, 16, 1024, 128) \| causal_mask \| torch.bfloat16 \| 473.794 \| 479.178 \| 1761.277 \| 4619.439 \| 0.989 \| 0.381 \| \| (16, 16, 1024, 128) \| relative_bias \| torch.bfloat16 \| 473.839 \| 463.282 \| 1758.692 \| 4290.562 \| 1.023 \| 0.410 \| \| (16, 16, 1024, 128) \| head_bias \| torch.bfloat16 \| 472.979 \| 472.896 \| 1763.086 \| 4367.931 \| 1.000 \| 0.404 \| \| (16, 16, 1024, 256) \| noop \| torch.bfloat16 \| 1014.184 \| 1026.764 \| 3922.997 \| 19104.147 \| 0.988 \| 0.205 \| \| (16, 16, 1024, 256) \| causal_mask \| torch.bfloat16 \| 1013.217 \| 1039.046 \| 3928.382 \| 21086.281 \| 0.975 \| 0.186 \| \| (16, 16, 1024, 256) \| relative_bias \| torch.bfloat16 \| 1008.519 \| 1015.278 \| 3922.133 \| 18980.652 \| 0.993 \| 0.207 \| \| (16, 16, 1024, 256) \| head_bias \| torch.bfloat16 \| 1011.360 \| 1047.542 \| 3931.245 \| 19069.172 \| 0.965 \| 0.206 \| \| (16, 16, 4096, 64) \| noop \| torch.bfloat16 \| 3929.850 \| 3325.667 \| 11411.704 \| 23344.280 \| 1.182 \| 0.489 \| \| (16, 16, 4096, 64) \| causal_mask \| torch.bfloat16 \| 3885.262 \| 3581.544 \| 11390.515 \| 23725.639 \| 1.085 \| 0.480 \| \| (16, 16, 4096, 64) \| relative_bias \| torch.bfloat16 \| 3865.737 \| 3537.308 \| 11489.901 \| 23406.330 \| 1.093 \| 0.491 \| \| (16, 16, 4096, 64) \| head_bias \| torch.bfloat16 \| 3880.530 \| 3665.249 \| 11484.411 \| 23299.496 \| 1.059 \| 0.493 \| \| (16, 16, 4096, 128) \| noop \| torch.bfloat16 \| 7030.306 \| 6745.715 \| 20621.264 \| 57464.096 \| 1.042 \| 0.359 \| \| (16, 16, 4096, 128) \| causal_mask \| torch.bfloat16 \| 7095.414 \| 7034.385 \| 20410.656 \| 61660.511 \| 1.009 \| 0.331 \| \| (16, 16, 4096, 128) \| relative_bias \| torch.bfloat16 \| 7084.779 \| 6686.497 \| 20315.161 \| 57243.969 \| 1.060 \| 0.355 \| \| (16, 16, 4096, 128) \| head_bias \| torch.bfloat16 \| 7075.367 \| 6863.305 \| 20494.385 \| 58481.953 \| 1.031 \| 0.350 \| \| (16, 16, 4096, 256) \| noop \| torch.bfloat16 \| 15612.741 \| 14297.482 \| 55306.847 \| 281161.865 \| 1.092 \| 0.197 \| \| (16, 16, 4096, 256) \| causal_mask \| torch.bfloat16 \| 15326.592 \| 14263.878 \| 55227.806 \| 313063.232 \| 1.075 \| 0.176 \| \| (16, 16, 4096, 256) \| relative_bias \| torch.bfloat16 \| 15297.963 \| 14007.379 \| 54558.029 \| 279529.175 \| 1.092 \| 0.195 \| \| (16, 16, 4096, 256) \| head_bias \| torch.bfloat16 \| 15216.160 \| 14276.027 \| 55081.581 \| 280996.826 \| 1.066 \| 0.196 \| </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/125515 Approved by: https://github.com/Chillee	2024-05-17 00:41:55 +00:00
Jithun Nair	14d8e3aec0	Add distributed/_tensor/test_attention to ROCM_BLOCKLIST (#126336 ) Fixes #125504 Fixes #126252 Fixes #126296 Fixes #126330 This PR doesn't really fix the RingAttentionTest tests for ROCm, but explicitly adds the whole test file to ROCM_BLOCKLIST to get a clean signal on ROCm distributed CI. We will enable these tests in a follow-up PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126336 Approved by: https://github.com/huydhn, https://github.com/pruthvistony	2024-05-16 16:38:09 +00:00
Catherine Lee	48f98bcdfc	[TD] Enable test removal on most default configs + distributed CUDA for everyone (#125931 ) yolo Add the longest jobs in pull: * default cpu configs * non sm86 cuda * distributed cuda for everyone Still excluding * slow, inductor, rocm, onnx, mac, dynamo * distributed cpu * windows cuda Pull Request resolved: https://github.com/pytorch/pytorch/pull/125931 Approved by: https://github.com/huydhn, https://github.com/ZainRizvi	2024-05-14 17:35:12 +00:00
Catherine Lee	6f619cc727	[ez] functorch/test_vmap and test_dataloader to run in parallel (#125597 ) Also mark test_svd serial in linalg to see if it helps with the flakiness Pull Request resolved: https://github.com/pytorch/pytorch/pull/125597 Approved by: https://github.com/huydhn, https://github.com/ZainRizvi	2024-05-08 15:37:29 +00:00
Huy Do	0e57bbb6d7	Set timeout for C++ tests (#125517 ) Looking at the unrelated Windows timeout failure on https://github.com/pytorch/pytorch/pull/125199, it looks like we don't have a timeout value set for C++ tests atm. In this case, a C++ test on Windows timed out after 2+ hours. ``` 2024-05-02T23:35:34.0639067Z Running cpp/c10_TypeList_test 1/1 ... [2024-05-02 23:35:34.059021] 2024-05-02T23:35:34.0641108Z Executing ['pytest', 'C:\\actions-runner\\_work\\pytorch\\pytorch\\build\\win_tmp\\build\\torch\\test\\c10_TypeList_test.exe', '-m', 'not serial', '-v', '-vv', '-rfEX', '-n', '2', '--junit-xml-reruns', 'test-reports\\python-pytest\\test\\run_test\\test\\run_test-c898ddeff8f33cbf.xml', '-x', '--reruns=2'] ... [2024-05-02 23:35:34.062137] 2024-05-03T02:45:33.7862004Z Process SpawnPoolWorker-2: 2024-05-03T02:45:33.7927201Z Traceback (most recent call last): 2024-05-03T02:45:33.7928032Z File "C:\Jenkins\Miniconda3\lib\multiprocessing\process.py", line 315, in _bootstrap 2024-05-03T02:45:33.7928722Z self.run() 2024-05-03T02:45:33.7929722Z File "C:\Jenkins\Miniconda3\lib\multiprocessing\process.py", line 108, in run 2024-05-03T02:45:33.7931639Z self._target(self._args, self._kwargs) 2024-05-03T02:45:33.7932435Z File "C:\Jenkins\Miniconda3\lib\multiprocessing\pool.py", line 114, in worker 2024-05-03T02:45:33.7933338Z task = get() 2024-05-03T02:45:33.7933946Z File "C:\Jenkins\Miniconda3\lib\multiprocessing\queues.py", line 365, in get 2024-05-03T02:45:33.7935219Z res = self._reader.recv_bytes() 2024-05-03T02:45:33.7935897Z File "C:\Jenkins\Miniconda3\lib\multiprocessing\connection.py", line 221, in recv_bytes 2024-05-03T02:45:33.7936609Z buf = self._recv_bytes(maxlength) 2024-05-03T02:45:33.7937302Z File "C:\Jenkins\Miniconda3\lib\multiprocessing\connection.py", line 310, in _recv_bytes 2024-05-03T02:45:33.7938316Z waitres = _winapi.WaitForMultipleObjects( 2024-05-03T02:45:33.7938766Z KeyboardInterrupt ``` Retrying was working, but it was already too late to finish the job. I'm setting the same default `THRESHOLD 3` timeout value here for C++ tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125517 Approved by: https://github.com/clee2000	2024-05-07 16:41:38 +00:00
Catherine Lee	848fce35b5	[CI][ez] Don't retry when it says don't retry (#125643 ) default arg for retry_shell is retries=1 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125643 Approved by: https://github.com/huydhn	2024-05-07 16:20:00 +00:00
Catherine Lee	1b3fd83ab2	[TD] Enable TD on AVX related configs (#125482 ) On test configs `nogpu_AVX512` and `nogpu_NO_AVX2`, which are the next longest jobs on trunk after windows Pull Request resolved: https://github.com/pytorch/pytorch/pull/125482 Approved by: https://github.com/huydhn	2024-05-06 22:02:16 +00:00
Catherine Lee	d4727fd4eb	[TD][ez] Better check for is pr or not (#125485 ) You can trigger ciflow tags on main branch commits, so we should be more conservative when checking to see if a workflow is a PR/on the main branch. get_pr_number checks for the pr number based on the PR_NUMBER env var or a tag of the for `ciflow/workflow/pr number` If we fail to find something like this, then assume it is on the main branch Pull Request resolved: https://github.com/pytorch/pytorch/pull/125485 Approved by: https://github.com/huydhn	2024-05-04 03:08:44 +00:00
Catherine Lee	e16f1ee4cc	[ez][CI] Move test_modules and test_schema_check off CI_SERIAL_LIST (#125193 ) * Related https://github.com/pytorch/pytorch/pull/124085 As in title, move test_modules and test_schema_check off CI_SERIAL_LIST If things fail, they can get the serialTest decorator instead Pull Request resolved: https://github.com/pytorch/pytorch/pull/125193 Approved by: https://github.com/huydhn	2024-05-01 15:48:48 +00:00
PyTorch MergeBot	e7631d6eae	Revert "CI: add aarch64 linux workflow (#121284 )" This reverts commit `32cf04cb7f`. Reverted https://github.com/pytorch/pytorch/pull/121284 on behalf of https://github.com/malfet due to Test only changes has not been reverted ([comment](https://github.com/pytorch/pytorch/pull/121284#issuecomment-2083925890))	2024-04-30 00:24:11 +00:00
Catherine Lee	4d717cd7c3	[TD] Enable td on cpu windows (#125049 ) yolo Also * Ensure that at least 1 test always gets run (`//` does truncation which results in 0 if you have too few tests discovered) * Don't run test removal on slow tests - I'm not touching that yet I am avoid everything other than pull + trunk workflows, so not doing this on windows CUDA, which runs on periodic Pull Request resolved: https://github.com/pytorch/pytorch/pull/125049 Approved by: https://github.com/huydhn, https://github.com/ZainRizvi	2024-04-29 23:39:54 +00:00
Catherine Lee	faee0e5ee8	[ez][CI] Move test_linalg and test_sparse_csr off CI_SERIAL_LIST (#125068 ) * https://github.com/pytorch/pytorch/pull/124649 for context Pull Request resolved: https://github.com/pytorch/pytorch/pull/125068 Approved by: https://github.com/huydhn, https://github.com/ZainRizvi	2024-04-29 21:22:35 +00:00
Sunita Nadampalli	32cf04cb7f	CI: add aarch64 linux workflow (#121284 ) aarch64 linux workflow is triggered for ciflow/aarch64 tags. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121284 Approved by: https://github.com/atalman, https://github.com/malfet	2024-04-29 18:25:40 +00:00
egienvalue	8461e7ed9e	Add test_cpp_extensions tests for stream_and_event and mita_backend (#123614 ) Test the generic torch.Stream/Event with fake device gurad and hooks. Since we added a fake device backend, it is mutual exclusive to other backends. Tests will be skipped if TEST_CUDA or TEST_ROCM is true. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123614 Approved by: https://github.com/albanD ghstack dependencies: #123611, #123612	2024-04-26 16:17:54 +00:00
PyTorch MergeBot	4a1299cc0e	Revert "Add test_cpp_extensions tests for stream_and_event and mita_backend (#123614 )" This reverts commit `355dc34f86`. Reverted https://github.com/pytorch/pytorch/pull/123614 on behalf of https://github.com/jeffdaily due to this PR broke ROCm with message RuntimeError: Cannot have MTIA with other devices ([comment](https://github.com/pytorch/pytorch/pull/123612#issuecomment-2077649762))	2024-04-25 16:06:46 +00:00
Catherine Lee	4f29103749	[ez][CI] Move test_cuda off CI_SERIAL_LIST (#124649 ) Tag test cases with large tensor with serial, also tag a few more that failed on a previous iteration of this PR Move test_cuda and test_cuda_expandable_segments off the serial list Pull Request resolved: https://github.com/pytorch/pytorch/pull/124649 Approved by: https://github.com/ZainRizvi	2024-04-24 22:04:23 +00:00
egienvalue	355dc34f86	Add test_cpp_extensions tests for stream_and_event and mita_backend (#123614 ) Test the generic torch.Stream/Event with fake device gurad and hooks. Differential Revision: [D56443358](https://our.internmc.facebook.com/intern/diff/D56443358) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123614 Approved by: https://github.com/albanD ghstack dependencies: #123611, #123612	2024-04-24 20:51:20 +00:00
Catherine Lee	8fe0b8b6a8	No CPP or xdist process level reruns (#124798 ) xdist doesn't play well with current process level rerun scheme Pull Request resolved: https://github.com/pytorch/pytorch/pull/124798 Approved by: https://github.com/huydhn	2024-04-24 19:44:51 +00:00
PyTorch MergeBot	52da03edeb	Revert "Add test_cpp_extensions tests for stream_and_event and mita_backend (#123614 )" This reverts commit `b6f0159db0`. Reverted https://github.com/pytorch/pytorch/pull/123614 on behalf of https://github.com/jeffdaily due to This broke ROCm. see test_overrides.py ([comment](https://github.com/pytorch/pytorch/pull/123611#issuecomment-2067363780))	2024-04-19 22:44:26 +00:00
egienvalue	b6f0159db0	Add test_cpp_extensions tests for stream_and_event and mita_backend (#123614 ) Test the generic torch.Stream/Event with fake device gurad and hooks. @exported-using-ghexport Differential Revision: [D55902506](https://our.internmc.facebook.com/intern/diff/D55902506/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123614 Approved by: https://github.com/albanD ghstack dependencies: #123611, #123612	2024-04-18 17:40:13 +00:00
Catherine Lee	025387f4dd	[ez][CI] Reduce CI_SERIAL_LIST pt2 (#124298 ) #124085 Add @serialTest() to some tests slow gradcheck already runs serially Doing this slowly so its easier to check flaky issues that might get made Pull Request resolved: https://github.com/pytorch/pytorch/pull/124298 Approved by: https://github.com/kit1980	2024-04-18 00:13:36 +00:00
Catherine Lee	0abd3f60fd	[CI] Reduce CI_SERIAL_LIST list (#124085 ) Add serial marker for individual tests so the test file can be removed from the ci serial list Run serial marked tests first in serial Run all other tests afterwards in parallel Slowly reduce list and mark individual tests as serial instead Hope # of serial tests is small so sharding evenness doesn't get too messed up Hopefully can do 3 procs for sm86 and cpu? serial no longer looks like a real word to me Pull Request resolved: https://github.com/pytorch/pytorch/pull/124085 Approved by: https://github.com/seemethere, https://github.com/malfet	2024-04-17 00:23:47 +00:00
Catherine Lee	946b50c788	[ez][TD] Increase logging (#124082 ) increase logging during td generate an artifact that says which tests got excluded fix minor bug where filter test configs couldnt get commit messages Pull Request resolved: https://github.com/pytorch/pytorch/pull/124082 Approved by: https://github.com/seemethere, https://github.com/malfet	2024-04-17 00:18:28 +00:00
Catherine Lee	3cd06f56b1	[ez] test_profiler in serial (#123665 ) Add test_profiler to the serial list since we keep needing to reopen disable issues and I think its due to being incompatible with parallelism Pull Request resolved: https://github.com/pytorch/pytorch/pull/123665 Approved by: https://github.com/ZainRizvi, https://github.com/huydhn	2024-04-11 20:24:47 +00:00
William Wen	4bee4c7c25	[3.12] enable inductor unittests (#123654 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123654 Approved by: https://github.com/jansel	2024-04-10 20:51:43 +00:00
Catherine Lee	61be8843c9	[TD] Use label to configure td on distributed for rollout (#122976 ) Gate TD on distributed behind label TODO: auto add label to certain people's prs Pull Request resolved: https://github.com/pytorch/pytorch/pull/122976 Approved by: https://github.com/huydhn, https://github.com/ZainRizvi	2024-04-08 15:53:55 +00:00
William Wen	d59c5d7353	[dynamo, 3.12] enable dynamo on 3.12, enable most dynamo unittests on 3.12 (#123216 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123216 Approved by: https://github.com/jansel, https://github.com/malfet	2024-04-04 20:00:54 +00:00
Catherine Lee	b5bef9bbfd	Fix cpp tests not running + failing to surface (#122845 ) The comment in the code should have the information Pull Request resolved: https://github.com/pytorch/pytorch/pull/122845 Approved by: https://github.com/huydhn	2024-03-29 22:41:45 +00:00
Catherine Lee	03184a82dd	[TD] TD on ASAN PR jobs (#122332 ) Low impact CPU jobs Pull Request resolved: https://github.com/pytorch/pytorch/pull/122332 Approved by: https://github.com/huydhn	2024-03-22 22:32:51 +00:00
eellison	cbbed46377	Defer selection of triton template (#120275 ) Our prior approach to epilogue fusion was to select from a choice from a set of triton templates and extern calls based on benchmarking inputs, then unconditionally fuse epilogues. This can be sub-optimal in following ways: - We select an extern kernel, however an epilogue like relu() exists such that choosing a triton template + relu would have been faster - We select a triton template, epilogue fuse, and register spilling occurs causing it to be slower than not epilogue fusing. In this PR we wait to select either the Triton Template or Extern Kernel based on benchmarking results from the kernel itself and its epilogue. As soon as a successful fusion occurs where a fused Triton Template + epilogue is faster than the unfused choice we finalize the MultiTemplateBuffer as a specific template. If no fusion occurs we'll finalize the MultiTemplateBuffer after fusion. Note: if there are multiple epilogue fusions (not super likely), even though we select a template after the first fusion, we will still benchmark to see if subsequent epilogue are worth fusing. We could potentially defer choosing template in this case in a follow up at expense of compile time. Gives 4% HF training win, 10% TIMM inference win. Increases compilation time which I will be trying to address more in follow up prs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120275 Approved by: https://github.com/jansel ghstack dependencies: #121996	2024-03-20 01:40:33 +00:00
Kai Londenberg	a5ec45f2ec	[Inductor Cutlass backend] Move tests to separate file (#121489 ) Move Cutlass backend related tests to test/inductor/test_cutlass_backend.py - no changes to the tests themselves. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121489 Approved by: https://github.com/jansel	2024-03-12 21:59:48 +00:00
Catherine Lee	fac06a12c8	CI sanity check test for env vars (#120519 ) Make a test that fails on purpose to trigger retries. Check the opposite of success (that env vars exist) It's bit hacky because I want it to fail on the normal flow in order to trigger reruns but I don't want to expose the failures to users since it's confusing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120519 Approved by: https://github.com/huydhn	2024-03-11 15:35:45 +00:00
PyTorch MergeBot	2c2d6ce515	Revert "CI sanity check test for env vars (#120519 )" This reverts commit `f43b9c56c5`. Reverted https://github.com/pytorch/pytorch/pull/120519 on behalf of https://github.com/clee2000 due to broken on slow `d27509c384` https://github.com/pytorch/pytorch/actions/runs/8208843198/job/22453617568 ([comment](https://github.com/pytorch/pytorch/pull/120519#issuecomment-1986480624))	2024-03-08 22:01:35 +00:00
Catherine Lee	f43b9c56c5	CI sanity check test for env vars (#120519 ) Make a test that fails on purpose to trigger retries. Check the opposite of success (that env vars exist) It's bit hacky because I want it to fail on the normal flow in order to trigger reruns but I don't want to expose the failures to users since it's confusing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120519 Approved by: https://github.com/huydhn	2024-03-08 20:28:50 +00:00
Catherine Lee	06b52dd103	TD outside of test job (#118250 ) Give TD it's own job so that each shard can get the results from this one job artifact and they will always be in sync with each other/no longer need to worry about consistently issues * Move test discovery to its own file that is not dependent on torch so it can be run without building torch * Cannot do cpp test discovery before building pytorch * Move TD calculation to own file that will create a json file with the final results * TD is now job/build env agnostic * TD will rank all tests, including those that test jobs may not want to run (ex it will rank distributed tests along with default tests, even though these tests are never run on the same machine together) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118250 Approved by: https://github.com/huydhn	2024-03-01 23:08:10 +00:00
Catherine Lee	0290fe65bd	Test TD (test removal) on crossref (#119426 ) Current threshold is to cut the bottom 75% of test files, which results in 13 min of tests getting cut. test_ops, functorch/test_ops, and test_decomp and other really long running test files are not getting cut and make the top 25% to take really long (still 90+ min) The original plan was to test on rocm but I'm worried about queuing given that cutting 75% of test files only cuts off 13 min, and crossref is rarely referenced by others and people keep talking about getting rid of it, so it's a good alternative Pull Request resolved: https://github.com/pytorch/pytorch/pull/119426 Approved by: https://github.com/huydhn	2024-02-29 18:53:43 +00:00
albanD	30625ae582	Add cpp stack traces to our own reruns (#119408 ) Note that I'm not sure why we both have pytest rerun the failing test twice via `81abc2b249/test/run_test.py (L966)` before our own logic retries it as well? The failing test is only here to make sure it works as expected in the CI env. Will remove before landing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119408 Approved by: https://github.com/huydhn	2024-02-26 22:21:14 +00:00
Catherine Lee	c39bbd6def	Numbers based TD (#119901 ) Convert from a list/bucket based TD system to just a numbers based TD system. Looks like a massive change but a decent amount of it is tests and removing code. Main file of interest is interface.py, which Github is collapsing by default due to size The test files pretty much got rewritten entirely since a lot of the old tests are no longer relevant. Other notable changes: * Use Frozenset to make TestRun hashable * Adds tools/test/heuristics/__init__.py to ensure that unittest can discover the tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/119901 Approved by: https://github.com/osalpekar, https://github.com/huydhn	2024-02-26 17:01:19 +00:00
Catherine Lee	cfddfce0d3	Alternate sharding (#119078 ) Changes sharding to attempt to put all serial tests on as few shards as possible. Parallel tests are then distributed across all shards, with most of which likely ending up on the non serial shards Example: 8 minutes of serial tests, 20 minutes of parallel tests, 2 proc per machine, 6 machines -> 8 + 20/2 = 18 total minutes of tests -> 18 / 6 machines = 3 min per machine -> all serial tests should fit on 3 machines (3min, 3 min, 2min) -> majority of parallel tests should go on last 4 machines, one of which is shared with the serial tests Move serial tests to run first If I want to move to a purely numbers based sharding, this ensures that parallel tests are run with parallel tests as much as possible instead of interleaving serial + parallel tests, which decreases effectiveness of parallelization, while also ensuring that test reordering is still mostly effective. See `73e816ee80` for example logs Pull Request resolved: https://github.com/pytorch/pytorch/pull/119078 Approved by: https://github.com/huydhn	2024-02-21 16:40:27 +00:00
Catherine Lee	af765dbdfd	[ez] Explicit env for run_test (#120251 ) env=None (which is the default) inherits the env from the calling process. Explicitly set the env to the calling process env so that things can be added to it later Tested in: `e7b4d8ec88` Checked that test-reports (which depend on the CI env var) get made. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120251 Approved by: https://github.com/huydhn	2024-02-21 00:40:19 +00:00
PyTorch MergeBot	dfb83df889	Revert "Add cpp stack traces to our own reruns (#119408 )" This reverts commit `47182a8f4b`. Reverted https://github.com/pytorch/pytorch/pull/119408 on behalf of https://github.com/clee2000 due to iirc the default setting of env to None causes it to inherit the env of the calling process, I'll make a PR that makes it so that the old env vars don't disappear, and then re merge this on top of it. Reverting this because I think some important env vars are disappearing (specifically CI) ([comment](https://github.com/pytorch/pytorch/pull/119408#issuecomment-1955128676))	2024-02-20 21:28:13 +00:00
PyTorch MergeBot	9b38ee2343	Revert "Alternate sharding (#119078 )" This reverts commit `861acda205`. Reverted https://github.com/pytorch/pytorch/pull/119078 on behalf of https://github.com/clee2000 due to failing `861acda205` ([comment](https://github.com/pytorch/pytorch/pull/119078#issuecomment-1946583857))	2024-02-15 16:59:50 +00:00
Catherine Lee	861acda205	Alternate sharding (#119078 ) Changes sharding to attempt to put all serial tests on as few shards as possible. Parallel tests are then distributed across all shards, with most of which likely ending up on the non serial shards Example: 8 minutes of serial tests, 20 minutes of parallel tests, 2 proc per machine, 6 machines -> 8 + 20/2 = 18 total minutes of tests -> 18 / 6 machines = 3 min per machine -> all serial tests should fit on 3 machines (3min, 3 min, 2min) -> majority of parallel tests should go on last 4 machines, one of which is shared with the serial tests Move serial tests to run first If I want to move to a purely numbers based sharding, this ensures that parallel tests are run with parallel tests as much as possible instead of interleaving serial + parallel tests, which decreases effectiveness of parallelization, while also ensuring that test reordering is still mostly effective. See `73e816ee80` for example logs Pull Request resolved: https://github.com/pytorch/pytorch/pull/119078 Approved by: https://github.com/huydhn	2024-02-15 01:32:44 +00:00
atalman	244b124bb8	Add linux cpu test for 3.12 (#117853 ) This is continuation of work: https://github.com/pytorch/pytorch/pull/113987 Co-authored-by: albanD <desmaison.alban@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/117853 Approved by: https://github.com/albanD	2024-02-14 20:52:23 +00:00
albanD	47182a8f4b	Add cpp stack traces to our own reruns (#119408 ) Note that I'm not sure why we both have pytest rerun the failing test twice via `81abc2b249/test/run_test.py (L966)` before our own logic retries it as well? The failing test is only here to make sure it works as expected in the CI env. Will remove before landing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119408 Approved by: https://github.com/huydhn	2024-02-14 18:40:23 +00:00
Catherine Lee	5d6e323549	No TD (test removal) option in CI (#118808 ) It currently doesn't do anything, but I will want these env vars later. Maybe I should start using ghstack Intention: --enable-td actually gets rid of tests I am open to better names Pull Request resolved: https://github.com/pytorch/pytorch/pull/118808 Approved by: https://github.com/huydhn, https://github.com/osalpekar	2024-02-09 16:42:27 +00:00
PyTorch MergeBot	8182fce769	Revert "Add cpp stack traces to our own reruns (#119408 )" This reverts commit `fbe6f6236e`. Reverted https://github.com/pytorch/pytorch/pull/119408 on behalf of https://github.com/malfet due to Looks like it introduced intermittent crashes see https://github.com/pytorch/pytorch/actions/runs/7823402867/job/21344456540 for example, testing the theory ([comment](https://github.com/pytorch/pytorch/pull/119408#issuecomment-1934589057))	2024-02-08 17:20:39 +00:00
albanD	fbe6f6236e	Add cpp stack traces to our own reruns (#119408 ) Note that I'm not sure why we both have pytest rerun the failing test twice via `81abc2b249/test/run_test.py (L966)` before our own logic retries it as well? The failing test is only here to make sure it works as expected in the CI env. Will remove before landing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119408 Approved by: https://github.com/huydhn	2024-02-08 00:54:16 +00:00
Huy Do	3ed9df36a9	Clean up some obsolete TODOs in run_test and several test files (#119113 ) * The TODOs in `test/test_nestedtensor.py` has been mitigated, I keep the issue for reference. * ~~The TODOs in `test/test_ops_fwd_gradients.py` doesn't apply anymore~~ * The TODOs in `run_test.py` to support disabling C++ tests is probably not going to happen. I have never seen a flaky C++ test that needs to be disabled before. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119113 Approved by: https://github.com/kit1980	2024-02-03 23:54:30 +00:00
Joel Schlosser	3b41793412	Purge redundant module init tests (#119028 ) Fixes #118784 This test file is old and redundant; coverage is maintained in `test_modules.py` via the `test_factory_kwargs` set of tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119028 Approved by: https://github.com/zou3519	2024-02-02 20:17:00 +00:00
Catherine Lee	8b729fb826	[ez] Fix CI log file piping error (#118807 ) Fixes https://github.com/pytorch/pytorch/issues/118764 Example log https://github.com/pytorch/pytorch/actions/runs/7737363970/job/21097159160 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118807 Approved by: https://github.com/huydhn, https://github.com/malfet, https://github.com/seemethere	2024-02-02 03:07:56 +00:00
Catherine Lee	9391af9796	Merging heuristics (#118029 ) Everyday I move closer and closer to just using numbers * number of heuristics that marked it as high, probable, low, none etc * order of heuristics in the `__init__` file as well as how the heuristic ordered the tests * put heuristics historical edited files and profiling as not trial mode * briefly sanity checked that all shards of the larger test files (ex test_ops) exist and there are no dups Pull Request resolved: https://github.com/pytorch/pytorch/pull/118029 Approved by: https://github.com/huydhn	2024-01-31 20:00:10 +00:00
Catherine Lee	2eefbc02a0	[ez] Discover tests without importing torch (#118574 ) Moves test discovery into a file that doesn't have import torch so test listing can be done without having torch installed. Helpful when you don't have torch installed (aka me when I'm feeling lazy) I want to move TD into it's own job that doesn't need to wait for build to finish, so this is part of that. The first commit is a nothing more than a copy paste of the selected functions/vars into a new file, the second commit has various changes that should be checked. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118574 Approved by: https://github.com/huydhn	2024-01-30 03:02:29 +00:00
Catherine Lee	84251d1d71	[ez] Windows log printing + save successful test logs (#118124 ) when doing print(f.read().decode etc etc) it prints an extra new line, so manually splitlines and strip to see if that helps My guess is windows line ending differences Also always save log file regardless of success or failure See `476b81a9bf` for what it looks like now Swapped to opening in text mode instead of binary, seems to be ok now. 42483193bf024983060a234dc0262f4840aef4b8 for example Pull Request resolved: https://github.com/pytorch/pytorch/pull/118124 Approved by: https://github.com/huydhn	2024-01-26 21:14:25 +00:00
Catherine Lee	de9ddd19a5	Various CI settings (#117668 ) Test [ci-verbose-test-logs] (this worked, the test logs printing while running and interleaved and are really long) Settings for no timeout (step timeout still applies, only gets rid of ~30 min timeout for shard of test file) and no piping logs/extra verbose test logs (good for debugging deadlocks but results in very long and possibly interleaved logs). Also allows these to be set via pr body if the label name is in brackets ex [label name] or the test above. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117668 Approved by: https://github.com/huydhn	2024-01-26 00:17:29 +00:00
Catherine Lee	364728b27b	Reduce pytest prints (#117069 ) * custom pytest-shard so I can control the verbosity (also index by 1 since it's confusing) * normal runs (not keep-going) always rerun each failed test 9 times (3 per process, 3 processes). Previously it would only run the entire test file 3 times, so if a test before you segfaulted, you only got 2 tries Example of quieter log https://github.com/pytorch/pytorch/actions/runs/7481334046/job/20363147497 "items in shard" only gets printed once at the beginning, and the reruns just say how many got skipped. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117069 Approved by: https://github.com/huydhn	2024-01-23 18:39:30 +00:00
chuanqiw	40890ba8e7	[CI] Add python test skip logic for XPU (#117621 ) Add python test skip logic for XPU For test purpose, cherry-pick #116833 & #116850 firstly, and the xpu test passed https://github.com/pytorch/pytorch/actions/runs/7566746218/job/20604997985?pr=117621. Revert them now. Works for #114850 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117621 Approved by: https://github.com/huydhn	2024-01-23 08:20:42 +00:00
Catherine Lee	cef5b93f28	[ez] Serial when NUM_PROCS is 1 (#117977 ) Makes it easier to understand whats going on Pull Request resolved: https://github.com/pytorch/pytorch/pull/117977 Approved by: https://github.com/huydhn	2024-01-22 23:11:41 +00:00
ydwu4	f96b7d06d7	[export] skip export tests when test with dynamo in ci (#117988 ) Fixes https://github.com/pytorch/pytorch/issues/117947. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117988 Approved by: https://github.com/suo, https://github.com/zou3519	2024-01-22 22:14:36 +00:00
PyTorch MergeBot	f684e44fd6	Revert "Reduce pytest prints (#117069 )" This reverts commit `40dbd567e0`. Reverted https://github.com/pytorch/pytorch/pull/117069 on behalf of https://github.com/clee2000 due to need to handle timeout expired better ([comment](https://github.com/pytorch/pytorch/pull/117069#issuecomment-1901270953))	2024-01-19 23:07:51 +00:00
Catherine Lee	40dbd567e0	Reduce pytest prints (#117069 ) * custom pytest-shard so I can control the verbosity (also index by 1 since it's confusing) * normal runs (not keep-going) always rerun each failed test 9 times (3 per process, 3 processes). Previously it would only run the entire test file 3 times, so if a test before you segfaulted, you only got 2 tries Example of quieter log https://github.com/pytorch/pytorch/actions/runs/7481334046/job/20363147497 "items in shard" only gets printed once at the beginning, and the reruns just say how many got skipped. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117069 Approved by: https://github.com/huydhn	2024-01-19 18:42:12 +00:00
Catherine Lee	6c5c2121b1	Run some OOMing tests serially (#117759 ) They were disabled due to being flaky due to OOMs but got renamed. Seeing if running serially helps I kind of want to keep this test disabled since the rest of the file is probably fine... Issues in question: #113132 #113136 #113140 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117759 Approved by: https://github.com/malfet, https://github.com/huydhn	2024-01-19 16:45:35 +00:00
PyTorch MergeBot	77cfacab55	Revert "Reduce pytest prints (#117069 )" This reverts commit `2f89ef2300`. Reverted https://github.com/pytorch/pytorch/pull/117069 on behalf of https://github.com/clee2000 due to distributed tests are not printing items ([comment](https://github.com/pytorch/pytorch/pull/117069#issuecomment-1899433816))	2024-01-19 00:27:03 +00:00
Catherine Lee	2f89ef2300	Reduce pytest prints (#117069 ) * custom pytest-shard so I can control the verbosity (also index by 1 since it's confusing) * normal runs (not keep-going) always rerun each failed test 9 times (3 per process, 3 processes). Previously it would only run the entire test file 3 times, so if a test before you segfaulted, you only got 2 tries Example of quieter log https://github.com/pytorch/pytorch/actions/runs/7481334046/job/20363147497 "items in shard" only gets printed once at the beginning, and the reruns just say how many got skipped. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117069 Approved by: https://github.com/huydhn	2024-01-18 23:30:59 +00:00
rzou	5aa895e53e	Don't run inductor tests in Dynamo shard (#117747 ) In theory we could, but these get really slow once we turn on strict mode, so we're not going to for now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117747 Approved by: https://github.com/bdhirsh ghstack dependencies: #117729	2024-01-18 17:43:30 +00:00
Jack Taylor	db79ceb110	[ROCm] Enabling additional UTs on ROCm (#115738 ) Unskips mostly for dynamo/inductor UT. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115738 Approved by: https://github.com/jithunnair-amd, https://github.com/malfet	2024-01-09 08:36:07 +00:00
Catherine Lee	d455c33cca	[ez][td] Pipe TD logs to log file (#116796 ) It is a bit annoying have them come up when searching through the logs. They're also surprisingly long Pull Request resolved: https://github.com/pytorch/pytorch/pull/116796 Approved by: https://github.com/huydhn	2024-01-05 19:05:12 +00:00
Catherine Lee	c52b78ebc2	[ez] Remove some args from run_test.py (#115459 ) Don't think anyone uses these Pull Request resolved: https://github.com/pytorch/pytorch/pull/115459 Approved by: https://github.com/malfet, https://github.com/huydhn	2023-12-11 19:56:37 +00:00
Sijia Chen	641ec2115f	[AOTI] move model runner into a library (#115220 ) Summary: So that we can import it in fbcode and do some AOTI run in py env Test Plan: existed AOTI tests Reviewed By: chenyang78 Differential Revision: D51780021 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115220 Approved by: https://github.com/desertfire	2023-12-09 19:03:32 +00:00
Catherine Lee	3b7d60b6ff	Fix keep-going (#112098 ) New function for continue on error Another solution might be to run the entire suite to the end and use last failed, but I'm worried about concurrent processes writing to the same last failed cache entry, it's a bit different than the usual test rerunning strategy we use especially regarding segfaults and other ways the test suite can suddenly end, and there are some cases where the entire test suite should immediately get rerun in a new process (ex cuda error that causes sync to fail). Find example logs on commit 2f1510839727f6ef2631040d5f0edde26265015d TODO: continue on error for --subprocess and test_distributed aren't working fully Pull Request resolved: https://github.com/pytorch/pytorch/pull/112098 Approved by: https://github.com/huydhn	2023-11-30 04:01:57 +00:00
Jithun Nair	2ea2421b44	Skip unit tests that fail on MI210 runners (#114613 ) Taken from https://github.com/pytorch/pytorch/pull/105980 Pull Request resolved: https://github.com/pytorch/pytorch/pull/114613 Approved by: https://github.com/malfet	2023-11-27 22:25:35 +00:00
Philip Meier	2aa486de9b	vendor packaging.version (#114108 ) Fixes #113940. This vendors the relevant parts of [`packaging==23.2.0`]() to have access to `Version` and `InvalidVersion` without taking a runtime dependency on `setuptools` or `packaging`. I didn't find any vendoring policy so I put it under `torch._vendor.packaging`. While I have only vendored the files we need, I have not touched or trimmed the files otherwise. Pull Request resolved: https://github.com/pytorch/pytorch/pull/114108 Approved by: https://github.com/malfet, https://github.com/albanD	2023-11-21 11:51:23 +00:00
Zain Rizvi	ec20c9044e	[TD] Fix metric emission for split test files (#113789 ) Fixes a bug in TD metrics generation where it wouldn't be able to find the rank and relevance that a heuristic gave a test run if that heuristic had divided that test into multiple test runs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113789 Approved by: https://github.com/clee2000	2023-11-16 23:19:40 +00:00
Catherine Lee	87aeb248c9	More random stepcurrent (#113620 ) Distributed tests for different backends have the same name, so they end up clashing using the current stepcurrent key, so tests were not being run. Disabled the following tests because they are failing: test_ddp_has_finalized test_broadcast_object_list <details> ``` 2023-11-14T06:44:01.0428686Z 2023-11-14T06:44:01.0430447Z distributed/test_distributed_spawn.py::TestDistBackendWithSpawn::test_broadcast_object_list <- ../../../../opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/distributed_test.py INFO:numba.cuda.cudadrv.driver:init 2023-11-14T06:44:01.0431048Z [1699943450.893723] [99f90b6e6ff3:10028:0] ucc_context.c:402 UCC ERROR failed to create tl context for cuda 2023-11-14T06:44:01.0431625Z [1699943450.914385] [99f90b6e6ff3:10029:0] ucc_context.c:402 UCC ERROR failed to create tl context for cuda 2023-11-14T06:44:01.0432314Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] Caught exception: 2023-11-14T06:44:01.0433178Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] Traceback (most recent call last): 2023-11-14T06:44:01.0434677Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 658, in run_test 2023-11-14T06:44:01.0435435Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] getattr(self, test_name)() 2023-11-14T06:44:01.0436895Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 544, in wrapper 2023-11-14T06:44:01.0437500Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] fn() 2023-11-14T06:44:01.0438917Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2536, in wrapper 2023-11-14T06:44:01.0439637Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] method(args, kwargs) 2023-11-14T06:44:01.0441122Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 143, in wrapper 2023-11-14T06:44:01.0441873Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] return func(args, *kwargs) 2023-11-14T06:44:01.0443340Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 274, in wrapper 2023-11-14T06:44:01.0444077Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] ret = func(args, *kwargs) 2023-11-14T06:44:01.0445769Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/distributed_test.py", line 7717, in test_broadcast_object_list 2023-11-14T06:44:01.0446732Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] return self._test_broadcast_object_list() 2023-11-14T06:44:01.0448433Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/distributed_test.py", line 7683, in _test_broadcast_object_list 2023-11-14T06:44:01.0449187Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] dist.broadcast_object_list( 2023-11-14T06:44:01.0450553Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper 2023-11-14T06:44:01.0451621Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] return func(args, *kwargs) 2023-11-14T06:44:01.0453161Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2650, in broadcast_object_list 2023-11-14T06:44:01.0454065Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] broadcast(object_sizes_tensor, src=src, group=group) 2023-11-14T06:44:01.0455441Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper 2023-11-14T06:44:01.0456183Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] return func(args, *kwargs) 2023-11-14T06:44:01.0457775Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1947, in broadcast 2023-11-14T06:44:01.0458649Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] work = default_pg.broadcast([tensor], opts) 2023-11-14T06:44:01.0460923Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] RuntimeError: [/var/lib/jenkins/workspace/torch/csrc/distributed/c10d/ProcessGroupUCC.cpp:488] [Rank 1][ProcessGroupUCC-0][READY]failed to init cuda collective, error code -1: Operation is not supported, system error code 2 2023-11-14T06:44:01.0461471Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] 2023-11-14T06:44:01.0462430Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] To execute this test, run the following from the base repo dir: 2023-11-14T06:44:01.0463552Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] python test/distributed/test_distributed_spawn.py -k test_broadcast_object_list 2023-11-14T06:44:01.0464082Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] 2023-11-14T06:44:01.0465136Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2023-11-14T06:44:01.0465945Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] exiting process 1 with exit code: 10 2023-11-14T06:44:01.0466605Z [1699943451.005633] [99f90b6e6ff3:10029:0] parser.c:2034 UCX WARN unused environment variables: UCX_COMMIT; UCX_HOME 2023-11-14T06:44:01.0467303Z [1699943451.005633] [99f90b6e6ff3:10029:0] parser.c:2034 UCX WARN (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning) 2023-11-14T06:44:01.0467972Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] Caught exception: 2023-11-14T06:44:01.0468743Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] Traceback (most recent call last): 2023-11-14T06:44:01.0470233Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 658, in run_test 2023-11-14T06:44:01.0471106Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] getattr(self, test_name)() 2023-11-14T06:44:01.0472581Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 544, in wrapper 2023-11-14T06:44:01.0473162Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] fn() 2023-11-14T06:44:01.0474581Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2536, in wrapper 2023-11-14T06:44:01.0475314Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] method(args, *kwargs) 2023-11-14T06:44:01.0476776Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 143, in wrapper 2023-11-14T06:44:01.0477535Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] return func(args, *kwargs) 2023-11-14T06:44:01.0478993Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 274, in wrapper 2023-11-14T06:44:01.0479886Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] ret = func(args, *kwargs) 2023-11-14T06:44:01.0481593Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/distributed_test.py", line 7717, in test_broadcast_object_list 2023-11-14T06:44:01.0482429Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] return self._test_broadcast_object_list() 2023-11-14T06:44:01.0484145Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/distributed_test.py", line 7683, in _test_broadcast_object_list 2023-11-14T06:44:01.0484886Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] dist.broadcast_object_list( 2023-11-14T06:44:01.0486271Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper 2023-11-14T06:44:01.0487018Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] return func(args, *kwargs) 2023-11-14T06:44:01.0488559Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2650, in broadcast_object_list 2023-11-14T06:44:01.0489470Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] broadcast(object_sizes_tensor, src=src, group=group) 2023-11-14T06:44:01.0491078Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper 2023-11-14T06:44:01.0491912Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] return func(args, *kwargs) 2023-11-14T06:44:01.0493369Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1947, in broadcast 2023-11-14T06:44:01.0494419Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] work = default_pg.broadcast([tensor], opts) 2023-11-14T06:44:01.0496679Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] RuntimeError: [/var/lib/jenkins/workspace/torch/csrc/distributed/c10d/ProcessGroupUCC.cpp:488] [Rank 0][ProcessGroupUCC-0][READY]failed to init cuda collective, error code -1: Operation is not supported, system error code 2 2023-11-14T06:44:01.0497211Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] 2023-11-14T06:44:01.0498198Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] To execute this test, run the following from the base repo dir: 2023-11-14T06:44:01.0499291Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] python test/distributed/test_distributed_spawn.py -k test_broadcast_object_list 2023-11-14T06:44:01.0499838Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] 2023-11-14T06:44:01.0500881Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2023-11-14T06:44:01.0501667Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] exiting process 0 with exit code: 10 2023-11-14T06:44:01.0502343Z [1699943451.002362] [99f90b6e6ff3:10028:0] parser.c:2034 UCX WARN unused environment variables: UCX_COMMIT; UCX_HOME 2023-11-14T06:44:01.0503024Z [1699943451.002362] [99f90b6e6ff3:10028:0] parser.c:2034 UCX WARN (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning) 2023-11-14T06:44:01.0503411Z ('RERUN', {'yellow': True}) [6.1102s] [100%] ``` </details> test_ddp_sync_bn_training_vs_eval <details> ``` 2023-11-14T06:44:01.1494815Z 2023-11-14T06:44:01.1496630Z distributed/test_distributed_spawn.py::TestDistBackendWithSpawn::test_ddp_sync_bn_training_vs_eval <- ../../../../opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/distributed_test.py INFO:numba.cuda.cudadrv.driver:init 2023-11-14T06:44:01.1497290Z [1699943779.976037] [99f90b6e6ff3:10758:0] parser.c:2034 UCX WARN unused environment variables: UCX_COMMIT; UCX_HOME 2023-11-14T06:44:01.1498119Z [1699943779.976037] [99f90b6e6ff3:10758:0] parser.c:2034 UCX WARN (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning) 2023-11-14T06:44:01.1498808Z STAGE:2023-11-14 06:36:20 10758:10758 ActivityProfilerController.cpp:314] Completed Stage: Warm Up 2023-11-14T06:44:01.1499465Z [1699943779.970792] [99f90b6e6ff3:10757:0] parser.c:2034 UCX WARN unused environment variables: UCX_COMMIT; UCX_HOME 2023-11-14T06:44:01.1500160Z [1699943779.970792] [99f90b6e6ff3:10757:0] parser.c:2034 UCX WARN (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning) 2023-11-14T06:44:01.1500820Z STAGE:2023-11-14 06:36:20 10757:10757 ActivityProfilerController.cpp:314] Completed Stage: Warm Up 2023-11-14T06:44:01.1501556Z STAGE:2023-11-14 06:36:20 10758:10758 ActivityProfilerController.cpp:320] Completed Stage: Collection 2023-11-14T06:44:01.1502239Z STAGE:2023-11-14 06:36:20 10757:10757 ActivityProfilerController.cpp:320] Completed Stage: Collection 2023-11-14T06:44:01.1502952Z STAGE:2023-11-14 06:36:20 10757:10757 ActivityProfilerController.cpp:324] Completed Stage: Post Processing 2023-11-14T06:44:01.1503678Z STAGE:2023-11-14 06:36:20 10758:10758 ActivityProfilerController.cpp:324] Completed Stage: Post Processing 2023-11-14T06:44:01.1504350Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] Caught exception: 2023-11-14T06:44:01.1505119Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] Traceback (most recent call last): 2023-11-14T06:44:01.1506729Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 658, in run_test 2023-11-14T06:44:01.1507492Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] getattr(self, test_name)() 2023-11-14T06:44:01.1508992Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 544, in wrapper 2023-11-14T06:44:01.1509578Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] fn() 2023-11-14T06:44:01.1510994Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2536, in wrapper 2023-11-14T06:44:01.1511725Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] method(args, *kwargs) 2023-11-14T06:44:01.1513193Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 174, in wrapper 2023-11-14T06:44:01.1513962Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] return func(args, *kwargs) 2023-11-14T06:44:01.1515697Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/distributed_test.py", line 9230, in test_ddp_sync_bn_training_vs_eval 2023-11-14T06:44:01.1516529Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] self.assertNotEqual([], all_gather_calls) 2023-11-14T06:44:01.1518019Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3448, in assertNotEqual 2023-11-14T06:44:01.1518910Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] with self.assertRaises(AssertionError, msg=msg): 2023-11-14T06:44:01.1520177Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 226, in __exit__ 2023-11-14T06:44:01.1521062Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] self._raiseFailure("{} not raised".format(exc_name)) 2023-11-14T06:44:01.1522238Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 163, in _raiseFailure 2023-11-14T06:44:01.1523099Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] raise self.test_case.failureException(msg) 2023-11-14T06:44:01.1523923Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] AssertionError: AssertionError not raised 2023-11-14T06:44:01.1524470Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] 2023-11-14T06:44:01.1525481Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] To execute this test, run the following from the base repo dir: 2023-11-14T06:44:01.1526632Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] python test/distributed/test_distributed_spawn.py -k test_ddp_sync_bn_training_vs_eval 2023-11-14T06:44:01.1527180Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] 2023-11-14T06:44:01.1528223Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2023-11-14T06:44:01.1529029Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] exiting process 0 with exit code: 10 2023-11-14T06:44:01.1529786Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] Caught exception: 2023-11-14T06:44:01.1530576Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] Traceback (most recent call last): 2023-11-14T06:44:01.1532383Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 658, in run_test 2023-11-14T06:44:01.1533127Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] getattr(self, test_name)() 2023-11-14T06:44:01.1534608Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 544, in wrapper 2023-11-14T06:44:01.1535194Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] fn() 2023-11-14T06:44:01.1536817Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2536, in wrapper 2023-11-14T06:44:01.1537575Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] method(args, *kwargs) 2023-11-14T06:44:01.1539036Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 174, in wrapper 2023-11-14T06:44:01.1539800Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] return func(args, *kwargs) 2023-11-14T06:44:01.1541531Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/distributed_test.py", line 9230, in test_ddp_sync_bn_training_vs_eval 2023-11-14T06:44:01.1542388Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] self.assertNotEqual([], all_gather_calls) 2023-11-14T06:44:01.1544015Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3448, in assertNotEqual 2023-11-14T06:44:01.1544907Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] with self.assertRaises(AssertionError, msg=msg): 2023-11-14T06:44:01.1546061Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 226, in __exit__ 2023-11-14T06:44:01.1546944Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] self._raiseFailure("{} not raised".format(exc_name)) 2023-11-14T06:44:01.1548142Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 163, in _raiseFailure 2023-11-14T06:44:01.1548991Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] raise self.test_case.failureException(msg) 2023-11-14T06:44:01.1549806Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] AssertionError: AssertionError not raised 2023-11-14T06:44:01.1550350Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] 2023-11-14T06:44:01.1551304Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] To execute this test, run the following from the base repo dir: 2023-11-14T06:44:01.1552462Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] python test/distributed/test_distributed_spawn.py -k test_ddp_sync_bn_training_vs_eval 2023-11-14T06:44:01.1553095Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] 2023-11-14T06:44:01.1554166Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2023-11-14T06:44:01.1554976Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] exiting process 1 with exit code: 10 2023-11-14T06:44:01.1555235Z ('RERUN', {'yellow': True}) [6.6107s] [100%] ``` </details> test_backend_full_group <details> ``` 2023-11-14T22:51:56.4502470Z FAILED [5.2125s] distributed/test_distributed_spawn.py::TestDistBackendWithSpawn::test_backend_full_group - RuntimeError: Process 0 exited with error code 10 and exception: 2023-11-14T22:51:56.4502665Z Traceback (most recent call last): 2023-11-14T22:51:56.4503603Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 658, in run_test 2023-11-14T22:51:56.4503796Z getattr(self, test_name)() 2023-11-14T22:51:56.4504710Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 544, in wrapper 2023-11-14T22:51:56.4504845Z fn() 2023-11-14T22:51:56.4505737Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2536, in wrapper 2023-11-14T22:51:56.4505896Z method(args, *kwargs) 2023-11-14T22:51:56.4506823Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 174, in wrapper 2023-11-14T22:51:56.4506992Z return func(args, *kwargs) 2023-11-14T22:51:56.4508285Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/distributed_test.py", line 882, in test_backend_full_group 2023-11-14T22:51:56.4508640Z self._test_group_override_backend(self._init_full_group_test) 2023-11-14T22:51:56.4509798Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/distributed_test.py", line 852, in _test_group_override_backend 2023-11-14T22:51:56.4510104Z group, group_id, rank = initializer(backend=new_backend) 2023-11-14T22:51:56.4510629Z UnboundLocalError: local variable 'new_backend' referenced before assignment 2023-11-14T22:51:56.4510650Z 2023-11-14T22:51:56.4510987Z To execute this test, run the following from the base repo dir: 2023-11-14T22:51:56.4511525Z python test/distributed/test_distributed_spawn.py -k test_backend_full_group 2023-11-14T22:51:56.4511545Z 2023-11-14T22:51:56.4511970Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2023-11-14T22:51:56.4511989Z 2023-11-14T22:51:56.4512242Z Process 1 exited with error code 10 and exception: 2023-11-14T22:51:56.4512454Z Traceback (most recent call last): 2023-11-14T22:51:56.4513380Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 658, in run_test 2023-11-14T22:51:56.4513687Z getattr(self, test_name)() 2023-11-14T22:51:56.4514612Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 544, in wrapper 2023-11-14T22:51:56.4514746Z fn() 2023-11-14T22:51:56.4515633Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2536, in wrapper 2023-11-14T22:51:56.4515791Z method(args, *kwargs) 2023-11-14T22:51:56.4516708Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 174, in wrapper 2023-11-14T22:51:56.4516895Z return func(args, **kwargs) 2023-11-14T22:51:56.4518008Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/distributed_test.py", line 882, in test_backend_full_group 2023-11-14T22:51:56.4518352Z self._test_group_override_backend(self._init_full_group_test) 2023-11-14T22:51:56.4519509Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/distributed_test.py", line 852, in _test_group_override_backend 2023-11-14T22:51:56.4519813Z group, group_id, rank = initializer(backend=new_backend) 2023-11-14T22:51:56.4520334Z UnboundLocalError: local variable 'new_backend' referenced before assignment 2023-11-14T22:51:56.4520355Z 2023-11-14T22:51:56.4528843Z To execute this test, run the following from the base repo dir: 2023-11-14T22:51:56.4529492Z python test/distributed/test_distributed_spawn.py -k test_backend_full_group 2023-11-14T22:51:56.4529681Z 2023-11-14T22:51:56.4530122Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2023-11-14T22:51:56.4530423Z !!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!! ``` </details> pretty sure the solution for this one is to add ucc in _test_group_override_backend https://ossci-raw-job-status.s3.amazonaws.com/log/18651430019 https://ossci-raw-job-status.s3.amazonaws.com/log/18651430132 Pull Request resolved: https://github.com/pytorch/pytorch/pull/113620 Approved by: https://github.com/huydhn	2023-11-15 21:56:10 +00:00
Catherine Lee	0c448526a4	[experiment][TD] Rating number system (#112676 ) Emit excessive amount of heuristic info emitted, but that just means I can do more with it later? Pull Request resolved: https://github.com/pytorch/pytorch/pull/112676 Approved by: https://github.com/ZainRizvi	2023-11-07 19:40:11 +00:00
Nikita Shulga	e2e5897269	[CI] Do not use `packaging` in run_tests.py (#112873 ) It used to check that CUDA is newer than 11.6, but all of them are Yet another mitigation towards missing `packaging` on MacOS Pull Request resolved: https://github.com/pytorch/pytorch/pull/112873 Approved by: https://github.com/huydhn	2023-11-03 17:22:46 +00:00
Zain Rizvi	4e67c69a7d	[TD] Support downgrading test relevance (#112671 ) Allow heuristics to actually downgrade the relevance of a test. Note that NONE/UNLIKELY tests will still get executed, but they will be ran at the end of the CI The Relevance chosen affects the outcome when Heuristics offer conflicting predictions. A relevance higher up in this list means higher confidence in the declared relevance: HIGH > NONE > PROBABLE > UNLIKELY > UNRANKED Given that we assume ordering based on the list in init right now since the lists are appended, do a similar thing for UNLIKELY and NONE ex HEURISTICS = [a, b, c, d] currently all things in b.high and added after a.high if b.none includes things in a.high, a.high trumps if b.none includes things in a.probable, then b.none trumps since none is stronger than probable if b.unlikely includes things from a.high/probable, a.high/probable trumps since unlikely and probable are at a higher strength Pull Request resolved: https://github.com/pytorch/pytorch/pull/112671 Approved by: https://github.com/clee2000	2023-11-02 21:02:40 +00:00
Zain Rizvi	a5641bc56b	[TD] Enable Test Class granularity on heuristics (#112161 ) Changes the heuristic framework to support multiple prioritizing individual classes within a test file. Components of this included: - Updating TestPrioritizations to accept individual test classes being prioritized. Previously, when a heuristic wanted to prioritize a test file it would pass in the test's name, now to prioritize a class within a test it uses the notation "test::classname" - Changes are fully backwards compatible with existing heuristics - Test sharding now supports sharding individual tests (for when they're prioritized) - When a TestClass is prioritized, we pass the appropriate "-k" flags down to pytest Pull Request resolved: https://github.com/pytorch/pytorch/pull/112161 Approved by: https://github.com/huydhn	2023-10-31 18:11:05 +00:00
Catherine Lee	3b5b7ebd09	[ci] Save various json files from test infra into folder (#111516 ) We pull a lot of files from https://github.com/pytorch/test-infra/blob/generated-stats/stats and name them separately when we add them to the artifacts in the build, so stick them in a folder and just add that instead. Slow test and disabled test jsons remain as they were since they are pulled during the test step and do not need to be included in the artifacts during build since they are not used for sharding. Sanity checked that test times could be found for linux, mac, windows, and rocm. Pull Request resolved: https://github.com/pytorch/pytorch/pull/111516 Approved by: https://github.com/huydhn, https://github.com/ZainRizvi	2023-10-23 20:38:25 +00:00
Nikita Shulga	e9a51a6a07	[BE] Revive test_typing (#111428 ) `test_typing.py` was written to use `pytest` in https://github.com/pytorch/pytorch/pull/54234 which unfortunately rendered it incompatible with run_test.py, and therefore it was not running in CI all this time. In this PR, same functionality is re-written using unittest framework, and `parametrize` from `torch.testing._internal._common_utils`. Valid `test_typing.py` with ufmt Disable `fail/bitwise_ops.py` and `pass/jit.py` as it regressed at some point as well as one of examples in `namedtuple.py` as `torch.linalg.qr` type is no longer revealed correctly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/111428 Approved by: https://github.com/clee2000	2023-10-18 02:19:49 +00:00
Jack Taylor	6b92c367c5	Add test_jit_cuda_fuser to ROCM_BLOCKLIST (#110440 ) Adds the nvfuser related unit test suite to ROCM_BLOCKLIST as should not be run on ROCm. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110440 Approved by: https://github.com/jeffdaily, https://github.com/pruthvistony, https://github.com/lezcano	2023-10-06 08:47:15 +00:00
Catherine Lee	8a09fe4a05	[ez] Remove print in heuristics aggregation (#110621 ) Move print to the beginning instead because putting it at the end makes it so you have to scroll through when debugging, and nothing in that function indicates that it should be printing anything Also the line for printing disabled issues out of the for loop Pull Request resolved: https://github.com/pytorch/pytorch/pull/110621 Approved by: https://github.com/huydhn	2023-10-06 02:04:53 +00:00
Catherine Lee	d6e5898e8d	Quieter logs in CI (#110033 ) To reduce the amount of logs * for successes, only print the part that says what tests ran and don't print the rest. Zip the log into an artifact. The line listing al the test names is really long, but if you view source of the raw logs, it will not wrap so it will only be one line. The log classifier can also be configured to ignored this line. Gets rid of lines like `test_ops.py::TestCommonCPU::test_multiple_devices_round_cpu_int64 SKIPPED [0.0010s] (Only runs on cuda) [ 9%]` * for failures/reruns, print logs. Do not zip. Also * change log artifact name Examples of various logs: `a074db0f7f` failures `1b439e24c4` failures possibly controversial haha should i include an option for always printing? Pull Request resolved: https://github.com/pytorch/pytorch/pull/110033 Approved by: https://github.com/huydhn	2023-10-05 16:40:37 +00:00
Catherine Lee	f69e9c8c91	run_tests.py minor logging changes (#110188 ) Minor logging changes that just kind of annoyed me: * prevent constant printing of `No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'` by moving import within the function (idk if this is ok) * prevent constant printing of `Ignoring disabled issues: ['']` (no idea why it was not gated behind a function or main) * change all prints in run_tests.py to be through stderr so theres no weird interleaving (although if everything goes through stderr, might as well just print everything through stdout...) Pull Request resolved: https://github.com/pytorch/pytorch/pull/110188 Approved by: https://github.com/huydhn, https://github.com/malfet, https://github.com/ZainRizvi	2023-10-03 01:22:47 +00:00
Zain Rizvi	1277d0e834	[BE] Add sharding data by default to metrics (#110035 ) Extend metric library to allow setting global metrics on a process level which will always be emitted. Current use case for them is to include shard information every time a metric is emitted by run_test.py <!-- copilot:poem --> ### <samp>🤖 Generated by Copilot at 0cae92c</samp> > _`run_test` refactored_ > _Sharding metrics in Rockset_ > _Autumn of testing_ Pull Request resolved: https://github.com/pytorch/pytorch/pull/110035 Approved by: https://github.com/clee2000	2023-09-26 17:06:49 +00:00
Catherine Lee	47adcd412f	Increase timeout for slow tests (#109206 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/109206 Approved by: https://github.com/huydhn	2023-09-26 16:18:38 +00:00
jjsjann123	0d3db1048a	remove nvfuser test in upstream pytorch (#109918 ) Removing nvfuser related tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/109918 Approved by: https://github.com/msaroufim	2023-09-24 13:49:37 +00:00
Catherine Lee	fe198f3141	inductor/test_max_autotune serial in CI (#109209 ) Fixes #ISSUE_NUMBER Trying to figure out why the this keeps timing out, wondering if its due to parallelization weirdness Pull Request resolved: https://github.com/pytorch/pytorch/pull/109209 Approved by: https://github.com/huydhn	2023-09-13 17:04:43 +00:00
Catherine Lee	a4138b1f99	[ez] Fix small type error in run_test (#109036 ) This is really small but it has tripped me up at least 3 times. Pull Request resolved: https://github.com/pytorch/pytorch/pull/109036 Approved by: https://github.com/kit1980	2023-09-11 21:11:20 +00:00
Catherine Lee	c67ebae344	Put logging in run_tests (#107987 ) Logging regarding which tests are serial + parallel + what tests actually get run on the shard got removed, which can be pretty helpful, so this adds it back in. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107987 Approved by: https://github.com/huydhn, https://github.com/Neilblaze	2023-09-01 20:23:30 +00:00
Zain Rizvi	5727b07ac6	TD: logging bugfix (#108288 ) Fix bug where logging metrics don't get emitted unless the 'keep-going' label is specified on the PR Also adds some extra logging to make debugging easier Pull Request resolved: https://github.com/pytorch/pytorch/pull/108288 Approved by: https://github.com/Skylion007	2023-08-31 16:51:49 +00:00
Zain Rizvi	238cc84af9	[TD] Emit metrics to compare heuristic quality (#108192 ) When a test fails, we will now emit fine grained details about how accurately heuristics predicted the relevance of that test. ## Context Why only look at failing tests? Our only signal that a PR is most likely relevant to a test is whether or not a test fails on it. Green tests don't tell us if the success was due to the code being good vs being irrelevant. This isn't a perfect measure, since it can miscategorize unstable and flaky failures as having been "missed" by the heuristics, but it's a reasonable approximation. ## What's measured? The metrics this PR collects are designed to answer the following questions ### How comprehensive are the heuristics? - What's the false negative rate, the % of failures that ideally should have been prioritized but weren't? (Both at an aggregate level and at a per heuristic level) ### How precise are the heuristics? - What % of failed tests were prioritized by a given heuristic? What % was prioritized overall? - How relevant was a failed test was considered to be? (Both a aggregate level and at a per heuristic level) - What % of time was a given heuristic prioritizing a failing test higher than any other heuristic? Pull Request resolved: https://github.com/pytorch/pytorch/pull/108192 Approved by: https://github.com/huydhn ghstack dependencies: #108117	2023-08-30 18:28:18 +00:00
Zain Rizvi	620d267ef3	Refactor TestPrioritizations to support more priorities and reduce risk of accidental mutations (#108117 ) Refactor TD code to make it easier to add additional categories later and also support the changes required to enable the metrics needed for TD Pull Request resolved: https://github.com/pytorch/pytorch/pull/108117 Approved by: https://github.com/huydhn	2023-08-30 04:14:28 +00:00
Zain Rizvi	36399d067a	Port existing heuristics to TD framework (#107071 ) This PR looks big, but it's mostly just refactorings with a bit of dead code deletion. Exceptions are: - Some metric emissions were changed to comply with the new TD format - Some logging changes - We now run tests in three batches (highly_relevant, probably_relevant, unranked_relevance) instead of the previous two (prioritized and general) Refactorings done: - Moves all test reordering code to the new TD framework - Refactors run_test.py to cleanly support multiple levels of test priorities - Deletes some dead code that was originally written for logging Pull Request resolved: https://github.com/pytorch/pytorch/pull/107071 Approved by: https://github.com/clee2000, https://github.com/huydhn	2023-08-23 21:23:23 +00:00
Catherine Lee	e0238577b6	Always import test selection tools (#107644 ) https://github.com/pytorch/pytorch/pull/107070 made emit_metrics importable without boto3, so we could just import all the files without the try catch. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107644 Approved by: https://github.com/huydhn, https://github.com/malfet	2023-08-22 16:36:20 +00:00
Zain Rizvi	5ddb8ef827	Make emit_metrics importable without having boto3 installed (#107070 ) Make it so that scripts can import and run the `emit_metrics` function even if they don't have boto3 installed, in which case it will still validate the inputs but skip the actual metric emission part. It's purely a refactor without any real logic changes Motivation: So that run_test.py and the target determination code can use this library easily without worrying about if it was imported or if it's dependencies are installed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107070 Approved by: https://github.com/huydhn	2023-08-21 21:13:01 +00:00
Catherine Lee	3b2c5d47c0	Use default build env and test config for test times (#107325 ) Redo of #107312 Pairs with https://github.com/pytorch/test-infra/pull/4476 If build env and test config combo cannot be found in the test times, use default. Then we don't have to go manually change the test-times.json a new job is added or we update the jobs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107325 Approved by: https://github.com/huydhn	2023-08-21 18:39:55 +00:00
FFFrog	e108f33299	Update distutils.Version to packaging.version due to the deprecation … (#107207 ) Update distutils.Version to packaging.version due to the deprecation warning. ```python /root/Git.d/pytorch/pytorch/torch/testing/_internal/common_methods_invocations.py:17136: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead. active_if=TEST_SCIPY and LooseVersion(scipy.__version__) < "1.4.0"), /root/Git.d/pytorch/pytorch/torch/testing/_internal/common_methods_invocations.py:17138: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead. active_if=TEST_SCIPY and LooseVersion(scipy.__version__) < "1.4.0"), /root/Git.d/pytorch/pytorch/torch/testing/_internal/common_methods_invocations.py:17140: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead. active_if=TEST_SCIPY and LooseVersion(scipy.__version__) < "1.4.0"), ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/107207 Approved by: https://github.com/soulitzer	2023-08-17 11:19:44 +00:00
Catherine Lee	f16be5e0d4	Reordering tests experiment (#106347 ) Companion with https://github.com/pytorch/test-infra/pull/4424 Uses the file rating generated by the test infra PR to re order tests. For each test file, sum the file ratings from the changed files in the PR, and put the tests in order of sum. A lot of tests are probably going to end up as "prioritized" since it takes anything with a rating > 0 right now. Sharding is done twice, once on the prioritized tests, and once on the general/non prioritized tests. Prioritized tests have an order, so they should be sharded according to that order, while general tests don't have an order and are sharded by test time, which should result in more balanced shards. I'll change the metric name before I merge, i want to quarantine my testing stuff from actual results Pull Request resolved: https://github.com/pytorch/pytorch/pull/106347 Approved by: https://github.com/ZainRizvi	2023-08-16 18:23:09 +00:00
PyTorch MergeBot	9858edd99f	Revert "Reordering tests experiment (#106347 )" This reverts commit `7dfab082be`. Reverted https://github.com/pytorch/pytorch/pull/106347 on behalf of https://github.com/clee2000 due to probably broke sharding ([comment](https://github.com/pytorch/pytorch/pull/106347#issuecomment-1675542738))	2023-08-11 23:59:48 +00:00
Richard Zou	b9ad7bc533	Don't run test/autograd/test_fallback.py in parallel (#106866 ) Fixes https://github.com/pytorch/pytorch/issues/106754 This PR: - moves test/autograd/test_fallback.py to test_autograd_fallback.py and removes it from test_autograd.py (necessary for the next step) - adds test_autograd_fallback.py to parallel test blocklist. - lintrunner really wanted to make changes to the files, but other than that, it is a move. The problem is that we set a global option (the autograd fallback mode) during these tests which may cause the tests to interfere with each other. Test Plan: - python test/run_test.py -i test_autograd_fallback NOTE to diff train oncall: - You'll also need to modify the test/autograd/test_fallback.py TARGET in caffe2/test/TARGETS since we renamed the file. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106866 Approved by: https://github.com/soulitzer	2023-08-10 00:26:23 +00:00
Catherine Lee	7dfab082be	Reordering tests experiment (#106347 ) Companion with https://github.com/pytorch/test-infra/pull/4424 Uses the file rating generated by the test infra PR to re order tests. For each test file, sum the file ratings from the changed files in the PR, and put the tests in order of sum. A lot of tests are probably going to end up as "prioritized" since it takes anything with a rating > 0 right now. Sharding is done twice, once on the prioritized tests, and once on the general/non prioritized tests. Prioritized tests have an order, so they should be sharded according to that order, while general tests don't have an order and are sharded by test time, which should result in more balanced shards. I'll change the metric name before I merge, i want to quarantine my testing stuff from actual results Pull Request resolved: https://github.com/pytorch/pytorch/pull/106347 Approved by: https://github.com/ZainRizvi	2023-08-09 20:11:11 +00:00
Aaron Gokaslan	6d43c89f37	[BE]: Update Ruff to 0.0.280 (#105724 ) Removes unusued loop values in python dictionary iteration. Automated fix from Ruff master Pull Request resolved: https://github.com/pytorch/pytorch/pull/105724 Approved by: https://github.com/ezyang, https://github.com/janeyx99	2023-07-22 23:03:34 +00:00

1 2 3 4 5 ...

786 Commits