pytorch/test
Colin Peppler fe285b9560 [aoti] fix corner case in unbacked replacements for atomically_apply_size_hint (#153768)
## PR
There are a few cases that my previous PR (#153220) didn't cover.
1. The LHS/RHS matters. Today, if you do `torch._check(lhs == rhs)` then it will show up as a deferred runtime assert with `Eq(lhs, rhs)`.
2. There can be transitive replacements. For example, expr1 -> expr2 -> u0. `test_size_with_unbacked_add_expr_transitive` tests for this.
3. An unbacked symint expr may not have a replacement that's purely a symbol, for instance, it could be another expression. `test_size_with_unbacked_add_and_mul_expr` tests for this.

## Device assertion msg

```
/tmp/tmp07mu50tx/6y/c6ym2jzadwfigu3yexredb7qofviusz3p7ozcdjywvayhxgcqxkp.py:40: unknown: block: [8681,0,0], thread: [4,0,0] Assertion `index out of bounds: 0 <= tl.broadcast_to(tmp13, [XBLOCK]) < ks0` failed.
...
/tmp/tmp07mu50tx/6y/c6ym2jzadwfigu3yexredb7qofviusz3p7ozcdjywvayhxgcqxkp.py:40: unknown: block: [8681,0,0], thread: [6,0,0] Assertion `index out of bounds: 0 <= tl.broadcast_to(tmp13, [XBLOCK]) < ks0` failed.
```

## Autotuning code setup
This is the autotuning code for a concat kernel which takes input tensors (`in_buf`) and writes them to the (`out_buf`).

It's important to note the size of `in_buf0` is the same as `in_buf1` don't match along dim=0. This is bad because all concat inputs must share the same size for each dim except for the concat dim (here that's dim=1).
```
in_buf0 = generate_example_value(size=(u1 + s0, 256))   # concrete size is (17900, 256)
in_buf1 = generate_example_value(size=(u0, 10))         # concrete size is (8192, 10)
...
out_buf = generate_example_value(size=(u1 + s0, 266))   # concrete size is (17900, 256+10)
triton_poi_fused_cat_1.run(in_buf0, in_buf1, ..., out_buf, xnumel=(u1 + s0) * 266 ...)
```

If we look into the kernel code, you'll see that `tmp9` loads `in_buf1` (our incorrectly shaped input tensor). There is also a mask to prevent OOB loads.
- `tmp6`  makes sure we're only loading with the `xindex` from 256 to 264.
- `xmask` makes sure we're only loading with the `xindex` within `xnumel`.
- `tmp6 & xmask` together is essentially checking `0 ≤ x0 < u1 + s0` and `256 ≤ x1 < 264`.

The mask logic is correct, however, `in_buf1` has the shape `[8192, 10]` this means any load where `8192 ≤ x0 < u1 + s0` will be an OOB load.
```
def triton_poi_fused_cat_1(in_buf0, in_buf1, ... out_buf, xnumel, XBLOCK):
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)
    xmask = xindex < xnumel
    x0 = (xindex % 264)
    x1 = xindex // 264
    ...
    tmp6 = x0 >= tl.full([1], value=256)
    tmp9 = tl.load(in_buf1 + (x1), tmp6 & xmask)
    # device assertion is thrown here
    tl.device_assert(((0 <= tl.broadcast_to(tmp13, [XBLOCK])) & (tl.broadcast_to(tmp13, [XBLOCK]) < ks0)) | ~(xmask & tmp6), "index out of bounds: 0 <= tl.broadcast_to(tmp13, [XBLOCK]) < ks0")
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153768
Approved by: https://github.com/jingsh
2025-05-22 02:05:37 +00:00
..
ao/sparsity [BE]: Update ruff to 0.11.8 (#153249) 2025-05-12 18:30:52 +00:00
autograd
backends/xeon
benchmark_utils PEP585 update - test (#145176) 2025-01-22 04:48:28 +00:00
bottleneck_test
cpp [nativert] Move GraphSignature to pytorch core (#152969) 2025-05-20 21:49:56 +00:00
cpp_api_parity
cpp_extensions Remove janky (though at times useful) dlclose test (#153975) 2025-05-20 23:26:42 +00:00
custom_backend [Cmake] Make PyTorch buildable by CMake-4.x (#150203) 2025-03-29 01:39:13 +00:00
custom_operator [Cmake] Make PyTorch buildable by CMake-4.x (#150203) 2025-03-29 01:39:13 +00:00
distributed Revert "[CI][CUDA] Move cu118 distributed pull jobs to cu126, move cu124-sm75 to cu126-sm75 (#151594)" 2025-05-21 01:45:20 +00:00
distributions Fix support of MixtureSameFamily [bugfix]. (#151317) 2025-05-14 19:24:36 +00:00
dynamo Add flag _metrics_log_runtime to disable runtime metric logging by default (#153506) 2025-05-22 01:02:11 +00:00
dynamo_expected_failures remove TestCustomOp.test_impl_device_cpu from dynamo expected failures (#154049) 2025-05-21 23:20:30 +00:00
dynamo_skips [dynamo] context manager/decorator for dynamo config patching during tracing (#150586) 2025-04-23 09:12:13 +00:00
edge Fix some CMake issues (#153686) 2025-05-19 00:31:34 +00:00
error_messages
expect Revert "[cuDNN][SDPA] cuDNN SDPA refactor/cleanup, nested tensor backward, test priority bump for sm90, sm100 (#149282)" 2025-05-14 20:53:49 +00:00
export [export] Remove unused constants (#153800) 2025-05-20 03:15:27 +00:00
forward_backward_compatibility API change for new enum in cusparseltsplitkmode-t for cusparseLT 0.7.0+ (#150536) 2025-05-14 23:36:53 +00:00
functorch [map] add inductor support by lowering to while_loop (#150971) 2025-05-21 22:19:47 +00:00
fx [Codemod][AddExplicitStrictExportForTrainingInferenceArg] caffe2/ (#149595) 2025-04-03 23:50:13 +00:00
higher_order_ops [hop_schema] support gen_schema for invoke_subgraph (#152984) 2025-05-21 18:55:46 +00:00
inductor [aoti] fix corner case in unbacked replacements for atomically_apply_size_hint (#153768) 2025-05-22 02:05:37 +00:00
inductor_expected_failures [dynamo] Support Tensor subclass that has dynamic attributes or calls Parameter.__torch_function__ (#149482) 2025-04-02 20:56:43 +00:00
inductor_skips [BE] Remove test_ops from FIXME_inductor_dont_reset_dynamo (#145307) 2025-01-27 18:12:39 +00:00
jit [JIT] Optimize DCE by storing a MemoryLocations for an entire set<Value*> (#153645) 2025-05-19 21:04:59 +00:00
jit_hooks [Cmake] Make PyTorch buildable by CMake-4.x (#150203) 2025-03-29 01:39:13 +00:00
lazy
mobile Fix some CMake issues (#153686) 2025-05-19 00:31:34 +00:00
nn [CUDA][CUDNN] Dispatch to cuDNN for non-batch-splittable 64-bit NCHW convolutions (#153101) 2025-05-20 20:19:03 +00:00
onnx [ONNX] Support float4 (#151069) 2025-05-18 03:19:35 +00:00
optim Add lr_lambda type check in MultiplicativeLR (#151973) 2025-04-29 08:21:41 +00:00
package Remove outdated test skipif conditions for Python3.9 (#146144) 2025-01-31 19:01:04 +00:00
profiler Add memory reporting for XPU to Memory Profiler (#152842) 2025-05-21 01:19:19 +00:00
quantization [Quant][X86] add an op to compute uint8 batch norm 2d (#152811) 2025-05-16 06:13:40 +00:00
scripts
strobelight/examples Enable strobelight profiling specific compile frame ids using COMPILE_STROBELIGHT_FRAME_FILTER (#147549) 2025-02-22 03:44:53 +00:00
test_img
torch_np Fix broken URLs (#152237) 2025-04-27 09:56:42 +00:00
typing Revert "Fix non-bitwise type annotations for Tensor operators (see #145838) (#146845)" 2025-02-18 19:01:27 +00:00
xpu [Intel GPU] scalar tensor case handling in addmm, baddmm (#153051) 2025-05-21 12:24:37 +00:00
_test_bazel.py
allowlist_for_publicAPI.json Refactor torch/utils/data/datapipes/gen_pyi.py with torchgen (#150626) 2025-05-17 06:21:41 +00:00
bench_mps_ops.py [MPS][Testing] Benchmark reduction ops (#150452) 2025-04-02 01:06:27 +00:00
conftest.py Apply ruff fixes to tests (#146140) 2025-02-04 05:41:01 +00:00
create_dummy_torchscript_model.py
delete.py
hi.py
HowToWriteTestsUsingFileCheck.md
linear.py
load_torchscript_model.py
minioptest_failures_dict.json
mkl_verbose.py
mkldnn_verbose.py
pytest_shard_custom.py
run_doctests.sh
run_test.py Support independent builds for cpp extension tests + apply to libtorch_agnostic tests (#153264) 2025-05-20 19:18:09 +00:00
simulate_nccl_errors.py [BE]: Update ruff to 0.11.8 (#153249) 2025-05-12 18:30:52 +00:00
slow_tests.json Update slow tests (#153815) 2025-05-19 11:15:25 +00:00
test_accelerator.py [Easy] Fix the function signature of torch.Event (#151221) 2025-04-26 13:51:56 +00:00
test_ao_sparsity.py
test_appending_byte_serializer.py Check integrity of bytes in AppendingByteSerializer (#152139) 2025-04-26 18:10:58 +00:00
test_autocast.py Enable TemporaryFileName tests on Windows (#146311) 2025-02-07 06:06:18 +00:00
test_autograd_fallback.py
test_autograd.py Fix test_side_stream_backward_overlap flakiness (#153963) 2025-05-20 21:02:56 +00:00
test_autoload.py
test_binary_ufuncs.py Fix lerp weight type promotion (#141117) 2025-01-24 01:18:20 +00:00
test_bundled_images.py
test_bundled_inputs.py
test_ci_sanity_check_fail.py
test_comparison_utils.py
test_compile_benchmark_util.py
test_complex.py
test_content_store.py torch.utils._content_store: fix error in hash_storage on XPU (#147785) 2025-02-26 23:57:59 +00:00
test_cpp_api_parity.py Enable C++ API parity tests on AArch64 (#145370) 2025-01-30 22:42:49 +00:00
test_cpp_extensions_aot.py Make python_agnostic cpp extension tests standalone (#153274) 2025-05-20 19:18:09 +00:00
test_cpp_extensions_jit.py xpu: get xpu arch flags at runtime in cpp_extensions (#152192) 2025-05-09 05:43:50 +00:00
test_cpp_extensions_mtia_backend.py Revert "Generalize poison fork logic for each device backend (#144664)" 2025-04-10 21:02:14 +00:00
test_cpp_extensions_open_device_registration.py [Openreg][PrivateUse1] Improve openreg module capabilities (#151000) 2025-04-12 17:21:35 +00:00
test_cpp_extensions_stream_and_event.py [Easy] Add more check for elapsedTime of torch.xxx.Event and torch.Event (#151404) 2025-04-25 20:15:04 +00:00
test_cuda_expandable_segments.py
test_cuda_multigpu.py [CUDA] try to abate some flakiness in test_stream_event_nogil (#148796) 2025-03-12 19:12:50 +00:00
test_cuda_nvml_based_avail.py
test_cuda_primary_ctx.py Remove outdated skipIfRocmVersionLessThan decorations (#148941) 2025-03-11 18:37:40 +00:00
test_cuda_sanitizer.py
test_cuda_trace.py
test_cuda.py make use_mem_pool threadlocal (#153356) 2025-05-13 00:16:07 +00:00
test_custom_ops.py Inductor respects exact strides on custom ops by default (#150511) 2025-05-03 00:02:24 +00:00
test_dataloader.py Enable more nightly tests on s390x (#148452) 2025-03-18 16:09:39 +00:00
test_datapipe.py Remove unactivated test (#146233) 2025-02-04 05:26:04 +00:00
test_decomp.py Update ruff linter for PEP585 (#147540) 2025-02-22 04:45:17 +00:00
test_deploy.py
test_determination.py
test_dispatch.py [BE][CI] bump ruff to 0.9.0: string quote styles (#144569) 2025-02-24 19:56:09 +00:00
test_dlpack.py
test_dynamic_shapes.py Support using SymInt shapes for torch.baddbmm no-broadcast case (#153112) 2025-05-08 21:34:24 +00:00
test_expanded_weights.py
test_extension_utils.py Move privateuse1 test out of test_utils and make them serial (#145380) 2025-01-23 00:31:39 +00:00
test_fake_tensor.py Revert "Fix fake tensor caching when output has unbacked (#153034)" 2025-05-20 06:02:38 +00:00
test_file_check.py
test_flop_counter.py Build RowwiseScaledMM.cu for SM89 (#145676) 2025-02-01 11:44:58 +00:00
test_foreach.py Synchronize in foreach tests after profiling (#152857) 2025-05-06 00:56:48 +00:00
test_function_schema.py
test_functional_autograd_benchmark.py Enable Windows tests (#146666) 2025-02-08 00:55:20 +00:00
test_functional_optim.py
test_functionalization_of_rng_ops.py
test_functionalization.py
test_futures.py
test_fx_experimental.py PEP585: Add noqa to necessary tests (#146391) 2025-02-12 15:29:50 +00:00
test_fx_passes.py
test_fx_reinplace_pass.py
test_fx.py [BE]: Update ruff to 0.11.8 (#153249) 2025-05-12 18:30:52 +00:00
test_hop_infra.py Support torch.compile rng selective activation checkpointing with cudagraph (#146878) 2025-02-28 00:47:03 +00:00
test_hub.py
test_import_stats.py
test_indexing.py [ROCm] Improve backwards indexing when stride is not one (#147630) 2025-03-11 19:02:48 +00:00
test_itt.py
test_jit_autocast.py PEP585 update - test (#145176) 2025-01-22 04:48:28 +00:00
test_jit_disabled.py
test_jit_fuser_legacy.py
test_jit_fuser_te.py [BE][CI] bump ruff to 0.9.0: string quote styles (#144569) 2025-02-24 19:56:09 +00:00
test_jit_fuser.py
test_jit_legacy.py
test_jit_llga_fuser.py
test_jit_profiling.py
test_jit_simple.py
test_jit_string.py PEP585 update - test (#145176) 2025-01-22 04:48:28 +00:00
test_jit.py [BE]: Apply ruff PERF403 to use dict comprehensions more often (#149257) 2025-03-18 00:46:07 +00:00
test_jiterator.py
test_kernel_launch_checks.py
test_legacy_vmap.py
test_license.py Fix license check for setuptools>=77 (#151158) 2025-04-12 13:41:12 +00:00
test_linalg.py torch.tensordot: performance improvements when contracting to a scalar. (#145936) 2025-05-13 10:57:30 +00:00
test_logging.py
test_masked.py
test_maskedtensor.py
test_matmul_cuda.py [CUDA][cuBLAS][cuBLASLt] avoid polluting prefer cuBLAS/Lt setting across tests (#153655) 2025-05-20 16:18:35 +00:00
test_meta.py [BE] Migrate dtype_abbrs into one location (#152229) 2025-04-28 03:52:47 +00:00
test_metal.py
test_mkl_verbose.py
test_mkldnn_fusion.py
test_mkldnn_verbose.py
test_mkldnn.py Support fp8 output of _scaled_mm for CPU (#153600) 2025-05-22 01:15:39 +00:00
test_mobile_optimizer.py
test_model_exports_to_core_aten.py [Codemod][AddExplicitStrictExportForTrainingInferenceArg] caffe2/ (#149595) 2025-04-03 23:50:13 +00:00
test_module_tracker.py
test_modules.py Disable slow gradcheck for nn.Transformer ModuleInfo (#145531) 2025-01-25 00:58:03 +00:00
test_monitor.py
test_mps.py [MPS] Fix float64 scalar tensor handling (#153582) 2025-05-15 05:15:14 +00:00
test_multiprocessing_spawn.py Remove NO_MULTIPROCESSING_SPAWN checks (#146705) 2025-02-28 05:53:19 +00:00
test_multiprocessing.py Remove NO_MULTIPROCESSING_SPAWN checks (#146705) 2025-02-28 05:53:19 +00:00
test_namedtensor.py
test_namedtuple_return_api.py
test_native_functions.py
test_native_mha.py
test_nestedtensor.py Rewrite autograd producer consumer stream sync logic (#151079) 2025-05-16 15:42:22 +00:00
test_nn.py [CUDA][cuDNN] Fix handling of CPU side input and target length tensors in CTCLoss (#152745) 2025-05-07 22:01:18 +00:00
test_nnapi.py
test_numba_integration.py Fix broken URLs (#152237) 2025-04-27 09:56:42 +00:00
test_numpy_interop.py
test_openmp.py
test_openreg.py [OpenReg] Add _lazy_init and rng_state support for OpenReg (#151914) 2025-05-04 09:42:08 +00:00
test_ops_fwd_gradients.py
test_ops_gradients.py Enable more nightly tests on s390x (#148452) 2025-03-18 16:09:39 +00:00
test_ops_jit.py
test_ops.py [ROCm] unkip test_non_standard_bool except for failings ops (#152956) 2025-05-13 15:55:42 +00:00
test_optim.py Fix test/test_optim.py error message. (#153076) 2025-05-07 22:46:05 +00:00
test_out_dtype_op.py
test_overrides.py [dynamo] Remove traceable_tensor_subclasses-related code (#151062) 2025-04-15 03:55:35 +00:00
test_package.py
test_per_overload_api.py
test_prims.py
test_proxy_tensor.py Support C++ statically_known_true (#151346) 2025-04-18 06:42:12 +00:00
test_pruning_op.py
test_public_bindings.py Remove public_allowlist from TestPublicBindings.test_correct_module_names and ensure private_allowlist-ed things are actually private (#145620) 2025-01-27 17:30:02 +00:00
test_python_dispatch.py Make DispatchKeySet serializable; add __eq__ (#152732) 2025-05-03 14:40:06 +00:00
test_pytree.py [pytree] Register normal class to register_dataclass (#147752) 2025-04-01 23:28:20 +00:00
test_quantization.py [BE]: Update ruff to 0.11.8 (#153249) 2025-05-12 18:30:52 +00:00
test_reductions.py Treat dim=[] same as dim=None (#153570) 2025-05-20 22:44:29 +00:00
test_scatter_gather_ops.py Reland fast gather and index implementation (#151917) 2025-04-23 19:13:13 +00:00
test_schema_check.py
test_segment_reductions.py
test_serialization.py Make torch.serialization.skip_data work with torch.load (#148018) 2025-03-06 12:04:46 +00:00
test_set_default_mobile_cpu_allocator.py
test_shape_ops.py [Quant] flip: throw runtime error for QUInt4x2 and QUInt2x4 input (#147430) 2025-02-25 03:47:40 +00:00
test_show_pickle.py
test_sort_and_select.py Fix linter F821 error (#146665) 2025-02-08 07:19:37 +00:00
test_sparse_csr.py [ROCm] improve sparse addmm, enable complex (#153262) 2025-05-19 22:23:18 +00:00
test_sparse_semi_structured.py API change for new enum in cusparseltsplitkmode-t for cusparseLT 0.7.0+ (#150536) 2025-05-14 23:36:53 +00:00
test_sparse.py [ROCm] improve sparse addmm, enable complex (#153262) 2025-05-19 22:23:18 +00:00
test_spectral_ops.py Re-add stft option to align window for center = false (#146379) 2025-02-06 14:07:13 +00:00
test_stateless.py
test_static_runtime.py
test_subclass.py
test_sympy_utils.py [Inductor] Expand Identity ops prior to block pattern matching (#146000) 2025-02-08 18:11:53 +00:00
test_tensor_creation_ops.py [Inductor] Add input value checking to randint meta function (#147191) 2025-02-25 02:18:16 +00:00
test_tensorboard.py
test_tensorexpr_pybind.py
test_tensorexpr.py
test_testing.py [Torch] Fix crash when comparing fp8 tensors that have more than 1 dimension (#153508) 2025-05-15 08:41:46 +00:00
test_throughput_benchmark.py Fix Throughputbenchmark issue (#144669) 2025-01-26 03:37:20 +00:00
test_torch.py convert guard_size_oblivious to runtime check in infer_size_impl (#148872) 2025-05-13 00:32:28 +00:00
test_transformers_privateuse1.py [OpenReg] Move SDPA to OpenReg from open_registration_extension.cpp (#153309) 2025-05-13 03:49:19 +00:00
test_transformers.py Revert "[cuDNN][SDPA] cuDNN SDPA refactor/cleanup, nested tensor backward, test priority bump for sm90, sm100 (#149282)" 2025-05-14 20:53:49 +00:00
test_type_hints.py
test_type_info.py
test_type_promotion.py
test_typing.py
test_unary_ufuncs.py Fix broken URLs (#152237) 2025-04-27 09:56:42 +00:00
test_utils_config_module.py Add check that envvar configs are boolean (#145454) 2025-02-05 19:40:10 +00:00
test_utils_filelock.py
test_utils.py [utils] add try_import method for importing optional modules (#145528) 2025-01-25 00:14:07 +00:00
test_view_ops.py Fix overflow in checkInBoundsForStorage (#147352) 2025-02-27 15:48:50 +00:00
test_vulkan.py
test_weak.py Consistently use load_torchbind_test_lib in tests (#148082) 2025-03-03 19:37:28 +00:00
test_xnnpack_integration.py [BE][Ez]: ISC001 Auto concatenate implicit one line strings (#146408) 2025-02-04 19:07:04 +00:00
test_xpu.py Record the XPU and XCCL build settings in the compiled binary (#147161) 2025-05-20 09:21:39 +00:00