mirror of
https://github.com/zebrajr/pytorch.git
synced 2025-12-07 12:21:27 +01:00
## PR There are a few cases that my previous PR (#153220) didn't cover. 1. The LHS/RHS matters. Today, if you do `torch._check(lhs == rhs)` then it will show up as a deferred runtime assert with `Eq(lhs, rhs)`. 2. There can be transitive replacements. For example, expr1 -> expr2 -> u0. `test_size_with_unbacked_add_expr_transitive` tests for this. 3. An unbacked symint expr may not have a replacement that's purely a symbol, for instance, it could be another expression. `test_size_with_unbacked_add_and_mul_expr` tests for this. ## Device assertion msg ``` /tmp/tmp07mu50tx/6y/c6ym2jzadwfigu3yexredb7qofviusz3p7ozcdjywvayhxgcqxkp.py:40: unknown: block: [8681,0,0], thread: [4,0,0] Assertion `index out of bounds: 0 <= tl.broadcast_to(tmp13, [XBLOCK]) < ks0` failed. ... /tmp/tmp07mu50tx/6y/c6ym2jzadwfigu3yexredb7qofviusz3p7ozcdjywvayhxgcqxkp.py:40: unknown: block: [8681,0,0], thread: [6,0,0] Assertion `index out of bounds: 0 <= tl.broadcast_to(tmp13, [XBLOCK]) < ks0` failed. ``` ## Autotuning code setup This is the autotuning code for a concat kernel which takes input tensors (`in_buf`) and writes them to the (`out_buf`). It's important to note the size of `in_buf0` is the same as `in_buf1` don't match along dim=0. This is bad because all concat inputs must share the same size for each dim except for the concat dim (here that's dim=1). ``` in_buf0 = generate_example_value(size=(u1 + s0, 256)) # concrete size is (17900, 256) in_buf1 = generate_example_value(size=(u0, 10)) # concrete size is (8192, 10) ... out_buf = generate_example_value(size=(u1 + s0, 266)) # concrete size is (17900, 256+10) triton_poi_fused_cat_1.run(in_buf0, in_buf1, ..., out_buf, xnumel=(u1 + s0) * 266 ...) ``` If we look into the kernel code, you'll see that `tmp9` loads `in_buf1` (our incorrectly shaped input tensor). There is also a mask to prevent OOB loads. - `tmp6` makes sure we're only loading with the `xindex` from 256 to 264. - `xmask` makes sure we're only loading with the `xindex` within `xnumel`. - `tmp6 & xmask` together is essentially checking `0 ≤ x0 < u1 + s0` and `256 ≤ x1 < 264`. The mask logic is correct, however, `in_buf1` has the shape `[8192, 10]` this means any load where `8192 ≤ x0 < u1 + s0` will be an OOB load. ``` def triton_poi_fused_cat_1(in_buf0, in_buf1, ... out_buf, xnumel, XBLOCK): xoffset = tl.program_id(0) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK) xmask = xindex < xnumel x0 = (xindex % 264) x1 = xindex // 264 ... tmp6 = x0 >= tl.full([1], value=256) tmp9 = tl.load(in_buf1 + (x1), tmp6 & xmask) # device assertion is thrown here tl.device_assert(((0 <= tl.broadcast_to(tmp13, [XBLOCK])) & (tl.broadcast_to(tmp13, [XBLOCK]) < ks0)) | ~(xmask & tmp6), "index out of bounds: 0 <= tl.broadcast_to(tmp13, [XBLOCK]) < ks0") ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/153768 Approved by: https://github.com/jingsh |
||
|---|---|---|
| .. | ||
| autoheuristic | ||
| codegen | ||
| compile_worker | ||
| fx_passes | ||
| kernel | ||
| package | ||
| runtime | ||
| __autotune_main__.py | ||
| __init__.py | ||
| analyze_preserves_zero_mask.py | ||
| aoti_eager.py | ||
| async_compile.py | ||
| autotune_process.py | ||
| bounds.py | ||
| choices.py | ||
| codecache.py | ||
| comm_analysis.py | ||
| comm_lowering.py | ||
| comms.py | ||
| compile_fx_async.py | ||
| compile_fx_ext.py | ||
| compile_fx_subproc.py | ||
| compile_fx.py | ||
| compiler_bisector.py | ||
| config.py | ||
| constant_folding.py | ||
| cpp_builder.py | ||
| cpu_vec_isa.py | ||
| cudagraph_trees.py | ||
| cudagraph_utils.py | ||
| custom_graph_pass.py | ||
| debug.py | ||
| decomposition.py | ||
| dependencies.py | ||
| dtype_propagation.py | ||
| exc.py | ||
| extern_node_serializer.py | ||
| freezing_utils.py | ||
| freezing.py | ||
| fuzzer.py | ||
| fx_utils.py | ||
| graph.py | ||
| hooks.py | ||
| index_propagation.py | ||
| inductor_prims.py | ||
| ir.py | ||
| jagged_lowerings.py | ||
| loop_body.py | ||
| lowering.py | ||
| memory.py | ||
| metrics.py | ||
| mkldnn_ir.py | ||
| mkldnn_lowerings.py | ||
| mock_cache.py | ||
| ops_handler.py | ||
| optimize_indexing.py | ||
| output_code.py | ||
| pattern_matcher.py | ||
| quantized_lowerings.py | ||
| remote_cache.py | ||
| scheduler.py | ||
| script.ld | ||
| select_algorithm.py | ||
| sizevars.py | ||
| standalone_compile.py | ||
| subgraph_lowering.py | ||
| template_heuristics.py | ||
| test_case.py | ||
| test_operators.py | ||
| triton_bundler.py | ||
| utils.py | ||
| virtualized.py | ||
| wrapper_benchmark.py | ||