Commit Graph

327 Commits

Author SHA1 Message Date
Xuehai Pan
93e249969b [BE] enable ruff rule RSE and remove useless parentheses in raise statements (#124261)
Remove useless parentheses in `raise` statements if the exception type is raised with no argument.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124261
Approved by: https://github.com/albanD
2024-04-17 19:29:34 +00:00
Edward Z. Yang
efa36ef092 Natively support int truncation, don't guard on positive/negative (#122827)
This doesn't entirely fix the original problem that prompted this, but
it seems to just be getting stuck in export constraint formatting now
which seems like progress to me.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122827
Approved by: https://github.com/avikchaudhuri
2024-04-11 15:22:32 +00:00
leslie-fang-intel
bac2a39aee [Inductor] [ReImplement] Outer Loop Fusion for CPP Backend (#121625)
**Summary**
Re-implement of https://github.com/pytorch/pytorch/pull/121064

**Test Plan**
```
python -u -m pytest -s -v test_cpu_repro.py -k test_outer_loop_fusion
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121625
Approved by: https://github.com/lezcano, https://github.com/jgong5
2024-04-05 06:24:57 +00:00
Gao Tianlin
aaef246c74 remove log2 decomposition; add log2 lowering (#123112)
Same reason as `log10`. `log2` is a core aten op, we should not decompose it. As https://github.com/pytorch/pytorch/pull/110882 suggested, it often maps to a hardware intrinsic; Furthermore, decomposing it will negatively impact the numerical precision of the output.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123112
Approved by: https://github.com/peterbell10
2024-04-02 16:16:26 +00:00
Peter Bell
09c72eaa3f [inductor] Remove identity from ops.scan (#119727)
Currently scan has an `init` argument which must be the identity of the
combine function. This isn't strictly necessary if we are more careful about
keeping track of the first element and avoid combining it with anything.

This does additionally require that there are no active load masks, since we can't
do the `where_cond` any more. However, this shouldn't be possible anyway since
scans are always realized and only fused via the scheduler.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119727
Approved by: https://github.com/lezcano
2024-04-01 22:47:26 +00:00
Edward Z. Yang
3178ba0dc9 Don't use sympy Float functions, use an opaque one with no reasoning (#122823)
Sympy simplifications don't obey floating point semantics, so don't
use Sympy for this.  Keep them as is, only evaluate with the reference
implementations when all arguments are known.

This may end up getting subsumed by some other changes later, but I
wanted to understand if this was easy and it seems to be easy.

This doesn't actually depend on the earlier diffs on the stack and I can detach it.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122823
Approved by: https://github.com/lezcano
2024-03-29 19:13:55 +00:00
eellison
cbbed46377 Defer selection of triton template (#120275)
Our prior approach to epilogue fusion was to select from a choice from a set of triton templates and extern calls based on benchmarking inputs, then unconditionally fuse epilogues. This can be sub-optimal in following ways:

- We select an extern kernel, however an epilogue like relu() exists such that choosing a triton template + relu would have been faster
- We select a triton template, epilogue fuse, and register spilling occurs causing it to be slower than not epilogue fusing.

In this PR we wait to select either the Triton Template or Extern Kernel based on benchmarking results from the kernel itself and its epilogue. As soon as a successful fusion occurs where a fused Triton Template + epilogue is faster than the unfused choice we finalize the MultiTemplateBuffer as a specific template. If no fusion occurs we'll finalize the MultiTemplateBuffer after fusion.

Note: if there are multiple epilogue fusions (not super likely), even though we select a template after the first fusion, we will still benchmark to see if subsequent epilogue are worth fusing. We could potentially defer choosing template in this case in a follow up at expense of compile time.

Gives 4% HF training win, 10% TIMM inference win. Increases compilation time which I will be trying to address more in follow up prs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120275
Approved by: https://github.com/jansel
ghstack dependencies: #121996
2024-03-20 01:40:33 +00:00
Isuru Fernando
409b1a6081 Add lowering for cummax, cummin (#120429)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120429
Approved by: https://github.com/peterbell10
2024-03-15 19:04:38 +00:00
eellison
6ca9ae4f86 Express y grid > 2^16 in terms of z grid (#121554)
CUDA has a max y_grid of 65535. If we're computing larger than that we can compose it in terms of z grid, which is currently unused in inductor codegen.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121554
Approved by: https://github.com/aakhundov
2024-03-12 02:36:19 +00:00
Elias Ellison
5b5d423c2e Benchmark templates (#118880)
Adding support for benchmarking templates in `benchmark_fusion`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118880
Approved by: https://github.com/shunting314
2024-03-11 23:55:13 +00:00
Peter Bell
168a04e752 [inductor] Changes to support newer triton pin (#121267)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121267
Approved by: https://github.com/lezcano
ghstack dependencies: #121438
2024-03-09 18:17:36 +00:00
Peter Bell
459c5bca58 [inductor] Refactor common triton imports into one function (#121438)
This means when codegen depends on a particular import we only need to
add it in one place and it's applied to all triton kernels.

This also changes codegen slightly so instead of generating
`@pointwise` we now generate `@triton_heuristics.pointwise` just so
the imports are the same for all kernel types.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121438
Approved by: https://github.com/lezcano
2024-03-09 18:17:36 +00:00
Peter Bell
8887c95004 [inductor] Skip welford combine on first reduciton loop iteration (#121488)
On the first iteration we short circuit `welford_reduce` since we know
the accumulators are filled with the default values.

This is split out from #120330 to hopefully avoid the meta-internal failure.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121488
Approved by: https://github.com/lezcano
2024-03-08 23:40:48 +00:00
Oguz Ulgen
6566b3db67 Add an autotune cache for inductor generated kernels (#120963)
Summary: Inductor currently has a best config cache for kernels that it generates. This is a local cache done via writing to the file system. This diff takes this local cache to remote by reusing the existing triton caching mechanism built via Memcache internally and Redis externally.

Test Plan:
tested locally using `TORCH_INDUCTOR_AUTOTUNE_REMOTE_CACHE =1`

Look at scuba to verify the local testing: https://fburl.com/scuba/triton_remote_cache/z6pypznk

The plan is to land this diff with this turned off and gradually introduce this.

Differential Revision: D54398076

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120963
Approved by: https://github.com/jansel
2024-03-04 16:58:37 +00:00
PyTorch MergeBot
0b924d7cde Revert "[inductor] Optimize welford reduction (#120330)"
This reverts commit 7eb7ac815f.

Reverted https://github.com/pytorch/pytorch/pull/120330 on behalf of https://github.com/kit1980 due to Broke internal tests, see D54230858 ([comment](https://github.com/pytorch/pytorch/pull/120330#issuecomment-1971878323))
2024-02-29 20:12:50 +00:00
Adnan Akhundov
0a46102b37 Add equal_to_1 to triton_meta for user-written Triton kernels (#120579)
Summary: Previously, we omitted `equal_to_1` from the `triton_meta` part of the `@user_autotune` decorator. For user-written Triton kernels, this could lead to perf regressions, as the kernel in the Inductor codegen is compiled without `equal_to_1` specialization.

Fixes #120478. The repro from the issue, on A100:

Before this PR:

```
Triton matmul:           0.0167 seconds
Triton matmul compiled:  0.0751 seconds
```

After this PR:

```
Triton matmul:           0.0168 seconds
Triton matmul compiled:  0.0072 seconds
```

Test Plan:

```
$ python test/dynamo/test_triton_kernels.py -k  test_triton_kernel_equal_to_1_arg
...
----------------------------------------------------------------------
Ran 3 tests in 3.545s

OK
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120579
Approved by: https://github.com/oulgen, https://github.com/jansel, https://github.com/chenyang78
2024-02-29 05:19:39 +00:00
Peter Bell
7eb7ac815f [inductor] Optimize welford reduction (#120330)
This does two things,
1) Short circuit `welford_reduce` on the first iteration to ignore the accumulator (big win for small `rnumel`)
2) Replace division with multiplication by reciprocal

Currently this is not enough to match two pass reduction with bfloat16 but it is still a significant improvement.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120330
Approved by: https://github.com/lezcano
2024-02-26 17:01:47 +00:00
Isuru Fernando
b7df3bba62 add decomposition for frexp (#119217)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119217
Approved by: https://github.com/peterbell10
ghstack dependencies: #119284, #120027
2024-02-23 21:52:42 +00:00
PyTorch MergeBot
2892d2f31b Revert "[inductor] Optimize welford reduction (#120330)"
This reverts commit 4c6ba16f82.

Reverted https://github.com/pytorch/pytorch/pull/120330 on behalf of https://github.com/jeffdaily due to broke ROCm CI while ROCm was in unstable status ([comment](https://github.com/pytorch/pytorch/pull/120330#issuecomment-1961623739))
2024-02-23 16:24:52 +00:00
Peter Bell
4c6ba16f82 [inductor] Optimize welford reduction (#120330)
This does two things,
1) Short circuit `welford_reduce` on the first iteration to ignore the accumulator (big win for small `rnumel`)
2) Replace division with multiplication by reciprocal

Currently this is not enough to match two pass reduction with bfloat16 but it is still a significant improvement.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120330
Approved by: https://github.com/lezcano
2024-02-22 23:54:24 +00:00
wangjiangben-hw
26610175d2 pass device_str for async_compile.triton function (#120202)
Fixes #120203

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120202
Approved by: https://github.com/jansel
2024-02-21 03:48:57 +00:00
Shunting Zhang
800e9acd43 [inductor] fix bandwidth extimation for StarDep (#120266)
A lot of HF models fail when inductor_config.bechmark_kernel is enabled. The reason is the bandwidth estimation code assumes every dependencies has an index but StarDep does not. An exception is raised when StarDep.index is being accessed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120266
Approved by: https://github.com/eellison, https://github.com/jansel
2024-02-21 03:33:45 +00:00
wangjiangben-hw
20f7e5a719 Remove dependency of triton during inductor codegen (#120193)
Fixes #120192

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120193
Approved by: https://github.com/jansel
2024-02-21 01:09:48 +00:00
Jason Ansel
d74bdd5042 [inductor] Always allow 64 bit in next_power_of_2 (#120164)
see #120153 #120152

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120164
Approved by: https://github.com/yanboliang
2024-02-18 03:22:46 +00:00
wangjiangben-hw
0c972c7c4e enhance next_power_of_2 function (#120153)
Fixes #120152

cc  @ezyang @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov @ColinPeppler @amjames @jansel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120153
Approved by: https://github.com/jansel
2024-02-17 20:18:46 +00:00
Shunting Zhang
36e118b810 [inductor] logging meta data for inductor generated triton kernel (#120048)
I want to log metadata for inductor generated triton kernels for a couple of purposes
1. with these metadata, it should be convenient to find unaligned reduction kernels and try the idea here https://github.com/pytorch/pytorch/issues/119929 . I think it's nice to try on kernels that are used in real models
2. I'm thinking that based on the collected kernel metadata, I can build a simple offline tool by benchmarking each kernel with ncu and augment each kernel metadata with: latency, theoretical membw (estimated memory access / latency), and actually achieved membw. Hopefully this can point us to some good optimization opportunities.

Command:
```
TORCHINDUCTOR_CACHE_DIR=`realpath ~/inductor-caches/kernel-metadata-log` TORCHINDUCTOR_ENABLED_METRIC_TABLES=kernel_metadata TORCHINDUCTOR_BENCHMARK_KERNEL=1 TORCHINDUCTOR_UNIQUE_KERNEL_NAMES=1 time python benchmarks/dynamo/huggingface.py --backend inductor --amp --performance --training
```

The best practice here is to point inductor cache to a folder outside of /tmp so that one can always run the kernel again based on the path stored in kernel metadata. (folders under /tmp may get removed by the system)

Here is first 1000 rows of collected metadata for huggingface: https://gist.github.com/shunting314/cf4ebdaaaa7e852efcaa93524c868e5f

And here is the total 10K kernels collected for huggingface. The gist can not be rendered as a csv since it's too large: https://gist.github.com/shunting314/7f841528e2debdc2ae05dece4ac591be .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120048
Approved by: https://github.com/jansel
2024-02-17 02:09:27 +00:00
Adnan Akhundov
e5f46a1d35 Check alignment of ReinterpretView args of custom Triton kernels (#119649)
Summary: Currently, when a custom (user-written) Triton kernel has a ReinterpretView argument in IR, we're always skipping the alignment checking for this argument when preparing the `signature_of` for the AOT compilation of the Triton kernel (via setting `TensorArg.check_alignment` to `False`). This is problematic for user-written kernels where, albeit reinterpreted, the argument of the Triton kernel (the data pointer) can still be aligned to 16. When we skip alignment checking, the performance of the AOT-compiled internal Triton kernels can degrade 2x--3x.

In this PR, we replace `TensorArg.check_alignment` by `TensorArg.offset`, in which we specify the offset of the `ReinterpretView.layout` relative to the underlying `ir.Buffer` (corresponding to the data pointer before reinterpretation). As the size and stride of the layout don't change the alignment properties, those can be skipped. Importantly, for `ReinterpretView` arguments of custom Triton kernels, we use `arg.data.get_name()` as the buffer name. That, together with the offset, is used to check the alignment.

Bonus: the namedtuples in `codegen/common.py` are refactored as `dataclass`es, with nicer type hints and default values (for the newly added `TensorArg.offset`).

Test Plan:

```
$ python test/inductor/test_aot_inductor.py -k test_triton_kernel_reinterpret_view
...
----------------------------------------------------------------------
Ran 6 tests in 27.952s

OK (skipped=4)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119649
Approved by: https://github.com/oulgen
2024-02-11 20:21:17 +00:00
Pearu Peterson
2c91e13afc Add lowerings to special functions (#119187)
As in the title.

In addition, the PR introduces infrastructure for lowerings of pointwise functions that have both cpp and triton implementations available.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119187
Approved by: https://github.com/peterbell10
2024-02-11 16:35:40 +00:00
Peter Bell
c0f1183eb4 [inductor] Fix compile error on scan with no mask (#119555)
Fixes #119591

Currently this results in invalid syntax:
```python
tmp4 = tl.where(, tmp1, tmp2)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119555
Approved by: https://github.com/lezcano
2024-02-10 12:38:40 +00:00
Elias Ellison
bf8a5a11be Fix Inductor CSE Across Separate Reductions (#119410)
We were CSE'ing a load across two separate reduction loop bodies. This is because we were examining an indirect indexing that did not have an explicit rindex in its load. I've commented with more details and other potentials on the fix.

Tried using minifier unsuccessfully and hand minified some but could do more..

Fix for https://github.com/pytorch/pytorch/issues/119327

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119410
Approved by: https://github.com/shunting314, https://github.com/jansel
2024-02-09 19:34:57 +00:00
Peter Bell
88429a8084 [inductor] Add split scan kernel (#117992)
This PR adds a new type of triton kernel in which data is persistent but the
reduction dimension is split over multiple blocks (up to the entire kernel).
though this is called a reduction dimension, in actuality we only support scans.
because of this limitation, i have to be able to block fusions of split scan
operations with reductions so chose to add a new `ir.SplitScan` node which
is identical but allows for differentiation in the scheduler.

The split scan kernel is also the first to require an additional workspace buffer
which is used to communicate between cuda blocks. this is slightly tricky as we
the exact scratch space requirement isn't known until the grid size is calculated.
here i workaround the issue by setting a minimum rblock size and always allocating
to the maximum possible grid size for a given input tensor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117992
Approved by: https://github.com/jansel
ghstack dependencies: #117991
2024-02-09 01:56:00 +00:00
Peter Bell
01edb8a559 [inductor] Refactor triton range_tree handling (#117991)
Currently the dimension handling in triton kernels has various special cases e.g.
- handling "r" for non-reduction vs persistent reduction vs non-persistent reduction.
- handling "x" when `no_x_dim` is set

This adds three new properties to the range tree objects which capture the
same information in a more generic way:
- `is_loop`: true for the "r" dimension of a non-persistent reduction
- `tensor_dim`: Optional index of the triton tensor dimension
- `grid_dim`: Optional index of the triton grid dimension

The motivation here is I want to add a new split scan kernel type which is:
- not a persistent reduction, yet has `is_loop=False` for the "r" dimension
- Has a `grid_dim` for the "r" dimension

These flags now only need to be set once in `initialize_range_trees`, instead of having
to infer them throughout the code based on the tree prefix and various other kernel flags.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117991
Approved by: https://github.com/lezcano
2024-02-09 01:56:00 +00:00
Pearu Peterson
7ec6ac89e8 Add lowering to special.modified_bessel_i0 (#118993)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118993
Approved by: https://github.com/peterbell10
2024-02-08 18:42:40 +00:00
Andrew M. James
884b6d2a67 [inductor] Implementing missing magic methods on IR values. (#118933)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118933
Approved by: https://github.com/peterbell10
2024-02-06 05:50:26 +00:00
Yang Chen
b2e0f8d82d [mypy] added type annotations to codegen_nodes methods (#119080)
added correct type annotations to scheduler and backends'
codegen_nodes methods

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119080
Approved by: https://github.com/eellison
2024-02-05 18:33:52 +00:00
Edward Z. Yang
abc09b27b9 Some minor type stub improvements (#118529)
I was just playing around with improving the typing of symbolic_shapes. The PR is not "complete" but I in particular wanted to get feedback on whether or not people liked making ValueRanges Generic; it seems that distinguishing if you have an Expr ValueRange or a SympyBoolean ValueRange is a lot of trouble for downstream. Using TypeGuard, we can perform refinements on the generic parameter inside methods, although we still have to cast back to ValueRange[T] due to https://github.com/python/mypy/issues/14425#issuecomment-1914852707

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118529
Approved by: https://github.com/Skylion007
2024-02-04 00:19:00 +00:00
Pearu Peterson
a69016a741 Add lowering to special.bessel_j1 (#118992)
As in the title.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118992
Approved by: https://github.com/peterbell10
2024-02-02 20:16:08 +00:00
PyTorch MergeBot
dbba1d4bf5 Revert "Some minor type stub improvements (#118529)"
This reverts commit c978f38bd4.

Reverted https://github.com/pytorch/pytorch/pull/118529 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/118529#issuecomment-1922362331))
2024-02-01 22:18:36 +00:00
Yang Chen
61b572ed56 [inductor] more accurate throughput calculations for kernel benchmarks (#118858)
Our current throughput calculations for kernel benchmarks have some issues,
particularly when we slice inputs in the kernel. In such cases, we count
the original inputs as part of the memory traffic passed across the kernel.
This is incorrect because it may result in a much larger throughput
calculation, which can even exceed the theoretical bandwidth.

Instead, we should only count the size of the "slices" that contribute to
the actual memory traffic.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118858
Approved by: https://github.com/jansel
2024-02-01 21:42:14 +00:00
Andrew M. James
9c2b43cc50 [inductor] Handle special values correctly in ir.Scan codegen (#118788)
Special values (`NaN`/`+/-Inf`) are not correctly during codegen for `ir.Scan` nodes. This
is a fairly minor bugfix that has not come up since the only two scan
ops with lowerings use "normal" values.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118788
Approved by: https://github.com/peterbell10
2024-02-01 14:54:20 +00:00
Edward Z. Yang
c978f38bd4 Some minor type stub improvements (#118529)
I was just playing around with improving the typing of symbolic_shapes. The PR is not "complete" but I in particular wanted to get feedback on whether or not people liked making ValueRanges Generic; it seems that distinguishing if you have an Expr ValueRange or a SympyBoolean ValueRange is a lot of trouble for downstream. Using TypeGuard, we can perform refinements on the generic parameter inside methods, although we still have to cast back to ValueRange[T] due to https://github.com/python/mypy/issues/14425#issuecomment-1914852707

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118529
Approved by: https://github.com/Skylion007
2024-01-31 20:56:56 +00:00
Pearu Peterson
2327879fb6 Add lowering to special.bessel_j0 (2nd try) (#118565)
This PR is a copy of https://github.com/pytorch/pytorch/pull/118464 that was merged without using pytorchbot. Sorry for the noise!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118565
Approved by: https://github.com/peterbell10
2024-01-30 15:26:59 +00:00
Edward Z. Yang
cad79bd0bb Remove follow_imports = skip from sympy (#118469)
dmypy silently ignores follow_imports = skip, so to get parity between
dmypy and mypy we have to suck it up and type: ignore all of the sympy
typing problems.

The suppressions were added automatically with the following script generated by GPT-4:

```
import re

# Read the error file
with open("error_file.txt", "r") as f:
    errors = f.readlines()

# Parse the lines with errors and error types
error_lines = {}
for error in errors:
    match = re.match(r"(.*):(\d+):\d+: error:.*\[(.*)\]", error)
    if match:
        file_path, line_number, error_type = match.groups()
        if file_path not in error_lines:
            error_lines[file_path] = {}
        error_lines[file_path][int(line_number)] = error_type

# Insert ignore comments in the source files
for file_path, lines in error_lines.items():
    with open(file_path, "r") as f:
        code = f.readlines()
    for line_number, error_type in sorted(lines.items(), key=lambda x: x[0], reverse=True):
        code[line_number - 1] = code[line_number - 1].rstrip() + f"  # type: ignore[{error_type}]\n"
    with open(file_path, "w") as f:
        f.writelines(code)
```

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118469
Approved by: https://github.com/Skylion007
ghstack dependencies: #118414, #118418, #118432, #118467, #118468
2024-01-28 13:38:38 +00:00
laith sakka
708e6241ed Fix sympy_subs to preserve integer and non-negative properties. (#118150)
This diff introduce the following changes:
1. Fix sympy_subs to preserve integer and non-negative properties of replaced symbol when replacement is string
why is this needed?
I was compiling an expression:
 x*abs(y)  where y =-2
  what happens is that this expression is passed as ``s1*abs(s0)`` then s0 is replaced to ks0 with a call to sympy_subs.
 but sympy_subs used to replace s0 (integer=false, nonegative=false) with ks0(inetegr=true, nonegative = true)
 resulting in ``x*abs(ks0) = x*ks0`` which is wrong

2. rename sympy_symbol to sympy_index_symbol to make it explicit.
3. add assertion that replaced expression is not passed as string but always a sympy expression.

Fixes https://github.com/pytorch/pytorch/issues/117757

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118150
Approved by: https://github.com/ezyang
2024-01-25 20:54:55 +00:00
Yang Chen
1565d58ad9 [inductor] correctly generate grid info for benchmark_kernel (#118202)
Previously, we generated the grid argument with tree.numel for
a benchmark TritonKernel. This was not correct, because it
didn't match the launch config used for profiling and running.

This PR fixed the issue by emitting the grid value computed
by the kernel's grid_fn, which is used by the profiler and
the kernel's runner.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118202
Approved by: https://github.com/shunting314, https://github.com/jansel
2024-01-25 20:37:44 +00:00
Edward Z. Yang
903e1913ff Rename unbacked SymInt prefix to u (#117859)
Currently, it conflicts with Inductor's naming convention for index
variables

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117859
Approved by: https://github.com/lezcano, https://github.com/jansel, https://github.com/avikchaudhuri
2024-01-22 20:53:47 +00:00
Jeff Daily
01abb5af21 additional support for float8_e4m3fnuz and _e5m2fnuz (#115214)
Follow up to #107586.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115214
Approved by: https://github.com/peterbell10, https://github.com/malfet
2024-01-22 18:33:41 +00:00
Edward Z. Yang
df4e3d9d08 Document OpsHandler protocol (#117790)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117790
Approved by: https://github.com/jansel
2024-01-21 07:20:53 +00:00
PyTorch MergeBot
b637fdc8b3 Revert "additional support for float8_e4m3fnuz and _e5m2fnuz (#115214)"
This reverts commit 74e1362499.

Reverted https://github.com/pytorch/pytorch/pull/115214 on behalf of https://github.com/PaliC due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/115214#issuecomment-1900815152))
2024-01-19 17:35:04 +00:00
Jeff Daily
74e1362499 additional support for float8_e4m3fnuz and _e5m2fnuz (#115214)
Follow up to #107586.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115214
Approved by: https://github.com/peterbell10
2024-01-19 00:50:18 +00:00