This doesn't entirely fix the original problem that prompted this, but
it seems to just be getting stuck in export constraint formatting now
which seems like progress to me.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122827
Approved by: https://github.com/avikchaudhuri
**Summary**
Refactor the `Scheduler.fuse_nodes` changes in https://github.com/pytorch/pytorch/pull/121625. In the previous implementation of `Scheduler.fuse_nodes` in https://github.com/pytorch/pytorch/pull/121625, we use the `enable_outer_loop_fusion` context to ensure `OuterLoopFusion` happens after all the norm fusions.
And there is a discussion in https://github.com/pytorch/pytorch/pull/121625/files#r1527177141 to reuse current `score_fusion` mechanism. However, given that [fuse_nodes](f4ff063c33/torch/_inductor/scheduler.py (L1679-L1698)) will invoke `fuse_nodes_once` 10 times. We are concerned that the score approach may potentially disrupt pairs of regular fusion nodes in the 2rd invocation of `fuse_nodes_once` if they have been pick up by the outer loop fusion in the 1st invocation of `fuse_nodes_once`.
In this PR, we propose adding an abstract of `filter_possible_fusions_by_priority`. In each invoking of `fuse_nodes_once`, the possible fusions will be grouped by their priority from the backend. And only the group of possible fusions with highest priority will be fused in this invocation. In this way, we can ensure `OuterLoopFusion` happens after all the norm fusions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123067
Approved by: https://github.com/lezcano, https://github.com/jgong5
ghstack dependencies: #121625
This PR unified the vectorized conversion with `at::vec::convert` for all vectorized data types. The intrinsics implementations are implemented as a specialization and moved to their own arch-specific files. The vectorized conversion logic in cpp Inductor is simplified.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119979
Approved by: https://github.com/jansel, https://github.com/malfet
Sympy simplifications don't obey floating point semantics, so don't
use Sympy for this. Keep them as is, only evaluate with the reference
implementations when all arguments are known.
This may end up getting subsumed by some other changes later, but I
wanted to understand if this was easy and it seems to be easy.
This doesn't actually depend on the earlier diffs on the stack and I can detach it.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122823
Approved by: https://github.com/lezcano
This PR adds the vectorized indirect indexing so that we can further simplify the `CppVecKernelChecker` (done in the later PR #119734) and remove the check that throws `CppVecUnsupportedError`. A boundary assertion check is added on vectorized indices and via the new `indirect_assert` method on `Kernel` - the base implementation is for scalar indices, overridden in `CppVecKernel` for vectorized indices.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119655
Approved by: https://github.com/jansel
ghstack dependencies: #119654
Vectorized boolean values in CPU Inductor were modeled with `Vectorized<float>` which cannot work for operations with other data types. This PR generalizes it with the new `VecMask` template class that can work for masks on any vectorized data types. The intrinsics implementation in `cpp_prefix.h` for mask conversion, cast and masked load are now implemented as the specialization for `VecMask` and moved to corresponding header files.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119654
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel
This started as a re-land of https://github.com/pytorch/pytorch/pull/105590 but focusing on enabling it on MacOS, but quickly turned into landing very limited platform-specific acceleration at this time (I.e. this PR does not add any NEON accelerated code at all, just enables vectorized compilation for the existing abstractions)
Enabling the test harness, uncovered number of latent issues in CPU inductor that were fixed in the following PRS:
- https://github.com/pytorch/pytorch/pull/122511
- https://github.com/pytorch/pytorch/pull/122513
- https://github.com/pytorch/pytorch/pull/122580
- https://github.com/pytorch/pytorch/pull/122608
Following was added/changed to enable vectorization code to work on MacOS
- Added VecNEON class to `_inductor/codecache.py` that is supported on all AppleSilicon Macs
- Added `Vectorized::loadu_one_fourth` to `vec_base.h`, and limit it to 8-bit types
- Change 64-bit integral types mapping to `int64_t`/`uint64_t` to align with the rest of the code, as on MacOS, `int64_t` is a `long long` rather than `long` (see https://github.com/pytorch/pytorch/pull/118149 for more details)
See table below for perf changes with and without torch.compile using [gpt-fast](https://github.com/pytorch-labs/gpt-fast) running `stories15M` on M2 Pro:
| dtype | Eager | Compile (before) | Compile (after) |
| ------ | ------ | --------- | --------- |
| bfloat16 | 120 tokens/sec | 130 tokens/sec | 156 tokens/sec |
| float32 | 158 tokens/sec | 140 tokens/sec | 236 tokens/sec |
| float16 | 235 tokens/sec | 81 tokens/sec | 58 tokens/sec |
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122217
Approved by: https://github.com/jansel
Not sure what was the idea behind `{self.tiling_factor}*sizeof(float)/sizeof({DTYPE_TO_CPP[dtype]})` size calculation (perhaps copy-n-paste error during the refactor made by https://github.com/pytorch/pytorch/pull/97626 ) , but `Vectorized::store(ptr, tiling_factor)` needs at least `tiling_factor` elements, not `tiling_factor/2` (which would be the case with the original calculation if data type is 64-bit value such as int64)
Discovered while trying to enable arch64 vectorized inductor.
Minimal reproducer (reproducible on ARMv8 or any x86_64 machine that does not support AVX512):
```python
import torch
def do_ds(x, y):
return torch.diagonal_scatter(x, y)
x=torch.ones(10, 10, dtype=torch.int64)
y=torch.tensor([ 1, 2, -8, 8, 5, 5, -7, -8, 7, 0])
dsc = torch.compile(do_ds)
assert torch.allclose(torch.diagonal_scatter(x, y), dsc(x, y))
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122580
Approved by: https://github.com/Skylion007, https://github.com/jansel
Summary: Previously, we only supported torch.Tensor boolean scalar predicate in `torch.cond` in Inductor. This PR adds support for SymBool and Python bool predicate, to match the `torch.cond` [sematics](https://pytorch.org/docs/stable/generated/torch.cond.html) in Dynamo / Export.
Test Plan:
```
$ python test/inductor/test_control_flow.py
...
----------------------------------------------------------------------
Ran 34 tests in 56.980s
OK
$ python test/inductor/test_aot_inductor.py -k test_cond
...
----------------------------------------------------------------------
Ran 54 tests in 460.093s
OK (skipped=4)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122378
Approved by: https://github.com/jansel, https://github.com/chenyang78
Fix https://github.com/pytorch/pytorch/issues/115260.
This issue is triggered by `FusedSchedularNodes` cases.
We always store `lowp buffer` to `store_cache` then load `lowp buffer` from `store_cache` and `convert it to float` before `compute ops`.
Now we will generate a `{key: to(float32)_expr, value: the float32 cse var before to_dtype and store}` in `cse.cache`.
Then the `to_dtype(float32)` after `load` will hit this cache and not generate a new var with cast codes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118365
Approved by: https://github.com/jgong5, https://github.com/jansel
In some cases where we have TORCH_CHECK in loops, it may cause the host
compiler to spend hours optimizing the run_impl function. This PR
mitigated the issue by replacing TORCH_CHECK with a custom AOTI_CHECK,
where we force the underneath assert function to be noinline.
If forcing noinline caused any serious perf regression, we could
either add an option to turn on/off enable noinline. Or, we could
another an option to just turn AOTI_CHECK into a no-op, similar
to the ```assert``` macro from cassert.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119220
Approved by: https://github.com/hl475, https://github.com/desertfire
This pull request aims to complete most of the support for vectorizing int32 and int64 data types except for indirect indexing and masks. The basic data type support for uint32 and uint64 is also added but without vectorization. More vectorized conversion functions are added between integer and float. In order to support int64 vectors, a new VectorizedN class to handle vectors of arbitrary length. Below are the details:
1. Complete most of the int32 and int64 vectorization support including load, store, reduction, constant and conversion. The indirect indexing and masks will be addressed in follow-up PRs, after which, the legality checking logic in `CppVecKernelChecker` can be further simplified.
2. Util functions for conversion between integer and float vectors (in cpp_prefix.h and ATen vec). Ideally, we'd better move them from cpp_prefix.h to ATen vec to simplify cpp_prefix.h, will be addressed in follow-up PRs.
3. Introduced a new template class VectorizedN, designed to handle vectors of arbitrary length by encapsulating multiple Vectorized<T> instances. This class supports most of the operations of `Vectorized<T>`. It makes the support of int64 vectorization simpler. I will also apply it to bf16/fp16/int8 in the follow-up PRs for better efficiency. For example, bf16 currently only uses half of the vector lanes. With `VectorizedN`, we can use full of the lanes and map bf16 vector to `VectorizedN<float,2>` on conversion.
4. Basic data type support is added for uint32 and uint64 (in graph.py). Vectorization support will be added later but not of high priority due to fewer usages.
Next steps:
- [ ] Refactor the vector mask handling to support data types other than float. Currently vector masks are implemented with float vectors.
- [ ] Fully utilize vector lanes for bfloat16/float16/int8.
- [ ] Support indirect indexing with vectorized index via scalarization.
- [ ] Clean up `CppVecKernelChecker`.
- [ ] Simplify `cpp_prefix.h` including refactoring vector conversion logic.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119001
Approved by: https://github.com/peterbell10, https://github.com/jansel
I was just playing around with improving the typing of symbolic_shapes. The PR is not "complete" but I in particular wanted to get feedback on whether or not people liked making ValueRanges Generic; it seems that distinguishing if you have an Expr ValueRange or a SympyBoolean ValueRange is a lot of trouble for downstream. Using TypeGuard, we can perform refinements on the generic parameter inside methods, although we still have to cast back to ValueRange[T] due to https://github.com/python/mypy/issues/14425#issuecomment-1914852707
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118529
Approved by: https://github.com/Skylion007
I was just playing around with improving the typing of symbolic_shapes. The PR is not "complete" but I in particular wanted to get feedback on whether or not people liked making ValueRanges Generic; it seems that distinguishing if you have an Expr ValueRange or a SympyBoolean ValueRange is a lot of trouble for downstream. Using TypeGuard, we can perform refinements on the generic parameter inside methods, although we still have to cast back to ValueRange[T] due to https://github.com/python/mypy/issues/14425#issuecomment-1914852707
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118529
Approved by: https://github.com/Skylion007