Currently there is `test_vertical_fusion1` which fuses entirely during
the lowering stage and no buffers are realized. This adds
`test_scheduler_vertical_fusion1` which is the same test but with
several intermediate calculations realized so the scheduler is left
to do the fusion.
To support the test, this PR also adds:
- `metrics.ir_nodes_pre_fusion` which when compared with
`generated_kernel_count` tells us how many nodes were fused.
- `torch._test_inductor_realize` which is an identity operator in
eager, but under inductor also forces the input to be realized.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90014
Approved by: https://github.com/jansel
Relands #89031
Per title. We now set strides from fx graph only for convolutions and mm, which is a hack, but bmm in some cases caused extra copy, and there is no obvious way to fix that, we should rethink the strides anyway.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89530
Approved by: https://github.com/Chillee
1. `aten.div.Tensor_mode` should allow broadcasting
2. `div` can use `ELEMENTWISE_TYPE_PROMOTION_KIND.INT_TO_FLOAT`
3. `prims.div` on integers should be truncating division
4. Add lowering for `true_divide` which is aliased to `div`
5. register lowering for inplace version of `div_mode`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88603
Approved by: https://github.com/ngimel
By itself, libdevice version of erf has the same perf as our decomposition, but in real workloads it leads to better fusion groups (due to fewer ops in the fused kernel).
Bonus: a few fp64 test skips removed, because our decomposition wasn't accurate enough for fp64, but libdevice version is.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89388
Approved by: https://github.com/jansel
The core problem that we often have with contiguous/channels-last layouts and convolutions is that Inductor often doesn't do a great job of "preserving" the eager-mode layouts.
So, for example, we'll often have something like
```
a: channels-last
b = foo(a)
c = convolution(a)
```
In eager-mode, `a` would stay channels-last, and we would avoid two transpose copies (one into NHWC and one back into NCHW) within the convolution kernel.
However, Inductor currently sometimes loses the "correct" layout of `b` (not in this simple example, but others). Then, not only will we do a transpose within `foo`, but we'll then immediately transpose it back to do the convolution (and then again once the convolution is done).
This is particularly egregious in `convnext_base`, where there's a lot of mixing of non-channels last tensors and channels-last tensors.
The solution in this PR is to constrain the inputs to `aten.convolution`/`aten.convolution_backward` to match the layouts from eager-mode. This ensures that we'll never do extra transposes *within* `aten.convolution`, which are particularly bad (since Inductor can't fuse them).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89031
Approved by: https://github.com/ngimel, https://github.com/jansel
Add stride/contiguity constraints to fallbacks so that inputs will be in the right stride permutation for the fallback kernel.
Improves perf of coat_lite_mini from 1.48415536054865 -> 2.010956856330101.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88534
Approved by: https://github.com/ngimel
Summary: Currently fallback kernel in inductor assumes its output is
either a tensor or a tuple/list of tensors. This PR makes it handle more
generic output data structure.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88495
Approved by: https://github.com/jansel
Ref pytorch/torchdynamo#327
The use of as_strided does require in-memory manipulations, however this
lowering allows those memory ops to be fused with any preceding calculations.
e.g.
```
def f(a, b):
return torch.as_strided_scatter(
a * 8 + 10,
b * 2 - 4,
size=(a.numel() // 2,),
stride=(2,))
```
Before this compiles to two kernels and a call to `aten.as_strided_scatter` and
with this PR it compiles to just two kernels and no additional operator calls.
In theory I think this could be a decomposition, but in practice I saw the
`output_view.copy_(src)` being optimized out in some cases when this was
implemented as a decomposition.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88379
Approved by: https://github.com/jansel
In cases where a tensor kwarg is actually "out=", the following error message would look nicer than this :
```
Traceback (most recent call last):
File "/fsx/users/binbao/pytorch/torch/_inductor/graph.py", line 241, in call_function
out = lowerings[target](*args, **kwargs)
File "/fsx/users/binbao/pytorch/torch/_inductor/lowering.py", line 168, in wrapped
assert not any(isinstance(x, TensorBox) for x in kwargs.values())
AssertionError
```
https://github.com/pytorch/torchdynamo/issues/1798
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88367
Approved by: https://github.com/desertfire
Workaround for https://github.com/pytorch/torchdynamo/issues/1775, and calling sqrt is better in any case, but `libdevice.pow` still for some reason doesn't work if both arguments are scalars
cc @jansel @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @mreso, can you please check if that takes you further with diffusers
cc @jansel @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87912
Approved by: https://github.com/desertfire
Porting over [torchdynamo/#1633](https://github.com/pytorch/torchdynamo/pull/1633)
`torch/_inductor/codegen/triton.py` now defines `libdevice_<function>` variants
of some functions. You can request dispatch to those for
float64 dtypes when using `register_pointwise` by setting
`use_libdevice_for_f64=True`.
Other minor changes:
- In triton, sigmoid now codegens tl.sigmoid
- silu now comes from decomp, not lowering
- Some test skips no longer necessary, removed or made xfails
Switching to `tl.sigmoid` has exactly same performance.
Moving `silu` to decomp does not change anything, same triton code is generated.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87189
Approved by: https://github.com/ngimel
Fixes https://github.com/pytorch/pytorch/pull/87048 by saving the needed properties before fork.
Actually attempting to get CUDA to load in the workers is probably not desired: cuda initialization takes O(seconds). Having multiple processes using the same device will slow things down.
This just moves the needed properties from the main trainer process to the workers.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87101
Approved by: https://github.com/soumith