## Context
Add decompositions for `aten.max`, `aten.min`, and `aten.var_mean`. These operators follow a pattern of returning a tuple of outputs from two component operators:
```
aten.max(x) -> return aten.amax(x), aten.argmax(x)
aten.min(x) -> return aten.amin(x), aten.argmin(x)
aten.var_mean(x) -> return aten.var(x), aten.mean(x)
```
For `var_mean`, the `refs` implementation was doing something similar, so I changed it to call `torch.` ops instead like was done for other `refs` implementations previously. cc: @peterbell10 @lezcano
Note that Inductor lowers all these directly, so they are excluded from the Inductor decomp table.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110906
Approved by: https://github.com/manuelcandales
Summary:
## Context
Both `aten.sum` and `aten.squeeze` have a "most generic" variant in the form of `aten.sum.dim_IntList` and `aten.squeeze.dims` respectively. Add decompositions for other non generic variants of these operators to express them using the most generic variant.
Note that to register these decomps, the reference implementation under `_refs` had to be removed as registered decompositions. cc: @lezcano @peterbell10
Test Plan: Github CI + Meta Internal CI
Differential Revision: D49965952
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110645
Approved by: https://github.com/peterbell10, https://github.com/digantdesai, https://github.com/manuelcandales
## Context
Introduce a core decomposition for `aten.floor_divide` into other `aten` ops, and add it to the core ATen decomposition table.
This replaces the decomposition of `floor_divide` that was used by Inductor. I noticed there was a note on that decomposition
```
# TorchInductor-only decomposition. It should not be taken to core.
# See https://github.com/pytorch/torchdynamo/pull/1120
```
but couldn't discern the reason why this is the case. cc: @lezcano
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110046
Approved by: https://github.com/peterbell10
- Extend `test_torch_dispatch_meta_outplace` to test torch ops that do not have an out parameter but have aten op overloads that have out parameters. Additionally, Python decompositions may register `OpOverloadPacket`'s so decompositions need to be tested to ensure all `OpOverloads` still function for the `Meta` key (e.g. if a python decomposition is registered for an aten op `aten.foo` with overloads `[default, out]`, the python function needs to support receiving out arguments)
- Add out parameter wrappers to python decomps for aten ops that have out overloads
CC. @ezyang @albanD @lezcano
Fixes#107713
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107707
Approved by: https://github.com/lezcano
The "safety" aspect refers to the output not being registered as aliasing the
input, but after AOTAutograd I don't think this distinction matters. However,
we shouldn't use the same decomposition as the safe variant in case the backend
doesn't want to decompose split.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109668
Approved by: https://github.com/lezcano
ghstack dependencies: #109667
Summary: See the comment in code for the reasons of the change
Test Plan:
buck2 test executorch/examples/export/test:test_export --
test_vit_export_to_executorch
Differential Revision: D48992180
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108608
Approved by: https://github.com/larryliu0820
Summary:
Earlier decomp was routing _flash* variant to _match variant and this
was result in failure during torch.export, for some reason that I
couldnt trace.
However, it seems that we should really have a decomp for
scaled_dot_product_attention, instead of
scaled_dot_product_flash_attention. Right?
This diff adds that. Plus it adds a test to check if the model exported
via two stage export, has decomposed the op. This test needs improvement
to figur eout what the core aten opset is and check for anything that is
not inside.
Test Plan:
test_model_exports_to_core_aten
Reviewers:
Subscribers:
Tasks:
Tags:
Differential Revision: [D48917461](https://our.internmc.facebook.com/intern/diff/D48917461)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108371
Approved by: https://github.com/larryliu0820
We allow registering decomps for HigherOrderOp via the existing decomp
mechanisms:
- I refactored those APIs to accept torch._ops.OperatorBase, which is the base
class for torch.ops.HigherOrderOperator and torch.ops.OpOverload
- HigherOrderOps must directly call maybe_handle_decomp in their
ProxyTorchDispatchMode handling in order to resolve decompositions. We
can change this in the future so that they do not need to do this.
Next, we add an inductor decomp for out_dtype. This decomp shouldn't be
generally available because we want to preserve out_dtype to the backend
for other use cases (i.e. executorch).
Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108080
Approved by: https://github.com/HDCharles
`scaled_dot_product_attention` used to be decomposed in pre-autograd, given that it calls `_scaled_dot_product_attention_math` and `_scaled_dot_product_attention_math` only has a `CompositeImplicitAutograd` kernel. As a result it's decomposed into ops with finer granularity.
However recent PRs (#103826#105131) added new logic in `scaled_dot_product_attention` and now it calls `_scaled_dot_product_flash_attention` which contains a CPU kernel. This results in `_scaled_dot_product_flash_attention` showing up in `torch.export()`. This PR adds a decomposition that ensures `scaled_dot_product_attention` is still being decomposed the same way as before, i.e., going through `_scaled_dot_product_attention_math`. Notice that this decomp rule should be excluded by inductor.
Differential Revision: [D48762000](https://our.internmc.facebook.com/intern/diff/D48762000/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108180
Approved by: https://github.com/SherlockNoMad
Summary: Before this change, the tensor_indices_or_sections variant of aten.tensor_split causes a `RuntimeError: The tensor has a non-zero number of elements` due to that operation needing to introspect data. Decomposing into one of the other two tensor_split variants fixes the problem.
Test Plan:
Enabled tensor_split tests in test/inductor/test_torchinductor_opinfo.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107251
Approved by: https://github.com/ezyang, https://github.com/eellison
Summary:
(From Brian Hirsh)
Description copied from what I put in a comment in this PR: https://github.com/pytorch/pytorch/pull/106329
So, the slightly-contentious idea behind this PR is that lower in the stack, I updated torch._decomps.get_decomps() to check not only the decomp table to see if a given op has a decomposition available, but to also check the dispatcher for any decomps registered to the CompositeImplicitAutograd key (link: https://github.com/pytorch/pytorch/pull/105865/files#diff-7008e894af47c01ee6b8eb94996363bd6c5a43a061a2c13a472a2f8a9242ad43R190)
There's one problem though: we don't actually make any hard guarantees that a given key in the dispatcher points does or does not point to a decomposition. We do rely pretty heavily, however, on the fact that everything registered to the CompositeImplicitAutograd key is in fact a decomposition into other ops.
QAT would like this API to faithfully return "the set of all decomps that would have run if we had traced through the dispatcher". However, native_batch_norm is an example of an op that has a pre-autograd decomp registered to it (through op.py_impl(), but the decomp is registered directly to the Autograd key instead of being registered to the CompositeImplicitAutograd key.
If we want to provide a guarantee to QAT that they can programatically access all decomps that would have run during tracing, then we need to make sure that every decomp we register to the Autograd key is also registered to the CompositeImplicitAutograd key.
This might sound kind of painful (since it requires auditing), but I think in practice this basically only applies to native_batch_norm.
Test Plan: python test/test_decomp.py
Differential Revision: D48607575
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107791
Approved by: https://github.com/jerryzh168, https://github.com/SherlockNoMad
When exporting dropout with cpu tensor, we get following graph module
```
class GraphModule(torch.nn.Module):
def forward(self, arg0_1: f32[512, 10]):
empty_memory_format: f32[512, 10] = torch.ops.aten.empty.memory_format([512, 10], dtype = torch.float32, layout = torch.strided, device = device(type='cpu'), pin_memory = False, memory_format = torch.contiguous_format)
bernoulli_p: f32[512, 10] = torch.ops.aten.bernoulli.p(empty_memory_format, 0.9); empty_memory_format = None
div_scalar: f32[512, 10] = torch.ops.aten.div.Scalar(bernoulli_p, 0.9); bernoulli_p = None
mul_tensor: f32[512, 10] = torch.ops.aten.mul.Tensor(arg0_1, div_scalar); arg0_1 = div_scalar = None
return (mul_tensor,)
```
In addition, if we export with eval() mode, we will have an empty graph.
However, when exporting with cuda tensor, we got
```
class GraphModule(torch.nn.Module):
def forward(self, arg0_1: f32[512, 10]):
native_dropout_default = torch.ops.aten.native_dropout.default(arg0_1, 0.1, True); arg0_1 = None
getitem: f32[512, 10] = native_dropout_default[0]; native_dropout_default = None
return (getitem,)
```
and exporting under eval() mode will still have a dropout node in graph.
This PR make exporting with CPU tensor also produce aten.native_dropout.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106274
Approved by: https://github.com/ezyang
This pattern shows up in torchrec KeyedJaggedTensor. Most
of the change in this PR is mechanical: whenever we failed
an unbacked symint test due to just error checking, replace the
conditional with something that calls expect_true (e.g.,
torch._check or TORCH_SYM_CHECK).
Some of the changes are a bit more nuanced, I've commented on the PR
accordingly.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106788
Approved by: https://github.com/lezcano
ghstack dependencies: #106720
Summary:
Redirect `aten._unsafe_index` to `aten.index` through a decomposition.
Also add it to the list of core decompositions.
Test Plan: contbuild and OSS CI (similar to D40075277)
Differential Revision: D48163393
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106814
Approved by: https://github.com/SherlockNoMad
- Enabled LSTM weight prepack in inductor.
- Added a mkldnn decomposition for lstm which won't change for different `seq_lens`. With the previous decomposition, for dynamic shapes use case where `seq_lens` changes, the graph will be different.
- Extended several inductor utility functions to support `List(Tensor`) as input. Previously those functions only supported `Tensor` input.
**Update 2023-07-26:**
- https://github.com/pytorch/pytorch/pull/103851 has moved CPU weight packing to be after AOTAutograd. Fixed the support in this PR to follow the same way (mainly in 3b207f7f1c (diff-6dffed1ade0ba3e887f9a4eafa3bfcec267ab2365b8adcb91bd391f49b3fd2e3)).
LSTM is decomposed in `aten.mkldnn_rnn_layer` by layer and by direction. The weight prepack is done at the `mkldnn_rnn_layer` level.
- Add a fix in rnn `__get_state__` function in case we need to recompile an `LSTM` module.
When compiling the module, the weights tensors which are the `named_parameters` of the module are converted to `functional_tensor` here:
76fb72e24a/torch/nn/utils/stateless.py (L125-L128)
The forward function of LSTM will be called:
76fb72e24a/torch/_functorch/aot_autograd.py (L3379-L3381)
In the forward function, the `_flat_weights` are updated to be the same as the weights, thus becoming `functional_tensor`:
76fb72e24a/torch/nn/modules/rnn.py (L775-L778)
The weights tensors are converted back to the original tensors (which are not `functional_tensor` anymore) before exiting the `_reparametrize_module` context here:
76fb72e24a/torch/nn/utils/stateless.py (L130-L142)
But since `_flat_weights` is not in the `named_parameters` of the module, it's still `functional_tensor` ([link of the parameters that will be converted to functional and reverted back](76fb72e24a/torch/_functorch/aot_autograd.py (L3695-L3698))).
At this moment, if we need to recompile the model, `deepcopy` will be called:
76fb72e24a/torch/_dynamo/utils.py (L915-L917)
And it will report `UnImplemented` since we have `functional_tensor` (`_flat_weights`) and will trigger graph break which is not what we expect:
76fb72e24a/torch/_subclasses/meta_utils.py (L514)
Added a fix in the `__get_state__` to update the `_flat_weights` if ever weights have changed to fix this issue. The fix is covered in the `test_lstm_packed` UT.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103071
Approved by: https://github.com/jgong5, https://github.com/jansel
Summary:
Add a new path in `post_grad.py` for replacing addmm + ReLU / GELU activation with the corresponding `_addmm_activation` call (with `use_gelu=False` or `True`, respectively). The replacement is done only on `max_autotune_gemm=False` and when the activation is fusible.
Test Plan:
$ python test/inductor/test_pattern_matcher.py -k test_addmm_activation -v
(__main__.TestPaternMatcher.test_addmm_activation) ... /data/users/aakhundov/pytorch/torch/_inductor/compile_fx.py:128: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance.
warnings.warn(
Using FallbackKernel: aten._addmm_activation.default
Using FallbackKernel: aten._addmm_activation.default
/data/users/aakhundov/pytorch/torch/_dynamo/eval_frame.py:373: UserWarning: changing options to `torch.compile()` may require calling `torch._dynamo.reset()` to take effect
warnings.warn(
frames [('total', 1), ('ok', 1)]
stats [('calls_captured', 2), ('unique_graphs', 1)]
aot_autograd [('total', 1), ('ok', 1)]
inductor []
ok
----------------------------------------------------------------------
Ran 1 test in 13.415s
OK
Reviewers: @eellison
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104132
Approved by: https://github.com/eellison, https://github.com/jansel
Currently calling the fill.Tensor overload under `torch.compile` results in a
`DataDependentOutputException` due to the `.item()` call. This instead does a
device-device copy which can then be inlined into subsequent inductor kernels as
you would expect, e.g.
```python
def fn(a):
result = torch.deg2rad(a).sin()
return torch.empty((128, 128), device=a.device).fill_(result)
```
generates the single kernel
```python
@triton.jit
def triton_(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr):
xnumel = 16384
xoffset = tl.program_id(0) * XBLOCK
xindex = xoffset + tl.arange(0, XBLOCK)[:]
xmask = xindex < xnumel
x0 = xindex
tmp0 = tl.load(in_ptr0 + (0))
tmp1 = tl.broadcast_to(tmp0, [XBLOCK])
tmp2 = 0.017453292519943295
tmp3 = tmp1 * tmp2
tmp4 = tl.sin(tmp3)
tl.store(out_ptr0 + (x0), tmp4, None)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103880
Approved by: https://github.com/Chillee
Fix https://github.com/pytorch/pytorch/issues/99686, for eager mode, if the given sizes is not meet requirements, it will report an error, but inductor can run, I think we need align inductor behavior with eager mode, the behavior will be like after this PR:
```
Traceback (most recent call last):
File "/home/xiaobing/pytorch-offical/torch/_dynamo/utils.py", line 1267, in run_node
return node.target(*args, **kwargs)
File "/home/xiaobing/pytorch-offical/torch/functional.py", line 189, in split
return tensor.split(split_size_or_sections, dim)
File "/home/xiaobing/pytorch-offical/torch/_tensor.py", line 804, in split
return torch._VF.split_with_sizes(self, split_size, dim)
File "/home/xiaobing/pytorch-offical/torch/utils/_stats.py", line 20, in wrapper
return fn(*args, **kwargs)
File "/home/xiaobing/pytorch-offical/torch/_subclasses/fake_tensor.py", line 1095, in __torch_dispatch__
return self.dispatch(func, types, args, kwargs)
File "/home/xiaobing/pytorch-offical/torch/_subclasses/fake_tensor.py", line 1259, in dispatch
return decomposition_table[func](*args, **kwargs)
File "/home/xiaobing/pytorch-offical/torch/_decomp/decompositions.py", line 1102, in split_with_sizes
raise ValueError(
ValueError: Split sizes don't add up to the tensor's size in the given dimension
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/xiaobing/pytorch-offical/torch/_dynamo/utils.py", line 1215, in get_fake_value
return wrap_fake_exception(
File "/home/xiaobing/pytorch-offical/torch/_dynamo/utils.py", line 835, in wrap_fake_exception
return fn()
File "/home/xiaobing/pytorch-offical/torch/_dynamo/utils.py", line 1216, in <lambda>
lambda: run_node(tx.output, node, args, kwargs, nnmodule)
File "/home/xiaobing/pytorch-offical/torch/_dynamo/utils.py", line 1279, in run_node
raise RuntimeError(
RuntimeError: Failed running call_function <function split at 0x7f45b8402ee0>(*(FakeTensor(..., size=(1, 5)), [2, 1, 1]), **{'dim': 1}):
Split sizes don't add up to the tensor's size in the given dimension
(scroll up for backtrace)
The above exception was the direct cause of the following exception:
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99702
Approved by: https://github.com/jgong5, https://github.com/lezcano, https://github.com/jansel
The meta implementation for these _like function is wrong whenever device != "meta" (it doesn't fill the memory!).
zeros_like is special due to sparse and is fixed directly by always filling it with zeros.
Every other one is CompositeExplicit implementation, I went with removing their meta registration and tweaking code to avoid infinite recursions.
I can do the same as zeros_like (and add the proper filling for each) but that would duplicate the c++ logic and make the meta registrations non trivial. I can do it if you prefer to removal.
test_meta works fine with these fixes, relying on CI to see if other tests are breaking as well.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98160
Approved by: https://github.com/ezyang
Fixes#94841
This fixes the error messages in the following files, the same as those referenced in the linked issue. I was not able to find any additional examples, but am happy to add commits for any that I may have missed!
```
aten/src/ATen/native/Blas.cpp: "size mismatch, got ", self.size(0), ", ", mat.size(0), "x", mat.size(1), ",", vec.size(0));
torch/_decomp/decompositions.py: lambda: f"size mismatch, got {self.size(0)}x{self.size(1)},{vec.size(0)}",
```
Example output for `Blas.cpp` before:
```
size mismatch, got 3, 3x4,1
```
The new error messages have the following format:
```
aten/src/ATen/native/Blas.cpp: "size mismatch, got bias (", self.size(0), "), matrix (", mat.size(0), "x", mat.size(1), "), vector (", vec.size(0), ")");
torch/_decomp/decompositions.py: lambda: f"size mismatch, got matrix ({self.size(0)}x{self.size(1)}), vector ({vec.size(0)})",
```
Example output for `Blas.cpp` after:
```
size mismatch, got bias (3), matrix (3x4), vector (1)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96863
Approved by: https://github.com/albanD
Fix https://github.com/pytorch/pytorch/issues/96042
### before
```
>>> torch.aminmax(torch.tensor(1, device='cpu'), dim=0, keepdim=True)
__main__:1: UserWarning: An output with one or more elements was resized since it had shape [], which does not match the required output shape [1]. This behavior is deprecated, and in a future PyTorch release outputs will not be resized unless they have zero elements. You can explicitly reuse an out tensor t by resizing it, inplace, to zero elements with t.resize_(0). (Triggered internally at ../aten/src/ATen/native/Resize.cpp:24.)
torch.return_types.aminmax(
min=tensor([1]),
max=tensor([1]))
>>> torch.aminmax(torch.tensor(1, device='cpu'), dim=0, keepdim=False)
torch.return_types.aminmax(
min=tensor(1),
max=tensor(1))
```
### after
```
>>> torch.aminmax(torch.tensor(1, device='cpu'), dim=0, keepdim=True)
torch.return_types.aminmax(
min=tensor(1),
max=tensor(1))
>>> torch.aminmax(torch.tensor(1, device='cpu'), dim=0, keepdim=False)
torch.return_types.aminmax(
min=tensor(1),
max=tensor(1))
```
Marked the following test as expected_fail:
`test_vmap.py TestVmapOperatorsOpInfoCPU.test_op_has_batch_rule_aminmax_cpu_float32`
Given input shape of (2), the loop out is shape (2), the batched vmap out is (2, 1), which mismatched.
The loop out will calculate twice on a tensor shape of ( ): without this patch, the output is (1), and then stacked into (2, 1); with this patch, the output is ( ), then stacked into (2).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96171
Approved by: https://github.com/jgong5, https://github.com/ngimel, https://github.com/zou3519
Fixes#95794
This is a hotfix for decomposition only (that is currently used by inductor), reference still accesses invalid indices. Perhaps `_nll_loss_nd` and this decomp should be unified, cc @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @peterbell10 @desertfire @lezcano
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95833
Approved by: https://github.com/lezcano, https://github.com/Chillee
Add _int_mm primitive that binds cuBLAS int8@int8 -> int32 matmul and that translates to Triton based mm templates under max autotune. This is a very useful first step towards better supporting quantization on the GPU. This is a not a user facing API, but an internal primitive.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94339
Approved by: https://github.com/ngimel, https://github.com/jansel
This allows unet to be compiled with symbolic shapes (but it still fails accuracy, lol).
Output sizes are always integer, there's no need to pretend they are ever float. Recomputing scale factors still used nominally float sizes converted to int, we might as well do it from the start.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94123
Approved by: https://github.com/ezyang
This removes the now-redundant `_squeeze_multiple` helpers and instead decomposes into a single call to `aten::squeeze.dims` which also has the effect of reducing the lowered graph size in inductor.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91602
Approved by: https://github.com/ngimel
`log1p` offers better precision near zero since `(1 + x) - 1` truncates any
values less than the float epsilon to zero. For `soft_margin_loss` this also
requires one fewer kernel invocation which for numel=1e7 gives me a 1.2x speedup
on CUDA and a 1.1x speedup on CPU.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92114
Approved by: https://github.com/ngimel, https://github.com/lezcano
Description:
- output memory format is matching input for bicubic2d
Problem: output tensor's memory format does not match input format for bicubic2d
```python
import torch
i = torch.rand(1, 3, 32, 32).contiguous(memory_format=torch.channels_last)
assert i.is_contiguous(memory_format=torch.channels_last)
o = torch.nn.functional.interpolate(i, size=(4, 4), mode="bicubic")
assert o.is_contiguous(memory_format=torch.channels_last), f"Should be channels last but given channels first ({o.is_contiguous(memory_format=torch.contiguous_format)})"
> AssertionError: Should be channels last but given channels first (True)
```
Related PR fixing bilinear ops: https://github.com/pytorch/pytorch/pull/53535 (cc @VitalyFedyunin @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @bdhirsh )
Discovered together with @NicolasHug while working on https://github.com/pytorch/pytorch/tree/interpolate_uint8_images_linear_cpu_support_dev
- Updated code to match grad input / output memory formats
- temporary tensor creation matches memory format in `separable_upsample_generic_Nd_kernel_impl`
- Updated tests
- Added missing forward AD support for bicubic with antialiasing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90470
Approved by: https://github.com/NicolasHug, https://github.com/lezcano
The eager implementation of softmax supports computation along zero dimensions, but many of the other implementations did not, including:
* decompositions & refs (this was causing dynamo failures)
* forward AD for logsumexp
* MPS log_softmax_backward
This PR handles the `input.numel() == 0` cases separately to avoid running `amax()`, which fails for zero dimensions, and updates opinfos.
example of "computation along zero dimensions":
```python
# example of where
import torch
t = torch.rand((4, 0, 0))
print("~")
print(torch.nn.functional.softmax(t, dim=-1)) # this passes
print("~")
torch._refs.softmax(t, dim=-1) # this fails
print("~")
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91322
Approved by: https://github.com/lezcano
This removes the now-redundant `_squeeze_multiple` helpers and instead decomposes into a single call to `aten::squeeze.dims` which also has the effect of reducing the lowered graph size in inductor.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91602
Approved by: https://github.com/ngimel
This PR moves the definitions for:
* `sym_int`
* `sym_ceil` (used only for `sym_int`)
* `sym_floor` (used only for `sym_int`)
* `sym_float`
from `torch/fx/experimental/symbolic_shapes.py` to `torch/__init__.py`, where `SymInt` and `SymFloat` are already defined.
This removes the need for several in-line imports, and enables proper JIT script gating for #91318. I'm very open to doing this in a better way!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91317
Approved by: https://github.com/ezyang, https://github.com/anijain2305
Use Prims to implement group_norm, group_norm_backward and mean_var
Use `torch._ops.ops` instead of `torch.ops` in numerous subpackages in
order to be able to make them importable from `torch/backend/mps/__init__.py` as this alias is defined in
15af4b1cee/torch/__init__.py (L1095)
is executed last during init process.
Add `__all__` to `torch/backends/mps/__init__.py` as well as alias all imports as private
Add `TestNNMPS.test_group_norm_backward` that validates no NaNs are generated during the backward pass
Fixes https://github.com/pytorch/pytorch/issues/88331
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91190
Approved by: https://github.com/albanD
Using the same repro from the issue (but with BatchNorm2D)
Rectifies native_batch_norm schema by splitting the schema into 2:
1. one will have NON-optional alias-able running_mean and running_var inputs
2. the other will just not have those parameters at all (no_stats variation)
**Calling for name suggestions!**
## test plan
I've added tests in test_functionalization.py as well as an entry in common_method_invocations.py for `native_batch_norm_legit`
CI should pass.
## next steps
Because of bc/fc reasons, we reroute native_batch_norm to call our new schemas ONLY through the python dispatcher, but in 2 weeks or so, we should make `native_batch_norm_legit` the official batch_norm.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88697
Approved by: https://github.com/albanD
We add most in-place references in a generic way. We also implement a
wrapper to implement the annoying interface that `nn.functional`
nonlinearities have.
We fix along the way a couple decompositions for some non-linearities by
extending the arguments that the references have.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88117
Approved by: https://github.com/mruberry
This is a policy update for meta registration. **We now prefer python meta implementation over C++ meta function.** This is a flip of the previous policy, where we prefer C++ meta function over python meta function if they both exist.
Here's the meta registration process:
1. register_meta and register_decomposition will place the python meta/decomp functions into the `global_decomp_table`. However, they will NOT register them into dispatcher.
2. After global_decomp_table is populated, we will compile an `active_meta_table`. For a given op, we pick the most specific decomp function from `global_decomp_table` in the preference order of Meta > PostAutograd > PreAutograd.
3. We will unconditionally register all of them into python dispatcher. And register them into C++ dispatcher, unless it one of the following 3 cases
- 1. the op is a CompositeImplicitAutograd, and should rely on decomposed op's meta
- 2. the op is a view op, as the MetaTensor doesn't support aliased storage
- 3. the op is in the blocklist (due to UT failures, and we will burn down this list op by op)
Over the long run, we wish to implement all meta functions in python. With this PR, 321 op_overloads will have cpp meta overridden by python meta. There are still 400 op_overloads is using cpp meta. The exact list can be found here https://gist.github.com/SherlockNoMad/d20bb736178df8eebd3b054c8bb7cdc5
cc @ngimel @jansel @lezcano @fdrocha @mlazos @soumith @voznesenskym @yanboliang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87426
Approved by: https://github.com/ezyang, https://github.com/jansel