Commit Graph

355 Commits

Author SHA1 Message Date
drisspg
c46fc46dba expose mem-eff to autograd (#110495)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110495
Approved by: https://github.com/jbschlosser
2023-11-13 17:47:40 +00:00
Peter Bell
44f1c6e41c [inductor] Handle variance corrections larger than number of data points (#113284)
Fixes #113167

When correction is larger than the number of data points, we should return a nan
by dividing by zero, as is done in the eager implementation.

5ea76f1760/aten/src/ATen/native/SharedReduceOps.h (L137)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113284
Approved by: https://github.com/lezcano
2023-11-13 11:16:17 +00:00
Edward Z. Yang
9752ef595c [BE] Consistently use the sym_stride lowering, instead of short-circuiting before (#113071)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113071
Approved by: https://github.com/voznesenskym
2023-11-10 21:19:12 +00:00
Roger Lam
289d887a41 Fix ZeroDivisionError when unfolding a zero-dimension tensor in compile mode (#113259)
Fixes #113026

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113259
Approved by: https://github.com/peterbell10
2023-11-09 17:25:36 +00:00
Lucas Pasqualin
1d56e7b5af Adds broadcast to functional collectives (#112668)
Adds `broadcast` to functional collectives, including inductor support.

Test with `python test_inductor_collectives.py -- TestCollectivesMultiProc.test_broadcast_inductor`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112668
Approved by: https://github.com/wanchaol, https://github.com/wconstab
2023-11-09 15:47:52 +00:00
Oguz Ulgen
fbf7866ac9 [Inductor] Fallback scatter when src dtype is bf16 (#113204)
basic_gnn_gcn, basic_gnn_gin, basic_gnn_sage now pass

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113204
Approved by: https://github.com/eellison
2023-11-09 03:43:11 +00:00
eellison
325e0fdfdd Enable masked_scatter_backward for inductor (#109642)
masked_scatter_backward was previously implemented as a
CompositeExplicitAutograd, which involved a decomp that calls
masked_select, and masked_select in general produces data-dependent
shapes that inductor doesn't support. But masked_scatter_backward
reshapes the return value of masked_select such that the end result has
a static shape again.

I have converted masked_scatter_backward into an aten op to avoid this
issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109642
Approved by: https://github.com/ezyang
ghstack dependencies: #108170
2023-11-09 01:27:57 +00:00
Yifu Wang
625958d8bc Inductor support for native c10d_functional (#112439)
This PR adds Inductor support for [native c10d_functional ops](https://github.com/pytorch/pytorch/pull/110570).

The Inductor IRs introduced in this PR will replace the existing `CollectiveKernel` IR hierarchy. Compared to the existing collective IRs, the new IRs:
- Are target language agnostic and support AOTInductor.
- Express the constraints solely with read/write deps. This maximizes the potential for buffer reuse.
- Address an issue where out-of-place collective's input buffers could be mutated while being volatilely read.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112439
Approved by: https://github.com/Chillee
2023-11-08 23:40:21 +00:00
Oguz Ulgen
611a7457ca [Inductor] Kill MutationLayout from ir.py (#112925)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112925
Approved by: https://github.com/jansel
2023-11-07 17:03:52 +00:00
Edward Z. Yang
10a829b85d Retarget sym_size/sym_stride lowerings to their .int overloads (#113054)
Fixes https://github.com/pytorch/pytorch/issues/112913

The new logging looks like this:

```
[2023-11-06 12:48:57,732] [0/0] torch._inductor.graph: [DEBUG] lowering %arg0_1 : [num_users=0] = placeholder[target=arg0_1]
[2023-11-06 12:48:57,732] [0/0] torch._inductor.graph: [DEBUG] lowering %arg1_1 : [num_users=2] = placeholder[target=arg1_1]
[2023-11-06 12:48:57,733] [0/0] torch._inductor.graph: [DEBUG] lowering %add : [num_users=1] = call_function[target=torch.ops.aten.add.Tensor](args = (%arg1_1, 1), kwargs = {})
[2023-11-06 12:48:57,733] [0/0] torch._inductor.graph: [DEBUG]   via <function make_pointwise.<locals>.inner at 0x7f0abed28ee0>
[2023-11-06 12:48:57,735] [0/0] torch._inductor.graph: [DEBUG] lowering %sym_stride_int : [num_users=1] = call_function[target=torch.ops.aten.sym_stride.int](args = (%add, 0), kwargs = {}) sym_stride
[2023-11-06 12:48:57,735] [0/0] torch._inductor.graph: [DEBUG] lowering %mul : [num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%arg1_1, %sym_stride_int), kwargs = {})
[2023-11-06 12:48:57,735] [0/0] torch._inductor.graph: [DEBUG]   via <function mul at 0x7f0abec8bd00>
[2023-11-06 12:48:57,744] [0/0] torch._inductor.graph: [DEBUG] lowering return (mul,)
```

Notice that `sym_stride` no longer is hitting the lowering. This is what the behavior was before I broke it. A better refactor coming soon.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113054
Approved by: https://github.com/davidberard98
2023-11-07 04:15:38 +00:00
Peter Bell
718035791d Prefer e.is_number over not e.free_symbols in SymPy (#112688)
We spend somewhere on the order 1% in `sympy.Expr.free_symbols` as it is called millions of times.
Most of the time we actually just want to know "is this a constant", however `e.is_constant()` is
horribly slow. It turns out though that there is another propery `is_number` that does what we want.

> property is_number:
>
> Returns True if self has no free symbols and no undefined functions (AppliedUndef, to be precise). It will be faster
> than if not self.free_symbols, however, since is_number will fail as soon as it hits a free symbol or undefined
> function.

Even further, we also avoid the overhead of building the unnecessary set object.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112688
Approved by: https://github.com/lezcano
2023-11-06 20:05:13 +00:00
vfdev
59e003d159 Fixed cat uint8 lowering (#112753)
Description:
- Fixed cat uint8 lowering

Otherwise, it gives the following issue on the repro code:
```python
def func(x):
    batch_shape = x.shape[:1]
    out = torch.cat([x.new_zeros(1).expand(batch_shape + (1,)), x], dim=-1)
    return out

cfunc = torch.compile(func)

x = torch.randint(0, 256, size=(3, 255), dtype=torch.uint8)
out = cfunc(x)
```
Error message:
```
  File "/pytorch/torch/_inductor/lowering.py", line 1037, in <genexpr>
    if all(len(input.layout.size) == 4 for input in inputs):
  File "/pytorch/torch/_inductor/ir.py", line 5795, in __getattr__
    fn = getattr(self.data, name)
torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
LoweringException: AttributeError: 'ExpandView' object has no attribute 'layout'
  target: aten.cat.default
  args[0]: [TensorBox(
    ExpandView(data=StorageBox(
      ComputedBuffer(name='buf0', layout=FlexibleLayout('cpu', torch.uint8, size=[1], stride=[1]), data=Pointwise(
        'cpu',
        torch.uint8,
        def inner_fn(index):
            _ = index
            tmp0 = ops.constant(0, torch.uint8)
            return tmp0
        ,
        ranges=[1],
        origin_node=full,
        origins={full}
      ))
    ), size=[3, 1])
  ), TensorBox(StorageBox(
    InputBuffer(name='arg0_1', layout=FixedLayout('cpu', torch.uint8, size=[3, 255], stride=[255, 1]))
  ))]
  args[1]: 1

Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information
```

Context: compiling is not working for torchvision's `F.equalize` op: https://github.com/pytorch/vision/issues/8056

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112753
Approved by: https://github.com/peterbell10
2023-11-06 19:42:04 +00:00
Oguz Ulgen
001573b687 [Inductor] Support one node creating multiple mutations in scheduler (#112547)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112547
Approved by: https://github.com/Chillee
2023-11-03 16:01:31 +00:00
leslie-fang-intel
a53d29cc18 Enable oneDNN QLinear FP32/BF16 output (#112126)
**Summary**
- PR 2 for enabling Int8-Mixed-BF16 PT2E PTQ Quantization with Inductor https://github.com/pytorch/pytorch/issues/111640.
- Enable QLinear (relu) with BFloat16 or Float32 output.

**TestPlan**
```
python -u -m pytest -s -v test_quantized_op.py -k test_qlinear_pt2e
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112126
Approved by: https://github.com/jerryzh168, https://github.com/jgong5
ghstack dependencies: #112010
2023-11-03 08:20:54 +00:00
leslie-fang-intel
b6fc7af8a0 Enable oneDNN QConv FP32/BF16 output (#112010)
**Summary**

- PR 1 for enabling Int8-Mixed-BF16 PT2E PTQ Quantization with Inductor https://github.com/pytorch/pytorch/issues/111640.
- Enable QConv (relu, add, add_relu) with BFloat16 or Float32 output.

**Test Plan**
```
python -u -m pytest -s -v test_quantized_op.py -k test_qconv1d_pt2e
python -u -m pytest -s -v test_quantized_op.py -k test_qconv2d_pt2e
python -u -m pytest -s -v test_quantized_op.py -k test_qconv3d_pt2e
python -u -m pytest test_quantized_op.py -k test_qconv2d_relu_pt2e
python -u -m pytest test_quantized_op.py -k test_qconv2d_add_pt2e
python -u -m pytest test_quantized_op.py -k test_qconv2d_add_relu_pt2e
python -u -m pytest test_quantized_op.py -k test_qconv2d_add_relu_float_output_pt2e
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112010
Approved by: https://github.com/jerryzh168, https://github.com/jgong5
2023-11-03 08:16:45 +00:00
Jiong Gong
e061144aaf [inductor] replace ops.div with ops.truediv (#112243)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112243
Approved by: https://github.com/lezcano
ghstack dependencies: #112234
2023-11-01 05:50:51 +00:00
chilli
3cee033b98 Reland of a bunch of pattern matcher + indexing fixes (#112476)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112476
Approved by: https://github.com/oulgen
2023-11-01 02:13:44 +00:00
Jon Chuang
9bfebf754f [dynamo] fix graph break, improve hygeine - enforce using ConstantVariable for torch.device,torch.dtype (#112416)
Fixes https://github.com/pytorch/pytorch/pull/112332/files#r1375690808

Simplify code paths, fix graph break

```
torch._dynamo.exc.InternalTorchDynamoError: TorchVariable() has no type
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112416
Approved by: https://github.com/lezcano
2023-11-01 00:19:52 +00:00
Peter Bell
66c32d099a Use pytree.arg_tree_leaves everywhere (#112394)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112394
Approved by: https://github.com/lezcano
ghstack dependencies: #112391, #112392, #112393
2023-10-31 15:57:06 +00:00
drisspg
40569b28f4 Constrain fx_stride order for scaled_mm (#112430)
# Summary
CublasLT requires row_major @ col_major order for scaled_mm.  It is possible for the to not respect this constraint without adding this.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112430
Approved by: https://github.com/eellison
2023-10-31 00:02:35 +00:00
Chengkun Du
67638d4dad torch.compile: fix bug of fallback_randn when 'generator' is None (#112240)
When I run Stable Diffusion in [Huggingface/Diffusers](https://github.com/huggingface/diffusers),an error occured:
```
LoweringException: AssertionError: should have been handled in replace_random.py.
   target:  aten.randn.generator
   args[0]:  [1, 4, 64, 64]
   kwargs: {'generator': None, 'dtype': torch.float16, 'layout': torch.strided, 'device': device(type='cuda', index=0), 'pin_memory': False}
```
It looks like some bug of dynamo, and you can reproduce this bug like this:
```python
import torch
def model(shape, generator):
      return torch.randn(shape, generator=generator, device="cuda:0")
model = torch.compile(model)
x = model((1, 3, 64, 64), None)
print(x)
```
Error occurs because 'None' is passed into ‘generator' ,  and dynamo has to process `torch.randn` into fx node `torch.ops.aten.randn.generator`.
aten.randn.generator is not processed by decomposition and  it is processed by lowering in [torch/_inductor/lowering.py](https://github.com/pytorch/pytorch/blob/main/torch/_inductor/lowering.py#L1815), randn.generator is processed like this:
```python
@register_lowering(aten.randn)
def randn(*args, **kwargs):
    if kwargs.get("generator", None) is not None:
        return fallback_randn_generator(*args, **kwargs)
    elif config.fallback_random:
        return fallback_randn_default(*args, **kwargs)
    raise AssertionError("should have been handled in replace_random.py")
```
As you can see, because 'generator' is None, it will not step into `fallback_randn_generator`, and of course, if you don't open `config.fallback_random`, it will not step into `fallback_randn_default`, too. Actually, if 'generator' is None, it could also be processed as`aten.randn.default`.  And then, AssertionError will be throw, but in here, I will not disscuss too much about how to process this bug and will open an issue.

Actually, `config.fallback_random` offers a way to debug randn in [config.py](https://github.com/pytorch/pytorch/blob/main/torch/_inductor/config.py#L190), so I try to open `config.fallback_random` to debug my model. But when I open it by:
```python
# fallback to eager for random/dropout, this is slow but useful for debugging
fallback_random = True
```
Another error occurs!
```python
LoweringException: RuntimeError: Unknown keyword argument 'generator' for operator 'aten::randn'. Schema: aten::randn(SymInt[] size, *, ScalarType? dtype=None, Layouit? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor
```
Obviously, `aten::randn` does not support `kwargs:{generator: None}`, so it should be popped before kwargs is feeded into `fallback_randn_default`.

That's all I'm going to say. Thanks for reading carefully.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112240
Approved by: https://github.com/jansel
2023-10-30 21:10:54 +00:00
PyTorch MergeBot
fc0b0820fc Revert "Readded device_assert skipping in index and index_put (and also added (#112093)"
This reverts commit b110d87ac2.

Reverted https://github.com/pytorch/pytorch/pull/112093 on behalf of https://github.com/ZainRizvi due to Stack breaks internal builds ([comment](https://github.com/pytorch/pytorch/pull/112093#issuecomment-1785922905))
2023-10-30 19:45:41 +00:00
PyTorch MergeBot
052f7a3edc Revert "Added patterns for randperm + index_add (#112102)"
This reverts commit 1ff0b82be9.

Reverted https://github.com/pytorch/pytorch/pull/112102 on behalf of https://github.com/ZainRizvi due to Stack breaks internal builds ([comment](https://github.com/pytorch/pytorch/pull/112102#issuecomment-1785916704))
2023-10-30 19:41:29 +00:00
Peter Bell
bbd5b935e4 Use pytree.tree_leaves everywhere (#112324)
This changes all the instances I could find of `tree_flatten(...)[0]` or
`x, _ = tree_flatten` to use `tree_leaves`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112324
Approved by: https://github.com/lezcano
ghstack dependencies: #112327, #112323
2023-10-30 03:39:04 +00:00
chilli
1ff0b82be9 Added patterns for randperm + index_add (#112102)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112102
Approved by: https://github.com/lezcano
ghstack dependencies: #112093, #112101
2023-10-28 01:26:52 +00:00
chilli
b110d87ac2 Readded device_assert skipping in index and index_put (and also added (#112093)
copy to noop pass)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112093
Approved by: https://github.com/oulgen, https://github.com/lezcano
2023-10-27 18:23:49 +00:00
Elias Ellison
7cb72704cc Constrain sdpa to fx strides (#111721)
Fix for https://github.com/pytorch/pytorch/issues/109607. sdpa requires last dimension strides to be 1. Add constraint so that we run the op with the strides we observed in tracing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111721
Approved by: https://github.com/drisspg, https://github.com/Chillee, https://github.com/jansel
ghstack dependencies: #111976
2023-10-27 03:23:27 +00:00
PyTorch MergeBot
0a3199dd7e Revert "Readded device_assert skipping in index and index_put (and also added (#112093)"
This reverts commit e38347f490.

Reverted https://github.com/pytorch/pytorch/pull/112093 on behalf of https://github.com/izaitsevfb due to Sorry, trying to resolve a conflict with intern, and unblock the revert of #108690 ([comment](https://github.com/pytorch/pytorch/pull/112093#issuecomment-1782154814))
2023-10-27 01:37:33 +00:00
lezcano
47ccf04885 Split SymNode into its own file (#112037)
This PR:

- Moves TrueDiv, LShift, RShift, IsNonOverlappingAndDenseIndicator to `_sympy.functions.py`
- Moves SymNode to `fx.experimental.sym_node`.
  - This file does not have any SymPy dependencies at import time
  - It installs the magic methods in Sym{Bool,Int,Float}.
  - N.b. With this split, we may be able to move Sym{Bool,Int,Float} to this file, and remove quite a few of the hacks around these classes
- Imports `sym_node` in `torch/__init__.py` rather than the whole `symbolic_shapes.py`.
  This breaks the import-time dependency between torch and SymPy

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112037
Approved by: https://github.com/peterbell10
ghstack dependencies: #112035, #112036
2023-10-26 23:32:27 +00:00
PyTorch MergeBot
55ab9932f5 Revert "Constrain sdpa to fx strides (#111721)"
This reverts commit 8a7c3cec78.

Reverted https://github.com/pytorch/pytorch/pull/111721 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is breaking ROCm job in trunk 8a7c3cec78 ([comment](https://github.com/pytorch/pytorch/pull/111721#issuecomment-1782064133))
2023-10-26 23:27:57 +00:00
Elias Ellison
8a7c3cec78 Constrain sdpa to fx strides (#111721)
Fix for https://github.com/pytorch/pytorch/issues/109607. sdpa requires last dimension strides to be 1. Add constraint so that we run the op with the strides we observed in tracing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111721
Approved by: https://github.com/drisspg, https://github.com/Chillee, https://github.com/jansel
ghstack dependencies: #111976
2023-10-26 20:21:55 +00:00
chilli
e38347f490 Readded device_assert skipping in index and index_put (and also added (#112093)
copy to noop pass)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112093
Approved by: https://github.com/oulgen, https://github.com/lezcano
ghstack dependencies: #111990
2023-10-26 07:54:44 +00:00
Oleg Bulatov
192477b5ba Enable flake8-bugbear B020 lint (#110823)
Fixes part of https://github.com/pytorch/pytorch/issues/106571

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110823
Approved by: https://github.com/Skylion007
2023-10-24 22:43:47 +00:00
Hongtao Yu
6977ba6e3c [inductor] decomposition for complex addition (#110740)
Tracks https://github.com/pytorch/pytorch/issues/98161

Complex number support in Pytorch isn't ideal today as complex operations will mostly end up taken care of by the aten runtime, except for `torch.angle` which is handled in [105609](https://github.com/pytorch/pytorch/pull/105609). In general a better way to handle that could be to decompose complex operations first so that more opportunities for fusion could be unveiled, and then to have Triton take care of non-continuous (strided) tensor operations more efficiently. This change adds support to decompose complex addtions.

```
@triton.jit
def triton_(in_ptr0, in_ptr1, out_ptr0, xnumel, XBLOCK : tl.constexpr):
    xnumel = 6
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:]
    xmask = xindex < xnumel
    x0 = xindex
    tmp0 = tl.load(in_ptr0 + (x0), xmask)
    tmp1 = tl.load(in_ptr1 + (x0), xmask)
    tmp2 = tmp0 + tmp1
    tl.store(out_ptr0 + (x0), tmp2, xmask)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110740
Approved by: https://github.com/jansel
2023-10-24 03:41:24 +00:00
Oguz Ulgen
2b2b6caf8f [inductor] Implement clone removal for user defined triton kernel via reinplace_scatters (#111627)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111627
Approved by: https://github.com/jansel
ghstack dependencies: #111434
2023-10-22 22:28:00 +00:00
Oguz Ulgen
977d3bcc46 [Inductor] Support user defined triton kernels in inductor (#111434)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111434
Approved by: https://github.com/jansel
2023-10-22 17:04:19 +00:00
Jason Ansel
a1154e673b [Compiled Autograd] Turn accumulate_grad into an op (#111700)
Relands #111271

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111700
Approved by: https://github.com/voznesenskym
2023-10-21 17:31:09 +00:00
Elias Ellison
0a147fd112 Pointwise fuse cat with pointwise inputs or outputs and <= 4 inputs (#111233)
Improves perf of llama_v2 locally from 1.55 -> 1.57

The initial heuristic is to lower to pointwise if # of inputs is <= 4, and all the inputs are pointwise or cannot be memory planned away, or if all the outputs are pointwise.

Perf run was +3% on inference.. There are definitely instances where we should be lowering to foreach_kernels, but it's less flexible for fusion. The motivating example was:

```
def rotate_half(x):
    """Rotates half the hidden dims of the input."""
    x1 = x[..., : x.shape[-1] // 2]
    x2 = x[..., x.shape[-1] // 2 :]
    return torch.cat((-x2, x1), dim=-1)

def apply_rotary_pos_emb(q, k, cos, sin):
    iota =  torch.ops.prims.iota.default(512, start = 0, step = 1, dtype = torch.int64, device = device(type='cuda', index=0), requires_grad = False)

    # File: /scratch/eellison/work/torchdynamo/lib/python3.8/site-packages/transformers/models/llama/modeling_llama.py:657, code: position_ids = position_ids.unsqueeze(0).view(-1, seq_length)
    unsqueeze = torch.ops.aten.unsqueeze.default(iota, 0)
    position_ids = torch.ops.aten.reshape.default(unsqueeze, [-1, 512]);  unsqueeze = None

    # The first two dimensions of cos and sin are always 1, so we can `squeeze` them.
    cos = cos.squeeze(1).squeeze(0)  # [seq_len, dim]
    sin = sin.squeeze(1).squeeze(0)  # [seq_len, dim]
    cos = cos[position_ids].unsqueeze(1)  # [bs, 1, seq_len, dim]
    sin = sin[position_ids].unsqueeze(1)  # [bs, 1, seq_len, dim]
    q_embed = (q * cos) + (rotate_half(q) * sin)
    k_embed = (k * cos) + (rotate_half(k) * sin)
    return q_embed, k_embed
```

Also not sure if I should be more worried about concatting reduction->pointwise inputs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111233
Approved by: https://github.com/Chillee
2023-10-21 02:34:05 +00:00
Jane Xu
93a9b1314b Make step() faster by passing in a tensor vs scalar 1 (#111084)
This is the culminated result of https://github.com/pytorch/pytorch/pull/110954#issuecomment-1758520411.

We are making the code slightly more complicated to gain some perf in minimizing calls to `.copy_()` and `.to()`.

### Code
```
import torch
with torch.cuda.device(0):
    steps = [torch.zeros((), device="cpu", dtype=torch.float32) for i in range(1000)]

    with torch.profiler.profile(
        activities=[
            torch.profiler.ProfilerActivity.CPU,
            torch.profiler.ProfilerActivity.CUDA,
        ]
    ) as p:
        # New code:
        # step_device = steps[0].device
        # one = torch.tensor(1.0, device=step_device) if str(step_device) == "cpu" else 1
        # torch._foreach_add_(steps, one, 1.0)

        # Old code:
        torch._foreach_add_(steps, 1)

    print(p.key_averages().table(sort_by="cpu_time_total"))
```

### Profiles
**with old code**
```
-------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                     Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
-------------------------  ------------  ------------  ------------  ------------  ------------  ------------
      aten::_foreach_add_        35.31%      52.089ms        99.99%     147.495ms     147.495ms             1
               aten::add_        25.05%      36.949ms        64.68%      95.406ms      95.406us          1000
                 aten::to         3.97%       5.852ms        39.63%      58.457ms      58.457us          1000
           aten::_to_copy        10.11%      14.917ms        35.66%      52.605ms      52.605us          1000
              aten::copy_        21.65%      31.939ms        21.65%      31.939ms      31.939us          1000
      aten::empty_strided         3.90%       5.749ms         3.90%       5.749ms       5.749us          1000
    cudaDeviceSynchronize         0.01%      18.000us         0.01%      18.000us      18.000us             1
-------------------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 147.513ms
```

**with new code**
```
-------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                     Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
-------------------------  ------------  ------------  ------------  ------------  ------------  ------------
      aten::_foreach_add_        55.06%      49.963ms        99.86%      90.625ms      90.625ms             1
               aten::add_        44.81%      40.662ms        44.81%      40.662ms      40.662us          1000
            aten::detach_         0.01%       8.000us         0.05%      45.000us      45.000us             1
                  detach_         0.04%      37.000us         0.04%      37.000us      37.000us             1
              aten::empty         0.03%      30.000us         0.03%      30.000us      30.000us             1
                 aten::to         0.03%      23.000us         0.03%      23.000us      23.000us             1
    cudaDeviceSynchronize         0.02%      22.000us         0.02%      22.000us      22.000us             1
         aten::lift_fresh         0.01%       6.000us         0.01%       6.000us       6.000us             1
-------------------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 90.751ms
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111084
Approved by: https://github.com/albanD
ghstack dependencies: #111079
2023-10-20 01:34:08 +00:00
Ying Zhang
dc31dbbcab Optimize reduction + amax fusion (#111122)
This PR optimizes cases like layer_norm + fp8 quant (which includes amax and fp8 quant) fusion when amax is split into multiple reduction kernels.

Benchmark:
```
python test/inductor/test_fp8.py -k test_layernorm_fp8_quant_benchmark

Before this PR:
Config: float8_dtype=torch.float8_e5m2, shape=(4, 2048, 4096).
Benchmark results: Inductor: 0.13262102689486555ms, Eager: 0.8211962616822429ms, LN only Inductor: 0.09606276150627614ms.

After this PR:
Config: float8_dtype=torch.float8_e5m2, shape=(4, 2048, 4096).
Benchmark results: Inductor: 0.08281274131274131ms, Eager: 0.8217452830188678ms, LN only Inductor: 0.09586902286902287ms.
```

LN + fp8 quant is even faster than LN itself. The reason could be that LN + fp8 outputs fp8 while LN outputs fp16.

From Inductor nightly benchmark test:
There are perf differences in cuda_graph / cuda_graph_dynamic / default runs, but no difference in inductor_max_autotune. So it seems to me that the perf differences are mostly like fluctuations.

![Screenshot 2023-10-18 at 4 58 55 PM](https://github.com/pytorch/pytorch/assets/10527447/6640474a-1e1d-4d33-97e9-0a60d0bc9f1f)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111122
Approved by: https://github.com/jansel
2023-10-19 20:53:50 +00:00
Michael Lazos
543dc75746 [Reland] horizontal concat fusion (#111437)
Reland https://github.com/pytorch/pytorch/pull/108115

The main fix is to disallow nop nodes to be included in foreach scheduler nodes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111437
Approved by: https://github.com/yanboliang
2023-10-18 17:09:01 +00:00
PyTorch MergeBot
3eb5cae3af Revert "[Compiled Autograd] Turn accumulate_grad into an op (#111271)"
This reverts commit 04b04c0686.

Reverted https://github.com/pytorch/pytorch/pull/111271 on behalf of https://github.com/jeanschmidt due to Breaking internal CI ([comment](https://github.com/pytorch/pytorch/pull/111271#issuecomment-1768527932))
2023-10-18 14:02:34 +00:00
Sherlock Huang
1aad6d803a [Reland][Inductor] Disallow OpOverloadPacket in ir.FallbackKernel (#110567) (#111396)
This is a reland of #110567 with additional fbcode fixed.

Summary:
In ABI compatible mode, We always need op_overload.schema for FallbackKernel.

Approved by: https://github.com/jansel

Test Plan: contbuild & OSS CI, see 37a0265992

Differential Revision: D50339346

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111396
Approved by: https://github.com/chenyang78
2023-10-17 18:53:38 +00:00
Jason Ansel
04b04c0686 [Compiled Autograd] Turn accumulate_grad into an op (#111271)
Rather than baking the behavior of `AccumulateGrad` nodes into the generated graph (either as `+=`, or as a return value of the graph).  This creates a new `accumulate_grad_` dispatcher op that is included in the generated graph like:
```
def forward(self, inputs, sizes, hooks):
    getitem = inputs[0]
    getitem_1 = inputs[1]
    getitem_2 = inputs[2]
    getitem_3 = inputs[3]
    getitem_4 = inputs[4]
    getitem_5 = inputs[5]
    getitem_6 = inputs[6]
    getitem_7 = inputs[7]
    getitem_8 = inputs[8]
    getitem_9 = inputs[9];  inputs = None
    expand = torch.ops.aten.expand.default(getitem, [2, 4]);  getitem = None
    threshold_backward = torch.ops.aten.threshold_backward.default(expand, getitem_1, 0);  expand = getitem_1 = None
    t = torch.ops.aten.t.default(getitem_3);  getitem_3 = None
    mm = torch.ops.aten.mm.default(threshold_backward, t);  t = None
    t_1 = torch.ops.aten.t.default(threshold_backward)
    mm_1 = torch.ops.aten.mm.default(t_1, getitem_2);  t_1 = getitem_2 = None
    t_2 = torch.ops.aten.t.default(mm_1);  mm_1 = None
    sum_1 = torch.ops.aten.sum.dim_IntList(threshold_backward, [0], True);  threshold_backward = None
    view = torch.ops.aten.view.default(sum_1, [4]);  sum_1 = None
    t_3 = torch.ops.aten.t.default(t_2);  t_2 = None
    accumulate_grad_ = torch.ops.inductor.accumulate_grad_.default(getitem_4, t_3);  getitem_4 = t_3 = None
    threshold_backward_1 = torch.ops.aten.threshold_backward.default(mm, getitem_5, 0);  mm = getitem_5 = None
    t_4 = torch.ops.aten.t.default(threshold_backward_1)
    mm_2 = torch.ops.aten.mm.default(t_4, getitem_6);  t_4 = getitem_6 = None
    t_5 = torch.ops.aten.t.default(mm_2);  mm_2 = None
    sum_2 = torch.ops.aten.sum.dim_IntList(threshold_backward_1, [0], True);  threshold_backward_1 = None
    view_1 = torch.ops.aten.view.default(sum_2, [4]);  sum_2 = None
    t_6 = torch.ops.aten.t.default(t_5);  t_5 = None
    accumulate_grad__1 = torch.ops.inductor.accumulate_grad_.default(getitem_7, t_6);  getitem_7 = t_6 = None
    accumulate_grad__2 = torch.ops.inductor.accumulate_grad_.default(getitem_8, view_1);  getitem_8 = view_1 = None
    accumulate_grad__3 = torch.ops.inductor.accumulate_grad_.default(getitem_9, view);  getitem_9 = view = None
    return []

```

The motivation here is `AccumulateGrad` nodes are causing trouble in FSDP tracing, since FSDP is in-place resizing parameters and parameter storage in hooks.  We will model this mutation in dynamo, but not during the initial compiled autograd capture.  This allows us to bypass failing shape checks in the initial capture.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111271
Approved by: https://github.com/voznesenskym
2023-10-16 21:16:17 +00:00
Adnan Akhundov
d7317d8a11 Fix size_hint call sites failing on unbacked SymInts (#110520)
Summary: Unbacked SymInts can't get a `sizevars.size_hint` due to being data-dependent. #109893 has added a new `fallback` parameter to `sizevars.size_hint` to specify the fallback value in cases like unbacked SymInt. In this PR we add more of those.

Test Plan: CI

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110520
Approved by: https://github.com/jansel, https://github.com/ezyang
2023-10-14 08:10:09 +00:00
Edward Z. Yang
d38472c176 Don't sympify reflection_pad2d ranges (#111212)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111212
Approved by: https://github.com/eellison
2023-10-13 21:36:30 +00:00
Edward Z. Yang
d24539ee6a Improve reflection_pad2d lowering for dynamic shapes (#110988)
Fixes https://github.com/pytorch/pytorch/issues/110696

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110988
Approved by: https://github.com/jansel, https://github.com/lezcano
2023-10-13 13:38:46 +00:00
Oguz Ulgen
fd4ba806f6 Implement tensor slice in inductor to stop falling back for aten.index (#111015)
Fixes #110711

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111015
Approved by: https://github.com/Chillee
2023-10-11 17:53:24 +00:00
eellison
c5f06b9753 Re-enable test_copy_transpose_math_view, neg_view/dce fix (#110651)
- neg view can just be lowered to neg() post functionalization
- we were treating all fallback kernels as not having side effects. we shouldn't dce mutating fallback kernels - either mutations induced by the reinplacing pass or clone_ with unsupported arguments (complex)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110651
Approved by: https://github.com/Chillee, https://github.com/jansel, https://github.com/malfet, https://github.com/Skylion007
2023-10-10 16:34:01 +00:00
PyTorch MergeBot
19ecb5d0d5 Revert "[Inductor] Disallow OpOverloadPacket in ir.FallbackKernel (#110567)"
This reverts commit 37a0265992.

Reverted https://github.com/pytorch/pytorch/pull/110567 on behalf of https://github.com/kit1980 due to breaking internal builds, see D50091340 ([comment](https://github.com/pytorch/pytorch/pull/110567#issuecomment-1754308982))
2023-10-10 03:49:20 +00:00