Commit Graph

151 Commits

Author SHA1 Message Date
Ying Zhang
a1e3c50165 A small fix for do_bench_using_profiling (#113611)
ATT, there are cases where multiple kernel invocations have same kernel names, and key_averages() will wrongly get average results across different invocations. This fix uses cuda_time_total / n_repeat instead.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113611
Approved by: https://github.com/chenyang78
2023-11-14 06:31:22 +00:00
Jez Ng
6805d1e1d6 [inductor] Make graph.py pass follow_imports typechecking (#113518)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113518
Approved by: https://github.com/Skylion007
ghstack dependencies: #113413
2023-11-11 22:15:46 +00:00
Jez Ng
5a9f08feb5 [inductor] Make {joint_graph,inductor_prims,utils}.py pass follow_imports typechecking (#113410)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113410
Approved by: https://github.com/lezcano
ghstack dependencies: #113409
2023-11-10 19:58:08 +00:00
Peter Bell
718035791d Prefer e.is_number over not e.free_symbols in SymPy (#112688)
We spend somewhere on the order 1% in `sympy.Expr.free_symbols` as it is called millions of times.
Most of the time we actually just want to know "is this a constant", however `e.is_constant()` is
horribly slow. It turns out though that there is another propery `is_number` that does what we want.

> property is_number:
>
> Returns True if self has no free symbols and no undefined functions (AppliedUndef, to be precise). It will be faster
> than if not self.free_symbols, however, since is_number will fail as soon as it hits a free symbol or undefined
> function.

Even further, we also avoid the overhead of building the unnecessary set object.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112688
Approved by: https://github.com/lezcano
2023-11-06 20:05:13 +00:00
Bin Bao
bd9be877e4 [aotinductor] Move cache_dir to utils.py (#112728)
Summary: Some tests can utilize cache_dir()

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112728
Approved by: https://github.com/jansel, https://github.com/chenyang78
ghstack dependencies: #112651
2023-11-06 03:42:10 +00:00
Jez Ng
ae85ba820f [inductor] Memory planning (#112178)
This was originally @jansel's PR:
https://github.com/pytorch/pytorch/pull/102625, which I've built upon.

This diff implements static memory planning. It's disabled by default
while we examine its performance.

We use a greedy-by-size approach. For dynamic shapes, the sizes of the
example inputs are used as estimates when making planning decisions. We
generate expressions to calculate the actual memory offsets and sizes at
runtime when the values of the dynamic shapes are known. In order to
simplify these calculations, we have organized the allocations into a
tree that branches on space (address offsets) and time (live ranges).
Finally, we need to align these offsets, so we have added an `align`
sympy Expr to express these calculations.

Some limitations:

1. It is only enabled during inference for now. Enabling it for training
   increases peak memory usage as we allocate all the memory needed for
   training upfront, before freeing the memory allocated during
   inference. We can probably address this by doing planning for both
   the inference and training passes together.
2. It doesn't work with PyTorch Distributed, because kernels like
   AllGatherIntoTensor codegen strings which do memory operations. We
   can fix this down the line by having them emit MemoryPlanningLines
   instead.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112178
Approved by: https://github.com/desertfire, https://github.com/jansel
2023-11-02 07:39:13 +00:00
Edward Z. Yang
5b0840c71b Guarantee expr is a sympy.Expr before xreplace'ing it (#112619)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112619
Approved by: https://github.com/eellison, https://github.com/voznesenskym
2023-11-01 21:26:27 +00:00
PyTorch MergeBot
74e6c877e9 Revert "[inductor] Memory planning (#112178)"
This reverts commit f64a97c6f8.

Reverted https://github.com/pytorch/pytorch/pull/112178 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it seems that ROCm will need to be fixed for the new test too f64a97c6f8 ([comment](https://github.com/pytorch/pytorch/pull/112178#issuecomment-1788195311))
2023-11-01 00:03:56 +00:00
Jez Ng
f64a97c6f8 [inductor] Memory planning (#112178)
This was originally @jansel's PR:
https://github.com/pytorch/pytorch/pull/102625, which I've built upon.

This diff implements static memory planning. It's disabled by default
while we examine its performance.

We use a greedy-by-size approach. For dynamic shapes, the sizes of the
example inputs are used as estimates when making planning decisions. We
generate expressions to calculate the actual memory offsets and sizes at
runtime when the values of the dynamic shapes are known. In order to
simplify these calculations, we have organized the allocations into a
tree that branches on space (address offsets) and time (live ranges).
Finally, we need to align these offsets, so we have added an `align`
sympy Expr to express these calculations.

Some limitations:

1. It is only enabled during inference for now. Enabling it for training
   increases peak memory usage as we allocate all the memory needed for
   training upfront, before freeing the memory allocated during
   inference. We can probably address this by doing planning for both
   the inference and training passes together.
2. It doesn't work with PyTorch Distributed, because kernels like
   AllGatherIntoTensor codegen strings which do memory operations. We
   can fix this down the line by having them emit MemoryPlanningLines
   instead.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112178
Approved by: https://github.com/desertfire, https://github.com/jansel
2023-10-31 20:02:30 +00:00
Ying Zhang
128f4db77e A small fix in "do_bench_using_profiling" (#112223)
This is a small fix in "do_bench_using_profiling()".
When CUDA kernels are executed in a non-default CUDA stream, if cuda.synchronize() is called, a CUDA kernel named "Context Sync" will be launched to the default stream to wait until all other streams are finished. This CUDA kernel has "CUDA time" but is not a real kernel to profile. This fix excludes "Context Sync" when calculating kernel total time.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112223
Approved by: https://github.com/int3, https://github.com/chenyang78
2023-10-27 23:08:38 +00:00
angelayi
b126adcdee [aotinductor] Pass TorchIR to AOTInductor (#110020)
Updates `_export.aot_compile` to pass a torch IR graph to inductor, allowing inductor to now run the pre_grad_passes, and reuse more of inductor's code.
Also updates the API to only return the `so_path`, and not returning the exported program. The pytree call spec is now serialized and placed inside of the generated model code. When calling the model, because there is no c++ pytree implementation linked yet, we can access the call specs through `get_call_spec()`, and call pytree flatten/unflattenin python.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110020
Approved by: https://github.com/desertfire
2023-10-26 15:54:31 +00:00
Bin Bao
ce48d36324 [aotinductor] Update test utility to use AOTIModelRunner (#111657)
Summary: Use AOTIModelRunner provided by libtorch instead of the custom written RAIIModelContainer for testing. This change also makes running AOTInductor benchmarks on CPU possbile.

Differential Revision: [D50560764](https://our.internmc.facebook.com/intern/diff/D50560764)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111657
Approved by: https://github.com/chenyang78
2023-10-23 18:21:27 +00:00
Oguz Ulgen
977d3bcc46 [Inductor] Support user defined triton kernels in inductor (#111434)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111434
Approved by: https://github.com/jansel
2023-10-22 17:04:19 +00:00
Elias Ellison
0a147fd112 Pointwise fuse cat with pointwise inputs or outputs and <= 4 inputs (#111233)
Improves perf of llama_v2 locally from 1.55 -> 1.57

The initial heuristic is to lower to pointwise if # of inputs is <= 4, and all the inputs are pointwise or cannot be memory planned away, or if all the outputs are pointwise.

Perf run was +3% on inference.. There are definitely instances where we should be lowering to foreach_kernels, but it's less flexible for fusion. The motivating example was:

```
def rotate_half(x):
    """Rotates half the hidden dims of the input."""
    x1 = x[..., : x.shape[-1] // 2]
    x2 = x[..., x.shape[-1] // 2 :]
    return torch.cat((-x2, x1), dim=-1)

def apply_rotary_pos_emb(q, k, cos, sin):
    iota =  torch.ops.prims.iota.default(512, start = 0, step = 1, dtype = torch.int64, device = device(type='cuda', index=0), requires_grad = False)

    # File: /scratch/eellison/work/torchdynamo/lib/python3.8/site-packages/transformers/models/llama/modeling_llama.py:657, code: position_ids = position_ids.unsqueeze(0).view(-1, seq_length)
    unsqueeze = torch.ops.aten.unsqueeze.default(iota, 0)
    position_ids = torch.ops.aten.reshape.default(unsqueeze, [-1, 512]);  unsqueeze = None

    # The first two dimensions of cos and sin are always 1, so we can `squeeze` them.
    cos = cos.squeeze(1).squeeze(0)  # [seq_len, dim]
    sin = sin.squeeze(1).squeeze(0)  # [seq_len, dim]
    cos = cos[position_ids].unsqueeze(1)  # [bs, 1, seq_len, dim]
    sin = sin[position_ids].unsqueeze(1)  # [bs, 1, seq_len, dim]
    q_embed = (q * cos) + (rotate_half(q) * sin)
    k_embed = (k * cos) + (rotate_half(k) * sin)
    return q_embed, k_embed
```

Also not sure if I should be more worried about concatting reduction->pointwise inputs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111233
Approved by: https://github.com/Chillee
2023-10-21 02:34:05 +00:00
Aaron Gokaslan
cb856b08b2 [BE]: Attach cause to some exceptions and enable RUFF TRY200 (#111496)
Did some easy fixes from enabling TRY200. Most of these seem like oversights instead of intentional. The proper way to silence intentional errors is with `from None` to note that you thought about whether it should contain the cause and decided against it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111496
Approved by: https://github.com/malfet
2023-10-19 21:56:36 +00:00
Michael Lazos
543dc75746 [Reland] horizontal concat fusion (#111437)
Reland https://github.com/pytorch/pytorch/pull/108115

The main fix is to disallow nop nodes to be included in foreach scheduler nodes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111437
Approved by: https://github.com/yanboliang
2023-10-18 17:09:01 +00:00
Will Feng
b28cb43f5c Intra-graph reordering pass on Inductor scheduler IR (based on #100762) (#108091)
This PR implements intra-graph communication reordering pass on Inductor scheduler IR, based on Horace's previous PR #100762.

Main algorithm:
1. Greedily moves waits as late as possible (i.e. until we reach a use)
2. Greedily moves comms as early as possible (i.e. until we reach an input)
3. Move computes following simple heuristics to improve overlap.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108091
Approved by: https://github.com/Chillee, https://github.com/wanchaol
2023-10-14 14:51:24 +00:00
Jack Taylor
94c9dbff22 Disable cutlass_template on ROCm (#111132)
Fixes #111066 #111065 #111064

Currently use_cutlass_template is returning True on ROCm but the feature is not supported. Fix to return false on ROCm. I considering adding this change to `try_import_cutlass` instead but the comments hinted that this function would be removed at some point.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111132
Approved by: https://github.com/jansel
2023-10-12 17:14:07 +00:00
eellison
c5f06b9753 Re-enable test_copy_transpose_math_view, neg_view/dce fix (#110651)
- neg view can just be lowered to neg() post functionalization
- we were treating all fallback kernels as not having side effects. we shouldn't dce mutating fallback kernels - either mutations induced by the reinplacing pass or clone_ with unsupported arguments (complex)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110651
Approved by: https://github.com/Chillee, https://github.com/jansel, https://github.com/malfet, https://github.com/Skylion007
2023-10-10 16:34:01 +00:00
chilli
f767a6c57a Made pattern-matcher diagnostics lazily reported + added TORCH_COMPILE_CPROFILE (#110504)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110504
Approved by: https://github.com/mlazos, https://github.com/eellison
ghstack dependencies: #110501
2023-10-05 15:47:30 +00:00
PyTorch MergeBot
1e4c0641ce Revert "Made pattern-matcher diagnostics lazily reported + added TORCH_COMPILE_CPROFILE (#110504)"
This reverts commit 9648df1a6a.

Reverted https://github.com/pytorch/pytorch/pull/110504 on behalf of https://github.com/PaliC due to temporarily will revert as it's causing problems with difftrain import ([comment](https://github.com/pytorch/pytorch/pull/110504#issuecomment-1749132253))
2023-10-05 15:28:23 +00:00
Kazuaki Ishizaki
434a996c42 Fix typo under torch/_inductor directory (#110530)
This PR fixes typo of comments and messages in files under `torch/_dynamo` directory.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110530
Approved by: https://github.com/kit1980
2023-10-05 02:17:20 +00:00
chilli
9648df1a6a Made pattern-matcher diagnostics lazily reported + added TORCH_COMPILE_CPROFILE (#110504)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110504
Approved by: https://github.com/mlazos, https://github.com/eellison
ghstack dependencies: #110501
2023-10-05 01:34:57 +00:00
Yang Chen
da63c7f2c3 [AOTInductor] remove CUDA dependency for cpp backend (#110409)
Summary:
Previously, we link against cuda libs even for pure cpp backend.
This caused issues for cases where the inference platform does not
have GPUs. This diff removed cuda dependency for cpp backend.

Reviewed By: bertmaher, muchulee8, mikekgfb

Differential Revision: D49800712

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110409
Approved by: https://github.com/bertmaher, https://github.com/desertfire
2023-10-03 18:36:00 +00:00
chilli
13681382d5 Add heuristic for when evict_first should be set (and some other minor things) (#108841)
Example of when the `evict_first` heuristic helps.
```
@torch.compile
def f(a, b):
    return (a * b).sum(dim=-1)

N = 512
inps = (torch.randn(N, N, N).permute(2, 1, 0), torch.randn(N, N, N).permute(1, 2, 0))
from torch._inductor.utils import do_bench
print(do_bench(lambda: f(*inps)))
```

This generates code like this: http://ix.io/4HFs

```
Original: 3.8 ms
This PR: 3.54 ms
Always `evict_first: 5.4ms
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108841
Approved by: https://github.com/lezcano, https://github.com/jansel
2023-10-01 17:06:12 +00:00
Edward Z. Yang
d1a13129bb Add support for item() and nonzero() codegen in Inductor (#109893)
This is another version of
https://github.com/pytorch/pytorch/pull/109262 that I think is more
harmonious with inductor design.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109893
Approved by: https://github.com/jansel
2023-09-28 23:37:31 +00:00
ydwu4
5f7eff0adb Replace node.meta source_fn with source_fn_stack (#108595)
A resubmit of https://github.com/pytorch/pytorch/pull/108447. Copy over the descriptions:

This is a follow-up of the discussion in https://github.com/pytorch/pytorch/pull/108356, where we want to repalce source_fn with source_fn_stack

Before this PR, for the following example:
```python
backend = EagerAndRecordGraphs()

@torch.compile(backend=backend, fullgraph=True)
def cond_f(pred, pred2, x, y):
    def true_fn(pred2, x, y):
        return x + y

    def false_fn(pred2, x, y):
        def true_fn2(x, y):
            return x.sin() - y.cos()

        def false_fn2(x, y):
            return x.cos() - y.sin()

        return control_flow.cond(pred2, true_fn2, false_fn2, (x, y))

    return control_flow.cond(pred, true_fn, false_fn, (pred2, x, y))
```
The graph captured is shown below:
```python
class GraphModule(torch.nn.Module):
    def forward(self, L_pred_ : torch.Tensor, L_pred2_ : torch.Tensor, L_x_ : torch.Tensor, L_y_ : torch.Tensor):
        l_pred_ = L_pred_
        l_pred2_ = L_pred2_
        l_x_ = L_x_
        l_y_ = L_y_

        cond_true_1 = self.cond_true_1
        cond_false_1 = self.cond_false_1
        cond = torch.ops.higher_order.cond(l_pred_, cond_true_1, cond_false_1, [l_pred2_, l_x_, l_y_]);  l_pred_ = cond_true_1 = cond_false_1 = l_pred2_ = l_x_ = l_y_ = None
        return (cond,)

    class GraphModule(torch.nn.Module):
        def forward(self, l_pred2_, l_x_, l_y_):
            add = l_x_ + l_y_;  l_x_ = l_y_ = None
            return add

    class GraphModule(torch.nn.Module):
        def forward(self, l_pred2_, l_x_, l_y_):
            cond_true_0 = self.cond_true_0
            cond_false_0 = self.cond_false_0
            cond = torch.ops.higher_order.cond(l_pred2_, cond_true_0, cond_false_0, [l_x_, l_y_]);  l_pred2_ = cond_true_0 = cond_false_0 = l_x_ = l_y_ = None
            return cond

        class GraphModule(torch.nn.Module):
            def forward(self, l_x_, l_y_):
                sin = l_x_.sin();  l_x_ = None
                cos = l_y_.cos();  l_y_ = None
                sub = sin - cos;  sin = cos = None
                return sub

        class GraphModule(torch.nn.Module):
            def forward(self, l_x_, l_y_):
                cos = l_x_.cos();  l_x_ = None
                sin = l_y_.sin();  l_y_ = None
                sub = cos - sin;  cos = sin = None
                return sub
```
the source_fn for inner cond, sin, cos will be a (name, target) tuple:
```
('cond', <torch._ops.HigherOrderOperator object at xxx>)
('sin', 'sin')
('cos', 'cos')
('sub'. <built-in function sub>)
```

After this pr, the source_fn_stack will be a list of (name, target) tuple. The bottom of stack is the end of the list.
```
[('cond', <torch._ops.HigherOrderOperator object at xxx>), ('cond', <torch._ops.HigherOrderOperator object at xxx>)],
[('cond', <torch._ops.HigherOrderOperator object at xxx>), ('cond', <torch._ops.HigherOrderOperator object at xxx>), ('sin', 'sin')],
[('cond', <torch._ops.HigherOrderOperator object at xxx>), ('cond', <torch._ops.HigherOrderOperator object at xxx>), ('cos', 'cos')]
[('cond', <torch._ops.HigherOrderOperator object at xxx>), ('cond', <torch._ops.HigherOrderOperator object at xxx>), ('sub', <built-in function sub>)]
```

Test Plan:
See added tests in test_higher_order_ops.py and modify existing test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108595
Approved by: https://github.com/angelayi, https://github.com/zou3519
2023-09-28 18:18:36 +00:00
Bin Bao
4bf1cd6961 [aotinductor] Rename aot_runtime to aoti_runtime (#110007)
Summary: Make the naming more explicit

Differential Revision: D49593528

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110007
Approved by: https://github.com/houseroad
2023-09-26 00:46:54 +00:00
Yu, Guangye
e9c9b1ed59 [Inductor] Generalize inductor triton backend device agnostic (#109486)
# Motivation
@jansel As discussed before, we expected to generalize some cuda-specific code. This can make inductor more friendly to third-party backend so that we can leverage inductor code as much as possible.

# Solution
To implement this, we give a solution to introduce device runtime abstraction. We wrapper them inside `DeviceInterface` and use `register_interface_for_device` to register each kind of device to inductor. Then use `get_interface_for_device` to fetch the corresponding runtime from device type. Then usage is like this:
```python
device_interface = get_interface_for_device("xpu")
device_interface .is_available() # to check if XPU is available
device_interface .device_count() # to check how much XPU device is available
```
The `DeviceInterface` is a simple abstraction, which enables third-party backends that implement CUDA-like semantics to be integrated with inductor. This can prevent third-party backend from using monkey patch to override some utility functions, like `decode_device` that is hard-coded with CUDA.

# Additional Context
The main code change:
- To leverage AsyncCompile, make it device-agnostic
- Avoid monkey patches, make some utility functions device-agnostic

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109486
Approved by: https://github.com/jansel, https://github.com/jgong5, https://github.com/EikanWang
2023-09-24 07:49:20 +00:00
Oguz Ulgen
1df14f1bf8 Move has_triton to top level triton utils so that dynamo can also access (#109832)
it without creating cyclic dependencies

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109832
Approved by: https://github.com/zou3519
2023-09-22 19:33:41 +00:00
Bin Bao
8856c1628e [inductor] Change AOTInductor to return output tensors (#109790)
Summary:
Change AOTInductor to directly return output tensors instead of taking pre-allocated output tensors to return the results. This gives several benefits:

* It makes sure AOTInductor has the same behavior when managing the output tensors as the default Inductor, which is widely tested and thus more reliable.
* As we have debugged before, there are cases we still have to codegen extra copy_ ops to fill the pre-allocated output tensors which doesn't make sense for performance.
* With the coming enhanced memory planning, this again will make sure the memory planning logic is the between AOTInductor and Inductor, which will greatly simplify the problem and improve the reliability.

This change also combines D49494954 from Yang and https://github.com/pytorch/pytorch/pull/109560 from Angela.

Differential Revision: D49502318

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109790
Approved by: https://github.com/chenyang78
2023-09-22 02:31:52 +00:00
Angela Yi
f7ddc54503 [aotinductor] Update performance benchmark code (109560) (#109820)
Summary: Same as #109560, made a new PR because we need to land from internal

Previously during performance benchmark testing, we would create an AOTInductorModelContainerHandle every time the compiled function is run with new inputs. However after https://github.com/pytorch/pytorch/pull/108473 we now load the constants needed in the runtime when initializing the AOTInductorModelContainerHandle. This resulted in our benchmarks displaying a ~0.4x speedup.

This diff moves the initialization of AOTInductorModelContainerHandle outside of the code where we run the compiled function with different inputs.

For example,
```
python benchmarks/dynamo/huggingface.py --performance --cold-start-latency --inference --bfloat16 --export-aot-inductor --disable-cudagraphs --device cuda --total-partitions 3 --partition-id 0 --only AlbertForMaskedLM
```
results in `1.359x` speedup.

Specifically, this adds a `create_container_handle` and `delete_container_handle` function which need to called before `run`. We call `create_container_handle` to initialize the AOTInductorModelContainerHandle, call `run` to run the compiled .so with different inputs, and then `delete_container_handle` to delete it.

[Updated dashboard results](https://hud.pytorch.org/benchmark/compilers?startTime=Wed%2C%2013%20Sep%202023%2021%3A03%3A55%20GMT&stopTime=Wed%2C%2020%20Sep%202023%2021%3A03%3A55%20GMT&granularity=hour&suite=torchbench&mode=inference&dtype=bfloat16&lBranch=angelayi/aot_inductor_benchmark&lCommit=f9aa49c4c9a1a140b6f0c4520d1d6d99b57e12fa&rBranch=main&rCommit=015be4cedba357eb931e24bf188479235db7c5c8)

Test Plan: CI

Differential Revision: D49513934

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109820
Approved by: https://github.com/desertfire
2023-09-21 20:49:41 +00:00
Bin Bao
9c2715bbb2 [inductor] Clean up AOTInductor runtime ABI (#109678)
Summary: Change the AOTInductor runtime interface to avoid referring to aten data structures directly, mostly at::Tensor and ProxyExecutor. This a combination of https://github.com/pytorch/pytorch/pull/109436,  https://github.com/pytorch/pytorch/pull/109498, https://github.com/pytorch/pytorch/pull/109450, https://github.com/pytorch/pytorch/pull/109606, plus a few internal build changes.

Reviewed By: frank-wei

Differential Revision: D49374820

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109678
Approved by: https://github.com/frank-wei, https://github.com/chenyang78
2023-09-21 00:25:24 +00:00
Bert Maher
e87bd9f588 [aot inductor] Make unit tests work on CPU (#109625)
Summary: AOT inductor is only sort-of supported on CPU right now, but it works
with a few hacks (the .so needs to be compiled and run with CUDA present,
because we haven't excised the CUDA deps; also there's an `is_cpu` flag that
needs to be plumbed into the call, or else all the weights are erroneously
allocated on GPU).

But, with those hacks in place, it currently works, so it's worth having the
current state of it continue working (and at some point we'll remove the
hacks).

Test Plan:
```
python test_aot_inductor -k test_simple_cpu
```

Reviewers: binbao

Subscribers:

Tasks:

Tags:

Differential Revision: [D49427400](https://our.internmc.facebook.com/intern/diff/D49427400)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109625
Approved by: https://github.com/mikekgfb, https://github.com/chenyang78, https://github.com/desertfire
2023-09-20 14:51:44 +00:00
Yang Chen
9cd4548f01 AOTInductor dynamic shape (#109012)
Summary: This PR adds dynamic-shape support for AOTInductor

* On the runtime/interface side, we added two structs, StaticDimInfo
and DynamicDimInfo, to hold values for static and dynamic dimensions,
respectively. Dynamic dimensions are tracked by an unordered map field
defined in AOTInductorModelBase. At inference time, the inference run
method will assign the current real dimensional value to each dynamic
dimension before executing any kernel.

* On the CUDA wrapper codegen side, we generate dynamic symbols
appropriately for shape computations. We simulate kernel launch grids
in the C++ land by re-using the grid functions from the Python world.
The returned grid configs, which may contain symbolic expressions,
are printed out in their C++ forms via the CppPrinter. Note that
when dynamic shapes are involved, we have to compute grid configs
for each kernel at runtime in the same way as we do for launching
the corresponding Triton kernel. Otherwise, we may end up with
memory-access failures or mis-computations caused by invalid indices
for fetching or storing data in device memory.

Differential Revision: D49100472

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109012
Approved by: https://github.com/khabinov, https://github.com/desertfire, https://github.com/hl475
2023-09-14 08:00:30 +00:00
Jez Ng
d2d36aad6f Enable typechecking for _inductor/virtualized.py (#108916)
Also add a few more type annotations to utils.py (some of its functions
are called from virtualized.py)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108916
Approved by: https://github.com/eellison
2023-09-13 13:04:51 +00:00
Ying Zhang
a2d5f13310 [Inductor CUTLASS backend] Step 5: Gemm CUTLASS templates (#108015)
This is the step 5 to add cutlass as an alternative inductor backend.

Feature request: https://github.com/pytorch/pytorch/issues/106991.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108015
Approved by: https://github.com/kadeng, https://github.com/jansel, https://github.com/aakhundov
ghstack dependencies: #107802, #107847, #107901, #107931
2023-09-12 17:44:38 +00:00
Ying Zhang
097fd43f8c [Inductor CUTLASS backend] Step 4: CUDA (template) kernels (#107931)
This is the step 4 to add cutlass as an alternative inductor backend.
Full tests can be found from the last PR in the stack.

Feature request: https://github.com/pytorch/pytorch/issues/106991.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107931
Approved by: https://github.com/aakhundov, https://github.com/jansel, https://github.com/kadeng
ghstack dependencies: #107802, #107847, #107901
2023-09-12 17:44:38 +00:00
Ying Zhang
b2d764ece0 [Inductor CUTLASS backend] Step 3: autotune_process, and CUDABenchmarkRequest (#107901)
This is the step 3 to add cutlass as an alternative inductor backend.
Full tests can be found from the last PR in the stack.

Feature request: https://github.com/pytorch/pytorch/issues/106991.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107901
Approved by: https://github.com/jansel, https://github.com/aakhundov, https://github.com/kadeng
ghstack dependencies: #107802, #107847
2023-09-12 17:44:36 +00:00
Ying Zhang
102fefac21 [Inductor CUTLASS backend] Step 2: CUDACodeCache (#107847)
This is the step 2 to add cutlass as an alternative inductor backend.
Feature request: https://github.com/pytorch/pytorch/issues/106991.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107847
Approved by: https://github.com/jansel, https://github.com/kadeng, https://github.com/aakhundov
ghstack dependencies: #107802
2023-09-12 17:44:34 +00:00
David Berard
ed7f9cac91 [inductor] Add CPU-side profiler event names for templates and foreach kernels (#108449)
This passes in the descriptive kernel name as part of the triton_meta dict that gets passed to the CachingAutotuner, for foreach kernels and templates.

Before:
<img width="684" alt="Screenshot 2023-09-01 at 11 56 02 AM" src="https://github.com/pytorch/pytorch/assets/5067123/c14e13fc-0d9e-425a-a08b-613ef42aa264">

After:
<img width="562" alt="Screenshot 2023-09-01 at 2 13 00 PM" src="https://github.com/pytorch/pytorch/assets/5067123/551bb9a9-865b-401e-b6e0-8ebbe5431565">

This PR also refactors the "magic strings" (KERNEL_NAME and DESCRIPTIVE_KRNL_NAME) into an enum in utils.py.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108449
Approved by: https://github.com/jansel
2023-09-09 02:11:13 +00:00
Bin Bao
e91f66471c [reland][inductor] Switch to use the runtime interface for AOTInductor testing (#108878)
Summary: This is a reland of https://github.com/pytorch/pytorch/pull/108663

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108878
Approved by: https://github.com/muchulee8
2023-09-08 17:58:35 +00:00
Michael Lazos
6c7260407b Back out "Horizontally fuse input concatenation (#108115)" (#108793)
Summary:
Original commit changeset: f15956d96311

Original Phabricator Diff: D48996091

Test Plan: Reverting to Unbreak test

Differential Revision: D49065517

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108793
Approved by: https://github.com/Chillee
2023-09-08 05:14:57 +00:00
PyTorch MergeBot
428f5f9e7e Revert "[inductor] Switch to use the runtime interface for AOTInductor testing (#108663)"
This reverts commit 366ce589d0.

Reverted https://github.com/pytorch/pytorch/pull/108663 on behalf of https://github.com/Chillee due to Sorry :'( Need to revert to resolve merge conflict for another revert ([comment](https://github.com/pytorch/pytorch/pull/108663#issuecomment-1711076411))
2023-09-08 05:01:27 +00:00
Bin Bao
366ce589d0 [inductor] Switch to use the runtime interface for AOTInductor testing (#108663)
Summary: Switch AOTInductor unit tests and integration tests to invoke the same runtime interface. This is only an effort to unify the usage of the runtime. The interface scrutiny will come in later PRs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108663
Approved by: https://github.com/ezyang
ghstack dependencies: #108653
2023-09-07 23:38:11 +00:00
chunyuan
ca9f4222e1 Inductor cpp wrapper: fix codegen of positional args with default value (#108552)
Fixes https://github.com/pytorch/pytorch/issues/108323.
Cpp wrapper has functionality regression on `llama` and `tnt_s_patch16_224` due to recent support of scaled dot product flash attention in inductor.

The schema of this OP is as follows:
```
- func: _scaled_dot_product_flash_attention(Tensor query, Tensor key, Tensor value, float dropout_p=0.0, bool is_causal=False, bool return_debug_mask=False, *, float? scale=None) -> (Tensor output, Tensor logsumexp, Tensor cum_seq_q, Tensor cum_seq_k, int max_q, int max_k, Tensor philox_seed, Tensor philox_offset, Tensor debug_attn_mask)
```

For `llama` and `tnt_s_patch16_224`, the OP is called in the below way, where the three positional args with default values are not passed (`float dropout_p=0.0, bool is_causal=False, bool return_debug_mask=False`).
```python
y = torch.ops.aten._scaled_dot_product_flash_attention.default(x0, x1, x2, scale = 0.125)
```

This PR fixes the cpp wrapper support for this case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108552
Approved by: https://github.com/jgong5, https://github.com/desertfire, https://github.com/jansel
2023-09-06 13:15:12 +00:00
Michael Lazos
96d74073f8 Horizontally fuse input concatenation (#108115)
Fixes https://github.com/pytorch/pytorch/issues/106688

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108115
Approved by: https://github.com/jansel
2023-09-05 16:55:32 +00:00
PyTorch MergeBot
2c1f0772d5 Revert "Horizontally fuse input concatenation (#108115)"
This reverts commit 5911faeb8f.

Reverted https://github.com/pytorch/pytorch/pull/108115 on behalf of https://github.com/osalpekar due to Broke internal benchmarking job. See [D48890838](https://www.internalfb.com/diff/D48890838) ([comment](https://github.com/pytorch/pytorch/pull/108115#issuecomment-1703546520))
2023-09-02 00:19:00 +00:00
Michael Lazos
5911faeb8f Horizontally fuse input concatenation (#108115)
Fixes https://github.com/pytorch/pytorch/issues/106688

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108115
Approved by: https://github.com/jansel
2023-08-30 21:57:11 +00:00
chilli
39130c7433 Add reinplacing pass for scatters + incremental fake tensor updating (#106192)
mutation for params)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106192
Approved by: https://github.com/jansel, https://github.com/eellison
2023-08-30 20:41:37 +00:00