Commit Graph

223 Commits

Author SHA1 Message Date
PyTorch MergeBot
40ece2e579 Revert "Enable possibly-undefined error code (#118533)"
This reverts commit 4f13f69a45.

Reverted https://github.com/pytorch/pytorch/pull/118533 on behalf of https://github.com/clee2000 due to sorry i'm trying to figure out a codev merge conflict, if this works i'll be back to rebase and merge ([comment](https://github.com/pytorch/pytorch/pull/118533#issuecomment-1917695185))
2024-01-30 19:00:34 +00:00
Edward Z. Yang
4f13f69a45 Enable possibly-undefined error code (#118533)
Fixes https://github.com/pytorch/pytorch/issues/118129

Suppressions automatically added with

```
import re

with open("error_file.txt", "r") as f:
    errors = f.readlines()

error_lines = {}
for error in errors:
    match = re.match(r"(.*):(\d+):\d+: error:.*\[(.*)\]", error)
    if match:
        file_path, line_number, error_type = match.groups()
        if file_path not in error_lines:
            error_lines[file_path] = {}
        error_lines[file_path][int(line_number)] = error_type

for file_path, lines in error_lines.items():
    with open(file_path, "r") as f:
        code = f.readlines()
    for line_number, error_type in sorted(lines.items(), key=lambda x: x[0], reverse=True):
        code[line_number - 1] = code[line_number - 1].rstrip() + f"  # type: ignore[{error_type}]\n"
    with open(file_path, "w") as f:
        f.writelines(code)
```

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118533
Approved by: https://github.com/Skylion007, https://github.com/zou3519
2024-01-30 05:08:10 +00:00
Edward Z. Yang
46712b019d Enable local_partial_types (#118467)
When using dmypy, this setting is enabled and cannot be turned off. Force it for regular mypy too.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118467
Approved by: https://github.com/Skylion007
ghstack dependencies: #118414, #118418, #118432
2024-01-28 13:38:22 +00:00
Isuru Fernando
978faf1fa2 Use an op counter to decide when to realize a kernel (#117030)
Instead of checking the number of bytes in the string representation
of the kernel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117030
Approved by: https://github.com/lezcano, https://github.com/peterbell10
2024-01-27 05:28:46 +00:00
eellison
b95c45fbf7 add stack trace to device skip (#118112)
Log stack trace of offending cpu use if it causes a disabling of cudagraphs. Also refactoring disable_cudagraphs: bool, and disable_cudagraphs_reason: str -> Optional[str].

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118112
Approved by: https://github.com/bdhirsh
2024-01-26 22:33:48 +00:00
Nikita Shulga
bd99115276 [AOTI] Enable for MacOS (#118076)
- Add `darwin` to the list of supported platform
- Add `#include <sstream>` to `aoti_runtime/model.h`
- Refactor Linux specific constant compilation logic to `_compile_consts_linux`
- Add `_compile_consts_darwin` that converts consts to .S file that is linked into a shared library
   - Patch file using magic to avoid converting bytes to large hexadecimal string
- Generate integer constants with `LL` suffix on MacOS (corresponds to int64_t definition)
- Enable test_aot_inductor.py tests on MacOS

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118076
Approved by: https://github.com/desertfire
ghstack dependencies: #118077
2024-01-24 14:24:05 +00:00
Jeff Daily
01abb5af21 additional support for float8_e4m3fnuz and _e5m2fnuz (#115214)
Follow up to #107586.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115214
Approved by: https://github.com/peterbell10, https://github.com/malfet
2024-01-22 18:33:41 +00:00
James Wu
afabed6ae6 [inductor][custom ops] Add tag to custom ops to preserve stride orders in inductor (#117298)
fixes #116715

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117298
Approved by: https://github.com/eellison
2024-01-21 18:47:01 +00:00
PyTorch MergeBot
10923f8720 Revert "[inductor][custom ops] Add tag to custom ops to preserve stride orders in inductor (#117298)"
This reverts commit 1967394690.

Reverted https://github.com/pytorch/pytorch/pull/117298 on behalf of https://github.com/huydhn due to Sorry for reverting you change but it is failing in MacOS 1967394690, may be due to a landrace ([comment](https://github.com/pytorch/pytorch/pull/117298#issuecomment-1901594120))
2024-01-20 02:14:58 +00:00
James Wu
1967394690 [inductor][custom ops] Add tag to custom ops to preserve stride orders in inductor (#117298)
fixes #116715

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117298
Approved by: https://github.com/eellison
2024-01-20 01:37:28 +00:00
PyTorch MergeBot
b637fdc8b3 Revert "additional support for float8_e4m3fnuz and _e5m2fnuz (#115214)"
This reverts commit 74e1362499.

Reverted https://github.com/pytorch/pytorch/pull/115214 on behalf of https://github.com/PaliC due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/115214#issuecomment-1900815152))
2024-01-19 17:35:04 +00:00
Jeff Daily
74e1362499 additional support for float8_e4m3fnuz and _e5m2fnuz (#115214)
Follow up to #107586.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115214
Approved by: https://github.com/peterbell10
2024-01-19 00:50:18 +00:00
Bin Bao
fad7734fa7 [AOTI] Remove caching for compiled model.so (#117087)
Summary: Oleg found the model.so caching does not compute hash key with model weights included, which can cause incorrect model.so reuse. Since caching is not really necessary in the AOT mode, let's just remove it.

Test Plan: CI

Differential Revision: D52647555

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117087
Approved by: https://github.com/frank-wei, https://github.com/chenyang78, https://github.com/khabinov
2024-01-10 19:53:27 +00:00
Jack Taylor
5046b4981d [ROCm] Add opt-in option for inductor's layout optimisation on ROCm (#116329)
Disabling layout optimisation in inductor for ROCm (https://github.com/pytorch/pytorch/pull/111474) was a bit shortsighted.

If there are workloads that heavily use NHWC we will see a perf drop from additional transpose ops. Instead of disabling this entirely on ROCm this is now an opt-in feature.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116329
Approved by: https://github.com/jansel, https://github.com/eellison
2024-01-10 13:56:27 +00:00
Oleg Khabinov
5377b994da [aot_inductor] Retrieve original FQNs for weights (#116157)
Differential Revision: D52303882

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116157
Approved by: https://github.com/frank-wei
2024-01-05 21:30:36 +00:00
Bin Bao
e5bcfe205e [inductor] fix cpp_wrapper inputs mismatch (#116197)
Summary: fixes https://github.com/pytorch/pytorch/issues/115035, where in the cpp_wrapper JIT inductor, the input args should contain the lifted parameters.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116197
Approved by: https://github.com/jansel
2023-12-26 21:41:47 +00:00
Bin Bao
f4230ec9fd [inductor] Remove the float16 restriction for cpu cpp_wrapper (#116205)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116205
Approved by: https://github.com/jgong5, https://github.com/chunyuan-w, https://github.com/jansel
2023-12-26 16:01:20 +00:00
etaf
7a6cb9fdfb [Inductor Intel GPU backend Upstream] Step 1/3: Generalize device-bias code in code generation. (#116020)
As the [RFC](https://github.com/pytorch/pytorch/issues/114856) mentions, this is the step 1 to add Intel GPU backend as an alternative inductor backend.

### Design
Typically, in order to integrate Intel GPU backend into Inductor, we need to inherit from `WrapperCodegen` and `TritonScheduling` and implement the corresponding subclasses respectively. However, since `WrapperCodegen` and `TritonScheduling` have some device-bias code generation **scattered** in their methods, overriding them in subclasses would introduce a lot of duplicated parent class code.
For example:
2a44034895/torch/_inductor/codegen/wrapper.py (L487)

2a44034895/torch/_inductor/codegen/triton.py (L1996)

 So we abstract the device-bias code scattered in WrapperCodegen and TritonScheduling and provide a unified interface "DeviceOpOverrides". This way, when integrating a new backend, we can  maximize the reuse of `WrapperCodegen` and `TritonScheduling` code by inherit and implement this interface for device flexibility.

Currently the `DeviceOpOverrides` only cover Python wrapper code generation. We can futher extend it to cover Cpp wrapper code generation on demand.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116020
Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/jansel
2023-12-22 08:42:51 +00:00
CK Luk
3b70bd3970 Take 2 of "Add an option to log the source of the Triton kernels generated by torch._inductor (#115979)
Summary: This is useful the comparing the Triton kernels generated by two different invocations of torch.compile on the same model (e.g., checking of serial compile and parallel compile generate identical Triton kernels).

Test Plan:
Unit test:
buck2 test mode/opt //caffe2/torch/fb/module_factory/sync_sgd/tests:test_torchdynamo_wrapper -- --print-passing-details >& ~/tmp/log.test
PyPer Mast job:
https://www.internalfb.com/mast/job/sw-951074659-OfflineTraining_87587a4e
See the *.py files generated in:
pyper_traces/tree/torchinductor_traces/sw-951074659-OfflineTraining_87587a4e/4623

Differential Revision: D52221500

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115979
Approved by: https://github.com/yanboliang
2023-12-18 18:16:44 +00:00
Bin Bao
0fc04e274d [inductor] Fix an aliased output bug (#115373)
Summary: for https://github.com/pytorch/pytorch/issues/97083, when

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115373
Approved by: https://github.com/jansel
2023-12-12 01:18:59 +00:00
PyTorch MergeBot
5fe2b138e3 Revert "[inductor] Fix an aliased output bug (#115373)"
This reverts commit 1310f0bf38.

Reverted https://github.com/pytorch/pytorch/pull/115373 on behalf of https://github.com/atalman due to Sorry for reverting your change it broke inductor tests ([comment](https://github.com/pytorch/pytorch/pull/115373#issuecomment-1850792869))
2023-12-11 20:02:15 +00:00
Bin Bao
1310f0bf38 [inductor] Fix an aliased output bug (#115373)
Summary: for https://github.com/pytorch/pytorch/issues/97083, when

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115373
Approved by: https://github.com/jansel
2023-12-10 23:52:39 +00:00
Jason Ansel
c370450f02 [inductor] Remove hashing of tensor data for constants (#115356)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115356
Approved by: https://github.com/eellison
2023-12-08 18:05:34 +00:00
Bin Bao
e06bff8bbe [AOTI] Handle empty input args (#114682)
Summary: When the model takes no inputs, AOTInductor relies on checking weights to figure out which device to compile the model into. Currently recording buffer device type happens too late, and this PR fixes that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114682
Approved by: https://github.com/chenyang78
2023-12-05 15:02:17 +00:00
Jez Ng
f1fd02503b Reland #113487 and #112527 (sdpa shim & fp8 AOTInductor support) (#114974)
This is a backout of #113747 which reverted the above two commits. Now that
#113997 has landed, this diff can be landed safely without breaking ABI compatibility.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114974
Approved by: https://github.com/chenyang78
2023-12-02 03:25:51 +00:00
Elias Ellison
7692595834 Use different conv layout optimization heuristics for inference (#114600)
While many models regress in training when converted to channels last, in inference the results are quite different. Almost all of the models experienced a speedup when converted to channels last. There were a few big regressions in torchbench - `timm_regnet` from `1.4343 → 1.0573` and `timm_resnet` from `1.7484 → 1.2868`.

 I used a modified script of the operator benchmarks [here](https://gist.github.com/eellison/e11dc645412f52e8b45fb26ba6f9f6a1) to measure the average speedup of convolutions across all of the input shapes found in torchbench according to the existing classifications that @shunting314 used - grouped convs, small channel convs, convolution with larger in-channel than out-channel. Only grouped convolutions benchmarked as a slowdown in inference.

I updated the inference heuristic to multiply the flops of each conv with its predicted speedup/slowdown in channels last. With this heuristic the two previously regressing models no longer regress.

Speeds up inference for torchbench ~8% and timm ~6%. The motivating model here was SDXL which now hits channels last and improves 10%.

There were some models that were sped up in training when forcing channels last (along with a number of regressions). It's possible there is some speedup in training to be had with additional heuristics. We could also have more granular classification/predictions which might benefit both training and inference.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114600
Approved by: https://github.com/jansel, https://github.com/shunting314
2023-11-29 07:53:59 +00:00
Jez Ng
87925789ae Make V.graph properly typed (#114025)
Previously it lacked a type hint and so was treated as an Any type. This
resulted in a lot of untyped code downstream as V.graph is referenced in
many places in inductor code. I've typed it properly now as
GraphLowering, and fixed the numerous type errors this surfaced.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114025
Approved by: https://github.com/eellison
ghstack dependencies: #114013
2023-11-21 02:14:29 +00:00
Bin Bao
5a96a42cea [AOTI] Improve the two-pass wrapper codegen (#114067)
Summary: For the second-pass, we don't have to rerun the whole inductor flow again. This PR moves that second-pass to the codegen time. This change not only speeds up the compilation, but also removes kernel scheduling inconsistency between the two passes. Another future improvement is to make the second-pass reuse the scheduler and do the wrapper codegen only.

This is a copy of https://github.com/pytorch/pytorch/pull/113762 to land in github first.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114067
Approved by: https://github.com/chenyang78
2023-11-19 23:30:36 +00:00
eellison
a9134fa99a Skip cudagraphs when there is sparsity (#113791)
Fix for dlrm training

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113791
Approved by: https://github.com/Chillee
2023-11-17 01:36:03 +00:00
Wei Wei
b19cf868e8 Back out "Support fp8 in AOTInductor + support optional<> in C ABI (#112527)" (#113747)
Test Plan: sandcastle

Differential Revision: D51330618

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113747
Approved by: https://github.com/chenyang78, https://github.com/khabinov
2023-11-15 22:42:22 +00:00
Aaron Gokaslan
b7b2178204 [BE]: Remove useless lambdas (#113602)
Applies PLW0108 which removes useless lambda calls in Python, the rule is in preview so it is not ready to be enabled by default just yet. These are the autofixes from the rule.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113602
Approved by: https://github.com/albanD
2023-11-14 20:06:48 +00:00
Edward Z. Yang
9752ef595c [BE] Consistently use the sym_stride lowering, instead of short-circuiting before (#113071)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113071
Approved by: https://github.com/voznesenskym
2023-11-10 21:19:12 +00:00
Jez Ng
297c26bb8e Support fp8 in AOTInductor + support optional<> in C ABI (#112527)
This was originally ipiszy's PR: https://github.com/pytorch/pytorch/pull/112358

It turns out that we need to add support for optional types in order to
support fp8 gemm (i.e. scaled_mm). Since our ABI-stable C interface
can't support optional<> directly, I am passing in optional types via
pointer instead.

`AtenTensorHandle`s are already pointers, so nothing needs to change
there. Only value types need to change.

We decided on this approach instead of adding an extra `bool` param to
the callee because this simplifies things. Having the same number of
arguments regardless of whether we are emitting Python / C++ /
ABI-compatible C++ makes codegen easier.

There are a number of existing ABI-compatible functions that have
optional-typed value parameters. Previously, they just assumed they
would never be passed a `nullopt` / `None` at runtime. Changing them to
use pointer types now would break ABI stability, so I have created an
exclude list for those functions.

Finally, I think the current implementation is kind of messy, and only
works for FallbackKernels, even though technically ExternKernels could
also have the same issue. It also doesn't support optional types nested
in lists. I've left FIXME comments for both issues.

Differential Revision: [D51084289](https://our.internmc.facebook.com/intern/diff/D51084289)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112527
Approved by: https://github.com/chenyang78, https://github.com/desertfire
2023-11-08 22:56:48 +00:00
Jason Ansel
3914566c73 [dynamo] Refactor OrderedDict to dict (#113234)
In Python3 all dicts are ordered.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113234
Approved by: https://github.com/oulgen, https://github.com/lezcano
2023-11-08 09:27:08 +00:00
Edward Z. Yang
10a829b85d Retarget sym_size/sym_stride lowerings to their .int overloads (#113054)
Fixes https://github.com/pytorch/pytorch/issues/112913

The new logging looks like this:

```
[2023-11-06 12:48:57,732] [0/0] torch._inductor.graph: [DEBUG] lowering %arg0_1 : [num_users=0] = placeholder[target=arg0_1]
[2023-11-06 12:48:57,732] [0/0] torch._inductor.graph: [DEBUG] lowering %arg1_1 : [num_users=2] = placeholder[target=arg1_1]
[2023-11-06 12:48:57,733] [0/0] torch._inductor.graph: [DEBUG] lowering %add : [num_users=1] = call_function[target=torch.ops.aten.add.Tensor](args = (%arg1_1, 1), kwargs = {})
[2023-11-06 12:48:57,733] [0/0] torch._inductor.graph: [DEBUG]   via <function make_pointwise.<locals>.inner at 0x7f0abed28ee0>
[2023-11-06 12:48:57,735] [0/0] torch._inductor.graph: [DEBUG] lowering %sym_stride_int : [num_users=1] = call_function[target=torch.ops.aten.sym_stride.int](args = (%add, 0), kwargs = {}) sym_stride
[2023-11-06 12:48:57,735] [0/0] torch._inductor.graph: [DEBUG] lowering %mul : [num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%arg1_1, %sym_stride_int), kwargs = {})
[2023-11-06 12:48:57,735] [0/0] torch._inductor.graph: [DEBUG]   via <function mul at 0x7f0abec8bd00>
[2023-11-06 12:48:57,744] [0/0] torch._inductor.graph: [DEBUG] lowering return (mul,)
```

Notice that `sym_stride` no longer is hitting the lowering. This is what the behavior was before I broke it. A better refactor coming soon.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113054
Approved by: https://github.com/davidberard98
2023-11-07 04:15:38 +00:00
Peter Bell
718035791d Prefer e.is_number over not e.free_symbols in SymPy (#112688)
We spend somewhere on the order 1% in `sympy.Expr.free_symbols` as it is called millions of times.
Most of the time we actually just want to know "is this a constant", however `e.is_constant()` is
horribly slow. It turns out though that there is another propery `is_number` that does what we want.

> property is_number:
>
> Returns True if self has no free symbols and no undefined functions (AppliedUndef, to be precise). It will be faster
> than if not self.free_symbols, however, since is_number will fail as soon as it hits a free symbol or undefined
> function.

Even further, we also avoid the overhead of building the unnecessary set object.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112688
Approved by: https://github.com/lezcano
2023-11-06 20:05:13 +00:00
Kai Londenberg
bdfde62e54 [Inductor CUTLASS backend] Epilogue fusion codegen (Step 1) (#110890)
Summary:

This PR adds epilogue fusion code generation support for the new experimental
[Inductor Cutlass backend]([https://github.com/pytorch/pytorch/pull/108015]).

Details:

A fusion happens on the GEMM template level by taking a Cutlass 3.x GEMM Universal Matmul Kernel template
and adding a custom template functor based on Cutlass new “Epilogue Visitor Trees” (EVT) on top, which represents and
performs the computation of the fused Pointwise / Elementwise computation nodes.

This is the approach dictated by [NVIDIA/cutlass example 49](https://github.com/NVIDIA/cutlass/blob/main/examples/49_hopper_gemm_with_collective_builder/49_collective_builder.cu),
which is currently the only documentation and example of Cutlass Epilogue Visitor Trees.

This EVT functor in turn is a hierarchical template expression which represents an abstract syntax tree of the fused computation to perform.
A second codegen task is to create a hierarchical initializer expression, which provides potentially necessary arguments
to each of the functor subexpressions.

Step 1 functionality:

 * End to end code generation is possible using the above approach.
 * Supports simple elementwise expression fusion of chains of elementwise operations (with scalar constants )
   after a matmul.
 * Elementwise operation support includes addition, subtraction, multiplication, division, minimum, maximum etc.
 * Examples / Unit tests include ReLU and ReLU6 fusion.
 * Support for fp16 and fp16 with fp32 accumulation data types.
 * Generates SM90 ( Hopper ) based CUDA Kernels ( as Cutlass up to 3.2.0 only supported EVT for SM90 )

The following is not yet supported, and is left for future work:

 * Full operation support ( e.g. full set of all ops usually handled via V.ops handlers )
 * Cutlass EVT with SM80 support ( possible in Cutlass 3.2.1 according to release notes, but not yet documented )
 * Add support for additional (auxiliary) inputs, which changes the Template Kernels' call signature
 * Add support for additional (auxiliary) outputs ( requires support for full computation graphs )
 * Add support for reduction operations and operations which use different output layouts than the input
 * Add support for additional dtypes ( as far as Cutlass allows )

This PR updates third_party/cutlass to v3.2.2, which has some important improvements and features
for the inductor backend.

See also Cutlass release notes:
https://github.com/NVIDIA/cutlass/releases/tag/v3.2.1 and https://github.com/NVIDIA/cutlass/releases/tag/v3.2.2

Notable changes in Cutlass 3.2.1 include:
 * Cutlass codegen python code has moved into a package with the "cutlass_library" namespace, which allows to
   prevent namespace clashes without resolving to monkey-patching ( which was done earlier ).
 * Support for SM80 epilogue visitor trees ( according to the Release Notes, not tried yet )
 * Small API changes to the cutlass_library API ( requires adapting the inductor backend code )

Notable changes in Cutlass 3.2.2 include:
 * Bugfix that led to CUDA Illegal memory access in some Pytorch unit tests involving flash attention

 Test Plan:
  * CI
  * pytest test/inductor/test_max_autotune.py

Note: So far, the CUTLASS backend is still disabled by default. Benchmarks are planned once more advanced fusions are enabled.

Differential Revision: [D50988161](https://our.internmc.facebook.com/intern/diff/D50988161)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110890
Approved by: https://github.com/jansel
ghstack dependencies: #112762
2023-11-06 19:42:10 +00:00
Ken Jin
674c104d12 Fix RecursionError in Inductor for large for loops (#112320)
Fixes https://github.com/pytorch/pytorch/issues/111686

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112320
Approved by: https://github.com/peterbell10
2023-11-05 13:12:54 +00:00
Jez Ng
ae85ba820f [inductor] Memory planning (#112178)
This was originally @jansel's PR:
https://github.com/pytorch/pytorch/pull/102625, which I've built upon.

This diff implements static memory planning. It's disabled by default
while we examine its performance.

We use a greedy-by-size approach. For dynamic shapes, the sizes of the
example inputs are used as estimates when making planning decisions. We
generate expressions to calculate the actual memory offsets and sizes at
runtime when the values of the dynamic shapes are known. In order to
simplify these calculations, we have organized the allocations into a
tree that branches on space (address offsets) and time (live ranges).
Finally, we need to align these offsets, so we have added an `align`
sympy Expr to express these calculations.

Some limitations:

1. It is only enabled during inference for now. Enabling it for training
   increases peak memory usage as we allocate all the memory needed for
   training upfront, before freeing the memory allocated during
   inference. We can probably address this by doing planning for both
   the inference and training passes together.
2. It doesn't work with PyTorch Distributed, because kernels like
   AllGatherIntoTensor codegen strings which do memory operations. We
   can fix this down the line by having them emit MemoryPlanningLines
   instead.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112178
Approved by: https://github.com/desertfire, https://github.com/jansel
2023-11-02 07:39:13 +00:00
PyTorch MergeBot
74e6c877e9 Revert "[inductor] Memory planning (#112178)"
This reverts commit f64a97c6f8.

Reverted https://github.com/pytorch/pytorch/pull/112178 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it seems that ROCm will need to be fixed for the new test too f64a97c6f8 ([comment](https://github.com/pytorch/pytorch/pull/112178#issuecomment-1788195311))
2023-11-01 00:03:56 +00:00
Jez Ng
f64a97c6f8 [inductor] Memory planning (#112178)
This was originally @jansel's PR:
https://github.com/pytorch/pytorch/pull/102625, which I've built upon.

This diff implements static memory planning. It's disabled by default
while we examine its performance.

We use a greedy-by-size approach. For dynamic shapes, the sizes of the
example inputs are used as estimates when making planning decisions. We
generate expressions to calculate the actual memory offsets and sizes at
runtime when the values of the dynamic shapes are known. In order to
simplify these calculations, we have organized the allocations into a
tree that branches on space (address offsets) and time (live ranges).
Finally, we need to align these offsets, so we have added an `align`
sympy Expr to express these calculations.

Some limitations:

1. It is only enabled during inference for now. Enabling it for training
   increases peak memory usage as we allocate all the memory needed for
   training upfront, before freeing the memory allocated during
   inference. We can probably address this by doing planning for both
   the inference and training passes together.
2. It doesn't work with PyTorch Distributed, because kernels like
   AllGatherIntoTensor codegen strings which do memory operations. We
   can fix this down the line by having them emit MemoryPlanningLines
   instead.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112178
Approved by: https://github.com/desertfire, https://github.com/jansel
2023-10-31 20:02:30 +00:00
Elias Ellison
6a99291546 Removing sdpa conv layout constraint (#112045)
Previously layout opt with sdpa would cause failures because we would pass a non-dense last dim to sdpa. Those layout constraints have been added in prior prs. Now we can do conv layout opt with sdpa.

Improves twins_pcpvt_base 1.4622 → 1.5351, xcit_large_24_p8_224 3.0681 → 3.1839

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112045
Approved by: https://github.com/shunting314
ghstack dependencies: #111976, #111721
2023-10-27 05:40:43 +00:00
lezcano
47ccf04885 Split SymNode into its own file (#112037)
This PR:

- Moves TrueDiv, LShift, RShift, IsNonOverlappingAndDenseIndicator to `_sympy.functions.py`
- Moves SymNode to `fx.experimental.sym_node`.
  - This file does not have any SymPy dependencies at import time
  - It installs the magic methods in Sym{Bool,Int,Float}.
  - N.b. With this split, we may be able to move Sym{Bool,Int,Float} to this file, and remove quite a few of the hacks around these classes
- Imports `sym_node` in `torch/__init__.py` rather than the whole `symbolic_shapes.py`.
  This breaks the import-time dependency between torch and SymPy

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112037
Approved by: https://github.com/peterbell10
ghstack dependencies: #112035, #112036
2023-10-26 23:32:27 +00:00
Andrew Hu
8253e0524c Add "device not supported" assert to inductor (#112001)
Fixes #111999

Adds an assert that provides a more informative error message

For example, when running a compiled function with mps (currently unsupported):
```
...
  File "/Users/andrew.hu/Desktop/pytorch/torch/_inductor/graph.py", line 927, in init_wrapper_code
    assert wrapper_code_gen_cls is not None, f"Device {device_type} not supported"
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
AssertionError: Device mps not supported
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112001
Approved by: https://github.com/peterbell10
2023-10-25 14:19:37 +00:00
Oguz Ulgen
977d3bcc46 [Inductor] Support user defined triton kernels in inductor (#111434)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111434
Approved by: https://github.com/jansel
2023-10-22 17:04:19 +00:00
Elias Ellison
0a147fd112 Pointwise fuse cat with pointwise inputs or outputs and <= 4 inputs (#111233)
Improves perf of llama_v2 locally from 1.55 -> 1.57

The initial heuristic is to lower to pointwise if # of inputs is <= 4, and all the inputs are pointwise or cannot be memory planned away, or if all the outputs are pointwise.

Perf run was +3% on inference.. There are definitely instances where we should be lowering to foreach_kernels, but it's less flexible for fusion. The motivating example was:

```
def rotate_half(x):
    """Rotates half the hidden dims of the input."""
    x1 = x[..., : x.shape[-1] // 2]
    x2 = x[..., x.shape[-1] // 2 :]
    return torch.cat((-x2, x1), dim=-1)

def apply_rotary_pos_emb(q, k, cos, sin):
    iota =  torch.ops.prims.iota.default(512, start = 0, step = 1, dtype = torch.int64, device = device(type='cuda', index=0), requires_grad = False)

    # File: /scratch/eellison/work/torchdynamo/lib/python3.8/site-packages/transformers/models/llama/modeling_llama.py:657, code: position_ids = position_ids.unsqueeze(0).view(-1, seq_length)
    unsqueeze = torch.ops.aten.unsqueeze.default(iota, 0)
    position_ids = torch.ops.aten.reshape.default(unsqueeze, [-1, 512]);  unsqueeze = None

    # The first two dimensions of cos and sin are always 1, so we can `squeeze` them.
    cos = cos.squeeze(1).squeeze(0)  # [seq_len, dim]
    sin = sin.squeeze(1).squeeze(0)  # [seq_len, dim]
    cos = cos[position_ids].unsqueeze(1)  # [bs, 1, seq_len, dim]
    sin = sin[position_ids].unsqueeze(1)  # [bs, 1, seq_len, dim]
    q_embed = (q * cos) + (rotate_half(q) * sin)
    k_embed = (k * cos) + (rotate_half(k) * sin)
    return q_embed, k_embed
```

Also not sure if I should be more worried about concatting reduction->pointwise inputs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111233
Approved by: https://github.com/Chillee
2023-10-21 02:34:05 +00:00
Jack Taylor
619ae87a1d Disable inductor layout_opt on ROCm (#111474)
Previously we disabled this option on none MI200 GPUs (https://github.com/pytorch/pytorch/pull/107812 due to worse NHWC conv performance on some cards. This PR will disable this feature for all GPUs to make this uniform for ROCm and due to perf regressions noted here https://github.com/pytorch/pytorch/pull/110319

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111474
Approved by: https://github.com/jithunnair-amd, https://github.com/eellison
2023-10-20 09:31:01 +00:00
Sherlock Huang
1aad6d803a [Reland][Inductor] Disallow OpOverloadPacket in ir.FallbackKernel (#110567) (#111396)
This is a reland of #110567 with additional fbcode fixed.

Summary:
In ABI compatible mode, We always need op_overload.schema for FallbackKernel.

Approved by: https://github.com/jansel

Test Plan: contbuild & OSS CI, see 37a0265992

Differential Revision: D50339346

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111396
Approved by: https://github.com/chenyang78
2023-10-17 18:53:38 +00:00
Sam Larsen
0dfa354570 [inductor] Implement Fx graph caching to improve warm compilation time. (#103453)
Summary: Implement an on-disk cache to save and reuse compiled FX Graphs. This implementation does not handle tensors with symbolic shapes. This needs to be done in a follow-up PR.

Test Plan:
* New unit tests exercising saving and load from the cache.
* New unit tests to exercise the cache key calculations.
* Ran several benchmarks to see cache hit and resulting compilation times.

Differential Revision: [D50255289](https://our.internmc.facebook.com/intern/diff/D50255289)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103453
Approved by: https://github.com/eellison, https://github.com/Chillee
2023-10-13 13:33:56 +00:00
Oleg Khabinov
8209bbbd06 [AOTInductor] Improve validation for C++ wrapper codegen (#111102)
It's a reimplementation of #111089

1. When using fake inputs make sure they are on the same device as the original inputs.
2. Don't change the value of self.cpp_wrapper from True to False if can't generate a C++ wrapper, instead have a check and fail early to avoid producing Python code for C++ compiler.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111102
Approved by: https://github.com/desertfire, https://github.com/jgong5, https://github.com/chunyuan-w
2023-10-13 08:46:17 +00:00