Commit Graph

24 Commits

Author SHA1 Message Date
PyTorch MergeBot
6581063583 Revert "Dynamo, FX, Inductor Progress Bars (#88384)"
This reverts commit db0ce4acf3.

Reverted https://github.com/pytorch/pytorch/pull/88384 on behalf of https://github.com/malfet due to Broke test_public_bindings across the board
2022-12-09 16:32:25 +00:00
Mark Saroufim
db0ce4acf3 Dynamo, FX, Inductor Progress Bars (#88384)
There are 3 progress bars each gated behind their own config, all off by default for now
1. Dynamo: Macro level config for dynamo, AOT, inductor
2. FX: Progress bar for each pass, with their names
3. Inductor

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88384
Approved by: https://github.com/wconstab, https://github.com/mlazos
2022-12-09 04:32:31 +00:00
William Wen
d224ac7f77 Remove logging.CODE (#90234)
Fixes https://github.com/pytorch/torchdynamo/issues/1932

Discussed with @mlazos: if we still want to separate streams for code logging and the rest of info, we can use a separate logger object with a unique name.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90234
Approved by: https://github.com/ezyang
2022-12-06 22:24:43 +00:00
Jean Schmidt
f62e54df8f Reland "Dynamo, FX, Inductor Progress Bars (#88384)" … (#90055)
This commit had inconsistent internal land and pr merged. This caused merge conflicts that required revert in both places, normalize the internal commit stack, and then re-land properly.

Original commit: #88384 (011452a2a1)
Inconsistent revert: #90018 (8566aa7c0b4bdca50bf85ca14705b4304de030b3)
Revert of the inconsistent revert to restore healthy state (or re-land of the original commit): cf3c3f2280
Landing the correct, internally congruent revert of the original commit: (This PR) #90055 (TBD)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90055
Approved by: https://github.com/DanilBaibak, https://github.com/malfet
2022-12-02 13:28:00 +00:00
PyTorch MergeBot
cf3c3f2280 Revert "Revert "Dynamo, FX, Inductor Progress Bars (#88384)" (#90018)"
This reverts commit bcf4292f04.

Reverted https://github.com/pytorch/pytorch/pull/90018 on behalf of https://github.com/jeanschmidt due to landed internal commit does not match with this one, causing merge conflict and preventing import and land new commits
2022-12-02 09:57:31 +00:00
Wang, Eikan
0bde810572 Add more debug information for Inductor (#90008)
- Add graph index to the profile information of the Inductor kernel for better debugability.

  The generated code for different graphs could produce kernels with the same name. The side effect is that it is hard to identify the portion of E2E performance for these kernels because the profiler will aggregate the performance with the same kernel name regardless of different graphs. Hence, this PR added the graph index to the profile information to address this limitation.

- Label arbitrary code ranges for `eager` and `opt` modes for better debugability

  The profile information of dynamo benchmarks mixes the eager mode and opt mode. It is hard to separate the range for different modes. This PR added eager and opt marks to the profile information to address this limitation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90008
Approved by: https://github.com/jgong5, https://github.com/jansel
2022-12-02 09:34:48 +00:00
Eli Uriegas
bcf4292f04 Revert "Dynamo, FX, Inductor Progress Bars (#88384)" (#90018)
This breaks in environments that use the fake tqdm 015b05af18/torch/hub.py (L26) which doesn't support the 'desc' kwarg and is not iterable

Original try using pytorchbot did not go through because of a merge
conflict: https://github.com/pytorch/pytorch/pull/88384#issuecomment-1334272489

This reverts commit 011452a2a1.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90018
Approved by: https://github.com/drisspg, https://github.com/dbort
2022-12-01 20:17:07 +00:00
Wu, Chunyuan
a6caa9c54b Add a cpp wrapper for Inductor (#88167)
## Description
Implements https://github.com/pytorch/torchdynamo/issues/1556.
This PR adds a cpp wrapper to invoke the generated kernels. The cpp wrapper is turned off by default and can be turned on by setting:
```python
from torch._inductor import config
config.cpp_wrapper = True
```

### Example
The main part of the generated code:
```python
from torch.utils.cpp_extension import load_inline
wrapper = (
'''
#include <dlfcn.h>
#include <assert.h>
    std::tuple<at::Tensor, at::Tensor> call_0(std::tuple<at::Tensor, at::Tensor> args) {
    at::Tensor arg0_1, arg1_1;
    std::tie(arg0_1, arg1_1) = args;
    auto buf0 = at::empty_strided({8, 8}, {8, 1}, at::ScalarType::Float);
    auto buf1 = at::empty_strided({8, 8}, {1, 8}, at::ScalarType::Float);
    auto kernel0_lib = dlopen("/tmp/torchinductor_user/kn/ckn7ubcn2qbkme2vx5r6antnh5sv6d3o3t6qwdfgfoupnxty6pnm.so", RTLD_NOW);
    assert(kernel0_lib != nullptr);
    void (*kernel0)(const float*,const float*,float*,float*);
    *(void **) (&kernel0) = dlsym(kernel0_lib, "kernel");
    kernel0((float*)(arg0_1.data_ptr()), (float*)(arg1_1.data_ptr()), (float*)(buf0.data_ptr()), (float*)(buf1.data_ptr()));
    arg0_1.reset();
    arg1_1.reset();
    return std::make_tuple(buf0, buf1); }''' )

module = load_inline(
    name='inline_extension_c64wpbccpbre3th2k6oxwrjy5bhvxnmkdxkhcfxlsw7xpsg4eabu',
    cpp_sources=[wrapper],
    functions=['call_0'],
    extra_cflags=['-fPIC -Wall -std=c++14 -Wno-unused-variable -march=native -O3 -ffast-math -fno-finite-math-only -fopenmp'],
    extra_ldflags=['-shared  -lgomp'],
    extra_include_paths=['-I/home/user/pytorch/torch/include -I/home/user/pytorch/torch/include/torch/csrc/api/include -I/home/user/pytorch/torch/include/TH -I/home/user/pytorch/torch/include/THC -I/home/user/miniconda3/envs/pytorch/include/python3.7m'])

def _wrap_func(f):
    def g(args):
        return f(args)
    return g
call = _wrap_func(module.call_0)
```

### Next steps
The below items will be addressed in upcoming PRs.
- [x] Support Reduction: #88561
- [x] Support None: #88560
- [ ] Support ExternKernel
   - [x] ATen GEMM-related OPs: #88667
   - [ ] ATen Conv
   - [ ] Conv/GEMM fusion OPs
- [x] Cache the kernel loading part: #89742
- [ ] De-allocate input buffers when possible by leveraging CPython APIs
- [ ] Support Constant

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88167
Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/desertfire
2022-11-30 13:40:47 +00:00
Mark Saroufim
011452a2a1 Dynamo, FX, Inductor Progress Bars (#88384)
There are 3 progress bars each gated behind their own config, all off by default for now
1. Dynamo: Macro level config for dynamo, AOT, inductor
2. FX: Progress bar for each pass, with their names
3. Inductor

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88384
Approved by: https://github.com/wconstab, https://github.com/mlazos
2022-11-30 06:07:14 +00:00
Natalia Gimelshein
a188f05e8c Reland #89031 Added conv constraint that infers layouts (#89530)
Relands #89031
Per title. We now set strides from fx graph only for convolutions and mm, which is a hack, but bmm in some cases caused extra copy, and there is no obvious way to fix that, we should rethink the strides anyway.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89530
Approved by: https://github.com/Chillee
2022-11-23 20:18:54 +00:00
Brian Hirsh
57353c9608 first draft of input mutation handling for aot autograd (#88817)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88817
Approved by: https://github.com/ezyang, https://github.com/wconstab
2022-11-23 19:20:11 +00:00
Animesh Jain
120d200620 Revert "Added conv constraint that infers layouts (#89031)" (#89451)
This reverts commit 716f70f19a.

Fixes performance regression and compilation latency increase.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89451
Approved by: https://github.com/soumith, https://github.com/jansel
2022-11-22 02:20:50 +00:00
Horace He
419ef2cdcf Added utility to count memory reads/written in Inductor (#89203)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89203
Approved by: https://github.com/jansel, https://github.com/ngimel
2022-11-19 04:18:26 +00:00
Horace He
716f70f19a Added conv constraint that infers layouts (#89031)
The core problem that we often have with contiguous/channels-last layouts and convolutions is that Inductor often doesn't do a great job of "preserving" the eager-mode layouts.

So, for example, we'll often have something like
```
a: channels-last
b = foo(a)
c = convolution(a)
```

In eager-mode, `a` would stay channels-last, and we would avoid two transpose copies (one into NHWC and one back into NCHW) within the convolution kernel.

However, Inductor currently sometimes loses the "correct" layout of `b` (not in this simple example, but others). Then, not only will we do a transpose within `foo`, but we'll then immediately transpose it back to do the convolution (and then again once the convolution is done).

This is particularly egregious in `convnext_base`, where there's a lot of mixing of non-channels last tensors and channels-last tensors.

The solution in this PR is to constrain the inputs to `aten.convolution`/`aten.convolution_backward` to match the layouts from eager-mode. This ensures that we'll never do extra transposes *within* `aten.convolution`, which are particularly bad (since Inductor can't fuse them).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89031
Approved by: https://github.com/ngimel, https://github.com/jansel
2022-11-17 01:52:35 +00:00
Jongsoo Park
0544a32ba3 [inductor] fix could not find as_strided with config.triton.mm=triton (#88946)
Summary: ReinterpretView doesn't seem to be handled properly with matrix multiply Triton kernels

Reviewed By: bertmaher

Differential Revision: D40836677

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88946
Approved by: https://github.com/jansel
2022-11-15 00:48:49 +00:00
Michael Lazos
c1553880de Have kernel names include fused ops (#88624)
- Propagates origin fx nodes through inlining during lowering
- Concatenates op names into kernel name
- Adds config to cap the number of ops in the kernel name so they don't get too long

Caveats:
- The ordering in the name may not match the order that the ops are executed in the kernel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88624
Approved by: https://github.com/anijain2305, https://github.com/jansel
2022-11-10 21:38:06 +00:00
Natalia Gimelshein
73c9911fc0 always realize output regardless of the number of reads (#88046)
This improves hf_Bert 1.139x->1.21x, currently lowmem dropout doesn't work for nn.Dropout module, and before this change we were recomputing all the dropout masks in a very inefficient kernel. This change pushes dropout masks to be saved in the dropout kernels where they are first computed.

cc @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88046
Approved by: https://github.com/Chillee
2022-11-01 15:47:43 +00:00
Edward Z. Yang
1ff52225f1 Unify SymIntNode and SymFloatNode into SymNode (#87817)
This refactor was prompted by challenges handling mixed int/float
operations in C++.  A previous version of this patch
added overloads for each permutation of int/float and was unwieldy
https://github.com/pytorch/pytorch/pull/87722/  This PR takes a different
approach.

The general outline of the patch is to combine the C++ types SymIntNode
and SymFloatNode into a single type, SymNode.  This is type erased; we
no longer know statically at C++ if we have an int/float and have to test
it with the is_int()/is_float() virtual methods.  This has a number of
knock on effects.

- We no longer have C++ classes to bind to Python.  Instead, we take an
  entirely new approach to our Python API, where we have a SymInt/SymFloat
  class defined entirely in Python, which hold a SymNode (which corresponds
  to the C++ SymNode).  However, SymNode is not pybind11-bound; instead,
  it lives as-is in Python, and is wrapped into C++ SymNode using PythonSymNode
  when it goes into C++.  This implies a userland rename.

  In principle, it is also possible for the canonical implementation of SymNode
  to be written in C++, and then bound to Python with pybind11 (we have
  this code, although it is commented out.)  However, I did not implement
  this as we currently have no C++ implementations of SymNode.

  Because we do return SymInt/SymFloat from C++ bindings, the C++ binding
  code needs to know how to find these classes.  Currently, this is done
  just by manually importing torch and getting the attributes.

- Because SymInt/SymFloat are easy Python wrappers, __sym_dispatch__ now
  takes SymInt/SymFloat, rather than SymNode, bringing it in line with how
  __torch_dispatch__ works.

Some miscellaneous improvements:

- SymInt now has a constructor that takes SymNode.  Note that this
  constructor is ambiguous if you pass in a subclass of SymNode,
  so an explicit downcast is necessary.  This means toSymFloat/toSymInt
  are no more.  This is a mild optimization as it means rvalue reference
  works automatically.

- We uniformly use the caster for c10::SymInt/SymFloat, rather than
  going the long way via the SymIntNode/SymFloatNode.

- Removed some unnecessary toSymInt/toSymFloat calls in normalize_*
  functions, pretty sure this doesn't do anything.

- guard_int is now a free function, since to guard on an int you cannot
  assume the method exists.  A function can handle both int and SymInt
  inputs.

- We clean up the magic method definition code for SymInt/SymFloat/SymNode.
  ONLY the user classes (SymInt/SymFloat) get magic methods; SymNode gets
  plain methods; this is to help avoid confusion between the two types.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

cc @jansel @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87817
Approved by: https://github.com/albanD, https://github.com/anjali411
2022-10-27 20:56:02 +00:00
William Wen
a605a30732 Fix CODE level usage in dynamo config.py (#87522)
Fixes https://github.com/pytorch/torchdynamo/issues/1718.

Tested by changing `log_level = logging.WARNING` in config.py to `log_level = logging.CODE` and running a test script that doesn't touch `log_level`.

cc @jansel @lezcano @fdrocha @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87522
Approved by: https://github.com/mlazos
2022-10-25 22:47:54 +00:00
Animesh Jain
c4fecff97d [inductor] Prevent aggressive fusion during inductor lowering (#87447)
Fixes https://github.com/pytorch/torchdynamo/issues/1599

Inductor performs aggressive fusion of ops during the lowering of Fx graph into IR nodes. Note that this fusion is different from the fusion that we typically discuss in the context of Inductor, which refers to the fusion of SchedulerNodes (way after lowering). This PR, instead, ensures that we don't accumulate too many ops in the IR node to begin with.

In the case of hf_t5_large backward graph, earlier we would generate a kernel with 100s of operators, causing that kernel to take ~350 seconds of compilation time. With this PR, we get it down from 350 seconds to 50 seconds.

Note that this could affect performance. I doubt that it will lead to really large dip though. In my toy examples, even if the lowering creates multiple IR nodes, if its a simple fusion, later fusion still creates one node.

I would like (1) test_torchinductor.py, (2) test_torchinductor_info.py, and (3) atleast HF models to be enabled in CI before merging this one.

@ngimel @jansel @Chillee

cc @jansel @lezcano @fdrocha @mlazos @soumith @voznesenskym @yanboliang @penguinwu
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87447
Approved by: https://github.com/jansel
2022-10-24 21:53:17 +00:00
Yanbo Liang
9ba632253a [Inductor] Convert 0d CPU tensor to scalar during triton codegen (#87329)
This is a follow up to address [this](https://github.com/pytorch/torchdynamo/pull/1284#pullrequestreview-1130319129). We revised to use the codegen approach to handle 0d CPU tensor, which will not support cudagraph any more.

cc @jansel @lezcano @fdrocha
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87329
Approved by: https://github.com/ngimel
2022-10-21 01:24:00 +00:00
Horace He
2418ddb1ec Unified symbolic shape variables between Inductor and AOTDispatcher (#87161)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87161
Approved by: https://github.com/jansel
2022-10-19 04:50:34 +00:00
Jason Ansel
054a2fd6c2 Sync changes from pytorch/torchdynamo (#87013)
This updates to:
6380959be2

Generated with:
https://github.com/pytorch/torchdynamo/blob/main/copy_to_core.sh
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87013
Approved by: https://github.com/voznesenskym
2022-10-15 21:00:57 +00:00
Jason Ansel
c7c09722ad Move TorchDynamo into PyTorch core (#86461)
Context:
https://github.com/pytorch/torchdynamo/issues/1588

This PR moves [TorchDynamo](https://github.com/pytorch/torchdynamo) and TorchInductor into PyTorch core.
- `torchdynamo` becomes `torch._dynamo`
- `torchinductor` becomes `torch._inductor`

This PR was generated by running `copy_to_core.sh` in https://github.com/pytorch/torchdynamo/pull/1538

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86461
Approved by: https://github.com/voznesenskym
2022-10-13 23:18:06 +00:00