Commit Graph

35 Commits

Author SHA1 Message Date
Bert Maher
d3d85e1c3b Emit torch.cuda.synchronize() after every kernel call in inductor (#90472)
Debugging illegal memory access is hard; even CUDA_LAUNCH_BLOCKING=1
and using C10_CUDA_KERNEL_LAUNCH_CHECK doesn't guarantee a useful stack trace.
doesn't necessarily guarantee that you'll get a stack trace pointing to the
right kernel.  This diff adds a config option to force a CUDA synchronize after
every kernel call in inductor, for debugging those tricky cases.

Differential Revision: [D41744967](https://our.internmc.facebook.com/intern/diff/D41744967/)

Differential Revision: [D41744967](https://our.internmc.facebook.com/intern/diff/D41744967)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90472
Approved by: https://github.com/jansel
2022-12-12 04:35:10 +00:00
Jiawen Liu
4a1633ca69 [Inductor] GEMM Shape Padding Optimization (#90425)
Summary:
Optimize the shape padding in the following perspectives:
- Add BFloat16 support for AMP training and Float16 support for inference
- Optimize microbenchmark to avoid peak memory issue, and include profiling memory ops to make more accurate decision
- Add a flag to turn off/on padding dims N and M in `torch.bmm` due to expensive memory copy of `.contiguous` to avoid peak memory issues in internal models

Test Plan: CI

Differential Revision: D41724868

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90425
Approved by: https://github.com/jianyuh
2022-12-09 22:48:02 +00:00
Michael Lazos
730e44bbc7 Add logging for aot autograd and unified debug flag (#88987)
- Adds `log_level` to aot's config
- Outputs log to `<graph_name>_<log_level>.log` in aot_torchinductor subfolder of the debug directory
- Modifies the Inductor debug context to use the graph name when naming the folder instead of the os pid
- Adds `TORCH_COMPILE_DEBUG` flag to enable it, (as well as separate dynamo and inductor logs)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88987
Approved by: https://github.com/Chillee
2022-12-09 17:28:10 +00:00
PyTorch MergeBot
6581063583 Revert "Dynamo, FX, Inductor Progress Bars (#88384)"
This reverts commit db0ce4acf3.

Reverted https://github.com/pytorch/pytorch/pull/88384 on behalf of https://github.com/malfet due to Broke test_public_bindings across the board
2022-12-09 16:32:25 +00:00
Mark Saroufim
db0ce4acf3 Dynamo, FX, Inductor Progress Bars (#88384)
There are 3 progress bars each gated behind their own config, all off by default for now
1. Dynamo: Macro level config for dynamo, AOT, inductor
2. FX: Progress bar for each pass, with their names
3. Inductor

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88384
Approved by: https://github.com/wconstab, https://github.com/mlazos
2022-12-09 04:32:31 +00:00
Bert Maher
26d1dbc4f8 [inductor] More correct check for fbcode environment (#90312)
Summary:
importing torch.fb seemed like a good idea, but we don't always have
torch.fb inside fbcode.  Testing for torch.version.git_version is more
reliable, since we'll never have a git_version inside fbcode, which is an hg
repo.

Test Plan: `buck2 run mode/dev-nosan //caffe2/test/inductor:smoke`

Reviewed By: soumith, jansel

Differential Revision: D41777058

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90312
Approved by: https://github.com/soumith
2022-12-07 04:50:11 +00:00
Michael Voznesensky
5423c2f0e2 Light refactor to how we get shape_env for graph lowering (#90139)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90139
Approved by: https://github.com/ezyang
2022-12-05 18:35:30 +00:00
Jean Schmidt
f62e54df8f Reland "Dynamo, FX, Inductor Progress Bars (#88384)" … (#90055)
This commit had inconsistent internal land and pr merged. This caused merge conflicts that required revert in both places, normalize the internal commit stack, and then re-land properly.

Original commit: #88384 (011452a2a1)
Inconsistent revert: #90018 (8566aa7c0b4bdca50bf85ca14705b4304de030b3)
Revert of the inconsistent revert to restore healthy state (or re-land of the original commit): cf3c3f2280
Landing the correct, internally congruent revert of the original commit: (This PR) #90055 (TBD)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90055
Approved by: https://github.com/DanilBaibak, https://github.com/malfet
2022-12-02 13:28:00 +00:00
PyTorch MergeBot
cf3c3f2280 Revert "Revert "Dynamo, FX, Inductor Progress Bars (#88384)" (#90018)"
This reverts commit bcf4292f04.

Reverted https://github.com/pytorch/pytorch/pull/90018 on behalf of https://github.com/jeanschmidt due to landed internal commit does not match with this one, causing merge conflict and preventing import and land new commits
2022-12-02 09:57:31 +00:00
Nikita Shulga
768bd3fb4a Add torch.compile implementation (#89607)
`torch.compile` can be used either as decorator or to optimize model directly, for example:
```
@torch.compile
def foo(x):
  return torch.sin(x) + x.max()
```
or
```
mod = torch.nn.ReLU()
optimized_mod = torch.compile(mod, mode="max-autotune")
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89607
Approved by: https://github.com/soumith
2022-12-01 20:17:52 +00:00
Eli Uriegas
bcf4292f04 Revert "Dynamo, FX, Inductor Progress Bars (#88384)" (#90018)
This breaks in environments that use the fake tqdm 015b05af18/torch/hub.py (L26) which doesn't support the 'desc' kwarg and is not iterable

Original try using pytorchbot did not go through because of a merge
conflict: https://github.com/pytorch/pytorch/pull/88384#issuecomment-1334272489

This reverts commit 011452a2a1.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90018
Approved by: https://github.com/drisspg, https://github.com/dbort
2022-12-01 20:17:07 +00:00
Bert Maher
6317311e61 [inductor] Disable parallel compilation inside fbcode (#89926)
Forking python processes using `multiprocessing` doesn't play nicely
with certain aspects of FB infra, so let's disable it until we find a better
solution.

Differential Revision: [D41618774](https://our.internmc.facebook.com/intern/diff/D41618774/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89926
Approved by: https://github.com/desertfire
2022-12-01 02:33:45 +00:00
Wu, Chunyuan
a6caa9c54b Add a cpp wrapper for Inductor (#88167)
## Description
Implements https://github.com/pytorch/torchdynamo/issues/1556.
This PR adds a cpp wrapper to invoke the generated kernels. The cpp wrapper is turned off by default and can be turned on by setting:
```python
from torch._inductor import config
config.cpp_wrapper = True
```

### Example
The main part of the generated code:
```python
from torch.utils.cpp_extension import load_inline
wrapper = (
'''
#include <dlfcn.h>
#include <assert.h>
    std::tuple<at::Tensor, at::Tensor> call_0(std::tuple<at::Tensor, at::Tensor> args) {
    at::Tensor arg0_1, arg1_1;
    std::tie(arg0_1, arg1_1) = args;
    auto buf0 = at::empty_strided({8, 8}, {8, 1}, at::ScalarType::Float);
    auto buf1 = at::empty_strided({8, 8}, {1, 8}, at::ScalarType::Float);
    auto kernel0_lib = dlopen("/tmp/torchinductor_user/kn/ckn7ubcn2qbkme2vx5r6antnh5sv6d3o3t6qwdfgfoupnxty6pnm.so", RTLD_NOW);
    assert(kernel0_lib != nullptr);
    void (*kernel0)(const float*,const float*,float*,float*);
    *(void **) (&kernel0) = dlsym(kernel0_lib, "kernel");
    kernel0((float*)(arg0_1.data_ptr()), (float*)(arg1_1.data_ptr()), (float*)(buf0.data_ptr()), (float*)(buf1.data_ptr()));
    arg0_1.reset();
    arg1_1.reset();
    return std::make_tuple(buf0, buf1); }''' )

module = load_inline(
    name='inline_extension_c64wpbccpbre3th2k6oxwrjy5bhvxnmkdxkhcfxlsw7xpsg4eabu',
    cpp_sources=[wrapper],
    functions=['call_0'],
    extra_cflags=['-fPIC -Wall -std=c++14 -Wno-unused-variable -march=native -O3 -ffast-math -fno-finite-math-only -fopenmp'],
    extra_ldflags=['-shared  -lgomp'],
    extra_include_paths=['-I/home/user/pytorch/torch/include -I/home/user/pytorch/torch/include/torch/csrc/api/include -I/home/user/pytorch/torch/include/TH -I/home/user/pytorch/torch/include/THC -I/home/user/miniconda3/envs/pytorch/include/python3.7m'])

def _wrap_func(f):
    def g(args):
        return f(args)
    return g
call = _wrap_func(module.call_0)
```

### Next steps
The below items will be addressed in upcoming PRs.
- [x] Support Reduction: #88561
- [x] Support None: #88560
- [ ] Support ExternKernel
   - [x] ATen GEMM-related OPs: #88667
   - [ ] ATen Conv
   - [ ] Conv/GEMM fusion OPs
- [x] Cache the kernel loading part: #89742
- [ ] De-allocate input buffers when possible by leveraging CPython APIs
- [ ] Support Constant

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88167
Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/desertfire
2022-11-30 13:40:47 +00:00
Mark Saroufim
011452a2a1 Dynamo, FX, Inductor Progress Bars (#88384)
There are 3 progress bars each gated behind their own config, all off by default for now
1. Dynamo: Macro level config for dynamo, AOT, inductor
2. FX: Progress bar for each pass, with their names
3. Inductor

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88384
Approved by: https://github.com/wconstab, https://github.com/mlazos
2022-11-30 06:07:14 +00:00
Jiong Gong
c75434ed4f [Inductor] Add an option to mark wrapper call in PyTorch profiler (#89674)
This PR adds an option `config.profiler_mark_wrapper_call` (disabled by default) to mark the duration of wrapper call in the PyTorch profiler. This makes it easy to identify the duration and start/end of each wrapper call in the profiler output.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89674
Approved by: https://github.com/jansel
2022-11-29 00:58:46 +00:00
Jiong Gong
bb77accb4c [Inductor] Record cpp kernel in PyTorch Profiler (#89367)
Add an option `config.cpp.enable_kernel_profile` to record individual cpp kernel time in PyTorch Profiler.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89367
Approved by: https://github.com/jansel
2022-11-26 14:06:44 +00:00
Natalia Gimelshein
3e20d023b1 put descriptive kernel names behind config (#89697)
Per title, generated kernel names are often long and confusing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89697
Approved by: https://github.com/Chillee
2022-11-26 03:08:23 +00:00
Natalia Gimelshein
61a3fe4b64 make inductor correctly propagate nans for maximum and minimum (#89612)
Partially fixes https://github.com/pytorch/torchdynamo/issues/594
Also, small cleanup for `where` codegen

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89612
Approved by: https://github.com/soumith, https://github.com/jansel
2022-11-25 19:42:38 +00:00
Jiong Gong
6796979ee1 [Inductor] Limit the number of compile threads to the available cpu cores (#89377)
`config.compile_threads` gets the number of compile threads via `min(32,os.cpu_count())` while `os.cpu_count()` is the total number of cpu cores in the system, not the available ones. This would cause compile thread contention when the available cpu cores are less than `min(32,os.cpu_count())`, e.g., available cpu cores are limited with numactl or taskset, making the compilation very slow. This PR tries to use `len(os.sched_getaffinity(0))` if `os.sched_getaffinity` is available which returns the available number of cpu cores.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89377
Approved by: https://github.com/soumith
2022-11-21 14:20:36 +00:00
Jiawen Liu
5270122773 [Inductor] Build FX Linear + Permute Vertical Fusion in Inductor (#89118)
Summary:
Build fx-based linear/matmul/bmm + permute/transpose vertical fusion in Inductor

For an internal Ads model: **1.15x -> 1.36x speedup**

Test Plan: CI

Reviewed By: bertmaher, jansel, jianyuh

Differential Revision: D41071665

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89118
Approved by: https://github.com/jianyuh
2022-11-16 10:37:30 +00:00
PyTorch MergeBot
9f0b2c73f3 Revert "[Inductor] Build FX Linear + Permute Vertical Fusion in Inductor (#88859)"
This reverts commit d60abe4b95.

Reverted https://github.com/pytorch/pytorch/pull/88859 on behalf of https://github.com/kit1980 due to Broke Mac OS testing, which were clearly shown in CI
2022-11-16 01:13:00 +00:00
Jiawen Liu
d60abe4b95 [Inductor] Build FX Linear + Permute Vertical Fusion in Inductor (#88859)
Summary:
Build fx-based linear/matmul/bmm + permute/transpose vertical fusion in Inductor

For an internal Ads model: **1.15x -> 1.36x speedup**

Differential Revision: D41071665

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88859
Approved by: https://github.com/jianyuh, https://github.com/jansel
2022-11-15 19:34:38 +00:00
Jiawen Liu
55b88cde0a [Inductor] Build Shape Padding in Inductor (#88709)
Summary: Build shape padding for matmul/bmm/addmm in Inductor

Differential Revision: D41071282

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88709
Approved by: https://github.com/bertmaher, https://github.com/Chillee
2022-11-15 03:10:36 +00:00
Nikita Shulga
f39cad50b7 Make InductorCPU usable in internally (#88870)
Test Plan: `buck2 test mode/opt //caffe2/test:test_inductor -- --exact 'caffe2/test:test_inductor - test_dtype_mismatch_issue_cuda (caffe2.test.inductor.test_torchinductor.CudaTests)'`

Differential Revision: D41206109

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88870
Approved by: https://github.com/izaitsevfb
2022-11-11 22:07:34 +00:00
Michael Lazos
c1553880de Have kernel names include fused ops (#88624)
- Propagates origin fx nodes through inlining during lowering
- Concatenates op names into kernel name
- Adds config to cap the number of ops in the kernel name so they don't get too long

Caveats:
- The ordering in the name may not match the order that the ops are executed in the kernel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88624
Approved by: https://github.com/anijain2305, https://github.com/jansel
2022-11-10 21:38:06 +00:00
PyTorch MergeBot
29550e2c1d Revert "[Inductor] Build FX Linear + Permute Vertical Fusion in Inductor (#88566)"
This reverts commit 48b58930cb.

Reverted https://github.com/pytorch/pytorch/pull/88566 on behalf of https://github.com/huydhn due to This change breaks trunk 48b58930cb
2022-11-10 20:56:30 +00:00
Jiawen Liu
48b58930cb [Inductor] Build FX Linear + Permute Vertical Fusion in Inductor (#88566)
Summary:
Build fx-based linear/matmul/bmm + permute/transpose vertical fusion in Inductor

For an internal Ads model: 1.15x -> 1.36x speedup

Differential Revision: D41071665

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88566
Approved by: https://github.com/jansel, https://github.com/jianyuh
2022-11-10 18:32:25 +00:00
Animesh Jain
c4fecff97d [inductor] Prevent aggressive fusion during inductor lowering (#87447)
Fixes https://github.com/pytorch/torchdynamo/issues/1599

Inductor performs aggressive fusion of ops during the lowering of Fx graph into IR nodes. Note that this fusion is different from the fusion that we typically discuss in the context of Inductor, which refers to the fusion of SchedulerNodes (way after lowering). This PR, instead, ensures that we don't accumulate too many ops in the IR node to begin with.

In the case of hf_t5_large backward graph, earlier we would generate a kernel with 100s of operators, causing that kernel to take ~350 seconds of compilation time. With this PR, we get it down from 350 seconds to 50 seconds.

Note that this could affect performance. I doubt that it will lead to really large dip though. In my toy examples, even if the lowering creates multiple IR nodes, if its a simple fusion, later fusion still creates one node.

I would like (1) test_torchinductor.py, (2) test_torchinductor_info.py, and (3) atleast HF models to be enabled in CI before merging this one.

@ngimel @jansel @Chillee

cc @jansel @lezcano @fdrocha @mlazos @soumith @voznesenskym @yanboliang @penguinwu
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87447
Approved by: https://github.com/jansel
2022-10-24 21:53:17 +00:00
Zachary DeVito
db83a0578c [inductor] force 'fork' method for processes, cleanup (#87411)
To cooperate with other multithreading methods, this
forces the process pool to use 'fork' even if others have set it
diferently. We require fork because otherwise `if __name__ == __main__`
needs to be set which we do not control as a library.

Furthermore this adds code to cleanup worker processes if
the parent exits abnormally (e.g. segfault). Previously we would leave
live but inactive workers around.

cc @jansel @lezcano @fdrocha
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87411
Approved by: https://github.com/soumith, https://github.com/anijain2305
2022-10-21 17:06:56 +00:00
Horace He
2418ddb1ec Unified symbolic shape variables between Inductor and AOTDispatcher (#87161)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87161
Approved by: https://github.com/jansel
2022-10-19 04:50:34 +00:00
Zachary DeVito
d36c284d14 [triton] allow cuda properties to be queried from workers (#87101)
Fixes https://github.com/pytorch/pytorch/pull/87048 by saving the needed properties before fork.

Actually attempting to get CUDA to load in the workers is probably not desired: cuda initialization takes O(seconds). Having multiple processes using the same device will slow things down.

This just moves the needed properties from the main trainer process to the workers.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87101
Approved by: https://github.com/soumith
2022-10-18 04:48:29 +00:00
Jiong Gong
78e2289005 [TorchInductor] enable inplace buffers by default (#87037)
This PR enables the inplace_buffers configuration by default after fixing issue: https://github.com/pytorch/torchdynamo/issues/1670. UT is added to cover the fix.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87037
Approved by: https://github.com/jansel
2022-10-17 06:05:30 +00:00
Jason Ansel
0379af681b [inductor] Disable parallel compile (#87048)
https://github.com/pytorch/pytorch/pull/87032 seems to have an issue that breaks our benchmark script, it might have to do with the benchmark script also using subprocess.

Before this PR:
```
$ ./benchmarks/dynamo/torchbench.py --performance --inductor --raise --training --float16
...
Traceback (most recent call last):
  File "/home/jansel/conda/envs/pytorch/lib/python3.9/concurrent/futures/process.py", line 246, in _process_worker
    r = call_item.fn(*call_item.args, **call_item.kwargs)
  File "/home/jansel/pytorch/torch/_inductor/codecache.py", line 239, in _worker_compile
    kernel = TritonCodeCache.load(source_code)
  File "/home/jansel/pytorch/torch/_inductor/codecache.py", line 234, in load
    mod = PyCodeCache.load(source_code)
  File "/home/jansel/pytorch/torch/_inductor/codecache.py", line 212, in load
    exec(code, mod.__dict__, mod.__dict__)
  File "/tmp/torchinductor_jansel/ij/cij7smji4sw2a56i4yz45bjkrosd2sb2raqnxzsxxpg4kwzuo2ta.py", line 5, in <module>
    from torch._inductor.triton_ops.autotune import reduction
  File "/home/jansel/pytorch/torch/_inductor/triton_ops/__init__.py", line 3, in <module>
    if has_triton():
  File "/home/jansel/pytorch/torch/_inductor/utils.py", line 38, in has_triton
    return triton is not None and torch.cuda.get_device_capability() >= (7, 0)
  File "/home/jansel/pytorch/torch/cuda/__init__.py", line 368, in get_device_capability
    prop = get_device_properties(device)
  File "/home/jansel/pytorch/torch/cuda/__init__.py", line 382, in get_device_properties
    _lazy_init()  # will define _get_device_properties
  File "/home/jansel/pytorch/torch/cuda/__init__.py", line 228, in _lazy_init
    raise RuntimeError(
RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method
```

cc @zdevito
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87048
Approved by: https://github.com/soumith
2022-10-17 01:02:43 +00:00
Zachary DeVito
2b7236a0e1 [torchdynamo] Use ProcessPoolExecutor for triton compiles (#87032)
This patch significantly improves the parallel compilation performance for cThis patch significantly improves the parallel compilation performance for compiling triton kernels
by using ProcessPoolExecutor to create persistent pool of compilation
workers.

Previously os.fork overhead and GIL contention limited the achieved
parallelism. This patch replaces
the worker threads with a pool of processes to do the raw compilation,
and does serial work on the main thread
for everything else. This other work couldn't be parallelized anyway
since it is mostly in python.

In cold start situations, the time to get the worker threads started can
be significant portion of the time.
This patch starts the workers earlier so they are ready to perform
compilation (see code comments) when dynamo
gets to that point.

Just tested this on one example benchmark (tf_efficientnet_b0), but the
results are significant, almost eliminating the difference between a
warm and cold compilation.

```
39.613s - warm
41.290s - cold, this patch

2m53.197s - cold, single threaded:
1m7.092s - cold, old setup n = 8 (its best config)
```
 (cold compilation is done after running `rm -rf
/tmp/torchinductor_$USER`).ompiling triton kernels
by using ProcessPoolExecutor to create persistent pool of compilation workers.

Previously os.fork overhead and GIL contention limited the achieved parallelism. This patch replaces
the worker threads with a pool of processes to do the raw compilation, and does serial work on the main thread
for everything else. This other work couldn't be parallelized anyway since it is mostly in python.

In cold start situations, the time to get the worker threads started can be significant portion of the time.
This patch starts the workers earlier so they are ready to perform compilation (see code comments) when dynamo
gets to that point.

Just tested this on one example benchmark (tf_efficientnet_b0), but the results are significant, almost eliminating the difference between a warm and cold compilation.

```
39.613s - warm
41.290s - cold, this patch

2m53.197s - cold, single threaded:
1m7.092s - cold, old setup n = 8 (its best config)
```
 (cold compilation is done after running `rm -rf /tmp/torchinductor_$USER`).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87032
Approved by: https://github.com/soumith, https://github.com/jansel
2022-10-16 21:58:26 +00:00
Jason Ansel
c7c09722ad Move TorchDynamo into PyTorch core (#86461)
Context:
https://github.com/pytorch/torchdynamo/issues/1588

This PR moves [TorchDynamo](https://github.com/pytorch/torchdynamo) and TorchInductor into PyTorch core.
- `torchdynamo` becomes `torch._dynamo`
- `torchinductor` becomes `torch._inductor`

This PR was generated by running `copy_to_core.sh` in https://github.com/pytorch/torchdynamo/pull/1538

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86461
Approved by: https://github.com/voznesenskym
2022-10-13 23:18:06 +00:00