Commit Graph

21 Commits

Author SHA1 Message Date
Max Podkorytov
ca30704f0b [Inductor][ROCm][CK] Add standalone runner (#139441)
Generate standalone executable to debug and profile CK gemm instances

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139441
Approved by: https://github.com/ColinPeppler
2024-11-07 06:21:27 +00:00
James Wu
f4ee5a243d Add PT2 Compile Events for triton and kernel compilation + load_by_key_path (#139402)
Adds a few more dynamo_timed() to measure triton compilation and load_by_key_path times.

In the case of async compilation with multiple threads, we'll generate a single `kernel_compile` event that occurs when waiting on all the parallel compiles to finish.

In the case where async parallel compilation is disabled (or, compile threads are warming up), we'll generate a `triton_compile` event for each kernel.

The `triton_compile` events is a bit questionable: do we need a row for each triton compile event? It might eat up on our already low retention, so I might just remove that. Will discuss with @slarsen.

Differential Revision: [D65215707](https://our.internmc.facebook.com/intern/diff/D65215707/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139402
Approved by: https://github.com/oulgen
2024-11-04 06:37:18 +00:00
Sam Larsen
06b5330674 [easy] Log subproc pool creation (#138642)
Summary: Request from internal to log subproc pool creation

Test Plan:
```
$ TORCH_LOGS=+torch._inductor.async_compile python ~/add.py
I1022 14:12:41.915000 444394 torch/_inductor/async_compile.py:165] Creating subprocess pool with 32 workers
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138642
Approved by: https://github.com/eellison
2024-10-23 02:41:42 +00:00
Max Podkorytov
52ba40c6f6 [ROCm][AOTI] add CK backend (#135641)
Companion to #134379

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135641
Approved by: https://github.com/ColinPeppler, https://github.com/chenyang78

Co-authored-by: Colin Peppler <colinpeppler@meta.com>
2024-10-07 23:53:58 +00:00
Colin Peppler
42adadf2f2 [aotinductor] enable CUTLASS backend (#134379)
### Context
This PR allows CUTLASS kernels usage in AOTI. It does this by:
* For any CUTLASS kernels that win during autotuning, compile them as a .so & .o
* When creating the final model .so, link all the CUTLASS kernels .o files
* Make sure we codegen things correctly (argument dtypes and specify extern "C" linking for the CUTLASS kernel)

### Example
https://gist.github.com/ColinPeppler/e834fa2255c37e9444b6d540bf7bd04d#file-model-cpp-L548-L549

```
TORCH_LOGS="+output_code" python test/inductor/test_cutlass_backend.py -v -k test_max_autotune_cutlass_backend_regular_mm
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134379
Approved by: https://github.com/tenpercent, https://github.com/chenyang78
2024-10-04 17:32:41 +00:00
Sam Larsen
1028cedf71 [inductor] Enable parallel compile by default in fbcode (#136246)
Summary: Now that we have subprocess parallel compile on by default, we can change the internal compile_threads default to > 1 with a killswitch. Some jankiness so we can avoid evaluating the justknob at import.

Test Plan: Ran codecache tests with JK on, then canaried locally with JK off

Differential Revision: D62913998

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136246
Approved by: https://github.com/eellison
2024-09-24 18:10:01 +00:00
Sam Larsen
bf8d0e3107 [inductor] Enable subprocess parallel compile internally with killswitch (#132467)
Differential Revision: [D60629630](https://our.internmc.facebook.com/intern/diff/D60629630)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132467
Approved by: https://github.com/eellison
2024-09-10 19:05:46 +00:00
Sam Larsen
a2db22e6bb [inductor] Catch BrokenProcessPool and print a more helpful message. (#135120)
Summary: BrokenProcessPool means a parallel-compile subprocess exited, which we never expect. It's likely due to a crash, so print a more meaningful error message and instructions that it's probably easier to debug by turning off parallel compile. Output looks like:
```
...
  File "/data/users/slarsen/pytorch/torch/_inductor/runtime/compile_tasks.py", line 45, in _reload_python_module
    exec(code, mod.__dict__, mod.__dict__)
  File "/tmp/torchinductor_slarsen/4q/c4qw7xk5lbb7whg5txnk4hwbc7z6kepak3o666tr3d64gcad5r5b.py", line 815, in <module>
    async_compile.wait(globals())
  File "/data/users/slarsen/pytorch/torch/_inductor/async_compile.py", line 265, in wait
    raise RuntimeError(
RuntimeError: A compilation subprocess exited unexpectedly. This is likely due to a crash. To facilitate debugging, you can re-run with TORCHINDUCTOR_COMPILE_THREADS=1 to cause compilation to occur in the main process.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135120
Approved by: https://github.com/Chillee
2024-09-07 16:33:37 +00:00
Sam Larsen
362ecd9817 [inductor] Skip the sub-process pool until it's ready (#133508)
Summary: Torch-compiling a quick script can be a bit slower than it needs to be: even though we initialize the subprocess pool early, it still might not be ready by the time we try to compile the first Triton kernel. Instead, let's use the single-threaded path until the pool has successfully completed a no-op job.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133508
Approved by: https://github.com/Chillee
2024-09-04 03:26:55 +00:00
Xuehai Pan
b6d477fd56 [BE][Easy][16/19] enforce style for empty lines in import segments in torch/_i*/ (#129768)
See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter.

You can review these PRs via:

```bash
git diff --ignore-all-space --ignore-blank-lines HEAD~1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129768
Approved by: https://github.com/jansel
2024-07-20 16:20:58 +00:00
Sam Larsen
358da54be5 [inductor] Better messaging when triton version is too old (#130403)
Summary:
If triton is available, but we can't import triton.compiler.compiler.triton_key, then we see some annoying behavior:
1) If we don't actually need to compile triton, the subprocess pool will still spew error messages about the import failure; it's unclear to users if this is an actual problem.
2) If we do need to compile triton, we a) see the error messages from above and b) get a vanilla import exception without the helpful "RuntimeError: Cannot find a working triton installation ..."

Test Plan: Ran with and without torch.compile for a) recent version of triton, b) triton 2.2, and c) no triton. In all cases, verified expected output (success or meaningful error message)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130403
Approved by: https://github.com/eellison
2024-07-10 23:45:50 +00:00
Sam Larsen
87d14ad419 [inductor] Fix TORCHINDUCTOR_FORCE_DISABLE_CACHES (#129257)
Summary: See https://github.com/pytorch/pytorch/issues/129159; this option wasn't doing its job for a few reasons. In this PR:
* Fix the with_fresh_cache_if_config() decorator
* Reset the "TORCHINDUCTOR_CACHE_DIR" & "TRITON_CACHE_DIR" env vars in sub-process to support them changing in the parent process

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129257
Approved by: https://github.com/oulgen
2024-06-26 18:34:48 +00:00
Max Podkorytov
79959d707c [Inductor][ROCm] Composable Kernel backend for Inductor (#125453)
This PR adds an alternative backend for Inductor, adding Composable Kernel Universal GEMM instances to the autotune instance selection.

The implementation is heavily influenced by the series of PRs which adds CUTLASS backend (https://github.com/pytorch/pytorch/issues/106991). The main differences are
 (1) customizing compiler for the ROCm platform
 (2) customizing template code generation for Composable Kernel Universal GEMM instances.

We provide config tuning knobs for balancing between instance sources compilation time and finding the best instance.

### Testing
Install the ck library
```
pip install git+https://github.com/rocm/composable_kernel@develop
```
Run the test
```
TORCH_LOGS=+torch._inductor \
pytest --capture=tee-sys test/inductor/test_ck_backend.py
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125453
Approved by: https://github.com/eellison, https://github.com/jansel
2024-06-25 20:54:14 +00:00
PyTorch MergeBot
ad76da6c16 Revert "[inductor] Fix TORCHINDUCTOR_FORCE_DISABLE_CACHES (#129257)"
This reverts commit 7b57ddd38c.

Reverted https://github.com/pytorch/pytorch/pull/129257 on behalf of https://github.com/clee2000 due to one of the PRs in the stack seems to have broken test/distributed/_composable/test_replicate_with_compiler.py::ReplicateTest::test_bucketing_concat_op on distributed https://github.com/pytorch/pytorch/actions/runs/9653941844/job/26627760340 4c1e4c5f30, not tested on this PR due to bad TD ([comment](https://github.com/pytorch/pytorch/pull/129257#issuecomment-2189444171))
2024-06-25 16:48:32 +00:00
Sam Larsen
7b57ddd38c [inductor] Fix TORCHINDUCTOR_FORCE_DISABLE_CACHES (#129257)
Summary: See https://github.com/pytorch/pytorch/issues/129159; this option wasn't doing its job for a few reasons. In this PR:
* Fix the with_fresh_cache_if_config() decorator
* Reset the "TORCHINDUCTOR_CACHE_DIR" & "TRITON_CACHE_DIR" env vars in sub-process to support them changing in the parent process

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129257
Approved by: https://github.com/oulgen
2024-06-24 23:39:43 +00:00
Aaron Orenstein
ea614fb2b1 Flip default value for mypy disallow_untyped_defs [2/11] (#127839)
See #127836 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127839
Approved by: https://github.com/oulgen
2024-06-08 18:23:08 +00:00
Sam Larsen
e8e0bdf541 [inductor] parallel-compile: call triton_key() before forking (#127639)
Summary:
A user reported severe slowdown on a workload when using parallel compile. The issue is that in some environments, the process affinity changes after forking such that all forked subprocesses use a single logical processor. Described here: https://github.com/pytorch/pytorch/issues/99625. That requires a separate fix, but during debuging we noticed that we can at least optimize the expensive call to triton_key() before forking.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127639
Approved by: https://github.com/eellison, https://github.com/anijain2305
2024-06-07 04:12:57 +00:00
Aaron Gokaslan
2d47385f0f [BE]: Enable ruff TCH rules and autofixes for better imports (#127688)
Automated fixes to put imports that are only used in type hints into TYPE_CHECKING imports. This also enables the RUFF TCH rules which will automatically apply autofixes to move imports in and out of TYPE_CHECKING blocks as needed in the future, this will make the initial PyTorch import faster and will reduce cyclic dependencies.

Co-authored-by: Xuehai Pan <XuehaiPan@pku.edu.cn>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127688
Approved by: https://github.com/XuehaiPan, https://github.com/ezyang, https://github.com/malfet
2024-06-06 16:55:58 +00:00
James Wu
63d7ffe121 Retry of D58015187 Move AsyncCompile to a different file (#127691)
Summary:
This is a retry of https://github.com/pytorch/pytorch/pull/127545/files
and
D58015187, fixing the internal test that also imported codecache

Test Plan: Same tests as CI in github, plus sandcastle for internal unit tests should pass now

Differential Revision: D58054611

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127691
Approved by: https://github.com/oulgen
2024-06-03 15:29:41 +00:00
PyTorch MergeBot
22f392ba40 Revert "[easy?] Move AsyncCompile to a different file (#127235)"
This reverts commit f58fc16e8f.

Reverted https://github.com/pytorch/pytorch/pull/127235 on behalf of https://github.com/izaitsevfb due to breaking internal tests, see [D58015187](https://www.internalfb.com/diff/D58015187) ([comment](https://github.com/pytorch/pytorch/pull/127235#issuecomment-2143518610))
2024-06-01 17:16:16 +00:00
James Wu
f58fc16e8f [easy?] Move AsyncCompile to a different file (#127235)
By moving AsyncCompile to its own file, we can import codecache without running the side effects of AsyncCompile. This will be important for AOTAutogradCaching, where we want to share some implementation details with codecache.py without spawning new processes.

To conservatively maintain the same behavior elsewhere, every time we import codecache, I've added an import to torch._inductor.async_compile (except in autograd_cache.py, where the explicit goal is to not do this)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127235
Approved by: https://github.com/aorenste, https://github.com/oulgen, https://github.com/masnesral
2024-05-30 02:43:02 +00:00