pytorch/torch
leslie-fang-intel 98929ceae3 [Inductor][CPP] Enable Local Buffer for Outer loop fusion (#126967)
**Summary**
Currently, the Inductor CPP backend [generated code](https://gist.github.com/leslie-fang-intel/98f91d43dabed581a1ffe23daf133a65#file-bf16-softmax-generated-code-wo-local-buffer-py) for `Softmax` with BF16 data type is significantly slower than the [ATen Implementation](9a2beb862d/aten/src/ATen/native/cpu/SoftMaxKernel.cpp (L149)). Upon comparing the generated code with ATen, the performance bottleneck appears to be related to the usage of [local buffer in ATen](9a2beb862d/aten/src/ATen/native/cpu/SoftMaxKernel.cpp (L159-L160)).

In the current implementation, the Inductor uses the output buffer of Kernel Group Args to store and load temporary result (such as `exp`), since this buffer is corresponding to a `SchedulerNode`. Each thread accesses a portion of this output buffer via indexing. However, since this buffer (take this `exp` as example) is only utilized internally within decomposed `softmax`, this buffer can be replaced with a thread-local buffer similar to ATen's approach.

In this PR, we have introduced the optimizations of `LocalBuffer`. Following this enhancement, the [new generated Inductor code with local buffer](https://gist.github.com/leslie-fang-intel/98f91d43dabed581a1ffe23daf133a65#file-bf16-softmax-generated-code-w-local-buffer-py) for BF16 `Softmax` demonstrates significantly improved performance. Running the benchmark [here](https://gist.github.com/leslie-fang-intel/37d81441237b5139c8295f5e6c4cd31a) to test this BF16 `Softmax` case on an 8480 Xeon server shows similar performance between the Inductor CPP Backend and the ATen implementation.

**TestPlan**
```
python -u -m pytest -s -v inductor/test_cpu_repro.py -k test_local_buffer_in_outer_loop_fusion
```

**Next Step**

- [ ] Support more than one Local Buffer/Global Buffer

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126967
Approved by: https://github.com/jgong5, https://github.com/peterbell10
2024-07-07 05:34:57 +00:00
..
_awaits
_C [BE] annotate torch.autograd.graph (#129558) 2024-07-06 18:14:16 +00:00
_C_flatbuffer Flip default value for mypy disallow_untyped_defs [1/11] (#127838) 2024-06-08 18:16:33 +00:00
_custom_op [custom_ops] Mark older custom ops prototypes as deprecated (#130032) 2024-07-03 21:11:05 +00:00
_decomp Add decomposition for slice_scatter (#123744) 2024-06-28 17:02:10 +00:00
_dispatch Flip default value for mypy disallow_untyped_defs [1/11] (#127838) 2024-06-08 18:16:33 +00:00
_dynamo [dynamo][user-defined] Support method descriptors (#130159) 2024-07-06 02:03:09 +00:00
_export Revert "[runtime asserts] deduplicate runtime asserts & CSE (#128599)" 2024-07-06 07:20:05 +00:00
_functorch Have torch_key hash entire torch directory (#129250) 2024-07-05 15:37:16 +00:00
_higher_order_ops Added compile option to create_block_mask (#130106) 2024-07-06 08:09:56 +00:00
_inductor [Inductor][CPP] Enable Local Buffer for Outer loop fusion (#126967) 2024-07-07 05:34:57 +00:00
_lazy Flip default value for mypy disallow_untyped_defs [3/11] (#127840) 2024-06-08 18:28:01 +00:00
_library [custom ops] Support factory function (#129978) 2024-07-04 00:10:52 +00:00
_logging Enable TORCH_TRACE by default on Conda on Mast (#129988) 2024-07-03 03:35:45 +00:00
_numpy [Ez][BE]: Enable new stable ruff rules (#129825) 2024-07-02 14:47:10 +00:00
_prims [Traceable FSDP2] Add Dynamo support for run_with_rng_state HOP (#127247) 2024-06-28 01:04:49 +00:00
_prims_common Make are_strides_like_channels_last size oblivious (#129677) 2024-07-02 11:05:20 +00:00
_refs check boolean alpha and beta of Fake tensor impl for Tensor.addr (#129839) 2024-07-02 09:20:49 +00:00
_strobelight Flip default value for mypy disallow_untyped_defs [3/11] (#127840) 2024-06-08 18:28:01 +00:00
_subclasses Revert "Fix the SDPA AOT export issue (#130164)" 2024-07-06 05:59:49 +00:00
_vendor
amp Revert "[MPS] Add support for autocast in MPS (#99272)" 2024-07-02 12:29:51 +00:00
ao Revert "Change numeric_debug_handle to store per-node id (#129811)" 2024-07-05 18:14:02 +00:00
autograd [BE] annotate torch.autograd.graph (#129558) 2024-07-06 18:14:16 +00:00
backends [cuDNN][SDPA] Remove TORCH_CUDNN_SDPA_ENABLED=1, enable cuDNN SDPA by default on H100 and 2nd on other archs >= sm80 (#125343) 2024-06-30 19:22:16 +00:00
compiler Flip default value for mypy disallow_untyped_defs [4/11] (#127841) 2024-06-08 18:36:48 +00:00
contrib Flip default value for mypy disallow_untyped_defs [4/11] (#127841) 2024-06-08 18:36:48 +00:00
cpu [inductor][cpp] BF16 AMX micro-gemm support (#127195) 2024-06-21 07:21:47 +00:00
csrc [CI] Enable build with asserts (#129924) 2024-07-06 13:14:32 +00:00
cuda Revert "[MPS] Add support for autocast in MPS (#99272)" 2024-07-02 12:29:51 +00:00
distributed Back out "Pass device to is_pinned call inside TensorProperties.create_from_tensor" (#129972) 2024-07-06 01:07:32 +00:00
distributions [BE]: Update mypy to 1.10.0 (#127717) 2024-06-13 15:57:13 +00:00
export Revert "[runtime asserts] deduplicate runtime asserts & CSE (#128599)" 2024-07-06 07:20:05 +00:00
fft
func
futures Flip default value for mypy disallow_untyped_defs [7/11] (#127844) 2024-06-08 18:49:45 +00:00
fx Revert "[runtime asserts] deduplicate runtime asserts & CSE (#128599)" 2024-07-06 07:20:05 +00:00
jit [BE][Easy] replace import pathlib with from pathlib import Path (#129426) 2024-06-30 01:36:07 +00:00
legacy
lib [Split Build] Add option to create libtorch wheel and use it to build pytorch as a separate wheel (#126328) 2024-05-29 04:33:56 +00:00
linalg Added sorting notes for eig/eigvals (#127492) 2024-05-30 18:13:22 +00:00
masked [BE] update type annotations for basic utilities in torch/__init__.py (#129001) 2024-06-24 18:04:38 +00:00
monitor
mps Add support in Python API for the recommended max working set size. (#128289) 2024-06-12 16:03:57 +00:00
mtia [MTIA] Fix synchronize API (#128714) 2024-06-17 21:58:46 +00:00
multiprocessing Enable sharing meta tensors between processes (#129520) 2024-07-04 20:29:48 +00:00
nested Default to input tensor device for as_nested_tensor(t) (#130050) 2024-07-05 17:50:08 +00:00
nn Added compile option to create_block_mask (#130106) 2024-07-06 08:09:56 +00:00
onnx Revert "[runtime asserts] deduplicate runtime asserts & CSE (#128599)" 2024-07-06 07:20:05 +00:00
optim [MPS] Add tensor_lr overloads to fused adam & adamw (#129451) 2024-07-02 19:46:30 +00:00
package [BE] enforce style for empty lines in import segments (#129751) 2024-06-29 14:15:24 +00:00
profiler [Profiler] Clean up use_mtia to follow standard use_device instead (#126284) 2024-06-18 21:01:03 +00:00
quantization Flip default value for mypy disallow_untyped_defs [9/11] (#127846) 2024-06-08 18:50:06 +00:00
signal Flip default value for mypy disallow_untyped_defs [9/11] (#127846) 2024-06-08 18:50:06 +00:00
sparse Flip default value for mypy disallow_untyped_defs [9/11] (#127846) 2024-06-08 18:50:06 +00:00
special Fix global flake8 issues (#124771) 2024-04-26 15:35:53 +00:00
testing Added compile option to create_block_mask (#130106) 2024-07-06 08:09:56 +00:00
utils Revert "[runtime asserts] deduplicate runtime asserts & CSE (#128599)" 2024-07-06 07:20:05 +00:00
xpu Flip default value for mypy disallow_untyped_defs [10/11] (#127847) 2024-06-08 18:50:06 +00:00
__config__.py Flip default value for mypy disallow_untyped_defs [1/11] (#127838) 2024-06-08 18:16:33 +00:00
__future__.py
__init__.py Make sympify'ing SymInt/etc produce their sympy expression (#130166) 2024-07-06 03:56:45 +00:00
_appdirs.py
_classes.py Flip default value for mypy disallow_untyped_defs [1/11] (#127838) 2024-06-08 18:16:33 +00:00
_compile.py Flip default value for mypy disallow_untyped_defs [1/11] (#127838) 2024-06-08 18:16:33 +00:00
_custom_ops.py Flip default value for mypy disallow_untyped_defs [1/11] (#127838) 2024-06-08 18:16:33 +00:00
_deploy.py Flip default value for mypy disallow_untyped_defs [1/11] (#127838) 2024-06-08 18:16:33 +00:00
_guards.py Evaluate symexprs on load path of cache not write (#128997) 2024-06-20 08:55:12 +00:00
_jit_internal.py [BE] enable UFMT for torch/nn/*.py (#128593) 2024-06-23 16:05:13 +00:00
_linalg_utils.py [BE] enable UFMT for torch/nn/*.py (#128593) 2024-06-23 16:05:13 +00:00
_lobpcg.py [BE] enable UFMT for top-level files torch/*.py (#127707) 2024-06-12 20:15:05 +00:00
_lowrank.py [BE] enable UFMT for torch/nn/*.py (#128593) 2024-06-23 16:05:13 +00:00
_meta_registrations.py Make _embedding_bag_backward explicitly dispatch to CPU and CUDA. (#129691) 2024-07-03 21:54:49 +00:00
_namedtensor_internals.py Flip default value for mypy disallow_untyped_defs [3/11] (#127840) 2024-06-08 18:28:01 +00:00
_ops.py Torchbind call method + effects support (#128397) 2024-06-14 21:28:17 +00:00
_python_dispatcher.py Flip default value for mypy disallow_untyped_defs [3/11] (#127840) 2024-06-08 18:28:01 +00:00
_size_docs.py Flip default value for mypy disallow_untyped_defs [3/11] (#127840) 2024-06-08 18:28:01 +00:00
_sources.py Flip default value for mypy disallow_untyped_defs [3/11] (#127840) 2024-06-08 18:28:01 +00:00
_storage_docs.py Flip default value for mypy disallow_untyped_defs [3/11] (#127840) 2024-06-08 18:28:01 +00:00
_streambase.py Flip default value for mypy disallow_untyped_defs [3/11] (#127840) 2024-06-08 18:28:01 +00:00
_tensor_docs.py Flip default value for mypy disallow_untyped_defs [3/11] (#127840) 2024-06-08 18:28:01 +00:00
_tensor_str.py Flip default value for mypy disallow_untyped_defs [3/11] (#127840) 2024-06-08 18:28:01 +00:00
_tensor.py added type hints for __contains__ (#129653) 2024-06-30 11:49:11 +00:00
_torch_docs.py [Easy] Add whitespace after comma when re-rendering tuple default value in schema (#129884) 2024-07-03 11:45:24 +00:00
_utils_internal.py [BE] enable UFMT for torch/nn/*.py (#128593) 2024-06-23 16:05:13 +00:00
_utils.py Remove dependency on private _compat_pickle in CPython (#129509) 2024-06-26 14:20:27 +00:00
_VF.py
_vmap_internals.py [BE] enable UFMT for torch/nn/*.py (#128593) 2024-06-23 16:05:13 +00:00
_weights_only_unpickler.py Improve error message for weights_only load (#129705) 2024-06-28 19:36:31 +00:00
abi-check.cpp
CMakeLists.txt [CI] Enable build with asserts (#129924) 2024-07-06 13:14:32 +00:00
custom_class_detail.h Revert "[1/N] Change #include <c10/util/Optional.h> to #include <optional> (#128301)" 2024-06-15 01:58:20 +00:00
custom_class.h [2/N] Fix some violations of unused-function and unused-variable checks in torch_cpu (#129878) 2024-07-04 00:39:28 +00:00
extension.h
functional.py [BE] enable UFMT for torch/nn/*.py (#128593) 2024-06-23 16:05:13 +00:00
hub.py [BE] enable UFMT for torch/nn/*.py (#128593) 2024-06-23 16:05:13 +00:00
library.h Revert "[1/N] Change #include <c10/util/Optional.h> to #include <optional> (#128301)" 2024-06-15 01:58:20 +00:00
library.py [BE] enable UFMT for torch/nn/*.py (#128593) 2024-06-23 16:05:13 +00:00
overrides.py [BE] enable UFMT for torch/nn/*.py (#128593) 2024-06-23 16:05:13 +00:00
py.typed
quasirandom.py [BE] enable UFMT for top-level files torch/*.py (#127707) 2024-06-12 20:15:05 +00:00
random.py [BE] enable UFMT for top-level files torch/*.py (#127707) 2024-06-12 20:15:05 +00:00
README.txt
return_types.py [BE] enable UFMT for top-level files torch/*.py (#127707) 2024-06-12 20:15:05 +00:00
script.h
serialization.py Fix test test_type_hints.py::TestTypeHints::test_doc_examples (#129829) 2024-07-01 13:28:37 +00:00
storage.py [BE] enable UFMT for torch/storage.py (#127706) 2024-06-27 23:16:24 +00:00
torch_version.py [BE] enable UFMT for top-level files torch/*.py (#127707) 2024-06-12 20:15:05 +00:00
types.py [BE] enable UFMT for torch/storage.py (#127706) 2024-06-27 23:16:24 +00:00
version.py.tpl

Note [TH abstraction violation]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

TH/THC provide some hpp headers, which are proper C++ headers rather than
C headers.  These headers serve double duty as *internal implementation
detail* headers, whose contents should largely not be used by external
clients.

Ideally, we would not install these headers at all; instead, you should
use public functions (in headers like `THTensor.h`, NOT `THTensor.hpp`)
to manipulate these structs.  However, there are a few places
in torch/csrc where we violate this abstraction.  They are marked with
a pointer to this note.  Each of those sites will have to be refactored
when we refactor the guts of THTensor and related structures.