pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

History

leslie-fang-intel 98929ceae3 [Inductor][CPP] Enable Local Buffer for Outer loop fusion (#126967 ) Summary Currently, the Inductor CPP backend [generated code](https://gist.github.com/leslie-fang-intel/98f91d43dabed581a1ffe23daf133a65#file-bf16-softmax-generated-code-wo-local-buffer-py) for `Softmax` with BF16 data type is significantly slower than the [ATen Implementation](`9a2beb862d/aten/src/ATen/native/cpu/SoftMaxKernel.cpp (L149)`). Upon comparing the generated code with ATen, the performance bottleneck appears to be related to the usage of [local buffer in ATen](`9a2beb862d/aten/src/ATen/native/cpu/SoftMaxKernel.cpp (L159-L160)`). In the current implementation, the Inductor uses the output buffer of Kernel Group Args to store and load temporary result (such as `exp`), since this buffer is corresponding to a `SchedulerNode`. Each thread accesses a portion of this output buffer via indexing. However, since this buffer (take this `exp` as example) is only utilized internally within decomposed `softmax`, this buffer can be replaced with a thread-local buffer similar to ATen's approach. In this PR, we have introduced the optimizations of `LocalBuffer`. Following this enhancement, the [new generated Inductor code with local buffer](https://gist.github.com/leslie-fang-intel/98f91d43dabed581a1ffe23daf133a65#file-bf16-softmax-generated-code-w-local-buffer-py) for BF16 `Softmax` demonstrates significantly improved performance. Running the benchmark [here](https://gist.github.com/leslie-fang-intel/37d81441237b5139c8295f5e6c4cd31a) to test this BF16 `Softmax` case on an 8480 Xeon server shows similar performance between the Inductor CPP Backend and the ATen implementation. TestPlan ``` python -u -m pytest -s -v inductor/test_cpu_repro.py -k test_local_buffer_in_outer_loop_fusion ``` Next Step - [ ] Support more than one Local Buffer/Global Buffer Pull Request resolved: https://github.com/pytorch/pytorch/pull/126967 Approved by: https://github.com/jgong5, https://github.com/peterbell10		2024-07-07 05:34:57 +00:00
..
_awaits
_C	[BE] annotate `torch.autograd.graph` (#129558 )	2024-07-06 18:14:16 +00:00
_C_flatbuffer	Flip default value for mypy disallow_untyped_defs [1/11] (#127838 )	2024-06-08 18:16:33 +00:00
_custom_op	[custom_ops] Mark older custom ops prototypes as deprecated (#130032 )	2024-07-03 21:11:05 +00:00
_decomp	Add decomposition for slice_scatter (#123744 )	2024-06-28 17:02:10 +00:00
_dispatch	Flip default value for mypy disallow_untyped_defs [1/11] (#127838 )	2024-06-08 18:16:33 +00:00
_dynamo	[dynamo][user-defined] Support method descriptors (#130159 )	2024-07-06 02:03:09 +00:00
_export	Revert "[runtime asserts] deduplicate runtime asserts & CSE (#128599 )"	2024-07-06 07:20:05 +00:00
_functorch	Have torch_key hash entire torch directory (#129250 )	2024-07-05 15:37:16 +00:00
_higher_order_ops	Added compile option to create_block_mask (#130106 )	2024-07-06 08:09:56 +00:00
_inductor	[Inductor][CPP] Enable Local Buffer for Outer loop fusion (#126967 )	2024-07-07 05:34:57 +00:00
_lazy	Flip default value for mypy disallow_untyped_defs [3/11] (#127840 )	2024-06-08 18:28:01 +00:00
_library	[custom ops] Support factory function (#129978 )	2024-07-04 00:10:52 +00:00
_logging	Enable TORCH_TRACE by default on Conda on Mast (#129988 )	2024-07-03 03:35:45 +00:00
_numpy	[Ez][BE]: Enable new stable ruff rules (#129825 )	2024-07-02 14:47:10 +00:00
_prims	[Traceable FSDP2] Add Dynamo support for run_with_rng_state HOP (#127247 )	2024-06-28 01:04:49 +00:00
_prims_common	Make are_strides_like_channels_last size oblivious (#129677 )	2024-07-02 11:05:20 +00:00
_refs	check boolean alpha and beta of Fake tensor impl for Tensor.addr (#129839 )	2024-07-02 09:20:49 +00:00
_strobelight	Flip default value for mypy disallow_untyped_defs [3/11] (#127840 )	2024-06-08 18:28:01 +00:00
_subclasses	Revert "Fix the SDPA AOT export issue (#130164 )"	2024-07-06 05:59:49 +00:00
_vendor
amp	Revert "[MPS] Add support for autocast in MPS (#99272 )"	2024-07-02 12:29:51 +00:00
ao	Revert "Change numeric_debug_handle to store per-node id (#129811 )"	2024-07-05 18:14:02 +00:00
autograd	[BE] annotate `torch.autograd.graph` (#129558 )	2024-07-06 18:14:16 +00:00
backends	[cuDNN][SDPA] Remove `TORCH_CUDNN_SDPA_ENABLED=1`, enable cuDNN SDPA by default on H100 and 2nd on other archs >= sm80 (#125343 )	2024-06-30 19:22:16 +00:00
compiler	Flip default value for mypy disallow_untyped_defs [4/11] (#127841 )	2024-06-08 18:36:48 +00:00
contrib	Flip default value for mypy disallow_untyped_defs [4/11] (#127841 )	2024-06-08 18:36:48 +00:00
cpu	[inductor][cpp] BF16 AMX micro-gemm support (#127195 )	2024-06-21 07:21:47 +00:00
csrc	[CI] Enable build with asserts (#129924 )	2024-07-06 13:14:32 +00:00
cuda	Revert "[MPS] Add support for autocast in MPS (#99272 )"	2024-07-02 12:29:51 +00:00
distributed	Back out "Pass device to is_pinned call inside TensorProperties.create_from_tensor" (#129972 )	2024-07-06 01:07:32 +00:00
distributions	[BE]: Update mypy to 1.10.0 (#127717 )	2024-06-13 15:57:13 +00:00
export	Revert "[runtime asserts] deduplicate runtime asserts & CSE (#128599 )"	2024-07-06 07:20:05 +00:00
fft
func
futures	Flip default value for mypy disallow_untyped_defs [7/11] (#127844 )	2024-06-08 18:49:45 +00:00
fx	Revert "[runtime asserts] deduplicate runtime asserts & CSE (#128599 )"	2024-07-06 07:20:05 +00:00
jit	[BE][Easy] replace `import pathlib` with `from pathlib import Path` (#129426 )	2024-06-30 01:36:07 +00:00
legacy
lib	[Split Build] Add option to create libtorch wheel and use it to build pytorch as a separate wheel (#126328 )	2024-05-29 04:33:56 +00:00
linalg	Added sorting notes for eig/eigvals (#127492 )	2024-05-30 18:13:22 +00:00
masked	[BE] update type annotations for basic utilities in `torch/__init__.py` (#129001 )	2024-06-24 18:04:38 +00:00
monitor
mps	Add support in Python API for the recommended max working set size. (#128289 )	2024-06-12 16:03:57 +00:00
mtia	[MTIA] Fix synchronize API (#128714 )	2024-06-17 21:58:46 +00:00
multiprocessing	Enable sharing meta tensors between processes (#129520 )	2024-07-04 20:29:48 +00:00
nested	Default to input tensor device for as_nested_tensor(t) (#130050 )	2024-07-05 17:50:08 +00:00
nn	Added compile option to create_block_mask (#130106 )	2024-07-06 08:09:56 +00:00
onnx	Revert "[runtime asserts] deduplicate runtime asserts & CSE (#128599 )"	2024-07-06 07:20:05 +00:00
optim	[MPS] Add tensor_lr overloads to fused adam & adamw (#129451 )	2024-07-02 19:46:30 +00:00
package	[BE] enforce style for empty lines in import segments (#129751 )	2024-06-29 14:15:24 +00:00
profiler	[Profiler] Clean up use_mtia to follow standard use_device instead (#126284 )	2024-06-18 21:01:03 +00:00
quantization	Flip default value for mypy disallow_untyped_defs [9/11] (#127846 )	2024-06-08 18:50:06 +00:00
signal	Flip default value for mypy disallow_untyped_defs [9/11] (#127846 )	2024-06-08 18:50:06 +00:00
sparse	Flip default value for mypy disallow_untyped_defs [9/11] (#127846 )	2024-06-08 18:50:06 +00:00
special	Fix global flake8 issues (#124771 )	2024-04-26 15:35:53 +00:00
testing	Added compile option to create_block_mask (#130106 )	2024-07-06 08:09:56 +00:00
utils	Revert "[runtime asserts] deduplicate runtime asserts & CSE (#128599 )"	2024-07-06 07:20:05 +00:00
xpu	Flip default value for mypy disallow_untyped_defs [10/11] (#127847 )	2024-06-08 18:50:06 +00:00
__config__.py	Flip default value for mypy disallow_untyped_defs [1/11] (#127838 )	2024-06-08 18:16:33 +00:00
__future__.py
__init__.py	Make sympify'ing SymInt/etc produce their sympy expression (#130166 )	2024-07-06 03:56:45 +00:00
_appdirs.py
_classes.py	Flip default value for mypy disallow_untyped_defs [1/11] (#127838 )	2024-06-08 18:16:33 +00:00
_compile.py	Flip default value for mypy disallow_untyped_defs [1/11] (#127838 )	2024-06-08 18:16:33 +00:00
_custom_ops.py	Flip default value for mypy disallow_untyped_defs [1/11] (#127838 )	2024-06-08 18:16:33 +00:00
_deploy.py	Flip default value for mypy disallow_untyped_defs [1/11] (#127838 )	2024-06-08 18:16:33 +00:00
_guards.py	Evaluate symexprs on load path of cache not write (#128997 )	2024-06-20 08:55:12 +00:00
_jit_internal.py	[BE] enable UFMT for `torch/nn/*.py` (#128593 )	2024-06-23 16:05:13 +00:00
_linalg_utils.py	[BE] enable UFMT for `torch/nn/*.py` (#128593 )	2024-06-23 16:05:13 +00:00
_lobpcg.py	[BE] enable UFMT for top-level files `torch/*.py` (#127707 )	2024-06-12 20:15:05 +00:00
_lowrank.py	[BE] enable UFMT for `torch/nn/*.py` (#128593 )	2024-06-23 16:05:13 +00:00
_meta_registrations.py	Make `_embedding_bag_backward` explicitly dispatch to CPU and CUDA. (#129691 )	2024-07-03 21:54:49 +00:00
_namedtensor_internals.py	Flip default value for mypy disallow_untyped_defs [3/11] (#127840 )	2024-06-08 18:28:01 +00:00
_ops.py	Torchbind call method + effects support (#128397 )	2024-06-14 21:28:17 +00:00
_python_dispatcher.py	Flip default value for mypy disallow_untyped_defs [3/11] (#127840 )	2024-06-08 18:28:01 +00:00
_size_docs.py	Flip default value for mypy disallow_untyped_defs [3/11] (#127840 )	2024-06-08 18:28:01 +00:00
_sources.py	Flip default value for mypy disallow_untyped_defs [3/11] (#127840 )	2024-06-08 18:28:01 +00:00
_storage_docs.py	Flip default value for mypy disallow_untyped_defs [3/11] (#127840 )	2024-06-08 18:28:01 +00:00
_streambase.py	Flip default value for mypy disallow_untyped_defs [3/11] (#127840 )	2024-06-08 18:28:01 +00:00
_tensor_docs.py	Flip default value for mypy disallow_untyped_defs [3/11] (#127840 )	2024-06-08 18:28:01 +00:00
_tensor_str.py	Flip default value for mypy disallow_untyped_defs [3/11] (#127840 )	2024-06-08 18:28:01 +00:00
_tensor.py	added type hints for __contains__ (#129653 )	2024-06-30 11:49:11 +00:00
_torch_docs.py	[Easy] Add whitespace after comma when re-rendering tuple default value in schema (#129884 )	2024-07-03 11:45:24 +00:00
_utils_internal.py	[BE] enable UFMT for `torch/nn/*.py` (#128593 )	2024-06-23 16:05:13 +00:00
_utils.py	Remove dependency on private _compat_pickle in CPython (#129509 )	2024-06-26 14:20:27 +00:00
_VF.py
_vmap_internals.py	[BE] enable UFMT for `torch/nn/*.py` (#128593 )	2024-06-23 16:05:13 +00:00
_weights_only_unpickler.py	Improve error message for weights_only load (#129705 )	2024-06-28 19:36:31 +00:00
abi-check.cpp
CMakeLists.txt	[CI] Enable build with asserts (#129924 )	2024-07-06 13:14:32 +00:00
custom_class_detail.h	Revert "[1/N] Change #include <c10/util/Optional.h> to #include <optional> (#128301 )"	2024-06-15 01:58:20 +00:00
custom_class.h	[2/N] Fix some violations of unused-function and unused-variable checks in torch_cpu (#129878 )	2024-07-04 00:39:28 +00:00
extension.h
functional.py	[BE] enable UFMT for `torch/nn/*.py` (#128593 )	2024-06-23 16:05:13 +00:00
hub.py	[BE] enable UFMT for `torch/nn/*.py` (#128593 )	2024-06-23 16:05:13 +00:00
library.h	Revert "[1/N] Change #include <c10/util/Optional.h> to #include <optional> (#128301 )"	2024-06-15 01:58:20 +00:00
library.py	[BE] enable UFMT for `torch/nn/*.py` (#128593 )	2024-06-23 16:05:13 +00:00
overrides.py	[BE] enable UFMT for `torch/nn/*.py` (#128593 )	2024-06-23 16:05:13 +00:00
py.typed
quasirandom.py	[BE] enable UFMT for top-level files `torch/*.py` (#127707 )	2024-06-12 20:15:05 +00:00
random.py	[BE] enable UFMT for top-level files `torch/*.py` (#127707 )	2024-06-12 20:15:05 +00:00
README.txt
return_types.py	[BE] enable UFMT for top-level files `torch/*.py` (#127707 )	2024-06-12 20:15:05 +00:00
script.h
serialization.py	Fix test `test_type_hints.py::TestTypeHints::test_doc_examples` (#129829 )	2024-07-01 13:28:37 +00:00
storage.py	[BE] enable UFMT for `torch/storage.py` (#127706 )	2024-06-27 23:16:24 +00:00
torch_version.py	[BE] enable UFMT for top-level files `torch/*.py` (#127707 )	2024-06-12 20:15:05 +00:00
types.py	[BE] enable UFMT for `torch/storage.py` (#127706 )	2024-06-27 23:16:24 +00:00
version.py.tpl

README.txt

Note [TH abstraction violation]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

TH/THC provide some hpp headers, which are proper C++ headers rather than
C headers.  These headers serve double duty as *internal implementation
detail* headers, whose contents should largely not be used by external
clients.

Ideally, we would not install these headers at all; instead, you should
use public functions (in headers like `THTensor.h`, NOT `THTensor.hpp`)
to manipulate these structs.  However, there are a few places
in torch/csrc where we violate this abstraction.  They are marked with
a pointer to this note.  Each of those sites will have to be refactored
when we refactor the guts of THTensor and related structures.