pytorch/torch
Wu, Chunyuan a8319698b3 [inductor] [cpp] improve cache blocking with CPU info (#129348)
## Description
For single thread case, this PR improves the cache blocking in CPP GEMM template with the CPU info (the L1 and L2 cache size). `Mc_blocks` and `Kc_blocks` are calculated based on the below condition:
     - size_of_B < L1
     - size_of_A < 0.5 * L2

For multi-thread, we need to tune the task decomposition among threads together with cache blocking. We disabled the cache blocking change for now and will submit a follow-up PR for multi-thread optimizations.

## Performance
No regressions. Models with > 3% performance speedup are listed below:

### BF16 single thread (measured on CPU with AMX support)
- static shape

| Model Family | Model Name | Speedup |
|--------------|------------|---------|
torchbench | detectron2_fasterrcnn_r_101_dc5| 4%

- dynamic shape

| Model Family | Model Name | Speedup |
|--------------|------------|---------|
torchbench | detectron2_fasterrcnn_r_101_dc5| 4%

### FP32 single thread (measured on Ice Lake)
- static shape

| Model Family | Model Name | Speedup |
|--------------|------------|---------|
torchbench | basic_gnn_edgecnn| 10%

- dynamic shape

| Model Family | Model Name | Speedup |
|--------------|------------|---------|
torchbench | basic_gnn_edgecnn| 10%

### Next step
The E2E level improvement is limited due to the below reasons:

- For several HF models, we can observe performance improvement at kernel level for the gemm template kernel but since the performance is either still worse than ATen kernel (thus won't be selected during autotune) or improved from worse than ATen to similar to ATen, so we don't see E2E level performance change.

- There're models where the gemm template kernel could get > 10% performance improvement with this PR but since the kernel time is only about 3% of the E2E time, we don't observe significant E2E level improvement.

We will continue to find possible optimizations in the gemm template kernel in follow-up PRs.

Co-authored-by: Jiong Gong <jiong.gong@intel.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129348
Approved by: https://github.com/jgong5, https://github.com/jansel
ghstack dependencies: #130675, #130690
2024-07-20 06:53:31 +00:00
..
_awaits
_C [inductor] [cpp] improve cache blocking with CPU info (#129348) 2024-07-20 06:53:31 +00:00
_C_flatbuffer
_custom_op Revert "Tighten torch.library.infer_schema input types (#130705)" 2024-07-16 12:57:11 +00:00
_decomp Add decomposition for channel_shuffle (#118775) 2024-07-20 01:24:41 +00:00
_dispatch
_dynamo Ensure invariant that all inputs have tensor dict (#131249) 2024-07-20 04:40:58 +00:00
_export [export] fix zero arg export in training_ir (#130990) 2024-07-20 02:35:13 +00:00
_functorch [functorch] saved tensor hooks error should only apply to grad, vjp transforms. (#131191) 2024-07-19 23:16:27 +00:00
_higher_order_ops Revert "[Autograd] Cond Higher-Order Operation (#126911)" 2024-07-18 22:06:40 +00:00
_inductor [inductor] [cpp] improve cache blocking with CPU info (#129348) 2024-07-20 06:53:31 +00:00
_lazy [BE][Easy] apply autofix for ruff rules unnecessary-collection-call (C408): list() / tuple() / dict() (#130199) 2024-07-11 17:30:28 +00:00
_library Revert "Tighten torch.library.infer_schema input types (#130705)" 2024-07-16 12:57:11 +00:00
_logging Write trace_structured events to scuba (#130955) 2024-07-19 06:02:47 +00:00
_numpy Make hashing a SymInt raise an error again (#130548) 2024-07-16 18:30:30 +00:00
_prims Revert "Add decompositions for copy variants of view ops (#128416)" 2024-07-11 22:09:23 +00:00
_prims_common [BE][Easy] fix ruff rule needless-bool (SIM103) (#130206) 2024-07-14 08:17:52 +00:00
_refs Add decomposition for channel_shuffle (#118775) 2024-07-20 01:24:41 +00:00
_strobelight
_subclasses [BE][Easy] fix ruff rule needless-bool (SIM103) (#130206) 2024-07-14 08:17:52 +00:00
_vendor
amp Revert "[MPS] Add support for autocast in MPS (#99272)" 2024-07-02 12:29:51 +00:00
ao Set correct output dtype for dequantize op during convert_pt2e in decomposed mode (#128953) 2024-07-19 04:58:02 +00:00
autograd [autograd] Support GradientEdge as output for torch.autograd.grad (#127766) 2024-07-16 21:46:19 +00:00
backends [cuDNN][SDPA] Remove TORCH_CUDNN_SDPA_ENABLED=1, enable cuDNN SDPA by default on H100 and 2nd on other archs >= sm80 (#125343) 2024-06-30 19:22:16 +00:00
compiler
contrib
cpu [inductor][cpp] BF16 AMX micro-gemm support (#127195) 2024-06-21 07:21:47 +00:00
csrc [inductor] [cpp] improve cache blocking with CPU info (#129348) 2024-07-20 06:53:31 +00:00
cuda [ROCm] Return correct AMDSMI socket_power metric (#130331) 2024-07-17 01:58:58 +00:00
distributed [c10] add an option to pg_config split share (#130877) 2024-07-19 21:11:26 +00:00
distributions [BE]: Update mypy to 1.10.0 (#127717) 2024-06-13 15:57:13 +00:00
export [export] fix zero arg export in training_ir (#130990) 2024-07-20 02:35:13 +00:00
fft
func
futures
fx Revert "[Autograd] Cond Higher-Order Operation (#126911)" 2024-07-18 22:06:40 +00:00
jit [BE][Easy] apply autofix for ruff rules unnecessary-collection-call (C408): list() / tuple() / dict() (#130199) 2024-07-11 17:30:28 +00:00
legacy
lib
linalg
masked [BE] update type annotations for basic utilities in torch/__init__.py (#129001) 2024-06-24 18:04:38 +00:00
monitor
mps Add support in Python API for the recommended max working set size. (#128289) 2024-06-12 16:03:57 +00:00
mtia [MTIA] Fix synchronize API (#128714) 2024-06-17 21:58:46 +00:00
multiprocessing Enable sharing meta tensors between processes (#129520) 2024-07-04 20:29:48 +00:00
nested [NestedTensor] Integrate sum along the jagged dimension into NestedTensor (#130425) 2024-07-18 10:48:18 +00:00
nn add some description on create_block_mask and mask mods (#131209) 2024-07-20 04:40:48 +00:00
onnx [ONNX] Run ruff pyupgrade to update type annotations (#130657) 2024-07-19 05:09:44 +00:00
optim fix the use of initial learning rate in the OneCycleLR example (#130306) 2024-07-09 18:58:07 +00:00
package [BE] enforce style for empty lines in import segments (#129751) 2024-06-29 14:15:24 +00:00
profiler [BE][Easy] apply autofix for ruff rules unnecessary-collection-call (C408): list() / tuple() / dict() (#130199) 2024-07-11 17:30:28 +00:00
quantization
signal
sparse Enable UFMT on all of torch/sparse (#130545) 2024-07-15 22:35:52 +00:00
special
testing Add hooks for execution on intel gaudi devices - 1 (#128584) 2024-07-20 05:03:36 +00:00
utils [BE] bump optree version to 0.12.1 (#130139) 2024-07-20 02:41:10 +00:00
xpu
__config__.py
__future__.py
__init__.py Make hashing a SymInt raise an error again (#130548) 2024-07-16 18:30:30 +00:00
_appdirs.py
_classes.py
_compile.py
_custom_ops.py Revert "Tighten torch.library.infer_schema input types (#130705)" 2024-07-16 12:57:11 +00:00
_deploy.py
_guards.py [BE][Easy] apply autofix for ruff rules unnecessary-collection-call (C408): list() / tuple() / dict() (#130199) 2024-07-11 17:30:28 +00:00
_jit_internal.py [torchscript] Add logging for model id. (#130118) 2024-07-09 22:24:16 +00:00
_linalg_utils.py [BE] enable UFMT for torch/nn/*.py (#128593) 2024-06-23 16:05:13 +00:00
_lobpcg.py [BE] enable UFMT for top-level files torch/*.py (#127707) 2024-06-12 20:15:05 +00:00
_lowrank.py [BE] enable UFMT for torch/nn/*.py (#128593) 2024-06-23 16:05:13 +00:00
_meta_registrations.py Add decomposition for channel_shuffle (#118775) 2024-07-20 01:24:41 +00:00
_namedtensor_internals.py
_ops.py [Easy] Fix argument name collision in dispatched functions (#129562) 2024-07-17 14:39:56 +00:00
_python_dispatcher.py
_size_docs.py
_sources.py
_storage_docs.py
_streambase.py
_tensor_docs.py
_tensor_str.py fix tensor print behavior for XPU (#130523) 2024-07-17 02:03:32 +00:00
_tensor.py [easy] Small rendering fix in Tensor.module_load doc (#130489) 2024-07-12 22:12:53 +00:00
_torch_docs.py Introduce the concept of Accelerators to PyTorch doc (#129363) 2024-07-15 14:24:46 +00:00
_utils_internal.py Write trace_structured events to scuba (#130955) 2024-07-19 06:02:47 +00:00
_utils.py Remove dependency on private _compat_pickle in CPython (#129509) 2024-06-26 14:20:27 +00:00
_VF.py
_vmap_internals.py [BE] enable UFMT for torch/nn/*.py (#128593) 2024-06-23 16:05:13 +00:00
_weights_only_unpickler.py Add torch.serialization.safe_globals context manager (#127939) 2024-07-12 20:38:43 +00:00
abi-check.cpp
CMakeLists.txt [BE] [CMake] Remove AT_CORE_STATIC_WINDOWS option (#130409) 2024-07-10 15:50:47 +00:00
custom_class_detail.h [1/N] Change #include <c10/util/Optional.h> to #include <optional> (#128301) 2024-07-08 07:03:53 +00:00
custom_class.h [2/N] Fix some violations of unused-function and unused-variable checks in torch_cpu (#129878) 2024-07-04 00:39:28 +00:00
extension.h
functional.py [BE] enable UFMT for torch/nn/*.py (#128593) 2024-06-23 16:05:13 +00:00
hub.py [BE] enable UFMT for torch/nn/*.py (#128593) 2024-06-23 16:05:13 +00:00
library.h [1/N] Change #include <c10/util/Optional.h> to #include <optional> (#128301) 2024-07-08 07:03:53 +00:00
library.py [custom_ops] expose torch.library.register_torch_dispatch (#130261) 2024-07-12 14:13:01 +00:00
overrides.py [BE][Easy] apply autofix for ruff rules unnecessary-collection-call (C408): list() / tuple() / dict() (#130199) 2024-07-11 17:30:28 +00:00
py.typed
quasirandom.py [BE] enable UFMT for top-level files torch/*.py (#127707) 2024-06-12 20:15:05 +00:00
random.py [BE] enable UFMT for top-level files torch/*.py (#127707) 2024-06-12 20:15:05 +00:00
README.txt
return_types.py [BE] enable UFMT for top-level files torch/*.py (#127707) 2024-06-12 20:15:05 +00:00
script.h
serialization.py Add torch.serialization.safe_globals context manager (#127939) 2024-07-12 20:38:43 +00:00
storage.py typing: storage (#130669) 2024-07-16 14:31:35 +00:00
torch_version.py [BE] enable UFMT for top-level files torch/*.py (#127707) 2024-06-12 20:15:05 +00:00
types.py typing fake_tensor.py (#128041) 2024-07-13 06:07:40 +00:00
version.py.tpl

Note [TH abstraction violation]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

TH/THC provide some hpp headers, which are proper C++ headers rather than
C headers.  These headers serve double duty as *internal implementation
detail* headers, whose contents should largely not be used by external
clients.

Ideally, we would not install these headers at all; instead, you should
use public functions (in headers like `THTensor.h`, NOT `THTensor.hpp`)
to manipulate these structs.  However, there are a few places
in torch/csrc where we violate this abstraction.  They are marked with
a pointer to this note.  Each of those sites will have to be refactored
when we refactor the guts of THTensor and related structures.