mirror of
https://github.com/zebrajr/pytorch.git
synced 2025-12-07 12:21:27 +01:00
## Description
For single thread case, this PR improves the cache blocking in CPP GEMM template with the CPU info (the L1 and L2 cache size). `Mc_blocks` and `Kc_blocks` are calculated based on the below condition:
- size_of_B < L1
- size_of_A < 0.5 * L2
For multi-thread, we need to tune the task decomposition among threads together with cache blocking. We disabled the cache blocking change for now and will submit a follow-up PR for multi-thread optimizations.
## Performance
No regressions. Models with > 3% performance speedup are listed below:
### BF16 single thread (measured on CPU with AMX support)
- static shape
| Model Family | Model Name | Speedup |
|--------------|------------|---------|
torchbench | detectron2_fasterrcnn_r_101_dc5| 4%
- dynamic shape
| Model Family | Model Name | Speedup |
|--------------|------------|---------|
torchbench | detectron2_fasterrcnn_r_101_dc5| 4%
### FP32 single thread (measured on Ice Lake)
- static shape
| Model Family | Model Name | Speedup |
|--------------|------------|---------|
torchbench | basic_gnn_edgecnn| 10%
- dynamic shape
| Model Family | Model Name | Speedup |
|--------------|------------|---------|
torchbench | basic_gnn_edgecnn| 10%
### Next step
The E2E level improvement is limited due to the below reasons:
- For several HF models, we can observe performance improvement at kernel level for the gemm template kernel but since the performance is either still worse than ATen kernel (thus won't be selected during autotune) or improved from worse than ATen to similar to ATen, so we don't see E2E level performance change.
- There're models where the gemm template kernel could get > 10% performance improvement with this PR but since the kernel time is only about 3% of the E2E time, we don't observe significant E2E level improvement.
We will continue to find possible optimizations in the gemm template kernel in follow-up PRs.
Co-authored-by: Jiong Gong <jiong.gong@intel.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129348
Approved by: https://github.com/jgong5, https://github.com/jansel
ghstack dependencies: #130675, #130690
|
||
|---|---|---|
| .. | ||
| _awaits | ||
| _C | ||
| _C_flatbuffer | ||
| _custom_op | ||
| _decomp | ||
| _dispatch | ||
| _dynamo | ||
| _export | ||
| _functorch | ||
| _higher_order_ops | ||
| _inductor | ||
| _lazy | ||
| _library | ||
| _logging | ||
| _numpy | ||
| _prims | ||
| _prims_common | ||
| _refs | ||
| _strobelight | ||
| _subclasses | ||
| _vendor | ||
| amp | ||
| ao | ||
| autograd | ||
| backends | ||
| compiler | ||
| contrib | ||
| cpu | ||
| csrc | ||
| cuda | ||
| distributed | ||
| distributions | ||
| export | ||
| fft | ||
| func | ||
| futures | ||
| fx | ||
| jit | ||
| legacy | ||
| lib | ||
| linalg | ||
| masked | ||
| monitor | ||
| mps | ||
| mtia | ||
| multiprocessing | ||
| nested | ||
| nn | ||
| onnx | ||
| optim | ||
| package | ||
| profiler | ||
| quantization | ||
| signal | ||
| sparse | ||
| special | ||
| testing | ||
| utils | ||
| xpu | ||
| __config__.py | ||
| __future__.py | ||
| __init__.py | ||
| _appdirs.py | ||
| _classes.py | ||
| _compile.py | ||
| _custom_ops.py | ||
| _deploy.py | ||
| _guards.py | ||
| _jit_internal.py | ||
| _linalg_utils.py | ||
| _lobpcg.py | ||
| _lowrank.py | ||
| _meta_registrations.py | ||
| _namedtensor_internals.py | ||
| _ops.py | ||
| _python_dispatcher.py | ||
| _size_docs.py | ||
| _sources.py | ||
| _storage_docs.py | ||
| _streambase.py | ||
| _tensor_docs.py | ||
| _tensor_str.py | ||
| _tensor.py | ||
| _torch_docs.py | ||
| _utils_internal.py | ||
| _utils.py | ||
| _VF.py | ||
| _vmap_internals.py | ||
| _weights_only_unpickler.py | ||
| abi-check.cpp | ||
| CMakeLists.txt | ||
| custom_class_detail.h | ||
| custom_class.h | ||
| extension.h | ||
| functional.py | ||
| hub.py | ||
| library.h | ||
| library.py | ||
| overrides.py | ||
| py.typed | ||
| quasirandom.py | ||
| random.py | ||
| README.txt | ||
| return_types.py | ||
| script.h | ||
| serialization.py | ||
| storage.py | ||
| torch_version.py | ||
| types.py | ||
| version.py.tpl | ||
Note [TH abstraction violation] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ TH/THC provide some hpp headers, which are proper C++ headers rather than C headers. These headers serve double duty as *internal implementation detail* headers, whose contents should largely not be used by external clients. Ideally, we would not install these headers at all; instead, you should use public functions (in headers like `THTensor.h`, NOT `THTensor.hpp`) to manipulate these structs. However, there are a few places in torch/csrc where we violate this abstraction. They are marked with a pointer to this note. Each of those sites will have to be refactored when we refactor the guts of THTensor and related structures.