pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

History

Wu, Chunyuan a8319698b3 [inductor] [cpp] improve cache blocking with CPU info (#129348 ) ## Description For single thread case, this PR improves the cache blocking in CPP GEMM template with the CPU info (the L1 and L2 cache size). `Mc_blocks` and `Kc_blocks` are calculated based on the below condition: - size_of_B < L1 - size_of_A < 0.5 * L2 For multi-thread, we need to tune the task decomposition among threads together with cache blocking. We disabled the cache blocking change for now and will submit a follow-up PR for multi-thread optimizations. ## Performance No regressions. Models with > 3% performance speedup are listed below: ### BF16 single thread (measured on CPU with AMX support) - static shape \| Model Family \| Model Name \| Speedup \| \|--------------\|------------\|---------\| torchbench \| detectron2_fasterrcnn_r_101_dc5\| 4% - dynamic shape \| Model Family \| Model Name \| Speedup \| \|--------------\|------------\|---------\| torchbench \| detectron2_fasterrcnn_r_101_dc5\| 4% ### FP32 single thread (measured on Ice Lake) - static shape \| Model Family \| Model Name \| Speedup \| \|--------------\|------------\|---------\| torchbench \| basic_gnn_edgecnn\| 10% - dynamic shape \| Model Family \| Model Name \| Speedup \| \|--------------\|------------\|---------\| torchbench \| basic_gnn_edgecnn\| 10% ### Next step The E2E level improvement is limited due to the below reasons: - For several HF models, we can observe performance improvement at kernel level for the gemm template kernel but since the performance is either still worse than ATen kernel (thus won't be selected during autotune) or improved from worse than ATen to similar to ATen, so we don't see E2E level performance change. - There're models where the gemm template kernel could get > 10% performance improvement with this PR but since the kernel time is only about 3% of the E2E time, we don't observe significant E2E level improvement. We will continue to find possible optimizations in the gemm template kernel in follow-up PRs. Co-authored-by: Jiong Gong <jiong.gong@intel.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/129348 Approved by: https://github.com/jgong5, https://github.com/jansel ghstack dependencies: #130675, #130690		2024-07-20 06:53:31 +00:00
..
_awaits
_C	[inductor] [cpp] improve cache blocking with CPU info (#129348 )	2024-07-20 06:53:31 +00:00
_C_flatbuffer
_custom_op	Revert "Tighten torch.library.infer_schema input types (#130705 )"	2024-07-16 12:57:11 +00:00
_decomp	Add decomposition for channel_shuffle (#118775 )	2024-07-20 01:24:41 +00:00
_dispatch
_dynamo	Ensure invariant that all inputs have tensor dict (#131249 )	2024-07-20 04:40:58 +00:00
_export	[export] fix zero arg export in training_ir (#130990 )	2024-07-20 02:35:13 +00:00
_functorch	[functorch] saved tensor hooks error should only apply to grad, vjp transforms. (#131191 )	2024-07-19 23:16:27 +00:00
_higher_order_ops	Revert "[Autograd] Cond Higher-Order Operation (#126911 )"	2024-07-18 22:06:40 +00:00
_inductor	[inductor] [cpp] improve cache blocking with CPU info (#129348 )	2024-07-20 06:53:31 +00:00
_lazy	[BE][Easy] apply autofix for ruff rules unnecessary-collection-call (C408): `list()` / `tuple()` / `dict()` (#130199 )	2024-07-11 17:30:28 +00:00
_library	Revert "Tighten torch.library.infer_schema input types (#130705 )"	2024-07-16 12:57:11 +00:00
_logging	Write trace_structured events to scuba (#130955 )	2024-07-19 06:02:47 +00:00
_numpy	Make hashing a SymInt raise an error again (#130548 )	2024-07-16 18:30:30 +00:00
_prims	Revert "Add decompositions for copy variants of view ops (#128416 )"	2024-07-11 22:09:23 +00:00
_prims_common	[BE][Easy] fix ruff rule needless-bool (SIM103) (#130206 )	2024-07-14 08:17:52 +00:00
_refs	Add decomposition for channel_shuffle (#118775 )	2024-07-20 01:24:41 +00:00
_strobelight
_subclasses	[BE][Easy] fix ruff rule needless-bool (SIM103) (#130206 )	2024-07-14 08:17:52 +00:00
_vendor
amp	Revert "[MPS] Add support for autocast in MPS (#99272 )"	2024-07-02 12:29:51 +00:00
ao	Set correct output dtype for dequantize op during convert_pt2e in decomposed mode (#128953 )	2024-07-19 04:58:02 +00:00
autograd	[autograd] Support GradientEdge as output for torch.autograd.grad (#127766 )	2024-07-16 21:46:19 +00:00
backends	[cuDNN][SDPA] Remove `TORCH_CUDNN_SDPA_ENABLED=1`, enable cuDNN SDPA by default on H100 and 2nd on other archs >= sm80 (#125343 )	2024-06-30 19:22:16 +00:00
compiler
contrib
cpu	[inductor][cpp] BF16 AMX micro-gemm support (#127195 )	2024-06-21 07:21:47 +00:00
csrc	[inductor] [cpp] improve cache blocking with CPU info (#129348 )	2024-07-20 06:53:31 +00:00
cuda	[ROCm] Return correct AMDSMI socket_power metric (#130331 )	2024-07-17 01:58:58 +00:00
distributed	[c10] add an option to pg_config split share (#130877 )	2024-07-19 21:11:26 +00:00
distributions	[BE]: Update mypy to 1.10.0 (#127717 )	2024-06-13 15:57:13 +00:00
export	[export] fix zero arg export in training_ir (#130990 )	2024-07-20 02:35:13 +00:00
fft
func
futures
fx	Revert "[Autograd] Cond Higher-Order Operation (#126911 )"	2024-07-18 22:06:40 +00:00
jit	[BE][Easy] apply autofix for ruff rules unnecessary-collection-call (C408): `list()` / `tuple()` / `dict()` (#130199 )	2024-07-11 17:30:28 +00:00
legacy
lib
linalg
masked	[BE] update type annotations for basic utilities in `torch/__init__.py` (#129001 )	2024-06-24 18:04:38 +00:00
monitor
mps	Add support in Python API for the recommended max working set size. (#128289 )	2024-06-12 16:03:57 +00:00
mtia	[MTIA] Fix synchronize API (#128714 )	2024-06-17 21:58:46 +00:00
multiprocessing	Enable sharing meta tensors between processes (#129520 )	2024-07-04 20:29:48 +00:00
nested	[NestedTensor] Integrate sum along the jagged dimension into NestedTensor (#130425 )	2024-07-18 10:48:18 +00:00
nn	add some description on create_block_mask and mask mods (#131209 )	2024-07-20 04:40:48 +00:00
onnx	[ONNX] Run ruff pyupgrade to update type annotations (#130657 )	2024-07-19 05:09:44 +00:00
optim	fix the use of initial learning rate in the OneCycleLR example (#130306 )	2024-07-09 18:58:07 +00:00
package	[BE] enforce style for empty lines in import segments (#129751 )	2024-06-29 14:15:24 +00:00
profiler	[BE][Easy] apply autofix for ruff rules unnecessary-collection-call (C408): `list()` / `tuple()` / `dict()` (#130199 )	2024-07-11 17:30:28 +00:00
quantization
signal
sparse	Enable UFMT on all of torch/sparse (#130545 )	2024-07-15 22:35:52 +00:00
special
testing	Add hooks for execution on intel gaudi devices - 1 (#128584 )	2024-07-20 05:03:36 +00:00
utils	[BE] bump `optree` version to 0.12.1 (#130139 )	2024-07-20 02:41:10 +00:00
xpu
__config__.py
__future__.py
__init__.py	Make hashing a SymInt raise an error again (#130548 )	2024-07-16 18:30:30 +00:00
_appdirs.py
_classes.py
_compile.py
_custom_ops.py	Revert "Tighten torch.library.infer_schema input types (#130705 )"	2024-07-16 12:57:11 +00:00
_deploy.py
_guards.py	[BE][Easy] apply autofix for ruff rules unnecessary-collection-call (C408): `list()` / `tuple()` / `dict()` (#130199 )	2024-07-11 17:30:28 +00:00
_jit_internal.py	[torchscript] Add logging for model id. (#130118 )	2024-07-09 22:24:16 +00:00
_linalg_utils.py	[BE] enable UFMT for `torch/nn/*.py` (#128593 )	2024-06-23 16:05:13 +00:00
_lobpcg.py	[BE] enable UFMT for top-level files `torch/*.py` (#127707 )	2024-06-12 20:15:05 +00:00
_lowrank.py	[BE] enable UFMT for `torch/nn/*.py` (#128593 )	2024-06-23 16:05:13 +00:00
_meta_registrations.py	Add decomposition for channel_shuffle (#118775 )	2024-07-20 01:24:41 +00:00
_namedtensor_internals.py
_ops.py	[Easy] Fix argument name collision in dispatched functions (#129562 )	2024-07-17 14:39:56 +00:00
_python_dispatcher.py
_size_docs.py
_sources.py
_storage_docs.py
_streambase.py
_tensor_docs.py
_tensor_str.py	fix tensor print behavior for XPU (#130523 )	2024-07-17 02:03:32 +00:00
_tensor.py	[easy] Small rendering fix in Tensor.module_load doc (#130489 )	2024-07-12 22:12:53 +00:00
_torch_docs.py	Introduce the concept of Accelerators to PyTorch doc (#129363 )	2024-07-15 14:24:46 +00:00
_utils_internal.py	Write trace_structured events to scuba (#130955 )	2024-07-19 06:02:47 +00:00
_utils.py	Remove dependency on private _compat_pickle in CPython (#129509 )	2024-06-26 14:20:27 +00:00
_VF.py
_vmap_internals.py	[BE] enable UFMT for `torch/nn/*.py` (#128593 )	2024-06-23 16:05:13 +00:00
_weights_only_unpickler.py	Add torch.serialization.safe_globals context manager (#127939 )	2024-07-12 20:38:43 +00:00
abi-check.cpp
CMakeLists.txt	[BE] [CMake] Remove AT_CORE_STATIC_WINDOWS option (#130409 )	2024-07-10 15:50:47 +00:00
custom_class_detail.h	[1/N] Change #include <c10/util/Optional.h> to #include <optional> (#128301 )	2024-07-08 07:03:53 +00:00
custom_class.h	[2/N] Fix some violations of unused-function and unused-variable checks in torch_cpu (#129878 )	2024-07-04 00:39:28 +00:00
extension.h
functional.py	[BE] enable UFMT for `torch/nn/*.py` (#128593 )	2024-06-23 16:05:13 +00:00
hub.py	[BE] enable UFMT for `torch/nn/*.py` (#128593 )	2024-06-23 16:05:13 +00:00
library.h	[1/N] Change #include <c10/util/Optional.h> to #include <optional> (#128301 )	2024-07-08 07:03:53 +00:00
library.py	[custom_ops] expose torch.library.register_torch_dispatch (#130261 )	2024-07-12 14:13:01 +00:00
overrides.py	[BE][Easy] apply autofix for ruff rules unnecessary-collection-call (C408): `list()` / `tuple()` / `dict()` (#130199 )	2024-07-11 17:30:28 +00:00
py.typed
quasirandom.py	[BE] enable UFMT for top-level files `torch/*.py` (#127707 )	2024-06-12 20:15:05 +00:00
random.py	[BE] enable UFMT for top-level files `torch/*.py` (#127707 )	2024-06-12 20:15:05 +00:00
README.txt
return_types.py	[BE] enable UFMT for top-level files `torch/*.py` (#127707 )	2024-06-12 20:15:05 +00:00
script.h
serialization.py	Add torch.serialization.safe_globals context manager (#127939 )	2024-07-12 20:38:43 +00:00
storage.py	typing: storage (#130669 )	2024-07-16 14:31:35 +00:00
torch_version.py	[BE] enable UFMT for top-level files `torch/*.py` (#127707 )	2024-06-12 20:15:05 +00:00
types.py	typing fake_tensor.py (#128041 )	2024-07-13 06:07:40 +00:00
version.py.tpl

README.txt

Note [TH abstraction violation]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

TH/THC provide some hpp headers, which are proper C++ headers rather than
C headers.  These headers serve double duty as *internal implementation
detail* headers, whose contents should largely not be used by external
clients.

Ideally, we would not install these headers at all; instead, you should
use public functions (in headers like `THTensor.h`, NOT `THTensor.hpp`)
to manipulate these structs.  However, there are a few places
in torch/csrc where we violate this abstraction.  They are marked with
a pointer to this note.  Each of those sites will have to be refactored
when we refactor the guts of THTensor and related structures.