pytorch/torch
Pearu Peterson e1c872e009 Add optimal triton kernel parameters to bsr_dense_mm and scatter_mm for bfloat16 and float32 dtypes (#113553)
As in the title.

This PR is a follow-up to PR https://github.com/pytorch/pytorch/pull/112737 to address bfloat16 and float32 dtype cases. The performance increase is as follows (`NVIDIA A100-SXM4-80GB`):

- bsr_scatter_mm and bfloat16
  - for blocksize 16x16, the average/maximum speed up is about 29/75 %.
  - for blocksize 32x32, the average/maximum speed up is about 23/58 %.
  - for blocksize 64x64, the average/maximum speed up is about 27/66 %.
  - for blocksize 128x128, the average/maximum speed up is about 33/72 %.
- bsr_dense_mm and bfloat16
  - for blocksize 16x16, the average/maximum speed up is about 47/61 %.
  - for blocksize 32x32, the average/maximum speed up is about 29/43 %.
  - for blocksize 64x64, the average/maximum speed up is about 21/41 %.
  - for blocksize 128x128, the average/maximum speed up is about 12/29 %.
- bsr_dense_mm and  float32
  - for blocksize 16x16, the average/maximum speed up is about 35/49 %.
  - for blocksize 32x32, the average/maximum speed up is about 2/5 %.
  - for blocksize 64x64, the average/maximum speed up is about 2/21 %.
  - for blocksize 128x128, the average/maximum speed up is about 79/84 %.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113553
Approved by: https://github.com/cpuhrsch
2023-11-14 00:47:59 +00:00
..
_awaits
_C Add support for torch.Generator type in TorchScript (#110413) 2023-11-13 23:18:14 +00:00
_C_flatbuffer
_custom_op Use pytree.tree_leaves everywhere (#112324) 2023-10-30 03:39:04 +00:00
_decomp Add support for torch.Generator type in TorchScript (#110413) 2023-11-13 23:18:14 +00:00
_dispatch
_dynamo [HigherOrderOp] add pytree operands tests for cond (#112661) 2023-11-13 23:09:46 +00:00
_export Revert "[pytree] register pytree node type in both C++ pytree and Python pytree (#112111)" 2023-11-10 17:24:40 +00:00
_functorch Revert "AOTAutograd: handle set_(), detect metadata mutations that cancel out (#111554)" 2023-11-13 21:46:57 +00:00
_higher_order_ops [HigherOrderOp] add pytree operands tests for cond (#112661) 2023-11-13 23:09:46 +00:00
_inductor [aotinductor] add versions for the sdpa shim api (#113487) 2023-11-13 20:18:58 +00:00
_lazy
_library torch.library: Create helper function is_functional_schema (#111660) 2023-10-27 15:20:25 +00:00
_logging Support logging aliases to list of modules (#113567) 2023-11-13 23:35:18 +00:00
_numpy Avoid calling as_tensor twice (#112866) 2023-11-07 16:10:59 +00:00
_prims Add support for torch.Generator type in TorchScript (#110413) 2023-11-13 23:18:14 +00:00
_prims_common Allow inferring divisibility on unbacked SymInts and do replacement trick (#113165) 2023-11-10 21:28:02 +00:00
_refs Add support for torch.Generator type in TorchScript (#110413) 2023-11-13 23:18:14 +00:00
_subclasses Implement narrow from a regular tensor to jagged tensor (#112770) 2023-11-13 19:09:59 +00:00
amp
ao [quant][pt2e] Add transform_for_annotation method in Quantizer (#113115) 2023-11-09 20:23:29 +00:00
autograd Fix docstring errors in reductions.py, spawn.py, pool.py, parameter.py, cpp.py, grad.py, __init__.py, profiler.py, queue.py, graph.py (#113052) 2023-11-10 21:19:17 +00:00
backends docs: fix docstring errors in quantized modules and others (#112695) 2023-11-07 23:52:16 +00:00
compiler Fix torch.compiler.cudagraph_mark_step_begin example (#112807) 2023-11-07 04:15:31 +00:00
contrib Fixed docstring errors in _fuser.py, _state.py, __init__.py, _freeze.py, _async.py, _recursive.py, _tensorboard_vis.py, _trace.py, _await.py, _check.py, _serialization.py, _script.py, annotations.py, _monkeytype_config.py (#113371) 2023-11-12 03:19:02 +00:00
cpu [Dist] Enable FSDP on CPU (#112145) 2023-11-07 01:37:02 +00:00
csrc [2/N] Enable clang-tidy checks in torch/csrc/profiler (#113439) 2023-11-14 00:39:54 +00:00
cuda Fixed docstring errors inside torch/cuda/ and torch/optim/ (Docathon H2) (#112964) 2023-11-13 22:16:44 +00:00
distributed [dtensor] refactor op dispatch and fix is_same_size/equal (#112927) 2023-11-13 22:46:31 +00:00
distributions Add inverse gamma distribution and fix sign bug in PowerTransform. (#104501) 2023-11-01 02:26:25 +00:00
export [pytree] align function signature between C++ and Python pytree (#112482) 2023-11-10 02:37:48 +00:00
fft
func
futures
fx [Dynamo] Match closures by code ID (#109427) 2023-11-12 08:20:14 +00:00
jit Add support for torch.Generator type in TorchScript (#110413) 2023-11-13 23:18:14 +00:00
legacy
lib
linalg [Docs] fix typo in example of torch.linalg.solve_triangular (#112361) 2023-10-30 10:33:14 +00:00
masked docs: Add docstring for torch.masked._ops.logaddexp (#113206) 2023-11-08 22:45:35 +00:00
monitor
mps
multiprocessing Fix docstring errors in reductions.py, spawn.py, pool.py, parameter.py, cpp.py, grad.py, __init__.py, profiler.py, queue.py, graph.py (#113052) 2023-11-10 21:19:17 +00:00
nested Implement narrow from a regular tensor to jagged tensor (#112770) 2023-11-13 19:09:59 +00:00
nn Add support for torch.Generator type in TorchScript (#110413) 2023-11-13 23:18:14 +00:00
onnx Add inheritance to ONNX's InputAdaptStep and OutputAdaptSet impl (#113476) 2023-11-13 21:27:44 +00:00
optim Fixed docstring errors inside torch/cuda/ and torch/optim/ (Docathon H2) (#112964) 2023-11-13 22:16:44 +00:00
package Add file name and size to the serialization metadata logging (#113077) 2023-11-09 11:14:24 +00:00
profiler [Profiler][Easy] Make timestamps in memory timelines be in microseconds (us) (#112772) 2023-11-03 00:41:41 +00:00
quantization
signal
sparse Add optimal triton kernel parameters to bsr_dense_mm and scatter_mm for bfloat16 and float32 dtypes (#113553) 2023-11-14 00:47:59 +00:00
special
testing Add support for torch.Generator type in TorchScript (#110413) 2023-11-13 23:18:14 +00:00
utils Fixed error with cuda_ver in cpp_extension.py (#113555) 2023-11-14 00:12:22 +00:00
__config__.py
__future__.py
__init__.py Make dynamo configs more amenable to static type checking (#112130) 2023-11-08 21:17:45 +00:00
_appdirs.py
_classes.py
_compile.py
_custom_ops.py
_deploy.py
_guards.py [inductor] Make {output_graph,pad_mm}.py pass follow_imports typechecking (#113413) 2023-11-11 22:15:46 +00:00
_jit_internal.py
_linalg_utils.py
_lobpcg.py
_lowrank.py
_meta_registrations.py expose mem-eff to autograd (#110495) 2023-11-13 17:47:40 +00:00
_namedtensor_internals.py
_ops.py Update impl_abstract_pystub to be less boilerplatey (#113182) 2023-11-08 00:39:00 +00:00
_python_dispatcher.py
_sources.py
_storage_docs.py Document torch.from_file and fix UntypedStorage.from_file docs (#111688) 2023-10-25 19:28:11 +00:00
_streambase.py [dynamo][stream]support device-agnostic stream in dynamo and capture stream/event method in fx graph (#108312) 2023-10-22 13:22:58 +00:00
_tensor_docs.py Rewrite docs so that it is OK to use record_stream before uses (#113282) 2023-11-08 21:24:50 +00:00
_tensor_str.py
_tensor.py [dynamo] Make {testing,debug_utils,utils}.py pass follow_imports typechecking (#113519) 2023-11-11 22:15:46 +00:00
_torch_docs.py Add torch.utils.deterministic.fill_uninitialized_memory flag (#111377) 2023-11-01 16:10:09 +00:00
_utils_internal.py Update impl_abstract_pystub to be less boilerplatey (#113182) 2023-11-08 00:39:00 +00:00
_utils.py Fix torch.load(..., weights_only=True) for NT (#112516) 2023-11-02 14:41:04 +00:00
_VF.py
_vmap_internals.py
_weights_only_unpickler.py Fix torch.load(..., weights_only=True) for NT (#112516) 2023-11-02 14:41:04 +00:00
abi-check.cpp
CMakeLists.txt Revert "[BE] [cuDNN] Always build assuming cuDNN >= 8.0 (#95722)" 2023-11-10 17:26:36 +00:00
custom_class_detail.h
custom_class.h
extension.h
functional.py Improve torch.unique docs (#113424) 2023-11-10 16:36:30 +00:00
hub.py
library.h [fbgemm_gpu] add pt2_compliant tag to some ops (#113201) 2023-11-10 00:32:30 +00:00
library.py Update impl_abstract_pystub to be less boilerplatey (#113182) 2023-11-08 00:39:00 +00:00
overrides.py Add support for torch.Generator type in TorchScript (#110413) 2023-11-13 23:18:14 +00:00
py.typed
quasirandom.py
random.py
README.txt
return_types.py
script.h
serialization.py added 'weights_only' param in torch.load examples (#112860) 2023-11-06 21:17:36 +00:00
storage.py Fix pydocstyle errors listed in issue 112589 (#113227) 2023-11-13 22:05:45 +00:00
torch_version.py
types.py Unify torch.SymInt and torch.types.SymInt (#110573) 2023-10-24 16:17:23 +00:00
version.py.tpl

Note [TH abstraction violation]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

TH/THC provide some hpp headers, which are proper C++ headers rather than
C headers.  These headers serve double duty as *internal implementation
detail* headers, whose contents should largely not be used by external
clients.

Ideally, we would not install these headers at all; instead, you should
use public functions (in headers like `THTensor.h`, NOT `THTensor.hpp`)
to manipulate these structs.  However, there are a few places
in torch/csrc where we violate this abstraction.  They are marked with
a pointer to this note.  Each of those sites will have to be refactored
when we refactor the guts of THTensor and related structures.