pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-06 12:20:52 +01:00

History

Pearu Peterson e1c872e009 Add optimal triton kernel parameters to bsr_dense_mm and scatter_mm for bfloat16 and float32 dtypes (#113553 ) As in the title. This PR is a follow-up to PR https://github.com/pytorch/pytorch/pull/112737 to address bfloat16 and float32 dtype cases. The performance increase is as follows (`NVIDIA A100-SXM4-80GB`): - bsr_scatter_mm and bfloat16 - for blocksize 16x16, the average/maximum speed up is about 29/75 %. - for blocksize 32x32, the average/maximum speed up is about 23/58 %. - for blocksize 64x64, the average/maximum speed up is about 27/66 %. - for blocksize 128x128, the average/maximum speed up is about 33/72 %. - bsr_dense_mm and bfloat16 - for blocksize 16x16, the average/maximum speed up is about 47/61 %. - for blocksize 32x32, the average/maximum speed up is about 29/43 %. - for blocksize 64x64, the average/maximum speed up is about 21/41 %. - for blocksize 128x128, the average/maximum speed up is about 12/29 %. - bsr_dense_mm and float32 - for blocksize 16x16, the average/maximum speed up is about 35/49 %. - for blocksize 32x32, the average/maximum speed up is about 2/5 %. - for blocksize 64x64, the average/maximum speed up is about 2/21 %. - for blocksize 128x128, the average/maximum speed up is about 79/84 %. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113553 Approved by: https://github.com/cpuhrsch		2023-11-14 00:47:59 +00:00
..
_awaits
_C	Add support for `torch.Generator` type in TorchScript (#110413 )	2023-11-13 23:18:14 +00:00
_C_flatbuffer
_custom_op	Use `pytree.tree_leaves` everywhere (#112324 )	2023-10-30 03:39:04 +00:00
_decomp	Add support for `torch.Generator` type in TorchScript (#110413 )	2023-11-13 23:18:14 +00:00
_dispatch
_dynamo	[HigherOrderOp] add pytree operands tests for cond (#112661 )	2023-11-13 23:09:46 +00:00
_export	Revert "[pytree] register pytree node type in both C++ pytree and Python pytree (#112111 )"	2023-11-10 17:24:40 +00:00
_functorch	Revert "AOTAutograd: handle set_(), detect metadata mutations that cancel out (#111554 )"	2023-11-13 21:46:57 +00:00
_higher_order_ops	[HigherOrderOp] add pytree operands tests for cond (#112661 )	2023-11-13 23:09:46 +00:00
_inductor	[aotinductor] add versions for the sdpa shim api (#113487 )	2023-11-13 20:18:58 +00:00
_lazy
_library	torch.library: Create helper function `is_functional_schema` (#111660 )	2023-10-27 15:20:25 +00:00
_logging	Support logging aliases to list of modules (#113567 )	2023-11-13 23:35:18 +00:00
_numpy	Avoid calling as_tensor twice (#112866 )	2023-11-07 16:10:59 +00:00
_prims	Add support for `torch.Generator` type in TorchScript (#110413 )	2023-11-13 23:18:14 +00:00
_prims_common	Allow inferring divisibility on unbacked SymInts and do replacement trick (#113165 )	2023-11-10 21:28:02 +00:00
_refs	Add support for `torch.Generator` type in TorchScript (#110413 )	2023-11-13 23:18:14 +00:00
_subclasses	Implement narrow from a regular tensor to jagged tensor (#112770 )	2023-11-13 19:09:59 +00:00
amp
ao	[quant][pt2e] Add transform_for_annotation method in Quantizer (#113115 )	2023-11-09 20:23:29 +00:00
autograd	Fix docstring errors in reductions.py, spawn.py, pool.py, parameter.py, cpp.py, grad.py, __init__.py, profiler.py, queue.py, graph.py (#113052 )	2023-11-10 21:19:17 +00:00
backends	docs: fix docstring errors in quantized modules and others (#112695 )	2023-11-07 23:52:16 +00:00
compiler	Fix `torch.compiler.cudagraph_mark_step_begin` example (#112807 )	2023-11-07 04:15:31 +00:00
contrib	Fixed docstring errors in _fuser.py, _state.py, __init__.py, _freeze.py, _async.py, _recursive.py, _tensorboard_vis.py, _trace.py, _await.py, _check.py, _serialization.py, _script.py, annotations.py, _monkeytype_config.py (#113371 )	2023-11-12 03:19:02 +00:00
cpu	[Dist] Enable FSDP on CPU (#112145 )	2023-11-07 01:37:02 +00:00
csrc	[2/N] Enable clang-tidy checks in torch/csrc/profiler (#113439 )	2023-11-14 00:39:54 +00:00
cuda	Fixed docstring errors inside torch/cuda/ and torch/optim/ (Docathon H2) (#112964 )	2023-11-13 22:16:44 +00:00
distributed	[dtensor] refactor op dispatch and fix is_same_size/equal (#112927 )	2023-11-13 22:46:31 +00:00
distributions	Add inverse gamma distribution and fix `sign` bug in `PowerTransform`. (#104501 )	2023-11-01 02:26:25 +00:00
export	[pytree] align function signature between C++ and Python pytree (#112482 )	2023-11-10 02:37:48 +00:00
fft
func
futures
fx	[Dynamo] Match closures by code ID (#109427 )	2023-11-12 08:20:14 +00:00
jit	Add support for `torch.Generator` type in TorchScript (#110413 )	2023-11-13 23:18:14 +00:00
legacy
lib
linalg	[Docs] fix typo in example of `torch.linalg.solve_triangular` (#112361 )	2023-10-30 10:33:14 +00:00
masked	docs: Add docstring for torch.masked._ops.logaddexp (#113206 )	2023-11-08 22:45:35 +00:00
monitor
mps
multiprocessing	Fix docstring errors in reductions.py, spawn.py, pool.py, parameter.py, cpp.py, grad.py, __init__.py, profiler.py, queue.py, graph.py (#113052 )	2023-11-10 21:19:17 +00:00
nested	Implement narrow from a regular tensor to jagged tensor (#112770 )	2023-11-13 19:09:59 +00:00
nn	Add support for `torch.Generator` type in TorchScript (#110413 )	2023-11-13 23:18:14 +00:00
onnx	Add inheritance to ONNX's InputAdaptStep and OutputAdaptSet impl (#113476 )	2023-11-13 21:27:44 +00:00
optim	Fixed docstring errors inside torch/cuda/ and torch/optim/ (Docathon H2) (#112964 )	2023-11-13 22:16:44 +00:00
package	Add file name and size to the serialization metadata logging (#113077 )	2023-11-09 11:14:24 +00:00
profiler	[Profiler][Easy] Make timestamps in memory timelines be in microseconds (us) (#112772 )	2023-11-03 00:41:41 +00:00
quantization
signal
sparse	Add optimal triton kernel parameters to bsr_dense_mm and scatter_mm for bfloat16 and float32 dtypes (#113553 )	2023-11-14 00:47:59 +00:00
special
testing	Add support for `torch.Generator` type in TorchScript (#110413 )	2023-11-13 23:18:14 +00:00
utils	Fixed error with cuda_ver in cpp_extension.py (#113555 )	2023-11-14 00:12:22 +00:00
__config__.py
__future__.py
__init__.py	Make dynamo configs more amenable to static type checking (#112130 )	2023-11-08 21:17:45 +00:00
_appdirs.py
_classes.py
_compile.py
_custom_ops.py
_deploy.py
_guards.py	[inductor] Make {output_graph,pad_mm}.py pass follow_imports typechecking (#113413 )	2023-11-11 22:15:46 +00:00
_jit_internal.py
_linalg_utils.py
_lobpcg.py
_lowrank.py
_meta_registrations.py	expose mem-eff to autograd (#110495 )	2023-11-13 17:47:40 +00:00
_namedtensor_internals.py
_ops.py	Update impl_abstract_pystub to be less boilerplatey (#113182 )	2023-11-08 00:39:00 +00:00
_python_dispatcher.py
_sources.py
_storage_docs.py	Document torch.from_file and fix UntypedStorage.from_file docs (#111688 )	2023-10-25 19:28:11 +00:00
_streambase.py	[dynamo][stream]support device-agnostic stream in dynamo and capture stream/event method in fx graph (#108312 )	2023-10-22 13:22:58 +00:00
_tensor_docs.py	Rewrite docs so that it is OK to use record_stream before uses (#113282 )	2023-11-08 21:24:50 +00:00
_tensor_str.py
_tensor.py	[dynamo] Make {testing,debug_utils,utils}.py pass follow_imports typechecking (#113519 )	2023-11-11 22:15:46 +00:00
_torch_docs.py	Add `torch.utils.deterministic.fill_uninitialized_memory` flag (#111377 )	2023-11-01 16:10:09 +00:00
_utils_internal.py	Update impl_abstract_pystub to be less boilerplatey (#113182 )	2023-11-08 00:39:00 +00:00
_utils.py	Fix torch.load(..., weights_only=True) for NT (#112516 )	2023-11-02 14:41:04 +00:00
_VF.py
_vmap_internals.py
_weights_only_unpickler.py	Fix torch.load(..., weights_only=True) for NT (#112516 )	2023-11-02 14:41:04 +00:00
abi-check.cpp
CMakeLists.txt	Revert "[BE] [cuDNN] Always build assuming cuDNN >= 8.0 (#95722 )"	2023-11-10 17:26:36 +00:00
custom_class_detail.h
custom_class.h
extension.h
functional.py	Improve torch.unique docs (#113424 )	2023-11-10 16:36:30 +00:00
hub.py
library.h	[fbgemm_gpu] add pt2_compliant tag to some ops (#113201 )	2023-11-10 00:32:30 +00:00
library.py	Update impl_abstract_pystub to be less boilerplatey (#113182 )	2023-11-08 00:39:00 +00:00
overrides.py	Add support for `torch.Generator` type in TorchScript (#110413 )	2023-11-13 23:18:14 +00:00
py.typed
quasirandom.py
random.py
README.txt
return_types.py
script.h
serialization.py	added 'weights_only' param in torch.load examples (#112860 )	2023-11-06 21:17:36 +00:00
storage.py	Fix pydocstyle errors listed in issue 112589 (#113227 )	2023-11-13 22:05:45 +00:00
torch_version.py
types.py	Unify torch.SymInt and torch.types.SymInt (#110573 )	2023-10-24 16:17:23 +00:00
version.py.tpl

README.txt

Note [TH abstraction violation]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

TH/THC provide some hpp headers, which are proper C++ headers rather than
C headers.  These headers serve double duty as *internal implementation
detail* headers, whose contents should largely not be used by external
clients.

Ideally, we would not install these headers at all; instead, you should
use public functions (in headers like `THTensor.h`, NOT `THTensor.hpp`)
to manipulate these structs.  However, there are a few places
in torch/csrc where we violate this abstraction.  They are marked with
a pointer to this note.  Each of those sites will have to be refactored
when we refactor the guts of THTensor and related structures.