pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-06 12:20:52 +01:00

History

eellison 4c57aec5b9 Dont exclude constant_pad_nd in prologue fusion (#149947 ) Originally, I excluded constant_pad_nd from fusing to be conservative on compilation time. But, on benchmarking, you do occasionally get speedups by fusing it. Also includes a fix for making single, contiguous dep for prologues. For instance, the following benchmark gets a 7% speedup by fusing in the constant_pad_nd. ``` import torch import torch.nn.functional as F torch._inductor.config.force_disable_caches = True padded_N = 2048 n_pad_rows = 100 K, N = 2048, 4096 tensor1 = torch.randn(padded_N - n_pad_rows, 4096, device="cuda").to(torch.bfloat16) tensor2 = torch.randn(4096, 4096, device="cuda").to(torch.bfloat16) @torch.compile(mode='max-autotune-no-cudagraphs') def masked_linear(input, weight, n_pad_input_rows): """ Linear layer with input padded by `n_pad_input_rows` rows """ # Use constant_pad_nd to pad with zeros for the invalid rows padded_input = F.pad(tensor1, (0, 0, 0, n_pad_input_rows), "constant", 0) return F.linear(padded_input, weight) # Invoke the function masked_linear(tensor1, tensor2, n_pad_rows) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149947 Approved by: https://github.com/drisspg		2025-03-27 22:26:30 +00:00
..
_awaits
_C	[StaticCudaLauncher] Support sharedMemBytes > 48KB (#149657 )	2025-03-27 17:00:18 +00:00
_C_flatbuffer
_custom_op
_decomp	Remove aten.elu core ATen decomp because it is now core ATen (#149780 )	2025-03-25 01:59:57 +00:00
_dispatch	[BE][PYFMT] migrate PYFMT for `torch._dynamo` to `ruff format` (#144549 )	2025-02-28 03:03:53 +00:00
_dynamo	[ca] introduce RuntimeState to support c++ hooks via graph breaks (#149987 )	2025-03-27 05:05:34 +00:00
_export	fix range constraints for expr (#150103 )	2025-03-27 22:11:39 +00:00
_functorch	[AOTAutogradCache] Allow Custom Autograd functions behind a flag (#149751 )	2025-03-24 21:12:11 +00:00
_higher_order_ops	Revert "[triton] Warp specialization support in torchinductor (#148503 )"	2025-03-27 16:06:42 +00:00
_inductor	Dont exclude constant_pad_nd in prologue fusion (#149947 )	2025-03-27 22:26:30 +00:00
_lazy
_library	[graph partition] support splitting on custom ops (#149782 )	2025-03-27 16:23:07 +00:00
_logging	[export] Beef up guard_added logs (#149465 )	2025-03-20 23:02:07 +00:00
_numpy
_prims	Support torch.compile rng selective activation checkpointing with cudagraph (#146878 )	2025-02-28 00:47:03 +00:00
_prims_common	PEP585: More UP006 fixes (#146392 )	2025-02-20 06:18:13 +00:00
_refs	[export] fix stft decomp and making it consistent with cpp impl. (#149232 )	2025-03-19 18:40:35 +00:00
_strobelight	Enable strobelight profiling specific compile frame ids using COMPILE_STROBELIGHT_FRAME_FILTER (#147549 )	2025-02-22 03:44:53 +00:00
_subclasses	added fake tensor support for foreach_copy (#149127 )	2025-03-27 09:26:23 +00:00
_vendor
accelerator	Move get accelerator to use build time flags when possible (#146098 )	2025-03-10 13:17:58 +00:00
amp	[MAIA] [Autocast] Enable autocast on MAIA device (#148511 )	2025-03-18 03:46:22 +00:00
ao	[BE]: Apply ruff PERF403 to use dict comprehensions more often (#149257 )	2025-03-18 00:46:07 +00:00
autograd	Revert "[pytree] add APIs to determine a class is a namedtuple or PyStructSequence (#113257 )"	2025-03-14 23:13:34 +00:00
backends	PEP585: More UP006 fixes (#146392 )	2025-02-20 06:18:13 +00:00
compiler	[ez] include config as part of __all__ in torch.compiler (#148978 )	2025-03-11 21:58:38 +00:00
contrib
cpu	[CPU Stream] Add noop for CPU stream record_event() and wait_event() (#145935 )	2025-02-20 18:50:55 +00:00
csrc	[StaticCudaLauncher] Support sharedMemBytes > 48KB (#149657 )	2025-03-27 17:00:18 +00:00
cuda	[ROCm][TunableOp] Fix offline tuning for ScaledGEMM. (#149677 )	2025-03-22 02:22:13 +00:00
distributed	Implement aten.select.int sharding strategy (#149842 )	2025-03-27 20:49:00 +00:00
distributions	[BE][PYFMT] migrate PYFMT for `torch.{distributed,distributions}` to `ruff format` (#144547 )	2025-02-28 07:35:56 +00:00
export	[export] Save unflattened gm (#150030 )	2025-03-27 02:01:51 +00:00
fft
func	Add torch.func.debug_unwrap (#146528 )	2025-02-06 18:48:09 +00:00
futures	PEP585: More UP006 fixes (#146392 )	2025-02-20 06:18:13 +00:00
fx	Revert "Use source hashing to generate consistent symbolic ids (#149665 )"	2025-03-27 16:02:27 +00:00
jit	scriptfunction: Make sure we have valid __name__ and __qualname__ (#147906 )	2025-02-28 23:25:47 +00:00
legacy
lib	[codemod] Fix missing field initializer in caffe2/torch/lib/libshm/manager.cpp +1 (#148393 )	2025-03-04 04:20:04 +00:00
linalg	Implement gradient for the `residuals` of `torch.linalg.lstsq` (#148526 )	2025-03-10 12:35:09 +00:00
masked
monitor	add WaitCounter type interface and get rid of type errors (#146175 )	2025-02-01 23:24:52 +00:00
mps	[MPS] Make `torch.mps.compile_shader` public (#148972 )	2025-03-11 20:20:58 +00:00
mtia	[MTIA] Add _mtia_maybeExchangeDevice to MTIA module (#149340 )	2025-03-18 15:15:12 +00:00
multiprocessing
nested	[aotd] Guess tangents stride as output strides (#144579 )	2025-03-20 15:41:36 +00:00
nn	Fix broken LazyLinear init (#149693 )	2025-03-25 23:49:49 +00:00
onnx	[ONNX] Annotate None inputs in symbolic ops (#150038 )	2025-03-27 00:01:09 +00:00
optim	Convert Tensor lr to 0-dim as needed for the optimizer to normally work (#145674 )	2025-03-17 23:07:05 +00:00
package	Remove code for Python < 3.9 (#147097 )	2025-02-14 03:22:49 +00:00
profiler	[BE][Ez]: Use itertools.chain.from_iterable when possible (#148190 )	2025-03-06 20:37:06 +00:00
quantization
signal
sparse	Fix spelling (#149277 )	2025-03-20 01:02:32 +00:00
special
testing	Implement aten.select.int sharding strategy (#149842 )	2025-03-27 20:49:00 +00:00
utils	[Build] Remove pre-CXX11 ABI logic from build script (#149888 )	2025-03-25 03:17:16 +00:00
xpu	xpu: torch.xpu.get_arch_list() to return [] if xpu not compiled (#147431 )	2025-02-24 01:35:54 +00:00
__config__.py
__future__.py
__init__.py	Fix #149806 : Fix path lookup in _preload_cuda_deps (#149808 )	2025-03-25 23:03:47 +00:00
_appdirs.py
_classes.py
_compile.py
_custom_ops.py
_deploy.py
_environment.py
_guards.py	[dynamic shapes] add backed_size_oblivious option (#148696 )	2025-03-11 21:52:34 +00:00
_jit_internal.py	[BE][CI] bump `ruff` to 0.9.2: multiline `assert` statements (#144546 )	2025-02-27 20:46:16 +00:00
_linalg_utils.py
_lobpcg.py	[BE][CI] bump `ruff` to 0.9.2: multiline `assert` statements (#144546 )	2025-02-27 20:46:16 +00:00
_lowrank.py
_meta_registrations.py	meta registration for torch._scaled_mm with mxfp8 (#148461 )	2025-03-27 02:32:40 +00:00
_namedtensor_internals.py
_ops.py	[BE][CI] bump `ruff` to 0.9.2: multiline `assert` statements (#144546 )	2025-02-27 20:46:16 +00:00
_python_dispatcher.py
_size_docs.py
_sources.py
_storage_docs.py
_streambase.py
_tensor_docs.py	Add link to non_blocking/pinmem tutorial in `Tensor.to` docstrings (#145651 )	2025-02-18 20:38:01 +00:00
_tensor_str.py	add `torch.float4_e2m1fn_x2` to PyTorch (#148791 )	2025-03-27 17:32:20 +00:00
_tensor.py	Revert "Fix non-bitwise type annotations for Tensor operators (see #145838 ) (#146845 )"	2025-02-18 19:01:27 +00:00
_thread_safe_fork.py
_torch_docs.py	Optimize `torch.equal` description (#149618 )	2025-03-21 03:44:49 +00:00
_utils_internal.py	[ROCm] OCP FP8 Support for new GPUs (#146632 )	2025-02-24 22:47:52 +00:00
_utils.py	Allow torch.load under FakeTensorMode to load FakeTensors with correct devices (for plain Tensors) (#147786 )	2025-03-06 12:04:32 +00:00
_VF.py
_vmap_internals.py
_weights_only_unpickler.py	Add sparse tensors constructed via legacy constructor to _sparse_tensors_to_validate (#147759 )	2025-02-25 23:51:12 +00:00
CMakeLists.txt	Set USE_CUFILE=1 by default and add pypi package to binary build matrix (#145748 )	2025-02-11 15:49:01 +00:00
custom_class_detail.h
custom_class.h	Remove unneeded Clang-tidy suppression (#148246 )	2025-03-01 16:51:54 +00:00
extension.h
functional.py	Fix invalid nested int guarding in broadcast_shapes() (#145957 )	2025-03-11 00:53:13 +00:00
hub.py	[BE][CI][Easy] bump `ruff` to 0.9.0: long statements in docstrings (#146509 )	2025-02-24 19:56:08 +00:00
library.h	Remove trivial dispatch_key_allowlist_check function (#146169 )	2025-01-31 19:59:40 +00:00
library.py	[Docs] Make `torch.Library`'s `kind` have no default value to be consistent with the code (#149390 )	2025-03-21 04:42:10 +00:00
overrides.py	Use Python 3.9 typing (#148157 )	2025-03-04 03:09:55 +00:00
py.typed
quasirandom.py
random.py
README.md	Rename README.txt to README.md (#149811 )	2025-03-24 22:33:33 +00:00
return_types.py
script.h
serialization.py	Move get accelerator to use build time flags when possible (#146098 )	2025-03-10 13:17:58 +00:00
storage.py	add `torch.float4_e2m1fn_x2` to PyTorch (#148791 )	2025-03-27 17:32:20 +00:00
torch_version.py	[BE]: Enable ruff SLOT checks (#146276 )	2025-02-04 19:18:23 +00:00
types.py
version.py.tpl

README.md

Note [TH abstraction violation]


TH/THC provide some hpp headers, which are proper C++ headers rather than
C headers.  These headers serve double duty as *internal implementation
detail* headers, whose contents should largely not be used by external
clients.

Ideally, we would not install these headers at all; instead, you should
use public functions (in headers like `THTensor.h`, NOT `THTensor.hpp`)
to manipulate these structs.  However, there are a few places
in torch/csrc where we violate this abstraction.  They are marked with
a pointer to this note.  Each of those sites will have to be refactored
when we refactor the guts of THTensor and related structures.