pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
arkadip-maitra	a0d2d84846	Handling overflow for long int overflow for the product of kernel_hei… (#155989 ) …ght and kernel_width that overflows to be exactly 0 Fixes [#155981](https://github.com/pytorch/pytorch/issues/155981) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155989 Approved by: https://github.com/malfet	2025-09-19 18:15:01 +00:00
PyTorch MergeBot	468c1f9e9d	Revert "[nn] Assert parsed iterable arguments are an appropriate length (#162340 )" This reverts commit `b5e6e58050`. Reverted https://github.com/pytorch/pytorch/pull/162340 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to break an MPS tests on ExecuTorch ([comment](https://github.com/pytorch/pytorch/pull/162340#issuecomment-3282676242))	2025-09-11 21:22:57 +00:00
Jeff Daily	d65ffdef3d	[ROCm] fix miopen batchnorm changing output format (#162112 ) It was found that the integration of miopen batchnorm was causing the output to always be in default contig memory format even when the input was channels last. This also unskips a number of related unit tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162112 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com> Co-authored-by: Dmitry Nikolaev <dmitry.nikolaev@amd.com> Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>	2025-09-11 19:37:48 +00:00
Benjamin Glass	b5e6e58050	[nn] Assert parsed iterable arguments are an appropriate length (#162340 ) Fixes #162327 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162340 Approved by: https://github.com/Skylion007	2025-09-10 15:15:49 +00:00
mansiag05	5927a70934	NLLLoss: validate target is 0D when input is 1D (#161412 ) Add a shape check in nll_loss_forward to error out when both input and target are 1D. Added a unit test to cover the incompatible 1D/1D case. Fixes #157420 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161412 Approved by: https://github.com/ngimel	2025-09-06 20:58:42 +00:00
Yu, Guangye	3a20a20e70	Fix largeTensorTest malfunction on XPU (#161988 ) # Motivation https://github.com/pytorch/pytorch/pull/143553/files#diff-6492991193449e118ff0c8d42ca544cc38a73604e505ff246a3c711aeab91748R1345 makes `largeTensorTest` malfunction on XPU. This PR aims to fix it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161988 Approved by: https://github.com/EikanWang, https://github.com/albanD	2025-09-04 16:10:03 +00:00
Jeff Daily	99f356fa58	[ROCm] revamp miopen integration (#161687 ) Update sources under ATen/miopen and ATen/native/miopen to align with best practices. Avoid reshape_ calls inside backward operations. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161687 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-09-03 22:28:09 +00:00
Rohit Singh Rathaur	fca2601c9d	Improve error message for unsupported padding config (#160866 ) Fixes #160053 The previous error message `Only 2D, 3D, 4D, 5D padding with non-constant padding are supported for now` was not clear now we have ``` python3 Python 3.13.5 \| packaged by conda-forge \| (main, Jun 16 2025, 08:27:50) [GCC 13.3.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import torch ... import torch.nn.functional as F ... a = torch.empty(2,2,2,2) ... F.pad(a, (1,1), mode="circular") ... Traceback (most recent call last): File "<python-input-0>", line 4, in <module> F.pad(a, (1,1), mode="circular") ~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/rrathaur/Desktop/pytorch/torch/nn/functional.py", line 5294, in pad return torch._C._nn.pad(input, pad, mode, value) ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^ NotImplementedError: Padding size 2 is not supported for 4D input tensor. Supported combinations for non-constant padding: - 2D or 3D input: padding size = 2 (pads last dimension) - 3D or 4D input: padding size = 4 (pads last 2 dimensions) - 4D or 5D input: padding size = 6 (pads last 3 dimensions) >>> ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/160866 Approved by: https://github.com/mikaylagawarecki	2025-09-02 07:15:59 +00:00
Kurt Mohler	6382302990	[MPS] Add `grid_sampler_3d` for MPS (#160541 ) This PR adds support for `grid_sampler_3d` for MPS with "bilinear" interpolation. NOTE: "nearest" interpolation is not yet supported Fixes #159882 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160541 Approved by: https://github.com/malfet	2025-08-15 16:19:25 +00:00
Nikita Shulga	e06b110f73	[Testing] Add MPS to NATIVE_DEVICES (#153835 ) This would allow me to enable more opinfo tests against MPS device eventually and supposed to be a very simple test, but actually required minor adjustments to lots of test files, namely: - Introduce `all_mps_types_and` that is very similar to `all_types_and`, but skips `float64` - Decorate lots of tests with `@dtypesIfMPS(*all_mps_types())` - Skip `test_from_dlpack_noncontinguous` as it currently crashes (need to be fixed) - Add lots of `expectedFailureIfMPS` - Delete all `@onlyNativeDeviceTypesAnd("mps")` <sarcasm> I love how well documented this variable are </sarcasm> Pull Request resolved: https://github.com/pytorch/pytorch/pull/153835 Approved by: https://github.com/Skylion007	2025-08-05 18:57:35 +00:00
PyTorch MergeBot	356ac3103a	Revert "Stop parsing command line arguments every time common_utils is imported. (#156703 )" This reverts commit `310f901a71`. Reverted https://github.com/pytorch/pytorch/pull/156703 on behalf of https://github.com/izaitsevfb due to breaking tests internally with `assert common_utils.SEED is not None` ([comment](https://github.com/pytorch/pytorch/pull/156703#issuecomment-3152337518))	2025-08-04 20:37:39 +00:00
Anthony Barbier	310f901a71	Stop parsing command line arguments every time common_utils is imported. (#156703 ) Last PR in the series to re-submit https://github.com/pytorch/pytorch/pull/134592 as smaller PRs: https://github.com/pytorch/pytorch/pull/154612 https://github.com/pytorch/pytorch/pull/154628 https://github.com/pytorch/pytorch/pull/154715 https://github.com/pytorch/pytorch/pull/154716 https://github.com/pytorch/pytorch/pull/154725 https://github.com/pytorch/pytorch/pull/154728 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156703 Approved by: https://github.com/clee2000	2025-08-02 16:38:54 +00:00
linmin	2a286cbdf4	Allow register_buffer with Tensor-like object (#159455 ) As torch allows extending the tensor with `__torch_function__`, it would be desirable to allow registering it as a buffer. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159455 Approved by: https://github.com/mikaylagawarecki	2025-08-01 15:31:38 +00:00
AaronWang04	dc286aef61	Fused RMSNorm Housekeeping (#159317 ) Small PR to address comments that were made from the original fused rmsnorm PR that were not landed Changes: - Warning message when input.dtype doesn't match weight.dtype - Ensure default epsilon value is correct Comments: https://github.com/pytorch/pytorch/pull/153666#discussion_r2114735005 https://github.com/pytorch/pytorch/pull/153666#discussion_r2223518064 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159317 Approved by: https://github.com/ngimel, https://github.com/Skylion007, https://github.com/eqy	2025-07-29 22:39:18 +00:00
eqy	8573a2beda	[CUDA] Fix missing `__syncthreads` in MultiMarginLoss backward (#158994 ) Turns out issue in #158921 is detectable with a simple unit test and adding the missing sync fixes it Pull Request resolved: https://github.com/pytorch/pytorch/pull/158994 Approved by: https://github.com/malfet, https://github.com/Skylion007 Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-07-24 20:47:29 +00:00
Jiang, Yanbing	f4d8bc46c7	Enable TF32 as fp32 internal precision for matmul/linear/conv (#157520 ) ### Description This PR is to enable TF32 as fp32 internal precision for matmul/linear/conv in `mkldnn backend`. Since we have refined fp32 precision API in https://github.com/pytorch/pytorch/pull/125888, we can easily extend the API to support TF32 for `mkldnn backend`. ``` torch.backends.mkldnn.matmul.fp32_precision = 'tf32' torch.backends.mkldnn.conv.fp32_precision = "tf32" ``` Related kernel update and UTs update are done. And the wrapper `bf32_on_and _off` is updated to `reduced_f32_on_and_off`, and it can run tests 3 times, one is reduced_f32 OFF, the other two are reduced_f32 ON (including `bf32 ON` and `tf32 ON`). Pull Request resolved: https://github.com/pytorch/pytorch/pull/157520 Approved by: https://github.com/mingfeima, https://github.com/jansel	2025-07-17 08:57:34 +00:00
Xuehai Pan	fc0376e8b1	[BE][2/6] fix typos in test/ (test/test_*.py) (#157636 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157636 Approved by: https://github.com/yewentao256, https://github.com/mlazos ghstack dependencies: #156311, #156609	2025-07-09 11:02:23 +00:00
zeshengzong	f41d017aa6	Add device check in `mse_loss` (#155089 ) Fixes #154978 ## Test Result ```python >>> import torch >>> import numpy as np >>> import torch.nn as nn >>> import torch.distributions.normal as norm >>> device = torch.device(('cuda' if torch.cuda.is_available() else 'cpu')) >>> print('Using {}'.format(device)) Using cuda >>> m = nn.Sequential(nn.Linear(1, 128).cuda(), nn.Tanh(), nn.Linear(128, 128).cuda(), nn.Tanh(), nn.Linear(128, 128).cuda(), nn.Tanh()) >>> m.to(device, dtype=None, non_blocking=False) Sequential( (0): Linear(in_features=1, out_features=128, bias=True) (1): Tanh() (2): Linear(in_features=128, out_features=128, bias=True) (3): Tanh() (4): Linear(in_features=128, out_features=128, bias=True) (5): Tanh() ) >>> opt = torch.optim.Adam(m.parameters(), lr=0.001) >>> print('Number of trainable parameters: ', sum((p.numel() for p in m.parameters() if p.requires_grad))) Number of trainable parameters: 33280 >>> input_tensor = torch.tensor(77.0, device=device) >>> target = torch.tensor(66.0) >>> loss_function = nn.MSELoss() >>> print('Loss Function: ', loss_function) Loss Function: MSELoss() >>> loss = loss_function(input_tensor, target) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/zong/code/pytorch/torch/nn/modules/module.py", line 1767, in _wrapped_call_impl return self._call_impl(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/zong/code/pytorch/torch/nn/modules/module.py", line 1778, in _call_impl return forward_call(args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/zong/code/pytorch/torch/nn/modules/loss.py", line 610, in forward return F.mse_loss(input, target, reduction=self.reduction) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/zong/code/pytorch/torch/nn/functional.py", line 3903, in mse_loss return torch._C._nn.mse_loss( ^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/155089 Approved by: https://github.com/cyyever, https://github.com/albanD	2025-07-04 12:37:48 +00:00
Ahmad Sharif	36dd598bda	layernorm tests: Tweak test thresholds for comparing tensors (#156699 ) After I landed this PR: https://github.com/pytorch/pytorch/pull/156600, this test was failing internally on large tensors because the differences were greater than tolerances on some cuda devices. We now raise the tolerances for larger tensors. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156699 Approved by: https://github.com/eqy, https://github.com/ngimel	2025-07-02 19:33:38 +00:00
Dmitry Nikolaev	fe1f1a38df	add test_batchnorn_2D and 3D tests (#156498 ) New set of batchnorm tests to verify NCHW 2D/3D BatchNorm This test also allows to add and configure different BatchNorm tests (dtypes, NCHW/NHWC, Mixed) in the future based on: - Train [test_batchnorm_cudnn_nhwc](`1051b93192/test/test_nn.py (L4985)`) - Inference [test_batchnorm_nhwc_cuda](`1051b93192/test/test_nn.py (L5130)`) ``` test_batchnorm_3D_inference_NCHW_vs_cpu_float32 (__main__.TestNN.test_batchnorm_3D_inference_NCHW_vs_cpu_float32) ... ok (0.113s) test_batchnorm_3D_inference_NCHW_vs_cpu_mixed_bfloat16 (__main__.TestNN.test_batchnorm_3D_inference_NCHW_vs_cpu_mixed_bfloat16) ... ok (0.057s) test_batchnorm_3D_inference_NCHW_vs_cpu_mixed_float16 (__main__.TestNN.test_batchnorm_3D_inference_NCHW_vs_cpu_mixed_float16) ... ok (0.063s) test_batchnorm_3D_inference_NCHW_vs_native_float32 (__main__.TestNN.test_batchnorm_3D_inference_NCHW_vs_native_float32) ... ok (0.059s) test_batchnorm_3D_inference_NCHW_vs_native_mixed_bfloat16 (__main__.TestNN.test_batchnorm_3D_inference_NCHW_vs_native_mixed_bfloat16) ... ok (0.006s) test_batchnorm_3D_inference_NCHW_vs_native_mixed_float16 (__main__.TestNN.test_batchnorm_3D_inference_NCHW_vs_native_mixed_float16) ... ok (0.006s) test_batchnorm_3D_train_NCHW_vs_cpu_float32 (__main__.TestNN.test_batchnorm_3D_train_NCHW_vs_cpu_float32) ... ok (0.007s) test_batchnorm_3D_train_NCHW_vs_cpu_mixed_bfloat16 (__main__.TestNN.test_batchnorm_3D_train_NCHW_vs_cpu_mixed_bfloat16) ... ok (0.005s) test_batchnorm_3D_train_NCHW_vs_cpu_mixed_float16 (__main__.TestNN.test_batchnorm_3D_train_NCHW_vs_cpu_mixed_float16) ... ok (0.005s) test_batchnorm_3D_train_NCHW_vs_native_float32 (__main__.TestNN.test_batchnorm_3D_train_NCHW_vs_native_float32) ... ok (0.003s) test_batchnorm_3D_train_NCHW_vs_native_mixed_bfloat16 (__main__.TestNN.test_batchnorm_3D_train_NCHW_vs_native_mixed_bfloat16) ... skip: bfloat16 NCHW train failed due to native tolerance issue (0.001s) test_batchnorm_3D_train_NCHW_vs_native_mixed_float16 (__main__.TestNN.test_batchnorm_3D_train_NCHW_vs_native_mixed_float16) ... skip: 3D float16 NCHW train failed on ROCm<7.0 (0.001s) test_batchnorm_2D_inference_NCHW_vs_cpu_float32 (__main__.TestNN.test_batchnorm_2D_inference_NCHW_vs_cpu_float32) ... ok (0.016s) test_batchnorm_2D_inference_NCHW_vs_cpu_mixed_bfloat16 (__main__.TestNN.test_batchnorm_2D_inference_NCHW_vs_cpu_mixed_bfloat16) ... ok (0.003s) test_batchnorm_2D_inference_NCHW_vs_cpu_mixed_float16 (__main__.TestNN.test_batchnorm_2D_inference_NCHW_vs_cpu_mixed_float16) ... ok (0.003s) test_batchnorm_2D_inference_NCHW_vs_native_float32 (__main__.TestNN.test_batchnorm_2D_inference_NCHW_vs_native_float32) ... ok (0.054s) test_batchnorm_2D_inference_NCHW_vs_native_mixed_bfloat16 (__main__.TestNN.test_batchnorm_2D_inference_NCHW_vs_native_mixed_bfloat16) ... ok (0.002s) test_batchnorm_2D_inference_NCHW_vs_native_mixed_float16 (__main__.TestNN.test_batchnorm_2D_inference_NCHW_vs_native_mixed_float16) ... ok (0.001s) test_batchnorm_2D_train_NCHW_vs_cpu_float32 (__main__.TestNN.test_batchnorm_2D_train_NCHW_vs_cpu_float32) ... ok (0.007s) test_batchnorm_2D_train_NCHW_vs_cpu_mixed_bfloat16 (__main__.TestNN.test_batchnorm_2D_train_NCHW_vs_cpu_mixed_bfloat16) ... ok (0.004s) test_batchnorm_2D_train_NCHW_vs_cpu_mixed_float16 (__main__.TestNN.test_batchnorm_2D_train_NCHW_vs_cpu_mixed_float16) ... ok (0.004s) test_batchnorm_2D_train_NCHW_vs_native_float32 (__main__.TestNN.test_batchnorm_2D_train_NCHW_vs_native_float32) ... ok (0.003s) test_batchnorm_2D_train_NCHW_vs_native_mixed_bfloat16 (__main__.TestNN.test_batchnorm_2D_train_NCHW_vs_native_mixed_bfloat16) ... skip: bfloat16 NCHW train failed due to native tolerance issue (0.001s) test_batchnorm_2D_train_NCHW_vs_native_mixed_float16 (__main__.TestNN.test_batchnorm_2D_train_NCHW_vs_native_mixed_float16) ... ok (0.002s) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/156498 Approved by: https://github.com/jeffdaily	2025-06-25 20:38:02 +00:00
Ahmad Sharif	899d3d3e9e	Don't call `sum()` on a tensor that is not summable in layer_norm (#156600 ) Don't call `sum()` on a tensor that is default constructed. Previously we could call `sum()` on a tensor that was default-contructed. That would lead to an error like this: ``` Traceback (most recent call last): File "/home/ahmads/.conda/envs/pt3/lib/python3.12/unittest/case.py", line 58, in testPartExecutor yield File "/home/ahmads/.conda/envs/pt3/lib/python3.12/unittest/case.py", line 634, in run self._callTestMethod(testMethod) File "/home/ahmads/.conda/envs/pt3/lib/python3.12/unittest/case.py", line 589, in _callTestMethod if method() is not None: ^^^^^^^^ File "/home/ahmads/personal/pytorch/torch/testing/_internal/common_utils.py", line 3191, in wrapper method(args, kwargs) File "/home/ahmads/personal/pytorch/test/test_nn.py", line 7235, in test_layer_norm_backwards_eps ln_out_cuda.backward(grad_output_cuda) File "/home/ahmads/personal/pytorch/torch/_tensor.py", line 647, in backward torch.autograd.backward( File "/home/ahmads/personal/pytorch/torch/autograd/__init__.py", line 354, in backward _engine_run_backward( File "/home/ahmads/personal/pytorch/torch/autograd/graph.py", line 829, in _engine_run_backward return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: tensor does not have a device Exception raised from device_default at /home/ahmads/personal/pytorch/c10/core/TensorImpl.h:1265 (most recent call first): C++ CapturedTraceback: #4 std::_Function_handler<std::shared_ptr<c10::LazyValue<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const> (), c10::SetStackTraceFetcher(std::function<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 #5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) from ??:0 #6 c10::detail::torchCheckFail(char const, char const, unsigned int, char const) from ??:0 #7 at::TensorBase::options() const from :0 #8 at::meta::resize_reduction(at::impl::MetaBase&, at::Tensor const&, c10::OptionalArrayRef<long>, bool, c10::ScalarType, bool) from :0 #9 at::meta::structured_sum_dim_IntList::meta(at::Tensor const&, c10::OptionalArrayRef<long>, bool, std::optional<c10::ScalarType>) from ??:0 #10 at::(anonymous namespace)::wrapper_CompositeExplicitAutogradNonFunctional_sum_dim_IntList(at::Tensor const&, c10::OptionalArrayRef<long>, bool, std::optional<c10::ScalarType>) from RegisterCompositeExplicitAutogradNonFunctional_0.cpp:0 #11 c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, c10::OptionalArrayRef<long>, bool, std::optional<c10::ScalarType>), &at::(anonymous namespace)::wrapper_CompositeExplicitAutogradNonFunctional_sum_dim_IntList>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, c10::OptionalArrayRef<long>, bool, std::optional<c10::ScalarType> > >, at::Tensor (at::Tensor const&, c10::OptionalArrayRef<long>, bool, std::optional<c10::ScalarType>)>::call(c10::OperatorKernel, c10::DispatchKeySet, at::Tensor const&, c10::OptionalArrayRef<long>, bool, std::optional<c10::ScalarType>) from RegisterCompositeExplicitAutogradNonFunctional_0.cpp:0 #12 at::_ops::sum_dim_IntList::call(at::Tensor const&, c10::OptionalArrayRef<long>, bool, std::optional<c10::ScalarType>) from ??:0 #13 void at::native::(anonymous namespace)::LaunchGammaBetaBackwardCUDAKernel<float, float>(float const, float const, float const, float const, long, long, at::Tensor, at::Tensor, CUstream_st) from ??:0 #14 void at::native::(anonymous namespace)::LayerNormBackwardKernelImplInternal<float>(at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, long, long, at::Tensor, at::Tensor, at::Tensor) from ??:0 #15 at::native::(anonymous namespace)::LayerNormBackwardKernelImpl(at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, long, long, at::Tensor, at::Tensor, at::Tensor) from ??:0 #16 at::native::layer_norm_backward_cuda(at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, std::array<bool, 3ul>) from ??:0 #17 at::(anonymous namespace)::(anonymous namespace)::wrapper_CUDA__native_layer_norm_backward(at::Tensor const&, at::Tensor const&, c10::ArrayRef<c10::SymInt>, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, std::array<bool, 3ul>) from RegisterCUDA_0.cpp:0 ``` Now we only call `sum(0)` on tensors that are defined and properly guard the `sum(0)` and assignment. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156600 Approved by: https://github.com/eqy, https://github.com/ngimel	2025-06-24 05:00:42 +00:00
Nikita Shulga	c28e74e457	[MPS] Add nearest_3d forward and backward (#156090 ) Introduce generalizable `UpsampleParams` structure in `UpSample.h`, which could be shared between CPU and MPS Delete `upsample_nearest3d` MPS fallback and replace it with proper shader Pull Request resolved: https://github.com/pytorch/pytorch/pull/156090 Approved by: https://github.com/kulinseth, https://github.com/dcci ghstack dependencies: #156256	2025-06-18 04:48:15 +00:00
atalman	c199a4d0fd	Move non inductor workflows cuda 12.6->cuda 12.8 (#155234 ) Move non inductor workflows cuda 12.6->cuda 12.8 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155234 Approved by: https://github.com/Skylion007, https://github.com/zxiiro, https://github.com/cyyever, https://github.com/malfet	2025-06-12 12:42:34 +00:00
PyTorch MergeBot	40fefe2871	Revert "[BE] Update cudnn to 9.10.1.4 (#155122 )" This reverts commit `73220d52fd`. Reverted https://github.com/pytorch/pytorch/pull/155122 on behalf of https://github.com/atalman due to wrong pr description ([comment](https://github.com/pytorch/pytorch/pull/155122#issuecomment-2960592004))	2025-06-10 21:13:18 +00:00
Eddie Yan	35e8f2593c	[CUDA] Fix missing bounds check in `Softmax.cu` (#154778 ) Uncovered by @ngimel, same as change in #144009 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154778 Approved by: https://github.com/ngimel, https://github.com/cyyever, https://github.com/malfet	2025-06-10 20:03:54 +00:00
Aaron Gokaslan	73220d52fd	[BE] Update cudnn to 9.10.1.4 (#155122 ) Follow up to #152782 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155122 Approved by: https://github.com/malfet, https://github.com/atalman, https://github.com/eqy	2025-06-10 16:59:00 +00:00
zeshengzong	f12d8d60b1	Add hint message when parameters is empty in clip_grad_norm_ (#151529 ) Fixes #148259 ## Changes - Add print warning message when `parameters` generator exhausted ## Test Result ### print warning ```python import torch import torch.nn as nn import torch.optim as optim class SimpleModel(nn.Module): def __init__(self): super(SimpleModel, self).__init__() self.fc = nn.Linear(10, 1) def forward(self, x): return self.fc(x) model = SimpleModel() criterion = nn.MSELoss() optimizer = optim.SGD(model.parameters(), lr=0.01) inputs = torch.randn(16, 10) targets = torch.randn(16, 1) outputs = model(inputs) loss = criterion(outputs, targets) optimizer.zero_grad() loss.backward() params_to_clip = model.parameters() for p in params_to_clip: print(p.shape) max_norm = 1.0 norm_type = 2.0 total_norm = nn.utils.clip_grad_norm_(params_to_clip, max_norm, norm_type) print(f"total_norm: {total_norm}") ``` ```bash /home/zong/code/pytorch/torch/nn/utils/clip_grad.py:222: UserWarning: `parameters` is an empty generator, no gradient clipping will occur. warnings.warn( total_norm: 0.0 ``` ### UT ```bash pytest test/test_nn.py -k test_clip_grad_norm ``` ![image](https://github.com/user-attachments/assets/0aa0f06c-e0a5-43cf-9a97-d7c2747c9180) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151529 Approved by: https://github.com/jbschlosser	2025-05-22 11:23:39 +00:00
Eddie Yan	cecfc7dc53	[CUDA][cuDNN] Fix handling of `CPU` side input and target length tensors in `CTCLoss` (#152745 ) https://github.com/pytorch/pytorch/pull/128271 migrated to cuDNN V8 CTCLoss which expects input and target length tensors to be on `CUDA` rather than `CPU` without adding the logic to account for the edge case of them being on `CPU` see also #152421 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152745 Approved by: https://github.com/Skylion007	2025-05-07 22:01:18 +00:00
iupaikov-amd	730a077d48	[ROCm] Unskipped test_rnn_dropout_state for ROCm (#152339 ) Unskipping the test, should work fine now. Related PR: https://github.com/pytorch/pytorch/pull/144572 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152339 Approved by: https://github.com/jeffdaily	2025-05-02 22:02:30 +00:00
abcarlisle	a1a4fee3b8	Native channel shuffle floating point exception (#144010 ) Fixes #142453 Added TORCH_CHECKS to prevent the user from using the native_channel_shuffle function incorrectly and getting a "Floating point exception (core dumped)" Pull Request resolved: https://github.com/pytorch/pytorch/pull/144010 Approved by: https://github.com/albanD	2025-04-29 23:38:54 +00:00
Jagadish Krishnamoorthy	0d99b4e9e2	ROCm: Enable tf32 testing on test_nn (#148945 ) Add tf32 support for ROCm tests. test command: python test/test_nn.py -v Pull Request resolved: https://github.com/pytorch/pytorch/pull/148945 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-04-28 23:01:04 +00:00
eqy	34b0de50a3	[TF32][CUDA] account for TF32 in `test_linear_autograd` (#152216 ) Abate some more noise seen on blackwell Pull Request resolved: https://github.com/pytorch/pytorch/pull/152216 Approved by: https://github.com/Skylion007	2025-04-28 21:00:17 +00:00
Anthony Shoumikhin	e2f9759bd0	Fix broken URLs (#152237 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152237 Approved by: https://github.com/huydhn, https://github.com/malfet	2025-04-27 09:56:42 +00:00
eqy	6efc572221	[CUDA][CPU] Bump system memory requirement for `test_cross_entropy_large_tensor` (#151812 ) `/usr/bin/time` seems to show max resident pages at 119GiB Pull Request resolved: https://github.com/pytorch/pytorch/pull/151812 Approved by: https://github.com/colesbury	2025-04-24 19:25:29 +00:00
zeshengzong	01f226bfb8	Add check for ctc_loss targets param (#150981 ) Fixes #150835 ## Test Result ```python # cuda >>> import torch >>> import torch.nn.functional as F >>> device = "cuda" # "cpu" is fine >>> num_classes = 4 >>> log_probs = torch.rand(0, 0, num_classes, device=device) >>> targets = torch.tensor([], device=device, dtype=torch.long) >>> input_lengths = torch.tensor([], device=device, dtype=torch.long) >>> target_lengths = torch.tensor([], device=device, dtype=torch.long) >>> result = F.ctc_loss(log_probs, targets, input_lengths, target_lengths, reduction='none') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/zong/code/pytorch/torch/nn/functional.py", line 3079, in ctc_loss return torch.ctc_loss( ^^^^^^^^^^^^^^^ RuntimeError: log_probs tensor must not be empty # cpu >>> device = "cpu" >>> num_classes = 4 >>> log_probs = torch.rand(0, 0, num_classes, device=device) >>> targets = torch.tensor([], device=device, dtype=torch.long) >>> input_lengths = torch.tensor([], device=device, dtype=torch.long) >>> target_lengths = torch.tensor([], device=device, dtype=torch.long) >>> result = F.ctc_loss(log_probs, targets, input_lengths, target_lengths, reduction='none') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/zong/code/pytorch/torch/nn/functional.py", line 3079, in ctc_loss return torch.ctc_loss( ^^^^^^^^^^^^^^^ RuntimeError: log_probs tensor must not be empty ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150981 Approved by: https://github.com/eqy	2025-04-14 07:24:30 +00:00
Ahmad Sharif	73b4938f7c	[cuda] Add new faster gammabeta backward kernel (#148605 ) (Reapply with launch bounds) (#150625 ) # Changes over the previous PR This reverts commit `61a1f09` and adds `__launch_bounds__` to the kernel. Previously I merged `114d404` that did not work on Blackwell because it consumed too many registers. It got reverted in `61a1f09`. For more context see: https://github.com/pytorch/pytorch/issues/150266. This PR reverts the revert (i.e. reapplies the original diff), with one additional line with `__launch_bounds__` added: ``` git diff HEAD^ diff --git a/aten/src/ATen/native/cuda/layer_norm_kernel.cu b/aten/src/ATen/native/cuda/layer_norm_kernel.cu index 0d63a2f979c..3ce2c24c18e 100644 --- a/aten/src/ATen/native/cuda/layer_norm_kernel.cu +++ b/aten/src/ATen/native/cuda/layer_norm_kernel.cu @@ -657,6 +657,7 @@ bool aligned_grid > __global__ void +__launch_bounds__(block_dim_x * block_dim_y) GammaBetaBackwardCUDAKernelTemplate( int64_t M, int64_t N, ``` I managed to get a Blackwell machine and verified that the fix works. The fix was verified using this repro that I got from @drisspg <details> <summary> Repro script that fails on Blackwell </summary> ``` import torch from torch.nn import init # from transformer_nuggets import init_logging # from transformer_nuggets.utils.benchmark import profiler # from pathlib import Path # init_logging() class PermuteModule(torch.nn.Module): def __init__(self, permutation): super(PermuteModule, self).__init__() self.permutation = permutation def forward(self, x:torch.Tensor) -> torch.Tensor: assert len(x.shape) == len(self.permutation), f"Dimension mismatch! Unable to permute {len(x.shape)} dim input with a {len(self.permutation)} dim permutation!" return x.permute(self.permutation) def test(n_layers:int, conv_stride:int): _sequence = [] for _ in range(n_layers): # Conv1d inputs are (N x C x L), LayerNorm expects ( x C). Dims must be permuted between modules. _sequence += [ PermuteModule((0,2,1)), torch.nn.Conv1d(in_channels=512, out_channels=512, groups=1, kernel_size=9, dilation=1, stride=conv_stride, padding=0, bias=False), PermuteModule((0,2,1)), torch.nn.LayerNorm(512), torch.nn.ReLU() ] model = torch.nn.Sequential(_sequence).to(device="cuda") data = torch.randn((100,2048,512), device="cuda") out = model(data) loss = torch.nn.functional.mse_loss(out, torch.rand_like(out)) loss.backward() torch.autograd.set_detect_anomaly(True) print(f"Torch version: {torch.__version__}") # with profiler(Path("conv")): # # print(f"layers=1, stride=1") # # test(n_layers=1, conv_stride=1) # # print(f"layers=2, stride=1") # # test(n_layers=2, conv_stride=1) # # print(f"layers=1, stride=2") # # test(n_layers=1, conv_stride=2) # print(f"layers=2, stride=2") # test(n_layers=2, conv_stride=2) print(f"layers=2, stride=2") test(n_layers=2, conv_stride=2) # we will not reach this print statement. print("DONE.") ``` </details> I also re-ran my performance benchmark and found no regressions over the previous PR. # Full description of the old PR Original PR: https://github.com/pytorch/pytorch/pull/148605 This PR adds a new kernel for producing gamma and beta values for the backward pass in a performant way. To test the performance against the baseline, I measured the backward pass of layernorm while sweeping over the following variables: 1. dtype in {half, float} 2. M in `2k, 2k - 1, 2k + 1 for k in range(...)` 3. N in `2k, 2k - 1, 2k + 1 for k in range(...)` 4. Whether we flush the L2 cache before running the backward pass Summary: The new code performs better than the old code, especially for powers of 2. For M >> N case, it performs very well (kernel itself can be 30x faster and the overall backward pass can be 5-10x faster). In order to visualize results of the kernel when choosing different values of M, N and dtype, I wrote some code to generate a heatmap. The heatmap has N on the x-axis, M on the y-axis and color-coded points where green shows performance improvement and red shows regressions. For example, `m=32 n=2048 1.42x` in the heatmap would indicate the normalized shape had 32 elements. The leading dimensions' product was 2048 elements and the new kernel resulted in the backward pass* being 1.42x faster than the old backward pass. Important note: This heatmap shows the total backward pass time as seen by the user. The kernel time difference can be sometimes very large while the total backward pass time is not that high. For example, for dtype=torch.half, M=32 N=2048, flush_l2_cache=True case, the heatmap shows a speedup of 1.42x, while ncu tells me the new kernel is 2.5x faster than the old: M=32 N=2048 dtype=half flush_l2=True Old Kernel NCU summary: ``` ----------------------- ----------- ------------ Metric Name Metric Unit Metric Value ----------------------- ----------- ------------ DRAM Frequency Ghz 1.59 SM Frequency Ghz 1.35 Elapsed Cycles cycle 27,526 Memory Throughput % 2.21 DRAM Throughput % 0.54 Duration us 20.42 L1/TEX Cache Throughput % 4.31 L2 Cache Throughput % 2.62 SM Active Cycles cycle 1,475.02 Compute (SM) Throughput % 0.29 ----------------------- ----------- ------------ ``` M=32 N=2048 dtype=half flush_l2=True New Kernel NCU summary: ``` ----------------------- ----------- ------------ Metric Name Metric Unit Metric Value ----------------------- ----------- ------------ DRAM Frequency Ghz 1.59 SM Frequency Ghz 1.34 Elapsed Cycles cycle 10,920 Memory Throughput % 5.64 DRAM Throughput % 1.35 Duration us 8.13 L1/TEX Cache Throughput % 1.92 L2 Cache Throughput % 6.89 SM Active Cycles cycle 3,554.41 Compute (SM) Throughput % 0.67 ----------------------- ----------- ------------ ``` Let's look at some rows from the heatmap. For dtype=float16 flush_l2_cache=True and when input shapes are powers of 2, we get the following: <img width="1508" alt="image" src="https://github.com/user-attachments/assets/06179599-b2f0-4a45-8664-247a1067950b" /> There are 3 columns -- the first shows all data points, the second shows speedups only and the 3rd column shows regressions only. We can see that there are dramatic speedups for M >> N cases and the regressions are not that high (less than 1%, which could just be measurement noise). Here is a small guide I made: ![image](https://github.com/user-attachments/assets/90c26f7c-e3ad-46d2-a6ce-fe4b5fb3d738) For dtype=float32, we get a similar chart: <img width="1499" alt="image" src="https://github.com/user-attachments/assets/c4d31a76-03b0-426c-9114-e1bfad29b530" /> The new code performs especially well for m >> n cases, and also where m and n are small. The m >> n case is special because we run 2 reduction kernels back to back and parallelize in the "M" dimension (the older kernel only parallelized in the "N" dimension). The new code can sometimes have regressions for non-powers of 2. That is because the old code was using block sizes of {16, 32} while we have `threads.x = 32`. For example when N=33, the old code would have 3 blocks and we will have 2 blocks. I wrote some code to specialize for this case, but I think it will add complexity and @ngimel mentioned that non-powers of 2 are rare enough. I am including the regressions here for completeness' sake: <img width="1500" alt="image" src="https://github.com/user-attachments/assets/31c17cfb-ed9b-4106-b9c8-5c359751f530" /> To see this better: 1. Click the image 2. Right click the expanded image and open in a new tab 3. Go to that tab and left click once to zoom in If you want to see the full data, here it is: ![image](https://github.com/user-attachments/assets/54fb60c9-8c0c-4530-a1dd-79ecda1a69a1) I also measured binary size and compile time since those are important for developers: Binary size comparison ![image](https://github.com/user-attachments/assets/ceef5073-1036-47f6-b9dc-cea088beda51) ``` # Original -rwxr-xr-x 1 ahmads users 307193112 Mar 6 08:46 ./torch/lib/libtorch_cuda.so # This PR -rwxr-xr-x 1 ahmads users 307193112 Mar 6 08:46 ./torch/lib/libtorch_cuda.so ``` The diff in bytes is 302kB which is about a 0.1% increase. Compile time difference: ``` # Original real 0m10.931s user 0m9.676s sys 0m1.004s # this PR real 0m16.720s user 0m15.514s sys 0m1.066s # Command I ran time /usr/local/cuda/bin/nvcc -forward-unknown-to-host-compiler -DAT_PER_OPERATOR_HEADERS -DFLASHATTENTION_DISABLE_ALIBI -DFLASHATTENTION_DISABLE_SOFTCAP -DFLASH_NAMESPACE=pytorch_flash -DFMT_HEADER_ONLY=1 -DHAVE_MALLOC_USABLE_SIZE=1 -DHAVE_MMAP=1 -DHAVE_SHM_OPEN=1 -DHAVE_SHM_UNLINK=1 -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DTORCH_CUDA_BUILD_MAIN_LIB -DTORCH_CUDA_USE_NVTX3 -DUNFUSE_FMA -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_CUDA -DUSE_CUFILE -DUSE_DISTRIBUTED -DUSE_EXTERNAL_MZCRC -DUSE_FLASH_ATTENTION -DUSE_MEM_EFF_ATTENTION -DUSE_NCCL -DUSE_RPC -DUSE_TENSORPIPE -D_FILE_OFFSET_BITS=64 -Dtorch_cuda_EXPORTS -I/home/ahmads/personal/pytorch/build/aten/src -I/home/ahmads/personal/pytorch/aten/src -I/home/ahmads/personal/pytorch/build -I/home/ahmads/personal/pytorch -I/home/ahmads/personal/pytorch/cmake/../third_party/benchmark/include -I/home/ahmads/personal/pytorch/third_party/onnx -I/home/ahmads/personal/pytorch/build/third_party/onnx -I/home/ahmads/personal/pytorch/nlohmann -I/home/ahmads/personal/pytorch/third_party/flash-attention/csrc/flash_attn/src -I/home/ahmads/personal/pytorch/aten/src/THC -I/home/ahmads/personal/pytorch/aten/src/ATen/cuda -I/home/ahmads/personal/pytorch/third_party/fmt/include -I/home/ahmads/personal/pytorch/aten/src/ATen/../../../third_party/cutlass/include -I/home/ahmads/personal/pytorch/aten/src/ATen/../../../third_party/cutlass/tools/util/include -I/home/ahmads/personal/pytorch/build/caffe2/aten/src -I/home/ahmads/personal/pytorch/aten/src/ATen/.. -I/home/ahmads/personal/pytorch/build/nccl/include -I/home/ahmads/personal/pytorch/c10/cuda/../.. -I/home/ahmads/personal/pytorch/c10/.. -I/home/ahmads/personal/pytorch/third_party/tensorpipe -I/home/ahmads/personal/pytorch/build/third_party/tensorpipe -I/home/ahmads/personal/pytorch/third_party/tensorpipe/third_party/libnop/include -I/home/ahmads/personal/pytorch/torch/csrc/api -I/home/ahmads/personal/pytorch/torch/csrc/api/include -isystem /home/ahmads/personal/pytorch/build/third_party/gloo -isystem /home/ahmads/personal/pytorch/cmake/../third_party/gloo -isystem /home/ahmads/personal/pytorch/cmake/../third_party/tensorpipe/third_party/libuv/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/googletest/googlemock/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/googletest/googletest/include -isystem /home/ahmads/personal/pytorch/third_party/protobuf/src -isystem /home/ahmads/personal/pytorch/third_party/XNNPACK/include -isystem /home/ahmads/personal/pytorch/third_party/ittapi/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/eigen -isystem /usr/local/cuda/include -isystem /home/ahmads/personal/pytorch/third_party/ideep/mkl-dnn/include/oneapi/dnnl -isystem /home/ahmads/personal/pytorch/third_party/ideep/include -isystem /home/ahmads/personal/pytorch/INTERFACE -isystem /home/ahmads/personal/pytorch/third_party/nlohmann/include -isystem /home/ahmads/personal/pytorch/third_party/NVTX/c/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/cudnn_frontend/include -DLIBCUDACXX_ENABLE_SIMPLIFIED_COMPLEX_OPERATIONS -D_GLIBCXX_USE_CXX11_ABI=1 -Xfatbin -compress-all -DONNX_NAMESPACE=onnx_torch -gencode arch=compute_90,code=sm_90 -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -Wno-deprecated-gpu-targets --expt-extended-lambda -DCUB_WRAPPED_NAMESPACE=at_cuda_detail -DCUDA_HAS_FP16=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -O3 -DNDEBUG -std=c++17 -Xcompiler=-fPIC -DTORCH_USE_LIBUV -DCAFFE2_USE_GLOO -Xcompiler -Wall -Wextra -Wdeprecated -Wno-unused-parameter -Wno-missing-field-initializers -Wno-array-bounds -Wno-unknown-pragmas -Wno-strict-overflow -Wno-strict-aliasing -Wunused-function -Wunused-variable -Wunused-but-set-variable -Wno-maybe-uninitialized -MD -MT caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/layer_norm_kernel.cu.o -MF caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/layer_norm_kernel.cu.o.d -x cu -c /home/ahmads/personal/pytorch/aten/src/ATen/native/cuda/layer_norm_kernel.cu -o caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/layer_norm_kernel.cu.o ``` So the new PR is 6 seconds longer compile time. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150625 Approved by: https://github.com/ngimel, https://github.com/atalman	2025-04-08 02:39:41 +00:00
CaoE	d7f3cd0ac3	Add Half support for weight_norm on CPU (#148878 ) Fixes #148867. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148878 Approved by: https://github.com/leslie-fang-intel, https://github.com/cyyever, https://github.com/albanD	2025-04-08 01:12:29 +00:00
PyTorch MergeBot	f443035f10	Revert "[cuda] Add new faster gammabeta backward kernel (#148605 ) (Reapply with launch bounds) (#150625 )" This reverts commit `c6defa9443`. Reverted https://github.com/pytorch/pytorch/pull/150625 on behalf of https://github.com/atalman due to failing internal build ([comment](https://github.com/pytorch/pytorch/pull/150625#issuecomment-2779183414))	2025-04-04 16:05:18 +00:00
Ahmad Sharif	c6defa9443	[cuda] Add new faster gammabeta backward kernel (#148605 ) (Reapply with launch bounds) (#150625 ) # Changes over the previous PR This reverts commit `61a1f09` and adds `__launch_bounds__` to the kernel. Previously I merged `114d404` that did not work on Blackwell because it consumed too many registers. It got reverted in `61a1f09`. For more context see: https://github.com/pytorch/pytorch/issues/150266. This PR reverts the revert (i.e. reapplies the original diff), with one additional line with `__launch_bounds__` added: ``` git diff HEAD^ diff --git a/aten/src/ATen/native/cuda/layer_norm_kernel.cu b/aten/src/ATen/native/cuda/layer_norm_kernel.cu index 0d63a2f979c..3ce2c24c18e 100644 --- a/aten/src/ATen/native/cuda/layer_norm_kernel.cu +++ b/aten/src/ATen/native/cuda/layer_norm_kernel.cu @@ -657,6 +657,7 @@ bool aligned_grid > __global__ void +__launch_bounds__(block_dim_x * block_dim_y) GammaBetaBackwardCUDAKernelTemplate( int64_t M, int64_t N, ``` I managed to get a Blackwell machine and verified that the fix works. The fix was verified using this repro that I got from @drisspg <details> <summary> Repro script that fails on Blackwell </summary> ``` import torch from torch.nn import init # from transformer_nuggets import init_logging # from transformer_nuggets.utils.benchmark import profiler # from pathlib import Path # init_logging() class PermuteModule(torch.nn.Module): def __init__(self, permutation): super(PermuteModule, self).__init__() self.permutation = permutation def forward(self, x:torch.Tensor) -> torch.Tensor: assert len(x.shape) == len(self.permutation), f"Dimension mismatch! Unable to permute {len(x.shape)} dim input with a {len(self.permutation)} dim permutation!" return x.permute(self.permutation) def test(n_layers:int, conv_stride:int): _sequence = [] for _ in range(n_layers): # Conv1d inputs are (N x C x L), LayerNorm expects ( x C). Dims must be permuted between modules. _sequence += [ PermuteModule((0,2,1)), torch.nn.Conv1d(in_channels=512, out_channels=512, groups=1, kernel_size=9, dilation=1, stride=conv_stride, padding=0, bias=False), PermuteModule((0,2,1)), torch.nn.LayerNorm(512), torch.nn.ReLU() ] model = torch.nn.Sequential(_sequence).to(device="cuda") data = torch.randn((100,2048,512), device="cuda") out = model(data) loss = torch.nn.functional.mse_loss(out, torch.rand_like(out)) loss.backward() torch.autograd.set_detect_anomaly(True) print(f"Torch version: {torch.__version__}") # with profiler(Path("conv")): # # print(f"layers=1, stride=1") # # test(n_layers=1, conv_stride=1) # # print(f"layers=2, stride=1") # # test(n_layers=2, conv_stride=1) # # print(f"layers=1, stride=2") # # test(n_layers=1, conv_stride=2) # print(f"layers=2, stride=2") # test(n_layers=2, conv_stride=2) print(f"layers=2, stride=2") test(n_layers=2, conv_stride=2) # we will not reach this print statement. print("DONE.") ``` </details> I also re-ran my performance benchmark and found no regressions over the previous PR. # Full description of the old PR Original PR: https://github.com/pytorch/pytorch/pull/148605 This PR adds a new kernel for producing gamma and beta values for the backward pass in a performant way. To test the performance against the baseline, I measured the backward pass of layernorm while sweeping over the following variables: 1. dtype in {half, float} 2. M in `2k, 2k - 1, 2k + 1 for k in range(...)` 3. N in `2k, 2k - 1, 2k + 1 for k in range(...)` 4. Whether we flush the L2 cache before running the backward pass Summary: The new code performs better than the old code, especially for powers of 2. For M >> N case, it performs very well (kernel itself can be 30x faster and the overall backward pass can be 5-10x faster). In order to visualize results of the kernel when choosing different values of M, N and dtype, I wrote some code to generate a heatmap. The heatmap has N on the x-axis, M on the y-axis and color-coded points where green shows performance improvement and red shows regressions. For example, `m=32 n=2048 1.42x` in the heatmap would indicate the normalized shape had 32 elements. The leading dimensions' product was 2048 elements and the new kernel resulted in the backward pass* being 1.42x faster than the old backward pass. Important note: This heatmap shows the total backward pass time as seen by the user. The kernel time difference can be sometimes very large while the total backward pass time is not that high. For example, for dtype=torch.half, M=32 N=2048, flush_l2_cache=True case, the heatmap shows a speedup of 1.42x, while ncu tells me the new kernel is 2.5x faster than the old: M=32 N=2048 dtype=half flush_l2=True Old Kernel NCU summary: ``` ----------------------- ----------- ------------ Metric Name Metric Unit Metric Value ----------------------- ----------- ------------ DRAM Frequency Ghz 1.59 SM Frequency Ghz 1.35 Elapsed Cycles cycle 27,526 Memory Throughput % 2.21 DRAM Throughput % 0.54 Duration us 20.42 L1/TEX Cache Throughput % 4.31 L2 Cache Throughput % 2.62 SM Active Cycles cycle 1,475.02 Compute (SM) Throughput % 0.29 ----------------------- ----------- ------------ ``` M=32 N=2048 dtype=half flush_l2=True New Kernel NCU summary: ``` ----------------------- ----------- ------------ Metric Name Metric Unit Metric Value ----------------------- ----------- ------------ DRAM Frequency Ghz 1.59 SM Frequency Ghz 1.34 Elapsed Cycles cycle 10,920 Memory Throughput % 5.64 DRAM Throughput % 1.35 Duration us 8.13 L1/TEX Cache Throughput % 1.92 L2 Cache Throughput % 6.89 SM Active Cycles cycle 3,554.41 Compute (SM) Throughput % 0.67 ----------------------- ----------- ------------ ``` Let's look at some rows from the heatmap. For dtype=float16 flush_l2_cache=True and when input shapes are powers of 2, we get the following: <img width="1508" alt="image" src="https://github.com/user-attachments/assets/06179599-b2f0-4a45-8664-247a1067950b" /> There are 3 columns -- the first shows all data points, the second shows speedups only and the 3rd column shows regressions only. We can see that there are dramatic speedups for M >> N cases and the regressions are not that high (less than 1%, which could just be measurement noise). Here is a small guide I made: ![image](https://github.com/user-attachments/assets/90c26f7c-e3ad-46d2-a6ce-fe4b5fb3d738) For dtype=float32, we get a similar chart: <img width="1499" alt="image" src="https://github.com/user-attachments/assets/c4d31a76-03b0-426c-9114-e1bfad29b530" /> The new code performs especially well for m >> n cases, and also where m and n are small. The m >> n case is special because we run 2 reduction kernels back to back and parallelize in the "M" dimension (the older kernel only parallelized in the "N" dimension). The new code can sometimes have regressions for non-powers of 2. That is because the old code was using block sizes of {16, 32} while we have `threads.x = 32`. For example when N=33, the old code would have 3 blocks and we will have 2 blocks. I wrote some code to specialize for this case, but I think it will add complexity and @ngimel mentioned that non-powers of 2 are rare enough. I am including the regressions here for completeness' sake: <img width="1500" alt="image" src="https://github.com/user-attachments/assets/31c17cfb-ed9b-4106-b9c8-5c359751f530" /> To see this better: 1. Click the image 2. Right click the expanded image and open in a new tab 3. Go to that tab and left click once to zoom in If you want to see the full data, here it is: ![image](https://github.com/user-attachments/assets/54fb60c9-8c0c-4530-a1dd-79ecda1a69a1) I also measured binary size and compile time since those are important for developers: Binary size comparison ![image](https://github.com/user-attachments/assets/ceef5073-1036-47f6-b9dc-cea088beda51) ``` # Original -rwxr-xr-x 1 ahmads users 307193112 Mar 6 08:46 ./torch/lib/libtorch_cuda.so # This PR -rwxr-xr-x 1 ahmads users 307193112 Mar 6 08:46 ./torch/lib/libtorch_cuda.so ``` The diff in bytes is 302kB which is about a 0.1% increase. Compile time difference: ``` # Original real 0m10.931s user 0m9.676s sys 0m1.004s # this PR real 0m16.720s user 0m15.514s sys 0m1.066s # Command I ran time /usr/local/cuda/bin/nvcc -forward-unknown-to-host-compiler -DAT_PER_OPERATOR_HEADERS -DFLASHATTENTION_DISABLE_ALIBI -DFLASHATTENTION_DISABLE_SOFTCAP -DFLASH_NAMESPACE=pytorch_flash -DFMT_HEADER_ONLY=1 -DHAVE_MALLOC_USABLE_SIZE=1 -DHAVE_MMAP=1 -DHAVE_SHM_OPEN=1 -DHAVE_SHM_UNLINK=1 -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DTORCH_CUDA_BUILD_MAIN_LIB -DTORCH_CUDA_USE_NVTX3 -DUNFUSE_FMA -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_CUDA -DUSE_CUFILE -DUSE_DISTRIBUTED -DUSE_EXTERNAL_MZCRC -DUSE_FLASH_ATTENTION -DUSE_MEM_EFF_ATTENTION -DUSE_NCCL -DUSE_RPC -DUSE_TENSORPIPE -D_FILE_OFFSET_BITS=64 -Dtorch_cuda_EXPORTS -I/home/ahmads/personal/pytorch/build/aten/src -I/home/ahmads/personal/pytorch/aten/src -I/home/ahmads/personal/pytorch/build -I/home/ahmads/personal/pytorch -I/home/ahmads/personal/pytorch/cmake/../third_party/benchmark/include -I/home/ahmads/personal/pytorch/third_party/onnx -I/home/ahmads/personal/pytorch/build/third_party/onnx -I/home/ahmads/personal/pytorch/nlohmann -I/home/ahmads/personal/pytorch/third_party/flash-attention/csrc/flash_attn/src -I/home/ahmads/personal/pytorch/aten/src/THC -I/home/ahmads/personal/pytorch/aten/src/ATen/cuda -I/home/ahmads/personal/pytorch/third_party/fmt/include -I/home/ahmads/personal/pytorch/aten/src/ATen/../../../third_party/cutlass/include -I/home/ahmads/personal/pytorch/aten/src/ATen/../../../third_party/cutlass/tools/util/include -I/home/ahmads/personal/pytorch/build/caffe2/aten/src -I/home/ahmads/personal/pytorch/aten/src/ATen/.. -I/home/ahmads/personal/pytorch/build/nccl/include -I/home/ahmads/personal/pytorch/c10/cuda/../.. -I/home/ahmads/personal/pytorch/c10/.. -I/home/ahmads/personal/pytorch/third_party/tensorpipe -I/home/ahmads/personal/pytorch/build/third_party/tensorpipe -I/home/ahmads/personal/pytorch/third_party/tensorpipe/third_party/libnop/include -I/home/ahmads/personal/pytorch/torch/csrc/api -I/home/ahmads/personal/pytorch/torch/csrc/api/include -isystem /home/ahmads/personal/pytorch/build/third_party/gloo -isystem /home/ahmads/personal/pytorch/cmake/../third_party/gloo -isystem /home/ahmads/personal/pytorch/cmake/../third_party/tensorpipe/third_party/libuv/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/googletest/googlemock/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/googletest/googletest/include -isystem /home/ahmads/personal/pytorch/third_party/protobuf/src -isystem /home/ahmads/personal/pytorch/third_party/XNNPACK/include -isystem /home/ahmads/personal/pytorch/third_party/ittapi/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/eigen -isystem /usr/local/cuda/include -isystem /home/ahmads/personal/pytorch/third_party/ideep/mkl-dnn/include/oneapi/dnnl -isystem /home/ahmads/personal/pytorch/third_party/ideep/include -isystem /home/ahmads/personal/pytorch/INTERFACE -isystem /home/ahmads/personal/pytorch/third_party/nlohmann/include -isystem /home/ahmads/personal/pytorch/third_party/NVTX/c/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/cudnn_frontend/include -DLIBCUDACXX_ENABLE_SIMPLIFIED_COMPLEX_OPERATIONS -D_GLIBCXX_USE_CXX11_ABI=1 -Xfatbin -compress-all -DONNX_NAMESPACE=onnx_torch -gencode arch=compute_90,code=sm_90 -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -Wno-deprecated-gpu-targets --expt-extended-lambda -DCUB_WRAPPED_NAMESPACE=at_cuda_detail -DCUDA_HAS_FP16=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -O3 -DNDEBUG -std=c++17 -Xcompiler=-fPIC -DTORCH_USE_LIBUV -DCAFFE2_USE_GLOO -Xcompiler -Wall -Wextra -Wdeprecated -Wno-unused-parameter -Wno-missing-field-initializers -Wno-array-bounds -Wno-unknown-pragmas -Wno-strict-overflow -Wno-strict-aliasing -Wunused-function -Wunused-variable -Wunused-but-set-variable -Wno-maybe-uninitialized -MD -MT caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/layer_norm_kernel.cu.o -MF caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/layer_norm_kernel.cu.o.d -x cu -c /home/ahmads/personal/pytorch/aten/src/ATen/native/cuda/layer_norm_kernel.cu -o caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/layer_norm_kernel.cu.o ``` So the new PR is 6 seconds longer compile time. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150625 Approved by: https://github.com/ngimel	2025-04-03 22:07:43 +00:00
PyTorch MergeBot	61a1f09b5b	Revert "[cuda] Add new faster gammabeta backward kernel (#148605 )" This reverts commit `114d404b07`. Reverted https://github.com/pytorch/pytorch/pull/148605 on behalf of https://github.com/drisspg due to See https://github.com/pytorch/pytorch/issues/150266#issuecomment-2773907902 for more details ([comment](https://github.com/pytorch/pytorch/pull/148605#issuecomment-2773928838))	2025-04-02 23:14:11 +00:00
Ahmad Sharif	114d404b07	[cuda] Add new faster gammabeta backward kernel (#148605 ) This PR adds a new kernel for producing gamma and beta values for the backward pass in a performant way. To test the performance against the baseline, I measured the backward pass of layernorm while sweeping over the following variables: 1. dtype in {half, float} 2. M in `2k, 2k - 1, 2k + 1 for k in range(...)` 3. N in `2k, 2k - 1, 2k + 1 for k in range(...)` 4. Whether we flush the L2 cache before running the backward pass Summary: The new code performs better than the old code, especially for powers of 2. For M >> N case, it performs very well (kernel itself can be 30x faster and the overall backward pass can be 5-10x faster). In order to visualize results of the kernel when choosing different values of M, N and dtype, I wrote some code to generate a heatmap. The heatmap has N on the x-axis, M on the y-axis and color-coded points where green shows performance improvement and red shows regressions. For example, `m=32 n=2048 1.42x` in the heatmap would indicate the normalized shape had 32 elements. The leading dimensions' product was 2048 elements and the new kernel resulted in the backward pass being 1.42x faster than the old backward pass. Important note: This heatmap shows the total backward pass time as seen by the user. The kernel time difference can be sometimes very large while the total backward pass time is not that high. For example, for dtype=torch.half, M=32 N=2048, flush_l2_cache=True case, the heatmap shows a speedup of 1.42x, while ncu tells me the new kernel is 2.5x faster than the old: M=32 N=2048 dtype=half flush_l2=True Old Kernel NCU summary: ``` ----------------------- ----------- ------------ Metric Name Metric Unit Metric Value ----------------------- ----------- ------------ DRAM Frequency Ghz 1.59 SM Frequency Ghz 1.35 Elapsed Cycles cycle 27,526 Memory Throughput % 2.21 DRAM Throughput % 0.54 Duration us 20.42 L1/TEX Cache Throughput % 4.31 L2 Cache Throughput % 2.62 SM Active Cycles cycle 1,475.02 Compute (SM) Throughput % 0.29 ----------------------- ----------- ------------ ``` M=32 N=2048 dtype=half flush_l2=True New Kernel NCU summary: ``` ----------------------- ----------- ------------ Metric Name Metric Unit Metric Value ----------------------- ----------- ------------ DRAM Frequency Ghz 1.59 SM Frequency Ghz 1.34 Elapsed Cycles cycle 10,920 Memory Throughput % 5.64 DRAM Throughput % 1.35 Duration us 8.13 L1/TEX Cache Throughput % 1.92 L2 Cache Throughput % 6.89 SM Active Cycles cycle 3,554.41 Compute (SM) Throughput % 0.67 ----------------------- ----------- ------------ ``` Let's look at some rows from the heatmap. For dtype=float16 flush_l2_cache=True and when input shapes are powers of 2, we get the following: <img width="1508" alt="image" src="https://github.com/user-attachments/assets/06179599-b2f0-4a45-8664-247a1067950b" /> There are 3 columns -- the first shows all data points, the second shows speedups only and the 3rd column shows regressions only. We can see that there are dramatic speedups for M >> N cases and the regressions are not that high (less than 1%, which could just be measurement noise). Here is a small guide I made: ![image](https://github.com/user-attachments/assets/90c26f7c-e3ad-46d2-a6ce-fe4b5fb3d738) For dtype=float32, we get a similar chart: <img width="1499" alt="image" src="https://github.com/user-attachments/assets/c4d31a76-03b0-426c-9114-e1bfad29b530" /> The new code performs especially well for m >> n cases, and also where m and n are small. The m >> n case is special because we run 2 reduction kernels back to back and parallelize in the "M" dimension (the older kernel only parallelized in the "N" dimension). The new code can sometimes have regressions for non-powers of 2. That is because the old code was using block sizes of {16, 32} while we have `threads.x = 32`. For example when N=33, the old code would have 3 blocks and we will have 2 blocks. I wrote some code to specialize for this case, but I think it will add complexity and @ngimel mentioned that non-powers of 2 are rare enough. I am including the regressions here for completeness' sake: <img width="1500" alt="image" src="https://github.com/user-attachments/assets/31c17cfb-ed9b-4106-b9c8-5c359751f530" /> To see this better: 1. Click the image 2. Right click the expanded image and open in a new tab 3. Go to that tab and left click once to zoom in If you want to see the full data, here it is: ![image](https://github.com/user-attachments/assets/54fb60c9-8c0c-4530-a1dd-79ecda1a69a1) I also measured binary size and compile time since those are important for developers: Binary size comparison ![image](https://github.com/user-attachments/assets/ceef5073-1036-47f6-b9dc-cea088beda51) ``` # Original -rwxr-xr-x 1 ahmads users 307193112 Mar 6 08:46 ./torch/lib/libtorch_cuda.so # This PR -rwxr-xr-x 1 ahmads users 307193112 Mar 6 08:46 ./torch/lib/libtorch_cuda.so ``` The diff in bytes is 302kB which is about a 0.1% increase. Compile time difference: ``` # Original real 0m10.931s user 0m9.676s sys 0m1.004s # this PR real 0m16.720s user 0m15.514s sys 0m1.066s # Command I ran time /usr/local/cuda/bin/nvcc -forward-unknown-to-host-compiler -DAT_PER_OPERATOR_HEADERS -DFLASHATTENTION_DISABLE_ALIBI -DFLASHATTENTION_DISABLE_SOFTCAP -DFLASH_NAMESPACE=pytorch_flash -DFMT_HEADER_ONLY=1 -DHAVE_MALLOC_USABLE_SIZE=1 -DHAVE_MMAP=1 -DHAVE_SHM_OPEN=1 -DHAVE_SHM_UNLINK=1 -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DTORCH_CUDA_BUILD_MAIN_LIB -DTORCH_CUDA_USE_NVTX3 -DUNFUSE_FMA -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_CUDA -DUSE_CUFILE -DUSE_DISTRIBUTED -DUSE_EXTERNAL_MZCRC -DUSE_FLASH_ATTENTION -DUSE_MEM_EFF_ATTENTION -DUSE_NCCL -DUSE_RPC -DUSE_TENSORPIPE -D_FILE_OFFSET_BITS=64 -Dtorch_cuda_EXPORTS -I/home/ahmads/personal/pytorch/build/aten/src -I/home/ahmads/personal/pytorch/aten/src -I/home/ahmads/personal/pytorch/build -I/home/ahmads/personal/pytorch -I/home/ahmads/personal/pytorch/cmake/../third_party/benchmark/include -I/home/ahmads/personal/pytorch/third_party/onnx -I/home/ahmads/personal/pytorch/build/third_party/onnx -I/home/ahmads/personal/pytorch/nlohmann -I/home/ahmads/personal/pytorch/third_party/flash-attention/csrc/flash_attn/src -I/home/ahmads/personal/pytorch/aten/src/THC -I/home/ahmads/personal/pytorch/aten/src/ATen/cuda -I/home/ahmads/personal/pytorch/third_party/fmt/include -I/home/ahmads/personal/pytorch/aten/src/ATen/../../../third_party/cutlass/include -I/home/ahmads/personal/pytorch/aten/src/ATen/../../../third_party/cutlass/tools/util/include -I/home/ahmads/personal/pytorch/build/caffe2/aten/src -I/home/ahmads/personal/pytorch/aten/src/ATen/.. -I/home/ahmads/personal/pytorch/build/nccl/include -I/home/ahmads/personal/pytorch/c10/cuda/../.. -I/home/ahmads/personal/pytorch/c10/.. -I/home/ahmads/personal/pytorch/third_party/tensorpipe -I/home/ahmads/personal/pytorch/build/third_party/tensorpipe -I/home/ahmads/personal/pytorch/third_party/tensorpipe/third_party/libnop/include -I/home/ahmads/personal/pytorch/torch/csrc/api -I/home/ahmads/personal/pytorch/torch/csrc/api/include -isystem /home/ahmads/personal/pytorch/build/third_party/gloo -isystem /home/ahmads/personal/pytorch/cmake/../third_party/gloo -isystem /home/ahmads/personal/pytorch/cmake/../third_party/tensorpipe/third_party/libuv/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/googletest/googlemock/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/googletest/googletest/include -isystem /home/ahmads/personal/pytorch/third_party/protobuf/src -isystem /home/ahmads/personal/pytorch/third_party/XNNPACK/include -isystem /home/ahmads/personal/pytorch/third_party/ittapi/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/eigen -isystem /usr/local/cuda/include -isystem /home/ahmads/personal/pytorch/third_party/ideep/mkl-dnn/include/oneapi/dnnl -isystem /home/ahmads/personal/pytorch/third_party/ideep/include -isystem /home/ahmads/personal/pytorch/INTERFACE -isystem /home/ahmads/personal/pytorch/third_party/nlohmann/include -isystem /home/ahmads/personal/pytorch/third_party/NVTX/c/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/cudnn_frontend/include -DLIBCUDACXX_ENABLE_SIMPLIFIED_COMPLEX_OPERATIONS -D_GLIBCXX_USE_CXX11_ABI=1 -Xfatbin -compress-all -DONNX_NAMESPACE=onnx_torch -gencode arch=compute_90,code=sm_90 -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -Wno-deprecated-gpu-targets --expt-extended-lambda -DCUB_WRAPPED_NAMESPACE=at_cuda_detail -DCUDA_HAS_FP16=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -O3 -DNDEBUG -std=c++17 -Xcompiler=-fPIC -DTORCH_USE_LIBUV -DCAFFE2_USE_GLOO -Xcompiler -Wall -Wextra -Wdeprecated -Wno-unused-parameter -Wno-missing-field-initializers -Wno-array-bounds -Wno-unknown-pragmas -Wno-strict-overflow -Wno-strict-aliasing -Wunused-function -Wunused-variable -Wunused-but-set-variable -Wno-maybe-uninitialized -MD -MT caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/layer_norm_kernel.cu.o -MF caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/layer_norm_kernel.cu.o.d -x cu -c /home/ahmads/personal/pytorch/aten/src/ATen/native/cuda/layer_norm_kernel.cu -o caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/layer_norm_kernel.cu.o ``` So the new PR is 6 seconds longer compile time. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148605 Approved by: https://github.com/ngimel	2025-03-27 03:01:53 +00:00
Dmitry Nikolaev	c99efc08fb	[ROCm] skip test_RNN_dropout_state (#149446 ) PR to skip test_nn.py::TestNN::test_RNN_dropout_state Currently ROCm doesn't support dropout value for RNN PR to enable RNN dropout on ROCm still in review and blocked pytorch/pytorch#144572 Fixes: https://github.com/pytorch/pytorch/issues/68849 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149446 Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily	2025-03-20 17:22:39 +00:00
Nikita Shulga	2e0c98ff05	[MPS] Add `bicubic2d_aa` (#149378 ) Which is currently the most frequently requested op in https://github.com/pytorch/pytorch/issues/141287 Mostly done by refactoring `upsample_bilinear2d_aa` to accept Functor as one of the template arguments, which closely ideas from `eec43cfbc0/src/libImaging/Resample.c` as well as `bb42e4d137/aten/src/ATen/native/cuda/UpSampleBilinear2d.cu (L472-L478)` Populate unit tests by copying upsample_bilinear_2d_aa and reusing it as upsample_bicubic2d_aa At that point, only difference between upsample_bilinear2d_aa and upsample_bicubic2d_aa are convolution kernel function and size: for bilinear it's 3x3, for bicubic it's 5x5 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149378 Approved by: https://github.com/dcci	2025-03-18 05:35:41 +00:00
zeshengzong	a7f8de2198	Add `nn.Bilinear` param validation (#149018 ) Fixes #103425 ## Changes - Add doc description size value `must be > 0` - Add validation for `in1_features` param Currently, only `in1_features` will cause runtime error, if add checks for `in2_features` and `out_features` as well, might be kind of BC breaking. ```python import torch from torch import nn class lenet(nn.Module): def __init__(self): super(lenet, self).__init__() self.conv = nn.Conv2d(in_channels=3, out_channels=16, kernel_size=5, stride=1) # Error, `in1_features=1, in2_features=0, out_features=0` no error self.linear = nn.Bilinear(in1_features=0, in2_features=0, out_features=0) def forward(self, x): # 1st block x = self.conv(x) x = self.linear(x) return x if __name__ == '__main__': net = lenet() ``` ## Test Result ```bash pytest test/test_nn.py -k test_bilinear -vv ``` ![image](https://github.com/user-attachments/assets/20617ba9-bac5-4db2-aecc-1831dbc8eb43) ![image](https://github.com/user-attachments/assets/401e4e1f-051a-4e1c-952b-48e85de64b0b) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149018 Approved by: https://github.com/mikaylagawarecki	2025-03-14 19:26:12 +00:00
zeshengzong	97272e4b49	Fix `torch.nn.functional.hardswish` gradients corner case (#148049 ) Fixes #147801 ## Changes - Change hardswish gradient compute condition as [torch.nn.functional.hardswish](https://pytorch.org/docs/stable/generated/torch.nn.functional.hardswish.html) - Enable cuda for test `test_hardswish_grad_corner` - Add test case for value=-3 ## Test Result ```bash pytest test/test_nn.py -k test_hardswish pytest test/test_unary_ufuncs.py -k test_hardswish pytest test/inductor/test_torchinductor.py -k test_hardswish ``` ![image](https://github.com/user-attachments/assets/000cb5c4-15f5-4bfd-ab45-f52bf810ff3d) ![image](https://github.com/user-attachments/assets/38b08cf8-ea84-47a2-8e37-0a213da3e0c8) ![image](https://github.com/user-attachments/assets/54bc57be-2c57-46cc-ab90-94ea6cbe1c34) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148049 Approved by: https://github.com/soulitzer	2025-03-14 18:53:10 +00:00
cyy	970fefcc53	Remove outdated skipCUDAIfCudnnVersionLessThan decoration (#148940 ) Test conditions for CUDNN 7 and 8 were removed because we have moved to CUDNN 9. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148940 Approved by: https://github.com/mikaylagawarecki	2025-03-13 18:02:50 +00:00
riccardofelluga	8f71d4563e	Fix rms_norm in fp16/bf16 (#147203 ) Fixes #134106. This PR moves the `upcasted_result` down-casting after all computation is done. Since the multiplication with the weight_opt input is not done in half precision, the current code path is doing the following: fp16 -> fp32 -> fp16 -> fp32 -> fp16. What we want tho is to avoid down-casting and this PR proposes: fp16 -> fp32 -> fp16. This results in better accuracy as it avoids truncating. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147203 Approved by: https://github.com/eqy	2025-03-08 04:43:18 +00:00
PyTorch MergeBot	abcca2fcbb	Revert "Fix `torch.nn.functional.hardswish` gradients corner case (#148049 )" This reverts commit `29b28e9d9f`. Reverted https://github.com/pytorch/pytorch/pull/148049 on behalf of https://github.com/soulitzer due to This may be causing an accuracy failure on inductor ([comment](https://github.com/pytorch/pytorch/pull/148049#issuecomment-2706839169))	2025-03-07 16:05:56 +00:00
zeshengzong	29b28e9d9f	Fix `torch.nn.functional.hardswish` gradients corner case (#148049 ) Fixes #147801 ## Changes - Change hardswish gradient compute condition as [torch.nn.functional.hardswish](https://pytorch.org/docs/stable/generated/torch.nn.functional.hardswish.html) - Enable cuda for test `test_hardswish_grad_corner` - Add test case for value=-3 ## Test Result ```bash pytest test/test_nn.py -k test_hardswish pytest test/test_unary_ufuncs.py -k test_hardswish pytest test/inductor/test_torchinductor.py -k test_hardswish ``` ![image](https://github.com/user-attachments/assets/000cb5c4-15f5-4bfd-ab45-f52bf810ff3d) ![image](https://github.com/user-attachments/assets/38b08cf8-ea84-47a2-8e37-0a213da3e0c8) ![image](https://github.com/user-attachments/assets/54bc57be-2c57-46cc-ab90-94ea6cbe1c34) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148049 Approved by: https://github.com/soulitzer	2025-03-06 19:04:52 +00:00
CaoE	8b818ab58f	Use float data type for Half sum in fallback implementation of batchnorm backward on CPU (#147353 ) Fixes #147303. Use float data type for Half sum in fallback implementation of batchnorm backward on CPU as the representation range of Half is small. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147353 Approved by: https://github.com/leslie-fang-intel, https://github.com/cpuhrsch	2025-02-21 01:33:33 +00:00

1 2 3 4 5 ...

1583 Commits