Commit Graph

1598 Commits

Author SHA1 Message Date
Ahmad Sharif
73b4938f7c [cuda] Add new faster gammabeta backward kernel (#148605) (Reapply with launch bounds) (#150625)
# Changes over the previous PR

This reverts commit 61a1f09 and adds `__launch_bounds__` to the kernel.

Previously I merged 114d404 that did not work on Blackwell because it consumed too many registers. It got reverted in 61a1f09. For more context see: https://github.com/pytorch/pytorch/issues/150266.

This PR reverts the revert (i.e. reapplies the original diff), with one additional line with `__launch_bounds__` added:

```
git diff HEAD^
diff --git a/aten/src/ATen/native/cuda/layer_norm_kernel.cu b/aten/src/ATen/native/cuda/layer_norm_kernel.cu
index 0d63a2f979c..3ce2c24c18e 100644
--- a/aten/src/ATen/native/cuda/layer_norm_kernel.cu
+++ b/aten/src/ATen/native/cuda/layer_norm_kernel.cu
@@ -657,6 +657,7 @@ bool aligned_grid
 >
 __global__
 void
+__launch_bounds__(block_dim_x * block_dim_y)
  GammaBetaBackwardCUDAKernelTemplate(
     int64_t M,
     int64_t N,
```

I managed to get a Blackwell machine and verified that the fix works. The fix was verified using this repro that I got from @drisspg

<details>
<summary> Repro script that fails on Blackwell </summary>

```
import torch
from torch.nn import init
# from transformer_nuggets import init_logging
# from transformer_nuggets.utils.benchmark import profiler
# from pathlib import Path

# init_logging()

class PermuteModule(torch.nn.Module):
    def __init__(self, permutation):
        super(PermuteModule, self).__init__()
        self.permutation = permutation
    def forward(self, x:torch.Tensor) -> torch.Tensor:
        assert len(x.shape) == len(self.permutation), f"Dimension mismatch! Unable to permute {len(x.shape)} dim input with a {len(self.permutation)} dim permutation!"
        return x.permute(*self.permutation)

def test(n_layers:int, conv_stride:int):
    _sequence = []
    for _ in range(n_layers):
        # Conv1d inputs are (N x C x L), LayerNorm expects (* x C). Dims must be permuted between modules.
        _sequence += [
            PermuteModule((0,2,1)),
            torch.nn.Conv1d(in_channels=512, out_channels=512, groups=1, kernel_size=9, dilation=1, stride=conv_stride, padding=0, bias=False),
            PermuteModule((0,2,1)),
            torch.nn.LayerNorm(512),
            torch.nn.ReLU()
        ]
    model = torch.nn.Sequential(*_sequence).to(device="cuda")
    data = torch.randn((100,2048,512), device="cuda")
    out = model(data)
    loss = torch.nn.functional.mse_loss(out, torch.rand_like(out))
    loss.backward()

torch.autograd.set_detect_anomaly(True)
print(f"Torch version: {torch.__version__}")

# with profiler(Path("conv")):
#     # print(f"layers=1, stride=1")
#     # test(n_layers=1, conv_stride=1)
#     # print(f"layers=2, stride=1")
#     # test(n_layers=2, conv_stride=1)
#     # print(f"layers=1, stride=2")
#     # test(n_layers=1, conv_stride=2)
#     print(f"layers=2, stride=2")
#     test(n_layers=2, conv_stride=2)

print(f"layers=2, stride=2")
test(n_layers=2, conv_stride=2)
# we will not reach this print statement.
print("DONE.")
```

</details>

I also re-ran my performance benchmark and found no regressions over the previous PR.

# Full description of the old PR

Original PR: https://github.com/pytorch/pytorch/pull/148605

This PR adds a new kernel for producing gamma and beta values for the backward pass in a performant way.

To test the performance against the baseline, I measured the backward pass of layernorm while sweeping over the following variables:

1. dtype in {half, float}
2. M in `2**k, 2**k - 1, 2**k + 1 for k in range(...)`
3. N in `2**k, 2**k - 1, 2**k + 1 for k in range(...)`
4. Whether we flush the L2 cache before running the backward pass

Summary: The new code performs better than the old code, especially for powers of 2. For M >> N case, it performs very well (kernel itself can be 30x faster and the overall backward pass can be 5-10x faster).

In order to visualize results of the kernel when choosing different values of M, N and dtype, I wrote some code to generate a heatmap. The heatmap has N on the x-axis, M on the y-axis and color-coded points where green shows performance improvement and red shows regressions. For example, `m=32 n=2048 1.42x` in the heatmap would indicate the normalized shape had 32 elements. The leading dimensions' product was 2048 elements and the new kernel resulted in the *backward pass* being 1.42x faster than the old *backward pass*.

Important note: This heatmap shows the total backward pass time as seen by the user. The kernel time difference can be sometimes very large while the total backward pass time is not that high. For example, for dtype=torch.half, M=32 N=2048, flush_l2_cache=True case, the heatmap shows a speedup of 1.42x, while ncu tells me the new kernel is 2.5x faster than the old:

M=32 N=2048 dtype=half flush_l2=True Old Kernel NCU summary:
```
    ----------------------- ----------- ------------
    Metric Name             Metric Unit Metric Value
    ----------------------- ----------- ------------
    DRAM Frequency                  Ghz         1.59
    SM Frequency                    Ghz         1.35
    Elapsed Cycles                cycle       27,526
    Memory Throughput                 %         2.21
    DRAM Throughput                   %         0.54
    Duration                         us        20.42
    L1/TEX Cache Throughput           %         4.31
    L2 Cache Throughput               %         2.62
    SM Active Cycles              cycle     1,475.02
    Compute (SM) Throughput           %         0.29
    ----------------------- ----------- ------------
```

M=32 N=2048 dtype=half flush_l2=True New Kernel NCU summary:
```
    ----------------------- ----------- ------------
    Metric Name             Metric Unit Metric Value
    ----------------------- ----------- ------------
    DRAM Frequency                  Ghz         1.59
    SM Frequency                    Ghz         1.34
    Elapsed Cycles                cycle       10,920
    Memory Throughput                 %         5.64
    DRAM Throughput                   %         1.35
    Duration                         us         8.13
    L1/TEX Cache Throughput           %         1.92
    L2 Cache Throughput               %         6.89
    SM Active Cycles              cycle     3,554.41
    Compute (SM) Throughput           %         0.67
    ----------------------- ----------- ------------
```

Let's look at some rows from the heatmap. For dtype=float16 flush_l2_cache=True and when input shapes are powers of 2, we get the following:

<img width="1508" alt="image" src="https://github.com/user-attachments/assets/06179599-b2f0-4a45-8664-247a1067950b" />

There are 3 columns -- the first shows all data points, the second shows speedups only and the 3rd column shows regressions only. We can see that there are dramatic speedups for M >> N cases and the regressions are not that high (less than 1%, which could just be measurement noise). Here is a small guide I made:

![image](https://github.com/user-attachments/assets/90c26f7c-e3ad-46d2-a6ce-fe4b5fb3d738)

For dtype=float32, we get a similar chart:

<img width="1499" alt="image" src="https://github.com/user-attachments/assets/c4d31a76-03b0-426c-9114-e1bfad29b530" />

The new code performs especially well for m >> n cases, and also where m and n are small. The m >> n case is special because we run 2 reduction kernels back to back and parallelize in the "M" dimension (the older kernel only parallelized in the "N" dimension).

The new code can sometimes have regressions for non-powers of 2. That is because the old code was using block sizes of {16, 32} while we have `threads.x = 32`. For example when N=33, the old code would have 3 blocks and we will have 2 blocks. I wrote some code to specialize for this case, but I think it will add complexity and @ngimel mentioned that non-powers of 2 are rare enough.

I am including the regressions here for completeness' sake:

<img width="1500" alt="image" src="https://github.com/user-attachments/assets/31c17cfb-ed9b-4106-b9c8-5c359751f530" />

To see this better:

1. Click the image
2. Right click the expanded image and open in a new tab
3. Go to that tab and left click once to zoom in

If you want to see the full data, here it is:

![image](https://github.com/user-attachments/assets/54fb60c9-8c0c-4530-a1dd-79ecda1a69a1)

I also measured binary size and compile time since those are important for developers:

Binary size comparison

![image](https://github.com/user-attachments/assets/ceef5073-1036-47f6-b9dc-cea088beda51)

```
# Original
-rwxr-xr-x 1 ahmads users 307193112 Mar  6 08:46 ./torch/lib/libtorch_cuda.so

# This PR
-rwxr-xr-x 1 ahmads users 307193112 Mar  6 08:46 ./torch/lib/libtorch_cuda.so
```

The diff in bytes is 302kB which is about a 0.1% increase.

Compile time difference:

```
# Original

real    0m10.931s
user    0m9.676s
sys     0m1.004s

# this PR

real    0m16.720s
user    0m15.514s
sys     0m1.066s

# Command I ran
time /usr/local/cuda/bin/nvcc -forward-unknown-to-host-compiler -DAT_PER_OPERATOR_HEADERS -DFLASHATTENTION_DISABLE_ALIBI -DFLASHATTENTION_DISABLE_SOFTCAP -DFLASH_NAMESPACE=pytorch_flash -DFMT_HEADER_ONLY=1 -DHAVE_MALLOC_USABLE_SIZE=1 -DHAVE_MMAP=1 -DHAVE_SHM_OPEN=1 -DHAVE_SHM_UNLINK=1 -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DTORCH_CUDA_BUILD_MAIN_LIB -DTORCH_CUDA_USE_NVTX3 -DUNFUSE_FMA -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_CUDA -DUSE_CUFILE -DUSE_DISTRIBUTED -DUSE_EXTERNAL_MZCRC -DUSE_FLASH_ATTENTION -DUSE_MEM_EFF_ATTENTION -DUSE_NCCL -DUSE_RPC -DUSE_TENSORPIPE -D_FILE_OFFSET_BITS=64 -Dtorch_cuda_EXPORTS -I/home/ahmads/personal/pytorch/build/aten/src -I/home/ahmads/personal/pytorch/aten/src -I/home/ahmads/personal/pytorch/build -I/home/ahmads/personal/pytorch -I/home/ahmads/personal/pytorch/cmake/../third_party/benchmark/include -I/home/ahmads/personal/pytorch/third_party/onnx -I/home/ahmads/personal/pytorch/build/third_party/onnx -I/home/ahmads/personal/pytorch/nlohmann -I/home/ahmads/personal/pytorch/third_party/flash-attention/csrc/flash_attn/src -I/home/ahmads/personal/pytorch/aten/src/THC -I/home/ahmads/personal/pytorch/aten/src/ATen/cuda -I/home/ahmads/personal/pytorch/third_party/fmt/include -I/home/ahmads/personal/pytorch/aten/src/ATen/../../../third_party/cutlass/include -I/home/ahmads/personal/pytorch/aten/src/ATen/../../../third_party/cutlass/tools/util/include -I/home/ahmads/personal/pytorch/build/caffe2/aten/src -I/home/ahmads/personal/pytorch/aten/src/ATen/.. -I/home/ahmads/personal/pytorch/build/nccl/include -I/home/ahmads/personal/pytorch/c10/cuda/../.. -I/home/ahmads/personal/pytorch/c10/.. -I/home/ahmads/personal/pytorch/third_party/tensorpipe -I/home/ahmads/personal/pytorch/build/third_party/tensorpipe -I/home/ahmads/personal/pytorch/third_party/tensorpipe/third_party/libnop/include -I/home/ahmads/personal/pytorch/torch/csrc/api -I/home/ahmads/personal/pytorch/torch/csrc/api/include -isystem /home/ahmads/personal/pytorch/build/third_party/gloo -isystem /home/ahmads/personal/pytorch/cmake/../third_party/gloo -isystem /home/ahmads/personal/pytorch/cmake/../third_party/tensorpipe/third_party/libuv/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/googletest/googlemock/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/googletest/googletest/include -isystem /home/ahmads/personal/pytorch/third_party/protobuf/src -isystem /home/ahmads/personal/pytorch/third_party/XNNPACK/include -isystem /home/ahmads/personal/pytorch/third_party/ittapi/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/eigen -isystem /usr/local/cuda/include -isystem /home/ahmads/personal/pytorch/third_party/ideep/mkl-dnn/include/oneapi/dnnl -isystem /home/ahmads/personal/pytorch/third_party/ideep/include -isystem /home/ahmads/personal/pytorch/INTERFACE -isystem /home/ahmads/personal/pytorch/third_party/nlohmann/include -isystem /home/ahmads/personal/pytorch/third_party/NVTX/c/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/cudnn_frontend/include -DLIBCUDACXX_ENABLE_SIMPLIFIED_COMPLEX_OPERATIONS -D_GLIBCXX_USE_CXX11_ABI=1 -Xfatbin -compress-all -DONNX_NAMESPACE=onnx_torch -gencode arch=compute_90,code=sm_90 -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda  -Wno-deprecated-gpu-targets --expt-extended-lambda -DCUB_WRAPPED_NAMESPACE=at_cuda_detail -DCUDA_HAS_FP16=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -O3 -DNDEBUG -std=c++17 -Xcompiler=-fPIC -DTORCH_USE_LIBUV -DCAFFE2_USE_GLOO -Xcompiler -Wall -Wextra -Wdeprecated -Wno-unused-parameter -Wno-missing-field-initializers -Wno-array-bounds -Wno-unknown-pragmas -Wno-strict-overflow -Wno-strict-aliasing -Wunused-function -Wunused-variable -Wunused-but-set-variable -Wno-maybe-uninitialized -MD -MT caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/layer_norm_kernel.cu.o -MF caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/layer_norm_kernel.cu.o.d -x cu -c /home/ahmads/personal/pytorch/aten/src/ATen/native/cuda/layer_norm_kernel.cu -o caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/layer_norm_kernel.cu.o

```

So the new PR is 6 seconds longer compile time.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150625
Approved by: https://github.com/ngimel, https://github.com/atalman
2025-04-08 02:39:41 +00:00
CaoE
d7f3cd0ac3 Add Half support for weight_norm on CPU (#148878)
Fixes #148867.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148878
Approved by: https://github.com/leslie-fang-intel, https://github.com/cyyever, https://github.com/albanD
2025-04-08 01:12:29 +00:00
PyTorch MergeBot
f443035f10 Revert "[cuda] Add new faster gammabeta backward kernel (#148605) (Reapply with launch bounds) (#150625)"
This reverts commit c6defa9443.

Reverted https://github.com/pytorch/pytorch/pull/150625 on behalf of https://github.com/atalman due to failing internal build ([comment](https://github.com/pytorch/pytorch/pull/150625#issuecomment-2779183414))
2025-04-04 16:05:18 +00:00
Ahmad Sharif
c6defa9443 [cuda] Add new faster gammabeta backward kernel (#148605) (Reapply with launch bounds) (#150625)
# Changes over the previous PR

This reverts commit 61a1f09 and adds `__launch_bounds__` to the kernel.

Previously I merged 114d404 that did not work on Blackwell because it consumed too many registers. It got reverted in 61a1f09. For more context see: https://github.com/pytorch/pytorch/issues/150266.

This PR reverts the revert (i.e. reapplies the original diff), with one additional line with `__launch_bounds__` added:

```
git diff HEAD^
diff --git a/aten/src/ATen/native/cuda/layer_norm_kernel.cu b/aten/src/ATen/native/cuda/layer_norm_kernel.cu
index 0d63a2f979c..3ce2c24c18e 100644
--- a/aten/src/ATen/native/cuda/layer_norm_kernel.cu
+++ b/aten/src/ATen/native/cuda/layer_norm_kernel.cu
@@ -657,6 +657,7 @@ bool aligned_grid
 >
 __global__
 void
+__launch_bounds__(block_dim_x * block_dim_y)
  GammaBetaBackwardCUDAKernelTemplate(
     int64_t M,
     int64_t N,
```

I managed to get a Blackwell machine and verified that the fix works. The fix was verified using this repro that I got from @drisspg

<details>
<summary> Repro script that fails on Blackwell </summary>

```
import torch
from torch.nn import init
# from transformer_nuggets import init_logging
# from transformer_nuggets.utils.benchmark import profiler
# from pathlib import Path

# init_logging()

class PermuteModule(torch.nn.Module):
    def __init__(self, permutation):
        super(PermuteModule, self).__init__()
        self.permutation = permutation
    def forward(self, x:torch.Tensor) -> torch.Tensor:
        assert len(x.shape) == len(self.permutation), f"Dimension mismatch! Unable to permute {len(x.shape)} dim input with a {len(self.permutation)} dim permutation!"
        return x.permute(*self.permutation)

def test(n_layers:int, conv_stride:int):
    _sequence = []
    for _ in range(n_layers):
        # Conv1d inputs are (N x C x L), LayerNorm expects (* x C). Dims must be permuted between modules.
        _sequence += [
            PermuteModule((0,2,1)),
            torch.nn.Conv1d(in_channels=512, out_channels=512, groups=1, kernel_size=9, dilation=1, stride=conv_stride, padding=0, bias=False),
            PermuteModule((0,2,1)),
            torch.nn.LayerNorm(512),
            torch.nn.ReLU()
        ]
    model = torch.nn.Sequential(*_sequence).to(device="cuda")
    data = torch.randn((100,2048,512), device="cuda")
    out = model(data)
    loss = torch.nn.functional.mse_loss(out, torch.rand_like(out))
    loss.backward()

torch.autograd.set_detect_anomaly(True)
print(f"Torch version: {torch.__version__}")

# with profiler(Path("conv")):
#     # print(f"layers=1, stride=1")
#     # test(n_layers=1, conv_stride=1)
#     # print(f"layers=2, stride=1")
#     # test(n_layers=2, conv_stride=1)
#     # print(f"layers=1, stride=2")
#     # test(n_layers=1, conv_stride=2)
#     print(f"layers=2, stride=2")
#     test(n_layers=2, conv_stride=2)

print(f"layers=2, stride=2")
test(n_layers=2, conv_stride=2)
# we will not reach this print statement.
print("DONE.")
```

</details>

I also re-ran my performance benchmark and found no regressions over the previous PR.

# Full description of the old PR

Original PR: https://github.com/pytorch/pytorch/pull/148605

This PR adds a new kernel for producing gamma and beta values for the backward pass in a performant way.

To test the performance against the baseline, I measured the backward pass of layernorm while sweeping over the following variables:

1. dtype in {half, float}
2. M in `2**k, 2**k - 1, 2**k + 1 for k in range(...)`
3. N in `2**k, 2**k - 1, 2**k + 1 for k in range(...)`
4. Whether we flush the L2 cache before running the backward pass

Summary: The new code performs better than the old code, especially for powers of 2. For M >> N case, it performs very well (kernel itself can be 30x faster and the overall backward pass can be 5-10x faster).

In order to visualize results of the kernel when choosing different values of M, N and dtype, I wrote some code to generate a heatmap. The heatmap has N on the x-axis, M on the y-axis and color-coded points where green shows performance improvement and red shows regressions. For example, `m=32 n=2048 1.42x` in the heatmap would indicate the normalized shape had 32 elements. The leading dimensions' product was 2048 elements and the new kernel resulted in the *backward pass* being 1.42x faster than the old *backward pass*.

Important note: This heatmap shows the total backward pass time as seen by the user. The kernel time difference can be sometimes very large while the total backward pass time is not that high. For example, for dtype=torch.half, M=32 N=2048, flush_l2_cache=True case, the heatmap shows a speedup of 1.42x, while ncu tells me the new kernel is 2.5x faster than the old:

M=32 N=2048 dtype=half flush_l2=True Old Kernel NCU summary:
```
    ----------------------- ----------- ------------
    Metric Name             Metric Unit Metric Value
    ----------------------- ----------- ------------
    DRAM Frequency                  Ghz         1.59
    SM Frequency                    Ghz         1.35
    Elapsed Cycles                cycle       27,526
    Memory Throughput                 %         2.21
    DRAM Throughput                   %         0.54
    Duration                         us        20.42
    L1/TEX Cache Throughput           %         4.31
    L2 Cache Throughput               %         2.62
    SM Active Cycles              cycle     1,475.02
    Compute (SM) Throughput           %         0.29
    ----------------------- ----------- ------------
```

M=32 N=2048 dtype=half flush_l2=True New Kernel NCU summary:
```
    ----------------------- ----------- ------------
    Metric Name             Metric Unit Metric Value
    ----------------------- ----------- ------------
    DRAM Frequency                  Ghz         1.59
    SM Frequency                    Ghz         1.34
    Elapsed Cycles                cycle       10,920
    Memory Throughput                 %         5.64
    DRAM Throughput                   %         1.35
    Duration                         us         8.13
    L1/TEX Cache Throughput           %         1.92
    L2 Cache Throughput               %         6.89
    SM Active Cycles              cycle     3,554.41
    Compute (SM) Throughput           %         0.67
    ----------------------- ----------- ------------
```

Let's look at some rows from the heatmap. For dtype=float16 flush_l2_cache=True and when input shapes are powers of 2, we get the following:

<img width="1508" alt="image" src="https://github.com/user-attachments/assets/06179599-b2f0-4a45-8664-247a1067950b" />

There are 3 columns -- the first shows all data points, the second shows speedups only and the 3rd column shows regressions only. We can see that there are dramatic speedups for M >> N cases and the regressions are not that high (less than 1%, which could just be measurement noise). Here is a small guide I made:

![image](https://github.com/user-attachments/assets/90c26f7c-e3ad-46d2-a6ce-fe4b5fb3d738)

For dtype=float32, we get a similar chart:

<img width="1499" alt="image" src="https://github.com/user-attachments/assets/c4d31a76-03b0-426c-9114-e1bfad29b530" />

The new code performs especially well for m >> n cases, and also where m and n are small. The m >> n case is special because we run 2 reduction kernels back to back and parallelize in the "M" dimension (the older kernel only parallelized in the "N" dimension).

The new code can sometimes have regressions for non-powers of 2. That is because the old code was using block sizes of {16, 32} while we have `threads.x = 32`. For example when N=33, the old code would have 3 blocks and we will have 2 blocks. I wrote some code to specialize for this case, but I think it will add complexity and @ngimel mentioned that non-powers of 2 are rare enough.

I am including the regressions here for completeness' sake:

<img width="1500" alt="image" src="https://github.com/user-attachments/assets/31c17cfb-ed9b-4106-b9c8-5c359751f530" />

To see this better:

1. Click the image
2. Right click the expanded image and open in a new tab
3. Go to that tab and left click once to zoom in

If you want to see the full data, here it is:

![image](https://github.com/user-attachments/assets/54fb60c9-8c0c-4530-a1dd-79ecda1a69a1)

I also measured binary size and compile time since those are important for developers:

Binary size comparison

![image](https://github.com/user-attachments/assets/ceef5073-1036-47f6-b9dc-cea088beda51)

```
# Original
-rwxr-xr-x 1 ahmads users 307193112 Mar  6 08:46 ./torch/lib/libtorch_cuda.so

# This PR
-rwxr-xr-x 1 ahmads users 307193112 Mar  6 08:46 ./torch/lib/libtorch_cuda.so
```

The diff in bytes is 302kB which is about a 0.1% increase.

Compile time difference:

```
# Original

real    0m10.931s
user    0m9.676s
sys     0m1.004s

# this PR

real    0m16.720s
user    0m15.514s
sys     0m1.066s

# Command I ran
time /usr/local/cuda/bin/nvcc -forward-unknown-to-host-compiler -DAT_PER_OPERATOR_HEADERS -DFLASHATTENTION_DISABLE_ALIBI -DFLASHATTENTION_DISABLE_SOFTCAP -DFLASH_NAMESPACE=pytorch_flash -DFMT_HEADER_ONLY=1 -DHAVE_MALLOC_USABLE_SIZE=1 -DHAVE_MMAP=1 -DHAVE_SHM_OPEN=1 -DHAVE_SHM_UNLINK=1 -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DTORCH_CUDA_BUILD_MAIN_LIB -DTORCH_CUDA_USE_NVTX3 -DUNFUSE_FMA -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_CUDA -DUSE_CUFILE -DUSE_DISTRIBUTED -DUSE_EXTERNAL_MZCRC -DUSE_FLASH_ATTENTION -DUSE_MEM_EFF_ATTENTION -DUSE_NCCL -DUSE_RPC -DUSE_TENSORPIPE -D_FILE_OFFSET_BITS=64 -Dtorch_cuda_EXPORTS -I/home/ahmads/personal/pytorch/build/aten/src -I/home/ahmads/personal/pytorch/aten/src -I/home/ahmads/personal/pytorch/build -I/home/ahmads/personal/pytorch -I/home/ahmads/personal/pytorch/cmake/../third_party/benchmark/include -I/home/ahmads/personal/pytorch/third_party/onnx -I/home/ahmads/personal/pytorch/build/third_party/onnx -I/home/ahmads/personal/pytorch/nlohmann -I/home/ahmads/personal/pytorch/third_party/flash-attention/csrc/flash_attn/src -I/home/ahmads/personal/pytorch/aten/src/THC -I/home/ahmads/personal/pytorch/aten/src/ATen/cuda -I/home/ahmads/personal/pytorch/third_party/fmt/include -I/home/ahmads/personal/pytorch/aten/src/ATen/../../../third_party/cutlass/include -I/home/ahmads/personal/pytorch/aten/src/ATen/../../../third_party/cutlass/tools/util/include -I/home/ahmads/personal/pytorch/build/caffe2/aten/src -I/home/ahmads/personal/pytorch/aten/src/ATen/.. -I/home/ahmads/personal/pytorch/build/nccl/include -I/home/ahmads/personal/pytorch/c10/cuda/../.. -I/home/ahmads/personal/pytorch/c10/.. -I/home/ahmads/personal/pytorch/third_party/tensorpipe -I/home/ahmads/personal/pytorch/build/third_party/tensorpipe -I/home/ahmads/personal/pytorch/third_party/tensorpipe/third_party/libnop/include -I/home/ahmads/personal/pytorch/torch/csrc/api -I/home/ahmads/personal/pytorch/torch/csrc/api/include -isystem /home/ahmads/personal/pytorch/build/third_party/gloo -isystem /home/ahmads/personal/pytorch/cmake/../third_party/gloo -isystem /home/ahmads/personal/pytorch/cmake/../third_party/tensorpipe/third_party/libuv/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/googletest/googlemock/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/googletest/googletest/include -isystem /home/ahmads/personal/pytorch/third_party/protobuf/src -isystem /home/ahmads/personal/pytorch/third_party/XNNPACK/include -isystem /home/ahmads/personal/pytorch/third_party/ittapi/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/eigen -isystem /usr/local/cuda/include -isystem /home/ahmads/personal/pytorch/third_party/ideep/mkl-dnn/include/oneapi/dnnl -isystem /home/ahmads/personal/pytorch/third_party/ideep/include -isystem /home/ahmads/personal/pytorch/INTERFACE -isystem /home/ahmads/personal/pytorch/third_party/nlohmann/include -isystem /home/ahmads/personal/pytorch/third_party/NVTX/c/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/cudnn_frontend/include -DLIBCUDACXX_ENABLE_SIMPLIFIED_COMPLEX_OPERATIONS -D_GLIBCXX_USE_CXX11_ABI=1 -Xfatbin -compress-all -DONNX_NAMESPACE=onnx_torch -gencode arch=compute_90,code=sm_90 -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda  -Wno-deprecated-gpu-targets --expt-extended-lambda -DCUB_WRAPPED_NAMESPACE=at_cuda_detail -DCUDA_HAS_FP16=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -O3 -DNDEBUG -std=c++17 -Xcompiler=-fPIC -DTORCH_USE_LIBUV -DCAFFE2_USE_GLOO -Xcompiler -Wall -Wextra -Wdeprecated -Wno-unused-parameter -Wno-missing-field-initializers -Wno-array-bounds -Wno-unknown-pragmas -Wno-strict-overflow -Wno-strict-aliasing -Wunused-function -Wunused-variable -Wunused-but-set-variable -Wno-maybe-uninitialized -MD -MT caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/layer_norm_kernel.cu.o -MF caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/layer_norm_kernel.cu.o.d -x cu -c /home/ahmads/personal/pytorch/aten/src/ATen/native/cuda/layer_norm_kernel.cu -o caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/layer_norm_kernel.cu.o

```

So the new PR is 6 seconds longer compile time.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150625
Approved by: https://github.com/ngimel
2025-04-03 22:07:43 +00:00
PyTorch MergeBot
61a1f09b5b Revert "[cuda] Add new faster gammabeta backward kernel (#148605)"
This reverts commit 114d404b07.

Reverted https://github.com/pytorch/pytorch/pull/148605 on behalf of https://github.com/drisspg due to See https://github.com/pytorch/pytorch/issues/150266#issuecomment-2773907902 for more details ([comment](https://github.com/pytorch/pytorch/pull/148605#issuecomment-2773928838))
2025-04-02 23:14:11 +00:00
Ahmad Sharif
114d404b07 [cuda] Add new faster gammabeta backward kernel (#148605)
This PR adds a new kernel for producing gamma and beta values for the backward pass in a performant way.

To test the performance against the baseline, I measured the backward pass of layernorm while sweeping over the following variables:

1. dtype in {half, float}
2. M in `2**k, 2**k - 1, 2**k + 1 for k in range(...)`
3. N in `2**k, 2**k - 1, 2**k + 1 for k in range(...)`
4. Whether we flush the L2 cache before running the backward pass

Summary: The new code performs better than the old code, especially for powers of 2. For M >> N case, it performs very well (kernel itself can be 30x faster and the overall backward pass can be 5-10x faster).

In order to visualize results of the kernel when choosing different values of M, N and dtype, I wrote some code to generate a heatmap. The heatmap has N on the x-axis, M on the y-axis and color-coded points where green shows performance improvement and red shows regressions. For example, `m=32 n=2048 1.42x` in the heatmap would indicate the normalized shape had 32 elements. The leading dimensions' product was 2048 elements and the new kernel resulted in the *backward pass* being 1.42x faster than the old *backward pass*.

Important note: This heatmap shows the total backward pass time as seen by the user. The kernel time difference can be sometimes very large while the total backward pass time is not that high. For example, for dtype=torch.half, M=32 N=2048, flush_l2_cache=True case, the heatmap shows a speedup of 1.42x, while ncu tells me the new kernel is 2.5x faster than the old:

M=32 N=2048 dtype=half flush_l2=True Old Kernel NCU summary:
```
    ----------------------- ----------- ------------
    Metric Name             Metric Unit Metric Value
    ----------------------- ----------- ------------
    DRAM Frequency                  Ghz         1.59
    SM Frequency                    Ghz         1.35
    Elapsed Cycles                cycle       27,526
    Memory Throughput                 %         2.21
    DRAM Throughput                   %         0.54
    Duration                         us        20.42
    L1/TEX Cache Throughput           %         4.31
    L2 Cache Throughput               %         2.62
    SM Active Cycles              cycle     1,475.02
    Compute (SM) Throughput           %         0.29
    ----------------------- ----------- ------------
```

M=32 N=2048 dtype=half flush_l2=True New Kernel NCU summary:
```
    ----------------------- ----------- ------------
    Metric Name             Metric Unit Metric Value
    ----------------------- ----------- ------------
    DRAM Frequency                  Ghz         1.59
    SM Frequency                    Ghz         1.34
    Elapsed Cycles                cycle       10,920
    Memory Throughput                 %         5.64
    DRAM Throughput                   %         1.35
    Duration                         us         8.13
    L1/TEX Cache Throughput           %         1.92
    L2 Cache Throughput               %         6.89
    SM Active Cycles              cycle     3,554.41
    Compute (SM) Throughput           %         0.67
    ----------------------- ----------- ------------
```

Let's look at some rows from the heatmap. For dtype=float16 flush_l2_cache=True and when input shapes are powers of 2, we get the following:

<img width="1508" alt="image" src="https://github.com/user-attachments/assets/06179599-b2f0-4a45-8664-247a1067950b" />

There are 3 columns -- the first shows all data points, the second shows speedups only and the 3rd column shows regressions only. We can see that there are dramatic speedups for M >> N cases and the regressions are not that high (less than 1%, which could just be measurement noise). Here is a small guide I made:

![image](https://github.com/user-attachments/assets/90c26f7c-e3ad-46d2-a6ce-fe4b5fb3d738)

For dtype=float32, we get a similar chart:

<img width="1499" alt="image" src="https://github.com/user-attachments/assets/c4d31a76-03b0-426c-9114-e1bfad29b530" />

The new code performs especially well for m >> n cases, and also where m and n are small. The m >> n case is special because we run 2 reduction kernels back to back and parallelize in the "M" dimension (the older kernel only parallelized in the "N" dimension).

The new code can sometimes have regressions for non-powers of 2. That is because the old code was using block sizes of {16, 32} while we have `threads.x = 32`. For example when N=33, the old code would have 3 blocks and we will have 2 blocks. I wrote some code to specialize for this case, but I think it will add complexity and @ngimel mentioned that non-powers of 2 are rare enough.

I am including the regressions here for completeness' sake:

<img width="1500" alt="image" src="https://github.com/user-attachments/assets/31c17cfb-ed9b-4106-b9c8-5c359751f530" />

To see this better:

1. Click the image
2. Right click the expanded image and open in a new tab
3. Go to that tab and left click once to zoom in

If you want to see the full data, here it is:

![image](https://github.com/user-attachments/assets/54fb60c9-8c0c-4530-a1dd-79ecda1a69a1)

I also measured binary size and compile time since those are important for developers:

Binary size comparison

![image](https://github.com/user-attachments/assets/ceef5073-1036-47f6-b9dc-cea088beda51)

```
# Original
-rwxr-xr-x 1 ahmads users 307193112 Mar  6 08:46 ./torch/lib/libtorch_cuda.so

# This PR
-rwxr-xr-x 1 ahmads users 307193112 Mar  6 08:46 ./torch/lib/libtorch_cuda.so
```

The diff in bytes is 302kB which is about a 0.1% increase.

Compile time difference:

```
# Original

real    0m10.931s
user    0m9.676s
sys     0m1.004s

# this PR

real    0m16.720s
user    0m15.514s
sys     0m1.066s

# Command I ran
time /usr/local/cuda/bin/nvcc -forward-unknown-to-host-compiler -DAT_PER_OPERATOR_HEADERS -DFLASHATTENTION_DISABLE_ALIBI -DFLASHATTENTION_DISABLE_SOFTCAP -DFLASH_NAMESPACE=pytorch_flash -DFMT_HEADER_ONLY=1 -DHAVE_MALLOC_USABLE_SIZE=1 -DHAVE_MMAP=1 -DHAVE_SHM_OPEN=1 -DHAVE_SHM_UNLINK=1 -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DTORCH_CUDA_BUILD_MAIN_LIB -DTORCH_CUDA_USE_NVTX3 -DUNFUSE_FMA -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_CUDA -DUSE_CUFILE -DUSE_DISTRIBUTED -DUSE_EXTERNAL_MZCRC -DUSE_FLASH_ATTENTION -DUSE_MEM_EFF_ATTENTION -DUSE_NCCL -DUSE_RPC -DUSE_TENSORPIPE -D_FILE_OFFSET_BITS=64 -Dtorch_cuda_EXPORTS -I/home/ahmads/personal/pytorch/build/aten/src -I/home/ahmads/personal/pytorch/aten/src -I/home/ahmads/personal/pytorch/build -I/home/ahmads/personal/pytorch -I/home/ahmads/personal/pytorch/cmake/../third_party/benchmark/include -I/home/ahmads/personal/pytorch/third_party/onnx -I/home/ahmads/personal/pytorch/build/third_party/onnx -I/home/ahmads/personal/pytorch/nlohmann -I/home/ahmads/personal/pytorch/third_party/flash-attention/csrc/flash_attn/src -I/home/ahmads/personal/pytorch/aten/src/THC -I/home/ahmads/personal/pytorch/aten/src/ATen/cuda -I/home/ahmads/personal/pytorch/third_party/fmt/include -I/home/ahmads/personal/pytorch/aten/src/ATen/../../../third_party/cutlass/include -I/home/ahmads/personal/pytorch/aten/src/ATen/../../../third_party/cutlass/tools/util/include -I/home/ahmads/personal/pytorch/build/caffe2/aten/src -I/home/ahmads/personal/pytorch/aten/src/ATen/.. -I/home/ahmads/personal/pytorch/build/nccl/include -I/home/ahmads/personal/pytorch/c10/cuda/../.. -I/home/ahmads/personal/pytorch/c10/.. -I/home/ahmads/personal/pytorch/third_party/tensorpipe -I/home/ahmads/personal/pytorch/build/third_party/tensorpipe -I/home/ahmads/personal/pytorch/third_party/tensorpipe/third_party/libnop/include -I/home/ahmads/personal/pytorch/torch/csrc/api -I/home/ahmads/personal/pytorch/torch/csrc/api/include -isystem /home/ahmads/personal/pytorch/build/third_party/gloo -isystem /home/ahmads/personal/pytorch/cmake/../third_party/gloo -isystem /home/ahmads/personal/pytorch/cmake/../third_party/tensorpipe/third_party/libuv/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/googletest/googlemock/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/googletest/googletest/include -isystem /home/ahmads/personal/pytorch/third_party/protobuf/src -isystem /home/ahmads/personal/pytorch/third_party/XNNPACK/include -isystem /home/ahmads/personal/pytorch/third_party/ittapi/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/eigen -isystem /usr/local/cuda/include -isystem /home/ahmads/personal/pytorch/third_party/ideep/mkl-dnn/include/oneapi/dnnl -isystem /home/ahmads/personal/pytorch/third_party/ideep/include -isystem /home/ahmads/personal/pytorch/INTERFACE -isystem /home/ahmads/personal/pytorch/third_party/nlohmann/include -isystem /home/ahmads/personal/pytorch/third_party/NVTX/c/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/cudnn_frontend/include -DLIBCUDACXX_ENABLE_SIMPLIFIED_COMPLEX_OPERATIONS -D_GLIBCXX_USE_CXX11_ABI=1 -Xfatbin -compress-all -DONNX_NAMESPACE=onnx_torch -gencode arch=compute_90,code=sm_90 -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda  -Wno-deprecated-gpu-targets --expt-extended-lambda -DCUB_WRAPPED_NAMESPACE=at_cuda_detail -DCUDA_HAS_FP16=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -O3 -DNDEBUG -std=c++17 -Xcompiler=-fPIC -DTORCH_USE_LIBUV -DCAFFE2_USE_GLOO -Xcompiler -Wall -Wextra -Wdeprecated -Wno-unused-parameter -Wno-missing-field-initializers -Wno-array-bounds -Wno-unknown-pragmas -Wno-strict-overflow -Wno-strict-aliasing -Wunused-function -Wunused-variable -Wunused-but-set-variable -Wno-maybe-uninitialized -MD -MT caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/layer_norm_kernel.cu.o -MF caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/layer_norm_kernel.cu.o.d -x cu -c /home/ahmads/personal/pytorch/aten/src/ATen/native/cuda/layer_norm_kernel.cu -o caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/layer_norm_kernel.cu.o

```

So the new PR is 6 seconds longer compile time.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148605
Approved by: https://github.com/ngimel
2025-03-27 03:01:53 +00:00
Dmitry Nikolaev
c99efc08fb [ROCm] skip test_RNN_dropout_state (#149446)
PR to skip test_nn.py::TestNN::test_RNN_dropout_state
Currently ROCm doesn't support dropout value for RNN

PR to enable RNN dropout on ROCm still in review and blocked pytorch/pytorch#144572

Fixes: https://github.com/pytorch/pytorch/issues/68849

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149446
Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily
2025-03-20 17:22:39 +00:00
Nikita Shulga
2e0c98ff05 [MPS] Add bicubic2d_aa (#149378)
Which is currently the most frequently requested op in https://github.com/pytorch/pytorch/issues/141287

Mostly done by refactoring `upsample_bilinear2d_aa` to accept Functor as one of the template arguments, which closely ideas from eec43cfbc0/src/libImaging/Resample.c as well as
bb42e4d137/aten/src/ATen/native/cuda/UpSampleBilinear2d.cu (L472-L478)

Populate unit tests by copying upsample_bilinear_2d_aa and reusing it as upsample_bicubic2d_aa

At that point, only difference between upsample_bilinear2d_aa and upsample_bicubic2d_aa are convolution kernel function and size: for bilinear it's 3x3, for bicubic it's 5x5
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149378
Approved by: https://github.com/dcci
2025-03-18 05:35:41 +00:00
zeshengzong
a7f8de2198 Add nn.Bilinear param validation (#149018)
Fixes #103425

## Changes

- Add doc description size value `must be > 0`
- Add validation for `in1_features` param

Currently, only `in1_features` will cause runtime error, if add checks for `in2_features` and `out_features` as well, might be kind of BC breaking.

```python
import torch
from torch import nn

class lenet(nn.Module):
    def __init__(self):
        super(lenet, self).__init__()
        self.conv = nn.Conv2d(in_channels=3, out_channels=16, kernel_size=5, stride=1)

        # Error, `in1_features=1, in2_features=0, out_features=0` no error
        self.linear = nn.Bilinear(in1_features=0, in2_features=0, out_features=0)

    def forward(self, x):
        # 1st block
        x = self.conv(x)
        x = self.linear(x)

        return x

if __name__ == '__main__':
    net = lenet()

```

## Test Result

```bash
pytest test/test_nn.py -k test_bilinear -vv
```

![image](https://github.com/user-attachments/assets/20617ba9-bac5-4db2-aecc-1831dbc8eb43)

![image](https://github.com/user-attachments/assets/401e4e1f-051a-4e1c-952b-48e85de64b0b)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149018
Approved by: https://github.com/mikaylagawarecki
2025-03-14 19:26:12 +00:00
zeshengzong
97272e4b49 Fix torch.nn.functional.hardswish gradients corner case (#148049)
Fixes #147801

## Changes

- Change hardswish gradient compute condition as [torch.nn.functional.hardswish](https://pytorch.org/docs/stable/generated/torch.nn.functional.hardswish.html)
- Enable cuda for test `test_hardswish_grad_corner`
- Add test case for value=-3

## Test Result

```bash
pytest test/test_nn.py -k test_hardswish
pytest test/test_unary_ufuncs.py -k test_hardswish
pytest test/inductor/test_torchinductor.py -k test_hardswish
```

![image](https://github.com/user-attachments/assets/000cb5c4-15f5-4bfd-ab45-f52bf810ff3d)
![image](https://github.com/user-attachments/assets/38b08cf8-ea84-47a2-8e37-0a213da3e0c8)
![image](https://github.com/user-attachments/assets/54bc57be-2c57-46cc-ab90-94ea6cbe1c34)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148049
Approved by: https://github.com/soulitzer
2025-03-14 18:53:10 +00:00
cyy
970fefcc53 Remove outdated skipCUDAIfCudnnVersionLessThan decoration (#148940)
Test conditions for CUDNN 7 and 8 were removed because we have moved to CUDNN 9.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148940
Approved by: https://github.com/mikaylagawarecki
2025-03-13 18:02:50 +00:00
riccardofelluga
8f71d4563e Fix rms_norm in fp16/bf16 (#147203)
Fixes #134106. This PR moves the `upcasted_result` down-casting after all computation is done.

Since the multiplication with the weight_opt input is not done in half precision, the current code path is doing the following: fp16 -> fp32 -> fp16 -> fp32 -> fp16. What we want tho is to avoid down-casting and this PR proposes: fp16 -> fp32 -> fp16. This results in better accuracy as it avoids truncating.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147203
Approved by: https://github.com/eqy
2025-03-08 04:43:18 +00:00
PyTorch MergeBot
abcca2fcbb Revert "Fix torch.nn.functional.hardswish gradients corner case (#148049)"
This reverts commit 29b28e9d9f.

Reverted https://github.com/pytorch/pytorch/pull/148049 on behalf of https://github.com/soulitzer due to This may be causing an accuracy failure on inductor ([comment](https://github.com/pytorch/pytorch/pull/148049#issuecomment-2706839169))
2025-03-07 16:05:56 +00:00
zeshengzong
29b28e9d9f Fix torch.nn.functional.hardswish gradients corner case (#148049)
Fixes #147801

## Changes

- Change hardswish gradient compute condition as [torch.nn.functional.hardswish](https://pytorch.org/docs/stable/generated/torch.nn.functional.hardswish.html)
- Enable cuda for test `test_hardswish_grad_corner`
- Add test case for value=-3

## Test Result

```bash
pytest test/test_nn.py -k test_hardswish
pytest test/test_unary_ufuncs.py -k test_hardswish
pytest test/inductor/test_torchinductor.py -k test_hardswish
```

![image](https://github.com/user-attachments/assets/000cb5c4-15f5-4bfd-ab45-f52bf810ff3d)
![image](https://github.com/user-attachments/assets/38b08cf8-ea84-47a2-8e37-0a213da3e0c8)
![image](https://github.com/user-attachments/assets/54bc57be-2c57-46cc-ab90-94ea6cbe1c34)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148049
Approved by: https://github.com/soulitzer
2025-03-06 19:04:52 +00:00
CaoE
8b818ab58f Use float data type for Half sum in fallback implementation of batchnorm backward on CPU (#147353)
Fixes #147303.
Use float data type for Half sum in fallback implementation of batchnorm backward on CPU as the representation range of Half is small.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147353
Approved by: https://github.com/leslie-fang-intel, https://github.com/cpuhrsch
2025-02-21 01:33:33 +00:00
Ahmad Sharif
ae5cc19ba7 [pytorch][cuda] Improve softmax backward pass native CUDA implementation (#145866)
This PR is similar to https://github.com/pytorch/pytorch/pull/122970, but works on the softmax backward pass.

Specifically, it uses shared memory to cache the gradOutput when it can fit in shared memory. Before this PR we were reading gradOutput twice.

On my H100 this seems to improve the softmax backward pass performance by about 5% for problem sizes that fit within shared memory. (Note that this is not the only kernel that runs when you call softmax backward pass -- there is an elementwise kernel that runs before this; optimizing that can be a separate PR).

**Important Note**: Currently the softmax backward pass consists of an [element-wise multiply operator](7f65a20884/aten/src/ATen/native/cuda/SoftMax.cu (L1216)), followed by [this function](7f65a20884/aten/src/ATen/native/cuda/SoftMax.cu (L1062)) which calls the `cunn_SoftMaxBackward` kernel. With my change the kernel time reduces by about 12% (see screenshot below), while the total time (including the elementwise) reduces by about 5%.

```
Baseline						This PR
N	size	FP32 bandwidth	FP16 bandwidth		N	size	FP32 bandwidth	FP16 bandwidth		fp32 diff	fp16 diff
0	256	134.340966	70.042039		0	256	133.70146	70.342753		-0.48%	0.43%
1	512	233.501185	129.945803		1	512	234.057145	132.933066		0.24%	2.30%
2	1024	340.667966	229.280464		2	1024	338.833265	226.441699		-0.54%	-1.24%
3	2048	379.643726	337.452058		3	2048	399.559017	338.432284		5.25%	0.29%
4	4096	416.597537	383.625364		4	4096	428.252403	396.137506		2.80%	3.26%
5	6000	431.198241	384.384384		5	6000	457.744577	406.06275		6.16%	5.64%
6	8192	462.811252	427.292573		6	8192	474.791032	428.281563		2.59%	0.23%
7	10000	464.258731	429.050294		7	10000	483.7643	446.849381		4.20%	4.15%
8	10013	465.199701	429.824179		8	10013	464.904407	428.72184		-0.06%	-0.26%
9	10240	477.07359	428.853737		9	10240	485.317024	444.902586		1.73%	3.74%
10	11000	473.038785	430.778663		10	11000	488.161438	453.462162		3.20%	5.27%
11	12000	474.342475	432.594814		11	12000	490.532418	458.427653		3.41%	5.97%
12	16384	487.468854	473.611576		12	16384	488.154406	476.264631		0.14%	0.56%
13	20000	482.029793	465.666186		13	20000	482.147092	483.886193		0.02%	3.91%
14	24000	478.368093	474.159464		14	24000	478.364948	491.447921		0.00%	3.65%
15	32000	476.523796	473.18868		15	32000	476.523796	474.398962		0.00%	0.26%
16	32768	476.104723	477.493634		16	32768	476.704463	477.330606		0.13%	-0.03%
17	36864	477.900663	475.472787		17	36864	477.973279	475.728454		0.02%	0.05%
18	40960	477.707561	475.559064		18	40960	478.445017	476.088067		0.15%	0.11%
19	45056	479.169812	475.865134		19	45056	479.143266	475.878202		-0.01%	0.00%
20	49152	477.804907	475.382982		20	49152	477.868404	475.976377		0.01%	0.12%
21	65536	481.274125	478.171806		21	65536	481.537733	478.703926		0.05%	0.11%
22	66000	481.64652	480.095457		22	66000	481.856013	480.466388		0.04%	0.08%
23	68608	481.745774	479.034704		23	68608	481.917596	478.856209		0.04%	-0.04%
24	80000	483.409361	480.356529		24	80000	483.330481	480.375277		-0.02%	0.00%
25	98304	480.736301	481.396882		25	98304	480.789858	481.320143		0.01%	-0.02%
```

NCU profiler shows lower DRAM fetches with the new kernel:

![image](https://github.com/user-attachments/assets/f3606725-d8fc-4ea5-ae6d-9c188bf32d72)

NCU reports about 12% elapsed time reduction in this kernel alone compared to baseline (and because of other kernels that are run, the overall backward pass time as seen by the user gets reduced by 5%).

I compared the binary size increase by running `python setup.py develop` before and after and diffing the .so files:

![image](https://github.com/user-attachments/assets/8e6cee2e-3c7a-4fa4-8836-954047ce8ffc)

libtorch_cuda.so goes from 274,752,224 bytes to 274,787,072 bytes. The increase in size is 34kB which is about 0.01%.

I measured the compilation time for incremental development:

```
touch ./aten/src/ATen/native/cuda/SoftMax.cu
time python setup.py develop
real    0m10.083s
user    0m8.197s
sys     0m3.149s
```

Note that this uses `ccache` and does a bunch of copies and is not just measuring the `nvcc` time. I measured the `nvcc` time separately by capturing the `nvcc` command shown in [1] below and running it on the baseline and modified kernels:

```
# baseline nvcc time for SoftMax.cu
real    0m35.341s
user    0m33.801s
sys     0m1.289s

# this PR's nvcc time for SoftMax.cu
real    0m36.513s
user    0m34.722s
sys     0m1.408s
```

So the `nvcc` time increases by about 1 second, or ~3% of the baseline.

[1] `nvcc` command is here:
```
# This is the nvcc command
/usr/local/cuda/bin/nvcc -forward-unknown-to-host-compiler -DAT_PER_OPERATOR_HEADERS -DFLASHATTENTION_DISABLE_ALIBI -DFMT_HEADER_ONLY=1 -DHAVE_MALLOC_USABLE_SIZE=1 -DHAVE_MMAP=1 -DHAVE_SHM_OPEN=1 -DHAVE_SHM_UNLINK=1 -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DTORCH_CUDA_BUILD_MAIN_LIB -DTORCH_CUDA_USE_NVTX3 -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_CUDA -DUSE_DISTRIBUTED -DUSE_EXTERNAL_MZCRC -DUSE_FLASH_ATTENTION -DUSE_MEM_EFF_ATTENTION -DUSE_NCCL -DUSE_RPC -DUSE_TENSORPIPE -D_FILE_OFFSET_BITS=64 -Dtorch_cuda_EXPORTS -I/home/ahmads/personal/pytorch/build/aten/src -I/home/ahmads/personal/pytorch/aten/src -I/home/ahmads/personal/pytorch/build -I/home/ahmads/personal/pytorch -I/home/ahmads/personal/pytorch/cmake/../third_party/benchmark/include -I/home/ahmads/personal/pytorch/third_party/onnx -I/home/ahmads/personal/pytorch/build/third_party/onnx -I/home/ahmads/personal/pytorch/nlohmann -I/home/ahmads/personal/pytorch/aten/src/THC -I/home/ahmads/personal/pytorch/aten/src/ATen/cuda -I/home/ahmads/personal/pytorch/third_party/fmt/include -I/home/ahmads/personal/pytorch/aten/src/ATen/../../../third_party/cutlass/include -I/home/ahmads/personal/pytorch/aten/src/ATen/../../../third_party/cutlass/tools/util/include -I/home/ahmads/personal/pytorch/build/caffe2/aten/src -I/home/ahmads/personal/pytorch/aten/src/ATen/.. -I/home/ahmads/personal/pytorch/build/nccl/include -I/home/ahmads/personal/pytorch/c10/cuda/../.. -I/home/ahmads/personal/pytorch/c10/.. -I/home/ahmads/personal/pytorch/third_party/tensorpipe -I/home/ahmads/personal/pytorch/build/third_party/tensorpipe -I/home/ahmads/personal/pytorch/third_party/tensorpipe/third_party/libnop/include -I/home/ahmads/personal/pytorch/torch/csrc/api -I/home/ahmads/personal/pytorch/torch/csrc/api/include -isystem /home/ahmads/personal/pytorch/build/third_party/gloo -isystem /home/ahmads/personal/pytorch/cmake/../third_party/gloo -isystem /home/ahmads/personal/pytorch/cmake/../third_party/tensorpipe/third_party/libuv/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/googletest/googlemock/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/googletest/googletest/include -isystem /home/ahmads/personal/pytorch/third_party/protobuf/src -isystem /home/ahmads/personal/pytorch/third_party/XNNPACK/include -isystem /home/ahmads/personal/pytorch/third_party/ittapi/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/eigen -isystem /usr/local/cuda/include -isystem /home/ahmads/personal/pytorch/torch/include -isystem /home/ahmads/personal/pytorch/third_party/ideep/include -isystem /home/ahmads/personal/pytorch/torch/include/oneapi/dnnl -isystem /home/ahmads/personal/pytorch/INTERFACE -isystem /home/ahmads/personal/pytorch/third_party/nlohmann/include -isystem /home/ahmads/personal/pytorch/third_party/NVTX/c/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/cudnn_frontend/include -DLIBCUDACXX_ENABLE_SIMPLIFIED_COMPLEX_OPERATIONS -D_GLIBCXX_USE_CXX11_ABI=1 -Xfatbin -compress-all -DONNX_NAMESPACE=onnx_torch -gencode arch=compute_90,code=sm_90 -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda  -Wno-deprecated-gpu-targets --expt-extended-lambda -DCUB_WRAPPED_NAMESPACE=at_cuda_detail -DCUDA_HAS_FP16=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -O3 -DNDEBUG -std=c++17 -Xcompiler=-fPIC -DTORCH_USE_LIBUV -DCAFFE2_USE_GLOO -Xcompiler -Wall -Wextra -Wdeprecated -Wno-unused-parameter -Wno-missing-field-initializers -Wno-array-bounds -Wno-unknown-pragmas -Wno-strict-overflow -Wno-strict-aliasing -Wunused-function -Wunused-variable -Wunused-but-set-variable -Wno-maybe-uninitialized -MD -MT caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/SoftMax.cu.o -MF caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/SoftMax.cu.o.d -x cu -c /home/ahmads/personal/pytorch/aten/src/ATen/native/cuda/SoftMax.cu -o caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/SoftMax.cu.o
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145866
Approved by: https://github.com/ngimel
2025-02-12 07:54:41 +00:00
Nikita Shulga
611ca163fd [MPS] Add bilineard2d_aa implementation (#145526)
Interesting quirk of the algorithm, that is not very well documented, is that value of align_corners is ignored in antialias mode, see arguments of
e8304f08fe/aten/src/ATen/native/cpu/UpSampleKernel.cpp (L747-L751)

Error out on  uint8 implementation(as it relies on a very fragile integer integer arithmetic), as it's not implemented on any other Accelerator devices at the moment.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145526
Approved by: https://github.com/dcci
2025-02-10 15:03:14 +00:00
Danial Javady
bb4964013f Add determinmistic kernel for reflection2d (#136241)
Adds feature for #98925

Tests pass for both existing reflectionpad2d and the new one I inserted.

**Summary of the work:**

Simple conditional check for deterministic mode that will dispatch to a different kernel. This kernel does not use any atomic operations, and will lead to deterministic results as instead of going from the output to input(1:1) relationship, I am doing the opposite. I am going from input -> all outputs, which is 1 to many. These operations are done in the same order every execution as I simply traverse the data set with a grid stride loop and use simple linearized indexing into the input tensor.

So each thread will compute the 4 conditionals, which are then used to see if the input has an output in the 8 regions. These 8 regions are top left, top, top right, left, right, bottom left, bottom, bottom right`.

I did not focus on performance for this PR as that would expand the scope heavily. If there are any performance questions though i can answer.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136241
Approved by: https://github.com/eqy, https://github.com/albanD
2025-01-29 20:34:03 +00:00
Benjamin Glass
5aa5a5763e [inductor triton] Disable incorrect TF32 usage on CUDA capability < 8 (#145684)
Triton 2.2 and greater have a bug where allowing TF32 generation for a GPU that does not support TF32 will cause code generation errors. Patch around this problem by:

1. Adding a function to `torch.cuda` that determines whether CUDA hardware is capable of using the TF32 format.
2. Using that function to explicitly disable TF32 generation when calling Triton, where needed.

To demonstrate that this fix works, try running `test/inductor/test_max_autotune.py` on a GPU with CUDA compute capability < 8 (e.g. any NVIDIA consumer GPU) without this fix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145684
Approved by: https://github.com/eqy
2025-01-28 22:01:08 +00:00
CaoE
a9bfc5f70c Fix boundary conditions for hardswish backward (#143899)
Fixes #136345.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143899
Approved by: https://github.com/jgong5, https://github.com/ezyang
2025-01-16 20:26:27 +00:00
Mario Vasilev
49bdc418be Add strict kwarg to nn.Module.set_submodule and fix bug for non dot delineated strings (#143455)
Before fixing set_submodule, it used to create leaf modules when the target was not a dot-delimited string. After the fix it will not create a new attribute if target is a non-dot-delimited string. If you want to create leaf nodes of `nn.Module` parent nodes, you can use `replace_or_create_new_leaf_module`.

Fixes https://github.com/pytorch/pytorch/issues/143441

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143455
Approved by: https://github.com/mikaylagawarecki
2025-01-16 05:06:33 +00:00
eqy
63569d9745 [CUDA][TF32] Add some missing TF32 decorators to test_nn.py (#144592)
Original authored by @bilal2vec

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144592
Approved by: https://github.com/Skylion007
2025-01-11 16:20:59 +00:00
hongxyan
aa69d73e6b [ROCm] fix torch.layer_norm invalid configuration problem when input is large tensor (#144007)
Fixes #136291

This PR is to fix the `invalid configuration argument` problem happened on ROCm when input is a large tensor when calling `torch.layer_norm`.

```
 File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/functional.py", line 2573, in layer_norm
    return torch.layer_norm
RuntimeError: HIP error: invalid configuration argument
```

After investigation, I found that the reason why this error happened is: The amd compute language runtime checks whether  `gridDim.x * blockDim.x` is greater than `std::numeric_limits<uint32_t>::max()` or not. If yes, it will error out with the "invalid configuration argument" message.

The fix is to split the whole task to several chunks so that each chunk will not trigger the failure condition. This will ensure the correctness and completeness given the current kernel implementation logic of `vectorized_layer_norm_kernel`.

Also added a largeTensor layer_norm unit test `test_layer_norm_large_tensor` with the same shape `[16, 3000, 3000, 16]` as the one used by the pytorch issue #136291 so that the unit test can check the expected output value to ensure correctness.

The future work may include performance optimization of layer_norm and CK layer_norm integration.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144007
Approved by: https://github.com/eqy
2025-01-07 19:17:02 +00:00
eqy
7e3cd0e488 [CUDA] Check size calculation in ilpReduce for softmax (#144009)
For #143644

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144009
Approved by: https://github.com/Skylion007
2025-01-04 02:31:15 +00:00
eqy
dbdda654af [64-bit][CUDA] Upsample2D 64-bit indexing fix attempt 2 (#141923)
#141831
Block/thread math requires a cast...

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141923
Approved by: https://github.com/ngimel
2025-01-04 02:30:38 +00:00
Tom Ritchford
d8c8ba2440 Fix unused Python variables in test/[e-z]* (#136964)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136964
Approved by: https://github.com/justinchuby, https://github.com/albanD
2024-12-18 23:02:30 +00:00
Sun, Jiayi
863e6e4567 Improve input dimensions check for reflection_pad1d, reflection_pad2d and reflection_pad3d (#141670)
Fix https://github.com/pytorch/pytorch/issues/141447.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141670
Approved by: https://github.com/mingfeima, https://github.com/malfet
2024-12-18 17:46:26 +00:00
albanD
792f1c47e9 No actual change, just remove variable contain Tensors from global scope (#143225)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143225
Approved by: https://github.com/ezyang
2024-12-17 16:14:25 +00:00
isalia20
37fe8015ac softshrink nan fixes (#138421)
Fixes #138385 .

Currently contains fixes for cpu and cuda. Will add fixes to mps as well soon if my mac can build it from source.(Had some issues with building it on my linux pc due to limited memory)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138421
Approved by: https://github.com/mikaylagawarecki
2024-11-21 23:06:08 +00:00
Michael Diggin
723498aab8 Gaussian nll loss scalar variance support (#138931)
Fixes #138747

Adds support for `variance` being a Tensor or a float in `gaussian_nll_loss` to avoid a cpu-gpu sync point in the loss function, when the variance is a static tensor like `<scalar>*torch.ones_like(input)`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138931
Approved by: https://github.com/mikaylagawarecki
2024-11-21 18:20:09 +00:00
PyTorch MergeBot
109f8274a8 Revert "Add NHWC support for group normalization (#126635)"
This reverts commit ed0e63e938.

Reverted https://github.com/pytorch/pytorch/pull/126635 on behalf of https://github.com/kit1980 due to Reverted internally at Meta, see D65979564 ([comment](https://github.com/pytorch/pytorch/pull/126635#issuecomment-2480130943))
2024-11-15 23:38:15 +00:00
Nikita Shulga
0f739b8f66 [Codemod] skipIfMps->skipIfMPS (#140562)
As `MPS` is an acronym that stands for Metal Performance Shaders
Also to closer align with `skipCUDAIf` not `skipCudaIf`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140562
Approved by: https://github.com/ZainRizvi, https://github.com/r-barnes
2024-11-13 19:45:08 +00:00
zeshengzong
cb71bcc542 Replace clone.detach with detach.clone (#140264)
Fixes #64532

As state in issue, replace `clone.detach` by `detach.clone`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140264
Approved by: https://github.com/soulitzer
2024-11-13 07:01:02 +00:00
Mikayla Gawarecki
2ee91db03d Add APIs to separate norm calculation and gradient scaling in nn.utils.clip_grad_norm_ (#139662)
Fixes https://github.com/pytorch/pytorch/issues/139467

Refactor `nn.utils.clip_grad_norm_` into `nn.utils.get_total_norm` and then `nn.utils.clip_grads_with_norm_` . `clip_grad_norm_` now calls into these two new ops,

`get_total_norm` is generalized (rather than `get_grad_norm` due to the discussion on the issue from @awgu)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139662
Approved by: https://github.com/H-Huang
2024-11-07 23:13:23 +00:00
Danial Javady
ed0e63e938 Add NHWC support for group normalization (#126635)
Fixes #111824

Currently it is the case that if the user specifies their group normalization to be of NHWC format, pytorch will default to NCHW tensors and convert. This  conversion is not immediately obvious to the user unless they check the format themselves which is not intuitive. This PR adds suppor for NHWC for cuda by adding necessary kernels.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126635
Approved by: https://github.com/eqy, https://github.com/mikaylagawarecki
2024-11-07 01:12:08 +00:00
Donald Tolley
c1e7d85ce6 Add Weighted Loss Functions to PyTorch : WMSE, WMAE, and Weighted Huber Loss (#132049)
#### Summary
This pull request introduces new weighted loss functions to the PyTorch library: `weighted_huber_loss`, `wmse_loss`, and `wmae_loss`. These functions allow for precise control over the influence of each sample during training, important for imbalanced data or when certain samples are more significant than others.

#### Changes
- **`weighted_huber_loss`**: Huber loss modified to incorporate weights, providing a balance between L1 and L2 loss based on the `delta` parameter.
- **`wmse_loss`** (Weighted Mean Squared Error): Applies weights to the standard MSE loss, useful for emphasizing certain samples in regression tasks.
- **`wmae_loss`** (Weighted Mean Absolute Error): Adjusts MAE loss calculation by including weights, ideal for datasets with outliers.

#### Code Details
- **Input Validation**: Ensures `input`, `target`, and `weights` tensors match in size to prevent broadcasting errors.
- **Reduction Options**: Supports `none`, `mean`, and `sum` reductions to suit various computational needs.
- **Backward Compatibility**: Maintains support for deprecated arguments `size_average` and `reduce`, while encouraging use of the `reduction` argument.

#### Usage Example
```python
import torch
input = torch.tensor([0.5, 2.5, 2.0], dtype=torch.float32)
target = torch.tensor([0.0, 2.0, 1.5], dtype=torch.float32)
weights = torch.tensor([1.0, 0.5, 1.5], dtype=torch.float32)

loss = weighted_huber_loss(input, target, weights, delta=1.0)
print(loss)
```
---

Feedback on these implementations is welcome; please let me know if further modifications are required.

Resolves #132465

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132049
Approved by: https://github.com/mikaylagawarecki

Co-authored-by: mikaylagawarecki <mikaylagawarecki@gmail.com>
2024-10-31 21:59:43 +00:00
Siddharth Kotapati
e27c0048db Enable additional tests for MPS CI runs (#134356)
As part of the follow up for https://github.com/pytorch/pytorch/issues/133520, adapting existing unused tests for use in MPS CI runs. Focusing on nhwc & other memory formatting tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134356
Approved by: https://github.com/malfet, https://github.com/eqy, https://github.com/huydhn
2024-10-04 21:52:38 +00:00
Nikita Shulga
c6192f32f1 [MPS] Add upsample_bicubic2d as Metal op (#136123)
More or less literal copy-n-paste of c33b0580e6/aten/src/ATen/native/cuda/UpSampleBicubic2d.cu (L24)
and
c33b0580e6/aten/src/ATen/native/cuda/UpSampleBicubic2d.cu (L99)
Missing `uint8` implementation mimics CUDA behavior
Initial version coded live in https://www.youtube.com/watch?v=shi6Kb5xxvk
Later refinements:
 - Switch from 2D dispatch to 1D one (to match CUDA behavior)
 - Added batch + channel loops
 - Fixed scale computation to match align corners behavior
 - Added backward implementation

Backward implementation again, mimics CUDA, so it has issues precision issue for `torch.half` as well as a somewhat slow simulation of atomic adds using atomic compare and exchange of the pair of adjacent values, i.e.
```metal
emplate <typename T>
static inline void atomic_add_helper(
    device atomic<int>* data,
    long offset,
    float value) {
  auto ptr = data + (offset >> 1);
  auto old = atomic_load_explicit(ptr, memory_order_relaxed);
  union {
    int i;
    T t[2];
  } val;
  do {
    val.i = old;
    val.t[offset & 1] += static_cast<T>(value);
  } while (!atomic_compare_exchange_weak_explicit(
      ptr, &old, val.i, memory_order_relaxed, memory_order_relaxed));
}
```
Bump basic Metal language version to 3.0, as it's supported on MacOS13 and that's the first version that has `atomic_float`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136123
Approved by: https://github.com/albanD
2024-09-24 18:58:11 +00:00
Xinya Zhang
74fd1bf965 [ROCm] Update to AOTriton 0.7b (#134498)
Notable changes:
1. Enable CudaGraph related tests
2. Fix UT problems
3. EXPERIMENTAL Navi31 support. User should enable Navi31 support with Env Var `TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1`

Know Problem:
1. `test/test_transformers.py` will massive failures and/or NaN outputs with `--use-pytest`
    + Update: Confirmed skip `class TestSDPAPrivateUse1Only` can fix the problem with `--use-pytest`

Note:
AOTriton 0.7b adds support to nestedtenosrs+SDPA but need more work (and consequently a separate PR) to enable it.

Fixes #133540

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134498
Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily, https://github.com/malfet
2024-09-11 20:34:01 +00:00
Roy Hvaara
23b1486185 [MPS] Allow nan mean reduction in nll_loss (#135434)
This PR allows results from `nn_loss` to be `nan`, which is the same behavior as with CUDA and CPU https://github.com/pytorch/pytorch/pull/64572#issuecomment-926504162.

Fixes #134431

Ref #64572 #119108
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135434
Approved by: https://github.com/malfet
2024-09-10 08:37:59 +00:00
Guilherme Leobas
136e28f616 Enable forward AD in functional.affine_grid (#135494)
Fixes #121411
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135494
Approved by: https://github.com/zou3519, https://github.com/soulitzer
2024-09-10 00:07:07 +00:00
CaoE
dfb2b661f7 Use float data type for Half var_sum in batchnorm stats updating on CPU (#126525)
Using float data type for Half `var_sum` in batchnorm stats updating on CPU to avoid `var_sum` overflow since the representation range of Half is small.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126525
Approved by: https://github.com/jgong5, https://github.com/peterbell10
2024-09-09 15:31:38 +00:00
Roy Hvaara
5a69e0ebbe [MPS] Update decorator comments with issue ref (#135448)
Updating the comments with references to better places for context now that the bugs have been identified.

xref #135442 #135447 #134184

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135448
Approved by: https://github.com/ezyang
2024-09-09 15:18:52 +00:00
Nikita Shulga
f95085fd91 [BE][MPS] Prefer xfail to skip (#134858)
This essentially undoes large skips on everything but MacOS Sequoia to nn.modules made by https://github.com/pytorch/pytorch/pull/128393

Instead it uses existing `xfail`, but guards it on `_macos15_or_newer` boolean

Before the change if run on MacOS 14:
```
 % python3 ../test/test_modules.py -v -k Hardswish 2>&1|tail -n3
Ran 57 tests in 0.053s

OK (skipped=32)
```
After
```
% python3 ../test/test_modules.py -v -k Hardswish 2>&1|tail -n3
Ran 57 tests in 0.229s

OK (skipped=10, expected failures=2)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134858
Approved by: https://github.com/janeyx99
2024-08-31 00:29:48 +00:00
Nikita Shulga
8de0d7690c Use newer toAccumulateType signature in Normalization.cpp (#134540)
Which fixes BatchNorm behavior for if called with empty tensors on MPS backed. Removed `expectedFailureMPS` in test_nn.py, deleted expected failure in `test_mps.py` and adjusted `skipIfMPS` to `expectedFailureMPS`  in BatchNorm2d OpInfo decorator, but restrict it only to the memory format tests

Test Plan: CI + `python3 -c "import torch; print(torch.nn.BatchNorm2d(3, device='mps')(torch.rand(0, 3, 2, 2, device='mps')))"`

Fixes https://github.com/pytorch/pytorch/issues/134423

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134540
Approved by: https://github.com/Skylion007, https://github.com/albanD
2024-08-27 18:09:20 +00:00
Roy Hvaara
1565940114 [MPS] Add test/test_nn.py to test suite (#134184)
This PR increases test coverage by including the tests in `test/test_nn.py` in the test suite of MPS.

Some of the tests are decorated with `@expectedFailureMPS` for various reasons. Either that the op is not implemented, or that the outputs do not align. Those tests that contain differing results should be investigated further to rule out any live bugs.

```bash
$ python test/run_test.py --mps --verbose -k TestNN
Running test batch 'tests to run' cost 84.76 seconds
```

Ref #133520

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134184
Approved by: https://github.com/albanD, https://github.com/malfet
2024-08-26 23:48:23 +00:00
drisspg
fb26b84390 Update fused kernels and call _safe_softmax from SDPA (#133882)
# UPDATE:
This is  take 3 of https://github.com/pytorch/pytorch/pull/131863 which was landed via co dev but not applying correclty

# Summary
Changes the stance of SDPA on what to do for fully masked out rows

## Current Behavior
Several PyTorch users have expressed frustration over this issue:
- https://github.com/pytorch/pytorch/issues/41508
- https://github.com/pytorch/pytorch/issues/103749
- https://github.com/pytorch/pytorch/issues/103963

These are significant issues with extensive discussion but no satisfactory resolution. The PyTorch team's consensus, as stated here:
https://github.com/pytorch/pytorch/issues/24816#issuecomment-524415617

Can be paraphrased as follows:

When passing in fully masked out rows, attention becomes ambiguous. We have two main options:

1. Uniformly attend to all values:
   ```python
   scores[masked_out_rows] = 1 / len(row)
   out[masked_out_rows] = 1 / len(row) * value
   ```

2. Decide that attention between no queries (masked) and no keys (masked) is meaningless:
   ```python
   output[fully_masked_rows] = NaN
   ```

We went with option 2. Partially because it was easier to implement, but also people argued that users can slice the output to remove the NaNs:
``` Python
>fill_value = -float("inf")
>row0 = torch.randn(4)
>row1 = torch.tensor([(fill_value for _ in range(4)])
>matrix = torch.stack([row0, row1]).requires_grad_(True)
>out = torch.softmax(matrix, 1)
>out = out[0]
>print(out)
tensor([0.5377, 0.2729, 0.0692, 0.1201])
```
Cool, problem solved. But what happends when you call backwards..
```Python
>out.backward(torch.ones_like(out))
>print(matrix.grad)
tensor([[3.0957e-08, 1.4157e-08, 7.7802e-10, 1.3713e-08],
        [       nan,        nan,        nan,        nan]])
```
Those pesky NaNs are back!

## Why do we see NaNs today?

The core of the problem revolves around using softmax function in sdpa:

```python
> row = torch.tensor([(-float("inf")) for _ in range(4)])
> torch.softmax(row, 0)
tensor([nan, nan, nan, nan])
```

## Quick Aside: Masking in Attention

Attention itself doesn't have a concept of masking. The `sdpa` function has an argument called `attn_mask`, which would be more accurately named `attn_bias`. This is because we don't actually "mask" entries when computing attention. Instead, due to implementation details([performance](https://github.com/pytorch/pytorch/issues/25110#issuecomment-524519087)), we add a value to the masked-out query/key pairs.

We use a large negative number (typically -inf) to decrease the attention weight, as softmax assigns more weight to larger values.

## Alternative Approaches

If we use a very large negative number instead of -inf:

```python
> row = torch.tensor([(-1e6) for _ in range(4)])
> torch.softmax(row, 0)
tensor([0.2500, 0.2500, 0.2500, 0.2500])
```
However if users always remembered to "slice" out their outputs i.e.:
```Python
>fill_value = -1e6
>...
>out.backward(torch.ones_like(out))
>print(matrix.grad)
tensor([[-0.0563, -0.0564,  0.1613, -0.0486],
        [ 0.0000,  0.0000,  0.0000,  0.0000]])
```
This would bring us back into a better state.

## A Third Option

We don't necessarily need to alter the behavior of softmax for -inf or very large negative numbers. The fundamental goal is to exclude certain query/key pairs from attention, regardless of the underlying implementation.

This PR implements the new semantic for masking w/ attention in fully masked-out rows:
```python
out[masked_out_rows] = 0
```

**Important Note**: This idea isn't entirely new. The [MaskedTensor](https://pytorch.org/tutorials/prototype/maskedtensor_overview#safe-softmax) prototype, a tensor subclass, was designed to handle such cases. However, it remains a prototype feature and hasn't gained widespread adoption.

## Details
This PR stack does 3 things:
1. Adds a PRIVATE _safe_softmax op
2. Updates semantic for flash_cpu fused kernel
3. Updates semantic for efficient_cuda fused kernel

_safe_softmax is not supposed to be used generically and is only meant to be used within the context of SDPA. Due to this fact instead of decomposing softmax and checking for -inf rows we instead "cheat" and use nan_to_num.

Why I think this is okay? (please find a counter point if avail)
There are multiple ways NaNs can emerge. For the fully masked out rows case nan_to_num works. But what if there were other NaNs, wouldn't this silently remove them?

The only case that this can happen is if the input itself had a NaN or an Inf
For example:
```Python
a = torch.ones([4], requires_grad=False, dtype=torch.float16)
a[1] = torch.finfo(torch.float16).max
print(a.softmax(-1))
```
Will return
`tensor([0., 1., 0., 0.], dtype=torch.float16)`

Where
```Python
a = torch.ones([4], requires_grad=False, dtype=torch.float16)
a[1] = float("inf")
a.softmax(-1)
```
returns:
`tensor([nan, nan, nan, nan], dtype=torch.float16)`

If we dont want to even allow for the possibility of "inf" or "NaN" attention scores to be converted to 0 then we can implemented it something like this

```Python
max = torch.max(a, dim=-1, keepdim=True)
exp = torch.exp(a - max.values)
denom = torch.sum(exp, dim=-1, keepdim=True)
softmax = exp / denom
softmax = torch.where(max.values == float('-inf'), 0.0, softmax)
```
however we would be paying for this in math performance.

## Why Now
I think one point that has substantially changed where PyTorch should lie on this argument is the fact that we have fused implementations for SDPA now. And these fused implementations allow us to easily and performantly support this new semantic.

Differential Revision: [D61418679](https://our.internmc.facebook.com/intern/diff/D61418679)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133882
Approved by: https://github.com/soulitzer
2024-08-19 18:53:11 +00:00
Mikayla Gawarecki
d9576c9440 Fix failures when default is flipped for weights_only (#127627)
Tests on XLA shard not fixed yet but there is an issue here https://github.com/pytorch/xla/issues/7799

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127627
Approved by: https://github.com/albanD
ghstack dependencies: #132349
2024-08-16 00:22:43 +00:00
PyTorch MergeBot
cfec69e2a1 Revert "Update fused kernels and call _safe_softmax from SDPA (#131863)"
This reverts commit caba37e99b.

Reverted https://github.com/pytorch/pytorch/pull/131863 on behalf of https://github.com/izaitsevfb due to breaks executorch test executorch/backends/apple/coreml:test - test_vit_skip_conv (executorch.backends.apple.coreml.test.test_coreml_partitioner.TestCoreMLPartitioner) ([comment](https://github.com/pytorch/pytorch/pull/131863#issuecomment-2291855634))
2024-08-15 17:55:07 +00:00
drisspg
caba37e99b Update fused kernels and call _safe_softmax from SDPA (#131863)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131863
Approved by: https://github.com/jbschlosser, https://github.com/Chillee
2024-08-13 23:37:50 +00:00
PyTorch MergeBot
4cca18d5b6 Revert "Update fused kernels and call _safe_softmax from SDPA (#131863)"
This reverts commit e61def65d5.

Reverted https://github.com/pytorch/pytorch/pull/131863 on behalf of https://github.com/albanD due to Broke forward AD tests in main ([comment](https://github.com/pytorch/pytorch/pull/131863#issuecomment-2286432628))
2024-08-13 14:44:08 +00:00
drisspg
e61def65d5 Update fused kernels and call _safe_softmax from SDPA (#131863)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131863
Approved by: https://github.com/jbschlosser
2024-08-13 00:51:55 +00:00
PyTorch MergeBot
31ef900a65 Revert "added persistent option to buffers and namedbuffers (#132994)"
This reverts commit 8707c6dfac.

Reverted https://github.com/pytorch/pytorch/pull/132994 on behalf of https://github.com/PaliC due to breaking internal pyre tests ([comment](https://github.com/pytorch/pytorch/pull/132994#issuecomment-2278487672))
2024-08-09 18:14:53 +00:00
Randolf Scholz
8707c6dfac added persistent option to buffers and namedbuffers (#132994)
Fixes #85235

Alternative to PR https://github.com/pytorch/pytorch/pull/129655, implements 3-valued option (None or bool).

- adds keyword only argument `persistent: Optional[bool] = None` to `nn.Module.buffers`
- updated docstrings slightly.
- added test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132994
Approved by: https://github.com/mikaylagawarecki
2024-08-08 21:39:01 +00:00
Oguz Ulgen
221350e3a4 Add None return type to init -- tests (#132352)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132352
Approved by: https://github.com/ezyang
ghstack dependencies: #132335, #132351
2024-08-01 15:44:51 +00:00
ekamiti
9e473fd868 Make adding Buffers more like adding Parameters (#125971)
Add similar semantics for creating a buffer object similar to creating a parameter. This is done by introducing a new Buffer class that can be used for type disambiguation. The underlying functionality of registering a buffer remains the same as the register_buffer method has not been changed. The persistent parameter in the Buffer type is to indicate whether a buffer object should be persistent or not. Other non-test changes have to do with getting the new Buffer type recognized by inductor and dynamo. Remaining changes are test changes to make sure that the Buffer type can be used as a drop in replacement for register_buffer as it just leads to register_buffer being called. The addition of this new functionality still allows for normal tensors to be used as buffers so these changes are intended to be backwards compatible.

Fixes #35735

Co-authored-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125971
Approved by: https://github.com/albanD, https://github.com/anijain2305, https://github.com/mlazos
2024-07-31 10:32:40 +00:00
Xuehai Pan
973037be6a [BE][Easy] apply autofix for ruff rules unnecessary-collection-call (C408): list() / tuple() / dict() (#130199)
This PR changes the empty collection factory call to Python literals:

- `list()` -> `[]`
- `tuple()` -> `()`
- `dict()` -> `{}`

The Python literals are more performant and safer. For example, the bytecode for building an empty dictionary:

```bash
$ python3 -m dis - <<EOS
import collections

d1 = {}
d2 = dict()

dict = collections.OrderedDict
d3 = dict()
EOS
```

```text
  0           0 RESUME                   0

  1           2 LOAD_CONST               0 (0)
              4 LOAD_CONST               1 (None)
              6 IMPORT_NAME              0 (collections)
              8 STORE_NAME               0 (collections)

  3          10 BUILD_MAP                0
             12 STORE_NAME               1 (d1)

  4          14 PUSH_NULL
             16 LOAD_NAME                2 (dict)
             18 CALL                     0
             26 STORE_NAME               3 (d2)

  6          28 LOAD_NAME                0 (collections)
             30 LOAD_ATTR                8 (OrderedDict)
             50 STORE_NAME               2 (dict)

  7          52 PUSH_NULL
             54 LOAD_NAME                2 (dict)
             56 CALL                     0
             64 STORE_NAME               5 (d3)
             66 RETURN_CONST             1 (None)
```

The dict literal `{}` only has one bytecode `BUILD_MAP`, while the factory call `dict()` has three `PUSH_NULL + LOAD_NAME + CALL`. Also, the factory call is not safe if users override the `dict` name in `locals` or `globals` (see the example of replacing with `OrderedDict` above).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130199
Approved by: https://github.com/malfet
2024-07-11 17:30:28 +00:00
Mikayla Gawarecki
dfd1d1971e Fix warning when pickle.load torch.Storage (#130246)
Fixes https://github.com/pytorch/pytorch/issues/130242

Since `torch.save` does not use pickle for storages, the `torch.load` in `_load_from_bytes` should not ever be called when `torch.load`-ing a checkpoint. Setting weights_only=False explicitly in `_load_from_bytes` to avoid the weights_only warning when using the pickle module

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130246
Approved by: https://github.com/albanD
2024-07-11 02:40:29 +00:00
Huy Do
f78b79daaa Forward fix the missing torch.nn.Module.set_submodule from D59140215 (#130075)
Summary: This is to forward fix D59140215 from a PyTorch open source contributor T194074371. On PyTorch side, we need to use isinstance instead of type when checking for nn.Module.  This is the same way get_submodule is currently implemented.

Test Plan: `buck2 test 'fbcode//mode/opt' fbcode//dper3/dper3/core/tests:module_test`

Differential Revision: D59254638

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130075
Approved by: https://github.com/mikaylagawarecki
2024-07-04 17:46:56 +00:00
Eddie Yan
8755e035d2 [CUDA][Pooling] Fix 64-bit indexing in avg_pool_2d backward attempt 2 (#129818)
Somehow the original PR was missing the `CUDA_KERNEL_LOOP_TYPE` change???

Thanks @johnc-keen @Chillee for the great repro! (#129785)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129818
Approved by: https://github.com/Chillee, https://github.com/Skylion007
2024-06-30 16:52:33 +00:00
Yang Cao
9f29a2291c Feat: Updated torch.nn.Modules.set_submodules() (#127714)
modified:   torch/nn/modules/module.py

Implemented feature request by #127712.
Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127714
Approved by: https://github.com/mikaylagawarecki
2024-06-27 06:38:54 +00:00
Mikayla Gawarecki
303ad8d7f5 Add warning for weights_only (#129239)
Also changes default for `weights_only` to `None` per comment below (hence the `suppress-bc-linter` tag)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129239
Approved by: https://github.com/albanD, https://github.com/malfet
2024-06-26 14:20:19 +00:00
PyTorch MergeBot
b1f486aff9 Revert "Add warning for weights_only (#129239)"
This reverts commit 381ce0821c.

Reverted https://github.com/pytorch/pytorch/pull/129239 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I am seeing some test_nn failures from ROCm 381ce0821c, trying to revert this to see if trunk recovers ([comment](https://github.com/pytorch/pytorch/pull/129239#issuecomment-2189812903))
2024-06-25 19:30:07 +00:00
eqy
8bfd9e9815 [cuDNN] Graph-capturable cuDNN CTCLoss (#128271)
cuDNN v8.x added a graph-capturable CTCLoss, which slots "neatly" into the `Tensor` variant

~~WIP as cuDNN has a restriction on the max target length (255), but this is not checkable in the graph-capture case, so the UX around warnings/error-messages here might need to be tuned...~~
Currently checks restriction on max target length during warmup run(s), and bails out during capture if this constraint was violated during warmup.

CC @ptrblck

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128271
Approved by: https://github.com/ezyang, https://github.com/malfet
2024-06-25 06:01:50 +00:00
Mikayla Gawarecki
381ce0821c Add warning for weights_only (#129239)
Also changes default for `weights_only` to `None` per comment below (hence the `suppress-bc-linter` tag)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129239
Approved by: https://github.com/albanD
ghstack dependencies: #129244, #129251
2024-06-25 04:19:44 +00:00
PyTorch MergeBot
1c75ddff35 Revert "[cuDNN] Graph-capturable cuDNN CTCLoss (#128271)"
This reverts commit 40e8675fcb.

Reverted https://github.com/pytorch/pytorch/pull/128271 on behalf of https://github.com/malfet due to This makes PyTorch buildable only with CuDNN v9 ([comment](https://github.com/pytorch/pytorch/pull/128271#issuecomment-2183576996))
2024-06-21 23:29:20 +00:00
Eddie Yan
40e8675fcb [cuDNN] Graph-capturable cuDNN CTCLoss (#128271)
cuDNN v8.x added a graph-capturable CTCLoss, which slots "neatly" into the `Tensor` variant

~~WIP as cuDNN has a restriction on the max target length (255), but this is not checkable in the graph-capture case, so the UX around warnings/error-messages here might need to be tuned...~~
Currently checks restriction on max target length during warmup run(s), and bails out during capture if this constraint was violated during warmup.

CC @ptrblck

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128271
Approved by: https://github.com/ezyang
2024-06-21 21:40:23 +00:00
Mikayla Gawarecki
a2d4fea872 [easy] Move state_dict hooks tests to test_module_hooks and decorate tests that call load_state_dict with swap (#126906)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126906
Approved by: https://github.com/albanD
2024-06-10 21:50:17 +00:00
laithsakka
3a2d0755a4 enable test_ParameterList with dynamo if nn module inlining enabled only (#128308)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128308
Approved by: https://github.com/anijain2305
2024-06-10 21:25:40 +00:00
Mikayla Gawarecki
65aa16f968 Revert "Default XLA to use swap_tensors path in nn.Module._apply (#126814)" (#128170)
https://github.com/pytorch/pytorch/issues/128165 :(

This reverts commit a7b1dd82ff.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128170
Approved by: https://github.com/drisspg, https://github.com/albanD
2024-06-07 01:44:14 +00:00
Mikayla Gawarecki
a7b1dd82ff Default XLA to use swap_tensors path in nn.Module._apply (#126814)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126814
Approved by: https://github.com/JackCaoG, https://github.com/albanD
ghstack dependencies: #127313
2024-06-04 21:40:49 +00:00
PyTorch MergeBot
17dea09b15 Revert "Default XLA to use swap_tensors path in nn.Module._apply (#126814)"
This reverts commit bfdec93395.

Reverted https://github.com/pytorch/pytorch/pull/126814 on behalf of https://github.com/izaitsevfb due to suspicious build instructions count regression, see [D58015016](https://www.internalfb.com/diff/D58015016) ([comment](https://github.com/pytorch/pytorch/pull/126814#issuecomment-2143545818))
2024-06-01 18:46:16 +00:00
Mikayla Gawarecki
bfdec93395 Default XLA to use swap_tensors path in nn.Module._apply (#126814)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126814
Approved by: https://github.com/JackCaoG, https://github.com/albanD
ghstack dependencies: #127313
2024-05-30 18:28:13 +00:00
Mikayla Gawarecki
cd06ae0cb8 Relax use_count constraints for swap_tensors when AccumulateGrad holds a reference (#127313)
### Before this PR:
`torch.utils.swap_tensors(a, b)` required the `use_count` of `a` and `b` to be 1

```python
a = torch.randn(2, 3, requires_grad=True)
b = torch.randn(2, 4)
out = a * 2
out.sum().backward()
# Calling swap_tensors here would fail due to the reference held by AccumulateGrad node, which is not cleaned up after backward
# torch.utils.swap_tensors(a, b)
del out
# Calling swap_tensors here would pass
torch.utils.swap_tensors(a, b)
```
### After this PR:
`torch.utils.swap_tensors(a, b)` requires the `use_count` of `a` and `b` to be 1 or 2 IF the second reference is held by `AccumulateGrad`

A pre-hook will be registered on the `AccumulateGrad` node so that it will fail if it is called (i.e. if user attempts to backward through the graph).

```python
a = torch.randn(2, 3, requires_grad=True)
b = torch.randn(2, 4)
out = a * 2
out.sum().backward()
# Calling swap_tensors here is ok
torch.utils.swap_tensors(a, b)
# If we ever backward to the AccumulateGrad node it will error that it was poisoned by swap_tensors
```

### Application to `nn.Module`

This issue is especially pertinent in context of `nn.Module` where parameters will have `AccumulateGrad` nodes initialized after forward. Specifically, this is intended to address https://github.com/pytorch/pytorch/pull/126814#issuecomment-2127777866. Previously, this would fail at the `m.cpu()` but we want users to be able to do something like the following, and instead raise an error if the user ever attempts to backward through the poisoned `AccumulateGrad` node

```python
import torch
import torch.nn as nn
m = nn.Linear(3, 5)
inp = torch.randn(2, 3)
out = m(inp)
out.sum().backward()
m.cpu()
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127313
Approved by: https://github.com/soulitzer
2024-05-30 07:06:55 +00:00
JackCaoG
38a33c3202 don't call .item in onehot for XLA (#127335)
We found that `nn.function.one_hot` will cause a graph break due to the item call in the native implementation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127335
Approved by: https://github.com/ezyang
2024-05-29 20:37:26 +00:00
dan_the_3rd
c133665d4a [CUDA] Parallelize upsampling OPS across the batch/channel dimension. (#127082)
This can make this operation 200x+ faster on modern GPUs for small grid sizes, as otherwise this kernel is scheduled with a single block (!)

Tested on A100 with:
```
python test/test_nn.py TestNNDeviceTypeCUDA
```

**Benchmarks FW**
Ran on A100 / bf16
## Forward pass benchmarks

| batch size | input size | output size | before runtime (mem bandwidth) | after runtime (mem bandwidth) | speedup |
|------------|------------|-------------|------------------|-----------------|---------|
| 768 | 16x16 | 6x6 | 5855us (0.07 GB/s) | 38us (10 GB/s) | 154x |
| 768 | 16x16 | 7x7 | 5214us (0.08 GB/s) | 37us (11 GB/s) | 138x |
| 768 | 16x16 | 14x14 | 2314us (0.27 GB/s) | 36us (17 GB/s) | 63x |
| 768 | 16x16 | 16x16 | 1232us (0.59 GB/s) | 33us (21 GB/s) | 36x |
| 768 | 32x32 | 6x6 | 19442us (0.07 GB/s) | 98us (15 GB/s) | 197x |
| 768 | 32x32 | 7x7 | 16918us (0.09 GB/s) | 89us (17 GB/s) | 188x |
| 768 | 32x32 | 14x14 | 6023us (0.28 GB/s) | 69us (25 GB/s) | 86x |
| 768 | 32x32 | 16x16 | 3455us (0.52 GB/s) | 55us (32 GB/s) | 62x |
| 768 | 48x48 | 6x6 | 38597us (0.08 GB/s) | 179us (18 GB/s) | 214x |
| 768 | 48x48 | 7x7 | 34700us (0.09 GB/s) | 163us (20 GB/s) | 211x |
| 768 | 48x48 | 14x14 | 10647us (0.33 GB/s) | 112us (31 GB/s) | 94x |
| 768 | 48x48 | 16x16 | 7388us (0.49 GB/s) | 100us (36 GB/s) | 73x |
| 768 | 64x64 | 6x6 | 76288us (0.07 GB/s) | 310us (19 GB/s) | 246x |
| 768 | 64x64 | 7x7 | 54981us (0.1 GB/s) | 257us (23 GB/s) | 213x |
| 768 | 64x64 | 14x14 | 16565us (0.37 GB/s) | 169us (36 GB/s) | 97x |
| 768 | 64x64 | 16x16 | 12037us (0.51 GB/s) | 141us (43 GB/s) | 84x |
| 1024 | 16x16 | 6x6 | 8123us (0.06 GB/s) | 44us (12 GB/s) | 183x |
| 1024 | 16x16 | 7x7 | 7017us (0.08 GB/s) | 45us (12 GB/s) | 155x |
| 1024 | 16x16 | 14x14 | 3150us (0.27 GB/s) | 45us (18 GB/s) | 69x |
| 1024 | 16x16 | 16x16 | 1695us (0.57 GB/s) | 41us (23 GB/s) | 40x |
| 1024 | 32x32 | 6x6 | 25918us (0.07 GB/s) | 120us (16 GB/s) | 214x |
| 1024 | 32x32 | 7x7 | 22622us (0.09 GB/s) | 108us (18 GB/s) | 208x |
| 1024 | 32x32 | 14x14 | 8245us (0.28 GB/s) | 87us (26 GB/s) | 94x |
| 1024 | 32x32 | 16x16 | 4599us (0.53 GB/s) | 68us (35 GB/s) | 67x |
| 1024 | 48x48 | 6x6 | 51486us (0.08 GB/s) | 219us (20 GB/s) | 234x |
| 1024 | 48x48 | 7x7 | 46501us (0.09 GB/s) | 202us (22 GB/s) | 229x |
| 1024 | 48x48 | 14x14 | 14280us (0.33 GB/s) | 145us (32 GB/s) | 98x |
| 1024 | 48x48 | 16x16 | 9877us (0.49 GB/s) | 125us (39 GB/s) | 79x |
| 1024 | 64x64 | 6x6 | 101731us (0.07 GB/s) | 378us (20 GB/s) | 268x |
| 1024 | 64x64 | 7x7 | 73465us (0.1 GB/s) | 320us (24 GB/s) | 229x |
| 1024 | 64x64 | 14x14 | 22109us (0.37 GB/s) | 218us (37 GB/s) | 101x |
| 1024 | 64x64 | 16x16 | 16081us (0.51 GB/s) | 178us (46 GB/s) | 90x |
| 1536 | 16x16 | 6x6 | 12546us (0.06 GB/s) | 61us (13 GB/s) | 205x |
| 1536 | 16x16 | 7x7 | 11064us (0.07 GB/s) | 63us (13 GB/s) | 175x |
| 1536 | 16x16 | 14x14 | 4839us (0.26 GB/s) | 62us (20 GB/s) | 77x |
| 1536 | 16x16 | 16x16 | 2630us (0.55 GB/s) | 59us (24 GB/s) | 44x |
| 1536 | 32x32 | 6x6 | 38898us (0.07 GB/s) | 170us (17 GB/s) | 227x |
| 1536 | 32x32 | 7x7 | 34079us (0.09 GB/s) | 155us (19 GB/s) | 219x |
| 1536 | 32x32 | 14x14 | 12632us (0.27 GB/s) | 124us (28 GB/s) | 101x |
| 1536 | 32x32 | 16x16 | 6900us (0.53 GB/s) | 98us (37 GB/s) | 70x |
| 1536 | 48x48 | 6x6 | 77272us (0.08 GB/s) | 316us (21 GB/s) | 243x |
| 1536 | 48x48 | 7x7 | 70153us (0.09 GB/s) | 291us (23 GB/s) | 240x |
| 1536 | 48x48 | 14x14 | 21500us (0.33 GB/s) | 208us (34 GB/s) | 103x |
| 1536 | 48x48 | 16x16 | 14851us (0.49 GB/s) | 181us (40 GB/s) | 81x |
| 1536 | 64x64 | 6x6 | 152669us (0.07 GB/s) | 548us (21 GB/s) | 278x |
| 1536 | 64x64 | 7x7 | 110348us (0.1 GB/s) | 466us (25 GB/s) | 236x |
| 1536 | 64x64 | 14x14 | 33350us (0.36 GB/s) | 316us (38 GB/s) | 105x |
| 1536 | 64x64 | 16x16 | 24173us (0.51 GB/s) | 263us (47 GB/s) | 91x |
| 4096 | 16x16 | 6x6 | 34638us (0.06 GB/s) | 138us (16 GB/s) | 249x |
| 4096 | 16x16 | 7x7 | 31590us (0.07 GB/s) | 144us (16 GB/s) | 218x |
| 4096 | 16x16 | 14x14 | 13203us (0.26 GB/s) | 149us (23 GB/s) | 88x |
| 4096 | 16x16 | 16x16 | 7328us (0.53 GB/s) | 143us (27 GB/s) | 51x |
| 4096 | 32x32 | 6x6 | 103802us (0.07 GB/s) | 405us (19 GB/s) | 256x |
| 4096 | 32x32 | 7x7 | 91354us (0.08 GB/s) | 372us (22 GB/s) | 245x |
| 4096 | 32x32 | 14x14 | 34501us (0.26 GB/s) | 312us (29 GB/s) | 110x |
| 4096 | 32x32 | 16x16 | 18465us (0.52 GB/s) | 247us (39 GB/s) | 74x |
## Backward pass benchmarks

| batch size | input size | output size | before runtime (mem bandwidth) | after runtime (mem bandwidth) | speedup |
|------------|------------|-------------|------------------|-----------------|---------|
| 768 | 16x16 | 6x6 | 78656us (0.0 GB/s) | 323us (1 GB/s) | 243x |
| 768 | 16x16 | 7x7 | 67167us (0.0 GB/s) | 292us (1 GB/s) | 230x |
| 768 | 16x16 | 14x14 | 27478us (0.02 GB/s) | 229us (2 GB/s) | 119x |
| 768 | 16x16 | 16x16 | 131us (5.59 GB/s) | 56us (13 GB/s) | 2x |
| 768 | 32x32 | 6x6 | 271752us (0.0 GB/s) | 888us (1 GB/s) | 305x |
| 768 | 32x32 | 7x7 | 224110us (0.0 GB/s) | 813us (1 GB/s) | 275x |
| 768 | 32x32 | 14x14 | 85365us (0.02 GB/s) | 450us (3 GB/s) | 189x |
| 768 | 32x32 | 16x16 | 67700us (0.02 GB/s) | 360us (5 GB/s) | 187x |
| 768 | 48x48 | 6x6 | 593709us (0.0 GB/s) | 1988us (1 GB/s) | 298x |
| 768 | 48x48 | 7x7 | 485566us (0.0 GB/s) | 1694us (1 GB/s) | 286x |
| 768 | 48x48 | 14x14 | 164059us (0.02 GB/s) | 897us (3 GB/s) | 182x |
| 768 | 48x48 | 16x16 | 134317us (0.02 GB/s) | 674us (5 GB/s) | 199x |
| 768 | 64x64 | 6x6 | 1026651us (0.0 GB/s) | 3360us (1 GB/s) | 305x |
| 768 | 64x64 | 7x7 | 770901us (0.0 GB/s) | 2584us (2 GB/s) | 298x |
| 768 | 64x64 | 14x14 | 277850us (0.02 GB/s) | 1556us (3 GB/s) | 178x |
| 768 | 64x64 | 16x16 | 236245us (0.02 GB/s) | 1144us (5 GB/s) | 206x |
| 1024 | 16x16 | 6x6 | 106638us (0.0 GB/s) | 341us (1 GB/s) | 312x |
| 1024 | 16x16 | 7x7 | 90886us (0.0 GB/s) | 314us (1 GB/s) | 288x |
| 1024 | 16x16 | 14x14 | 36572us (0.02 GB/s) | 292us (2 GB/s) | 124x |
| 1024 | 16x16 | 16x16 | 171us (5.69 GB/s) | 56us (17 GB/s) | 3x |
| 1024 | 32x32 | 6x6 | 356900us (0.0 GB/s) | 936us (2 GB/s) | 380x |
| 1024 | 32x32 | 7x7 | 299139us (0.0 GB/s) | 870us (2 GB/s) | 343x |
| 1024 | 32x32 | 14x14 | 113205us (0.02 GB/s) | 576us (4 GB/s) | 196x |
| 1024 | 32x32 | 16x16 | 90886us (0.02 GB/s) | 458us (5 GB/s) | 198x |
| 1024 | 48x48 | 6x6 | 786896us (0.0 GB/s) | 2127us (2 GB/s) | 369x |
| 1024 | 48x48 | 7x7 | 640515us (0.0 GB/s) | 1837us (2 GB/s) | 348x |
| 1024 | 48x48 | 14x14 | 218720us (0.02 GB/s) | 1152us (4 GB/s) | 189x |
| 1024 | 48x48 | 16x16 | 178827us (0.02 GB/s) | 863us (5 GB/s) | 207x |
| 1024 | 64x64 | 6x6 | 1379991us (0.0 GB/s) | 3589us (2 GB/s) | 384x |
| 1024 | 64x64 | 7x7 | 1047466us (0.0 GB/s) | 2774us (2 GB/s) | 377x |
| 1024 | 64x64 | 14x14 | 370139us (0.02 GB/s) | 1999us (4 GB/s) | 185x |
| 1024 | 64x64 | 16x16 | 316501us (0.02 GB/s) | 1470us (5 GB/s) | 215x |
| 1536 | 16x16 | 6x6 | 159057us (0.0 GB/s) | 477us (1 GB/s) | 332x |
| 1536 | 16x16 | 7x7 | 135578us (0.0 GB/s) | 441us (1 GB/s) | 306x |
| 1536 | 16x16 | 14x14 | 53002us (0.02 GB/s) | 400us (3 GB/s) | 132x |
| 1536 | 16x16 | 16x16 | 252us (5.79 GB/s) | 55us (26 GB/s) | 4x |
| 1536 | 32x32 | 6x6 | 545653us (0.0 GB/s) | 1323us (2 GB/s) | 412x |
| 1536 | 32x32 | 7x7 | 447491us (0.0 GB/s) | 1248us (2 GB/s) | 358x |
| 1536 | 32x32 | 14x14 | 173491us (0.02 GB/s) | 787us (4 GB/s) | 220x |
| 1536 | 32x32 | 16x16 | 136395us (0.02 GB/s) | 633us (5 GB/s) | 215x |
| 1536 | 48x48 | 6x6 | 1198639us (0.0 GB/s) | 3057us (2 GB/s) | 392x |
| 1536 | 48x48 | 7x7 | 985549us (0.0 GB/s) | 2645us (2 GB/s) | 372x |
| 1536 | 48x48 | 14x14 | 331419us (0.02 GB/s) | 1581us (4 GB/s) | 209x |
| 1536 | 48x48 | 16x16 | 270972us (0.02 GB/s) | 1186us (6 GB/s) | 228x |
| 1536 | 64x64 | 6x6 | 2094282us (0.0 GB/s) | 5214us (2 GB/s) | 401x |
| 1536 | 64x64 | 7x7 | 1593449us (0.0 GB/s) | 4086us (2 GB/s) | 389x |
| 1536 | 64x64 | 14x14 | 559244us (0.02 GB/s) | 2828us (4 GB/s) | 197x |
| 1536 | 64x64 | 16x16 | 469471us (0.02 GB/s) | 2057us (6 GB/s) | 228x |
| 4096 | 16x16 | 6x6 | 430494us (0.0 GB/s) | 1008us (2 GB/s) | 427x |
| 4096 | 16x16 | 7x7 | 360346us (0.0 GB/s) | 1015us (2 GB/s) | 354x |
| 4096 | 16x16 | 14x14 | 142868us (0.02 GB/s) | 988us (3 GB/s) | 144x |
| 4096 | 16x16 | 16x16 | 658us (5.93 GB/s) | 56us (69 GB/s) | 11x |
| 4096 | 32x32 | 6x6 | 1425928us (0.0 GB/s) | 2796us (2 GB/s) | 509x |
| 4096 | 32x32 | 7x7 | 1188862us (0.0 GB/s) | 2906us (2 GB/s) | 409x |
| 4096 | 32x32 | 14x14 | 464286us (0.02 GB/s) | 1965us (4 GB/s) | 236x |
| 4096 | 32x32 | 16x16 | 363903us (0.02 GB/s) | 1588us (6 GB/s) | 229x |
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127082
Approved by: https://github.com/fmassa
2024-05-24 21:17:12 +00:00
PyTorch MergeBot
b36e390b6c Revert "Default XLA to use swap_tensors path in nn.Module._apply (#126814)"
This reverts commit eb41ed5d90.

Reverted https://github.com/pytorch/pytorch/pull/126814 on behalf of https://github.com/mikaylagawarecki due to broke xla ci ([comment](https://github.com/pytorch/pytorch/pull/126814#issuecomment-2127719337))
2024-05-23 17:43:06 +00:00
Mikayla Gawarecki
eb41ed5d90 Default XLA to use swap_tensors path in nn.Module._apply (#126814)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126814
Approved by: https://github.com/JackCaoG, https://github.com/albanD
2024-05-23 15:43:32 +00:00
Simon Fan
7530cfe7e4 [dynamo][flaky tests] test_conv_empty_input_* (#126790)
Run CI, maybe fixes https://github.com/pytorch/pytorch/issues/126178

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126790
Approved by: https://github.com/mikaylagawarecki
2024-05-22 03:14:21 +00:00
Martim Mendes
d54c28e7fc Added error checks for invalid inputs on thnn_conv2d (#121906)
Fixes #121188
Prevent Segmentation Fault in 'torch._C._nn.thnn_conv2d'

Previously, calling 'torch._C._nn.thnn_conv2d' with invalid arguments for padding, stride, and kernel_size would result in a segmentation fault. This issue has been resolved by implementing argument validation (using Torch Check). Now, when invalid arguments are detected, a runtime error is raised with a debug message detailing the correct format.

Additionally, this commit includes tests to cover the three referenced cases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121906
Approved by: https://github.com/janeyx99
2024-05-17 23:41:48 +00:00
vfdev-5
415a8f6398 Fixed issue in affine_grid_backward when grad_grid is non-contiguous (#124370)
Description:
- replaced .view with .reshape to fix the problem when grad_grid is channels last 2d/3d
- added a consistency test

Fixes #124154

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124370
Approved by: https://github.com/lezcano
2024-04-18 16:30:10 +00:00
Xuehai Pan
93e249969b [BE] enable ruff rule RSE and remove useless parentheses in raise statements (#124261)
Remove useless parentheses in `raise` statements if the exception type is raised with no argument.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124261
Approved by: https://github.com/albanD
2024-04-17 19:29:34 +00:00
Aaron Gokaslan
1d6c5972c1 [BE]: Optimize min/max/sum comprehensions C419 (#123960)
Automatic fixes that replaces certain list comprehensions with generator ones where appropriate so that they are immediately consumed. This is preview functionality in ruff for rule C419 and it was automatically applied.

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123960
Approved by: https://github.com/malfet
2024-04-12 23:54:15 +00:00
eqy
2f0fc04fa3 [CUDA][64-bit indexing] Bump large tensor threshold of test_cross_entropy_large_tensor to 70GiB (#123772)
`torch.cuda.max_memory_reserved()` here shows 68729962496 (about 65546 MiB).

CC @malfet @crcrpar

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123772
Approved by: https://github.com/mikaylagawarecki
2024-04-12 19:18:20 +00:00
David Yan
63c24f73ef Upsample2d backwards to int64_t (#123682)
Summary: To unblock training where upsamplenearest2d involves input or output tensors which are larger than 2^31. Comes up frequently in image & video applications.

Test Plan:
```
buck2 test mode/opt //caffe2/test:test_nn_cuda -- test_upsamplingnearest2d_backward_64bit_indexing
```

Benchmarking (N5207417):
```
device_ms, cpu_ms, gb/device_ms*1000
# before changes
118.03993721008301 124.09385920000001 98.72685525972494
# after changes
118.05780944824218 124.10893509999994 98.71190944734577
```

Differential Revision: D55625666

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123682
Approved by: https://github.com/ezyang
2024-04-10 20:26:08 +00:00
Tailing Yuan
041be901b3 fix ctc_loss zero-length/neg-length corner cases (#123193)
Fixes #84827, fixes #86596, fixes #88047, fixes #89208.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123193
Approved by: https://github.com/mikaylagawarecki
2024-04-09 20:39:39 +00:00
eqy
d5b5012dc4 [CUDA] Raise softmax_forward_64bit_indexing GPU memory requirement (#116075)
printing `torch.cuda.memory_summary()` shows ~41GiB reserved at the end of this test, not sure how it was passing previously on CUDA.

CC @ptrblck @malfet

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116075
Approved by: https://github.com/ptrblck, https://github.com/malfet
2024-03-21 00:03:17 +00:00
David Yan
6915a5be70 Increase numel limit to 2^63 for replicatepad1d (#122199)
Summary: As title

Test Plan:
```
CUDA_VISIBLE_DEVICES=5 buck2 test mode/opt //caffe2/test:test_nn_cuda -- test_replicatepad_64bit_indexing
```

Also benchmarked in N5106027
```
device_ms, cpu_ms, gb/device_ms*1000
# before changes
11.058772478103638 18.912256770000006 735.4118906278957
# after changes
10.621162576675415 18.58972748 765.7121070725207
```

Differential Revision: D55030372

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122199
Approved by: https://github.com/ezyang
2024-03-19 21:55:34 +00:00
jmarin
a2854ae904 Bugfix consume_prefix_in_state_dict_if_present function to keep the order of the state_dict (#117464)
This PR proposes to keep the original order as the original state_dict, as the issue creator proposed. It also removes a bug concerning how ``_metadata`` is handled (see below), as well as other small changes to properly remove the prefix when is present.

In the original code, ``_metadata`` was handled as a ``key``.

```
    # also strip the prefix in metadata if any.
    if "_metadata" in state_dict:
```

This is not the case, ``_metadata`` is actually an ``attribute``. Hence, the previous condition is changed to:

```
    # also strip the prefix in metadata if any.
    if hasattr(state_dict, "_metadata"):
```

This PR also includes the necessary test.

Fixes #106942

Co-authored-by: mikaylagawarecki <mikaylagawarecki@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117464
Approved by: https://github.com/mikaylagawarecki
2024-03-07 04:00:49 +00:00
Eddie Yan
967dd31621 [cuDNN] Cleanup cuDNN < 8.1 ifdefs (#120862)
Follow-up of #95722

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120862
Approved by: https://github.com/Skylion007
2024-03-07 01:46:25 +00:00
Lourencom
69cedc16c5 Add padding dimension checks and tests (#121298)
Fixes #121093

Previously, calling the following functions with invalid padding dimensions would cause a segmentation fault:
```
torch._C._nn.replication_pad1d, torch._C._nn.replication_pad3d, torch._C._nn.replication_pad3d
```

To fix, added condition checking to raise a runtime error with a debug message instead, specifying the correct dimensions necessary.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121298
Approved by: https://github.com/mikaylagawarecki
2024-03-06 21:55:34 +00:00
lancerts
099ff51d45 torch check the division by zero in batch_norm_update_stats (#120882)
Fixes #120803

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120882
Approved by: https://github.com/CaoE, https://github.com/malfet
2024-03-06 05:40:21 +00:00
mingfeima
34e3f6f3c9 fix segfault in torch.native_channel_shuffle when input is empty (#121199)
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):

fix https://github.com/pytorch/pytorch/issues/121092

`torch.channel_shuffle` could handle empty inputs correctly. `torch.native_channel_shuffle` bypassed the `numel == 0` check, this causes divided by zero in underlying kernel.

* __->__ #121199

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121199
Approved by: https://github.com/malfet
2024-03-06 00:46:36 +00:00
yuanx749
e317e39a02 Fix nonlinearity arg issue in RNN (#120234)
Fixes #114617

This PR fix the the issue with `nonlinearity`, so that it can be passed as arg or kwarg.

Alternatively, if making `nonlinearity` kwarg-only is preferred, I can revert to another commit. cc @mikaylagawarecki
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120234
Approved by: https://github.com/mikaylagawarecki
2024-02-28 20:53:18 +00:00
Tobias Ringwald
d9a1b25807 Fixed an issue where nn.Linear would cause an internal int underflow … (#119221)
…when trying to reshape a scalar input.

Fixes #119161

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119221
Approved by: https://github.com/albanD
2024-02-08 21:06:34 +00:00
lancerts
b51b27922b Add to_empty() suggestion in the error message (#119353)
Fixes #119293, the comprehensive documentation is [here](0f478d9d61/docs/source/meta.rst (id11)).
Just added the suggestion into the error message so it is more informative to user.

@albanD

Co-authored-by: mikaylagawarecki <mikaylagawarecki@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119353
Approved by: https://github.com/mikaylagawarecki
2024-02-08 18:30:02 +00:00
Mikayla Gawarecki
d5a718d27b Add swap_tensors path to nn.Module._apply (#117167)
Added `torch.__future__.{get/set}_swap_module_params_on_conversion` that defaults to `False` for now, but we probably want to modify  to override this and default to `True` in `nn.Module._apply` if input is a tensor subclass.

From offline discussion, for now we are **not** allowing `swap_tensor` after the first module forward has been run*** if the autograd graph is still alive. The reason being that `torch.utils.swap_tensors(t1, t2)` requires the `use_count` of both `TensorImpl`s associated with `t1` and `t2` to be 1.  The first forward pass will install `AccumulateGrad` nodes on each param, which [bump the refcount of the associated TensorImpl](6cf1fc66e3/torch/csrc/autograd/variable.cpp (L307)). **Future work might be to swap the refs that the `AccumulateGrad` nodes hold if it is necessary.**

***From this, it might seem like we don't need to handle gradients. However, I still handle the grads for the edge case that the grads are set via `p.grad = grad` OR the autograd graph is no longer alive because the output has been garbage collected.

If any `swap_tensors` fails on any of the parameters in the `nn.Module` we raise an error.

**`RNNBase` overrides `nn.Module._apply()` and installs weakrefs on some parameters. As a result, all modules that inherit from `RNNBase` (`RNN`, `GRU` and `LSTM`) cannot use the`swap_tensors` path as of now**

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117167
Approved by: https://github.com/albanD
ghstack dependencies: #118028
2024-02-07 18:55:44 +00:00
Mikayla Gawarecki
b92819a039 Move nn.Module.load_state_dict tests from test_nn.py to separate file (#118028)
Move these tests out so in https://github.com/pytorch/pytorch/pull/117913 where we can to run these tests with both `torch.nn.utils.set_swap_module_params_on_conversion({True/False})`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118028
Approved by: https://github.com/albanD
2024-02-05 20:17:28 +00:00
Tobias Ringwald
2de327cedc Fixed an illegal memory access in cross entropy loss when using an index that is not a valid class (#117561)
…dex that is not a valid class.

Fixes #117532.

Co-authored-by: mikaylagawarecki <mikaylagawarecki@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117561
Approved by: https://github.com/mikaylagawarecki
2024-02-02 11:03:16 +00:00
PyTorch MergeBot
df048f4da4 Revert "[RELAND] Remove deprecated fbgemm operators (#112153)"
This reverts commit 19e8ba95e5.

Reverted https://github.com/pytorch/pytorch/pull/112153 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/112153#issuecomment-1921965780))
2024-02-01 18:35:19 +00:00