Commit Graph

86290 Commits

Author SHA1 Message Date
Animesh Jain
173f126068 [invoke_subgraph] Preserve node meta (#150782)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150782
Approved by: https://github.com/bdhirsh
ghstack dependencies: #150666
2025-04-08 16:57:39 +00:00
PyTorch MergeBot
4447352e64 Revert "[CUDA] Only use vec128 if CUDA version is newer than 12.8 (#150705)"
This reverts commit 5228986c39.

Reverted https://github.com/pytorch/pytorch/pull/150705 on behalf of https://github.com/atalman due to break periodic tests ([comment](https://github.com/pytorch/pytorch/pull/150705#issuecomment-2787017751))
2025-04-08 16:29:05 +00:00
ikalinic
97f34f0125 [ROCm][Windows] Include AOTriton dependent sources in Windows build (#150521)
Includes ATen native transformers hipified sources in ROCm+Windows build. This was removed due to Trinton not being available on Windows, but this causes further linker errors. Setting `USE_FLASH_ATTENTION=0` and `USE_MEM_EFF_ATTENTION=0` during the build will mitigate the missing headers, but also not cause any linker errors, so we will use this approach for now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150521
Approved by: https://github.com/jeffdaily
2025-04-08 16:18:15 +00:00
Yuanhao Ji
1239260a0e [Accelerator][Chore] Use existing acc when raising an error (#150829)
As the title said, `acc` already exists so we just use it instead of calling `current_accelerator()` again.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150829
Approved by: https://github.com/guangyey, https://github.com/Skylion007
2025-04-08 16:05:06 +00:00
Nikita Shulga
ec5f2e3028 [Build] Fix fbgemm build with gcc-12+ (#150847)
By suppressing more warnings

TODO: fbgemm pin really needs to get updated

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150847
Approved by: https://github.com/atalman, https://github.com/Skylion007
2025-04-08 16:03:40 +00:00
ZhiweiYan-96
52d172eafd Facilitate at::_weight_int4pack_mm_with_scale_and_zeros related registration (#147962)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147962
Approved by: https://github.com/jerryzh168, https://github.com/guangyey, https://github.com/EikanWang
ghstack dependencies: #137566

Co-authored-by: xiaolil1 <xiaoli.liu@intel.com>
2025-04-08 15:36:07 +00:00
Yan Zhiwei
da7322548b [Intel GPU] int4 WOQ gemm XPU Support (#137566)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137566
Approved by: https://github.com/liangan1, https://github.com/guangyey, https://github.com/EikanWang

Co-authored-by: xiaolil1 <xiaoli.liu@intel.com>
2025-04-08 15:36:06 +00:00
FFFrog
05365e380d Remove torch functions that do not support device arguments from _device_constructor (#150290)
As the title stated

In Addition:
- I have checked all the functions in _device_constructor and found ``torch.vander`` also don`t support device arguments
- Remove the duplicated function such as torch.ones and torch.asarray

Related issue:https://github.com/pytorch/pytorch/issues/150284
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150290
Approved by: https://github.com/albanD
2025-04-08 15:13:55 +00:00
FFFrog
a402c2f203 Remove redundant code in cuda/__init__.py (#150529)
As the title stated.

Follow: https://github.com/pytorch/pytorch/pull/147078
Fix issue: https://github.com/pytorch/pytorch/issues/150519
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150529
Approved by: https://github.com/eqy
2025-04-08 15:03:21 +00:00
Guilherme Leobas
ad516180e0 Update CPython tests for ctx manager to use unittest (#146501)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146501
Approved by: https://github.com/zou3519
ghstack dependencies: #146500
2025-04-08 14:55:17 +00:00
Guilherme Leobas
f3b2fb6c66 Allow trace through unittest (#146500)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146500
Approved by: https://github.com/anijain2305
2025-04-08 14:55:17 +00:00
Luca Wehrstedt
1791b4150b Clarify behavior of TORCH_NCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK (#150682)
I still don't really understand the original purpose of that env var, but it appears that its usage is completely disconnected from MemPools and from `ncclMemAlloc`/`Free`. In fact, when that env var is set, we invoke `ncclCommRegister` for _all_ NCCL communicators for _all_ the memory segments managed by the allocator (both the global ones, allocated with `cudaMalloc`, and the ones in private MemPools), and we do that both for the segments that already exist when the PG is initialized and for all segments that will be allocated later.

I'm reworking the code a bit, by using a few helper functions, whose name should make this behavior clearer.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150682
Approved by: https://github.com/kwen2501
ghstack dependencies: #150681
2025-04-08 13:00:59 +00:00
Luca Wehrstedt
3649e2e7bd Safer bookkeeping of NCCL communicators (#150681)
This consists mainly in two changes:
- ensure we can reliably obtain the device from a `NCCLComm` object (there was one constructor which didn't set the device)
- use a RAII pattern for acquiring the lock to the global dictionary of `NCCLComms` (which ensures the lock is released in case of exceptions)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150681
Approved by: https://github.com/kwen2501
2025-04-08 11:12:37 +00:00
FFFrog
3da14d38bd Fix the Problems About Defining Static Variable in Inline Function (#147095)
Refer to https://github.com/pytorch/pytorch/issues/125465 for more informations

- Remove unused header files
- Move the inline function that defines the static variable to .cc

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147095
Approved by: https://github.com/cyyever, https://github.com/albanD
2025-04-08 10:23:02 +00:00
FFFrog
881d99495d Add more check for torch.ormqr (#150759)
As the title statd.

Please refer to https://github.com/pytorch/pytorch/issues/150674 for more info.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150759
Approved by: https://github.com/lezcano
2025-04-08 08:26:05 +00:00
fengqing.lu
a106842ea8 [XPU] Fix XPU unit test on Windows (#150520)
This PR is to resolve issue reported in https://github.com/intel/torch-xpu-ops/issues/1478

There are two cases failing in our Windows CI enabling.

- **test_xpu.py::TestXpuXPU::test_lazy_init_xpu** Needs to add  `if __name__ == '__main__':` for Windows when using multiprocess. Refer to https://stackoverflow.com/a/18205006
```
RuntimeError:
        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.

        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:

            if __name__ == '__main__':
                freeze_support()
                ...

        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.
Traceback (most recent call last):
  File "C:\Users\sdp\lufengqing\torch-xpu-ops\test\xpu\xpu_test_utils.py", line 24, in <module>
    test_multi_process(model, input)
  File "C:\Users\sdp\lufengqing\torch-xpu-ops\test\xpu\xpu_test_utils.py", line 16, in test_multi_process
    assert p.exitcode == 0
AssertionError
```

- **test_xpu.py::TestXpuXPU::test_wrong_xpu_fork_xpu** is a linux only test case, we should skip it on Windows. Refer to 248487f455/test/test_multiprocessing.py (L609)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150520
Approved by: https://github.com/guangyey, https://github.com/EikanWang
2025-04-08 07:02:40 +00:00
xinan.lin
58ede0cca3 [Inductor XPU] Refine test_mkldnn_pattern_matcher.py to be reusable for XPU. (#150286)
This PR extracts some test cases from TestPatternMatcher into a newly created TestPatternMatcherGeneric, and uses instantiate_device_type_tests to make them reusable across multiple devices.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150286
Approved by: https://github.com/jansel
2025-04-08 05:42:44 +00:00
FFFrog
f8aa6404ac Refactor: add initialization of math.lcm into torch_c_binding_in_graph_functions (#150766)
As the title stated.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150766
Approved by: https://github.com/aorenste, https://github.com/jansel
2025-04-08 04:12:26 +00:00
zeshengzong
c9c0f8eae3 Add plot for torch.nn.Threshold and torch.nn.GLU (#150171)
Fixes #150170

## Changes

- Add plot for `torch.nn.Threshold` and `torch.nn.GLU`
- Add example output make them easier get result by users

## Test Result

![image](https://github.com/user-attachments/assets/f6c5bc46-f9b7-4db7-9797-e08d8423d1b3)

![image](https://github.com/user-attachments/assets/ad4e6c84-7b29-44f1-b7bd-9c81e4a92ef8)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150171
Approved by: https://github.com/albanD
2025-04-08 03:55:37 +00:00
zeshengzong
7e11089fe5 Optimize dataloader Self typing (#146816)
Optimize `dataloader.py` method return type with Self typing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146816
Approved by: https://github.com/albanD
2025-04-08 03:52:23 +00:00
atalman
836955bdbd [Manylinux 2.28] Correct Linux aarch64 cuda binaries wheel name (#150786)
Related to: https://github.com/pytorch/pytorch/issues/149044#issuecomment-2784044555
For CPU binaries we run auditwheel however for cuda binaries auditwheel produces invalid results . Hence we need to rename the file.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150786
Approved by: https://github.com/malfet
2025-04-08 02:58:28 +00:00
Ahmad Sharif
73b4938f7c [cuda] Add new faster gammabeta backward kernel (#148605) (Reapply with launch bounds) (#150625)
# Changes over the previous PR

This reverts commit 61a1f09 and adds `__launch_bounds__` to the kernel.

Previously I merged 114d404 that did not work on Blackwell because it consumed too many registers. It got reverted in 61a1f09. For more context see: https://github.com/pytorch/pytorch/issues/150266.

This PR reverts the revert (i.e. reapplies the original diff), with one additional line with `__launch_bounds__` added:

```
git diff HEAD^
diff --git a/aten/src/ATen/native/cuda/layer_norm_kernel.cu b/aten/src/ATen/native/cuda/layer_norm_kernel.cu
index 0d63a2f979c..3ce2c24c18e 100644
--- a/aten/src/ATen/native/cuda/layer_norm_kernel.cu
+++ b/aten/src/ATen/native/cuda/layer_norm_kernel.cu
@@ -657,6 +657,7 @@ bool aligned_grid
 >
 __global__
 void
+__launch_bounds__(block_dim_x * block_dim_y)
  GammaBetaBackwardCUDAKernelTemplate(
     int64_t M,
     int64_t N,
```

I managed to get a Blackwell machine and verified that the fix works. The fix was verified using this repro that I got from @drisspg

<details>
<summary> Repro script that fails on Blackwell </summary>

```
import torch
from torch.nn import init
# from transformer_nuggets import init_logging
# from transformer_nuggets.utils.benchmark import profiler
# from pathlib import Path

# init_logging()

class PermuteModule(torch.nn.Module):
    def __init__(self, permutation):
        super(PermuteModule, self).__init__()
        self.permutation = permutation
    def forward(self, x:torch.Tensor) -> torch.Tensor:
        assert len(x.shape) == len(self.permutation), f"Dimension mismatch! Unable to permute {len(x.shape)} dim input with a {len(self.permutation)} dim permutation!"
        return x.permute(*self.permutation)

def test(n_layers:int, conv_stride:int):
    _sequence = []
    for _ in range(n_layers):
        # Conv1d inputs are (N x C x L), LayerNorm expects (* x C). Dims must be permuted between modules.
        _sequence += [
            PermuteModule((0,2,1)),
            torch.nn.Conv1d(in_channels=512, out_channels=512, groups=1, kernel_size=9, dilation=1, stride=conv_stride, padding=0, bias=False),
            PermuteModule((0,2,1)),
            torch.nn.LayerNorm(512),
            torch.nn.ReLU()
        ]
    model = torch.nn.Sequential(*_sequence).to(device="cuda")
    data = torch.randn((100,2048,512), device="cuda")
    out = model(data)
    loss = torch.nn.functional.mse_loss(out, torch.rand_like(out))
    loss.backward()

torch.autograd.set_detect_anomaly(True)
print(f"Torch version: {torch.__version__}")

# with profiler(Path("conv")):
#     # print(f"layers=1, stride=1")
#     # test(n_layers=1, conv_stride=1)
#     # print(f"layers=2, stride=1")
#     # test(n_layers=2, conv_stride=1)
#     # print(f"layers=1, stride=2")
#     # test(n_layers=1, conv_stride=2)
#     print(f"layers=2, stride=2")
#     test(n_layers=2, conv_stride=2)

print(f"layers=2, stride=2")
test(n_layers=2, conv_stride=2)
# we will not reach this print statement.
print("DONE.")
```

</details>

I also re-ran my performance benchmark and found no regressions over the previous PR.

# Full description of the old PR

Original PR: https://github.com/pytorch/pytorch/pull/148605

This PR adds a new kernel for producing gamma and beta values for the backward pass in a performant way.

To test the performance against the baseline, I measured the backward pass of layernorm while sweeping over the following variables:

1. dtype in {half, float}
2. M in `2**k, 2**k - 1, 2**k + 1 for k in range(...)`
3. N in `2**k, 2**k - 1, 2**k + 1 for k in range(...)`
4. Whether we flush the L2 cache before running the backward pass

Summary: The new code performs better than the old code, especially for powers of 2. For M >> N case, it performs very well (kernel itself can be 30x faster and the overall backward pass can be 5-10x faster).

In order to visualize results of the kernel when choosing different values of M, N and dtype, I wrote some code to generate a heatmap. The heatmap has N on the x-axis, M on the y-axis and color-coded points where green shows performance improvement and red shows regressions. For example, `m=32 n=2048 1.42x` in the heatmap would indicate the normalized shape had 32 elements. The leading dimensions' product was 2048 elements and the new kernel resulted in the *backward pass* being 1.42x faster than the old *backward pass*.

Important note: This heatmap shows the total backward pass time as seen by the user. The kernel time difference can be sometimes very large while the total backward pass time is not that high. For example, for dtype=torch.half, M=32 N=2048, flush_l2_cache=True case, the heatmap shows a speedup of 1.42x, while ncu tells me the new kernel is 2.5x faster than the old:

M=32 N=2048 dtype=half flush_l2=True Old Kernel NCU summary:
```
    ----------------------- ----------- ------------
    Metric Name             Metric Unit Metric Value
    ----------------------- ----------- ------------
    DRAM Frequency                  Ghz         1.59
    SM Frequency                    Ghz         1.35
    Elapsed Cycles                cycle       27,526
    Memory Throughput                 %         2.21
    DRAM Throughput                   %         0.54
    Duration                         us        20.42
    L1/TEX Cache Throughput           %         4.31
    L2 Cache Throughput               %         2.62
    SM Active Cycles              cycle     1,475.02
    Compute (SM) Throughput           %         0.29
    ----------------------- ----------- ------------
```

M=32 N=2048 dtype=half flush_l2=True New Kernel NCU summary:
```
    ----------------------- ----------- ------------
    Metric Name             Metric Unit Metric Value
    ----------------------- ----------- ------------
    DRAM Frequency                  Ghz         1.59
    SM Frequency                    Ghz         1.34
    Elapsed Cycles                cycle       10,920
    Memory Throughput                 %         5.64
    DRAM Throughput                   %         1.35
    Duration                         us         8.13
    L1/TEX Cache Throughput           %         1.92
    L2 Cache Throughput               %         6.89
    SM Active Cycles              cycle     3,554.41
    Compute (SM) Throughput           %         0.67
    ----------------------- ----------- ------------
```

Let's look at some rows from the heatmap. For dtype=float16 flush_l2_cache=True and when input shapes are powers of 2, we get the following:

<img width="1508" alt="image" src="https://github.com/user-attachments/assets/06179599-b2f0-4a45-8664-247a1067950b" />

There are 3 columns -- the first shows all data points, the second shows speedups only and the 3rd column shows regressions only. We can see that there are dramatic speedups for M >> N cases and the regressions are not that high (less than 1%, which could just be measurement noise). Here is a small guide I made:

![image](https://github.com/user-attachments/assets/90c26f7c-e3ad-46d2-a6ce-fe4b5fb3d738)

For dtype=float32, we get a similar chart:

<img width="1499" alt="image" src="https://github.com/user-attachments/assets/c4d31a76-03b0-426c-9114-e1bfad29b530" />

The new code performs especially well for m >> n cases, and also where m and n are small. The m >> n case is special because we run 2 reduction kernels back to back and parallelize in the "M" dimension (the older kernel only parallelized in the "N" dimension).

The new code can sometimes have regressions for non-powers of 2. That is because the old code was using block sizes of {16, 32} while we have `threads.x = 32`. For example when N=33, the old code would have 3 blocks and we will have 2 blocks. I wrote some code to specialize for this case, but I think it will add complexity and @ngimel mentioned that non-powers of 2 are rare enough.

I am including the regressions here for completeness' sake:

<img width="1500" alt="image" src="https://github.com/user-attachments/assets/31c17cfb-ed9b-4106-b9c8-5c359751f530" />

To see this better:

1. Click the image
2. Right click the expanded image and open in a new tab
3. Go to that tab and left click once to zoom in

If you want to see the full data, here it is:

![image](https://github.com/user-attachments/assets/54fb60c9-8c0c-4530-a1dd-79ecda1a69a1)

I also measured binary size and compile time since those are important for developers:

Binary size comparison

![image](https://github.com/user-attachments/assets/ceef5073-1036-47f6-b9dc-cea088beda51)

```
# Original
-rwxr-xr-x 1 ahmads users 307193112 Mar  6 08:46 ./torch/lib/libtorch_cuda.so

# This PR
-rwxr-xr-x 1 ahmads users 307193112 Mar  6 08:46 ./torch/lib/libtorch_cuda.so
```

The diff in bytes is 302kB which is about a 0.1% increase.

Compile time difference:

```
# Original

real    0m10.931s
user    0m9.676s
sys     0m1.004s

# this PR

real    0m16.720s
user    0m15.514s
sys     0m1.066s

# Command I ran
time /usr/local/cuda/bin/nvcc -forward-unknown-to-host-compiler -DAT_PER_OPERATOR_HEADERS -DFLASHATTENTION_DISABLE_ALIBI -DFLASHATTENTION_DISABLE_SOFTCAP -DFLASH_NAMESPACE=pytorch_flash -DFMT_HEADER_ONLY=1 -DHAVE_MALLOC_USABLE_SIZE=1 -DHAVE_MMAP=1 -DHAVE_SHM_OPEN=1 -DHAVE_SHM_UNLINK=1 -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DTORCH_CUDA_BUILD_MAIN_LIB -DTORCH_CUDA_USE_NVTX3 -DUNFUSE_FMA -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_CUDA -DUSE_CUFILE -DUSE_DISTRIBUTED -DUSE_EXTERNAL_MZCRC -DUSE_FLASH_ATTENTION -DUSE_MEM_EFF_ATTENTION -DUSE_NCCL -DUSE_RPC -DUSE_TENSORPIPE -D_FILE_OFFSET_BITS=64 -Dtorch_cuda_EXPORTS -I/home/ahmads/personal/pytorch/build/aten/src -I/home/ahmads/personal/pytorch/aten/src -I/home/ahmads/personal/pytorch/build -I/home/ahmads/personal/pytorch -I/home/ahmads/personal/pytorch/cmake/../third_party/benchmark/include -I/home/ahmads/personal/pytorch/third_party/onnx -I/home/ahmads/personal/pytorch/build/third_party/onnx -I/home/ahmads/personal/pytorch/nlohmann -I/home/ahmads/personal/pytorch/third_party/flash-attention/csrc/flash_attn/src -I/home/ahmads/personal/pytorch/aten/src/THC -I/home/ahmads/personal/pytorch/aten/src/ATen/cuda -I/home/ahmads/personal/pytorch/third_party/fmt/include -I/home/ahmads/personal/pytorch/aten/src/ATen/../../../third_party/cutlass/include -I/home/ahmads/personal/pytorch/aten/src/ATen/../../../third_party/cutlass/tools/util/include -I/home/ahmads/personal/pytorch/build/caffe2/aten/src -I/home/ahmads/personal/pytorch/aten/src/ATen/.. -I/home/ahmads/personal/pytorch/build/nccl/include -I/home/ahmads/personal/pytorch/c10/cuda/../.. -I/home/ahmads/personal/pytorch/c10/.. -I/home/ahmads/personal/pytorch/third_party/tensorpipe -I/home/ahmads/personal/pytorch/build/third_party/tensorpipe -I/home/ahmads/personal/pytorch/third_party/tensorpipe/third_party/libnop/include -I/home/ahmads/personal/pytorch/torch/csrc/api -I/home/ahmads/personal/pytorch/torch/csrc/api/include -isystem /home/ahmads/personal/pytorch/build/third_party/gloo -isystem /home/ahmads/personal/pytorch/cmake/../third_party/gloo -isystem /home/ahmads/personal/pytorch/cmake/../third_party/tensorpipe/third_party/libuv/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/googletest/googlemock/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/googletest/googletest/include -isystem /home/ahmads/personal/pytorch/third_party/protobuf/src -isystem /home/ahmads/personal/pytorch/third_party/XNNPACK/include -isystem /home/ahmads/personal/pytorch/third_party/ittapi/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/eigen -isystem /usr/local/cuda/include -isystem /home/ahmads/personal/pytorch/third_party/ideep/mkl-dnn/include/oneapi/dnnl -isystem /home/ahmads/personal/pytorch/third_party/ideep/include -isystem /home/ahmads/personal/pytorch/INTERFACE -isystem /home/ahmads/personal/pytorch/third_party/nlohmann/include -isystem /home/ahmads/personal/pytorch/third_party/NVTX/c/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/cudnn_frontend/include -DLIBCUDACXX_ENABLE_SIMPLIFIED_COMPLEX_OPERATIONS -D_GLIBCXX_USE_CXX11_ABI=1 -Xfatbin -compress-all -DONNX_NAMESPACE=onnx_torch -gencode arch=compute_90,code=sm_90 -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda  -Wno-deprecated-gpu-targets --expt-extended-lambda -DCUB_WRAPPED_NAMESPACE=at_cuda_detail -DCUDA_HAS_FP16=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -O3 -DNDEBUG -std=c++17 -Xcompiler=-fPIC -DTORCH_USE_LIBUV -DCAFFE2_USE_GLOO -Xcompiler -Wall -Wextra -Wdeprecated -Wno-unused-parameter -Wno-missing-field-initializers -Wno-array-bounds -Wno-unknown-pragmas -Wno-strict-overflow -Wno-strict-aliasing -Wunused-function -Wunused-variable -Wunused-but-set-variable -Wno-maybe-uninitialized -MD -MT caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/layer_norm_kernel.cu.o -MF caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/layer_norm_kernel.cu.o.d -x cu -c /home/ahmads/personal/pytorch/aten/src/ATen/native/cuda/layer_norm_kernel.cu -o caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/layer_norm_kernel.cu.o

```

So the new PR is 6 seconds longer compile time.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150625
Approved by: https://github.com/ngimel, https://github.com/atalman
2025-04-08 02:39:41 +00:00
morotti
c0991b0316 README: anaconda license violation / no longer recommend anaconda since it's no longer free to use (#150619)
hello,

I was going over the documentation to build pytorch from source.
Unfortunately, the first thing that come up is that you strongly recommend to use anaconda, which shouldn't be used because it's no longer free to use.
Could you please remove that from the doc?

I don't know if you are aware but anaconda is no longer free.
They changed their terms of service in 2020 to restrict commercial usage.
They changed their terms of service in 2024 to forbid downloading anaconda and forbid education and non-profit usage too.
The download is open and doesn't require any registration, but if you download anaconda they will sue you ^^

They started raining lawsuits against users since last year. You may have heard about anaconda vs intel in the news. They started another 5 or so in the last few months.
https://www.reuters.com/legal/litigation/intel-sued-copyright-infringement-over-ai-software-2024-08-09/

You may need to adjust more doc and adjust your build system. The free to use alternatives are miniforge with the conda-forge channel.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150619
Approved by: https://github.com/seemethere
2025-04-08 02:10:31 +00:00
CaoE
d7f3cd0ac3 Add Half support for weight_norm on CPU (#148878)
Fixes #148867.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148878
Approved by: https://github.com/leslie-fang-intel, https://github.com/cyyever, https://github.com/albanD
2025-04-08 01:12:29 +00:00
Nikita Shulga
5228986c39 [CUDA] Only use vec128 if CUDA version is newer than 12.8 (#150705)
By addressing a feedback requested at https://github.com/pytorch/pytorch/pull/145746
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150705
Approved by: https://github.com/atalman
2025-04-08 00:46:13 +00:00
Akash Verma
e9e5682a4a [ROCm] Build Pytorch extensions with amdclang++ (#150451)
Here are the following modifications made to cpp_extension.py- 1) Changed compiler flag to use --version.
2) Added a feature to convert alpha-numeric string to numeric string for the version string returned by compiler. This was the source of error as the parser was failing on parsing alpha-numeric version string.

Build with following pytorch extensions- Apex, TorchVision, TorchAudio & DeepSpeed.
Unit tested with following pytorch extensions- Apex, TorchVision.

(cherry picked from commit c873aeac35851a7d5000eb7f24561d3f56c2ffbd)

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150451
Approved by: https://github.com/jeffdaily
2025-04-07 23:31:29 +00:00
Hexin Wang
91173ff89a Fixing NCCL abort hang issue when a ProcessGroupNCCL manages multiple ncclComms (#150690)
Detail of the issue:

If PyTorch issues send/recv to each 2 rank comm, and these comms are managed by a single ProcessGroupNCCL instance, then comms need to abort either in sequence or in group.

I.e. the following sequential abort will cause hang in NCCL. recv(..., comm0, stream);
send(..., comm1, stream);
abort(comm1);
abort(comm0);

Fixes #119797

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150690
Approved by: https://github.com/kwen2501
2025-04-07 23:20:49 +00:00
Animesh Jain
6ea5514e04 [invoke_subgraph] Lazy backward (#150666)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150666
Approved by: https://github.com/zou3519, https://github.com/bdhirsh
2025-04-07 22:44:43 +00:00
Ankita George
78fe079c97 Support having no metadata file for HuggingFaceStorageReader (#150701)
Summary: If there is only one safetensors file, we don't need users to have a metadata file and we can just construct it from the keys of that file. This is a use-case for some HuggingFace models, so adding support for it

Test Plan:
ensure existing tests pass
tested e2e in a notebook

Differential Revision: D72472490

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150701
Approved by: https://github.com/joecummings
2025-04-07 22:10:39 +00:00
Nikita Shulga
fbccbfedaf [BE] Fix Amp.metal compilation warning (#150783)
Deleting unused `uint tid` fixes
```
[114/1416] Compiling /Users/nshulga/git/pytorch/pytorch/aten/src/ATen/native/mps/kernels/Amp.metal to Amp_30.air
/Users/nshulga/git/pytorch/pytorch/aten/src/ATen/native/mps/kernels/Amp.metal:70:10: warning: unused parameter 'tid' [-Wunused-parameter]
    uint tid [[thread_position_in_grid]]) {
         ^
1 warning generated.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150783
Approved by: https://github.com/wdvr, https://github.com/atalman
2025-04-07 22:05:00 +00:00
Max Ren
eba05e2d3e [AO] Refactor convert and add QuantAffinePlaceholderObserver (#150644)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150644
Approved by: https://github.com/jerryzh168
ghstack dependencies: #150642, #150643
2025-04-07 20:52:45 +00:00
Max Ren
5653fb3525 [AO] Add Moving Average Affine Observer (#150643)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150643
Approved by: https://github.com/jerryzh168
ghstack dependencies: #150642
2025-04-07 20:52:45 +00:00
Max Ren
ed0dea3e24 [AO] update port_metadata_pass to support quant_affine ops (#150642)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150642
Approved by: https://github.com/jerryzh168
2025-04-07 20:52:44 +00:00
PyTorch MergeBot
bf1132c196 Revert "Generalize poison fork logic for each device backend (#144664)"
This reverts commit d86c14156d.

Reverted https://github.com/pytorch/pytorch/pull/144664 on behalf of https://github.com/atalman due to failing periodic test: python test/test_cpp_extensions_mtia_backend.py TestCppExtensionMTIABackend.test_device_context ([comment](https://github.com/pytorch/pytorch/pull/144664#issuecomment-2784506104))
2025-04-07 20:09:53 +00:00
Pian Pawakapan
f8b53f4a75 [export] raise when Dim.DYNAMIC 0/1 specializes (#150716)
Previously we didn't catch this, mark_dynamic() just doesn't allocate a symbol for it

Differential Revision: D72486930

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150716
Approved by: https://github.com/angelayi
2025-04-07 18:58:42 +00:00
Sam Larsen
2a1e2b88ed [logging] Add pgo remote get/put timings to dynamo_compile (#150322)
Test Plan: https://fburl.com/scuba/dynamo_compile/sandbox/xf950tw8

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150322
Approved by: https://github.com/ppanchalia
2025-04-07 18:08:26 +00:00
Annop Wongwathanarat
6fcffd8cd1 Optimize SVE embedding performance (#150176)
Change loop unrolling strategy. Previously, the script only unrolls the inner loop over block_size when block size is multiple of vector length. This version instead unrolls the outer loop which reduces the number of load/store for accumulation into the output array and improves performance for cases when block size is not multiple of vector length.

Benchmarking script:
```python
# SPDX-FileCopyrightText: Copyright 2025 Arm Limited and/or its affiliate <open-source-office@arm.com>
# SPDX-License-Identifier: BSD-3-Clause
import torch
import torch.nn as nn
import numpy as np
import time
import sys

np.random.seed(0)
torch.manual_seed(0)

num_embeddings = 400000
embedding_dim = int(sys.argv[1])
multi_hot = 100
batch_size = 400
nrun = 1000

class SimpleEmbeddingBagModel(nn.Module):
    def __init__(self, num_embeddings, embedding_dim):
        super(SimpleEmbeddingBagModel, self).__init__()

        weights = torch.from_numpy((np.random.random_sample((num_embeddings, embedding_dim)) + 1).astype(np.float32)).to(torch.float16)

        # Defining the EmbeddingBag layer
        self.embedding_bag = torch.nn.EmbeddingBag(num_embeddings, embedding_dim, _weight=weights,
                                                   mode='sum', include_last_offset=True, dtype=torch.float32)

    def forward(self, input, offsets):
        # Forward pass through the EmbeddingBag layer
        result32 = self.embedding_bag(input, offsets, per_sample_weights=None)
        return result32

# Instantiate the model
model = SimpleEmbeddingBagModel(num_embeddings=num_embeddings, embedding_dim=embedding_dim)
model.eval()

# Example input
input_tensor = torch.randint(0, num_embeddings, (batch_size * multi_hot,), dtype=torch.long)

offsets = torch.tensor(range(0, batch_size * multi_hot + 1, multi_hot))

with torch.no_grad():
    # warm up
    output32 = model(input_tensor, offsets)

    ti = time.time_ns()
    for i in range(nrun):
        _ = model(input_tensor, offsets)
    tf = time.time_ns()
    print("{:3d} {:.3E}".format(embedding_dim, (tf-ti)/nrun/1.e6))
```
Speedup on NEOVERSEV1 with 1 thread
![embedding](https://github.com/user-attachments/assets/16e567ed-b9a5-4db3-90b8-dec66d5414a7)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150176
Approved by: https://github.com/digantdesai, https://github.com/malfet
2025-04-07 18:01:54 +00:00
Saurabh Mishra
7d2411d30e [DCP][OSS] Introduce barrier util in the DistWrapper for rank local checkpointing (#150748)
Summary: Introduce barrier util in the DistWrapper for rank local checkpointing. This barrier will be used at the end of the rank local checkpointing to ensure all ranks synchronize.

Test Plan: UTs

Differential Revision: D72541431

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150748
Approved by: https://github.com/MeetVadakkanchery
2025-04-07 17:33:07 +00:00
Isuru Fernando
957faaadca Avoid overflow in vector_norm for scalar input (#144073)
Fixes https://github.com/pytorch/pytorch/issues/143960 where torch.dist gave different results from eager due to vector_norm overflowing and eager mode avoids the overflow for single element reductions by not computing the power and then the root.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144073
Approved by: https://github.com/eellison, https://github.com/laithsakka
2025-04-07 17:10:10 +00:00
fduwjj
06e9deabb6 [c10d][fr] Improve FR dump robustness with all watchdog broadcast wait and more frequent store check (#150652)
When debugging FR missing dump and missing dump logs, I have couple initial findings:
1. On the same rank, if a second watchdog timeout triggers on a different PG(or subPG), that watchdog thread will immediately throw exception instead of sleeping. We want to fix that by still making the watchdog thread to wait for 1 min.
2. The FR dump takes about 900ms to 1200ms so, we are not checking the store frequently enough. But instead of changing the frequency from 1sec to 300ms, we finally decided to just let all ranks just sleep for 1 min universally rather than using a promise.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150652
Approved by: https://github.com/kwen2501
2025-04-07 16:33:27 +00:00
jpvillam
56ab71de98 [ROCm] Expand workspace size for gfx95 (#150632)
Use same workspace size for gfx95* as gfx94*

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150632
Approved by: https://github.com/jeffdaily

Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
2025-04-07 16:05:56 +00:00
shiyang-weng
0ad2c5d7e2 Add RECORD_FUNCTION for AOTI (#150150)
Only add RECORD_FUNCTION for shim_fn now.
Next step need to add RECORD_FUNCTION for all the aoti_torch_* functions.

Fixes https://github.com/pytorch/pytorch/issues/148650

Some code gen by aoti
```c++
    AtenTensorHandle buf1_handle;
    AtenTensorHandle buf2_handle;
    AtenTensorHandle buf3_handle;
    AtenTensorHandle buf4_handle;
    {RECORD_FUNCTION("aoti_torch_cpu__embedding_bag", c10::ArrayRef<c10::IValue>());AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_cpu__embedding_bag(L__self___sparse_arch_embedding_bag_collection_embedding_bags_t_cat_0_weight, arg80_1, arg81_1, 0, 0L, 0, nullptr, 1, -1L, &buf1_handle, &buf2_handle, &buf3_handle, &buf4_handle));}
    RAIIAtenTensorHandle buf1(buf1_handle);
    RAIIAtenTensorHandle buf2(buf2_handle);
    RAIIAtenTensorHandle buf3(buf3_handle);
    RAIIAtenTensorHandle buf4(buf4_handle);
    arg80_1.reset();
    arg81_1.reset();
```

On trace
```
{
  "name": "aoti_torch_cpu__embedding_bag",
  "ph": "X",
  "ts": 68874.450000,
  "dur": 361.291000,
  "tid": 2,
  "pid": "CPU Functions",
  "args": {}
},
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150150
Approved by: https://github.com/desertfire, https://github.com/EikanWang
2025-04-07 15:12:29 +00:00
Benjamin Glass
f813d64f54 cpp_wrapper: Fix even more tests (#147225)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147225
Approved by: https://github.com/desertfire
ghstack dependencies: #150671, #150672
2025-04-07 14:20:06 +00:00
Benjamin Glass
f0abbabac1 AOTI fallback ops: sort alphabetically (#150672)
This is just a housekeeping task that makes the listed fallback op order match what's in the generated C shim files.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150672
Approved by: https://github.com/desertfire
ghstack dependencies: #150671
2025-04-07 14:20:06 +00:00
Benjamin Glass
5e3c8214b5 cpp_wrapper: Re-enable code disabled for forward compatibility (#150671)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150671
Approved by: https://github.com/desertfire
2025-04-07 14:20:06 +00:00
Shivam Raikundalia
99c9a31386 [submodule] [Snapshot/Profiler] Memory Snapshot On Demand (#150559)
Summary:
Profiler side of memory snapshot.

1. Add API to actually do snapshot when client interface is called
2. Add ifdefs to builds so that kineto hooks snapshot correctly.

Design Philosophy: There is one interesting part of this implementation and it is during export. For export we are callign the python impl of the export rather than CPP even though we are already in CPP. This is because it is better to simply have one path of export rather than 2. Personally, I want there to be parity between auto-trace and on-demand so it if we can limit the side paths then we will have an easier time maintaining this relationship

Test Plan: {F1976563426}

Reviewed By: sanrise

Differential Revision: D70733247

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150559
Approved by: https://github.com/sanrise
2025-04-07 13:04:38 +00:00
Zain Huda
e209625334 [torchrec] update local_shards_wrapper to latest version (#150469)
Summary: Adding new ops, support for empty shards, and fixed initializations for downstream checkpointing.

Test Plan: buck2 run 'fbcode//mode/dev-nosan' fbcode//torchrec/distributed/tests:test_shards_wrapper

Differential Revision: D72271275

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150469
Approved by: https://github.com/XilunWu
2025-04-07 13:00:52 +00:00
PyTorch UpdateBot
cdf3b63e32 Update slow tests (#150283)
This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml).
Update the list of slow tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150283
Approved by: https://github.com/pytorchbot
2025-04-07 11:49:59 +00:00
PyTorch UpdateBot
25662d38d5 [xla hash update] update the pinned xla hash (#132021)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned xla hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132021
Approved by: https://github.com/pytorchbot
2025-04-07 11:35:56 +00:00
Kurt Mohler
164d2c887b Add check in test_cow_input to ensure COW data is never changed (#150723)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150723
Approved by: https://github.com/Skylion007
2025-04-07 04:35:00 +00:00