Commit Graph

88238 Commits

Author SHA1 Message Date
xinan.lin
a9adc9a9b6 [Linter] Add linter to detect device-bias hard code in test cases. (#152948)
Since XPU does not gate community pull requests, we’ve observed that contributors often hardcode "cuda" in functions decorated with @requires_gpu() when adding new test cases. This causes the tests to fail on XPU and breaks XPU CI.
This PR adds a linter to detect such issues automatically. An example is shown below.

```
  Error (TEST_DEVICE_BIAS) [device-bias]
    `@requires_gpu` function should not hardcode device='cuda'

        11670  |                .contiguous()
        11671  |            )
        11672  |
    >>> 11673  |        inp = torch.rand((64, 64), device="cuda") * 2 - 1
        11674  |        boundaries = torch.tensor([-0.9, -0.8, 0.1, 0.2, 0.5, 0.9])
        11675  |
        11676  |        self.common(fn, (inp, boundaries), check_lowp=False)

  Error (TEST_DEVICE_BIAS) [device-bias]
    `@requires_gpu` function should not hardcode .cuda() call

        11700  |            self.assertEqual(ref, res)
        11701  |
        11702  |            for offset2 in (0, 1, 2, 3, 4):
    >>> 11703  |                base2 = torch.randn(64 * 64 + 64, dtype=torch.float32).cuda()
        11704  |                inp2 = torch.as_strided(base2, (64, 64), (64, 1), offset2)
        11705  |                ref2 = fn(inp2)
        11706  |                res2 = fn_c(inp2)

  Error (TEST_DEVICE_BIAS) [device-bias]
    `@requires_gpu` function should not hardcode torch.device('cuda:0')

        11723  |            return x.sin() + x.cos()
        11724  |
        11725  |        base = torch.randn(
    >>> 11726  |            64 * 64 + 64, dtype=torch.float32, device=torch.device("cuda:0")
        11727  |        )
        11728  |
        11729  |        inp1 = torch.as_strided(base, (32, 32), (32, 1), 4)

  Error (TEST_DEVICE_BIAS) [device-bias]
    `@requires_gpu` function should not hardcode .to('cuda') call

        11771  |            torch.manual_seed(42)
        11772  |            base = torch.randn(64 * 64 + 64, dtype=torch.float32, device=self.device)
        11773  |            torch.manual_seed(42)
    >>> 11774  |            base_ref = torch.randn(64 * 64 + 64, dtype=torch.float32).to("cuda")
        11775  |
        11776  |            inp = torch.as_strided(base, size, stride, offset)
        11777  |            inp_ref = torch.as_strided(base_ref, size, stride, offset)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152948
Approved by: https://github.com/EikanWang, https://github.com/cyyever, https://github.com/malfet, https://github.com/jansel
2025-05-16 08:03:54 +00:00
Ti-Tai Wang
658d17dfb5 [ONNX] Add test for decomp_table update (#153671)
Added a test to strengthen the case for cherry-picking #153168. The original PR didn’t include this test since the fix for decomp_table and the registry was already covered by existing tests. However, it's reasonable to include a dedicated test for the specific issue (https://github.com/pytorch/pytorch/issues/150367 ) when considering the cherry-pick.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153671
Approved by: https://github.com/justinchuby
2025-05-16 08:00:16 +00:00
angelayi
3fe42d4d5d [export] Dynamo symint support (#152677)
Basically adds native _IntWrapper support to dynamo. Here's my process of trying to make symint input support work on dynamo, and how I ended up with this approach [(doc)](https://docs.google.com/document/d/1GvNRQd8BnxlMay_hrEVgEta6VUeUW_hcFeRuB7q1nDY/edit?tab=t.0).

What I did was, before passing inputs to dynamo.export, I first wrap them with a class, `_IntWrapper`. When processing dynamic shapes, I will then add the corresponding dynamic shape specification to the `dynamism` field stored on the `_IntWrapper`. If there is no dynamism specified, then this will get unwrapped back to an integer. When dynamo tracing, when we encounter an `_IntWrapper`, we will convert this to a symint if the dynamism was specified as `Dim.DYNAMIC/AUTO`. Dynamo will then trace a graph that contains symint inputs, which will get passed to AOTAutograd and so on.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152677
Approved by: https://github.com/pianpwk
2025-05-16 07:51:50 +00:00
Eddie Yan
d965fa2c4b [CUDA][cuBLAS] Remove IS_ARM64 skip in test_matmul_cuda.py (#153660)
Original skip seems stale and the test appears to run fine on Grace + Hopper and Grace + Blackwell

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153660
Approved by: https://github.com/Skylion007
2025-05-16 07:31:16 +00:00
Chien-Chin Huang
1503b3f897 [DSD] Don't pop tensors if they are on Meta device (#153185)
DSD currently will pop tensors if these tensors are on Meta device. This forbid the use cases that users would like to let DCP to directly initialize the tensors when loading.

This PR also removes test/distributed/checkpoint/e2e/test_pipeline.py which is based on the above feature that is not realistic and is not used anywhere.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153185
Approved by: https://github.com/mori360
2025-05-16 07:18:39 +00:00
Xia, Weiwen
1a722f62c2 [Quant][X86] add an op to compute uint8 batch norm 2d (#152811)
**Summary**
This PR adds a new op, `onednn.qbatch_norm2d`, which accepts uint8 inputs on CPU device (instead of QuantizedCPU).
The new ops are implemented with AVX512 instructions and it provides similar performance as its counterpart for QuantizedCPU device `quantized.batch_norm2d`.
The new op supports output dtypes other than uint8 (fp32, fp16 and bf16 are supported).

**Test plan**
```
pytest test/quantization/core/test_quantized_op.py -k test_int8_batch_norm_onednn
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152811
Approved by: https://github.com/leslie-fang-intel, https://github.com/jerryzh168, https://github.com/jgong5
ghstack dependencies: #152411
2025-05-16 06:13:40 +00:00
Daniel Vega-Myhre
7e16cb99b6 [FlexAttention] Enforce Q,K,V memory layouts for fp8 flex attention to avoid perf degradation (#153357)
Fixes #147336

## Context

NCU analysis of the fp8 flex attention perf issue in #147336 showed an unexpected increase in shared memory access bank conflicts when loading the V tensor from HBM to SRAM.

Bringing this to the attention of triton developer @davidberard98 he identified the memory layout of the tensor in HBM to be causing non-pipelined loads into SRAM, causing the slowdown.

To summarize:

In flex attention when performing the FP8 GEMM `softmax_scores @ V` the right operand V must be in column-major memory layout. However, the `tl.load` of V blocks from HBM to SRAM cannot be pipelined if the V tensor isn't column-major in HBM already, leading to substantial performance degradation.

This is because triton does not perform async copies with the `cp.async` PTX instruction if the number of contiguous bytes is less than 4 (see [here](81f93f2c8e/lib/Dialect/TritonGPU/Transforms/Pipeliner/PipeliningUtility.cpp (L403))).

i.e., when loading 4 bytes of contiguous data from a tensor stored in row-major in HBM, we have to perform 4 separate non-contiguous writes to SRAM to place those bytes in their new location in the col-major layout in SRAM. Thus the load is not a candidate for pipelining w/ cp.async and just moves data to registers then performs a series of single byte stores.

## Fix summary
- To fix this, we should enforce memory layouts for Q, K, V in FlexAttention when fp8 is being used, to ensure they each exist in HBM in the necessary memory layout to facilitate pipelined loads into SRAM ahead of the FP8 GEMMs

## Benchmarks
Rerunning the repro we see fp8 runtime is reduced from 120% of bf16 to 76% of bf16 runtime.

Before fix:

```
(flex) [danvm@devgpu007.eag6 ~/ml-perf-tools/flex_attention (main)]$ rm -rf /tmp/torchinductor_${USER}; python profile_flex.py --bf16 --fp8
2025-05-11 19:07:33,402 - flex_bench - INFO - Running benchmark: bf16
2025-05-11 19:07:35,885 - flex_bench - INFO - bf16: 424.87228804347734 us
2025-05-11 19:07:35,893 - flex_bench - INFO - Running benchmark: fp8e4m3
2025-05-11 19:07:37,319 - flex_bench - INFO - fp8e4m3: 515.714000000001 us
```

After fix:
```
(flex) [danvm@devgpu007.eag6 ~/ml-perf-tools/flex_attention (main)]$ rm -rf /tmp/torchinductor_${USER}; python profile_flex.py --bf16 --fp8
2025-05-11 17:34:38,223 - flex_bench - INFO - Running benchmark: bf16
2025-05-11 17:34:41,157 - flex_bench - INFO - bf16: 423.4662032967036 us
2025-05-11 17:34:41,167 - flex_bench - INFO - Running benchmark: fp8e4m3
2025-05-11 17:34:42,917 - flex_bench - INFO - fp8e4m3: 326.3694803493453 us
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153357
Approved by: https://github.com/ngimel, https://github.com/davidberard98
2025-05-16 04:56:50 +00:00
Angela Yi
459ce6c12a [export] Flatten frame local logs (#153627)
Summary:
Some new errors have been showing up on the PT2 dashboard with
```
Invalid type for lengths: Expected BlobReference or torch.Tensor, got: Tensor(shape: torch.Size([10]), stride: (1,), storage_offset: 0)
```
This is caused by [this piece of code](https://fburl.com/code/5nbi9on7) which maps over a set of nodes (in this case type `IDListFeatureListField`) and turns the results into strings to be displayed later. However during pytree.tree_map we call pytree.tree_unflatten which will call the class's init function, which calls `assert_blob` (https://fburl.com/code/h3ainrn9). Because we've mapped over the values and converted them to strings, the assert_blob fails.

I initially thought to disable the assert_blob while tracing (D74684309) but then I think we should actually flatten the list first. Because tlparse will expect just a string out outputs instead of the actual structure.

Test Plan: `buck2 run mode/opt sigmoid/inference/ts_migration:pt2i_readiness_main -- --test_suite ads_all --mode test_full_model --model_id 542947220` fails with something else 😅

Differential Revision: D74744326

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153627
Approved by: https://github.com/yiming0416
2025-05-16 04:45:09 +00:00
Mengwei Liu
7ed377f577 Reapply "Delete TorchScript based Android demo app and point to ExecuTorch (#153633)" (#153656)
This reverts commit ae0e8f0c73.

Keep android/libs/fbjni because it's being used by other components of
PyTorch.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153656
Approved by: https://github.com/malfet
2025-05-16 04:35:42 +00:00
Raymond Li
56e1c236bf [Dynamo] Catch unserialisable NN modules (#153503)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153503
Approved by: https://github.com/c00w, https://github.com/jansel
2025-05-16 02:55:28 +00:00
Simon Fan
d1f1ff8610 [ddp] propagate use_python_reducer to C++ reducer (#152735)
C++ Reducer is silently incorrect under CA, its implementation is no-oping the collective. I'm guessing that it was no-op'd because in DDP + python reducer, the C++ reducer is still being initialized.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152735
Approved by: https://github.com/fegin
ghstack dependencies: #153300, #152689
2025-05-16 01:38:03 +00:00
Simon Fan
1b4749f748 [ca][dtensor] run real PG dtensor tests under CA (#152689)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152689
Approved by: https://github.com/bdhirsh
ghstack dependencies: #153300
2025-05-16 01:38:03 +00:00
Simon Fan
5aea57d653 [ca][dynamo] always run eager checkpoint region's recomputation in eager (#153300)
I slap disable on the recomputation hook, otherwise the partitioner may save less/more activations and mismatch with the expected eager count in checkpoint. See code comment `Note: [compiled autograd and checkpoint unpack hook]`.

This fixes all non-nested checkpointing tests. I also wrap nested checkpointing tests, and a few of them still fail.

This also seems to fix all PYTORCH_TEST_WITH_DYNAMO checkpointing tests except for `TestAutograd.test_checkpointing_without_reentrant_custom_function_works`. For those tests, it looks like we fail to HOPify the checkpointed region and when the backward executes the unpack hooks, dynamo tried to trace them. This messed up the internal state tracking of checkpointing, some raising the _StopRecomputationError and others raising the same count mismatch error as CA.

FIXES https://github.com/pytorch/pytorch/issues/127115

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153300
Approved by: https://github.com/jansel
2025-05-16 01:37:48 +00:00
cyy
9d3b6ee4c1 [submodule] Update gtest to v1.17.0 (#153618)
And remove some outdated CMake code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153618
Approved by: https://github.com/malfet
2025-05-16 01:24:19 +00:00
Tristan Rice
d1dd2c1fc8 gloo: cuda (#153406)
This enables Gloo CUDA when used with a backend that supports GPUDirect which currently is only the IBVERBS backend.

This requires some changes to Gloo which are in https://github.com/pytorch/gloo/pull/441

Since we're now depending on gloo_cuda we need to split ProcessGroupGloo into two pieces, one with the CPU bits (libtorch_cpu) and one with CUDA kernels in libtorch_cuda. This unfortunately requires some major refactoring as some CPU code is shared across both.

The gloo submodule is updated to depend on the new Gloo changes

Test plan:

```py
import os
import time

transport = "TCP"
#transport = "IBVERBS"

os.environ["GLOO_DEVICE_TRANSPORT"] = transport
rank = int(os.environ["RANK"])
os.environ["CUDA_VISIBLE_DEVICES"] = str(rank)

ibv = "mlx5_0:1,mlx5_3:1,mlx5_4:1,mlx5_5:1,mlx5_6:1,mlx5_9:1,mlx5_10:1,mlx5_11:1".split(",")[rank]
ibv_name, ibv_port = ibv.split(":")
os.environ["TORCH_GLOO_IBV_NAME"] = ibv_name
os.environ["TORCH_GLOO_IBV_PORT"] = ibv_port
os.environ["TORCH_GLOO_IBV_INDEX"] = "3"

import torch
import torch.distributed as dist

dist.init_process_group("gloo")

rank = dist.get_rank()

# initial sanity check
#device = "cpu"
#t = torch.zeros(10, device=device)
#dist.all_reduce(t)
#print("sanity complete")

device = "cpu"

iters = 10
warmup_iters = 2

for nelem in [10, 100, 1000, 10000, 100000, 1000000, 10000000, 100000000]:
    t = torch.zeros(nelem, device=device)

    torch.cuda.current_stream().synchronize()
    for i in range(warmup_iters):
        dist.all_reduce(t)

    torch.cuda.current_stream().synchronize()

    start = time.perf_counter()

    for i in range(iters):
        dist.all_reduce(t)

    torch.cuda.current_stream().synchronize()

    dur = (time.perf_counter() - start)
    qps = iters/dur

    bandwidth_gb = t.nbytes * iters / dur / 1e9

    gb = t.nbytes / 1e9

    if rank == 0:
        print(f"{transport=} {device=} {iters=} {nelem=} {qps=} {gb=} {bandwidth_gb=}\n", end="")
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153406
Approved by: https://github.com/fduwjj
2025-05-16 01:13:13 +00:00
Nikita Shulga
ab757dcddc [MPS][Testing] Add GoogleFnet, YituTechConvBert and Super_SloMo to benchmarks (#153658)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153658
Approved by: https://github.com/atalman, https://github.com/ZainRizvi, https://github.com/cyyever
ghstack dependencies: #153657
2025-05-16 01:09:31 +00:00
Nikita Shulga
754b758ea1 [BE] Extend empty_gpu_cache to mps (#153657)
And replace `if: elif:` with `getattr()`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153657
Approved by: https://github.com/atalman, https://github.com/wdvr, https://github.com/ZainRizvi
2025-05-16 01:08:54 +00:00
Deep Shah
2489b6470b [c10d] Allow split_group to work with non nccl backends (#152175)
Summary:
Currently things are hardcoded to only work with nccl backend. Extend it
to allow NCCL + custom plugin backend.

The split-specific methods/attributes have not been added to the base
Backend and Options as some of them are specific to backend implementations.
Instead, explicit checks have been added to the split_group method for the
expected methods and attributes.

I am open to making them part of base Backend based if folks prefer.

Test Plan:
CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152175
Approved by: https://github.com/shuqiangzhang, https://github.com/kwen2501
2025-05-16 00:15:29 +00:00
Aaron Orenstein
cb5f31a4a1 Fix fake tensor caching when output has unbacked (#153034)
We handle fake tensor caching in two ways:
1. If the inputs have no symbols (SymInt, etc) then we cache on the FakeTensorMode.
2. If the inputs have symbols then we cache on the ShapeEnv.

This way the symbols in the inputs and outputs are associated with the guards in place at the time of the call.

However - it's possible to have an op where there are no symbols in the inputs but there is an unbacked symbol in the output.  In this case we shouldn't cache at all because what would that really mean?

So this PR changes the caching behavior so that if there's a symbol in the output which doesn't come in some way from the input then we refuse to cache that op.

Added a test which checks for this case.

While in there I also did a couple other related changes:
1. Added negative caching - if we see that an (op, args) failed to cache previously we don't even bother trying to cache it again.
2. Reworked the inner behavior of _cached_dispatch_impl a little to make it more clear which bits we expect to be able to throw _BypassDispatchCache and add some comments.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153034
Approved by: https://github.com/masnesral, https://github.com/tugsbayasgalan
2025-05-15 23:18:52 +00:00
Daniel Vega-Myhre
e7a40fb301 [Async TP] Fix dim swapping before reduction in fused_scaled_matmul_reduce_scatter (#153595)
## Summary
- The unit test `pytest test/distributed/test_symmetric_memory.py -k test_fused_scaled_matmul_reduce_scatter_scatter` was not running for some reason when #149247 was merged, giving false green CI signals. When it was ran manually recently, the test failed, highlighting a bug causing incorrect numerics when `scatter_dim=1`.
- This PR fixes the bug, which was related to how we swap dims 0<=>scatter_dim at the beginning of the custom op (for more efficient cross-device data movement I believe), then swap it back prior to reduction.

## Test plan
- I confirmed the unit test `pytest test/distributed/test_symmetric_memory.py -k test_fused_scaled_matmul_reduce_scatter_scatter` is now passing.
- I confirmed e2e training w/ torchtitan looks good ([logs](https://www.internalfb.com/phabricator/paste/view/P1812054188))
- I analyzed the tlparse to verify the fused_all_gather_matmul and fused_scaled_matmul_reduce_scatter both appear at least once in the post grad graphs ([tlparse](https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpVbUsdG/dedicated_log_torch_trace_65oh3qj_.log/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000))

## Next steps
1. I think for async TP `fused_scaled_matmul_reduce_scatter` we may only need `scatter_dim_after_maybe_reshape` and not `orig_scatter_dim` after all. I can confirm this and refactor if it is the case.
2. This op is specifically designed for async TP, and many of the arguments don't make sense for a user trying to use this as a standalone op. IMO we should have separate standalone custom op without all the extra function args and internal logic that doesn't apply to non-async TP cases.
3. In a follow up PR I want to add shape annotations to each line (e.g. `# (B, T, H)` etc) to make this easier to debug in the future.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153595
Approved by: https://github.com/fegin
2025-05-15 21:44:57 +00:00
Scott Wolchok
ea17cd067d Add vec_reduce_all specialization for std::plus on AArch64 (#152388)
AArch64 has an instruction for this.

Differential Revision: [D73817183](https://our.internmc.facebook.com/intern/diff/D73817183/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152388
Approved by: https://github.com/Skylion007
ghstack dependencies: #152365, #152366
2025-05-15 21:26:18 +00:00
Scott Wolchok
b972435158 vec::map: directly process reduced-precision floats when reasonable (#152366)
The immediate motivation is to make map support match
ExecuTorch so we can delete ExecuTorch-specific mapping functions, but
this should also straightforwardly improve performance.

Testing: there is existing coverage for this in
vec_test_all_types.cpp. Verified that it really does cover the newly
enabled "don't convert through float" paths by temporarily adding a
TORCH_INTERNAL_ASSERT(false).

Differential Revision: [D73802126](https://our.internmc.facebook.com/intern/diff/D73802126/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152366
Approved by: https://github.com/malfet
ghstack dependencies: #152365
2025-05-15 21:26:18 +00:00
Jithun Nair
e4adf5df39 [ROCm] cpp_extension allow user to override default flags (#152432)
We need -fgpu-rdc for projects such as DeepEP + rocSHMEM. The default of -no-gpu-rdc doesn't work for such cases.

As per https://github.com/pytorch/pytorch/pull/152432#issuecomment-2840899088:
"rocshmem shares the same global variable in different files, as deepEP uses CUDAExtention to build the project 65e2a700f0/setup.py (L51) and depends on rocshmem, this -fgpu-rdc is needed. The current logic in Pytorch prevents users from overriding this flag."

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152432
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-05-15 21:06:18 +00:00
Catherine Lee
b8fad785d5 Change trigger for autoformat, use --all-files (#153289)
Change trigger for auto format to be pull_request b/c the reusable action used gets the pr number from the pull_request event context, but only run it if ciflow/autoformat is attached to the PR.  Tested this on a different PR, and it seems to be working

Changed tag name because ciflow prefixed labels have special handling

Also change to run on all files so it will mimic the normal CI lintrunner call, and because lintrunner, either by itself or using -m mergebase can miss some things.  Idk if it would miss for format, but it does for checking lint.  Format seems to take shorter than normal lint.  I don't know if the comment about making suggestions on non edited file changes is a concern.  I didn't really test this part

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153289
Approved by: https://github.com/atalman, https://github.com/malfet
2025-05-15 20:38:33 +00:00
Sam Larsen
90deff6d59 Refactor tests in test_max_autotune into a few separate test cases. (#153486)
Summary: To support running a subset of these tests with the remote autotuning utilities, I've split out some of the tests into separate classes so that I can derive from the "main" TestMaxAutotune class when creating new tests for remote. I'm not 100% sure what some of these tests do, so please suggest if another grouping / naming might make more sense. The remaining tests in TestMaxAutotune all smelled relevant to me.

Test Plan: existing unit tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153486
Approved by: https://github.com/eellison
2025-05-15 20:35:22 +00:00
Scott Wolchok
a2e2f908fd add is_vec_specialized_for (#152365)
Let people detect at compile time whether Vectorized is specialized for a given type. See vec_base.h.

Differential Revision: [D73802129](https://our.internmc.facebook.com/intern/diff/D73802129/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152365
Approved by: https://github.com/jgong5, https://github.com/malfet
2025-05-15 20:21:48 +00:00
PyTorch MergeBot
ae0e8f0c73 Revert "Delete TorchScript based Android demo app and point to ExecuTorch (#153633)"
This reverts commit b22f01fcb9.

Reverted https://github.com/pytorch/pytorch/pull/153633 on behalf of https://github.com/malfet due to But libtorch build regressions are real, fbjni is still used for C++ builds ([comment](https://github.com/pytorch/pytorch/pull/153633#issuecomment-2884951805))
2025-05-15 20:16:05 +00:00
Yang Wang
b03e4f53d2 [Monitoring] enable windows monitoring test (#153453)
enable the utilization for win tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153453
Approved by: https://github.com/huydhn
2025-05-15 20:03:07 +00:00
Tristan Rice
f7ecc091a0 c10d/TCPStore: better logs on remote shutdown (#153586)
This makes it more obvious what's going on when TCPStore shuts down while waiting on a remote key and also shows the remote address.

Test plan:

```
[W514 18:33:36.536327028 TCPStore.cpp:138] [c10d] recvValueWithTimeout failed on SocketImpl(fd=3, addr=[localhost]:34658, remote=[localhost]:1234): Failed to recv, got 0 bytes. Connection was likely closed. Did the remote server shutdown or crash?
```

```py
import os
rank = int(os.environ["RANK"])

import time
from torch import distributed as dist

store = dist.TCPStore(
    host_name="localhost",
    port=1234,
    is_master=(rank == 0),
    wait_for_workers=False,
)

time.sleep(1)

print("starting")

if rank != 0:
    store.get("foo")
else:
    time.sleep(1)

print("done")
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153586
Approved by: https://github.com/XilunWu
2025-05-15 20:02:51 +00:00
Yang Wang
064f4c18f9 [Monitoring] Enable perf tests (#153452)
Enable monitoring for more perf tests, currently for perf, we collect usage data every 4 seconds and aggregate every 15 seconds.

Can reduce the number down if the monitoring does not affect the perf testx
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153452
Approved by: https://github.com/Skylion007, https://github.com/huydhn
2025-05-15 19:19:19 +00:00
Xuehai Pan
a4c828199e [BE] Add __all__ to torch/nn/functional.pyi and torch/return_types.pyi (#150729)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150729
Approved by: https://github.com/aorenste
2025-05-15 19:01:57 +00:00
Mengwei Liu
b22f01fcb9 Delete TorchScript based Android demo app and point to ExecuTorch (#153633)
Delete TorchScript demo app and point people to ExecuTorch demo app.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153633
Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/atalman, https://github.com/janeyx99, https://github.com/seemethere
2025-05-15 18:43:59 +00:00
Catherine Lee
00e5cb3db3 [ez][trymerge] Edit revert message for reverted ghstack PRs (#153573)
Change comment about successful revert so it also contains info about the original PR that got the comment (if it is a ghstacked PR)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153573
Approved by: https://github.com/atalman, https://github.com/malfet
2025-05-15 18:23:20 +00:00
Shuai Yang
480ae2dab8 Add needs_contiguous_strides to more collective ops (#153523)
Differential Revision: D74705770

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153523
Approved by: https://github.com/fmassa
2025-05-15 17:27:37 +00:00
Aditya Tewari
cfee9046b6 cpu: enable gemm-bf16f32 for SDPA BF16 (#140159)
This PR enables SDPA BF16:  gemm:bf16f32 for aarch64.  This will enable faster inference for models with attention layers  for autocast mode (bf16).

Benchmark results from  [PyTorch CI HUD - branch](https://hud.pytorch.org/benchmark/huggingface/inductor_no_cudagraphs?dashboard=torchinductor&startTime=Fri%2C%2028%20Mar%202025%2021%3A26%3A20%20GMT&stopTime=Fri%2C%2004%20Apr%202025%2020%3A26%3A20%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cpu%20(aarch64)&lBranch=adi/gemm_bf16f32&lCommit=d5aeab452e4b1f0580a4636b15a604c77a02c57b&rBranch=main&rCommit=bc72420bcb37390af3fced885e019903e6e425bd)
Overall Geometric mean speedup in HUD dashboard  : for Huggingface: `[0.48x → 0.58x]` and for Blueberries: `[0.88x → 1.13x]`

Benchmark numbers for `torch.nn.functional.scaled_dot_product_attention`on Neoverse™ V1.

`batch_size = 1, num_attention_heads = 64, sequence_length = 512, attention_head_size = 128`
 `threads=16`
<img width="319" alt="Screenshot 2024-12-20 at 16 23 22" src="https://github.com/user-attachments/assets/c863f97d-0761-4fb8-aa6c-fc67b22ac3f9" />

Script to benchmark & profile SDPA:

    import torch
    import torch.nn as nn
    import time
    import numpy as np
    from torch.profiler import profile, record_function, ProfilerActivity
    class SimpleAttentionModel(nn.Module):
        def __init__(self, query, key, value):
            super(SimpleAttentionModel, self).__init__()
            self.query = query
            self.key = key
            self.value = value

        def forward(self, attn_mask=None):
            torch.nn.functional.scaled_dot_product_attention(
                        self.query,
                        self.key,
                        self.value,
                        attn_mask=attn_mask)

    #batch_size = 1, num_attention_heads = 64, sequence_length = 512, hidden_size = 128
    def bench_sdpa(batch_size = 1, num_attention_heads = 64, sequence_length = 512, query_sequence_length = 128 , hidden_size=128, precision=torch.float32):
        with torch.no_grad():
            attention_head_size = int(hidden_size / num_attention_heads)
            query = torch.rand(size=(batch_size, num_attention_heads, query_sequence_length, attention_head_size), dtype=precision)
            key = torch.rand(size=(batch_size, num_attention_heads, sequence_length, attention_head_size), dtype=precision)
            value = torch.rand(size=(batch_size, num_attention_heads, sequence_length, attention_head_size), dtype=precision)

            model = SimpleAttentionModel(query, key, value)
            model.eval()
            for _ in range(10):
                model()
            times = []
            n_iters = 100
            for _ in range(n_iters):
                s = time.time_ns()
                model()
                times.append((time.time_ns() - s) / 1e3)
            min_times = np.min(times)
            mean_times = np.mean(times)
            print(f"Min Times = {min_times} us")
            print(f"Mean Times = {mean_times} us")
            print("Times = ", times)

    print("BF16 mode:")
    with profile(activities=[ProfilerActivity.CPU], record_shapes=True) as prof:
        with record_function("model_inference"):
            bench_sdpa(precision=torch.bfloat16)
    profile_data = prof.key_averages(group_by_input_shape=True).table(sort_by="cpu_time_total")
    print(profile_data)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140159
Approved by: https://github.com/jgong5, https://github.com/malfet, https://github.com/nikhil-arm, https://github.com/leslie-fang-intel, https://github.com/CaoE, https://github.com/cfRod, https://github.com/fadara01
2025-05-15 17:21:18 +00:00
PyTorch MergeBot
236b08cbf8 Revert "[ca][dynamo] always run eager checkpoint region's recomputation in eager (#153300)"
This reverts commit 4863e5c843.

Reverted https://github.com/pytorch/pytorch/pull/153300 on behalf of https://github.com/malfet due to Looks like it breaks rocm, see fa8543454a/1 ([comment](https://github.com/pytorch/pytorch/pull/153300#issuecomment-2884489459))
2025-05-15 16:58:52 +00:00
PyTorch MergeBot
2327c9eedc Revert "[ca][dtensor] run real PG dtensor tests under CA (#152689)"
This reverts commit b297e01f4b.

Reverted https://github.com/pytorch/pytorch/pull/152689 on behalf of https://github.com/malfet due to Looks like it breaks rocm, see fa8543454a/1 ([comment](https://github.com/pytorch/pytorch/pull/153300#issuecomment-2884489459))
2025-05-15 16:58:51 +00:00
Nikita Shulga
db26aeaec2 [MPSInductor] Support numpy scalars handling (#153598)
By default, numpy computes results in float64 format, but when passed as an argument to MPS function, must be implicitly converted to float32, which naturally occurs in some networks, for example in speech_transformer

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153598
Approved by: https://github.com/cyyever, https://github.com/dcci
ghstack dependencies: #153582
2025-05-15 16:48:25 +00:00
Catherine Lee
0cb48633d9 [ez][CI] Add linux aarch64 to upload test stats, change format of trigger for upload test stats (#153505)
Change from inline list to yml list
Add linux aarch64 for list of triggering workflows
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153505
Approved by: https://github.com/Skylion007
2025-05-15 15:33:59 +00:00
Animesh Jain
fa8543454a [dynamo][torch-function] Prevent unnecessary __torch_function__ tracing (#153551)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153551
Approved by: https://github.com/mlazos
2025-05-15 14:06:17 +00:00
Aaron Gokaslan
4f4ecc583e [BE]: Enable RUFF TRY400 rule - log.exception (#153473)
Change logging.error to logging.exception to log additional information when relevant.  A few places have slipped in logging.errors in try except since I last did a clean up here and the rule is stabilized so I am enabling it codebase wide. I have NOQA'd much of our custom exception stack trace handling for RPC calls and distributed and tried to a fix a few errors based on whether we immediately reraised it or if we didn't print any exception handling where it could be useful.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153473
Approved by: https://github.com/albanD, https://github.com/cyyever
2025-05-15 13:36:59 +00:00
sanchitintel
7482eb217c [Inductor-CPU] Faster int8 WoQ GEMM for small M with explicit prefetching and different outer loops (#149373)
### Summary

Fixes #148494

Explicitly prefetch the cache lines of the next `B` block to accelerate int8 WoQ (BF16 activation, int8 statically quantized weights) GEMM for small `M` dimension.

Some of this code (outer loops of the GEMM) is being ported over from Intel Extension for PyTorch. The macro-kernel* and the micro-kernel* are essentially the same, but optionally prefetch a block of B. Templatization is being used to prevent branching causing a slowdown due to unnecessary prefetching.

\* - in [BLIS](https://dl.acm.org/doi/10.1145/2764454) parlance

### Performance data with BS 1

Machine: 32 cores of one socket of a Intel Xeon SP Gen 5 machine

| Model | input tokens | output tokens | next-token latency before this PR | Next-token latency after this change | Speedup |
|-----------|-------------|-----------------|--------------------------------------|------------------------------------------|-----------|
|GPT-J | 128 | 128 | 42 ms | 38 ms | 9.52 % |
| GPT-J | 1024 | 1024 | 48 ms | 45 ms | 6.25 % |
|LLaMA 3.1 8B Instruct | 128 | 128 | 52 ms | 47 ms|  9.61% |
|LLaMA 3.1 8B Instruct | 1024 | 1024 | 57 ms | 53 ms|  7.01% |

While the input shapes of GEMMs corresponding to linear for next-token computation remain the same in case of different number of input & output tokens, the difference in next-token latency is due to attention for those cases

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149373
Approved by: https://github.com/leslie-fang-intel, https://github.com/Xia-Weiwen

Co-authored-by: Xia Weiwen <xia.weiwen@hotmail.com>
2025-05-15 11:55:58 +00:00
cyy
e5e06d9cab [submodule] Update kleidiai to v1.8.0 (#153592)
And cleans up some CMake instructions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153592
Approved by: https://github.com/malfet
2025-05-15 10:14:05 +00:00
Xuehai Pan
22b124335e [BE] Update .pyi stub template to use Generic TypeAlias (PEP 585) and Union Type (PEP 604) (#150728)
https://github.com/pytorch/pytorch/pull/129001#discussion_r1645126801 is the motivation for the whole stack of PRs. In `torch/__init__.py`, `torch._C.Type` shadows `from typing import Type`, and there is no type stub for `torch._C.Type` in `torch/_C/__init__.pyi`. So we need to use `from typing import Type as _Type`. After enabling [Generic TypeAlias (PEP 585)](https://peps.python.org/pep-0585) in the `.pyi` type stub files, we can use `type` instead of `typing.Type` or `from typing import Type as _Type`.

------

- [Generic TypeAlias (PEP 585)](https://peps.python.org/pep-0585): e.g. `typing.List[T] -> list[T]`, `typing.Dict[KT, VT] -> dict[KT, VT]`, `typing.Type[T] -> type[T]`.
- [Union Type (PEP 604)](https://peps.python.org/pep-0604): e.g. `Union[X, Y] -> X | Y`, `Optional[X] -> X | None`, `Optional[Union[X, Y]] -> X | Y | None`.

Note that in `.pyi` stub files, we do not need `from __future__ import annotations`. So this PR does not violate issue #117449:

- #117449

------

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150728
Approved by: https://github.com/cyyever, https://github.com/aorenste
ghstack dependencies: #150726, #150727
2025-05-15 09:36:42 +00:00
Xuehai Pan
f7a5aa1d8d [torchgen] Refactor and simplify gen_pyi.py to use Generic TypeAlias (PEP 585) and Union Type (PEP 604) (#150727)
https://github.com/pytorch/pytorch/pull/129001#discussion_r1645126801 is the motivation for the whole stack of PRs. In `torch/__init__.py`, `torch._C.Type` shadows `from typing import Type`, and there is no type stub for `torch._C.Type` in `torch/_C/__init__.pyi`. So we need to use `from typing import Type as _Type`. After enabling [Generic TypeAlias (PEP 585)](https://peps.python.org/pep-0585) in the `.pyi` type stub files, we can use `type` instead of `typing.Type` or `from typing import Type as _Type`.

------

- [Generic TypeAlias (PEP 585)](https://peps.python.org/pep-0585): e.g. `typing.List[T] -> list[T]`, `typing.Dict[KT, VT] -> dict[KT, VT]`, `typing.Type[T] -> type[T]`.
- [Union Type (PEP 604)](https://peps.python.org/pep-0604): e.g. `Union[X, Y] -> X | Y`, `Optional[X] -> X | None`, `Optional[Union[X, Y]] -> X | Y | None`.

Note that in `.pyi` stub files, we do not need `from __future__ import annotations`. So this PR does not violate issue #117449:

- #117449

------

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150727
Approved by: https://github.com/aorenste
ghstack dependencies: #150726
2025-05-15 09:36:42 +00:00
Jerry Mannil
129a2976a8 [ROCm] Improvements to non-vectorized elementwise kernels (#153184)
* Unroll loops manually to hide memory access latency

Co-authors: @akadutta @amd-hhashemi

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153184
Approved by: https://github.com/jeffdaily
2025-05-15 09:14:43 +00:00
Pat Vignola
6e107899da [Torch] Fix crash when comparing fp8 tensors that have more than 1 dimension (#153508)
Summary: `torch.nonzero` returns as many items as the number of dimensions, so we shouldn't expect a single element for the indices.

Test Plan: CI

Differential Revision: D74539233

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153508
Approved by: https://github.com/exclamaforte
2025-05-15 08:41:46 +00:00
Simon Fan
b297e01f4b [ca][dtensor] run real PG dtensor tests under CA (#152689)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152689
Approved by: https://github.com/bdhirsh
ghstack dependencies: #153300
2025-05-15 08:10:35 +00:00
Simon Fan
4863e5c843 [ca][dynamo] always run eager checkpoint region's recomputation in eager (#153300)
I slap disable on the recomputation hook, otherwise the partitioner may save less/more activations and mismatch with the expected eager count in checkpoint. See code comment `Note: [compiled autograd and checkpoint unpack hook]`.

This fixes all non-nested checkpointing tests. I also wrap nested checkpointing tests, and a few of them still fail.

This also seems to fix all PYTORCH_TEST_WITH_DYNAMO checkpointing tests except for `TestAutograd.test_checkpointing_without_reentrant_custom_function_works`. For those tests, it looks like we fail to HOPify the checkpointed region and when the backward executes the unpack hooks, dynamo tried to trace them. This messed up the internal state tracking of checkpointing, some raising the _StopRecomputationError and others raising the same count mismatch error as CA.

FIXES https://github.com/pytorch/pytorch/issues/127115

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153300
Approved by: https://github.com/jansel
2025-05-15 08:10:35 +00:00
PyTorch MergeBot
71027b13b2 Revert "[FlexAttention] Enforce Q,K,V memory layouts for fp8 flex attention to avoid perf degradation (#153357)"
This reverts commit 881a598a1e.

Reverted https://github.com/pytorch/pytorch/pull/153357 on behalf of https://github.com/jeanschmidt due to Might have introduced regressions in rocm testing for main: https://github.com/pytorch/pytorch/actions/runs/15035410497/job/42257000513 feel free to re-merge if this was a mistake ([comment](https://github.com/pytorch/pytorch/pull/153357#issuecomment-2882915691))
2025-05-15 07:58:27 +00:00