Commit Graph

37298 Commits

Author SHA1 Message Date
Eddie Yan
f911d64750 [CUDA] xFail max-autotune grouped gemm tests on devices with insufficient SM count (#165921)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165921
Approved by: https://github.com/ngimel
2025-10-30 20:05:07 +00:00
Chien-Chin Huang
56838bad5f [CP][BE][1/2] Refactor the code structure (#166456)
Our CP codebase now contains several files and we are adding more. This PR refactors the code to consolidate the files into a context_parallel folder but keep the import so that the existing users of CP won't be affected.

Unfortunately, we have to split this PR into two PRs as the PyTorch infra cannot accept a PR with 3000+ LoC change and git cannot recognize that _context_parallel/_attention.py is moved from _attention.py because we want to keep BC.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166456
Approved by: https://github.com/Skylion007
2025-10-30 19:46:49 +00:00
Yu, Guangye
0ec0549823 Introduce a new API torch.xpu.get_per_process_memory_fraction (#165511)
# Motivation
Aligned with other backends, this PR introduces a new API torch.xpu.get_per_process_memory_fraction to allow user to retrieve the allowed memory fraction per a single process.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165511
Approved by: https://github.com/EikanWang, https://github.com/ezyang
ghstack dependencies: #165508, #165509, #165510
2025-10-30 19:30:09 +00:00
Austin Morton
b939de26d1 Avoid writing temporary modules to disk (#157713)
In some cases the warning from #147744 still gets emitted because [atexit hooks aren't called](https://github.com/python/cpython/pull/114279).

Even in those cases, if the atexit hooks _were_ called you could end up with issues due to the directory being deleted in one process, but still being used elsewhere.

It's better all round to load these modules entirely in-memory.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157713
Approved by: https://github.com/xush6528
2025-10-30 19:11:16 +00:00
Edward Z. Yang
4acc66f119 Make PT2 compile backprop through custom op without autograd key a hard error (#166367)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166367
Approved by: https://github.com/bdhirsh
2025-10-30 18:43:07 +00:00
PyTorch MergeBot
8f40a0c634 Revert "address DDE in matmul decomp (#166541)"
This reverts commit 90519402c2.

Reverted https://github.com/pytorch/pytorch/pull/166541 on behalf of https://github.com/atalman due to breaks internal test ([comment](https://github.com/pytorch/pytorch/pull/166541#issuecomment-3469382334))
2025-10-30 18:11:33 +00:00
PyTorch MergeBot
694d205143 Revert "shrink_group implementation to expose ncclCommShrink API (#164518)"
This reverts commit 311ea0dec0.

Reverted https://github.com/pytorch/pytorch/pull/164518 on behalf of https://github.com/atalman due to breaks internal builds Error: from logging_utils import ( ModuleNotFoundError: No module named 'logging_utils' ([comment](https://github.com/pytorch/pytorch/pull/164518#issuecomment-3469308568))
2025-10-30 17:52:29 +00:00
eellison
629293f568 bucket all reduce (#166528)
Bucket all reduce in bucketer, thanks to @IvanKobzarev's earlier pr.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166528
Approved by: https://github.com/IvanKobzarev
ghstack dependencies: #166527
2025-10-30 17:12:34 +00:00
eellison
c37802a8c4 use multi-dtype bucketing (#166527)
Make the bucketer use multi-dtype bucketing for all gathers.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166527
Approved by: https://github.com/IvanKobzarev, https://github.com/ezyang
2025-10-30 16:54:49 +00:00
PyTorch MergeBot
0a3ac47c0a Revert "[user-streams] Fix stream graph output semantics (#164819)"
This reverts commit f5cb9a4c68.

Reverted https://github.com/pytorch/pytorch/pull/164819 on behalf of https://github.com/atalman due to breaks CI ([comment](https://github.com/pytorch/pytorch/pull/164819#issuecomment-3469018283))
2025-10-30 16:53:32 +00:00
Simon Layton
fb545fb068 Add MXFP4 grouped gemm support via. FBGEMM kernels (#166530)
Summary:

* Extend `_scaled_grouped_mm_v2` to include MXFP4 support
* Add testing to existing grouped routines

Test Plan:

```
pytest -svv -k "mxfp4 and group" test/test_scaled_matmul_cuda.py
```

Reviewers:

Subscribers:

Tasks:

Tags:
Signed-off-by: Simon Layton <simonlayton@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166530
Approved by: https://github.com/drisspg
2025-10-30 16:46:11 +00:00
Bin Bao
2df2c316e2 [devx] Fix invalid symbol definition emitted in fx_graph_runnable.py (#166529)
Summary: When emitting symbolic variable definition in fx_graph_runnable.py, we need to check if a SymNode is actually an expression, so that we won't generate something like "s27*s53**2 = 36".

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166529
Approved by: https://github.com/mlazos
ghstack dependencies: #166432
2025-10-30 16:40:12 +00:00
Bin Bao
08b0a8f11a [Inductor] Fix an inductor_provenance bug (#166432)
Summary: Fix an inductor_provenance related error seen when running TORCH_COMPILE_DEBUG generated fx_graph_runnable.py.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166432
Approved by: https://github.com/mlazos
2025-10-30 16:40:12 +00:00
PyTorch MergeBot
3f1824742c Revert "Fix comparing inductor actual strides vs bw graph for activations should not throw DDE. (#166277)"
This reverts commit b2a0f90501.

Reverted https://github.com/pytorch/pytorch/pull/166277 on behalf of https://github.com/atalman due to Breaks internal executorch tests ([comment](https://github.com/pytorch/pytorch/pull/166277#issuecomment-3468696623))
2025-10-30 15:49:23 +00:00
Isuru Fernando
bbb7d2270b [inductor] print 0.0 as 0 for triton (#164291)
Fixes https://github.com/pytorch/pytorch/issues/164157
Fixes https://github.com/pytorch/pytorch/issues/164086

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164291
Approved by: https://github.com/bobrenjc93, https://github.com/mlazos
2025-10-30 15:15:25 +00:00
Shawn Xu
ad559072db [triton][sigmoid] Fix kernel cache and serialization issue for triton sigmoid + CUDA kernel bug (#166568)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166568
Approved by: https://github.com/minjang
2025-10-30 14:54:54 +00:00
eellison
7563f61cc8 Make bucketing aware of collective LIFO semantics (#166324)
In the initial pr for overlapping preserving bucketing, for a graph like:

```
def foo(...):
     ag = all_gather(...)
     hiding_compute = mm(...)
     wait(ag)
```

We would add dependencies from mm -> ag, and wait from wait -> hiding_compute, to prevent bucketing reordering these collectives so that overlap no long occurred. however, there is an additional way for bucketing to prevent overlap.

If we were to reorder another collective so the graph looked like:

```
def foo(...):
     ag = all_gather(...)
     ar = all_reduce(...)
     wait(ar)
     hiding_compute = mm(...)
     wait(ag)
```

Overlap would not occur, because the wait for the all reduce would also force realization of every collective enqueued on the same stream prior to the all reduce. NCCL uses a single stream per process group.

To model, we set a set a strict ordering of all collective starts, waits, and hiding compute initially when bucketing. Then, when trying to add a collective to a bucket, we will see if we interfere with overlap for all of the following possible bucketings:

[move collective start to bucket start, move bucket start to collective start] x [move collective wait to bucket wait x move bucket wait to collective wait].

For any of these positions, we check if overlap would have been interfered with because of stream queue semantics. Then, if not, we remove the moving start and wait from the constrained ordering of collectives, and see if it's topologically valid to merge the nodes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166324
Approved by: https://github.com/IvanKobzarev
ghstack dependencies: #166309
2025-10-30 13:37:00 +00:00
PyTorch MergeBot
fa8e073a4e Revert "[triton][sigmoid] Fix kernel cache and serialization issue for triton sigmoid + CUDA kernel bug (#166568)"
This reverts commit d46d8d6f54.

Reverted https://github.com/pytorch/pytorch/pull/166568 on behalf of https://github.com/atalman due to Failed test/test_extension_utils.py::TestExtensionUtils::test_external_module_register_with_renamed_backend [GH job link](https://github.com/pytorch/pytorch/actions/runs/18931754443/job/54050880312) [HUD commit link](d46d8d6f54) ([comment](https://github.com/pytorch/pytorch/pull/166568#issuecomment-3468008894))
2025-10-30 13:31:47 +00:00
PyTorch MergeBot
95b5534773 Revert "[user-streams] Track symbolic current stream (#165212)"
This reverts commit a5335263d3.

Reverted https://github.com/pytorch/pytorch/pull/165212 on behalf of https://github.com/atalman due to test/test_rename_privateuse1_to_existing_device.py::TestRenamePrivateuseoneToExistingBackend::test_external_module_register_with_existing_backend [GH job link](https://github.com/pytorch/pytorch/actions/runs/18930365446/job/54046768884) [HUD commit link](a5335263d3) ([comment](https://github.com/pytorch/pytorch/pull/165212#issuecomment-3467968796))
2025-10-30 13:24:56 +00:00
libohao1201
2829d48bd1 [xpu][test][1/N] Port 3 fsdp distributed test cases to Intel GPU (#161476)
For https://github.com/pytorch/pytorch/issues/114850, we will port 3 distributed tests to Intel GPU.
We could enable Intel GPU with the following methods and try the best to keep the original code styles:

- use "torch.accelerator.current_accelerator()" to determine the accelerator backend
- use "requires_accelerator_dist_backend" to enable "xccl"
- enabled XPU for some test path
- skip some test cases that Intel GPU does not support

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161476
Approved by: https://github.com/weifengpy, https://github.com/guangyey
2025-10-30 07:30:04 +00:00
Shawn Xu
d46d8d6f54 [triton][sigmoid] Fix kernel cache and serialization issue for triton sigmoid + CUDA kernel bug (#166568)
Differential Revision: D85792537

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166568
Approved by: https://github.com/minjang
2025-10-30 06:17:39 +00:00
Michael Lazos
a5335263d3 [user-streams] Track symbolic current stream (#165212)
merge into stream tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165212
Approved by: https://github.com/anijain2305
ghstack dependencies: #164304, #164522, #164819, #165211
2025-10-30 04:58:53 +00:00
Michael Lazos
f5cb9a4c68 [user-streams] Fix stream graph output semantics (#164819)
Preivously, we would stash a single stream value we constructed at trace time in a global and return the same value from repeated calls to the graph.

With this PR, we construct the stream value in advance, reference the constructed value in the graph via the lookup table, and if that value is returned as an output, read the value from the lookup table and return it (in bytecode, not as a graph output, since we don't support arbitrary stream outputs).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164819
Approved by: https://github.com/anijain2305
ghstack dependencies: #164304, #164522
2025-10-30 04:58:46 +00:00
Angel Li
476b149a00 bwd pass (#164504)
**Summary**
This implements the backward pass for the Varlen API and registers `_varlen_attn()` as a custom op.

**Benchmarking**

To benchmark, we compare runtime and TFLOPs against the current SDPA approach with padding.

Settings:

- 1 H100 machine
- `batch_size=8`, `max_seq_len=2048`, `embed_dim=1024`, `num_heads=16`
- dtype `torch.bfloat16`
- `is_causal=False`
- for variable length, we set sequences to be random multiples of 64 up to `max_seq_len`
- 100 runs

|        | Variable Length API | SDPA     |
|--------|--------------------|----------|
| Runtime | 0.8189142608642578 ms       | 3.263883056640625 ms  |
| TFLOPs | 268.652       | 158.731  |

We can see that runtime for Varlen is >3x faster

**Testing**

Run `python test/test_varlen_attention.py` for unit tests where we verify basic functionality and confirm numerical match between varlen gradients vs SDPA.

For custom op testing, `test_custom_op_registration` uses logging mode to verify that `_varlen_attn()` was called and tests with `torch.compile`. `test_custom_op_compliances` uses `torch.library.opcheck()` to verify.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164504
Approved by: https://github.com/drisspg
2025-10-30 03:46:37 +00:00
xinan.lin
0918bf321c [xpu][test] Reuse native_mm and mix_order_reduction for Intel GPU. (#166384)
This PR reused native_mm and mix_order_reduction for Intel GPU and enabled the corresonding test.
Fixes #165370

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166384
Approved by: https://github.com/jansel
2025-10-30 03:38:35 +00:00
Laith Sakka
90519402c2 address DDE in matmul decomp (#166541)
Address https://github.com/pytorch/pytorch/issues/165081
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166541
Approved by: https://github.com/mlazos
2025-10-30 03:19:29 +00:00
Dzmitry Huba
791ca80d3a Enable local tensor mode for DTensor attention and convolution tests (#166406)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166406
Approved by: https://github.com/ezyang
2025-10-30 02:48:02 +00:00
Yuanyuan Chen
5cbdade914 Fix a syntactic error in test_indexing.py (#166390)
This PR fixes a syntactic error in test_indexing.py by a misplaced `if else` expression.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166390
Approved by: https://github.com/jerryzh168
2025-10-30 02:28:01 +00:00
Bruce Chang
311ea0dec0 shrink_group implementation to expose ncclCommShrink API (#164518)
Closes #164529

To expose the new [ncclCommShrink](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/api/comms.html#ncclcommshrink) API to PyTorch.

This is useful when you need to exclude certain GPUs or nodes from a collective operation, for example in fault tolerance scenarios or when dynamically adjusting resource utilization.

For more info:  [Shrinking a communicator](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/communicators.html#shrinking-a-communicator)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164518
Approved by: https://github.com/kwen2501
2025-10-30 01:50:54 +00:00
Ruben Rodriguez Buchillon
e380028a51 [inductor][choices] lookup table choices 1/3 (#164978)
\# why

- enable users to control which choices get used on which inputs
- reduce lowering time, and pin kernel selection, by selecting
  them for the inputs

\# what

- a new InductorChoices subclass that implements a lookup table
- a README explaining the usage
- corresponding testing

- currently only supports templates that go through
  `V.choices.get_template_configs`

\# testing

```
python3 -bb -m pytest test/inductor/test_lookup_table.py -v
```

Differential Revision: [D85685743](https://our.internmc.facebook.com/intern/diff/D85685743)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164978
Approved by: https://github.com/PaulZhang12, https://github.com/eellison, https://github.com/mlazos
2025-10-30 01:28:01 +00:00
Sherlock Huang
f4d05feb7a Repro dynamo issue for union typed annotation (#166443)
when nested function has type annotation using "|",  it fails.

it works fine with `Union[torch.Tensor, DTensor]` tho.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166443
Approved by: https://github.com/anijain2305
2025-10-30 01:05:15 +00:00
Laith Sakka
b2a0f90501 Fix comparing inductor actual strides vs bw graph for activations should not throw DDE. (#166277)
Fix https://github.com/pytorch/pytorch/issues/163894

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166277
Approved by: https://github.com/Lucaskabela
2025-10-30 00:34:05 +00:00
eellison
14d4a77495 disable current modes instead of no dispatch in estimation (#166571)
otherwise, the custom estimation's TorchDispatchModes will be disabled.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166571
Approved by: https://github.com/SherlockNoMad, https://github.com/bdhirsh
2025-10-29 23:24:41 +00:00
eellison
c3d205d598 helper function for replacing nodes in aug graph (#166309)
When we do bucketing, we replace starts and waits with new nodes. This pr adds a helper to transfer the augmented graph additional deps.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166309
Approved by: https://github.com/IvanKobzarev
2025-10-29 23:08:33 +00:00
Michael Lazos
c54e2c5b41 [User-streams] Make torch.Event weakref compatible (#164522)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164522
Approved by: https://github.com/williamwen42
ghstack dependencies: #164304
2025-10-29 23:06:31 +00:00
Michael Lazos
c3047938a0 [user-streams] Make device-agnostic streams weakref compatible (#164304)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164304
Approved by: https://github.com/williamwen42, https://github.com/colesbury
2025-10-29 23:06:31 +00:00
Shangdi Yu
d2eff5d454 Add python stack trace to AOTI generated code (#160539)
Summary:
We add a thread_local KernelContext object so Strobelight (and other potential profilers) can read the stack trace information of the running kernel.

This will bring extra overhead, so we guard this behind the `cpp.enable_kernel_profile` flag.

Example output code:

```cpp
#include <torch/csrc/inductor/aoti_runtime/kernel_context_tls.h>
namespace torch::aot_inductor {
thread_local KernelContext* tls_kernel_context = nullptr;
}
// Other code .....
void AOTInductorModel::run_impl(
    AtenTensorHandle*
        input_handles, // array of input AtenTensorHandle; handles
                        // are stolen; the array itself is borrowed
    AtenTensorHandle*
        output_handles, // array for writing output AtenTensorHandle; handles
                        // will be stolen by the caller; the array itself is
                        // borrowed
    DeviceStreamType stream,
    AOTIProxyExecutorHandle proxy_executor
) {
    __check_inputs_outputs(input_handles, output_handles);
    auto inputs = steal_from_raw_handles_to_raii_handles(input_handles, 4);
    auto arg2_1 = std::move(inputs[0]);
    auto arg3_1 = std::move(inputs[1]);
    auto arg4_1 = std::move(inputs[2]);
    auto arg5_1 = std::move(inputs[3]);
    [[maybe_unused]] auto& fc1_weight = constants_->at(0);
    [[maybe_unused]] auto& fc1_bias = constants_->at(1);
    inputs.clear();
    [[maybe_unused]] auto& kernels = static_cast<AOTInductorModelKernels&>(*this->kernels_.get());
    static constexpr int64_t int_array_0[] = {8L, 16L};
    static constexpr int64_t int_array_1[] = {16L, 1L};
    AtenTensorHandle buf0_handle;
    AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_empty_strided(2, int_array_0, int_array_1, cached_torch_dtype_float32, cached_torch_device_type_cpu, this->device_idx_, &buf0_handle));
    RAIIAtenTensorHandle buf0(buf0_handle);
    // Topologically Sorted Source Nodes: [linear], Original ATen: [aten.t, aten.addmm]
    // [Provenance debug handles] aoti_torch_cpu_addmm_out:1
    static constexpr int64_t int_array_2[] = {10L, 16L};
    static constexpr int64_t int_array_3[] = {1L, 10L};
    {
    KernelContextGuard _ctx("aoti_torch_cpu_addmm_out", R"(
File "/data/users/shangdiy/fbsource/buck-out/v2/gen/fbcode/cba6f4fb5faa5f79/caffe2/test/inductor/__provenance_tracing__/provenance_tracing#link-tree/caffe2/test/inductor/test_provenance_tracing.py", line 829, in forward
    x = self.fc1(x)
  File "/data/users/shangdiy/fbsource/buck-out/v2/gen/fbcode/cba6f4fb5faa5f79/caffe2/test/inductor/__provenance_tracing__/provenance_tracing#link-tree/torch/nn/modules/linear.py", line 134, in forward
    return F.linear(input, self.weight, self.bias)
)");
    RAIIAtenRecordFunctionHandle record_aoti_torch_cpu_addmm_out_("aoti_torch_cpu_addmm_out", nullptr);
    AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_cpu_addmm_out(buf0, fc1_bias, arg2_1, wrap_with_raii_handle_if_needed(reinterpret_tensor_wrapper(fc1_weight, 2, int_array_2, int_array_3, 0L)), 1L, 1L));
    }
    arg2_1.reset();
    auto buf1 = std::move(buf0);  // reuse
    static constexpr int64_t int_array_4[] = {10L, 20L};
    static constexpr int64_t int_array_5[] = {20L, 1L};
    AtenTensorHandle buf2_handle;
    AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_empty_strided(2, int_array_4, int_array_5, cached_torch_dtype_float32, cached_torch_device_type_cpu, this->device_idx_, &buf2_handle));
    RAIIAtenTensorHandle buf2(buf2_handle);
    // [Provenance debug handles] cpp_fused_mul_relu_sigmoid_0:2
    {
    KernelContextGuard _ctx("cpp_fused_mul_relu_sigmoid_0", R"(
File "/data/users/shangdiy/fbsource/buck-out/v2/gen/fbcode/cba6f4fb5faa5f79/caffe2/test/inductor/__provenance_tracing__/provenance_tracing#link-tree/caffe2/test/inductor/test_provenance_tracing.py", line 831, in forward
    x = self.sigmoid(x)
  File "/data/users/shangdiy/fbsource/buck-out/v2/gen/fbcode/cba6f4fb5faa5f79/caffe2/test/inductor/__provenance_tracing__/provenance_tracing#link-tree/torch/nn/modules/activation.py", line 359, in forward
    return torch.sigmoid(input)
File "/data/users/shangdiy/fbsource/buck-out/v2/gen/fbcode/cba6f4fb5faa5f79/caffe2/test/inductor/__provenance_tracing__/provenance_tracing#link-tree/caffe2/test/inductor/test_provenance_tracing.py", line 830, in forward
    x = self.relu(x)
  File "/data/users/shangdiy/fbsource/buck-out/v2/gen/fbcode/cba6f4fb5faa5f79/caffe2/test/inductor/__provenance_tracing__/provenance_tracing#link-tree/torch/nn/modules/activation.py", line 144, in forward
    return F.relu(input, inplace=self.inplace)
File "/data/users/shangdiy/fbsource/buck-out/v2/gen/fbcode/cba6f4fb5faa5f79/caffe2/test/inductor/__provenance_tracing__/provenance_tracing#link-tree/caffe2/test/inductor/test_provenance_tracing.py", line 832, in forward
    d = a * 3.14
)");
    cpp_fused_mul_relu_sigmoid_0((float*)(buf1.data_ptr()), (const float*)(arg3_1.data_ptr()), (float*)(buf2.data_ptr()));
    }
    arg3_1.reset();
    static constexpr int64_t int_array_6[] = {10L, 30L};
    static constexpr int64_t int_array_7[] = {30L, 1L};
    AtenTensorHandle buf3_handle;
    AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_empty_strided(2, int_array_6, int_array_7, cached_torch_dtype_float32, cached_torch_device_type_cpu, this->device_idx_, &buf3_handle));
    RAIIAtenTensorHandle buf3(buf3_handle);
    // Topologically Sorted Source Nodes: [mul, addmm], Original ATen: [aten.mul, aten.addmm]
    // [Provenance debug handles] aoti_torch_cpu_addmm_out:3
    {
    KernelContextGuard _ctx("aoti_torch_cpu_addmm_out", R"(
File "/data/users/shangdiy/fbsource/buck-out/v2/gen/fbcode/cba6f4fb5faa5f79/caffe2/test/inductor/__provenance_tracing__/provenance_tracing#link-tree/caffe2/test/inductor/test_provenance_tracing.py", line 833, in forward
    y = torch.addmm(c, d, b)
)");
    RAIIAtenRecordFunctionHandle record_aoti_torch_cpu_addmm_out_("aoti_torch_cpu_addmm_out", nullptr);
    AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_cpu_addmm_out(buf3, arg5_1, buf2, arg4_1, 1L, 1L));
    }
    arg4_1.reset();
    arg5_1.reset();
    buf2.reset();
    auto buf4 = std::move(buf3);  // reuse
    // [Provenance debug handles] cpp_fused_gelu_1:4
    {
    KernelContextGuard _ctx("cpp_fused_gelu_1", R"(
File "/data/users/shangdiy/fbsource/buck-out/v2/gen/fbcode/cba6f4fb5faa5f79/caffe2/test/inductor/__provenance_tracing__/provenance_tracing#link-tree/caffe2/test/inductor/test_provenance_tracing.py", line 834, in forward
    z = torch.nn.functional.gelu(y)
)");
    cpp_fused_gelu_1((float*)(buf4.data_ptr()));
    }
    output_handles[0] = buf1.release();
    output_handles[1] = buf4.release();
} // AOTInductorModel::run_impl
```

Test Plan:
```
buck run mode/dev-nosan fbcode//caffe2/test/inductor:provenance_tracing -- -r  stack_traces
```

Rollback Plan:

Differential Revision: D78436007

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160539
Approved by: https://github.com/yiming0416
2025-10-29 22:47:52 +00:00
PyTorch MergeBot
972030fe2e Revert "[pytree] add treespec_{leaf,tuple,dict} functions for args_spec modification (#160843)"
This reverts commit 284716a691.

Reverted https://github.com/pytorch/pytorch/pull/160843 on behalf of https://github.com/atalman due to failing internal torchrec test' ([comment](https://github.com/pytorch/pytorch/pull/160843#issuecomment-3464647878))
2025-10-29 22:46:48 +00:00
Jeff Daily
d401e4e70a [ROCm][CUDA] add unit test utility busy_wait_for_flag (#166218)
torch.cuda._busy_wait_for_flag() will launch a kernel that spins until a flag is set by a corresponding torch.cuda._clear_flag(). These **must** be run on separate streams or it will deadlock.

When used correctly these kernels will put work on the GPU that is more predictable than torch.cuda._sleep() in cases where the unit test is depending on the GPU being busy.

Fixes #120318.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166218
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-10-29 22:40:23 +00:00
Rob Timpe
e0604d3170 [dynamo] Fix ListIterator tracking mutations to original list (#166350)
Currently ListIteratorVariable copies the underlying list, which prevents it
from seeing mutations to the original list.  Remove the copy to match cpython behavior.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166350
Approved by: https://github.com/guilhermeleobas
ghstack dependencies: #166349, #162768
2025-10-29 21:54:37 +00:00
Rob Timpe
8101fd46d4 [dynamo] Implement iter with a polyfill (#162768)
Currently most variable trackers implement `iter` via `_call_iter_tuple_list`.
This makes it difficult to customize the behavior of `iter` for different
variable types.  Instead, implement `iter` via a polyfill, which will delegate
to the appropriate `__iter__` method.

While this method is more flexible, it increases the overhead of dynamo tracing.
For example, `iter(x)` will generate 9x more instructions than the current
implementation for common iterable types.  Microbenchmarking shows a ~6x
slowdown for this operation.  I suspect this would be much less for realistic
workloads, but more work would be needed to get specific numbers.  If the
performance is a concern we could also consider adding a fast path for types
that are known to correctly implement `__iter__`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162768
Approved by: https://github.com/guilhermeleobas
ghstack dependencies: #166349
2025-10-29 21:54:37 +00:00
Rob Timpe
3d4a2d8a93 [dynamo] Add __iter__ for iterable VariableTrackers (#166349)
This is in preparation for implementing iter with a polyfill

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166349
Approved by: https://github.com/guilhermeleobas
2025-10-29 21:54:37 +00:00
Boyuan Feng
bebabd7fce [Graph Partition] move custom rules to inductor config (#166458)
This PR adds `custom_should_partition_ops: list[str]` to specify the name of custom ops upon which graph partition happens. It works with cache since it is a `list[str]` in the config file. The op name should be of format "mylib::baz".

Close: #165341

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166458
Approved by: https://github.com/ProExpertProg, https://github.com/eellison, https://github.com/zou3519
2025-10-29 21:43:58 +00:00
Sean McGovern
56a809aa07 [DTensor] Fix torch.all() using incorrect reduction operator (#165924)
Fixes #165923
Corrects the reduction operation to be product.

Enables "all" in the boolean tensor tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165924
Approved by: https://github.com/malfet, https://github.com/Skylion007
2025-10-29 20:58:35 +00:00
Yuanyuan Chen
b33762bd2f Fix incomplete test_memory_plots_metadata (#166508)
The different context cases were not fully tested before this PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166508
Approved by: https://github.com/Skylion007
2025-10-29 20:55:00 +00:00
Yuanyuan Chen
a3fe1825aa Fix incomplete torch.cdist tests (#166507)
Because the `p` value is not used.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166507
Approved by: https://github.com/Skylion007
2025-10-29 17:11:07 +00:00
PyTorch MergeBot
d7040e6d75 Revert "[dynamo][guards] 1/N Guard selectively for DTensor (#165824)"
This reverts commit ee7434be82.

Reverted https://github.com/pytorch/pytorch/pull/165824 on behalf of https://github.com/anijain2305 due to internal job failed ([comment](https://github.com/pytorch/pytorch/pull/165824#issuecomment-3462667536))
2025-10-29 16:52:31 +00:00
PyTorch MergeBot
35f3572fa4 Revert "[ROCm] Enable group gemm through CK (#166334)"
This reverts commit 1fa520ea65.

Reverted https://github.com/pytorch/pytorch/pull/166334 on behalf of https://github.com/atalman due to Internal build failures ([comment](https://github.com/pytorch/pytorch/pull/166334#issuecomment-3462640668))
2025-10-29 16:45:02 +00:00
anwang
bc5111cd8d [Inductor] Prevent kernel fusion with too many unique inputs and outputs (#166275)
MTIA triton currently has a limit that it can't support the cases when there are too many input/output buffers. This PR adds the limitation to prevent large fusion with many input/output buffer.

Differential Revision: [D85509351](https://our.internmc.facebook.com/intern/diff/D85509351/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166275
Approved by: https://github.com/eellison
ghstack dependencies: #166274
2025-10-29 16:41:34 +00:00
Millie Chen
398fdd32bb [Inductor] Lower fallback nodes annotated with "should_fallback" (#166339)
Summary:
This PR introduces an inductor-level fallback mechanism that gives users control over which operations or subgraphs Inductor should lower and which should fall back to preexisting kernels. This has similar motivation as #164776 in providing flexibility to selectively disable Inductor lowering for specific nodes.

The implementation simply adds a check for the `"should_fallback"` metadata annotation on FX graph nodes. If this is set to `True`, the lowerer falls back before attempting the normal lowering path. Note that since these are user-directed fallbacks dependent upon specific, customized conditions, use `add_to_fallback_set=False` to avoid permanent overwrites of inductor's lowering/fallback rules.

Simple example marking nodes for fallback based on custom predicates:

```
def should_fallback_predicate(node: torch.fx.Node, pred: Callable[torch.fx.Node, bool]):
    # Apply predicate and mark for fallback if needed
    if self.predicate(node):
         node.meta["should_fallback"] = True
```

Test Plan: added a CI test

Differential Revision: D85347587

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166339
Approved by: https://github.com/blaine-rister, https://github.com/eellison
2025-10-29 16:33:55 +00:00
PyTorch MergeBot
5fd1d41e62 Revert "[user-streams] Make device-agnostic streams weakref compatible (#164304)"
This reverts commit bfc2050db9.

Reverted https://github.com/pytorch/pytorch/pull/164304 on behalf of https://github.com/atalman due to Breaks periodic: test/dynamo/test_streams.py::TestStreams::test_stream_weakref [GH job link](https://github.com/pytorch/pytorch/actions/runs/18909552619/job/53979171605) [HUD commit link](cde81e92b9) ([comment](https://github.com/pytorch/pytorch/pull/164304#issuecomment-3462489278))
2025-10-29 16:09:54 +00:00
PyTorch MergeBot
5cdbcb5233 Revert "[User-streams] Make torch.Event weakref compatible (#164522)"
This reverts commit cde81e92b9.

Reverted https://github.com/pytorch/pytorch/pull/164522 on behalf of https://github.com/atalman due to Breaks periodic: test/dynamo/test_streams.py::TestStreams::test_stream_weakref [GH job link](https://github.com/pytorch/pytorch/actions/runs/18909552619/job/53979171605) [HUD commit link](cde81e92b9) ([comment](https://github.com/pytorch/pytorch/pull/164522#issuecomment-3462450571))
2025-10-29 16:03:03 +00:00
PyTorch MergeBot
d6d6fa26f5 Revert "bwd pass (#164504)"
This reverts commit f36f372acc.

Reverted https://github.com/pytorch/pytorch/pull/164504 on behalf of https://github.com/jeffdaily due to CI had been clean for both cuda and rocm before merge, broke post merge? ([comment](https://github.com/pytorch/pytorch/pull/164504#issuecomment-3462116676))
2025-10-29 15:10:40 +00:00
Way Wang
4a94591321 filter out alloc-free pairs from trace plot (#165752)
Summary:
When dealing with a large memory trace, the resulting plot can be challenging to interpret and analyze.
This commit introduces a feature that enables filtering of allocations that have already been freed, providing a more focused view.
The remaining events in the plot often warrant closer examination, as they may be indicative of potential out-of-memory (OOM) issues.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165752
Approved by: https://github.com/zdevito
2025-10-29 12:44:54 +00:00
Xuehai Pan
284716a691 [pytree] add treespec_{leaf,tuple,dict} functions for args_spec modification (#160843)
The goal of this PR is to provide a standard way to create simple treespec instances and hide the implementation details of the `PyTreeSpec` class.

Changes:

1. Add function `treespec_leaf()` to replace `LeafSpec()`.
2. Add function `treespec_tuple(...)` and `treespec_dict(...)` to create treespec for `tuple` / `dict` which is used for `*args` / `**kwargs`. This avoids direct modification to `treespec` instances that rely on the implementation details of the `PyTreeSpec` class.
3. Change `len(spec.children_specs)` to `spec.num_children`.
4. Change `isinstance(spec, LeafSpec)` to `spec.is_leaf()`.

------

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160843
Approved by: https://github.com/mlazos
2025-10-29 09:16:24 +00:00
Yuanyuan Chen
8b188647cf [2/N] Fix unused loop variables (#166500)
This PR removes unused loop variables.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166500
Approved by: https://github.com/mlazos
2025-10-29 08:30:35 +00:00
etaf
1b655a87ef [xpu][test] Enable more UTs for Intel GPU. (#166047)
This PR enables additional Inductor unit tests for Intel GPU. Due to the increased number of test cases, the number of runners has been extended from 8 to 12 to prevent CI timeouts.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166047
Approved by: https://github.com/jansel

Co-authored-by: Deng, Daisy <daisy.deng@intel.com>
Co-authored-by: Jason Ansel <jansel@jansel.net>
2025-10-29 06:25:36 +00:00
Amin Sedaghat
17d5aa4767 disable jiterator for complex tan and tanh (#165250)
Fixes #100842

Disable jiterator for complex tan and tanh kernels due to accuracy issues, matching the existing approach used for acos, acosh, asin, and asinh. Reverts to thrust implementation which provides better numerical accuracy.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165250
Approved by: https://github.com/ezyang
2025-10-29 04:59:01 +00:00
Michael Lazos
cde81e92b9 [User-streams] Make torch.Event weakref compatible (#164522)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164522
Approved by: https://github.com/williamwen42
ghstack dependencies: #162903, #164343, #164344, #164507, #162901, #164304
2025-10-29 04:57:23 +00:00
Michael Lazos
bfc2050db9 [user-streams] Make device-agnostic streams weakref compatible (#164304)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164304
Approved by: https://github.com/williamwen42, https://github.com/colesbury
ghstack dependencies: #162903, #164343, #164344, #164507, #162901
2025-10-29 04:57:23 +00:00
Justin Chu
c5701d0ab5 [ONNX] Create fake implementations for onnx ops; fix boolean mask in attention (#165780)
Previously we rely on the concreate implementation to generate fake implementation. This makes the fake implementation overly complicated and breaks in some cases when there are dynamic shapes.

This PR updates onnx op registration to instead take a dedicated fake implementation.

**Also fixed: When boolean mask is supplied to torch sdpa, it was previously taken the negation, which is incorrect.**

Fix https://github.com/pytorch/pytorch/issues/164909 Also taken changes from https://github.com/pytorch/pytorch/pull/156635

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165780
Approved by: https://github.com/titaiwangms
2025-10-29 04:51:49 +00:00
Michael Lazos
23669d02a6 [user-cuda-streams] Add cuda streams test suite (#162901)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162901
Approved by: https://github.com/williamwen42
ghstack dependencies: #162903, #164343, #164344, #164507
2025-10-29 04:46:08 +00:00
Michael Lazos
e8d887ae3f [user-streams] Support streams as contexts (#164507)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164507
Approved by: https://github.com/williamwen42
ghstack dependencies: #162903, #164343, #164344
2025-10-29 04:46:08 +00:00
Yuanyuan Chen
0e19561e23 Add back Windows and macOS to tensorboard tests (#166389)
This PR adds back tensorboard tests on Windows and macOS because the dependency issue is resolved.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166389
Approved by: https://github.com/Skylion007
2025-10-29 04:34:57 +00:00
Jagadish Krishnamoorthy
1fa520ea65 [ROCm] Enable group gemm through CK (#166334)
Fixes #161366
All the 4 types of dimension matrix are supported.
2d-2d, 2d-3d, 3d-3d, 3d-2d. The corresponding test cases in test_matmul_cuda are working
for both forward and backward pass.
The CK path is enabled for gfx942, gfx950.
ToDo: Need to enable support on gfx90a since the ck kernel used in this commit produces gpu error,
might require a different CK kernel config, based on the profiler result on gfx90a.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166334
Approved by: https://github.com/jeffdaily, https://github.com/pruthvistony
2025-10-29 04:32:38 +00:00
PyTorch MergeBot
924482a6f6 Replace NUMA inheritance approach (#166026)
# Context
Previously, we would modify the parent process's NUMA bindings in order to force child process to inherit them.

However, this would not work correctly if `start_method="forkserver"`, because the subprocesses would actually inherit their bindings from the forkserver middleman process. In this case, the inherited affinity would actually be incorrect for all but the first subprocess (because the forkserver process would get created lazily, and hence inherit and then stick with the bindings intended for the first subprocess).

# This PR
* `str` entrypoints: Use `numactl` CLI
* `Callable` entrypoints: Wrap the `Callable` entrypoint and call `os.sched_setaffinity` inside it.

Hopefully this will be the last necessary iteration.

# Test Plan
## Automated
`$ pytest test/test_numa_binding.py`

## Manual
Verified flops/sec and memory locality wins on several different types of jobs
* `Callable` with forkserver
* `str` entrypoint with spawn
* `Callable` entrypoint with spawn

More details in [this doc (Meta-only).](https://docs.google.com/document/d/1vxD-OKYBTT27jbBwtW9iz9g0tNM0u-i0tiTJg_ieQA8/edit?tab=t.scjv58yswi64)

# Later PR
Update all the documentation when we're confident this has stabilized.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166026
Approved by: https://github.com/d4l3k

Co-authored-by: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
2025-10-29 03:58:44 +00:00
Sun, Jiayi
20be077085 [Inductor] support masked vectorization for the tail_loop for float64 datatype (#163316)
**Summary:**
Support masked vectorization for the tail_loop for float64 datatype.

**Example:**
```
import torch

def fn(x):
    return x * x

x = torch.randn((22, 22), dtype=torch.double)
with torch.no_grad():
    compiled_fn = torch.compile(fn)
    compiled_fn(x)
```

**Generated code:**

- Before
```
cpp_fused_mul_0 = async_compile.cpp_pybinding(['const double*', 'double*'], r'''
#include <torch/csrc/inductor/cpp_prefix.h>
extern "C"  void  kernel(const double* in_ptr0,
                       double* out_ptr0)
{
    {
        for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(484L); x0+=static_cast<int64_t>(16L))
        {
            {
                if(C10_LIKELY(x0 >= static_cast<int64_t>(0) && x0 < static_cast<int64_t>(480L)))
                {
                    auto tmp0 = at::vec::VectorizedN<double,2>::loadu(in_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(16));
                    auto tmp1 = tmp0 * tmp0;
                    tmp1.store(out_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(16));
                }
                if(C10_UNLIKELY(x0 >= static_cast<int64_t>(480L) && x0 < static_cast<int64_t>(484L)))
                {
                    for (int64_t x0_tail = static_cast<int64_t>(480L);x0_tail < static_cast<int64_t>(484L); x0_tail++)
                    {
                        auto tmp0 = in_ptr0[static_cast<int64_t>(x0_tail)];
                        auto tmp1 = double(tmp0 * tmp0);
                        out_ptr0[static_cast<int64_t>(x0_tail)] = tmp1;
                    }
                }
            }
        }
    }
}
''')

async_compile.wait(globals())
del async_compile

class Runner:
    def __init__(self, partitions):
        self.partitions = partitions

    def recursively_apply_fns(self, fns):
        new_callables = []
        for fn, c in zip(fns, self.partitions):
            new_callables.append(fn(c))
        self.partitions = new_callables

    def call(self, args):
        arg0_1, = args
        args.clear()
        assert_size_stride(arg0_1, (22, 22), (22, 1))
        buf0 = empty_strided_cpu((22, 22), (22, 1), torch.float64)
        # [Provenance debug handles] cpp_fused_mul_0:1
        cpp_fused_mul_0(arg0_1, buf0)
        del arg0_1
        return (buf0, )
```
- After
```
cpp_fused_mul_0 = async_compile.cpp_pybinding(['const double*', 'double*'], r'''
#include <torch/csrc/inductor/cpp_prefix.h>
extern "C"  void  kernel(const double* in_ptr0,
                       double* out_ptr0)
{
    {
        for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(484L); x0+=static_cast<int64_t>(16L))
        {
            {
                if(C10_LIKELY(x0 >= static_cast<int64_t>(0) && x0 < static_cast<int64_t>(480L)))
                {
                    auto tmp0 = at::vec::VectorizedN<double,2>::loadu(in_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(16));
                    auto tmp1 = tmp0 * tmp0;
                    tmp1.store(out_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(16));
                }
                if(C10_UNLIKELY(x0 >= static_cast<int64_t>(480L) && x0 < static_cast<int64_t>(484L)))
                {
                    auto tmp0 = at::vec::VectorizedN<double,2>::loadu(in_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(4L));
                    auto tmp1 = tmp0 * tmp0;
                    tmp1.store(out_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(4L));
                }
            }
        }
    }
}
''')

async_compile.wait(globals())
del async_compile

class Runner:
    def __init__(self, partitions):
        self.partitions = partitions

    def recursively_apply_fns(self, fns):
        new_callables = []
        for fn, c in zip(fns, self.partitions):
            new_callables.append(fn(c))
        self.partitions = new_callables

    def call(self, args):
        arg0_1, = args
        args.clear()
        assert_size_stride(arg0_1, (22, 22), (22, 1))
        buf0 = empty_strided_cpu((22, 22), (22, 1), torch.float64)
        # [Provenance debug handles] cpp_fused_mul_0:1
        cpp_fused_mul_0(arg0_1, buf0)
        del arg0_1
        return (buf0, )
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163316
Approved by: https://github.com/mingfeima, https://github.com/jansel
2025-10-29 03:30:38 +00:00
thenumberouscode
94eaeb9cb8 [Conv1d] Check overflow before we compute padding size. (#162363)
Fixes https://github.com/pytorch/pytorch/issues/161877
also fixes https://github.com/pytorch/pytorch/issues/161875

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162363
Approved by: https://github.com/jbschlosser
2025-10-29 03:27:20 +00:00
Yu, Guangye
753d9bd806 Introduce a new API torch.xpu.set_per_process_memory_fraction (#165510)
# Motivation
Aligned with other backends, this PR introduces a new API `torch.xpu.set_per_process_memory_fraction` to allow user to customize the allowed memory per a single process.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165510
Approved by: https://github.com/EikanWang, https://github.com/ezyang
ghstack dependencies: #165508, #165509
2025-10-29 03:24:52 +00:00
linhaifeng
695cb0d342 [2/N][Fix] Fix typo in test folder (#166374)
Fix typo in test folder.

_typos.toml
```bash
[default.extend-words]
nd = "nd"
arange = "arange"
Nd = "Nd"
GLOBALs = "GLOBALs"
hte = "hte"
iy = "iy"
PN = "PN"
Dout = "Dout"
optin = "optin"
gam = "gam"
PTD = "PTD"
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166374
Approved by: https://github.com/cyyever, https://github.com/ezyang
2025-10-29 03:02:07 +00:00
can-gaa-hou
c201a1cab1 [OpenReg] Update Installation in README.md (#166235)
It is recommended to use `python -m pip install --no-build-isolation .` instead of `pip3 install --no-build-isolation .` because most of us use a virtual environment, and the latter probably relies on the system `pip3` rather than the conda or uv. We need to make it consistent with the Python we use, and it is also consistent with how `torch` is installed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166235
Approved by: https://github.com/fffrog, https://github.com/ezyang
2025-10-29 02:57:26 +00:00
Nicolas Macchioni
f8b4c00294 intfs + unit tests (#164723)
Test Plan:
```
buck test fbcode//mode/opt caffe2/test/inductor:caching
```

Differential Revision: D83727222

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164723
Approved by: https://github.com/aorenste
2025-10-29 02:32:19 +00:00
Laith Sakka
adedf26e21 Support python slicing with tensor inputs. (#165074)
when the slice is tensor, we decompose it to .item() call and pass the unbacked symbol to the slice to avoid DDE.
the diff also fix an existing bug in codegen_dynamic_slice_size in the cpp wrapper.  a +1 should be -1 making it match
python codegen.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165074
Approved by: https://github.com/Lucaskabela
2025-10-29 01:18:45 +00:00
Iris Zhang
48e672d149 [dcp][state_dict] Make _flatten_optim_state_dict and _unflatten_optim_state_dict handle arbitrary-level of nested optim dictionaries by recursion (#165071)
Summary:
This updates the internal helper function of ` _flatten_optim_state_dict` and `_unflatten_optim_state_dict` to handle arbitrary level of nested dictionaries. With this, it can handle optimizer like Shampoo has multiple level of nested dictionary. We parametrized the `shampoo_checkpoint_test.py` to test both for `flatten_optimizer_state_dict=True` or `False`.

Example shampoo nested dictionary:
```
{
    "state": {
        0: {
            "block_0": {
                "shampoo": {
                    "factor_matrices": {
                        0: torch.tensor([[0.0, 0.0], [0.0, 0.0]]),
                        1: torch.tensor([[0.0, 0.0], [0.0, 0.0]]),
                    },
                    "factor_matrix_indices": {},
                    "inv_factor_matrices": {
                        0: torch.tensor([[1.0, 0.0], [0.0, 1.0]]),
                        1: torch.tensor([[1.0, 0.0], [0.0, 1.0]]),
                    },
                },
            },
        },
    },
    "param_groups": [
        {
            "lr": 0.01,
            "betas": (0.9, 1.0),
            "beta3": 0.9,
            "epsilon": 1e-12,
            "momentum": 0.9,
            "dampening": 0.0,
            "weight_decay": 0.0,
            "max_preconditioner_dim": 5,
            "precondition_frequency": 1,
            "start_preconditioning_step": 1,
            "use_nesterov": False,
            "use_bias_correction": True,
            "use_decoupled_weight_decay": True,
            "grafting_config": AdaGradPreconditionerConfig(epsilon=0.001),
            "use_pin_memory": False,
            "distributed_config": SingleDeviceDistributedConfig(
                target_parameter_dimensionality=2
            ),
            "preconditioner_config": self._preconditioner_config,
            "params": [0],
        }
    ],
}
```

With this update, shampoo optimizers can be used with torchtitan without any modification in torchtitan side.

Also, we ensure it is still backward compatible with other torch optimizers like Adam.

Test Plan:
Shampoo test:
```
[irisz@devvm5551.cco0 ~/fbsource/fbcode (49fd905c0b)]$ buck2 test @//mode/opt //hpc/optimizers/distributed_shampoo/dev/distributor/gpu_tests:shampoo_checkpoint_test
Buck UI: https://www.internalfb.com/buck2/ff5e0f02-637d-4a73-b990-c0792a460216
Test UI: https://www.internalfb.com/intern/testinfra/testrun/9007199373078880
Network: Up: 0B  Down: 0B
Executing actions. Remaining     0/5
Command: test.
Time elapsed: 27.3s
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0
```

torch.checkpoint.state_dict test.
```
[irisz@devvm5551.cco0 ~/fbsource/fbcode (49fd905c0b)]$  buck2 test @//mode/opt  //caffe2/test/distributed/checkpoint:test_state_dict
Buck UI: https://www.internalfb.com/buck2/bf367c2c-4d17-4d13-b6c6-f6058211bcf2
Test UI: https://www.internalfb.com/intern/testinfra/testrun/13792273976572052
Network: Up: 0B  Down: 11GiB  (reSessionID-9662acf0-f3de-4993-b4fe-880c33f91f78)
Executing actions. Remaining     0/5
Command: test.
Time elapsed: 5:31.9s
Tests finished: Pass 26. Fail 0. Fatal 0. Skip 0. Build failure 0
```

Differential Revision: D83619435

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165071
Approved by: https://github.com/fegin
2025-10-29 01:00:38 +00:00
zhxchen17
56afad4eb3 [precompile] Pickle and check closure variable properly. (#166351)
Summary:

Previously we didn't correctly handle closure tuple when there's content in it. Adding additional code for serializing the tuple and merge it with guard manager local scope.

Test Plan:

pytest test/dynamo/test_aot_compile.py

Reviewers:

Subscribers:

Tasks:

Tags:

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166351
Approved by: https://github.com/Lucaskabela
2025-10-29 00:28:21 +00:00
Sarthak Tandon
2a058bfecf [ROCm][tunableop] Fixed Offline Tuning file writing (#166074)
- Fixes issue with offline tuning mode, we want to append to the existing file, not delete it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166074
Approved by: https://github.com/naromero77amd, https://github.com/jeffdaily
2025-10-29 00:25:45 +00:00
Michael Lazos
6d5e651a50 [user-streams] update stream context to use fork/join (#162903)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162903
Approved by: https://github.com/anijain2305
2025-10-28 23:12:05 +00:00
Yiming Zhou
3cc5949dc2 Remove global pytree registration for blockmask (#166434)
The global pytree registration of `BlockMask` was added in https://github.com/pytorch/pytorch/pull/166045

In general ppl assume `BlockMask` is a leaf, so the global registration  could lead to some unexpected failure when calling `tree_map()` on a `BlockMask` since now it will flatten all the way down.

Therefore, we remove the global registration but keep the `_flatten()` and `_unflatten()` classmethod. Users could do a local registration easily when it is needed.

in pytorch
```
python test/distributed/tensor/test_dtensor_export.py -k test_flex_attention_dtensor_export
```

in torchtitan
```
python -m tests.integration_tests.run_tests ./outputs --test_suite features --ngpu 8
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166434
Approved by: https://github.com/wwwjn
2025-10-28 23:11:52 +00:00
Shangdi Yu
f167fd09fa [annotation] Override metadata on regenerated node in functional mode (#166200)
Fixes #165810

If we regenerate a node during functionalization, we override the "stack_trace", "custom", and "seq_nr" metadata of the regenerated node with the node meta of the original node.

```
python test/functorch/test_aot_joint_with_descriptors.py -k test_preserve_annotate_replay_view
python test/functorch/test_aotdispatch.py TestAOTAutogradWithDynamo.test_duplicated_arguments_on_tensor_overlap
 ```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166200
Approved by: https://github.com/bdhirsh
2025-10-28 22:59:39 +00:00
Angel Li
f36f372acc bwd pass (#164504)
**Summary**
This implements the backward pass for the Varlen API and registers `_varlen_attn()` as a custom op.

**Benchmarking**

To benchmark, we compare runtime and TFLOPs against the current SDPA approach with padding.

Settings:

- 1 H100 machine
- `batch_size=8`, `max_seq_len=2048`, `embed_dim=1024`, `num_heads=16`
- dtype `torch.bfloat16`
- `is_causal=False`
- for variable length, we set sequences to be random multiples of 64 up to `max_seq_len`
- 100 runs

|        | Variable Length API | SDPA     |
|--------|--------------------|----------|
| Runtime | 0.8189142608642578 ms       | 3.263883056640625 ms  |
| TFLOPs | 268.652       | 158.731  |

We can see that runtime for Varlen is >3x faster

**Testing**

Run `python test/test_varlen_attention.py` for unit tests where we verify basic functionality and confirm numerical match between varlen gradients vs SDPA.

For custom op testing, `test_custom_op_registration` uses logging mode to verify that `_varlen_attn()` was called and tests with `torch.compile`. `test_custom_op_compliances` uses `torch.library.opcheck()` to verify.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164504
Approved by: https://github.com/drisspg
2025-10-28 22:35:11 +00:00
Scott Wolchok
572cc12b42 Move MaskPartial to placement_types to improve discoverability (#164414)
Had trouble finding this one myself in #163030.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164414
Approved by: https://github.com/ezyang
2025-10-28 21:56:02 +00:00
Angel Li
08ae55021e support batch size=0 for flash attention (#166318)
Fixes #165944

**Summary**

Today, if we attempt to run flash attention with batch_size 0, we get error `Runtime Error: batch size must be positive`. This PR fixes this by returning early with empty tensors in the fwd and bwd.

**Test plan**
`python test/test_transformers.py -k test_scaled_dot_product_attention` - added case for batch_size=0
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166318
Approved by: https://github.com/drisspg
2025-10-28 21:53:48 +00:00
Simon Layton
b5189e269e NVFP4 grouped gemm support via. FBGEMM kernels (#166308)
Summary:

* Add NVFP4 (1x16 block e4m3, tensor-wise fp32) scaled grouped gemm
* Extend testing to add nvfp4 support

Test Plan:

```
pytest -svv -k grouped test/test_scaled_matmul_cuda.py
```

Reviewers:

Subscribers:

Tasks:

Tags:
Signed-off-by: Simon Layton <simonlayton@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166308
Approved by: https://github.com/ngimel
2025-10-28 20:32:53 +00:00
Shunting Zhang
3895ce093f [inductor] add in-kernel nan-check (#166008)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166008
Approved by: https://github.com/eellison
2025-10-28 20:19:10 +00:00
Catherine Lee
8aa087a29d [ez] Fix print for failing test when entire file fails (#166420)
Was previously printing "FAILED CONSISTENTLY: ul" since it was null,
This changes it so it prints the test_file by moving some logic for checking this to be earlier
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166420
Approved by: https://github.com/Skylion007
2025-10-28 20:13:58 +00:00
Ved Thorat
21b48f8dfa Fixes torch.compile(nn.ModuleList()) changes bool() behavior (#159208)
Fixes #159139
## The Cause

The bug occurs because the OptimizedModule wrapper in torch._dynamo.eval_frame doesn't call the len method. This causes Python's bool() check to fall back to the default object truthiness (always True) instead of correctly evaluating containers with len() == 0 as False.
## The Fix

A very easy fix . I just added the len method to OptimizedModule in torch._dynamo.eval_frame class to delegate the call to the original module
```python
def __len__(self):
    """
    Proxy the len() call to the original module to fix truthiness checks.
    """
    return len(self._orig_mod)
```
This successfully fixes the issue . The script now works as expected.
## Reproduction Script
```python
import torch
import torch.nn as nn

# Create an empty nn.ModuleList
original = nn.ModuleList()

# Compile it using torch.compile
compiled = torch.compile(original)

# Compare their boolean evaluations
print(f"bool(original): {bool(original)}")
print(f"bool(compiled): {bool(compiled)}")

# Trigger failure if they differ
assert bool(original) == bool(compiled), "BUG: truthiness behavior mismatch after compilation"
```
## Output

bool(original): False
bool(compiled): False

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159208
Approved by: https://github.com/andrewboldi, https://github.com/Lucaskabela

Co-authored-by: pushkar-hue <pushkarsharma.rtm@gmail.com>
Co-authored-by: Lucas Kabela <lucasakabela@gmail.com>
2025-10-28 19:21:24 +00:00
Elana
e3e93c7107 [MPS] Fix random in-place ops on non-contiguous tensors (#165267)
Random in-place operations (normal_, uniform_, exponential_, bernoulli_, random_) were silently failing on non-contiguous tensors on macOS < 15.0.

* Added needsGather check and scatter-back logic to handle non-contiguous output tensors, following the pattern used in PointwiseOps.

* Adds test to confirm these now work
* Remove pre-macOS15 xfail for test_Dropout

Fixes #165257 and #124029

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165267
Approved by: https://github.com/kulinseth, https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-10-28 18:43:22 +00:00
Bin Bao
687c15c0b3 [AOTI][BE] Change test_aoti_inference to one-pass build (#164277)
Summary: To fix https://github.com/pytorch/pytorch/issues/159400. Currently, test_aoti_abi_check and test_aoti_inference need to be built in two passes, first build pytorch using the regular `pythonsetup.py develop` and then build with `CMAKE_FRESH=1 BUILD_AOT_INDUCTOR_TEST=1 python setup.py devleop`. This is cumbersome. Fix by rewriting CMakeLists.txt for test_aoti_inference to one-pass build which runs AOTI to compile models at the test time. Also update CI test script to get rid of two-pass build. For test_aoti_abi_check, it is not AOTI specific, so we make it not guarded by BUILD_AOT_INDUCTOR_TEST.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164277
Approved by: https://github.com/janeyx99
2025-10-28 17:43:22 +00:00
Jeff Daily
895795f07c [ROCm][CI] forward fix kineto submodule bump (#166421)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166421
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-10-28 17:40:23 +00:00
Prachi Gupta
ac841267a1 [ROCm] skip AsyncTP test class as AsyncTP is not supported on ROCm (#166316)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166316
Approved by: https://github.com/jeffdaily
2025-10-28 16:23:46 +00:00
drisspg
5016e7b2eb [FlexAttention] Add mechanism to get optimal autotune decision (#165817)
Script: https://github.com/meta-pytorch/attention-gym/pull/169

Feels directionally okay but there is some bike shedding / this could be quite prone to collision of keys depending on mask mod and score mod changes and simple cache key.

Usecase: https://github.com/meta-pytorch/attention-gym/pull/169

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165817
Approved by: https://github.com/Chillee
2025-10-28 15:50:12 +00:00
johannes
3041ede082 Improve eig tests in preparation for new eig backends (#166322)
### Summary
Improves validation of `torch.linalg.eig` results by verifying the eigen decomposition identity **A v − v λ = 0**.

### Motivation
Eigenvectors are not unique, and numerical differences between backends (cuSOLVER, MAGMA, CPU)
can cause false test failures. This PR replaces direct elementwise comparisons with a mathematical
identity check, improving robustness across devices.

### Details
- Introduces `fulfills_eigen_decomposition_identity()` in `test_eig_compare_backends()` to validate the eigen equation.
- Uses CPU matmul for high-precision verification.
- Handles zero-sized matrices explicitly.
- Tolerances derived from numerical comparisons between cuSOLVER and NumPy.
  See discussion: [dev-discuss.pytorch.org link](https://dev-discuss.pytorch.org/t/cusolver-dnxgeev-faster-cuda-eigenvalue-calculations/3248/6)

### Impact
- Improves test stability and correctness across eig backends.
- No change to public API.
- All tests pass; lintrunner reports no issues.
- Enables introduction of new eig backends without false test failures.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166322
Approved by: https://github.com/lezcano
2025-10-28 14:42:47 +00:00
Sherlock Huang
34d6ef7022 Update gm.print_readable to include Annotation (#165397)
Sample output
```
[rank0]:        # Annotation: {'compile_with_inductor': 'flex_attention'} File: /data/users/bahuang/pytorch/torch/nn/attention/flex_attention.py:1490 in flex_attention, code: out, lse, max_scores = flex_attention_hop(
[rank0]:        score_mod_2 = self.score_mod_2
[rank0]:        mask_fn_2 = self.mask_fn_2
[rank0]:        flex_attention_1 = torch.ops.higher_order.flex_attention(xq_5, xk_5, xv_3, score_mod_2, (2048, 2048, g____import_torchtitan_dot_models_dot_attention___flex_attention_block_masks___block_causal___none___kv_num_blocks, g____import_torchtitan_dot_models_dot_attention___flex_attention_block_masks___block_causal___none___kv_indices, g____import_torchtitan_dot_models_dot_attention___flex_attention_block_masks___block_causal___none___full_kv_num_blocks, g____import_torchtitan_dot_models_dot_attention___flex_attention_block_masks___block_causal___none___full_kv_indices, g____import_torchtitan_dot_models_dot_attention___flex_attention_block_masks___block_causal___none___q_num_blocks, g____import_torchtitan_dot_models_dot_attention___flex_attention_block_masks___block_causal___none___q_indices, g____import_torchtitan_dot_models_dot_attention___flex_attention_block_masks___block_causal___none___full_q_num_blocks, g____import_torchtitan_dot_models_dot_attention___flex_attention_block_masks___block_causal___none___full_q_indices, 128, 128, mask_fn_2), 0.25, {'PRESCALE_QK': False, 'ROWS_GUARANTEED_SAFE': False, 'BLOCKS_ARE_CONTIGUOUS': False, 'WRITE_DQ': True, 'OUTPUT_LOGSUMEXP': True, 'OUTPUT_MAX': False}, (), (g____import_torchtitan_dot_models_dot_attention___flex_attention_block_masks___block_causal___none___mask_mod___closure___0_cell_contents,));  xq_5 = xk_5 = xv_3 = score_mod_2 = mask_fn_2 = None
[rank0]:        out_2: "bf16[8, 4, 2048, 16]" = flex_attention_1[0];  flex_attention_1 = None
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165397
Approved by: https://github.com/yushangdi, https://github.com/anijain2305, https://github.com/mlazos
2025-10-28 13:54:38 +00:00
PyTorch MergeBot
110efe4df4 Revert "[inductor][choices] lookup table choices 1/3 (#164978)"
This reverts commit b44423bbb4.

Reverted https://github.com/pytorch/pytorch/pull/164978 on behalf of https://github.com/atalman due to failing internal test on newly added tests: Test when there's no lookup table entry with different autotune modes ([comment](https://github.com/pytorch/pytorch/pull/164978#issuecomment-3456400126))
2025-10-28 13:12:55 +00:00
Roman Krasavtsev
e137cd0a10 docs: fix typos (#164879)
Correct typos in the comments

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164879
Approved by: https://github.com/Lucaskabela, https://github.com/mlazos, https://github.com/cyyever
2025-10-28 12:00:36 +00:00
William Wen
32fe4f681e [dynamo] fix keyerror in resume_execution (again) (#166040)
Fixes https://github.com/pytorch/pytorch/issues/166176

The error I attempted to fix in https://github.com/pytorch/pytorch/pull/162318 was still appearing internally.

Surprised that this wasn't caught anywhere 😰

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166040
Approved by: https://github.com/Lucaskabela
ghstack dependencies: #166036
2025-10-28 07:04:29 +00:00
William Wen
ebb2b2e894 [dynamo] fix store attr graph break in with block (#166036)
Fixes https://github.com/pytorch/pytorch/issues/166033

Differential Revision: [D85198055](https://our.internmc.facebook.com/intern/diff/D85198055)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166036
Approved by: https://github.com/Lucaskabela
2025-10-28 07:04:29 +00:00
KarhouTam
13413b3b07 [AMP][Refactor] Autocast dtype handling to simplify device-specific c… (#165221)
This PR refactors the autocast context manager in autocast_mode.py to simplify and centralize the logic for checking supported dtypes for each device. The previous implementation repeated similar checks for multiple device types. Now, a single mapping device_supported_dtypes is used to associate device types with their supported dtypes, and the validation logic is unified.

**The former PR #163446 was merged but reverted due to failed CI test on `openreg` related tests.**

This RR additionally slightly modified some test assertions for passing the CI tests. CI failed due to assertion for the exactly same error message. For example:
```
File "/var/lib/jenkins/workspace/test/cpp_extensions/open_registration_extension/torch_openreg/tests/test_autocast.py", line 9, in test_autocast_with_unsupported_type
    with self.assertWarnsRegex(
        AssertionError: "In openreg autocast, but the target dtype torch.float32 is not supported." does not match "In openreg autocast, but the target dtype is not supported. Disabling autocast."
```

Sorry for the inconvenience again.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165221
Approved by: https://github.com/albanD
2025-10-28 06:21:29 +00:00
Shunting Zhang
5d0b3e28dc [inductor] generate fused rms/layer norm bwd (#165370)
RMS/Layer norm backward would generated 2 kind of reductions:
- the reduction computing dx which reduce across the hidden dimension (in the context of transformer)
- the reduction computing dw/db which reduce across the BxT (batch size , sequence length) dimension.

These 2 set of reductions have common input buffers but inductor can not fuse them because of different loop orders.

There are multiple sources of custom kernels that implement fused version of such kernel (Liger-Kernel, quack, Paul Zhang's internal post). This PR enable Inductor to generate such kernels automatically.

The generated kernel is very similar to 33924d20b6/src/liger_kernel/ops/rms_norm.py (L114) .

To make the implementation simple and performing, we enable such fusion only if the inner reduction (computing dx) is a persistent reduction. This should be true for representative inputs. Persistent reduction is critical for the perf here to make sure a loaded tensor does not need to be reload.

To make sure the inner reduction (computing dx) and outer reductions (computing dw/db) being fusible, the PR does the following:
1. convert the outer reductions to pointwise by replacing 'reduction' & 'store_reduction' node with a new type of node 'parital_accumulate'. The new node will collect the reduction type, buffer name, input of reduction etc, which is essential for proper codegening.
2. do loop reordering (rely on the earlier loop ordering after fusion work) to reorder the loops of the converted pointwise so it can be fused with the inner reduction
3. there can be epilogues that need to be added in the end. E.g. the outer reduction may be followed by a division for mean , or followed by a down cast if dw/db is in low precision (fp16/bf16).

Some early benchmarking on H100 shows about 2X speedup for both RMSNorm and LayerNorm backward for shape (1152 * 500, 384 ) used in some internal model. Note that, I manually disable split reduction in this benchmarking since otherwise the fusion will be skipped right now. The next PR will make the mix-order-reduction compose better with split reduction

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165370
Approved by: https://github.com/jansel
ghstack dependencies: #166204
2025-10-28 05:53:52 +00:00
Zhengxu Chen
f93ea7dab1 [export] Update dynamo_graph_capture_for_export to return GraphModule. (#166091)
Make dynamo_graph_capture_for_export return a more compatible GraphModule object which is closer the the original behavior of dynamo

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166091
Approved by: https://github.com/tugsbayasgalan
2025-10-28 04:23:28 +00:00