Commit Graph

95128 Commits

Author SHA1 Message Date
PyTorch MergeBot
d7040e6d75 Revert "[dynamo][guards] 1/N Guard selectively for DTensor (#165824)"
This reverts commit ee7434be82.

Reverted https://github.com/pytorch/pytorch/pull/165824 on behalf of https://github.com/anijain2305 due to internal job failed ([comment](https://github.com/pytorch/pytorch/pull/165824#issuecomment-3462667536))
2025-10-29 16:52:31 +00:00
PyTorch MergeBot
35f3572fa4 Revert "[ROCm] Enable group gemm through CK (#166334)"
This reverts commit 1fa520ea65.

Reverted https://github.com/pytorch/pytorch/pull/166334 on behalf of https://github.com/atalman due to Internal build failures ([comment](https://github.com/pytorch/pytorch/pull/166334#issuecomment-3462640668))
2025-10-29 16:45:02 +00:00
anwang
bc5111cd8d [Inductor] Prevent kernel fusion with too many unique inputs and outputs (#166275)
MTIA triton currently has a limit that it can't support the cases when there are too many input/output buffers. This PR adds the limitation to prevent large fusion with many input/output buffer.

Differential Revision: [D85509351](https://our.internmc.facebook.com/intern/diff/D85509351/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166275
Approved by: https://github.com/eellison
ghstack dependencies: #166274
2025-10-29 16:41:34 +00:00
Millie Chen
398fdd32bb [Inductor] Lower fallback nodes annotated with "should_fallback" (#166339)
Summary:
This PR introduces an inductor-level fallback mechanism that gives users control over which operations or subgraphs Inductor should lower and which should fall back to preexisting kernels. This has similar motivation as #164776 in providing flexibility to selectively disable Inductor lowering for specific nodes.

The implementation simply adds a check for the `"should_fallback"` metadata annotation on FX graph nodes. If this is set to `True`, the lowerer falls back before attempting the normal lowering path. Note that since these are user-directed fallbacks dependent upon specific, customized conditions, use `add_to_fallback_set=False` to avoid permanent overwrites of inductor's lowering/fallback rules.

Simple example marking nodes for fallback based on custom predicates:

```
def should_fallback_predicate(node: torch.fx.Node, pred: Callable[torch.fx.Node, bool]):
    # Apply predicate and mark for fallback if needed
    if self.predicate(node):
         node.meta["should_fallback"] = True
```

Test Plan: added a CI test

Differential Revision: D85347587

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166339
Approved by: https://github.com/blaine-rister, https://github.com/eellison
2025-10-29 16:33:55 +00:00
PyTorch MergeBot
5fd1d41e62 Revert "[user-streams] Make device-agnostic streams weakref compatible (#164304)"
This reverts commit bfc2050db9.

Reverted https://github.com/pytorch/pytorch/pull/164304 on behalf of https://github.com/atalman due to Breaks periodic: test/dynamo/test_streams.py::TestStreams::test_stream_weakref [GH job link](https://github.com/pytorch/pytorch/actions/runs/18909552619/job/53979171605) [HUD commit link](cde81e92b9) ([comment](https://github.com/pytorch/pytorch/pull/164304#issuecomment-3462489278))
2025-10-29 16:09:54 +00:00
PyTorch MergeBot
c594950e86 Revert "nn.Linear: nD contiguous input + bias -- dispatch to addmm also when weight is sparse (#166071)"
This reverts commit 467c21ad9a.

Reverted https://github.com/pytorch/pytorch/pull/166071 on behalf of https://github.com/atalman due to Multiple CI breakages: test/profiler/test_profiler_tree.py::TestProfilerTree::test_profiler_experimental_tree_with_stack_and_modules [GH job link](https://github.com/pytorch/pytorch/actions/runs/18909087335/job/53976915830) [HUD commit link](467c21ad9a) ([comment](https://github.com/pytorch/pytorch/pull/166071#issuecomment-3462458968))
2025-10-29 16:05:30 +00:00
Laith Sakka
14102fb1f3 add new line in log (#164240)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164240
Approved by: https://github.com/ColinPeppler, https://github.com/Skylion007, https://github.com/ezyang
ghstack dependencies: #164075
2025-10-29 16:03:32 +00:00
PyTorch MergeBot
5cdbcb5233 Revert "[User-streams] Make torch.Event weakref compatible (#164522)"
This reverts commit cde81e92b9.

Reverted https://github.com/pytorch/pytorch/pull/164522 on behalf of https://github.com/atalman due to Breaks periodic: test/dynamo/test_streams.py::TestStreams::test_stream_weakref [GH job link](https://github.com/pytorch/pytorch/actions/runs/18909552619/job/53979171605) [HUD commit link](cde81e92b9) ([comment](https://github.com/pytorch/pytorch/pull/164522#issuecomment-3462450571))
2025-10-29 16:03:03 +00:00
Mikayla Gawarecki
eae701cad0 Add scaffolding for StableIValue FC/BC (no PoC) (#164332)
1. Add `extension_build_version` and `is_internal` to `FromImpl`/`ToImpl` (this will be useful for future if we need to break the BC of any type) #163832 has the PoC of how we would actually use this system
2. Add `aoti_torch_library_impl_v2` that takes in an additional `extension_build_version` argument, updates callsite in `torch/csrc/stable/library.h` to always pass `TORCH_ABI_VERSION` for this argument
3. Add `extension_build_version` to `from_ivalue` and `to_ivalue` and update all callsites
4. Add a private `_from` and `_to` that pass `is_internal=True` to `FromImpl`/`ToImpl`, making it easier to reason about what is being called from libtorch-land / extension-land

**Note: This PR does not include a linter that tells the user to update from/to if changing the ABI of a type in headeronly, which I intend to do in https://github.com/pytorch/pytorch/pull/163998**

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164332
Approved by: https://github.com/janeyx99
ghstack dependencies: #164356, #166373, #163683
2025-10-29 15:41:45 +00:00
Mikayla Gawarecki
8f51556daa Add scaffolding for aoti_torch_call_dispatcher BC with native ops (#163683)
Part 1 of plan in https://docs.google.com/document/d/1MaX51H5aEQE5XnOlnZIpf9oCYwzGrTWkgBACxNzsmWE/edit?usp=sharing

- Upgrade `aoti_torch_call_dispatcher` to v2 with an `extension_build_version`
- Allow registration of StableIValue stack  --> IValue stack adapters for schema changes

#### Note: This PR does not include a linter that tells the user to add the upgrader if the schema changes, which is an important piece that will be added in a separate PR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163683
Approved by: https://github.com/janeyx99
ghstack dependencies: #164356, #166373
2025-10-29 15:41:45 +00:00
Mikayla Gawarecki
c0bbda37e8 Move static from_ivalue/to_ivalue to new shim_common.cpp (#166373)
Move `from_ivalue` and `to_ivalue` and their dependents `StableIValueBoxedKernel`, `aoti_torch_library_impl` `aoti_torch_call_dispatcher` into new (non-aoti shim_common.cpp)

This is in prep for the above PRs where I add v2s (`torch_call_dispatcher` and `torch_library_impl`) that are versioning aware

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166373
Approved by: https://github.com/janeyx99
ghstack dependencies: #164356
2025-10-29 15:41:36 +00:00
Mikayla Gawarecki
fefb546b91 Add TORCH_TARGET_VERSION for stable ABI (#164356)
And update it so comparisons can be done by the preprocessor

**Note: We also need to gate in shim.h and figure out how to enforce this**

Differential Revision: [D85683549](https://our.internmc.facebook.com/intern/diff/D85683549)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164356
Approved by: https://github.com/janeyx99
2025-10-29 15:41:28 +00:00
PyTorch MergeBot
d6d6fa26f5 Revert "bwd pass (#164504)"
This reverts commit f36f372acc.

Reverted https://github.com/pytorch/pytorch/pull/164504 on behalf of https://github.com/jeffdaily due to CI had been clean for both cuda and rocm before merge, broke post merge? ([comment](https://github.com/pytorch/pytorch/pull/164504#issuecomment-3462116676))
2025-10-29 15:10:40 +00:00
Nikita Vedeneev
467c21ad9a nn.Linear: nD contiguous input + bias -- dispatch to addmm also when weight is sparse (#166071)
As per title.

It seems safe to be able to generalize to arbitrary contiguous inputs since `at::matmul` is likely to do the flattening to avoid `baddmm`.

Additionally, we guard for bias to be 1D and contiguous which is guaranteed to be fused with no copies.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166071
Approved by: https://github.com/ngimel
2025-10-29 13:13:40 +00:00
Way Wang
4a94591321 filter out alloc-free pairs from trace plot (#165752)
Summary:
When dealing with a large memory trace, the resulting plot can be challenging to interpret and analyze.
This commit introduces a feature that enables filtering of allocations that have already been freed, providing a more focused view.
The remaining events in the plot often warrant closer examination, as they may be indicative of potential out-of-memory (OOM) issues.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165752
Approved by: https://github.com/zdevito
2025-10-29 12:44:54 +00:00
PyTorch MergeBot
5e7272b60a Revert "[BE] Move GreenContext implementation details to cpp (#166462)"
This reverts commit afaaaa314c.

Reverted https://github.com/pytorch/pytorch/pull/166462 on behalf of https://github.com/atalman due to multiple internal build failures ([comment](https://github.com/pytorch/pytorch/pull/166462#issuecomment-3461145801))
2025-10-29 11:59:41 +00:00
PyTorch MergeBot
1dd6b76914 Revert "[1/N] Remove unused loop variables (#166258)"
This reverts commit 76b2c37045.

Reverted https://github.com/pytorch/pytorch/pull/166258 on behalf of https://github.com/atalman due to breaks test/distributed/test_serialization.py::TestSerialization::test_weights_only [GH job link](https://github.com/pytorch/pytorch/actions/runs/18894311802/job/53929321703) [HUD commit link](76b2c37045) ([comment](https://github.com/pytorch/pytorch/pull/166258#issuecomment-3460964612))
2025-10-29 11:10:37 +00:00
Xuehai Pan
284716a691 [pytree] add treespec_{leaf,tuple,dict} functions for args_spec modification (#160843)
The goal of this PR is to provide a standard way to create simple treespec instances and hide the implementation details of the `PyTreeSpec` class.

Changes:

1. Add function `treespec_leaf()` to replace `LeafSpec()`.
2. Add function `treespec_tuple(...)` and `treespec_dict(...)` to create treespec for `tuple` / `dict` which is used for `*args` / `**kwargs`. This avoids direct modification to `treespec` instances that rely on the implementation details of the `PyTreeSpec` class.
3. Change `len(spec.children_specs)` to `spec.num_children`.
4. Change `isinstance(spec, LeafSpec)` to `spec.is_leaf()`.

------

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160843
Approved by: https://github.com/mlazos
2025-10-29 09:16:24 +00:00
Yuanyuan Chen
8b188647cf [2/N] Fix unused loop variables (#166500)
This PR removes unused loop variables.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166500
Approved by: https://github.com/mlazos
2025-10-29 08:30:35 +00:00
Aaron Gokaslan
96b61844a7 [BE]: Update nvshmem to 3.4.5 (#164046)
Release notes can be found here: https://docs.nvidia.com/nvshmem/release-notes-install-guide/release-notes/release-3405.html main difference is the addition of a CPU assisted IBGDA fallback which should allow NVSHMEM IBGDA to work on way more systems without admin intervention and without using GDRCopy.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164046
Approved by: https://github.com/ezyang, https://github.com/kwen2501
2025-10-29 07:32:05 +00:00
etaf
1b655a87ef [xpu][test] Enable more UTs for Intel GPU. (#166047)
This PR enables additional Inductor unit tests for Intel GPU. Due to the increased number of test cases, the number of runners has been extended from 8 to 12 to prevent CI timeouts.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166047
Approved by: https://github.com/jansel

Co-authored-by: Deng, Daisy <daisy.deng@intel.com>
Co-authored-by: Jason Ansel <jansel@jansel.net>
2025-10-29 06:25:36 +00:00
fffrog
cb6966704c Add merge rule for PrivateUse1 Module (#166394)
Add merge rights for the following people:
- albanD
- fffrog
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166394
Approved by: https://github.com/ezyang
2025-10-29 06:13:44 +00:00
Amin Sedaghat
17d5aa4767 disable jiterator for complex tan and tanh (#165250)
Fixes #100842

Disable jiterator for complex tan and tanh kernels due to accuracy issues, matching the existing approach used for acos, acosh, asin, and asinh. Reverts to thrust implementation which provides better numerical accuracy.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165250
Approved by: https://github.com/ezyang
2025-10-29 04:59:01 +00:00
Michael Lazos
cde81e92b9 [User-streams] Make torch.Event weakref compatible (#164522)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164522
Approved by: https://github.com/williamwen42
ghstack dependencies: #162903, #164343, #164344, #164507, #162901, #164304
2025-10-29 04:57:23 +00:00
Michael Lazos
bfc2050db9 [user-streams] Make device-agnostic streams weakref compatible (#164304)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164304
Approved by: https://github.com/williamwen42, https://github.com/colesbury
ghstack dependencies: #162903, #164343, #164344, #164507, #162901
2025-10-29 04:57:23 +00:00
Justin Chu
c5701d0ab5 [ONNX] Create fake implementations for onnx ops; fix boolean mask in attention (#165780)
Previously we rely on the concreate implementation to generate fake implementation. This makes the fake implementation overly complicated and breaks in some cases when there are dynamic shapes.

This PR updates onnx op registration to instead take a dedicated fake implementation.

**Also fixed: When boolean mask is supplied to torch sdpa, it was previously taken the negation, which is incorrect.**

Fix https://github.com/pytorch/pytorch/issues/164909 Also taken changes from https://github.com/pytorch/pytorch/pull/156635

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165780
Approved by: https://github.com/titaiwangms
2025-10-29 04:51:49 +00:00
Michael Lazos
23669d02a6 [user-cuda-streams] Add cuda streams test suite (#162901)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162901
Approved by: https://github.com/williamwen42
ghstack dependencies: #162903, #164343, #164344, #164507
2025-10-29 04:46:08 +00:00
Michael Lazos
e8d887ae3f [user-streams] Support streams as contexts (#164507)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164507
Approved by: https://github.com/williamwen42
ghstack dependencies: #162903, #164343, #164344
2025-10-29 04:46:08 +00:00
Junjie Wang (PyTorch)
774abb018e [ptd] Fix test config in destroy_pg (#166463)
Summary: When device_type is CPU we will not use device id from CUDA which is enabled in https://github.com/pytorch/pytorch/pull/161015. However, we should not exclude the case when the accelerator itself is CPU. This PR fixes it.

Test Plan: UT

Differential Revision: D85714901

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166463
Approved by: https://github.com/mori360, https://github.com/fegin
2025-10-29 04:35:04 +00:00
Yuanyuan Chen
0e19561e23 Add back Windows and macOS to tensorboard tests (#166389)
This PR adds back tensorboard tests on Windows and macOS because the dependency issue is resolved.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166389
Approved by: https://github.com/Skylion007
2025-10-29 04:34:57 +00:00
Jagadish Krishnamoorthy
1fa520ea65 [ROCm] Enable group gemm through CK (#166334)
Fixes #161366
All the 4 types of dimension matrix are supported.
2d-2d, 2d-3d, 3d-3d, 3d-2d. The corresponding test cases in test_matmul_cuda are working
for both forward and backward pass.
The CK path is enabled for gfx942, gfx950.
ToDo: Need to enable support on gfx90a since the ck kernel used in this commit produces gpu error,
might require a different CK kernel config, based on the profiler result on gfx90a.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166334
Approved by: https://github.com/jeffdaily, https://github.com/pruthvistony
2025-10-29 04:32:38 +00:00
PaulZhang12
c2e3cc7aed [Inductor] No longer throw error in bmm out_dtype lowering due to template heuristics (#166457)
Fixes https://github.com/pytorch/pytorch/issues/165892

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166457
Approved by: https://github.com/coconutruben
2025-10-29 04:27:13 +00:00
PyTorch UpdateBot
5849eea129 [vision hash update] update the pinned vision hash (#166356)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166356
Approved by: https://github.com/pytorchbot
2025-10-29 04:14:16 +00:00
PyTorch MergeBot
924482a6f6 Replace NUMA inheritance approach (#166026)
# Context
Previously, we would modify the parent process's NUMA bindings in order to force child process to inherit them.

However, this would not work correctly if `start_method="forkserver"`, because the subprocesses would actually inherit their bindings from the forkserver middleman process. In this case, the inherited affinity would actually be incorrect for all but the first subprocess (because the forkserver process would get created lazily, and hence inherit and then stick with the bindings intended for the first subprocess).

# This PR
* `str` entrypoints: Use `numactl` CLI
* `Callable` entrypoints: Wrap the `Callable` entrypoint and call `os.sched_setaffinity` inside it.

Hopefully this will be the last necessary iteration.

# Test Plan
## Automated
`$ pytest test/test_numa_binding.py`

## Manual
Verified flops/sec and memory locality wins on several different types of jobs
* `Callable` with forkserver
* `str` entrypoint with spawn
* `Callable` entrypoint with spawn

More details in [this doc (Meta-only).](https://docs.google.com/document/d/1vxD-OKYBTT27jbBwtW9iz9g0tNM0u-i0tiTJg_ieQA8/edit?tab=t.scjv58yswi64)

# Later PR
Update all the documentation when we're confident this has stabilized.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166026
Approved by: https://github.com/d4l3k

Co-authored-by: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
2025-10-29 03:58:44 +00:00
Sun, Jiayi
20be077085 [Inductor] support masked vectorization for the tail_loop for float64 datatype (#163316)
**Summary:**
Support masked vectorization for the tail_loop for float64 datatype.

**Example:**
```
import torch

def fn(x):
    return x * x

x = torch.randn((22, 22), dtype=torch.double)
with torch.no_grad():
    compiled_fn = torch.compile(fn)
    compiled_fn(x)
```

**Generated code:**

- Before
```
cpp_fused_mul_0 = async_compile.cpp_pybinding(['const double*', 'double*'], r'''
#include <torch/csrc/inductor/cpp_prefix.h>
extern "C"  void  kernel(const double* in_ptr0,
                       double* out_ptr0)
{
    {
        for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(484L); x0+=static_cast<int64_t>(16L))
        {
            {
                if(C10_LIKELY(x0 >= static_cast<int64_t>(0) && x0 < static_cast<int64_t>(480L)))
                {
                    auto tmp0 = at::vec::VectorizedN<double,2>::loadu(in_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(16));
                    auto tmp1 = tmp0 * tmp0;
                    tmp1.store(out_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(16));
                }
                if(C10_UNLIKELY(x0 >= static_cast<int64_t>(480L) && x0 < static_cast<int64_t>(484L)))
                {
                    for (int64_t x0_tail = static_cast<int64_t>(480L);x0_tail < static_cast<int64_t>(484L); x0_tail++)
                    {
                        auto tmp0 = in_ptr0[static_cast<int64_t>(x0_tail)];
                        auto tmp1 = double(tmp0 * tmp0);
                        out_ptr0[static_cast<int64_t>(x0_tail)] = tmp1;
                    }
                }
            }
        }
    }
}
''')

async_compile.wait(globals())
del async_compile

class Runner:
    def __init__(self, partitions):
        self.partitions = partitions

    def recursively_apply_fns(self, fns):
        new_callables = []
        for fn, c in zip(fns, self.partitions):
            new_callables.append(fn(c))
        self.partitions = new_callables

    def call(self, args):
        arg0_1, = args
        args.clear()
        assert_size_stride(arg0_1, (22, 22), (22, 1))
        buf0 = empty_strided_cpu((22, 22), (22, 1), torch.float64)
        # [Provenance debug handles] cpp_fused_mul_0:1
        cpp_fused_mul_0(arg0_1, buf0)
        del arg0_1
        return (buf0, )
```
- After
```
cpp_fused_mul_0 = async_compile.cpp_pybinding(['const double*', 'double*'], r'''
#include <torch/csrc/inductor/cpp_prefix.h>
extern "C"  void  kernel(const double* in_ptr0,
                       double* out_ptr0)
{
    {
        for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(484L); x0+=static_cast<int64_t>(16L))
        {
            {
                if(C10_LIKELY(x0 >= static_cast<int64_t>(0) && x0 < static_cast<int64_t>(480L)))
                {
                    auto tmp0 = at::vec::VectorizedN<double,2>::loadu(in_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(16));
                    auto tmp1 = tmp0 * tmp0;
                    tmp1.store(out_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(16));
                }
                if(C10_UNLIKELY(x0 >= static_cast<int64_t>(480L) && x0 < static_cast<int64_t>(484L)))
                {
                    auto tmp0 = at::vec::VectorizedN<double,2>::loadu(in_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(4L));
                    auto tmp1 = tmp0 * tmp0;
                    tmp1.store(out_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(4L));
                }
            }
        }
    }
}
''')

async_compile.wait(globals())
del async_compile

class Runner:
    def __init__(self, partitions):
        self.partitions = partitions

    def recursively_apply_fns(self, fns):
        new_callables = []
        for fn, c in zip(fns, self.partitions):
            new_callables.append(fn(c))
        self.partitions = new_callables

    def call(self, args):
        arg0_1, = args
        args.clear()
        assert_size_stride(arg0_1, (22, 22), (22, 1))
        buf0 = empty_strided_cpu((22, 22), (22, 1), torch.float64)
        # [Provenance debug handles] cpp_fused_mul_0:1
        cpp_fused_mul_0(arg0_1, buf0)
        del arg0_1
        return (buf0, )
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163316
Approved by: https://github.com/mingfeima, https://github.com/jansel
2025-10-29 03:30:38 +00:00
thenumberouscode
94eaeb9cb8 [Conv1d] Check overflow before we compute padding size. (#162363)
Fixes https://github.com/pytorch/pytorch/issues/161877
also fixes https://github.com/pytorch/pytorch/issues/161875

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162363
Approved by: https://github.com/jbschlosser
2025-10-29 03:27:20 +00:00
Yu, Guangye
753d9bd806 Introduce a new API torch.xpu.set_per_process_memory_fraction (#165510)
# Motivation
Aligned with other backends, this PR introduces a new API `torch.xpu.set_per_process_memory_fraction` to allow user to customize the allowed memory per a single process.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165510
Approved by: https://github.com/EikanWang, https://github.com/ezyang
ghstack dependencies: #165508, #165509
2025-10-29 03:24:52 +00:00
Yuanyuan Chen
dd1fe7c22f Remove clang-tidy type conversion suppressions (#166398)
This PR fixes and removes type conversion suppressions of clang-tidy.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166398
Approved by: https://github.com/Skylion007
2025-10-29 03:21:16 +00:00
linhaifeng
695cb0d342 [2/N][Fix] Fix typo in test folder (#166374)
Fix typo in test folder.

_typos.toml
```bash
[default.extend-words]
nd = "nd"
arange = "arange"
Nd = "Nd"
GLOBALs = "GLOBALs"
hte = "hte"
iy = "iy"
PN = "PN"
Dout = "Dout"
optin = "optin"
gam = "gam"
PTD = "PTD"
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166374
Approved by: https://github.com/cyyever, https://github.com/ezyang
2025-10-29 03:02:07 +00:00
linhaifeng
1764f3a9c8 [Fix] fix gramma error in PyTorch docs (#166158)
Fix several gramma errors in PyTorch docs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166158
Approved by: https://github.com/yewentao256, https://github.com/cyyever, https://github.com/ezyang
2025-10-29 03:01:07 +00:00
Yu, Guangye
c9eabadc5e Suppress std::hardware_destructive_interference_size warning on GCC 13+ (#166297)
# Motivation
In https://github.com/pytorch/pytorch/pull/145591, `std::hardware_destructive_interference_size` was introduced in CUDACachingAllocator. Later, https://github.com/pytorch/pytorch/pull/160067 moved it to `c10/core/alignment.h` for code reuse.
However, on **GCC 13+** using `std::hardware_destructive_interference_size` triggers the following warning:
```bash
warning: use of ‘std::hardware_destructive_interference_size’ [-Winterference-size]
/home/pt-gpu/4T-4652/guangyey/stock-pytorch/aten/src/ATen/core/CachingHostAllocator.h:42:16: note: its value can vary between compiler versions or with different ‘-mtune’ or ‘-mcpu’ flags
/home/pt-gpu/4T-4652/guangyey/stock-pytorch/aten/src/ATen/core/CachingHostAllocator.h:42:16: note: if this use is part of a public ABI, change it to instead use a constant variable you define
/home/pt-gpu/4T-4652/guangyey/stock-pytorch/aten/src/ATen/core/CachingHostAllocator.h:42:16: note: the default value for the current CPU tuning is 64 bytes
/home/pt-gpu/4T-4652/guangyey/stock-pytorch/aten/src/ATen/core/CachingHostAllocator.h:42:16: note: you can stabilize this value with ‘--param hardware_destructive_interference_size=64’, or disable this warning with ‘-Wno-interference-size’
```

# Solution
- Solution 1: Replace `c10::hardware_destructive_interference_size` with a constant 64.
```cpp
constexpr std::size_t hardware_destructive_interference_size = 64;
```

- Solution 2: adding `-Wno-interference-size’ to 8d4e48831e/cmake/public/utils.cmake (L386) to suppress the warning.

# Additional Context
The current implementation uses the second approach. If the reviewers prefer the first approach, I am happy to update it accordingly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166297
Approved by: https://github.com/ezyang
2025-10-29 02:57:46 +00:00
can-gaa-hou
c201a1cab1 [OpenReg] Update Installation in README.md (#166235)
It is recommended to use `python -m pip install --no-build-isolation .` instead of `pip3 install --no-build-isolation .` because most of us use a virtual environment, and the latter probably relies on the system `pip3` rather than the conda or uv. We need to make it consistent with the Python we use, and it is also consistent with how `torch` is installed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166235
Approved by: https://github.com/fffrog, https://github.com/ezyang
2025-10-29 02:57:26 +00:00
Michael Lazos
e105a47575 [user-streams] Have StreamVariable inherit from StreamContextVariable (#164344)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164344
Approved by: https://github.com/williamwen42
ghstack dependencies: #162903, #164343
2025-10-29 02:49:54 +00:00
Michael Lazos
aab27b051a [user-streams] Move StreamContextVariable into streams module (#164343)
finish moving

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164343
Approved by: https://github.com/williamwen42, https://github.com/fxdawnn
ghstack dependencies: #162903
2025-10-29 02:49:54 +00:00
Nicolas Macchioni
f8b4c00294 intfs + unit tests (#164723)
Test Plan:
```
buck test fbcode//mode/opt caffe2/test/inductor:caching
```

Differential Revision: D83727222

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164723
Approved by: https://github.com/aorenste
2025-10-29 02:32:19 +00:00
Nikita Shulga
877f126e35 [MPS] Improve index_select error checking (#166468)
Just copy-n-paste overlap checks from
0d4992c170/aten/src/ATen/native/TensorAdvancedIndexing.cpp (L1620-L1622)

Very similar to https://github.com/pytorch/pytorch/pull/166425
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166468
Approved by: https://github.com/dcci, https://github.com/Skylion007
2025-10-29 02:23:12 +00:00
Maggie Moss
4fada51ada Fix existing Pyrefly errors (#166439)
Trying to keep main as clean of type errors as possible until we are able to swtich to just one checker.

This adds suppressions for existing type errors on main.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166439
Approved by: https://github.com/Skylion007
2025-10-29 02:08:02 +00:00
Yuanyuan Chen
76b2c37045 [1/N] Remove unused loop variables (#166258)
This PR removes unused loop variables.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166258
Approved by: https://github.com/Lucaskabela, https://github.com/mlazos
2025-10-29 01:34:15 +00:00
Laith Sakka
adedf26e21 Support python slicing with tensor inputs. (#165074)
when the slice is tensor, we decompose it to .item() call and pass the unbacked symbol to the slice to avoid DDE.
the diff also fix an existing bug in codegen_dynamic_slice_size in the cpp wrapper.  a +1 should be -1 making it match
python codegen.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165074
Approved by: https://github.com/Lucaskabela
2025-10-29 01:18:45 +00:00
Nicolas De Carli
bea89d6060 [PyTorch] Improve conversion from/to bool on aarch64+sve (#166330)
Summary:
We are adding autovec routines to convert to/from boolean values

We observed the following performance improvements when compiling targeting armv9-a+sve2+fp16+bf16

before:

bool->uint8->bool ===> 447.854us
bool->int8->bool ===> 445.609us
bool->int16->bool ===> 312.425us
bool->int32->bool ===> 324.368us
bool->float->bool ===> 320.929us
bool->float16->bool ===> 290.825us
bool->bfloat16->bool ===> 437.250us

after

bool->uint8->bool ===> 78.988us ----> 467% higher throughput
bool->int8->bool ===> 78.494us -----> 468% higher throughput
bool->int16->bool ===> 107.993us ----> 189% higher throughput
bool->int32->bool ===> 186.887us -----> 74% higher throughput
bool->float->bool ===> 188.048us ------> 71% higher throughput
bool->float16->bool ===> 102.789us --> 183% higher throughput
bool->bfloat16->bool ===> 105.809us -> 313% higher throughput

Test Plan:
Correctness:

buck2 test mode/opt //caffe2/test:test_ops
buck2 test mode/opt //caffe2/test:torch

Performance:

buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test

Reviewed By: mcfi

Differential Revision: D85533284

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166330
Approved by: https://github.com/mcfi
2025-10-29 01:09:34 +00:00