Commit Graph

49941 Commits

Author SHA1 Message Date
dolpm
51a708ffc6 [nativert] libtorch kernel registry (#157150)
Summary: att

Test Plan:
ci

Rollback Plan:

Differential Revision: D77451703

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157150
Approved by: https://github.com/georgiaphillips, https://github.com/henryoier
2025-07-16 12:36:55 +00:00
Hari Krishna Sai Kodali
9d184bda2f add device generalization support for distributed tests (#156796)
MOTIVATION
To generalize Distributed test cases for non-CUDA devices

CHANGES

- test/distributed/checkpoint/test_fsspec.py
- test/distributed/checkpoint/test_state_dict.py
- test/distributed/test_multi_threaded_pg.py

Replaced hard coded device names with torch.accelerator.current_accelerator

- torch/testing/_internal/distributed/_shard/sharded_tensor/__init__.py

support for hccl backend

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156796
Approved by: https://github.com/guangyey, https://github.com/ezyang
2025-07-16 09:37:03 +00:00
NikhilAPatel
ea74fdd24a [Inductor][Triton] Update TMA Compatibility Requirements (#157881)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157881
Approved by: https://github.com/Skylion007, https://github.com/drisspg
2025-07-16 09:31:44 +00:00
Manuel Candales
fb9a5d248f Fix torch._numpy to match NumPy when empty ellipsis causes advanced indexing separation (#158297)
Fixes #141563

In NumPy, an ellipsis always acts as a separator between advanced indices, even when the ellipsis doesn't actually match any dimensions. In PyTorch an empty ellipsis doesn't cause a separation. This leads to differing behavior between Numpy and PyTorch in this edge case.

This difference in behavior leads to a bug when using torch.compile:
```python
>>> import numpy as np
>>> f = lambda x: x[:,(0,1),...,(0,1)].shape
>>> a = np.ones((3, 4, 5))
>>> f(a)
(2, 3)
>>> torch.compile(f)(a)
(3, 2)
```

Similarly to #157676, this PR doesn't change PyTorch's behavior, but it fixes the translation layer, ensuring torch._numpy compatibility with NumPy. I am marking this PR as fixing #141563, even though PyTorch behavior isn't modified.

Notice that there are still some other bugs in PyTorch's advanced indexing, that need to be fixed (mainly regarding proper accounting of dimensions when multidimensional boolean masks are present). But those need to be fixed at the ATen operator level. Examples:
- #71673
- #107699
- #158125

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158297
Approved by: https://github.com/soumith
2025-07-16 08:11:53 +00:00
Huamin Li
ddf502c988 [AOTI] add -lstdc++ into aoti link cmd for Meta internal (#158325)
Differential Revision: D78123716

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158325
Approved by: https://github.com/desertfire
2025-07-16 07:55:08 +00:00
FFFrog
555f356254 [Easy] Show some clear error when torch.ops.load_library fails. (#157524)
**Background**:

```Shell
torch       2.5.1+cpu
torchvision 0.20.1
```

```Python
import torch
import torchvision

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/anaconda3/envs/test/lib/python3.10/site-packages/torchvision/__init__.py", line 10, in <module>
    from torchvision import _meta_registrations, datasets, io, models, ops, transforms, utils  # usort:skip
  File "/usr/local/anaconda3/envs/test/lib/python3.10/site-packages/torchvision/_meta_registrations.py", line 164, in <module>
    def meta_nms(dets, scores, iou_threshold):
  File "/usr/local/anaconda3/envs/test/lib/python3.10/site-packages/torch/library.py", line 795, in register
    use_lib._register_fake(op_name, func, _stacklevel=stacklevel + 1)
  File "/usr/local/anaconda3/envs/test/lib/python3.10/site-packages/torch/library.py", line 184, in _register_fake
    handle = entry.fake_impl.register(func_to_register, source)
  File "/usr/local/anaconda3/envs/test/lib/python3.10/site-packages/torch/_library/fake_impl.py", line 31, in register
    if torch._C._dispatch_has_kernel_for_dispatch_key(self.qualname, "Meta"):
RuntimeError: operator torchvision::nms does not exist
```

**Cause**:

```
torchvision's .so file lacks some symbol definitions, because these symbols come from CUDA, but the current environment does not have CUDA and GPU. The above error message is very confusing.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157524
Approved by: https://github.com/ezyang
2025-07-16 07:33:22 +00:00
Kaichao You
59f9b25f3c [cuda][cupy] Improve cupy device placement when device is provided (#158320)
This is an improvement over https://github.com/pytorch/pytorch/pull/132595 . That PR improves the case where `device` is not given. This PR tries to improve the case where `device` is given but the first step of auto-infer device from `cudaPointerGetAttributes` can be wrong (undesired). See https://github.com/pytorch/pytorch/issues/158316 for more details on when this can happen.

I think this is a reasonable improvement, as people expect `torch.as_tensor` + cupy should be zero-copy as much as possible. However, it does change some behaviors, because previously it might incur a device-to-device copy.

I will leave it to pytorch developers to see if the improvement is worthwhile.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158320
Approved by: https://github.com/ezyang
2025-07-16 07:12:36 +00:00
drisspg
5484890539 Add better typing to avaialbe kernel options for flex attention (#158383)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158383
Approved by: https://github.com/joydddd, https://github.com/BoyuanFeng
2025-07-16 06:06:29 +00:00
Denghui Dong
e92e3eaf4e [Profiler] the doc of _ExperimentalConfig is incorrectly truncated by commas (#156586)
Hi team,

Please help review this trivial fix.

Without this change:

``` python
>>> import torch
>>> print(torch._C._profiler._ExperimentalConfig.__init__.__doc__)
__init__(self: torch._C._profiler._ExperimentalConfig, profiler_metrics: list[str] = [], profiler_measure_per_kernel: bool = False, verbose: bool = False, performance_events: list[str] = [], enable_cuda_sync_events: bool = False, adjust_profiler_step: bool = False, disable_external_correlation: bool = False, profile_all_threads: bool = False, capture_overload_names: bool = False) -> None

    capture_overload_names (bool) : whether to include ATen overload names in the profile
```

With this change:

```python
>>> import torch
>>> print(torch._C._profiler._ExperimentalConfig.__init__.__doc__)
__init__(self: torch._C._profiler._ExperimentalConfig, profiler_metrics: list[str] = [], profiler_measure_per_kernel: bool = False, verbose: bool = False, performance_events: list[str] = [], enable_cuda_sync_events: bool = False, adjust_profiler_step: bool = False, disable_external_correlation: bool = False, profile_all_threads: bool = False, capture_overload_names: bool = False) -> None

An experimental config for Kineto features. Please note thatbackward compatibility is not guaranteed.
    profiler_metrics : a list of CUPTI profiler metrics used
       to measure GPU performance events.
       If this list contains values Kineto runs in CUPTI profiler mode
    profiler_measure_per_kernel (bool) : whether to profile metrics per kernel
       or for the entire measurement duration.
    verbose (bool) : whether the trace file has `Call stack` field or not.
    performance_events : a list of profiler events to be used for measurement.
    enable_cuda_sync_events : for CUDA profiling mode, enable adding CUDA synchronization events
       that expose CUDA device, stream and event synchronization activities. This feature is new
       and currently disabled by default.
    adjust_profiler_step (bool) : whether to adjust the profiler step to
       match the parent python event duration. This feature is new and currently disabled by default.
    disable_external_correlation (bool) : whether to disable external correlation
    profile_all_threads (bool) : whether to profile all threads
    capture_overload_names (bool) : whether to include ATen overload names in the profile

```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156586
Approved by: https://github.com/sraikund16, https://github.com/cyyever
2025-07-16 04:10:49 +00:00
Will Constable
0a9d450168 [DTensor] implement histc (#158298)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158298
Approved by: https://github.com/zpcore, https://github.com/XilunWu
2025-07-16 04:10:32 +00:00
Edward Z. Yang
e265b719bd Extract out prepare_aot_module_simplified for use in next PR (#158319)
Also a small amount of extra code cleanup.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158319
Approved by: https://github.com/jingsh
ghstack dependencies: #158149, #158150, #158173, #158176, #158213, #158251
2025-07-16 03:59:41 +00:00
Edward Z. Yang
7637c9718a Move functions from torch._functorch.aot_autograd that are not frontend functions to frontend_utils (#158251)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158251
Approved by: https://github.com/jamesjwu
ghstack dependencies: #158149, #158150, #158173, #158176, #158213
2025-07-16 03:59:41 +00:00
Edward Z. Yang
49d0332cef Introduce stages to aot_dispatch (#158213)
The starting point for this refactor is that I need access to the fully
general joint graph representation in an export-like interface, but I
then subsequently need a way to feed this joint graph into the rest of
the compilation pipeline so I can get an actual callable that I can run
once I've finished modifying it.  Previously, people had added export
capabilities to AOTAutograd by having an export flag that toggled what
exactly the functions return and triggering aot_dispatch to go to a
different "export" implementation, but I've found this difficult to
understand and has lead to a bit of duplicate code for the export path.

So the idea here is to reorganize the structure of the function calls in AOTAutograd. Here, it is helpful to first describe how things used to work:

* Start with aot_autograd.py top level functions like aot_function, _aot_export_function and aot_module_simplified. These call:
  * create_aot_dispatcher_function. This does a bunch of stuff (forward metadata collection) and adds many context managers. This calls:
    * One of aot_dispatch_base, aot_dispatch_export or aot_dispatch_autograd, which:
      * Call aot_dispatch_autograd_graph or aot_dispatch_base_graph to actually do the graph capture
      * Do some base/export/autograd specific post-processing on the graph

Notice the pattern of nested function invocations means that there is no way to easily get the graph capture result from the autograd case; furthermore, the export path is "bolted" on to force the entire chain of functions to have a different return result than normal, and no way to *resume* the rest of the post-processing to actually get a callable.

Here is the new structure:

* Start with aot_autograd.py top level functions like aot_function, _aot_export_function and aot_module_simplified. These now orchestrate this top level flow:
  * Start a context manager (stack); this stateful context block takes care of all of the nested context managers which originally necessitated the nested call structure
  * Call create_aot_state to do initial setup and setup all the context managers on stack. These context managers do NOT exit upon return of this.
  * Call aot_stage1_graph_capture to do the graph capture
  * Call aot_stage2_compile or aot_stage2_export depending on what postprocessing you want

With this new structure, it's now possible (although not done in this PR) to return the graph after aot_stage1_graph_capture and do something with it, before running aot_stage2_compile to finish the job.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158213
Approved by: https://github.com/jamesjwu
ghstack dependencies: #158149, #158150, #158173, #158176
2025-07-16 03:59:32 +00:00
Edward Z. Yang
84dec060b7 Hoist choose_dispatcher to top level, remove unnecessary returns (#158176)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158176
Approved by: https://github.com/jamesjwu
ghstack dependencies: #158149, #158150, #158173
2025-07-16 03:56:25 +00:00
Edward Z. Yang
5b0df2565e Pipeline _create_aot_dispatcher_function (#158173)
Two main things of note:

- Review this diff without whitespace changes
- To ensure that context managers correctly propagate to later pipeline
  stages, I am using the ExitStack trick: there is an ExitStack which is
  in scope for the entire pipeline, and inside of the individual
  pipeline stages we push context managers onto this stack when we want
  them to survive into the next pipeline stage.  This is not obviously
  what the best final form of the code is, but
  create_aot_dispatcher_function is called from multiple locations so I
  can't just inline the context managers into the call site.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158173
Approved by: https://github.com/jamesjwu, https://github.com/wconstab
ghstack dependencies: #158149, #158150
2025-07-16 03:56:25 +00:00
Songhao Jia
0cb36e2d62 cache dict and string rep for better perf (#158372)
Summary: NodeSouce should not be updated after created, so that it would be better if we cache its dict and string representation for better perf.

Test Plan:
ci

Rollback Plan:

Reviewed By: yushangdi

Differential Revision: D78298501

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158372
Approved by: https://github.com/yushangdi
2025-07-16 02:15:32 +00:00
Xu Han
584a0510b3 [inductor] fix windows path for fresh cache. (#158324)
`normalize_path_separator` for windows path.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158324
Approved by: https://github.com/jansel
2025-07-16 01:54:35 +00:00
yuchengliu1
9768d393fa add sfdp pattern (#155792)
add sfdp pattern for MBartForCausalLM/PLBartForCausalLM in transformers==4.44.2.
Improve the inference performance of these model.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155792
Approved by: https://github.com/Valentine233, https://github.com/jansel
2025-07-16 01:52:05 +00:00
PyTorch MergeBot
03852ddc22 Revert "[ROCm] logsumexp on ROCm needs scaling back to natural base. (#156903)"
This reverts commit 1ea9cde598.

Reverted https://github.com/pytorch/pytorch/pull/156903 on behalf of https://github.com/atalman due to Breaks torchao and torchtitan nightly builds ([comment](https://github.com/pytorch/pytorch/pull/156903#issuecomment-3076423488))
2025-07-16 01:28:46 +00:00
Xuan Zhang
8554c8007d [PT2][fusion] ban fusions with large accumulated reads (#157563)
**Problem:**
Fusion can accumulate large amount of reads, which leads to significant increase in peak memory utilization. Imagine we have the following code snippet
```
total = torch.rand(N, N)
for _ in range(r):
    x = torch.rand(N, N)
    total = total + x
```
The default execution is memory efficient as only two tensors of size N-by-N is in memory at any given time. However, with fusion, the additions are fused into a single operation and the execution becomes something like:
```
x_1 = torch.rand(N, N)
x_2 =  torch.rand(N, N)
...
x_r = torch.rand(N, N)
total = x_1 + x_2 + ... + x_r
```
Though this is run-time efficient, in the case of large `N` and/or large `r`, this is not memory efficient.

[internal only] see [post](https://fb.workplace.com/groups/1075192433118967/permalink/1703374333634104/) for additional details

**Solution:**
Our proposed solution is to ban fusions in case where a large amount of reads are accumulated. This is in addition to some existing logics during torch compile.
* During lowering (i.e., `ir.py`), the config `realize_acc_reads_threshold`, which is default to be 8, controls _the number of_ buffers can be accumulated for a single operator. However, this is oblivious to the size of the buffers. Hence, we additionally introduce a config `realize_acc_reads_size_threshold` to control _the amount of buffers_ in size that can be accumulated.
* During scheduling (i.e., `scheduler.py`), additional fusion will be performed and thus we also need to capture such pattern there. The decisions are implemented under `choices.py`.

**Results:**
For a small example similar to be one in the test case (but with larger `N` and higher number of loop repeats), the memory snapshot before and after are shown below. Note the snapshot on the right is zoomed out so that the y-axis of the two snapshots match.

<img width="1328" alt="image" src="https://github.com/user-attachments/assets/670b5961-8454-4379-ae0f-62d4e7946c64" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157563
Approved by: https://github.com/jansel, https://github.com/mlazos
2025-07-16 01:05:25 +00:00
Yidi Wu
651b4a68f2 [hop][dynamo] track run-ahead sym variables in side effects (#158273)
Before the PR, for code like this:
```
        class Example2(torch.nn.Module):
            def forward(self, x, trigger, target):
                return torch.cond(
                    trigger == 1,
                    lambda: x + target,
                    lambda: x * target,
                    (),
                )

        m = Example2()
        x = torch.randn(2)
        trigger = 0
        target = 2
        args = (x, trigger, target)
        ep = torch.export.export(
            m, args, dynamic_shapes=(None, Dim.DYNAMIC, Dim.DYNAMIC)
        )
```
dynamo will wrap "target" (i.e. a symInt) twice, once when we speculate the first lambda and find target is a symint and decides to wrap it up, creating a new SymNodeVariable and a placeholder input to the top-level graph.

The second time happens when we speculate the second lambda. Tensors are de-duplicated by checking tracked side effects to make sure object with the same id (though different sources) is mapped to the same TensorVaraible. For symints, two things are missing:
1. it's not in the _can_lift_attrs_to_input list (the change in builder.py)
2. it's not in the tracked by runahead_side_effects, so when speculate_subgraph finishes, they're discarded (the change in side_effects.py)

Note: the auto lifting mechanism for HOPs happens at proxy level when we trace the subgraph, which is after SymNodeVariable are created (they're created when realizing the args and bind them to subgraph). At that time, builder has created two unique SymNodeVariable for the same symint so the auto lifting in hops cannot de-dup them.

Differential Revision: [D78298163](https://our.internmc.facebook.com/intern/diff/D78298163)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158273
Approved by: https://github.com/avikchaudhuri, https://github.com/zou3519
2025-07-15 23:48:20 +00:00
dolpm
144965ca9a [BE][S538760] get rid of TORCH_CHECK_.* and CHECK macros (#158269)
Summary: check will be crit, causing program to exit, which is quite dangerous

Test Plan:
CI

Rollback Plan:

Differential Revision: D78050595

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158269
Approved by: https://github.com/SherlockNoMad, https://github.com/henryoier
2025-07-15 22:04:12 +00:00
Ti-Tai Wang
3f83e3eeca [ONNX] Remove legacy registration and dispatcher (#158283)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158283
Approved by: https://github.com/Skylion007, https://github.com/justinchuby
ghstack dependencies: #158258, #158262, #158282
2025-07-15 21:00:49 +00:00
Yiming Zhou
0640cfa38c [2/n] Remove references to TorchScript in PyTorch docs (#158306)
Summary: Removed jit_language_reference.md

Test Plan:
CI

Rollback Plan:

Differential Revision: D78308133

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158306
Approved by: https://github.com/svekars, https://github.com/zhxchen17
2025-07-15 20:57:23 +00:00
Ti-Tai Wang
e4c17d5e1c [ONNX] Remove fx_onnx_interpreter.py (#158282)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158282
Approved by: https://github.com/Skylion007, https://github.com/justinchuby
ghstack dependencies: #158258, #158262
2025-07-15 20:46:06 +00:00
Animesh Jain
cc0faeb80f [dynamo][guards] Instruction count for guard eval for development work (#158214)
Its turned off  by default. Even the code is hidden before of the define preprocessing flag. It will be used only for development work.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158214
Approved by: https://github.com/StrongerXi
ghstack dependencies: #158215
2025-07-15 20:29:23 +00:00
Ti-Tai Wang
205241a0d5 [ONNX] Remove legacy dynamo graph extractor (#158262)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158262
Approved by: https://github.com/justinchuby
ghstack dependencies: #158258
2025-07-15 20:21:49 +00:00
Sam Larsen
dbf7d421da [BE][testing] fix aot_inductor_package internally (#158270)
Summary: We have internal test failure for several aot_inductor_package tests. It looks like we're translating args like:
```
-Wl,--script=/home/slarsen/local/fbsource2/buck-out/v2/gen/fbcode/7ce8f48f92bc4ee6/caffe2/test/inductor/__aot_inductor_package__/aot_inductor_package#link-tree/torch/_inductor/script.ld
```

To:
```
-Wl,--script=/home/slarsen/local/fbsource2/buck-out/v2/gen/fbcode/7ce8f48f92bc4ee6/caffe2/test/inductor/__aot_inductor_package__/aot_inductor_package#link-tree/torch/_inductor//tmp/jZMktZ/tmpsqoxb_cq/data/aotinductor/model/script.ld
```

This PR changes to strings like:
```
-Wl,--script=/tmp/jZMktZ/tmpsqoxb_cq/data/aotinductor/model/script.ld
```

Test Plan: `buck test '@fbcode//mode/opt' fbcode//caffe2/test/inductor:aot_inductor_package --run-disabled`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158270
Approved by: https://github.com/desertfire
2025-07-15 20:15:18 +00:00
Animesh Jain
b86d5cef68 [dynamo][tensor] Skip HASATTR attribute on tensor guards (#158215)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158215
Approved by: https://github.com/StrongerXi
2025-07-15 20:10:47 +00:00
Jane Xu
30587195d3 Migrate c10/macros/cmake_macros.h.in to torch/headeronly (#158035)
Summary: As above, also changes a bunch of the build files to be better

Test Plan:
internal and external CI

did run buck2 build fbcode//caffe2:torch and it succeeded

Rollback Plan:

Reviewed By: swolchok

Differential Revision: D78016591

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158035
Approved by: https://github.com/swolchok
2025-07-15 19:52:59 +00:00
Aaron Orenstein
250ae2531c Fix types in graphs.py (#158192)
Added type annotations for torch/cuda/graphs.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158192
Approved by: https://github.com/oulgen
2025-07-15 19:49:38 +00:00
Songhao Jia
011026205a make node source hashable (#158322)
Summary: as title

Test Plan:
ci

Rollback Plan:

Reviewed By: yushangdi

Differential Revision: D78296410

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158322
Approved by: https://github.com/yushangdi
2025-07-15 19:31:00 +00:00
Menglu Yu
4657a84bc5 [Optimus][fp8_activation_quantization] Only log when there's some node to be quantized (#158129)
Summary:
We add some extra check on whether there's some node has been marked as should quantize, otherwise we skip the quantizaton and tlparse log.

Rollback Plan:

Differential Revision: D78173788

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158129
Approved by: https://github.com/Skylion007, https://github.com/avicizhu
2025-07-15 19:22:26 +00:00
Ti-Tai Wang
5606c516fd [ONNX] Remove legacy Dort (#158258)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158258
Approved by: https://github.com/justinchuby, https://github.com/malfet
2025-07-15 19:14:06 +00:00
Edward Z. Yang
7afb834f93 Inline dispatch_and_compile into its call site. (#158150)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158150
Approved by: https://github.com/jamesjwu, https://github.com/wconstab
ghstack dependencies: #158149
2025-07-15 19:08:55 +00:00
Edward Z. Yang
148789ddd8 Avoid AOTAutogradCache.load in stack trace on cache miss path (#158149)
The general context for the upcoming stack of commits is I am attempting
to "pipeline" AOTAutograd.  Instead of having function f call function g
which is the next "stage" of compilation, instead f should return with
its outputs, which are then piped to g for the next stage.  This will
make it easier to implement early exit / resume pipeline without forcing
callback structure, which is good for export-style use cases.  It also
reduces the size of our stack traces, which makes tools like Perfetto
happy.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158149
Approved by: https://github.com/jamesjwu
2025-07-15 19:08:55 +00:00
Shangdi Yu
cf3247b74a Standalone compile API in _Exporter (#158139)
Given an `package: _ExportPackage`, users can get a ready-to-use workspace in `tmp_dir` by calling:
```python
package._compiled_and_package(
                tmp_dir + "/pt2_pacakge_name.pt2", True, package_example_inputs = True
            )
```

`tmp_dir` will contains:
- `main.cpp` (an example cpp file that create the models, if package_example_inputs is True, it'll also load the example inputs and run the models)
- `CMakeLists.txt`
- `pt2_pacakge_name/` (this is where the models are)
- `pt2_pacakge_name.pt2`
- `inputs.pt` files if package_example_inputs is True

Remaining TODOs
- support loading contants/weights
- the `package_example_inputs = True` option only supports a list of Tensors for now
- eventually we should remove the `torch` dependency, and use `SlimTensor`/`StableIValue` instead.

Test Plan:
```
python test/inductor/test_aot_inductor_package.py  -k test_compile_with_exporter
```

Example generated `main.cpp`:

```cpp
#include <dlfcn.h>
#include <fstream>
#include <iostream>
#include <memory>
#include <torch/torch.h>
#include <vector>
#include <torch/csrc/inductor/aoti_torch/tensor_converter.h>
#include "package/data/aotinductor/Plus__default/Plus__default.h"
#include "package/data/aotinductor/Minus__default/Minus__default.h"

using torch::aot_inductor::AOTInductorModelPlus__default;
using torch::aot_inductor::AOTInductorModelMinus__default;
using torch::aot_inductor::ConstantHandle;
using torch::aot_inductor::ConstantMap;

int main(int argc, char* argv[]) {
    std::string device_str = "cpu";
    try {
        c10::Device device(device_str);
        // Load input tensors for model Plus__default
        std::vector<at::Tensor> input_tensors1;
        for (int j = 0; j < 2; ++j) {
            std::string filename = "Plus__default_input_" + std::to_string(j) + ".pt";
            std::ifstream in(filename, std::ios::binary);
            if (!in.is_open()) {
                std::cerr << "Failed to open file: " << filename << std::endl;
                return 1;
            }
            std::vector<char> buffer((std::istreambuf_iterator<char>(in)), std::istreambuf_iterator<char>());
            torch::IValue ivalue = torch::pickle_load(buffer);
            input_tensors1.push_back(ivalue.toTensor().to(device));
        }

        // Load input tensors for model Minus__default
        std::vector<at::Tensor> input_tensors2;
        for (int j = 0; j < 2; ++j) {
            std::string filename = "Minus__default_input_" + std::to_string(j) + ".pt";
            std::ifstream in(filename, std::ios::binary);
            if (!in.is_open()) {
                std::cerr << "Failed to open file: " << filename << std::endl;
                return 1;
            }
            std::vector<char> buffer((std::istreambuf_iterator<char>(in)), std::istreambuf_iterator<char>());
            torch::IValue ivalue = torch::pickle_load(buffer);
            input_tensors2.push_back(ivalue.toTensor().to(device));
        }

// Create array of input handles
        auto input_handles1 =
            torch::aot_inductor::unsafe_alloc_new_handles_from_tensors(input_tensors1);
        auto input_handles2 =
            torch::aot_inductor::unsafe_alloc_new_handles_from_tensors(input_tensors2);

// Create array for output handles
        AtenTensorHandle output_handle1;
        AtenTensorHandle output_handle2;

// Create and load models
        auto constants_map1 = std::make_shared<ConstantMap>();
        auto constants_array1 = std::make_shared<std::vector<ConstantHandle>>();
        auto model1 = AOTInductorModelPlus__default::Create(
            constants_map1, constants_array1, device_str,
            "package/data/aotinductor/Plus__default/");
        model1->load_constants();
        auto constants_map2 = std::make_shared<ConstantMap>();
        auto constants_array2 = std::make_shared<std::vector<ConstantHandle>>();
        auto model2 = AOTInductorModelMinus__default::Create(
            constants_map2, constants_array2, device_str,
            "package/data/aotinductor/Minus__default/");
        model2->load_constants();

// Run the models
        torch::aot_inductor::DeviceStreamType stream1 = nullptr;
        model1->run(&input_handles1[0], &output_handle1, stream1, nullptr);
        torch::aot_inductor::DeviceStreamType stream2 = nullptr;
        model2->run(&input_handles2[0], &output_handle2, stream2, nullptr);

// Convert output handles to tensors
        auto output_tensor1 =
            torch::aot_inductor::alloc_tensors_by_stealing_from_handles(&output_handle1, 1);
        auto output_tensor2 =
            torch::aot_inductor::alloc_tensors_by_stealing_from_handles(&output_handle2, 1);

// Validate outputs
        std::cout << "output_tensor1" << output_tensor1 << std::endl;
        std::cout << "output_tensor2" << output_tensor2 << std::endl;
        return 0;
    } catch (const std::exception &e) {
        std::cerr << "Error: " << e.what() << std::endl;
        return 1;
    }
}

```

Rollback Plan:

Differential Revision: D78124705

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158139
Approved by: https://github.com/desertfire
2025-07-15 18:47:56 +00:00
PyTorch MergeBot
ea5f88dca6 Revert "Deprecate overleap functions in CUDAAllocatorConfig, use AcceleratorAllocatorConfig instead (#156165)"
This reverts commit e40ade5182.

Reverted https://github.com/pytorch/pytorch/pull/156165 on behalf of https://github.com/huydhn due to Sorry for reverting your change but because https://github.com/pytorch/pytorch/pull/157908 has been reverted + this PR caused issue earlier, I think it is better to revert the whole stack and reland it from scratch to be sure ([comment](https://github.com/pytorch/pytorch/pull/150312#issuecomment-3074897532))
2025-07-15 18:24:36 +00:00
Menglu Yu
243b12e565 [Optimus] add einsum_to_pointwise_pass pattern (#155666)
Summary: More context: https://docs.google.com/document/d/1ipiskqG13ZKNX1SGygB3QnHcSyXNQ8pACazPIcS4bnI/edit?tab=t.0

Test Plan:
### how to enable

```
torch._inductor.config.pre_grad_fusion_options={
            "einsum_to_pointwise_pass": {},
        },
```

### unit test

```
CUDA_VISIBLE_DEVICES=3 OC_CAUSE=1 buck2 test 'fbcode//mode/dev-nosan' //caffe2/test/inductor:kernel_optimization
```
Buck UI: https://www.internalfb.com/buck2/267263ff-6f5b-4fff-bfc0-d8f013440ba0
Test UI: https://www.internalfb.com/intern/testinfra/testrun/5629499820839168
Network: Up: 61KiB  Down: 675KiB  (reSessionID-fda8edfc-6eef-4bf0-b268-0f8d2e666571)
Loading targets.   Remaining     0/1                                                            1 dirs read, 2310 targets declared
Analyzing targets. Remaining     0/345                                                          284 actions, 329 artifacts declared
Executing actions. Remaining     0/18334                                                        8.0s exec time total
Command: test.     Finished 6 local
Time elapsed: 1:15.5s
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0

### local reproduce

baseline:

| Metric                | Value       |
|:----------------------|:------------|
| Batch size            | 4096        |
| GPU type              | H100        |
| Latency               | 196.06 ms   |
| Model size            | 1205.21 MB  |
| Flops                 | 7671.30 G   |
| Flops/example         | 1.87 G      |
| TFLOPS/sec            | 39.13       |
| MFU                   | 4.89%       |
| Activation/example    | 1.51 MB     |
| CPU time total        | 602.28 ms   |
| GPU time total        | 798.60 ms   |
| Estimated avg BW      | 234.62 GB/s |
| Estimated avg BW util | 9.78%       |
Trace link: https://our.intern.facebook.com/intern/perfdoctor/trace_view?filepath=tree/traces/efficient_module_suite/fused_attention_mlp.Jun_09_22_12_38_trace.json.gz&bucket=pyper_traces

with the pattern:

| Metric                | Value       |
|:----------------------|:------------|
| Batch size            | 4096        |
| GPU type              | H100        |
| Latency               | 184.94 ms   |
| Model size            | 1205.21 MB  |
| Flops                 | 7671.30 G   |
| Flops/example         | 1.87 G      |
| TFLOPS/sec            | 41.48       |
| MFU                   | 5.18%       |
| Activation/example    | 1.15 MB     |
| CPU time total        | 562.44 ms   |
| GPU time total        | 754.36 ms   |
| Estimated avg BW      | 201.40 GB/s |
| Estimated avg BW util | 8.39%       |
Trace link: https://our.intern.facebook.com/intern/perfdoctor/trace_view?filepath=tree/traces/efficient_module_suite/fused_attention_mlp.Jun_10_22_03_34_trace.json.gz&bucket=pyper_traces

### E2E

baseline: f713998364
with patter:

Rollback Plan:

Differential Revision: D76400889

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155666
Approved by: https://github.com/Yuzhen11
2025-07-15 17:50:23 +00:00
vfdev
b7b1109f49 Expose opt_einsum in torch.backends (#157740)
Fixes the following issue:
```
:/tmp# python -c "import torch; print(torch.__version__)"
2.7.1+cu126
:/tmp# python -c "import torch; print(torch.backends.opt_einsum.is_available())"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
AttributeError: module 'torch.backends' has no attribute 'opt_einsum'
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157740
Approved by: https://github.com/Skylion007, https://github.com/benjaminglass1
2025-07-15 17:46:43 +00:00
PyTorch MergeBot
26807dcf27 Revert "[PT2][fusion] ban fusions with large accumulated reads (#157563)"
This reverts commit c062550a35.

Reverted https://github.com/pytorch/pytorch/pull/157563 on behalf of https://github.com/clee2000 due to broke test_linear_and_cel on main c062550a35, caused OOM? Also broken on PR, Dr. CI classification is wrong (claims the test is disabled by an issue but the issue is for a different test).  Also I'm pretty sure the expected results json is supposed to have a ton of empty lines, its to prevent merge conflicts, I will add it to the linter ([comment](https://github.com/pytorch/pytorch/pull/157563#issuecomment-3074355331))
2025-07-15 16:35:55 +00:00
PyTorch MergeBot
4f36743f5e Revert "[simple_fsdp][inductor_collectives] rewrite reorder_collectives, sink_waits_iterative (#158062)"
This reverts commit 5a54db14e3.

Reverted https://github.com/pytorch/pytorch/pull/158062 on behalf of https://github.com/clee2000 due to sorry I want to revert something else and this is causing a merge conflict, all you should need to do is rebase and remerged ([comment](https://github.com/pytorch/pytorch/pull/158062#issuecomment-3074342140))
2025-07-15 16:31:13 +00:00
dsashidh
05d7288e31 Fix incorrect bin edge description in histogramdd docs (#158275)
Fixes #124435

This updates the torch.histogramdd documentation to correctly state that bins are inclusive of their left edges, not exclusive as currently written. There was a previous PR addressing this but it was closed due to inactivity. This picks that up and applies the fix.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158275
Approved by: https://github.com/albanD
2025-07-15 16:25:01 +00:00
IvanKobzarev
5a54db14e3 [simple_fsdp][inductor_collectives] rewrite reorder_collectives, sink_waits_iterative (#158062)
Differential Revision: [D78159013](https://our.internmc.facebook.com/intern/diff/D78159013)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158062
Approved by: https://github.com/wconstab
2025-07-15 14:27:57 +00:00
Aleksandar Samardžić
90618581e9 Fix grouped MM output strides when compiled but not max-autotuned (#158143)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158143
Approved by: https://github.com/ngimel
2025-07-15 11:53:13 +00:00
Andrey Talman
4e13eca713 [BE] Remove CUDA 11.8 artifacts (#158303)
We are including cufile by default in all CUDA 12+ builds. Since CUDA 11.8 is removed we can safely remove this code

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158303
Approved by: https://github.com/Camyll, https://github.com/cyyever
2025-07-15 11:52:08 +00:00
Xiangyang (Mark) Guo
156a377f4c [AOTI][CPP] add flag TORCHINDUCTOR_CPP_FORCE_INLINE_KERNEL (#157949)
Summary: Add flag TORCHINDUCTOR_CPP_FORCE_INLINE_KERNEL to force inline the kernel function when TORCHINDUCTOR_CPP_FORCE_INLINE_KERNEL=1. It's disabled by default because force inlining may increase the build time.

Differential Revision: D77915987

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157949
Approved by: https://github.com/desertfire
2025-07-15 10:51:43 +00:00
Yu, Guangye
e40ade5182 Deprecate overleap functions in CUDAAllocatorConfig, use AcceleratorAllocatorConfig instead (#156165)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156165
Approved by: https://github.com/albanD
ghstack dependencies: #150312
2025-07-15 10:14:35 +00:00
Huamin Li
7f9fc7e67c [Inductor] Add CPU_MAX_FIRST_DIMENSION_DECOMPOSITION and CPU_MAX_OTHER_DIMENSION_DECOMPOSITION for decompose_mm_pass (#158183)
Differential Revision: D78209993

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158183
Approved by: https://github.com/houseroad
2025-07-15 10:07:25 +00:00
wengshiy
c8c221c0b3 [Inductor][Float8] Add float8_e4m3fn into assertion dtype list. (#157684)
Fix assert issue.
Add float8_e4m3fn into dtype list.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157684
Approved by: https://github.com/Xia-Weiwen, https://github.com/leslie-fang-intel, https://github.com/jansel
2025-07-15 06:02:01 +00:00