Commit Graph

47182 Commits

Author SHA1 Message Date
Isuru Fernando
66b0a0b61a [inductor] support dilation in max_pool2d lowering (#148209)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148209
Approved by: https://github.com/eellison
2025-03-24 13:00:12 +00:00
James Wu
de3aca3311 [StaticCudaLauncher] Support any number of kernel arguments (#149442)
Fixes #149450

This PR adds fallback support on StaticCudaLauncher for any number of kernel arguments. Above MAX_ARGS, we can do a heap allocation/malloc instead.

For 0 arguments, triton technically does some undefined behavior by allocating a 0 byte array and passing it to cuLaunchKernel. In reality, cuLaunchKernel never accesses the pointer if the singature of the cubin has no parameters, so we can just pass nullptr directly.

We could technically use `alloca` to stack allocate instead of heap allocate, though in my tests it didn't seem to affect runtime performance on benchmarks particularly impressively, and alloca has portability issues, so I'd rather just stick with something simpler for now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149442
Approved by: https://github.com/jansel
2025-03-23 22:43:47 +00:00
Justin Chu
2dccd70ef0 [ONNX] Clean up legacy dynamo export code (#149745)
Clean up code that is unused and obsolete. The public `torch.onnx.dynamo_export` is kept for now but the legacy implementation is removed.

Remove public option classes and OnnxRegistry that have been deprecated.

Users: use torch.onnx.export(…, dynamo=True).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149745
Approved by: https://github.com/titaiwangms, https://github.com/cyyever
2025-03-23 19:35:16 +00:00
Nikita Shulga
8bece88655 [BE] Eliminate TODO for 2022 (#149557)
Need to think a bit more about what types.h includes

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149557
Approved by: https://github.com/albanD
2025-03-23 05:35:54 +00:00
fzyzcjy
85027ef74a Super tiny fix typo (#149109)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149109
Approved by: https://github.com/malfet
2025-03-23 03:02:53 +00:00
James Wu
fe954cdcbf Use correct boxed_forward_device_index when running CompiledFxGraph.post_compile (#148130)
This PR threads through the correct boxed_forward_device_index from graph_kwargs to CompiledFXGraph.post_compile. This allows us to correctly update BoxedDeviceIndex from cache hits.

We don't actually need to save `boxed_forward_device_index` in CompiledFXGraph because its value is in the cache key, so it always matches to the ambient one anyway. On forward with cudagraphs enabled, derive `boxed_forward_device_index`'s value from `device_idxs`.

Testing:

```
python benchmarks/dynamo/cachebench.py --mode training --benchmark torchbench --model BERT_pytorch --device cuda --repeat 1 --dynamic --output="dynamic.json"
```

Now cache hits properly on FXGraphCache. AOTAutogradCache has a guard failure. Will look into that as a followup.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148130
Approved by: https://github.com/eellison
2025-03-23 02:57:58 +00:00
Mark Saroufim
539db4af4b load_inline no_implicit_headers mode (#149480)
In the kernelBot leaderboard we support people competing with custom cuda extensions via `load_inline()`, however even on toy kernels this can result in cold starts of up to 90s - this feature is primarily responsible for us having to double our timeout values

I performed an investigation here https://github.com/msaroufim/load_inline_slow and the primary cause was that torch/extension.h and torch/types.h add in about 5,000 header files https://github.com/msaroufim/load_inline_slow/blob/main/header-analysis

So we introduce a mode `no_implicit_headers` which forces users to be explicit about exactly what they want to add. There's a proper test meant to be used in a CLI and a pytest test that's not terribly helpful

Then there's still an open question around what's the most minimal example implementation we can provide. For the baseline kernel we're showing here, it takes about 1 min to compile
1. There's using TensorBase.h (finicky to get right but can get compilation times down to 7s)
2. Just using Tensor.h (down to 15s)
3. Using Shim.h (did not try yet since the syntax is verbose relative to cuda)

This is my take so far https://gist.github.com/msaroufim/079a8d08ffebd0f91a1c2247eb0ce9e0 for a minimal implementation at 15s but @malfet has a simpler one at only 5s

There's more things I'd like to try moving forward like nvrtc and fancier compilation flags. Typical advice around using precompiled headers does not apply to us because we are mostly interested in cold starts where we tear down the machine after running a kernel

Also in a future PR I'd like to fix issue I've noticed with load_inline
1. It needs a force recompilation mode, I was using this quite a bit myself
2. The cache does not take into account changes in environment so the best way to force a recompilation is to change some string in the file
3. Instead of relying on pybind, can we use TORCH_LIBRARY instead

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149480
Approved by: https://github.com/malfet
2025-03-22 19:21:29 +00:00
Davide Italiano
2b848ab192 [MPS/inductor] Add support for modified_scaled_bessel_k{0,1} (#149794)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149794
Approved by: https://github.com/malfet
2025-03-22 15:41:40 +00:00
Animesh Jain
6bbe8dbd63 [dynamo][hooks] config to wrap the top frame in a wrapper (#149758)
This should be done by default but there are too many issues. This PR is a
workaround.

https://github.com/pytorch/pytorch/issues/117584

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149758
Approved by: https://github.com/yf225
ghstack dependencies: #149712
2025-03-22 07:17:01 +00:00
bobrenjc93
621c801f78 fix dynamic float when dynamic=True (#149564)
Fixes https://github.com/pytorch/pytorch/issues/149406#issuecomment-2738111733. Basically previously we would only make floats dynamic via automatic dynamic, now if you set dynamic=True, we will make the floats dynamic on the first compile.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149564
Approved by: https://github.com/laithsakka
2025-03-22 05:58:59 +00:00
Nichols A. Romero
01b1d1f91b [ROCm][TunableOp] Fix offline tuning for ScaledGEMM. (#149677)
The main purpose of this PR is to fix offline tuning for ScaledGEMM. The previous UT passed because it was not strict enough. Additionally:
- All the offline tuning tests now do a comparison with the online results to ensure that ParamSignature match.
- We raise an error if submatrices are encountered as this is only supported in online tuning mode.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149677
Approved by: https://github.com/jeffdaily
2025-03-22 02:22:13 +00:00
Tugsbayasgalan Manlaibaatar
021b3e23ec Fix is_nonzero for more than one elem tensors (#149637)
Differential Revision: [D71560442](https://our.internmc.facebook.com/intern/diff/D71560442)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149637
Approved by: https://github.com/pianpwk
2025-03-22 02:08:28 +00:00
Xintong Hu
9d02b3993f [PT2] Port use_triton_lce to PT2 pre_grad passes (#149702)
Summary:
`use_triton_lce_replace_simple_LCE` and `use_triton_lce_replace_normal_LCE`

code is mostly the same, some minor changes to support aten IR

Test Plan:
```
scripts/aetk/aetk -L
%run ~/fbsource/fbcode/caffe2/test/inductor/fb/test_customized_triton_kernel_passes.py
```

will verify the qps after everything done in the stack

Reviewed By: frank-wei

Differential Revision: D68909857

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149702
Approved by: https://github.com/frank-wei
2025-03-22 00:36:58 +00:00
Animesh Jain
d320af0663 [dynamo] Ensure placeholder name is not an intermediate node name (#149712)
Fixes https://fb.workplace.com/groups/1075192433118967/permalink/1615671879071017/

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149712
Approved by: https://github.com/zou3519
2025-03-21 22:24:45 +00:00
Brian Hirsh
7f836b747f partitioner: ensure collectives saved by SAC that are actually unused in the bw are properly not saved (#149652)
This PR fixes one of the issues described here: https://github.com/pytorch/torchtitan/issues/866#issuecomment-2726015248

I spent some time trying to write a unit test and ultimately failed. If folks are interested I can spend more time trying to, but otherwise I have an E2E test with torchtitan. command:
```
CUDA_VISIBLE_DEVICES=1,2,3,4 NGPU=4 CONFIG_FILE="./torchtitan/models/llama/train_configs/llama3_8b.toml" tlp ./run_train.sh --training.steps=30  --training.tensor_parallel_degree=2 --training.compile --experimental.enable_async_tensor_parallel
```

here's the backward graph generated prior to the PR: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/hirsheybar/f7d17388-42c2-4d7e-8a55-a00387341ecb/custom/rank_0/-_0_0_0/aot_backward_graph_9.txt?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000

and new backward graph with the PR: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/hirsheybar/ab8576fc-98c1-4915-af47-699aa8e2557e/custom/rank_0/-_0_0_0/aot_backward_graph_9.txt?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000

The main difference is that the input arg `reduce_scatter_tensor_1` is dead code in the bw graph, causing us to unnecessarily save a giant `reduce_scatter` for bw. With the PR, we properly ensure that it is not saved for backward.

More comments in the PR, but the main thing going on is that:

(1) We have some existing logic that checks for activations that are actually dead code in the backward, and removes them

(2) collectives are not properly handled by this code. Why? collective are **always** followed by  `wait_tensor()` call. So we need to go one node further and check if the "dead" code has a wait_tensor user that is also dead

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149652
Approved by: https://github.com/zou3519
ghstack dependencies: #149514
2025-03-21 22:09:19 +00:00
Brian Hirsh
1c6b517e19 DTensor: more generically support CompositeImplicitAutograd ops under inference mode (#149514)
Today, if you run DTensor (or any tensor subclass) under __torch_dispatch__, you will start seeing `CompositeImplicitAutograd` ops show up in the torch_dispatch.

"handling" these ops is trivial: you can just tell them to decompose into their constituent ops. Normally this decomposing happens in autograd, above DTensor, but inference_mode turns autograd off, forcing the subclass to handle the op directly.

It looks like previously we manually added a few CompositeImplicitAutograd entries to DTensor (e.g. linear), but this PR tries to support these ops a bit more generically.

The main difference is that DTensor now needs to check if a given op is `CompositeImplicitAutograd` before attempting to run sharding prop. I ran a quick microbenchmark for the below code with `timeit`, which gave me overhead on the order of ~1us, which is hopefully not too bad for eager mode:

```
        def fast_function():
            return torch._C._dispatch_has_kernel_for_dispatch_key(op_call.name(), torch._C.DispatchKey.CompositeImplicitAutograd)
        import timeit
        time_taken = timeit.timeit(fast_function, number=1000)
        # printed 0.12..., aka 1.2us
        print(f'func={str(op_call)}, time={str(time_taken)}')
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149514
Approved by: https://github.com/kwen2501, https://github.com/albanD, https://github.com/wanchaol
2025-03-21 22:09:19 +00:00
Wei Feng
d46c16fca6 [FSDP2] warning that reshard_after_forward=1 and True are different (#149750)
people complains about spending time to debug reshard_after_forward=1. What they actually want is reshard_after_forward=True. 1 and True can be used interchangeably in programming generally, add one-time warning to remind they are different
* reshard_after_forward=1 means resharding parameters to world size 1, by keeping unsharded parameters from forward to backward
* reshard_after_forward=True means reshard parameters to FSDP mesh

from FSDP2 perspective, our docstring is clear about int vs bool https://pytorch.org/docs/main/distributed.fsdp.fully_shard.html

<img width="764" alt="Screenshot 2025-03-21 at 11 02 55 AM" src="https://github.com/user-attachments/assets/6675f7a4-95a0-4421-8dbf-f47e9fdeca26" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149750
Approved by: https://github.com/mori360, https://github.com/msaroufim, https://github.com/wconstab
2025-03-21 22:05:20 +00:00
angelayi
ff020d32b6 [export] Patch dynamo configs when nonstrict tracing (#149295)
Differential Revision: [D71298929](https://our.internmc.facebook.com/intern/diff/D71298929)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149295
Approved by: https://github.com/ydwu4, https://github.com/zou3519
2025-03-21 21:44:54 +00:00
Avik Chaudhuri
fb07fe6f36 pretty print graph signature (#149710)
Fixes #141243

Differential Revision: [D71604218](https://our.internmc.facebook.com/intern/diff/D71604218/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149710
Approved by: https://github.com/angelayi
2025-03-21 21:31:58 +00:00
eellison
5757aa8773 Cudagraph fix + comment cleanup (#149741)
Cudagraphs is careful to not allow any memory recorded to escape globally without having a reference to the tensor. This is because we may later reclaim that memory for a cudagraph recording and we need to mark the tensor as erroring on access. Very occasionally, a stray tensor will have been allocated locally but not yet cleaned up. In this case, we enter the slow path and try to gc.collect() to deallocate it. From a hard to repro internal use case, this was fixed by an additional `cuda.synchronize()`.

i also snuck in an outdated comment and a duplicate line removal.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149741
Approved by: https://github.com/BoyuanFeng, https://github.com/Skylion007
2025-03-21 21:12:36 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar
c5deacc27a Fix subclass access custom op bug (#149698)
Summary: When we call torch.inference_mode, we seem to skip Autograd key causing the custom op export uses to be not decomposed properly before subclass dispatching starts. We fix this by force desugaring this op at Python key

Test Plan: test

Differential Revision: D71599541

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149698
Approved by: https://github.com/bdhirsh
2025-03-21 19:42:56 +00:00
Avik Chaudhuri
09aa63ea2c preserve custom meta in placeholders (#149661)
Fixes #147338

Differential Revision: [D71573533](https://our.internmc.facebook.com/intern/diff/D71573533/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149661
Approved by: https://github.com/junpeiz, https://github.com/angelayi
2025-03-21 19:09:38 +00:00
Aaron Orenstein
0eb3ac9349 Make sure to write to caches atomically (#149654)
This is an attempt to fix #119698

I was unable to reproduce the original described problem on the latest trunk but the proposed fix makes sense. Instead of adding locks like the original (unlanded) fix I changed a few of the cache writes to be atomic file swaps (write to temp file, rename file) which should have the same effect without blocking reads.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149654
Approved by: https://github.com/eellison
2025-03-21 18:59:41 +00:00
Shangdi Yu
46dd226702 Fakify torchbind objects in compile_fx and add tests for SigridTransformsInstanceTorchBind (#149529)
Summary:
We need to properly fakify torchbind objects, including the ones in graph module attributes, so the resgitered fake implementation works properly.

- _fakify_script_objects in `compile_fx`
- Allow fake torchbind objects in `torchbind_constants`

Remove `node.meta["unbacked_bindings"]` for `aot_compile` in `compile_fx`. Otherwise `ShapeProp` will fail when trying to resolve the `unbacked_bindings` of `with_effect` tokens.

Update `sigrid_transforms_test` to use the latest `torch._inductor.aot_compile` API.

Add a test for `Fakify torchbind objects in compile_fx and add tests for SigridTransformsInstanceTorchBind` in `e2e_test`.

Test Plan:
```
buck run //caffe2/torch/fb/sparsenn:sigrid_test -- -r test_transform_torch_bind

buck run //sigmoid/inference/test:e2e_test_cpu -- -r SigridTransforms

buck2 run mode/dev-nosan sigmoid/inference/ts_migration:pt2i_readiness_main -- --model_id 545017754 --test_suite ads_all --mode test_preproc

```

Differential Revision: D70013257

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149529
Approved by: https://github.com/angelayi
2025-03-21 18:58:28 +00:00
Pradeep Fernando
1b08aaeafe Supporting non-tensor-data write_size in planner write items. (#149699)
Summary:
1\ The current write item structure does not contain the amount of data that needs to be written.
2\ the planner.item already has a size primitive 'tensor_storage_size'. https://fburl.com/code/7a0gsmw7 But only for tensors.
3\ Right now, the only way the writer layer get hold of this property (fro non tensor data)
first do a lookup in to the actual tensor/bytes
then calculate the nbytes.
This change introduce a way to capture non-tensor data size within a write-plan item.

Test Plan: Existing UT.

Differential Revision: D71599725

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149699
Approved by: https://github.com/MeetVadakkanchery
2025-03-21 18:09:14 +00:00
Ding, Yi1
f7d1b966c2 [Inductor] Unify the data type propagation between Triton and CPP Backend (#146970)
Fixes #144246

Use `DtypePropagationOpsHandler` for CSE variables of CPP backend. In addition, add static type checking for the generated CPP code similar to the `config.test_configs.runtime_triton_dtype_assert`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146970
Approved by: https://github.com/jgong5, https://github.com/eellison, https://github.com/leslie-fang-intel
2025-03-21 17:52:51 +00:00
Ryo Suzuki
d072254eae Extend vec backend with BF16 SVE intrinsics (#143666)
- Following the work in https://github.com/pytorch/pytorch/pull/119571, BF16 SVE intrinsics are added to the Vectorized class, providing ~1.7x speedup on `silu` and `softmax`.
- Added bf16 detection in CMake
- Added a guard for native NEON code to prevent compilation errors

@aditew01 @maajidkhann please have a look

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143666
Approved by: https://github.com/swolchok, https://github.com/aditew01

Co-authored-by: Aditya Tewari <aditya.tewari@arm.com>
2025-03-21 10:55:11 +00:00
Nikita Shulga
68dfd44e50 Do not depend on numpy during the import (#149683)
But a good followup would be to use torch primitives instead of numpy here
Fixes https://github.com/pytorch/pytorch/issues/149681

Test plan: Monkey-patch 2.7.0-rc and run `python -c "import torch;print(torch.compile(lambda x:x.sin() + x.cos())(torch.rand(32)))"`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149683
Approved by: https://github.com/seemethere
2025-03-21 08:14:57 +00:00
Michael Lazos
34743678b9 [Dynamo] Cleanup state management for ctx managers (#149689)
Removes state indirection for ctx managers. This isn't needed anymore since VTs are mutable.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149689
Approved by: https://github.com/StrongerXi
2025-03-21 07:18:33 +00:00
Davide Italiano
0ed34210b2 [MPS] Add support for modified_bessel_k1 to eager and inductor. (#149687)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149687
Approved by: https://github.com/malfet
2025-03-21 04:59:06 +00:00
Yuanhao Ji
0a396a8160 [Docs] Make torch.Library's kind have no default value to be consistent with the code (#149390)
Fixes #149389

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149390
Approved by: https://github.com/janeyx99
2025-03-21 04:42:10 +00:00
Yuanhao Ji
bf6621d08f [Distributed] Add repr methods for ParallelStyles (#149478)
Fixes #149470

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149478
Approved by: https://github.com/wanchaol
2025-03-21 03:59:25 +00:00
zeshengzong
732f9d7435 Optimize torch.equal description (#149618)
Fixes #149222

## Test Result

![image](https://github.com/user-attachments/assets/559a376f-2dd0-4474-bbd5-9299d9df51e3)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149618
Approved by: https://github.com/zou3519
2025-03-21 03:44:49 +00:00
Xia, Weiwen
64bd889660 [Inductor][CPP] rename shim_mkldnn.h/.cpp to shim_cpu.h/.cpp (#149372)
**Summary**
Previous discussion is here: https://github.com/pytorch/pytorch/pull/148907#issuecomment-2712795600
Rename these files because
- they may hold mkldnn-unrelated code for CPU
- filenames are aligned with files for CUDA and XPU

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149372
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/desertfire
2025-03-21 03:42:12 +00:00
Justin Chu
a39bf846f5 [ONNX] Add draft_export as a strategy (#147529)
Create draft_export strategy.

The strategy is added before jit and after strict=True, as the third fallback. Since it is specializing tensors it should not be less robust than the jit trace strategy.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147529
Approved by: https://github.com/titaiwangms
2025-03-21 03:05:17 +00:00
Hollow Man
0692301e25 Catch OSError in general when writing files (#149464)
Redundant exception types in `except (PermissionError, OSError):`.  Write `except OSError:`, which catches exactly the same exceptions.

https://github.com/pytorch/pytorch/actions/runs/13935844871/job/39141062991

When hipify files, or writing cprofile files, PermissionError is not enough when the file is located in a place that is not writable at all, or other OS errors happened when writing files.

This fix makes the code more robust.

Example error log:
```log
  File "deepspeed/ops/adam/fused_adam.py", line 94, in __init__
    fused_adam_cuda = FusedAdamBuilder().load()
                      ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "deepspeed/ops/op_builder/builder.py", line 540, in load
    return self.jit_load(verbose)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "deepspeed/ops/op_builder/builder.py", line 587, in jit_load
    op_module = load(name=self.name,
                ^^^^^^^^^^^^^^^^^^^^
  File "torch/utils/cpp_extension.py", line 1597, in load
    return _jit_compile(
           ^^^^^^^^^^^^^
  File "torch/utils/cpp_extension.py", line 2031, in _jit_compile
    hipify_result = hipify_python.hipify(
                    ^^^^^^^^^^^^^^^^^^^^^
  File "torch/utils/hipify/hipify_python.py", line 1167, in hipify
    preprocess_file_and_save_result(output_directory, filepath, all_files, header_include_dirs,
  File "torch/utils/hipify/hipify_python.py", line 213, in preprocess_file_and_save_result
    result = preprocessor(output_directory, filepath, all_files, header_include_dirs, stats,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "torch/utils/hipify/hipify_python.py", line 940, in preprocessor
    output_source = RE_QUOTE_HEADER.sub(mk_repl('#include "{0}"', True), output_source)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "torch/utils/hipify/hipify_python.py", line 919, in repl
    preprocess_file_and_save_result(output_directory,
  File "torch/utils/hipify/hipify_python.py", line 213, in preprocess_file_and_save_result
    result = preprocessor(output_directory, filepath, all_files, header_include_dirs, stats,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "torch/utils/hipify/hipify_python.py", line 986, in preprocessor
    with clean_ctx.open(fout_path, 'w', encoding='utf-8') as fout:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "torch/utils/hipify/hipify_python.py", line 123, in open
    return open(fn, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
OSError: [Errno 30] Read-only file system: 'deepspeed/ops/csrc/adam/multi_tensor_apply_hip.cuh'
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149464
Approved by: https://github.com/janeyx99
2025-03-21 02:42:50 +00:00
Justin Chu
362b40939d [ONNX] Improve docstring of onnx symbolic ops (#149668)
Better examples
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149668
Approved by: https://github.com/titaiwangms
2025-03-21 01:57:39 +00:00
Simon Fan
e481615bc7 [aot] always lower the backward with a deepcopy (#149229)
FIXES https://github.com/pytorch/pytorch/issues/149105

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149229
Approved by: https://github.com/bdhirsh
2025-03-21 01:47:13 +00:00
Xintong Hu
5ebc283f2c [PT2] Port use_triton_dot_compress to PT2 pre_grad passes (#148517)
Summary: add use_triton_dot_compress in pre_grad

Test Plan:
```
scripts/aetk/aetk -L

%run ~/fbsource/fbcode/caffe2/test/inductor/fb/test_customized_triton_kernel_passes.py
```

Reviewed By: frank-wei

Differential Revision: D68909838

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148517
Approved by: https://github.com/frank-wei
2025-03-21 01:42:32 +00:00
James Wu
c2ada9d77b [easy] Do not logspam if static cuda launcher is disabled (#149669)
No need to log.info every time someone runs with StaticCudaLauncher disabled.

Test plan: Run any benchmark and see that we don't spam the bypass message in logs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149669
Approved by: https://github.com/oulgen, https://github.com/jansel
ghstack dependencies: #148890
2025-03-21 01:22:26 +00:00
Han, Xu
2975664fb0 add python root bin to windows load path. (#146573)
This PR is extend python root bin path to dll load list.
It makes PyTorch robust and compatible to more dependency libraries, such as `intel-pti`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146573
Approved by: https://github.com/EikanWang, https://github.com/albanD
2025-03-21 00:48:43 +00:00
Zhengxu Chen
f47aa08130 [export] Support python assertion with symints. (#149444)
Summary: This diff ports some technique from torch.fx symbolic trace to trace through Python asserts when we run into data dependent symbolic shape assertions, so that we can achieve the same effect as torch dynamo to automatically turn assert into torch.check()s.

Test Plan: buck test mode/opt caffe2/test:test_export -- -r test_python_asserts_with_sym_int
Differential Revision: D71425360

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149444
Approved by: https://github.com/tugsbayasgalan
2025-03-20 23:07:45 +00:00
angelayi
bf34e228c5 [export] Beef up guard_added logs (#149465)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149465
Approved by: https://github.com/pianpwk
2025-03-20 23:02:07 +00:00
Michael Lazos
1d3c50fcc5 [Dynamo] Support the torch._C.DisableTorchFunction ctx manager (#149491)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149491
Approved by: https://github.com/StrongerXi
ghstack dependencies: #149489, #149490
2025-03-20 22:19:55 +00:00
Michael Lazos
ce5adc5c05 [Dynamo] add support for torch._C._is_torch_function_all_disabled (#149490)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149490
Approved by: https://github.com/StrongerXi
ghstack dependencies: #149489
2025-03-20 22:19:55 +00:00
Michael Lazos
f64c361860 [Dynamo] Refactor DisableTorchFunction ctx manager (#149489)
Refactors the DisableTorchFunction ctx manager to properly model the eager code (no args to the context manager).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149489
Approved by: https://github.com/StrongerXi
2025-03-20 22:19:55 +00:00
zhc7
a268c29b9f [distributed] fix: use group rank instead of global rank when possible (#149488)
Fixes #149200

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149488
Approved by: https://github.com/wconstab
2025-03-20 21:47:03 +00:00
Isuru Fernando
b07b819912 [inductor] Add a helper for convert index_dtype to torch dtype (#149531)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149531
Approved by: https://github.com/eellison
2025-03-20 21:33:29 +00:00
Zhuoran Zhao
a703107f7b [AOTInductor] Fix skip cpp wrapper unit test (#149606)
Summary: as title

Test Plan:
```
buck2 test 'fbcode//mode/opt' fbcode//deeplearning/aot_inductor/cpu/test:cpu_lowering_utils_test -- --exact 'deeplearning/aot_inductor/cpu/test:cpu_lowering_utils_test - test_cpu_lower_aoti_ep_called (deeplearning.aot_inductor.cpu.test.test_lowering_utils.CPULoweringTest)'
```
```
buck test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:cudagraph_trees_expandable_segments -- --exact 'caffe2/test/inductor:cudagraph_trees_expandable_segments - test_skip_cpp_wrapper (caffe2.test.inductor.test_cudagraph_trees.CudaGraphTreeTests)'
```

https://www.internalfb.com/phabricator/paste/view/P1758059197

Reviewed By: henryoier

Differential Revision: D71528281

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149606
Approved by: https://github.com/desertfire
2025-03-20 20:55:33 +00:00
Guilherme Leobas
406d464d97 Add is_batchedtensor to dynamo builder (#149541)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149541
Approved by: https://github.com/zou3519
2025-03-20 20:46:15 +00:00
Kai Londenberg
f17ae3f7b7 [Inductor Cutlass backend] Fix imports and compilation of Cutlass SM100 Kernels (#149515)
Summary: Fixes the import and compilation of Cutlass SM100 Kernels.

Test Plan: Cutlass backend unit tests, running benchmarks/inductor_backends/cutlass.py

Differential Revision: D71196747

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149515
Approved by: https://github.com/ColinPeppler, https://github.com/chenyang78
2025-03-20 20:35:18 +00:00
PyTorch MergeBot
24176f6e32 Revert "[cond] don't trace fw and bw graph in autograd key (#148930)"
This reverts commit 6e843a51dd.

Reverted https://github.com/pytorch/pytorch/pull/148930 on behalf of https://github.com/ydwu4 due to Test failure is legit ([comment](https://github.com/pytorch/pytorch/pull/148930#issuecomment-2741585315))
2025-03-20 20:28:29 +00:00
Yidi Wu
4a4a71a73c [inductor]lowering scan to while_loop (#148580)
This PR add a pass in post_grad that lowers scan to while_loop. See the comment before the pass for how this is implemented.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148580
Approved by: https://github.com/jansel, https://github.com/eellison
2025-03-20 20:21:02 +00:00
Yidi Wu
6e843a51dd [cond] don't trace fw and bw graph in autograd key (#148930)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148930
Approved by: https://github.com/zou3519
2025-03-20 20:18:29 +00:00
Guilherme Leobas
18435945af Set __context__/__cause__ when generator raise StopIteration (#148765)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148765
Approved by: https://github.com/zou3519
ghstack dependencies: #146505
2025-03-20 19:59:30 +00:00
Guilherme Leobas
44e6464914 Allow setting attribute to NestedUserFunctionVariable (#146505)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146505
Approved by: https://github.com/zou3519
2025-03-20 19:59:30 +00:00
Dominic Binks
aae4c0729e Fix broken build within xplat/caffe2 (#149403)
Summary:
Following a pull from open source, the build within xplat is broken
due to not finding <autograd/function.h>.

Within the python_function.cpp there seems to be a convention of using the
torch/csrc prefix.

This change includes that prefix to enable the build to proceed.

Test Plan:
Build a binary using torch.

https://www.internalfb.com/buck2/83122485-d3c3-43f4-97b4-81bb90450b3b

Unit tests run too

https://www.internalfb.com/intern/testinfra/testrun/13229323975828416

Further testing in CI and elsewise expected.

Reviewed By: malfet

Differential Revision: D70331539

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149403
Approved by: https://github.com/izaitsevfb

Co-authored-by: Dominic Binks <dbinks@meta.com>
2025-03-20 19:27:55 +00:00
Yi Wang
ffa085334c Specify the default PyTorch Distributed backend for MPS (#149538)
Fixes #149537

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149538
Approved by: https://github.com/d4l3k, https://github.com/malfet
2025-03-20 18:54:03 +00:00
William Wen
6285a71aba [dynamo] fix bug where non-recursive disable modifies the original function (#148896)
Fixes https://github.com/pytorch/pytorch/issues/148787.

We fix this by:
- Wrapping the original function instead of directly modifying it
- When we detect that the previous frame is the non-recursive disable wrapper, then skip tracing this frame (non-recursive disable wrapper will always be skipped, so that frame will be present in the traceback)l

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148896
Approved by: https://github.com/jansel
2025-03-20 18:33:54 +00:00
Sergey Zimin
b99fc9d29f [MTIA] Support loading Tensors on mtia:0 for pytorch code (#149327)
Summary: The diff includes updates to the PyTorch code to enable loading tensors to MTIA.

Reviewed By: PatriceVignola

Differential Revision: D71176848

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149327
Approved by: https://github.com/ezyang
2025-03-20 18:05:15 +00:00
James Wu
7bb9c36784 Hook StaticCudaLauncher up to torch.compile (cold start) (#148890)
This hooks up the previous PR to torch.compile. Will add a config flag to hide this behind in a bit, but for now it's useful for testing purposes to have it on by default.

Inductor will automatically choose to use StaticCudaLauncher to launch triton kernels if:
- The kernel is a cuda kernel and inductor can find a cubin file associated with it
- The kernel takes less than 50 arguments
- The kernel doesn't use any special features (launch hooks, large amounts of shared memory)
- The kernel is not user defined (to be supported in a later PR)

We split CompileResult into TritonCompileResult and StaticTritonCompileResult, but have them share implementations of how they exec a python launcher. StaticTritonCompileResult's python launcher has the benefit of a simpler def_args/call_args setup, since it always filters out all constexprs before running, no matter the triton version.

Some key features of StaticTritonCompileResult:
- It is fully serializable
- It stores the minimum amount of stuff, so that later it can be cached easily
- It does not depend on any triton specific types (though it does have various triton metadata).

For now, both TritonCompileResult and StaticTritonCompileResult still `exec` custom python launchers, and use GridExpr. We can change that in the future to simplify if we'd like. For now though, this custom python codegen is good for flexibility when it comes to supporting removal of constexprs, so using it for static launching is nice to not have to pay the cost of removing constexprs at kernel runtime.

Hooking everything up to torch.compile lets me run every unit test with StaticCudaLauncher to make sure that we still pass (even if we bypass StaticCudaLauncher itself). It also lets me check for compilation/runtime performance with these changes.

Fixes #149448

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148890
Approved by: https://github.com/jansel
2025-03-20 17:32:20 +00:00
Avik Chaudhuri
6237495fcf torch.Size input (#149414)
Summary: Support for `torch.Size` inputs was patchy before because `unflatten_fn` for this type returned a tuple. This PR cleans this up.

Fixes #149158

Test Plan: added test

Differential Revision: D71403635

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149414
Approved by: https://github.com/yushangdi
2025-03-20 16:23:13 +00:00
IvanKobzarev
2c4bc65366 [aotd] Guess tangents stride as output strides (#144579)
AOTDispatch  doing AOT backward graph preparation does not know real tangents that user will specify when runs backward.

AOTD guesses the tangents. Before - we guessed that memory format of tangents will be as memory format of corresponding outputs. And if specified tangents at runtime are not the same memory format as we guessed during compilation, AOTD does coercion (copy) to guessed memory_format

But as Horace found, there are popular use cases, where the outputs of compiled region will be in specific memory_format. E.g. in 4D tensor transposing dims 1 and 2.

https://github.com/karpathy/nanoGPT/blob/master/model.py#L57

This PR changes the logic, that AOTD expects the same "strideness" of tangents as outputs. As a result it will avoid coercion for the case of transposed dims.

Limitations:
We keep guessing memory_format for:
1/ Dynamic shapes (needs more changes)
2/ Tensor subclasses (needs more changes)

Other changes:
test_torchinductor was always creating contiguous tangents via `torch.randn()`, changing them to be `torch.randn_like()` to compare computation with the same strideness.

(E.g. for cuda float16 strideness affects numerics for fft ops).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144579
Approved by: https://github.com/bdhirsh
2025-03-20 15:41:36 +00:00
Zhengxu Chen
80dfce2cc3 [export] Handle non OpNamespace type during decomposition. (#149431)
Summary:
Turns out we can have non OpNamespace object in torch.ops._dir.

We should just throw away those during iteration.

Test Plan: eyes

Differential Revision: D71417992

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149431
Approved by: https://github.com/tugsbayasgalan
2025-03-20 15:36:15 +00:00
Shuai Yang
00a2c68f67 Fix a typo "trochrec" to "torchrec" (#149542)
Summary: As titled, the path is incorrect due to the typo

Test Plan: CI

Differential Revision: D71490709

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149542
Approved by: https://github.com/williamwen42
2025-03-20 10:14:23 +00:00
William Wen
a66a9581da [dynamo] support Python 3.13t (#149549)
A few bug fixes to get Dynamo mostly working with 3.13 nogil. Dynamo encounters internal CPython assert errors in older versions of 3.13. The fix has been landed on [CPython's 3.13 branch](https://github.com/python/cpython/tree/3.13) and will be included in 3.13.3 (https://peps.python.org/pep-0719/ - april 8). If you wish to try `torch.compile` on the latest 3.13 branch, you can comment out the error checking (i.e. 70b6cd4e11/torch/__init__.py (L2535) and 70b6cd4e11/torch/_dynamo/eval_frame.py (L899)).

We will work on getting PyTorch CI up for Dynamo/dynamo-wrapped/inductor once 3.13.3 is available.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149549
Approved by: https://github.com/jansel
2025-03-20 09:49:27 +00:00
Blaine Burton Rister
970ac2d907 [Inductor] Improve memory locality by iterating over y dimension before x (#149339)
# Feature

Fixes https://github.com/pytorch/pytorch/issues/148718 by reordering the tensor dims to `(z, y, x)`.

As a bonus refactor, block pointers no longer needed the `reorder=True` argument to `self.active_range_trees()`. Since this argument is no longer used anywhere, this PR simply deletes it as opposed to updating the logic for the new iteration order.

# Perf impact

It looks like there's a decent perf bump on A100, with cudagraphs enabled. Granted, perf runs seem to have some noise between commits. ([Workflow run](https://github.com/pytorch/pytorch/actions/runs/13914815576).)

Training (all neutral or positive):
![image](https://github.com/user-attachments/assets/57f1ef1d-60b4-446f-baf3-aca87a26b81b)

Inference (one positive, one very small negative):
![image](https://github.com/user-attachments/assets/679aa057-af23-47f1-8d8e-8520daf1bd92)

As reported in https://github.com/pytorch/pytorch/issues/148718, this PR makes consecutive threads access consecutive memory addresses. This should theoretically give the GPU more opportunities to coalesce loads and stores. From Nvidia's [kernel profiling guide](https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html):

> Local memory is private storage for an executing thread and is not visible outside of that thread. It is intended for thread-local data like thread stacks and register spills. Local memory addresses are translated to global virtual addresses by the AGU unit. Local memory has the same latency as global memory. One difference between global and local memory is that local memory is arranged such that consecutive 32-bit words are accessed by consecutive thread IDs. Accesses are therefore fully coalesced as long as all threads in a warp access the same relative address (e.g., same index in an array variable, same member in a structure variable, etc.).

I couldn't find any information on how coalescing works for other kinds of memory, but the guide mentions it is also supported for accesses to the L2 cache.

> The L2 Request Coalescer (LRC) processes incoming requests for L2 and tries to coalesce read requests before forwarding them to the L2 cache. It also serves programmatic multicast requests from the SM and supports compression for writes.

The [answer to this Stack Overflow post](https://stackoverflow.com/a/5044424) also explains coalescing in a straightforward way. Inductor's current iteration order corresponds to the first (uncoalesced) example in that answer, while the order after this PR corresponds to the second (coalesced) example.

Besides GPUs, this order of accessing data is highly advantageous for systems relying on DMAs, as those are designed to access contiguous spans of memory. This change improves the performance of an elementwise add kernel on an internal model, using internal hardware, by 1.76x. I will share the details with reviewers who are Meta employees via a private channel.

# Test plan
 - Updated expected code on CI tests.
 - Added a new test checking the {x,y,z}indices and block pointers on a 3D pointwise kernel.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149339
Approved by: https://github.com/jansel
2025-03-20 08:12:00 +00:00
Bin Bao
3647711a89 [AOTI][refactor] Remove dead code (#149287)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149287
Approved by: https://github.com/cyyever, https://github.com/yushangdi
2025-03-20 07:29:27 +00:00
PyTorch MergeBot
90ef7a9561 Revert "Supporting non-tensor-data write_size in planner write items. (#149434)"
This reverts commit 1442230a26.

Reverted https://github.com/pytorch/pytorch/pull/149434 on behalf of https://github.com/izaitsevfb due to breaking docs build ([comment](https://github.com/pytorch/pytorch/pull/149434#issuecomment-2739378287))
2025-03-20 06:52:02 +00:00
Rachel Guo
c4d59e6279 [Inductor] Fix combo_kernel logging error (#149575)
Summary:
Fix logging error like:
```
in combinable_nodes
    log.debug(
Message: 'ComboKernels: %d template nodes are filtered'
Arguments: (OrderedSet([8]),)
--- Logging error ---
Traceback (most recent call last):
  File "/usr/local/fbcode/platform010/lib/python3.10/logging/__init__.py", line 1100, in emit
    msg = self.format(record)
  File "/usr/local/fbcode/platform010/lib/python3.10/logging/__init__.py", line 943, in format
    return fmt.format(record)
  File "/data/users/guorachel/fbsource/buck-out/v2/gen/fbcode/854b9ed00d28c5c5/caffe2/torch/fb/model_transform/experimental/benchmark/__mts_gpu_benchmark__/mts_gpu_benchmark#link-tree/torch/_logging/_internal.py", line 818, in format
    record.message = record.getMessage()
  File "/usr/local/fbcode/platform010/lib/python3.10/logging/__init__.py", line 368, in getMessage
    msg = msg % self.args
TypeError: %d format: a real number is required, not OrderedSet
```

encountered in running a prod model + enable combo kernel feature

Test Plan: CI

Differential Revision: D71512220

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149575
Approved by: https://github.com/ColinPeppler
2025-03-20 06:09:44 +00:00
Davide Italiano
595293316d [MPS/Inductor] Add support for modified_bessel_k0. (#149593)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149593
Approved by: https://github.com/jansel
2025-03-20 04:51:44 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar
9a184b1074 Monkeypatch fake mode so it errors on invalid custom ops (#149410)
Internal version: [D71294776](https://www.internalfb.com/diff/D71294776)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149410
Approved by: https://github.com/gmagogsfm
2025-03-20 04:50:57 +00:00
Menglu Yu
fe94d7da1a [Inductor][Optimus] Add move view after cat aten pattern (#149178)
Summary:
Add aten pattern to move the view/reshape out of split cat, further reduce the number of kernels.

context: https://docs.google.com/document/d/1G2qFcQu1K7VXbz2uPe0CS2aBirnwtwI_B8lxmlBlAPQ/edit?tab=t.0

Test Plan:
### how to enable
Add the following patterns to the post grad
```
        post_grad_fusion_options={
            "normalization_aten_pass": {},
            "move_view_after_cat_aten_pass": {},
        },
```

### unit test
```
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:split_cat_fx_aten_passes -- test_move_view_after_cat_aten
```

Buck UI: https://www.internalfb.com/buck2/3c5451be-c63a-4794-8d6b-103ecac78905
Test UI: https://www.internalfb.com/intern/testinfra/testrun/6192449704507267

### local reproduce

```
buck2 run mode/opt scripts/shuaiyang:test -- --flow_id 691990503 --use_synthetic_data --optimus
```
https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/mengluy/2025-03-13-20-59-34/trace.json.gz&bucket=gpu_traces

### E2E

baseline

f691990503

proposal

Differential Revision: D71177004

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149178
Approved by: https://github.com/Yuzhen11
2025-03-20 04:07:25 +00:00
bobrenjc93
cca46a0b6f Fix score_mod.py dynamic max autotune (#148991)
python benchmarks/transformer/score_mod.py --dynamic --max-autotune

previously would crash with

```
"/home/bobren/local/a/pytorch/torch/_inductor/select_algorithm.py", line 2306, in key_of
    node.get_device().type,

```

but with this change no longer does

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148991
Approved by: https://github.com/drisspg
2025-03-20 03:28:51 +00:00
Xu Han
bc1b8730a4 [Windows][inductor] fix blank space break windows file path (#149388)
Fixes #149310

From origin error message:
```cmd
Command:
cl /I C:/Program Files/Python310/Include /I c:/code/.env/lib/site-packages/torch/include /I c:/code/.env/lib/site-packages/torch/include/torch/csrc/api/include /I c:/code/.env/lib/site-packages/torch/include/TH /I c:/code/.env/lib/site-packages/torch/include/THC /D TORCH_INDUCTOR_CPP_WRAPPER /D STANDALONE_TORCH_HEADER /D C10_USING_CUSTOM_GENERATED_MACROS /DLL /MD /O2 /std:c++20 /wd4819 /wd4251 /wd4244 /wd4267 /wd4275 /wd4018 /wd4190 /wd4624 /wd4067 /wd4068 /EHsc /openmp /openmp:experimental C:/Users/user/AppData/Local/Temp/torchinductor_user/ou/coubnfnqsm2gbdzdytufv46jotd6sxsnnhgldiw45pl5yjq5nbvz.cpp /LD /FeC:/Users/user/AppData/Local/Temp/torchinductor_user/ou/coubnfnqsm2gbdzdytufv46jotd6sxsnnhgldiw45pl5yjq5nbvz.pyd /link /LIBPATH:c:/code/.env/Scripts/libs /LIBPATH:c:/code/.env/lib/site-packages/torch/lib torch.lib torch_cpu.lib torch_python.lib sleef.lib

Output:
Microsoft (R) C/C++ Optimizing Compiler Version 19.43.34809 for x86
Copyright (C) Microsoft Corporation.  All rights reserved.

cl : Command line warning D9025 : overriding '/openmp' with '/openmp:experimental'
cl : Command line warning D9024 : unrecognized source file type 'Files/Python310/Include', object file assumed
coubnfnqsm2gbdzdytufv46jotd6sxsnnhgldiw45pl5yjq5nbvz.cpp
C:/Users/user/AppData/Local/Temp/torchinductor_user/ou/coubnfnqsm2gbdzdytufv46jotd6sxsnnhgldiw45pl5yjq5nbvz.cpp(21): fatal error C1083: Cannot open include file: 'Python.h': No such file or directory
```
Python installed in `C:/Program Files/Python310` path, and the blank space break the file path.

Solution:
Add quotes to declare Windows file paths, after that:
```cmd
cl /I "C:/Users/Xuhan/.conda/envs/new_build/Include" /I "C:/Users/Xuhan/.conda/envs/new_build/lib/site-packages/torch/include" /I "C:/Users/Xuhan/.conda/envs/new_build/lib/site-packages/torch/include/torch/csrc/api/include"  /D TORCH_INDUCTOR_CPP_WRAPPER /D STANDALONE_TORCH_HEADER /D  C10_USING_CUSTOM_GENERATED_MACROS /D CPU_CAPABILITY_AVX512  /DLL /MD /O2 /std:c++20 /wd4819 /wd4251 /wd4244 /wd4267 /wd4275 /wd4018 /wd4190 /wd4624 /wd4067 /wd4068 /EHsc /openmp /openmp:experimental  C:/Users/Xuhan/AppData/Local/Temp/tmp1wsj0m8r/za/czarp3ly5c22ge3hydvnzvad4cjimyr3hkwvofodxqffgil7frfd.cpp  /arch:AVX512  /FeC:/Users/Xuhan/AppData/Local/Temp/tmp1wsj0m8r/za/czarp3ly5c22ge3hydvnzvad4cjimyr3hkwvofodxqffgil7frfd.pyd /LD /link /LIBPATH:"C:/Users/Xuhan/.conda/envs/new_build/libs" /LIBPATH:"C:/Users/Xuhan/.conda/envs/new_build/lib/site-packages/torch/lib"  "torch.lib" "torch_cpu.lib" "torch_python.lib" "sleef.lib"
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149388
Approved by: https://github.com/jansel
2025-03-20 03:10:30 +00:00
Tugsbayasgalan Manlaibaatar
3b7bd6c63d Fix dynamic shapes repordering bug (#149528)
WHen we create constraints, we look at the ordering of kwargs according to model signature. But when we trace, we use the ordering that is created based on how user passes in their kwargs. As a result, constraints and dynamic shapes end up having a different order causing issues when they have different dynamic tensor specs.

Differential Revision: [D71478578](https://our.internmc.facebook.com/intern/diff/D71478578)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149528
Approved by: https://github.com/ydwu4
2025-03-20 01:57:44 +00:00
Sam Larsen
1e30192b19 [logging] Add python version to dynamo_compile table (#149419)
Summary: This adds a version field like the following: `3.10.9+fb (3.10:1dd9be6, May  4 2022, 01:23:45) [Clang 15.0.7 (mononoke://mononoke.internal.tfbnw.net/fbsource 5d1601b0eed7426ac`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149419
Approved by: https://github.com/c00w
2025-03-20 01:48:34 +00:00
Pradeep Fernando
1442230a26 Supporting non-tensor-data write_size in planner write items. (#149434)
Summary:
1\ The current write item structure does not contain the amount of data that needs to be written.
2\ the planner.item already has a size primitive 'tensor_storage_size'. https://fburl.com/code/7a0gsmw7 But only for tensors.
3\ Right now, the only way the writer layer get hold of this property (fro non tensor data)

- first do a lookup in to the actual tensor/bytes
- then calculate the nbytes.
This change introduce a way to capture non-tensor data  size within a write-plan item.

Reviewed By: daulet-askarov

Differential Revision: D70497442

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149434
Approved by: https://github.com/MeetVadakkanchery
2025-03-20 01:22:05 +00:00
Theodore Ehrenborg
02e21c7854 Fix spelling (#149277)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149277
Approved by: https://github.com/zou3519
2025-03-20 01:02:32 +00:00
Ethan Wee
6cbf97ede8 [ROCm] enable HIPMallocAsyncAllocator (#149145)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149145
Approved by: https://github.com/izaitsevfb

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-03-19 23:42:35 +00:00
Shivam Raikundalia
a11538aa46 [GPU Snapshot] Add Clear History Flag (#149352)
Summary:
Oftentimes, users complain that a bunch of extra events are prepended to their desired GPU snapshot. This is because they usually attach an OOM logger without knowing and when they go to collect the actual snapshot, it adds all the OOM logger contents. Since OOM and regular snapshot use the same backend, we currently don't have the infra in place to split these snapshots.

As a solution we add a flag to the snapshot frontend to clear out the history when starting the auto-trace record memory history.

A more thorough solution would be to have a user pass in a handle and to have snapshots per handle to seperate the events. However, this would likely be complicated and more work than it is worth as we would have to change the callbacks in the caching allocator and pass these objects between python and cpp.

Test Plan:
See diff below

Differential Revision: D71159720

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149352
Approved by: https://github.com/eqy, https://github.com/aaronenyeshi
2025-03-19 21:44:20 +00:00
PyTorch MergeBot
e1d143cb7b Revert "[ROCm] enable HIPMallocAsyncAllocator (#149145)"
This reverts commit ee1a2b7810.

Reverted https://github.com/pytorch/pytorch/pull/149145 on behalf of https://github.com/izaitsevfb due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/149145#issuecomment-2738115728))
2025-03-19 21:12:13 +00:00
Jessica Vandebon
71daeddde2 [MTIA] Ensure correct stream behavior for input_buffer add autograd on MTIA (#149433)
Test Plan: CI

Differential Revision: D71414498

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149433
Approved by: https://github.com/albanD
2025-03-19 20:19:18 +00:00
Yanan Cao (PyTorch)
fae79e91a0 Remove torch.export.export_for_inference (#149078)
Summary: Remove torch.export.export_for_inference, it is redundant and can always be replaced with torch.export.export_for_training() + run_decompositions()

Test Plan: unit tests

Differential Revision: D71069057

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149078
Approved by: https://github.com/tugsbayasgalan
2025-03-19 19:57:18 +00:00
Shangdi Yu
05fee772e5 Fix with effect lowering for list return type (#149510)
Summary: - For `torch.ops.higher_order.with_effects`'s lowering, we should not extract the items out of an list (i.e. `*result` vs `result`). The `get_attr` nodes consider the result to be in the list format.

Test Plan:
```
buck run fbcode//mode/dev-nosan //caffe2/test/inductor:torchbind -- -r test_torchbind_aot_compile

buck run fbcode//mode/dev-nosan //caffe2/test/inductor:torchbind -- -r list_return

buck run //caffe2/torch/fb/sparsenn:sigrid_test -- -r test_transform_torch_bind # tested together with D70013257

buck run fbcode//mode/dev-nosan //caffe2/test:test_export  -- -r test_custom_obj
```

Reviewed By: angelayi

Differential Revision: D71346024

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149510
Approved by: https://github.com/zou3519
2025-03-19 19:35:08 +00:00
Scott Ramsby
842a072fd3 [codemod] Fix clang-tidy command line doc comments (#149524)
Summary:
Fixes the comments to match the latest updates to the checked-in tools.

Search/replace applied in this order:
* `# /fbsource/tools/lint/clangtidy/clang-tidy-platform010 -list-checks` -> `# ~/fbsource/tools/lint/clangtidy/clang-tidy-platform010-clang-17 -list-checks`
* `# ~/fbsource/tools/lint/clangtidy/clang-tidy-platform010 -list-checks` -> `# ~/fbsource/tools/lint/clangtidy/clang-tidy-platform010-clang-17 -list-checks`
* `fbsource/tools/lint/clangtidy/clang-tidy-platform010 -list-checks` -> `fbsource/tools/lint/clangtidy/clang-tidy-platform010-clang-17 -list-checks`

Test Plan: CI

Reviewed By: johnkearney

Differential Revision: D71431516

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149524
Approved by: https://github.com/janeyx99
2025-03-19 19:22:11 +00:00
Pian Pawakapan
96828a2155 [export] refactor DimHints for type errors (#149424)
Differential Revision: D71414367

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149424
Approved by: https://github.com/justinchuby, https://github.com/avikchaudhuri
2025-03-19 18:51:07 +00:00
Yidi Wu
9ec9f4740c [export] fix stft decomp and making it consistent with cpp impl. (#149232)
Summary: We change the fake impl of stft to follow more closely with its cpp implementation [here](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/SpectralOps.cpp#L951-L963)

where  " n_frames = 1 + (len - n_fft) / hop_length;" is also an integer division.

Test Plan: Existing tests and buck2 build --flagfile fbcode//mode/dev fbcode//executorch/examples/models/fb/llama4:speech_transform.pte

Differential Revision: D71209142

edit: we kept the original path un-changed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149232
Approved by: https://github.com/jackzhxng
2025-03-19 18:40:35 +00:00
Bin Bao
94d761fbf0 [AOTI][reland] Update test runner to use the new APIs (#149412)
Summary: Reland https://github.com/pytorch/pytorch/pull/147105. Switch to the newer aoti_compile_and_package APIs. Some tests still kept using legacy APIs, and will follow up with internal test refactoring.

Differential Revision: [D71470265](https://our.internmc.facebook.com/intern/diff/D71470265)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149412
Approved by: https://github.com/yushangdi
2025-03-19 17:56:44 +00:00
IvanKobzarev
d686d04c2f [custom_ops][perf] Move expensive pytree traversals of tensors to C++ (#148555)
(benchmark for 1 call)

Before:
```
└─ $ python ~/task_custom_ops_perf/test_custom_ops_perf_repro.py
DO_BENCH mutate: 77.72445678710938 us PROFILE:/home/ivankobzarev/task_custom_ops_perf/mutate.json
DO_BENCH no_mutate: 64.61143493652344 us PROFILE:/home/ivankobzarev/task_custom_ops_perf/no_mutate.json
DO_BENCH direct_mutate: 11.682510375976562 us PROFILE:/home/ivankobzarev/task_custom_ops_perf/direct_mutate.json
DO_BENCH direct_no_mutate: 18.596649169921875 us PROFILE:/home/ivankobzarev/task_custom_ops_perf/direct_no_mutate.json
```

After:
```
└─ $ python ~/task_custom_ops_perf/test_custom_ops_perf_repro.py
DO_BENCH mutate: 47.6837158203125 us PROFILE:/home/ivankobzarev/task_custom_ops_perf/mutate.json
DO_BENCH no_mutate: 31.709671020507812 us PROFILE:/home/ivankobzarev/task_custom_ops_perf/no_mutate.json
DO_BENCH direct_mutate: 10.967254638671875 us PROFILE:/home/ivankobzarev/task_custom_ops_perf/direct_mutate.json
DO_BENCH direct_no_mutate: 10.728836059570312 us PROFILE:/home/ivankobzarev/task_custom_ops_perf/direct_no_mutate.json
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148555
Approved by: https://github.com/zou3519
2025-03-19 17:16:57 +00:00
Ze Sheng
e98afa0f89 [Sigmoid] Remove magic method in CapabilityBasedPartitioner (#149400)
Summary: As title.

Test Plan: CI

Differential Revision: D70575197

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149400
Approved by: https://github.com/jfix71
2025-03-19 16:02:43 +00:00
Shangdi Yu
1bf443e2f2 [aoti x with_effect token] Unbacked symint and register lowering (#147656)
Differential Revision: D70022208

- When resolving unbacked symints in ExternKernel for with_effect, we need to ignore the first item in the binding path, because the `example_output` doesn't contain the effect token, but the binding paths do.
- Similarly, `node.meta["val"]` contains the effect token, so when we compute_unbacked_bindings, we need to remove that effect token

- For `torch.ops.higher_order.with_effects`'s lowering, we should not extract the items out of an list (i.e. `*result` vs `result`). The `get_attr` nodes consider the result to be in the list format.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147656
Approved by: https://github.com/angelayi, https://github.com/zou3519
2025-03-19 14:38:30 +00:00
Aaron Orenstein
2fcfae72b4 async fx compile (#146135)
Adds the ability to run the selected out-of-process fx compile scheme in async mode - where we kick off the compile and then run eagerly until the compile is finished.

Added a test which runs a tiny model in a loop making sure that we execute it both eagerly and then compiled.

Differential Revision: [D71135546](https://our.internmc.facebook.com/intern/diff/D71135546)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146135
Approved by: https://github.com/jamesjwu, https://github.com/jansel
2025-03-19 14:07:51 +00:00
FFFrog
466d5295c1 Fixed abnormal behavior of LazyLinear when using LayzLinear and load_state together (#147599)
Update Points:
- Update the logic of ``initialize_parameters``
- Add new testcases

The ISSUE Related:
https://github.com/pytorch/pytorch/issues/147389
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147599
Approved by: https://github.com/mikaylagawarecki
2025-03-19 10:01:12 +00:00
fduwjj
8bf3f3fc43 [c10d] Add a collective time estimator for NCCL comms (#149343)
We want to upstream the feature from new nccl for users to estimate comm time.

Resolves #147753

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149343
Approved by: https://github.com/kwen2501
2025-03-19 07:54:02 +00:00
Riham Selim
b963d96bad [Torchscript] Add a flag to use mangled names instead of demangled (#148906)
Summary: Optionally keep mangled names when expanding torchscript stacks

Test Plan:
```
buck2 build mode/opt //scripts/rihams/LearnPyTorch:torch_script_generate --show-full-output

/data/users/rihams/fbsource/buck-out/v2/gen/fbcode/0bd9d136228ad8a7/scripts/rihams/LearnPyTorch/__torch_script_generate__/torch_script_generate.par

buck2 build mode/opt //scripts/rihams/LearnPyTorch:torch_script_execute --show-full-output
```

- With `--torch_jit_expanded_stacks_mangled` Flag:

/data/users/rihams/fbsource/buck-out/v2/gen/fbcode/ef35e45045e8164c/scripts/rihams/LearnPyTorch/__torch_script_execute__/torch_script_execute fbcode/model.pt  --torch_jit_expanded_stacks_mangled --torch_jit_enable_expanded_stacks

https://fburl.com/scuba/strobelight_function_tracer/8die4rvm

{F1975933247}

Without Flag:

/data/users/rihams/fbsource/buck-out/v2/gen/fbcode/ef35e45045e8164c/scripts/rihams/LearnPyTorch/__torch_script_execute__/torch_script_execute ./model.pt   --torch_jit_enable_expanded_stacks

https://fburl.com/scuba/strobelight_function_tracer/x3nladpf

 {F1975933268}

Reviewed By: bbus

Differential Revision: D70905872

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148906
Approved by: https://github.com/zdevito
2025-03-19 07:53:02 +00:00
Menglu Yu
981807cfcb [Inductor][Optimus] split cat aten pass (#149027)
Summary:
We add the aten pattern to optimize big cat node with arbitrary order of inputs to support APS jobs

context: https://docs.google.com/document/d/1G2qFcQu1K7VXbz2uPe0CS2aBirnwtwI_B8lxmlBlAPQ/edit?tab=t.0

Test Plan:
### how to enable
Add the following patterns to the post grad
```
        post_grad_fusion_options={
            "normalization_aten_pass": {},
            "split_cat_aten_pass": {"threshold_to_cat": 10},
        },
```
You can tune threshold_to_cat to achieve best performance. If nothing gives, the default value 10 will be used

### unit test
```
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:split_cat_fx_aten_passes -- test_split_cat_post_grad
```

Buck UI: https://www.internalfb.com/buck2/9e52168d-c107-4be8-a46b-b9d239f5c50d
Test UI: https://www.internalfb.com/intern/testinfra/testrun/17732923605061752
Network: Up: 112KiB  Down: 132KiB  (reSessionID-915796e0-4a8f-486a-9f63-afb1e191d24a)
Executing actions. Remaining     0/3                                                                                   1.0s exec time total
Command: test.     Finished 2 local
Time elapsed: 4:57.9s
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0

### E2E

baseline

f691990503

proposal

Differential Revision: D71017436

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149027
Approved by: https://github.com/Yuzhen11
2025-03-19 06:01:05 +00:00
Simon Fan
f123f2c077 [ca] fix dce for side-effects (#149336)
The AOT backward could have contained side effectful ops, so we can't DCE them. Have CA also call the default fx.Node.is_impure which will cover some of the existing cases

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149336
Approved by: https://github.com/jansel
2025-03-19 05:56:47 +00:00
Pat Vignola
42bd4a09a3 [MTIA] Add _mtia_getCurrentRawStream to MTIA module (#149436)
Summary: The FlexAttention path generates code that uses this function. Although streams are not used yet in Triton-MTIA, adding this now allows us to not branch out just for MTIA and generate different code.

Test Plan: CI

Reviewed By: chaos5958

Differential Revision: D70072057

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149436
Approved by: https://github.com/chaos5958
2025-03-19 05:17:51 +00:00
Ethan Wee
ee1a2b7810 [ROCm] enable HIPMallocAsyncAllocator (#149145)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149145
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-03-19 03:59:55 +00:00
Avik Chaudhuri
20874a1f46 debug ival swap (#149206)
Summary:
Recall that we use "ivals" to track intermediate values of mutations during unflattening. Previously, for each such intermediate value, we would create a hidden shared attribute that would be updated / read by respective submodules.

Unfortunately this scheme doesn't work when some but not all of those submodules are swapped out. This is because the swapped in submodules have no knowledge of these hidden attributes. Thus the submodules that are not swapped out end up reading / updating dangling state.

This PR does away with these hidden attributes. Instead, we directly read the underlying buffer or placeholder that was updated, and update those underlying buffers and placeholders in place. This makes the graphs look much closer to their eager origins.

Test Plan: added some tests, ensured existing tests pass

Differential Revision: D71203469

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149206
Approved by: https://github.com/tugsbayasgalan
2025-03-19 03:43:30 +00:00
Jun Luo
14dc6e732d Cache the get_device_module result (#149207)
Summary: As title.

Test Plan: OSS CIs.

Reviewed By: chaos5958

Differential Revision: D71084180

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149207
Approved by: https://github.com/jansel
2025-03-19 03:20:38 +00:00
angelayi
01a57981aa [export] Add TracingContext (#149294)
TracingContext is added to all tracing locations -- in torch.export this is where we call make_fx (for training IR) and aot_export_module (for inference IR), and in run_decompositions where we call aot_export_module

Differential Revision: [D71298927](https://our.internmc.facebook.com/intern/diff/D71298927)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149294
Approved by: https://github.com/ydwu4
2025-03-19 03:11:08 +00:00
Animesh Jain
a3c286677b [compile] Switch off inference mode during compilation (#149321)
PR does following
* Turns `inference_mode` to False and `no_grad` for `convert_frame`, if the inference_mode is on globally.
* Turns off inference_mode for fake tensor prop. This ensures that converting from real inference tensor to a fake tensor removes the inference-ness.
* Graph breaks on is_inference and is_inference_mode_enabled.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149321
Approved by: https://github.com/jansel, https://github.com/zou3519
2025-03-19 02:45:27 +00:00
Bin Bao
04e251a7dd [AOTI] Add num_runners to AOTIModelPackageLoader (#149364)
Summary: AOTIModelContainerRunner takes a num_runners argument for multi-threaded inference, but AOTIModelPackageLoader forgot to take the same parameter, although its run() API already expects to take an optional cudaStream_t parameter for multi-threaded inference.

Differential Revision: [D71357418](https://our.internmc.facebook.com/intern/diff/D71357418)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149364
Approved by: https://github.com/angelayi
2025-03-19 02:28:06 +00:00
Richard Barnes
536c0c7a47 [codemod][lowrisk] Remove unused exception parameter from caffe2/aten/src/ATen/cuda/CUDABlas.cpp (#149328)
Summary:
`-Wunused-exception-parameter` has identified an unused exception parameter. This diff removes it.

This:
```
try {
    ...
} catch (exception& e) {
    // no use of e
}
```
should instead be written as
```
} catch (exception&) {
```

If the code compiles, this is safe to land.

Test Plan: Sandcastle

Reviewed By: dtolnay

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149328
Approved by: https://github.com/Skylion007, https://github.com/eqy
2025-03-19 02:05:33 +00:00
Nichols A. Romero
11d4438a5f [ROCm][TunableOp] More TF32 support. (#149088)
This PR includes additional enhancements to TF32 support in TunableOp.
- OpSignature now differentiates between float32 and tf32 data types.
- Offline tuning now supports TF32.
- Unit tests for online and offline tuning of TF32.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149088
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-03-19 00:26:20 +00:00
tvukovic-amd
268de64005 [ROCm][Windows] Enable torchvision build with ROCm on Windows (#147382)
- Updated HIP flags for Windows (removed non Windows flags on Windows case, added runtime library)
- Set hipcc call for Windows case
- Removed CUDA flags (not used in ROCm) on Windows
- Updated Windows compiler (added case when using ROCm on Windows)
- Fixed path issue in hipify_python

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147382
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-03-18 23:37:05 +00:00
Angela Yi
ea02aac2ca [export] Update remove runtime asserts pass (#149198)
Test Plan: CI -- Removing asserts should be a noop

Differential Revision: D69566851

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149198
Approved by: https://github.com/pianpwk
2025-03-18 23:07:25 +00:00
Zhuoran Zhao
45fec7843d Fix local compilication and hipification (#149384)
Summary:
As title, we need to fix the issue introduced from
https://github.com/pytorch/pytorch/pull/148305

Test Plan: CI and e2e https://docs.google.com/document/d/1Bu-MxJCkN7WaRkKJLVBQvnSp8yV0v3Aeb3Y9R5sjeHw/edit?tab=t.0

Differential Revision: D71373001

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149384
Approved by: https://github.com/desertfire, https://github.com/jansel, https://github.com/chenyang78
2025-03-18 22:56:02 +00:00
Shivam Raikundalia
0d804dec0f [Profiler/Easy] Pass Overload Names To Kineto (#149333)
Summary: Right now we get Overload names and forward them to the Event List frontend for profiler but we do not forward anything to kineto. This diff checks if there is an overload name for each cpu op and appends it to the name if necessary

Test Plan: Added test in CI

Differential Revision: D71326670

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149333
Approved by: https://github.com/aaronenyeshi
2025-03-18 22:15:51 +00:00
angelayi
3b48c72141 [export] Minor refactor to trace.py (#149240)
Minor refactor to trace.py
* Removed `_strict_export_lower_to_aten_ir` in favor of just `_strict_export` and `_non_strict_export`
* Matched the APIs of `_strict_export` and `_non_strict_export`
    * Instead of a `lower_to_aten_callback` which is a callable, or `dispatch_tracing_mode`, both functions take in a `_to_aten_func` which can be either `_export_to_aten_ir_make_fx` or `_export_to_aten_ir`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149240
Approved by: https://github.com/pianpwk
2025-03-18 21:40:30 +00:00
Justin Chu
010963032c [ONNX] Create onnx_symbolic (#148905)
In the old exporter we allow users to define a symbolic() method to bypass JIT tracing for a block of logic. We can allow users to do similar things by creating symbolic ops at export.

This PR implements `torch.onnx.ops.symbolic` and `torch.onnx.ops.symbolic_multi_out` to allow users to create onnx nodes symbolically with pt2 & fx. The custom pytorch ops were designed such that the attributes are encoded to be part of a valid fx op. Users provide shape and dtype for the meta function to produce the currect fake tensor during export.

An example is

![image](https://github.com/user-attachments/assets/c62f5f21-e038-456e-a71d-b9a5d0a7cd9d)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148905
Approved by: https://github.com/titaiwangms
2025-03-18 21:32:06 +00:00
Yuxin Wu
d80a70b58a Avoid unnecessary clone in torch.cuda.set_rng_state (#149283)
Clone has performance issue according to f49c3eb6e6/megatron/core/tensor_parallel/random.py (L77-L80)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149283
Approved by: https://github.com/cyyever, https://github.com/Skylion007
2025-03-18 20:47:57 +00:00
Thomas Bohnstingl
cd5c13d8f0 [hop] Rework the check of Metadata in the functionalization key (#148789)
This PR is a more cosmetic rework of the metadata check performed by some HOPs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148789
Approved by: https://github.com/ydwu4
2025-03-18 20:30:59 +00:00
Brian Hirsh
f06e366532 partitioner: treat inputs with static indices as free to save (#148922)
Fixes https://github.com/pytorch/pytorch/issues/141881

internal xref: https://fb.workplace.com/groups/1075192433118967/posts/1538435030128036/?comment_id=1556782068293332

I tried to make a test case out of the code linked in that github issue. The setup + bad outcome today was as follows:

(1) you have a graph where one of its inputs is a model weight

(2) in the backward, you do some downstream compute on `weight`, `tmp = f(weight)`, where (a) `tmp` is of a smaller size than `weight`, and (b) the compute is trivially fusible into other kernels (so the partitioner thinks it is "free" to recompute

(3) since `sizeof(tmp) < sizeof(weight)` and the recompute is free, the partitioner decides that it would be strictly better to save `tmp` for backward instead of weight

(4) this is bad: `weight` is a static tensor that sits in GPU memory for the duration of your entire training loop, so saving it for backward has no negative impact on peak memory.  Since we're saving `tmp` instead, we end up unnecessarily increasing peak memory. In particular - the repro involves an autograd.Function in eager that saves the weight for bw, so we end up hitting higher peak memory in compile

The fix I'm trying out in this PR is to tell the partitioner that graph inputs that we know have static addresses (aka parameters) are "free" to save.

Below is the fw/bw graph before my change, where you can see that instead of `primals_2` being saved for backward, we save `t_8` (which involves some low precision downstream compute on `primals_2`, that is only needed in the backward.

```
 ===== Forward graph 0 =====
 /data/users/hirsheybar/checkout2/pytorch/torch/fx/_lazy_graph_module.py class GraphModule(torch.nn.Module):
    def forward(self, primals_1: "bf16[64, 64][64, 1]cuda:0", primals_2: "bf16[64, 64][64, 1]cuda:0", primals_3: "bf16[64][1]cuda:0"):
         # File: /data/users/hirsheybar/checkout2/pytorch/test/dynamo/test_repros.py:6943 in forward, code: out = Fp8LinearFn.apply(
        abs_1: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten.abs.default(primals_1)
        view: "bf16[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(abs_1, [64, 1, 64]);  abs_1 = None
        amax: "bf16[64, 1][1, 1]cuda:0" = torch.ops.aten.amax.default(view, [-1]);  view = None
        abs_2: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten.abs.default(primals_2)
        view_1: "bf16[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(abs_2, [64, 1, 64]);  abs_2 = None
        amax_1: "bf16[64, 1][1, 1]cuda:0" = torch.ops.aten.amax.default(view_1, [-1]);  view_1 = None
        _to_copy: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten._to_copy.default(amax, dtype = torch.float32);  amax = None
        clamp: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.clamp.default(_to_copy, 1e-12);  _to_copy = None
        div: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.div.Tensor(clamp, 448.0);  clamp = None
        reciprocal: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.reciprocal.default(div)
        view_2: "bf16[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(primals_1, [64, 1, 64])
        view_3: "bf16[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.view.default(view_2, [64, 1, 1, 64]);  view_2 = None
        slice_1: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.slice.Tensor(reciprocal, 0, 0, 9223372036854775807);  reciprocal = None
        unsqueeze: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_1, 1);  slice_1 = None
        slice_2: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.slice.Tensor(unsqueeze, 2, 0, 9223372036854775807);  unsqueeze = None
        unsqueeze_1: "f32[64, 1, 1, 1][1, 1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_2, 3);  slice_2 = None
        mul: "f32[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.mul.Tensor(view_3, unsqueeze_1);  view_3 = unsqueeze_1 = None
        view_4: "f32[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(mul, [64, 1, 64]);  mul = None
        view_5: "f32[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(view_4, [64, 64]);  view_4 = None
        _to_copy_1: "f8e4m3fn[64, 64][64, 1]cuda:0" = torch.ops.aten._to_copy.default(view_5, dtype = torch.float8_e4m3fn);  view_5 = None
        _to_copy_2: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten._to_copy.default(amax_1, dtype = torch.float32)
        clamp_1: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.clamp.default(_to_copy_2, 1e-12);  _to_copy_2 = None
        div_1: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.div.Tensor(clamp_1, 448.0);  clamp_1 = None
        reciprocal_1: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.reciprocal.default(div_1)
        view_6: "bf16[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(primals_2, [64, 1, 64])
        view_7: "bf16[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.view.default(view_6, [64, 1, 1, 64]);  view_6 = None
        slice_3: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.slice.Tensor(reciprocal_1, 0, 0, 9223372036854775807);  reciprocal_1 = None
        unsqueeze_2: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_3, 1);  slice_3 = None
        slice_4: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.slice.Tensor(unsqueeze_2, 2, 0, 9223372036854775807);  unsqueeze_2 = None
        unsqueeze_3: "f32[64, 1, 1, 1][1, 1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_4, 3);  slice_4 = None
        mul_1: "f32[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.mul.Tensor(view_7, unsqueeze_3);  view_7 = unsqueeze_3 = None
        view_8: "f32[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(mul_1, [64, 1, 64]);  mul_1 = None
        view_9: "f32[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(view_8, [64, 64]);  view_8 = None
        _to_copy_3: "f8e4m3fn[64, 64][64, 1]cuda:0" = torch.ops.aten._to_copy.default(view_9, dtype = torch.float8_e4m3fn);  view_9 = None
        t: "f32[1, 64][1, 1]cuda:0" = torch.ops.aten.t.default(div_1);  div_1 = None
        new_ones: "f32[1, 1][1, 1]cuda:0" = torch.ops.aten.new_ones.default(div, [1, 1], pin_memory = False)
        new_ones_1: "f32[1, 1][1, 1]cuda:0" = torch.ops.aten.new_ones.default(t, [1, 1], pin_memory = False)
        t_2: "f8e4m3fn[64, 64][1, 64]cuda:0" = torch.ops.aten.t.default(_to_copy_3);  _to_copy_3 = None
        t_3: "f32[1, 1][1, 1]cuda:0" = torch.ops.aten.t.default(new_ones_1);  new_ones_1 = None
        _scaled_mm: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten._scaled_mm.default(_to_copy_1, t_2, new_ones, t_3, None, None, torch.bfloat16);  _to_copy_1 = t_2 = new_ones = t_3 = None
        view_10: "bf16[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(_scaled_mm, [64, 1, 64]);  _scaled_mm = None
        view_11: "bf16[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.view.default(view_10, [64, 1, 1, 64]);  view_10 = None
        slice_5: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.slice.Tensor(div, 0, 0, 9223372036854775807);  div = None
        unsqueeze_4: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_5, 1);  slice_5 = None
        slice_6: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.slice.Tensor(unsqueeze_4, 2, 0, 9223372036854775807);  unsqueeze_4 = None
        unsqueeze_5: "f32[64, 1, 1, 1][1, 1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_6, 3);  slice_6 = None
        mul_2: "f32[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.mul.Tensor(view_11, unsqueeze_5);  view_11 = unsqueeze_5 = None
        view_12: "f32[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(mul_2, [64, 1, 64]);  mul_2 = None
        view_13: "f32[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(view_12, [64, 64]);  view_12 = None
        view_14: "f32[1, 64, 64][4096, 64, 1]cuda:0" = torch.ops.aten.view.default(view_13, [1, 64, 64]);  view_13 = None
        view_15: "f32[1, 64, 64, 1][4096, 64, 1, 1]cuda:0" = torch.ops.aten.view.default(view_14, [1, 64, 64, 1]);  view_14 = None
        slice_7: "f32[1, 64][1, 1]cuda:0" = torch.ops.aten.slice.Tensor(t, 0, 0, 9223372036854775807);  t = None
        unsqueeze_6: "f32[1, 1, 64][1, 64, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_7, 1);  slice_7 = None
        slice_8: "f32[1, 1, 64][1, 64, 1]cuda:0" = torch.ops.aten.slice.Tensor(unsqueeze_6, 2, 0, 9223372036854775807);  unsqueeze_6 = None
        unsqueeze_7: "f32[1, 1, 64, 1][1, 64, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_8, 3);  slice_8 = None
        mul_3: "f32[1, 64, 64, 1][4096, 64, 1, 1]cuda:0" = torch.ops.aten.mul.Tensor(view_15, unsqueeze_7);  view_15 = unsqueeze_7 = None
        view_16: "f32[64, 64, 1][64, 1, 1]cuda:0" = torch.ops.aten.view.default(mul_3, [64, 64, 1]);  mul_3 = None
        view_17: "f32[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(view_16, [64, 64]);  view_16 = None
        _to_copy_4: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten._to_copy.default(view_17, dtype = torch.bfloat16);  view_17 = None
        add: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten.add.Tensor(_to_copy_4, primals_3);  _to_copy_4 = primals_3 = None
        t_4: "bf16[64, 64][1, 64]cuda:0" = torch.ops.aten.t.default(primals_2);  primals_2 = None
        clone: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten.clone.default(t_4, memory_format = torch.contiguous_format);  t_4 = None
        t_5: "bf16[1, 64][1, 1]cuda:0" = torch.ops.aten.t.default(amax_1);  amax_1 = None
        view_21: "bf16[1, 1, 64][1, 64, 1]cuda:0" = torch.ops.aten.view.default(t_5, [1, 1, 64]);  t_5 = None
        amax_3: "bf16[1, 1][1, 1]cuda:0" = torch.ops.aten.amax.default(view_21, [-1]);  view_21 = None
        unsqueeze_8: "bf16[1, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(amax_3, 1);  amax_3 = None
        expand: "bf16[1, 64, 1][1, 0, 1]cuda:0" = torch.ops.aten.expand.default(unsqueeze_8, [1, 64, 1])
        clone_1: "bf16[1, 64, 1][64, 1, 1]cuda:0" = torch.ops.aten.clone.default(expand, memory_format = torch.contiguous_format);  expand = None
        view_22: "bf16[64, 1][1, 1]cuda:0" = torch.ops.aten.view.default(clone_1, [64, 1]);  clone_1 = None
        _to_copy_7: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten._to_copy.default(view_22, dtype = torch.float32);  view_22 = None
        clamp_3: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.clamp.default(_to_copy_7, 1e-12);  _to_copy_7 = None
        div_3: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.div.Tensor(clamp_3, 448.0);  clamp_3 = None
        reciprocal_3: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.reciprocal.default(div_3);  div_3 = None
        view_27: "bf16[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(clone, [64, 1, 64]);  clone = None
        view_28: "bf16[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.view.default(view_27, [64, 1, 1, 64]);  view_27 = None
        slice_11: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.slice.Tensor(reciprocal_3, 0, 0, 9223372036854775807);  reciprocal_3 = None
        unsqueeze_11: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_11, 1);  slice_11 = None
        slice_12: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.slice.Tensor(unsqueeze_11, 2, 0, 9223372036854775807);  unsqueeze_11 = None
        unsqueeze_12: "f32[64, 1, 1, 1][1, 1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_12, 3);  slice_12 = None
        mul_5: "f32[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.mul.Tensor(view_28, unsqueeze_12);  view_28 = unsqueeze_12 = None
        view_29: "f32[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(mul_5, [64, 1, 64]);  mul_5 = None
        view_30: "f32[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(view_29, [64, 64]);  view_29 = None
        _to_copy_8: "f8e4m3fn[64, 64][64, 1]cuda:0" = torch.ops.aten._to_copy.default(view_30, dtype = torch.float8_e4m3fn);  view_30 = None
        t_8: "f8e4m3fn[64, 64][1, 64]cuda:0" = torch.ops.aten.t.default(_to_copy_8);  _to_copy_8 = None

        # No stacktrace found for following nodes
        view_39: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(add, [64, 64]);  add = None
        return (view_39, primals_1, unsqueeze_8, t_8)

INFO: TRACED GRAPH
 ===== Backward graph 0 =====
 <eval_with_key>.1 class GraphModule(torch.nn.Module):
    def forward(self, primals_1: "bf16[64, 64][64, 1]cuda:0", unsqueeze_8: "bf16[1, 1, 1][1, 1, 1]cuda:0", t_8: "f8e4m3fn[64, 64][1, 64]cuda:0", tangents_1: "bf16[64, 64][64, 1]cuda:0"):
         # File: /data/users/hirsheybar/checkout2/pytorch/test/dynamo/test_repros.py:6946 in forward, code: out = out.unflatten(0, input.shape[:-1])
        view_19: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(tangents_1, [64, 64]);  tangents_1 = None

         # File: /data/users/hirsheybar/checkout2/pytorch/test/dynamo/test_repros.py:6943 in forward, code: out = Fp8LinearFn.apply(
        abs_3: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten.abs.default(view_19)
        view_20: "bf16[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(abs_3, [64, 1, 64]);  abs_3 = None
        amax_2: "bf16[64, 1][1, 1]cuda:0" = torch.ops.aten.amax.default(view_20, [-1]);  view_20 = None
        expand: "bf16[1, 64, 1][1, 0, 1]cuda:0" = torch.ops.aten.expand.default(unsqueeze_8, [1, 64, 1]);  unsqueeze_8 = None
        clone_1: "bf16[1, 64, 1][64, 1, 1]cuda:0" = torch.ops.aten.clone.default(expand, memory_format = torch.contiguous_format);  expand = None
        view_22: "bf16[64, 1][1, 1]cuda:0" = torch.ops.aten.view.default(clone_1, [64, 1]);  clone_1 = None
        _to_copy_5: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten._to_copy.default(amax_2, dtype = torch.float32);  amax_2 = None
        clamp_2: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.clamp.default(_to_copy_5, 1e-12);  _to_copy_5 = None
        div_2: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.div.Tensor(clamp_2, 448.0);  clamp_2 = None
        reciprocal_2: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.reciprocal.default(div_2)
        view_23: "bf16[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(view_19, [64, 1, 64])
        view_24: "bf16[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.view.default(view_23, [64, 1, 1, 64]);  view_23 = None
        slice_9: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.slice.Tensor(reciprocal_2, 0, 0, 9223372036854775807);  reciprocal_2 = None
        unsqueeze_9: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_9, 1);  slice_9 = None
        slice_10: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.slice.Tensor(unsqueeze_9, 2, 0, 9223372036854775807);  unsqueeze_9 = None
        unsqueeze_10: "f32[64, 1, 1, 1][1, 1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_10, 3);  slice_10 = None
        mul_4: "f32[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.mul.Tensor(view_24, unsqueeze_10);  view_24 = unsqueeze_10 = None
        view_25: "f32[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(mul_4, [64, 1, 64]);  mul_4 = None
        view_26: "f32[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(view_25, [64, 64]);  view_25 = None
        _to_copy_6: "f8e4m3fn[64, 64][64, 1]cuda:0" = torch.ops.aten._to_copy.default(view_26, dtype = torch.float8_e4m3fn);  view_26 = None
        _to_copy_7: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten._to_copy.default(view_22, dtype = torch.float32);  view_22 = None
        clamp_3: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.clamp.default(_to_copy_7, 1e-12);  _to_copy_7 = None
        div_3: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.div.Tensor(clamp_3, 448.0);  clamp_3 = None
        t_6: "f32[1, 64][1, 1]cuda:0" = torch.ops.aten.t.default(div_3);  div_3 = None
        new_ones_2: "f32[1, 1][1, 1]cuda:0" = torch.ops.aten.new_ones.default(div_2, [1, 1], pin_memory = False)
        new_ones_3: "f32[1, 1][1, 1]cuda:0" = torch.ops.aten.new_ones.default(t_6, [1, 1], pin_memory = False)
        t_9: "f32[1, 1][1, 1]cuda:0" = torch.ops.aten.t.default(new_ones_3);  new_ones_3 = None
        _scaled_mm_1: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten._scaled_mm.default(_to_copy_6, t_8, new_ones_2, t_9, None, None, torch.bfloat16);  _to_copy_6 = t_8 = new_ones_2 = t_9 = None
        view_31: "bf16[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(_scaled_mm_1, [64, 1, 64]);  _scaled_mm_1 = None
        view_32: "bf16[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.view.default(view_31, [64, 1, 1, 64]);  view_31 = None
        slice_13: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.slice.Tensor(div_2, 0, 0, 9223372036854775807);  div_2 = None
        unsqueeze_13: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_13, 1);  slice_13 = None
        slice_14: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.slice.Tensor(unsqueeze_13, 2, 0, 9223372036854775807);  unsqueeze_13 = None
        unsqueeze_14: "f32[64, 1, 1, 1][1, 1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_14, 3);  slice_14 = None
        mul_6: "f32[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.mul.Tensor(view_32, unsqueeze_14);  view_32 = unsqueeze_14 = None
        view_33: "f32[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(mul_6, [64, 1, 64]);  mul_6 = None
        view_34: "f32[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(view_33, [64, 64]);  view_33 = None
        view_35: "f32[1, 64, 64][4096, 64, 1]cuda:0" = torch.ops.aten.view.default(view_34, [1, 64, 64]);  view_34 = None
        view_36: "f32[1, 64, 64, 1][4096, 64, 1, 1]cuda:0" = torch.ops.aten.view.default(view_35, [1, 64, 64, 1]);  view_35 = None
        slice_15: "f32[1, 64][1, 1]cuda:0" = torch.ops.aten.slice.Tensor(t_6, 0, 0, 9223372036854775807);  t_6 = None
        unsqueeze_15: "f32[1, 1, 64][1, 64, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_15, 1);  slice_15 = None
        slice_16: "f32[1, 1, 64][1, 64, 1]cuda:0" = torch.ops.aten.slice.Tensor(unsqueeze_15, 2, 0, 9223372036854775807);  unsqueeze_15 = None
        unsqueeze_16: "f32[1, 1, 64, 1][1, 64, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_16, 3);  slice_16 = None
        mul_7: "f32[1, 64, 64, 1][4096, 64, 1, 1]cuda:0" = torch.ops.aten.mul.Tensor(view_36, unsqueeze_16);  view_36 = unsqueeze_16 = None
        view_37: "f32[64, 64, 1][64, 1, 1]cuda:0" = torch.ops.aten.view.default(mul_7, [64, 64, 1]);  mul_7 = None
        view_38: "f32[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(view_37, [64, 64]);  view_37 = None
        _to_copy_9: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten._to_copy.default(view_38, dtype = torch.bfloat16);  view_38 = None
        t_10: "bf16[64, 64][1, 64]cuda:0" = torch.ops.aten.t.default(view_19)
        mm: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten.mm.default(t_10, primals_1);  t_10 = primals_1 = None
        sum_1: "bf16[64][1]cuda:0" = torch.ops.aten.sum.dim_IntList(view_19, [0]);  view_19 = None
        return (_to_copy_9, mm, sum_1)

```

With the change, we save primals_2 for backward instead

```
 ===== Forward graph 0 =====
 /data/users/hirsheybar/checkout2/pytorch/torch/fx/_lazy_graph_module.py class GraphModule(torch.nn.Module):
    def forward(self, primals_1: "bf16[64, 64][64, 1]cuda:0", primals_2: "bf16[64, 64][64, 1]cuda:0", primals_3: "bf16[64][1]cuda:0"):
         # File: /data/users/hirsheybar/checkout2/pytorch/test/dynamo/test_repros.py:6943 in forward, code: out = Fp8LinearFn.apply(
        abs_1: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten.abs.default(primals_1)
        view: "bf16[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(abs_1, [64, 1, 64]);  abs_1 = None
        amax: "bf16[64, 1][1, 1]cuda:0" = torch.ops.aten.amax.default(view, [-1]);  view = None
        abs_2: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten.abs.default(primals_2)
        view_1: "bf16[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(abs_2, [64, 1, 64]);  abs_2 = None
        amax_1: "bf16[64, 1][1, 1]cuda:0" = torch.ops.aten.amax.default(view_1, [-1]);  view_1 = None
        _to_copy: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten._to_copy.default(amax, dtype = torch.float32);  amax = None
        clamp: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.clamp.default(_to_copy, 1e-12);  _to_copy = None
        div: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.div.Tensor(clamp, 448.0);  clamp = None
        reciprocal: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.reciprocal.default(div)
        view_2: "bf16[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(primals_1, [64, 1, 64])
        view_3: "bf16[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.view.default(view_2, [64, 1, 1, 64]);  view_2 = None
        slice_1: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.slice.Tensor(reciprocal, 0, 0, 9223372036854775807);  reciprocal = None
        unsqueeze: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_1, 1);  slice_1 = None
        slice_2: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.slice.Tensor(unsqueeze, 2, 0, 9223372036854775807);  unsqueeze = None
        unsqueeze_1: "f32[64, 1, 1, 1][1, 1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_2, 3);  slice_2 = None
        mul: "f32[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.mul.Tensor(view_3, unsqueeze_1);  view_3 = unsqueeze_1 = None
        view_4: "f32[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(mul, [64, 1, 64]);  mul = None
        view_5: "f32[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(view_4, [64, 64]);  view_4 = None
        _to_copy_1: "f8e4m3fn[64, 64][64, 1]cuda:0" = torch.ops.aten._to_copy.default(view_5, dtype = torch.float8_e4m3fn);  view_5 = None
        _to_copy_2: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten._to_copy.default(amax_1, dtype = torch.float32)
        clamp_1: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.clamp.default(_to_copy_2, 1e-12);  _to_copy_2 = None
        div_1: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.div.Tensor(clamp_1, 448.0);  clamp_1 = None
        reciprocal_1: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.reciprocal.default(div_1)
        view_6: "bf16[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(primals_2, [64, 1, 64])
        view_7: "bf16[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.view.default(view_6, [64, 1, 1, 64]);  view_6 = None
        slice_3: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.slice.Tensor(reciprocal_1, 0, 0, 9223372036854775807);  reciprocal_1 = None
        unsqueeze_2: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_3, 1);  slice_3 = None
        slice_4: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.slice.Tensor(unsqueeze_2, 2, 0, 9223372036854775807);  unsqueeze_2 = None
        unsqueeze_3: "f32[64, 1, 1, 1][1, 1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_4, 3);  slice_4 = None
        mul_1: "f32[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.mul.Tensor(view_7, unsqueeze_3);  view_7 = unsqueeze_3 = None
        view_8: "f32[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(mul_1, [64, 1, 64]);  mul_1 = None
        view_9: "f32[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(view_8, [64, 64]);  view_8 = None
        _to_copy_3: "f8e4m3fn[64, 64][64, 1]cuda:0" = torch.ops.aten._to_copy.default(view_9, dtype = torch.float8_e4m3fn);  view_9 = None
        t: "f32[1, 64][1, 1]cuda:0" = torch.ops.aten.t.default(div_1);  div_1 = None
        new_ones: "f32[1, 1][1, 1]cuda:0" = torch.ops.aten.new_ones.default(div, [1, 1], pin_memory = False)
        new_ones_1: "f32[1, 1][1, 1]cuda:0" = torch.ops.aten.new_ones.default(t, [1, 1], pin_memory = False)
        t_2: "f8e4m3fn[64, 64][1, 64]cuda:0" = torch.ops.aten.t.default(_to_copy_3);  _to_copy_3 = None
        t_3: "f32[1, 1][1, 1]cuda:0" = torch.ops.aten.t.default(new_ones_1);  new_ones_1 = None
        _scaled_mm: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten._scaled_mm.default(_to_copy_1, t_2, new_ones, t_3, None, None, torch.bfloat16);  _to_copy_1 = t_2 = new_ones = t_3 = None
        view_10: "bf16[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(_scaled_mm, [64, 1, 64]);  _scaled_mm = None
        view_11: "bf16[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.view.default(view_10, [64, 1, 1, 64]);  view_10 = None
        slice_5: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.slice.Tensor(div, 0, 0, 9223372036854775807);  div = None
        unsqueeze_4: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_5, 1);  slice_5 = None
        slice_6: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.slice.Tensor(unsqueeze_4, 2, 0, 9223372036854775807);  unsqueeze_4 = None
        unsqueeze_5: "f32[64, 1, 1, 1][1, 1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_6, 3);  slice_6 = None
        mul_2: "f32[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.mul.Tensor(view_11, unsqueeze_5);  view_11 = unsqueeze_5 = None
        view_12: "f32[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(mul_2, [64, 1, 64]);  mul_2 = None
        view_13: "f32[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(view_12, [64, 64]);  view_12 = None
        view_14: "f32[1, 64, 64][4096, 64, 1]cuda:0" = torch.ops.aten.view.default(view_13, [1, 64, 64]);  view_13 = None
        view_15: "f32[1, 64, 64, 1][4096, 64, 1, 1]cuda:0" = torch.ops.aten.view.default(view_14, [1, 64, 64, 1]);  view_14 = None
        slice_7: "f32[1, 64][1, 1]cuda:0" = torch.ops.aten.slice.Tensor(t, 0, 0, 9223372036854775807);  t = None
        unsqueeze_6: "f32[1, 1, 64][1, 64, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_7, 1);  slice_7 = None
        slice_8: "f32[1, 1, 64][1, 64, 1]cuda:0" = torch.ops.aten.slice.Tensor(unsqueeze_6, 2, 0, 9223372036854775807);  unsqueeze_6 = None
        unsqueeze_7: "f32[1, 1, 64, 1][1, 64, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_8, 3);  slice_8 = None
        mul_3: "f32[1, 64, 64, 1][4096, 64, 1, 1]cuda:0" = torch.ops.aten.mul.Tensor(view_15, unsqueeze_7);  view_15 = unsqueeze_7 = None
        view_16: "f32[64, 64, 1][64, 1, 1]cuda:0" = torch.ops.aten.view.default(mul_3, [64, 64, 1]);  mul_3 = None
        view_17: "f32[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(view_16, [64, 64]);  view_16 = None
        _to_copy_4: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten._to_copy.default(view_17, dtype = torch.bfloat16);  view_17 = None
        add: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten.add.Tensor(_to_copy_4, primals_3);  _to_copy_4 = primals_3 = None
        t_5: "bf16[1, 64][1, 1]cuda:0" = torch.ops.aten.t.default(amax_1);  amax_1 = None
        view_21: "bf16[1, 1, 64][1, 64, 1]cuda:0" = torch.ops.aten.view.default(t_5, [1, 1, 64]);  t_5 = None
        amax_3: "bf16[1, 1][1, 1]cuda:0" = torch.ops.aten.amax.default(view_21, [-1]);  view_21 = None
        unsqueeze_8: "bf16[1, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(amax_3, 1);  amax_3 = None

        # No stacktrace found for following nodes
        view_39: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(add, [64, 64]);  add = None
        return (view_39, primals_1, primals_2, unsqueeze_8)

INFO: TRACED GRAPH
 ===== Backward graph 0 =====
 <eval_with_key>.1 class GraphModule(torch.nn.Module):
    def forward(self, primals_1: "bf16[64, 64][64, 1]cuda:0", primals_2: "bf16[64, 64][64, 1]cuda:0", unsqueeze_8: "bf16[1, 1, 1][1, 1, 1]cuda:0", tangents_1: "bf16[64, 64][64, 1]cuda:0"):
         # File: /data/users/hirsheybar/checkout2/pytorch/test/dynamo/test_repros.py:6946 in forward, code: out = out.unflatten(0, input.shape[:-1])
        view_19: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(tangents_1, [64, 64]);  tangents_1 = None

         # File: /data/users/hirsheybar/checkout2/pytorch/test/dynamo/test_repros.py:6943 in forward, code: out = Fp8LinearFn.apply(
        t_4: "bf16[64, 64][1, 64]cuda:0" = torch.ops.aten.t.default(primals_2);  primals_2 = None
        clone: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten.clone.default(t_4, memory_format = torch.contiguous_format);  t_4 = None
        abs_3: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten.abs.default(view_19)
        view_20: "bf16[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(abs_3, [64, 1, 64]);  abs_3 = None
        amax_2: "bf16[64, 1][1, 1]cuda:0" = torch.ops.aten.amax.default(view_20, [-1]);  view_20 = None
        expand: "bf16[1, 64, 1][1, 0, 1]cuda:0" = torch.ops.aten.expand.default(unsqueeze_8, [1, 64, 1]);  unsqueeze_8 = None
        clone_1: "bf16[1, 64, 1][64, 1, 1]cuda:0" = torch.ops.aten.clone.default(expand, memory_format = torch.contiguous_format);  expand = None
        view_22: "bf16[64, 1][1, 1]cuda:0" = torch.ops.aten.view.default(clone_1, [64, 1]);  clone_1 = None
        _to_copy_5: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten._to_copy.default(amax_2, dtype = torch.float32);  amax_2 = None
        clamp_2: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.clamp.default(_to_copy_5, 1e-12);  _to_copy_5 = None
        div_2: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.div.Tensor(clamp_2, 448.0);  clamp_2 = None
        reciprocal_2: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.reciprocal.default(div_2)
        view_23: "bf16[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(view_19, [64, 1, 64])
        view_24: "bf16[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.view.default(view_23, [64, 1, 1, 64]);  view_23 = None
        slice_9: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.slice.Tensor(reciprocal_2, 0, 0, 9223372036854775807);  reciprocal_2 = None
        unsqueeze_9: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_9, 1);  slice_9 = None
        slice_10: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.slice.Tensor(unsqueeze_9, 2, 0, 9223372036854775807);  unsqueeze_9 = None
        unsqueeze_10: "f32[64, 1, 1, 1][1, 1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_10, 3);  slice_10 = None
        mul_4: "f32[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.mul.Tensor(view_24, unsqueeze_10);  view_24 = unsqueeze_10 = None
        view_25: "f32[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(mul_4, [64, 1, 64]);  mul_4 = None
        view_26: "f32[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(view_25, [64, 64]);  view_25 = None
        _to_copy_6: "f8e4m3fn[64, 64][64, 1]cuda:0" = torch.ops.aten._to_copy.default(view_26, dtype = torch.float8_e4m3fn);  view_26 = None
        _to_copy_7: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten._to_copy.default(view_22, dtype = torch.float32);  view_22 = None
        clamp_3: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.clamp.default(_to_copy_7, 1e-12);  _to_copy_7 = None
        div_3: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.div.Tensor(clamp_3, 448.0);  clamp_3 = None
        reciprocal_3: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.reciprocal.default(div_3)
        view_27: "bf16[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(clone, [64, 1, 64]);  clone = None
        view_28: "bf16[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.view.default(view_27, [64, 1, 1, 64]);  view_27 = None
        slice_11: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.slice.Tensor(reciprocal_3, 0, 0, 9223372036854775807);  reciprocal_3 = None
        unsqueeze_11: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_11, 1);  slice_11 = None
        slice_12: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.slice.Tensor(unsqueeze_11, 2, 0, 9223372036854775807);  unsqueeze_11 = None
        unsqueeze_12: "f32[64, 1, 1, 1][1, 1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_12, 3);  slice_12 = None
        mul_5: "f32[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.mul.Tensor(view_28, unsqueeze_12);  view_28 = unsqueeze_12 = None
        view_29: "f32[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(mul_5, [64, 1, 64]);  mul_5 = None
        view_30: "f32[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(view_29, [64, 64]);  view_29 = None
        _to_copy_8: "f8e4m3fn[64, 64][64, 1]cuda:0" = torch.ops.aten._to_copy.default(view_30, dtype = torch.float8_e4m3fn);  view_30 = None
        t_6: "f32[1, 64][1, 1]cuda:0" = torch.ops.aten.t.default(div_3);  div_3 = None
        new_ones_2: "f32[1, 1][1, 1]cuda:0" = torch.ops.aten.new_ones.default(div_2, [1, 1], pin_memory = False)
        new_ones_3: "f32[1, 1][1, 1]cuda:0" = torch.ops.aten.new_ones.default(t_6, [1, 1], pin_memory = False)
        t_8: "f8e4m3fn[64, 64][1, 64]cuda:0" = torch.ops.aten.t.default(_to_copy_8);  _to_copy_8 = None
        t_9: "f32[1, 1][1, 1]cuda:0" = torch.ops.aten.t.default(new_ones_3);  new_ones_3 = None
        _scaled_mm_1: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten._scaled_mm.default(_to_copy_6, t_8, new_ones_2, t_9, None, None, torch.bfloat16);  _to_copy_6 = t_8 = new_ones_2 = t_9 = None
        view_31: "bf16[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(_scaled_mm_1, [64, 1, 64]);  _scaled_mm_1 = None
        view_32: "bf16[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.view.default(view_31, [64, 1, 1, 64]);  view_31 = None
        slice_13: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.slice.Tensor(div_2, 0, 0, 9223372036854775807);  div_2 = None
        unsqueeze_13: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_13, 1);  slice_13 = None
        slice_14: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.slice.Tensor(unsqueeze_13, 2, 0, 9223372036854775807);  unsqueeze_13 = None
        unsqueeze_14: "f32[64, 1, 1, 1][1, 1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_14, 3);  slice_14 = None
        mul_6: "f32[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.mul.Tensor(view_32, unsqueeze_14);  view_32 = unsqueeze_14 = None
        view_33: "f32[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(mul_6, [64, 1, 64]);  mul_6 = None
        view_34: "f32[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(view_33, [64, 64]);  view_33 = None
        view_35: "f32[1, 64, 64][4096, 64, 1]cuda:0" = torch.ops.aten.view.default(view_34, [1, 64, 64]);  view_34 = None
        view_36: "f32[1, 64, 64, 1][4096, 64, 1, 1]cuda:0" = torch.ops.aten.view.default(view_35, [1, 64, 64, 1]);  view_35 = None
        slice_15: "f32[1, 64][1, 1]cuda:0" = torch.ops.aten.slice.Tensor(t_6, 0, 0, 9223372036854775807);  t_6 = None
        unsqueeze_15: "f32[1, 1, 64][1, 64, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_15, 1);  slice_15 = None
        slice_16: "f32[1, 1, 64][1, 64, 1]cuda:0" = torch.ops.aten.slice.Tensor(unsqueeze_15, 2, 0, 9223372036854775807);  unsqueeze_15 = None
        unsqueeze_16: "f32[1, 1, 64, 1][1, 64, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_16, 3);  slice_16 = None
        mul_7: "f32[1, 64, 64, 1][4096, 64, 1, 1]cuda:0" = torch.ops.aten.mul.Tensor(view_36, unsqueeze_16);  view_36 = unsqueeze_16 = None
        view_37: "f32[64, 64, 1][64, 1, 1]cuda:0" = torch.ops.aten.view.default(mul_7, [64, 64, 1]);  mul_7 = None
        view_38: "f32[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(view_37, [64, 64]);  view_37 = None
        _to_copy_9: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten._to_copy.default(view_38, dtype = torch.bfloat16);  view_38 = None
        t_10: "bf16[64, 64][1, 64]cuda:0" = torch.ops.aten.t.default(view_19)
        mm: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten.mm.default(t_10, primals_1);  t_10 = primals_1 = None
        sum_1: "bf16[64][1]cuda:0" = torch.ops.aten.sum.dim_IntList(view_19, [0]);  view_19 = None
        return (_to_copy_9, mm, sum_1)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148922
Approved by: https://github.com/zou3519
2025-03-18 20:08:11 +00:00
Nichols A. Romero
c0566e0dbf [ROCm] Fixes and improvements to CUDA->HIP flag conversion for CPP extensions (#149245)
Fixes https://github.com/ROCm/hip/issues/3764.

Fixes and improvements to CUDA->HIP flag conversion for CPP extensions

- Log flag conversion for debugging purposes.
- Fix cases where it should not touch the -I flags or cases where CUDA appears more than once by replacing only the first instance.
- Fix case where nvcc key may not exist
- Fix case where hipify should ignore flag values and only touch the flag itself

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149245
Approved by: https://github.com/jeffdaily

Co-authored-by: Qubitium-ModelCloud <qubitium@modelcloud.ai>
2025-03-18 18:01:07 +00:00
eellison
585fd972b8 Iterate over dense dim first in split reduction reindexing (#147229)
Fix for https://github.com/pytorch/pytorch/issues/144431.

Improves perf from 0.29963893827160504 -> 0.0396331632970453.

In split reductions, we view an input tensor as a single dimension, then reduce over it. When we are reducing over a tensor which has a dimension other than the last dimension as the dense dimension, we should iterate over the dense dimension first in our re-indexing.

This pr also gives evidence for general need of reduction tiling, e.g. for cooperative reduction handling of this..

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147229
Approved by: https://github.com/jansel
2025-03-18 17:35:21 +00:00
mori360
ee3a2c6ee2 [State_dict] Remove functools.cache and add unit test (#149354)
Fixes https://github.com/pytorch/pytorch/issues/149100

@functools.cache would keep 'self' alive, leading to unexpected memory performance. (e.g. in the issue linked, if the model is deleted, the model's memory is still occupied.)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149354
Approved by: https://github.com/fegin
2025-03-18 17:30:41 +00:00
mori360
5b8cc4709a [FSDP2] Add set_reshard_after_forward (#149103)
Fixes https://github.com/pytorch/pytorch/issues/149029

Add `set_reshard_after_forward` to set `post_forward_mesh_info` so as to decide `_reshard_after_forward`

Add unit test similar to `test_fully_shard_communication_count`, the FSDPModule would perform as `._reshard_after_forward=True` after `.set_reshard_after_forward=True`, as well as setting to False

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149103
Approved by: https://github.com/awgu
2025-03-18 17:21:54 +00:00
Aleksei Nikiforov
d5b1d99f78 Enable more nightly tests on s390x (#148452)
Also enable some tests which probably were accidentally disabled.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148452
Approved by: https://github.com/seemethere, https://github.com/malfet
2025-03-18 16:09:39 +00:00
Saurabh Mishra
381d0cb239 [DCP] Avoid in-place update and deepcopy during dudpe (#149320)
Summary:
Avoid in-place update and deepcopy during dudpe. Deepcopy becomes prohibitively expensive with models having a huge number of FQNs. This was manifestd in the Ads 2K experiment as well. Here are the results from the TextRay model in Mitra:

#### Control job with deepcopy regression:
First save ~24.8s
Global step latency is ~7-8s

Test job with the new fix to avoid deepcopy:
First save is ~21s
global step latency ~2s

Test Plan:
```
buck test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/distributed/checkpoint:test_planner
```
https://www.internalfb.com/intern/testinfra/testrun/3940649945104822

Differential Revision: D71245218

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149320
Approved by: https://github.com/MeetVadakkanchery
2025-03-18 16:08:40 +00:00
Justin Chu
fdacf3c920 [ONNX] Update types in VerificationInfo (#149377)
torch.types.Number was rendered as is in the documentation and can be confusing. We write the original types instead to reduce confusion for users.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149377
Approved by: https://github.com/titaiwangms
2025-03-18 15:37:39 +00:00
PyTorch MergeBot
405025778d Revert "[AOTI] Update test runner to use the new APIs (#147105)"
This reverts commit 9a78513c3c.

Reverted https://github.com/pytorch/pytorch/pull/147105 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/147105#issuecomment-2733656413))
2025-03-18 15:25:40 +00:00
Pat Vignola
213eea216a [MTIA] Add _mtia_maybeExchangeDevice to MTIA module (#149340)
Summary: The FlexAttention path uses `_maybe_exchange_device`, so it will be needed eventually for MTIA as well.

Test Plan: `buck2 test fbcode//mtia/host_runtime/torch_mtia/tests:test_torch_mtia_api -- test_maybe_exchange_device`

Reviewed By: chaos5958

Differential Revision: D70072063

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149340
Approved by: https://github.com/chaos5958
2025-03-18 15:15:12 +00:00
Andrey Talman
bc88f6faa1 Use TorchVersion for triton version check (#149136)
Followup after https://github.com/pytorch/pytorch/pull/149092#issuecomment-2721990321
To use TorchVersion for triton version parsing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149136
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-03-18 13:48:46 +00:00
Francisco Massa
9b92828d4b Add batch dim sharding rule to sdpa (#149253)
This is a trivial rule that for most cases isn't needed, but if we want to consider that the input data is actually `Shard(0)` (instead of `Replicated()` as it is currently assumed), then we need this rule.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149253
Approved by: https://github.com/XilunWu
2025-03-18 07:54:02 +00:00
Davide Italiano
9cd52da45c [MPS/inductor] Add support for modified_bessel_i1. (#149379)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149379
Approved by: https://github.com/malfet
2025-03-18 06:02:33 +00:00
Rachel Guo
b8f91bcb14 [pt2_provenance_tracking] add support for cpp kernel (#149185)
Summary:
As title.

Add inductor cpp kernel to post grad graph node mapping
& UT.

Context:
Raised as a feature request for AOTI CPU case.

https://fb.workplace.com/groups/1028545332188949/permalink/1169020841474730/

Differential Revision: D71181284

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149185
Approved by: https://github.com/jingsh
2025-03-18 04:43:07 +00:00
Shangdi Yu
7869196482 Fix torchbind schema str generation (#149239)
Summary: Fix Torchbind HOP schema generation when there's no input

Test Plan:
```
buck run fbcode//mode/dev-nosan //caffe2/test/inductor:torchbind -- -r schema
```

Differential Revision: D71231164

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149239
Approved by: https://github.com/zou3519
2025-03-18 04:29:56 +00:00
Wei-Sheng Chin
bca75fe97a [MAIA] [Autocast] Enable autocast on MAIA device (#148511)
Fixes #148510.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148511
Approved by: https://github.com/albanD
2025-03-18 03:46:22 +00:00
Mu-Chu Lee
bb42e4d137 [AOTInductor] Add function to free buffer (#149161)
Summary:
We add a function that allows users to free the unused buffer.

Test Plan:
Testing correctness:
    python test/inductor/test_aot_inductor.py -k free_inactive

    Testing memory consumption:
    LD_LIBRARY_PATH=/data/users/$USER/pytorch/build/lib
    /home/$USER/local/pytorch/build/bin/test_aoti_inference

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149161
Approved by: https://github.com/chenyang78, https://github.com/desertfire
ghstack dependencies: #149249
2025-03-18 02:43:14 +00:00
Jane Xu
988827cdfb Use schema as source of truth + support ones_like/empty_like (#149052)
This change does 2 important things:
(a) Instead of relying on IValue type as source of truth, we use the schema as the source of truth, which is important as IValue types are overloaded and can ambiguously convert incorrectly. For example, a MemoryFormat will look like an int + get converted to an int64_t vs a MemoryFormat!

(b) This PR expands support for many more types to encompass way more schemas, e.g., Optional, Device, dtype, etc. The main win from this PR is the ability for aoti_torch_call_dispatcher to call TensorFactory ops like ones_like/empty_like!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149052
Approved by: https://github.com/albanD
2025-03-18 02:40:54 +00:00
Justin Chu
ebabd0efdd [ONNX] Expose verification utilities (#148603)
Expose verification utilities to public documentation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148603
Approved by: https://github.com/titaiwangms
2025-03-18 02:10:34 +00:00
Sun, Jiayi
c36ac16da1 [Inductor] optimize welford reduction (#145061)
Fix https://github.com/pytorch/pytorch/issues/141541.
Fix https://github.com/pytorch/pytorch/issues/142839.
Fix https://github.com/pytorch/pytorch/issues/143182.

**Summary:**
In order to fix the issue that the accuracy of welford reduction is not good enough, we refer to the eager implementation, combine Welford algorithm with cascade sum to improve numerical stability. Specifically:
1. Use Welford algorithm to compute mean and variance.
2. Use cascade summation when computing sum over input for both mean and variance.

I tested Inductor benchmark with this PR on CPU, no performance gains or regressions were seen.

**Example:**
Take https://github.com/pytorch/pytorch/issues/141541 as an example:
```
import torch
import torch.nn as nn
torch.manual_seed(0)

class Model(nn.Module):
    def __init__(self):
        super().__init__()
        self.gn = nn.GroupNorm(num_groups=32, num_channels=32)

    def forward(self, x):
        return self.gn(x)

model = Model().eval()
c_model = torch.compile(model)
x = torch.randn(1, 32, 128, 128, 128)

with torch.no_grad():
    output = model(x)
    c_output = c_model(x)

print(torch.max(torch.abs(output - c_output)))
print(torch.allclose(output, c_output, 1.3e-6, 1e-5))
```
**logs**

- before
```
tensor(7.0095e-05)
False
```
- After
```
tensor(9.5367e-07)
True
```

- on CUDA
```
tensor(1.4305e-06, device='cuda:0', grad_fn=<MaxBackward1>)
True
```

**Generated code:**
- before
```
cpp_fused_native_group_norm_0 = async_compile.cpp_pybinding(['const float*', 'const float*', 'const float*', 'float*', 'float*', 'float*'], '''
#include "/tmp/torchinductor_jiayisun/pi/cpicxudqmdsjh5cm4klbtbrvy2cxwr7whxl3md2zzdjdf3orvfdf.h"
extern "C"  void kernel(const float* in_ptr0,
                       const float* in_ptr1,
                       const float* in_ptr2,
                       float* out_ptr0,
                       float* out_ptr1,
                       float* out_ptr2)
{
    {
        #pragma GCC ivdep
        for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(32L); x0+=static_cast<int64_t>(1L))
        {
            {
                Welford<float> tmp_acc0 = Welford<float>();
                Welford<at::vec::Vectorized<float>> tmp_acc0_vec = Welford<at::vec::Vectorized<float>>();
                Welford<at::vec::Vectorized<float>> masked_tmp_acc0_vec = Welford<at::vec::Vectorized<float>>();
                static WeightRecp<at::vec::Vectorized<float>> wrecps0(static_cast<int64_t>(131072L));
                for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(2097152L); x1+=static_cast<int64_t>(16L))
                {
                    {
                        if(C10_LIKELY(x1 >= static_cast<int64_t>(0) && x1 < static_cast<int64_t>(2097152L)))
                        {
                            auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x1 + 2097152L*x0), static_cast<int64_t>(16));
                            tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp0, &wrecps0);
                        }
                    }
                }
                tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(masked_tmp_acc0_vec));
                tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(tmp_acc0_vec));
                out_ptr0[static_cast<int64_t>(x0)] = static_cast<float>(tmp_acc0.mean);
                out_ptr1[static_cast<int64_t>(x0)] = static_cast<float>(tmp_acc0.m2);
            }
        }
    }
    {
        #pragma GCC ivdep
        for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(32L); x0+=static_cast<int64_t>(1L))
        {
            for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(2097152L); x1+=static_cast<int64_t>(16L))
            {
                {
                    if(C10_LIKELY(x1 >= static_cast<int64_t>(0) && x1 < static_cast<int64_t>(2097152L)))
                    {
                        auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x1 + 2097152L*x0), static_cast<int64_t>(16));
                        auto tmp1 = out_ptr0[static_cast<int64_t>(x0)];
                        auto tmp4 = out_ptr1[static_cast<int64_t>(x0)];
                        auto tmp12 = in_ptr1[static_cast<int64_t>(x0)];
                        auto tmp15 = in_ptr2[static_cast<int64_t>(x0)];
                        auto tmp2 = at::vec::Vectorized<float>(tmp1);
                        auto tmp3 = tmp0 - tmp2;
                        auto tmp5 = static_cast<float>(2097152.0);
                        auto tmp6 = tmp4 / tmp5;
                        auto tmp7 = static_cast<float>(1e-05);
                        auto tmp8 = decltype(tmp6)(tmp6 + tmp7);
                        auto tmp9 = 1 / std::sqrt(tmp8);
                        auto tmp10 = at::vec::Vectorized<float>(tmp9);
                        auto tmp11 = tmp3 * tmp10;
                        auto tmp13 = at::vec::Vectorized<float>(tmp12);
                        auto tmp14 = tmp11 * tmp13;
                        auto tmp16 = at::vec::Vectorized<float>(tmp15);
                        auto tmp17 = tmp14 + tmp16;
                        tmp17.store(out_ptr2 + static_cast<int64_t>(x1 + 2097152L*x0));
                    }
                }
            }
        }
    }
}
''')
```
- After
```
cpp_fused_native_group_norm_0 = async_compile.cpp_pybinding(['const float*', 'const float*', 'const float*', 'float*', 'float*', 'float*'], '''
#include "/tmp/torchinductor_jiayisun/ln/clnlak27xpvmq3klpqyj6xzyq2thf4ecrezve5ddy4f4xaz4sb7w.h"
extern "C"  void kernel(const float* in_ptr0,
                       const float* in_ptr1,
                       const float* in_ptr2,
                       float* out_ptr0,
                       float* out_ptr1,
                       float* out_ptr2)
{
    {
        #pragma GCC ivdep
        for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(32L); x0+=static_cast<int64_t>(1L))
        {
            {
                Welford<float> tmp_acc0 = Welford<float>();
                Welford<at::vec::Vectorized<float>> tmp_acc0_vec = Welford<at::vec::Vectorized<float>>();
                Welford<at::vec::Vectorized<float>> masked_tmp_acc0_vec = Welford<at::vec::Vectorized<float>>();
                WelfordHelper<at::vec::Vectorized<float>> welford_helper0(static_cast<int64_t>(131072L));
                static WelfordHelper<at::vec::Vectorized<float>> masked_welford_helper0(static_cast<int64_t>(0L));
                for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(2097152L); x1+=static_cast<int64_t>(16L))
                {
                    {
                        if(C10_LIKELY(x1 >= static_cast<int64_t>(0) && x1 < static_cast<int64_t>(2097152L)))
                        {
                            auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x1 + 2097152L*x0), static_cast<int64_t>(16));
                            tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp0, &welford_helper0);
                        }
                    }
                }
                tmp_acc0_vec = welford_combine(tmp_acc0_vec, &welford_helper0);
                masked_tmp_acc0_vec = welford_combine(masked_tmp_acc0_vec, &masked_welford_helper0);
                tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(masked_tmp_acc0_vec));
                tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(tmp_acc0_vec));
                out_ptr0[static_cast<int64_t>(x0)] = static_cast<float>(tmp_acc0.mean);
                out_ptr1[static_cast<int64_t>(x0)] = static_cast<float>(tmp_acc0.m2);
            }
        }
    }
    {
        #pragma GCC ivdep
        for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(32L); x0+=static_cast<int64_t>(1L))
        {
            for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(2097152L); x1+=static_cast<int64_t>(16L))
            {
                {
                    if(C10_LIKELY(x1 >= static_cast<int64_t>(0) && x1 < static_cast<int64_t>(2097152L)))
                    {
                        auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x1 + 2097152L*x0), static_cast<int64_t>(16));
                        auto tmp1 = out_ptr0[static_cast<int64_t>(x0)];
                        auto tmp4 = out_ptr1[static_cast<int64_t>(x0)];
                        auto tmp12 = in_ptr1[static_cast<int64_t>(x0)];
                        auto tmp15 = in_ptr2[static_cast<int64_t>(x0)];
                        auto tmp2 = at::vec::Vectorized<float>(tmp1);
                        auto tmp3 = tmp0 - tmp2;
                        auto tmp5 = static_cast<float>(2097152.0);
                        auto tmp6 = tmp4 / tmp5;
                        auto tmp7 = static_cast<float>(1e-05);
                        auto tmp8 = decltype(tmp6)(tmp6 + tmp7);
                        auto tmp9 = 1 / std::sqrt(tmp8);
                        auto tmp10 = at::vec::Vectorized<float>(tmp9);
                        auto tmp11 = tmp3 * tmp10;
                        auto tmp13 = at::vec::Vectorized<float>(tmp12);
                        auto tmp14 = tmp11 * tmp13;
                        auto tmp16 = at::vec::Vectorized<float>(tmp15);
                        auto tmp17 = tmp14 + tmp16;
                        tmp17.store(out_ptr2 + static_cast<int64_t>(x1 + 2097152L*x0));
                    }
                }
            }
        }
    }
}
''')
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145061
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jansel
2025-03-18 02:05:35 +00:00
Su, Tong
60523540f1 Force build to conform C++ standard on windows by adding /permissive- flag (#149035)
Fixes #147366

1. Add `/permissive-` to the `torch_compile_options` for the build to conform to the C++ standard.
2. Fix the error when trying to assign a string literal to a non-const ptr.

The `/permissive-` flag can be found at https://learn.microsoft.com/en-us/cpp/build/reference/permissive-standards-conformance?view=msvc-170

From the above [doc](https://learn.microsoft.com/en-us/cpp/build/reference/permissive-standards-conformance?view=msvc-170#remarks),
>  By default, the /permissive- option is set in new projects created by Visual Studio 2017 version 15.5 and later versions.
> The /permissive- option is implicitly set by the /std:c++latest option starting in Visual Studio 2019 version 16.8, and in version 16.11 by the /std:c++20 option.

Thus, it is reasonable to add this flag to the existing project.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149035
Approved by: https://github.com/guangyey, https://github.com/malfet
2025-03-18 01:51:46 +00:00
Xia, Weiwen
c1dd75e4dc Add AOTI shim for _weight_int4pack_mm_cpu_tensor (#149031)
**Summary**
Previous implementation of shim did not align with the design and it was removed by https://github.com/pytorch/pytorch/pull/148907
This PR adds it back in the files of MKLDNN backend and re-enable the CPP wrapper UT.

**Test plan**
```
pytest -s test/inductor/test_cpu_cpp_wrapper.py -k test_woq_int4
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149031
Approved by: https://github.com/leslie-fang-intel, https://github.com/EikanWang, https://github.com/desertfire
2025-03-18 01:33:13 +00:00
cyy
425c6d8eba Replace c10::is_pod with std::is_trivial (#149286)
These remaining c10::is_pod calls can be replaced without compromising the semantics.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149286
Approved by: https://github.com/zou3519
2025-03-18 01:33:01 +00:00
Animesh Jain
f9a787224c [dynamo][guards][serialization] Dont use ID_MATCH guard for bool and None (#149228)
Doing this removes the need of collecting `id` and therefore facilitates serialization. It also improves readability with recompilations. Earlier, recompile message will just show the `id`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149228
Approved by: https://github.com/jansel
2025-03-18 01:25:37 +00:00
Aaron Gokaslan
a0ac63cbd9 [BE]: Apply ruff PERF403 to use dict comprehensions more often (#149257)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149257
Approved by: https://github.com/jansel
2025-03-18 00:46:07 +00:00
Bin Bao
9a78513c3c [AOTI] Update test runner to use the new APIs (#147105)
Summary: Switch to the newer aoti_compile_and_package APIs. Some tests still kept using legacy APIs, and will follow up with internal test refactoring.

Differential Revision: [D69609685](https://our.internmc.facebook.com/intern/diff/D69609685)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147105
Approved by: https://github.com/jingsh
2025-03-18 00:27:09 +00:00
PyTorch MergeBot
b52a8bef01 Revert "[dynamo][guards][serialization] Dont use ID_MATCH guard for bool and None (#149228)"
This reverts commit 5905bbe745.

Reverted https://github.com/pytorch/pytorch/pull/149228 on behalf of https://github.com/malfet due to I wonder if this will fix the pr-time-benchmark regressions ([comment](https://github.com/pytorch/pytorch/pull/149228#issuecomment-2731237949))
2025-03-18 00:10:50 +00:00
Mayank Mishra
37c914ca0c fix simple-spec crash (#147723)
found an issue while running `python torchgen/fuse/gen_patterns.py`

exact error:
```shell
Traceback (most recent call last):
  File "/Users/mayankmishra/Desktop/non-IBM/pytorch/torchgen/fuse/gen_patterns.py", line 19, in <module>
    joint_graph.lazy_init()
  File "/Users/mayankmishra/miniconda3/envs/ai/lib/python3.10/site-packages/torch/_inductor/pattern_matcher.py", line 2096, in lazy_init
    result = fn()
  File "/Users/mayankmishra/miniconda3/envs/ai/lib/python3.10/site-packages/torch/_inductor/fx_passes/joint_graph.py", line 53, in lazy_init
    _pad_mm_init()
  File "/Users/mayankmishra/miniconda3/envs/ai/lib/python3.10/site-packages/torch/_inductor/fx_passes/pad_mm.py", line 905, in _pad_mm_init
    gen_register_replacement(
  File "/Users/mayankmishra/miniconda3/envs/ai/lib/python3.10/site-packages/torch/_inductor/pattern_matcher.py", line 1584, in gen_register_replacement
    pat = _serialize_pattern(
  File "/Users/mayankmishra/miniconda3/envs/ai/lib/python3.10/site-packages/torch/_inductor/pattern_matcher.py", line 1539, in _serialize_pattern
    file_template = get_file_template()
  File "/Users/mayankmishra/miniconda3/envs/ai/lib/python3.10/site-packages/torch/_inductor/pattern_matcher.py", line 1513, in get_file_template
    if isinstance(attr, type) and issubclass(attr, (PatternExpr, _TargetExpr)):
  File "/Users/mayankmishra/miniconda3/envs/ai/lib/python3.10/abc.py", line 123, in __subclasscheck__
    return _abc_subclasscheck(cls, subclass)
TypeError: issubclass() arg 1 must be a class
```

This PR fixes this issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147723
Approved by: https://github.com/aorenste

Co-authored-by: Aaron Orenstein <aorenste@meta.com>
2025-03-17 23:25:48 +00:00
Tony-Y
78715a181f Convert Tensor lr to 0-dim as needed for the optimizer to normally work (#145674)
Fixes #145461

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145674
Approved by: https://github.com/janeyx99

Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
2025-03-17 23:07:05 +00:00
Mu-Chu Lee
1157367c78 [AOTInductor] [BE] Add macro for loading symbols in aoti runner (#149249)
Summary:
Add macro for loading symbols in aoti runner

Test Plan:
Existing tests

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149249
Approved by: https://github.com/chenyang78
2025-03-17 23:02:01 +00:00
PyTorch MergeBot
24cfeec2c7 Revert "[BE]: Apply ruff PERF403 to use dict comprehensions more often (#149257)"
This reverts commit bfee141666.

Reverted https://github.com/pytorch/pytorch/pull/149257 on behalf of https://github.com/malfet due to Let's see if it helps restore compiler benchmark sanity, see 8bc7bd94a5/1 ([comment](https://github.com/pytorch/pytorch/pull/149257#issuecomment-2731133812))
2025-03-17 22:57:00 +00:00
PyTorch MergeBot
afa1eda901 Revert "[PGNCCL] Launch kernel on current stream & remove record_stream entirely (#148590)"
This reverts commit ef6296e7f2.

Reverted https://github.com/pytorch/pytorch/pull/148590 on behalf of https://github.com/izaitsevfb due to reverted internally, see D71292427 ([comment](https://github.com/pytorch/pytorch/pull/148590#issuecomment-2731114626))
2025-03-17 22:43:15 +00:00
Yanan Cao (PyTorch)
a16ada41b9 Fix outdated docstring of torch.export.export regarding strict flag (#149077)
Summary: Fix outdated docstring of torch.export.export regarding strict flag

Test Plan: None, doc only change

Differential Revision: D71068215

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149077
Approved by: https://github.com/zhxchen17
2025-03-17 22:29:20 +00:00
Sheng Qin
d25617255c Fix AOTI update_constant_buffer issue. (#149243)
Summary:
In D69553929 we changed the logic of constant & buffer update in AOTI. However this is incompatible with current Sigmoid runtime since we have different logics to pass in buffers, resulted in errors like
```
I0310 17:29:24.456960 3679102 AOTIDelegateExecutor.cpp:89] AOTIDelegateExecutor processing weights
*** Aborted at 1741652964 (Unix time, try 'date -d 1741652964') ***
*** Signal 11 (SIGSEGV) (0x30) received by PID 3679102 (pthread TID 0x7f9933e49000) (linux TID 3679102) (code: address not mapped to object), stack trace: ***
    @ 00000000000040b9 folly::symbolizer::(anonymous namespace)::signalHandler(int, siginfo_t*, void*)
                       ./fbcode/folly/debugging/symbolizer/SignalHandler.cpp:453
    @ 0000000000006c45 folly::fibers::(anonymous namespace)::sigsegvSignalHandler(int, siginfo_t*, void*)
                       ./fbcode/folly/fibers/GuardPageAllocator.cpp:237
    @ 000000000004455f (unknown)
                       /home/engshare/third-party2/glibc/2.34/src/glibc-2.34/signal/../sysdeps/unix/sysv/linux/libc_sigaction.c:8
                       -> /home/engshare/third-party2/glibc/2.34/src/glibc-2.34/signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c
    @ 00000000001e8164 torch::aot_inductor::AOTInductorModelContainer::update_constant_buffer(std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, AtenTensorOpaque*, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, AtenTensorOpaque*> > > const&, bool, bool)
```

Test Plan:
1) Generate lowered merge net
```
CUDA_VISIBLE_DEVICES=0 ../buck-out/v2/gen/fbcode/b5b13003c82cbdec/caffe2/torch/fb/model_transform/fx2trt/packaging/__generate_merge_net_file__/generate_merge_net_file.par  --action=generate --input-file=/home/shengqin/models/aoti_sigmoid_test/cmf_interformer_with_custom_triton_kernels_691990503_0_input --output-file=/home/shengqin/models/aoti_sigmoid_test/cmf_interformer_with_custom_triton_kernels_691990503_0_output.aoti_sigmoid --lower-backend=aot_inductor  --use_sigmoid=true --aot_inductor_config="{'max_autotune': True, 'comprehensive_padding': False}" --add_passes=use_matmul_lce_replace_normal_LCE,use_triton_dot_compress,use_matmul_fuse_lce_replace_first_LCE,use_contiguous_linear_reduction_replace_linear_reduction --disable_acc_tracer=false
```

2) Load net predictor
```
CUDA_VISIBLE_DEVICES=1 ../buck-out/v2/gen/fbcode/103717df3cc2b97a/caffe2/torch/fb/model_transform/fx2trt/packaging/__load_net_predictor__/load_net_predictor --loadMode=AccuracyAB --inputNetFile=/home/shengqin/models/aoti_sigmoid_test/cmf_interformer_with_custom_triton_kernels_691990503_0_output.aoti_ts --otherNetFile=/home/shengqin/models/aoti_sigmoid_test/cmf_interformer_with_custom_triton_kernels_691990503_0_output.aoti_sigmoid --moduleName=merge --benchmarkEnableProfiling=false —-predictor_hardware_type=1 --disableStaticRuntime=true
```

Reviewed By: hl475

Differential Revision: D71236710

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149243
Approved by: https://github.com/hl475, https://github.com/jingsh
2025-03-17 22:10:57 +00:00
Davide Italiano
e4f6e4ac84 [MPS] Add inductor support for modified_bessel_i0. (#149342)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149342
Approved by: https://github.com/malfet
2025-03-17 21:45:51 +00:00
Benjamin Glass
e8dd58b8cf cpp_wrapper: Precompile device-specific header files (#146928)
This saves us about a second per compilation, which is _massive_ for the OpInfo tests.  Total OpInfo test runtime is down about 2x from this change alone.

Relands #144002, with changes needed by fbcode internals.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146928
Approved by: https://github.com/desertfire
2025-03-17 20:40:15 +00:00
Shunting Zhang
6c7d8419e3 fix two accuracy regression (#149172)
There are 2 accuracy regression in 3/12 nightly perf run. I can not repro them locally thus there is no effective way to bisect. Raise the tolerance to make them pass the accuracy check.

- error log for HF MegatronBertForQuestionAnswering https://gist.github.com/shunting314/25322b66e15e98feed32e0d9a1e43316
- error log for TIMM gluon_inception_v3 https://gist.github.com/shunting314/df64ce22327df27a7057bbbd19ef5164

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149172
Approved by: https://github.com/jansel, https://github.com/eellison
2025-03-17 19:34:00 +00:00
Pat Vignola
769f19bf95 [MTIA] Add _mtia_exchangeDevice to MTIA module (#149322)
Summary: The FlexAttention path uses `_exchange_device`, so it will be needed eventually for MTIA as well.

Test Plan: `buck2 test fbcode//mtia/host_runtime/torch_mtia/tests:test_torch_mtia_api -- test_exchange_device`

Reviewed By: chaos5958

Differential Revision: D70072059

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149322
Approved by: https://github.com/chaos5958
2025-03-17 19:31:10 +00:00
Isuru Fernando
c41c2130be Fix printing INT64_MIN (#149148)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149148
Approved by: https://github.com/anijain2305
2025-03-17 17:57:18 +00:00
Rachel Guo
aaa4c3d60b [mm_logs] make aten mm info readable (#148800)
Summary:
as title. make it into a table like

e.g. also see pic in test plan

| Name     | M   | N   | K   | Count |
| aten.mm | 16  | 6   |  16 |     1     |
...

Test Plan: {F1975907876}
<img width="1090" alt="Screenshot 2025-03-11 at 3 13 00 PM" src="https://github.com/user-attachments/assets/ffae8c56-e32c-49cc-bbfb-5b8d216b8657" />

Differential Revision: D70825664

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148800
Approved by: https://github.com/henrylhtsang
2025-03-17 17:00:58 +00:00
Xinya Zhang
2a011ca904 [ROCm] testing: enable MEFF/FA unittests for gfx1100 (#148911)
Include gfx1100, and optionally enable gfx1201/gfx950 according to env var TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148911
Approved by: https://github.com/jeffdaily
2025-03-17 16:41:15 +00:00
PyTorch MergeBot
9d37b501db Revert "[ROCm] enable HIPMallocAsyncAllocator (#149145)"
This reverts commit 2e02c07a5d.

Reverted https://github.com/pytorch/pytorch/pull/149145 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally.  @albanD, might you be able to help get this PR landed? See D71214814 for more details on the failure. To validate the fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/149145#issuecomment-2730104736))
2025-03-17 16:17:02 +00:00
Sun, Jiayi
b2862f1435 optimize the decomposition of aten.native_group_norm (#144733)
Summary:
Optimize the decomposition of aten.native_group_norm. Reduce unnecessary repeated operations by changing the order of operations for `mean`, `rstd`, `weight`, `bias `and `input`, which can improve performance when `flattened_inner_size `is large.

The original decomposition:
1. compute `mean `and `rstd`,
2. out = (x - mean) * rstd, compute in the range [N, C, *],
3. out = out * weight + bias, compute in the range [N, C, *],

The new decomposition:
1. compute `mean `and `rstd`,
2. new_weight = rstd * weight, new_bias = - mean * rstd * weight + bias, compute in the range [N, C],
3. out = out * new_weight + new_bias, compute in the range [N, C, *],

I tested the Inductor performance benchmark with this PR on both CPU and A100. On CPU, two torchbench models(functorch_dp_cifar10 and opacus_cifar10) have about 25% performance improvement, and two diffusion models(Stable Diffusion and Latent Consistency Model(LCM)) have about 2% performance improvement. On A100, no performance gains or regressions were seen.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144733
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel
2025-03-17 09:27:01 +00:00
zeshengzong
1cc5f6b623 Optimize MaxPool1d param ceil_mode description (#148869)
Fixes #148123

Add output shape formula based on `ceil_mode` value, according to

00199acdb8/aten/src/ATen/native/Pool.h (L61-L75)

## Test Result

### Before

![image](https://github.com/user-attachments/assets/0a175178-a104-4348-a14b-516e866d533a)

### After

![image](https://github.com/user-attachments/assets/ce621d4b-1986-41fb-bd71-2b03c0aa996e)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148869
Approved by: https://github.com/mikaylagawarecki
2025-03-17 08:50:40 +00:00
Aaron Gokaslan
bfee141666 [BE]: Apply ruff PERF403 to use dict comprehensions more often (#149257)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149257
Approved by: https://github.com/jansel
2025-03-16 23:52:58 +00:00
Tugsbayasgalan Manlaibaatar
6b1b95ad2a Support subclass constructor capturing in export (#147014)
Notable TODOs:
1. Need to implement AutogradHOP to get rid of subclasses before serializing
2. Need to implement mechanism to figure out what subclasses will be used in export when they are not expressed in the inputs

Differential Revision: [D69640673](https://our.internmc.facebook.com/intern/diff/D69640673)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147014
Approved by: https://github.com/bdhirsh
2025-03-16 18:19:19 +00:00
Animesh Jain
5905bbe745 [dynamo][guards][serialization] Dont use ID_MATCH guard for bool and None (#149228)
Doing this removes the need of collecting `id` and therefore facilitates serialization. It also improves readability with recompilations. Earlier, recompile message will just show the `id`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149228
Approved by: https://github.com/jansel
2025-03-16 15:56:17 +00:00
Sam Larsen
acf42b0048 Fix memory leak in subproc_pool future (#149259)
Summary: The future holds a reference to the callback, and the callback captures the outer future. Seems to create a cycle that the garbage collector doesn't clean up. Verified by compiling 15k synthetic Triton kernels and observing that subprocess memory overhead improves.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149259
Approved by: https://github.com/Skylion007
2025-03-15 20:26:30 +00:00
James Wu
a9c55277d7 [Reland] First version of statically compiled launcher for triton compiled CUDA kernels (#149238)
This is a new version of https://github.com/pytorch/pytorch/pull/148561 fixing the ROCM test failure

Putting this up for a first pass review, though I will likely make a bunch of changes before landing to add more features, etc.

This diff implements a first version of a static CUDA kernel launcher in `torch._C`. The goal here is to take a cubin file and some metadata from a CompiledKernel from `triton`, and launch the cubin file directly.

Background doc: https://docs.google.com/document/d/1rjRcHl6MfauHG30nCoQX-9UKvKyIs4WWMy_GsGyqb9g/edit?tab=t.0#heading=h.ut5lf39lzq66

Normally, using triton's CompiledKernel.make_launcher(), we would pay the cost of codegenning C++ and running it at compile time. With this new approach, we can use one statically compiled library to launch the kernel.

The tradeoff here is that this new kernel launcher will not be able to use codegen to deal with different lengths/types of arguments. So we use templating to handle up to 10 arguments for now. We also allocate 8 bytes on the stack per argument no matter the argument type, which can take more memory than codegenning. On the other hand, we improve compile time on cold and warm start by not having to call the C++ compiler at all.

This diff does not add the launcher to torch, but introduces a basic test suite.

A list of TODOs that are not yet complete:
- Handle `nvTmaDesc` and `cuTensorMap`, which triton handles
- Embed the grid logic instead of passing in gridX,Y,Z
- Handle launch_enter and exit hooks? (Not sure if inductor has these)
- Benchmarking to see if there's runtime performance loss
- Probably lots of features of the triton C++ generated code that I haven't handled yet.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149238
Approved by: https://github.com/oulgen
2025-03-15 15:06:46 +00:00
Sam Larsen
c83c711da8 Remove some memory overhead in parallel compile workers (#149168)
Summary: The parallel compile workers are holding on to more memory than they need to because they're loading the compiled modules into memory. Update the post-fork initializer to record when in a subprocess and skip some of the unnecessary overhead.

Test Plan: Ran a test script to compile 15k Triton kernels and used tracemalloc in the subprocs to investigate the overhead. On my devgpu:
* After importing torch in a subproc: 371M
* Without this PR, after compiling 15k kernels: 825M
* With this PR, after compiling 15k kernels: 531M

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149168
Approved by: https://github.com/jansel
2025-03-15 14:20:40 +00:00
Huamin Li
e7e477c1f9 Not generate custom obj json when it's empty (#149246)
Summary: as title.

See internal Diff summary for more context.

Test Plan: buck run @fbcode//mode/dev-nosan //caffe2/test/inductor:torchbind -- -r config_not_generated

Differential Revision: D71241676

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149246
Approved by: https://github.com/houseroad

Co-authored-by: Huamin Li <huaminli@meta.com>
2025-03-15 13:00:48 +00:00
Lirong
4482a65fef Add side_effect to avoid dce custom op in CA graph (#149181)
We found that in compiled_autograd, when defining custom op, the custom op will be dce in the backward graph. We added a side effect condition in the dce function to prevent eliminating custom op with side effect in CA graph.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149181
Approved by: https://github.com/xmfan
2025-03-15 04:15:49 +00:00
Wenjie Yang
115fc98cc0 Migrate aten.split.Tensor from using Sharding Rule to Sharding Strategy (#149106)
Summary:
Use Sharding Strategy for aten.split.Tensor instead of sharding rule

Test Plan:
pytest test/distributed/tensor/test_dtensor_ops.py -s -k split

Reviewers:
xilunwu

Subscribers:

Tasks:

Tags:

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149106
Approved by: https://github.com/XilunWu, https://github.com/tianyu-l
2025-03-15 04:03:40 +00:00
Jane Xu
740ce0fa5f op should NOT be static in aoti_torch_call_dispatcher (#149208)
aoti_torch_call_dispatcher is meant to call different ops, so the op must not be static. Otherwise, every call to this API will call the first op that was ever called, which is not the intended behavior of any human being.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149208
Approved by: https://github.com/albanD, https://github.com/zou3519, https://github.com/malfet
2025-03-15 01:47:11 +00:00
Simon Fan
578160c875 [ca] don't inline accumulate grad op (#149014)
we use dummy tensors in our initial trace, so we should never inline. the subclass dispatch might not support the dummy tensor, e.g. DTensor accumulate grad will check that both param and grad are DTensors

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149014
Approved by: https://github.com/jansel
ghstack dependencies: #149064
2025-03-15 01:10:54 +00:00
Simon Fan
f4368d8872 [ca] clean up aot node deduping (#149064)
rename the AOT nodes as we copy paste them into the CA graph

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149064
Approved by: https://github.com/jansel
2025-03-15 01:10:54 +00:00
yifanmao
7537b19c73 [FSDP2] Update ignored_params docstring and add unit test (#149074)
Fixes https://github.com/pytorch/pytorch/issues/148242

ignored_params won't be moved to devices in full_shard(), update docstring.
Add unit test `test_move_states_to_device_ignored_param_device` to show that ignored_params won't be moved during full_shard(), but would be after `model.cuda()`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149074
Approved by: https://github.com/awgu
2025-03-15 00:23:09 +00:00
bobrenjc93
eb7bf4202d Make dynamism code robust to NotImplementedException (#148823)
In prod many models have `@property` methods that raise
NotImplementedError. This PR updates our dynamism code to be more robust
to these types of models.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148823
Approved by: https://github.com/laithsakka
2025-03-14 23:38:19 +00:00
PyTorch MergeBot
f9b4856989 Revert "[pytree] add APIs to determine a class is a namedtuple or PyStructSequence (#113257)"
This reverts commit c95a6b416b.

Reverted https://github.com/pytorch/pytorch/pull/113257 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally. @zou3519 can you please help land this internally? See the sigmoid tests in D71198793 for details. To validate the fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/113257#issuecomment-2725982539))
2025-03-14 23:13:34 +00:00
PyTorch MergeBot
643aaea133 Revert "[RFC] First version of statically compiled launcher for triton compiled CUDA kernels (#148561)"
This reverts commit 5a843f8973.

Reverted https://github.com/pytorch/pytorch/pull/148561 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/148561#issuecomment-2725969268))
2025-03-14 23:01:26 +00:00
cz2h
05f2cbfe19 Add meta function for out variants of ones,zeros,empty (#149098)
Open another PR to fix merge conflicts. Fixes https://github.com/pytorch/pytorch/issues/135832

For aten.ones, aten.zeros, followed this [link](https://docs.google.com/document/d/1GgvOe7C8_NVOMLOCwDaYV1mXXyHMXY7ExoewHqooxrs/edit?tab=t.0#heading=h.64r4npvq0w0) to register meta functions.

For aten.empty.out, followed this [part](https://docs.google.com/document/d/1GgvOe7C8_NVOMLOCwDaYV1mXXyHMXY7ExoewHqooxrs/edit?tab=t.0#heading=h.iy9lxhxhtl5v) to register a decomp for empty that handles the FakeTensor input.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149098
Approved by: https://github.com/williamwen42
2025-03-14 22:17:30 +00:00
Nikita Shulga
d7d9a71e19 [MPSInductor] Add support for atan2 (#149216)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149216
Approved by: https://github.com/dcci
2025-03-14 21:53:03 +00:00
Davide Italiano
0bd863a62f [MPS] Add inductor support for i1e. (#149221)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149221
Approved by: https://github.com/malfet
2025-03-14 21:18:38 +00:00
albanD
1bdbf12672 Update as strided doc (#149146)
Make it clearer why it is not recommended to use it and when the resulting Tensor will have undefined behavior.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149146
Approved by: https://github.com/gchanan, https://github.com/jbschlosser
2025-03-14 19:49:57 +00:00
Um Changyong
69aeb87eca update error message in get_backend() more detail_ (#141796)
Fixes #ISSUE_NUMBER
When attempting to reconfigure the environment without properly handling the PyTorch-related settings, you may encounter the following message.
```
                             │ /root/.cache/pypoetry/virtualenvs/app-rag-sample-9TtSrW0h-py3.10/lib/python3.10/site-packages/torch/distributed/distribut │
                             │ ed_c10d.py:1215 in get_backend                                                                                            │
                             │                                                                                                                           │
                             │   1212 │   if _rank_not_in_group(pg):                                                                                     │
                             │   1213 │   │   raise ValueError("Invalid process group specified")                                                        │
                             │   1214 │   pg_store = _world.pg_map[pg] if pg in _world.pg_map else None                                                  │
                             │ ❱ 1215 │   return Backend(not_none(pg_store)[0])                                                                          │
                             │   1216                                                                                                                    │
                             │   1217                                                                                                                    │
                             │   1218 def _get_process_group_uid(pg: ProcessGroup) -> int:                                                               │
                             │                                                                                                                           │
                             │ /root/.cache/pypoetry/virtualenvs/app-rag-sample-9TtSrW0h-py3.10/lib/python3.10/site-packages/torch/utils/_typing_utils.p │
                             │ y:13 in not_none                                                                                                          │
                             │                                                                                                                           │
                             │   10                                                                                                                      │
                             │   11 def not_none(obj: Optional[T]) -> T:                                                                                 │
                             │   12 │   if obj is None:                                                                                                  │
                             │ ❱ 13 │   │   raise TypeError("Invariant encountered: value was None when it should not be")                               │
                             │   14 │   return obj                                                                                                       │
                             │   15                                                                                                                      │
                             ╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
                             TypeError: Invariant encountered: value was None when it should not be
Exception ignored in: <function Vllm.__del__ at 0x7f35f96b6dd0>
```
Since this message can cause confusion for multiple developers, the purpose of this PR is to suggest additional details to help clarify the situation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141796
Approved by: https://github.com/kwen2501
2025-03-14 19:42:42 +00:00
Qiongwen Zhang
5e79b61e8a add PrivateUse1 backend in fsdp collecitves (#147260)
add PrivateUse1 backend in fsdp collecitves

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147260
Approved by: https://github.com/weifengpy
2025-03-14 19:41:41 +00:00
henrylhtsang
fe01af2242 [AOTI][debug logger] small fix for intermediate value debugger for jit when arg is not tensor (#149007)
repro:
```
import torch
import torch._inductor.config as config

config.aot_inductor.debug_intermediate_value_printer = "2"
config.aot_inductor.filtered_kernel_names = "triton_poi_fused__to_copy_add_0"

class Model(torch.nn.Module):
    def forward(self, x):
        x = x.to(torch.float)
        return x + 1

model = Model().cuda()
x = torch.randn(10).cuda().to(torch.float8_e4m3fn)
_ = torch.compile(model, fullgraph=True)(x)

print("done")
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149007
Approved by: https://github.com/jingsh
2025-03-14 19:40:41 +00:00
zeshengzong
a7f8de2198 Add nn.Bilinear param validation (#149018)
Fixes #103425

## Changes

- Add doc description size value `must be > 0`
- Add validation for `in1_features` param

Currently, only `in1_features` will cause runtime error, if add checks for `in2_features` and `out_features` as well, might be kind of BC breaking.

```python
import torch
from torch import nn

class lenet(nn.Module):
    def __init__(self):
        super(lenet, self).__init__()
        self.conv = nn.Conv2d(in_channels=3, out_channels=16, kernel_size=5, stride=1)

        # Error, `in1_features=1, in2_features=0, out_features=0` no error
        self.linear = nn.Bilinear(in1_features=0, in2_features=0, out_features=0)

    def forward(self, x):
        # 1st block
        x = self.conv(x)
        x = self.linear(x)

        return x

if __name__ == '__main__':
    net = lenet()

```

## Test Result

```bash
pytest test/test_nn.py -k test_bilinear -vv
```

![image](https://github.com/user-attachments/assets/20617ba9-bac5-4db2-aecc-1831dbc8eb43)

![image](https://github.com/user-attachments/assets/401e4e1f-051a-4e1c-952b-48e85de64b0b)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149018
Approved by: https://github.com/mikaylagawarecki
2025-03-14 19:26:12 +00:00
James Wu
5a843f8973 [RFC] First version of statically compiled launcher for triton compiled CUDA kernels (#148561)
Putting this up for a first pass review, though I will likely make a bunch of changes before landing to add more features, etc.

This diff implements a first version of a static CUDA kernel launcher in `torch._C`. The goal here is to take a cubin file and some metadata from a CompiledKernel from `triton`, and launch the cubin file directly.

Background doc: https://docs.google.com/document/d/1rjRcHl6MfauHG30nCoQX-9UKvKyIs4WWMy_GsGyqb9g/edit?tab=t.0#heading=h.ut5lf39lzq66

Normally, using triton's CompiledKernel.make_launcher(), we would pay the cost of codegenning C++ and running it at compile time. With this new approach, we can use one statically compiled library to launch the kernel.

The tradeoff here is that this new kernel launcher will not be able to use codegen to deal with different lengths/types of arguments. So we use templating to handle up to 10 arguments for now. We also allocate 8 bytes on the stack per argument no matter the argument type, which can take more memory than codegenning. On the other hand, we improve compile time on cold and warm start by not having to call the C++ compiler at all.

This diff does not add the launcher to torch, but introduces a basic test suite.

A list of TODOs that are not yet complete, will do in separate diff:
- Handle `nvTmaDesc` and `cuTensorMap`, which triton handles
- Embed the grid logic instead of passing in gridX,Y,Z. With https://github.com/pytorch/pytorch/pull/147583, we should be able to handle all of the grid logic directly in _StaticCudaLauncher.launch_kernel, and get rid of the python evaluation.
- Handle launch_enter and exit hooks? (Not sure if inductor has these)
- Benchmarking to see if there's runtime performance loss
- Hooking it up with a config to inductor
- Testing harness to test against torch generated triton kernels

Differential Revision: [D69926783](https://our.internmc.facebook.com/intern/diff/D69926783/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148561
Approved by: https://github.com/aorenste, https://github.com/syed-ahmed
2025-03-14 19:12:13 +00:00
zeshengzong
97272e4b49 Fix torch.nn.functional.hardswish gradients corner case (#148049)
Fixes #147801

## Changes

- Change hardswish gradient compute condition as [torch.nn.functional.hardswish](https://pytorch.org/docs/stable/generated/torch.nn.functional.hardswish.html)
- Enable cuda for test `test_hardswish_grad_corner`
- Add test case for value=-3

## Test Result

```bash
pytest test/test_nn.py -k test_hardswish
pytest test/test_unary_ufuncs.py -k test_hardswish
pytest test/inductor/test_torchinductor.py -k test_hardswish
```

![image](https://github.com/user-attachments/assets/000cb5c4-15f5-4bfd-ab45-f52bf810ff3d)
![image](https://github.com/user-attachments/assets/38b08cf8-ea84-47a2-8e37-0a213da3e0c8)
![image](https://github.com/user-attachments/assets/54bc57be-2c57-46cc-ab90-94ea6cbe1c34)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148049
Approved by: https://github.com/soulitzer
2025-03-14 18:53:10 +00:00
Ethan Wee
2e02c07a5d [ROCm] enable HIPMallocAsyncAllocator (#149145)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149145
Approved by: https://github.com/jeffdaily
2025-03-14 18:21:27 +00:00
Nikita Shulga
42e468d9b0 [MPSInductor] Adjust check_bounds (#147205)
To make upper bound inclusive, which fixes `test_vectorized_ops_masked` and results in the following code
```python
mps_lib_0 = compile_mps_shader("""
    #include <c10/metal/random.h>
    #include <c10/metal/special_math.h>
    #include <c10/metal/utils.h>
    kernel void generated_kernel(
        device float* out_ptr0,
        constant float* in_ptr0,
        uint xindex [[thread_position_in_grid]]
    ) {
        int x0 = (xindex) % (64);
        int x1 = (xindex) / (64);
        auto tmp5 = in_ptr0[x0 + 63*x1];
        int x2 = xindex;
        auto tmp0 = x0;
        auto tmp1 = static_cast<long>(tmp0);
        auto tmp2 = 63;
        auto tmp3 = tmp1 < tmp2;
        if (x0 > 63) return;
        auto tmp6 = tmp3 ? tmp5 : 7;
        out_ptr0[x2] = static_cast<float>(tmp6);
    }
""")
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147205
Approved by: https://github.com/jansel, https://github.com/dcci
ghstack dependencies: #147211
2025-03-14 17:26:00 +00:00
Davide Italiano
f2ea77c099 [MPS] Add inductor support for i0e. (#149180)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149180
Approved by: https://github.com/malfet
2025-03-14 16:15:52 +00:00
PyTorch MergeBot
71795f159e Revert "[AOTInductor] [BE] Add swap_constant_buffer into pybind for tests. (#149167)"
This reverts commit bea181ff7e.

Reverted https://github.com/pytorch/pytorch/pull/149167 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally. See D71177501 for the failure. To validate your fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/149167#issuecomment-2725001232))
2025-03-14 15:16:21 +00:00
Xuehai Pan
c95a6b416b [pytree] add APIs to determine a class is a namedtuple or PyStructSequence (#113257)
Changes in this PR:

1. Add `is_structseq` and `is_structseq_class` functions to determine a object or a class is PyStructSequence.
2. Add a generic class `structseq` which can be used as the registration key for PyStructSequence types like `namedtuple` for Named Tuple types.
3. Change `is_namedtuple` to accept subclasses of namedtuple to be namedtuple. Before this PR, only namedtuple class directly created by `collections.namedtuple` or `typing.NamedTuple` were namedtuple classes while their subclasses were not. This PR makes `is_namedtuple` return true for subclasses of namedtuple class.

Resolves #75982. New tests are included in this PR.

- #75982

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113257
Approved by: https://github.com/zou3519
2025-03-14 08:50:30 +00:00
Sheng Fu
05ac99042f Clean up grid in execution trace (#149159)
Summary: This DIFF https://www.internalfb.com/diff/D70471332 removed input "grid" when calling triton kernel. PyTorch execution trace need to make the appropriate change. It includes capturing ET and replay ET.

Test Plan:
buck2 run mode/opt caffe2/test:test_profiler_cuda  -- profiler.test_execution_trace.TestExecutionTraceCUDA.test_execution_trace_with_pt2_cuda

buck2 run mode/opt param_bench/fb/integration_tests:test_et_replay

Differential Revision: D71152464

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149159
Approved by: https://github.com/sraikund16, https://github.com/jansel
2025-03-14 07:12:16 +00:00
Nikita Shulga
e162758051 [MPSInductor] Add bessel_[jy][01] ops (#149179)
By simply calling corresponding special functions

Followup TODO: tweak bessel_y0 to match CPU implementation for `torch.half` dtype

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149179
Approved by: https://github.com/dcci
ghstack dependencies: #149123
2025-03-14 06:33:30 +00:00
Huamin Li
d4496346b9 Update logic when producing key name for keep_original_weights (#149171)
Differential Revision: D71160718

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149171
Approved by: https://github.com/houseroad
2025-03-14 05:29:54 +00:00
Isuru Fernando
9e6b2ca58d Fix sympy float priting (#147552)
Fixes https://github.com/pytorch/pytorch/pull/147261
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147552
Approved by: https://github.com/bobrenjc93, https://github.com/cyyever
2025-03-14 05:07:06 +00:00
Mu-Chu Lee
bea181ff7e [AOTInductor] [BE] Add swap_constant_buffer into pybind for tests. (#149167)
Summary:
We add swap_constant_buffer in pybind to add tests.

Test Plan:
python test/inductor/test_aot_inductor.py -k test_update_inactive_constant_buffer

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149167
Approved by: https://github.com/chenyang78, https://github.com/jingsh
2025-03-14 04:12:48 +00:00
fduwjj
aed0b7a742 [c10d] Add param recording for uniqueID broadcasting and allgather (#149166)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149166
Approved by: https://github.com/kwen2501
2025-03-14 03:51:30 +00:00
Dmitry Rogozhkin
c179971bfc xpu: update filter out of dg2 AOT target (#148677)
torch-xpu-ops has updated list of AOT targets to use and used `dg2` instead of `dg2-g10`. This requires an update in cpp_extension.py which currently filters out `dg2-` prefixed AOT targets.

CC: @gujinghui @EikanWang @fengyuan14 @guangyey @jgong5
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148677
Approved by: https://github.com/EikanWang, https://github.com/guangyey, https://github.com/albanD
2025-03-14 02:24:06 +00:00
Pat Vignola
e8d36019d4 [c10d] Make getDefaultBackend more fault tolerant without relying on exceptions (#149152)
Summary: no-except builds are terminating when this exception is thrown. We should proactively check if a backend is available before calling has_hooks, instead of trying and failing.

Test Plan: CI

Differential Revision: D71144456

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149152
Approved by: https://github.com/kwen2501
2025-03-14 01:27:52 +00:00
Yiming Zhou
15cd6921a5 [export] Fix tensor_constant and buffer naming conflicts in TS converter (#148803)
Summary: In TS converter, tensor constants are traced as BUFFER and later we will convert them back to CONSTANT_TENSOR. So we need to prevent naming conflicts during lift constant pass.

Test Plan: CI

Differential Revision: D70826426

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148803
Approved by: https://github.com/angelayi
2025-03-14 00:38:12 +00:00
Zhenghao Hu
e5fccb2bab [pytorch] Fix duplicated Malloc/Free insertation when using IRBuilderBase::CreateMalloc/CreateFree in LLVM 18+ (#149058)
Summary:
Pytorch unitest hangs when jitting the Tensor kernel. The problem exists for LLVM version >= 18 due to this upstream change: 45bb45f2ae

`IRBuilderBase::CreateCall` will insert the instruction into the BasicBlock by default. And we don't need to explicitly insert the instruction when compiling the tensor kernel.

Test Plan:
## Test with the release toolchain
```
buck test 'mode/dev' //caffe2/test:jit -- --exact 'caffe2/test:jit - test_concat_invariant (test_jit_fuser_te.TestTEFuserDynamic)'
```
## Test with the Buckified toolchain
Apply this D71046097 to select the LLVM libraries.
```
# Build tests
buck build 'mode/dev-asan' //caffe2/test:jit --show-output
```
```
# Run test (Change HASH and paths accordingly)
HASH="b755f1c435832a1e"

ENABLE_FLATBUFFER=0 FB_OVERRIDE_PYBIND11_GIL_INCREF_DECREF_CHECK=1 MKL_NUM_THREADS=1 NO_MULTIPROCESSING_SPAWN=0 OMP_NUM_THREADS=1 PYTORCH_TEST=1 PYTORCH_TEST_FBCODE=1 PYTORCH_TEST_WITH_ASAN=1 PYTORCH_TEST_WITH_DEV_DBG_ASAN=1 PYTORCH_TEST_WITH_TSAN=0 PYTORCH_TEST_WITH_UBSAN=1 SKIP_TEST_BOTTLENECK=1 TENSORPIPE_TLS_DATACENTER=test_dc TEST_PILOT=True TPX_IS_TEST_EXECUTION=true TPX_TIMEOUT_SEC=6000 \
buck-out/v2/gen/$HASH/caffe2/test/__jit__/jit.par --test-filter test_jit_fuser_te.TestTEFuserDynamic.test_concat_invariant
```

Differential Revision: D71046799

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149058
Approved by: https://github.com/dcci, https://github.com/Skylion007
2025-03-13 20:37:47 +00:00
Eddie Yan
0dcd482e54 [SDPA] Respect sdpa_kernel's priority_order setting in torch.compile (#147768)
[https://github.com/pytorch/pytorch/pull/140467](https://github.com/pytorch/pytorch/pull/140467) added the option to specify a priority order for SDPA but the `torch.compile` path silently ignored this setting as I wasn't aware of the separate context manager handling on `torch.compile`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147768
Approved by: https://github.com/drisspg
2025-03-13 18:52:34 +00:00
Joel Schlosser
5e1b715dda BC fix for AOTIModelPackageLoader() constructor defaults (#149082)
The default value for `run_single_threaded` was wrongly specified in the .cpp file instead of the header, breaking C++-side instantiation of `AOTIModelPackageLoader` with no arguments. This PR fixes this and adds a test for the use case of running with `AOTIModelPackageLoader` instead of `AOTIModelContainerRunner` on the C++ side.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149082
Approved by: https://github.com/desertfire
2025-03-13 18:40:53 +00:00
cyy
970fefcc53 Remove outdated skipCUDAIfCudnnVersionLessThan decoration (#148940)
Test conditions for CUDNN 7 and 8 were removed because we have moved to CUDNN 9.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148940
Approved by: https://github.com/mikaylagawarecki
2025-03-13 18:02:50 +00:00
Simon Fan
7c87ec1b50 [ca] always do initial trace with dynamic shapes (#148801)
HUD: https://fburl.com/wzvx6tax no regressions (ignore the pass rate improvements, those come from #149030)
<img width="864" alt="image" src="https://github.com/user-attachments/assets/d7598f98-b378-4abb-a0c7-e4311162f681" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148801
Approved by: https://github.com/jansel
ghstack dependencies: #148799, #149030
2025-03-13 17:30:29 +00:00
Simon Fan
b263b272fa [ca] fix lazily compiled aot bwd (#149030)
FIXES https://github.com/pytorch/pytorch/issues/137372

sometimes, the aot bwd is lowered lazily. so the bw_module we saved in CompiledFunction._lazy_backward_info hasn't gone through post grad passes, specifically the view_to_reshape pass. Running that directly will then sometimes error, because the AOT forward has already changed its views to reshapes, and it is reflected in the gradients we see in CA.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149030
Approved by: https://github.com/bdhirsh
ghstack dependencies: #148799
2025-03-13 17:30:29 +00:00
Simon Fan
e6f560a262 [ca] support for dynamic shapes CopySlices (#148799)
i'm changing CA initial trace to always trace as dynamic, fixes these errors:
```python
This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
FAILED [0.2139s] test/inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_autograd_python_custom_function_inplace - RuntimeError: !has_symbolic_sizes_strides_ INTERNAL ASSERT FAILED at "/home/xmfan/core/a/pytorch/aten/src/ATen/TensorGeometry.h":63, please report a bug to PyTorch.
To execute this test, run the following from the base repo dir:
    python test/test_autograd.py TestAutogradWithCompiledAutograd.test_autograd_python_custom_function_inplace
This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
FAILED [0.0057s] test/inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_copy_slices_graph_task_updates - RuntimeError: !has_symbolic_sizes_strides_ INTERNAL ASSERT FAILED at "/home/xmfan/core/a/pytorch/aten/src/ATen/TensorGeometry.h":63, please report a bug to PyTorch.
To execute this test, run the following from the base repo dir:
    python test/test_autograd.py TestAutogradWithCompiledAutograd.test_copy_slices_graph_task_updates
This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
FAILED [0.9662s] test/inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_inplace_on_view_weak_grad_fn - RuntimeError: !has_symbolic_sizes_strides_ INTERNAL ASSERT FAILED at "/home/xmfan/core/a/pytorch/aten/src/ATen/TensorGeometry.h":63, please report a bug to PyTorch.
To execute this test, run the following from the base repo dir:
    python test/test_autograd.py TestAutogradWithCompiledAutograd.test_inplace_on_view_weak_grad_fn
This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
FAILED [0.0077s] test/inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_leaf_assignment - RuntimeError: !has_symbolic_sizes_strides_ INTERNAL ASSERT FAILED at "/home/xmfan/core/a/pytorch/aten/src/ATen/TensorGeometry.h":63, please report a bug to PyTorch.
To execute this test, run the following from the base repo dir:
    python test/test_autograd.py TestAutogradWithCompiledAutograd.test_leaf_assignment
This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
FAILED [5.0485s] test/inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_setitem_mask - RuntimeError: !has_symbolic_sizes_strides_ INTERNAL ASSERT FAILED at "/home/xmfan/core/a/pytorch/aten/src/ATen/TensorGeometry.h":63, please report a bug to PyTorch.
To execute this test, run the following from the base repo dir:
    python test/test_autograd.py TestAutogradWithCompiledAutograd.test_setitem_mask
This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
FAILED [0.0102s] test/inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_tensor_hooks_inplace_over_view - RuntimeError: !has_symbolic_sizes_strides_ INTERNAL ASSERT FAILED at "/home/xmfan/core/a/pytorch/aten/src/ATen/TensorGeometry.h":63, please report a bug to PyTorch.
To execute this test, run the following from the base repo dir:
    python test/test_autograd.py TestAutogradWithCompiledAutograd.test_tensor_hooks_inplace_over_view
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148799
Approved by: https://github.com/jansel, https://github.com/zou3519
2025-03-13 17:30:20 +00:00
Boyuan Feng
3e605fe46d [CUDAGraph] Graph Partition (#147648)
This PR implements cudagraph partition, following previous PR on inductor graph partition (#147038). Since there are many ops that cudagraph cannot support, this PR focuses on `cpu ops` and will add more partition rules in the next PR.

## Example
```python
import torch

torch._inductor.config.graph_partition = True

def f(x, y):
    x1 = x + 1
    y1 = y + 1
    y_cpu = y1.cpu() + 1
    z = x @ y
    return x1 + y1 + z + y_cpu.cuda()

x, y = [torch.ones(2, 2, device="cuda") for _ in range(2)]
x_cloned, y_cloned = [tmp.clone() for tmp in [x,y]]
eager_out = f(x, y)

f_compiled = torch.compile(f, mode="reduce-overhead")

for _ in range(5):
    compiled_out = f_compiled(x_cloned, y_cloned)
    assert torch.allclose(eager_out, compiled_out)
```

w/o graph partition, we will skip cudagraph:
```
skipping cudagraphs due to skipping cudagraphs due to cpu device (device_put). Found from :
   File "/home/boyuan/playground/cudagraph/graph_partition/graph_partition.py", line 9, in f
    y_cpu = y1.cpu() + 1 # 3
```

w/ graph partition, we can see two cudagraphify under the same torch-compiled region:
![image](https://github.com/user-attachments/assets/4e22d428-2687-433d-b92a-0814a2201b25)

## Design

PR #147038 splits `def call(args)` function into multiple `def partition_id(args)`. In this PR, we use `recursively_apply_fns()` to wrap each `partition_id()` function with `cudagraphify`. One major design point is, `cudagraphify` takes metadata such as static_input_idxs and we need to provide such metadata for each graph partition. However, we previously only have such metadata for the original graph instead of graph partitions.

The [idea](https://github.com/pytorch/pytorch/pull/147038#discussion_r1964124800) is:
- compute a mapping from the partition metadata (e.g., input/output idx) to the graph metadata, stored in `GraphPartitionMap`.
- during post_compile, get the `CudagraphMetadata` for each partition based on the graph-level metadata and `GraphPartitionMap`, via `get_partition_cudagraph_metadata()`.
- finally, in `cudagraph_partition_pos_compile`, we compute the `CudagraphMetadata` and apply cudagraphify for each graph via `recursively_apply_fns`.

#### Q: How does it work with codecache?

While we have multiple graph partitions, we still have 1 file and 1 `call` function for 1 dynamo graph. The major difference is we need to additionally load a `recursively_apply_fns()` for graph partition. We also add `partition_maps: Optional[list[GraphPartitionMap]]` to `CompiledFxGraph` so it will be serialized and could be deserialized later.

## Edge Case 1
PyTorch has an assumption on input/output orders. For example, backward inputs take saved tensors first and then tangents. In graph partition, we respect such orders via `graph_partition_signature_reorder`.

## Edge Case 2
Cudagraphifying `call` function gives 2 cudagraph managed tensors `buf0` and `primals_1`. However, cudagraphifying `partition_0` gives only 1 cudagraph managed tensor `buf0`. This leads to a semantic difference between cudagraph w/ and w/o graph partition. [full code comparison](https://www.internalfb.com/intern/diffing/?paste_number=1747654420)

![image](https://github.com/user-attachments/assets/03d08ce0-f1d1-4d1d-8432-805a07e1dd40)

To achieve the same semantic, we returns an input tensor as output if it is not freed in a graph partition. This allows more cudagraph managed tensors and is important for handling saved tensors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147648
Approved by: https://github.com/eellison
2025-03-13 16:00:21 +00:00
atalman
65d19a5699 Remove runtime dependency on packaging (#149092)
Looks like after https://github.com/pytorch/pytorch/pull/148924
We are seeing this error in nightly test:
https://github.com/pytorch/pytorch/actions/runs/13806023728/job/38616861623

```
  File "/Users/runner/work/_temp/anaconda/envs/test_conda_env/lib/python3.13/site-packages/torch/_inductor/pattern_matcher.py", line 79, in <module>
    from .lowering import fallback_node_due_to_unsupported_type
  File "/Users/runner/work/_temp/anaconda/envs/test_conda_env/lib/python3.13/site-packages/torch/_inductor/lowering.py", line 7024, in <module>
    from . import kernel
  File "/Users/runner/work/_temp/anaconda/envs/test_conda_env/lib/python3.13/site-packages/torch/_inductor/kernel/__init__.py", line 1, in <module>
    from . import mm, mm_common, mm_plus_mm
  File "/Users/runner/work/_temp/anaconda/envs/test_conda_env/lib/python3.13/site-packages/torch/_inductor/kernel/mm.py", line 6, in <module>
    from packaging.version import Version
ModuleNotFoundError: No module named 'packaging'
```

Hence removing runtime dependency on packaging since it may not be installed by default

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149092
Approved by: https://github.com/drisspg, https://github.com/davidberard98
2025-03-13 14:53:13 +00:00
taoyang
f59064f2b7 [FIX] remove the duplicate key in DEFAULT_STATIC_QUANT_MODULE_MAPPINGS (#149043)
nn.Dropout appeared at line 81
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149043
Approved by: https://github.com/jingsh
2025-03-13 12:42:33 +00:00
Bin Bao
bdf57fb8f7 [AOTI][refactor] Split MiniArrayRef into a separate header (#149073)
Summary: MiniArrayRef is a common utility and will be used by the libtorch-free AOTI.

Differential Revision: [D71064657](https://our.internmc.facebook.com/intern/diff/D71064657)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149073
Approved by: https://github.com/yushangdi
2025-03-13 11:57:32 +00:00
Andrew Gu
a8b1767ae5 [DTensor] Fix local_map with multi-threading (#149070)
Using `nonlocal device_mesh` is not safe with multi-threading

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149070
Approved by: https://github.com/wanchaol
2025-03-13 10:58:59 +00:00
Shangdi Yu
df60500ab8 Fix too big to optimize in test, actually use O0 when aot_inductor.compile_wrapper_with_O0 is set (#148714)
Summary:
1. Check against the "0" char instead

2. We got the following error when using anything other than O0 flag: `error: Function ZN5torch12aot_inductorL22__check_inputs_outputsEPP16AtenTensorOpaqueS3 is too big to optimize [-Werror,-Wignored-optimization-argument]` So we use O0 flag in wrapper code when `aot_inductor.compile_wrapper_opt_level` is set to `O0`.

Test Plan:
```
 buck run  'fbcode//mode/opt' fbcode//deeplearning/aot_inductor/cpu/test:ads_second_stage_dsnn_models_aoti_lowering_test -- -r AdsSecondStageDSNNModelsAOTILoweringTest
```

Differential Revision: D70670957

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148714
Approved by: https://github.com/desertfire
2025-03-13 10:22:06 +00:00
wdziurdz
75c8b7d972 [Profiler][HPU] Fix incorrect availabilities for HPU (#148663)
Fixes #148661

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148663
Approved by: https://github.com/jeromean, https://github.com/albanD
2025-03-13 08:03:52 +00:00
eqy
ec93aa7f84 fix cuDNN SDPA meta registration (#148921)
Update `cuDNN SDPA` meta registration to matching memory layout behavior in: https://github.com/pytorch/pytorch/pull/138354

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148921
Approved by: https://github.com/drisspg, https://github.com/jbschlosser
2025-03-13 07:33:16 +00:00
Shangdi Yu
2a7d583452 Consolidate torchbind fake class registration (#149063)
Summary: Remove duplicated fake class registration

Test Plan: CI

Differential Revision: D71052419

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149063
Approved by: https://github.com/angelayi
2025-03-13 06:57:13 +00:00
Yuanhao Ji
c208f21791 [Dynamo] Replace unimplemented withunimplemented_v2 in torch/_dynamo/variables/base.py (#148177)
Part of #147913

Replace `unimplemented` with`unimplemented_v2` in `torch/_dynamo/variables/base.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148177
Approved by: https://github.com/williamwen42
2025-03-13 06:35:51 +00:00
xinan.lin
037d7af778 [Inductor UT] Enable PYTORCH_TESTING_DEVICE_ONLY_FOR test case filter for test_torchinductor.py (#149023)
The environ var PYTORCH_TESTING_DEVICE_ONLY_FOR controls the devices
in get_desired_device_type_test_bases, so we add RUN_CPU and RUN_GPU to
make sure cases are only enabled for devices specified for PYTORCH_TESTING_DEVICE_ONLY_FOR.
eg. Only enable GPU cases, not CPU cases even HAS_CPU.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149023
Approved by: https://github.com/jansel, https://github.com/cyyever
2025-03-13 05:15:28 +00:00
Sam Larsen
7cdbb913e7 [logging] Set compile_id in the CachingAutotuner during compilation so we have it for dynamo_timed logging (#148693)
Summary: This is a simpler alternative to https://github.com/pytorch/pytorch/pull/146455, where we can stick the compileId (and forward/backward bool) in the CachingAutotuner so that we have it for logging `benchmark_all_configs`. Recall that the first attempt put the compileId in the inductor_meta and that interfered with caching.

Test Plan:
`python benchmarks/dynamo/torchbench.py --performance --training --amp --backend inductor --device cuda --print-compilation-time --repeat 5 --cold-start-latency --only nanogpt`
* tlparse: https://fburl.com/e71yn6uc
* dynamo_compile: https://fburl.com/scuba/dynamo_compile/sandbox/4ageghhv
* pt2_compile_events: https://fburl.com/scuba/pt2_compile_events/4fgv1itq

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148693
Approved by: https://github.com/eellison
2025-03-13 03:50:58 +00:00
Brian Hirsh
3646d4dbc8 [partitioner] always ban compiler-driven recompute of collectives by default (#147561)
This should fix the hang in https://fb.workplace.com/groups/1075192433118967/permalink/1603268720311333/

The argument here is that:

(1) in general, it is not safe for the partitioner to sometimes choose to recompute collectives in the backward. Why? If we are running a distributed job, where many ranks are compiling at the same time, we need every rank to make a consistent decision about which collectives are recomputed for backward. If we let each compiler instance make its own choice without any cross-rank communication, they can make different choices and cause NCCL hangs (see the link above)

(2) later on, we'll want an `spmd_mode` flag that causes the compiler to issue collectives and communicate info across ranks. Once we have such a config, then turning it on should make it safe for the partitioner to potentially choose to recompute collectives (and agree on the binary "recompute-or-save" choice across all ranks)

(3) even without an `spmd_mode`, users can override this choice by using `torch.utils.checkpoint()` in their user code. User checkpointing generally always overrides the partitioner, and this should be safe because we expect the user to apply checkpointing consistently across ranks

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147561
Approved by: https://github.com/zou3519
2025-03-13 03:36:13 +00:00
Zhou, Lingzhi
4a12777ffe [Partitioner] Remove unnecessary upstream nodes in dependency viewer (#146580)
We iterate upstream nodes to update partition map. But actually did nothing due to we iterate nodes with reversed topological order https://github.com/pytorch/pytorch/pull/136608/files#diff-f2f9dd3903fd99955732eb694941fea0cb7301a58d59554787f3311d417e5615L193 so that there exists no upstream nodes in assignment. Remove it to reduce for-loop overhead which up to O(N * N) complexity.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146580
Approved by: https://github.com/Skylion007, https://github.com/jerome-habana
2025-03-13 01:42:10 +00:00
PyTorch MergeBot
e51615cb73 Revert "[Profiler][HPU] Fix incorrect availabilities for HPU (#148663)"
This reverts commit 28b78800b9.

Reverted https://github.com/pytorch/pytorch/pull/148663 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally. @albanD, could you please help get this relanded? See D71052806 for more details. To validate the fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/148663#issuecomment-2719297055))
2025-03-12 22:52:11 +00:00
PyTorch MergeBot
b1980b2405 Revert "Make dynamism code robust to NotImplementedException (#148823)"
This reverts commit 60576419a2.

Reverted https://github.com/pytorch/pytorch/pull/148823 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally, see D71042206 for details. To validate your fixes internally before relanding, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/148823#issuecomment-2719287467))
2025-03-12 22:45:39 +00:00
Yanan Cao (PyTorch)
ab45aaca97 Set non-strict export as default mode (#148790)
Summary:
- Flip the default value of strict argument in torch.export.export from True to False
- Update test infra to cope with the change, some of them made the assumption of strict mode as default
- Disabled some tests that fail in non-strict mode

Test Plan: Sandcastle

Differential Revision: D70228628

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148790
Approved by: https://github.com/angelayi
2025-03-12 21:10:58 +00:00
Matthew Hoffman
e3ebf61589 Create and send full_tensor on ProcessGroup-supported device in _broadcast_tensors (#148865)
Fixes #138842

`device` is always the device of the `local_state_dict`, which may or may not be CPU, which is not supported by NCCL backend.

Instead, create broadcasted tensors on one of `pg._device_types` and then move the tensors back if `local_state_dict`'s `device` was not supported by the `ProcessGroup`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148865
Approved by: https://github.com/mori360
2025-03-12 20:56:31 +00:00
Thomas Bohnstingl
86bc154d61 [scan] Flattened output of HOP scan (#148955)
This is required because downstream operations expect HOPs to return a flattened list of output elements.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148955
Approved by: https://github.com/ydwu4
2025-03-12 18:27:27 +00:00
Tugsbayasgalan Manlaibaatar
fb0e9cb0a0 Remove warnings on non-buffer tensor constants (#148483)
Export already registers tensor constants directly in the graph and this is also true for Torchbind objects. This removes warning that pollutes the output.

Differential Revision: [D70577856](https://our.internmc.facebook.com/intern/diff/D70577856)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148483
Approved by: https://github.com/zhxchen17, https://github.com/zou3519
ghstack dependencies: #148364
2025-03-12 18:20:04 +00:00
Shangdi Yu
01e9036bd2 skip torchbind in cosntant folding (#148993)
Summary:
Do not fold torchbind objects in constant folding

Any operation on these torchbind objects can have arbitrary side effects, so we can't effectively constant fold anything torchbind-obj-related anyway.

Test Plan:
```
buck run fbcode//mode/dev-nosan //caffe2/test/inductor:torchbind -- -r aot_compile_constant_folding
```

Reviewed By: angelayi

Differential Revision: D69946541

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148993
Approved by: https://github.com/angelayi
2025-03-12 18:08:08 +00:00
Yidi Wu
923ce10f6c [while_loop] require stride to be the same as input for body_fn (#148002)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148002
Approved by: https://github.com/zou3519
2025-03-12 17:15:10 +00:00
wdziurdz
28b78800b9 [Profiler][HPU] Fix incorrect availabilities for HPU (#148663)
Fixes #148661

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148663
Approved by: https://github.com/jeromean, https://github.com/Skylion007, https://github.com/EikanWang, https://github.com/albanD
2025-03-12 17:06:57 +00:00
Jason Ansel
b040dc3a53 Reland: [inductor] Simplify grid handling (#148305)
Summary:
Relands D69965761 / https://github.com/pytorch/pytorch/pull/147583

Before this PR, calling a triton kernel would look like:
```py
kernel.run(a, b, xnumel, grid=grid(xnumel), stream=stream0)
```
where the `grid=` was passed as a callable (function closure) arg.  This PR removes the grid arg:
```py
kernel.run(a, b, xnumel, stream=stream0)
```
instead now the grid computation is included in the kernel launcher, with something like:
```py
def launcher(in_ptr0, out_ptr0, xnumel, stream):
    grid_0 = ((xnumel + 1023) >> 10)
    grid_1 = 1
    grid_2 = 1
    runner(grid_0, grid_1, grid_2, stream, function, metadata, None, launch_enter_hook, launch_exit_hook, in_ptr0, out_ptr0, xnumel)
```

This should be faster, since we remove multiple function/dict calls and are able to specialize the grid computation for each `triton.Config`.

It also allows us to unify the handling of grids between the Python and C++ wrapper code.  Before this, C++ wrapper code didn't actually support dynamic grid sizes and instead burned in a static grid.

This unification allows this PR to be a net deletion of code.

Differential [disconnected] Revision: D70471332

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148305
Approved by: https://github.com/shunting314, https://github.com/eellison
2025-03-12 15:52:16 +00:00
PyTorch MergeBot
5ada4e6a53 Revert "Reland: [inductor] Simplify grid handling (#148305)"
This reverts commit 8d08b49015.

Reverted https://github.com/pytorch/pytorch/pull/148305 on behalf of https://github.com/jithunnair-amd due to Broke ROCm CI ([comment](https://github.com/pytorch/pytorch/pull/148305#issuecomment-2718177044))
2025-03-12 14:58:43 +00:00
cyy
8fa81a6066 Enable misc-use-internal-linkage check and apply fixes (#148948)
Enables clang-tidy rule [`misc-use-internal-linkage`](https://clang.llvm.org/extra/clang-tidy/checks/misc/use-internal-linkage.html). This new check was introduced in Clang-Tidy 18 and is available due to recent update of Clang-Tidy 19.

The check marks functions and variables used only in the translation unit as static. Therefore undesired symbols are not leaked into other units, more link time optimisations are possible and the resulting binaries may be smaller.

The detected violations were mostly fixed by using static. In other cases, the symbols were indeed consumed by others files, then their declaring headers were included. Still some declarations were wrong and have been fixed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148948
Approved by: https://github.com/Skylion007
2025-03-12 14:22:56 +00:00
leslie-fang-intel
f349304c08 [Inductor][CPP] Fix expr issue in loop split (#148882)
**Summary**
Fix issue: https://github.com/pytorch/pytorch/issues/148058. In this case, there is an `indexing_expr` as an integer which doesn't have the method of `find`.

**Test Plan**
```
python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_issue_148058
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148882
Approved by: https://github.com/jgong5
2025-03-12 11:08:07 +00:00
lingzhi98
81aee3c9c4 [Partitioner] Reduce time consuming of partitions merger (#146582)
This patch optimize maybe_merge_partition func through 3-ways:

Remove unnecessary copy https://github.com/pytorch/pytorch/blob/main/torch/fx/passes/infra/partitioner.py#L99. The number of copied nodes is large if we can merge all of the nodes of graph into one partition.
Record users of each partition to avoid duplicate iteration over nodes https://github.com/pytorch/pytorch/blob/main/torch/fx/passes/infra/partitioner.py#L133. The trip count of this loop maybe very large.
The nodes number of each partitions maybe not balance https://github.com/pytorch/pytorch/blob/main/torch/fx/passes/infra/partitioner.py#L145. We always encounter one issue: one partition has n nodes, but the other has one node. Merge the smaller partition into the larger can help to reduce time consuming.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146582
Approved by: https://github.com/jerome-habana, https://github.com/Skylion007
2025-03-12 09:24:38 +00:00
Nikita Shulga
7b78a2c415 [MPSInductor] Fix argmin/argmax long reductions (#149021)
By adding an additional indexes array for aggregates and populating it when performing partial reductions.

And with that I can finally `torch.compile` TinyStories and get 600+ tokens/sec vs <200 on eager

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149021
Approved by: https://github.com/jansel
ghstack dependencies: #148969, #148975, #149004, #149020
2025-03-12 04:39:29 +00:00
Nikita Shulga
fe22db9cc3 [MPSInductor] Fix min/max reductions over large dims (#149004)
Simple followup after sum/prod

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149004
Approved by: https://github.com/jansel
ghstack dependencies: #148969, #148975
2025-03-12 04:39:19 +00:00
bobrenjc93
60576419a2 Make dynamism code robust to NotImplementedException (#148823)
In prod many models have `@property` methods that raise
NotImplementedError. This PR updates our dynamism code to be more robust
to these types of models.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148823
Approved by: https://github.com/laithsakka
2025-03-12 01:01:57 +00:00
Boyuan Feng
5b60749e9e [cudagraph] add log for skip reasons (#148797)
Summary: Add skip reasons to dynamo_compile so we can know popular skip reasons for cudagraph

Test Plan: {F1975906635}

Differential Revision: D70820791

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148797
Approved by: https://github.com/masnesral
2025-03-11 23:31:48 +00:00
Nikita Shulga
98a2d905bf [MPSInductor] Fix large prod and sum reductions (#148975)
After this change, if reduction dimension is larger than `max_threadgroup_size`,  emit a `for` loop from `codegen_iteration_ranges_entry` and wrap it up in `codegen_body()`
I.e. after this changes following command
```
% TORCH_LOGS=output_code python -c "import torch;print(torch.compile(lambda x:(x[0::2].sin()+(x[1::2] + .4).cos()).sum(dim=0) - 3.14)(torch.rand(4096, device='mps')))" 2>&1|cut -c 86-
```
will emit following shader
```metal
#include <c10/metal/random.h>
#include <c10/metal/special_math.h>
#include <c10/metal/utils.h>
#include <c10/metal/reduction_utils.h>
kernel void generated_kernel(
    device float* out_ptr1,
    constant float* in_ptr0,
    uint2 thread_pos [[thread_position_in_grid]],
    uint2 group_pos [[thread_position_in_threadgroup]]
) {
    auto xindex = thread_pos.x;
    auto r0_index = thread_pos.y;
    threadgroup float tmp_acc_0[1024];
    tmp_acc_0[r0_index] = 0;
    for(auto r0_0_cnt = 0; r0_0_cnt < 2; ++r0_0_cnt) {
        int r0_0 = 2 * r0_index + r0_0_cnt;
        if (r0_0 >= 2047) break;
        auto tmp0 = in_ptr0[2*r0_0];
        auto tmp2 = in_ptr0[1 + 2*r0_0];
        auto tmp1 = metal::precise::sin(tmp0);
        auto tmp3 = 0.4;
        auto tmp4 = tmp2 + tmp3;
        auto tmp5 = metal::precise::cos(tmp4);
        auto tmp6 = tmp1 + tmp5;
        tmp_acc_0[r0_index] += tmp6;
    }
    auto tmp7 = c10:🤘:threadgroup_sum(tmp_acc_0, 1024);
    auto tmp8 = 3.14;
    auto tmp9 = tmp7 - tmp8;
    out_ptr1[0] = static_cast<float>(tmp9);
}
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148975
Approved by: https://github.com/dcci, https://github.com/jansel
ghstack dependencies: #148969
2025-03-11 22:46:41 +00:00
bobrenjc93
2dcdb4ba78 [ez] include config as part of __all__ in torch.compiler (#148978)
Right now we are susceptive to a race condition where if the torch.compiler.config is not implicitly import via dynamo/builder.py, we will throw an error when trying to set compiler configs. This fixes it by including config in `__all__`.

Previous
```
>>> import torch
>>> torch.compiler.config.dynamic_sources = "L['kwargs']['float_features']"
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: module 'torch.compiler' has no attribute 'config'
>>> torch.compiler.config.dynamic_sources =
"L['kwargs']['float_features']"
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: module 'torch.compiler' has no attribute 'config'
```

Now
```
>>> import torch
>>> torch.compiler.config.dynamic_sources = "L['kwargs']['float_features']"
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148978
Approved by: https://github.com/bdhirsh, https://github.com/laithsakka
2025-03-11 21:58:38 +00:00
Pian Pawakapan
a6459afb0e [dynamic shapes] add backed_size_oblivious option (#148696)
Adds option `torch.fx.experimental._config.backed_size_oblivious = True` to allocate `[0, inf]` instead of `[2, inf]` ranges for size backed symbols, and opting into size-oblivious semantics for them.

Helps in a number of cases like
- Keeps `[0, inf]` bounds for unbacked symbols, when we make a unbacked -> backed replacement
- More sound handling for 0/1 inputs at runtime when we lower from export
- Avoids ends-of-bounds, sys.maxsize constraint violations for exporting with named Dims (https://github.com/pytorch/pytorch/issues/146315, https://github.com/pytorch/pytorch/issues/146046)

May look towards turning this on globally for export.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148696
Approved by: https://github.com/bobrenjc93
2025-03-11 21:52:34 +00:00
Jane Xu
e6ef0620cc Add shim.h C API to call dispatcher on our own aten ops (#148832)
This PR still needs testing through some cpp extension

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148832
Approved by: https://github.com/albanD, https://github.com/atalman
ghstack dependencies: #148124
2025-03-11 21:02:04 +00:00
Shangdi Yu
cf19efd3d9 Support basic TorchBind in aot_compile and aoti_compile_and_package (#148506)
Summary:
**Codegen**

- Skip some codegen parts for torchbind (such as arg decleration) because they are loaded in proxy executor, so we do not need to declare torchbind args in cpp code
- Added a helper method to get the schema of CallTorchBind HOP. The returned schema is only the schema of `obj.method()`.

**Serialization**
Add support for torchbind object in serialization

- For CallTorchBind HOP, we need to handle it specially because of it's schema. The output serialized args is in the format of `(obj, method, *args, **kwargs)`.
- it.TorchBindObject inputs are serialized to `as_custom_obj` Argument.

**Packaging**

Add torchbind objects file and `custom_objs_config.json` file to generated files output of `aot_compile`.

The json file is stored in the `data/aotinductor/<model_name>` folder in pt2 archive.

The torchbind objects are stored in data/constants/ folder in pt2 archive.
The format of torchbind objects are `f"{CUSTOM_OBJ_FILENAME_PREFIX}{custom_obj_idx}"`. e.g. `custom_obj_0`.
CustomClassHolder objects implement their own pickle methods.

Note that this `custom_objs_config.json` file is different from the `model_constants_config.json` file produced in package_sigmoid(). The keys in `custom_objs_config` directly correspond to the arg name in extern nodes json.
The key in `model_constants_config.json` produced by `package_sigmoid` is the attribute name in the user mode code.

This is required for both internal and OSS torchbind support.
For OSS torchbind support, we also need to package torchbind_constants into the .pt2 output.

**Work Left**
We still need to add torchbind support in ProxyExecutor for inductor.aoti_load_package to work. See other diffs in the stack.

Test Plan:
```
buck run fbcode//mode/dev-nosan //caffe2/test/inductor:torchbind -- -r schema
buck run fbcode//mode/dev-nosan //caffe2/test/inductor:torchbind -- -r aot_compile
```

Differential Revision: D69490718

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148506
Approved by: https://github.com/angelayi
2025-03-11 20:55:18 +00:00
PyTorch MergeBot
b54cf1a281 Revert "[logging] Set compile_id in the CachingAutotuner during compilation so we have it for dynamo_timed logging (#148693)"
This reverts commit 73c8068cf8.

Reverted https://github.com/pytorch/pytorch/pull/148693 on behalf of https://github.com/ZainRizvi due to This is breaking lint on trunk. Please rebase these changes before merging them back in. [GH job link](https://github.com/pytorch/pytorch/actions/runs/13796723235/job/38590020554) [HUD commit link](73c8068cf8) ([comment](https://github.com/pytorch/pytorch/pull/148693#issuecomment-2715671875))
2025-03-11 20:50:23 +00:00
Nikita Shulga
c18858d633 [MPS] Make torch.mps.compile_shader public (#148972)
It was a private method in 2.6, but nothin changes in its API for 2.7
and it will likely remain the same in 2.8, so time to remove underscore
from its name

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148972
Approved by: https://github.com/Skylion007, https://github.com/atalman, https://github.com/seemethere, https://github.com/albanD, https://github.com/dcci
2025-03-11 20:20:58 +00:00
cat-state
abcec55532 gracefully handle tokenize.TokenError in funcname parser. Adds support for non-Python source (#148737)
This change allows defining python functions in non-python source and having them be able to compiled by torch.compile. The existing implementation already returns None for the case where the file couldn't be read, so returning None (by making an empty funcname cache) makes sense for the case of non-python source code too.

Example [basilisp](https://github.com/basilisp-lang/basilisp):
```clojure
(import torch)
(import [torch.nn.functional :as F])
(torch/rand 10)

(defn f {:decorators [torch/compile]} [x]
  (* (F/relu x) x))

(f (-> (torch/randn 100)
       (.cuda)))
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148737
Approved by: https://github.com/williamwen42
2025-03-11 19:49:28 +00:00
Sam Larsen
73c8068cf8 [logging] Set compile_id in the CachingAutotuner during compilation so we have it for dynamo_timed logging (#148693)
Summary: This is a simpler alternative to https://github.com/pytorch/pytorch/pull/146455, where we can stick the compileId (and forward/backward bool) in the CachingAutotuner so that we have it for logging `benchmark_all_configs`. Recall that the first attempt put the compileId in the inductor_meta and that interfered with caching.

Test Plan:
`python benchmarks/dynamo/torchbench.py --performance --training --amp --backend inductor --device cuda --print-compilation-time --repeat 5 --cold-start-latency --only nanogpt`
* tlparse: https://fburl.com/e71yn6uc
* dynamo_compile: https://fburl.com/scuba/dynamo_compile/sandbox/4ageghhv
* pt2_compile_events: https://fburl.com/scuba/pt2_compile_events/4fgv1itq

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148693
Approved by: https://github.com/eellison
2025-03-11 19:38:40 +00:00
Jane Xu
971606befa Add a stable TORCH_LIBRARY to C shim (#148124)
This PR adds two main parts:
- shim.h stable C APIs into torch::Library APIs
- a higher level API in torch/csrc/stable/library.h that calls into this shim.h + otherwise is self contained

Goal: custom kernel writers should be able to call the apis in the directories above in order to register their library in a way that allows their custom extension to run with a different libtorch version than it was built with.

Subplots resolved:

- Do we want a whole separate StableLibrary or do we want to freeze torch::Library and add `m.stable_impl(cstring, void (*fn)(void **, int64_t, int64_t)` into it
    - Yes, we want a separate StableLibrary. We cannot freeze Library and it is NOT header only.
- Should I use unint64_t as the common denominator instead of void* to support 32bit architectures better?
    -  Yes, and done
- Should I add a stable `def` and `fragment` when those can be done in python?
    - I think we do want these --- and now they're done
- Where should library_stable_impl.cpp live? -- no longer relevant
- I need some solid test cases to make sure everything's going ok. I've intentionally thrown in a bunch of random dtypes into the signature, but I still haven't tested returning multiple things, returning nothing, complex dtypes, etc.
    - Have since tested all the torch library endpoints. the others can be tested in a followup to separate components that need to be in shim.h vs can be added later

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148124
Approved by: https://github.com/albanD, https://github.com/zou3519, https://github.com/atalman
2025-03-11 19:12:46 +00:00
Guilherme Leobas
daff65d671 Correctly propagate exception to parent tx (#146502)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146502
Approved by: https://github.com/anijain2305, https://github.com/williamwen42, https://github.com/zou3519
ghstack dependencies: #146504, #146499
2025-03-11 18:55:45 +00:00
Guilherme Leobas
fb53e9e514 Add __context/cause/suppress_context/traceback__ to Exception (#146499)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146499
Approved by: https://github.com/zou3519, https://github.com/anijain2305
ghstack dependencies: #146504
2025-03-11 18:55:45 +00:00