Commit Graph

47182 Commits

Author SHA1 Message Date
PyTorch MergeBot
e57fa18b40 Revert "Add one_shot_all_reduce_copy to allow non-symm-mem allocated tensors to be reduced (#150129)"
This reverts commit 8a872261dc.

Reverted https://github.com/pytorch/pytorch/pull/150129 on behalf of https://github.com/clee2000 due to breaking internal builds D72080428 ([comment](https://github.com/pytorch/pytorch/pull/150129#issuecomment-2766619006))
2025-03-31 15:37:54 +00:00
Yichen Yan
bbb9b2476b Unify use of enableCollectiveHashDebug_ and trivial updates (#142865)
Use `enableCollectiveHashDebug_` instead of checking env ad-hoc when `TORCH_DISTRIBUTED_DEBUG = DETAIL`

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142865
Approved by: https://github.com/fegin, https://github.com/kwen2501
2025-03-31 12:23:30 +00:00
Kavya Govindarajan
4aded85e79 Fix space typo in warning message (#143473)
Warning shows up like this (no space between willbe):
```
/home/xxx/.local/lib/python3.11/site-packages/torch/distributed/fsdp/_state_dict_utils.py:827:
UserWarning: When using ``NO_SHARD`` for ``ShardingStrategy``, full_state_dict willbe returned.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143473
Approved by: https://github.com/mikaylagawarecki, https://github.com/kwen2501
2025-03-31 07:38:02 +00:00
Matthew Hoffman
c976321541 Use variadic length tuple for torch.masked.DimOrDims (#149870)
`tuple[int]` means only a tuple of length 1, which is not what was intended.

```python
loss = torch.masked.mean(loss, mask=mask, dim=(-1, -2))  # Argument of type "tuple[Literal[-1], Literal[-2]]" cannot be assigned to parameter "dim" of type "DimOrDims"
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149870
Approved by: https://github.com/Skylion007
2025-03-31 07:06:58 +00:00
Vlad K
f1b74037b1 Fix bug when Inductor include path contains spaces (#148271)
This PR fixes a bug with how include directories with spaces are handled on Windows. I ran into an edge case with torch.compile() - it will error out with an exception on Windows. In particular, it will try to execute the following: `cl /I C:/Program Files/Python311/Include ...`, where `C:/Program` will be treated as separate from `Files/Python311/Include`.

I looked into using something like `shlex.quote` or `pathlib.Path`, but I didn't find those options to be suitable (shlex is POSIX shell only, pathlib.Path does not escape spaces).

There is another place in the function that also deals with escaping spaces. My fix follows the same style. 0ff2e6a85a/torch/_inductor/cpp_builder.py (L1464)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148271
Approved by: https://github.com/Skylion007

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2025-03-31 06:46:05 +00:00
Yuanhao Ji
4f14224dc8 [Inductor] Fix torch.polygamma() when n == 1 (#147453)
Fixes #147450

Be consistent with cpu kernel:

77dbd28535/aten/src/ATen/native/cpu/UnaryOpsKernel.cpp (L433-L444)

Got this in the case:

```
Eager: tensor([1.2914e+15]), dtype: torch.float32
Compile: tensor([1.2914e+15]), dtype: torch.float32
Expected: tensor([6.5808e+32], dtype=torch.float64), dtype: torch.float64
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147453
Approved by: https://github.com/eellison
2025-03-31 05:27:46 +00:00
fduwjj
9456738edf [c10d][fr] Allow multiple writer registration with warnings (#150232)
The life span of writer is actually the whole program which is sub-optimal but it is a practical compromise so that the registration of writer can happen outside PG creation.

So we decide to allow multiple writer registrations with warnings.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150232
Approved by: https://github.com/d4l3k, https://github.com/kwen2501
2025-03-31 04:43:43 +00:00
Luca Arnaboldi
c3bb174bb2 SubsetRandomSampler - changed iteration over tensor to iteration over list (#149126)
Digging further the problem at https://github.com/UKPLab/sentence-transformers/pull/3261, it boils down to this expensive loop over a torch tensor. Looping over a list, like in RandomSampler, solves the issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149126
Approved by: https://github.com/divyanshk, https://github.com/cyyever
2025-03-31 04:33:35 +00:00
dscamiss
59abb8c7a2 Fix documentation build errors caused by unsupported section titles (#150205)
Fixes #150134

Build with `make html` looks OK now:
```shell
reading sources... [100%] torch.compiler_get_started .. xpu
looking for now-outdated files... none found
pickling environment... done
checking consistency... done
preparing documents... done
writing output... [ 80%] generated/torch.nn.Softsign .. generated/torch.nn.modules.module.register_module_full_backward_writing output... [ 86%] generated/torch.nn.modules.module.register_module_module_registration_hook .. generated/torch.rwriting output... [100%] generated/torch.xpu.get_rng_state .. xpu
generating indices... genindex done
highlighting module code... [100%] typing
writing additional pages... search done
copying images... [100%] _static/img/torch_cuda_memory/allocator_state_history.png
copying static files... done
copying extra files... done
dumping search index in English (code: en)... done
dumping object inventory... done
build succeeded.

The HTML pages are in build/html.
```

New rendering looks like this:

![image](https://github.com/user-attachments/assets/af7e23a5-9dfd-4cb6-9333-a9e8cfe47ea0)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150205
Approved by: https://github.com/albanD
2025-03-31 04:27:44 +00:00
jj hunt
46c8f2e965 Update docstring to match code. (#148455)
Very tiny fix to doc string. Pass grid_size=None results in an Exception.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148455
Approved by: https://github.com/mikaylagawarecki
2025-03-31 04:16:11 +00:00
Nichols A. Romero
ca2ffc23ab [ROCm][TunableOp] Stricter unit tests for online and offline tuning (#150142)
Improvements to unit tests and warnings for unsupported cases in offline tuning. Here are more details:
- Previously we only compared the OpSig for the untuned vs. tuned entries. This was not strict enough so we now compare OpSig+ParamSig.
- The main offline and online UTs are now stricter to make sure we exercise the code paths for the four combinations of transA and transB.
- Offline tuning does not support some tensor shapes. Emit warning and skip tuning.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150142
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-03-31 04:12:08 +00:00
Daniel Vega-Myhre
157bff22f7 [Async TP] Fuse matmul-reduce-scatters when reduce scatters have multiple users, and save fused node for backward instead of reduce_scatter node (#149946)
Fixes #149876

## Stack
- [previous PR in stack] https://github.com/pytorch/pytorch/pull/149247

## TL;DR
This PR implements support in async TP for saving the reduce-scatter result for backward, which previously would break the torchtitan AC policies: no AC, per op SAC, and per layer SAC.

## Context
In torchtitan's LLama3 per op SAC policy, we want to save the output of `reduce_scatter` ops for backward, which is useful for TP. The reduce_scatter op is also saved for No AC (since all activations are saved) and per layer SAC (since we save the activations for N full layers, which do contain reduce-scatters for TP.

However, doing this causes incompatibility with Async TP for the AC policies above, for 2 reasons:

1) The graph pattern matching specifically only matches on reduce scatter nodes with 1 user, but reduce_scatter nodes saved for backwards will have 2 users (the 2nd one being the return/output node, which saves it for backward).

2) The subgraph replacement logic which replaces the users of the `wait_tensor` after the reduce-scatter with the new fused node has no mechanism to save the fused_node for backward instead of the reduce-scatter node. This means we cannot directly replace the subgraph, since we can't delete nodes which still have users (in this case, the output node is still using the reduce-scatter node).

To fix this, we do 2 things:

1) Add additional pattern matching logic to also match reduce-scatter nodes with 2 users, so we also perform fusion when reduce-scatter is saved for backward.

2) When replacing the subgraph with the fused node, detect if the reduce-scatter was saved for backward, and if so, save the result of the fused node for backward instead. This enables us to properly erase the subgraph and prevent the memory leak which occurred in #149876

## Other changes
- Continue to throw an error if we don't find any candidate all-gathers or reduce-scatters for fusion (since TP should have both) but DON'T throw an error if we don't fuse any matmul-reduce-scatters. This is because I've found there are actually valid graphs where we do fuse reduce scatters in the forward graph but not the backward graph (in the backward pass there are reduce-scatters but the producer op is an "add" not a mm/scaled_mm).

## Test plan

1. All unit tests are passing
2. Visualized the graphs and verified the fusion is occurring properly.
3. Verified via manual torchtitan runs there is no memory leak / OOM occurring anymore.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149946
Approved by: https://github.com/fegin
2025-03-30 19:05:47 +00:00
James Wu
cbc0964636 Store statically launchable CachingAutotuners inside CompiledFXGraph.triton_bundle (#149054)
This PR adds CachingAutotuners that are statically launchable to FXGraphCache's cache entry.

Regular CachingAutotuners, with triton kernels attached to them, are not very good to cache: they are very large, and take huge amounts of space since they track all of the various binary files, along with various metadata. We could probably figure out what information we could delete from the kernel and have it still work, but with StaticCudaLauncher, we no longer have to. Instead, we can cache every compiled triton kernel that is statically launchable.

Because StaticTritonCompileResult is serializable, and designed to have a very small memory footprint, we can save it into FXGraphCache without increasing the cache size significantly. We store it as a part of CompiledFxGraph.triton_bundle.

Then, on load, we repopulate the CachingAutotuner into our CompiledTritonKernel cache.

The upsides of this are many:
- We no longer need to call into a separate process on cache hit
- We can *guarantee* that the triton kernel we got from our cache entry is the one we use to launch again, so no worries about triton's own caching logic
- Once we achieve feature parity and all torch.compiled triton kernels are statically launchable, we can clean up a bunch of TritonBundler code and simplify the cache hit logic.

Fixes #149449

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149054
Approved by: https://github.com/oulgen
2025-03-30 17:51:11 +00:00
Prajesh Praveen Anchalia
005c9b2f4f Fix _Waitcounter decorator and dd backward pass wait counter (#150235)
Summary:
This will log a wait counter with for backward compile and fixes weirdness with nested context managers.

Since the old wait counters added through dynamo_timed were never created with the nesting issue. I am also changing the key nomenclature from `pytorch.dynamo_timed` to `pytorch.wait_counter`. We want to use the same nomenclature, to make it easy to find keys.

Reviewed By: jamesjwu

Differential Revision: D72032055

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150235
Approved by: https://github.com/jamesjwu, https://github.com/masnesral
2025-03-30 05:20:12 +00:00
Shangdi Yu
cc58ecceea Move dump location to avoid dumping twice (#150219)
Summary:
If we put the dumping code in codegen, we might get a separate node_mapping dump for the constant folded graph (https://github.com/pytorch/pytorch/blob/main/torch/_inductor/compile_fx.py#L1119).

We move it into compile_fx.py so there's only one node_mapping dump.

Test Plan: CI

Reviewed By: YUNQIUGUO

Differential Revision: D72068715

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150219
Approved by: https://github.com/YUNQIUGUO
2025-03-30 03:35:38 +00:00
Horace He
3140565db6 Update type of create_block_mask to more accurately reflect things (#150244)
Fixes some mypy issues
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150244
Approved by: https://github.com/drisspg
2025-03-29 21:55:57 +00:00
sanshang
879a293db8 fix et trace collection of all_to_all (#149485)
![image](https://github.com/user-attachments/assets/1e602dec-24a4-4f47-88c0-9311737e217b)
![image](https://github.com/user-attachments/assets/c48a3273-43fb-4a7f-9341-b90cb6b10785)

fix ET trace collection to all_to_all.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149485
Approved by: https://github.com/shengfukevin, https://github.com/kwen2501
2025-03-29 20:17:24 +00:00
Nikita Shulga
965784eb9b [MPSInductor] Specify max_total_threads_per_threadgroup (#150247)
When generating reduction kernel, otherwise compiler can unroll loops too much that kernel could not be launched for the intended threadgroup size

Extend `c10:🤘:max` to accept different dtypes

Together this fixes `test_large_broadcast_reduction`

TODO:
  - Explore different threadgroup_sizes for best perf

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150247
Approved by: https://github.com/jansel, https://github.com/dcci
ghstack dependencies: #150246
2025-03-29 19:37:15 +00:00
PyTorch MergeBot
3b00ff8850 Revert "[Profiler] Give non-zero default values to start events (#149757)"
This reverts commit bc72420bcb.

Reverted https://github.com/pytorch/pytorch/pull/149757 on behalf of https://github.com/malfet due to Broke windows builds, which were also the signal on the HUD ([comment](https://github.com/pytorch/pytorch/pull/149757#issuecomment-2763461365))
2025-03-29 15:08:55 +00:00
PaulZhang12
b8ef642f04 Enable TMA persistent GEMM Template by default (#149427)
Previously, this was unable to be landed given there was limited H100 for CI testing. Benchmarking on H100 CI looks good now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149427
Approved by: https://github.com/drisspg
2025-03-29 07:32:42 +00:00
Max Calman
bc72420bcb [Profiler] Give non-zero default values to start events (#149757)
The intent of the existing code is to

> // Assign system TIDs to start events based on the system TID of the next
    // observed event with the same Python TID.

However, if there are start events that don't share the same Python TID as later observed events, then they are left with the default initialization of DeviceAndResource and assigned values of `0`. This is problematic because Kineto uses `device=0, resource=0` for the first GPU (or other backend) device.

This PR maintains the previous logic of using TIDs from later events if any are present, but defaults to the current process and system thread IDs if there aren't later events to reference.

This issue was discovered while working to implement a custom backend and some CPU start events were appearing on the same process and thread as the device in the trace.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149757
Approved by: https://github.com/sraikund16
2025-03-29 06:29:25 +00:00
Shangdi Yu
5e787bf3e5 [reland] Support torchbind in OSS proxy executor (#150196)
Summary:
The original Diff D69500038 is reverted due to a false alarm on trunk health.

Implement torchbind support in OSSProxyExecutor.

Exactly the same as the implementation in FbProxyExecutor.

D69693697 - fbProxyExecutor
D69887230 - fbProxyExecutor but for torchbind method
D70746626 - Support None output type

Other changes:

- When generating the schema of the CallTrochBind HOP, the arg name of the torchbind object arg should be the same as the torchbind method's torchbind object arg (instead of `obj`).

- In `AOTIModelPackageLoader`, we extract everything in `data/constants` to `tmp_dir/data/aot_inductor/<model>/` folder, so the torchbind objs exist in the same folder as the rest of the files (e.g. cpp, so). This is to be consistent of how files are packaged internally (more details in internal Diff summary).

Note on using `filesystem`:

Seems like there'll be [issues](https://github.com/pytorch/pytorch/pull/137209) with using`filesystem` header in linux, so here I use string manipulation instead of `filesystem::path`.

Test Plan:
```
test/inductor:torchbind -- -r torchbind_aoti
test/inductor:torchbind -- -r aot_compile
```

Differential Revision: D72063691

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150196
Approved by: https://github.com/hl475, https://github.com/desertfire
2025-03-29 03:36:55 +00:00
Mandar Deshpande
0861af2596 [pytorch][triton] Warp specialization support in TritonTemplate for torchinductor (#148503) (#150122)
Summary:
Currently only `num_warps` and `num_stages` are supported as one of the kernel options for inductor auto-tuning using `TritonTemplate`.

In order to allow warp-specialization kernel options should allow specifying `num_consumer_groups` and `num_buffers_warp_spec` as well.

NOTE: Currently gating changes to FBCODE using HAS_WARP_SPEC which is only available on triton/release-3.3.x

Test Plan:
## Unit test
Added tests for `test_triton_template_warp_specialization` to verify generated kenrnel contains configs for  `num_consumer_groups` and `num_buffers_warp_spec`.

## Functional Testing
Specific to flexattention.
```
import torch
from torch.nn.attention.flex_attention import flex_attention

from triton.testing import do_bench

make_tensor = lambda: torch.rand(8, 16, 8192, 128, device="cuda", dtype=torch.bfloat16)
q, k, v = make_tensor(), make_tensor(), make_tensor()

flex_compiled = torch.compile(flex_attention, fullgraph=True)

print(do_bench(lambda: flex_compiled(q, k, v, kernel_options={"num_warps": 4})))
```

triton do_bench results:
- default compile: 15.176783561706543
- with warp-spec: 9.452800750732422

## Extra notes
- generated triton kernel using `TORCH_LOGS=output_code`: P1740612877
- TTGIR for fused kernel: P1740614685

Differential Revision: D71982587

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150122
Approved by: https://github.com/eellison, https://github.com/zou3519, https://github.com/jansel
2025-03-29 03:36:50 +00:00
Mu-Chu Lee
03313c6619 [AOTInductor] Add function for users to extract constants in container (#150163)
Summary: Add extract_constant_map that allows users to inspect the constants being used by AOTInductor

Test Plan:
`python test/inductor/test_aot_inductor.py -k extract_constants_map`

`LD_LIBRARY_PATH=/data/users/$USER/pytorch/build/lib /data/users/$USER/pytorch/build/bin/test_aoti_inference`

Differential Revision: D72020400

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150163
Approved by: https://github.com/chenyang78
2025-03-29 03:36:12 +00:00
Nichols A. Romero
7a470c9320 [ROCm] change preferred blas lib defaults (#150212)
Fixes #148883
Fixes #150155

Also adds at::BlasBackend:Default. Instinct cards prefer hipBLASLt, everything else prefers rocBLAS.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150212
Approved by: https://github.com/jeffdaily
2025-03-29 03:33:07 +00:00
Tristan Rice
29b3fdab01 TCPStoreLibUvBackend: support masterListenFd (#150215)
This supports `masterListenFd` which is required for full compatibility with the non-libuv TCPStore. The code was just missing a `uv_listen` call and now it works just fine.

This is required to migrate the last remaining uses of TCPStore off of the non-libuv backend.

Test plan:
```
pytest -v test/distributed/test_store.py -k test_take_over_listen_socket
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150215
Approved by: https://github.com/fduwjj
2025-03-29 01:58:07 +00:00
zeshengzong
cb83850a24 Fix docs format error in torch.nn (#150156)
Fixes #150152

Fix format error in [torch.nn.CosineSimilarity](https://pytorch.org/docs/stable/generated/torch.nn.CosineSimilarity.html#torch.nn.CosineSimilarity), [torch.nn.KLDivLoss](https://pytorch.org/docs/stable/generated/torch.nn.KLDivLoss.html#torch.nn.KLDivLoss) and other pages.

## Test Result

### Before

#### torch.nn.CosineSimilarity

![Image](https://github.com/user-attachments/assets/1ad633d9-dfaf-43f0-a536-9035a24bf858)

#### torch.nn.KLDivLoss

![Image](https://github.com/user-attachments/assets/20a001b0-1f66-414e-b554-11934d65a4bf)

### After
#### torch.nn.CosineSimilarity
![image](https://github.com/user-attachments/assets/a2d9ea8d-5637-4604-a0e4-9231a4deee44)

#### torch.nn.KLDivLoss
![image](https://github.com/user-attachments/assets/d0e319f9-a3b3-47a7-b2f8-060d46d53bc7)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150156
Approved by: https://github.com/cyyever, https://github.com/malfet
2025-03-28 20:54:09 +00:00
Nikita Shulga
7c65911b11 [MPS] Fix dot/mm for conj_tensors (#150157)
- Distinguish between conjugated/non_conjugated inputs by appending conjugation to the operator key
- For matmul or dot, add `conjugateWithTensor:name:` calls before running the op
- Enable testing for conjugated ops by passing `include_conjugated_inputs` to opinfo
- Filter  `include_conjugated_inputs` argument from `sample_inputs_window` (probably should have landed as separate PR)
- Preserve conj property when gathering the views, that fixes `cov` operator

Fixes https://github.com/pytorch/pytorch/issues/148156
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150157
Approved by: https://github.com/dcci
2025-03-28 20:36:44 +00:00
Natalia Gimelshein
cdeb32d2d1 enable out variant of 2-shot reduction (#150153)
Per title, this version uses symm mem input both as input source and as a work buffer, so input is modified after the end (similar to what fbgemm car reduction does). It is intended to be wrapped in an op that would first copy the real inputs to symm mem buffers that wouldn't be exposed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150153
Approved by: https://github.com/xw285cornell
2025-03-28 19:06:03 +00:00
PyTorch MergeBot
cf7447ae99 Revert "cpp_wrapper: Fix even more tests (#147225)"
This reverts commit d25acac357.

Reverted https://github.com/pytorch/pytorch/pull/147225 on behalf of https://github.com/yangw-dev due to broke test internally test/inductor/test_benchmark_fusion ([comment](https://github.com/pytorch/pytorch/pull/147225#issuecomment-2761944564))
2025-03-28 17:07:52 +00:00
PyTorch MergeBot
e691fcae0e Revert "cpp_wrapper: precompile a few more commonly used headers, and improve RAIIPyObject interface (#149350)"
This reverts commit 2b20d1433f.

Reverted https://github.com/pytorch/pytorch/pull/149350 on behalf of https://github.com/yangw-dev due to broke test internally test/inductor/test_benchmark_fusion ([comment](https://github.com/pytorch/pytorch/pull/147225#issuecomment-2761944564))
2025-03-28 17:07:52 +00:00
Animesh Jain
a469ddc663 [inductor] No type promotion for slice_scatter (#150090)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150090
Approved by: https://github.com/eellison, https://github.com/zou3519
ghstack dependencies: #149087, #149667, #150036, #148953
2025-03-28 17:02:01 +00:00
Michael Lazos
d2c0c65ea1 [Dynamo] Add debug linting option for graph dedupe (#150053)
As title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150053
Approved by: https://github.com/StrongerXi, https://github.com/anijain2305
2025-03-28 14:27:09 +00:00
IvanKobzarev
25309a17f0 [aotd] Config to guess_tangents_stride (#150035)
Differential Revision: [D71907684](https://our.internmc.facebook.com/intern/diff/D71907684)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150035
Approved by: https://github.com/ilyas409, https://github.com/seemethere
2025-03-28 13:54:19 +00:00
PyTorch MergeBot
7c4e49750e Revert "Store statically launchable CachingAutotuners inside CompiledFXGraph.triton_bundle (#149054)"
This reverts commit c16af5d798.

Reverted https://github.com/pytorch/pytorch/pull/149054 on behalf of https://github.com/jamesjwu due to Sorry I forgot to fix one last test ([comment](https://github.com/pytorch/pytorch/pull/149054#issuecomment-2761381443))
2025-03-28 13:35:07 +00:00
James Wu
c16af5d798 Store statically launchable CachingAutotuners inside CompiledFXGraph.triton_bundle (#149054)
This PR adds CachingAutotuners that are statically launchable to FXGraphCache's cache entry.

Regular CachingAutotuners, with triton kernels attached to them, are not very good to cache: they are very large, and take huge amounts of space since they track all of the various binary files, along with various metadata. We could probably figure out what information we could delete from the kernel and have it still work, but with StaticCudaLauncher, we no longer have to. Instead, we can cache every compiled triton kernel that is statically launchable.

Because StaticTritonCompileResult is serializable, and designed to have a very small memory footprint, we can save it into FXGraphCache without increasing the cache size significantly. We store it as a part of CompiledFxGraph.triton_bundle.

Then, on load, we repopulate the CachingAutotuner into our CompiledTritonKernel cache.

The upsides of this are many:
- We no longer need to call into a separate process on cache hit
- We can *guarantee* that the triton kernel we got from our cache entry is the one we use to launch again, so no worries about triton's own caching logic
- Once we achieve feature parity and all torch.compiled triton kernels are statically launchable, we can clean up a bunch of TritonBundler code and simplify the cache hit logic.

Fixes #149449

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149054
Approved by: https://github.com/oulgen
2025-03-28 13:28:05 +00:00
Yuanhao Ji
d4da0e955e [Dynamo] Fix is_compile_supported() when device_type contains device index (#147837)
Fixes #147826

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147837
Approved by: https://github.com/anijain2305
2025-03-28 07:16:29 +00:00
Pian Pawakapan
103bf64a3c [export] refactor _Dim into Dim (#149891)
Summary: forward fix T218515233

Test Plan: test_export

Differential Revision: D71769231

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149891
Approved by: https://github.com/jingsh, https://github.com/angelayi
2025-03-28 06:19:03 +00:00
bobrenjc93
f649ee73ce Use source hashing to generate consistent symbolic ids (#149665)
This PR was inspired by internal models that were cache missing due to PGO. At a high level the problem looks as follows

Run 1, Invocation 1: We do static compile, save some example values in PGO/automatic dynamic

Run 1, Invocation 2: We detect varying inputs, do dynamic compile, get a dynamic graph and save to PGO. Crucially what we save to PGO is actually a superset of what is actually dynamic. If we notice an input was varying, we mark it as dynamic in PGO even if later on that value gets specialized. When a value gets specialized, we actually remove the symbol from the graph. This results in an interesting conundrum where although we are producing the same isomorphic graph, PGO makes the second run cache miss. Let's see how....

Run 2, Invocation 1: We fetch the PGO, over-mark things as dynamic, get a fx graph, look it up in the cache and... whoops! cache miss! This is because of the aforementioned behavior where the PGO profile will cause us to over-allocate symbols. In practice this means we end up saving a graph in cache with symbols x:s1, y:s3 and on second attempt we cache miss with x:s1, y:s6 where symbols s3,s4,s5 were all optimistically marked dynamic by PGO and subsequently specialized.

We solve this problem by hashing the source names. This ensures somewhat stable assignment. To prevent catastrophic symbol collisions, we use linear probing to ensure no collisions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149665
Approved by: https://github.com/Mingming-Ding, https://github.com/laithsakka
2025-03-28 05:36:32 +00:00
Tugsbayasgalan Manlaibaatar
c49315e645 Improve attr mismatch msg (#149576)
Differential Revision: [D71513041](https://our.internmc.facebook.com/intern/diff/D71513041)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149576
Approved by: https://github.com/avikchaudhuri
2025-03-28 05:10:56 +00:00
Animesh Jain
c9ebf517c2 [dynamo][invoke_subgraph] Input aliasing and mutation check in Dynamo (#148953)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148953
Approved by: https://github.com/zou3519
ghstack dependencies: #149087, #149667, #150036
2025-03-28 03:50:07 +00:00
eellison
c18e2ce53b Ignore meta ops in inductor (#150137)
Fix for https://github.com/pytorch/pytorch/issues/144607

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150137
Approved by: https://github.com/BoyuanFeng
2025-03-28 03:01:57 +00:00
PyTorch MergeBot
ddb1e97839 Revert "Support torchbind in OSS proxy executor (#149747)"
This reverts commit aa70d62041.

Reverted https://github.com/pytorch/pytorch/pull/149747 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/149747#issuecomment-2760040741))
2025-03-28 02:48:02 +00:00
Colin L. Rice
2f785ab208 dynamo_compile: Log all compilation time under all_compilation_types (#149664)
This counter is designed to include all compilation pytorch does (triton +
dynamo_compile). However this wasn't including all of dynamo compilation, since
it was put in at the fx_codegen_and_compile spot.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149664
Approved by: https://github.com/masnesral
2025-03-28 02:27:48 +00:00
Natalia Gimelshein
8a872261dc Add one_shot_all_reduce_copy to allow non-symm-mem allocated tensors to be reduced (#150129)
Per title, we want to be able to use it even if inputs are not registered. Separate copy would add latency, and one-shot is all about the lowest possible latency.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150129
Approved by: https://github.com/xw285cornell
2025-03-28 02:14:27 +00:00
Sam Larsen
1e55b9c0b5 Fix autotune pool shutdown (#149890)
Summary: A couple follow-ups noted in review from https://github.com/pytorch/pytorch/pull/149700:
1. Make sure we correctly signal _all_ subproces to shutdown, even in the case where some processes are currently benchmarking.
2. Change how the pool singleton is created. That also allows us to fully initialize the object in the ctor and remove a bunch of asserts.

Test Plan: existing unit tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149890
Approved by: https://github.com/aorenste
ghstack dependencies: #149700
2025-03-28 02:09:51 +00:00
Sam Larsen
266bd22b44 Improve subproc autotuning implementation (#149700)
Summary: The primary change is to update the autotune-in-a-subproc implementation to avoid using multiprocessing spawn. Spawn (re)executes the toplevel script in the subproc, which can be problematic. The approach here is similar to Triton parallel compile: we Popen a subproc on a controlled entry point and communicate over pipes. That change drove a lot of refactoring in the TuningProcess class, so I took the opportunity to simplify some things, rename some methods, etc.

One other notable change is around the timeout / kill approach. After a timeout, we were previously attempting to stop the subproc in three steps (graceful shutdown, sigkill if graceful fails, sigterm if sigkill fails). I'm gonna argue think that's not useful: 1) The graceful shutdown is never going to work unless the subproc happens to have just completed its task and is ready to receive the next command. 2) If we're going to kill the subproc, let's just take the most aggressive approach and move on as quickly as possible to restarting it rather than waiting to see if previous shutdown attempts succeeded. The only downside that I can find find is maybe a little log spew?, e.g., ` ResourceWarning: subprocess 2987680 is still running`

List of changes:
* Use Popen instead of spawn for the autotuning subprocess.
* Introduced a new entry point `__autotune_main__.py`
* Renamed some TuningProcess methods. For example `shutdown` makes more sense than `terminate` because the latter implies a forced kill.
* Simplified the implementation around benchmarking timeout and how we kill the subproc after a timeout.
* Deprecated the unused timeout configs in `_inductor/config.py`
* Moved `get_ld_library_path` helper to a common utils file.
* Added more unit tests for subproc crashes / timeouts / exceptions, etc.

Test plan:
* New unit tests
* Also ran internally with all combinations of: build mode `opt` and `dev-nosan`, and `buck run` vs. executing the `.par` file directly.
* Made sure the functionality to parallelize autotuning across different GPUs is working (it wasn't clear to me this was behaving the way we wanted it to).

Differential Revision: [D71976971](https://our.internmc.facebook.com/intern/diff/D71976971)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149700
Approved by: https://github.com/aorenste, https://github.com/jansel, https://github.com/eellison
2025-03-28 01:06:39 +00:00
Shivam Raikundalia
8b04364914 [Easy/Profiler] Set Duration to -1 for unfinished CPU events (#150131)
Summary: Some OSS Kineto users were requesting that we allow for 0 duration events in Kineto even though they won't be seen on the trace. To allow this we changed the handling of said events in D71510383. However this causes unfinished events in collection to never be post processed; this diff fixes said issue.

Test Plan: https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/0/1743102222/localhost/libkineto_activities_631490.json.gz&bucket=gpu_traces

Differential Revision: D71993609

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150131
Approved by: https://github.com/chuanhaozhuge, https://github.com/xw285cornell
2025-03-28 00:29:22 +00:00
Shangdi Yu
aa70d62041 Support torchbind in OSS proxy executor (#149747)
Summary:
Implement torchbind support in OSSProxyExecutor.

Exactly the same as the implementation in FbProxyExecutor.

D69693697 - fbProxyExecutor
D69887230 - fbProxyExecutor but for torchbind method

Other changes:

- When generating the schema of the CallTrochBind HOP, the arg name of the torchbind object arg should be the same as the torchbind method's torchbind object arg (instead of `obj`).

- In `AOTIModelPackageLoader`, we extract everything in `data/constants` to `tmp_dir/data/aot_inductor/<model>/` folder, so the torchbind objs exist in the same folder as the rest of the files (e.g. cpp, so). This is to be consistent of how files are packaged internally

Test Plan:
```
buck run fbcode//mode/dev-nosan //caffe2/test/inductor:torchbind -- -r torchbind_aoti

buck run fbcode//mode/dev-nosan //caffe2/test/inductor:torchbind -- -r aot_compile
```

Differential Revision: D69500038

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149747
Approved by: https://github.com/desertfire
2025-03-28 00:04:19 +00:00
Taras
d670df356c Improve error handling when checking CUDA version in case nvcc is not found (#148671)
Fixes:
- https://github.com/pytorch/pytorch/issues/101138

**Description**
The PR enhances error handling in `_check_cuda_version` by verifying the existence of the `nvcc` executable before invoking `subprocess.check_output`. If `nvcc` is missing, a `FileNotFoundError` is raised with a clear message, guiding users to check their CUDA installation and path configuration.

**Testing**
Manually tested with and without `nvcc` present in the expected path.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148671
Approved by: https://github.com/malfet
2025-03-27 23:04:59 +00:00
Benjamin Glass
2b20d1433f cpp_wrapper: precompile a few more commonly used headers, and improve RAIIPyObject interface (#149350)
Add includes for torch.device, torch.dtype, torch.layout, and torch.memory_format to the cpp_wrapper common header, so that they get precompiled. Additionally, add move constructors and operator bool to RAIIPyObject.

Closes #142005.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149350
Approved by: https://github.com/desertfire
ghstack dependencies: #147225
2025-03-27 23:00:01 +00:00
PyTorch MergeBot
1a3bd894ff Revert "[fbcode]Removing @NoIntBaseDeprecated annotation in caffe2.thrift file (#149742) (#149744)"
This reverts commit 6eac3a0068.

Reverted https://github.com/pytorch/pytorch/pull/149744 on behalf of https://github.com/malfet due to Broke tests, see 80aa88f907/1 ([comment](https://github.com/pytorch/pytorch/pull/149744#issuecomment-2759676260))
2025-03-27 22:31:54 +00:00
eellison
4c57aec5b9 Dont exclude constant_pad_nd in prologue fusion (#149947)
Originally, I excluded constant_pad_nd from fusing to be conservative on compilation time. But, on benchmarking, you do occasionally get speedups by fusing it. Also includes a fix for making single, contiguous dep for prologues.

For instance, the following benchmark gets a 7% speedup by fusing in the constant_pad_nd.

```
import torch
import torch.nn.functional as F
torch._inductor.config.force_disable_caches = True

padded_N = 2048
n_pad_rows = 100

K, N = 2048, 4096

tensor1 = torch.randn(padded_N - n_pad_rows, 4096, device="cuda").to(torch.bfloat16)
tensor2 = torch.randn(4096, 4096, device="cuda").to(torch.bfloat16)

@torch.compile(mode='max-autotune-no-cudagraphs')
def masked_linear(input, weight, n_pad_input_rows):
    """
    Linear layer with input padded by `n_pad_input_rows` rows
    """
    # Use constant_pad_nd to pad with zeros for the invalid rows
    padded_input = F.pad(tensor1, (0, 0, 0, n_pad_input_rows), "constant", 0)
    return F.linear(padded_input, weight)

# Invoke the function
masked_linear(tensor1, tensor2, n_pad_rows)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149947
Approved by: https://github.com/drisspg
2025-03-27 22:26:30 +00:00
PyTorch MergeBot
80aa88f907 Revert "Store statically launchable CachingAutotuners inside CompiledFXGraph.triton_bundle (#149054)"
This reverts commit ac91f8765b.

Reverted https://github.com/pytorch/pytorch/pull/149054 on behalf of https://github.com/yangw-dev due to This is breaking ROCM tests on trunk. hud.pytorch.org/ ([comment](https://github.com/pytorch/pytorch/pull/149054#issuecomment-2759604301))
2025-03-27 22:15:40 +00:00
Avik Chaudhuri
21bcbbfb5e fix range constraints for expr (#150103)
During tracing it is possible for a `s1: VR[2, inf]` to be replaced by a `s0: VR[3, inf]` (note smaller range) by the shape env. But after export, unfortunately we'd previously record `range_constraints[s0] = VR[2, inf]` (note larger range), which is incorrect.

This is because we'd map `s1.node.expr` (`s0`) to the `var_to_range` of `s1.node._expr` (`s1`) when creating `range_constraints`. The comment surrounding this code suggests this predated `bound_sympy`, but now we can do better.

For users, this means that when using `Dim.DYNAMIC` previously they wouldn't get input constraints checked sufficiently, now they do (shifting errors early).

Differential Revision: D71962694

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150103
Approved by: https://github.com/zhxchen17
2025-03-27 22:11:39 +00:00
Keke Zhai
68414512e6 Implement aten.select.int sharding strategy (#149842)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149842
Approved by: https://github.com/XilunWu
2025-03-27 20:49:00 +00:00
Benjamin Glass
d25acac357 cpp_wrapper: Fix even more tests (#147225)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147225
Approved by: https://github.com/desertfire
2025-03-27 19:21:03 +00:00
Shangdi Yu
0ed0b7fa96 [aoti] Better error message when torchbind object is used as a graph input in AOTI (#149965)
Summary: Given an explicit error when torchbind object is used as input to AoTI

Test Plan:
```
buck run fbcode//mode/dev-nosan //caffe2/test/inductor:torchbind -- -r test_torchbind_input
```

Differential Revision: D69490915

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149965
Approved by: https://github.com/desertfire
2025-03-27 18:48:55 +00:00
vasiliy
01cb3519b3 wire torch._scaled_mm with fp4 operands to the cublas nvfp4 kernel (#148792)
Summary:

When `a` and `b` have dtype `torch.float4_e2m1fn_x2` and `a_scale` and `b_scale` have dtype `torch.float8_e4m3fn`, makes

```python
c = torch._scaled_mm(a, b, a_scale, b_scale, out_dtype=torch.bfloat16)
```

call the cuBLAS fp4 gemm kernel, as specified in https://docs.nvidia.com/cuda/cublas/index.html?highlight=fp4#d-block-scaling-for-fp8-and-fp4-data-types

note: output scale (`scale_in_D` from the cuBLAS docs) is not tested in this PR - we can enable in a follow-up.

Test Plan:

```bash
pytest test/test_matmul_cuda.py -s -k mxfp8_nvfp4
```

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148792
Approved by: https://github.com/eqy
ghstack dependencies: #148791
2025-03-27 17:32:20 +00:00
vasiliy
e33bc41958 add torch.float4_e2m1fn_x2 to PyTorch (#148791)
Summary:

Redo of https://github.com/pytorch/pytorch/pull/146578 to get around
rebase conflicts.

Test Plan:

```
pytest test/quantization/core/experimental/test_floatx.py -s
```

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148791
Approved by: https://github.com/drisspg, https://github.com/eqy, https://github.com/jeffdaily
2025-03-27 17:32:20 +00:00
James Wu
ac91f8765b Store statically launchable CachingAutotuners inside CompiledFXGraph.triton_bundle (#149054)
This PR adds CachingAutotuners that are statically launchable to FXGraphCache's cache entry.

Regular CachingAutotuners, with triton kernels attached to them, are not very good to cache: they are very large, and take huge amounts of space since they track all of the various binary files, along with various metadata. We could probably figure out what information we could delete from the kernel and have it still work, but with StaticCudaLauncher, we no longer have to. Instead, we can cache every compiled triton kernel that is statically launchable.

Because StaticTritonCompileResult is serializable, and designed to have a very small memory footprint, we can save it into FXGraphCache without increasing the cache size significantly. We store it as a part of CompiledFxGraph.triton_bundle.

Then, on load, we repopulate the CachingAutotuner into our CompiledTritonKernel cache.

The upsides of this are many:
- We no longer need to call into a separate process on cache hit
- We can *guarantee* that the triton kernel we got from our cache entry is the one we use to launch again, so no worries about triton's own caching logic
- Once we achieve feature parity and all torch.compiled triton kernels are statically launchable, we can clean up a bunch of TritonBundler code and simplify the cache hit logic.

Fixes #149449

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149054
Approved by: https://github.com/oulgen
ghstack dependencies: #149657
2025-03-27 17:14:44 +00:00
Danfeng Wang
6eac3a0068 [fbcode]Removing @NoIntBaseDeprecated annotation in caffe2.thrift file (#149742) (#149744)
Summary:

To align with thrift-python, we are adding the int base class for `non-Flag` enums. In order to not break production code, the annotation `python.NoIntBaseClassDeprecated` is added to opt-out some enums

After the related customer code logic changes, we can now safely remove the annotations that were added earlier.

Our ultimate goal is to unconditionally add the `int` base to `thrift-py3` enums.

Test Plan:
```
buck test 'fbcode//mode/opt' fbcode//caffe2/torch/fb/training_toolkit/applications/bulk_eval/tests:evaluator_test -- --exact 'caffe2/torch/fb/training_toolkit/applications/bulk_eval/tests:evaluator_test - test_setup_evaluation_utils (caffe2.torch.fb.training_toolkit.applications.bulk_eval.tests.evaluator_test.EvaluatorTest)'
```

Reviewed By: ahilger

Differential Revision: D71446522

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149744
Approved by: https://github.com/izaitsevfb, https://github.com/huydhn
2025-03-27 17:11:26 +00:00
James Wu
14f0cd7630 [StaticCudaLauncher] Support sharedMemBytes > 48KB (#149657)
Triton does some special handling when requesting more than 48 KB of shared memory: specifically it queries the device for maximum device memory, then sets the maximum amount of dynamic memory to be the difference between static and dynamic memory.

See corresponding implementation in triton land here:
https://github.com/triton-lang/triton/blob/main/third_party/nvidia/backend/driver.c#L128-L143

Test plan:
- New unit test requesting more than 48 KB of memory

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149657
Approved by: https://github.com/jansel
2025-03-27 17:00:18 +00:00
Ankita George
85e4e51a7d Fix bug in _load_state_dict_from_keys method (#150058)
Summary:
The _load_state_dict_from_keys method specifies that `Loads any key specified in this set. If no keys are specified, the entire checkpoint is loaded.`
But this isn't happening right now, because an empty keys arg is passed in as a set() to `_load_state_dict` and keys is expected to be None for it to actually be included in the state_dict https://fburl.com/code/l8yzojyx. So with the set() argument, the state_dict is always going to be empty

Test Plan: ensure existing tests pass

Differential Revision: D71930712

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150058
Approved by: https://github.com/saumishr
2025-03-27 16:36:00 +00:00
Boyuan Feng
c830d750e6 [graph partition] support splitting on custom ops (#149782)
This PR adds support for graph partition on custom ops. Land after #149458.

### API
This PR provides a new API to register/unregister custom ops for graph partition.

```python
def register_custom_op_support_cudagraph(
    operator: torch._library.custom_ops.CustomOpDef,
    is_cudagraphable: bool,
) -> None
```

Example usage:

```python
from torch._inductor.utils import register_custom_op_partition

@torch.library.custom_op("mylib::movement", mutates_args=())
def movement(pic: torch.Tensor) -> torch.Tensor:
    img = pic.cpu()
    cropped_img = (img + 1) * 2
    return cropped_img.cuda() / 255.0

@movement.register_fake
def _(pic):
    return torch.empty_like(pic)

register_custom_op_support_cudagraph(movement, is_cudagraphable=False)
```

### Example
In this example, 1 torch-compiled region has 3 cudagraphs after splitting on 2 custom ops.

![image](https://github.com/user-attachments/assets/6d07355b-6690-4cde-89ef-e4aff6b0079c)

Code to repro:
```python
import torch
from torch._inductor.utils import register_custom_op_support_cudagraph

torch._inductor.config.graph_partition = True

@torch.library.custom_op("mylib::movement", mutates_args=())
def movement(pic: torch.Tensor) -> torch.Tensor:
    img = pic.cpu()
    cropped_img = (img + 1)*2
    return cropped_img.cuda() / 255.

@movement.register_fake
def _(pic):
    return torch.empty_like(pic)

@torch.library.custom_op("mylib::modify", mutates_args=())
def modify(pic: torch.Tensor) -> torch.Tensor:
    pic1 = pic + 1
    pic1_cpu = (pic1.cpu() + 1) * 2
    return pic1_cpu.cuda() + pic

@modify.register_fake
def _(pic):
    return torch.empty_like(pic)

@torch.library.custom_op("mylib::transform", mutates_args=())
def transform(pic: torch.Tensor) -> torch.Tensor:
    return (pic + 1) * 2

@transform.register_fake
def _(pic):
    return torch.empty_like(pic)

register_custom_op_support_cudagraph(movement, is_cudagraphable=False)
register_custom_op_support_cudagraph(modify, is_cudagraphable=False)

img = torch.randn(3, 64, 64, device="cuda")

def f(img):
    x = (img + 10) * 2
    y = movement(x)
    z = y + 1
    u = transform(z)
    v = 2*u + 1
    out = modify(v)
    return out + 1

compiled_f = torch.compile(f, mode="reduce-overhead", fullgraph=True)

eager_out = f(img)

for _ in range(3):
    compiled_out = compiled_f(img)
    assert torch.allclose(eager_out, compiled_out)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149782
Approved by: https://github.com/zou3519
2025-03-27 16:23:07 +00:00
PyTorch MergeBot
efc975feb2 Revert "[triton] Warp specialization support in torchinductor (#148503)"
This reverts commit 36183215e8.

Reverted https://github.com/pytorch/pytorch/pull/148503 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/148503#issuecomment-2758590645))
2025-03-27 16:06:42 +00:00
PyTorch MergeBot
af7719a2fa Revert "Use source hashing to generate consistent symbolic ids (#149665)"
This reverts commit 1f92348dc6.

Reverted https://github.com/pytorch/pytorch/pull/149665 on behalf of https://github.com/malfet due to Broke trunk, see 6eb3c2e282/1 ([comment](https://github.com/pytorch/pytorch/pull/149665#issuecomment-2758578187))
2025-03-27 16:02:27 +00:00
Mandar Deshpande
36183215e8 [triton] Warp specialization support in torchinductor (#148503)
Summary:
Currently only `num_warps` and `num_stages` are supported as one of the kernel options for inductor auto-tuning using `TritonTemplate`. In order to allow warp-specialization kernel options should allow specifying `num_consumer_groups` and `num_buffers_warp_spec` as well.

Test Plan:
## Unit test
Added tests for `test_triton_template_warp_specialization` to verify generated kenrnel contains configs for  `num_consumer_groups` and `num_buffers_warp_spec`.

## Functional Testing
Specific to flexattention.
```
import torch
from torch.nn.attention.flex_attention import flex_attention

from triton.testing import do_bench

make_tensor = lambda: torch.rand(8, 16, 8192, 128, device="cuda", dtype=torch.bfloat16)
q, k, v = make_tensor(), make_tensor(), make_tensor()

flex_compiled = torch.compile(flex_attention, fullgraph=True)

print(do_bench(lambda: flex_compiled(q, k, v, kernel_options={"num_warps": 4})))
```

triton do_bench results:
- default compile: 15.176783561706543
- with warp-spec: 9.452800750732422

## Extra notes
- generated triton kernel using `TORCH_LOGS=output_code`: P1740612877
- TTGIR for fused kernel: P1740614685

Differential Revision: D70212243

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148503
Approved by: https://github.com/eellison
2025-03-27 13:07:50 +00:00
_githubsgi
f0e1a0838c Enabling xpu in OffsetBasedRNGTracker . (#148360)
Else torch.distributed breaks on xpu devices.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148360
Approved by: https://github.com/zhangxiaoli73, https://github.com/guangyey, https://github.com/gujinghui, https://github.com/XilunWu, https://github.com/kwen2501

Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
2025-03-27 10:55:05 +00:00
Laith Sakka
6cbcdee944 Introduce guard_or_true, guard_or_false (#148430)
some context in this document:
https://docs.google.com/document/d/18nJsj-F2C_QXO7ClwzPcAUENQ-B440B43W7DdDnlDt4/edit?tab=t.0#heading=h.pgebnyi7pocj

But TLDR;
`guard_or_true`, `guard_or_false` are better than `guard_size_oblivious` due to :
- Easier to reason about what assumptions we are making while reading the code.
- Avoid size_oblivious complexity that is not needed.
- Avoid unsoundness that could make `guard_size_oblivious(a==1)` be true when its not true for some vaue `a` during runtime.
- Less data dependent errors for some cases: ex, when doing `guard_size_oblivious(a==1)` and we know `a` is a tensor size, if it's traced with `a=u1-u2` `guard_size_oblivious(a==1)` will throw a data dependent error but `guard_else_false` will just return `False`.

### How is it different from statically_known_true??
**`if(cond)`:** (normal guarding) will try to evaluate statically and guard on the condition, willing to restrict input space to evaluate cond. if it fails to evaluate due to data dependent error will throw an exception (that could be converted to graph break in some situations).

**`statically_known_true(cond)`:** would be used when you never want to add a guard (restrict your input space), but just want to do a best effort check to see if you can infer that something is true/false ONLY based on existing constraints.

**`guard_or_true(cond)`/`guard_or_false(cond)`:** Those would be used in situations you prefer to guard and know the result of the expression over not guarding, but in case you hit a data dependent error you are ok with just returning true or false.
Some reasons you might be ok with returning true/false instead could be:
1. It's an optimization I do not want to fail for not performing optimization.
2. I am willing to deviate from the normal semantics when I have unbacked for the benefit of not failing (See the doc above for more details).

**`definitely_true(cond)`**: same as `guard_or_false(cond)` except does not try to do static eval for unbacked (planning to deprecate it and replace uses with `guard_or_false` or make it alias to `guard_or_false`)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148430
Approved by: https://github.com/bobrenjc93
2025-03-27 09:34:05 +00:00
pralay
a9ee797e41 added fake tensor support for foreach_copy (#149127)
Fixes #149111

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149127
Approved by: https://github.com/jansel, https://github.com/jeromean
2025-03-27 09:26:23 +00:00
Mu-Chu Lee
e6afb51805 [AOTInductor] Free folded constants that's managed by AOTInductor (#149825)
internally.

Summary:
This diff allows freeing the usage of folded constants that's created by
AOTInductor through CUDACachingAllocator instead of the constant blob
from cudaMalloc directly.

Test Plan:
LD_LIBRARY_PATH=/data/users/$USER/pytorch/build/lib
/home/$USER/local/pytorch/build/bin/test_aoti_inference

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149825
Approved by: https://github.com/chenyang78, https://github.com/desertfire, https://github.com/jingsh
2025-03-27 06:05:50 +00:00
PyTorch MergeBot
e080bac533 Revert "Introduce guard_or_true, guard_or_false (#148430)"
This reverts commit d5593ea31c.

Reverted https://github.com/pytorch/pytorch/pull/148430 on behalf of https://github.com/laithsakka due to need to fix stuff ([comment](https://github.com/pytorch/pytorch/pull/148430#issuecomment-2756701436))
2025-03-27 05:10:20 +00:00
Simon Fan
748252378d [ca] introduce RuntimeState to support c++ hooks via graph breaks (#149987)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149987
Approved by: https://github.com/jansel
ghstack dependencies: #149647, #149709, #149651, #149897
2025-03-27 05:05:34 +00:00
Simon Fan
dcb378cff2 [ca] support anomly mode nan checks with different semantics than eager (#149897)
see note in code

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149897
Approved by: https://github.com/jansel
ghstack dependencies: #149647, #149709, #149651
2025-03-27 05:05:34 +00:00
bobrenjc93
1f92348dc6 Use source hashing to generate consistent symbolic ids (#149665)
This PR was inspired by internal models that were cache missing due to PGO. At a high level the problem looks as follows

Run 1, Invocation 1: We do static compile, save some example values in PGO/automatic dynamic

Run 1, Invocation 2: We detect varying inputs, do dynamic compile, get a dynamic graph and save to PGO. Crucially what we save to PGO is actually a superset of what is actually dynamic. If we notice an input was varying, we mark it as dynamic in PGO even if later on that value gets specialized. When a value gets specialized, we actually remove the symbol from the graph. This results in an interesting conundrum where although we are producing the same isomorphic graph, PGO makes the second run cache miss. Let's see how....

Run 2, Invocation 1: We fetch the PGO, over-mark things as dynamic, get a fx graph, look it up in the cache and... whoops! cache miss! This is because of the aforementioned behavior where the PGO profile will cause us to over-allocate symbols. In practice this means we end up saving a graph in cache with symbols x:s1, y:s3 and on second attempt we cache miss with x:s1, y:s6 where symbols s3,s4,s5 were all optimistically marked dynamic by PGO and subsequently specialized.

We solve this problem by hashing the source names. This ensures somewhat stable assignment. To prevent catastrophic symbol collisions, we use linear probing to ensure no collisions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149665
Approved by: https://github.com/Mingming-Ding, https://github.com/laithsakka
2025-03-27 03:39:27 +00:00
Daniel Vega-Myhre
ae29f054f5 [Async TP] More robust support for rowwise scales when fusing matmul reduce-scatter (#149247)
Part of https://github.com/pytorch/torchtitan/issues/866

## Context
- Async TP needs to support the "reshape -> scaled_mm -> reshape" pattern because scaled mm only supports 2D input tensors and 2D scales.
    - (a,b,c) => (a*b,c)
    - (a\*b,c) @ (c,d) = (a\*b,d)
    - (a\*b,d) => (a,b,d)

- Currently the implementation does not support scaled mm with rowwise scales **for all cases** of the reshape -> scaled_mm -> reshape pattern. The minimal example of this pattern is confirmed to work via this [unit test](00a2c68f67/test/distributed/tensor/parallel/test_micro_pipeline_tp.py (L406)), but more involved e2e examples in torchtitan fail silently (more context in final bullet point).
- Previously, the "A tensor" **node** referenced in the async TP graph manipulation code is the 3D+ node before the reshape, but the "A_scale" node is the 2d node from after the reshape, so they are incompatible.
- I previously implemented a simpler solution to this problem in https://github.com/pytorch/pytorch/pull/148001, with a [unit test](https://github.com/pytorch/pytorch/pull/148001/files#diff-115f1d0852382c9b58f22640d80999d879b33618e5f6c633fc9e4d0ca9781cecR406) confirming the fused node is indeed in the graph for the minimal example of the reshape->mm->reshape pattern. I also confirmed via manual e2e testing w/ torchtitan that the crash I was fixing no longer occurred. However, it turns out due to this [bug in torchtitan](https://github.com/pytorch/torchtitan/issues/866)  it was causing async TP to fail silently and fall back to vanilla TP, hiding the fact that this original solution fixed the crash but the fusion would not occur for rowwise scales. Thus, more robust solution is needed to support all cases.

## Solution TL;DR
- Use the 2D 'A' tensor and corresponding 2D scales as input to the fused_matmul_reduce_scatter implementation, instead of the 3D+ tensor/scales.
- Track the "pre mm reshape" and "post mm reshape" separately, to be referenced in the `fused_scaled_matmul_reduce_scatter` implementation, to update the scatter dim through the pre-mm reshape, and apply the post-mm reshape before applying the reduce scatter and returning the output tensor.
- Separate the `fused_matmul_reduce_scatter` and the `fused_scaled_matmul_reduce_scatter` code paths, to simplify them both.
- By fixing the bug in torchtitan (PR https://github.com/pytorch/torchtitan/pull/965) and implementing support for rowwise scales in pytorch in this PR, together these changes will solve the problem of how to support rowwise scales with all types of AC.

## Additional details for reviewers
To use the 2D A tensor while also supporting the "reshape -> mm -> reshape" pattern, the following other changes were needed:
- Track the pre-mm reshape, as it will affect the scatter dim used in the fused_matmul_reduce_scatter impementation.
- Track the post-mm reshape, as it will affect the output shape used in the fused_matmul_reduce_scatter impementation
- Based on the pre-mm reshape and the original scatter dim, calculate the new scatter dim for the 2D tensor. This is needed because during the pipelined producer mm implementation, the scatter dim is moved to dim 0 (so it can be sharded along the first dim and then get chunks to do mm ops on by indexing into the first dim), then moved back to it's original place before the reduce-scatter.
- Use the tracked post-mm reshape to reshape the stacked partial 2D outputs of the mm ops into 3D outputs needed for 1) the reduce-scatter w/ the original scatter dim, and 2) the expected output shape to prevent shape errors with subsequent ops.

## Test plan
- All existing unit tests passing.
- Expand unit tests for rowwise scales to test more scatter dims
- Added unit tests enforcing that async TP fails fast / throws an error if it fails to perform any fusions. Previously it just "failed silently" (fell back to vanilla TP without the user knowing) which has led to confusion, so this will improve the UX.
- Compared loss curves of bf16 vs float8 w/ rowwise scales to confirm integrity of numerics
- Confirmed via manual testing with torchtitan and inspecting the compile graph that the fusion is working as intended for:
    - bfloat16
    - float8 with tensorwise scales
    - float8 with rowwise scales

## Loss curves

Loss curves are virtually identical for bf16 + vanilla TP versus float8 with rowwise scales + async TP:

<img width="1017" alt="loss_async_tp" src="https://github.com/user-attachments/assets/4995db78-7012-490f-a370-f4fecc289a22" />

## Performance

#### Per op SAC
Performance benchmarks for torchtitan Llama3 8b training runs on 4 H100s with per op SAC, using FSDP degree=2, TP degree=2:
- bf16 (vanilla TP): TPS 5161.5, peak memory 50.53 GB
- bf16 (async TP): TPS  5229.5, peak memory 50.68 GB
- float8 tensorwise (vanilla TP): TPS: 5959.5, peak memory: 50.47 GB
- float8 tensorwise (async TP): TPS 5964.5, peak memory 50.47 GB
- float8 rowwise (vanilla TP): TPS: 4962.0, peak memory: 50.55 GB
- float8 rowwise (async TP): TPS 4966.5, peak memory 50.65 GB

#### Full AC
Llama3 70b training runs on 128 H100s with full AC, using FSDP=16, TP=8
- bf16 (vanilla TP): 598 TPS, peak memory 71.51 GB
- bf16 (async TP): TPS  673, peak memory 71.08 (+12.54% TPS vs vanilla TP)
- float8 tensorwise (vanilla TP): 820 TPS, peak memory  55.26 GB
- float8 tensorwise (async TP): 950 TPS, peak memory 55.91 GB (+15.85% TPS vs vanilla TP)
- float8 rowwise (vanilla TP): TPS: 540 TPS, peak memory 71.46 GB
- float8 rowwise (async TP): 560 TPS, peak memory 70.65 GB (+3.7% TPS vs vanilla TP but still unexpectedly lower than bf16)

As you can see, float8 rowwise is working but performance needs to be improved further.

## Other changes
- Added logging so the user will know why fusion failed if it does.
- Remove logic which inserted a reshape node targeting "A scale" to get it to be in 3D like the "A tensor" since it's no longer needed.

## Long term plan
- Add a `scaled_matmul` op in pytorch, which will natively support a 3D+ "A tensor" and allow us to simplify the async TP implementation by avoiding the reshape -> scaled_mm -> reshape pattern and the special handling for it.

## Visualizing fused nodes in graphs for torchtitan training runs

Below are examples of the visualized graph generated by torch compile for torchtitan llama3 8b training runs with per op SAC. These graphs provide additional evidence (beyond the new unit tests added) that the implementation is working correctly.

### bf16

<img width="900" alt="bf16-fusion" src="https://github.com/user-attachments/assets/a3bed917-28eb-4a56-8d6e-2d2bf498385c" />

### float8 with tensorwise scales

<img width="900" alt="tensorwise-node" src="https://github.com/user-attachments/assets/b212ec4a-1899-44de-a4de-18c74e1de68a" />

### float8 with rowwise scales

<img width="900" alt="rowwise" src="https://github.com/user-attachments/assets/ed3354a3-894b-4ec9-86d0-f80364bf3d83" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149247
Approved by: https://github.com/kwen2501
2025-03-27 03:15:30 +00:00
Yidi Wu
b2b9aaf0ad Fix non-strict export doesn't turn on dynamo for hop (#149903)
Somehow the torch._dynamo.is_compiling is changed to torch.compiler.is_compiling(), which also checks whether we're exporting. This is not caught by cI because we don't have an export test for scan.

Changing to torch.compiler.is_dynamo_compiling and added a test.

edit: piggyback the re-tracing support in this PR. Related code in combine_fn_is_normalized.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149903
Approved by: https://github.com/zou3519
2025-03-27 02:38:05 +00:00
vasiliy
dad0854d48 meta registration for torch._scaled_mm with mxfp8 (#148461)
Summary:

Adds the meta registration logic for torch.compile to work with
`torch._scaled_mm` with mxfp8.  Thanks to @eellison  for the pointer to make inductor work with this.

Test Plan:

```
pytest test/test_matmul_cuda.py -k test_blockwise_mxfp8_compile -s
```

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148461
Approved by: https://github.com/drisspg, https://github.com/eellison
2025-03-27 02:32:40 +00:00
Laith Sakka
d5593ea31c Introduce guard_or_true, guard_or_false (#148430)
some context in this document:
https://docs.google.com/document/d/18nJsj-F2C_QXO7ClwzPcAUENQ-B440B43W7DdDnlDt4/edit?tab=t.0#heading=h.pgebnyi7pocj

But TLDR;
`guard_or_true`, `guard_or_false` are better than `guard_size_oblivious` due to :
- Easier to reason about what assumptions we are making while reading the code.
- Avoid size_oblivious complexity that is not needed.
- Avoid unsoundness that could make `guard_size_oblivious(a==1)` be true when its not true for some vaue `a` during runtime.
- Less data dependent errors for some cases: ex, when doing `guard_size_oblivious(a==1)` and we know `a` is a tensor size, if it's traced with `a=u1-u2` `guard_size_oblivious(a==1)` will throw a data dependent error but `guard_else_false` will just return `False`.

### How is it different from statically_known_true??
**`if(cond)`:** (normal guarding) will try to evaluate statically and guard on the condition, willing to restrict input space to evaluate cond. if it fails to evaluate due to data dependent error will throw an exception (that could be converted to graph break in some situations).

**`statically_known_true(cond)`:** would be used when you never want to add a guard (restrict your input space), but just want to do a best effort check to see if you can infer that something is true/false ONLY based on existing constraints.

**`guard_or_true(cond)`/`guard_or_false(cond)`:** Those would be used in situations you prefer to guard and know the result of the expression over not guarding, but in case you hit a data dependent error you are ok with just returning true or false.
Some reasons you might be ok with returning true/false instead could be:
1. It's an optimization I do not want to fail for not performing optimization.
2. I am willing to deviate from the normal semantics when I have unbacked for the benefit of not failing (See the doc above for more details).

**`definitely_true(cond)`**: same as `guard_or_false(cond)` except does not try to do static eval for unbacked (planning to deprecate it and replace uses with `guard_or_false` or make it alias to `guard_or_false`)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148430
Approved by: https://github.com/bobrenjc93
2025-03-27 02:22:20 +00:00
Ahmad Sarvmeily
c2b8fead43 Allow TritonTemplate subclasses to override kernel type (#150018)
Allows subclasses of `TritonTemplate` to override the kernel type, e.g.
```
class MyTritonTemplate(TritonTemplate):
    kernel_type = MyTritonTemplateKernel
```

This means that all of the logic in `TritonTemplate` class doesn't need to be duplicated in subclasses if the only required change is the kernel type.

Note that there is precedent for doing this - see `SIMDScheduling` in `torch/_inductor/codegen/simd.py`:

```
class SIMDScheduling(BaseScheduling):
    kernel_type: type[Any] = SIMDKernel  # override in subclass
...
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150018
Approved by: https://github.com/jansel
2025-03-27 02:16:40 +00:00
Angela Yi
8d1cfb63b5 [export] Save unflattened gm (#150030)
Summary: Reland of D71082652

Test Plan:
https://www.internalfb.com/intern/testinfra/testrun/8444249558423545
https://www.internalfb.com/intern/testinfra/testrun/7318349652864293
https://www.internalfb.com/intern/testinfra/testrun/13229323980143778
https://www.internalfb.com/intern/testinfra/testrun/11540474119884081

Differential Revision: D71902033

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150030
Approved by: https://github.com/pianpwk
2025-03-27 02:01:51 +00:00
Laith Sakka
128b32f363 cache loaded python modules (#149910)
I am splitting caching the loading of modules from the caching the codegen since its trivial and much easier.
Module loading is 50% of the cost, and codegen is 50%  of maybe_append choice on full graph model. which is 40% of total compile time.

<img width="434" alt="Screenshot 2025-03-24 at 4 35 12 PM" src="https://github.com/user-attachments/assets/aa851c6a-bde9-43f8-b12d-e439504ef62c" />

running mm_loop benchmark,
before this change:
67947323682

after this change:
25845073249

2.6X faster.

it seems that the cache was there then got dropped. I added benchmark so it wont be dropped again by mistake.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149910
Approved by: https://github.com/eellison, https://github.com/aorenste
ghstack dependencies: #149932
2025-03-27 00:45:09 +00:00
Rachel Guo
48cff64a54 [pt2_provenance_tracing] add combo kernel nodes post_grad nodes origin info (#149598)
Summary: found it helpful when running prod model with combo_kernel feature enabled

Test Plan: CI

Differential Revision: D71513304

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149598
Approved by: https://github.com/yushangdi
2025-03-27 00:26:24 +00:00
Animesh Jain
999fa15ba8 [invoke_subgraph][fake tensor cache] Add a finalizer for id hashed objects (#149667)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149667
Approved by: https://github.com/zou3519
ghstack dependencies: #149087
2025-03-27 00:01:39 +00:00
Animesh Jain
a7596b4b34 [invoke_subgraph] Fake tensor prop caching (#149087)
Redoing https://github.com/pytorch/pytorch/pull/137808
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149087
Approved by: https://github.com/zou3519
2025-03-27 00:01:39 +00:00
Justin Chu
3efa211e48 [ONNX] Annotate None inputs in symbolic ops (#150038)
Add `None` to type annotations of `torch.onnx.ops.symbolic*` ops and improve tests to test support for optional inputs. Previously it was omitted mistakenly even though the implementation supports it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150038
Approved by: https://github.com/titaiwangms
2025-03-27 00:01:09 +00:00
Nikita Shulga
6aca002d82 [MPS] Add chebyshev_polynomial_[uvw] (#150060)
For both eager and inductor

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150060
Approved by: https://github.com/dcci, https://github.com/jansel
2025-03-26 23:35:05 +00:00
PyTorch MergeBot
185aaaaf8e Revert "Improve subproc autotuning implementation (#149700)"
This reverts commit 8cd6a133f2.

Reverted https://github.com/pytorch/pytorch/pull/149700 on behalf of https://github.com/yangw-dev due to This is breaking servicelab_benchmark_pyper_local_runner internally ([comment](https://github.com/pytorch/pytorch/pull/149700#issuecomment-2755975959))
2025-03-26 23:17:01 +00:00
Pat Vignola
625913eefc [MTIA] [Triton] Set codename of MTIA device in triton heuristics (#149860)
Summary: Triton-MTIA expects the codename of the device as the arch when querying the module map, not the compute capability. This diff gets rid of the following error: `No libdevice is provided for arch (0, 0)`

Test Plan: CI

Reviewed By: Myrthan

Differential Revision: D70072095

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149860
Approved by: https://github.com/jansel
2025-03-26 20:58:12 +00:00
Boyuan Feng
039ebdc192 [Graph Partition] Support symbol inputs (#149458)
This PR supports symbol inputs to graph partition functions. Before this PR, we rely on `node.read_writes` to get partition inputs. However, this does not cover symbol inputs.

In this PR, for each graph partition, we collect all symbol inputs which are required to be in scope to successfully         perform codegen, including:
- free symbols used in partition nodes.
- free symbols in partition input/node shapes, strides, and offsets. This is needed for recording cudagraphs for tensors with dynamic shapes.

### Note1: MutationLayout
In this example, node.layout is MutationLayoutSHOULDREMOVE. The symint from index `n` does not appear in the size, offset, stridese of node.layout. This symint appear in node.layout.target. So we need extra handle for it.

```python
x = torch.zeros(7, device="cuda")

def fn(n, a):
    a[n] = -1
    return a

opt_fn = torch.compile(fn, fullgraph=True)

for n in range(2, x.shape[0]):
    opt_fn(n, x)
```

### Note2: Composability with Padded Tensor Subclass

W/o graph partition, Padded Tensor subclass lifts outer shapes to input arguments (i.e., arg0_1 for s0, arg1_1 for s1) but does not lift inner shapes (i.e., s2 and s3). Since cudagraph cache relies on integer inputs, it will cache on outer shapes and ignore inner shapes, which is bad.

```
def call(args):
    arg0_1, arg1_1, arg2_1, arg3_1, arg4_1, arg5_1 = args
    args.clear()
    s0 = arg0_1
    s1 = arg1_1
    arg2_1_size = arg2_1.size()
    s2 = arg2_1_size[0]
    s3 = arg2_1_size[1]
    assert_size_stride(arg2_1, (s2, s3), (s3, 1))
    with torch.cuda._DeviceGuard(0):
        torch.cuda.set_device(0)
        buf0 = empty_strided_cuda((s2, s3), (s3, 1), torch.float32)
        # Topologically Sorted Source Nodes: [x1, mul], Original ATen: [aten.add, aten.mul]
        triton_poi_fused_add_mul_0_xnumel = s2*s3
        stream0 = get_raw_stream(0)
        triton_poi_fused_add_mul_0.run(arg2_1, buf0, triton_poi_fused_add_mul_0_xnumel, stream=stream0)
        del arg2_1
    return (buf0, s0, s1, s1, )
```

w/ graph partition, the partition function only includes tensor and inner shapes as inputs, to make sure the cudagraph caching is correct. Full Comparison: [code](https://www.internalfb.com/intern/diffing/?paste_number=1761674743)
```python
   def call(self, args):
        arg0_1, arg1_1, arg2_1, arg3_1, arg4_1, arg5_1 = args
        args.clear()
        s0 = arg0_1
        s1 = arg1_1
        arg2_1_size = arg2_1.size()
        s2 = arg2_1_size[0]
        s3 = arg2_1_size[1]
        assert_size_stride(arg2_1, (s2, s3), (s3, 1))
        partition0_args = [arg2_1, s2, s3]
        del arg2_1
        (buf0,) = self.partitions[0](partition0_args)
        del partition0_args
        return (buf0, s0, s1, s1, )
```

The number of cudagraphs is validated below: (also added to test)
```python
import torch

from padded_tensor import PaddedTensor

# Turning off graph_partition leads to
# torch._inductor.cudagraph_trees.get_container(0).tree_manager.new_graph_id().id=6
# at the end, which is wrong.
# torch._inductor.config.graph_partition = False

# Turning on graph_partition leads to
# torch._inductor.cudagraph_trees.get_container(0).tree_manager.new_graph_id().id=4
# at the end, which is correct.
torch._inductor.config.graph_partition = True

def f(x):
    x1 = x + 1
    return x1 * 2

compiled_f = torch.compile(f, mode="reduce-overhead")

def run(shape):
    x = torch.randn(*shape, device="cuda")
    pad_x = PaddedTensor.from_tensor(x, multipliers={0:4, 1:4})
    assert hasattr(pad_x, "multipliers"), breakpoint()
    eager_out = f(pad_x)

    for _ in range(3):
        compiled_out = compiled_f(pad_x)
    compiled_out = compiled_f(pad_x)

    assert eager_out.shape == compiled_out.shape
    assert eager_out.tensor.shape == compiled_out.tensor.shape
    assert torch.allclose(eager_out.tensor, compiled_out.tensor)

# static shape. record a NEW cudagraph. 1 cudagraph in total now.
run((2,3))
# outer shape is dynamic, leading to a new dynamo graph
# this new dynamo graph forces a NEW cudagraph. 2 cudagraphs in total now
run((3,4))
# outer shape changed but inner shape does not change
# so NO new cudagraph is recorded
run((2,2))
# inner shape is dynamic now, leading to a new dynamo graph
# this new dynamo graph forces a NEW cudagraph. 3 cudagraphs in total now
run((5,6))
# does NOT record a new cudagraph
run((7,8))
# record a NEW cudagraph. 4 cudagraphs in total now
run((10,11))

assert torch._inductor.cudagraph_trees.get_container(0).tree_manager.new_graph_id().id == 4
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149458
Approved by: https://github.com/eellison
2025-03-26 17:21:30 +00:00
Mu-Chu Lee
a0253d2840 [Inductor] Use real input to autotune user defined triton kernels (#149553)
Summary:
User defined Triton kernel sometimes rely on real inputs to determine
the path of execution. We need real inputs to invoke the correct
behavior of the user defined triton kernels (see example in test case,
where we have an early return for random inputs)

Test Plan:
Included in the commit.
python test/inductor/test_aot_inductor.py -k triton_autotuning
python test/inductor/test_aot_inductor.py -k triton_mutated_autotuning

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149553
Approved by: https://github.com/davidberard98, https://github.com/eellison
2025-03-26 16:42:48 +00:00
Jack Taylor
32299e5f9a Reland "Introduce new template heuristic for triton autotune configs" (#147452)
This change was reverted in https://github.com/pytorch/pytorch/pull/147388 for regressing an internal workload.

I have removed the additional ir.device_type calls in mm_scaled and unpack_mixed_mm.py which could be contributing to the additional compile time.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147452
Approved by: https://github.com/jansel
2025-03-26 15:47:06 +00:00
Ankita George
8a40fca9a1 Support huggingface reading and writing for multi rank case (#148189)
Summary: This diff adds the ability for HF reader/writer to read/write in a distributed way. We do this by sending all the tensors meant for the same file to the same rank.

Test Plan:
ensure existing tests pass
I also ran a full end to end test on my devserver to read/write from my HF repo

Differential Revision: D70096439

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148189
Approved by: https://github.com/joecummings, https://github.com/saumishr
2025-03-26 14:47:31 +00:00
cyy
79e8a69257 Enable move warnings for torch targets (#149923)
This PR enables more move warnings for torch targets and fixes some code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149923
Approved by: https://github.com/malfet
2025-03-26 08:38:13 +00:00
Justin Chu
6ae8eb881c [ONNX] Clean up the diagnostics module (#149864)
Remove the diagnostics/SARIF module from ONNX exporter because it is obsolete unused.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149864
Approved by: https://github.com/titaiwangms
2025-03-26 05:58:32 +00:00
PyTorch MergeBot
d256b2dcb2 Revert "[custom_ops][perf] Move expensive pytree traversals of tensors to C++ (#148555)"
This reverts commit d686d04c2f.

Reverted https://github.com/pytorch/pytorch/pull/148555 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/148555#issuecomment-2753283221))
2025-03-26 05:27:52 +00:00
Shangdi Yu
819b23e0b4 Support None return type in torchbind and Add more AOTI torchbind e2e tests (#149749)
Summary:
- Add more tests for torchbind in aoti

**FallBackKernel**
- In FallbackKernel.find_device, do not check the device of torchbind obj because they don't have a fixed "device"
- If no device found for CallTorchBindObject, use cpu
- handle None output in `export_extern_kernel_node`

Test Plan:
```
buck run //sigmoid/inference/test:e2e_test_cpu -- -r CustomClassHolderConstantDynamic
```

Differential Revision: D70746626

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149749
Approved by: https://github.com/desertfire
2025-03-26 04:20:14 +00:00
Isuru Fernando
71acb1bb42 [inductor] Fix division by zero error in fractional max (#148729)
Fixes https://github.com/pytorch/pytorch/issues/148152
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148729
Approved by: https://github.com/eellison
2025-03-26 04:18:50 +00:00
eqy
9108d153ce [CUDA]][SymmetricMemory] Interpret empty string as std::nullopt in rendezvous (#149793)
this is a "temporary" fix as current internal API requires strings at some interfaces instead of `std::optional` and empty strings are presumably used in-lieu of `nullopt`.
e.g.,
9d02b3993f/torch/csrc/distributed/c10d/intra_node_comm.cu (L49)

this currently breaks `test_intra_node_comm_all_reduce`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149793
Approved by: https://github.com/kwen2501, https://github.com/cyyever
2025-03-26 03:59:43 +00:00