Commit Graph

47182 Commits

Author SHA1 Message Date
Sukchul Cho
0da8127f77 Compare device name of profiler dynamically (#150396)
Compare self.use_device of torch.autograd.profiler.profiler with _get_privateuse1_backend_name(), since privateuse1 backend can be renamed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150396
Approved by: https://github.com/sraikund16
2025-04-02 06:06:06 +00:00
Rebecca Chen
c65de03196 Add Any return annotation to __getattr__ methods that return a union of types. (#150204)
Adds an `Any` return type annotation to `__getattr__` methods in `torch/_ops.py` that return a union of types. Attribute access returning a union of types can cause issues downstream because consumers would need to handle all of the possible types to make the type checker happy. This doesn't seem to matter today for mypy, presumably because `Any` is always inferred when a return type annotation is missing, but it still makes explicit what mypy is already doing implicitly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150204
Approved by: https://github.com/malfet
2025-04-02 05:25:07 +00:00
Nikita Shulga
dee016ceb7 [MPSInductor] Add store_reduce method (#150457)
That restrict the store operation to 0th thread, which should be much better, shouldn't it
(Though I don't observe it in the benchmark)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150457
Approved by: https://github.com/jansel, https://github.com/dcci
ghstack dependencies: #150452
2025-04-02 05:12:49 +00:00
William Wen
3ac5a499dd [dynamo] add dynamo disable reasons to codebase (#150440)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150440
Approved by: https://github.com/jansel, https://github.com/zou3519
ghstack dependencies: #150341
2025-04-02 04:26:48 +00:00
William Wen
25eff6e991 [dynamo] add reason field to torch.compiler.disable (#150341)
Implements https://github.com/pytorch/pytorch/issues/146445

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150341
Approved by: https://github.com/zou3519, https://github.com/jansel
2025-04-02 04:26:48 +00:00
Shivam Raikundalia
5734909f34 [Profiler] Fix Empty C Call Queue (#150370)
Summary:
My commandeer of https://github.com/pytorch/pytorch/pull/150102

Based on description of PR it seems that we need to add C calls for each starting python event with a callable such that when the tracing exits we will have a matching enter for any given exit. It adds some unnecessary events at worst but prevents segfaults/failures. My PR just cleans up some refcount impl and logging.

Test Plan: Ran resnet test internally. Will check CI and ask reviewers to make sure it resolves their issues.

Differential Revision: D72207570

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150370
Approved by: https://github.com/aaronenyeshi
2025-04-02 02:44:50 +00:00
eqy
f09513e515 [CUDA]][SymmetricMemory] Interpret empty string as std::nullopt in rendezvous (#149793)
this is a "temporary" fix as current internal API requires strings at some interfaces instead of `std::optional` and empty strings are presumably used in-lieu of `nullopt`.
e.g.,
9d02b3993f/torch/csrc/distributed/c10d/intra_node_comm.cu (L49)

this currently breaks `test_intra_node_comm_all_reduce`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149793
Approved by: https://github.com/kwen2501, https://github.com/cyyever
2025-04-02 02:41:07 +00:00
Animesh Jain
61ebe999cc [invoke_subgraph] Do not cache fake tensors for AOTDispatcher first pass (#150450)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150450
Approved by: https://github.com/zou3519
ghstack dependencies: #150082
2025-04-02 02:31:54 +00:00
Animesh Jain
b060fedfa8 [invoke_subgraph] Support None in the fwd output (#150082)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150082
Approved by: https://github.com/zou3519
2025-04-02 02:31:54 +00:00
Rithesh Baradi
0ae75ca2de assert on all_reduce_event only if it's not CPU device. (#150316)
Summary: For CPU based runs, `all_reduce_event` would be None since this is the result of the `all_reduce_stream.record_event()`, which does not do much other than returning None when device type is CPU.

Test Plan: CI

Differential Revision: D72176406

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150316
Approved by: https://github.com/kwen2501, https://github.com/weifengpy, https://github.com/mori360
2025-04-02 01:54:35 +00:00
cyy
e872c38eb3 Remove cppcoreguidelines-pro-type-member-init_fix suppression (#148638)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148638
Approved by: https://github.com/zou3519
2025-04-02 01:33:20 +00:00
vasiliy
c974b5322a enable torch.compile for torch._scaled_mm nvfp4 recipe (#150462)
Summary:

Updates the meta registration for `torch._scaled_mm` to work for the
nvfp4 recipe.

Test Plan:

```bash
pytest test/test_matmul_cuda.py -s -k test_blockwise_nvfp4
```

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150462
Approved by: https://github.com/eellison
2025-04-02 01:08:40 +00:00
tvukovic-amd
db32093192 [ROCm][Windows] Fix torchvision build with ROCm 6.4 on windows (#150180)
Since with HIP SDK 6.4 hipcc files and calls and restructured, the case for calling hipcc.exe is added in case of building torchvision with HIP SDK 6.4 on Windows

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150180
Approved by: https://github.com/malfet, https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-04-02 00:35:47 +00:00
Junjie Wang (PyTorch)
d22e3d5efe [fr] Add logger config for flight record in PGNCCL (#150356)
Summary: We want to move from a scuba based direct logging to a logger config based logging. Mostly changes are internal but we need to change the exception to exception_msg.

Test Plan: Following https://www.internalfb.com/wiki/Server_Logging/Getting_Started_with_Logging/Onboarding_Existing_Scribe-Based_Logging_(Alpha)/ to test it.

Differential Revision: D72198171

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150356
Approved by: https://github.com/fegin
2025-04-01 23:54:07 +00:00
Tristan Rice
6aea4d90fb gloo: use shared Stores (#150230)
Summary:
X-link: https://github.com/facebookincubator/gloo/pull/423

This modifies `connectFullMesh` to take in a shared_ptr<IStore> instead of a reference. This is an API breaking change but fairly easy to work around.

To have backwards compatibility in PyTorch during the commit phase we add a new ifdef `GLOO_SHARED_STORE` which can provide backwards compatibility until we update the pinned Gloo version in pytorch OSS repo.

This also adds a new `wait_get` method to `IStore` which will allow us to do a more efficient operation in PyTorch TCPStore. PyTorch's `Store::get` automatically waits so we want to make sure we can avoid waiting twice to reduce network traffic.

This change will land simultaneously in PyTorch and Gloo repos.

Test Plan:
```
buck2 test //gloo/... //caffe2/caffe2/contrib/gloo:
```

Differential Revision: D72084111

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150230
Approved by: https://github.com/fduwjj
2025-04-01 23:37:25 +00:00
Nick Riasanovsky
4934a83347 [AMD] [TRITON] [INDUCTOR] Add tl.assume to enable bufferops on AMD (#150373)
Summary: Update the GEMM template to include the necessary `tl.assume` annotations to enable bufferops with AMD.

Test Plan: Tested manually with a simple matmul run with torch.complie(f, mode="max-autotune") the environment variables TRITON_ALWAYS_COMPILE=1 AMDGCN_ENABLE_DUMP=1 AMDGCN_USE_BUFFER_OPS=1.
Inspecting the generated AMDGCN all loads/stores use bufferops.
Note: Since inductor is loading constants for many of the shape values assumes are generally not needed for the stride/shape information, but pid calculations are generally a gap in Triton's inference capability.

Differential Revision: D71922698

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150373
Approved by: https://github.com/eellison
2025-04-01 23:29:39 +00:00
angelayi
60fe0922f6 [pytree] Register normal class to register_dataclass (#147752)
Fixes https://github.com/pytorch/pytorch/pull/147532#discussion_r1964365330

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147752
Approved by: https://github.com/zou3519
2025-04-01 23:28:20 +00:00
Will Feng
80ab233786 [Inductor] Hide reinplace_fsdp_all_gather pass behind skip_fsdp_hooks config (#150436)
The `reinplace_fsdp_all_gather` pass is currently only for Traceable FSDP2 and doesn't work together with SimpleFSDP. We should hide the pass behind `skip_fsdp_hooks` config which makes it only apply to Traceable FSDP2.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150436
Approved by: https://github.com/BoyuanFeng
2025-04-01 22:56:06 +00:00
Avik Chaudhuri
b70d105c77 infer dynamic shapes through additional inputs (#150144)
Summary:
Instead of explicitly specifying dynamic shapes, it is possible to infer them from additional example inputs. Together with the example inputs provided to export, we can basically make any varying dim dynamic and keep any fixed dim static. This should be useful for prod scenarios that have access to tests and/or profiling data, yet are somewhat removed from the model authoring process.

However this alone is not satisfactory: the exported program by design has only one graph, representing one path through the model, and we cannot necessarily guarantee that this graph works for the additional example inputs because different guards might have been created if we had exported with them instead (corresponding to different traced paths). However, checking that the additional example inputs satisfy the guards created by the original export should be sufficient for generalization.

Now, while we don't preserve all guards in the exported program, we do check a subset of them as part of input matching. So we add a verification step at the end of export when such additional example inputs are provided. This should be enough for now.

Test Plan: added test (positive and negative cases)

Differential Revision: D72001771

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150144
Approved by: https://github.com/bobrenjc93
2025-04-01 21:13:39 +00:00
Michael Lazos
0d44a8aea1 [Hierarchical Compile] Apply deduplication after output node creation (#150306)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150306
Approved by: https://github.com/anijain2305
ghstack dependencies: #150303, #150304, #150305
2025-04-01 20:54:18 +00:00
Michael Lazos
8740ffa760 [Hierarchical Compile] Add cycle detection to graph region expansion (#150305)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150305
Approved by: https://github.com/anijain2305
ghstack dependencies: #150303, #150304
2025-04-01 20:54:18 +00:00
Michael Lazos
a2300aff94 [Hierarchical Compile] Add cycle detection function for debug (#150304)
Remove print

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150304
Approved by: https://github.com/anijain2305
ghstack dependencies: #150303
2025-04-01 20:54:10 +00:00
Michael Lazos
99fd96c10b [Hierarchical Compile] Remove spammy debug log (#150303)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150303
Approved by: https://github.com/williamwen42
2025-04-01 20:54:03 +00:00
Tianyu Liu
d2ad9aa2f2 [dtensor][tp] add a ParallelStyle PrepareModuleInputOutput (#150372)
Needed this class for because `parallelize_module` takes a dict, which doesn't allow `PrepareModuleInput` and `PrepareModuleOutput` to be applied at the same time.

The `PrepareModuleInputOutput` in this PR initializes two variables `prepare_module_input` and `prepare_module_output` and uses them to process module / inputs / outputs.

I had another implementation which put all code in `PrepareModuleInputOutput` and let `PrepareModuleInput` and `PrepareModuleOutput` inherit the monolithic `PrepareModuleInputOutput`. But it is
1. less cleaner
2. conceptually abusing inheritance because `PrepareModuleInput` shouldn't be able to access class methods of `PrepareModuleOutput` and vice versa

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150372
Approved by: https://github.com/wanchaol
2025-04-01 19:15:43 +00:00
Tianyu Liu
5d6ac2dced [dtensor] add op support for select_backward and slice_backward (#150357)
Inheriting and rebasing @awgu 's PR https://github.com/pytorch/pytorch/pull/149071
- fixed an issue for `select_backward` and an issue for `slice_backward`
- removed `_experimental_ops.py` as it becomes empty

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150357
Approved by: https://github.com/awgu, https://github.com/XilunWu
2025-04-01 19:15:25 +00:00
IvanKobzarev
a37afd23fa [custom_ops][perf] Move expensive pytree traversals of tensors to C++ (#148555)
(benchmark for 1 call)

Before:
```
└─ $ python ~/task_custom_ops_perf/test_custom_ops_perf_repro.py
DO_BENCH mutate: 77.72445678710938 us PROFILE:/home/ivankobzarev/task_custom_ops_perf/mutate.json
DO_BENCH no_mutate: 64.61143493652344 us PROFILE:/home/ivankobzarev/task_custom_ops_perf/no_mutate.json
DO_BENCH direct_mutate: 11.682510375976562 us PROFILE:/home/ivankobzarev/task_custom_ops_perf/direct_mutate.json
DO_BENCH direct_no_mutate: 18.596649169921875 us PROFILE:/home/ivankobzarev/task_custom_ops_perf/direct_no_mutate.json
```

After:
```
└─ $ python ~/task_custom_ops_perf/test_custom_ops_perf_repro.py
DO_BENCH mutate: 47.6837158203125 us PROFILE:/home/ivankobzarev/task_custom_ops_perf/mutate.json
DO_BENCH no_mutate: 31.709671020507812 us PROFILE:/home/ivankobzarev/task_custom_ops_perf/no_mutate.json
DO_BENCH direct_mutate: 10.967254638671875 us PROFILE:/home/ivankobzarev/task_custom_ops_perf/direct_mutate.json
DO_BENCH direct_no_mutate: 10.728836059570312 us PROFILE:/home/ivankobzarev/task_custom_ops_perf/direct_no_mutate.json
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148555
Approved by: https://github.com/zou3519
2025-04-01 18:45:48 +00:00
Jason Ansel
37ebb0b56a [inductor] Fix inductor windows linker error (#150256)
Fixes #149889

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150256
Approved by: https://github.com/anijain2305, https://github.com/eellison
2025-04-01 18:30:55 +00:00
PyTorch MergeBot
f04cf13bdd Revert "Merge Triton ScaledMM as epilogue to MM template (#150045)"
This reverts commit 981048854d.

Reverted https://github.com/pytorch/pytorch/pull/150045 on behalf of https://github.com/PaulZhang12 due to Need to add PR 150415 fixes for internal merge ([comment](https://github.com/pytorch/pytorch/pull/150045#issuecomment-2770252452))
2025-04-01 17:54:28 +00:00
Will Feng
b0c560ef2a [dynamo][hooks] use wrap_top_frame config for functions (#150209)
When torch.compile is applied to a module via `mod.compile(...)`, it's equivalent to `torch.compile(mod._call_impl)` which takes a different path than `OptimizedModule`. This PR ensures that the `wrap_top_frame` config can also take effect for the `torch.compile(mod._call_impl)` use case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150209
Approved by: https://github.com/anijain2305
2025-04-01 17:41:23 +00:00
Xia, Weiwen
3b0cd9b542 [Quant][PT2E] add a lowering pass for x86 backend (#149708)
**Summary**
This PR adds a lowering pass for x86 backend
- Patterns of `dequantize -> conv/linear (-> quantize)` are fused to corresponding quantized onednn ops.
- Weights are prepacked ahead of time.
- Post ops of conv/linear are fused if supported.
- The pass returns a `GraphModule` with the modifications mentioned above.

**Test plan**
```
pytest test/quantization/pt2e/test_x86inductor_quantizer.py -k test_lowering_to_x86
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149708
Approved by: https://github.com/jerryzh168, https://github.com/leslie-fang-intel
2025-04-01 17:32:41 +00:00
Nikita Shulga
f94ac263af [MPSInductor] Fix neg for unsigned types (#150412)
By more-or-less copy-n-pasting the fix from https://github.com/pytorch/pytorch/pull/94035

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150412
Approved by: https://github.com/jansel, https://github.com/dcci
ghstack dependencies: #150382, #150386
2025-04-01 16:52:41 +00:00
Sriram Kumar
a19b667bca [ROCm] Update CUDAPluggableAllocator.h (#1984) (#150010)
Altering the flag to use the correct streamType in CUDAPluggableAllocator class for ROCm gpu. The flag TORCH_HIP_VERSION does not work for ROCm as intended. This flag is replaced with USE_ROCM. This is impacting Distributed Fused Adam in Rocm/APEX when using nccl_ub feature. This has been tested with rocm/apex.

See PR https://github.com/ROCm/apex/pull/184

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150010
Approved by: https://github.com/jeffdaily
2025-04-01 16:49:03 +00:00
Ke Wen
35c45a4a31 [Reland] Launch kernel on current stream & remove record_stream entirely (#150398)
Relanding #148590 due to merge conflict.

This PR has multiple changes to `ProcessGroupNCCL` (which unfortunately are related):
1. When async_op=False, we directly launch the collective on "current" stream, instead of a trampoline stream and join back.
- Resolves #147729
- Resolves #146881
- Also saves two event syncs (which have overhead in case of HIP) and one pybind when we call `work.wait()` in distributed_c10d.py on behalf of user.
2. Entirely remove `record_stream` and use CPU-side stashing for managing tensor lifetime against recycling.
- Resolves #147168
3. Remove tensor life management when async_op=False; only use it when async_op=True.
4. To guard against user not calling `work.wait()`, we ask watchdog to unstash tensors after detecting completion of collectives, to prevent us from holding reference to tensors forever. This is a safety net, rather than a service guarantee, see discussion [here](https://github.com/pytorch/pytorch/issues/147168#issuecomment-2660142460).
5. Profile in async_op=False mode would look different -- collective kernels would show up in the same line and compute kernels.

Joint work with @cenzhaometa who wants to remove the event sync overhead.

Squashed contents:

* [ptd][nccl] use current-stream as nccl-stream under async=False mode (#147820)
PTD current workflow:
- PTD creates its own dedicated `ncclStream` for comm operation
- it will first add a dependency on current-stream (typically the compute stream) to ensure tensors are ready before invoking collective
such stream synchronization become expensive in Inference world (cpu overhead: 70us vs GPU kernel time: 160us).
This diff:
- async=False [default], will use current-stream as nccl-stream and avoid the stream-sync overhead
- async=True, will retain existing logic: create new nccl-stream, let it wait on current-stream to ensure tensors are ready
- pass down async from c10d down to NCCL-PG
this helps shave off 50% CPU overhead **(70us -> 35us)**, which reduce total CPU/GPU from **230us to 195us by 15%**

* [PGNCCL] Make avoid-record-stream default

* [c10d] Add asyncOp argument to Ops

* Change python side wait

* Pass asyncOp at ProcessGroup level

* Watchdog unstashing tensors as a safety net

* Stash tensors for reduce_scatter_v and all_gather_v
Pull Request approved: https://github.com/pytorch/pytorch/pull/149753

* [c10d] Move unstashing from watchdog to main thread
Pull Request approved: https://github.com/pytorch/pytorch/pull/150079

* [PGNCCL][BE] Merge mutex into TensorShelf for encapsulation
Pull Request approved: https://github.com/pytorch/pytorch/pull/150130

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150398
Approved by: https://github.com/atalman
2025-04-01 16:46:07 +00:00
Nikita Shulga
428234bc28 [MPSInductor] torch.complex128 is unsupported on MPS (#150386)
Same as torch.float64

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150386
Approved by: https://github.com/dcci
ghstack dependencies: #150382
2025-04-01 15:19:10 +00:00
Xuehai Pan
a10b765bf1 [pytree] add APIs to determine a class is a namedtuple or PyStructSequence (#113257)
Changes in this PR:

1. Add `is_structseq` and `is_structseq_class` functions to determine a object or a class is PyStructSequence.
2. Add a generic class `structseq` which can be used as the registration key for PyStructSequence types like `namedtuple` for Named Tuple types.
3. Change `is_namedtuple` to accept subclasses of namedtuple to be namedtuple. Before this PR, only namedtuple class directly created by `collections.namedtuple` or `typing.NamedTuple` were namedtuple classes while their subclasses were not. This PR makes `is_namedtuple` return true for subclasses of namedtuple class.

Resolves #75982. New tests are included in this PR.

- #75982

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113257
Approved by: https://github.com/zou3519
2025-04-01 10:40:43 +00:00
Prajesh Praveen Anchalia
48e9ffc873 Unify on dynamo_compile as the overall wait counter (#150293)
Summary:
dynamo_compile for the most part has been accounting for compile time except autotuning.

all_compilation_types had earlier been injected on fx_codegen_and_compile, which was incorrect.

Add autotuining to dynamo and deprcate all_compilation_types counter.

Differential Revision: D72145447

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150293
Approved by: https://github.com/masnesral, https://github.com/jamesjwu
2025-04-01 08:55:51 +00:00
Natalia Gimelshein
1700599266 Add one_shot_all_reduce_copy to allow non-symm-mem allocated tensors to be reduced (#150129)
Per title, we want to be able to use it even if inputs are not registered. Separate copy would add latency, and one-shot is all about the lowest possible latency.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150129
Approved by: https://github.com/xw285cornell
2025-04-01 05:36:43 +00:00
Natalia Gimelshein
414b9ae016 enable out variant of 2-shot reduction (#150153)
Per title, this version uses symm mem input both as input source and as a work buffer, so input is modified after the end (similar to what fbgemm car reduction does). It is intended to be wrapped in an op that would first copy the real inputs to symm mem buffers that wouldn't be exposed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150153
Approved by: https://github.com/xw285cornell
2025-04-01 05:36:04 +00:00
Tugsbayasgalan Manlaibaatar
7e7e5698cc Suppress more warnings (#149833)
Differential Revision: [D71702307](https://our.internmc.facebook.com/intern/diff/D71702307)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149833
Approved by: https://github.com/malfet, https://github.com/Skylion007
2025-04-01 05:33:04 +00:00
William Wen
790d459f85 [dynamo] add error message for unsupported LOAD_BUILD_CLASS (#150323)
Improved error message for https://github.com/pytorch/pytorch/issues/128942

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150323
Approved by: https://github.com/jansel, https://github.com/zou3519
2025-04-01 05:03:50 +00:00
Nikita Shulga
6470b373c1 torch.backends.mkldnn.flags() CM should not warn (#150358)
By returning `None` rather than `False` from `THPModule_allowTF32OneDNN` when USE_XPU is not defined

Added regression test

Fixes https://github.com/pytorch/pytorch/issues/149829

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150358
Approved by: https://github.com/atalman
2025-04-01 01:33:40 +00:00
Sun, Jiayi
5cb5675f13 [Inductor] optimize the heuristics of parallel reduction (#149614)
Fix https://github.com/pytorch/pytorch/issues/148639.

Summary:
Optimize the heuristics of parallel reduction: When the number of steps of the first inner loop beyond the maximum parallel depth is much larger than the number of steps of all outer loops within the maximum parallel depth, change the starting depth of parallelism to the first inner loop and recalculate the maximum parallel depth. I ran the Inductor benchmark with this PR on CPU. A timm model poolformer_m36 BF16 has about 25% performance improvement, and no performance regression is seen.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149614
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel
2025-04-01 01:31:00 +00:00
Nikita Shulga
c75dac5f5c Fix typo (#150363)
Fixes https://github.com/pytorch/pytorch/issues/150339

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150363
Approved by: https://github.com/atalman, https://github.com/kwen2501
2025-03-31 23:58:37 +00:00
PaulZhang12
981048854d Merge Triton ScaledMM as epilogue to MM template (#150045)
Previously, scaled_mm's (FP8 matmul) Triton lowering for inductor was in a separate template. This PR consolidates that lowering into the mm template, with an added epilogue to deal with multiplying the scales. This paves the way for future scaled variants of BMM, Grouped GEMM in inductor.

Currently, there is still a separate template for TMA+persistent version of scaled_mm. The current mm lowering has a separate template for TMA + Persistent version. Will hopefully consolidate the extra scaled_mm TMA+persistent template when the consolidation for the mm template is done.
TODO: Consolidate TMA+Persistent logic into 1 template and remove separate scaled_mm TMA template

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150045
Approved by: https://github.com/drisspg
2025-03-31 23:20:14 +00:00
Pian Pawakapan
925fd4aa2e [export] min/max ranges for dim hints (#149590)
Differential Revision: D71522032

Adds min/max ranges to Dim.AUTO/DYNAMIC/STATIC, so users can do `Dim.AUTO(min=2, max=2048)`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149590
Approved by: https://github.com/tugsbayasgalan
2025-03-31 21:32:20 +00:00
angelayi
5e34758cef [invoke_subgraph] Support unbacked (#149298)
Differential Revision: [D71420641](https://our.internmc.facebook.com/intern/diff/D71420641)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149298
Approved by: https://github.com/zou3519
2025-03-31 17:25:09 +00:00
Pian Pawakapan
284b766898 [dynamic shapes] C++ bindings for guard_or_false/true (#150148)
C++ version. Would like to add it in one place to prove it works, but couldn't find one that doesn't expose a chain of data-dependent changes... so just gonna put up the base implementation

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150148
Approved by: https://github.com/laithsakka, https://github.com/jingsh
2025-03-31 17:04:25 +00:00
Prachi Gupta
47cdad2995 [ROCm] Enable several fsdp related UTs (#149369)
Enabling 26 UTs for ROCm in the following files:

-  distributed._shard.sharded_optim.test_sharded_optim - 2 UTs
-  distributed._shard.sharded_tensor.ops.test_binary_cmp - 4 UTs
-  distributed._shard.sharded_tensor.ops.test_init - 3 UTs
-  distributed._shard.sharded_tensor.ops.test_embedding - 2 UTs
-  distributed._shard.sharded_tensor.ops.test_embedding_bag - 2 UTs
-  distributed._composable.test_replicate_with_compiler - 4 UTs
-  distributed._composable.fsdp.test_fully_shard_grad_scaler - 1 UTs
-  distributed.tensor.test_attention - 4 UTs
-  distributed.tensor.test_matrix_ops - 1 UTs
-  distributed.tensor.test_tensor_ops - 1 UTs
-  distributed.fsdp.test_fsdp_grad_acc - 2 UTs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149369
Approved by: https://github.com/jeffdaily
2025-03-31 16:15:57 +00:00
PyTorch MergeBot
7c858066ae Revert "Enable TMA persistent GEMM Template by default (#149427)"
This reverts commit b8ef642f04.

Reverted https://github.com/pytorch/pytorch/pull/149427 on behalf of https://github.com/clee2000 due to failing tests internally D72116141 ([comment](https://github.com/pytorch/pytorch/pull/149427#issuecomment-2766672200))
2025-03-31 15:58:34 +00:00
PyTorch MergeBot
57fa99c5c3 Revert "enable out variant of 2-shot reduction (#150153)"
This reverts commit cdeb32d2d1.

Reverted https://github.com/pytorch/pytorch/pull/150153 on behalf of https://github.com/clee2000 due to failing internal builds D72083877 ([comment](https://github.com/pytorch/pytorch/pull/150153#issuecomment-2766633712))
2025-03-31 15:43:24 +00:00