Commit Graph

46836 Commits

Author SHA1 Message Date
Sam Larsen
73c8068cf8 [logging] Set compile_id in the CachingAutotuner during compilation so we have it for dynamo_timed logging (#148693)
Summary: This is a simpler alternative to https://github.com/pytorch/pytorch/pull/146455, where we can stick the compileId (and forward/backward bool) in the CachingAutotuner so that we have it for logging `benchmark_all_configs`. Recall that the first attempt put the compileId in the inductor_meta and that interfered with caching.

Test Plan:
`python benchmarks/dynamo/torchbench.py --performance --training --amp --backend inductor --device cuda --print-compilation-time --repeat 5 --cold-start-latency --only nanogpt`
* tlparse: https://fburl.com/e71yn6uc
* dynamo_compile: https://fburl.com/scuba/dynamo_compile/sandbox/4ageghhv
* pt2_compile_events: https://fburl.com/scuba/pt2_compile_events/4fgv1itq

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148693
Approved by: https://github.com/eellison
2025-03-11 19:38:40 +00:00
Jane Xu
971606befa Add a stable TORCH_LIBRARY to C shim (#148124)
This PR adds two main parts:
- shim.h stable C APIs into torch::Library APIs
- a higher level API in torch/csrc/stable/library.h that calls into this shim.h + otherwise is self contained

Goal: custom kernel writers should be able to call the apis in the directories above in order to register their library in a way that allows their custom extension to run with a different libtorch version than it was built with.

Subplots resolved:

- Do we want a whole separate StableLibrary or do we want to freeze torch::Library and add `m.stable_impl(cstring, void (*fn)(void **, int64_t, int64_t)` into it
    - Yes, we want a separate StableLibrary. We cannot freeze Library and it is NOT header only.
- Should I use unint64_t as the common denominator instead of void* to support 32bit architectures better?
    -  Yes, and done
- Should I add a stable `def` and `fragment` when those can be done in python?
    - I think we do want these --- and now they're done
- Where should library_stable_impl.cpp live? -- no longer relevant
- I need some solid test cases to make sure everything's going ok. I've intentionally thrown in a bunch of random dtypes into the signature, but I still haven't tested returning multiple things, returning nothing, complex dtypes, etc.
    - Have since tested all the torch library endpoints. the others can be tested in a followup to separate components that need to be in shim.h vs can be added later

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148124
Approved by: https://github.com/albanD, https://github.com/zou3519, https://github.com/atalman
2025-03-11 19:12:46 +00:00
Guilherme Leobas
daff65d671 Correctly propagate exception to parent tx (#146502)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146502
Approved by: https://github.com/anijain2305, https://github.com/williamwen42, https://github.com/zou3519
ghstack dependencies: #146504, #146499
2025-03-11 18:55:45 +00:00
Guilherme Leobas
fb53e9e514 Add __context/cause/suppress_context/traceback__ to Exception (#146499)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146499
Approved by: https://github.com/zou3519, https://github.com/anijain2305
ghstack dependencies: #146504
2025-03-11 18:55:45 +00:00
Guilherme Leobas
4e7d264cf8 Introduce UserDefinedExceptionClassVariable (#146504)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146504
Approved by: https://github.com/anijain2305
2025-03-11 18:55:45 +00:00
Jason Ansel
8d08b49015 Reland: [inductor] Simplify grid handling (#148305)
Summary:
Relands D69965761 / https://github.com/pytorch/pytorch/pull/147583

Before this PR, calling a triton kernel would look like:
```py
kernel.run(a, b, xnumel, grid=grid(xnumel), stream=stream0)
```
where the `grid=` was passed as a callable (function closure) arg.  This PR removes the grid arg:
```py
kernel.run(a, b, xnumel, stream=stream0)
```
instead now the grid computation is included in the kernel launcher, with something like:
```py
def launcher(in_ptr0, out_ptr0, xnumel, stream):
    grid_0 = ((xnumel + 1023) >> 10)
    grid_1 = 1
    grid_2 = 1
    runner(grid_0, grid_1, grid_2, stream, function, metadata, None, launch_enter_hook, launch_exit_hook, in_ptr0, out_ptr0, xnumel)
```

This should be faster, since we remove multiple function/dict calls and are able to specialize the grid computation for each `triton.Config`.

It also allows us to unify the handling of grids between the Python and C++ wrapper code.  Before this, C++ wrapper code didn't actually support dynamic grid sizes and instead burned in a static grid.

This unification allows this PR to be a net deletion of code.

Differential Revision: D70471332

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148305
Approved by: https://github.com/shunting314, https://github.com/eellison
2025-03-11 18:51:06 +00:00
PyTorch MergeBot
c916a8efc5 Revert "Use the device interface for detecting Triton availability (#139171)"
This reverts commit 940b60db97.

Reverted https://github.com/pytorch/pytorch/pull/139171 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally. @jansel can you please help get these changes working? See D70946254 for more details. To validate the fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/139171#issuecomment-2715392451))
2025-03-11 18:49:21 +00:00
drisspg
57ee821a41 fix dynamo ide (#148849)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148849
Approved by: https://github.com/bobrenjc93
2025-03-11 18:43:30 +00:00
Ke Wen
ef6296e7f2 [PGNCCL] Launch kernel on current stream & remove record_stream entirely (#148590)
This PR has multiple changes to `ProcessGroupNCCL` (which unfortunately are related):
1. When async_op=False, we directly launch the collective on "current" stream, instead of a trampoline stream and join back.
- Resolves #147729
- Resolves #146881
- Also saves two event syncs (which have overhead in case of HIP) and one pybind when we call `work.wait()` in distributed_c10d.py on behalf of user.
2. Entirely remove `record_stream` and use CPU-side stashing for managing tensor lifetime against recycling.
- Resolves #147168
3. Remove tensor life management when async_op=False; only use it when async_op=True.
4. To guard against user not calling `work.wait()`, we ask watchdog to unstash tensors after detecting completion of collectives, to prevent us from holding reference to tensors forever. This is a safety net, rather than a service guarantee, see discussion [here](https://github.com/pytorch/pytorch/issues/147168#issuecomment-2660142460).
5. Profile in async_op=False mode would look different -- collective kernels would show up in the same line and compute kernels.

Joint work with @cenzhaometa who wants to remove the event sync overhead.

Cc: @ngimel @awgu @Aidyn-A @skyw @wconstab @leonardo0lyj

Differential Revision: [D70937982](https://our.internmc.facebook.com/intern/diff/D70937982)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148590
Approved by: https://github.com/eqy, https://github.com/Aidyn-A, https://github.com/fduwjj
2025-03-11 18:36:12 +00:00
Nikita Shulga
b366f33606 [MPSInductor] Prep for mutlistage reductions (#148969)
----

- Move reduction variable initialization from `loads` to  `indexing_code`
- Move barriers from `codegen_kernel` to `reduction` and only use them for `any` reductions (as other reduction ops do  barriers explicitly inside the respective reduction functions)
- Use `self.compute` instead of `self.body` for all compute operations

Checked that number of before/after failures stays at `164 failed, 616 passed, 53 skipped`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148969
Approved by: https://github.com/dcci
2025-03-11 18:35:23 +00:00
Nichols A. Romero
dcc502f376 [ROCm][TunableOp] Add bias data type to params signature. (#146227)
Add bias vector data type in TunableOp params signature.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146227
Approved by: https://github.com/jeffdaily
2025-03-11 18:31:22 +00:00
Chien-Chin Huang
52acc1f955 [DSD] Update the document to mention the limitation of set_optimizer_state_dict (#148918)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/140898

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148918
Approved by: https://github.com/fduwjj, https://github.com/mori360
ghstack dependencies: #148825
2025-03-11 18:24:12 +00:00
Jason Ansel
09029010e5 [inductor] Fix create_specialize_impl error in latest Triton (#148933)
```py
$ python test/inductor/test_triton_kernels.py KernelTests.test_triton_kernel_2d_autotune_grad_False_dynamic_True_backend_inductor_grid_type_1
WARNING:torch._dynamo:Encountered an exception in identify_mutated_tensors, assuming every input is mutated
Traceback (most recent call last):
  File "/home/jansel/pytorch/torch/_higher_order_ops/triton_kernel_wrap.py", line 715, in identify_mutated_tensors
    ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
                                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_higher_order_ops/triton_kernel_wrap.py", line 289, in generate_ttir
    specialization = _get_specialization(ordered_args.values())
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_higher_order_ops/triton_kernel_wrap.py", line 262, in _get_specialization
    specialize_impl = triton.runtime.jit.create_specialize_impl()
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: create_specialize_impl() missing 1 required positional argument: 'specialize_extra'
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148933
Approved by: https://github.com/yanboliang, https://github.com/davidberard98
2025-03-11 15:54:47 +00:00
Animesh Jain
f1787ee0f7 [dynamo] Remove L scoping for recompilation messages (#148917)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148917
Approved by: https://github.com/williamwen42
2025-03-11 14:26:26 +00:00
Animesh Jain
992838e702 [dynamo][guards] Do not ID_MATCH on numpy tensors (#148923)
Might help with https://github.com/pytorch/pytorch/issues/148535

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148923
Approved by: https://github.com/jansel
2025-03-11 14:20:26 +00:00
David Berard
9ad64ce795 [triton 3.3] Forward-fix mm template selection logic (#148924)
Follow-up from https://github.com/pytorch/pytorch/pull/148662.

The logic from https://github.com/pytorch/pytorch/pull/148662 is incorrect; what we want is "choose the second template 'AMD-specific template' only if we're on hip AND triton version < 3.3" - negating it, the code should be "choose the cirst template if we're NOT on hip OR triton version >= 3.3".

Tested locally to verify that it fixes the test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148924
Approved by: https://github.com/drisspg, https://github.com/atalman, https://github.com/eellison
2025-03-11 09:05:44 +00:00
eellison
2bcc3acb90 Update low prec codegen for div/mod (#142350)
Div/mod in fp16/bf16 requires a downcast to preserve its inputs' dtypes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142350
Approved by: https://github.com/blaine-rister
2025-03-11 08:02:30 +00:00
Gabriel Ferns
41e4728f74 update types on dynamo configs (#146873)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146873
Approved by: https://github.com/williamwen42
2025-03-11 05:33:48 +00:00
Gabriel Ferns
1fcc4bc109 Don't look at TESTING_ONLY in fuzzer (#146870)
Lots of configs aren't meant to be set because they're testing only

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146870
Approved by: https://github.com/masnesral
2025-03-11 05:32:25 +00:00
Bin Bao
ecfbfe1603 [AOTI] Remove aoti_torch_cpu__weight_int4pack_mm_cpu_tensor (#148907)
Summary: shim.h is only meant for generic tensor util shim functions. We should switch to use the auto fallback generation, but it will need some extra care on the op schema.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148907
Approved by: https://github.com/janeyx99
2025-03-11 04:41:05 +00:00
George White
940b60db97 Use the device interface for detecting Triton availability (#139171)
This allows for each device type to check current devices for Triton compatibility and ensure their Triton backend is present.

This PR replaces the `has_triton()` global method which was previously used for this task, and moves the initial check for each Inductor backend on to their associated `BaseScheduler` subclass. This means that other backends, such as Halide, can also implement their own availability checks.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139171
Approved by: https://github.com/jansel
2025-03-11 03:56:11 +00:00
Brian Hirsh
621dadd4ca partitioner: when materializing unbacked tensor intermediates, apply hint to symbol, not expr (#144097)
Fixes https://github.com/pytorch/pytorch/issues/144095

open to suggestions: the `hint_int(..., fallback=...)` API feels like a bit of a footgun, because:

(1) we use the same guess for every unbacked symint (both symbols, and compound expressions)
(2) the user may have established some relationship between some unbacked symints that we are not taking into account.

I'm not sure how real of an issue (2) is - is it common to e.g. generate two unbacked symints, and then add a runtime assert that they are unequal?

Instead I did something simpler that's just enough to fix the linked issue: if we have a sympy expression containing an unbacked symbol (e.g. `u0 + 1`), then the partitioner will now fill in the symbol with our guess instead of the expression (plugging in `u0=4096` gets us 4097). This was important for an internal custom op, that had some logic like this:
```
def custom_op(x: [u0], y: [u0 + 1]):
    assert x.shape[0] = y.shape[0] - 1
    ...
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144097
Approved by: https://github.com/laithsakka
2025-03-11 02:11:57 +00:00
Simon Fan
457ff9b7ae [reland][ca] side-effect free inital trace: compiled_args (#148376)
This reverts commit ea12fc8a9f.
Reland https://github.com/pytorch/pytorch/pull/147804, there was a bad import inserted by my linter.

Differential Revision: [D70582747](https://our.internmc.facebook.com/intern/diff/D70582747)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148376
Approved by: https://github.com/jansel
2025-03-11 01:57:36 +00:00
Shuai Yang
9fddbf3417 Update the comment (#148726)
Differential Revision: D70747931

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148726
Approved by: https://github.com/yf225
2025-03-11 01:19:14 +00:00
bobrenjc93
c297c09a37 Fix invalid nested int guarding in broadcast_shapes() (#145957)
Fixes #145874

This PR takes the approach of updating the logic determining whether multiple shapes broadcast together to handle nested ints specially.

Possible alternative approach: don't update `broadcast_shapes()` + indicate that e.g. `Ne(j0, 1)` should statically evaluate to False. I briefly tried this but it wasn't straightforward. Is it better?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145957
Approved by: https://github.com/bobrenjc93

Co-authored-by: bobrenjc93 <bobren@meta.com>
2025-03-11 00:53:13 +00:00
cyy
295f2ed4d1 Fix "invalid application of 'sizeof' to an incomplete type" (#148854)
Fixes with C++23 and constexpr std::unique_ptr

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148854
Approved by: https://github.com/Skylion007
2025-03-11 00:40:00 +00:00
drisspg
b215841ebb [MM] Add sm carevout to lowerings (#148793)
# Summary

See https://github.com/pytorch/pytorch/issues/145115 for more details. I have been using
the following to verify, need to figure out how to do proper guarding

This does  do the correct thing if we compile w/ sm carvout already set but since we dont guard on it just yet we dont recompile

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148793
Approved by: https://github.com/lw, https://github.com/eellison
2025-03-10 23:49:26 +00:00
Brian Hirsh
492f3fd5cf replace usages of upload_graph in inductor with tlparse (v2) (#148720)
Reland of https://github.com/pytorch/pytorch/pull/148703

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148720
Approved by: https://github.com/mengluy0125
2025-03-10 22:47:58 +00:00
PyTorch MergeBot
a95eb0c0a7 Revert "[PGNCCL] Launch kernel on current stream & remove record_stream entirely (#148590)"
This reverts commit 2149f6c684.

Reverted https://github.com/pytorch/pytorch/pull/148590 on behalf of https://github.com/ZainRizvi due to Breaking internally, see D70873275. Discussed reverting this with Ke. To validate your fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/148590#issuecomment-2712001270))
2025-03-10 22:38:40 +00:00
Qiaochu Yuan
12a95390ae [Minimizer] allow overriding of ShapeProp logic by subclasses of _MinimizerBase (#148784)
Summary:
The changes contained in this diff
- allow subclass Minimizer implementations to override the default shape propagation logic with custom logic
- copies over the meta attribute on get_attr graph nodes during the graph splitting step
- for both changes, behavior for existing classes do not change

Test Plan: CI

Differential Revision: D70799942

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148784
Approved by: https://github.com/blaine-rister
2025-03-10 22:22:16 +00:00
Jane Xu
fcb633fafa Introduce TORCH_ABI_VERSION and a runtime aoti_torch_abi_version C shim ABI (#148892)
Importable https://github.com/pytorch/pytorch/pull/148836

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148892
Approved by: https://github.com/albanD
2025-03-10 22:22:10 +00:00
Boyuan Feng
98b3f1db9f [Flex Attention] support num_heads > 1 in block_mask (#148857)
Previously flex decoding errors when block mask has num_heads > 1. So users have to use num_heads=1, or explicitly mark `kernel_options={"FORCE_USE_FLEX_ATTENTION": True}`.

This PR fixes this issue. When not using grouped query attention (GQA, i.e., Hq == Hkv), we support block mask with num_heads = 1 and num_heads = num_query_heads (i.e., Hq). This is the same setting as flex attention kernel.

When using GQA (i.e., Hq != Hkv), we support block mask with num_heads = 1. When num_heads = Hq, we fall back to flex attention kernel so user don't need to explicitly mark `kernel_options={"FORCE_USE_FLEX_ATTENTION": True}` anymore.

Why fallback? In the current flex decoding triton kernel, grouped query heads for the same kv head are handled by the same thread block. Supporting num_heads = Hq with GQA requires support different kv num blocks for different query heads in the same thread block, leading to lots of redundant workload. So we should better use the main flex_attention kernel where each query head is handled by a separate block.

Fixes #148527
Fixes #147267

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148857
Approved by: https://github.com/drisspg
2025-03-10 22:02:50 +00:00
Mandar Deshpande
6ef15c7f46 [pytorch] Update flexattention bwd config generation (#148600)
Summary: Currently `flex_attention` template's backward config generation returns values for every case. This change instead stores intermediate values in `'bwd_config` returned at the end.

Test Plan: CI. Existing tests.

Differential Revision: D70649316

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148600
Approved by: https://github.com/Skylion007
2025-03-10 22:00:56 +00:00
Chien-Chin Huang
ed969d1236 [DSD] Fix the shared parameter mismatch for optimizer state_dict when flattening FQNs are used (#148825)
Summary:
As title.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148825
Approved by: https://github.com/fduwjj, https://github.com/mori360
2025-03-10 20:04:36 +00:00
Tristan Rice
494abeff8a CUDACachingAllocator,c10d: fixes for IPC release performance (#148805)
This has two fixes to improve IPC tensor release performance when using torchft's BabyProcessGroupNCCL.

1. release the IpcMutex when deleting the `ExpandableSegements` object to avoid synchronizing under the lock
2. release the GIL in WorkNCCL destructor since the shared tensor will be destructed there

Test plan:

Run with torchft + torchtitan

```
REPLICA_GROUP_ID=0 NGPU=2 CUDA_VISIBLE_DEVICES=0,1 CONFIG_FILE=./torchtitan/models/llama/train_configs/llama3_8b.toml ./run_train.sh --training.data_par
allel_shard_degree=2 --fault_tolerance.enable --fault_tolerance.group_size=2 --fault_tolerance.replica_id=0 --metrics.log_freq=1 --training.seq_len 4096

...

[rank0]:[titan] 2025-03-07 17:51:31,387 - root - INFO - step: 61  loss:  7.4825  memory: 79.73GiB(83.89%)  tps: 317  tflops: 16.34  mfu: 1.65%
```

Check py-spy to verify no bottleneck on IPC lock when creating new shared tensors

![20250307_17h50m10s_grim](https://github.com/user-attachments/assets/fa8b359f-e337-4ed5-be22-a42ab2bee03d)
![20250307_17h50m00s_grim](https://github.com/user-attachments/assets/206f869a-f07e-4fbd-9e28-89b3da95ef6e)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148805
Approved by: https://github.com/Skylion007, https://github.com/fegin, https://github.com/zdevito
2025-03-10 19:47:04 +00:00
clr
6b0fd741d1 dynamo: Count number of opcodes processes (#147149)
This gives us a decent proxy for how big of a graph we functionally had to parse.

Note that this is a cummulative counter. If people feel strongly, I can either write into the dynamo_timed datasets with metrics contexts, or clear the counters / write a counter per frame id as well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147149
Approved by: https://github.com/jansel
2025-03-10 19:20:09 +00:00
Wanchao Liang
3129faf8be Optimize shard_dim_alltoall to use alltoall_single (#148868)
as titled, previously the shard_dim_alltoall uses `all_to_all`, which essentially could incur lots of copies if the tensor become non-contiguous during splits, and alltoall itself also incur copies

This PR uses alltoall_single instead, so that we could minimize tensor copies.

tested on all the shard dim change tests and it works properly:
```
pytest test/distributed/tensor/test_redistribute.py -s -k shard_dim_alltoall
```

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148868
Approved by: https://github.com/tianyu-l
2025-03-10 18:38:12 +00:00
Benjamin Glass
ed7e964f2b codecache.py: use str.format rather than % formatting (#148691)
Additionally, swaps over a fixed length `std::vector` used by `cpp_wrapper` for a `std::array`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148691
Approved by: https://github.com/desertfire
2025-03-10 18:33:58 +00:00
Han, Xu
00cabd4235 [Inductor][Windows] add env_var switch to turn all Windows inductor UTs. (#148733)
For timeout reason, we can't turn on all Windows Inductor UTs in CI: https://github.com/pytorch/pytorch/issues/135927
And without the UTs, we can't ensure Windows inductor quality.

Intel team will do some local test for Windows inductor, but we still need to add a switch to turn on the full Windows inductor UTs.

The switch is an environment variable:
```cmd
set TORCHINDUCTOR_WINDOWS_TESTS=1
```
After setup this environment variable, we can turn on all Windows inductor UTs, It will not affect to PyTorch CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148733
Approved by: https://github.com/jansel

Co-authored-by: Jason Ansel <jansel@jansel.net>
2025-03-10 18:25:29 +00:00
eellison
4c13a859e5 Workaround no triton float8_e8m0fnu support in inductor (#148722)
Triton doesn't support actual float8_e8m0fnu yet, so we can't currently codegen any arithmetic on them. But we can support bitcasting, and view/memory operators and treat them as uint8 for now. Fix for https://github.com/pytorch/pytorch/issues/147873.

The one question i'm not sure of is whether or not we need to explicitly disable triton template fusion since it would fuse in these dtypes as uint8..

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148722
Approved by: https://github.com/vkuzo
ghstack dependencies: #148450
2025-03-10 17:37:39 +00:00
PyTorch MergeBot
ebd087e4b5 Revert "[pytree] add APIs to determine a class is a namedtuple or PyStructSequence (#113257)"
This reverts commit f08146b67b.

Reverted https://github.com/pytorch/pytorch/pull/113257 on behalf of https://github.com/jovianjaison due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/113257#issuecomment-2711299830))
2025-03-10 17:19:21 +00:00
PyTorch MergeBot
2ec9aceaeb Revert "Move aoti_torch_cpu__weight_int4pack_mm_cpu_tensor to not be mangled (#148834)"
This reverts commit 3680e666d8.

Reverted https://github.com/pytorch/pytorch/pull/148834 on behalf of https://github.com/janeyx99 due to sorry I don't think I want this PR in before the branch cut, as it'd freeze the API in the file when it should really be in a different header ([comment](https://github.com/pytorch/pytorch/pull/148834#issuecomment-2711162193))
2025-03-10 16:29:40 +00:00
Jason Ansel
a60b4ed623 [fx] Optimize TracerBase.create_arg and Graph._gen_python_code (#148292)
Before: 19502951 function calls (18702776 primitive calls) in 8.533 seconds
After: 16402551 function calls (15602452 primitive calls) in 7.701 seconds

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148292
Approved by: https://github.com/oulgen
ghstack dependencies: #148243, #148260, #148261, #148288
2025-03-10 16:06:19 +00:00
Jason Ansel
8f858e226b [fx] Optimizations for node name generation (#148288)
Before:
![image](https://github.com/user-attachments/assets/3a9ed22b-ae33-41ec-a0db-01f4f3ca2ffe)

After:
![image](https://github.com/user-attachments/assets/44c6e578-c63e-4a43-b3e0-d11d4bdbb6db)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148288
Approved by: https://github.com/oulgen
ghstack dependencies: #148243, #148260, #148261
2025-03-10 16:06:19 +00:00
Jason Ansel
5d4e7d58b4 [fx] Move Node._prepend/Node._remove_from_list to C++ (#148261)
Microbenchmarking `fx.symbolic_trace(lambda x: functools.reduce(operator.add, [x, *range(100000)]))`, before:
```
24303536 function calls (23503339 primitive calls) in 10.726 seconds
```
after:
```
20003454 function calls (19203257 primitive calls) in 8.936 seconds
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148261
Approved by: https://github.com/oulgen
ghstack dependencies: #148243, #148260
2025-03-10 16:06:11 +00:00
Jason Ansel
bf752c36da [fx] Move Node._update_args_kwargs to C++ (#148260)
Microbenchmarking `fx.symbolic_trace(lambda x: functools.reduce(operator.add, [x, *range(100000)]))`, before:
```
25203549 function calls (24403352 primitive calls) in 12.090 seconds
```
after:
```
24303536 function calls (23503339 primitive calls) in 10.726 seconds
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148260
Approved by: https://github.com/oulgen
ghstack dependencies: #148243
2025-03-10 16:06:02 +00:00
Jason Ansel
bec7bdad47 [fx] Move map_aggregate to C++ (#148243)
Microbenchmarking `fx.symbolic_trace(lambda x: functools.reduce(operator.add, [x, *range(100000)]))`, before:
```
30603618 function calls (29403419 primitive calls) in 13.744 seconds
```
after:
```
25203549 function calls (24403352 primitive calls) in 12.090 seconds
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148243
Approved by: https://github.com/oulgen
2025-03-10 16:05:53 +00:00
Kalpit Munot
31625b08b8 Add ccode for FloorDiv (#148727)
Summary: Add ccode for FloorDiv

Test Plan: CIs

Differential Revision: D70749021

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148727
Approved by: https://github.com/bobrenjc93
2025-03-10 14:00:18 +00:00
albanD
68c12ecfe2 Move get accelerator to use build time flags when possible (#146098)
This PR does two main things (they are in a single PR to show how the newly added APIs are used).

- Add isBuilt and isAvailable APIs to the AcceleratorHook interface. See inline doc for their exact semantic
- Use the newly added isBuilt for accelerator check to ensure it does not poison fork

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146098
Approved by: https://github.com/ngimel, https://github.com/malfet, https://github.com/EikanWang, https://github.com/jeromean

Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
2025-03-10 13:17:58 +00:00
Xuehai Pan
098494e9cb [dynamo] allow global import from collections import deque in user code (#148676)
See https://github.com/pytorch/pytorch/pull/148669#discussion_r1983462218 for more details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148676
Approved by: https://github.com/jansel
2025-03-10 13:14:05 +00:00
Xinyuan Zhao
59f14d19ae Implement gradient for the residuals of torch.linalg.lstsq (#148526)
Fixes #147543.

I have written some tests in python using `gradcheck`. Please advise where I should put these tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148526
Approved by: https://github.com/lezcano
2025-03-10 12:35:09 +00:00
Francisco Massa
ea86b8d315 Fix redistribution cost for all-reduce (#148761)
This issue seems to have been introduced in https://github.com/pytorch/pytorch/pull/119897. With the current implementation, it might be more favorable to perform a reduce_scatter followed by an all-gather than simply an all-reduce.

Thanks @lw for the helpful discussions on getting this PR out!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148761
Approved by: https://github.com/Skylion007, https://github.com/lw, https://github.com/tianyu-l, https://github.com/fegin
2025-03-10 12:13:11 +00:00
Mwiza Kunda
00199acdb8 [inductor][triton] Block ptr analysis fix assert on matched index expression (#148446)
If dynamic shapes are enabled, then block analysis may create new precomputed size replacements from the index which can lead to an assertion failure when the matched index is compared with the original index. For example the below assertion fails, despite the expressions being equivalent (ps2 = 3 * ps0). This can be resolved by updating the original index with the replacements, or simply removing the replacements when the expressions are tested to be equal - the latter option is implemented in this PR.

```
       torch._inductor.exc.InductorError: AssertionError:
E       Invalid match!
E       Index: 3*ps0*((yindex//3)) + (ModularIndexing(yindex, 1, 3))
E       Matched expression: ps2*((yindex//3)) + (ModularIndexing(yindex, 1, 3))
E
```

This PR fixes the test below when `config.triton.use_block_ptr=True`:
```
python test/inductor/test_torchinductor_dynamic_shapes.py DynamicShapesCpuTests.test_conv3d_channels_last_dynamic_shapes_cpu
```

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148446
Approved by: https://github.com/jansel
2025-03-10 05:26:55 +00:00
Jane Xu
3680e666d8 Move aoti_torch_cpu__weight_int4pack_mm_cpu_tensor to not be mangled (#148834)
I noticed that this op was likely intended to be in the `extern "C"` portion of the file, but it was not added as such in https://github.com/pytorch/pytorch/pull/145250 which means this function is actually not stable/would get mangled by C++.

Following the thread there I am thinking there are two possible solutions:
(1) Since this op was never stable to begin with, and @Xia-Weiwen already landed the fallback, maybe this op is deletable + should get deleted before the 2.7 branch cut
(2) Or we could just move the op to the right portion of the code. While I like just deleting the op, I am hesitant to do in case there's something I haven't considered, so this PR does option 2.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148834
Approved by: https://github.com/desertfire
2025-03-10 03:23:48 +00:00
henrylhtsang
7ae0ce6360 [cutlass backend] fix assertion that prevent self multiplication (#148233)
# Problem:
In a matmul, sometimes some of the nodes are the same. Say `A @ A`. In that case, when writing the stride of node B, we have to figure out if we want lda or ldb, which points to the same node, and we have no way to differentiate which one.

# Solution
Just use whichever. Since they are the same.

# Question
What if we compile with `A @ A`, and then pass in `A @ B`? Well inductor guards will raise an error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148233
Approved by: https://github.com/ColinPeppler
2025-03-10 00:21:36 +00:00
henrylhtsang
b47d81682d [cutlass backend] Forward fix for less aligned gemm shapes (#148521)
Differential Revision: [D70600093](https://our.internmc.facebook.com/intern/diff/D70600093/)

1. Check if config name filtering still works.
Tested, it works

2. do we get C++ compile error
Yes, potentially we need to filter them out manually.

Here we get this.
```
static_assert(threads_minor == 0 || (TileSizeK % threads_minor == 0));
```
We need to move some assertions to gemm_template.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148521
Approved by: https://github.com/ColinPeppler
2025-03-10 00:21:24 +00:00
PyTorch MergeBot
275a7c5dbb Revert "Add a stable TORCH_LIBRARY to C shim (#148124)"
This reverts commit 327e07ac1d.

Reverted https://github.com/pytorch/pytorch/pull/148124 on behalf of https://github.com/malfet due to Sorry for reverting your PR, but somehow it caused test failures in newly introduced tests, see https://hud.pytorch.org/hud/pytorch/pytorch/main/1?per_page=50&name_filter=pull%20%2F%20linux-focal-cuda12.6-py3.10-gcc11-sm89%20%2F%20test%20(default%2C%201&mergeLF=true ([comment](https://github.com/pytorch/pytorch/pull/148124#issuecomment-2709057833))
2025-03-09 20:44:56 +00:00
PyTorch MergeBot
19a39a7a06 Revert "[dynamo] allow global import from collections import deque in user code (#148676)"
This reverts commit 685fb37713.

Reverted https://github.com/pytorch/pytorch/pull/148676 on behalf of https://github.com/malfet due to Looks like it broke ROCM, see f1444f006c/1(default%2C%201&mergeLF=true ([comment](https://github.com/pytorch/pytorch/pull/148676#issuecomment-2709057326))
2025-03-09 20:42:03 +00:00
Zhenghao Hu
f1444f006c [caffe2/torch] Fixup upstream LLVM (major version 21) API changes (#148833)
Latest LLVM introduced two changes related to the `Triple` usage that causes build failures when building pytorch.

## Failure in llvm_codegen.cpp:
Triple is stored in Modules instead of the string: 979c275097

## Failure in llvm_jit.cpp:
Triple argument is removed from LLJITBuilder::... : b18e5b6a36

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148833
Approved by: https://github.com/Skylion007
2025-03-09 18:58:36 +00:00
Aditya Tiwari
bb9c426024 Typo Errors fixed in multiple files (#148262)
# Fix typo errors across PyTorch codebase

This PR fixes various spelling errors throughout the PyTorch codebase to improve documentation quality and code readability.

## Changes Made

### Documentation Fixes
- Changed "seperate" to "separate" in multiple files:
  - `setup.py`: Build system documentation
  - `torch/_library/triton.py`: AOT compilation comments
  - `torch/csrc/dynamo/compiled_autograd.h`: Node compilation documentation
  - `torch/export/_unlift.py`: Pass population comments
  - `torch/export/exported_program.py`: Decomposition table notes

### Code Comments and Error Messages
- Changed "occured" to "occurred" in:
  - `test/mobile/test_lite_script_module.py`: Exception handling comments
  - `torch/export/_draft_export.py`: Error message text
  - `aten/src/ATen/native/cuda/linalg/BatchLinearAlgebra.cpp`: MAGMA bug comment
  - `torch/csrc/utils/python_numbers.h`: Overflow handling comment
  - `torch/csrc/jit/OVERVIEW.md`: Graph compilation documentation
  - `torch/_dynamo/symbolic_convert.py`: Error explanation

### API Documentation
- Changed "fullfill" to "fulfill" in `torch/distributed/checkpoint/state_dict_loader.py`
- Changed "accross" to "across" in:
  - `torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp`
  - `torch/distributed/distributed_c10d.py`

## Motivation
These changes improve code readability and maintain consistent spelling throughout the codebase. No functional changes were made; this is purely a documentation and comment improvement PR.

## Test Plan
No testing required as these changes only affect comments and documentation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148262
Approved by: https://github.com/janeyx99

Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
2025-03-09 12:21:40 +00:00
Jane Xu
327e07ac1d Add a stable TORCH_LIBRARY to C shim (#148124)
This PR adds two main parts:
- shim.h stable C APIs into torch::Library APIs
- a higher level API in torch/csrc/stable/library.h that calls into this shim.h + otherwise is self contained

Goal: custom kernel writers should be able to call the apis in the directories above in order to register their library in a way that allows their custom extension to run with a different libtorch version than it was built with.

Subplots resolved:

- Do we want a whole separate StableLibrary or do we want to freeze torch::Library and add `m.stable_impl(cstring, void (*fn)(void **, int64_t, int64_t)` into it
    - Yes, we want a separate StableLibrary. We cannot freeze Library and it is NOT header only.
- Should I use unint64_t as the common denominator instead of void* to support 32bit architectures better?
    -  Yes, and done
- Should I add a stable `def` and `fragment` when those can be done in python?
    - I think we do want these --- and now they're done
- Where should library_stable_impl.cpp live? -- no longer relevant
- I need some solid test cases to make sure everything's going ok. I've intentionally thrown in a bunch of random dtypes into the signature, but I still haven't tested returning multiple things, returning nothing, complex dtypes, etc.
    - Have since tested all the torch library endpoints. the others can be tested in a followup to separate components that need to be in shim.h vs can be added later

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148124
Approved by: https://github.com/albanD, https://github.com/zou3519
2025-03-09 10:07:25 +00:00
Xuehai Pan
685fb37713 [dynamo] allow global import from collections import deque in user code (#148676)
See https://github.com/pytorch/pytorch/pull/148669#discussion_r1983462218 for more details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148676
Approved by: https://github.com/jansel
2025-03-09 09:35:29 +00:00
William Wen
6566d67bd3 [dynamo] show stack above dynamo in graph break user tracebacks (#148401)
Also show the line of code relevant to a dynamo-compiled frame, instead of just the first line (this was broken for data-dependent jump graph breaks and for 3.11+).

Also collapses resume frames together (use config.verbose to see full stack trace - for developers).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148401
Approved by: https://github.com/zou3519, https://github.com/jansel
2025-03-09 07:37:38 +00:00
Ke Wen
2149f6c684 [PGNCCL] Launch kernel on current stream & remove record_stream entirely (#148590)
This PR has multiple changes to `ProcessGroupNCCL` (which unfortunately are related):
1. When async_op=False, we directly launch the collective on "current" stream, instead of a trampoline stream and join back.
- Resolves #147729
- Resolves #146881
- Also saves two event syncs (which have overhead in case of HIP) and one pybind when we call `work.wait()` in distributed_c10d.py on behalf of user.
2. Entirely remove `record_stream` and use CPU-side stashing for managing tensor lifetime against recycling.
- Resolves #147168
3. Remove tensor life management when async_op=False; only use it when async_op=True.
4. To guard against user not calling `work.wait()`, we ask watchdog to unstash tensors after detecting completion of collectives, to prevent us from holding reference to tensors forever. This is a safety net, rather than a service guarantee, see discussion [here](https://github.com/pytorch/pytorch/issues/147168#issuecomment-2660142460).
5. Profile in async_op=False mode would look different -- collective kernels would show up in the same line and compute kernels.

Joint work with @cenzhaometa who wants to remove the event sync overhead.

Cc: @ngimel @awgu @Aidyn-A @skyw @wconstab @leonardo0lyj

Differential Revision: [D70835197](https://our.internmc.facebook.com/intern/diff/D70835197)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148590
Approved by: https://github.com/eqy, https://github.com/Aidyn-A, https://github.com/fduwjj
2025-03-09 07:32:23 +00:00
PyTorch MergeBot
9cb25f0ea2 Revert "[PGNCCL] Launch kernel on current stream & remove record_stream entirely (#148590)"
This reverts commit 17dbeb11db.

Reverted https://github.com/pytorch/pytorch/pull/148590 on behalf of https://github.com/janeyx99 due to PR break backward compat test ([comment](https://github.com/pytorch/pytorch/pull/148590#issuecomment-2708641172))
2025-03-09 03:01:55 +00:00
Ke Wen
17dbeb11db [PGNCCL] Launch kernel on current stream & remove record_stream entirely (#148590)
This PR has multiple changes to `ProcessGroupNCCL` (which unfortunately are related):
1. When async_op=False, we directly launch the collective on "current" stream, instead of a trampoline stream and join back.
- Resolves #147729
- Resolves #146881
- Also saves two event syncs (which have overhead in case of HIP) and one pybind when we call `work.wait()` in distributed_c10d.py on behalf of user.
2. Entirely remove `record_stream` and use CPU-side stashing for managing tensor lifetime against recycling.
- Resolves #147168
3. Remove tensor life management when async_op=False; only use it when async_op=True.
4. To guard against user not calling `work.wait()`, we ask watchdog to unstash tensors after detecting completion of collectives, to prevent us from holding reference to tensors forever. This is a safety net, rather than a service guarantee, see discussion [here](https://github.com/pytorch/pytorch/issues/147168#issuecomment-2660142460).
5. Profile in async_op=False mode would look different -- collective kernels would show up in the same line and compute kernels.

Joint work with @cenzhaometa who wants to remove the event sync overhead.

Cc: @ngimel @awgu @Aidyn-A @skyw @wconstab @leonardo0lyj

Differential Revision: [D70835197](https://our.internmc.facebook.com/intern/diff/D70835197)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148590
Approved by: https://github.com/eqy, https://github.com/Aidyn-A, https://github.com/fduwjj
2025-03-08 20:00:12 +00:00
Nino Risteski
5245304f1e Update decompositions_for_jvp.py (#148821)
small typo thing that got my eye

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148821
Approved by: https://github.com/Skylion007
2025-03-08 19:08:42 +00:00
Tristan Rice
7ffadff286 c10d/ProcessGroup: cleanup abort and shutdown (#148798)
This adds `abort` and `shutdown` to `Backend` and `ProcessGroup` objects. This simplifies the logic in `distributed_c10d.py` by having a default noop implementation for all PGs.

This will be useful for torchft and upcoming versions of NCCL which will handle abort correctly. Currently `torchft` would have to call internal methods `_abort` on the PGNCCL object directly but with this change we can now just call `.abort()` and have it work for any PG implementation.

Test plan:

```
pytest distributed/test_backends.py distributed/test_c10d_common.py distributed/test_c10d_pypg.py
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148798
Approved by: https://github.com/kwen2501
2025-03-08 18:33:18 +00:00
Sanket Purandare
9841f0ddcf Add support for non functional collectives under FakeTensorMode and fake_pg for memory tracking (#147566)
This PR adds support for non-functional collectives under `FakeTensorMode` and `fake_pg`. It helps eliminate the patching of collectives for memory and runtime estimation.

It also modifies the `ModTracker` to enable the post-backward hook call for modules whose inputs don't require gradients but parameters do.

For the memory tracking, we now enable tracking DTensor dispatcher for custom dispatch functions like `entropy_loss`.
Dispatcher is only enabled for the memory tracking part and disabled as soon as it is done.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147566
Approved by: https://github.com/weifengpy
2025-03-08 18:00:49 +00:00
David Berard
c3b05c4a27 [triton 3.3] support both specialize_impl and create_specialize_impl (#148806)
After https://github.com/triton-lang/triton/pull/6099, we sometimes need to do `from triton.runtime.jit import specialize impl` and sometimes do `triton.runtime.jit.create_specialize_impl()`. This should fix a bunch of the new errors that appeared with the triton 3.3 / pytorch 2.7 integration (e.g. `python test/inductor/test_aot_inductor.py -k test_triton_kernel_equal_to_1_float_arg_dynamic_False_cuda`, failing at https://hud.pytorch.org/pr/pytorch/pytorch/148684#38392501220)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148806
Approved by: https://github.com/drisspg
2025-03-08 09:31:52 +00:00
Zhuoran Zhao
3745da18f4 [AOTI] Swith to local cpp compile for fbcode (#148592)
Summary: as title, otherwise we can not find lamdhip64

Test Plan: https://www.internalfb.com/phabricator/paste/view/P1747104431

Differential Revision: D70637798

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148592
Approved by: https://github.com/hl475
2025-03-08 08:38:26 +00:00
Simon Fan
666508eb17 [aot cache][ca] remove restriction on caching ca's aot inference graph (#148491)
but still can't cache CA's aot inference graph yet: the CA functional ops aren't serializable

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148491
Approved by: https://github.com/jamesjwu
ghstack dependencies: #148381
2025-03-08 06:08:26 +00:00
Simon Fan
c16cd25cf5 [ca] remove compiled_autograd_tracing (#148381)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148381
Approved by: https://github.com/jansel
2025-03-08 06:08:26 +00:00
cyy
f7c0c230b0 Fix compile errors (#148758)
Fix
```
  /usr/bin/../lib64/gcc/x86_64-pc-linux-gnu/14.2.1/../../../../include/c++/14.2.1/bits/unique_ptr.h:91:16: error: invalid application of 'sizeof' to an incomplete type 'torch::jit::AliasDb::WriteRegistry'
     91 |         static_assert(sizeof(_Tp)>0,
        |                       ^~~~~~~~~~~
  /usr/bin/../lib64/gcc/x86_64-pc-linux-gnu/14.2.1/../../../../include/c++/14.2.1/bits/unique_ptr.h:399:4: note: in instantiation of member function 'std::default_delete<torch::jit::AliasDb::WriteRegistry>::operator()' requested here
    399 |           get_deleter()(std::move(__ptr));
        |           ^
  ../torch/csrc/jit/ir/alias_analysis.cpp:200:10: note: in instantiation of member function 'std::unique_ptr<torch::jit::AliasDb::WriteRegistry>::~unique_ptr' requested here
    200 | AliasDb::~AliasDb() = default;
        |          ^
  ../torch/csrc/jit/ir/alias_analysis.cpp:200:23: note: in defaulted destructor for 'torch::jit::AliasDb' first required here
    200 | AliasDb::~AliasDb() = default;
        |                       ^
  ../torch/csrc/jit/ir/alias_analysis.h:298:10: note: forward declaration of 'torch::jit::AliasDb::WriteRegistry'
    298 |   struct WriteRegistry;
        |          ^
  1 error generated.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148758
Approved by: https://github.com/Skylion007
2025-03-08 04:56:42 +00:00
riccardofelluga
8f71d4563e Fix rms_norm in fp16/bf16 (#147203)
Fixes #134106. This PR moves the `upcasted_result` down-casting after all computation is done.

Since the multiplication with the weight_opt input is not done in half precision, the current code path is doing the following: fp16 -> fp32 -> fp16 -> fp32 -> fp16. What we want tho is to avoid down-casting and this PR proposes: fp16 -> fp32 -> fp16. This results in better accuracy as it avoids truncating.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147203
Approved by: https://github.com/eqy
2025-03-08 04:43:18 +00:00
Joel Schlosser
85467ed063 Fix for AOTI + CUDAGraphs when calling from Python (#148601)
**Background**: I've been comparing performance of torch.compile vs. torch.export + AOTI (specifically, loaded from Python) on the Flux model and found a ~1.4% performance decrease with the latter. The trace shows that CUDAGraphs are not utilized for torch.export + AOTI, leading to higher overhead.

When trying to manually CUDAGraph the loaded, previously exported + AOTIed model (thanks to @eellison for the logic here), I get:
```
Error: operation not permitted when stream is capturing
```

@desertfire confirms that this is due to multi-threading logic on the AOTI runtime side (in `AOTIModelContainer` / `AOTIModel`) conflicting with the use of CUDAGraphs.

**Fix**: This PR takes the approach of providing an alternate, single-threaded method for running loaded models with the AOTI runtime. Details:
* Python side introduces a new flag to enable this behavior (needs a better name): `torch._inductor.package.load_package(..., run_single_threaded=False)`
    * This flag is passed down to the C++ side's `AOTIModelPackageLoader`, which passes it to the `CreateAOTIModelRunnerFunc` during `AOTIModelContainerRunner` construction.
* C++ side introduces single-threaded alternatives to model running and model container running:
    * `AOTIModelContainer.run_single_threaded()` / `AOTIModel.run_single_threaded()`. The interfaces match those of `run()`, but the synchronization logic has been removed.
    * Introduces `AOTInductorModelContainerRunSingleThreaded` to AOTI's `interface.h`; this is invoked by the `AOTIModelContainerRunner` utility class when `run_single_threaded=true`.

I've verified on both a small repro and my real-world use case that I can manually CUDAGraph a loaded model that was previously exported + AOTIed.

**Future work:**
* Flip default value to `run_single_threaded=True` as Python-side inference doesn't take advantage of the AOTI runtime thread pool
    * There are some BC concerns here - models need to be re-serialized so the .so contains the new `AOTInductorModelContainerRunSingleThreaded` interface func. We can flip the default value and warn (instead of crashing) if the `AOTInductorModelContainerRunSingleThreaded` symbol does not exist.
* Compose with cudagraph trees as opposed to manual cuda graph wrapping

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148601
Approved by: https://github.com/desertfire
2025-03-08 02:44:14 +00:00
Sampsa
9f170d9d13 [Triton 3.3] Remove ROCm specific mm gemm template (#148662)
Fixes: https://github.com/pytorch/pytorch/issues/147121
Since triton 3.3.x fixes the problem

Needs to be handled in none BC breaking way, so we will conditionalise this change on triton version.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148662
Approved by: https://github.com/davidberard98

Co-authored-by: Jack Taylor <108682042+jataylo@users.noreply.github.com>
2025-03-08 01:24:40 +00:00
drisspg
a89e7c2da9 [Upstream] Wrap log_2_e in tl.constexpr for new 3.3 bump (#148785)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148785
Approved by: https://github.com/davidberard98
2025-03-08 01:09:28 +00:00
Lukas Pfahler
179b7a0abc Do not crash when compiling quantized LORA models (#148435)
Fixes #148072

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148435
Approved by: https://github.com/Valentine233, https://github.com/leslie-fang-intel
2025-03-08 00:02:08 +00:00
Gabriel Ferns
24085db082 Don't clear feedback_saver_fns after cache clear (#148723)
Summary:
Since feedback_saver_fns are used for logging, I don't think it makes sense to clear them, and this resulted in weird behavior in user code where disabling caches caused logging code to break.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148723
Approved by: https://github.com/henrylhtsang, https://github.com/eellison
2025-03-07 23:43:59 +00:00
Justin Chu
d96c85558a [ONNX] Use torch export to get dynamic shapes for JIT convert strategy (#148627)
Use torch export to get dynamic shapes for JIT converted graph. I just realized we can retrace a converted jit graph with `torch.export` and produce dynamic shapes using `torch.export`.

-	**Prior:** The exporter will produce a **static graph silently** even when dynamic_shapes are provided.
-	**Proposed:** When `dynamic_shapes` is provided and when the strategy is able to handle it, it will succeed

## Why are we still keeping the JIT strategy?

It is useful when users want to convert JIT modules or `.pt` files into ONNX via the new path. Sometimes also useful when there are JIT scripted modules in the nn module.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148627
Approved by: https://github.com/titaiwangms
2025-03-07 23:41:50 +00:00
Sam Larsen
187d5c0eb1 [logging] Log cudagraphify timings to dynamo_timed (#143220)
Summary: this adds some new dynamo_timed calls in cudagraph_trees, primarily with the aim to add cudagraph-related timing to scuba. Things to note:
* Uses the changes in https://github.com/pytorch/pytorch/pull/141919 to log "runtime" entries
* The logging for chromium/tlparse/scuba relies on us providing a compile_id since it's not available in the environment. A lot of the changes here are just passing around the compile_id
* I believe the spirit of the scuba logging is to capture the overheads of `torch.compile`. Therefore, I'm not adding _every_ dynamo_timed to scuba. For example, "run_eager" is the first real execution of the inductor graph -- it's not cudagraph overhead, per se. Watch out for the two instances of `dynamo_compile_runtime_column_us="runtime_cudagraphify_time_us"`. Those are the spots I believe are _extra_ overhead we'd contribute to torch.compile.

Test Plan:
`python benchmarks/dynamo/torchbench.py --performance --training --amp --backend inductor --device cuda --print-compilation-time --repeat 5 --cold-start-latency --only dcgan`:
* tlparse: https://fburl.com/21yrdn8h
* scuba: https://fburl.com/scuba/dynamo_compile/sandbox/wt90wnjz

`python benchmarks/dynamo/torchbench.py --performance --training --amp --backend inductor --device cuda --print-compilation-time --repeat 5 --cold-start-latency --only nanogpt`
* tlparse: https://fburl.com/r9mp7uiv
* scuba: https://fburl.com/scuba/dynamo_compile/sandbox/1nvx94re

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143220
Approved by: https://github.com/eellison
2025-03-07 23:07:13 +00:00
iupaikov-amd
f2dfe2d99c [Triton 3.3] [ROCm] Enabled split_scan support for ROCm builds (#147619)
Fixes issue https://github.com/pytorch/pytorch/issues/133228

Enabled split_scan support for ROCm builds.

Must be handled in a non BC breaking way so this functionality is enabled conditionalised on triton version.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147619
Approved by: https://github.com/davidberard98

Co-authored-by: Jack Taylor <108682042+jataylo@users.noreply.github.com>
Co-authored-by: David Berard <davidberard98@gmail.com>
2025-03-07 23:06:21 +00:00
PyTorch MergeBot
0f852641c2 Revert "[cutlass backend] Forward fix for less aligned gemm shapes (#148521)"
This reverts commit d35a4ddae2.

Reverted https://github.com/pytorch/pytorch/pull/148521 on behalf of https://github.com/henrylhtsang due to mistakes when writing the tests ([comment](https://github.com/pytorch/pytorch/pull/148521#issuecomment-2707637965))
2025-03-07 22:42:13 +00:00
David Berard
755965d2e4 [inductor] fix matmul w/ torch.bucketize epilogue (#148769)
See https://github.com/pytorch/pytorch/issues/148764.

Inductor was codegen-ing wrong shapes for bucketize when it was fused as an epilogue: the binary search helper function requested the shape of the input tensor, and Inductor was generating `[XBLOCK]`, when `XBLOCK` doesn't exist.

As a workaround, this PR removes the `BLOCK_SHAPE` parameter from the helper function (and just uses `values.shape`) so that we don't even have to generate the shape.

This PR also introduces `torch._inductor.config.triton.disallow_failing_autotune_kernels_TESTING_ONLY` to test this behavior. This config is needed to enforce that _all_ autotune kernel candidates pass - otherwise, the fused-bucketize exception just gets caught and an `inf` latency is assigned to it.

Differential Revision: [D70794563](https://our.internmc.facebook.com/intern/diff/D70794563)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148769
Approved by: https://github.com/benjaminglass1, https://github.com/aaronenyeshi
2025-03-07 22:34:13 +00:00
Xinya Zhang
67742128b7 [ROCm] Bump AOTriton to 0.9.2b (#148433)
Notable new features/optimizations for SDPA operators on AMD systems from AOTriton 0.9b:

* Optimize these Non-power-of-two head dimensions: 48, 80, 96, 160, 192, 224. Inputs with these head dimensions do not need padding to power-of-two anymore.
* `is_causal=True` cases are now supported with persistent dynamic algorithm, which requires an atomic tensor but does load balance between different CTAs
* `dropout_p > 0.0` cases now support full 64-bit offsets and use all i64x4 PRNG outputs
* The precise AOTriton shared library version can now be identified with `readelf -p .comment libaotriton_v2.so`
  + However, this does not guarantee the GPU images stored under `aotriton.images` have the same version, since they can be overwritten.
* The newly added fused backward kernel will be used for smaller workloads, due to less kernel invocation overhead.
* Support gfx1201 (RX 9070XT). Need to be enabled at runtime with `TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148433
Approved by: https://github.com/jeffdaily
2025-03-07 22:10:07 +00:00
Nichols A. Romero
08baaa7d63 [Docs][TunableOp] TunableOp documentation update (#148384)
This PR aligns documentation to what is in the README file:
https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/cuda/tunable/README.md

and removes the prototype NOTE.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148384
Approved by: https://github.com/jeffdaily, https://github.com/svekars

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2025-03-07 21:02:49 +00:00
PyTorch MergeBot
bb94b65da7 Revert "[cutlass backend] fix assertion that prevent self multiplication (#148233)"
This reverts commit 2fb654676f.

Reverted https://github.com/pytorch/pytorch/pull/148233 on behalf of https://github.com/henrylhtsang due to mistake in PR  ([comment](https://github.com/pytorch/pytorch/pull/148233#issuecomment-2707440106))
2025-03-07 20:58:28 +00:00
Nikita Shulga
6602e632cd Suppress build warnings when gcc-11 is used (#148763)
By decorating the header with `C10_DIAGNOSTIC_PUSH_AND_IGNORED_IF_DEFINED("-Wmismatched-new-delete")`
that will suppress following (when building against ancient llvm-9)
```
In file included from /var/lib/jenkins/workspace/torch/csrc/jit/tensorexpr/llvm_codegen.cpp:24:
/opt/llvm/include/llvm/IR/IRBuilder.h: In member function 'llvm::LoadInst* llvm::IRBuilder<T, Inserter>::CreateLoad(llvm::Type*, llvm::Value*, const llvm::Twine&) [with T = llvm::ConstantFolder; Inserter = llvm::IRBuilderDefaultInserter]':
/opt/llvm/include/llvm/IR/IRBuilder.h:1581:19: error: 'static void llvm::User::operator delete(void*)' called on pointer returned from a mismatched allocation function [-Werror=mismatched-new-delete]
 1581 |     return Insert(new LoadInst(Ty, Ptr), Name);
      |                   ^~~~~~~~~~~~~~~~~~~~~
/opt/llvm/include/llvm/IR/IRBuilder.h:1581:19: note: returned from 'static void* llvm::UnaryInstruction::operator new(size_t)'
```

Probably a reasonable followup will be to disable NNC testing all-together, as project has been in a maintenance mode for a while now

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148763
Approved by: https://github.com/Skylion007, https://github.com/ZainRizvi, https://github.com/atalman
ghstack dependencies: #148739
2025-03-07 20:43:35 +00:00
Justin Chu
d36391307f [ONNX] Handle error in verification interpreter (#148730)
Use a simple try catch to handle onnx runtime errors in the verification interpreter when that happens. One example is ort will sometimes produce a list of None for some nodes. I am not sure how that happens yet.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148730
Approved by: https://github.com/titaiwangms
ghstack dependencies: #148706
2025-03-07 20:24:49 +00:00
Xuehai Pan
aebd2e411f [pytree][easy] lock global registry containers properly for thread-safety (#148750)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148750
Approved by: https://github.com/StrongerXi
2025-03-07 20:04:52 +00:00
bobrenjc93
6b44a91a62 use statically_known_true instead of guard_size_oblivious in pattern matcher (#147557)
We shouldn't add guards here. Use statically_known_true instead. Internal xref: https://fb.workplace.com/groups/1075192433118967/?multi_permalinks=1609560723015466&comment_id=1610040026300869&notif_id=1740082892544333&notif_t=work_feedback_reaction_generic&ref=notif

Differential Revision: [D69950122](https://our.internmc.facebook.com/intern/diff/D69950122/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147557
Approved by: https://github.com/eellison
2025-03-07 19:17:25 +00:00
PyTorch MergeBot
b246cd7b82 Revert "Move get accelerator to use build time flags when possible (#146098)"
This reverts commit 17302b4bc8.

Reverted https://github.com/pytorch/pytorch/pull/146098 on behalf of https://github.com/albanD due to Still fails with cuda build on a non-gpu machine ([comment](https://github.com/pytorch/pytorch/pull/146098#issuecomment-2707191770))
2025-03-07 18:59:58 +00:00
eqy
18c6e00c7b [CUDA Graphs][NCCL] Set event queries to happen under thread-local mode in ProcessGroupNCCL.cpp (#148594)
Should mean we don't need to coordinate the watchdog with CUDAGraph captures anymore

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148594
Approved by: https://github.com/kwen2501
2025-03-07 18:39:02 +00:00
Jack Taylor
8059ead823 [ROCm] Incorporate ROCm triton specific tuning parameters (#148437)
Splitting https://github.com/pytorch/pytorch/pull/147315 into two PRs. This PR adds general support for kpack and waves_per_eu triton kernel args for AMD backend. More detail in the PR above.

A follow up PR will update the configs used by ROCm but this requires https://github.com/pytorch/pytorch/pull/147452 to land first

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148437
Approved by: https://github.com/eellison, https://github.com/jansel
2025-03-07 18:09:47 +00:00
Aaron Orenstein
a3b77d434a Subprocess compile (attempt 2) (#148635)
Add a mode to fx_codegen_and_compile() to compile in a separate process. This is to prepare for async compile where we'll compile and run eager in parallel (and also be able to move the compile phase to a remote computer).

Added a test based which runs the test_torchinductor tests with subprocess compiling turned on.

Fixed the test which caused the previous version (#146134) to be reverted:
```
$ PYTORCH_TEST_WITH_ROCM=1 PYTORCH_TEST_WITH_SLOW=1 PYTORCH_TEST_SKIP_FAST=1 python test/inductor/test_compile_subprocess.py CpuTests.test_conv_bn_fuse_cpu
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148635
Approved by: https://github.com/jamesjwu
2025-03-07 17:50:14 +00:00
xinan.lin
50c9f6d83b [Windows][Inductor][XPU] Unload triton pyd files to be able to remove them on Windows. (#148323)
In `fresh_inductor_cache` remove pyd files will raise permission error
on Windows because they are still used by the process.
So we clear the references to the loaded pyd libray obj and unload them
from the process.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148323
Approved by: https://github.com/jansel
ghstack dependencies: #148534, #148538, #147727
2025-03-07 17:19:59 +00:00
Saurabh Mishra
136b8165d1 [DCP] Save Plan Caching: Fix the missing all_plans update in the cache. (#148577)
Summary: Save Plan Caching: Fix the missing all_plans update in the cache.

Test Plan:
```
buck2 test //aiplatform/modelstore/experimental/integration_tests/tests/nosan:checkpoint_dist_save_load_test
```

https://www.internalfb.com/intern/testinfra/testrun/17451448626323264

Reviewed By: MeetVadakkanchery

Differential Revision: D70229019

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148577
Approved by: https://github.com/MeetVadakkanchery
2025-03-07 17:00:59 +00:00
PyTorch MergeBot
abcca2fcbb Revert "Fix torch.nn.functional.hardswish gradients corner case (#148049)"
This reverts commit 29b28e9d9f.

Reverted https://github.com/pytorch/pytorch/pull/148049 on behalf of https://github.com/soulitzer due to This may be causing an accuracy failure on inductor ([comment](https://github.com/pytorch/pytorch/pull/148049#issuecomment-2706839169))
2025-03-07 16:05:56 +00:00
albanD
17302b4bc8 Move get accelerator to use build time flags when possible (#146098)
This PR does two main things (they are in a single PR to show how the newly added APIs are used).

- Add isBuilt and isAvailable APIs to the AcceleratorHook interface. See inline doc for their exact semantic
- Use the newly added isBuilt for accelerator check to ensure it does not poison fork

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146098
Approved by: https://github.com/ngimel, https://github.com/malfet, https://github.com/EikanWang, https://github.com/jeromean

Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
2025-03-07 15:19:34 +00:00
Anant Gulati
372ad7b181 Enable FSDP2 on HPU device (#148667)
The motivation of this PR is to enable FSDP2 collectives for HPU

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148667
Approved by: https://github.com/wconstab
2025-03-07 14:33:43 +00:00
Luca Wehrstedt
f80aad62fa Improve Pareto frontier plot for AutoAC (#148678)
This was added in https://github.com/pytorch/pytorch/pull/126320. It's a very nice feature, which can be used to predict memory usage for different budget values.

However, it had some limitations, notably in terms of resolution (it only sampled 21 points across the whole range thus missed many threshold values) and in distributed settings.

Here I fix those by using recursive binary searches to identify all thresholds (up to a resolution of 1e-3, which can be made configurable) and output them in SVG (to be able to discern different points), plus I add the rank to the filename and store it in a user-define directory.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148678
Approved by: https://github.com/Chillee, https://github.com/fmassa
2025-03-07 13:22:29 +00:00
Avik Chaudhuri
6cf360be04 fix lost input mutations with export_tracepoint (#148709)
Preserving module call signatures in the presence of input mutation cause incorrect results. The root cause turned out to be that export tracepoints would unwrap / wrap functional args that would lose mutation info on those args.

Differential Revision: [D70734821](https://our.internmc.facebook.com/intern/diff/D70734821/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148709
Approved by: https://github.com/angelayi
2025-03-07 09:36:18 +00:00
FFFrog
416ea1c71c Code Clean: Remove unnecessary code (#148735)
As the title stated.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148735
Approved by: https://github.com/jingsh, https://github.com/cyyever
2025-03-07 08:15:37 +00:00
Rachel Guo
3f069e7679 [mm_logs] enhance the printing for overview info (#148716)
Summary:
previously the dynamo counters does not print the counts information automatically.

explicitly added a log msg to print after lowering for overview info for inductor aten mms

it will look like:

the name is in `{aten_op_name}_{m}_{n}_{k}`
```
torch/_inductor/compile_fx.py:832] [0/0] Overview info of inductor aten mms: (aten.addmm_16_6_16: 1), (name: count), xxx
```

 {F1975874802}

Test Plan:
```
TORCH_LOGS="+inductor" buck2 run -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=h100 @//mode/opt fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_addmm_cuda
```

Differential Revision: D70739912

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148716
Approved by: https://github.com/henrylhtsang
2025-03-07 05:23:49 +00:00
Wei Feng
c0f1557285 [FSDP2][doc] highlight equivalence of set_requires_gradient_sync and no_sync (#148715)
we got asked a few times about FSDP2's equivalence of no_sync. highlight
set_requires_gradient_sync as the equivalence in docstring

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148715
Approved by: https://github.com/mori360
2025-03-07 04:34:46 +00:00
Nitin Singh
fe4b88f6aa [HPU] Add hpu to fused kernels supported devices (#148666)
This change adds "hpu" to the list of device types that support fused kernels in the optimizer, ensuring
compatibility with HPU backend.

Without this change, when `test_all_gather_extension_outer_size_stride` of `pytorch/test/distributed/_composable/fsdp/test_fully_shard_extensions.py` is run on 'hpu' backend, it fails with:

RuntimeError: fused=True requires all the params to be floating point Tensors
of supported devices: ['mps', 'cuda', 'xpu', 'cpu', 'privateuseone']
but torch.float32 and hpu

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148666
Approved by: https://github.com/albanD
2025-03-07 04:28:33 +00:00
Nichols A. Romero
33f8ab2f58 [ROCm][TunableOp] Add support for rowwise scaling on scaled GEMM. (#148238)
This PR adds support for rowwise scaling versus tensorwise scaling on scaled GEMM.

There are few other items included in this PR as well:
- Fixes for offline tuning of scaled GEMM
- Simplification of existing offline UT
- Update existing online UT to also test rowwise versus tensorwise scaled GEMM
- New UT for offline scaled GEMM

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148238
Approved by: https://github.com/jeffdaily
2025-03-07 04:12:48 +00:00
Wei-Sheng Chin
9c9b05bc4f Expose functions used in custom backend in torch_python dll (#148213)
Fixes #148208. There are solutions for exposing symbols implicitly from inline functions (i.e., inline function A calls non-inline function B in foo.h. Code includes foo.h has to see the symbol B in DLL).

Solution 1: tag the entire struct where the inline functions are defined as member functions with TORCH_PYTHON_API --- this PR does this for python_arg_parser.h. An alternative solution exists but will slow down dispatching a lot --- drop inline keyword and move implementation to .cc file.

Solution 2: tag individual functions with TORCH_PYTHON_API. This PR does this for python_tensor.h.

Related discussion about hiding torch_python symbols: https://github.com/pytorch/pytorch/pull/142214

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148213
Approved by: https://github.com/malfet
2025-03-07 02:34:37 +00:00
Zhuoran Zhao
dfb4094b9c Skip buffer in dense update (#148533)
Summary:
as title.

PyTorch Module buffer will not be published in delta publishing.  In Quinn's previous diff, constant type annotations have been introduced.

In addition to skip constant, we also need to skip buffer if it is not found in the user-provided delta weights list

Test Plan: https://docs.google.com/document/d/1wiqUo0PyZ4g6YJIJlL_LE084ZEuE74iu74gZjqGGjWY/edit?tab=t.0#heading=h.dby6cwiw1xrn

Differential Revision: D69553929

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148533
Approved by: https://github.com/22quinn, https://github.com/jingsh
2025-03-07 01:59:58 +00:00
ZhiweiYan-96
00cd6c07b9 [Intel GPU][pt2e] Enable quantized grouped convolution at XPU (#148522)
# Motivation&Details
This PR fix a bug that blocked quantized group convolution before. The bug is caused by that, grouped convolution requires setting weight scale mask on both group dimension and output channel dimension. This PR fixs the wrong mask in integration and add grouped conv in UT.

# UT
` python test/inductor/test_mkldnn_pattern_matcher.py -k test_qconv2d_xpu`

# Runtime exemplification
```onednn_verbose,v1,primitive,exec,gpu:0,convolution,jit:ir,forward_training,src:s8::blocked:acdb::f0 wei:s8::blocked:abcde::f0 bia:f32::blocked:a::f0 dst:f32::blocked:acdb::f0,attr-scratchpad:user attr-scales:src0:0:f32+dst:0:f32+wei:3:f32 attr-zero-points:src0:0:s32,alg:convolution_direct,g4mb1_ic128oc128_ih4oh2kh3sh1dh0ph0_iw4ow2kw3sw1dw0pw0,0.0529785``
The verbose shows that we successfully run into quantized convolution, where weight is `abcde` format(group conv).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148522
Approved by: https://github.com/EikanWang, https://github.com/liangan1, https://github.com/jansel
ghstack dependencies: #148423
2025-03-07 01:57:45 +00:00
Ryan Guo
c8cd8f68bd [dynamo] Properly account for non-list instances in list comparison (#148470)
As title; this patch also removes an unused `list_compare` method.

Fixes #148179.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148470
Approved by: https://github.com/anijain2305
2025-03-07 01:29:30 +00:00
eellison
a7fe685be8 Add cpp wrapper skip to cudagraph logs (#148700)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148700
Approved by: https://github.com/jbschlosser
2025-03-07 01:02:40 +00:00
Justin Chu
e3087f6d76 [ONNX] Improve verify_onnx_program to use VerificationInterpreter (#148706)
I realized we can just extend `verify_onnx_program` to return intermediate values. There is no need for us to expose the VerificationInterpreter to users.

I added a `compare_intermediates` option to `verify_onnx_program`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148706
Approved by: https://github.com/titaiwangms
2025-03-07 00:40:54 +00:00
Richard Barnes
33a285379a [codemod] Remove unused-variable in caffe2/torch/csrc/distributed/c10d/cuda/AsyncMM.cu (#148501)
Summary:
LLVM-15 has a warning `-Wunused-variable` which we treat as an error because it's so often diagnostic of a code issue. Unused variables can compromise readability or, worse, performance.

This diff either (a) removes an unused variable and, possibly, it's associated code or (b) qualifies the variable with `[[maybe_unused]]`.

 - If you approve of this diff, please use the "Accept & Ship" button :-)

Test Plan: Sandcastle

Reviewed By: dtolnay

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148501
Approved by: https://github.com/Skylion007
2025-03-07 00:33:39 +00:00
Xilun Wu
e2a0296e80 [dtensor] add CuDNN SDPA op support to DTensor (#148537)
### Summary
This PR adds `_scaled_dot_product_cudnn_attention` and `_scaled_dot_product_cudnn_attention_backward` to DTensor ops

### Test
`pytest test/distributed/tensor/test_attention.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148537
Approved by: https://github.com/drisspg, https://github.com/fegin
2025-03-06 23:44:40 +00:00
Blaine Burton Rister
75d29443e7 [Docs] update bucketize documentaion (#148400)
Fixes #144504

Clarify the documentation for `torch.bucketize` by referencing the existing table. The current version includes a somewhat confusing explanation for the `right` kwarg, whereas the existing table is much clearer.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148400
Approved by: https://github.com/benjaminglass1, https://github.com/eellison, https://github.com/albanD
2025-03-06 22:07:52 +00:00
henrylhtsang
2fb654676f [cutlass backend] fix assertion that prevent self multiplication (#148233)
# Problem:
In a matmul, sometimes some of the nodes are the same. Say `A @ A`. In that case, when writing the stride of node B, we have to figure out if we want lda or ldb, which points to the same node, and we have no way to differentiate which one.

# Solution
Just use whichever. Since they are the same.

# Question
What if we compile with `A @ A`, and then pass in `A @ B`? Well inductor guards will raise an error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148233
Approved by: https://github.com/ColinPeppler
2025-03-06 22:02:26 +00:00
henrylhtsang
d35a4ddae2 [cutlass backend] Forward fix for less aligned gemm shapes (#148521)
Differential Revision: [D70600093](https://our.internmc.facebook.com/intern/diff/D70600093/)

1. Check if config name filtering still works.
Tested, it works

2. do we get C++ compile error
Yes, potentially we need to filter them out manually.

Here we get this.
```
static_assert(threads_minor == 0 || (TileSizeK % threads_minor == 0));
```
We need to move some assertions to gemm_template.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148521
Approved by: https://github.com/ColinPeppler
2025-03-06 22:02:19 +00:00
lanzongwei.lan
3d62e81a1e [DCP] fix dcp gather_object/scatter_object_list (#147675)
gather_object/scatter_object_list's dst is `Destination rank on global process group (regardless of group argument)`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147675
Approved by: https://github.com/MeetVadakkanchery
2025-03-06 21:20:38 +00:00
Ryan Guo
1d7fc0c681 [dynamo] Remove dead code path around functools.partial objects (#148683)
This removes the code paths added in #98120, which has then been
superceded by #108846.

More importantly, it makes `EQUALS_MATCH`'s `ok_mutable_types` (added in #134016)
easier to reason about, i.e., no need to worry about `dict` types, which
was only needed for #98120.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148683
Approved by: https://github.com/yanboliang
2025-03-06 21:20:04 +00:00
Shunting Zhang
262411e48b [inductor] online softmax (#127011)
Softmax need do some preparation work that access the input tensor in two passes
- compute amax of each row
- compute (x - amax).exp.sum for each row

When the row size is large, cache can not hold all the active data and accessing the input multiple passes increases execution time since the kernel is membw bounded.

Online softmax uses a customized reduction to compute max and sum at the same time by accessing the data in one pass. Check this paper for more details ( https://arxiv.org/abs/1805.02867 ).

Also here is an online softmax kernel generated by inductor as a reference: https://gist.github.com/shunting314/67ae4fffd45d4f2753c781780332fa54

## Microbenchmark

- `TORCHINDUCTOR_COORDINATE_DESCENT_TUNING=1 TORCHINDUCTOR_ONLINE_SOFTMAX=0 DO_PERF_TEST=1 python test/inductor/test_online_softmax.py -k test_softmax` : without online softmax
  - eager_ms=6.671296119689941
  - opt_ms=8.06931209564209
- `TORCHINDUCTOR_COORDINATE_DESCENT_TUNING=1 TORCHINDUCTOR_ONLINE_SOFTMAX=1 DO_PERF_TEST=1 python test/inductor/test_online_softmax.py -k test_softmax`: with online softmax
  - eager_ms=6.634047985076904
  - opt_ms=6.230591773986816

Ideally, online softmax should save about 2ms here. We saves about 1.84ms in practice.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127011
Approved by: https://github.com/jansel
2025-03-06 21:07:18 +00:00
zeshengzong
1add61c242 Replace unimplemented with unimplemented_v2' in codegen.py` (#148069)
Fixes #147913

- replace `unimplemented` in `codegen.py`
- remove unused import `unimplemented`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148069
Approved by: https://github.com/Skylion007, https://github.com/williamwen42
2025-03-06 20:42:37 +00:00
Aaron Gokaslan
edd640a95a [BE][Ez]: Use itertools.chain.from_iterable when possible (#148190)
Often makes the code more readable, more efficient, and adds support for infinite iterables.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148190
Approved by: https://github.com/jansel, https://github.com/malfet
2025-03-06 20:37:06 +00:00
zeshengzong
29b28e9d9f Fix torch.nn.functional.hardswish gradients corner case (#148049)
Fixes #147801

## Changes

- Change hardswish gradient compute condition as [torch.nn.functional.hardswish](https://pytorch.org/docs/stable/generated/torch.nn.functional.hardswish.html)
- Enable cuda for test `test_hardswish_grad_corner`
- Add test case for value=-3

## Test Result

```bash
pytest test/test_nn.py -k test_hardswish
pytest test/test_unary_ufuncs.py -k test_hardswish
pytest test/inductor/test_torchinductor.py -k test_hardswish
```

![image](https://github.com/user-attachments/assets/000cb5c4-15f5-4bfd-ab45-f52bf810ff3d)
![image](https://github.com/user-attachments/assets/38b08cf8-ea84-47a2-8e37-0a213da3e0c8)
![image](https://github.com/user-attachments/assets/54bc57be-2c57-46cc-ab90-94ea6cbe1c34)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148049
Approved by: https://github.com/soulitzer
2025-03-06 19:04:52 +00:00
Xuehai Pan
f08146b67b [pytree] add APIs to determine a class is a namedtuple or PyStructSequence (#113257)
Changes in this PR:

1. Add `is_structseq` and `is_structseq_class` functions to determine a object or a class is PyStructSequence.
2. Add a generic class `structseq` which can be used as the registration key for PyStructSequence types like `namedtuple` for Named Tuple types.
3. Change `is_namedtuple` to accept subclasses of namedtuple to be namedtuple. Before this PR, only namedtuple class directly created by `collections.namedtuple` or `typing.NamedTuple` were namedtuple classes while their subclasses were not. This PR makes `is_namedtuple` return true for subclasses of namedtuple class.

Resolves #75982. New tests are included in this PR.

- #75982

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113257
Approved by: https://github.com/zou3519
2025-03-06 18:59:02 +00:00
PyTorch MergeBot
96176e32a9 Revert "[ROCm] Bump AOTriton to 0.9.1b (#148433)"
This reverts commit 8af79b7ec8.

Reverted https://github.com/pytorch/pytorch/pull/148433 on behalf of https://github.com/jovianjaison due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/148433#issuecomment-2704638858))
2025-03-06 18:32:48 +00:00
Ke Wen
d91a634edf [c10d] Make getDefaultBackend more fault tolerant (#148596)
This is a forward fix for #135338.
It hits error like this:
```
"distributed_c10d.py", line 2156, in destroy_process_group
    if type(pg) == ProcessGroup and pg._has_hooks():
RuntimeError: Could not find the default backend type 0 for Process Group with name undefined.
```

When users call `init_process_group(nothing)`, default backend is not set, or set to `undefined`. Thus the above signature. Triggered by the `_has_hooks()` call.

The fix wraps `getDefaultBackend` with a try-catch.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148596
Approved by: https://github.com/LucasLLC, https://github.com/fduwjj
2025-03-06 18:07:43 +00:00
PyTorch MergeBot
28b68b46bc Revert "[cutlass backend] fix assertion that prevent self multiplication (#148233)"
This reverts commit 4aeca28137.

Reverted https://github.com/pytorch/pytorch/pull/148233 on behalf of https://github.com/henrylhtsang due to mistake in PR  ([comment](https://github.com/pytorch/pytorch/pull/148233#issuecomment-2704534995))
2025-03-06 17:45:49 +00:00
PyTorch MergeBot
841451af9f Revert "[Inductor] Avoid tensor slice overflow for large step (#147433)"
This reverts commit 1d7397a2d0.

Reverted https://github.com/pytorch/pytorch/pull/147433 on behalf of https://github.com/jovianjaison due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/147433#issuecomment-2704506627))
2025-03-06 17:33:08 +00:00
Rachel Guo
679e7d257e [mm_logs] follow up to add count info based on shape for inductor aten.mms (#148623)
Summary:
as title.

when enable `TORCH_LOGS="+inductor"`, you can get logs at the end such as

stats [('calls_captured', 1), ('unique_graphs', 1)]
inductor [('pattern_matcher_count', 2), ('pattern_matcher_nodes', 2), ('benchmarking.TritonBenchmarker.benchmark_gpu', 2), **(('aten_addmm', (16, 6, 16)), 1)**, ('extern_calls', 1), ('async_compile_cache_miss', 1)]
graph_break []

Test Plan: follow up to add proper logging test.

Differential Revision: D70665104

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148623
Approved by: https://github.com/henrylhtsang
2025-03-06 16:20:04 +00:00
Benjamin Glass
b160dda743 cpp_wrapper: reduce memory usage by removing unneeded temporaries (#147403)
This PR contains a set of interrelated changes, listed below, with the upshot that compiled model memory usage in `cpp_wrapper` mode is now roughly equivalent to the default inductor mode.

Changes:

1. Refactor `reinterpret_view` calls in `cpp_wrapper` to always return a temporary RAII tensor object, rather than saving off a "temporary" tensor handle that persisted through the end of the function. This matches the behavior of the base Python wrapper class, and is responsible for majority of the memory usage reductions.
2. Eliminate nearly all other cases where a "temporary" tensor handle was saved off (with the exception of one or two places where the tensor would immediately be destroyed by going out-of-scope). This necessitated some ugly-looking code to handle `Optional[Tensor]` and `Optional[Sequence[Any]]`, since `Optional` is passed by pointer into the C-shim functions (making passing temporary objects difficult). This code is justified by the fact that it only appears in controlled circumstances that we auto-generate, so there are minimal user-facing footguns.
3. Delete the list containing the input tensors to the `cpp_wrapper` main function after casting them to `AtenTensorHandle` objects, which have an internal reference count keeping them alive.

The [TorchInductor benchmark](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Sat%2C%2015%20Feb%202025%2018%3A38%3A08%20GMT&stopTime=Sat%2C%2022%20Feb%202025%2018%3A38%3A08%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(a100)&lBranch=gh/benjaminglass1/73/head&lCommit=4d5edaf67e80ca9ca36d301af1ded13967a04790&rBranch=main&rCommit=e1bf892d9004a4dba0748d0eda5c3b4eced0ea70) I ran shows the increased memory compression.

Differential Revision: [D70648897](https://our.internmc.facebook.com/intern/diff/D70648897)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147403
Approved by: https://github.com/desertfire
2025-03-06 16:08:16 +00:00
Mikayla Gawarecki
d5184901c4 Make torch.serialization.skip_data work with torch.load (#148018)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148018
Approved by: https://github.com/albanD
ghstack dependencies: #147786, #147787, #147788
2025-03-06 12:04:46 +00:00
Mikayla Gawarecki
be0ceee1c3 Make record/storage alignment in torch.save configurable (#147788)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147788
Approved by: https://github.com/albanD
ghstack dependencies: #147786, #147787
2025-03-06 12:04:46 +00:00
Mikayla Gawarecki
209977e6e5 Add information about checkpoint offset to untyped storages when torch.load under FakeTensorMode (#147787)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147787
Approved by: https://github.com/albanD
ghstack dependencies: #147786
2025-03-06 12:04:39 +00:00
Mikayla Gawarecki
bdcc1b579b Allow torch.load under FakeTensorMode to load FakeTensors with correct devices (for plain Tensors) (#147786)
This only fixes _rebuild_tensor_v2 and _rebuild_tensor_v3

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147786
Approved by: https://github.com/albanD
2025-03-06 12:04:32 +00:00
rzou
79aa17489c [dynamo] ctx_manager.py: replace unimplemented with unimplemented_v2 (#148570)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148570
Approved by: https://github.com/williamwen42
ghstack dependencies: #148454
2025-03-06 07:46:31 +00:00
titaiwangms
e7bc1d1791 [ONNX] Update saved exported program in debugging report if the exporting passes run_decomposition() (#148617)
Previous to this PR, if the exporting passes run_decomposition(), the report still shows the exported_program before decomposition, which adds the difficulties to our users when they want to check the exported program that are used to translate to ONNX graph.

The following example is what we see before this PR:

```
# PyTorch ONNX Conversion Report

```
 Obtain model graph with `torch.export.export(..., strict=False)`
 Obtain model graph with `torch.export.export(..., strict=True)`
 Obtain model graph with `torch.jit.trace`
 Decompose operators for ONNX compatibility
 Translate the graph into ONNX
 Run `onnx.checker` on the ONNX model
 Execute the model with ONNX Runtime
 Validate model output accuracy
```

## Error messages

```pytb

Traceback (most recent call last):

  File "/home/titaiwang/pytorch/torch/onnx/_internal/exporter/_core.py", line 707, in _translate_fx_graph
    _handle_call_function_node_with_lowering(

  File "/home/titaiwang/pytorch/torch/onnx/_internal/exporter/_core.py", line 486, in _handle_call_function_node_with_lowering
    raise _errors.DispatchError(

torch.onnx._internal.exporter._errors.DispatchError: No ONNX function found for <OpOverload(op='aten.slice', overload='Tensor')>. Failure message: No decompositions registered for the complex-valued input

The above exception was the direct cause of the following exception:

Traceback (most recent call last):

  File "/home/titaiwang/pytorch/torch/onnx/_internal/exporter/_core.py", line 1371, in export
    onnx_program = _exported_program_to_onnx_program(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/home/titaiwang/pytorch/torch/onnx/_internal/exporter/_core.py", line 1007, in _exported_program_to_onnx_program
    values = _translate_fx_graph(
             ^^^^^^^^^^^^^^^^^^^^

  File "/home/titaiwang/pytorch/torch/onnx/_internal/exporter/_core.py", line 733, in _translate_fx_graph
    raise _errors.ConversionError(

torch.onnx._internal.exporter._errors.ConversionError: Error when translating node %slice_1 : [num_users=1] = call_function[target=torch.ops.aten.slice.Tensor](args = (%_to_copy, 0, 0, 9223372036854775807), kwargs = {}). See the stack trace for more information.

```

## Exported program

```python
ExportedProgram:
    class GraphModule(torch.nn.Module):
        def forward(self, x: "f32[3, 4]"):
             # File: /home/titaiwang/pytorch/test_slice_complex.py:6 in forward, code: x_complex = x.to(torch.complex64)
            to: "c64[3, 4]" = torch.ops.aten.to.dtype(x, torch.complex64);  x = None

             # File: /home/titaiwang/pytorch/test_slice_complex.py:8 in forward, code: return x_complex[:, :2]
            slice_1: "c64[3, 4]" = torch.ops.aten.slice.Tensor(to, 0, 0, 9223372036854775807);  to = None
            slice_2: "c64[3, 2]" = torch.ops.aten.slice.Tensor(slice_1, 1, 0, 2);  slice_1 = None
            return (slice_2,)

Graph signature: ExportGraphSignature(input_specs=[InputSpec(kind=<InputKind.USER_INPUT: 1>, arg=TensorArgument(name='x'), target=None, persistent=None)], output_specs=[OutputSpec(kind=<OutputKind.USER_OUTPUT: 1>, arg=TensorArgument(name='slice_2'), target=None)])
Range constraints: {}

```

## Analysis

PyTorch ONNX Conversion Analysis

## Model Information

The model has 0 parameters and 0 buffers (non-trainable parameters).
Number of parameters per dtype:
```python
defaultdict(<class 'int'>, {})
```
Number of buffers per dtype:
```python
defaultdict(<class 'int'>, {})
```

Inputs:
- `x`: `TensorMetadata(shape=torch.Size([3, 4]), dtype=torch.float32, requires_grad=False, stride=(4, 1), memory_format=torch.contiguous_format, is_quantized=False, qparams={})`

Outputs:
- `slice_2`: `TensorMetadata(shape=torch.Size([3, 2]), dtype=torch.complex64, requires_grad=False, stride=(4, 1), memory_format=None, is_quantized=False, qparams={})`

The FX graph has 5 nodes in total. Number of FX nodes per op:
- `placeholder`: 1
- `call_function`: 3
- `output`: 1

Of the call_function nodes, the counts of operators used are:

- `aten.slice.Tensor`: 2
- `aten.to.dtype`: 1

## ONNX Conversion Information

The model contains operators the dispatcher could not find registered ONNX decompositions for. This may be due to missing implementations, decompositions not registered correctly, or a bug in the dispatcher.

Errors grouped by operator:

- `aten.to.dtype`:     No decompositions registered for the real-valued input. Example node: `%to : [num_users=1] = call_function[target=torch.ops.aten.to.dtype](args = (%x, torch.complex64), kwargs = {})`. All nodes: `[to]`
- `aten.slice.Tensor`:     No decompositions registered for the complex-valued input. Example node: `%slice_1 : [num_users=1] = call_function[target=torch.ops.aten.slice.Tensor](args = (%to, 0, 0, 9223372036854775807), kwargs = {})`. All nodes: `[slice_1, slice_2]`

## Decomposition comparison

Ops exist only in the ExportedProgram before decomposition: `['aten.to.dtype']`

Ops exist only in the ExportedProgram after decomposition: `['aten._to_copy.default']`

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148617
Approved by: https://github.com/justinchuby
2025-03-06 07:03:45 +00:00
PyTorch MergeBot
ae6bb58483 Revert "[cutlass backend] Forward fix for less aligned gemm shapes (#148521)"
This reverts commit ad49cfc9f0.

Reverted https://github.com/pytorch/pytorch/pull/148521 on behalf of https://github.com/davidberard98 due to broke lint: [GH job link](https://github.com/pytorch/pytorch/actions/runs/13690720601/job/38283359447) [HUD commit link](ad49cfc9f0) ([comment](https://github.com/pytorch/pytorch/pull/148521#issuecomment-2702980028))
2025-03-06 06:59:39 +00:00
titaiwangms
f057206fca [ONNX] Support complex comparison when verify=True (#148619)
Previously, the comparison of complex numbers was not supported when `verify=True`.

NOTE: This PR can be extended to support more complex comparison cases if there are other places in onnx codebase needed to be changed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148619
Approved by: https://github.com/justinchuby
2025-03-06 04:38:43 +00:00
bobrenjc93
8b65d522e1 refactor delayed compile to use code context (#148530)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148530
Approved by: https://github.com/williamwen42
ghstack dependencies: #148509
2025-03-06 04:02:30 +00:00
henrylhtsang
ad49cfc9f0 [cutlass backend] Forward fix for less aligned gemm shapes (#148521)
Differential Revision: [D70600093](https://our.internmc.facebook.com/intern/diff/D70600093/)

1. Check if config name filtering still works.
Tested, it works

2. do we get C++ compile error
Yes, potentially we need to filter them out manually.

Here we get this.
```
static_assert(threads_minor == 0 || (TileSizeK % threads_minor == 0));
```
We need to move some assertions to gemm_template.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148521
Approved by: https://github.com/ColinPeppler
2025-03-06 03:42:55 +00:00
James Wu
8728d4b815 Clear triton kernels after parent make_launcher (#148604)
Before, we were clearing the cache only after inductor compile. But inductor may not **always** compile, i.e. on AOTAutogradCache hit.

So instead, we should clear it when the future is consumed. This is a more robust fix for the issue in D69476856

Differential Revision: [D70646281](https://our.internmc.facebook.com/intern/diff/D70646281/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148604
Approved by: https://github.com/masnesral
2025-03-06 03:28:38 +00:00
maybeLee
43e1284c96 Fix empty matrix handling of addmv in inductor (#143792)
This is a resubmission of my previous PR that I accidentally deleted, apology in advance if any inconvenience caused. Below are details of this PR.

Fix an issue when torch.addmv behaves inconsistent between torch.compile mode and eager mode. Here is the code to reproduce:

```
import torch
import numpy as np

@torch.compile
def test_optimized(input, mat, vec):
    return torch.addmv(input, mat, vec)

def test(input, mat, vec):
    return torch.addmv(input, mat, vec)

input = torch.tensor([2], dtype=torch.int32)
mat = torch.tensor(np.random.randn(0, 0), dtype=torch.int32)
vec = torch.tensor([])
origin_out = test(input, mat, vec)
optimized_out = test_optimized(input, mat, vec)
print(origin_out)  # tensor([2.])
print(optimized_out)  # tensor([])
```

According to the equation (https://pytorch.org/docs/stable/generated/torch.addmv.html), when matrix and vector is empty, returning `[2.]` seems more reasonable to me.

Following the cpu implementation of this API:e97b97af56/aten/src/ATen/native/Blas.cpp (L62)

I add an additional branch to handle empty matrix

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143792
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel
2025-03-06 02:09:27 +00:00
Pat Vignola
38b3375a81 [MTIA] Use "ieee" instead of "tf32" for MTIA's default precision in FlexAttention (#148565)
Summary: MTIA supports ieee but not tf32, so we set the default precision of MTIA to ieee similar to how it's done for AMD.

Test Plan: CI

Reviewed By: mortzur

Differential Revision: D70072064

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148565
Approved by: https://github.com/mortzur
2025-03-06 02:07:18 +00:00
Ruben Rodriguez Buchillon
32715a2311 [inductor][ck] add kBatch_sweep to config.rocm (#148223)
Summary:
# Why

enable testing and users to specify a set of kBatches to try rather than relying on our hand written heuristic

# What

add rocm.kBatch_sweep as a list of kBatches to try out. These will generate a product of CK instances, one per kBatch for each existing op, though they are often filtered out if they are likely to fail at runtime

Test Plan: n/a

Reviewed By: chenyang78

Differential Revision: D70226055

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148223
Approved by: https://github.com/ColinPeppler
2025-03-06 01:14:33 +00:00
Shivam Raikundalia
63fbc738dc [Easy/Profiler] Add last entry to truncated values (#148576)
Summary: Since the ranks of a PG are usually in a consecutive range it is useful to print the last values when truncating metadata

Test Plan:
Manually changed truncate length to 2 and ran 4 gpu graph to get the following trace:
https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devgpu003.rva5.facebook.com/rank-1.Mar_05_09_48_21.1280355.pt.trace.json.gz&bucket=gpu_traces

Differential Revision: D70637461

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148576
Approved by: https://github.com/davidberard98
2025-03-06 01:14:15 +00:00
Thomas Bohnstingl
23441492f6 [scan] Refactoring of input checking and dynamo invocation (#142125)
This PR does a refactoring of the way dynamo is invoked and how the input shapes are checked for scan and for associative_scan

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142125
Approved by: https://github.com/ydwu4
2025-03-06 01:06:54 +00:00
Shunting Zhang
6cc3e69103 [inductor] use eager stride for custom op if no tags (#148367)
Fix https://github.com/pytorch/pytorch/issues/148356

This is some sort of short term fix to recover the default behavior to apply layout constraint for custom ops when there are no tags.

A longer term attempt to make sure Inductor always gets correct eager strides is here: https://github.com/pytorch/pytorch/pull/148104

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148367
Approved by: https://github.com/eellison, https://github.com/zou3519
2025-03-06 00:58:00 +00:00
Prachi Gupta
703176e538 [ROCm] Fix sort for non-standard bool (#147459)
When converting from uint8 to bool using `view` op, we get a bool that has 0 for false and a non-zero value for true. However, these kinds of bool have undefined behavior. We only read the last bit as 0 or 1 to convert to false or true.

In this fix, we convert bools to uint8, which will convert false to 0 and non-zero value to 1. Essentially, converting non-standard bool to a standard bool and fixing the sort op for non-standard bool.

Fixes #139972

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147459
Approved by: https://github.com/jeffdaily, https://github.com/pruthvistony
2025-03-06 00:23:02 +00:00