pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Sam Larsen	73c8068cf8	[logging] Set compile_id in the CachingAutotuner during compilation so we have it for dynamo_timed logging (#148693 ) Summary: This is a simpler alternative to https://github.com/pytorch/pytorch/pull/146455, where we can stick the compileId (and forward/backward bool) in the CachingAutotuner so that we have it for logging `benchmark_all_configs`. Recall that the first attempt put the compileId in the inductor_meta and that interfered with caching. Test Plan: `python benchmarks/dynamo/torchbench.py --performance --training --amp --backend inductor --device cuda --print-compilation-time --repeat 5 --cold-start-latency --only nanogpt` * tlparse: https://fburl.com/e71yn6uc * dynamo_compile: https://fburl.com/scuba/dynamo_compile/sandbox/4ageghhv * pt2_compile_events: https://fburl.com/scuba/pt2_compile_events/4fgv1itq Pull Request resolved: https://github.com/pytorch/pytorch/pull/148693 Approved by: https://github.com/eellison	2025-03-11 19:38:40 +00:00
Jane Xu	971606befa	Add a stable TORCH_LIBRARY to C shim (#148124 ) This PR adds two main parts: - shim.h stable C APIs into torch::Library APIs - a higher level API in torch/csrc/stable/library.h that calls into this shim.h + otherwise is self contained Goal: custom kernel writers should be able to call the apis in the directories above in order to register their library in a way that allows their custom extension to run with a different libtorch version than it was built with. Subplots resolved: - Do we want a whole separate StableLibrary or do we want to freeze torch::Library and add `m.stable_impl(cstring, void (fn)(void , int64_t, int64_t)` into it - Yes, we want a separate StableLibrary. We cannot freeze Library and it is NOT header only. - Should I use unint64_t as the common denominator instead of void to support 32bit architectures better? - Yes, and done - Should I add a stable `def` and `fragment` when those can be done in python? - I think we do want these --- and now they're done - Where should library_stable_impl.cpp live? -- no longer relevant - I need some solid test cases to make sure everything's going ok. I've intentionally thrown in a bunch of random dtypes into the signature, but I still haven't tested returning multiple things, returning nothing, complex dtypes, etc. - Have since tested all the torch library endpoints. the others can be tested in a followup to separate components that need to be in shim.h vs can be added later Pull Request resolved: https://github.com/pytorch/pytorch/pull/148124 Approved by: https://github.com/albanD, https://github.com/zou3519, https://github.com/atalman	2025-03-11 19:12:46 +00:00
Guilherme Leobas	daff65d671	Correctly propagate exception to parent tx (#146502 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146502 Approved by: https://github.com/anijain2305, https://github.com/williamwen42, https://github.com/zou3519 ghstack dependencies: #146504, #146499	2025-03-11 18:55:45 +00:00
Guilherme Leobas	fb53e9e514	Add `__context/cause/suppress_context/traceback__` to Exception (#146499 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146499 Approved by: https://github.com/zou3519, https://github.com/anijain2305 ghstack dependencies: #146504	2025-03-11 18:55:45 +00:00
Guilherme Leobas	4e7d264cf8	Introduce `UserDefinedExceptionClassVariable` (#146504 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146504 Approved by: https://github.com/anijain2305	2025-03-11 18:55:45 +00:00
Jason Ansel	8d08b49015	Reland: [inductor] Simplify grid handling (#148305 ) Summary: Relands D69965761 / https://github.com/pytorch/pytorch/pull/147583 Before this PR, calling a triton kernel would look like: ```py kernel.run(a, b, xnumel, grid=grid(xnumel), stream=stream0) ``` where the `grid=` was passed as a callable (function closure) arg. This PR removes the grid arg: ```py kernel.run(a, b, xnumel, stream=stream0) ``` instead now the grid computation is included in the kernel launcher, with something like: ```py def launcher(in_ptr0, out_ptr0, xnumel, stream): grid_0 = ((xnumel + 1023) >> 10) grid_1 = 1 grid_2 = 1 runner(grid_0, grid_1, grid_2, stream, function, metadata, None, launch_enter_hook, launch_exit_hook, in_ptr0, out_ptr0, xnumel) ``` This should be faster, since we remove multiple function/dict calls and are able to specialize the grid computation for each `triton.Config`. It also allows us to unify the handling of grids between the Python and C++ wrapper code. Before this, C++ wrapper code didn't actually support dynamic grid sizes and instead burned in a static grid. This unification allows this PR to be a net deletion of code. Differential Revision: D70471332 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148305 Approved by: https://github.com/shunting314, https://github.com/eellison	2025-03-11 18:51:06 +00:00
PyTorch MergeBot	c916a8efc5	Revert "Use the device interface for detecting Triton availability (#139171 )" This reverts commit `940b60db97`. Reverted https://github.com/pytorch/pytorch/pull/139171 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally. @jansel can you please help get these changes working? See D70946254 for more details. To validate the fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/139171#issuecomment-2715392451))	2025-03-11 18:49:21 +00:00
drisspg	57ee821a41	fix dynamo ide (#148849 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148849 Approved by: https://github.com/bobrenjc93	2025-03-11 18:43:30 +00:00
Ke Wen	ef6296e7f2	[PGNCCL] Launch kernel on current stream & remove `record_stream` entirely (#148590 ) This PR has multiple changes to `ProcessGroupNCCL` (which unfortunately are related): 1. When async_op=False, we directly launch the collective on "current" stream, instead of a trampoline stream and join back. - Resolves #147729 - Resolves #146881 - Also saves two event syncs (which have overhead in case of HIP) and one pybind when we call `work.wait()` in distributed_c10d.py on behalf of user. 2. Entirely remove `record_stream` and use CPU-side stashing for managing tensor lifetime against recycling. - Resolves #147168 3. Remove tensor life management when async_op=False; only use it when async_op=True. 4. To guard against user not calling `work.wait()`, we ask watchdog to unstash tensors after detecting completion of collectives, to prevent us from holding reference to tensors forever. This is a safety net, rather than a service guarantee, see discussion [here](https://github.com/pytorch/pytorch/issues/147168#issuecomment-2660142460). 5. Profile in async_op=False mode would look different -- collective kernels would show up in the same line and compute kernels. Joint work with @cenzhaometa who wants to remove the event sync overhead. Cc: @ngimel @awgu @Aidyn-A @skyw @wconstab @leonardo0lyj Differential Revision: [D70937982](https://our.internmc.facebook.com/intern/diff/D70937982) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148590 Approved by: https://github.com/eqy, https://github.com/Aidyn-A, https://github.com/fduwjj	2025-03-11 18:36:12 +00:00
Nikita Shulga	b366f33606	[MPSInductor] Prep for mutlistage reductions (#148969 ) ---- - Move reduction variable initialization from `loads` to `indexing_code` - Move barriers from `codegen_kernel` to `reduction` and only use them for `any` reductions (as other reduction ops do barriers explicitly inside the respective reduction functions) - Use `self.compute` instead of `self.body` for all compute operations Checked that number of before/after failures stays at `164 failed, 616 passed, 53 skipped` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148969 Approved by: https://github.com/dcci	2025-03-11 18:35:23 +00:00
Nichols A. Romero	dcc502f376	[ROCm][TunableOp] Add bias data type to params signature. (#146227 ) Add bias vector data type in TunableOp params signature. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146227 Approved by: https://github.com/jeffdaily	2025-03-11 18:31:22 +00:00
Chien-Chin Huang	52acc1f955	[DSD] Update the document to mention the limitation of set_optimizer_state_dict (#148918 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/140898 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148918 Approved by: https://github.com/fduwjj, https://github.com/mori360 ghstack dependencies: #148825	2025-03-11 18:24:12 +00:00
Jason Ansel	09029010e5	[inductor] Fix create_specialize_impl error in latest Triton (#148933 ) ```py $ python test/inductor/test_triton_kernels.py KernelTests.test_triton_kernel_2d_autotune_grad_False_dynamic_True_backend_inductor_grid_type_1 WARNING:torch._dynamo:Encountered an exception in identify_mutated_tensors, assuming every input is mutated Traceback (most recent call last): File "/home/jansel/pytorch/torch/_higher_order_ops/triton_kernel_wrap.py", line 715, in identify_mutated_tensors ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_higher_order_ops/triton_kernel_wrap.py", line 289, in generate_ttir specialization = _get_specialization(ordered_args.values()) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_higher_order_ops/triton_kernel_wrap.py", line 262, in _get_specialization specialize_impl = triton.runtime.jit.create_specialize_impl() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ TypeError: create_specialize_impl() missing 1 required positional argument: 'specialize_extra' ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148933 Approved by: https://github.com/yanboliang, https://github.com/davidberard98	2025-03-11 15:54:47 +00:00
Animesh Jain	f1787ee0f7	[dynamo] Remove L scoping for recompilation messages (#148917 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148917 Approved by: https://github.com/williamwen42	2025-03-11 14:26:26 +00:00
Animesh Jain	992838e702	[dynamo][guards] Do not ID_MATCH on numpy tensors (#148923 ) Might help with https://github.com/pytorch/pytorch/issues/148535 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148923 Approved by: https://github.com/jansel	2025-03-11 14:20:26 +00:00
David Berard	9ad64ce795	[triton 3.3] Forward-fix mm template selection logic (#148924 ) Follow-up from https://github.com/pytorch/pytorch/pull/148662. The logic from https://github.com/pytorch/pytorch/pull/148662 is incorrect; what we want is "choose the second template 'AMD-specific template' only if we're on hip AND triton version < 3.3" - negating it, the code should be "choose the cirst template if we're NOT on hip OR triton version >= 3.3". Tested locally to verify that it fixes the test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148924 Approved by: https://github.com/drisspg, https://github.com/atalman, https://github.com/eellison	2025-03-11 09:05:44 +00:00
eellison	2bcc3acb90	Update low prec codegen for div/mod (#142350 ) Div/mod in fp16/bf16 requires a downcast to preserve its inputs' dtypes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142350 Approved by: https://github.com/blaine-rister	2025-03-11 08:02:30 +00:00
Gabriel Ferns	41e4728f74	update types on dynamo configs (#146873 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146873 Approved by: https://github.com/williamwen42	2025-03-11 05:33:48 +00:00
Gabriel Ferns	1fcc4bc109	Don't look at TESTING_ONLY in fuzzer (#146870 ) Lots of configs aren't meant to be set because they're testing only Pull Request resolved: https://github.com/pytorch/pytorch/pull/146870 Approved by: https://github.com/masnesral	2025-03-11 05:32:25 +00:00
Bin Bao	ecfbfe1603	[AOTI] Remove aoti_torch_cpu__weight_int4pack_mm_cpu_tensor (#148907 ) Summary: shim.h is only meant for generic tensor util shim functions. We should switch to use the auto fallback generation, but it will need some extra care on the op schema. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148907 Approved by: https://github.com/janeyx99	2025-03-11 04:41:05 +00:00
George White	940b60db97	Use the device interface for detecting Triton availability (#139171 ) This allows for each device type to check current devices for Triton compatibility and ensure their Triton backend is present. This PR replaces the `has_triton()` global method which was previously used for this task, and moves the initial check for each Inductor backend on to their associated `BaseScheduler` subclass. This means that other backends, such as Halide, can also implement their own availability checks. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139171 Approved by: https://github.com/jansel	2025-03-11 03:56:11 +00:00
Brian Hirsh	621dadd4ca	partitioner: when materializing unbacked tensor intermediates, apply hint to symbol, not expr (#144097 ) Fixes https://github.com/pytorch/pytorch/issues/144095 open to suggestions: the `hint_int(..., fallback=...)` API feels like a bit of a footgun, because: (1) we use the same guess for every unbacked symint (both symbols, and compound expressions) (2) the user may have established some relationship between some unbacked symints that we are not taking into account. I'm not sure how real of an issue (2) is - is it common to e.g. generate two unbacked symints, and then add a runtime assert that they are unequal? Instead I did something simpler that's just enough to fix the linked issue: if we have a sympy expression containing an unbacked symbol (e.g. `u0 + 1`), then the partitioner will now fill in the symbol with our guess instead of the expression (plugging in `u0=4096` gets us 4097). This was important for an internal custom op, that had some logic like this: ``` def custom_op(x: [u0], y: [u0 + 1]): assert x.shape[0] = y.shape[0] - 1 ... ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/144097 Approved by: https://github.com/laithsakka	2025-03-11 02:11:57 +00:00
Simon Fan	457ff9b7ae	[reland][ca] side-effect free inital trace: compiled_args (#148376 ) This reverts commit `ea12fc8a9f`. Reland https://github.com/pytorch/pytorch/pull/147804, there was a bad import inserted by my linter. Differential Revision: [D70582747](https://our.internmc.facebook.com/intern/diff/D70582747) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148376 Approved by: https://github.com/jansel	2025-03-11 01:57:36 +00:00
Shuai Yang	9fddbf3417	Update the comment (#148726 ) Differential Revision: D70747931 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148726 Approved by: https://github.com/yf225	2025-03-11 01:19:14 +00:00
bobrenjc93	c297c09a37	Fix invalid nested int guarding in broadcast_shapes() (#145957 ) Fixes #145874 This PR takes the approach of updating the logic determining whether multiple shapes broadcast together to handle nested ints specially. Possible alternative approach: don't update `broadcast_shapes()` + indicate that e.g. `Ne(j0, 1)` should statically evaluate to False. I briefly tried this but it wasn't straightforward. Is it better? Pull Request resolved: https://github.com/pytorch/pytorch/pull/145957 Approved by: https://github.com/bobrenjc93 Co-authored-by: bobrenjc93 <bobren@meta.com>	2025-03-11 00:53:13 +00:00
cyy	295f2ed4d1	Fix "invalid application of 'sizeof' to an incomplete type" (#148854 ) Fixes with C++23 and constexpr std::unique_ptr Pull Request resolved: https://github.com/pytorch/pytorch/pull/148854 Approved by: https://github.com/Skylion007	2025-03-11 00:40:00 +00:00
drisspg	b215841ebb	[MM] Add sm carevout to lowerings (#148793 ) # Summary See https://github.com/pytorch/pytorch/issues/145115 for more details. I have been using the following to verify, need to figure out how to do proper guarding This does do the correct thing if we compile w/ sm carvout already set but since we dont guard on it just yet we dont recompile Pull Request resolved: https://github.com/pytorch/pytorch/pull/148793 Approved by: https://github.com/lw, https://github.com/eellison	2025-03-10 23:49:26 +00:00
Brian Hirsh	492f3fd5cf	replace usages of upload_graph in inductor with tlparse (v2) (#148720 ) Reland of https://github.com/pytorch/pytorch/pull/148703 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148720 Approved by: https://github.com/mengluy0125	2025-03-10 22:47:58 +00:00
PyTorch MergeBot	a95eb0c0a7	Revert "[PGNCCL] Launch kernel on current stream & remove `record_stream` entirely (#148590 )" This reverts commit `2149f6c684`. Reverted https://github.com/pytorch/pytorch/pull/148590 on behalf of https://github.com/ZainRizvi due to Breaking internally, see D70873275. Discussed reverting this with Ke. To validate your fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/148590#issuecomment-2712001270))	2025-03-10 22:38:40 +00:00
Qiaochu Yuan	12a95390ae	[Minimizer] allow overriding of ShapeProp logic by subclasses of _MinimizerBase (#148784 ) Summary: The changes contained in this diff - allow subclass Minimizer implementations to override the default shape propagation logic with custom logic - copies over the meta attribute on get_attr graph nodes during the graph splitting step - for both changes, behavior for existing classes do not change Test Plan: CI Differential Revision: D70799942 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148784 Approved by: https://github.com/blaine-rister	2025-03-10 22:22:16 +00:00
Jane Xu	fcb633fafa	Introduce TORCH_ABI_VERSION and a runtime aoti_torch_abi_version C shim ABI (#148892 ) Importable https://github.com/pytorch/pytorch/pull/148836 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148892 Approved by: https://github.com/albanD	2025-03-10 22:22:10 +00:00
Boyuan Feng	98b3f1db9f	[Flex Attention] support num_heads > 1 in block_mask (#148857 ) Previously flex decoding errors when block mask has num_heads > 1. So users have to use num_heads=1, or explicitly mark `kernel_options={"FORCE_USE_FLEX_ATTENTION": True}`. This PR fixes this issue. When not using grouped query attention (GQA, i.e., Hq == Hkv), we support block mask with num_heads = 1 and num_heads = num_query_heads (i.e., Hq). This is the same setting as flex attention kernel. When using GQA (i.e., Hq != Hkv), we support block mask with num_heads = 1. When num_heads = Hq, we fall back to flex attention kernel so user don't need to explicitly mark `kernel_options={"FORCE_USE_FLEX_ATTENTION": True}` anymore. Why fallback? In the current flex decoding triton kernel, grouped query heads for the same kv head are handled by the same thread block. Supporting num_heads = Hq with GQA requires support different kv num blocks for different query heads in the same thread block, leading to lots of redundant workload. So we should better use the main flex_attention kernel where each query head is handled by a separate block. Fixes #148527 Fixes #147267 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148857 Approved by: https://github.com/drisspg	2025-03-10 22:02:50 +00:00
Mandar Deshpande	6ef15c7f46	[pytorch] Update flexattention bwd config generation (#148600 ) Summary: Currently `flex_attention` template's backward config generation returns values for every case. This change instead stores intermediate values in `'bwd_config` returned at the end. Test Plan: CI. Existing tests. Differential Revision: D70649316 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148600 Approved by: https://github.com/Skylion007	2025-03-10 22:00:56 +00:00
Chien-Chin Huang	ed969d1236	[DSD] Fix the shared parameter mismatch for optimizer state_dict when flattening FQNs are used (#148825 ) Summary: As title. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148825 Approved by: https://github.com/fduwjj, https://github.com/mori360	2025-03-10 20:04:36 +00:00
Tristan Rice	494abeff8a	CUDACachingAllocator,c10d: fixes for IPC release performance (#148805 ) This has two fixes to improve IPC tensor release performance when using torchft's BabyProcessGroupNCCL. 1. release the IpcMutex when deleting the `ExpandableSegements` object to avoid synchronizing under the lock 2. release the GIL in WorkNCCL destructor since the shared tensor will be destructed there Test plan: Run with torchft + torchtitan ``` REPLICA_GROUP_ID=0 NGPU=2 CUDA_VISIBLE_DEVICES=0,1 CONFIG_FILE=./torchtitan/models/llama/train_configs/llama3_8b.toml ./run_train.sh --training.data_par allel_shard_degree=2 --fault_tolerance.enable --fault_tolerance.group_size=2 --fault_tolerance.replica_id=0 --metrics.log_freq=1 --training.seq_len 4096 ... [rank0]:[titan] 2025-03-07 17:51:31,387 - root - INFO - step: 61 loss: 7.4825 memory: 79.73GiB(83.89%) tps: 317 tflops: 16.34 mfu: 1.65% ``` Check py-spy to verify no bottleneck on IPC lock when creating new shared tensors ![20250307_17h50m10s_grim](https://github.com/user-attachments/assets/fa8b359f-e337-4ed5-be22-a42ab2bee03d) ![20250307_17h50m00s_grim](https://github.com/user-attachments/assets/206f869a-f07e-4fbd-9e28-89b3da95ef6e) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148805 Approved by: https://github.com/Skylion007, https://github.com/fegin, https://github.com/zdevito	2025-03-10 19:47:04 +00:00
clr	6b0fd741d1	dynamo: Count number of opcodes processes (#147149 ) This gives us a decent proxy for how big of a graph we functionally had to parse. Note that this is a cummulative counter. If people feel strongly, I can either write into the dynamo_timed datasets with metrics contexts, or clear the counters / write a counter per frame id as well. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147149 Approved by: https://github.com/jansel	2025-03-10 19:20:09 +00:00
Wanchao Liang	3129faf8be	Optimize shard_dim_alltoall to use alltoall_single (#148868 ) as titled, previously the shard_dim_alltoall uses `all_to_all`, which essentially could incur lots of copies if the tensor become non-contiguous during splits, and alltoall itself also incur copies This PR uses alltoall_single instead, so that we could minimize tensor copies. tested on all the shard dim change tests and it works properly: ``` pytest test/distributed/tensor/test_redistribute.py -s -k shard_dim_alltoall ``` Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/148868 Approved by: https://github.com/tianyu-l	2025-03-10 18:38:12 +00:00
Benjamin Glass	ed7e964f2b	codecache.py: use str.format rather than % formatting (#148691 ) Additionally, swaps over a fixed length `std::vector` used by `cpp_wrapper` for a `std::array`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148691 Approved by: https://github.com/desertfire	2025-03-10 18:33:58 +00:00
Han, Xu	00cabd4235	[Inductor][Windows] add env_var switch to turn all Windows inductor UTs. (#148733 ) For timeout reason, we can't turn on all Windows Inductor UTs in CI: https://github.com/pytorch/pytorch/issues/135927 And without the UTs, we can't ensure Windows inductor quality. Intel team will do some local test for Windows inductor, but we still need to add a switch to turn on the full Windows inductor UTs. The switch is an environment variable: ```cmd set TORCHINDUCTOR_WINDOWS_TESTS=1 ``` After setup this environment variable, we can turn on all Windows inductor UTs, It will not affect to PyTorch CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148733 Approved by: https://github.com/jansel Co-authored-by: Jason Ansel <jansel@jansel.net>	2025-03-10 18:25:29 +00:00
eellison	4c13a859e5	Workaround no triton float8_e8m0fnu support in inductor (#148722 ) Triton doesn't support actual float8_e8m0fnu yet, so we can't currently codegen any arithmetic on them. But we can support bitcasting, and view/memory operators and treat them as uint8 for now. Fix for https://github.com/pytorch/pytorch/issues/147873. The one question i'm not sure of is whether or not we need to explicitly disable triton template fusion since it would fuse in these dtypes as uint8.. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148722 Approved by: https://github.com/vkuzo ghstack dependencies: #148450	2025-03-10 17:37:39 +00:00
PyTorch MergeBot	ebd087e4b5	Revert "[pytree] add APIs to determine a class is a namedtuple or PyStructSequence (#113257 )" This reverts commit `f08146b67b`. Reverted https://github.com/pytorch/pytorch/pull/113257 on behalf of https://github.com/jovianjaison due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/113257#issuecomment-2711299830))	2025-03-10 17:19:21 +00:00
PyTorch MergeBot	2ec9aceaeb	Revert "Move aoti_torch_cpu__weight_int4pack_mm_cpu_tensor to not be mangled (#148834 )" This reverts commit `3680e666d8`. Reverted https://github.com/pytorch/pytorch/pull/148834 on behalf of https://github.com/janeyx99 due to sorry I don't think I want this PR in before the branch cut, as it'd freeze the API in the file when it should really be in a different header ([comment](https://github.com/pytorch/pytorch/pull/148834#issuecomment-2711162193))	2025-03-10 16:29:40 +00:00
Jason Ansel	a60b4ed623	[fx] Optimize TracerBase.create_arg and Graph._gen_python_code (#148292 ) Before: 19502951 function calls (18702776 primitive calls) in 8.533 seconds After: 16402551 function calls (15602452 primitive calls) in 7.701 seconds Pull Request resolved: https://github.com/pytorch/pytorch/pull/148292 Approved by: https://github.com/oulgen ghstack dependencies: #148243, #148260, #148261, #148288	2025-03-10 16:06:19 +00:00
Jason Ansel	8f858e226b	[fx] Optimizations for node name generation (#148288 ) Before: ![image](https://github.com/user-attachments/assets/3a9ed22b-ae33-41ec-a0db-01f4f3ca2ffe) After: ![image](https://github.com/user-attachments/assets/44c6e578-c63e-4a43-b3e0-d11d4bdbb6db) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148288 Approved by: https://github.com/oulgen ghstack dependencies: #148243, #148260, #148261	2025-03-10 16:06:19 +00:00
Jason Ansel	5d4e7d58b4	[fx] Move Node._prepend/Node._remove_from_list to C++ (#148261 ) Microbenchmarking `fx.symbolic_trace(lambda x: functools.reduce(operator.add, [x, *range(100000)]))`, before: ``` 24303536 function calls (23503339 primitive calls) in 10.726 seconds ``` after: ``` 20003454 function calls (19203257 primitive calls) in 8.936 seconds ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148261 Approved by: https://github.com/oulgen ghstack dependencies: #148243, #148260	2025-03-10 16:06:11 +00:00
Jason Ansel	bf752c36da	[fx] Move Node._update_args_kwargs to C++ (#148260 ) Microbenchmarking `fx.symbolic_trace(lambda x: functools.reduce(operator.add, [x, *range(100000)]))`, before: ``` 25203549 function calls (24403352 primitive calls) in 12.090 seconds ``` after: ``` 24303536 function calls (23503339 primitive calls) in 10.726 seconds ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148260 Approved by: https://github.com/oulgen ghstack dependencies: #148243	2025-03-10 16:06:02 +00:00
Jason Ansel	bec7bdad47	[fx] Move map_aggregate to C++ (#148243 ) Microbenchmarking `fx.symbolic_trace(lambda x: functools.reduce(operator.add, [x, *range(100000)]))`, before: ``` 30603618 function calls (29403419 primitive calls) in 13.744 seconds ``` after: ``` 25203549 function calls (24403352 primitive calls) in 12.090 seconds ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148243 Approved by: https://github.com/oulgen	2025-03-10 16:05:53 +00:00
Kalpit Munot	31625b08b8	Add ccode for FloorDiv (#148727 ) Summary: Add ccode for FloorDiv Test Plan: CIs Differential Revision: D70749021 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148727 Approved by: https://github.com/bobrenjc93	2025-03-10 14:00:18 +00:00
albanD	68c12ecfe2	Move get accelerator to use build time flags when possible (#146098 ) This PR does two main things (they are in a single PR to show how the newly added APIs are used). - Add isBuilt and isAvailable APIs to the AcceleratorHook interface. See inline doc for their exact semantic - Use the newly added isBuilt for accelerator check to ensure it does not poison fork Pull Request resolved: https://github.com/pytorch/pytorch/pull/146098 Approved by: https://github.com/ngimel, https://github.com/malfet, https://github.com/EikanWang, https://github.com/jeromean Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>	2025-03-10 13:17:58 +00:00
Xuehai Pan	098494e9cb	[dynamo] allow global import `from collections import deque` in user code (#148676 ) See https://github.com/pytorch/pytorch/pull/148669#discussion_r1983462218 for more details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148676 Approved by: https://github.com/jansel	2025-03-10 13:14:05 +00:00
Xinyuan Zhao	59f14d19ae	Implement gradient for the `residuals` of `torch.linalg.lstsq` (#148526 ) Fixes #147543. I have written some tests in python using `gradcheck`. Please advise where I should put these tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148526 Approved by: https://github.com/lezcano	2025-03-10 12:35:09 +00:00
Francisco Massa	ea86b8d315	Fix redistribution cost for all-reduce (#148761 ) This issue seems to have been introduced in https://github.com/pytorch/pytorch/pull/119897. With the current implementation, it might be more favorable to perform a reduce_scatter followed by an all-gather than simply an all-reduce. Thanks @lw for the helpful discussions on getting this PR out! Pull Request resolved: https://github.com/pytorch/pytorch/pull/148761 Approved by: https://github.com/Skylion007, https://github.com/lw, https://github.com/tianyu-l, https://github.com/fegin	2025-03-10 12:13:11 +00:00
Mwiza Kunda	00199acdb8	[inductor][triton] Block ptr analysis fix assert on matched index expression (#148446 ) If dynamic shapes are enabled, then block analysis may create new precomputed size replacements from the index which can lead to an assertion failure when the matched index is compared with the original index. For example the below assertion fails, despite the expressions being equivalent (ps2 = 3 * ps0). This can be resolved by updating the original index with the replacements, or simply removing the replacements when the expressions are tested to be equal - the latter option is implemented in this PR. ``` torch._inductor.exc.InductorError: AssertionError: E Invalid match! E Index: 3ps0((yindex//3)) + (ModularIndexing(yindex, 1, 3)) E Matched expression: ps2*((yindex//3)) + (ModularIndexing(yindex, 1, 3)) E ``` This PR fixes the test below when `config.triton.use_block_ptr=True`: ``` python test/inductor/test_torchinductor_dynamic_shapes.py DynamicShapesCpuTests.test_conv3d_channels_last_dynamic_shapes_cpu ``` Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/148446 Approved by: https://github.com/jansel	2025-03-10 05:26:55 +00:00
Jane Xu	3680e666d8	Move aoti_torch_cpu__weight_int4pack_mm_cpu_tensor to not be mangled (#148834 ) I noticed that this op was likely intended to be in the `extern "C"` portion of the file, but it was not added as such in https://github.com/pytorch/pytorch/pull/145250 which means this function is actually not stable/would get mangled by C++. Following the thread there I am thinking there are two possible solutions: (1) Since this op was never stable to begin with, and @Xia-Weiwen already landed the fallback, maybe this op is deletable + should get deleted before the 2.7 branch cut (2) Or we could just move the op to the right portion of the code. While I like just deleting the op, I am hesitant to do in case there's something I haven't considered, so this PR does option 2. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148834 Approved by: https://github.com/desertfire	2025-03-10 03:23:48 +00:00
henrylhtsang	7ae0ce6360	[cutlass backend] fix assertion that prevent self multiplication (#148233 ) # Problem: In a matmul, sometimes some of the nodes are the same. Say `A @ A`. In that case, when writing the stride of node B, we have to figure out if we want lda or ldb, which points to the same node, and we have no way to differentiate which one. # Solution Just use whichever. Since they are the same. # Question What if we compile with `A @ A`, and then pass in `A @ B`? Well inductor guards will raise an error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148233 Approved by: https://github.com/ColinPeppler	2025-03-10 00:21:36 +00:00
henrylhtsang	b47d81682d	[cutlass backend] Forward fix for less aligned gemm shapes (#148521 ) Differential Revision: [D70600093](https://our.internmc.facebook.com/intern/diff/D70600093/) 1. Check if config name filtering still works. Tested, it works 2. do we get C++ compile error Yes, potentially we need to filter them out manually. Here we get this. ``` static_assert(threads_minor == 0 \|\| (TileSizeK % threads_minor == 0)); ``` We need to move some assertions to gemm_template.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/148521 Approved by: https://github.com/ColinPeppler	2025-03-10 00:21:24 +00:00
PyTorch MergeBot	275a7c5dbb	Revert "Add a stable TORCH_LIBRARY to C shim (#148124 )" This reverts commit `327e07ac1d`. Reverted https://github.com/pytorch/pytorch/pull/148124 on behalf of https://github.com/malfet due to Sorry for reverting your PR, but somehow it caused test failures in newly introduced tests, see https://hud.pytorch.org/hud/pytorch/pytorch/main/1?per_page=50&name_filter=pull%20%2F%20linux-focal-cuda12.6-py3.10-gcc11-sm89%20%2F%20test%20(default%2C%201&mergeLF=true ([comment](https://github.com/pytorch/pytorch/pull/148124#issuecomment-2709057833))	2025-03-09 20:44:56 +00:00
PyTorch MergeBot	19a39a7a06	Revert "[dynamo] allow global import `from collections import deque` in user code (#148676 )" This reverts commit `685fb37713`. Reverted https://github.com/pytorch/pytorch/pull/148676 on behalf of https://github.com/malfet due to Looks like it broke ROCM, see `f1444f006c/1`(default%2C%201&mergeLF=true ([comment](https://github.com/pytorch/pytorch/pull/148676#issuecomment-2709057326))	2025-03-09 20:42:03 +00:00
Zhenghao Hu	f1444f006c	[caffe2/torch] Fixup upstream LLVM (major version 21) API changes (#148833 ) Latest LLVM introduced two changes related to the `Triple` usage that causes build failures when building pytorch. ## Failure in llvm_codegen.cpp: Triple is stored in Modules instead of the string: `979c275097` ## Failure in llvm_jit.cpp: Triple argument is removed from LLJITBuilder::... : `b18e5b6a36` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148833 Approved by: https://github.com/Skylion007	2025-03-09 18:58:36 +00:00
Aditya Tiwari	bb9c426024	Typo Errors fixed in multiple files (#148262 ) # Fix typo errors across PyTorch codebase This PR fixes various spelling errors throughout the PyTorch codebase to improve documentation quality and code readability. ## Changes Made ### Documentation Fixes - Changed "seperate" to "separate" in multiple files: - `setup.py`: Build system documentation - `torch/_library/triton.py`: AOT compilation comments - `torch/csrc/dynamo/compiled_autograd.h`: Node compilation documentation - `torch/export/_unlift.py`: Pass population comments - `torch/export/exported_program.py`: Decomposition table notes ### Code Comments and Error Messages - Changed "occured" to "occurred" in: - `test/mobile/test_lite_script_module.py`: Exception handling comments - `torch/export/_draft_export.py`: Error message text - `aten/src/ATen/native/cuda/linalg/BatchLinearAlgebra.cpp`: MAGMA bug comment - `torch/csrc/utils/python_numbers.h`: Overflow handling comment - `torch/csrc/jit/OVERVIEW.md`: Graph compilation documentation - `torch/_dynamo/symbolic_convert.py`: Error explanation ### API Documentation - Changed "fullfill" to "fulfill" in `torch/distributed/checkpoint/state_dict_loader.py` - Changed "accross" to "across" in: - `torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp` - `torch/distributed/distributed_c10d.py` ## Motivation These changes improve code readability and maintain consistent spelling throughout the codebase. No functional changes were made; this is purely a documentation and comment improvement PR. ## Test Plan No testing required as these changes only affect comments and documentation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148262 Approved by: https://github.com/janeyx99 Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>	2025-03-09 12:21:40 +00:00
Jane Xu	327e07ac1d	Add a stable TORCH_LIBRARY to C shim (#148124 ) This PR adds two main parts: - shim.h stable C APIs into torch::Library APIs - a higher level API in torch/csrc/stable/library.h that calls into this shim.h + otherwise is self contained Goal: custom kernel writers should be able to call the apis in the directories above in order to register their library in a way that allows their custom extension to run with a different libtorch version than it was built with. Subplots resolved: - Do we want a whole separate StableLibrary or do we want to freeze torch::Library and add `m.stable_impl(cstring, void (fn)(void , int64_t, int64_t)` into it - Yes, we want a separate StableLibrary. We cannot freeze Library and it is NOT header only. - Should I use unint64_t as the common denominator instead of void to support 32bit architectures better? - Yes, and done - Should I add a stable `def` and `fragment` when those can be done in python? - I think we do want these --- and now they're done - Where should library_stable_impl.cpp live? -- no longer relevant - I need some solid test cases to make sure everything's going ok. I've intentionally thrown in a bunch of random dtypes into the signature, but I still haven't tested returning multiple things, returning nothing, complex dtypes, etc. - Have since tested all the torch library endpoints. the others can be tested in a followup to separate components that need to be in shim.h vs can be added later Pull Request resolved: https://github.com/pytorch/pytorch/pull/148124 Approved by: https://github.com/albanD, https://github.com/zou3519	2025-03-09 10:07:25 +00:00
Xuehai Pan	685fb37713	[dynamo] allow global import `from collections import deque` in user code (#148676 ) See https://github.com/pytorch/pytorch/pull/148669#discussion_r1983462218 for more details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148676 Approved by: https://github.com/jansel	2025-03-09 09:35:29 +00:00
William Wen	6566d67bd3	[dynamo] show stack above dynamo in graph break user tracebacks (#148401 ) Also show the line of code relevant to a dynamo-compiled frame, instead of just the first line (this was broken for data-dependent jump graph breaks and for 3.11+). Also collapses resume frames together (use config.verbose to see full stack trace - for developers). Pull Request resolved: https://github.com/pytorch/pytorch/pull/148401 Approved by: https://github.com/zou3519, https://github.com/jansel	2025-03-09 07:37:38 +00:00
Ke Wen	2149f6c684	[PGNCCL] Launch kernel on current stream & remove `record_stream` entirely (#148590 ) This PR has multiple changes to `ProcessGroupNCCL` (which unfortunately are related): 1. When async_op=False, we directly launch the collective on "current" stream, instead of a trampoline stream and join back. - Resolves #147729 - Resolves #146881 - Also saves two event syncs (which have overhead in case of HIP) and one pybind when we call `work.wait()` in distributed_c10d.py on behalf of user. 2. Entirely remove `record_stream` and use CPU-side stashing for managing tensor lifetime against recycling. - Resolves #147168 3. Remove tensor life management when async_op=False; only use it when async_op=True. 4. To guard against user not calling `work.wait()`, we ask watchdog to unstash tensors after detecting completion of collectives, to prevent us from holding reference to tensors forever. This is a safety net, rather than a service guarantee, see discussion [here](https://github.com/pytorch/pytorch/issues/147168#issuecomment-2660142460). 5. Profile in async_op=False mode would look different -- collective kernels would show up in the same line and compute kernels. Joint work with @cenzhaometa who wants to remove the event sync overhead. Cc: @ngimel @awgu @Aidyn-A @skyw @wconstab @leonardo0lyj Differential Revision: [D70835197](https://our.internmc.facebook.com/intern/diff/D70835197) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148590 Approved by: https://github.com/eqy, https://github.com/Aidyn-A, https://github.com/fduwjj	2025-03-09 07:32:23 +00:00
PyTorch MergeBot	9cb25f0ea2	Revert "[PGNCCL] Launch kernel on current stream & remove `record_stream` entirely (#148590 )" This reverts commit `17dbeb11db`. Reverted https://github.com/pytorch/pytorch/pull/148590 on behalf of https://github.com/janeyx99 due to PR break backward compat test ([comment](https://github.com/pytorch/pytorch/pull/148590#issuecomment-2708641172))	2025-03-09 03:01:55 +00:00
Ke Wen	17dbeb11db	[PGNCCL] Launch kernel on current stream & remove `record_stream` entirely (#148590 ) This PR has multiple changes to `ProcessGroupNCCL` (which unfortunately are related): 1. When async_op=False, we directly launch the collective on "current" stream, instead of a trampoline stream and join back. - Resolves #147729 - Resolves #146881 - Also saves two event syncs (which have overhead in case of HIP) and one pybind when we call `work.wait()` in distributed_c10d.py on behalf of user. 2. Entirely remove `record_stream` and use CPU-side stashing for managing tensor lifetime against recycling. - Resolves #147168 3. Remove tensor life management when async_op=False; only use it when async_op=True. 4. To guard against user not calling `work.wait()`, we ask watchdog to unstash tensors after detecting completion of collectives, to prevent us from holding reference to tensors forever. This is a safety net, rather than a service guarantee, see discussion [here](https://github.com/pytorch/pytorch/issues/147168#issuecomment-2660142460). 5. Profile in async_op=False mode would look different -- collective kernels would show up in the same line and compute kernels. Joint work with @cenzhaometa who wants to remove the event sync overhead. Cc: @ngimel @awgu @Aidyn-A @skyw @wconstab @leonardo0lyj Differential Revision: [D70835197](https://our.internmc.facebook.com/intern/diff/D70835197) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148590 Approved by: https://github.com/eqy, https://github.com/Aidyn-A, https://github.com/fduwjj	2025-03-08 20:00:12 +00:00
Nino Risteski	5245304f1e	Update decompositions_for_jvp.py (#148821 ) small typo thing that got my eye Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/148821 Approved by: https://github.com/Skylion007	2025-03-08 19:08:42 +00:00
Tristan Rice	7ffadff286	c10d/ProcessGroup: cleanup abort and shutdown (#148798 ) This adds `abort` and `shutdown` to `Backend` and `ProcessGroup` objects. This simplifies the logic in `distributed_c10d.py` by having a default noop implementation for all PGs. This will be useful for torchft and upcoming versions of NCCL which will handle abort correctly. Currently `torchft` would have to call internal methods `_abort` on the PGNCCL object directly but with this change we can now just call `.abort()` and have it work for any PG implementation. Test plan: ``` pytest distributed/test_backends.py distributed/test_c10d_common.py distributed/test_c10d_pypg.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148798 Approved by: https://github.com/kwen2501	2025-03-08 18:33:18 +00:00
Sanket Purandare	9841f0ddcf	Add support for non functional collectives under FakeTensorMode and fake_pg for memory tracking (#147566 ) This PR adds support for non-functional collectives under `FakeTensorMode` and `fake_pg`. It helps eliminate the patching of collectives for memory and runtime estimation. It also modifies the `ModTracker` to enable the post-backward hook call for modules whose inputs don't require gradients but parameters do. For the memory tracking, we now enable tracking DTensor dispatcher for custom dispatch functions like `entropy_loss`. Dispatcher is only enabled for the memory tracking part and disabled as soon as it is done. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147566 Approved by: https://github.com/weifengpy	2025-03-08 18:00:49 +00:00
David Berard	c3b05c4a27	[triton 3.3] support both specialize_impl and create_specialize_impl (#148806 ) After https://github.com/triton-lang/triton/pull/6099, we sometimes need to do `from triton.runtime.jit import specialize impl` and sometimes do `triton.runtime.jit.create_specialize_impl()`. This should fix a bunch of the new errors that appeared with the triton 3.3 / pytorch 2.7 integration (e.g. `python test/inductor/test_aot_inductor.py -k test_triton_kernel_equal_to_1_float_arg_dynamic_False_cuda`, failing at https://hud.pytorch.org/pr/pytorch/pytorch/148684#38392501220) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148806 Approved by: https://github.com/drisspg	2025-03-08 09:31:52 +00:00
Zhuoran Zhao	3745da18f4	[AOTI] Swith to local cpp compile for fbcode (#148592 ) Summary: as title, otherwise we can not find lamdhip64 Test Plan: https://www.internalfb.com/phabricator/paste/view/P1747104431 Differential Revision: D70637798 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148592 Approved by: https://github.com/hl475	2025-03-08 08:38:26 +00:00
Simon Fan	666508eb17	[aot cache][ca] remove restriction on caching ca's aot inference graph (#148491 ) but still can't cache CA's aot inference graph yet: the CA functional ops aren't serializable Pull Request resolved: https://github.com/pytorch/pytorch/pull/148491 Approved by: https://github.com/jamesjwu ghstack dependencies: #148381	2025-03-08 06:08:26 +00:00
Simon Fan	c16cd25cf5	[ca] remove compiled_autograd_tracing (#148381 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148381 Approved by: https://github.com/jansel	2025-03-08 06:08:26 +00:00
cyy	f7c0c230b0	Fix compile errors (#148758 ) Fix ``` /usr/bin/../lib64/gcc/x86_64-pc-linux-gnu/14.2.1/../../../../include/c++/14.2.1/bits/unique_ptr.h:91:16: error: invalid application of 'sizeof' to an incomplete type 'torch::jit::AliasDb::WriteRegistry' 91 \| static_assert(sizeof(_Tp)>0, \| ^~~~~~~~~~~ /usr/bin/../lib64/gcc/x86_64-pc-linux-gnu/14.2.1/../../../../include/c++/14.2.1/bits/unique_ptr.h:399:4: note: in instantiation of member function 'std::default_delete<torch::jit::AliasDb::WriteRegistry>::operator()' requested here 399 \| get_deleter()(std::move(__ptr)); \| ^ ../torch/csrc/jit/ir/alias_analysis.cpp:200:10: note: in instantiation of member function 'std::unique_ptr<torch::jit::AliasDb::WriteRegistry>::~unique_ptr' requested here 200 \| AliasDb::~AliasDb() = default; \| ^ ../torch/csrc/jit/ir/alias_analysis.cpp:200:23: note: in defaulted destructor for 'torch::jit::AliasDb' first required here 200 \| AliasDb::~AliasDb() = default; \| ^ ../torch/csrc/jit/ir/alias_analysis.h:298:10: note: forward declaration of 'torch::jit::AliasDb::WriteRegistry' 298 \| struct WriteRegistry; \| ^ 1 error generated. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148758 Approved by: https://github.com/Skylion007	2025-03-08 04:56:42 +00:00
riccardofelluga	8f71d4563e	Fix rms_norm in fp16/bf16 (#147203 ) Fixes #134106. This PR moves the `upcasted_result` down-casting after all computation is done. Since the multiplication with the weight_opt input is not done in half precision, the current code path is doing the following: fp16 -> fp32 -> fp16 -> fp32 -> fp16. What we want tho is to avoid down-casting and this PR proposes: fp16 -> fp32 -> fp16. This results in better accuracy as it avoids truncating. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147203 Approved by: https://github.com/eqy	2025-03-08 04:43:18 +00:00
Joel Schlosser	85467ed063	Fix for AOTI + CUDAGraphs when calling from Python (#148601 ) Background: I've been comparing performance of torch.compile vs. torch.export + AOTI (specifically, loaded from Python) on the Flux model and found a ~1.4% performance decrease with the latter. The trace shows that CUDAGraphs are not utilized for torch.export + AOTI, leading to higher overhead. When trying to manually CUDAGraph the loaded, previously exported + AOTIed model (thanks to @eellison for the logic here), I get: ``` Error: operation not permitted when stream is capturing ``` @desertfire confirms that this is due to multi-threading logic on the AOTI runtime side (in `AOTIModelContainer` / `AOTIModel`) conflicting with the use of CUDAGraphs. Fix: This PR takes the approach of providing an alternate, single-threaded method for running loaded models with the AOTI runtime. Details: * Python side introduces a new flag to enable this behavior (needs a better name): `torch._inductor.package.load_package(..., run_single_threaded=False)` * This flag is passed down to the C++ side's `AOTIModelPackageLoader`, which passes it to the `CreateAOTIModelRunnerFunc` during `AOTIModelContainerRunner` construction. * C++ side introduces single-threaded alternatives to model running and model container running: * `AOTIModelContainer.run_single_threaded()` / `AOTIModel.run_single_threaded()`. The interfaces match those of `run()`, but the synchronization logic has been removed. * Introduces `AOTInductorModelContainerRunSingleThreaded` to AOTI's `interface.h`; this is invoked by the `AOTIModelContainerRunner` utility class when `run_single_threaded=true`. I've verified on both a small repro and my real-world use case that I can manually CUDAGraph a loaded model that was previously exported + AOTIed. Future work: * Flip default value to `run_single_threaded=True` as Python-side inference doesn't take advantage of the AOTI runtime thread pool * There are some BC concerns here - models need to be re-serialized so the .so contains the new `AOTInductorModelContainerRunSingleThreaded` interface func. We can flip the default value and warn (instead of crashing) if the `AOTInductorModelContainerRunSingleThreaded` symbol does not exist. * Compose with cudagraph trees as opposed to manual cuda graph wrapping Pull Request resolved: https://github.com/pytorch/pytorch/pull/148601 Approved by: https://github.com/desertfire	2025-03-08 02:44:14 +00:00
Sampsa	9f170d9d13	[Triton 3.3] Remove ROCm specific mm gemm template (#148662 ) Fixes: https://github.com/pytorch/pytorch/issues/147121 Since triton 3.3.x fixes the problem Needs to be handled in none BC breaking way, so we will conditionalise this change on triton version. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148662 Approved by: https://github.com/davidberard98 Co-authored-by: Jack Taylor <108682042+jataylo@users.noreply.github.com>	2025-03-08 01:24:40 +00:00
drisspg	a89e7c2da9	[Upstream] Wrap log_2_e in tl.constexpr for new 3.3 bump (#148785 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148785 Approved by: https://github.com/davidberard98	2025-03-08 01:09:28 +00:00
Lukas Pfahler	179b7a0abc	Do not crash when compiling quantized LORA models (#148435 ) Fixes #148072 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148435 Approved by: https://github.com/Valentine233, https://github.com/leslie-fang-intel	2025-03-08 00:02:08 +00:00
Gabriel Ferns	24085db082	Don't clear feedback_saver_fns after cache clear (#148723 ) Summary: Since feedback_saver_fns are used for logging, I don't think it makes sense to clear them, and this resulted in weird behavior in user code where disabling caches caused logging code to break. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148723 Approved by: https://github.com/henrylhtsang, https://github.com/eellison	2025-03-07 23:43:59 +00:00
Justin Chu	d96c85558a	[ONNX] Use torch export to get dynamic shapes for JIT convert strategy (#148627 ) Use torch export to get dynamic shapes for JIT converted graph. I just realized we can retrace a converted jit graph with `torch.export` and produce dynamic shapes using `torch.export`. - Prior: The exporter will produce a static graph silently even when dynamic_shapes are provided. - Proposed: When `dynamic_shapes` is provided and when the strategy is able to handle it, it will succeed ## Why are we still keeping the JIT strategy? It is useful when users want to convert JIT modules or `.pt` files into ONNX via the new path. Sometimes also useful when there are JIT scripted modules in the nn module. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148627 Approved by: https://github.com/titaiwangms	2025-03-07 23:41:50 +00:00
Sam Larsen	187d5c0eb1	[logging] Log cudagraphify timings to dynamo_timed (#143220 ) Summary: this adds some new dynamo_timed calls in cudagraph_trees, primarily with the aim to add cudagraph-related timing to scuba. Things to note: * Uses the changes in https://github.com/pytorch/pytorch/pull/141919 to log "runtime" entries * The logging for chromium/tlparse/scuba relies on us providing a compile_id since it's not available in the environment. A lot of the changes here are just passing around the compile_id * I believe the spirit of the scuba logging is to capture the overheads of `torch.compile`. Therefore, I'm not adding _every_ dynamo_timed to scuba. For example, "run_eager" is the first real execution of the inductor graph -- it's not cudagraph overhead, per se. Watch out for the two instances of `dynamo_compile_runtime_column_us="runtime_cudagraphify_time_us"`. Those are the spots I believe are _extra_ overhead we'd contribute to torch.compile. Test Plan: `python benchmarks/dynamo/torchbench.py --performance --training --amp --backend inductor --device cuda --print-compilation-time --repeat 5 --cold-start-latency --only dcgan`: * tlparse: https://fburl.com/21yrdn8h * scuba: https://fburl.com/scuba/dynamo_compile/sandbox/wt90wnjz `python benchmarks/dynamo/torchbench.py --performance --training --amp --backend inductor --device cuda --print-compilation-time --repeat 5 --cold-start-latency --only nanogpt` * tlparse: https://fburl.com/r9mp7uiv * scuba: https://fburl.com/scuba/dynamo_compile/sandbox/1nvx94re Pull Request resolved: https://github.com/pytorch/pytorch/pull/143220 Approved by: https://github.com/eellison	2025-03-07 23:07:13 +00:00
iupaikov-amd	f2dfe2d99c	[Triton 3.3] [ROCm] Enabled split_scan support for ROCm builds (#147619 ) Fixes issue https://github.com/pytorch/pytorch/issues/133228 Enabled split_scan support for ROCm builds. Must be handled in a non BC breaking way so this functionality is enabled conditionalised on triton version. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147619 Approved by: https://github.com/davidberard98 Co-authored-by: Jack Taylor <108682042+jataylo@users.noreply.github.com> Co-authored-by: David Berard <davidberard98@gmail.com>	2025-03-07 23:06:21 +00:00
PyTorch MergeBot	0f852641c2	Revert "[cutlass backend] Forward fix for less aligned gemm shapes (#148521 )" This reverts commit `d35a4ddae2`. Reverted https://github.com/pytorch/pytorch/pull/148521 on behalf of https://github.com/henrylhtsang due to mistakes when writing the tests ([comment](https://github.com/pytorch/pytorch/pull/148521#issuecomment-2707637965))	2025-03-07 22:42:13 +00:00
David Berard	755965d2e4	[inductor] fix matmul w/ torch.bucketize epilogue (#148769 ) See https://github.com/pytorch/pytorch/issues/148764. Inductor was codegen-ing wrong shapes for bucketize when it was fused as an epilogue: the binary search helper function requested the shape of the input tensor, and Inductor was generating `[XBLOCK]`, when `XBLOCK` doesn't exist. As a workaround, this PR removes the `BLOCK_SHAPE` parameter from the helper function (and just uses `values.shape`) so that we don't even have to generate the shape. This PR also introduces `torch._inductor.config.triton.disallow_failing_autotune_kernels_TESTING_ONLY` to test this behavior. This config is needed to enforce that _all_ autotune kernel candidates pass - otherwise, the fused-bucketize exception just gets caught and an `inf` latency is assigned to it. Differential Revision: [D70794563](https://our.internmc.facebook.com/intern/diff/D70794563) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148769 Approved by: https://github.com/benjaminglass1, https://github.com/aaronenyeshi	2025-03-07 22:34:13 +00:00
Xinya Zhang	67742128b7	[ROCm] Bump AOTriton to 0.9.2b (#148433 ) Notable new features/optimizations for SDPA operators on AMD systems from AOTriton 0.9b: * Optimize these Non-power-of-two head dimensions: 48, 80, 96, 160, 192, 224. Inputs with these head dimensions do not need padding to power-of-two anymore. * `is_causal=True` cases are now supported with persistent dynamic algorithm, which requires an atomic tensor but does load balance between different CTAs * `dropout_p > 0.0` cases now support full 64-bit offsets and use all i64x4 PRNG outputs * The precise AOTriton shared library version can now be identified with `readelf -p .comment libaotriton_v2.so` + However, this does not guarantee the GPU images stored under `aotriton.images` have the same version, since they can be overwritten. * The newly added fused backward kernel will be used for smaller workloads, due to less kernel invocation overhead. * Support gfx1201 (RX 9070XT). Need to be enabled at runtime with `TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148433 Approved by: https://github.com/jeffdaily	2025-03-07 22:10:07 +00:00
Nichols A. Romero	08baaa7d63	[Docs][TunableOp] TunableOp documentation update (#148384 ) This PR aligns documentation to what is in the README file: https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/cuda/tunable/README.md and removes the prototype NOTE. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148384 Approved by: https://github.com/jeffdaily, https://github.com/svekars Co-authored-by: Svetlana Karslioglu <svekars@meta.com>	2025-03-07 21:02:49 +00:00
PyTorch MergeBot	bb94b65da7	Revert "[cutlass backend] fix assertion that prevent self multiplication (#148233 )" This reverts commit `2fb654676f`. Reverted https://github.com/pytorch/pytorch/pull/148233 on behalf of https://github.com/henrylhtsang due to mistake in PR ([comment](https://github.com/pytorch/pytorch/pull/148233#issuecomment-2707440106))	2025-03-07 20:58:28 +00:00
Nikita Shulga	6602e632cd	Suppress build warnings when gcc-11 is used (#148763 ) By decorating the header with `C10_DIAGNOSTIC_PUSH_AND_IGNORED_IF_DEFINED("-Wmismatched-new-delete")` that will suppress following (when building against ancient llvm-9) ``` In file included from /var/lib/jenkins/workspace/torch/csrc/jit/tensorexpr/llvm_codegen.cpp:24: /opt/llvm/include/llvm/IR/IRBuilder.h: In member function 'llvm::LoadInst* llvm::IRBuilder<T, Inserter>::CreateLoad(llvm::Type, llvm::Value, const llvm::Twine&) [with T = llvm::ConstantFolder; Inserter = llvm::IRBuilderDefaultInserter]': /opt/llvm/include/llvm/IR/IRBuilder.h:1581:19: error: 'static void llvm::User::operator delete(void)' called on pointer returned from a mismatched allocation function [-Werror=mismatched-new-delete] 1581 \| return Insert(new LoadInst(Ty, Ptr), Name); \| ^~~~~~~~~~~~~~~~~~~~~ /opt/llvm/include/llvm/IR/IRBuilder.h:1581:19: note: returned from 'static void llvm::UnaryInstruction::operator new(size_t)' ``` Probably a reasonable followup will be to disable NNC testing all-together, as project has been in a maintenance mode for a while now Pull Request resolved: https://github.com/pytorch/pytorch/pull/148763 Approved by: https://github.com/Skylion007, https://github.com/ZainRizvi, https://github.com/atalman ghstack dependencies: #148739	2025-03-07 20:43:35 +00:00
Justin Chu	d36391307f	[ONNX] Handle error in verification interpreter (#148730 ) Use a simple try catch to handle onnx runtime errors in the verification interpreter when that happens. One example is ort will sometimes produce a list of None for some nodes. I am not sure how that happens yet. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148730 Approved by: https://github.com/titaiwangms ghstack dependencies: #148706	2025-03-07 20:24:49 +00:00
Xuehai Pan	aebd2e411f	[pytree][easy] lock global registry containers properly for thread-safety (#148750 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148750 Approved by: https://github.com/StrongerXi	2025-03-07 20:04:52 +00:00
bobrenjc93	6b44a91a62	use statically_known_true instead of guard_size_oblivious in pattern matcher (#147557 ) We shouldn't add guards here. Use statically_known_true instead. Internal xref: https://fb.workplace.com/groups/1075192433118967/?multi_permalinks=1609560723015466&comment_id=1610040026300869&notif_id=1740082892544333&notif_t=work_feedback_reaction_generic&ref=notif Differential Revision: [D69950122](https://our.internmc.facebook.com/intern/diff/D69950122/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147557 Approved by: https://github.com/eellison	2025-03-07 19:17:25 +00:00
PyTorch MergeBot	b246cd7b82	Revert "Move get accelerator to use build time flags when possible (#146098 )" This reverts commit `17302b4bc8`. Reverted https://github.com/pytorch/pytorch/pull/146098 on behalf of https://github.com/albanD due to Still fails with cuda build on a non-gpu machine ([comment](https://github.com/pytorch/pytorch/pull/146098#issuecomment-2707191770))	2025-03-07 18:59:58 +00:00
eqy	18c6e00c7b	[CUDA Graphs][NCCL] Set event queries to happen under thread-local mode in `ProcessGroupNCCL.cpp` (#148594 ) Should mean we don't need to coordinate the watchdog with CUDAGraph captures anymore Pull Request resolved: https://github.com/pytorch/pytorch/pull/148594 Approved by: https://github.com/kwen2501	2025-03-07 18:39:02 +00:00
Jack Taylor	8059ead823	[ROCm] Incorporate ROCm triton specific tuning parameters (#148437 ) Splitting https://github.com/pytorch/pytorch/pull/147315 into two PRs. This PR adds general support for kpack and waves_per_eu triton kernel args for AMD backend. More detail in the PR above. A follow up PR will update the configs used by ROCm but this requires https://github.com/pytorch/pytorch/pull/147452 to land first Pull Request resolved: https://github.com/pytorch/pytorch/pull/148437 Approved by: https://github.com/eellison, https://github.com/jansel	2025-03-07 18:09:47 +00:00
Aaron Orenstein	a3b77d434a	Subprocess compile (attempt 2) (#148635 ) Add a mode to fx_codegen_and_compile() to compile in a separate process. This is to prepare for async compile where we'll compile and run eager in parallel (and also be able to move the compile phase to a remote computer). Added a test based which runs the test_torchinductor tests with subprocess compiling turned on. Fixed the test which caused the previous version (#146134) to be reverted: ``` $ PYTORCH_TEST_WITH_ROCM=1 PYTORCH_TEST_WITH_SLOW=1 PYTORCH_TEST_SKIP_FAST=1 python test/inductor/test_compile_subprocess.py CpuTests.test_conv_bn_fuse_cpu ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148635 Approved by: https://github.com/jamesjwu	2025-03-07 17:50:14 +00:00
xinan.lin	50c9f6d83b	[Windows][Inductor][XPU] Unload triton pyd files to be able to remove them on Windows. (#148323 ) In `fresh_inductor_cache` remove pyd files will raise permission error on Windows because they are still used by the process. So we clear the references to the loaded pyd libray obj and unload them from the process. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148323 Approved by: https://github.com/jansel ghstack dependencies: #148534, #148538, #147727	2025-03-07 17:19:59 +00:00
Saurabh Mishra	136b8165d1	[DCP] Save Plan Caching: Fix the missing all_plans update in the cache. (#148577 ) Summary: Save Plan Caching: Fix the missing all_plans update in the cache. Test Plan: ``` buck2 test //aiplatform/modelstore/experimental/integration_tests/tests/nosan:checkpoint_dist_save_load_test ``` https://www.internalfb.com/intern/testinfra/testrun/17451448626323264 Reviewed By: MeetVadakkanchery Differential Revision: D70229019 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148577 Approved by: https://github.com/MeetVadakkanchery	2025-03-07 17:00:59 +00:00
PyTorch MergeBot	abcca2fcbb	Revert "Fix `torch.nn.functional.hardswish` gradients corner case (#148049 )" This reverts commit `29b28e9d9f`. Reverted https://github.com/pytorch/pytorch/pull/148049 on behalf of https://github.com/soulitzer due to This may be causing an accuracy failure on inductor ([comment](https://github.com/pytorch/pytorch/pull/148049#issuecomment-2706839169))	2025-03-07 16:05:56 +00:00
albanD	17302b4bc8	Move get accelerator to use build time flags when possible (#146098 ) This PR does two main things (they are in a single PR to show how the newly added APIs are used). - Add isBuilt and isAvailable APIs to the AcceleratorHook interface. See inline doc for their exact semantic - Use the newly added isBuilt for accelerator check to ensure it does not poison fork Pull Request resolved: https://github.com/pytorch/pytorch/pull/146098 Approved by: https://github.com/ngimel, https://github.com/malfet, https://github.com/EikanWang, https://github.com/jeromean Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>	2025-03-07 15:19:34 +00:00
Anant Gulati	372ad7b181	Enable FSDP2 on HPU device (#148667 ) The motivation of this PR is to enable FSDP2 collectives for HPU Pull Request resolved: https://github.com/pytorch/pytorch/pull/148667 Approved by: https://github.com/wconstab	2025-03-07 14:33:43 +00:00
Luca Wehrstedt	f80aad62fa	Improve Pareto frontier plot for AutoAC (#148678 ) This was added in https://github.com/pytorch/pytorch/pull/126320. It's a very nice feature, which can be used to predict memory usage for different budget values. However, it had some limitations, notably in terms of resolution (it only sampled 21 points across the whole range thus missed many threshold values) and in distributed settings. Here I fix those by using recursive binary searches to identify all thresholds (up to a resolution of 1e-3, which can be made configurable) and output them in SVG (to be able to discern different points), plus I add the rank to the filename and store it in a user-define directory. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148678 Approved by: https://github.com/Chillee, https://github.com/fmassa	2025-03-07 13:22:29 +00:00
Avik Chaudhuri	6cf360be04	fix lost input mutations with export_tracepoint (#148709 ) Preserving module call signatures in the presence of input mutation cause incorrect results. The root cause turned out to be that export tracepoints would unwrap / wrap functional args that would lose mutation info on those args. Differential Revision: [D70734821](https://our.internmc.facebook.com/intern/diff/D70734821/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148709 Approved by: https://github.com/angelayi	2025-03-07 09:36:18 +00:00
FFFrog	416ea1c71c	Code Clean: Remove unnecessary code (#148735 ) As the title stated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148735 Approved by: https://github.com/jingsh, https://github.com/cyyever	2025-03-07 08:15:37 +00:00
Rachel Guo	3f069e7679	[mm_logs] enhance the printing for overview info (#148716 ) Summary: previously the dynamo counters does not print the counts information automatically. explicitly added a log msg to print after lowering for overview info for inductor aten mms it will look like: the name is in `{aten_op_name}_{m}_{n}_{k}` ``` torch/_inductor/compile_fx.py:832] [0/0] Overview info of inductor aten mms: (aten.addmm_16_6_16: 1), (name: count), xxx ``` {F1975874802} Test Plan: ``` TORCH_LOGS="+inductor" buck2 run -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=h100 @//mode/opt fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_addmm_cuda ``` Differential Revision: D70739912 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148716 Approved by: https://github.com/henrylhtsang	2025-03-07 05:23:49 +00:00
Wei Feng	c0f1557285	[FSDP2][doc] highlight equivalence of set_requires_gradient_sync and no_sync (#148715 ) we got asked a few times about FSDP2's equivalence of no_sync. highlight set_requires_gradient_sync as the equivalence in docstring Pull Request resolved: https://github.com/pytorch/pytorch/pull/148715 Approved by: https://github.com/mori360	2025-03-07 04:34:46 +00:00
Nitin Singh	fe4b88f6aa	[HPU] Add hpu to fused kernels supported devices (#148666 ) This change adds "hpu" to the list of device types that support fused kernels in the optimizer, ensuring compatibility with HPU backend. Without this change, when `test_all_gather_extension_outer_size_stride` of `pytorch/test/distributed/_composable/fsdp/test_fully_shard_extensions.py` is run on 'hpu' backend, it fails with: RuntimeError: fused=True requires all the params to be floating point Tensors of supported devices: ['mps', 'cuda', 'xpu', 'cpu', 'privateuseone'] but torch.float32 and hpu Pull Request resolved: https://github.com/pytorch/pytorch/pull/148666 Approved by: https://github.com/albanD	2025-03-07 04:28:33 +00:00
Nichols A. Romero	33f8ab2f58	[ROCm][TunableOp] Add support for rowwise scaling on scaled GEMM. (#148238 ) This PR adds support for rowwise scaling versus tensorwise scaling on scaled GEMM. There are few other items included in this PR as well: - Fixes for offline tuning of scaled GEMM - Simplification of existing offline UT - Update existing online UT to also test rowwise versus tensorwise scaled GEMM - New UT for offline scaled GEMM Pull Request resolved: https://github.com/pytorch/pytorch/pull/148238 Approved by: https://github.com/jeffdaily	2025-03-07 04:12:48 +00:00
Wei-Sheng Chin	9c9b05bc4f	Expose functions used in custom backend in torch_python dll (#148213 ) Fixes #148208. There are solutions for exposing symbols implicitly from inline functions (i.e., inline function A calls non-inline function B in foo.h. Code includes foo.h has to see the symbol B in DLL). Solution 1: tag the entire struct where the inline functions are defined as member functions with TORCH_PYTHON_API --- this PR does this for python_arg_parser.h. An alternative solution exists but will slow down dispatching a lot --- drop inline keyword and move implementation to .cc file. Solution 2: tag individual functions with TORCH_PYTHON_API. This PR does this for python_tensor.h. Related discussion about hiding torch_python symbols: https://github.com/pytorch/pytorch/pull/142214 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148213 Approved by: https://github.com/malfet	2025-03-07 02:34:37 +00:00
Zhuoran Zhao	dfb4094b9c	Skip buffer in dense update (#148533 ) Summary: as title. PyTorch Module buffer will not be published in delta publishing. In Quinn's previous diff, constant type annotations have been introduced. In addition to skip constant, we also need to skip buffer if it is not found in the user-provided delta weights list Test Plan: https://docs.google.com/document/d/1wiqUo0PyZ4g6YJIJlL_LE084ZEuE74iu74gZjqGGjWY/edit?tab=t.0#heading=h.dby6cwiw1xrn Differential Revision: D69553929 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148533 Approved by: https://github.com/22quinn, https://github.com/jingsh	2025-03-07 01:59:58 +00:00
ZhiweiYan-96	00cd6c07b9	[Intel GPU][pt2e] Enable quantized grouped convolution at XPU (#148522 ) # Motivation&Details This PR fix a bug that blocked quantized group convolution before. The bug is caused by that, grouped convolution requires setting weight scale mask on both group dimension and output channel dimension. This PR fixs the wrong mask in integration and add grouped conv in UT. # UT ` python test/inductor/test_mkldnn_pattern_matcher.py -k test_qconv2d_xpu` # Runtime exemplification ```onednn_verbose,v1,primitive,exec,gpu:0,convolution,jit:ir,forward_training,src:s8::blocked:acdb::f0 wei:s8::blocked:abcde::f0 bia:f32::blocked:a::f0 dst:f32::blocked:acdb::f0,attr-scratchpad:user attr-scales:src0:0:f32+dst:0:f32+wei:3:f32 attr-zero-points:src0:0:s32,alg:convolution_direct,g4mb1_ic128oc128_ih4oh2kh3sh1dh0ph0_iw4ow2kw3sw1dw0pw0,0.0529785`` The verbose shows that we successfully run into quantized convolution, where weight is `abcde` format(group conv). Pull Request resolved: https://github.com/pytorch/pytorch/pull/148522 Approved by: https://github.com/EikanWang, https://github.com/liangan1, https://github.com/jansel ghstack dependencies: #148423	2025-03-07 01:57:45 +00:00
Ryan Guo	c8cd8f68bd	[dynamo] Properly account for non-list instances in list comparison (#148470 ) As title; this patch also removes an unused `list_compare` method. Fixes #148179. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148470 Approved by: https://github.com/anijain2305	2025-03-07 01:29:30 +00:00
eellison	a7fe685be8	Add cpp wrapper skip to cudagraph logs (#148700 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148700 Approved by: https://github.com/jbschlosser	2025-03-07 01:02:40 +00:00
Justin Chu	e3087f6d76	[ONNX] Improve verify_onnx_program to use VerificationInterpreter (#148706 ) I realized we can just extend `verify_onnx_program` to return intermediate values. There is no need for us to expose the VerificationInterpreter to users. I added a `compare_intermediates` option to `verify_onnx_program`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148706 Approved by: https://github.com/titaiwangms	2025-03-07 00:40:54 +00:00
Richard Barnes	33a285379a	[codemod] Remove unused-variable in caffe2/torch/csrc/distributed/c10d/cuda/AsyncMM.cu (#148501 ) Summary: LLVM-15 has a warning `-Wunused-variable` which we treat as an error because it's so often diagnostic of a code issue. Unused variables can compromise readability or, worse, performance. This diff either (a) removes an unused variable and, possibly, it's associated code or (b) qualifies the variable with `[[maybe_unused]]`. - If you approve of this diff, please use the "Accept & Ship" button :-) Test Plan: Sandcastle Reviewed By: dtolnay Pull Request resolved: https://github.com/pytorch/pytorch/pull/148501 Approved by: https://github.com/Skylion007	2025-03-07 00:33:39 +00:00
Xilun Wu	e2a0296e80	[dtensor] add CuDNN SDPA op support to DTensor (#148537 ) ### Summary This PR adds `_scaled_dot_product_cudnn_attention` and `_scaled_dot_product_cudnn_attention_backward` to DTensor ops ### Test `pytest test/distributed/tensor/test_attention.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148537 Approved by: https://github.com/drisspg, https://github.com/fegin	2025-03-06 23:44:40 +00:00
Blaine Burton Rister	75d29443e7	[Docs] update bucketize documentaion (#148400 ) Fixes #144504 Clarify the documentation for `torch.bucketize` by referencing the existing table. The current version includes a somewhat confusing explanation for the `right` kwarg, whereas the existing table is much clearer. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148400 Approved by: https://github.com/benjaminglass1, https://github.com/eellison, https://github.com/albanD	2025-03-06 22:07:52 +00:00
henrylhtsang	2fb654676f	[cutlass backend] fix assertion that prevent self multiplication (#148233 ) # Problem: In a matmul, sometimes some of the nodes are the same. Say `A @ A`. In that case, when writing the stride of node B, we have to figure out if we want lda or ldb, which points to the same node, and we have no way to differentiate which one. # Solution Just use whichever. Since they are the same. # Question What if we compile with `A @ A`, and then pass in `A @ B`? Well inductor guards will raise an error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148233 Approved by: https://github.com/ColinPeppler	2025-03-06 22:02:26 +00:00
henrylhtsang	d35a4ddae2	[cutlass backend] Forward fix for less aligned gemm shapes (#148521 ) Differential Revision: [D70600093](https://our.internmc.facebook.com/intern/diff/D70600093/) 1. Check if config name filtering still works. Tested, it works 2. do we get C++ compile error Yes, potentially we need to filter them out manually. Here we get this. ``` static_assert(threads_minor == 0 \|\| (TileSizeK % threads_minor == 0)); ``` We need to move some assertions to gemm_template.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/148521 Approved by: https://github.com/ColinPeppler	2025-03-06 22:02:19 +00:00
lanzongwei.lan	3d62e81a1e	[DCP] fix dcp gather_object/scatter_object_list (#147675 ) gather_object/scatter_object_list's dst is `Destination rank on global process group (regardless of group argument)`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147675 Approved by: https://github.com/MeetVadakkanchery	2025-03-06 21:20:38 +00:00
Ryan Guo	1d7fc0c681	[dynamo] Remove dead code path around `functools.partial` objects (#148683 ) This removes the code paths added in #98120, which has then been superceded by #108846. More importantly, it makes `EQUALS_MATCH`'s `ok_mutable_types` (added in #134016) easier to reason about, i.e., no need to worry about `dict` types, which was only needed for #98120. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148683 Approved by: https://github.com/yanboliang	2025-03-06 21:20:04 +00:00
Shunting Zhang	262411e48b	[inductor] online softmax (#127011 ) Softmax need do some preparation work that access the input tensor in two passes - compute amax of each row - compute (x - amax).exp.sum for each row When the row size is large, cache can not hold all the active data and accessing the input multiple passes increases execution time since the kernel is membw bounded. Online softmax uses a customized reduction to compute max and sum at the same time by accessing the data in one pass. Check this paper for more details ( https://arxiv.org/abs/1805.02867 ). Also here is an online softmax kernel generated by inductor as a reference: https://gist.github.com/shunting314/67ae4fffd45d4f2753c781780332fa54 ## Microbenchmark - `TORCHINDUCTOR_COORDINATE_DESCENT_TUNING=1 TORCHINDUCTOR_ONLINE_SOFTMAX=0 DO_PERF_TEST=1 python test/inductor/test_online_softmax.py -k test_softmax` : without online softmax - eager_ms=6.671296119689941 - opt_ms=8.06931209564209 - `TORCHINDUCTOR_COORDINATE_DESCENT_TUNING=1 TORCHINDUCTOR_ONLINE_SOFTMAX=1 DO_PERF_TEST=1 python test/inductor/test_online_softmax.py -k test_softmax`: with online softmax - eager_ms=6.634047985076904 - opt_ms=6.230591773986816 Ideally, online softmax should save about 2ms here. We saves about 1.84ms in practice. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127011 Approved by: https://github.com/jansel	2025-03-06 21:07:18 +00:00
zeshengzong	1add61c242	Replace `unimplemented` with `unimplemented_v2' in` codegen.py` (#148069 ) Fixes #147913 - replace `unimplemented` in `codegen.py` - remove unused import `unimplemented` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148069 Approved by: https://github.com/Skylion007, https://github.com/williamwen42	2025-03-06 20:42:37 +00:00
Aaron Gokaslan	edd640a95a	[BE][Ez]: Use itertools.chain.from_iterable when possible (#148190 ) Often makes the code more readable, more efficient, and adds support for infinite iterables. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148190 Approved by: https://github.com/jansel, https://github.com/malfet	2025-03-06 20:37:06 +00:00
zeshengzong	29b28e9d9f	Fix `torch.nn.functional.hardswish` gradients corner case (#148049 ) Fixes #147801 ## Changes - Change hardswish gradient compute condition as [torch.nn.functional.hardswish](https://pytorch.org/docs/stable/generated/torch.nn.functional.hardswish.html) - Enable cuda for test `test_hardswish_grad_corner` - Add test case for value=-3 ## Test Result ```bash pytest test/test_nn.py -k test_hardswish pytest test/test_unary_ufuncs.py -k test_hardswish pytest test/inductor/test_torchinductor.py -k test_hardswish ``` ![image](https://github.com/user-attachments/assets/000cb5c4-15f5-4bfd-ab45-f52bf810ff3d) ![image](https://github.com/user-attachments/assets/38b08cf8-ea84-47a2-8e37-0a213da3e0c8) ![image](https://github.com/user-attachments/assets/54bc57be-2c57-46cc-ab90-94ea6cbe1c34) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148049 Approved by: https://github.com/soulitzer	2025-03-06 19:04:52 +00:00
Xuehai Pan	f08146b67b	[pytree] add APIs to determine a class is a namedtuple or PyStructSequence (#113257 ) Changes in this PR: 1. Add `is_structseq` and `is_structseq_class` functions to determine a object or a class is PyStructSequence. 2. Add a generic class `structseq` which can be used as the registration key for PyStructSequence types like `namedtuple` for Named Tuple types. 3. Change `is_namedtuple` to accept subclasses of namedtuple to be namedtuple. Before this PR, only namedtuple class directly created by `collections.namedtuple` or `typing.NamedTuple` were namedtuple classes while their subclasses were not. This PR makes `is_namedtuple` return true for subclasses of namedtuple class. Resolves #75982. New tests are included in this PR. - #75982 Pull Request resolved: https://github.com/pytorch/pytorch/pull/113257 Approved by: https://github.com/zou3519	2025-03-06 18:59:02 +00:00
PyTorch MergeBot	96176e32a9	Revert "[ROCm] Bump AOTriton to 0.9.1b (#148433 )" This reverts commit `8af79b7ec8`. Reverted https://github.com/pytorch/pytorch/pull/148433 on behalf of https://github.com/jovianjaison due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/148433#issuecomment-2704638858))	2025-03-06 18:32:48 +00:00
Ke Wen	d91a634edf	[c10d] Make getDefaultBackend more fault tolerant (#148596 ) This is a forward fix for #135338. It hits error like this: ``` "distributed_c10d.py", line 2156, in destroy_process_group if type(pg) == ProcessGroup and pg._has_hooks(): RuntimeError: Could not find the default backend type 0 for Process Group with name undefined. ``` When users call `init_process_group(nothing)`, default backend is not set, or set to `undefined`. Thus the above signature. Triggered by the `_has_hooks()` call. The fix wraps `getDefaultBackend` with a try-catch. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148596 Approved by: https://github.com/LucasLLC, https://github.com/fduwjj	2025-03-06 18:07:43 +00:00
PyTorch MergeBot	28b68b46bc	Revert "[cutlass backend] fix assertion that prevent self multiplication (#148233 )" This reverts commit `4aeca28137`. Reverted https://github.com/pytorch/pytorch/pull/148233 on behalf of https://github.com/henrylhtsang due to mistake in PR ([comment](https://github.com/pytorch/pytorch/pull/148233#issuecomment-2704534995))	2025-03-06 17:45:49 +00:00
PyTorch MergeBot	841451af9f	Revert "[Inductor] Avoid tensor slice overflow for large step (#147433 )" This reverts commit `1d7397a2d0`. Reverted https://github.com/pytorch/pytorch/pull/147433 on behalf of https://github.com/jovianjaison due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/147433#issuecomment-2704506627))	2025-03-06 17:33:08 +00:00
Rachel Guo	679e7d257e	[mm_logs] follow up to add count info based on shape for inductor `aten.mm`s (#148623 ) Summary: as title. when enable `TORCH_LOGS="+inductor"`, you can get logs at the end such as stats [('calls_captured', 1), ('unique_graphs', 1)] inductor [('pattern_matcher_count', 2), ('pattern_matcher_nodes', 2), ('benchmarking.TritonBenchmarker.benchmark_gpu', 2), (('aten_addmm', (16, 6, 16)), 1), ('extern_calls', 1), ('async_compile_cache_miss', 1)] graph_break [] Test Plan: follow up to add proper logging test. Differential Revision: D70665104 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148623 Approved by: https://github.com/henrylhtsang	2025-03-06 16:20:04 +00:00
Benjamin Glass	b160dda743	cpp_wrapper: reduce memory usage by removing unneeded temporaries (#147403 ) This PR contains a set of interrelated changes, listed below, with the upshot that compiled model memory usage in `cpp_wrapper` mode is now roughly equivalent to the default inductor mode. Changes: 1. Refactor `reinterpret_view` calls in `cpp_wrapper` to always return a temporary RAII tensor object, rather than saving off a "temporary" tensor handle that persisted through the end of the function. This matches the behavior of the base Python wrapper class, and is responsible for majority of the memory usage reductions. 2. Eliminate nearly all other cases where a "temporary" tensor handle was saved off (with the exception of one or two places where the tensor would immediately be destroyed by going out-of-scope). This necessitated some ugly-looking code to handle `Optional[Tensor]` and `Optional[Sequence[Any]]`, since `Optional` is passed by pointer into the C-shim functions (making passing temporary objects difficult). This code is justified by the fact that it only appears in controlled circumstances that we auto-generate, so there are minimal user-facing footguns. 3. Delete the list containing the input tensors to the `cpp_wrapper` main function after casting them to `AtenTensorHandle` objects, which have an internal reference count keeping them alive. The [TorchInductor benchmark](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Sat%2C%2015%20Feb%202025%2018%3A38%3A08%20GMT&stopTime=Sat%2C%2022%20Feb%202025%2018%3A38%3A08%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(a100)&lBranch=gh/benjaminglass1/73/head&lCommit=4d5edaf67e80ca9ca36d301af1ded13967a04790&rBranch=main&rCommit=e1bf892d9004a4dba0748d0eda5c3b4eced0ea70) I ran shows the increased memory compression. Differential Revision: [D70648897](https://our.internmc.facebook.com/intern/diff/D70648897) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147403 Approved by: https://github.com/desertfire	2025-03-06 16:08:16 +00:00
Mikayla Gawarecki	d5184901c4	Make torch.serialization.skip_data work with torch.load (#148018 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148018 Approved by: https://github.com/albanD ghstack dependencies: #147786, #147787, #147788	2025-03-06 12:04:46 +00:00
Mikayla Gawarecki	be0ceee1c3	Make record/storage alignment in torch.save configurable (#147788 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147788 Approved by: https://github.com/albanD ghstack dependencies: #147786, #147787	2025-03-06 12:04:46 +00:00
Mikayla Gawarecki	209977e6e5	Add information about checkpoint offset to untyped storages when torch.load under FakeTensorMode (#147787 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147787 Approved by: https://github.com/albanD ghstack dependencies: #147786	2025-03-06 12:04:39 +00:00
Mikayla Gawarecki	bdcc1b579b	Allow torch.load under FakeTensorMode to load FakeTensors with correct devices (for plain Tensors) (#147786 ) This only fixes _rebuild_tensor_v2 and _rebuild_tensor_v3 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147786 Approved by: https://github.com/albanD	2025-03-06 12:04:32 +00:00
rzou	79aa17489c	[dynamo] ctx_manager.py: replace unimplemented with unimplemented_v2 (#148570 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148570 Approved by: https://github.com/williamwen42 ghstack dependencies: #148454	2025-03-06 07:46:31 +00:00
titaiwangms	e7bc1d1791	[ONNX] Update saved exported program in debugging report if the exporting passes run_decomposition() (#148617 ) Previous to this PR, if the exporting passes run_decomposition(), the report still shows the exported_program before decomposition, which adds the difficulties to our users when they want to check the exported program that are used to translate to ONNX graph. The following example is what we see before this PR: ``` # PyTorch ONNX Conversion Report ``` ✅ Obtain model graph with `torch.export.export(..., strict=False)` ⚪ Obtain model graph with `torch.export.export(..., strict=True)` ⚪ Obtain model graph with `torch.jit.trace` ✅ Decompose operators for ONNX compatibility ❌ Translate the graph into ONNX ⚪ Run `onnx.checker` on the ONNX model ⚪ Execute the model with ONNX Runtime ⚪ Validate model output accuracy ``` ## Error messages ```pytb Traceback (most recent call last): File "/home/titaiwang/pytorch/torch/onnx/_internal/exporter/_core.py", line 707, in _translate_fx_graph _handle_call_function_node_with_lowering( File "/home/titaiwang/pytorch/torch/onnx/_internal/exporter/_core.py", line 486, in _handle_call_function_node_with_lowering raise _errors.DispatchError( torch.onnx._internal.exporter._errors.DispatchError: No ONNX function found for <OpOverload(op='aten.slice', overload='Tensor')>. Failure message: No decompositions registered for the complex-valued input The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/home/titaiwang/pytorch/torch/onnx/_internal/exporter/_core.py", line 1371, in export onnx_program = _exported_program_to_onnx_program( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/titaiwang/pytorch/torch/onnx/_internal/exporter/_core.py", line 1007, in _exported_program_to_onnx_program values = _translate_fx_graph( ^^^^^^^^^^^^^^^^^^^^ File "/home/titaiwang/pytorch/torch/onnx/_internal/exporter/_core.py", line 733, in _translate_fx_graph raise _errors.ConversionError( torch.onnx._internal.exporter._errors.ConversionError: Error when translating node %slice_1 : [num_users=1] = call_function[target=torch.ops.aten.slice.Tensor](args = (%_to_copy, 0, 0, 9223372036854775807), kwargs = {}). See the stack trace for more information. ``` ## Exported program ```python ExportedProgram: class GraphModule(torch.nn.Module): def forward(self, x: "f32[3, 4]"): # File: /home/titaiwang/pytorch/test_slice_complex.py:6 in forward, code: x_complex = x.to(torch.complex64) to: "c64[3, 4]" = torch.ops.aten.to.dtype(x, torch.complex64); x = None # File: /home/titaiwang/pytorch/test_slice_complex.py:8 in forward, code: return x_complex[:, :2] slice_1: "c64[3, 4]" = torch.ops.aten.slice.Tensor(to, 0, 0, 9223372036854775807); to = None slice_2: "c64[3, 2]" = torch.ops.aten.slice.Tensor(slice_1, 1, 0, 2); slice_1 = None return (slice_2,) Graph signature: ExportGraphSignature(input_specs=[InputSpec(kind=<InputKind.USER_INPUT: 1>, arg=TensorArgument(name='x'), target=None, persistent=None)], output_specs=[OutputSpec(kind=<OutputKind.USER_OUTPUT: 1>, arg=TensorArgument(name='slice_2'), target=None)]) Range constraints: {} ``` ## Analysis PyTorch ONNX Conversion Analysis ## Model Information The model has 0 parameters and 0 buffers (non-trainable parameters). Number of parameters per dtype: ```python defaultdict(<class 'int'>, {}) ``` Number of buffers per dtype: ```python defaultdict(<class 'int'>, {}) ``` Inputs: - `x`: `TensorMetadata(shape=torch.Size([3, 4]), dtype=torch.float32, requires_grad=False, stride=(4, 1), memory_format=torch.contiguous_format, is_quantized=False, qparams={})` Outputs: - `slice_2`: `TensorMetadata(shape=torch.Size([3, 2]), dtype=torch.complex64, requires_grad=False, stride=(4, 1), memory_format=None, is_quantized=False, qparams={})` The FX graph has 5 nodes in total. Number of FX nodes per op: - `placeholder`: 1 - `call_function`: 3 - `output`: 1 Of the call_function nodes, the counts of operators used are: - `aten.slice.Tensor`: 2 - `aten.to.dtype`: 1 ## ONNX Conversion Information The model contains operators the dispatcher could not find registered ONNX decompositions for. This may be due to missing implementations, decompositions not registered correctly, or a bug in the dispatcher. Errors grouped by operator: - `aten.to.dtype`: No decompositions registered for the real-valued input. Example node: `%to : [num_users=1] = call_function[target=torch.ops.aten.to.dtype](args = (%x, torch.complex64), kwargs = {})`. All nodes: `[to]` - `aten.slice.Tensor`: No decompositions registered for the complex-valued input. Example node: `%slice_1 : [num_users=1] = call_function[target=torch.ops.aten.slice.Tensor](args = (%to, 0, 0, 9223372036854775807), kwargs = {})`. All nodes: `[slice_1, slice_2]` ## Decomposition comparison Ops exist only in the ExportedProgram before decomposition: `['aten.to.dtype']` Ops exist only in the ExportedProgram after decomposition: `['aten._to_copy.default']` ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148617 Approved by: https://github.com/justinchuby	2025-03-06 07:03:45 +00:00
PyTorch MergeBot	ae6bb58483	Revert "[cutlass backend] Forward fix for less aligned gemm shapes (#148521 )" This reverts commit `ad49cfc9f0`. Reverted https://github.com/pytorch/pytorch/pull/148521 on behalf of https://github.com/davidberard98 due to broke lint: [GH job link](https://github.com/pytorch/pytorch/actions/runs/13690720601/job/38283359447) [HUD commit link](`ad49cfc9f0`) ([comment](https://github.com/pytorch/pytorch/pull/148521#issuecomment-2702980028))	2025-03-06 06:59:39 +00:00
titaiwangms	f057206fca	[ONNX] Support complex comparison when verify=True (#148619 ) Previously, the comparison of complex numbers was not supported when `verify=True`. NOTE: This PR can be extended to support more complex comparison cases if there are other places in onnx codebase needed to be changed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148619 Approved by: https://github.com/justinchuby	2025-03-06 04:38:43 +00:00
bobrenjc93	8b65d522e1	refactor delayed compile to use code context (#148530 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148530 Approved by: https://github.com/williamwen42 ghstack dependencies: #148509	2025-03-06 04:02:30 +00:00
henrylhtsang	ad49cfc9f0	[cutlass backend] Forward fix for less aligned gemm shapes (#148521 ) Differential Revision: [D70600093](https://our.internmc.facebook.com/intern/diff/D70600093/) 1. Check if config name filtering still works. Tested, it works 2. do we get C++ compile error Yes, potentially we need to filter them out manually. Here we get this. ``` static_assert(threads_minor == 0 \|\| (TileSizeK % threads_minor == 0)); ``` We need to move some assertions to gemm_template.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/148521 Approved by: https://github.com/ColinPeppler	2025-03-06 03:42:55 +00:00
James Wu	8728d4b815	Clear triton kernels after parent make_launcher (#148604 ) Before, we were clearing the cache only after inductor compile. But inductor may not always compile, i.e. on AOTAutogradCache hit. So instead, we should clear it when the future is consumed. This is a more robust fix for the issue in D69476856 Differential Revision: [D70646281](https://our.internmc.facebook.com/intern/diff/D70646281/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148604 Approved by: https://github.com/masnesral	2025-03-06 03:28:38 +00:00
maybeLee	43e1284c96	Fix empty matrix handling of addmv in inductor (#143792 ) This is a resubmission of my previous PR that I accidentally deleted, apology in advance if any inconvenience caused. Below are details of this PR. Fix an issue when torch.addmv behaves inconsistent between torch.compile mode and eager mode. Here is the code to reproduce: ``` import torch import numpy as np @torch.compile def test_optimized(input, mat, vec): return torch.addmv(input, mat, vec) def test(input, mat, vec): return torch.addmv(input, mat, vec) input = torch.tensor([2], dtype=torch.int32) mat = torch.tensor(np.random.randn(0, 0), dtype=torch.int32) vec = torch.tensor([]) origin_out = test(input, mat, vec) optimized_out = test_optimized(input, mat, vec) print(origin_out) # tensor([2.]) print(optimized_out) # tensor([]) ``` According to the equation (https://pytorch.org/docs/stable/generated/torch.addmv.html), when matrix and vector is empty, returning `[2.]` seems more reasonable to me. Following the cpu implementation of this API:`e97b97af56/aten/src/ATen/native/Blas.cpp (L62)` I add an additional branch to handle empty matrix Pull Request resolved: https://github.com/pytorch/pytorch/pull/143792 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel	2025-03-06 02:09:27 +00:00
Pat Vignola	38b3375a81	[MTIA] Use "ieee" instead of "tf32" for MTIA's default precision in FlexAttention (#148565 ) Summary: MTIA supports ieee but not tf32, so we set the default precision of MTIA to ieee similar to how it's done for AMD. Test Plan: CI Reviewed By: mortzur Differential Revision: D70072064 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148565 Approved by: https://github.com/mortzur	2025-03-06 02:07:18 +00:00
Ruben Rodriguez Buchillon	32715a2311	[inductor][ck] add kBatch_sweep to config.rocm (#148223 ) Summary: # Why enable testing and users to specify a set of kBatches to try rather than relying on our hand written heuristic # What add rocm.kBatch_sweep as a list of kBatches to try out. These will generate a product of CK instances, one per kBatch for each existing op, though they are often filtered out if they are likely to fail at runtime Test Plan: n/a Reviewed By: chenyang78 Differential Revision: D70226055 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148223 Approved by: https://github.com/ColinPeppler	2025-03-06 01:14:33 +00:00
Shivam Raikundalia	63fbc738dc	[Easy/Profiler] Add last entry to truncated values (#148576 ) Summary: Since the ranks of a PG are usually in a consecutive range it is useful to print the last values when truncating metadata Test Plan: Manually changed truncate length to 2 and ran 4 gpu graph to get the following trace: https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devgpu003.rva5.facebook.com/rank-1.Mar_05_09_48_21.1280355.pt.trace.json.gz&bucket=gpu_traces Differential Revision: D70637461 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148576 Approved by: https://github.com/davidberard98	2025-03-06 01:14:15 +00:00
Thomas Bohnstingl	23441492f6	[scan] Refactoring of input checking and dynamo invocation (#142125 ) This PR does a refactoring of the way dynamo is invoked and how the input shapes are checked for scan and for associative_scan Pull Request resolved: https://github.com/pytorch/pytorch/pull/142125 Approved by: https://github.com/ydwu4	2025-03-06 01:06:54 +00:00
Shunting Zhang	6cc3e69103	[inductor] use eager stride for custom op if no tags (#148367 ) Fix https://github.com/pytorch/pytorch/issues/148356 This is some sort of short term fix to recover the default behavior to apply layout constraint for custom ops when there are no tags. A longer term attempt to make sure Inductor always gets correct eager strides is here: https://github.com/pytorch/pytorch/pull/148104 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148367 Approved by: https://github.com/eellison, https://github.com/zou3519	2025-03-06 00:58:00 +00:00
Prachi Gupta	703176e538	[ROCm] Fix sort for non-standard bool (#147459 ) When converting from uint8 to bool using `view` op, we get a bool that has 0 for false and a non-zero value for true. However, these kinds of bool have undefined behavior. We only read the last bit as 0 or 1 to convert to false or true. In this fix, we convert bools to uint8, which will convert false to 0 and non-zero value to 1. Essentially, converting non-standard bool to a standard bool and fixing the sort op for non-standard bool. Fixes #139972 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147459 Approved by: https://github.com/jeffdaily, https://github.com/pruthvistony	2025-03-06 00:23:02 +00:00

1 2 3 4 5 ...

46836 Commits