pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-06 00:20:18 +01:00

Author	SHA1	Message	Date
Wang, Chuanqi	292454942e	[CD] Introduce windows.12xlarge runners for CD Windows build (#165287 ) Follows https://github.com/pytorch/test-infra/pull/7174. Windows CD build time cost comparison as below \|Runner\|cpu\|cuda\|xpu\| \|-\|-\|-\|-\| \|windows.4xlarge\|1.5h\| 4.0h\| 5.5h\| \|windows.12xlarge\|0.5h\|1.5h\|2.5h\| Fixes #162962 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165287 Approved by: https://github.com/zxiiro, https://github.com/malfet, https://github.com/seemethere	2025-10-21 18:28:23 +00:00
PyTorch MergeBot	6c4412f72b	Revert "[Inductor] support masked vectorization for the tail_loop for float64 datatype (#163316 )" This reverts commit `e9d8973427`. Reverted https://github.com/pytorch/pytorch/pull/163316 on behalf of https://github.com/clee2000 due to seems to have broken some no_gpu tests? test/inductor/test_cpu_repro.py::CPUReproTests::test_double_reduction_vec [GH job link](https://github.com/pytorch/pytorch/actions/runs/18689033019/job/53290772740) [HUD commit link](`e9d8973427`) ([comment](https://github.com/pytorch/pytorch/pull/163316#issuecomment-3428210509))	2025-10-21 17:44:42 +00:00
PyTorch MergeBot	78bf6186f2	Revert "[Inductor] support masked vectorization for the tail_loop for fp8 datatype (#163324 )" This reverts commit `e8cb34dd52`. Reverted https://github.com/pytorch/pytorch/pull/163324 on behalf of https://github.com/clee2000 due to seems to have broken some no_gpu tests? test/inductor/test_cpu_repro.py::CPUReproTests::test_double_reduction_vec [GH job link](https://github.com/pytorch/pytorch/actions/runs/18689033019/job/53290772740) [HUD commit link](`e9d8973427`) ([comment](https://github.com/pytorch/pytorch/pull/163316#issuecomment-3428210509))	2025-10-21 17:44:42 +00:00
Shangdi Yu	c40048472c	Remove AOTI cross compilation time from internal CI (#165935 ) Summary: as title Test Plan: CI Differential Revision: D85088451 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165935 Approved by: https://github.com/desertfire	2025-10-21 16:58:28 +00:00
Yuanyuan Chen	3dfd0c7584	Improve PATH hints in FindvecLib.cmake (#165881 ) Change /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.9.sdk to /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk in `cmake/Modules/FindvecLib.cmake` which is more general (and MacOSX10.9 is not supported now). Otherwise, vecLib can't be found on MacOS 26. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165881 Approved by: https://github.com/ezyang	2025-10-21 16:44:12 +00:00
Gufan Yin	e6ba4d0725	Back out "Do not decompose in functionalization/proxy tensor if autograd wouldn't have decomposed (#164939 )" (#165910 ) Summary: Original commit changeset: d6d62d0c96dd Original Phabricator Diff: D84468451 and D84613184 D84468451 caused CUDA OutOfMemoryError in model. Test Plan: D84468451 was found through bisect. Also double checked on recent trunk 9866939225248c2adc307be7a804b26db0b9b555: f815887517 With this diff that backs out D84468451 and D84613184 : f816114560 Differential Revision: D85025378 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165910 Approved by: https://github.com/clee2000	2025-10-21 16:36:38 +00:00
PyTorch MergeBot	bdf7cb9d9c	Revert "[torch/utils][Code Clean] Clean asserts in `torch/utils/*.py` (#165410 )" This reverts commit `e20c9bf288`. Reverted https://github.com/pytorch/pytorch/pull/165410 on behalf of https://github.com/clee2000 due to sorry I'm going to revert this since I want to try to back out some other things that are conflicting with this, there is nothing wrong with this PR, rebasing and resolving the merge conflicts should be enough, sorry for the churn ([comment](https://github.com/pytorch/pytorch/pull/165410#issuecomment-3427532373))	2025-10-21 16:27:54 +00:00
angelayi	6aed378958	[export] Handle kwargs better in aot_export_joint_with_descriptors (#165334 ) fx.Interpreter doesn't handle kwargs... not sure how this code worked previously Pull Request resolved: https://github.com/pytorch/pytorch/pull/165334 Approved by: https://github.com/tugsbayasgalan, https://github.com/ezyang	2025-10-21 15:53:05 +00:00
lichuyang	8b3dc0d1b0	Better error handling in torch/csrc/jit/runtime/* (#165118 ) Refactor error handling by using TORCH_CHECK for improved clarity in constants and scope management in some files in torch/csrc/jit/runtime/* Fixes some parts of ISSUE https://github.com/pytorch/pytorch/issues/148114 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165118 Approved by: https://github.com/FFFrog, https://github.com/albanD	2025-10-21 15:22:49 +00:00
James Wu	06773663b5	Implement an AOT precompile mode for standalone_compile (#165843 ) This PR introduces an `aot` flag to standalone_compile that uses BundledAOTAutogradCacheEntry, and then allows regional_inductor to use this so that we can start aot compiling regional compiler graphs. The diff above this will attempt to allow GraphPickler to fully serialize graphs that have regionally compiled subgraphs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165843 Approved by: https://github.com/oulgen	2025-10-21 15:02:45 +00:00
Yu, Guangye	0bff65503c	Move hardware_destructive_interference_size to c10/core/alignment.h (#160067 ) # Motivation Move `hardware_destructive_interference_size` to `c10/core/alignment.h`, which gives a chance to reuse it across different accelerators. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160067 Approved by: https://github.com/Skylion007, https://github.com/EikanWang	2025-10-21 14:39:46 +00:00
PyTorch MergeBot	21131a2444	Revert "[ROCm][CI] Update rocm.yml workflow to use 1 GPU ARC runners (#165481 )" This reverts commit `ffa90d46e6`. Reverted https://github.com/pytorch/pytorch/pull/165481 on behalf of https://github.com/jeffdaily due to timeouts after merge ([comment](https://github.com/pytorch/pytorch/pull/165481#issuecomment-3426898171))	2025-10-21 14:15:55 +00:00
Xuehai Pan	1009790ad8	[pytree][dynamo] trace on native optree functions for community pytree support (#165860 ) Resolves #164972 - #164972 All `torch.utils._cxx_pytree` functions are based on `optree` functions with hardcoded `none_is_leaf=True` and `namespace="torch"`. This PR changes the polyfills to generic `optree` functions with those arguments unhardcoded. This means `torch.utils._cxx_pytree` functions are still traceable while the community `optree` usages can get dynamo support additionally. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165860 Approved by: https://github.com/Lucaskabela	2025-10-21 14:13:08 +00:00
lichuyang	410e6a4321	Better error handling in torch/csrc/jit/frontend/* (#165213 ) Refactor error handling by using TORCH_CHECK for improved clarity in constants and scope management in some files in torch/csrc/jit/frontend/* Fixes some parts of ISSUE https://github.com/pytorch/pytorch/issues/148114 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165213 Approved by: https://github.com/FFFrog, https://github.com/albanD	2025-10-21 13:54:59 +00:00
orangeH25	23c55c5b66	[Code Clean]Replace assert statements with explicit if/raise patterns (#165735 ) Fix part of #164878 Replace 75 assert statements with explicit if/raise patterns in `torch/ao/ns` , include: - `torch/ao/ns/_numeric_suite_fx.py` - 5 asserts - `torch/ao/ns/fx/graph_matcher.py` - 6 asserts - `torch/ao/ns/fx/graph_passes.py` -12 asserts - `torch/ao/ns/fx/n_shadows_utils.py` - 20 asserts - `torch/ao/ns/fx/pattern_utils.py` - 2 asserts - `torch/ao/ns/fx/utils.py` - 21 asserts - `torch/ao/ns/fx/weight_utils.py` - 19 asserts Pull Request resolved: https://github.com/pytorch/pytorch/pull/165735 Approved by: https://github.com/albanD	2025-10-21 11:21:57 +00:00
Animesh Jain	1290b077f2	[dynamo][misc] Replace UserFunctionVariable with VariableTracker build (#165707 ) Audit: To prevent future issues with functools.partial or callable objects. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165707 Approved by: https://github.com/Lucaskabela	2025-10-21 09:27:41 +00:00
Nichols A. Romero	9f9ab881b2	[ROCm][inductor] heuristic improvements for reduction kernels (#161280 ) Improvements to reduction kernel heuristics for MI350. Contributions from several members of the AMD Inductor and Triton teams: @jataylo @iupaikov-amd @AmdSampsa @xiaohuguo2023 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161280 Approved by: https://github.com/jansel, https://github.com/PaulZhang12, https://github.com/eellison, https://github.com/jeffdaily	2025-10-21 07:48:54 +00:00
Blaine Burton Rister	f2bb22ff84	[Inductor-FX] Support Tensor.item (#165599 ) # Feature This PR supports compiling `Tensor.item` with Inductor's FX backend. This maps to a custom WrapperCodeGen method called `codegen_dynamic_scalar`. # Implementation The implementation is fairly mechanical, following the usual flow for these types of PRs. 1. Introduce a new Wrapper IR line for this, called `DynamicScalarLine`. 2. Split `PythonWrapperCodegen.codegen_dynamic_scalar` into 2 parts: a public method which generates the Wrapper IR line, and a private one generating Python from Wrapper IR. 3. Implement an FX codegen method for the wrapper IR line. This one calls `aten.where.Scalar` to handle code like `1 if x.item() else 0`, which is a bit tricky. It also calls `aten.item.default` to convert tensors to scalars. # Test plan Added CI tests mirroring the AOTI ones. They test float, int and bool types, the latter taking a distinct codegen path. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165599 Approved by: https://github.com/angelayi, https://github.com/jansel	2025-10-21 07:09:56 +00:00
Elias Ellison	03f3f7899c	[ATen] Add reduction tag to reduction operators (#165155 ) Add a new 'reduction' tag to tags.yaml and apply it to 98 reduction operator variants across 21 operator families (sum, mean, min, max, argmin, argmax, amin, amax, aminmax, prod, all, any, norm, var, std, std_mean, var_mean, nansum, logsumexp, count_nonzero, linalg_vector_norm). This tag categorizes operators that perform reduction operations, computing aggregate values across one or more dimensions of input tensor(s). Based on PR #153342 - co-written with @AlonSardas. Just as we have pointwise tag - this can be useful for compiler passes, or for opting into sharding rules. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165155 Approved by: https://github.com/ezyang, https://github.com/zou3519, https://github.com/mlazos	2025-10-21 04:35:03 +00:00
Animesh Jain	771170807b	[dynamo][nn_module] Replace UserFunctionVariable with VariableTracker build (#165708 ) Audit: To prevent future issues with functools.partial or callable objects. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165708 Approved by: https://github.com/Lucaskabela	2025-10-21 04:13:12 +00:00
amdfaa	ffa90d46e6	[ROCm][CI] Update rocm.yml workflow to use 1 GPU ARC runners (#165481 ) * Moving rocm.yml from using persistent non-ARC runners from the combined MI2xx (MI210 + MI250) cluster to the ARC runners from the MI250 cluster. This halves the number of nodes, but provides access to approximately 4 times the runners, since every 8-GPU MI250 node now provides 8 1-GPU runners. This should help with concurrent capacity and queueing on the MI2xx jobs. Tested here successfully: https://github.com/pytorch/pytorch/actions/runs/18620814622/job/53092469720 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165481 Approved by: https://github.com/jeffdaily Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>	2025-10-21 04:02:04 +00:00
Yuanyuan Chen	0e083942cc	Enable PLW0127 in ruff (#165851 ) This PR enables `PLW0127` in ruff, which checks self-assignment of variables with the form `var=var`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165851 Approved by: https://github.com/Lucaskabela	2025-10-21 03:30:57 +00:00
Ethan Wee	ce1fcff03e	[ROCm] Keep amdgpu-coerce-illegal-types flag if rocm version is less than 7.2 (#165789 ) The `-amdgpu-coerce-illegal-types=1` flag is for LLVM that is in ROCm 6.3, 6.4, 7.0, and 7.1. It will not be in ROCm7.2. It was added to enable performance improvements for composable kernel. ROCm7.2 and newer changed the compiler so that the flag isn't needed to achieve those performance improvements. Keeping the flag with ROCm 7.2 breaks the PyTorch build. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165789 Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily	2025-10-21 03:17:33 +00:00
Jane Xu	a238a9a100	Add clang-tidy misc-definitions-in-headers check (#164959 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164959 Approved by: https://github.com/Skylion007, https://github.com/mikaylagawarecki ghstack dependencies: #164882, #164956	2025-10-21 02:59:46 +00:00
Jane Xu	fe69a2bbbd	Move from/to to torch::stable::detail (#164956 ) To not pollute the global namespace, we should move the `from`/`to` APIs into torch::stable::detail. We are also following our normal deprecation cycle and choosing to continue exposing the global `from`/`to` for the time being as people who onboard their extensions onto 2.9 would not be able to build with 2.10 otherwise. Note that this means that within libtorch, we do not get the luxury of tacking on a `using torch::stable::detail::from` because then it leads to build time ambiguous calls --> both the global and namespace APIs are exposed, which one do I want? So that is why you see every local site is updated. Note that the update is _not_ necessary from a custom op writer point of view. FA3 can continue to build on torch nightlies without changing any code. (Since this is a header change, this PR has no implication on runtime, a previously built FA3 ABI stable wheel will continue to work fine with newer torch versions after this PR.) Once TORCH_BOX lands, we would be free to remove these global APIs when the deprecation cycle is up (April 2026) and encourage people to use TORCH_BOX and avoid from/to entirely. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164956 Approved by: https://github.com/malfet ghstack dependencies: #164882	2025-10-21 02:59:46 +00:00
Maggie Moss	0be0de4ffa	Add type suppressions to _inductor/runtime (#165918 ) Original PR that did this was reverted due to merge conflicts. Trying it again Pull Request resolved: https://github.com/pytorch/pytorch/pull/165918 Approved by: https://github.com/oulgen	2025-10-21 02:54:22 +00:00
fduwjj	7406d2e665	[DeviceMesh] Clean up the call into mesh_resouces to get root mesh (#165787 ) We moved the method to get root mesh into class in https://github.com/pytorch/pytorch/pull/164510. This is to further clean code up. Differential Revision: [D85090191](https://our.internmc.facebook.com/intern/diff/D85090191) Pull Request resolved: https://github.com/pytorch/pytorch/pull/165787 Approved by: https://github.com/fegin	2025-10-21 02:54:04 +00:00
Scott Wolchok	303c9cf048	Save Python refcount bump on each arg in maybe_handle_torch_function (#164625 ) Pybind's API entails a small unnecessary overhead when working with args. (Similarly, we should probably be using vectorcall, but that's a bigger change for both us and pybind11.) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164625 Approved by: https://github.com/albanD ghstack dependencies: #164624	2025-10-21 02:40:12 +00:00
Wang, Eikan	d7d4bb7c51	Add XPU part for persons_of_interest (#165920 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/165920 Approved by: https://github.com/albanD	2025-10-21 01:57:17 +00:00
Dzmitry Huba	0b1c462979	Making Numpy depedency in Local Tensor optional to fix broken Torchao CI (#165938 ) In recent change LocalTensor introduced dependency on Numpy and has broken Torchao CI. This dependency cna be made optional and required only when Local Tensor is used. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165938 Approved by: https://github.com/atalman	2025-10-21 01:46:53 +00:00
Shangdi Yu	4a6cf0a93e	Fix dynamo stack trace (#165930 ) Fixes #165911 - Add message to Attribute error so we see ` Developer debug context: raised exception AttributeError(["'Linear' object has no attribute 'w'"])` instead of just `Developer debug context: raised exception AttributeError([])` - Add stack trace in `ObservedException` so we display the inner most error stack trace back to user code Output: ``` /data/users/shangdiy/pytorch/torch/__init__.py:2641: UserWarning: You are calling torch.compile inside torch.export region. To capture an useful graph, we will implicitly switch to torch.compile(backend=eager) warnings.warn( Traceback (most recent call last): File "/data/users/shangdiy/pytorch/torch/_dynamo/variables/user_defined.py", line 1385, in var_getattr subobj = self._getattr_static(name) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/users/shangdiy/pytorch/torch/_dynamo/variables/user_defined.py", line 1256, in _getattr_static subobj = type(self.value).__getattribute__(self.value, name) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ AttributeError: 'Linear' object has no attribute 'w' During handling of the above exception, another exception occurred: torch._dynamo.exc.ObservedAttributeError: 'Linear' object has no attribute 'w' The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/data/users/shangdiy/pytorch/test.py", line 34, in <module> mod = torch._dynamo.functional_export._dynamo_graph_capture_for_export(Model())(x) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/users/shangdiy/pytorch/torch/_dynamo/functional_export.py", line 481, in inner out = fullgraph_capture( ^^^^^^^^^^^^^^^^^^ File "/data/users/shangdiy/pytorch/torch/_dynamo/convert_frame.py", line 1053, in fullgraph_capture return _fullgraph_capture_frame( ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/users/shangdiy/pytorch/torch/_dynamo/convert_frame.py", line 1115, in _fullgraph_capture_frame raise e.with_traceback(None) from e.__cause__ # User compiler error ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ torch._dynamo.exc.Unsupported: Observed exception Explanation: Dynamo found no exception handler at the top-level compiled function when encountering an exception. Exception will propagate outside the compiled region. Hint: Dynamo has detected that tracing the code will result in an error when running in eager. Please double check that your code doesn't contain a similar error when actually running eager/uncompiled. Hint: It may be possible to write Dynamo tracing rules for this code. Please report an issue to PyTorch if you encounter this graph break often and it is causing performance issues. Developer debug context: raised exception AttributeError(["'Linear' object has no attribute 'w'"]) For more details about this graph break, please visit: https://meta-pytorch.github.io/compile-graph-break-site/gb/gb0088.html from user code: File "/data/users/shangdiy/pytorch/torch/_dynamo/functional_export.py", line 171, in forward res = self._export_root(args, *kwargs) File "/data/users/shangdiy/pytorch/test.py", line 31, in forward weight = self.linear.w Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo" ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/165930 Approved by: https://github.com/anijain2305	2025-10-21 01:32:23 +00:00
Jane Xu	4c963a68d7	Use inline instead of anon namespace for stableivalue from/to (#164882 ) Fixes https://github.com/pytorch/pytorch/issues/163343. After some consideration, I propose we remove the anonymous namespace around from/to in favor of: 1. Adding inline to the function implementations, assuming that they will not change in the near future 2. If we decide to change them, we will wrap the code in inline versioned namespaces such that the implementations within any versioned namespace will be guaranteed identical. Note that: - We eventually intend to abstract away usage of `from`/`to` (related: @lw's TORCH_BOX work) - The from/to implementations are now powered through class template specializations, where adding a specialization does not change the from/to signatures. I do plan to deprecate top-level from/to in favor of torch::stable::details::from/to consequently. This way we can stop polluting the global namespace. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164882 Approved by: https://github.com/lw, https://github.com/albanD	2025-10-21 00:12:15 +00:00
Howard Huang	b20deec3d1	[PP] Add optional argument to not save outputs (#165822 ) Fix https://github.com/pytorch/pytorch/issues/159251 Add an optional argument `return_outputs` to the schedule `step` Pull Request resolved: https://github.com/pytorch/pytorch/pull/165822 Approved by: https://github.com/wconstab	2025-10-21 00:09:31 +00:00
PaulZhang12	51d0d8ee67	[ATen] Fix CUDA reduction warp shuffle order (#164790 ) Typical warp shuffle reduction has the following pattern: <img width="1138" height="501" alt="image" src="https://github.com/user-attachments/assets/3bd176dc-0ad2-4df6-90c7-06e467337166" /> which is exhibited in Triton generated by torch.compile: <img width="663" height="403" alt="image" src="https://github.com/user-attachments/assets/7f9f36cd-b9eb-44c1-879e-b469668a2ea8" /> Switch the warp shuffle order to make bitwise equivalence between the 2 easier. PTX difference between old and new, we see a few extra instructions: https://www.diffchecker.com/h6ly3INC/ Comparing the performance on different reduction operations, we see minimal differences. New represents the changes in this PR, old represents the past warp shuffle order: ``` Tensor Shape Operation New all dims (ms) New dim=0 (ms) New dim=1 (ms) Old all dims (ms) Old dim=0 (ms) Old dim=1 (ms) ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ (1024, 1024) mean 0.015817 0.016259 0.013642 0.015990 0.016258 0.013631 (1024, 1024) sum 0.015917 0.015906 0.013359 0.015707 0.016266 0.013226 (1024, 1024) min 0.016021 0.024625 0.015631 0.015761 0.024485 0.015317 (1024, 1024) max 0.016349 0.024971 0.015972 0.015771 0.025001 0.015314 (1024, 1024) argmin 0.018070 0.024448 0.015578 0.018135 0.025370 0.015322 (1024, 1024) argmax 0.018427 0.024859 0.015932 0.018164 0.024452 0.015639 (1024, 1024) var 0.020078 0.026413 0.020295 0.020199 0.026381 0.020214 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ (2048, 2048) mean 0.023826 0.023726 0.022273 0.023236 0.023776 0.022248 (2048, 2048) sum 0.023840 0.023355 0.021974 0.023294 0.023354 0.021884 (2048, 2048) min 0.024519 0.041263 0.024620 0.023292 0.041491 0.024358 (2048, 2048) max 0.024509 0.041670 0.024277 0.023334 0.041231 0.024395 (2048, 2048) argmin 0.026125 0.041282 0.024567 0.026772 0.041773 0.024296 (2048, 2048) argmax 0.026117 0.041487 0.024572 0.026412 0.041477 0.024273 (2048, 2048) var 0.026603 0.048581 0.031308 0.027587 0.048603 0.030860 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ (4096, 4096) mean 0.053927 0.057070 0.054073 0.053028 0.057544 0.053935 (4096, 4096) sum 0.053604 0.057410 0.054451 0.053076 0.057033 0.054266 (4096, 4096) min 0.054293 0.109122 0.058363 0.053821 0.108689 0.058382 (4096, 4096) max 0.054258 0.108035 0.058703 0.053492 0.110552 0.058376 (4096, 4096) argmin 0.056805 0.111167 0.058301 0.056836 0.112325 0.058292 (4096, 4096) argmax 0.056488 0.110958 0.058636 0.056844 0.111000 0.057928 (4096, 4096) var 0.058936 0.141755 0.068693 0.059735 0.141284 0.068500 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ (8192, 8192) mean 0.145552 0.148082 0.138647 0.145364 0.147818 0.138207 (8192, 8192) sum 0.145985 0.147900 0.138714 0.145755 0.148031 0.138616 (8192, 8192) min 0.146566 0.205359 0.192739 0.145611 0.205237 0.182335 (8192, 8192) max 0.146526 0.204844 0.193050 0.146073 0.205457 0.182697 (8192, 8192) argmin 0.150190 0.206605 0.192543 0.150654 0.206847 0.182007 (8192, 8192) argmax 0.150481 0.206368 0.192535 0.150845 0.206430 0.182022 (8192, 8192) var 0.150884 0.184546 0.203900 0.151594 0.184172 0.197983 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ (1, 1024, 128) mean 0.014293 0.008119 0.014533 0.013861 0.008022 0.014449 (1, 1024, 128) sum 0.014039 0.007877 0.014111 0.014219 0.008227 0.014045 (1, 1024, 128) min 0.014159 0.011354 0.023493 0.014271 0.010862 0.023644 (1, 1024, 128) max 0.014154 0.011027 0.023368 0.014259 0.011234 0.023692 (1, 1024, 128) argmin 0.016403 0.005677 0.023328 0.016273 0.005683 0.024073 (1, 1024, 128) argmax 0.016734 0.005675 0.023437 0.016580 0.005318 0.023331 (1, 1024, 128) var 0.018338 0.009549 0.025538 0.018528 0.009391 0.024777 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ (5, 1024, 128) mean 0.014873 0.010131 0.015546 0.015123 0.010131 0.015481 (5, 1024, 128) sum 0.015334 0.009673 0.015824 0.014736 0.009671 0.015438 (5, 1024, 128) min 0.015047 0.013252 0.024573 0.014803 0.013163 0.024551 (5, 1024, 128) max 0.015050 0.013339 0.024197 0.014810 0.013525 0.024230 (5, 1024, 128) argmin 0.017341 0.012737 0.024306 0.017471 0.012379 0.024991 (5, 1024, 128) argmax 0.017345 0.012411 0.024421 0.017422 0.012471 0.024237 (5, 1024, 128) var 0.019973 0.011453 0.026188 0.020050 0.011438 0.026282 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ (10, 1024, 128) mean 0.016976 0.011575 0.016831 0.016722 0.011927 0.017173 (10, 1024, 128) sum 0.017039 0.011841 0.017159 0.016385 0.011860 0.016753 (10, 1024, 128) min 0.017036 0.015331 0.026770 0.016944 0.015205 0.027166 (10, 1024, 128) max 0.017369 0.015348 0.027077 0.016531 0.015716 0.026819 (10, 1024, 128) argmin 0.019203 0.014447 0.026813 0.018994 0.014497 0.027313 (10, 1024, 128) argmax 0.019563 0.014795 0.027140 0.019460 0.014912 0.026733 (10, 1024, 128) var 0.020529 0.014316 0.030405 0.020719 0.013960 0.029964 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ (100, 1024, 128) mean 0.045046 0.039168 0.046082 0.044839 0.039217 0.045782 (100, 1024, 128) sum 0.045094 0.039150 0.045777 0.044496 0.039542 0.046083 (100, 1024, 128) min 0.045768 0.054466 0.076244 0.044915 0.053943 0.076599 (100, 1024, 128) max 0.045748 0.054459 0.076188 0.044931 0.053949 0.076856 (100, 1024, 128) argmin 0.048275 0.054046 0.076647 0.048694 0.054105 0.077004 (100, 1024, 128) argmax 0.048267 0.054395 0.077401 0.048691 0.054131 0.076751 (100, 1024, 128) var 0.049710 0.043254 0.083077 0.050971 0.043251 0.082378 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ (1000, 1000, 100) mean 0.202312 0.196723 0.197765 0.201774 0.196641 0.197459 (1000, 1000, 100) sum 0.202651 0.196682 0.197736 0.202175 0.196313 0.197523 (1000, 1000, 100) min 0.203022 0.264762 0.269200 0.202729 0.264129 0.268694 (1000, 1000, 100) max 0.202864 0.264396 0.269388 0.202486 0.263896 0.268720 (1000, 1000, 100) argmin 0.226727 0.263781 0.268651 0.226597 0.264676 0.268983 (1000, 1000, 100) argmax 0.226412 0.264469 0.269090 0.226570 0.264595 0.269178 (1000, 1000, 100) var 0.243223 0.204079 0.216096 0.241942 0.204079 0.215925 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ (10000, 100) mean 0.016193 0.020277 0.014316 0.016152 0.020324 0.013712 (10000, 100) sum 0.016289 0.020237 0.014034 0.016168 0.020265 0.013708 (10000, 100) min 0.016046 0.030872 0.019609 0.016208 0.030867 0.018627 (10000, 100) max 0.016369 0.030835 0.019257 0.016218 0.030861 0.018209 (10000, 100) argmin 0.017957 0.031171 0.019517 0.018050 0.031556 0.018077 (10000, 100) argmax 0.017961 0.031658 0.019521 0.018060 0.031564 0.018087 (10000, 100) var 0.020393 0.035652 0.019339 0.020144 0.035987 0.019171 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ (100000, 10) mean 0.015718 0.016576 0.016555 0.015999 0.016246 0.014869 (100000, 10) sum 0.015833 0.016247 0.016572 0.016007 0.016627 0.014872 (100000, 10) min 0.015888 0.020510 0.023920 0.015671 0.020821 0.021417 (100000, 10) max 0.015889 0.020479 0.023918 0.016077 0.020386 0.021421 (100000, 10) argmin 0.018233 0.020863 0.023647 0.017574 0.020864 0.021103 (100000, 10) argmax 0.017896 0.020527 0.023296 0.017569 0.020447 0.021098 (100000, 10) var 0.020005 0.024198 0.024372 0.020075 0.024167 0.022415 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ (1023, 1023, 1023) mean 1.874816 1.963506 1.903909 1.873279 1.963859 1.903230 (1023, 1023, 1023) sum 1.875030 1.965716 1.902458 1.873566 1.960730 1.901642 (1023, 1023, 1023) min 1.878563 2.473455 2.179092 1.875174 2.482086 2.183027 (1023, 1023, 1023) max 1.879128 2.474803 2.178895 1.874831 2.482253 2.183884 (1023, 1023, 1023) argmin 1.921800 2.476629 2.174831 1.923987 2.472641 2.170453 (1023, 1023, 1023) argmax 1.922605 2.476688 2.177927 1.923366 2.472808 2.172979 (1023, 1023, 1023) var 1.972606 3.088695 2.758797 1.978679 3.095658 2.762243 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ (1023, 1023, 255) mean 0.489984 0.500954 0.492957 0.489891 0.500654 0.491971 (1023, 1023, 255) sum 0.490228 0.500764 0.492289 0.489624 0.501089 0.492824 (1023, 1023, 255) min 0.491457 0.563560 0.553334 0.490355 0.564709 0.554754 (1023, 1023, 255) max 0.491396 0.563628 0.553345 0.490017 0.565004 0.554947 (1023, 1023, 255) argmin 0.503666 0.561512 0.551831 0.503845 0.560972 0.551017 (1023, 1023, 255) argmax 0.503602 0.561185 0.551407 0.504328 0.561267 0.551448 (1023, 1023, 255) var 0.510844 0.709452 0.701630 0.512693 0.710365 0.701965 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ (1023, 1023, 377) mean 0.707439 0.727646 0.712019 0.706769 0.727101 0.711632 (1023, 1023, 377) sum 0.707780 0.727453 0.711554 0.706807 0.726656 0.711729 (1023, 1023, 377) min 0.709423 0.819809 0.794379 0.707847 0.822086 0.796664 (1023, 1023, 377) max 0.709297 0.819780 0.794308 0.707566 0.821913 0.796690 (1023, 1023, 377) argmin 0.725028 0.817088 0.791695 0.726039 0.816445 0.790828 (1023, 1023, 377) argmax 0.725301 0.817011 0.791420 0.726040 0.816917 0.791143 (1023, 1023, 377) var 0.740859 1.034165 1.006712 0.743413 1.035506 1.007638 ``` Differential Revision: [D85022826](https://our.internmc.facebook.com/intern/diff/D85022826) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164790 Approved by: https://github.com/ngimel, https://github.com/eqy	2025-10-21 00:09:13 +00:00
Jithun Nair	70592c6819	[ROCm][CI] Move gfx1100 workflows to own yaml file (#165699 ) This should allow us to move gfx1100 workflow to a lower frequency and also allow it to be triggered on PRs via a dedicated label, for any PRs that target Navi fixes such as [this](https://github.com/pytorch/pytorch/pull/165630) or [this](https://github.com/pytorch/pytorch/pull/165625). Pull Request resolved: https://github.com/pytorch/pytorch/pull/165699 Approved by: https://github.com/jeffdaily	2025-10-20 23:52:48 +00:00
Avik Chaudhuri	259cb945f5	[stage 2c] make autograd and inference functions (#165668 ) Add final stage of aot_stage2_compile for autograd and inference. Differential Revision: D84844699 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165668 Approved by: https://github.com/zhxchen17, https://github.com/tugsbayasgalan	2025-10-20 23:50:31 +00:00
KarhouTam	e20c9bf288	[torch/utils][Code Clean] Clean asserts in `torch/utils/.py` (#165410 ) Including: - `torch/utils/.py` Fixes part of #164878 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165410 Approved by: https://github.com/albanD	2025-10-20 23:29:17 +00:00
Yuanyuan Chen	99c8640b5d	[1/N] Change C-style casts to static_cast or reinterpret_cast (#165750 ) This series of changes try to cover C style casts into C++ alternatives. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165750 Approved by: https://github.com/Skylion007	2025-10-20 23:27:13 +00:00
zhudada	96b0e7aaa6	[Code Clean] Clean asserts in torch/ao/quantization/experimental/* and torch/ao/quantization/pt2e/* (#165317 ) Replace assert statements with explicit if/raise patterns in: - torch/ao/quantization/experimental/* (11 errors) - torch/ao/quantization/pt2e/* (68 errors) fix partialy #164878 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165317 Approved by: https://github.com/albanD	2025-10-20 23:07:11 +00:00
mansiag05	850ba8c96d	[Code Clean] Clean asserts in torch/autograd. (#165627 ) Replaces 78 assert statements across 10 files in torch.autograd with explicit if-checks raising AssertionError to prevent assertions from being disabled with Python -O flag. This ensures error checking remains active in optimized builds. fix partially #164878 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165627 Approved by: https://github.com/albanD	2025-10-20 23:03:47 +00:00
eellison	1bcd736f91	fix bad merge duplicate pre pass (#165917 ) fix for https://github.com/pytorch/pytorch/issues/165624 - we were applying pre pass multiple times. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165917 Approved by: https://github.com/bdhirsh	2025-10-20 22:54:36 +00:00
zhudada	df64c0c464	[Code Clean] Clean asserts in torch/ao/quantization (root, quantizer, backend_config) (#165433 ) Replace assert statements with explicit if/raise patterns in: - torch/ao/quantization/~ - torch/ao/quantization/quantizer/ - torch/ao/quantization/backend_config/ fix partialy #164878 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165433 Approved by: https://github.com/albanD	2025-10-20 22:42:51 +00:00
Boyuan Feng	1891239a1d	[Graph Partition] fix graph partition input signature for fallback kernels (#165815 ) Scheduler relies on node.last_usage to free buffers. `last_usage` may contain a buffer that is allocated in previous graph partition AND not directly accessed in the current graph partition. ## Example ```python def f(x): y = x + 1 z = torch.ops.aten.view.dtype(y, torch.float8_e4m3fn) z_cpu = z.cpu() u_cuda = z_cpu.cuda() return u_cuda ``` In the generated code, we have ``` def partition_0(args): ... # Topologically Sorted Source Nodes: [y, z], Original ATen: [aten.add, aten.view] buf1 = torch.ops.aten.view.dtype(buf0, torch.float8_e4m3fn) # < ------ buf1 is a view of buf0 buf2 = buf1 # <------- buf2 is buf1 assert_size_stride(buf2, (8, ), (1, ), 'torch.ops.aten.view.dtype') assert_alignment(buf2, 16, 'torch.ops.aten.view.dtype') return (buf2, ) def call(self, args): ... (buf2,) = self.partitions[0](partition0_args) ... buf3.copy_(buf2, False) del buf0 del buf1 del buf2 # <---- `del buf2` leads to `del buf0`. BUT `buf0` is not returned from partition_0. ... ``` Note: view is treated as a fallback kernel due to its special dtype. `de09bab4b6/torch/_inductor/lowering.py (L841-L843)` ## Fix This PR fixes the issue by also returning these buffers to be freed later. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165815 Approved by: https://github.com/eellison	2025-10-20 22:23:29 +00:00
PyTorch MergeBot	cf280ca1e8	Revert "[Inductor] Naive foreach autotune support (#162053 )" This reverts commit `779296a3fc`. Reverted https://github.com/pytorch/pytorch/pull/162053 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](https://github.com/pytorch/pytorch/pull/162053#issuecomment-3423808492))	2025-10-20 21:36:44 +00:00
Shangdi Yu	efc277cac7	[annotation] add logging for debugging annotation (#165797 ) Add logging for debugging annotation bugs. Log will show with `TORCH_LOGS="+annotation" ` Pull Request resolved: https://github.com/pytorch/pytorch/pull/165797 Approved by: https://github.com/ezyang, https://github.com/Skylion007, https://github.com/SherlockNoMad	2025-10-20 21:27:38 +00:00
PyTorch MergeBot	4f7f43253d	Revert "[ROCm][CI] Update rocm.yml workflow to use 1 GPU ARC runners (#165481 )" This reverts commit `8700d68fef`. Reverted https://github.com/pytorch/pytorch/pull/165481 on behalf of https://github.com/malfet due to Broke lint somehow, see `8f06a1308f/1` ([comment](https://github.com/pytorch/pytorch/pull/165481#issuecomment-3423642456))	2025-10-20 20:39:56 +00:00
Jack Taylor	779296a3fc	[Inductor] Naive foreach autotune support (#162053 ) Initial autotuning support for foreach kernels, 4x improvement for some kernels in internal workload. More improvements can surely be made here in the future. Removing num_warps for definition to enable autotune support in generated wrapper code. Before: triton_for_fused_18.kd 🔍 \| 4.986 ms \| 4.986 ms \| 2.493 ms \| 2 \| triton_for_fused_6.kd 🔍 \| 0.098 ms \| 0.098 ms \| 0.049 ms \| 2 \| triton_for_fused_7.kd 🔍 \| 0.036 ms \| 0.036 ms \| 0.018 ms \| 2 \| After: triton_for_fused_18.kd 🔍 \| 1.273 ms \| 1.273 ms \| 0.636 ms \| 2 \| triton_for_fused_6.kd 🔍 \| 0.044 ms \| 0.044 ms \| 0.022 ms \| 2 \| triton_for_fused_7.kd 🔍 \| 0.024 ms \| 0.024 ms \| 0.012 ms \| 2 \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/162053 Approved by: https://github.com/mlazos, https://github.com/naromero77amd	2025-10-20 20:39:04 +00:00
Isalia20	8f06a1308f	[MPS] slightly faster cholesky (#165867 ) Slightly faster cholesky, removed one redundant simdgroup_multiply <img width="721" height="593" alt="Screenshot 2025-10-19 at 22 00 19" src="https://github.com/user-attachments/assets/e3a9005b-9347-4e62-a24d-16ba5e28849a" /> Generate benchmarks with(measured on M1 Pro): ``` import torch import numpy as np import time import csv matrix_sizes = [512, 1024, 2048, 4096] batch_sizes = [1, 2, 4, 8, 16] num_runs = 10 warmup_runs = 3 def create_spd_matrix(n, batch_size): torch.manual_seed(42) A = torch.randn(batch_size, n, n, dtype=torch.float32) return A @ A.transpose(-2, -1) + n * torch.eye(n).expand(batch_size, -1, -1) def run_cholesky_mps(A): torch.mps.synchronize() start = time.perf_counter() b = torch.linalg.cholesky(A, upper=False) torch.mps.synchronize() end = time.perf_counter() return b, end - start results = { 'N': [], 'batch_size': [], 'mean_time': [], 'std_time': [] } for n in matrix_sizes: for batch_size in batch_sizes: print(f"\nBenchmarking N={n}, batch_size={batch_size}") try: A_cpu = create_spd_matrix(n, batch_size) A_mps = A_cpu.to("mps") for _ in range(warmup_runs): _, _ = run_cholesky_mps(A_mps) times = [] for _ in range(num_runs): _, t = run_cholesky_mps(A_mps) times.append(t) mean_time = np.mean(times) std_time = np.std(times) results['N'].append(n) results['batch_size'].append(batch_size) results['mean_time'].append(mean_time) results['std_time'].append(std_time) print(f"Mean time: {mean_time:.4f}s ± {std_time:.4f}s") except RuntimeError as e: print(f"Error for N={n}, batch_size={batch_size}: {e}") continue with open('cholesky_benchmark_times.csv', 'w', newline='') as f: writer = csv.writer(f) writer.writerow(['N', 'batch_size', 'mean_time', 'std_time']) for i in range(len(results['N'])): writer.writerow([ results['N'][i], results['batch_size'][i], results['mean_time'][i], results['std_time'][i] ]) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/165867 Approved by: https://github.com/malfet	2025-10-20 18:56:17 +00:00
PyTorch MergeBot	240c13394e	Revert "[inductor] require shape in TritonCSEVariable (#162275 )" This reverts commit `3af2f0c12a`. Reverted https://github.com/pytorch/pytorch/pull/162275 on behalf of https://github.com/clee2000 due to still failing due to the above D84932446 ([comment](https://github.com/pytorch/pytorch/pull/162275#issuecomment-3423153819))	2025-10-20 17:55:54 +00:00
PyTorch MergeBot	150682ba7f	Revert "Remove workaround to old CUDA bug (#164354 )" This reverts commit `26f3803433`. Reverted https://github.com/pytorch/pytorch/pull/164354 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/164354#issuecomment-3423132083))	2025-10-20 17:48:08 +00:00

1 2 3 4 5 ...

94780 Commits