pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-06 12:20:52 +01:00

Author	SHA1	Message	Date
dolpm	51a708ffc6	[nativert] libtorch kernel registry (#157150 ) Summary: att Test Plan: ci Rollback Plan: Differential Revision: D77451703 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157150 Approved by: https://github.com/georgiaphillips, https://github.com/henryoier	2025-07-16 12:36:55 +00:00
Hari Krishna Sai Kodali	9d184bda2f	add device generalization support for distributed tests (#156796 ) MOTIVATION To generalize Distributed test cases for non-CUDA devices CHANGES - test/distributed/checkpoint/test_fsspec.py - test/distributed/checkpoint/test_state_dict.py - test/distributed/test_multi_threaded_pg.py Replaced hard coded device names with torch.accelerator.current_accelerator - torch/testing/_internal/distributed/_shard/sharded_tensor/__init__.py support for hccl backend Pull Request resolved: https://github.com/pytorch/pytorch/pull/156796 Approved by: https://github.com/guangyey, https://github.com/ezyang	2025-07-16 09:37:03 +00:00
NikhilAPatel	ea74fdd24a	[Inductor][Triton] Update TMA Compatibility Requirements (#157881 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157881 Approved by: https://github.com/Skylion007, https://github.com/drisspg	2025-07-16 09:31:44 +00:00
Manuel Candales	fb9a5d248f	Fix torch._numpy to match NumPy when empty ellipsis causes advanced indexing separation (#158297 ) Fixes #141563 In NumPy, an ellipsis always acts as a separator between advanced indices, even when the ellipsis doesn't actually match any dimensions. In PyTorch an empty ellipsis doesn't cause a separation. This leads to differing behavior between Numpy and PyTorch in this edge case. This difference in behavior leads to a bug when using torch.compile: ```python >>> import numpy as np >>> f = lambda x: x[:,(0,1),...,(0,1)].shape >>> a = np.ones((3, 4, 5)) >>> f(a) (2, 3) >>> torch.compile(f)(a) (3, 2) ``` Similarly to #157676, this PR doesn't change PyTorch's behavior, but it fixes the translation layer, ensuring torch._numpy compatibility with NumPy. I am marking this PR as fixing #141563, even though PyTorch behavior isn't modified. Notice that there are still some other bugs in PyTorch's advanced indexing, that need to be fixed (mainly regarding proper accounting of dimensions when multidimensional boolean masks are present). But those need to be fixed at the ATen operator level. Examples: - #71673 - #107699 - #158125 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158297 Approved by: https://github.com/soumith	2025-07-16 08:11:53 +00:00
Huamin Li	ddf502c988	[AOTI] add -lstdc++ into aoti link cmd for Meta internal (#158325 ) Differential Revision: D78123716 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158325 Approved by: https://github.com/desertfire	2025-07-16 07:55:08 +00:00
FFFrog	555f356254	[Easy] Show some clear error when torch.ops.load_library fails. (#157524 ) Background: ```Shell torch 2.5.1+cpu torchvision 0.20.1 ``` ```Python import torch import torchvision Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/local/anaconda3/envs/test/lib/python3.10/site-packages/torchvision/__init__.py", line 10, in <module> from torchvision import _meta_registrations, datasets, io, models, ops, transforms, utils # usort:skip File "/usr/local/anaconda3/envs/test/lib/python3.10/site-packages/torchvision/_meta_registrations.py", line 164, in <module> def meta_nms(dets, scores, iou_threshold): File "/usr/local/anaconda3/envs/test/lib/python3.10/site-packages/torch/library.py", line 795, in register use_lib._register_fake(op_name, func, _stacklevel=stacklevel + 1) File "/usr/local/anaconda3/envs/test/lib/python3.10/site-packages/torch/library.py", line 184, in _register_fake handle = entry.fake_impl.register(func_to_register, source) File "/usr/local/anaconda3/envs/test/lib/python3.10/site-packages/torch/_library/fake_impl.py", line 31, in register if torch._C._dispatch_has_kernel_for_dispatch_key(self.qualname, "Meta"): RuntimeError: operator torchvision::nms does not exist ``` Cause: ``` torchvision's .so file lacks some symbol definitions, because these symbols come from CUDA, but the current environment does not have CUDA and GPU. The above error message is very confusing. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/157524 Approved by: https://github.com/ezyang	2025-07-16 07:33:22 +00:00
Kaichao You	59f9b25f3c	[cuda][cupy] Improve cupy device placement when device is provided (#158320 ) This is an improvement over https://github.com/pytorch/pytorch/pull/132595 . That PR improves the case where `device` is not given. This PR tries to improve the case where `device` is given but the first step of auto-infer device from `cudaPointerGetAttributes` can be wrong (undesired). See https://github.com/pytorch/pytorch/issues/158316 for more details on when this can happen. I think this is a reasonable improvement, as people expect `torch.as_tensor` + cupy should be zero-copy as much as possible. However, it does change some behaviors, because previously it might incur a device-to-device copy. I will leave it to pytorch developers to see if the improvement is worthwhile. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158320 Approved by: https://github.com/ezyang	2025-07-16 07:12:36 +00:00
drisspg	5484890539	Add better typing to avaialbe kernel options for flex attention (#158383 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158383 Approved by: https://github.com/joydddd, https://github.com/BoyuanFeng	2025-07-16 06:06:29 +00:00
Denghui Dong	e92e3eaf4e	[Profiler] the doc of _ExperimentalConfig is incorrectly truncated by commas (#156586 ) Hi team, Please help review this trivial fix. Without this change: ``` python >>> import torch >>> print(torch._C._profiler._ExperimentalConfig.__init__.__doc__) __init__(self: torch._C._profiler._ExperimentalConfig, profiler_metrics: list[str] = [], profiler_measure_per_kernel: bool = False, verbose: bool = False, performance_events: list[str] = [], enable_cuda_sync_events: bool = False, adjust_profiler_step: bool = False, disable_external_correlation: bool = False, profile_all_threads: bool = False, capture_overload_names: bool = False) -> None capture_overload_names (bool) : whether to include ATen overload names in the profile ``` With this change: ```python >>> import torch >>> print(torch._C._profiler._ExperimentalConfig.__init__.__doc__) __init__(self: torch._C._profiler._ExperimentalConfig, profiler_metrics: list[str] = [], profiler_measure_per_kernel: bool = False, verbose: bool = False, performance_events: list[str] = [], enable_cuda_sync_events: bool = False, adjust_profiler_step: bool = False, disable_external_correlation: bool = False, profile_all_threads: bool = False, capture_overload_names: bool = False) -> None An experimental config for Kineto features. Please note thatbackward compatibility is not guaranteed. profiler_metrics : a list of CUPTI profiler metrics used to measure GPU performance events. If this list contains values Kineto runs in CUPTI profiler mode profiler_measure_per_kernel (bool) : whether to profile metrics per kernel or for the entire measurement duration. verbose (bool) : whether the trace file has `Call stack` field or not. performance_events : a list of profiler events to be used for measurement. enable_cuda_sync_events : for CUDA profiling mode, enable adding CUDA synchronization events that expose CUDA device, stream and event synchronization activities. This feature is new and currently disabled by default. adjust_profiler_step (bool) : whether to adjust the profiler step to match the parent python event duration. This feature is new and currently disabled by default. disable_external_correlation (bool) : whether to disable external correlation profile_all_threads (bool) : whether to profile all threads capture_overload_names (bool) : whether to include ATen overload names in the profile ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/156586 Approved by: https://github.com/sraikund16, https://github.com/cyyever	2025-07-16 04:10:49 +00:00
Will Constable	0a9d450168	[DTensor] implement histc (#158298 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158298 Approved by: https://github.com/zpcore, https://github.com/XilunWu	2025-07-16 04:10:32 +00:00
Edward Z. Yang	e265b719bd	Extract out prepare_aot_module_simplified for use in next PR (#158319 ) Also a small amount of extra code cleanup. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/158319 Approved by: https://github.com/jingsh ghstack dependencies: #158149, #158150, #158173, #158176, #158213, #158251	2025-07-16 03:59:41 +00:00
Edward Z. Yang	7637c9718a	Move functions from torch._functorch.aot_autograd that are not frontend functions to frontend_utils (#158251 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/158251 Approved by: https://github.com/jamesjwu ghstack dependencies: #158149, #158150, #158173, #158176, #158213	2025-07-16 03:59:41 +00:00
Edward Z. Yang	49d0332cef	Introduce stages to aot_dispatch (#158213 ) The starting point for this refactor is that I need access to the fully general joint graph representation in an export-like interface, but I then subsequently need a way to feed this joint graph into the rest of the compilation pipeline so I can get an actual callable that I can run once I've finished modifying it. Previously, people had added export capabilities to AOTAutograd by having an export flag that toggled what exactly the functions return and triggering aot_dispatch to go to a different "export" implementation, but I've found this difficult to understand and has lead to a bit of duplicate code for the export path. So the idea here is to reorganize the structure of the function calls in AOTAutograd. Here, it is helpful to first describe how things used to work: * Start with aot_autograd.py top level functions like aot_function, _aot_export_function and aot_module_simplified. These call: * create_aot_dispatcher_function. This does a bunch of stuff (forward metadata collection) and adds many context managers. This calls: * One of aot_dispatch_base, aot_dispatch_export or aot_dispatch_autograd, which: * Call aot_dispatch_autograd_graph or aot_dispatch_base_graph to actually do the graph capture * Do some base/export/autograd specific post-processing on the graph Notice the pattern of nested function invocations means that there is no way to easily get the graph capture result from the autograd case; furthermore, the export path is "bolted" on to force the entire chain of functions to have a different return result than normal, and no way to resume the rest of the post-processing to actually get a callable. Here is the new structure: * Start with aot_autograd.py top level functions like aot_function, _aot_export_function and aot_module_simplified. These now orchestrate this top level flow: * Start a context manager (stack); this stateful context block takes care of all of the nested context managers which originally necessitated the nested call structure * Call create_aot_state to do initial setup and setup all the context managers on stack. These context managers do NOT exit upon return of this. * Call aot_stage1_graph_capture to do the graph capture * Call aot_stage2_compile or aot_stage2_export depending on what postprocessing you want With this new structure, it's now possible (although not done in this PR) to return the graph after aot_stage1_graph_capture and do something with it, before running aot_stage2_compile to finish the job. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/158213 Approved by: https://github.com/jamesjwu ghstack dependencies: #158149, #158150, #158173, #158176	2025-07-16 03:59:32 +00:00
Edward Z. Yang	84dec060b7	Hoist choose_dispatcher to top level, remove unnecessary returns (#158176 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/158176 Approved by: https://github.com/jamesjwu ghstack dependencies: #158149, #158150, #158173	2025-07-16 03:56:25 +00:00
Edward Z. Yang	5b0df2565e	Pipeline _create_aot_dispatcher_function (#158173 ) Two main things of note: - Review this diff without whitespace changes - To ensure that context managers correctly propagate to later pipeline stages, I am using the ExitStack trick: there is an ExitStack which is in scope for the entire pipeline, and inside of the individual pipeline stages we push context managers onto this stack when we want them to survive into the next pipeline stage. This is not obviously what the best final form of the code is, but create_aot_dispatcher_function is called from multiple locations so I can't just inline the context managers into the call site. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/158173 Approved by: https://github.com/jamesjwu, https://github.com/wconstab ghstack dependencies: #158149, #158150	2025-07-16 03:56:25 +00:00
Songhao Jia	0cb36e2d62	cache dict and string rep for better perf (#158372 ) Summary: NodeSouce should not be updated after created, so that it would be better if we cache its dict and string representation for better perf. Test Plan: ci Rollback Plan: Reviewed By: yushangdi Differential Revision: D78298501 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158372 Approved by: https://github.com/yushangdi	2025-07-16 02:15:32 +00:00
Xu Han	584a0510b3	[inductor] fix windows path for fresh cache. (#158324 ) `normalize_path_separator` for windows path. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158324 Approved by: https://github.com/jansel	2025-07-16 01:54:35 +00:00
yuchengliu1	9768d393fa	add sfdp pattern (#155792 ) add sfdp pattern for MBartForCausalLM/PLBartForCausalLM in transformers==4.44.2. Improve the inference performance of these model. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155792 Approved by: https://github.com/Valentine233, https://github.com/jansel	2025-07-16 01:52:05 +00:00
PyTorch MergeBot	03852ddc22	Revert "[ROCm] logsumexp on ROCm needs scaling back to natural base. (#156903 )" This reverts commit `1ea9cde598`. Reverted https://github.com/pytorch/pytorch/pull/156903 on behalf of https://github.com/atalman due to Breaks torchao and torchtitan nightly builds ([comment](https://github.com/pytorch/pytorch/pull/156903#issuecomment-3076423488))	2025-07-16 01:28:46 +00:00
Xuan Zhang	8554c8007d	[PT2][fusion] ban fusions with large accumulated reads (#157563 ) Problem: Fusion can accumulate large amount of reads, which leads to significant increase in peak memory utilization. Imagine we have the following code snippet ``` total = torch.rand(N, N) for _ in range(r): x = torch.rand(N, N) total = total + x ``` The default execution is memory efficient as only two tensors of size N-by-N is in memory at any given time. However, with fusion, the additions are fused into a single operation and the execution becomes something like: ``` x_1 = torch.rand(N, N) x_2 = torch.rand(N, N) ... x_r = torch.rand(N, N) total = x_1 + x_2 + ... + x_r ``` Though this is run-time efficient, in the case of large `N` and/or large `r`, this is not memory efficient. [internal only] see [post](https://fb.workplace.com/groups/1075192433118967/permalink/1703374333634104/) for additional details Solution: Our proposed solution is to ban fusions in case where a large amount of reads are accumulated. This is in addition to some existing logics during torch compile. * During lowering (i.e., `ir.py`), the config `realize_acc_reads_threshold`, which is default to be 8, controls _the number of_ buffers can be accumulated for a single operator. However, this is oblivious to the size of the buffers. Hence, we additionally introduce a config `realize_acc_reads_size_threshold` to control _the amount of buffers_ in size that can be accumulated. * During scheduling (i.e., `scheduler.py`), additional fusion will be performed and thus we also need to capture such pattern there. The decisions are implemented under `choices.py`. Results: For a small example similar to be one in the test case (but with larger `N` and higher number of loop repeats), the memory snapshot before and after are shown below. Note the snapshot on the right is zoomed out so that the y-axis of the two snapshots match. <img width="1328" alt="image" src="https://github.com/user-attachments/assets/670b5961-8454-4379-ae0f-62d4e7946c64" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/157563 Approved by: https://github.com/jansel, https://github.com/mlazos	2025-07-16 01:05:25 +00:00
Yidi Wu	651b4a68f2	[hop][dynamo] track run-ahead sym variables in side effects (#158273 ) Before the PR, for code like this: ``` class Example2(torch.nn.Module): def forward(self, x, trigger, target): return torch.cond( trigger == 1, lambda: x + target, lambda: x * target, (), ) m = Example2() x = torch.randn(2) trigger = 0 target = 2 args = (x, trigger, target) ep = torch.export.export( m, args, dynamic_shapes=(None, Dim.DYNAMIC, Dim.DYNAMIC) ) ``` dynamo will wrap "target" (i.e. a symInt) twice, once when we speculate the first lambda and find target is a symint and decides to wrap it up, creating a new SymNodeVariable and a placeholder input to the top-level graph. The second time happens when we speculate the second lambda. Tensors are de-duplicated by checking tracked side effects to make sure object with the same id (though different sources) is mapped to the same TensorVaraible. For symints, two things are missing: 1. it's not in the _can_lift_attrs_to_input list (the change in builder.py) 2. it's not in the tracked by runahead_side_effects, so when speculate_subgraph finishes, they're discarded (the change in side_effects.py) Note: the auto lifting mechanism for HOPs happens at proxy level when we trace the subgraph, which is after SymNodeVariable are created (they're created when realizing the args and bind them to subgraph). At that time, builder has created two unique SymNodeVariable for the same symint so the auto lifting in hops cannot de-dup them. Differential Revision: [D78298163](https://our.internmc.facebook.com/intern/diff/D78298163) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158273 Approved by: https://github.com/avikchaudhuri, https://github.com/zou3519	2025-07-15 23:48:20 +00:00
dolpm	144965ca9a	[BE][S538760] get rid of TORCH_CHECK_.* and CHECK macros (#158269 ) Summary: check will be crit, causing program to exit, which is quite dangerous Test Plan: CI Rollback Plan: Differential Revision: D78050595 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158269 Approved by: https://github.com/SherlockNoMad, https://github.com/henryoier	2025-07-15 22:04:12 +00:00
Ti-Tai Wang	3f83e3eeca	[ONNX] Remove legacy registration and dispatcher (#158283 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158283 Approved by: https://github.com/Skylion007, https://github.com/justinchuby ghstack dependencies: #158258, #158262, #158282	2025-07-15 21:00:49 +00:00
Yiming Zhou	0640cfa38c	[2/n] Remove references to TorchScript in PyTorch docs (#158306 ) Summary: Removed jit_language_reference.md Test Plan: CI Rollback Plan: Differential Revision: D78308133 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158306 Approved by: https://github.com/svekars, https://github.com/zhxchen17	2025-07-15 20:57:23 +00:00
Ti-Tai Wang	e4c17d5e1c	[ONNX] Remove fx_onnx_interpreter.py (#158282 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158282 Approved by: https://github.com/Skylion007, https://github.com/justinchuby ghstack dependencies: #158258, #158262	2025-07-15 20:46:06 +00:00
Animesh Jain	cc0faeb80f	[dynamo][guards] Instruction count for guard eval for development work (#158214 ) Its turned off by default. Even the code is hidden before of the define preprocessing flag. It will be used only for development work. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158214 Approved by: https://github.com/StrongerXi ghstack dependencies: #158215	2025-07-15 20:29:23 +00:00
Ti-Tai Wang	205241a0d5	[ONNX] Remove legacy dynamo graph extractor (#158262 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158262 Approved by: https://github.com/justinchuby ghstack dependencies: #158258	2025-07-15 20:21:49 +00:00
Sam Larsen	dbf7d421da	[BE][testing] fix aot_inductor_package internally (#158270 ) Summary: We have internal test failure for several aot_inductor_package tests. It looks like we're translating args like: ``` -Wl,--script=/home/slarsen/local/fbsource2/buck-out/v2/gen/fbcode/7ce8f48f92bc4ee6/caffe2/test/inductor/__aot_inductor_package__/aot_inductor_package#link-tree/torch/_inductor/script.ld ``` To: ``` -Wl,--script=/home/slarsen/local/fbsource2/buck-out/v2/gen/fbcode/7ce8f48f92bc4ee6/caffe2/test/inductor/__aot_inductor_package__/aot_inductor_package#link-tree/torch/_inductor//tmp/jZMktZ/tmpsqoxb_cq/data/aotinductor/model/script.ld ``` This PR changes to strings like: ``` -Wl,--script=/tmp/jZMktZ/tmpsqoxb_cq/data/aotinductor/model/script.ld ``` Test Plan: `buck test '@fbcode//mode/opt' fbcode//caffe2/test/inductor:aot_inductor_package --run-disabled` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158270 Approved by: https://github.com/desertfire	2025-07-15 20:15:18 +00:00
Animesh Jain	b86d5cef68	[dynamo][tensor] Skip HASATTR attribute on tensor guards (#158215 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158215 Approved by: https://github.com/StrongerXi	2025-07-15 20:10:47 +00:00
Jane Xu	30587195d3	Migrate c10/macros/cmake_macros.h.in to torch/headeronly (#158035 ) Summary: As above, also changes a bunch of the build files to be better Test Plan: internal and external CI did run buck2 build fbcode//caffe2:torch and it succeeded Rollback Plan: Reviewed By: swolchok Differential Revision: D78016591 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158035 Approved by: https://github.com/swolchok	2025-07-15 19:52:59 +00:00
Aaron Orenstein	250ae2531c	Fix types in graphs.py (#158192 ) Added type annotations for torch/cuda/graphs.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/158192 Approved by: https://github.com/oulgen	2025-07-15 19:49:38 +00:00
Songhao Jia	011026205a	make node source hashable (#158322 ) Summary: as title Test Plan: ci Rollback Plan: Reviewed By: yushangdi Differential Revision: D78296410 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158322 Approved by: https://github.com/yushangdi	2025-07-15 19:31:00 +00:00
Menglu Yu	4657a84bc5	[Optimus][fp8_activation_quantization] Only log when there's some node to be quantized (#158129 ) Summary: We add some extra check on whether there's some node has been marked as should quantize, otherwise we skip the quantizaton and tlparse log. Rollback Plan: Differential Revision: D78173788 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158129 Approved by: https://github.com/Skylion007, https://github.com/avicizhu	2025-07-15 19:22:26 +00:00
Ti-Tai Wang	5606c516fd	[ONNX] Remove legacy Dort (#158258 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158258 Approved by: https://github.com/justinchuby, https://github.com/malfet	2025-07-15 19:14:06 +00:00
Edward Z. Yang	7afb834f93	Inline dispatch_and_compile into its call site. (#158150 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/158150 Approved by: https://github.com/jamesjwu, https://github.com/wconstab ghstack dependencies: #158149	2025-07-15 19:08:55 +00:00
Edward Z. Yang	148789ddd8	Avoid AOTAutogradCache.load in stack trace on cache miss path (#158149 ) The general context for the upcoming stack of commits is I am attempting to "pipeline" AOTAutograd. Instead of having function f call function g which is the next "stage" of compilation, instead f should return with its outputs, which are then piped to g for the next stage. This will make it easier to implement early exit / resume pipeline without forcing callback structure, which is good for export-style use cases. It also reduces the size of our stack traces, which makes tools like Perfetto happy. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/158149 Approved by: https://github.com/jamesjwu	2025-07-15 19:08:55 +00:00
Shangdi Yu	cf3247b74a	Standalone compile API in _Exporter (#158139 ) Given an `package: _ExportPackage`, users can get a ready-to-use workspace in `tmp_dir` by calling: ```python package._compiled_and_package( tmp_dir + "/pt2_pacakge_name.pt2", True, package_example_inputs = True ) ``` `tmp_dir` will contains: - `main.cpp` (an example cpp file that create the models, if package_example_inputs is True, it'll also load the example inputs and run the models) - `CMakeLists.txt` - `pt2_pacakge_name/` (this is where the models are) - `pt2_pacakge_name.pt2` - `inputs.pt` files if package_example_inputs is True Remaining TODOs - support loading contants/weights - the `package_example_inputs = True` option only supports a list of Tensors for now - eventually we should remove the `torch` dependency, and use `SlimTensor`/`StableIValue` instead. Test Plan: ``` python test/inductor/test_aot_inductor_package.py -k test_compile_with_exporter ``` Example generated `main.cpp`: ```cpp #include <dlfcn.h> #include <fstream> #include <iostream> #include <memory> #include <torch/torch.h> #include <vector> #include <torch/csrc/inductor/aoti_torch/tensor_converter.h> #include "package/data/aotinductor/Plus__default/Plus__default.h" #include "package/data/aotinductor/Minus__default/Minus__default.h" using torch::aot_inductor::AOTInductorModelPlus__default; using torch::aot_inductor::AOTInductorModelMinus__default; using torch::aot_inductor::ConstantHandle; using torch::aot_inductor::ConstantMap; int main(int argc, char* argv[]) { std::string device_str = "cpu"; try { c10::Device device(device_str); // Load input tensors for model Plus__default std::vector<at::Tensor> input_tensors1; for (int j = 0; j < 2; ++j) { std::string filename = "Plus__default_input_" + std::to_string(j) + ".pt"; std::ifstream in(filename, std::ios::binary); if (!in.is_open()) { std::cerr << "Failed to open file: " << filename << std::endl; return 1; } std::vector<char> buffer((std::istreambuf_iterator<char>(in)), std::istreambuf_iterator<char>()); torch::IValue ivalue = torch::pickle_load(buffer); input_tensors1.push_back(ivalue.toTensor().to(device)); } // Load input tensors for model Minus__default std::vector<at::Tensor> input_tensors2; for (int j = 0; j < 2; ++j) { std::string filename = "Minus__default_input_" + std::to_string(j) + ".pt"; std::ifstream in(filename, std::ios::binary); if (!in.is_open()) { std::cerr << "Failed to open file: " << filename << std::endl; return 1; } std::vector<char> buffer((std::istreambuf_iterator<char>(in)), std::istreambuf_iterator<char>()); torch::IValue ivalue = torch::pickle_load(buffer); input_tensors2.push_back(ivalue.toTensor().to(device)); } // Create array of input handles auto input_handles1 = torch::aot_inductor::unsafe_alloc_new_handles_from_tensors(input_tensors1); auto input_handles2 = torch::aot_inductor::unsafe_alloc_new_handles_from_tensors(input_tensors2); // Create array for output handles AtenTensorHandle output_handle1; AtenTensorHandle output_handle2; // Create and load models auto constants_map1 = std::make_shared<ConstantMap>(); auto constants_array1 = std::make_shared<std::vector<ConstantHandle>>(); auto model1 = AOTInductorModelPlus__default::Create( constants_map1, constants_array1, device_str, "package/data/aotinductor/Plus__default/"); model1->load_constants(); auto constants_map2 = std::make_shared<ConstantMap>(); auto constants_array2 = std::make_shared<std::vector<ConstantHandle>>(); auto model2 = AOTInductorModelMinus__default::Create( constants_map2, constants_array2, device_str, "package/data/aotinductor/Minus__default/"); model2->load_constants(); // Run the models torch::aot_inductor::DeviceStreamType stream1 = nullptr; model1->run(&input_handles1[0], &output_handle1, stream1, nullptr); torch::aot_inductor::DeviceStreamType stream2 = nullptr; model2->run(&input_handles2[0], &output_handle2, stream2, nullptr); // Convert output handles to tensors auto output_tensor1 = torch::aot_inductor::alloc_tensors_by_stealing_from_handles(&output_handle1, 1); auto output_tensor2 = torch::aot_inductor::alloc_tensors_by_stealing_from_handles(&output_handle2, 1); // Validate outputs std::cout << "output_tensor1" << output_tensor1 << std::endl; std::cout << "output_tensor2" << output_tensor2 << std::endl; return 0; } catch (const std::exception &e) { std::cerr << "Error: " << e.what() << std::endl; return 1; } } ``` Rollback Plan: Differential Revision: D78124705 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158139 Approved by: https://github.com/desertfire	2025-07-15 18:47:56 +00:00
PyTorch MergeBot	ea5f88dca6	Revert "Deprecate overleap functions in CUDAAllocatorConfig, use AcceleratorAllocatorConfig instead (#156165 )" This reverts commit `e40ade5182`. Reverted https://github.com/pytorch/pytorch/pull/156165 on behalf of https://github.com/huydhn due to Sorry for reverting your change but because https://github.com/pytorch/pytorch/pull/157908 has been reverted + this PR caused issue earlier, I think it is better to revert the whole stack and reland it from scratch to be sure ([comment](https://github.com/pytorch/pytorch/pull/150312#issuecomment-3074897532))	2025-07-15 18:24:36 +00:00
Menglu Yu	243b12e565	[Optimus] add einsum_to_pointwise_pass pattern (#155666 ) Summary: More context: https://docs.google.com/document/d/1ipiskqG13ZKNX1SGygB3QnHcSyXNQ8pACazPIcS4bnI/edit?tab=t.0 Test Plan: ### how to enable ``` torch._inductor.config.pre_grad_fusion_options={ "einsum_to_pointwise_pass": {}, }, ``` ### unit test ``` CUDA_VISIBLE_DEVICES=3 OC_CAUSE=1 buck2 test 'fbcode//mode/dev-nosan' //caffe2/test/inductor:kernel_optimization ``` Buck UI: https://www.internalfb.com/buck2/267263ff-6f5b-4fff-bfc0-d8f013440ba0 Test UI: https://www.internalfb.com/intern/testinfra/testrun/5629499820839168 Network: Up: 61KiB Down: 675KiB (reSessionID-fda8edfc-6eef-4bf0-b268-0f8d2e666571) Loading targets. Remaining 0/1 1 dirs read, 2310 targets declared Analyzing targets. Remaining 0/345 284 actions, 329 artifacts declared Executing actions. Remaining 0/18334 8.0s exec time total Command: test. Finished 6 local Time elapsed: 1:15.5s Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0 ### local reproduce baseline: \| Metric \| Value \| \|:----------------------\|:------------\| \| Batch size \| 4096 \| \| GPU type \| H100 \| \| Latency \| 196.06 ms \| \| Model size \| 1205.21 MB \| \| Flops \| 7671.30 G \| \| Flops/example \| 1.87 G \| \| TFLOPS/sec \| 39.13 \| \| MFU \| 4.89% \| \| Activation/example \| 1.51 MB \| \| CPU time total \| 602.28 ms \| \| GPU time total \| 798.60 ms \| \| Estimated avg BW \| 234.62 GB/s \| \| Estimated avg BW util \| 9.78% \| Trace link: https://our.intern.facebook.com/intern/perfdoctor/trace_view?filepath=tree/traces/efficient_module_suite/fused_attention_mlp.Jun_09_22_12_38_trace.json.gz&bucket=pyper_traces with the pattern: \| Metric \| Value \| \|:----------------------\|:------------\| \| Batch size \| 4096 \| \| GPU type \| H100 \| \| Latency \| 184.94 ms \| \| Model size \| 1205.21 MB \| \| Flops \| 7671.30 G \| \| Flops/example \| 1.87 G \| \| TFLOPS/sec \| 41.48 \| \| MFU \| 5.18% \| \| Activation/example \| 1.15 MB \| \| CPU time total \| 562.44 ms \| \| GPU time total \| 754.36 ms \| \| Estimated avg BW \| 201.40 GB/s \| \| Estimated avg BW util \| 8.39% \| Trace link: https://our.intern.facebook.com/intern/perfdoctor/trace_view?filepath=tree/traces/efficient_module_suite/fused_attention_mlp.Jun_10_22_03_34_trace.json.gz&bucket=pyper_traces ### E2E baseline: f713998364 with patter: Rollback Plan: Differential Revision: D76400889 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155666 Approved by: https://github.com/Yuzhen11	2025-07-15 17:50:23 +00:00
vfdev	b7b1109f49	Expose opt_einsum in torch.backends (#157740 ) Fixes the following issue: ``` :/tmp# python -c "import torch; print(torch.__version__)" 2.7.1+cu126 :/tmp# python -c "import torch; print(torch.backends.opt_einsum.is_available())" Traceback (most recent call last): File "<string>", line 1, in <module> AttributeError: module 'torch.backends' has no attribute 'opt_einsum' ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/157740 Approved by: https://github.com/Skylion007, https://github.com/benjaminglass1	2025-07-15 17:46:43 +00:00
PyTorch MergeBot	26807dcf27	Revert "[PT2][fusion] ban fusions with large accumulated reads (#157563 )" This reverts commit `c062550a35`. Reverted https://github.com/pytorch/pytorch/pull/157563 on behalf of https://github.com/clee2000 due to broke test_linear_and_cel on main `c062550a35`, caused OOM? Also broken on PR, Dr. CI classification is wrong (claims the test is disabled by an issue but the issue is for a different test). Also I'm pretty sure the expected results json is supposed to have a ton of empty lines, its to prevent merge conflicts, I will add it to the linter ([comment](https://github.com/pytorch/pytorch/pull/157563#issuecomment-3074355331))	2025-07-15 16:35:55 +00:00
PyTorch MergeBot	4f36743f5e	Revert "[simple_fsdp][inductor_collectives] rewrite reorder_collectives, sink_waits_iterative (#158062 )" This reverts commit `5a54db14e3`. Reverted https://github.com/pytorch/pytorch/pull/158062 on behalf of https://github.com/clee2000 due to sorry I want to revert something else and this is causing a merge conflict, all you should need to do is rebase and remerged ([comment](https://github.com/pytorch/pytorch/pull/158062#issuecomment-3074342140))	2025-07-15 16:31:13 +00:00
dsashidh	05d7288e31	Fix incorrect bin edge description in histogramdd docs (#158275 ) Fixes #124435 This updates the torch.histogramdd documentation to correctly state that bins are inclusive of their left edges, not exclusive as currently written. There was a previous PR addressing this but it was closed due to inactivity. This picks that up and applies the fix. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158275 Approved by: https://github.com/albanD	2025-07-15 16:25:01 +00:00
IvanKobzarev	5a54db14e3	[simple_fsdp][inductor_collectives] rewrite reorder_collectives, sink_waits_iterative (#158062 ) Differential Revision: [D78159013](https://our.internmc.facebook.com/intern/diff/D78159013) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158062 Approved by: https://github.com/wconstab	2025-07-15 14:27:57 +00:00
Aleksandar Samardžić	90618581e9	Fix grouped MM output strides when compiled but not max-autotuned (#158143 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158143 Approved by: https://github.com/ngimel	2025-07-15 11:53:13 +00:00
Andrey Talman	4e13eca713	[BE] Remove CUDA 11.8 artifacts (#158303 ) We are including cufile by default in all CUDA 12+ builds. Since CUDA 11.8 is removed we can safely remove this code Pull Request resolved: https://github.com/pytorch/pytorch/pull/158303 Approved by: https://github.com/Camyll, https://github.com/cyyever	2025-07-15 11:52:08 +00:00
Xiangyang (Mark) Guo	156a377f4c	[AOTI][CPP] add flag TORCHINDUCTOR_CPP_FORCE_INLINE_KERNEL (#157949 ) Summary: Add flag TORCHINDUCTOR_CPP_FORCE_INLINE_KERNEL to force inline the kernel function when TORCHINDUCTOR_CPP_FORCE_INLINE_KERNEL=1. It's disabled by default because force inlining may increase the build time. Differential Revision: D77915987 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157949 Approved by: https://github.com/desertfire	2025-07-15 10:51:43 +00:00
Yu, Guangye	e40ade5182	Deprecate overleap functions in CUDAAllocatorConfig, use AcceleratorAllocatorConfig instead (#156165 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156165 Approved by: https://github.com/albanD ghstack dependencies: #150312	2025-07-15 10:14:35 +00:00
Huamin Li	7f9fc7e67c	[Inductor] Add CPU_MAX_FIRST_DIMENSION_DECOMPOSITION and CPU_MAX_OTHER_DIMENSION_DECOMPOSITION for decompose_mm_pass (#158183 ) Differential Revision: D78209993 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158183 Approved by: https://github.com/houseroad	2025-07-15 10:07:25 +00:00
wengshiy	c8c221c0b3	[Inductor][Float8] Add float8_e4m3fn into assertion dtype list. (#157684 ) Fix assert issue. Add float8_e4m3fn into dtype list. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157684 Approved by: https://github.com/Xia-Weiwen, https://github.com/leslie-fang-intel, https://github.com/jansel	2025-07-15 06:02:01 +00:00

1 2 3 4 5 ...

49941 Commits