pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
rzou	fddaa2958b	[SkipFiles] Some more cleanup (#147012 ) I think these are all no-ops. Test Plan: - tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/147012 Approved by: https://github.com/yanboliang ghstack dependencies: #147016	2025-02-13 01:18:47 +00:00
rzou	87ebd77b34	Add some more docs to trace_rules.py (#147016 ) After discussing with Yanbo we wanted to record the behavior down so we don't need to rederive them in the future. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147016 Approved by: https://github.com/yanboliang	2025-02-13 01:18:39 +00:00
Animesh Jain	d6513f3246	[dynamo] Support list subclasses and fix dict subclasses mutation bugs (#146819 ) This PR adds support for list subclasses. Among other things are 1) Tracking the mutations on internal vts like `_dict_vt` and `_list_vt` using sources. This helps identify if there was a mutation in the underlying data structures, and we need to reconstruct it. 2) `UserDefinedObjectVariable` now has a new method - `is_modified` which `side_effect` infra relies upon to check mutations in the underlying vts (like `_dict_vt`). 3) `reconstruction` logic ensures that we use `dict.__getitem__` and `list.__getitem__` methods. This is super important because we don't want to call the overridden `__getitem__` methods. If this PR is hard to review, please let me know. I can break it into several small PRs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146819 Approved by: https://github.com/StrongerXi, https://github.com/jansel	2025-02-12 17:46:02 +00:00
rzou	5235a18cd6	[SkipFiles] remove some more stuff from MOD_SKIPLIST (#146876 ) Test Plan: - tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/146876 Approved by: https://github.com/anijain2305 ghstack dependencies: #146854	2025-02-11 15:00:56 +00:00
rzou	a7fe384d0e	Remove torch._higher_order_ops from MOD_SKIPLIST (#146853 ) Test Plan: - tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/146853 Approved by: https://github.com/williamwen42	2025-02-11 04:38:26 +00:00
rzou	275c034b16	[SkipFiles] remove some stuff from MOD_SKIPLIST (#146854 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146854 Approved by: https://github.com/yanboliang, https://github.com/anijain2305	2025-02-11 01:34:46 +00:00
Guilherme Leobas	8603a1c870	Suport generators (#141055 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141055 Approved by: https://github.com/zou3519	2025-02-08 22:42:12 +00:00
eellison	92b7e610ab	[Inductor changes] Invoke Quant (#139102 ) Adds a `invoke_quant` higher order operator as proposed [here](https://docs.google.com/document/d/1s2PfJlq6Q1F8l11CkTIC69BW1rEnGEgs6YmBC7hu8rA/edit?tab=t.0). The primary motivations are - Unifying scattered reasoning for quant operators throughout the code base - Easy of pattern matching - see this very large pattern match expression [here](`949fdd2997/torch/_inductor/fx_passes/post_grad.py (L390-L426)`. Compared to the pattern I have in the tests: ``` @register_graph_pattern( CallFunction( torch.ops.aten.mm, CallFunction( torch.ops.higher_order.invoke_quant, Ignored(), Ignored(), Ignored(), scheme="nf4", ), Arg(), ), pass_dict=test_pass, ) ``` - Ability to specify inductor specific logic, like codegen'ing the operators in lower precision, or forcing fusion to a matmul. Example graph: ``` Python ===== AFTER POST GRAD ===== /data/users/eellison/pytorch/torch/fx/_lazy_graph_module.py class <lambda>(torch.nn.Module): def forward(self, arg0_1: "f32[8][1]cpu", arg1_1: "f32[8][1]cpu"): # File: /data/users/eellison/pytorch/torch/_higher_order_ops/invoke_quant.py:87 in __call__, code: return invoke_quant_tracer(args, kwargs, quant_options=self) # type: ignore[call-arg] repeated_subgraph0 = self.repeated_subgraph0 invoke_quant: "f32[8][1]cpu" = torch.ops.higher_order.invoke_quant(repeated_subgraph0, arg0_1, arg1_1, scheme = 'nf4'); repeated_subgraph0 = arg0_1 = arg1_1 = None return (invoke_quant,) class repeated_subgraph0(torch.nn.Module): def forward(self, arg0_1: "f32[8][1]cpu", arg1_1: "f32[8][1]cpu"): # File: /data/users/eellison/pytorch/torch/_higher_order_ops/invoke_quant.py:87 in __call__, code: return invoke_quant_tracer(args, *kwargs, quant_options=self) # type: ignore[call-arg] mul: "f32[8][1]cpu" = torch.ops.aten.mul.Tensor(arg0_1, arg1_1); arg0_1 = None add: "f32[8][1]cpu" = torch.ops.aten.add.Tensor(mul, arg1_1); mul = arg1_1 = None return add ``` The schema for `invoke_quant` is `torch.ops.higher_order.invoke_quant(subgraph, args, scheme=None)` where the scheme will not always be present. I wasn't sure exactly how the inductor specific configurations like `codgen_in_low_precision` should be passed through. I didnt want to stuff them all in as kwargs, and I didn't want to have them affect pattern matching. So they will be stored as meta of the node itself. And, following that, I wanted the invocation of the hop to match how it will show up in the graph. So I decided to have it be an object that is then invoked for the tracing. ``` invoke_quant = InvokeQuant(codegen_low_precision=True) invoke_quant(gn, (x, y), scheme="nf4") ``` Todo - not require the packing of args in a tuple, will do following https://github.com/pytorch/pytorch/pull/139162. Feedback welcome. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139102 Approved by: https://github.com/Chillee	2025-02-08 19:30:19 +00:00
Animesh Jain	5f53889850	[dynamo][builtin-skipfiles-cleanup] Remove inspect (#146116 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146116 Approved by: https://github.com/williamwen42, https://github.com/zou3519, https://github.com/jansel ghstack dependencies: #146322	2025-02-04 03:36:07 +00:00
Nikita Shulga	e56dcf2772	[CPUInductor] Fix SVE256 detection (#146207 ) This PR removes `torch.cpu._is_arm_sve_supported()` and replaces is with stable `torch.backends.cpu.get_cpu_capability()` I should have reviewed https://github.com/pytorch/pytorch/pull/134672 more thoroughly, because it introduced duplicate, but slightly different API for detecting CPU architectures, which resulted in runtime crashes on system that do support SVE128, rather than SVE256 Fixes https://github.com/pytorch/pytorch/issues/145441 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146207 Approved by: https://github.com/angelayi	2025-02-01 18:51:34 +00:00
Animesh Jain	781aceee9c	[dynamo] Revert abc change due to internal failures (#146177 ) xref - https://www.internalfb.com/tasks/?t=191383874 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146177 Approved by: https://github.com/StrongerXi ghstack dependencies: #146141	2025-01-31 21:28:06 +00:00
Animesh Jain	667b94d1c2	[hotfix][dynamo] Skip linecache due to a flaky issue (#146141 ) A large number of jit + dynamo wrapped tests fail in linecache tracing. We need further debugging. Skipping for now to stem the bleeding. https://github.com/pytorch/pytorch/issues/146076 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146141 Approved by: https://github.com/StrongerXi	2025-01-31 17:45:06 +00:00
Animesh Jain	4499d60d56	[dynamo][builin-skipfiles-cleanup] Remove types (#145909 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145909 Approved by: https://github.com/zou3519 ghstack dependencies: #145856, #145875, #145878, #145892	2025-01-29 16:47:02 +00:00
Animesh Jain	3f77002b96	[dynamo][builtin-skipfiles-cleanup] remove abc, enum, importlib (#145892 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145892 Approved by: https://github.com/williamwen42, https://github.com/StrongerXi ghstack dependencies: #145856, #145875, #145878	2025-01-29 05:30:06 +00:00
Animesh Jain	236793684d	[dynamo][builtin-skipfiles-cleanup] Remove threading, _collections_abc, _weakrefset, threading (#145878 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145878 Approved by: https://github.com/williamwen42, https://github.com/StrongerXi ghstack dependencies: #145856, #145875	2025-01-29 05:30:06 +00:00
Animesh Jain	a479656cd2	[dynamo][builtin-skipfiles-removal] Remove logging (#145875 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145875 Approved by: https://github.com/williamwen42 ghstack dependencies: #145856	2025-01-29 05:29:58 +00:00
Animesh Jain	64ee57847b	[dynamo][builtin-skipfiles-cleanup] Remove some builtins (#145856 ) [dynamo][builtin-skipfiles-cleanup] Remove more builtins Pull Request resolved: https://github.com/pytorch/pytorch/pull/145856 Approved by: https://github.com/zou3519	2025-01-29 05:29:47 +00:00
Animesh Jain	80a0412b76	[dynamo][builtin-skipfiles-cleanup] Remove posixpath (#145828 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145828 Approved by: https://github.com/zou3519 ghstack dependencies: #145744, #145753, #145826	2025-01-28 16:14:34 +00:00
Animesh Jain	6824a4a75d	[dynamo][builtin-skipfiles-cleanup] Remove re (#145826 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145826 Approved by: https://github.com/zou3519 ghstack dependencies: #145744, #145753	2025-01-28 16:14:34 +00:00
Animesh Jain	4307e6c008	[dynamo][builtin-skipfile-cleanup] Remove signal (#145753 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145753 Approved by: https://github.com/zou3519 ghstack dependencies: #145744	2025-01-28 16:14:23 +00:00
Animesh Jain	5c5306e8bc	[dynamo][builtin-skiplist-cleanup] Remove weakref (#145744 ) WeakKeyDictionary already works very nicely with the UserDefinedObject Variable Tracker. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145744 Approved by: https://github.com/jansel	2025-01-28 07:55:12 +00:00
rzou	ea141d8134	functional compiled autograd (#144707 ) This PR squashes together the following commits: https://github.com/pytorch/pytorch/pull/144115 https://github.com/pytorch/pytorch/pull/143417 https://github.com/pytorch/pytorch/pull/143405 https://github.com/pytorch/pytorch/pull/143387 https://github.com/pytorch/pytorch/pull/143304 https://github.com/pytorch/pytorch/pull/143296 This is a refactor of compiled autograd to use "functional autograd". The end goal is that it gets compiled autograd's initial capture to stop specializing on Tensor metadata, therefore allowing compiled autograd to better handle Tensor subclasses. For more information, please read the commit messages for each PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144707 Approved by: https://github.com/bdhirsh, https://github.com/xmfan, https://github.com/jansel	2025-01-27 05:20:56 +00:00
Animesh Jain	53fc921ce2	[dynamo][trace-rules-cleanup] Remove functools from the Builtins skiplist (#145519 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145519 Approved by: https://github.com/yanboliang, https://github.com/zou3519	2025-01-24 06:02:03 +00:00
PyTorch MergeBot	3f6cfd0156	Revert "[compiled autograd] stop specializing on metadata during initial trace (#143417 )" This reverts commit `99dd1bf1b9`. Reverted https://github.com/pytorch/pytorch/pull/143417 on behalf of https://github.com/izaitsevfb due to breaking internal tests T213390054 ([comment](https://github.com/pytorch/pytorch/pull/143296#issuecomment-2611224926))	2025-01-23 23:34:12 +00:00
Nikhil Gupta	41b38f755c	Revert "Reverting the PR adding Kleidiai-based int4 kernels (#145392 )" (#145505 ) https://github.com/pytorch/pytorch/pull/134124 was reverted by https://github.com/pytorch/pytorch/pull/145392 due to KleidiAI clone issue. 1. This reverts commit `0940eb6d44` (https://github.com/pytorch/pytorch/pull/145392 )and Fixes KleidiAI mirror issue. 2. KleidiAI is now cloned from github mirror instead of arm gitlab Change-Id: I7d6eee7214cd117d3057d615936fcc3ee6052fa2 Fixes https://github.com/pytorch/pytorch/issues/145273 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145505 Approved by: https://github.com/malfet	2025-01-23 18:50:59 +00:00
Animesh Jain	5a18f1e1eb	[dynamo] Support fx map_aggregate (#145351 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145351 Approved by: https://github.com/zou3519	2025-01-23 03:19:30 +00:00
rzou	99dd1bf1b9	[compiled autograd] stop specializing on metadata during initial trace (#143417 ) The previous PRs built up to this. We change compiled autograd's initial trace to stop baking in metadata. While tracing, we allocate some weirdly shaped tensors that we can put proxies on. The initial trace should not be accessing any metadata of these tensors (it will likely error out if it does because of how weird the shapes are). This involved fixing some various sites where we do specialize on the metadata, like: - we change CopySlices's apply_with_saved to proxy some calls into the graph (this change is fairly hard to split out by itself). - we stop calling InputBuffer::add - we delete the weird metadata from the graph so that no graph passes can make use of it. Test Plan: - tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/143417 Approved by: https://github.com/jansel, https://github.com/xmfan ghstack dependencies: #143296, #143304, #143387, #143405	2025-01-22 21:51:07 +00:00
albanD	0940eb6d44	Reverting the PR adding Kleidiai-based int4 kernels (#145392 ) Mitigation for https://github.com/pytorch/pytorch/issues/145273 Reverting https://github.com/pytorch/pytorch/pull/134124 and https://github.com/pytorch/pytorch/pull/144074 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145392 Approved by: https://github.com/ZainRizvi, https://github.com/malfet, https://github.com/atalman, https://github.com/digantdesai	2025-01-22 20:11:49 +00:00
rzou	1e8d6d6f0e	[SkipFiles] New modules added to torch.* are inlined by default (#145279 ) This PR: - makes it so that new modules added to torch are inlined by default - adds a list of the previously "skipped by default" modules to avoid regressing anything. This is a new MOD_SKIPLIST list that is consulted in trace_rules.check_file. - Follow-up work will go through this list, one-by-one, and try to delete modules. I think we should be able to delete almost everything, except for torch._dynamo. Test Plan - existing tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/145279 Approved by: https://github.com/yanboliang	2025-01-21 23:24:12 +00:00
Aaron Orenstein	a79100ab11	PEP585 update - torch/_dynamo (#145105 ) See #145101 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145105 Approved by: https://github.com/bobrenjc93	2025-01-18 20:47:11 +00:00
Yanbo Liang	43a00d73b3	[Trace Python Dispatcher] Support FuncTorchInterpreter (#144444 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144444 Approved by: https://github.com/williamwen42, https://github.com/zou3519 ghstack dependencies: #144439	2025-01-17 02:26:37 +00:00
James Wu	7d71ddbe5d	Add non_c_binding torch functions to allowlist for AOTAutogradCache, confirm no special handlers for them (#144802 ) Differential Revision: [D68173093](https://our.internmc.facebook.com/intern/diff/D68173093/) This diff allows any function in torch_non_c_binding_in_graph_functions to be safe to cache. These functions should be safe to cache because they are part of the torch API, and do not save global state (or if they do, dynamo creates unique guards around the constants they return). A function that's allowed in a dynamo graph is safe to cache for AOTAutograd purposes as long as: - It's functional (i.e. does not access global state); - or its value is constant folded away (and guarded against by dynamo) The tricky cases are functions that dynamo uses special handlers to track. These special handlers can sometimes close over stuff that's safe for dynamo locally, but isn't encoded anywhere when cached across processes. An example of this is `DTensor.from_local`, where various DeviceMesh information doesn't change in the same dynamo process, but can change across multiple processes. The handler for `DTensor.from_local` closes over these and dynamo creates a proxy for the function call. This is not safe to cache. That said, most special handlers are in fact functional and safe. So I add a unit test to test_trace_rules.py that confirms that any function with special handlers in dynamo added to this list needs to be audited to be safe to cache. The list of safe handlers there either: - Don't access global state; - Guard on global state; or - Always returns a constant that never changes Pull Request resolved: https://github.com/pytorch/pytorch/pull/144802 Approved by: https://github.com/bdhirsh	2025-01-15 05:41:36 +00:00
Yanbo Liang	430d54ee20	[Dynamo] Add functorch C++ bindings as in graph functions (#144309 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144309 Approved by: https://github.com/williamwen42 ghstack dependencies: #144306, #144307, #144308	2025-01-07 22:25:01 +00:00
Yanbo Liang	d146763f6f	[Dynamo] Inline functions in torch._ops (#144308 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144308 Approved by: https://github.com/williamwen42 ghstack dependencies: #144306, #144307	2025-01-07 22:25:01 +00:00
Yanbo Liang	242a4a3f83	[Dynamo] Inline functions in torch._functorch.pyfunctorch (#144307 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144307 Approved by: https://github.com/williamwen42 ghstack dependencies: #144306	2025-01-07 22:24:53 +00:00
Yanbo Liang	4417be65e5	[Dynamo] Inline functions in torch._functorch.autograd_function (#144306 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144306 Approved by: https://github.com/williamwen42	2025-01-07 22:24:46 +00:00
Sam Ginzburg	ec1f56fdcf	[user triton] add support for prune_configs_by in @triton.autotune (#142207 ) This PR adds support for prune_configs_by in the @triton.autotune decorator [docs](https://triton-lang.org/main/python-api/generated/triton.autotune.html#triton.autotune). Supporting this lets users reduce autotuning time by running user-supplied code (early_config_prune, perf_model) to prune the provided list of configs. We implement this by realizing args/kwargs in call_triton_kernel(...), and then calling kernel.prune_configs(...). Pull Request resolved: https://github.com/pytorch/pytorch/pull/142207 Approved by: https://github.com/zou3519, https://github.com/aakhundov	2025-01-04 03:50:28 +00:00
Animesh Jain	087c625261	[dynamo] Trace torch.typename (#144163 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144163 Approved by: https://github.com/yanboliang, https://github.com/williamwen42, https://github.com/jansel ghstack dependencies: #144129, #144130, #144141, #144158	2025-01-04 02:52:58 +00:00
Xiaodong Wang	0a94bb432e	[ROCm] CK Flash Attention Backend (#143695 ) Replace https://github.com/pytorch/pytorch/pull/138947 for re-import. Replaces https://github.com/ROCm/pytorch/pull/1592 This PR contains the initial implementation of SDPA with composable_kernel backend. The CK path can be forced by simply calling torch.backends.cuda.preferred_rocm_fa_library("ck"). Similarly, you can force the incumbent aotriton implementation by passing in "aotriton" or "default". As you'd expect, not setting this option will result in aotriton to be used as the backend. In the case of CK, if pytorch deems flash attention usable, then it will use the CK path in all the same places aotriton would have been used. This PR makes no changes to the heuristics which select which attention scheme to use (i.e. flash attention vs memory efficient attention vs math etc etc). It only gets called when flash attention is both enabled (via USE_FLASH_ATTENTION) and is selected at runtime by the existing heuristics. Files located in pytorch/aten/src/ATen/native/transformers/hip/flash_attn/ck/mha* have been pulled from https://github.com/Dao-AILab/flash-attention courtesy of @tridao's hard work who is the co-author NOTE: In order to use this backend, the user MUST set USE_CK_FLASH_ATTENTION=1 in their environment when they build PyTorch. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143695 Approved by: https://github.com/malfet Co-authored-by: Andy Lugo <Andy.LugoReyes@amd.com> Co-authored-by: Jithun Nair <jithun.nair@amd.com>	2025-01-03 22:01:36 +00:00
Nikhil Gupta	94737e8a2a	[ARM][feat]: Add 4 bit dynamic quantization matmuls & KleidiAI Backend (#134124 ) Description: 1. Quantize Linear Layer Weights to 4-bits: Quantize the weights of the Linear layer to 4 bits, using symmetric quantization. Pack two 4-bit weights into one uint8 container. Choose a quantization scheme (channel-wise or group-wise), with the group size being a multiple of 32. 2. Prepare Quantized Weights, Scales, and Optional Bias: After quantizing, obtain the quantized_weights, scales, and groupsize. If the original Linear layer has a bias, prepare it as well. 3. Pack the Weights Efficiently: Use torch.ops.aten._dyn_quant_pack_4bit_weight to optimally pack the weights, scales, and optional bias. ```python packed_weights = torch.ops.aten._dyn_quant_pack_4bit_weight(weight, scales_and_zeros, bias, groupsize, in_features, out_features) ``` Input parameters should include: in_features and out_features (the same as the Linear layer’s corresponding parameters). 4. Perform Dynamic Quantized Matrix Multiplication: Use torch.ops.aten._dyn_quant_matmul_4bit to perform matrix multiplication with quantized weights. ```python output = torch.ops.aten._dyn_quant_matmul_4bit(input, packed_weights, groupsize, in_features, out_features) ``` Inputs required include: The input tensor, packed_weights , groupsize, and the in_features and out_features. API Usage: https://github.com/pytorch/pytorch/issues/143289 Model Perf : 7B Transformer model: Prefill : 340 t/s Decode : 40 t/s 2B Transformer model Prefill : 747 t/s Decode : 80 t/s Tests: python test/test_linalg.py -k test__dyn_quant_pack_4bit_weight Ran 1 test in 0.016s OK python test/test_linalg.py -k test__dyn_quant_matmul_4bit Ran 8 tests in 0.077s OK python test/test_linalg.py -k test_compile_dyn_quant_matmul_4bit Ran 8 tests in 11.454s Change-Id: Ia1672bad5e6ec94e64d8bb1971395d60f4b3a452 Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/134124 Approved by: https://github.com/digantdesai, https://github.com/malfet	2024-12-20 19:32:03 +00:00
Guilherme Leobas	673cc88fd6	Add support for `contextmanager` in Dynamo (#136033 ) Fixes #130559 * Intro This PR adds support for `@contextmanager` in Dynamo. We chose to limit the scope of this work to only `@contextmanager` and plan to handle generators fully in #141055 (still in draft). * Motivation Dynamo lacks support for generator functions. When it encounters one, it traces it as if it were a regular function. This is problematic because it can lead to incorrect behavior. To illustrate, consider the test case below: ```python import torch import contextlib @contextlib.contextmanager def set_default_dtype(dtype): old_dtype = torch.get_default_dtype() try: torch.set_default_dtype(dtype) yield finally: torch.set_default_dtype(old_dtype) @torch.compile(backend="eager", fullgraph=True) def fn(): with set_default_dtype(torch.float64): x = torch.tensor([3.0, 3.0 + 5.0j]) return x ``` Before this work, Dynamo would not stop at the `yield`, and the graph produced would contain both calls to `set_default_dtype` executed one after the other. This is incorrect because the context manager should execute code before and after the `yield`. * List of changes `YIELD_VALUE` now raises an exception (`YieldValueOp`) to signal that control flow must be suspended and returned to the caller. Additionally, `RETURN_VALUE` behaves differently in a generator function. Unlike regular functions, where `RETURN_VALUE` indicates the final result, in generators it signifies that the generator is exhausted and implicitly raises `StopIteration`. A new `VariableTracker` named `FunctionDecoratedByContextlibContextManagerVariable` was introduced to handle `@contextmanager`. This variable tracker acts not just as a wrapper for the original function but also maintains an internal `tx` (InstructionTranslator) object to suspend and return control flow to the parent tracer when a `yield` is encountered. * Corner cases Returning a context manager from a compiled function is not supported. This would require PyTorch to synchronize the generator state between Dynamo and the interpreter. Any attempt to return it will result in an `IncorrectUsage` exception. Graph breaks require special handling as well. In the event of a graph break, the frame associated with the context manager is skipped, and the context manager runs in eager mode. * This PR is breaking my code There is a configuration flag (`enable_trace_contextlib`) that can be set to `False` to disable tracing context managers. If this still causes crashes, please revert this PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136033 Approved by: https://github.com/zou3519	2024-12-20 12:02:20 +00:00
PyTorch MergeBot	8136daff5a	Revert "[ARM][feat]: Add 4 bit dynamic quantization matmuls & KleidiAI Backend (#134124 )" This reverts commit `4b82251011`. Reverted https://github.com/pytorch/pytorch/pull/134124 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it breaks lots of internal build ([comment](https://github.com/pytorch/pytorch/pull/134124#issuecomment-2555953189))	2024-12-19 23:33:17 +00:00
Nikhil Gupta	4b82251011	[ARM][feat]: Add 4 bit dynamic quantization matmuls & KleidiAI Backend (#134124 ) Description: 1. Quantize Linear Layer Weights to 4-bits: Quantize the weights of the Linear layer to 4 bits, using symmetric quantization. Pack two 4-bit weights into one uint8 container. Choose a quantization scheme (channel-wise or group-wise), with the group size being a multiple of 32. 2. Prepare Quantized Weights, Scales, and Optional Bias: After quantizing, obtain the quantized_weights, scales, and groupsize. If the original Linear layer has a bias, prepare it as well. 3. Pack the Weights Efficiently: Use torch.ops.aten._dyn_quant_pack_4bit_weight to optimally pack the weights, scales, and optional bias. ```python packed_weights = torch.ops.aten._dyn_quant_pack_4bit_weight(weight, scales_and_zeros, bias, groupsize, in_features, out_features) ``` Input parameters should include: in_features and out_features (the same as the Linear layer’s corresponding parameters). 4. Perform Dynamic Quantized Matrix Multiplication: Use torch.ops.aten._dyn_quant_matmul_4bit to perform matrix multiplication with quantized weights. ```python output = torch.ops.aten._dyn_quant_matmul_4bit(input, packed_weights, groupsize, in_features, out_features) ``` Inputs required include: The input tensor, packed_weights , groupsize, and the in_features and out_features. API Usage: https://github.com/pytorch/pytorch/issues/143289 Model Perf : 7B Transformer model: Prefill : 340 t/s Decode : 40 t/s 2B Transformer model Prefill : 747 t/s Decode : 80 t/s Tests: python test/test_linalg.py -k test__dyn_quant_pack_4bit_weight Ran 1 test in 0.016s OK python test/test_linalg.py -k test__dyn_quant_matmul_4bit Ran 8 tests in 0.077s OK python test/test_linalg.py -k test_compile_dyn_quant_matmul_4bit Ran 8 tests in 11.454s Change-Id: Ia1672bad5e6ec94e64d8bb1971395d60f4b3a452 Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/134124 Approved by: https://github.com/digantdesai, https://github.com/malfet	2024-12-19 18:51:26 +00:00
PyTorch MergeBot	14fe1f7190	Revert "[ARM][feat]: Add 4 bit dynamic quantization matmuls & KleidiAI Backend (#134124 )" This reverts commit `d3ff2d42c2`. Reverted https://github.com/pytorch/pytorch/pull/134124 on behalf of https://github.com/malfet due to This broke S390 builds, includes cpuinfo unconditionally ([comment](https://github.com/pytorch/pytorch/pull/134124#issuecomment-2552560208))	2024-12-19 01:05:11 +00:00
Nikhil Gupta	d3ff2d42c2	[ARM][feat]: Add 4 bit dynamic quantization matmuls & KleidiAI Backend (#134124 ) Description: 1. Quantize Linear Layer Weights to 4-bits: Quantize the weights of the Linear layer to 4 bits, using symmetric quantization. Pack two 4-bit weights into one uint8 container. Choose a quantization scheme (channel-wise or group-wise), with the group size being a multiple of 32. 2. Prepare Quantized Weights, Scales, and Optional Bias: After quantizing, obtain the quantized_weights, scales, and groupsize. If the original Linear layer has a bias, prepare it as well. 3. Pack the Weights Efficiently: Use torch.ops.aten._dyn_quant_pack_4bit_weight to optimally pack the weights, scales, and optional bias. ```python packed_weights = torch.ops.aten._dyn_quant_pack_4bit_weight(weight, scales_and_zeros, bias, groupsize, in_features, out_features) ``` Input parameters should include: in_features and out_features (the same as the Linear layer’s corresponding parameters). 4. Perform Dynamic Quantized Matrix Multiplication: Use torch.ops.aten._dyn_quant_matmul_4bit to perform matrix multiplication with quantized weights. ```python output = torch.ops.aten._dyn_quant_matmul_4bit(input, packed_weights, groupsize, in_features, out_features) ``` Inputs required include: The input tensor, packed_weights , groupsize, and the in_features and out_features. API Usage: https://github.com/pytorch/pytorch/issues/143289 Model Perf : 7B Transformer model: Prefill : 340 t/s Decode : 40 t/s 2B Transformer model Prefill : 747 t/s Decode : 80 t/s Tests: python test/test_linalg.py -k test__dyn_quant_pack_4bit_weight Ran 1 test in 0.016s OK python test/test_linalg.py -k test__dyn_quant_matmul_4bit Ran 8 tests in 0.077s OK python test/test_linalg.py -k test_compile_dyn_quant_matmul_4bit Ran 8 tests in 11.454s Change-Id: Ia1672bad5e6ec94e64d8bb1971395d60f4b3a452 Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/134124 Approved by: https://github.com/digantdesai, https://github.com/malfet	2024-12-18 22:30:07 +00:00
Yidi Wu	1e201422ed	[export] add is_exporting flag (#142425 ) We added an is_export flag under torch.compiler.is_exporting. This comes handy when we try to do some special logic in user-level and system-level (e.g. in upper of the stack). In increasing-scope: - `_is_fx_tracing` is set to True when we use under symbolic_trace or make_fx. - `is_exporting` is set to True when we're doing strict or non-strict export, which internally has a step that calls make_fx and set _is_fx_tracing to be True. - `is_compiling` is set to True when we're either doing strict, non-strict export or torch.compile. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142425 Approved by: https://github.com/avikchaudhuri	2024-12-18 21:36:28 +00:00
PyTorch MergeBot	969b07b96f	Revert "[ROCm] CK Flash Attention Backend (#138947 )" This reverts commit `500d02921b`. Reverted https://github.com/pytorch/pytorch/pull/138947 on behalf of https://github.com/atalman due to Breaks default windows checkout ([comment](https://github.com/pytorch/pytorch/pull/138947#issuecomment-2548998359))	2024-12-17 16:46:57 +00:00
Andy Lugo	500d02921b	[ROCm] CK Flash Attention Backend (#138947 ) Replaces https://github.com/ROCm/pytorch/pull/1592 This PR contains the initial implementation of SDPA with composable_kernel backend. The CK path can be forced by simply calling `torch.backends.cuda.preferred_rocm_fa_library("ck")`. Similarly, you can force the incumbent aotriton implementation by passing in "aotriton" or "default". As you'd expect, not setting this option will result in aotriton to be used as the backend. In the case of CK, if pytorch deems flash attention usable, then it will use the CK path in all the same places aotriton would have been used. This PR makes no changes to the heuristics which select which attention scheme to use (i.e. flash attention vs memory efficient attention vs math etc etc). It only gets called when flash attention is both enabled (via `USE_FLASH_ATTENTION`) and is selected at runtime by the existing heuristics. Files located in pytorch/aten/src/ATen/native/transformers/hip/flash_attn/ck/mha* have been pulled from https://github.com/Dao-AILab/flash-attention courtesy of @tridao's hard work who is the co-author NOTE: In order to use this backend, the user MUST set USE_CK_FLASH_ATTENTION=1 in their environment when they build PyTorch. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138947 Approved by: https://github.com/pruthvistony, https://github.com/xw285cornell, https://github.com/leitian Co-authored-by: Xiaodong Wang <xw285@cornell.edu>	2024-12-17 02:18:07 +00:00
Michael Lazos	9701c50bdc	[Dynamo] Add missing tensor builtins to allowed functions (#142841 ) Fixes https://github.com/pytorch/pytorch/issues/141232 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142841 Approved by: https://github.com/yanboliang	2024-12-12 06:38:19 +00:00
Michael Lazos	de313f1155	[foreach_map] Initial foreach map HOP impl for inference (#142098 ) This is the initial foreach map HOP for pointwise ops which will be extended in the future to support grouped GEMMs and other ops. This PR utilizes PrimHOPBase class to represent foreach_map as a HOP with a single subgraph. The way this is implemented is that the user API `foreach_map` provides a single pointwise torch op, and internally this function calls a polyfill which has the same semantics as a foreach op (ie iterates over lists of operands applying the op elementwise). The higher order op is passed through the stack down to inductor where a lowering in essence inlines the subgraph into the main graph. This is done by interpreting it with a pointwise subgraph lowering, grouping the outputs by device, and registering the output buffers as foreach groups as applicable. For testing I was able to reuse the existing foreach tests by creating a wrapper function which matches the foreach op interfaces for those tests and then run all of the existing foreach tests on foreach_map. TODO before landing: * Add tests for general functions * Test warning if unsupported op will block fusion Followups: * I need to add tests for backwards (this will be a followup PR because backwards will require other work as well) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142098 Approved by: https://github.com/eellison	2024-12-11 21:32:11 +00:00

1 2 3 4 5 ...

329 Commits