pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-06 00:20:18 +01:00

Author	SHA1	Message	Date
Zaili Wang	ea5d40db73	Address source code building command for Intel GPU support (#143476 ) As the title Pull Request resolved: https://github.com/pytorch/pytorch/pull/143476 Approved by: https://github.com/EikanWang, https://github.com/malfet Co-authored-by: Xu Han <xu.han@outlook.com> Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-02-27 01:07:40 +00:00
Bin Bao	f104ef1248	[AOTI][refactor] Consolidate CppBuilder.build and CppBuilder.build_fbcode (#147975 ) Summary: Let CppBuilder handle all the cpp build logic Differential Revision: D70141808 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147975 Approved by: https://github.com/angelayi, https://github.com/yushangdi	2025-02-27 00:35:12 +00:00
Benjamin Glass	f98cd84b04	cpp_wrapper: use largeTensorTest for test memory checks (#146991 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146991 Approved by: https://github.com/desertfire	2025-02-27 00:30:21 +00:00
Benjamin Glass	723f3a9eab	torch.utils._content_store: fix error in hash_storage on XPU (#147785 ) See https://github.com/pytorch/pytorch/actions/runs/13508573465/job/37745227468 for an example error. This is triggering after the merge of #147541, which enabled Dynamo compilation on XPU. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147785 Approved by: https://github.com/jansel	2025-02-26 23:57:59 +00:00
PyTorch MergeBot	915eb012e1	Revert "[dynamo] add sourceless builder for `types.MethodType` (#147880 )" This reverts commit `08f4c1a233`. Reverted https://github.com/pytorch/pytorch/pull/147880 on behalf of https://github.com/wdvr due to failing trunk tests ([comment](https://github.com/pytorch/pytorch/pull/147880#issuecomment-2686436432))	2025-02-26 23:29:58 +00:00
Nichols A. Romero	84e60eece8	[ROCm] [TunableOp] Unit tests for scaled GEMM and GEMM with bias (#147890 ) Two more unit tests for TunableOp: - Scaled GEMM - GEMM with bias Pull Request resolved: https://github.com/pytorch/pytorch/pull/147890 Approved by: https://github.com/jeffdaily	2025-02-26 22:41:24 +00:00
Nichols A. Romero	b13ad1a193	[ROCm][TunableOp] Remove extra transpose characters in hipBLASLt signature. (#147900 ) Cleanup the TunableOp hipBLASLt signature of extra transpose characters. Test manually and no new regressions found. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147900 Approved by: https://github.com/jeffdaily	2025-02-26 22:28:00 +00:00
PyTorch MergeBot	7e7d05bf85	Revert "[do not merge yet] update grammar (#147996 )" This reverts commit `6e129a697f`. Reverted https://github.com/pytorch/pytorch/pull/147996 on behalf of https://github.com/seemethere due to Need to revert ([comment](https://github.com/pytorch/pytorch/pull/147996#issuecomment-2686291282))	2025-02-26 22:01:12 +00:00
sokkaofthewatertribe	6e129a697f	[do not merge yet] update grammar (#147996 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147996 Approved by: https://github.com/seemethere	2025-02-26 21:52:58 +00:00
PyTorch MergeBot	dc7556f1bd	Revert "[do not merge yet] update grammar (#147996 )" This reverts commit `a1ee2c3a08`. Reverted https://github.com/pytorch/pytorch/pull/147996 on behalf of https://github.com/seemethere due to Need to revert ([comment](https://github.com/pytorch/pytorch/pull/147996#issuecomment-2686266052))	2025-02-26 21:43:06 +00:00
sokkaofthewatertribe	a1ee2c3a08	[do not merge yet] update grammar (#147996 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147996 Approved by: https://github.com/seemethere	2025-02-26 21:39:08 +00:00
henrylhtsang	201666d77d	[cutlass backend] turn autotuning logs off by default + rename log to autotuning log (#147922 ) things we did: * turn off autotuning logs by default * rename autotuning logs from log to autotuning_log, so people are aware that it is a special artifact log. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147922 Approved by: https://github.com/eellison	2025-02-26 21:02:04 +00:00
Xiao Wang	976ff5cf01	Add cmake hints to USE_SYSTEM_NVTX for nvtx3 include dir (#147418 ) per title sometimes, it's hard for cmake to find NVTX3 without the cuda include path hint Pull Request resolved: https://github.com/pytorch/pytorch/pull/147418 Approved by: https://github.com/nWEIdia, https://github.com/malfet	2025-02-26 20:52:28 +00:00
Ankita George	6a658d983e	Build a storage reader/writer to write checkpoints in HF format (#147622 ) Title - we want to write checkpoints in HF format with DCP, this diff allows this for the non-distributed use case. Copy of [D68444967](https://www.internalfb.com/diff/D68444967) (https://github.com/pytorch/pytorch/pull/146352). That diff got reverted because of lint errors. The lint error was due to having imports of uninstalled libraries. This was on purpose because we don't want to install safetensors and huggingface, this new diff explicitly ignores this lint so that we don't have the error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147622 Approved by: https://github.com/saumishr	2025-02-26 20:47:54 +00:00
Thomas Bohnstingl	7c71ab1d40	[scan] User-facing reverse flag handling (#147886 ) This PR removes the reverse flag from the backend implementation and resolves it via `torch.flip` in the frontend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147886 Approved by: https://github.com/ydwu4	2025-02-26 20:04:57 +00:00
Davide Italiano	683e083e8d	[MPS] Add support for `entr()` in eager. (#147948 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147948 Approved by: https://github.com/malfet	2025-02-26 19:55:02 +00:00
Ryan Guo	eb08ada5d3	[dynamo] Support reads to global/captured tensors in `nonstrict_trace`-ed function (#147572 ) As title. Without this patch we get the following error: Tweaking the `allow_non_fake_inputs` flag on tensor mode doesn't quite work for AOTAutograd, which also needs to fake-tensor-propagate the `nonstrict_trace`-ed function, but that's _after_ Dynamo has handled the `nonstrict_trace` processing and put the `flat_apply(...)` node into the graph. So we can't easily to temporarily enable the `allow_non_fake_inputs` flag on current fake mode, when AOTAutograd processes a `flat_apply` node from Dynamo's `nonstrict_trace` handling. And after discussing with zou3519, I decided to add a global `FakeTensorTLS` that contains a `allow_non_fake_inputs_override` flag, and patch the `nonstrict_trace`-ed function to temporarily tweak this flag during its execution. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147572 Approved by: https://github.com/zou3519 ghstack dependencies: #146714, #146367, #146950, #147571	2025-02-26 19:47:39 +00:00
Ryan Guo	73e963459e	[dynamo] Support `nonstrict_trace` on class method (#147571 ) As title, also see 1. new test `test_nonstrict_trace_on_method` for example. 2. newly added comments for why we need special treatment on methods. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147571 Approved by: https://github.com/zou3519 ghstack dependencies: #146714, #146367, #146950	2025-02-26 19:47:39 +00:00
Ryan Guo	7e0ef2c844	[dynamo] Use the new `get_unique_name_wrt` helper when applicable (#146950 ) This patch removes some duplicated name generation logic in Dynamo. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146950 Approved by: https://github.com/zou3519 ghstack dependencies: #146714, #146367	2025-02-26 19:47:39 +00:00
Ryan Guo	f46f0e465c	[dynamo] Initial support for `nonstrict_trace` (#146367 ) ## Context > Note: `mark_traceable` got renamed to `nonstrict_trace` after > offline discussion. The reasons are (1) it aligns with `torch.export`'s > `nonstrict` notion, and (2) it's more definitive in behavior suggestion. 1. [Overall Design](https://docs.google.com/document/d/1O-dR2ZQaJQVt_v67AVcDCw2yJLtqgkZFwoXK0buEWRg/edit?tab=t.0) 2. [Dynamo graph representation with `torch._higher_order_ops.flat_apply`](https://docs.google.com/document/d/1YHl5nPTJvYeCPE5TO9uA18DPWNgUYGE4gCn6bFvXcBM/edit?tab=t.0#heading=h.xtw3hhbro4gn) ## Summary This patch adds a `torch._dynamo.nonstrict_trace` decorator, which currently is an enhanced version of `torch._dynamo.allow_in_graph` (see docstring for their differences). Specifically, this patch focuses on the UI and functionality prototyping/plumbing. The main enhancement is supporting more input types, and the implementation challenge lies in reconstructing the input objects from Dynamo `VariableTracker` (while accounting for buffered side-effects and guards). This patch takes a middle-ground (simple implementation with a bit of user labor), by 1. asking the user to provide pytree registration for non-proxy-able input types, 2. letting Dynamo trace through `pytree_flatten` (which accounts for buffered side-effects and guards automatically), 3. and passing in the TreeSpec as a graph attribute constant into `torch._higher_order_ops.flat_apply` (which unflattens the inputs and invokes the underlying function). ## Next Steps In subsequent patches, we will try to support the following: - annotating on class method - reads to global tensors - inputs that contains `pytree.register_constant`-ed instances. - function as input - more output types (e.g., any pytree-registered type) - `torch.nn.Module` as inputs Pull Request resolved: https://github.com/pytorch/pytorch/pull/146367 Approved by: https://github.com/zou3519 ghstack dependencies: #146714	2025-02-26 19:47:39 +00:00
Ryan Guo	bab84f0bd9	[hop] Support more output types for `flat_apply` (#146714 ) This patch enables `flat_apply` to support certain non-Tensor output types like containers and graphable types. This will in turn enable the upcoming `mark_traceable` to support more output types. The patch also exposes a `func_to_graphable` rather than having the users calling the lower level `pytree.flatten(ConstantFunction(...))`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146714 Approved by: https://github.com/zou3519	2025-02-26 19:47:39 +00:00
IvanKobzarev	8594856651	[aotd] Alias of intermediate unwrap TensorAlias (#147638 ) Bug was reported by internal user. AOTD classified outputs that are aliases of intermediates of the graph in different categories. ... - output is alias of intermediate which base is already output - output is alias of intermediate which base is not in output If we look at the fn: ``` def fn(x): ix = x + 1 a = ix.transpose(0, 1) return a.detach(), a ``` output 0: detach view of alias a, where a is already output output 1: alias of intermediate ix, then additional output ix will be added internally output 0 base is TensorAlias(a) in this case, but could be Tensor. Adding runtime unwrapping solves this problem. Alternatively we should track base of a.detach() all the way to ix, in that case the base will be always a Tensor, not TensorAlias. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147638 Approved by: https://github.com/bdhirsh	2025-02-26 19:42:21 +00:00
Xintong Hu	30db64bf51	[PT2] Support add/remove passes in pre_grad (#146064 ) Summary: support the same functionality with acc_tracer disabled, add a new config for pre_grad add/remove_passes, at the front end it still uses the same interface some minor updates in pre_grad passes to make sure the passes are run in desired order, after added passes, still run pass like remove_noops at the end Test Plan: add new UT, please see stacked diff for add pass tests (TODO: update diff link) Differential Revision: D68909278 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146064 Approved by: https://github.com/frank-wei	2025-02-26 18:46:43 +00:00
Nikita Shulga	00732c3f7e	[MPS] Implemented `masked_fill_scalar` as shader (#147369 ) - Move `pos_from_thread_index and `offset_from_pos` from `UnfoldBackward.metal` into `c10/metal/indexing.h` header - Initial idea were to implement `StridedTensor` and `ConstStridedTensor` and use them to have masked_fill kernel a something simple as the following loop ```metal ConstStridedTensor<bool> mask(mask_data, sizes, mask_strides, ndim); if (mask[thread_index]) { StridedTensor<T> input(input_data, sizes, input_strides, ndim); input[thread_index] = val; } ``` But though it looks elegant and works correctly, performance wise it's much slower that the existing MPS shader (see table below), as int64 divisions on M2 GPU are really slow - Solved performance issue by implementing 3 flavors of the same shader: `dense`, that is used when both input and mask are dense tensors of the same size, `broadcast`, which is used when `mask` is leading dimensions expandable into input tensor and `strided` which is a general purpose fallback, but still computes position in the tensors only ones. As result, perf is even better than existing MPS shader for dense and broadcast able tensors. Performance measured on M2Pro thru different iterations of the same shader \| dtype \| MPS \| int64-idx \| int64-inlined \| 32-bit strided \| 32-bit broadcasted \| \| ------\|------\| -----\| ---- \| --- \| ---- \| \| float32 \| 2.8 msec \| 41.6 msec \| 26.9 msec \| 5 msec \| 2.4 msec \| \| float16 \| 1.86 msec \| 38.2 msec\| 26.6 msec \| 4.6 msec \| 1.9 msec \| \|bfloat16\|1.86 msec \|38.3 msec \| 26.6 msec \| 4.6 msec \| 1.9 msec \| And benchmark script ```python import torch from timeit import default_timer from itertools import product from torch.utils.benchmark import Measurement, Timer def bench_mask_fill( n, binary_func, dtype=torch.float32, ) -> Measurement: t = Timer( stmt=f"x.masked_fill(y, -17.0); torch.mps.synchronize()", setup=f"x,y = torch.rand(1, 20, {n}, {n}, dtype={dtype}, device='mps'), torch.ones({n}, {n}, device='mps').triu().bool()", globals = {'f': binary_func}, language="python", timer=default_timer ) return t.blocked_autorange() if __name__ == "__main__": n = 1024 for dtype in [torch.float32, torch.float16, torch.bfloat16]: eager_t = bench_mask_fill(n, torch.fmax, dtype) use_msec = eager_t.mean > 1e-4 multiplier = 1e3 if use_msec else 1e6 uname = "msec" if use_msec else "usec" print(f"torch.masked_fill_() {str(dtype):>14} {eager_t.mean*multiplier:>7.2f} {uname}") ``` Fixes https://github.com/pytorch/pytorch/issues/143477 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147369 Approved by: https://github.com/dcci ghstack dependencies: #147977	2025-02-26 18:39:15 +00:00
Isalia20	ebf6b9839c	[MPS] faster integer batched matmul (#147877 ) Followup to #147526 Tiled matmul for bmm as well. ## Speed ups: ![speedups_bmm](https://github.com/user-attachments/assets/02501145-7d64-4bbe-9dcc-994f004b4829) Script to record times: ```python import torch import numpy as np import time import csv batch_sizes = [1, 2, 4, 8] matrix_sizes = [256, 512, 1024, 2048] num_runs = 10 warmup_runs = 3 def run_int_mm(A, B): torch.mps.synchronize() start = time.perf_counter() c = A @ B torch.mps.synchronize() end = time.perf_counter() return c, end - start results = { 'N': [], 'B': [], 'mean_time': [], 'std_time': [] } for b in batch_sizes: for n in matrix_sizes: print(f"\nBenchmarking N={n} and B={b}") try: A_mps = torch.randint(low=-100, high=100, size=(b, n, n), dtype=torch.int8, device="mps") B_mps = torch.randint(low=-100, high=100, size=(b, n, n), dtype=torch.int8, device="mps") for _ in range(warmup_runs): _, _ = run_int_mm(A_mps, B_mps) times = [] for _ in range(num_runs): _, t = run_int_mm(A_mps, B_mps) times.append(t) mean_time = np.mean(times) std_time = np.std(times) results['N'].append(n) results['B'].append(b) results['mean_time'].append(mean_time) results['std_time'].append(std_time) print(f"Mean time: {mean_time:.4f}s ± {std_time:.4f}s") except RuntimeError as e: print(f"Error for N={n}: {e}") continue with open('int_bmm_benchmark_times_new.csv', 'w', newline='') as f: writer = csv.writer(f) writer.writerow(['N', 'batch', 'mean_time', 'std_time']) for i in range(len(results['N'])): writer.writerow([ results['N'][i], results['B'][i], results['mean_time'][i], results['std_time'][i] ]) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/147877 Approved by: https://github.com/Skylion007	2025-02-26 18:37:13 +00:00
Henry Tsang	cfb293ee02	[inductor] Add logs for precompile and autotuning (#147923 ) Differential Revision: D70222645 I want to add more logs around precompile, especially around the reason why sometimes it gets fast returned. See https://github.com/pytorch/pytorch/pull/147590 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147923 Approved by: https://github.com/Skylion007	2025-02-26 18:26:07 +00:00
Jagadish Krishnamoorthy	0ea5d1067b	ROCm: Remove static specifier for allow_tf32 variable. (#147186 ) Since the env variable HIPBLASLT_ALLOW_TF32 can change, remove static type for allow_tf32 variable so that it captures the current value of env variable HIPBLASLT_ALLOW_TF32. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147186 Approved by: https://github.com/jeffdaily, https://github.com/naromero77amd	2025-02-26 18:24:02 +00:00
Animesh Jain	4e4191854b	[logs][qol] Print log options alphabetically (#147888 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147888 Approved by: https://github.com/jansel	2025-02-26 18:15:39 +00:00
rzou	fb566c5aea	Fix auto_functionalize x inference_mode (#147925 ) Fixes #147924 We were using the wrong FunctionalTensorMode to construct FunctionalTensors. FunctionalTensors modify the FunctionalTensorMode on construction, so that led to the wrong FunctionalTensorMode being modified. This PR threads the FunctionalTensorMode through correctly. Test Plan: - new test Pull Request resolved: https://github.com/pytorch/pytorch/pull/147925 Approved by: https://github.com/bdhirsh	2025-02-26 18:05:30 +00:00
drisspg	678435c443	[FlexAttention] Fix IMA bug (#147918 ) # Summary Fixes: https://github.com/pytorch/pytorch/issues/147268 I got this right for the backwards and somehow forgot to do the flip in the forward, not sure how this wasnt found earlier.. Testing IMAs is tuff in pytest so didnt add but verified on reproducer ```py ❯ sanitize python flex/maurice_ima.py --setting 0 ========= COMPUTE-SANITIZER pool: torch.Size([64, 8, 784, 64]) tensor(1.0078, device='cuda:0') Feat shape torch.Size([64, 8, 784, 64]) Feat strides (401408, 50176, 64, 1) Feat is contig: True attn: torch.Size([64, 8, 784, 64]) tensor(1.7994, device='cuda:0') ========= ERROR SUMMARY: 0 errors ❯ sanitize python flex/maurice_ima.py --setting 1 ========= COMPUTE-SANITIZER pool: torch.Size([64, 8, 784, 64]) tensor(2.8297, device='cuda:0') Feat shape torch.Size([64, 8, 784, 64]) Feat strides (401408, 50176, 64, 1) Feat is contig: True attn: torch.Size([64, 8, 784, 64]) tensor(1.9714, device='cuda:0') ========= ERROR SUMMARY: 0 errors ❯ sanitize python flex/maurice_ima.py --setting 2 ========= COMPUTE-SANITIZER pool: torch.Size([64, 8, 784, 64]) tensor(3.2232, device='cuda:0') Feat shape torch.Size([64, 8, 784, 64]) Feat strides (401408, 50176, 64, 1) Feat is contig: True attn: torch.Size([64, 8, 784, 64]) tensor(2.2095, device='cuda:0') ========= ERROR SUMMARY: 0 errors ```` Pull Request resolved: https://github.com/pytorch/pytorch/pull/147918 Approved by: https://github.com/BoyuanFeng, https://github.com/Skylion007	2025-02-26 17:59:05 +00:00
Catherine Lee	3f7e242c86	[CI] Checkout with more processes (#147652 ) The default action doesn't use more processes, possibly because most github provided runners only have 2 cpus, but we have more than that, so we might as well use them Generally cuts maybe 1 min off of checkout time? Changed checkout from pytorch/pytorch@main to pytorch/pytorch@my branch to test on 249a936998e66cc0d6ad8664e0e93ec1b9432a8b Pull Request resolved: https://github.com/pytorch/pytorch/pull/147652 Approved by: https://github.com/ZainRizvi	2025-02-26 17:51:28 +00:00
Xilun Wu	ef61c290e1	[DTensor][random] defer DTensor RNG state sync until first random op call or manual_seed call; support more flexible OffsetBasedRNGTracker init (#147025 ) Resolves https://github.com/pytorch/pytorch/issues/146767. May also resolve https://github.com/pytorch/pytorch/issues/147584. ### Summary This PR removes the RNG tracker init from the `distribute_tensor` call for the following reasons: 1. if the user does not use random ops on DTensor, there's no need to init DTensor RNG which currently requires CUDA device to be present. 2. this complies with the 0-communication semantic of `src_data_rank=None` shard distribution. Besides, `OffsetBasedRNGTracker` only accepts `DeviceMesh` argument to its constructor method. ### Consequence DTensor RNG initialization is delayed till the first DTensor random ops call or `torch.distributed.tensor.random.manual_seed`. ### Test `pytest test/distributed/tensor/test_random_ops.py` `pytest test/distributed/tensor/parallel/test_tp_random_state.py` `pytest test/distributed/tensor/parallel/test_tp_style.py` Differential Revision: [D70201856](https://our.internmc.facebook.com/intern/diff/D70201856) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147025 Approved by: https://github.com/kwen2501	2025-02-26 17:33:22 +00:00
Nikita Shulga	5ef94ca816	[BE] Do not copy arguments in variadic template (#147977 ) By adding missing `std::forward<Args>(args)...` and declaring template as passing args by reference Noticed while working on creating `mtl_setBytes` specification that takes `MPSScalar` as argument Pull Request resolved: https://github.com/pytorch/pytorch/pull/147977 Approved by: https://github.com/Skylion007, https://github.com/dcci	2025-02-26 17:20:16 +00:00
Boyuan Feng	ba9ed856e0	[FlexAttention] Improve error msg for embedding < 16 (#147765 ) flex_attention uses tl.dot, which [does not support embedding < 16](https://github.com/triton-lang/triton/issues/2266) on input shapes. This PR adds explicit error message for users who are prototyping with small tensors. Fixes #147701 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147765 Approved by: https://github.com/drisspg	2025-02-26 17:06:35 +00:00
Alex Baden	ac926f81cc	[Inductor][Triton] Rework casting logic to avoid illegal bitcast (#147395 ) Triton introduced checks for bitcasts where the casted value does not fit into the casted type (e.g. https://github.com/triton-lang/triton/pull/5926, though in this instance I think the issue is related to the type for the broadcast). Some routines in Inductor now perform illegal bitcasts. I reworked the compare and swap w/ index routine used in sort to remove the illegal bitcast (~~I left the bitcast for now, but I think it could probably be removed assuming the reshape does not change the type~~). The explicit cast is correct, and I don't think there are performance issues, but because the cast on the sum is not a bitcast I suppose there could be. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147395 Approved by: https://github.com/eellison	2025-02-26 16:56:17 +00:00
Simon Fan	fd1220e386	[ca] side-effect free inital trace: compiled_args (#147804 ) const methods to prevent accidental mutation. changes mainly in Error nodes and PyNode. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147804 Approved by: https://github.com/jansel ghstack dependencies: #147242, #147796	2025-02-26 16:37:27 +00:00
Simon Fan	5e3069dde8	[ca] side-effect free initial trace: GraphTask (#147796 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147796 Approved by: https://github.com/jansel ghstack dependencies: #147242	2025-02-26 16:37:27 +00:00
Simon Fan	0a2da008f8	[ca] trace saved variable unpacking (#147242 ) ## Before Previously, CA will always unpack all saved variables stored in the autograd graph before executing it. This meant that we can't capture unpack hooks as part of the CA graph, and they would fire out of order wrt to other backward hooks. For memory saving APIs built on top of saved tensor hooks like non-reentrant checkpointing and offloading, we couldn't achieve any savings because all activations would be recomputed/loaded and active at the same time, resulting in no-op. ## After We add unpack hooks into the CA graph so that they can be executed progressively. The python hook and hook input themselves are wrapped by non-traceable code, so CA polyfills the wrapping as: ```python # pseudocode class SavedVariable: def unpack(self): if self.hook: return self.hook(self.packed_data) else: return self.packed_data # This approach won't directly work when we add support for Forward AD or double-backward. ``` Directly executing the CA graph (without torch.compiling it) under checkpointing/offloading, memory profile is expected to stay the same as when using the eager autograd engine. If AOT backward is in the autograd graph, memory profile is expected to be better than the eager autograd engine, since we can now delay saved activations unpacking into the AOT backward's execution. All tests pass when running the CA graph directly, the remaining issues are in Dynamo. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147242 Approved by: https://github.com/jansel	2025-02-26 16:37:17 +00:00
Xuehai Pan	08f4c1a233	[dynamo] add sourceless builder for `types.MethodType` (#147880 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147880 Approved by: https://github.com/jansel	2025-02-26 15:43:47 +00:00
Katarzyna Fojcik	edaf9ddeb5	Add basic Gaudi support to benchmarks/dynamo (#145920 ) This PR adds basic Gaudi support to benchmarks/dynamo Pull Request resolved: https://github.com/pytorch/pytorch/pull/145920 Approved by: https://github.com/eellison	2025-02-26 14:50:22 +00:00
leslie-fang-intel	be830c8b1c	[Inductor][CPP] fix store mode atomic add (#147961 ) Summary Fix issue: https://github.com/pytorch/pytorch/issues/147848 and https://github.com/pytorch/pytorch/issues/146390. While addressing these issues, 2 problems were encountered: - In `CppVecKernel`, when the number of threads is 1 and the mode is `atomic_add`, `store` did not `load/add` before storing. This has been fixed in this PR. - In `CppTile2DKernel`, `store` did not support `atomic_add` mode. Support for this has been added in this PR. Test Plan ``` python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_nn_fold ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/147961 Approved by: https://github.com/malfet	2025-02-26 14:04:34 +00:00
Irem Yuksel	f522d899fb	Add MSVC version condition to "Fix for MSVC problem on Windows Arm64 (#136765 )" (#145076 ) This PR adds MSVC version guards around the if block presented on `f7e36d8d6f`. This commit was to provide a workaround for the problem reported here: https://developercommunity.visualstudio.com/t/MSVC-loop-unrolling-problem-194033813-/10720692 . The issue is fixed now and only appears between versions 19.36 and 19.42. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145076 Approved by: https://github.com/malfet, https://github.com/alinpahontu2912 Co-authored-by: Ozan Aydin <148207261+ozanMSFT@users.noreply.github.com>	2025-02-26 12:08:24 +00:00
Luca Wehrstedt	60d94ea22b	Add option to limit number of SMs used by matmul kernels (#147966 ) Resubmission of #144974 which was reverted for unrelated reasons. Newer matmul kernels, e.g. those targeting Hopper GPUs, sometime use a "persistent" schedule which consists in launching as many CUDA blocks as there are SMs on the GPU, with each such block then working on multiple output tiles in a row. This allows to eliminate the overhead of starting and finishing each tile, effectively doing cross-tile pipelining. In previous generations these latencies could be hidden by having multiple CUDA blocks per SM but, with blocks becoming larger, only one can run at a time per SM and thus this needs to be taken care of in software. Persistent kernels become an issue when other kernels are running concurrently. The classical example is a NCCL communication kernel running in the background. In such cases the matmul expects to be able to use all the SMs but is prevented from doing so because some of the are busy. This can lead to its blocks being scheduled as two separate waves on the available SMs. This "wave quantization" can double the latency of the matmul kernels. While we wait for smarter solutions, such as automatic load balancing among the blocks, an easy way to unblock ourselves is to tell the matmuls to only use a subset of the GPU's SMs. For this, I am introducing a global `sm_carveout` flag which can be used to specify how many SMs should be left available for other kernels. For now I only change the cuBLAS kernels and the scaled-mm CUTLASS kernel. More kernels can be opted-in later. I tested this change manually, by using the Kineto profiler to look up the grid size of a scaled-mm kernel with different values of `sm_carveout`, and making sure it changed. Suggestions are welcome for a more automated test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147966 Approved by: https://github.com/danthe3rd	2025-02-26 12:01:12 +00:00
Zhenbin Lin	7ffae2c028	Split test_transformers.py (#147441 ) Split test_transformers.py into test_transformers.py and test_transformers_privateuser1.py. Currently the privateuse1 test cases in test_transformers.py are skipped since they conflict with cuda test cases. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147441 Approved by: https://github.com/drisspg	2025-02-26 11:54:24 +00:00
William Wen	cf6d1e6824	[dynamo] add generic graph break hints (#147429 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147429 Approved by: https://github.com/jansel, https://github.com/zou3519 ghstack dependencies: #147385	2025-02-26 09:20:28 +00:00
William Wen	3fd68e4e2f	[dynamo] make some more graph break messages readable in English [2/N] (#147385 ) This is for "for some large number Z, make sure the error messages are readable English." - beginning to audit all `unimplemented` sites and making sure that all messages are at least English-readable. Hints may not necessarily be provided. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147385 Approved by: https://github.com/jansel	2025-02-26 09:20:28 +00:00
Ruben Rodriguez Buchillon	7a06bfdd1c	[inductor][ck] kBatch parametrized (#147885 ) Summary: # Why Enable us to set the kBatch parameter, rather than bake it in Especially for larger splitK scenarios, this can yield very good performance (up to 1.5x vs hipblaslt from initial tests) ## Why like this The obvious question should be: why not add this to the op itself, and maybe even into the template/kernel. That would simplify the code. The choice to have it as a "runtime" param that we fix is be able to reuse the compiled CK `.so` libraries, as now multiple choices of kBatch can be used with the exact same `.so` (as the shared library does not depend on kBatch, but takes it as a parameter) # What - copy cutlass approach for swizzle to have a "runtime" arg that we pass in but is really choice dependent - pipe through everything from template and kernel - hard-code it to be kBatch=1 for now (same as before, just now settable) This is part of a series of Diffs, where next we need to figure out 1. how to filter out ops + kBatch that don't work 2. set this better for splitK scenarios (hand written heuristic) Test Plan: (with minor modifications) ``` # show it working with AOTI buck2 run mode/opt-amd-gpu //scripts/henrylhtsang/repros:aot ``` ``` # show it working with inductor only buck2 run -c fbcode.re_gpu_tests=False mode/opt-amd-gpu fbcode//deeplearning/aot_inductor/benchmark/sampling:test_gemm_autotune_benchmark_AMD_block_0 ``` Differential Revision: D70200008 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147885 Approved by: https://github.com/ColinPeppler	2025-02-26 07:28:19 +00:00
PyTorch MergeBot	a84db75e1b	Revert "torch._scaled_mm with MXFP8 (#147548 )" This reverts commit `12b9674cb6`. Reverted https://github.com/pytorch/pytorch/pull/147548 on behalf of https://github.com/wdvr due to failing internal build - similar to previous, see below ([comment](https://github.com/pytorch/pytorch/pull/147548#issuecomment-2684134336))	2025-02-26 07:17:24 +00:00
Huy Do	4216478250	Fix the benchmark config name from H100 benchmark (#147947 ) When using the wrong benchmark configs, the benchmark jobs will be skipped. The name should have the `_cuda_h100` suffix as used in the test matrix. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147947 Approved by: https://github.com/wdvr	2025-02-26 06:40:07 +00:00
Isuru Fernando	4ec6c1d1ec	Fix test_halide.py report invocation to re-run failed tests (#147640 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147640 Approved by: https://github.com/jansel	2025-02-26 06:32:22 +00:00

1 2 3 4 5 ...

84832 Commits