pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-06 00:20:18 +01:00

Author	SHA1	Message	Date
PyTorch MergeBot	d6d6fa26f5	Revert "bwd pass (#164504 )" This reverts commit `f36f372acc`. Reverted https://github.com/pytorch/pytorch/pull/164504 on behalf of https://github.com/jeffdaily due to CI had been clean for both cuda and rocm before merge, broke post merge? ([comment](https://github.com/pytorch/pytorch/pull/164504#issuecomment-3462116676))	2025-10-29 15:10:40 +00:00
Way Wang	4a94591321	filter out alloc-free pairs from trace plot (#165752 ) Summary: When dealing with a large memory trace, the resulting plot can be challenging to interpret and analyze. This commit introduces a feature that enables filtering of allocations that have already been freed, providing a more focused view. The remaining events in the plot often warrant closer examination, as they may be indicative of potential out-of-memory (OOM) issues. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165752 Approved by: https://github.com/zdevito	2025-10-29 12:44:54 +00:00
Xuehai Pan	284716a691	[pytree] add `treespec_{leaf,tuple,dict}` functions for args_spec modification (#160843 ) The goal of this PR is to provide a standard way to create simple treespec instances and hide the implementation details of the `PyTreeSpec` class. Changes: 1. Add function `treespec_leaf()` to replace `LeafSpec()`. 2. Add function `treespec_tuple(...)` and `treespec_dict(...)` to create treespec for `tuple` / `dict` which is used for `args` / `*kwargs`. This avoids direct modification to `treespec` instances that rely on the implementation details of the `PyTreeSpec` class. 3. Change `len(spec.children_specs)` to `spec.num_children`. 4. Change `isinstance(spec, LeafSpec)` to `spec.is_leaf()`. ------ Pull Request resolved: https://github.com/pytorch/pytorch/pull/160843 Approved by: https://github.com/mlazos	2025-10-29 09:16:24 +00:00
Yuanyuan Chen	8b188647cf	[2/N] Fix unused loop variables (#166500 ) This PR removes unused loop variables. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166500 Approved by: https://github.com/mlazos	2025-10-29 08:30:35 +00:00
etaf	1b655a87ef	[xpu][test] Enable more UTs for Intel GPU. (#166047 ) This PR enables additional Inductor unit tests for Intel GPU. Due to the increased number of test cases, the number of runners has been extended from 8 to 12 to prevent CI timeouts. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166047 Approved by: https://github.com/jansel Co-authored-by: Deng, Daisy <daisy.deng@intel.com> Co-authored-by: Jason Ansel <jansel@jansel.net>	2025-10-29 06:25:36 +00:00
Amin Sedaghat	17d5aa4767	disable jiterator for complex tan and tanh (#165250 ) Fixes #100842 Disable jiterator for complex tan and tanh kernels due to accuracy issues, matching the existing approach used for acos, acosh, asin, and asinh. Reverts to thrust implementation which provides better numerical accuracy. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165250 Approved by: https://github.com/ezyang	2025-10-29 04:59:01 +00:00
Michael Lazos	cde81e92b9	[User-streams] Make torch.Event weakref compatible (#164522 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164522 Approved by: https://github.com/williamwen42 ghstack dependencies: #162903, #164343, #164344, #164507, #162901, #164304	2025-10-29 04:57:23 +00:00
Michael Lazos	bfc2050db9	[user-streams] Make device-agnostic streams weakref compatible (#164304 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164304 Approved by: https://github.com/williamwen42, https://github.com/colesbury ghstack dependencies: #162903, #164343, #164344, #164507, #162901	2025-10-29 04:57:23 +00:00
Justin Chu	c5701d0ab5	[ONNX] Create fake implementations for onnx ops; fix boolean mask in attention (#165780 ) Previously we rely on the concreate implementation to generate fake implementation. This makes the fake implementation overly complicated and breaks in some cases when there are dynamic shapes. This PR updates onnx op registration to instead take a dedicated fake implementation. Also fixed: When boolean mask is supplied to torch sdpa, it was previously taken the negation, which is incorrect. Fix https://github.com/pytorch/pytorch/issues/164909 Also taken changes from https://github.com/pytorch/pytorch/pull/156635 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165780 Approved by: https://github.com/titaiwangms	2025-10-29 04:51:49 +00:00
Michael Lazos	23669d02a6	[user-cuda-streams] Add cuda streams test suite (#162901 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162901 Approved by: https://github.com/williamwen42 ghstack dependencies: #162903, #164343, #164344, #164507	2025-10-29 04:46:08 +00:00
Michael Lazos	e8d887ae3f	[user-streams] Support streams as contexts (#164507 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164507 Approved by: https://github.com/williamwen42 ghstack dependencies: #162903, #164343, #164344	2025-10-29 04:46:08 +00:00
Yuanyuan Chen	0e19561e23	Add back Windows and macOS to tensorboard tests (#166389 ) This PR adds back tensorboard tests on Windows and macOS because the dependency issue is resolved. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166389 Approved by: https://github.com/Skylion007	2025-10-29 04:34:57 +00:00
Jagadish Krishnamoorthy	1fa520ea65	[ROCm] Enable group gemm through CK (#166334 ) Fixes #161366 All the 4 types of dimension matrix are supported. 2d-2d, 2d-3d, 3d-3d, 3d-2d. The corresponding test cases in test_matmul_cuda are working for both forward and backward pass. The CK path is enabled for gfx942, gfx950. ToDo: Need to enable support on gfx90a since the ck kernel used in this commit produces gpu error, might require a different CK kernel config, based on the profiler result on gfx90a. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166334 Approved by: https://github.com/jeffdaily, https://github.com/pruthvistony	2025-10-29 04:32:38 +00:00
PyTorch MergeBot	924482a6f6	Replace NUMA inheritance approach (#166026 ) # Context Previously, we would modify the parent process's NUMA bindings in order to force child process to inherit them. However, this would not work correctly if `start_method="forkserver"`, because the subprocesses would actually inherit their bindings from the forkserver middleman process. In this case, the inherited affinity would actually be incorrect for all but the first subprocess (because the forkserver process would get created lazily, and hence inherit and then stick with the bindings intended for the first subprocess). # This PR * `str` entrypoints: Use `numactl` CLI * `Callable` entrypoints: Wrap the `Callable` entrypoint and call `os.sched_setaffinity` inside it. Hopefully this will be the last necessary iteration. # Test Plan ## Automated `$ pytest test/test_numa_binding.py` ## Manual Verified flops/sec and memory locality wins on several different types of jobs * `Callable` with forkserver * `str` entrypoint with spawn * `Callable` entrypoint with spawn More details in [this doc (Meta-only).](https://docs.google.com/document/d/1vxD-OKYBTT27jbBwtW9iz9g0tNM0u-i0tiTJg_ieQA8/edit?tab=t.scjv58yswi64) # Later PR Update all the documentation when we're confident this has stabilized. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166026 Approved by: https://github.com/d4l3k Co-authored-by: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>	2025-10-29 03:58:44 +00:00
Sun, Jiayi	20be077085	[Inductor] support masked vectorization for the tail_loop for float64 datatype (#163316 ) Summary: Support masked vectorization for the tail_loop for float64 datatype. Example: ``` import torch def fn(x): return x * x x = torch.randn((22, 22), dtype=torch.double) with torch.no_grad(): compiled_fn = torch.compile(fn) compiled_fn(x) ``` Generated code: - Before ``` cpp_fused_mul_0 = async_compile.cpp_pybinding(['const double', 'double'], r''' #include <torch/csrc/inductor/cpp_prefix.h> extern "C" void kernel(const double* in_ptr0, double* out_ptr0) { { for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(484L); x0+=static_cast<int64_t>(16L)) { { if(C10_LIKELY(x0 >= static_cast<int64_t>(0) && x0 < static_cast<int64_t>(480L))) { auto tmp0 = at::vec::VectorizedN<double,2>::loadu(in_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(16)); auto tmp1 = tmp0 * tmp0; tmp1.store(out_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(16)); } if(C10_UNLIKELY(x0 >= static_cast<int64_t>(480L) && x0 < static_cast<int64_t>(484L))) { for (int64_t x0_tail = static_cast<int64_t>(480L);x0_tail < static_cast<int64_t>(484L); x0_tail++) { auto tmp0 = in_ptr0[static_cast<int64_t>(x0_tail)]; auto tmp1 = double(tmp0 * tmp0); out_ptr0[static_cast<int64_t>(x0_tail)] = tmp1; } } } } } } ''') async_compile.wait(globals()) del async_compile class Runner: def __init__(self, partitions): self.partitions = partitions def recursively_apply_fns(self, fns): new_callables = [] for fn, c in zip(fns, self.partitions): new_callables.append(fn(c)) self.partitions = new_callables def call(self, args): arg0_1, = args args.clear() assert_size_stride(arg0_1, (22, 22), (22, 1)) buf0 = empty_strided_cpu((22, 22), (22, 1), torch.float64) # [Provenance debug handles] cpp_fused_mul_0:1 cpp_fused_mul_0(arg0_1, buf0) del arg0_1 return (buf0, ) ``` - After ``` cpp_fused_mul_0 = async_compile.cpp_pybinding(['const double', 'double'], r''' #include <torch/csrc/inductor/cpp_prefix.h> extern "C" void kernel(const double* in_ptr0, double* out_ptr0) { { for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(484L); x0+=static_cast<int64_t>(16L)) { { if(C10_LIKELY(x0 >= static_cast<int64_t>(0) && x0 < static_cast<int64_t>(480L))) { auto tmp0 = at::vec::VectorizedN<double,2>::loadu(in_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(16)); auto tmp1 = tmp0 * tmp0; tmp1.store(out_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(16)); } if(C10_UNLIKELY(x0 >= static_cast<int64_t>(480L) && x0 < static_cast<int64_t>(484L))) { auto tmp0 = at::vec::VectorizedN<double,2>::loadu(in_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(4L)); auto tmp1 = tmp0 * tmp0; tmp1.store(out_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(4L)); } } } } } ''') async_compile.wait(globals()) del async_compile class Runner: def __init__(self, partitions): self.partitions = partitions def recursively_apply_fns(self, fns): new_callables = [] for fn, c in zip(fns, self.partitions): new_callables.append(fn(c)) self.partitions = new_callables def call(self, args): arg0_1, = args args.clear() assert_size_stride(arg0_1, (22, 22), (22, 1)) buf0 = empty_strided_cpu((22, 22), (22, 1), torch.float64) # [Provenance debug handles] cpp_fused_mul_0:1 cpp_fused_mul_0(arg0_1, buf0) del arg0_1 return (buf0, ) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/163316 Approved by: https://github.com/mingfeima, https://github.com/jansel	2025-10-29 03:30:38 +00:00
thenumberouscode	94eaeb9cb8	[Conv1d] Check overflow before we compute padding size. (#162363 ) Fixes https://github.com/pytorch/pytorch/issues/161877 also fixes https://github.com/pytorch/pytorch/issues/161875 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162363 Approved by: https://github.com/jbschlosser	2025-10-29 03:27:20 +00:00
Yu, Guangye	753d9bd806	Introduce a new API torch.xpu.set_per_process_memory_fraction (#165510 ) # Motivation Aligned with other backends, this PR introduces a new API `torch.xpu.set_per_process_memory_fraction` to allow user to customize the allowed memory per a single process. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165510 Approved by: https://github.com/EikanWang, https://github.com/ezyang ghstack dependencies: #165508, #165509	2025-10-29 03:24:52 +00:00
linhaifeng	695cb0d342	[2/N][Fix] Fix typo in test folder (#166374 ) Fix typo in test folder. _typos.toml ```bash [default.extend-words] nd = "nd" arange = "arange" Nd = "Nd" GLOBALs = "GLOBALs" hte = "hte" iy = "iy" PN = "PN" Dout = "Dout" optin = "optin" gam = "gam" PTD = "PTD" ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/166374 Approved by: https://github.com/cyyever, https://github.com/ezyang	2025-10-29 03:02:07 +00:00
can-gaa-hou	c201a1cab1	[OpenReg] Update Installation in README.md (#166235 ) It is recommended to use `python -m pip install --no-build-isolation .` instead of `pip3 install --no-build-isolation .` because most of us use a virtual environment, and the latter probably relies on the system `pip3` rather than the conda or uv. We need to make it consistent with the Python we use, and it is also consistent with how `torch` is installed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166235 Approved by: https://github.com/fffrog, https://github.com/ezyang	2025-10-29 02:57:26 +00:00
Nicolas Macchioni	f8b4c00294	intfs + unit tests (#164723 ) Test Plan: ``` buck test fbcode//mode/opt caffe2/test/inductor:caching ``` Differential Revision: D83727222 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164723 Approved by: https://github.com/aorenste	2025-10-29 02:32:19 +00:00
Laith Sakka	adedf26e21	Support python slicing with tensor inputs. (#165074 ) when the slice is tensor, we decompose it to .item() call and pass the unbacked symbol to the slice to avoid DDE. the diff also fix an existing bug in codegen_dynamic_slice_size in the cpp wrapper. a +1 should be -1 making it match python codegen. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165074 Approved by: https://github.com/Lucaskabela	2025-10-29 01:18:45 +00:00
Iris Zhang	48e672d149	[dcp][state_dict] Make `_flatten_optim_state_dict` and `_unflatten_optim_state_dict` handle arbitrary-level of nested optim dictionaries by recursion (#165071 ) Summary: This updates the internal helper function of ` _flatten_optim_state_dict` and `_unflatten_optim_state_dict` to handle arbitrary level of nested dictionaries. With this, it can handle optimizer like Shampoo has multiple level of nested dictionary. We parametrized the `shampoo_checkpoint_test.py` to test both for `flatten_optimizer_state_dict=True` or `False`. Example shampoo nested dictionary: ``` { "state": { 0: { "block_0": { "shampoo": { "factor_matrices": { 0: torch.tensor([[0.0, 0.0], [0.0, 0.0]]), 1: torch.tensor([[0.0, 0.0], [0.0, 0.0]]), }, "factor_matrix_indices": {}, "inv_factor_matrices": { 0: torch.tensor([[1.0, 0.0], [0.0, 1.0]]), 1: torch.tensor([[1.0, 0.0], [0.0, 1.0]]), }, }, }, }, }, "param_groups": [ { "lr": 0.01, "betas": (0.9, 1.0), "beta3": 0.9, "epsilon": 1e-12, "momentum": 0.9, "dampening": 0.0, "weight_decay": 0.0, "max_preconditioner_dim": 5, "precondition_frequency": 1, "start_preconditioning_step": 1, "use_nesterov": False, "use_bias_correction": True, "use_decoupled_weight_decay": True, "grafting_config": AdaGradPreconditionerConfig(epsilon=0.001), "use_pin_memory": False, "distributed_config": SingleDeviceDistributedConfig( target_parameter_dimensionality=2 ), "preconditioner_config": self._preconditioner_config, "params": [0], } ], } ``` With this update, shampoo optimizers can be used with torchtitan without any modification in torchtitan side. Also, we ensure it is still backward compatible with other torch optimizers like Adam. Test Plan: Shampoo test: ``` [irisz@devvm5551.cco0 ~/fbsource/fbcode (49fd905c0b)]$ buck2 test @//mode/opt //hpc/optimizers/distributed_shampoo/dev/distributor/gpu_tests:shampoo_checkpoint_test Buck UI: https://www.internalfb.com/buck2/ff5e0f02-637d-4a73-b990-c0792a460216 Test UI: https://www.internalfb.com/intern/testinfra/testrun/9007199373078880 Network: Up: 0B Down: 0B Executing actions. Remaining 0/5 Command: test. Time elapsed: 27.3s Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` torch.checkpoint.state_dict test. ``` [irisz@devvm5551.cco0 ~/fbsource/fbcode (49fd905c0b)]$ buck2 test @//mode/opt //caffe2/test/distributed/checkpoint:test_state_dict Buck UI: https://www.internalfb.com/buck2/bf367c2c-4d17-4d13-b6c6-f6058211bcf2 Test UI: https://www.internalfb.com/intern/testinfra/testrun/13792273976572052 Network: Up: 0B Down: 11GiB (reSessionID-9662acf0-f3de-4993-b4fe-880c33f91f78) Executing actions. Remaining 0/5 Command: test. Time elapsed: 5:31.9s Tests finished: Pass 26. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` Differential Revision: D83619435 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165071 Approved by: https://github.com/fegin	2025-10-29 01:00:38 +00:00
zhxchen17	56afad4eb3	[precompile] Pickle and check closure variable properly. (#166351 ) Summary: Previously we didn't correctly handle closure tuple when there's content in it. Adding additional code for serializing the tuple and merge it with guard manager local scope. Test Plan: pytest test/dynamo/test_aot_compile.py Reviewers: Subscribers: Tasks: Tags: Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/166351 Approved by: https://github.com/Lucaskabela	2025-10-29 00:28:21 +00:00
Sarthak Tandon	2a058bfecf	[ROCm][tunableop] Fixed Offline Tuning file writing (#166074 ) - Fixes issue with offline tuning mode, we want to append to the existing file, not delete it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166074 Approved by: https://github.com/naromero77amd, https://github.com/jeffdaily	2025-10-29 00:25:45 +00:00
Michael Lazos	6d5e651a50	[user-streams] update stream context to use fork/join (#162903 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162903 Approved by: https://github.com/anijain2305	2025-10-28 23:12:05 +00:00
Yiming Zhou	3cc5949dc2	Remove global pytree registration for blockmask (#166434 ) The global pytree registration of `BlockMask` was added in https://github.com/pytorch/pytorch/pull/166045 In general ppl assume `BlockMask` is a leaf, so the global registration could lead to some unexpected failure when calling `tree_map()` on a `BlockMask` since now it will flatten all the way down. Therefore, we remove the global registration but keep the `_flatten()` and `_unflatten()` classmethod. Users could do a local registration easily when it is needed. in pytorch ``` python test/distributed/tensor/test_dtensor_export.py -k test_flex_attention_dtensor_export ``` in torchtitan ``` python -m tests.integration_tests.run_tests ./outputs --test_suite features --ngpu 8 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/166434 Approved by: https://github.com/wwwjn	2025-10-28 23:11:52 +00:00
Shangdi Yu	f167fd09fa	[annotation] Override metadata on regenerated node in functional mode (#166200 ) Fixes #165810 If we regenerate a node during functionalization, we override the "stack_trace", "custom", and "seq_nr" metadata of the regenerated node with the node meta of the original node. ``` python test/functorch/test_aot_joint_with_descriptors.py -k test_preserve_annotate_replay_view python test/functorch/test_aotdispatch.py TestAOTAutogradWithDynamo.test_duplicated_arguments_on_tensor_overlap ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/166200 Approved by: https://github.com/bdhirsh	2025-10-28 22:59:39 +00:00
Angel Li	f36f372acc	bwd pass (#164504 ) Summary This implements the backward pass for the Varlen API and registers `_varlen_attn()` as a custom op. Benchmarking To benchmark, we compare runtime and TFLOPs against the current SDPA approach with padding. Settings: - 1 H100 machine - `batch_size=8`, `max_seq_len=2048`, `embed_dim=1024`, `num_heads=16` - dtype `torch.bfloat16` - `is_causal=False` - for variable length, we set sequences to be random multiples of 64 up to `max_seq_len` - 100 runs \| \| Variable Length API \| SDPA \| \|--------\|--------------------\|----------\| \| Runtime \| 0.8189142608642578 ms \| 3.263883056640625 ms \| \| TFLOPs \| 268.652 \| 158.731 \| We can see that runtime for Varlen is >3x faster Testing Run `python test/test_varlen_attention.py` for unit tests where we verify basic functionality and confirm numerical match between varlen gradients vs SDPA. For custom op testing, `test_custom_op_registration` uses logging mode to verify that `_varlen_attn()` was called and tests with `torch.compile`. `test_custom_op_compliances` uses `torch.library.opcheck()` to verify. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164504 Approved by: https://github.com/drisspg	2025-10-28 22:35:11 +00:00
Scott Wolchok	572cc12b42	Move MaskPartial to placement_types to improve discoverability (#164414 ) Had trouble finding this one myself in #163030. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164414 Approved by: https://github.com/ezyang	2025-10-28 21:56:02 +00:00
Angel Li	08ae55021e	support batch size=0 for flash attention (#166318 ) Fixes #165944 Summary Today, if we attempt to run flash attention with batch_size 0, we get error `Runtime Error: batch size must be positive`. This PR fixes this by returning early with empty tensors in the fwd and bwd. Test plan `python test/test_transformers.py -k test_scaled_dot_product_attention` - added case for batch_size=0 Pull Request resolved: https://github.com/pytorch/pytorch/pull/166318 Approved by: https://github.com/drisspg	2025-10-28 21:53:48 +00:00
Simon Layton	b5189e269e	NVFP4 grouped gemm support via. FBGEMM kernels (#166308 ) Summary: * Add NVFP4 (1x16 block e4m3, tensor-wise fp32) scaled grouped gemm * Extend testing to add nvfp4 support Test Plan: ``` pytest -svv -k grouped test/test_scaled_matmul_cuda.py ``` Reviewers: Subscribers: Tasks: Tags: Signed-off-by: Simon Layton <simonlayton@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/166308 Approved by: https://github.com/ngimel	2025-10-28 20:32:53 +00:00
Shunting Zhang	3895ce093f	[inductor] add in-kernel nan-check (#166008 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/166008 Approved by: https://github.com/eellison	2025-10-28 20:19:10 +00:00
Catherine Lee	8aa087a29d	[ez] Fix print for failing test when entire file fails (#166420 ) Was previously printing "FAILED CONSISTENTLY: ul" since it was null, This changes it so it prints the test_file by moving some logic for checking this to be earlier Pull Request resolved: https://github.com/pytorch/pytorch/pull/166420 Approved by: https://github.com/Skylion007	2025-10-28 20:13:58 +00:00
Ved Thorat	21b48f8dfa	Fixes torch.compile(nn.ModuleList()) changes bool() behavior (#159208 ) Fixes #159139 ## The Cause The bug occurs because the OptimizedModule wrapper in torch._dynamo.eval_frame doesn't call the len method. This causes Python's bool() check to fall back to the default object truthiness (always True) instead of correctly evaluating containers with len() == 0 as False. ## The Fix A very easy fix . I just added the len method to OptimizedModule in torch._dynamo.eval_frame class to delegate the call to the original module ```python def __len__(self): """ Proxy the len() call to the original module to fix truthiness checks. """ return len(self._orig_mod) ``` This successfully fixes the issue . The script now works as expected. ## Reproduction Script ```python import torch import torch.nn as nn # Create an empty nn.ModuleList original = nn.ModuleList() # Compile it using torch.compile compiled = torch.compile(original) # Compare their boolean evaluations print(f"bool(original): {bool(original)}") print(f"bool(compiled): {bool(compiled)}") # Trigger failure if they differ assert bool(original) == bool(compiled), "BUG: truthiness behavior mismatch after compilation" ``` ## Output bool(original): False bool(compiled): False Pull Request resolved: https://github.com/pytorch/pytorch/pull/159208 Approved by: https://github.com/andrewboldi, https://github.com/Lucaskabela Co-authored-by: pushkar-hue <pushkarsharma.rtm@gmail.com> Co-authored-by: Lucas Kabela <lucasakabela@gmail.com>	2025-10-28 19:21:24 +00:00
Elana	e3e93c7107	[MPS] Fix random in-place ops on non-contiguous tensors (#165267 ) Random in-place operations (normal_, uniform_, exponential_, bernoulli_, random_) were silently failing on non-contiguous tensors on macOS < 15.0. * Added needsGather check and scatter-back logic to handle non-contiguous output tensors, following the pattern used in PointwiseOps. * Adds test to confirm these now work * Remove pre-macOS15 xfail for test_Dropout Fixes #165257 and #124029 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165267 Approved by: https://github.com/kulinseth, https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-10-28 18:43:22 +00:00
Bin Bao	687c15c0b3	[AOTI][BE] Change test_aoti_inference to one-pass build (#164277 ) Summary: To fix https://github.com/pytorch/pytorch/issues/159400. Currently, test_aoti_abi_check and test_aoti_inference need to be built in two passes, first build pytorch using the regular `pythonsetup.py develop` and then build with `CMAKE_FRESH=1 BUILD_AOT_INDUCTOR_TEST=1 python setup.py devleop`. This is cumbersome. Fix by rewriting CMakeLists.txt for test_aoti_inference to one-pass build which runs AOTI to compile models at the test time. Also update CI test script to get rid of two-pass build. For test_aoti_abi_check, it is not AOTI specific, so we make it not guarded by BUILD_AOT_INDUCTOR_TEST. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164277 Approved by: https://github.com/janeyx99	2025-10-28 17:43:22 +00:00
Jeff Daily	895795f07c	[ROCm][CI] forward fix kineto submodule bump (#166421 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/166421 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-10-28 17:40:23 +00:00
Prachi Gupta	ac841267a1	[ROCm] skip AsyncTP test class as AsyncTP is not supported on ROCm (#166316 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/166316 Approved by: https://github.com/jeffdaily	2025-10-28 16:23:46 +00:00
drisspg	5016e7b2eb	[FlexAttention] Add mechanism to get optimal autotune decision (#165817 ) Script: https://github.com/meta-pytorch/attention-gym/pull/169 Feels directionally okay but there is some bike shedding / this could be quite prone to collision of keys depending on mask mod and score mod changes and simple cache key. Usecase: https://github.com/meta-pytorch/attention-gym/pull/169 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165817 Approved by: https://github.com/Chillee	2025-10-28 15:50:12 +00:00
johannes	3041ede082	Improve eig tests in preparation for new eig backends (#166322 ) ### Summary Improves validation of `torch.linalg.eig` results by verifying the eigen decomposition identity A v − v λ = 0. ### Motivation Eigenvectors are not unique, and numerical differences between backends (cuSOLVER, MAGMA, CPU) can cause false test failures. This PR replaces direct elementwise comparisons with a mathematical identity check, improving robustness across devices. ### Details - Introduces `fulfills_eigen_decomposition_identity()` in `test_eig_compare_backends()` to validate the eigen equation. - Uses CPU matmul for high-precision verification. - Handles zero-sized matrices explicitly. - Tolerances derived from numerical comparisons between cuSOLVER and NumPy. See discussion: [dev-discuss.pytorch.org link](https://dev-discuss.pytorch.org/t/cusolver-dnxgeev-faster-cuda-eigenvalue-calculations/3248/6) ### Impact - Improves test stability and correctness across eig backends. - No change to public API. - All tests pass; lintrunner reports no issues. - Enables introduction of new eig backends without false test failures. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166322 Approved by: https://github.com/lezcano	2025-10-28 14:42:47 +00:00
Sherlock Huang	34d6ef7022	Update gm.print_readable to include Annotation (#165397 ) Sample output ``` [rank0]: # Annotation: {'compile_with_inductor': 'flex_attention'} File: /data/users/bahuang/pytorch/torch/nn/attention/flex_attention.py:1490 in flex_attention, code: out, lse, max_scores = flex_attention_hop( [rank0]: score_mod_2 = self.score_mod_2 [rank0]: mask_fn_2 = self.mask_fn_2 [rank0]: flex_attention_1 = torch.ops.higher_order.flex_attention(xq_5, xk_5, xv_3, score_mod_2, (2048, 2048, g____import_torchtitan_dot_models_dot_attention___flex_attention_block_masks___block_causal___none___kv_num_blocks, g____import_torchtitan_dot_models_dot_attention___flex_attention_block_masks___block_causal___none___kv_indices, g____import_torchtitan_dot_models_dot_attention___flex_attention_block_masks___block_causal___none___full_kv_num_blocks, g____import_torchtitan_dot_models_dot_attention___flex_attention_block_masks___block_causal___none___full_kv_indices, g____import_torchtitan_dot_models_dot_attention___flex_attention_block_masks___block_causal___none___q_num_blocks, g____import_torchtitan_dot_models_dot_attention___flex_attention_block_masks___block_causal___none___q_indices, g____import_torchtitan_dot_models_dot_attention___flex_attention_block_masks___block_causal___none___full_q_num_blocks, g____import_torchtitan_dot_models_dot_attention___flex_attention_block_masks___block_causal___none___full_q_indices, 128, 128, mask_fn_2), 0.25, {'PRESCALE_QK': False, 'ROWS_GUARANTEED_SAFE': False, 'BLOCKS_ARE_CONTIGUOUS': False, 'WRITE_DQ': True, 'OUTPUT_LOGSUMEXP': True, 'OUTPUT_MAX': False}, (), (g____import_torchtitan_dot_models_dot_attention___flex_attention_block_masks___block_causal___none___mask_mod___closure___0_cell_contents,)); xq_5 = xk_5 = xv_3 = score_mod_2 = mask_fn_2 = None [rank0]: out_2: "bf16[8, 4, 2048, 16]" = flex_attention_1[0]; flex_attention_1 = None ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/165397 Approved by: https://github.com/yushangdi, https://github.com/anijain2305, https://github.com/mlazos	2025-10-28 13:54:38 +00:00
PyTorch MergeBot	110efe4df4	Revert "[inductor][choices] lookup table choices 1/3 (#164978 )" This reverts commit `b44423bbb4`. Reverted https://github.com/pytorch/pytorch/pull/164978 on behalf of https://github.com/atalman due to failing internal test on newly added tests: Test when there's no lookup table entry with different autotune modes ([comment](https://github.com/pytorch/pytorch/pull/164978#issuecomment-3456400126))	2025-10-28 13:12:55 +00:00
Roman Krasavtsev	e137cd0a10	docs: fix typos (#164879 ) Correct typos in the comments Pull Request resolved: https://github.com/pytorch/pytorch/pull/164879 Approved by: https://github.com/Lucaskabela, https://github.com/mlazos, https://github.com/cyyever	2025-10-28 12:00:36 +00:00
William Wen	32fe4f681e	[dynamo] fix keyerror in resume_execution (again) (#166040 ) Fixes https://github.com/pytorch/pytorch/issues/166176 The error I attempted to fix in https://github.com/pytorch/pytorch/pull/162318 was still appearing internally. Surprised that this wasn't caught anywhere 😰 Pull Request resolved: https://github.com/pytorch/pytorch/pull/166040 Approved by: https://github.com/Lucaskabela ghstack dependencies: #166036	2025-10-28 07:04:29 +00:00
William Wen	ebb2b2e894	[dynamo] fix store attr graph break in with block (#166036 ) Fixes https://github.com/pytorch/pytorch/issues/166033 Differential Revision: [D85198055](https://our.internmc.facebook.com/intern/diff/D85198055) Pull Request resolved: https://github.com/pytorch/pytorch/pull/166036 Approved by: https://github.com/Lucaskabela	2025-10-28 07:04:29 +00:00
KarhouTam	13413b3b07	[AMP][Refactor] Autocast dtype handling to simplify device-specific c… (#165221 ) This PR refactors the autocast context manager in autocast_mode.py to simplify and centralize the logic for checking supported dtypes for each device. The previous implementation repeated similar checks for multiple device types. Now, a single mapping device_supported_dtypes is used to associate device types with their supported dtypes, and the validation logic is unified. The former PR #163446 was merged but reverted due to failed CI test on `openreg` related tests. This RR additionally slightly modified some test assertions for passing the CI tests. CI failed due to assertion for the exactly same error message. For example: ``` File "/var/lib/jenkins/workspace/test/cpp_extensions/open_registration_extension/torch_openreg/tests/test_autocast.py", line 9, in test_autocast_with_unsupported_type with self.assertWarnsRegex( AssertionError: "In openreg autocast, but the target dtype torch.float32 is not supported." does not match "In openreg autocast, but the target dtype is not supported. Disabling autocast." ``` Sorry for the inconvenience again. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165221 Approved by: https://github.com/albanD	2025-10-28 06:21:29 +00:00
Shunting Zhang	5d0b3e28dc	[inductor] generate fused rms/layer norm bwd (#165370 ) RMS/Layer norm backward would generated 2 kind of reductions: - the reduction computing dx which reduce across the hidden dimension (in the context of transformer) - the reduction computing dw/db which reduce across the BxT (batch size , sequence length) dimension. These 2 set of reductions have common input buffers but inductor can not fuse them because of different loop orders. There are multiple sources of custom kernels that implement fused version of such kernel (Liger-Kernel, quack, Paul Zhang's internal post). This PR enable Inductor to generate such kernels automatically. The generated kernel is very similar to `33924d20b6/src/liger_kernel/ops/rms_norm.py (L114)` . To make the implementation simple and performing, we enable such fusion only if the inner reduction (computing dx) is a persistent reduction. This should be true for representative inputs. Persistent reduction is critical for the perf here to make sure a loaded tensor does not need to be reload. To make sure the inner reduction (computing dx) and outer reductions (computing dw/db) being fusible, the PR does the following: 1. convert the outer reductions to pointwise by replacing 'reduction' & 'store_reduction' node with a new type of node 'parital_accumulate'. The new node will collect the reduction type, buffer name, input of reduction etc, which is essential for proper codegening. 2. do loop reordering (rely on the earlier loop ordering after fusion work) to reorder the loops of the converted pointwise so it can be fused with the inner reduction 3. there can be epilogues that need to be added in the end. E.g. the outer reduction may be followed by a division for mean , or followed by a down cast if dw/db is in low precision (fp16/bf16). Some early benchmarking on H100 shows about 2X speedup for both RMSNorm and LayerNorm backward for shape (1152 * 500, 384 ) used in some internal model. Note that, I manually disable split reduction in this benchmarking since otherwise the fusion will be skipped right now. The next PR will make the mix-order-reduction compose better with split reduction Pull Request resolved: https://github.com/pytorch/pytorch/pull/165370 Approved by: https://github.com/jansel ghstack dependencies: #166204	2025-10-28 05:53:52 +00:00
Zhengxu Chen	f93ea7dab1	[export] Update dynamo_graph_capture_for_export to return GraphModule. (#166091 ) Make dynamo_graph_capture_for_export return a more compatible GraphModule object which is closer the the original behavior of dynamo Pull Request resolved: https://github.com/pytorch/pytorch/pull/166091 Approved by: https://github.com/tugsbayasgalan	2025-10-28 04:23:28 +00:00
Janani Sriram	ff46d5a79b	[Inductor][Triton][FP8] Support deepseek-style scaling in Inductor (#164404 ) Summary: Support deepseek-style scaling in Inductor Triton for FP8 GEMMs. DeepSeek-style scaling is a colloquial term for a fine-grained mixed precision framework using FP8 to train [Deepseek-V3](https://arxiv.org/pdf/2412.19437), DeepSeek AI's recent MoE (Mixture of Experts) model. DeepSeek-style scaling effectively extends the dynamic range of FP8 by mitigating dequantization overhead under increased-precision accumulation, which is key to achieving more accurate FP8 GEMM results. DeepSeek-style scaling on matmul `A @ B` leverages two different types of scaling strategies to preserve a balance between numerical stability and training efficiency: - Activations (input tensor `A`): tile-wise (1x128 across shape `(M, K)`) - Weights (input tensor `B`): block-wise (128x128 across shape `(N, K)`) This diff enables Inductor users to replicate past successes with deepseek-style scaling and achieve higher numerical stability while increasing training efficiency. NOTE: Block-wise 128x128 scaling is only supported in CUDA 12.9+; therefore, deepseek-style scaling is currently unsupported in `fbcode` (CUDA 12.4). Use OSS PyTorch to run deepseek-style scaling. NOTE: Accuracy for FP8 is unstable, even with high tolerances, which is why TritonBench benchmarks are unlikely to be accurate against a `torch` implementation. Test Plan: In OSS PyTorch, run ``` TORCHINDUCTOR_CACHE_DIR=~/personal/cache_dir_inductor CUDA_LAUNCH_BLOCKING=1 TORCH_USE_CUDA_DSA=1 TRITON_PRINT_AUTOTUNING=1 TRITON_ALWAYS_COMPILE=1 TORCH_LOGS=+inductor TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 ENABLE_PERSISTENT_TMA_MATMUL=1 TORCHINDUCTOR_MAX_AUTOTUNE_GEMM=1 python run.py --op fp8_gemm --only torch_fp8_gemm,pt2_fp8_gemm --metrics tflops,accuracy --m 4096 --n 768 --k 512 --output="{output_dir}/deepseek_bench.csv" --scaling_deepseek --atol=1e-2 --rtol=0.5 2>&1 \| tee ~/personal/deepseek_style/deepseek_bench.log ``` Differential Revision: D83609850 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164404 Approved by: https://github.com/slayton58	2025-10-28 03:38:54 +00:00
William Wen	f452edd782	[dynamo, 3.14] fix misc. bugs to get most dynamo unittests passing locally in 3.14 (#164631 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164631 Approved by: https://github.com/Lucaskabela, https://github.com/mlazos	2025-10-28 03:24:22 +00:00

1 2 3 4 5 ...

37196 Commits