pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Natalia Gimelshein	37c6087334	Add split-K control to cuBLAS reduced-precision settings (#164766 ) ## Summary - add a CuBLASReductionOption enum so the CUDA context can track reduced-precision and split-K options - extend the Python bindings, backend helpers, and docs to accept an optional allow_splitk argument for fp16/bf16 matmul controls - update cuBLAS/cuBLASLt call sites plus dynamo guards and tests to respect the new combinations ## Testing - python test/test_cuda.py TestCuda.test_cublas_allow_fp16_reduced_precision_reduction_get_set -v (fails: ModuleNotFoundError: No module named 'psutil') ------ https://chatgpt.com/codex/tasks/task_e_68e404623178832f8a3e1d34e1e175da Pull Request resolved: https://github.com/pytorch/pytorch/pull/164766 Approved by: https://github.com/malfet, https://github.com/albanD	2025-10-08 18:48:45 +00:00
Laith Sakka	0b85236477	Fix refine_ranges corner case (#164075 ) (#164846 ) Summary: address https://github.com/pytorch/pytorch/issues/161360 u0>0 should update the range of u0 to start from [1, ..] this fix it. it was not doing that. Test Plan: contbuild & OSS CI, see `27234792ad` D84038721 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164846 Approved by: https://github.com/izaitsevfb, https://github.com/ezyang	2025-10-08 18:42:37 +00:00
Janani Sriram	4c0fec3e4d	[Max Autotune][B200] Skip carveout tests (#164435 ) Summary: Skip sm `carveout` tests on B200, as carveout is currently unsupported. Test Plan: ``` buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:max_autotune -c fbcode.nvcc_arch=b200a -c fbcode.enable_gpu_sections=true -c fbcode.platform010_cuda_version=12.8 -c fbcode.re_gpu_tests=False -- test_honor_sm_carveout_with_triton_tma ``` Differential Revision: D83395610 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164435 Approved by: https://github.com/eellison	2025-10-08 18:39:43 +00:00
Ke Wen	5c827a4133	[SymmMem] Multi-root tile reduction (#164757 ) Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0) (oldest at bottom): Perform multiple tile reductions concurrently, with each tile reduced to a separate root. - The number of concurrent reductions can be smaller than world size, i.e. roots can be a subset of all ranks. But all ranks are still required to call into this API. - Currently supports NVLink SHARP scope only. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164757 Approved by: https://github.com/weifengpy, https://github.com/fegin ghstack dependencies: #162243	2025-10-08 17:28:00 +00:00
Sean McGovern	f332017294	C++ API handle optimizer defaults (#161825 ) Fixes #141884 This fixes the issue for all optimizers and parameter options. A member function `overwrite_from` is added to the optimizer base class. Each optimizer then implements this function for comparing their accepted parameters to defaults. A SFINAE approach to handle the different optimizer parameters generically (in optimizer.h only) was evaluated, but I think this is easier to review and maintain. This mirrors the Python API up to one edge case. An example of the edge case is provided below. Python can distinguish between 1) Key not present in dict = "not specified" and 2) Key present in dict = "explicitly set". The C++ implementation cannot. The issue hinges on whether or not to track if a particular parameter was set by the user explicitly or not (discrepancy in the case when the constructor default is explicitly passed in). To track this seems like it will take more intervention than would be worth it (modify TORCH_ARG to keep track, use std::optional for the parameter types, use bitset tracking) and was not pursued in the current PR. I'm happy to alter the design if appropriate. ### Example of edge case hinging on CONSTRUCTOR DEFAULTS vs OPTIMIZER DEFAULTS 1. CONSTRUCTOR DEFAULTS: These are the values you get when calling AdamOptions() AdamOptions().lr() = 0.001 AdamOptions().weight_decay() = 0 AdamOptions().eps() = 1e-08 2. OPTIMIZER DEFAULTS: These are the values the user chose when creating the optimizer User's optimizer defaults: optimizer.lr() = 0.005 optimizer.weight_decay() = 0.1 optimizer.eps() = 1e-07 3. THE PROBLEM SCENARIO: User wants to add a parameter group with explicit weight_decay=0.0 User sets: weight_decay(0) 4. THE CONFUSION: Constructor default weight_decay: 0 User's explicit weight_decay: 0 Are they equal? YES Since they're equal, our overwrite_from() logic thinks: "User didn't set weight_decay explicitly, use optimizer default" 5. CURRENT BEHAVIOR: Final weight_decay: 0.1 User expected: 0 Match? ❌ NO === KEY INSIGHT === Constructor defaults are built into the C++ class definition. Optimizer defaults are chosen by the user at runtime. We want to respect the user intention. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161825 Approved by: https://github.com/janeyx99	2025-10-08 16:40:45 +00:00
eqy	0d39ecb2ce	[cuDNN][RNN] cuDNN RNN supports BFloat16 inputs since 9.13 (#164411 ) seems to work Pull Request resolved: https://github.com/pytorch/pytorch/pull/164411 Approved by: https://github.com/Skylion007	2025-10-08 15:26:50 +00:00
PyTorch MergeBot	fd4bde430a	Revert "list_stored_sd_metadata API. (#160610 )" This reverts commit `da903b6a8b`. Reverted https://github.com/pytorch/pytorch/pull/160610 on behalf of https://github.com/jeffdaily due to broke ROCm CI, but flaky also on CUDA CI https://hud.pytorch.org/failure?name=periodic%20%2F%20linux-jammy-rocm-py3.10%20%2F%20test%20(distributed%2C%202%2C%203%2C%20linux.rocm.gpu.mi250.4%2C%20module%3Arocm%2C%20oncall%3Adistributed)&jobName=undefined&failureCaptures=distributed%2Fcheckpoint%2Ftest_list_stored_state_dict.py%3A%3ATestListStateDict%3A%3Atest_list_stored_sd_metadata ([comment](https://github.com/pytorch/pytorch/pull/160610#issuecomment-3382023022))	2025-10-08 15:10:38 +00:00
PyTorch MergeBot	b5e93ffdcf	Revert "Limit path search within range (#164581 )" This reverts commit `415e641572`. Reverted https://github.com/pytorch/pytorch/pull/164581 on behalf of https://github.com/eellison due to merge sets makes this trickier ([comment](https://github.com/pytorch/pytorch/pull/164581#issuecomment-3381955240))	2025-10-08 14:56:21 +00:00
PyTorch MergeBot	20082d7136	Revert "fix flex attention eager bwd: more rounding (#164317 )" This reverts commit `41808b2ba9`. Reverted https://github.com/pytorch/pytorch/pull/164317 on behalf of https://github.com/jeffdaily due to inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_builtin_score_mods_seqlen_lt_custom_sparse_block_size_score_mod4_cuda_float16 [GH job link](https://github.com/pytorch/pytorch/actions/runs/18330774537/job/52207370954) [HUD commit link](`41808b2ba9`) ([comment](https://github.com/pytorch/pytorch/pull/164317#issuecomment-3381812090))	2025-10-08 14:29:10 +00:00
Laith Sakka	7158aa22e8	remove more (#164753 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164753 Approved by: https://github.com/aorenste, https://github.com/mlazos ghstack dependencies: #164664, #164665, #164667, #164668	2025-10-08 14:23:38 +00:00
Laith Sakka	2035f6b2e6	use check_size instead of check_is_size in ops.py (#164668 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164668 Approved by: https://github.com/angelayi ghstack dependencies: #164664, #164665, #164667	2025-10-08 14:23:38 +00:00
Mwiza Kunda	2b58adc3bd	[inductor][templates] Distinguish between kernel input nodes and codegen input nodes (#163752 ) If there is a single autotuner choice, the wrong type of input node is used to instantiate `TritonTemplateBuffer` through `TritonTemplateCaller.output_node`. This PR distinguishes the input nodes used in `AlgorithmSelectorCache.__call__` between the actual inputs passed to the kernel at runtime, vs the possibly viewed inputs that influence scheduling behaviour (e.g. `MemoryDeps`) and codegen. See the added unit test for more detail. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163752 Approved by: https://github.com/eellison	2025-10-08 14:12:14 +00:00
angelayi	322091d8d8	[opaque_obj] Add make_fx tracing support (#163278 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163278 Approved by: https://github.com/zou3519 ghstack dependencies: #163279, #163277	2025-10-08 09:09:16 +00:00
angelayi	2bb4e6876c	[opaque obj] Error for torch.library.custom_op infer_schema (#163277 ) Unsure how we can get infer_schema to infer the scriptObject type from just the type annotation, so for now will just error clearly and ask users to specify a schema. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163277 Approved by: https://github.com/zou3519 ghstack dependencies: #163279	2025-10-08 09:09:16 +00:00
angelayi	56ef7743fc	[opaque_obj] Add __eq__ and __deepcopy__ (#163279 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163279 Approved by: https://github.com/zou3519	2025-10-08 09:09:16 +00:00
Yuanyuan Chen	64108bdbed	[BC-Breaking] Remove long-deprecated casting functions from native_functions.yaml (#164641 ) This PR removes `torch._cast_XXX` from generated OPs. They were deprecated in PyTorch 1 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164641 Approved by: https://github.com/albanD, https://github.com/justinchuby	2025-10-08 08:27:58 +00:00
Maggie Moss	c855f8632e	Pyrefly suppressions 7/n (#164913 ) Adds suppressions to pyrefly will typecheck clean: https://github.com/pytorch/pytorch/issues/163283 Almost there! Test plan: dmypy restart && python3 scripts/lintrunner.py -a pyrefly check step 1: delete lines in the pyrefly.toml file from the project-excludes field step 2: run pyrefly check step 3: add suppressions, clean up unused suppressions before: https://gist.github.com/maggiemoss/4b3bf2037014e116bc00706a16aef199 after: INFO 0 errors (6,884 ignored) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164913 Approved by: https://github.com/oulgen	2025-10-08 07:27:17 +00:00
morrison-turnansky	12d2ef557f	Update round size with 1 division behavior (#162203 ) have round size return nearest power of 2 greater than or equal to size with 1 division Fixes #161139 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162203 Approved by: https://github.com/ezyang	2025-10-08 06:41:46 +00:00
Ke Wen	19bf67be32	multimem reduce (#164517 ) Modified `multimem_one_shot_all_reduce_out` function to accept a `root` argument, making it a `multimem_reduce` op. The original `multimem_one_shot_all_reduce` op becomes a caller of the `multimem_reduce`, with each rank providing its own rank id as root. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164517 Approved by: https://github.com/ngimel	2025-10-08 05:25:16 +00:00
Nicolas Macchioni	184817c7a8	locks + unit tests (#164636 ) Test Plan: ``` buck test fbcode//mode/opt caffe2/test/inductor:caching ``` Reviewed By: aorenste D83714690 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164636 Approved by: https://github.com/aorenste	2025-10-08 04:34:22 +00:00
Pradeep Fernando	da903b6a8b	list_stored_sd_metadata API. (#160610 ) Summary: 1\ Certain checkpoint load use cases are not aware of the properties of the data/tensors they want to load. 2\ These usecases include data loader checkpoints, reading data for post processing (when the original model definition is not available). 3\ There, we have to use saved checkpoint (metadata) as our source of truth. 4\ This RFC proposal exposes the checkpoint metadata using a public API. In this proposal we expose the stored state-dict metadata (minus associated storage/chunk metadata). Chunk/storage details should not be exposed to the users and is a impl detail of the storage writer/reader. Test Plan: UT. Rollback Plan: Differential Revision: D80231457 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160610 Approved by: https://github.com/saumishr	2025-10-08 04:33:51 +00:00
Maggie Moss	086dec3235	Pyrefly suppressions 6/n (#164877 ) Adds suppressions to pyrefly will typecheck clean: https://github.com/pytorch/pytorch/issues/163283 Almost there! Test plan: dmypy restart && python3 scripts/lintrunner.py -a pyrefly check step 1: delete lines in the pyrefly.toml file from the project-excludes field step 2: run pyrefly check step 3: add suppressions, clean up unused suppressions before: https://gist.github.com/maggiemoss/4b3bf2037014e116bc00706a16aef199 after: INFO 0 errors (5,064 ignored) Only four directories left to enable Pull Request resolved: https://github.com/pytorch/pytorch/pull/164877 Approved by: https://github.com/oulgen	2025-10-08 02:30:57 +00:00
Ke Wen	d444384003	[SymmMem] Tiled reduce (#162243 ) Added op: `tile_reduce(Tensor input, Tensor(a!) out, int root, str group_name)` For now supports only: - NVSHMEM backed symmetric tensor; - 2D tensor and tile; - torch.float. Testing on right-bottom quandrant: ``` rank 0: tensor([[0., 0., 0., 0., 0., 0., 0., 0.], [0., 0., 0., 0., 0., 0., 0., 0.], [0., 0., 0., 0., 0., 0., 0., 0.], [0., 0., 0., 0., 0., 0., 0., 0.], [0., 0., 0., 0., 1., 1., 1., 1.], [0., 0., 0., 0., 1., 1., 1., 1.], [0., 0., 0., 0., 1., 1., 1., 1.], [0., 0., 0., 0., 1., 1., 1., 1.]], device='cuda:0') PASSED ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/162243 Approved by: https://github.com/ngimel	2025-10-08 02:03:04 +00:00
PyTorch MergeBot	3040a5d294	Revert "[dynamo] Support torch.fx.traceback.annotate (#164678 )" This reverts commit `801e282f39`. Reverted https://github.com/pytorch/pytorch/pull/164678 on behalf of https://github.com/izaitsevfb due to breaks executorch internally, see [D84068062](https://www.internalfb.com/diff/D84068062?entry_point=16) ([comment](https://github.com/pytorch/pytorch/pull/164678#issuecomment-3379281844))	2025-10-08 01:49:34 +00:00
PyTorch MergeBot	97463d4cf3	Revert "Fix double dispatch to Python for detach (#163671 )" This reverts commit `c32118dc3e`. Reverted https://github.com/pytorch/pytorch/pull/163671 on behalf of https://github.com/izaitsevfb due to breaks export tests ([comment](https://github.com/pytorch/pytorch/pull/163671#issuecomment-3379281422))	2025-10-08 01:46:45 +00:00
Howard Huang	c813617c53	[PP] Migrate other schedules to use PipelineScheduleRuntime (#164777 ) Second fix for https://github.com/pytorch/pytorch/issues/164756 This has been a TODO to make the all schedules execute using the same runtime. Now after this change, schedules will use the same logic for `_PipelineScheduleRuntime` where it adds `UNSHARD` and `RESHARD` operations to the schedules which fixes the issue mentioned above. <img width="920" height="406" alt="image" src="https://github.com/user-attachments/assets/a4d5bcd0-7dac-43cd-96f9-8ca33cfd8b91" /> A test is failing after the conversion: - Fixed a gradient scaling issue for dWeight Pull Request resolved: https://github.com/pytorch/pytorch/pull/164777 Approved by: https://github.com/fegin ghstack dependencies: #164775	2025-10-08 01:45:57 +00:00
Howard Huang	e659661ffa	[PP] Fix FSDP unshard/reshard (#164775 ) First fix for https://github.com/pytorch/pytorch/issues/164756 In the pipeline IR we call `UNSHARD` and `RESHARD`, but there is a bug because when we call `module.unshard()` these do not recursively call the FSDP modules, hence leading to sometime call allgather before the module forward. Since we want the pipeline IR to explicitly handle this, we can call `group.unshard` instead which ensures that all the modules are unsharded. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164775 Approved by: https://github.com/weifengpy	2025-10-08 01:45:57 +00:00
Markus Hoehnerbach	41808b2ba9	fix flex attention eager bwd: more rounding (#164317 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164317 Approved by: https://github.com/drisspg ghstack dependencies: #163986	2025-10-08 01:17:45 +00:00
Xilun Wu	c0510dc447	[ContextParallel] add `_LoadBalancer` classes, and load-balance interface to Context Parallel APIs (#161062 ) Summary This PR provides an interface for users to specify how to load-balance the attention input. The load-balance is essentially a rearrangement of the input tensor(s) over the seq_dim before sharding and can be specified via an index tensor `rearrange` such that Q[rearrange] is the balanced Q users want (i.e. `rearrange[i] == j` where `i` is the new index of `Q[j]` in the balanced Q). An example is the `_generate_round_robin_indices()` added in https://github.com/pytorch/pytorch/pull/155442. New `_LoadBalancer` classes New `_LoadBalancer` class (defined in `torch/distributed/tensor/experimental/_load_balancer.py`) provides one interface for defining load-balance behavior: `_generate_indices(self, restore: bool = False)`. When `restore == False`, this method should output an index Tensor (namely `rearrange_idx`) such that QKV will be transformed into Q' K' V' in a way that `Q'[i] == Q[rearrange_idx[i]]` (same applies to K and V). When `restore == True`, this method outputs an index Tensor (namely `restore_idx` such that `Q'[restore_idx] == Q` (same applies to K and V). Impact 2 public CP APIs and 1 private CP API is modified. This PR should be backward-compatible by: - For uses w/ SDPA, existing users must be using the `context_parallel()` API which does not take in the extra `load_balancer` argument and solely determines from the global var `_cp_options.enable_load_balance`. - For new users including who want to try `flex_attention()`, we require to use the new API `_context_parallel_buffers` to explicitly shard the QKV input instead of using `context_parallel()` because we no longer rely on TorchDispatchMode nor TorchFunctionMode for op replacement. And we also require users to explicitly pass in a `load_balancer` argument if load-balancing is demanded. Load-Balance Behavior `context_parallel_unshard()`, and `create_cp_block_mask()` APIs now take an extra optional argument `load_balancer`. This argument is optional because of backward compatibility but we require new users to explicitly pass in a `load_balancer` if load-balancing is demanded: - if `load_balancer == None` and `_cp_options.enable_load_balance == False`, CP performs no load-balancing on input Tensors. - if `load_balancer == None` and `_cp_options.enable_load_balance ==True`, CP performs head-tail load-balancing (e.g. split a Tensor into 2N chunks and first N are called head and the rest are called tail. Place the first head chunk the last tail chunk on rank 0, and the second head along with the second last tail chunk on rank 1, and so on). `_context_parallel_buffers()` also takes the extra optional argument `load_balancer`, but the behavior is slightly different from the other 2 APIs -- it doesn't branch on `_cp_options.enable_load_balance` : - if `load_balancer == None`, no load-balancing will be performed - otherwise, apply load-balancing using `load_balancer._generate_indices()` before sharding. Changes* This PR moves the index Tensor generation logic into a set of LoadBalancer classes and make LoadBalancer the common interface for Context Parallel APIs that leverages load-balancing: * _context_parallel_buffers * context_parallel_unshard * create_cp_block_mask The `_LoadBalancer` classes added are: - `_LoadBalancer`: the abstract base class that provides “_generate_indices” interface index Tensor generation. - `_HeadTailLoadBalancer`: Implements head-tail balancing logic. - `_PerDocumentHeadTailLoadBalancer`: Supports per-document head-tail balancing for batched sequences. Test `pytest test/distributed/tensor/test_attention.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/161062 Approved by: https://github.com/fegin	2025-10-08 01:09:14 +00:00
Nicolas Macchioni	9ec10dc26a	utils + unit tests (#164551 ) Test Plan: ``` buck test fbcode//mode/opt caffe2/test/inductor:caching ``` Reviewed By: aorenste Differential Revision: D83714691 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164551 Approved by: https://github.com/aorenste	2025-10-08 01:05:45 +00:00
Pian Pawakapan	bd3b98a8a5	[dynamic shapes] make backed_size_oblivious behavior consistent b/w symbolic_shapes/inductor (#164796 ) Summary: call guard_or_ directly to enable backed_size_obl in inductor calls to guard_or Test Plan: CI and unit test added. Differential Revision: D84009392 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164796 Approved by: https://github.com/laithsakka	2025-10-08 00:19:06 +00:00
Scott Wolchok	c32118dc3e	Fix double dispatch to Python for detach (#163671 ) This fixes #71725. Differential Revision: [D83857880](https://our.internmc.facebook.com/intern/diff/D83857880) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163671 Approved by: https://github.com/ezyang, https://github.com/albanD	2025-10-07 23:34:37 +00:00
Chien-Chin Huang	e3ae80fc03	[PP] Let PP split BlockMask into micro-BlockMask (#164111 ) BlockMask has batch dimension information. So PP has to split it as well just like all other tensors. All the tensors in BlockMask have the batch dimension, so we can just split it without too many issues. However, `mask_mod` requires the batch index as the input, which the value is going to be changed after the split. So we have to wrap it inside a closure to modify the batch index. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164111 Approved by: https://github.com/H-Huang	2025-10-07 23:25:34 +00:00
Aaron Gokaslan	d1a62c8036	[BE][Ez]: Enable RUF007 Prefer itertools.pairwise over zip slicing (#164856 ) Now that our min version is 3.10 we can support this rule. This is more concise, readable, and efficient than the previous zip slicing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164856 Approved by: https://github.com/williamwen42	2025-10-07 22:51:17 +00:00
amdfaa	955f21dc2c	[ROCm][CI] Add support for gfx1100 in rocm workflow + test skips (#148355 ) This PR adds infrastructure support for gfx1100 in the rocm workflow. Nodes have been allocated for this effort. @dnikolaev-amd contributed all the test skips. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148355 Approved by: https://github.com/jeffdaily Co-authored-by: Dmitry Nikolaev <dmitry.nikolaev@amd.com> Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-10-07 22:36:25 +00:00
Mwiza Kunda	2e027e8742	[inductor] Improve bound on the number of dims to match for the block (#163755 ) - Removes redundant broadcast code when `len(kernel.range_tree_nodes)` is much larger than `len(range_tree.nodes)`. For example: ```python # before, the broadcast is to [1, 1, XBLOCK, R0_BLOCK] tmp0 = tl.reshape(tl.broadcast_to(tl.load(block_ptr0, boundary_check=[2], padding_option='zero', eviction_policy='evict_last')[:, None, :, :], [(511 + XBLOCK) // 512, ((1) * ((1) <= ((511 + XBLOCK) // 512)) + ((511 + XBLOCK) // 512) * (((511 + XBLOCK) // 512) < (1))), ((512) * ((512) <= (XBLOCK)) + (XBLOCK) * ((XBLOCK) < (512))), R0_BLOCK]), [XBLOCK, R0_BLOCK]) # after tmp0 = tl.reshape(tl.load(block_ptr0, boundary_check=[2], padding_option='zero', eviction_policy='evict_last'), [XBLOCK, R0_BLOCK]) ``` - Fix: also save range_tree_nodes per subgraph Pull Request resolved: https://github.com/pytorch/pytorch/pull/163755 Approved by: https://github.com/eellison, https://github.com/blaine-rister	2025-10-07 21:02:37 +00:00
PyTorch MergeBot	1e42fde45e	Revert "[CUDA] Add experimental green context support for SM carveout (#159104 )" This reverts commit `746fe78ecd`. Reverted https://github.com/pytorch/pytorch/pull/159104 on behalf of https://github.com/malfet due to Breaks Windows CD build ([comment](https://github.com/pytorch/pytorch/pull/159104#issuecomment-3378675515))	2025-10-07 20:51:22 +00:00
PyTorch MergeBot	f505caa71b	Revert "multimem reduce (#164517 )" This reverts commit `d1cbb74fb1`. Reverted https://github.com/pytorch/pytorch/pull/164517 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/164517#issuecomment-3378529654))	2025-10-07 20:12:38 +00:00
PyTorch MergeBot	df640df68a	Revert "Reapply "C++-accessible Placements via pybind11 (#163030 )" (#164519 )" This reverts commit `8c0bc879b9`. Reverted https://github.com/pytorch/pytorch/pull/164519 on behalf of https://github.com/malfet due to Still breaks internal workflows ([comment](https://github.com/pytorch/pytorch/pull/164519#issuecomment-3378469432))	2025-10-07 19:46:17 +00:00
zhxchen17	4c3c0ef2f1	[precompile] Load source cache for AOT compile as well. (#164773 ) Adding source_get_cache also to AOT compile case. Since the guard manager loader code can be shared between AOT and caching, we added a new function load_guard_manager to avoid code duplication between two workflows, for loading guards. Test Plan: test_guard_serialization.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/164773 Approved by: https://github.com/yiming0416, https://github.com/dolpm	2025-10-07 18:47:09 +00:00
Parshant Sharma	bc33b10202	fix copy_ for scalar in inductor (#164167 ) Fixes #158437 ### Summary - TorchInductor was not properly handling scalar copy operations `(tensor.copy_(scalar_value))` - Ensured scalar sources are converted to appropriate tensor representations with correct dtype and device ### Impact - Enables compilation of models using ` tensor.copy_(scalar) `patterns - module: inductor Pull Request resolved: https://github.com/pytorch/pytorch/pull/164167 Approved by: https://github.com/shunting314	2025-10-07 18:31:37 +00:00
Yuanyuan Chen	ee5389d520	Enable batch samples in sparse tests (#164677 ) The test cases are enabled because the issue was fixed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164677 Approved by: https://github.com/albanD	2025-10-07 15:58:37 +00:00
Animesh Jain	801e282f39	[dynamo] Support torch.fx.traceback.annotate (#164678 ) Builds on top of https://github.com/pytorch/pytorch/pull/163673 and https://github.com/pytorch/pytorch/pull/164174. This will be used in the followup PRs to apply regional inductor compilation. The existing implementation let Dynamo trace into the `torch.fx.traceback.annotate`, but thats not what we want. We want Dynamo to essentially run the torch.fx.traceback.annotate function in eager, so that every Fx node created in Dynamo Fx graph has the custom meta node. This does not work with graph breaks yet. But we can solve that problem, if needed, in a separate PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164678 Approved by: https://github.com/SherlockNoMad, https://github.com/jansel, https://github.com/xmfan	2025-10-07 14:54:26 +00:00
Nicolas Macchioni	1fb072ac2a	exceptions + unit tests (#164550 ) Test Plan: ``` buck test fbcode//mode/opt caffe2/test/inductor:caching ``` Reviewed By: aorenste Differential Revision: D83714688 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164550 Approved by: https://github.com/aorenste	2025-10-07 10:04:58 +00:00
Animesh Jain	cac5e13e13	[dynamo] Inline nn module calls using __call__ methods (#164817 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164817 Approved by: https://github.com/SherlockNoMad, https://github.com/mlazos	2025-10-07 08:57:20 +00:00
Laith Sakka	ef7e2ca77e	remove check_is_size from test_misc.py (#164667 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164667 Approved by: https://github.com/angelayi ghstack dependencies: #164664, #164665	2025-10-07 07:33:50 +00:00
Laith Sakka	cdaaf3e4a3	remove size-like based size-oblivious special max simplifications (#164665 ) As we removed guard_size_oblivious this simplification is no longer relevant, this is part of the process of deprecation for guard_size_oblivious and its dependencies. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164665 Approved by: https://github.com/aorenste ghstack dependencies: #164664	2025-10-07 07:33:50 +00:00
Laith Sakka	0ea59c3c55	do not suggest torch._check_is_size() (#164664 ) size like concept for data dependency is not relevant anymore as we removed all guard_size_oblivious calls. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164664 Approved by: https://github.com/angelayi, https://github.com/mlazos	2025-10-07 07:33:50 +00:00
Nicolas Macchioni	8f705d019a	context + unit tests (#164549 ) Summary: the context module provides configurable context selection + isolation key hashing; context selection is broken into runtime and compile context. runtime context is decided at call time (inductor configs, precision configs, etc.) and compile context is decided at compile time (hardware type, software hashes). callees will be given access to SelectedRuntimeContext and SelectedCompileContext, which they can use to determine and select what context is necessary with regards to the function which is being cached. these selected contexts are wrapped in an IsolationSchema, which denotes what context should be taken into consideration when producing an isolation key. The isolation key is essentially a salt of the function signature key, which says that some function signature key result is valid under a given context (isolation schema) Test Plan: ``` buck test fbcode//mode/opt caffe2/test/inductor:caching ``` Reviewed By: aorenste D83714689 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164549 Approved by: https://github.com/aorenste	2025-10-07 06:02:10 +00:00
Tugsbayasgalan Manlaibaatar	4725871a81	Return fake mode from export graph capture API (#164730 ) This PR is to temporarily unblock various experiments to re-use dynamo create fake mode. Note that this is still not what we want as the end state. The end state should look sth like: ``` out = fulllgraph_capture(mod, inputs) fake_mode = out.backend_inputs.fake_mode gm = out.module() ``` This doesn't work today because export requires we need to wrap the original module to setup a flat module to trace for easier handling of pytree. As a result, we would need to carry export specific flag in fullgraph_capture which seems not ideal. Regardless, the end state is that we need to give downstream user a graph module and a fake mode in some form, so I think _dynamo_graph_capture_for_export returning a fake mode within graph module itself via gm.meta Pull Request resolved: https://github.com/pytorch/pytorch/pull/164730 Approved by: https://github.com/avikchaudhuri	2025-10-07 03:42:46 +00:00

1 2 3 4 5 ...

36737 Commits