pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-06 00:20:18 +01:00

Author	SHA1	Message	Date
Mikayla Gawarecki	c0bbda37e8	Move static from_ivalue/to_ivalue to new shim_common.cpp (#166373 ) Move `from_ivalue` and `to_ivalue` and their dependents `StableIValueBoxedKernel`, `aoti_torch_library_impl` `aoti_torch_call_dispatcher` into new (non-aoti shim_common.cpp) This is in prep for the above PRs where I add v2s (`torch_call_dispatcher` and `torch_library_impl`) that are versioning aware Pull Request resolved: https://github.com/pytorch/pytorch/pull/166373 Approved by: https://github.com/janeyx99 ghstack dependencies: #164356	2025-10-29 15:41:36 +00:00
Mikayla Gawarecki	fefb546b91	Add TORCH_TARGET_VERSION for stable ABI (#164356 ) And update it so comparisons can be done by the preprocessor Note: We also need to gate in shim.h and figure out how to enforce this Differential Revision: [D85683549](https://our.internmc.facebook.com/intern/diff/D85683549) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164356 Approved by: https://github.com/janeyx99	2025-10-29 15:41:28 +00:00
PyTorch MergeBot	d6d6fa26f5	Revert "bwd pass (#164504 )" This reverts commit `f36f372acc`. Reverted https://github.com/pytorch/pytorch/pull/164504 on behalf of https://github.com/jeffdaily due to CI had been clean for both cuda and rocm before merge, broke post merge? ([comment](https://github.com/pytorch/pytorch/pull/164504#issuecomment-3462116676))	2025-10-29 15:10:40 +00:00
Nikita Vedeneev	467c21ad9a	`nn.Linear`: nD contiguous input + bias -- dispatch to addmm also when weight is sparse (#166071 ) As per title. It seems safe to be able to generalize to arbitrary contiguous inputs since `at::matmul` is likely to do the flattening to avoid `baddmm`. Additionally, we guard for bias to be 1D and contiguous which is guaranteed to be fused with no copies. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166071 Approved by: https://github.com/ngimel	2025-10-29 13:13:40 +00:00
Way Wang	4a94591321	filter out alloc-free pairs from trace plot (#165752 ) Summary: When dealing with a large memory trace, the resulting plot can be challenging to interpret and analyze. This commit introduces a feature that enables filtering of allocations that have already been freed, providing a more focused view. The remaining events in the plot often warrant closer examination, as they may be indicative of potential out-of-memory (OOM) issues. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165752 Approved by: https://github.com/zdevito	2025-10-29 12:44:54 +00:00
PyTorch MergeBot	5e7272b60a	Revert "[BE] Move GreenContext implementation details to cpp (#166462 )" This reverts commit `afaaaa314c`. Reverted https://github.com/pytorch/pytorch/pull/166462 on behalf of https://github.com/atalman due to multiple internal build failures ([comment](https://github.com/pytorch/pytorch/pull/166462#issuecomment-3461145801))	2025-10-29 11:59:41 +00:00
PyTorch MergeBot	1dd6b76914	Revert "[1/N] Remove unused loop variables (#166258 )" This reverts commit `76b2c37045`. Reverted https://github.com/pytorch/pytorch/pull/166258 on behalf of https://github.com/atalman due to breaks test/distributed/test_serialization.py::TestSerialization::test_weights_only [GH job link](https://github.com/pytorch/pytorch/actions/runs/18894311802/job/53929321703) [HUD commit link](`76b2c37045`) ([comment](https://github.com/pytorch/pytorch/pull/166258#issuecomment-3460964612))	2025-10-29 11:10:37 +00:00
Xuehai Pan	284716a691	[pytree] add `treespec_{leaf,tuple,dict}` functions for args_spec modification (#160843 ) The goal of this PR is to provide a standard way to create simple treespec instances and hide the implementation details of the `PyTreeSpec` class. Changes: 1. Add function `treespec_leaf()` to replace `LeafSpec()`. 2. Add function `treespec_tuple(...)` and `treespec_dict(...)` to create treespec for `tuple` / `dict` which is used for `args` / `*kwargs`. This avoids direct modification to `treespec` instances that rely on the implementation details of the `PyTreeSpec` class. 3. Change `len(spec.children_specs)` to `spec.num_children`. 4. Change `isinstance(spec, LeafSpec)` to `spec.is_leaf()`. ------ Pull Request resolved: https://github.com/pytorch/pytorch/pull/160843 Approved by: https://github.com/mlazos	2025-10-29 09:16:24 +00:00
Yuanyuan Chen	8b188647cf	[2/N] Fix unused loop variables (#166500 ) This PR removes unused loop variables. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166500 Approved by: https://github.com/mlazos	2025-10-29 08:30:35 +00:00
Aaron Gokaslan	96b61844a7	[BE]: Update nvshmem to 3.4.5 (#164046 ) Release notes can be found here: https://docs.nvidia.com/nvshmem/release-notes-install-guide/release-notes/release-3405.html main difference is the addition of a CPU assisted IBGDA fallback which should allow NVSHMEM IBGDA to work on way more systems without admin intervention and without using GDRCopy. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164046 Approved by: https://github.com/ezyang, https://github.com/kwen2501	2025-10-29 07:32:05 +00:00
etaf	1b655a87ef	[xpu][test] Enable more UTs for Intel GPU. (#166047 ) This PR enables additional Inductor unit tests for Intel GPU. Due to the increased number of test cases, the number of runners has been extended from 8 to 12 to prevent CI timeouts. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166047 Approved by: https://github.com/jansel Co-authored-by: Deng, Daisy <daisy.deng@intel.com> Co-authored-by: Jason Ansel <jansel@jansel.net>	2025-10-29 06:25:36 +00:00
fffrog	cb6966704c	Add merge rule for PrivateUse1 Module (#166394 ) Add merge rights for the following people: - albanD - fffrog Pull Request resolved: https://github.com/pytorch/pytorch/pull/166394 Approved by: https://github.com/ezyang	2025-10-29 06:13:44 +00:00
Amin Sedaghat	17d5aa4767	disable jiterator for complex tan and tanh (#165250 ) Fixes #100842 Disable jiterator for complex tan and tanh kernels due to accuracy issues, matching the existing approach used for acos, acosh, asin, and asinh. Reverts to thrust implementation which provides better numerical accuracy. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165250 Approved by: https://github.com/ezyang	2025-10-29 04:59:01 +00:00
Michael Lazos	cde81e92b9	[User-streams] Make torch.Event weakref compatible (#164522 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164522 Approved by: https://github.com/williamwen42 ghstack dependencies: #162903, #164343, #164344, #164507, #162901, #164304	2025-10-29 04:57:23 +00:00
Michael Lazos	bfc2050db9	[user-streams] Make device-agnostic streams weakref compatible (#164304 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164304 Approved by: https://github.com/williamwen42, https://github.com/colesbury ghstack dependencies: #162903, #164343, #164344, #164507, #162901	2025-10-29 04:57:23 +00:00
Justin Chu	c5701d0ab5	[ONNX] Create fake implementations for onnx ops; fix boolean mask in attention (#165780 ) Previously we rely on the concreate implementation to generate fake implementation. This makes the fake implementation overly complicated and breaks in some cases when there are dynamic shapes. This PR updates onnx op registration to instead take a dedicated fake implementation. Also fixed: When boolean mask is supplied to torch sdpa, it was previously taken the negation, which is incorrect. Fix https://github.com/pytorch/pytorch/issues/164909 Also taken changes from https://github.com/pytorch/pytorch/pull/156635 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165780 Approved by: https://github.com/titaiwangms	2025-10-29 04:51:49 +00:00
Michael Lazos	23669d02a6	[user-cuda-streams] Add cuda streams test suite (#162901 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162901 Approved by: https://github.com/williamwen42 ghstack dependencies: #162903, #164343, #164344, #164507	2025-10-29 04:46:08 +00:00
Michael Lazos	e8d887ae3f	[user-streams] Support streams as contexts (#164507 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164507 Approved by: https://github.com/williamwen42 ghstack dependencies: #162903, #164343, #164344	2025-10-29 04:46:08 +00:00
Junjie Wang (PyTorch)	774abb018e	[ptd] Fix test config in destroy_pg (#166463 ) Summary: When device_type is CPU we will not use device id from CUDA which is enabled in https://github.com/pytorch/pytorch/pull/161015. However, we should not exclude the case when the accelerator itself is CPU. This PR fixes it. Test Plan: UT Differential Revision: D85714901 Pull Request resolved: https://github.com/pytorch/pytorch/pull/166463 Approved by: https://github.com/mori360, https://github.com/fegin	2025-10-29 04:35:04 +00:00
Yuanyuan Chen	0e19561e23	Add back Windows and macOS to tensorboard tests (#166389 ) This PR adds back tensorboard tests on Windows and macOS because the dependency issue is resolved. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166389 Approved by: https://github.com/Skylion007	2025-10-29 04:34:57 +00:00
Jagadish Krishnamoorthy	1fa520ea65	[ROCm] Enable group gemm through CK (#166334 ) Fixes #161366 All the 4 types of dimension matrix are supported. 2d-2d, 2d-3d, 3d-3d, 3d-2d. The corresponding test cases in test_matmul_cuda are working for both forward and backward pass. The CK path is enabled for gfx942, gfx950. ToDo: Need to enable support on gfx90a since the ck kernel used in this commit produces gpu error, might require a different CK kernel config, based on the profiler result on gfx90a. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166334 Approved by: https://github.com/jeffdaily, https://github.com/pruthvistony	2025-10-29 04:32:38 +00:00
PaulZhang12	c2e3cc7aed	[Inductor] No longer throw error in bmm out_dtype lowering due to template heuristics (#166457 ) Fixes https://github.com/pytorch/pytorch/issues/165892 Pull Request resolved: https://github.com/pytorch/pytorch/pull/166457 Approved by: https://github.com/coconutruben	2025-10-29 04:27:13 +00:00
PyTorch UpdateBot	5849eea129	[vision hash update] update the pinned vision hash (#166356 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166356 Approved by: https://github.com/pytorchbot	2025-10-29 04:14:16 +00:00
PyTorch MergeBot	924482a6f6	Replace NUMA inheritance approach (#166026 ) # Context Previously, we would modify the parent process's NUMA bindings in order to force child process to inherit them. However, this would not work correctly if `start_method="forkserver"`, because the subprocesses would actually inherit their bindings from the forkserver middleman process. In this case, the inherited affinity would actually be incorrect for all but the first subprocess (because the forkserver process would get created lazily, and hence inherit and then stick with the bindings intended for the first subprocess). # This PR * `str` entrypoints: Use `numactl` CLI * `Callable` entrypoints: Wrap the `Callable` entrypoint and call `os.sched_setaffinity` inside it. Hopefully this will be the last necessary iteration. # Test Plan ## Automated `$ pytest test/test_numa_binding.py` ## Manual Verified flops/sec and memory locality wins on several different types of jobs * `Callable` with forkserver * `str` entrypoint with spawn * `Callable` entrypoint with spawn More details in [this doc (Meta-only).](https://docs.google.com/document/d/1vxD-OKYBTT27jbBwtW9iz9g0tNM0u-i0tiTJg_ieQA8/edit?tab=t.scjv58yswi64) # Later PR Update all the documentation when we're confident this has stabilized. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166026 Approved by: https://github.com/d4l3k Co-authored-by: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>	2025-10-29 03:58:44 +00:00
Sun, Jiayi	20be077085	[Inductor] support masked vectorization for the tail_loop for float64 datatype (#163316 ) Summary: Support masked vectorization for the tail_loop for float64 datatype. Example: ``` import torch def fn(x): return x * x x = torch.randn((22, 22), dtype=torch.double) with torch.no_grad(): compiled_fn = torch.compile(fn) compiled_fn(x) ``` Generated code: - Before ``` cpp_fused_mul_0 = async_compile.cpp_pybinding(['const double', 'double'], r''' #include <torch/csrc/inductor/cpp_prefix.h> extern "C" void kernel(const double* in_ptr0, double* out_ptr0) { { for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(484L); x0+=static_cast<int64_t>(16L)) { { if(C10_LIKELY(x0 >= static_cast<int64_t>(0) && x0 < static_cast<int64_t>(480L))) { auto tmp0 = at::vec::VectorizedN<double,2>::loadu(in_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(16)); auto tmp1 = tmp0 * tmp0; tmp1.store(out_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(16)); } if(C10_UNLIKELY(x0 >= static_cast<int64_t>(480L) && x0 < static_cast<int64_t>(484L))) { for (int64_t x0_tail = static_cast<int64_t>(480L);x0_tail < static_cast<int64_t>(484L); x0_tail++) { auto tmp0 = in_ptr0[static_cast<int64_t>(x0_tail)]; auto tmp1 = double(tmp0 * tmp0); out_ptr0[static_cast<int64_t>(x0_tail)] = tmp1; } } } } } } ''') async_compile.wait(globals()) del async_compile class Runner: def __init__(self, partitions): self.partitions = partitions def recursively_apply_fns(self, fns): new_callables = [] for fn, c in zip(fns, self.partitions): new_callables.append(fn(c)) self.partitions = new_callables def call(self, args): arg0_1, = args args.clear() assert_size_stride(arg0_1, (22, 22), (22, 1)) buf0 = empty_strided_cpu((22, 22), (22, 1), torch.float64) # [Provenance debug handles] cpp_fused_mul_0:1 cpp_fused_mul_0(arg0_1, buf0) del arg0_1 return (buf0, ) ``` - After ``` cpp_fused_mul_0 = async_compile.cpp_pybinding(['const double', 'double'], r''' #include <torch/csrc/inductor/cpp_prefix.h> extern "C" void kernel(const double* in_ptr0, double* out_ptr0) { { for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(484L); x0+=static_cast<int64_t>(16L)) { { if(C10_LIKELY(x0 >= static_cast<int64_t>(0) && x0 < static_cast<int64_t>(480L))) { auto tmp0 = at::vec::VectorizedN<double,2>::loadu(in_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(16)); auto tmp1 = tmp0 * tmp0; tmp1.store(out_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(16)); } if(C10_UNLIKELY(x0 >= static_cast<int64_t>(480L) && x0 < static_cast<int64_t>(484L))) { auto tmp0 = at::vec::VectorizedN<double,2>::loadu(in_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(4L)); auto tmp1 = tmp0 * tmp0; tmp1.store(out_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(4L)); } } } } } ''') async_compile.wait(globals()) del async_compile class Runner: def __init__(self, partitions): self.partitions = partitions def recursively_apply_fns(self, fns): new_callables = [] for fn, c in zip(fns, self.partitions): new_callables.append(fn(c)) self.partitions = new_callables def call(self, args): arg0_1, = args args.clear() assert_size_stride(arg0_1, (22, 22), (22, 1)) buf0 = empty_strided_cpu((22, 22), (22, 1), torch.float64) # [Provenance debug handles] cpp_fused_mul_0:1 cpp_fused_mul_0(arg0_1, buf0) del arg0_1 return (buf0, ) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/163316 Approved by: https://github.com/mingfeima, https://github.com/jansel	2025-10-29 03:30:38 +00:00
thenumberouscode	94eaeb9cb8	[Conv1d] Check overflow before we compute padding size. (#162363 ) Fixes https://github.com/pytorch/pytorch/issues/161877 also fixes https://github.com/pytorch/pytorch/issues/161875 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162363 Approved by: https://github.com/jbschlosser	2025-10-29 03:27:20 +00:00
Yu, Guangye	753d9bd806	Introduce a new API torch.xpu.set_per_process_memory_fraction (#165510 ) # Motivation Aligned with other backends, this PR introduces a new API `torch.xpu.set_per_process_memory_fraction` to allow user to customize the allowed memory per a single process. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165510 Approved by: https://github.com/EikanWang, https://github.com/ezyang ghstack dependencies: #165508, #165509	2025-10-29 03:24:52 +00:00
Yuanyuan Chen	dd1fe7c22f	Remove clang-tidy type conversion suppressions (#166398 ) This PR fixes and removes type conversion suppressions of clang-tidy. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166398 Approved by: https://github.com/Skylion007	2025-10-29 03:21:16 +00:00
linhaifeng	695cb0d342	[2/N][Fix] Fix typo in test folder (#166374 ) Fix typo in test folder. _typos.toml ```bash [default.extend-words] nd = "nd" arange = "arange" Nd = "Nd" GLOBALs = "GLOBALs" hte = "hte" iy = "iy" PN = "PN" Dout = "Dout" optin = "optin" gam = "gam" PTD = "PTD" ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/166374 Approved by: https://github.com/cyyever, https://github.com/ezyang	2025-10-29 03:02:07 +00:00
linhaifeng	1764f3a9c8	[Fix] fix gramma error in PyTorch docs (#166158 ) Fix several gramma errors in PyTorch docs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166158 Approved by: https://github.com/yewentao256, https://github.com/cyyever, https://github.com/ezyang	2025-10-29 03:01:07 +00:00
Yu, Guangye	c9eabadc5e	Suppress std::hardware_destructive_interference_size warning on GCC 13+ (#166297 ) # Motivation In https://github.com/pytorch/pytorch/pull/145591, `std::hardware_destructive_interference_size` was introduced in CUDACachingAllocator. Later, https://github.com/pytorch/pytorch/pull/160067 moved it to `c10/core/alignment.h` for code reuse. However, on GCC 13+ using `std::hardware_destructive_interference_size` triggers the following warning: ```bash warning: use of ‘std::hardware_destructive_interference_size’ [-Winterference-size] /home/pt-gpu/4T-4652/guangyey/stock-pytorch/aten/src/ATen/core/CachingHostAllocator.h:42:16: note: its value can vary between compiler versions or with different ‘-mtune’ or ‘-mcpu’ flags /home/pt-gpu/4T-4652/guangyey/stock-pytorch/aten/src/ATen/core/CachingHostAllocator.h:42:16: note: if this use is part of a public ABI, change it to instead use a constant variable you define /home/pt-gpu/4T-4652/guangyey/stock-pytorch/aten/src/ATen/core/CachingHostAllocator.h:42:16: note: the default value for the current CPU tuning is 64 bytes /home/pt-gpu/4T-4652/guangyey/stock-pytorch/aten/src/ATen/core/CachingHostAllocator.h:42:16: note: you can stabilize this value with ‘--param hardware_destructive_interference_size=64’, or disable this warning with ‘-Wno-interference-size’ ``` # Solution - Solution 1: Replace `c10::hardware_destructive_interference_size` with a constant 64. ```cpp constexpr std::size_t hardware_destructive_interference_size = 64; ``` - Solution 2: adding `-Wno-interference-size’ to `8d4e48831e/cmake/public/utils.cmake (L386)` to suppress the warning. # Additional Context The current implementation uses the second approach. If the reviewers prefer the first approach, I am happy to update it accordingly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166297 Approved by: https://github.com/ezyang	2025-10-29 02:57:46 +00:00
can-gaa-hou	c201a1cab1	[OpenReg] Update Installation in README.md (#166235 ) It is recommended to use `python -m pip install --no-build-isolation .` instead of `pip3 install --no-build-isolation .` because most of us use a virtual environment, and the latter probably relies on the system `pip3` rather than the conda or uv. We need to make it consistent with the Python we use, and it is also consistent with how `torch` is installed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166235 Approved by: https://github.com/fffrog, https://github.com/ezyang	2025-10-29 02:57:26 +00:00
Michael Lazos	e105a47575	[user-streams] Have StreamVariable inherit from StreamContextVariable (#164344 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164344 Approved by: https://github.com/williamwen42 ghstack dependencies: #162903, #164343	2025-10-29 02:49:54 +00:00
Michael Lazos	aab27b051a	[user-streams] Move StreamContextVariable into streams module (#164343 ) finish moving Pull Request resolved: https://github.com/pytorch/pytorch/pull/164343 Approved by: https://github.com/williamwen42, https://github.com/fxdawnn ghstack dependencies: #162903	2025-10-29 02:49:54 +00:00
Nicolas Macchioni	f8b4c00294	intfs + unit tests (#164723 ) Test Plan: ``` buck test fbcode//mode/opt caffe2/test/inductor:caching ``` Differential Revision: D83727222 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164723 Approved by: https://github.com/aorenste	2025-10-29 02:32:19 +00:00
Nikita Shulga	877f126e35	[MPS] Improve index_select error checking (#166468 ) Just copy-n-paste overlap checks from `0d4992c170/aten/src/ATen/native/TensorAdvancedIndexing.cpp (L1620-L1622)` Very similar to https://github.com/pytorch/pytorch/pull/166425 Pull Request resolved: https://github.com/pytorch/pytorch/pull/166468 Approved by: https://github.com/dcci, https://github.com/Skylion007	2025-10-29 02:23:12 +00:00
Maggie Moss	4fada51ada	Fix existing Pyrefly errors (#166439 ) Trying to keep main as clean of type errors as possible until we are able to swtich to just one checker. This adds suppressions for existing type errors on main. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166439 Approved by: https://github.com/Skylion007	2025-10-29 02:08:02 +00:00
Yuanyuan Chen	76b2c37045	[1/N] Remove unused loop variables (#166258 ) This PR removes unused loop variables. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166258 Approved by: https://github.com/Lucaskabela, https://github.com/mlazos	2025-10-29 01:34:15 +00:00
Laith Sakka	adedf26e21	Support python slicing with tensor inputs. (#165074 ) when the slice is tensor, we decompose it to .item() call and pass the unbacked symbol to the slice to avoid DDE. the diff also fix an existing bug in codegen_dynamic_slice_size in the cpp wrapper. a +1 should be -1 making it match python codegen. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165074 Approved by: https://github.com/Lucaskabela	2025-10-29 01:18:45 +00:00
Nicolas De Carli	bea89d6060	[PyTorch] Improve conversion from/to bool on aarch64+sve (#166330 ) Summary: We are adding autovec routines to convert to/from boolean values We observed the following performance improvements when compiling targeting armv9-a+sve2+fp16+bf16 before: bool->uint8->bool ===> 447.854us bool->int8->bool ===> 445.609us bool->int16->bool ===> 312.425us bool->int32->bool ===> 324.368us bool->float->bool ===> 320.929us bool->float16->bool ===> 290.825us bool->bfloat16->bool ===> 437.250us after bool->uint8->bool ===> 78.988us ----> 467% higher throughput bool->int8->bool ===> 78.494us -----> 468% higher throughput bool->int16->bool ===> 107.993us ----> 189% higher throughput bool->int32->bool ===> 186.887us -----> 74% higher throughput bool->float->bool ===> 188.048us ------> 71% higher throughput bool->float16->bool ===> 102.789us --> 183% higher throughput bool->bfloat16->bool ===> 105.809us -> 313% higher throughput Test Plan: Correctness: buck2 test mode/opt //caffe2/test:test_ops buck2 test mode/opt //caffe2/test:torch Performance: buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test Reviewed By: mcfi Differential Revision: D85533284 Pull Request resolved: https://github.com/pytorch/pytorch/pull/166330 Approved by: https://github.com/mcfi	2025-10-29 01:09:34 +00:00
Iris Zhang	48e672d149	[dcp][state_dict] Make `_flatten_optim_state_dict` and `_unflatten_optim_state_dict` handle arbitrary-level of nested optim dictionaries by recursion (#165071 ) Summary: This updates the internal helper function of ` _flatten_optim_state_dict` and `_unflatten_optim_state_dict` to handle arbitrary level of nested dictionaries. With this, it can handle optimizer like Shampoo has multiple level of nested dictionary. We parametrized the `shampoo_checkpoint_test.py` to test both for `flatten_optimizer_state_dict=True` or `False`. Example shampoo nested dictionary: ``` { "state": { 0: { "block_0": { "shampoo": { "factor_matrices": { 0: torch.tensor([[0.0, 0.0], [0.0, 0.0]]), 1: torch.tensor([[0.0, 0.0], [0.0, 0.0]]), }, "factor_matrix_indices": {}, "inv_factor_matrices": { 0: torch.tensor([[1.0, 0.0], [0.0, 1.0]]), 1: torch.tensor([[1.0, 0.0], [0.0, 1.0]]), }, }, }, }, }, "param_groups": [ { "lr": 0.01, "betas": (0.9, 1.0), "beta3": 0.9, "epsilon": 1e-12, "momentum": 0.9, "dampening": 0.0, "weight_decay": 0.0, "max_preconditioner_dim": 5, "precondition_frequency": 1, "start_preconditioning_step": 1, "use_nesterov": False, "use_bias_correction": True, "use_decoupled_weight_decay": True, "grafting_config": AdaGradPreconditionerConfig(epsilon=0.001), "use_pin_memory": False, "distributed_config": SingleDeviceDistributedConfig( target_parameter_dimensionality=2 ), "preconditioner_config": self._preconditioner_config, "params": [0], } ], } ``` With this update, shampoo optimizers can be used with torchtitan without any modification in torchtitan side. Also, we ensure it is still backward compatible with other torch optimizers like Adam. Test Plan: Shampoo test: ``` [irisz@devvm5551.cco0 ~/fbsource/fbcode (49fd905c0b)]$ buck2 test @//mode/opt //hpc/optimizers/distributed_shampoo/dev/distributor/gpu_tests:shampoo_checkpoint_test Buck UI: https://www.internalfb.com/buck2/ff5e0f02-637d-4a73-b990-c0792a460216 Test UI: https://www.internalfb.com/intern/testinfra/testrun/9007199373078880 Network: Up: 0B Down: 0B Executing actions. Remaining 0/5 Command: test. Time elapsed: 27.3s Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` torch.checkpoint.state_dict test. ``` [irisz@devvm5551.cco0 ~/fbsource/fbcode (49fd905c0b)]$ buck2 test @//mode/opt //caffe2/test/distributed/checkpoint:test_state_dict Buck UI: https://www.internalfb.com/buck2/bf367c2c-4d17-4d13-b6c6-f6058211bcf2 Test UI: https://www.internalfb.com/intern/testinfra/testrun/13792273976572052 Network: Up: 0B Down: 11GiB (reSessionID-9662acf0-f3de-4993-b4fe-880c33f91f78) Executing actions. Remaining 0/5 Command: test. Time elapsed: 5:31.9s Tests finished: Pass 26. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` Differential Revision: D83619435 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165071 Approved by: https://github.com/fegin	2025-10-29 01:00:38 +00:00
Nikita Shulga	afaaaa314c	[BE] Move GreenContext implementation details to cpp (#166462 ) - Remove all complex defines logic from the header - Make GreenContext constructor private, as it should only be created via the static method as singleton - Delete unused `getContext` and `getGreenContext` methods - Rename `CUDA_HAS_GREEN_CONTEXT` to `HAS_CUDA_GREEN_CONTEXT()`, which results in compilation error if one accidentally makes a typo Pull Request resolved: https://github.com/pytorch/pytorch/pull/166462 Approved by: https://github.com/ngimel, https://github.com/eqy	2025-10-29 00:40:11 +00:00
Maggie Moss	84fe848503	Fix pyrefly error syntax (2/n) (#166448 ) Ensrues pyrefly ignores only silence one error code. After this, only ~40 files left to clean up . pyrefly check lintrunner Pull Request resolved: https://github.com/pytorch/pytorch/pull/166448 Approved by: https://github.com/Skylion007	2025-10-29 00:36:40 +00:00
zhxchen17	56afad4eb3	[precompile] Pickle and check closure variable properly. (#166351 ) Summary: Previously we didn't correctly handle closure tuple when there's content in it. Adding additional code for serializing the tuple and merge it with guard manager local scope. Test Plan: pytest test/dynamo/test_aot_compile.py Reviewers: Subscribers: Tasks: Tags: Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/166351 Approved by: https://github.com/Lucaskabela	2025-10-29 00:28:21 +00:00
Sarthak Tandon	2a058bfecf	[ROCm][tunableop] Fixed Offline Tuning file writing (#166074 ) - Fixes issue with offline tuning mode, we want to append to the existing file, not delete it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166074 Approved by: https://github.com/naromero77amd, https://github.com/jeffdaily	2025-10-29 00:25:45 +00:00
Maggie Moss	31e42eb732	Fix pyrefly ignore syntax (#166438 ) Reformats pyrefly ignore suppressions so they only ignore one error code. pyrefly check lintrunner Pull Request resolved: https://github.com/pytorch/pytorch/pull/166438 Approved by: https://github.com/Skylion007	2025-10-29 00:02:21 +00:00
jainapurva	a9b29caeae	Add attention benchmarking numbers to pytorch operator microbenchmarks (#164155 ) This pull request introduces a standardized YAML-based configuration system for transformer attention benchmarks, making it easier to run and manage comprehensive performance tests. It adds example configs, and a wrapper script to convert YAML configs into CLI arguments for the benchmark runner. #### Next Steps: CI Enablement: This change would further lead to running the attention ops in CI for regression tracking. #### Developer flow: (Run locally) `python score_mod.py --config configs/config_test.yaml` #### Enabling CI run: https://github.com/pytorch/pytorch/pull/165915 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164155 Approved by: https://github.com/jbschlosser	2025-10-28 23:46:04 +00:00
Animesh Jain	0d4992c170	[dynamo][easy] Use CONSTANT_MATCH for __code__ guard (#166445 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/166445 Approved by: https://github.com/Lucaskabela ghstack dependencies: #166437, #166444	2025-10-28 23:19:42 +00:00
Animesh Jain	b060e5c131	[dynamo] Move more FUNCTION_MATCH to CLOSURE_MATCH (#166444 ) Closure match is more relaxed than function match which is id match Pull Request resolved: https://github.com/pytorch/pytorch/pull/166444 Approved by: https://github.com/Lucaskabela ghstack dependencies: #166437	2025-10-28 23:19:42 +00:00
Michael Lazos	6d5e651a50	[user-streams] update stream context to use fork/join (#162903 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162903 Approved by: https://github.com/anijain2305	2025-10-28 23:12:05 +00:00

1 2 3 4 5 ...

95118 Commits