pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-06 12:20:52 +01:00

Author	SHA1	Message	Date
Yu, Guangye	c1145852a5	Deprecate overleap functions in CUDAAllocatorConfig, use AcceleratorAllocatorConfig instead (#156165 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156165 Approved by: https://github.com/albanD ghstack dependencies: #159629, #150312	2025-08-05 04:08:42 +00:00
Yu, Guangye	ae1a706444	Refactor CUDAAllocatorConfig to reuse AcceleratorAllocatorConfig (#150312 ) # Motivation Refactor `CUDAAllocatorConfig` to reuse `AcceleratorAllocatorConfig` and `ConfigTokenizer`. We would deprecate those option that overleap with `AcceleratorAllocatorConfig` in the following PR and keep them only for BC. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150312 Approved by: https://github.com/albanD ghstack dependencies: #159629	2025-08-05 04:08:04 +00:00
Yu, Guangye	56d19a5ced	Fix AllocatorConfig potential SIO issue (#159629 ) # Motivation As @ScottTodd identified in this [comment](https://github.com/pytorch/pytorch/pull/150312#issuecomment-3141524874), using STL containers like `std::string` and `std::unordered_set` at static init time can cause static initialization order issues. This PR is based on and modified from his original PR: https://github.com/pytorch/pytorch/pull/159607. I’m stacking this PR here to help facilitate the landing and validation process. Co-authored-by: @ScottTodd Pull Request resolved: https://github.com/pytorch/pytorch/pull/159629 Approved by: https://github.com/ScottTodd, https://github.com/albanD	2025-08-05 04:07:51 +00:00
PyTorch MergeBot	7e8197e34d	Revert "Migrate ScalarType to headeronly (#159416 )" This reverts commit `1371a98b0e`. Reverted https://github.com/pytorch/pytorch/pull/159416 on behalf of https://github.com/izaitsevfb due to breaking internal builds, see D79452481 ([comment](https://github.com/pytorch/pytorch/pull/159416#issuecomment-3152138508))	2025-08-04 19:55:09 +00:00
Nikita Shulga	e2a5c42e7e	[BE][MPS] Build metal kernels of MacOS-14+ (#159733 ) Which makes `#if __METAL_VERSION__ >= 310` guards for `bfloat` use support unnecessary. Rename `kernels_bfloat.metallib` into `kernels_basic` and remove custom build/selection logic. Part of https://github.com/pytorch/pytorch/issues/159275 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159733 Approved by: https://github.com/dcci ghstack dependencies: #159731, #159732	2025-08-03 20:53:58 +00:00
Yu, Guangye	c0e0126399	Remove unused input parameter in ExpandableSegment (#159356 ) # Motivation While refactoring the caching allocator, I noticed that the `ExpandableSegment` constructor on CUDA had an unused parameter. This change removes that unused argument to avoid potential confusion. # Additional Context I noticed that `ExpandableSegment` is defined in cpp file, so it should be safe to make this change. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159356 Approved by: https://github.com/ngimel, https://github.com/albanD ghstack dependencies: #159159	2025-08-01 17:47:51 +00:00
Jane Xu	1371a98b0e	Migrate ScalarType to headeronly (#159416 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159416 Approved by: https://github.com/albanD ghstack dependencies: #159415, #159411	2025-08-01 16:07:01 +00:00
PyTorch MergeBot	5cc6a0abc1	Revert "Refactor CUDAAllocatorConfig to reuse AcceleratorAllocatorConfig (#150312 )" This reverts commit `dfacf11f66`. Reverted https://github.com/pytorch/pytorch/pull/150312 on behalf of https://github.com/guangyey due to Static initialization order issue impact the downstream repo ([comment](https://github.com/pytorch/pytorch/pull/150312#issuecomment-3142035444))	2025-08-01 03:24:54 +00:00
PyTorch MergeBot	90f13f3b2a	Revert "Deprecate overleap functions in CUDAAllocatorConfig, use AcceleratorAllocatorConfig instead (#156165 )" This reverts commit `1fc010a9d8`. Reverted https://github.com/pytorch/pytorch/pull/156165 on behalf of https://github.com/guangyey due to Static initialization order issue impact the downstream repo ([comment](https://github.com/pytorch/pytorch/pull/150312#issuecomment-3142035444))	2025-08-01 03:24:54 +00:00
PyTorch MergeBot	cb9b74872b	Revert "Generalize torch._C._set_allocator_settings to be generic (#156175 )" This reverts commit `d3ce45012e`. Reverted https://github.com/pytorch/pytorch/pull/156175 on behalf of https://github.com/guangyey due to Static initialization order issue impact the downstream repo ([comment](https://github.com/pytorch/pytorch/pull/150312#issuecomment-3142035444))	2025-08-01 03:24:54 +00:00
Jane Xu	b95cf5c91d	Move complex to headeronly (#159411 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159411 Approved by: https://github.com/albanD ghstack dependencies: #159415	2025-07-31 22:05:43 +00:00
Jane Xu	5e2ef2a465	Move Float8 variations to headeronly (#159415 ) This PR is a big copy pasta from `c10/util/Float8*` -> `torch/headeronly/util/` which is why we are breaking PR sanity :C (sorry @albanD!). Why is it not a clean copy paste? - For BC reasons, we have to keep the old c10 file around so that OSS devs relying on those files can still get the same APIs - Because we reexpose APIs that are headeronly through torch::headeronly, so there is an extra chunk of code in the new torch::headeronly files to do that. Outside of the copy paste, I: - changed the tests to call torch::headeronly instead of c10 - updated header_only_apis.txt - added `// NOLINTNEXTLINE(bugprone-narrowing-conversions,cppcoreguidelines-narrowing-conversions)` to pass lint (which was previously skipped for -inl.h files) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159415 Approved by: https://github.com/albanD	2025-07-31 22:05:43 +00:00
Nikita Shulga	f946b25865	[MPS] Speedup `argmax`/`argmin` (#159524 ) By using efficient `threadgroup_arg[max\|min]` primitives. - Fixed bug in `simd_argmax` when result of the `simd_ballot` were prematurely cast to `ushort` and adjusted unit test - Fixed nan handling in compiled argmax, but can't reliably test it as MPS(eager) implementaiton of argmax is buggy Now according to `bench_mps_ops.py` `max(x, dim=0)` is reliably faster than eager implementaiton: ``` [--------------------------------------------------------------------------------------------- --------------------------------------------------------------------------------------------] \| eager-512x512 \| compile-512x512 \| eager-1024x1024 \| compile-1024x1024 \| eager-2048x2048 \| compile-2048x2048 \| eager-4096x4096 \| compile-4096x4096 1 threads: ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- max (torch.float16) \| 285.8 \| 272.2 \| 422.3 \| 354.5 \| 721.6 \| 683.5 \| 2224.0 \| 1979.1 max (torch.float32) \| 300.2 \| 267.0 \| 389.6 \| 342.5 \| 769.4 \| 682.6 \| 2995.7 \| 2609.8 max (torch.int32) \| 299.6 \| 275.4 \| 390.0 \| 361.7 \| 758.7 \| 686.1 \| 3103.4 \| 2646.5 max (torch.int64) \| 297.5 \| 275.5 \| 417.0 \| 382.1 \| 856.1 \| 722.6 \| 5467.7 \| 3156.8 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/159524 Approved by: https://github.com/Skylion007, https://github.com/dcci ghstack dependencies: #158990	2025-07-31 16:18:32 +00:00
Jane Xu	c57382a493	Move BFloat16.h to headeronly (#159412 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159412 Approved by: https://github.com/desertfire	2025-07-31 15:29:17 +00:00
Nikita Shulga	1293405c8d	[MPS] Add `simd_[arg][max\|min]` (#158990 ) And add eager tests for those. Re-implement `threadgroup_[max\|min]` using those function as they are significantly faster (though much slower than eager, due to the arg part) than before, which could be verified by running the following script ```python import itertools import timeit import torch from torch.utils.benchmark import Compare, Measurement, Timer def bench_unary_op(func, x, label) -> Measurement: sync_cmd = "torch.mps.synchronize()" if "mps" in str(x.device) else "" t = Timer( stmt=f"f(x);{sync_cmd}", globals={"f": func, "x": x}, language="python", timer=timeit.default_timer, sub_label=f"{func.__name__} ({str(x.dtype)})", description=label, env=torch.__version__, ) return t.blocked_autorange() def bench_reduction( reduction_func, device: str = "mps", dtype: torch.dtype = torch.float32 ) -> list[Measurement]: rc = [] # Bench 2D with reduction over dim=0 def f(t): return reduction_func(t, dim=0)[0] f.__name__ = reduction_func.__name__ f_c = torch.compile(f, dynamic=False, fullgraph=True) for size in (512, 1024, 2048, 4096): x = torch.testing.make_tensor(size, size, device=device, dtype=dtype) rc_c, rc_e = f(x), f_c(x) rc_c, rc_e = (rc_c[0], rc_e[0]) if isinstance(rc_c, tuple) else (rc_c, rc_e) rc.append(bench_unary_op(f, x, f"eager-{size}x{size}")) rc.append(bench_unary_op(f_c, x, f"compile-{size}x{size}")) return rc def main() -> None: #dtypes = [torch.float16, torch.float32, torch.bfloat16, torch.int32, torch.int64] dtypes = [torch.float32, torch.int32, torch.int64] # Profile reduction ops rc = [] for op, dtype in itertools.product([torch.max], dtypes): rc.extend(bench_reduction(op, dtype=dtype)) Compare(rc).print() if __name__ == "__main__": torch._dynamo.config.cache_size_limit = 2**16 main() ``` Produces the following table before ``` [--------------------------------------------------------------------------------------------- --------------------------------------------------------------------------------------------] \| eager-512x512 \| compile-512x512 \| eager-1024x1024 \| compile-1024x1024 \| eager-2048x2048 \| compile-2048x2048 \| eager-4096x4096 \| compile-4096x4096 1 threads: ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- max (torch.float32) \| 297.3 \| 531.6 \| 394.1 \| 2550.5 \| 773.0 \| 4904.7 \| 3647.2 \| 9682.0 max (torch.int32) \| 297.8 \| 359.2 \| 387.7 \| 1179.4 \| 768.2 \| 2175.0 \| 3677.1 \| 4495.9 max (torch.int64) \| 278.7 \| 541.4 \| 410.2 \| 2873.3 \| 858.9 \| 5620.4 \| 6107.2 \| 11176.1 Times are in microseconds (us). ``` And after ``` [--------------------------------------------------------------------------------------------- --------------------------------------------------------------------------------------------] \| eager-512x512 \| compile-512x512 \| eager-1024x1024 \| compile-1024x1024 \| eager-2048x2048 \| compile-2048x2048 \| eager-4096x4096 \| compile-4096x4096 1 threads: ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- max (torch.float32) \| 307.9 \| 265.3 \| 401.0 \| 340.8 \| 766.5 \| 661.9 \| 3463.5 \| 2829.5 max (torch.int32) \| 293.5 \| 263.1 \| 405.0 \| 338.8 \| 761.4 \| 672.5 \| 3050.0 \| 2688.6 max (torch.int64) \| 308.2 \| 255.7 \| 417.4 \| 341.4 \| 877.0 \| 695.0 \| 5812.2 \| 5762.2 ``` `argmax`/`argmin` are much tricker due to the nan-handling logic that need to be added there. Also fixes `torch.max/min` compilation for half-precision types, added regression types for it. This PR also introduces a bunch of helper functions, such as `simd_broadcast` that works for int64 and `c10:🤘:pair` template, which are used by `simd_argmax` to return both value and index Pull Request resolved: https://github.com/pytorch/pytorch/pull/158990 Approved by: https://github.com/dcci, https://github.com/Skylion007	2025-07-30 21:57:25 +00:00
PaliC	1b99c1859c	[BE] Make PyObjectSlot use a global PyInterpreter and remove (#158427 ) This PR is a bit more involved but effectively works to drastically simplify PyObjectSlot and PyInterpreter. 1) For PyObjectSlot we now use a global pyinterpreter since there only is one. From here we change all of the call sites to rely on this assumption. 2) We also remove the "tags" of the PyInterpreter by deprecating `PyInterpreterStatus`. For the reviewer, sadly it seems like `functorch/csrc/dim/dim.cpp` needed to get linted, so there is an unreadable amount of changes there. Fortunately, the only actual change in the file is as follows which just removes `getPyInterpreter()` from the `check_pyobj` call. ``` mpy::handle handle_from_tensor(Arena& A, TensorRef t) { - // fast case: tensor is live in python - std::optional<PyObject> mb_obj = - t->unsafeGetTensorImpl()->pyobj_slot()->check_pyobj(getPyInterpreter(), /ignore_hermetic_tls=/false); - if (mb_obj.has_value() && !t->unsafeGetTensorImpl()->pyobj_slot()->owns_pyobj()) { - return mb_obj; - } - return A.autorelease(mpy::object::checked_steal(THPVariable_Wrap(t))); -} -} + // fast case: tensor is live in python + std::optional<PyObject> mb_obj = + t->unsafeGetTensorImpl()->pyobj_slot()->check_pyobj( + /ignore_hermetic_tls=/false); + if (mb_obj.has_value() && + !t->unsafeGetTensorImpl()->pyobj_slot()->owns_pyobj()) { + return mb_obj; + } + return A.autorelease(mpy::object::checked_steal(THPVariable_Wrap(t))); +} ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158427 Approved by: https://github.com/albanD	2025-07-30 17:29:43 +00:00
Jane Xu	259e79e3ff	Move Half to headeronly (#159172 ) Essence of this copypasta: - combine Half-inl.h and Half.h in c10/util -> torch/headeronly/util/Half.h - Add NOLINTNEXTLINE's to the portions of Half-inl.h that were previously in the ignore list of clangtidy - Re-expose all APIs in namespaces and through includes of the original files. Ideally, we would have the APIs in torch::headeronly and reexpose them in c10, but that runs into BC issues (see D78997465) so for now we are keeping the APIs in c10 but reexposing them in torch::headeronly. - Change test cases in test_aoti_abi_check to test torch::headeronly::Half vs c10::Half (they're the same thing but we eventually want all the tests for headeronly APIs to only import from headeronly). Pull Request resolved: https://github.com/pytorch/pytorch/pull/159172 Approved by: https://github.com/albanD, https://github.com/desertfire	2025-07-30 16:11:58 +00:00
Jane Xu	b268f22ab2	Move Float4 to headeronly (#159414 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159414 Approved by: https://github.com/desertfire	2025-07-30 15:34:01 +00:00
PyTorch MergeBot	eaadd1282c	Revert "Move Half to headeronly (#159172 )" This reverts commit `6d0f4566e2`. Reverted https://github.com/pytorch/pytorch/pull/159172 on behalf of https://github.com/clee2000 due to broke lint [GH job link](https://github.com/pytorch/pytorch/actions/runs/16613893793/job/47002486679) [HUD commit link](`6d0f4566e2`). Note to self: why isn't Dr. CI updating ([comment](https://github.com/pytorch/pytorch/pull/159172#issuecomment-3136769493))	2025-07-30 15:10:26 +00:00
Yu, Guangye	d3ce45012e	Generalize torch._C._set_allocator_settings to be generic (#156175 ) # Motivation This PR moves the implementation of `torch.cuda.memory._set_allocator_settings` to `torch._C._accelerator_setAllocatorSettings`. Since the original API was intended as a temporary/internal utility, I am not exposing the new function as a public API. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156175 Approved by: https://github.com/albanD ghstack dependencies: #149601, #157908, #150312, #156165	2025-07-30 06:37:15 +00:00
Yu, Guangye	1fc010a9d8	Deprecate overleap functions in CUDAAllocatorConfig, use AcceleratorAllocatorConfig instead (#156165 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156165 Approved by: https://github.com/albanD ghstack dependencies: #149601, #157908, #150312	2025-07-30 06:37:15 +00:00
Yu, Guangye	dfacf11f66	Refactor CUDAAllocatorConfig to reuse AcceleratorAllocatorConfig (#150312 ) # Motivation Refactor `CUDAAllocatorConfig` to reuse `AcceleratorAllocatorConfig` and `ConfigTokenizer`. We would deprecate those option that overleap with `AcceleratorAllocatorConfig` in the following PR and keep them only for BC. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150312 Approved by: https://github.com/albanD ghstack dependencies: #149601, #157908	2025-07-30 06:37:06 +00:00
Yu, Guangye	c8cf811995	Enable AcceleratorAllocatorConfig key check (#157908 ) # Motivation Add a mechanism to ensure raise the key if the key is unrecognized in allocator config. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157908 Approved by: https://github.com/albanD ghstack dependencies: #149601	2025-07-30 06:36:56 +00:00
Yu, Guangye	914b1a3873	Introduce AcceleratorAllocatorConfig as the common class (#149601 ) # Motivation This PR aims to generalize `AllocatorConfig` to be device-agnostic. Introduce the class `AcceleratorAllocatorConfig` to clarify its scope as a configuration manager for accelerator backends (e.g., CUDA, XPU). The another name `AllocatorConfig` is now reserved for a potential future base class that can unify configuration handling for both CPU and accelerator allocators, should similar requirements arise for the CPU path. # Design Rule ## Overall This class configures memory allocation for both device and host memory. A single `AcceleratorAllocatorConfig` instance is shared across all accelerator backends, such as CUDA and XPU, under the assumption that relevant environment variables apply uniformly to all accelerators. Device-specific configuration extensions are supported via hooks (see `registerDeviceConfigParserHook`). Introduce a new class `ConfigTokenizer` to help process the env variable config key-value pair ## Naming Convention: - Public API names in `AcceleratorAllocatorConfig` should be device-generic. - Members prefixed with `pinned_` are specific to the host/pinned allocator. - Environment variable names should be generic across backends. - Comma-separated key-value pairs in the format: `key:value`. Use square brackets `[]` for list values Example: `key1:123, key2:[val1,val2]` ## Environment Variables: - The default environment variable for configuration is `PYTORCH_ALLOC_CONF`. - For backward compatibility, `PYTORCH_CUDA_ALLOC_CONF` and `PYTORCH_HIP_ALLOC_CONF` are also supported with lower priority. Differential Revision: [D79011786](https://our.internmc.facebook.com/intern/diff/D79011786) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149601 Approved by: https://github.com/albanD	2025-07-30 06:36:46 +00:00
Jane Xu	6d0f4566e2	Move Half to headeronly (#159172 ) Essence of this copypasta: - combine Half-inl.h and Half.h in c10/util -> torch/headeronly/util/Half.h - Add NOLINTNEXTLINE's to the portions of Half-inl.h that were previously in the ignore list of clangtidy - Re-expose all APIs in namespaces and through includes of the original files. Ideally, we would have the APIs in torch::headeronly and reexpose them in c10, but that runs into BC issues (see D78997465) so for now we are keeping the APIs in c10 but reexposing them in torch::headeronly. - Change test cases in test_aoti_abi_check to test torch::headeronly::Half vs c10::Half (they're the same thing but we eventually want all the tests for headeronly APIs to only import from headeronly). Pull Request resolved: https://github.com/pytorch/pytorch/pull/159172 Approved by: https://github.com/albanD, https://github.com/desertfire	2025-07-30 05:02:13 +00:00
Jane Xu	96ac64d00c	Migrate easy q(u)int/bits stuff to torch/headeronly (#159302 ) Straightup copy pasta. Keeps APIs in c10 and reexposes them to torch::headeronly. It is arguable that we should just get rid of some of these unused dtypes but that is outside the scope of this PR, which is meant to build up to ScalarType moving to headeronly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159302 Approved by: https://github.com/malfet, https://github.com/albanD	2025-07-30 03:41:27 +00:00
Yu, Guangye	badd0618e4	Remove unused paramter on CUDA AllocParams (#159159 ) # Motivation While refactoring the caching allocator, I noticed that the `AllocParams` constructor on CUDA had an unused parameter. This change removes that unused argument to avoid potential confusion. # Additional Context I noticed that `AllocParams` is defined in cpp file, so it should be safe to make this change. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159159 Approved by: https://github.com/cyyever, https://github.com/albanD	2025-07-30 02:05:25 +00:00
PaliC	a753a72b14	[BE] Modify PyObjectSlot the assume only a single interpreter is in use (#158407 ) This PR makes some less risky changes to PyObjectSlot as there is a lot of stuff we do not need since there is only one interpreter. Specifically `check_interpreter` and `has_pyobj_nonhermetic` are removed Pull Request resolved: https://github.com/pytorch/pytorch/pull/158407 Approved by: https://github.com/albanD ghstack dependencies: #158290, #158291	2025-07-30 01:36:03 +00:00
Jane Xu	222fa451a2	Move some of vec into headeronly in preparation for Half.h (#158976 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158976 Approved by: https://github.com/albanD, https://github.com/desertfire	2025-07-29 05:43:53 +00:00
Sameer	51cd6697cd	Fix: Use memory_order_relaxed instead of memory_order_relaxed (#159105 ) Addresses #159074 by using `memory_order_release` instead of `memory_order_relaxed` here: `9c10760662/c10/core/DeviceType.cpp (L161)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/159105 Approved by: https://github.com/colesbury	2025-07-25 19:39:04 +00:00
PyTorch MergeBot	5620e617c9	Revert "[BE] Modify PyObjectSlot the assume only a single interpreter is in use (#158407 )" This reverts commit `255c0545e7`. Reverted https://github.com/pytorch/pytorch/pull/158407 on behalf of https://github.com/ZainRizvi due to Reverting as per offline discussion to fix internal breaks. @PaliC will reland this as a codev diff. Instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/158288#issuecomment-3119037960))	2025-07-25 16:09:39 +00:00
PyTorch MergeBot	751285cb22	Revert "Move some of vec into headeronly in preparation for Half.h (#158976 )" This reverts commit `5564f2ca2e`. Reverted https://github.com/pytorch/pytorch/pull/158976 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally. See D78924504 for details. To validate your fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/158976#issuecomment-3115198443))	2025-07-24 22:31:49 +00:00
Jane Xu	5564f2ca2e	Move some of vec into headeronly in preparation for Half.h (#158976 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158976 Approved by: https://github.com/albanD, https://github.com/desertfire	2025-07-24 20:32:33 +00:00
PaliC	255c0545e7	[BE] Modify PyObjectSlot the assume only a single interpreter is in use (#158407 ) This PR makes some less risky changes to PyObjectSlot as there is a lot of stuff we do not need since there is only one interpreter. Specifically `check_interpreter` and `has_pyobj_nonhermetic` are removed Pull Request resolved: https://github.com/pytorch/pytorch/pull/158407 Approved by: https://github.com/albanD ghstack dependencies: #158288, #158290, #158291	2025-07-23 20:27:28 +00:00
PyTorch MergeBot	95b658427d	Revert "Add DeviceAllocator as the base device allocator (#138222 )" This reverts commit `1179e33323`. Reverted https://github.com/pytorch/pytorch/pull/138222 on behalf of https://github.com/ZainRizvi due to Very sorry but this is still breaking internally. @albanD would you be able to help get this past the finish line? D78496124 has more details on the failure and the workaround might be to do something like what's in D78684669. To validate the fixes internally, you can follow the instructions here to ghimport the changes: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/138222#issuecomment-3100195370))	2025-07-22 01:01:41 +00:00
Raymond Li	08540b13c6	Use cuda error code instead of error text in get_cuda_error_help (#158688 ) Use cudaError_t and switch through the enum to prevent impact by upstream changes in wording Pull Request resolved: https://github.com/pytorch/pytorch/pull/158688 Approved by: https://github.com/q10, https://github.com/aorenste	2025-07-21 23:34:50 +00:00
PyTorch MergeBot	99cc3633f6	Revert "[BE] Modify PyObjectSlot the assume only a single interpreter is in use (#158407 )" This reverts commit `d9426a81d2`. Reverted https://github.com/pytorch/pytorch/pull/158407 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally, see D78496147 for details. To validate your fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/158288#issuecomment-3099826158))	2025-07-21 23:17:38 +00:00
PyTorch MergeBot	15a50dcf1c	Revert "[BE] Make PyObjectSlot use a global PyInterpreter and remove (#158427 )" This reverts commit `eb73650723`. Reverted https://github.com/pytorch/pytorch/pull/158427 on behalf of https://github.com/ZainRizvi due to Reverting this as part of reverting the stack for https://github.com/pytorch/pytorch/pull/158288 ([comment](https://github.com/pytorch/pytorch/pull/158427#issuecomment-3099815367))	2025-07-21 23:14:57 +00:00
Yukio Siraichi	b4abf41425	Raise `BufferError` for DLPack buffer-related errors. (#150691 ) This PR addresses the Array API documentation for [`__dlpack__`][1] and [`from_dlpack`][2] by making some buffer-related errors `BufferError` instead of `RuntimeError`, e.g. incompatible dtype, strides, or device. [1]: https://data-apis.org/array-api/latest/API_specification/generated/array_api.array.__dlpack__.html [2]: https://data-apis.org/array-api/latest/API_specification/generated/array_api.from_dlpack.html#from-dlpack Pull Request resolved: https://github.com/pytorch/pytorch/pull/150691 Approved by: https://github.com/Skylion007, https://github.com/albanD ghstack dependencies: #150216, #150217, #150218	2025-07-20 00:46:21 +00:00
Xu Han	8c3f84908b	[aot] fix greater_than_max build fail on Windows. (#158479 ) Error snapshot: <img width="937" height="110" alt="image" src="https://github.com/user-attachments/assets/10195f84-83c4-42db-af3c-76f875a6a983" /> Reason: `std::numeric_limits::max` is confilct to windef.h:`max(a, b)` Fix code: <img width="488" height="269" alt="image" src="https://github.com/user-attachments/assets/3328c37b-7c89-435e-944c-4ca7c9b6c5b6" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/158479 Approved by: https://github.com/desertfire	2025-07-18 17:18:10 +00:00
Jane Xu	e882c761dd	Add STD_TORCH_CHECK to headeronly (#158377 ) Differential Revision: [D78366519](https://our.internmc.facebook.com/intern/diff/D78366519/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158377 Approved by: https://github.com/albanD	2025-07-18 14:35:20 +00:00
PaliC	eb73650723	[BE] Make PyObjectSlot use a global PyInterpreter and remove (#158427 ) This PR is a bit more involved but effectively works to drastically simplify PyObjectSlot and PyInterpreter. 1) For PyObjectSlot we now use a global pyinterpreter since there only is one. From here we change all of the call sites to rely on this assumption. 2) We also remove the "tags" of the PyInterpreter by deprecating `PyInterpreterStatus`. For the reviewer, sadly it seems like `functorch/csrc/dim/dim.cpp` needed to get linted, so there is an unreadable amount of changes there. Fortunately, the only actual change in the file is as follows which just removes `getPyInterpreter()` from the `check_pyobj` call. ``` mpy::handle handle_from_tensor(Arena& A, TensorRef t) { - // fast case: tensor is live in python - std::optional<PyObject> mb_obj = - t->unsafeGetTensorImpl()->pyobj_slot()->check_pyobj(getPyInterpreter(), /ignore_hermetic_tls=/false); - if (mb_obj.has_value() && !t->unsafeGetTensorImpl()->pyobj_slot()->owns_pyobj()) { - return mb_obj; - } - return A.autorelease(mpy::object::checked_steal(THPVariable_Wrap(t))); -} -} + // fast case: tensor is live in python + std::optional<PyObject> mb_obj = + t->unsafeGetTensorImpl()->pyobj_slot()->check_pyobj( + /ignore_hermetic_tls=/false); + if (mb_obj.has_value() && + !t->unsafeGetTensorImpl()->pyobj_slot()->owns_pyobj()) { + return mb_obj; + } + return A.autorelease(mpy::object::checked_steal(THPVariable_Wrap(t))); +} ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158427 Approved by: https://github.com/albanD	2025-07-18 05:23:00 +00:00
Paul Ganssle	74f4cf4bd5	Add missing <vector> in c10/util/WaitCounter.h (#158354 ) It seems that `#include <vector>` is being pulled in indirectly, but it is being used directly, so it is best to explicitly include it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158354 Approved by: https://github.com/janeyx99	2025-07-17 22:23:05 +00:00
Wouter Devriendt	fd51bcdd21	check if USE_ROCM is defined (#158571 ) Summary: check if USE_ROCM is defined D78424375 broke some builds: see T231304402 Test Plan: rerunning failed builds Rollback Plan: Reviewed By: Camyll Differential Revision: D78493019 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158571 Approved by: https://github.com/huydhn, https://github.com/malfet	2025-07-17 19:48:26 +00:00
PaliC	d9426a81d2	[BE] Modify PyObjectSlot the assume only a single interpreter is in use (#158407 ) This PR makes some less risky changes to PyObjectSlot as there is a lot of stuff we do not need since there is only one interpreter. Specifically `check_interpreter` and `has_pyobj_nonhermetic` are removed Pull Request resolved: https://github.com/pytorch/pytorch/pull/158407 Approved by: https://github.com/albanD ghstack dependencies: #158288, #158290, #158291	2025-07-17 05:56:26 +00:00
Raymond Li	24b49b9881	[Fix] Rework CUDA error explanation framework to be less destructive … (#158484 ) …in fbsource Fix-forward for #158395 Added `std::string c10::cuda::get_cuda_error_help(const char* error_string)` to provide a framework for appending clarifying messages to CUDA errors. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158484 Approved by: https://github.com/aorenste	2025-07-17 03:36:47 +00:00
Yu, Guangye	1179e33323	Add DeviceAllocator as the base device allocator (#138222 ) # Motivation In line with [RFC] [A device-agnostic Python device memory related API design for stream-based accelerators](https://github.com/pytorch/pytorch/issues/134978), some memory-related APIs are widely used in popular repositories, such as HuggingFace [so many if-else conditional code](https://github.com/search?q=repo%3Ahuggingface%2Faccelerate%20torch.cuda.empty_cache&type=code). We would like to introduce a generic API set under torch.accelerator namespace to generalize these user cases. <div align="center"> <table> <tr> <td> Device-specific memory APIs torch.xxx.foo</td> <td> Device-agnostic memory APIs torch.accelerator.foo</td> </tr> <tr> <td> ```python torch.xxx.empty_cache ``` </td> <td> ```python torch.accelerator.empty_cache ``` </td> </tr> <tr> <td> ```python torch.xxx.reset_peak_memory_stats ``` </td> <td> ```python torch.accelerator.reset_peak_memory_stats ``` </td> </tr> <tr> <td> ```python torch.xxx.reset_accumulated_memory_stats ``` </td> <td> ```python torch.accelerator.reset_accumulated_memory_stats ``` </td> </tr> <tr> <td> ```python torch.xxx.memory_stats ``` </td> <td> ```python torch.accelerator.memory_stats ``` </td> </tr> <tr> <td> ```python torch.xxx.memory_allocated ``` </td> <td> ```python torch.accelerator.memory_allocated ``` </td> </tr> <tr> <td> ```python torch.xxx.max_memory_allocated ``` </td> <td> ```python torch.accelerator.max_memory_allocated ``` </td> </tr> <tr> <td> ```python torch.xxx.memory_reserved ``` </td> <td> ```python torch.accelerator.memory_reserved ``` </td> </tr> <tr> <td> ```python torch.xxx.max_memory_reserved ``` </td> <td> ```python torch.accelerator.max_memory_reserved ``` </td> </tr> </table> </div> # Solution This design follows a similar pattern to `HostAllocator`. We're introducing a base class `DeviceAllocator`, from which `CUDAAllocator` and `XPUAllocator` will inherit. This allows us to provide a unified call path like: `torch.accelerator.empty_cache()` -> `GetDeviceAllocator(allocator)->empty_cache()`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138222 Approved by: https://github.com/albanD, https://github.com/Camyll	2025-07-17 01:56:01 +00:00
Jiapeng Li	c09eba877f	[Device] Add support for PrivateUse1 device type in parse_type function (#157609 ) This pull request refactors the `parse_type` function in `c10/core/Device.cpp` to improve the handling of the `PrivateUse1` device type. The main change involves reordering the logic to check for the `PrivateUse1` device type earlier in the function for better clarity and efficiency. This help to migrate existed backend to PrivateUse1 smoothly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157609 Approved by: https://github.com/jgong5, https://github.com/albanD	2025-07-17 01:27:44 +00:00
Frank Lin	a9f902add0	[CUDA] Use runtime driver API for cuStreamWriteValue32 (#158295 ) Reopen https://github.com/pytorch/pytorch/pull/156097 Fixes https://github.com/pytorch/pytorch/issues/154073 Reference: https://github.com/NVIDIA/Fuser/pull/4197 See PR https://github.com/pytorch/pytorch/pull/156097 and https://github.com/pytorch/pytorch/pull/154097 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158295 Approved by: https://github.com/Skylion007, https://github.com/ngimel, https://github.com/eqy, https://github.com/huydhn Co-authored-by: Wei Wang <weiwan@nvidia.com>	2025-07-16 23:14:36 +00:00
Jane Xu	2b0f9b1f61	Move c10/macros/Macros.h to headeronly (#158365 ) ^ Differential Revision: [D78361893](https://our.internmc.facebook.com/intern/diff/D78361893/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158365 Approved by: https://github.com/swolchok ghstack dependencies: #158358	2025-07-16 18:46:52 +00:00
Jane Xu	b40f48d191	Move the rest of c10/macros/Export.h (#158358 ) Differential Revision: [D78356975](https://our.internmc.facebook.com/intern/diff/D78356975/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158358 Approved by: https://github.com/swolchok	2025-07-16 18:46:52 +00:00
Nichols A. Romero	ff611d971f	[ROCm] check stream graph capture status in memcpy_and_sync inline function (#158165 ) Check for stream graph capture when using hipMemcpyWithStream. Fixes https://github.com/pytorch/pytorch/issues/155684, https://github.com/pytorch/pytorch/issues/155231 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158165 Approved by: https://github.com/jeffdaily	2025-07-16 17:17:34 +00:00
dolpm	51a708ffc6	[nativert] libtorch kernel registry (#157150 ) Summary: att Test Plan: ci Rollback Plan: Differential Revision: D77451703 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157150 Approved by: https://github.com/georgiaphillips, https://github.com/henryoier	2025-07-16 12:36:55 +00:00
Raymond Li	55d888a616	Add framework for explanations for common CUDA errors (#158395 ) As popularly requested in user groups. Test plan: ``` import torch a = torch.randn(10000) device = torch.device('cuda:1') a = a.to(device) ``` Before: ``` Traceback (most recent call last): File "/data/users/raymo/pytorch/test/cuda.py", line 6, in <module> a = a.to(device) ^^^^^^^^^^^^ torch.AcceleratorError: CUDA error: invalid device ordinal CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. ``` After: ``` Traceback (most recent call last): File "/data/users/raymo/pytorch/test/cuda.py", line 6, in <module> a = a.to(device) ^^^^^^^^^^^^ torch.AcceleratorError: CUDA error: invalid device ordinal GPU device may be out of range, do you have enough GPUs? CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158395 Approved by: https://github.com/aorenste Co-authored-by: Aaron Orenstein <aorenste@fb.com>	2025-07-16 12:31:18 +00:00
Jane Xu	30587195d3	Migrate c10/macros/cmake_macros.h.in to torch/headeronly (#158035 ) Summary: As above, also changes a bunch of the build files to be better Test Plan: internal and external CI did run buck2 build fbcode//caffe2:torch and it succeeded Rollback Plan: Reviewed By: swolchok Differential Revision: D78016591 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158035 Approved by: https://github.com/swolchok	2025-07-15 19:52:59 +00:00
PyTorch MergeBot	46915b1361	Revert "Introduce AcceleratorAllocatorConfig as the common class (#149601 )" This reverts commit `1e8e9f745e`. Reverted https://github.com/pytorch/pytorch/pull/149601 on behalf of https://github.com/huydhn due to See https://github.com/pytorch/pytorch/pull/149601#discussion_r2208325379 ([comment](https://github.com/pytorch/pytorch/pull/149601#issuecomment-3074965720))	2025-07-15 18:40:59 +00:00
PyTorch MergeBot	41971335c9	Revert "Refactor CUDAAllocatorConfig to reuse AcceleratorAllocatorConfig (#150312 )" This reverts commit `e241a07e6b`. Reverted https://github.com/pytorch/pytorch/pull/150312 on behalf of https://github.com/huydhn due to Sorry for reverting your change but because https://github.com/pytorch/pytorch/pull/157908 has been reverted + this PR caused issue earlier, I think it is better to revert the whole stack and reland it from scratch to be sure ([comment](https://github.com/pytorch/pytorch/pull/150312#issuecomment-3074897532))	2025-07-15 18:24:36 +00:00
PyTorch MergeBot	ea5f88dca6	Revert "Deprecate overleap functions in CUDAAllocatorConfig, use AcceleratorAllocatorConfig instead (#156165 )" This reverts commit `e40ade5182`. Reverted https://github.com/pytorch/pytorch/pull/156165 on behalf of https://github.com/huydhn due to Sorry for reverting your change but because https://github.com/pytorch/pytorch/pull/157908 has been reverted + this PR caused issue earlier, I think it is better to revert the whole stack and reland it from scratch to be sure ([comment](https://github.com/pytorch/pytorch/pull/150312#issuecomment-3074897532))	2025-07-15 18:24:36 +00:00
PyTorch MergeBot	f2ecf6145f	Revert "Enable AcceleratorAllocatorConfig key check (#157908 )" This reverts commit `65fcca4f8c`. Reverted https://github.com/pytorch/pytorch/pull/157908 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing internally per https://github.com/pytorch/pytorch/pull/157908#discussion_r2208204782 ([comment](https://github.com/pytorch/pytorch/pull/157908#issuecomment-3074833696))	2025-07-15 18:17:43 +00:00
Yu, Guangye	e40ade5182	Deprecate overleap functions in CUDAAllocatorConfig, use AcceleratorAllocatorConfig instead (#156165 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156165 Approved by: https://github.com/albanD ghstack dependencies: #150312	2025-07-15 10:14:35 +00:00
Yu, Guangye	e241a07e6b	Refactor CUDAAllocatorConfig to reuse AcceleratorAllocatorConfig (#150312 ) # Motivation Refactor `CUDAAllocatorConfig` to reuse `AcceleratorAllocatorConfig` and `ConfigTokenizer`. We would deprecate those option that overleap with `AcceleratorAllocatorConfig` in the following PR and keep them only for BC. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150312 Approved by: https://github.com/albanD	2025-07-15 10:14:35 +00:00
PyTorch MergeBot	6fe7456aa1	Revert "Refactor CUDAAllocatorConfig to reuse AcceleratorAllocatorConfig (#150312 )" This reverts commit `03b307575a`. Reverted https://github.com/pytorch/pytorch/pull/150312 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing to build PyTorch internally ([comment](https://github.com/pytorch/pytorch/pull/150312#issuecomment-3070218901))	2025-07-14 16:33:48 +00:00
PyTorch MergeBot	e8cca7bac7	Revert "Deprecate overleap functions in CUDAAllocatorConfig, use AcceleratorAllocatorConfig instead (#156165 )" This reverts commit `85857181eb`. Reverted https://github.com/pytorch/pytorch/pull/156165 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing to build PyTorch internally ([comment](https://github.com/pytorch/pytorch/pull/150312#issuecomment-3070218901))	2025-07-14 16:33:48 +00:00
Nikita Shulga	9ca080db87	[MPS] Extend atomic operations to all int types (#158179 ) That fixes `index_put(..., accumulate=True)` for all dtypes int64 operation is not really atomic, but eventually consistent from the `index_put_accumulate` kernel point of view: i.e. by the end of the operation results in the global memory are indeed accumulation of the operands at given indices Pull Request resolved: https://github.com/pytorch/pytorch/pull/158179 Approved by: https://github.com/dcci, https://github.com/Skylion007 ghstack dependencies: #158064, #158178	2025-07-14 04:25:05 +00:00
Natalia Gimelshein	9056279f81	don't error out in empty_cache under mempool context (#158152 ) Now instead of erroring out on `empty_cache` call during graph capture or under mempool context, we will just silently do nothing. This used to be the behavior for mempools, cudagraphs used to error out, but it's fine to just ignore the call. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158152 Approved by: https://github.com/zou3519, https://github.com/eqy	2025-07-12 04:37:05 +00:00
Nikita Shulga	beed033b6e	[MPS] Fix `index_kernel` for large tensors (#158064 ) Move `MetalShaderLibrary::bind_tensors` private method to OperatorUtils.h and extract `iter_tensor_offset` method, that returns an offset from the start of the storage associated with given tensor inside the iterator Migrated `index`, `index_put[_accumulate][_serial]` to the new paradigm that does not require additional tensor for indices nor special handling for 32 vs 64-bit offset, which resulted in almost 2x perf gain for 2000x2000 tensor, see results below before ``` [------------------------------------------------------------ -----------------------------------------------------------] \| 11x50x50 \| 11x100x100 \| 11x500x500 \| 11x1000x1000 \| 11x2000x2000 1 threads: ---------------------------------------------------------------------------------------------------------------- __getitem__ (torch.int8, torch.int64) \| 383.5 \| 379.8 \| 470.9 \| 1232.9 \| 4410.3 __getitem__ (torch.float16, torch.int64) \| 379.6 \| 354.5 \| 533.2 \| 1290.3 \| 4442.2 __getitem__ (torch.float32, torch.int64) \| 360.8 \| 338.6 \| 478.6 \| 1348.9 \| 4870.4 Times are in microseconds (us). ``` and after ``` [------------------------------------------------------------ -----------------------------------------------------------] \| 11x50x50 \| 11x100x100 \| 11x500x500 \| 11x1000x1000 \| 11x2000x2000 1 threads: ---------------------------------------------------------------------------------------------------------------- __getitem__ (torch.int8, torch.int64) \| 349.8 \| 330.5 \| 432.6 \| 764.5 \| 1961.2 __getitem__ (torch.float16, torch.int64) \| 342.5 \| 330.7 \| 434.7 \| 741.0 \| 1969.4 __getitem__ (torch.float32, torch.int64) \| 332.2 \| 326.1 \| 445.4 \| 751.3 \| 1972.6 Times are in microseconds (us). ``` While migrating also fixed index_put_accumulate for boolean types, by using compare_and_exchange trick over uint Fixes https://github.com/pytorch/pytorch/issues/153560 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158064 Approved by: https://github.com/dcci	2025-07-11 22:35:44 +00:00
PyTorch MergeBot	702a304b07	Revert "[CUDA] Use runtime driver API for cuStreamWriteValue32 (#156097 )" This reverts commit `9a5278225f`. Reverted https://github.com/pytorch/pytorch/pull/156097 on behalf of https://github.com/ngimel due to breaks 525 driver installs ([comment](https://github.com/pytorch/pytorch/pull/156097#issuecomment-3063742807))	2025-07-11 20:36:36 +00:00
Simon Mahns	b487003182	[PyTorch Core] MTIA supports arbitrary strides (#157883 ) Summary: Currently, on MTIA the following case will return false ``` options.device().supports_as_strided() ``` As a result, whenever moving a tensor from CPU to MTIA, strides will not be preserved ([see here](`e5edd013ab/aten/src/ATen/native/TensorConversions.cpp (L351)`)). This is a primary reason why deserializing tensors from .pt files will be contiguous. Reviewed By: egienvalue, andyanwang Differential Revision: D77843224 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157883 Approved by: https://github.com/albanD, https://github.com/andyanwang	2025-07-11 18:54:21 +00:00
Yu, Guangye	85857181eb	Deprecate overleap functions in CUDAAllocatorConfig, use AcceleratorAllocatorConfig instead (#156165 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156165 Approved by: https://github.com/albanD ghstack dependencies: #149601, #157908, #150312	2025-07-11 11:41:34 +00:00
Yu, Guangye	03b307575a	Refactor CUDAAllocatorConfig to reuse AcceleratorAllocatorConfig (#150312 ) # Motivation Refactor `CUDAAllocatorConfig` to reuse `AcceleratorAllocatorConfig` and `ConfigTokenizer`. We would deprecate those option that overleap with `AcceleratorAllocatorConfig` in the following PR and keep them only for BC. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150312 Approved by: https://github.com/albanD ghstack dependencies: #149601, #157908	2025-07-11 11:25:43 +00:00
Yu, Guangye	65fcca4f8c	Enable AcceleratorAllocatorConfig key check (#157908 ) # Motivation Add a mechanism to ensure raise the key if the key is unrecognized in allocator config. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157908 Approved by: https://github.com/albanD ghstack dependencies: #149601	2025-07-11 02:11:08 +00:00
Scott Wolchok	e3f8141c25	Fix UB in BFloat16 round_to_nearest_even (#157942 ) Type punning using unions is undefined behavior in C++ (you may not access a member of a union that is not the active member). bit_cast is the right way. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157942 Approved by: https://github.com/Skylion007	2025-07-10 18:03:39 +00:00
Frank Lin	9a5278225f	[CUDA] Use runtime driver API for cuStreamWriteValue32 (#156097 ) Fixes #154073 Reference: https://github.com/NVIDIA/Fuser/pull/4197 See PR #154097 @nWEIdia is currently out of the office, so I’ve temporarily taken over his work. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156097 Approved by: https://github.com/syed-ahmed, https://github.com/wujingyue, https://github.com/atalman Co-authored-by: Wei Wang <weiwan@nvidia.com>	2025-07-10 14:38:18 +00:00
Yu, Guangye	1e8e9f745e	Introduce AcceleratorAllocatorConfig as the common class (#149601 ) # Motivation This PR aims to generalize `AllocatorConfig` to be device-agnostic. Introduce the class `AcceleratorAllocatorConfig` to clarify its scope as a configuration manager for accelerator backends (e.g., CUDA, XPU). The another name `AllocatorConfig` is now reserved for a potential future base class that can unify configuration handling for both CPU and accelerator allocators, should similar requirements arise for the CPU path. # Design Rule ## Overall This class configures memory allocation for both device and host memory. A single `AcceleratorAllocatorConfig` instance is shared across all accelerator backends, such as CUDA and XPU, under the assumption that relevant environment variables apply uniformly to all accelerators. Device-specific configuration extensions are supported via hooks (see `registerDeviceConfigParserHook`). Introduce a new class `ConfigTokenizer` to help process the env variable config key-value pair ## Naming Convention: - Public API names in `AcceleratorAllocatorConfig` should be device-generic. - Members prefixed with `pinned_` are specific to the host/pinned allocator. - Environment variable names should be generic across backends. - Comma-separated key-value pairs in the format: `key:value`. Use square brackets `[]` for list values Example: `key1:123, key2:[val1,val2]` ## Environment Variables: - The default environment variable for configuration is `PYTORCH_ALLOC_CONF`. - For backward compatibility, `PYTORCH_CUDA_ALLOC_CONF` and `PYTORCH_HIP_ALLOC_CONF` are also supported with lower priority. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149601 Approved by: https://github.com/albanD	2025-07-10 07:05:39 +00:00
Jane Xu	02a9d9095f	[BE] remove commented out code in c10/ovrsource_defs.bzl (#157856 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157856 Approved by: https://github.com/swolchok, https://github.com/albanD	2025-07-09 13:28:56 +00:00
PyTorch MergeBot	86251eff40	Revert "Introduce AcceleratorAllocatorConfig as the common class (#149601 )" This reverts commit `55108074c0`. Reverted https://github.com/pytorch/pytorch/pull/149601 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/149601#issuecomment-3050628047))	2025-07-09 00:07:31 +00:00
Yu, Guangye	55108074c0	Introduce AcceleratorAllocatorConfig as the common class (#149601 ) # Motivation This PR aims to generalize `AllocatorConfig` to be device-agnostic. Introduce the class `AcceleratorAllocatorConfig` to clarify its scope as a configuration manager for accelerator backends (e.g., CUDA, XPU). The another name `AllocatorConfig` is now reserved for a potential future base class that can unify configuration handling for both CPU and accelerator allocators, should similar requirements arise for the CPU path. # Design Rule ## Overall This class configures memory allocation for both device and host memory. A single `AcceleratorAllocatorConfig` instance is shared across all accelerator backends, such as CUDA and XPU, under the assumption that relevant environment variables apply uniformly to all accelerators. Device-specific configuration extensions are supported via hooks (see `registerDeviceConfigParserHook`). Introduce a new class `ConfigTokenizer` to help process the env variable config key-value pair ## Naming Convention: - Public API names in `AcceleratorAllocatorConfig` should be device-generic. - Members prefixed with `pinned_` are specific to the host/pinned allocator. - Environment variable names should be generic across backends. - Comma-separated key-value pairs in the format: `key:value`. Use square brackets `[]` for list values Example: `key1:123, key2:[val1,val2]` ## Environment Variables: - The default environment variable for configuration is `PYTORCH_ALLOC_CONF`. - For backward compatibility, `PYTORCH_CUDA_ALLOC_CONF` and `PYTORCH_HIP_ALLOC_CONF` are also supported with lower priority. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149601 Approved by: https://github.com/albanD	2025-07-08 08:40:47 +00:00
Nikita Shulga	39a8f66d59	[BE] Use `simdgroup_size` constexpr (#157751 ) Instead of every shader defining it separately, move it to `c10/metal/common.h` Pull Request resolved: https://github.com/pytorch/pytorch/pull/157751 Approved by: https://github.com/Skylion007, https://github.com/dcci ghstack dependencies: #157746	2025-07-08 03:46:20 +00:00
Nikita Shulga	0b73f7c871	[EZ][BE] Move array def to `c10/metal/common.h` (#157746 ) And use proper type aliasing instead of weird _ARRAY_NS Also use `uint64_t` instead of `ulong` Pull Request resolved: https://github.com/pytorch/pytorch/pull/157746 Approved by: https://github.com/Skylion007, https://github.com/dcci	2025-07-08 03:46:20 +00:00
cyy	7c1f627828	Fix 'dllimport attribute ignored on inline function' (#157670 ) There are lots of warnings in builds: ``` 2025-07-05T16:59:46.9208806Z C:\actions-runner\_work\pytorch\pytorch\build\aten\src\ATen\core\TensorBody.h(5043,29): warning: 'at::Tensor::less_' redeclared inline; 'dllimport' attribute ignored [-Wignored-attributes] 2025-07-05T16:59:46.9209030Z 5043 \| inline at::Tensor & Tensor::less_(const at::Scalar & other) const { 2025-07-05T16:59:46.9209104Z \| ^ 2025-07-05T16:59:46.9209671Z C:\actions-runner\_work\pytorch\pytorch\build\aten\src\ATen\core\TensorBody.h(5048,29): warning: 'at::Tensor::less_' redeclared inline; 'dllimport' attribute ignored [-Wignored-attributes] 2025-07-05T16:59:46.9209860Z 5048 \| inline at::Tensor & Tensor::less_(const at::Tensor & other) const ``` This PR has fixed them and turned the warning into an error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157670 Approved by: https://github.com/albanD	2025-07-07 16:57:48 +00:00
Nikita Shulga	a952956d05	Add isnan exit condition to special ops (#157464 ) They might have been slow on CUDA-11.3, but this version of CUDA is long gone. More fundamental underlying issue were linear complexity of the recursive polynomial definitions for higher order polynomials, for example see this loop from implementation of Chebyshev polynomial of the first kind `7081b8233a/aten/src/ATen/native/Math.h (L2969-L2973)` which were tested by `test_compare_cpu` using following values (as sample index 16) `7081b8233a/torch/testing/_internal/opinfo/core.py (L2079)` Luckily chebyshev polynomials for absolute values higher than 1 pretty quickly reach infinity, see below ``` python3 -c "import torch;print(torch.special.chebyshev_polynomial_v(torch.nextafter(torch.tensor(1.0), torch.tensor(2.0)), torch.tensor(1e6)))" tensor(nan) ``` Which is not the case for Laguerre polynomials, but it's probably fine to just limit it to 1e7 Before ``` $ PYTORCH_TEST_WITH_SLOW=1 python test_ops.py -k chebyshev_polynomial_ ssssssss..ssssss..ssssss..ssssssssssssssssssssss..ssssss/home/ubuntu/py3.10-nightly/lib/python3.10/site-packages/torch/backends/cuda/__init__.py:131: UserWarning: This API is going to be deprecated, please see https://pytorch.org/docs/main/notes/cuda.html#tensorfloat-32-tf32-on-ampere-and-later-devices (Triggered internally at /pytorch/aten/src/ATen/Context.cpp:78.) return torch._C._get_cublas_allow_tf32() ....ssssssssssss..ssssss..ssssss............ssssssssssssssssssssssssssssssssssss..ssssssssssssss..ssssss..ssssssssssssssssssssssssssssss..ssssss....ssssssssssss..ssssss..ssssss............ssssssssssssssssssssssssssssssssssss..ssssss..ssssssssssssss..ssssss..ssssss..ssssssssssssss..ssssss..ssssss..ssssss..ssssss..ssssss..ssssss..ssssss..ssssss..ssssss..ssssss..ssssssssssssss ---------------------------------------------------------------------- Ran 432 tests in 8.575s OK (skipped=344) ``` After ``` $ PYTORCH_TEST_WITH_SLOW=1 python test_ops.py -k chebyshev_polynomial_ ssssssss........................ssssssssssssssss......../home/ubuntu/pytorch/torch/backends/cuda/__init__.py:131: UserWarning: This API is going to be deprecated, please see https://pytorch.org/docs/main/notes/cuda.html#tensorfloat-32-tf32-on-ampere-and-later-devices (Triggered internally at /home/ubuntu/pytorch/aten/src/ATen/Context.cpp:78.) return torch._C._get_cublas_allow_tf32() ........................................................................................xxxxxxxx................ssssssssssssssssssssssss........................................................................................................ssssssss........................ssssssss........................................................................................ssssssss ---------------------------------------------------------------------- Ran 432 tests in 45.580s OK (skipped=72, expected failures=8) ``` Fixes https://github.com/pytorch/pytorch/issues/79528 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157464 Approved by: https://github.com/Skylion007, https://github.com/dcci ghstack dependencies: #157488	2025-07-05 04:19:50 +00:00
Nikita Shulga	ec816d73b4	[MPS] Add `shifted_chebyshev_polynomial_[tuvw]` (#157488 ) For eager and inductor As for all other chebyshev ops, logic is simply compiled from `94716db222/aten/src/ATen/native/cuda/Math.cuh (L2821)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/157488 Approved by: https://github.com/dcci	2025-07-03 15:48:37 +00:00
PyTorch MergeBot	b6276a425f	Revert "[MPS] Add `shifted_chebyshev_polynomial_[tuvw]` (#157488 )" This reverts commit `9620994067`. Reverted https://github.com/pytorch/pytorch/pull/157488 on behalf of https://github.com/clee2000 due to caused slow test config to time out [GH job link](https://github.com/pytorch/pytorch/actions/runs/16037776972/job/45254574100) [HUD commit link](`e124a0d88c`) ([comment](https://github.com/pytorch/pytorch/pull/157464#issuecomment-3032676989))	2025-07-03 15:24:15 +00:00
Yu, Guangye	5cc4e856fd	Add device_id to XPU device properties (#156481 ) # Motivation Some older Intel iGPUs may share the same device name across different hardware products. (See [device name example](`aaa01c06f9/shared/source/dll/devices/devices_base.inl (L190-L199)`)) To help disambiguate which specific iGPU product is being used, we introduce the use of a [device id](https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/supported/sycl_ext_intel_device_info.md#device-id). This device id corresponds to the Device ID in [official Intel product specification](https://www.intel.com/content/www/us/en/products/sku/232155/intel-core-i71360p-processor-18m-cache-up-to-5-00-ghz/specifications.html) and enables more accurate identification and troubleshooting for user issues. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156481 Approved by: https://github.com/EikanWang, https://github.com/albanD	2025-07-03 01:22:11 +00:00
Nikita Shulga	9620994067	[MPS] Add `shifted_chebyshev_polynomial_[tuvw]` (#157488 ) For eager and inductor As for all other chebyshev ops, logic is simply compiled from `94716db222/aten/src/ATen/native/cuda/Math.cuh (L2821)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/157488 Approved by: https://github.com/dcci ghstack dependencies: #157464	2025-07-02 23:29:35 +00:00
Laith Sakka	7cfd054075	[attempt 2] Compute contiguity symbolically to avoid dde, and introduce c++ sym_is_contiguous (#157472 ) Summary: When we compute contiguity for a tensor with dynamic shapes we first: 1) Try to compute it without guarding. 2) If all shapes hinted, compute it with potentially adding guards. 3) if any input is not hinted, compute it symbolically. sym_is_contiguous return a SymBool that is then either evaluated or guard_or_false can be called on it to avoid data dependent errors. ex: bool is_contiguous = input.sym_is_contiguous().guard_or_false(__FILE__, __LINE__); is_contiguous_or_false is a helper function that does that. In this PR I only handle default contiguity, will follow up with changes for other formats like channel_last . We use this patter in this PR for several locations to avoid DDEs. Test Plan: contbuild & OSS CI, Rollback Plan: Reviewed By: malfet Differential Revision: D77639021 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157472 Approved by: https://github.com/aorenste	2025-07-02 23:12:29 +00:00
Bin Bao	34c8033fd3	Fix a div_mod bug in generic_math.h (#157383 ) Summary: There is a bug in integer div_mod that when the remainder is 0 and the divisor is negative, mod operation produces a negative number. Fixed in this PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157383 Approved by: https://github.com/angelayi, https://github.com/jingsh	2025-07-02 12:22:57 +00:00
PyTorch MergeBot	c6a27bae36	Revert "[do not revert] Compute contiguity symbolically to avoid dde, and introduce c++ sym_is_contiguous (#155590 )" This reverts commit `d0a9629435`. Reverted https://github.com/pytorch/pytorch/pull/155590 on behalf of https://github.com/laithsakka due to was asked by to land this internally ([comment](https://github.com/pytorch/pytorch/pull/155590#issuecomment-3025796794))	2025-07-01 22:58:14 +00:00
Laith Sakka	d0a9629435	[do not revert] Compute contiguity symbolically to avoid dde, and introduce c++ sym_is_contiguous (#155590 ) When we compute contiguity for a tensor with dynamic shapes we first: 1) Try to compute it without guarding. 2) If all shapes hinted, compute it with potentially adding guards. 3) if any input is not hinted, compute it symbolically. sym_is_contiguous return a SymBool that is then either evaluated or guard_or_false can be called on it to avoid data dependent errors. ex: bool is_contiguous = input.sym_is_contiguous().guard_or_false(__FILE__, __LINE__); is_contiguous_or_false is a helper function that does that. In this PR I only handle default contiguity, will follow up with changes for other formats like channel_last . We use this patter in this PR for several locations to avoid DDEs. Differential Revision: [D77183032](https://our.internmc.facebook.com/intern/diff/D77183032) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155590 Approved by: https://github.com/ezyang	2025-07-01 21:39:38 +00:00
PyTorch MergeBot	1586521461	Revert "Compute contiguity symbolically to avoid dde, and introduce c++ sym_is_contiguous (#155590 )" This reverts commit `2c76f31221`. Reverted https://github.com/pytorch/pytorch/pull/155590 on behalf of https://github.com/jeanschmidt due to Breaking 1000s of internal builds, it cant be properly landed internally, there are no options except revert and codev. ([comment](https://github.com/pytorch/pytorch/pull/155590#issuecomment-3023503929))	2025-07-01 11:23:00 +00:00
Ethan Wee	04bd7e6850	[ROCm] Remove use of `warpsize` on host-side compilation (#156979 ) Changes needed for ROCm7.0: * `warpSize` is _not_ a compile-time constant on device-side compilation for ROCm anymore * `warpSize` is _not_ defined on host-side compilation, hence `at::cuda::warp_size()` must be used to query warpsize at runtime * Redefining `C10_WARP_SIZE` to be a compile-time constant, with a reasonable value for device-side compilation, but an unreasonable value of 1 for host-side compilation Pull Request resolved: https://github.com/pytorch/pytorch/pull/156979 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-07-01 04:55:31 +00:00
Scott Wolchok	fee2377f9e	Reapply D77381084 / #156964 : Rename torch::standalone to headeronly (#157251 ) Was reverted due to internal failure which should be fixed now. I believe Jane wants this reapplied and picked to release, and she's out this week. Original summary: headeronly is more clear, let's change the name before anyone depends on standalone Differential Revision: [D77520173](https://our.internmc.facebook.com/intern/diff/D77520173/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157251 Approved by: https://github.com/janeyx99, https://github.com/Skylion007, https://github.com/desertfire	2025-06-30 23:25:30 +00:00
Aaron Ang	3dda80e990	Overload `mul_overflows` for `size_t` (#155736 ) Partially fixes https://github.com/pytorch/executorch/pull/11537. We want to extend `mul_overflows` to support `size_t` in ExecuTorch. The current workflow in ET checks that the `c10` mirrors exactly as in PT, so the tests are failing. See comment: https://github.com/pytorch/executorch/pull/11537#issuecomment-2963821312 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155736 Approved by: https://github.com/swolchok	2025-06-30 22:57:28 +00:00
Isalia20	a1282b1823	[MPS] Add boilerplate sparse code support (#157238 ) This PR makes minimal changes to support sparse tensors on MPS. In the followup PRs I'll start adding different operations slowly so we can fix the issue of https://github.com/pytorch/pytorch/issues/129842 which is highly requested(I assume because of whisper using sparse tensors) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157238 Approved by: https://github.com/malfet	2025-06-30 01:53:45 +00:00
Jake Stevens	11f7e2f145	[caffe][executorch] rename to avoid shadow in irange (#157107 ) Summary: D76832520 switched Executorch to use the caffe c10 headers. This copy contains a shadow, which is treated as an error for certain embedded compile flows. Simple rename to avoid. Test Plan: CI Rollback Plan: Differential Revision: D77446104 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157107 Approved by: https://github.com/Skylion007	2025-06-30 00:17:09 +00:00
Laith Sakka	2c76f31221	Compute contiguity symbolically to avoid dde, and introduce c++ sym_is_contiguous (#155590 ) When we compute contiguity for a tensor with dynamic shapes we first: 1) Try to compute it without guarding. 2) If all shapes hinted, compute it with potentially adding guards. 3) if any input is not hinted, compute it symbolically. sym_is_contiguous return a SymBool that is then either evaluated or guard_or_false can be called on it to avoid data dependent errors. ex: bool is_contiguous = input.sym_is_contiguous().guard_or_false(__FILE__, __LINE__); is_contiguous_or_false is a helper function that does that. In this PR I only handle default contiguity, will follow up with changes for other formats like channel_last . We use this patter in this PR for several locations to avoid DDEs. Differential Revision: [D77183032](https://our.internmc.facebook.com/intern/diff/D77183032) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155590 Approved by: https://github.com/ezyang	2025-06-27 04:59:52 +00:00
PyTorch MergeBot	e290a4c645	Revert "Rename torch::standalone to headeronly (#156964 )" This reverts commit `7e54c02a35`. Reverted https://github.com/pytorch/pytorch/pull/156964 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/156964#issuecomment-3011136947))	2025-06-27 02:20:33 +00:00
Jane Xu	7e54c02a35	Rename torch::standalone to headeronly (#156964 ) Summary: headeronly is more clear, let's change the name before anyone depends on standalone Test Plan: CI should pass! Rollback Plan: Differential Revision: D77381084 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156964 Approved by: https://github.com/swolchok, https://github.com/albanD, https://github.com/desertfire	2025-06-27 01:00:14 +00:00
Jane Xu	acaf6ba3c6	Organize BUCK for torch/standalone (#156503 ) Summary: Undo highlevel BUCKification in favor of something more organized by moving it to the dir itself Test Plan: CI Rollback Plan: Reviewed By: swolchok Differential Revision: D76920013 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156503 Approved by: https://github.com/swolchok	2025-06-25 22:56:15 +00:00
PyTorch MergeBot	3dd872e6d5	Revert "Add DeviceAllocator as the base device allocator (#138222 )" This reverts commit `92409b6c89`. Reverted https://github.com/pytorch/pytorch/pull/138222 on behalf of https://github.com/Camyll due to internal build failures ([comment](https://github.com/pytorch/pytorch/pull/138222#issuecomment-3002206756))	2025-06-25 00:11:35 +00:00
cyy	b09bd414a6	Deprecate c10::string (#155084 ) Now there is no mention of c10::string in OSS. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155084 Approved by: https://github.com/ezyang	2025-06-24 03:03:06 +00:00
PyTorch MergeBot	e583b88819	Revert "[Draft][CUDA] Use runtime driver API for cuStreamWriteValue32 (#156097 )" This reverts commit `ac86ec0e60`. Reverted https://github.com/pytorch/pytorch/pull/156097 on behalf of https://github.com/atalman due to internal breakage ([comment](https://github.com/pytorch/pytorch/pull/156097#issuecomment-2997314638))	2025-06-23 17:36:44 +00:00
Yu, Guangye	92409b6c89	Add DeviceAllocator as the base device allocator (#138222 ) # Motivation In line with [RFC] [A device-agnostic Python device memory related API design for stream-based accelerators](https://github.com/pytorch/pytorch/issues/134978), some memory-related APIs are widely used in popular repositories, such as HuggingFace [so many if-else conditional code](https://github.com/search?q=repo%3Ahuggingface%2Faccelerate%20torch.cuda.empty_cache&type=code). We would like to introduce a generic API set under torch.accelerator namespace to generalize these user cases. <div align="center"> <table> <tr> <td> Device-specific memory APIs torch.xxx.foo</td> <td> Device-agnostic memory APIs torch.accelerator.foo</td> </tr> <tr> <td> ```python torch.xxx.empty_cache ``` </td> <td> ```python torch.accelerator.empty_cache ``` </td> </tr> <tr> <td> ```python torch.xxx.reset_peak_memory_stats ``` </td> <td> ```python torch.accelerator.reset_peak_memory_stats ``` </td> </tr> <tr> <td> ```python torch.xxx.reset_accumulated_memory_stats ``` </td> <td> ```python torch.accelerator.reset_accumulated_memory_stats ``` </td> </tr> <tr> <td> ```python torch.xxx.memory_stats ``` </td> <td> ```python torch.accelerator.memory_stats ``` </td> </tr> <tr> <td> ```python torch.xxx.memory_allocated ``` </td> <td> ```python torch.accelerator.memory_allocated ``` </td> </tr> <tr> <td> ```python torch.xxx.max_memory_allocated ``` </td> <td> ```python torch.accelerator.max_memory_allocated ``` </td> </tr> <tr> <td> ```python torch.xxx.memory_reserved ``` </td> <td> ```python torch.accelerator.memory_reserved ``` </td> </tr> <tr> <td> ```python torch.xxx.max_memory_reserved ``` </td> <td> ```python torch.accelerator.max_memory_reserved ``` </td> </tr> </table> </div> # Solution This design follows a similar pattern to `HostAllocator`. We're introducing a base class `DeviceAllocator`, from which `CUDAAllocator` and `XPUAllocator` will inherit. This allows us to provide a unified call path like: `torch.accelerator.empty_cache()` -> `GetDeviceAllocator(allocator)->empty_cache()`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138222 Approved by: https://github.com/albanD	2025-06-23 08:49:30 +00:00
Syed Tousif Ahmed	f70c80105e	Enables NCCL symmetric memory kernels through mempool registration (#155134 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155134 Approved by: https://github.com/kwen2501 Co-authored-by: Ke Wen <kw2501@meta.com>	2025-06-21 23:24:04 +00:00
Frank Lin	ac86ec0e60	[Draft][CUDA] Use runtime driver API for cuStreamWriteValue32 (#156097 ) Fixes #154073 Reference: https://github.com/NVIDIA/Fuser/pull/4197 See PR #154097 @nWEIdia is currently out of the office, so I’ve temporarily taken over his work. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156097 Approved by: https://github.com/ngimel Co-authored-by: Wei Wang <weiwan@nvidia.com>	2025-06-21 01:34:41 +00:00
cyy	3c2324c64a	[2/N] Fix cppcoreguidelines-init-variables suppression (#146237 ) This PR removes all `cppcoreguidelines-init-variables` suppressions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146237 Approved by: https://github.com/ezyang	2025-06-19 23:26:42 +00:00
Kaichao You	fec8af8b98	[bugfix] [build] guard cuda version for ipc with fabric handle (#156394 ) https://github.com/pytorch/pytorch/pull/156074 adds the support of ipc with fabric handle, but the code cannot compile for cuda < 12.3 (in particular, e.g. cuda 11.8). this pr improves the support by adding some compilation-time check against cuda versions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156394 Approved by: https://github.com/ngimel	2025-06-19 13:54:01 +00:00
Jeff Daily	6303cc41b7	[ROCm] support CUDA_KERNEL_ASSERT using abort() (#155262 ) We won't have the full message that __assert_fail would provide, but at least we won't silently do nothing. Fixes #155045. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155262 Approved by: https://github.com/hongxiayang, https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-06-18 23:52:35 +00:00
Scott Wolchok	76d07e919f	Unbreak //c10/util:base (#156216 ) Missing dep. Bifferential Revision: [D76840057](https://our.internmc.facebook.com/intern/diff/D76840057/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156216 Approved by: https://github.com/janeyx99, https://github.com/desertfire	2025-06-18 22:44:20 +00:00
PyTorch MergeBot	bfccfa0b31	Revert "[Draft][CUDA] Use runtime driver API for cuStreamWriteValue32 (#156097 )" This reverts commit `cf90c9f8d1`. Reverted https://github.com/pytorch/pytorch/pull/156097 on behalf of https://github.com/atalman due to break internal tests ([comment](https://github.com/pytorch/pytorch/pull/156097#issuecomment-2985785811))	2025-06-18 21:48:50 +00:00
Xuehai Pan	402ae09e41	[BE] fix typos in c10/ (#156078 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156078 Approved by: https://github.com/malfet, https://github.com/cyyever	2025-06-18 10:24:44 +00:00
Kaichao You	a5df6ffbc2	Improve IPC for Expandable Segments to use fabric handle when possible (#156074 ) Improve upon https://github.com/pytorch/pytorch/pull/130890 , inspired by https://github.com/pytorch/pytorch/pull/130890#issuecomment-2278882984 , we can automatically use the fabric handle for IPC when possible. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156074 Approved by: https://github.com/ngimel, https://github.com/malfet	2025-06-18 05:22:06 +00:00
Aidyn-A	4a26bb8a12	[C10][CUDA] Eagerly create context on torch.cuda.set_device(device) call (#155900 ) Fixes #155668 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155900 Approved by: https://github.com/ngimel	2025-06-17 18:59:44 +00:00
Manuel Candales	a4ea242edc	[MPS] Implement scan metal kernels (#156100 ) Implements metal kernels for scan operations: - Migrates cumsum and cumprod from MPSGraph implementation to Metal. - Fixes #154881 - Adds MPS backend support for cummin and cummax Pull Request resolved: https://github.com/pytorch/pytorch/pull/156100 Approved by: https://github.com/malfet	2025-06-17 17:44:22 +00:00
Frank Lin	cf90c9f8d1	[Draft][CUDA] Use runtime driver API for cuStreamWriteValue32 (#156097 ) Fixes #154073 Reference: https://github.com/NVIDIA/Fuser/pull/4197 See PR #154097 @nWEIdia is currently out of the office, so I’ve temporarily taken over his work. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156097 Approved by: https://github.com/ngimel, https://github.com/cyyever Co-authored-by: Wei Wang <weiwan@nvidia.com>	2025-06-17 14:15:49 +00:00
Scott Wolchok	cce4d213a6	Remove non-header-only dep from c10_headers target (#155858 ) It depends on /c10/util:base which is not header-only. Differential Revision: [D76552750](https://our.internmc.facebook.com/intern/diff/D76552750/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D76552750/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/155858 Approved by: https://github.com/ezyang	2025-06-16 21:41:25 +00:00
PyTorch MergeBot	365ce465f3	Revert "[C10][CUDA] Eagerly create context on torch.cuda.set_device(device) call (#155900 )" This reverts commit `8142a02860`. Reverted https://github.com/pytorch/pytorch/pull/155900 on behalf of https://github.com/clee2000 due to causing some sort of hang? in test_distributed_spawn [GH job link](https://github.com/pytorch/pytorch/actions/runs/15678895788/job/44168117193) [HUD commit link](`8142a02860`) note to self: bad TD ([comment](https://github.com/pytorch/pytorch/pull/155900#issuecomment-2977365699))	2025-06-16 16:59:25 +00:00
Aidyn-A	8142a02860	[C10][CUDA] Eagerly create context on torch.cuda.set_device(device) call (#155900 ) Fixes #155668 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155900 Approved by: https://github.com/ngimel	2025-06-16 10:55:47 +00:00
cyy	c2beeadeb4	[Reland] Use 3.27 as the minimum CMake version (#154783 ) Reland of #153153, which was incidentally closed. Update the minimum CMake version to 3.27 because of it provides more CUDA targets such as CUDA::nvperf_host so that it is possible to remove some of our forked CUDA modules. See https://github.com/pytorch/pytorch/pull/153783. It's also possible to facilitate future third-party updates such as FBGEMM (its current shipped version requires 3.21). Pull Request resolved: https://github.com/pytorch/pytorch/pull/154783 Approved by: https://github.com/ezyang	2025-06-14 16:37:51 +00:00
Sean McGovern	297805fd8f	Typo fixes for "overridden" in comments and function names (#155944 ) This word appears often in class descriptions and is not consistently spelled. Update comments and some function names to use the correct spelling consistently. Facilitates searching the codebase. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155944 Approved by: https://github.com/Skylion007	2025-06-14 03:37:38 +00:00
Nikita Shulga	b6add8c8ba	[MPSInductor] Fix remainder implementation for int types (#155891 ) Introduce `c10:🤘:remainder` and call it from both inductor and eager implementation, with integer specialization, which should make it much faster than before, while still compliant with Python way of rounding up negative numbers. This allows one to remove complex type detection logic from mps codegen and rely on Metal(C++) type system to figure out input and output types. This fixes compilation of something like ```python @torch.compile def f(x, y): return x[y % 5] ``` which beforehand failed to compile with ``` torch._inductor.exc.InductorError: SyntaxError: failed to compile #include <c10/metal/utils.h> kernel void generated_kernel( device float* out_ptr0, constant long* in_ptr0, constant float* in_ptr1, uint xindex [[thread_position_in_grid]] ) { int x0 = xindex; auto tmp0 = in_ptr0[x0]; auto tmp1 = 12; auto tmp2 = static_cast<float>(tmp0) - static_cast<float>(tmp1) * metal::floor(static_cast<float>(tmp0) / static_cast<float>(tmp1)); auto tmp3 = 1024; auto tmp4 = static_cast<long>(tmp3); auto tmp5 = tmp2 + tmp4; auto tmp6 = tmp2 < 0; auto tmp7 = tmp6 ? tmp5 : tmp2; if ((tmp7 < 0) && (tmp7 > 1024)) return; auto tmp9 = in_ptr1[tmp7]; out_ptr0[x0] = static_cast<float>(tmp9); } with program_source:372:28: error: array subscript is not an integer auto tmp9 = in_ptr1[tmp7]; ^~~~~ ``` This fixes fail_to_compile for GPT2ForSequenceClassification Huggingface model using `transformers==4.44.2` Pull Request resolved: https://github.com/pytorch/pytorch/pull/155891 Approved by: https://github.com/manuelcandales	2025-06-13 16:42:56 +00:00
zeshengzong	905b194a2e	Replace device check of TORCH_INTERNAL_ASSERT with TORCH_CHECK (#155318 ) Fixes #136849 ## Test Result ```python >>> import torch >>> device = torch.cuda.device_count() + 1 >>> torch.cuda.current_stream(device) # INTERNAL ASSERT FAILED Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/zong/code/pytorch/torch/cuda/__init__.py", line 1083, in current_stream streamdata = torch._C._cuda_getCurrentStream( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: Device index value 3 is out of index range [0, 2) >>> torch.cuda.default_stream(device) # INTERNAL ASSERT FAILED Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/zong/code/pytorch/torch/cuda/__init__.py", line 1101, in default_stream streamdata = torch._C._cuda_getDefaultStream( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: Device index value 3 is out of index range [0, 2) >>> torch.cuda.set_per_process_memory_fraction(0.5, device) # INTERNAL ASSERT FAILED Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/zong/code/pytorch/torch/cuda/memory.py", line 193, in set_per_process_memory_fraction torch._C._cuda_setMemoryFraction(fraction, device) RuntimeError: Allocator not initialized for device : did you call init? ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/155318 Approved by: https://github.com/albanD	2025-06-13 01:20:19 +00:00
Frost Mitchell	db01f1032f	Support XPU in memory tracker (#150703 ) This PR adds support for XPU devices to the distributed MemoryTracker tool, including unit test for XPU. Specifically, this code adds tracking for a few alloc-related statistics for XPUCachingAllocator. It also adapts the existing memory tracker tool to be device agnostic, by getting the device module and recording the necessary memory stats. (I get the device module instead of using `torch.accelerator` methods, as that API is still in-progress.) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150703 Approved by: https://github.com/EikanWang, https://github.com/guangyey, https://github.com/gujinghui, https://github.com/d4l3k	2025-06-12 21:33:52 +00:00
Bin Bao	dd1b6621bc	Remove C10_DEPRECATED references in c10 (#151058 ) Summary: Revive https://github.com/pytorch/pytorch/pull/138406. Only limit the scope to files in c10. Summary from the original PR, ``` Looking in the code I see // NB: __cplusplus doesn't work for MSVC, so for now MSVC always uses // the "__declspec(deprecated)" implementation and not the C++14 // "[[deprecated]]" attribute. We tried enabling "[[deprecated]]" for C++14 on // MSVC, but ran into issues with some older MSVC versions. But looking at the MSVC C++ support table I see that the [[deprecated]] attribute is supported as of MSVC 2015 and that the vast majority of C++17 features became supported in MSVC 2015 or later. Since PyTorch is C++17 now, I infer that PyTorch must not support versions of MSVC earlier than MSVC 2015, so the versions of MSVC supported by PyTorch must support [[deprecated]]. Therefore, since we are finished deprecating old MSVCs we can deprecate C10_DEPRECATED. ``` Test Plan: CI Differential Revision: D72762767 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151058 Approved by: https://github.com/r-barnes	2025-06-12 13:38:03 +00:00
Nikita Shulga	48921721d8	[MPS] Fix binary builds (#155733 ) Introduced by https://github.com/pytorch/pytorch/pull/155611 All functions in those headers must be static and inline Pull Request resolved: https://github.com/pytorch/pytorch/pull/155733 Approved by: https://github.com/seemethere, https://github.com/atalman	2025-06-11 22:55:33 +00:00
Kurt Mohler	013cf1e330	[MPS] Move expm1 op to Metal (#155611 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155611 Approved by: https://github.com/malfet	2025-06-11 13:06:14 +00:00
Laith Sakka	2585960b47	remove redundent type_id (#155539 ) Those were added in https://github.com/pytorch/pytorch/pull/92229 to prevent confusion of overloads. but the variants that accepts SymBool are all removed in https://github.com/pytorch/pytorch/pull/112890 with the introduction of SymbolicShapeMeta. Hence that dummy arg is not needed anymore. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155539 Approved by: https://github.com/ezyang	2025-06-11 08:46:56 +00:00
Manuel Candales	0f47e76937	[MPS] Implement hardshrink metal kernel (#155304 ) Implements the forward and backward hardshrink operators as Metal kernels. In order to support the lambda parameter, we extend the `exec_unary_kernel` and `exec_binary_kernel` methods. Now they take an optional Scalar and an optional ScalarType argument. When the optional ScalarType is provided, it overrides the type of the Scalar. We add a new `REGISTER_UNARY_ALPHA_OP` macro, and modify the existing `REGISTER_BINARY_ALPHA_OP` to support the new feature. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155304 Approved by: https://github.com/malfet	2025-06-10 18:20:27 +00:00
Nikita Shulga	f140fac8dc	[MPS] Implement erfc (#155382 ) And migrate `erf` to Metal kernel Use `erf` approximations from https://github.com/ml-explore/mlx/blob/main/mlx/backend/metal/kernels/erf.h as previous approximation did not match the CPU implementation After that, `erfc(x) := 1.0 - erf(x)` Fixes https://github.com/pytorch/pytorch/issues/155337 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155382 Approved by: https://github.com/manuelcandales, https://github.com/dcci	2025-06-07 02:35:12 +00:00
Natalia Gimelshein	706bc41c4c	pass mempool arg through emptyCache (#155315 ) Fixing typo in a previous PR #154746 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155315 Approved by: https://github.com/Skylion007	2025-06-06 16:14:26 +00:00
Nikita Shulga	231eb9902b	[MPS][BE] Extend ndim_and_dtypes to 4 elements (#155272 ) Metal arguments must be 8 bytes aliged (or may be 16 bytes), so running any strided (or typecasted) binary op with MTL_DEBUG_LAYER leads to exception ``` % MTL_DEBUG_LAYER=1 python3 ../test/test_mps.py -v -k test_output_match_add 2025-06-05 15:41:34.201 Python[86653:16826825] Metal API Validation Enabled test_output_match_add_mps_bfloat16 (__main__.TestConsistencyMPS.test_output_match_add_mps_bfloat16) ... validateComputeFunctionArguments:1083: failed assertion `Compute Function(add_strided_bfloat_bfloat): argument ndim[0] from buffer(7) with offset(0) and length(12) has space for 12 bytes, but argument has a length(16).' zsh: abort MTL_DEBUG_LAYER=1 python3 ../test/test_mps.py -v -k test_output_match_add ``` Extend it to 4 elements and pass output dtype, which will be used by binary_op later on anyway Test plan: Run abovementioned command with `MTL_DEBUG_LAYER=1` and make sure everything passes Pull Request resolved: https://github.com/pytorch/pytorch/pull/155272 Approved by: https://github.com/angelayi, https://github.com/dcci, https://github.com/cyyever	2025-06-06 14:20:21 +00:00
tvukovic-amd	a85ad55525	[ROCm][Windows] Fix offload gpu arch list in tests (#155212 ) Added fix to get ROCM_PROPERTY_ARCH_LIST value in set_target_properties in c10/cuda and caffe2 tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/155212 Approved by: https://github.com/malfet	2025-06-05 20:30:28 +00:00
tvukovic-amd	ec35a36820	[ROCm][Windows] Fix building tests for multiple architectures (#154979 ) Fixing building C10_CUDA_ALL_TEST_FILES and Caffe2_HIP_TEST_SRCS for multiple architectures Pull Request resolved: https://github.com/pytorch/pytorch/pull/154979 Approved by: https://github.com/cyyever, https://github.com/Skylion007	2025-06-04 03:53:21 +00:00
Nikita Shulga	9cdce682a1	[MPS][BE] Reimplement log1p as Metal shader (#154936 ) That should make it faster than MPSGraph implementation, but also improves accuracy for small inputs, by using the algorithm described in [What Every Computer Scientist Should Know About Floating-Point Arithmetic](https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html#1202), i.e. $log(1+x) = \frac{x * log(1+x)}{(1 + x) - 1}$ if $1 +x \neq 1$ else just $x$ Also tried using first 3 elements of Taylor series in Horner's form which also seems to work fine, i.e. $log(1+x) \approx x * (1 -x (\frac{1}{2} - \frac{x}{3}))$ Replaced less accurate log1p implementation in `c10/metal/special_math.h` with generic one. Parametrize and modify regression test to check for accuracy of small values TODOs: - Do proper implementation for complex values as well, perhaps using `0408ba0a76/mlx/backend/metal/kernels/utils.h (L339)` - May be implement it using Remez-like algorithm documented here `207f3b2b25/lib/msun/src/s_log1pf.c (L37)` - Or use llvm's implementation from `f393986b53/libclc/clc/lib/generic/math/clc_log1p.inc (L22)` - Benchmark which algorithm is faster and delivers better accuracy Pull Request resolved: https://github.com/pytorch/pytorch/pull/154936 Approved by: https://github.com/dcci, https://github.com/Skylion007	2025-06-03 14:10:13 +00:00
Bin Bao	13044b2b04	Move c10/macros/Export.h to torch/standalone (#154850 ) Summary: The goal of this PR and future follow-up PRs is to group a set of header files required by AOTInductor Standalone in a separate directory, ensuring they are implemented in a header-only manner. Test Plan: CI Bifferential Revision: D75756619 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154850 Approved by: https://github.com/janeyx99	2025-06-03 06:18:59 +00:00
Nikita Shulga	0350c7e72c	[BE] Introduce torch.AcceleratorError (#152023 ) Which inherits from `RuntimeError` and contains `error_code`, which in case of CUDA should contain error returned by `cudaGetLastError` `torch::detail::_new_accelerator_error_object(c10::AcceleratorError&)` follows the pattern of CPython's [`PyErr_SetString`](`cb8a72b301/Python/errors.c (L282)`), namely - Convert cstr into Python string with `PyUnicode_FromString` - Create new exception object using `PyObject_CallOneArg` just like it's done in [`_PyErr_CreateException`](`cb8a72b301/Python/errors.c (L32)`) - Set `error_code` property using `PyObject_SetAttrString` - decref all temporary references Test that it works and captures CPP backtrace (in addition to CI) by running ```python import os os.environ['TORCH_SHOW_CPP_STACKTRACES'] = '1' import torch x = torch.rand(10, device="cuda") y = torch.arange(20, device="cuda") try: x[y] = 2 print(x) except torch.AcceleratorError as e: print("Exception was raised", e.args[0]) print("Captured error code is ", e.error_code) ``` which produces following output ``` Exception was raised CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. Exception raised from c10_cuda_check_implementation at /home/ubuntu/pytorch/c10/cuda/CUDAException.cpp:41 (most recent call first): C++ CapturedTraceback: #4 std::_Function_handler<std::shared_ptr<c10::LazyValue<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const> (), c10::SetStackTraceFetcher(std::function<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 #5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) from ??:0 #6 c10::cuda::c10_cuda_check_implementation(int, char const, char const, int, bool) [clone .cold] from CUDAException.cpp:0 #7 void at::native::gpu_kernel_impl<at::native::AbsFunctor<float> >(at::TensorIteratorBase&, at::native::AbsFunctor<float> const&) [clone .isra.0] from tmpxft_000191fc_00000000-6_AbsKernel.cudafe1.cpp:0 #8 at::native::abs_kernel_cuda(at::TensorIteratorBase&) from ??:0 #9 at::Tensor& at::native::unary_op_impl_with_complex_to_float_out<at::native::abs_stub_DECLARE_DISPATCH_type>(at::Tensor&, at::Tensor const&, at::native::abs_stub_DECLARE_DISPATCH_type&, bool) [clone .constprop.0] from UnaryOps.cpp:0 #10 at::(anonymous namespace)::(anonymous namespace)::wrapper_CUDA_out_abs_out(at::Tensor const&, at::Tensor&) from RegisterCUDA_0.cpp:0 #11 at::_ops::abs_out::call(at::Tensor const&, at::Tensor&) from ??:0 #12 at::native::abs(at::Tensor const&) from ??:0 #13 c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&), &at::(anonymous namespace)::(anonymous namespace)::wrapper_CompositeExplicitAutograd__abs>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&> >, at::Tensor (at::Tensor const&)>::call(c10::OperatorKernel, c10::DispatchKeySet, at::Tensor const&) from RegisterCompositeExplicitAutograd_0.cpp:0 #14 at::_ops::abs::redispatch(c10::DispatchKeySet, at::Tensor const&) from ??:0 #15 torch::autograd::VariableType::(anonymous namespace)::abs(c10::DispatchKeySet, at::Tensor const&) from VariableType_1.cpp:0 #16 c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (c10::DispatchKeySet, at::Tensor const&), &torch::autograd::VariableType::(anonymous namespace)::abs>, at::Tensor, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&> >, at::Tensor (c10::DispatchKeySet, at::Tensor const&)>::call(c10::OperatorKernel, c10::DispatchKeySet, at::Tensor const&) from VariableType_1.cpp:0 #17 at::_ops::abs::call(at::Tensor const&) from ??:0 #18 at::native::isfinite(at::Tensor const&) from ??:0 #19 c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&), &at::(anonymous namespace)::(anonymous namespace)::wrapper_CompositeImplicitAutograd__isfinite>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&> >, at::Tensor (at::Tensor const&)>::call(c10::OperatorKernel, c10::DispatchKeySet, at::Tensor const&) from RegisterCompositeImplicitAutograd_0.cpp:0 #20 at::_ops::isfinite::call(at::Tensor const&) from ??:0 #21 torch::autograd::THPVariable_isfinite(_object, _object, _object) from python_torch_functions_2.cpp:0 #22 PyObject_CallFunctionObjArgs from ??:0 #23 _PyObject_MakeTpCall from ??:0 #24 _PyEval_EvalFrameDefault from ??:0 #25 _PyObject_FastCallDictTstate from ??:0 #26 _PyStack_AsDict from ??:0 #27 _PyObject_MakeTpCall from ??:0 #28 _PyEval_EvalFrameDefault from ??:0 #29 _PyFunction_Vectorcall from ??:0 #30 _PyEval_EvalFrameDefault from ??:0 #31 _PyFunction_Vectorcall from ??:0 #32 _PyEval_EvalFrameDefault from ??:0 #33 _PyFunction_Vectorcall from ??:0 #34 _PyEval_EvalFrameDefault from ??:0 #35 PyFrame_GetCode from ??:0 #36 PyNumber_Xor from ??:0 #37 PyObject_Str from ??:0 #38 PyFile_WriteObject from ??:0 #39 _PyWideStringList_AsList from ??:0 #40 _PyDict_NewPresized from ??:0 #41 _PyEval_EvalFrameDefault from ??:0 #42 PyEval_EvalCode from ??:0 #43 PyEval_EvalCode from ??:0 #44 PyUnicode_Tailmatch from ??:0 #45 PyInit__collections from ??:0 #46 PyUnicode_Tailmatch from ??:0 #47 _PyRun_SimpleFileObject from ??:0 #48 _PyRun_AnyFileObject from ??:0 #49 Py_RunMain from ??:0 #50 Py_BytesMain from ??:0 #51 __libc_init_first from ??:0 #52 __libc_start_main from ??:0 #53 _start from ??:0 Captured error code is 710 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/152023 Approved by: https://github.com/eqy, https://github.com/mradmila, https://github.com/ngimel ghstack dependencies: #154436	2025-06-01 21:02:43 +00:00
PyTorch MergeBot	bd10ea4e6c	Revert "Use 3.27 as the minimum CMake version (#153153 )" This reverts commit `ad26ec6abe`. Reverted https://github.com/pytorch/pytorch/pull/153153 on behalf of https://github.com/cyyever due to It still breaks windows debug builds ([comment](https://github.com/pytorch/pytorch/pull/153153#issuecomment-2923997777))	2025-05-31 02:14:24 +00:00
cyy	ad26ec6abe	Use 3.27 as the minimum CMake version (#153153 ) Update the minimum CMake version to 3.27 because of it provides more CUDA targets such as `CUDA::nvperf_host` so that it is possible to remove some of our forked CUDA modules. See https://github.com/pytorch/pytorch/pull/153783. It's also possible to facilitate future third-party updates such as FBGEMM (its current shipped version requires 3.21). Pull Request resolved: https://github.com/pytorch/pytorch/pull/153153 Approved by: https://github.com/malfet	2025-05-31 01:54:35 +00:00
Natalia Gimelshein	f01e628e3b	Resubmit Remove MemPoolContext (#154042 ) (#154746 ) Summary: Per title Test Plan: Added tests + existing tests Differential Revision: D75695030 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154746 Approved by: https://github.com/malfet	2025-05-31 01:21:54 +00:00
PyTorch MergeBot	108422ac26	Revert "Use 3.27 as the minimum CMake version (#153153 )" This reverts commit `78624679a8`. Reverted https://github.com/pytorch/pytorch/pull/153153 on behalf of https://github.com/cyyever due to It still breaks windows debug builds ([comment](https://github.com/pytorch/pytorch/pull/153153#issuecomment-2923785799))	2025-05-31 00:28:03 +00:00
cyy	78624679a8	Use 3.27 as the minimum CMake version (#153153 ) Update the minimum CMake version to 3.27 because of it provides more CUDA targets such as `CUDA::nvperf_host` so that it is possible to remove some of our forked CUDA modules. See https://github.com/pytorch/pytorch/pull/153783. It's also possible to facilitate future third-party updates such as FBGEMM (its current shipped version requires 3.21). Pull Request resolved: https://github.com/pytorch/pytorch/pull/153153 Approved by: https://github.com/malfet	2025-05-31 00:01:52 +00:00
PyTorch MergeBot	7e8532077f	Revert "Use 3.27 as the minimum CMake version (#153153 )" This reverts commit `1ece53b157`. Reverted https://github.com/pytorch/pytorch/pull/153153 on behalf of https://github.com/cyyever due to It still breaks windows debug builds ([comment](https://github.com/pytorch/pytorch/pull/153153#issuecomment-2922369830))	2025-05-30 13:16:33 +00:00
cyy	1ece53b157	Use 3.27 as the minimum CMake version (#153153 ) Update the minimum CMake version to 3.27 because of it provides more CUDA targets such as `CUDA::nvperf_host` so that it is possible to remove some of our forked CUDA modules. See https://github.com/pytorch/pytorch/pull/153783. It's also possible to facilitate future third-party updates such as FBGEMM (its current shipped version requires 3.21). Pull Request resolved: https://github.com/pytorch/pytorch/pull/153153 Approved by: https://github.com/malfet	2025-05-30 11:25:30 +00:00
PyTorch MergeBot	d173ba5a75	Revert "Remove MemPoolContext (#154042 )" This reverts commit `3b38989b5f`. Reverted https://github.com/pytorch/pytorch/pull/154042 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/154042#issuecomment-2921401100))	2025-05-30 06:53:37 +00:00
Shiyan Deng	e8f5c24d17	[rocm]add device guard when initialize single stream (#154433 ) Summary: AMD streams are lazily initialized and sometimes (e.g. when we just want to do event recording on the stream) we might not be setting the device guard while it's initializing which would lead to invalid configuration error. Differential Revision: D75456460 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154433 Approved by: https://github.com/jeffdaily	2025-05-29 19:42:12 +00:00
Yu, Guangye	3c74a72ea0	Keep XPU compatible with toolchain 2025.2 (#154359 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154359 Approved by: https://github.com/EikanWang, https://github.com/cyyever	2025-05-29 11:12:07 +00:00
dolpm	66f53889d5	[nativert] port semaphore to c10 util (#153504 ) Summary: nativert RFC: https://github.com/zhxchen17/rfcs/blob/master/RFC-0043-torch-native-runtime.md To land the runtime into PyTorch core, we will gradually land logical parts of the code into the Github issue and get each piece properly reviewed. This diff adds a simple semaphore interface into c10 until c++20 where we get counting_semaphore gonna need a oss build export to take a look at this... Test Plan: CI Differential Revision: D73882656 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153504 Approved by: https://github.com/zhxchen17	2025-05-28 19:17:30 +00:00
Natalia Gimelshein	3b38989b5f	Remove MemPoolContext (#154042 ) Removes MemPoolContext from custom user mempools. The ground truth for which pool should be used is in graph_pools active pool, and MemPoolContext just introduced an opportunity for the pool pointed to by MemPoolContext and active pool in graph_pools to go out of sync (see all the asserts in the code to make sure that happens, and yet it still could happen in a multithread scenario, see my recent PRs (#153990). Pull Request resolved: https://github.com/pytorch/pytorch/pull/154042 Approved by: https://github.com/albanD, https://github.com/syed-ahmed	2025-05-28 16:35:48 +00:00
bobrenjc93	e9b97d19b1	[ez] Make SymNodeImpl comments less misleading (#154480 ) As discussed in DS workchat, it's easy for users to get confused by guarding for these supposedly non-guarding methods. The TL;DR is in the case of non pythonic compilers like XLA, we actually do guard. I've updated the comments accordingly to reduce confusion. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154480 Approved by: https://github.com/pianpwk, https://github.com/Skylion007	2025-05-28 14:04:32 +00:00
Laith Sakka	39df901b2a	introduce definitely_contiguous and use it for reshape and tensor meta data computation. (#153432 ) when a tensor has unbacked symbols it can be general enough to represent both contiguous and non contiguous tensors. in that case we cant really evaluate is_contiguous. In many places in the code base, we check for is_contiguous to take a fast path. but the general path usually works for both contiguous and not contiguous in that case we probably want to use definitely _contiguous API. This is appleid for reshape in this PR and also to tensor meta data computation, the meta data now will have an attribute that says that its contiguous when its always contiguous. We would store that only if definitely _contiguous is true now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153432 Approved by: https://github.com/bobrenjc93	2025-05-28 03:41:26 +00:00
bobrenjc93	919a1a17e3	[ez] Replace misleading implementations with NYI (#154440 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154440 Approved by: https://github.com/Skylion007, https://github.com/pianpwk	2025-05-28 02:21:56 +00:00
Nikita Shulga	f472ea63bb	[BE] Fix typos in SyntaxError description (#154436 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154436 Approved by: https://github.com/seemethere, https://github.com/wdvr, https://github.com/ZainRizvi	2025-05-27 18:08:58 +00:00
PyTorch MergeBot	11a51a11af	Revert "introduce definitely_contiguous and use it for reshape and tensor meta data computation. (#153432 )" This reverts commit `5c6d7caaaa`. Reverted https://github.com/pytorch/pytorch/pull/153432 on behalf of https://github.com/malfet due to Looks like it broke flex attention tests, see https://hud.pytorch.org/hud/pytorch/pytorch/main/1?per_page=50&name_filter=g6.4xlarge&mergeEphemeralLF=true ([comment](https://github.com/pytorch/pytorch/pull/153432#issuecomment-2912562570))	2025-05-27 13:42:34 +00:00
Laith Sakka	5c6d7caaaa	introduce definitely_contiguous and use it for reshape and tensor meta data computation. (#153432 ) when a tensor has unbacked symbols it can be general enough to represent both contiguous and non contiguous tensors. in that case we cant really evaluate is_contiguous. In many places in the code base, we check for is_contiguous to take a fast path. but the general path usually works for both contiguous and not contiguous in that case we probably want to use definitely _contiguous API. This is appleid for reshape in this PR and also to tensor meta data computation, the meta data now will have an attribute that says that its contiguous when its always contiguous. We would store that only if definitely _contiguous is true now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153432 Approved by: https://github.com/bobrenjc93	2025-05-27 08:54:31 +00:00
Nikita Shulga	975bbc63db	[MPS][BE] Move fmod/remainder to Metal ops (#154280 ) This accomplishes following: - Fixes correctness problem with large integer types (though probably makes it slower, but this could not be avoided if one wants to compute accurate answer) - Makes op faster for floating point types (as Metal kernel invocation is faster than creating MPSGraph) - Eliminates need for several correctness workarounds Fixes https://github.com/pytorch/pytorch/issues/154171 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154280 Approved by: https://github.com/dcci ghstack dependencies: #154275, #154290	2025-05-24 01:45:33 +00:00
Natalia Gimelshein	401fa87ace	make only current thread allocate to pool in NcclPG (#153990 ) follow up to #153356 that fixes nccl allocation to pool Pull Request resolved: https://github.com/pytorch/pytorch/pull/153990 Approved by: https://github.com/kwen2501	2025-05-21 21:57:37 +00:00
Frost Mitchell	fe49b11e09	Add memory reporting for XPU to Memory Profiler (#152842 ) Adds support for XPU profile_memory in Pytorch Profiler. Currently, when `profile_memory=True` is passed to `torch.profiler.profile`, there is no XPU memory reported. For example, the profiling table printed by the code below is missing any `XPU Mem` columns: <details><summary>profiling.py</summary> <p> ```python import torch import torch.nn as nn import torch.optim as optim from torch.profiler import profile, ProfilerActivity class ToyModel(nn.Module): def __init__(self): super(ToyModel, self).__init__() self.conv1 = nn.Conv1d(20,20,15,padding="same") self.flatten = nn.Flatten() self.net1 = nn.Linear(2048, 4096) self.relu = nn.ReLU() self.net2 = nn.Linear(4096, 5) def forward(self, x): res = self.conv1(x) res = self.flatten(res) res = self.net1(res) return self.net2(self.relu(res)) def demo_basic(): model = ToyModel().to("xpu") loss_fn = nn.MSELoss().to("xpu") optimizer = optim.SGD(model.parameters(), lr=0.001) with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.XPU], profile_memory=True) as prof: for epoch in range(10): optimizer.zero_grad() outputs = model(torch.randn(20, 2048).to("xpu")) labels = torch.randn(20, 5).to("xpu") loss_fn(outputs, labels).backward() optimizer.step() print(prof.key_averages().table(max_name_column_width=100, sort_by="xpu_time_total", row_limit=100)) if __name__ == "__main__": demo_basic() ``` </p> </details> ``` ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self XPU Self XPU % XPU total XPU time avg CPU Mem Self CPU Mem # of Calls ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ gemm_kernel 0.00% 0.000us 0.00% 0.000us 0.000us 1.501ms 44.73% 1.501ms 25.024us 0 b 0 b 60 autograd::engine::evaluate_function: AddmmBackward0 0.12% 1.067ms 30.47% 260.929ms 13.046ms 0.000us 0.00% 1.009ms 50.448us 0 b 0 b 20 AddmmBackward0 0.09% 744.983us 15.99% 136.944ms 6.847ms 0.000us 0.00% 784.640us 39.232us 0 b 0 b 20 aten::mm 15.41% 131.956ms 15.79% 135.167ms 3.379ms 784.640us 23.37% 784.640us 19.616us 0 b 0 b 40 aten::linear 0.02% 156.361us 20.58% 176.187ms 8.809ms 0.000us 0.00% 741.760us 37.088us 0 b 0 b 20 aten::addmm 20.25% 173.371ms 20.52% 175.723ms 8.786ms 741.760us 22.10% 741.760us 37.088us 0 b 0 b 20 Optimizer.step#SGD.step 0.40% 3.429ms 5.55% 47.509ms 4.751ms 0.000us 0.00% 488.960us 48.896us 0 b 0 b 10 aten::_foreach_add_ 4.81% 41.162ms 5.15% 44.080ms 4.408ms 488.960us 14.57% 488.960us 48.896us 0 b 0 b 10 at::native::xpu::MultiTensorApplyKernelFunctor<at::n... 0.00% 0.000us 0.00% 0.000us 0.000us 422.880us 12.60% 422.880us 42.288us 0 b 0 b 10 autograd::engine::evaluate_function: ConvolutionBack... 0.03% 280.041us 4.36% 37.328ms 3.733ms 0.000us 0.00% 356.320us 35.632us 0 b 0 b 10 ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Self CPU time total: 856.227ms Self XPU time total: 3.357ms ``` This PR updates the XPUCachingAllocator.cpp to report allocation events to the Profiler, and causes these to be printed in the table: ``` ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self XPU Self XPU % XPU total XPU time avg CPU Mem Self CPU Mem XPU Mem Self XPU Mem # of Calls ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ gemm_kernel 0.00% 0.000us 0.00% 0.000us 0.000us 1.436ms 43.64% 1.436ms 23.939us 0 b 0 b 0 b 0 b 60 autograd::engine::evaluate_function: AddmmBackward0 0.13% 1.186ms 29.92% 262.875ms 13.144ms 0.000us 0.00% 1.005ms 50.272us 0 b 0 b 320.94 Mb -4.69 Mb 20 AddmmBackward0 0.09% 815.288us 16.48% 144.802ms 7.240ms 0.000us 0.00% 790.720us 39.536us 0 b 0 b 325.47 Mb 0 b 20 aten::mm 15.86% 139.342ms 16.26% 142.875ms 3.572ms 790.720us 24.03% 790.720us 19.768us 0 b 0 b 325.47 Mb 325.47 Mb 40 aten::linear 0.02% 182.856us 20.46% 179.775ms 8.989ms 0.000us 0.00% 669.440us 33.472us 0 b 0 b 3.13 Mb 0 b 20 aten::addmm 20.10% 176.607ms 20.40% 179.210ms 8.961ms 669.440us 20.34% 669.440us 33.472us 0 b 0 b 3.13 Mb 3.13 Mb 20 Optimizer.step#SGD.step 0.42% 3.692ms 5.61% 49.267ms 4.927ms 0.000us 0.00% 486.640us 48.664us 0 b 0 b 0 b 0 b 10 aten::_foreach_add_ 4.83% 42.439ms 5.19% 45.574ms 4.557ms 486.640us 14.79% 486.640us 48.664us 0 b 0 b 0 b -20.00 Kb 10 at::native::xpu::MultiTensorApplyKernelFunctor<at::n... 0.00% 0.000us 0.00% 0.000us 0.000us 420.960us 12.79% 420.960us 42.096us 0 b 0 b 0 b 0 b 10 autograd::engine::evaluate_function: ConvolutionBack... 0.04% 310.719us 4.47% 39.279ms 3.928ms 0.000us 0.00% 339.520us 33.952us 0 b 0 b -2.89 Mb -3.12 Mb 10 ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Self CPU time total: 878.627ms Self XPU time total: 3.291ms ``` These XPU memory numbers match the same profiling results on CUDA. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152842 Approved by: https://github.com/guangyey, https://github.com/sraikund16	2025-05-21 01:19:19 +00:00
Nikita Shulga	58dc80dff6	[MPSInductor] Fix indexing calculation (#153997 ) By using `c10:🤘:floor_divie` primitive Which fixes `test_flip_cat_mps` test, and makes `doctr_reco_predictor` and `doctr_det_predictor` pass accuracy checks (at least locally, scheduled a workflow dispatch to validate it in CI) Before this change following script generated different compile and eager results ```python import torch def foo(unsqueeze, unsqueeze_1): cat_1 = torch.ops.aten.cat.default([unsqueeze, unsqueeze_1], 1) view = torch.ops.aten.view.default(cat_1, [4]) slice_5 = torch.ops.aten.slice.Tensor(view, 0, 0, 3) rev_1 = torch.ops.aten.flip.default(slice_5, [0]) return rev_1 if __name__ == "__main__": x = torch.arange(1.0, 3.0, device='mps').reshape(2, 1) y = torch.arange(5.0, 7.0, device='mps').reshape(2, 1) rc, (kernel,) = torch._inductor.utils.run_and_get_kernels(torch.compile(foo), x, y) print(kernel) print("Compile: ", rc) print("Eager: ", foo(x, y)) ``` After this change ``` ''' #include <c10/metal/utils.h> kernel void generated_kernel( device float* out_ptr0, constant float* in_ptr0, constant float* in_ptr1, uint xindex [[thread_position_in_grid]] ) { int x0 = xindex; auto tmp6 = in_ptr0[1 + (c10:🤘:floor_divide((-1)x0, 2))]; auto tmp11 = in_ptr1[1 + (c10:🤘:floor_divide((-1)x0, 2))]; auto tmp0 = (2 + ((-1)*x0)) % (2); auto tmp1 = static_cast<long>(tmp0); auto tmp2 = 0; auto tmp3 = tmp1 >= tmp2; auto tmp4 = 1; auto tmp5 = tmp1 < tmp4; auto tmp7 = tmp5 ? tmp6 : 0.0; auto tmp8 = tmp1 >= tmp4; auto tmp9 = 2; auto tmp10 = tmp1 < tmp9; auto tmp12 = tmp8 ? tmp11 : 0.0; auto tmp13 = tmp5 ? tmp7 : tmp12; out_ptr0[x0] = static_cast<float>(tmp13); } ''' Compile: tensor([2., 5., 1.], device='mps:0') Eager: tensor([2., 5., 1.], device='mps:0') ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/153997 Approved by: https://github.com/dcci ghstack dependencies: #153970, #153971	2025-05-21 00:03:46 +00:00
pbialecki	e8f8baf71f	set CUDA_MODULE_LOADING for older drivers only (#152695 ) `CUDA_MODULE_LOADING=LAZY` is the default for all drivers shipped with CUDA >=12.2 and we should check the driver version before setting the env variable. (the `LOG(WARNING)` has to be removed before merging) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152695 Approved by: https://github.com/malfet, https://github.com/atalman, https://github.com/nWEIdia	2025-05-20 19:34:40 +00:00
Nikita Shulga	c4d1ff02f8	[Lint] Update clang-format to 19.1.4 (#153889 ) All changes other than the one to `tools/linter/adapters/s3_init_config.json` are generated by newer clang-format Pull Request resolved: https://github.com/pytorch/pytorch/pull/153889 Approved by: https://github.com/cyyever, https://github.com/atalman	2025-05-20 14:12:46 +00:00
cyy	a8986963da	Fix some CMake issues (#153686 ) These issues were discovered when trying CMake 3.27: 1. set C++ language on HIP sources. 2. add missing link to gtest_main. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153686 Approved by: https://github.com/Skylion007	2025-05-19 00:31:34 +00:00
cyy	9d3b6ee4c1	[submodule] Update gtest to v1.17.0 (#153618 ) And remove some outdated CMake code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153618 Approved by: https://github.com/malfet	2025-05-16 01:24:19 +00:00
redwrasse	f7798d8645	Checks kv pair indexing in OrderedPreservingDictTest.test_range_insert (#148136 ) `OrderedPreservingDictTest.test_range_insert` has an [unused loop variable `j`](https://github.com/pytorch/pytorch/blob/main/c10/test/util/ordered_preserving_dict_test.cpp#L186), I think taken from the [inspired project](https://github.com/pytorch/pytorch/blob/main/c10/test/util/ordered_preserving_dict_test.cpp#L165) testcase for range inserts, where it [checks kv pair indexing/order](https://github.com/Tessil/ordered-map/blob/master/tests/ordered_map_tests.cpp#L136) for the ordered dict. This just adds in that functionality to the test case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148136 Approved by: https://github.com/eellison	2025-05-14 06:05:23 +00:00
Scott Wolchok	e8662e836a	Remove std::is_arithmetic specialization from c10/util/strong_type.h (#153424 ) Specializing std::is_arithmetic has undefined behavior (and breaks builds with -Winvalid-specialization). Should fix #150901 Differential Revision: [D74614724](https://our.internmc.facebook.com/intern/diff/D74614724/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153424 Approved by: https://github.com/cyyever, https://github.com/Skylion007	2025-05-14 02:01:32 +00:00
TJ Yin	81719ebde3	[caffe2] Make c10::str works with scoped enum (#152705 ) (#152714 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/152705 Test Plan: ``` buck2 test fbcode//caffe2/c10/test:util_base_tests --fail-fast ``` Differential Revision: D74087796 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152714 Approved by: https://github.com/Skylion007	2025-05-13 21:05:36 +00:00
Shivam Raikundalia	dbb4444ce3	[Memento] Add PT2 to Memory Snapshot (#152707 ) Summary: To add PT2 information to memory snapshot we piggyback off of the Kineto implementation using record_function similar to adding the user annotations. To do this we add the following: 1. Stack implementation that we instantiate to keep track of which compile context stack we are currently in (top element of the stack). The stack will be per device and thread-local since different threads of a process can be in different compile contexts at a given time. For this reason, we do not need to add mutexes to our stack impl since no two threads will touch a given stack 2. RecordFunction hooks to properly pipe the correct events to the compile context stack. These hooks are similar to the annotation ones in the fact that we just register them lazily and DO NOT unregister them. This is done out of convenience. In the future, we should save the handles and unregister them to minimize overhead after profiling is finished. As of now, we are registering this at the FUNCTION scope which is wide; however, we treat any function that does not start with "Torch-Compiled Region" as a no-op so we anticipate the difference in performance to be negligible during and after profiling. We also hide this feature behind a flag set to off on default so existing jobs will be unaffected 3. Piping for compile context to pickle output Test Plan: In D74039793, we add CompileContext to the visualizer and we see the following {F1977654658} Differential Revision: D74028214 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152707 Approved by: https://github.com/eqy	2025-05-12 21:12:51 +00:00
Benson Ma	639793c17e	[pytorch] Expose `c10_retrieve_device_side_assertion_info()` for use by external code (#153211 ) Summary: - Expose `c10_retrieve_device_side_assertion_info()` for use by external code. The motivating use case is FBGEMM kernel launcher utilities, which add FBGEMM-specific context to the errors coming out of Torch DSA Test Plan: OSS CI Differential Revision: D74432771 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153211 Approved by: https://github.com/Skylion007	2025-05-10 01:08:45 +00:00
Natalia Gimelshein	9ae722cdb4	allocate cuMem memory with rdma flag (#153261 ) to be able to register memory with ibverbs Pull Request resolved: https://github.com/pytorch/pytorch/pull/153261 Approved by: https://github.com/kwen2501, https://github.com/eqy, https://github.com/Skylion007	2025-05-09 21:48:48 +00:00
Dmitry Rogozhkin	10234ccefe	xpu: rely on sycl/sycl.hpp to include bfloat16.hpp (#152562 ) Fixes: https://github.com/intel/torch-xpu-ops/issues/1503 `sycl/ext/oneapi/bfloat16.hpp` header file is a DPC++ compiler internal header. It's not documented for usage (see extension specification linked below) and is not guaranteed to exist. Instead, documented usage of extension suggests to rely on including `sycl/sycl.hpp` which in its turn includes `bfloat16.hpp` header (which is implementation detail). We stepped into issues by explicitly including `bloat16.hpp` sycl header whithin user facing production environment when `intel-sycl-rt` wheel is installed (which is the dependency of `torch` wheel package built and publicly available for xpu). Compiler includes this file from `intel-sycl-rt` and due to `#pragma once` usage its content is included as well giving redefinitions of symbols in this file (previous inclusion is coming from `sycl/sycl.hpp`): ``` In file included from /workspace/lib/python3.12/site-packages/torch/include/c10/util/BFloat16.h:23: /opt/intel/oneapi/compiler/2025.0/bin/compiler/../../include/sycl/ext/oneapi/bfloat16.hpp:60:23: error: redefinition of 'BF16VecToFloatVec' 60 \| template <int N> void BF16VecToFloatVec(const bfloat16 src[N], float dst[N]) { \| ^ /workspace/include/sycl/ext/oneapi/bfloat16.hpp:60:23: note: previous definition is here 60 \| template <int N> void BF16VecToFloatVec(const bfloat16 src[N], float dst[N]) { \| ``` While SYCL header files themselves can be improved (`#pragma once` dropped), we still must correct usage of sycl `bfloat16.hpp` header in pytorch, i.e. drop it. This fortunately helps to address the reported issue of redefinitions though follow up on compiler side is still required. Also, `SYCL_EXT_ONEAPI_BFLOAT16_MATH_FUNCTIONS` used to cover inclusion of `sycl/sycl.hpp` does not make sense since it's defined in this very header. Thus, we should use `SYCL_LANGUAGE_VERSION` instead which is defined on compiler level. See: `f958dce280/sycl/doc/extensions/experimental/sycl_ext_oneapi_bfloat16_math_functions.asciidoc` CC: @EikanWang, @guangyey, @gujinghui Pull Request resolved: https://github.com/pytorch/pytorch/pull/152562 Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/albanD	2025-05-09 02:25:44 +00:00
cyy	d291fa8ecc	Avoid std::chrono::system_clock (#153135 ) This PR replaces most `std::chrono::system_clock` with `std::chrono::steady_clock` if the duration is used in condition variables. Ideally system clocks should be used only to log wall-clock times. Some `high_resolution_clock` are also changed to `steady_clock` because its resolution is not required in the context. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153135 Approved by: https://github.com/albanD, https://github.com/Skylion007, https://github.com/malfet	2025-05-08 16:30:29 +00:00
Yiming Zhou	13fbf21a76	[nativert] Port string join and split to c10/util (#152873 ) Summary: Torch Native Runtime RFC: https://github.com/pytorch/rfcs/pull/72 Port string utils functions join and split to c10/util Test Plan: Added tests in `string_util_test.cpp` buck2 run mode/opt caffe2/c10/test:util_base_tests Differential Revision: D74202473 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152873 Approved by: https://github.com/cyyever, https://github.com/Skylion007	2025-05-07 03:58:11 +00:00
dolpm	a766c1d117	[nativert] move intrusive list to c10/util (#152754 ) Summary: nativert RFC: https://github.com/zhxchen17/rfcs/blob/master/RFC-0043-torch-native-runtime.md To land the runtime into PyTorch core, we will gradually land logical parts of the code into the Github issue and get each piece properly reviewed. This diff moves intrusive list to c10/util Test Plan: CI Differential Revision: D74104595 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152754 Approved by: https://github.com/Skylion007, https://github.com/cyyever	2025-05-05 18:49:56 +00:00
Nikita Shulga	e889937850	[MPS] Migrate `div` to Metal (#152743 ) TODOs: - Verify accuracy of `metal::dot` vs `x.xx.x + y.yy.y` Pull Request resolved: https://github.com/pytorch/pytorch/pull/152743 Approved by: https://github.com/dcci, https://github.com/Skylion007 ghstack dependencies: #152663, #152515, #152737	2025-05-04 00:56:19 +00:00
rzou	762844355e	Make DispatchKeySet serializable; add `__eq__` (#152732 ) These seem like reasonable things to add. Also fixes a bug in vLLM for me. Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/152732 Approved by: https://github.com/bdhirsh	2025-05-03 14:40:06 +00:00
Nikita Shulga	792736f9ac	[BE][MPS] Pass `alpha` by reference (#152737 ) As it's always a scalar Pull Request resolved: https://github.com/pytorch/pytorch/pull/152737 Approved by: https://github.com/dcci ghstack dependencies: #152663, #152515	2025-05-03 08:31:45 +00:00
Nikita Shulga	34e9f0b5c6	[MPS] Migrate mul to TensorIterator (#152515 ) What initially supposed to be a very straightforward change resulted in small refactor of binary op tensor generators when invoked for mixed dtype, which surfaced via `test_output_grad_match_sinc_mps_float16` test failure. If operands are of different dtype (in particular float16 tensor and float32 scalar), one must perform an operation with `opmath_t` (or `TensorIterator::common_dtype()`) precision, rather than casting both operands to output dtype and performing it then, which can be demonstrated via the following example: ``` >>> torch.tensor([-1.8633, 6.2031, -2.2500, -3.3926, 8.5938, 5.9766], dtype=torch.half).mul(torch.pi) tensor([ -5.8555, 19.4844, -7.0703, -10.6562, 27.0000, 18.7812], dtype=torch.float16) >>> torch.tensor([-1.8633, 6.2031, -2.2500, -3.3926, 8.5938, 5.9766], dtype=torch.half).mul(torch.tensor(torch.pi, dtype=torch.float16)) tensor([ -5.8516, 19.4844, -7.0664, -10.6562, 26.9844, 18.7656], dtype=torch.float16) ``` Solve this problem for now, but introducing `REGISTER_OPMATH_BINARY_OP` that indicates that operands must be cast to opmath_t, before performing the computation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152515 Approved by: https://github.com/Skylion007, https://github.com/kulinseth, https://github.com/dcci ghstack dependencies: #152663	2025-05-03 02:35:03 +00:00
Laith Sakka	376529c78b	consolidate guard_or_x and definitely_x (#152463 ) definitely_true is almost same as guard_or_false, the potential differences are not meaningful to a degree that justify the existence of both. same for definitely_false, it can be expressed with guard_or_true and guard_or_false. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152463 Approved by: https://github.com/bobrenjc93	2025-05-02 18:08:11 +00:00
cyy	e9e1aacef8	Enable -Wunused on torch targets (#150077 ) For GCC, ``-Wunused`` contains: ``` -Wunused-function Warn whenever a static function is declared but not defined or a non\-inline static function is unused. -Wunused-label Warn whenever a label is declared but not used. To suppress this warning use the unused attribute. -Wunused-parameter Warn whenever a function parameter is unused aside from its declaration. To suppress this warning use the unused attribute. -Wunused-variable Warn whenever a local variable or non-constant static variable is unused aside from its declaration To suppress this warning use the unused attribute. ``` For Clang, some of the diagnostics controlled by ``-Wunused`` are enabled by default: ``` Controls [-Wunused-argument](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-argument), [-Wunused-but-set-variable](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-but-set-variable), [-Wunused-function](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-function), [-Wunused-label](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-label), [-Wunused-lambda-capture](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-lambda-capture), [-Wunused-local-typedef](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-local-typedef), [-Wunused-private-field](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-private-field), [-Wunused-property-ivar](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-property-ivar), [-Wunused-value](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-value), [-Wunused-variable](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-variable). ``` These checks are all usefull. This PR aims to enable ``-Wunused`` without breaking code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150077 Approved by: https://github.com/zou3519, https://github.com/wdvr	2025-05-02 07:14:19 +00:00
PyTorch MergeBot	6dadfc4457	Revert "Enable -Wunused on torch targets (#150077 )" This reverts commit `688adc9941`. Reverted https://github.com/pytorch/pytorch/pull/150077 on behalf of https://github.com/wdvr due to failing internally with use of undeclared identifier ([comment](https://github.com/pytorch/pytorch/pull/150077#issuecomment-2846499828))	2025-05-02 06:53:20 +00:00
cyy	ce94b212c7	[Environment Variable][Rebase] Use thread-safe getenv functions (#140200 ) Use our thread-safe getenv wrappers. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140200 Approved by: https://github.com/kwen2501, https://github.com/eqy	2025-05-02 00:41:49 +00:00
dolpm	a765e2ddda	[nativert] port enumerate from folly to c10::utill (#152481 ) Summary: nativert RFC: https://github.com/zhxchen17/rfcs/blob/master/RFC-0043-torch-native-runtime.md To land the runtime into PyTorch core, we will gradually land logical parts of the code into the Github issue and get each piece properly reviewed. This diff ports an enumeration util from folly into c10. Test Plan: CI Differential Revision: D73881042 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152481 Approved by: https://github.com/Skylion007, https://github.com/zhxchen17, https://github.com/cyyever	2025-05-01 21:41:05 +00:00
cyy	688adc9941	Enable -Wunused on torch targets (#150077 ) For GCC, ``-Wunused`` contains: ``` -Wunused-function Warn whenever a static function is declared but not defined or a non\-inline static function is unused. -Wunused-label Warn whenever a label is declared but not used. To suppress this warning use the unused attribute. -Wunused-parameter Warn whenever a function parameter is unused aside from its declaration. To suppress this warning use the unused attribute. -Wunused-variable Warn whenever a local variable or non-constant static variable is unused aside from its declaration To suppress this warning use the unused attribute. ``` For Clang, some of the diagnostics controlled by ``-Wunused`` are enabled by default: ``` Controls [-Wunused-argument](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-argument), [-Wunused-but-set-variable](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-but-set-variable), [-Wunused-function](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-function), [-Wunused-label](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-label), [-Wunused-lambda-capture](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-lambda-capture), [-Wunused-local-typedef](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-local-typedef), [-Wunused-private-field](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-private-field), [-Wunused-property-ivar](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-property-ivar), [-Wunused-value](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-value), [-Wunused-variable](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-variable). ``` These checks are all usefull. This PR aims to enable ``-Wunused`` without breaking code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150077 Approved by: https://github.com/zou3519	2025-05-01 04:09:06 +00:00
Zhengxu Chen	5a66c1d921	[nativert] Add utility function to convert strings into numbers. (#151467 ) Summary: nativert RFC: https://github.com/zhxchen17/rfcs/blob/master/RFC-0043-torch-native-runtime.md To land the runtime into PyTorch core, we will gradually land logical parts of the code into the Github issue and get each piece properly reviewed. This diff adds a small library to convert strings into numbers which will later be used for parsing graph IR. Differential Revision: D73133034 ## Test Plan c10 unittests Pull Request resolved: https://github.com/pytorch/pytorch/pull/151467 Approved by: https://github.com/cyyever, https://github.com/albanD	2025-04-30 21:20:52 +00:00
io-no	d88e0ceb64	Cast to unsigned char to avoid UB (#152360 ) The standard requires that the argument to functions like `isdigit`, `isalpha`, and similar must be either `EOF` or an `unsigned char`; otherwise, the behavior is undefined (UB). To avoid out-of-bounds reads, modern implementations of some libraries (such as glibc) deliberately pad their internal tables to guarantee valid memory access even for negative values. However, this is implementation-specific, and other libraries may not do this. Properly casting the argument to `unsigned char` is good practice to avoid potential issues on some platforms. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152360 Approved by: https://github.com/cyyever, https://github.com/Skylion007	2025-04-30 15:09:13 +00:00
Nikita Shulga	a2c553cac6	[Metal] Extend typecasted op support to complex dtypes (#152504 ) First of all, by extending `c10:🤘:cast_to` to work correctly with complex dtypes, by introducing two more specializations: one that casts complex to scalar, and another that casts scalar to complex (as default metal typecast will turn `float x` into `float2(x, x)`) Add ComplexHalf and ComplexFloat enum values to `c10:🤘:ScalarTypes` and handle them in `val_at_offs(ptr, offs, type)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/152504 Approved by: https://github.com/dcci ghstack dependencies: #152443, #152466, #152479	2025-04-30 05:32:07 +00:00
Nikita Shulga	9bfdf57572	[MPS][BE] Introduce `c10:🤘:mul` (#152466 ) Which multiplies two arguments for either scalar or complex data types This allows one to get rid of bunch of complex specialization in BinaryOps Pull Request resolved: https://github.com/pytorch/pytorch/pull/152466 Approved by: https://github.com/dcci ghstack dependencies: #152443	2025-04-30 04:45:47 +00:00
Dan Johnson	8e2e06b7ea	Fix shadow local variables (#152429 ) Summary: Fixing shadow local variables error: P1798875650 Test Plan: CI Differential Revision: D73853605 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152429 Approved by: https://github.com/Skylion007, https://github.com/eqy	2025-04-29 18:50:18 +00:00
Siddharth Kotapati	663bcb68ba	Implement metal kernel for basic MPS arithmetic ops using TensorIterator (#147644 ) Add metal kernels for add, subtract, & lerp ops using TensorIterator. Should help resolve: https://github.com/pytorch/pytorch/issues/143874 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147644 Approved by: https://github.com/malfet	2025-04-29 14:24:49 +00:00
cyy	41bd0c900a	[1/N] Deprecate c10::string_view and at::string (#151972 ) The calls of `c10::string_view` in the code base are replaced by `std::string_view`. The calls of `at::string` are replaced by `std::string` Pull Request resolved: https://github.com/pytorch/pytorch/pull/151972 Approved by: https://github.com/malfet	2025-04-29 07:23:52 +00:00
Grace Cheng	8e65310d49	[caffe2/c10/util/TypeIndex] Add '__CUDA_ARCH_LIST__' check (#152030 ) Summary: We suspect that switching the NVCC host compiler from GCC to Clang, while targeting multiple architectures, is causing issues because only _CUDA_ARCH_LIST_ is being passed, without _CUDA_ARCH_. To resolve this c10 compilation error, we should first fix the problem and then switch the NVCC host compiler from GCC to Clang. Once this is done, the errors no longer occur. Test Plan: CI Reviewed By: zhuhan0 Differential Revision: D73383236 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152030 Approved by: https://github.com/cyyever, https://github.com/ZainRizvi	2025-04-28 20:31:23 +00:00
Anthony Shoumikhin	e2f9759bd0	Fix broken URLs (#152237 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152237 Approved by: https://github.com/huydhn, https://github.com/malfet	2025-04-27 09:56:42 +00:00
Dan Johnson	d22c4cc353	Add option to use mempool on OOM (#151487 ) MemPool is a separate pool of memory handled by the caching allocator. This PR adds the option let the caching allocator try to use this pool as a last resort instead of OOMing by associating a use_on_oom bool with each MemPool. Usage: Users can optionally specify a ``use_on_oom`` bool (which is False by default) during MemPool creation. If true, then the CUDACachingAllocator will be able to use memory in this pool as a last resort instead of OOMing. ``` pool = torch.cuda.MemPool(allocator, use_on_oom=True) with torch.cuda.use_mem_pool(pool): a = torch.randn(40 * 1024 * 1024, dtype=torch.uint8, device="cuda") del a # at the memory limit, this will succeed by using pool's memory in order to avoid the oom b = torch.randn(40 * 1024 * 1024, dtype=torch.uint8, device="cuda") ``` Testing: ``` python test/test_cuda.py -k test_mempool_limited_memory_with_allocator ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/151487 Approved by: https://github.com/eqy, https://github.com/syed-ahmed, https://github.com/ngimel	2025-04-26 04:04:57 +00:00
Davide Italiano	e28864fc0f	[MPS/inductor] Fix the approximation of polygamma for n == 0. (#152214 ) Fixes #152205 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152214 Approved by: https://github.com/malfet	2025-04-25 22:42:45 +00:00
FFFrog	2c5c793085	[Easy] Add more check for elapsedTime of torch.xxx.Event and torch.Event (#151404 ) As the title stated Changes: - Add record, query and enable_timing check - Add related tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/151404 Approved by: https://github.com/albanD	2025-04-25 20:15:04 +00:00
PyTorch MergeBot	67f75244ea	Revert "[Easy] Add more check for elapsedTime of torch.xxx.Event and torch.Event (#151404 )" This reverts commit `c91acad73a`. Reverted https://github.com/pytorch/pytorch/pull/151404 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally. @albanD can you please help it get relanded? To validate the fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/151404#issuecomment-2830829368))	2025-04-25 16:08:27 +00:00
zhxchen17	a34c28e0d2	[dynamo] Add guard serialization for tensor matches. (#151318 ) This is a proof-of-concept of how we could serialize a guard and deserialize it back from the bytes. The main behavioral change introduced in this diff is on CheckFunctionManager: ``` check_fn_manager = CheckFunctionManager(code, output_graph, guards_serialization_mode="save") guards_state: bytes = check_fn_manager.guards_state ``` Once `guards_serialization_mode` is set to `save`, CheckFunctionManager will return an addtional `bytes` object called `guards_state` which should contain all the information needed for deserializing guards later. When we load back guards state, we will set `guards_serialization_mode` is set to `load`: ``` output_graph_state = pickle.loads(guards_state) check_fn_manager = CheckFunctionManager(code, output_graph_state, guards_serialization_mode="load") ``` # TENSOR_MATCH Since we have many types of guards to support, we will break the work into small diffs instead of a single diff to support every guards. We kick off the work from TENSOR_MATCH from this diff. # Testing For each type of guard we will test it like the following: 1. Use guard_filter_fn to select 1 type of guard each time. 2. Call InstructionTranslator directly on an example function to get OutputGraph and CheckFunctionManager (reference guard manager) 3. Serialize->deserialize the output graph state and re-build the guards with a new CheckFunctionManager (loaded guard manager) 4. Throw a set of example inputs to both reference and loaded guard manager to see if their behavior match. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151318 Approved by: https://github.com/jansel, https://github.com/anijain2305	2025-04-25 14:16:23 +00:00
PyTorch MergeBot	b1d055fd6a	Revert "[dynamo] Add guard serialization for tensor matches. (#151318 )" This reverts commit `81c4369d81`. Reverted https://github.com/pytorch/pytorch/pull/151318 on behalf of https://github.com/zhxchen17 due to macos test failing ([comment](https://github.com/pytorch/pytorch/pull/151318#issuecomment-2828638168))	2025-04-24 19:22:45 +00:00
zhxchen17	81c4369d81	[dynamo] Add guard serialization for tensor matches. (#151318 ) This is a proof-of-concept of how we could serialize a guard and deserialize it back from the bytes. The main behavioral change introduced in this diff is on CheckFunctionManager: ``` check_fn_manager = CheckFunctionManager(code, output_graph, guards_serialization_mode="save") guards_state: bytes = check_fn_manager.guards_state ``` Once `guards_serialization_mode` is set to `save`, CheckFunctionManager will return an addtional `bytes` object called `guards_state` which should contain all the information needed for deserializing guards later. When we load back guards state, we will set `guards_serialization_mode` is set to `load`: ``` output_graph_state = pickle.loads(guards_state) check_fn_manager = CheckFunctionManager(code, output_graph_state, guards_serialization_mode="load") ``` # TENSOR_MATCH Since we have many types of guards to support, we will break the work into small diffs instead of a single diff to support every guards. We kick off the work from TENSOR_MATCH from this diff. # Testing For each type of guard we will test it like the following: 1. Use guard_filter_fn to select 1 type of guard each time. 2. Call InstructionTranslator directly on an example function to get OutputGraph and CheckFunctionManager (reference guard manager) 3. Serialize->deserialize the output graph state and re-build the guards with a new CheckFunctionManager (loaded guard manager) 4. Throw a set of example inputs to both reference and loaded guard manager to see if their behavior match. Differential Revision: [D72987485](https://our.internmc.facebook.com/intern/diff/D72987485/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151318 Approved by: https://github.com/jansel, https://github.com/anijain2305	2025-04-24 18:07:01 +00:00
dolpm	4ac2ee573d	[sigmoid] memory planner C10 deps (#151275 ) Summary: perf-sensitive util functions for use in our memory planner Test Plan: CI Differential Revision: D73002726 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151275 Approved by: https://github.com/georgiaphillips	2025-04-24 01:46:32 +00:00
FFFrog	c91acad73a	[Easy] Add more check for elapsedTime of torch.xxx.Event and torch.Event (#151404 ) As the title stated Changes: - Add record, query and enable_timing check - Add related tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/151404 Approved by: https://github.com/albanD	2025-04-24 01:28:09 +00:00

... 2 3 4 5 6 ...

3174 Commits