pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 00:21:07 +01:00

Author	SHA1	Message	Date
Laith Sakka	39df901b2a	introduce definitely_contiguous and use it for reshape and tensor meta data computation. (#153432 ) when a tensor has unbacked symbols it can be general enough to represent both contiguous and non contiguous tensors. in that case we cant really evaluate is_contiguous. In many places in the code base, we check for is_contiguous to take a fast path. but the general path usually works for both contiguous and not contiguous in that case we probably want to use definitely _contiguous API. This is appleid for reshape in this PR and also to tensor meta data computation, the meta data now will have an attribute that says that its contiguous when its always contiguous. We would store that only if definitely _contiguous is true now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153432 Approved by: https://github.com/bobrenjc93	2025-05-28 03:41:26 +00:00
bobrenjc93	919a1a17e3	[ez] Replace misleading implementations with NYI (#154440 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154440 Approved by: https://github.com/Skylion007, https://github.com/pianpwk	2025-05-28 02:21:56 +00:00
Nikita Shulga	f472ea63bb	[BE] Fix typos in SyntaxError description (#154436 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154436 Approved by: https://github.com/seemethere, https://github.com/wdvr, https://github.com/ZainRizvi	2025-05-27 18:08:58 +00:00
PyTorch MergeBot	11a51a11af	Revert "introduce definitely_contiguous and use it for reshape and tensor meta data computation. (#153432 )" This reverts commit `5c6d7caaaa`. Reverted https://github.com/pytorch/pytorch/pull/153432 on behalf of https://github.com/malfet due to Looks like it broke flex attention tests, see https://hud.pytorch.org/hud/pytorch/pytorch/main/1?per_page=50&name_filter=g6.4xlarge&mergeEphemeralLF=true ([comment](https://github.com/pytorch/pytorch/pull/153432#issuecomment-2912562570))	2025-05-27 13:42:34 +00:00
Laith Sakka	5c6d7caaaa	introduce definitely_contiguous and use it for reshape and tensor meta data computation. (#153432 ) when a tensor has unbacked symbols it can be general enough to represent both contiguous and non contiguous tensors. in that case we cant really evaluate is_contiguous. In many places in the code base, we check for is_contiguous to take a fast path. but the general path usually works for both contiguous and not contiguous in that case we probably want to use definitely _contiguous API. This is appleid for reshape in this PR and also to tensor meta data computation, the meta data now will have an attribute that says that its contiguous when its always contiguous. We would store that only if definitely _contiguous is true now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153432 Approved by: https://github.com/bobrenjc93	2025-05-27 08:54:31 +00:00
Nikita Shulga	975bbc63db	[MPS][BE] Move fmod/remainder to Metal ops (#154280 ) This accomplishes following: - Fixes correctness problem with large integer types (though probably makes it slower, but this could not be avoided if one wants to compute accurate answer) - Makes op faster for floating point types (as Metal kernel invocation is faster than creating MPSGraph) - Eliminates need for several correctness workarounds Fixes https://github.com/pytorch/pytorch/issues/154171 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154280 Approved by: https://github.com/dcci ghstack dependencies: #154275, #154290	2025-05-24 01:45:33 +00:00
Natalia Gimelshein	401fa87ace	make only current thread allocate to pool in NcclPG (#153990 ) follow up to #153356 that fixes nccl allocation to pool Pull Request resolved: https://github.com/pytorch/pytorch/pull/153990 Approved by: https://github.com/kwen2501	2025-05-21 21:57:37 +00:00
Frost Mitchell	fe49b11e09	Add memory reporting for XPU to Memory Profiler (#152842 ) Adds support for XPU profile_memory in Pytorch Profiler. Currently, when `profile_memory=True` is passed to `torch.profiler.profile`, there is no XPU memory reported. For example, the profiling table printed by the code below is missing any `XPU Mem` columns: <details><summary>profiling.py</summary> <p> ```python import torch import torch.nn as nn import torch.optim as optim from torch.profiler import profile, ProfilerActivity class ToyModel(nn.Module): def __init__(self): super(ToyModel, self).__init__() self.conv1 = nn.Conv1d(20,20,15,padding="same") self.flatten = nn.Flatten() self.net1 = nn.Linear(2048, 4096) self.relu = nn.ReLU() self.net2 = nn.Linear(4096, 5) def forward(self, x): res = self.conv1(x) res = self.flatten(res) res = self.net1(res) return self.net2(self.relu(res)) def demo_basic(): model = ToyModel().to("xpu") loss_fn = nn.MSELoss().to("xpu") optimizer = optim.SGD(model.parameters(), lr=0.001) with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.XPU], profile_memory=True) as prof: for epoch in range(10): optimizer.zero_grad() outputs = model(torch.randn(20, 2048).to("xpu")) labels = torch.randn(20, 5).to("xpu") loss_fn(outputs, labels).backward() optimizer.step() print(prof.key_averages().table(max_name_column_width=100, sort_by="xpu_time_total", row_limit=100)) if __name__ == "__main__": demo_basic() ``` </p> </details> ``` ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self XPU Self XPU % XPU total XPU time avg CPU Mem Self CPU Mem # of Calls ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ gemm_kernel 0.00% 0.000us 0.00% 0.000us 0.000us 1.501ms 44.73% 1.501ms 25.024us 0 b 0 b 60 autograd::engine::evaluate_function: AddmmBackward0 0.12% 1.067ms 30.47% 260.929ms 13.046ms 0.000us 0.00% 1.009ms 50.448us 0 b 0 b 20 AddmmBackward0 0.09% 744.983us 15.99% 136.944ms 6.847ms 0.000us 0.00% 784.640us 39.232us 0 b 0 b 20 aten::mm 15.41% 131.956ms 15.79% 135.167ms 3.379ms 784.640us 23.37% 784.640us 19.616us 0 b 0 b 40 aten::linear 0.02% 156.361us 20.58% 176.187ms 8.809ms 0.000us 0.00% 741.760us 37.088us 0 b 0 b 20 aten::addmm 20.25% 173.371ms 20.52% 175.723ms 8.786ms 741.760us 22.10% 741.760us 37.088us 0 b 0 b 20 Optimizer.step#SGD.step 0.40% 3.429ms 5.55% 47.509ms 4.751ms 0.000us 0.00% 488.960us 48.896us 0 b 0 b 10 aten::_foreach_add_ 4.81% 41.162ms 5.15% 44.080ms 4.408ms 488.960us 14.57% 488.960us 48.896us 0 b 0 b 10 at::native::xpu::MultiTensorApplyKernelFunctor<at::n... 0.00% 0.000us 0.00% 0.000us 0.000us 422.880us 12.60% 422.880us 42.288us 0 b 0 b 10 autograd::engine::evaluate_function: ConvolutionBack... 0.03% 280.041us 4.36% 37.328ms 3.733ms 0.000us 0.00% 356.320us 35.632us 0 b 0 b 10 ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Self CPU time total: 856.227ms Self XPU time total: 3.357ms ``` This PR updates the XPUCachingAllocator.cpp to report allocation events to the Profiler, and causes these to be printed in the table: ``` ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self XPU Self XPU % XPU total XPU time avg CPU Mem Self CPU Mem XPU Mem Self XPU Mem # of Calls ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ gemm_kernel 0.00% 0.000us 0.00% 0.000us 0.000us 1.436ms 43.64% 1.436ms 23.939us 0 b 0 b 0 b 0 b 60 autograd::engine::evaluate_function: AddmmBackward0 0.13% 1.186ms 29.92% 262.875ms 13.144ms 0.000us 0.00% 1.005ms 50.272us 0 b 0 b 320.94 Mb -4.69 Mb 20 AddmmBackward0 0.09% 815.288us 16.48% 144.802ms 7.240ms 0.000us 0.00% 790.720us 39.536us 0 b 0 b 325.47 Mb 0 b 20 aten::mm 15.86% 139.342ms 16.26% 142.875ms 3.572ms 790.720us 24.03% 790.720us 19.768us 0 b 0 b 325.47 Mb 325.47 Mb 40 aten::linear 0.02% 182.856us 20.46% 179.775ms 8.989ms 0.000us 0.00% 669.440us 33.472us 0 b 0 b 3.13 Mb 0 b 20 aten::addmm 20.10% 176.607ms 20.40% 179.210ms 8.961ms 669.440us 20.34% 669.440us 33.472us 0 b 0 b 3.13 Mb 3.13 Mb 20 Optimizer.step#SGD.step 0.42% 3.692ms 5.61% 49.267ms 4.927ms 0.000us 0.00% 486.640us 48.664us 0 b 0 b 0 b 0 b 10 aten::_foreach_add_ 4.83% 42.439ms 5.19% 45.574ms 4.557ms 486.640us 14.79% 486.640us 48.664us 0 b 0 b 0 b -20.00 Kb 10 at::native::xpu::MultiTensorApplyKernelFunctor<at::n... 0.00% 0.000us 0.00% 0.000us 0.000us 420.960us 12.79% 420.960us 42.096us 0 b 0 b 0 b 0 b 10 autograd::engine::evaluate_function: ConvolutionBack... 0.04% 310.719us 4.47% 39.279ms 3.928ms 0.000us 0.00% 339.520us 33.952us 0 b 0 b -2.89 Mb -3.12 Mb 10 ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Self CPU time total: 878.627ms Self XPU time total: 3.291ms ``` These XPU memory numbers match the same profiling results on CUDA. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152842 Approved by: https://github.com/guangyey, https://github.com/sraikund16	2025-05-21 01:19:19 +00:00
Nikita Shulga	58dc80dff6	[MPSInductor] Fix indexing calculation (#153997 ) By using `c10:🤘:floor_divie` primitive Which fixes `test_flip_cat_mps` test, and makes `doctr_reco_predictor` and `doctr_det_predictor` pass accuracy checks (at least locally, scheduled a workflow dispatch to validate it in CI) Before this change following script generated different compile and eager results ```python import torch def foo(unsqueeze, unsqueeze_1): cat_1 = torch.ops.aten.cat.default([unsqueeze, unsqueeze_1], 1) view = torch.ops.aten.view.default(cat_1, [4]) slice_5 = torch.ops.aten.slice.Tensor(view, 0, 0, 3) rev_1 = torch.ops.aten.flip.default(slice_5, [0]) return rev_1 if __name__ == "__main__": x = torch.arange(1.0, 3.0, device='mps').reshape(2, 1) y = torch.arange(5.0, 7.0, device='mps').reshape(2, 1) rc, (kernel,) = torch._inductor.utils.run_and_get_kernels(torch.compile(foo), x, y) print(kernel) print("Compile: ", rc) print("Eager: ", foo(x, y)) ``` After this change ``` ''' #include <c10/metal/utils.h> kernel void generated_kernel( device float* out_ptr0, constant float* in_ptr0, constant float* in_ptr1, uint xindex [[thread_position_in_grid]] ) { int x0 = xindex; auto tmp6 = in_ptr0[1 + (c10:🤘:floor_divide((-1)x0, 2))]; auto tmp11 = in_ptr1[1 + (c10:🤘:floor_divide((-1)x0, 2))]; auto tmp0 = (2 + ((-1)*x0)) % (2); auto tmp1 = static_cast<long>(tmp0); auto tmp2 = 0; auto tmp3 = tmp1 >= tmp2; auto tmp4 = 1; auto tmp5 = tmp1 < tmp4; auto tmp7 = tmp5 ? tmp6 : 0.0; auto tmp8 = tmp1 >= tmp4; auto tmp9 = 2; auto tmp10 = tmp1 < tmp9; auto tmp12 = tmp8 ? tmp11 : 0.0; auto tmp13 = tmp5 ? tmp7 : tmp12; out_ptr0[x0] = static_cast<float>(tmp13); } ''' Compile: tensor([2., 5., 1.], device='mps:0') Eager: tensor([2., 5., 1.], device='mps:0') ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/153997 Approved by: https://github.com/dcci ghstack dependencies: #153970, #153971	2025-05-21 00:03:46 +00:00
pbialecki	e8f8baf71f	set CUDA_MODULE_LOADING for older drivers only (#152695 ) `CUDA_MODULE_LOADING=LAZY` is the default for all drivers shipped with CUDA >=12.2 and we should check the driver version before setting the env variable. (the `LOG(WARNING)` has to be removed before merging) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152695 Approved by: https://github.com/malfet, https://github.com/atalman, https://github.com/nWEIdia	2025-05-20 19:34:40 +00:00
Nikita Shulga	c4d1ff02f8	[Lint] Update clang-format to 19.1.4 (#153889 ) All changes other than the one to `tools/linter/adapters/s3_init_config.json` are generated by newer clang-format Pull Request resolved: https://github.com/pytorch/pytorch/pull/153889 Approved by: https://github.com/cyyever, https://github.com/atalman	2025-05-20 14:12:46 +00:00
cyy	a8986963da	Fix some CMake issues (#153686 ) These issues were discovered when trying CMake 3.27: 1. set C++ language on HIP sources. 2. add missing link to gtest_main. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153686 Approved by: https://github.com/Skylion007	2025-05-19 00:31:34 +00:00
cyy	9d3b6ee4c1	[submodule] Update gtest to v1.17.0 (#153618 ) And remove some outdated CMake code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153618 Approved by: https://github.com/malfet	2025-05-16 01:24:19 +00:00
redwrasse	f7798d8645	Checks kv pair indexing in OrderedPreservingDictTest.test_range_insert (#148136 ) `OrderedPreservingDictTest.test_range_insert` has an [unused loop variable `j`](https://github.com/pytorch/pytorch/blob/main/c10/test/util/ordered_preserving_dict_test.cpp#L186), I think taken from the [inspired project](https://github.com/pytorch/pytorch/blob/main/c10/test/util/ordered_preserving_dict_test.cpp#L165) testcase for range inserts, where it [checks kv pair indexing/order](https://github.com/Tessil/ordered-map/blob/master/tests/ordered_map_tests.cpp#L136) for the ordered dict. This just adds in that functionality to the test case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148136 Approved by: https://github.com/eellison	2025-05-14 06:05:23 +00:00
Scott Wolchok	e8662e836a	Remove std::is_arithmetic specialization from c10/util/strong_type.h (#153424 ) Specializing std::is_arithmetic has undefined behavior (and breaks builds with -Winvalid-specialization). Should fix #150901 Differential Revision: [D74614724](https://our.internmc.facebook.com/intern/diff/D74614724/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153424 Approved by: https://github.com/cyyever, https://github.com/Skylion007	2025-05-14 02:01:32 +00:00
TJ Yin	81719ebde3	[caffe2] Make c10::str works with scoped enum (#152705 ) (#152714 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/152705 Test Plan: ``` buck2 test fbcode//caffe2/c10/test:util_base_tests --fail-fast ``` Differential Revision: D74087796 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152714 Approved by: https://github.com/Skylion007	2025-05-13 21:05:36 +00:00
Shivam Raikundalia	dbb4444ce3	[Memento] Add PT2 to Memory Snapshot (#152707 ) Summary: To add PT2 information to memory snapshot we piggyback off of the Kineto implementation using record_function similar to adding the user annotations. To do this we add the following: 1. Stack implementation that we instantiate to keep track of which compile context stack we are currently in (top element of the stack). The stack will be per device and thread-local since different threads of a process can be in different compile contexts at a given time. For this reason, we do not need to add mutexes to our stack impl since no two threads will touch a given stack 2. RecordFunction hooks to properly pipe the correct events to the compile context stack. These hooks are similar to the annotation ones in the fact that we just register them lazily and DO NOT unregister them. This is done out of convenience. In the future, we should save the handles and unregister them to minimize overhead after profiling is finished. As of now, we are registering this at the FUNCTION scope which is wide; however, we treat any function that does not start with "Torch-Compiled Region" as a no-op so we anticipate the difference in performance to be negligible during and after profiling. We also hide this feature behind a flag set to off on default so existing jobs will be unaffected 3. Piping for compile context to pickle output Test Plan: In D74039793, we add CompileContext to the visualizer and we see the following {F1977654658} Differential Revision: D74028214 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152707 Approved by: https://github.com/eqy	2025-05-12 21:12:51 +00:00
Benson Ma	639793c17e	[pytorch] Expose `c10_retrieve_device_side_assertion_info()` for use by external code (#153211 ) Summary: - Expose `c10_retrieve_device_side_assertion_info()` for use by external code. The motivating use case is FBGEMM kernel launcher utilities, which add FBGEMM-specific context to the errors coming out of Torch DSA Test Plan: OSS CI Differential Revision: D74432771 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153211 Approved by: https://github.com/Skylion007	2025-05-10 01:08:45 +00:00
Natalia Gimelshein	9ae722cdb4	allocate cuMem memory with rdma flag (#153261 ) to be able to register memory with ibverbs Pull Request resolved: https://github.com/pytorch/pytorch/pull/153261 Approved by: https://github.com/kwen2501, https://github.com/eqy, https://github.com/Skylion007	2025-05-09 21:48:48 +00:00
Dmitry Rogozhkin	10234ccefe	xpu: rely on sycl/sycl.hpp to include bfloat16.hpp (#152562 ) Fixes: https://github.com/intel/torch-xpu-ops/issues/1503 `sycl/ext/oneapi/bfloat16.hpp` header file is a DPC++ compiler internal header. It's not documented for usage (see extension specification linked below) and is not guaranteed to exist. Instead, documented usage of extension suggests to rely on including `sycl/sycl.hpp` which in its turn includes `bfloat16.hpp` header (which is implementation detail). We stepped into issues by explicitly including `bloat16.hpp` sycl header whithin user facing production environment when `intel-sycl-rt` wheel is installed (which is the dependency of `torch` wheel package built and publicly available for xpu). Compiler includes this file from `intel-sycl-rt` and due to `#pragma once` usage its content is included as well giving redefinitions of symbols in this file (previous inclusion is coming from `sycl/sycl.hpp`): ``` In file included from /workspace/lib/python3.12/site-packages/torch/include/c10/util/BFloat16.h:23: /opt/intel/oneapi/compiler/2025.0/bin/compiler/../../include/sycl/ext/oneapi/bfloat16.hpp:60:23: error: redefinition of 'BF16VecToFloatVec' 60 \| template <int N> void BF16VecToFloatVec(const bfloat16 src[N], float dst[N]) { \| ^ /workspace/include/sycl/ext/oneapi/bfloat16.hpp:60:23: note: previous definition is here 60 \| template <int N> void BF16VecToFloatVec(const bfloat16 src[N], float dst[N]) { \| ``` While SYCL header files themselves can be improved (`#pragma once` dropped), we still must correct usage of sycl `bfloat16.hpp` header in pytorch, i.e. drop it. This fortunately helps to address the reported issue of redefinitions though follow up on compiler side is still required. Also, `SYCL_EXT_ONEAPI_BFLOAT16_MATH_FUNCTIONS` used to cover inclusion of `sycl/sycl.hpp` does not make sense since it's defined in this very header. Thus, we should use `SYCL_LANGUAGE_VERSION` instead which is defined on compiler level. See: `f958dce280/sycl/doc/extensions/experimental/sycl_ext_oneapi_bfloat16_math_functions.asciidoc` CC: @EikanWang, @guangyey, @gujinghui Pull Request resolved: https://github.com/pytorch/pytorch/pull/152562 Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/albanD	2025-05-09 02:25:44 +00:00
cyy	d291fa8ecc	Avoid std::chrono::system_clock (#153135 ) This PR replaces most `std::chrono::system_clock` with `std::chrono::steady_clock` if the duration is used in condition variables. Ideally system clocks should be used only to log wall-clock times. Some `high_resolution_clock` are also changed to `steady_clock` because its resolution is not required in the context. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153135 Approved by: https://github.com/albanD, https://github.com/Skylion007, https://github.com/malfet	2025-05-08 16:30:29 +00:00
Yiming Zhou	13fbf21a76	[nativert] Port string join and split to c10/util (#152873 ) Summary: Torch Native Runtime RFC: https://github.com/pytorch/rfcs/pull/72 Port string utils functions join and split to c10/util Test Plan: Added tests in `string_util_test.cpp` buck2 run mode/opt caffe2/c10/test:util_base_tests Differential Revision: D74202473 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152873 Approved by: https://github.com/cyyever, https://github.com/Skylion007	2025-05-07 03:58:11 +00:00
dolpm	a766c1d117	[nativert] move intrusive list to c10/util (#152754 ) Summary: nativert RFC: https://github.com/zhxchen17/rfcs/blob/master/RFC-0043-torch-native-runtime.md To land the runtime into PyTorch core, we will gradually land logical parts of the code into the Github issue and get each piece properly reviewed. This diff moves intrusive list to c10/util Test Plan: CI Differential Revision: D74104595 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152754 Approved by: https://github.com/Skylion007, https://github.com/cyyever	2025-05-05 18:49:56 +00:00
Nikita Shulga	e889937850	[MPS] Migrate `div` to Metal (#152743 ) TODOs: - Verify accuracy of `metal::dot` vs `x.xx.x + y.yy.y` Pull Request resolved: https://github.com/pytorch/pytorch/pull/152743 Approved by: https://github.com/dcci, https://github.com/Skylion007 ghstack dependencies: #152663, #152515, #152737	2025-05-04 00:56:19 +00:00
rzou	762844355e	Make DispatchKeySet serializable; add `__eq__` (#152732 ) These seem like reasonable things to add. Also fixes a bug in vLLM for me. Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/152732 Approved by: https://github.com/bdhirsh	2025-05-03 14:40:06 +00:00
Nikita Shulga	792736f9ac	[BE][MPS] Pass `alpha` by reference (#152737 ) As it's always a scalar Pull Request resolved: https://github.com/pytorch/pytorch/pull/152737 Approved by: https://github.com/dcci ghstack dependencies: #152663, #152515	2025-05-03 08:31:45 +00:00
Nikita Shulga	34e9f0b5c6	[MPS] Migrate mul to TensorIterator (#152515 ) What initially supposed to be a very straightforward change resulted in small refactor of binary op tensor generators when invoked for mixed dtype, which surfaced via `test_output_grad_match_sinc_mps_float16` test failure. If operands are of different dtype (in particular float16 tensor and float32 scalar), one must perform an operation with `opmath_t` (or `TensorIterator::common_dtype()`) precision, rather than casting both operands to output dtype and performing it then, which can be demonstrated via the following example: ``` >>> torch.tensor([-1.8633, 6.2031, -2.2500, -3.3926, 8.5938, 5.9766], dtype=torch.half).mul(torch.pi) tensor([ -5.8555, 19.4844, -7.0703, -10.6562, 27.0000, 18.7812], dtype=torch.float16) >>> torch.tensor([-1.8633, 6.2031, -2.2500, -3.3926, 8.5938, 5.9766], dtype=torch.half).mul(torch.tensor(torch.pi, dtype=torch.float16)) tensor([ -5.8516, 19.4844, -7.0664, -10.6562, 26.9844, 18.7656], dtype=torch.float16) ``` Solve this problem for now, but introducing `REGISTER_OPMATH_BINARY_OP` that indicates that operands must be cast to opmath_t, before performing the computation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152515 Approved by: https://github.com/Skylion007, https://github.com/kulinseth, https://github.com/dcci ghstack dependencies: #152663	2025-05-03 02:35:03 +00:00
Laith Sakka	376529c78b	consolidate guard_or_x and definitely_x (#152463 ) definitely_true is almost same as guard_or_false, the potential differences are not meaningful to a degree that justify the existence of both. same for definitely_false, it can be expressed with guard_or_true and guard_or_false. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152463 Approved by: https://github.com/bobrenjc93	2025-05-02 18:08:11 +00:00
cyy	e9e1aacef8	Enable -Wunused on torch targets (#150077 ) For GCC, ``-Wunused`` contains: ``` -Wunused-function Warn whenever a static function is declared but not defined or a non\-inline static function is unused. -Wunused-label Warn whenever a label is declared but not used. To suppress this warning use the unused attribute. -Wunused-parameter Warn whenever a function parameter is unused aside from its declaration. To suppress this warning use the unused attribute. -Wunused-variable Warn whenever a local variable or non-constant static variable is unused aside from its declaration To suppress this warning use the unused attribute. ``` For Clang, some of the diagnostics controlled by ``-Wunused`` are enabled by default: ``` Controls [-Wunused-argument](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-argument), [-Wunused-but-set-variable](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-but-set-variable), [-Wunused-function](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-function), [-Wunused-label](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-label), [-Wunused-lambda-capture](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-lambda-capture), [-Wunused-local-typedef](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-local-typedef), [-Wunused-private-field](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-private-field), [-Wunused-property-ivar](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-property-ivar), [-Wunused-value](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-value), [-Wunused-variable](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-variable). ``` These checks are all usefull. This PR aims to enable ``-Wunused`` without breaking code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150077 Approved by: https://github.com/zou3519, https://github.com/wdvr	2025-05-02 07:14:19 +00:00
PyTorch MergeBot	6dadfc4457	Revert "Enable -Wunused on torch targets (#150077 )" This reverts commit `688adc9941`. Reverted https://github.com/pytorch/pytorch/pull/150077 on behalf of https://github.com/wdvr due to failing internally with use of undeclared identifier ([comment](https://github.com/pytorch/pytorch/pull/150077#issuecomment-2846499828))	2025-05-02 06:53:20 +00:00
cyy	ce94b212c7	[Environment Variable][Rebase] Use thread-safe getenv functions (#140200 ) Use our thread-safe getenv wrappers. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140200 Approved by: https://github.com/kwen2501, https://github.com/eqy	2025-05-02 00:41:49 +00:00
dolpm	a765e2ddda	[nativert] port enumerate from folly to c10::utill (#152481 ) Summary: nativert RFC: https://github.com/zhxchen17/rfcs/blob/master/RFC-0043-torch-native-runtime.md To land the runtime into PyTorch core, we will gradually land logical parts of the code into the Github issue and get each piece properly reviewed. This diff ports an enumeration util from folly into c10. Test Plan: CI Differential Revision: D73881042 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152481 Approved by: https://github.com/Skylion007, https://github.com/zhxchen17, https://github.com/cyyever	2025-05-01 21:41:05 +00:00
cyy	688adc9941	Enable -Wunused on torch targets (#150077 ) For GCC, ``-Wunused`` contains: ``` -Wunused-function Warn whenever a static function is declared but not defined or a non\-inline static function is unused. -Wunused-label Warn whenever a label is declared but not used. To suppress this warning use the unused attribute. -Wunused-parameter Warn whenever a function parameter is unused aside from its declaration. To suppress this warning use the unused attribute. -Wunused-variable Warn whenever a local variable or non-constant static variable is unused aside from its declaration To suppress this warning use the unused attribute. ``` For Clang, some of the diagnostics controlled by ``-Wunused`` are enabled by default: ``` Controls [-Wunused-argument](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-argument), [-Wunused-but-set-variable](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-but-set-variable), [-Wunused-function](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-function), [-Wunused-label](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-label), [-Wunused-lambda-capture](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-lambda-capture), [-Wunused-local-typedef](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-local-typedef), [-Wunused-private-field](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-private-field), [-Wunused-property-ivar](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-property-ivar), [-Wunused-value](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-value), [-Wunused-variable](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-variable). ``` These checks are all usefull. This PR aims to enable ``-Wunused`` without breaking code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150077 Approved by: https://github.com/zou3519	2025-05-01 04:09:06 +00:00
Zhengxu Chen	5a66c1d921	[nativert] Add utility function to convert strings into numbers. (#151467 ) Summary: nativert RFC: https://github.com/zhxchen17/rfcs/blob/master/RFC-0043-torch-native-runtime.md To land the runtime into PyTorch core, we will gradually land logical parts of the code into the Github issue and get each piece properly reviewed. This diff adds a small library to convert strings into numbers which will later be used for parsing graph IR. Differential Revision: D73133034 ## Test Plan c10 unittests Pull Request resolved: https://github.com/pytorch/pytorch/pull/151467 Approved by: https://github.com/cyyever, https://github.com/albanD	2025-04-30 21:20:52 +00:00
io-no	d88e0ceb64	Cast to unsigned char to avoid UB (#152360 ) The standard requires that the argument to functions like `isdigit`, `isalpha`, and similar must be either `EOF` or an `unsigned char`; otherwise, the behavior is undefined (UB). To avoid out-of-bounds reads, modern implementations of some libraries (such as glibc) deliberately pad their internal tables to guarantee valid memory access even for negative values. However, this is implementation-specific, and other libraries may not do this. Properly casting the argument to `unsigned char` is good practice to avoid potential issues on some platforms. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152360 Approved by: https://github.com/cyyever, https://github.com/Skylion007	2025-04-30 15:09:13 +00:00
Nikita Shulga	a2c553cac6	[Metal] Extend typecasted op support to complex dtypes (#152504 ) First of all, by extending `c10:🤘:cast_to` to work correctly with complex dtypes, by introducing two more specializations: one that casts complex to scalar, and another that casts scalar to complex (as default metal typecast will turn `float x` into `float2(x, x)`) Add ComplexHalf and ComplexFloat enum values to `c10:🤘:ScalarTypes` and handle them in `val_at_offs(ptr, offs, type)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/152504 Approved by: https://github.com/dcci ghstack dependencies: #152443, #152466, #152479	2025-04-30 05:32:07 +00:00
Nikita Shulga	9bfdf57572	[MPS][BE] Introduce `c10:🤘:mul` (#152466 ) Which multiplies two arguments for either scalar or complex data types This allows one to get rid of bunch of complex specialization in BinaryOps Pull Request resolved: https://github.com/pytorch/pytorch/pull/152466 Approved by: https://github.com/dcci ghstack dependencies: #152443	2025-04-30 04:45:47 +00:00
Dan Johnson	8e2e06b7ea	Fix shadow local variables (#152429 ) Summary: Fixing shadow local variables error: P1798875650 Test Plan: CI Differential Revision: D73853605 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152429 Approved by: https://github.com/Skylion007, https://github.com/eqy	2025-04-29 18:50:18 +00:00
Siddharth Kotapati	663bcb68ba	Implement metal kernel for basic MPS arithmetic ops using TensorIterator (#147644 ) Add metal kernels for add, subtract, & lerp ops using TensorIterator. Should help resolve: https://github.com/pytorch/pytorch/issues/143874 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147644 Approved by: https://github.com/malfet	2025-04-29 14:24:49 +00:00
cyy	41bd0c900a	[1/N] Deprecate c10::string_view and at::string (#151972 ) The calls of `c10::string_view` in the code base are replaced by `std::string_view`. The calls of `at::string` are replaced by `std::string` Pull Request resolved: https://github.com/pytorch/pytorch/pull/151972 Approved by: https://github.com/malfet	2025-04-29 07:23:52 +00:00
Grace Cheng	8e65310d49	[caffe2/c10/util/TypeIndex] Add '__CUDA_ARCH_LIST__' check (#152030 ) Summary: We suspect that switching the NVCC host compiler from GCC to Clang, while targeting multiple architectures, is causing issues because only _CUDA_ARCH_LIST_ is being passed, without _CUDA_ARCH_. To resolve this c10 compilation error, we should first fix the problem and then switch the NVCC host compiler from GCC to Clang. Once this is done, the errors no longer occur. Test Plan: CI Reviewed By: zhuhan0 Differential Revision: D73383236 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152030 Approved by: https://github.com/cyyever, https://github.com/ZainRizvi	2025-04-28 20:31:23 +00:00
Anthony Shoumikhin	e2f9759bd0	Fix broken URLs (#152237 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152237 Approved by: https://github.com/huydhn, https://github.com/malfet	2025-04-27 09:56:42 +00:00
Dan Johnson	d22c4cc353	Add option to use mempool on OOM (#151487 ) MemPool is a separate pool of memory handled by the caching allocator. This PR adds the option let the caching allocator try to use this pool as a last resort instead of OOMing by associating a use_on_oom bool with each MemPool. Usage: Users can optionally specify a ``use_on_oom`` bool (which is False by default) during MemPool creation. If true, then the CUDACachingAllocator will be able to use memory in this pool as a last resort instead of OOMing. ``` pool = torch.cuda.MemPool(allocator, use_on_oom=True) with torch.cuda.use_mem_pool(pool): a = torch.randn(40 * 1024 * 1024, dtype=torch.uint8, device="cuda") del a # at the memory limit, this will succeed by using pool's memory in order to avoid the oom b = torch.randn(40 * 1024 * 1024, dtype=torch.uint8, device="cuda") ``` Testing: ``` python test/test_cuda.py -k test_mempool_limited_memory_with_allocator ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/151487 Approved by: https://github.com/eqy, https://github.com/syed-ahmed, https://github.com/ngimel	2025-04-26 04:04:57 +00:00
Davide Italiano	e28864fc0f	[MPS/inductor] Fix the approximation of polygamma for n == 0. (#152214 ) Fixes #152205 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152214 Approved by: https://github.com/malfet	2025-04-25 22:42:45 +00:00
FFFrog	2c5c793085	[Easy] Add more check for elapsedTime of torch.xxx.Event and torch.Event (#151404 ) As the title stated Changes: - Add record, query and enable_timing check - Add related tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/151404 Approved by: https://github.com/albanD	2025-04-25 20:15:04 +00:00
PyTorch MergeBot	67f75244ea	Revert "[Easy] Add more check for elapsedTime of torch.xxx.Event and torch.Event (#151404 )" This reverts commit `c91acad73a`. Reverted https://github.com/pytorch/pytorch/pull/151404 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally. @albanD can you please help it get relanded? To validate the fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/151404#issuecomment-2830829368))	2025-04-25 16:08:27 +00:00
zhxchen17	a34c28e0d2	[dynamo] Add guard serialization for tensor matches. (#151318 ) This is a proof-of-concept of how we could serialize a guard and deserialize it back from the bytes. The main behavioral change introduced in this diff is on CheckFunctionManager: ``` check_fn_manager = CheckFunctionManager(code, output_graph, guards_serialization_mode="save") guards_state: bytes = check_fn_manager.guards_state ``` Once `guards_serialization_mode` is set to `save`, CheckFunctionManager will return an addtional `bytes` object called `guards_state` which should contain all the information needed for deserializing guards later. When we load back guards state, we will set `guards_serialization_mode` is set to `load`: ``` output_graph_state = pickle.loads(guards_state) check_fn_manager = CheckFunctionManager(code, output_graph_state, guards_serialization_mode="load") ``` # TENSOR_MATCH Since we have many types of guards to support, we will break the work into small diffs instead of a single diff to support every guards. We kick off the work from TENSOR_MATCH from this diff. # Testing For each type of guard we will test it like the following: 1. Use guard_filter_fn to select 1 type of guard each time. 2. Call InstructionTranslator directly on an example function to get OutputGraph and CheckFunctionManager (reference guard manager) 3. Serialize->deserialize the output graph state and re-build the guards with a new CheckFunctionManager (loaded guard manager) 4. Throw a set of example inputs to both reference and loaded guard manager to see if their behavior match. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151318 Approved by: https://github.com/jansel, https://github.com/anijain2305	2025-04-25 14:16:23 +00:00
PyTorch MergeBot	b1d055fd6a	Revert "[dynamo] Add guard serialization for tensor matches. (#151318 )" This reverts commit `81c4369d81`. Reverted https://github.com/pytorch/pytorch/pull/151318 on behalf of https://github.com/zhxchen17 due to macos test failing ([comment](https://github.com/pytorch/pytorch/pull/151318#issuecomment-2828638168))	2025-04-24 19:22:45 +00:00
zhxchen17	81c4369d81	[dynamo] Add guard serialization for tensor matches. (#151318 ) This is a proof-of-concept of how we could serialize a guard and deserialize it back from the bytes. The main behavioral change introduced in this diff is on CheckFunctionManager: ``` check_fn_manager = CheckFunctionManager(code, output_graph, guards_serialization_mode="save") guards_state: bytes = check_fn_manager.guards_state ``` Once `guards_serialization_mode` is set to `save`, CheckFunctionManager will return an addtional `bytes` object called `guards_state` which should contain all the information needed for deserializing guards later. When we load back guards state, we will set `guards_serialization_mode` is set to `load`: ``` output_graph_state = pickle.loads(guards_state) check_fn_manager = CheckFunctionManager(code, output_graph_state, guards_serialization_mode="load") ``` # TENSOR_MATCH Since we have many types of guards to support, we will break the work into small diffs instead of a single diff to support every guards. We kick off the work from TENSOR_MATCH from this diff. # Testing For each type of guard we will test it like the following: 1. Use guard_filter_fn to select 1 type of guard each time. 2. Call InstructionTranslator directly on an example function to get OutputGraph and CheckFunctionManager (reference guard manager) 3. Serialize->deserialize the output graph state and re-build the guards with a new CheckFunctionManager (loaded guard manager) 4. Throw a set of example inputs to both reference and loaded guard manager to see if their behavior match. Differential Revision: [D72987485](https://our.internmc.facebook.com/intern/diff/D72987485/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151318 Approved by: https://github.com/jansel, https://github.com/anijain2305	2025-04-24 18:07:01 +00:00
dolpm	4ac2ee573d	[sigmoid] memory planner C10 deps (#151275 ) Summary: perf-sensitive util functions for use in our memory planner Test Plan: CI Differential Revision: D73002726 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151275 Approved by: https://github.com/georgiaphillips	2025-04-24 01:46:32 +00:00

1 2 3 4 5 ...

2875 Commits