pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-06 00:20:18 +01:00

Author	SHA1	Message	Date
Yuanyuan Chen	f9953e0f61	Enable PLC0414 on ruff (#165828 ) This PR enables `PLC0414` that fixes redundant import aliases. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165828 Approved by: https://github.com/albanD	2025-10-22 04:56:52 +00:00
Jagadish Krishnamoorthy	34ed7a8f0d	[ROCm] Skip test_blockwise_nvfp4_with_global_scale (#165968 ) Disable the fp4 global_scale test till the feature is enabled on ROCm. Fixes #166027. Not really, but we're trading an issue for a test skip decorator since the test is parameterized. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165968 Approved by: https://github.com/jeffdaily, https://github.com/drisspg	2025-10-22 04:23:05 +00:00
Jeff Daily	2fde10d914	[ROCm] fix test_allocator_backend (#166035 ) Fixes #165872. Forward fix PR #165298. hipify was causing some symbols to be replaced. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166035 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-10-22 03:46:23 +00:00
Tugsbayasgalan Manlaibaatar	0a93295da0	Update doc (#166024 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/166024 Approved by: https://github.com/yiming0416	2025-10-22 03:41:31 +00:00
Ketan Ambati	4b898b51b9	[12/n][take2] : Remove fbandroid_compiler_flags platform args (#165916 ) Summary: This diff removes the `fbandroid_compiler_flags` and merges its content with `compiler_flags` and wraps it in a android select. My first attempt at this got reverted - D84626885. Test Plan: CI and failing builds are now passing ``` buck2 build --target-universe fbsource//fbandroid/apps/wearable/system/healthservices:healthservices_target30_mosnative_xhdpi_arm64_release_debug_keystore_redex_postprocessed_repack_resign @//fbandroid/mode/nosan @//fbandroid/mode/opt @//fbandroid/mode/milan_build_rdk @//fbandroid/mode/relr-relocations fbsource//fbandroid/apps/wearable/system/healthservices:healthservices_target30_mosnative_xhdpi_arm64_release_debug_keystore_redex_postprocessed_repack_resign fbsource//fbandroid/apps/wearable/system/healthservices:healthservices_target30_mosnative_xhdpi_arm64_release_debug_keystore_redex_genrule fbsource//fbandroid/apps/wearable/system/healthservices:healthservices_target30_mosnative_xhdpi_arm64_release_debug_keystore-mobileconfig-definition-resource-gen fbsource//fbandroid/apps/wearable/system/healthservices:healthservices_target30_mosnative_xhdpi_arm64_release_debug_keystore File changed: fbsource//tools/build_defs/fb_xplat_cxx_library.bzl Buck UI: https://www.internalfb.com/buck2/509c0b7b-ada3-421a-8c32-2f1d3a7babdd Network: Up: 1.3MiB Down: 293MiB (reSessionID-17f73b81-3c34-4c01-9f6c-2b4f3c8332e3) Loading targets. Remaining 0/1311 292986 targets declared Analyzing targets. Remaining 0/13515 216715 actions, 359204 artifacts declared Executing actions. Remaining 0/40415 6:33.3s exec time total Command: build. Finished 40 local, 790 remote Time elapsed: 32.0s BUILD SUCCEEDED ``` Reviewed By: jaejunku Differential Revision: D84868234 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165916 Approved by: https://github.com/malfet	2025-10-22 03:01:55 +00:00
Rob Timpe	550e3e6efb	[dynamo] Fix MATCH_KEYS for dict pattern matching (#165956 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/165956 Approved by: https://github.com/guilhermeleobas, https://github.com/cyyever	2025-10-22 02:52:07 +00:00
inventshah	715449ca76	[MPS] Fix parity between CPU and MPS on singular matrices in linalg.lu_factor (#165871 ) Fixes #165870. Follow up from #165254. This PR [a] removes the MPS specific version of `lu_factor` in favor of the version in BatchedLinearAlgebra.cpp which uses `lu_factor_ex`, and [b] updates `lu_factor_ex` error codes to match expectations. When `lu_factor` was first implemented for MPS (#99269), it bypassed the implementation in BatchedLinearAlgebra.cpp since we did not have `lu_factor_ex`. Since #144651 implements `lu_factor_ex`, we can now remove the MPS specific wrapper. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165871 Approved by: https://github.com/kulinseth, https://github.com/albanD	2025-10-22 02:48:40 +00:00
arkadip-maitra	84d8d06fc3	Fixes floating point exception in torch.nn.PixelShuffle (#163154 ) Fixes #162251 Previous Output: `Floating point exception (core dumped)` Now Output: `RuntimeError: upscale factor is too large, (upscale_factor}^2 overflowed: upscale_factor=545460846592` Pull Request resolved: https://github.com/pytorch/pytorch/pull/163154 Approved by: https://github.com/cyyever, https://github.com/albanD	2025-10-22 02:22:16 +00:00
Animesh Jain	60992d98b2	[dynamo][remaining] Replace UserFunctionVariable with VariableTracker build (#165896 ) Audit: To prevent future issues with functools.partial or callable objects. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165896 Approved by: https://github.com/Lucaskabela	2025-10-22 02:13:00 +00:00
Yuanyuan Chen	59e015e3a1	Remove outdated CUB macros (#164656 ) This PR removes `CUB_SUPPORTS_NV_BFLOAT16` and `CUB_SUPPORTS_FUTURE_VALUE` because they are always true on CUDA >=12 installations with its CUB version. Their branches are also removed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164656 Approved by: https://github.com/albanD, https://github.com/eqy, https://github.com/jeffdaily	2025-10-22 02:02:50 +00:00
Yu, Guangye	8904a5a7c9	Move allocation size config to AllocatorConfig for cross-allocator sharing (#159553 ) # Motivation Make CUDA and XPU share the same config and code. And allow the other backends to reuse them. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159553 Approved by: https://github.com/albanD ghstack dependencies: #160067	2025-10-22 01:48:56 +00:00
Guilherme Leobas	f5df9ca03a	Fix creation of `BINARY_SUBSCR` in Python 3.14+ (#165864 ) Python 3.14 replaced `BINARY_SUBSCR` by `BINARY_OP(opcode=BN_SUBSCR)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/165864 Approved by: https://github.com/williamwen42	2025-10-22 01:43:03 +00:00
zhudada	2998abd777	[Code Clean] Better error handling in torch/csrc/distributed (#165053 ) Replace the runtime_error of the vallina C++ exceptions with TORCH_CEHCK Including: torch/csrc/distributed/* fix partialy #148114 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165053 Approved by: https://github.com/FFFrog, https://github.com/albanD	2025-10-22 01:40:36 +00:00
Artem Kuzmitckii	e13580e41c	[AMD] Run int4_mm tests only for compatible arch (#165630 ) Such tests should be skipped for rest including gfx1100(Navi3x) Fixes for CI HUD for gfx1100 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165630 Approved by: https://github.com/jeffdaily Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>	2025-10-22 01:38:55 +00:00
Artem Kuzmitckii	f3b8e15f20	[AMD][gfx1100] test_decompose_mem_bound_mm.py tolerance increase (#165625 ) test_decompose_mem_bound_mm.py tolerance increase for navi3x(gfx11x) (cherry picked from commit 03c7da05f61890bbf5ae41e23c8df6d5f6805bac) from Fixes for CI HUD for gfx1100 Signed-off-by: Artem Kuzmitckii <artem.kuzmitckii@amd.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/165625 Approved by: https://github.com/jeffdaily Co-authored-by: iupaikov-amd <Iurii.Paikov@amd.com> Co-authored-by: Dmitry Nikolaev <139769634+dnikolaev-amd@users.noreply.github.com> Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-10-22 01:38:48 +00:00
Nikita Shulga	5211f4c108	[MPS] Fix SDPA fp16 overflow (#165961 ) Do not cast intermediate result back to lower precision data data until softmax is finished, otherwise it might produce NaN Adjust the test to use 256 as filler value rather than 64 Fixes https://github.com/pytorch/pytorch/issues/160841 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165961 Approved by: https://github.com/dcci, https://github.com/Skylion007 ghstack dependencies: #165960	2025-10-22 01:29:42 +00:00
Nikita Shulga	ad9027b80d	[BE] Remove unused 'rows' parameter from spmm_bmm_coo_rows_grouped (#166041 ) To fix following compilation warning ``` Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/sparse/mps/kernels/Mul.metal:76:14: warning: unused variable 'B' [-Wunused-variable] const uint B = dims.x; ^ /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/sparse/mps/kernels/Mul.metal:65:26: warning: unused parameter 'rows' [-Wunused-parameter] device const long* rows [[buffer(0)]], ^ 2 warnings generated. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/166041 Approved by: https://github.com/Skylion007	2025-10-22 00:59:41 +00:00
Han Chao	a1005427bf	[xpu] Support high stream for ProcessGroupXCCL (#163049 ) Add high priority stream support for ProcessGroupXCCL. Just like CUDA, XPU streams also support execution with higher priority compared to other streams. Implementation in https://github.com/intel/torch-xpu-ops/pull/1715, add register here. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163049 Approved by: https://github.com/guangyey, https://github.com/gujinghui, https://github.com/EikanWang, https://github.com/albanD	2025-10-22 00:54:25 +00:00
Yuanyuan Chen	35153d0846	Simplify c10::guts::apply (#164566 ) There is only one call site of `c10::guts::apply` that can be replaced by `:std::apply` except for ROCm. This PR therefore simplifies the implementation of `c10::guts::apply`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164566 Approved by: https://github.com/Aidyn-A, https://github.com/albanD	2025-10-22 00:47:43 +00:00
PyTorch MergeBot	7773a22cdb	Revert "[AMP][Refactor] Autocast dtype handling to simplify device-specific c… (#165221 )" This reverts commit `4be1e3bf92`. Reverted https://github.com/pytorch/pytorch/pull/165221 on behalf of https://github.com/clee2000 due to I think this broke test_openreg [GH job link](https://github.com/pytorch/pytorch/actions/runs/18698271058/job/53322459496) [HUD commit link](`4be1e3bf92`) note to self: bad TD ([comment](https://github.com/pytorch/pytorch/pull/165221#issuecomment-3430012693))	2025-10-22 00:26:57 +00:00
Yuanyuan Chen	7cb467a169	[CI] Update ONNX CI packages to latest (#165883 ) This PR updates ONNX related packages to their latest versions used in CI environments. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165883 Approved by: https://github.com/justinchuby, https://github.com/albanD	2025-10-22 00:25:35 +00:00
KarhouTam	12aac12b8d	[Code Clean] Replace `std::runtime_error` with `TORCH_CHECK` (#165209 ) Including: 1. `aten/src/ATen/core` 2. `c10/core` Fixes part of #148114 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165209 Approved by: https://github.com/FFFrog, https://github.com/albanD	2025-10-22 00:05:22 +00:00
jainapurva	2b748d0a56	Add operator name to output json (#164583 ) The benchmarks, model_name on dashboard needs to be grouped with operator_name. This PR passed an additional argument operator_name to the json for grouping. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164583 Approved by: https://github.com/yangw-dev	2025-10-21 23:58:39 +00:00
Shangdi Yu	16745a882a	[aoti][win] add support for a list of shim libraries (#165914 ) As title, support passing in a list of shim libraries when cross compiling artifacts Pull Request resolved: https://github.com/pytorch/pytorch/pull/165914 Approved by: https://github.com/desertfire	2025-10-21 22:55:17 +00:00
PyTorch MergeBot	8daef35cf1	Revert "[Code Clean] Clean asserts in torch/ao/quantization (root, quantizer, backend_config) (#165433 )" This reverts commit `df64c0c464`. Reverted https://github.com/pytorch/pytorch/pull/165433 on behalf of https://github.com/clee2000 due to I think this broke some quantization tests ([comment](https://github.com/pytorch/pytorch/pull/165433#issuecomment-3429741770))	2025-10-21 22:10:19 +00:00
Nicolas De Carli	51319ca090	[Pytorch] Add NEON Vectorized<uint> family of translation layers (#165690 ) Summary: Adding NEON specializations of Vectorized<T> for uint8, uint16, uint32 and uint64. Correcness has been checked using test_ops.py operator_benchmark_test.py, which uses the PyTorch API, shows significant enhancements in some operations: Before: uint8 mul: 1460.751us uint8 add: 2359.565us uint8 lsl: 2151.206us After: uint8 mul: 194.792us ---> 650% higher throughput uint8 add: 195.609us ---> 1100% higher throughput uint8 lsl: 186.249us ---> 1055% higher throughput Test Plan: Correctness: buck2 test mode/opt //caffe2/test:test_ops buck2 test mode/opt //caffe2/test:torch Performance: buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test Reviewed By: mcfi Differential Revision: D84770153 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165690 Approved by: https://github.com/malfet	2025-10-21 21:46:55 +00:00
Guang Yang	d311a3d1dc	A temporary fix to autotune out of range and related IMA (#165943 ) Summary: Autotune issue during lowering w/ AOTI: ``` setStorage: sizes [1536, 32, 8192], strides [8192, 8192, 1], storage offset 0, and itemsize 2 requiring a storage size of 25673728 are out of bounds for storage of size 25362432 ``` Need a hack to create new base tensor with sufficient storage Test Plan: Finally be able to see the e2e test passes on CI. See the detailed Test Plan in D83520844 Differential Revision: D84872792 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165943 Approved by: https://github.com/laithsakka	2025-10-21 21:40:20 +00:00
Zhaoqi Zhu	04adfe5ba9	Make Backend::setGroupUid virtual (#165957 ) As titled, so that we may customize this function in custom backends Pull Request resolved: https://github.com/pytorch/pytorch/pull/165957 Approved by: https://github.com/d4l3k	2025-10-21 21:33:24 +00:00
KarhouTam	4be1e3bf92	[AMP][Refactor] Autocast dtype handling to simplify device-specific c… (#165221 ) This PR refactors the autocast context manager in autocast_mode.py to simplify and centralize the logic for checking supported dtypes for each device. The previous implementation repeated similar checks for multiple device types. Now, a single mapping device_supported_dtypes is used to associate device types with their supported dtypes, and the validation logic is unified. The former PR #163446 was merged but reverted due to failed CI test on `openreg` related tests. This RR additionally slightly modified some test assertions for passing the CI tests. CI failed due to assertion for the exactly same error message. For example: ``` File "/var/lib/jenkins/workspace/test/cpp_extensions/open_registration_extension/torch_openreg/tests/test_autocast.py", line 9, in test_autocast_with_unsupported_type with self.assertWarnsRegex( AssertionError: "In openreg autocast, but the target dtype torch.float32 is not supported." does not match "In openreg autocast, but the target dtype is not supported. Disabling autocast." ``` Sorry for the inconvenience again. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165221 Approved by: https://github.com/FFFrog, https://github.com/albanD	2025-10-21 21:32:12 +00:00
Catherine Lee	e7592f4005	[CI] Move the periodic debug tests to newer runner (#165158 ) Previously g3 = NVIDIA Tesla M60 Now g6 = NVIDIA L4 Also change cuda arch list accordingly Pros: More memory, newer GPU Cons: That was one of the few remaining tests on g3 runners, so we probably lost coverage? We can probably run more tests in parallel now but I'm not going to do that here Disabled a bunch of sparse tests and nestedtensor tests that were previously skipped due to not having sufficient hardware? They are now failing with ``` Traceback (most recent call last): File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3293, in wrapper method(args, kwargs) File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3292, in wrapper with policy(): File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2532, in __enter__ self.beforeStreams[-1].synchronize() File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/cuda/streams.py", line 105, in synchronize super().synchronize() torch.AcceleratorError: CUDA error: device-side assert triggered Search for `cudaErrorAssert' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information. CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. Exception raised from stream_synchronize at /var/lib/jenkins/workspace/c10/cuda/CUDAFunctions.h:120 (most recent call first): C++ CapturedTraceback: #4 std::_Function_handler<std::shared_ptr<c10::LazyValue<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const> (), c10::SetStackTraceFetcher(std::function<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 #5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) from ??:0 #6 c10::cuda::c10_cuda_check_implementation(int, char const, char const, unsigned int, bool) [clone .cold] from CUDAException.cpp:0 #7 THCPStream_synchronize(_object, _object*) from Stream.cpp:0 #8 cfunction_vectorcall_NOARGS from /usr/local/src/conda/python-3.10.14/Objects/methodobject.c:489 #9 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 #10 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 #11 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 #12 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 ``` when run with cuda launch blocking I got a ton of stuff like ``` /var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [5,3,0], thread: [2,7,0] Assertion `value < upper_bound` failed. /var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [5,3,0], thread: [3,7,0] Assertion `value < upper_bound` failed. /var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [0,0,0] Assertion `value < upper_bound` failed. /var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [1,0,0] Assertion `value < upper_bound` failed. /var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [2,0,0] Assertion `value < upper_bound` failed. /var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [3,0,0] Assertion `value < upper_bound` failed. /var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [0,1,0] Assertion `value < upper_bound` failed. /var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [1,1,0] Assertion `value < upper_bound` failed. /var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [3,1,0] Assertion `value < upper_bound` failed. /var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [0,2,0] Assertion `value < upper_bound` failed. /var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [2,2,0] Assertion `value < upper_bound` failed. /var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [3,2,0] Assertion `value < upper_bound` failed. /var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [0,3,0] Assertion `value < upper_bound` failed. /var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [1,3,0] Assertion `value < upper_bound` failed. /var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [1,4,0] Assertion `value < upper_bound` failed. /var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [3,4,0] Assertion `value < upper_bound` failed. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/165158 Approved by: https://github.com/seemethere	2025-10-21 21:28:12 +00:00
Isalia20	d334c3649d	[CUDA] fix reflection padding for large batch size (#165942 ) Fixes [#165861](https://github.com/pytorch/pytorch/issues/165861) Pull Request resolved: https://github.com/pytorch/pytorch/pull/165942 Approved by: https://github.com/eqy	2025-10-21 21:07:38 +00:00
Jerry Mannil	9f82535c5a	[ROCm] [Normalization] Update block size (#165941 ) * Seeing upto 6x improvement Pull Request resolved: https://github.com/pytorch/pytorch/pull/165941 Approved by: https://github.com/jeffdaily	2025-10-21 20:53:05 +00:00
Ivan Zaitsev	5b35fc8777	Support multiple commits on push events in trunk tagging workflow (#165937 ) Context: * this workflow is used to create tags like `trunk/{sha}` for all `main` commits * those tags are used by [autorevert](https://github.com/pytorch/test-infra/blob/main/aws/lambda/pytorch-auto-revert/README.md) to rerun selected workflows Problem: currently the workflow creates only a single tag per push event, while ghstack pushes multiple commits per single push. This PR supports tag creation for all commits in the push event. Complimentary autorevert PR: https://github.com/pytorch/test-infra/pull/7291 --- ### Testing I created an identical copy of this workflow in my personal repo: https://github.com/izaitsevfb/pr-head-test/actions/workflows/trunk-tagging.yml See action runs there. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165937 Approved by: https://github.com/huydhn	2025-10-21 20:52:34 +00:00
Nikita Vedeneev	2f38eece7c	[CUDA][cuBLAS] addmm -- some refactoring for easier navigation between the Lt and non-Lt paths (#163955 ) As per title. Additionally, some Lt selection conditions are revisited, and some redundancy removed (especially in the ROCm vs non-ROCm paths). Pull Request resolved: https://github.com/pytorch/pytorch/pull/163955 Approved by: https://github.com/ngimel, https://github.com/eqy	2025-10-21 20:48:12 +00:00
Animesh Jain	830e789a55	[dynamo][annotate] Graph break cleanly on fx.traceback.annotate reconstruction (#166006 ) This avoids generation of bad bytecode, leading to really confusing error. I am not sure why we can't reconstruct cleanly, it has to do with the input being a dict, while other supported ctx managers take bools. Fixing that is for another day. Lets give a good error message for now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166006 Approved by: https://github.com/yushangdi, https://github.com/SherlockNoMad	2025-10-21 20:48:04 +00:00
PyTorch MergeBot	ad4dc52bf6	Revert "shrink_group implementation to expose ncclCommShrink API (#164518 )" This reverts commit `4e643422f6`. Reverted https://github.com/pytorch/pytorch/pull/164518 on behalf of https://github.com/albanD due to Breaks lint ([comment](https://github.com/pytorch/pytorch/pull/164518#issuecomment-3429426503))	2025-10-21 20:24:14 +00:00
dependabot[bot]	dac9ed9790	Bump uv from 0.8.6 to 0.9.5 in /.ci/lumen_cli (#166017 ) Bumps [uv](https://github.com/astral-sh/uv) from 0.8.6 to 0.9.5. - [Release notes](https://github.com/astral-sh/uv/releases) - [Changelog](https://github.com/astral-sh/uv/blob/main/CHANGELOG.md) - [Commits](https://github.com/astral-sh/uv/compare/0.8.6...0.9.5) --- updated-dependencies: - dependency-name: uv dependency-version: 0.9.5 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2025-10-21 13:16:30 -07:00
linhaifeng	1c7fe8f861	[BugFix] chunk_size should always be int64_t (#165971 ) aspired by https://github.com/pytorch/pytorch/pull/156872 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165971 Approved by: https://github.com/albanD	2025-10-21 19:52:47 +00:00
Bruce Chang	4e643422f6	shrink_group implementation to expose ncclCommShrink API (#164518 ) Closes #164529 To expose the new [ncclCommShrink](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/api/comms.html#ncclcommshrink) API to PyTorch. This is useful when you need to exclude certain GPUs or nodes from a collective operation, for example in fault tolerance scenarios or when dynamically adjusting resource utilization. For more info: [Shrinking a communicator](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/communicators.html#shrinking-a-communicator) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164518 Approved by: https://github.com/kwen2501	2025-10-21 19:47:33 +00:00
Jason Ansel	3c3b278872	[reland][fx] Move Node._prepend/Node._remove_from_list to C++ (#165882 ) Relands #148261 that was reverted by #150542 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165882 Approved by: https://github.com/ezyang	2025-10-21 19:43:55 +00:00
Nikita Shulga	0bd12c1168	[CI] Extend test_transfomers to MPS (#165960 ) Just skip grad_checks as they need float64 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165960 Approved by: https://github.com/Skylion007	2025-10-21 19:27:44 +00:00
PyTorch MergeBot	ce8a7764e2	Revert "[dynamo][misc] Replace UserFunctionVariable with VariableTracker build (#165707 )" This reverts commit `1290b077f2`. Reverted https://github.com/pytorch/pytorch/pull/165707 on behalf of https://github.com/clee2000 due to failing internal tests D85160820 ([comment](https://github.com/pytorch/pytorch/pull/165707#issuecomment-3429084393))	2025-10-21 19:25:03 +00:00
Tushar Jain	d1269a0434	update fr trace analysis (#165994 ) Summary: - allow empty entries from ranks - allow not all ranks to provide dump --- [//]: # (BEGIN SAPLING FOOTER) Stack created with [Sapling](https://sapling-scm.com). Best reviewed with [ReviewStack](https://reviewstack.dev/pytorch/pytorch/pull/165994). * #165638 * #165640 * #165642 * __->__ #165994 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165994 Approved by: https://github.com/fduwjj	2025-10-21 19:14:33 +00:00
Pearu Peterson	c87cf1be32	Update workaround to old CUDA bug (#164354 ) (#165984 ) The workaround cannot be removed because of BC. Here we'll update PyTorch code base to not use the workaround. See https://github.com/pytorch/pytorch/pull/164354 for the BC breakage issue. Resolves https://github.com/pytorch/pytorch/issues/164348. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165984 Approved by: https://github.com/janeyx99	2025-10-21 19:09:43 +00:00
Tugsbayasgalan Manlaibaatar	2fc5e45a41	better error message when there is no pytree impl (#165955 ) Differential Revision: [D85117597](https://our.internmc.facebook.com/intern/diff/D85117597) Pull Request resolved: https://github.com/pytorch/pytorch/pull/165955 Approved by: https://github.com/avikchaudhuri	2025-10-21 18:49:22 +00:00
Shivam Raikundalia	f9022ba93b	[PyTorch] Add user_metadata display to memory visualizer (#165939 ) Summary: Enhanced the PyTorch CUDA memory visualizer to display user_metadata alongside stack frames when inspecting allocations. The user_metadata field is now shown in all views (Allocator State History, Active Memory Timeline, etc.) with consistent formatting. The implementation handles both string and object metadata types, displaying strings directly and objects as key-value pairs. Test Plan: 1. Generate a memory snapshot with user_metadata 2. Open the memory visualizer in a browser 3. Load the snapshot file 4. Verify user_metadata appears 5. Test with both string metadata ("testing") and object metadata ({"key": "value"}) 6. Verify formatting shows "User Metadata:\n <value>" for strings {F1982860439} Differential Revision: D85095152 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165939 Approved by: https://github.com/yushangdi	2025-10-21 18:48:33 +00:00
Tony Targonski	ff8be889ad	Remove unused exception parameter from some files, to work with -Wunused-exception-parameter (#165770 ) Summary: address compiler complains that were coming up to unblock the build Test Plan: before the change ``` aten/src/ATen/native/LinearAlgebra.cpp:3623:36: error: unused exception parameter 'e' [-Werror,-Wunused-exception-parameter] 3623 \| } catch (const std::exception& e) { \| ``` after: targets build with `-Wunused-exception-parameter` Differential Revision: D84876246 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165770 Approved by: https://github.com/Skylion007, https://github.com/cyyever Co-authored-by: Tony Targonski <tony.targonski@meta.com>	2025-10-21 18:30:29 +00:00
Wang, Chuanqi	292454942e	[CD] Introduce windows.12xlarge runners for CD Windows build (#165287 ) Follows https://github.com/pytorch/test-infra/pull/7174. Windows CD build time cost comparison as below \|Runner\|cpu\|cuda\|xpu\| \|-\|-\|-\|-\| \|windows.4xlarge\|1.5h\| 4.0h\| 5.5h\| \|windows.12xlarge\|0.5h\|1.5h\|2.5h\| Fixes #162962 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165287 Approved by: https://github.com/zxiiro, https://github.com/malfet, https://github.com/seemethere	2025-10-21 18:28:23 +00:00
PyTorch MergeBot	6c4412f72b	Revert "[Inductor] support masked vectorization for the tail_loop for float64 datatype (#163316 )" This reverts commit `e9d8973427`. Reverted https://github.com/pytorch/pytorch/pull/163316 on behalf of https://github.com/clee2000 due to seems to have broken some no_gpu tests? test/inductor/test_cpu_repro.py::CPUReproTests::test_double_reduction_vec [GH job link](https://github.com/pytorch/pytorch/actions/runs/18689033019/job/53290772740) [HUD commit link](`e9d8973427`) ([comment](https://github.com/pytorch/pytorch/pull/163316#issuecomment-3428210509))	2025-10-21 17:44:42 +00:00
PyTorch MergeBot	78bf6186f2	Revert "[Inductor] support masked vectorization for the tail_loop for fp8 datatype (#163324 )" This reverts commit `e8cb34dd52`. Reverted https://github.com/pytorch/pytorch/pull/163324 on behalf of https://github.com/clee2000 due to seems to have broken some no_gpu tests? test/inductor/test_cpu_repro.py::CPUReproTests::test_double_reduction_vec [GH job link](https://github.com/pytorch/pytorch/actions/runs/18689033019/job/53290772740) [HUD commit link](`e9d8973427`) ([comment](https://github.com/pytorch/pytorch/pull/163316#issuecomment-3428210509))	2025-10-21 17:44:42 +00:00

1 2 3 4 5 ...

94827 Commits