Yuanyuan Chen
f9953e0f61
Enable PLC0414 on ruff ( #165828 )
...
This PR enables `PLC0414` that fixes redundant import aliases.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165828
Approved by: https://github.com/albanD
2025-10-22 04:56:52 +00:00
Jagadish Krishnamoorthy
34ed7a8f0d
[ROCm] Skip test_blockwise_nvfp4_with_global_scale ( #165968 )
...
Disable the fp4 global_scale test till the feature is enabled on ROCm.
Fixes #166027 .
Not really, but we're trading an issue for a test skip decorator since the test is parameterized.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165968
Approved by: https://github.com/jeffdaily , https://github.com/drisspg
2025-10-22 04:23:05 +00:00
Jeff Daily
2fde10d914
[ROCm] fix test_allocator_backend ( #166035 )
...
Fixes #165872 .
Forward fix PR #165298 . hipify was causing some symbols to be replaced.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166035
Approved by: https://github.com/jeffdaily
Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-10-22 03:46:23 +00:00
Tugsbayasgalan Manlaibaatar
0a93295da0
Update doc ( #166024 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166024
Approved by: https://github.com/yiming0416
2025-10-22 03:41:31 +00:00
Ketan Ambati
4b898b51b9
[12/n][take2] : Remove fbandroid_compiler_flags platform args ( #165916 )
...
Summary: This diff removes the `fbandroid_compiler_flags` and merges its content with `compiler_flags` and wraps it in a android select. My first attempt at this got reverted - D84626885.
Test Plan:
CI and failing builds are now passing
```
buck2 build --target-universe fbsource//fbandroid/apps/wearable/system/healthservices:healthservices_target30_mosnative_xhdpi_arm64_release_debug_keystore_redex_postprocessed_repack_resign @//fbandroid/mode/nosan @//fbandroid/mode/opt @//fbandroid/mode/milan_build_rdk @//fbandroid/mode/relr-relocations fbsource//fbandroid/apps/wearable/system/healthservices:healthservices_target30_mosnative_xhdpi_arm64_release_debug_keystore_redex_postprocessed_repack_resign fbsource//fbandroid/apps/wearable/system/healthservices:healthservices_target30_mosnative_xhdpi_arm64_release_debug_keystore_redex_genrule fbsource//fbandroid/apps/wearable/system/healthservices:healthservices_target30_mosnative_xhdpi_arm64_release_debug_keystore-mobileconfig-definition-resource-gen fbsource//fbandroid/apps/wearable/system/healthservices:healthservices_target30_mosnative_xhdpi_arm64_release_debug_keystore
File changed: fbsource//tools/build_defs/fb_xplat_cxx_library.bzl
Buck UI: https://www.internalfb.com/buck2/509c0b7b-ada3-421a-8c32-2f1d3a7babdd
Network: Up: 1.3MiB Down: 293MiB (reSessionID-17f73b81-3c34-4c01-9f6c-2b4f3c8332e3)
Loading targets. Remaining 0/1311 292986 targets declared
Analyzing targets. Remaining 0/13515 216715 actions, 359204 artifacts declared
Executing actions. Remaining 0/40415 6:33.3s exec time total
Command: build. Finished 40 local, 790 remote
Time elapsed: 32.0s
BUILD SUCCEEDED
```
Reviewed By: jaejunku
Differential Revision: D84868234
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165916
Approved by: https://github.com/malfet
2025-10-22 03:01:55 +00:00
Rob Timpe
550e3e6efb
[dynamo] Fix MATCH_KEYS for dict pattern matching ( #165956 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165956
Approved by: https://github.com/guilhermeleobas , https://github.com/cyyever
2025-10-22 02:52:07 +00:00
inventshah
715449ca76
[MPS] Fix parity between CPU and MPS on singular matrices in linalg.lu_factor ( #165871 )
...
Fixes #165870 . Follow up from #165254 .
This PR [a] removes the MPS specific version of `lu_factor` in favor of the version in BatchedLinearAlgebra.cpp which uses `lu_factor_ex`, and [b] updates `lu_factor_ex` error codes to match expectations.
When `lu_factor` was first implemented for MPS (#99269 ), it bypassed the implementation in BatchedLinearAlgebra.cpp since we did not have `lu_factor_ex`. Since #144651 implements `lu_factor_ex`, we can now remove the MPS specific wrapper.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165871
Approved by: https://github.com/kulinseth , https://github.com/albanD
2025-10-22 02:48:40 +00:00
arkadip-maitra
84d8d06fc3
Fixes floating point exception in torch.nn.PixelShuffle ( #163154 )
...
Fixes #162251
**Previous Output:**
`Floating point exception (core dumped)`
**Now Output:**
`RuntimeError: upscale factor is too large, (upscale_factor}^2 overflowed: upscale_factor=545460846592`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163154
Approved by: https://github.com/cyyever , https://github.com/albanD
2025-10-22 02:22:16 +00:00
Animesh Jain
60992d98b2
[dynamo][remaining] Replace UserFunctionVariable with VariableTracker build ( #165896 )
...
Audit: To prevent future issues with functools.partial or callable objects.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165896
Approved by: https://github.com/Lucaskabela
2025-10-22 02:13:00 +00:00
Yuanyuan Chen
59e015e3a1
Remove outdated CUB macros ( #164656 )
...
This PR removes `CUB_SUPPORTS_NV_BFLOAT16` and `CUB_SUPPORTS_FUTURE_VALUE` because they are always true on CUDA >=12 installations with its CUB version. Their branches are also removed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164656
Approved by: https://github.com/albanD , https://github.com/eqy , https://github.com/jeffdaily
2025-10-22 02:02:50 +00:00
Yu, Guangye
8904a5a7c9
Move allocation size config to AllocatorConfig for cross-allocator sharing ( #159553 )
...
# Motivation
Make CUDA and XPU share the same config and code. And allow the other backends to reuse them.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159553
Approved by: https://github.com/albanD
ghstack dependencies: #160067
2025-10-22 01:48:56 +00:00
Guilherme Leobas
f5df9ca03a
Fix creation of BINARY_SUBSCR in Python 3.14+ ( #165864 )
...
Python 3.14 replaced `BINARY_SUBSCR` by `BINARY_OP(opcode=BN_SUBSCR)`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165864
Approved by: https://github.com/williamwen42
2025-10-22 01:43:03 +00:00
zhudada
2998abd777
[Code Clean] Better error handling in torch/csrc/distributed ( #165053 )
...
Replace the runtime_error of the vallina C++ exceptions with TORCH_CEHCK
Including:
torch/csrc/distributed/*
fix partialy #148114
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165053
Approved by: https://github.com/FFFrog , https://github.com/albanD
2025-10-22 01:40:36 +00:00
Artem Kuzmitckii
e13580e41c
[AMD] Run int4_mm tests only for compatible arch ( #165630 )
...
Such tests should be skipped for rest including gfx1100(Navi3x)
Fixes for CI HUD for gfx1100
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165630
Approved by: https://github.com/jeffdaily
Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
2025-10-22 01:38:55 +00:00
Artem Kuzmitckii
f3b8e15f20
[AMD][gfx1100] test_decompose_mem_bound_mm.py tolerance increase ( #165625 )
...
test_decompose_mem_bound_mm.py tolerance increase for navi3x(gfx11x)
(cherry picked from commit 03c7da05f61890bbf5ae41e23c8df6d5f6805bac) from
Fixes for CI HUD for gfx1100
Signed-off-by: Artem Kuzmitckii <artem.kuzmitckii@amd.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165625
Approved by: https://github.com/jeffdaily
Co-authored-by: iupaikov-amd <Iurii.Paikov@amd.com>
Co-authored-by: Dmitry Nikolaev <139769634+dnikolaev-amd@users.noreply.github.com>
Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-10-22 01:38:48 +00:00
Nikita Shulga
5211f4c108
[MPS] Fix SDPA fp16 overflow ( #165961 )
...
Do not cast intermediate result back to lower precision data data until
softmax is finished, otherwise it might produce NaN
Adjust the test to use 256 as filler value rather than 64
Fixes https://github.com/pytorch/pytorch/issues/160841
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165961
Approved by: https://github.com/dcci , https://github.com/Skylion007
ghstack dependencies: #165960
2025-10-22 01:29:42 +00:00
Nikita Shulga
ad9027b80d
[BE] Remove unused 'rows' parameter from spmm_bmm_coo_rows_grouped ( #166041 )
...
To fix following compilation warning
```
Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/sparse/mps/kernels/Mul.metal:76:14: warning: unused variable 'B' [-Wunused-variable]
const uint B = dims.x;
^
/Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/sparse/mps/kernels/Mul.metal:65:26: warning: unused parameter 'rows' [-Wunused-parameter]
device const long* rows [[buffer(0)]],
^
2 warnings generated.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166041
Approved by: https://github.com/Skylion007
2025-10-22 00:59:41 +00:00
Han Chao
a1005427bf
[xpu] Support high stream for ProcessGroupXCCL ( #163049 )
...
Add high priority stream support for ProcessGroupXCCL. Just like CUDA, XPU streams also support execution with higher priority compared to other streams. Implementation in https://github.com/intel/torch-xpu-ops/pull/1715 , add register here.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163049
Approved by: https://github.com/guangyey , https://github.com/gujinghui , https://github.com/EikanWang , https://github.com/albanD
2025-10-22 00:54:25 +00:00
Yuanyuan Chen
35153d0846
Simplify c10::guts::apply ( #164566 )
...
There is only one call site of `c10::guts::apply` that can be replaced by `:std::apply` except for ROCm. This PR therefore simplifies the implementation of `c10::guts::apply`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164566
Approved by: https://github.com/Aidyn-A , https://github.com/albanD
2025-10-22 00:47:43 +00:00
PyTorch MergeBot
7773a22cdb
Revert "[AMP][Refactor] Autocast dtype handling to simplify device-specific c… ( #165221 )"
...
This reverts commit 4be1e3bf92 .
Reverted https://github.com/pytorch/pytorch/pull/165221 on behalf of https://github.com/clee2000 due to I think this broke test_openreg [GH job link](https://github.com/pytorch/pytorch/actions/runs/18698271058/job/53322459496 ) [HUD commit link](4be1e3bf92 ) note to self: bad TD ([comment](https://github.com/pytorch/pytorch/pull/165221#issuecomment-3430012693 ))
2025-10-22 00:26:57 +00:00
Yuanyuan Chen
7cb467a169
[CI] Update ONNX CI packages to latest ( #165883 )
...
This PR updates ONNX related packages to their latest versions used in CI environments.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165883
Approved by: https://github.com/justinchuby , https://github.com/albanD
2025-10-22 00:25:35 +00:00
KarhouTam
12aac12b8d
[Code Clean] Replace std::runtime_error with TORCH_CHECK ( #165209 )
...
Including:
1. `aten/src/ATen/core`
2. `c10/core`
Fixes part of #148114
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165209
Approved by: https://github.com/FFFrog , https://github.com/albanD
2025-10-22 00:05:22 +00:00
jainapurva
2b748d0a56
Add operator name to output json ( #164583 )
...
The benchmarks, model_name on dashboard needs to be grouped with operator_name. This PR passed an additional argument operator_name to the json for grouping.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164583
Approved by: https://github.com/yangw-dev
2025-10-21 23:58:39 +00:00
Shangdi Yu
16745a882a
[aoti][win] add support for a list of shim libraries ( #165914 )
...
As title, support passing in a list of shim libraries when cross compiling artifacts
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165914
Approved by: https://github.com/desertfire
2025-10-21 22:55:17 +00:00
PyTorch MergeBot
8daef35cf1
Revert "[Code Clean] Clean asserts in torch/ao/quantization (root, quantizer, backend_config) ( #165433 )"
...
This reverts commit df64c0c464 .
Reverted https://github.com/pytorch/pytorch/pull/165433 on behalf of https://github.com/clee2000 due to I think this broke some quantization tests ([comment](https://github.com/pytorch/pytorch/pull/165433#issuecomment-3429741770 ))
2025-10-21 22:10:19 +00:00
Nicolas De Carli
51319ca090
[Pytorch] Add NEON Vectorized<uint> family of translation layers ( #165690 )
...
Summary:
Adding NEON specializations of Vectorized<T> for uint8, uint16, uint32 and uint64.
Correcness has been checked using test_ops.py
operator_benchmark_test.py, which uses the PyTorch API, shows significant enhancements in some operations:
Before:
uint8 mul: 1460.751us
uint8 add: 2359.565us
uint8 lsl: 2151.206us
After:
uint8 mul: 194.792us ---> 650% higher throughput
uint8 add: 195.609us ---> 1100% higher throughput
uint8 lsl: 186.249us ---> 1055% higher throughput
Test Plan:
Correctness:
buck2 test mode/opt //caffe2/test:test_ops
buck2 test mode/opt //caffe2/test:torch
Performance:
buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test
Reviewed By: mcfi
Differential Revision: D84770153
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165690
Approved by: https://github.com/malfet
2025-10-21 21:46:55 +00:00
Guang Yang
d311a3d1dc
A temporary fix to autotune out of range and related IMA ( #165943 )
...
Summary:
Autotune issue during lowering w/ AOTI:
```
setStorage: sizes [1536, 32, 8192], strides [8192, 8192, 1], storage offset 0, and itemsize 2 requiring a storage size of 25673728 are out of bounds for storage of size 25362432
```
Need a hack to create new base tensor with sufficient storage
Test Plan: Finally be able to see the e2e test passes on CI. See the detailed Test Plan in D83520844
Differential Revision: D84872792
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165943
Approved by: https://github.com/laithsakka
2025-10-21 21:40:20 +00:00
Zhaoqi Zhu
04adfe5ba9
Make Backend::setGroupUid virtual ( #165957 )
...
As titled, so that we may customize this function in custom backends
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165957
Approved by: https://github.com/d4l3k
2025-10-21 21:33:24 +00:00
KarhouTam
4be1e3bf92
[AMP][Refactor] Autocast dtype handling to simplify device-specific c… ( #165221 )
...
This PR refactors the autocast context manager in autocast_mode.py to simplify and centralize the logic for checking supported dtypes for each device. The previous implementation repeated similar checks for multiple device types. Now, a single mapping device_supported_dtypes is used to associate device types with their supported dtypes, and the validation logic is unified.
**The former PR #163446 was merged but reverted due to failed CI test on `openreg` related tests.**
This RR additionally slightly modified some test assertions for passing the CI tests. CI failed due to assertion for the exactly same error message. For example:
```
File "/var/lib/jenkins/workspace/test/cpp_extensions/open_registration_extension/torch_openreg/tests/test_autocast.py", line 9, in test_autocast_with_unsupported_type
with self.assertWarnsRegex(
AssertionError: "In openreg autocast, but the target dtype torch.float32 is not supported." does not match "In openreg autocast, but the target dtype is not supported. Disabling autocast."
```
Sorry for the inconvenience again.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165221
Approved by: https://github.com/FFFrog , https://github.com/albanD
2025-10-21 21:32:12 +00:00
Catherine Lee
e7592f4005
[CI] Move the periodic debug tests to newer runner ( #165158 )
...
Previously g3 = NVIDIA Tesla M60
Now g6 = NVIDIA L4
Also change cuda arch list accordingly
Pros:
More memory, newer GPU
Cons:
That was one of the few remaining tests on g3 runners, so we probably lost coverage?
We can probably run more tests in parallel now but I'm not going to do that here
Disabled a bunch of sparse tests and nestedtensor tests that were previously skipped due to not having sufficient hardware? They are now failing with
```
Traceback (most recent call last):
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3293, in wrapper
method(*args, **kwargs)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3292, in wrapper
with policy():
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2532, in __enter__
self.beforeStreams[-1].synchronize()
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/cuda/streams.py", line 105, in synchronize
super().synchronize()
torch.AcceleratorError: CUDA error: device-side assert triggered
Search for `cudaErrorAssert' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Exception raised from stream_synchronize at /var/lib/jenkins/workspace/c10/cuda/CUDAFunctions.h:120 (most recent call first):
C++ CapturedTraceback:
#4 std::_Function_handler<std::shared_ptr<c10::LazyValue<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const> (), c10::SetStackTraceFetcher(std::function<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0
#5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) from ??:0
#6 c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, unsigned int, bool) [clone .cold] from CUDAException.cpp:0
#7 THCPStream_synchronize(_object*, _object*) from Stream.cpp:0
#8 cfunction_vectorcall_NOARGS from /usr/local/src/conda/python-3.10.14/Objects/methodobject.c:489
#9 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114
#10 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46
#11 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114
#12 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46
```
when run with cuda launch blocking I got a ton of stuff like
```
/var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [5,3,0], thread: [2,7,0] Assertion `value < upper_bound` failed.
/var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [5,3,0], thread: [3,7,0] Assertion `value < upper_bound` failed.
/var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [0,0,0] Assertion `value < upper_bound` failed.
/var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [1,0,0] Assertion `value < upper_bound` failed.
/var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [2,0,0] Assertion `value < upper_bound` failed.
/var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [3,0,0] Assertion `value < upper_bound` failed.
/var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [0,1,0] Assertion `value < upper_bound` failed.
/var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [1,1,0] Assertion `value < upper_bound` failed.
/var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [3,1,0] Assertion `value < upper_bound` failed.
/var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [0,2,0] Assertion `value < upper_bound` failed.
/var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [2,2,0] Assertion `value < upper_bound` failed.
/var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [3,2,0] Assertion `value < upper_bound` failed.
/var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [0,3,0] Assertion `value < upper_bound` failed.
/var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [1,3,0] Assertion `value < upper_bound` failed.
/var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [1,4,0] Assertion `value < upper_bound` failed.
/var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [3,4,0] Assertion `value < upper_bound` failed.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165158
Approved by: https://github.com/seemethere
2025-10-21 21:28:12 +00:00
Isalia20
d334c3649d
[CUDA] fix reflection padding for large batch size ( #165942 )
...
Fixes [#165861 ](https://github.com/pytorch/pytorch/issues/165861 )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165942
Approved by: https://github.com/eqy
2025-10-21 21:07:38 +00:00
Jerry Mannil
9f82535c5a
[ROCm] [Normalization] Update block size ( #165941 )
...
* Seeing upto 6x improvement
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165941
Approved by: https://github.com/jeffdaily
2025-10-21 20:53:05 +00:00
Ivan Zaitsev
5b35fc8777
Support multiple commits on push events in trunk tagging workflow ( #165937 )
...
Context:
* this workflow is used to create tags like `trunk/{sha}` for all `main` commits
* those tags are used by [autorevert](https://github.com/pytorch/test-infra/blob/main/aws/lambda/pytorch-auto-revert/README.md ) to rerun selected workflows
Problem: currently the workflow creates only a single tag per push event, while ghstack pushes multiple commits per single push.
This PR supports tag creation for all commits in the push event.
Complimentary autorevert PR: https://github.com/pytorch/test-infra/pull/7291
---
### Testing
I created an identical copy of this workflow in my personal repo: https://github.com/izaitsevfb/pr-head-test/actions/workflows/trunk-tagging.yml
See action runs there.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165937
Approved by: https://github.com/huydhn
2025-10-21 20:52:34 +00:00
Nikita Vedeneev
2f38eece7c
[CUDA][cuBLAS] addmm -- some refactoring for easier navigation between the Lt and non-Lt paths ( #163955 )
...
As per title. Additionally, some Lt selection conditions are revisited, and some redundancy removed (especially in the ROCm vs non-ROCm paths).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163955
Approved by: https://github.com/ngimel , https://github.com/eqy
2025-10-21 20:48:12 +00:00
Animesh Jain
830e789a55
[dynamo][annotate] Graph break cleanly on fx.traceback.annotate reconstruction ( #166006 )
...
This avoids generation of bad bytecode, leading to really confusing
error. I am not sure why we can't reconstruct cleanly, it has to do with
the input being a dict, while other supported ctx managers take bools.
Fixing that is for another day. Lets give a good error message for now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166006
Approved by: https://github.com/yushangdi , https://github.com/SherlockNoMad
2025-10-21 20:48:04 +00:00
PyTorch MergeBot
ad4dc52bf6
Revert "shrink_group implementation to expose ncclCommShrink API ( #164518 )"
...
This reverts commit 4e643422f6 .
Reverted https://github.com/pytorch/pytorch/pull/164518 on behalf of https://github.com/albanD due to Breaks lint ([comment](https://github.com/pytorch/pytorch/pull/164518#issuecomment-3429426503 ))
2025-10-21 20:24:14 +00:00
dependabot[bot]
dac9ed9790
Bump uv from 0.8.6 to 0.9.5 in /.ci/lumen_cli ( #166017 )
...
Bumps [uv](https://github.com/astral-sh/uv ) from 0.8.6 to 0.9.5.
- [Release notes](https://github.com/astral-sh/uv/releases )
- [Changelog](https://github.com/astral-sh/uv/blob/main/CHANGELOG.md )
- [Commits](https://github.com/astral-sh/uv/compare/0.8.6...0.9.5 )
---
updated-dependencies:
- dependency-name: uv
dependency-version: 0.9.5
dependency-type: direct:production
...
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-10-21 13:16:30 -07:00
linhaifeng
1c7fe8f861
[BugFix] chunk_size should always be int64_t ( #165971 )
...
aspired by https://github.com/pytorch/pytorch/pull/156872
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165971
Approved by: https://github.com/albanD
2025-10-21 19:52:47 +00:00
Bruce Chang
4e643422f6
shrink_group implementation to expose ncclCommShrink API ( #164518 )
...
Closes #164529
To expose the new [ncclCommShrink](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/api/comms.html#ncclcommshrink ) API to PyTorch.
This is useful when you need to exclude certain GPUs or nodes from a collective operation, for example in fault tolerance scenarios or when dynamically adjusting resource utilization.
For more info: [Shrinking a communicator](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/communicators.html#shrinking-a-communicator )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164518
Approved by: https://github.com/kwen2501
2025-10-21 19:47:33 +00:00
Jason Ansel
3c3b278872
[reland][fx] Move Node._prepend/Node._remove_from_list to C++ ( #165882 )
...
Relands #148261 that was reverted by #150542
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165882
Approved by: https://github.com/ezyang
2025-10-21 19:43:55 +00:00
Nikita Shulga
0bd12c1168
[CI] Extend test_transfomers to MPS ( #165960 )
...
Just skip grad_checks as they need float64
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165960
Approved by: https://github.com/Skylion007
2025-10-21 19:27:44 +00:00
PyTorch MergeBot
ce8a7764e2
Revert "[dynamo][misc] Replace UserFunctionVariable with VariableTracker build ( #165707 )"
...
This reverts commit 1290b077f2 .
Reverted https://github.com/pytorch/pytorch/pull/165707 on behalf of https://github.com/clee2000 due to failing internal tests D85160820 ([comment](https://github.com/pytorch/pytorch/pull/165707#issuecomment-3429084393 ))
2025-10-21 19:25:03 +00:00
Tushar Jain
d1269a0434
update fr trace analysis ( #165994 )
...
Summary:
- allow empty entries from ranks
- allow not all ranks to provide dump
---
[//]: # (BEGIN SAPLING FOOTER)
Stack created with [Sapling](https://sapling-scm.com ). Best reviewed with [ReviewStack](https://reviewstack.dev/pytorch/pytorch/pull/165994 ).
* #165638
* #165640
* #165642
* __->__ #165994
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165994
Approved by: https://github.com/fduwjj
2025-10-21 19:14:33 +00:00
Pearu Peterson
c87cf1be32
Update workaround to old CUDA bug ( #164354 ) ( #165984 )
...
The workaround cannot be removed because of BC. Here we'll
update PyTorch code base to not use the workaround.
See https://github.com/pytorch/pytorch/pull/164354 for the BC breakage issue.
Resolves https://github.com/pytorch/pytorch/issues/164348 .
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165984
Approved by: https://github.com/janeyx99
2025-10-21 19:09:43 +00:00
Tugsbayasgalan Manlaibaatar
2fc5e45a41
better error message when there is no pytree impl ( #165955 )
...
Differential Revision: [D85117597](https://our.internmc.facebook.com/intern/diff/D85117597 )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165955
Approved by: https://github.com/avikchaudhuri
2025-10-21 18:49:22 +00:00
Shivam Raikundalia
f9022ba93b
[PyTorch] Add user_metadata display to memory visualizer ( #165939 )
...
Summary: Enhanced the PyTorch CUDA memory visualizer to display user_metadata alongside stack frames when inspecting allocations. The user_metadata field is now shown in all views (Allocator State History, Active Memory Timeline, etc.) with consistent formatting. The implementation handles both string and object metadata types, displaying strings directly and objects as key-value pairs.
Test Plan:
1. Generate a memory snapshot with user_metadata
2. Open the memory visualizer in a browser
3. Load the snapshot file
4. Verify user_metadata appears
5. Test with both string metadata ("testing") and object metadata ({"key": "value"})
6. Verify formatting shows "User Metadata:\n <value>" for strings
{F1982860439}
Differential Revision: D85095152
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165939
Approved by: https://github.com/yushangdi
2025-10-21 18:48:33 +00:00
Tony Targonski
ff8be889ad
Remove unused exception parameter from some files, to work with -Wunused-exception-parameter ( #165770 )
...
Summary: address compiler complains that were coming up to unblock the build
Test Plan:
before the change
```
aten/src/ATen/native/LinearAlgebra.cpp:3623:36: error: unused exception parameter 'e' [-Werror,-Wunused-exception-parameter]
3623 | } catch (const std::exception& e) {
|
```
after: targets build with `-Wunused-exception-parameter`
Differential Revision: D84876246
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165770
Approved by: https://github.com/Skylion007 , https://github.com/cyyever
Co-authored-by: Tony Targonski <tony.targonski@meta.com>
2025-10-21 18:30:29 +00:00
Wang, Chuanqi
292454942e
[CD] Introduce windows.12xlarge runners for CD Windows build ( #165287 )
...
Follows https://github.com/pytorch/test-infra/pull/7174 . Windows CD build time cost comparison as below
|Runner|cpu|cuda|xpu|
|-|-|-|-|
|windows.4xlarge|1.5h| 4.0h| 5.5h|
|windows.12xlarge|0.5h|1.5h|2.5h|
Fixes #162962
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165287
Approved by: https://github.com/zxiiro , https://github.com/malfet , https://github.com/seemethere
2025-10-21 18:28:23 +00:00
PyTorch MergeBot
6c4412f72b
Revert "[Inductor] support masked vectorization for the tail_loop for float64 datatype ( #163316 )"
...
This reverts commit e9d8973427 .
Reverted https://github.com/pytorch/pytorch/pull/163316 on behalf of https://github.com/clee2000 due to seems to have broken some no_gpu tests? test/inductor/test_cpu_repro.py::CPUReproTests::test_double_reduction_vec [GH job link](https://github.com/pytorch/pytorch/actions/runs/18689033019/job/53290772740 ) [HUD commit link](e9d8973427 ) ([comment](https://github.com/pytorch/pytorch/pull/163316#issuecomment-3428210509 ))
2025-10-21 17:44:42 +00:00
PyTorch MergeBot
78bf6186f2
Revert "[Inductor] support masked vectorization for the tail_loop for fp8 datatype ( #163324 )"
...
This reverts commit e8cb34dd52 .
Reverted https://github.com/pytorch/pytorch/pull/163324 on behalf of https://github.com/clee2000 due to seems to have broken some no_gpu tests? test/inductor/test_cpu_repro.py::CPUReproTests::test_double_reduction_vec [GH job link](https://github.com/pytorch/pytorch/actions/runs/18689033019/job/53290772740 ) [HUD commit link](e9d8973427 ) ([comment](https://github.com/pytorch/pytorch/pull/163316#issuecomment-3428210509 ))
2025-10-21 17:44:42 +00:00