Commit Graph

95006 Commits

Author SHA1 Message Date
Nikita Shulga
add37bacda [MPS] Better error checking for FFT ops (#166272)
Namely, error out rather than crash when out dtype is of an unexpected type
Resize output tensor to the expected size in `_out` operation, to prevent crash when tensor of an unexpected size is passed.
Preserve symbolic shapes whenever possible

Test plan: Run `python test_ops.py -v -k test_out_warning_fft_hfft_mps` for MPS device, without this change it crashes with `Error: Invalid KernelDAG, equalShape for destination failed'`, run `python ../test/test_ops.py -v -k test_dtypes_stft_mps`, without this change it crashes with `A complex mlir::Type does not have a corresponding complex MPSDataType"`, when input dtype is bfloat16
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166272
Approved by: https://github.com/kulinseth
2025-10-28 01:31:47 +00:00
karthickai
1425b40f29 [inductor] Fix argmin/argmax returning incorrect indices for non-contiguous tensor (#165983)
Fixes #163929

Fixes argmin/argmax operations to return correct logical indices instead of physical memory offsets when applied to transposed/permuted tensors.  When `argmin()` or `argmax()` is called on a transposed tensor, Inductor was returning physical memory indices instead of logical row-major indices. This caused incorrect results that don't match eager mode behavior.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165983
Approved by: https://github.com/shunting314
2025-10-28 01:23:24 +00:00
bobrenjc93
8af9ed0824 [torchfuzz] split, chunk, stack, cat, expand, gather, cumsum, clamp, index_select, split (#166221)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166221
Approved by: https://github.com/pianpwk
ghstack dependencies: #166187, #166188, #166220, #166189, #166190
2025-10-28 01:21:07 +00:00
bobrenjc93
7045aab143 [torchfuzz] add mhaf operator (#166190)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166190
Approved by: https://github.com/pianpwk
ghstack dependencies: #166187, #166188, #166220, #166189
2025-10-28 01:21:07 +00:00
bobrenjc93
7ae8aaf4c0 [torchfuzz] add sdpa operator (#166189)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166189
Approved by: https://github.com/pianpwk
ghstack dependencies: #166187, #166188, #166220
2025-10-28 01:20:58 +00:00
bobrenjc93
f2450798cd [torchfuzz] make pointwise subclasses defined torch_op_name (#166220)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166220
Approved by: https://github.com/pianpwk
ghstack dependencies: #166187, #166188
2025-10-28 01:08:34 +00:00
fduwjj
46d17e8871 [Symm mem] Add a unit test for mempool tensor with dist collective (#166206)
We haven't tried to see if tensors on nvshmem calling c10d collectives work or not. This PR is adding a show case for it inside UT.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166206
Approved by: https://github.com/ngimel
2025-10-28 00:41:47 +00:00
Shunting Zhang
dc011d3203 [inductor][ez] add overridable env var for disabling fx graph cache (#166138)
I set TORCHINDUCTOR_FX_GRAPH_CACHE=0 a lot to make sure the compilation
happens by disabling fx graph caching. I even put this in my .bashrc.
But this cause a simple vllm script fail:
https://gist.github.com/shunting314/4253b2b5ab5e7d1b0fc9516c84054904

Error log:
https://gist.github.com/shunting314/1d04bbeb58bc486f975684f56d65615d

The root cause is,
1. vllm patch inductor_config.fx_graph_cache to True here:
   e255d92990/vllm/compilation/compiler_interface.py (L308)

   The code in vllm relies fx graph cache is on (unless
   VLLM_DISABLE_COMPILE_CACHE is overriden to false)
2. setting TORCHINDUCTOR_FX_GRAPH_CACHE=0 will cause
   inductor_config.fx_graph_cache not overridable.

I add TORCHINDUCTOR_FX_GRAPH_CACHE_DEFAULT so that we can still use it to skip fx
graph cache while still allow project like vllm to override it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166138
Approved by: https://github.com/eellison
2025-10-28 00:27:19 +00:00
Menglu Yu
e95920e3e6 [Optimus] Rename the post_grad_graph tlparse log (#166109)
Summary:
ezyang observed a cache miss issue, see details in https://github.com/pytorch/pytorch/issues/166012

We thus rename the post_grad_graph tlparse log name to resolve the cache issue.

Differential Revision: D85309891

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166109
Approved by: https://github.com/jamesjwu
2025-10-28 00:23:01 +00:00
Ting Lu
5e769ff867 [CD] Upgrade to CUDA 13.0.2 for nightly binaries (#165470)
13.0.U2 is posted, adding to nightlies
Why we want to upgrade: CUDA 13.0.U2 included a new release from cuBLAS that
1. Enabled opt-in fixed-point emulation for FP64 matmuls (D/ZGEMM) which improves performance and power-efficiency.
2. Improved performance on NVIDIA [DGX Spark](https://www.nvidia.com/en-us/products/workstations/dgx-spark/) for FP16/BF16 and FP8 GEMMs.
3. adds BF16x9 FP32 emulation support for SYRK and HERK routines.
Reference: https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#cublas-release-13-0-update-2

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165470
Approved by: https://github.com/atalman
2025-10-28 00:21:47 +00:00
bobrenjc93
0ae3e30621 [torchfuzz] fix group norm operator (#166188)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166188
Approved by: https://github.com/pianpwk
ghstack dependencies: #166187
2025-10-28 00:11:04 +00:00
bobrenjc93
47f50cfd45 [torchfuzz] check in more ignore regexes (#166187)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166187
Approved by: https://github.com/pianpwk
2025-10-27 23:58:54 +00:00
Dzmitry Huba
a51f877287 Enable local tensor mode for another set of DTensor tests (#166105)
Enable local tensor mode DTensor tests for the optimizers, op strategy,  matrix ops,
math ops, init ops, experimental ops, embedding ops, dynamic, convolution ops, main api.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166105
Approved by: https://github.com/ezyang
2025-10-27 23:58:24 +00:00
Ruben Rodriguez Buchillon
b44423bbb4 [inductor][choices] lookup table choices 1/3 (#164978)
\# why

- enable users to control which choices get used on which inputs
- reduce lowering time, and pin kernel selection, by selecting
  them for the inputs

\# what

- a new InductorChoices subclass that implements a lookup table
- a README explaining the usage
- corresponding testing

- currently only supports templates that go through
  `V.choices.get_template_configs`

\# testing

```
python3 -bb -m pytest test/inductor/test_lookup_table.py -v
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164978
Approved by: https://github.com/PaulZhang12, https://github.com/eellison
2025-10-27 23:45:16 +00:00
Animesh Jain
8e1e4ee8e0 [reland][dynamo][easy] Support torch.accelerator.current_accelerator (#166327)
Reland https://github.com/pytorch/pytorch/pull/165734

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166327
Approved by: https://github.com/Lucaskabela
2025-10-27 23:41:43 +00:00
Isalia20
1e836bc769 [MPS] fix large matmul test device (#166271)
PR is self explanatory
Test was introduced by https://github.com/pytorch/pytorch/pull/143095 and was always running on CPU

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166271
Approved by: https://github.com/kulinseth, https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-10-27 22:56:59 +00:00
Millie Chen
9a91486e45 [Inductor-FX] Don't flatten constant args (#166144)
Summary: Fallback kernels are created with flattened constant args and an `unflatten` utility to unflatten them when needed. Apply it in FXConverter to preserve the original structure

Test Plan: added new CI tests

Differential Revision: D85347589

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166144
Approved by: https://github.com/blaine-rister
2025-10-27 22:33:37 +00:00
Joseph Macaranas
92381a5aa7 [ROCm] Custom OpenBLAS library name (#166333)
- TheRock build system for ROCm builds OpenBLAS from source and uses a custom name for the library.
- Following existing conventions in `FindOpenBLAS.cmake` to support finding a custom named version of OpenBLAS.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166333
Approved by: https://github.com/jeffdaily
2025-10-27 22:13:05 +00:00
Eddie Yan
2a5f87decf [cuDNN] Smoke-test runtime cuDNN version matches compile time version in CI (#165922)
Fix and regression test for https://github.com/pytorch/pytorch/issues/165801

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165922
Approved by: https://github.com/malfet, https://github.com/atalman, https://github.com/Skylion007, https://github.com/drisspg

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-10-27 22:10:45 +00:00
Andrey Talman
840d63c12d Update cuDNN 9.10.2 in Manylinux 2.28 Docker files (#165913)
Fixes https://github.com/pytorch/pytorch/issues/165801
Smoke test: https://github.com/pytorch/pytorch/pull/165922/files

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165913
Approved by: https://github.com/Camyll, https://github.com/Skylion007
2025-10-27 22:08:06 +00:00
Animesh Jain
2ce894bb1d [dynamo] Dont guard on numpy Cython functions (#166328)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166328
Approved by: https://github.com/Lucaskabela
2025-10-27 22:01:10 +00:00
Tugsbayasgalan Manlaibaatar
47ec1e9990 Support regional inductor with custom config (#166269)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166269
Approved by: https://github.com/anijain2305
2025-10-27 21:46:02 +00:00
fduwjj
904abfc2ca Export flex attention with kwargs and DTensor (#166045)
Fixes #165948

Adding registration of the MaskBlock makes flex attention with kwargs exportable.

Also modified unittests to accept kwargs

```
python test/distributed/tensor/test_dtensor_export.py -k test_flex_attention_dtensor_export

python test/inductor/test_flex_attention.py -k test_pytree_
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166045
Approved by: https://github.com/drisspg, https://github.com/SherlockNoMad

Co-authored-by: fduwjj <fduwjj@gmail.com>
2025-10-27 21:40:40 +00:00
Scott Wolchok
7d16fcf2df Re-re-re-re-apply "C++-accessible Placements via pybind11 (#163030)" (#166132)
Was reverted (again!) due to a merge conflict that crept in sometime during the "export to github -> land internally -> merge on github" process.

D85096233

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166132
Approved by: https://github.com/Skylion007, https://github.com/ezyang, https://github.com/malfet
2025-10-27 21:19:32 +00:00
Anshul Sinha
483845a9c4 [DTensor][Op] fix for DTensor ops with Partial placements (#165962)
**Summary:** When operations are done on partial placements, we use sharding logic to incorrectly determine whether we should redistribute the tensor to replicate. By delaying the redistribution, we do the operation first, and then the partial reduction. This leads to incorrect results for max, min, gradient norm clipping, and more. We solve this by setting reduction_linear to False when there is a Partial placement to force the redistribution before completing the op.

**Test Cases**
1. pytest test/distributed/tensor/test_math_ops.py -k test_partial_reduction_ops
2. pytest test/distributed/tensor/test_math_ops.py -k test_matching_partial_reduction_ops

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165962
Approved by: https://github.com/wconstab
2025-10-27 21:17:13 +00:00
Anshul Sinha
60bcb4ee88 [pipeline][be] refactored pipeline composability tests (#165701)
**Summary:** The first thing I did was increase the world size to 8 because test_3d_with_tp_dp_pp wouldn't actually do fully shard as tp = 2, pp = 2, leaving dp = 1. The second thing was refactoring the tests using both single and multi stage schedules so that their logic is largely combined. This was accomplished by using the logic in test_replicate_pp_grad multi-stage schedule to determine the start and end indices for a partial model, but setting virtual_stage to 1 if we are using single stage schedules. Even if this approach isn't approved, multistage schedule logic in test_3d_with_tp_dp_pp and test_replicate_pp should be changed as the logic used is incorrect.

**Test Case**
1. pytest test/distributed/_composable/test_composability/test_pp_composability.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165701
Approved by: https://github.com/H-Huang
2025-10-27 21:08:57 +00:00
Animesh Jain
ee7434be82 [dynamo][guards] 1/N Guard selectively for DTensor (#165824)
A few internal jobs are observing very high guard overhead for DTensor.
Since we own DTensor, we can make those guards way faster.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165824
Approved by: https://github.com/Lucaskabela, https://github.com/bdhirsh
2025-10-27 20:35:40 +00:00
Nikita Shulga
d049ed2cb1 [BE] Fix metal compilation warnings (#166315)
- Fixes `s/#pragma onces/#pragma once` typoe

All methods in the headers must be inline, otherwise one gets barrage of following warnings
```
/Users/malfet/git/pytorch/pytorch/c10/metal/utils.h:337:7: warning: unused function 'conj<half __attribute__((ext_vector_type(2)))>' [-Wunused-function]
half2 conj(half2 a) {
      ^
/Users/malfet/git/pytorch/pytorch/c10/metal/utils.h:342:8: warning: unused function 'conj<float __attribute__((ext_vector_type(2)))>' [-Wunused-function]
float2 conj(float2 a) {
       ^
2 warnings generated.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166315
Approved by: https://github.com/seemethere, https://github.com/atalman
2025-10-27 20:17:10 +00:00
KarhouTam
9901d44418 [torch/utils][Code Clean] Clean asserts in torch/utils/*.py (#165410)
Including:
- `torch/utils/*.py`

Fixes part of #164878

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165410
Approved by: https://github.com/albanD, https://github.com/cyyever
2025-10-27 19:48:55 +00:00
Tugsbayasgalan Manlaibaatar
6096c0fc74 Export should use aot_export_joint_with_descriptors (#165931)
This diff moves export run_decompositions to use aot_export_joint_with_descriptors instead of aot_export_module. Doing so, i ran into 2 main bugs:
1) aot_export_joint_with_descriptors don't correctly pass in record_nn_module_stack flag that is needed to populate nn_module_stack by switching the internal tracer.
2) When creating symint with negative inputs, we need to pass in positive=False. This didn't matter before because aot_autograd directly returns integer inputs instead of creating symint.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165931
Approved by: https://github.com/zhxchen17
2025-10-27 19:33:33 +00:00
eun2ce
f6951cb8ea [dynamo] Fix recompilation error message to point to new programming model docs (#165260)
Fixes #163496

Updated troubleshooting_url in torch/_dynamo/utils.py to point to the new programming model documentation.

Changed:
- Old: https://pytorch.org/docs/main/torch.compiler_troubleshooting.html
- New: https://pytorch.org/docs/main/compile/programming_model.recompilation.html

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165260
Approved by: https://github.com/Lucaskabela, https://github.com/williamwen42
2025-10-27 19:31:11 +00:00
Nicolas De Carli
8887a33ede [PyTorch] Improve conversion from/to FP16 on aarch64+sve (#166306)
Summary:
Conversion from/to float16 was not getting covered by conversion templates, because these used float16_t as data type instead of the custom at::Half.

We are adding a shim that makes conversion routines use autovec code for float16

We observed the following performance improvements when compiling targeting armv9-a+sve2+fp16

before:

float16_t->uint8->float16_t ===> 657.489us
float16_t->int8->float16_t ===> 656.518us
float16_t->int16->float16_t ===> 668.998us
float16_t->int64->float16_t ===> 618.444us
float16_t->double->float16_t ===> 439.728us

after

float16_t->uint8->float16_t ===> 181.216us  ----> 263% higher throughput
float16_t->int8->float16_t ===> 179.821us  -----> 265% higher throughput
float16_t->int16->float16_t ===> 183.417us  ----> 265% higher throughput
float16_t->int64->float16_t ===> 459.897us  ----> 35% higher throughput
float16_t->double->float16_t ===> 351.276us  ---> 25% higher throughput

Test Plan:
Correctness:

buck2 test mode/opt //caffe2/test:test_ops
buck2 test mode/opt //caffe2/test:torch

Performance:

buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test

Differential Revision: D85533271

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166306
Approved by: https://github.com/mcfi, https://github.com/ezyang
2025-10-27 19:07:44 +00:00
Maggie Moss
36a48e7e6d Fix existing pyrefly errors on main (#166312)
Silences existing errors on main to keep errors and noise from the type checker to a minimum

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166312
Approved by: https://github.com/Skylion007
2025-10-27 19:03:06 +00:00
Catherine Lee
c6a02eae5b Add XLAHooksInterface to bazel file (#166179)
Differential Revision: D85446553

Internal builds failing after https://github.com/pytorch/pytorch/pull/161369

```
buck-headers/ATen/Context.h:22:10: fatal error: 'ATen/detail/XLAHooksInterface.h' file not found
   22 | #include <ATen/detail/XLAHooksInterface.h>
      |          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1 error generated.
```

Changes similar to that PR also change the build_variables file, which I've done here.  I'm not sure why this wasn't caught by the bazel build we have?

Sanity checked that some of the previously failing builds pass after this change
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166179
Approved by: https://github.com/Camyll
2025-10-27 18:47:06 +00:00
Mikayla Gawarecki
6ecd6b23b6 Document limitations of weights_only in SECURITY.md and torch.load doc (#165645)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165645
Approved by: https://github.com/albanD
2025-10-27 18:20:50 +00:00
Sarthak Tandon
3f69b4d9b4 [ROCm][tunableop] Fixes flaky test issue (#166084)
Fixes #165603

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166084
Approved by: https://github.com/naromero77amd, https://github.com/jeffdaily
2025-10-27 18:13:30 +00:00
Shunting Zhang
a04edcb27a [inductor] a few workspace api change (#166204)
A few workspace API changes:
1. return outer name when creating. Usually a use case does not care about outer name. But for mix-order-reduction (stacked PR), we need it to do the next-layer of reduction on the workspace tensor
2. be able to override workspace tensor dtype
3. be able to delay the deallocation of workspace tensors in TritonKernel.call_kernel since they may be used after the call. The lifetime of the workspace tensors are only enlarged a little bit. They would be deallocated once the next layer reduction is done.

Test with the stacked PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166204
Approved by: https://github.com/jansel
2025-10-27 18:10:23 +00:00
anwang
eb2bad5bb5 [Inductor] Make combo kernel MAX_NUM_ARGS configurable (#166274)
The MAX_NUM_ARGS of ComboKernel is currently a fixed number. We need to tune this number to avoid large fusion for MTIA, thus making it configurable.

Differential Revision: [D85509352](https://our.internmc.facebook.com/intern/diff/D85509352/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166274
Approved by: https://github.com/eellison
2025-10-27 18:06:38 +00:00
Catherine Lee
a076b4d7ac Use std::min for #166021 (#166195)
Summary:
Attempting to forward fix failures from D85405167 (PR
https://github.com/pytorch/pytorch/pull/166021)

This is devmates suggestion and seems to work, but idk if it's a good idea or not.  Devmate says it's getting resolved to at::min which is host only, and it doesn't happen in OSS is likely because `AT_PER_OPERATOR_HEADERS` is defined in OSS but not internally.

```
In file included from .../ATen/native/hip/Normalization.hip:11:
.../ATen/native/hip/Normalization.cuh:302:37: error: no matching function for call to 'min'
  302 |         v_[u] = input[batch][plane][min(x+u*blockDim.x, input.size(2)-1)];
      |                                     ^~~
```

Differential Revision: D85463674

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166195
Approved by: https://github.com/Camyll, https://github.com/malfet, https://github.com/eqy
2025-10-27 17:57:44 +00:00
PyTorch MergeBot
a988510c33 Revert "Simplify the CUPTI CMake check for kineto (#161370)"
This reverts commit e67e3d95f3.

Reverted https://github.com/pytorch/pytorch/pull/161370 on behalf of https://github.com/atalman due to Sorry this is failing libtorch nightly builds [pytorch/pytorch/actions/runs/18800131287/job/53653414136](https://github.com/pytorch/pytorch/actions/runs/18800131287/job/53653414136) ([comment](https://github.com/pytorch/pytorch/pull/161370#issuecomment-3452400982))
2025-10-27 17:05:59 +00:00
Animesh Jain
99e07c39ec [dynamo][misc] Replace UserFunctionVariable with VariableTracker build (#165707)
Audit: To prevent future issues with functools.partial or callable
objects.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165707
Approved by: https://github.com/Lucaskabela
ghstack dependencies: #166251
2025-10-27 16:47:32 +00:00
Animesh Jain
610c09f8f4 [dynamo] Fix python_type for UserDefinedClassExceptionVariable (#166251)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166251
Approved by: https://github.com/Lucaskabela
2025-10-27 16:47:32 +00:00
Animesh Jain
61bad3c1ea [dynamo] Move some FUNCTION_MATCH to CLOSURE_MATCH (#166244)
Closure match is more relaxed than FUNCTION_MATCH (which is ID_MATCH)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166244
Approved by: https://github.com/Lucaskabela
2025-10-27 16:43:46 +00:00
linhaifeng
f89a7e9fe8 [1/N][Fix] Fix typo in aten folder (#166126)
Fix typo in aten folder.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166126
Approved by: https://github.com/cyyever, https://github.com/slayton58
2025-10-27 15:34:39 +00:00
fduwjj
f2c81635c8 [DeviceMesh][2D] Use concatenate for 2D (FSDP+TP) instead of getting from root mesh (#165492)
With concatenate API, we can directly combine two meshes together rather than getting the spmd mesh from root.

Differential Revision: [D85409698](https://our.internmc.facebook.com/intern/diff/D85409698)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165492
Approved by: https://github.com/fegin
ghstack dependencies: #163358
2025-10-27 15:33:21 +00:00
Nicolas De Carli
e214af6ae8 [Pytorch] Improve float32 erf() on aarch64 (#166262)
Summary:
The float32 data type has a vectorized routine that computes erf(). Such function currently calls std::exp() individually for each float on the vector being processed.
We now use sleef's vectorized routine to compute exp, improving performance of erf.

AVX2/AVX512 also have a custom erf implementation, which uses sleef to compute exp.

We've observed a throughput increase of 25%, when tested on tensors containing 1M elements

Before:
f32 erf: 3175.977us

After:
f32 erf: 2539.446us

Test Plan:
Correctness:

buck2 test mode/opt //caffe2/test:test_ops
buck2 test mode/opt //caffe2/test:torch

Performance:

buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test

Differential Revision: D85522651

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166262
Approved by: https://github.com/fadara01, https://github.com/jgong5, https://github.com/aditew01
2025-10-27 14:55:38 +00:00
Bin Bao
7ce723d21c [AOTI] Remove c10 as linked library (#165489)
Summary: AOTI compilation doesn't depend on c10 now. It should only depend on C shim symbols which live in libtorch_cpu or libtorch_cuda.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165489
Approved by: https://github.com/yushangdi
2025-10-27 13:53:44 +00:00
PyTorch UpdateBot
4295a9a158 [xla hash update] update the pinned xla hash (#165895)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned xla hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165895
Approved by: https://github.com/pytorchbot
2025-10-27 11:47:29 +00:00
PyTorch UpdateBot
90d7be35e9 Update slow tests (#165894)
This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml).
Update the list of slow tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165894
Approved by: https://github.com/pytorchbot
2025-10-27 11:42:14 +00:00
Oguz Ulgen
8d4e48831e Remove JITFunction constexpr and some arg_names (#166280)
https://github.com/triton-lang/triton/pull/8536 breaks torch.compile integration. This PR attempts to fix it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166280
Approved by: https://github.com/jansel
2025-10-27 09:29:03 +00:00