Commit Graph

92627 Commits

Author SHA1 Message Date
rzou
70d36e047d Making batching rule for F.embedding DTensor-aware (#162117)
`vmap(F.embedding)(DTensor, DTensor)` was failing because F.embedding's
batching rule generates a new tensor via at::arange, at::arange
generates a regular tensor, and DTensor rightfully errors on mixed
DTensor-regular Tensor operations.

This PR fixes the problem by activating DTensor implicit replication on
just the at::arange and the subsequent add operation.

In order to accomplish this I move the DTensor implicit replication flag
to C++ (most batching rules are in C++).

Test Plan:
- new test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162117
Approved by: https://github.com/bdhirsh
2025-09-05 21:40:14 +00:00
Nikita Shulga
a00cdc1e41 [CD][BE] Get rid of SETUPTOOLS and PYYAML extra pins (#162266)
As those weren't really a pins to begin with, and requirments.txt
already has those
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162266
Approved by: https://github.com/clee2000, https://github.com/Skylion007, https://github.com/ZainRizvi
ghstack dependencies: #162263, #162264
2025-09-05 21:32:52 +00:00
Shunzhi Wen
c10195e723 [C10d][Gloo] Enable complex datatype support in ProcessGroupGloo (#156633)
- Enable communication of tensors with Complex datatype in ProcessGroupGloo, similar to how ProcessGroupNCCL handles it.
- Move a function, which checks if Complex datatype is supported by a reduce operation, from ProcessGroupNCCL.cpp into a new file to be shared with ProcessGroupGloo.

Fixes #156632

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156633
Approved by: https://github.com/d4l3k
2025-09-05 21:24:36 +00:00
Boyuan Feng
771f369448 [Inductor] Improve RoPE (#161420)
This PR fuses ROPE from 2 kernels into 1 kernel.

Shape:
```
q: [B, Hq, S, D]
k: [B, Hkv, S, D]
```

`Hq=32, Hkv=8, D=128` following Llama3 setting.

<img width="980" height="624" alt="image" src="https://github.com/user-attachments/assets/652a8227-6f1d-465c-97fd-2b0af41f8ed9" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161420
Approved by: https://github.com/shunting314
2025-09-05 20:55:20 +00:00
henrylhtsang
92a43025e0 [cutlass backend] Add FP8 tests for multiple linears (#160782)
Adding a test that is closer to real use case. Thanks @mlazos for fixing a few issues so this test works for most cases.

We still have to skip the AOTI and dynamic case due to accuracy issues.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160782
Approved by: https://github.com/mlazos
2025-09-05 20:23:25 +00:00
Xuehai Pan
2fa0520a64 [BE][pytree] cleanup parameterized pytree tests (#160842)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160842
Approved by: https://github.com/Skylion007
2025-09-05 20:15:29 +00:00
Edward Z. Yang
01edcd4df8 Make distributed modules importable even when backend not built (#159889)
This PR is greatly simplified now that it stacked on top of a PR that builds with distributed always. We only need to stub functions that may not be defined due to a backend not being enabled.

Signed-off-by: Edward Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159889
Approved by: https://github.com/wconstab
ghstack dependencies: #160449
2025-09-05 20:15:11 +00:00
Edward Yang
de893e96c7 Always build USE_DISTRIBUTED. (#160449)
Signed-off-by: Edward Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160449
Approved by: https://github.com/wconstab, https://github.com/albanD, https://github.com/dcci
2025-09-05 20:15:11 +00:00
Nikita Shulga
6087ef41e5 [BE] Cleanup stale comments/copy from gemm (#162001)
Followup after https://github.com/pytorch/pytorch/pull/154012

Since the introduction of `gemm_no_downcast_stub` it's no longer necessary to allocate temporary array and then manually implement the `beta` logic in the codebase
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162001
Approved by: https://github.com/drisspg
ghstack dependencies: #161999
2025-09-05 19:59:51 +00:00
Nikita Shulga
a3c7f77e50 [EZ][CD] Update MacOS deployment platform to 11.0 (#162264)
Fixes following warning
```
MACOSX_DEPLOYMENT_TARGET is set to a lower value (10.15) than the version on which the Python interpreter was compiled (11.0)
```
Update deployment platform in `README.MD` as well
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162264
Approved by: https://github.com/clee2000, https://github.com/Skylion007, https://github.com/ZainRizvi
ghstack dependencies: #162263
2025-09-05 19:58:04 +00:00
Justin Chu
3771380f83 [ONNX] Hide draft export under a flag (#162225)
Use `TORCH_ONNX_ENABLE_DRAFT_EXPORT` to control whether draft_export should be used as a strategy in onnx export.

Follow up of https://github.com/pytorch/pytorch/pull/161454

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162225
Approved by: https://github.com/xadupre, https://github.com/titaiwangms
2025-09-05 19:54:50 +00:00
PyTorch MergeBot
adae7f66aa Revert "Always build USE_DISTRIBUTED. (#160449)"
This reverts commit c37103234a.

Reverted https://github.com/pytorch/pytorch/pull/160449 on behalf of https://github.com/jeanschmidt due to Breaking internal build rules, see D81756619 ([comment](https://github.com/pytorch/pytorch/pull/160449#issuecomment-3259430011))
2025-09-05 18:58:47 +00:00
PyTorch MergeBot
70f865ac9b Revert "Make distributed modules importable even when backend not built (#159889)"
This reverts commit ef3be6726f.

Reverted https://github.com/pytorch/pytorch/pull/159889 on behalf of https://github.com/jeanschmidt due to Breaking internal build rules, see D81756619 ([comment](https://github.com/pytorch/pytorch/pull/160449#issuecomment-3259430011))
2025-09-05 18:58:47 +00:00
Scott Wolchok
88d94d17e8 Add torch.Tensor._make_dtensor to accelerate DTensor.__new__ further (#161590)
This seems to be a (very very roughly) ~8% improvement on DTensor benchmark very similar to the benchmark from #160580 (120ish usec -> 110ish usec)

Differential Revision: [D81530105](https://our.internmc.facebook.com/intern/diff/D81530105)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161590
Approved by: https://github.com/albanD
ghstack dependencies: #161466, #161586
2025-09-05 18:43:41 +00:00
Ruben Rodriguez Buchillon
c321111499 [inductor][ez] V.choices.get_mm_configs returns list of ChoiceCallers (#161348)
\# why

- every callsite just executes the generator on the spot
- previous pr adds the ability to add an override before expensive
  generators are executed, so we don't need this generator anymore

\# what

- rather than yielding the ChoiceCaller, just return the list of all
  valid ChoiceCallers

\# testing

```
python3 -bb -m pytest test/inductor/test_max_autotune.py -v
```

Differential Revision: [D81520574](https://our.internmc.facebook.com/intern/diff/D81520574)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161348
Approved by: https://github.com/eellison
ghstack dependencies: #162075, #161340, #161341, #161342, #161343, #161344, #161345, #161346, #161347
2025-09-05 18:02:53 +00:00
Ruben Rodriguez Buchillon
9a8d454c46 [inductor] add kernel template choice (ktc) (#161347)
# why

- gather everything up to make choices, without running
  potentially expensive generators
- enables overrides where we toss the entire list of configs
  from inductor, without having to enumrate it (expensive)

# what

- add a holding class that just gets all the components necessary
  to generate a ChoiceCaller
- use that class to generate ChoiceCallers
- this does not (yet) add the override function, but just prepares
  the scene

```
python3 -bb -m pytest test/inductor/test_max_autotune.py -v
```

Differential Revision: [D81520569](https://our.internmc.facebook.com/intern/diff/D81520569)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161347
Approved by: https://github.com/eellison
ghstack dependencies: #162075, #161340, #161341, #161342, #161343, #161344, #161345, #161346
2025-09-05 18:02:53 +00:00
Ruben Rodriguez Buchillon
e02e9edb55 [inductor] V.choice.get_mm_configs takes a stack of templates (#161346)
# why

- enables us to just gather relevant templates and get all
  choices at once
- that in turns allows us to make op wide override decisions

# what

- V.choice.get_mm_configs takes a stack of templates
- all callsites just provide a stack of size 1 right now
  but do not merge everything yet (other features pending)

# testing

```
python3 -bb -m pytest test/inductor/test_max_autotune.py -v
```

Differential Revision: [D81520583](https://our.internmc.facebook.com/intern/diff/D81520583)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161346
Approved by: https://github.com/eellison
ghstack dependencies: #162075, #161340, #161341, #161342, #161343, #161344, #161345
2025-09-05 18:02:46 +00:00
Ruben Rodriguez Buchillon
d63ad53a99 [inductor][ez] return choicecallers directly (#161345)
# why

- remove repeat patterns
- we have everything to make the choicecallers
  - templates
  - input_nodes
  - layouts
  - all the kwargs

# what

- yield a choicecaller directly from V.choices.get_mm_configs

# testing

```
python3 -bb -m pytest test/inductor/test_max_autotune.py -v
```

Differential Revision: [D81520577](https://our.internmc.facebook.com/intern/diff/D81520577)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161345
Approved by: https://github.com/jansel
ghstack dependencies: #162075, #161340, #161341, #161342, #161343, #161344
2025-09-05 18:02:38 +00:00
Ruben Rodriguez Buchillon
031d79cb51 [inductor] move max-autotune logic inside V.choices.get_mm_configs (#161344)
# why

- heuristics providers know decide whether to (or which choices to add)
  in the max-autotune case
- enables an eventual override point to gracefully fallback to the
  standard behavior

# what

- max-autotune is determined inside V.choices.get_mm_configs
  because it's mm only right now, we can just do
  `config.max_autotune or config.max_autotune_gemm`
  a TODO indicates that this can change in the future when this
  expands to more templates

# testing

```
python3 -bb -m pytest test/inductor/test_max_autotune.py -v
```

Differential Revision: [D81520573](https://our.internmc.facebook.com/intern/diff/D81520573)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161344
Approved by: https://github.com/jansel
ghstack dependencies: #162075, #161340, #161341, #161342, #161343
2025-09-05 18:02:30 +00:00
Ruben Rodriguez Buchillon
a301dc3b60 [inductor][ez] pass template rather than template.uid (#161343)
# why

- simpler interface
- enables future of extracting more things out of the template e.g. a
  hash

# what

V.choices.get_mm_configs now takes the whole template rather than just
the template.uid

# testing

```
python3 -bb -m pytest test/inductor/test_max_autotune.py -v
```

Differential Revision: [D81520576](https://our.internmc.facebook.com/intern/diff/D81520576)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161343
Approved by: https://github.com/jansel
ghstack dependencies: #162075, #161340, #161341, #161342
2025-09-05 18:02:22 +00:00
Ruben Rodriguez Buchillon
af590cb729 [inductor][aten] treat like a template in GEMMs (#161342)
# why

- central point to analyze and override all generated choices

# what

- add a pseudo heuristic for aten that just yields a single, empty
  kwargs
- add a pseudo heuristic with the bias_addmm logic for it
- add an addmm specific heuristic that yields a single choice, but
  also expands it with alpha and beta kwargs

- replace all the aten.bind calls with V.choices.get_mm_configs
  using the now matching API for aten

# testing

```
python3 -bb -m pytest test/inductor/test_max_autotune.py -v
```

Differential Revision: [D81520580](https://our.internmc.facebook.com/intern/diff/D81520580)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161342
Approved by: https://github.com/jansel
ghstack dependencies: #162075, #161340, #161341
2025-09-05 18:02:10 +00:00
Ruben Rodriguez Buchillon
4902c76c65 [inductor][ez] add template/externchoice uid (#161341)
# why

- to have a central registry of templates/externkernelchoice
  to match them to heuristics etc, they need unique names
- mm is both the triton template name and the aten_mm name

# what

- add a uid() to KernelTemplate/ExternKernelChoice that returns name
- override in ExternKernel to prepend "aten::"
- override in TritonTemplate to prepend "triton::"

This id is just use to find template heuristics, so it has no other
impact

# testing

```
python3 -bb -m pytest test/inductor/test_max_autotune.py -v
```

Differential Revision: [D81520579](https://our.internmc.facebook.com/intern/diff/D81520579)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161341
Approved by: https://github.com/jansel, https://github.com/eellison
ghstack dependencies: #162075, #161340
2025-09-05 18:01:58 +00:00
Ruben Rodriguez Buchillon
9602590b15 [inductor] move scaled_mm input nodes logic (#161340)
# why

- a step towards a unified interface for all choices, where any
  adjustment to nodes (e.g. unsqueezing) happens as part of
  choice specific preprocessing, behind a common point

# what

- move the unsqueeze logic for triton nodes for scaled_mm inside
  the new hookup for adjusting the kernel inputs for template
  heuristics

# testing

```
python3 -bb -m pytest test/inductor/test_max_autotune.py -v -k "scale"
```

Differential Revision: [D81520582](https://our.internmc.facebook.com/intern/diff/D81520582)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161340
Approved by: https://github.com/jansel, https://github.com/eellison
ghstack dependencies: #162075
2025-09-05 18:01:44 +00:00
Ruben Rodriguez Buchillon
2ef665ae19 [inductor][contigous mm] mild refactor (#162075)
# why

- use the new heuristics logic better to handle kwargs

# what

- move all checks into the heuristics to yield a single choice or not
  choices if the decomposition should not be used
- fix `hip` device type, which should be `cuda`
- let heuristics handle the kwarg passing

# testing

in ci

Differential Revision: [D81706776](https://our.internmc.facebook.com/intern/diff/D81706776)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162075
Approved by: https://github.com/exclamaforte, https://github.com/jansel
2025-09-05 18:01:07 +00:00
Mikayla Gawarecki
b18bb6796f Add const to stable amax (#162082)
Fixes https://github.com/pytorch/pytorch/issues/161826

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162082
Approved by: https://github.com/soulitzer
2025-09-05 17:37:49 +00:00
PyTorch MergeBot
d711f27845 Revert "[ROCm] [CK] Composable Kernel integration for inductor backend (#158747)"
This reverts commit 019fed39aa.

Reverted https://github.com/pytorch/pytorch/pull/158747 on behalf of https://github.com/jithunnair-amd due to Broke linux-binary-manywheel-rocm / manywheel-py3_9-rocm6_4-test: 019fed39aa/1 ... PR didn't have this job run successfully due to CI outage ([comment](https://github.com/pytorch/pytorch/pull/158747#issuecomment-3259212343))
2025-09-05 17:27:45 +00:00
Nikita Shulga
261a84a176 [CD][BE] Remove unnecessary checks for XCode version (#162263)
None of them have worked for a while, PyTorch for Mac is build with
XCode-15.4
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162263
Approved by: https://github.com/clee2000, https://github.com/Skylion007, https://github.com/ZainRizvi
2025-09-05 17:02:36 +00:00
xinan.lin
98374612fc [Intel GPU] Update Intel triton commit pin to Triton 3.5.x (#161777)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161777
Approved by: https://github.com/EikanWang
2025-09-05 16:55:47 +00:00
Eddie Yan
c2a3024617 [cuBLASLt][FP8] cuBLASLt appears to support float8 rowwise-scaling on H100 (#161305)
Following #157905 I think the macro around
```
  TORCH_INTERNAL_ASSERT(use_rowwise == false, "rowwise scaled_gemm not supported with blaslt");
```
was never updated and this would cause `float8` tests to fail. Also it appears the `Lt` accepts two inputs with `e4m3` and `e5m2` dtypes simultaneously, so removing that check here as well...

CC @lw

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161305
Approved by: https://github.com/Skylion007, https://github.com/drisspg, https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-09-05 16:55:09 +00:00
Xingyuan Li
b2c7b9ad2d [Intel GPU][FlexAttention] Enable TMA path on Intel GPU (#162138)
The existing `can_use_tma` has some conditions that are unnecessary for Intel GPUs.
We have removed these useless conditions on the Intel GPU path.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162138
Approved by: https://github.com/liangan1, https://github.com/EikanWang, https://github.com/jansel, https://github.com/etaf
2025-09-05 16:54:51 +00:00
PyTorch MergeBot
f3cebec39e Revert "Rename propagate_tensor_meta to make private again (#161744)"
This reverts commit 734ce8eba9.

Reverted https://github.com/pytorch/pytorch/pull/161744 on behalf of https://github.com/jeanschmidt due to seems to break internal tests, see D81657000 for more details ([comment](https://github.com/pytorch/pytorch/pull/161744#issuecomment-3258934519))
2025-09-05 16:20:29 +00:00
Saurabh Mishra
06da7c0730 [DCP][Quantization] Fix for FP8 multiplication during dequantization (#162202)
Summary:
Weight vector needs to be upcasted since some FP8 formats (like Float8_e4m3fn) don't have CPU implementations in PyTorch. Reference: https://docs.pytorch.org/docs/stable/tensors.html#id13

We will use FP32 for the scale vector multiplication and convert to the target dtype.

Upcasting helps with the following:

1.  **Full CPU support**: `float32` has complete CPU kernel implementations for all operations
2.  **Numerical stability**: `float32` provides more precision during intermediate calculations
3.  **Compatibility**: Works across all devices (CPU/GPU) and PyTorch versions

Test Plan:
UTs

Rollback Plan:

Differential Revision: D81711093

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162202
Approved by: https://github.com/wwwjn
2025-09-05 16:06:21 +00:00
Edward Yang
2dd529df00 A basic CLAUDE.md based on bad things I see claude code doing (#162163)
Signed-off-by: Edward Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162163
Approved by: https://github.com/albanD, https://github.com/Skylion007
2025-09-05 14:52:36 +00:00
Shunting Zhang
a714437093 [ez][inductor] add a few outer dimension reduction cases for LOAF (#162028)
For the not able to fuse issue reported here: https://github.com/pytorch/pytorch/issues/93718 , LOAF can fuse the outer dimension softmax into a single kernel and brings 1.87x speedup for the example shape mentioned in the issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162028
Approved by: https://github.com/jansel, https://github.com/eellison
2025-09-05 09:30:13 +00:00
atalman
bffc7dd1f3 [CD] Add cuda 13.0 libtorch builds, remove CUDA 12.9 builds (#161916)
Related to https://github.com/pytorch/pytorch/issues/159779

Adding CUDA 13.0 libtorch builds, followup after https://github.com/pytorch/pytorch/pull/160956
Removing CUDA 12.9 builds, See https://github.com/pytorch/pytorch/issues/159980

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161916
Approved by: https://github.com/jeanschmidt, https://github.com/Skylion007

Co-authored-by: Ting Lu <tingl@nvidia.com>
2025-09-05 07:47:54 +00:00
Zeng, Xiangdong
5c473e9f5e [1/N] Port 5 _composable/fsdp distributed test cases to Intel GPU (#159118)
For https://github.com/pytorch/pytorch/issues/114850, we will port distributed tests to Intel GPU.
We could enable Intel GPU with following methods and try the best to keep the original code styles:

- use "torch.accelerator.current_accelerator()" to determine the accelerator backend
- enabled XPU for some test path
- skip some test cases which Intel GPU does not support

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159118
Approved by: https://github.com/guangyey, https://github.com/d4l3k
2025-09-05 05:52:15 +00:00
Pian Pawakapan
5da573c42c [PGO] handle PGO profile merges (#162097)
Avoid merges from extra PGO key, if same source has different rank. Unlikely to happen (needs code hash match & source variable type to change), but being safe.

Differential Revision: D81299840

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162097
Approved by: https://github.com/bobrenjc93
2025-09-05 04:58:15 +00:00
PyTorch UpdateBot
494878a11b [audio hash update] update the pinned audio hash (#162114)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162114
Approved by: https://github.com/pytorchbot
2025-09-05 04:32:16 +00:00
PyTorch UpdateBot
3bbc2e3e4f [vllm hash update] update the pinned vllm hash (#162226)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vllm hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162226
Approved by: https://github.com/pytorchbot
2025-09-05 04:32:08 +00:00
Nick Riasanovsky
b67c410398 [BE] [Inductor] Add Kernel name to all coor-desc tuning (#161409)
Summary: When running coordinate descent tuning the logging is difficult to parse if the results are parallelized at all. This includes the kernel name in each step so post-processing can unify the results, even if run in parallel.

Test Plan:
NFC. Just a logging change.

Rollback Plan:

Differential Revision: D80942794

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161409
Approved by: https://github.com/PaulZhang12
2025-09-05 02:53:13 +00:00
Colin L Reliability Rice
be5b03dde9 Allow for using a dedicated binary for the torch subproc pool. (#162093)
Summary:
The binary torch is running inside of can be larger than needed and in certain
situations, this can cause a loss of memory.

Test Plan:
We've manually run tests via
```
TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 TORCHINDUCTOR_WORKER_SUPPRESS_LOGGING=0
make mc8-train-publish-cint-datafm-toy -C
minimal_viable_ai/models/ifr_mtml/main_v1/ 2>&1 | tee ~/run_out
```
and overriding the binary used to be the built fbpkg in /packages.

We've also kicked off manual runs at
```
fire-feid-20250903-1051-ae8c6827
```

Which do show the binary running -  https://fburl.com/scuba/procprint/e6lwv32m

Rollback Plan:
steps:
  - jk.update:
      jk: pytorch/compiler:subproc_worker_binary
      constant_bool: null
      consistent_pass_rate: null
      fractional_host_rollout: null
      sampling_rate: null
  - manual.note:
      content: ''

Differential Revision: D81616624

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162093
Approved by: https://github.com/masnesral
2025-09-05 01:43:46 +00:00
Eddie Yan
73eb4511fb [B200][NVFP4] Fix argument passing in test_blockwise_mxfp8_nvfp4_mxfp4_numerics_ (#162185)
to unblock https://github.com/pytorch/pytorch/pull/159494

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162185
Approved by: https://github.com/Skylion007, https://github.com/drisspg
2025-09-05 01:24:59 +00:00
Jeffro
29280864d9 Add new parameter for gen_pyi.py to make it more configureable. (#161772)
This is a reposting of PR #128519.
This change is important to how we maintain PyTorch at Google.

From the previous PR:
"
This will make the script more flexible for the directory where it is executed.
...
We plan to use the deprecated_yaml from a blaze genrule that invokes pyi.py. As the input to the pyi.py, genrule requires the input file to be explicitly listed out. When we feed the value of tools/autograd/deprecated.yaml to genrule, it failed to resolve since tools/autograd is a package from blaze perspective. Any file under a blaze package will a proper blaze target to be access.
"

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161772
Approved by: https://github.com/albanD

Co-authored-by: Haifeng Jin <haifeng-jin@users.noreply.github.com>
2025-09-05 00:48:15 +00:00
angelayi
5c67426d68 [dynamo] Add support for const prop on .item (#162204)
Fixes some of the errors in https://fb.workplace.com/groups/1028545332188949/permalink/1303030824740397/

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162204
Approved by: https://github.com/williamwen42
2025-09-05 00:28:49 +00:00
Nikita Shulga
d2d4c8e9b2 [BLAS] Avoid downcasts for fp16fp16->fp32 BLAS (#161999)
Followup after https://github.com/pytorch/pytorch/pull/154012

Fixes CPU part of https://github.com/pytorch/pytorch/issues/160841

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161999
Approved by: https://github.com/drisspg
2025-09-04 23:35:27 +00:00
Eddie Yan
c7e41071a0 [B200][MXFP8] Fix regex in test_blockwise_mxfp8_nvfp4_error_messages_recipe_mxfp8_cuda (#162180)
to unblock https://github.com/pytorch/pytorch/pull/159494

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162180
Approved by: https://github.com/Skylion007, https://github.com/drisspg, https://github.com/nWEIdia
2025-09-04 23:29:10 +00:00
xinan.lin
9499c8761c [Inductor][Intel GPU] Register triton template heuristic for addmm tma. (#162132)
Fixes #162048

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162132
Approved by: https://github.com/jansel
2025-09-04 23:01:57 +00:00
Nan Zhang
3a207816cc Forward fix for user defined triton kernel grid calc (#162162)
Summary:

This change fixes the test: inductor:fxir_backend - test_custom_triton_autotune_dynamic which was broken by https://github.com/pytorch/pytorch/pull/160997

Test Plan:
inductor:fxir_backend - test_custom_triton_autotune_dynamic

Rollback Plan:

Differential Revision: D81679217

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162162
Approved by: https://github.com/eellison, https://github.com/jansel
2025-09-04 22:51:23 +00:00
Yiming Zhou
09be1890d7 [export] Fix torch.export.load with storage offset (#162172)
Summary: As titled

Test Plan:
CI

Rollback Plan:

Differential Revision: D81687701

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162172
Approved by: https://github.com/angelayi
2025-09-04 22:50:33 +00:00
Pian Pawakapan
0d84ff3b78 [PGO] log add_extra_remote PGO to tlparse (#161751)
Summary: log when additional PGO profile is merged in, from added read key

Test Plan:
test_pgo

Rollback Plan:

Differential Revision: D81284190

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161751
Approved by: https://github.com/bobrenjc93
2025-09-04 22:47:03 +00:00