Commit Graph

49710 Commits

Author SHA1 Message Date
drisspg
987314aa96 Split batch-num-heads grid dim between y and z (#157745)
for #157018

doesn't totally fix the problem but should help alot

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157745
Approved by: https://github.com/Chillee
2025-07-08 05:17:43 +00:00
cyy
3ee8828c87 [1/N] Don't use CUDA.cmake module (#157188)
Small changes before removing CUDA.cmake.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157188
Approved by: https://github.com/ezyang
2025-07-08 03:05:35 +00:00
zpcore
12f9942b10 Fix slice op redistribute_cost compute (#157178)
For slice op backward, my understanding is that the `redistribute_cost` attribute is incorrectly assigned to previous placement strategy: 0decd966af/torch/distributed/tensor/_ops/_tensor_ops.py (L399-L400)

The mistake is hard to be tested since we didn't enforce the `redistribute_cost` for `strategy.strategies` with size one: 2815ade9a8/torch/distributed/tensor/_sharding_prop.py (L491-L499)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157178
Approved by: https://github.com/XilunWu
2025-07-08 02:28:59 +00:00
Ke Wen
c5589074e6 [SymmMem] find_path does not search /usr/local/lib (#157695)
This PR uses `find_library` to replace `find_path`.
It also searches for NVSHMEM host lib and device lib separately.

Tested against system install location: /usr/local/lib and /usr/local/include.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157695
Approved by: https://github.com/Skylion007
ghstack dependencies: #157513
2025-07-08 01:21:59 +00:00
PyTorch MergeBot
19a01382bc Revert "[SymmMem] find_path does not search /usr/local/lib (#157695)"
This reverts commit 3effe0c293.

Reverted https://github.com/pytorch/pytorch/pull/157695 on behalf of https://github.com/kwen2501 due to Changing it to be landable on 2.8 branch ([comment](https://github.com/pytorch/pytorch/pull/157695#issuecomment-3047020152))
2025-07-08 01:12:01 +00:00
zeshengzong
df72078fe1 [dynamo] Replace unimplemented with unimplemented_v2 in torch/_dynamo/variables/torch.py (#157344)
Fixes part of #147913

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157344
Approved by: https://github.com/williamwen42

Co-authored-by: William Wen <william.wen42@gmail.com>
2025-07-08 00:46:56 +00:00
Bob Ren
60b41de0ca remove allow-untyped-defs from torch/ao/nn/quantized/modules/rnn.py (#157234)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157234
Approved by: https://github.com/jingsh
ghstack dependencies: #157231, #157232
2025-07-08 00:11:52 +00:00
Bob Ren
e38a335d7f remove allow-untyped-defs from torch/backends/cusparselt/__init__.py (#157232)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157232
Approved by: https://github.com/jingsh
ghstack dependencies: #157231
2025-07-08 00:11:52 +00:00
Bob Ren
9d8cf24b3b remove allow-untyped-defs from torch/_classes.py (#157231)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157231
Approved by: https://github.com/jingsh
2025-07-08 00:11:52 +00:00
James Wu
be56a8d7ac Automatically load and save dynamo entries via caching_precompile (#155913)
This PR adds a new config option, `caching_precompile`, and a `DynamoCache`, which loads and saves Dynamo Cache entries automatically. It also hooks up DynamoCache to PrecompileContext, so that we can save multiple cache entries.

When this configuration is turned on, we:
- Automatically create and initialize a CompilePackage on every torch.compile
- Automatically use BundledAutogradcache
- Automatically save the CompilePackage entry to DynamoCache after every compile

You can also use PrecompileContext.serialize() to manually serialize a full object.

I've added unit tests to exhibit this behavior.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155913
Approved by: https://github.com/zhxchen17
2025-07-07 23:57:17 +00:00
Ke Wen
3effe0c293 [SymmMem] find_path does not search /usr/local/lib (#157695)
This PR uses `find_library` to replace `find_path`.
It also searches for NVSHMEM host lib and device lib separately.

Tested against system install location: /usr/local/lib and /usr/local/include.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157695
Approved by: https://github.com/Skylion007
ghstack dependencies: #157513
2025-07-07 23:16:45 +00:00
IvanKobzarev
2fde2090d0 [inductor_collectives] Make reorder_collectives_preserve_peak pass grouping nodes (#157706)
Differential Revision: [D77861765](https://our.internmc.facebook.com/intern/diff/D77861765)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157706
Approved by: https://github.com/wconstab
2025-07-07 23:13:58 +00:00
rzou
5d8d126249 Fix einops x torch.compile interaction (#157600)
Fixes https://github.com/pytorch/pytorch/issues/157451

If/when einops releases a version greater than 0.8.1, it will just break
(without this patch).

The history is:
- Between 2.6 and 2.7, we tried to delete the einops import (#142847)
- That didn't work so well, so we applied a hotfix in 2.7.1. (#153925)
- The hotfix wasn't completely correct (0.8.1 is the latest version of
  einops, so the condition in the hotfix just always evaluates to True!)
- It turns out we didn't need to delete the einops import. We already
  do not eagerly import einops.
- I reverted the code back to the state it was in in 2.6.
  https://github.com/pytorch/pytorch/blob/release/2.6/torch/_dynamo/decorators.py

Test Plan:
- We have testing in CI for einops 0.6.1, 0.7.0, and 0.8.1. Wait for CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157600
Approved by: https://github.com/guilhermeleobas, https://github.com/anijain2305
ghstack dependencies: #157416
2025-07-07 23:04:02 +00:00
Gabriel Ferns
7e83d50845 Inductor logging + analysis of torch.profile (#149697)
Prereqs:
 - https://github.com/pytorch/pytorch/pull/152708

Features:
1. Adds inductor's estimate of flops and bandwidth to the json trace events that perfetto uses.
1. Only use the tflops estimation from triton if we don't have the info from the datasheet because Triton's estimates are inaccurate. I have a backlog item to fix triton flops estimation upstream. New `DeviceInfo` class, and new function `get_device_tflops`.
1. New helpers `countable_fx` and `count_flops_fx` helps get the flops of an `fx.Node`.
1. Extends Triton `torch.profiler` logging to `DebugAutotuner`.
1. New script `profile_analysis.py`: `--augment_trace` adds perf estimates to any perfetto json trace, `--analyze` creates a summary table of these perf estimates, and `--diff` will compare two traces side by side:
```python
Device(NVIDIA H100, 0):
 Kernel Name                              | resnet Kernel Count | resnet FLOPS       | resnet bw gbps        | resnet Dur (ms)    | resnet Achieved FLOPS % | resnet Achieved Bandwidth % | newresnet Kernel Count | newresnet FLOPS    | newresnet bw gbps     | newresnet Dur (ms) | newresnet Achieved FLOPS % | newresnet Achieved Bandwidth %
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 triton_poi_fused__native_batch_norm_legi | 24                  | 0                  | 0.11395268248131513   | 2.5919166666666666 | 0                       | 0.003401572611382541        | 24                     | 0                  | 0.11395268248131513   | 2.5919166666666666 | 0                          | 0.003401572611382541
 sm90_xmma_fprop_implicit_gemm_f32f32_tf3 | 142                 | 16932673552.422373 | 0.2585007824198784    | 12.441619718309857 | 0.08683422334575583     | 0.007716441266265022        | 142                    | 16932673552.422373 | 0.2585007824198784    | 12.441619718309857 | 0.08683422334575583        | 0.007716441266265022
 triton_red_fused__native_batch_norm_legi | 39                  | 0                  | 0.13990024992108846   | 5.752589743589743  | 0                       | 0.004176126863316074        | 39                     | 0                  | 0.13990024992108846   | 5.752589743589743  | 0                          | 0.004176126863316074
 triton_poi_fused__native_batch_norm_legi | 25                  | 0                  | 0.31824055917536503   | 2.5291999999999994 | 0                       | 0.009499718184339253        | 25                     | 0                  | 0.31824055917536503   | 2.5291999999999994 | 0                          | 0.009499718184339253
 void cutlass::Kernel2<cutlass_80_tensoro | 98                  | 16211056473.596165 | 0.42972434051025826   | 7.130408163265306  | 0.08313362294151874     | 0.012827592254037562        | 98                     | 16211056473.596165 | 0.42972434051025826   | 7.130408163265306  | 0.08313362294151874        | 0.012827592254037562
 triton_red_fused__native_batch_norm_legi | 73                  | 0                  | 0.3225381327611705    | 9.987068493150682  | 0                       | 0.009628003963020014        | 73                     | 0                  | 0.3225381327611705    | 9.987068493150682  | 0                          | 0.009628003963020014
 triton_poi_fused__native_batch_norm_legi | 15                  | 0                  | 1.4491211346487216    | 4.439333333333333  | 0                       | 0.043257347302946926        | 15                     | 0                  | 1.4491211346487216    | 4.439333333333333  | 0                          | 0.043257347302946926
 void cutlass::Kernel2<cutlass_80_tensoro | 186                 | 14501701145.337954 | 0.2667131401910989    | 7.873865591397849  | 0.07436769818122027     | 0.007961586274361157        | 186                    | 14501701145.337954 | 0.2667131401910989    | 7.873865591397849  | 0.07436769818122027        | 0.007961586274361157
 triton_poi_fused__native_batch_norm_legi | 33                  | 0                  | 1.4924556538193923    | 4.3101515151515155 | 0                       | 0.044550915039384846        | 33                     | 0                  | 1.4924556538193923    | 4.3101515151515155 | 0                          | 0.044550915039384846
 triton_red_fused__native_batch_norm_legi | 29                  | 0                  | 0.25562590522631107   | 6.296275862068965  | 0                       | 0.007630624036606301        | 29                     | 0                  | 0.25562590522631107   | 6.296275862068965  | 0                          | 0.007630624036606301
 triton_poi_fused__native_batch_norm_legi | 13                  | 0                  | 0.5870562174192726    | 2.7397692307692307 | 0                       | 0.01752406619162008         | 13                     | 0                  | 0.5870562174192726    | 2.7397692307692307 | 0                          | 0.01752406619162008
 triton_poi_fused__native_batch_norm_legi | 34                  | 0                  | 0.41409928846284      | 2.853588235294117  | 0                       | 0.012361172789935523        | 34                     | 0                  | 0.41409928846284      | 2.853588235294117  | 0                          | 0.012361172789935523
 triton_per_fused__native_batch_norm_legi | 34                  | 0                  | 0.11705315007018151   | 3.460647058823529  | 0                       | 0.0034941238826919864       | 34                     | 0                  | 0.11705315007018151   | 3.460647058823529  | 0                          | 0.0034941238826919864
 triton_poi_fused__native_batch_norm_legi | 16                  | 0                  | 0.17207853197124584   | 2.3459375000000002 | 0                       | 0.005136672596156592        | 16                     | 0                  | 0.17207853197124584   | 2.3459375000000002 | 0                          | 0.005136672596156592
 triton_per_fused__native_batch_norm_legi | 30                  | 0                  | 0.2639714322022256    | 6.131199999999999  | 0                       | 0.007879744244842555        | 30                     | 0                  | 0.2639714322022256    | 6.131199999999999  | 0                          | 0.007879744244842555
 sm90_xmma_fprop_implicit_gemm_f32f32_tf3 | 100                 | 11875430356.891787 | 0.19494470869421385   | 16.36534           | 0.06089964285585531     | 0.005819245035648175        | 100                    | 11875430356.891787 | 0.19494470869421385   | 16.36534           | 0.06089964285585531        | 0.005819245035648175
 triton_poi_fused__native_batch_norm_legi | 8                   | 0                  | 0.9854096626224687    | 3.2757500000000004 | 0                       | 0.029415213809625928        | 8                      | 0                  | 0.9854096626224687    | 3.2757500000000004 | 0                          | 0.029415213809625928
 void cublasLt::splitKreduce_kernel<32, 1 | 56                  | 34377923395.147064 | 0.8310300045762317    | 3.4199999999999986 | 0.17629704305203628     | 0.024806865808245714        | 56                     | 34377923395.147064 | 0.8310300045762317    | 3.4199999999999986 | 0.17629704305203628        | 0.024806865808245714
 triton_poi_fused__native_batch_norm_legi | 23                  | 0                  | 0.9944002965861103    | 3.2431304347826084 | 0                       | 0.02968359094286896         | 23                     | 0                  | 0.9944002965861103    | 3.2431304347826084 | 0                          | 0.02968359094286896
 triton_per_fused__native_batch_norm_legi | 10                  | 0                  | 0.1826801058931057    | 4.428800000000001  | 0                       | 0.00545313748934644         | 10                     | 0                  | 0.1826801058931057    | 4.428800000000001  | 0                          | 0.00545313748934644
 triton_poi_fused__native_batch_norm_legi | 10                  | 0                  | 0.3168973585366449    | 2.5471999999999997 | 0                       | 0.009459622642884923        | 10                     | 0                  | 0.3168973585366449    | 2.5471999999999997 | 0                          | 0.009459622642884923
 triton_poi_fused__native_batch_norm_legi | 34                  | 0                  | 1.1463614897015777    | 4.124323529411764  | 0                       | 0.03421974596124114         | 34                     | 0                  | 1.1463614897015777    | 4.124323529411764  | 0                          | 0.03421974596124114
 void cask_plugin_cudnn::xmma_cudnn::init | 44                  | 44045510816.64277  | 2.0661232850348643    | 3.6887499999999993 | 0.22587441444432194     | 0.06167532194133924         | 44                     | 44045510816.64277  | 2.0661232850348643    | 3.6887499999999993 | 0.22587441444432194        | 0.06167532194133924
 sm90_xmma_fprop_implicit_gemm_f32f32_tf3 | 95                  | 7876855400.165316  | 0.4694941555946739    | 18.224315789473682 | 0.04039413025725802     | 0.014014750913273854        | 95                     | 7876855400.165316  | 0.4694941555946739    | 18.224315789473682 | 0.04039413025725802        | 0.014014750913273854
 triton_per_fused__native_batch_norm_legi | 41                  | 0                  | 0.06825669875995298   | 3.0384146341463416 | 0                       | 0.002037513395819492        | 41                     | 0                  | 0.06825669875995298   | 3.0384146341463416 | 0                          | 0.002037513395819492
 triton_poi_fused__native_batch_norm_legi | 23                  | 0                  | 0.08808154712430301   | 2.3275652173913044 | 0                       | 0.0026292999141582997       | 23                     | 0                  | 0.08808154712430301   | 2.3275652173913044 | 0                          | 0.0026292999141582997
 triton_per_fused__native_batch_norm_legi | 40                  | 0                  | 0.18179321034952417   | 4.556825           | 0                       | 0.005426662995508183        | 40                     | 0                  | 0.18179321034952417   | 4.556825           | 0                          | 0.005426662995508183
 triton_poi_fused__native_batch_norm_legi | 15                  | 0                  | 0.5887415155454232    | 2.783866666666667  | 0                       | 0.017574373598370836        | 15                     | 0                  | 0.5887415155454232    | 2.783866666666667  | 0                          | 0.017574373598370836
 void cutlass::Kernel2<cutlass_80_tensoro | 38                  | 14242013806.264643 | 0.256592404353939     | 7.217631578947369  | 0.0730359682372546      | 0.007659474756834           | 38                     | 14242013806.264643 | 0.256592404353939     | 7.217631578947369  | 0.0730359682372546         | 0.007659474756834
 triton_poi_fused__native_batch_norm_legi | 21                  | 0                  | 0.5842860973430516    | 2.7779047619047623 | 0                       | 0.017441376040091088        | 21                     | 0                  | 0.5842860973430516    | 2.7779047619047623 | 0                          | 0.017441376040091088
 triton_per_fused__native_batch_norm_legi | 16                  | 0                  | 0.11509365173486417   | 3.5959375000000002 | 0                       | 0.0034356313950705724       | 16                     | 0                  | 0.11509365173486417   | 3.5959375000000002 | 0                          | 0.0034356313950705724
 triton_poi_fused__native_batch_norm_legi | 14                  | 0                  | 0.1704672000243914    | 2.4044285714285714 | 0                       | 0.00508857313505646         | 14                     | 0                  | 0.1704672000243914    | 2.4044285714285714 | 0                          | 0.00508857313505646
 triton_poi_fused__native_batch_norm_legi | 58                  | 0                  | 2.307520779930795     | 8.190706896551722  | 0                       | 0.06888121731136704         | 58                     | 0                  | 2.307520779930795     | 8.190706896551722  | 0                          | 0.06888121731136704
 triton_per_fused__native_batch_norm_legi | 29                  | 0                  | 0.037243248971881276  | 3.0277586206896556 | 0                       | 0.001111738775280038        | 29                     | 0                  | 0.037243248971881276  | 3.0277586206896556 | 0                          | 0.001111738775280038
 triton_poi_fused__native_batch_norm_legi | 20                  | 0                  | 0.04741699795428918   | 2.2911500000000005 | 0                       | 0.0014154327747549007       | 20                     | 0                  | 0.04741699795428918   | 2.2911500000000005 | 0                          | 0.0014154327747549007
 triton_per_fused__native_batch_norm_legi | 25                  | 0                  | 0.13357016893727824   | 3.37536            | 0                       | 0.003987169222008305        | 25                     | 0                  | 0.13357016893727824   | 3.37536            | 0                          | 0.003987169222008305
 triton_poi_fused__native_batch_norm_legi | 13                  | 0                  | 0.3089862268300253    | 2.8111538461538457 | 0                       | 0.009223469457612694        | 13                     | 0                  | 0.3089862268300253    | 2.8111538461538457 | 0                          | 0.009223469457612694
 triton_poi_fused__native_batch_norm_legi | 17                  | 0                  | 0.3129385387909844    | 2.673              | 0                       | 0.009341448919133863        | 17                     | 0                  | 0.3129385387909844    | 2.673              | 0                          | 0.009341448919133863
 triton_per_fused__native_batch_norm_legi | 19                  | 0                  | 0.2215568162533158    | 3.8837368421052636 | 0                       | 0.0066136363060691275       | 19                     | 0                  | 0.2215568162533158    | 3.8837368421052636 | 0                          | 0.0066136363060691275
 std::enable_if<!(false), void>::type int | 23                  | 504916805.19297093 | 1.0118296096314707    | 8.113913043478261  | 0.0025893169497075447   | 0.030203868944223014        | 23                     | 504916805.19297093 | 1.0118296096314707    | 8.113913043478261  | 0.0025893169497075447      | 0.030203868944223014
 triton_poi_fused_add_copy__38            | 56                  | 0                  | 0                     | 2.132482142857143  | 0                       | 0                           | 56                     | 0                  | 0                     | 2.132482142857143  | 0                          | 0
 triton_poi_fused_convolution_0           | 18                  | 0                  | 0.43458610794936897   | 2.773333333333334  | 0                       | 0.012972719640279667        | 18                     | 0                  | 0.43458610794936897   | 2.773333333333334  | 0                          | 0.012972719640279667
 triton_poi_fused_convolution_1           | 17                  | 0                  | 0.028816312469162712  | 2.6145882352941174 | 0                       | 0.0008601884319153051       | 17                     | 0                  | 0.028816312469162712  | 2.6145882352941174 | 0                          | 0.0008601884319153051
 void convolve_common_engine_float_NHWC<f | 44                  | 8641868995.31118   | 0.024730540008465626  | 25.87327272727273  | 0.04431727689903169     | 0.0007382250748795709       | 44                     | 8641868995.31118   | 0.024730540008465626  | 25.87327272727273  | 0.04431727689903169        | 0.0007382250748795709
 triton_per_fused__native_batch_norm_legi | 12                  | 0                  | 0.6809930918986744    | 4.82675            | 0                       | 0.020328151996975356        | 12                     | 0                  | 0.6809930918986744    | 4.82675            | 0                          | 0.020328151996975356
 triton_per_fused__native_batch_norm_legi | 14                  | 0                  | 0.02883030597936608   | 2.6651428571428575 | 0                       | 0.0008606061486377935       | 14                     | 0                  | 0.02883030597936608   | 2.6651428571428575 | 0                          | 0.0008606061486377935
 triton_per_fused__native_batch_norm_legi | 16                  | 0                  | 0.0014658988233201874 | 2.098              | 0                       | 4.375817383045335e-05       | 16                     | 0                  | 0.0014658988233201874 | 2.098              | 0                          | 4.375817383045335e-05
 triton_poi_fused__native_batch_norm_legi | 13                  | 0                  | 0.9926297180284697    | 3.2367692307692306 | 0                       | 0.02963073785159611         | 13                     | 0                  | 0.9926297180284697    | 3.2367692307692306 | 0                          | 0.02963073785159611
 triton_poi_fused__native_batch_norm_legi | 9                   | 0                  | 1.3008817095666507    | 3.0863333333333336 | 0                       | 0.03883228983781048         | 9                      | 0                  | 1.3008817095666507    | 3.0863333333333336 | 0                          | 0.03883228983781048
 void at::native::(anonymous namespace):: | 98                  | 0                  | 0.09174335613709389   | 4.408520408163265  | 0                       | 0.0027386076458833994       | 98                     | 0                  | 0.09174335613709389   | 4.408520408163265  | 0                          | 0.0027386076458833994
 void at::native::vectorized_elementwise_ | 7                   | 0                  | 0                     | 1.7278571428571428 | 0                       | 0                           | 7                      | 0                  | 0                     | 1.7278571428571428 | 0                          | 0
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149697
Approved by: https://github.com/eellison, https://github.com/shunting314
2025-07-07 22:13:34 +00:00
Shangdi Yu
6f05d58f2b [AOTI] Split aoti_runtime/model.h to prepare for model static linking (#157592)
Summary:
Prepare for https://github.com/pytorch/pytorch/pull/157129.

We split the file so we can re-use `model.h` part for codegen a separate header for each model in static linkage.

Test Plan:
CI

Rollback Plan:

Differential Revision: D77761249

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157592
Approved by: https://github.com/desertfire
2025-07-07 22:13:22 +00:00
lucasb-eyer
a7eb153bba [MemoryViz] Add file selector button (#157647)
In some linux desktop environments like mine, there is no drag and dropping of files. Which made the memoryviz impossible for me to use. So this adds a file selector button as an alternative. Tested that it works locally, and also works with multiple files.

![image](https://github.com/user-attachments/assets/dcb61d68-6c6f-42f6-a075-1783d747d1b0)

And the button remains when something is loaded, to allow loading something else, but it moves out of the way to save vertical space:

![image](https://github.com/user-attachments/assets/4239d13c-3d80-4790-9696-0906c75e14e6)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157647
Approved by: https://github.com/sraikund16
2025-07-07 22:03:51 +00:00
Zeina Migeed
ed6df0e324 correctly import torch.version (#157584)
The structure is

```
torch/
  __init__.py
  version.py
```

When we import torch, only `torch/__init__.py` is executed by default.

The submodules like `version.py` are not automatically imported or attached to the torch module.

So without anything in `__init__.py`, `torch.version` may not be found. So in this PR, we make the import explicit.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157584
Approved by: https://github.com/ezyang
2025-07-07 21:43:35 +00:00
Ankita George
5c79a55e7e [oss] Add version to metadata (#155343)
Summary: We want to add versioning to DCP to the metadata so that whenever planner logic changes, we can use the version on save to determine how to load the data

Test Plan:
added a test

Rollback Plan:

Differential Revision: D76135887

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155343
Approved by: https://github.com/teja-rao
2025-07-07 20:57:30 +00:00
Olga Gerasimova
2efa5eaa65 swa avoid stream sync (#157705)
Summary:
When AveragedModel updates_parameters it calls self.n_averaged == 0 for each parameter, where n_averated is a buffer on GPU. Moving check before the cycle to call sync once

It improves update_parameter from 74ms to 57ms ~22% improvement
{F1980011097}
{F1980011111}

Test Plan:
CI

Rollback Plan:

Differential Revision: D77723025

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157705
Approved by: https://github.com/albanD, https://github.com/Skylion007, https://github.com/janeyx99
2025-07-07 20:47:35 +00:00
zpcore
c2510fcd86 Fix index_put propagate strategy arg unpack error (#157671)
Fix `index_put` propagate strategy didn't consider optional arg `accumulate`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157671
Approved by: https://github.com/fmassa, https://github.com/wconstab
2025-07-07 20:18:18 +00:00
Kurt Mohler
510c398a4f Add max_pool3d backward pass for MPS (#157498)
Note on backward precision over fp16:

A float16 number has 10 bits of mantissa, 5 bits of exponent, and 1 bit for the sign. If the sign bit is positive, then with a mantissa $m$ and exponent $e$ represented in base 10, the number that the float16 format represents is $(1 + m / 1024)  \exp2(e)$. ([source](https://en.wikipedia.org/wiki/Half-precision_floating-point_format))

Consider adding two numbers $a$ and $b$ which have arbitrary mantissas, and say their exponents are $e_a = 1$ (so $2 \le a \lt 4$) and $e_b=-3$ (so $0.175 \le b \lt 0.25$). Assume that the result has the same exponent as $a$. Since the exponents differ by 4, we'll effectively need to truncate the 4 rightmost bits of $b$'s mantissa, which would introduce a maximum error on the order of $(2^4 / 1024)  \exp2(-3) \approx 0.002$.

The error is nearly the same if $e_b = -2$ (so $0.25 \le b \lt 0.5$), where the 3 rightmost bits are truncated, giving a maximum error on the order of $(2^3 / 1024)  \exp2(-2) \approx 0.002$. Same for $e_b=-1$.

So if we're adding up nine different numbers that all have exponents -3, -2, or -1, and they sum to a number with exponent 1, then we would expect a maximum error of several times greater than 0.002. In my comments above, summing those particular nine numbers in different ways gave results that ranged between 3.1816 and 3.1758, a difference of $0.0058 \approx 2.9  * 0.002$.

That's within the acceptable bounds, and we can safely just increase the error tolerance used in test_output_grad_match for the case of max_pool3d_backward with float16.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157498
Approved by: https://github.com/malfet
2025-07-07 19:46:44 +00:00
fduwjj
63a96eaeb8 [DeviceMesh] Add error when users try to slice non contiguous flattened dim submesh (#157523)
With https://github.com/pytorch/pytorch/issues/157393, we want to first throw a clearer error for users and then fix it in the long-term

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157523
Approved by: https://github.com/fegin
ghstack dependencies: #157501
2025-07-07 19:43:51 +00:00
fduwjj
2b8d3b1b2b [DeviceMesh] Use user set backend and pg option even for the global mesh (#157501)
Short term solution to https://github.com/pytorch/pytorch/issues/156593.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157501
Approved by: https://github.com/fegin, https://github.com/lw
2025-07-07 19:43:51 +00:00
Abhishek Nandy
bf1ebe0531 Fix typo: 'paramter' → 'parameter' in dynamo variable comment (#157651)
This PR fixes a minor typo in a comment in `torch/_dynamo/variables/torch.py`, changing 'paramter' to the correct spelling 'parameter'.

These small but meaningful changes help improve code readability and maintain the overall quality of the codebase.

Thanks for your time and review!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157651
Approved by: https://github.com/Skylion007
2025-07-07 19:42:44 +00:00
Sam Larsen
433a247102 [logging] [redo] dynamo_timed for CachingAutotuner.coordinate_descent_tuning (#156840)
Summary: This is a redo of https://github.com/pytorch/pytorch/pull/156517, but with pt2_compile_events logging disabled.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156840
Approved by: https://github.com/jamesjwu
2025-07-07 19:09:48 +00:00
yifanmao
9e5f4a844c [FSDP2] Fix issue with set_reduce_scatter_divide_factor errors and MixedPrecisionPolicy (#155964)
fix https://github.com/pytorch/pytorch/issues/155223

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155964
Approved by: https://github.com/weifengpy
2025-07-07 17:09:29 +00:00
cyy
7c1f627828 Fix 'dllimport attribute ignored on inline function' (#157670)
There are lots of warnings in builds:
```
 2025-07-05T16:59:46.9208806Z C:\actions-runner\_work\pytorch\pytorch\build\aten\src\ATen\core\TensorBody.h(5043,29): warning: 'at::Tensor::less_' redeclared inline; 'dllimport' attribute ignored [-Wignored-attributes]
2025-07-05T16:59:46.9209030Z  5043 | inline at::Tensor & Tensor::less_(const at::Scalar & other) const {
2025-07-05T16:59:46.9209104Z       |                             ^
2025-07-05T16:59:46.9209671Z C:\actions-runner\_work\pytorch\pytorch\build\aten\src\ATen\core\TensorBody.h(5048,29): warning: 'at::Tensor::less_' redeclared inline; 'dllimport' attribute ignored [-Wignored-attributes]
2025-07-05T16:59:46.9209860Z  5048 | inline at::Tensor & Tensor::less_(const at::Tensor & other) const
```
This PR has fixed them and turned the warning into an error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157670
Approved by: https://github.com/albanD
2025-07-07 16:57:48 +00:00
PyTorch MergeBot
ae1094b72b Revert "[WIP] Automatically load and save dynamo entries via caching_precompile (#155913)"
This reverts commit e466dab164.

Reverted https://github.com/pytorch/pytorch/pull/155913 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to fail a test in trunk ([comment](https://github.com/pytorch/pytorch/pull/155913#issuecomment-3045914878))
2025-07-07 16:53:35 +00:00
Guilherme Leobas
eda0a9cc90 [list] Add list.__delitem__ (#156339)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156339
Approved by: https://github.com/zou3519
ghstack dependencies: #153969, #156148, #156242, #156270, #156271
2025-07-07 14:51:32 +00:00
Guilherme Leobas
d74ccf4ffe [list] Add list.__mul__ and list.__imul__ (#156271)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156271
Approved by: https://github.com/zou3519
ghstack dependencies: #153969, #156148, #156242, #156270
2025-07-07 14:51:32 +00:00
Guilherme Leobas
689fba032d Implement list.__add__ and list.__iadd__ (#156270)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156270
Approved by: https://github.com/Skylion007, https://github.com/zou3519
ghstack dependencies: #153969, #156148, #156242
2025-07-07 14:51:25 +00:00
Guilherme Leobas
c1d69d5dd5 [list] Implement list.remove (#156242)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156242
Approved by: https://github.com/Skylion007, https://github.com/zou3519
ghstack dependencies: #153969, #156148
2025-07-07 14:51:17 +00:00
Guilherme Leobas
e49acfc5c5 [list] Raise exception in invalid list method call (#156148)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156148
Approved by: https://github.com/zou3519
ghstack dependencies: #153969
2025-07-07 14:51:10 +00:00
Guilherme Leobas
034e996d37 [list] Implement list.count (#153969)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153969
Approved by: https://github.com/zou3519, https://github.com/XuehaiPan
2025-07-07 14:51:03 +00:00
Henry Tsang
54a4d34d10 [fbcode] switch to cutlass-4 (#157579)
Summary: Update cutlass version to 4. For most use cases.

Test Plan:
testing in progress

Rollback Plan:

Differential Revision: D77605011

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157579
Approved by: https://github.com/drisspg, https://github.com/Skylion007
2025-07-07 14:12:33 +00:00
James Wu
e466dab164 [WIP] Automatically load and save dynamo entries via caching_precompile (#155913)
This PR adds a new config option, `caching_precompile`, and a `DynamoCache`, which loads and saves Dynamo Cache entries automatically. It also hooks up DynamoCache to PrecompileContext, so that we can save multiple cache entries.

When this configuration is turned on, we:
- Automatically create and initialize a CompilePackage on every torch.compile
- Automatically use BundledAutogradcache
- Automatically save the CompilePackage entry to DynamoCache after every compile

You can also use PrecompileContext.serialize() to manually serialize a full object.

I've added unit tests to exhibit this behavior.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155913
Approved by: https://github.com/zhxchen17
2025-07-07 11:56:30 +00:00
haozhe.zhu
815545f2dd [inductor] enable bf32 for mkldnn linear pointwise/binary in inductor (#127294)
When `torch.backends.mkldnn.matmul.fp32_precision=='bf16'`, we also enabled mkldnn linear in inductor path and allow to run with bf16 computation data type.

Testplan:
```
python test/inductor/test_mkldnn_pattern_matcher.py -k test_linear_unary
python test/inductor/test_mkldnn_pattern_matcher.py -k test_linear_fp32
python test/inductor/test_mkldnn_pattern_matcher.py -k test_multi_linear_share_same_input
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127294
Approved by: https://github.com/jgong5, https://github.com/jansel

Co-authored-by: Jiang, Yanbing <yanbing.jiang@intel.com>
2025-07-07 06:03:41 +00:00
Lei
2022588295 Fix: Ensure writeback handles NO_SHARD correctly by flattening tensors before copying (#154369)
Fixes #151223

Because FSDP stores original parameters as views into a flattened tensor, changing the flattened parameter’s tensor directly can desynchronize the views. With the NO_SHARD strategy this caused a shape mismatch error when writing back modified parameters.

Ensured writeback handles NO_SHARD correctly by flattening tensors before copying. The logic now flattens the source parameter or gradient when the strategy is unsharded to maintain the expected 1‑D shape for writeback operations

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154369
Approved by: https://github.com/weifengpy
2025-07-06 09:20:31 +00:00
Sv. Lockal
7cda4017dd Fix torch.utils.cpp_extension parser for clang version 20.1.7+libcxx (#157666)
When CC and CXX compiler is set to clang, and clang was compiled with libc++, compilation of torchvision fails with:

```
  File "/usr/lib/python3.12/site-packages/torch/utils/cpp_extension.py", line 585, in build_extensions
    compiler_name, compiler_version = self._check_abi()
                                      ^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/site-packages/torch/utils/cpp_extension.py", line 1034, in _check_abi
    _, version = get_compiler_abi_compatibility_and_version(compiler)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/site-packages/torch/utils/cpp_extension.py", line 449, in get_compiler_abi_compatibility_and_version
    if tuple(map(int, version)) >= minimum_required_version:
       ^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: invalid literal for int() with base 10: '7+libcxx'
```

Compiler identification is a valid semantic version:
```
$ clang -dumpfullversion -dumpversion
20.1.7+libcxx
```

After adjusting parser of version, clang is able to compile extensions successfully.

Fixes #157665

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157666
Approved by: https://github.com/msaroufim
2025-07-06 01:35:00 +00:00
Sandeep Narendranath Karjala
9be5860bc3 [dynamo] Fix dynamic shapes handling in after_aot repro generation (#157136)
Summary:
- Extract symbolic variables directly from graph placeholders and arguments
- Add symbolic variable definitions to generated repro code
- Add unit tests with ToyModel for testing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157136
Approved by: https://github.com/xmfan
ghstack dependencies: #157021
2025-07-05 15:38:41 +00:00
bobrenjc93
db00e1699a [pc] introduce ProgressiveCompilationState and clear callback (#157619)
followup from https://github.com/pytorch/pytorch/pull/157305 where
@aorenste correctly suggested clearing callback. this refactor
introduces a new dataclass so we don't need to check nullability for
each field

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157619
Approved by: https://github.com/aorenste
ghstack dependencies: #157305, #157614
2025-07-05 07:55:11 +00:00
bobrenjc93
5ea832e5f6 [pc] migrate progression futures from list to deque (#157614)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157614
Approved by: https://github.com/aorenste
ghstack dependencies: #157305
2025-07-05 07:55:03 +00:00
Nikita Shulga
a952956d05 Add isnan exit condition to special ops (#157464)
They might have been slow on CUDA-11.3, but this version of CUDA is long gone. More fundamental underlying issue were linear complexity of the recursive polynomial definitions for higher order polynomials, for example see this loop from implementation of Chebyshev polynomial of the first kind
7081b8233a/aten/src/ATen/native/Math.h (L2969-L2973)
which were tested by `test_compare_cpu` using following values (as sample index 16)
7081b8233a/torch/testing/_internal/opinfo/core.py (L2079)

Luckily chebyshev polynomials for absolute values higher than 1 pretty quickly reach infinity, see below
```
python3 -c "import torch;print(torch.special.chebyshev_polynomial_v(torch.nextafter(torch.tensor(1.0), torch.tensor(2.0)), torch.tensor(1e6)))"
tensor(nan)
```
Which is not the case for Laguerre polynomials, but it's probably fine to just limit it to 1e7

Before
```
$ PYTORCH_TEST_WITH_SLOW=1 python test_ops.py -k chebyshev_polynomial_
ssssssss..ssssss..ssssss..ssssssssssssssssssssss..ssssss/home/ubuntu/py3.10-nightly/lib/python3.10/site-packages/torch/backends/cuda/__init__.py:131: UserWarning: This API is going to be deprecated, please see https://pytorch.org/docs/main/notes/cuda.html#tensorfloat-32-tf32-on-ampere-and-later-devices (Triggered internally at /pytorch/aten/src/ATen/Context.cpp:78.)
  return torch._C._get_cublas_allow_tf32()
....ssssssssssss..ssssss..ssssss............ssssssssssssssssssssssssssssssssssss..ssssssssssssss..ssssss..ssssssssssssssssssssssssssssss..ssssss....ssssssssssss..ssssss..ssssss............ssssssssssssssssssssssssssssssssssss..ssssss..ssssssssssssss..ssssss..ssssss..ssssssssssssss..ssssss..ssssss..ssssss..ssssss..ssssss..ssssss..ssssss..ssssss..ssssss..ssssss..ssssssssssssss
----------------------------------------------------------------------
Ran 432 tests in 8.575s

OK (skipped=344)
```
After
```
$ PYTORCH_TEST_WITH_SLOW=1 python test_ops.py -k chebyshev_polynomial_
ssssssss........................ssssssssssssssss......../home/ubuntu/pytorch/torch/backends/cuda/__init__.py:131: UserWarning: This API is going to be deprecated, please see https://pytorch.org/docs/main/notes/cuda.html#tensorfloat-32-tf32-on-ampere-and-later-devices (Triggered internally at /home/ubuntu/pytorch/aten/src/ATen/Context.cpp:78.)
  return torch._C._get_cublas_allow_tf32()
........................................................................................xxxxxxxx................ssssssssssssssssssssssss........................................................................................................ssssssss........................ssssssss........................................................................................ssssssss
----------------------------------------------------------------------
Ran 432 tests in 45.580s

OK (skipped=72, expected failures=8)
```

Fixes https://github.com/pytorch/pytorch/issues/79528

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157464
Approved by: https://github.com/Skylion007, https://github.com/dcci
ghstack dependencies: #157488
2025-07-05 04:19:50 +00:00
yewentao256
f7127b9b94 [Refactor] Remove unused variables (#157654)
Fixes #157653

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157654
Approved by: https://github.com/Skylion007, https://github.com/malfet
2025-07-05 02:12:15 +00:00
Ke Wen
8a8fac1131 [SymmMem] Move code to where it is used (#157611)
`maybe_initialize_env_vars` and `initialize_nvshmem_with_store` are only used in `NVSHMEMSymmetricMemory.cu`. Moving them there.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157611
Approved by: https://github.com/Skylion007
ghstack dependencies: #157513
2025-07-04 23:37:49 +00:00
Tom Ritchford
9968edd002 Fix #153942 (#153943)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153943
Approved by: https://github.com/malfet
2025-07-04 18:25:18 +00:00
zhxchen17
7be862ab8f [dynamo] Relax DUPLICATED_INPUT to be serializable. (#157492)
Since we don't actually rely on any real data while building DUPLICATE_INPUT guard, we can safely serialize it with sources and it should be able to reconstruct the guard correctly in the new process. Therefore we don't really need to prevent serializing it.

Differential Revision: [D77683302](https://our.internmc.facebook.com/intern/diff/D77683302/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157492
Approved by: https://github.com/jamesjwu, https://github.com/jansel
2025-07-04 15:19:34 +00:00
William Wen
52e4e41cbc [dynamo] do not issue lru_cache warning for functions in the top-level torch namespace (#157598)
`lru_cache` usage warning was being raised for `torch.get_device_module()`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157598
Approved by: https://github.com/Sidharth123-cpu
2025-07-04 08:17:50 +00:00
Jason Ansel
64f2ec77f8 [inductor] Fix fractional_max_pool2d 3D input causing assertion error (#156912)
Fixes #156682

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156912
Approved by: https://github.com/angelayi
2025-07-04 06:09:28 +00:00
Laith Sakka
fdc5b42a8f _broadcast_shapes gso generalizations (#157008)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157008
Approved by: https://github.com/ColinPeppler
ghstack dependencies: #155590
2025-07-04 05:56:42 +00:00