This PR enables the misc-XX checks in clang-tidy. Meanwhile, I excluded some of them that require a lot of code changes and have no immediate benefits. Some additional fixes and suppression were also given.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110283
Approved by: https://github.com/albanD
Summary: with the grid computed in terms of unbacked `SymInt`s, it can happen that the grid is zero size. This causes CUDA error on `cuLaunchKernel` in the AOT Inductor codegen.
In this PR, when the grid contains unbacked `SymInt`s, a check is added around the `launchKernel` in the AOT Inductor's C++ wrapper codegen to make sure that the grid is not zero-size.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110312
Approved by: https://github.com/chenyang78
# Summary
Logging Mode is great, and helped me identify that we are doing an unnecessary slice sometimes.
### Numbers
For small sizes: ie. (16, 16, 32, 32)
This brings the timing from:
`flash_time: 29.344002110883594 micro seconds`
to
`flash_time: 26.971791498363018 micro seconds`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110324
Approved by: https://github.com/cpuhrsch
**Summary**
Follow up https://github.com/pytorch/pytorch/pull/109893 which has issue in support of CPU as reported in https://github.com/pytorch/pytorch/issues/109897. This fix mainly includes 2 changes:
- Current implementation of `rename_indexing`
10c646295d/torch/_inductor/codegen/common.py (L1023) only add symbol name start with `s` or `ps` into `kernel.args.sizevars`. However, `Unbacked symint` will start as `i`, so we extend the implementation of `rename_indexing` to support symbol start with `i`.
- Currently, the internal loop index also name start as `i`. Since `i` has has been used as `Unbacked symint`, change the name to start with `x` which should align with trition.
**Test Plan**
```
python -u -m pytest -s -v test_torchinductor_dynamic_shapes.py -k test_bool_mask_nobreak
python -u -m pytest -s -v test_torchinductor_dynamic_shapes.py -k test_nonzero_size_factory_nobreak
python -u -m pytest -s -v test_torchinductor_dynamic_shapes.py -k test_item_zeros_nobreak
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110262
Approved by: https://github.com/ezyang, https://github.com/jgong5
Summary: This diff fixes a heap UAF found by fuzzing in torch/csrc/jit/mobile/interpreter.cpp
Test Plan:
CI and
```
arc lionhead crash reproduce 1009060456885023
```
doesn't crash anymore.
Reviewed By: malfet
Differential Revision: D49538326
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110289
Approved by: https://github.com/malfet
# Summary
<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at 318764f</samp>
This pull request implements the CUDA backend of the SDPA kernel for nested tensors, which enables efficient transformer models with variable-length sequences. It adds a new dispatch key, a backward function, a unit test, and some helper functions for the kernel. It modifies `test/test_transformers.py`, `aten/src/ATen/native/native_functions.yaml`, `aten/src/ATen/native/nested/cuda/NestedTensorTransformerFunctionsBackward.cpp`, and `aten/src/ATen/native/nested/cuda/NestedTensorTransformerUtils.h`.
<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at ed4a773</samp>
> _Fused kernels of doom, unleash the flash attention_
> _Nested tensors on fire, reshape and pad with caution_
> _Backward pass of power, dispatch the CUDA key_
> _Test the gradients of hell, warn the user if they disagree_
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97485
Approved by: https://github.com/jbschlosser
An `ExportedProgram`'s `__call__` signature is different from the original module, so `dynamic_shapes` that follow the original signature would fail when applied to re-export an `ExportedProgram`.
This PR fixes this issue, in other words, the original `dynamic_shapes` should now work when re-exporting.
Differential Revision: D49764011
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110276
Approved by: https://github.com/tugsbayasgalan
generate_output may return non-list/tuple outputs. Let's force
those to be list, because we will enumerate kernel.outputs
later in the codegen.
Also fixed a minor issue in an assertion message.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110145
Approved by: https://github.com/aakhundov
caused by #109866
The test registers new device module, the above pr checks for xpu, sees that it got registered and uses it but its a dummy module.
This causes any test after it to fail so I "clean up" the registered module
Another possible solution would be to run this test last lol
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110254
Approved by: https://github.com/huydhn
Generating reference outputs somtimes fails because of type mismatches in the graph,
an issue which was noticed previously for `prims.convert_element_type` and fixed in #92036
but the same issue happens with other functions such as tensor constructors.
This expands the fix from #92036 to all dtype keyword arguments.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110232
Approved by: https://github.com/ezyang
Fix bug where the historical correlations heuristic currently sorts heuristics in the opposite order, ranking the least relevant tests most highly
<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at 70333d1</samp>
> _`test_files` sorted_
> _by ratings, high to low_
> _a faster spring test_
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110257
Approved by: https://github.com/clee2000
Removing the functionalities from nvfuser python APIs.
Since the use of nvfuser has been deprecated before the last release cut. We are removing torch script support.
I'll have the next PR to actually remove the code base.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110124
Approved by: https://github.com/davidberard98
Summary:
Also added annotation support for conv1d_relu and conv1d in XNNPACKQuantizer, the quantized results still
matches fx quant path (didn't quantize conv1d) so tests are not disabled
Test Plan: with-proxy buck2 run executorch/examples/quantization:example -- -m=w2l --verify
Differential Revision: D49479546
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109830
Approved by: https://github.com/kimishpatel
Summary:
Add the test to make sure we can call the quantize API multiple times
Test Plan:
python test/test_quantization.py TestQuantizePT2E.test_reentrant
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110125
Approved by: https://github.com/kimishpatel
ghstack dependencies: #110097
Opaque pointers support is disabled in llvm 14 and enabled by default from llvm 15 and above.
setOpaquePointers api usage is deprecated from llvm 16. Removed this API.
Update CreateMalloc and CreateFree apis for latest llvm release.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110200
Approved by: https://github.com/Skylion007