MSVC don't support dynamic array.
Ref: https://stackoverflow.com/questions/56555406/creating-dynamic-sized-array-using-msvc-c-compiler
We tried to solutions:
1. use std::vector to instead of it in previous PR: https://github.com/pytorch/pytorch/pull/134140, but it changed variable's type and failed at UTs.
2. Use `std::unique_ptr` to instead of it in PR: https://github.com/pytorch/pytorch/pull/134156, @jansel reviewed and give comments: https://github.com/pytorch/pytorch/pull/134156#pullrequestreview-2253091693. It is make sense, allocation memory maybe make code run slower.
3. Use fixed size array to instead of it in PR: https://github.com/pytorch/pytorch/pull/134210, fixed size is hard to process the situlation, reserved size if small than CPU number.
> a. Use min() function limited is local test failed: https://github.com/pytorch/pytorch/pull/134210#issuecomment-2304447729
> b. Dynamic select fixed size or dynamic array: https://github.com/pytorch/pytorch/pull/134210#issuecomment-2304128666 . It makes code too complex to maintains.
Discussed with origin PR(https://github.com/pytorch/pytorch/pull/115620) author @zhuhaozhe, we think:
1. MSVC it the only one compiler, which not support VLA.
2. MSVC it worse performance than other compilers, use `std::unique_ptr` for MSVC and make it works.
3. For other compilers, keep using current `VLA` code.
4. For Windows users, they can use `clang-cl` or `icx` to get better performance than MSVC.
5. Discussed with @jansel , we need to move compiler check to python side, and make output code cleaner.
Reproduce UT:
```cmd
pytest test/inductor/test_cpu_repro.py -v -k test_reduction_with_dynamic_threads
```
Error msg:
```cmd
C:/Users/Xuhan/AppData/Local/Temp/tmpncykej5v/a4/ca4534cazplidnf7vopaaxaifqkjiyhxm3h2gsylgztputbaeybx.cpp(13): error C2131: expression did not evaluate to a constant
C:/Users/Xuhan/AppData/Local/Temp/tmpncykej5v/a4/ca4534cazplidnf7vopaaxaifqkjiyhxm3h2gsylgztputbaeybx.cpp(13): note: failure was caused by a read of a variable outside its lifetime
C:/Users/Xuhan/AppData/Local/Temp/tmpncykej5v/a4/ca4534cazplidnf7vopaaxaifqkjiyhxm3h2gsylgztputbaeybx.cpp(13): note: see usage of 'max_threads'
C:/Users/Xuhan/AppData/Local/Temp/tmpncykej5v/a4/ca4534cazplidnf7vopaaxaifqkjiyhxm3h2gsylgztputbaeybx.cpp(16): error C3863: array type 'float [max_threads]' is not assignable
```
Genarated code:
```c++
#include "C:/Users/Xuhan/AppData/Local/Temp/tmpt6mxcjzi/j2/cj22tgrdgh42wbunl7gdptg2lintcziox2kmr7rdbcc6n2njrhgx.h"
extern "C" __declspec(dllexport) void kernel(const float* in_ptr0,
const float* in_ptr1,
float* out_ptr0,
float* out_ptr1)
{
{
{
float tmp_acc0 = 0;
at::vec::Vectorized<float> tmp_acc0_vec = at::vec::Vectorized<float>(0);
int max_threads = omp_get_max_threads();
float tmp_acc0_arr[max_threads];
for (int tid = 0; tid < max_threads; tid++)
{
tmp_acc0_arr[tid] = 0;
}
at::vec::Vectorized<float> tmp_acc0_vec_arr[max_threads];
for (int tid = 0; tid < max_threads; tid++)
{
tmp_acc0_vec_arr[tid] = at::vec::Vectorized<float>(0);
}
#pragma omp parallel
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134221
Approved by: https://github.com/zhuhaozhe, https://github.com/jansel
**Summary**
Implement the complete vectorization of `index_expr` functionally. We also add heuristic from performance perspective to resolve the regressions posted below: https://github.com/pytorch/pytorch/pull/122961#issuecomment-2041336265 by disabling vectorization of specific (Fused) scheduler Node:
- Heuristic 1: when the num of non-contiguous `index_expr/load/store` exceeds the threshold, we disable the vectorization.
- Heuristic 2: when the total number of elements along the vec dim is less than `tiling_factor/2`, we disable the vectorization.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122961
Approved by: https://github.com/jansel
Co-authored-by: leslie-fang-intel <leslie.fang@intel.com>
**Summary**
Implement the complete vectorization of `index_expr` functionally. We also add heuristic from performance perspective to resolve the regressions posted below: https://github.com/pytorch/pytorch/pull/122961#issuecomment-2041336265 by disabling vectorization of specific (Fused) scheduler Node:
- Heuristic 1: when the num of non-contiguous `index_expr/load/store` exceeds the threshold, we disable the vectorization.
- Heuristic 2: when the total number of elements along the vec dim is less than `tiling_factor/2`, we disable the vectorization.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122961
Approved by: https://github.com/jansel
Co-authored-by: leslie-fang-intel <leslie.fang@intel.com>
**Summary**
After enabling more vectorization, we found that vectorization does not always bring performance benefits. For example, a kernel with several non-contiguous index computations or non-contiguous buffer load/store operations can experience performance regression. A typical case is what we observed in the next PR: after fully enabling vectorization of `index_expr`, we saw a performance regression of `hf_BigBird`.
In this PR, we refactor the tiling select into a standalone module to enhance its extensibility for further advanced tiling select heuristic. A standalone class `TilingSelect` with its method `select_tiling` has been added. `select_tiling` accepts the inputs of `fn_list`, `var_sizes_list` and return `tiling_factors`, `tiling_indices`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130892
Approved by: https://github.com/jgong5
This PR fixes the accuracy issues when template_buffer has users other than the epilogue nodes. This will fix the accuracy failure of the below models using max-autotune:
- MobileBertForMaskedLM
- MobileBertForQuestionAnswering
- convnext_base
- swin_base_patch4_window7_224
## Issue 1:
Previously we always add `template_buffer` as an alias of `Y`. In case the `template_buffer` has users other than the epilogue nodes, we shouldn't set it as an alias of `Y`. This PR adds the check in such case.
Wrong code before the fix where `tmp4` and `tmp9` are both stored to `Y` while we need 2 different buffers for them since `tmp4` will be used by nodes other than the epilogue node:
```cpp
Y[static_cast<long>(n_start + x1 + (32L*m_start) + (32L*x0))] = tmp4; // tmp4 is the output of the template
Y[static_cast<long>(n_start + x1 + (32L*m_start) + (32L*x0))] = tmp9; // tmp9 is the output of the epilogue node
```
Correct code after the fix:
```cpp
out_ptr2[static_cast<long>(n_start + x1 + (32L*m_start) + (32L*x0))] = tmp4;
Y[static_cast<long>(n_start + x1 + (32L*m_start) + (32L*x0))] = tmp9;
```
## Issue 2:
When fixing the above issue, we found that there's correctness issue when `bias` is `False`. The root cause is that in the case where `bias` is `False`, the `template_buffer` has users other than the epilogue nodes and the GEMM output buffer is localized, we need to add an extra copy epilogue to ensure that the GEMM output (a local buffer) is stored to the `template_buffer` that will be used later by other nodes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133073
Approved by: https://github.com/jgong5
ghstack dependencies: #133070
**Summary**
When check the vectorization status among 3 test suit, we found some operators disabled vectorization with message `Disabled vectorization: op: remainder`. In this PR, we add vectorization support of this op.
**Test Plan**
```
python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_vec_remainder
python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_int_div_vec
```
Differential Revision: [D61147014](https://our.internmc.facebook.com/intern/diff/D61147014)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129849
Approved by: https://github.com/jgong5, https://github.com/lezcano
**Summary**
Previously, we used `data_type_propagation` at the start of `codegen` to deduce the data type of each node and save this information in `node.meta[OptimizationContext.key]`. Then, we used this node metadata to update the cppcsevar data type in `update_on_args`. However, this method is not always correct. For example, in the codegen of `indirect_indexing` (see [here](096dc444ce/torch/_inductor/codegen/common.py (L1844))), we insert nodes on the fly and reuse the node of `indirect_indexing` to set the `cppcsevar` data type. In this PR, we plan to enhance the `cppcsevar` data type deduction:
- We will deduce the `cppcsevar` data type in `update_on_args` by reusing the code in `data_type_propagation`.
- To align the data type of scalar and vector variables, we previously always cast the scalar to the vector's data type. This caused a data type misalignment between `codegen` and `data_type_propagation`. We should use the same data type promotion logic to align the data types of scalar and vector variables.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130827
Approved by: https://github.com/jgong5, https://github.com/jansel
```
# Mode to emulate pytorch eager numerics for lower precision (fp16, bf16)
# Pytorch eager computes bf16/fp16 by upcasting inputs to fp32 and downcasting after
# For multiple, fused pointwise nodes, inductor will elide the intermediary upcasts and downcasts
# Typically this should be closer to fp64 ref numerics. However, it can be useful for debugging
# to emulate the eager numerics.
```
We add extra upcasts and downcasts for pointwise nodes that correspond to casts that existed in the original user program (excluding pointwise nodes that are emitted during decomposition). Since this is mostly for debugging, I added this information in the `meta` so that this mode does not have unintended side effects like changing pattern matching.
in theory there could also be some other casts with fused reduction -> reduction, although i havent seen this in practice as much. could be done as follow up. note: only works with cuda backend right now.
This mode was sufficient to eliminate compile differences from https://fb.workplace.com/groups/385893200869952/posts/464263173032954/?comment_id=465199259606012&reply_comment_id=465676792891592.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131595
Approved by: https://github.com/shunting314, https://github.com/bdhirsh, https://github.com/jansel
The conversion cache used for fixing https://github.com/pytorch/pytorch/issues/115260 depended on "store" which might be removed and ignored. This would lead to inconsistent code generated between vec and scalar kernels since we generate scalar kernel first followed by the vector kernel and the store buffer might be removed by the scalar and impacts the vector kernel codegen. This PR move the caching from "store" to the "to_dtype" calls which won't be impacted by the removed buffers.
`pytest -k test_consistent_remove_buffers test/inductor/test_cpu_repro.py`
before
```c++
extern "C" void kernel(const bfloat16* in_ptr0,
bfloat16* out_ptr1)
{
{
for(long x0=static_cast<long>(0L); x0<static_cast<long>(64L); x0+=static_cast<long>(16L))
{
auto tmp0 = at::vec::Vectorized<bfloat16>::loadu(in_ptr0 + static_cast<long>(x0), 16);
auto tmp1 = at::vec::convert<float>(tmp0);
auto tmp2 = tmp1 + tmp1;
auto tmp3 = at::vec::convert<bfloat16>(tmp2);
auto tmp4 = at::vec::convert<float>(tmp3);
auto tmp5 = tmp1 + tmp4;
auto tmp6 = at::vec::convert<bfloat16>(tmp5);
tmp6.store(out_ptr1 + static_cast<long>(x0), 16);
}
#pragma omp simd simdlen(8)
for(long x0=static_cast<long>(64L); x0<static_cast<long>(65L); x0+=static_cast<long>(1L))
{
auto tmp0 = in_ptr0[static_cast<long>(x0)];
auto tmp1 = c10::convert<float>(tmp0);
auto tmp2 = decltype(tmp1)(tmp1 + tmp1);
auto tmp3 = c10::convert<bfloat16>(tmp2);
auto tmp4 = decltype(tmp1)(tmp1 + tmp2);
auto tmp5 = c10::convert<bfloat16>(tmp4);
out_ptr1[static_cast<long>(x0)] = tmp5;
}
}
}
```
after
```c++
extern "C" void kernel(const bfloat16* in_ptr0,
bfloat16* out_ptr1)
{
{
for(long x0=static_cast<long>(0L); x0<static_cast<long>(64L); x0+=static_cast<long>(16L))
{
auto tmp0 = at::vec::Vectorized<bfloat16>::loadu(in_ptr0 + static_cast<long>(x0), 16);
auto tmp1 = at::vec::convert<float>(tmp0);
auto tmp2 = tmp1 + tmp1;
auto tmp3 = at::vec::convert<bfloat16>(tmp2);
auto tmp4 = tmp1 + tmp2;
auto tmp5 = at::vec::convert<bfloat16>(tmp4);
tmp5.store(out_ptr1 + static_cast<long>(x0), 16);
}
#pragma omp simd simdlen(8)
for(long x0=static_cast<long>(64L); x0<static_cast<long>(65L); x0+=static_cast<long>(1L))
{
auto tmp0 = in_ptr0[static_cast<long>(x0)];
auto tmp1 = c10::convert<float>(tmp0);
auto tmp2 = decltype(tmp1)(tmp1 + tmp1);
auto tmp3 = c10::convert<bfloat16>(tmp2);
auto tmp4 = decltype(tmp1)(tmp1 + tmp2);
auto tmp5 = c10::convert<bfloat16>(tmp4);
out_ptr1[static_cast<long>(x0)] = tmp5;
}
}
}
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130677
Approved by: https://github.com/leslie-fang-intel
**Summary**
Support more than 1 Local Buffer in an outer loop fused node and also the case when multi global buffers sharing usage of same local buffer.
**TestPlan**
```
python -u -m pytest -s -v inductor/test_cpu_repro.py -k test_two_local_buffers_in_outer_loop_fusion
python -u -m pytest -s -v inductor/test_cpu_repro.py -k test_share_local_buffers_in_outer_loop_fusion
```
**Next Step**
- [✓] Support more than one Local Buffer/Global Buffer
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129121
Approved by: https://github.com/jgong5, https://github.com/peterbell10
ghstack dependencies: #126967
**Summary**
When check the vectorization status among 3 test suit, we found some operators disabled vectorization with message `Disabled vectorization: op: remainder`. In this PR, we add vectorization support of this op.
**Test Plan**
```
python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_vec_remainder
python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_int_div_vec
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129849
Approved by: https://github.com/jgong5, https://github.com/lezcano
ghstack dependencies: #130405
**Summary**
Support more than 1 Local Buffer in an outer loop fused node and also the case when multi global buffers sharing usage of same local buffer.
**TestPlan**
```
python -u -m pytest -s -v inductor/test_cpu_repro.py -k test_two_local_buffers_in_outer_loop_fusion
python -u -m pytest -s -v inductor/test_cpu_repro.py -k test_share_local_buffers_in_outer_loop_fusion
```
**Next Step**
- [✓] Support more than one Local Buffer/Global Buffer
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129121
Approved by: https://github.com/jgong5, https://github.com/peterbell10
ghstack dependencies: #126967
Currently a buffer represents both a tensor with physical storage and a
computation that produces the tensor as a result.
This PR attempts to split these into two different concepts in the scheduler.
This should allow us to have multiple outputs from a single operation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128893
Approved by: https://github.com/lezcano