For embedding lookup, there are indirect indexing with indices that are invariant to the vectorized itervar. To vectorize it, we need to keep the related indexing variables as scalars and allow vectorization when the related index_exprs are invariant to the vectorized itervar.
This PR adds the support by lazily broadcasting scalar values (index_expr and constant) to vectors so that vector operations are only generated if needed by `CppVecKernel` when any of the inputs are vectors, otherwise, scalar ops are generated. The cse variable in cpp is now represented with `CppCSEVariable` which bookkeeps the relevant itervars to the variable and has a flag to mark whether it is a scalar or a vector. `CppVecOverrides` is improved to propagate these states when the ops are executed.
For the added UT `test_embedding_vec`, the generated code before this PR is:
```c++
extern "C" void kernel(const long* in_ptr0,
const float* in_ptr1,
const float* in_ptr2,
float* out_ptr0)
{
#pragma omp parallel num_threads(64)
{
{
#pragma omp for
for(long x0=static_cast<long>(0L); x0<static_cast<long>(128L); x0+=static_cast<long>(1L))
{
#pragma GCC ivdep
for(long x1=static_cast<long>(0L); x1<static_cast<long>(128L); x1+=static_cast<long>(1L))
{
auto tmp0 = in_ptr0[static_cast<long>(x0)];
auto tmp5 = in_ptr2[static_cast<long>(x1 + (128L*x0))];
auto tmp1 = decltype(tmp0)(tmp0 + 64);
auto tmp2 = tmp0 < 0;
auto tmp3 = tmp2 ? tmp1 : tmp0;
TORCH_CHECK((0 <= tmp3) & (tmp3 < 64L), "index out of bounds: 0 <= tmp3 < 64L")
auto tmp4 = in_ptr1[static_cast<long>(x1 + (128L*tmp3))];
auto tmp6 = decltype(tmp4)(tmp4 + tmp5);
out_ptr0[static_cast<long>(x1 + (128L*x0))] = tmp6;
}
}
}
}
}
```
After this PR, we have:
```c++
extern "C" void kernel(const long* in_ptr0,
const float* in_ptr1,
const float* in_ptr2,
float* out_ptr0)
{
#pragma omp parallel num_threads(64)
{
{
#pragma omp for
for(long x0=static_cast<long>(0L); x0<static_cast<long>(128L); x0+=static_cast<long>(1L))
{
for(long x1=static_cast<long>(0L); x1<static_cast<long>(128L); x1+=static_cast<long>(16L))
{
auto tmp0 = in_ptr0[static_cast<long>(x0)];
auto tmp5 = at::vec::Vectorized<float>::loadu(in_ptr2 + static_cast<long>(x1 + (128L*x0)));
auto tmp1 = decltype(tmp0)(tmp0 + 64);
auto tmp2 = tmp0 < 0;
auto tmp3 = tmp2 ? tmp1 : tmp0;
TORCH_CHECK((0 <= tmp3) & (tmp3 < 64L), "index out of bounds: 0 <= tmp3 < 64L")
auto tmp4 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<long>(x1 + (128L*tmp3)));
auto tmp6 = tmp4 + tmp5;
tmp6.store(out_ptr0 + static_cast<long>(x1 + (128L*x0)));
}
}
}
}
}
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114062
Approved by: https://github.com/jansel
For embedding lookup, there are indirect indexing with indices that are invariant to the vectorized itervar. To vectorize it, we need to keep the related indexing variables as scalars and allow vectorization when the related index_exprs are invariant to the vectorized itervar.
This PR adds the support by lazily broadcasting scalar values (index_expr and constant) to vectors so that vector operations are only generated if needed by `CppVecKernel` when any of the inputs are vectors, otherwise, scalar ops are generated. The cse variable in cpp is now represented with `CppCSEVariable` which bookkeeps the relevant itervars to the variable and has a flag to mark whether it is a scalar or a vector. `CppVecOverrides` is improved to propagate these states when the ops are executed.
For the added UT `test_embedding_vec`, the generated code before this PR is:
```c++
extern "C" void kernel(const long* in_ptr0,
const float* in_ptr1,
const float* in_ptr2,
float* out_ptr0)
{
#pragma omp parallel num_threads(64)
{
{
#pragma omp for
for(long x0=static_cast<long>(0L); x0<static_cast<long>(128L); x0+=static_cast<long>(1L))
{
#pragma GCC ivdep
for(long x1=static_cast<long>(0L); x1<static_cast<long>(128L); x1+=static_cast<long>(1L))
{
auto tmp0 = in_ptr0[static_cast<long>(x0)];
auto tmp5 = in_ptr2[static_cast<long>(x1 + (128L*x0))];
auto tmp1 = decltype(tmp0)(tmp0 + 64);
auto tmp2 = tmp0 < 0;
auto tmp3 = tmp2 ? tmp1 : tmp0;
TORCH_CHECK((0 <= tmp3) & (tmp3 < 64L), "index out of bounds: 0 <= tmp3 < 64L")
auto tmp4 = in_ptr1[static_cast<long>(x1 + (128L*tmp3))];
auto tmp6 = decltype(tmp4)(tmp4 + tmp5);
out_ptr0[static_cast<long>(x1 + (128L*x0))] = tmp6;
}
}
}
}
}
```
After this PR, we have:
```c++
extern "C" void kernel(const long* in_ptr0,
const float* in_ptr1,
const float* in_ptr2,
float* out_ptr0)
{
#pragma omp parallel num_threads(64)
{
{
#pragma omp for
for(long x0=static_cast<long>(0L); x0<static_cast<long>(128L); x0+=static_cast<long>(1L))
{
for(long x1=static_cast<long>(0L); x1<static_cast<long>(128L); x1+=static_cast<long>(16L))
{
auto tmp0 = in_ptr0[static_cast<long>(x0)];
auto tmp5 = at::vec::Vectorized<float>::loadu(in_ptr2 + static_cast<long>(x1 + (128L*x0)));
auto tmp1 = decltype(tmp0)(tmp0 + 64);
auto tmp2 = tmp0 < 0;
auto tmp3 = tmp2 ? tmp1 : tmp0;
TORCH_CHECK((0 <= tmp3) & (tmp3 < 64L), "index out of bounds: 0 <= tmp3 < 64L")
auto tmp4 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<long>(x1 + (128L*tmp3)));
auto tmp6 = tmp4 + tmp5;
tmp6.store(out_ptr0 + static_cast<long>(x1 + (128L*x0)));
}
}
}
}
}
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114062
Approved by: https://github.com/jansel
ghstack dependencies: #113950
Fixes#97361
When fused kernel more than 1024 parameters, it should throw error from ctypes.
Limit args number is should be a mechanism to protect stack memory. As we known, CPP is passing args via stack memory, and stack memory has size limitation.
Code change:
1. cpp backend will check the fused nodes' args number, if it is reach the limitation. It will status flush status to ready.
2. scheduler will check `ready_to_flush` API and help backend flush codegen.
3. Add `ready_to_flush` API to `BaseScheduling`, Triton backend will return False due to not support it yet.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113131
Approved by: https://github.com/jgong5, https://github.com/mlazos
Applies PLW0108 which removes useless lambda calls in Python, the rule is in preview so it is not ready to be enabled by default just yet. These are the autofixes from the rule.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113602
Approved by: https://github.com/albanD
Previously, floor_div operations were defined in
ATen/native/BinaryOps.h. Since this header was not included under
ABI-compat mode, trying to use those indexing operations would result in
compilation errors.
Technically, it is safe to use aten::native::floor_div_* functions in
ABI-compat mode as they are header-only; we could simply include
BinaryOps.h. However, there are other declarations in BinaryOps.h that
are not binary-compatible, so this is not ideal. Thus, I have moved those
functions into a separate file, and put them under c10/util, since they
don't really have tensor-specific logic.
c10 functions are not all header-only, so this still isn't ideal, but
this still seems like an improvement. Moreover, cpp_prefix.h -- used
when compiling cpp kernels -- already includes c10 header files, so
ABI-compatibility already depends on maintaining some c10 functions as
header-only.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113276
Approved by: https://github.com/chenyang78, https://github.com/desertfire
This was originally ipiszy's PR: https://github.com/pytorch/pytorch/pull/112358
It turns out that we need to add support for optional types in order to
support fp8 gemm (i.e. scaled_mm). Since our ABI-stable C interface
can't support optional<> directly, I am passing in optional types via
pointer instead.
`AtenTensorHandle`s are already pointers, so nothing needs to change
there. Only value types need to change.
We decided on this approach instead of adding an extra `bool` param to
the callee because this simplifies things. Having the same number of
arguments regardless of whether we are emitting Python / C++ /
ABI-compatible C++ makes codegen easier.
There are a number of existing ABI-compatible functions that have
optional-typed value parameters. Previously, they just assumed they
would never be passed a `nullopt` / `None` at runtime. Changing them to
use pointer types now would break ABI stability, so I have created an
exclude list for those functions.
Finally, I think the current implementation is kind of messy, and only
works for FallbackKernels, even though technically ExternKernels could
also have the same issue. It also doesn't support optional types nested
in lists. I've left FIXME comments for both issues.
Differential Revision: [D51084289](https://our.internmc.facebook.com/intern/diff/D51084289)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112527
Approved by: https://github.com/chenyang78, https://github.com/desertfire
`install_config_module` makes a regular module into a ConfigModule with
extra methods defined on it. mypy thinks those extra methods (or module
functions) are undefined since it cannot analyze something so
dynamic. As a workaround, I've created a fake module that defines these
extra functions, which I import into the config modules during type
checking.
As part of this change, I've also added more types to config_utils.py
and enabled typechecking for torch/_dynamo/config.py.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112130
Approved by: https://github.com/jansel
We spend somewhere on the order 1% in `sympy.Expr.free_symbols` as it is called millions of times.
Most of the time we actually just want to know "is this a constant", however `e.is_constant()` is
horribly slow. It turns out though that there is another propery `is_number` that does what we want.
> property is_number:
>
> Returns True if self has no free symbols and no undefined functions (AppliedUndef, to be precise). It will be faster
> than if not self.free_symbols, however, since is_number will fail as soon as it hits a free symbol or undefined
> function.
Even further, we also avoid the overhead of building the unnecessary set object.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112688
Approved by: https://github.com/lezcano
**Summary**
Follow up https://github.com/pytorch/pytorch/pull/109893 which has issue in support of CPU as reported in https://github.com/pytorch/pytorch/issues/109897. This fix mainly includes 2 changes:
- Current implementation of `rename_indexing`
10c646295d/torch/_inductor/codegen/common.py (L1023) only add symbol name start with `s` or `ps` into `kernel.args.sizevars`. However, `Unbacked symint` will start as `i`, so we extend the implementation of `rename_indexing` to support symbol start with `i`.
- Currently, the internal loop index also name start as `i`. Since `i` has has been used as `Unbacked symint`, change the name to start with `x` which should align with trition.
**Test Plan**
```
python -u -m pytest -s -v test_torchinductor_dynamic_shapes.py -k test_bool_mask_nobreak
python -u -m pytest -s -v test_torchinductor_dynamic_shapes.py -k test_nonzero_size_factory_nobreak
python -u -m pytest -s -v test_torchinductor_dynamic_shapes.py -k test_item_zeros_nobreak
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110262
Approved by: https://github.com/ezyang, https://github.com/jgong5
Summary:
This is a prototype for running extern fallback kernels with a host side proxy executor.
Sample of generated cpp wrapper call:
```
at::Tensor buf0; // output buffer
void* tensor_args_var_0[] = {&arg0_1, &arg0_1, &arg1_1, &arg0_1, &arg1_1, &buf0};
int64_t int_args_var_1[] = {81, 81, 7, 7, 7, 81};
proxy_executor->call_function("buf0", int_args_var_1, tensor_args_var_0);
```
- In my current implementation, proxy executor interprets the raw pointers according to the ops schema.
This assumes that custom op MUST have a valid schema registered to Dispatcher. (I would like to validate this assumption)
- I am using callboxed() API of the custom kernels. This is inevitable, as we wish to have a single call_function API for all possible custom kernels.
- These are all the input argument types I have support so far.
union Argument {
# Bool value does not matter
1: bool asNone;
2: TensorArgument asTensor;
3: list<TensorArgument> asTensors;
5: i64 asInt;
7: list<i64> asInts;
8: double asFloat;
9: list<double> asFloats;
10: string asString;
10.5: list<string> asStrings;
11: SymIntArgument asSymInt;
12: list<SymIntArgument> asSymInts;
13: ScalarType asScalarType;
14: MemoryFormat asMemoryFormat;
15: Layout asLayout;
16: Device asDevice;
17: bool asBool;
18: list<bool> asBools;
}
- Need a policy for handling unpopulated argument with default values. Here are the options, and it has BC implications.
1. requires exported fx graph to explicitly populate default values, if users doesn't specify.
2. requires cpp wrapper to explicitly populate default values, if fx graph doesn't specify.
3. Proxy executor look up from opSchema for default values.
For fixing T162112344
Test Plan:
frontend:
buck2 run mode/dev-sand mode/inplace -c fbcode.enable_gpu_sections=True sigmoid/frontend:export_main
test:
buck2 run mode/dev-sand //deeplearning/aot_inductor/test:test_custom_ops
backend:
buck2 run mode/dev-nosan //deeplearning/aot_inductor/fb:main
buck2 test 'fbcode//mode/opt' fbcode//caffe2/torch/fb/model_transform/experimental/benchmark/test:test_aot_inductor_benchmark -- --exact 'caffe2/torch/fb/model_transform/experimental/benchmark/test:test_aot_inductor_benchmark - test_aot_inductor_benchmark_cmf30x (caffe2.torch.fb.model_transform.experimental.benchmark.test.test_aot_inductor_benchmark.AOTInductorBenchmark)'
Reviewed By: suo
Differential Revision: D48747417
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108350
Approved by: https://github.com/izaitsevfb
Inductor kernel codegen previously have the following side effect:
- in `Kernel.__exit__ `, we add local used buffers in graph.removed_buffers
- during codegen, we do memory allocation/free.
These cause doing multiple versions of codegen for the same kernel hard. The PR refactor the code to make kernel codegen not changing graph level states. After codegening a kernel, the graph level state is not changed so we can go on to codegen another version of the kernel if we want.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107617
Approved by: https://github.com/jansel
For max_pooling code:
```
#pragma GCC ivdep
for(long i2=static_cast<long>(0L); i2<static_cast<long>(56L); i2+=static_cast<long>(1L))
{
for(long i3=static_cast<long>(0L); i3<static_cast<long>(64L); i3+=static_cast<long>(16L))
{
auto tmp0 = at::vec::Vectorized<int>(static_cast<int>((-1L) + (2L*i1)));
auto tmp1 = at::vec::Vectorized<int>(static_cast<int>(0));
auto tmp2 = to_float_mask(tmp0 >= tmp1);
auto tmp3 = at::vec::Vectorized<int>(static_cast<int>(112));
auto tmp4 = to_float_mask(tmp0 < tmp3);
auto tmp5 = tmp2 & tmp4;
auto tmp6 = at::vec::Vectorized<int>(static_cast<int>((-1L) + (2L*i2)));
auto tmp7 = to_float_mask(tmp6 >= tmp1);
auto tmp8 = to_float_mask(tmp6 < tmp3);
auto tmp9 = tmp7 & tmp8;
auto tmp10 = tmp5 & tmp9;
auto tmp11 = [&]
{
auto tmp12 = at::vec::Vectorized<bfloat16>::loadu(in_ptr0 + static_cast<long>((-7232L) + i3 + (128L*i2) + (14336L*i1) + (802816L*i0)), 16);
load
auto tmp13 = cvt_lowp_fp_to_fp32<bfloat16>(tmp12);
return tmp13;
}
;
auto tmp14 = decltype(tmp11())::blendv(at::vec::Vectorized<float>(-std::numeric_limits<float>::infinity()), tmp11(), to_float_mask(tmp10));
```
the index of ```tmp12 ``` may be a correct index, such as ```i1=0, i2=0, i3=0```, the index is ```-7232L```, it is not a valid index. We may meet segmentation fault error when we call ```tmp11()```, the original behavior is that only the ```tmp10```(index check variable) is true, we can safely get the value, this PR will support masked_load to fixing this issue.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107670
Approved by: https://github.com/jgong5, https://github.com/jansel
This replaces `var_unnormalized` reduction type with `welford_reduce` which takes the input data and outputs not just the variance, but also the mean and weights which account for the full welford accumulator state. Thus we can avoid re-computing the mean, and we now have enough information to create a multilayer reduction which I implement here by adding a second reduction type called `welford_combine` which reduces over all three inputs simultaneously.
Multi-layer support is particularly important as normalization operators like BatchNorm are being split in many timm models, which meant `var_unnormalized` had to fall back to two-pass variance calculation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104725
Approved by: https://github.com/lezcano
## Summary
This is re-land PR for https://github.com/pytorch/pytorch/pull/100706 to address the compilation latency performance regression.
## Root Cause
Regarding the C++/OpenMP backend, `codecache.pick_vec_isa()` to check vectorization ISA is a time-consuming and one-shot operation. It leads to taking a longer time to import `codegen.cpp` package because the `LoopLevel` of the package is decorated by `@dataclasses.dataclass` while the decorator will invoke `codecache.pick_vec_isa()` to initialize the `simd_nelements` of the `LoopLevel`.
c14cf312c9/torch/_inductor/codegen/cpp.py (L2883C53-L2883C53)
In terms of the Triton backend, it does not need to touch it. But we'd prefer to uniform the code. Therefore, the new design simultaneously registers `CpuScheduling` for CPU and `TritonScheduling` for Triton regardless of whether the current backend is Triton. It will bring additional overhead to the Triton backend.
```python
def init_backend_registration(self):
if get_scheduling_for_device("cpu") is None:
from .codegen.cpp import CppScheduling
register_backend_for_device("cpu", CppScheduling, WrapperCodeGen)
if get_scheduling_for_device("cuda") is None:
from .codegen.triton import TritonScheduling
register_backend_for_device("cuda", TritonScheduling, WrapperCodeGen)
```
## Solution
To resolve the compilation latency regression for the Triton backend, we changed the `LoopLevel` a little bit([new code changes](https://github.com/pytorch/pytorch/pull/106874/files#diff-5ab7b0235e2076a5fc6629ba0b109208940f5b94f5c13babc3e0f87cf4fcec82R2893-R2904)) by moving the `simd_nelements` to `__post_init__` and the compilation performance would be back.
## Compilation Latency Performance Result
We ran a single model benchmark and reproduced the compilation regression:
- Run `python benchmarks/dynamo/torchbench.py -dcuda --training --performance --inductor --only hf_Bart`
- W/ PR #100706, the compilation latency is about **57~58**
```
dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks
cuda,hf_Bart,4,1.556712,109.676554,57.055242,0.936330,5.760698,6.152422,642,1,8,7
cuda,hf_Bart,4,1.646658,109.621747,57.909817,0.936330,5.760698,6.152422,642,1,8,7
```
- W/O PR #100706, the compilation latency is about **46~47**
```
dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks
cuda,hf_Bart,4,1.599065,108.702480,47.490346,0.936330,5.760698,6.152422,642,1,8,7
cuda,hf_Bart,4,1.588419,108.431411,46.983041,0.936330,5.760698,6.152422,642,1,8,7
```
This PR fixed the compilation performance regression.
- W/ this PR #106874, the compilation latency is about **47~48**
```
dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks
cuda,hf_Bart,4,1.586261,108.149467,47.481058,0.936330,5.760698,6.152422,642,1,8,7
cuda,hf_Bart,4,1.758915,108.613899,47.925633,0.936330,5.760698,6.152422,642,1,8,7
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106874
Approved by: https://github.com/jansel
This PR aims to sort out the data type for `constant`.
The constant should be promoted to float https://github.com/pytorch/pytorch/pull/105440. So there are serval changes to do:
- Data type propagation should propagate constant node to `float` dtype if original dtype is `bfloat16`
- We do not need to insert `to_dtype` after the `constant` node, directly init an `fp32` constant is faster.
```
vectorized<bfloat16> tmp(value);
vectorized <float> tmp1 = cvt_bf16_fp32(tmp);
->
vectorized<float> tmp(value);
```
- move `constant` out of the list for `all operations can support bf16 without converting to fp32`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105827
Approved by: https://github.com/jgong5, https://github.com/jansel
This PR intends to extend Inductor to support the third-party backend that only focuses on the code generation just like what C++/OpenMP and Triton backend have done.
Currently, the generated code by Inductor contains two major parts. One is the kernel, and the other is the Python wrapper to glue the kernel. Therefore, the third-party backend needs to customize the two parts to generate its specific code.
- Python wrapper code generation
Inductor provides a `WrapperCodeGen` class to generate the Python wrapper code to glue the kernel. Therefore, it is straightforward for the third-party backend to generate the backend-specific Python wrapper code. It just needs to inherit the `WrapperCodeGen` class and purposely override the particular member functions.
- Kernel code generation
It is driven by different `Scheduling`. Hence, the third-party backend needs to provide a custom `Scheduling` for its specific kernel code generation. Currently, `CppScheduling` and `TritonScheduling` are for C++/OpenMP and Triton backend, respectively. But there is no common `Scheduling` class. Based on the scheduling invocation, this PR abstracts a common `Scheduling` class containing the following member functions.
- [group_fn](71c4becda7/torch/_inductor/scheduler.py (LL649C64-L649C64))
- [flush](71c4becda7/torch/_inductor/scheduler.py (L1150))
- [can_fuse_vertical](71c4becda7/torch/_inductor/scheduler.py (L1006))
- [can_fuse_horizontal](71c4becda7/torch/_inductor/scheduler.py (LL1008C45-L1008C64))
- [codegen_template](71c4becda7/torch/_inductor/scheduler.py (L1234)) _This function is only available for triton. If the third-party backend behaves as a sub-class of `TritonScheduling`, it can override it or reuse it._
- [codegen_nodes](71c4becda7/torch/_inductor/scheduler.py (L1234))
- [codegen_sync](71c4becda7/torch/_inductor/scheduler.py (LL1251C1-L1251C1)). _This function is only available for triton debug purpose. But it might also be useful for other computation devices. Therefore, we'd prefer to keep this function._
The third-party backend needs to inherit from the `Scheduling` class and implement these functions.
Regarding some other classes like `CppKernel` and `TritonKernel` for code generation, they are used by or part of the logic of either `Scheduling` or `WrapperCodeGen`. Hence, this PR does not define the interface and leaves the flexibility to the third-party backend. The third-party backend can decide to implement these classes from scratch or reuse them by inheriting and overriding them.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100706
Approved by: https://github.com/jansel