Currently scan has an `init` argument which must be the identity of the
combine function. This isn't strictly necessary if we are more careful about
keeping track of the first element and avoid combining it with anything.
This does additionally require that there are no active load masks, since we can't
do the `where_cond` any more. However, this shouldn't be possible anyway since
scans are always realized and only fused via the scheduler.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119727
Approved by: https://github.com/lezcano
Vectorized boolean values in CPU Inductor were modeled with `Vectorized<float>` which cannot work for operations with other data types. This PR generalizes it with the new `VecMask` template class that can work for masks on any vectorized data types. The intrinsics implementation in `cpp_prefix.h` for mask conversion, cast and masked load are now implemented as the specialization for `VecMask` and moved to corresponding header files.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119654
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel