Commit Graph

6 Commits

Author SHA1 Message Date
chunyuan
1faef895ca Inductor cpp wrapper: support sympy.Expr as input (#101257)
Leverage the logic in https://github.com/pytorch/pytorch/pull/95533 to get the `dtype` of `sympy.Expr` and support it as graph input in the cpp wrapper.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101257
Approved by: https://github.com/jgong5, https://github.com/Skylion007, https://github.com/EikanWang, https://github.com/jansel
2023-05-15 23:57:28 +00:00
haozhe.zhu
a1ada050f8 do not insert to_dtype for memory copy only buffers (#97147)
Remove redundant to_dtype like
`load_bf16 + to_fp32 + to_bf16 + store_bf16` => `load_bf16 + store_bf16`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97147
Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/jansel
2023-03-27 14:55:41 +00:00
Peter Bell
4f44877983 [Inductor] Add test for Scheduler fusions (#90014)
Currently there is `test_vertical_fusion1` which fuses entirely during
the lowering stage and no buffers are realized. This adds
`test_scheduler_vertical_fusion1` which is the same test but with
several intermediate calculations realized so the scheduler is left
to do the fusion.

To support the test, this PR also adds:
- `metrics.ir_nodes_pre_fusion` which when compared with
`generated_kernel_count` tells us how many nodes were fused.
- `torch._test_inductor_realize` which is an identity operator in
eager, but under inductor also forces the input to be realized.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90014
Approved by: https://github.com/jansel
2022-12-07 01:33:25 +00:00
Horace He
419ef2cdcf Added utility to count memory reads/written in Inductor (#89203)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89203
Approved by: https://github.com/jansel, https://github.com/ngimel
2022-11-19 04:18:26 +00:00
Wang, Eikan
6541e51ffd Explicit vectorization support for TorchInductor (#87068)
In this PR, we replace OMP SIMD with `aten::vec` to optimize TorchInductor vectorization performance. Take `res=torch.exp(torch.add(x, y))` as the example. The generated code is as follows if `config.cpp.simdlen` is 8.

```C++
extern "C" void kernel(const float* __restrict__ in_ptr0,
                       const float* __restrict__ in_ptr1,
                       float* __restrict__ out_ptr0,
                       const long ks0,
                       const long ks1)
{
    #pragma omp parallel num_threads(48)
    {
        #pragma omp for
        for(long i0=0; i0<((ks0*ks1) / 8); ++i0)
        {
            auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + 8*i0);
            auto tmp1 = at::vec::Vectorized<float>::loadu(in_ptr1 + 8*i0);
            auto tmp2 = tmp0 + tmp1;
            auto tmp3 = tmp2.exp();
            tmp3.store(out_ptr0 + 8*i0);
        }
        #pragma omp for simd simdlen(4)
        for(long i0=8*(((ks0*ks1) / 8)); i0<ks0*ks1; ++i0)
        {
            auto tmp0 = in_ptr0[i0];
            auto tmp1 = in_ptr1[i0];
            auto tmp2 = tmp0 + tmp1;
            auto tmp3 = std::exp(tmp2);
            out_ptr0[i0] = tmp3;
        }
    }
}

```

The major pipeline is as follows.
- Check whether the loop body could be vectorized by `aten::vec`. The checker consists of two parts. [One ](bf66991fc4/torch/_inductor/codegen/cpp.py (L702))is to check whether all the `ops` have been supported. The [other one](355326faa3/torch/_inductor/codegen/cpp.py (L672)) is to check whether the data access could be vectorized.
  - [`CppSimdVecKernelChecker`](355326faa3/torch/_inductor/codegen/cpp.py (L655))
- Create the `aten::vec` kernel and original omp simd kernel. Regarding the original omp simd kernel, it serves for the tail loop when the loop is vectorized.
  - [`CppSimdVecKernel`](355326faa3/torch/_inductor/codegen/cpp.py (L601))
  - [`CppSimdVecOverrides`](355326faa3/torch/_inductor/codegen/cpp.py (L159)): The ops that we have supported on the top of `aten::vec`
  - Create kernel
    - [`aten::vec` kernel](355326faa3/torch/_inductor/codegen/cpp.py (L924))
    - [`Original CPP kernel - OMP SIMD`](355326faa3/torch/_inductor/codegen/cpp.py (L929))
- Generate code
  - [`CppKernelProxy`](355326faa3/torch/_inductor/codegen/cpp.py (L753)) is used to combine the `aten::vec` kernel and original cpp kernel
    - [Vectorize the most inner loop](355326faa3/torch/_inductor/codegen/cpp.py (L753))
    - [Generate code](355326faa3/torch/_inductor/codegen/cpp.py (L821))

Next steps:
- [x] Support reduction
- [x] Vectorize the tail loop with `aten::vec`
- [ ] Support BF16
- [ ] Optimize the loop condition and loop index calculation by replacing `div` with `add`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87068
Approved by: https://github.com/jgong5, https://github.com/jansel
2022-11-07 06:24:14 +00:00
Jason Ansel
c7c09722ad Move TorchDynamo into PyTorch core (#86461)
Context:
https://github.com/pytorch/torchdynamo/issues/1588

This PR moves [TorchDynamo](https://github.com/pytorch/torchdynamo) and TorchInductor into PyTorch core.
- `torchdynamo` becomes `torch._dynamo`
- `torchinductor` becomes `torch._inductor`

This PR was generated by running `copy_to_core.sh` in https://github.com/pytorch/torchdynamo/pull/1538

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86461
Approved by: https://github.com/voznesenskym
2022-10-13 23:18:06 +00:00