Commit Graph

320 Commits

Author SHA1 Message Date
Jiong Gong
7b6d036c05 [inductor][cpp] support bf16/fp16 gemm template epilogue fusion (#126545)
As part of #125683, this PR adds epilogue fusion support for bf16/fp16 gemms. The key changes are as follows:
1. bf16 linear w/ epilogue fusion of some ops was originally supported via ATen oneDNN linear pointwise ops. In order to match the ATen op semantics, in-template epilogue support is added to the cpp gemm template so that we would have: "gemm + in-template epilogues -> template buffer". If the template is chosen for codegen, the in-template epilogues will be concatenated with the out-of-template epilogues that are appended during the scheduling.
2. Support bf16/fp16 legalization for `codegen_loop_bodies` which is used to generate the epilogue loops.
3. We used to leverage the in-place buffer mechanism to handle the in-place buffers in the epilogue codegen, in particular, for the reuses for output buffers of GEMM, template and epilogues. This is not correct since the output buffer is an "output" not an "in-place" buffer of the template kernel itself. Now, we use a dedicated "aliases" dict to manage such buffer reuses and the intermediate aliasing buffers are removed after codegen.
4. Add `localize_buffer` method to `LocalBufferScope` to allow the replacement of a global buffer with a local one in the given inductor IR nodes. This helps the fused loops to work on smaller-sized local buffers for better data locality.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126545
Approved by: https://github.com/jansel
ghstack dependencies: #124021, #126019, #126068
2024-05-23 07:39:29 +00:00
Jiong Gong
31412cb2f2 [inductor][cpp] bf16/fp16 gemm template computed with fp32 w/o epilogue fusion (#126068)
As part of #125683, this PR adds the initial bf16/fp16 gemm template support with micro-gemm implemented with fused type casting and fp32 computation. It doesn't provide epilogue fusion support yet which will be added in the next PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126068
Approved by: https://github.com/jansel
ghstack dependencies: #124021, #126019
2024-05-23 07:39:29 +00:00
Jiong Gong
08f57b4bff [inductor][cpp] epilogue support for gemm template (#126019)
As part of #125683, this PR adds the epilogue support for c++ gemm template by reusing the c++ vector codegen on sub-slices of tensors. This is implemented by retracing the epilogue IR nodes with new ranges and offsets. The new `codegen_loop_bodies` and `codegen_functions` methods are added to c++ vector codegen for this purpose. This is leveraged by the `store_output` method of the template kernel for epilogue codegen and store to the final result.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126019
Approved by: https://github.com/jansel
ghstack dependencies: #124021
2024-05-23 07:39:29 +00:00
Jiong Gong
9da7efa677 [inductor][cpp] GEMM template (infra and fp32) (#124021)
This PR adds the Cpp template infrastructure and the initial FP32 gemm template. See RFC https://github.com/pytorch/pytorch/issues/125683 for more background info.
1. Cpp template infrastructure
Similar template abstractions as the CUTLASS template, i.e., `CppTemplate`, `CppTemplateKernel`, `CppTemplateBuffer`. The MicroGemm micro-kernel abstraction that can be used by Cpp GEMM templates.
2. Initial FP32 gemm template
This involves a GEMM template implementation `CppPackedGemmTemplate` that supports GEMM with constant weight (`B`) requiring `N` to be a multiple of register blocking while allows the static or dynamic sizes for the `M` (batch dim) of `A`. The `B` matrix would be prepacked. This is a typical setting for inference workloads. The template handles the thread decomposition (via `thread_blocking`) and cache blocking (via `cache_blocking`). Then it invokes `CppMicroGemm` which handles register blocking, instruction selection, and other CPU architecture-specific optimizations. A `CppMicroGemmFP32Vec` micro-kernel implementation is provided for fp32 matmuls implemented with ATen vec abstraction.
3. Correctness and performance
The changes have been validated with fp32 inference on the three benchmark suites (torchbench, huggingface and timm_models) with both static shape and dynamic shapes. Since it is an initial implementation, we are still working on further performance improves with follow-up PRs including the optimizations in kernels as well as fusions. The perf gains are only observed from a selective number of models compared to the ATen kernels which are implemented with MKL. The perf gains are more obvious with dynamic shapes since MKL only supports packed gemm for static shapes. Below are details.

Static shapes
| Benchmark | torchbench | huggingface | timm_models |
|------------|-------------|--------------|--------------|
| Multi-threaded (baseline) | 1.47x | 1.36x | 1.91x |
| Multi-threaded (max-autotune) | 1.47x | 1.36x | 1.92x |
| Single-threaded (baseline) | 1.56x | 1.19x | 1.51x |
| Single-threaded (max-autotune) | 1.56x | 1.19x | 1.52x |

Key models being sped up:
drq: 1.14x
soft_act: 1.12
cait_m36_384: 1.18x

Dynamic shapes
| Benchmark | torchbench | huggingface | timm_models |
| --- | --- | --- | --- |
| Multi-threaded (baseline) | 1.43x | 1.28x | 1.85x |
| Multi-threaded (max-autotune) | 1.47x | 1.28x | 1.85x |
| Single-threaded (baseline) | 1.55x | 1.20x | 1.51x |
| Single-threaded (max-autotune) | 1.56x | 1.19x | 1.53x |

Key models being sped up:
BERT_pytorch: 1.22x
pyhpc_turbulent: 1.13x
soft_actor_critic: 1.77x
BlenderbotForCausalLM: 1.09x
cait_m36_384: 1.17x

Differential Revision: [D57585365](https://our.internmc.facebook.com/intern/diff/D57585365)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124021
Approved by: https://github.com/jansel
2024-05-23 07:39:29 +00:00
PyTorch MergeBot
4f14282e35 Revert "[inductor][cpp] GEMM template (infra and fp32) (#124021)"
This reverts commit 2ac33a9f66.

Reverted https://github.com/pytorch/pytorch/pull/124021 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think it has a land race and failing in trunk 2ac33a9f66 ([comment](https://github.com/pytorch/pytorch/pull/124021#issuecomment-2126016522))
2024-05-23 01:13:29 +00:00
PyTorch MergeBot
657d39e44c Revert "[inductor][cpp] epilogue support for gemm template (#126019)"
This reverts commit 57108d9a49.

Reverted https://github.com/pytorch/pytorch/pull/126019 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think it has a land race and failing in trunk 2ac33a9f66 ([comment](https://github.com/pytorch/pytorch/pull/124021#issuecomment-2126016522))
2024-05-23 01:13:29 +00:00
PyTorch MergeBot
205f08140e Revert "[inductor][cpp] bf16/fp16 gemm template computed with fp32 w/o epilogue fusion (#126068)"
This reverts commit 57c185b4c7.

Reverted https://github.com/pytorch/pytorch/pull/126068 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think it has a land race and failing in trunk 2ac33a9f66 ([comment](https://github.com/pytorch/pytorch/pull/124021#issuecomment-2126016522))
2024-05-23 01:13:29 +00:00
Jiong Gong
57c185b4c7 [inductor][cpp] bf16/fp16 gemm template computed with fp32 w/o epilogue fusion (#126068)
As part of #125683, this PR adds the initial bf16/fp16 gemm template support with micro-gemm implemented with fused type casting and fp32 computation. It doesn't provide epilogue fusion support yet which will be added in the next PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126068
Approved by: https://github.com/jansel
ghstack dependencies: #124021, #126019
2024-05-23 00:12:38 +00:00
Jiong Gong
57108d9a49 [inductor][cpp] epilogue support for gemm template (#126019)
As part of #125683, this PR adds the epilogue support for c++ gemm template by reusing the c++ vector codegen on sub-slices of tensors. This is implemented by retracing the epilogue IR nodes with new ranges and offsets. The new `codegen_loop_bodies` and `codegen_functions` methods are added to c++ vector codegen for this purpose. This is leveraged by the `store_output` method of the template kernel for epilogue codegen and store to the final result.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126019
Approved by: https://github.com/jansel
ghstack dependencies: #124021
2024-05-23 00:07:52 +00:00
Jiong Gong
2ac33a9f66 [inductor][cpp] GEMM template (infra and fp32) (#124021)
This PR adds the Cpp template infrastructure and the initial FP32 gemm template. See RFC https://github.com/pytorch/pytorch/issues/125683 for more background info.
1. Cpp template infrastructure
Similar template abstractions as the CUTLASS template, i.e., `CppTemplate`, `CppTemplateKernel`, `CppTemplateBuffer`. The MicroGemm micro-kernel abstraction that can be used by Cpp GEMM templates.
2. Initial FP32 gemm template
This involves a GEMM template implementation `CppPackedGemmTemplate` that supports GEMM with constant weight (`B`) requiring `N` to be a multiple of register blocking while allows the static or dynamic sizes for the `M` (batch dim) of `A`. The `B` matrix would be prepacked. This is a typical setting for inference workloads. The template handles the thread decomposition (via `thread_blocking`) and cache blocking (via `cache_blocking`). Then it invokes `CppMicroGemm` which handles register blocking, instruction selection, and other CPU architecture-specific optimizations. A `CppMicroGemmFP32Vec` micro-kernel implementation is provided for fp32 matmuls implemented with ATen vec abstraction.
3. Correctness and performance
The changes have been validated with fp32 inference on the three benchmark suites (torchbench, huggingface and timm_models) with both static shape and dynamic shapes. Since it is an initial implementation, we are still working on further performance improves with follow-up PRs including the optimizations in kernels as well as fusions. The perf gains are only observed from a selective number of models compared to the ATen kernels which are implemented with MKL. The perf gains are more obvious with dynamic shapes since MKL only supports packed gemm for static shapes. Below are details.

Static shapes
| Benchmark | torchbench | huggingface | timm_models |
|------------|-------------|--------------|--------------|
| Multi-threaded (baseline) | 1.47x | 1.36x | 1.91x |
| Multi-threaded (max-autotune) | 1.47x | 1.36x | 1.92x |
| Single-threaded (baseline) | 1.56x | 1.19x | 1.51x |
| Single-threaded (max-autotune) | 1.56x | 1.19x | 1.52x |

Key models being sped up:
drq: 1.14x
soft_act: 1.12
cait_m36_384: 1.18x

Dynamic shapes
| Benchmark | torchbench | huggingface | timm_models |
| --- | --- | --- | --- |
| Multi-threaded (baseline) | 1.43x | 1.28x | 1.85x |
| Multi-threaded (max-autotune) | 1.47x | 1.28x | 1.85x |
| Single-threaded (baseline) | 1.55x | 1.20x | 1.51x |
| Single-threaded (max-autotune) | 1.56x | 1.19x | 1.53x |

Key models being sped up:
BERT_pytorch: 1.22x
pyhpc_turbulent: 1.13x
soft_actor_critic: 1.77x
BlenderbotForCausalLM: 1.09x
cait_m36_384: 1.17x

Differential Revision: [D57585365](https://our.internmc.facebook.com/intern/diff/D57585365)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124021
Approved by: https://github.com/jansel
2024-05-22 23:59:12 +00:00
Aaron Orenstein
e4623de4cf typing scheduler.py [2/2]: Apply types (#126656)
Add `# mypy: disallow-untyped-defs` to scheduler.py and then fix the resulting fallout.

We probably should eventually add a new node between BaseSchedulerNode and all the non-FusedSchedulerNode types to indicate the split between nodes that have a valid `self.node` and ones that don't. That would cause a lot of the `assert self.node is not None` churn to go away - but was a bigger change because a lot of code makes assumptions about types that aren't reflected in the types themselves.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126656
Approved by: https://github.com/eellison
2024-05-22 20:33:31 +00:00
leslie-fang-intel
1cc9354cb0 Unify the dtype to VecMask<float, N> in ops.masked (#126662)
**Summary**
Fix issue: https://github.com/pytorch/pytorch/issues/126449. For `ops.masked` in CPP backend, when input dtype is `bool`, we actually load it as `VecMask<float, N>`. So, we should unify the type of `other` and `mask` to the same as  `VecMask<float, N>` to invoke `blendv` method.

**Test Plan**
```
clear && python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_ops_masked_with_bool_input
clear && PYTORCH_ALL_SAMPLES=1 python -u -m pytest -s -v test/inductor/test_torchinductor_opinfo.py -k test_comprehensive__chunk_cat_cpu_bool
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126662
Approved by: https://github.com/isuruf, https://github.com/jgong5, https://github.com/peterbell10
2024-05-21 20:52:25 +00:00
CaoE
6c503f1dbb save the reciprocal of weights for welford_reduce (#125148)
Save the reciprocal of weights for welford_reduce to avoid redundant divisions for improving performance, and `weight_recps` will be inserted into the generated vec kernel.

Generated code:

- Before:

```
for(long x1=static_cast<long>(0L); x1<static_cast<long>(1024L); x1+=static_cast<long>(16L))
{
    auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x1 + (1024L*x0)), 16);
    tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp0);
}
```

- After::

```
static WeightRecp<at::vec::Vectorized<float>> weight_recps(64);
for(long x1=static_cast<long>(0L); x1<static_cast<long>(1024L); x1+=static_cast<long>(16L))
{
    auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x1 + (1024L*x0)), 16);
    tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp0, &weight_recps);
}
```

Performance:

- Single core:

Op | shape | eager/ms | inductor/ms | optimized inductor/ms
-- | -- | -- | -- | --
layernorm | (56, 384, 1024) | 16.825 | 22.338 | 15.208
var | (56, 384, 1024) | 21.752 | 13.258 | 13.102

- 4 cores:

Op | shape | eager/ms | inductor/ms | optimized inductor/ms
-- | -- | -- | -- | --
layernorm | (56, 384, 1024) | 4.249 | 5.899 | 4.223
var | (56, 384, 1024) | 5.3152 | 3.278 | 2.163

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125148
Approved by: https://github.com/jgong5, https://github.com/peterbell10
2024-05-17 08:20:12 +00:00
PyTorch MergeBot
337830f657 Revert "[inductor][cpp] GEMM template (infra and fp32) (#124021)"
This reverts commit f060b0c6e6.

Reverted https://github.com/pytorch/pytorch/pull/124021 on behalf of https://github.com/huydhn due to Unfortunately, the new tests are still failing internally ([comment](https://github.com/pytorch/pytorch/pull/124021#issuecomment-2116415398))
2024-05-17 00:22:40 +00:00
PyTorch MergeBot
4a5ef0b793 Revert "[inductor][cpp] epilogue support for gemm template (#126019)"
This reverts commit 7844c202b2.

Reverted https://github.com/pytorch/pytorch/pull/126019 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but the dependency PR https://github.com/pytorch/pytorch/pull/124021 is going to be revert ([comment](https://github.com/pytorch/pytorch/pull/126019#issuecomment-2116408137))
2024-05-17 00:15:00 +00:00
PyTorch MergeBot
59ca0d8c14 Revert "[inductor][cpp] bf16/fp16 gemm template computed with fp32 w/o epilogue fusion (#126068)"
This reverts commit 927e631dc2.

Reverted https://github.com/pytorch/pytorch/pull/126068 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but the dependency PR https://github.com/pytorch/pytorch/pull/124021 is going to be revert ([comment](https://github.com/pytorch/pytorch/pull/126019#issuecomment-2116408137))
2024-05-17 00:15:00 +00:00
Jiong Gong
927e631dc2 [inductor][cpp] bf16/fp16 gemm template computed with fp32 w/o epilogue fusion (#126068)
As part of #125683, this PR adds the initial bf16/fp16 gemm template support with micro-gemm implemented with fused type casting and fp32 computation. It doesn't provide epilogue fusion support yet which will be added in the next PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126068
Approved by: https://github.com/jansel
ghstack dependencies: #126019
2024-05-16 02:05:49 +00:00
Jiong Gong
7844c202b2 [inductor][cpp] epilogue support for gemm template (#126019)
As part of #125683, this PR adds the epilogue support for c++ gemm template by reusing the c++ vector codegen on sub-slices of tensors. This is implemented by retracing the epilogue IR nodes with new ranges and offsets. The new `codegen_loop_bodies` and `codegen_functions` methods are added to c++ vector codegen for this purpose. This is leveraged by the `store_output` method of the template kernel for epilogue codegen and store to the final result.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126019
Approved by: https://github.com/jansel
2024-05-16 01:42:29 +00:00
Jiong Gong
f060b0c6e6 [inductor][cpp] GEMM template (infra and fp32) (#124021)
This PR adds the Cpp template infrastructure and the initial FP32 gemm template. See RFC https://github.com/pytorch/pytorch/issues/125683 for more background info.
1. Cpp template infrastructure
Similar template abstractions as the CUTLASS template, i.e., `CppTemplate`, `CppTemplateKernel`, `CppTemplateBuffer`. The MicroGemm micro-kernel abstraction that can be used by Cpp GEMM templates.
2. Initial FP32 gemm template
This involves a GEMM template implementation `CppPackedGemmTemplate` that supports GEMM with constant weight (`B`) requiring `N` to be a multiple of register blocking while allows the static or dynamic sizes for the `M` (batch dim) of `A`. The `B` matrix would be prepacked. This is a typical setting for inference workloads. The template handles the thread decomposition (via `thread_blocking`) and cache blocking (via `cache_blocking`). Then it invokes `CppMicroGemm` which handles register blocking, instruction selection, and other CPU architecture-specific optimizations. A `CppMicroGemmFP32Vec` micro-kernel implementation is provided for fp32 matmuls implemented with ATen vec abstraction.
3. Correctness and performance
The changes have been validated with fp32 inference on the three benchmark suites (torchbench, huggingface and timm_models) with both static shape and dynamic shapes. Since it is an initial implementation, we are still working on further performance improves with follow-up PRs including the optimizations in kernels as well as fusions. The perf gains are only observed from a selective number of models compared to the ATen kernels which are implemented with MKL. The perf gains are more obvious with dynamic shapes since MKL only supports packed gemm for static shapes. Below are details.

Static shapes
| Benchmark | torchbench | huggingface | timm_models |
|------------|-------------|--------------|--------------|
| Multi-threaded (baseline) | 1.47x | 1.36x | 1.91x |
| Multi-threaded (max-autotune) | 1.47x | 1.36x | 1.92x |
| Single-threaded (baseline) | 1.56x | 1.19x | 1.51x |
| Single-threaded (max-autotune) | 1.56x | 1.19x | 1.52x |

Key models being sped up:
drq: 1.14x
soft_act: 1.12
cait_m36_384: 1.18x

Dynamic shapes
| Benchmark | torchbench | huggingface | timm_models |
| --- | --- | --- | --- |
| Multi-threaded (baseline) | 1.43x | 1.28x | 1.85x |
| Multi-threaded (max-autotune) | 1.47x | 1.28x | 1.85x |
| Single-threaded (baseline) | 1.55x | 1.20x | 1.51x |
| Single-threaded (max-autotune) | 1.56x | 1.19x | 1.53x |

Key models being sped up:
BERT_pytorch: 1.22x
pyhpc_turbulent: 1.13x
soft_actor_critic: 1.77x
BlenderbotForCausalLM: 1.09x
cait_m36_384: 1.17x

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124021
Approved by: https://github.com/jansel
2024-05-15 08:14:51 +00:00
PyTorch MergeBot
b6d8b256e6 Revert "[inductor][cpp] GEMM template (infra and fp32) (#124021)"
This reverts commit 037615b989.

Reverted https://github.com/pytorch/pytorch/pull/124021 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it is failing inductor.test_unbacked_symints.TestUnbackedSymintsCPU::test_autotuning_cpu ([comment](https://github.com/pytorch/pytorch/pull/124021#issuecomment-2111318883))
2024-05-14 23:26:15 +00:00
Jiong Gong
037615b989 [inductor][cpp] GEMM template (infra and fp32) (#124021)
This PR adds the Cpp template infrastructure and the initial FP32 gemm template. See RFC https://github.com/pytorch/pytorch/issues/125683 for more background info.
1. Cpp template infrastructure
Similar template abstractions as the CUTLASS template, i.e., `CppTemplate`, `CppTemplateKernel`, `CppTemplateBuffer`. The MicroGemm micro-kernel abstraction that can be used by Cpp GEMM templates.
2. Initial FP32 gemm template
This involves a GEMM template implementation `CppPackedGemmTemplate` that supports GEMM with constant weight (`B`) requiring `N` to be a multiple of register blocking while allows the static or dynamic sizes for the `M` (batch dim) of `A`. The `B` matrix would be prepacked. This is a typical setting for inference workloads. The template handles the thread decomposition (via `thread_blocking`) and cache blocking (via `cache_blocking`). Then it invokes `CppMicroGemm` which handles register blocking, instruction selection, and other CPU architecture-specific optimizations. A `CppMicroGemmFP32Vec` micro-kernel implementation is provided for fp32 matmuls implemented with ATen vec abstraction.
3. Correctness and performance
The changes have been validated with fp32 inference on the three benchmark suites (torchbench, huggingface and timm_models) with both static shape and dynamic shapes. Since it is an initial implementation, we are still working on further performance improves with follow-up PRs including the optimizations in kernels as well as fusions. The perf gains are only observed from a selective number of models compared to the ATen kernels which are implemented with MKL. The perf gains are more obvious with dynamic shapes since MKL only supports packed gemm for static shapes. Below are details.

Static shapes
| Benchmark | torchbench | huggingface | timm_models |
|------------|-------------|--------------|--------------|
| Multi-threaded (baseline) | 1.47x | 1.36x | 1.91x |
| Multi-threaded (max-autotune) | 1.47x | 1.36x | 1.92x |
| Single-threaded (baseline) | 1.56x | 1.19x | 1.51x |
| Single-threaded (max-autotune) | 1.56x | 1.19x | 1.52x |

Key models being sped up:
drq: 1.14x
soft_act: 1.12
cait_m36_384: 1.18x

Dynamic shapes
| Benchmark | torchbench | huggingface | timm_models |
| --- | --- | --- | --- |
| Multi-threaded (baseline) | 1.43x | 1.28x | 1.85x |
| Multi-threaded (max-autotune) | 1.47x | 1.28x | 1.85x |
| Single-threaded (baseline) | 1.55x | 1.20x | 1.51x |
| Single-threaded (max-autotune) | 1.56x | 1.19x | 1.53x |

Key models being sped up:
BERT_pytorch: 1.22x
pyhpc_turbulent: 1.13x
soft_actor_critic: 1.77x
BlenderbotForCausalLM: 1.09x
cait_m36_384: 1.17x

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124021
Approved by: https://github.com/jansel
2024-05-12 07:46:44 +00:00
lezcano
320af5eaa6 Compute bounds for the variables created during codegen (#123100)
Before we would just bail out on these bounds for all variables that did
not come from the FX graph. Now we propagate the bounds whenever we have
a rule for that op.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123100
Approved by: https://github.com/jgong5, https://github.com/peterbell10
2024-05-08 08:14:06 +00:00
Jiong Gong
058e28108f [inductor][cpp] support int64 vertical vec reduction (fix #124821) (#125563)
Fix #124821

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125563
Approved by: https://github.com/desertfire
2024-05-07 03:56:22 +00:00
PyTorch MergeBot
2a42c40791 Revert "Compute bounds for the variables created during codegen (#123100)"
This reverts commit bb668c6468.

Reverted https://github.com/pytorch/pytorch/pull/123100 on behalf of https://github.com/huydhn due to Sorry for reverting you change but it is failing inductor tests bb668c6468 ([comment](https://github.com/pytorch/pytorch/pull/123100#issuecomment-2096837821))
2024-05-06 20:23:39 +00:00
PyTorch MergeBot
7ffa5558ee Revert "[FX] Update type hints in torch.fx._compatibility.py (#125469)"
This reverts commit 235b4d6ec2.

Reverted https://github.com/pytorch/pytorch/pull/125469 on behalf of https://github.com/izaitsevfb due to breaks pyre in dependent projects (internal: see D56986361) ([comment](https://github.com/pytorch/pytorch/pull/125469#issuecomment-2096665396))
2024-05-06 18:36:43 +00:00
lezcano
bb668c6468 Compute bounds for the variables created during codegen (#123100)
Before we would just bail out on these bounds for all variables that did
not come from the FX graph. Now we propagate the bounds whenever we have
a rule for that op.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123100
Approved by: https://github.com/jgong5, https://github.com/peterbell10
2024-05-06 18:12:15 +00:00
Jiong Gong
68a1f787c8 [inductor][cpp] move some common cpp utils to cpp_utils.py (#125152)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125152
Approved by: https://github.com/desertfire, https://github.com/jansel
2024-05-06 04:30:30 +00:00
Xuehai Pan
235b4d6ec2 [FX] Update type hints in torch.fx._compatibility.py (#125469)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125469
Approved by: https://github.com/Skylion007
ghstack dependencies: #125468
2024-05-05 19:30:22 +00:00
Edward Z. Yang
6f70d22277 Extend torch.utils._sympy.symbol for more Inductor symbols (#125419)
I'm still missing a few, cdzq at least

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125419
Approved by: https://github.com/lezcano
ghstack dependencies: #125395
2024-05-04 09:05:00 +00:00
haozhe.zhu
57790fd088 [inductor] share cse cache during vectorized indirect load (#124597)
Fix https://github.com/pytorch/pytorch/issues/123502

`swap_buffer` in not needed in vectorized indirect load, remove it to share cse buffer.
```
auto tmp8 =
[&]
{
    __at_align__ std::array<int64_t, 16> tmpbuf;
    tmp7.store(tmpbuf.data());
    return tmpbuf;
}
()
;
//
// other codes
//
// also store tmp7 here (redundant tmp16)
auto tmp16 =
[&]
{
    __at_align__ std::array<int64_t, 16> tmpbuf;
    tmp7.store(tmpbuf.data());
    return tmpbuf;
}
()
;
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124597
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-04-28 01:02:48 +00:00
leslie-fang-intel
2d7f709752 [Inductor] Force the parallel depth as outer loop fusion depth (#123899)
**Summary**
Fix issue: https://github.com/pytorch/pytorch/issues/123801 which brings performance regression of `pyhpc_turbulent_kinetic_energy` after outer loop fusion.

**Root Cause**

- [Generated Kernel before Outer Loop Fusion](https://gist.github.com/leslie-fang-intel/54fe21ac8871fc63b9bf20fdb6edf209)
  - Taking below 2 kernels as example:
    - [Kernel 0](https://gist.github.com/leslie-fang-intel/54fe21ac8871fc63b9bf20fdb6edf209#file-pyhpc_turbulent_kinetic_energy-before-outer-loop-fusion-py-L255-L305) has 2 loop levels with size [200, 200]. Parallelization is not feasible due to the inefficient number of elements determined by [`decide_parallel_depth`](aaec97a403/torch/_inductor/codegen/cpp.py (L2145-L2164)). Therefore, the loop code will be generated with the `#pragma omp single` directive.
    - [Kernel 1](https://gist.github.com/leslie-fang-intel/54fe21ac8871fc63b9bf20fdb6edf209#file-pyhpc_turbulent_kinetic_energy-before-outer-loop-fusion-py-L306-L316) has 3 loop levels with size [200, 200, 26] which has enough number of elements to be parallelized.
- [Generated Kernel after Outer Loop Fusion](https://gist.github.com/leslie-fang-intel/57a497b9d9c6aa82b1c6a686292fc887)
  - After outer loop fusion, `Kernel0` and `Kernel1` has been fused into one [OuterLoopFusedKernel](https://gist.github.com/leslie-fang-intel/57a497b9d9c6aa82b1c6a686292fc887#file-pyhpc_turbulent_kinetic_energy-after-outer-loop-fusion-py-L261-L497), the outer loop size is [200, 200] which does not contain enough number of elements to do parallelization.

In this PR, we propose a fix for `loop_nest` involving `OuterLoopFusedKernel`. The fix entails adding a specific heuristic for `OuterLoopFusedKernel` to determine the parallel depth by combining `outer_loop_fusion_depth` with the internal kernels' parallel depth.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123899
Approved by: https://github.com/jgong5, https://github.com/lezcano
2024-04-25 09:50:46 +00:00
leslie-fang-intel
bffecb5aff [Inductor] Enable VecMask store (#123710)
**Summary**
Enable the vectorization of store with `bool` dtype.

**Test Plan**
```
python -u -m pytest -s -v inductor/test_cpu_repro.py -k test_decomposed_fake_quant_per_channel
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123710
Approved by: https://github.com/jgong5, https://github.com/lezcano
ghstack dependencies: #123512
2024-04-23 00:29:47 +00:00
Aaron Gokaslan
29cc293725 [BE]: FURB142 - Remove set mutations. Use set update (#124551)
Uses set mutation methods instead of manually reimplementing (update, set_difference etc).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124551
Approved by: https://github.com/ezyang
2024-04-21 14:12:33 +00:00
Xuehai Pan
93e249969b [BE] enable ruff rule RSE and remove useless parentheses in raise statements (#124261)
Remove useless parentheses in `raise` statements if the exception type is raised with no argument.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124261
Approved by: https://github.com/albanD
2024-04-17 19:29:34 +00:00
Edward Z. Yang
efa36ef092 Natively support int truncation, don't guard on positive/negative (#122827)
This doesn't entirely fix the original problem that prompted this, but
it seems to just be getting stuck in export constraint formatting now
which seems like progress to me.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122827
Approved by: https://github.com/avikchaudhuri
2024-04-11 15:22:32 +00:00
vfdev-5
6b7741546b Fixed arange decomp for float dtype (#121013)
## Description:

- [x] Fixed arange decomp for float dtype
- [x] Added a test

## Current state

Arange graph and C++ generated code are not optimal when arange is created directly using float32 dtype:
```python
import torch

def func(x):
    s = x.shape[-1]
    a = torch.arange(s, dtype=torch.float32)
    return s + a

c_func = torch.compile(func)
out = c_func(torch.rand(10))
```

Graph on `main`:
```
 ===== Forward graph 0 =====
 /pytorch/torch/fx/_lazy_graph_module.py class <lambda>(torch.nn.Module):
    def forward(self):
        # File: check_arange_decomp.py:8 in func, code: a = torch.arange(s, dtype=torch.float32)
        iota: "i64[10]" = torch.ops.prims.iota.default(10, start = 0, step = 1, dtype = torch.int64, device = device(type='cpu'), requires_grad = False)
        convert_element_type: "f64[10]" = torch.ops.prims.convert_element_type.default(iota, torch.float64);  iota = None
        mul: "f64[10]" = torch.ops.aten.mul.Tensor(convert_element_type, 1);  convert_element_type = None
        add: "f64[10]" = torch.ops.aten.add.Tensor(mul, 0);  mul = None
        convert_element_type_1: "f32[10]" = torch.ops.prims.convert_element_type.default(add, torch.float32);  add = None

        # File: check_arange_decomp.py:9 in func, code: return s + a
        add_1: "f32[10]" = torch.ops.aten.add.Tensor(convert_element_type_1, 10);  convert_element_type_1 = None
        return (add_1,)

 ===== AFTER POST GRAD =====
 /pytorch/torch/fx/_lazy_graph_module.py class <lambda>(torch.nn.Module):
    def forward(self):
        # File: check_arange_decomp.py:15 in func, code: a = torch.arange(s, dtype=torch.float32)
        iota: "i64[10]" = torch.ops.prims.iota.default(10, start = 0, step = 1, dtype = torch.int64, device = device(type='cpu'), requires_grad = False)
        convert_element_type: "f64[10]" = torch.ops.prims.convert_element_type.default(iota, torch.float64);  iota = None
        mul: "f64[10]" = torch.ops.aten.mul.Tensor(convert_element_type, 1);  convert_element_type = None
        add: "f64[10]" = torch.ops.aten.add.Tensor(mul, 0);  mul = None
        convert_element_type_1: "f32[10]" = torch.ops.prims.convert_element_type.default(add, torch.float32);  add = None

        # File: check_arange_decomp.py:16 in func, code: return s + a
        add_1: "f32[10]" = torch.ops.aten.add.Tensor(convert_element_type_1, 10);  convert_element_type_1 = None
        return (add_1,)
```
and C++
```c++
extern "C" void kernel(float* out_ptr0)
{
    {
        #pragma GCC ivdep
        for(long x0=static_cast<long>(0L); x0<static_cast<long>(10L); x0+=static_cast<long>(1L))
        {
            auto tmp0 = c10::convert<long>(x0);
            auto tmp1 = c10::convert<double>(tmp0);   // <---- useless ops
            auto tmp2 = static_cast<double>(1.0);     // <----
            auto tmp3 = decltype(tmp1)(tmp1 * tmp2);  // <----
            auto tmp4 = static_cast<double>(0.0);     // <----
            auto tmp5 = decltype(tmp3)(tmp3 + tmp4);  // <----
            auto tmp6 = c10::convert<float>(tmp5);
            auto tmp7 = static_cast<float>(10.0);
            auto tmp8 = decltype(tmp6)(tmp6 + tmp7);
            out_ptr0[static_cast<long>(x0)] = tmp8;
        }
    }
}
```

However, if we manually create arange on i64 and then put to float32, generated graph and C++ code are more natural and benefit of a speed-up.
```python
import torch

def func(x):
    s = x.shape[-1]
    a = torch.arange(s).to(dtype=torch.float32)
    return s + a

c_func = torch.compile(func)
out = c_func(torch.rand(10))
```

Graph on `main`:
```
 ===== Forward graph 0 =====
 /pytorch/torch/fx/_lazy_graph_module.py class <lambda>(torch.nn.Module):
    def forward(self):
        # File: check_arange_decomp.py:14 in func, code: a = torch.arange(s).to(dtype=torch.float32)
        iota: "i64[10]" = torch.ops.prims.iota.default(10, start = 0, step = 1, dtype = torch.int64, device = device(type='cpu'), requires_grad = False)
        convert_element_type: "f32[10]" = torch.ops.prims.convert_element_type.default(iota, torch.float32);  iota = None

        # File: check_arange_decomp.py:15 in func, code: return s + a
        add: "f32[10]" = torch.ops.aten.add.Tensor(convert_element_type, 10);  convert_element_type = None
        return (add,)

 ===== AFTER POST GRAD =====
 /pytorch/torch/fx/_lazy_graph_module.py class <lambda>(torch.nn.Module):
    def forward(self):
        # File: check_arange_decomp.py:21 in func, code: a = torch.arange(s).to(dtype=torch.float32)
        iota: "i64[10]" = torch.ops.prims.iota.default(10, start = 0, step = 1, dtype = torch.int64, device = device(type='cpu'), requires_grad = False)
        convert_element_type: "f32[10]" = torch.ops.prims.convert_element_type.default(iota, torch.float32);  iota = None

        # File: check_arange_decomp.py:22 in func, code: return s + a
        add: "f32[10]" = torch.ops.aten.add.Tensor(convert_element_type, 10);  convert_element_type = None
        return (add,)
```

C++ on `main`
```c++
extern "C" void kernel(float* out_ptr0)
{
    {
        #pragma GCC ivdep
        for(long x0=static_cast<long>(0L); x0<static_cast<long>(10L); x0+=static_cast<long>(1L))
        {
            auto tmp0 = c10::convert<long>(x0);
            auto tmp1 = c10::convert<float>(tmp0);
            auto tmp2 = static_cast<float>(10.0);
            auto tmp3 = decltype(tmp1)(tmp1 + tmp2);
            out_ptr0[static_cast<long>(x0)] = tmp3;
        }
    }
}
```

For example, the speed-up seen on upsample_nearest2d on cpu:
```
[----------------------------------------------------------------------------------------------------------------------------------------------- Interpolate, cpu ----------------------------------------------------------------------------------------------------------------------------------------------]
                                                                                                                                |  Eager (2.3.0a0+gitb4324ed) PR  |  Compiled (2.3.0a0+gitb4324ed) PR  |  Compiled (2.3.0a0+git0d1e705) Nightly  |  speed-up PR vs Nightly  |  Eager (2.3.0a0+git0d1e705) Nightly
1 threads: ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      Input (1, 3, 500, 400), torch.uint8, torch.contiguous_format | mode: nearest, align_corners: None, osize: (256, 256)      |        287.988 (+-10.399)       |         200.034 (+-8.630)          |            285.143 (+-8.412)            |     1.425 (+-0.000)      |          287.991 (+-11.302)
      Input (1, 3, 500, 400), torch.uint8, torch.channels_last | mode: nearest, align_corners: None, osize: (256, 256)          |        697.206 (+-27.033)       |         171.650 (+-7.381)          |            193.280 (+-5.840)            |     1.126 (+-0.000)      |          701.642 (+-26.461)
      Input (1, 3, 500, 400), torch.float32, torch.contiguous_format | mode: nearest, align_corners: None, osize: (256, 256)    |        149.149 (+-6.045)        |         222.780 (+-6.852)          |            299.968 (+-12.354)           |     1.346 (+-0.000)      |          145.055 (+-7.232)
      Input (1, 3, 500, 400), torch.float32, torch.channels_last | mode: nearest, align_corners: None, osize: (256, 256)        |        596.741 (+-27.970)       |         205.923 (+-8.648)          |            233.912 (+-7.742)            |     1.136 (+-0.000)      |          598.000 (+-25.630)
      Input (4, 3, 500, 400), torch.uint8, torch.contiguous_format | mode: nearest, align_corners: None, osize: (256, 256)      |       1095.734 (+-51.658)       |         700.850 (+-24.852)         |           1044.255 (+-38.216)           |     1.490 (+-0.000)      |         1097.977 (+-35.521)
      Input (4, 3, 500, 400), torch.uint8, torch.channels_last | mode: nearest, align_corners: None, osize: (256, 256)          |       2741.813 (+-122.917)      |         583.073 (+-16.998)         |            665.029 (+-36.331)           |     1.141 (+-0.000)      |         2722.388 (+-116.263)
      Input (4, 3, 500, 400), torch.float32, torch.contiguous_format | mode: nearest, align_corners: None, osize: (256, 256)    |        578.183 (+-37.266)       |         833.295 (+-42.264)         |           1131.341 (+-54.710)           |     1.358 (+-0.000)      |          584.953 (+-45.549)
      Input (4, 3, 500, 400), torch.float32, torch.channels_last | mode: nearest, align_corners: None, osize: (256, 256)        |       2332.508 (+-103.556)      |         840.194 (+-47.664)         |            935.625 (+-47.467)           |     1.114 (+-0.000)      |         2334.314 (+-91.644)
      Input (1, 3, 1200, 1300), torch.uint8, torch.contiguous_format | mode: nearest, align_corners: None, osize: (200, 300)    |        272.631 (+-11.348)       |         195.988 (+-5.748)          |            274.021 (+-9.475)            |     1.398 (+-0.000)      |          272.752 (+-12.716)
      Input (1, 3, 1200, 1300), torch.uint8, torch.channels_last | mode: nearest, align_corners: None, osize: (200, 300)        |        640.409 (+-25.465)       |         164.773 (+-7.372)          |            185.018 (+-8.349)            |     1.123 (+-0.000)      |          639.390 (+-30.761)
      Input (1, 3, 1200, 1300), torch.float32, torch.contiguous_format | mode: nearest, align_corners: None, osize: (200, 300)  |        158.602 (+-6.593)        |         220.478 (+-6.809)          |            286.376 (+-8.981)            |     1.299 (+-0.000)      |          158.557 (+-6.143)
      Input (1, 3, 1200, 1300), torch.float32, torch.channels_last | mode: nearest, align_corners: None, osize: (200, 300)      |        548.903 (+-22.889)       |         202.788 (+-9.158)          |            227.404 (+-8.995)            |     1.121 (+-0.000)      |          554.096 (+-21.330)
      Input (4, 3, 1200, 1300), torch.uint8, torch.contiguous_format | mode: nearest, align_corners: None, osize: (200, 300)    |       1036.061 (+-35.285)       |         680.728 (+-30.925)         |            986.254 (+-42.732)           |     1.449 (+-0.000)      |         1038.718 (+-43.070)
      Input (4, 3, 1200, 1300), torch.uint8, torch.channels_last | mode: nearest, align_corners: None, osize: (200, 300)        |       2504.520 (+-125.805)      |         550.067 (+-21.383)         |            628.000 (+-27.589)           |     1.142 (+-0.000)      |         2523.134 (+-113.336)
      Input (4, 3, 1200, 1300), torch.float32, torch.contiguous_format | mode: nearest, align_corners: None, osize: (200, 300)  |       1058.188 (+-57.853)       |        1216.427 (+-76.160)         |           1380.231 (+-98.939)           |     1.135 (+-0.000)      |         1057.031 (+-66.075)
      Input (4, 3, 1200, 1300), torch.float32, torch.channels_last | mode: nearest, align_corners: None, osize: (200, 300)      |       2305.911 (+-116.864)      |        1080.189 (+-79.934)         |           1141.561 (+-67.959)           |     1.057 (+-0.000)      |         2306.606 (+-121.544)
      Input (1, 3, 300, 400), torch.uint8, torch.contiguous_format | mode: nearest, align_corners: None, osize: (600, 700)      |       1689.489 (+-60.579)       |        1077.401 (+-44.948)         |           1634.264 (+-64.340)           |     1.517 (+-0.000)      |         1693.945 (+-67.998)
      Input (1, 3, 300, 400), torch.uint8, torch.channels_last | mode: nearest, align_corners: None, osize: (600, 700)          |       4198.368 (+-179.096)      |         886.656 (+-30.355)         |           1028.568 (+-46.310)           |     1.160 (+-0.000)      |         4174.351 (+-141.020)
      Input (1, 3, 300, 400), torch.float32, torch.contiguous_format | mode: nearest, align_corners: None, osize: (600, 700)    |        716.572 (+-51.954)       |        1175.864 (+-52.191)         |           1674.373 (+-51.815)           |     1.424 (+-0.000)      |          715.724 (+-41.104)
      Input (1, 3, 300, 400), torch.float32, torch.channels_last | mode: nearest, align_corners: None, osize: (600, 700)        |       3604.989 (+-132.489)      |        1096.933 (+-54.290)         |           1270.347 (+-60.932)           |     1.158 (+-0.000)      |         3601.864 (+-140.218)
      Input (4, 3, 300, 400), torch.uint8, torch.contiguous_format | mode: nearest, align_corners: None, osize: (600, 700)      |       6721.610 (+-355.997)      |        4203.213 (+-134.362)        |           6423.763 (+-225.311)          |     1.528 (+-0.000)      |         6715.626 (+-288.233)
      Input (4, 3, 300, 400), torch.uint8, torch.channels_last | mode: nearest, align_corners: None, osize: (600, 700)          |      16695.467 (+-709.620)      |        3460.013 (+-149.456)        |           4001.810 (+-218.093)          |     1.157 (+-0.000)      |        16621.138 (+-713.320)
      Input (4, 3, 300, 400), torch.float32, torch.contiguous_format | mode: nearest, align_corners: None, osize: (600, 700)    |       3020.017 (+-147.314)      |        4743.164 (+-135.850)        |           6709.494 (+-281.025)          |     1.415 (+-0.000)      |         3015.602 (+-105.852)
      Input (4, 3, 300, 400), torch.float32, torch.channels_last | mode: nearest, align_corners: None, osize: (600, 700)        |      14456.688 (+-752.839)      |        5150.893 (+-201.571)        |           5737.315 (+-138.011)          |     1.114 (+-0.000)      |        14464.472 (+-720.027)

Times are in microseconds (us).
```

## PR

This PR improves arange decomp such that `arange(s, dtype=torch.float32)` removing extra dtype conversion to double:

Code:
```python
import torch

def func(x):
    s = x.shape[-1]
    a = torch.arange(s, dtype=torch.float32)
    return s + a

c_func = torch.compile(func)
out = c_func(torch.rand(10))
```

Graph on this PR:
```
 ===== Forward graph 0 =====
 /pytorch/torch/fx/_lazy_graph_module.py class <lambda>(torch.nn.Module):
    def forward(self):
        # File: check_arange_decomp.py:15 in func, code: a = torch.arange(s, dtype=torch.float32)
        iota: "i64[10]" = torch.ops.prims.iota.default(10, start = 0, step = 1, dtype = torch.int64, device = device(type='cpu'), requires_grad = False)
        mul: "i64[10]" = torch.ops.aten.mul.Tensor(iota, 1);  iota = None
        add: "i64[10]" = torch.ops.aten.add.Tensor(mul, 0);  mul = None
        convert_element_type: "f32[10]" = torch.ops.prims.convert_element_type.default(add, torch.float32);  add = None

        # File: check_arange_decomp.py:16 in func, code: return s + a
        add_1: "f32[10]" = torch.ops.aten.add.Tensor(convert_element_type, 10);  convert_element_type = None
        return (add_1,)

 ===== AFTER POST GRAD =====
 /pytorch/torch/fx/_lazy_graph_module.py class <lambda>(torch.nn.Module):
    def forward(self):
        # File: check_arange_decomp.py:16 in func, code: a = torch.arange(s, dtype=torch.float32)
        iota: "i64[10]" = torch.ops.prims.iota.default(10, start = 0, step = 1, dtype = torch.int64, device = device(type='cpu'), requires_grad = False)
        mul: "i64[10]" = torch.ops.aten.mul.Tensor(iota, 1);  iota = None
        add: "i64[10]" = torch.ops.aten.add.Tensor(mul, 0);  mul = None
        convert_element_type: "f32[10]" = torch.ops.prims.convert_element_type.default(add, torch.float32);  add = None

        # File: check_arange_decomp.py:17 in func, code: return s + a
        add_1: "f32[10]" = torch.ops.aten.add.Tensor(convert_element_type, 10);  convert_element_type = None
        return (add_1,)
```
and C++ on this PR:
```c++
extern "C" void kernel(float* out_ptr0)
{
    {
        #pragma GCC ivdep
        for(long x0=static_cast<long>(0L); x0<static_cast<long>(10L); x0+=static_cast<long>(1L))
        {
            auto tmp0 = c10::convert<long>(x0);
            auto tmp1 = c10::convert<float>(tmp0);
            auto tmp2 = static_cast<float>(10.0);
            auto tmp3 = decltype(tmp1)(tmp1 + tmp2);
            out_ptr0[static_cast<long>(x0)] = tmp3;
        }
    }
}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121013
Approved by: https://github.com/peterbell10
2024-04-11 09:02:31 +00:00
Jiong Gong
cacc8e27a5 [inductor][cpp] refactor code to use define_kernel and call_kernel similar to CUDA (#123704)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123704
Approved by: https://github.com/jansel, https://github.com/desertfire
2024-04-11 06:34:44 +00:00
leslie-fang-intel
9078191666 [Inductor] Add the possible fusions group by priority (#123067)
**Summary**

Refactor the `Scheduler.fuse_nodes` changes in https://github.com/pytorch/pytorch/pull/121625. In the previous implementation of `Scheduler.fuse_nodes` in https://github.com/pytorch/pytorch/pull/121625, we use the `enable_outer_loop_fusion` context to ensure `OuterLoopFusion` happens after all the norm fusions.

And there is a discussion in https://github.com/pytorch/pytorch/pull/121625/files#r1527177141 to reuse current `score_fusion` mechanism. However, given that [fuse_nodes](f4ff063c33/torch/_inductor/scheduler.py (L1679-L1698)) will invoke `fuse_nodes_once` 10 times. We are concerned that the score approach may potentially disrupt pairs of regular fusion nodes in the 2rd invocation of `fuse_nodes_once` if they have been pick up by the outer loop fusion in the 1st invocation of `fuse_nodes_once`.

In this PR, we propose adding an abstract of `filter_possible_fusions_by_priority`. In each invoking of `fuse_nodes_once`, the possible fusions will be grouped by their priority from the backend. And only the group of possible fusions with highest priority will be fused in this invocation. In this way, we can ensure `OuterLoopFusion` happens after all the norm fusions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123067
Approved by: https://github.com/lezcano, https://github.com/jgong5
ghstack dependencies: #121625
2024-04-05 06:30:41 +00:00
leslie-fang-intel
bac2a39aee [Inductor] [ReImplement] Outer Loop Fusion for CPP Backend (#121625)
**Summary**
Re-implement of https://github.com/pytorch/pytorch/pull/121064

**Test Plan**
```
python -u -m pytest -s -v test_cpu_repro.py -k test_outer_loop_fusion
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121625
Approved by: https://github.com/lezcano, https://github.com/jgong5
2024-04-05 06:24:57 +00:00
Gao Tianlin
aaef246c74 remove log2 decomposition; add log2 lowering (#123112)
Same reason as `log10`. `log2` is a core aten op, we should not decompose it. As https://github.com/pytorch/pytorch/pull/110882 suggested, it often maps to a hardware intrinsic; Furthermore, decomposing it will negatively impact the numerical precision of the output.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123112
Approved by: https://github.com/peterbell10
2024-04-02 16:16:26 +00:00
Jiong Gong
6f4ed57b8a [inductor][cpp] unified the vectorized conversion with at::vec::convert for all data types (#119979)
This PR unified the vectorized conversion with `at::vec::convert` for all vectorized data types. The intrinsics implementations are implemented as a specialization and moved to their own arch-specific files. The vectorized conversion logic in cpp Inductor is simplified.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119979
Approved by: https://github.com/jansel, https://github.com/malfet
2024-03-29 21:48:29 +00:00
Edward Z. Yang
3178ba0dc9 Don't use sympy Float functions, use an opaque one with no reasoning (#122823)
Sympy simplifications don't obey floating point semantics, so don't
use Sympy for this.  Keep them as is, only evaluate with the reference
implementations when all arguments are known.

This may end up getting subsumed by some other changes later, but I
wanted to understand if this was easy and it seems to be easy.

This doesn't actually depend on the earlier diffs on the stack and I can detach it.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122823
Approved by: https://github.com/lezcano
2024-03-29 19:13:55 +00:00
Jiong Gong
105381ea11 [inductor][cpp] simplify CppVecKernelChecker (remove bool/int8 load as mask and load as float flags) (#119734)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119734
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel
ghstack dependencies: #119654, #119655
2024-03-27 11:20:35 +00:00
Jiong Gong
49121603ab [inductor][cpp] support vectorized indirect indexing (#119655)
This PR adds the vectorized indirect indexing so that we can further simplify the `CppVecKernelChecker` (done in the later PR #119734) and remove the check that throws `CppVecUnsupportedError`. A boundary assertion check is added on vectorized indices and via the new `indirect_assert` method on `Kernel` - the base implementation is for scalar indices, overridden in `CppVecKernel` for vectorized indices.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119655
Approved by: https://github.com/jansel
ghstack dependencies: #119654
2024-03-27 10:25:45 +00:00
Jiong Gong
367ec62ae3 [inductor][cpp] generalize vector mask for dtypes (#119654)
Vectorized boolean values in CPU Inductor were modeled with `Vectorized<float>` which cannot work for operations with other data types. This PR generalizes it with the new `VecMask` template class that can work for masks on any vectorized data types. The intrinsics implementation in `cpp_prefix.h` for mask conversion, cast and masked load are now implemented as the specialization for `VecMask` and moved to corresponding header files.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119654
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel
2024-03-27 05:33:53 +00:00
Nikita Shulga
dd3f2cb53a [Inductor] Add NEON ISA support on arm64 Macs (#122217)
This started as a re-land of https://github.com/pytorch/pytorch/pull/105590 but focusing on enabling it on MacOS, but quickly turned into landing very limited platform-specific acceleration at this time (I.e. this PR does not add any NEON accelerated code at all, just enables vectorized compilation for the existing abstractions)

Enabling the test harness, uncovered number of latent issues in CPU inductor that were fixed in the following PRS:
- https://github.com/pytorch/pytorch/pull/122511
- https://github.com/pytorch/pytorch/pull/122513
- https://github.com/pytorch/pytorch/pull/122580
- https://github.com/pytorch/pytorch/pull/122608

Following was added/changed to enable vectorization code to work on MacOS
 - Added VecNEON class to `_inductor/codecache.py`  that is supported on all AppleSilicon Macs
 - Added `Vectorized::loadu_one_fourth` to `vec_base.h`, and limit it to 8-bit types
 - Change 64-bit integral types mapping to `int64_t`/`uint64_t` to align with the rest of the code, as on MacOS, `int64_t` is a `long long` rather than `long` (see https://github.com/pytorch/pytorch/pull/118149 for more details)

See table below for perf changes with and without torch.compile using [gpt-fast](https://github.com/pytorch-labs/gpt-fast) running `stories15M` on M2 Pro:
| dtype  | Eager | Compile (before) | Compile (after) |
| ------ | ------ | --------- | --------- |
| bfloat16  | 120 tokens/sec  | 130 tokens/sec | 156 tokens/sec |
| float32  | 158 tokens/sec  | 140 tokens/sec | 236 tokens/sec |
| float16  | 235 tokens/sec  | 81 tokens/sec | 58 tokens/sec |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122217
Approved by: https://github.com/jansel
2024-03-26 05:07:30 +00:00
Nikita Shulga
cf06189a2d [CPPInductor] Fix another out-of-bounds access (#122580)
Not sure what was the idea behind `{self.tiling_factor}*sizeof(float)/sizeof({DTYPE_TO_CPP[dtype]})` size calculation (perhaps copy-n-paste error during the refactor made by https://github.com/pytorch/pytorch/pull/97626  ) , but `Vectorized::store(ptr, tiling_factor)` needs at least `tiling_factor` elements, not `tiling_factor/2` (which would be the case with the original calculation if data type is 64-bit value such as int64)
Discovered while trying to enable arch64 vectorized inductor.
Minimal reproducer (reproducible on ARMv8 or any  x86_64 machine that does not support AVX512):
```python
import torch
def do_ds(x, y):
    return torch.diagonal_scatter(x, y)

x=torch.ones(10, 10, dtype=torch.int64)
y=torch.tensor([ 1,  2, -8,  8,  5,  5, -7, -8,  7,  0])
dsc = torch.compile(do_ds)
assert torch.allclose(torch.diagonal_scatter(x, y), dsc(x, y))
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122580
Approved by: https://github.com/Skylion007, https://github.com/jansel
2024-03-25 04:49:20 +00:00
vfdev-5
cdc7f0fd3b Fixed failing pyhpc_equation_of_state due to cpp nodes fusion with compatible ranges (#122420)
Fixes #122283

Description:

PR https://github.com/pytorch/pytorch/pull/120077 introduced cpp nodes fusion with compatible ranges with an assumption that all scheduler nodes inside the fused nodes are the same, however, it appeared that snodes can have different indexing expressions. This PR fixes the incorrect assumption.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122420
Approved by: https://github.com/lezcano
2024-03-24 00:40:31 +00:00
Adnan Akhundov
456b112dca [inductor] Support non-Tensor predicate in torch.cond (#122378)
Summary: Previously, we only supported torch.Tensor boolean scalar predicate in `torch.cond` in Inductor. This PR adds support for SymBool and Python bool predicate, to match the `torch.cond` [sematics](https://pytorch.org/docs/stable/generated/torch.cond.html) in Dynamo / Export.

Test Plan:

```
$ python test/inductor/test_control_flow.py
...
----------------------------------------------------------------------
Ran 34 tests in 56.980s

OK

$ python test/inductor/test_aot_inductor.py -k test_cond
...
----------------------------------------------------------------------
Ran 54 tests in 460.093s

OK (skipped=4)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122378
Approved by: https://github.com/jansel, https://github.com/chenyang78
2024-03-21 14:35:01 +00:00
haozhe.zhu
3bc2bb6781 use two pass reduction for deterministic reduction order (#115620)
## Motivation
Address the [non-deterministic reduction order](https://github.com/pytorch/pytorch/issues/93542#issuecomment-1411294181) issue for `omp parallel reduction`.

## Latest update on 1.15:
55d81901bc.
Do not reduce to arr in loops. Instead, reduce to a local scaler and write it to arr after local reduction is done. This will allow the compiler to optimize the reduction variable in register instead read/write from memory. If the `working set` of `loop body` is quite large, `read/write from register/memory` will have a large gap.
```
vaddss (%xmm0, %xmm11, %xmm11) -> accumulate in register %xmm0
vaddssl ((%rdx, %rdi, 4), %xmm0, %xmm0) -> accumulate in memory address (%rdx, %rdi, 4)
```
Examples code:
```
tmp0_acc_arr[64];
#pragma omp parallel num_threads(64)
{
    auto tid = omp_get_thread_num();
    #pragma omp for
    for(...){
        ....
        tmp0_acc_arr[tid] = tmp0_acc_arr[tid] + tmp_x;  // access array will always from memory
    }
}
```
will be changed to
```
tmp0_acc_arr[64];
#pragma omp parallel num_threads(64)
{
    auto tid = omp_get_thread_num();
    **auto tmp0_acc_local = 0;**
    #pragma omp for
    for(...){
        ....
        **tmp0_acc_local**  = tmp0_acc_local + tmp_x;
    }
    **tmp0_acc_arr[tid] = tmp0_acc_local;**
}
```

## Descriptions
Following aten to use `two pass reduction` with `omp parallel` for deterministic reduction order.
9c3ae37fc4/aten/src/ATen/Parallel-inl.h (L39)
9c3ae37fc4/aten/src/ATen/native/TensorIteratorReduce.cpp (L24)
```
            float tmp_acc0 = 0;
            at::vec::Vectorized<float> tmp_acc0_vec = at::vec::Vectorized<float>(0);
            // init reduction buffer per thread
            float tmp_acc0_arr[64];
            at::vec::Vectorized<float> tmp_acc0_vec_arr[64];
            for (int tid = 0; tid < 64; tid++)
            {
                tmp_acc0_arr[tid] = 0;
                tmp_acc0_vec_arr[tid] = at::vec::Vectorized<float>(0);
            }
            #pragma omp parallel num_threads(64)
            {
                int tid = omp_get_thread_num();
                #pragma omp for
                for(long x0=static_cast<long>(0L); x0<static_cast<long>(3964928L); x0+=static_cast<long>(16L))
                {
                    auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x0));
                    auto tmp1 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<long>(x0));
                    auto tmp2 = tmp0 - tmp1;
                    auto tmp3 = tmp2 * tmp2;
                    // reduce to per thread buffers
                    tmp_acc0_vec_arr[tid] = tmp_acc0_vec_arr[tid] + tmp3;
                }
            }
            // second pass reduce
            for (int tid = 0; tid < 64; tid++)
            {
                tmp_acc0 = tmp_acc0 + tmp_acc0_arr[tid];
                tmp_acc0_vec = tmp_acc0_vec + tmp_acc0_vec_arr[tid];
            }
            tmp_acc0 = tmp_acc0 + at::vec::vec_reduce_all<float>([](at::vec::Vectorized<float>& x, at::vec::Vectorized<float>& y) { return x + y; }, tmp_acc0_vec);
            out_ptr0[static_cast<long>(0L)] = static_cast<float>(tmp_acc0);
```

## Test results
I test this PR with dynamo benchmark on 32-core ICX system,
Result (avg speed up):
| |  before this PR   | after this PR  |
| ---- |  ----  | ----  |
| torchbench | 1.303  | 1.301 |
| hugginface | 1.346  | 1.343 |
| timms | 1.971 | 1.970 |

```
export LD_PRELOAD=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libiomp5.so:${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libjemalloc.so
export MALLOC_CONF="oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms:-1,muzzy_decay_ms:-1"
export KMP_AFFINITY=granularity=fine,compact,1,0
export KMP_BLOCKTIME=1

multi_threads_test() {
    CORES=$(lscpu | grep Core | awk '{print $4}')
    export OMP_NUM_THREADS=$CORES
    end_core=$(expr $CORES - 1)
    numactl -C 0-${end_core} --membind=0 python benchmarks/dynamo/${SUITE}.py --${SCENARIO} --${DT} -dcpu -n50 --no-skip --dashboard --only "${MODEL}" ${Channels_extra} ${BS_extra} ${Shape_extra} ${Mode_extra} ${Wrapper_extra} ${Flag_extra} --timeout 9000 --backend=inductor --output=${LOG_BASE}/${SUITE}.csv
}

SCENARIO=performance
DT=float32
export TORCHINDUCTOR_FREEZING=1
Flag_extra="--freezing"
Mode_extra="--inference"

for suite in timm_models huggingface torchbench
do
  export SUITE=$suite
  echo $SUITE
  export LOG_BASE=`date +%m%d%H%M%S`
  mkdir $LOG_BASE
  multi_threads_test
done
```
System info
```
ubuntu@ip-172-31-18-205:~/hz/pytorch$ lscpu
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         46 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  64
  On-line CPU(s) list:   0-63
Vendor ID:               GenuineIntel
  Model name:            Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
    CPU family:          6
    Model:               106
    Thread(s) per core:  2
    Core(s) per socket:  32
    Socket(s):           1
    Stepping:            6
    BogoMIPS:            5800.00
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic mo
                         vbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xs
                         aveopt xsavec xgetbv1 xsaves wbnoinvd ida arat avx512vbmi pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq rdpid md_clear flush_l1d arch_capabilities
Virtualization features:
  Hypervisor vendor:     KVM
  Virtualization type:   full
Caches (sum of all):
  L1d:                   1.5 MiB (32 instances)
  L1i:                   1 MiB (32 instances)
  L2:                    40 MiB (32 instances)
  L3:                    54 MiB (1 instance)
NUMA:
  NUMA node(s):          1
  NUMA node0 CPU(s):     0-63
Vulnerabilities:
  Gather data sampling:  Unknown: Dependent on hypervisor status
  Itlb multihit:         Not affected
  L1tf:                  Not affected
  Mds:                   Not affected
  Meltdown:              Not affected
  Mmio stale data:       Mitigation; Clear CPU buffers; SMT Host state unknown
  Retbleed:              Not affected
  Spec rstack overflow:  Not affected
  Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:            Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
  Srbds:                 Not affected
  Tsx async abort:       Not affected
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115620
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-03-15 02:03:10 +00:00