pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Jiong Gong	7b6d036c05	[inductor][cpp] support bf16/fp16 gemm template epilogue fusion (#126545 ) As part of #125683, this PR adds epilogue fusion support for bf16/fp16 gemms. The key changes are as follows: 1. bf16 linear w/ epilogue fusion of some ops was originally supported via ATen oneDNN linear pointwise ops. In order to match the ATen op semantics, in-template epilogue support is added to the cpp gemm template so that we would have: "gemm + in-template epilogues -> template buffer". If the template is chosen for codegen, the in-template epilogues will be concatenated with the out-of-template epilogues that are appended during the scheduling. 2. Support bf16/fp16 legalization for `codegen_loop_bodies` which is used to generate the epilogue loops. 3. We used to leverage the in-place buffer mechanism to handle the in-place buffers in the epilogue codegen, in particular, for the reuses for output buffers of GEMM, template and epilogues. This is not correct since the output buffer is an "output" not an "in-place" buffer of the template kernel itself. Now, we use a dedicated "aliases" dict to manage such buffer reuses and the intermediate aliasing buffers are removed after codegen. 4. Add `localize_buffer` method to `LocalBufferScope` to allow the replacement of a global buffer with a local one in the given inductor IR nodes. This helps the fused loops to work on smaller-sized local buffers for better data locality. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126545 Approved by: https://github.com/jansel ghstack dependencies: #124021, #126019, #126068	2024-05-23 07:39:29 +00:00
Jiong Gong	31412cb2f2	[inductor][cpp] bf16/fp16 gemm template computed with fp32 w/o epilogue fusion (#126068 ) As part of #125683, this PR adds the initial bf16/fp16 gemm template support with micro-gemm implemented with fused type casting and fp32 computation. It doesn't provide epilogue fusion support yet which will be added in the next PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126068 Approved by: https://github.com/jansel ghstack dependencies: #124021, #126019	2024-05-23 07:39:29 +00:00
Jiong Gong	08f57b4bff	[inductor][cpp] epilogue support for gemm template (#126019 ) As part of #125683, this PR adds the epilogue support for c++ gemm template by reusing the c++ vector codegen on sub-slices of tensors. This is implemented by retracing the epilogue IR nodes with new ranges and offsets. The new `codegen_loop_bodies` and `codegen_functions` methods are added to c++ vector codegen for this purpose. This is leveraged by the `store_output` method of the template kernel for epilogue codegen and store to the final result. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126019 Approved by: https://github.com/jansel ghstack dependencies: #124021	2024-05-23 07:39:29 +00:00
Jiong Gong	9da7efa677	[inductor][cpp] GEMM template (infra and fp32) (#124021 ) This PR adds the Cpp template infrastructure and the initial FP32 gemm template. See RFC https://github.com/pytorch/pytorch/issues/125683 for more background info. 1. Cpp template infrastructure Similar template abstractions as the CUTLASS template, i.e., `CppTemplate`, `CppTemplateKernel`, `CppTemplateBuffer`. The MicroGemm micro-kernel abstraction that can be used by Cpp GEMM templates. 2. Initial FP32 gemm template This involves a GEMM template implementation `CppPackedGemmTemplate` that supports GEMM with constant weight (`B`) requiring `N` to be a multiple of register blocking while allows the static or dynamic sizes for the `M` (batch dim) of `A`. The `B` matrix would be prepacked. This is a typical setting for inference workloads. The template handles the thread decomposition (via `thread_blocking`) and cache blocking (via `cache_blocking`). Then it invokes `CppMicroGemm` which handles register blocking, instruction selection, and other CPU architecture-specific optimizations. A `CppMicroGemmFP32Vec` micro-kernel implementation is provided for fp32 matmuls implemented with ATen vec abstraction. 3. Correctness and performance The changes have been validated with fp32 inference on the three benchmark suites (torchbench, huggingface and timm_models) with both static shape and dynamic shapes. Since it is an initial implementation, we are still working on further performance improves with follow-up PRs including the optimizations in kernels as well as fusions. The perf gains are only observed from a selective number of models compared to the ATen kernels which are implemented with MKL. The perf gains are more obvious with dynamic shapes since MKL only supports packed gemm for static shapes. Below are details. Static shapes \| Benchmark \| torchbench \| huggingface \| timm_models \| \|------------\|-------------\|--------------\|--------------\| \| Multi-threaded (baseline) \| 1.47x \| 1.36x \| 1.91x \| \| Multi-threaded (max-autotune) \| 1.47x \| 1.36x \| 1.92x \| \| Single-threaded (baseline) \| 1.56x \| 1.19x \| 1.51x \| \| Single-threaded (max-autotune) \| 1.56x \| 1.19x \| 1.52x \| Key models being sped up: drq: 1.14x soft_act: 1.12 cait_m36_384: 1.18x Dynamic shapes \| Benchmark \| torchbench \| huggingface \| timm_models \| \| --- \| --- \| --- \| --- \| \| Multi-threaded (baseline) \| 1.43x \| 1.28x \| 1.85x \| \| Multi-threaded (max-autotune) \| 1.47x \| 1.28x \| 1.85x \| \| Single-threaded (baseline) \| 1.55x \| 1.20x \| 1.51x \| \| Single-threaded (max-autotune) \| 1.56x \| 1.19x \| 1.53x \| Key models being sped up: BERT_pytorch: 1.22x pyhpc_turbulent: 1.13x soft_actor_critic: 1.77x BlenderbotForCausalLM: 1.09x cait_m36_384: 1.17x Differential Revision: [D57585365](https://our.internmc.facebook.com/intern/diff/D57585365) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124021 Approved by: https://github.com/jansel	2024-05-23 07:39:29 +00:00
PyTorch MergeBot	4f14282e35	Revert "[inductor][cpp] GEMM template (infra and fp32) (#124021 )" This reverts commit `2ac33a9f66`. Reverted https://github.com/pytorch/pytorch/pull/124021 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think it has a land race and failing in trunk `2ac33a9f66` ([comment](https://github.com/pytorch/pytorch/pull/124021#issuecomment-2126016522))	2024-05-23 01:13:29 +00:00
PyTorch MergeBot	657d39e44c	Revert "[inductor][cpp] epilogue support for gemm template (#126019 )" This reverts commit `57108d9a49`. Reverted https://github.com/pytorch/pytorch/pull/126019 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think it has a land race and failing in trunk `2ac33a9f66` ([comment](https://github.com/pytorch/pytorch/pull/124021#issuecomment-2126016522))	2024-05-23 01:13:29 +00:00
PyTorch MergeBot	205f08140e	Revert "[inductor][cpp] bf16/fp16 gemm template computed with fp32 w/o epilogue fusion (#126068 )" This reverts commit `57c185b4c7`. Reverted https://github.com/pytorch/pytorch/pull/126068 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think it has a land race and failing in trunk `2ac33a9f66` ([comment](https://github.com/pytorch/pytorch/pull/124021#issuecomment-2126016522))	2024-05-23 01:13:29 +00:00
Jiong Gong	57c185b4c7	[inductor][cpp] bf16/fp16 gemm template computed with fp32 w/o epilogue fusion (#126068 ) As part of #125683, this PR adds the initial bf16/fp16 gemm template support with micro-gemm implemented with fused type casting and fp32 computation. It doesn't provide epilogue fusion support yet which will be added in the next PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126068 Approved by: https://github.com/jansel ghstack dependencies: #124021, #126019	2024-05-23 00:12:38 +00:00
Jiong Gong	57108d9a49	[inductor][cpp] epilogue support for gemm template (#126019 ) As part of #125683, this PR adds the epilogue support for c++ gemm template by reusing the c++ vector codegen on sub-slices of tensors. This is implemented by retracing the epilogue IR nodes with new ranges and offsets. The new `codegen_loop_bodies` and `codegen_functions` methods are added to c++ vector codegen for this purpose. This is leveraged by the `store_output` method of the template kernel for epilogue codegen and store to the final result. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126019 Approved by: https://github.com/jansel ghstack dependencies: #124021	2024-05-23 00:07:52 +00:00
Jiong Gong	2ac33a9f66	[inductor][cpp] GEMM template (infra and fp32) (#124021 ) This PR adds the Cpp template infrastructure and the initial FP32 gemm template. See RFC https://github.com/pytorch/pytorch/issues/125683 for more background info. 1. Cpp template infrastructure Similar template abstractions as the CUTLASS template, i.e., `CppTemplate`, `CppTemplateKernel`, `CppTemplateBuffer`. The MicroGemm micro-kernel abstraction that can be used by Cpp GEMM templates. 2. Initial FP32 gemm template This involves a GEMM template implementation `CppPackedGemmTemplate` that supports GEMM with constant weight (`B`) requiring `N` to be a multiple of register blocking while allows the static or dynamic sizes for the `M` (batch dim) of `A`. The `B` matrix would be prepacked. This is a typical setting for inference workloads. The template handles the thread decomposition (via `thread_blocking`) and cache blocking (via `cache_blocking`). Then it invokes `CppMicroGemm` which handles register blocking, instruction selection, and other CPU architecture-specific optimizations. A `CppMicroGemmFP32Vec` micro-kernel implementation is provided for fp32 matmuls implemented with ATen vec abstraction. 3. Correctness and performance The changes have been validated with fp32 inference on the three benchmark suites (torchbench, huggingface and timm_models) with both static shape and dynamic shapes. Since it is an initial implementation, we are still working on further performance improves with follow-up PRs including the optimizations in kernels as well as fusions. The perf gains are only observed from a selective number of models compared to the ATen kernels which are implemented with MKL. The perf gains are more obvious with dynamic shapes since MKL only supports packed gemm for static shapes. Below are details. Static shapes \| Benchmark \| torchbench \| huggingface \| timm_models \| \|------------\|-------------\|--------------\|--------------\| \| Multi-threaded (baseline) \| 1.47x \| 1.36x \| 1.91x \| \| Multi-threaded (max-autotune) \| 1.47x \| 1.36x \| 1.92x \| \| Single-threaded (baseline) \| 1.56x \| 1.19x \| 1.51x \| \| Single-threaded (max-autotune) \| 1.56x \| 1.19x \| 1.52x \| Key models being sped up: drq: 1.14x soft_act: 1.12 cait_m36_384: 1.18x Dynamic shapes \| Benchmark \| torchbench \| huggingface \| timm_models \| \| --- \| --- \| --- \| --- \| \| Multi-threaded (baseline) \| 1.43x \| 1.28x \| 1.85x \| \| Multi-threaded (max-autotune) \| 1.47x \| 1.28x \| 1.85x \| \| Single-threaded (baseline) \| 1.55x \| 1.20x \| 1.51x \| \| Single-threaded (max-autotune) \| 1.56x \| 1.19x \| 1.53x \| Key models being sped up: BERT_pytorch: 1.22x pyhpc_turbulent: 1.13x soft_actor_critic: 1.77x BlenderbotForCausalLM: 1.09x cait_m36_384: 1.17x Differential Revision: [D57585365](https://our.internmc.facebook.com/intern/diff/D57585365) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124021 Approved by: https://github.com/jansel	2024-05-22 23:59:12 +00:00
Aaron Orenstein	e4623de4cf	typing scheduler.py [2/2]: Apply types (#126656 ) Add `# mypy: disallow-untyped-defs` to scheduler.py and then fix the resulting fallout. We probably should eventually add a new node between BaseSchedulerNode and all the non-FusedSchedulerNode types to indicate the split between nodes that have a valid `self.node` and ones that don't. That would cause a lot of the `assert self.node is not None` churn to go away - but was a bigger change because a lot of code makes assumptions about types that aren't reflected in the types themselves. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126656 Approved by: https://github.com/eellison	2024-05-22 20:33:31 +00:00
leslie-fang-intel	1cc9354cb0	Unify the dtype to VecMask<float, N> in ops.masked (#126662 ) Summary Fix issue: https://github.com/pytorch/pytorch/issues/126449. For `ops.masked` in CPP backend, when input dtype is `bool`, we actually load it as `VecMask<float, N>`. So, we should unify the type of `other` and `mask` to the same as `VecMask<float, N>` to invoke `blendv` method. Test Plan ``` clear && python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_ops_masked_with_bool_input clear && PYTORCH_ALL_SAMPLES=1 python -u -m pytest -s -v test/inductor/test_torchinductor_opinfo.py -k test_comprehensive__chunk_cat_cpu_bool ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/126662 Approved by: https://github.com/isuruf, https://github.com/jgong5, https://github.com/peterbell10	2024-05-21 20:52:25 +00:00
CaoE	6c503f1dbb	save the reciprocal of weights for welford_reduce (#125148 ) Save the reciprocal of weights for welford_reduce to avoid redundant divisions for improving performance, and `weight_recps` will be inserted into the generated vec kernel. Generated code: - Before: ``` for(long x1=static_cast<long>(0L); x1<static_cast<long>(1024L); x1+=static_cast<long>(16L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x1 + (1024Lx0)), 16); tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp0); } ``` - After:: ``` static WeightRecp<at::vec::Vectorized<float>> weight_recps(64); for(long x1=static_cast<long>(0L); x1<static_cast<long>(1024L); x1+=static_cast<long>(16L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x1 + (1024Lx0)), 16); tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp0, &weight_recps); } ``` Performance: - Single core: Op \| shape \| eager/ms \| inductor/ms \| optimized inductor/ms -- \| -- \| -- \| -- \| -- layernorm \| (56, 384, 1024) \| 16.825 \| 22.338 \| 15.208 var \| (56, 384, 1024) \| 21.752 \| 13.258 \| 13.102 - 4 cores: Op \| shape \| eager/ms \| inductor/ms \| optimized inductor/ms -- \| -- \| -- \| -- \| -- layernorm \| (56, 384, 1024) \| 4.249 \| 5.899 \| 4.223 var \| (56, 384, 1024) \| 5.3152 \| 3.278 \| 2.163 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125148 Approved by: https://github.com/jgong5, https://github.com/peterbell10	2024-05-17 08:20:12 +00:00
PyTorch MergeBot	337830f657	Revert "[inductor][cpp] GEMM template (infra and fp32) (#124021 )" This reverts commit `f060b0c6e6`. Reverted https://github.com/pytorch/pytorch/pull/124021 on behalf of https://github.com/huydhn due to Unfortunately, the new tests are still failing internally ([comment](https://github.com/pytorch/pytorch/pull/124021#issuecomment-2116415398))	2024-05-17 00:22:40 +00:00
PyTorch MergeBot	4a5ef0b793	Revert "[inductor][cpp] epilogue support for gemm template (#126019 )" This reverts commit `7844c202b2`. Reverted https://github.com/pytorch/pytorch/pull/126019 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but the dependency PR https://github.com/pytorch/pytorch/pull/124021 is going to be revert ([comment](https://github.com/pytorch/pytorch/pull/126019#issuecomment-2116408137))	2024-05-17 00:15:00 +00:00
PyTorch MergeBot	59ca0d8c14	Revert "[inductor][cpp] bf16/fp16 gemm template computed with fp32 w/o epilogue fusion (#126068 )" This reverts commit `927e631dc2`. Reverted https://github.com/pytorch/pytorch/pull/126068 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but the dependency PR https://github.com/pytorch/pytorch/pull/124021 is going to be revert ([comment](https://github.com/pytorch/pytorch/pull/126019#issuecomment-2116408137))	2024-05-17 00:15:00 +00:00
Jiong Gong	927e631dc2	[inductor][cpp] bf16/fp16 gemm template computed with fp32 w/o epilogue fusion (#126068 ) As part of #125683, this PR adds the initial bf16/fp16 gemm template support with micro-gemm implemented with fused type casting and fp32 computation. It doesn't provide epilogue fusion support yet which will be added in the next PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126068 Approved by: https://github.com/jansel ghstack dependencies: #126019	2024-05-16 02:05:49 +00:00
Jiong Gong	7844c202b2	[inductor][cpp] epilogue support for gemm template (#126019 ) As part of #125683, this PR adds the epilogue support for c++ gemm template by reusing the c++ vector codegen on sub-slices of tensors. This is implemented by retracing the epilogue IR nodes with new ranges and offsets. The new `codegen_loop_bodies` and `codegen_functions` methods are added to c++ vector codegen for this purpose. This is leveraged by the `store_output` method of the template kernel for epilogue codegen and store to the final result. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126019 Approved by: https://github.com/jansel	2024-05-16 01:42:29 +00:00
Jiong Gong	f060b0c6e6	[inductor][cpp] GEMM template (infra and fp32) (#124021 ) This PR adds the Cpp template infrastructure and the initial FP32 gemm template. See RFC https://github.com/pytorch/pytorch/issues/125683 for more background info. 1. Cpp template infrastructure Similar template abstractions as the CUTLASS template, i.e., `CppTemplate`, `CppTemplateKernel`, `CppTemplateBuffer`. The MicroGemm micro-kernel abstraction that can be used by Cpp GEMM templates. 2. Initial FP32 gemm template This involves a GEMM template implementation `CppPackedGemmTemplate` that supports GEMM with constant weight (`B`) requiring `N` to be a multiple of register blocking while allows the static or dynamic sizes for the `M` (batch dim) of `A`. The `B` matrix would be prepacked. This is a typical setting for inference workloads. The template handles the thread decomposition (via `thread_blocking`) and cache blocking (via `cache_blocking`). Then it invokes `CppMicroGemm` which handles register blocking, instruction selection, and other CPU architecture-specific optimizations. A `CppMicroGemmFP32Vec` micro-kernel implementation is provided for fp32 matmuls implemented with ATen vec abstraction. 3. Correctness and performance The changes have been validated with fp32 inference on the three benchmark suites (torchbench, huggingface and timm_models) with both static shape and dynamic shapes. Since it is an initial implementation, we are still working on further performance improves with follow-up PRs including the optimizations in kernels as well as fusions. The perf gains are only observed from a selective number of models compared to the ATen kernels which are implemented with MKL. The perf gains are more obvious with dynamic shapes since MKL only supports packed gemm for static shapes. Below are details. Static shapes \| Benchmark \| torchbench \| huggingface \| timm_models \| \|------------\|-------------\|--------------\|--------------\| \| Multi-threaded (baseline) \| 1.47x \| 1.36x \| 1.91x \| \| Multi-threaded (max-autotune) \| 1.47x \| 1.36x \| 1.92x \| \| Single-threaded (baseline) \| 1.56x \| 1.19x \| 1.51x \| \| Single-threaded (max-autotune) \| 1.56x \| 1.19x \| 1.52x \| Key models being sped up: drq: 1.14x soft_act: 1.12 cait_m36_384: 1.18x Dynamic shapes \| Benchmark \| torchbench \| huggingface \| timm_models \| \| --- \| --- \| --- \| --- \| \| Multi-threaded (baseline) \| 1.43x \| 1.28x \| 1.85x \| \| Multi-threaded (max-autotune) \| 1.47x \| 1.28x \| 1.85x \| \| Single-threaded (baseline) \| 1.55x \| 1.20x \| 1.51x \| \| Single-threaded (max-autotune) \| 1.56x \| 1.19x \| 1.53x \| Key models being sped up: BERT_pytorch: 1.22x pyhpc_turbulent: 1.13x soft_actor_critic: 1.77x BlenderbotForCausalLM: 1.09x cait_m36_384: 1.17x Pull Request resolved: https://github.com/pytorch/pytorch/pull/124021 Approved by: https://github.com/jansel	2024-05-15 08:14:51 +00:00
PyTorch MergeBot	b6d8b256e6	Revert "[inductor][cpp] GEMM template (infra and fp32) (#124021 )" This reverts commit `037615b989`. Reverted https://github.com/pytorch/pytorch/pull/124021 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it is failing inductor.test_unbacked_symints.TestUnbackedSymintsCPU::test_autotuning_cpu ([comment](https://github.com/pytorch/pytorch/pull/124021#issuecomment-2111318883))	2024-05-14 23:26:15 +00:00
Jiong Gong	037615b989	[inductor][cpp] GEMM template (infra and fp32) (#124021 ) This PR adds the Cpp template infrastructure and the initial FP32 gemm template. See RFC https://github.com/pytorch/pytorch/issues/125683 for more background info. 1. Cpp template infrastructure Similar template abstractions as the CUTLASS template, i.e., `CppTemplate`, `CppTemplateKernel`, `CppTemplateBuffer`. The MicroGemm micro-kernel abstraction that can be used by Cpp GEMM templates. 2. Initial FP32 gemm template This involves a GEMM template implementation `CppPackedGemmTemplate` that supports GEMM with constant weight (`B`) requiring `N` to be a multiple of register blocking while allows the static or dynamic sizes for the `M` (batch dim) of `A`. The `B` matrix would be prepacked. This is a typical setting for inference workloads. The template handles the thread decomposition (via `thread_blocking`) and cache blocking (via `cache_blocking`). Then it invokes `CppMicroGemm` which handles register blocking, instruction selection, and other CPU architecture-specific optimizations. A `CppMicroGemmFP32Vec` micro-kernel implementation is provided for fp32 matmuls implemented with ATen vec abstraction. 3. Correctness and performance The changes have been validated with fp32 inference on the three benchmark suites (torchbench, huggingface and timm_models) with both static shape and dynamic shapes. Since it is an initial implementation, we are still working on further performance improves with follow-up PRs including the optimizations in kernels as well as fusions. The perf gains are only observed from a selective number of models compared to the ATen kernels which are implemented with MKL. The perf gains are more obvious with dynamic shapes since MKL only supports packed gemm for static shapes. Below are details. Static shapes \| Benchmark \| torchbench \| huggingface \| timm_models \| \|------------\|-------------\|--------------\|--------------\| \| Multi-threaded (baseline) \| 1.47x \| 1.36x \| 1.91x \| \| Multi-threaded (max-autotune) \| 1.47x \| 1.36x \| 1.92x \| \| Single-threaded (baseline) \| 1.56x \| 1.19x \| 1.51x \| \| Single-threaded (max-autotune) \| 1.56x \| 1.19x \| 1.52x \| Key models being sped up: drq: 1.14x soft_act: 1.12 cait_m36_384: 1.18x Dynamic shapes \| Benchmark \| torchbench \| huggingface \| timm_models \| \| --- \| --- \| --- \| --- \| \| Multi-threaded (baseline) \| 1.43x \| 1.28x \| 1.85x \| \| Multi-threaded (max-autotune) \| 1.47x \| 1.28x \| 1.85x \| \| Single-threaded (baseline) \| 1.55x \| 1.20x \| 1.51x \| \| Single-threaded (max-autotune) \| 1.56x \| 1.19x \| 1.53x \| Key models being sped up: BERT_pytorch: 1.22x pyhpc_turbulent: 1.13x soft_actor_critic: 1.77x BlenderbotForCausalLM: 1.09x cait_m36_384: 1.17x Pull Request resolved: https://github.com/pytorch/pytorch/pull/124021 Approved by: https://github.com/jansel	2024-05-12 07:46:44 +00:00
lezcano	320af5eaa6	Compute bounds for the variables created during codegen (#123100 ) Before we would just bail out on these bounds for all variables that did not come from the FX graph. Now we propagate the bounds whenever we have a rule for that op. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123100 Approved by: https://github.com/jgong5, https://github.com/peterbell10	2024-05-08 08:14:06 +00:00
Jiong Gong	058e28108f	[inductor][cpp] support int64 vertical vec reduction (fix #124821 ) (#125563 ) Fix #124821 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125563 Approved by: https://github.com/desertfire	2024-05-07 03:56:22 +00:00
PyTorch MergeBot	2a42c40791	Revert "Compute bounds for the variables created during codegen (#123100 )" This reverts commit `bb668c6468`. Reverted https://github.com/pytorch/pytorch/pull/123100 on behalf of https://github.com/huydhn due to Sorry for reverting you change but it is failing inductor tests `bb668c6468` ([comment](https://github.com/pytorch/pytorch/pull/123100#issuecomment-2096837821))	2024-05-06 20:23:39 +00:00
PyTorch MergeBot	7ffa5558ee	Revert "[FX] Update type hints in `torch.fx._compatibility.py` (#125469 )" This reverts commit `235b4d6ec2`. Reverted https://github.com/pytorch/pytorch/pull/125469 on behalf of https://github.com/izaitsevfb due to breaks pyre in dependent projects (internal: see D56986361) ([comment](https://github.com/pytorch/pytorch/pull/125469#issuecomment-2096665396))	2024-05-06 18:36:43 +00:00
lezcano	bb668c6468	Compute bounds for the variables created during codegen (#123100 ) Before we would just bail out on these bounds for all variables that did not come from the FX graph. Now we propagate the bounds whenever we have a rule for that op. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123100 Approved by: https://github.com/jgong5, https://github.com/peterbell10	2024-05-06 18:12:15 +00:00
Jiong Gong	68a1f787c8	[inductor][cpp] move some common cpp utils to cpp_utils.py (#125152 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/125152 Approved by: https://github.com/desertfire, https://github.com/jansel	2024-05-06 04:30:30 +00:00
Xuehai Pan	235b4d6ec2	[FX] Update type hints in `torch.fx._compatibility.py` (#125469 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/125469 Approved by: https://github.com/Skylion007 ghstack dependencies: #125468	2024-05-05 19:30:22 +00:00
Edward Z. Yang	6f70d22277	Extend torch.utils._sympy.symbol for more Inductor symbols (#125419 ) I'm still missing a few, cdzq at least Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/125419 Approved by: https://github.com/lezcano ghstack dependencies: #125395	2024-05-04 09:05:00 +00:00
haozhe.zhu	57790fd088	[inductor] share cse cache during vectorized indirect load (#124597 ) Fix https://github.com/pytorch/pytorch/issues/123502 `swap_buffer` in not needed in vectorized indirect load, remove it to share cse buffer. ``` auto tmp8 = [&] { __at_align__ std::array<int64_t, 16> tmpbuf; tmp7.store(tmpbuf.data()); return tmpbuf; } () ; // // other codes // // also store tmp7 here (redundant tmp16) auto tmp16 = [&] { __at_align__ std::array<int64_t, 16> tmpbuf; tmp7.store(tmpbuf.data()); return tmpbuf; } () ; ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/124597 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-04-28 01:02:48 +00:00
leslie-fang-intel	2d7f709752	[Inductor] Force the parallel depth as outer loop fusion depth (#123899 ) Summary Fix issue: https://github.com/pytorch/pytorch/issues/123801 which brings performance regression of `pyhpc_turbulent_kinetic_energy` after outer loop fusion. Root Cause - [Generated Kernel before Outer Loop Fusion](https://gist.github.com/leslie-fang-intel/54fe21ac8871fc63b9bf20fdb6edf209) - Taking below 2 kernels as example: - [Kernel 0](https://gist.github.com/leslie-fang-intel/54fe21ac8871fc63b9bf20fdb6edf209#file-pyhpc_turbulent_kinetic_energy-before-outer-loop-fusion-py-L255-L305) has 2 loop levels with size [200, 200]. Parallelization is not feasible due to the inefficient number of elements determined by [`decide_parallel_depth`](`aaec97a403/torch/_inductor/codegen/cpp.py (L2145-L2164)`). Therefore, the loop code will be generated with the `#pragma omp single` directive. - [Kernel 1](https://gist.github.com/leslie-fang-intel/54fe21ac8871fc63b9bf20fdb6edf209#file-pyhpc_turbulent_kinetic_energy-before-outer-loop-fusion-py-L306-L316) has 3 loop levels with size [200, 200, 26] which has enough number of elements to be parallelized. - [Generated Kernel after Outer Loop Fusion](https://gist.github.com/leslie-fang-intel/57a497b9d9c6aa82b1c6a686292fc887) - After outer loop fusion, `Kernel0` and `Kernel1` has been fused into one [OuterLoopFusedKernel](https://gist.github.com/leslie-fang-intel/57a497b9d9c6aa82b1c6a686292fc887#file-pyhpc_turbulent_kinetic_energy-after-outer-loop-fusion-py-L261-L497), the outer loop size is [200, 200] which does not contain enough number of elements to do parallelization. In this PR, we propose a fix for `loop_nest` involving `OuterLoopFusedKernel`. The fix entails adding a specific heuristic for `OuterLoopFusedKernel` to determine the parallel depth by combining `outer_loop_fusion_depth` with the internal kernels' parallel depth. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123899 Approved by: https://github.com/jgong5, https://github.com/lezcano	2024-04-25 09:50:46 +00:00
leslie-fang-intel	bffecb5aff	[Inductor] Enable VecMask store (#123710 ) Summary Enable the vectorization of store with `bool` dtype. Test Plan ``` python -u -m pytest -s -v inductor/test_cpu_repro.py -k test_decomposed_fake_quant_per_channel ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/123710 Approved by: https://github.com/jgong5, https://github.com/lezcano ghstack dependencies: #123512	2024-04-23 00:29:47 +00:00
Aaron Gokaslan	29cc293725	[BE]: FURB142 - Remove set mutations. Use set update (#124551 ) Uses set mutation methods instead of manually reimplementing (update, set_difference etc). Pull Request resolved: https://github.com/pytorch/pytorch/pull/124551 Approved by: https://github.com/ezyang	2024-04-21 14:12:33 +00:00
Xuehai Pan	93e249969b	[BE] enable `ruff` rule `RSE` and remove useless parentheses in `raise` statements (#124261 ) Remove useless parentheses in `raise` statements if the exception type is raised with no argument. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124261 Approved by: https://github.com/albanD	2024-04-17 19:29:34 +00:00
Edward Z. Yang	efa36ef092	Natively support int truncation, don't guard on positive/negative (#122827 ) This doesn't entirely fix the original problem that prompted this, but it seems to just be getting stuck in export constraint formatting now which seems like progress to me. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/122827 Approved by: https://github.com/avikchaudhuri	2024-04-11 15:22:32 +00:00
vfdev-5	6b7741546b	Fixed arange decomp for float dtype (#121013 ) ## Description: - [x] Fixed arange decomp for float dtype - [x] Added a test ## Current state Arange graph and C++ generated code are not optimal when arange is created directly using float32 dtype: ```python import torch def func(x): s = x.shape[-1] a = torch.arange(s, dtype=torch.float32) return s + a c_func = torch.compile(func) out = c_func(torch.rand(10)) ``` Graph on `main`: ``` ===== Forward graph 0 ===== /pytorch/torch/fx/_lazy_graph_module.py class <lambda>(torch.nn.Module): def forward(self): # File: check_arange_decomp.py:8 in func, code: a = torch.arange(s, dtype=torch.float32) iota: "i64[10]" = torch.ops.prims.iota.default(10, start = 0, step = 1, dtype = torch.int64, device = device(type='cpu'), requires_grad = False) convert_element_type: "f64[10]" = torch.ops.prims.convert_element_type.default(iota, torch.float64); iota = None mul: "f64[10]" = torch.ops.aten.mul.Tensor(convert_element_type, 1); convert_element_type = None add: "f64[10]" = torch.ops.aten.add.Tensor(mul, 0); mul = None convert_element_type_1: "f32[10]" = torch.ops.prims.convert_element_type.default(add, torch.float32); add = None # File: check_arange_decomp.py:9 in func, code: return s + a add_1: "f32[10]" = torch.ops.aten.add.Tensor(convert_element_type_1, 10); convert_element_type_1 = None return (add_1,) ===== AFTER POST GRAD ===== /pytorch/torch/fx/_lazy_graph_module.py class <lambda>(torch.nn.Module): def forward(self): # File: check_arange_decomp.py:15 in func, code: a = torch.arange(s, dtype=torch.float32) iota: "i64[10]" = torch.ops.prims.iota.default(10, start = 0, step = 1, dtype = torch.int64, device = device(type='cpu'), requires_grad = False) convert_element_type: "f64[10]" = torch.ops.prims.convert_element_type.default(iota, torch.float64); iota = None mul: "f64[10]" = torch.ops.aten.mul.Tensor(convert_element_type, 1); convert_element_type = None add: "f64[10]" = torch.ops.aten.add.Tensor(mul, 0); mul = None convert_element_type_1: "f32[10]" = torch.ops.prims.convert_element_type.default(add, torch.float32); add = None # File: check_arange_decomp.py:16 in func, code: return s + a add_1: "f32[10]" = torch.ops.aten.add.Tensor(convert_element_type_1, 10); convert_element_type_1 = None return (add_1,) ``` and C++ ```c++ extern "C" void kernel(float* out_ptr0) { { #pragma GCC ivdep for(long x0=static_cast<long>(0L); x0<static_cast<long>(10L); x0+=static_cast<long>(1L)) { auto tmp0 = c10::convert<long>(x0); auto tmp1 = c10::convert<double>(tmp0); // <---- useless ops auto tmp2 = static_cast<double>(1.0); // <---- auto tmp3 = decltype(tmp1)(tmp1 * tmp2); // <---- auto tmp4 = static_cast<double>(0.0); // <---- auto tmp5 = decltype(tmp3)(tmp3 + tmp4); // <---- auto tmp6 = c10::convert<float>(tmp5); auto tmp7 = static_cast<float>(10.0); auto tmp8 = decltype(tmp6)(tmp6 + tmp7); out_ptr0[static_cast<long>(x0)] = tmp8; } } } ``` However, if we manually create arange on i64 and then put to float32, generated graph and C++ code are more natural and benefit of a speed-up. ```python import torch def func(x): s = x.shape[-1] a = torch.arange(s).to(dtype=torch.float32) return s + a c_func = torch.compile(func) out = c_func(torch.rand(10)) ``` Graph on `main`: ``` ===== Forward graph 0 ===== /pytorch/torch/fx/_lazy_graph_module.py class <lambda>(torch.nn.Module): def forward(self): # File: check_arange_decomp.py:14 in func, code: a = torch.arange(s).to(dtype=torch.float32) iota: "i64[10]" = torch.ops.prims.iota.default(10, start = 0, step = 1, dtype = torch.int64, device = device(type='cpu'), requires_grad = False) convert_element_type: "f32[10]" = torch.ops.prims.convert_element_type.default(iota, torch.float32); iota = None # File: check_arange_decomp.py:15 in func, code: return s + a add: "f32[10]" = torch.ops.aten.add.Tensor(convert_element_type, 10); convert_element_type = None return (add,) ===== AFTER POST GRAD ===== /pytorch/torch/fx/_lazy_graph_module.py class <lambda>(torch.nn.Module): def forward(self): # File: check_arange_decomp.py:21 in func, code: a = torch.arange(s).to(dtype=torch.float32) iota: "i64[10]" = torch.ops.prims.iota.default(10, start = 0, step = 1, dtype = torch.int64, device = device(type='cpu'), requires_grad = False) convert_element_type: "f32[10]" = torch.ops.prims.convert_element_type.default(iota, torch.float32); iota = None # File: check_arange_decomp.py:22 in func, code: return s + a add: "f32[10]" = torch.ops.aten.add.Tensor(convert_element_type, 10); convert_element_type = None return (add,) ``` C++ on `main` ```c++ extern "C" void kernel(float* out_ptr0) { { #pragma GCC ivdep for(long x0=static_cast<long>(0L); x0<static_cast<long>(10L); x0+=static_cast<long>(1L)) { auto tmp0 = c10::convert<long>(x0); auto tmp1 = c10::convert<float>(tmp0); auto tmp2 = static_cast<float>(10.0); auto tmp3 = decltype(tmp1)(tmp1 + tmp2); out_ptr0[static_cast<long>(x0)] = tmp3; } } } ``` For example, the speed-up seen on upsample_nearest2d on cpu: ``` [----------------------------------------------------------------------------------------------------------------------------------------------- Interpolate, cpu ----------------------------------------------------------------------------------------------------------------------------------------------] \| Eager (2.3.0a0+gitb4324ed) PR \| Compiled (2.3.0a0+gitb4324ed) PR \| Compiled (2.3.0a0+git0d1e705) Nightly \| speed-up PR vs Nightly \| Eager (2.3.0a0+git0d1e705) Nightly 1 threads: ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ Input (1, 3, 500, 400), torch.uint8, torch.contiguous_format \| mode: nearest, align_corners: None, osize: (256, 256) \| 287.988 (+-10.399) \| 200.034 (+-8.630) \| 285.143 (+-8.412) \| 1.425 (+-0.000) \| 287.991 (+-11.302) Input (1, 3, 500, 400), torch.uint8, torch.channels_last \| mode: nearest, align_corners: None, osize: (256, 256) \| 697.206 (+-27.033) \| 171.650 (+-7.381) \| 193.280 (+-5.840) \| 1.126 (+-0.000) \| 701.642 (+-26.461) Input (1, 3, 500, 400), torch.float32, torch.contiguous_format \| mode: nearest, align_corners: None, osize: (256, 256) \| 149.149 (+-6.045) \| 222.780 (+-6.852) \| 299.968 (+-12.354) \| 1.346 (+-0.000) \| 145.055 (+-7.232) Input (1, 3, 500, 400), torch.float32, torch.channels_last \| mode: nearest, align_corners: None, osize: (256, 256) \| 596.741 (+-27.970) \| 205.923 (+-8.648) \| 233.912 (+-7.742) \| 1.136 (+-0.000) \| 598.000 (+-25.630) Input (4, 3, 500, 400), torch.uint8, torch.contiguous_format \| mode: nearest, align_corners: None, osize: (256, 256) \| 1095.734 (+-51.658) \| 700.850 (+-24.852) \| 1044.255 (+-38.216) \| 1.490 (+-0.000) \| 1097.977 (+-35.521) Input (4, 3, 500, 400), torch.uint8, torch.channels_last \| mode: nearest, align_corners: None, osize: (256, 256) \| 2741.813 (+-122.917) \| 583.073 (+-16.998) \| 665.029 (+-36.331) \| 1.141 (+-0.000) \| 2722.388 (+-116.263) Input (4, 3, 500, 400), torch.float32, torch.contiguous_format \| mode: nearest, align_corners: None, osize: (256, 256) \| 578.183 (+-37.266) \| 833.295 (+-42.264) \| 1131.341 (+-54.710) \| 1.358 (+-0.000) \| 584.953 (+-45.549) Input (4, 3, 500, 400), torch.float32, torch.channels_last \| mode: nearest, align_corners: None, osize: (256, 256) \| 2332.508 (+-103.556) \| 840.194 (+-47.664) \| 935.625 (+-47.467) \| 1.114 (+-0.000) \| 2334.314 (+-91.644) Input (1, 3, 1200, 1300), torch.uint8, torch.contiguous_format \| mode: nearest, align_corners: None, osize: (200, 300) \| 272.631 (+-11.348) \| 195.988 (+-5.748) \| 274.021 (+-9.475) \| 1.398 (+-0.000) \| 272.752 (+-12.716) Input (1, 3, 1200, 1300), torch.uint8, torch.channels_last \| mode: nearest, align_corners: None, osize: (200, 300) \| 640.409 (+-25.465) \| 164.773 (+-7.372) \| 185.018 (+-8.349) \| 1.123 (+-0.000) \| 639.390 (+-30.761) Input (1, 3, 1200, 1300), torch.float32, torch.contiguous_format \| mode: nearest, align_corners: None, osize: (200, 300) \| 158.602 (+-6.593) \| 220.478 (+-6.809) \| 286.376 (+-8.981) \| 1.299 (+-0.000) \| 158.557 (+-6.143) Input (1, 3, 1200, 1300), torch.float32, torch.channels_last \| mode: nearest, align_corners: None, osize: (200, 300) \| 548.903 (+-22.889) \| 202.788 (+-9.158) \| 227.404 (+-8.995) \| 1.121 (+-0.000) \| 554.096 (+-21.330) Input (4, 3, 1200, 1300), torch.uint8, torch.contiguous_format \| mode: nearest, align_corners: None, osize: (200, 300) \| 1036.061 (+-35.285) \| 680.728 (+-30.925) \| 986.254 (+-42.732) \| 1.449 (+-0.000) \| 1038.718 (+-43.070) Input (4, 3, 1200, 1300), torch.uint8, torch.channels_last \| mode: nearest, align_corners: None, osize: (200, 300) \| 2504.520 (+-125.805) \| 550.067 (+-21.383) \| 628.000 (+-27.589) \| 1.142 (+-0.000) \| 2523.134 (+-113.336) Input (4, 3, 1200, 1300), torch.float32, torch.contiguous_format \| mode: nearest, align_corners: None, osize: (200, 300) \| 1058.188 (+-57.853) \| 1216.427 (+-76.160) \| 1380.231 (+-98.939) \| 1.135 (+-0.000) \| 1057.031 (+-66.075) Input (4, 3, 1200, 1300), torch.float32, torch.channels_last \| mode: nearest, align_corners: None, osize: (200, 300) \| 2305.911 (+-116.864) \| 1080.189 (+-79.934) \| 1141.561 (+-67.959) \| 1.057 (+-0.000) \| 2306.606 (+-121.544) Input (1, 3, 300, 400), torch.uint8, torch.contiguous_format \| mode: nearest, align_corners: None, osize: (600, 700) \| 1689.489 (+-60.579) \| 1077.401 (+-44.948) \| 1634.264 (+-64.340) \| 1.517 (+-0.000) \| 1693.945 (+-67.998) Input (1, 3, 300, 400), torch.uint8, torch.channels_last \| mode: nearest, align_corners: None, osize: (600, 700) \| 4198.368 (+-179.096) \| 886.656 (+-30.355) \| 1028.568 (+-46.310) \| 1.160 (+-0.000) \| 4174.351 (+-141.020) Input (1, 3, 300, 400), torch.float32, torch.contiguous_format \| mode: nearest, align_corners: None, osize: (600, 700) \| 716.572 (+-51.954) \| 1175.864 (+-52.191) \| 1674.373 (+-51.815) \| 1.424 (+-0.000) \| 715.724 (+-41.104) Input (1, 3, 300, 400), torch.float32, torch.channels_last \| mode: nearest, align_corners: None, osize: (600, 700) \| 3604.989 (+-132.489) \| 1096.933 (+-54.290) \| 1270.347 (+-60.932) \| 1.158 (+-0.000) \| 3601.864 (+-140.218) Input (4, 3, 300, 400), torch.uint8, torch.contiguous_format \| mode: nearest, align_corners: None, osize: (600, 700) \| 6721.610 (+-355.997) \| 4203.213 (+-134.362) \| 6423.763 (+-225.311) \| 1.528 (+-0.000) \| 6715.626 (+-288.233) Input (4, 3, 300, 400), torch.uint8, torch.channels_last \| mode: nearest, align_corners: None, osize: (600, 700) \| 16695.467 (+-709.620) \| 3460.013 (+-149.456) \| 4001.810 (+-218.093) \| 1.157 (+-0.000) \| 16621.138 (+-713.320) Input (4, 3, 300, 400), torch.float32, torch.contiguous_format \| mode: nearest, align_corners: None, osize: (600, 700) \| 3020.017 (+-147.314) \| 4743.164 (+-135.850) \| 6709.494 (+-281.025) \| 1.415 (+-0.000) \| 3015.602 (+-105.852) Input (4, 3, 300, 400), torch.float32, torch.channels_last \| mode: nearest, align_corners: None, osize: (600, 700) \| 14456.688 (+-752.839) \| 5150.893 (+-201.571) \| 5737.315 (+-138.011) \| 1.114 (+-0.000) \| 14464.472 (+-720.027) Times are in microseconds (us). ``` ## PR This PR improves arange decomp such that `arange(s, dtype=torch.float32)` removing extra dtype conversion to double: Code: ```python import torch def func(x): s = x.shape[-1] a = torch.arange(s, dtype=torch.float32) return s + a c_func = torch.compile(func) out = c_func(torch.rand(10)) ``` Graph on this PR: ``` ===== Forward graph 0 ===== /pytorch/torch/fx/_lazy_graph_module.py class <lambda>(torch.nn.Module): def forward(self): # File: check_arange_decomp.py:15 in func, code: a = torch.arange(s, dtype=torch.float32) iota: "i64[10]" = torch.ops.prims.iota.default(10, start = 0, step = 1, dtype = torch.int64, device = device(type='cpu'), requires_grad = False) mul: "i64[10]" = torch.ops.aten.mul.Tensor(iota, 1); iota = None add: "i64[10]" = torch.ops.aten.add.Tensor(mul, 0); mul = None convert_element_type: "f32[10]" = torch.ops.prims.convert_element_type.default(add, torch.float32); add = None # File: check_arange_decomp.py:16 in func, code: return s + a add_1: "f32[10]" = torch.ops.aten.add.Tensor(convert_element_type, 10); convert_element_type = None return (add_1,) ===== AFTER POST GRAD ===== /pytorch/torch/fx/_lazy_graph_module.py class <lambda>(torch.nn.Module): def forward(self): # File: check_arange_decomp.py:16 in func, code: a = torch.arange(s, dtype=torch.float32) iota: "i64[10]" = torch.ops.prims.iota.default(10, start = 0, step = 1, dtype = torch.int64, device = device(type='cpu'), requires_grad = False) mul: "i64[10]" = torch.ops.aten.mul.Tensor(iota, 1); iota = None add: "i64[10]" = torch.ops.aten.add.Tensor(mul, 0); mul = None convert_element_type: "f32[10]" = torch.ops.prims.convert_element_type.default(add, torch.float32); add = None # File: check_arange_decomp.py:17 in func, code: return s + a add_1: "f32[10]" = torch.ops.aten.add.Tensor(convert_element_type, 10); convert_element_type = None return (add_1,) ``` and C++ on this PR: ```c++ extern "C" void kernel(float* out_ptr0) { { #pragma GCC ivdep for(long x0=static_cast<long>(0L); x0<static_cast<long>(10L); x0+=static_cast<long>(1L)) { auto tmp0 = c10::convert<long>(x0); auto tmp1 = c10::convert<float>(tmp0); auto tmp2 = static_cast<float>(10.0); auto tmp3 = decltype(tmp1)(tmp1 + tmp2); out_ptr0[static_cast<long>(x0)] = tmp3; } } } ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121013 Approved by: https://github.com/peterbell10	2024-04-11 09:02:31 +00:00
Jiong Gong	cacc8e27a5	[inductor][cpp] refactor code to use define_kernel and call_kernel similar to CUDA (#123704 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123704 Approved by: https://github.com/jansel, https://github.com/desertfire	2024-04-11 06:34:44 +00:00
leslie-fang-intel	9078191666	[Inductor] Add the possible fusions group by priority (#123067 ) Summary Refactor the `Scheduler.fuse_nodes` changes in https://github.com/pytorch/pytorch/pull/121625. In the previous implementation of `Scheduler.fuse_nodes` in https://github.com/pytorch/pytorch/pull/121625, we use the `enable_outer_loop_fusion` context to ensure `OuterLoopFusion` happens after all the norm fusions. And there is a discussion in https://github.com/pytorch/pytorch/pull/121625/files#r1527177141 to reuse current `score_fusion` mechanism. However, given that [fuse_nodes](`f4ff063c33/torch/_inductor/scheduler.py (L1679-L1698)`) will invoke `fuse_nodes_once` 10 times. We are concerned that the score approach may potentially disrupt pairs of regular fusion nodes in the 2rd invocation of `fuse_nodes_once` if they have been pick up by the outer loop fusion in the 1st invocation of `fuse_nodes_once`. In this PR, we propose adding an abstract of `filter_possible_fusions_by_priority`. In each invoking of `fuse_nodes_once`, the possible fusions will be grouped by their priority from the backend. And only the group of possible fusions with highest priority will be fused in this invocation. In this way, we can ensure `OuterLoopFusion` happens after all the norm fusions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123067 Approved by: https://github.com/lezcano, https://github.com/jgong5 ghstack dependencies: #121625	2024-04-05 06:30:41 +00:00
leslie-fang-intel	bac2a39aee	[Inductor] [ReImplement] Outer Loop Fusion for CPP Backend (#121625 ) Summary Re-implement of https://github.com/pytorch/pytorch/pull/121064 Test Plan ``` python -u -m pytest -s -v test_cpu_repro.py -k test_outer_loop_fusion ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121625 Approved by: https://github.com/lezcano, https://github.com/jgong5	2024-04-05 06:24:57 +00:00
Gao Tianlin	aaef246c74	remove log2 decomposition; add log2 lowering (#123112 ) Same reason as `log10`. `log2` is a core aten op, we should not decompose it. As https://github.com/pytorch/pytorch/pull/110882 suggested, it often maps to a hardware intrinsic; Furthermore, decomposing it will negatively impact the numerical precision of the output. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123112 Approved by: https://github.com/peterbell10	2024-04-02 16:16:26 +00:00
Jiong Gong	6f4ed57b8a	[inductor][cpp] unified the vectorized conversion with `at::vec::convert` for all data types (#119979 ) This PR unified the vectorized conversion with `at::vec::convert` for all vectorized data types. The intrinsics implementations are implemented as a specialization and moved to their own arch-specific files. The vectorized conversion logic in cpp Inductor is simplified. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119979 Approved by: https://github.com/jansel, https://github.com/malfet	2024-03-29 21:48:29 +00:00
Edward Z. Yang	3178ba0dc9	Don't use sympy Float functions, use an opaque one with no reasoning (#122823 ) Sympy simplifications don't obey floating point semantics, so don't use Sympy for this. Keep them as is, only evaluate with the reference implementations when all arguments are known. This may end up getting subsumed by some other changes later, but I wanted to understand if this was easy and it seems to be easy. This doesn't actually depend on the earlier diffs on the stack and I can detach it. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/122823 Approved by: https://github.com/lezcano	2024-03-29 19:13:55 +00:00
Jiong Gong	105381ea11	[inductor][cpp] simplify CppVecKernelChecker (remove bool/int8 load as mask and load as float flags) (#119734 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119734 Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel ghstack dependencies: #119654, #119655	2024-03-27 11:20:35 +00:00
Jiong Gong	49121603ab	[inductor][cpp] support vectorized indirect indexing (#119655 ) This PR adds the vectorized indirect indexing so that we can further simplify the `CppVecKernelChecker` (done in the later PR #119734) and remove the check that throws `CppVecUnsupportedError`. A boundary assertion check is added on vectorized indices and via the new `indirect_assert` method on `Kernel` - the base implementation is for scalar indices, overridden in `CppVecKernel` for vectorized indices. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119655 Approved by: https://github.com/jansel ghstack dependencies: #119654	2024-03-27 10:25:45 +00:00
Jiong Gong	367ec62ae3	[inductor][cpp] generalize vector mask for dtypes (#119654 ) Vectorized boolean values in CPU Inductor were modeled with `Vectorized<float>` which cannot work for operations with other data types. This PR generalizes it with the new `VecMask` template class that can work for masks on any vectorized data types. The intrinsics implementation in `cpp_prefix.h` for mask conversion, cast and masked load are now implemented as the specialization for `VecMask` and moved to corresponding header files. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119654 Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel	2024-03-27 05:33:53 +00:00
Nikita Shulga	dd3f2cb53a	[Inductor] Add NEON ISA support on arm64 Macs (#122217 ) This started as a re-land of https://github.com/pytorch/pytorch/pull/105590 but focusing on enabling it on MacOS, but quickly turned into landing very limited platform-specific acceleration at this time (I.e. this PR does not add any NEON accelerated code at all, just enables vectorized compilation for the existing abstractions) Enabling the test harness, uncovered number of latent issues in CPU inductor that were fixed in the following PRS: - https://github.com/pytorch/pytorch/pull/122511 - https://github.com/pytorch/pytorch/pull/122513 - https://github.com/pytorch/pytorch/pull/122580 - https://github.com/pytorch/pytorch/pull/122608 Following was added/changed to enable vectorization code to work on MacOS - Added VecNEON class to `_inductor/codecache.py` that is supported on all AppleSilicon Macs - Added `Vectorized::loadu_one_fourth` to `vec_base.h`, and limit it to 8-bit types - Change 64-bit integral types mapping to `int64_t`/`uint64_t` to align with the rest of the code, as on MacOS, `int64_t` is a `long long` rather than `long` (see https://github.com/pytorch/pytorch/pull/118149 for more details) See table below for perf changes with and without torch.compile using [gpt-fast](https://github.com/pytorch-labs/gpt-fast) running `stories15M` on M2 Pro: \| dtype \| Eager \| Compile (before) \| Compile (after) \| \| ------ \| ------ \| --------- \| --------- \| \| bfloat16 \| 120 tokens/sec \| 130 tokens/sec \| 156 tokens/sec \| \| float32 \| 158 tokens/sec \| 140 tokens/sec \| 236 tokens/sec \| \| float16 \| 235 tokens/sec \| 81 tokens/sec \| 58 tokens/sec \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/122217 Approved by: https://github.com/jansel	2024-03-26 05:07:30 +00:00
Nikita Shulga	cf06189a2d	[CPPInductor] Fix another out-of-bounds access (#122580 ) Not sure what was the idea behind `{self.tiling_factor}*sizeof(float)/sizeof({DTYPE_TO_CPP[dtype]})` size calculation (perhaps copy-n-paste error during the refactor made by https://github.com/pytorch/pytorch/pull/97626 ) , but `Vectorized::store(ptr, tiling_factor)` needs at least `tiling_factor` elements, not `tiling_factor/2` (which would be the case with the original calculation if data type is 64-bit value such as int64) Discovered while trying to enable arch64 vectorized inductor. Minimal reproducer (reproducible on ARMv8 or any x86_64 machine that does not support AVX512): ```python import torch def do_ds(x, y): return torch.diagonal_scatter(x, y) x=torch.ones(10, 10, dtype=torch.int64) y=torch.tensor([ 1, 2, -8, 8, 5, 5, -7, -8, 7, 0]) dsc = torch.compile(do_ds) assert torch.allclose(torch.diagonal_scatter(x, y), dsc(x, y)) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/122580 Approved by: https://github.com/Skylion007, https://github.com/jansel	2024-03-25 04:49:20 +00:00
vfdev-5	cdc7f0fd3b	Fixed failing pyhpc_equation_of_state due to cpp nodes fusion with compatible ranges (#122420 ) Fixes #122283 Description: PR https://github.com/pytorch/pytorch/pull/120077 introduced cpp nodes fusion with compatible ranges with an assumption that all scheduler nodes inside the fused nodes are the same, however, it appeared that snodes can have different indexing expressions. This PR fixes the incorrect assumption. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122420 Approved by: https://github.com/lezcano	2024-03-24 00:40:31 +00:00
Adnan Akhundov	456b112dca	[inductor] Support non-Tensor predicate in torch.cond (#122378 ) Summary: Previously, we only supported torch.Tensor boolean scalar predicate in `torch.cond` in Inductor. This PR adds support for SymBool and Python bool predicate, to match the `torch.cond` [sematics](https://pytorch.org/docs/stable/generated/torch.cond.html) in Dynamo / Export. Test Plan: ``` $ python test/inductor/test_control_flow.py ... ---------------------------------------------------------------------- Ran 34 tests in 56.980s OK $ python test/inductor/test_aot_inductor.py -k test_cond ... ---------------------------------------------------------------------- Ran 54 tests in 460.093s OK (skipped=4) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/122378 Approved by: https://github.com/jansel, https://github.com/chenyang78	2024-03-21 14:35:01 +00:00
haozhe.zhu	3bc2bb6781	use two pass reduction for deterministic reduction order (#115620 ) ## Motivation Address the [non-deterministic reduction order](https://github.com/pytorch/pytorch/issues/93542#issuecomment-1411294181) issue for `omp parallel reduction`. ## Latest update on 1.15: `55d81901bc`. Do not reduce to arr in loops. Instead, reduce to a local scaler and write it to arr after local reduction is done. This will allow the compiler to optimize the reduction variable in register instead read/write from memory. If the `working set` of `loop body` is quite large, `read/write from register/memory` will have a large gap. ``` vaddss (%xmm0, %xmm11, %xmm11) -> accumulate in register %xmm0 vaddssl ((%rdx, %rdi, 4), %xmm0, %xmm0) -> accumulate in memory address (%rdx, %rdi, 4) ``` Examples code: ``` tmp0_acc_arr[64]; #pragma omp parallel num_threads(64) { auto tid = omp_get_thread_num(); #pragma omp for for(...){ .... tmp0_acc_arr[tid] = tmp0_acc_arr[tid] + tmp_x; // access array will always from memory } } ``` will be changed to ``` tmp0_acc_arr[64]; #pragma omp parallel num_threads(64) { auto tid = omp_get_thread_num(); auto tmp0_acc_local = 0; #pragma omp for for(...){ .... tmp0_acc_local = tmp0_acc_local + tmp_x; } tmp0_acc_arr[tid] = tmp0_acc_local; } ``` ## Descriptions Following aten to use `two pass reduction` with `omp parallel` for deterministic reduction order. `9c3ae37fc4/aten/src/ATen/Parallel-inl.h (L39)` `9c3ae37fc4/aten/src/ATen/native/TensorIteratorReduce.cpp (L24)` ``` float tmp_acc0 = 0; at::vec::Vectorized<float> tmp_acc0_vec = at::vec::Vectorized<float>(0); // init reduction buffer per thread float tmp_acc0_arr[64]; at::vec::Vectorized<float> tmp_acc0_vec_arr[64]; for (int tid = 0; tid < 64; tid++) { tmp_acc0_arr[tid] = 0; tmp_acc0_vec_arr[tid] = at::vec::Vectorized<float>(0); } #pragma omp parallel num_threads(64) { int tid = omp_get_thread_num(); #pragma omp for for(long x0=static_cast<long>(0L); x0<static_cast<long>(3964928L); x0+=static_cast<long>(16L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x0)); auto tmp1 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<long>(x0)); auto tmp2 = tmp0 - tmp1; auto tmp3 = tmp2 * tmp2; // reduce to per thread buffers tmp_acc0_vec_arr[tid] = tmp_acc0_vec_arr[tid] + tmp3; } } // second pass reduce for (int tid = 0; tid < 64; tid++) { tmp_acc0 = tmp_acc0 + tmp_acc0_arr[tid]; tmp_acc0_vec = tmp_acc0_vec + tmp_acc0_vec_arr[tid]; } tmp_acc0 = tmp_acc0 + at::vec::vec_reduce_all<float>([](at::vec::Vectorized<float>& x, at::vec::Vectorized<float>& y) { return x + y; }, tmp_acc0_vec); out_ptr0[static_cast<long>(0L)] = static_cast<float>(tmp_acc0); ``` ## Test results I test this PR with dynamo benchmark on 32-core ICX system, Result (avg speed up): \| \| before this PR \| after this PR \| \| ---- \| ---- \| ---- \| \| torchbench \| 1.303 \| 1.301 \| \| hugginface \| 1.346 \| 1.343 \| \| timms \| 1.971 \| 1.970 \| ``` export LD_PRELOAD=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libiomp5.so:${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libjemalloc.so export MALLOC_CONF="oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms:-1,muzzy_decay_ms:-1" export KMP_AFFINITY=granularity=fine,compact,1,0 export KMP_BLOCKTIME=1 multi_threads_test() { CORES=$(lscpu \| grep Core \| awk '{print $4}') export OMP_NUM_THREADS=$CORES end_core=$(expr $CORES - 1) numactl -C 0-${end_core} --membind=0 python benchmarks/dynamo/${SUITE}.py --${SCENARIO} --${DT} -dcpu -n50 --no-skip --dashboard --only "${MODEL}" ${Channels_extra} ${BS_extra} ${Shape_extra} ${Mode_extra} ${Wrapper_extra} ${Flag_extra} --timeout 9000 --backend=inductor --output=${LOG_BASE}/${SUITE}.csv } SCENARIO=performance DT=float32 export TORCHINDUCTOR_FREEZING=1 Flag_extra="--freezing" Mode_extra="--inference" for suite in timm_models huggingface torchbench do export SUITE=$suite echo $SUITE export LOG_BASE=`date +%m%d%H%M%S` mkdir $LOG_BASE multi_threads_test done ``` System info ``` ubuntu@ip-172-31-18-205:~/hz/pytorch$ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 46 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 64 On-line CPU(s) list: 0-63 Vendor ID: GenuineIntel Model name: Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz CPU family: 6 Model: 106 Thread(s) per core: 2 Core(s) per socket: 32 Socket(s): 1 Stepping: 6 BogoMIPS: 5800.00 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic mo vbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xs aveopt xsavec xgetbv1 xsaves wbnoinvd ida arat avx512vbmi pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq rdpid md_clear flush_l1d arch_capabilities Virtualization features: Hypervisor vendor: KVM Virtualization type: full Caches (sum of all): L1d: 1.5 MiB (32 instances) L1i: 1 MiB (32 instances) L2: 40 MiB (32 instances) L3: 54 MiB (1 instance) NUMA: NUMA node(s): 1 NUMA node0 CPU(s): 0-63 Vulnerabilities: Gather data sampling: Unknown: Dependent on hypervisor status Itlb multihit: Not affected L1tf: Not affected Mds: Not affected Meltdown: Not affected Mmio stale data: Mitigation; Clear CPU buffers; SMT Host state unknown Retbleed: Not affected Spec rstack overflow: Not affected Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Spectre v2: Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence Srbds: Not affected Tsx async abort: Not affected ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/115620 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-03-15 02:03:10 +00:00

1 2 3 4 5 ...

320 Commits