pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 00:21:07 +01:00

Author	SHA1	Message	Date
leslie-fang-intel	d83ab88f81	[Inductor] [Quant] Enable lowering of quant per tensor and refactor quant pattern (#124041 ) Summary Per the discussion in https://github.com/pytorch/pytorch/pull/123444, the `decomposed quant/dequant` patterns changed after https://github.com/pytorch/pytorch/pull/123445, we can move the optimization of `decomposed quant/dequant` from inductor decomposition into lowering phase to avoid the changes. In this way, we can: - Avoid the pattern matcher failure introduced in https://github.com/pytorch/pytorch/pull/123445 - Make the quantization pattern clearer in the pattern matcher phase, since the `quant/dequant` nodes have not been decomposed. Changes in this PR - Move optimization of `decomposed quant/dequant` from inductor decomposition into lowering phase. - Corresponding changes in the quantization pattern matcher to ensure no bc-breaking. TestPlan ``` python -u -m pytest -s -v test/inductor/test_mkldnn_pattern_matcher.py -k test_q ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/124041 Approved by: https://github.com/peterbell10, https://github.com/jgong5	2024-05-09 08:40:44 +00:00
PyTorch MergeBot	ea3f625e32	Revert "[Inductor] [Quant] Enable lowering of quant per tensor and refactor quant pattern (#124041 )" This reverts commit `33e6791645`. Reverted https://github.com/pytorch/pytorch/pull/124041 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think there is a land race with the change `33e6791645` ([comment](https://github.com/pytorch/pytorch/pull/124041#issuecomment-2101766558))	2024-05-09 01:34:19 +00:00
leslie-fang-intel	33e6791645	[Inductor] [Quant] Enable lowering of quant per tensor and refactor quant pattern (#124041 ) Summary Per the discussion in https://github.com/pytorch/pytorch/pull/123444, the `decomposed quant/dequant` patterns changed after https://github.com/pytorch/pytorch/pull/123445, we can move the optimization of `decomposed quant/dequant` from inductor decomposition into lowering phase to avoid the changes. In this way, we can: - Avoid the pattern matcher failure introduced in https://github.com/pytorch/pytorch/pull/123445 - Make the quantization pattern clearer in the pattern matcher phase, since the `quant/dequant` nodes have not been decomposed. Changes in this PR - Move optimization of `decomposed quant/dequant` from inductor decomposition into lowering phase. - Corresponding changes in the quantization pattern matcher to ensure no bc-breaking. TestPlan ``` python -u -m pytest -s -v test/inductor/test_mkldnn_pattern_matcher.py -k test_q ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/124041 Approved by: https://github.com/peterbell10, https://github.com/jgong5	2024-05-09 00:54:22 +00:00
Andrew M. James	445a0c01da	Retry: Low mem max_pool2d_with_indices (#122832 ) Based on #105687 The low memory path does not need to strictly return the int8 offsets instead the offset to index computation can be separated from the inner function of the max pool lowering. The partitioner can then choose to move the offset to index computation into the backward pass. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122832 Approved by: https://github.com/peterbell10, https://github.com/eellison	2024-05-08 19:37:08 +00:00
Isuru Fernando	4d5f8070c4	add a decomposition for select_scatter (#124426 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124426 Approved by: https://github.com/peterbell10	2024-05-01 03:23:18 +00:00
Kazuaki Ishizaki	9fec26e231	Fix typo under torch/_inductor directory (#119658 ) This PR fixes typo in comments and msgs under `torch/_inductor` directory, and also changes the corresponding test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119658 Approved by: https://github.com/colesbury	2024-04-30 22:28:56 +00:00
Bin Bao	bb37910e30	[AOTI] Fixes ScatterFallback codegen (#124580 ) Summary: For https://github.com/pytorch/pytorch/issues/123184. ScatterFallback currently relies on op name matching for codegen, which makes its cpp codegen fragile. Refactor to use op_overload and fix the relevant unit test failures. Differential Revision: [D56417815](https://our.internmc.facebook.com/intern/diff/D56417815) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124580 Approved by: https://github.com/chenyang78	2024-04-22 20:47:26 +00:00
Pearu Peterson	43b4ac956e	Add index_reduce decomposition (#122579 ) As in the title. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122579 Approved by: https://github.com/peterbell10 ghstack dependencies: #123375	2024-04-18 01:30:47 +00:00
Honglin Zhu	78824fd212	[inductor] Fix recompiles bug for torch.full (#123811 ) Fixes #123810 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123811 Approved by: https://github.com/peterbell10	2024-04-12 00:07:47 +00:00
Peter Bell	9189d04cb1	[inductor] Add explicit ops.fma and use it in softmax_backward (#122518 ) This allows us to generate an fma even when fp-fusion is disabled in the compiler. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122518 Approved by: https://github.com/lezcano, https://github.com/Chillee	2024-04-06 02:15:16 +00:00
PyTorch MergeBot	16cb5d48dd	Revert "[inductor] Add explicit ops.fma and use it in softmax_backward (#122518 )" This reverts commit `05984e642b`. Reverted https://github.com/pytorch/pytorch/pull/122518 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it starts failing in trunk `05984e642b` ([comment](https://github.com/pytorch/pytorch/pull/122518#issuecomment-2038631010))	2024-04-05 02:09:32 +00:00
Peter Bell	05984e642b	[inductor] Add explicit ops.fma and use it in softmax_backward (#122518 ) This allows us to generate an fma even when fp-fusion is disabled in the compiler. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122518 Approved by: https://github.com/lezcano, https://github.com/Chillee ghstack dependencies: #121924	2024-04-04 20:53:14 +00:00
Gao Tianlin	aaef246c74	remove log2 decomposition; add log2 lowering (#123112 ) Same reason as `log10`. `log2` is a core aten op, we should not decompose it. As https://github.com/pytorch/pytorch/pull/110882 suggested, it often maps to a hardware intrinsic; Furthermore, decomposing it will negatively impact the numerical precision of the output. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123112 Approved by: https://github.com/peterbell10	2024-04-02 16:16:26 +00:00
Mu-Chu Lee	4b725e1619	[AOTInductor] Support quantized linear on CPU with fbgemm (#123069 ) Summary: Added support for quantized linear on CPU with fbgemm. Specifically, for torch.ops.quantized.linear_unpacked_dynamic_fp16, we decompose it into two steps, pack weight, and fbgemm's qlinear with packed weight. Test Plan: Included in commit. test_aot_inductor::test_quantized_linear Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D55577959](https://our.internmc.facebook.com/intern/diff/D55577959) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123069 Approved by: https://github.com/hl475	2024-04-01 09:15:05 +00:00
andrewor14	773ae817f7	Batch Norm Consolidation (#116092 ) Summary: This commit simplifies the existing decomposition hierarchy of batch norm ops by adding a single, backend agnostic op: `batch_norm_with_update`. The existing hierarchy looks like: ``` aten.batch_norm -> aten._batch_norm_impl_index -> [ aten.native_batch_norm -> aten._native_batch_norm_legit (export only) -> _batch_norm_legit_cpu/cuda (kernels, export only) -> _batch_norm_cpu/cuda (kernels) ] OR [ aten.cudnn_batch_norm ] OR [ aten.miopen_batch_norm ] ``` Aside from complexity, an important problem with the above decomposition hierarchy is cuda numerics in export flows. We observed significantly worse convergence when training a mobilenetv2-like model when using the `_batch_norm_cuda` kernel instead of the `cudnn_batch_norm` kernel. This means users who export their models on CPU first then move the models to cuda later may silently see worse accuracies even when cudnn is installed, because they are using the worse kernel. This issue is summarized in https://github.com/pytorch/pytorch/issues/111384. Instead, the new hierarchy proposed by consolidating existing batch norm ops will look like: ``` aten.batch_norm -> aten.batch_norm_with_update -> [ _batch_norm_cpu (kernel) ] OR [ _batch_norm_cuda (kernel) ] OR [ cudnn_batch_norm (kernel) ] OR [ miopen_batch_norm (kernel) ] ``` The new op `batch_norm_with_update` hides backend implementation details and automatically picks the right kernel based on what is installed. This commit also adds the following variants to this op: ``` batch_norm_with_update_functional batch_norm_with_update.out batch_norm_no_update batch_norm_no_update.out batch_norm_backward ``` Note that this commit only adds this op and its variants, but does not actually change the decomps to produce these ops in the graph. This will be done after the 2 week FC window, and the ops used in the old stack is planned to be removed after the 6 month BC window. Test Plan: `OpInfo` tests for `batch_norm_with_update`. Reviewers: albanD, bdhirsh Subscribers: albanD, bdhirsh, supriyar Tasks: https://github.com/pytorch/pytorch/issues/111384 Differential Revision: [D54805279](https://our.internmc.facebook.com/intern/diff/D54805279) Co-authored-by: Tugsbayasgalan Manlaibaatar <tmanlaibaatar@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/116092 Approved by: https://github.com/bdhirsh, https://github.com/albanD	2024-03-18 21:01:30 +00:00
PyTorch MergeBot	fd0dbcd891	Revert "Batch Norm Consolidation (#116092 )" This reverts commit `7b4f70eda5`. Reverted https://github.com/pytorch/pytorch/pull/116092 on behalf of https://github.com/osalpekar due to Causes build failure in //caffe2:aten-hip (AMD build) target. See [D54707318](https://www.internalfb.com/diff/D54707318) for more details, may require internal build system changes to resolve. ([comment](https://github.com/pytorch/pytorch/pull/116092#issuecomment-1989542965))	2024-03-11 22:22:41 +00:00
andrewor14	7b4f70eda5	Batch Norm Consolidation (#116092 ) Summary: This commit simplifies the existing decomposition hierarchy of batch norm ops by adding a single, backend agnostic op: `batch_norm_with_update`. The existing hierarchy looks like: ``` aten.batch_norm -> aten._batch_norm_impl_index -> [ aten.native_batch_norm -> aten._native_batch_norm_legit (export only) -> _batch_norm_legit_cpu/cuda (kernels, export only) -> _batch_norm_cpu/cuda (kernels) ] OR [ aten.cudnn_batch_norm ] OR [ aten.miopen_batch_norm ] ``` Aside from complexity, an important problem with the above decomposition hierarchy is cuda numerics in export flows. We observed significantly worse convergence when training a mobilenetv2-like model when using the `_batch_norm_cuda` kernel instead of the `cudnn_batch_norm` kernel. This means users who export their models on CPU first then move the models to cuda later may silently see worse accuracies even when cudnn is installed, because they are using the worse kernel. This issue is summarized in https://github.com/pytorch/pytorch/issues/111384. Instead, the new hierarchy proposed by consolidating existing batch norm ops will look like: ``` aten.batch_norm -> aten.batch_norm_with_update -> [ _batch_norm_cpu (kernel) ] OR [ _batch_norm_cuda (kernel) ] OR [ cudnn_batch_norm (kernel) ] OR [ miopen_batch_norm (kernel) ] ``` The new op `batch_norm_with_update` hides backend implementation details and automatically picks the right kernel based on what is installed. This commit also adds the following variants to this op: ``` batch_norm_with_update_functional batch_norm_with_update.out batch_norm_no_update batch_norm_no_update.out batch_norm_backward ``` Note that this commit only adds this op and its variants, but does not actually change the decomps to produce these ops in the graph. This will be done after the 2 week FC window, and the ops used in the old stack is planned to be removed after the 6 month BC window. Test Plan: `OpInfo` tests for `batch_norm_with_update`. Reviewers: albanD, bdhirsh Subscribers: albanD, bdhirsh, supriyar Tasks: https://github.com/pytorch/pytorch/issues/111384 Co-authored-by: Tugsbayasgalan Manlaibaatar <tmanlaibaatar@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/116092 Approved by: https://github.com/bdhirsh, https://github.com/albanD	2024-03-08 15:07:15 +00:00
Shunting Zhang	1ce5049692	[inuctor] fix the layout problem for nll_loss2d_backward (#121173 ) Fixes https://github.com/pytorch/pytorch/issues/120759 . The CUDA implementation of nll_loss2d_backward.default requires that the 'self' tensor to be contiguous. These implicit assumption may be broken by layout optimizations. The fix here is to add the constraint when we explicitly defining the fallback for the op. Not sure if we can improve the cuda kernel to release the constraints though. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121173 Approved by: https://github.com/jansel, https://github.com/desertfire	2024-03-07 09:05:07 +00:00
PyTorch MergeBot	b529c19bdf	Revert "Batch Norm Consolidation (#116092 )" This reverts commit `5680f565d5`. Reverted https://github.com/pytorch/pytorch/pull/116092 on behalf of https://github.com/jeffdaily due to broke ROCm, PR signal was clean but trunk was not, the merge should have been blocked but wasn't ([comment](https://github.com/pytorch/pytorch/pull/116092#issuecomment-1981373237))	2024-03-06 17:10:01 +00:00
Tugsbayasgalan Manlaibaatar	5680f565d5	Batch Norm Consolidation (#116092 ) Summary: This commit simplifies the existing decomposition hierarchy of batch norm ops by adding a single, backend agnostic op: `batch_norm_with_update`. The existing hierarchy looks like: ``` aten.batch_norm -> aten._batch_norm_impl_index -> [ aten.native_batch_norm -> aten._native_batch_norm_legit (export only) -> _batch_norm_legit_cpu/cuda (kernels, export only) -> _batch_norm_cpu/cuda (kernels) ] OR [ aten.cudnn_batch_norm ] OR [ aten.miopen_batch_norm ] ``` Aside from complexity, an important problem with the above decomposition hierarchy is cuda numerics in export flows. We observed significantly worse convergence when training a mobilenetv2-like model when using the `_batch_norm_cuda` kernel instead of the `cudnn_batch_norm` kernel. This means users who export their models on CPU first then move the models to cuda later may silently see worse accuracies even when cudnn is installed, because they are using the worse kernel. This issue is summarized in https://github.com/pytorch/pytorch/issues/111384. Instead, the new hierarchy proposed by consolidating existing batch norm ops will look like: ``` aten.batch_norm -> aten.batch_norm_with_update -> [ _batch_norm_cpu (kernel) ] OR [ _batch_norm_cuda (kernel) ] OR [ cudnn_batch_norm (kernel) ] OR [ miopen_batch_norm (kernel) ] ``` The new op `batch_norm_with_update` hides backend implementation details and automatically picks the right kernel based on what is installed. This commit also adds the following variants to this op: ``` batch_norm_with_update_functional batch_norm_with_update.out batch_norm_no_update batch_norm_no_update.out batch_norm_backward ``` Note that this commit only adds this op and its variants, but does not actually change the decomps to produce these ops in the graph. This will be done after the 2 week FC window, and the ops used in the old stack is planned to be removed after the 6 month BC window. Test Plan: `OpInfo` tests for `batch_norm_with_update`. Reviewers: albanD, bdhirsh Subscribers: albanD, bdhirsh, supriyar Tasks: https://github.com/pytorch/pytorch/issues/111384 Co-authored-by: Tugsbayasgalan Manlaibaatar <tmanlaibaatar@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/116092 Approved by: https://github.com/bdhirsh, https://github.com/albanD	2024-03-06 04:50:46 +00:00
Pearu Peterson	c06499981d	Add a decomposition for torch.put, 2. (#120179 ) As in the title. It is an updated copy of https://github.com/pytorch/pytorch/pull/115306 . Pull Request resolved: https://github.com/pytorch/pytorch/pull/120179 Approved by: https://github.com/lezcano, https://github.com/peterbell10, https://github.com/jgong5	2024-03-04 14:37:30 +00:00
Xia, Weiwen	83d848e1c7	[Quant][Inductor] Enable lowering of dynamic qlinear for X86Inductor (#120605 ) description Enable lowering of dynamic qlinear for X86Inductor. The pattern is `choose_qparams -> getitem -> q -> dq -> linear`. We only fuse `dq -> linear` and get `choose_qparams -> getitem -> q -> onednn.qlinear_pointwise`. So, we treat it as dynamic quantization of activation + static quantized linear. The previous implementation of `onednn.qlinear_pointwise` is for the case where `x_scale` and `x_zp` are scalars. Since `choose_qparams` returns tensors, we added a variation `onednn.qlinear_pointwise.tensor` to support the case. This feature is targeting PyTorch 2.3 release. Test plan ``` python inductor/test_mkldnn_pattern_matcher.py -k test_dynamic_qlinear_cpu python inductor/test_mkldnn_pattern_matcher.py -k test_dynamic_qlinear_qat_cpu python inductor/test_cpu_cpp_wrapper.py -k test_dynamic_qlinear ``` Performance before and after lowering `choose_qparam` to Inductor Before - latency for shape (32, 32) = 0.151 ms latency for shape (128, 128) = 0.153 ms latency for shape (1024, 1024) = 0.247 ms After - latency for shape (32, 32) = 0.049 ms - latency for shape (128, 128) = 0.052 ms - latency for shape (1024, 1024) = 0.133 ms Test method: A module with a single Linear layer, dynamic-quantize, lower to X86Inductor Test env & config: Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz, single instance, single core, using Intel OpenMP and Tcmalloc Pull Request resolved: https://github.com/pytorch/pytorch/pull/120605 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jerryzh168	2024-03-02 05:11:17 +00:00
Edward Z. Yang	2a08a51738	Add _assert_scalar and teach Inductor to codegen it (#114148 ) Inductor codegen for `_assert_async` is currently disabled because we don't really understand how to codegen `scalar_to_tensor` on a Sympy expression. I initially tried to see if I could get this to work, but I got into some weird problem involving stride sorting, so I decided to fix it properly by not going through a tensor. So we introduce an `_assert_scalar` which takes a scalar as an argument, avoiding needing to turn a SymBool into a tensor before asserting on it. I also add `_functional_assert_scalar` for good luck, although this doesn't do anything right now because https://github.com/pytorch/pytorch/pull/104203 still hasn't been landed. I need to customize the codegen for this operator, so I decide to directly implement it in Inductor, rather than trying to treat it as a generic ExternKernel. This leads to the new AssertScalar IR node. This is written carefully so that it doesn't get DCE'd by Inductor. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/114148 Approved by: https://github.com/jansel ghstack dependencies: #120800	2024-03-01 05:06:36 +00:00
angelayi	759204253f	[export] Change runtime asserts to using assert_scalar (#119608 ) By changing runtime symbolic asserts to using assert_scalar, the asserts can call into `expect_true` and modify the shape env so that we can run through the traced graph module with fake tensors. With assert_async, the asserts only get hit during runtime, but that means if we run the graph module with fake tensors, the asserts will not affect the shape env, so later data dependent calls to the fake tensors may result in GuardOnDataDependentSymNode errors. https://github.com/pytorch/pytorch/issues/119587 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119608 Approved by: https://github.com/ezyang	2024-02-26 17:56:12 +00:00
Angela Yi	6d82a7e9b0	Add pixel_shuffle to core aten decomps (#120092 ) Summary: https://github.com/pytorch/pytorch/pull/118239 added a decomposition for pixel_shuffle, so pixel_shuffle no longer needs to be a Core ATen Op. We have also fixed the internal use case so that it no longer special cases on pixel_shuffle, allowing us to revert the changes in https://github.com/pytorch/pytorch/pull/118921. Test Plan: CI Differential Revision: D53860966 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120092 Approved by: https://github.com/ydwu4	2024-02-20 18:37:32 +00:00
PyTorch MergeBot	86dedebeaf	Revert "Add pixel_shuffle to core aten decomps (#119899 )" This reverts commit `9201d7335a`. Reverted https://github.com/pytorch/pytorch/pull/119899 on behalf of https://github.com/huydhn due to Sorry for reverting your change but keep the diff D53766709 around while investigating the failed tests is not a good practice and could lead to out of sync issue, so it is better to revert and reland this ([comment](https://github.com/pytorch/pytorch/pull/119899#issuecomment-1948970686))	2024-02-16 17:44:59 +00:00
PyTorch MergeBot	47300221c2	Revert "[export] Change runtime asserts to using assert_scalar (#119608 )" This reverts commit `f4d641ba2f`. Reverted https://github.com/pytorch/pytorch/pull/119608 on behalf of https://github.com/huydhn due to This break ONNX trunk job `65fd8b6730` ([comment](https://github.com/pytorch/pytorch/pull/119608#issuecomment-1947436402))	2024-02-15 22:25:24 +00:00
angelayi	f4d641ba2f	[export] Change runtime asserts to using assert_scalar (#119608 ) By changing runtime symbolic asserts to using assert_scalar, the asserts can call into `expect_true` and modify the shape env so that we can run through the traced graph module with fake tensors. With assert_async, the asserts only get hit during runtime, but that means if we run the graph module with fake tensors, the asserts will not affect the shape env, so later data dependent calls to the fake tensors may result in GuardOnDataDependentSymNode errors. https://github.com/pytorch/pytorch/issues/119587 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119608 Approved by: https://github.com/ezyang	2024-02-15 07:13:42 +00:00
Angela Yi	9201d7335a	Add pixel_shuffle to core aten decomps (#119899 ) Summary: https://github.com/pytorch/pytorch/pull/118239 added a decomposition for pixel_shuffle, so pixel_shuffle no longer needs to be a Core ATen Op. We have also fixed the internal use case so that it no longer special cases on pixel_shuffle, allowing us to revert the changes in https://github.com/pytorch/pytorch/pull/118921. Test Plan: CI Differential Revision: D53766709 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119899 Approved by: https://github.com/peterbell10, https://github.com/lezcano	2024-02-14 21:01:11 +00:00
Edward Z. Yang	c2522554dd	Prevent DCE'ing unbacked SymInt for view outputs (#119552 ) Fixes https://github.com/pytorch/pytorch/issues/119414 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/119552 Approved by: https://github.com/Skylion007, https://github.com/eellison	2024-02-13 16:32:21 +00:00
Edward Z. Yang	3f0fd36835	Introduce size oblivious guards (#118579 ) Fixes https://github.com/pytorch/pytorch/issues/117361 The implementation here slightly diverges from what was proposed in the issue, so I will recap what this PR is doing here. Today, when doing computations involving size-like unbacked SymInts, we assume for all operations that the compile time range of the integer is `[2, inf]`, even though at runtime we also accept zero and one. This PR removes the carte blanche assumption, and instead does the analysis in a much more limited and controlled fashion: only for guards which we have designated as "size oblivious" are we willing to do the analysis under the assumption that the range of all size-like unbacked SymInts is `[2, inf]`; otherwise, we will faithfully only do analysis with `[0, inf]` (or whatever the user provided) bounds. The infra pieces of this PR are: * Remove runtime_var_to_range from torch/fx/experimental/symbolic_shapes.py; modify `_constrain_range_for_size` to refine the range without clamping min to 2, and instead add the symbol to a `size_like` set in the ShapeEnv * When evaluating an expression, if the expression is requested to be evaluated in a `size_oblivious` way, we attempt to statically compute the value of the expression with the assumption that all symbols in `size_like` are updated to assume that they are `>= 2`. * Add Python and C++ APIs for guarding on a SymBool in a size-oblivious way. In C++, I also need to add some helpers for performing symbolic comparisons, since the stock comparisons immediately specialize in the "normal" way. The rest of the changes of the PR are marking various spots in PyTorch framework code as size oblivious, based on what our current test suite exercises. As you review the places where we have marked things as size oblivious, it may become clear why I ended up not opting for the "designate a branch as the default branch when it's not statically obvious which way to go": for some of the conditions, this answer is rather non-obvious. I think potentially there is another refinement on top of this PR, which is something like "I don't care if you can't figure it out with ValueRange analysis, go down this path anyway if there are unbacked sizes involved." But even if we add this API, I think we are obligated to attempt the ValueRange analysis first, since it can lead to better outcomes sometimes (e.g., we are able to figure out that something is contiguous no matter what the unbacked size is.) When is it permissible to mark something as size oblivious? Heuristically, it is OK anywhere in framework code if it gets you past a guard on unbacked SymInt problem. It is somewhat difficult to provide a true semantic answer, however. In particular, these annotations don't have any observational equivalence guarantee; for example, if I have `torch.empty(u0, 1).squeeze()`, we will always produce a `[u0]` size tensor, even though if `u0 == 1` PyTorch will actually produce a `[]` size tensor. The argument that I gave to Lezcano is that we are in fact defining an alternate semantics for a "special" size = 0, 1, for which we have these alternate eager mode semantics. In particular, suppose that we have a constant `special1` which semantically denotes 1, but triggers alternate handling rules. We would define `torch.empty(special1, 1).squeeze()` to always produce a `[special1]` size tensor, making its semantics coincide with unbacked SymInt semantics. In this model, the decision to designate guards as size oblivious is simply a user API question: you put them where ever you need some handling for special1! As we conservatively error out whenever it is not obvious what `special1` semantics should be, it is always valid to expand these semantics to cover more cases (although you can always choose the wrong semantics!) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118579 Approved by: https://github.com/eellison, https://github.com/lezcano	2024-02-06 19:45:32 +00:00
angelayi	1adedc3c86	[decomp] Remove pixel_shuffle from core aten decomps (#118921 ) pixel_shuffle is a core aten op (https://pytorch.org/docs/main/torch.compiler_ir.html#core-aten-ir) so we should not decompose it. https://github.com/pytorch/pytorch/pull/118239 added a decomp for it which is causing an internal test failure (https://www.internalfb.com/intern/test/281475090561210/) which cases on the pixel_shuffle operator. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118921 Approved by: https://github.com/SherlockNoMad, https://github.com/lezcano	2024-02-03 08:21:32 +00:00
Mengwei Liu	1a8545164a	[export] Add unit test for SDPA export result (#117390 ) Summary: A follow up for #117097. In that PR I didn't add `_scaled_dot_product_attention_for_cpu` into the core_aten_decomposition table. This PR does that and also add a unit test. Test Plan: python test/export/test_export.py -k test_scaled_dot_product_attention Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/117390 Approved by: https://github.com/drisspg	2024-01-14 00:21:28 +00:00
Yanbo Liang	abd80cbb15	[Inductor] Decompose bmm if batch2's last dim size is 1 and coordinate_descent_tuning is enabled (#116582 ) We found this perf optimization opportunity at https://github.com/pytorch-labs/gpt-fast/pull/71. This would bring 5%+ perf gain for Mixtral 8x7B on gpt-fast. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116582 Approved by: https://github.com/lezcano	2024-01-01 21:24:02 +00:00
Peter Bell	fb80f05ee2	[inductor] Fix angle decomposition return type (#115700 ) The current decomposition always returns float32 when the input isn't complex. Instead, we should do proper type promotion. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115700 Approved by: https://github.com/lezcano ghstack dependencies: #115677, #115699	2023-12-13 14:16:31 +00:00
Peter Bell	9cdc80d581	[inductor] Fix torch.bernoulli decomposition return type (#115699 ) Strangely enough, `torch.bernoulli` doesn't return a boolean and instead it matches the output type of the inplace bernoulli. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115699 Approved by: https://github.com/lezcano ghstack dependencies: #115677	2023-12-13 14:16:31 +00:00
Jez Ng	fda94124d7	[inductor] Make {cudagraph_trees,decomposition,post_grad}.py pass follow_imports typechecking (#113609 ) I added explicit imports to `kernel/__init__.py` as mypy doesn't seem to understand an empty `__init__.py`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113609 Approved by: https://github.com/eellison	2023-11-15 05:04:11 +00:00
Adnan Akhundov	8d41a5c605	[indictor] Fix cat decomp when first tensor is empty (#113514 ) Summary: Previously, when the first tensor argument to `aten.cat` was empty and there was only one non-empty tensor argument, the first (empty) tensor was erroneously returned by the `aten.cat` decomposition. Here we fix the bug. Test Plan: ``` $ python test/inductor/test_torchinductor.py -k test_cat_empty ... ---------------------------------------------------------------------- Ran 2 tests in 5.760s OK ``` Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/113514 Approved by: https://github.com/jansel	2023-11-11 20:34:22 +00:00
Aleksei Nikiforov	65304d8fd0	s390x: fix inductor constructing floats out of bytes (#112723 ) This change fixes test_embedding_bag_byte_unpack_cpu from test/inductor/test_torchinductor.py on s390x. Pull Request resolved: https://github.com/pytorch/pytorch/pull/112723 Approved by: https://github.com/jansel	2023-11-07 06:51:46 +00:00
leslie-fang-intel	6ba2748690	[Quant] [PT2] Enable Decomposed quant per tensor/channel to accept bfloat16 input (#112225 ) Summary - PR 4 for enabling Int8-Mixed-BF16 PT2E PTQ Quantization with Inductor https://github.com/pytorch/pytorch/issues/111640. - Enable `decomposed quant_per_tensor` and `quant_per_channel` accepts bfloat16 input. TestPlan ``` python -m pytest test_quantized_tensor.py -k test_decomposed_quantize_per_tensor_bfloat16_input python -m pytest test_quantized_tensor.py -k test_decomposed_quantize_per_channel_bfloat16_input ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/112225 Approved by: https://github.com/jgong5, https://github.com/jerryzh168	2023-11-03 23:47:43 +00:00
Peter Bell	1dd57082a4	[inductor] Decompose boolean min/max into all/any (#110311 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/110311 Approved by: https://github.com/lezcano ghstack dependencies: #110310	2023-10-24 21:33:53 +00:00
Hongtao Yu	6977ba6e3c	[inductor] decomposition for complex addition (#110740 ) Tracks https://github.com/pytorch/pytorch/issues/98161 Complex number support in Pytorch isn't ideal today as complex operations will mostly end up taken care of by the aten runtime, except for `torch.angle` which is handled in [105609](https://github.com/pytorch/pytorch/pull/105609). In general a better way to handle that could be to decompose complex operations first so that more opportunities for fusion could be unveiled, and then to have Triton take care of non-continuous (strided) tensor operations more efficiently. This change adds support to decompose complex addtions. ``` @triton.jit def triton_(in_ptr0, in_ptr1, out_ptr0, xnumel, XBLOCK : tl.constexpr): xnumel = 6 xoffset = tl.program_id(0) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:] xmask = xindex < xnumel x0 = xindex tmp0 = tl.load(in_ptr0 + (x0), xmask) tmp1 = tl.load(in_ptr1 + (x0), xmask) tmp2 = tmp0 + tmp1 tl.store(out_ptr0 + (x0), tmp2, xmask) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/110740 Approved by: https://github.com/jansel	2023-10-24 03:41:24 +00:00
PyTorch MergeBot	98c329b19e	Revert "[core ATen IR] Add decompositions for max, min, var_mean (#110906 )" This reverts commit `9606cda64e`. Reverted https://github.com/pytorch/pytorch/pull/110906 on behalf of https://github.com/SS-JIA due to Breaks internal CI ([comment](https://github.com/pytorch/pytorch/pull/110906#issuecomment-1757490740))	2023-10-11 11:41:21 +00:00
SS-JIA	9606cda64e	[core ATen IR] Add decompositions for max, min, var_mean (#110906 ) ## Context Add decompositions for `aten.max`, `aten.min`, and `aten.var_mean`. These operators follow a pattern of returning a tuple of outputs from two component operators: ``` aten.max(x) -> return aten.amax(x), aten.argmax(x) aten.min(x) -> return aten.amin(x), aten.argmin(x) aten.var_mean(x) -> return aten.var(x), aten.mean(x) ``` For `var_mean`, the `refs` implementation was doing something similar, so I changed it to call `torch.` ops instead like was done for other `refs` implementations previously. cc: @peterbell10 @lezcano Note that Inductor lowers all these directly, so they are excluded from the Inductor decomp table. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110906 Approved by: https://github.com/manuelcandales	2023-10-11 00:06:24 +00:00
Stephen Jia	c2e7a0d689	[core IR] Add decomps for `aten.sum` and `aten.squeeze` variants (#110645 ) Summary: ## Context Both `aten.sum` and `aten.squeeze` have a "most generic" variant in the form of `aten.sum.dim_IntList` and `aten.squeeze.dims` respectively. Add decompositions for other non generic variants of these operators to express them using the most generic variant. Note that to register these decomps, the reference implementation under `_refs` had to be removed as registered decompositions. cc: @lezcano @peterbell10 Test Plan: Github CI + Meta Internal CI Differential Revision: D49965952 Pull Request resolved: https://github.com/pytorch/pytorch/pull/110645 Approved by: https://github.com/peterbell10, https://github.com/digantdesai, https://github.com/manuelcandales	2023-10-07 04:21:51 +00:00
Kazuaki Ishizaki	434a996c42	Fix typo under torch/_inductor directory (#110530 ) This PR fixes typo of comments and messages in files under `torch/_dynamo` directory. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110530 Approved by: https://github.com/kit1980	2023-10-05 02:17:20 +00:00
Bert Maher	4c3d3b7176	[inductor] Lower small gemvs on CPU (#110456 ) If the gemv fits in registers, like [1,16]*[16,16], MKL isn't going to do much better than compiling a simple for-loop, and we end up paying allocation overhead and ATen overhead. A very small internal inference model drops from 7->5 us with this change. Differential Revision: [D49875991](https://our.internmc.facebook.com/intern/diff/D49875991/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/110456 Approved by: https://github.com/chenyang78, https://github.com/jgong5	2023-10-04 15:16:38 +00:00
Stephen Jia	ff96f6d04f	[core IR][reland] Add `split.Tensor` and `unbind` decompositions to core ATen decomp table (#110323 ) Summary: This is a reland of [github PR #110102]( https://github.com/pytorch/pytorch/pull/110102). The original PR had to be unlanded due to internal CI failures. This diff applies some small fixes to the failing tests to adjust to the new decompositions. Note that `lift_fresh` will not be decomposed for now, since it was found that [constant propogation looks specifically for `lift_fresh`](`13af952f94/torch/fx/experimental/proxy_tensor.py (L381-L386)`). Therefore decomposing `lift_fresh` will interfere with constant propogation during export. Test Plan: Github CI and internal CI Differential Revision: D49761321 Pull Request resolved: https://github.com/pytorch/pytorch/pull/110323 Approved by: https://github.com/jansel	2023-10-03 14:35:04 +00:00
eellison	3812f2e40c	Preserve layout on like constructors (#110242 ) Partially fixes `test_memory_format_factory_like_functions_preserve` with PYTORCH_TEST_WITH_INDUCTOR. Inductor preserves memory layouts for user-visible outputs as annotated on the fx graph that it is passed in. That graph is generated from running aot_autograd with decompositions. If the decompositions give incorrect strides, so will inductor. This preserves the layout of `_like` operators when it corresponds to a `torch.memory_format`. It doesnt fix a) arbitrary permutations, b) striding of non-dense outputs. Both of these are lower-pri compared to preserving channels last. We would need either https://github.com/pytorch/pytorch/issues/92920 or a `to` variant that takes in a physical layout arbitrary permutations. I converted the output of rand to the correct layout instead of passing the layout in so that this would compose with the `replace_random` pass, and because the two pointwise ops will get fused anyway. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110242 Approved by: https://github.com/int3	2023-10-02 23:53:55 +00:00
chilli	13681382d5	Add heuristic for when `evict_first` should be set (and some other minor things) (#108841 ) Example of when the `evict_first` heuristic helps. ``` @torch.compile def f(a, b): return (a * b).sum(dim=-1) N = 512 inps = (torch.randn(N, N, N).permute(2, 1, 0), torch.randn(N, N, N).permute(1, 2, 0)) from torch._inductor.utils import do_bench print(do_bench(lambda: f(*inps))) ``` This generates code like this: http://ix.io/4HFs ``` Original: 3.8 ms This PR: 3.54 ms Always `evict_first: 5.4ms ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/108841 Approved by: https://github.com/lezcano, https://github.com/jansel	2023-10-01 17:06:12 +00:00

1 2 3 4

180 Commits