pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-06 12:20:52 +01:00

Author	SHA1	Message	Date
leslie-fang-intel	25de671ea8	[Inductor][CPP] Enable Grouped GEMM Template (#143796 ) Summary Enable the CPP Grouped GEMM Fusion, lowering and Grouped GEMM Template following the RFC: https://github.com/pytorch/pytorch/issues/144012 - Support flexible number of GEMMs - Share activation across GEMMs - The Grouped GEMM Template supports independent activations - However, the pattern matcher requires an anchor node, which is as the shared activation across GEMMs - Each GEMM can have a unique weight but same sizes - Each GEMM can have a unique bias or None - Current PR does not yet support biases; this will be addressed in a follow-up epilogue fusion PR - Each GEMM have its own epilogues - Epilogue fusion is not yet supported in this PR and will be enabled in an upcoming follow-up epilogue fusion PR Test Plan ``` python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_grouped_linear python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_grouped_linear_invalid python -u -m pytest -s -v test/inductor/test_cpu_cpp_wrapper.py -k test_grouped_linear ``` Example Here is the example and generated code ``` batch_size = 4 in_features = 512 out_features = 1024 dtype = torch.bfloat16 class M(torch.nn.Module): def __init__(self, bias): super().__init__() self.linear0 = torch.nn.Linear(in_features, out_features, bias=False) self.linear1 = torch.nn.Linear(in_features, out_features, bias=False) def forward(self, x): return self.linear0(x), self.linear1(x) if __name__ == "__main__": with torch.no_grad(): input = torch.randn(batch_size, in_features, dtype=dtype) m = M(bias=bias).to(dtype=dtype).eval() cm = torch.compile(m) act_res = cm(input) ``` Generated Code: https://gist.github.com/leslie-fang-intel/ed2e8d23aeb3586eb504feeace692e16#file-grouped-gemm-generated-code-py Next Step - Support Epilogue fusion Pull Request resolved: https://github.com/pytorch/pytorch/pull/143796 Approved by: https://github.com/jgong5, https://github.com/jansel	2025-01-14 05:59:07 +00:00
Wu, Chunyuan	d7411c0cc1	[AOTI] add C shim for QConvPointWise (#138540 ) This PR adds C shim for `QConvPointWisePT2E` and `QConvPointWiseBinaryPT2E` similar to https://github.com/pytorch/pytorch/pull/138439. Besides that, we aligned the implementation of `qconv_pointwise` with `qlinear_pointwise` in the following aspects: 1. The parameter order of `qconv_pointwise` and `qlinear_pointwise` are quite different, we aligned the schema of `qconv_pointwise` to have similar parameter order as `qlinear_pointwise` to make it more consistent. 2. We always converted `x_scale` and `x_zero_point` to Tensors, just like in the lowering of `qlinear_pointwise`. This avoids the need to create two separate C APIs (one for `double x_scale` and `int64_t x_zero_point`, and another for `Tensor` versions). Instead, we only need one API for `Tensor`-based `x_scale` and `x_zero_point`. If we later add dynamic quantization for qconv (which will use `Tensor` for `x_scale` and `x_zero_point`), we can reuse the code from this PR and don't need to change the C shim layer API. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138540 Approved by: https://github.com/jgong5, https://github.com/desertfire ghstack dependencies: #138691, #138806	2024-10-31 02:03:01 +00:00
Wu, Chunyuan	489c66fdb3	[AOTI] fix pointer_to_list (#138806 ) Fixes the `pointer_to_list` function to take `(ptr + i)` instead of `ptr`. This fixes the runtime error when running INT8 yolo-v7. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138806 Approved by: https://github.com/jgong5, https://github.com/desertfire ghstack dependencies: #138691	2024-10-29 14:33:16 +00:00
Wu, Chunyuan	9af1816974	[AOTI] add C shim for _weight_int8pack_mm (#138691 ) Fixes the error of running WOQ-INT8 LLaMA: ``` E In file included from /home/user/inductor/pytorch/torch/include/torch/csrc/inductor/aoti_runtime/arrayref_tensor.h:3, E from /tmp/torchinductor_user/sw/csw5gfmlzp5iooqvfwl2gwn574frwdpmtrx2y6nu2m6x76d3xcux.cpp:4: E /tmp/torchinductor_user/sw/csw5gfmlzp5iooqvfwl2gwn574frwdpmtrx2y6nu2m6x76d3xcux.cpp: In function ‘void inductor_entry_impl(AtenTensorOpaque, AtenTensorOpaque)’: E /tmp/torchinductor_user/sw/csw5gfmlzp5iooqvfwl2gwn574frwdpmtrx2y6nu2m6x76d3xcux.cpp:117:33: error: ‘aoti_torch_cpu__weight_int8pack_mm’ was not declared in this scope E 117 \| AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_cpu__weight_int8pack_mm(convert_arrayref_tensor_to_tensor(arg8_1), _frozen_param0, _frozen_param1, &buf0_handle)); E \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/138691 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/desertfire	2024-10-29 13:53:36 +00:00
Wu, Chunyuan	a3aca24ae5	[AOTI] add C shim for QLinearPointwise (#138439 ) This PR adds C shim for `QLinearPointwisePT2E` and `QLinearPointwiseBinaryPT2E`. The below changes are needed: - We moved the qlinear API out of the anonymous namespace since we need to call it in the shim layer. - We fixed the code which generated the `inputs` and `constant_args` so that we can directly leverage the `codegen` of the parent class. - `x_scale` and `x_zp` are ensured to be tensor during the lowering stage, thus we can remove the code which handles whether they're tensor or not. `fb0da32377/torch/_inductor/mkldnn_lowerings.py (L492-L496)` `fb0da32377/torch/_inductor/mkldnn_lowerings.py (L499-L503)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/138439 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/desertfire	2024-10-26 08:04:15 +00:00
Wu, Chunyuan	de51ed8610	[AOTI] Add C shim for _mkl_linear (#137880 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137880 Approved by: https://github.com/desertfire	2024-10-18 16:26:19 +00:00
Bin Bao	6bc57549f9	[AOTI] Remove non-ABI-compatible tests (#137982 ) Summary: Remove non-ABI-compatible mode tests since ABI-compatible has been turned on as default. Also clean up tests that explicitly set ABI-compatible to True. Differential Revision: [D64439673](https://our.internmc.facebook.com/intern/diff/D64439673) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137982 Approved by: https://github.com/malfet	2024-10-16 21:35:46 +00:00
Huy Do	df114a447e	Parametrize test_lstm_packed (#137447 ) The test runs all its combination (512) sequentially, so it takes more than 30 minutes to finish or timeout on ASAN after one hour. Parametrizing it will break it up, so individual tests can finish and aren't need to be marked as slow anymore. Also, the test seems to run OOM on a 2xlarge with `std::bad_alloc` memory error. Maybe, this would also fix the issue (pending CI testing) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137447 Approved by: https://github.com/albanD, https://github.com/malfet	2024-10-09 05:13:53 +00:00
PyTorch MergeBot	5349ee2934	Revert "Parametrize test_lstm_packed (#137447 )" This reverts commit `d5493ed579`. Reverted https://github.com/pytorch/pytorch/pull/137447 on behalf of https://github.com/huydhn due to Need to up few more instance to 4xlarge, revert to reland ([comment](https://github.com/pytorch/pytorch/pull/137447#issuecomment-2400737602))	2024-10-08 20:15:24 +00:00
Huy Do	d5493ed579	Parametrize test_lstm_packed (#137447 ) The test runs all its combination (512) sequentially, so it takes more than 30 minutes to finish or timeout on ASAN after one hour. Parametrizing it will break it up, so individual tests can finish and aren't need to be marked as slow anymore. Also, the test seems to run OOM on a 2xlarge with `std::bad_alloc` memory error. Maybe, this would also fix the issue (pending CI testing) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137447 Approved by: https://github.com/albanD, https://github.com/malfet	2024-10-08 15:26:27 +00:00
Bin Bao	0878739b11	[AOTI] Add C shim for MKLDNN _convolution_pointwise (#137269 ) Differential Revision: [D63875271](https://our.internmc.facebook.com/intern/diff/D63875271) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137269 Approved by: https://github.com/chenyang78, https://github.com/hl475	2024-10-04 19:42:05 +00:00
Henry Tsang	c318bafe9c	[inductor mkldnn test][BE] Use parametrize to shorten test run time (#137153 ) Summary: Tests in test_mkldnn_pattern_matcher.py can take too long to finish. Splitting them into smaller tests, using `parametrize`. I guess this means this test file has some refactoring opportunities as well. Next time would be the parametrize the add functions. Differential Revision: D63723925 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137153 Approved by: https://github.com/desertfire	2024-10-02 17:20:27 +00:00
Bin Bao	d6d9183456	[Inductor] Switch cpp_wrapper tests to ABI-compatible (#136904 ) Summary: Switch test_cpu_cpp_wrapper and test_cuda_cpp_wrapper to test the ABI-compatible mode only. Fixed a missing Py_NewRef issue for python 3.9. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136904 Approved by: https://github.com/Yoggie9477, https://github.com/chenyang78	2024-09-30 05:44:52 +00:00
Bin Bao	1c9a1a2a19	[AOTI] Support MKL linear ops in cpp wrapper (#134974 ) Summary: Similar to https://github.com/pytorch/pytorch/pull/134475, support mkl linear in the ABI-compatible mode for cpp-wrapper Inductor. Differential Revision: [D63322202](https://our.internmc.facebook.com/intern/diff/D63322202) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134974 Approved by: https://github.com/chenyang78, https://github.com/leslie-fang-intel Co-authored-by: leslie-fang-intel <leslie.fang@intel.com>	2024-09-25 03:53:11 +00:00
Bin Bao	b4c84c3167	[AOTI] Fix a fallback op returning None issue (#135997 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/135781. In some cases, a fallback can return None in the place of a tensor. Differential Revision: [D62659039](https://our.internmc.facebook.com/intern/diff/D62659039) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135997 Approved by: https://github.com/chenyang78	2024-09-14 18:12:06 +00:00
Bin Bao	ea2ecab15b	[AOTI][reland] Fix assert_function call in cpu autotune template (#135920 ) Summary: Reland https://github.com/pytorch/pytorch/pull/135086. In the ABI-compatible mode, assert_function should be AOTI_TORCH_CHECK. Test Plan: CI Differential Revision: D62500592 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135920 Approved by: https://github.com/chenyang78	2024-09-13 12:21:57 +00:00
Aaron Orenstein	8c356ce3da	Fix lint errors in fbcode (#135614 ) Summary: Fixed a bunch of fbcode imports that happened to work but confused autodeps. After this autodeps still suggests "improvements" to TARGETS (which breaks our builds) but at least it can find all the imports. Test Plan: ``` fbpython fbcode/tools/build/buck/linters/lint_autoformat.py --linter=autodeps --default-exec-timeout=1800 -- fbcode/caffe2/TARGETS fbcode/caffe2/test/TARGETS ``` Before: ``` ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "test_export" (from caffe2/test/export/testing.py:229) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See https://fbur$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "testing" (from caffe2/test/export/test_export.py:87) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See https://fburl$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "test_export" (from caffe2/test/export/test_serdes.py:9) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See https://fb$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "testing" (from caffe2/test/export/test_serdes.py:10) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See https://fburl$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "testing" (from caffe2/test/export/test_retraceability.py:7) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See https:$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "test_export" (from caffe2/test/export/test_retraceability.py:6) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See ht$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "testing" (from caffe2/test/export/test_export_nonstrict.py:7) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See http$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "test_export" (from caffe2/test/export/test_export_nonstrict.py:6) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See $ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "test_export" (from caffe2/test/export/test_export_training_ir_to_run_decomp.py:8) when processing rule "test_export". Please make sure it's listed in the srcs parameter of an$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "testing" (from caffe2/test/export/test_export_training_ir_to_run_decomp.py:10) when processing rule "test_export". Please make sure it's listed in the srcs parameter of anoth$ ERROR while processing caffe2/test/TARGETS: Found "//python/typeshed_internal:typeshed_internal_library" owner for "cv2" but it is protected by visibility rules: [] (from caffe2/test/test_bundled_images.py:7) when processing rule "test_bundled_$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "caffe2.test.profiler_test_cpp_thread_lib" (from caffe2/test/profiler/test_cpp_thread.py:29) when processing rule "profiler_test_cpp_thread". Please make sure it's listed in t$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "torch._utils_internal.get_file_path_2" (from caffe2/test/test_custom_ops.py:23) when processing rule "custom_ops". Please make sure it's listed in the srcs parameter of anoth$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "torch._utils_internal.get_file_path_2" (from caffe2/test/test_public_bindings.py:13) when processing rule "public_bindings". Please make sure it's listed in the srcs paramete$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "torch._C._profiler.symbolize_tracebacks" (from caffe2/test/test_cuda.py:3348) when processing rule "test_cuda". Please make sure it's listed in the srcs parameter of another $ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "torch._C._profiler.gather_traceback" (from caffe2/test/test_cuda.py:3348) when processing rule "test_cuda". Please make sure it's listed in the srcs parameter of another rule$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for include <torch/csrc/autograd/profiler_kineto.h> (from caffe2/test/profiler/test_cpp_thread.cpp:2) when processing profiler_test_cpp_thread_lib. Some things to try: ``` Differential Revision: D62049222 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135614 Approved by: https://github.com/oulgen, https://github.com/laithsakka	2024-09-13 02:04:34 +00:00
Pushpak Raj Gautam	ee8c5cc1cc	For S444023: Back out "deprecate `search_autotune_cache` (#133628 )" (#135186 ) Summary: For S444023 Test Plan: Revert prevented the NaN errors - f639391901 Training job ran for 7767 iterations. NaN errors show up within the first 1k. Reviewed By: nmacchioni Differential Revision: D62224747 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135186 Approved by: https://github.com/kit1980	2024-09-11 14:08:40 +00:00
PyTorch MergeBot	0a9d55d2ee	Revert "[AOTI] Fix assert_function call in cpu autotune template (#135086 )" This reverts commit `16c3b8f87c`. Reverted https://github.com/pytorch/pytorch/pull/135086 on behalf of https://github.com/izaitsevfb due to breaks internal tests, see D62405818 ([comment](https://github.com/pytorch/pytorch/pull/135086#issuecomment-2341889428))	2024-09-10 19:51:16 +00:00
Bin Bao	16c3b8f87c	[AOTI] Fix assert_function call in cpu autotune template (#135086 ) Summary: In the ABI-compatible mode, assert_function should be AOTI_TORCH_CHECK. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135086 Approved by: https://github.com/chenyang78, https://github.com/angelayi ghstack dependencies: #134857	2024-09-09 16:54:12 +00:00
Bin Bao	9c6dff4941	[AOTI] Add C shim for aten.mkldnn_rnn_layer in cpp wrapper (#134857 ) Summary: Support aten.mkldnn_rnn_layer in the ABI-compatible mode. Because aten.mkldnn_rnn_layer is an aten op, it is easier to add a C shim function for it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134857 Approved by: https://github.com/angelayi	2024-09-09 16:54:12 +00:00
Bin Bao	1e57ef08fa	[AOTI] Support MKLDNN qconv ops in cpp wrapper (#134795 ) Summary: Similar to https://github.com/pytorch/pytorch/pull/134475, support qconv in the ABI-compatible mode for cpp-wrapper Inductor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134795 Approved by: https://github.com/leslie-fang-intel, https://github.com/chunyuan-w, https://github.com/angelayi ghstack dependencies: #134475, #134783	2024-09-06 01:01:53 +00:00
Bin Bao	614b86d602	[AOTI] Support MKLDNN qlinear ops in cpp wrapper (#134783 ) Summary: Similar to https://github.com/pytorch/pytorch/pull/134475, support qlinear in the ABI-compatible mode for cpp-wrapper Inductor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134783 Approved by: https://github.com/leslie-fang-intel, https://github.com/chunyuan-w, https://github.com/angelayi ghstack dependencies: #134475	2024-09-06 01:01:53 +00:00
Bin Bao	0b96dfb736	[AOTI] Support MKLDNN conv ops in cpp wrapper (#134475 ) Summary: Partially fix https://github.com/pytorch/pytorch/issues/123040. In the ABI-compatible mode, MKLDNN fallback ops do not have C shim implementations and thus need to go through the custom ops launch path. Other MLKDNN ops will be fixed in following PRs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134475 Approved by: https://github.com/leslie-fang-intel, https://github.com/chunyuan-w, https://github.com/angelayi	2024-09-06 01:01:53 +00:00
Nicolas Macchioni	dd69013c7a	deprecate `search_autotune_cache` (#133628 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133628 Approved by: https://github.com/oulgen	2024-08-16 09:29:39 +00:00
Bin Bao	fd874b799f	[AOTI][refactor] Update MKLDNN ops cpp wrapper support (#132367 ) Summary: Set op_overload for MKLDNN ops so that cpp_kernel_name and python_kernel_name are constructed from there. This is an important step towards support those MKLDNN ops in the ABI-compatible mode, because we will need to read schema from op_overload for generating correct fallback op call in C++. Differential Revision: [D60909798](https://our.internmc.facebook.com/intern/diff/D60909798) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132367 Approved by: https://github.com/leslie-fang-intel, https://github.com/angelayi	2024-08-08 03:02:29 +00:00
Xu Han	59bbaea3a7	[inductor] disable capture_pre_autograd_graph related UTs on Windows (#132848 ) Contined to https://github.com/pytorch/pytorch/pull/132841 We disable `capture_pre_autograd_graph` related UT on Windows. Disable `test_lstm_packed_change_input_sizes` and `test_multihead_attention` UTs on Windows. TODO: Turn on them after fix `capture_pre_autograd_graph` issue on Windows. ## Local Test: Linux is not skiped: <img width="1387" alt="image" src="https://github.com/user-attachments/assets/28dfbb4b-d9c0-4d5b-be84-d7b3697bcd3f"> And we can skiped them on Windows: <img width="853" alt="image" src="https://github.com/user-attachments/assets/e96ebcf8-9bf3-43aa-93fd-fb33d3743573"> Co-authored-by: Jiong Gong <jiong.gong@intel.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132848 Approved by: https://github.com/jgong5, https://github.com/desertfire	2024-08-07 19:38:03 +00:00
Gabriel Ferns	c3ee07c71c	add missing profiler include in cpp code generation (#132419 ) Summary: When a user sets config.profiler_mark_wrapper_call, RECORD_FUNCTION annotations are added to the code. This requires importing the header <ATen/record_function.h>, but the conditional for doing so didn't check config.profiler_mark_wrapper_call. Test Plan: This case is already covered in test_profiler_mark_wrapper_call. ``` (pytorch-3.10) [gabeferns@devvm2252.cco0 ~/pytorch (missing-profile-include)]$ TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCHINDUCTOR_CPP_WRAPPER=1 python test/inductor/test_torchinductor.py -k CpuTests.test_profiler_mark_wrapper_call_cpu stats [('calls_captured', 1), ('unique_graphs', 1)] inductor [('fxgraph_cache_miss', 1)] aot_autograd [('total', 1), ('ok', 1)] . ---------------------------------------------------------------------- Ran 1 test in 8.080s OK ``` Fixes https://github.com/pytorch/pytorch/issues/131339 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132419 Approved by: https://github.com/jgong5, https://github.com/desertfire	2024-08-05 13:40:47 +00:00
Gabriel Ferns	2138a710eb	enable test_max_pool2d6 after resolving empty array (#132219 ) Related to Issue: https://github.com/pytorch/pytorch/issues/131335 Resolving PR: https://github.com/pytorch/pytorch/pull/132023 Test output: ``` (pytorch-3.10) [gabeferns@devvm2252.cco0 ~/pytorch (enable-test-max-pool2d6)]$ TORCHINDUCTOR_ABI_COMPATIBLE=1 python test/inductor/test_cpu_cpp_wrapper.py -k test_max_pool2d6 inline_call [] stats [('calls_captured', 3), ('unique_graphs', 1)] inductor [('extern_calls', 3), ('fxgraph_cache_miss', 1)] aot_autograd [('total', 1), ('ok', 1)] .inline_call [] stats [('calls_captured', 3), ('unique_graphs', 1)] aot_autograd [('total', 1), ('ok', 1)] inductor [('extern_calls', 3), ('fxgraph_cache_miss', 1)] . ---------------------------------------------------------------------- Ran 2 tests in 8.668s OK ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/132219 Approved by: https://github.com/desertfire	2024-07-31 19:13:54 +00:00
Wu, Chunyuan	30e7fc0fe1	Cpp wrapper: set args to CppWrapperKernelArgs in cpp template kernel (#129557 ) Fix the compilation error: ```cpp /tmp/tmpywg34bca/tg/ctg7wbli6pvydsjr2xsxamdbamkquhlincuky3dzopa3ilrxqdwt.cpp:401:24: error: cannot convert ‘at::Tensor’ to ‘const bfloat16’ {aka ‘const c10::BFloat16’} 401 \| cpp_fused_div_mm_0(arg2_1, constant2, _frozen_param1, buf1); \| ^~~~~~ \| \| \| at::Tensor ``` The generated code after the fix will be: ```cpp cpp_fused_div_mm_0((bfloat16)(arg2_1.data_ptr()), (bfloat16)(constant2.data_ptr()), (bfloat16)(_frozen_param1.data_ptr()), (bfloat16)(buf1.data_ptr())); ``` Multiple changes are required for ABI compatible mode. Separate it into a follow-up PR in this ghstack: https://github.com/pytorch/pytorch/pull/131841 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129557 Approved by: https://github.com/leslie-fang-intel	2024-07-29 04:01:17 +00:00
Wu, Chunyuan	632910e2a8	Add test to xfail_list only for abi_compatible (#128506 ) https://github.com/pytorch/pytorch/pull/126717 will skip the tests in both ABI compatible and non-ABI compatible mode. It's not expected to skip them in non-ABI compatible mode since they can actually run successfully in such mode but only have issues in ABI compatible mode. We leverage the existing `xfail_list` for those that will only fail in ABI compatible mode. - `test_qlinear_add` is already in the `xfail_list`. - `test_linear_packed` doesn't fail either in my local run (running with `TORCHINDUCTOR_ABI_COMPATIBLE=1`) or in the CI of this PR so I didn't add it into `xfail_list`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128506 Approved by: https://github.com/jgong5, https://github.com/desertfire	2024-06-21 07:19:28 +00:00
Sam Larsen	571a0db132	[inductor] Fix logging for run_and_get_cpp_code (#128794 ) Summary: Found during testing with remote caching: Use the same output logger object between graph.py and codecache.py since it's patched in `run_and_get_cpp_code`. That allows us to capture any logging produced from the codecache path when using `run_and_get_cpp_code`. I'm also fixing a few tests that were passing mistakenly because logging was missing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128794 Approved by: https://github.com/oulgen, https://github.com/leslie-fang-intel	2024-06-19 21:32:34 +00:00
Colin Peppler	3a185778ed	[aotinductor] Add torch.polar fallback op for shim v2 (#128722 ) Compilation error: ``` $ TORCHINDUCTOR_C_SHIM_VERSION=2 TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCH_LOGS_FORMAT="%(pathname)s:%(lineno)s: %(message)s" TORCH_LOGS="+output_code" python test/inductor/test_cpu_cpp_wrapper.py -k test_polar /tmp/tmp2sp128xj/dy/cdypvu3hvgg3mwxydwbiuddsnmuoi37it3mrpjktcnu6vt4hr3ki.cpp:59:33: error: ‘aoti_torch_cpu_polar’ was not declared in this scope; did you mean ‘aoti_torch_cpu_topk’? ``` Steps: 1. Add aten.polar 2. run `python torchgen/gen.py --update-aoti-c-shim`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128722 Approved by: https://github.com/chenyang78, https://github.com/desertfire	2024-06-19 05:06:58 +00:00
PyTorch MergeBot	a584b2a389	Revert "Add test to xfail_list only for abi_compatible (#128506 )" This reverts commit `df85f34a14`. Reverted https://github.com/pytorch/pytorch/pull/128506 on behalf of https://github.com/huydhn due to The failure shows up in trunk `df85f34a14` ([comment](https://github.com/pytorch/pytorch/pull/128506#issuecomment-2177744578))	2024-06-19 04:59:10 +00:00
Wu, Chunyuan	df85f34a14	Add test to xfail_list only for abi_compatible (#128506 ) https://github.com/pytorch/pytorch/pull/126717 will skip the tests in both ABI compatible and non-ABI compatible mode. It's not expected to skip them in non-ABI compatible mode since they can actually run successfully in such mode but only have issues in ABI compatible mode. We leverage the existing `xfail_list` for those that will only fail in ABI compatible mode. - `test_qlinear_add` is already in the `xfail_list`. - `test_linear_packed` doesn't fail either in my local run (running with `TORCHINDUCTOR_ABI_COMPATIBLE=1`) or in the CI of this PR so I didn't add it into `xfail_list`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128506 Approved by: https://github.com/jgong5, https://github.com/desertfire	2024-06-19 01:18:37 +00:00
PyTorch MergeBot	c8e9656a12	Revert "Add test to xfail_list only for abi_compatible (#128506 )" This reverts commit `49366b2640`. Reverted https://github.com/pytorch/pytorch/pull/128506 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it causes an inductor test to fail in trunk `49366b2640` ([comment](https://github.com/pytorch/pytorch/pull/128506#issuecomment-2166824714))	2024-06-13 21:30:07 +00:00
Wu, Chunyuan	49366b2640	Add test to xfail_list only for abi_compatible (#128506 ) https://github.com/pytorch/pytorch/pull/126717 will skip the tests in both ABI compatible and non-ABI compatible mode. It's not expected to skip them in non-ABI compatible mode since they can actually run successfully in such mode but only have issues in ABI compatible mode. We leverage the existing `xfail_list` for those that will only fail in ABI compatible mode. - `test_qlinear_add` is already in the `xfail_list`. - `test_linear_packed` doesn't fail either in my local run (running with `TORCHINDUCTOR_ABI_COMPATIBLE=1`) or in the CI of this PR so I didn't add it into `xfail_list`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128506 Approved by: https://github.com/jgong5, https://github.com/desertfire	2024-06-13 15:32:15 +00:00
Prachi Gupta	fc6e3ff96d	[ROCm] Update triton pin to fix libtanh issue (#125396 ) There were some internal build issues related to tanh when we moved to upstream triton in ROCm. These issues were fixed by the following triton commit: https://github.com/triton-lang/triton/pull/3810 . This PR moves the triton pin to incorporate that change. Added some skips for unit tests that regressed due to the triton commit bump in this PR. Needs https://github.com/pytorch/pytorch/pull/127968 since this PR introduces a triton dependency on llnl-hatchet, which doesn't have py3.12 wheels available currently. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125396 Approved by: https://github.com/pruthvistony, https://github.com/malfet	2024-06-07 16:23:04 +00:00
Catherine Lee	fba21edf5b	[CI] Ensure inductor/test_cpu_cpp_wrapper is actually run in inductor_cpp_wrapper_abi_compatible (#126717 ) `inductor/test_cpu_cpp_wrapper` is not actually being run in `inductor_cpp_wrapper_abi_compatible` test config The cpu device type gets removed in `d28868c7e8/torch/testing/_internal/common_device_type.py (L733)` so `d28868c7e8/test/inductor/test_cpu_cpp_wrapper.py (L396)` returns false. Feel free to make a PR with a different way to do this (a better RUN_CPU check?) Add a skip for a failing test. I am not equipped to fix it Pull Request resolved: https://github.com/pytorch/pytorch/pull/126717 Approved by: https://github.com/ZainRizvi	2024-06-06 18:23:52 +00:00
PyTorch MergeBot	58b461d57a	Revert "[ROCm] Update triton pin to fix libtanh issue (#125396 )" This reverts commit `19333d1eb9`. Reverted https://github.com/pytorch/pytorch/pull/125396 on behalf of https://github.com/atalman due to Broke nightly builds ([comment](https://github.com/pytorch/pytorch/pull/125396#issuecomment-2142638237))	2024-05-31 16:51:39 +00:00
Bin Bao	413b81789f	[AOTI][refactor] Unify val_to_arg_str and val_to_cpp_arg_str (#126916 ) Summary: Now fallback argument type information has been passed, so time to unify val_to_arg_str and val_to_cpp_arg_str Differential Revision: [D57907751](https://our.internmc.facebook.com/intern/diff/D57907751) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126916 Approved by: https://github.com/chenyang78	2024-05-31 13:56:11 +00:00
Prachi Gupta	19333d1eb9	[ROCm] Update triton pin to fix libtanh issue (#125396 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/125396 Approved by: https://github.com/pruthvistony, https://github.com/nmacchioni	2024-05-30 19:26:58 +00:00
Xia, Weiwen	45f2d09452	[Quant][Inductor] Enable lowering of qlinear-binary(-unary) fusion for X86Inductor (#122593 ) Description Lower the qlinear binary post op pattern to Inductor. Use post op sum (in-place) if the extra input has the same dtype as output. Otherwise, it uses binary add. Supported linear-binary(-unary) patterns ``` linear(X) extra input \ / Add \| Optional(relu) \| Y 1. int8-mixed-fp32 +---+---------------+-----------+------------------------------+---------+ \| # \| Add type \| Quant out \| Pattern \| Post op \| +---+---------------+-----------+------------------------------+---------+ \| 1 \| In-/out-place \| Yes \| linear + fp32 -> (relu) -> q \| add \| +---+---------------+-----------+------------------------------+---------+ \| 2 \| In-/out-place \| No \| linear + fp32 -> (relu) \| sum \| +---+---------------+-----------+------------------------------+---------+ 2. int8-mixed-bf16 +---+----------+---------------+-----------+--------------------------------------------------+---------+ \| # \| X2 dtype \| Add type \| Quant out \| Pattern \| Post op \| +---+----------+---------------+-----------+--------------------------------------------------+---------+ \| 1 \| BF16 \| In-/out-place \| Yes \| linear + bf16 -> (relu) -> to_fp32 -> q \| add \| +---+----------+---------------+-----------+--------------------------------------------------+---------+ \| 2 \| BF16 \| In-/out-place \| No \| linear + bf16 -> (relu) \| sum \| +---+----------+---------------+-----------+--------------------------------------------------+---------+ \| 3 \| FP32 \| Out-place \| Yes \| linear + fp32 -> (relu) -> q \| add \| \| \| \| In-place right\| \| \| \| +---+----------+---------------+-----------+--------------------------------------------------+---------+ \| 4 \| FP32 \| Out-place \| No \| linear + fp32 -> (relu) \| sum \| \| \| \| In-place right\| \| \| \| +---+----------+---------------+-----------+--------------------------------------------------+---------+ \| 5 \| FP32 \| In-place left \| Yes \| linear + fp32 -> to_bf16 -> relu -> to_fp32 -> q \| add \| +---+----------+---------------+-----------+--------------------------------------------------+---------+ \| 6 \| FP32 \| In-place left \| No \| linear + fp32 -> to_bf16 -> (relu) \| add \| +---+----------+---------------+-----------+--------------------------------------------------+---------+ ``` Note (1) The positions of linear and the extra input can be swapped. (2) we don't insert q-dq before the extra input of linear-add by recipe. But if q-dq is found at the extra input, we don't match that pattern because we cannot match all these patterns in 3 passes. Test plan python test/inductor/test_mkldnn_pattern_matcher.py -k test_qlinear_add python test/inductor/test_cpu_cpp_wrapper.py -k test_qlinear_add Pull Request resolved: https://github.com/pytorch/pytorch/pull/122593 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/eellison	2024-05-17 07:46:48 +00:00
Bin Bao	0332b5812e	[AOTI] Support InplaceBernoulliFallback in the ABI-compatible codegen (#126183 ) Summary: Update the torchgen rule for inplace ops like bernoulli_, and update InplaceBernoulliFallback to codegen in the ABI-compatible mode. Fixes https://github.com/pytorch/pytorch/issues/121809 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126183 Approved by: https://github.com/angelayi ghstack dependencies: #126181, #126182	2024-05-16 17:07:06 +00:00
Bin Bao	ed48ea9997	[AOTI] Refine the C shim autogen mechanism (#125589 ) Summary: Based on the discussions in https://github.com/pytorch/pytorch/pull/120513. Instead of auto-generate C shim fallback ops for thousands of ops, we maintain a list of fallback ops based on torch/_inductor/lowering.py, and only generate C shim functions for those ops. At the torchgen time, we will re-generate C shim files and compare the header file contents against the existing C shim headers. If there is any change, the compilation will fail with prompt on how to proceed. This makes sure the ABI-compatible C shim layer is small enough to maintain in the long run. Differential Revision: [D57004046](https://our.internmc.facebook.com/intern/diff/D57004046) Pull Request resolved: https://github.com/pytorch/pytorch/pull/125589 Approved by: https://github.com/frank-wei, https://github.com/chenyang78, https://github.com/albanD, https://github.com/ezyang	2024-05-09 02:48:16 +00:00
Bin Bao	a988b4ed76	[AOTI] Generate mul_Scalar instead of mul_Tensor (#125397 ) Summary: Fix https://github.com/pytorch/pytorch/issues/117365. When the second argument to aten.mul.Tensor is a scalar (e.g. scale factor), the cpp wrapper expects to generate a call to mul_Scalar when fallback happens (e.g. Complex dtype). Pull Request resolved: https://github.com/pytorch/pytorch/pull/125397 Approved by: https://github.com/chenyang78 ghstack dependencies: #125329	2024-05-03 18:35:42 +00:00
Bin Bao	e84a5b6cc0	[AOTI] Add missing std::move for constant args (#125329 ) Summary: fix https://github.com/pytorch/pytorch/issues/123187 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125329 Approved by: https://github.com/angelayi, https://github.com/chenyang78	2024-05-03 18:35:42 +00:00
Bin Bao	bb37910e30	[AOTI] Fixes ScatterFallback codegen (#124580 ) Summary: For https://github.com/pytorch/pytorch/issues/123184. ScatterFallback currently relies on op name matching for codegen, which makes its cpp codegen fragile. Refactor to use op_overload and fix the relevant unit test failures. Differential Revision: [D56417815](https://our.internmc.facebook.com/intern/diff/D56417815) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124580 Approved by: https://github.com/chenyang78	2024-04-22 20:47:26 +00:00
chunyuan	d0211e207c	inductor cpp wrapper: add GIL release back (#123897 ) Fixes https://github.com/pytorch/pytorch/issues/123517. This PR adds the GIL release (originally added in https://github.com/pytorch/pytorch/pull/111888) back following the suggestion here: https://github.com/pytorch/pytorch/pull/123897#discussion_r1562509705. We added a default constructor and an assignment operator for the `RAIIPyObject` class (https://github.com/pytorch/pytorch/pull/123897#discussion_r1566262575) in order to declare the `custom_op_wrapper` outside of the GIL acquisition scope. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123897 Approved by: https://github.com/peterbell10, https://github.com/jgong5	2024-04-17 07:18:14 +00:00
Aaron Orenstein	2bcc83dfbd	Preserve dispatch state across function tracing (#122073 ) If we throw an exception in the "wrong" place we can end up with the dispatch state being in a weird state which can cause all future dispatching to fail. Preserve and restore it as part of `preserve_global_state` so we know it's sane after that. Also fake_tensor's in_kernel_invocation_manager() was leaving a bit set in the dispatcher (DispatchKey.Dense) which affected follow-on code. Fixed that to reset after as well. Repro: before: ``` $ rm test/dynamo_skips/TestSparseCPU.test_to_dense_with_gradcheck_sparse_cpu_complex64 $ PYTORCH_TEST_WITH_DYNAMO=1 pytest -s test/dynamo/test_export.py test/test_sparse.py -k 'test_to_dense_with_gradcheck_sparse_cpu_complex64' ======== 1 passed, 6173 deselected in 5.21s ============= $ PYTORCH_TEST_WITH_DYNAMO=1 pytest -s test/dynamo/test_export.py test/test_sparse.py -k 'test_torch_inference_mode_ctx or test_to_dense_with_gradcheck_sparse_cpu_complex64' ========= 1 skipped, 6172 deselected, 1 error in 5.29s ========= ``` (note that test_to_dense_with_gradcheck_sparse_cpu_complex64 passes on its own but failed when including the skipped test_export.py tests) after: ``` $ rm test/dynamo_skips/TestSparseCPU.test_to_dense_with_gradcheck_sparse_cpu_complex64 $ PYTORCH_TEST_WITH_DYNAMO=1 pytest -s test/dynamo/test_export.py test/test_sparse.py -k 'test_to_dense_with_gradcheck_sparse_cpu_complex64' ===================== 1 passed, 6173 deselected in 5.42s ===================== $ PYTORCH_TEST_WITH_DYNAMO=1 pytest -s test/dynamo/test_export.py test/test_sparse.py -k 'test_torch_inference_mode_ctx or test_to_dense_with_gradcheck_sparse_cpu_complex64' ===================== 1 passed, 1 skipped, 6172 deselected in 7.30s ====================== ``` (note that test_to_dense_with_gradcheck_sparse_cpu_complex64 passes in both runs) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122073 Approved by: https://github.com/zou3519	2024-04-10 18:57:01 +00:00

1 2

69 Commits