Commit Graph

19 Commits

Author SHA1 Message Date
Jez Ng
178ce1433c Hoist out auxiliary values in optional-typed arguments (#123613)
This fixes #123176, and partially addresses #121814 too. #123176 uses an
optional device arg while #121814 uses an optional list arg.

For optional arguments that have auxiliary info -- specifically, tuples
/ lists with their length parameter, and device types with their device
index -- we need to hoist out the extra argument. E.g. when passing a
device with ID 1, we want to emit

```
auto var_0 = cached_torch_device_type_cpu;
aoti_torch_foo(..., &var_0, 1);
```

instead of the (syntactically incorrect)

```
auto var_0 = cached_torch_device_type_cpu,1;
aoti_torch_foo(..., &var_0);
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123613
Approved by: https://github.com/desertfire
2024-04-09 20:17:35 +00:00
Jez Ng
1b9eebb6bb [AOTI] Handle null outputs (#123460)
Summary:

I skipped over the codegen for output handle assignment if the outputs
are null -- in addition to being redundant, it was causing compile
errors.

I also modified the runtime to do the necessary null checks.

Fixes #123173.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123460
Approved by: https://github.com/chenyang78, https://github.com/desertfire
2024-04-08 23:07:03 +00:00
Adnan Akhundov
63c221b7fa Clone mutated inputs in first pass of CPP wrapper compilation (#123316)
Summary: CPP wrapper compilation is currently done in two passes: in the first pass, Python wrapper is generated and run to compile Triton kernels as a side effect, in the second pass C++ wrapper is generated and compiled. When model inputs are mutated, running the Python wrapper in the first pass mutates the inputs, although the first pass (including the Python wrapper run) is strictly a part of the compilation process, hence must not introduce any side effects on the example inputs.

In this PR, we clone mutated inputs in the first pass to avoid input mutation.

Fixes https://github.com/pytorch/pytorch/issues/117364.

Test Plan:

```
$ TORCHINDUCTOR_CPP_WRAPPER=1 python test/inductor/test_torchinductor.py -k test_inductor_layout_optimization_input_mutations_cuda
...
.
----------------------------------------------------------------------
Ran 1 test in 6.368s

OK
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123316
Approved by: https://github.com/jansel, https://github.com/chenyang78, https://github.com/desertfire
2024-04-05 21:47:19 +00:00
Bin Bao
aa063054ce [AOTI] Fix the codegen for aten.randint.low_out (#123346)
Summary: Fixing https://github.com/pytorch/pytorch/issues/123174. There are two problems here,
* Incorrectly calling convert_arrayref_tensor_to_tensor on int arguments. Removing relevant code since we don't use ArrayRef when there is a fallback op.
* codegen_kwargs generates an argument for the out parameter of ExternKernelOut. The fix is to leave that logic to corresponding wrapper codegen.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123346
Approved by: https://github.com/chenyang78
2024-04-04 23:23:50 +00:00
Bin Bao
0c6e8af257 [AOTI][refactor] Update some test cases (#123093)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123093
Approved by: https://github.com/Skylion007, https://github.com/chenyang78
2024-04-03 00:51:11 +00:00
chunyuan
8b7da5b791 Inductor cpp wrapper: fix dtype of ShapeAsConstantBuffer (#122297)
For `at::scalar_tensor` the default dtype will be `float` ([link to scalar_tensor](0d8e960f74/aten/src/ATen/native/TensorFactories.cpp (L856)), [link to default dtype](0d8e960f74/c10/core/TensorOptions.h (L551))) if we don't set the `dtype` value. However, the input scalar value is not necessarily a `float` value. With `torch::tensor(x)`, the dtype of the tensor will be decided according to the dtype of the scalar.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122297
Approved by: https://github.com/jgong5, https://github.com/desertfire
2024-04-01 01:32:41 +00:00
Bin Bao
537cd66e73 [Inductor] Support custom op in JIT with cpp wrapper (#122554)
Summary:  To call custom ops in an ABI-compatible way requires doing boxed call with varargs across C shim. In the JIT mode, we can get around it by calling into Python.  https://gist.github.com/desertfire/be2a65b0a9b47780bb716b53ac2cd2b3 is an example of generated code.

Differential Revision: [D55326556](https://our.internmc.facebook.com/intern/diff/D55326556)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122554
Approved by: https://github.com/jansel, https://github.com/chenyang78
2024-03-26 18:48:45 +00:00
Sam Larsen
535bc71d03 Enable FX graph caching in another batch of inductor tests (#121697)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121697
Approved by: https://github.com/eellison
2024-03-15 19:38:51 +00:00
Bin Bao
818b14025a [AOTI][refactor] Remove is_legacy_abi_kernel and abi_compatible_kernel (#121523)
Summary: is_legacy_abi_kernel was used for _scaled_dot_product_flash_attention fallback. It is only needed for C shim kernel name matching now, and the name matching is done with a direct string comparison. Also consolidate the fallback cpp kernel naming logic in CppWrapperCpu.

Differential Revision: [D54727789](https://our.internmc.facebook.com/intern/diff/D54727789)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121523
Approved by: https://github.com/chenyang78
2024-03-14 22:05:38 +00:00
Bin Bao
0339f1ca82 [Inductor] Allocate another shard for testing cpp-wrapper JIT (#121310)
Summary: The ABI-compatible for cpp wrapper has not been turned on as default, so test them separately. Expect to add more tests for the shard.

Differential Revision: [D54617287](https://our.internmc.facebook.com/intern/diff/D54617287)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121310
Approved by: https://github.com/chenyang78
ghstack dependencies: #121309
2024-03-07 14:24:21 +00:00
Xia, Weiwen
83d848e1c7 [Quant][Inductor] Enable lowering of dynamic qlinear for X86Inductor (#120605)
**description**
Enable lowering of dynamic qlinear for X86Inductor. The pattern is `choose_qparams -> getitem -> q -> dq -> linear`. We only fuse `dq -> linear` and get `choose_qparams -> getitem -> q -> onednn.qlinear_pointwise`. So, we treat it as dynamic quantization of activation + static quantized linear.
The previous implementation of `onednn.qlinear_pointwise` is for the case where `x_scale` and `x_zp` are scalars. Since `choose_qparams` returns tensors, we added a variation `onednn.qlinear_pointwise.tensor` to support the case.
This feature is targeting PyTorch 2.3 release.

**Test plan**
```
python inductor/test_mkldnn_pattern_matcher.py -k test_dynamic_qlinear_cpu
python inductor/test_mkldnn_pattern_matcher.py -k test_dynamic_qlinear_qat_cpu
python inductor/test_cpu_cpp_wrapper.py -k test_dynamic_qlinear
```

**Performance before and after lowering `choose_qparam` to Inductor**
Before
- latency for shape (32, 32) = 0.151 ms
  latency for shape (128, 128) = 0.153 ms
  latency for shape (1024, 1024) = 0.247 ms

After
- latency for shape (32, 32) = 0.049 ms
- latency for shape (128, 128) = 0.052 ms
- latency for shape (1024, 1024) = 0.133 ms

Test method: A module with a single Linear layer, dynamic-quantize, lower to X86Inductor
Test env & config: Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz, single instance, single core, using Intel OpenMP and Tcmalloc

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120605
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jerryzh168
2024-03-02 05:11:17 +00:00
Bin Bao
946ea47a4f [inductor] Fix an internal test issue (#118903)
Summary: test_add_complex4 that introduced in https://github.com/pytorch/pytorch/pull/117929  fails internally, because of a cpp compilation issue for cpu. Specify the right device in the test instead.

Differential Revision: [D53333919](https://our.internmc.facebook.com/intern/diff/D53333919)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118903
Approved by: https://github.com/clee2000
2024-02-02 03:18:12 +00:00
hodavand
8026534a2f Add torch.complex128 and torch.complex32 to DTYPE_TO_ATEN dictionary. (#117929)
Fixes #117370

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117929
Approved by: https://github.com/Skylion007, https://github.com/desertfire
2024-01-31 19:34:58 +00:00
chunyuan
1ae39a372e Inductor cpp wrapper: fix cumsum codegen (#116171)
Fixes https://github.com/pytorch/pytorch/issues/115829

For `cumsum(Tensor self, int dim, *, ScalarType? dtype=None) -> Tensor`, `dim` is not a `kwarg_only` argument, but it could be provided as a kwarg when calling this OP.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116171
Approved by: https://github.com/jgong5, https://github.com/desertfire, https://github.com/jansel
2024-01-03 05:33:17 +00:00
Bin Bao
a81edf9f23 [inductor] Fix cpp_wrapper codegen for ir.ComplexView (#116481)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116481
Approved by: https://github.com/htyu
2024-01-02 05:38:58 +00:00
Bin Bao
a597a00c87 [AOTI][refactor][3/n] Declare python_kernel_name and cpp_kernel_name in ExternKernel (#115972)
Summary: Both ExternKernelAlloc and ExternKernelOut need the two fields, so declaring them in the base class. Also add cpp codegen for IndexPutFallback and InplaceBernoulliFallback in this PR.

This is a reland of https://github.com/pytorch/pytorch/pull/115831

Differential Revision: [D52290900](https://our.internmc.facebook.com/intern/diff/D52290900)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115972
Approved by: https://github.com/chenyang78
2023-12-20 03:22:03 +00:00
Jiong Gong
715d663794 [inductor] split test_cpp_wrapper.py into cpu and cuda test files (#115479)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115479
Approved by: https://github.com/atalman
ghstack dependencies: #115167
2023-12-15 21:21:10 +00:00
PyTorch MergeBot
66994bca5f Revert "[inductor] split test_cpp_wrapper.py into cpu and cuda test files (#115479)"
This reverts commit 653acd8fe1.

Reverted https://github.com/pytorch/pytorch/pull/115479 on behalf of https://github.com/desertfire due to will cause land race in fbcode because https://github.com/pytorch/pytorch/pull/115831 is already landed internally ([comment](https://github.com/pytorch/pytorch/pull/115479#issuecomment-1857979948))
2023-12-15 14:35:40 +00:00
Jiong Gong
653acd8fe1 [inductor] split test_cpp_wrapper.py into cpu and cuda test files (#115479)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115479
Approved by: https://github.com/atalman
ghstack dependencies: #115167
2023-12-15 04:04:08 +00:00