**BC-breaking note**:
This PR deprecates `torch.lu` in favor of `torch.linalg.lu_factor`.
A upgrade guide is added to the documentation for `torch.lu`.
Note this PR DOES NOT remove `torch.lu`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77636
Approved by: https://github.com/malfet
Summary:
compilation_preference is one of:
ANEURALNETWORKS_PREFER_LOW_POWER = 0
ANEURALNETWORKS_PREFER_FAST_SINGLE_ANSWER = 1
ANEURALNETWORKS_PREFER_SUSTAINED_SPEED = 2
relax_f32_to_f16 calls Model_relaxComputationFloat32toFloat16
Test Plan:
Tested on device with nnapi models
* Works with existing exported models
* Works with new exported models with options
Differential Revision: D36433236
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78758
Approved by: https://github.com/kimishpatel
(reopening due to botched merge)
The cuDNN V8 API (main support merged in https://github.com/pytorch/pytorch/pull/60755) potentially exposes many more kernels with benchmark=True. While these additional kernels can improve performance, it is often unnecessary to run every kernel returned by the heuristic and doing so may degrade the user experience by causing the first model iteration to be very slow. To alleviate this issue, this PR introduces torch.backends.cudnn.benchmark_limit. benchmark_limit specifies the maximum number of working cuDNN kernels to try for a given workload, with the default being 10 (similar to what TensorFlow does). benchmark_limit = 0 yields the current behavior of trying every kernel returned by the heuristic.
CC @ptrblck @ngimel @xwang233
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77002
Approved by: https://github.com/ngimel
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74441
For xirp based segmentation models, we want to support enumerated input shapes. This allows us to support both landscape and portrait mode images without sacrificing the performance. P488118264
ghstack-source-id: 151736964
Test Plan: `buck run coreml:xirp -- --model="/home/taox/xirp/xirp_20a.pt" --out="/home/taox/xirp/xirp_20a_coreml_enumerated.ptl"`
Reviewed By: mcr229
Differential Revision: D34803184
fbshipit-source-id: c462c0783846a1489ca7ce4d5a654aa6927c9c44
(cherry picked from commit 67d418c97531daaf3d03d1000ca4a4ff60de2a95)
Summary:
This PR adds a new quantization backend, ONEDNN, with quantized conv and linear kernels in the same code path as the FBGEMM backend
The ONEDNN backend is an alternative of FBGEMM and QNNPACK backends. It takes advantage of features of the latest Intel® CPU products. It supports VNNI on Cascade Lake and the AMX instruction set to be available on Sapphire Rapids which has 8X int8 peak TOPS over VNNI.
ONEDNN demonstrates better performance on conv kernels of popular CNN models than FBGEMM. It also supports more fused ops, such as convolution-add-ReLU, than FBGEMM and QNNPACK.
To use this backend, users only need to set the quantization backend to 'onednn' before any calculation without a single change to models.
```python
torch.backends.quantized.engine = 'onednn'
```
## Design docs
https://github.com/pytorch/pytorch/issues/21120#issuecomment-562371983https://github.com/pytorch/pytorch/pull/67177#issuecomment-963787096
## File changes
**Add ONEDNN to qengine list**
- aten/src/ATen/Context.cpp
- c10/core/QEngine.h
- torch/ao/quantization/qconfig.py
- torch/backends/quantized/\_\_init\_\_.py
**Implement qconv & qlinear for ONEDNN backend**
- aten/src/ATen/native/quantized/cpu/conv_serialization.h
- aten/src/ATen/native/quantized/cpu/fbgemm_utils.cpp
- aten/src/ATen/native/quantized/cpu/onednn_utils.h
- aten/src/ATen/native/quantized/cpu/qconv.cpp
- aten/src/ATen/native/quantized/cpu/qconv_dynamic.cpp
- aten/src/ATen/native/quantized/cpu/qconv_prepack.cpp
- aten/src/ATen/native/quantized/cpu/qconv_unpack.cpp
- aten/src/ATen/native/quantized/cpu/qlinear.cpp
- aten/src/ATen/native/quantized/cpu/qlinear_dynamic.cpp
- aten/src/ATen/native/quantized/cpu/qlinear_prepack.cpp
- aten/src/ATen/native/quantized/cpu/qlinear_unpack.cpp
**Skip tests that are not supported by ONEDNN**
- test/ao/sparsity/test_kernels.py
- test/quantization/core/test_quantized_module.py
- test/quantization/core/test_quantized_op.py
## Validation results
This PR has passed `test_quantization.py` and `test_mkldnn.py`.
Below are performance data of int8 2d convolution and linear on the Cascade Lake Xeon® platform:
(Note: Tested with single instance on single core. Using the latest oneDNN library.)
**Table 1. Performance comparison of int8 2d convolution operator**
|No.| Shape| FBGEMM| ONEDNN| Gain|
|-|-|-|-|-|
|1| IC=128, OC=128, kernel=3, stride=1, N=4, H=32, W=32, G=1, pad=0| 668.310us| 535.630us| 24.8%|
|2| IC=128, OC=128, kernel=3, stride=2, N=4, H=32, W=32, G=1, pad=0| 290.630us| 281.810us| 3.1%|
|3| IC=128, OC=256, kernel=3, stride=1, N=4, H=32, W=32, G=1, pad=0| 1.045ms| 893.010us| 17.0%|
|4| IC=128, OC=256, kernel=3, stride=2, N=4, H=32, W=32, G=1, pad=0| 385.320us| 373.720us| 3.1%|
|5| IC=256, OC=256, kernel=3, stride=1, N=4, H=32, W=32, G=1, pad=0| 1.876ms| 1.641ms| 14.3%|
|6| IC=256, OC=256, kernel=3, stride=2, N=4, H=32, W=32, G=1, pad=0| 660.460us| 638.470us| 3.4%|
**Table 2. Performance comparison of int8 linear operator**
|No.| Shape (m, n, k)| FBGEMM| ONEDNN| Gap|
|-|-|-|-|-|
|1| 64, 800, 320| 80.550us| 96.770us| 20.10%|
|2| 64, 768, 512| 101.230us| 130.720us| 29.10%|
|3| 16, 256, 512| 30.230us| 51.450us| 70.20%|
|4| 128, 128, 128| 33.810us| 50.480us| 49.30%|
|5| 256, 512, 256| 154.490us| 195.050us| 26.30%|
|6| 1024, 1024, 1024| 3.134ms| 3.514ms| 12.10%|
ONEDNN showed advantages over FBGEMM for convolution. However, it has performance gap to FBGEMM for Linear ops. The gap is a known issue and further optimization is in progress in the oneDNN library. On the latest platforms, better performance of ONEDNN is achieved for both conv and linear.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69820
Reviewed By: HDCharles
Differential Revision: D33716039
Pulled By: jerryzh168
fbshipit-source-id: 6f7bb807e85798142dfcffccfca8b8bd652fb3dd
(cherry picked from commit 91526b373560f42ba0ad307f9cccfc0eb5218b1f)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70621
Pytorch doesn't have support for qint16 yet. Add an option to handle qint16 via int16 & qint32 data types.
* For qint16 tensors in NNAPI, the user sends a qint32 tensor. We convert the qint32 to int16 for the converter and set the zero point and scale for nnapi
* inputs to the model have to have fixed scale and zero point and are only supported for testing
* Added a flag use_int16_for_qint16 which will be used maintain backwards compatibility in the converter when true qint16 is supported in PyTorch
ghstack-source-id: 146507483
Test Plan: pytest test/test_nnapi.py
Reviewed By: dreiss
Differential Revision: D33285124
fbshipit-source-id: b6376fa1bb18a0b9f6a18c545f600222b650cb66
Summary:
Per title.
This PR introduces a global flag that lets pytorch prefer one of the many backend implementations while calling linear algebra functions on GPU.
Usage:
```python
torch.backends.cuda.preferred_linalg_library('cusolver')
```
Available options (str): `'default'`, `'cusolver'`, `'magma'`.
Issue https://github.com/pytorch/pytorch/issues/63992 inspired me to write this PR. No heuristic is perfect on all devices, library versions, matrix shapes, workloads, etc. We can obtain better performance if we can conveniently switch linear algebra backends at runtime.
Performance of linear algebra operators after this PR should be no worse than before. The flag is set to **`'default'`** by default, which makes everything the same as before this PR.
The implementation of this PR is basically following that of https://github.com/pytorch/pytorch/pull/67790.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67980
Reviewed By: mruberry
Differential Revision: D32849457
Pulled By: ngimel
fbshipit-source-id: 679fee7744a03af057995aef06316306073010a6
Summary:
https://github.com/pytorch/pytorch/issues/67578 disabled reduced precision reductions for FP16 GEMMs. After benchmarking, we've found that this has substantial performance impacts for common GEMM shapes (e.g., those found in popular instantiations of multiheaded-attention) on architectures such as Volta. As these performance regressions may come as a surprise to current users, this PR adds a toggle to disable reduced precision reductions
`torch.backends.cuda.matmul.allow_fp16_reduced_precision_reduction = `
rather than making it the default behavior.
CC ngimel ptrblck
stas00 Note that the behavior after the previous PR can be replicated with
`torch.backends.cuda.matmul.allow_fp16_reduced_precision_reduction = False`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67946
Reviewed By: zou3519
Differential Revision: D32289896
Pulled By: ngimel
fbshipit-source-id: a1ea2918b77e27a7d9b391e030417802a0174abe
Summary:
NNAPI converter failed with 1 const value and one tensor earlier
Code suggestions from dreiss
Test Plan:
pytest test/test_nnapi.py::TestNNAPI::test_pointwise_binary
Imported from OSS
Reviewed By: anshuljain1
Differential Revision: D28893881
fbshipit-source-id: 59240373fb03c6fdafa4cb2fa4d8408dd20092f6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62225
Rewrote the preprocess function for Android NNAPI delegate.
Previously, `preprocess()` called `convert_model_to_nnapi()` using Pybind and returned a NnapiModule that is serialized for mobile. Now, `preprocess()` calls a sub-function of `convert_model_to_nnapi()` and returns several preprocessed items (that were previously components of NnapiModule).
Dictionary returned contains:
"shape_compute_module": torch::jit::Module,
"ser_model": torch::Tensor,
"weights": List[torch.Tensor],
"inp_mem_fmts": List[int],
"out_mem_fmts": List[int]
**Purpose and Future:**
The purpose of these changes are to move more implementation from bytecode and Torchscript to the delegate API, since bytecode is less efficient.
Now, only the shape computation uses bytecode. In the future, shape computation will be moved out of Torchscript as well.
**nnapi_backend_preprocess.cpp:** preprocess implementation
**prepare.py**: refactored a portion of `convert_model_to_nnapi()` to `process_for_nnapi()`, so preprocess can get components of NnapiModule
**Test:**
Ran `python test/test_jit.py TestNnapiBackend` and `python test/test_nnapi.py` on OSS successfully
ghstack-source-id: 134444190
Test Plan: Ran `python test/test_jit.py TestNnapiBackend` and `python test/test_nnapi.py` on OSS successfully
Reviewed By: raziel
Differential Revision: D29922279
fbshipit-source-id: cadcf8908d8a745dc7abbe286e97d6ead937d4ab
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61796
We can easily handle nnapi conversion for nhwc inputs
that have 1 channel or H & W are 1
Test Plan:
pytest test/test_nnapi.py::TestNNAPI::test_flatten
Imported from OSS
Reviewed By: saketh-are
Differential Revision: D29827735
fbshipit-source-id: 65dee4b42fceef1b032bf5dd1c4cc6e020d01e14
Summary:
To add serializer for custom ops we can subclass default serializer
and update ADDER_MAP
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61025
Test Plan:
* pytest test/test_nnapi.py::TestNNAPI for current serializer
* Custom serializers to be tested with custom ops
Imported from OSS
Reviewed By: anshuljain1
Differential Revision: D29480745
fbshipit-source-id: 37e3f8de3c97f6c8a486f9879ce11430ea89af34
Summary: As title
Test Plan: pytest test/test_nnapi.py::TestNNAPI::test_cat
Reviewed By: anshuljain1
Differential Revision: D29480747
fbshipit-source-id: 161803054ff1a4c2c750fc30a5f0fc6d8a24b2c9
Summary:
Same as title
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61021
Test Plan: pytest test/test_nnapi.py::TestNNAPI
Reviewed By: anshuljain1
Differential Revision: D29480746
fbshipit-source-id: 7217c8f3a811db8c3c373f3e7ca31caf9502ef22
Summary:
Add support for aten::slice op in the NNAPI model converter
* If start = 0; end = max -> identity
* Flexible shapes can be passed through
* Flexible shapes can't be sliced over
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59364
Test Plan: pytest test/test_nnapi.py::TestNNAPI::test_slice
Reviewed By: anshuljain1
Differential Revision: D28881039
fbshipit-source-id: 3c1c630ff27b5bba6eda403d87570c61d43ae90e
Summary:
* Add support for aten::detach op in the NNAPI model converter as a no-op
* Also add flexible op support for add_pointwise_simple_unary_op
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58543
Test Plan: pytest test/test_nnapi.py::TestNNAPI::test_detatch
Reviewed By: anshuljain1
Differential Revision: D28531942
fbshipit-source-id: 4387dbbbadd8ce6b690841f3a903e68a380b849d
Summary:
Add support for aten::div op in the NNAPI model converter. Startup time
variable size support isn't supported as shapes go as inputs to NNAPI op
Runtime variable size support to supported soon
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60885
Test Plan: pytest test/test_nnapi.py::TestNNAPI::test_flatten
Reviewed By: anshuljain1
Differential Revision: D29451725
fbshipit-source-id: 8902745f7758c8cc88ad4b4ce02b8301ff894bd4
Summary:
Add support for aten::div op in the NNAPI model converter. Add variable
size input test as well.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58541
Test Plan: pytest test/test_nnapi.py::TestNNAPI::test_div
Reviewed By: anshuljain1
Differential Revision: D28531943
fbshipit-source-id: e96342146f6de216f7b88443618edfc54963747c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58540
Add support for aten::to op in the NNAPI model converter for simple
cases like to("cpu"), to("gpu")
Test Plan: pytest test/test_nnapi.py::TestNNAPI::test_to
Reviewed By: anshuljain1
Differential Revision: D28531941
fbshipit-source-id: 0c934f7aceaff2669307c3426efe32046d8c44f3
Summary:
Add support for aten::softmax op in the NNAPI model converter with
flexible size
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58539
Test Plan: pytest test/test_nnapi.py::TestNNAPI::test_softmax
Reviewed By: anshuljain1
Differential Revision: D28531946
fbshipit-source-id: 8633f3e3f7f52795f9866ff16ad0867ea36a19e8
Summary:
Add support for aten::avgpool2d op in the NNAPI model converter with var
size support
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58538
Test Plan: pytest test/test_nnapi.py::TestNNAPI::test_avgpool2d
Reviewed By: anshuljain1
Differential Revision: D28531944
fbshipit-source-id: 43ff8c9389365698c282f204042b49c7ec84d824
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57563
Add flexible size support for upsample_nearest2d op in nnapi model conversion
Test Plan:
pytest test/test_nnapi.py
Imported from OSS
Reviewed By: dreiss
Differential Revision: D28200847
fbshipit-source-id: 901fe3f6e68e4c16ece730f3ffa68dc88c6ed6c3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57562
Add flexible size support for qadd op in nnapi model conversion
Test Plan:
pytest test/test_nnapi.py
Imported from OSS
Reviewed By: dreiss
Differential Revision: D28200849
fbshipit-source-id: d5b2ea8e9eb8ae405ff2c960f7549cef60bc0991
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57561
Add flexible size support for conv2d op in nnapi model conversion
Test Plan:
pytest test/test_nnapi.py
Imported from OSS
Reviewed By: dreiss
Differential Revision: D28200848
fbshipit-source-id: d94ccf48a3d8453aa8e96c7cac02948c4cd870cc
Summary:
Fixes https://github.com/pytorch/pytorch/issues/48141
~Mypy is complaining about a missing arg in a function call.~
```bash
torch/backends/_nnapi/serializer.py:806: error: Too few arguments for "_do_add_binary" [call-arg]
Found 1 error in 1 file (checked 1140 source files)
```
9392137dbe/torch/backends/_nnapi/serializer.py (L804-L806)
~dreiss, would you mind take a look when you have some cycles to spare and see what would be the appropriated value for `fuse_code` here? Thanks :)~
Edit: https://github.com/pytorch/pytorch/issues/48925 got merged a couple of days ago. The blocking part is now unblocked, and I just pushed the changes to make mypy happy again. This PR is ready for review.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48142
Reviewed By: ezyang
Differential Revision: D28006249
Pulled By: walterddr
fbshipit-source-id: 5e43eeba7143512a549efaad31541f86718add7c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54701
We need NNAPI models to support inputs (and, by extension, intermediate
values and outputs) whose shape is only determined at load time. For
example, a vision models input shape might be dependent on the aspect
ratio of the device camera. While NNAPI has full support for variable
shapes (by setting components of the operand shape to 0), the guidance
we have received is that vendor-provided drivers for real hardware are
not able to support this efficiently. Therefore, we take a hybrid
approach where shapes are calculated at model load time to
semi-dynamically construct our NNAPI model. While this doesn't let us
have truly dynamic input shapes, it does allow us to ensure that the
vendor driver only sees fixed shapes, so we get maximum performance.
In this initial commit, only PReLU supports dynamic shapes. Additional
operators will be converted in separate diffs.
- In order to convert a flexible-shape model, the user supplies inputs
with shapes containing dimensions of size 0 for the flexible
dimensions.
- During conversion, we generate code to compute the shapes of all
intermediates and outputs as a function of the input shapes.
- We no longer run the input model to produce the output templates.
Instead, we generate code to return properly-sized templates, given
the input shapes.
- All of this generated code goes into a "ShapeComputeModule" that is
used by the NnapiModule during initialization.
- The ShapeComputeModule mutates the serialized model to fill in the
computed sizes for each operand. This requires us to change the dtype
for the serialized model to int32, but this should be fine because
everything in it is already 4-byte aligned.
- NnapiInitWrapper no longer exists. Instead, initialization is
performed on the first run, based on the real arguments. We plan to
provide an API for doing eager initialization.
- Unit test updated to allow separate arguments to be given for trace,
conversion, and inference. A flexible-shape test case was added for
PReLU.
Test Plan: Unit test
Reviewed By: axitkhurana
Differential Revision: D27536796
Pulled By: dreiss
fbshipit-source-id: 105585f247987b1e6ec6946a6fe44401237cb0a0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54700
This is an internal method just to make it more clear what
len(self.operands) is doing.
Test Plan: Unit test
Reviewed By: axitkhurana
Differential Revision: D27536794
Pulled By: dreiss
fbshipit-source-id: 678cee8a47df6757dd2e6feabf2560fd82d32e26
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54699
We'll soon be adding support for flexible-size tensors to the NNAPI
converter, but it won't be added to all ops at once. Create
get_tensor_operand_by_jitval_fixed_size as a wrapper for
get_tensor_operand_by_jitval that verifies that the argument has a fixed
shape. Update all call sites. As flexible size support is added to
each op, the call sites can be converted back and proper size checks
added.
Test Plan: Unit test
Reviewed By: axitkhurana
Differential Revision: D27536791
Pulled By: dreiss
fbshipit-source-id: 6fb1fea814d767b6ff263fd8b88240a51be74777