Summary: D79674759 tried to fix the expensive prepare and convert steps, as `assert_and_get_unique_device` was called multiple times. This change fixes that issue by using `functools.cache` decorator.
Test Plan:
Verified on llm export to QNN.
LLM Quantization prepare time of ~20min reduced to ~3min.
Rollback Plan:
Differential Revision: D82073679
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162550
Approved by: https://github.com/andrewor14
**Summary:** Previously, we call `assert_and_get_unqiue_device` once per node in both prepare and convert. This is expensive and unnecessary since the model device is the same across all nodes, so we should just call this once in the beginning and reuse the same model device across all the nodes.
**Test Plan:**
python test/test_quantization.py -k TestQuantizePT2E
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159901
Approved by: https://github.com/jerryzh168
Part of #134054.
This corresponds to the pytorch mypy changes from D61493706. Updating takes so
long and touches so many files that it's impossible to land as a whole without conflicting with some other intermediate change.
So landing these 'type: ignore' for pytorch in advance of them actually being needed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134202
Approved by: https://github.com/Skylion007
Summary: Before in `move_exported_model_to_train/eval`, we only
switched the CPU versions of the batch norm op. This commit adds
support for the cuda versions of the op too. Note that this fix
is temporary; we won't have to differentiate between these two
cases once we have batch norm consolidation.
Test Plan:
python test/test_quantization.py -k test_move_exported_model_bn
Reviewers: jerryzh168
Subscribers: jerryzh168, leslie-fang-intel, supriyar
Differential Revision: [D56070054](https://our.internmc.facebook.com/intern/diff/D56070054)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123957
Approved by: https://github.com/jerryzh168
Summary: Before in `move_exported_model_to_train/eval`, we only
switched the CPU versions of the batch norm op. This commit adds
support for the cuda versions of the op too. Note that this fix
is temporary; we won't have to differentiate between these two
cases once we have batch norm consolidation.
Test Plan:
python test/test_quantization.py -k test_move_exported_model_bn
Reviewers: jerryzh168
Subscribers: jerryzh168, leslie-fang-intel, supriyar
Differential Revision: [D56070054](https://our.internmc.facebook.com/intern/diff/D56070054)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123957
Approved by: https://github.com/jerryzh168
Adds a ruff lint rule to ban raising raw exceptions. Most of these should at the very least be runtime exception, value errors, type errors or some other errors. There are hundreds of instance of these bad exception types already in the codebase, so I have noqa'd most of them. Hopefully this error code will get commiters to rethink what exception type they should raise when they submit a PR.
I also encourage people to gradually go and fix all the existing noqas that have been added so they can be removed overtime and our exception typing can be improved.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124570
Approved by: https://github.com/ezyang
Summary:
new version of this: https://www.internalfb.com/diff/D49110166?dst_version_fbid=252052334533986
Fix Assign device error, when module has multiple devices
If fc_fp16_quantization enabled for CPU model.
And module REMOTE_OTHER has multiple devices: {device(type='meta'), device(type='cpu')}
We fail on this assertion:
fbcode/caffe2/torch/ao/quantization/fx/utils.py
232
assert len(devices) <= 1, (
Since CPU models work on CPU devices, added a condition before the assertion.
In case, we have CPU in module list of devices. Set device as CPU.
Please see debug details:
https://docs.google.com/document/d/1pMPCeJyMPA15NhFc2uAyNDkS9azR40uaNyOP0DIgHjU/edit
Test Plan:
AIMP_DISAGG_CPU=true buck run mode/opt -c python.package_style=inplace -c fbcode.enable_gpu_sections=true lego/scripts:lego_cli -- run-locally --model_entity_id 959168967 --config_version 28 --publish_context OFFLINE_PUBLISH --lego_pipeline aiplatform.modelstore.model_generation.lego.lego_pipeline_builder.gmpp_lego_pipeline --gmpp_config '{"gmpp_pipeline_descriptor": "aiplatform.modelstore.model_generation.v1.ads_pipelines.aimp_pyper_pipeline.model_generation_pipeline", "worker_process_number":12, "worker_thread_per_process_number": 6, "use_work_assignment": true}' 2>&1 | tee /tmp/gmpp_lc.txt
Snapshot:
https://www.internalfb.com/manifold/explorer/ads_storage_fblearner/tree/user/facebook/fblearner/predictor/959168967/47
Differential Revision: D51226114
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113509
Approved by: https://github.com/jerryzh168
Summary:
The goal is to remove the need to use backend_config when pt2e flow code call this function
Test Plan:
python test/test_quantization.py TestQuantizeFx
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98094
Approved by: https://github.com/jcaip
Summary:
Previously prepare_fx returns an ObservedGraphModule and convert_fx returns a QuantizedGraphModule,
this is to preserve the attributes since torch.fx.GraphModule did not preserve them, after https://github.com/pytorch/pytorch/pull/92062
we are preserving `model.meta`, so we can store the attributes in model.meta now to preserve them.
With this, we don't need to create a new type of GraphModule in these functions and can use GraphModule directly, this
is useful for quantization in pytorch 2.0 flow, if other transformations are using GraphModule as well, the quantization passes will be composable with them
Test Plan:
python test/test_quantization.py TestQuantizeFx
python test/test_quantization.py TestQuantizeFxOps
python test/test_quantization.py TestQuantizeFxModels
python test/test_quantization.py TestQuantizePT2E
Imported from OSS
Differential Revision: D42979722
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94412
Approved by: https://github.com/vkuzo
Summary: The existing BackendConfig fusion pattern
uses a "reversed nested tuple" format that is highly
unintuitive. For example,
```
linear-relu -> (nn.ReLU, nn.Linear)
conv-bn-relu -> (nn.ReLU, (nn.BatchNorm2d, nn.Conv2d))
```
This pattern format also complicates the signatures
of the user specified "fuser methods", which needed
to accept arguments in reverse nested order to match
the patterns:
```
def fuse_linear_relu(is_qat, relu, linear):
...
def fuse_conv_bn_relu(is_qat, relu, bn_conv):
(bn, conv) = bn_conv
...
```
Instead, this commit introduces a new pattern format that
simply specifies the ops in forward order with no nesting:
```
linear-relu -> (nn.Linear, nn.ReLU)
conv-bn-relu -> (nn.Conv2d, nn.BatchNorm2d, nn.ReLU)
def fuse_linear_relu(is_qat, linear, relu):
...
def fuse_conv_bn_relu(is_qat, conv, bn, relu):
...
```
Note that the legacy "reversed nested tuple" is still
used internally since it is more general. In the
future, we should replace it with the format used in
the subgraph rewriter in `torch.fx`, and simplify the
existing pattern matching code to handle the new
format added in this commit.
BC-breaking Notes:
Before:
```
import torch as nn
import torch.ao.nn.intrinsic as nni
from torch.ao.quantization.backend_config import BackendPatternConfig
def fuse_linear_relu(is_qat, relu, bn_conv):
(bn, conv) = bn_conv
return nni.ConvBnReLU2d(conv, bn, relu)
config = BackendPatternConfig((nn.ReLU, (nn.BatchNorm2d, nn.Conv2d))) \
.set_dtype_configs(...) \
.set_fuser_method(fuse_conv_bn_relu) \
.set_fused_module(nni.ConvBnReLU2d)
```
After:
```
def fuse_linear_relu(is_qat, conv, bn, relu):
return nni.ConvBnReLU2d(conv, bn, relu)
config = BackendPatternConfig((nn.Conv2d, nn.BatchNorm2d, nn.ReLU)) \
.set_dtype_configs(...) \
.set_fuser_method(fuse_conv_bn_relu) \
.set_fused_module(nni.ConvBnReLU2d)
```
OR (for backward-compatibility)
```
def fuse_linear_relu(is_qat, relu, bn_conv):
(bn, conv) = bn_conv
return nni.ConvBnReLU2d(conv, bn, relu)
config = BackendPatternConfig() \
._set_pattern_complex_format((nn.ReLU, (nn.BatchNorm2d, nn.Conv2d))) \
.set_dtype_configs(...) \
.set_fuser_method(fuse_conv_bn_relu) \
.set_fused_module(nni.ConvBnReLU2d) \
._set_use_legacy_pattern_format(True)
```
Before:
```
backend_config.configs # returns Dict[Pattern, BackendPatternConfig]
```
After:
```
backend_config.configs # returns List[BackendPatternConfig]
```
Test Plan:
python test/test_quantization.py TestQuantizeFx
python test/test_quantization.py TestQuantizeFxOps
python test/test_quantization.py TestBackendConfig
Reviewers: jerryzh168, vkuzo
Subscribers: jerryzh168, vkuzo
Differential Revision: [D41954553](https://our.internmc.facebook.com/intern/diff/D41954553)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90698
Approved by: https://github.com/vkuzo, https://github.com/jerryzh168
Summary: This commit renames fx/quantization_patterns.py
to fx/quantize_handler.py, and fx/fusion_patterns.py to
fx/fuse_handler.py. This is because these files contain
only QuantizeHandler and FuseHandler respectively, so the
new names are more descriptive. A future commit will
further break BC by removing all the empty *QuantizeHandler
classes.
BC-breaking notes:
The following classes under the
`torch.ao.quantization.fx.quantization_patterns` namespace
are migrated to the `torch.ao.quantization.fx.quantize_handler`
namespace:
```
QuantizeHandler
BinaryOpQuantizeHandler
CatQuantizeHandler
ConvReluQuantizeHandler
LinearReLUQuantizeHandler
BatchNormQuantizeHandler
EmbeddingQuantizeHandler
RNNDynamicQuantizeHandler
DefaultNodeQuantizeHandler
FixedQParamsOpQuantizeHandler
CopyNodeQuantizeHandler
GeneralTensorShapeOpQuantizeHandler
CustomModuleQuantizeHandler
StandaloneModuleQuantizeHandler
```
The following classes under the
`torch.ao.quantization.fx.fusion_patterns` namespace are
migrated to the `torch.ao.quantization.fx.fuse_handler`
namespace:
```
DefaultFuseHandler
FuseHandler
```
Test Plan:
python test/test_quantization.py TestQuantizeFx
python test/test_quantization.py TestQuantizeFxOps
Reviewers: jerryzh168, vkuzo
Subscribers: jerryzh168, vkuzo
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89872
Approved by: https://github.com/jerryzh168
Summary: same function in observer and quantize, consolidated to a
single function. Note the definitions were slightly different, I've
changed the definition to be maximally inclusive so that the name of the
function is more accurate
Test Plan: python test/test_public_bindings.py
python test/test_quantization.py
Reviewers:
Subscribers:
Tasks:
Tags:
Differential Revision: [D40709276](https://our.internmc.facebook.com/intern/diff/D40709276)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87520
Approved by: https://github.com/jcaip
Summary: When the BackendConfig was first introduced,
`overwrite_output_observer` and `overwrite_output_fake_quantize`
were added to ensure fixed qparams ops like `torch.nn.Sigmoid`
and `torch.nn.Tanh` used the correct observers and fake quantizes.
However, this is hacky because the BackendConfig should not set
the observer constructors themselves, but should instead specify
only requirements on the observers.
Later, https://github.com/pytorch/pytorch/pull/80184 added the
correct observers to `get_default_qconfig_mapping` along with
validation logic that throws an error if incorrect observers
were specified. With this change, we no longer need to overwrite
the observers from the BackendConfig, since we expect the user to
pass in the correct observers for these ops.
This commit removes these overwrite observer settings in the
BackendConfig. Instead, we represent the observer constraints for
fixed qparams ops through the existing DTypeWithConstraints
mechanism. Note that, however, to be consistent with other
DTypeWithConstraints checks, we no longer throw an error if an
incorrect observer is specified, but simply ignore the offending
QConfig and log a warning instead. This is the BC-breaking part
of the change.
BC-breaking notes:
```
from torch.ao.quantization.qconfig import default_qconfig
from torch.ao.quantization.quantize_fx import prepare_fx
model = ModelWithFixedQParamsOps()
qconfig_mapping = QConfigMapping().set_global(default_qconfig)
example_inputs = ...
prepare_fx(model, qconfig_mapping, example_inputs)
```
Before this commit, running the above leads to an exception
because the wrong observers are used for fixed qparams ops.
After this commit, the above will only encounter a warning,
and the fixed qparams ops will not be quantized. In both cases,
switching to `get_default_qconfig_mapping` will cause the
fixed qparams ops to be quantized.
Test Plan:
python test/test_quantization.py TestQuantizeFx
python test/test_quantization.py TestQuantizeFxOps
Reviewers: jerryzh168, vkuzo
Subscribers: jerryzh168, vkuzo
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88620
Approved by: https://github.com/jerryzh168
Summary:
_convert_to_reference_decomposed is a private convert function in fx graph mode quantization flow to convert
a calibrated/trained model to a reference quantized model with decomposed quantized tensor representations.
Test Plan:
python test/test_quantization.py TestQuantizeFx.test__convert_to_reference_decomposed_fx
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87094
Approved by: https://github.com/andrewor14