Motivation: Since `intel_extension_for_pytorch` is added to `THIRDPARTY_SKIPLIST`, when the IPEX optimized model uses `torch.compile`, the functions defined in IPEX will be skipped, these functions will not be able to generate the corresponding FX graph through dynamo, cannot be optimized by the compiler, and unnecessary graph breaks occurred. This PR is to remove `intel_extension_for_pytorch` from `THIRDPARTY_SKIPLIST` so that IPEX and torch.compile can work better together.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112840
Approved by: https://github.com/jgong5, https://github.com/jansel
This major change in this PR is to consolidate the skipfiles.check rules, the major thing done is merging the original ```FILE_INLINELIST``` with ```SUBMOD_INLINELIST``` into new ```MOD_INLINELIST``` and a legacy ```LEGACY_MOD_INLINELIST```.
Let's use the following example to illustrate what is the expected behavior for this force inline list:
fa995626a8/torch/_dynamo/skipfiles.py (L344-L369)
The handling logic is:
* If f2 is inlined, we will check both ```MOD_INLINELIST``` and ```LEGACY_MOD_INLINELIST``` to consultant force inline rules for f3.
* If f2 is skipped, we will check ```LEGACY_MOD_INLINELIST``` only for inline rules for f3.
The reason behind this design is: if f2 is skipped, if we always trace all recursively called functions, we will go to the very low level functions (e.g, ```super().__init__```) which caused graph breaks. We treated this as a signal that all functions that f2 recursively called should be skipped as well if f2 is skipped. This is also a feature that many PyTorch developers requested, they just want to skip all recursive functions if they mark the upper level functions as skipped.
For PyTorch developers, we should only use ```MOD_INLINELIST``` going forward. I think most of the modules in the ```LEGACY_MOD_INLINELIST``` are legacy things to workaround when we didn't have a good skip/inline API.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111451
Approved by: https://github.com/ezyang
`is_allowed` is a tricky bit of functionality - it sits early up in builder and is used to drive the creation of TorchVariable (more notes here, meta only https://fb.workplace.com/groups/pytorch.dev/permalink/1393563781222098/)
If we are tracing distributed in full, we want to route certain calls in distributed to NOT PASS is_allowed (this does not, confusingly, mean that they are not allowed, lol, but rather that we dont want them to become TorchVariable), others, we are fine with preserving.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110894
Approved by: https://github.com/ezyang
Several improvements for skipfiles:
* Add ```FUNC_INLINELIST``` to support function level skip/inline check.
* Use ```fn.__code__``` to match function since we can't get the function object sometimes.
* Use python module string name for ```FILE_INLINELIST``` and ```SUBMODULE_INLINELIST```.
* Use filename to match file and python module, which can fundamentally resolved the circular import issues introduced by skipfiles.
* Use ```TYPE_CHECKING``` to ensure the python module string name is correct.
* Add unit tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110835
Approved by: https://github.com/ezyang
Several improvements for skipfiles:
* Add ```FUNC_INLINELIST``` to support function level skip/inline check.
* Use ```fn.__code__``` to match function since we can't get the function object sometimes.
* Use python module string name for ```FILE_INLINELIST``` and ```SUBMODULE_INLINELIST```.
* Use filename to match file and python module, which can fundamentally resolved the circular import issues introduced by skipfiles.
* Use ```TYPE_CHECKING``` to ensure the python module string name is correct.
* Add unit tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110835
Approved by: https://github.com/ezyang
Summary:
During convert step observers are first replaced by Q-DQ pair. In some
scenarios like following output DQ has a fan out.
---> OP2 -> Q -> DQ
/
OP -> Q -> DQ -
\
---> OP3 -> Q -> DQ
If either op OP2 or OP3 are configured to be quantized, then the input
is expected to quantized. In this case quantized equivalent of some
pattern, that quantizer asked to be quantized, should look like:
[DQ -> {pattern} -> Q]. However, in scenario like above where DQ node
is shared between multiple "quantized" patterns, boundary of "quantized"
pattern is not clear because DQ now belongs to multiple quantized
patterns.
This poses challenge for:
- Porting metadata: which "quantized" partition this DQ node belongs
- Quantized representation, equivalently, needs to identify
self-contained quantized pattern that is replaced by its equivalent pattern
that captures compute in the quantized precision.
Test Plan:
test_duplicate_dq_pass
Reviewers:
Subscribers:
Tasks:
Tags:
Differential Revision: [D48663147](https://our.internmc.facebook.com/intern/diff/D48663147)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107900
Approved by: https://github.com/jerryzh168, https://github.com/andrewor14, https://github.com/leslie-fang-intel
ghstack dependencies: #107105, #107106, #107899
Some notable changes:
1. `constrain_as_size` allows min value to be less than 2 as it will unconditionally assume min >= 2 for compiler purposes. Instead, we add additional check to make sure max value is always greater than 2.
2. Previously, we used to runtime assert on the unbacked symint's val range which would be always between [2, max]. I modified this logic to assert on [0, max] unless user explicitly specifies the min range.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106591
Approved by: https://github.com/gmagogsfm, https://github.com/ezyang
Some notable changes:
1. `constrain_as_size` allows min value to be less than 2 as it will unconditionally assume min >= 2 for compiler purposes. Instead, we add additional check to make sure max value is always greater than 2.
2. Previously, we used to runtime assert on the unbacked symint's val range which would be always between [2, max]. I modified this logic to assert on [0, max] unless user explicitly specifies the min range.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106591
Approved by: https://github.com/gmagogsfm, https://github.com/ezyang
Summary: moving quantizer to torch.ao.quantization to make it a public api, since pt2e is a folder for implementations
Test Plan:
CIs
sanity check: "buck test //executorch/backends/xnnpack/test:test_xnnpack_quantized_models -- test_resnet18"
Differential Revision: D47727838
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105885
Approved by: https://github.com/andrewor14
Summary:
QAT convert for mobilenetv2 was previously not working
because we incorrectly applied dropout during eval as well as
training. This is because, for exported models, model.eval() does
not change the behavior of dropout, unlike models with torch ops.
This commit simulates the effects of model.eval() for exported
models as well by replacing the aten dropout pattern before eval.
As of this commit, end-to-end QAT numerics now match for
mobilenetv2 between FX and PT2.
Test Plan: python test/test_quantization.py TestQuantizePT2EModels.test_qat_mobilenet_v2
Differential Revision: D46750343
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104110
Approved by: https://github.com/jerryzh168
Summary:
The planned e2e for quantization in pytorch 2.0 export is the following:
float_model -> prepare_pt2e -> calibration -> convert_pt2e -> ...
inside convert_pt2e, we will first produce a q/dq representation of the quantized model, similar to the previous output of
convert_to_reference_fx in fx grah mode quantization:
```
torch.ops.quantized_decomposed.dequantize_per_tensor -> torch.ops.aten.add -> torch.ops.quantized_decomopsed.quantize_per_tensor
torch.ops.quantized_decomposed.dequantize_per_tensor /
```
Then we'll rewrite the above to a more precise representation that express the intention in a more precise manner, since
here we actually want to do int8 addition, instead of simulating the int8 addition with fp32 operations, the representation for
quantized add is:
```
def quantized_add(x_i8, x_scale, x_zero_point, y_i8, y_scale, y_zero_point, out_scale, out_zero_point):
x = (x_scale / out_scale) * x_i8
y = (y_scale / out_scale) * y_i8
out = x + y
out -= (x_zero_point * x_scale - y_zero_point * y_scale) / out_scale
out += out_zero_point
return out
```
Test Plan:
```
buck2 test caffe2/test:quantization_pt2e -- --exact 'caffe2/test:quantization_pt2e - test_representation_add (quantization.pt2e.test_quantize_pt2e.TestQuantizePT2E)'
```
Reviewed By: kimishpatel
Differential Revision: D45628032
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104130
Approved by: https://github.com/kimishpatel
Summary: Importing torch.ao.quantization._pt2e from dynamo led to
internal test failures related to memory profiling. For now,
let's express the path using a simple string instead.
Reviewers: jerryzh168, kimishpatel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100194
Approved by: https://github.com/jerryzh168
Implements a simple content-addressable store for storages (with tensors implemented as cheap references on top), enabling incremental serialization of tensors to disk, which I intend to use in the accuracy repro extractor. Check the comment at the top of torch/utils/_content_store.py for more details on the intended use case.
One major piece of this PR is implementing the content hash for tensors. For our prospective use case, we may need to repeatedly hash up to 80 GB of tensor data every time we snapshot (and we may snapshot multiple times). Using a conventional cryptographic hash and hashing each snapshot would likely take on order of minutes, which seemed too slow to me. So instead, I implemented a crappy hash function that can be run on GPU. It is at least somewhat theoretically grounded: using random parameters generated by Philox, we use the standard shift-multiply and xor sum universal hash family. The hash function is a bit dorky though; instead of properly doing 160-bit math, it just runs 32-bit hash five times and cats them together. By the way, this sets the first precedent for kernel in PyTorch library which MUST be torch.compile'd to be run (in fact, this kernel does not run in eager mode because of the use of xor_sum, which doesn't actually exist in ATen.)
I had to add a few more primitives to inductor, namely randint (over the entire int range) and xor_sum. Fortunately, these primitives are natively supported by Triton/C++, and so they were very easy to plumb through. xor_sum is exposed as a prim, while randint special cases on when low/high span the entire 32-bit signed integer range.
Thanks to Jeff Johnson for letting me bounce ideas of him on a Saturday morning lol.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99809
Approved by: https://github.com/voznesenskym