Make it easier to serialize patterns by adding `pattern_matcher.gen_register_replacement()` which is like `pattern_matcher.register_replacement()` but also requires the replacement to be precompiled.
To precompile patterns (and save to disk) run:
```
torchgen/fuse_attention_patterns/gen_attention_patterns.py
```
- Updated the sfdp patterns to use `gen_register_replacement`.
- Add serialized patterns for mm_pattern and bmm_pattern (The 'misc' patterns don't serialize cleanly so can't be added).
- Updated the testing so it checked the round-trip patterns match and not just that it serialized the same way.
- Checking that the patterns round-trip properly found that the `users` field wasn't being serialized properly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121313
Approved by: https://github.com/eellison
There exist some issues in the previous PR (https://github.com/pytorch/pytorch/pull/120985) of supporting int8 WOQ mm pattern matcher. This PR tends to further optimize it.
1. New patterns are added to match int8 woq mm in gpt-fast model, due to different input layouts.
2. In constant folding, `int8_weight -> dq -> bf16_weight` should be kept for pattern match.
3. Currently, GPT-Fast enables `coordinate_descent_tuning` for CPU. This flag is only useful for CUDA, but it could change the graph: from the non-decomposed fallback pass to the decomposed one. We will disable the flag in GPT-Fast script for CPU, in order to have neat patterns. @yanbing-j
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122955
Approved by: https://github.com/jgong5, https://github.com/jansel
Summary:
Add Runtime Constant-folding for AOTInductor.
This also include the invocation of constant folding at load time.
The constant folding lowering is a 2-step process.
First, we split the graph into 2 modules, one of it is the constant module, which doesn't depend on any input and the whole module could be inferred (constant-folded) one-time and be reused. The constant module, is lowered, and being codegen-ed as usual and cached (let's call this constant code). The constant code reuses the whole lowering/profiling/etc. process, only difference is that we do not generate any headers or initialization for the constant code.
Second, after handling the constant module, we take care of the main module (which is the part that would depend on the user input.) For the main module, we take in one additional component, the constant code, compare with a normal lowering. Addition step we do here is that, we inject the constant code into the codegen-ed main module, and create the caller for the main module to consume the result of the constant module.
Test Plan: Unit tests included in commit.
Differential Revision: D53274382
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118765
Approved by: https://github.com/chenyang78
This PR fixes two bugs
1) Constant folding a triton kernel results in the kernel's inputs to be returned back without any modification. Disable constant folding for triton kernels. Need more investigation
2) NoneLayout buffers should not be deleted as they do not exist
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115908
Approved by: https://github.com/aakhundov, https://github.com/jansel
Summary:
Two nodes can point to the same attribute via node.target.
This makes sure,
- we don't try to delete already deleted attribute, i.e. delete attr only once
- we do delete all the nodes pointing to the attribute
Test Plan:
```
buck run fbcode//mode/dev-nosan fbcode//executorch/backends/xnnpack/test:test_xnnpack_passes -- executorch.backends.xnnpack.test.passes.test_batch_norm_fusion.TestBatchNormFusion.test_q8_batch_norm_fusion
```
Differential Revision: D51419442
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113957
Approved by: https://github.com/Skylion007
Couple changes to make it more efficient.
- Because we replacing nodes that only have a single value, only store a single value instead of the whole tensor for node replacement
- torch.fx.Interpreter will preserve a Tensor in the env as long as it has more uses. That also applies even to output uses, but we are not going to constant fold that use. Instead of using last use for garbage collection, use last non output use.
If reviewers would prefer I ghstack this bc of code movement let me know.
Fix for https://github.com/pytorch/pytorch/issues/108388
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108421
Approved by: https://github.com/jansel