I used a couple of type-ignore comments in ir.py because it constructs
short-lived instances of FixedLayout and GraphModuleSerializer, just to
call a single method on them that doesn't use all their members. Making
those unused members optional would make the rest of the code a lot
messier with sprinkled `assert` statements.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113534
Approved by: https://github.com/albanD
They are used in many contexts that don't actually check if the returned
type is `None`. I have also created `try_get()` for the cases where we
do actually want an Optional type returned.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113535
Approved by: https://github.com/ezyang
ghstack dependencies: #113412
Summary:
Include constants in AOTInductor .so file.
Added some difference:
1) serialize with ctypes instead of the native of torch.storage
2) Use the underlying for_blob instead of from_blob to construct Tensor.
Test Plan:
Unit tests:
```
test/inductor/test_aot_inductor.py
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108473
Approved by: https://github.com/angelayi
Summary:
Include the constants into AOTInductor .so file.
We do not modify existing API signatures but create necessary format with weight lifted out instead.
Test Plan:
test/inductor/test_aot_inductor.py
Reviewers:
Subscribers:
Tasks:
Tags:
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107718
Approved by: https://github.com/angelayi, https://github.com/eellison
Batchnorm inference is done in fp32 if the inputs are in fp16/bf16 and the output is casted back down to its original precision. This causes the batchnorm weights to get constant folded to fp32, and prevented Conv-BN folding from firing.
```
def forward(self, arg0_1: bf16[32, 3, 3, 3], arg1_1: bf16[32], arg2_1: bf16[32], ...)
convolution: bf16[3, 32, 15, 15] = aten..convolution.default(arg6_1, arg0_1, None, [2, 2], [0, 0], [1, 1], False, [0, 0], 1); arg6_1 = arg0_1 = None
# weight upcasting
convert_element_type: f32[32] = torch.ops.prims.convert_element_type.default(arg3_1, torch.float32); arg3_1 = None
convert_element_type_1: f32[32] = torch.ops.prims.convert_element_type.default(arg4_1, torch.float32); arg4_1 = None
...
# end of batch norm
add_1: f32[3, 32, 15, 15] = aten..add.Tensor(mul_2, unsqueeze_7); mul_2 = unsqueeze_7 = None
# output downcast
convert_element_type_2: bf16[3, 32, 15, 15] = torch.ops.prims.convert_element_type.default(add_1, torch.bfloat16); add_1 = None
```
I mark the convolutions which are followed by binary foldable ops in a higher precision that are then get converted back down to the original conv dtype. We fold the weights in fp32 because it's slightly better accuracy, then at the end of the pass convert back the weights to their original dtype.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106576
Approved by: https://github.com/XiaobingSuper, https://github.com/yanboliang
Batchnorm inference is done in fp32 if the inputs are in fp16/bf16 and the output is casted back down to its original precision. This causes the batchnorm weights to get constant folded to fp32, and prevented Conv-BN folding from firing.
```
def forward(self, arg0_1: bf16[32, 3, 3, 3], arg1_1: bf16[32], arg2_1: bf16[32], ...)
convolution: bf16[3, 32, 15, 15] = aten..convolution.default(arg6_1, arg0_1, None, [2, 2], [0, 0], [1, 1], False, [0, 0], 1); arg6_1 = arg0_1 = None
# weight upcasting
convert_element_type: f32[32] = torch.ops.prims.convert_element_type.default(arg3_1, torch.float32); arg3_1 = None
convert_element_type_1: f32[32] = torch.ops.prims.convert_element_type.default(arg4_1, torch.float32); arg4_1 = None
...
# end of batch norm
add_1: f32[3, 32, 15, 15] = aten..add.Tensor(mul_2, unsqueeze_7); mul_2 = unsqueeze_7 = None
# output downcast
convert_element_type_2: bf16[3, 32, 15, 15] = torch.ops.prims.convert_element_type.default(add_1, torch.bfloat16); add_1 = None
```
I mark the convolutions which are followed by binary foldable ops in a higher precision that are then get converted back down to the original conv dtype. We fold the weights in fp32 because it's slightly better accuracy, then at the end of the pass convert back the weights to their original dtype.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106576
Approved by: https://github.com/XiaobingSuper, https://github.com/yanboliang
ghstack dependencies: #106471, #106575
This PR handles inference. Will do similar thing for training later.
Some manual testing results shows this can improve inference perf by 2-3% (absolute improvement not relative one).
- convmixer: 4.285x -> 4.309x
- resnet50: 2.170x -> 2.203x
The PR is built upon freezing. Since without freezing, the weight input for a conv node may not be a parameter directly but be the output of precision converting ops. It's so much easier to implement this PR after freezing.
Commands
```
TORCHINDUCTOR_FREEZING=1 python benchmarks/dynamo/timm_models.py --backend inductor --amp --performance --only convmixer_768_32 --inference
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103642
Approved by: https://github.com/eellison
Adds Conv-BN folding to inductor freezing. One thing that's a little awkward now is we'll want different decompositions to run depending on if we are in the inference compiler. For now, I require that you run with torch.no_grad() so we can detect if no gradients are required before calling aot_autograd.
Differential Revision: [](https://our.internmc.facebook.com/intern/diff/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100653
Approved by: https://github.com/jansel
Adds a Constant Folding pass to the joint graph only targeting tensors which can be replaced with a single value, and then removes no-ops from the graph. This allows us to match sdpa in BertForMaskedLM, AlbertForMaskedLM, and LayoutLMForMaskedLM.
BertForMaskedLM
Perf: 1.6853 -> 1.933, Memory: 0.9462 -> 1.41
AlbertForMaskedLM
Perf: 1.6620 -> 1.761, Memory: 1.257 -> 1.94
LayoutLMForMaskedLM
Perf: (non cudagraphs) 1.6991 -> 1.939x, Memory: 0.9624 -> 1.50
MobileBertForMaskedLM
Perf: 1.864x -> 1.941x, Memory: 0.94 -> 1.03
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103600
Approved by: https://github.com/jansel
Adds a freezing pass that will constant fold parameters in inductor `config.freezing`. This occurs post functionalization in aot autograd to capture both dispatching and allow passes to occur post functionalization. A few notes:
- There is an option to discard parameters `config.freezing_discard_parameters` which will take the current eager modules and wrap parameters to a Tensor subclass which will error if used.
- I needed to expose flat_params in aot_autograd in order to discard old references when we constant fold away parameters, like with amp. I also exposed `fw_metadata` to avoid constant folding mutated paraemters.
- Caching parameter transformations/constant folding across different inferences nyi
- Checking version_counter of constant folded params nyi
I'm not really sure what the actual naming should be. In jit there was both "freezing", which was platform agnostic, and "optimize for inference", which made device specific optimizations. We're doing the latter here but maybe freezing is a better name.
Differential Revision: [D46244033](https://our.internmc.facebook.com/intern/diff/D46244033)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100652
Approved by: https://github.com/jansel