Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61505
The handling of `self` in static runtime was previously incorrect. This diff fixed that issue, since self is essential to prim::GetAttr/SetAttr. After all, most of the time we're getting and setting attributes from self, the torch script module.
Reviewed By: ajyu
Differential Revision: D29350173
fbshipit-source-id: 6e62add4cda517ef8cd6c315d4cb0595e7d531fb
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60503
Fixed a few issues in the static_runtime::to_copy impl:
- fixed a bug with memory_format
- copy strides when appropriate. This is necessary to make sure that the fbgemm path in the copy kernel gets hit.
- fix the schema in the `ReplaceWithCopy` pass
- add registration of `static_runtime::to_copy.other`
Add more unit tests:
- test dynamic shapes
- test strided input tensor to `aten::to`
- test alias case (same input/output)
- test `to.other`
Reviewed By: ajyu
Differential Revision: D26838933
fbshipit-source-id: ec0d1a2deebe998fcfe8858e772e1ef429cb4522
Summary:
This is an automatic change generated by the following script:
```
#!/usr/bin/env python3
from subprocess import check_output, check_call
import os
def get_compiled_files_list():
import json
with open("build/compile_commands.json") as f:
data = json.load(f)
files = [os.path.relpath(node['file']) for node in data]
for idx, fname in enumerate(files):
if fname.startswith('build/') and fname.endswith('.DEFAULT.cpp'):
files[idx] = fname[len('build/'):-len('.DEFAULT.cpp')]
return files
def run_clang_tidy(fname):
check_call(["python3", "tools/clang_tidy.py", "-c", "build", "-x", fname,"-s"])
changes = check_output(["git", "ls-files", "-m"])
if len(changes) == 0:
return
check_call(["git", "commit","--all", "-m", f"NOLINT stubs for {fname}"])
def main():
git_files = check_output(["git", "ls-files"]).decode("ascii").split("\n")
compiled_files = get_compiled_files_list()
for idx, fname in enumerate(git_files):
if fname not in compiled_files:
continue
if fname.startswith("caffe2/contrib/aten/"):
continue
print(f"[{idx}/{len(git_files)}] Processing {fname}")
run_clang_tidy(fname)
if __name__ == "__main__":
main()
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56892
Reviewed By: H-Huang
Differential Revision: D27991944
Pulled By: malfet
fbshipit-source-id: 5415e1eb2c1b34319a4f03024bfaa087007d7179
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56812
fb::equally_split get fused with ListUnpack and all outputs from ListUnpack getting attached to fb::equally_split.
So fb::equal_split will have as many outputs as ListUnpack .
Test Plan:
buck test caffe2/benchmarks/static_runtime/fb:test_fb_operators
buck test caffe2/torch/fb/sparsenn:test -- test_equally_split_op
Reviewed By: hlu1
Differential Revision: D27974999
fbshipit-source-id: b2ca19ff86aec76b977c1e3cfc56567adab66b35
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56565
fb::equally_split get fused with ListUnpack and all outputs from ListUnpack getting attached to fb::equally_split.
So fb::equal_split will have as many outputs as ListUnpack .
Test Plan:
buck test caffe2/torch/fb/sparsenn:fb_operators_test
buck test caffe2/torch/fb/sparsenn:test -- test_equally_split_op
Reviewed By: hlu1
Differential Revision: D27902824
fbshipit-source-id: 7855047c3bd46bbb74b7346ac384c70b6a3e1f46
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56441
Since aten::to is overloaded, match schema to replace it with static_runtime::to_copy
Test Plan:
```
MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 numactl -m 0 -C 3 ./buck-out/opt/gen/caffe2/caffe2/fb/predictor/ptvsc2_predictor_bench --c2_model=/data/users/ansha/tmp/adfinder/210494966_0.predictor.disagg.remote_request_only --c2_inputs=/data/users/ansha/tmp/adfinder/models/c2_remote_ro_input_data.pb --pred_net=/data/users/ansha/tmp/adfinder/models/c2_remote_ro_net2.pb --c2_sigrid_transforms_opt=1 --c2_apply_nomnigraph_passes=1 --c2_use_memonger=1 --scripted_model=/data/users/ansha/tmp/adfinder/models_dianshi/210494966_0.predictor.disagg.remote_request_only.pt --pt_inputs=/data/users/ansha/tmp/adfinder/models/remote_ro_wrapped_input_data.pt --pt_enable_static_runtime=1 --pt_cleanup_activations=1 --pt_enable_out_variant=1 --compare_results=1 --iters=1 --warmup_iters=1 --num_threads=1 --do_profile=1 --benchmark_c2_predictor=0 --do_benchmark=0
```
```
Time per node type:
0.623426 ms. 55.337%. quantized::embedding_bag_4bit_rowwise_offsets (82 nodes)
0.331633 ms. 29.4367%. quantized::embedding_bag_byte_rowwise_offsets (71 nodes)
0.123163 ms. 10.9323%. aten::to (155 nodes)
0.038479 ms. 3.4155%. fb::lengths_to_offsets (155 nodes)
0.004169 ms. 0.370052%. aten::embedding_bag (2 nodes)
0.002549 ms. 0.226256%. static_runtime::to_copy (2 nodes)
0.002512 ms. 0.222972%. prim::TupleConstruct (1 nodes)
0.000667 ms. 0.0592048%. prim::dtype (2 nodes)
1.1266 ms. in Total
StaticRuntime setup time: 0.009605 ms
Memory allocation time: 0.001907 ms
Memory deallocation time: 0.032401 ms
Outputs deallocation time: 0.020876 ms
Total memory managed: 256 bytes
Total number of reused tensors: 159
```
I verified that all of the aten::to matches, for the local, local_ro, and remote_ro nets in opt and dev mode.
Only 2 of calls are replaced because the other 155 have either the input or the ouput of the op returned as an external output. This is a similar case for the other instances of aten::to in the local and local_ro nets.
Reviewed By: hlu1
Differential Revision: D27872350
fbshipit-source-id: b72785ea2768be415faae2afcf9915aef07daec2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55337
`static_runtime::permute_copy` is in fb-only folder. Because `caffe2/test/test_static_runtime.py` is in OSS, we can't load the fb-only operator library. The workaround is to check at runtime whether the op is registered or not.
Test Plan:
This fixed two of the broken tests:
```
✓ Pass: caffe2/test:static_runtime - test_multihead_attention_layer (test_static_runtime.TestStaticModule) (10.316)
✓ Pass: caffe2/test:static_runtime - test_mlp (test_static_runtime.TestStaticModule) (16.134)
```
Reviewed By: ajyu
Differential Revision: D27577066
fbshipit-source-id: ac87dcde71f0d5140ccde448bb49aaebbbb5908a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54657
The constraint checked in D27145406 (acf03b13f1) is too tight for the adindexer model and as a result, 5 ops (4 aten::narrow + 1 aten::premute) are not replaced with the copy version and resulted in perf regression. This diff checks for inplace ops explicitly and only applies the input constraint to graphs with inplace ops.
Test Plan: Contbuild
Reviewed By: ajyu
Differential Revision: D27253145
fbshipit-source-id: 23e2b1a018c84dd0fc2880fddd9c41bc0422b8eb
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54353
The current implementation of reshape/flatten is problematic because whether the output is sometimes a tensor view and sometimes not. It entirely depends on the graph ir and input shapes. Replacing them with the copy version makes it deterministic and the output is always a tensor.
Reviewed By: ajyu, edvgha
Differential Revision: D26358525
fbshipit-source-id: ee7571317b061221a8d50083676cded388ce6f87
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54230
The comments in the code explained why this change is needed.
Reviewed By: bwasti
Differential Revision: D27145406
fbshipit-source-id: 2a61a42f22dfadfad59ee6c3be3e9e9d19e90ac3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53920
Fusing SigridTransforms + ListUnpack allows for enabling out variant for SigridTransforms so that the output tensors can be managed by the MemoryPlanner in Static Runtime.
The speedup comes from three parts 1) get rid of memory allocation inside SigridTransforms itself, 2) memory deallocation cost (outside SigridTransforms, inside MemoryPlanner), 3) get rid of ListUnpack. However, in 3) we still need to pay the cost of constructing `vector<Tensor>` for outputs and a round of refcount bumps for all the output TensorImpls.
Reviewed By: ajyu
Differential Revision: D26220546
fbshipit-source-id: 651bdfb850225511c43b8f50083b13e8dec46bcc
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54111
If we only run the ReplaceWithCopy pass when enable_out_variant is true, there is no need register a default op implementation.
Reviewed By: edvgha
Differential Revision: D27036077
fbshipit-source-id: f615f5d8b84629044af1c554421ea5e505e93239
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53799
Fix two issues with ClipRangesGatherRangesX2SigridHash and ClipRangesGatherRangesX2SigridHashPrecompute:
- The first issue is with the two step graph rewrite process. If step 2 doesn't happen after step 1, then we're stuck with a graph with a `fb::placeholder` op that can't run. Step 3 is added to revert step 1 so we restore the original graph if there's any `fb::placeholder` op left.
- The second issue is with `SigridHashPrecompute`. The coupling with `freeze_module` is not ideal and limits its use to Static Runtime only. By running `ConstantPropagation` and `ConstantPooling` after splitting SigridHash, we can move all the Constant ops to the front of the graph and fusion can happen right afterwards.
Reviewed By: ajyu
Differential Revision: D26920008
fbshipit-source-id: e4bc67c7a15181bac5dbbfbb95d861849652bddf
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53634
Make the op signature of `static_runtime::to_copy` consistent with that of native_functions.yaml so it works with 2-5 args:
```
- func: to.dtype(Tensor self, ScalarType dtype, bool non_blocking=False, bool copy=False, MemoryFormat? memory_format=None) -> Tensor
variants: method
device_guard: False
```
(Note: this ignores all push blocking failures!)
Reviewed By: ajyu
Differential Revision: D26906726
fbshipit-source-id: b9203eb23619aba42b1bfed1a077401f9fe2ddf0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53323
Whilst optimizing inline cvr local ro, found a pattern where gather_ranges is used redundantly. Fuse this pattern to remove unnecessary gather_ranges.
Reviewed By: hlu1
Differential Revision: D26659824
fbshipit-source-id: 6420afa3a2c3272c57706b70c2e9834014d6c32d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50060
Aliasing is currently mishandled in SR.
This diff fixes that issue entirely and allows us to avoid hard coded "view" registration. I'll remove the macro in a follow up diff.
However, this diff introduces a subtle assumption when memory optimization is turned on: operators cannot "sometimes alias." Some care will need to be taken to actually make sure this is enforced going forward.
This diff
```
$ batch=20 ./run.sh --pt_optimize_memory=false |& grep "finished"
C2 run finished. Milliseconds per iter: 0.512114. Iters per second: 1952.69
PyTorch run finished. Milliseconds per iter: 0.51176. Iters per second: 1954.04
$ batch=20 ./run.sh --pt_optimize_memory=true |& grep "finished"
C2 run finished. Milliseconds per iter: 0.511402. Iters per second: 1955.41
PyTorch run finished. Milliseconds per iter: 0.506493. Iters per second: 1974.36
$ batch=1 iters=100000 ./run.sh --pt_optimize_memory=false |& grep "finished"
C2 run finished. Milliseconds per iter: 0.0562877. Iters per second: 17765.9
PyTorch run finished. Milliseconds per iter: 0.0667712. Iters per second: 14976.5
$ batch=1 iters=100000 ./run.sh --pt_optimize_memory=true |& grep "finished"
C2 run finished. Milliseconds per iter: 0.0561829. Iters per second: 17799
PyTorch run finished. Milliseconds per iter: 0.0665069. Iters per second: 15036
```
Test Plan:
buck test //caffe2/test:static_runtime
buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest
Reviewed By: eellison
Differential Revision: D25581156
fbshipit-source-id: 41e68119d53e687a9c32d966ed420b270aea4b5b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50249
Add a few new patterns for `ConcatAddMulReplaceNanClip`
Reviewed By: houseroad
Differential Revision: D25843126
fbshipit-source-id: d4987c716cf085f2198234651a2214591d8aacc0