Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65971
ghstack-source-id: 141842335
We should be able to load methods into their ClassTypes. Right now mobile runtime only loads data member to ClassTypes but not for methods. To support interface call, we inject methods into ClassTypes when the methods are loaded.
Test Plan: existing tests should all pass.
Reviewed By: qihqi
Differential Revision: D31326146
fbshipit-source-id: fb1dbea619910ef1f8fa26146da3ebab348fe902
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64066
I noticed a bunch of time being spent heap-allocating Tuples
in the unpickler. 1-, 2-, and 3-element Tuples are apparently common
enough that they get their own bytecode instructions, so I decided to
try also giving them their own representation. We store up to 3
IValues inline in `Tuple` rather than doing a second heap allocation
for a `std::vector<IValue>`.
ghstack-source-id: 140695395
Test Plan:
Added automated tests for TupleElements.
Pixel 3 before: https://www.internalfb.com/intern/aibench/details/761596366576284
Pixel 3 after: https://www.internalfb.com/intern/aibench/details/591414145082422
We went from 347 ms to 302 ms.
Reviewed By: dhruvbird
Differential Revision: D30592622
fbshipit-source-id: 93625c54c9dca5f765ef6d5c191944179cb281a8
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65865
`operator_str` is not used in `import.cpp` and it is also defined in `parse_operators.cpp` so removing it from `import.cpp`.
Test Plan: CI passing
Reviewed By: iseeyuan
Differential Revision: D31293008
fbshipit-source-id: 1c857cbd63c57b8f79c1a068789fc8605605b642
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64065
It is only safe to mutate Tuple elements if you are the sole owner
of the tuple. The most efficient way to do this, then, is
`std::move(*std::move(tupleIValue).toTuple()).elements()` (the
innermost move allows `IValue::toTuple()` to avoid a refcount bump and
the outermost move allows the element vector to be moved out of the
tuple), but many callsites write simply
`tupleIValue.toTuple().elements()`, which incurs many extra refcount
bumps.
ghstack-source-id: 139468088
Test Plan: CI
Reviewed By: ezyang
Differential Revision: D30592621
fbshipit-source-id: e8312de866de09b9ea2a62e5128cbf403ee16f09
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65710
No need to incur extra refcount bumps, and no need to use a stringstream for what are presumably string keys anyway.
ghstack-source-id: 139325445
Test Plan: CI, reviewers to confirm the keys are supposed to be strings
Reviewed By: dhruvbird
Differential Revision: D31215347
fbshipit-source-id: 82be93cb2e57aefe94edf74d149115cb734112be
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65179
This is following up this PR: https://github.com/pytorch/pytorch/pull/61862. The purpose is to modularize operator parsing so that it can be used as needed without pulling the whole `import.cpp` into build.
Test Plan: Added a unit test in `test_lite_predictor.cpp` called `ParseOperators`, similar to `ParseBytecode`.
Reviewed By: iseeyuan
Differential Revision: D31006555
fbshipit-source-id: c38e221800af4cf72963a353c452c5437f56a0ac
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64269
Revert changes in D29826210 (693d8f2f07) (we don't need operator lambda caching since there aren't duplicate operators anymore)
This diff stack results in an additional approx 12% speedup in model loading time (from 229ms to 200ms) when run against an 87MB speech model that jiatongzhou provided.
ghstack-source-id: 138014904
Test Plan:
**Speech Transducer v25 model (as in D29826210 (693d8f2f07))**
|| Before | After |
|Load Time|[229ms](https://www.internalfb.com/intern/aibench/details/160889436133243)|[200ms](https://www.internalfb.com/intern/aibench/details/837884532607514)|
|Save File Size|[86.23 MB](https://lookaside.facebook.com/intern/diff/file/data/?number=658544950)|[86.1 MB](https://lookaside.facebook.com/intern/diff/file/data/?number=658554403)|
The "after" flamegraph shows significantly less time is spent on ```append_operator``` than before.
Steps
- Check out desired commit in devserver (base branch or this diff)
- ```buck build bento/kernels:bento_kernel_pytorch```
- Use N1094068 with pytorch_local kernel to save model for lite interpreter
- Edit ```aibench/specifications/models/pytorch/speech_transducer/v25.json ``` to have new model location and md5
- ```buck run aibench:run_bench -- -b aibench/specifications/models/pytorch/speech_transducer/v25.json --framework pytorch --platform android/arm64 --devices "S8US" --force_profile --remote ```
**Test that saving a model with de-dup ops doesn't change its output**
https://www.internalfb.com/intern/anp/view/?id=1137434
Reviewed By: iseeyuan
Differential Revision: D30615710
fbshipit-source-id: bb4052f0f16eccab386585e94411056f94bce43c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61862
Modularize functions of parsing bytecode tables so that they can be used as needed in situations other than mobile lite interpreter.
* The decoupled functions are re-used by current lite interpreter loader.
* The bytecode can be serialized/deserialized from other formats.
* The decoupled functions have minimum dependencies on other PyTorch components.
Next:
Build a driver binary to include the parser and interpreter, but only has necessary dependency on other PyTorch components.
ghstack-source-id: 137867287
Test Plan:
As an example, a simple bytecode is parsed to a mobile function, and directly run in the added unit test, `RunTimeTest:ParseBytecode`. It contains basic control flow (if, else) and basic data orchestration (list construction).
CI
Reviewed By: larryliu0820
Differential Revision: D29798382
Pulled By: iseeyuan
fbshipit-source-id: 1c173a5f5d37097e3a97baec3f3e48e1eea1400f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64029
This should save us a separate pass over the data structure to destroy it.
ghstack-source-id: 137566821
Test Plan:
Pixel3
before:
https://www.internalfb.com/intern/aibench/details/503337445067962
after:
https://our.intern.facebook.com/intern/aibench/details/320277034999340
overall mean time decreased from 373 ms to 358 ms. In flame graph, we
can see that some time spent destroying a vector of IValues was moved
into parseMethods, and the new parseMethods time is less than the old
time plus the recursive destruction time.
Reviewed By: dhruvbird
Differential Revision: D30559530
fbshipit-source-id: d080295a846745ea03ac50f08f4f6c95f4eaf3d8
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64307
Original commit changeset: 0b2aa7c57d08
Restores original changes.
This diff changes the way operator profiling is done in lite predictor
benchmarking binary.
Instead of using custom callbacks it uses KinetoEdgeCPUProfiler to profile
events and then generate operator level metric from it.
Since KinetoEvents do not contain cpu clock time, now we report only wallclock
time.
This unifies various profiling effort that we have for benchmarking purpose. In
production we will still use observer based mechanism, but the advantage of
using kineto profiler is that we get few other things for free, such as:
chrome trace generation.
operator level memory profiling (to be added)
flop counts (to be added)
Furthermore possible we can use python post processing script to parse chrome
trace and generate output similar to torch.profiler. (To be done)
Furthermore removes some tests from test_lite_interpreter.cpp which were testing module hierarchy in debug info. They should be covered by test_mobile_profiler.cpp.
Test Plan:
aibench run
Model without debug info:
https://www.internalfb.com/intern/aibench/details/219598441154763
Model with debug info and --print_module_info true (see Operator summary has now module hierarchy information).
https://www.internalfb.com/intern/aibench/details/617154236292985
Reviewed By: raziel
Differential Revision: D30680354
fbshipit-source-id: b6ba0d59c510c13d13d9935b1d8051cc82ffa4e9
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63367
This diff changes the way operator profiling is done in lite predictor
benchmarking binary.
Instead of using custom callbacks it uses KinetoEdgeCPUProfiler to profile
events and then generate operator level metric from it.
Since KinetoEvents do not contain cpu clock time, now we report only wallclock
time.
This unifies various profiling effort that we have for benchmarking purpose. In
production we will still use observer based mechanism, but the advantage of
using kineto profiler is that we get few other things for free, such as:
- chrome trace generation.
- operator level memory profiling (to be added)
- flop counts (to be added)
Furthermore possible we can use python post processing script to parse chrome
trace and generate output similar to torch.profiler. (To be done)
Test Plan:
aibench run
Model without debug info:
https://www.internalfb.com/intern/aibench/details/219598441154763
Model with debug info and `--print_module_info true` (see Operator summary has now module hierarchy information).
https://www.internalfb.com/intern/aibench/details/617154236292985
Reviewed By: raziel
Differential Revision: D30327514
fbshipit-source-id: 3bb2f2daaaedfb04bd6f5d9c91292783f9c4344f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64010
Fixing typos in a error message
Test Plan:
Error message before fix:
Lite Interpreter verson number does not match. The model version must be between 3 and 5But the model version is 6
Error message after fix:
Lite Interpreter version number does not match. The model version must be between 3 and 5 but the model version is 6
Reviewed By: larryliu0820
Differential Revision: D30568367
fbshipit-source-id: 205f3278ee8dcf38579dbb828580a9e986ccacc1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61996
A recent post https://fb.workplace.com/groups/pytorch.edge.users/posts/2012215235600341/ about slow model loading with an accompanying perf report (report.html) caused me to look at the report and find hot spots during model loading. This suggested that we spend quite a bit of time looking up operators from the dispatcher. This means that we can probably just cach the operator handler functions (instead of computing them every time the operator name shows up since it potentially shows up multiple times in a given model).
This diff results in an approx 7% speedup in model loading time (from [315ms](https://www.internalfb.com/intern/aibench/details/45077128343028) to [293ms](https://www.internalfb.com/intern/aibench/details/600870874797229)) when run against an 87MB speech model that jiatongzhou provided.
See https://fb.workplace.com/groups/pytorch.dev/posts/855724575006024/ for the previous post from jiatongzhou.
ghstack-source-id: 134634612
Test Plan:
Run using AI Bench.
### Speech Transducer v25 model (87MiB)
Followed up with jiatongzhou and he gave me his speech model. For posterity, here's how to fetch it (you don't need to since I uploaded it to NMLML and now has a permanent Everstore Handle):
```
cd /tmp/
mkdir speech_model
cd speech_model
fbpkg fetch speech.stella.neural_transducer.on_device.en_us:25
cp pytorchmodel.pt ~/speech_transducer_v25_pytorchmodel.ptl
```
Here's how to build and run the benchmark using AI Bench:
```
buck run aibench:run_bench -- -b aibench/specifications/models/pytorch/speech_transducer/v25.json --framework pytorch --platform android/arm64 --devices "S8US" --force_profile --remote
```
Reviewed By: raziel
Differential Revision: D29826210
fbshipit-source-id: 134b67eb466e73f0e43447b9b966278f13c4b56f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61731
In the previous diff, metadata_map contains mobile_info.json and producer_info.json. We need to parse json each time when we log the required information. This diff helps to flatten the content in the files into key/value pair. It allows logger to directly loop through the metadata_map and log the information.
Test Plan:
Since 3D Photo is disabled for current FB app, testings are only performed on CC scanner.
# Test On CC Scanner
**Test content with LOG(WARNING)**
{P429123273}
**Scuba Logger Output**
1. MOBILE_MODULE_LOAD_STATS
{F631884673}
2. MOBILE_MODULE_STATS
{F631884787}
Reviewed By: xcheng16
Differential Revision: D29690702
fbshipit-source-id: 1db5a1f5c25e98e5b2f1cc254fd880dfdfa025e2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61600
This diff changes the module.h constructor, and removes metadata_. It refactors all the constructors caller side, and creates a getter & setting for metadata_. MOBILE_MODULE_STATS reads the metadata from mobile::Module, and pass it into logger.
Test Plan:
Since 3D Photo is disabled for current FB app, testings are only performed on CC scanner.
# Test On CC Scanner
**Test content with LOG(WARNING)**
{P428930572}
**Scuba Logger Output**
{F631761194}
Reviewed By: xcheng16
Differential Revision: D29673184
fbshipit-source-id: 962e0d7b06a07caaa0c695a4ac58b885fd1505ea
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61480
Append mobile_info.json and producer_info.json into extra_files and parse the jsons from “model_info.json” in onExitLoadModel.
ghstack-source-id: 133327912
Test Plan:
# Test On CC Scanner
**Test content with LOG(WARNING)**
{P428339274}
**Scuba Logger Output**
{F631024095}
# Test On 3D Photo
**Test content with LOG(WARNING)**
{P428340927}
**Scuba Logger Output**
{F631026739}
Reviewed By: xcheng16, guangy10
Differential Revision: D29608014
fbshipit-source-id: abc39c44b947632fd4349de8a432649e84284a87
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59472
Previously, the lite interpreter would refuse to load any model
with a version greater than kProducedBytecodeVersion. Now, we're
able to independently advance the loading and saving code, so we
can roll out changes without breaking forward compatibility.
Test Plan:
CI.
Loaded a bytecode v5 model even with setting kProducedBytecodeVersion
to v4.
Reviewed By: raziel
Differential Revision: D28904350
fbshipit-source-id: 598c22f0adf47d4ed3e976bcbebdf3959dacb1df
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59354
Check if the model has `bytecode.pkl` and provide proper error message before loading model. Test it by loading a model.pt and model.ptl.
```
>>> from torch.jit.mobile import _load_for_lite_interpreter
>>> _load_for_lite_interpreter("/Users/chenlai/Documents/pytorch/data/mobilenet_v2.pt")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/chenlai/pytorch/torch/jit/mobile/__init__.py", line 48, in _load_for_lite_interpreter
cpp_module = torch._C._load_for_lite_interpreter(f, map_location) # type: ignore[attr-defined]
RuntimeError: The model is not generated from the api _save_for_lite_interpreter. Please regenerate the module by scripted_module._save_for_lite_interpreter('model.ptl'). Refer to https://pytorch.org/tutorials/prototype/lite_interpreter.html for more details.
```
iOS:

Android:

Differential Revision:
D28856713
D28856713
Test Plan: Imported from OSS
Reviewed By: dhruvbird
Pulled By: cccclai
fbshipit-source-id: c3f9a3b64459dda6811d296371c8a2eaf22f8b20
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58182
As title, the v5 model format will be
```
(base) chenlai@chenlai-mp reuse_constant % zipinfo /Users/chenlai/Documents/pytorch/reuse_constant/tmp/zip/script_module_v5_unify.ptl
Archive: /Users/chenlai/Documents/pytorch/reuse_constant/tmp/zip/script_module_v5_unify.ptl
Zip file size: 3120 bytes, number of entries: 7
-rw---- 0.0 fat 77 bl stor 80-000-00 00:00 script_module_v4_unify/data.pkl
-rw---- 0.0 fat 240 bl defN 80-000-00 00:00 script_module_v4_unify/code/__torch__/___torch_mangle_5.py
-rw---- 0.0 fat 422 bl defN 80-000-00 00:00 script_module_v4_unify/code/__torch__/___torch_mangle_5.py.debug_pkl
-rw---- 0.0 fat 64 bl stor 80-000-00 00:00 script_module_v4_unify/constants/140245072983168.storage
-rw---- 0.0 fat 172 bl stor 80-000-00 00:00 script_module_v4_unify/constants.pkl
-rw---- 0.0 fat 678 bl stor 80-000-00 00:00 script_module_v4_unify/bytecode.pkl
-rw---- 0.0 fat 2 bl stor 80-000-00 00:00 script_module_v4_unify/version
7 files, 1655 bytes uncompressed, 1453 bytes compressed: 12.2%
```
bytecode.pkl is:
```
(5,
('__torch__.___torch_mangle_5.TestModule.forward',
(('instructions',
(('STOREN', 1, 2),
('DROPR', 1, 0),
('LOADC', 0, 0),
('LOADC', 1, 0),
('MOVE', 2, 0),
('OP', 0, 0),
('LOADC', 1, 0),
('OP', 1, 0),
('RET', 0, 0))),
('operators', (('aten::add', 'int'), ('aten::add', 'Scalar'))),
('constants',
(torch._utils._rebuild_tensor_v2(pers.obj(('storage',
torch.DoubleStorage,
'140245072983168.storage',
'cpu',
8),),
0,
(2, 4),
(4, 1),
False,
collections.OrderedDict()),
1)),
('types', ()),
('register_size', 2)),
(('arguments',
((('name', 'self'),
('type', '__torch__.___torch_mangle_5.TestModule'),
('default_value', None)),
(('name', 'y'), ('type', 'int'), ('default_value', None)))),
('returns',
((('name', ''), ('type', 'Tensor'), ('default_value', None)),)))))
```
constants.pkl is:
```
(torch._utils._rebuild_tensor_v2(pers.obj(('storage', torch.DoubleStorage, '140245072983168.storage', 'cpu', 8),),
0,
(2, 4),
(4, 1),
False,
collections.OrderedDict()),)
```
Both tensors will refer to the tensor in at the path `script_module_v4_unify/constants/140245072983168.storage`.
## Note
According to unify format, all tensors will be written to the folder `.data`, however, torch.jit.load() can't handle the unified format at this moment, so this change will write tensors at the `constants` folders, and mobile will write/read tensors from `constants` folder. such that the model can be interpreted by both jit and mobile.
ghstack-source-id: 129010347
Test Plan: buck test mode/dev //caffe2/test/cpp/jit:jit
Reviewed By: raziel, iseeyuan
Differential Revision: D28375257
fbshipit-source-id: 6544472db4c957c5ea037e0bb5112b637dd15897
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56845
Handle forward/backward compatibility caused by added default arguments in mobile. As an example,
In older version, operator aten::foo's schema is
```
foo(Tensor a, Tensor b) -> Tensor
```
In the new version, the schema is updated to
```
foo(Tensor a, Tensor b, int groups=1) -> Tensor
```
## Model file
Serialize the number of specified arguments to each operator into the bytecode operator table. Before the operator table contains operator name and overload name:
```
('operators', (('aten::foo', ''),))
```
Now the number of specified arguments is added:
```
# bytecode version 6
('operators', (('aten::foo', '', 2),))
```
where "2" means the number of specified arguments.
Since there's bytecode schema change, the bytecode version number is bumped. This PR is to be landed after #56002 , where the version number is bumped from 4 to 5. This PR bumps the version number from 5 to 6.
## Runtime and backward compatibility
When the operator is found (either jit or c10), we have the OperatorHandle, where the operator schema can be accessed by
```
op.value().schema().arguments()
```
Adaptation is implemented to handle backward compatibility. For the example above, the new runtime holds the updated schema:
```
foo(Tensor a, Tensor b, int groups=1) -> Tensor
```
Whereas the model file carries
```
(('aten::foo', ''), 2)
```
We can implement a wrapper around the original function pointer to push the default argument to the stack.
## Deliver time and forward compatibility
At model delivery time, two checks can be done:
### Operator check
Two APIs to be provided:
* Runtime: An API to get a runtime’s ops and their schemas (i.e. the # of args). D27920185(WIP)
* Model: An API to get a model’s ops and their schema requirements (i.e. the # of args required).
The APIs can be used to check
* runtime.ops() is a superset of model.ops()
* for each op in model.ops() validate their schemas are compatible with those in runtime.ops() -- i.e. the # args required in a model op are <= # args in the runtime op.
Note that only root ops in the model needs to be checked here. For transient ops it's not necessary. For example, if a root op, "aten::root" calls "aten::foo", it's "aten::root"'s responsibility to adapt to "aten::foo"'s change, or "aten::root" itself needs to be updated too.
### Bytecode version backport (PR coming)
When delivering a model with bytecode v6, if the runtime only works with bytecode v5 and lower, backport is needed.
* The number of arguments is removed from the operator table
* The bytecode version is changed from 6 to 5
Note that this backport is a pure format change, it does not guarantee the backported model always runs in old runtime. The operator check mentioned before should be done first, before it’s back ported to v5.
Test Plan: Imported from OSS
Reviewed By: gmagogsfm
Differential Revision: D27986544
Pulled By: iseeyuan
fbshipit-source-id: 143e19d4798cfb96b65095538dd648eead4e3fda
Summary:
Add an api `_get_bytecode_version` to get version number given a bytecode model in both cxx and python, and the input can be both from file path and buffer.
## Test
CI (new added unit test will run as part of `pytorch_core-buck`)
1. run test_lite_interpreter.cpp
2. `python test/mobile/test_bytecode.py`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56801
ghstack-source-id: 128169647
Test Plan:
CI (new added unit test will run as part of `pytorch_core-buck`)
1. run test_lite_interpreter.cpp
2. `python test/mobile/test_bytecode.py`
Reviewed By: iseeyuan
Differential Revision: D27961417
fbshipit-source-id: f786cc9573d855feecff0b4fe8e5363e25f5728c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55711
Currently, there is some complex logic that tries to handle all exceptions but re-throws them as a `c10::Error` so that it can log the error message. I'm looking for context on why this was added. The current logic (after talking with swolchok) seems equivalent, simpler, and also preserves the original stack trace from where the exception was originally thrown. This is useful when viewing the backtrace in logview. Re-throwing an exception using `TORCH_CHECK(false, message)` results in the original exception stack trace getting lost, so we want to avoid that.
ghstack-source-id: 128043281
Test Plan: Build.
Reviewed By: iseeyuan
Differential Revision: D27688352
fbshipit-source-id: b7b1a29b652b31da80d72f16d284e48b8623377b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55252
Earlier for bytecode serialization we were saving debug handles only for OPs and not all
instructions. This PR makes changes to add that for all instructions.
Test Plan:
python test/mobile/test_lite_script_module.py TestLiteScriptModule
Imported from OSS
Reviewed By: dreiss
Differential Revision: D27542502
fbshipit-source-id: cff75118c721ce9f0c2f60d2c9471481f05264ca
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55062
This diff introduces the following changes:
1. InlinedCallStack pickler/serializer is introduced. It is serialized
as a tuple of {module_instance_info, source range tag, callee:InlinedCallStack}
Module instance info is serialized as tuple of {class_type_name,
instance_name}.
Note that callee of the serialized inlined callstack points to the tuple
of already serialized callstack. This means the first callstack ptr to
serialize, will serialize entire path of the tree, where some callee
nodes might be shared with callstack pointers that will be serialized
subsequently. Pickler supports memoization of pickled objects, where if
a tuple has been serialized then object id is obtained instead of
serialized object again. Thus we stll serialize the tree and not every
path from the root separately. Furthermore, InlinedCallStackSerializer
also uses cache to lookup the pointer and return the serialized IValue.
Furthermore, note that we must also serialize the source range of
InlinedCallStack. In order to this serializer requires map of
source-range-tags-to-source-range map. This was done in the previous
diff, where as part of source range serialization we also generate
unique tags. These are the tags that are serialized in InlinedCallStack.
Thus during deserialization we would have to deserialize source range
before deserializing InlinedCallStacks.
2. Furthermore, each serialized InlinedCallStack is serialized with a
unique debug_handle and source range tag.
BackendDebugHandleManager manages generation of
unique debug handles and saves the map of
debug-handles-to-{source_range_tag, inlined-callstack-ptr}.
This map is then serialized as callstack_debug_map.pkl. Note that
inlined callstack is not sufficient to get all the source information
since it contains source information about the nodes which are inlined.
The top-of-the-stack (or bottom) node, which is the actual op node, is
not part of the inlined callstack pointer and thus the source range of
this node is serialized separately using source_range_tag. This is
similar to how JIT creates callstack in
torch/csrc/jit/runtime/interpreter.cpp
Unique debug handles facilitates exception throwing or profiling using
just the debug handle without any further qualifications, such as which
function or module the inlined-callstack belongs to.
Furthermore, this diff refactors the old mobile code for tracking
module hierarchy information per op. Mainly now bytecode serialization
will serialize debug handles corresponding to ops/nodes in graph and
have callstack_debug_map.pkl help generate:
1. Entire callstack and
2. Module hierarchy information.
Test Plan:
python test/mobile/test_lite_script_module.py TestLiteScriptModule
./build/bin/test_jit --gtest_filter=*ModuleInfo
Imported from OSS
Reviewed By: raziel
Differential Revision: D27468709
fbshipit-source-id: 53e2413e7703ead01c77718b7c333c7c6ff50a23
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54284
In order to bring mobile deployment, via lite interpreter, on feature
parity with JIT, with respect model level debug information we must make
model level debug information available to mobile runtime.
At the moment, model level debug information is stored in SourceRange
which associates node's of graph to where the come from in original
python source code.
This information is serialized as part of debug_pkl and deserialized
when JIT loads the model and reads the model code.
On lite interpreter, we do not have access to all the functionality of
JIT and hence we cannot load model in the same way as JIT, by reading
code, constructing module hierarchy and graph corresponding module
methods etc. Instead in, lite interpreter, only bytecode corresonding to
the compiled graph, Code, is saved.
Thus in order to annotate OPs in the bytecode with equivalent
SourceRange information we do the following:
1. During model serialization, we create a unique tag for each source
range of the model.
2. Create a map of <SourceRange, tag>
3. During debug_pkl serialization we save tag along with SourceRange, on
top of byte offset.
4. During bytecode generation, the methods of the top module are
lowered. During this process methods are inlined. In the inlined graph,
when the node of a graph is lowered to bytecode, we query node's source
range and look it up against the map.
5. Resulting source range tag is serialized in module_debug_info.
6. During model deserialization, we read all the debug_pkl records in
the archieve and create a map of <tag, SourceRange>
7. This map can be used to find source code information.
During mobile runtime:
1. We read all the debug_pkl records and create <tag=debug_handle,
SourceRange> map.
1.1 This map, MobileDebugInfo, is a member of mobile Module.
2. Interpreter catches appropriate exceptions and sets the thread local
debug handle and rethrows the exception.
3. In Function's run method we catch exception and query current debug
handle where the exception happened.
4. Query MobileDebugInfo with debug handle to retrieve source range and
augment error with source range info.
This information is still incomplete as it does not contain entire
callstack.
In the following diffs we will serialize InlinedCallStack directly.
Note that compilation is gated by SYMBOLICATE_MOBILE_DEBUG_HANDLE macro,
so that mobile builds can avoid building MobileDebugInfo, source range
and source range pickler/unpickler. Later we will add path where, if
building without debug support stack trace will contain only debug
handles. They can be symbolicated later.
Test Plan:
Ported bunch of source range tests from test_jit.py. Added on more test
in test_lite_interpreter.py
Imported from OSS
Reviewed By: raziel
Differential Revision: D27174722
fbshipit-source-id: a7b7c6088ce16dec37e823c7fefa4f0b61047e12
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57098
1. Separate `readArchiveAndTensors()` from `jit/import.cpp` to a new file `jit/import_read.cpp`.
2. Use `readArchiveAndTensors()` in `mobile/import.cpp`
ghstack-source-id: 127703081
3. Add a util function in cpp that could read .pkl files directly instead of loading the entire module
Test Plan: CI
Reviewed By: raziel, iseeyuan
Differential Revision: D28052193
fbshipit-source-id: c8d57f3270bdcf2e52a32f7c111899bd5da7cac2
Summary:
This is an automatic change generated by the following script:
```
#!/usr/bin/env python3
from subprocess import check_output, check_call
import os
def get_compiled_files_list():
import json
with open("build/compile_commands.json") as f:
data = json.load(f)
files = [os.path.relpath(node['file']) for node in data]
for idx, fname in enumerate(files):
if fname.startswith('build/') and fname.endswith('.DEFAULT.cpp'):
files[idx] = fname[len('build/'):-len('.DEFAULT.cpp')]
return files
def run_clang_tidy(fname):
check_call(["python3", "tools/clang_tidy.py", "-c", "build", "-x", fname,"-s"])
changes = check_output(["git", "ls-files", "-m"])
if len(changes) == 0:
return
check_call(["git", "commit","--all", "-m", f"NOLINT stubs for {fname}"])
def main():
git_files = check_output(["git", "ls-files"]).decode("ascii").split("\n")
compiled_files = get_compiled_files_list()
for idx, fname in enumerate(git_files):
if fname not in compiled_files:
continue
if fname.startswith("caffe2/contrib/aten/"):
continue
print(f"[{idx}/{len(git_files)}] Processing {fname}")
run_clang_tidy(fname)
if __name__ == "__main__":
main()
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56892
Reviewed By: H-Huang
Differential Revision: D27991944
Pulled By: malfet
fbshipit-source-id: 5415e1eb2c1b34319a4f03024bfaa087007d7179
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56044
We want to be able to drop the dependence of full-jit deps in the auto-generated unit tests for 2 reasons:
1. Running bloaty on the auto-generated unit tests should be somewhat representative of the actual size.
2. The runtime environment of the auto-generated unit tests should be as close to the production environment as possible to ensure that we are running the tests in a production-like runtime.
Due to the dependece on full-jit, we aren't there yet. For the auto-generated tests, we probably don't need to depend on `_export_operator_list()` evetually, but for now we do since it is used to decide whether the model being run is a Metal GPU model or a CPU model, and gates whether the test runs that model or not.
Eventually, we can stop doing this in the test and do it in the codegen from PTM-CLI instead (by fetching the operators from that tool, and writing out to the BUCK file which backend(s) this model is targeting). However, that will take some time to land, so in the spirit of expediency, this change is being proposed.
Discussed this offline with iseeyuan
ghstack-source-id: 126656877
Test Plan: Build + BSB.
Reviewed By: iseeyuan
Differential Revision: D27694781
fbshipit-source-id: f31a2dfd40803c02f4fd19c45a3cc6fb9bdf9697
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53578
We want to be able to log the loaded module size to the scuba table `qpl_metrics/pytorch`. Hence, adding the `model_size` field to the logged metadata when logging a module load success event.
ghstack-source-id: 123980964
Test Plan: xcheng16 How should this be tested?
Reviewed By: xcheng16, raziel
Differential Revision: D26902971
fbshipit-source-id: a7c2e9120706bd31f76f6572c8503d4acf8a89e2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52814
Currently, there is no way to load a model on a devvm (CPU) if that model has operators that the runtime doesn't support. This ends up happening (currently) for Metal GPU models, and potentially in the future for other backends that have backend-specific operators that don't have a registered implementation (even a dummy one) on CPU.
There are at least a couple reasons for why this is needed:
1. We want to extract operator list directly from the bytecode (instead of looking it up from `mobile_info.json).
2. We want to be able to trace the quantized operators that are invoked when loading the compressed weights for a model that has prepacked weights. xta0 root-caused this after husthyc discovered that there are untraced operators showing up when loading a Metal GPU model.
If we want to scale out to support different types of models, we absolutely need the ability to load a model on a devvm irrespective of what backend (device/etc...) it is targeted at.
ghstack-source-id: 123284366
Test Plan: The next diff in this stack is using the newly introduced methods.
Reviewed By: iseeyuan
Differential Revision: D26656266
fbshipit-source-id: eed9af2f7b55979e9c18b986b8c3b9a767153297
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52870
Add the missing parts to support to_backend modules by lite interpreter.
1. Add ISINSTANCE instruction support, which is used in to_backend for output type check.
2. Bypass lite interpreter's type parser by checking the qualified name. If it starts with "torch.jit", use the same type resolver as nn module (starting with "__torch__").
Tests
Mobile module is serialized and loaded in ```BackendTest.TestCompiler```. The results are compared to those from original torchscript module.
Test Plan: Imported from OSS
Reviewed By: raziel
Differential Revision: D26715351
Pulled By: iseeyuan
fbshipit-source-id: ad9d74ee81c6aa692ab9e5dd7a9003bae5d4f01f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51432
ghstack-source-id: 120976584
torchbind is a convenient way to include custom class to both python and torchscript. CREATE_OBJECT is used to create an object of custom class.
CREATE_OBJECT was not supported by lite interpreter. The major reason was that for custom class directly defined in Python, there's no language parser in lite interpreter. It's still the case. However, for torchbind classes that are defined in C++, a python/torchscript parser is not needed.
This diff is to support the case of torchbind custom classes.
1. The class type can be resolved at import level.
2. If the class is not the supported torchbind class, an error message is provided at export stage. Workaround is also suggested.
3. Unit tests. C++: ```LiteInterpreterTest::BuiltinClass``` is added as an end-to-end test on supported class. Python: ```test_unsupported_createobject``` is changed to ```test_unsupported_classtype``` to test unsupported classes.
Test Plan: CI
Reviewed By: raziel
Differential Revision: D26168913
fbshipit-source-id: 74e8b6a12682ad8e9c39afdfd2b605c5f8e65427
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48863
Support default arguments when invoking a module via PyTorch Lite (`mobile::Module`).
Test Plan:
buck test mode/dbg //caffe2/test/cpp/jit:jit -- LiteInterpreterTest.MethodInvocation
buck test mode/dbg caffe2/test:mobile -- test_method_calls_with_optional_arg
Reviewed By: iseeyuan
Differential Revision: D25896212
fbshipit-source-id: 6d7e7fd5f3244a88bd44889024d81ad2e678ffa5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51236
The problem we currently have with tracing is that GPU models can't load on devvm CPU machines. Here's why:
1. The Metal GPU ops don't exist so the validation that checks for missing ops kicks in and prevents loading
2. Even if the check for missing ops is commented out, the actual model contents can't be succssfully loaded (see t83364623 for details)
Hence, to work around these problems and allow tracing to detect GPU models, and skip actual tracing for these (as discussed in the meeting 2 weeks ago and based on recommendations from raziel, iseeyuan, and xta0), we're adding code to detect these GPU models based on the set of operators that show up in the file `extra/mobile_info.json`.
The code then skips tracing, and picks up the root operators from the model itself.
The diff below this one will be removed before landing since we don't want to check in the model - I've kept it here in case anyone wants to patch this diff in and run the command on their devvm locally.
ghstack-source-id: 120638092
Test Plan:
See {P168657729} for a successful run of tracing on a GPU model (person segmentation tier-0, v1001) provided by xta0
Also ran `buck test //xplat/pytorch_models/build/...` successfully.
Reviewed By: ljk53
Differential Revision: D26109526
fbshipit-source-id: 6119b0b59af8aae8b1feca0b8bc29f47a57a1a67
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50932
After the change to split `_load_for_mobile()` into multiple methods, one which takes in the `extra_files` map, and one which doesn't, we can change the implementation of the `deserialize()` method with different overloads as well. Suggested by raziel on D25968216 (bb909d27d5).
ghstack-source-id: 120185089
Test Plan: Build/Sandcastle.
Reviewed By: JacobSzwejbka
Differential Revision: D26014084
fbshipit-source-id: 914142137346a6246def1acf38a3204dd4c4f52f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50795
There's [a post](https://fb.workplace.com/groups/2148543255442743/permalink/2583012411995823/) about a customer having to pass in `-Wno-global-constructors` to disable warnings related to calling constructors for global objects. This is related to the initialization of `default_extra_files_mobile` in `import.h`.
It requires end users to pass in the compiler flag, since the definition is now in code (.cpp files) that they will be compiling.
In addition, it makes the API for `_load_for_mobile` non-re-entrant (i.e. can not be safely used concurrently from multiple threads without the caller taking a mutex/lock) if the `extra_files_mobile` argument is not explicitly passed in.
Instead, a better option would be to create different overloads; one which requires all 3 parameters, and one that can work with 1-2. This solves the problem without creating a static variable.
ghstack-source-id: 120127083
Test Plan: Build Lite Interpreter and sandcastle.
Reviewed By: raziel
Differential Revision: D25968216
fbshipit-source-id: fbd80dfcafb8ef7231aca301445c4a2ca9a08995
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50521
Original commit changeset: 9731ec1e0c1d
Test Plan:
- run `arc focus2 -b pp-ios //xplat/arfx/tracking/segmentation:segmentationApple -a ModelRunner --force-with-bad-commit `
- build via Xcode, run it on an iOS device
- Click "Person Segmentation"
- Crash observed without the diff patched, and the segmentation image is able to be loaded with this diff patched
Reviewed By: husthyc
Differential Revision: D25908493
fbshipit-source-id: eef072a8a3434b932cfd0646ee78159f72be5536
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49916
Test Plan:
1. Build pytorch locally. `MACOSX_DEPLOYMENT_TARGET=10.9 CC=clang CXX=clang++ USE_CUDA=0 DEBUG=1 MAX_JOBS=16 python setup.py develop`
2. Run `python save_lite.py`
```
import torch
# ~/Documents/pytorch/data/dog.jpg
model = torch.hub.load('pytorch/vision:v0.6.0', 'shufflenet_v2_x1_0', pretrained=True)
model.eval()
# sample execution (requires torchvision)
from PIL import Image
from torchvision import transforms
import pathlib
import tempfile
import torch.utils.mobile_optimizer
input_image = Image.open('~/Documents/pytorch/data/dog.jpg')
preprocess = transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])
input_tensor = preprocess(input_image)
input_batch = input_tensor.unsqueeze(0) # create a mini-batch as expected by the model
# move the input and model to GPU for speed if available
if torch.cuda.is_available():
input_batch = input_batch.to('cuda')
model.to('cuda')
with torch.no_grad():
output = model(input_batch)
# Tensor of shape 1000, with confidence scores over Imagenet's 1000 classes
print(output[0])
# The output has unnormalized scores. To get probabilities, you can run a softmax on it.
print(torch.nn.functional.softmax(output[0], dim=0))
traced = torch.jit.trace(model, input_batch)
sum(p.numel() * p.element_size() for p in traced.parameters())
tf = pathlib.Path('~/Documents/pytorch/data/data/example_debug_map_with_tensorkey.ptl')
torch.jit.save(traced, tf.name)
print(pathlib.Path(tf.name).stat().st_size)
traced._save_for_lite_interpreter(tf.name)
print(pathlib.Path(tf.name).stat().st_size)
print(tf.name)
```
3. Run `python test_lite.py`
```
import torch
from torch.jit.mobile import _load_for_lite_interpreter
# sample execution (requires torchvision)
from PIL import Image
from torchvision import transforms
input_image = Image.open('~/Documents/pytorch/data/dog.jpg')
preprocess = transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])
input_tensor = preprocess(input_image)
input_batch = input_tensor.unsqueeze(0) # create a mini-batch as expected by the model
reload_lite_model = _load_for_lite_interpreter('~/Documents/pytorch/experiment/example_debug_map_with_tensorkey.ptl')
with torch.no_grad():
output_lite = reload_lite_model(input_batch)
# Tensor of shape 1000, with confidence scores over Imagenet's 1000 classes
print(output_lite[0])
# The output has unnormalized scores. To get probabilities, you can run a softmax on it.
print(torch.nn.functional.softmax(output_lite[0], dim=0))
```
4. Compare the result with pytorch in master and pytorch built locally with this change, and see the same output.
5. The model size was 16.1 MB and becomes 12.9 with this change.
Imported from OSS
Reviewed By: kimishpatel, iseeyuan
Differential Revision: D25731596
Pulled By: cccclai
fbshipit-source-id: 9731ec1e0c1d5dc76cfa374d2ad3d5bb10990cf0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49385
Currently, the API to export operator lists accepts a `torch::jit::Module` object, and spits out an operator list. The operator list is practically used only for mobile. This is not ideal because the set of root operators may change by the time the model is subsequently optmized and exported for mobile.
What we need to to instead is glean the list of operators from the mobile model itself (`bytecode.pkl` specifically), and expose that instead.
Also updated the logic in `converter`.
### Before this change:
1. Get operator List from Torch Script Model
2. Convert to bytecode mobile model
### After this change:
1. Convert to bytecode mobile model
2. Use this converted mobile model to get the list of operators for each method on the model
ghstack-source-id: 118796752
Test Plan:
Added a unit test in `test_lite_interpreter.cpp` to ensure that all model referenced operators show up in the exported operator list. Also make `test_lite_interpreter.cpp` runnable from `xplat/caffe2/BUCK` since this is where the production code will be built from.
Verified that the list of operators produced before and after this change for an example model (segmentation) are the same.
{P147863234}
Also verified that the operator lists for BI-Xray model is different (we have been having problems with missing operators for this one): {P154903132}
Reviewed By: iseeyuan
Differential Revision: D24690094
fbshipit-source-id: 0426a6ef90456a811010cfe337c415882ae2deff
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48863
Support default arguments when invoking a module via PyTorch Lite (`mobile::Module`).
Test Plan:
buck test mode/dbg //caffe2/test/cpp/jit:jit -- LiteInterpreterTest.MethodInvocation
buck test mode/dbg caffe2/test:mobile -- test_method_calls_with_optional_arg
Reviewed By: raziel, iseeyuan
Differential Revision: D25152559
fbshipit-source-id: bbf52f1fbdbfbc6f8fa8b65ab524b1cd4648f9c0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47425
Extra files can be exported in lite interpreter model, but it could not be loaded. This PR is to add the capability to load extra files from lite interpreter model. Because extra_files is a default argument, it should not affect the existing usage of _load_for_mobile. It's a simple assembly or a generic unordered_map. No additional dependency should be introduced and the size overhead should be small (to be tested).
Test Plan: Imported from OSS
Reviewed By: kwanmacher
Differential Revision: D24770266
Pulled By: iseeyuan
fbshipit-source-id: 7e8bd301ce734dbbf36ae56c9decb045aeb801ce