Summary:
By default, TorchScript execution is single threaded and uses the caller's thread pool. For the use case of distributed inference, we hope there is a way to customize the behavior where the interpreter in torch script can be executed in other places. This diff allows an explicit taskLauncher for torchscript interpreter.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46865
Test Plan:
unit test is passed.
fbshipit-source-id: 1d7b003926c0d1f8facc53206efb960cff8897ac
Fixes #{issue number}
Reviewed By: houseroad
Differential Revision: D24616102
Pulled By: garroud
fbshipit-source-id: 79202b62f92d0b0baf72e4bf7aa3f05e0da91d59
Summary:
With https://github.com/pytorch/pytorch/pull/35562, we are running peephole optimization on inlining to reduce the number of nodes that are copied.
The tracer encodes the sizes in the graph like:
```
graph(%0 : Double(7)):
%1 : Function = prim::Constant[name="tensor_size"]()
%2 : Tensor = prim::CallFunction(%1, %0)
return (%2)
```
however people would like to reuse the graph with different shapes so running size invalidations would invalidate that. long term it might be better for the tracer to not include shape information but there are downstream users of that.
Separates out FuseAddMM from peephole so that now there is a single `disable_size_optimizations` parameter, and onnx explicitly invokes fuseaddmm.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36404
Differential Revision: D20968974
Pulled By: eellison
fbshipit-source-id: 56f8f1699e3b0adeeccdfd5a67bb975fd41a2913
Summary:
Resubmit of https://github.com/pytorch/pytorch/pull/35424, only this time I run optimizations in the right order so the PR description is actually true.
This speeds up the inlining pass of FairSeq model from 180s -> 13s, and MaskRCNN model from 5s -> 1.5s.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35562
Differential Revision: D20738922
Pulled By: eellison
fbshipit-source-id: 1439cf9d1f0bc780e2d64a744694f8b3b7ba4b70
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34109
This change adds glue to GraphExecutor to give the RPC server
access to the future-based Interpreter::runAsync() api.
Previously, if a server encounted a TorchScript continuation-based block
with fork/wait, it would simply block in the server thread until the handler
completed, since it uses the synchronous Interpreter::run() api.
With the ivalue::Future returned by the Interpreter, we can run the
TorchScript code asynchronously from c++ simply by connecting its
callback to the server callback.
We add test cases to cover the new logic, both rpc_async and remote.
ghstack-source-id: 101245438
Test Plan: buck test mode/dev-nosan caffe2/test/distributed/rpc/...
Differential Revision: D20194321
fbshipit-source-id: 16785ec5d9ed0b16cb1ffab0a9771a77de30fcb0
Summary:
This speeds up the inlining pass of FairSeq model from 180s -> 13s.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35424
Differential Revision: D20657271
Pulled By: eellison
fbshipit-source-id: 7a9006858c2f1b157f5a3f36ed2b3774cc186de8
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33921
**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.intern.facebook.com/intern/diff/D20153092/)!
Test Plan: Imported from OSS
Differential Revision: D20177227
Pulled By: jamesr66a
fbshipit-source-id: 87f3e484c4f873d60f76f50f6789c1b4a73bdfde