Commit Graph

14 Commits

Author SHA1 Message Date
Hao Lu
0521e420fd [Static Runtime] Temporarily disable fusion tests (#55342)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55342

The fusion stuff is pretty hard to debug. Given that we're not shipping this part of the stack any time soon, let's temporarily disable them and re-enable them when somebody has the cycles to debug them.

Test Plan: Verified that the tests are now disabled

Reviewed By: ajyu

Differential Revision: D27578573

fbshipit-source-id: cb8d7c9339f7c1700b7653b0231cf570996995ff
2021-04-05 20:54:02 -07:00
Bram Wasti
56f8379802 [static runtime] Move all heavy constructor logic into InferenceModule (renamed to StaticModule) (#51564)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51564

Constructor logic was spread throughout InferenceModule and StaticRuntime.  This diff unifies the two.  After a lot of discussion on this diff D25961626 it became apparent that `clone` is uglier than a cheap StaticRuntime.

This means StaticRuntime is effectively StaticModule and the only code in the new StaticRuntime is the `run` functions.

```
graph, schema = PrepareForStaticModule(torchscript_module)
sm = StaticModule(graph, schema, options)
sm(inputs)
// or create many cheap runtimes with the module
sr = StaticRuntime(sm)
sr(inputs)
```

Changelist:
- Rename InferenceModule StaticModule
- Move all logic for construction into StaticModule
- Create a new StaticRuntime that only has a unique memory planner (everything else is in StaticModule)
- Update comments with explanation
- Propagate all changes to predictor integration
- Propagate all changes to python integration
- Change semantics to be a bit more PyTorch-standard (no "run" calls, no "get_" getters).

Test Plan:
buck test //caffe2/test:static_runtime
buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest

Reviewed By: hlu1

Differential Revision: D25592967

fbshipit-source-id: 8233bed03137ce129137af2d44bce0095033ef0f
2021-03-05 10:15:26 -08:00
Richard Barnes
a4383a69d4 Clean up some type annotations in caffe2/test (#49943)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49943

Upgrades type annotations from Python2 to Python3

Test Plan: Sandcastle tests

Reviewed By: xush6528

Differential Revision: D25717534

fbshipit-source-id: 5aedea4db07efca126ffb6daee79617c30a67146
2021-01-13 10:01:55 -08:00
Bram Wasti
f4226b5c90 [static runtime] add static subgraph fusion pass (#49185)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49185

This diff adds a fusion feature that will let us use static runtime for *parts* of the graph.  This will prove useful in cases where fully eliminating control flow is hard etc.

TODO:
[x] factor out into separate fusion file
[x] add python test case
[x] add graph that isn't fully lowered test case
[x] add graph that has weird list/tuple outputs test case

the loop example looks quite good:
```
graph(%a.1 : Tensor,
      %b.1 : Tensor,
      %iters.1 : int):
  %12 : bool = prim::Constant[value=1]() # /data/users/bwasti/fbsource/fbcode/buck-out/dev/gen/caffe2/test/static_runtime#binary,link-tree/test_static_runtime.py:110:4
  %c.2 : Tensor = prim::StaticSubgraph_0(%a.1, %b.1)
  %c : Tensor = prim::Loop(%iters.1, %12, %c.2) # /data/users/bwasti/fbsource/fbcode/buck-out/dev/gen/caffe2/test/static_runtime#binary,link-tree/test_static_runtime.py:110:4
    block0(%i : int, %c.12 : Tensor):
      %c.10 : Tensor = prim::StaticSubgraph_1(%a.1, %c.12, %b.1)
      -> (%12, %c.10)
  return (%c)
with prim::StaticSubgraph_0 = graph(%0 : Tensor,
      %4 : Tensor):
  %5 : int = prim::Constant[value=2]()
  %6 : Tensor = aten::mul(%4, %5) # /data/users/bwasti/fbsource/fbcode/buck-out/dev/gen/caffe2/test/static_runtime#binary,link-tree/test_static_runtime.py:109:12
  %2 : int = prim::Constant[value=1]()
  %c.2 : Tensor = aten::add(%0, %6, %2) # /data/users/bwasti/fbsource/fbcode/buck-out/dev/gen/caffe2/test/static_runtime#binary,link-tree/test_static_runtime.py:109:8
  return (%c.2)
with prim::StaticSubgraph_1 = graph(%1 : Tensor,
      %7 : Tensor,
      %8 : Tensor):
  %9 : int = prim::Constant[value=1]()
  %c.4 : Tensor = aten::add(%7, %8, %9) # /data/users/bwasti/fbsource/fbcode/buck-out/dev/gen/caffe2/test/static_runtime#binary,link-tree/test_static_runtime.py:111:12
  %5 : int = prim::Constant[value=2]()
  %c.7 : Tensor = aten::mul_(%c.4, %5) # /data/users/bwasti/fbsource/fbcode/buck-out/dev/gen/caffe2/test/static_runtime#binary,link-tree/test_static_runtime.py:112:8
  %2 : int = prim::Constant[value=1]()
  %c.10 : Tensor = aten::sub_(%c.7, %1, %2) # /data/users/bwasti/fbsource/fbcode/buck-out/dev/gen/caffe2/test/static_runtime#binary,link-tree/test_static_runtime.py:113:8
  return (%c.10)
```

(Note: this ignores all push blocking failures!)

Test Plan:
buck test mode/no-gpu //caffe2/benchmarks/static_runtime:static_runtime_cpptest

buck test mode/no-gpu caffe2/test:static_runtime

Reviewed By: bertmaher

Differential Revision: D25385702

fbshipit-source-id: 2f24af4f11d92a959167facd03fbd24f464a6098
2020-12-10 14:03:11 -08:00
Katy Voor
fe7d1d7d0e Add LeakyReLU operator to static runtime (#47798)
Summary:
- Add LeakyReLU operator to static runtime
- Add LeakyReLU benchmark
- Add LeakyReLU correctness test case

Static Runtime
```
------------------------------------------------------------------------------
Benchmark                                       Time           CPU Iterations
------------------------------------------------------------------------------
BM_leaky_relu/1                              4092 ns       4092 ns     172331
BM_leaky_relu/8                              4425 ns       4425 ns     158434
BM_leaky_relu/20                             4830 ns       4830 ns     145335
BM_leaky_relu_const/1                        3545 ns       3545 ns     198054
BM_leaky_relu_const/8                        3825 ns       3825 ns     183074
BM_leaky_relu_const/20                       4222 ns       4222 ns     165999
```

Interpreter
```
------------------------------------------------------------------------------
Benchmark                                       Time           CPU Iterations
------------------------------------------------------------------------------
BM_leaky_relu/1                              7183 ns       7182 ns      96377
BM_leaky_relu/8                              7580 ns       7580 ns      91588
BM_leaky_relu/20                             8066 ns       8066 ns      87183
BM_leaky_relu_const/1                        6466 ns       6466 ns     107925
BM_leaky_relu_const/8                        7063 ns       7063 ns      98768
BM_leaky_relu_const/20                       7380 ns       7380 ns      94564
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47798

Reviewed By: ezyang

Differential Revision: D24927043

Pulled By: kavoor

fbshipit-source-id: 69b12cc57f725f1dc8d68635788813710a74dc2b
2020-11-13 22:05:52 -08:00
Hao Lu
996f444c00 [pt][static_runtime] Memory model (#46896)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46896

The idea of the memory model is quite similar to that of BlackBoxPredictor, however, it's more complicated in pt due to 1) tensor views that share storage with storage refcount bumps but with different TensorImpls, 2) tensors sharing the same TensorImpl and the same storage, but with no refcount bump of the StorageImpl, 3) data types such as TensorList and Tuples that have Tensors in them, 4) need to support non-out/out variant mix while we move the aten ops to out variants.

As a result, I have to make the following adjustments:
1) remove tensors in output Tuples from internal blob list;
2) for memory allocation/deallocation, get candidate Tensors from the outputs of ops with out variant, extract StorageImpls from the Tensors, dedup, and remove output tensor StorageImpls, and get the final list of blobs for memory planning;
3) during the clean_up_memory pass, clean up memory held by the StorageImpls as well as Tensors/Lists/Tuples in IValues that don't participate in memory planning to reduce overall memory usage

Risk:
PyTorch team is planning to deprecate the current resize_outout api, which we do rely on. This is a pretty big risk.

https://www.internalfb.com/intern/diffusion/FBS/browsefile/master/fbcode/caffe2/aten/src/ATen/native/Resize.cpp?commit=6457b329847607553d34e788a3a7092f41f38895&lines=9-23

Test Plan:
```
buck test //caffe2/test:static_runtime
buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest
buck test //caffe2/caffe2/fb/predictor:pytorch_predictor_test
```
Benchmarks:
```
MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 numactl -m 0 -C 13 \
buck-out/opt/gen/caffe2/caffe2/fb/predictor/ptvsc2_predictor_bench \
--scripted_model=/home/hlu/ads/adindexer/adindexer_ctr_mobilefeed/pt/merge/traced_precomputation.pt \
--pt_inputs=/home/hlu/ads/adindexer/adindexer_ctr_mobilefeed/pt/merge/container_precomputation_bs1.pt \
--iters=1000 --warmup_iters=10000 --num_threads=1 --pt_enable_static_runtime=true \
--pt_cleanup_activations=true --pt_enable_out_variant=false
```

|pt_cleanup_activations	|pt_enable_out_variant	|old ms/iter	|new ms/iter	|
|---	|---	|---	|---	|
|0	|0	|0.31873	|0.30228	|
|0	|1	|0.30018	|0.29184	|
|1	|0	|0.35246	|0.31895	|
|1	|1	|0.35742	|0.30417	|

Reviewed By: bwasti, raziel

Differential Revision: D24471854

fbshipit-source-id: 4ac37dca7d2a0c362120a7f02fd3995460c9a55c
2020-11-03 23:47:59 -08:00
Hao Lu
e8d8de32b4 [StaticRuntime] Implement StaticRuntime::benchmark (#45639)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45639

`StaticRuntime::run_individual` is to mimic the caffe2 operator benchmark `SimpleNet::TEST_Benchmark`, so we can accurate information on the operator breakdown. We found that the PyTorch AutogradProfiler adds a lot of overhead to small models, such as the adindexer precomputation_merge net, 100% for batch_size 1, 33% for batch_size 20. This implementation adds very little overhead, as shown in the test plan.

Test Plan: Test results are fb internal only.

Reviewed By: yinghai, dzhulgakov

Differential Revision: D24012088

fbshipit-source-id: f32eb420aace93e2de421a15e4209fce6a3d90f0
2020-10-06 20:54:43 -07:00
Bram Wasti
87b356d093 [static runtime] Split out graph preparation from runtime (#44131)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44131

Test Plan: Imported from OSS

Reviewed By: hlu1

Differential Revision: D23604305

Pulled By: bwasti

fbshipit-source-id: 7b47da4961d99074199417ef1407a788c7d80ee6
2020-09-28 13:01:23 -07:00
Bram Wasti
d1a11618f5 [static runtime] Add _out variants and reuse memory (#44128)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44128

Test Plan: Imported from OSS

Reviewed By: hlu1

Differential Revision: D23604304

Pulled By: bwasti

fbshipit-source-id: 06a23cb75700a0fc733069071843b7b498e7b9e9
2020-09-25 11:03:06 -07:00
Bram Wasti
a475613d1d [static runtime] Swap to out-variant compatible nodes (#44127)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44127

Test Plan: Imported from OSS

Reviewed By: hlu1

Differential Revision: D23604306

Pulled By: bwasti

fbshipit-source-id: 18ccfb9b466b822e28130be3d5c4fae36c76820b
2020-09-14 12:38:25 -07:00
Hao Lu
8538a79bfe [jit][static] Basic executor (#43647)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43647

Nothing fancy, just a basic implementation of the graph executor without using stack machine.

Reviewed By: bwasti

Differential Revision: D23208413

fbshipit-source-id: e483bb6ad7ba8591bbe1767e669654d82f42c356
2020-08-28 23:20:07 -07:00
Hao Lu
8864148823 [jit] DeepAndWide benchmark (#43096)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43096

Add benchmark script for deep and wide model.

Reviewed By: bwasti, yinghai

Differential Revision: D23099925

fbshipit-source-id: aef09d8606eba1eccc0ed674dfea59b890d3648b
2020-08-15 01:27:12 -07:00
Bram Wasti
523b2ce9c6 [jit][static runtime] Simplify the graph and add operator whitelist (#43024)
Summary:
This PR whitelists and simplifies graphs to help with development later on.  Key to note in this PR is the use of both a pattern substitution and the registration of custom operators.  This will likely be one of the main optimization types done in this folder.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43024

Reviewed By: hlu1

Differential Revision: D23114262

Pulled By: bwasti

fbshipit-source-id: e25aa3564dcc8a2b48cfd1561b3ee2a4780ae462
2020-08-13 20:19:55 -07:00
Bram Wasti
ada8404f2d [jit] Scaffold a static runtime (#42753)
Summary:
The premise of this approach is that a small subset of neural networks are well represented by a data flow graph.  The README contains more information.

The name is subject to change, but I thought it was a cute reference to fire.

suo let me know if you'd prefer this in a different spot.  Since it lowers a JIT'd module directly I assumed the JIT folder would be appropriate.  There is no exposed Python interface yet (but is mocked up in `test_accelerant.py`)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42753

Reviewed By: zou3519

Differential Revision: D23043771

Pulled By: bwasti

fbshipit-source-id: 5353731e3aae31c08b5b49820815da98113eb551
2020-08-12 13:05:27 -07:00