Commit Graph

6 Commits

Author SHA1 Message Date
Shashank Chaudhry
89c4e8c22b [NOOP][clangformat][codemod] Enable CLANGFORMAT for some folders in caffe2/* (#67746)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67746

Test Plan: Visual inspection. Sandcastle.

Reviewed By: zertosh

Differential Revision: D31986646

fbshipit-source-id: 91885c20c3cead3853c49abb9fe0a94a67f33cc8
2021-11-03 12:23:14 -07:00
Raghavan Raman
31584d065e [Static Runtime] Added NNC implementation for signed log1p kernel. (#65387)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65387

Added a customized NNC implementation for signed log1p kernel and enabled the fusion pass that adds the fused signed log1p op.

Also, added a SR microbenchmark for this kernel which shows the performance improvement.

Without fusion:
```
--------------------------------------------------------------------------------
Benchmark                                         Time           CPU Iterations
--------------------------------------------------------------------------------
BM_signed_log1p/16                             1953 ns       1953 ns     358746
BM_signed_log1p/64                             2049 ns       2049 ns     342145
BM_signed_log1p/512                            3291 ns       3291 ns     214342
BM_signed_log1p/4096                          15559 ns      15559 ns      44420
BM_signed_log1p/32768                        101936 ns     101935 ns       6843
BM_signed_log1p/65536                        194792 ns     194789 ns       3615
```

With NNC fusion:
```
--------------------------------------------------------------------------------
Benchmark                                         Time           CPU Iterations
--------------------------------------------------------------------------------
BM_signed_log1p/16                              369 ns        369 ns    1896179
BM_signed_log1p/64                              497 ns        497 ns    1406995
BM_signed_log1p/512                            1618 ns       1618 ns     430209
BM_signed_log1p/4096                          11327 ns      11326 ns      61463
BM_signed_log1p/32768                         84099 ns      84086 ns       8325
BM_signed_log1p/65536                        166531 ns     166510 ns       4186
```

This clearly shows >15% improvement in performance of this kernel with NNC fusion.

On inline_cvr local model, there is a small improvement in terms of profiled time spent on ops:
  without fusion: `0.9%` (computed by adding the % spent on all the 4 ops involved)
  with NNC fusion: `0.55%`

Test Plan:
`buck test mode/opt-clang //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- SignedLog1p`

Also, did the accuracy test with inline_cvr as described here, https://fb.quip.com/qmdDAJzEmPtf, on the full size model (285298536_1)

```
get 57220 prediction values
get 57220 prediction values
max_error:  0  total:  0
```

Reviewed By: hlu1

Differential Revision: D30609492

fbshipit-source-id: d2e68df580569a30ee61abb0ef18d2c4c56827bd
2021-09-22 15:53:33 -07:00
Bram Wasti
cb046f7bd2 [static runtime] Initial memonger (#47759)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47759

Parity reached :)

*/0 -> no memonger
*/1 -> memonger on
We can see that the impact is large when activations don't all fit in cache (6x speed up on this micro bench)
```
BM_long_static_memory_optimization/2/0         8563 ns       8559 ns      86370
BM_long_static_memory_optimization/8/0         8326 ns       8322 ns      84099
BM_long_static_memory_optimization/32/0       11446 ns      11440 ns      56107
BM_long_static_memory_optimization/512/0    6116629 ns    6113108 ns        128
BM_long_static_memory_optimization/2/1         8151 ns       8149 ns      87000
BM_long_static_memory_optimization/8/1         7905 ns       7902 ns      85124
BM_long_static_memory_optimization/32/1       10652 ns      10639 ns      66055
BM_long_static_memory_optimization/512/1    1101415 ns    1100673 ns        641
```

TODO:
[x] implementation
[x] enable/disable flag
[x] statistics about memory saved
[x] additional models

Test Plan:
```
buck test //caffe2/test:static_runtime
buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest
buck test //caffe2/caffe2/fb/predictor:pytorch_predictor_test
```

Reviewed By: yinghai

Differential Revision: D24824445

fbshipit-source-id: db1f5239f72cbd1a9444017e20d5a107c3b3f043
2020-11-17 13:55:49 -08:00
Katy Voor
fe7d1d7d0e Add LeakyReLU operator to static runtime (#47798)
Summary:
- Add LeakyReLU operator to static runtime
- Add LeakyReLU benchmark
- Add LeakyReLU correctness test case

Static Runtime
```
------------------------------------------------------------------------------
Benchmark                                       Time           CPU Iterations
------------------------------------------------------------------------------
BM_leaky_relu/1                              4092 ns       4092 ns     172331
BM_leaky_relu/8                              4425 ns       4425 ns     158434
BM_leaky_relu/20                             4830 ns       4830 ns     145335
BM_leaky_relu_const/1                        3545 ns       3545 ns     198054
BM_leaky_relu_const/8                        3825 ns       3825 ns     183074
BM_leaky_relu_const/20                       4222 ns       4222 ns     165999
```

Interpreter
```
------------------------------------------------------------------------------
Benchmark                                       Time           CPU Iterations
------------------------------------------------------------------------------
BM_leaky_relu/1                              7183 ns       7182 ns      96377
BM_leaky_relu/8                              7580 ns       7580 ns      91588
BM_leaky_relu/20                             8066 ns       8066 ns      87183
BM_leaky_relu_const/1                        6466 ns       6466 ns     107925
BM_leaky_relu_const/8                        7063 ns       7063 ns      98768
BM_leaky_relu_const/20                       7380 ns       7380 ns      94564
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47798

Reviewed By: ezyang

Differential Revision: D24927043

Pulled By: kavoor

fbshipit-source-id: 69b12cc57f725f1dc8d68635788813710a74dc2b
2020-11-13 22:05:52 -08:00
Hao Lu
996f444c00 [pt][static_runtime] Memory model (#46896)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46896

The idea of the memory model is quite similar to that of BlackBoxPredictor, however, it's more complicated in pt due to 1) tensor views that share storage with storage refcount bumps but with different TensorImpls, 2) tensors sharing the same TensorImpl and the same storage, but with no refcount bump of the StorageImpl, 3) data types such as TensorList and Tuples that have Tensors in them, 4) need to support non-out/out variant mix while we move the aten ops to out variants.

As a result, I have to make the following adjustments:
1) remove tensors in output Tuples from internal blob list;
2) for memory allocation/deallocation, get candidate Tensors from the outputs of ops with out variant, extract StorageImpls from the Tensors, dedup, and remove output tensor StorageImpls, and get the final list of blobs for memory planning;
3) during the clean_up_memory pass, clean up memory held by the StorageImpls as well as Tensors/Lists/Tuples in IValues that don't participate in memory planning to reduce overall memory usage

Risk:
PyTorch team is planning to deprecate the current resize_outout api, which we do rely on. This is a pretty big risk.

https://www.internalfb.com/intern/diffusion/FBS/browsefile/master/fbcode/caffe2/aten/src/ATen/native/Resize.cpp?commit=6457b329847607553d34e788a3a7092f41f38895&lines=9-23

Test Plan:
```
buck test //caffe2/test:static_runtime
buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest
buck test //caffe2/caffe2/fb/predictor:pytorch_predictor_test
```
Benchmarks:
```
MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 numactl -m 0 -C 13 \
buck-out/opt/gen/caffe2/caffe2/fb/predictor/ptvsc2_predictor_bench \
--scripted_model=/home/hlu/ads/adindexer/adindexer_ctr_mobilefeed/pt/merge/traced_precomputation.pt \
--pt_inputs=/home/hlu/ads/adindexer/adindexer_ctr_mobilefeed/pt/merge/container_precomputation_bs1.pt \
--iters=1000 --warmup_iters=10000 --num_threads=1 --pt_enable_static_runtime=true \
--pt_cleanup_activations=true --pt_enable_out_variant=false
```

|pt_cleanup_activations	|pt_enable_out_variant	|old ms/iter	|new ms/iter	|
|---	|---	|---	|---	|
|0	|0	|0.31873	|0.30228	|
|0	|1	|0.30018	|0.29184	|
|1	|0	|0.35246	|0.31895	|
|1	|1	|0.35742	|0.30417	|

Reviewed By: bwasti, raziel

Differential Revision: D24471854

fbshipit-source-id: 4ac37dca7d2a0c362120a7f02fd3995460c9a55c
2020-11-03 23:47:59 -08:00
Hao Lu
8538a79bfe [jit][static] Basic executor (#43647)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43647

Nothing fancy, just a basic implementation of the graph executor without using stack machine.

Reviewed By: bwasti

Differential Revision: D23208413

fbshipit-source-id: e483bb6ad7ba8591bbe1767e669654d82f42c356
2020-08-28 23:20:07 -07:00