Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46694
For the op with parameters (e.g. conv), the jit mode run currently will raise an error of
`RuntimeError: Cannot insert a Tensor that requires grad as a constant. Consider making it a parameter or input, or detaching the gradient`. After consulting https://www.fburl.com/vtkys6ug, decided to turn-off gradient for the parameters in the forward run. If we want op with parameters to work in backward with jit mode, probably needs to turn `TorchBenchmarkBase` into a sub-class of `nn.Module`
Test Plan: ./buck-out/gen/caffe2/benchmarks/operator_benchmark/pt/conv_test.par --use_jit
Reviewed By: mingzhe09088
Differential Revision: D24451206
fbshipit-source-id: 784eb60ca155b0152d745c92f6d0ce6b2c9014c6
Summary: benchmakr_caffe2 is broken, due to some refactoring which change from eager test generation to register only.
Test Plan:
`buck run caffe2/benchmarks/operator_benchmark/c2:add_test`
```
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : short
# Benchmarking Caffe2: add
WARNING: Logging before InitGoogleLogging() is written to STDERR
W1021 08:07:06.350742 390665 init.h:137] Caffe2 GlobalInit should be run before any other API calls.
# Name: add_M8_N16_K32_dtypeint
# Input: M: 8, N: 16, K: 32, dtype: int
Forward Execution Time (us) : 652.748
# Benchmarking Caffe2: add
# Name: add_M16_N16_K64_dtypefloat
# Input: M: 16, N: 16, K: 64, dtype: float
Forward Execution Time (us) : 63.570
# Benchmarking Caffe2: add
# Name: add_M64_N64_K128_dtypeint
# Input: M: 64, N: 64, K: 128, dtype: in
```
Reviewed By: qizzzh
Differential Revision: D24448374
fbshipit-source-id: 850fd375d194c20c385ea4433aea13066c7476e6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46679
Current way of import configs will have runtime error when a single benchmark is launched directly with buck(e.g. `/buck-out/gen/caffe2/benchmarks/operator_benchmark/pt/conv_test.par`). The diff fixed that issue.
ghstack-source-id: 114857978
Test Plan: waitforsandcastle
Reviewed By: vkuzo
Differential Revision: D24459631
fbshipit-source-id: 29df17e66962a8604dbb7b8b9106713c3c19bed5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46219
- Refactor StaticRuntime and group common data structures, the jit graph, and the script module into a separate struct `InferenceModule`:
```
struct InferenceModule {
explicit InferenceModule(const torch::jit::Module& m);
explicit InferenceModule(std::shared_ptr<torch::jit::Graph> g);
torch::jit::Module module;
std::shared_ptr<torch::jit::Graph> graph;
std::unique_ptr<c10::FunctionSchema> schema;
std::unordered_map<Value*, size_t> value_to_reg;
std::vector<size_t> input_regs; // inputs to the graph
std::vector<size_t> output_regs; // outputs of the graph
std::vector<size_t> internals;
};
```
which is stored in the PyTorchPredictor, as well as the static runtime, and shared across threads. Then this is what's left inside the Static Runtime:
```
mutable std::vector<IValue> reg_;
// The nodes we need to run
std::vector<ProcessedNode> nodes_;
```
`reg_` holds all the weights and activations, which is different across threads during running. `nodes_` holds the op nodes and input/output registers, and is the same across threads for now. We could potentially put other stateful data structures in it, so I kept it inside the static runtime. It could be easily moved into the `InferenceModule` if we decide not to anything else into `ProcessedNode`.
- Added StaticRuntimeOptions so we can toggle certain optimizations on/off, for testing and benchmarking. `cleanup_activations` is an example.
- Integration with PyTorchPredictor. Added a lockfree stack in the PyTorchPredictor to hold all the static runtime instances. Benchmark shows that the `push` and `pop` combo takes about 80 ns, which is quite acceptable.
This diff focuses on threading model only. Benchmarks will be separate.
Reviewed By: bwasti
Differential Revision: D24237078
fbshipit-source-id: fd0d6347f02b4526ac17dec1f731db48424bade1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46308
This PR adds a hand optimized version of DeepAndWide model with the goal
of estimating overheads of static runtime. While static runtime is
currently much faster than the existing JIT interpreter, it would be
useful to understand how close we are to an absolutely 0-overhead
system. Currently, this "ideal" implementation is 2x faster than the
static runtime on batchsize=1.
Full benchmark results:
```
Running build/bin/static_runtime_bench
Run on (24 X 2394.71 MHz CPU s)
CPU Caches:
L1 Data 32K (x24)
L1 Instruction 32K (x24)
L2 Unified 4096K (x24)
L3 Unified 16384K (x24)
------------------------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------------------------
BM_deep_wide_base/1 59518 ns 59500 ns 10909
BM_deep_wide_base/8 74635 ns 74632 ns 9317
BM_deep_wide_base/20 82186 ns 82147 ns 9119
BM_deep_wide_fast/1 13851 ns 13851 ns 49825 << new
BM_deep_wide_fast/8 22497 ns 22497 ns 32089 << new
BM_deep_wide_fast/20 23868 ns 23841 ns 31184 << new
BM_deep_wide_jit_graph_executor/1 62786 ns 62786 ns 10835
BM_deep_wide_jit_graph_executor/8 76730 ns 76718 ns 7529
BM_deep_wide_jit_graph_executor/20 78886 ns 78883 ns 8769
BM_deep_wide_jit_profiling_executor/1 69504 ns 69490 ns 10309
BM_deep_wide_jit_profiling_executor/8 75718 ns 75715 ns 9199
BM_deep_wide_jit_profiling_executor/20 75364 ns 75364 ns 9010
BM_deep_wide_static/1 40324 ns 40318 ns 17232
BM_deep_wide_static/8 50327 ns 50319 ns 13335
BM_deep_wide_static/20 53075 ns 53071 ns 12855
BM_deep_wide_static_threaded/threads:8 6258 ns 49873 ns 14008
```
PS: The implementation could probably be optimized even more.
Differential Revision: D24300702
Test Plan: Imported from OSS
Reviewed By: dzhulgakov
Pulled By: ZolotukhinM
fbshipit-source-id: 7870bdef127c39d11bcaa4f03a60eb80a46be58e
Summary: Add operator benchmark for 4bit/8bit embedding lookups in `aibench`.
Test Plan:
```
buck build //caffe2/benchmarks/operator_benchmark/pt:qembedding_bag_lookups_test
aibench-cli adhoc -c 'buck run //caffe2/benchmarks/operator_benchmark/pt:qembedding_bag_lookups_test'
````
The run was successful in aibench: https://www.internalfb.com/intern/aibench/details/738300474https://www.internalfb.com/intern/aibench/details/346463246
Reviewed By: radkris-git
Differential Revision: D24268413
fbshipit-source-id: 7fb4ff75da47f8f327edab562c5d29bb69e00b8d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46124
We want to make sure we can actually fuse kernels within a fairly
tight time budget. So here's a quick benchmark of codegen for a simple
pointwise activation function (swish). I kept all the intermediate tensors
separate to force TE to actually do inlining.
Test Plan:
```
buck run mode/opt //caffe2/benchmarks/cpp/tensorexpr:tensorexpr_bench
```
I've only run in debug mode so results aren't super meaningful, but even in
that mode it's 18ms for compilation, 15 of which are in llvm.
Update, opt build mode:
```
----------------------------------------------------------------------------
Benchmark Time CPU Iterations
----------------------------------------------------------------------------
BM_CompileSwish 5123276 ns 5119846 ns 148
BM_CompileSwishLLVMOnly 4754361 ns 4753701 ns 160
```
Reviewed By: asuhan
Differential Revision: D24232801
fbshipit-source-id: d58a8b7f79bcd9244c49366af7a693e09f24bf76
Summary:
This PR modifies `benchmarks/tensorexpr`. It follows up[ https://github.com/pytorch/pytorch/issues/44101](https://github.com/pytorch/pytorch/pull/44101) and further supports characterizing fusers with dynamic shape benchmarks. Dynamic shape condition models the use case when the input tensor shape changes in each call to the graph.
Changes include:
Added an auxiliary class `DynamicShape `that provides a simple API for enabling dynamic shapes in existing test cases, example can be found with `DynamicSimpleElementBench`
Created new bench_cls: `DynamicSimpleElementBench`, `DynamicReduce2DInnerBench`, `DynamicReduce2DOuterBench`, and `DynamicLSTM`. They are all dynamic shaped versions of existing benchmarks and examples of enabling dynamic shape with `DynamicShape`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46107
Reviewed By: glaringlee
Differential Revision: D24229400
Pulled By: bertmaher
fbshipit-source-id: 889fece5ea87d0f6f6374d31dbe11b1cd1380683
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46003
sparse is confusing because itt is used in training for sparse gradients
Test Plan: Imported from OSS
Reviewed By: radkris-git, qizzzh
Differential Revision: D24178248
fbshipit-source-id: 0a2b595f3873d33b2ce25839b6eee31d2bfd3b0d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45853
The method name in README is not consistent with actual implementation.
Reviewed By: qizzzh
Differential Revision: D24114849
fbshipit-source-id: d979e324c768708e99b8cc5b87e261f17c22a883
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45875
Adds a googlebenchmark harness for perf testing programs generated by
tensorexpr, sans any pytorch wrappings (for python-level benchmarks of
tensorexpr, see benchmarks/tensorexpr).
Currently there's a harness for gemm that sets up the problem using torch (and
also measures the perf of a torch::mm to give a baseline).
Right now there's just an unoptimized implementation that is expected to be not
very fast. More optimized versions are coming.
Sample output from my dev box:
```
Run on (48 X 2501 MHz CPU s)
CPU Caches:
L1 Data 32K (x24)
L1 Instruction 32K (x24)
L2 Unified 256K (x24)
L3 Unified 30720K (x2)
--------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
--------------------------------------------------------------------------------------------
Gemm/Torch/128/128/128 73405 ns 73403 ns 8614 GFLOPS=57.1411G/s
Gemm/TensorExprNoopt/128/128/128 3073003 ns 3072808 ns 229 GFLOPS=1.36497G/s
```
Test Plan: Imported from OSS
Reviewed By: SplitInfinity
Differential Revision: D24142403
Pulled By: bertmaher
fbshipit-source-id: 3354aaa56868a43a553acd1ad9a192f28d8e3597
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39955
resolves https://github.com/pytorch/pytorch/issues/36323 by adding `torch.sgn` for complex tensors.
`torch.sgn` returns `x/abs(x)` for `x != 0` and returns `0 + 0j` for `x==0`
This PR doesn't test the correctness of the gradients. It will be done as a part of auditing all the ops in future once we decide the autograd behavior (JAX vs TF) and add gradchek.
Test Plan: Imported from OSS
Reviewed By: mruberry
Differential Revision: D23460526
Pulled By: anjali411
fbshipit-source-id: 70fc4e14e4d66196e27cf188e0422a335fc42f92
Summary:
Modified files in `benchmarks/tensorexpr` to add support for NVIDIA's Fuser for the jit compiler.
This support has some modifications besides adding an option to support the NVIDIA fuser:
* Adds FP16 Datatype support
* Fixes SOL/Algo calculations to generally use the data type instead of being fixed to 4 bytes
* Adds IR printing and kernel printing knobs
* Adds a knob `input_iter` to create ranges of inputs currently only for reductions
* Adds further reduction support for Inner and Outer dimension reductions that are compatible with the `input_iter` knob.
* Added `simple_element`, `reduce2d_inner`, and `reduce2d_outer` to isolate performance on elementwise and reduction operations in the most minimal fashion.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44101
Reviewed By: ngimel
Differential Revision: D23713658
Pulled By: bertmaher
fbshipit-source-id: d6b83cfab559aefe107c23b3c0f2df9923b3adc1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44270
The previous PR (#44212) was reverted since I didn't update the
`upload_scribe.py` script and it was looking for 'executor_and_fuser'
field in the json which now is replaced with two separate fields:
'executor' and 'fuser'.
Differential Revision: D23561500
Test Plan: Imported from OSS
Reviewed By: ngimel
Pulled By: ZolotukhinM
fbshipit-source-id: 7fe86d34afa488a0e43d5ea2aaa7bc382337f470
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43647
Nothing fancy, just a basic implementation of the graph executor without using stack machine.
Reviewed By: bwasti
Differential Revision: D23208413
fbshipit-source-id: e483bb6ad7ba8591bbe1767e669654d82f42c356
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43603
We are in the midst of landing a big reword of profiling executor and
benchmarks are expected to fail while we are in the transitional state.
Test Plan: Imported from OSS
Reviewed By: SplitInfinity
Differential Revision: D23334818
Pulled By: ZolotukhinM
fbshipit-source-id: 99ff17c6f8ee18d003f6ee76ff0e719cea68c170
Summary:
Reland of the benchmark code that broke the slow tests because the GPU were running out of memory
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43428
Reviewed By: ngimel
Differential Revision: D23296136
Pulled By: albanD
fbshipit-source-id: 0002ae23dc82f401604e33d0905d6b9eedebc851
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42956
In preparation for observer perf improvement, cleans up the
micro benchmarks:
* disable CUDA for histogram observers (it's too slow)
* add larger shapes for better representation of real workloads
Test Plan:
```
cd benchmarks/operator_benchmark
python -m pt.qobserver_test
```
Imported from OSS
Reviewed By: supriyar
Differential Revision: D23093996
fbshipit-source-id: 5dc477c9bd5490d79d85ff8537270cd25aca221a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43018
In this diff, a fix is added where the original non-learnable fake quantize is provided with trainable scale and zero point, whereas the requires_grad for both parameters should be completely disabled.
Test Plan:
Use the following command to execute the benchmark test:
`buck test mode/dev-nosan pt:quantization_test`
Reviewed By: vkuzo
Differential Revision: D23107846
fbshipit-source-id: d2213983295f69121e9e6ae37c84d1f37d78ef39
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42946
There are 3 options for the executor and fuser and some of them aren't
super interesting so I've combined the options into a single parameter, but
made it fairly easy to expand the set if there are other configs we might care
about.
Test Plan:
Benchmark it
Imported from OSS
Reviewed By: zheng-xq
Differential Revision: D23090177
fbshipit-source-id: bd93a93c3fc64e5a4a847d1ce7f42ce0600a586e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42767
Same as previous PR, forcing the qlinear benchmark to follow the fp one
Test Plan:
```
cd benchmarks/operator_benchmark
python -m pt.linear_test
python -m pt.qlinear_test
```
Imported from OSS
Reviewed By: jerryzh168
Differential Revision: D23013937
fbshipit-source-id: fffaa7cfbfb63cea41883fd4d70cd3f08120aaf8
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42761
Makes the qconv benchmark follow the conv benchmark exactly. This way
it will be easy to compare q vs fp with the same settings.
Test Plan:
```
cd benchmarks/operator_benchmark
python -m pt.qconv_test
python -m pt.conv_test
```
Imported from OSS
Reviewed By: jerryzh168
Differential Revision: D23012533
fbshipit-source-id: af30ee585389395569a6322f5210828432963077